Trends in Mathematical, Information and Data Sciences: A Tribute to Leandro Pardo [1 ed.] 9783031041365, 9783031041372, 3031041364

This book involves ideas/results from the topics of mathematical, information, and data sciences, in connection with the

146 18 8MB

English Pages 474 [450] Year 2022

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Trends in Mathematical, Information and Data Sciences: A Tribute to Leandro Pardo [1 ed.]
 9783031041365, 9783031041372, 3031041364

Table of contents :
Preface
Contents
*-1.5pc Trends in Mathematical Sciences
Using Taxes to Manage Energy Resources Related to Stock Pollutants: Resource Cartel versus Importers
1 Introduction
2 The Model
3 Analyzed Cases
3.1 Absence of Taxes (WT)
3.2 Taxes (T)
3.3 Numerical Example
4 Conclusions
References
A New Shapley Value-Based Rule for Distributing Delay Costs in Stochastic Projects
1 Introduction
2 The Problem
3 A New Shapley Value-Based Rule
4 Some Examples
5 Properties
References
Variogram Model Selection
1 Introduction
2 Robust Estimators of the Variogram
3 Acceptance of a Model and Variogram Model Comparison
4 Example
5 Conclusions and Future Works
References
On First Passage Times in Discrete Skeletons and Uniformized Versions of a Continuous-Time Markov Chain
1 Introduction
2 The Continuous-Time Process calX Under a Taboo
2.1 Expected Sojourn Times for Uniformized Markov Chains
2.2 A Simple Stochastic Ordering Property
3 Discussion and Concluding Remarks
References
A Numerical Approximation of a Two-Dimensional Atherosclerosis Model
1 Introduction
2 2D Mathematical Model
3 Numerical Approximation
3.1 2D WENO Reconstruction
3.2 Numerical Results of the 2D Model
4 Conclusions
References
Limit Results for Lp Functionals of Weighted CUSUM Processes
1 Lp Functionals of Cumulative Sum Processes
2 Proofs
References
Generalized Models for Binary and Ordinal Responses
1 Introduction
2 Binary Response Models Based on φ-Divergence
3 Connection to Generalized Association Models
4 Generalized Regression Models for Ordinal Responses
5 Simple Effect Size Measures
6 Example
References
Approximations of δ-Record Probabilities in i.i.d. and Trend Models
1 Introduction and Notation
2 First Order Approximations for the δ-Record Probability
3 Correction Terms for the LDM
4 Conclusions
References
The Relative Strength Index (RSI) to Monitor GDP Variations. Comparing Regions or Countries from a New Perspective
1 Introduction
2 The Relative Strength Index (RSI): Key Concepts and the Computation of the RSI of GDP Variations for a Country or Region
2.1 Framework and Key Concepts
2.2 Detailed Example About Computing RS and RSI Values in the Case of GDP Variations for US in the Period 2001–2005
2.3 Uses of the RSI: Oversold and Overbought Zones
3 Use of the RSI to Compare the Evolution of Several Economies
3.1 Canada Versus Spain in the Period 2000Q1 to 2020Q2 Using the RSI for Comparing GDP Variations
3.2 Some Advantages of Using the RSI for Comparing GDP Variations
3.3 Developing Applications for Analyzing Macroeconomic Variations of Some Countries Using the RSI Approach
3.4 Incorporating the RSI Approach in Economic Sentiment Indicators Analysis
4 Concluding Remarks
References
Escape Probabilities from an Interval for Compound Poisson Processes with Drift
1 Introduction
1.1 General Properties of Escape Probabilities
2 Integral Equations for the Escape Probability
3 Severities with Rational Characteristic Function
References
A Note on the Notion of Informative Composite Density
1 Preliminaries
2 Comparison of Composite Densities
3 Examples
4 Conclusions
References
*-1.5pc Trends in Information Sciences
Equivalence Tests for Multinomial Data Based on φ-Divergences
1 Introduction
2 Equivalence Tests
3 Simulation Results
4 Conclusion
References
Minimum Rényi Pseudodistance Estimators for Logistic Regression Models
1 Introduction
2 Minimum Rényi Pseudodistance Estimators
3 Asymptotic Distribution of the Minimum Rényi Pseudodistance Estimators
4 Confidence Intervals
5 Simulation Study
6 Conclusions and Future Work
References
Infinite–Dimensional Divergence Information Analysis
1 Introduction
2 Preliminary Definitions
3 Parametric Estimation Based on Kullback–Leibler Divergence Functional
4 Asymptotic Analysis
5 Final Comments
References
A Model Selection Criterion for Count Models Based on a Divergence Between Probability Generating Functions
1 Introduction
2 Model Selection Criterion
3 Numerical Experiments
3.1 Standard Hermitte Versus Discrete Lindley
3.2 Poisson Versus Geometric
4 Conclusions and Topics for Further Research
References
On the Choice of the Optimal Tuning Parameter in Robust One-Shot Device Testing Analysis
1 Robust One-Shot Device Testing
2 Methods to Choose the ``Optimal'' Tuning Parameter
2.1 Iterative Warwick and Jones Algorithm (IWJ)
2.2 Other Methods
2.3 Choice of the ``Optimal Method''
3 Numerical Results
References
Optimal Spatial Prediction for Non-negative Spatial Processes Using a Phi-divergence Loss Function
1 Introduction
2 Decision-Theoretic Approach to Prediction
3 Decision-Theoretic Approach to Spatial Prediction of a Non-negative Spatial Process
4 Extensions to Spatial Processes Bounded from Below
5 Spatial Prediction of Zinc Pollution on a Floodplain of the Meuse River
6 Discussion and Conclusions
References
On Entropy Based Diversity Measures: Statistical Efficiency and Robustness Considerations
1 Introduction
2 Entropy Based Diversity Measures: A General Formulation and Examples
3 Statistical Estimation and Asymptotic Distribution
4 Numerical Illustrations: Comparative Performances
4.1 Asymptotic Efficiency
4.2 Finite-Sample Robustness
5 Application: Diversity of Covid-19 Deaths in USA
6 Conclusions
References
Statistical Distances in Goodness-of-fit
1 Introduction
2 Goodness of Fit Based on Distances
3 Simulation Results
4 Discussion and Conclusions
References
Phi-divergence Test Statistics Applied to Latent Class Models for Binary Data
1 Introduction and Basic Concepts
2 Goodness-of-Fit Tests
3 Nested Latent Class Models
4 An Example with Real Data
5 Conclusions
References
Cross-sectional Stochastic Frontier Parameter Estimator Using Kulback-Leibler Divergence
1 Introduction
1.1 Deterministic Frontier
1.2 Stochastic Frontier
2 Methods
2.1 Maximum Likelihood
2.2 Common Area
2.3 Kullback-Leibler Divergence
2.4 Skewness-Fixing Methods
3 Results
4 Conclusions
References
Clustering and Representation of Time Series. Application to Dissimilarities Based on Divergences
1 Introduction
2 Divergence-Based Dissimilarity Measures for Time Series
3 Simultaneous K-means Clustering and MDS Representation
4 Illustrative Application
5 Discussion
References
*-1.5pc Trends in Data Sciences
Proportional Odds COM-Poisson Cure Rate Model with Gamma Frailty and Associated Inference and Application
1 Introduction
2 COM-Poisson Cure Rate Model
3 Data and the Likelihood
4 Estimation of Parameters
4.1 E-step
4.2 M-step
4.3 COM-Poisson Cure Rate Model with Gamma Frailty
4.4 Results for Some Special Cases
5 Observed Information Matrix
6 Empirical Study
7 Illustrative Analysis of Cutaneous Melanoma
8 Concluding Remarks
References
On Residual Analysis in the GMANOVA-MANOVA Model
1 Introduction
2 GMANOVA-MANOVA Model
3 Residuals
4 Properties of R1 and R2
4.1 Interpretation
4.2 Properties
5 Concluding Remarks
References
Computational Efficiency of Bagging Bootstrap Bandwidth Selection for Density Estimation with Big Data
1 The Bootstrap Method
2 Bagging and Subagging
3 Kernel Density Estimation and Bootstrap Bandwidth
4 The Subagged Bootstrap Bandwidth
5 Simulation Studies and Application to Real Data
References
Optimal Experimental Design for Physicochemical Models: A Partial Review
1 Optimal Experimental Designs
2 Michaelis-Menten Model
3 Arrhenius Equation
4 Adsorption Isotherms
5 Tait Equation
6 Discussion
References
Small Area Estimation of Proportion-Based Indicators
1 Introduction
2 Predictors
2.1 Predictors of Probability-Dependent Indicators
2.2 Predictors of Variable-Dependent Indicators
3 MSE of Predictors
4 Discussion and Future Research
References
Non-parametric Testing of Non-inferiority with Censored Data
1 Introduction
2 Preliminaries
3 Non-parametric Tests
4 Simulation Study
5 Application to Real Data
6 Concluding Remarks
References
A Review of Goodness-of-Fit Tests for Models Involving Functional Data
1 Introduction
2 GoF for Distribution Models for Functional Data
3 GoF for Regression Models with Functional Data
3.1 Scalar Response
3.2 Functional Response
References
An Area-Level Gamma Mixed Model for Small Area Estimation
1 Introduction
2 The Laplace Approximation Algorithm
3 Empirical Best Predictors
3.1 Bootstrap Estimation of the MSE
4 Application to Real Data
References
On the Consistence of the Modified Median Estimator for the Logistic Regression Model
1 Introduction
2 Consistence of the Estimator
References
Analyzing the Influence of the Rating Scale for Items in a Questionnaire on Cronbach Coefficient Alpha
1 Usual Imprecise-Valued Rating Scales Involved in the Items of a Questionnaire
2 Comparing Rating Scales Through Cronbach Alpha
2.1 Simulation of FRS-Based Responses and Suggested Links with Responses to Other Rating Scales
2.2 Comparison of Rating Scales Through Percentages of Greater Values of α
2.3 Comparison of Rating Scales Through Values of α
References
Robust LASSO and Its Applications in Healthcare Data
1 Introduction
2 Background
3 Robust Penalized Regression
3.1 Asymptotic Distribution of the MDPDE
4 Robust Cp Statistic and Degrees of Freedom
5 Simulation Study
6 Real Data Analysis
7 Conclusion
References
Machine Learning Procedures for Daily Interpolation of Rainfall in Navarre (Spain)
1 Introduction
2 Data Description
3 Machine Learning Algorithms
3.1 K nearest Neighbor
3.2 Random Forest
3.3 Neural Networks
3.4 Scaling Data
4 Data Analysis
5 Conclusions
References
Testing Homogeneity of Response Propensities in Surveys
1 Introduction
2 Test for Homogeneity of Propensity
3 Zero-Inflated (Truncated) Geometric Distribution
3.1 Distribution of Xn:n
3.2 Distribution of Rn
3.3 MLE
4 Implementation and Simulations
5 Conclusions
References
Use of Free Software to Estimate Sensitive Behaviours from Complex Surveys
1 Introduction
2 The R Package RRTCS
3 Implementation Details
4 Computational Efficiency
5 Examples
5.1 Example 1: A Randomized Response Survey to Investigate the Alcohol Abuse
5.2 Example 2: A Randomized Response Survey to Investigate the Agricultural Subsidies
5.3 Example 3: A Randomized Response Survey to Investigate the Infidelity
6 Summary
References
Inference with Median Distances: An Alternative to Reduce the Influence of Outlier Populations
1 Introduction
2 Approximation by Medians: Inference and Overlap
3 Simulation Study
References
Empirical Analysis of the Maxbias of the Fuzzy MDD-based Hampel M-estimator of Location
1 Introduction
2 Some General Comments on Fuzzy Numbers
3 The Fuzzy Hampel M-estimator of Location and its Scale Equivariant Version
4 Empirical Comparison in Terms of the Maximum Asymptotic Bias
5 Concluding Remarks
References

Citation preview

Studies in Systems, Decision and Control 445

Narayanaswamy Balakrishnan · María Ángeles Gil · Nirian Martín · Domingo Morales · María del Carmen Pardo   Editors

Trends in Mathematical, Information and Data Sciences A Tribute to Leandro Pardo

Studies in Systems, Decision and Control Volume 445

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Systems, Decision and Control” (SSDC) covers both new developments and advances, as well as the state of the art, in the various areas of broadly perceived systems, decision making and control–quickly, up to date and with a high quality. The intent is to cover the theory, applications, and perspectives on the state of the art and future developments relevant to systems, decision making, control, complex processes and related areas, as embedded in the fields of engineering, computer science, physics, economics, social and life sciences, as well as the paradigms and methodologies behind them. The series contains monographs, textbooks, lecture notes and edited volumes in systems, decision making and control spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the worldwide distribution and exposure which enable both a wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

More information about this series at https://link.springer.com/bookseries/13304

Narayanaswamy Balakrishnan · María Ángeles Gil · Nirian Martín · Domingo Morales · María del Carmen Pardo Editors

Trends in Mathematical, Information and Data Sciences A Tribute to Leandro Pardo

Editors Narayanaswamy Balakrishnan Department of Mathematics and Statistics McMaster University Hamilton, ON, Canada

María Ángeles Gil Department of Statistics and OR and TM University of Oviedo Oviedo, Asturias, Spain

Nirian Martín Department of Financial and Actuarial Economics & Statistics Complutense University of Madrid Madrid, Spain

Domingo Morales Department of Statistics, Mathematics and Informatics Miguel Hernández University of Elche Elche, Spain

María del Carmen Pardo Department of Statistics and Operational Research Complutense University of Madrid Madrid, Spain

ISSN 2198-4182 ISSN 2198-4190 (electronic) Studies in Systems, Decision and Control ISBN 978-3-031-04136-5 ISBN 978-3-031-04137-2 (eBook) https://doi.org/10.1007/978-3-031-04137-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Leandro Pardo in the advertisement of the homage paid to him in his alma mater, the Universidad Complutense de Madrid, by December 2019

Leandro Pardo entering the room for his UCM homage (source: https://www.ucm.es/tribunacompl utense/261/art3944.php#.YcoUO2jMJnI)

To Marisa and Julio, the closest collaborators of Leandro in research and life

Preface

This scientific edited book has been prepared as a tribute to our beloved friend and colleague, Leandro Pardo. To a certain extent, the book is related to the Symposium on Information Theory with Applications to Statistical Inference that was held at the Universidad Complutense de Madrid on December 2, 2019. The book was planned to be published by either late 2020 but the circumstances associated with the COVID-19 situation have often caused unavoidable delays. Anyway, the book is now ready and we should deeply thank all the contributors for their helpful support and understanding. The contributions to this book involve ideas/results from the topics of mathematical, information and statistical sciences, to which Leandro has devoted immense research endeavors. Although the book has been structured so that papers have been classified as those related to either Mathematical, Information or Data Science, this should not be considered as a classical partition since classes definitely overlap and many contributions could be properly included in more than one of them. In the scientific world, there are colleagues who stand out for combining the originality and impact of their research with the generosity and human quality of the character. One of these colleagues is undoubtedly Prof. Leandro Pardo. Since in 1980 he obtained a doctorate in Mathematics at the Department of Statistics and Operations Research of the Complutense University of Madrid, Leandro has been an enthusiastic and tireless researcher. He has made remarkable contributions in fields as diverse as fuzzy sets, Bayesian nonparametric statistics, survival and reliability analysis, information theory, estimation and hypothesis testing procedures based on divergences and entropies, categorical data analysis, statistical models and their diagnosis, or robust statistical inference, among others. This book contains a variety of articles written by authors who want to pay tribute to Leandro for his professional career and his humanity. It is difficult to summarize in a brief space the professional life of Leandro Pardo and therefore we will limit ourselves to highlighting some important aspects. By the end of 2021, databases show that Leandro has published more than 250 research

ix

x

Preface

papers in the field of Statistics with more than 1700 citations and co-authored by around 45 different authors. He is the author of two reference books on Statistical Information Theory, which establish the methodological bases for the statistical inference based on divergence measures. It is remarkable that he has coordinated more than 12 research and development projects funded by the Spanish Government and participated in international research projects funded by NATO. Leandro is closely linked to the Spanish Society of Statistics and Operations Research (SEIO), in which he has held relevant management positions. For instance, he was President of SEIO and Editor-in-Chief of TEST, the statistical journal sponsored by SEIO and in 2020 he has been awarded with SEIO’s Medal. He was also Associate Editor of Communication in Statistics-Theory and Methods, Communication in Statistics-Simulation and Computation and Journal of Statistical Planning and Inference. Currently, he is Associate Editor of Journal of Multivariate Analysis, TEST and Complutense Mathematical Magazine. Special mention deserves his time as the “2004 Distinguished Eugene Lukacs Professor” at Booling Green University (Bowling, Green, Ohio), a worldwide recognition established 1990 to 2007 and awarding outstanding scientists on the basis of their distinguished records of research in the application or theory of probability or statistics. This book, categorized into three parts, Trends in Mathematical Sciences (First Part), Trends in Information Sciences (Second Part) and Trends in Data Sciences (Third Part), brings together 38 contributions, authored by colleagues, students, descendants and friends of Leandro. Throughout these parts the reader will encounter that many of the research works citate his prominent book “Statistical Inference based on Divergence Measures” edited by Chapman & Hall/CRC (2006). As mentioned earlier, the three parts are not exhaustive, they have in common several branches of Statistics, namely, • Big Data/High-dimensional Statistics: – José Miguel Angulo and María Dolores Ruiz-Medina (Second Part) – Daniel Barreiro-Ures, Ricardo Cao and Mario Francisco-Fernández (Third Part) – Wenceslao González-Manteiga, Rosa M. Crujeiras and Eduardo GarcíaPortugués (Third Part) – Abhijit Mandal and Samiran Ghosh (Third Part) • Categorical Data Analysis: – Maria Kateri (First Part) – Mara Virtudes Alba-Fernández and María Dolores Jiménez-Gamero (Second Part) – Juana M. Alonso, Aída Calviño and Susana Muñoz (Second Part) – Apostolos Batsidis and Polychronis Economou (Second Part)

Preface

xi

– Pedro Miranda, Ángel Felipe and Nirian Martín (Second Part) – María Jaenada (Third Part) • Fuzzy Data Analysis: – María Asunción Lubiano, Manuel Montenegro, Sonia Pérez-Fernández, and María Ángeles Gil (Third Part) – Beatriz Sinova (Third Part) • Mathematical Economics: – Emilio Cerdá and Xiral López-Otero (First Part) – Ahmed Shatla, Carlos Carleos, Norberto Corral, Antonia Salas and María Teresa López (Second Part) • Multivariate Data Analysis: – Béatrice Byukusenge, Dietrich von Rosen and Martin Singull (Third Part) – Miquel Salicrú, Ferran Reverter, Mireia Besalú and Moises Burset (Third Part) • Spatial Statistics: – Alfonso García-Pérez (First Part) – Noel Cressie, Alan R. Pearse and David Gunawan (Second Part) – Ana F. Militino, María Dolores Ugarte and Unai Pérez-Goya (Third Part) • Stochastics Processes: – Antonio Gómez-Corral, María Jesús López-Herrero and María Teresa Rodríguez-Bernal (First Part) – Miguel Lafuente, David Ejea, Raúl Gouet, F. Javier López and Gerardo Sanz (First Part) – Javier Villarroel and Juan A. Vega (First Part) • Survey Data Analysis: – María Dolores Esteban, Tomáš Hobza, Domingo Morales and Agustín Pérez (Third Part) – Tomáš Hobza and Domingo Morales (Third Part) – Juan Luis Moreno-Rebollo, Joaquín Muñoz-García and Rafael Pino-Mejías (Third Part)

xii

Preface

– María del Mar Rueda, Beatriz Cobo and Antonio Arcos (Third Part) • Survival/Reliability Analysis: – Elena Castilla and Pedro J. Chocano (Second Part) – Narayanaswamy Balakrishnan, Tian Feng and Hon-Yiu So (Third Part) – Alba M. Franco-Pereira, María Carmen Pardo and Teresa Pérez (Third Part) • Theoretical Statistical Inference with Divergence Measures: – Kostas Zografos (First Part) – Abhik Ghosh and Ayanendranath Basu (Second Part) – Marianthi Markatou and Anran Liu (Second Part) • Time Series: – Lajos Horváth and Gregory Rice (First Part) – Carlos Maté (First Part) – J. Fernando Vera (Second Part). Either as main interest or as a complementary tool, “Robust Statistics” has been considered in most of the contributions of the previous branches, as it constitutes the principal research area in which has been published, in the last decade, almost all of Leandro’s research work. Linked specifically to first part, the work of Arturo Hidalgo and Lourdes Tello falls into Partial Differential Equations, while the work of Julián Costa, Ignacio GarcíaJurado and Juan Carlos Gonçalves-Dosantos falls into Project Management. Design of Experiments is the main area of the contribution of Carlos de la Calle Arroyo, Jesús López-Fidalgo and Licesio J. Rodríguez-Aragón, in third part. The previous challenging and appealing topics have offered either new developments from a theoretical and/or computational point of view, or reviews of recent literature of outstanding developments. They have been applied through examples in Climatology, Chemistry, Economics, Engineering, Geology, Health Sciences, Physics, Pandemics and Socioeconomic indicators. The authors of this edited multiauthors book should deeply thank to all those contributing it. We know well how much affection for Leandro is involved in all the papers in it. We must also express our special gratitude to Asun Lubiano and Antonia Salas for their meticulous proofreading of the whole book. As is evident from the contents of this volume, Leandro has had varied research interests in the field of Statistics and has made many incisive contributions to many different areas of Statistics over the years. Likewise, he has had numerous successful collaborations and relationships with many people in the statistical community at large, both within Spain and outside. So, the diverse editorial team of this volume

Preface

xiii

should come as no surprise! Some of us are among his colleagues, some are among his doctoral students, and some others are his collaborators, but one thing we all share in common is that we are all among his close friends and well-wishers. It is this that brought us together to work closely as a team to bring out this volume to honor our friend, Prof. Leandro Pardo, and it is our sincere hope that it will put a smile on his face and bring him many fond memories and a lot of happiness! Hamilton, Canada Oviedo, Spain Madrid, Spain Elche, Spain Madrid, Spain February 2022

Narayanaswamy Balakrishnan María Ángeles Gil Nirian Martín Domingo Morales María del Carmen Pardo

Contents

Trends in Mathematical Sciences Using Taxes to Manage Energy Resources Related to Stock Pollutants: Resource Cartel versus Importers . . . . . . . . . . . . . . . . . . . . . . . . Emilio Cerdá and Xiral López-Otero A New Shapley Value-Based Rule for Distributing Delay Costs in Stochastic Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julián Costa, Ignacio García-Jurado, and Juan Carlos Gonçalves-Dosantos Variogram Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alfonso García-Pérez On First Passage Times in Discrete Skeletons and Uniformized Versions of a Continuous-Time Markov Chain . . . . . . . . . . . . . . . . . . . . . . . Antonio Gómez-Corral, María Jesús Lopez-Herrero, and María Teresa Rodríguez-Bernal A Numerical Approximation of a Two-Dimensional Atherosclerosis Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arturo Hidalgo and Lourdes Tello

3

13

21

29

39

Limit Results for L p Functionals of Weighted CUSUM Processes . . . . . . Lajos Horváth and Gregory Rice

51

Generalized Models for Binary and Ordinal Responses . . . . . . . . . . . . . . . Maria Kateri

63

Approximations of δ-Record Probabilities in i.i.d. and Trend Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Miguel Lafuente, David Ejea, Raúl Gouet, F. Javier López, and Gerardo Sanz

73

xv

xvi

Contents

The Relative Strength Index (RSI) to Monitor GDP Variations. Comparing Regions or Countries from a New Perspective . . . . . . . . . . . . . Carlos Maté

83

Escape Probabilities from an Interval for Compound Poisson Processes with Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Javier Villarroel and Juan A. Vega

93

A Note on the Notion of Informative Composite Density . . . . . . . . . . . . . . . 107 Konstantinos Zografos Trends in Information Sciences Equivalence Tests for Multinomial Data Based on φ-Divergences . . . . . . . 121 María Virtudes Alba-Fernández and María Dolores Jiménez-Gamero Minimum Rényi Pseudodistance Estimators for Logistic Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Juana M. Alonso, Aida Calviño, and Susana Muñoz Infinite–Dimensional Divergence Information Analysis . . . . . . . . . . . . . . . . 147 José Miguel Angulo and María Dolores Ruiz-Medina A Model Selection Criterion for Count Models Based on a Divergence Between Probability Generating Functions . . . . . . . . . . . 159 Apostolos Batsidis and Polychronis Economou On the Choice of the Optimal Tuning Parameter in Robust One-Shot Device Testing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Elena Castilla and Pedro J. Chocano Optimal Spatial Prediction for Non-negative Spatial Processes Using a Phi-divergence Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Noel Cressie, Alan R. Pearse, and David Gunawan On Entropy Based Diversity Measures: Statistical Efficiency and Robustness Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Abhik Ghosh and Ayanendranath Basu Statistical Distances in Goodness-of-fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Marianthi Markatou and Anran Liu Phi-divergence Test Statistics Applied to Latent Class Models for Binary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Pedro Miranda, Ángel Felipe, and Nirian Martín Cross-sectional Stochastic Frontier Parameter Estimator Using Kulback-Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Ahmed Shatla, Carlos Carleos, Norberto Corral, Antonia Salas, and María Teresa López

Contents

xvii

Clustering and Representation of Time Series. Application to Dissimilarities Based on Divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 J. Fernando Vera Trends in Data Sciences Proportional Odds COM-Poisson Cure Rate Model with Gamma Frailty and Associated Inference and Application . . . . . . . . . . . . . . . . . . . . . 255 Narayanaswamy Balakrishnan, Tian Feng, and Hon-Yiu So On Residual Analysis in the GMANOVA-MANOVA Model . . . . . . . . . . . . 287 Béatrice Byukusenge, Dietrich von Rosen, and Martin Singull Computational Efficiency of Bagging Bootstrap Bandwidth Selection for Density Estimation with Big Data . . . . . . . . . . . . . . . . . . . . . . . 307 Daniel Barreiro-Ures, Ricardo Cao, and Mario Francisco-Fernández Optimal Experimental Design for Physicochemical Models: A Partial Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 Carlos de la Calle Arroyo, Jesús López-Fidalgo, and Licesio J. Rodríguez-Aragón Small Area Estimation of Proportion-Based Indicators . . . . . . . . . . . . . . . . 329 María Dolores Esteban, Tomáš Hobza, Domingo Morales, and Agustín Pérez Non-parametric Testing of Non-inferiority with Censored Data . . . . . . . . 339 Alba M. Franco-Pereira, María Carmen Pardo, and Teresa Pérez A Review of Goodness-of-Fit Tests for Models Involving Functional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 Wenceslao González-Manteiga, Rosa M. Crujeiras, and Eduardo García-Portugués An Area-Level Gamma Mixed Model for Small Area Estimation . . . . . . . 359 Tomáš Hobza and Domingo Morales On the Consistence of the Modified Median Estimator for the Logistic Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 María Jaenada Analyzing the Influence of the Rating Scale for Items in a Questionnaire on Cronbach Coefficient Alpha . . . . . . . . . . . . . . . . . . . . 377 María Asunción Lubiano, Manuel Montenegro, Sonia Pérez-Fernández, and María Ángeles Gil Robust LASSO and Its Applications in Healthcare Data . . . . . . . . . . . . . . . 389 Abhijit Mandal and Samiran Ghosh

xviii

Contents

Machine Learning Procedures for Daily Interpolation of Rainfall in Navarre (Spain) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 Ana F. Militino, María Dolores Ugarte, and Unai Pérez-Goya Testing Homogeneity of Response Propensities in Surveys . . . . . . . . . . . . . 415 Juan Luis Moreno-Rebollo, Joaquín Muñoz-García, and Rafael Pino-Mejías Use of Free Software to Estimate Sensitive Behaviours from Complex Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 María del Mar Rueda, Beatriz Cobo, and Antonio Arcos Inference with Median Distances: An Alternative to Reduce the Influence of Outlier Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 Miquel Salicrú, Ferran Reverter, Mireia Besalú, and Moises Burset Empirical Analysis of the Maxbias of the Fuzzy MDD-based Hampel M-estimator of Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 Beatriz Sinova

Trends in Mathematical Sciences

Using Taxes to Manage Energy Resources Related to Stock Pollutants: Resource Cartel versus Importers Emilio Cerdá and Xiral López-Otero

Abstract This chapter analyzes, through a two-period model, the interaction between a producing cartel and a country (or coalition of countries) that import an energy-related natural resource whose consumption generates a stock pollution. The chapter has been inspired by and aims to contribute to the design of climate change strategies, as greenhouse gas emissions are accumulated in the atmosphere and are brought about by the use of fossil fuels that in many cases are unevenly distributed across the world. A particular attention is paid to the use of taxes on natural resources in the context of simultaneous decisions by countries and different concern regarding the environmental problem.

1 Introduction In the last few years there has been an increased attention to the management of stock pollutants as a most significant environmental problem, climate change, is due to the accumulation of greenhouse gas (GHG) emissions in the atmosphere. At the same time, GHG emissions are mainly related to the use of natural resources (fossil fuels) that are unevenly distributed and in some ocassions highly concentrated across the planet. This raises important issues and potential trade-offs as, on the one hand, climate change has the characteristic of a global public good (i.e. affecting both fossil-fuel producers and consumers) but, on the other hand, producers of certain fossil fuels behave strategically to maximize their rents from such natural resources.

E. Cerdá (B) Department of Economic Analysis and ICEI, Universidad Complutense de Madrid, Campus de Somosoguas, 28223 Madrid, Spain e-mail: [email protected] X. López-Otero Department of Economic Theory and Mathematical Economics, UNED, Senda del Rey, 11, 28040 Madrid, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_1

3

4

E. Cerdá and X. López-Otero

This chapter is interested in exploring this setting, with a particular attention to the use of taxes on the natural resource. Indeed, fossil fuel taxes are quite common in reality and respond to the growing concerns on climate change (carbon taxes) or other environmental problems, and also to strategic reaction by consuming countries to price manipulation by fossil fuel producers. In particular, the chapter analyzes, from a theoretical point of view, the relationship between a producing cartel and a country (or coalition of countries) that imports an energy-related natural resource. It does so by considering different attitudes towards the stock pollutant (global public bad) and by analyzing simultaneous actions by countries. The main objective of the chapter is to study how taxes by importers affect the strategies of producers and eventually the stock of pollution. Previous literature on these matters includes Bergstrom [1], who studied the capacity of importing countries to obtain rents from the natural resource. Maskin and Newbery [7] and Karp and Newbery [4, 5] analyzed the world oil market in this respect. More related to this paper, Wirl [13, 14] studied the relationship between a producing cartel of a resource that generates pollution with an importing country that uses a resource tax to maximize the welfare of these consumers. Wirl and Dockner [17] explore again this relationship but assuming that the government also values the tax revenues. Rubio [8] extends the interaction between countries when there are many importing countries that behave non-cooperatively. Other papers such as Tahvonen [11], Rubio and Escriche [9] or Strand [10] study the relationship between resource producing and importing countries but incorporating the possibility of sequential choices. Finally, Wirl [15] introduces uncertainty in the model, Daubanes and Grimaud [2] propose a growth model with a rich importing region and a poor exporting region, Dullieux et al. [3] incorporate an upper limit of carbon concentration in the atmosphere, Wirld [16] considers the case in which agents can choose to set prices (taxes) or quantities, while Wei et al. [12] distinguish between domestic marked production and export production, allowing price discrimination between them. Our contribution to the literature rests on the use of an analytical model in discrete time that, instead of focusing on the evolution path for the different variables in continuous time (as the above.mentioned papers), makes it possible to obtain and compare the optimal solutions to ascertain which situations are better for each side and for the management of the stock of pollution. The chapter also provides some practical and relevant insights. In general it is found that, when countries that import a (energy-related) natural resource are coordinated to set a (environmental) tax on imported energy goods, their welfare is increased and the associated pollution decreased. Yet this does not hold in all situations, so importers should be careful to evaluate each particular setting. Moreover, given that decisions in each period by the involved parties have influence in the future, agents cannot commit to pre-determined actions or commitments. The chapter is composed of four sections, including this introduction. The following section presents the basic model, which is employed to study particular cases in section three. Finally, the chapter closes with a summary of the main results and conclusions.

Using Taxes to Manage Energy Resources Related to Stock …

5

2 The Model There exists a producer cartel of a polluting natural resource (in our case, energyrelated, for example oil) and a country or a coalition of countries that imports and consumes such a resource from the cartel.1 The model considers two periods: in each period the cartel decides the price of the resource, whilst the importer country establishes a tax on the resource. The consumption of the resource generates a pollution that accumulates in the atmosphere, in such a way that in each period t the stock of pollution (St ) will be given by (1) St = St−1 + qt , qt being the consumption of the resource in the importing country in period t. We assume that the stock of pollution does not decline. This is a sensible decision if the stock pollutant represents accumulated GHG emissions, as their decline is very slow (about 200 years) and non-linear. In order to consider that all the quantity of the resource consumed enters into the stock of pollution as in (1), a unit of measure of the energy which gives rise to the emission of a unit of pollutant to the atmosphere can be used (see Wirl and Dockner [17]). That stock of pollution generates a negative externality in each period which is modeled, following the literature (see e.g. Wirl [14] or Liski and Tahvonen [6]), trough a quadratic damage function of the form 21 cSt2 , with c > 0. The cartel tries to maximize its profits, which will be given by the difference between the income and cost of extraction of the resource. We assume that the extraction costs are constant and, since it does not affect the essence of the results, we will consider null costs (see Wirl [13]). For its part, the importer country tries to maximize the welfare of its citizens, which is given by the sum of the consumer’s surplus and tax revenues,2 minus the environmental damage provoked by the consumption of the resource. We assume that the demand function of the resource in each period in the importing country is linear of the form qt = a − b( pt + τt )

(2)

pt and τt being, respectively, the price of the resource, fixed by the cartel, and the tax established by the government of the importing country in period t, a > 0, b > 0, a > c. Although the resource is nonrenewable, following Wirl [13] we assume that the resource constraint is not binding, because the resource will not be consumed totally due to the associated externality. In other words, we limit the externality and do not simply delay its negative environmental effects.

1

It is assumed that the resource is not consumed in the producer country, whilst the importing country does not produce that resource. 2 It is assumed that the tax revenues in the importing country are given back to the citizens through lump-sum transfers.

6

E. Cerdá and X. López-Otero

In the following section we present the different cases analyzed using this model, obtaining the analytical expressions of the main variables of the model and studying the relationships between these variables in different cases to see what situations are best for each party and the environment (stock of pollution). As it is a dynamic model with two periods, time-consistent results are presented, i.e. the agents, though initially could commit to certain prices or taxes in the future, may be interested in deviating from that commitment with the passage of time.

3

Analyzed Cases

In this section, we first consider the case in which the government of the importing country does not establish any tax on the consumption of the energy resource. In the second subsection we assume that the government of the importing country decides to introduce a tax in each period on the consumption of the resource. Analytical expressions for prices, quantities, taxes, stock of pollution, profits of the cartel and welfare of the consumers in the importing country in terms of the parameters a, b, c and S0 are obtained. Finally, a numerical example is given.

3.1 Absence of Taxes (WT) In this case, there is a static optimization problem in which the producer sets the price that maximizes its profits, and the importing country does not have any capacity to influence that price. So, the problem for the cartel will be max Π = p1 q1 + p2 q2 = p1 (a − b p1 ) + p2 (a − bp2 ) . p1 , p2

(3)

From the necessary and sufficient conditions of optimality we obtain that, in absence of taxes, the main variables of the model take the following values, which are timeconsistent: p1 W T =a/(2b), q1W T =a/2, S1W T = S0 + (a/2), (4) p2W T =a/(2b), q2W T =a/2, S2W T = S0 + a. Then, the profits of the cartel will be Π W T = p1W T q1W T + p2W T q2W T =

a2 . 2b

(5)

On the other hand, the welfare of the consumers in the importing country will be given by

Using Taxes to Manage Energy Resources Related to Stock …

=

 1 a 2

b

W W T = u 1 + u 2 − 21 cS12 − 21 cS22    − p1W T q1W T + 21 ab − p2W T q2W T − 21 cS12 − 21 cS22   = a 2 /(4b) − c S02 + 58 a 2 + 23 S0 a ,

7

(6)

u t being the consumer’s surplus in the importing country deriving from the consumption of the resource in period t.

3.2 Taxes (T) Let us now assume that the government of the importing country decides to introduce a tax τ in each period on the consumption of the resource. In this way, the problem of the cartel producer is now max Π = p1 q1 + p2 q2 = p1 (a − bp1 − bτ1 ) + p2 (a − b p2 − bτ2 ) . p1 , p2

(7)

The first order conditions are: a − bτi ∂Π , i = 1, 2 = a − 2b pi − bτi = 0 =⇒ pi = ∂ pi 2b

(8)

Sufficient conditions for the maximization problem are also satisfied. On the other hand, the government of the importing country sets the taxes in such a way that the welfare of its citizens in maximized, that is max W = u 1 + τ1 q1 − 21 cS12 + u 2 + τ2 q2 − 21 cS22 τ1 ,τ2   1 a = 2 b − p1 − τ1 (a − bp1 − bτ1 ) + τ1 (a − bp1 − bτ1 )   − 21 c (S0 + a − bp1 − bτ1 )2 + 21 ab − p2 − τ2 (a − bp2 − bτ2 )

(9)

+τ2 (a − bp2 − bτ2 ) − 21 c (S0 + a − bp1 − bτ1 + a − bp2 − bτ2 )2 Solving this maximization problem, we obtain the best-response functions of the importer. The open-loop Nash equilibrium is calculated with the best response functions of both agents. However, these results are not time consistent as, given that pollution generated in the first period accumulates in the atmosphere and thus influences the agents when making their decisions in the second period, the result in that period will depend on what happens in the first period and therefore the importer is forced to reduce the tax on the resource. Although the price increase brings about a reduction in the consumed quantity, and hence a reduction in the stock of pollution, it also reduces their utility and tax revenues causing a welfare loss that is greater than the profit obtained by reducing pollution. Therefore, the importer partly offsets the fall in consumption due to the price increase. On the other hand, since decisions in the first period influence the second period through the accumulated stock of

8

E. Cerdá and X. López-Otero

pollution, as a result of deviation in the first period, and since the price in the second period depends negatively on the stock of accumulated stock of pollution in the first period, the producer’s price will also increase in the second. In this case, however, the reduction in the importer’s tax is of such magnitude that it provokes a reduction in the price paid by consumers. This, in turn, increases the quantity consumed during this period, in an attempt to reduce welfare loss. As a result exporters increase their profits, whereas welfare in the importing country is reduced. Thus, the time consistent results (subgame perfect Nash equilibrium), obtained by backward induction are given below: 4a + 5abc + 2ab2 c2 − 8bcS0 − 9b2 c2 S0 − b3 c3 S0 8b + 16b2 c + 7b3 c2 + b4 c3 4a + 4abc + ab2 c2 − 4bcS0 − 2b2 c2 S0 p2T = 8b + 16b2 c + 7b3 c2 + b4 c3 2 2 3 + ab c + 16cS0 + 14bc2 S0 + 2b2 c3 S0 10ac + 5abc τ1T = 8 + 16bc + 7b2 c2 + b3 c3 8ac + 5abc2 + ab2 c3 + 8cS0 + 4bc2 S0 τ2T = 8 + 16bc + 7b2 c2 + b3 c3 p1T =

(10) (11) (12) (13)

From these expressions, we can see, as in Wirl [13] that the taxes set by the importing country will be purely environmental. Thus, if the importing country had not taken into account the environmental damage that the consumption of the resource provokes (c = 0), such taxes would be zero and the government could not use them to influence the price set by the producers. Substituting expressions (10) and (12) in the demand function (2) for t = 1 and expressions (11) and (13) in that demand function for t = 2, the following results for the quantities are obtained: 4a + abc − 8bcS0 − 5b2 c2 S0 − b3 c3 S0 8 + 16bc + 7b2 c2 + b3 c3 4a + 4abc + ab2 c2 − 4bcS0 − 2b2 c2 S0 q2T = 8 + 16bc + 7b2 c2 + b3 c3 q1T =

(14) (15)

In an analogous way, substituting the expressions (10) to (15) in the objective function (7), the mathematical expression of the maximum profit of the cartel producer can be obtained, and after substitution in the objective function (9), the expression of the maximum welfare of the consumers in the importing countries can also be obtained. The corresponding values of the stock of pollution in periods 1 and 2 are the following:

Using Taxes to Manage Energy Resources Related to Stock …

9

4a + abc + 8S0 + 8bcS0 + 2b2 c2 S0 8 + 16bc + 7b2 c2 + b3 c3 8a + 5abc + ab2 c2 + 8S0 + 4bcS0 S2T = 8 + 16bc + 7b2 c2 + b3 c3

S1T =

(16) (17)

The main results are presented in the following proposition. Proposition 3.1 The introduction of taxes on the consumption of the resource by the importing country reduces the prices of the producer ( piT ≤ piW T , with piT < piW T , if c = 0, i = 1, 2), the quantities (qiT ≤ qiW T , with qiT < qiW T , if c = 0, i = 1, 2) and the stock of the pollution (SiT ≤ SiW T , with SiT < SiW T , if c = 0, i = 1, 2) in both periods, although the final price paid by the consumers is higher ( piT + τiT ≥ piW T , with piT + τiT > piW T , if c = 0, i = 1, 2). Moreover, the welfare in the importing country is higher (W T ≥ W W T , with W T > W W T , if c = 0) and the profits of the cartel are lower (Π T ≤ Π W T , with Π T < Π W T , if c = 0). Proof We have that −6ac − 3abc2 − ab2 c3 − 16cS0 − 16bc2 S0 − 2b3 c3 S0 16 + 32bc + 14b2 c2 + 2b3 c3 −8ac − 5abc2 − ab2 c3 − 8cS0 − 4bc2 S0 p2T = p2W T + 16 + 32bc + 14b2 c2 + 2b3 c3 2 2 3 3 −8abc − 7ab c − ab c − 16bcS0 − 10b2 c2 S0 − 2b3 c3 S0 + 16 + 32bc + 14b2 c2 + 2b3 c3 T q2 = bp2T , q2W T = bp2W T , p2T < p2W T =⇒ q2T < q2W T

p1T = p1W T +

q1T = q1W T

(18) (19) (20) (21)

14ac + 7abc + ab c + 16cS0 + 12bc S0 + 2b c S0 (22) 16 + 32bc + 14b2 c2 + 2b3 c3 8ac + 5abc2 + ab2 c3 + 8cS0 + 4bc2 S0 (23) p2T + τ2T = p2W T + 16 + 32bc + 14b2 c2 + 2b3 c3 2

p1T + τ1T = p1W T +

2 3

2

2 3

Moreover W T = W WT +

A+B C+D+E + , 8(8 + 16bc + 7b2 c2 + b3 c3 )2 8(4 + 6bc + b2 c2 )2

(24)

where A = 224a 2 c + 948a 2 bc2 + 1552a 2 b2 c3 + 1082a 2 b3 c4 B = 381a 2 b4 c5 + 68a 2 b5 c6 + 5a 2 b6 c7 + 384acS0 C = 1824abc2 S0 + 325ab2 c3 S0 + 2400ab3 c4 S0 + 876ab4 c5 S0 D = 160ab c S0 + 12ab c S0 + 320bc2 S02 + 1088b2 c3 S02 E = 1092b3 c4 S02 + 480b4 c5 S02 + 100b5 c6 S02 + 8b6 c7 S02 5 6

6 7

(25)

10

E. Cerdá and X. López-Otero

As a, b, c and S0 are positive (c = 0 if the importing country does not take into account the environmental damage that the consumption of the resource provokes), with the introduction of the tax, producer prices and quantities are reduced. At the same time, the price paid by consumers and the welfare in the importing country increase. Moreover, as q1 and q2 are smaller than in the case without taxes, S1 and S2 are also smaller. Finally, prices and quantities are smaller in both periods with respect to the case without taxes, and therefore producer’s profits are also smaller.  By introducing the consumption tax on the resource, prices increase for consumers, who thus reduce the amount of resource consumed. This requires the exporter to reduce the prices to ensure that the fall in consumption is not as sharp and thus minimizes the reduction in profits. In any case, the exporter cannot avoid the decrease of its profits. The importing country manages to increase its welfare because the stock of pollution is reduced and tax revenues are collected, despite the loss of utility caused by the reduction of the amount of resource consumed.

3.3

Numerical Example

For the following values of the parameters, a = 5, b = 1, c = 0.02, S0 = 10, the following results are obtained:

p1 p2 τ1 τ2 p1 + τ1 p2 + τ2 Π

q1 q2 S1 S2 W

Without taxes With taxes 2.5 2.27 2.5 2.35 0 0.51 0 0.29 2.5 2.78 2.5 2.65 12.5 10.58

Without taxes With taxes 2.5 2.22 2.5 2.35 12.5 12.22 15 14.85 2.44 3.61

(26)

(27)

Using Taxes to Manage Energy Resources Related to Stock …

11

4 Conclusions This chapter has analyzed, from a theoretical point of view, the relationship between a producing cartel and a country (or coalition of countries) that imports an energyrelated natural resource whose consumption generates a stock of pollution. Although the chapter can be applicable to any environmental problem with these characteristics, we have been inspired by a major contemporary environmental challenge: climate change phenomena. Climate change is caused by the accumulation of GHG emissions in the atmosphere and has a public good nature, so there is room for strategic interaction among countries. Moreover, GHG emissions are related to the consumption of fossil fuels that, in many cases are unevenly distributed and concentrated across the planet, giving also rise to strategic interaction among countries to capture resource rents. An adequate treatment of such a complex and multiple relationships between resource producers and consumers is thus necessary for a proper understanding and management of the stock pollutant. We have paid a particular attention to the use of taxes on the natural resource whose use provokes the stock of pollution. Indeed, fossil fuel taxes are quite common in reality and respond to the growing concerns on climate change (carbon taxes) or other environmental problems, and also to strategic reaction by consuming countries to price manipulation by fossil fuel producers. Tax revenues may also play an important role in welfare-enhancing strategies by countries, either through public expenditure programs or through shifts in the tax system. The chapter has actually studied the influence of resource taxes, and of environmental concerns by countries, on the price and the quantity consumed of such a resource, thus providing information on the stock of pollution and on the level of welfare of exporters and importers. Unlike most of the literature in this area, which has focused on the evaluation path for the different variables in continuous time, our contribution rests on the use of an analytical model in discrete time that allows us to obtain and compare the optimal solutions to ascertain which situations are better for each side and for the management of the stock of pollution. Our research has shown that, in the case of simultaneous decisions, the introduction of a tax on resources to correct the environmental externality brings about a decrease in the producer price and in the quantity of the consumed resource, thus reducing the level of pollution, and an increase in the price paid by the consumers. Moreover, there is an increase in the welfare of the importing country, and the profits of the cartel are reduced. An interesting extension for this piece of research consist of the consideration of sequential choices, instead of simultaneous decisions. The two possible cases are the following: (a) The exporting country is first in deciding its prices (the leader is the exporter). (b) The importing country decides first on the taxes levied on the consumption of the resource (the leader is the importer country).

12

E. Cerdá and X. López-Otero

References 1. Bergstrom, T.: On capturing oil rents with national excise tax. Amer. Econ. Rev. 72, 194–201 (1982) 2. Daubanes, J., Grimaud, A.: Taxation of a polluting non-renewable resource in the heterogeneous world. Environ. Resour. Econ. 47, 567–588 (2010) 3. Dullieux, R., Ragot, L., Shcubert, K.: Carbon tax and OPEC’s rents under a ceiling constraint. Escand. J. Econ. 113, 798–824 (2011) 4. Karp, L., Newbery, D.: OPEC and the U.S. oil import tariff. Econ. J. 101, 303–313 (1991) 5. Karp, L., Newbery, D.: Dynamically consistent oil import tariffs. Can. J. Econ. XXV, 1–21 (1992) 6. Liski, M., Tahvonen, O.: Can carbon tax eat OPEC’s rents? J. Environ. Econ. Manag. 47, 1–12 (2004) 7. Maskin, E., Newbery, D.: Disadvantageous oil tariffs and dynamic consistency. Amer. Econ. Rev. 80, 143–156 (1990) 8. Rubio, S.: On capturing rent from a non-renewable resource international monopoly: prices versus quantities. Dyn. Games Appl. 1, 558–580 (2011) 9. Rubio, S., Escriche, L.: Strategic pigouvian taxation, stock externalities and polluting nonrenewable resources. J. Pub. Econ. 79, 297–313 (2001) 10. Strand, J.: Importer and producer petroleum taxation: a geo-political model. International Monetary Fund Working Paper WP/08/35 (2008) 11. Tahvonen, O.: Trade with polluting nonrenewable resources. J. Environ. Econ. Manag. 30, 1–17 (1996) 12. Wei, J., Hennlock, M., Johansson, D., Sterner, T.: The fossil endgame: strategic oil price discrimination and carbon taxation. J. Environ. Econ. Policy 1, 48–69 (2012) 13. Wirl, F.: Pigouvian taxation of energy for flow and stock externalities and strategic, non competitive energy pricing. J. Environ. Econ. Manag. 26, 1–18 (1994) 14. Wirl, F.: The exploitation of fossil fuels under the threat of global warming and carbon taxes: a dynamic game approach. Environ. Resour. Econ. 5, 333–352 (1995) 15. Wirl, F.: Energy prices and carbon taxes under uncertainty about global warming. Environ. Resour. Econ. 36, 313–340 (2007) 16. Wirl, F.: Global warming: prices versus quantities from a strategic point of view. J. Environ. Econ. Manag. 64, 217–229 (2012) 17. Wirl, F., Dockner, E.: Leviathan governments and carbon taxes: costs and potential benefits. Eur. Econ. Rev. 39, 1215–1236 (1995)

A New Shapley Value-Based Rule for Distributing Delay Costs in Stochastic Projects Julián Costa, Ignacio García-Jurado, and Juan Carlos Gonçalves-Dosantos

Abstract In this paper we propose a new allocation rule for stochastic projects with delays based on the Shapley value and compare it with another Shapley value-based rule introduced in a recent paper. First we justify the interest of considering a new rule of this kind, then we compare it with the old one in some examples and finally we study some theoretical properties that distinguish them.

1 Introduction Project management is an important and widely used tool in engineering for the successful implementation of complex projects. One of the first contributions to this body of knowledge was the PERT/CPM methodology, developed in the late 1950s (see [8]). Since then, numerous techniques, algorithms and protocols have given rise to modern project management. An important issue in this field is the planning and time control of a project and, within this issue, a relevant question is how the costs generated by project delays should be distributed. The first articles that deal with the allocation of delay costs in a project from the perspective of cooperative game theory are [1, 3]. Since then, this topic has been treated by various authors, always in a deterministic setting, until [4] addresses the problem in stochastic projects and proposes a sharing rule based on the Shapley value, inspired by the rule for deterministic projects in [2]. J. Costa (B) Grupo MODES, Universidade da Coruña, Facultad de Informática, 15071 A Coruña, Spain e-mail: [email protected] I. García-Jurado · J. C. Gonçalves-Dosantos Grupo MODES and CITMAga, Universidade da Coruña, Facultad de Informática, 15071 A Coruña, Spain e-mail: [email protected] J. C. Gonçalves-Dosantos e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_2

13

14

J. Costa et al.

In this paper we propose a new sharing rule for stochastic projects with delays based on the Shapley value. In Sect. 3 we justify the interest of considering a new rule, in Sect. 4 we compare both rules in some examples and, finally, in Sect. 5 we study some theoretical properties that distinguish them.

2 The Problem In this section we describe the problem we are dealing with, that was first introduced and discussed in [4]. We start from a stochastic project, given by a list of activities, a description of the precedence relations between them, and a specification of the probability distribution of the duration of each of the activities. We also have a cost function that depends on the durations of the activities. Our objective is to propose and analyse a rule for distributing the cost associated with each implementation of the project. Next we give the formal definition of stochastic projects with delays; notice that it generalizes the notion of deterministic projects with delays, dealt with in [1, 3]. Definition 2.1 A stochastic project with delays S P is a tuple (N , ≺, X 0 , x, C) where: • N is the finite non-empty set of activities. • ≺ is a binary relation over N satisfying asymmetry and transitivity; it describes the precedence relations between the activities. • X 0 = (X i0 )i∈N is a vector of independent non-negative random variables. Each X i0 describes the duration of activity i. • x ∈ R N is the vector of observed non-negative durations. • C : R N → R is the delay cost function. We assume that C(0) = 0 and that C is non-decreasing. We denote by SP N the set of stochastic projects with delays with activities set N , and by SP the set of all stochastic projects with delays. The problem is to identify a sound rule for distributing the costs associated with the duration of the project and its activities. The following is a formal definition of distribution rule in this context. Definition 2.2 A rule for stochastic projects with delays is a map ψ on SP that assigns to each S P = (N , ≺, X 0 , x, C) ∈ SP N a vector ψ(S P) ∈ R N satisfying  i∈N ψi (S P) = C(x).

3 A New Shapley Value-Based Rule In this section we provide a theoretical rationale for the introduction of a new rule for stochastic projects with delays.

A New Shapley Value-Based Rule for Distributing …

15

To begin with, we remember the Shapley rule for stochastic projects with delays introduced in [4]. Informally, such a rule is based on associating a cooperative game to each stochastic project with delays and calculating its Shapley value. For an introduction to cooperative games and the Shapley value, the reader can consult [6]. Definition 3.1 The Shapley rule for stochastic projects with delays Sh is defined by Sh(S P) = (v S P ) where for all S P ∈ SP N : • v S P is the cooperative game with set of players N given by  v

SP

(S) =

E(C(x S , X 0N \S )) 0

for all non-empty S ⊆ N , for S = ∅,

where x S and X 0N \S denote the restrictions of x and X 0 to S and N \ S, respectively. • (v S P ) denotes the proposal of the Shapley value for v S P , i.e. i (v S P ) =

 |S|! (|N | − |S| − 1)! (v S P (S ∪ {i}) − v S P (S)) |N |! S⊆N \{i}

for all i ∈ N . The Shapley rule Sh satisfies a number of reasonable properties (see [4]) and, moreover, it is a natural extension of the Shapley rule for deterministic projects with delays introduced in [2]. However, Sh has the disadvantage that it is based on a game that shows a certain discontinuity, in the sense that its definition for a coalition S is different depending on whether or not S is empty. In fact, it is easy to check that there exist S P ∈ S P N with E(C(X 0N )) = 0 = v S P (∅). To correct this disadvantage, in this paper we study a variant of the Shapley rule that is based on a game whose definition has a single expression for all coalitions. This rule was introduced in [5], but in that paper only its definition is given without any additional comments, except that it has been incorporated to the R package ProjectManagement. The main purpose of this paper is to motivate its interest, to study some of its properties and to compare it with Sh. Definition 3.2 The modified Shapley rule for stochastic projects with delays  Sh is defined by  Sh(S P) = ( v S P ) where for all S P ∈ SP N : • v S P is the cooperative game with set of players N given by  v S P (S) = E(C(x S , X 0N \S )) − E(C(X 0N )) + E(C(X 0S , 0 N \S )) for all S ⊆ N . vSP . • ( v S P ) denotes the proposal of the Shapley value for 

16

J. Costa et al.

Notice that  Sh is a well defined rule for stochastic projects with delays because it satisfies: 1.  v S P is a well defined cooperative game since  v S P (∅) = E(C(X 0N )) − E(C(X 0N )) + E(C(0)) = 0. 2. It distributes the delay costs among the activities since 

i ( vSP ) =  v S P (N ) = C(x N ).

i∈N

Observe that  Sh is also a natural extension of the Shapley rule for deterministic projects introduced in [2] with two correction summands so that a single expression produces a well defined rule for stochastic projects.

4 Some Examples In this section we look at some examples to compare the two rules considered so that we can get to know them better. Example 4.1 Consider the stochastic project with delays S P 1 = (N 1 , ≺1 , X 0,1 , x 1 , C 1 ) given by N 1 = {1, 2}, ≺1 = ∅ (this means that the two activities of the project can be carried out simultaneously), X 10,1 and X 20,1 are random variables with uniform distributions U (0, 10) and U (0, 8) (respectively), x 1 = (7, 7) and, for every y ∈ R N ,  C 1 (y) =

0 if d(N 1 , ≺1 , y) ≤ 6, 1 1 otherwise, d(N , ≺ , y) − 6

where d(N 1 , ≺1 , y) denotes the duration of the deterministic project with set of activities N 1 , precedence relations given by ≺1 , and vector of activity durations y 1 . The duration of a deterministic project can be obtained using the widely known PERT/CPM methodology; more details on project management and PERT/CPM can be found at [7]. In this stochastic project with delays C 1 (x 1 ) = 1, and it is easy to check that1 Sh(S P 1 ) = (0.3049, 0.6951)

 Sh(S P 1 ) = (0.5823, 0.4177).

In all the examples in this section the exact calculation of the Sh and  Sh rules can easily be done manually. In any case, for the approximate calculation of such rules the R package ProjectManagement can be used (see [5]).

1

A New Shapley Value-Based Rule for Distributing …

17

This example is interesting because it shows a very different behaviour of the two rules towards activities with a probability distribution of their duration that is potentially adverse to the project, as is the case for activity 1. Any activity with the potential to delay the project can be seen as detrimental to the project and somehow undesirable. Implicitly, Sh assumes that such harm has already been borne by the project before the costs of the delay are distributed. Therefore activity 1 is favoured by Sh: although it could potentially have taken longer than activity 2, in the end both activities have the same observed duration. Instead,  Sh in making the allocation takes into account not only the observed duration of the activities, but also the risk to the project of incorporating activities with the capacity to cause high delays. It therefore penalises activity 1, even though it did not last longer than activity 2. Example 4.2 In this example we illustrate how the discontinuity shown by Sh leads to unsatisfactory results, and how the new rule  Sh corrects this effect. Take the stochastic project with delays S P 2 = (N 2 , ≺2 , X 0,2 , x 2 , C 2 ) given by N 2 = {1, 2}, ≺2 = ∅, X 10,2 and X 20,2 are random variables with uniform distributions U (10, 20) and U (0, 10) (respectively), x 2 = (17, 7) and, for every y ∈ RN ,  0 if d(N 2 , ≺2 , y) ≤ 15, C 2 (y) = 2 2 otherwise. d(N , ≺ , y) − 15 It is easy to see that the two rules distribute the cost C 2 (x 2 ) = 2 as follows: Sh(S P 2 ) = (1.3791, 0.6209)

 Sh(S P 2 ) = (2, 0).

Activity 2 cannot delay the project, but Sh assigns a positive cost to it. On the contrary,  Sh assigns all the cost to activity 1, which is responsible for the whole delay. Example 4.3 This example shows, jointly, the behaviours we have discussed in the previous examples. Take S P 3 = (N 3 , ≺3 , X 0,3 , x 3 , C 3 ) given by N 3 = {1, 2}, ≺3 = ∅, X 10,3 and X 20,3 are random variables with uniform distributions U (20, 30) and U (0, 10) (respectively), x 3 = (20, 7) and, for every y ∈ RN ,  0 if d(N 3 , ≺3 , y) ≤ 15, C 3 (y) = 3 3 otherwise. d(N , ≺ , y) − 15 The cost to be distributed is C 3 (x 3 ) = 5, and the solutions obtained are as follows: Sh(S P 3 ) = (0, 5)

 Sh(S P 3 ) = (5, 0).

18

J. Costa et al.

In this example there are two extreme and opposite cost distribution proposals. Activity 1 could have caused a much longer delay in the project, but it has not, so it is rewarded by Sh. On the contrary,  Sh takes into account that activity 2 cannot in any case delay the project and assigns the cost of the delay to activity 1, which is the one that causes it.

5 Properties In this section we present some properties of a generic rule for stochastic projects with delays ψ and study which of these properties are fulfilled by Sh and which by  Sh. We start with three adaptations to this context of standard properties in the cooperative game theory literature: one that ensures that a change in the time units does not lead to differences in distributions of costs, one that prevents activities that behave identically from being treated differently, and one additivity property. It is easy to check that both Sh and  Sh satisfy these three properties. Scale Invariance (SI). ψ satisfies SI if for all S P = (N , ≺, X 0 , x, C) and λ ∈ (0, ∞),   ψ N , ≺, X 0 , x, C = ψ N , ≺, λX 0 , λx, C λ where C λ : R N → R is such that C λ (λy) = C(y) for all y ∈ R N , λX 0 is the vector of random variables (λX i0 )i∈N , and λy = (λi yi )i∈N . Anonymity (AN). ψ satisfies AN if for all S P = (N , ≺, X 0 , x, C), π ∈  N , where  N is the set of all permutations over the finite set N , and i ∈ N , then Ψi (S P) = Ψπ(i) (S P π ) where S P π = (N , ≺π , π(X 0 ), π(x), C π ) such that i ≺π j if and only if π(i) ≺ π( j) for all i, j ∈ N ; besides C π (y) = C(π −1 (y)) where if y ∈ R N , π(y) ∈ R N and π(y)i = yπ −1 (i) . Cost additivity (CA). ψ satisfies CA if for all S P = (N , ≺, X 0 , x, C) and all S P = (N , ≺, X 0 , x, C ), then   ψi S P + S P = ψi (S P) + ψi S P for all i ∈ N , where S P + S P = (N , ≺, X 0 , x, C + C ) and (C + C )(y) = C(y) + C (y) for all y ∈ R N . The following property guarantees that if an activity lasts the same or longer in one realisation of a stochastic project than in another realisation and all the other activities last the same, the cost that this activity must bear in the first realisation (according to the rule) cannot be smaller than in the second. Again, it is easy to check that both Sh and  Sh satisfy this property.

A New Shapley Value-Based Rule for Distributing …

19

Monotonicity (MON). ψ satisfies MON if for all S P = (N , ≺, X 0 , x, C) and S P = (N , ≺, X 0 , x , C) such that xi ≥ xi and x j = x j for some i ∈ N and for all j ∈ N \ i, then  ψi (S P) ≥ ψi S P . Next property is used in [4] to characterise Sh. In fact, it is proved that Sh is the unique rule for stochastic projects with delays that satisfies the property BAL below. Since  Sh is different from Sh, it is obvious that  Sh does not satisfy BAL. Balancedness (BAL). ψ satisfies BAL if for all S P ∈ SP N , all finite N , and all i, j ∈ N with i = j, it holds that ψi (S P) − ψi (S P− j ) = ψ j (S P) − ψ j (S P−i ) 0 , x−i , C−i ) and: where S P−i = (N \ i, ≺−i , X −i

• • • •

≺−i is the restriction of ≺ to N \ i, 0 is the vector equal to X 0 after deleting its ith component, X −i x−i is the vector equal to x after deleting its ith component, and C−i : R N \i → R is given by C−i (y) = E(C(y, X i0 )), for all y ∈ R N \i .

The following property gives a sufficient condition for the non-negativity of a rule. Gonçalves-Dosantos [4] shows that Sh satisfies this property. In an analogous way, it is easy to prove that  Sh also satisfies it. Non-Negativity (NN). ψ satisfies NN if for all S P = (N , ≺, X 0 , x, C) with C(xi , y N \i ) ≥ E(C(X i0 , y N \i )) for all i ∈ N and for all y N \i ∈ R N \i , then ψi (N , ≺ , x 0 , x, C) ≥ 0. Let us now look at a property that has to do with the so-called irrelevant activities. Take a stochastic project with delays S P = (N , ≺, X 0 , x, C). We say that i ∈ N is an irrelevant activity in S P if, for all S ⊆ N \i, it holds that: • E(C(x S , xi , X 0N \(S∪i) )) = E(C(x S , X i0 , X 0N \(S∪i) )) • E(C(X 0S , X i0 , 0 N \(S∪i) )) = E(C(X 0S , 0, 0 N \(S∪i) )) The property we consider below states that if an activity is irrelevant in a stochastic project with delays, then the rule should not assign delay cost to it. Formally: Irrelevant Activities Property (IAP). ψ satisfies IAP if for all S P=(N , ≺, X 0 , x, C) and all i ∈ N , irrelevant activity in S P, it holds that ψi (S P) = 0. Example S P 3 in Sect. 4 shows that Sh does not satisfy IAP. Indeed, it is easy to check that activity 2 is irrelevant in S P 3 ; however, Sh 2 (S P 3 ) = 5 > 0.2 On the contrary, the following proposition shows that  Sh satisfies IAP.

2

Note that this example also shows that Sh does not satisfy the property independence of irrelevant delays described in [4]; that article wrongly states the opposite.

20

J. Costa et al.

Table 1 Some properties satisfied by Sh and/or  Sh SI AN CA MON Sh  Sh

 

 

 

 

BAL

NN

IAP

 −

 

− 

Proposition 5.1  Sh satisfies IAP. Proof Take a stochastic project with delays S P = (N , ≺, X 0 , x, C) with an irrelevant activity i ∈ N . Then:  v S P (S ∪ i) = E(C(x S∪i , X 0N \(S∪i) )) − E(C(X 0N )) + E(C(X 0S∪i , 0 N \(S∪i) )) v S P (S). = E(C(x S , X 0N \S )) − E(C(X 0N )) + E(C(X 0S , 0 N \S )) =  Since the Shapley value for cooperative games satisfies the null player property (see,  for instance [6]), then  Sh i (S P) = 0. Table 1 shows in a summarised form the properties fulfilled by Sh and by  Sh. An open question of significant interest is to characterise the rule  Sh axiomatically. Acknowledgements This work has been supported by the ERDF, the MINECO/AEI grant MTM2017-87197-C3-1-P and the Xunta de Galicia (Grupos de Referencia Competitiva ED431C2020-14, and Centro de Investigación del Sistema universitario de Galicia ED431G 2019/01). Juan Carlos Gonçalves-Dosantos has a Margarita Salas postdoctoral grant and he is doing a stay at the Center of Operations Research (CIO), Miguel Hernandez University of Elche (UMH).

References 1. Bergantiños, G., Sánchez, E.: How to distribute costs associated with a delayed project. Ann. Oper. Res. 109, 159–174 (2002) 2. Bergantiños, G., Valencia-Toledo, A., Vidal-Puga, J.: Hart and Mas-Colell consistency in PERT problems. Discret Appl. Math. 243, 11–20 (2018) 3. Brânzei, R., Ferrari, G., Fragnelli, V., Tijs, S.: Two approaches to the problem of sharing delay costs in joint projects. Ann. Oper. Res. 109, 359–374 (2002) 4. Gonçalves-Dosantos, J.C., García-Jurado, I., Costa, J.: Sharing delay costs in stochastic scheduling problems with delays. 4OR 18, 457–476 (2020) 5. Gonçalves-Dosantos, J.C., García-Jurado, I., Costa, J.: Project Management: an R package for managing projects. R. J. 12, 419–436 (2020) 6. González-Díaz, J., García-Jurado, I., Fiestras-Janeiro, M.G.: An Introductory Course on Mathematical Game Theory. Graduate Studies in Mathematics, vol. 115. American Mathematical Society, Providence (2010) 7. Hillier, F.S., Lieberman, G.J.: Introduction to Operations Research. McGraw-Hill, New York (2002) 8. Kelley, J.E.: Critical path planning and scheduling mathematical basis. Oper. Res. 9, 296–320 (1961)

Variogram Model Selection Alfonso García-Pérez

Abstract A common problem in geostatistics is variogram estimation, in order to choose an acceptable model for kriging. Nevertheless, there is no standard method, first, to test if a particular model can be accepted as valid and, second, to choose among several competing variogram models. The problem is even more complex if, in addition, there are outliers in the data. In this paper we propose to use the distribution of some classical and robust variogram estimators to test, first, the validity of a particular model, accepting it if the p-value of the test, with this particular model as null hypothesis, is large enough and, second, to compare several competing models, choosing the model with the largest p-value among several acceptable models.

1 Introduction A common problem in geostatistics is variogram model selection among several competing models, after the variogram has been estimated, usually by weighted least squares. Among all the models that apparently fit well, you might choose from among them the one with smallest residual sum of squares, or smallest mean square, or the usual Matheron’s estimator [8]. Sometimes, the chosen model is the one with smallest Akaike’s information criterion (AIC) [1] AIC = −2 log(maximized likelihood) + 2 p being p the number of parameters of the model.

A. García-Pérez (B) Departamento de Estadística, I.O. y C.N., Universidad Nacional de Educación a Distancia (UNED), Paseo Senda del Rey 9, 28040 Madrid, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_3

21

22

A. García-Pérez

AIC is usually estimated by  AIC = {n log(2π/n) + n + 2} + n log R + 2 p where n is the number of points on the variogram, and R is the mean of the squared residuals between observed values and the fitted model (Webster and Oliver [13], p. 105). Because the first term is constant, the model with the smallest n log R + 2 p is chosen. Combining both criteria, the model with the smallest mean squared residual and the smallest n log R + 2 p is, usually, the selected model. But the chosen model might not be significant enough because there is no probability distribution to compare with. In this sense, Webster and McBratney [11] propose an F test for nested models, and suggest other possible criteria. In this context, equations for estimating the estimation variances for variograms (with a bounded sill) are given in Matheron [9] and in Muñoz-Pardo [10], solving them by numerical integration. Also, Webster and Oliver [12] obtain confidence limits by Monte Carlo methods. These results are valid considering only classical estimators and observations with a normal model distribution. The paper by Gorsich and Genton [6] has these purposes from a nonparametric point of view. In García-Pérez [3] approximated distributions, under a contaminated normal model, for classical and robust variogram estimators are obtained. The aim of this paper is to use these approximations for the distributions of the Matheron’s estimator and some robust ones, first, to valid a particular variogram model and, then, to compare among several competing variogram models.

2 Robust Estimators of the Variogram Let us suppose that an univariate random variable Z is observed at some known fixed locations si ∈ D, being D a fixed subset of Rd , d ≥ 1, and let us assume that the variable Z satisfies the intrinsic stationarity property, i.e., the differences have zero mean E[Z (s + h) − Z (s)] = 0, ∀s, s + h ∈ D and the variance depends only on lag h, V (Z (s + h) − Z (s)) = 2 γ (h), ∀s, s + h ∈ D, being the function

Variogram Model Selection

23

2 γ (h) = V (Z (s + h) − Z (s)) = E[(Z (s) − Z (s + h))2 ] the variogram. This is estimated with the classical Matheron’s estimator, 2γˆM (h) =

Nh 1  (Z i+h − Z i )2 Nh i=1

where the sample size n = Nh is the cardinality of N (h) = {(si , s j ) : si − s j = h}. In García-Pérez [3] some robust estimators of the variogram were introduced. If we transform the original observations Z i by Yi = (Z i+h − Z i )2 , robust Mestimators Tn of the variogram can be obtained as solutions of the equation n 

ψ(Yi , Tn ) = 0.

i=1

If a linearized variogram can be accepted, the transformed variables Yi can be considered as independent. If we assume a scale contaminated normal model, F = (1 − ε) N (μ, σ ) + ε N (μ, gσ ) with ε ∈ (0, 1) (usually small) and g > 1, for the marginal distributions of the original observations Z i , that means a distribution F = (1 − ε) 2 γ (h) χ12 + ε g 2 2 γ (h) χ12 for the transformed observations Yi , in García-Pérez [3] it is proved that a saddlepoint approximation (VOM+SAD) for the distribution of Tn is PF {Tn > t}  PG {Tn > t} + ε

φ(s) √ n r1

   z0 ψ(x,t) d H (x) e  − 1 e z0 ψ(y,t) dG(y)

(1)

being G = 2γ (h)χ12 , H = g 2 2γ (h)χ12 , φ the density function of the standard normal distribution, s and r1 are the functionals   r1 = z 0 K  (z 0 , t) s = −2n K (z 0 , t), K (λ, t) the function  K (λ, t) = log



eλψ(y,t) dG(y)

−∞

K  (λ, t) (K  (λ, t)) the second (the first) partial derivative of K (λ, t) with respect to the first variable and z 0 the saddlepoint, i.e., the solution of the saddlepoint equation  ∞ e z0 ψ(y,t) ψ(y, t) dG(y) = 0. K  (z 0 , t) = −∞

24

A. García-Pérez

Approximation (1) is easy to compute with R for the Matheron’s estimator and for robust M-estimators, as it is explained in García-Pérez [3]. In that paper, an α-trimmed variogram estimator is also introduced and its VOM+SAD distribution, obtained.

3 Acceptance of a Model and Variogram Model Comparison The VOM+SAD approximations obtained in García-Pérez [3] for classical and robust variogram estimators, can be used to test if a particular variogram model 2γ (h) can be accepted to explain a variogram estimator Tn = 2γˆ (h) and also, to compare several variogram models. Let us assume model 2γ (h) as null hypothesis and 2γˆ (h) as a variogram estimator. We consider the test statistic     Sn = sup 2γˆ (h) − 2γ (h) = max 2γˆ (h) − 2γ (h) h

1≤||h||≤k

taking values sn , assuming there are k lags. If the p-value of this test, P{Sn > sn } is large enough, the model will be accepted; otherwise the model will be rejected. If several competing models are accepted, the model for which this p-value is the largest, will be the selected one. In García-Pérez [3], it is obtained that the cumulative distribution function of Sn FSn (sn ) = 1 − P{Sn > sn } is FSn (sn ) =

k

P2γ (h) {2γˆ (h) > −sn + 2γ (h)} − P2γ (h) {2γˆ (h) > sn + 2γ (h)}



||h||=1

being this tail probabilities computed with the VOM+SAD approximations.

4 Example Let us consider log Calcium data (mg/l), one of the eight variables observed in the groundwater data analysis around the city of Madrakah, a town located in the Wadi Usfan region in western Saudi Arabia, (Marko et al. [7]). In Cabrero-Ortega and

Variogram Model Selection

25

0.20

semivariance

0.15

0.10

0.05

0.005

0.010

0.015

0.020

0.015

0.020

distance

Fig. 1 Matheron’s estimator and a Spherical model 0.20

semivariance

0.15

0.10

0.05

0.005

0.010

distance

Fig. 2 Matheron’s estimator and a Cardinal Sine model

García-Pérez ([2], pp. 303–310), a classical methodology is applied to these data, concluding that an Spherical model with partial sill = 0.1564478, nugget = 0 and range = 0.007289068 is suitable, see Fig. 1. We also observe Matheron’s estimations for several lags in this figure and some outliers, appreciating that these estimates seem to be affected by them. In García-Pérez [3], we define robust estimates for these data and we also prove that the linearized versions of the variogram models (classical, 0.05-trimmed and Huber) can be accepted. Hence, we can consider the transformed observations Yi as independent. We also obtain the VOM+SAD approximations for their distributions. But let us observe that, in Cabrero-Ortega and García-Pérez [2], we also mention that a Cardinal Sine model with partial sill = 0.11533833, nugget = 0.03038008 and range = 0.005372508, can also be accepted for these data, as we see in Fig. 2. We check now if both models have p-values large enough to be accepted and which one is the largest. The p-values for the Spherical model, computed with the VOM+SAD approximations are included in the middle-hand of Table 1. In the right-

26

A. García-Pérez

Table 1 P-values for the Spherical model (middle) and Cardinal Sine model (right), considering the classical Matheron’s estimator and two robust ones Estimator\Model Spherical model Cardinal Sine model Classical 0.05-trimmed Huber

0.2270516 0.1333519 0.0157108

0.0001244 0.0862922 0.1036955

hand of the table we show the p-values for the Cardinal Sine model. The computations are in the Supplementary Material available on the website https://www2.uned.es/pea-metodos-estadisticos-aplicados/VariogramSelection. htm Although both models are accepted using the standard criteria, from this table we see that Cardinal Sine model cannot be accepted considering the distribution of Matheron’s estimator and that Spherical model can be accepted. Nevertheless, if we use robust methods, the conclusion is the opposite one because of the outliers in the data: first, with the 0.05-trimmed variogram estimator both models are acceptable but, because of the asymmetry, it is better to use Huber’s estimator with which we conclude that Cardinal Sine model should be the selected one.

5 Conclusions and Future Works The selection of a valid variogram model is a key question in geostatistics. In this paper we propose to establish a test to do this, in which the null hypothesis is the suggested variogram model, which is accepted if the p-value is large enough. If several model are valid, we propose to chose the model with the largest p-value. This proposal is especially useful when there are outliers in the data set because robust variogram estimators can be used in the proposal. This test is performed with the VOM+SAD approximations to the distribution of the classical a robust variogram estimators obtained in García-Pérez [3]. These ideas can be extended to the multivariate situation through the crossvariogram, following the results obtained in García-Pérez [4] and, even, to the spatiotemporal framework, with the results developed in García-Pérez [5]. Acknowledgements This work is partially supported by Grant PGC2018-095194-B-I00 from Ministerio de Ciencia, Innovación y Universidades (Spain).

Variogram Model Selection

27

References 1. Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Petrov, B.N., Csáki, F. (eds.) 2nd International Symposium on Information Theory, pp. 267– 281. Akadémiai Kiadó, Budapest (1973) 2. Cabrero-Ortega, M.Y., García-Pérez, A.: Análisis estadístico de datos espaciales con QGIS y R. Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain (2020) 3. García-Pérez, A.: Saddlepoint approximations for the distribution of some robust estimators of the variogram. Metrika 83(1), 69–91 (2020) 4. García-Pérez, A.: New robust cross-variogram estimators and approximations of their distributions based on saddlepoint techniques (2020). Submitted 5. García-Pérez, A.: Robust spatio-temporal variogram estimators and a saddlepoint approximation for their distributions (2020). Submitted 6. Gorsich, D.J., Genton, M.G.: Variogram model selection via nonparametric derivative estimation. Math. Geol. 32(3), 249–270 (2000) 7. Marko, K., Al-Amri, N.S., Elfeki, A.M.M.: Geostatistical analysis using GIS for mapping groundwater quality: case study in the recharge area of Wadi Usfan, western Saudi Arabia. Arab. J. Geosci. 7, 5239–5252 (2014) 8. Matheron, G.: Traité de géostatistique appliquée, Tome I: Mémoires du Bureau de Recherches Géologiques et Minières, no. 14, Editions Technip, Paris (1962) 9. Matheron, G.: Les variables régionalisées et leur estimation. Masson, Paris (1965) 10. Muñoz-Pardo, J.F.: Approche géostatistique de la variabilité spatiale des milieux géophysique. MA Thesis, Université de Grenoble et lÍnstitut National Polytechnique de Grenoble (1987) 11. Webster, R., McBratney, A.B.: On the Akaike Information Criterion for choosing models for variograms of soil properties. Eur. J. Soil. Sci. 40, 493–496 (1989) 12. Webster, R., Oliver, M.A.: Sample adequately to estimate variograms of soil properties. Eur. J. Soil. Sci. 43, 177–192 (1992) 13. Webster, R., Oliver, M.A.: Geostatistics for Environmental Scientists, 2nd edn. Wiley, Chichester (2007)

On First Passage Times in Discrete Skeletons and Uniformized Versions of a Continuous-Time Markov Chain Antonio Gómez-Corral, María Jesús Lopez-Herrero, and María Teresa Rodríguez-Bernal

Abstract In this paper, the aim is to study similarities and differences between a continuous-time Markov chain and its uniformized Markov chains and discrete skeletons in terms of first passage times when the taboo subset of states is assumed to be accessible from a class of communicating states. Under the assumption of a finite communicating class, we characterize the first-passage times in terms of either continuous or discrete phase-type random variables. For illustrative purposes, we show how first passage times in uniformized Markov chains and discrete skeletons can be used to approximate the random duration of an outbreak in the SIS epidemic model.

1 Introduction The use of uniformized Markov chains and discrete skeletons has been shown to be a keystone in the derivation of theoretical results for the underlying continuous-time Markov chain (CTMC) by applying well-known theorems for discrete-time Markov chains, as well as the performance analysis of systems modelled by a CTMC; see e.g. Anderson [1, Chap. 5] and van Dijk et al. [19]. Specifically, the uniformization method was first described by Jensen [12] in 1953 for time-homogeneous CTMCs with uniformly bounded transition rates. Uniformization allows one to interpret a CTMC in terms of a discrete-time Markov chain by replacing the constant unit of time by random jump times, which are selected from A. Gómez-Corral (B) · M. T. Rodríguez-Bernal Department of Statistics and Operations Research, Complutense University of Madrid, Madrid, Spain e-mail: [email protected] M. T. Rodríguez-Bernal e-mail: [email protected] M. J. Lopez-Herrero Department of Statistics and Data Science, Complutense University of Madrid, Madrid, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_4

29

30

A. Gómez-Corral et al.

a suitably defined Poisson process. By a simple conditioning argument on the number of Poisson events up to time t, it is easy to compute the matrix exponential for transient probabilities in the CTMC by using the Fox-Glynn method [5] via iterative computation of the n-step transition probability matrix in the uniformized Markov chain, with global error control at any time t. For a related work, see Sect. 2.8 in the monograph of Latouche and Ramaswami [14], where uniformization is used to evaluate numerically the density and distribution functions of a continuous phase-type random variable without recourse to the Kolmogorov forward equations. An interesting analytical method to determine expected sojourn time averages and the expected number of events is provided by Gross and Miller [9, Sect. 6] in the special case of CTMCs with finitely many states. For CTMCs with a countable state space, Melamed and Yadin [15] present upper and lower bounds on cumulative-time distributions by using a computational methodology that utilizes and generalizes the uniformization technique of Jensen [12]. In the queueing-theoretic context, a cumulative time amounts to the time spent in a specified set of states up until hitting another set of states, whence the Melamed–Yadin method [15, Sect. 2] is seen to be a powerful tool for computing sojourn times and waiting times in queueing systems with exponential servers and Poisson arrivals, such as tandem Jackson networks. Uniformization is also seen to be an appealing technique applied to time-inhomogeneous systems [17] and unbounded transition rates [18]. The paper by van Dijk et al. [19] is an excellent mathematical and intuitive review of the uniformization technique and some of its exact and approximate extensions, including steady state detection, adaptative uniformation and unbounded Markov decision processes. An elementary but striking property shown by Jensen [12] states that an irreducible CTMC and any of its uniformized Markov chains behave asymptotically the same in the limit of large time index. As argued by Kingman [13, Sect. 3], this property also holds for h-skeletons (or discrete skeletons at scale h), which are obtained by recording the state of the CTMC at a sequence of inspection times tn = nh, for a fixed length h > 0. We refer the reader to Chapter 5 of Anderson [1] for further details about communicating classes and classification of states—which are equivalent for the CTMC and any of its h-skeletons—and ergodic theorems. In this paper, we complement the classical work of Jensen [12] and Kingman [13] (see also Anderson [1, Chap. 5], van Dijk et al. [19], and references therein) by focusing on first passage times for a time-homogenous CTMC and their discrete counterparts in the resulting uniformized Markov chains and h-skeletons. More concretely, we first establish that, in the original setting of Jensen [12], the matrix of expected sojourn times in a proper non-closed communicating subset D of states for the CTMC is the same as the scaled matrix of expected sojourn times for any of its uniformized Markov chains. In a more general framework, we then study the dynamics of the CTMC before leaving states in D, and demonstrate that the first passage times to states in the outside of D for the h-skeleton are stochastically greater than the analogous first-passage times for the CTMC, for any time step h > 0, in the usual stochastic order.

On First Passage Times in Discrete Skeletons and Uniformized Versions …

31

2 The Continuous-Time Process X Under a Taboo We consider a conservative time-homogeneous CTMC X = {X (t) : t ≥ 0} with values on a countable state space S, generator matrix Q = (qi, j : i, j ∈ S) and standard transition function Pi, j (t) = P(X (t) = j|X (0) = i), for i, j ∈ S and t ≥ 0. Let D be a proper communicating subset of S satisfying that at least one state in its complement Dc = S \ D is accessible from D; i.e., D is a non-closed communicating class or non-essential class. For our purposes, we assume that X (0) ∈ D and define T = inf {t > 0 : X (t) ∈ / D} as the first passage time to Dc , or the sojourn time in D. In Sect. 2.1, the interest is in a scaled version of the matrix SD of expected sojourn times in D before the first visit of process X to any state in Dc . This matrix has the form 



SD =

PD,D (t; Dc )dt,

0

where PD,D (t; Dc ) is the taboo transition function with elements Pi, j (X (t) = j, T ≥ t|X (0) = i), for states i, j ∈ D and t > 0. It is well known that PD,D (t; Dc ) may be written in terms of the sub-matrix Q D = (qi, j : i, j ∈ D) as PD,D (t; Dc ) = ex p{Q D t}, from which it follows (see e.g. [4, Chap. 10]) that the matrix SD of expected sojourn times is the minimal nonnegative solution of the following systems of equations: SD (−Q D ) = I,

−Q D SD = I,

(1)

where I denotes the identity matrix. Remark 2.1 For any finite subset D, the first visit to the taboo subset Dc occurs almost surely in a finite expected time from any initial state i ∈ D, and SD = −Q −1 D . Indeed, T behaves as a continuous phase-type random variable of order d with representation (α, Q D ), where d denotes the cardinality of D and α is a row vector with entries P(X (0) = i), for i ∈ D; see e.g. Latouche and Ramaswami [14, Theorem k 2.4.3]. As a result, E[T k |X (0) = i] = α(−Q −1 D ) 1, for k ∈ N, where 1 is a column vector of 1’s.

32

A. Gómez-Corral et al.

2.1 Expected Sojourn Times for Uniformized Markov Chains Under the assumption that |qi,i | ≤ h −1 < ∞, for states i ∈ S, it can be readily seen that the random variable X (t) is identically distributed to Y N (t) at any time t, where N = {N (t) : t ≥ 0} is the counting process of a Poisson process with rate h −1 , Y = {Yn : n ∈ N0 } is an aperiodic discrete-time Markov chain—termed uniformized Markov chain—with one-step transition probability matrix P = I + h Q, and N and Y are assumed to be independent; see e.g. Çinlar [4, Chap. 8]. Remark 2.2 For the process X , one has that  P(X (t + h) = j|X (t) = i) =

+ o(h), if j = i, qi, j h 1 − j∈S\{i} qi, j h + o(h), if j = i,

for states i, j ∈ S and t ≥ 0, with h −1 o(h) → 0 as h → 0. Thus, the one-step transition probabilities Pi, j of the uniformized Markov chain Y can be seen as approximations of P(X (t + h) = j|X (t) = i) at time steps t = nh, for n ∈ N0 , provided that h is sufficiently small. Note that the condition |qi,i | ≤ h −1 < ∞, for i ∈ S, is −1 | : i ∈ S}. equivalent to the inequality 0 < h ≤ inf{|qi,i For the uniformized Markov chain Y, Lemma 5.1.2 in Latouche and Ramaswami  of expected sojourn times in the subset [14] tells us how to compute the matrix SD  c are D of states, before the first passage to D . Here, we recall that the entries of SD given by ∞ 

P(Yn = j, T  ≥ n|Y0 = i), i, j ∈ D,

n=0

where T  = inf{n ∈ N : Yn ∈ / D}, so that  SD =

∞ 

n PD,D (Dc ),

n=0 n (Dc ) is the n-step transition probability matrix of Y under the taboo of where PD,D  c D . This means that SD is the minimal nonnegative solution of the systems  (I − PD ) = I, SD

 (I − PD )SD = I.

(2)

 are uniquely characterized from (1) and (2), Observe that the matrices SD and SD respectively. Since I − PD = −h Q D , it is then seen that  . SD = h SD

(3)

On First Passage Times in Discrete Skeletons and Uniformized Versions …

33

Moreover, since the transition rates from states in Dc are not used in the underlying arguments, the equality (3) holds under the less restrictive assumption of uniformly bounded transition rates on D; i.e., |qi,i | ≤ h −1 < ∞, for states i ∈ D. It is worth bearing in mind that, since T is a continuous random variable and hT  is a discrete one, T and hT  are not identically distributed, but their expectations are identical by (3), irrespectively of the value h satisfying |qi,i | ≤ h −1 < ∞, for i ∈ D. This first-order property does not necessarily extend to moments of higher order, as was noticed by Gómez-Corral et al. [8, Sect. 2.2] for the random duration T of an outbreak and its discrete counterpart hT  in the SIS epidemic model.  Remark 2.3 For a finite subset D, Eq. (3) becomes SD = (−h Q D )−1 . It is also  c verified that the first passage time T to D is a discrete phase-type random variable of order d with representation (α, I + h Q D ); see e.g. Latouche and Ramaswami [14, Sect. 2.5].

2.2 A Simple Stochastic Ordering Property Given a fixed value h > 0, the h-skeleton of a conservative time-homogeneous CTMC X is defined by Kingman [13] as the discrete-time Markov chain Z = {Z n : n ∈ N0 } with Z n = X (nh), which takes values in S with one-step transition probabilities Pi, j (h), for i, j ∈ S. It is worth noting that the division of the state space S into communicating classes for the h-skeleton is exactly the same as results from the continuous-time process X . This implies that, under our assumptions on D, the submatrix PD (h) = (Pi, j (h) : i, j ∈ D) consists of strictly positive entries and, consequently, Z is aperiodic. In addition, the submatrix PD,Dc (h) = (Pi, j (h) : i ∈ D, j ∈ Dc ) of one-step transition probabilities is a non-null matrix. It results from the definition of PD,D (h; Dc ) that PD,D (h; Dc ) ≤ PD (h),

(4)

since the matrix PD,D (h; Dc ) is related to the dynamics of process X up to time h under the taboo of Dc and the proper subset D is assumed to be non-closed. Similarly, since h[h −1 t] ≤ t, it is seen that    PD,D (t; Dc ) ≤ PD,D h h −1 t ; Dc ,

(5)

for any time t > 0, where [·] denotes integer part. / D} as the sojourn time of the h-skeleton on By defining T  = inf{n ∈ N : Z n ∈ D up until hitting the subset Dc , the inequalities (4) and (5) yield PD,D (t; Dc ) ≤ (PD (h))[h

−1

t]

,

(6)

34

A. Gómez-Corral et al. −1

where the entries of matrix (PD (h))[h t] are equivalent to the taboo probabilities P(Z n = j, T  ≥ n|Z 0 = i) with n = [h −1 t], for states i, j ∈ D. Then, by postmultiplying both sides of (6) by vector 1, it is found that  P (T > t|X (0) = i) ≤ P hT  > t|Z 0 = i , t > 0, for any initial state i ∈ D, so that T ≤st hT  ,

(7)

in the usual stochastic order. For any nondecreasing function f , this implies that E[ f (T )|X (0) = i] ≤ E[ f (hT  )|Z 0 = i], for i ∈ D; more particularly, E[T |X (0) = i] ≤ h E[T  |Z 0 = i]. Remark 2.4 In a similar manner to the first passage time T  in Remark 2.3, the first passage time T  to Dc for the h-skeleton is a discrete phase-type random variable if D is assumed to be finite, and its representation of order d is given by (α, PD (h)).

3 Discussion and Concluding Remarks For practical use, the scaled versions hT  and hT  —the former for any process X with a bounded sub-matrix Q D and the latter regardless of this sub-matrix—could be used to approximate the probability law of the first passage time T to the complement of a proper non-closed communicating subset D by selecting the time step h > 0 in an appropriate manner. In Figs. 1, 2 and Table 1, we briefly illustrate this issue for the random duration T of an outbreak in the SIS model with contact and recovery rates β > 0 and γ > 0, respectively. We assume that a single initially infective individual is hosted within a closed and homogeneously well-mixed population with constant size N . The process X , describing the number of infective individuals at any time t, is a finite birth-death process and T is a continuous phase-type random variable of order d = N with representation (α, Q D ), where D = {1, . . . , N }, α = (1, 0, . . . , 0) and ⎞ −(λ1 + μ1 ) λ1 ⎟ ⎜ μ2 −(λ2 + μ2 ) λ2 ⎟ ⎜ ⎟ ⎜ .. .. .. =⎜ ⎟, . . . ⎟ ⎜ ⎝ μ N −1 −(λ N −1 + μ N −1 ) λ N −1 ⎠ μN −μ N ⎛

QD

with λi = β N −1 i(N − i), for i ∈ {1, . . . , N − 1}, and μi = γ i, for i ∈ {1, . . . , N }. In Figs. 1 and 2, the probability distribution functions FT (t), FhT  (t) and FhT  (t) of T , hT  and hT  , respectively, are plotted as a function of t for values h ∈ {h 0 , 2−1 h 0 }, where h 0 = min{(λ1 + μ1 )−1 , . . . , (λ N −1 + μ N −1 )−1 , μ−1 N }. The population con-

On First Passage Times in Discrete Skeletons and Uniformized Versions … Probability Distribution Functions 0.6

0.6

Probability Distribution Functions

0.5

T hT' hT''

T hT' hT''

0.0

0.0

0.1

0.1

0.2

0.2

0.3

0.3

0.4

0.4

0.5

35

0.2

0.4

0.6

0.8

1.0

0.2

0.4

t

0.6

0.8

1.0

t

Fig. 1 The probability distribution function FT (t) versus its discrete counterparts FhT  (t) and FhT  (t) for h = h 0 (left) and 2−1 h 0 (right) with h 0 = min{(λ1 + μ1 )−1 , . . . , (λ N −1 + μ N −1 )−1 , μ−1 N }, in the SIS model with R0 = 0.5, N = 20 and X (0) = 1 Probability Distribution Functions 0.6 0.5

T hT' hT''

T hT' hT''

0.0

0.0

0.1

0.1

0.2

0.2

0.3

0.3

0.4

0.4

0.5

0.6

Probability Distribution Functions

0.2

0.4

0.6

0.8

1.0

0.2

t

0.4

0.6

0.8

1.0

t

Fig. 2 The probability distribution function FT (t) versus its discrete counterparts FhT  (t) and FhT  (t) for h = h 0 (left) and 2−1 h 0 (right) with h 0 = min{(λ1 + μ1 )−1 , . . . , (λ N −1 + μ N −1 )−1 , μ−1 N }, in the SIS model with R0 = 2.0, N = 20 and X (0) = 1 Table 1 Expected values E[T ] versus h E[T  ] and h E[T  ], for h ∈ {h 0 , 2−1 h 0 } with h 0 = min{(λ1 + μ1 )−1 , . . . , (λ N −1 + μ N −1 )−1 , μ−1 N }, in the SIS model with R0 ∈ {0.5, 2.0}, N = 20 and X (0) = 1 R0 E[T ] = h E[T  ] h h E[T  ] 0.5

1.34194

2.0

31.21071

h0 2−1 h 0 h0 2−1 h 0

1.36715 1.35449 31.23309 31.22187

36

A. Gómez-Corral et al.

sists of N = 20 individuals, γ = 1.0 and β ∈ {0.5, 2.0}, whence the basic reproduction number R0 = γ −1 β ∈ {0.5, 2.0}. The specific interval on the ox axis in both figures (i.e., t ∈ [0, 1]) is selected to make the probability distribution functions graphically distinguishable. In this sense, it should be pointed out that, from numerical experiments additional to those reported here, the probability distribution functions FhT  (t) and FhT  (t) are seen to approximate FT (t) in a very accurate manner for time instants t ≤ K 0.99 , where K 0.99 denotes the 99% percentile of FT (t). Without going into details, we may remark that the estimation error of FT (t) by FhT  (t) (respectively, FhT  (t)) can be routinely measured in terms of the supremum of the differences |FT (t) − FhT  (t)| (respectively, FT (t) − FhT  (t)) over subintervals Ck = {t ∈ [0, K 0.99 ] : [h −1 t] = k}, for integers k ∈ N0 . It is observed that, as intuition tells us, the smaller the value of h, the better approximation of FT (t) is obtained by FhT  (t) and FhT  (t), regardless of the expected duration of the outbreak shown in Table 1. More particularly, it is observed that FhT  (t) results in a better approximation of FT (t) than FhT  (t), as long as the specific value of h yields a well defined uniformized Markov chain Y. Furthermore, Figs. 1 and 2 show how the scaled length hT  of the outbreak in the corresponding uniformized Markov chain Y is neither stochastically greater than, nor lesser than, nor equal to the random duration T of the outbreak in the SIS model. To conclude, we remark that the theoretical and methodological aspects in Sects. 2.1 and 2.2 extend results by Gómez-Corral et al. [8, Sects. 2 and 3], linked to a specific finite birth-death process, to the more general setting of CTMCs with a state space S containing a countable communicating subset D of states. Therefore, our results on the first passage time T and its discrete analogues, hT  and hT  , in Sects. 2.1 and 2.2 can be readily applied to absorption times for non-finite birth-death processes (Artalejo et al. [2]) and competition processes (Iglehart [11]; Reuter [16]), including the two-species competition process and the host-parasitoid process, where first passage times amount to extinction times; see also Billard [3], Gómez-Corral and López García [6, 7], and Hitchcock [10], among others. More work is needed to investigate how the scaled first passage times hT  and hT  could be used to improve the phase-type approximation of T in [6, Sect. 3], which is based on truncation of the state space and extreme values. In a general framework, there is clearly future work to be done on the comparison between our results on the discrete versions T in Sects. 2.1 and 2.2 and the upper and lower bounds of Melamed and Yadin [15] on cumulative-time distributions. As a last remark, we note that an interesting open problem is related to the derivation of exact or approximate results on T , hT  and hT  for time-inhomogeneous CTMCs, as well as their application to the analysis of seasonal fluctuations in epidemic models. Acknowledgements The authors express their warm appreciation to Prof. Leandro Pardo for valuable discussions on the use of the Hellinger distance during the elaboration of the article [8]. Indeed, the Eqs. (3) and (7) in the present work are motivated from the fact that the Hellinger distance between two probability measures requires these measures to be absolutely continuous with respect to a third probability measure. This work is supported by the Ministry of Science and Innovation (Government of Spain), Project PGC2018-097704-B-I00.

On First Passage Times in Discrete Skeletons and Uniformized Versions …

37

References 1. Anderson, W.J.: Continuous-Time Markov Chains. An Applications-Oriented Approach. Springer, New York (1991) 2. Artalejo, J.R., Gómez-Corral, A., López-García, M., Molina-Paris, C.: Stochastic descriptors to study the fate and potential of naive T cell clonotypes in the periphery. J. Math. Biol. 74, 673–708 (2017) 3. Billard, L.: Competition between two species. Stoch. Proc. Their Appl. 2, 391–398 (1974) 4. Çinlar, E.: Introduction to Stochastic Processes. Prentice-Hall, Englewood Cliffs (1975) 5. Fox, B.L., Glynn, P.W.: Computing Poisson probabilities. Commun. ACM 31, 440–445 (1988) 6. Gómez-Corral, A., López García, M.: Extinction times and size of the surviving species in a two-species competition process. J. Math. Biol. 64, 255–289 (2012) 7. Gómez-Corral, A., López García, M.: Maximum population sizes in host-parasitoid models. Int. J. Biomath. 6, 1350002 (2013) 8. Gómez-Corral, A., López-García, M., Rodríguez-Bernal, M.T.: On time-discretized versions of the stochastic SIS epidemic model: a comparative analysis. J. Math. Biol. 82, 46 (2021) 9. Gross, D., Miller, D.R.: The randomization technique as a modelling tool and solution procedure for transient Markov processes. Oper. Res. 32, 343–361 (1984) 10. Hitchcock, S.E.: Extinction probabilities in predator-prey models. J. Appl. Probab. 23, 1–13 (1986) 11. Iglehart, D.L.: Multivariate competition processes. Ann. Math. Stat. 35, 350–361 (1964) 12. Jensen, A.: Markoff chains as an aid in the study of Markoff processes. Scand. Actuar. J. 1953(sup 1), 87–91 (1953) 13. Kingman, J.F.C.: Ergodic properties of continuous-time Markov processes and their discrete skeletons. Proc. Lond. Math. Soc. 13, 593–604 (1963) 14. Latouche, G., Ramaswami, V.: Introduction to Matrix Analytic Methods in Stochastic Modelling. ASA-SIAM Series on Statistics and Applied Probability, Philadelphia (1999) 15. Melamed, B., Yadin, M.: Randomization procedures in the computation of cumulative-time distributions over discrete state Markov processes. Oper. Res. 32, 926–944 (1984) 16. Reuter, G.E.H.: Competition processes. In: Neyman, J. (ed.) Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability. Vol. 2 Contributions to Probability Theory, pp. 421–430. University of California Press, Berkeley (1961) 17. Van Dijk, N.M.: Uniformization for nonhomogeneous Markov chains. Oper. Res. Lett. 12, 283–291 (1992) 18. Van Dijk, N.M.: Approximate uniformization for continuous-time Markov chains with an application to performability analysis. Stoch. Proc. Their Appl. 40, 339–357 (1992) 19. Van Dijk, N.M., van Brummelen, S.P.J., Boucherie, R.J.: Uniformization: basics, extensions and applications. Perform. Eval. 118, 8–32 (2018)

A Numerical Approximation of a Two-Dimensional Atherosclerosis Model Arturo Hidalgo and Lourdes Tello

Abstract This work concerns a mathematical model given by a system of twodimensional nonlinear reaction-diffusion equations with a nonlinear source term in one of the equations, which represents the first stages of atherosclerosis development as an inflammatory disease. In addition, this model incorporates a nonlinear non-homogeneous Neumann boundary condition which represents the recruitment of immune cells through the upper boundary as a response to the production of cytokines. The model is solved using a finite volume scheme with dimension-bydimension Weighted Essentially Non-Oscillatory (WENO) reconstruction in space, using entire polynomials, unlike the pointwise WENO reconstruction commonly used, and a third order Runge–Kutta Total Variation Diminishing (TVD) scheme for time integration. Two sets of parameters have been considered in the numerical simulations. The evolution of the inflammation is studied according to the results of the numerical simulation, depending on the values of the bio-physical parameters and the size of the initial inflammation.

1 Introduction We consider a 2D mathematical model representing the first stages of atherosclerosis disease, which is based on nonlinear reaction-diffusion equations.

This work is dedicated to our friend and colleague Prof. Leandro Pardo. A. Hidalgo (B) Departamento de Ingeniería Geológica y Minera. ETS de Ingenieros de Minas y Energía. Center for Computational Simulation, Universidad Politécnica de Madrid, Calle Ríos Rosas, 21, 28003 Madrid, Spain e-mail: [email protected] L. Tello Departamento de Matemática Aplicada. ETS de Arquitectura. Center for Computational Simulation, Universidad Politécnica de Madrid, Av. Juan de Herrera, 4, 28040 Madrid, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_5

39

40

A. Hidalgo and L. Tello

Cholesterol is a fatty substance essential to many metabolic processes. However, when there is an excess of cholesterol, it is transported in the blood flow by low density lipoproteins (LDL) and can accumulate in the arteries, giving rise to an inflammatory process called atherosclerosis. This usually happens as a consequence of some injury in the artery wall which facilitates the entrance of LDL. A detailed description of the biological process involved can be found, for instance, in [11, 13]. The mathematical model on which this work is based was proposed in [4, 5]. Motivated by these works, in [9] it was presented and applied a numerical scheme based on an ADER-FV-WENO method to solve the 1D reaction-diffusion model. The numerical simulation conducted allowed to verify certain properties that were theoretically stated and proved. Recently, in [8], a variant of this mathematical model was introduced incorporating porous-medium type nonlinear diffusion, in the 1D case, due the fact that the artery wall is a porous medium, as pointed out in several references which deals with this idea. Other alternative was reported in [15] where the authors propose a hyperbolization method, based upon Cattaneo s approach, applied to reaction-diffusion problems, including an application to atherosclerosis. In [7] there is an interesting mathematical model which considers intraplaque neovascularization and also chemotaxis and haptotaxis phenomena. Results on the study of steady state solutions and bifurcation have been obtained in [1], where the authors study fold bifurcation when the flux of LDL from the blood is sufficiently high. As aforementioned, the model conducted in this work is a system of reactiondiffusion equations in two space dimensions which incorporates a nonlinear nonhomogeneous boundary condition. The numerical scheme developed in this research is based on the finite volume method (FVM), where a third order Runge–Kutta TVD scheme, as that proposed in [6], has been used for time integration and WENO5 for space reconstruction. Concerning WENO approach we recall that it was first introduced in [10, 12] where pointwise values of the reconstruction polynomials are used. The two-dimensional WENO reconstruction is performed following the so called dimension-by-dimension approach, see also [2, 3, 14], since it is less computationally expensive than the fully 2D WENO reconstruction, obtaining entire polynomials, instead of reconstructing pointwise values, such as in the classical WENO, approach. This way to proceed has the advantage of offering the values and gradients of the polynomials where they are needed, for instance at cell interfaces or Gaussian quadrature points. The drawback of this type of interpolation is that the order of accuracy of the scheme may be reduced, as pointed out in the aforementioned references. The rest of the document is organized as follows. In the next section the mathematical model describing the 2D atherosclerosis model is set up, next the numerical method used, which is based on FV RK3TVD-WENO approach, is described. Next, some numerical results are shown. Finally some conclusions are given.

A Numerical Approximation of a Two-Dimensional Atherosclerosis Model

41

2 2D Mathematical Model The 2D mathematical model, which describes the first stages of the atherosclerosis process, considered in this work (see [4, 5, 9]) reads ∂M (x, y) ∈ (0, L) × (0, h) , t = d1 ΔM − λ1 M , ∂t ∂A (x, y) ∈ (0, L) × (0, h) , t = d2 ΔA + f 2 (A)M − λ2 A, ∂t ∂A ∂M (x, 0, t) = (x, 0, t) = 0, x ∈ (0, L) , t ∂y ∂y ∂A ∂M ∂A ∂M (0, y, t) = (0, y, t) = (L , y, t) = (L , y, t) = 0, y ∈ (0, h) , t ∂x ∂x ∂x ∂x ∂M ∂A h f 1 (A); (x, h, t) = (x, h, t) = 0, x ∈ (0, L) , t ∂y d1 ∂y M(x, y, 0) = M0 (x, y), A(x, y, 0) = A0 (x, y),

>0 >0 >0

(1)

>0 >0

(x, y) ∈ (0, 1) × (0, h) ,

where M is the concentration of immune cells (including monocytes, macrophages and foam cells), A is the concentration of cytokines (both pro and anti-inflammatory), d1 and d2 are diffusion coefficients, λ1 and λ2 represent the degradation rates of immune cells and cytokines respectively, h is the thickness of the artery, t represents the time and x, y are the spatial coordinates along the artery and the thickness respectively. There is also a source term in the second equation, namely f 2 (A)M, and another term in the upper boundary, denoted as f 1 (A), both of them depending on the concentration of cytokines, whose expressions are f 1 (A) =

α1 + β1 A > 0, 1 + τA1

f 2 (A) =

α2 A 1 + τA2

(2)

where α1 , α2 , β1 , β2 , τ1 and τ2 are known constants. Since cytokines are secreted by the monocytes and give rise to the recruitment of monocytes, the factor β1 in the source term of the first equation stands for the autoamplification factor of the recruitment of monocytes. The term 1 + τA1 represents the effect produced by the fibrous cap, being τ1 the characteristic time of the formation of this fibrous cap, which means the saturation of the recruitment of immune cells. Concerning 1 + τA2 it is the inhibition of the pro-inflammatory cytokines due to the effect of anti-inflammatory cytokines, where τ2 is the characteristic time of this inhibition to take place. All the parameters in the model, d1 , d2 , α1 , α2 , β1 , τ1 , τ2 , λ1 and λ2 , are assumed positive. It must also be verified that τ1 > αβ11 . We remark that in the first equation of the system there is no source term, unlike the 1D model, only a degradation one exists, since the recruitment of immune cells takes place through the upper boundary of the domain, where a nonlinear non-homogeneous boundary condition is given. In Fig. 1 it is depicted a schematic representation of the 2D domain with the boundary conditions that are set. The domain is the rectangle [0, L] × [0, h]

42

A. Hidalgo and L. Tello

Fig. 1 Schematic representation of the 2D domain with the boundary conditions. Recruitment of immune cells take place through the upper boundary

where h is the width of the artery which has a very small dimension compared to the length L. The expected result here is to recover the 1D behaviour when h → 0. The 2D model (1) is a well posed problem in the sense of existence and uniqueness of solutions. For example the method of upper and lower solutions, thanks to the properties of functions f 1 and f 2 , guarantee it.

3 Numerical Approximation In this work we apply a numerical scheme to solve the system (1) built in the finite volume framework with third order Runge–Kutta TVD approach for time integration and dimension-by-dimension WENO reconstruction in space, to compute intercell values and gradients. As mentioned above, the WENO reconstruction applied here makes use of entire polynomials, as introduced, for instance, in [2, 3], instead of the more classical pointwise WENO reconstruction [10, 12]. This is specially relevant when solving reaction-diffusion problems where gradients of the solution are involved. Previous numerical results for a 1D atherosclerosis model, based on ADER approach, have been reported in [8, 9]. In order to proceed, it is useful to write the system (1) in the following vector form ∂u ∂F ∂G + + = R, (3) ∂t ∂x ∂y where u = (M, A)T and   ∂M ∂A T , d2 , F = − d1 ∂x ∂x

  ∂M ∂A T , d2 G = − d1 ∂y ∂y

(4)

are the fluxes in x and y direction respectively. R = (−λ1 M, f 2 (A)M − λ2 A)T ,

(5)

is the source-reaction term. We consider the control volume Ii j = [xi− 21 , xi+ 21 ] × [y j− 21 , y j+ 21 ] and integrate the system (3) dividing by Δxi × Δy j where Δxi = xi+ 21 − xi− 21 and Δy j = y j+ 21 − y j− 21 to get

A Numerical Approximation of a Two-Dimensional Atherosclerosis Model

43

dui, j = dt 1 = Δxi



y j+ 1 2

y j− 1

1 − Fi− 21 , j )dy + Δy j

(Fi+ 21 , j



xi+ 1 2

xi− 1

2

¯ i, j (Gi, j+ 21 − Gi, j− 21 )d x + R

2

(6)

where we have introduced the cell average of the source term given by ¯ i, j = R

1 Δxi Δy j



xi+ 1 2

xi− 1 2



y j+ 1 2

R(x, y, t)d yd x.

(7)

y j− 1 2

3.1 2D WENO Reconstruction In order to obtain the fluxes appearing in the expression (6), we apply a dimensionby-dimension WENO reconstruction procedure (see for instance [2, 3, 14]). This technique is briefly described in the following. The reconstruction is achieved for each particular cell Ii j using a set of 1D stencils which, for each Cartesian direction, are given by S is,x j =

i+R 

s,y

Iej , Si j =

e=i−L

j+R 

Iie ,

(8)

e= j−L

where L and R are the spatial extension of the stencil to the left and to right respectively. If we consider M as the degree of the polynomial in each stencil, then for odd order schemes we consider three stencils and for even order we consider four stencils. Therefore, in the case of odd order schemes we have a central stencil: s = 1, L = R = M/2, a fully biased to the left stencil: s = 2, L = M, R = 0, and a fully biased to the right stencil: s = 3, L = 0, R = M, whereas for the case of even order schemes always four stencils are adopted: two central stencils s = 0, L = floor(M/2) + 1, R = floor(M/2) and s = 1, L = floor(M/2), R = floor(M/2) + 1 ; one left-sided stencil: s = 2, L = M, R = 0, A right-sided stencil: s = 3, L = 0, R = M, as illustrated in Fig. 2. It is useful to introduce a set of reference coordinates, for instance: x = xi− 21 + ξ Δxi , y = y j− 21 + ηΔy j . (1) WENO reconstruction in x direction: We consider the expression of the polynomial of each candidate stencil for the control volume Ii j which is expressed as whs,x (x, t n ) =

M  p=0

s,x ˆ in,s ψ p (ξ )w j, p , ∀ Si j ,

(9)

44

A. Hidalgo and L. Tello

where ψ p are convenient basis functions (usually Lagrange or Legendre’s ones) and ˆ in,s w j, p are the coefficients of the interpolation. Now we impose integral conservation over each stencil to yield 1 Δxe



M xe+1/2  xe−1/2

n ¯ ej ψ p (ξ(x))wˆ in,s , j, p d x = u

∀ Iej ∈ Sis,x j ,

(10)

p=0

and perform the solution-dependent nonlinear combination whx (x, t n ) =

M 

ˆ inj, p , with w ˆ inj, p = ψ p (ξ )w

Ns 

p=0

ˆ in,s ωs w j, p ,

(11)

s=1

where we use the nonlinear weights ωs =

ω˜ s ω˜ k

ω˜ s =

λs (σs +ε)r

, with ε = 10−20 , r = 3,

k

for instance, although other values can be used. The oscillation indicators are given M  M  ˆ in,s ˆ in,s by σs =  pm w j, p w j,m , which require the computation of the oscillation p=1 m=1

matrix

M   ∂ α ψ p (ξ ) ∂ α ψm (ξ ) = · dξ. ∂ξ α ∂ξ α α=1 1

 pm

(12)

0

(2) WENO reconstruction in y direction using as input the M+1 degrees of freedom ˆ inj, p . ˆ inj, p . We repeat the process followed in x− direction for each coefficient w w

i-3

i-2

i-1

i

i+1

i+2

i+3

s=1 M=even (M=2)

Three stencils

s=2 s=3 s=2 s=0

M=odd (M=3)

Four stencils s=1 s=3 M+1 cells in each stencil

Fig. 2 Stencils considered in 2D dimension-by-dimension WENO reconstruction

A Numerical Approximation of a Two-Dimensional Atherosclerosis Model s,y

wh (x, t n ) =

M  M 

45

ˆ in,s ψ p (ξ )ψq (η)w j, pq .

(13)

q=0 p=0

We now apply integral conservation in y direction 1 Δye



M ye+1/2  ye−1/2

n ˆ in,s ˆ ie, ψq (η(y))w p, j, pq dy = w

s,y

∀ Iie ∈ Si j ,

(14)

q=0

and perform a nonlinear combination y

wh (x, y, t n ) =

M  M 

ˆ inj, pq with w ˆ inj, pq = ψ p (ξ )ψq (η)w

Ns 

q=0 p=0

ˆ in,s ωs w j, pq .

(15)

s=1

The final expression of the reconstruction polynomial will read wi j (ξ, η, t n ) =

M+1  M+1 

wˆ ik,lj (t n )ψk (ξ )ψl (η).

(16)

k=1 l=1

We approximate the integrals appearing in (6) using appropriate Gaussian quadrature formulas Fi+ 21 , j =

1 Δy j

Fi− 21 , j =

1 Δy j

Gi, j+ 21 =

1 Δxi

Gi, j− 21 =

1 Δxi

 y j+ 21 y j− 1 2

 y j+ 21 y j− 1 2

 xi+ 21 xi− 1 2

 xi+ 21 xi− 1 2

F(xi+ 21 , y, t n )dy =

1

F(xi− 21 , y, t n )dy =

ˆ F(1, η, t n )dη ≈

0

1

G(x, y j+ 21 , t n )d x = G(x, y j− 21 , t n )d x =

0

NG 

0

1

yˆ βk , t n ) γk F(0,

k=1 NG 

ˆ G(ξ, 1, t n )dξ ≈

ˆ G(ξ, 0, t n )dξ ≈

ˆ βk , t n ) γk F(1, y

k=1

ˆ F(0, η, t n )dη ≈

1

0

N G

k=1 NG  k=1

ˆ k , 1, t n ) γkx G(α ˆ k , 0, t n ) γkx G(α

(17) where αk and βk are Gaussian quadrature points in ξ and η directions respectively, y γkx and γk are the Gaussian weights and N G is the number of Gaussian points in each Cartesian direction. The numerical integration of the source term reads

=

1 Δxi Δy j

 y j+ 21  xi+ 21

R(x, y, t n )d x dy = 2 2 G

NG N  1  1   z n ˆ ˆ k , βl , t n ) . γkx γl R(α 0 0 R(ξ, η, t )Δx i dξ Δy j dη ≈

i, j (t n ) =

1 Δxi Δy j

y j− 1

xi− 1

k=1

l=1

(18) The solution at each Gaussian point may be expressed as

46

A. Hidalgo and L. Tello M+1 

ˆ (1, βk , t n ) = U

k=1 M+1 

ˆ (αk , 1, t n ) = U

ˆ (0, βk , t n ) = ωk wi j (1, βk , t n ); U ˆ (αk , 0, t n ) = ωk wi j (αk , 1, t n ); U

k=1

M+1 

ωk wi j (0, βk , t n )

k=1 M+1 

(19) ωk wi j (αk , 0, t n )

k=1

Regarding the gradients, for each particular Gaussian point we have the following approximations ˆ (1, βk , t n ) = ∇U ˆ (αk , 1, t n ) = ∇U

r −1 k=0 r −1

ˆ (0, βk , t n ) = ωk ∇wi j (1, βk , t n ); ∇ U ˆ (αk , 0, t n ) = ωk ∇wi j (αk , 1, t n ); ∇ U

k=0

r −1

ωk ∇wi j (0, βk , t n )

k=0 r −1

ωk ∇wi j (αk , 0, t n ) .

k=0

(20) x,y x,y are γ = γ = 1 and the In the particular case of N G =√2 the Gaussian weights 0 1 √ Gaussian points α0 = β0 = − 33 and α1 = β1 = − 33 . After applying the finite volume scheme with WENO reconstruction we obtain a system of ordinary differential equations du = L(u(t), ∇u(t)) (21) dt where L(u(t), ∇u(t)) = −

1 1 (Fi+ 21 , j − Fi− 21 , j ) − (G 1 − Gi, j− 21 ) + i,j Δx Δy i, j+ 2

(22)

Concerning time integration we use a third order Runge–Kutta see [6] which

TVD,  reads uk,1 = un + Δt L(un ), uk,2 = 43 un + 41 uk,1 + 41 ΔtL uk,1 , un+1 = 13 un + 

2 k,2 u + 23 ΔtL uk,2 . 3

3.2 Numerical Results of the 2D Model In the following numerical examples we use the values of the parameters given in Table 1. We start by verifying numerically the accuracy of the numerical scheme. In order to proceed, we apply the well-known method of the Manufactured Solution (MMS) for the problem (1). Thus, we define a manufactured solution which, in this

Table 1 Parameters used in the numerical simulations L = 1 α1 α2 β1 τ1 τ2 Set 1 Set 2

2 2

7 1

8 8

1 1

6.5 42/43

λ1 1 1

λ2

d1

d2

26 1

10−5

10−3 10−3

10−5

A Numerical Approximation of a Two-Dimensional Atherosclerosis Model

47

0.28 0.26

0.22

0.45

0.2

0.4

0.18 Density of cytokines

Density of immune cells

0.24

0.16 0.14 0.12

Exact solution (cut x=0.5) Numerical solution (cut x=0.5)

0.1 0.08 0.06

0.35 0.3 0.25 0.2 0.15 0.1

0.04

Exact solution (cut y=0.5) Numerical solution (cut y=0.5)

0.05

0.02 0.1

0.2

0.3

0.4

0.5

0.6

y

0.7

0.8

0.9

0.1

0.2

0.3

0.4

0.5

x

0.6

0.7

0.8

0.9

Fig. 3 Validation of the numerical scheme. Comparison numerical solution (symbols) versus an exact (manufactured) solution (23) (solid line) of the system (1) for the values labeled as Set 1 in Table 1 for an output time of t = 1. Left frame: density of immune cells (M) for x = 0.5. Right frame: density of cytokines (A) for y = 0.5

work, will read M(x, y, t) = 80t x 2 (x − 1)2 y 2 ex p(−3x 2 − 3y 2 ), A(x, y, t) = 100(1 + t)x 2 (x − 1)2 y 2 (y − 1)2 ex p(−x 2 − y 2 ).

(23)

The output time is t = 1 and the number of control volumes 60 × 60 in [0, 1] × [0, 1]. The L 2 norm of the error obtained is ||εimc || L 2 = 6.19 × 10−4 , ||εcyt || L 2 = 4.81 × 10−4 for immune cells and cytokines respectively. In Fig. 3 the exact and numerical solutions are compared for both immune cells (left frame) and cytokines (right frame), showing good agreement. Now we solve the problem (1) with the parameters from Set 1, which is the monostable situation and introduce two different initial conditions, in particular, a small perturbation (24) and a large perturbation (25) of the healthy state, which is considered to be (M, A)T = (2, 0)T . 

2.5 i f x ∈ [0.4, 0.6] ∀(x, y) ∈ [0, 1] × [0, 10−3 ] 2 i f x ∈ / [0.4, 0.6]  2.5 i f x ∈ [0.4, 0.6] A(x, y, 0) = ∀(x, y) ∈ [0, 1] × [0, 10−3 ]. 0 if x ∈ / [0.4, 0.6]

M(x, y, 0) =



12.5 i f  2 if 8.5 i f x A(x, y, 0) = 0 if x

M(x, y, 0) =

x ∈ [0.4, 0.6] ∀(x, y) ∈ [0, 1] × [0, 10−3 ] x∈ / [0.4, 0.6] ∈ [0.4, 0.6] ∀(x, y) ∈ [0, 1] × [0, 10−3 ]. ∈ / [0.4, 0.6]

(24)

(25)

The results obtained for immune cells and cytokines are shown in Fig. 4. The mesh taken is formed by 30 × 5 control volumes. In all the following plots, the y−axis has been deliberately stretched for a better visualization. In the context of

48

A. Hidalgo and L. Tello

Fig. 4 Solution of the system (1) for the values labeled as Set 1 in Table 1 with initial condition (24) with output time t = 10. Top row: large perturbation. Immune cells (left), Cytokines (right). Bottom row: small perturbation. Immune cells (left), Cytokines (right)

Fig. 5 Solution of the system (1) for the values labeled as Set 2 in Table 1 with initial condition (25) with output time t = 10. Large perturbation of the healthy state. Immune cells (left), Cytokines (right)

atherosclerosis disease, these results show a situation in which inflammation persists in time, regardless of the size of the initial inflammation. In this last situation, when the initial condition is (25), the inflammation tends to diminish with time, due to the bistable behavior. The results obtained for immune cells and cytokines are shown in Fig. 5. The results attained in these examples behave in a similar way to those got for the 1D case when time evolves, since the width of the domain is small. In these

A Numerical Approximation of a Two-Dimensional Atherosclerosis Model

49

examples we have considered h = 0.001. Nevertheless it is important to note that in the 2D case the solution evolves towards non-constant states, whereas in the 1D case constant steady states are obtained [4, 5, 8, 9].

4 Conclusions In this work we have considered a 2D reaction-diffusion model representing the first stages of atherosclerosis disease. This model is based on that proposed in [5] and later on in [4, 9]. The model also incorporates a nonlinear nonhomogeneous boundary condition which represents the recruitment of monocytes through the upper boundary, which is the contact with the blood flow. With the aim to obtain a numerical solution of the proposed problem, a finite volume scheme with dimension-by-dimension WENO reconstruction and third order Runge–Kutta TVD scheme for time integration is used. The problem is applied to two different sets of data, taken from the bibliography, obtaining different results when time advances depending on the values of the parameters and hence the initial condition considered. Acknowledgements This work is partially supported by the research project PID2020-112517GBI00 of Ministerio de Ciencia e Innovación (Spain).

References 1. Chalmers, A.D., Cohen, A., Bursill, C.A., Myerscough, M.R.: Bifurcation and dynamics in a mathematical model of early atherosclerosis. J. Math. Biol. 71(6), 1451–1480 (2015) 2. Dumbser, M., Hidalgo, A., Zanotti, O.: High order spacetime adaptive ADER-WENO finite volume schemes for non-conservative hyperbolic systems. Comput. Meth. Appl. Mech. Eng. 268, 359–387 (2014) 3. Dumbser, M., Zanotti, O., Hidalgo, A., Balsara, D.S.: ADER-WENO finite volume schemes with spacetime adaptive mesh refinement. J. Comput. Phys. 248, 257–286 (2013) 4. El Khatib, N., Genieys, S., Kazmierczak, B., Volpert, V.: Reaction-diffusion model of atherosclerosis development. J. Math. Biol. 65(2), 349–374 (2012) 5. El Khatib, N., Genieys, S., Volpert, V.: Atherosclerosis Initiation modeled as an inflammatory process. Math. Model Nat. Phenom. 2(2), 126–141 (2007) 6. Gottlieb, S., Shu, C.-W.: Total variation diminishing Runge-Kutta schemes. Math. Comput. 67(221), 73–85 (1998) 7. Guo, M., Cai, Y., Yao, X., Li, Z.: Mathematical modeling of atheosclerotic plaque destabilization: role of neovascularization and intraplaque hemorrhage. J. Theor. Biol. 450, 53–65 (2018) 8. Hidalgo, A., Tello, L.: Numerical simulation of a porous medium-type atherosclerosis initiation model. Comput. Fluids 169, 380–387 (2018) 9. Hidalgo, A., Tello, L., Toro, E.F.: Numerical and analytical study of an atherosclerosis inflammatory disease model. J. Math. Biol. 68(7), 1785–1814 (2014) 10. Jiang, G.-S., Shu, C.-W.: Efficient implementation of weighted ENO schemes. J. Comput. Phys. 126(1), 202–228 (1996) 11. Libby, P.: Inflammation in atherosclerosis. Nature 420, 19–26 (2002)

50

A. Hidalgo and L. Tello

12. Liu, X.-D., Osher, S., Chan, T.: Weighted essentially non-oscillatory schemes. J. Comput. Phys. 115(1), 200–212 (1994) 13. Ross, R.: Atherosclerosis: an inflammatory disease. New Engl. J. Med. 340(2), 115–126 (1999) 14. Titarev, V.A., Toro, E.F.: Finite-volume WENO schemes for three-dimensional conservation laws. J. Comput. Phys. 201(1), 238–260 (2004) 15. Toro, E.F., Montecinos, G.: Advection-Diffusion-Reaction equations: Hyperbolization and high-order ADER discretizations. SIAM J. Sci. Comput. 36(5), A2423–A2457 (2014)

Limit Results for L p Functionals of Weighted CUSUM Processes Lajos Horváth and Gregory Rice

Abstract The cumulative sum (CUSUM) process is often used in change point analysis to detect changes in the mean of sequentially observed data. We provide a full description of the asymptotic distribution of L p , 1 ≤ p < ∞, functionals of the weighted CUSUM process for time series under general conditions.

1 L p Functionals of Cumulative Sum Processes Let X 1 , X 2 , . . . , X N be a sequence of scalar observations following a simple at-mostone change point in the mean model  Xi =

μ0 + εi if 1 ≤ k ≤ k ∗ μ A + εi , if k ∗ + 1 ≤ k ≤ N ,

(1)

where k ∗ is the unknown change point, and μ0 and μ A denote the means before and after the change point. To identify the mean parameters, we assume that Eεi = 0, 1 ≤ i ≤ N . The following developments are motivated by methods that arise in testing H0 : k ∗ > N L. Horváth (B) Department of Mathematics, University of Utah, Salt Lake City, UT 841120090, USA e-mail: [email protected] G. Rice Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, ON, Canada e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_6

51

52

L. Horváth and G. Rice

against the alternative of a change point in the mean H A : k ∗ > N and μ0 = μ A . Change point detection has been an important and growing area of research in statistics and econometrics for the past several decades. For reviews we refer to Csörg˝o and Horváth [7], Aue and Horváth [2], and Horváth and Rice [13]. Most statistics employed in this testing problem are based on or connected to the cumulative sum (CUSUM) process Z (x) =

x  i=1

x  Xi − Xi , 0 ≤ x ≤ N . N i=1 N

where x denotes the integer part of x. Let Z N (t) = N −1/2 Z ((N + 1)t/N ), 0 ≤ t ≤ 1. The asymptotic properties and Gaussian approximations of Z N (t) are investigated in Csörg˝o and Horváth [7], mainly in case of independent and identically distributed εi ’s. The behaviour of Z N (t) is similar to that of empirical processes. Csörg˝o and Horváth [6] reviews results on the asymptotics of the uniform empirical and quantile processes, providing necessary and sufficient conditions for the convergence in distribution of their supremum as well as L p functionals. In change point analysis supremum functionals of Z N (t), perhaps with suitable weights applied, are often considered, since if the no change in the mean hypothesis H0 is rejected, the location at which the supremum is attained can be used to estimate the time of change. However, it is well known in empirical process theory (cf. Shorack and Wellner [16]) that the rate of convergence is faster for L p functionals when compared to supremum functionals. The Cramér–von Mises statistic, which is the L 2 functional of the standard empirical process, has received special attention in the literature. In the present note we provide limit results for L p functionals of Z N (t) under general conditions. Throughout we assume that H0 holds. It follows from Chap. 3 of Billingsley [4] that if N t  D[0,1] εi −→ σ W (t), (2) N −1/2 i=1

then

D[0,1]

Z N (t) −→ σ B(t),

(3)

where σ > 0, {W (t), 0 ≤ t ≤ 1} is a Wiener process and {B(t), 0 ≤ t ≤ 1} is a Brownian bridge. To obtain convergence of weighted functionals, we require a rate of approximation in (2):

Limit Results for L p Functionals of Weighted CUSUM Processes

53

Assumption 1.1 For each N there are two independent Wiener processes {W N ,1 (t), 0 ≤ t ≤ N /2}, {W N ,2 (t), 0 ≤ t ≤ N /2}, σ > 0 and ζ < 1/2 such that sup k −ζ

1≤k≤N /2

 k      εi − σ W N ,1 (k) = O P (1)    i=1

and sup (N − k)−ζ

N /2k π j|i=1 πk|i=2 − k> j π j|i=1 πk|i=2 and γ = j>k π j|i=1 πk|i=2 + j π j|i=1 π j|i=2 /2, ranging in [−1, 1] and [0, 1], respectively, where π j|i is the conditional ith row probability for the jth response category.

6 Example The data in Table 1 show the severity of nausea on a 6 point scale, measured in groups of patients who received chemotherapy with and without cisplatin. This 2 × 6 table has been analyzed through (10), i.e. the PO cumulative logit model (G 2 = 6.592, d f = 4, p-value = 0.159) and a partial PO model (G 2 = 1.48, d f = 3, p-value = 0.687) in Cox [4], considering equidistant scores for the response on severity of nausea. Focusing on (10), which is more parsimonious and simpler to interpret, we consider its generalization (12) for the power divergence. We observe that for the Pearsonian divergence (λ = 1), i.e. for a linear probability model, the fit is improved (G 2 = 5.953; see Table 2). Furthermore, notice that this data set is better modeled by an adjacent categories logit model. Considering λ as unknown parameter and estimating it by the data, we find the optimal λ (λˆ ) for both classes of models (see Fig. 2). However, the improvement in fit is not worth for adopting these estimated values in favor of the simpler to interpret linear probability models (see Table 2).

70

M. Kateri

5

6

7

G2

8

9

10

Fig. 2 Deviance (G 2 ) for PO adjacent categories generalized odds (blue solid line) and cumulative generalized odds (red dashed line) power divergence models (F(x) = φ  (x) = x λ /λ) as a function of λ for the data of Table 1

−4

−2

0

2

4

6

λ

Table 1 Observed frequencies for the severity of nausea (from none to severe) in two groups of chemotherapy patients. (Source Cox [4]) In parentheses are given the estimated expected frequencies under the adjacent categories generalized odds model for λ = 1 Group Severity of nausea Total 0 1 2 3 4 5 Cisplatin No cisplatin

7 ( 6.51) 7 ( 8.61) 3 ( 3.90) 12 (10.23) 15 (10.74) 14 (17.84) 58 43 (43.49) 39 (37.39) 13 (12.10) 22 (23.77) 15 (19.26) 29 (25.16) 161

Table 2 Fit of PO generalized ordinal response models applied to Table 1 for λ → 0, λ = 1 and the optimal value λˆ of the power parameter λ Adj. categories logit Cumulative logit λˆ G2 df p-value λˆ G2 df p-value 0 1 λˆ

1.70

5.596 5.125 5.033

4 4 3

0.231 0.275 0.169

1.05

6.592 5.953 5.950

4 4 3

0.159 0.203 0.114

For the power divergence adjacent categories models fitted on Table 1, Δλ is estimated as Δˆ 0 = 0.302, Δˆ 1 = 0.308 and Δˆ 1.7 = 0.303, respectively, while γˆ0 = 0.651, γˆ1 = 0.654, and γˆλˆ = 0.651. Thus under all three models, there is a change of about 65% for occurrence of more severe nausea at the ‘cisplatin’ (i = 1) than at the ‘no cisplatin’ (i = 2) group.

References 1. Agresti, A., Kateri, M.: Ordinal probability effect measures for group comparisons in multinomial cumulative link models. Biometrics 73, 214–219 (2017) 2. Agresti, A., Tarantola, C., Varriale, R.: Simple ways to interpret effects in modeling binary data. In: Moustaki, I. (ed.) Kateri M. Trends and Challenges in Categorical Data Analysis. Springer, Heidelberg (to appear) (2022)

Generalized Models for Binary and Ordinal Responses

71

3. Cochran, W.G.: Some methods of strengthening the common X 2 tests. Biometrics 10, 417–451 (1954) 4. Cox, C.: Location-scale cumulative odds models for ordinal data: a generalized non-linear model approach. Stat. Med. 14, 1191–1203 (1995) 5. Forcina, A., Kateri, M.: A new general class o RC association models: estimation and main properties. J. Multivar. Anal. 184, 104741 (1–16) (2021) 6. Kateri, M.: φ-Divergence in contingency table analysis. Entropy 20, 324 (1–12) (2018) 7. Kateri, M., Agresti, A.: A class of ordinal quasi symmetry models for square contingency tables. Stat. Probab. Lett. 77, 598–603 (2007) 8. Kateri, M., Agresti, A.: A generalized regression model for a binary response. Stat. Probab. Lett. 80, 89–95 (2010) 9. Kateri, M., Papaioannou, T.: f-divergence association models. Int. J. Math. Stat. Sci. 3, 179–203 (1994) 10. Kateri, M., Papaioannou, T.: Asymmetry models for contingency tables. J. Amer. Stat. Ass. 92, 1124–1131 (1997) 11. Pardo, L.: Statistical Inference Based on Divergence Measures. Chapman & Hall/CRC, Boca Raton (2006)

Approximations of δ-Record Probabilities in i.i.d. and Trend Models Miguel Lafuente, David Ejea, Raúl Gouet, F. Javier López, and Gerardo Sanz

Abstract We study the probability of occurrence of δ-records in a model with linear trend. While this probability has been studied in the iid case, the existence of a trend makes its analysis much more involved. Asymptotic properties of this probability have been studied in the literature when the number of observations is large. However, no approximations are known when the number of observations is small. We propose a first order approximation as a function of both the values of δ and the trend. We assess our results via Montecarlo simulations finding that the approximations are accurate for a small-moderate number of observations.

This work is dedicated to Leandro Pardo, an excellent mathematician, a better person and a great friend. Meeting Leandro is one of the best things that can happen to you in life. M. Lafuente (B) · D. Ejea · F. J. López · G. Sanz Departamento de Métodos Estadísticos, Facultad de Ciencias, Universidad de Zaragoza, Zaragoza, Spain e-mail: [email protected] D. Ejea e-mail: [email protected] F. J. López e-mail: [email protected] G. Sanz e-mail: [email protected] R. Gouet Departamento Ingeniería Matemática y Centro de Modelamiento Matemático (IRL 2807, CNRS), Universidad de Chile, Santiago, Chile e-mail: [email protected] F. J. López · G. Sanz Instituto de Biocomputación y Física de Sistemas Complejos (BIFI), Universidad de Zaragoza, 50018 Zaragoza, Spain © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_8

73

74

M. Lafuente et al.

1 Introduction and Notation Even in today’s world, in the midst of the information age and dominated by terms like Big Data, the concept of record is still ubiquitous in the news, the media and in people’s everyday conversations. The main reason for this is the scarcity of this type of observations. A seminal paper by Chandler in 1952 [7] started what today is a rich literature about mathematical properties of record observations and related concepts. An important part of these results can be consulted in dedicated monographs [1, 2]. Motivated by the study of extreme related observations, the notion of δ-record was introduced in 2007 by Gouet et al. [10], according to which the nth entry in a discrete time series is a δ-record if it is larger than all preceding observations plus a fixed real quantity δ, i.e., if the observation X n satisfies Xn >

n−1 

X i + δ, δ ∈ R,

i=1

where ∨ stands for the maximum operator. The parameter δ can take either positive or negative values, making the latter condition respectively harder or easier to fulfill than the usual record setting, which corresponds to case where δ = 0. Also, denoting n X i , we define the δ-record indicators the maximum value at time n by Mn := ∨i=1 as 1n,δ := 1{X n >Mn−1 +δ} , n > 1, where the first observation is always considered to be a δ-record, 11,δ = 1, by convention. This extension of the record concept adds a non trivial extra difficulty when studying its properties, even in the easiest scenario called the classical record model (CRM) where the observations are assumed to be drawn from a sequence {X n }n∈N of independent and identically distributed (i.i.d.) continuous random variables (r.v.). This is due to the fact that in the CRM, record occurrence is a universal feature where P(1n,0 = 1) = 1/n, ∀n ∈ N,

(1)

regardless of the distribution of the underlying r.v. On the contrary, if δ = 0 the distribution affects the probability of observing a δ-record, which will be denoted by pn,δ := P(1n,δ = 1). Properties about δ-records for the CRM have been studied in [11, 14–16]. The study of record probabilities in the Linear Drift Model (LDM) was introduced in 1985 [4]. In this scenario the observations will be denoted by {Yn }n∈N where Yn = X n + cn,

(2)

being c > 0 the drift parameter and X n as in the CRM. In the LDM record probabilities are no longer universal and have been studied mainly from the asymptotic point of view [4–6, 8]. Applications of the LDM comprises sports [4, 5] and climate [13, 18, 19] among others. Also, asymptotic theory about δ-records in the LDM and other trend models can be seen in [12, 13]. The quantity pn,0 for the non-asymptotic setting

Approximations of δ-Record Probabilities in i.i.d. and Trend Models

75

and usual record probabilities in the LDM was studied in [9] by means of first order Taylor approximations. Also, it was suggested in [17] to apply this methodology to δ-records in the CRM. Here, we will develop this idea for the CRM and the LDM. Through this work we will write F and f for the cumulative distribution function (cdf) and the probability density function (pdf) of the r.v. X n . Moreover, we only consider continuous pdfs. Also, we denote an ∝ bn if limn→∞ an /bn = l > 0 and O(δ) the standard big-O Landau notation. Finally, we mention the relationship between δ-records and the so-called near records as defined by Balakrishnan et al. [3], and how our results can also be applied to this kind of observations. Let X n be a sequence of r.v. and a > 0 a fixed parameter. We say that nth observation is a near-record if it is not a record but it is closer than a units to be one, that is, if X n ∈ (Mn−1 − a, Mn ].

(3)

2 First Order Approximations for the δ-Record Probability First, we study the behaviour of the δ-record probability, pn,δ when the r.v. are {X n } are drawn from the CRM.   n−1  pn,δ = P X n > Xi + δ i=1

 = =

∞ n−1 

−∞ i=1  ∞ −∞

P (X i < x − δ) F(d x)

F(x − δ)n−1 F(d x).

(4)

This last integral is usually not analytically solvable, so in order to compute it for a wider range of distributions, we will explore the possibility of a first order approximation for F(x − δ) at the point δ. Taking | δ | 1, we have F(x − δ) = F(x) − δ f (x) + O(δ 2 ), and thus F(x − δ)n−1 = F(x)n−1 − (n − 1)F(x)n−2 δ f (x) + O(δ 2 ). This expression, together with (4), yields pn,δ =

1 − δ(n − 1)In + O(δ 2 ), n

(5)

76

M. Lafuente et al.

∞ where In := −∞ F(x)n−2 f (x)2 d x. Summarizing, δ(n − 1)In plays the role of a first order correction term in the δ-record probability. Moreover, according to the definition of near-record in (3), and since the first term 1/n is the usual record probability as pointed out in (1), we get the following approximation for the near record probability for a 1 taking a = −δ near = a(n − 1)In + O(a 2 ). pn,a

Let us now consider the case where the r.v. of interest {Yn } follow the LDM as defined in (2). Reasoning as in Eq. (4), we get  pn,δ (c) =

∞ n−1  −∞ j=1

F(x + cj − δ)F(d x),

(6)

where the c in pn,δ (c) is added to the notation to make the dependence explicit. In this setting, in addition to | δ | 1, we take c ≈| δ | for simplicity, so that we can compute a first order approximation around the points (cj − δ). Note that this condition also implies that the approximation will only be valid for small n. For the terms in the product in (6) we have F(x + cj − δ) = F(x) + f (x)(cj − δ) + O(δ 2 ), and so, for the whole product in (6) and after some algebra, we get  n(n − 1) − δ(n − 1) + O(δ 2 ). F(x + cj − δ) = F(x)n−1 + F(x)n−2 f (x) c 2 j=1

n−1 

Finally, substituting this last expression in (6), we obtain, for fixed n, pn,δ (c) =

cn 1 + (n − 1) − δ In + O(δ 2 ), n 2

(7)

where the term In is the same as in the case of the CRM. We define the excess probability E n (c, δ) as pn,δ (c) − 1/n, that is, the difference between the probability of a δ-record in the LDM and the probability of a usual record in the CRM. Note that, unlike in the CRM case, we do not have that (n − 1) (cn/2 − δ) In is an approximation of the near-record probability, since part of it corresponds to the contribution of the trend c to the usual record probability. Some Notes About the Approximations The accuracy of these δ-record probability approximations have been studied via Montecarlo simulation. Nevertheless, the analytical expressions of the approximations illustrate some of the features of δ-record probabilities.

Approximations of δ-Record Probabilities in i.i.d. and Trend Models

77

For instance, the approximation in both cases is consistent with the influence of the parameters δ and c. Indeed, for increasing (decreasing) δ, the occurrence of a δ-record is more difficult (easier), and the approximate probability will be lower (higher). For the LDM, the influence of δ is the same as in the CRM, while for c the behaviour is also consistent since, for a higher (lower) trend parameter c, the occurrence of a δ-record will be easier (more difficult), and then the approximate probability will be higher (lower). Also, for the LDM, taking c ≈| δ |, there is not an interaction term between c and δ, being the influence of δ of the same magnitude as in the CRM. Nevertheless, the influence of c is of a higher order of magnitude, revealing that in order to facilitate the appearance of δ-records it is better to increase the underlying trend than to widen the δ-record condition varying the value of δ. In terms It is important to note that, although the computation of In cannot be guaranteed to be analitically solvable, it avoids the problem that arises in the integrals (4) and (6), where there is a delay between the points appearing in the argument of the cdfs F with respect to the pdf. Some In terms have been computed previously in the literature, showing a strong relationship between these terms and the domains of attraction of extreme values. In particular, the authors in [9], find the following patterns: • In the Fréchet class of heavy-tailed distributions they consider r.v. of the Pareto family with pdf f (x) = μx −μ−1 1{x>1} . Computations yield In ∝ n −2−1/μ and thus In ∝ n −α , for a parameter α > 2. • In ∝ an n −2 with an growing at a slow logarithmic rate in the Gumbel class of exponential-like tailed distributions. In this paper, we have computed the term In for another distribution in the Gumbel class. More specifically, for the Gumbel Distribution, that is the case where F(x) = exp (− exp (−x)). For this distribution we find that the term In is exactly n −2 , consistent with the results in [9] for other distributions in this family. • In the Weibull class of distributions with a right endpoint they consider a Beta(1, b) r.v. with b > 1/2, finding In ∝ bΓ (2 − 1/b)n −2+1/b , and, in particular, In ∝ n α , α < 2. In this family we can add the case where X n is a negative exponential distribution, with pdf f (x) = exp (x)1{x 2 (Fréchet family). The correction term is Cn (c, δ) ∝ c

1 1 − δ α−1 , 2n α−2 n

and thus the influence of c and δ vanish quickly as n increases. 2. In ∝ n −2 , (Gumbel family). The correction term is 1 1 Cn (c, δ) ∝ c − δ . 2 n This means that the dependence on δ is weak, while for two values c1 , c2 , the difference between the correction terms should be proportional to (c2 − c1 )/2. We can observe this phenomena in Fig. 1 (left), where the exact value of Cn (c, δ) and the estimated (by simulation) value of E n (c, δ) for the Gumbel Distribution are displayed. For small parameter values, approximations are good, and the expected difference induced by (c2 − c1 )/2 is predicted fairly well, since the proportionality constant is 1 in this case because In = n −2 . 3. In ∝ n −α , 1 < α < 2 (Weibull family). The correction term is Cn (c, δ) ∝ c

1 n 2−α − δ α−1 , 2 n

expecting an increasing influence of c and decreasing influence of δ as n increases. 4. In ∝ n −1 , (Weibull family). The correction term is Cn (c, δ) ∝ c

n − δ, 2

0.06

0.3

Approximations of δ-Record Probabilities in i.i.d. and Trend Models c=0.02, delta=−0.05 c=0.02, delta=−0.02 c=0.05, delta=−0.05 c=0.05, delta=−0.02

0.0

0.00

cn(c, δ) 0.1 0.2

cn(c, δ) 0.02 0.04

delta=−0.01 delta=−0.05 c=0.02

79

0

10

20

30

0

10

20

30

n

n

Fig. 1 Points: Estimations of the excess probability E n (c, δ) via simulation with 108 iterations. Lines: Cn (c, δ). Left: Results for the Gumbel distribution in the LDM. Right: Results for the negative exponential distribution in the LDM

revealing that while specially in the short term the influence of c is strong, the effect of δ is a translation proportional to δ units. This is exactly the phenomena shown in Fig. 1 (right) for the negative exponential distribution, for which In = n −1 . The dependence on c is observed for the first observations. The influence of δ induces a constant difference for two different values of δ. This phenomena seems to be valid even in the limit. 5. In ∝ n −α , 0 < α < 1, (Weibull family). The correction term is Cn (c, δ) ∝ c

n 2−α − δn 1−α , 2

which shows an increasing influence of both parameters as n grows. However, simulations show that (7) is not accurate, at least for the distributions that we have considered. The reason is that those distributions have a finite right-endpoint and an unbounded pdf.

4 Conclusions In the CRM, our first order approximations seem to capture well the influence of δ in pn,δ for small δ, even for not too small values of n. Moreover, we have found an approximation for the near-record probability. Due to the lack of space, we have chosen not to show any plot for the CRM since estimations are very close to the approximations.

80

M. Lafuente et al.

In the LDM, for moderate c and δ, estimations seems to have more variability due to the existence of two sources of error. Also, while c has a greater influence than δ, we can still find the effect of δ on the δ-record probability. This is confirmed both via approximations and simulations. We find that the qualitative behaviour predicted by the first order approximations fits reasonably well the simulation results in the small n regime. Moreover, this behaviour is still critically related to the different domains of attraction of extremevalues, and among the considered distributions, we do not find two distributions from two different families with a similar behaviour. In particular, we find that heavy-tailed distributions (Fréchet class) are the least influenced by the parameters c and δ, which it is not surprising since these distributions tend to take large values more often. The Gumbel class is an intermediate case between the two other families, being the Weibull the most dependent on the value of δ, except in the case where there is an asymptote in the right-endpoint, which makes the approximation useless. Acknowledgements This research was supported by ACE210010, FB210005 basal funds from ANID-Chile and Grant MCIN/ AEI/10.13039/501100011033. ML, RG, FJL and GS are members of the research group Modelos Estocásticos of DGA.

References 1. Ahsanullah, M., Nevzorov, V.B.: Records via Probability Theory. Atlantis Press, Amsterdam (2015) 2. Arnold, B.C., Balakrishnan, N., Nagajara, H.N.: Records. Wiley Series in Probability and Statistics (1998) 3. Balakrishnan, N., Pakes, A.G., Stepanov, A.: On the number and sum of near-record observations. Adv. Appl. Probab. 37, 765–780 (2005) 4. Ballerini, R., Resnick, S.: Records from improving populations. J. Appl. Probab. 22, 487–502 (1985) 5. Ballerini, R., Resnick, S.: Records in the presence of a linear trend. Adv. Appl. Probab. 19, 883–909 (1987) 6. Borovkov, K.: On records and related processes for sequences with trends. J. Appl. Probab. 36, 668–681 (1999) 7. Chandler, K.N.: The distribution and frequency of record values. J. R. Soc. B 14, 220–228 (1952) 8. De Haan, L., Verkade, E.: On extreme value theory in the presence of a trend. J. Appl. Probab. 24, 62–76 (1987) 9. Franke, J., Wergen, G., Krug, J.: Records and sequences of records from random variables with a linear trend. J. Stat. Mech. P10013 (2010) 10. Gouet, R., López, F.J., Sanz, G.: Asymptotic normality for the counting process of weak records and δ-records in discrete models. Bernoulli 13, 754–781 (2007) 11. Gouet, R., López, F.J., Sanz, G.: On δ-record observations: asymptotic rates for the counting process and elements of maximum likelihood estimation. Test 21, 188–214 (2012) 12. Gouet, R., Lafuente, M., López, F.J., Sanz, G.: δ-Records observations in models with random trend. In: Gil, E., Gil, E., Gil, J., Gil, M.A. (eds.) The Mathematics of the Uncertain: A Tribute to Pedro Gil, pp. 209–217. Springer, Cham (2018) 13. Gouet, R., Lafuente, M., López, F.J., Sanz, G.: Exact and asymptotic properties of δ-records in the linear drift model. J. Stat. Mech. 103201 (2020)

Approximations of δ-Record Probabilities in i.i.d. and Trend Models

81

14. Gouet, R., López, F.J., Sanz, G.: On the point process of near-records values. Test 24, 302–321 (2015) 15. López-Blazquez, F., Salamanca-Miño, B.: Distribution theory of δ-record values: case δ 0

(2)

Loss(i) = −V ar (i), i f V ar (i) < 0.

(3)

We obtain the cumulative value of gains—FCG (first cumulative gain)—and losses-FCL (first cumulative loss)—during the first k periods FC Gyt (k + 1) =

k+1 

Gain(i),

(4)

Loss(i).

(5)

i=2

FC L yt (k + 1) =

k+1  i=2

We define the Relative Strength of a time series yt and a length k of a concrete period (days or quarters), at time k + 1 by RSyt (k + 1) =

FC Gyt (k + 1) . FC L yt (k + 1)

(6)

We extend the above computations to the next periods. For j = k + 2, . . . , n; we define the cumulative gains (CG) and losses (CL) in a recurrent way, C G( j) = C G( j − 1) + Gain( j)

(7)

C L( j) = C L( j − 1) + Loss( j).

(8)

The average gain (AG) and average loss (AL) at time k + 1 are defined by AGyt (k + 1) =

C Gyt (k + 1) k

(9)

86

C. Maté GDP variations in US (2001/Q1 - 2005/Q4)

2

GDPvariationsUS

1 0 Jan 2001 70 65 60

Jan 2002

Jan 2003

Jan 2004

Jan 2005

Jan 2006

RSI(8) Oversold (40) Overbought (60)

55 50 45 40 Jan 2001

Jul 2001

Jan 2002

Jul 2002

Jan 2003

Jul 2003

Jan 2004

Jul 2004

Jan 2005

Jul 2005

Jan 2006

Fig. 1 The evolution of GDP variations and the corresponding RSI(8) values in US during 2001– 2005

AL yt (k + 1) =

C L yt (k + 1) . k

(10)

For the rest of the time periods, j = k + 2, . . . , n; the AG and AL are given by AGyt ( j) =

(AGyt ( j − 1) ∗ (k − 1) + Gain( j)) . k

(11)

AL yt ( j) =

(AL yt ( j − 1) ∗ (k − 1) + Loss( j)) . k

(12)

Therefore, the relative strength at j = k + 1, k + 2, . . . , n; is RSyt ( j) =

AGyt ( j) . AL yt ( j)

(13)

RS varies from 0 (all variations are losses) to ∞ (only gains). For this reason, the following RSI was proposed at j = k + 1, k + 2, . . . , n, RS I yt ( j) = 100 −

100 . 1 + RSyt ( j)

(14)

RSI varies oscillating from 0 (RS = 0) to 100 (RS = ∞) and sometimes it is plotted against the time under the representation (time series, candlestick,…) of the asset or, in this paper, the quarterly GDP variation time series.

The Relative Strength Index (RSI) to Monitor GDP Variations. Comparing …

87

Table 1 Example of RS and RSI computation, with 8 quarters, with GDP variations of US during the first years of the XXI century Quarter GDP Var Gain Loss AG AL RS(8) RSI(8) 2001, Q1 2001, Q2 2001, Q3 2001, Q4 2002, Q1 2002, Q2 2002, Q3 2002, Q4 2003, Q1 2003, Q2 2003, Q3 2003, Q4 2004, Q1 2004, Q2 2004, Q3 2004, Q4 2005, Q1 2005, Q2 2005, Q3 2005, Q4

−0.3 0.6 −0.4 0.3 0.9 0.6 0.4 0.2 0.6 0.9 1.7 1.1 0.5 0.8 0.9 1 1.1 0.5 0.9 0.6

.... 0.9 ...... 0.7 0.6 .... .... .... 0.4 0.3 0.8 .... .... 0.3 0.1 0.1 0.1 .... 0.4 ....

...... ... −1 ... ..... −0.3 −0.2 −0.2 ..... .... ..... −0.6 −0.6 ..... ..... ..... ..... −0.6 ..... −0.3

...... ...... ...... ...... ...... ..... ...... ...... 0.3199 0.3181 0.3831 0.3352 0.2933 0.2852 0.2724 0.2454 0.2279 0.1994 0.2282 0.1997

...... ...... ...... ...... ...... ...... ...... ...... 0.2149 0.1880 0.1645 0.2127 0.2629 0.2301 0.2013 0.1761 0.1541 0.2155 0.1886 0.1975

...... ...... ...... ...... ...... ...... ...... ...... 1.4885 1.6916 2.3282 1.5757 1.1155 1.2395 1.3533 1.3933 1.4787 0.9252 1.2101 1.0110

...... ...... ...... ...... ...... ...... ...... ...... 59.8158 62.8475 69.9537 61.1757 52.7289 55.3474 57.5064 58.2163 59.6558 48.0576 54.7523 50.2745

2.2 Detailed Example About Computing RS and RSI Values in the Case of GDP Variations for US in the Period 2001–2005 Figure 1 shows the evolution of GDP variations and RSI values for 8 quarters in the case of the US during 2001–2005 and Table 1 provides computation details. Both plots in Fig. 1 are quite similar. Some important differences between them are the scale, the interval [0, 100], the reference line, 50, or the two lines traced at 40 and 60 in the y-axis of RSI. The central line, 50 line, separates the plot into two regions above (below) 50 is the part where on average every quarter is better (worst) than the previous one. The 40 (60) line defines a region by the interval [0, 40] ([60, 100]) where every quarter on average is much worst (better) than the previous one.

88

C. Maté

2.3 Uses of the RSI: Oversold and Overbought Zones The original and common use of the RSI during the last 40 years is to be a technical indicator (TI) to help in trading, being quite important to establish limits for the RSI. One obvious is the 50 line, due to below (above) this value the asset on average is getting losses (gains). Regular behavior of the RSI is established by a range, where the lower (upper) limit is offering a reasonable limit for losses (gains). The most used range is taking 30 (70) as a lower (upper) limit. The range 0 < RSI 0; Nt = #{tn ∈ T : 0 < tn ≤ t} counts the number of “events” tn , n = 1, . . . ∞ “observed” in the time window (0, t] and T = {t0 , t1 , . . . tn , . . . } ⊂ R+ . We define t0 = 0 and x = X t0 . Thus X t = x + ct +

Nt 

Jn ,

(1)

n=0

J. Villarroel (B) · J. A. Vega Instit. Univ. Física y Matemáticas, Universidad de Salamanca, Salamanca, Spain e-mail: [email protected] J. A. Vega e-mail: [email protected] J. Villarroel Dept. Estadística, Universidad de Salamanca, Salamanca, Spain © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_10

93

94

J. Villarroel and J. A. Vega

 Nt When c = 0 the resulting “renewal reward process” X t = n=0 Jn has a prominent role in reliability and system maintenance, see [22]. It also describes earthquake shocks [12] or stock markets where sudden price changes are allowed, [19]. By contrast, the prototype model of risk theory results when a drift c > 0 is incorporated. Here (X t )t≥t0 describes the cash flow at an insurance company and c accounts for the constant premium’s rate while claims occur with Poisson arrivals tn , tn < tn+1 and sizes (or “severities”) Jn < 0. It was first introduced by Cramer-Lundberg under Poissonian arrivals Nt ∼ P(λt) [5] and generalized to general renewals by SparreAndersen−see [1] for general background. Even such simplified situation is far from trivial and during the last two decades substantial research has been devoted to the topic i.e. (EP) from (0, ∞) defined as   escape probabilities   of ruin probabilities, P X t ≤ 0, some t = 1 − P X t > 0, ∀t . Under Erlang Γ (2, λ) arrivals they were studied in [7]. The distribution of the time to ruin under Erlang times is considered in [8, 11, 18]. Ruin probabilities under more general settings like Lévy and stable processes appear in the interesting paper [15]. In a remarkable paper Bertoin [3] considers exit probabilities for subordinators (i.e. stable Lévy processes with only negative jumps) and gives a representation of EP via scale functions. More general models where the one-sided jump restriction c J1 < 0 is not required occur naturally in other contexts. A natural application of (1) in reliability is the workload of single-server queue system under random workload removal, cf. [22]. Energy dissipation in defective nonlinear optical fibers is also described by (1) where c > 0 and jumps Jn > 0, n = 1, . . . ∞ account for Energy losses due to damping and inhomogeneities respectively (see [26]). Further motivation is given by the description of temporally aggregated rainfall in meteorology and hydrology contexts, see [17]. In a different context, (1) also models the dynamics of snow depth on mountain hillsides, see [21]. Exit probabilities for such models with two-sided jumps have recently attracted / (0, b)}, b > 0 and attention. In [22] the joint distribution of τ b ≡ inf{t ≥ 0|X (t) ∈ X (τb ) is studied. In actuarial science those “overshoots” represent the deficit X (τb ) in which the company incurs at the ruin time. One barrier ruin probabilities and ruin times with two sided jumps are derived for Erlang or Coxian distributions. This study is continued in [27]. The time to ruin and the deficit at ruin in an actuarial context is studied in [4, 10, 16, 28, 29]. Here we consider the double barrier problem with Poisson, two-sided jumps. Concretely, let a < x0 < b be two fixed levels and call τ a = inf{t > 0 : X t ≤ a}, τ b = inf{t > 0 : X t ≥ b} > 0. We study two-barrier escape probabilities Px τ b < τ a , the probability that starting from x ∈ (a, b) the process (1) exits (a, b) via the upper barrier, when both positive and negative jumps occur. (We use Px (A) ≡ P(A|X 0 = x), A ∈ F∞ ). To the best of our knowledge the double barrier problem with only negative jumps arriving with Poisson law was first considered in [6]. It is also considered in [2] via a Wiener Hopf decomposition. See also [13, 20, 23, 24]. However even for the simplest one-sided cases, derivation of explicit expressions requires much ingenuity. Indeed, to our knowledge complete solutions to the doublebarrier exit problem is known only for Lévy processes with only negative jumps.

Escape Probabilities from an Interval for Compound Poisson Processes with Drift

95

Under Poisson arrivals (1) is a Lévy-Markov process. In Sect. 2 we consider, under these assumptions, general properties of escape probabilities and the relation with ruin probabilities. In Sect. 3 we use the Dynkin martingale associated to the infinitesimal generator G of the Markov process. It provides a way to codify escape probabilities in terms of the solution of a Fredholm integral equation (IE). Unfortunately, in a general situation a closed form solution for the previous IE is not possible. We show that for the class of densities h : R → R having rational characteristic function the EP satisfies a simple ordinary differential equation with boundary conditions at the end point x = b. Several particular cases are studied.

1.1 General Properties of Escape Probabilities Let (Ω, G, P) denote the complete probability space where the basic random variables are defined. Let (tn )n∈N and (Jn )n∈N be two sequences of random variables defined on the above space. We make the following natural assumptions on the process (X t )t≥0 of (1). Assumption A1. Events occur according to a Poisson process with intensity λ > i.i.d

0. Alternatively, interarrival times τn := tn − tn−1 > 0, n = 1, . . . ∞ satisfy τn ∼ Exp (λ). i.i.d.

Assumption A2. (Jn )n=1,...∞ ∼ H define an i.i.d sequence with a common cumulative distribution function (c.d.f.) H . Assumption A3. The sequence (Jn )n=1,...∞ is independent of the underlying Poisson process (Nt )t≥t0 . Assumption A4. The drift c > 0 is strictly positive.

1.1.1

Symmetry Properties of the Escape Probability

Let a < x < b, τ a ≡ τ−a = inf{t > 0 : X t ≤ a}, τ b ≡ τ+b = inf{t > 0 : X t ≥ b} > 0 the exit time from (0, ∞) and, respectively,  (−∞,b). Here we study, under the above assumptions, general properties of Px τ b < τ a , the escape probability from (a, b) starting from x. Proposition 1.1 Under Assumptions (A1) to (A4) the function πb (x) ≡ π(x) defined by   (2) π : (0, b) × R+ → [0, 1], (x, b) → πb (x) = P τ b < τ 0 |X 0 = x is monotone in both variables. If tb = (b − x)/c it satisfies 1. As x → b−

lim π(x) ≡ π(b− ) = 1

x→b−

2. The function x → π(x) is increasing

(3)

96

J. Villarroel and J. A. Vega

3. The function b → πb (x) is decreasing. Besides, as b → ∞, πb (x) tends to the survival probability   πb (x) ≥ πb (x) ≥ πb→∞ (x) = P X t > 0, ∀t := ψ(x), x < b ≤ b

(4)

4. For any x, x → π(x) satisfies the bounds 0 < 1 − F(tb )H− (b − x) ≤ πb (x) ≤ 1 − F(tb )H (−b)

(5)

5. For x ∈ (a, b), and any d ∈ R the double-barrier escape probability satisfies       Px τ b < τ a = Px+d τ b+d < τ a+d = Px−a τ b−a < τ 0 ≡ πb−a (x − a) (6) Proof To stress the dependence on x we denote the event {τ b < τ 0 } when the process started from x at t = 0 as {τ b < τ 0 } ≡ Ubx . As x grows so it does (X t ), see (1) and hence {Ubx }x∈(0,b) is an increasing sequence of events on Ω. Besides as b grows τ b ↑ ∞ w.p.1, while {Ubx }b∈(x,∞) decreases. Hence lim {τ b < τ 0 } = {τ 0 = ∞} = {X t > 0, ∀ t}.

b→∞

Sequential continuity of probabilities gives (4):   lim πb (x) = lim Px (Ubx ) = Px ( lim Ubx ) = Px X t > 0, ∀ t .

b→∞

b→∞

b→∞

Besides if Ubx has occurred then it is not possible that the first event is such that t1 ≤ tb and J1 ≤ −b. By contrast when the first event is such that t1 ≤ tb and J1 ≥ b − x, or when t1 > tb then Ubx has occurred. Thus {t1 > tb } ∪ {t1 ≤ tb , J1 ≥ b − x} ⊂ Ubx ⊂ {t1 ≤ tb , J1 ≤ −b}

(7)

which implies (5). Letting x → b− then, by sequential continuity and assumption A5, F(tb )H− (b − x) → 0 and (3) follows. We note that τ ∞ := lim τ b = ∞ a.s. b→∞

Finally since τ 0 ∧ τ b < ∞ w.p. 1. then, up to a null set, {τ 0 < τ b } = Ubc . For the last point note that (X t ) is a spatially homogeneous process, namely 0 that E y (g(X t )) = E  (g(y + X t )), ∀y and any Borel measurable g : R → R; hence x+d b+d a+d ∧τ τ must be independent of d ∈ R, and exit probabilities can only P depend on the distances to the barriers b − x and x − a. Choosing d = −a (6) is obtained. 

Escape Probabilities from an Interval for Compound Poisson Processes with Drift

97

Remark 1.1 The invariance of (X t )t≥0 under the group of all space translations and reflections permits to reduce the double barrier problem to the case a = 0 with no loss of generality. Proposition 1.2 Suppose Assumptions (A1) to (A4) hold, that c = 0 and that “severities” Jn have a c.d.f. which is symmetric with respect to the origin: H (y) = 1 − H (−y) for any y ∈ Supp J1 . Then     Px τ b < τ 0 = 1 − Pb−x τ b < τ 0

(8)

Proof For any c ∈ R consider the translation y → Tc (y) = y + c. Let also Rx be the reflection from point x: y → Rx (y) = 2x − y. For any Borel set B the reflected and translated sets are Rx ◦ B := {y ∈ R : ∃z ∈ B and y = Rx (z)} ≡ 2x − B and Tc ◦ B := {y ∈ R : ∃z ∈ B and y = z + c} ≡ B + c

(9)

If J1 has a symmetric distribution with respect to the origin and if c = 0 then (1) implies that the law of X does not change when reflected over the point x. For any t and Borel set B     Px X t ∈ B = Px X t ∈ Rx ◦ B     This and translational invariance Px X t ∈ B = Px+c X t ∈ B + c , give       Px X t ∈ B = PTc (x) X t ∈ Tc (B) = PTc (x) X t ∈ Rx+c ◦ Tc (B) , ∀c ∈ R Letting c := b − 2x we have θ (y) := Rb−x ◦ Tb−2x (y) = b − y and         Px X t ∈ B = Pθ (x) X t ∈ Rb−x ◦ Tb−2x (B) ≡ Pθ (x) X t ∈ θ(B) = Pθ (x) θ −1 X t ∈ (B)

For any sample path t → X t (ω) ≡ (ωt ) consider the shadow path t → X˜ t (ω) ≡ (ω˜ t ) where X˜ t (ω) := b − X t (ω) ≡ θ −1 X t (ω) and X˜ 0 = b − x ≡ θ (x) Then both evolutions (ω)t and (ω˜ t ) are equally likely. Besides they satisfy τ b (ωt ) < τ 0 (ωt ) ⇔ τ 0 (ω˜ t ) < τ b (ω˜ t )

(10)

namely, the image of any sample satisfies τ b < τ 0 is a sample path that  b path  which  0 0 b x 0 b−x τ < τb .  satisfies τ < τ . Hence P τ < τ = P

98

J. Villarroel and J. A. Vega

2 Integral Equations for the Escape Probability Under Poisson arrivals (1) is a Lévy-Markov process. The infinitesimal generator is the map A : D → D where D := L ∞ ∩ C 1 (R) is the domain and A acts on any Φ ∈ D via   Φ → AΦ = lim Ex Φ(X t ) − Φ(x) /t = c∂x Φ(x) + λ

 R

t→0

  Φ(x + y) − Φ(x) d H (y)

(11)

In the last equality we used spatial homogeneity to conclude that A = A1 + A2 is the  Nt sum of generators A1 , A2 corresponding to the processes X t,1 := ct, X t,2 := n=0 Jn – and the well known properties of compound Poisson process processes. We now derive an integral equation that the escape probability satisfies. Theorem 2.1 Under Assumptions A1 to A6 π(x) (see proposition 1) satisfies (3) and solves for 0 < x < b the integral equation   λ − c∂x π(x) = λ H¯ (b − x) + λ



b−x

−x

π(x + y)d H (y), 0 < x < b

(12)

Proof We denote as π˜ : R → R the extension of π from (0, b) to R, i.e. the EP when (X ) may start at arbitrary x ∈ R. If x ∈ R − (0, b) then τ0 ∧ τb = 0 as escape occurs instantly. This insight yields   Px τ b < τ 0 = π(x)1(0,b) (x) + 1[b,∞) (x) + 01(−∞,0) (x) ≡ π˜ (x)

(13)

Additionally for any Φ : R+ × D → R in the domain of A, Dynkin’s formula [25] yields that the process 

t

Mt := Φ(t, X t )] − Φ(0, x) − 0

 ∂ + A Φ(s, X s ) ds. ∂s

(14)

is a martingale. Hence EMτ = EM0 = 0 since τ ≡ τ0 ∧ τb is a stopping time. Suppose that additionally Φ(t, y) solves the Dirichlet boundary problem ∂Φ + AΦ(t, y) = 0, y ∈ (0, b) and Φ(t, y) = 1[b,∞) (y), y ∈ / (0, b) ∂t

(15)

Thus we have Φ(τ , X τ ) = 1 X τ ∈[b,∞) and 

τ

0 = EM0 = EMτ := EΦ(τ , X τ )] − Φ(0, x) − E 0

 ∂ + A Φ(s, X s ) ds (16) ∂s

= Ex [1 X τ ∈[b,∞) ] − Φ(0, x) = π(x) ˜ − Φ(0, x)

(17)

Escape Probabilities from an Interval for Compound Poisson Processes with Drift

99

since X τ ∈ [b, ∞) = {τb < τ0 }. Thus π˜ (x) = Φ(0, x) ≡ Φ(t, x) must solve  Aπ(x) ˜ = c∂x π˜ (x) + λ

∞ −∞

  π˜ (x + y) − π˜ (x) d H (y)

By restriction to (0, b) an use of (13) this yields (12).

(18) 

Remark 2.1 Note that when c > 0 we also need to add the boundary condition (3). By contrast, when c = 0 (3) is no longer present and π(x) is independent of λ; actually it does not depend on the distribution of the rate of jumps.

3 Severities with Rational Characteristic Function We now study a general case where (12) can be solved in explicit form. We denote by H the class of densities h : R → R having rational characteristic function (CF), namely  ∞ R(is) ˜ ,s ∈ R (19) h(x)ei xs ds = h ∈ H ⇔ h(s) := Q(is) −∞ where Q and R are coprime polynomials with m ≡ deg(R) < deg(Q) = n: n m   j Q(s) ≡ a j s , R(s) ≡ bjs j, s ∈ R j=0

(20)

j=0

Besides (a j ), (b j ) ∈ R, a0 = b0 = 0, m ≤ n − 1 and an = 0. It turns out that if J1 has a density h ∈ H, (12) can be reduced further to an ordinary differential equation (ODE) with boundary conditions at x = b. Interesting examples of such class include the convex combination ˜ ¯ − eγ− x θ (−x) and h(s) = h(x) = pγ+ e−γ+ x θ (x) + pγ

pγ ¯ − pγ+ + γ+ − is γ− + is

(21)

Such double-exponential jump models are obtained when (X ) can jump rightward with probability p, 0 < p < 1 or leftward with probability p¯ := 1 − p. Besides, J1 is sampled with an exponential distribution with parameters γ+ and γ− respectively. They correspond to polynomials R(s) = γ− γ+ + ( pγ+ − qγ− )s, Q(s) = (γ+ − s)(γ− + s), We first note the following Lemma.

(22)

100

J. Villarroel and J. A. Vega

Lemma 3.1 Assume h ∈ H with h˜ given by (19). Then ˜ 1. Poles of h(s) are purely imaginary points on the boundary of the strip of convergence 2. h is of class C n−m−2 and has derivatives with jumps at the origin satisfying I j ≡ ∂xj h(0+ ) − ∂xj h(0− ) = 0, for j > n − m − 2

(23)

3. Away from the origin h(x) solves the ordinary differential equation n  Q(−∂x )h (x) ≡ a j (−1) j ∂xj h(x) = 0, x ∈ R − {0}

(24)

j=0

and, at at x = 0, the boundary conditions I j = 0, j ≤ n − m − 2 and n 

(−1) j−k a j I j−k−1 = bk , k = 0, . . . , m

(25)

j=k+1

Reciprocally if h is a density and solves (24), (25) then it has CF (19). Proposition 3.1 Assume h ∈ H with h˜ given by (19) where Q has only simple roots s j± , namely that ˜ h(s) :=

 j

 p j− p j+ + where s j+ > 0, s j− < 0 s j+ − is s j− − is j

(26)

 p Let κ = j s j+j+ . Then positive (negative) jumps define a sub-sequence of arrival times (tn+ ) (respectively, (tn− ) ) satisfying i.i.d.

i.i.d.

+ − tn+ − tn−1 ∼ Exp (κλ), tn− − tn−1 ∼ Exp ((1 − κ)λ)

(27)

That is, positive jumps arrive in a Poisson way with rate κ. Proof The nature of the n-th arrival defines naturally a sequence (In ) of Bernoulli trials with results In = +, − where (+) := {Jn > 0} and (−) := {Jn < 0} The definition of characteristic function (19) along with Cauchy’s residue theorem, yield  P(In = +) = P(Jn > 0) =



−∞

˜ h(x)d x = h(0)κ

Note that it follows from the above Lemma that s j ∈ R.

Escape Probabilities from an Interval for Compound Poisson Processes with Drift

101

Besides the sequence (In ) is independent of the Poisson process (Nt )t≥0 since the (Jn ) are independent too. Hence it permits a “thinning”1 of the Poisson process (Nt )t≥0 as i.i.d.

Nt := Nt+ + Nt− where, sayNt+ ∼ Poi (κt)

(28)

Recalling the well known correspondence {Nt ≥ n} = {tn ≤ t} that relates (Nt )t≥0 and (tn ) the result follows.  We now show that π(x) can be found solving a simple ODE. Theorem 3.1 Suppose Assumptions  A1–A4 hold with h∈ H satisfying (19). Let L be the differential operator L ≡ Q − R − cλ−1 ∂x ◦ Q (∂x ). Then π(x) solves the n + 1− order differential equation 1  ∂j ∂ j+1  λ(a j − b j ) j − a j c j+1 π(x) = 0 λ j=0 ∂x ∂x n

Lπ ≡

(29)

Besides it satisfies the n + 1 boundary conditions at x = b− π0 = 1, π0 − ρ −1 π1 = H¯ (0+ ) +



b

h − (z − b)π(z)dz,

(30)

0

where H¯ (x) := 1−H (x) and for j = 1, . . . , n − 1 π j − ρ −1 π j+1 +

 b j−1    ( j−1) j (−1)k Ik π j−k−1 = (−1) j−1 h + (0) − ∂z h − (z − b)π(z)dz 0

k=0

(31)

where we call π j ≡ π ( j) (b− ), j = 0, . . . n − 1.

Proof We write the jump distribution as h(y) = h + (y)θ (y) + h − (−y)θ (−y) where θ (x) = 1x∈(0,∞) , the Heaviside function. To keep the algebra tidy we introduce M(x) := π(x) − ρ −1 ∂x π(x) and (12) reads M(x) = H¯ (b − x) +



b

 h + (z − x) +

x

x

 h − (z − x) π(z)dz.

0

By repeated differentiation we find for j ≥ 1 ∂x M(x) = h + (b − x) − π(x)I0 +



b x

1



x

∂x h + (z − x) +

 ∂x h − (z − x) π(z)dz

0

More generally, we “thin” the process by independently including selected arrivals with probability p, and throwing the others away. In this application, the thinned process (Nt+ ) counts the number of positive arrivals.

102

J. Villarroel and J. A. Vega

∂x2 M(x) = −h + (b − x) − N (x)I0 + π(x)I1  +

b x

 ∂x2 h + (z − x) +

x 0

 ∂x2 h − (z − x) π(z)dz.

By iteration we find ∂x( j) M(x)

=

( j−1) (−1) j−1 h + (b

j−1  − x) + (−1)k+1 Ik π ( j−k−1) (x) k=0

 +

b



x

+

x

0

 (−1) j ∂zj h(z − x)π(z)dz. ( j)

This yields (31) sending x → b± and evaluating lim x→b+ ∂x M(x) ( j) − lim x→b− ∂x M(x). Next, operating with Q(∂x ) on the LHS of (12) yields that M satisfies n 

a j ∂x( j) M(x) = a0 H¯ (b − x) + a1 h + (b − x) − a2 h + (b − x) + . . .

j=0

+(−1)n−1 h n−1 + (b − x)

n 

aj

j=0

 +

b



x

+

x

0

j−1  (−1)k+1 Ik π ( j−k−1) (x) k=0

n   π(z) a j (−1) j ∂zj h(z − x)dz. j=0

Recalling (24) we see that several terms cancel as Q(−∂z )h(z − x) = 0. The above simplifies to n 

a j ∂x( j) M(x)

j=0

=

n−1  m=0

π m) (x)

=

n  j=0

n 

aj

j−1 

n j=0

j

a j (−1) j ∂z h(z − x) ≡

(−1)k+1 Ik π ( j−k−1) (x)

k=0

(−1) j−m I j−m−1 a j =

j=m+1

n−1 

π m) (x)bm = R(∂x )π

m=0

where we used (25). Equation (29) follows since Q(∂x )M ≡ Q(∂x )π − λ˜ −1 ∂x Q(∂x )π = R(∂x )π. 

Escape Probabilities from an Interval for Compound Poisson Processes with Drift

103

We next evaluate the EP for several cases of interest. Example 3.1 The classical Poisson risk model is recovered when J1 < 0 and −J1 ∼ Exp (γ ), γ > 0. It corresponds to (see (22)) R(s) = γ , Q(s) = γ + s so (see equation (20)) m = 0, n = 1. From (29) π(x) is found solving   ∂x x + (γ − λ˜ )∂x π(x) = 0, 0 ≤ x ≤ b.

(32)

˜ This implies that π(x) = A + Be(λ−γ )x where λ˜ := λ/c. Besides H¯ (0+ ) = 0 so one has the boundary conditions

 π(b) = 1, 1 −

0

−b

˜ (b). π(b + y)e−λy dy = (1/λ)π

˜ )B. It follows that the only solution By insertion one finds first that A = −(λ/γ to (24) is ˜ )x ˜ ) (λ−γ   1 − (λ/γ π(x) := P τ b < τ 0 = (33) ˜ )b 1 − (λ˜ /γ ) (λ−γ π(x) is a concave function when cEt1 − EJ1 > 0, or γ˜ − λ > 0. In this case, the drift term dominates over the jumps; as a consequence the survival probability is non-trivial: (4) reads   ˜ ψ(x) := P X t > 0, ∀t = πb→∞ (x) = 1 − (λ˜ /γ ) (λ−γ )x .

(34)

Hence, we recover the result of Asmussen and Albrecher ([1], p. 47). for the one-sided Poisson Risk model. When γ˜ < λ the situation inverts: π(x) is convex and ψ(x) = 0. Finally, if γ˜ − λ < 0 the drift is exactly balanced by the jumps. Here π grows linearly with the distance to the origin: π(x) =

1+γx ≡ π(0)(1 + γ x). 1 + γb

(35)

Example 3.2 Symmetric exponential distribution. We generalize the above to a truly two-sided case. Suppose that J1 ∼ Laplace(0, γ ), i.e. J1 has density and characteristic function γ2 ˜ = 2 , γ > 0. (36) h(x) = (γ /2)e−γ |x| and h(s) γ − s2 This identifies (see Eq. (22)) R(s) = γ 2 , Q(s) = γ 2 − s 2 , m = 0 and n = 2. We start considering the case when c = 0. Equation (29) reads Lπ = λ∂x x π(x) = 0.

104

J. Villarroel and J. A. Vega

Hence π(x) = A + Bx. Here π(b) = 1 is not required. Since H¯ (0+ ) = 1/2 the boundary conditions read b A + Bb = (1/2) + 0 (γ /2)eγ (z−b) (A + Bz)dz,  b (γ 2 /2)eγ (z−b) (A + Bz)dz. Bb = γ /2 −

(37)

0

Alternatively the second equation simplifies to A + Bb(1 + 1/γ )) = 1. Thus one has 1+γx . (38) π(x) = 2 + γb It is interesting to compare (35) and (38) as both models have the same safety loading parameter η := cμ/m − 1 = 0 so EX t = 0 in both cases. The comparison shows that to increase the probability to exit via the upper barrier a constant drift is a more powerful mechanism than having to wait for an exponential jump with the same mean per unit of time. Acknowledgements The authors acknowledge support from the Spanish Agencia Estatal de Investigación and the European Fondo Europeo de Desarrollo Regional (AEI/FEDER, UE) under Contract No. FIS2016-78904-C3-2-P.

References 1. Asmussen, S., Albrecher, H.: Ruin Probabilities. World Scientific, Singapore (2010) 2. Avram, F., Pistorius, M.R., Usabel, M.: The two barriers ruin problem via a Wiener Hopf decomposition. Ann. Univ. Cracovia, Math. Comp. Sci. Ser. 30, 38–44 (2003) 3. Bertoin, J.: Subordinators: examples and applications. In: Bernard, P. (ed.) Lectures on Probability Theory and Statistics. Lecture Notes in Mathematics, vol. 1717, pp. 1–91. Springer, Berlin (2004) 4. Cai, N.: On first passage times of a hyper-exponential jump diffusion process. Oper. Res. Lett. 37(2), 127–134 (2009) 5. Cramér, H.: On the Mathematical Theory of Risk. HC Collected Works, 1, pp. 601–678. Springer, Berlin (1994) 6. Dickson, D.C., Gray, J.R.: Exact solutions for ruin probability in the presence of an upper absorbing barrier. Scand. Act. J. 3, 174–186 (1984) 7. Dickson, D.C., Hipp, C.: Ruin probabilities for Erlang(2) risk process. Insur. Math. Econ. 22, 251–262 (1998) 8. Dickson, D.C., Hipp, C.: Time to ruin for Erlang(2) risk process. Insur. Math. Econ. 29(3), 333–344 (2001) 9. Feller, W.: Diffusion processes in one dimension. Trans. Amer. Math. Soc. 77, 1–31 (1954) 10. Gao, J., Wu, L., Liu, H.: Probability of ruin in a continuous risk model with two types of delayed claims. Comm. Statist. Theor. Meth. 45(13), 3734–3750 (2016) 11. Gerber, H.U., Shiu, E.S.W.: The time value of ruin in a Sparre Andersen model. N. Amer. Act. J. 9(2), 49–69 (2005) 12. Helmstetter, A., Sornette, D.: Diffusion of epicenters of earthquake aftershocks, Omori law and generalized continuous-time random walk models. Phys. Rev. E: Statist. Phys. 66, 061104 (2003)

Escape Probabilities from an Interval for Compound Poisson Processes with Drift

105

13. Jacobsen, M.: The time to ruin for a class of Markov additive risk process with two-sided jumps. Adv. Appl. Probab. 37(4), 936–992 (2005) 14. Karlin, S., Taylor, H.: A First Course in Stochastic Processes. Acad Press, New York (1981) 15. Kluppelberg, C., Kiprianou, A.E., Maller, R.A.: Ruin probabilities and overshoots for general Lévy insurance risk processes. Ann. Appl. Prob. 14(4), 1766–1801 (2004) 16. Kou, S.G., Wang, H.: First passage times of a jump diffusion process. Adv. Appl. Prob. 35(2), 504–531 (2003) 17. Lavergnat, L., Gole, P.: Stochastic raindrop time distribution model. J. Appl. Meteor. 37, 805– 818 (1998) 18. Li, S., Garrido, J.: On ruin for the Erlang(n) risk process. Insur. Math. Econ. 34, 391–408 (2004) 19. Merton, R.C.: Option pricing when stock returns are discontinuous. J. Fin. Econ. 3, 125–144 (1976) 20. Montero, M., Villarroel, J.: Mean exit times in non-Markovian drifting random-walk processes. Phys. Rev. E: Statist. Phys. 82, 021102 (2010) 21. Perona, P., Daly, E., Crouzy, B., Porporato, A.: Stochastic dynamics of snow avalanche by superposition of Poisson processes. Proc. R. Soc. A 468, 4193–4208 (2012) 22. Perry, D., Stadje, W., Zacks, S.: First-exit time for the compound Poisson processes for some types of positive and negative jumps. Stoch. Models 18, 139–157 (2002) 23. Ramsden, L., Papaioannou, A.D.: Ruin probabilities under capital constraints. Insur. Math. Econ. 88, 273–282 (2019) 24. Rogers, L.C.G.: The two-sided exit problem for spectrally positive Levy processes. Adv. Appl. Prob. 22, 486–487 (1990) 25. Rolski, T., Schmidli, H., Schmidt, V., Teugels, J.: Stochastic Processes for Insurance and Finance. Wiley, New York (2006) 26. Villarroel, J., Montero, M.: On the integrability of the Poisson driven stochastic nonlinear Schrödinger equations. Stud. Appl. Math. 127(4), 372–393 (2011) 27. Wen, Y.Z., Yin, C.C.: Exit problems for jump processes having double-sided jumps with rational laplace transforms. Abstr. Appl. Anal. Art. 747262 (2014) 28. Zhang, Z., Yang, H., Li, S.: The perturbed compound Poisson risk model with two-sided jumps. J. Comput. Appl. Math. 33(8), 1773–1784 (2010) 29. Zhou, X.: When does surplus reach a level before ruin? Insur. Math. Econ. 35, 553–561 (2004)

A Note on the Notion of Informative Composite Density Konstantinos Zografos

Abstract This note concentrates on the notion of the informative composite density, that is the composite density function which stands closer, in a sense, to the true but unknown model which describes the data. It aims to provide a preliminary discussion on how the composite density is affected by the components of the random vector that constitute the basis for the definition of this special type of density. It is expected that the composite maximum likelihood estimator is similarly affected by the components of the composite density.

1 Preliminaries The subject of composite likelihood methods in estimation has a long history, it is over forty years old and it has been extensively discussed in the existing literature. It was signaled (cf. Varin [16]) in the pioneer work by Besag [1], Cox [7], Lindsay [9]. This subject has been extensively discussed and a long queue of papers, special issues of journals, thematic conferences have been taken place over the years. For a bibliography on the subject we indicatively mention the papers by Varin et al. [17], Reid et al. [15], Joe et al. [8], Reid [14], Martín et al. [10] and references appeared therein, among many others. We adopt here the notation by Joe et al. [8], regarding the composite likelihood function and the respective composite maximum likelihood estimator (CMLE). In this regard, and closely following the exposition in Castilla et al. [3–5] and Martín et al. [10], let Y 1 , . . . , Y n be independent and identically distributed replications of a random m-vector Y = (Y1 , . . . , Ym )T , which are described by the true but This work is dedicated to the 65th birthday of Prof. Leandro Pardo, honoring his outstanding work and contribution in the fields of statistics and statistical information theory. In this occasion, I cordially thank Leandro for the long standing collaboration and friendship and I express my deep appreciation to him who is one of my most valuable collaborators and friends. K. Zografos (B) Department of Mathematics, University of Ioannina, 451 10 Ioannina, Greece e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_11

107

108

K. Zografos

unknown distribution g with respective distribution function denoted by G. Taking into account that the true model g is unknown, suppose that it is included in { f (·; θ ), θ ∈ Θ ⊆ R p , p ≥ 1}, which is a parametric identifiable family of candiK date distributions to describe the observations y1 , . . . , yn . Let also A = {Ak }k=1 denotes a family of random variables associated either with marginal or conditional distributions involving some y j and j ∈ {1, . . . , m}. In this setting, the composite density which is based on K different marginal or conditional distributions, has the form K   w f Ak (y j , j ∈ Ak ; θ ) k , CLA (θ, y) = k=1

where wk , k = 1, . . . , K are non-negative and known weights. If the weights are all equal, then they can be ignored. In this case, all the statistical procedures produce equivalent results. The corresponding composite log-density has the form cA (θ , y ) =

K 

wk  Ak (θ , y),

k=1

with  Ak (θ , y) = log f Ak (y j , j ∈ Ak ; θ ). In order to use the notion of composite likelihood to develop statistical inference on the unknown parameters of the model, we reproduce here the exposition of Castilla et al. [3], for the sake of completeness. Based on this paper, the CMLE,  θ c , is defined by K n n     cA (θ , yi ) = arg max wk  Ak (θ , yi ), (1) θ c = arg max θ ∈Θ

θ ∈Θ

i=1

i=1 k=1

and it can be also obtained by solving the equations u(θ, y1 , . . . , yn ) = 0 p , where, ∂cA (θ , y1 , . . . , yn )   ∂ Ak (θ , yi ) = . wk ∂θ ∂θ i=1 k=1 n

u(θ, y1 , . . . , yn ) =

K

However, the CMLE,  θ c is also obtained on the basis of the Kullback–Leibler divergence measure, as follows. The Kullback–Leibler divergence between the true model g and the composite density function, CLA (θ , y), associated to the parametric model f (·; θ ), is defined as follows:  d K L (g (.) , CLA (θ, .)) =

g( y) log Rm

g( y) d y. CLA (θ, y)

A Note on the Notion of Informative Composite Density

109

Following Castilla et al. [3], the estimator  θ K L = arg min d K L (g (.) , CLA (θ , .)), θ

coincides with the CMLE  θ c . Hence, the CMLE is also obtained by minimizing the distance, in the Kullback-Leibler sense, between the composite density CLA and the true model g. The CMLE,  θ c , obeys asymptotic normality (see Joe et al. [8]) and in particular   √  L n(θ c − θ ) −→ N 0, (G ∗ (θ ))−1 , where G ∗ (θ) denotes the Godambe informan→∞

tion matrix G ∗ (θ ) = H(θ) ( J(θ))−1 H(θ ), with H(θ) being the sensitivity or Hessian matrix and J(θ ) being the variability matrix, defined, respectively, by H(θ) = E θ [−(∂/∂θ )u(θ , Y )T ], J(θ) = Var θ [u(θ, Y )] = E θ [u(θ , Y )u(θ, Y )T ], where the superscript T denotes the transpose of a vector or a matrix. Useful comments and properties of the above matrices are mentioned in the papers by Lindsay ([9], Lemma 4) and Joe et al. [8], among others.

2 Comparison of Composite Densities In the composite likelihood estimation procedure, the classic likelihood, which is defined on the basis of the parametric model f (·; θ ), is substituted by the composite likelihood which is based on the composite density CLA (θ , y). This last density, K , where Ak is a subset of {1, . . . , m}. depends on the family of indices A = {Ak }k=1 Hence, a natural question is raised at this point: how is the estimation procedure K which is used to define the composite density affected by the family A = {Ak }k=1 K    CLA (θ , y) = f Ak (y j , j ∈ Ak ; θ ) ? A possible answer to this question can be k=1

based on the investigation on how close stands the composite density CLA to the true model g. That is, on the distance between the true model g and the composite density function, CLA (θ , y), which is associated with the parametric model f (·; θ ) K . If this distance measure is of a small magnitude then and the family A = {Ak }k=1 K the family A = {Ak }k=1 , which leads to the definition of CLA (θ, y), causes a small difference between the true model and the model which is used to estimate the unknown parameters. Then, CLA stands closer to the true model g and, therefore, CLA is maybe more informative, in the sense that the loss of information due to the substitution of g by CLA is reduced. This particular more informative composite density leads, maybe, to much more promising estimators. However, the true model g is unknown. In this case we can approximate the distance between g and CLA (θ , y) by the distance between f (·; θ ) and CLA (θ , y). Let’s use Kullback-Leibler type distance between f (·; θ ) and CLA (θ, y). It is defined by

110

K. Zografos

 d K L ( f, CLA ) = d K L ( f (., θ ) , CLA (θ , .)) =

Rm

f ( y; θ ) log

f ( y; θ ) d y. (2) CLA (θ , y)

Hence, it is maybe of interest to investigate the behavior of d K L ( f, CLA ), in respect to the family A. Remark 2.1 Suppose now that the weights are all equal, then they can be ignored. (a) Based on the previous discussion, if K = 1, A = A1 = {1, 2, . . . , m} and w1 = 1, then CLA (θ, y) = f ( y; θ ), that is the composite density coincides with the joint density that drives the random m-vector Y and this is equivalent to d K L ( f, CLA ) = 0. In this case it is quite clear from (1) that the CMLE,  θ c , coincides with the well known Maximum Likelihood Estimator of θ . K , if K = m and Ak = {k}, k = 1, . . . , m, then CLA (b) Moreover, for A = {Ak }k=1 m  (θ, y) coincides with the product of marginals CLA (θ , y) = f 0 ( y; θ ) = f Ak (yk ; θ ) =

m 

k=1

f Yk (yk ; θ ), where f Yk is the marginal density of Yk . In this case,

k=1

d K L ( f (., θ ) , CLA (θ , .)) of (2) coincides with the mutual information (cf., for example, Micheas and Zografos [12], Blumentritt and Schmid [2] and references appeared therein), a measure of dependence between the components of the random vector Y = (Y1 , . . . , Ym )T , which is defined by  f ( y; θ ) dy f ( y; θ ) ln d K L ( f, CLA ) = I (Y ; θ ) = f 0 ( y; θ ) Rm  f ( y; θ ) = dy f ( y; θ ) ln m  m R f Yk (yk ; θ ) k=1  = cY (u; θ ) ln cY (u; θ )du,

(3)

[0,1]m

where cY denotes the copula density associated with Y = (Y1 , . . . , Ym )T . K defines a partition (c) Suppose moreover, without loss of generality, that A = {Ak }k=1 of the set of indices {1, 2, . . . , m} of Y = (Y1 , . . . , Ym )T , that is, Ak ⊆ {1, 2, . . . , m}, Ak ∩ A = ∅, k,  = 1, . . . , K , k = , with ∪k Ak = {1, . . . , m}. In this case, A = K defines a similar partition of the components of the random m-vector {Ak }k=1 Y = (Y1 , . . . , Ym )T , into K subvectors Y Ak = (Y j , j ∈ Ak )T , k = 1, . . . , K . In this setting, d K L ( f, CLA ) of (2) is, in essence, the mutual information of the random subvectors Y A1 , . . . , Y A K of the initial random vector Y , defined by

A Note on the Notion of Informative Composite Density

111

d K L ( f, CLA ) = I (Y A1 , . . . , Y A K ; θ )  f ( y A , . . . , y AK ; θ ) f ( y A1 , . . . , y A K ; θ ) ln K 1 = dy (4) m  R f Y Ak ( y A k ; θ ) k=1  cA (u1 , . . . , uk ; θ ) ln cA (u1 , . . . , uk ; θ )du1 . . . duk , = [0,1]m

where cA denotes the copula density associated to (Y A1 , . . . , Y A K ). It is clear that the last term is the negative of Shannon entropy of the copula density cA . It is obvious, from the previous remark, that d K L ( f, CLA ) is the key quantity which leads to the CMLE and this quantity, moreover, formulates the degree of θc =  θ KL, dependence of the components of CLA . Therefore, the resulting CMLE,  is logical to consider that it is affected by the choice of CLA and, consequently, it is also affected by the degree of dependence of the components variables which define the composite density. Based on this ascertainment we will concentrate on the study of d K L ( f, CLA ), in the sequel. In this context, taking into account the decisive role of d K L ( f, CLA ), of (2), we will try to simplify its expression. In this direction, we observe, at first, that d K L ( f, CLA ) can be decomposed into two parts, namely,  d K L ( f, CLA ) =



Rm

f ( y; θ ) log f ( y; θ )d y −

Rm

f ( y; θ ) log CLA (θ , y)d y, (5)



or d K L ( f, CLA ) = −H ( f, θ ) −

Rm

f ( y; θ ) log CLA (θ , y)d y,



where H ( f, θ ) = −

Rm

f ( y; θ ) log f ( y; θ )d y,

(6)

denotes the Shannon entropy of f . So, we have to simplify the second term of the right hand side of (5). Based on CLA (θ, y) =

K   k=1

f Ak (y j , j ∈ Ak ; θ )

wk

,

112

K. Zografos



 Rm

f ( y; θ ) log CLA (θ , y)d y = =

f ( y; θ ) log

Rm K  k=1

=

K 

f Ak (y j , j ∈ Ak ; θ )

wk

dy

k=1

 wk

K  

Rm

f ( y; θ ) log f Ak (y j , j ∈ Ak ; θ )d y

 wk

Rm

k=1

f ( y; θ ) Ak (θ , y)d y

(7)

where  Ak (θ, y) = log f Ak (y j , j ∈ Ak ; θ ). Now, a specific Ak , k = 1, . . . , K , includes some of the indices j, for j = 1, . . . , m and therefore f Ak is a marginal density. Therefore for a specific Ak , it is obvious that, 

 Rm

f ( y; θ ) Ak (θ , y)d y =

 RA

k

RA

k

 =  =

RA

R Ac

f ( y; θ ) Ak (θ , y)d y Ak d y Ack

⎧ k ⎨ ⎩

R Ac

f ( y; θ )d y Ack

⎫ ⎬ ⎭

 Ak (θ, y)d y Ak

k

f Ak (y j , j ∈ Ak ; θ ) Ak (θ, y)d y Ak k

= −H ( f Ak , θ ),

(8)

where H ( f Ak , θ ) is the Shannon entropy of the marginal density f Ak (y j , j ∈ Ak ; θ ) = R c f ( y; θ )d y Ack , k = 1, . . . , K , with R Ak = × R, R Ac = × c R, Ack = j∈Ak

Ak

k

j∈Ak

{1, 2, . . . , m} − Ak and d y Ak = × dy j , d y Ack = × c dy j . Then, based on (5)–(8), j∈Ak

j∈Ak

d K L ( f (., θ ) , CLA (θ , .)) = −H ( f, θ ) +

K 

wk H ( f Ak , θ ).

(9)

k=1

The above derivations lead to the formulation of the next result. Proposition 2.1 Consider an m-dimensional random vector Y = (Y1 , . . . , Ym )T with density f ( y; θ ), depending on a p-dimensional parameter θ ∈ Θ ⊆ R p , p ≥ 1 and let CLA (θ , y) denotes the composite density which is defined by CLA (θ , y) =

K 

f Awkk (y j , j ∈ Ak ; θ ).

k=1

Then the Kullback-Leibler divergence (2) between f ( y; θ ) and CLA (θ, y) is given by means of Shannon entropies H ( f, θ ) and H ( f Ak , θ ), defined by (6) and (8) respectively, as follows

A Note on the Notion of Informative Composite Density

113

d K L ( f (., θ ) , CLA (θ , .)) = −H ( f, θ ) +

K 

wk H ( f Ak , θ ).

k=1

The previous proposition analyzes d K L ( f (., θ ) , CLA (θ , .)) in terms of the associated Shannon entropies. Next proposition investigates the range of values of d K L K in case of a partition A = {Ak }k=1 of the set of indices {1, 2, . . . , m}. In the sequel, we will suppose that all wk , k = 1, . . . , K , are equal and then they can be ignored. Proposition 2.2 Let CLA (θ , y) denotes the composite density which is defined by CLA (θ , y) =

K 

f Ak (y j , j ∈ Ak ; θ ),

k=1 K where A = {Ak }k=1 defines a partition of the set of indices {1, 2, . . . , m}. Then,

 0 ≤ d K L ( f (., θ ) , CLA (θ , .)) ≤ d K L

f,

m 

 f Yk

= I (Y ; θ ),

k=1

with I (Y ; θ ) the mutual information, defined by (3).The left handside of the above inequality is achieved if and only if f (., θ ) and CLA (θ , .) are coincide, something which is valid if K = 1, A = A1 = {1, 2, . . . , m}. The right handside of the above inequality is achieved if and only if the random subvectors Y A1 , . . . , Y A K of the initial random vector Y are independent. Proof The proof of the lower bound is trivial and it follows from a similar inequality which is obeyed by Kullback-Leibler divergence, see for example, Cover and Thomas ([6] p. 252). The upper bound of d K L is obtained by observing that H ( f Ak , θ ) ≤



H ( f Y j , θ ),

j∈Ak

in view of Cover and Thomas ([6], p. 253), with equality if and only if Y1 , . . . , Ym are independent. Similar is the conclusion by using the previous remark and more precisely (4), which says that d K L ( f (., θ ) , CLA (θ , .)) = I (Y A1 , . . . , Y A K ; θ ). Now, based on Micheas and Zografos ([12], Theorem 2.1) or Blumentritt and Schmid ([2], Proposition 3.1 (e)), I (Y A1 , . . . , Y A K ) ≤ I (Y1 , . . . , Ym ; θ ), which completes the proof.



114

K. Zografos

Although the meaning of the previous results is illustrated in the next examples and it is moreover discussed in the conclusions, we will add a few words, at this point, about the decisive role of the family A which is the cornerstone of the composite density CLA and the respective methodology of the composite likelihood estimation. Based on (4), CLA coincides with f if and only if the random subvectors Y A1 , . . . , Y A K are independent, that is, if and only if the family of K , which is used to define CLA , refers to independent subvecindices A = {Ak }k=1 tors of Y = (Y1 , . . . , Ym )T . Therefore, more dependent subvectors Y A1 , . . . , Y A K move away the composite density CLA from the model f and this maybe affects the respective CMLE.

3 Examples Following Wang and Wu [18] and Varin et al. [17], there are two general types of composite likelihood: marginal and conditional composite likelihood. In view of these papers, the simplest composite likelihood density is the one constructed under the independence assumption, that is, CLind A (θ , y) =

m 

f j (y j ; θ ),

j=1

and if the inferential interest is also in parameters prescribing a dependence structure, a pairwise composite likelihood density is defined as follows pair −ind

CLA

(θ, y) =

m  m 

fr,s (yr , ys ; θ ),

r =1 s=1

where f j and fr,s denote the marginal densities, for j, r, s = 1, . . . , m. In order to illustrate the procedures, presented above, let’s start with a standard example, introduced in Xu and Reid [19], which is ideal for illustrative purposes. Consider the random vector Y = (Y1 , Y2 , Y3 , Y4 )T which follows a four dimensional normal distribution with density f , mean vector μ = (μ1 , μ2 , μ3 , μ4 )T and variancecovariance matrix ⎛ ⎞ 1 ρ 2ρ 2ρ ⎜ ρ 1 2ρ 2ρ ⎟ ⎟ Σ =⎜ (10) ⎝ 2ρ 2ρ 1 ρ ⎠ , 2ρ 2ρ ρ 1 i.e., we suppose that the pairs of the random variables Y1 , Y2 and Y3 , Y4 are equicorrelated. Taking into account that Σ should be semi-definite positive, the following condition is imposed, − 15 ≤ ρ ≤ 13 .

A Note on the Notion of Informative Composite Density

115

As a first case, in the above setting, we consider the composite density, under K , with Ak = {k}, independence assumption. Then, for m = K = 4, let A1 = {Ak }k=1 k = 1, 2, 3, 4, H ( f Ak , θ ), of (8), is the Shannon entropy of the univariate normal distribution with mean μk , k = 1, 2, 3, 4 and variance equal to one. Then, based on Pardo ([13], p. 32), H ( f Ak , θ ), of (8), is given by, H ( f Ak , θ ) =

1 log(2π e). 2

On the other hand, the Shannon entropy (6) of the four dimensional normal distribution with density f , mean vector μ = (μ1 , μ2 , μ3 , μ4 )T and variance-covariance matrix (10) is given (cf. Pardo [13], p. 32) by H ( f, θ ) =

 4  1 1 log (2π e)4 |Σ| = log(2π e) + log |Σ| . 2 2 2

Taking into account that |Σ| = −15ρ 4 + 32ρ 3 − 18ρ 2 + 1, it is immediate to see, in view of (9), that for equal wk ’s, d K L ( f (., θ ) , CLA1 (θ, .)) = −H ( f, θ ) +

K 

wk H ( f Ak , θ )

k=1

1 = − log |Σ| 2   1 = − log −15ρ 4 + 32ρ 3 − 18ρ 2 + 1 , 2

(11)

where A1 = {Ak = {k} : k = 1, 2, 3, 4}. As a second case, in the above setting, consider the composite density CLA2 (θ, y) = f A1 ( y; θ ) f A2 ( y; θ ), with f A1 ( y; θ ) = f 12 (y1 , y2 ; μ1 , μ2 , ρ) and f A2 ( y; θ ) = f 34 (y3 , y4 ; μ3 , μ4 , ρ), where f 12 and f 34 are the densities of the marginals of Y , i.e., bivariate normal distributions with mean vectors (μ1 , μ2 )T and (μ3 , μ4 )T , respectively, and common variance-covariance matrix   1ρ Σ0 = . ρ 1 Then, A1 = {1, 2}, A2 = {3, 4}, A2 = {A1 , A2 } and H ( f Ak , θ ), of (8), is the Shannon entropy of the bivariate normal distribution with mean vectors (μ1 , μ2 )T and (μ3 , μ4 )T , respectively, and common variance-covariance matrix Σ0 . H ( f Ak , θ ) is given (cf. Pardo [13], p. 32) by,

116

K. Zografos

H ( f Ak , θ ) =

  1 1 log (2π e)2 |Σ0 | = log(2π e) + log |Σ0 | , k = 1, 2. 2 2

On the other hand, the Shannon entropy (6) of the four dimensional normal distribution with density f , mean vector μ = (μ1 , μ2 , μ3 , μ4 )T and variance-covariance matrix (10) is given by H ( f, θ ) =

 4  1 1 log (2π e)4 |Σ| = log(2π e) + log |Σ| . 2 2 2

Taking into account that |Σ| = −15ρ 4 + 32ρ 3 − 18ρ 2 + 1 and |Σ0 | = 1 − ρ 2 , it is immediate to see, in view (9), that for equal wk ’s, d K L ( f (., θ ) , CLA2 (θ, .)) = −H ( f, θ ) +

K 

wk H ( f Ak , θ )

(12)

k=1

1 log |Σ| + 2 log(2π e) + log |Σ0 | 2     1 = − log −15ρ 4 + 32ρ 3 − 18ρ 2 + 1 + log 1 − ρ 2 , 2 = −2 log(2π e) −

where A2 = {A1 = {1, 2}, A2 = {3, 4}}. Equations (11) and (12) lead to the conclusion that d K L ( f, CLA1 ) − d K L ( f, CLA2 ) = − log(1 − ρ 2 ) > 0, which is in full agreement with the joint plot of d K L ( f, CLA1 ) and d K L ( f, CLA2 ), in the next figure. The red solid line corresponds to the values of d K L ( f, CLA1 ), that is the degree of dependence between the components of Y = (Y1 , Y2 , Y3 , Y4 )T while the blue dash line corresponds to the case d K L ( f, CLA2 ), where CLA2 (θ, y) = f A1 ( y; θ ) f A2 ( y; θ ) = f 12 (y1 , y2 ; μ1 , μ2 , ρ) f 34 (y3 , y4 ; μ3 , μ4 , ρ) and d K L ( f, CLA2 ) formulates the degree of dependence between (Y1 , Y2 ) and (Y3 , Y4 ). In Fig. 1, d K L ( f, CLA1 ) appears to be always greater than or equal to d K L ( f, CLA2 ) something which means that the composite density CLA2 (θ , y) = f A1 ( y; θ ) f A2 ( y; θ ) is closer (in Kullback-Leibler sense) to the true model than the composite density CLA1 . It was expected, in view of Proposition 3, because d K L ( f, CLA1 ) is the mutual information I (Y ; θ ), defined in (3). It is also intuitively expected as CLA2 (θ , y) includes less dependence between Yi , i = 1, 2, 3, 4, than that which is included in

A Note on the Notion of Informative Composite Density

117

Fig. 1 Plots of d K L ( f, CLA1 ) (solid) and d K L ( f, CLA2 ) (dash)

CLA1 (θ , y), with f ( y; θ ) including all the dependence structure of the components of Y = (Y1 ,Y2 ,Y3 ,Y4 )T .

4 Conclusions This note aims to start the investigation on how the family of random variables A = K , associated with marginal distributions involving some y j , j ∈ {1, . . . , m}, {Ak }k=1 θ K L ). It was established that the degree of informativeness of affects the CMLE  θ c ( the composite density CLA , that is the degree of closeness of CLA to the family f (·; θ ) which includes the true model g, coincides with the degree of dependence (in the mutual information sense) of the subvectors of Y = (Y1 , . . . , Ym )T , which are defined by the family A. Less dependent subvectors of Y lead to composite densities CLA which are closer to the true model. Hence, it is conjectured that the CMLE is similarly affected by A and this is maybe signaling further investigations on the θ K L ) and the family A, which is the basis of the composite relationship between  θ c ( density CLA (θ, y). Moreover, when the dimension of the random m-vector Y = (Y1 , . . . , Ym )T is large, that is in the high dimensional case, the application of the composite likelihood method is maybe cumbersome. In such a case, thinking in parallel to the principal components methodology, it is maybe a way to ignore some of the component variables of the composite density CLA , so as to reduce the computational burden. The key step, at this direction, should be concentrated to the component

118

K. Zografos

variables which would be ignored. This problem has been discussed in the existing literature and the recent work by Mazo et al. [11] focuses on some computational challenges of the pairwise likelihood methods. Based on the investigations in this note, it has been introduced the concept of the informative composite density CLA , that is the composite density which stands close to the true model and causes, in this sense, a small amount of loss of information by the substitution of the true model by the composite density CLA . The concept of the informative composite density CLA can maybe help in the direction of a reduction of the computational burden. More precisely, more dependent subvectors of Y can be, maybe, ignored from the composite density as their existence move away the composite density CLA from the true model, as it has been discussed above. However, it would be a subject of further investigations.

References 1. Besag, J.: Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Ser. B 36, 192–236 (1974) 2. Blumentritt, T., Schmid, F.: Mutual information as a measure of multivariate association: analytical properties and statistical estimation. J. Stat. Comput. Simul. 82, 1257–1274 (2012) 3. Castilla, E., Martín, N., Pardo, L., Zografos, K.: Composite likelihood methods based on minimum density power divergence estimator. Entropy 20, e20010018 (2018) 4. Castilla, E., Martín, N., Pardo, L., Zografos, K.: Model Selection in a composite likelihood framework based on density power divergence. Entropy 22(3), e22030270 (2020) 5. Castilla, E., Martín, N., Pardo, L., Zografos, K.: Composite likelihood methods: Rao-type tests based on composite minimum density power divergence estimator. Statist. Pap. 62, 1003–1041 (2021) 6. Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley, New York (2006) 7. Cox, D.R.: Partial likelihood. Biometrika 62, 269–276 (1975) 8. Joe, H., Reid, N., Somg, P.-X., Firth, D., Varin, C.: Composite Likelihood Methods. Report on the Workshop on Composite Likelihood (2012). http://www.birs.ca/events/2012/5-dayworkshops/12w5046 9. Lindsay, G.: Composite likelihood methods. Contemp. Math. 80, 221–239 (1988) 10. Martín, N., Pardo, L., Zografos, K.: On divergence tests for composite hypotheses under composite likelihood. Statist. Papers 60, 1883–1919 (2019) 11. Mazo, G., Karlis, D., Rau, A.: A randomized pairwise likelihood method for complex statistical inferences (2021). ffhal-03126621f. https://hal.archives-ouvertes.fr/hal-03126621/document 12. Micheas, A., Zografos, K.: Measuring stochastic dependence using φ-divergence. J. Multivar. Anal. 97, 765–784 (2006) 13. Pardo, L.: Statistical Inference Based on Divergence Measures. Chapman & Hall/CRC, Boca Raton (2006) 14. Reid, N.: Aspects of likelihood inference. Bernoulli 19, 1404–1418 (2013) 15. Reid, N., Lindsay, B., Liang, K.-Y.: Introduction to special issue. Statist. Sinica 21, 1–3 (2011) 16. Varin, C.: On composite marginal likelihoods. Adv. Stat. Anal. 92, 1–28 (2008) 17. Varin, C., Reid, N., Firth, D.: An overview of composite likelihood methods. Statist. Sinica 21, 5–42 (2011) 18. Wang, X., Wu, Y.: Theoretical properties of composite likelihoods. Open J. Statist. 4, 188–197 (2014) 19. Xu, X., Reid, N.: On the robustness of maximum composite estimate. J. Statist. Plann. Infer. 141, 3047–3054 (2011)

Trends in Information Sciences

Equivalence Tests for Multinomial Data Based on φ-Divergences María Virtudes Alba-Fernández and María Dolores Jiménez-Gamero

Abstract Equivalence tests have received increasing attention in the last years, especially in experimental applied fields such as Biology, Medicine or Pharmacology. In the statistical applications in these fields, the multinomial distribution is perhaps one of the discrete distributions most widely used. The family of φ-divergence measures have supported a great variety of inference problems involving multinomial data. Because of this reason, an equivalence test based on those measures is proposed for this kind of data. The spirit is to incorporate little or irrelevant deviations between the observed data and the target population in the definition of the hypothesis to be tested. Such deviations can be measured by means of a φ-divergence measure, which in turn can be consistently estimated. An equivalence test based on that estimator is presented. The asymptotic behavior of the test statistic is studied and a critical region based on the asymptotic null distribution is considered. To study the finite sample performance of the proposal and to compare it with existing tests, several simulation experiments were carried out. In simulations, the new tests compete very satisfactorily in terms of power.

1 Introduction When dealing with multinomial data, a common practice is to base inferences on the well-known chi-square Pearson statistic, which compares observed and expected frequencies under the assumed model. The comparison can be done by using other functions of the observed and expected frequencies. Specifically, the use of φ-divergence measures, which contain the chi-square statistic as a special case, has been widely studied by some authors in many inferential problems such as point estimation (see M. V. Alba-Fernández (B) Department of Statistics and O.R. University of Jaén, Jaén, Spain e-mail: [email protected] M. D. Jiménez-Gamero Department of Statistics and O.R. University of Sevilla, Sevilla, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_12

121

122

M. V. Alba-Fernández and M. D. Jiménez-Gamero

e.g. Morales et al. [22], Mandal et al. [20]), goodness-of-fit testing (see e.g. Basu and Sarkar [8], Pardo [28], Mandal and Basu [19], Alba-Fernández et al. [3]), testing the equality of populations (see e.g. Zografos [31], Pardo et al. [27], García-Pérez and Núñez-Antón [13], Alba-Fernández and Jiménez-Gamero [1]), the model selection testing problem (see e.g. Jiménez-Gamero et al. [16, 17], Alba-Fernández et al. [4]), association in contingency tables (see e.g. Martín and Pardo [21], Alba-Fernández and Jiménez-Gamero [2], Alba-Fernández et al. [5], Jiménez-Gamero et al. [14, 15], Kateri [18]), among many others. The above list of references and inferential problems is far from being exhaustive. An excellent book dealing with the use of φ-divergence measures in statistical inference is that of Pardo [26]. Here we will address the following testing problem. Let X = (X 1 , . . . , X k ) be a k-cell multinomial distribution with parameters n and P = ( p1 , . . . , pk ) ∈ Δk = k  {( p1 , . . . , pk ) : pi > 0, 1 ≤ i ≤ k, i=1 pi = 1}, X ∼ Mk (n; P) in short, where the superscript  denotes transpose. Let P0 = ( p01 , . . . , p0k ) be a fixed point of Δk . A classical problem consists in testing H0c : P = P0 , H1c : P = P0 . A test of H0c versus H1c is typically applied when it is hoped to reject H0c , since otherwise it cannot be concluded that H0c is true, that is, it is not a proof of its validity. By contrast, a model equivalence test is applied when it is hoped to accept H0c , in the following sense. Let d : Δk × Δk → [0, ∞) be a function measuring the dissimilarity between any two points in Δk so that d(P, P) = 0, ∀P ∈ Δk . Let P, Q ∈ Δk , then P and Q are considered equivalent with respect to d if d(P, Q) < ε, for some sufficiently small positive ε. Dissimilarities smaller than ε are of little practical importance. For fixed d and ε, a model equivalence test is a test of the hypotheses H0e : d(P, P0 ) ≥ ε, H1e : d(P, P0 ) < ε. If a model equivalence test is rejected, then it can be interpreted that P is very close to P0 . Equivalence testing problems differ from conventional testing problems in the formulation of the null and alternative hypotheses. In the equivalence testing setting, the alternative hypothesis specifies an “indifference zone” around certain point in the parametric space, which indicates sufficient coincidence of the distributions to be compared, that is, “equivalence” should be understood as “equality except for practically irrelevant deviations”. Such a coincidence is determined by the measure of dissimilarity used for each application. This idea is demanding in many applied field as Medicine, Psychology or Bio-statistics where it is also known with the acronym “bioequivalence testing” (see e.g. [6, 7, 11, 23, 29, 30]) Equivalence tests of H0e versus H1e have been proposed by using the Euclidean distance in Wellek [30] and Frey [12], and the smooth total variation distance in

Equivalence Tests for Multinomial Data Based on φ-Divergences

123

Ostrovski [24] (see also [25]). Our objective is to propose and study a test based on φ-divergences. The rest of the paper is organized as follows. Section 2 introduces equivalence tests for multinomial data based on φ-divergences. Because the critical region is built by using asymptotic results, in order to evaluate their finite sample performance and to compare them with existing tests, several simulation experiments were carried out. Section 3 summarizes the main findings of those experiments. Finally, Sect. 4 concludes.

2 Equivalence Tests Let P = ( p1 , . . . , pk ) ∈ Δk and X = (X 1 , . . . , X k ) ∼ Mk (n; P). Let φ : [0, ∞) → R ∪ {∞} be a continuously differentiable strictly convex function satisfying φ(1) = 0, and let Pˆ = ( pˆ 1 , pˆ 2 , . . . , pˆ k ) be the vector of relative frequencies, pˆ i =

Xi , 1 ≤ i ≤ k. n

For arbitrary Q = (q1 , . . . , qk ) ∈ Δk , the φ-divergence between P and Q is defined by Csiszár [10] as Dφ (P, Q) =

k 

qi φ( pi /qi ).

(1)

i=1

The strict convexity of φ and φ(1) = 0 imply that Dφ (P, Q)≥0, with Dφ (P, Q) = 0 if and only if P = Q. For fixed small ε > 0, P and Q will be considered equivalent according to (1) if Dφ (P, Q) < ε, implying that both distributions are equal except for irrelevant deviations. Let P0 = ( p01 , . . . , p0k ) be a fixed point of Δk and take Q = P0 . Since ˆ P0 ) is a consistent estimator of Dφ (P, P0 ), for testing H0c versus H1c , Dφ ( P, ˆ P0 ). Because Zografos et al. [32] proposed to reject H0c for “large values” of Dφ ( P, ˆ of the same reason, Dφ ( P, P0 ) can be also used to test H0e versus H1e , with d = Dφ . Zografos et al. [32] (see also the proof of Theorem 3.2 in Pardo [26]) showed that for each P ∈ H0e  L √  ˆ P0 ) − Dφ (P, P0 ) −→ n Dφ ( P, N (0, σ 2 (P)), as n → ∞, where

124

M. V. Alba-Fernández and M. D. Jiménez-Gamero

σ (P) = 2

k 

pi φ

i=1





pi pi0

2 −

 k 

pi φ





i=1

pi pi0

 2 ,

L

and → denotes convergence in distribution. Since ˆ σˆ 2 = σ 2 ( P), consistently estimates σ 2 (P), it follows that ˆ P0 ) − Dφ (P, P0 ) L √ Dφ ( P, n −→ N (0, 1), σˆ as n → ∞. Let α ∈ (0, 1). Therefore, the test that rejects H0e if Tn =

ˆ P0 ) − ε √ Dφ ( P, ≤ Zα, n σˆ

(2)

where Z α stands for the α-quantile of the standard normal distribution, has asymptotic level α (which is asymptotically achieved for those P in the boundary of H0e , i.e., satisfying Dφ (P, P0 ) = ε), and is consistent against any fixed alternative, that is, it rejects with probability tending to 1 (as n → ∞) for any P in the alternative hypothesis H1e .

3 Simulation Results The equivalence tests in the previous section are valid asymptotically. To study their performance for small or moderate samples and to compare them with existing tests, several simulation experiments were carried out. The setting is the following: (a) We have considered the testing problem H0e versus H1e with P0 = (1/k, . . . , 1/k), k = 4, 6. (b) We took d = Dφ two members φ of the Power-divergence family studied by Cressie and Read [9], defined as follows φλ (x) =

(λ+1) 1 x − x − λ(x − 1) , λ = 0, −1, λ(λ + 1)

φ0 (x) = x log(x) − x + 1, for λ = 0, and φ−1 (x) = − log(x) + x − 1, for λ = −1. Specifically, we have taken the Cressie and Read (CR) test and the Kullback-Leibler (KL) test, which correspond to λ = 2/3 and λ = 0, respectively, because of their recognized good behavior in classical goodness-of-fit testing. (c) We have included in the simulation two tests previously proposed: the test in Wellek [30], that will be denoted as Tw and measures discrepancies by means of the

Equivalence Tests for Multinomial Data Based on φ-Divergences

125

Table 1 Probability distributions included in the simulation study for k = 4 Selected cases p1 p2 p3 Case 1 Case 2 Case 3 Case 4 Case 5 Case 6

0.1687 0.1444 0.1750 0.2538 0.2823 0.2832

0.1687 0.1444 0.3250 0.2200 0.1505 0.1785

0.1687 0.1444 0.1750 0.1600 0.2500 0.2214

0.4937 0.5667 0.3250 0.2661 0.3171 0.3167

Table 2 Probability distributions included in the simulation study for k = 6 Selected p1 p2 p3 p4 p5 cases Case 1 Case 2 Case 3 Case 4 Case 5 Case 6

0.1240 0.1112 0.1021 0.1729 0.1333 0.1175

0.1240 0.1112 0.1500 0.1000 0.1999 0.1539

0.1240 0.1112 0.1500 0.1000 0.1333 0.1539

0.1240 0.1112 0.1500 0.1500 0.1999 0.1539

p4

0.1240 0.1112 0.1500 0.2000 0.1333 0.1539

p6 0.3800 0.4442 0.2978 0.2770 0.1999 0.2665

Euclidean distance, and the test in Ostrovski [24], that will be denoted as To and measures discrepancies by means of the smooth total variation distance, defined as P − Qb = 0.5

k  ( pi − qi )2 + b2 , i=1

for some fixed b > 0. In our simulations we took b = 0.001 and 0.0006 for k = 4, 6, respectively. The test in Frey [12] has not been included because it crucially depends on the labeling of the categories. In all cases, the critical region is based on asymptotic results. (d) Simulations for the level: we generated data from multinomial data with n = 50, 100, 250, 500 and P as shown in Table 1, for k = 4, and Table 2 for k = 6. We took as ε = d(P, P0 ), which depends on d. The value of ε is inlude, here in Tables 3 and 4, that display the rejection probabilities simulated under the null hypothesis for k = 4 and k = 6, respectively, based on 100,000 random samples generated for each scenario for the nominal level α = 0.05. Looking at these tables we observe that as the value of ε becomes smaller, the asymptotic approximation exhibits a poorer result for all tests, in the sense that very large sample sizes are required for the simulations results to reach the nominal level. In most cases, the equivalence test based on the CR statistic provides results closer to the nominal value than that based on the KL statistic.

126

M. V. Alba-Fernández and M. D. Jiménez-Gamero

Table 3 Rejection probabilities of the test proposed in (2) at the nominal level α = 0.05 under the selected null configurations for k = 4 Case 1 Case 2 CR KL Tw To CR KL Tw To ε 0.15 0.1370 0.0792 0.0243 0.25 0.2255 0.1333 0.3162 50 100 250 500

ε 50 100 250 500

ε 50 100 250 500

0.0692 0.0645 0.0608 0.0575 Case 3 CR 0.045 0.0352 0.0447 0.0490 0.0504 Case 5 CR 0.0316 0.0181 0.0320 0.0407 0.0442

0.0585 0.0564 0.0527 0.0528

0.0847 0.0695 0.0632 0.0590

0.0601 0.0544 0.0518 0.0509

KL 0.0457 0.0364 0.0449 0.0500 0.0514

Tw 0.15 0.0352 0.0439 0.0487 0.0499

To 0.15 0.0073 0.0341 0.0442 0.0514

KL 0.0334 0.0176 0.0363 0.0441 0.0481

Tw 0.0154 0.0181 0.0303 0.0386 0.0427

To 0.10 0 0.0044 0.0077 0.0114

0.0717 0.0660 0.0589 0.0545 Case 4 CR 0.0445 0.0396 0.0489 0.0546 0.0553 Case 6 CR 0.0377 0.0121 0.0335 0.0470 0.0514

0.0591 0.0541 0.0525 0.0518

0.0778 0.0708 0.0611 0.0545

0.0437 0.0524 0.0494 0.0465

KL 0.0440 0.0395 0.0468 0.0531 0.0527

Tw 0.15 0.0396 0.0513 0.0566 0.0572

To 0.1201 0 0.0084 0.0129 0.0180

KL 0.0353 0.0112 0.0351 0.0471 0.0516

Tw 0.0130 0.0115 0.0341 0.0466 0.0510

To 0.10 0 0.0129 0.0296 0.0428

(e) Simulations for the power: we generated random samples from multinomial data with P0 = (1/k, . . . , 1/k), for k = 4, 6, and n = 50, 100, 250. Table 5 shows the simulated power of the considered tests at the nominal level α = 0.05 for ε = 0.15 based on 100,000 random samples. In the light of results, the CR test and the KL test are more powerful than the ones proposed in [24, 30]. Finally, to illustrate the use of the equivalence testing procedure described previously, it is interesting to revisit the following example taken from Wellek [30] (specifically, Example 9.1 on p. 267 of that book): let us consider the results of a sequence of n = 100 casts of a play dice, which are shown in Table 6. The p-value of the well-known χ 2 -test applied to these data for testing if the dice can be considered fair, is p = 0.16997. This means that the observed frequencies do not differ significantly from those expected for an ideal dice. To find out if the dice is really fair or approximately fair, we applied the so far considered equivalence tests, with nominal significance level α = 0.05 and ε = 0.15. Table 7 displays the observed values of the studentized test statistic Tn for the CR and KL divergence measures and of the test statistics Tw and To , as well as the results of the decision rule in each case applied for the nominal level 5% (Z 0.95 = −1.645). The tests of [24, 30] do not reject H0e , but the equivalence tests based on φ-divergences do. In view of the power

Equivalence Tests for Multinomial Data Based on φ-Divergences

127

Table 4 Rejection probabilities of the test proposed in (2) at the nominal level α = 0.05 under the selected null configurations for k = 6 Case 1 Case 2 CR KL Tw To CR KL Tw To ε 0.15 0.1298 0.2336 0.2133 0.25 0.2102 0.3039 0.2775 50 100 250 500

ε 50 100 250 500

ε 50 100 250 500

0.0518 0.0549 0.0552 0.0537 Case 3 CR 0.0645 0.0173 0.0340 0.0474 0.0505 Case 5 CR 0.0200 0.0005 0.0045 0.0183 0.0285

0.0333 0.0405 0.0451 0.0480

0.0620 0.0615 0.0599 0.0571

0.0051 0.0202 0.0416 0.0512

KL 0.0597 0.0122 0.0248 0.0378 0.0423

Tw 0.15 0.0192 0.0395 0.0527 0.0545

To 0.1312 0 0.0014 0.0069 0.0176

KL 0.0201 0 0.0047 0.0191 0.0279

Tw 0.0812 0.0005 0.0045 0.0182 0.0826

To 0.1000 0.0073 0.0341 0.0442 0.0514

0.0613 0.0581 0.0562 0.0536 Case 4 CR 0.0665 0.0127 0.0260 0.0368 0.0425 Case 6 CR 0.0377 0.0396 0.0489 0.0546 0.0553

0.0371 0.0420 0.0465 0.0468

0.0725 0.0669 0.0613 0.0567

0.0168 0.0352 0.0538 0.0570

KL 0.0656 0.0118 0.0225 0.0338 0.0391

Tw 0.15 0.0130 0.0276 0.0391 0.0449

To 0.15 0.0003 0.0039 0.0100 0.0192

KL 0.0353 0.0048 0.0468 0.0531 0.0527

Tw 0.1063 0.0396 0.0513 0.0566 0.0572

To 0.1000 0 0.0084 0.0129 0.0180

Tw

To

0.1625 0.6693 0.9981

0.0052 0.2823 0.9801

Table 5 Simulated powers at nominal level α = 0.05 for ε = 0.15 k=4 k=6 CR KL Tw To CR KL 50 100 250

0.9232 0.9992 1

0.9125 0.9992 1

0.2351 0.6358 0.9902

0.0609 0.5682 0.9941

0.7713 0.9953 1

0.7504 0.9955 1

Table 6 Absolute frequencies observed in a sequence of n = 100 casts of a given play dice j 1 2 3 4 5 6 xj

17

16

25

9

16

17

results in Table 5, we can conclude that the dice can be considered as ideal, except for irrelevant deviations.

128

M. V. Alba-Fernández and M. D. Jiménez-Gamero

Table 7 Application of proposed equivalence tests CR KL Test statistic Decision

−4.0953 Reject

−4.0489 Reject

Tw

To

−1.0398 Not reject

−1.2257 Not reject

4 Conclusion The simulations results suggest that the proposed equivalence tests based on φdivergences compete very favourably with the few equivalence tests for multinomial data previously published. As a consequence, additional research in statistical hypothesis testing under this perspective should be done. Acknowledgements M.V. Alba-Fernández acknowledges financial support from Grant PID2019106195RB-100 (Spanish Ministerio de Ciencia, Innovación y Universidades). M.D. JiménezGamero acknowledges financial support from Grants MTM2017-89422-P (Spanish Ministerio de Economía, Industria y Competitividad, Agencia Estatal de Investigación and European Regional Development Fund), and P18-FR-2369 (Junta de Andalucía).

References 1. Alba-Fernández, V., Jiménez-Gamero, M.D.: Bootstrapping divergence statistics for testing homogeneity in multinomial populations. Math. Comput. Simulat. 79, 3375–3384 (2009) 2. Alba-Fernández, V., Jiménez-Gamero, M.D.: Estimating Rao’s statistic distribution for testing uniform association in cross-classifications. Math. Comput. Simulat. 81, 1978–1990 (2011) 3. Alba-Fernández, M.V., Jiménez-Gamero, M.D., Ariza-López, F.J.: Minimum penalized φdivergence estimation under model misspecification. Entropy 20(329), 1–15 (2018) 4. Alba-Fernández, V., Jiménez-Gamero, M.D., Jiménez-Jiménez, F.: Model selection based on penalized φ-divergences for multinomial data. J. Comput. Appl. Math. 404, 113181 (2020) 5. Alba-Fernández, V., Jiménez-Gamero, M.D., Lagos Álvarez, B.: Divergence statistics for testing uniform association in cross-classifications. Inf. Sci. 180, 4557–4571 (2010) 6. Baringhaus, L., Ebner, B., Henze, N.: The limit distribution of weighted L 2 -goodness-of-fit statistics under fixed alternatives, with applications. Ann. Inst. Stat. Math. 69, 969–995 (2017) 7. Baringhaus, L., Henze, N.: Cramér-von Mises distance: probabilistic interpretation, confidence intervals, and neighbourhood-of-model validation. J. Nonparam. Statist. 29, 167–188 (2017) 8. Basu, A., Sarkar, S.: On disparity based goodness-of-fit tests for multinomial models. Statist. Probab. Lett. 19, 307–312 (1994) 9. Cressie, N., Read, T.R.C.: Multinomial goodness-of-fit tests. J. Roy. Statist. Soc. Ser. B 46, 440–464 (1984) 10. Csiszár, I.: Information type measures of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar. 2, 299–318 (1967) 11. Freitag, G., Czado, C., Munk, A.: A nonparametric test for similarity of marginals - with applications to the assessment of bioequivalence. J. Statist. Plan. Infer. 137, 697–711 (2007) 12. Frey, J.: An exact multinomial test for equivalence. Canad. J. Statist. 37(1), 47–59 (2009) 13. García-Pérez, M.A., Núñez-Antón, V.: Accuracy of power-divergence statistics for testing independence and homogeneity in two-way contingency tables. Commun. Statist. Simul. Comput. 38(3), 3:503–512 (2009)

Equivalence Tests for Multinomial Data Based on φ-Divergences

129

14. Jiménez-Gamero, M.D., Alba-Fernández, M.V., Barranco-Chamorro, I., Muñoz-García, J.: Two classes of divergence statistics for testing uniform association. Statistics 48, 367–387 (2014) 15. Jiménez-Gamero, M.D., Alba-Fernández, M.V., Estudillo Martínez, M.D.: Burbea-Rao divergence based statistics for testing uniform association. Math. Comput. Simulat. 99, 1–18 (2014) 16. Jiménez-Gamero, M.D., Pino-Mejías, R., Alba-Fernández, M.V., Moreno-Rebollo, J.L.: Minimum φ-divergence estimation in misspecified multinomial models. Comput. Statist. Data Anal. 55, 3365–3378 (2011) 17. Jiménez-Gamero, M.D., Pino-Mejías, R., Rufián-Lizana, A.: Minimum K φ -divergence estimators for multinomial models and applications. Comput. Statist. 29, 363–401 (2014) 18. Kateri, M.: φ-divergence in contingency table analysis. Entropy 20(5), 324, e20050324 (2018) 19. Mandal, A., Basu, A.: Minimum disparity inference and the empty cell penalty: asymptotic results. Electron. J. Statist. 5, 1846–1875 (2011) 20. Mandal, A., Basu, A., Pardo, L.: Minimum disparity inference and the empty cell penalty: asymptotic results. Sankhya Ser. A 72, 376–406 (2010) 21. Martín, N., Pardo, L.: New families of estimators and test statistics in log-linear models. J. Multivar. Anal. 99, 1590–1609 (2008) 22. Morales, D., Pardo, L., Vajda, I.: Asymptotic divergence of estimates of discrete distributions. J. Statist. Plann. Infer. 48, 347–369 (1995) 23. Ocaña, J., Sánchez, M.P., Sánchez, A., Carrasco, J.L.: On equivalence and bioequivalence testing. SORT 32(2), 151–176 (2008) 24. Ostrovski, V.: Testing equivalence of multinomial distribution. Statist. Probab. Lett. 124, 77–82 (2017) 25. Ostrovski, V.: Testing equivalence to families of multinomial distributions with application to the independence model. Statist. Probab. Lett. 139, 61–66 (2018) 26. Pardo, L.: Statistical Inference based on Divergence Measures. Chapman & Hall/CRC, Boca Raton (2006) 27. Pardo, L., Pardo, M.C., Zografos, K.: Homogeneity for multinomial populations based on φ-divergences. J. Japan Statist. Soc. 29(2), 213–228 (1999) 28. Pardo, M.C.: On Burbea-Rao divergence based goodness-of-fit tests for multinomials models. J. Multivar. Anal. 69, 65–87 (1999) 29. Tempelman, R.J.: Experimental design and statistical methods for classical and bioequivalence hypothesis testing with an application to dairy nutrition studies. J. Anim. Sci. 82, 162–172 (2004) 30. Wellek, S.: Testing Statistical Hypotheses of Equivalence and Noninferiority. Chapman & Hall/CRC, Boca Raton (2010) 31. Zografos, K.: f-dissimilarity of several distributions in testing statistical hypotheses. Ann. Inst. Statist. Math. 50(2), 295–310 (1998) 32. Zografos, K., Ferentinos, A., Papaioannou, T.: Divergence statistics: sampling properties and multinomial goodness of fit and divergence tests. Commun. Statist. Theor. Meth. 19, 1785–1802 (1990)

Minimum Rényi Pseudodistance Estimators for Logistic Regression Models Juana M. Alonso, Aida Calviño, and Susana Muñoz

Abstract In this work we propose a new family of estimators, called minimum Rényi pseudodistance estimators (MRPE), as a robust generalization of maximum likelihood estimators (MLE) for the logistic regression model based on the Rényi pseudodistance introduced by Jones et al. [14], along with their corresponding asymptotic distribution. Based on this information, we further develop three types of confidence intervals (approximate and parametric and non-parametric bootstrap ones). Finally, a simulation study is conducted considering different levels of outliers, where a better behavior of the MRPE with respect to the MLE is shown.

1 Introduction Nowadays, large data sets are frequently available, where the presence of outliers is not negligible. In those cases, a common approach consists of detecting and eliminating all outliers. However, as there is not consensus on the numerical definition of an outlier, it is not clear that this is the appropriate option in order to reflect the reality of the population. In this context, it is desirable to make use of estimation methods that are not distorted by the presence of a certain percentage of outliers, which are referred as to as robust methods. The problem of robust estimation in logistic regression is not new in the literature. For example, [10] discussed the breakdown behavior of the maximum likelihood estimator (MLE) in the logistic regression model and showed that it J. M. Alonso · A. Calviño (B) Department of Statistics and Data Science, Complutense University of Madrid, Madrid, Spain e-mail: [email protected] J. M. Alonso e-mail: [email protected] S. Muñoz Department of Statistics and Operations Research, Complutense University of Madrid, Madrid, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_13

131

132

J. M. Alonso et al.

breaks down when several outliers are added to a data set. Recently, several authors have attempted to derive robust estimates of the parameters in the logistic regression model; see [5, 6, 9, 12, 13, 18], among others. Furthermore, the use of estimators based on divergence measures has been widely used as an alternative to the MLE to improve robustness (see [17] for a variety of methods based on divergences). Among the alternatives proposed we can found the minimum density power divergence estimators [3], proposed in [4] and applied to the loglinear model in [8], and the minimum Rényi pseudodistance estimators (MRPE), based on a new family of pseudodistances between probability measures proposed in [7] and applied for the computation of M-estimators in [19]. In this work, taking into account the good properties of the newly proposed MRPE regarding efficiency and robustness, we extend them to the logistic regression model. In particular, we assume that Y1 , Y2 , . . . , Yn are independent variables following a Bernoulli distribution such that: Pr (Yi = 1) = πi

and

Pr (Yi = 0) = 1 − πi , i = 1, . . . , n.

We further assume that there exists a set of explanatory variables xi0 , . . . , xik (with xi0 = 1, xi j ∈ R, i = 1, . . . , n, j = 1, . . . , k, k < n) associated with each Yi through the Bernoulli parameter, πi , in such a way that:   πi = π x iT β =

e xi β T

1 + e xi β T

, i = 1, . . . , n,

(1)

where x iT = (xi0 , . . . , xik ) and β = (β0 , . . . , βk )T is a (k + 1)-dimensional vector of unknown parameters with β ∈ Rk+1 . The work is organized as follows. Section 2 is devoted to the computation of the MRPE for the parameters in the logistic regression model, whereas the asymptotic distribution of those estimators is given in Sect. 3. Moreover, in Sect. 4 we propose approximate and bootstrap confidence intervals for the same parameters. In Sect. 5 we show the results of the simulation study carried out to evaluate the efficiency and robustness of the proposed MRPE. Finally, some conclusions and future work are included in Sect. 6.

2 Minimum Rényi Pseudodistance Estimators If we denote by y1 , . . . , yn the observed values of the random variables Y1 , . . . , Yn , the likelihood function for the logistic regression model is given by L(β) =

n    T  yi   1−yi π xi β 1 − π x iT β . i=1

Minimum Rényi Pseudodistance Estimators for Logistic Regression Models

133

Therefore, the classical MLE of β is defined as βˆ M L E = arg min log (L(β)) .

(2)

β∈Rk+1

However, if we consider the probability vectors  pˆ =

yn 1 − yn y1 1 − y1 y2 1 − y2 , , , ,..., , n n n n n n

T

and       1   1 1   1 T , . . . , π x nT β , 1 − π x nT β p(β) = π x 1T β , 1 − π x 1T β , n n n n the MLE of β in (2) can also be obtained by βˆ M L E = arg min d K L ( pˆ , p(β)),

(3)

β∈Rk+1

where d K L ( pˆ , p(β)) is the Kullback-Leibler divergence measure between the probability vectors pˆ and p(β) given by n 2 yi j yi j d K L ( pˆ , p(β)) = log  T  , n π j xi β i=1 j=1

(4)

        where π1 x iT β = π x iT β , π2 x iT β = 1 − π x iT β , yi1 = yi and yi2 = 1 − yi . For more details see [2]. ˆ p(β)) Based on (3), one can think of using other types of divergence measures d( p, in order to define a minimum divergence estimator for β. In this work we shall use the Rényi pseudodistance (RP) proposed in [7] because of the robustness of the MRPE (see, for instance, [15, 19]). As noted in [15], it is important to highlight that the RP is not a classical distance because symmetry and the triangle inequality do not hold. The use of other types of divergences for logistic regression models is not new in the literature (see, for example, [2], where minimum density power divergence estimators are derived) but, as far as we know, this is the first time that the RP is applied. Let X 1 , . . . , X n be a random sample from a population having true density g which is being modeled by a parametric family of densities f θ . In this case, according to [7], the RP between the densities g and f θ is given by 



  1 1 λ+1 λ+1 log log Rλ ( f θ , g) = f θ (x) d x + g(x) d x λ+1 λ(λ + 1) 

 1 (5) − log f θ (x)λ g(x)d x λ

134

J. M. Alonso et al.

for λ > 0, whereas for λ = 0 it is given by

R0 ( f θ , g) = lim Rλ ( f θ , g) = λ↓0

g(x) log

g(x) , f θ (x)

i.e., the Kullback-Leibler divergence between g and f θ (see [17]). In [7] it was established that Rλ ( f θ , g) ≥ 0, with Rλ ( f θ , g) = 0 if and only if f θ = g. In the case of the logistic regression model, a discrete version of (5) between the probability vectors pˆ and p(β) is required and can be expressed as follows ⎧ ⎫  n 2  ⎨  T  1 λ+1 ⎬ 1 π j xi β log Rλ ( p(β), pˆ ) = ⎩ ⎭ λ+1 n i=1 j=1 ⎧ ⎫ 2  n ⎨ yi j λ+1 ⎬ 1 log + ⎩ ⎭ λ(λ + 1) n i=1 j=1 ⎧ ⎫  2  n ⎨  T  1 λ yi j ⎬ 1 − log π j xi β , ⎩ λ n n ⎭ i=1 j=1

(6)

for λ > 0, whereas for λ = 0 it is given by d K L ( pˆ , p(β)) in (4). Based on (3) and (6), we shall define the MRPE as follows. Definition 2.1 The MRPE for the parameter β, βˆ λ , in the logistic regression model is given by βˆ λ = arg min Rλ ( p(β), pˆ ), β∈Rk+1

where Rλ ( p(β), pˆ ) is as defined in (6). It is not difficult to verify that, for λ > 0, Eq. (6) can be rewritten as: ⎧ n 2   T  1 λ yi j ⎪ 1⎨ i=1 j=1 π j x i β n n Rλ ( p(β), pˆ ) = − log  λ  λ+1       λ⎪ λ+1 n 2 ⎩ 1 T i=1 j=1 π j x i β n ⎡ ⎤⎫ n 2  ⎬  λ+1 yi j 1 ⎦ . log ⎣ − ⎭ (λ + 1) n

(7)

i=1 j=1

Note that the second term in (7) does not depend on β and, thus, can be neglected when minimizing the equation with respect to it. Taking that into account, and after some algebra, we can state the following equivalence:

Minimum Rényi Pseudodistance Estimators for Logistic Regression Models

n min Rλ ( p(β), pˆ ) ⇐⇒ max β

β

where L n,λ (β) =

⎧ n 2 ⎨ ⎩

=1 j=1

i=1

2 j=1

  π λj x iT β

L n,λ (β)

135 yi j n

,

⎫ λ λ+1  T ⎬ λ+1 x β πj . ⎭

Based on the previous results, the MRPE for β, βˆ λ , for λ > 0, can alternatively be obtained as: n ˆβ λ = arg max 1 ϕλ (x i , yi , β), (8) n i=1 β       λ with ϕλ (x i , yi , β) = L n,λ1(β) π λ x iT β yi + 1 − π x iT β (1 − yi ) . We note that Eq. (8) points out that the MRPE is an M-estimator. In order to obtain the estimating equations, we need to get the derivative of ϕλ (x i , yi , β) with respect to β:  !     λ 1 ∂ ∂ϕλ (x i , yi , β) =  π λ x iT β yi + 1−π x iT β (1− yi ) L n,λ (β) 2 ∂β ∂β L n,λ (β) " !     λ ∂ L n,λ (β) , (1 − yi ) − π λ x iT β yi + 1 − π x iT β ∂β

where ∂ L n,λ (β) =λ ∂β

   ∂π ( x T β )   T λ ∂π ( x T β )  T λ π x β − 1 − π x β =1  ∂β ∂β 1    λ+1      n λ+1 x T β + 1 − π x T β λ+1 π =1  

n

  T   T    T λ+1  T  λ+1 π x β 1 − π x β + 1 − π x β π x β =1   x .  T λ+1  n  λ+1  T   x β + 1 − π x β =1 π

n = λL n,λ (β)

The result follows by      T λ  T  ∂π x iT β ∂  λ T  λ−1 π x i β yi + 1 − π x i β (1 − yi ) = λπ x i β yi ∂β ∂β 

−λ 1 − π



x iT β

λ−1

  ∂π x iT β , (1 − yi ) ∂β

136

and

J. M. Alonso et al.

      ∂π x iT β = π x iT β 1 − π x iT β x i . ∂β

(9)

Finally, the estimating equations for λ > 0 are given by n

Ψ λ (x i , yi , β) = 0k+1 ,

(10)

i=1

with   Ψ λ (x i , yi , β) = πiλ (1 − πi )yi − πi (1 − πi )λ (1 − yi ) x i   − πiλ yi + (1 − πi )λ (1 − yi )  n  λ+1 (1 − π ) − π (1 − π )λ+1 x  =1 π  , × n  λ+1 + (1 − π )λ+1 =1 π

(11)

  where, for the sake of simplicity, we have replaced π x iT β by πi . Based on the previous results we have established the following theorem. Theorem 2.1 The MRPE for β, βˆ λ , can be obtained as the solution of the system of equations given in (10). Note that, if we consider λ = 0 in Eqs. (10) and (11), we get the estimating equations for the MLE n   T   π x i β − yi x i = 0. i=1

3 Asymptotic Distribution of the Minimum Rényi Pseudodistance Estimators In order to get the asymptotic distribution of the MRPE of β, βˆ λ , we are going to assume that not only are the explanatory variables random but they are also identically distributed and moreover (X 1 , Y1 ), . . . , (X n , Yn ) are independent and identically distributed. We shall assume that X 1 , . . . , X n is a random sample from a random variable X with marginal distribution function H (x). In order to be able to apply, in a convenient way, the asymptotic M-estimators theory, we are going to consider an approximation, valid for n big enough, of the MRPE given by the estimating equations in (10) and (11)

Minimum Rényi Pseudodistance Estimators for Logistic Regression Models

  Ψ λ (x, Y, β) ≈ π λ (1 − π )Y − π (1 − π )λ (1 − Y ) x   C(β) , − π λ Y + (1 − π )λ (1 − Y ) D(β)

137

(12)

  where we have replaced π x T β by π , and

C (β) = D (β) =

X X

 λ+1  π (1 − π ) − π (1 − π )λ+1 x d H (x), π λ+1 + (1 − π )λ+1 d H (x),

with H (x) the distribution function of X. Notice that an estimator for C (β) and D (β) will be, respectively, n  1  λ+1 # C (β) = πi (1 − πi ) − πi (1 − πi )λ+1 x i , n i=1 n  λ+1  # (β) = 1 πi + (1 − πi )λ+1 . D n i=1

For the sake of simplicity, from now on we shall replace C (β), D (β), # C (β) and # (β) by C, D, # # respectively. D C and D, By following the method given in [16], the asymptotic variance-covariance matrix √ of n βˆ λ is       Jλ−1 β 0 Kλ β 0 Jλ−1 β 0 , where ! ∂Ψ λ (X, Y, β) Jλ (β) = −E ∂β T   Kλ (β) = E Ψ λ (X, Y, β)Ψ λT (X, Y, β) .

(13)

In the relation to the matrix Kλ (β), by (13)

    Kλ (β) = E Ψ λ (X,Y,β)Ψ λT (X,Y,β) = E Ψ λ (x,Y,β)Ψ λT (x,Y,β) d H (x), X

  being X the support of X. We have, knowing that E[Y ] = π x T β = π ,

138

J. M. Alonso et al.

    E Ψ λ (X, Y, β)Ψ λT (X, Y, β) = π 2λ+1 (1 − π )2 + π 2 (1 − π )2λ+1 x x T   − π 2λ+1 (1 − π ) − π (1 − π )2λ+1   ×D −1 xC T + C x T   + π 2λ+1 + (1 − π )2λ+1 D −2 C C T .

Therefore, an estimator of Kλ (β) will be

#λ (β) = K

X

  E Ψ λ (x, Y, β)Ψ λT (x, Y, β) d Hn (x) ,

where Hn (x) is the empirical distribution function associated with the sample x 1 , . . . , x n . Then n  2λ+1  #λ (β) = 1 πi K (1 − πi )2 + πi2 (1 − πi )2λ+1 x i x iT n i=1    T #−1 C +# − πi2λ+1 (1 − πi ) − πi (1 − πi )2λ+1 x i # C x iT D    T −2 # . C# C D + πi2λ+1 + (1 − πi )2λ+1 #

(14)

To compute the matrix Jλ (β), first we need ∂Ψ λ (x, y, β) ∂β T

= L 1 (x, y, β) − L 2 (x, y, β) − L 3 (y, β) + L 4 (y, β),

and considering (9), then   L 1 (x, y, β) = [ λπ λ (1 − π )2 − π λ+1 (1 − π ) y   − π(1 − π )λ+1 − λπ 2 (1 − π )λ (1 − y)]x x T  C  L 2 (x, y, β) = λ π λ (1 − π )y − π(1 − π )λ (1 − y) x T D

π λ y + (1 − π )λ (1 − y) L 3 (y, β) = [(λ + 1)(π λ (1 − π ) + π(1 − π )λ ) D X −π λ+1 − (1 − π )λ+1 ]π(1 − π )x x T d H (x) π λ y + (1 − π )λ (1 − y) L 4 (y, β) = D2

×C

X

(λ + 1)π(1 − π )(π λ − (1 − π )λ )x T d H (x).

Minimum Rényi Pseudodistance Estimators for Logistic Regression Models

139

  So, knowing that E[Y ] = π x T β = π , E

∂Ψ λ (X, Y, β) ∂β T

!

= [λ(π λ (1 − π ) + π(1 − π )λ ) −(π λ+1 + (1 − π )λ+1 )]π(1 − π )x x T λπ(1 − π )[π λ − (1 − π )λ ] − C xT D

π λ+1 + (1 − π )λ+1 − [(λ + 1)(π λ (1 − π ) + π(1 − π )λ ) D X −π λ+1 − (1 − π )λ+1 ]π(1 − π )x x T d H (x) π λ+1 + (1 − π )λ+1 + D2

×C

X

Finally,

(λ + 1)π(1 − π )(π λ − (1 − π )λ )x T d H (x).

Jλ (β) =

E

∂Ψ λ (x, Y, β) ∂β T

X

! d H (x),

and an estimator of Jλ (β) is given by n  1  # λ(πiλ (1 − πi ) + πi (1 − πi )λ ) Jλ (β) = n i=1  −(πiλ+1 + (1 − πi )λ+1 ) πi (1 − πi )x i x iT

λπi (1 − πi )[πiλ − (1 − πi )λ ] # C x iT # D n  π λ+1 + (1−πi )λ+1 1 (λ + 1)(πλ+1 (1 − π )2 + π2 (1−π )λ+1 ) − i # n D =1  −πλ+2 (1 − π ) − π (1 − π )λ+2 x  x T " n π λ+1 + (1 − πi )λ+1 1 λ − (1−π )λ )x T . (15) # C + i (λ + 1)π (1−π )(π      #2 n D −

=1

From the sequence of above results, the next theorem follows. Theorem 3.1 The asymptotic distribution for the MRPE, βˆ λ , for the logistic model given in (1) is given by        √ L n(# β λ − β 0 ) −→ N 0, Jλ−1 β 0 Kλ β 0 Jλ−1 β 0 n→∞

(16)

140

J. M. Alonso et al.

    and estimators of the matrices Jλ β 0 and Kλ β 0 have been found in (15) and (14) respectively. Remark 3.1 It is interesting to observe that for λ = 0 we get n   #0 (β) = 1 K πi (1 − πi )2 + πi2 (1 − πi ) x i x T n i=1

1 T X diag (πi (1 − πi ))i=1,...,n X n = I F (β) ,

=

and # J0 (β) = I F (β), with I F (β) being the Fisher information matrix associated to the logistic regression model. Then, the asymptotic variance-covariance matrix of √ ˆ n β 0 is         J0−1 β 0 K0 β 0 J0−1 β 0 = I −1 F β0 .

4 Confidence Intervals In this section, we propose different methods of constructing confidence intervals (CI) for the parameter vector β. In particular, approximate and bootstrap CI are shown. Definition 4.1 (Approximate confidence intervals) The two-sided 100(1 − α)% approximate CI for the unknown parameter vector β given by the asymptotic distribution of the MRPE in (16) (see Theorem 3.1) are given by $ Ujj , j = 0, . . . , k, βˆλ, j ∓ z α/2 n where U j j are the diagonal elements of the variance-covariance matrix       T  #λ βˆ λ # # Jλ−1 βˆ λ with βˆ λ = βˆλ,0 , . . . , βˆλ,k . Jλ−1 βˆ λ K The bootstrap method, which is one of a broader class of resampling methods, uses Monte Carlo sampling to generate an empirical sampling distribution of the estimate (see [11] for more details on bootstrap methods). As usual, only one sample is available; bootstrap methods require an algorithm ∗ to simulate B bootstrap samples, which lead to B different estimates βˆ of the parameter vector. In this work we evaluate both parametric and non-parametric procedures in [1]. In the parametric scenario, we generate randomly the Y values for the observations in the data set based on the explanatory variables and the probabilities

Minimum Rényi Pseudodistance Estimators for Logistic Regression Models

141

ˆ whereas in the non-parametric in (1) considering the estimated parameter vector β, scenario we take a sample of n observations with replacement from the original data set. In this work, we consider percentile bootstrap CI, which make use of the quantiles of the bootstrap distribution to obtain the limits of the confidence interval. Definition 4.2 (Bootstrap confidence intervals) The two-sided 100(1 − α)% percentile bootstrap CI for the unknown parameter vector β are given by ∗[ B (1− α2 )] ∗[ B α ] ), (βˆλ, j 2 , βˆλ, j

j = 0, . . . , k,

where [·] refers to the floor function and βˆ ∗b j is the b-th element of the sorted sequence of bootstrap estimates for the parameter β j .

5 Simulation Study In this section we empirically demonstrate some of the strong robustness properties of the MRPE and the CI for the logistic regression model. We consider two explanatory variables x1 and x2 in this study, so k = 2. These two variables are distributed according to a standard normal distribution N (0, I 2×2 ). The response variables Yi are generated randomly from a Bernoulli distribution where πi are given by the logistic model in (1). The true value of the parameter vector is taken as β 0 = (0, 1, 1)T . To evaluate the robustness of the proposed point and interval estimates, we add different percentages of outliers to the data (0%, 2%, 5% and 10%). For the outlying observations we first introduce the leverage points where x1 and x2 are generated from N (μc , σc I 2×2 ) with μc = (5, 5)T and σc = 0.01. Then the values of the response variable corresponding to those leverage points were altered to produce vertical outliers (yi = 1 is converted to yi = 0 and vice versa). Furthermore, in order to assess the effect of sample size, we have assumed two sample sizes: n = 100, 500. We also consider λ = 0, 0.1, 0.3, 0.5, 0.7, 0.9. Note that, as it is previously mentioned, λ = 0 corresponds to the classical MLE. For each setting, the root mean square error (RMSE) for the estimator of parameters, based on 500 simulated samples, is computed and results are shown in Fig. 1. As expected, smaller RMSE are obtained when n = 500 and larger ones are obtained when the percentage of outliers increases. As it can be seen, our proposal outperforms the classical MLE (except for the case of n = 100 and λ = 0.9), showing larger differences for medium levels of outliers (2–5%). Moreover, in the absence of outliers, our proposal leads to more precise results than the MLE, indicating that it is a robust and efficient alternative. According to the results shown in the figure, a value of λ close to 0.5 seems a good choice as it leads to small values of the RMSE independently of the data set size.

142

J. M. Alonso et al. 100

500

1.2

λ 0

RMSE

0.1 0.9

0.3 0.5 0.7

0.6

0.3

0.9

0.0

2.5

5.0

7.5

10.0

0.0

2.5

5.0

7.5

10.0

Percentage of outliers

Fig. 1 Average RMSE for the three parameters in the simulation study for different data set sizes, percentage of outliers and values of the λ parameter 100

100

100

Asymp

Param

NoParam

1.0

0.8

0.6

Coverage Probability

0.4

λ 0

0.2

0.1 500

500

500

0.3

Asymp

Param

NoParam

0.5

1.0

0.7 0.9

0.8

0.6

0.4

0.2 0.0

2.5

5.0

7.5

10.0 0.0

2.5

5.0

7.5

10.0 0.0

2.5

5.0

7.5

10.0

Percentage of outliers

Fig. 2 Average coverage probabilities of the different CI proposed for the data in the simulation study for different data set sizes, percentage of outliers and values of the λ parameter

Moreover, for the same settings as before, the corresponding coverage probabilities for the 95% approximate (based on the asymptotic distribution of the estimator) and parametric and non-parametric bootstrap confidence intervals (based on 500 bootstrap replications) and average lengths are obtained and plotted in Figs. 2 and 3, respectively (note that we have added a black horizontal line on Fig. 2 at the nominal confidence level for visual reference). As it can be seen in Fig. 2, worse coverage probabilities are generally obtained for the bootstrap parametric method in comparison with the other two methods, specially for large values of λ. Moreover, for λ ≥ 0.5 and a percentage of outliers smaller than

Minimum Rényi Pseudodistance Estimators for Logistic Regression Models

8

100

100

100

Asymp

Param

NoParam

143

6

4

2

λ CI length

0 0

0.1

8

500

500

500

0.3

Asymp

Param

NoParam

0.5 0.7 0.9

6

4

2

0

0.0

2.5

5.0

7.5

10.0

0.0

2.5

5.0

7.5

10.0

0.0

2.5

5.0

7.5

10.0

Percentage of outliers

Fig. 3 Average length of the different CI proposed obtained in the simulation study for different data set sizes, percentage of outliers and values of the λ parameter

10%, bootstrap non-parametric CI lead to coverage probabilities very close to the nominal level, independently of the sample size. Regarding asymptotic CI, for very large values of λ, coverage probabilities not far from the nominal level are achieved even for a percentage of outliers of 10%. It is interesting to highlight that a value of λ of 0.1 leads to CI behaving similarly to the MLE ones. Furthermore, coverage probabilities for those cases are far from the nominal level even for a very small percentage of outliers, specially when the sample size is large. Regarding CI lengths shown in Fig. 3, it can be seen that bootstrap parametric CI are the shortest ones, which is to be expected as their coverage probabilities are the smallest ones as well. Moreover, with respect to the other two CI types, it can be seen that CI have similar lengths except for λ close to 1. Finally, as it is to be expected, CI are shorter when sample size is bigger. Taking into account the results derived from the simulations, we suggest to use a value of λ of 0.5, as it leads to good coverage probabilities but keeping CI lengths small. Furthermore, except for the case of large percentages of outliers, where the asymptotic CI might seem a good choice, non-parametric bootstrap CI lead to the best results.

144

J. M. Alonso et al.

6 Conclusions and Future Work In this work we have introduced a new family of estimators, the MRPE, for the parameters in the logistic regression model, which can be seen as a generalization of the MLE. Moreover, we have studied the asymptotic distribution of these estimators and obtained approximate and bootstrap confidence intervals for the parameters. In order to evaluate the behavior of our proposal, a simulation study has been conducted showing that, for most cases, the MRPE perform better than the MLE, both in terms of efficiency and robustness. Confidence intervals have been found to maintain a coverage probability very close to the nominal level in presence of medium or high outliers level, keeping its length close to that of the MLE and, thus, retaining efficiency. As part of future work, we plan to build a family of robust Wald type tests for the logistic regression model, where the MRPE is used instead of the MLE. We shall obtain the corresponding asymptotic distribution and study the robustness properties of these Wald type test statistics. Acknowledgements The authors of this chapter belong to different generations but have done research with Leandro, during which he has transmitted his way of working, his clarity of ideas and his constant willingness to solve any doubt or problem. In addition, for some of us he has been our professor of Statistics in the Mathematics degree, awakening interest in this area thanks to his way of teaching and his enthusiasm.

References 1. Adjei, I.A., Karim, R.: An application of bootstrapping in logistic regression model. Open Access Libr. J. 3, e3049 (2016) 2. Basu, A., Ghosh, A., Mandal, A., Martín, N., Pardo, L.: A Wald-type test statistic for testing linear hypothesis in logistic regression models based on minimum density power divergence estimator. Elect. J. Statist. 11(2), 2741–2772 (2017) 3. Basu, A., Harris, I.R., Hjort, N.L., Jones, M.C.: Robust and efficient estimation by minimizing a density power divergence. Biometrika 85(3), 549–559 (1998) 4. Basu, A., Shioya, H., Park, C.: The Minimum Distance Approach. Monographs on Statistics and Applied Probability. CRC Press, Boca Raton (2011) 5. Bianco, A.M., Yohai, V.J.: Robust estimation in the logistic regression model. In: Rieder, H. (ed.) Robust Statistics, Data Analysis, and Computer Intensive Methods. Lecture Notes in Statistics, vol. 109, pp. 17–34. Springer, New York (1996) 6. Bondell, H.D.: Minimum distance estimation for the logistic regression model. Biometrika 92(3), 724–731 (2005) 7. Broniatowski, M., Toma, A., Vajda, I.: Decomposable pseudodistances and applications in statistical estimation. J. Statist. Plan. Infer. 142(9), 2574–2585 (2012) 8. Calviño, A., Martín, N., Pardo, L.: Robustness of minimum density power divergence estimators and Wald-type test statistics in loglinear models with multinomial sampling. J. Comput. Appl. Math. 386, 113214 (2021) 9. Carroll, R.J., Pederson, S.: On robustness in the logistic regression model. J. Royal Statist. Soc. Ser. B 55(3), 693–706 (1993)

Minimum Rényi Pseudodistance Estimators for Logistic Regression Models

145

10. Croux, C., Haesbroeck, G.: Implementing the Bianco and Yohai estimator for logistic regression. Comput. Statist. Data Anal. 44(1–2), 273–295 (2003) 11. Efron, B., Tibshirani, R.: An Introduction to the Bootstrap. Chapman & Hall, New York (1993) 12. Hobza, T., Martin, N., Pardo, L.: A Wald-type test statistic based on robust modified median estimator in logistic regression models. J. Statist. Comput. Simul. 87, 2309–2733 (2017) 13. Hobza, T., Pardo, L., Vajda, I.: Robust median estimator in logistic regression. J. Statist. Plan. Infer. 138(12), 3822–3840 (2008) 14. Jones, M.C., Hjort, N.L., Harris, I.R., Basu, A.: A comparison of related density based minimum divergence estimators. Biometrika 88, 865–873 (2001) 15. Kus, V., Kucera, J., Morales, D.: Numerical enhancements for robust Rényi decomposable minimum distance estimators. J. Phys. Confer. Ser. 1141, 012037 (2018) 16. Maronna, R.A., Martin, R.D., Yohai, V.J.: Robust Statistics. Theory and Methods. Wiley Series in Probability and Statistics. Wiley, Hoboken (2006) 17. Pardo, L.: Statistical Inference Based on Divergence Measures. Chapman & Hall/CRC, Boca Raton (2006) 18. Pregibon, D.: Resistant fits for some commonly used logistic models with medical applications. Biometrics 38(2), 485–498 (1982) 19. Toma, A., Leoni-Aubin, S.: Optimal robust M-estimators using Rényi pseudodistances. J. Multivar. Anal. 115, 359–373 (2013)

Infinite–Dimensional Divergence Information Analysis José Miguel Angulo and María Dolores Ruiz-Medina

Abstract Kullback–Leibler divergence is formulated in an infinite–dimensional random variable framework. Specifically, the abstract notion of a divergence functional D is established for comparing infinite–dimensional probability models generated from a suitable operator family in the space L 1+ (H ) of positive semi–definite trace operators on a separable Hilbert space H. In particular, in a parametric setting, D compares the true probability density model, underlying the curve data, with the parameterized candidates, generated from subsets of L 1+ (H ). The definition of f –divergence functional allows the introduction of Kullback–Leibler divergence functional D K L . Divergence-based parametric infinite–dimensional probability density estimation is then formulated in this framework. The corresponding asymptotic analysis is addressed under an infinite–dimensional Gaussian scenario.

1 Introduction Divergence measures, first introduced in the Statistical Information Theory context as an instrument to quantify the departure of a certain probability distribution from a given reference distribution in a system (with discrete or continuous states), have played an important role in fundamental, methodological and applied research. Since the formulation of Kullback-Leibler divergence [13], diverse generalizations and alternative forms have been proposed and thoroughly investigated in the literature, such as those included in the classes of f –divergences and Bregman divergences [1, 4, 6], etc. Among other areas of knowledge, divergence measures have been adopted as a basis for comparative assessment and related optimality criteria in Statistical Inference (see, for reference, [16]) and Large Deviation Theory J. M. Angulo · M. D. Ruiz-Medina (B) Department of Statistics and Operations Research, Faculty of Sciences, University of Granada, Campus Fuente Nueva s/n, 18071 Granada, Spain e-mail: [email protected] J. M. Angulo e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_14

147

148

J. M. Angulo and M. D. Ruiz-Medina

(see [9, 23], etc.), with an important application to divergence-based risk measures under dual representations in modern Risk Measure Theory (see, e.g., [3, 11, 14]; see also [24] for a discussion on related extensions). Divergence measures constitute the basis for construction of other useful information measures, including divergencebased forms of mutual information (e.g., [19, 22]), and more recently, generalized product-type relative complexity measures, and generalized relative dimensions in a multifractal domain (a connective discussion, particularly focussed on Rényidivergence-based measures, is given in [2]). A growing interest is observed, in the last few decades, on the parametric [5] and nonparametric [10] data analysis in a statistical functional framework. Particularly, the field of Functional Data Analysis (FDA) has been nurtured by various disciplines, including probability in abstract spaces (see, e.g., [15]). Inference from stochastic differential equation modeling has attracted much attention, since it generates interesting applications where Gaussian measures in Hilbert spaces, and associated infinite–dimensional quadratic forms, play a crucial role (see [7]). Several statistical approaches have been recently developed to address inference from infinite–dimensional multivariate data, with a particular focus on functional process modeling (see, e.g., [12, 20, 21]). In [20], under a functional spectral approach, a Kullback-Leibler divergence–like loss operator is formulated for parameter estimation in an infinite–dimensional framework. This approach is useful for the intrinsic case, where point–wise singularities can arise that are compensated by a weight operator. However, a suitable formulation of functionals as divergence measures for infinite–dimensional probability distribution estimation, contemplating FDA techniques, constitutes an open research area. This work is a first attempt to introduce some preliminary formulations and results that can be the basis for a more extensive analysis. In summary, from the notion of a divergence functional acting on a suitable subspace of the unit ball in L 1+ (H ), generating the probability models to be compared, f –divergence functionals are introduced in Sect. 2. Kullback–Leibler divergence is then defined as a special case in an infinite–dimensional framework. Section 3 is focused on the application of these concepts in parametric estimation of the probability distribution of functional random variables, under a spectral approach. This methodology is especially suitable under a Gaussian scenario in separable Hilbert spaces. In Sect. 4, the asymptotic probability distribution of the empirical version K L of Kullback–Leibler divergence functional D K L is identified under such sceD K L also follow from the application nario. The conditions ensuring consistency of D of FDA methodology. Section 5 ends with some final remarks.

2 Preliminary Definitions This section introduces some preliminary concepts and properties on divergence functionals. In particular, f –divergence functionals are introduced in terms of probability measures in a separable Hilbert space H, generated by self–adjoint positive

Infinite–Dimensional Divergence Information Analysis

149

semi–definite trace operators. A spectral approach is adopted in the larger ambient space of bounded linear operators. The following concept of divergence between infinite–dimensional probability distributions is adopted: Definition 2.1 Let B L 1+ (H ) (1) be the unit ball of the space of positive semi–definite trace operators L 1+ (H ) on a separable Hilbert space H. A divergence functional D acting on a subspace S of B L 1+ (H ) (1) is a function D(··) : S × S −→ R satisfying the following properties: (i) D(AB) ≥ 0 for all A, B ∈ S (ii) D(AB) = 0 ⇔ A − B L 1 (H ) = 0. Similarly to the finite–dimensional framework, the dual divergence functional D satisfies the following identity: D (AB) = D(BA), ∀A, B ∈ S. Let us now refer to the so–called f –divergence functional family, denoted as D f , introduced in terms of a continuous convex function f on S. Definition 2.2 Let f be a continuous convex function on S satisfying f (I H ) = 0, where I H denotes the identity functional on H . Then, an f –divergence functional is defined by the following expression:     D f (AB) =  f BA−1 A L 1 (H ) , ∀A, B ∈ S ⊂ B L 1+ (H ) .

(1)

The identity (1) is computed in the trace norm of L 1 (H ), applying spectral theory of self–adjoint operators on a separable Hilbert space. In particular, we restrict our attention to bounded linear operators on H, denoted as the space L(H ). Thus, the following identities, in the weak–sense and in L(H ), are considered :  B(λ) A(λ)d E λ (g), h H , ∀g, h ∈ H, A(λ)       B(λ) A(λ)d E λ , f f BA−1 A = L(H )  A(λ)

  f BA−1 A(g)(h) =





f

(2)

with respect to a suitable common spectral operator measure d E λ (see Sect. 2.1 in [20]). Here the operator integral (2) is understood as an improper operator Stieltjes integral which strongly converges (see, e.g., Sect. 8.2.1 in [18]). In the next section, the special case given by Kullback–Leibler divergence functional, D K L , is considered. Specifically, the following definition of such a functional is adopted: Definition 2.3 For A, B ∈ S ⊂ B L 1+ (H ) ,     D K L (AB) = ln AB −1 A L 1 (H ) .

(3)

150

J. M. Angulo and M. D. Ruiz-Medina

3 Parametric Estimation Based on Kullback–Leibler Divergence Functional This section addresses probability density estimation in an infinitedimensional framework, based on D K L . Specifically, let BΘ = {Bθ , θ ∈ Θ} ⊂ S ⊂ B L 1+ (H ) (1) be a parametric family of strictly positive definite self–adjoint operators in the unit ball of L 1 (H ), generating a parametric family of probability measures in (H, B(H )), denoted as {Bθ (dh), θ ∈ Θ}. Here, B(H ) denotes the σ –algebra in H, and the parameter space Θ is a compact subset of R p , p ≥ 1. Assume that the true parameter value θ0 lies in the interior of Θ. Let us now consider       D K L Aθ0 Bθ = ln Aθ0 Bθ−1 Aθ0  L 1 (H ) , Aθ0 , Bθ ∈ BΘ ,

(4)

generate the probability measures Aθ0 (dh) and Bθ (dh) where Aθ0 and Bθ respectively

in (H, B(H )). Let φk,θ0 , k ≥ 1 be the orthonormal basis in H of eigenvectors of operator Aθ0 , satisfying Aθ0 (φk,θ0 ) = λk (Aθ0 )φk,θ0 , k ≥ 1,

(5)

where the self–adjoint operator Aθ0 has parametric pure point spectrum {λk (Aθ0 ), k ≥ 1}, represented by the symbol A(λ, θ0 ), with respect to a point spectral operator–valued measure d E λ . As before, Aθ0 defines the infinite–dimensional probability measure Aθ0 (dh) from the identity  Aθ0 (dh) =



A(λ, θ0 )d E λ (h), h H = Aθ0 (h)(h), ∀h ∈ H.

(6)

In the formulation of the next assumption, we restrict our attention to the case where φk,θ = φk is known for every k ≥ 1: A1. For every θ ∈ Θ, Bθ (φk ) = λk (Bθ )φk , k ≥ 1, Bθ ∈ BΘ ,

(7)

and Aθ0 Bθ−1 ∈ L(H ), i.e. λk (Aθ0 ) = sup λk (Aθ0 ) < ∞. sup k≥1 λk (Bθ ) k≥1 λk (Bθ )

(8)

Assumption A1 then establishes a common spectral kernel Φ for representing the parametric operator family BΘ . Note that, in the case of unknown parametric eigenvectors, the formulation of the estimation methodology subsequently derived is straightforward.

Infinite–Dimensional Divergence Information Analysis

151

Remark 3.1 Under Assumption A1, the common spectral kernel Φ is constructed from l Φl = φk ⊗ φk , l ≥ 1. k=1

Furthermore, the family of point spectral measures δλk (Bθ ) , k ≥ 1 , θ ∈ Θ, is involved in the spectral diagonalization of the elements of B Θ , in terms of the spectral kernel Φ of Aθ0 (see, e.g., Sect. 8.2.1 in [18]). Under A1, from Eq. (7), Eq. (4) can be rewritten as     ln λk (Aθ0 ) λk (Aθ ) , Bθ ∈ BΘ . D K L Aθ0 Bθ = 0 λ (B ) k

k≥1

(9)

θ

Note that, under A1,  λk (Aθ )     0 ln λk (Aθ0 ) ≤ ln Aθ0 Bθ−1 L(H ) Aθ0  L 1 (H ) < ∞. λk (Bθ ) k≥1   Thus, D K L Aθ0 Bθ is well defined, for every Bθ ∈ BΘ .     K L Aθ0 Bθ In practice, D K L Aθ0 Bθ is replaced by its empirical version D computed from a functional sample {X 1 , . . . , X N } of random elements of H (i.e., P[X i ∈ H ] = 1, i = 1, . . . , N ). In the subsequent development we consider the Gaussian scenario, where the following result holds (see, e.g., Theorem 1.2.1 in [7]): Lemma 3.1 For every Q ∈ L 1+ (H ), there exists a unique probability measure A Q on (H, B(H )) such that 

  1 exp (i h, x H ) A Q (d x) = exp − Q(h), h H , ∀h ∈ H. 2 H

(10)

Moreover, A Q is the restriction to H (identified with 2 ) of the product measure ∞

k=1

μk =



N (0, λk )

k=1

defined on (R∞ , B(R∞ )), with N (0, λk ) denoting the zero–mean Gaussian prob∞ as a ability measure with variance λk , for every k ≥ 1. Here,  R−k is|xkconsidered −yk | 2 , for every metric space, whose distance is given by d(x, y) := ∞ k=1 1+|xk −yk | x, y ∈ R∞ . Also, Q(φk ) = λk φk , for every k ≥ 1. From Lemma 3.1 (see Eq. (10)), the parametric family BΘ of probability density operators can be formally defined as

152

J. M. Angulo and M. D. Ruiz-Medina

 Bθ (x) =

  1 θ (dh), ∀x ∈ H, exp (−i h, x H ) exp − Q θ (h), h H B 2 H

(11)

 where Q θ = k≥1 λk (θ )φk ⊗ φk , for every θ ∈ Θ. Thus, the operator Bθ generating the probability measure Bθ (dh) in (H, B(H )) satisfies ∞  ∞ 1 1 h, φk 2H = φk ⊗ φk (h)(h), − [ln (Bθ )] (h)(h) = 2 k=1 λk (θ ) 2λk (θ ) k=1 for every h ∈ H, and θ ∈ Θ. Applying theorems from Spectral Functional Calculus (see, e.g., [8], pp. 112–140), we obtain − ln (λk (Bθ )) =

1 , k ≥ 1, θ ∈ Θ. 2λk (θ )

(12)

From Eqs. (9) and (12), the empirical Kullback–Leibler divergence functional K L can be computed, from a functional sample X 1 , . . . , X N , as follows: D      λk,N (Aθ0 )   λk,N (Aθ0 ) , D K L Aθ0 Bθ = ln λk (Bθ ) k≥1

(13)

  where, according to Eq. (12),  λk,N (Aθ0 ) = exp − 2λ1 , k = 1, . . . , N , with k,N

N N (φk,N ) = 1 R λk,N φk,N , k = 1, . . . , N . [X i ⊗ X i ] (φk,N ) =  N i=1

(14)

Here, {λk,N , k = 1, . . . , N } and {φk,N , k ≥ 1} respectively denote the systems of empirical eigenvalues and eigenvectors associated with the empirical autocovariN , computed from the functional sample X i , i = 1, . . . , N . Note ance operator R that X i ∼ N (0, Q θ0 ), i = 1, . . . , N , are independent and identically distributed H – valued zero–mean Gaussian random variables with autocovariance operator Q θ0 . The following parametric estimator is then considered:     K L Aθ0 Bθ = arg min D K L Aθ0 Bθ ,  θ0 = arg min D θ∈Θ

Bθ ∈BΘ

(15)

defining the estimator {λk (Aθ0 ) = λk,A ( θ0 ), k ≥ 1} of the pure point spectrum {λk (Aθ0 ) = λk,A (θ0 ), k ≥ 1}. The corresponding plug–in estimator of the probability density operator Aθ0 , generating the underlying probability measure Aθ0 (dh), is then given by θ0 = λk,A ( θ0 )φk,N ⊗ φk,N . (16) A k≥1

Infinite–Dimensional Divergence Information Analysis

153

Note that, from the above identities, N = 1 R N



N

 Xi ⊗ Xi

i=1

=

N

 λk,N φk,N ⊗ φk,N .

k=1

In the next section we study the asymptotic behavior of the functional statistics S

(N )

   N  N   λk,N I1 (ϕk )φk,N ⊗ φk,N λl,N I1 (ϕl )φl,N ⊗ φl,N , =

(17)

k=1 l=1

where I1 denotes the simple Wiener-Itô stochastic integral, with respect to the Wiener measure (see Eq. (9.7.32), p. 169, and Definition 9.2.1 in [17]), and {ϕk , k ≥ 1} denotes an orthonormal basis of H.

4 Asymptotic Analysis θ0 , With regard to the asymptotic probability distribution of the parametric estimator A we establish Theorem 4.1 below, derived in the norm of the space L2S(H ) (Ω, A, P) of zero–mean second–order S(H )–valued random variables, with S(H ) being the space of Hilbert–Schmidt operators on H . The induced metric in L2S(H ) (Ω, A, P) is given by   X − Y L2S(H ) (Ω,A,P) = E X − Y 2S(H ) , ∀X, Y ∈ L2S(H ) (Ω, A, P).

(18)

In the next result, we restrict our attention to H = L 2 ([0, 1], R), obtaining, in particular, the limiting functional random variable S (∞) = X ⊗ X =

∞ ∞ 

λk (θ0 )I1 (ϕk )φk ⊗ φk

 

 λl (θ0 )I1 (ϕl )φl ⊗ φl ,

k=1 l=1

(19) in the space L2S(H ) (Ω, A, P). Theorem 4.1 Let X 1 , . . . , X N be a random sample of independent and identically distributed functional random variables, with X i ∼ N (0, Q θ0 ), i = 1, . . . , N . Then, the following limit holds:  2  E  S (N ) − S (∞) S(H ) → 0, N → ∞,

(20)

154

J. M. Angulo and M. D. Ruiz-Medina

and, in particular,

S (N ) → D S (∞) , N → ∞,

with S (N ) and S (∞) being defined in Eqs. (17) and (19), respectively. Proof Denoting by C  the adjoint operator of C, we have  2  E  S (N ) − S (∞) S(H )        = E  S (N ) − S (∞) S (N ) − S (∞)  1 L (H )   (N ) (N )    (N ) (∞)    = E S (S ) − E S (S )     −E S (∞) (S (N ) ) + E S (∞) (S (∞) )  L 1 (H ) .

(21)

From Eqs. (17) and (19), and applying the isometry property of Wiener–Itô stochastic integral, the following identities hold: ∞ ∞ ∞       λh,N λl,N λ p,N E I1 (ϕh )I1 (ϕ p ) |I1 (ϕl )|2 E S (N ) (S (N ) ) = h=1 l=1 p=1

=

×φh,N ⊗ φ p,N

∞ ∞ ∞ 

   λh,N λ p,N λl,N δh, p + δh,l δ p,l + δh,l δ p,l

h=1 l=1 p=1

=

∞ 



×φh,N ⊗ φh,N

  N  1 + 2  λh,N  R λ2h,N φh,N ⊗ φh,N , L (H )

(22)

h=1 ∞ ∞ ∞      E S (∞) (S (∞) ) = λh (θ0 )λl (θ0 )λ p (θ0 ) E I1 (ϕh )I1 (ϕ p ) |I1 (ϕl )|2 h=1 l=1 p=1

=

×φh ⊗ φ p

∞ 

   λh (θ0 )  Q θ0  L 1 (H ) + 2[λh (θ0 )]2 φh ⊗ φh ,

(23)

h=1 ∞ ∞ ∞ ∞     λh 1 ,N λ p1 ,N λh 2 (θ0 )λ p2 (θ0 ) E S (N ) (S (∞) ) = h 1 =1 p1 =1 h 2 =1 p2 =1

  × δh 1 h 2 δ p1 p2 + δh 1 p1 δ p2 h 2 + δh 1 p2 δh 2 p1     ×φh 1 ,N ⊗ φh 2 φ p1 ,N , φ p2 H = E S (∞) (S (N ) ) (24)

Infinite–Dimensional Divergence Information Analysis

155

(as before, with H = L 2 ([0, 1], C)). From Eqs. (21)–(24), we obtain  2   E S (N ) − S (∞) 



S (H )

 × −2 −2 =

=

∞  ∞  2δk,l |λk (θ0 )|2 + |λk (θ0 ) λl (θ0 )| k=1 l=1

2 2δk,l  λl,N λk,N +  λk,N

2δk,l |λk (θ0 )|2 + |λk (θ0 ) λl (θ0 )| 2  λk,N λl (θ0 ) φk,N (φl )

+1

2δk,l |λk (θ0 )|2 + |λk (θ0 ) λl (θ0 )| 1/2    λl,N λk (θ0 ) λl (θ0 ) φk (φk,N )φl (φl,N ) + φk (φl,N )φl (φk,N ) λk,N 2δk,l |λk (θ0 )|2 + |λk (θ0 ) λl (θ0 )|

∞ ∞   2δk,l |λk (θ0 )|2 + |λk (θ0 ) λl (θ0 )| [S1 (N ) + 1 − 2S2 (N ) − 2S3 (N )] . (25) k=1 l=1

Note that

∞ ∞

2δk,l |λk (θ0 )|2 + |λk (θ0 ) λl (θ0 )| < ∞,

k=1 l=1

and, applying in (25) the strong consistency of empirical eigenvectors and eigenvalues N (see [5]), we obtain, as N → ∞, S1 (N ) → 1, 2S2 (N ) → 2/3, and 2S3 (N ) → of R 4/3, uniformly in k and l. Thus, Dominated Convergence Theorem leads to (20).  From Theorem 4.1, we obtain that S (N ) is asymptotically distributed as an infinite– dimensional random quadratic form, applied to an infinite–dimensional vector of weighted functions, defined from the theoretical eigenvalues and eigenvector of the underlying autocovariance operator. The associated infinite–dimensional matrix has χ 2 –like random entries. As applied in the proof of this theorem, the strong consistency of the empirical weights involved in the definition of S (N ) ensures the strong K L in Eq. (13) consistency of the empirical Kullback–Leibler divergence functional D (see, e.g., [5] on the required conditions).

5 Final Comments This paper introduces divergence functionals in an abstract framework, for statistical distance assessment in an infinite–dimensional probability distribution context. The theoretical and empirical versions of Kullback–Leibler divergence functional are considered. Under a Gaussian scenario, the asymptotic analysis of the empirical K L of D K L can be achieved by applying usual FDA techniques. Particuversion D larly, from Theorem 4.1, applying results in [5] on the normal asymptotic probability distribution of the empirical eigenvalues, the asymptotic probability distribution of K L can be identified with the infinite sum of a sequence of products of log-Gaussian D

156

J. M. Angulo and M. D. Ruiz-Medina

random variables multiplied by scaled, inverted, and shifted Gaussian random variables. Infinite-dimensional formulations of other families of f –divergence measures, including squared Hellinger–distance–based divergence, Jeffrey’s divergence, Chernoff’s α–divergence, exponential divergence, Kagan’s divergence, and (α, β)–product divergence functionals, will be considered in subsequent research. Acknowledgements The authors are grateful to the editors for providing the opportunity to be part of this compendium in honour of Prof. Leandro Pardo, whose extraordinary scientific contributions have been and will continue being a source of inspiration for many of us and coming generations.This work has been supported in part by grants MCIU/AEI/ERDF, UE PGC2018-098860-B-I00 and PGC2018-099549-B-I00, grant A-FQM-345-UGR18 cofinanced by ERDF Operational Programme 2014–2020 and the Economy and Knowledge Council of the Regional Government of Andalusia, Spain, and grant CEX2020-001105-M MCIN/AEI/10.13039/501100011033.

References 1. Ali, S.M., Silvey, S.D.: A general class of coefficients of divergence of one distribution from another. J. Roy. Statist. Soc. Ser. B 28, 131–142 (1966) 2. Angulo, J.M., Esquivel, F.J., Madrid, A.E., Alonso, F.J.: Information and complexity analysis of spatial data. Spat. Statist. 42, 100462 (2021) 3. Ben-Tal, A., Teboulle, M.: Penalty functions and duality in stochastic programming via φdivergence functionals. Math. Oper. Res. 12, 224–240 (1987) 4. Bregman, L.M.: The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7, 200–217 (1967) 5. Bosq, D.: Linear Processes in Function Spaces. Springer, New York (2000) 6. Csiszár, I.: Information-type measures of difference of probability distributions and indirect observation. Studia Scient. Mathemat. Hungar. 2, 229–318 (1967) 7. Da Prato, G., Zabczyk, J.: Second Order Partial Differential Equations in Hilbert Spaces. Cambridge University Press, Cambridge (2002) 8. Dautray, R., Lions, J.L.: Mathematical Analysis and Numerical Methods for Science and Technology, 3: Spectral Theory and Applications. Springer, New York (1985) 9. Ellis, R.S.: Entropy, Large Deviations, and Statistical Mechanics. Springer, New York (1985) 10. Ferraty, F., Vieu, P.: Nonparametric Functional Data Analysis: Theory and Practice. Springer, New York (2006) 11. Föllmer, H., Schied, A.: Stochastic Finance: An Introduction in Discrete Time, 3rd edn. De Gruyter, Berlin (2011) 12. Frías, M.P., Torres-Signes, A., Ruiz–Medina, M.D.: Spatial Cox processes in an infinitedimensional framework. Test 31, 175–203 (2022) 13. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Statist. 22, 79–86 (1951) 14. Laeven, R.J.A., Stadje, M.: Entropy coherent and entropy convex measures of risk. Math. Oper. Res. 38, 265–293 (2013) 15. Ledoux, M., Talagrand, M.: Probability in Banach Spaces. Springer, Heidelberg (1991) 16. Pardo, L.: Statistical Inference Based on Divergence Measures. Chapman & Hall/CRC, Boca Raton (2006) 17. Peccati, G., Taqqu, M.S.: Wiener Chaos: Moments, Cumulants and Diagrams. Springer, Milan (2011) 18. Ramm, A.G.: Random Fields Estimation. Longman Scientific & Technical, London (2005)

Infinite–Dimensional Divergence Information Analysis

157

19. Rényi, A.: On measures of entropy and information. In: Neyman, J. (ed.) Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 547–561. University of California Press, Berkeley (1961) 20. Ruiz–Medina, M.D.: Spectral analysis of long range dependence in functional time series (2021) arXiv:1912.07086 [math.ST] 21. Torres-Signes, A., Frías, M.P., Ruiz–Medina, M.D.: COVID-19 mortality analysis from softdata multivariate curve regression and machine learning. Stoch. Environ. Res. Risk Assess. 35, 2659–2678 (2021) 22. Tsallis, C.: Possible generalization of Boltzmann-Gibbs statistics. J. Statist. Phys. 52, 479–487 (1988) 23. Varadhan, S.R.S.: Asymptotic probability and differential equations. Commun. Pure Appl. Math. 19, 261–286 (1966) 24. Xu, M., Angulo, J.M.: Divergence-based risk measures: a discussion on sensitivities and extensions. Entropy 21, 634 (2019)

A Model Selection Criterion for Count Models Based on a Divergence Between Probability Generating Functions Apostolos Batsidis and Polychronis Economou

Abstract Model selection criteria are often used to choose the best fitting distribution among a set of candidate models for describing the data. Plethora of model selection criteria have been developed over the years by constructing estimators of discrepancy measures that assess the divergence between the true model and a fitted approximating model in terms of their probability mass functions or probability density functions. This contribution focuses on a model selection criterion for count models, which assess the divergence between the true model and a fitted approximating model in terms of probability generating functions. The proposed model selection criterion is appealing in cases where the likelihood of a model is not regular enough. An example is presented where the proposed model selection criterion can be applied, while ordinary ones, such as the Akaike Information Criterion, cannot. Finally, the performance of the proposed model selection criterion is evaluated and compared with the respective of the Akaike Information Criterion based on a Monte Carlo study in a case where both approaches can be applied.

I would like to thank the Editors for their kind invitation to present a contribution in this book in tribute to Professor Leandro Pardo, an exceptional scientist and an outstanding human personality. I met him for first time in Ioannina almost twenty years ago, since he has a long standing friendship and research collaboration with my Ph.D. supervisor Prof. Kostas Zografos. It was a great honor to join their team and collaborate with them. Our scientific collaboration with Leandro seems rather very small in contrast to our personal relationship. A. Batsidis (B) Department of Mathematics, University of Ioannina, Ioannina, Greece e-mail: [email protected] P. Economou Department of Civil Engineering, University of Patras, University Campus, 26504 Rio Achaia, Greece e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_15

159

160

A. Batsidis and P. Economou

1 Introduction Divergence measures are indices of similarity or dissimilarity between populations and are used for the development of statistical methods in order to formulate and solve a great variety of statistical problems (see the monographs by [3, 16]). In applications choosing the best fitting distribution among a list of candidates, for describing a population from a given sample of observations, is an important subject. The construction of model selection criteria based on a measure of similarity or divergence between two models has also received considerable attention. The well-known Kullback-Leibler measure of divergence [11] was used by [1] in order to develop the Akaike’s information criterion (AIC). AIC acts as a penalized log likelihood criterion, since in its general formula consists of two terms; the first term of AIC (the values of the log likelihood function evaluated at the maximum likelihood estimator of the unknown parameters of each candidate model multiplied by minus two) measures the lack of fit of the model to the data, while the second term (two times the number of the estimated parameters) is a penalty for adding additional estimable parameters. Thus AIC acts as a penalized log likelihood criterion. The pioneer work by [1] has generated a vast literature and various model selection criteria have been presented in the literature either by introducing other ways of penalizing (see for instance [4, 8, 9, 18]) or by using different divergence measures (see among others [2, 13, 20, 21]). The previous model selection criteria were developed in the context of full likelihood, while [5] presented recently a model selection criterion in a composite likelihood framework based on density power divergence measures. Although several model selection criteria have been presented in the literature, to the best of our knowledge, they are based on divergence measures in terms of probability density functions (pdf) or probability mass functions (pmf), as well as on maximum likelihood estimation. However the maximum likelihood estimates are rather sensitive to outlying observations or/and computationally difficult whenever the corresponding pdf or pmf is complicated. Moreover, there are also cases in which the likelihood equations do not always have a solution. A typical example is the case of the standard Hermite count distribution (see p. 689 in [17]). Due to the previous reasons, the main goal of this contribution is to propose an alternative model selection criterion, which can be used when dealing with count models and in cases where the traditional ones cannot be applied. As argued by [14], when dealing with count data, it is more convenient to use methods based on the probability generating function (pgf) instead of the corresponding pmf, since pgf is usually much simpler than the corresponding pmf and possesses convenient features. Because of this and motivated by [10, 22] dealt with the hypothesis testing problem that two competing count models are equally close to the true population against the hypothesis that one model is closer than the other by means of a divergence measure based on pgf. In the sequel, motivated by [10], a general model selection criterion between count models which is based on this divergence measure is presented.

A Model Selection Criterion for Count Models Based …

161

The rest of the paper is organized as follows. In Sect. 2, the pgf model selection criterion is presented and some of its properties are given. Section 3 gives one example of choosing between two candidates count models where the traditional model selection criteria based on the maximum likelihood estimates does not always apply. In contrast, the pgf model selection criterion can be applied and its performance is numerically investigated based on a simulation study. Section 3 also presents an example where both approaches, i.e. the proposed and the traditional apply. In this context, the performance of the pgf model selection criterion is evaluated and compared with the respective of the popular Akaike’s information criterion based on a Monte Carlo study. Section 4 provides the conclusions to this study and some open problems for further research.

2 Model Selection Criterion Let X 1 , X 2 , . . . , X n be independent, identically distributed (IID) random vectors from the population X , which takes values on Nd0 , where N0 = N ∪ {0}, with unknown pgf c(t). Let Fl = {c Fl (t; θl ); θl ∈ Θl }, l = 1, . . . , M be M parametric families, which constitute the set of candidate models, so that each member in those families has pgf c Fl (t; θl ), respectively, for some finite dimensional parameters θl ∈ Θl , where Θl ⊆ Rkl , for some kl ∈ N. It is assumed that Fl are separate and that the elements in Fl are identifiable. In the sequel, the argument t will be skipped when unnecessary; in such cases c(t), c Fl (t; θl ) will be simply denoted by c and c Fl (θl ), respectively. Given this sample, the problem of choosing the best fitting model among the list of two or more parametric count models consists of choosing the model which is nearest to the true population. For reasons explained in the introduction when dealing with count models [14] argued in favor of using methods based on the pgf and the empirical pgf (epgf). Therefore, it is more convenient to measure the closeness between each competing model and the true population model by means of a divergence based on the pgf’s. In this context, given two probability generating functions c1 (t) and c2 (t), associated with two d−dimensional random vectors taking values on Nd0 , we consider the following divergence measure (see [10, 15, 19] and references therein for further details)  2 (1) Dα,w (c1 , c2 ) = {c1α (t) − c2α (t)}2 w(t)dt,  where the integral is over [0, 1]d , w(t) > 0 is such that w(t)dt < ∞ and α ∈ R \ {0}. Note that Dα,w (c1 , c2 ) = 0 if and only if c1 (t) = c2 (t), ∀t ∈ [0, 1]d , and therefore, the associated populations coincide. It is obvious that the best model is the one which minimizes the divergence measure defined in (1) between the true population pgf and its pgf. Following [10] the closeness between the unknown c(t) and the parametric family Fl is measured

162

A. Batsidis and P. Economou

as the closeness between the unknown c(t) and the element in Fl closest to c. In this α,w 2 := θl,∗ = arg minθl ∈Θl Dα,w (c, cFl (θl )), which implies that cFl (θl,∗ ) frame, let θl,∗ is the element in Fl closest to c, which is usually called the projection of c on Fl . As [10] pointed out if the true distribution belongs to the j parametric family then θ j,∗ exists and is unique, while when the true distribution does not belong to the j parametric family then θ j,∗ may not exist or if it exists, it may not be unique. It will be assumed in the sequel that Dα2 (c, cθl ) has a unique minimum at θl,∗ ∈ Θl , l = 1, . . . , M. If this assumption holds then   2 2 2 (c, Fl ) = inf Dα,w (c, cFl (θl,∗ )). c, cFl (θl ) = Dα,w Dα,w θl ∈Θl

Remark 2.1 In all examples considered  in the next section the previous assumption 2 holds since the function Dα,w c, cFl (θl ) is strictly convex (see [10]). Based on the previous discussion, Fθ j is the best among the M candidate models 2 2 (c, cF j (θ j,∗ )) is the minimum among the Dα,w (c, cFl (θl,∗ )), l = 1, . . . , M. if Dα,w 2 However, the quantities Dα,w (c, cFl (θl,∗ )) are unknown and should be consistently estimated. For this purpose, the population pgf is estimated (see for example [12]) by the epgf, which is given by the following relation: n 1  Xj t , t ∈ [0, 1]d , cn (t) = n j=1 α,w while to estimate θl,∗ we consider θˆl,n := θˆl,n (X 1 , X 2 , . . . , X n ), where

  2 θˆl,n = arg min Dα,w cn , cFl (θl ) . θl ∈Θl

If θˆl,n exists and is unique, it is called the minimum probability generating function distance (pgfd) estimator of θl . Reference [10] studied the strong consistency and asymptotic normality of θˆl,n , even if the model is misspecified, as a prerequisite to goodness-of-fit and the hypothesis model selection testing problem. Based on 2 (c, Fl ) is consistently Theorem 3 (a) given in [10] we have that the quantity Dα,w 2 ˆ estimated by Dα,w (cn , cFl (θl,n )). 2 Proposition 2.1 (see [10]) Suppose that Dα,w (c, cθl ) has a unique minimum at θl,∗ ∈ int Θl , where int Θl denotes the interior of Θl . Moreover suppose that P(X = 0) > 0 whenever α < 1, where P(X = r ) denote the probability function of X . Finally ) is continuous as a function of θl for all t ∈ [0, 1]d , then assume  that cFl (t;θl a.s. a.s. 2 cn , cFl (θˆl,n ) −→ Dα2 (c, cθl,∗ ) = Dα2 (c, Fl ), where −→ denotes the almost Dα,w sure convergence.

Remark 2.2 Following [10], if instead of θˆl,n we use any other estimator θ˜l,n = a.s. θ˜n (X 1 , . . . , X n ) such as θ˜l,n −→ θl,1 ∈ int Θl , then the assertion of Proposition 2.1 keep on being true with the following minor change

A Model Selection Criterion for Count Models Based …

163

2 2 Dα,w (cn , cFl (θ˜l,n ) −→ Dα,w (c, cθl,1 ). a.s.

2 2 (c, cθl,1 ) = Dα,w (c, Fl ). Since the maximum However in this case in general, Dα,w likelihood estimates satisfy the assumptions described previously, they can be used as an alternative to pgfd estimators in cases where both can be applied.

Therefore based on the previous discussion, the best model between the M candidates is the one which minimizes the divergence measure defined in (1) between the epgf and an estimator of the pgf of the candidate model obtained by replacing the unknown parameters by the minimum pgfd estimators or the maximum likelihood estimator θ˜l,n . Thus the best among the M candidate models is the one which 2 2 (cn , cFl (θˆl,n )) or Dα,w (cn , cFl (θ˜l,n )). minimizes Dα,w

3 Numerical Experiments In this section two examples are presented. The first one consists of a case in which the traditional model selection criteria like AIC and its variant cannot be applied, while in the second the proposed pgf model selection criterion is compared with the traditional ones in a case where all the criteria can be applied. Due to limited space, we restrict our interest to compare the performance of the pgf model selection criterion with the respective of the most popular information criterion, namely the AIC information criterion. Additionally, without loss of generality, it will be assumed throughout the rest of the paper, taking into account that the examples are for univariate data, that the weight function is the probability density function of the uniform distribution on (0, 1). For this reason w will be omitted for the notation. Related to the parameter α the values of α = 0.5 (Hellinger type divergence), α = 1 (L 2 -type divergence) and α = 2 are considered. In each example, we generate a random sample of size n from a count model with pgf cFl (θl ), for fixed θl , l = 1, 2 and sample size n = 20, 40, 60, 100, 150, 200. The values of the model selection criteria for all the candidates models are computed and based on them it is checked whether the model selection criterion attains its minimum value for the model from which the data was generated. We replicate the process 20,000 times and obtain the relative frequencies of choosing the correct model. Before proceeding we introduce some notation: MLE- j and PGFD- j, j = 0.5, 1, 2 is used to denote the pgf criterion with the use of the MLE or PGFD estimators with j the value of the parameter a.

3.1 Standard Hermitte Versus Discrete Lindley In this first example, we consider the problem of choosing between the family of the Discrete Lindley distribution, introduced by [7] with PGF given by

164

A. Batsidis and P. Economou

c F (t; θ1 ) =

(θ1 − 1)(tθ1 − 1) − (1 − 2θ1 + tθ12 ) log(θ1 ) , (1 − tθ1 )2 (1 − log θ1 )

where 0 < θ1 < 1 and the Standard Hermite distribution H (θ2 ), θ2 = (θ2,1 , θ2,2 ) with PGF c F (t; θ2 ) = exp{θ2,1 (t − 1) + θ2,2 (t 2 − 1)}, with θ2,1 ≥ 0 and θ2,2 ≥ 0. Standard Hermite distribution is a typical example of a count model where the pmf is complicated, while the corresponding pgf is much simpler with convenient features. However this is not the only reason that traditional model selection criteria, like the AIC, cannot be applied. As [17] pointed out, the likelihood equations for the standard Hermite distribution do not always have a solution. Taking into account that traditional model selection criteria, like the AIC, requires that the log-likelihood has been maximized, it is obvious that they cannot always been applied whenever the Standard Hermite distribution is a candidate model. In the sequel, the performance of the pgf model selection criterion is numerically investigated via a small simulation study. The simulation results for different sample sizes and for different parameter values θ1 , θ2 = (θ2,1 , θ2,2 ) are given in Table 1. Looking at Table 1 it can be concluded that: (i) when sampling from DL(θ1 ) the performance of the criterion depends on the parameter θ1 . On the other hand when

Table 1 Relative frequencies of selecting the correct model for the model selection problem between standard Hermitte and Discrete Lindley Sampling from DL(θ1 )

Sampling from H(θ2 )

n

θ1

PGFD-0.5

PGFD-1 PGFD-2 θ2

PGFD-0.5

PGFD-1 PGFD-2

20

0.5

0.0875

0.0942

0.9966

0.9966

0.99115

40

0.1272

0.13765 0.15865

0.9867

0.9857

0.98545

60

0.1558

0.16725 0.19355

0.9924

0.99235 0.99185

100

0.2064

0.2215

0.2547

0.9948

0.9946

0.9946

150

0.25955

0.27775 0.3172

0.9972

0.9969

0.9972

0.3015

0.32385 0.37125

0.99805

0.99805 0.9982

0.25175

0.327

0.51175

0.99455

0.99455 0.9947

40

0.3805

0.4774

0.67775

0.99595

0.99595 0.996

60

0.47835

0.5837

0.77065

0.99905

0.99895 0.99885

100

0.6096

0.71705 0.869

0.99965

0.99985 0.99975

150

0.71735

0.8144

0.99955

0.99995 1

0.78935

0.87675 0.9563

0.9995

1

0.3849

0.6871

0.84365

0.9949

0.99485 0.99435

40

0.5939

0.8512

0.9457

0.9985

0.9984

0.99785

60

0.702

0.91545 0.9775

0.9997

0.9996

0.9994

100

0.832

0.97185 0.99655

0.99985

0.9999

0.9997

150

0.90745

0.9904

0.99965

0.99995

1

0.9996

200

0.9454

0.9968

0.9999

0.99985

1

0.99965

200 20

0.75

200 20

0.9

0.1096

(0.25, 0.1)

(0.25, 0.25)

0.9254 (0.25, 0.5)

0.9998

A Model Selection Criterion for Count Models Based …

165

Table 2 Relative frequencies of selecting the correct model for the model selection problem between Poisson and Geometric Sampling from Poisson (θ1 ) n θ1 AIC MLE-0.5 MLE-1 MLE-2 PGFD- PGFD-1 PGFD-2 0.5 20 0.25 0.7851 40 0.77915 60 0.7596 100 0.7838 150 0.8276 200 0.8582 20 0.5 0.8137 40 0.84915 60 0.87385 100 0.9128 150 0.9449 200 0.9676 20 0.75 0.85245 40 0.89585 60 0.93045 100 0.96525 150 0.985 20 1 0.88795 40 0.93615 60 0.96455 100 0.9862 150 0.99655 200 0.99935 Sampling from Geo(θ2 ) n θ2 AIC

0.7691 0.75455 0.76525 0.7962 0.8328 0.8683 0.7716 0.8397 0.8673 0.9139 0.95165 0.9722 0.82435 0.89215 0.93065 0.9681 0.98695 0.8753 0.93185 0.96265 0.98605 0.99645 0.99905

20 40 60 100

1

0.73655 0.871 0.9293 0.9781

0.7691 0.75455 0.76525 0.8022 0.8368 0.86945 0.77185 0.8398 0.8683 0.9181 0.95365 0.9735 0.8284 0.89575 0.9332 0.9708 0.9882 0.8822 0.9381 0.9669 0.98905 0.99755 0.9997

0.7694 0.75455 0.76525 0.803 0.84145 0.86995 0.7826 0.844 0.87645 0.92155 0.95575 0.97545 0.8457 0.9061 0.9408 0.9749 0.9901 0.894 0.94835 0.974 0.992 0.9986 0.99985

0.7693 0.7522 0.75995 0.7951 0.82945 0.8596 0.77135 0.81845 0.8603 0.90835 0.94665 0.96755 0.8139 0.87835 0.91985 0.96155 0.98425 0.85125 0.9171 0.9532 0.98155 0.99505 0.9985

0.7693 0.7538 0.75755 0.7951 0.82955 0.8637 0.77135 0.82925 0.86055 0.9089 0.9476 0.9686 0.81545 0.8839 0.9221 0.96325 0.985 0.85905 0.9207 0.95495 0.98335 0.9956 0.99865

MLE-0.5 MLE-1

MLE-2

PGFD-1 PGFD-2

0.73805 0.85675 0.91535 0.96525

0.71325 0.836 0.89665 0.95415

PGFD0.5 0.70235 0.81005 0.8642 0.92665

0.731 0.85015 0.90915 0.96205

0.7033 0.81285 0.8672 0.9297

0.7693 0.7536 0.75755 0.79475 0.8293 0.8636 0.7723 0.82925 0.86265 0.9126 0.9492 0.96985 0.82615 0.889 0.92535 0.96685 0.986 0.8747 0.9305 0.96135 0.98625 0.9967 0.99915

0.7037 0.81555 0.873 0.93465 (continued)

166

A. Batsidis and P. Economou

Table 2 (continued) Sampling from Geo(θ2 ) n θ1 AIC 150 200 20 40 60 100 150 200 20 40 60 100 150 200 20 40 60 100 150 200

3/7

0.25

1/9

0.99435 0.9984 0.52105 0.644 0.72675 0.81805 0.879 0.92035 0.39485 0.51615 0.6152 0.6972 0.74225 0.7858 0.16825 0.29415 0.4031 0.5321 0.55475 0.6036

MLE-0.5 MLE-1

MLE-2

PGFD0.5

PGFD-1 PGFD-2

0.9888 0.9966 0.5636 0.65965 0.7248 0.80635 0.86205 0.9015 0.41115 0.5404 0.60745 0.67715 0.73065 0.76945 0.1689 0.29535 0.40295 0.4896 0.55225 0.5803

0.98245 0.99425 0.55635 0.6583 0.71965 0.7964 0.8551 0.89625 0.41065 0.5407 0.60745 0.67185 0.7218 0.7674 0.1689 0.29535 0.40295 0.4896 0.55185 0.5817

0.9657 0.9838 0.55555 0.65435 0.7112 0.77895 0.83565 0.8732 0.4101 0.53525 0.59625 0.6644 0.71625 0.76045 0.1689 0.29465 0.40165 0.52965 0.563 0.5938

0.96725 0.98455 0.55555 0.64665 0.71175 0.7792 0.83585 0.8734 0.4101 0.5338 0.60155 0.6643 0.71645 0.7576 0.1689 0.29465 0.40255 0.52965 0.563 0.5941

0.987 0.99575 0.56335 0.6596 0.72415 0.8002 0.85975 0.8998 0.41115 0.5404 0.60745 0.6726 0.7266 0.76865 0.1689 0.29535 0.40295 0.4896 0.55225 0.5803

0.97005 0.9865 0.5564 0.6491 0.71365 0.78195 0.8383 0.875 0.4102 0.5348 0.60175 0.6666 0.71795 0.75785 0.1689 0.29465 0.40255 0.52965 0.56315 0.5941

sampling from H(θ2 ), the performance of the criterion is excellent in all cases presented here. For instance when θ1 = 0.5 the discriminating ability of the criterion is poor. This can be explained since the DL belongs to the one-parameter family of distributions, while Hermite to the two-parameter family of distributions. Thus Hermite is more flexible distribution and taking into account that the introduced criterion does not penalized the additional parameters may explains this poor behavior in some cases; (ii) the value of a = 2 gives the best results, while as the sample size gets large the relative frequencies of selecting the correct model increases too.

3.2 Poisson Versus Geometric In this second example, mainly motivated by [6], we consider the problem of choosing between the family of Poisson distributions with pgf c Fθ1 = exp(θ1 (t − 1)) and the family of Geometric distributions with pgf c Fθ2 = (1 + θ2 − θ2 t)−1 . Note that

A Model Selection Criterion for Count Models Based …

167

Poisson and Geometric distribution are of roughly similar shape if the mean is less than one (see [6]). The simulation results for different sample sizes and for different parameter values θ1 , θ2 are given in Table 2. Looking at Table 2 it can be concluded that: (i) in most cases the differences in relative frequencies for the considered values of a are rather small; (ii) in most of the cases the method of estimation does not affect the behavior of the pgf model selection criterion. However in some cases the higher relative frequency is obtained for the MLE (see for example (n, θ2 ) = (20, 0.1), (100, 3/7)), while in others it is obtained for the PGFD (see for example (n, θ2 ) = (100, 1/9)); (iii) the pgf criterion behaves comparative to AIC. In the majority of the cases the differences in relative frequencies are rather small. However in some cases AIC outperforms (see for instance (n, θ1 ) = (20, 0.5), (20, 0.75) and (n, θ2 ) = (100, 1/9)). On the other hand pgf outperforms in some other situations (see (n, θ1 ) = (100, 0.25) and (n, θ2 ) = (20, 3/7)).

4 Conclusions and Topics for Further Research The problem of choosing the best fitting distribution among a list of candidates count models for a given set of data was considered based on a selection criterion which assesses the divergence between the true model and a fitted approximating model in terms of their probability generating functions. It is observed based on a simulation study that the new criterion works comparative to the well known AIC criterion, when both approaches can be applied. In cases where the likelihood equations do not always have a solution, for instance when one of the candidate model is the standard Hermite distribution, we recommend the use of the pgf model selection criterion. In any other case, it is no clear-cut answer to the question which criterion should be used.   2 2 cn , cFl (θˆl,n ) or Dα,w (cn , cFl (θ˜l,n )) assesses the goodness In our derivation Dα,w of the fit to the data. However, a model selection criterion needs to take both goodness of the fit and the complexity of the competing models into account. For this reason a second term which penalizes the complexity of the model could be added in the aforementioned criterion. It is an open problem the choice of a penalty term which will imply the better performance of the criterion and the comparison of its performance with the respective of others model selection criteria mentioned in the introduction. Finally, it is an open problem to perform a comprehensive simulation study in order to examine the role of a, w and the proposed estimator used in the pgf model selection criterion.

168

A. Batsidis and P. Economou

References 1. Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Petrov, B.N., Csaki, F. (eds.) Proceeding of the Second International Symposium on Information Theory. Akadémiai Kaidó, Budapest (1973) 2. Avlogiaris, G., Micheas, A., Zografos, K.: A criterion for local model selection. Shankhya 81, 406–444 (2019) 3. Basu, A., Shioya, H., Park, C.: Statistical Inference: the Minimum Distance Approach. Chapman & Hall/CRC, London (2011) 4. Bozdogan, H.: Model selection and Akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika 52, 345–370 (1987) 5. Castilla, E., Martín, N., Pardo, L., Zografos, K.: Model selection in a composite likelihood framework based on density power divergence. Entropy 22(270), e22030270 (2020) 6. Cox, D.R.: Further results on tests of separate families of hypothesis. J. Roy. Statist. Soc. Ser. B 24, 406–424 (1962) 7. Gómez-Déniz, E., Calderín-Ojeda, E.: The discrete Lindley distribution: properties and applications. J. Statist. Comput. Simul. 81, 1405–1416 (2011) 8. Hannan, E.J., Quinn, B.G.: The determination of the order of an autoregression. J. Roy. Statist. Soc. Ser. B 41, 190–195 (1979) 9. Hurvich, C.M., Tsai, C.: Regression and time series model selection in small samples. Biometrika 76, 297–307 (1989) 10. Jiménez-Gamero, M.D., Batsidis, A.: Minimum distance estimators for count data based on the probability generating function with applications. Metrika 80, 503–545 (2017) 11. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Statist. 22, 79–86 (1951) 12. Marques, M., Pérez-Abreu, V.: Law of large numbers and central limit theorem for the empirical probability generating function of stationary random sequences and processes. Aport. Mat. Notas. Invest. 4, 100–109 (1989) 13. Mattheou, K., Lee, S., Karagrigoriou, A.: A model selection criterion based on the BHHJ measure of divergence. J. Statist. Plan. Infer. 139, 228–235 (2009) 14. Nakamura, M., Pérez-Abreu, V.: Empirical probability generating function: an overview. Insu. Math. Econom. 12, 287–295 (1993) 15. Ng, C.M., Ong, S.H., Srivastava, H.M.: Parameter estimation by Hellinger type distance for multivariate distributions based upon probability generating function. Appl. Math. Model 37, 7374–7385 (2013) 16. Pardo, L.: Statistical Inference Based on Divergence Measures. Chapman & Hall/CRC, Boca Raton (2006) 17. Puig, P.: Characterizing additively closed discrete models by a property of their maximum likelihood estimators, with an application to generalized Hermite distributions. J. Amer. Statist. Assoc. 98(463), 687–692 (2003) 18. Schwarz, G.: Estimating the dimension of a model. Ann. Statist. 6(2), 461–464 (1978) 19. Sim, S.Z., Ong, S.H.: Parameter estimation for discrete distributions by generalized Hellingertype divergence based on probability generating function. Commun. Statist. Simul. Comput. 39(2), 305–314 (2010) 20. Takeuchi, K.: Distribution of informational statistics and a criterion of model fitting. SuriKagaku (Math. Sci.) 153, 12–18 (1976) 21. Toma, A.: Model selection criteria using divergences. Entropy 16(5), 2686–2698 (2014) 22. Vuong, Q.H.: Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica 57, 257–306 (1989)

On the Choice of the Optimal Tuning Parameter in Robust One-Shot Device Testing Analysis Elena Castilla

and Pedro J. Chocano

Abstract During the last decade, considerable work has been carried out in one-shot device analysis and, in particular, in robust methods based on divergences, which improve the classical inference based on the maximum likelihood estimator (MLE) or likelihood ratio test. The estimators and tests developed by this approach depend on a tuning parameter β. The choice of β is, however, one of the main drawbacks of this perspective. In this paper, given a data set, we study different methods for the choice of the “optimal” tuning parameter including the iterative-Warwick and Jones (IWJ) algorithm (Basak et al. [8]) or the minimization of some loss functions of the observed data. While IWJ algorithm seems to be a good approach for low and moderate contamination, some simulations do suggest that minimizing the mean absolute error of the observed probabilities is as least as efficient as the IWJ algorithm for high contamination, avoiding heavier computations.

1 Robust One-Shot Device Testing One-shot devices are devices that are destroyed after use. These data, also known as current status data in survival analysis, are an extreme case of interval censoring, as we only know if the device has failed before a specific inspection time, but not their exact failure times. As lifetimes of devices may be very long, we usually test them under accelerated life tests (ALTs), which shorten lifetimes by increasing some stress factors such as temperature, humidity, pressure, etc. After fitting a suitable model, results can be extrapolated to normal operating conditions and higher inspection times.

E. Castilla (B) · P. J. Chocano Departamento de Matemática Aplicada, Ciencia e Ingenieria de los Materiales y Tecnologia Electronica, Universidad Rey Juan Carlos, 28933 Madrid, Spain e-mail: [email protected] P. J. Chocano e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_16

169

170

E. Castilla and P. J. Chocano

Let us suppose that the devices are stratified into I testing conditions. In the i-th testing condition, K i units are tested with J types of stress factors being maintained at certain levels. Then, the conditions of those units are observed in pre-specified inspection times τi . In the i-th test group, the number of failures n i is collected. In this setting, we consider that the density function is given by f (τ ; x i , θ ), while the distribution function is given by F(τ ; x i , θ ), where θ is the model parameter vector and x i = (1, xi1 , . . . , xi J )T is the vector of stresses associated to the test condition i (i = 1, . . . , I ). The reliability function is denoted by R(τ ; x i , θ ) = 1 − F(τ ; x i , θ ). Assuming independent observations, the likelihood function based on the observed data is given by L(n 1 , . . . , n I ; θ ) ∝

I 

F ni (τi ; x i , θ )(1 − F(τi ; x i , θ )) K i −ni ,

(1)

i=1

and the corresponding MLE of θ ,  θ, will be obtained by maximizing θ in Eq. (1) or, equivalently, its logarithm, that is to say,  θ = arg max log L(n 1 , . . . , n I ; θ ). θ∈

During the last decade, substantial work has been carried out in one-shot device analysis. Primarily, motivated by the work of Fan et al. [14]. In terms of MLE, main efforts have been focused both on the development of EM algorithms and Bayesian approaches for estimating the parameters of the model assuming different lifetime distributions. See for instance [5, 6] or [7]. However, it is well-known that MLE is asymptotically efficient but non-robust. For this reason, robust estimators based on density power divergence (DPD) measures may be developed as an alternative to classical MLE. Let us consider the problem of one-shot device testing presented above. The empirical and theoretical probability vectors are given by  pi1 ,  pi2 )T , i = 1, . . . , I, pi = (

(2)

π i (θ ) = (πi1 (θ ), πi2 (θ))T , i = 1, . . . , I,

(3)

and pi2 = 1 − Knii , πi1 (θ ) = F(τi ; x i , θ ) and πi2 (θ) = respectively, with  pi1 = Knii ,  pi and π i (θ ) is R(τi ; x i , θ ). The Kullback-Leibler divergence measure between  given by    pi2 + pi2 log πi2 (θ )       ni 1 − Knii ni ni Ki log = log + 1− Ki F(τi ; x i , θ ) Ki R(τi ; x i , θ ) 

pi , π i (θ)) =  pi1 log d K L (

 pi1 πi1 (θ )



On the Choice of the Optimal Tuning Parameter …

171

and the weighted Kullback-Leibler divergence measure is given by    ni I I  Ki 1  Ki d K L ( pi , π i (θ )) = n i log K K i=1 F(τi ; x i , θ ) i=1   1 − Knii + (K i − n i ) log , R(τi ; x i , θ ) where K = K 1 + · · · + K I is the total number of devices under test. See Pardo [18] for a complete introduction about the Kullback-Leibler divergence measure. It is straightforward to check that the MLE can be obtained as its minimization, i.e.,  θ = arg min θ ∈

I  Ki d K L ( pi , π i (θ)). K i=1

(4)

Based on this idea, we can now define the weighted minimum DPD estimators for the one-shot device model with multiple stresses (Basu et al. [10]). Definition 1.1 Given the probability vectors  pi and π i (θ ) defined in (2) and (3), respectively, the DPD between the two probability vectors is given, as the function of a single tuning parameter β ≥ 0, by

β +1

β+1 β+1 β β  pi1 πi1 (θ) +  pi , π i (θ)) = πi1 (θ ) + πi2 (θ ) − pi2 πi2 (θ ) dβ ( β

1 β+1 β+1  pi1 +  , if β > 0, (5) pi2 + β pi , π i (θ )) = limβ→0+ dβ ( pi , π i (θ )) = d K L ( pi , π i (θ)), if β = 0. and dβ ( Definition 1.2 Let us consider the one-shot device testing problem, we can define the weighted minimum DPD estimator for θ as  θ β = arg min θ∈

I  Ki dβ ( pi , π i (θ )), if β > 0, K i=1

where dβ ( pi , π i (θ)) is given in (5), and  pi and π i (θ ) are given in (2) and (3), respectively. For β = 0, we have the MLE  θ defined in (4). Following these ideas, robust estimators and Wald-type tests have been developed for one-shot devices under different distributions such as exponential [1, 3], gamma [2] or Weibull [4]. In all these papers, the estimating equations are computed and the asymptotic distribution of the estimated parameters is proved to follow a normal distribution with variance-covariance matrix depending on the choice of the

172

E. Castilla and P. J. Chocano

tuning parameter β. It is also proved that the influence function of these estimators is bounded for β > 0, but unbounded for β = 0, i.e., the MLE. Monte Carlo simulations are provided so as to corroborate this point illustrating a greater robustness (but also less efficiency) in those estimators with β > 0. Nevertheless, a common problem arises in all these studies: the choice of the tuning parameter. In this paper, given a data set, we discuss different approaches for the choice of the tuning parameter in the context of one-shot device testing. Particularly, we consider exponential lifetimes, that is to say, we assume that the true lifetime follows an exponential distribution with unknown failure rate λi (θ ) related to the stress factor x i in loglinear form as λi (θ ) = exp(x iT θ ), where x i = (xi0 , xi1 , . . . , xi J )T and θ = (θ0 , θ1 , . . . , θ J )T . The corresponding density function and distribution function are, respectively, f (τ ; x i , θ ) = λi (θ ) exp{−λi (θ )τ } = exp(x iT θ ) exp{− exp(x iT θ )τ }

(6)

F(τ ; x i , θ ) = 1 − exp{−λi (θ )τ } = 1 − exp{−τ exp(x iT θ)}.

(7)

and

2 Methods to Choose the “Optimal” Tuning Parameter 2.1 Iterative Warwick and Jones Algorithm (IWJ) A useful procedure for the data-based selection of β for the weighted minimum DPD estimator was proposed by Warwick and Jones [19] and applied to different contexts (see [11–13]). It consists of minimizing the estimated mean squared error (MSE) and requires a pilot estimator of model parameters. We can adopt a similar approach in order to obtain a suitable data-driven β in our model. In this approach, we minimize an estimate of the asymptotic MSE of the weighted minimum DPD estimator  θβ given by  θ pilot )T ( θβ − θ pilot ) + trace V β ( θβ) , M S E(β) = ( θβ − where  θ pilot is a pilot estimator, whose choice must be empirically discussed since θ β , the approach the overall procedure depends on this choice. If we take θ pilot =  coincides with the one introduced by Hohng and Kim [17], but it does not take into account the model misspecification. The asymptotic variance-covariance of the minimum DPD estimators under different lifetime distributions were obtained in the

On the Choice of the Optimal Tuning Parameter …

173

cited papers [1–4]. However, as pointed out by Basu et al. [9], when dealing with the robustness issue, the estimation of the variance component should not assume the model to be true and this variance should be modified. Therefore, following the general formulation of Ghosh and Basu [15, 16], we have the following result. Proposition 2.1 Let us consider the one-shot device model with multiple stress factors. The model robust estimate of the asymptotic variance-covariance matrix of the model parameters, based on the DPD with tuning parameter β, is given by

 −1 −1  β (θ ) = 1  J β (θ)  K β (θ ) J β (θ ) , V K

(8)

where  J β (θ ) = (β + 1)

I  2  Ki β+1 ui j (θ )uiTj (θ)πi j (θ)+ K i=1 j=1

   2 I   ni j β ∂ ui j (θ ) K i ∂ ui j (θ) β+1 T πi j (θ ) − , π (θ ) βui j (θ)ui j (θ) − + K ∂θ Ki i j ∂θ i=1 j=1  K β (θ ) =

   I  I 2  n i j 2β Ki Ki ∗ T ui j (θ)ui j (θ) πi j (θ ) − ξ i j,β (θ )ξ i∗Tj,β (θ), K K K i i=1 j=1 i=1

with ξ i∗j,β (θ ) =

2  ni j j=1

ui j (θ ) =

Ki

β

ui j (θ )πi j (θ),

∂ log πi j (θ) 1 ∂πi j (θ ) = , ∂θ πi j (θ) ∂θ

n i1 = n i and n i2 = K i − n i . Remark 2.1 Let us consider that the lifetimes of our model follow a exponential distribution. Then, 1 f (τi ; θ , x i )τi x i , πi j (θ )   ∂ ui j (θ ) −1 1 = (−1) j+1 ; θ , x )τ + (1 − τ λ (θ)) f (τi ; θ, x i )τi x i x iT , f (τ i i i i i ∂θ πi j πi2j (θ ) ui j (θ ) = (−1) j+1

from which (8) can be computed.

174

E. Castilla and P. J. Chocano

Unfortunately, the need for a pilot estimator becomes the major drawback of this procedure, as the final result will depend excessively on this choice, see [4, 12]. This problem was also highlighted recently in Basak et al. [8], where an iterative Warwick and Jones algorithm (IWJ algorithm) is proposed. The idea is the following. After an iteration of the algorithm, the estimator obtained is used as a pilot estimator and then it is applied the same algorithm. The process continues until there is no further change on the pilot estimator (see Algorithm 1). As proved in the cited paper of Basak et al. [8], the final converged estimate is independent of the initial choice of the pilot and is denoted by βopt . Algorithm 1 Iterative Warwick and Jones algorithm (IWJ) Step 1: Start with a robust pilot estimator, for example β pilot = 0.5 Step 2: In a grid of [0, 1], compute  β pilot ( βaux = min βgrid ∈[0,1] ( θ βgrid −  θ β pilot )( θ βgrid −  θ β pilot )T + trace{ V θ β pilot )} if βaux = β pilot then βopt = βaux else β pilot = βaux and return to Step 2 end if Return: βopt

2.2 Other Methods The IWJ method tries to minimize the expected MSE of the model parameters. Other criteria may be focused on minimizing an estimated error of the observed probabilities with respect to the observed ones. In a grid of tuning parameters β, minimizing a loss function which relates both probabilities may lead to an alternative criterion. Here, we propose three different methods, whose performance will be discussed in the simulation study made in Sect. 3. 1. Minimize the maximum of the absolute errors (see [3]),   ni | . minAMAX(β) = minβ max |π1i (θ β ) − K i i=1,...,I

(9)

2. Minimize the mean absolute error, I minMAE(β) = minβ

i=1

|π1i (θ β ) − I

ni Ki

|

.

(10)

On the Choice of the Optimal Tuning Parameter …

3. Minimize the median of the absolute errors,   ni minAMED(β) = minβ median |π1i (θ β ) − | . K i i=1,...,I

175

(11)

2.3 Choice of the “Optimal Method” As commented previously, one-shot devices are usually tested under ALTs, which shorten lifetimes by increasing some stress factors. In practice, we are interested on the behavior of one-shot devices under normal operating conditions and higher time. This “good behavior” can be interpreted as a good estimation of expected lifetime and probability of success, that is to say, we are interested in: 1. Minimizing the error of the estimated reliability under normal operating conditions x 0 and inspection time τ0 . If we consider exponential lifetimes, the Reliability is given by    R(τ0 ; x 0 , θ ) = 1 − F(τ0 ; x 0 , θ ) = exp −τ0 exp x 0T θ . 2. Minimizing the error of the estimated expected lifetime under normal operating conditions x 0 . If we consider exponential lifetimes, the expected lifetime is given by   1 = exp −x 0T θ . E[T |x 0 ] = λ0

3 Numerical Results In this section, we illustrate empirically the proposed methods for the choice of the optimal tuning parameter. Following the Monte Carlo simulation study carried out in [3], we consider devices having exponential lifetimes subjected to two types of stress factors at two different conditions for each one. The first stress factor at levels 55 and 70. The second one at levels 85 and 100. It is tested at three different inspection times I T = {2, 5, 8}, with a total of I = 12 testing conditions in a balanced scenario. The model is examined under (θ0 , θ1 , θ2 ) = (−6.5, 0.03, 0.03) and at different degrees of contamination of the parameters θ1 in the I -th testing condition for different sample sizes K i ∈ {25, 50, 100}. The contaminated parameter is given by α

θ˜1 = θ1 1 − , 2

176

E. Castilla and P. J. Chocano

0

5

Frequency

10

15

K=100

0.2

0.4

0.6

0.8

tuning parameter IWJ algorithm with high contamination

Fig. 1 Chosen tuning parameters with the IWJ algorithm, for high contamination

where α ∈ [0, 1] in the degree of contamination represented in the x-axis of Figs. 2 and 3. Optimal tuning parameters under each condition are obtained with the different discussed methods in a grid of 100 elements, in S = 100 simulated samples. Means of the relative absolute error of the reliability and expected lifetime under normal operating conditions x 0 = (25, 35) and t0 = 30 are computed and presented in Fig. 2. In Fig. 3, the number of times each method chooses the best tuning parameter is shown, both in terms of reliability and expected lifetime under normal operating conditions, and represented by η(R0 ) and η(E 0 ), respectively. Note that two different algorithms can lead to the choice of the same tuning parameter, which explains that the sum of these last values over the different algorithms is greater than S in some cases. For a low or moderate contamination, the IWJ algorithm presents the better estimations of both reliability and expected lifetime. Particularly, it is the one which chooses the best tuning parameter for estimating the reliability in most of the cases. The tendency of the minAMAX criterion is similar to the IWJ one, but with worst results in general. When a high contamination is considered, minAMED and, overall, minMAE present the lowest errors, choosing the best tuning parameter most of the times. However, the IWJ approach becomes less efficient. This is probably due to the fact that, in this set-up, IWJ tends to not choose very high tuning parameters, even when the contamination is clear (Fig. 1). This behaviour was also observed for the simple WJ algorithm applied to Weibull lifetimes (see Fig. 8 of [4]). Similar results have been obtained when θ2 is contaminated, as well as for different sample sizes and scenarios, but omitted here so as to avoid redundancy.

On the Choice of the Optimal Tuning Parameter …

177

Fig. 2 Relative errors of expected lifetimes and estimated reliabilities under normal operating conditions for different sample sizes and degrees of contamination

178

E. Castilla and P. J. Chocano

Fig. 3 Best method for expected lifetimes and estimating reliabilities under normal operating conditions for different sample sizes and degrees of contamination

On the Choice of the Optimal Tuning Parameter …

179

Acknowledgements The authors would like to express their gratitude for having the opportunity to contribute to this tribute to Leandro Pardo. Leandro is not only an excellent researcher, but also a magnificent man, for which both authors have a great affection and admiration. This research has been supported by Spanish Grants FPU-16/03104 (E. Castilla) and BES-2016-076669 (P. J. Chocano).

References 1. Balakrishnan, N., Castilla, E., Martín, N., Pardo, L.: Robust estimators and test statistics for one-shot device testing under the exponential distribution. IEEE Trans. Inform. Theor. 65(5), 3080–3096 (2019) 2. Balakrishnan, N., Castilla, E., Martín, N., Pardo, L.: Robust estimators for one-shot device testing data under gamma lifetime model with an application to a tumor toxicological data. Metrika 82(8), 991–1019 (2019) 3. Balakrishnan, N., Castilla, E., Martín, N., Pardo, L.: Robust inference for one-shot device testing data under exponential lifetime model with multiple stresses. Qual. Reliab. Engin. Intern. 36(6), 1916–1930 (2020) 4. Balakrishnan, N., Castilla, E., Martín, N., Pardo, L.: Robust inference for one-shot device testing data under Weibull lifetime model. IEEE Trans. Reliab. 69(3), 937–953 (2020) 5. Balakrishnan, N., Ling, M.-H.: Multiple-stress model for one-shot device testing data under exponential distribution. IEEE Trans. Reliab. 61(3), 809–821 (2012) 6. Balakrishnan, N., Ling, M.-H.: Gamma lifetimes and one-shot device testing analysis. Reliab. Engin. Syst. Saf. 126, 54–64 (2014) 7. Balakrishnan, N., Ling, M.-H., So, H.Y.: Accelerated Life Testing of One-shot Devices: Data Collection and Analysis, 1st edn. Wiley, New York (2021) 8. Basak, S., Basu, A., Jones, M.C.: On the ‘optimal’ density power divergence tuning parameter. J. Appl. Statist. 48(3), 536–556 (2020) 9. Basu, A., Ghosh, A., Mandal, A., Martín, N., Pardo, L.: A Wald-type test statistic for testing linear hypothesis in logistic regression models based on minimum density power divergence estimator. Elect. J. Statist. 11(2), 2741–2772 (2017) 10. Basu, A., Harris, I.R., Hjort, N.L., Jones, M.: Robust and efficient estimation by minimising a density power divergence. Biometrika 85(3), 549–559 (1998) 11. Castilla, E., Ghosh, A., Martín, N., Pardo, L.: New robust statistical procedures for the polytomous logistic regression models. Biometrics 74(4), 1282–1291 (2018) 12. Castilla, E., Martín, N., Pardo, L.: Pseudo minimum phi-divergence estimator for the multinomial logistic regression model with complex sample design. AStA Adv. Statist. Anal. 102(3), 381–411 (2018) 13. Castilla, E., Martín, N., Pardo, L., Zografos, K.: Composite likelihood methods: Rao-type tests based on composite minimum density power divergence estimator. Statist. Pap. 62, 1003–1041 (2021) 14. Fan, T.H., Balakrishnan, N., Chang, C.C.: The Bayesian approach for highly reliable electroexplosive devices using one-shot device testing. J. Statist. Comput. Simul. 79(9), 1143–1154 (2009) 15. Ghosh, A., Basu, A.: Robust estimation for independent non-homogeneous observations using density power divergence with applications to linear regression. Elect. J. Statist. 7, 2420–2456 (2013) 16. Ghosh, A., Basu, A.: Robust estimation for non-homogeneous data and the selection of the optimal tuning parameter: the density power divergence approach. J. Appl. Statist. 42(9), 2056– 2072 (2015)

180

E. Castilla and P. J. Chocano

17. Hong, C., Kim, Y.: Automatic selection of the turning parameter in the minimum density power divergence estimation. J. Korean Statist. Soc. 30(3), 453–465 (2001) 18. Pardo, L.: Statistical Inference Based on Divergence Measures. Chapman & Hall/CRC, Boca Raton (2006) 19. Warwick, J., Jones, M.: Choosing a robustness tuning parameter. J. Statist. Comput. Simul. 75(7), 581–588 (2005)

Optimal Spatial Prediction for Non-negative Spatial Processes Using a Phi-divergence Loss Function Noel Cressie, Alan R. Pearse, and David Gunawan

Abstract A major component of inference in spatial statistics is that of spatial prediction of an unknown value from an underlying spatial process, based on noisy measurements of the process taken at various locations in a spatial domain. The most commonly used predictor is the conditional expectation of the unknown value given the data, and its calculation is obtained from assumptions about the probability distribution of the process and the measurements of that process. The conditional expectation is unbiased and minimises the mean-squared prediction error, which can be interpreted as the unconditional risk based on the squared-error loss function. Cressie [4, p. 108] generalised this approach to other loss functions, to obtain spatial predictors that are optimal (i.e., that minimise the unconditional risk) but not necessarily unbiased. This chapter is concerned with spatial prediction of processes that take non-negative values, for which there is a class of loss functions obtained by adapting the phi-divergence goodness-of-fit statistics [6]. The important sub-class of power-divergence loss functions is featured, from which a new class of spatial predictors can be defined by choosing the predictor that minimises the corresponding unconditional risk. An application is given to spatial prediction of zinc concentrations in soil on a floodplain of the Meuse River in the Netherlands.

1 Introduction Spatial statistical analysis in a subset D of d-dimensional Euclidean space Rd is concerned, inter alia, with inference on values of a hidden random process,   Y (·) ≡ Y (s) : s ∈ D ⊂ Rd , based on spatial data Z ≡ (Z (s1 ), . . . , Z (sn )) where, independently for i = 1, . . . , n, N. Cressie (B) · A. R. Pearse · D. Gunawan NIASRA, School of Mathematics and Applied Statistics, University of Wollongong, Northfields Avenue, Wollongong, NSW 2522, Australia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_17

181

182

N. Cressie et al.

[Z (si ) | Y (·)] = [Z (si ) | Y (si )] . That is, measurements of the underlying process Y (·) are taken at known locations and independently from one other. From Bayes’ Rule, [Y (s0 ) | Z] ∝ [Y (s0 ), Z],

(1)

where the “constant” of proportionality depends on the data Z and ensures that the left-hand side of Eq. (1) integrates or sums to unity over all possible realisations of Y (s0 ). The “square bracket” notation above is defined as follows: [A] is the probability distribution of A, [A, B] is the joint probability distribution of A and B, and [B | A] is the conditional probability distribution of B given A. These distributions are related through [B | A] = [A, B]/[A], which was used to obtain Eq. (1). The left-hand-side of Eq. (1) is called the predictive distribution, and it is key to optimal spatial prediction. In what follows, we build a spatial statistical hierarchical model through the data model, n  (2) [Z | Y (·)] = [Z (si ) | Y (si )] , i=1

and the process model, [Y (·)] ≡ [{Y (s) : s ∈ D}] .

(3)

This latter probability model is often assumed to be a Gaussian process, although for most of the chapter we simply assume that [Y (·)] is well defined. Indeed, our focus here is on non-negative spatial processes, which clearly are non-Gaussian. From Eqs. (2) and (3), the joint probability distribution is [Y (·), Z] = [Z | Y (·)][Y (·)], which generally depends on unknown parameters. Hence, an initial inference problem in spatial statistics is the estimation of those parameters. What is arguably the more important inference problem is the spatial prediction of unknown values of Y (·) at given locations or regions in D. Here we concentrate on the prediction of one value, Y (s0 ), at a given location s0 in D. (Spatial prediction of many values jointly and of averages over regions within D were considered by Cressie [4, Chap. 3].) Optimal spatial prediction of Y (s0 ) based on spatial data Z is a classical problem, for which there are a number of solutions that depend, inter alia, on the class of spatial predictors over which the optimal one is chosen. Kriging [10] is probably the most famous spatial predictor, where the class of predictors consists of linear functions of Z. In this chapter, we shall use the most general class, namely predictors that are measurable functions of Z. Since this includes linear functions, the optimal spatial predictors we give will be better than, or perform as well as, the kriging predictors.

Optimal Spatial Prediction for Non-negative Spatial Processes …

183

Now we explain what the term “optimal” means in the preceding paragraph. Define the function, L S E L (Y (s0 ), δ(Z)) ≡ (δ(Z) − Y (s0 ))2 ,

(4)

which is the squared prediction error, but the notation deliberately represents it as the loss function, L S E L , where “SEL” denotes squared-error loss. It is a function of two variables, namely the predictand Y (s0 ) and the predictor δ(Z), which is a function of the data. Using terminology from statistical decision theory, we define the unconditional risk for this loss function as, E(L S E L (Y (s0 ), δ(Z))) = E(δ(Z) − Y (s0 ))2 ,

(5)

where the expectation is taken over the joint distribution [Y (s0 ), Z]. (Correspondingly, the conditional risk is where the expectation is taken over the conditional distribution [Y (s0 ) | Z].) Now, the right-hand side of Eq. (5) is commonly referred to as the mean-squared prediction error (MSPE), and an optimal spatial predictor, δ ∗ (Z), is chosen here to minimise the MSPE. That is, δ ∗ = arg infδ∈F E (L S E L (Y (s0 ), δ(Z))) , = arg infδ∈F E (L S E L (Y (s0 ), δ(Z)) | Z) ,

(6)

since [Z] ≥ 0, and in Eq. (6) F is a well defined class of measurable functions of the data. When F is the class of all measurable functions, the solution to Eq. (6) is straightforward, as follows. First,    E (L S E L (Y (s0 ), δ(Z))) = E E (Y (s0 ) − δ(Z))2 | Z . Using differential calculus, we solve for δ(Z) in the inner expectation:   ∂ E (Y (s0 ) − δ(Z))2 | Z = −2E(Y (s0 ) − δ(Z) | Z) = 0. ∂δ(Z) Consequently, the optimal predictor is δ ∗ (Z) = E(Y (s0 ) | Z) ,

(7)

which is a predictor that is ubiquitous in Statistics. Furthermore, the minimised MSPE is the unconditional risk of using the optimal predictor given by Eq. (7):  2 R S E L (δ ∗ ) = E Y (s0 ) − δ ∗ (Z) = E(var(Y (s0 ) | Z)) ,

184

N. Cressie et al.

after some derivation and using the fact that E(δ ∗ (Z)) = E(E(Y (s0 ) | Z)) = E (Y (s0 )). That is, δ ∗ (Z) given by Eq. (7) is unbiased, and its MSPE is the expected predictive variance. In the case  of kriging, F is the class of all linear functions of the data, so that n ai Z (si ). Hence, the optimal kriging predictor is found by minδ ∗ (Z) = a0 + i=1 imising the MSPE over {ai : i = 0, . . . , n}, and the minimised MSPE is called the kriging variance (e.g., Cressie [4, Chap. 3]). There are several points to make about the optimal predictor in Eq. (7) that are not brought out in the notation. First, there are no restictions on the class F of possible predictors, except that they have to be measurable functions of the data Z. Second, the optimal predictor δ ∗ (Z) depends on the chosen spatial location s0 . Finally, and most importantly for this chapter, δ ∗ (Z) is optimal for the choice of loss function, L S E L . In what follows, the first and second points remain, but we look for optimal spatial predictors using other loss functions, in particular for predicting non-negative spatial processes. In Sect. 2, we take a decision-theoretic approach to optimal spatial prediction, we give a general definition, and then we discuss uncertainty quantification through the use of prediction intervals. In Sect. 3, a new class of loss functions is proposed for spatial processes that are non-negative, which we call the phi-divergence loss functions; in particular, we focus on a class we call the power-divergence loss functions. Section 4 shows how the same methodology can be adapted for spatial processes that could also take negative values but are bounded from below. An application of optimal spatial prediction based on the power-divergence loss functions is given in Sect. 5, where spatial samples of zinc concentrations from a floodplain of the Meuse River in the Netherlands are analysed. Section 6 contains a discussion and conclusions.

2 Decision-Theoretic Approach to Prediction While the previous section sets up the notation and model behind optimal spatial prediction, this section takes a generic approach to predicting a random variable Y based on data Z , which could be a scalar or a vector. The key assumption here is that [Y | Z ] depends on Z ; that is, there is information in the data Z that can be used to predict the unknown Y . In the next section, we return to spatial prediction and, in the results given there, Y is replaced with Y (s0 ), the hidden value of the process Y (·) at a given location s0 ∈ D; and Z is replaced with spatial data Z that are noisy measurements of Y (·) at known spatial locations {s1 , . . . , sn }. Cressie and Pardo [6] define a class of goodness-of-fit statistics via a class of divergence measures between two discrete probability distributions, which they called phi-divergences. Let {a j : j = 1, . . . , k} and {b j : j = 1, . . . , k} denote two discrete probability distributions, and suppose that φ(·) is a convex function on the real line. Then the phi-divergence measure between the two discrete probability distributions is defined as,

Optimal Spatial Prediction for Non-negative Spatial Processes …

Dφ ({a j }, {b j }) ≡

k 

bjφ

j=1

aj bj

185

,

(8)

provided the function φ(·) satisfies φ(1) = 0, φ  (1) = 0, φ  (1) > 0, 0 · φ(0/0) = 0, and 0 · φ( p/0) = p limu→∞ φ(u)/u. Note that divergence measures are not necessarily distance measures, so they are not necessarily symmetric in their arguments. A leading example given by Cressie and Pardo [6, 7] is the class of power-divergence measures, introduced by Read and Cressie [12, p. 93]. Let φλ (x) =

 λ+1   1 x − x + λ(1 − x) , (λ(λ + 1))

(9)

where it is straightforward to show that {φλ (·) : −∞ < λ < ∞} satisfies the five conditions just below Eq. (8). The family given by Eq. (9) is defined for all λ ∈ (−∞, ∞), by taking the limits as λ → −1 and λ → 0. Then for λ ∈ (−∞, ∞), the power-divergence measure is  1 Dφλ ({a j }, {b j }) = aj λ(λ + 1) j=1 k



aj bj

λ

−1 ,

(10)

  since kj=1 a j = kj=1 b j = 1. In the context of goodness-of-fit testing in a contingency table with cells j = 1, . . . , k, let x j be the number of observations in cell  j, let n = kj=1 x j be the total number of observations in all cells of the table, and let π j be the hypothesised probability that an observation falls in cell j. Then the power-divergence goodness-of-fit statistics defined by Cressie and Read [8] are given by 2n Dφλ ({x j /n}, {π j }), for λ ∈ (−∞, ∞). Decision theory for the prediction of Y with a predictor δ(Z ), starts with a loss function, L(Y, δ(Z )), as specified in the introductory section. The unconditional risk is E(L(Y, δ(Z ))), where the expectation is taken over the joint distribution [Y, Z ]. A loss function has to satisfy L(·, ·) ≥ 0, L(Y, Y ) = 0, and in any particular application the first moment, E(L(Y, δ(Z ))), must exist [1, p. 3]. The predictor δ(·) is specified to belong to a class F which, in this chapter, we take to be the largest class possible; specifically, F is the set of all measurable functions of Z . Motivated by Eq. (6), in this more general setting the optimal predictor based on the loss function L(·, ·) is: (11) δ ∗ (Z ) ≡ arg infδ∈F E(L(Y, δ(Z )) | Z ) . The idea in this section is to marry the studies of goodness-of-fit statistics and those of optimal prediction, by choosing a phi-function from Eq. (8), and then using it to define the phi-divergence loss function, L(Y, δ; φ) ≡ δ · φ



Y ; Y ≥ 0, δ ≥ 0, δ

(12)

186

N. Cressie et al.

and it is easy to see that Eq. (12) satisfies the conditions of a loss function [1, p. 3]. Since the units of Y and δ are the same, the loss function defined by Eq. (12) and its unconditional risk have the same units as Y . This is in contrast to L S E L given by Eq. (4), and the MSPE, whose units are the same as the units of Y 2 . The special case of substituting φλ given by Eq. (9) into Eq. (12) defines a new class of loss functions whose members we call the power-divergence loss functions, L λ (Y, δ) ≡ L(Y, δ; φλ ), for − ∞ < λ < ∞. Hence, from Eqs. (9) and (12),

Y λ Y (δ − Y ) ; λ = 0, −1 −1 + L λ (Y, δ) = λ(λ + 1) δ λ+1

Y − (Y − δ), L 0 (Y, δ) = Y log δ

Y + (Y − δ), L −1 (Y, δ) = −δ log δ

(13)

where recall that Y ≥ 0 and δ ≥ 0. For all real λ, these are convex differentiable functions of δ, L λ (Y, δ) ≥ 0, L λ (Y, Y ) = 0, and the first derivative of L λ with respect to δ, evaluated at δ = Y , is equal to 0. Notice that L S E L (Y, δ) ≡ (δ − Y )2 also has these properties, but it has different units and it is defined for all real Y and all real δ. Our interest in this chapter is in non-negative predictors, δ(Z ), of non-negative predictands, Y . For the power-divergence loss function L λ , we can derive the optimal predictor δλ∗ (Z ) of Y from Eq. (11). Straightforwardly,   1/(λ+1) ; λ = 0, −1, δλ∗ (Z ) = E Y λ+1 | Z

δ0∗ (Z ) = E(Y | Z ) ∗ δ−1 (Z ) = exp(E(log(Y ) | Z )) ,

(14)

which we call the optimal power-divergence (OPD) predictors. Read and Cressie [12, Sect. 8.4] obtain this result for inference on a parameter θ in a Bayesian (but non-hierarchical) model. Notice that over all λ ∈ (−∞, ∞), the only predictor given by Eq. (14) that is unbiased is δ0∗ (Z ) (i.e., where λ = 0). From Jensen’s inequality, δλ∗ (Z ) over-predicts for λ > 0, and it under-predicts for λ < 0. Unbiasedness is only one of a number of properties of a predictor that might be considered. More importantly, a statistical property called validity addresses behaviour beyond that of the predictor’s first moment. A prediction interval, (a α (Z ), bα (Z )) for predicting Y , is a valid (1 − α) × 100% unconditional prediction interval if (15) Pr (a α (Z ) < Y < bα (Z )) = 1 − α,

Optimal Spatial Prediction for Non-negative Spatial Processes …

187

where the probability Pr(·) is with respect to the joint distribution [Y, Z ]. Following the approach in Cressie [4, pp. 107–108], prediction intervals can be constructed using loss functions. For a (1 − α) × 100% unconditional prediction interval for Y ,   consider the set (Y, Z ) : L λ (Y, δλ∗ (Z )) < K λα , where K λα is a cut-off chosen so that   Pr L λ (Y, δλ∗ (Z )) < K λα = 1 − α.

(16)

Using the general expression above, it is not difficult to show that the (1 − α) × 100% unconditional prediction interval for Y under power-divergence loss is the set,

 {Y : Y λ+1 − (λ + 1)δλ∗ (Z )λ Y − λδλ∗ (Z )λ (λ + 1)K λα − δλ∗ (Z ) < 0}.

(17)

Now, we have already seen that L λ (Y, δ) in Eq. (13) is a convex function of δ; it is easy to see that it is also a convex function of Y with a minimum value L(δ, δ) = 0 at Y = δ. Hence, the interval (a α (Z ), bα (Z )) in Eq. (15) is defined by the roots of the function of Y in Eq. (17) and contains the optimal predictor, δλ∗ (Z ); see Eq. (14) et seq. Since Pr(Y < 0) = 0, the lower limit aλα (Z ) may, on occasions, take the value 0. To obtain K λα via classical distribution theory, we need to know the distribution of L λ (Y, δλ∗ (Z )), where both Y and Z vary randomly. A computational solution is available by simulating from [Y, Z ] and using the empirical distribution of L λ (Y, δλ∗ (Z )) to obtain K λα , up to Monte Carlo error. The next section gives the details for spatial prediction, where the results for the prediction of Y given by Eqs. (14) and (16) will be adapted to the spatial prediction of Y (s0 ), with spatial predictor δ(Z) that is a function of the spatial data Z.

3 Decision-Theoretic Approach to Spatial Prediction of a Non-negative Spatial Process The discussion of generic prediction given in the previous section applies with almost no change to spatial prediction. Replace Y with the hidden process value Y (s0 ) and Z with the spatial data Z. Generalising Eq. (6), the generic loss function, L(Y (s0 ), δ(Z)), is used to define an optimal spatial predictor as follows: δ ∗ (Z) ≡ arg infδ∈F E(L(Y (s0 ), δ(Z)) | Z).

(18)

Examples of loss functions include squared-error loss (i.e., L S E L (Y (s0 ), δ(Z)) = (δ(Z) − Y (s0 ))2 ), absolute-error loss (i.e., L AE L (Y (s0 ), δ(Z)) = |δ(Z) − Y (s0 )|), weighted squared-error loss (i.e., L W SL (Y (s0 ), δ(Z)) = w(Y (s0 ))(δ(Z) − Y (s0 ))2

188

N. Cressie et al.

for weight function w(Y (s0 )) > 0), and the linear exponential (linex) loss function that is deliberately asymmetric [14]: L L N X (Y (s0 ), δ(Z)) = v·{exp(a ·(δ(Z) − Y (s0 ))) − a ·(δ(Z) − Y (s0 )) − 1} , (19) where −∞ < a < ∞ and v > 0. All these loss functions are defined for Y (s0 ) and δ(Z) potentially taking negative values. Our focus in this chapter is on spatial prediction for a non-negative spatial process. For Y (s0 ) ≥ 0 and predictor δ(Z) ≥ 0, we use Eq. (12) to define the phidivergence loss function as, L(Y (s0 ), δ(Z); φ) ≡ δ(Z) · φ

Y (s0 ) . δ(Z)

We have seen that the power-divergence loss function given by Eq. (13), with φ = φλ , is an important special case. From Eq. (14), we obtain the following optimal power-divergence (OPD) spatial predictors for Y (s0 ) ≥ 0:   1/(λ+1) ; λ = 0, −1 δλ∗ (Z) = E Y (s0 )λ+1 | Z

δ0∗ (Z) = E(Y (s0 ) | Z) ∗ δ−1 (Z) = exp(E(log(Y (s0 )) | Z)) .

(20)

These were derived by minimising the unconditional risk under the powerdivergence loss function, which we denote by   Rλ (δλ∗ ) ≡ E L λ (Y (s0 ), δλ∗ (Z)) ,

(21)

where the expectation is taken over [Y (s0 ), Z]. The same result is obtained by minimising the conditional risk, E(L λ (Y (s0 ), δλ∗ (Z)) | Z), as in Eqs. (6) and (11). As discussed in Sect. 2, the OPD spatial predictors in Eq. (20) are biased, except when λ = 0. The bias of the OPD spatial predictor is defined as,     Bias δλ∗ ≡ E δλ∗ (Z) − E(Y (s0 ))

(22)

which, from Eq. (20) and Jensen’s inequality, is negative for λ < 0 and positive for λ > 0. Furthermore, Bias(δλ∗ ) is a monotonically increasing function of λ, passing through the point (λ, Bias(δλ∗ )) = (0, 0). Valid (1 − α) × 100% unconditional prediction intervals for Y (s0 ) follow from Eq. (15) and (17). Specifically, for spatial predictand Y (s0 ), OPD spatial predictor δλ∗ (Z), and cut-off K λα , the (1 − α) × 100% unconditional prediction interval for Y (s0 ) is the set,

 {Y (s0 ) : Y (s0 )λ+1 − (λ + 1)δλ∗ (Z)λ Y (s0 ) − λδλ∗ (Z)λ (λ + 1)K λα − δλ∗ (Z) < 0}. (23)

Optimal Spatial Prediction for Non-negative Spatial Processes …

189

The cut-off K λα and the resulting (1 − α) × 100% unconditional prediction interval for Y (s0 ) are given by equations analogous to Eqs. (15) and (16) and can be obtained directly from simulations of [Y (s0 ), Z]. Let Y ≡ (Y (s1 ), . . . , Y (sn )) denote the process vector hidden behind the observations Z. Then the joint distribution [Y (s0 ), Y, Z] = [Z | Y][Y (s0 ), Y], where the second term reduces to [Y] if {s0 , s1 , . . . , sn } and, s0 ∈ {s1 , . . . , sn }. In the first step, simulate values of Y (·) at  n [Z (si ) | Y (si )]. in the second step, simulate Z from the data model, [Z | Y] = i=1 Finally, keep just the simulations from Y (s0 ) and Z, which can be used to obtain simulations of L λ (Y (s0 ), δλ∗ (Z)) that allow K λα to be determined through numerical root-finding methods.

4 Extensions to Spatial Processes Bounded from Below So far, the spatial process Y (·) is bounded below by 0, but in this section we show that the results are generalisable to the case where Y (·) is bounded below by −κ, where κ ≥ 0. In that case, Y (·) can take negative values, but it is always bounded from below over the spatial domain D. It is also possible to generalise everything in this section to κ < 0, which may be of interest when Y (·) is only defined above a positive threshold. The spatial-prediction problem is still to predict Y (s0 ) with δ(Z), but now we use the loss function, (24) L λ,κ (Y (s0 ), δ(Z)) ≡ L λ (Y (s0 ) + κ, δ(Z) + κ), where L λ is defined by Eq. (13). Then the same calculations that led to Eqs. (14) and ∗ (Z), given by (20), yield the optimal spatial predictor, δλ,κ   1/(λ+1) ∗ δλ,κ (Z) = E (Y (s0 ) + κ)λ+1 | Z − κ; λ = 0, −1 ∗ δ0,κ (Z) = E(Y (s0 ) | Z)

(25)

∗ δ−1,κ (Z) = exp{E(log(Y (s0 ) + κ) | Z)} − κ.

Substituting κ = 0 into Eq. (25) yields Eq. (20), the OPD spatial predictor δλ∗ (Z) of Y (s0 ) ≥ 0, for a given location s0 ∈ D. As we saw in the previous section, valid (1 − α) × 100% unconditional prediction intervals for Y (s0 ) are easily obtained by adapting Eq. (16) to the spatial setting. α so that Specifically, for a given λ and κ, we choose K λ,κ   ∗ α Pr L λ,κ (Y (s0 ), δλ,κ = 1 − α. (Z)) < K λ,κ α and, hence, the (1 − α) × 100% unconditional prediction interThe cut-off K λ,κ α α (Z), bλ,κ (Z)), for predicting Y (s0 ), can then be obtained by simulation, as val, (aλ,κ described in the previous section.

190

N. Cressie et al.

Before closing this section, we show that the extended loss function, L λ,κ (Y (s0 ), δ(Z)), contains as a limiting case the linex loss function, L L N X (Y (s0 ), δ(Z)) given by Eq. (19). The optimal linex spatial predictor is obtained by minimising the unconditional risk and is easily seen to be [4, p. 108]: δ L∗ N X (Z)

1 log(E(exp{−aY (s0 )} | Z)) , = − a

(26)

where −∞ < a < ∞, and the units of a are the same as the units of Y (s0 )−1 . Notice ∗ (Z) in Eq. (20), where the roles of that δ L∗ N X (Z) in Eq. (26) is different from δ−1 exp(·) and log(·) are reversed. For ease of presentation here, we revert to the generic prediction problem in the previous section, where the predictand is Y , the data are represented by Z , and we abbreviate δ(Z ) to simply δ. We first note that loss functions are non-negative. In the case of Eq. (24), namely, 1 L λ,κ (Y, δ) = λ(λ + 1)



 (Y + κ)λ+1 − (Y + κ) + λ(δ − Y ) , (δ + κ)λ

(27)

this is the case and, as it should be, the loss is 0 when δ = Y ; that is, L λ,κ (Y, Y ) = 0. In what follows, we put λ = −aκ in Eq. (27), where a = 0 is fixed, κ has the same units as those of Y , and κ → ∞. Hence, λ → ∞ if a < 0, and λ → −∞ if a > 0, which is a well defined requirement since λ can take values anywhere on the real line. Now consider the case where a < 0, so that λ → ∞ as κ → ∞. Further, note that for v > 0, the optimal predictor obtained from L λ,κ (Y, δ) is exactly the same as the optimal predictor obtained from v· L λ,κ (Y, δ). Hence, write (Y + κ) (λ + 1)L λ,κ (Y, δ) = −aκ

 −a·κ 1 + Yκ a ·(δ − Y )  .  −a·κ − 1 −  1 + Yκ 1 + κδ

Let κ → ∞ on the right-hand side and, since a < 0, we see that λ → ∞. Then   1 exp(−aY ) − a ·(δ − Y ) − 1 lim (λ + 1)L λ,κ (Y, δ) = − κ→∞ a exp(−aδ) ∝ exp(a ·(δ − Y )) − a ·(δ − Y ) − 1 ∝ L L N X (Y, δ), where the constants of proportionality are non-negative. Consequently, when λ and κ are both large and non-negative, such that λ/κ = −a > 0, the optimal shifted-power-divergence spatial predictor given by Eq. (25), approximates the optimal linex spatial predictor given by Eq. (26). The case where a > 0, and hence λ → −∞ as κ → ∞, yields the same result. Recall that the spatial process Y (·) is bounded below by −κ. Here, κ → ∞ so, in the limit, Y (·) becomes unbounded. The generalisation in this section defines

Optimal Spatial Prediction for Non-negative Spatial Processes …

191

loss functions where both Y (s0 ) and δ(Z) can take large negative values or large positive values, which corroborates with L L N X being a limiting case in the class {L λ,κ : −∞ < λ < ∞, κ ≥ 0}.

5 Spatial Prediction of Zinc Pollution on a Floodplain of the Meuse River In this section, the OPD spatial predictors and associated prediction intervals are applied to the problem of predicting zinc concentrations in a floodplain of the Meuse River in the Netherlands. Zinc is a heavy metal that is harmful to plants, soil invertebrates, and other species when they are exposed to it at high levels [9]. The dataset, henceforth referred to as the Meuse River data, is a set of observations of soil concentrations of cadmium, copper, lead, and zinc, along with covariates (including distance to the river, Easting, Northing, soil type, and historical flooding frequency) at 155 survey sites to the west of the town of Stein, Netherlands [11, 13]. Figure 1 shows the boundaries of the spatial region D given by the area enclosed by the river and a canal running around the town of Stein, Netherlands, and the sampling locations therein. The Meuse River data also include a regular grid of 3,103 prediction locations with geographic covariates available that are the same as those for the sampling locations. Of these prediction locations, 15 were chosen (some deliberately and some at random) as sites where OPD-spatial-prediction properties were analysed. Of these 15 locations, five were deliberately chosen so that four were placed at the extremes of the spatial domain and one was placed in the middle. The remaining 10 locations were chosen at random in D. The data locations and prediction locations are shown in Fig. 1. The methodological development in this chapter assumes a data model and a process model, from which the predictive distribution, [Y (s0 ) | Z], is obtained from Eq. (1). Recall that Y (s0 ) ≥ 0, which we achieve by modelling Y (·) as the fourth power of a Gaussian process (a choice based on exploratory data analysis; see below). Define the Gaussian spatial process, W (s) ≡ x(s) β + ξ(s); s ∈ D,

(28)

where x(·) is an eight-dimensional vector of covariates obtained after model-selection to include geographic terms (distance to river and Easting), soil-type terms (calcareous meadow soils, non-calcareous meadow soils, or red brick soil), and floodingfrequency terms (once every two years, once every ten years, or once every 50 years); ξ(·) is a zero-mean Gaussian process with a stationary, isotropic covariance function; and the parameters of the process model [Y (·)], where Y (·) ≡ W (·)4 , are β (mean function) and θ (covariance function). Exploratory data analysis based on transforming the Meuse River data with different powers resulted in taking a fourth-root transformation of the zinc concentrations.

192

N. Cressie et al.

Fig. 1 Map of the 155 survey sites from Rikken and Van Rijn [13] (white triangles) and fifteen prediction locations, comprising five that were deliberately chosen (orange circles) and 10 that were chosen at random (smaller yellow circles). The spatial domain D considered here is the area enclosed by the serpentine Meuse River and a canal running around the town of Stein, Netherlands. Aerial image products: Kadaster/Beeldmateriaal.nl, CC BY 4.0

Fig. 2 A Gaussian Q-Q plot of the standardised residuals from a linear model where the fourth-root of zinc concentrations have been modelled as a function of geographic, soil-type, and flooding-frequency terms

Hence, the data model is defined in terms of  Z ≡ (Z (s1 )1/4 , . . . , Z (sn )1/4 ). Now, ˜ | Y (·)] is obtained from [Z  Z (si ) = W (si ) + εi ; i = 1, . . . , n,

(29)

where {ε1 , . . . , εn } are independent and identically distributed Gau(0, σε2 ) random variables that are independent of W (·). That is, [ Z | W (·)] is Gau(0, σε2 In ), where 2 In is the n × n identity matrix and σε is the measurement-error variance. Equation (28) substituted into Eq. (29) yields a trans-Gaussian model for Y (·) [4, pp. 135– Z, we obtained residuals whose standardised quantiles 138]. After fitting x(·) βˆ to  are plotted against Gau(0, 1) quantiles in the Q-Q plot shown in Fig. 2.

Optimal Spatial Prediction for Non-negative Spatial Processes …

193

According to Bayes’ rule, the data model given by Eq. (29), combined with the Z], which process model given by Eq. (28), yields the predictive distribution [W (s0 ) |  is Gaussian with mean and variance, −1  Z) = x(s0 ) β + cW (s0 )  (Z − Xβ) E(W (s0 ) |  Z −1 2  var(W (s0 ) |  Z) = σW − cW (s0 )  c (s ). Z W 0

Here, C W ( h ; θ ) ≡ cov(W (s + h), W (s)) is the stationary isotropic covariance function for W (·); cW (s0 ) ≡ (C W ( s1 − s0 ; θ ), . . . , C W ( sn − s0 ; θ )) ; σW2 ≡ C W (0; θ ); W is an n × n covariance matrix with (i, j)-th entry equal to C W ( si − s j ; 2 θ ); and  Z ≡ W + σε In . Note that, in practice, estimates for the parameters β, θ, 2 and σε are substituted into the formulae for the conditional mean and the conditional variance given just above. From M conditional simulations {W (s0 )1 , . . . , W (s0 ) M } Z], conditional simulations from [Y (s0 ) |  Z] are obtained simulated from [W (s0 ) |  simply as follows   {Y (s0 )1 , . . . , Y (s0 ) M } ≡ W (s0 )41 , . . . , W (s0 )4M ,

(30)

and these are used to obtain OPD spatial predictors given by Eq. (20). Here, we chose M = 10,000. Since the primary goal of this section is to illustrate OPD spatial prediction, we only give a brief description of the estimation of the parameters β and θ . After transforming the data to  Z via the fourth-root transformation, the ordinary-leastsquares estimate, βˆ O L S , of β in Eq. (29) resulted in the residuals, Z (si ) − x(si ) βˆ O L S ; i = 1, . . . , n. r (si ) ≡  From these detrended data {r (si ) : i = 1, . . . , n}, an empirical semivariogram was computed, and a spherical semivariogram model, γW (·; θ ) ≡ C W (0; θ ) − C W (·; θ ), was fitted to it to yield an estimate θˆ of θ (e.g., see Cressie [4, Sect. 3.4.3]). Figure 3 shows the empirical semivariogram (estimated robustly according to the method of Cressie and Hawkins [5]) and the fitted spherical semivariogram, γW (h; θˆ ) (fitted by weighted least squares, using the weights proposed by Cressie [3]). Now, returning to the original measure of spatial dependence, the fitted covariance function is,     (31) C W (·; θˆ ) = γW ∞; θˆ − γW ·; θˆ . This allows a generalised-least-squares estimator, βˆ G L S , of β to be obtained:  −1 ˆ −1 X ˆ −1 Z, βˆ G L S ≡ X X Z˜ Z˜

194

N. Cressie et al.

Fig. 3 A plot of the empirical semivariogram for zinc-concentration residuals in the soils of the Meuse River floodplain (obtained from the robust Cressie-Hawkins estimator [5]) and a fitted spherical semivariogram model (fitted by weighted least squares using ‘Cressie weights’ [3])

ˆ Z˜ ≡ ˆ W + σˆ ε2 In , ˆ W has (i, j)-th entry equal to C W ( si − s j ; θˆ ), and σˆ ε2 where is assumed to be equal to 0.08364, the so-called nugget effect shown in Fig. 3 as limh→0 C W (h; θˆ ). That is, the micro-scale variation of W (·) is assumed to be zero. At this point, the spatial-process model is assumed to have parameters equal to ˆ and σˆ 2 . In all the prediction equations that follow in this section, these βˆ G L S , θ, ε estimates of β, θ, and σε2 will be substituted in without accounting for the effect their estimation has on the spatial-prediction uncertainties. Generally, estimation variances are O(n −1 ) [4, Chap. 1], whereas prediction variances are O(1), which goes some way to justify our not accounting for the estimation uncertainties. Now we turn our attention to OPD spatial prediction of Y (s0 ), using the results given in Sect. 3. The OPD spatial predictor, the unconditional risk, the bias, and the 95% unconditional prediction interval are involved in the analysis presented below. The definitions of these quantities are given in Eqs. (20), (21), (22), and (23), respectively. The cut-off K λ0.05 for the 95% unconditional prediction interval was obtained via the simulation described at the end of Sect. 3. In what follows, we focus on s0 at the fifteen locations among the 3,103 possible in D (see Fig. 1), which were described at the beginning of this section. At all fifteen locations, we computed a 95% unconditional prediction interval for λ ∈ {±3, ±2.5, ±2, ±1.5, ±1.0, ±0.5, ±0.25, ±0.15, ±0.10, ±0.05, 0}; the narrower the prediction interval, the more precise the prediction. The widths of the 95% prediction intervals were then computed for each of the fifteen locations; Fig. 4a shows how the widths at the five deliberately chosen locations vary with λ. Let λ∗ (s0 ) be the value of λ that minimises the width of the 95% prediction interval for s0 given by each of the 15 prediction locations. Figure 4b shows a histogram of the 15 values of λ∗ (s0 ) obtained numerically for the 15 locations of s0 . The median (shown as a red vertical line) is 0.25, leading us to choose λ = 0.25 and the OPD spatial predictor, ∗ (Z), for all s0 ∈ D. Y ∗ (s0 ) = δ0.25

Optimal Spatial Prediction for Non-negative Spatial Processes …

195

Fig. 4 a Plots of the widths of the 95% unconditional prediction intervals for Y (s0 ) as functions of λ, where s0 is at each of the five deliberately chosen prediction locations. b A histogram of the 15 values of λ∗ (s0 ) that minimised the width of the 95% unconditional prediction interval, respectively at the 15 prediction locations; the median value is 0.25, which is indicated on the histogram by a vertical red line

Fig. 5 Maps of a the OPD spatial predictor, δλ∗ (Z), for λ = 0.25; b the bias, Bias(δλ∗ ), for λ = 0.25; and c the unconditional risk, Rλ (δλ∗ ), for λ = 0.25. In map a, the midpoint of the colour scale, corresponding to a white colour, is set to 320 ppm, which is two times the U.S. Environmental Protection Agency’s Soil Screening Level for zinc for terrestrial plants [9, p. 4]

∗ Finally, in Fig. 5, we present plots of the prediction surface δ0.25 (Z) at all 3,103 prediction locations (Fig. 5a), the bias (Fig. 5b), and the unconditional risk (Fig. 5c) of the OPD spatial predictors. It is not surprising that the bias at each prediction site is positive, because the chosen λ = 0.25 > 0, but there is clearly more bias in regions where the data are sparse. The unconditional risk is a quantification of the uncertainty of the spatial predictor, analogous to the minimised MSPE under squared-error loss.

196

N. Cressie et al.

6 Discussion and Conclusions This research is based on the idea that distance measures and divergence measures can be used as loss functions in a decision-theoretic approach to inferring unknown parameters or random variables from data and a statistical model. Squared-error loss is ubiquitous, but that loss is symmetric around the true value. Our setting is spatial prediction of a hidden-process value at a given spatial location, where the process, the data, and the spatial predictor are all non-negative. We featured the power-divergences [8], a family of divergences within the class of phi-divergences [6] that have non-negative arguments and are not symmetric around the true hiddenprocess value. The optimal spatial predictor minimises the expected loss, where the expectation is taken over the randomness in the process and in the data. The minimised expected loss is then the smallest unconditional risk among all possible spatial predictors. In general, the spatial predictor is biased with respect to the mean of the hiddenprocess value. However, there are other statistical criteria by which an inference can be judged, such as prediction intervals. For a given level α, the shorter the (1 − α) × 100% prediction interval, the more precise the inference. In Sect. 3, we give results for the OPD spatial predictors, in terms of a computational algorithm to compute it, its bias, its unconditional risk, and a (1 − α) × 100% unconditional prediction interval for the hidden-process value that is being predicted. In this article, we have emphasised statistical criteria that take expectations over all sources of randomness in the statistical model, namely the joint distribution of the hidden-process value and the spatial data. We have extended our results to criteria that take expectations over the conditional distribution given the data, namely the predictive distribution, and this research will appear elsewhere. The well known spatial predictor, kriging, uses the squared-error loss function and minimises the mean squared prediction error, which is the unconditional risk under squared-error loss. The kriging predictor is unbiased, and hence the minimised mean-squared prediction error is the expectation of the conditional variance (conditional on the data). In the context of squared-error loss, this conditional variance is the minimised conditional risk. Inference with respect to the predictive distribution can be quite sensitive to the spatial data observed and is generally more precise than inference with respect to the joint distribution. For example, we have found that conditional prediction intervals can be much narrower than their unconditional counterparts. In conclusion, new spatial predictors have been obtained by changing from minimising the mean-squared prediction error to minimising the unconditional risk based on a power-divergence loss function. The methodology developed in Sects. 2–4 is directly applicable to prediction for non-negative spatial processes and data. Our methodological results are applied to spatial prediction of zinc concentrations in soil on a floodplain of the Meuse River in the Netherlands.

Optimal Spatial Prediction for Non-negative Spatial Processes …

197

Acknowledgements This research was supported by the Australian Research Council Discovery Project DP190100180. We are grateful for perceptive comments and suggestions from Andrew Zammit-Mangion. Leandro Pardo’s work with Noel on phi-divergence goodness-of-fit statistics [6, 7] was an inspiration for this chapter on optimal spatial prediction of non-negative processes. Happy birthday, Leandro!

References 1. Berger, J.O.: Statistical Decision Theory and Bayesian Analysis, 2nd edn. Springer, New York (1985) 2. Box, G.E.P., Cox, D.R.: An analysis of transformations. J. Roy. Statist. Soc. Ser. B 26(2), 211–252 (1964) 3. Cressie, N.: Fitting variogram models by weighted least squares. Math. Geol. 17, 563–586 (1985) 4. Cressie, N.: Statistics for Spatial Data. Rev Wiley, New York (1993) 5. Cressie, N., Hawkins, D.M.: Robust estimation of the variogram: I. Math. Geol. 12, 115–125 (1980) 6. Cressie, N., Pardo, L.: Minimum φ-divergence estimator and hierarchical testing in loglinear models. Statist. Sinica 10, 867–884 (2000) 7. Cressie, N., Pardo, L.: Model checking in loglinear models using φ-divergences and MLEs. J. Statist. Plan. Infer. 103(1–2), 437–453 (2002) 8. Cressie, N., Read, T.: Multinomial goodness-of-fit tests. J. Roy. Statist. Soc. Ser. B 46(3), 440–464 (1984) 9. Environmental Protection Agency: Ecological Soil Screening Levels for Zinc. Environmental Protection Agency, Washington DC, U.S (2007) 10. Matheron, G.: Traité de Géostatistique Applique, Tome I. Mémoires du Bureau de Recherches Géologiques et Minières 14. Editions Technip, Paris (1962) 11. Pebesma, E.J.: Multivariable geostatistics in S: the gstat package. Comput. Geosci. 30, 683–691 (2004) 12. Read, T., Cressie, N.: Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer, New York (1988) 13. Rikken, M.G.J., Van Rijn, R.P.G.: Soil pollution with heavy metals - an inquiry into spatial variation, cost of mapping and the risk evaluation of copper, cadmium, lead and zinc in the floodplains of the Meuse west of Stein, the Netherlands. MA Thesis, Utrecht University, Utrecht (1993) 14. Zellner, A.: Bayesian estimation and prediction using asymmetric loss functions. J. Amer. Statist. Assoc. 81(394), 446–451 (1986)

On Entropy Based Diversity Measures: Statistical Efficiency and Robustness Considerations Abhik Ghosh and Ayanendranath Basu

Abstract We consider the problem of estimating diversity measures for a stratified population and discuss a general formulation for the entropy based diversity measures which includes the previously used entropies as well as a newly proposed family of logarithmic norm entropy (LNE) measures. Our main focus in this work is the consideration of statistical properties (asymptotic efficiency and finite sample robustness) of the sample estimates of such entropy-based diversity measures for their validation and appropriate recommendations. Our proposed LNE based diversity is indeed seen to provide the best trade-offs at an appropriately chosen tuning parameter. Along the way, we also show that the second best candidates are the hypoentropy based diversities justifying their consideration by Leandro Pardo and his colleagues in 1993 over the other entropy families existing at that time. We finally apply the proposed LNE based measure to examine the demographic (age and gender based) diversities among Covid-19 deaths in USA.

1 Introduction Measurement of diversity within a population is an important problem in all applied sciences, including ecology, biology, economics, sociology, physics and management sciences; see, e.g., [2–4, 9, 11–13, 21]. Intuitively, diversity is a measure of the average variability of the individuals within a population in terms of a qualitative or quantitative characteristic of the individuals. We consider the most common case of stratified population with the different strata being the sole characteristics of the individuals within it for the purpose of measuring diversity. In such cases, diversity can be measured by considering the relative frequency of the strata within the populations and measuring their probabilistic uncertainty. Entropy, being the most popular A. Ghosh (B) · A. Basu Interdisciplinary Statistical Research unit, Indian Statistical Institute, Kolkata, India e-mail: [email protected] A. Basu e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_18

199

200

A. Ghosh and A. Basu

measure of uncertainty in different associate fields of information sciences, is one immediate candidate for measuring diversity. The popular classical entropy based diversity measures are the Simpson index [22] and Shannon’s entropy [20]. Subsequently, there have been several proposals for generalized measures of diversity and attempts to formally define the diversity measures; see [15–17] among others. The link between the diversity measures and disparities [10], a particular class of statistical divergences, has been explored in [19]. In this work, we consider entropy based diversity measures and their general formulation following [17]. Besides discussing such existing diversity measures, we propose a new class of generalized diversity measures based on a newly developed scale-invariant family (LNE) of entropies from [6, 7] and examine its usefulness for this purpose. We mainly focus on the statistical performance of the sample estimates of such entropy-based diversity measures; such statistical considerations are rarely available in the literature for existing diversity measures (except for the hypoentropy measure [14]). We derive the asymptotic distribution of the general entropy-based diversity estimated from a simple random sample and use it to develop confidence intervals for the diversity of the population under study. In particular, we simplify the asymptotic variance of the estimated entropy measures (as diversities) for several families of entropy measures including Shannon, Renyi, Havrda and Charvat as well as our proposed LNE. Our results for general entropy-based divergence measures extend those of [14] who derived them only for the hypoentropy measures (See Eq. (5)). Further, we empirically compare the finite-sample performances of different entropy based diversity measures through their asymptotic variances (efficiency considerations) as well as their robustness against misclassification errors via appropriate simulation studies. These empirical investigations reveal better statistical performance of the hypoentropy based diversity over the entropies and diversities existing at that time (including Rao’s quadratic diversity) justifying their consideration by Leandro Pardo and his colleagues in 1993 [14]. However, our newly proposed LNE outperforms even these hypoentropy based ones for appropriate choice of tuning parameters, both in terms of efficiency and robustness considerations, making them the best currently available candidate for a statistically viable entropy-based diversity measure. Accordingly, we apply them to study the diversity of Covid-19 deaths in USA with respect to their age and gender based stratification.

2 Entropy Based Diversity Measures: A General Formulation and Examples Let us consider a finite population of N individuals that are characterized by a set of measurements X (say). As mentioned previously, in this paper, we assume that these measurements lead to a classification of each individual into a finite set of M (< N ) classes denoted by X = {x1 , . . . , x M }; for example; these classes are often

On Entropy Based Diversity Measures …

201

species in ecology, whereas they are different phenotypes/alleles in genetics. Let us denote the (discrete) probability distribution of any individual to belong to these M classes as p = ( p1 , . . . , p M ), with pi being the probability of belonging to class i for each i = 1, . . . , M, and denote the set of all such discrete probability distributions as P. Rao [17] defined a general measure of diversity within such a population as an appropriate function, say D(·), from the class P to the set of (positive) reals that “reflects difference between individuals (X’s) within a population”. Rao [17] went on to further characterize the nature of the function D to be useful as a diversity measure from different angles; in particular, considering a functional approach, he suggested that an appropriate functional D must satisfy the following sets of intuitively justified conditions: (C1) The diversity should be maximum for the uniformly distributed population, i.e., the maximum of D( p) over p ∈ P should attain at the uniform probability vector pU = (1/M, . . . , 1/M). (C2) The diversity measure should not change when the class labels are permuted, i.e., D( p) is a symmetric function of p1 , . . . , p M for every p = ( p1 , . . . , p M ) ∈ P. (C3) The diversity measure is smooth in the sense that, for any p = ( p1 , . . . , p M ) ∈ P, the functional D( p) has all the first and second order partial derivatives with respect to p1 , . . . , p M−1 . Additionally, D  ( p), the (M − 1) × (M − 1) matrix of second-order partial derivatives, is continuous and non-null at p = pU , the uniform probability vector. Rao has considered another functional relationship along with the Conditions (C1)–(C3) to derive his quadratic diversity measures given by  D

(Q)

( p) = a 1 −



 pi2

+ b, a > 0, b ∈ R.

(1)

i

This popular diversity measure (with a = 1, b = 0) is also known as GiniSimpson index, previously derived from different applied considerations. The Rao’s quadratic entropy D (Q) further satisfies another intuitive characterization of a diversity measure as given below (C0). (C0) The diversity should be non-negative and equal to zero if all individuals in the population are identical, i.e., D( p) ≥ 0 for all p ∈ P and D( p) = 0 if and only if p is degenerate at a particular x ∈ X . Now, it is important to note that Conditions (C0)–(C3) are also satisfied by different entropy measures, commonly used in information theory and statistical physics. As a result, any such entropy measure can be used as a new diversity measure. Rao [17] himself discussed three different possible candidate families of entropies as generalized diversity measures, which are given by

202

A. Ghosh and A. Basu

Dβ(S) ( p)

 =−

i

β

pi log pi  β , i pi 

(2)

 α+β−1 1 i pi = , log  β 1−α i pi    α+β−1 p 1 (H C) i i D(α,β) ( p) = 1−α  β −1 , 2 −1 i pi (R) D(α,β) ( p)

(3) (4)

where α and β are two positive constants (tuning parameters) leading to different diversity measures. The superscripts in the name of the above diversity measures are motivated from the fact that, at β = 1, they coincide with the famous Shannon [20], Renyi [18], and Havrda and Charvat [8] entropies, respectively. Interpretations of these functionals as diversity measures in the context of ecological studies had been (H C) ( p) is a constant multiple of the Rao’s quadratic discussed in [15]. Further, D2,1 measure of diversity D (Q) ( p). The hypoentropy measure [5] is also explored as a possible generalized diversity measure in [14], where the corresponding diversity measure is defined as   1 1 (H yp) log(1 + λ) − ( p) = 1 + (1 + λpi ) log(1 + λpi ), (5) Dλ λ λ i for some λ > 0. Note that, all these entropy measures are concave functions. In [16], these concavity was indeed assumed to be a requirement for the definition of diversity measures, in place of Condition (C3); the same definition was also subsequently used by several authors including [14]. But, the concavity is a stronger assumption and implies (C3). So, here, we work with Conditions (C0)–(C3) without assuming the stricter requirement of concavity. In general, we consider any (entropy) functional D : P → R+ satisfying Conditions (C0)–(C3) as a possible candidate for a diversity measure and discuss the issue of statistical inference in relation to this general measure. Further illustrations and simplifications will be provided for the special cases (3)–(5). Additionally, we also examine a new scale-invariant generalization of the Renyi entropy, namely the logarithmic norm-entropy (LNE) of [7], as a generalized measure of diversity; this measure is defined in terms of two parameters α, β > 0 as ⎡ (L N ) ( p) = D(α,β)



αβ ⎢ log ⎣ i  β −α

piα

β i pi

α1



⎥  β1 ⎦ .

(6)

Note that, this LNE functional is not necessarily concave but satisfies Conditions (C0)–(C3). It is a slightly modified version of the entropy family in (3), which was only scale-equivariant. At β = 1 or α = 1, the LNE reduces to the Renyi entropy family and is generally symmetric in the choice of (α, β).

On Entropy Based Diversity Measures …

203

Before we move onto statistical estimation of these diversity measures, it is important to note the limiting inter-relations between these entropy families. (H yp)

(R) (H C) (L N ) ( p) = lim D(α,1) ( p) = lim D(α,1) ( p) = lim Dλ lim D(α,1) α→1 α→1 λ→0  (S) pi log pi . = D1 ( p) = −

α→1

( p) (7)

i

Also, it is important to note that the maximum value of the diversity measure, which is attained at the uniform distribution pU , equals log(M) for all members (S) (R) (L N ) of families D(α,β) ( p), D(α,β) ( p) and D(α,β) ( p) irrespective of the values of their tuning parameters. On the other hand, the maximum of the quadratic diversity in (H yp) (H C) ( p) and Dλ ( p) depend (1) is a(n − 1)/n + b and those of the diversities D(α,β) on their respective tuning parameters α and λ. This makes the first group of measures more useful, in practice, for comparative purposes as they all have a common range [0, log M].

3 Statistical Estimation and Asymptotic Distribution We now consider the problem of estimating the diversity measure D( p) of a population, as defined in the previous section, based on a random sample drawn from that population. Suppose that, in such a random sample of size n, there are n i observations in the i-th class for each i = 1, . . . , M, with n 1 + · · · + n M = n. Then, the vector of observed class-frequencies, n = (n 1 , . . . , n M ) has a multinomial distribution with pi = n i /n, for parameters (n; p). Hence, the maximum likelihood estimate of pi is  each i = 1, . . . , M, which are consistent and asymptotically normal. This leads to a direct plug-in estimate of the entropy-based diversity measure D( p) as n = D( p), D

with  p = ( p1 , . . . ,  pM ) =

n

1

n

,··· ,

nM  . n

(8)

n = D( The asymptotic properties of the estimated diversity measure D p), presented in the following theorem, easily follow from the asymptotic theory of statistics (see, e.g., [1]), and Condition (C3). Theorem 3.1 Suppose that D( p) is an (entropy-based) diversity measure satisfying (C0)–(C3). Define σ D2 ( p)

=

M  i=1

 pi [∇i D( p)] − 2

M 

2 pi [∇i D( p)]

,

(9)

i=1

where ∇i denotes the partial derivative with respect to pi , i = 1, . . . , M. Then, n = D( p) yields a consistent estimator of D( p) and satisfies D

204

A. Ghosh and A. Basu

L √ n − D( p) → n D N (0, σ D2 ( p)), as n → ∞, provided 0 < σ D2 ( p) < ∞. Note that Theorem 3.1 is completely general and applies to all the special entropybased diversity measures (1)–(6). We have simplified the formula for asymptotic variance for all these measures as listed below, where we will use the  Mdiversity notation Wc ( p) = i=1 pic for any c > 0. • For the D (Q) measure in (1), we have   σ Q2 ( p) = σ D2 (Q) ( p) = 4a 2 W3 ( p) − W2 ( p)2 , which, along with Theorem 3.1, establishes the asymptotic properties of the wellknown Rao’s quadratic diversity index. • For the family of diversity measures Dβ(S) in (2), we have 2 σ S(β) ( p) =

 1  2 2 2  W (1 + β W )W + 2β W + β 2β−1 2β−1 2β−1 β Wβ4  β W β + β W 2β−1 − Wβ4 , −2βWβ W2β−1 W

c = where we denote Wc = Wc ( p), W

M  i=1

M  =  p c (log p )2 for pic log pi and W c i i i=1

any c ∈ R. Noting that, Dβ(S) ( p) simplifies to the Shannon diversity measure at β = 1, our Theorem 3.1 also describes its asymptotic properties with the corresponding asymptotic variance 2 σ S(1) ( p)

=1+

M  i=1

 pi (log pi ) − 2 2



2 pi log pi

.

i=1

(R) • For the family of diversity measures D(α,β) in (3), we have

2 ( p) = σ R(α,β)

(1 − α)−2  2 (α + β − 1)2 Wβ2 W2α+2β−3 + β 2 W2β−1 Wα+β−1 2 Wβ2 Wα+β−1  2 . −2β(α + β − 1)Wβ Wα+2β−2 Wα+β−1 − Wβ2 Wα+β−1

√ As a special case, for β = 1, we then have the asymptotic normality of the ( n times) estimated Renyi entropy (with tuning parameter α) for a finitely stratified population with the asymptotic variance being 2 ( p) σ R(α,1)

  2 α W2α−1 − αWα Wα+1 = . (1 − α)2 Wα2

On Entropy Based Diversity Measures …

205

(H C) • For the family of diversity measures D(α,β) in (4), we have

σ H2 C(α,β) ( p) =

 (α + β − 1)2 Wβ2 W2α+2β−3

1

Wβ4 (21−α − 1)2 2 + β 2 W2β−1 Wα+β−1 − (1 − α)

2

− 2β(α + β − 1)Wβ Wα+2β−2 Wα+β−1  ,

2 Wβ2 Wα+β−1

√ In particular, again at β = 1, we get the asymptotic variance of the ( n times) estimated HC entropy (with tuning parameter α) for a finitely stratified population   2 as given by σ H2 C(α,1) ( p) = (21−αα −1)2 W2α−1 − Wα2 . (H yp)

• For the family of diversity measures D(λ) simplifies to Theorem 1 of [14] with σ H2 yp(λ) ( p)

=

σ D2 (H yp) ( p) λ

=

M 

in (5), our general result in Theorem 3.1

pi log (1 + λpi ) − 2

i=1

 M 

2 pi log(1 + λpi )

.

i=1

(L N ) • For our proposed new family of diversity measures D(α,β) in (6), the asymptotic variance has a nicer form given by

σ L2 N (α,β) ( p)

α2 β 2 = (β − α)2



 W2α−1 W2β−1 2Wα+β−1 + − . Wα2 Wα Wβ Wβ2

Note that, σ L2 N (α,β) ( p) is symmetric in the choice of (α, β), as intuitively expected from similar behavior of the LN entropy itself. Note that σ D2 ( p) is again a continuous function of p, and hence, it can be consistently p). This helps us to report the standard error of any estimated estimated by σ D2 ( diversity measure and its confidence interval in order to portray the complete picture. In particular, the 100(1 − τ )% asymptotic confidence interval for a diversity measure based on D is given by

n + n −1/2 σ D ( n − n −1/2 σ D ( p)z τ/2 , D p)z τ/2 D



where z α is the (1 − α)-th quantile of the standard normal distribution. The estimated asymptotic variance can also be used to develop appropriate (asymptotic) tests for any statistical hypothesis concerning diversity measures, extending the results from [14].

206

A. Ghosh and A. Basu

4 Numerical Illustrations: Comparative Performances 4.1 Asymptotic Efficiency In order to compare different entropy-based diversity measures in terms of their efficiency (i.e., in terms of their estimated standard error or, equivalently, the length of the corresponding confidence interval), we compare the values of their asymptotic variances following the theory of Sect. 3. For simplicity, we have computed the values of these asymptotic variances of different diversity estimates for the case of M = 2 and p = ( p, 1 − p); noting symmetry of all the diversity measures in p, we report their values only for p ∈ [0, 1/2] and a few interesting choices of tuning parameters in Fig. 1. The asymptotic variance of the Rao’s quadratic diversity measure is plotted in all the sub-figures as the comparative benchmark. Note that, except for the generalized Shannon entropy family Dβ(S) , the diversity measures have zero variability when the population is perfectly homogeneous ( p = 0.5) leading to their respective maximum diversity values. The family of (H C) measures are clearly less efficient (having higher variance) than the clasD(α,β) sical quadratic diversity D (Q) and their inefficiency further increases as the values of the tuning parameters increase. It has been observed that the asymptotic variances (H C) (R) and D(α,β) both increase linearly for all p as α of the diversity measures D(α,β) increases when β is held fixed; so results for only one α value is reported here. For (R) , we can get more efficient estimates than the the generalized Renyi family D(α,β) (Q) D measure, if p is small (i.e., the population is least diverse) and a larger value of β is used; however, those particular measures lead to extremely larger variances if p is near 0.5, i.e., when the population is more homogeneous. A diversity measure having uniformly greater efficiency than the Rao’s measure is indeed the hypoen(H yp) with smaller values of λ; their efficiencies also decrease as λ tropy family Dλ increases. This indeed justifies the consideration of the hypoentropy measure as a diversity in [14] over the other (then) existing entropy families. However, if p is not very small (i.e., the population is not extremely heterogeneous), we can have an even more efficient diversity measure by considering the newly proposed LNE family, (L N ) , with either of the tuning parameters being small enough; they provide signifD(α,β) (H yp)

icantly smaller asymptotic variance compared to both D (Q) and Dλ for moderate and larger values of p which are more frequent in real-life scenarios.

4.2 Finite-Sample Robustness We now study the finite-sample robustness of different diversity measures against possible misclassification of sample observations through an appropriate simulation study. It may be noted that robustness against distant outliers is not meaningful here since the set-up is of finite support, and the associated robustness measures

On Entropy Based Diversity Measures …

(S)

(a) For Dβ

(HC)

(c) For D(α=2,β )

(LN)

(e) For D(α=2,β )

207

(R)

(b) For D(α=2,β )

(Hyp)

(d) For Dλ

(LN)

(f) For D(α=0.3,β )

Fig. 1 Asymptotic variances of different diversity estimates over p when M = 2 and p = ( p, 1 − p). ‘Quad’ denotes Rao’s quadratic diversity D (Q)

(e.g., influence function) are not directly applicable. We simulate a random sample of n = 100 observations from a moderately homogeneous population with M = 2 and p = (0.3, 0.7) and contaminate 100ε% of the sample observations by moving them to a particular group (the first one). This contamination tends to make the population mode homogeneous increasing the diversity values significantly for nonrobust diversity measures. In order to examine the changes in different diversity measures, we repeat the above simulation exercise 1000 times for several ε values and study the ratio (changes) of the median estimated diversity at a given ε > 0

208

A. Ghosh and A. Basu

(S)

(a) For Dβ

(HC)

(c) For D(α=2,β )

(LN)

(e) For D(α=2,β )

(R)

(b) For D(α=2,β )

(Hyp)

(d) For Dλ

(LN)

(f) For D(α=0.3,β )

Fig. 2 Asymptotic variances of different diversity estimates over p when M = 2 and p = ( p, 1 − p). ‘Quad’ denotes Rao’s quadratic diversity D (Q)

(contaminated case) over that obtained for ε = 0 (no contamination). The resulting ratios are plotted in Fig. 2 for contamination proportion as high as 30% (ε = 0.3) for some particular members of the entropy-based diversity families (mostly the same as those in Fig. 1). Clearly the generalized Shannon entropy leads to the most stable estimates when the tuning parameter is chosen to be small enough. In general, the robustness of all (R) (H C) and D(α=2,β) decrease as β increases and they become the measures Dβ(S) , D(α=2,β) less robust than the Rao’s quadratic diversity D (Q) soon enough as β crosses a cut-

On Entropy Based Diversity Measures …

209 (H yp)

off. On the contrary, all members of the hypoentropy family Dλ lead to estimates as robust as the D (Q) with robustness increasing slightly with increase in λ, further justifying its use in [14]. However, the new LNE family, with a small enough value of its any one tuning parameter, provides a diversity measure that is significantly more robust than the D (Q) as well as all members of the hypoentropy family. Along with the uniform range and greater efficiency, the above discussion clearly establishes the LNE based diversity measures to be arguably the best candidates among all entropy-based measures considered here when the population is at least moderately homogeneous.

5 Application: Diversity of Covid-19 Deaths in USA We now analyze the demographic diversities among the persons who died due to the coronavirus disease 2019 (COVID-19) within USA. We collected the data from the website1 of the Centers for Disease Control and Prevention (CDC), USA, as on March 29, 2021. Based on these data, we estimate the diversity among Covid-19 deaths with respect to gender and age-group in USA using our entropy based diversity measures. We consider the finer age group available in the data are as follows: Under 1 year, 1–4, 5–14, 15–24, 25–34, 35–44, 45–54, 55–64, 65–74, 75–84 years, and above 85 years. Following the discussions from √ the previous section, we only report the values of the estimated diversity and SEs ( n-times) only for the members of the new LNE family in Table 1. We can clearly see that the estimated entropy values are pretty close for different parameter values, and we choose the one leading to smallest estimated SE. Particularly, the least SE for gender-based analysis is obtained for LNE with (α, β) = (0.3, 0.1) where the estimated diversity is very close to the maximum value of log(M). Thus, there is not much of a gender-based discrimination between covid19 deaths in USA (indeed they are pretty homogeneous). However, the estimated diversity values with least SE (at α = 2, β = 1) are significantly away from the corresponding value of maximum diversity indicating heterogeneity of the Covid-19 deaths across different age-groups. However, this diversity is not significantly different when measured separately among male and female populations although females tends to have a slightly larger diversity value.

1

https://data.cdc.gov/NCHS/Provisional-COVID-19-Death-Counts-by-Sex-Age-and-S/9bhghcku/data.

210

A. Ghosh and A. Basu

√ Table 1 Estimated LNE-based diversity and ( n-times) standard errors (in parentheses) for the Covid-19 deaths in USA with respect to age-groups and gender. Values with lowest standard errors are marked as bold LNE Stratification criterion α β Gender Age-group All ages All sex Male Female M log(M) 0.3

2

0.1 0.5 1 1.5 0.1 0.5 1 1.5

2 0.693 0.693 (0.003) 0.693 (0.015) 0.692 (0.029) 0.691 (0.043) 0.692 (0.019) 0.689 (0.095) 0.684 (0.189) 0.679 (0.282)

11 2.398 2.286 (1.076) 2.024 (1.720) 1.906 (1.358) 1.853 (1.219) 2.147 (1.082) 1.644 (0.779) 1.435 (0.689) 1.338 (0.785)

11 2.398 2.289 (1.026) 2.038 (1.653) 1.928 (1.309) 1.879 (1.175) 2.158 (1.037) 1.687 (0.747) 1.498 (0.642) 1.413 (0.720)

11 2.398 2.281 (1.153) 2.001 (1.829) 1.867 (1.440) 1.805 (1.294) 2.128 (1.155) 1.56 (0.848) 1.301 (0.851) 1.169 (1.056)

6 Conclusions Describing the diversity of a population is an important practical problem. At the present time the statistical community recognizes that together with the asymptotic efficiency of the diversity measure, the robustness aspect also has to be given due importance. In the literature there is a tradition of generating diversity measures based on entropies, and here we have provided a general formulation of such types of diversity measures. In particular, we have established that the diversity measure based on the recently developed logarithmic norm entropy, under suitable choice of tuning parameters, can provide the best trade-off between the conflicting concepts of efficiency and robustness. As the research of Professor Leandro Pardo and his colleagues on hypoentropy and related diversity measures represents one of the most prominent works in the existing literature of this area, we feel that our chosen topic is a most appropriate one for paying our tribute to Professor Pardo and his long and distinguished research career. Acknowledgements Authors wish to thank the Editors for inviting us to contribute to this edited volume. The research of AG is partially supported by the Grant CRG/2019/001461 from Science & Engineering Research Board (SERB), Government of India. The research of AB is supported by the Technology Innovation Hub at Indian Statistical Institute, Kolkata, under Grant NMICPS/006/MD/2020-21 of Department of Science and Technology, Government of India, dated 16.10.2020.

On Entropy Based Diversity Measures …

211

References 1. Bickel, P.J., Doksum, K.A.: Mathematical Statistics: Basic Ideas and Selected Topics. Chapman & Hall/CRC Press, Boca Raton (2015) 2. Bossert, W., Pattanaik, P.K., Xu, Y.: The measurement of diversity. Centre de Recherche et Développement en Économique, Université de Montréal (2001) 3. Burkard, A.W., Boticki, M.A., Madson, M.B.: Workplace discrimination, prejudice, and diversity measurement: a review of instrumentation. J. Career. Assess. 10(3), 343–361 (2002) 4. Daly, A.J., Baetens, J.M., De Baets, B.: Ecological diversity: measuring the unmeasurable. Mathematics 6(7), 119:math6070119 (2018) 5. Ferieii, C.: Hypoentropy and related heterogeneity, divergence and information measure. Statistica 40, 155–167 (1980) 6. Ghosh, A., Basu, A.: A generalized relative (α, β)-entropy: geometric properties and applications to robust statistical inference. Entropy 20(5), 347:e20050347 (2018) 7. Ghosh, A., Basu, A.: A scale-invariant generalization of the Rényi entropy, associated divergences and their optimizations under Tsallis’ nonextensive framework. IEEE Trans. Inform. Theor. 67(4), 2141–2161 (2021) 8. Havrda, J., Charvat, F.: Quantification method of classification processes. Concept of structural α-entropy. Kybernetika 3(1), 30–35 (1967) 9. Li, S., Deng, Y., Du, X., Feng, K., Wu, Y., He, Q., Wang, Z., Liu, Y., Wang, D., Peng, X., Zhang, Z., Escalas, A., Qu, Y.: Sampling cores and sequencing depths affected the measurement of microbial diversity in soil quadrats. Sci. Total Environ. 144966 (2021) 10. Lindsay, B.G.: Efficiency versus robustness: the case for minimum Hellinger distance and related methods. Ann. Statist. 22(2), 1081–1114 (1994) 11. Magurran, A.E.: Measuring Biological Diversity. Wiley, New York (2013) 12. Magurran, A.E., McGill, B.J. (eds.): Biological Diversity: Frontiers in Measurement and Assessment. Oxford University Press, Oxford (2011) 13. McGlinn, D.J., Xiao, X., May, F., Gotelli, N.J., Engel, T., Blowes, S.A., Knight, T.M., Purschke, O., Chase, J.M., McGill, B.J.: Measurement of Biodiversity (MoB): a method to separate the scaledependent effects of species abundance distribution, density, and aggregation on diversity change. Meth. Ecol. Evol. 10(2), 258–269 (2019) 14. Morales, D., Taneja, I.J., Pardo, L.: Hypoentropy as an index of diversity. Theor. Probab. Appl. 37(1), 155–158 (1993) 15. Patil, G.P., Taillie, C.: An overview of diversity. In: Grassle, F., Patil, G.P., Smith, W., Taillie, C. (eds.) Ecological Diversity in Theory and Practice, pp. 3–27. International Co-Operative Publishing House, Fairland (1979) 16. Patil, G.P., Taillie, C.: Diversity as a concept and its measurement. J. Amer. Statist. Assoc. 77(379), 548–561 (1982) 17. Rao, C.R.: Diversity and dissimilarity coefficients: a unified approach. Theor. Popul. Biol. 21, 24–43 (1982) 18. Rényi, A.: On measures of entropy and information. In: Neyman, J. (ed.) Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 547–561. University of California Press, Berkeley (1961) 19. Sarkar, S., Basu, A.: Linking diversity and disparity measures. Pakistan J. Statist. Oper. Res. 8(3), 491–506 (2012) 20. Shannon, C.E.: A mathematical theory of communication. Bell System Tech. J. 27(3), 379–423 (1948) 21. Sibhatu, K.T., Qaim, M.: Farm production diversity and dietary quality: linkages and measurement issues. Food Secur. 10(1), 47–59 (2018) 22. Simpson, E.H.: Measurement of diversity. Nature 163, 688 (1949)

Statistical Distances in Goodness-of-fit Marianthi Markatou and Anran Liu

Abstract Statistical distances or divergences have a long history in the scientific literature, where they are used for a variety of purposes, including that of testing for goodness of fit. In the present work, we discuss the role of distances or divergences in the context of model selection via testing. Specifically, we construct a goodness of fit test for testing simple null hypotheses and study the asymptotic distribution of the test statistic under the null. We obtain a locally quadratic representation of the test statistic and exemplify the derived results in the case of testing for normality. To do this, we identify the kernel that enters the local quadratic representation of the test statistic, obtain the asymptotic distribution of the test statistic, and illustrate its performance via simulation.

1 Introduction One of the conventional approaches to the problem of model selection is to view it as a hypothesis testing problem. Hypothesis testing as a means of selecting a model has had a long exposure in science (D’Agostino and Stephens [6]; Rayner et al. [18]). Many seem to feel more comfortable with the hypothesis testing paradigm to model selection, and some even consider the results of a test as the standard by which other approaches can be judged. When the hypothesis testing framework is adopted, one usually thinks about likely alternatives to the model, or alternatives that seem to be most dangerous to the inference, such as “heavy tails”. As an example, we note that one of the fundamental robustness questions is the performance of statistical procedures in the presence of heavy tailed distributions and a variety of data errors. M. Markatou (B) · A. Liu Department of Biostatistics, SPHHP, University at Buffalo, Buffalo, NY, USA e-mail: [email protected] A. Liu e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_19

213

214

M. Markatou and A. Liu

Goodness of fit problems were considered early in the statistical literature. Mathematically, the classical goodness of fit problem can be stated as follows: given a random sample x1 , x2 , . . . , xn test the hypothesis that the sample comes from a population with distribution function F. Examples of goodness of fit statistics include Kolmogorov-Smirnov, Anderson-Darling, the chi-squared goodness of fit test and many other statistics. The literature for univariate samples is voluminous, and a number of goodness of fit methods have appeared for testing multivariate data. Chen and Markatou [4] offer a review of multivariate goodness of fit tests and introduce tests based on kernel statistics. We note here that goodness of fit testing is also studied in the parallel machine learning literature, where Gretton et al. [7] utilize the concept of embedding of probability distributions to construct a test statistic, called the maximum mean discrepancy (MMD), to propose kernel tests for the two-sample problem. Tests that are based on divergences are also present in the literature (see Salicrú et al. [19]; Morales et al. [15]). Furthermore, Noughabi and Balakrishnan [16] introduce a general goodness of fit test based on a φ-divergence and establish its consistency. Pardo and Zografos [17] suggest a φ-divergence goodness of fit test when the data are realizations from a multinomial model and are subjected to misclassification. Additional works include Chen et al. [3] and various tests of normality, see Arizono and Ohta [1], Lequesne and Regnautt [8] and Lindsay et al. [9]. We develop a goodness of fit test that is locally quadratic. Specifically, the paper is organized as follows. Section 2 discusses the role of statistical distances in goodness of fit and presents a test statistic for testing simple null hypotheses. Further, it develops the asymptotic distribution of the test statistic and applies the obtained results to the case of testing normality. Section 3 of the paper presents a small simulation study of the test performance. Finally, Section 4 offers a brief discussion and conclusions.

2 Goodness of Fit Based on Distances Let τ, m be two probability density functions, where the word “density” is used to indicate probability mass functions as well. We will say ρ(τ, m) is a statistical distance between two probability distributions with densities τ, m if ρ(τ, m) ≥ 0, with equality if and only if τ and m are the same for all statistical purposes (Markatou et al. [13]). Note that we do not require ρ to be symmetric or to satisfy the triangle inequality. Therefore, ρ(τ, m) is not a distance in a formal, mathematical sense. However, this is not a drawback as popular distances, such as the Kullback-Leibler distance, are neither symmetric nor they satisfy the triangle inequality, yet they are immensely useful. Examples of statistical distances include φ-divergences (or f divergences), originally introduced by Csiszár [5]. We define the distances we work with as  G(δ(x))m β (x), (1) ρ(τˆ , m β ) = x

Statistical Distances in Goodness-of-fit

215



or ρ( f



, m ∗β )

=

G(δ(x))m ∗β (x)d x,

(2)

in the discrete and continuous probability model case respectively. Furthermore, in the discrete probability case δ(x) = md(x) − 1, where d(x) is the proportion of β (x) observations in the sample with values equal to x and it is an  estimator (see τˆ ) of ∗ ∗ ˆ − 1, with f (x) = k(x, t; h) d F(t) is a τ . In the continuous case δ(x) = mf ∗ (x) (x) β

ˆ density estimator of the true probability  distribution, F is the empirical cumulative ∗ distribution function, and m β (x) = k(x, t; h) d Mβ (t) is the smoothed, with the same kernel we use to obtain the density estimator, hypothesized model. The distances defined in (1) have been studied by Lindsay [9], while those defined by (2) by Basu and Lindsay [2]; the function G is a real-valued, thrice differentiable function on [−1, +∞), with G(0) = 0. The class of power divergence measures is defined by G(δ) = {(1 + δ)λ+1 − 1}/{λ(λ + 1)}, where if λ = −2 Neyman’s chisquared distance (divided by 2) is obtained, while λ = −1 returns the KullbackLeibler distance and λ = 0 returns the likelihood disparity. Assume a random sample from a continuous distribution F is available and we are interested in testing H0 : F = Mβ0 , where Mβ0 is a completely specified null hypothesis. Our test statistic is then defined as Tn = 2n[G  (0)]−1 ρ( f ∗ , m ∗β0 ),

(3)

where ρ( f ∗ , m ∗β0 ) is given by Eq. (2). The statistic Tn is a function of the Pearson residual δ, defined above. Under differentiability of G, we expand ρ( f ∗ , m ∗β0 ) with respect to the Pearson residual δ in the neighborhood of δ = 0, to obtain:  1 ρ( f ∗ , m ∗β0 )  G  (0) δ 2 (x)m ∗β0 (x) d x, 2 or, equivalently ρ( f ∗ , m ∗β0 )  Therefore Tn 

1  G (0) 2



[ f ∗ (x) − m ∗β0 (x)]2 m ∗β0 (x)

n  [ f ∗ (xi ) − m ∗β0 (xi )]2 i=1

m ∗β0 (xi )

d x.

(4)

,

and hence, the class of φ-divergences is locally equivalent to a Pearson chi-squared distance. This is under the conditions G(0) = 0, G  (0) = 0 and G  (0) = 1 (see Lindsay [9] for a discussion on these conditions). Therefore, the class of φ-divergences is locally quadratic.

216

M. Markatou and A. Liu

Lindsay et al. [11] introduced and discussed the class of quadratic distances, defined as  d(F, G) = K (x, y) d(F − G)(x)d(F − G)(y), (5) where K (x, y) is a nonnegative definite kernel. The above discussion justifies the central role of quadratic distances in statistical inference, as most of statistical distances are locally quadratic. For an overview of this line of work, see Markatou et al. [14]. In addition, non-quadratic distances and their connection to quadratic distances are studied in Markatou and Chen [12]. The asymptotic null distribution of (3) can be obtained using standard empirical process theory. That is, once the process that arises in the formulation of ρ( f ∗ , m ∗β0 ) is identified, the statistic Tn can be written as an integrated version of a function of that process. However, an alternative and fruitful approach to deriving the asymptotic distribution of the test statistic is to exploit its local quadratic representation. Relation (4) establishes the local quadratic character of the test statistic. This means that Tn can be put in the form given by (5). We assume thatthe kernel K (x, y) has a KarhuneLoeve decomposition of the form K (x, y) = λ j f j (x) f j (y), where λ j are the eigenvalues and f j (·) are the eigenfunctions of K (x, y) with respect to the uniform measure on [0, 1]. Then 

K (x, y) d( Fˆ − Mβ0 )(x) d( Fˆ − Mβ0 )(y) =



 λj

f j (u) d( Fˆ − Mβ0 )(u)

2 ,

where 

 ˆ f j (u) d F(u) − f j (u) d Mβ0 (u)   1 f j (u) − E Mβ0 ( f j (U )) , = n

f j (u) d( Fˆ − Mβ0 )(u) =



with the expectation of f j (U ), under uniform distribution, being equal to 0 (and variance being 1). Hence, asymptotically, the distribution of Tn is an infinite linear combination of chi-squared random variables, with weight equal to λ j . Example 2.1 (Testing for Normality) Suppose we would like to test the simple null hypothesis H0 : F = N (0, σ 2 ), with σ 2 known. Using our statistic Tn we need to find the kernel K (s, t) entering the local representation of the statistic. From relationship (4) we see that the main ingredient in the identification of K (s, t) is the quantity 



f ∗ (s) − m ∗β0 (s) f ∗ (t) − m ∗β0 (t) E . m ∗β0 (s) m ∗β0 (t)

Statistical Distances in Goodness-of-fit

217

Proposition 2.1 Under the assumption that we can exchange integration with summation, the kernel is K˜ (s, t) − 1, K (s, t) = ∗ m S (s)m ∗T (t) where K˜ (s, t) is the joint density of the random variables S, T defined as S = X + E 1 , T = X + E 2 where X ∼ N (0, σ 2 ) independent of E 1 , E 2 , and with E 1 , E 2 independent N (0, h 2 ). Furthermore, m ∗S (s), m ∗T (t) indicate the densities of S, T . Proof Write E

f ∗ (s) − m ∗β0 (s)

 1  n

=E

m ∗β0 (s)



f ∗ (t) − m ∗β0 (t)



m ∗β0 (t)    k(s; X i , h) − m ∗β0 (s) n1 k(t; X j , h) − m ∗β0 (t)

m ∗β0 (s) m ∗β0 (t)    n −2 [k(s; X i , h) − m ∗β0 (s)][k(t; X j , h) − m ∗β0 (t)] =E . m ∗β0 (s)m ∗β0 (t)

When i = j the two terms k(s; X i , h) − m ∗β0 (s) and k(t; X j , h) − m ∗β0 (t) are independent with expectation 0. Thus, the aforementioned relationship becomes   k(s; X i , h) − m ∗β0 (s) k(t; X j , h) − m ∗β0 (t) 1 E . n m ∗β0 (s) m ∗β0 (t) Because X 1 , . . . , X n are independent, identically distributed random variables, the above relationship is given as 1

E[k(s; m ∗β0 (s)m ∗β0 (t)

X, h)k(t; X, h)] − 1.

 But E[k(s; X, h)k(t; X, h)] = k(s; x, h)k(t; x, h)d Mβ0 (x), and since X ∼ N (0, σ 2 ) and the kernel k is a N (0, h 2 ), if we define S = X + E 1 , T = X + E 2 , E 1 , E 2 independent N (0, h 2 ), the conditional density of S|X is N (x, h 2 ) which is exactly the same with k(s; x, h). Write k(s; x, h) = k(s|x). Similarly, k(t; x, h) = k(t|x). Therefore,  E[k(s; X, h)k(t; X, h)] = k(s|x)k(t|x)m β0 (x)d x  = f (x, s, t)d x = K˜ S,T (s, t), with f (x, s, t) being the joint density of X, S, T . Thus, the kernel is given as

218

M. Markatou and A. Liu

K (s, t) =

K˜ S,T (s, t) − 1, m ∗S (s)m ∗T (t)

with m ∗S (s), m ∗T (t) being N (0, σ 2 + h 2 ) densities.



To complete the distribution of the statistic we need the spectrum of K (s, t). To this end, we have the following proposition. Proposition 2.2 The eigenvalues of the kernel K (s, t) are given as

λj =

σ2 σ 2 + h2

j , j = 1, 2, 3, . . .

and are powers of the regression coefficient determined by the joint normality of S, T . The eigenfunctions are f j (t) = (βt) j + (lower or der polynomial in t). Furthermore, f j (t), j = 1, 2, . . . are linearly independent functions and form a basis in L 2 . Proof Observe that

therefore





K˜ (s, t) · 1 · dt = m ∗S (s), K˜ (s, t)

m ∗S (s)m ∗T (t)

· 1 · m ∗T (t)dt = 1.

Hence f 0 (t) = 1 is the unit eigenfunction. Now  K (s, t) · t ·

m ∗T (t) dt

 =  =

K˜ (s, t) m ∗S (s)m ∗T (t)

t m ∗T (t) dt

 K˜ (s, t) t dt = f T |S (t|S = s) t dt m ∗S (s)

= E[T |S = s] = βs, where f T |S (t|S = s) indicates the conditional density of T given S = s, and β is the regression coefficient determined by the joint normality of S, T , given as β=

Cov(X + E 1 , X + E 2 ) σ2 Cov(S, T ) = = . σ S σT σ 2 + h2 σ 2 + h2

Thus f 1 (t) = ct is an eigenfunction with eigenvalue β, and we determine c from the equations

Statistical Distances in Goodness-of-fit



219

f 12 (t) m ∗T (t) dt = 1,

or equivalently, 

c2 t 2 m ∗T (t) dt = 1 ⇒

⇒ c2 = 

1 t 2 m ∗T (t)dt

=



σ2

t 2 m ∗T (t) dt =

1 c2

1 . + h2

Note also that  K (s, t) t

2

m ∗T (t) dt

 =

 K˜ (s, t) t 2 m ∗T (t) dt = f T |S (t|s) t 2 dt m ∗S (s) m ∗T (t)

= E(T 2 |S = s) = (βs)2 + σT2 |S . Therefore, the eigenfunctions are of the form f j (t) = (βt) j + (lower or der polynomial in t).  The following proposition establishes that we have found all eigenvalues. Proposition 2.3 The eigenvalues β j , j = 1, 2, . . . , with β = σ 2 /(σ 2 + h 2 ) are all the eigenvalues of the kernel K (s, t).  Proof Suppose they are not. Then, there exists f (t) = a j f j (t) = 0, such that  K (s, t) f (t)dt = λ f (s), or, equivalently,

But this gives

or



 aj



K (s, t) f j (t)dt = λ f (s).

a j β j f j (s) =





a j λ f j (s),

a j (β j − λ) f j (s) = 0.

Since not all a j are 0, this last relationship cannot hold unless β = 0 or β = 1. dist



Remark 2.1 Note that a recursion is always present, i.e. Y = χ12 + βY or equivalently, the density of Y is f Y (y) = χ12 f Y (y)dy, with χ12 indicating the density of a chi-squared random variable with 1 degree of freedom.

220

M. Markatou and A. Liu

We thus have that the asymptotic distribution of the test statistic, under the null hypothesis of normality with μ = 0  and known variance σ 2 , is the same with the distribution of the random variable β j Z 2j , where Z j are independent standard normal random variables.

3 Simulation Results In this section, we present simulation results to exemplify the performance of our test statistic. Assume that we are interested in testing whether samples come from a N (0, 1) distribution. We can estimate the nominal 0.05 level as follows. To identify the empirical 95th quantile of the distribution of the test statistic under the null hypothesis we generated r = 1000 replications from a N (0, 1) of sample size 5,000. For each sample we compute Tn and order the values of Tn from smallest to largest. The empirical cutoff value is the 95th quantile of the above list. When we use h = 0.1, the 95th percentile of our test statistic is 4.4699, and the null hypothesis of normality is rejected when the value of Tn is greater than 4.4699. Figures 1 and 2 present the power of the test statistic Tn as a function of the means under various alternative hypotheses in both uni-dimensional and higher dimensional cases. Figure 1 shows the power for three different sample sizes, indicating an increase in power as the sample size increases, and for two different values of the smoothing parameter. Overall, the test statistic is powerful and able to detect alternatives close to the null hypothesis.

Fig. 1 Power of the test statistic Tn , for three sample sizes, as a function of the mean under the alternative hypothesis. Panel A presents the power with h = 0.1, while panel B uses h as given by the default smoothing parameter of the density estimation function in R

Statistical Distances in Goodness-of-fit

221

Fig. 2 Power of the test statistic Tn as a function of the four-dimensional mean µ = c1, where c = 1, 2, 3, 4, 5 and 1 = (1, 1, 1, 1)T . The sample size is n = 500 and data are generated from M V N (µ, I ). The smoothing matrix H = diag(0.01), I is the identity matrix

4 Discussion and Conclusions In this paper, we discuss the role of statistical distances in goodness of fit testing. Our proposed test statistic for testing a simple null hypothesis is based on measures of statistical distance. The asymptotic distribution of the statistic is obtained and a test of normality is presented as an example of the derived distributional results. Of notice is the fact that tests based on statistical distances are locally quadratic. Lindsay et al. [10] constructed tests based on quadratic distances and studied their performance in terms of level and power. These authors propose as a test statistic an unbiased estimator of the quadratic distance and use it to study normality in dimension d > 2. In the current paper we use a general form of a statistical distance to base our test statistic, identified the kernel that enters the local quadratic representation of the statistic in the case of testing for normality and obtain its spectrum. The eigenvalues of the kernel, given as (σ 2 / h 2 + σ 2 ) j indicate the contribution of each of the components in the representation of the asymptotic distribution of the test. As h 2 → 0, (σ 2 / h 2 + σ 2 ) j → 1 and for j = 1, the asymptotic distribution is a simple χ12 . We also present evidence of the performance of the test statistic in terms of its power. Acknowledgements M. Markatou thanks the Troup Fund, KALEIDA Health Foundation (award number 82114) for providing financial support that funded the work of A. Liu. She also thanks Huipei Wang for providing technical assistance.

References 1. Arizono, I., Ohta, H.: A test for normality based on Kullback-Leibler information. Am. Stat. 43(1), 20–22 (1989) 2. Basu, A., Lindsay, B.G.: Minimum disparity estimation for continuous models: efficiency, distributions and robustness. Ann. Inst. Stat. Math. 46, 683–705 (1994)

222

M. Markatou and A. Liu

3. Chen, H.S., Lai, K., Ying, Z.: Goodness-of-fit tests and minimum power divergence estimators for survival data. Stat. Sin. 14(1), 231–248 (2004) 4. Chen, Y., Markatou, M.: Kernel tests for one, two and k-sample goodness-of-fit: state of the art and implementation considerations. In: Zhao, Y., Chen, D.G. (eds.) Statistical Modeling in Biomedical Research. Emerging Topics in Statistics and Biostatistics, pp. 309–337. Springer, Cham (2020) 5. Csiszár, I.: Information-type measures of difference of probability distributions and indirect observations. Stud. Sci. Math. Hung. 2, 299–318 (1967) 6. D’Agostino, R.B., Stephens, M.A.: Goodness-of-Fit Techniques. Marcel Dekker, New York (1986) 7. Gretton, A., Borgwardt, K.M., Rasch, M.J., Schölkopf, B., Smola, A.: A kernel two-sample test. J. Mach. Learn. Res. 13(1), 723–773 (2012) 8. Lequesne, J., Regnault, P.: Vsgoftest: an R package for goodness-of-fit testing based on Kullback-Leibler divergence. J. Stat. Softw. 96(1), 1–26 (2020) 9. Lindsay, B.G.: Efficiency versus robustness: the case for minimum Hellinger distance and related methods. Ann. Stat. 22, 1081–1114 (1994) 10. Lindsay, B.G., Markatou, M., Ray, S.: Kernels, degrees of freedom, and power properties of quadratic distance goodness-of-fit tests. J. Am. Stat. Assoc. 109(505), 395–410 (2014) 11. Lindsay, B.G., Markatou, M., Ray, S., Yang, K., Chen, S.-C.: Quadratic distances on probabilities: a unified foundation. Ann. Stat. 36, 983–1006 (2008) 12. Markatou, M., Chen, Y.: Non-quadratic distances in model assessment. Entropy 20(6), 464:e20060464 (2018) 13. Markatou, M., Chen, Y., Afendras, G., Lindsay, B.G.: Statistical distances and their role in robustness. In: Chen, D.G., Jin, Z., Li, G., Li, Y., Liu, A., Zhao, Y. (eds.) New Advances in Statistics and Data Science. ICSA Book Series in Statistics, pp. 3–26. Springer, Cham (2017) 14. Markatou, M., Karlis, D., Ding, Y.: Distance-based inference. Ann. Rev. Stat. Appl. 8, 301–327 (2021) 15. Morales, D., Pardo, L., Vajda, I.: Some new statistics for testing hypotheses in parametric models. J. Multivar. Anal. 62(1), 137–168 (1997) 16. Noughabi, H.A., Balakrishnan, N.: Tests of goodness of fit based on phi-divergence. J. Appl. Stat. 43(3), 412–429 (2016) 17. Pardo, L., Zografos, K.: Goodness of fit tests with misclassified data based on phi-divergences. Biom. J. 42(2), 223–237 (2000) 18. Rayner, J.C.W., Thas, O., Best, D.J.: Smooth Tests of Goodness of Fit: Using R, 2nd edn. Wiley, Singapore (2009) 19. Salicrú, M., Morales, D., Menéndez, M.L., Pardo, L.: On the applications of divergence type measures in testing statistical hypotheses. J. Multivar. Anal. 51(2), 372–391 (1994)

Phi-divergence Test Statistics Applied to Latent Class Models for Binary Data Pedro Miranda, Ángel Felipe, and Nirian Martín

Abstract In this paper we present two new families of test statistics for studying the problem of goodness-of-fit of some data to a latent class model for dichotomous questions based on phi-divergence measures. We also treat the problem of selecting the best model out of a sequence of nested latent class models. In both problems, we study the asymptotic distribution of the corresponding test statistics, showing that they share the same behavior as the corresponding maximum likelihood test statistic.

1 Introduction and Basic Concepts Latent class modelling is based on the distinction between manifest and latent variables. While manifest variables can be directly observed, like socioeconomic variables, item responses in a questionnaire or some codification of observed behavior, latent variables cannot be observed or measured by means of a yardstick. In this paper dichotomous observed variables are considered. Consider a set S of N people: S := {S1 , . . . , S N }. Each person Sv is asked to answer to k dichotomous items I1 , . . . , Ik ; let us denote by yvi the answer (right = 1, wrong = 0) of person Sv to item Ii and yv := (yv1 , . . . , yvk ) a generic pattern given by person Sv . A categorical latent variable (categorical unobservable variable) is postulated to exist, whose different levels partition set S into m mutually exclusive and exhaustive latent classes C1 , . . . , Cm whose corresponding weights are w1 , . . . , wm . Let us denote p ji = Pr (yvi = 1|Sv ∈ C j ), j = 1, . . . , m, i = 1, . . . , k. P. Miranda (B) · Á. Felipe · N. Martín Complutense University of Madrid, Madrid, Spain e-mail: [email protected] Á. Felipe e-mail: [email protected] N. Martín e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_20

223

224

P. Miranda et al.

Let yν be a possible answer vector. We shall assume that in each class the answers for the different questions are stochastically independent. We will denote by Nν , ν = 1, . . . , 2k , the number of times that the sequence yν appears in an N -sample and  p := (N1 /N , . . . , N2k /N ) the corresponding proportions. The likelihood function L is given by L y1 ,..., y2k (w1 , . . . , wm , p11 , . . . , pmk ) =

N! 2k 

nν !

2  k

Pr( yν )n ν ,

(1)

ν=1

ν=1

where n ν is the sample result for Nν . In this model the unknown parameters are w j , j = 1, . . . , m and p ji , j = 1, . . . , m, i = 1, . . . , k. In order to avoid the problem of obtaining uninterpretable estimations for the item latent probabilities lying outside the interval [0, 1], some authors [4–8, 10] proposed a linear-logistic parametrization given by p ji =

exp(x ji ) , j = 1, . . . , m, i = 1, . . . , k, 1 + exp(x ji )

and wj =

exp(z j ) , j = 1, . . . , m. m  exp(z h ) h=1

Next, restrictions are introduced relating parameters x ji , z j to some explanatory parameters λr , r = 1, . . . , t and ηs , s = 1, . . . , u, so the final model is given by  t   q jir λr + c ji exp r =1



p ji = 1 + exp

t 

 , j = 1, . . . , m, i = 1, . . . , k,

(2)

q jir λr + c ji

r =1

and exp wj =

m  h=1

 u 

exp



v jr ηr r =1  u 

+ dj

vhr ηr + dh

r =1

 , j = 1, . . . , m,

(3)

Phi-divergence Test Statistics Applied to Latent Class Models for Binary Data

225

where Q r = (q jir ), C = (c ji ), V = (v jr ) are fixed and known. Consequently, in this case the vector of unknown parameters is θ := (λ, η). It is not difficult to establish [3] that p, p(λ, η)) + constant. log L(w1 , . . . , wm , p11 , . . . , pmk ) = −N D K ullback ( Based on this, varying the divergence measure considered, we obtain a family of estimators that includes the MLE. Consider two probability distributions p = ( p1 , . . . , p M ) and q = (q1 , . . . , q M ) and a function φ that is convex for x > 0 and satisfies φ(1) = 0, 0φ(0/0) = 0. The φ-divergence measure between the probability distributions p and q is defined by Dφ (p, q) :=

M  i=1

 pi φ

qi pi

 .

2 Goodness-of-Fit Tests LCM for binary data fit is assessed by comparing the observed classification frequencies to the expected frequencies predicted by the LCM for binary data. When dealing with the MLE, the difference is formally assessed with a likelihood ratio test statistic or with a chi-square test statistic: 2  k

G = 2N 2

ν=1

2 2k  n s − N p( yν ,  λ, η) pˆ ν 2 pˆ ν log , X = . p( yν ,  λ, η) N p( yν ,  λ, η) ν=1

(4)

It is known that the asymptotic distribution of the test statistics G 2 and X 2 is a chi-square distribution with 2k − (u + t) − 1 degrees of freedom [8]. These statistics can be extended in two ways: first, differences between observed and expected values can be measured in terms of a divergence measure. Next, estimation of the parameters to obtain the expected values can be obtained in terms of a divergence measure. Definition 2.1 We define the phi-divergence family of test statistics for testing goodness-of-fit for latent class models for binary data as φ

Tφ12 :=

2N ˆ  D p, p( θ ) , φ φ 2 φ1 (1) 1

where we use φ2 for estimation and φ1 for comparing with the observed data. For this family, the following holds.

(5)

226

P. Miranda et al.

Theorem 2.1 Under the hypothesis that the LCM for binary data with parameters λ = (λ1 , . . . , λt ) and η = (η1 , . . . , ηu ) holds, the asymptotic distribution of the famφ ily of test statistics Tφ12 given in (5) is a chi-square distribution with 2k − (u + t) − 1 degrees of freedom. It is noteworthy that the asymptotic distribution depends on neither φ1 nor φ2 , i.e. it is the same for any functions φ1 and φ2 considered.

3 Nested Latent Class Models Suppose that our model fits the data, i.e. we conclude that the data can be explained through a LCM with m classes. Then, it could be the case that several sets of parameters could fit the data. If two LCM fit the data but one of them has a reduced number of parameters, then this model should be considered the more appropriate. In this section we deal with the problem of selecting the best model from a nested sequence of LCM. In general, we shall assume that we have s LCM {Ml }l=1,...,s in such a way that the parameter space associated to Ml , l = 1, . . . , s, is Θ Ml and Θ Ms ⊂ Θ Ms−1 ⊂ · · · . ⊂ Θ M1 ⊂ Rr

holds. Let us denote dim Θ Ml = h l ; l = 1, . . . ., s, with h s < h s−1 < . . . . < h 1 ≤ r, i.e., the parameters of one LCM are a subset of the parameters of the previous LCM. Our strategy is to test successively Hl+1 :θ ∈ Θ Ml+1 against Hl :θ ∈ Θ Ml , l = 1, . . . , s − 1,

(6)

and we continue to test as long as the null hypothesis is accepted, and choose the LCM Ml with parameter space Θ Ml according to the first l satisfying that Hl+1 is rejected (as null hypothesis) in favor of Hl (as alternative hypothesis). The classical expressions for solving (6) are 2 A A B 2k p yν ,  p yν ,  θ θ − p yν ,  θ  , X 2A−B = N G 2A−B = 2 n υ log . B B  p y , θ p yν ,  θ ν=1 ν=1 ν (7) The asymptotic distribution of the test statistics G 2A−B and X 2A−B is a chi-square distribution with h l − h l+1 degrees of freedom. 2  k

Phi-divergence Test Statistics Applied to Latent Class Models for Binary Data

227

Hence, proceeding as in the previous section, we can define two new families of test statistics. A generalization of G is given by φ φ

1, 2 = S A−B

2N A B θ φ2 ) − Dφ1 pˆ , p( θ φ2 ) , Dφ1 pˆ , p(  φ1 (1)

(8)

and a generalization of X is φ φ

1, 2 = T A−B

2N A B   p( θ D ), p( θ ) . φ φ2 φ2 φ1 (1) 1

(9)

Now, the following can be proved. Theorem 3.1 Given the LCM for binary data A, B with parameters θ A = θ A,1 , θ A,2 , θ A,3 , θ A,4 and θ B = θ A,1 , 0, θ A,3 , 0 , respectively, and under the null hypothesis given in (6), it follows φ φ

φ φ

L

1, 2 1, 2 S A−B , T A−B → χh2l −hl+1 .

N −→∞

4 An Example with Real Data In order to shed light about the behavior of the families established in the previous sections, let us deal with a problem with real data. We consider the interview data collected in [1] and analized in [9]. The experiment consists in studying the answers of 3398 schoolboys to two questions about their membership in “leading crowd” on two occasions t1 and t2 (Oct., 1957 and May, 1958). Thus, in this model we have four questions and four manifest variables; these answers can only be “low” (value 0) and “high” (value 1). The sample data are given in Table 1. Next, 4 latent classes are considered, namely C1 ≡ low agreement in question 1 and low agreement in question 2. C2 ≡ low agreement in question 1 and high agreement in question 2. C3 ≡ high agreement in question 1 and low agreement in question 2. C4 ≡ high agreement in question 1 and high agreement in question 2.

Table 1 The set of data collected by Coleman October, 1957/ May, 1958 00 01 10 11

00

01

10

11

554 281 87 49

338 531 56 110

97 75 182 140

85 184 171 458

228

P. Miranda et al.

Table 2 The model design according to Formann Class 1

2

3

4

Item 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

λ1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

λ2 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0

λ3 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0

λ4 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0

λ5 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0

λ6 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0

λ7 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0

λ8 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1

Consequently, there are 16 probability values p ji to be estimated; let us start with the problem of goodness-of-fit; for this, we consider the hypothesis “The attitudinal changes between times t1 and t2 are dependent on the positions (low, high) of the respective classes on the underlying attitudinal scales at t1 ”. This implies that the probabilities just depend on the definition of the class. Thus, a model with 8 parameters λi is considered; λ1 means low agreement in the first question at time t1 , λ2 means high agreement in the first question at time t1 , and so on. We write the values for matrices Qi as they appear in [8] in Table 2. In order to study if the data come from a LCM for binary data in the conditions explained before, we shall consider the particular family of power divergence measures introduced in [2] and defined as ⎧ 1 ⎨ a(a+1) (x a+1 − x − a(x − 1)) a = 0, a = −1 φ(x) ≡ φa (x) = x log x − x + 1 a=0 ⎩ − log x + x − 1 a = −1

(10)

In [3] it was established that a competitive alternative to the MLE is the MφE obtained from Eq. (10) with a = 2/3. Therefore, we are going to consider this estimator in our study. In this case the values are very similar to those obtained in [7] using MLE. In Table 3 we present the values obtained. We want to study the goodness-of-fit of our data. Following Sect. 2, we shall φ consider the family of test statistics Tφa2/3 , obtained from φa (x) for several values of a. The results are presented in Table 4.

Phi-divergence Test Statistics Applied to Latent Class Models for Binary Data Table 3 Estimations of the parameters when a = 2/3 Parameter Parameter ˆλ1 −2.34292610 pˆ 1,1 1.72393168 pˆ 1,2 λˆ 2 λˆ 3 −0.84040580 pˆ 1,3 1.56524945 pˆ 1,4 λˆ 4 λˆ 5 −2.06480043 pˆ 2,1 λˆ 6 2.29928080 pˆ 2,2 ˆλ7 −0.91137901 pˆ 2,3 2.01252338 pˆ 2,4 λˆ 8 ηˆ 1 0.50480183 pˆ 3,1 ηˆ 2 0.16964329 pˆ 3,2 ηˆ 3 −0.87356633 pˆ 3,3 ηˆ 4 −0.00424661 pˆ 3,4 wˆ 1 0.38936544 pˆ 4,1 wˆ 2 0.27848377 pˆ 4,2 wˆ 3 0.09811597 pˆ 4,3 wˆ 4 0.23403482 pˆ 4,4

229

0.08762969 0.30144933 0.11256540 0.28671773 0.08762969 0.82710532 0.11256540 0.88210569 0.8463457 0.30144933 0.90881746 0.28671773 0.84863457 0.82710532 0.90881746 0.88210569

Table 4 Statistics for different divergence measures a φ

Tφa2/3

–1

–1/2

0

2/3

1

3/2

2

5/2

3

1.279

1.278

1.277

1.277

1.277

1.277

1.278

1.279

1.281

Now, the distribution of these statistics under the null hypothesis that the model 2 2 ; as χ4;0.05 = 9.49, we conclude that we have no evidence fits the data is a χ4=16−8−3−1 to reject our model. Notice that the values for all test statistics are very similar; this was expected, as the sample size under consideration is big enough (N = 3398) to apply the asymptotic results of Theorem 1. As a conclusion, we could say that the LCM proposed (that we will call M1 ) fits our data; now, a question arises: Is it possible to find a latent model with a reduced number of parameters that also fits the data? In [8], the following models are studied: M2 : Attitudinal changes between the two moments are dependent on the latent classes but are independent on the items. M3 : Attitudinal changes between the two moments are independent both on the items and on the latent classes. M4 : There are no attitudinal changes. We can observe that Θ M1 ⊃ Θ M2 ⊃ Θ M3 ⊃ Θ M4 .

230

P. Miranda et al.

Table 5 Results for Example in Sect. 4 for statistics S (left) and T (right) a/Model M1 − M2 M2 − M3 M3 − M4 M1 − M2 M2 − M3 –1 –1/2 0 2/3 1 3/2 2 5/2 3 2 χi;0.05

3.761 3.757 3.755 3.754 3.754 3.756 3.759 3.763 3.769 5.99

4.610 4.593 4.584 4.578 4.580 4.586 4.599 4.617 4.641 3.84

31.465 30.977 30.769 30.626 30.659 30.820 30.991 31.347 31.765 3.84

φ φ

3.431 3.417 3.403 3.386 3.378 3.366 3.355 3.344 3.334 5.99

4.613 4.604 4.595 4.585 4.580 4.574 4.570 4.566 4.563 3.84

M3 − M4 31.005 30.845 30.722 30.616 30.587 30.574 30.597 30.655 30.749 3.84

φ ,φ

a, 2/3 a 2/3 For testing, we consider the family S A−B and T A−B given in (8) and (9) for several values of a. In Table 5 we present the results obtained. As a conclusion, we can adopt LCM M2 as the best model. The values obtained are very similar, due again to the asymptotic results of Sect. 3.

5 Conclusions In this paper we have introduced phi-divergence test statistics in the context of LCM for binary data. Classically, these problems have been solved on the basis of the likelihood-ratio-test and the chi-square test statistic. In this paper, we have derived two families of test statistics based on phi-divergence measures that generalize the likelihood-ratio-test and the chi-square test statistic; we have obtained their asymptotic distribution under the null hypothesis of that LCM fits the data, showing that it coincides with the that of the likelihood-ratio-test and the chi-square test statistic; thus, they show the same behavior as the classical statistics for big sample sizes. To see the applicability of this theory, we have considered a real data situation studied by Goodman [9] and Formann [8]. We conclude the paper remarking that a simulation study has been carried out showing that there are several members of these families with a better behavior than the classical test statistics for small and moderate sample sizes. Acknowledgements This paper is dedicated to Prof. Leandro Pardo, an outstanding researcher, but more imporant, an exceptional person.

Phi-divergence Test Statistics Applied to Latent Class Models for Binary Data

231

References 1. Coleman, J.: Introduction to Mathematical Sociology. Free Press, New York (1964) 2. Cressie, N., Read, T.: Multinomial goodness-of-fit tests. J. R. Stat. Soc.: Ser. B 8, 440–464 (1984) 3. Felipe, A., Miranda, P., Pardo, L.: Minimum φ-divergence estimation in constrained latent class models for binary data. Psychometrika 80(4), 1020–1042 (2015) 4. Formann, A.: Schätzung der parameter in lazarsfeld latent-class analysis. Technical Report 18, Res Bul Institut für Psycologie der Universität Wien (1976) 5. Formann, A.: Log-linear latent class analyse. Technical Report 20, Res Bul Institut für Psycologie der Universität Wien (1977) 6. Formann, A.: A note on parametric estimation for Lazarsfeld’s latent class analysis. Psychometrika 48, 123–126 (1978) 7. Formann, A.: Linear logistic latent class analysis. Biom. J. 24, 171–190 (1982) 8. Formann, A.: Constrained latent class models: theory and applications. Br. J. Math. Stat. Psychol. 38, 87–111 (1985) 9. Goodman, L.: Exploratory latent structure analysis using Goth identifiable and unidentifiable models. Biometrika 61, 215–231 (1974) 10. Lazarsfeld, P., Henry, N.: Latent Structure Analysis. Houghton-Mifflin, Boston (1968)

Cross-sectional Stochastic Frontier Parameter Estimator Using Kulback-Leibler Divergence Ahmed Shatla, Carlos Carleos, Norberto Corral, Antonia Salas, and María Teresa López

Abstract A frontier analysis builds a linear econometric model including an individual (producer, firm) non-negative random effect called inefficiency. A stochastic frontier analysis includes in the model, besides the inefficiency, a random residual deviation. Cross-sectional data are those referred to a concrete moment, that is, they are not longitudinal (panel data). In this article a Kullback-Leibler divergencebased estimator for cross-sectional stochastic frontier model parameters is introduced, along with an outlier elimination procedure that is enabled when the sample shows an extreme skewness. In comparison with the classic maximum-likelihood method, the new method presents bigger bias but smaller mean quadratic error.

1 Introduction Roughly speaking, available data for a study can be gathered regarding some concrete timestamp (e.g. some year, some day), in which case one speaks about cross-sectional data; or else they correspond to several time periods, and such data are called longitudinal or panel data. In this work only cross-sectional data are considered. Econometric models often establish a multiplicative relationship between the explanatory variables in order to approximate the response variable. Thus, a logarithmic transformation is usually applied to get a linear model: Yi = B0 ·

p  j=1

β

log

X jij −−−−→ yi = β0 +

p 

β j · x ji

(1)

j=1

A. Shatla (B) Department of Mathematics and Statistics, Ain Shams University, Cairo, Egypt e-mail: [email protected] C. Carleos · N. Corral · A. Salas · M. T. López Departamento de Estadística e Investigación Operativa y Didáctica de la Matemática, Universidad de Oviedo, Oviedo, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_21

233

234

A. Shatla et al.

where • i denotes the individual (usually a producer or firm) • Yi is the production of i, and yi = log Yi • X 1i , . . . , X pi are the values of p explanatory variables for individual i, and x ji = log X ji • B0 , β0 , . . . , β p are coefficients, with β0 = log B0

1.1 Deterministic Frontier When the model (1) stands for an upper bound or frontier of the production, a term is added representing producer’s (in)efficiency to reach that bound. So a frontier analysis model looks: Yi = B0 ·

p 

β X jij

log

· Ui −−−−→ yi = β0 +

j=1

p 

β j · x ji − u i

j=1

where Ui ∈ [0, 1] is called technical efficiency and − log Ui = u i ∈ [0, ∞] is known as inefficiency.

1.2 Stochastic Frontier The previous model is hard to fit to real data because of its rigidness. Most often, random deviations are allowed for, so a residual term is added to the model: Yi = B0 ·

p 

β

log

X jij · Vi · Ui −→ yi = β0 +

j=1

p 

β j · x ji + vi − u i

j=1

where Vi ∈ [0, ∞] and − log Vi = vi ∈ [−∞, ∞] is the symmetric residual. p p β In these models, the terms B0 · j=1 X jij · Vi and β0 + j=1 β j · x ji + vi constitute the stochastic frontier.

2 Methods The following probability distributions are often used for stochastic frontier models: • vi → N (0, σv ) normal residual • u i → |N (0, σu )| half-normal inefficiency.

Cross-sectional Stochastic Frontier Parameter Estimator …

235

The model parameters are: • β = (β0 , . . . , β p ) linear coefficients • σv2 residual variance • σu2 inefficiency-related variance but also the follow parameterization is usual: • Dispersion:

σ 2 = σu2 + σv2

• Inefficiency-related standard deviation ratio: λ=

σu σv

.

2.1 Maximum Likelihood The most popular method to estimate stochastic frontier parameters is maximum likelihood, for which free software implementations are available [1]. In order to express the likelihood in a convenient manner, the compound residual is usually defined as εi = vi − u i whose density is f (εi ) =

ε   −λ · ε  2 i i ·φ ·Φ . σ σ σ

(2)

In these conditions,  E(εi ) = −σu

2 π

Var(εi ) = σε2 = σv2 +

π −2 2 σu . π

Besides, it can be proven that the theoretical asymmetry is always negative: √

2 (π − 4) λ3 skewness(ε) =

3 < 0. π + (π − 2)λ2 Notwithstanding, by simulation one gets 48% of positively skewed samples when n = 100 and λ = 0.5, for instance. Even with λ = 1 one gets 33% of such samples. It is easy to theoretically derive estimators for all parameters, either by maximum likelihood or by the method of moments [4]. In practice, however, the just mentioned highly frequent positive skewness leads to inconsistencies for the method of moments

236

A. Shatla et al.

0'0

0'5

1'0

1'5

2'0

2'5

3'0

Fig. 1 Distribution of maximum likelihood estimates of λ = 0.5 when n = 100

and zero-valued estimates for λ in maximum likelihood maximization. This happens mainly with moderate-sized samples, for instance n = 100, and small values of λ (0 < λ < 1.5). Figure 1 shows an example of distribution of such estimates. In any case, good estimates of the following parameters can be obtained: • β1 , . . . , β p (but not of the intercept β0 ) • σε2 = Var(εi ). Thus, other methods can make use of such estimates to obtain more convenient estimates of λ.

2.2 Common Area An attempt to circumvent the problems of estimating λ when the sample skewness is positive was proposed by the authors [4] and consists of the following two stages:

237

0'0

0'0

0'1

0'1

0'2

0'2

0'3

0'3

0'4

0'4

Cross-sectional Stochastic Frontier Parameter Estimator …

−3

−2

−1

0

1

2

3

−3

−2

−1

0

1

2

3

Fig. 2 Two common areas obtained from two different values of a theoretical density parameter

1. Estimating σε2 and obtaining residuals e. 2. Finding λ maximizing the common area between • parametric density of residuals, f λˆ (e) • non-parametric estimated density of residuals, fˆ(e). This method will be denoted in results section as ac0. The common area between two densities f and g is defined as min{ f (x), g(x)}d x R

so given two common areas as in Fig. 2 the parameter value producing the largest (blue) area is to be preferred. The procedure to obtain the non-parametric density, mentioned in the second stage of the method, uses symmetric kernel functions. Notice, on the other hand, that the theoretical density to be estimated is skewed because of the inefficiency component, so this can produce a biased density estimate when the inefficiency effect is large. Therefore, it has been proposed [4] to submit the residuals to a double transformation before the density is estimated, that is: 1. symmetrizing—First, skewness is to be eliminated. If F is the cumulative distribution function of X , then F(X ) has a uniform distribution, hence symmetric. Given a value of λ, a numeric integration is applied to the density (2) to obtain an estimate Fˆ of the distribution function, which is in turn applied to the residuals e ˆ to get the transformed residuals F(e). 2. smoothing—The uniform distribution, which should be followed by the transformed residuals in the previous stage, has a density that, for its abrupt discontinuities, is hard to be estimated via kernel functions. To bypass this problem, the residuals are again transformed through the quantile function of a standard

238

A. Shatla et al.

normal distribution. In this way, the doubly transformed residuals follow a Gaussian distribution, suitable for density estimation by means of Gaussian kernel functions. The previous procedure ac0 is now transformed to 1. Estimating σε2 and getting the residuals e. 2. Finding λ maximizing the common area between • standard normal density φ ˆ • non-parametric density of the transformed residuals, fˆ Φ −1 [ F(e)] . This method will be denoted in results section as ac.

2.3 Kullback-Leibler Divergence The KL-divergence between two densities f and g is defined [3] as f (x) log R

f (x) d x. g(x)

In this paper, it is firstly proposed an estimation method for the parameters of the cross-sectional stochastic frontier model, very similar to that described in Sect. 2.2 but replacing maximization of common area by minimization of divergence. This is solved by a numerical version of a classic development of the Principle of Minimum Divergence [2]. Two new methods are defined (kl0 y kl), which share the first stage with the previous ones: 1. Estimating σε2 and obtaining residuals e. 2. Finding λ minimizing the divergence between kl0 (using original residuals) • parametric density of residuals, f λˆ (e) • non-parametric density of residuals, fˆ(e) kl (using transformed residuals) • standard normal density φ ˆ . • non-parametric density of transformed residuals, fˆ Φ −1 [ F(e)]

2.4 Skewness-Fixing Methods As aforementioned, theoretically the skewness of residuals is always negative (2.1), but a high percentage of the samples show positive skewness. Not so importantly,

Cross-sectional Stochastic Frontier Parameter Estimator …

239

something similar happens on the other extreme: theoretically the skewness is lowerbounded by √ 2 (π − 4) ≈ −0.995 lim skewness(ε) = 3 λ→∞ (π − 2) 2 but it is usual to obtain samples with lower skewness. These cases of extreme skewness produce extreme estimates of λ. For samples with positive skewness, the most likely estimate of λ is zero, which corresponds to a symmetric distribution. When λ is small, the obtained frequency of zeros is very high (see Fig. 1).

2.4.1

Bootstrapping

It can be thought that, if the sample represents the population, then bootstrap samples from that one would represent the population as well. Because the population’s skewness is bounded between −0.995 and 0, bootstrap samples whose skewness is in that interval could well represent the population. So the following method blboot is proposed too: 1. Estimating σε2 and getting the residuals e. 2. Setting a copy r ← e. 3. If it is not fulfilled that −0.995 < skewness(r) < 0 then a. r ← bootstrap(e). b. Go back to 2 to check the skewness. 4. Finding λ minimizing the divergence between • standard normal density φ and ˆ • non-parametric density of transformed residuals, fˆ Φ −1 [ F(r)] .

2.4.2

Outlier-Removal

After simulating a few samples it is seen that, most often, extreme skewness is due to outliers, that is, individuals with an extremely high or low residual. So, instead a random resampling as in the previous section, a removal of those outliers can be performed in order to get a sample with adequate skewness. Previously [4] a threshold was proposed to consider an extreme point as an outlier; in this paper an iterative procedure is used, similar to the abovementioned bootstrapping: the removal is iterated until the skewness is between −0.995 and 0 or until a considerable fraction (by default, 10%) of the sample is already removed. So the method kloutl is presented as follows: 1. Estimating σε2 and getting the residuals e. 2. If it is not fulfilled that −0.995 < skewness(e) < 0 then

240

A. Shatla et al.

a. If skewness < −0.995 then e ← e  min e b. else skewness > 0 so e ← e  max e c. Go back to 2 to check the skewness again. 3. Finding λ minimizing the divergence between • standard normal density φ ˆ • non-parametric estimated density of transformed residuals, fˆ Φ −1 [ F(e)]

3 Results In order to check the effectiveness of the methods presented in the previous section, a series of simulations was performed. This model is considered: • yi = 10 + 5xi + vi − u i • σv2 = 1 • sample size n = 100 or n = 300. Several values for λ are examined, determining σu2 = λ2 σv2 . Simulations consist of 10 000 replicates of each parameter combination. Table columns are: λ skew σˆ ε2 ml ac ac0 kl kl0 klboot kloutl

Real value of the inefficiency parameter. Sample skewness. Sample residual variance. Maximum likelihood method. Common area method. Common area method applied to untransformed residuals. Kullback-Leibler divergence-based method. KL-method applied to untransformed residuals. KL-method with bootstrapping stage. KL-method with outlier removal stage.

Medians of estimations are shown in Table 1, and median squared errors in Table 2. Table 1 Medians of estimates of λ n

λ

skew

σˆ ε2

ml

ac

ac0

kl

kl0

blboot

kloutl

100

0.50

–0.02

1.06

0.39

0.38

0.38

0.31

1.10

0.95

0.41

100

1.00

–0.12

1.33

0.97

0.96

0.96

0.86

1.19

1.11

0.87

100

2.00

–0.41

2.39

2.05

1.93

1.88

1.75

1.79

1.77

1.75

300

0.50

–0.02

1.08

0.45

0.45

0.45

0.39

0.54

0.47

0.43

300

1.00

–0.13

1.35

1.00

0.99

0.98

0.90

0.91

0.91

0.90

300

2.00

–0.44

2.44

2.01

1.94

1.90

1.79

1.79

1.79

1.79

Cross-sectional Stochastic Frontier Parameter Estimator …

241

Table 2 Median square errors of estimates of λ n

λ

ml

ac

ac0

kl

kl0

blboot

kloutl

100

0.50

0.24

0.21

0.21

0.25

0.36

0.25

0.25

100

1.00

0.59

0.54

0.51

0.40

0.40

0.40

0.36

100

2.00

0.38

0.29

0.23

0.30

0.30

0.30

0.30

300

0.50

0.24

0.20

0.20

0.25

0.25

0.25

0.25

300

1.00

0.13

0.12

0.12

0.10

0.10

0.10

0.10

300

2.00

0.10

0.10

0.08

0.11

0.11

0.11

0.11

4 Conclusions The effect of sample size is crucial. With n = 300 all methods give estimates reasonably centered on the true value of λ and, for λ ≥ 1, squared errors are acceptable. A sample size of n = 100 seems not enough to adequately separate the effects of the inefficiency and the residual randomness in a cross-sectional scenario. A value of λ = 0.5 produces too large squared errors with the tested sample sizes, so we can assume that prohibitively large samples would be required for crosssectional data when the inefficiency-related variance is moderate in comparison with the noise. Only high quality data would be useful for such effects. As it was previously published [4], the ml, ac and ac0 methods are mostly unbiased, and the common-area-based ones give slightly smaller squared errors than maximum likelihood. With respect to the new methods presented here, based on the Kullback-Leibler divergence, it can be said that they are globally comparable to the common-area methods. Concerning bias, kl0 and klboot performed poorly in the most difficult scenario, n = 100 and λ = 0.5. About squared error, it can be highlighted that kloutl bests all the other methods in the intermediate scenario (λ = 1) when n = 100, where every KL-based method performed quite well. In summary, the new KL-methods rival the previously proposed ones, and are better in moderate conditions, especially the method implementing skewness-fixing by outlier removal (kloutl). Acknowledgements This work has been partially funded by project Grants PID2019-104486GBI00 of the Spanish Ministry of Science and Innovation and AYUD/2021/50897 of the Principality of Asturias Counseling of Science, Innovation and University.

242

A. Shatla et al.

References 1. Coelli, T., Henningsen, A.: frontier: Stochastic Frontier Analysis. R package version 1.1 (2013). http://CRAN.R-Project.org/package=frontier 2. Gil Álvarez, P., Pardo Llorente, L., Gil Álvarez, M.A.: Incertidumbre, Información e Inferencia Estadística, Sect. 8.6. Facultad de Ciencias, Universidad de Oviedo, Spain (1995) 3. Kullback, S., Leibler, A.: On the information and sufficiency. Ann. Math. Stat. 22, 79–86 (1951) 4. Shatla, A.M.H.: New approaches to stochastic frontier analysis. Ph.D. Thesis, Universidad de Oviedo, Spain (2017)

Clustering and Representation of Time Series. Application to Dissimilarities Based on Divergences J. Fernando Vera

Abstract Time series classification has shown to be a very successful strategy for determining structures in temporal data. In addition, the combined use of cluster analysis and multidimensional scaling has proven to be an advisable strategy which leads to better understanding of the data than can be learned from each separately. One of the most popular partitioning algorithms is K-means clustering, whose main drawback for its application in time series is generally related to the lack of precision in the centroid of the cluster to capture the shape of the series, partly derived of the data dependence structure. In this paper we propose the combined use of cluster-mds in time series that aims at partitioning the time series into clusters while simultaneously the cluster centres are represented in a space of low dimension. The combined procedure, which does not require estimation of time series centroids, is illustrated by analyzing dissimilarity measures based on divergence for synthetic data sets.

1 Introduction In most applications of cluster analysis, the basic data set is a standard N × p matrix X, which contains the values for p variables describing a set of N objects to be clustered. The K-means algorithm for clustering (MacQueen [10], Hartigan and Wong [3]), is one of the most popular optimization clustering techniques. It produces a partition of the rows of X into a specified number K of non-overlapping groups, on the basis of their proximities, and each observation is classified in the cluster with the nearest mean value, typically accounted for in terms of squared Euclidean distances. Since K -means clustering is an NP-complete problem, the “optimal” partition for a fixed value of K will be the one estimated in relation to the particular optimisation algorithm employed. A Euclidean embedding of data points, which is the situation assumed in the procedure described above, is not always directly available, for example when some J. F. Vera (B) Department of Statistics and O.R. Faculty of Sciences, University of Granada, Granada, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_22

243

244

J. F. Vera

variables in X are not measured on a continuous scale, some entries are missing, or when the particular role of the columns of X induces a different framework, as, for example, in time series. For time series, the clustering problem arises in a wide variety of fields and there have recently been a large number of contributions in this regard (see for example Liao [8]). In addition, the spatial representation of the clusters may facilitates their interpretation. In K-means clustering, groups of objects can be represented as cluster centres in a subspace of dimension K − 1, which is low-dimensional only if either p or k is small. Visualization techniques as Multidimensional Scaling in combination with clustering have proven to be an advisable tool for better understanding in data analysis. Several of these combined procedures have been proposed in a deterministic framework for two-way one-mode data, (see Heiser and Groenen [5]; Vera et al. [14]), as well for two-way two-mode data (Vera et al. [15]; Vera et al. [18]). In the context of latent class models for two-way one-mode data, Vera et al. [16] and Vera et al. [15] have proposed a cluster-MDS model for dissimilarity data. For two-way, two-mode data, Vera et al. [17] have developed a Simulated Annealing (SA) based cluster-unfolding procedure for interval scaled preference data. Different dissimilarity measures have been accounted for times series (see for example Montero and Vilar [9]). Here, we are interested in divergence-related measures to compare a model-free approach, for example dissimilarity measures based on nonparametric spectral estimators (Kakizawa et al. [7]), with a complexity-based approach, such as measurements of dissimilarity based on α-divergence (Brandmaier [1]). This last measure generalizes the Kullback-Leibler divergence, and the parameter α can be chosen to obtain a symmetric divergence. One of the main drawback of the application of K-means clustering in time series is related to the lack of precision in the cluster centroid to capture the shape of the series. In this paper we propose a combined cluster-mds procedure in which the estimation of the cluster centres is not required. The procedure aims at partitioning the time series into K clusters while simultaneously the cluster centres are represented in a space of low dimension M. This methodology is illustrated with the analysis of clustered synthetic data in conjunction with dissimilarity measures based on divergence.

2 Divergence-Based Dissimilarity Measures for Time Series To compare results for model-free and model-based approaches, two divergencebased dissimilarity measures implemented in the R package TSclust (Montero and Vilar [9]) are considered. Model-free approaches refers to dissimilarity measures based on the closeness of the series values at the same points of time. In particular, here we consider dissimilarity measured based on nonparametric spectral estimators (Kakizawa et al. [7]), given by dW (si , s j ) =

1 4π





π

W −π

 f si (λ) dλ, f s j (λ)

(1)

Clustering and Representation of Time Series …

245

where f si (λ) and f s j (λ) denote the spectral densities of si and s j respectively, and W is a divergence function ensuring that dW is a dissimilarity. In particular, an approximation to this estimator is considered in which the spectral density is replaced by the least squares exponential transformation of local linear smoothers of the logperidiograms using least squares (dW (L S) ). On the other hand, complexity-based dissimilarity measures are related to divergence between permutation distributions of order patterns in m-embedding S˜ m of ˜ the original  series, where Sm = {˜sm = (st , . . . , st+m ), t = 1, . . . , T − m}. A permutation s˜ m obtained by sorting s˜ m in ascending order is called a codeword of s˜ , and the distribution P(˜s) of these permutations is used to determine the complexity of s˜ . The dissimilarities between series (d P DC ) are thus measured in terms of the dissimilarities of their corresponding codeword (Montero and Vilar [9]).

3 Simultaneous K-means Clustering and MDS Representation Let S = {s1 , . . . , s N } be a set of N temporal series of length T without missing data, and Δ = (δi j ), i, j = 1 . . . N , a symmetric matrix of dissimilarities between them. Based on Δ, each series is assumed allocated to one and only one of K clusters, denoting by E an indicator matrix of order N × K , whose elements eik are equal to one if series si , belongs to cluster k, or zero otherwise. Thus, if we denote by for k = 1, . . . , K , the hypothesis that the clusters form a partition Jk = {si |eik = 1},   is expressed as Jk Jl = ∅, for k = l, and k Jk = S. The cluster difference scaling problem can be stated as to achieve a configuration, X, of K points, xi , i = 1, . . . K , in a Euclidean metric space of low dimension M ≤ N − 1, which is optimal in the sense that the associated vector of Euclidean distances, d, in R K (K −1)/2 , approaches as closely as possible that of corresponding between cluster dissimilarities. This model is based on the assumption that when the allocation leads to series i ∈ Jk and series j ∈ Jl , their dissimilarity will be represented in the model as the Euclidean distance dkl between cluster representative points x k and x l , then assumed to be constant for all other pairs of series in which the first one is chosen from cluster k and the second one from cluster l. For a partition, the dissimilarities are assumed to vary randomly within a cluster, while the corresponding distance is constant within the same cluster, whereas between clusters, differences in distance will reflect the tendency of the corresponding dissimilarities to vary systematically. Therefore, the objective here is to minimize the loss function (named STRESS), which enables us to assign weight wi j for each pair of series, and which is defined as:  wi j (δi j − dkl )2 . (2) Str ess = k≤l i∈Jk j∈Jl

246

J. F. Vera

Parameter estimation is performed in an least squares iterative optimization procedure. First, an initial partition is given and an initial configuration is obtained from the Sokal-Michener dissimilarities [11] between initial clusters using classical MDS. Then, both the classification an the configuration are estimated in an alternating k-mean/MDS estimation procedure that is summarized in two phases. The details can be consulted in Heiser [4] and Heiser and Groenen [5]: A Multidimensional Scaling Phase, in which the configuration of cluster centres is obtained from the Sokal-Michener dissimilarity between clusters, using SMACOF (from de Leeuw and Heiser [2]), and an Allocation Phase, in which time series are classified from the MDS distances between cluster centres, minimizing the classic equivalent k-means criterion:  2 eik ai − b(i) (3) κ 2 (E) = k  , i

k

2 where ai − b(i) k  denote the squared Euclidean distances between the i-th row of the (i) N × q matrix A = {ai j } and the k-th row of the K × q matrix B(i) = {bkr }, with q = (i) = 1, . . . , N , being specified as air = N − 1, and with the elements of A and B , i (i) ∗ ∗ = dkr , where δir∗ = δiu and dkr = l eul dkl (X ), for r = 1, . . . , N − 1, δir∗ and bkr with u = r if r < i, and u = r + 1 if r ≥ i. The alternating optimization procedure minimizes (2) in each step, ending when two consecutive values of the Stress function differ less than a previously specified value, generally 10−6 , for a optimal partition E of the time series and an optimal X configuration of the cluster centers. To determine the number of clusters K when it has not been previously set, Vera and Macías [12, 13] have proposed a new methodology based on the dissimilarities, that have shown to provide better results than the classical variance-based criteria in an extensive Monte Carlo experiment. Given any partition of the series P(S), a block-shaped partition matrix P(Δkl ) is constructed, where δi j ∈ Δkl if si belongs to cluster k and s j belongs to cluster l, ∀i, j = 1, 2, ..., N . Then, the total dispersion of the dissimilarities can be expressed as

N  N 

eik e jl wi j (δi j − δ)2 =

k≤l i=1 j=1

N  N 

eik e jl wi j (δi j − δ kl )2 +

k≤l i=1 j=1



w˚ kl (δ kl − δ)2 ,

k≤l

(4) where wi j represents the weights that can be assigned to a pair of series, e.g., to deal with missing dissimilarities, w˚ kl and δ kl represent the number of dissimilarities and the mean of dissimilarities, respectively, in the block Δkl , k ≤ l, and δ is the overall mean of dissimilarities. Denoting by W ∗ (K ) =

N  N 

eik e jl wi j (δi j − δ kl )2

k≤l i=1 j=1

B ∗ (K ) =

 k≤l

w˚ kl (δ kl − δ)2 ,

(5)

Clustering and Representation of Time Series …

247

the within-block and the between-block dispersion, respectively, the Calinski and Harabazs index and Hartigan’s rule are reformulated using (5), respectively by C H ∗ (K ) =

B ∗ (K )/(K (K + 1)/2) − 1) − K (K + 1)]/2)

W ∗ (K )/([N (N

(6)

and H ∗ (K ) =



W ∗ (K ) − 1 (([N (N − 1) − K (K + 1)]/2) − 1). W ∗ (K + 1)

(7)

The suggested number of clusters K is associated with the largest value of C H ∗ , and with the smallest value K ≥ 1 such that H ∗ (K ) ≤ 5N , respectively. In general, the results presented in Vera and Macías [12] have shown that this new formulation of Hartigan’s rule performs considerably better than the original one for the simulated data sets. Furthermore, H ∗ is even more efficient than the C H ∗ index in some situations, and more so the Silhouette criterion (Kaufman and Rousseeuw [6]) the scenarios considered.

4 Illustrative Application To illustrate the proposed procedure, a synthetic data set previously analyzed by Montero and Vilar [9] is used. Eighteen synthetic time series consisting on three partial realisations of length T = 200 of each of six first order autoregressive models were simulated. Model A is an A R(1) process with moderate autocorrelation. Model B is a bilinear process with approximately quadratic conditional mean, and thus, strongly non-linear. Model E is an exponential autoregressive model very close to linearlity. Model S is a self-exciting threshold autoregressive model with a relatively strong non-linearity. Finally, Models N , a general non-linear autoregressive model, and T , a smooth transition autoregressive model of a weak non-linear structure. Details can be seen in Montero and Vilar [9]. Since in this synthetic data set all series have the same length T = 200, divergencebased dissimilarity measures can be applied directly using the R package T Clust. Thus, two symmetric matrices of dimension 18 × 18 were obtained using the functions dW (L S) for dissimilarities based on the spectral density by the local-linear estimation method, and the function d P DC for dissimilarities based on the permutation distribution distance between time series, respectively. The cluster difference scaling procedure was run for K = 6 in two and in three dimensions both for the dwL S and the dpcd dissimilarity matrices. The CDS decomposition of the normalised stress values are shown in Table 1. As can be seen, in general lower normalized stress values were found for the dwL S method, both in two and three dimensions. Regarding the solution in three dimensions, the lowest stress value for the classification was found for the dwLS procedure,

248

J. F. Vera

Table 1 Decomposition of the total stress value in terms of the classification and representation of the cluster centers in each dimension for both dissimilarity measures Dimension 2 Dimension 3 Total Cluster MDS Total Cluster MDS dwLS 0.0886 dpdc 0.1214

0.0266

0.0619

0.0844

0.0243

0.0600

0.0924

0.0289

0.095

0.0871

0.0074

while in terms of the related representation the results were opposite. Figure 1 shows the given classification with the cluster-mds procedure in tree dimensions with both dissimilarity procedures. The quality of the given classification was evaluated using the similarity criterion proposed by Montero and Vilar [9],  K 2|C1i C2 j | 1  Sim(C1, C2) = max K k=1 1≤ j≤k |C1i | + |C2 j |

(8)

where |C| is the cardinality of a group C. The similarity between the original classification and that given by the cluster-mds procedure was of 0.8178571 and 0.8555556 for the dwL S method and of 0.7305556 and 0.6730159 for the dpdc method, in two and three dimensions respectively. The similarity value given when partitioning clustering around medoids ( pam) was performed was of 0.8555556 for the dwL S dissimilarity matrix and of 0.7150794 for the dpdc dissimilarities. Hence, the clustermds procedure obtained in general the same or better results in terms of classification than the pam procedure, while simultaneously also providing a representation of the cluster. Figure 1 shows the best classification obtained for the synthetic series in the CDS method for both dissimilarity matrices. For the dwL S procedure, the original classification was recovered almost well, except for group four with only one isolated series, and group five in which the series corresponding to the non-linear models B and N were grouped together. As can be seen in the time series classification shown for the dpdc procedure, except for the B model classified in cluster five, the other series are generally dispersed throughout all the remaining groups. Figure 2 shows the representation of the cluster centres in three dimensions showing larger cluster separation when the dwL S measures is employed. This is in consonance with the results given in terms of the classification. The position of the groups can be interpreted in terms of the linearity of the model used to generate the synthetic data. The groups 3 (S), 5 (B,N ) and 6 (T ) are close to each other as they correspond to non-linear models, while the groups 1 (E) and 2 (A) correspond to a smooth non-linear model, and a moderate autocorrelation model, respectively, occupying opposite positions on the graph.

Clustering and Representation of Time Series …

249

dwLS cluster−MDS classification (K=6, M=3) 6

T1 T2 T3 N1 N2 N3

4

E3

3

S1 S2 S3 A1 A2 A3 E1 E2

1

2

Clusters

5

B1 B2 B3

Series

dpdc cluster−MDS classification (K=6, M=2) 6

E1 E2

4

S2 S3

3

S1

1

T2 N1 N2

E3

2

Clusters

5

B1 B2 B3

N3

A1 A2 A3

T1

T3

Series

Fig. 1 Results of the classification for the synthetic data based on the dwL S and dpdc dissimilarities. The groups are displayed according to the best results of the cluster-MDS procedure for the dwL S (in three dimensions) and dpdc (in two dimensions) methods dwLS (K=6, M=3) 5

4

3

Dim3

1

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

0.2 0.1 0.0 −0.1 −0.2 −0.3

Dim2

2

6

Dim1

Fig. 2 Representation of the six cluster centers obtained with cluster-mds in three dimensions for the dwLS dissimilarities

250

J. F. Vera

5 Discussion This paper illustrates a cluster-mds procedure for the analysis of time series. The procedure aims at partitioning the time series into K clusters while simultaneously the cluster centres are represented in a space of low dimension. In terms of clustering, one of the main advantages of the proposed procedure is based in which the estimation of the cluster centres is not required. In addition, the representation of the groups in an MDS framework facilitates the interpretation of the cluster structure of the data. The procedure is illustrated for divergence-based dissimilarity measures, offering insight into their use alongside a model-free and model-based approach. However, it can be used for any dissimilarity measure, which also allows it to be used for series of different lengths or in the presence of missing data. Different model-based cluster-mds procedures have also been developed for dissimilarity analysis, and their use in relation to different dissimilarity measures for time series is currently being investigated by the authors. Acknowledgements This work has been partially supported by Grant RTI2018-099723-B-I00 (ERDF/Ministry of Science and Innovation—State Research Agency).

References 1. Brandmaier, A.M.: SEM trees: recursive partitioning with structural equation models. Ph.D. Thesis, Universität des Saarlandes (2011). http://www.brandmaier.de/semtree 2. de Leeuw, J., Heiser, W.J.: Multidimensional scaling with restrictions on the configuration. In: Krishnaiah, P.R. (ed.) Multivariate Analysis, V, pp. 501–522. North-Holland, Amsterdam (1980) 3. Hartigan, J.A., Wong, M.A.: Algorithm AS 136: a K -means clustering algorithm. Appl. Stat. 28, 100–108 (1979) 4. Heiser, W.J.: Clustering in low-dimensional space. In: Lausen, B., Klar, R., Opitz, O. (eds.) Information and Classification: Concepts, Methods and Applications, pp. 162–173. Springer, Heidelberg (1993) 5. Heiser, W.J., Groenen, P.J.F.: Cluster differences scaling with a within-clusters loss component and a fuzzy successive approximation strategy to avoid local minima. Psychometrika 62(1), 63–83 (1997) 6. Kaufman, L., Rousseeuw, P.J.: Finding groups in data: an introduction to cluster analysis. Wiley, New York (1990) 7. Kakizawa, Y., Shumway, R.H., Taniguchi, M.: Discrimination and clustering for multivariate time series. J. Am. Stat. Assoc. 93(441), 328–340 (1998) 8. Liao, T.W.: Clustering of time series data: a survey. Pattern Recognit. 38(11), 1857–1874 (2005) 9. Montero, P., Vilar, J.A.: TSclust: an R package for time series clustering. J. Stat. Softw. 62(1), 1–43 (2017) 10. McQueen, J.: Some methods for classification and analysis of multivariate observations. In: Le Cam, L.M., Neyman, J. (eds.) Proceedings 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 2, pp. 281–297. University of California Press, Berkeley (1967) 11. Sokal, R.R., Michener, C.D.: A statistical method for evaluating systematic relationships. Univ. Kans. Sci. Bull. 38, 1409–1438 (1958) 12. Vera, J.F., Macías, R.: Variance-based cluster selection criteria in a K -means framework for one-mode dissimilarity data. Psychometrika 82(2), 275–294 (2017)

Clustering and Representation of Time Series …

251

13. Vera, J.F., Macías, R.: On the behaviour of k-means clustering of a dissimilarity matrix by means of full multidimensional scaling. Psychometrika 86(2), 489–513 (2021) 14. Vera, J.F., Macías, R., Angulo, J.M.: Non-stationary spatial covariance structure estimation in oversampled domains by cluster differences scaling with spatial constraints. Stoch. Environ. Res. Risk Assess. 22, 95–106 (2008) 15. Vera, J.F., Macías, R., Angulo, J.M.: A latent class MDS model with spatial constraints for non-stationary spatial covariance estimation. Stoch. Environ. Res. Risk Assess. 23(6), 769–779 (2009) 16. Vera, J.F., Macías, R., Heiser, W.J.: A latent class multidimensional scaling model for two-way one-mode continuous rating dissimilarity data. Psychometrika 74(2), 297–315 (2009) 17. Vera, J.F., Macías, R., Heiser, W.J.: A dual latent class unfolding model for two-way two-mode preference rating data. Comput. Stat. Data Anal. 53(8), 3231–3244 (2009) 18. Vera, J.F., Macías, R., Heiser, W.J.: Cluster differences unfolding for two-way two-mode preference rating data. J. Classif. 30, 370–396 (2013)

Trends in Data Sciences

Proportional Odds COM-Poisson Cure Rate Model with Gamma Frailty and Associated Inference and Application Narayanaswamy Balakrishnan, Tian Feng, and Hon-Yiu So

Abstract We introduce in this work a gamma frailty cure rate model for lifetime data by assuming the number of competing causes for the event of interest to follow the Conway-Maxwell-Poisson (COM-Poisson) distribution and the lifetimes of the non-cured individuals to follow a proportional odds model. The baseline distribution is taken to be either Weibull or log-logistic. Statistical inference is then developed under non-informative right censored data. We derive the maximum likelihood estimators (MLEs) with the use of Expectation Maximization (EM) method for all model parameters. The model discrimination among some well-known special cases, including Geometric, Poisson, and Bernoulli models, are discussed under both likelihood- and information-based criteria. An extensive Monte Carlo simulation study is carried out to examine the performance of the proposed model as well as all the inferential methods developed here. Finally, a cutaneous melanoma dataset is analyzed for illustrative purpose.

1 Introduction Cure rate model is a long-term survival model that accommodes a surviving fraction as long-term survivors. It is commonly used in biomedical studies, as well as in some other fields such as industrial reliability, finance, manufacturing, demography, and criminology. The cure rate model was first discussed in [7, 8], and have been subsequently studied by many authors. In general, cure rate model can be seen as a two-component mixture model N. Balakrishnan (B) · T. Feng Department of Mathematics and Statistics, McMaster University, Hamilton, ON L8S 4K1, Canada e-mail: [email protected] T. Feng e-mail: [email protected] H.-Y. So Department of Mathematics and Statistics, Oakland University, Rochester, MI 48309, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_23

255

256

N. Balakrishnan et al.

S p (t) = p0 + (1 − p0 )Ss (t),

(1)

where p0 is the probability of cure, and Ss (t) and S p (t) are the survival functions of the susceptible individuals and the population, respectively. More generally, a cure model can be approached through a competing risks set up as follows. Suppose M is an unobservable random variable denoting the number of competing causes related to the occurrence of an event of interest. Let W j , j = 1, . . . , m, be the random variables denoting the time-to-event for the jth competing cause. Given M = m, W1 , . . . , Wm are assumed to be independent and identically distributed (i.i.d.) with a common cumulative distribution function (c.d.f) F(w) = 1 − S(w) and survival function S(w). Then, the population time-to-event or lifetime is given by Y = min{W0 , W1 , . . . , Wm },

(2)

where W0 corresponds to the individual who are not susceptible to the event occurrence (i.e., with infinite lifetime). This leads to a proportion of cured group, known as cure rate. The survival function for the entire population is then obtained from (2) to be S p (y) =

∞ 

P(M = m)[S(y)]m = A M (S(y)),

(3)

m=0

where A M (.) is the probability generating function (p.g.f.) of M. Frailty models provide a convenient way to accommodate unobserved covariates and/or heterogeneity in survival data in the form of a frailty term. In this work, we assume a proportional odds model with frailty term for the distribution of W j ( j = 1, . . . , m), with a parametric assumption on the baseline odds function. To be more specific, the odds function of W j is taken as O(w, x |y) = r θ O0 (w),

(4)

where O(w) = S(w)/F(w) is the odds of survival up to time w, the proportionality  term θ is linked to covariates as eα xc with xc = (x1 , . . . , x p ) being a vector of p covariates, α = (α1 , . . . , α p ) is the proportional odds regression coefficient, O0 (w) is the baseline odds function, and r is the frailty term assumed to follow a gamma distribution with shape k > 0 and scale ξ > 0. The mean and variance of r are k/ξ and k/ξ 2 , respectively. We need to set mean equal to 1 to avoid non-identifiability in the model so that k = ξ . Therefore, the probability density function of r is given by f (r ) =

r ξ −1 ξ ξ e−r ξ , r ≥ 0, ξ > 0. Γ (ξ )

We can further obtain the survival function of W j , through unconditioning, as

(5)

Proportional Odds COM-Poisson Cure Rate Model with Gamma Frailty

 S(ti ) =



0

257



ri S0 (ti )eα x fr (ri )dri , ti > 0, ri S0 (ti )eα  x + F0 (ti )

(6)

with the corresponding probability density function (p.d.f.) as 



f (ti ) = 0



ri f 0 (ti )eα x fr (ri )dri , ti > 0. (ri S0 (ti )eα  x + F0 (ti ))2

(7)

The rest of this paper proceeds as follows. In Sect. 2, we describe briefly the COMPoisson cure model and its three well-known special cases. Section 3 describes the form of the data and the likelihood, while the estimation of the model parameters and associated inferential issues are discussed in Sect. 4. In Sect. 5, an extensive Monte Carlo simulation study is carried out. In Sect. 6, we discuss model discrimination using information- and likelihood-based methods. A data on cutaneous melanoma is then analyzed in Sect. 7 for illustrative purpose. Some concluding remarks are finally made in Sect. 8.

2 COM-Poisson Cure Rate Model Suppose the number of competing causes M follows a COM-Poisson distribution with probability mass function (p.m.f.) P(M = m; η, φ) =

ηm 1 , m = 0, 1, 2, . . . , Z (η, φ) (m!)φ

(8)

where the normalization constant is Z (η, φ) =

∞  ηj , ( j!)φ j=0

(9)

with φ ≥ 0 and η > 0. Then, the corresponding cure rate is the probability p0 = P(M = 0; η, φ) = (Z (η, φ))−1 .

(10)

As a weighted Poisson random variable (r.v.), M leads to a Poisson r.v. with mean equal to η when φ = 1, and M possesses under- and over-dispersion if φ > 1 and φ < 1, respectively (see [9, 11, 16]). For example, M approaches the Bernoulli 1 when φ → ∞ and Z (η, φ) → 1 + η, and M reduces to a r.v. with parameter 1+η 1 . Note that Geometric r.v. with parameter 1 − η if φ = 0, η < 1 and Z (η, φ) = 1−η M is undefined when η ≥ 1 and φ = 0. The population survival function and density function of the time-to-event Y are then given by

258

N. Balakrishnan et al.

S p (y) = f p (y) =

Z (ηS(y), φ) , y > 0, Z (η, φ) ∞ f (y)  j (ηS(y)) j 1 Z (η, φ) S(y)

j=1

( j!)φ

(11) , y > 0.

(12)

Note that as y → ∞, S p (y) → p0 > 0. So, S p (y) is not a proper survival function. Suppose we have an indicator variable I such that I = 0 if the subject is immune (belongs to set I0 ) with probability p0 and I = 1 if the subject is susceptible (belongs to set I1 ) with probability 1 − p0 . The cumulative distribution and survival function of the overall population can then be viewed as a mixture of two populations of the form F p (y) = P[Y ≤ y|I = 0]P(I = 0) + P[Y ≤ y|I = 1]P(I = 1) = Fs (y)(1 − p0 ), S p (y) = P[Y > y|I = 0]P(I = 0) + P[Y > y|I = 1]P(I = 1) = p0 + (1 − p0 )Ss (y).

(13) (14)

For a detailed discussion on this model, interested readers may refer to [1–4, 6, 15].

3 Data and the Likelihood Censoring is a common occurrence in survival analysis. In this paper, we assume the data are subject to non-informative right censoring. Hence, the observation time Ti would be the minimum of the censoring time Ci and the actual lifetime yi for the ith subject, i.e., Ti = min{Yi , Ci }, i = 1, . . . , n.

(15)

We define an indicator variable δi = I (Yi ≤ Ci ) for the i-th subject such that δi = 1 if the lifetime is observed, while δi = 0 if the lifetime is right censored; Δ0 and Δ1 are sets with all the i’s equal to 0 and 1, respectively, and set Δ∗ contains all the i’s. It is to be noted that the cure rate p0 = Z (η, φ)−1 is purely a function of η for a fixed value of φ. The range of 1/ p0 is from 1 to ∞ and it is monotone in η. Therefore, it  is natural to use a logistic link function of the form Hφ (η) = (1 + ex i β )−1 to link the covariates x to the cured proportion p0i , i.e., 

p0i = p0 (xx i , β ) = Z (η, φ)−1 = Hφ (η) = (1 + ex i β )−1 ,

(16)

  ) = where p0i is the cured proportion for the ith category, x i = (1, x ic  (1, xi1 , . . . , xi p ) is a vector of p + 1 covariates, and z is the vector of regression coefficients. Under this link function, η can be calculated from Hφ−1 (.) analytically for

Proportional Odds COM-Poisson Cure Rate Model with Gamma Frailty

259

the Geometric, Poisson and Bernoulli distributions, and by using numerical method for the general COM-Poisson distribution. For n pairs of observations (tt , δ )={(t1 , δ1 ),. . . ,(tn , δn )} corresponding to n individuals, the full observed data likelihood function under non-informative censoring is given by L(θθ ; t , δ ) ∝

n 

{ f p (ti ; θ )}δi {S p (ti ; θ )}1−δi



fr (ri ),

(17)

i∈Δ∗

i=1

where θ is the set of parameters (φ, β  , z  , γ  ), which is equivalent to L(θθ ; t , δ ) ∝



f p (ti ; θ )

i∈Δ1



{ p0 + (1 − p0 )Ss (ti ; θ )}



fr (ri ).

(18)

i∈Δ∗

i∈Δ0

Here, we consider two baseline distributions for the proportional odds survival model corresponding to the time-to-event random variable, namely, Weibull and log-logistic distributions. It should also be noted that log-logistic distribution, in fact, possesses the proportional odds property, while the Weibull distribution does not. The survival function and p.d.f. of W under a Weibull baseline, for example, are 

S(w, γ0 , γ1 ) = [1 + e−xc α (e(γ1 w) f (w, γ0 , γ1 ) =

1/γ0

− 1)]−1 ,

1/γ0 xc α −(γ1 w)1/γ0

(γ1 w) e , 1/γ  γ0 w[e−(γ1 w) 0 (exc α − 1) + 1]2

w > 0,

(19)

w > 0,

(20)

where γ0 > 0 and γ1 > 0 are the shape and scale parameters of the baseline Weibull distribution, respectively. If, instead, we assume the baseline distribution to be a log-logistic distribution with γ0 > 0 and γ1 > 0 as the scale and shape parameters, respectively, then the corresponding odds function of Wi is given by O(w; x c , α ) =

γ

γ0 1 x cα  r e = O0 (w, ; γ0 , γ1 )r ex cα . w γ1

(21)

We observe that Wi still follows a two-parameter log-logistic distribution (γ0 , γ1 > 0)  with shape parameter γ1 and scale parameter γ0 e−xx c z /γ1 , and with corresponding survival function S(w; x c , α |r ) =

γ



r γ0 1 e x c α , w > 0.  γ r γ 0 1 e x c α + w γ1

(22)

Note that the mean does not exist if γ1 < 1 and the variance does not exist if γ1 < 2.

260

N. Balakrishnan et al.

4 Estimation of Parameters In this section, we present an Expectation-Maximization (EM) algorithm for determining the MLE of θ , and a profile likelihood approach for the estimation of the dispersion parameter φ. It is well-known that EM is an effective technique for finding the MLEs of the unknown parameters of a model involving latent variables, see for example, [13]. In our model, the random variable Ii ’s are observed for i in the set Δ1 , but they are unobserved for i in the set Δ0 , where Ii = 1 if the individual is susceptible and Ii = 0 if the individual is cured. Let us denote the set of complete data by (tt , δ , x , I ) = {(t1 , δ1 , x 1 , I1 ), . . . , (tn , δn , x n , In )}. The complete data likelihood is then given by L c (tt , δ , x , I , y ) =



{(1 − p0i ) f (ti |ri )}

i∈Δ1



1−Ii

{ p0i

[(1 − p0i )S(ti |ri )] Ii }



fr (ri ),

i∈Δ∗

i∈Δ0

(23)   ) . The corresponding where I = (I1 , . . . , In ) , xic = (xi1 , . . . , xi p ) and xi = (1, xic complete log-likelihood function is given by

lc (θθ ; t , x , δ , I ) = constant + +

 i∈Δ0



log f p (ti , x i , θ ) +

i∈Δ1

β , x i )] + Ii log[1 − p0 (β





β, x i) (1 − Ii )log p0 (β

i∈Δ0

Ii logSs (ti , x ic ; θ ) +

i∈Δ0



fr (ri ).

i∈Δ∗

(24)

4.1 E-step The expectation step is achieved by calculating the expected value of the complete data log-likelihood function with respect to the conditional distribution of the unobserved Ii ’s (i ∈ Δ0 ), given the observed data O = {(ti , δi , x i ), i ∈ Δ1 } and the curβ  , γ  ) , for a fixed value of φ. Let us denote rent estimates of the parameters θ (k) = (β the function by O , θ (k) ) Q(θθ ∗ , π (k) ) = E(lc (θθ ; t , x , δ , I )|O

(25)

at the k-th iteration step. In our model, Ii ’s are Bernoulli random variables and we can easily find the conditional expectation for the ith individual who is susceptible to be

Proportional Odds COM-Poisson Cure Rate Model with Gamma Frailty

261

E(K i (ti |ri ) f p (ti |Y )δi S p (ti |Y )1−δi |θθ (k) ) , E( f p (ti |Y )δi S p (ti |Y )1−δi |θθ (k) )

(26)

O , θ (k) ) = E(K i (ti |ri )|O

O , θ (k) ) = E(Ii K i (ti |ri )|O

(1 − p0i (xx i , β (k) ))E(Ss (ti |Y )|θθ (k) ) , (1 − p0i (xx i , β (k) ))E(K i (ti |ri )Ss (ti |Y )|θθ (k) ) + p0i (xx i , β (k) )

(27) where K (.) is a function of ti , conditional on ri . Now, for a fixed value of φ, the Q function is given by β , γ0 , γ1 , α ) + Q 2 (ξ ), Q(θθ ∗ , π (k) ) = Q 1 (β

(28)

with β , γ0 , γ1 , α ) = Q 1 (β



E 3,i −

i∈Δ1

Q 2 (ξ ) =





E 4,i +

i∈Δ1

nξlog(ξ ) − nlogΓ (ξ ) +

 i∈Δ1



E 5,i +



E 6,i −

i∈Δ0





log(1 + eβ z i ),

i∈Δ∗

(29) ((ξ − 1)E 2i − ξ E 1i ),

(30)

i∈Δ∗

i∈Δ0

where O , θ (k) ), E 2i = E(logri |O O , θ (k) ), E 3i = E(log f i |O O , θ (k) ), E 1i = E(ri |O O , θ (k) ), E 5i = E(logz 2,i |O O , θ (k) ), E 6i = E(Ii logz 1,i |O O , θ (k) ), E 4i = E(logSi |O and z 1,i = z 1 (θθ ; x i , ti ) =

∞  {ηi S(ti |ri )} j j=1

( j!)φ

, z 2,i = z 2 (θθ ; x i , ti ) =

∞  j{ηi S(ti |ri )} j . ( j!)φ j=1

4.2 M-step The M-step involves maximizing the Q(θθ , π (k) )function in (28) to obtain the improved estimate of θ , i.e., θ (k+1) = arg max Q(θθ , π (k) ). θ

(31)

The MLEs of β and γ do not have explicit expressions, and so numerical maximization is utilized here by the use of Newton-Raphson method. For a fixed value of φ, the E-step and M-step are alternated until the parameter estimates converge to a desired level of accuracy. The parameter φ is then determined by the use of profile likelihood technique. Specifically, we consider a range of φ with

262

N. Balakrishnan et al.

small increment, and then for each value of φ, the MLEs of other parameters are found, and the value of φ with the largest likelihood is then chosen as the final estimate. The following subsections present explicit forms of the first- and secondorder derivatives of the Q function as well as the update function for the case of COMPoisson distribution, which are necessary for the numerical computation process.

4.3 COM-Poisson Cure Rate Model with Gamma Frailty The required first- and second-order derivatives of Q(θθ ∗ , π (k) ) function with respect to β and γ , for fixed values of φ, are as follows:     ∂ Q1 = E 3,i(αl ) − E 4,i(αl ) + E 5,i(αl ) + E 6,i(αl ) , ∂αl i∈Δ i∈Δ i∈Δ i∈Δ 1

1

1

0

1

0

    ∂ Q1 = E 3,i(γk ) − E 4,i(γk ) + E 5,i(γk ) + E 6,i(γk ) , ∂γk i∈Δ i∈Δ i∈Δ i∈Δ 1

1





1

0

∂ Q1 = E 5,i(βh ) + ∂βh i∈Δ i∈Δ

 xi h eβ  x i E 6,i(βh ) − , 1 + eβ  x i i∈Δ∗

 ∂Q = nlog(ξ ) + n − nψ0 (ξ ) + (E 3i − E 2i ), ∂ξ i∈Δ∗

    ∂ 2 Q1 = E 3,i(αl α  ) − E 4,i(αl α  ) + E 5,i(αl α  ) + E 6,i(αl α  ) , l l l l ∂αl ∂αl  i∈Δ1

i∈Δ1

i∈Δ1

i∈Δ0

    ∂ 2 Q1 = E 3,i(αl γk ) − E 4,i(αl γk ) + E 5,i(αl γk ) + E 6,i(αl γk ) , ∂αl γk i∈Δ1

∂2 Q

1

∂γk γk 

=



i∈Δ1

E 3,i(γk γ  ) − k

i∈Δ1



i∈Δ1

E 4,i(γk γ  ) + k

i∈Δ1

i∈Δ0



E 5,i(γk γ  ) + k

i∈Δ1



E 6,i(γk γ  ) , k

i∈Δ0

    ∂ 2 Q1 ∂ 2 Q1 = E 5,i(αl βh ) + E 6,i(αl βh )) , = E 5,i(γk βh ) + E 6,i(γk βh )) , ∂αl ∂βh ∂γk ∂βh i∈Δ1

∂2 Q

1

∂βh ∂βh 

=



i∈Δ0

E 5,i(βh β  ) + h

i∈Δ1

in the above, we have

 i∈Δ0

i∈Δ1

E 6,i(βh β  ) − h

 xi h x  eβ  x i ih i∈Δ∗

i∈Δ0

∂2 Q

n = − nψ1 (ξ ); ,  ξ (1 + eβ x i )2 ∂ξ ∂ξ

Proportional Odds COM-Poisson Cure Rate Model with Gamma Frailty

263

E 3,i(•) =M1,1 ((log f i )•),

E 3,i(••) = M1,2 ((log f i ) • •),

E 4,i(•) =M1,1 ((logSi )•), E 5,i(•) =M1,1 ((logz 2i )•), E 6,i(•) =M2,1 ((logz 2i )•),

E 4,i(••) = M1,2 ((logSi ) • •), E 5,i(••) = M1,2 ((logz 2i ) • •), E 6,i(••) = M2,2 ((logz 2i ) • •),

where (ti |ri ) E( ∂ K ∂• f p (ti |Y )δi S p (ti |Y )1−δi |θθ (k) ) , E( f p (ti |Y )δi S p (ti |Y )1−δi |θθ (k) )

M1,1 (K (ti |ri )•) =

M1,2 (K (ti |ri ) • •) = M2,1 (K (ti |ri )•) =

E( ∂

2

(32)

K (ti |ri ) ∂•∂•

f p (ti |Y )δi S p (ti |Y )1−δi |θθ (k) ) , E( f p (ti |Y )δi S p (ti |Y )1−δi |θθ (k) )

(33)

2i (1 − p0i (xx i , β (k) ))E( ∂logz Ss (ti |Y )|θθ (k) ) ∂• , (1 − p0i (xx i , β (k) ))E(Ss (ti |Y )|θθ (k) ) + p0i (xx i , β (k) )

logz 2i (1 − p0i (xx i , β (k) ))E( ∂ ∂•∂• Ss (ti |Y )|θθ (k) ) M2,2 (K (ti |ri ) • •) = , (1 − p0i (xx i , β (k) ))E(Ss (ti |Y )|θθ (k) ) + p0i (xx i , β (k) )

(34)

2

(35)

where • can be γk , αl or βh . The first- and second-order derivatives of the functions log f i and logSi , with respect to γk , αl and βh , under proportional odds model with log-logistic as well as Weibull baseline are presented in the Appendix. The first- and second-order derivatives of logz 1,i with respect to γk , αl , βh are as follows: z 2,i ∂logS(ti |ri ) ∂logz 1,i z 2,i ∂logz 1,i  = , = xi h eβ i x i , ∂αl z 1,i ∂αl ∂βh z 01,i z 1,i 

∂ 2 logz 1,i xi h eβ x i ∂logS(ti |ri ) 2 = 2 {z 21,i z 1,i − z 2,i }, ∂αl ∂βh ∂αl z 1,i z 01,i 2 z 21,i z 1,i − z 2,i ∂logS(ti |ri ) ∂logS(ti |ri ) z 2,i ∂ 2 logS(ti |ri ) ∂ 2 logz 1,i = + , 2 ∂αl ∂αl  ∂αl ∂αl  z 1,i ∂αl ∂αl  z 1,i     2  z 2,i z 02,i x i h x i h  eβ i x i ∂ 2 logz 1,i β i x i z e . = z + z − z − 2,i 01,i 21,i 2,i 2 ∂βh ∂βh  z 01,i z 1,i z 01,i z 1,i

Similarly, the first- and second-order derivatives of logz 2,i with respect to γk , αl , βh are as follows: 

∂logz 2,i z 21,i ∂logS(ti |ri ) ∂logz 2,i z 21,i xi h eβ i x i = , = , ∂αl z 2,i ∂αl ∂βh z 01,i z 2,i 

2 (z 31,i z 2,i − z 21,i )xi h eβ x i ∂logS(ti |ri ) ∂logz 2,i z 21,i ∂logS(ti |ri ) ∂ 2 logz 2,i = , = , 2 ∂γk z 2,i ∂γk ∂αl ∂βh ∂αl z 2,i z 01,i

264

N. Balakrishnan et al.

    2  z 21,i ∂ 2 logz 2,i z 02,i xi h xi h  eβ i x i β i x i z 21,i z 01,i + z 31,i − z 21,i e , = − 2 ∂βh ∂βh  z 01,i z 2,i z 01,i z 2,i  2  z 21,i ∂logS(ti |ri ) ∂logS(ti |ri ) ∂ 2 logz 2,i z 21,i ∂logS(ti |ri ) z 31,i = + − 2 , ∂αl ∂αl  z 2,i ∂αl αl  z 2,i ∂αl ∂αl  z 2,i  2  z 21,i ∂logS(ti |ri ) ∂logS(ti |ri ) ∂ 2 logz 2,i z 21,i ∂logS(ti |ri ) z 31,i = + − 2 , ∂γk ∂γk  z 2,i ∂γk γk  z 2,i ∂γk ∂γk  z 2,i  2  z 21,i ∂logS(ti |ri ) ∂logS(ti |ri ) z 21,i ∂logS(ti |ri ) z 31,i ∂ 2 logz 2,i = + − 2 , ∂αl γk z 2,i ∂αl ∂γk z 2,i ∂αl ∂γk z 2,i where z 1,i = z 1 (ti ) =

∞  {ηi S(ti |ri )} j j=1

( j!)φ

, z 2,i = z 2 (ti ) =

∞  j{ηi S(ti |ri )} j , ( j!)φ j=1

z 01,i = z 01 (ti ) =

∞ ∞ j   jηi j 2 {ηi S(ti |ri )} j , z = z (t ) = , 21,i 21 i φ ( j!) ( j!)φ j=1 j=1

z 31,i = z 31 (ti ) =

∞ ∞ j   j 2 ηi j 3 {ηi S(ti |ri )} j , z = z (t ) = . 02,i 02 i ( j!)φ ( j!)φ j=1 j=1

4.4 Results for Some Special Cases In this section, we present the Q-function and its first-order derivatives for some special cases of the COM-Poisson cure rate model with gamma frailty. The secondorder derivatives are not presented for the sake of conciseness, and they are available from the authors. 4.4.1

Geometric Cure Rate Model with Gamma Frailty

Let the competing cause random variable M follow a geometric distribution. Then, the survival function and density function of the susceptibles, conditional on ri , are S(ti |ri ) , 1 + eβ  x i F(ti |ri )  eβ x i f (ti |ri ) f p (ti |ri ) = , (1 + eβ  x i F(ti |ri ))2

Ss (ti |ri ) =

(36) (37)

respectively. The cure rate is p0 = 1 − η, in which the parameter η is linked to  eβ x i covariates through the logistic link function 1+e β  x i . The Q-function can then be expressed as

Proportional Odds COM-Poisson Cure Rate Model with Gamma Frailty

265

β , γ0 , γ1 , α ) + Q 2 (ξ ), Q = Q 1 (β    {E 4i − 2E 5i + β  x i } − {log(1 + eβ x i ) + β  x i E 1i + E 6i − E 7i }, Q1 = i∈Δ1

i∈Δ0



Q 2 = nξlog(ξ ) − nlogΓ (ξ ) + (ξ − 1)

E 3i − ξ

i∈Δ∗



E 2i ,

i∈Δ∗

where O , θ (k) ), E 2i = E(ri |O O , θ (k) ), E 3i = E(logri |O O , θ (k) ), E 1i = E(Ii |O O , θ (k) ), E 5i = E(logRi |O O , θ (k) ), E 6i = E(Ii logSi |O O , θ (k) ), E 4i = E(log f i |O 

O , θ (k) ), Ri = 1 + eβ x i F(ti |ri ). E 7i = E(Ii logRi |O The required first-order derivatives of Q(θθ , π (k) ) with respect to β and γ , for a fixed value of φ, are as follows:     ∂ Q1 = E 4i(α) − 2 E 5i(α) + E 6i(α) − E 7i(α) , ∂αl i∈Δ i∈Δ i∈Δ i∈Δ 1

0

0

    ∂ Q1 = E 4i(γk ) − 2 E 5i(γk ) + E 6i(γk ) − E 7i(γk ) , ∂γk i∈Δ i∈Δ i∈Δ i∈Δ 1

1

0

0

   xik eβ  x i   ∂ Q1 = −2 E 5i(βh ) + xik − xik E 1i − E 7i(βh ) , x + β i ∂βh 1+e i∈Δ i∈Δ i∈Δ i∈Δ i∈Δ 1

1

0

0

0

where the first-order derivatives of E functions with respect to different parameters are as follows: E 4,i(•) = M1,1 ((log f i )•), E 5,i(•) = M1,1 ((logRi )•), E 6,i(•) = M2,1 ((logSi )•), E 7,i(•) = M2,1 ((logRi )•), with • being any of γk , αl or βh , functions M1,1 , M1,2 , M2,1 , M2,2 being as in (32)– (35), respectively, 



eβ x i S(ti |ri )γk ∂logRi eβ x i S(ti |ri )αl ∂logRi =− , =− , ∂αl Ri ∂γk Ri   ∂logRi xi h eβ x i − xi h eβ x i S(ti |ri ) = . ∂βh Ri

266

N. Balakrishnan et al.

4.4.2

Poisson Cure Rate Model with Gamma Frailty

Let the competing cause random variable M follow a Poisson distribution. Then, the survival function and density function of the susceptibles, conditional on ri , are 



Ss (ti |ri ) = [(1 + eβ x i ) S(ti |ri ) − 1]e−ββ x i , β  x i −F(ti |ri )

f p (ti |ri ) = (1 + e

)

(38)

β x i

log(1 + e

) f (ti |ri ),

(39)

respectively. The cure rate is p0 = e−η , in which the parameter η is linked to covari ates as η = log(1 + e−ββ z i ). The Q-function can then be expressed as β , γ0 , γ1 , α ) + Q 2 (ξ ), Q =Q 1 (β    {E 4i log(1 + eβ x i ) + loglog(1 + eβ x i ) + E 3i } Q1 = i∈Δ1

+



E 5i −

i∈Δ0

Q2 =







log(1 + eβ x i ),

(40)

i∈Δ∗

nξ log(ξ ) − nlogΓ (ξ ) + (ξ − 1)



E 2i − ξ

i∈Δ∗

i∈Δ0



E 1i ,

i∈Δ∗

where O , θ (k) ), E 2i = E(logri |O O , θ (k) ), E 3i = E(log f i |O O , θ (k) ), E 1i = E(ri |O 

O , θ (k) ), E 5i = E(Ii logBi |O O , θ (k) ), Bi = (1 + eβ x i ) S(ti |ri ) − 1. E 4i = E(Si |O The required first-order derivatives of Q(θθ , π (k) ) with respect to β and γ , for fixed value of φ, are as follows:  ∂ Q1   = {E 4i(αl ) log(1 + eβ x i ) + E 3i(αl ) } + E 5i(αl ) , ∂αl i∈Δ i∈Δ 1

0

 ∂ Q1   = {E 4i(γk ) log(1 + eβ x i ) + E 3i(γk ) } + E 5i(γk ) , ∂γk i∈Δ i∈Δ 1



0

β x i





xi h eβ x i xi h e 1 ∂ Q1 = E 4i +   ∂βh 1 + eβ x i i∈Δ log(1 + eβ x i ) 1 + eβ  x i i∈Δ 1

+

 i∈Δ0

0

 xi h eβ  x i E 5i(βh ) − , 1 + eβ  x i i∈Δ∗

where the first-order derivatives of E functions with respect to different parameters are as follows:

Proportional Odds COM-Poisson Cure Rate Model with Gamma Frailty

267

E 3i(•) = M1,1 ((log f i )•), E 4i(•) = M1,1 ((Si )•), E 5i(•) = M2,1 ((logBi )•), with • being any of γk , αl or βh , functions M1,1 , M1,2 , M2,1 , M2,2 being as in (32)– (35), respectively, and  ∂logBi ∂logBi ∂logBi = B2i S(ti |ri )αl , = B2i S(ti |ri )γk , = S(ti |ri )B1i p0i eβ x i xi h , ∂αl ∂γk ∂βh 

in which B0i = log(1 + eβ z i ), B1i =

4.4.3

(1+eβ

x

i ) S(ti |ri )

Bi

, and B2i = B0i B1i .

Bernoulli Cure Rate Model with Gamma Frailty

Let the competing cause random variable M follow a Bernoulli distribution with η probability of success 1−η . Then, the Q-function can be written as a sum of three parts as follows: β ) + Q 2 (α α , γ0 , γ1 ) + Q 3 (ξ ), Q = Q 1 (β

(41)

where β) = Q 1 (β



β x i −

i∈Δ1

α , γ0 , γ1 ) = Q 2 (α







log(1 + eβ x i ) +

i∈Δ∗



E 1i β  x i ,

i∈Δ0

α  x i + log f 0 (ti ) − 2E 4i ) + (α

i∈Δ1

Q 3 (ξ ) = nξlog(ξ ) − nlogΓ (ξ ) + (ξ − 1)

 i∈Δ∗

 i∈Δ0

E 3i − ξ

(42)

(E 1i α  x i + E 1i logS0 (ti ) − E 5i ), 

E 2i ,

i∈Δ∗

with O , θ (k) ), E 2i = E(ri |O O , θ (k) ), E 3i = E(logri |O O , θ (k) ), E 1i = E(Ii |O 

O , θ (k) ), E 5i = E(Ii logAi |O O , θ (k) ) Ai = ri S0 (ti )eα x i + F0 (ti ). E 4i = E(logAi |O The required first-order derivatives of Q(θθ , π (k) ) with respect to β and γ , for fixed value of φ, are as follows:

268

N. Balakrishnan et al.

 ∂Q = nlog(ξ ) + n − nψ0 (ξ ) + (E 3i − E 2i ), ∂ξ i∈Δ∗    eβ  x i xil ∂Q = xil + E 1i xil − , ∂βh 1 + eβ  x i i∈Δ i∈Δ i∈Δ∗ 1

0

1

0

    ∂Q = xil + E 1i xil − 2 E 4i(αl ) − E 5i(αl ) , ∂αl i∈Δ i∈Δ i∈Δ i∈Δ 1

0

 ∂log f 0 (ti )   ∂Q ∂logS0 (ti )  = −2 E 4i(γk ) + E 1i − E 5i (γk ), ∂γk ∂γk ∂γk i∈Δ i∈Δ i∈Δ i∈Δ 1

1

0

0

 (x) 1 where ψ0 (.) is the digamma function given by ψ0 (x) = ΓΓ (x) = −γ − ∞ k=0 ( x+k −

1 x−1 1 ), and γ = limn−→∞ [ nk=1 k1 − logn], and ψ1 (x) = − 0 t1−t (logt)dt. k+1 The first-order derivatives of E-functions with respect to different parameters are as follows: E 4i(•) = M1,1 ((logAi )•), E 5i(•) = M2,1 ((logAi )•), , with • being any of γk , αl or βh , functions M1,1 , M1,2 , M2,1 , M2,2 being as in (32)– (35), and 



∂logAi ri S0 (ti )eα x i xil ∂logAi ri eα x i − 1 = , = S0;γk . ∂αl Ai ∂γk Ai

5 Observed Information Matrix In this section, the score functions and the observation information matrix are presented for the COM-Poisson cure rate model with gamma frailty. The corresponding expressions for the special cases are not presented here for conciseness, and they are available from the authors. The score functions, for a fixed value of φ, are as follows:     1  ∂ f (ti |ri ) ∂z 2,i ∂z 1,i 1 ∂l fr (ri )dri = z 2,i + f (ti |ri ) − ∂αl Ai S(ti |ri ) ∂αl ∂αl ∂αl i∈Δ 1

+

 1  ∂z 1,i fr (ri )dri , Bi ∂αl i∈Δ 0

    1  ∂l ∂ f (ti |ri ) ∂z 2,i ∂z 1,i 1 fr (ri )dri = z 2,i + f (ti |ri ) − ∂γk Ai S(ti |ri ) ∂γk ∂γk ∂γk i∈Δ 1

Proportional Odds COM-Poisson Cure Rate Model with Gamma Frailty

+

269

 1  ∂z 1,i fr (ri )dri , Bi ∂γk i∈Δ 0

 1  f (ti |ri ) ∂z 2,i  1  ∂z 1,i  xi h eβ  x i ∂l = fr (ri )dri + fr (ri )dri −  , ∂βh A S(ti |ri ) ∂βh B ∂βh 1 + eβ x i i∈Δ1 i i∈Δ0 i i∈Δ∗  1  f (ti |ri )  1  ∂ fr (ri ) ∂l ∂ f y(ri ) (1 + z 1,i ) = z 2,i dri + dri . ∂ξ Ai S(ti |ri ) ∂ξ Bi ∂ξ i∈Δ1 i∈Δ0

Hence, the components of the observed information matrix, for a fixed value of φ, are as follows: ∂ 2l − ∂γk ∂βh   2    ∂ z 2,i ∂ f (ti |ri ) ∂z 2,i 1 ∂ 2 z 1,i 1 =− + f (ti |ri ) − fr (ri )dri Ai S(ti |ri ) ∂γk ∂βh ∂γk βh ∂γk βh i∈Δ 1

+

    1  ∂z 2,i ∂ f (ti |ri ) ∂z 1,i 1 z + f (t |r ) − (r )dr f 2,i i i r i i S(ti |ri ) ∂γk ∂γk ∂γk Ai2 i∈Δ 1

 × +

f (ti |ri ) ∂z 2,i fr (ri )dri S(ti |ri ) ∂βh



  1  ∂ 2 z 1,i  1  ∂z 1,i ∂z 1,i fr (ri )dri − f (r )dr f (r )dr r i i r i i , Bi ∂γk βh ∂γk ∂βh Bi2 i∈Δ i∈Δ 0

0



∂ 2l ∂αl ∂βh

    2  ∂ f (ti |ri ) ∂z 2,i 1 ∂ z 2,i ∂ 2 z 1,i 1 fr (ri )dri =− + f (ti |ri ) − Ai S(ti |ri ) ∂αl ∂βh ∂αl βh ∂αl βh i∈Δ 1

    1  ∂ f (ti |ri ) ∂z 2,i ∂z 1,i 1 fr (ri )dri + z 2,i + f (ti |ri ) − S(ti |ri ) ∂αl ∂αl ∂αl Ai2 i∈Δ 1

 × +

f (ti |ri ) ∂z 2,i fr (ri )dri S(ti |ri ) ∂βh



  1  ∂ 2 z 1,i  1  ∂z 1,i ∂z 1,i fr (ri )dri − f (r )dr f (r )dr r i i r i i , Bi ∂αl βh ∂αl ∂βh Bi2 i∈Δ i∈Δ 0

0

270

N. Balakrishnan et al.



∂ 2l ∂βh ∂βh 

  1 f (ti |ri ) ∂ 2 z 2,i =− fr (ri )dri Ai S(ti |ri ) ∂βh ∂βh  i∈Δ 1



    1   f (ti |ri ) ∂z 2,i f (ti |ri ) ∂z 2,i f (r )dr f (r )dr r i i r i i S(ti |ri ) ∂βh S(ti |ri ) ∂βh  Ai2 i∈Δ 1

 1  ∂ 2 z 1,i  1  ∂z 1,i ∂z 1,i + fr (ri )dri − fr (ri )dri Bi ∂βh ∂βh  ∂βh ∂βh  Bi2 i∈Δ i∈Δ 0

0



 xi h xi h  eβ  x i , (1 + eβ  x i )2 i∈Δ∗ −

∂ 2l ∂γk ∂γk 

  2  ∂ f (ti |ri ) 1 1 =− z 2,i Ai S(ti |ri ) ∂γk ∂γk  i∈Δ 1

    2 ∂ f (ti |ri ) ∂z 2,i ∂ f (ti |ri ) ∂z 2,i ∂z 1,i ∂ z 2,i ∂ 2 z 1,i + f (ti |ri ) + + − − ∂γk ∂γk  ∂γk  ∂γk ∂γk ∂γk γk  ∂γk γk     ∂z 2,i ∂z 1,i ∂logS(ti |ri ) ∂ f (ti |ri ) fr (ri )dri z 2,i + f (ti |ri ) − − ∂γk  ∂γk ∂γk ∂γk −

    1  ∂ f (ti |ri ) ∂z 2,i ∂z 1,i 1 fr (ri )dri z + f (t |r ) − 2,i i i S(ti |ri ) ∂γk ∂γk ∂γk  Ai2 i∈Δ 1



   ∂ f (ti |ri ) 1 ∂z 2,i ∂z 1,i fr (ri )dri z 2,i + f (ti |ri ) − S(ti |ri ) ∂γk  ∂γk  ∂γk 

  1  ∂ 2 z 1,i  1  ∂z 1,i ∂z 1,i + fr (ri )dri − fr (ri )dri fr (ri )dri } , Bi ∂γk γk  γk γk  Bi2 i∈Δ i∈Δ 0

0



∂ 2l ∂αl ∂αl 

Proportional Odds COM-Poisson Cure Rate Model with Gamma Frailty

271

  2  ∂ f (ti |ri ) 1 ∂ f (ti |ri ) ∂z 2,i 1 =− z 2,i +  A S(t |r ) ∂α ∂α ∂αl ∂αl  i i i l l i∈Δ 1

+ −

    2 ∂ f (ti |ri ) ∂z 2,i ∂z 1,i ∂ z 2,i ∂ 2 z 1,i + f (ti |ri ) − − ∂αl  ∂αl ∂αl ∂αl αl  ∂αl αl 

   ∂z 2,i ∂z 1,i ∂logS(ti |ri ) ∂ f (ti |ri ) fr (ri )dri z 2,i + f (ti |ri ) − ∂αl  ∂αl ∂αl ∂αl

    1  ∂ f (ti |ri ) ∂z 2,i ∂z 1,i 1 fr (ri )dri z 2,i + f (ti |ri ) − − S(ti |ri ) ∂αl ∂αl ∂αl  Ai2 i∈Δ 1



+

   ∂ f (ti |ri ) ∂z 2,i ∂z 1,i 1 fr (ri )dri z 2,i + f (ti |ri ) − S(ti |ri ) ∂αl  ∂αl  ∂αl 

  1  ∂ 2 z 1,i  1  ∂z 1,i ∂z 1,i fr (ri )dri − f (r )dr f (r )dr r i i r i i , Bi ∂αl αl  αl αl  Bi2 i∈Δ i∈Δ 0

0



 2 z 31,i z 2,i − z 21,i ∂ 2l ∂logS(ti |ri )  =− xi h eβ x i 2 ∂βh ∂γk ∂γk z 2,i z 01,i i∈Δ 1

+

 i∈Δ0

   2 z 2,i ∂logS(ti |ri ) xi h eβ x i , z 21,i − z 01,i (1 + z 1,i ) 1 + z 1,i ∂γk

   ∂ 2l ∂log f 2 (ti , γ )  z 21,i ∂logS 2 (ti , γ ) − =− + −1 ∂γk ∂γk  ∂γk ∂γk  z 2,i ∂γk ∂γk  i∈Δ i∈Δ 1

+

1

2    z 31,i z 2,i − z 21,i ∂logS(ti , γ ) ∂logS(ti , γ ) 2 z 2,i

i∈Δ1

+

 i∈Δ0

∂γk

∂γk 

z 2,i ∂logS 2 (ti , γ ) 1 + z 1,i ∂γk ∂γk 

2  z 21,i z 1,i + z 21,i − z 2,i ∂logS(ti , γ ) ∂logS(ti , γ ) , + (1 + z 1,i )2 ∂γk ∂γk  i∈Δ 0



 1 x−1 ∂ 2l t n − nψ (logt)dt, = − (ξ ) , ψ (x) = − 1 1 ∂ξ 2 ξ 1 −t 0

272

N. Balakrishnan et al.

where  Ai =

f (ti |ri ) z 2,i fr (ri )dri , Bi = S(ti |ri )

 (1 + z 1,i ) fr (ri )dri

for l, l  = 0, . . . , p, xi0 ≡ 1, h, h  = 0, 1, j ∗ , j ∗ = 21, 22, . . . , 2 p, i = 1, . . . , n.

6 Empirical Study In this section, an extensive Monte Carlo simulation study is carried out for the special cases to illustrate the performance of the proposed model and the method of inference. We vary the sample size, censoring proportion and the underling baseline distribution to consider different scenarios. We mimic the cutaneous melanoma data analysed in the next section, and consider 4 possible categories for the individuals, namely, x = 0, 1, 2, 3. Two different sample sizes are considered in the study: n = 800 (200, 168, 212, 220) and n = 2000 (500, 420, 530, 550) to reflect medium and large sample sizes. Moreover, if we assume that β = (β0 , β1 ) has two parameters, fixing the cure rates for the first and fourth categories would be enough to cover all cases as the cure rates for the second and third categories can then be readily obtained from β . We chose ( p00 , p03 ) = (0.4, 0.2) for the cure rates for the first and fourth categories. Also, the cure rate would be in a decreasing order in this way, and that β0 = ln(1/ p00 − 1) , β1 = (ln(1/ p03 − 1) − β0 )/3.

(43)

We thus obtain the true value of β as (0.405,0.321). In addition, we consider light and heavy censoring cases with light and heavy censoring rates as (0.52, 0.45, 0.37, 0.3) and (0.65, 0.49, 0.4, 0.35) for the low cure rates, and as (0.7, 0.57, 0.45, 0.34) and (0.8, 0.64, 0.5, 0.38) for the high cure rates, respectively. Suppose the probability of getting censored and cured for group x are cx and p0x , respectively. We can then consider the proportion of censored individuals in the susceptible group to be equal to the difference between the probability of getting censored and cured; i.e., P(Y ≥ C x ∩ M ≥ 1|X = x) = cx − p0x ,

(44)

where the censoring time C x is assumed to follow an exponential distribution with rate λx for x = 0, 1, 2, 3. The choice of (ξ, γ0 ) in the underlying distribution of the proportional odds survival model are chosen to be (0.571, 0.307) and (1.75, 3.25) for Weibull and log-logistic distributions, respectively. The odds parameter is taken as γ1 = −0.75 to ensure a decreasing lifetime over the four nodule categories. We consider an inverse transform sampling method to simulate the actual survival lifetime wi for each individual under different competing risk as follows:

Proportional Odds COM-Poisson Cure Rate Model with Gamma Frailty

    γ0 1 1 − 1 ri e xi α log 1 + , γ1 u  1/γ1 u xi α , wi = γ0 ri e 1−u

wi =

273

(45) (46)

for i = 1, . . . , n, under the proportional odds model with Weibull and log-logistic baseline distributions, respectively, where u follows an uniform distribution over (0, 1). Under the above setting, the procedure to generate the data from the three special cases of cure rate models proceeds as follows. Geometric cure rate model: For each individual, we simulate the number of competing risks Mi from Geometric distribution with probability P(Mi = 0) = p0x , and we simulate the censoring time C x from exponential distribution with rate λx . If Mi does not equal zero, we simulate Mi number of actual lifetimes {Yi1 , . . . , Yi Mi } from the proportional odds survival model, the actual lifetime is then Yi = min{Yi1 , . . . , Yi Mi } and the observed lifetime Ti is taken as the minimum of all the actual lifetimes and the censoring time, i.e., Ti = min{Yi , Ci }. If Yi > Ci , we take the censoring indicator δi = 0, otherwise δi = 1. On the other hand, if Mi = 0, it means the individual is cured, in which case we assign Ci to be the lifetime, and the corresponding censoring indicator is taken to be δi = 0. Poisson cure rate model: In this case, the procedure is the same as that of the Geometric cure rate model except that Mi is simulated from Poisson distribution with parameter − log( p0x ). Bernoulli cure rate model: There are two ways to do the data generation in this case. One is the same as Geometric cure rate model except that Mi is simulated from Bernoulli distribution with probability of success as 1 − p0x . Another way is simpler since Mi can only take on 0 or 1 in this case. For each individual, we simulate the censoring time C x from exponential distribution with rate λx . Then, we simulate an uniform random variable Ui and if Ui ≤ p0x , the observed lifetime Ti is set to C x ; otherwise, we generate the observed lifetime Ti from the proportional odds survival model. This simpler simulation procedure is what we used in this case. In our simulation study, 1000 Monte Carlo runs were considered in each scenario. The estimates were calculated through the EM method. The iterations were terminated when the difference in the log-likelihood values between two consecutive iterations was less than 10−7 . We calculated the empirical Bias, standard error (SE), root Mean Square Error (RMSE), and 95% coverage probabilities (CPs) for all the β , γ ) were taken from a grid parameters. Here, the initial values of the parameters (β search over the parameter values, and those values having the maximum likelihood were then chosen as the initial value for starting the iterative process.

274

N. Balakrishnan et al.

Table 1 True values of parameters, Bias, SE, RMSE and CP for the different cure rate models with gamma frailty under proportional odds with log-logistic baseline with light censoring (LC) and heavy censoring (HC) LC

HC

n

Param

True

Bias

SE

RMSE

CP(95%) True

Bias

SE

RMSE

CP(95%)

Geometric

α

–0.75

0.006

0.108

0.104

95.689

–0.75

0.009

0.122

0.12

95.39

n = 800

γ0

1.25

–0.003

0.065

0.065

93.653

1.25

–0.005

0.079

0.08

93.388

γ1

3.75

–0.022

0.189

0.148

96.527

3.75

–0.023

0.202

0.158

97.045

β1

0.405

0.003

0.146

0.145

95.329

0.405

0.005

0.175

0.17

95.868

β2

0.327

0.004

0.08

0.081

94.371

0.327

0.002

0.092

0.088

96.222

1/ξ

0.152

–0.061

0.085

0.082

54.371

0.152

–0.06

0.086

0.082

55.083

Poisson

α

–0.75

0.006

0.085

0.086

94.97

–0.75

0.01

0.094

0.099

94.097

n = 800

γ0

1.25

–0.005

0.054

0.054

94.611

1.25

–0.006

0.064

0.065

93.388

γ1

3.75

–0.025

0.151

0.156

92.575

3.75

–0.021

0.161

0.164

93.388

β1

0.405

0.005

0.145

0.143

95.329

0.405

–0.001

0.174

0.167

95.632

β2

0.327

0.001

0.079

0.079

94.85

0.327

0.004

0.091

0.09

95.396

1/ξ

0.152

–0.06

0.053

0.08

53.653

0.152

–0.052

0.056

0.075

61.039

Bernoulli

α

–0.75

0.004

0.079

0.08

94.85

–0.75

0.007

0.085

0.088

94.097

n = 800

γ0

1.25

–0.006

0.052

0.052

94.97

1.25

–0.01

0.058

0.06

94.215

γ1

3.75

–0.028

0.154

0.147

95.569

3.75

–0.015

0.165

0.163

95.277

β1

0.405

–0.005

0.146

0.146

94.731

0.405

0.009

0.175

0.184

94.097

β2

0.327

0.004

0.08

0.08

95.808

0.327

–0.001

0.092

0.097

93.743

1/ξ

0.152

–0.054

0.063

0.075

58.922

0.152

–0.051

0.065

0.074

60.803

Geometric

α

–0.75

0.004

0.069

0.068

94.897

–0.75

0.003

0.079

0.075

97.054

n = 2000

γ0

1.25

–0.002

0.041

0.043

93.317

1.25

–0.003

0.05

0.049

94.993

γ1

3.75

–0.019

0.124

0.1

96.719

3.75

–0.002

0.143

0.103

98.38

β1

0.405

0.002

0.092

0.092

95.018

0.405

–0.002

0.11

0.111

94.845

β2

0.327

0.001

0.05

0.051

95.261

0.327

0.002

0.058

0.057

95.582

1/ξ

0.152

–0.04

0.058

0.061

66.1

0.152

–0.019

0.069

0.04

83.652

Poisson

α

–0.75

0.005

0.054

0.056

93.816

–0.75

0.002

0.06

0.056

96.159

n = 2000

γ0

1.25

–0.004

0.034

0.036

94.079

1.25

0

0.041

0.039

96.302

γ1

3.75

–0.02

0.096

0.103

92.763

3.75

–0.015

0.102

0.104

94.168

β1

0.405

0

0.092

0.095

94.605

0.405

0.005

0.109

0.104

96.302

β2

0.327

0.003

0.05

0.049

95.789

0.327

–0.001

0.057

0.055

95.875

1/ξ

0.152

–0.035

0.039

0.055

69.868

0.152

–0.022

0.042

0.041

82.077

Bernoulli

α

–0.75

0.005

0.05

0.05

94.4

–0.75

0.005

0.054

0.055

95.395

n = 2000

γ0

1.25

–0.006

0.033

0.034

92.7

1.25

–0.006

0.037

0.038

94.194

γ1

3.75

–0.029

0.097

0.101

93.4

3.75

–0.02

0.105

0.102

94.394

β1

0.405

0

0.092

0.093

95

0.405

–0.004

0.11

0.107

95.596

β2

0.327

0.003

0.05

0.05

95.4

0.327

0.001

0.058

0.057

94.795

1/ξ

0.152

–0.044

0.041

0.063

60.8

0.152

–0.034

0.045

0.056

70.47

Tables 1 and 2 present the bias, SE, RMSE and CPs for all three special cases. We see that the estimates are quite accurate under different cure rate models with gamma frailty. The bias and SE, along with RMSE, get reduced for low censoring. The coverage probabilities of the confidence intervals based on the asymptotic normality of the MLEs are quite close to the nominal level in most of the cases except for 1/ξ , which does become better when the sample size n increases.

Proportional Odds COM-Poisson Cure Rate Model with Gamma Frailty

275

Table 2 True values of parameters, Bias, SE, RMSE and CP for the different cure rate models with gamma frailty under proportional odds with Weibull baseline with light censoring (LC) and heavy censoring (HC) LC

HC

Param

True

Bias

SE

RMSE

CP(95%) True

Bias

SE

RMSE

CP(95%)

Geometric

α

–0.75

0.006

0.119

0.109

96.134

–0.75

0.008

0.142

0.13

96.482

n = 800

γ0

0.571

0.003

0.035

0.025

97.597

0.571

0.003

0.039

0.026

98.593

γ1

0.307

0.003

0.027

0.027

96.029

0.307

0.005

0.04

0.04

94.684

β1

0.405

–0.003

0.149

0.143

95.716

0.405

0.002

0.2

0.2

95.888

β2

0.327

0.005

0.083

0.081

95.82

0.327

0.004

0.102

0.1

95.884

1/ξ

0.152

–0.039

0.107

0.062

71.578

0.152

–0.035

0.114

0.058

75.879

Poisson

α

–0.75

0.004

0.085

0.082

95.929

–0.75

0

0.098

0.097

95.391

n = 800

γ0

0.571

0.005

0.024

0.024

94.781

0.571

0.003

0.026

0.027

94.589

γ1

0.307

0.003

0.021

0.021

94.572

0.307

0.003

0.028

0.029

94.289

β1

0.405

–0.001

0.149

0.147

95.929

0.405

0.002

0.19

0.195

94.689

β2

0.327

0

0.082

0.083

94.05

0.327

0.003

0.098

0.098

95.09

1/ξ

0.152

–0.042

0.058

0.062

63.779

0.152

–0.036

0.062

0.056

71.643

Bernoulli

α

–0.75

0.005

0.076

0.075

95.687

–0.75

0

0.084

0.084

94.4

n = 800

γ0

0.571

0.005

0.026

0.025

95.587

0.571

0.003

0.028

0.027

95.2

γ1

0.307

0.003

0.017

0.017

95.186

0.307

0.003

0.021

0.022

95

β1

0.405

0.006

0.148

0.155

94.283

0.405

0.005

0.185

0.183

95.6

β2

0.327

0.002

0.082

0.083

93.882

0.327

0.001

0.097

0.099

94.4

1/ξ

0.152

–0.042

0.069

0.064

69.007

0.152

–0.036

0.072

0.059

74.6

Geometric

α

–0.75

0.003

0.115

0.092

97.355

–0.75

0.001

0.135

0.1

92.026

n = 2000

γ0

0.571

0.001

0.027

0.02

98.105

0.571

0.002

0.031

0.02

96.229

γ1

0.307

0.003

0.028

0.022

98.083

0.307

0.002

0.042

0.03

93.827

β1

0.405

–0.004

0.15

0.122

97.679

0.405

0.01

0.198

0.151

91.918

β2

0.327

0.004

0.083

0.07

97.376

0.327

–0.002

0.099

0.077

91.866

1/ξ

0.152

–0.026

0.049

0.05

78.312

0.152

–0.02

0.055

0.042

83.162

Poisson

α

–0.75

0.001

0.054

0.054

94.751

–0.75

–0.001

0.062

0.062

94.954

n = 2000

γ0

0.571

0.001

0.015

0.015

95.304

0.571

0

0.016

0.017

93.493

γ1

0.307

0.001

0.013

0.013

94.475

0.307

0

0.017

0.017

95.485

β1

0.405

–0.001

0.094

0.094

95.028

0.405

0.003

0.119

0.117

94.29

β2

0.327

0.001

0.052

0.052

94.061

0.327

–0.001

0.061

0.062

93.094

1/ξ

0.152

–0.008

0.034

0.022

92.403

0.152

–0.007

0.034

0.022

92.961

Bernoulli

α

–0.75

0.001

0.048

0.048

95.286

–0.75

0

0.053

0.052

95

n = 2000

γ0

0.571

0.003

0.016

0.016

95.286

0.571

0.001

0.017

0.016

96.2

γ1

0.307

0.001

0.011

0.011

93.882

0.307

0.002

0.013

0.013

94.3

β1

0.405

0.002

0.093

0.092

94.885

0.405

0.007

0.116

0.115

94.8

β2

0.327

0.001

0.052

0.05

95.085

0.327

–0.002

0.061

0.061

94.8

1/ξ

0.152

–0.024

0.036

0.044

77.332

0.152

–0.02

0.037

0.039

81.8

276

N. Balakrishnan et al.

7 Illustrative Analysis of Cutaneous Melanoma Let us now consider the cutaneous melanoma data described earlier, in which the subjects were divided into four different nodule categories (x = 0, 1, 2, 3), with corresponding sample sizes n 1 = 111, n 2 = 137, n 3 = 87, n 4 = 82. The percentage of censored observations for the groups are 67.57, 61.31, 52.87, 32.93%. These data, originally reported in [10, 12], have been analyzed earlier in [5, 15], for example, by assuming some other cure rate models. See Fig. 1 for a plot of the lifetimes of patients with and without ulceration status. For these data, we fitted the Geometric (φ = 0), COM-Poisson (φ = 0.5), Poisson (φ = 1), COM-Poisson (φ = 2) and Bernoulli (φ ≈ ∞) cure rate models under proportional odds models with gamma frailty with Weibull and log-logistic baseline distributions. Tables 3 and 4 present the MLEs and SEs of the parameters under different cure rate models with gamma frailty. While implementing the EM algorithm, 40,000 sample points were simulated for the Monte Carlo integration required in the expectation step. In order to check the accuracy of the estimates obtained by terminating the iterative process when the absolute difference between two consecutive maximized likelihood values to be less than 10−7 and also the use of 40,000 samples for the Monte Carlo integration, we generated 100,000 sample points for Monte Carlo integration for Bernoulli cure rate model with gamma frailty under proportional odds model with Weibull baseline, and set convergence criteria to be the absolute difference between two consecutive likelihood values to be less than 10−8 . We then obtained the esti-

Fig. 1 Lifetimes of patients in the cutaneous melanoma data with and without ulceration (ULC)

Proportional Odds COM-Poisson Cure Rate Model with Gamma Frailty

277

Table 3 MLEs of the model parameters for different PO cure models with gamma frailty Par

PO frailty Weibull baseline

φ

0

0.5

1

2



0

PO frailty log-logistic baseline 0.5

1

2

α

–0.3271

–0.3897

–0.4224

–0.4622

–0.5250

–0.2877

–0.3591

–0.3903

–0.4268

–0.4882

γ0

0.4542

0.4616

0.4631

0.4637

0.4636

3.5800

3.4719

3.3720

3.2505

3.1431

γ1

0.2591

0.2660

0.2713

0.2778

0.2835

2.3258

2.3033

2.3058

2.3155

2.3284

β0

–0.8749

–0.8990

–0.9111

–0.9273

–0.9543

–0.7693

–0.7682

–0.7707

–0.7759

–0.7851

β1

0.3717

0.3793

0.3829

0.3879

0.3972

0.3713

0.3699

0.3709

0.3732

0.3767

1/ξ

0.1483

0.1393

0.1390

0.1400

0.1405

0.1509

0.1487

0.1484

0.1479

0.1479



Table 4 SEs of the MLEs of model parameters for different PO cure models with gamma frailty Par PO frailty Weibull baseline PO frailty log-logistic baseline φ α γ0 γ1 β0 β1 1/ξ

0 0.1433 0.0481 0.0430 0.2996 0.1056 0.2960

0.5 0.1326 0.0482 0.0412 0.2978 0.1061 0.2771

1 0.1293 0.0484 0.0404 0.2964 0.1061 0.2757

2 0.1267 0.0487 0.0396 0.2948 0.1062 0.2769

∞ 0.1237 0.0486 0.0386 0.2925 0.1064 0.2772

0 0.1642 0.7984 0.2426 0.3212 0.1149 0.3187

0.5 0.1482 0.7252 0.2412 0.3204 0.1141 0.3115

1 0.1345 0.5891 0.2404 0.3227 0.1154 0.3139

2 0.1399 0.6371 0.2407 0.3217 0.1148 0.3111

∞ 0.1642 0.7984 0.2426 0.3212 0.1149 0.3064

mates as –0.5221, 0.4663, 0.2847, –0.9557, 0.3975, 0.1166 and the corresponding standard errors as 0.1224, 0.0475, 0.0383, 0.2921, 0.1063, 2.2382, for the parameters α, γ0 , γ1 , β0 , β1 , 1/ξ , respectively. We observe that these are quite close to the values reported in Tables 3 and 4. Table 5 presents the AIC, BIC and lˆ values for proportional odds cure rate models under Weibull and log-logistic baseline distributions with and without frailty term. The proportional odds model under Weibull baseline with frailty term has the largest ˆ But, the increase in log-likelihood is not significant enough for AIC or BIC to l. select this model with frailty over the model without the frailty term. The proportional odds model under log-logistic baseline with gamma frailty term ˆ The MLEs and SEs of the cure rate model with gamma frailty under has a lower l. proportional odds model for log-logistic baseline are presented in Table 6. The frailty parameter ξ becomes very large, and the proportional odds frailty cure rate model tends to proportional odds model with ordinary cure rate model, and so the loglikelihood values for the models with and without frailty term turn out to be the ˆ AIC and BIC values. same; see Table 7 for the corresponding l, One more important quantity that is of interest is the probability an individual is cured, conditional on that individual having survived up to a specific time t, i.e., P(I = 0|T > t). The estimate of this probability is given by

278

N. Balakrishnan et al.

Table 5 AIC, BIC and lˆ for different models for the cutaneous melanoma data Geometric Poisson φ 0 0.5 1 2 PO frailty CRM Weibull PO frailty CRM log-logistic PO CRM Weibull PO CRM log-logistic

AIC BIC lˆ

1026.8062 1051.1469 –507.4031 1025.0739 1049.4146 –506.5369 1025.644 1045.809 507.822 −507.822 1022.863 1043.029 –506.432

AIC BIC lˆ AIC BIC lˆ AIC BIC lˆ

1026.8642 1051.2049 –507.4321 1025.0517 1049.3924 –506.5259 1026.014 1046.179 –508.007 1022.845 1043.011 –506.423

1027.0028 1051.3435 –507.5014 1025.0276 1049.3683 –506.5138 1026.374 1046.539 –508.187 1022.821 1042.986 –506.41

Bernoulli ∞

1027.1876 1051.5283 –507.5938 1024.9765 1049.3172 –506.4882 1026.862 1047.027 –508.431 1022.768 1042.933 –506.384

1027.3077 1051.6484 –507.6538 1024.8932 1049.2339 –506.4466 1027.414 1047.579 –508.707 1022.683 1042.849 506.342 −506.342

Table 6 MLEs and SEs of the parameters in CRM with gamma frailty under PO with log-logistic baseline (convergence criteria: two consecutive likelihood values to be less than 10−10 ) MLEs

SEs

φ

0

0.5

1

2



0

α

–0.271

–0.343

–0.374

–0.411

–0.473

0.059478 0.138998 0.134526 0.130778 0.135997

γ0

3.460

3.355

3.256

3.137

3.033

0.121427 0.651613 0.609339 0.562915 0.301182

γ1

2.262

2.240

2.242

2.253

2.266

0.026793 0.187205 0.187253 0.187609 0.186913

β0

–0.775

–0.774

–0.776

–0.782

–0.791

0.223368 0.31806

β1

0.373

0.372

0.373

0.375

0.379

0.091576 0.113683 0.113849 0.114416 0.090989

1/ξ

0.0003

0.0003

0.0003

0.0003

0.0003

5.87E-16 8.73E-22 7.85E-22 6.96E-22 5.06E-22

0.5

1

2



0.318435 0.319483 0.236902

Table 7 AIC, BIC and lˆ for CRM with gamma frailty under PO with log-logistic baseline (convergence criteria: two consecutive likelihood values to be less than 10−10 ) Geometric Poisson Bernoulli PO frailty CRM log-logistic PO CRM log-logistic

φ AIC BIC l AIC BIC lˆ

0 1024.8640 1049.2047 –506.432 1022.863 1043.029 –506.432

0.5 1024.8456 1049.1863 –506.423 1022.845 1043.011 –506.423

1 1024.8212 1049.1619 –506.411 1022.821 1042.986 –506.41

2 1024.7684 1049.1091 –506.384 1022.768 1042.933 –506.384

∞ 1024.6836 1049.0243 506.342 −506.342 1022.683 1042.849 506.342 −506.342

Proportional Odds COM-Poisson Cure Rate Model with Gamma Frailty

279

Fig. 2 Cure rate, given an individual has survived up to a specific time t, over four covariate groups (From left to right, the models are proportional odds cure rate model with log-logistic baseline φ = (0, 0.5, 1, 2, ∞))

ˆ = 0|T > t) = P(I

∞ p0 (1 − 0 S(t|r ) f (r )dr )

∞ , (1 − p0 ) 0 S(t|r ) f (r )dr ) + p0

(47)

where S(t|y) is the survival function under proportional odds model given by 

S(t|r ) =

r S0 (t)eα x . r S0 (t)eα  x + F0 (t)

(48)

These conditional probability of cure, for individuals having survived up to time t, for the four nodule categories of the cutaneous melanoma data are presented in Figs. 2 and 3 for Weilbull and log-logistic baselines, respectively.

8 Concluding Remarks In this work, we have developed a flexible proportional odds COM-Poisson cure rate model with gamma frailty with lifetime distribution of the susceptibles having baseline distribution to be either Weibull or log-logistic. An EM algorithm has been developed for the maximum likelihood estimation of model parameters of the proposed cure rate model. An extensive Monte Carlo simulation study has been performed by varying sample sizes, censoring proportions, cure rates, and the parameter values in different distributions to evaluate the performance of the proposed methodology. Overall, the method provides accurate estimates of the model parameters as

280

N. Balakrishnan et al.

Fig. 3 Cure rate, given an individual has survived up to a specific time t, over four covariate groups (From left to right, the models are proportional odds cure rate model with Weibull baseline φ = (0, 0.5, 1, 2, ∞))

well as of the cure rates. Moreover, a real data on cutaneous melanoma has been analyzed and model discrimination has also been performed. There are many possible future works in this direction. A natural extension of this work would be to consider different clusters with respect to the frailty term. One may also consider the use of a non-parametric specification of the baseline distribution in the proportional odds model for the lifetimes of susceptibles, instead of using parametric forms such as Weibull and log-logistic, as done here. In addition, a destructive cure rate model, by including a damage or destruction term for the initial risk factors, can also be considered in the context of proportional odds model with frailty. We are currently working on some of these problems and hope to present these findings in a future paper.

Appendix Expressions for Weibull baseline In this case, we have the following expressions: 1/γ0 1/γ0 1/γ0 S0 (t) = e−(γ1 t) , f 0 (t) = (γ1γt)0 t e−(γ1 t) ; the derivatives of S0 are given by

Proportional Odds COM-Poisson Cure Rate Model with Gamma Frailty

S0;γ1 =

(γ1 ti )1/γ0 log(γ1 ti )

S0;γ0 γ0 = S0;γ0 S0;γ1 γ1 = S0;γ1

γ02

S0 , S0;γ0 = −

281

(γ1 ti )1/γ0 S0 , γ0 γ1

[(γ1 ti )1/γ0 − 1]log(γ1 ti ) − 2γ0 γ02 [(γ1 ti )1/γ0 − 1]log(γ1 ti ) − γ0 γ02

,

, S0;γ1 γ0 = S0;γ1

1 − γ0 − (γ1 ti )1/γ0 ; γ0 γ1

the derivatives of log S0 are given by ∂logS0 (γ1 ti )1/γ0 log(γ1 ti ) ∂logS0 (γ1 ti )1/γ0 = , = − , ∂γ0 ∂γ1 γ0 γ1 γ02   ∂ 2 logS0 ∂logS0 log(γ1 ti ) 2 , =− + 2 2 ∂γ0 γ0 ∂γ0 γ0     ∂ 2 logS0 ∂ 2 logS0 1 ∂logS0 log(γ1 ti ) 1 ∂logS0 1 , ; =− + = − ∂γ0 ∂γ1 ∂γ1 γ0 ∂γ1 γ0 γ1 γ1 γ02 ∂γ12 the derivatives of log f 0 are given by   log(γ1 ti ) ∂log f 0 ∂log f 0 ∂logS0 1 1 ∂logS0 1+ , = − = + , ∂γ0 ∂γ0 γ0 γ0 ∂γ1 γ0 γ1 ∂γ1 ∂ 2 log f 0 ∂logS02 1 2log(γ1 ti ) ∂ 2 log f 0 ∂logS02 1 = + 2+ , = − 2 , 2 2 3 ∂γ0 ∂γ1 ∂γ0 ∂γ1 ∂γ0 ∂γ0 γ0 γ0 γ0 γ1 ∂ 2 log f 0 ∂logS02 1 = − ; 2 2 ∂γ1 ∂γ1 γ0 γ12 the derivatives of S are given by ∂ S0 ∂S ∂ S0 ∂S = G0, = G0, ∂γ0 ∂γ0 ∂γ1 ∂γ1 ∂2S ∂ 2 S0 ∂ S ∂ S0 = G0 − 2 G1, 2 ∂γ0 ∂γ0 ∂γ0 ∂γ02

∂S xil F0 S , = ∂αl G

∂2S ∂ 2 S0 ∂ S ∂ S0 = G0 − 2 G1, ∂γ0 ∂γ1 ∂γ0 ∂γ1 ∂γ0 ∂γ1 ∂2S ∂S = xil G 2 , ∂γ0 ∂αl ∂γ0

∂2S ∂ 2 S0 ∂ S ∂ S0 = G0 − 2 G1, 2 2 ∂γ ∂γ1 ∂γ1 1 ∂γ1

∂2S ∂S = xil G 2 , ∂γ1 ∂αl ∂γ1

the derivatives of log S are given by

∂2S ∂S = xil  G 2 ; ∂αl ∂αl  ∂αl

282

N. Balakrishnan et al.

∂logS ∂logS ∂logS0 1 ∂logS0 1 , , = = ∂γ0 ∂γ0 G ∂γ1 ∂γ1 G   2 ∂ 2 logS 1 ∂ logS0 ∂logS0 ∂ S0 , = − G 1 2 2 ∂γ0 ∂γ0 G ∂γ0 ∂γ0   2 ∂ 2 logS 1 ∂ logS0 ∂logS0 ∂ S0 , = − G1 ∂γ0 ∂γ1 ∂γ0 ∂γ1 ∂γ0 ∂γ1 G   ∂ 2 logS 1 ∂ 2 logS0 ∂logS0 ∂ S0 , = − G1 2 2 ∂γ ∂γ G ∂γ1 ∂γ1 1 1 ∂ 2 logS ∂S ∂ 2 logS ∂S =− xil , =− xil , ∂γ0 ∂αil ∂γ0 ∂γ1 ∂αil ∂γ1 ∂ 2 logS ∂logS xil F0 ∂S , = =− xil  , ∂αil G ∂αil ∂αil  ∂αil 

where G = 1 + S0 (yi ex i γ1 − 1), β zi G 0 = ff0 ; G 1 = yi e G −1 , G 2 = 2 FG0 − 1, , G 3 = (γ1 ti )1/γ0 S0 G 1 + 1; and finally, the derivatives of log f 0 are given by ∂log f 0 ∂ S0 ∂log f ∂log f 0 ∂ S0 ∂log f = −2 G1, = −2 G1, ∂γ0 ∂γ0 ∂γ0 ∂γ1 ∂γ1 ∂γ1   2 ∂ 2 log f ∂ 2 log f 0 ∂ 2 S0 ∂ S0 = − 2 G + 2 G , 1 1 ∂γ0 ∂γ02 ∂γ02 ∂γ02 ∂ 2 log f ∂ 2 log f 0 ∂ 2 S0 ∂ S0 ∂ S0 = −2 G1 + 2 (G 1 )2 , ∂γ0 ∂γ1 ∂γ0 ∂γ1 ∂γ0 ∂γ1 ∂γ0 ∂γ1  ∂ 2 log f ∂ S0 xil yi eα x i = −2 , ∂γ0 ∂αl ∂γ0 G2 2  ∂ 2 log f ∂ 2 log f 0 ∂ 2 S0 ∂ S0 = − 2 2 G1 + 2 G1 , ∂γ1 ∂γ12 ∂γ12 ∂γ1 

∂ 2 log f ∂ S0 xil yi eα x i = −2 , ∂γ1 ∂αl ∂γ1 G2   1 − S0 ∂log f −1 , = xil 2 ∂αl G



∂ 2 log f 2xil xil  S0 F0 yi eα x i =− . ∂αl ∂αl  G2

Proportional Odds COM-Poisson Cure Rate Model with Gamma Frailty

283

Expressions for Log-Logistic Baseline In this case, we have the following expressions: γ γ0 1 γ1 γ , ti + γ0 1 γ γ −1 γ0 1 γ1 ti 1 , γ1 γ (ti + γ0 1 )2

S0 = f0 =

γ

S=



γ0 1 yi eα x γ γ , yi γ0 1 eα  x + ti 1 γ

γ −1



γ 1 γ1 t 1 eα x i f = γ01 i γ1 α  x 2 . (ti + γ0 e i )

The derivatives of S(ti , γ ) are given by, ∂ S0 (ti ; γ ) ∂γ0 2 ∂ S0 (ti ; γ ) ∂γ02 ∂ 2 S0 (ti ; γ ) ∂γ0 ∂γ1 ∂ 2 S0 (ti ; γ ) ∂γ12

γ1 ∂ S0 (ti ; γ ) γ0 , = F0 S0 log , γ0 ∂γ1 ti γ1 γ1 ∂ S0 (ti ; γ ) 1 + γ1 S0 (ti ; γ ) =− + F0 S0 F0 , ∂γ0 γ0 γ0 γ0   γ1 ∂ S0 (ti ; γ ) 1 ti γ0 + F0 S0 S0 log , = + S0 (ti ; γ )log ∂γ0 γ1 γ0 γ0 ti ∂ S0 (ti ; γ ) ti γ0 γ0 = S0 (ti ; γ )log + F0 log F0 S0 log ; ∂γ1 γ0 ti ti = F0 S0

the derivatives of logS(ti , γ ) are given by γ γ t 1 (γ1 /γ0 ) ∂logS0 (ti ; γ ) ti 1 log(γ0 /ti ) ∂logS0 (ti ; γ ) = i γ1 , = , γ γ γ ∂γ0 ∂γ1 γ0 + ti 1 γ0 1 + ti 1 ∂ 2 logS0 (ti ; γ ) ∂logS0 (ti ; γ ) 1 + γ1 S0 (ti ; γ ) =− , ∂γ0 γ0 ∂γ02   ∂logS02 (ti ; γ ) ∂logS0 (ti ; γ ) 1 ti , = + S0 (ti ; γ )log ∂γ0 ∂γ1 ∂γ0 γ1 γ0 ∂ 2 logS0 (ti ; γ ) ∂logS0 (ti ; γ ) ti = S0 (ti ; γ )log ; 2 ∂γ1 γ0 ∂γ1

the derivatives of log f (ti , γ ) are given by

(49)

284

N. Balakrishnan et al. γ

γ

γ

γ

γ 1 − ti 1 ∂log f 0 (ti , γ ) γ1 ti 1 − γ0 1 ∂log f 0 (ti , γ ) ti 1 = = 0γ1 + , γ1 γ1 , γ log ∂γ0 γ0 γ0 + ti ∂γ1 γ0 γ1 γ0 + ti 1 γ γ 2γ0 1 ti 1 γ12 ∂ 2 log f 0 (ti , γ ) ∂log f 0 (ti , γ ) 1 = − − , γ γ ∂γ0 γ0 (γ0 1 + ti 1 )2 γ02 ∂γ02 γ γ 2γ 1 t 1 γ1 ∂ 2 log f 0 (ti , γ ) ∂log f 0 (ti , γ ) 1 ti = + γ1 0 i γ1 2 log , ∂γ0 ∂γ1 ∂γ0 γ1 γ0 (γ0 + ti ) γ0 2 γ1 γ1  2 2xil γ t ∂ log f 0 (ti , γ ) 1 ti = − γ1 0 γi1 2 log − 2; 2 γ0 (γ0 + ti ) ∂γ1 γ1 the derivatives of S(ti |yi ) are given by ∂ S(ti ) γ1 ∂ S(ti ) ti ∂ S(ti ) = F(ti )S(ti ) , = −S(ti )F(ti )log , = xil F(ti )S(ti ), ∂γ0 γ0 ∂γ1 γ0 ∂αl ∂ 2 S(ti ) ∂ S(ti ) γ1 (F(ti ) − S(ti )) − 1 = , 2 ∂γ0 γ0 ∂γ0   ∂ S 2 (ti ) ∂ S(ti ) 1 ti , = − (F(ti ) − S(ti ))log ∂γ0 ∂γ1 ∂γ0 γ1 γ0 ∂ 2 S(ti ) ∂ S(ti ) ∂ 2 S(ti ) ∂ S(ti ) ti = (F(ti ) − S(ti ))xil , = (S(ti ) − F(ti ))log , 2 ∂γ0 ∂αh ∂γ0 ∂γ1 γ0 ∂γ1 ∂ 2 S(ti ) ∂ S(ti ) = (F(ti ) − S(ti ))xil , ∂γ1 ∂αh ∂γ1

∂ 2 S(ti ) ∂ S(ti ) = (F(ti ) − S(ti ))xil  ; ∂αl ∂αl  ∂αl

the derivatives of logS(ti |yi ) are given by ∂logS(ti ) γ1 = F(ti ) , ∂γ0 γ0 ∂ 2 logS(ti ) ∂γ02

=−

∂logS(ti ) γ0 = F(ti )log , ∂γ1 ti

∂logS(ti ) 1 + γ1 S(ti ) , ∂γ0 γ0

∂logS(ti ) = xil F(ti ), ∂αl   ∂ 2 logS(ti ) ∂logS(ti ) 1 ti , = + S(ti )log ∂γ0 ∂γ1 ∂γ0 γ1 γ0

∂ 2 logS(ti ) ∂logS(ti ) =− S(ti )xil , ∂γ0 ∂αh ∂γ0

∂ 2 logS(ti )

∂ 2 logS(ti ) ∂logS(ti ) =− S(ti )xil , ∂γ1 ∂αl ∂γ1

∂ S(ti ) ∂ 2 logS(ti ) = −xil  ; ∂αl ∂αl  ∂αl

∂γ12

=

∂logS(ti ) ti S(ti )log , ∂γ1 γ0

and finally, the derivatives of log f (ti |yi ) are given by

Proportional Odds COM-Poisson Cure Rate Model with Gamma Frailty

285

∂log f (ti ) γ1 ∂log f (ti ) 1 ti ∂log f (ti ) = Vi , = − Vi log , = xil Vi , ∂γ0 γ0 ∂γ1 γ1 γ0 ∂αl γ12 γ1 ∂log f (ti ) 1 ∂ 2 log f (ti ) ∂log f (ti ) 1 ti ∂ 2 log f (ti ) = − − W , = + Wi log , i 2 2 ∂γ γ ∂γ ∂γ ∂γ γ γ γ ∂γ0 γ0 0 0 0 1 0 1 0 0  2 2 2 ti γ1 ∂ log f (ti ) 1 ∂ log f (ti ) = −Wi xil , = −Wi log − 2, ∂γ0 ∂αl γ0 γ0 ∂γ12 γ1 ∂ 2 log f (ti ) ti = Wi log xil , ∂γ1 ∂αl γ0

∂log f 2 (ti ) = −xil xil  Wi , ∂αl ∂αl 

where Vi = F(ti ) − S(ti ),and Wi =

γ



γ

2γ0 1 yi eα x ti 1 . γ γ (γ0 1 yi eα  x +ti 1 )2

References 1. Balakrishnan, N., Barui, S., Milienos, F.S.: Proportional hazards under Conway-MaxwellPoisson cure rate model and associated inference. Stat. Meth. Med. Res. 26(5), 2055–2077 (2017) 2. Balakrishnan, N., Pal, S.: EM algorithm-based likelihood estimation for some cure rate models. J. Stat. Theory Pract. 6(4), 698–724 (2012) 3. Balakrishnan, N., Pal, S.: Lognormal lifetimes and likelihood-based inference for flexible cure rate models based on COM-Poisson family. Comput. Stat. Data Anal. 67, 41–67 (2013) 4. Balakrishnan, N., Pal, S.: COM-Poisson cure rate models and associated likelihood-based inference with exponential and Weibull lifetimes. In: Frenkel, I.B., Karagrigoriou, A., Lisnianski, A. (eds.) Applied Reliability Engineering and Risk Analysis: Probabilistic Models and Statistical Inference, pp. 308–348. Wiley, Chichester (2014) 5. Balakrishnan, N., Pal, S.: An EM algorithm for the estimation of parameters of a flexible cure rate model with generalized gamma lifetime and model discrimination using likelihood- and information-based methods. Comput. Stat. 30(1), 151–189 (2015) 6. Balakrishnan, N., Pal, S.: Expectation maximization-based likelihood inference for flexible cure rate models with Weibull lifetimes. Stat. Meth. Med. Res. 25(4), 1535–1563 (2016) 7. Berkson, J., Gage, R.P.: Survival curve for cancer patients following treatment. J. Am. Stat. Assoc. 47(259), 501–515 (1952) 8. Boag, J.W.: Maximum likelihood estimates of the proportion of patients cured by cancer therapy. J. R. Stat. Soc. Ser. B 11(1), 15–53 (1949) 9. Conway, R.W., Maxwell, W.L.: A queuing model with state dependent service rates. J. Ind. Eng. 12(2), 132–136 (1962) 10. Ibrahim JG, Chen M-H, Sinha D.: Bayesian Survival Analysis. Springer, New York (2005) 11. Kadane, J.B., Shmueli, G., Minka, T.P., Borle, S., Boatwright, P.: Conjugate analysis of the Conway-Maxwell-Poisson distribution. Bayesian Anal. 1(2), 363–374 (2006) 12. Kirkwood, J.M., Ibrahim, J.G., Sondak, V.K., Richards, J., Flaherty, L.E., Ernstoff, M.S., Smith, T.J., Rao, U., Steele, M., Blum, R.H.: High-and low-dose interferon alfa-2b in highrisk melanoma: first analysis of intergroup trial e1690/s9111/c9190. J. Clin. Oncol. 18(12), 2444–2458 (2000) 13. McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions. Wiley, Hoboken (2007) 14. Rodrigues, J., Cancho, V.G., de Castro, M., Louzada-Neto, F.: On the unification of long-term survival models. Stat. Probab. Lett. 79(6), 753–759 (2009)

286

N. Balakrishnan et al.

15. Rodrigues, J., de Castro, M., Cancho, V.G., Balakrishnan, N.: COM-Poisson cure rate survival models and an application to a cutaneous melanoma data. J. Stat. Plan. Inference 139(10), 3605–3611 (2009) 16. Shmueli, G., Minka, T.P., Kadane, J.B., Borle, S., Boatwright, P.: A useful distribution for fitting discrete data: revival of the Conway-Maxwell-Poisson distribution. J. R. Stat. Soc. Ser. C 54(1), 127–142 (2005)

On Residual Analysis in the GMANOVA-MANOVA Model Béatrice Byukusenge, Dietrich von Rosen, and Martin Singull

Abstract In this article, the GMANOVA-MANOVA model is considered. Two different matrix residuals are established. The interpretation of the residuals is discussed and several properties are verified. A data set illustrates how the residuals can be used.

1 Introduction The generalized multivariate analysis of variance model was introduced by [7], although similar models existed earlier, see [10] for more details and references. This model is also known in the literature as the Growth Curve model or GMANOVA model and is suitable for the analysis of balanced repeated measurements. Over the years it has been extensively studied. A generalization of this model called Extended Growth Curve model has been considered in the 80s by different authors and its definition and history can be found in for example [10]. In this paper a special case of an Extended Growth Curve model is studied, where the first term models the profile (growth curve) and the second term is a MANOVA model which takes care of the covariables (see, e.g., [1, 8]). This special case of the Extended Growth Curve model is often referred to as the GMANOVA-MANOVA model. Note that the covariates also can be modelled via the GMANOVA part and

B. Byukusenge (B) · D. von Rosen · M. Singull Department of Mathematics, Linköping University, Linköping, Sweden e-mail: [email protected] D. von Rosen e-mail: [email protected] M. Singull e-mail: [email protected] B. Byukusenge Department of Mathematics, University of Rwanda, Kigali, Rwanda D. von Rosen Energy and Technology, Swedish University of Agricultural Sciences, Uppsala, Sweden © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_24

287

288

B. Byukusenge et al.

instead the MANOVA part includes, say, the treatment effects which are of primary interest. The main goal of this article is to introduce a new pair of residuals for the GAMANOVA-MANOVA model. Residuals are useful for the quality assessments of the underlying distributional assumptions. Previously, residuals in an Extended Growth Curve model have been studied by [4]. In Sect. 2 the GMANOVA-MANOVA model is defined and maximum likelihood estimators are presented in Sect. 3. Via these estimators two residuals are proposed and in Sect. 4 interpretations and properties of the residuals are established. Data is used throughout the article to illustrate the ideas.

2 GMANOVA-MANOVA Model Some notations and definitions that will be useful in this paper are now presented. Let C( A) denote the column space for the matrix A. For any positive definite matrix S and any matrix A of proper size, P A,S = A( A S−1 A)− A S−1 denotes the projector on the space C S ( A), where the subscript S in C S ( A) indicates that the inner product is defined via the positive definite matrix S−1 . In the expression “− ” denotes an arbitrary g-inverse. Note that the projector, P A,S , is independent of any choice of g-inverse. If S = I, where I is the identity matrix, instead of C I ( A) it is written C( A), and instead of P A,I we use P A . Definition 2.1 Let X be an p × n matrix of observations, where n represents the number of subjects, each measured at p occasions. The GMANOVA-MANOVA model is defined by X = AB 1 C 1 + B 2 C 2 + E,

(1)

where A : p × m, C 1 : r1 × n and C 2 : r2 × n are known design matrices, B 1 : m × r1 and B 2 : p × r2 are unknown parameter matrices, and the random error matrix E is such that the columns are assumed to be independently distributed following a p-variate normal distribution with mean zero and an unknown positive definite dispersion matrix Σ, i.e., E ∼ N p,n (0, Σ, I). Example 2.1 Similarly to [2, 5] two treatments for patients suffering from multiple sclerosis are considered. Assume 69 patients suffering from the disease were recruited into a study. Out of these patients 35 were randomized to receive one medicine (M1) alone, say Group 1. The other 34 patients, say Group 2, received the first medicine (M1) together with a second medicine (M2). For each patient in the study, a measure of autoimmunity, AFCR, was sampled at clinic visits: at baseline (time 0, initiation of the treatment) and at 6, 12, and 18 months. Multiple sclerosis affects the immune system: low values of AFCR (approaching 0) gives evidence that immunity is improving, which is hopefully associated with a better prognosis for sufferers of

On Residual Analysis in the GMANOVA-MANOVA Model

289

the decease. Also recorded for each patient was age at entry into the study and an indicator of whether or not the patient had previous treatment with any of the two study agents (0 = no, 1 = yes). The complete set of observations X : 4 × 69 together with the covariates are shown in the Appendix. In order to handle this example the design matrices of the GMANOVA-MANOVA model are given by ⎛

⎞ 1 0     ⎜1 6 ⎟ 134 035 48 . . . 57 0 . . . 0 ⎜ ⎟ A=⎝ , , C1 =   , C2 = 034 135 0 . . . 0 46 . . . 55 1 12⎠ 1 18

(2)

with B 1 : 2 × 2 and B 2 : 4 × 2. Thus a linear growth model is assumed and age is used as a covariate. Here, it is assumed that the age effect is different for the two treatment groups. In (2) 1a stands for the vector of a ones and 0a stands for the vector of a zeroes. However, one can also assume that the age effect is the same for the groups and then

C 2 = 48 . . . 57 46 . . . 55 ,

B 2 : 4 × 1.

Another

model would be to include the previous treatment as a covariate. Then, C 2 = 013 122 010 124 and B 2 : 4 × 1. The likelihood function for the model in (1) can be written L(B 1 , B 2 , Σ)

 1 = (2π )− pn/2 |Σ|−n/2 exp − tr{Σ −1 (X − AB 1 C 1 − B 2 C 2 )() } , 2

where ( Q)() means ( Q)( Q) for any matrix function Q, | · | is the determinant and tr(·) stands for the trace function. From differentiation of the likelihood function it follows that the likelihood equations equal ⎧  −1  ⎪ ⎨ A Σ (X − AB 1 C 1 − B 2 C 2 )C 1 = 0, Σ −1 (X − AB 1 C 1 − B 2 C 2 )C 2 = 0, ⎪ ⎩ nΣ = (X − AB 1 C 1 − B 2 C 2 )() .

(3)

Furthermore, let Q C 2 = I − P C 2 ,

S = X Q C 2 (I − P Q C  C 1 ) Q C 2 X  .

(4)

2

Then, assuming that n is so large that S−1 exists, after some calculations the solution of the system (3) is given by

290

B. Byukusenge et al.

 B 1 = ( A S−1 A)− A S−1 X Q C 2 C 1 (C 1 Q C 2 C 1 )− + ( A )o Z 1 

+ A Z 2 (C 1 Q C 2 )o ,

(5)

  B 2 = (X − A B 1 C 1 )C 2 (C 2 C 2 )− + Z 3 C o2 ,  = (X − A B 2 C 2 )() = S+(I − P A,S )X P Q C  C 1 X  (I − P A,S ), nΣ B1 C 1 − 

(6) (7)

2

where Z 1 – Z 3 are arbitrary matrices of proper size and for any H the matrix H o is a matrix spanning the orthogonal complement to the column space generated by H.  satisfies this important property. However, B 2 are not unique whereas Σ Thus  B 1 and  using the estimators (5) and (6), the estimated mean, i.e., the predicted value, equals  B 2 C 2 = X P C 2 + P A,S X Q C 2 C 1 (C 1 Q C 2 C 1 )− C 1 Q C 2 , X = A B1 C 1 +  which is unique with respect to choice of generalized inverse. B 2 be the maximum likelihood estimators of B 1 and B 2 Theorem 2.1 Let  B 1 and  given in (5) and (6), respectively. Then  B 2 C 2 = X P C 2 + P A,S X P Q C  C 1 , X = A B1 C 1 + 

(8)

2

where Q C 2 and S are given in (4), respectively. Example 2.2 For the study described in Example 2.1 and the GMANOVA-MANOVA model with the matrices defined in (2), it follows that both C 1 and C 2 are of full rank and C(C 1 ) ∩ C(C 2 ) = {0}. The within-individuals design matrix A is also of full rank. Then Z 1 – Z 3 equal 0 and the parameter estimates of the model are given by ⎛ 0.13 ⎜0.15 6.68 9.19  B1 = ,  B2 = ⎜ ⎝0.16 −0.31 −0.68 0.18 



⎞ ⎛ 0.08 4.45 ⎜1.44 0.13⎟ ⎟, Σ  =⎜ ⎝1.27 0.19⎠ 0.25 0.95

1.44 4.35 2.17 0.76

1.27 2.17 4.32 0.87

⎞ 0.95 0.76⎟ ⎟. 0.87⎠ 3.22

According to  B 1 there can be a differences between the treatment groups. For the covariate estimator  B 2 the age effect seems to be almost constant for Group 1 whereas for Group 2 the effect increases. The estimated variance decreases somewhat over time. In Fig. 1 the repeated measurements have been plotted group-wise together with the estimated mean profiles given the sample mean age per group. It seems like both treatments (M1 or M1 + M2) lower AFCR over the 18 months period, i.e., there is a positive treatment effect for both groups. However, according to the statistical paradigm, assumptions and results should be validated, in particular the model fit together with the estimates. This is often carried out by studying residuals. If one looks closer at the data one can find some observations that maybe do not follow the model and thereby can have an impact on the results and conclusions.

On Residual Analysis in the GMANOVA-MANOVA Model

291

Fig. 1 All observations mentioned in Example 2.1 are presented (observations within an individual are joined with straight lines) and for each group the estimated mean profiles are shown

3 Residuals The next lemma will be used several times in the subsequent: Lemma 3.1 Suppose that a partitioned matrix C consists of two parts, i.e., C = ( A : B). Then the projection matrix P C satisfies P C = P A + P (I− P A )B = P A + (I − P A )B(B  (I − P A )B)− B  (I − P A ). The most common way to construct residuals is by subtracting the fitted model from the observations. This means that residuals for the GMANOVA-MANOVA model are given by R= X− X = X − (X P C 2 + P A,S X P Q C  C 1 ) 2

= X Q C 2 − P A,S X P Q C  C 1 . 2

Equivalently, using Lemma 3.1, (9) can be written

(9)

292

B. Byukusenge et al.

Fig. 2 Based on Theorem 2.1 and Definition 2.1 decomposition of the whole space according to the within and between individuals design matrices illustrating the mean and residual spaces

R = X( Q C 2 − P Q C  C 1 ) + (I − P A,S )X P Q C  C 1 2

2

= X(I − P C 2 :C 1 ) + (I − P A,S )X P Q C  C 1 2

= P A,S X(I − P C 2 :C 1 ) + (I − P A,S )X(I − P C 2 :C 1 ) + (I − P A,S )X P Q C  C 1 .

(10)

2

The relation in (10) leads to the following definitions of residuals: Definition 3.1 Let Q C 2 and S be given in (4). Then residuals for the GMANOVAMANOVA model in (1) can be defined as R1 = X(I − P C 2 :C 1 ), R2 = (I − P A,S )X P

Q C  C 1 2

(11) .

(12)

Moreover, R1 = R11 + R12 , where R11 = P A,S X(I − P C 2 :C 1 ),

R12 = (I − P A,S )X(I − P C 2 :C 1 ).

(13)

On Residual Analysis in the GMANOVA-MANOVA Model

293

The residuals in Definition 3.1 and the model are illustrated in Fig. 2, where a tensor space decomposition is presented. Furthermore, for a matrix A let A◦ be any matrix of full rank spanning the orthogonal complement to C( A), with respect to the standard inner product, i.e., C( A◦ ) = C( A)⊥ . Using the well known relation I − P A,S = I − A( A S−1 A)− A S−1 = S A◦ ( A◦ S A◦ )− A◦ = P A◦ ,S−1 . the residual R2 equals R2 = (I − P A,S )X P Q C  C 1 = P A◦ ,S−1 X P Q C  C 1 . 2

2

4 Properties of R1 and R2 4.1 Interpretation The residuals in Definition 3.1 have a clear meaning which will be discussed in this section. Consider first the residual R1 , given in (11): R1 = X(I − P C 2 :C 1 ) = X − X P C 2 − X P (I− P C  )C 1 ,

(14)

2

where Lemma 3.1 has been used. Thus R1 is the difference between the observations X and the “group mean” X P C 2 :C 1 . Moreover, X − X P C 2 means that X has been adjusted with the effect from the covariate and X P (I− P C  )C 1 is an adjusted “mean” 2 effect. Specifically, R1 provides information about the between individual assumption in a given group. Therefore, it can be used to detect observations which deviate from the others without taking into account any model assumption. In this regard, R11 given in (13), is the difference between the observations X and the mean X P C 2 :C 1 relative to the within-individuals model. It can therefore be used for detecting if observations do not follow the “within-individuals” model assumptions. Similarly, R12 given in (13), is the difference between the observations X and the mean X P C 2 :C 1 relative to the case where the within-individuals model assumptions do not hold. For R2 in (12), the residual can be written as R2 = (I − P A,S )X P Q C  C 1 (I − P A,S )X P Q C  C 1 = (I − P A,S )X P Q C  C 1 2

2

2

+ X P C 2 + P A,S X P Q C  C 1 − ( A1  B2 C 2) B1 C 1 +  2

= X( P Q C  C 1 + P C 2 ) − ( A1  B 2 C 2 ). B1 C 1 +  2

Since P C 2 :C 1 = P Q C  C 1 + P C 2 2

R2 = X P C 2 :C 1 − ( A1  B 2 C 2 ), B1 C 1 + 

(15)

294

B. Byukusenge et al.

Fig. 3 For the data in Tables 1 and 2 in the Appendix the residuals R1 and R2 , given in (11) and (12), are presented. The figures show the residuals per individual at time points 0–18

which is the observed “mean” X P C 2 :C 1 minus the estimated mean structure (the model), i.e., A1  B 2 C 2 = X P C 2 + P A,S X P Q C  C 1 , and therefore R2 tells us B1 C 1 +  2 how well the estimated mean structure fits the observed mean. More specifically, it includes information about the within individual assumptions, i.e., the model. Therefore, R2 provides information about the appropriateness of the model assumptions with respect to the mean structure (the profile). Example 4.1 Consider the residuals R1 and R2 , defined in (11) and (12), using the estimates from Example 2.2. These residuals are plotted in Fig. 3. For R1 one can see some observations that seem to deviate from the group means. Also in R2 one can see some repeated measurements (mostly at time 12 months) that could deviate from the linear growth assumption. To understand if these deviations are crucial one can perturb/change some repeated measurements. We will now carry out this in three different ways: (i) perturb one patient at one time point; (ii) perturb one patient at all four time points; (iii) change the linear growth assumption to be quadratic for one group. (i): Patient 1 at 6 months is changed, by reducing its response by 50%, see Fig. 4. The effect on the residual R1 is seen in Fig. 5, whereas it also can be seen that this change does not effect R2 . Hence, one can conclude that for Patient 1, at 6 months,

On Residual Analysis in the GMANOVA-MANOVA Model

295

Fig. 4 For the data in Tables 1 and 2 in the Appendix when the response from Patient 1 is reduced with 50% at Month 6

the AFCR value deviates from the group mean, but the within individuals linear growth assumption is not violated. (ii): The observations for all time points for Patient 1 are lowered by 50%, see Fig. 6. Again, one can recognice the effect in the residuals R1 , see Fig. 7, i.e., in this case all time points for Patient 1 differ from the group mean, as they should do. Moreover, also with this perturbation R2 is not effected. (iii): At last Group 1 is contaminated so that it seems to follow a quadratic mean instead of the assumed linear mean, see Fig. 8. In this case the resulting residual R1 is not affected, whereas R2 is. This can been noticed in Fig. 9, where the residuals for Group 1 are larger than for the original data with a linear growth, i.e., the quadratic model for Group 1 does not seem to be an appropriate model.

4.2 Properties In regression analysis, it is well known that ordinary residuals in the univariate linear model are symmetrically distributed around zero and are uncorrelated with the estimated model. In this section similar results are derived for the GMANOVAMANOVA model E(X) = AB 1 C 1 + B 2 C 2 . It has been shown that for the Growth Curve model, see [4], similarly defined residuals as those residuals given in Definition 3.1 are symmetrically distributed around zero. Moreover, [9] obtained some moment relations for the residuals. Reference [4] obtained the same type of results for the Extended Growth Curve model with the nested subspace condition C(C 2 ) ⊆ C(C 1 ). To have a nested subspace condition on the between individuals design matrices is

296

B. Byukusenge et al.

Fig. 5 For the data in Tables 1 and 2 in the Appendix residuals R1 and R2 are calculated when the response from Patient 1 is reduced with 50% at Month 6

Fig. 6 For the data in Tables 1 and 2 in the Appendix when the response from Patient 1 is reduced with 50% at all time points

On Residual Analysis in the GMANOVA-MANOVA Model

297

Fig. 7 For the data in Tables 1 and 2 in the Appendix residuals R1 and R2 are calculated when the response from Patient 1 is reduced with 50% at all time points

Fig. 8 For the data in Tables 1 and 2 in the Appendix when all individuals in Group 1 are contaminated so that patients follow a quadratic growth

298

B. Byukusenge et al.

Fig. 9 For the data in Tables 1 and 2 in the Appendix residuals R1 and R2 are calculated when all individuals in Group 1 are contaminated so that patients follow a quadratic growth instead of a linear growth

equivalent (from an estimation point of view) to have a nested subspace condition on the within individuals design matrices. In our model, no nested subspace condition C(C 2 ) ⊆ C(C 1 ) is assumed to hold but it is always true that C( A) ⊆ C(I). For more information on these conditions in the Extended Growth Curve model see [3, 10], where both versions of nestedness are treated. With the assumption of normality, the following theorem shows that the residuals are symmetrically distributed around zero. Later it will also be proven that the residuals R1 and R2 are uncorrelated. The next lemma presents some technical results which will be used when establishing the theorem. Lemma 4.1 Let C 1 , C 2 be as in model (1), let Q C 2 be defined in (4) and let Q C 2 C 1 be the projection of the columns of C 1 onto the orthogonal complement of C 2 . Then, C 2 P Q C  C 1 = 0, 2

Q C 2 − P Q C  C 1 = I − P C 2 :C 1 , 2

C( Q C 2 C 1 ) = C(C 2 )⊥ ∩ {C(C 2 ) + C(C 1 )}. Proof Since C 2 Q C 2 = 0, C 2 P Q C  = 0. Furthermore, Lemma 3.1 implies that 2 P C 2 :C 1 = P C 2 + P Q C  C 1 and therefore 2

On Residual Analysis in the GMANOVA-MANOVA Model

299

Q C 2 − P Q C  C 1 = I − P C 2 :C 1 . 2

Utilizing the fact that Q C 2 is an orthogonal projection matrix and Theorem 1.2.16 in [6] implies C( Q C 2 C 1 ) = C( Q C 2 ) ∩ {C(C 1 ) + C( Q C 2 )⊥ } = C(C 2 )⊥ ∩ {C(C 1 ) + C(C 2 )}.  Now the expectations, E(·), of the residuals R1 and R2 will be considered. Theorem 4.1 Let R1 and R2 be the residuals defined in (11) and (12). Then, for i ∈ {1, 2}, E(Ri ) = 0. Proof Since I − P C 1 :C 2 is the projection matrix on C(C 1 : C 2 )⊥ , E(R1 ) = E(X(I − P C 1 :C 2 )) = ( AB 1 C 1 + B 2 C 2 )(I − P C 1 :C 2 ) = 0. For the residual R2 , due to independence between S and X P Q C  C 1 , 2

E(R2 ) = E( P A◦ ,S−1 X P Q C  C 1 ) = E( P A◦ ,S−1 E(X P Q C  C 1 )). 2

2

Hence, using Lemma 4.1 and that P A◦ ,S−1 A = 0, E(R2 ) = E( P A◦ ,S−1 ( AB 1 C 1 + B 2 C 2 ) P Q C  C 1 ) = 0. 2

 In the next theorem the dispersion matrices, D(·), for the residuals R1 and R2 are presented. Let ρ( Q) denote the rank of a matrix Q, ⊗ the Kronecker product and vec(·) the usual vectorization operator. Theorem 4.2 Let R1 and R2 be the residuals respectively defined in (11) and (12). Then D(R1 ) = (I − P C 1 :C 2 ) ⊗ Σ, D(R2 ) = P Q C  C 1 2   n − ρ(C 1 : C 2 ) − 2( p − ρ( A)) − 1  −1 −  A( A . ⊗ Σ− Σ A) A n − ρ(C 1 : C 2 ) − ( p − ρ( A)) − 1 Proof Consider D(R1 ), and because I − P C 1 :C 2 is idempotent D(R1 ) = D(X(I − P C 1 :C 2 )) = (I − P C 1 :C 2 ) ⊗ Σ.

300

B. Byukusenge et al.

For D(R2 ) it follows, since E(R2 ) = 0, D(R2 ) = E(vec( P A◦ ,S−1 X P Q C  C 1 )vec ( P A◦ ,S−1 X P Q C  C 1 )) 2

2

= P Q C  C 1 ⊗ E( P A◦ ,S−1 Σ P A◦ ,S−1 ) 2

= P Q C  C 1 2   n − ρ(C 1 : C 2 ) − 2( p − ρ( A)) − 1 A( A Σ −1 A)− A , ⊗ Σ−   n − ρ(C 1 : C 2 ) − ( p − ρ( A)) − 1 where the last equality follows from the same calculations as when deriving the expectation of the maximum likelihood estimator of the dispersion in the Growth Curve model (see [10, p. 113]).  Example 4.2 Consider again the residuals R1 and R2 , in Example 4.1 and Fig. 3. Since, R1 = X(I − P C 1 :C 2 ) is a linear transformation of the matrix normally distributed observation matrix X we know that also R1 must be matrix normal, i.e., R1 ∼ N p,n (0, Σ, I − P C 1 :C 2 ). The distribution for R2 is not straightforward to obtain since it includes the projection P A,S which is a function of S. X and  B j , j ∈ {1, 2}, which are uncorrelated are Finally the pairs of R1 , R2 ,  presented. However, if the covariance, Cov(·, ·), equals 0 this does not imply independence. Theorem 4.3 Let R1 and R2 be the residuals defined in (11) and (12), respectively. B j ) = 0, j ∈ {1, 2}, and Cov(R1 ,  X) = 0. Where Then Cov(R1 , R2 ) = 0, Cov(R1 ,   it is assumed that B j , j ∈ {1, 2}, is uniquely estimated. Proof First Cov(R1 , R2 ) = 0 is proven. Since E(R1 ) = 0 Cov(R1 , R2 ) = E(vec(X(I − P C 1 :C 2 ))vec ( P A◦ ,S−1 X P Q C  C 1 )) 2

and uncorrelatedness follows because X(I − P C 1 :C 2 ) as well as S are independently distributed of X P Q C  C 1 , i.e., 2

Cov(R1 , R2 ) = E(vec(X(I − P C 1 :C 2 ))vec ( P A◦ ,S−1 X P Q C  C 1 )) 2

= E(vec(X(I − P C 1 :C 2 ))E(vec (X P Q C  C 1 )){I ⊗ P A◦ ,S−1 }), 2

and E(vec (X P Q C  C 1 )) = vec ( AB 1 C 1 P Q C  C 1 ), implies 2

2



E(vec (X P

Q C  C 1 2

){I ⊗ P A◦ ,S−1 }) = 0

which establishes the first statement. For showing Cov(R1 ,  B j ) = 0, j ∈ {1, 2}, it is assumed that B j , j ∈ {1, 2}, have been uniquely estimated. Then

On Residual Analysis in the GMANOVA-MANOVA Model

301

Fig. 10 Absolute value of the standardized individual residuals in R1

Cov(R1 ,  B1)   = Cov X(I − P C 1 :C 2 ), ( A S−1 A)−1 A S−1 X Q C 2 C 1 (C 1 Q C 2 C 1 )−1 = 0, because E(R1 ) = 0, and X Q C 2 is independent of X(I − P C 1 :C 2 ) and S. Moreover, Cov(R1 ,  B2) = Cov(R1 , X C 2 (C 2 C 2 )−1 ) − Cov(R1 , A B 1 C 1 C 2 (C 2 C 2 )−1 ) = 0, since Cov(R1 ,  B 1 ) = 0 and R1 is independently distributed of X C 2 . X) = 0 because Cov(R1 ,  B 1 ) = 0 and Finally it is noted that Cov(R1 ,    Cov(R1 , B 2 ) = 0. Example 4.3 The individual residuals (components) in R1 and R2 are correlated as shown in Theorem 4.2. Moreover, one can study standardized residuals, i.e., the individual residuals divided by their estimated standard deviation. In Figs. 10 and 11 the residuals of Fig. 3 are presented, where the absolute value of the residuals are divided by the estimated standard deviations (i.e., the diagonals of the covariance  given in (7)) give a clear presentation. matrices in Theorem 4.2 with Σ replaced by Σ Since R1 is the difference of the observations and the “group mean” (without taking into account any model assumption) there should be no structure for the residuals, as in Fig. 10. However, in Fig. 11 we see the group structure for the absolute value of the individual standardized R2 residuals. The residual R2 is about how well the estimated mean structure fits the observed mean, i.e., on group level. Hence, when studying the individual standardized R2 it is enough to consider one individual per group.

302

B. Byukusenge et al.

Fig. 11 Absolute value of the standardized individual residuals in R2

5 Concluding Remarks In this paper we have derived and interpreted the two main residuals for the GMANOVA-MANOVA model. Both the residuals have a clear meaning. The first residual is the difference between the observations and the group mean corrected for the covariate. Hence, it gives information about the between individual assumption in a given group and can be used to detect observations which deviate from the others without taking into account any model assumption. The first residual can be divided in two parts which are relative the within individual model assumption. The second residual is the observed group mean, corrected for the covariate, minus the fitted mean model. Hence, it describes how well the estimated mean structure fits the observed group mean, i.e., it relates to the within individual structure and give information about the mean model assumptions. However, to understand and test if some observations deviates from the model assumptions we need to consider the properties and distribution for the residuals. In this paper we have given the properties such that expectation, dispersion and relevant covariances. In a numerical example we have also seen how the residuals are affected when observations are perturbed and do not follow the model assumptions. In future research it is worth studying how to find critical values to test if some residuals are large enough to conclude violation of the model assumption. Acknowledgements The research of Béatrice Byukusenge has been supported by the sub-program of Applied Mathematics and Statistics under the Sida-funded bilateral program, The University of Rwanda-Sweden Programme for Research, Higher Education and Institutional Advancement. Dietrich von Rosen is supported by the Swedish Research Council (2017-03003).

On Residual Analysis in the GMANOVA-MANOVA Model

303

Appendix See Tables 1 and 2. Table 1 Repeated measurements for individuals in Group 1, i.e., treated with azathioprine alone, together with the covariates ‘Previously treated’ and ‘Age’ Time points Id

0

6

12

18

Group

Previously treated

Age

1

10.50

11.20

12.60

10.90

1

0

48

2

11.90

9.80

12.70

9.50

1

0

53

3

9.50

11.40

5.40

10.50

1

0

46

4

13.90

14.70

15.00

10.80

1

0

56

5

15.10

11.20

9.80

10.10

1

0

52

6

12.50

12.40

7.60

6.50

1

0

44

7

11.80

12.40

13.40

8.20

1

0

44

8

11.80

10.20

9.00

11.80

1

0

35

9

14.00

13.40

11.30

9.30

1

0

50

10

13.40

11.70

11.10

7.90

1

0

57

11

13.00

11.20

9.00

9.70

1

0

52

12

14.10

10.60

9.00

9.50

1

0

46

13

14.70

11.30

8.20

9.20

1

0

49

14

17.00

16.90

15.00

12.30

1

1

61

15

15.20

13.90

13.00

14.40

1

1

47

16

11.20

15.20

11.20

12.50

1

1

59

17

15.20

13.70

14.90

13.20

1

1

58

18

12.20

11.70

9.50

7.70

1

1

45

19

12.60

13.70

13.00

13.40

1

1

57

20

15.10

11.70

10.10

9.70

1

1

50

21

14.80

13.80

15.10

10.70

1

1

58

22

9.90

11.80

10.90

13.10

1

1

53

23

12.20

12.50

8.80

11.20

1

1

53

24

14.40

8.60

10.10

10.00

1

1

41

25

14.70

14.50

15.60

16.10

1

1

63

26

14.90

13.30

10.70

7.90

1

1

58

27

11.40

10.20

11.10

8.90

1

1

55

28

18.70

13.10

12.10

11.70

1

1

52

29

13.00

14.20

10.30

9.00

1

1

53

30

9.90

13.50

8.90

8.10

1

1

47

31

10.80

8.00

7.70

8.10

1

1

44

32

10.90

10.30

8.20

9.80

1

1

49

33

10.20

14.70

13.00

11.00

1

1

53

34

15.70

13.60

10.80

9.50

1

1

45

35

15.50

12.20

17.30

10.20

1

1

57

304

B. Byukusenge et al.

Table 2 Repeated measurements for individuals in Group 2, i.e., treated with azathioprine and methylprenisommne, together with the covariates ‘Previously treated’ and ‘Age’ Time points Id 0 6 12 18 Group Previously Age treated 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69

16.10 15.70 10.20 13.30 13.70 15.10 12.10 8.00 13.30 8.80 11.20 15.70 9.70 11.80 15.40 11.60 12.90 11.60 15.40 16.20 14.00 14.40 16.10 15.00 12.20 14.60 16.20 14.00 14.50 10.90 14.60 11.80 14.70 10.60

11.80 10.10 9.40 12.30 13.70 12.50 10.80 10.20 8.90 5.70 10.00 11.80 13.60 10.00 17.00 5.50 14.00 12.20 14.60 13.20 11.90 14.00 10.80 7.20 12.70 16.40 15.40 11.70 14.20 10.60 14.50 11.30 11.40 9.50

9.00 10.70 10.50 12.80 12.40 10.30 13.20 7.60 9.30 6.40 8.50 10.90 12.40 10.40 12.60 5.40 13.50 9.90 13.00 10.30 9.30 10.70 8.30 9.00 10.10 10.80 11.60 11.40 14.50 9.00 12.50 11.20 13.30 8.40

9.40 10.50 4.70 12.90 11.10 9.50 11.40 2.00 8.00 6.50 5.80 12.40 7.30 10.90 10.60 6.40 11.70 9.80 11.50 8.10 8.50 10.40 8.80 10.40 7.90 11.40 12.20 11.30 7.30 13.10 6.80 10.10 8.90 10.50

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

46 45 44 55 58 51 55 32 45 47 37 52 46 50 53 41 61 42 45 49 49 48 54 52 55 54 54 54 52 56 38 57 46 55

On Residual Analysis in the GMANOVA-MANOVA Model

305

References 1. Chinchilli, V.M., Elswick, R.K.: A mixture of the MANOVA and GMANOVA models. Commun. Stat.-Theory Methods 14, 3075–3089 (1985) 2. Ellison, G.W., Myers, L.W., Mickey, M.R., Graves, M.C., Tourtellotte, W.W., Syndulko, K., Holevoet-Howson, M.I., Lerner, C.D., Frane, M.V., Pettier-Jennings, P.: A placebo-controlled, randomized, double-masked, variable dosage, clinical trial of azathioprine with and without methylprednisolone in multiple sclerosis. Neurology 39, 1018–1026 (1989) 3. Filipiak, K., von Rosen, D.: On MLEs in an extended multivariate linear growth curve model. Metrika 75, 1069–1092 (2012) 4. Hamid, J.S., von Rosen, D.: Residuals in the extended growth curve model. Scand. J. Stat. 33, 121–138 (2006) 5. Heitjan, D.F.: Nonlinear modeling of serial immunologic data: a case study. J. Am. Stat. Assoc. 86, 891–898 (1991) 6. Kollo, T., von Rosen, D.: Advanced Multivariate Statistics with Matrices. Mathematics and its Applications, vol. 579. Springer, Dordrecht (2005) 7. Potthoff, R.F., Roy, S.N.: A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika 51, 313–326 (1964) 8. von Rosen, D.: Maximum likelihood estimators in multivariate linear normal models. J. Multivar. Anal. 31, 187–200 (1989) 9. von Rosen, D.: Residuals in the growth curve model. Ann. Inst. Stat. Math. 47, 129–136 (1995) 10. von Rosen, D.: Bilinear Regression Analysis: An Introduction. Lecture Notes IN Statistics, vol. 220. Springer, New York (2018)

Computational Efficiency of Bagging Bootstrap Bandwidth Selection for Density Estimation with Big Data Daniel Barreiro-Ures, Ricardo Cao, and Mario Francisco-Fernández

Abstract Bandwidth selection is a crucial problem in kernel density estimation. However, the computational cost of some bandwidth selection methods grows very rapidly as the sample size increases. To address the problem of selecting the bandwidth of the Parzen-Rosenblatt kernel density estimator for samples of very large size, a subagging version of the bootstrap bandwidth selector is proposed and empirically studied. A heuristic rule is also proposed for the selection of the size and number of subsamples used when applying subagging.

1 The Bootstrap Method Bootstrapping [6] is a resampling method that allows us to estimate the sampling distribution of a certain statistic. This is done by assuming that the sample at hand is representative of the population from which it was drawn and sampling with replacement from the sample itself. In this way, the original sampling process is imitated, but with the advantage of working with a new, fully known (bootstrap) population. A simple version of the bootstrap method, generally known as the naive bootstrap, consists of drawing the resamples from the empirical distribution function, but this approach is known to fail in several situations. A more elaborate way to carry out the resampling process is the so-called smoothed bootstrap, in which the bootstrap population is not characterized by the empirical distribution function, but instead by a smooth estimate of the unknown density function. In particular, the resamples would belong to a population whose density function is given by a kernel density D. Barreiro-Ures (B) · R. Cao · M. Francisco-Fernández Department of Mathematics, CITIC, University of A Coruña, A Coruña, Spain e-mail: [email protected] R. Cao e-mail: [email protected] M. Francisco-Fernández e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_25

307

308

D. Barreiro-Ures et al.

estimator, fˆg , where g is often referred to as the pilot bandwidth. Of course, this way of proceeding depends largely on the choice of g and, therefore, it is necessary to establish some optimality criterion for this parameter. In more detail, let X denote a simple random sample of size n drawn from a population whose distribution function is given by F, and suppose that we are interested in making inference about some population parameter θ = θ (F). To do so, it is necessary to know the sampling distribution of a certain statistic R(X , F), which in many cases can take the form R(X , F) = θ (Fn ) − θ (F), where Fn denotes the empirical distribution function of X . The sampling distribution of R(X , F) is generally unknown. The bootstrap approach starts by replacing F ˆ Then, from Fˆ and conditionally to the sample X , we can draw with an estimate, F. resamples of size n, X ∗ , which are usually called bootstrap samples. The idea of the bootstrap method then lies in approximating the sampling distribution of R(X , F) by the resampling distribution or bootstrap distribution of ˆ = θ (Fn∗ ) − θ ( F), ˆ R(X ∗ , F) where Fn∗ denotes the empirical distribution function of the bootstrap sample, X ∗ . ˆ is usually not computable in practice Again, the bootstrap distribution of R(X ∗ , F) and must be approximated by Monte Carlo.

2 Bagging and Subagging Ensemble methods [10] are a family of techniques that combine the estimates or predictions of several base estimators or base models with the aim of producing a new estimator or predictor with better statistical properties. One of the most popular and widely used ensemble methods is bootstrap aggregating [2], also known as bagging, which is a resampling technique whose main purpose is to reduce the variability of a given base estimator and is best suited for high-variance low-bias estimators. In the case of estimators which are nonlinear in the observations, such as decision trees, neural networks or, more importantly for us, bandwidth selectors, it has been shown [7] that bagging can lead to substantial reductions in the variability of these estimators. More precisely, let X denote a sample of size n drawn from the distribution F, ˆ )] θˆn = θˆn (X ) a base estimator of the target parameter θ = θ (F) and θˆA = E F [θ(X the aggregated estimator (following the notation in [2]). Then we have that

Computational Efficiency of Bagging Bootstrap Bandwidth …

EF

309

 2      2 θ − θˆn = θ 2 − 2θ θˆA + E F θˆn2 ≥ E F θ − θˆn 2  = θ 2 − 2θ θˆA + θˆA2 = θ − θˆA ,

where we have used the fact that E[Z 2 ] ≥ E[Z ]2 for any random variable Z . In other words, the mean squared error of the aggregated estimator is lower than that of the base estimator, and the difference between the two depends on how unequal the two sides of      2 E F θˆn (X )2 ≥ E F θˆn (X ) are. That is, the more unstable [3] the base estimator, the greater the decrease in mean squared error achieved by the aggregated estimator. However, while the aggregated estimator depends on the distribution F, from which X was drawn, the bagging estimator, θˆn,bag , actually depends on the distribution FX which assigns mass 1/n to each observation belonging to X . In other words,   ˆ ∗) θˆn,bag = E FX θ(X and there is a point between maximum instability and maximum stability at which θˆn,bag stops improving on θˆn in terms of error, and in fact starts to underperform the base estimator. The bagging procedure can be summarized as follows: Step 1. Step 2. Step 3.

Generate a bootstrap sample of size n, X ∗ , by sampling with replacement from X . Compute the bootstrap estimate θˆn∗ (X ∗ ). Define the bagged estimate as   θˆn,bag = E∗ θˆn∗ (X ∗ ) .

(1)

It follows immediately from (1) that     θˆn,bag − θ = θˆn − θ + E∗ θˆn∗ − θˆn and so we would expect the bagged estimator to be more biased than θˆn . The rationale for bagging is that this increase in bias will be offset by an even greater reduction in variance. Bagging has become a widely used technique especially in the field of machine learning and multiple variants of the method (see, for example, [4]) have been proposed over time such as bootstap robust aggregating (bragging), subsample aggregating (subagging) and BagBoosting. One such variant of particular interest to us is subagging, which uses subsampling to achieve a reduction not only in the variability of the estimator, but also in computational time. Subagging proceeds as follows:

310

Step 1. Step 2. Step 3.

D. Barreiro-Ures et al.

Randomly draw a subsample of size r < n, X ∗ , by sampling without replacement from X . Compute the subsample estimate θˆr∗ (X ∗ ). Define the subagging estimate as θˆn,S B(r ) =

−1 n θˆr (X(i1 ,...,ir ) ), r (i ,...,i )∈I 1

(2)

r

where I is the set of r -tuples whose elements in {1, . . . , n} are all distinct and X(i1 ,...,ir ) denotes the subsample of size r made up of the elements in X whose indices are i 1 , . . . , ir .

3 Kernel Density Estimation and Bootstrap Bandwidth Let X 1 , . . . , X n be a sample of size n whose observations are independent and identically distributed to the continuous random variable X , with density function f . Instead of assuming that f belongs to a certain parametric family of functions as parametric techniques do, nonparametric density estimation methods do not impose such a restriction on f , but rather aim to capture its main features from the data itself. This allows to state that, in general, nonparametric methods are more flexible than their parametric competitors. Among the available nonparametric approaches, kernel methods are perhaps the most popular. They seek to estimate f as a locally weighted average, using a kernel function as a weighting function. Aside from the kernel function, these methods are highly dependent on the choice of a tuning parameter called bandwidth or smoothing parameter which regulates the amount of smoothing performed by the estimator, which in turn determines the trade-off between the bias and the variance of the estimator. The problem of bandwidth selection is therefore crucial and intrinsic to kernel methods. Different ways of addressing it have been proposed and studied over time, these including cross-validation [8], bootstrapping [5] or plug-in methods [13]. The kernel density estimator or Parzen-Rosenblatt estimator [11, 12] has the following expression:

n 1 x − Xi , (3) K fˆh (x) = nh i=1 h where K is usually assumed to be a symmetric kernel function, that is, a non-negative

∞ function such that K (x) = K (−x) and K (x) d x = 1. It is easy to see the impor−∞

tant role that the bandwidth plays in (3) and how making a good choice of the bandwidth is crucial to obtaining a good density estimate. In this regard, an oft-used criterion of optimality for the bandwidth is based on mean integrated squared error (MISE), defined as

Computational Efficiency of Bagging Bootstrap Bandwidth …

 Mn (h) = E





−∞

fˆh (x) − f (x)

311

2

 dx .

(4)

Moreover, it can be shown (see, for example, [14]) that the bandwidth, h n0 , that minimizes (4) is asymptotic to  hn =

R(K ) μ2 (K )2 R( f  )n

1/5 ,

as n → ∞, with R(g) = g(x)2 d x and μ j (g) = x j g(x) d x ( j = 0, 1, . . . ), provided that these integrals exist finite. In practice, neither h n0 nor h n can be computed since both depend on unknown population quantities. The idea behind the bootstrap bandwidth selector is to approximate Mn (h) by its bootstrap counterpart, Mn∗ (h). In order to do this, [5] proposes the following resampling plan: Step 1. Step 2.

Select a pilot bandwidth, g, and consider the Parzen-Rosenblatt estimator, fˆg , of f . Draw independent bootstrap samples of size n, {X 1∗ , . . . , X n∗ }, where each of these observations is drawn from a population whose density function is given by fˆg . This can be done as follows: 1. 2. 3.

Step 3.

Generate a sample of size n, U1 , . . . , Un , where Ui is drawn from a discrete uniform distribution defined in {1, . . . , n} for every i = 1, . . . , n. Generate a sample of size n, Z 1 , . . . , Z n , where Z i is drawn from the density function K for every i = 1, . . . , n. For every i = 1, . . . , n, define X i∗ = X Ui + g Z i .

For any h > 0 denote by fˆh∗ the bootstrap version of the Parzen-Rosenblatt estimator, that is,

n x − X i∗ 1 ∗ ˆ f h (x) = K . nh i=1 h

Step 4.

Define the bootstrap version of Mn (h) as Mn∗ (h; g) = E∗

 

fˆh∗ (x) − fˆg (x)

2

 dx .

(5)

It should be noted that the bootstrap version of (4) depends on the original sample but not on the resamples, and since all the terms that appear in (5) are known, it is not really necessary to draw any resamples and approximate (5) by Monte Carlo. A closed expression for (5) is found in [5], namely, Mn∗ (h; g) = Vn∗ (h; g) + Bn∗ (h; g),

(6)

312

D. Barreiro-Ures et al.

where Vn∗ (h; g)

−1 −1

=n h

R(K ) + n

−3

n n   (K h ∗ K g ) ∗ (K h ∗ K g ) (X i − X j ), i=1 j=1

Bn∗ (h; g) = n −2

n n 

 (K h ∗ K g − K g ) ∗ (K h ∗ K g − K g ) (X i − X j ),

i=1 j=1

being K h (u) = K (u/ h)/ h and ∗ denoting the convolution operation. Since the problem of choosing a pilot bandwidth is closely linked to that of estimating the curvature of f , a sensible optimality criterion for the pilot bandwidth would be to choose g

such that it minimizes the mean squared error of fˆg (x)2 d x as an estimator of

 2 f (x) d x. In this regard, [5] provides an expression for the dominant term of the optimal pilot bandwidth, namely  g0 =

R(K  ) nμ2 (K )R( f  )

1/7

  + o n −1/7 .

Thus, we can directly calculate the bootstrap MISE bandwidth, h ∗n0 , by minimizing (6), that is,   (7) h ∗n0 = arg min Vn∗ (h; g0 ) + Bn∗ (h; g0 ) . h>0

4 The Subagged Bootstrap Bandwidth Due to its quadratic complexity, computing the bootstrap bandwidth defined in (7) can become too computationally expensive very quickly as the sample size increases. A possible solution to this problem is to consider a subagged version of the bootstrap bandwidth and take advantage of the computational benefits of working with subsamples of size r < n rather than with the entire sample of size n. To compute the subagged bootstrap bandwidth, we propose the following procedure: Step 1. Step 2.

Step 3.

Independently generate N subsamples of size r < n by sampling without replacement from X 1 , . . . , X n . For every i = 1, . . . , N , estimate the optimal pilot bandwidth, g0 , for instance, by fitting a mixture of normals to the corresponding subsample. Denote these estimates by gˆ 0,1 , . . . , gˆ 0,N . Compute the bootstrap bandwidths   ∗ (h; gˆ 0,i ) , i = 1, . . . , N . h r∗0,i = arg min Vr,i∗ (h; gˆ 0,i ) + Br,i h>0

Computational Efficiency of Bagging Bootstrap Bandwidth …

Step 4.

313

Compute the subagged bootstrap bandwidth as the mean of the rescaled bootstrap bandwidths, that is, N  1/5 ˆh ∗ (r, N ) = 1 r h r∗0,i . N n i=1

(8)

It should be noted that in the case of the bootstrap bandwidth, bagging has less room for improvement in terms of variance reduction when compared to other bandwidth selectors such as the cross-validation bandwidth, hˆ C V,n . Specifically, while hˆ C V,n − h n0 converges to a normal distribution with zero mean and constant variance at the rate n −3/10 , in the case of the bootstap bandwidth, in [5] it is shown that h ∗n0 − h n0 converges to a normal distribution with zero mean and constant variance at a faster rate, namely n −39/70 . In this sense, it is clear that the cross-validation bandwidth selector is a much better candidate for the application of bagging [1, 9] than the bootstrap bandwidth because of the higher variability of the former. This implies that in the case of the subagged bootstrap bandwidth, little can be expected from the use of subagging in terms of variance reduction, and its benefits are expected to be purely computational. Hence, the number of subsamples, N , may be kept at moderate to low values and the size of the subsamples, r , should be chosen according to the cost, as a loss in statistical precision, that the user is willing to pay.

5 Simulation Studies and Application to Real Data In this section, some numerical analysis showing the empirical behavior of the subagged bandwidth are presented. The Gausian kernel, denoted by φ, was used in these experiments. The different programs used in this study were computed in parallel on an Intel Core i5-8600K 3.6GHz CPU using 2 cores. To illustrate the effect that r has on the computing time, Fig. 1 shows the observed CPU elapsed time for both the ordinary bootstrap bandwidth and its subagged version as a function of the sample size, n, which took the values n = 104 , 105 , 106 . In accordance with the comment above, a low and fixed value of N was considered, namely, N = 25. The size of the subsamples, r , was chosen as r = n p , with p = 0.5, 0.6, 0.7, 0.8, 0.9. From these CPU elapsed times, one could predict the expected time required to compute a bootstrap bandwidth for, say, a sample of size n = 108 . In this case, while computing a subagged bandwidth would take no more than 53 seconds for any p between 0.5 and 0.9, it would take approximately 8 hours to compute an ordinary bootstrap bandwidth. Now, to assess the loss in statistical precision due to the use of subagging, we simulated 100 samples of size n = 105 from different density functions. We denote by μ = (μ1 , . . . , μk ), σ = (σ1 , . . . , σk ) and w = (w1 , . . . , wk ) the mean, standard deviation and weight vectors, respectively, for the density mixture f (x) =  k i=1 wi φμi ,σi (x), with φμi ,σi a N(μi , σi ) density, i = 1, . . . , k. We considered the density mixtures D1, with parameters μ = 0, σ = 1 and w = 1, D2, with parame-

314

D. Barreiro-Ures et al.

Density

0.0 0.1 0.2 0.3 0.4 0.5 0.6

Fig. 1 CPU elapsed time (seconds) as a function of the sample size, n = 104 , 105 , 106 . Variables are shown in logarithmic scale. For the subagged bootstrap bandwidth, the value of N was set to N = 25 and the subsample size, r , was chosen as r = n p , with p = 0.5 (triangle point up), 0.6 (plus), 0.7 (cross), 0.8 (diamond), 0.9 (triangle point down). A binned implementation of the bandwidth selectors was considered, using 0.1n bins for the ordinary bootstrap bandwidth (circle) and 0.1r bins for the subagged bootstrap bandwidth

−2

−1

0

1

2

Fig. 2 Densities D1 (solid line), D2 (dashed line) and D3 (dotted line)

ters μ = (0, 1.5), σ = (1, 1/3) and w = (0.75, 0.25), and D3, with parameters μ = (0, −1, −0.5, 0, 0.5, 1), σ = (1, 0.1, 0.1, 0.1, 0.1, 0.1) and w = (0.5, 0.1, 0.1, 0.1, 0.1, 0.1). Densities D1, D2 and D3, which are shown in Fig. 2, can be seen as representing low, medium and high “complexity” densities, respectively. Figure 3 shows the sampling distribution of hˆ n / h n0 and Mn (hˆ n )/Mn (h n0 ), with hˆ n denoting both the ordinary bootstrap bandwidth and the subagged bootstrap bandwidth, for densities D1, D2 and D3. The number of subsamples was set to N = 1 and the size of the subsamples was chosen as r = n p , with p = 0.5, 0.6, 0.7, 0.8, 0.9. From Figs. 1 and 3 and as a rule-of-thumb, one may conclude that a sensible choice of r , in the sense of offering a certain balance between statistical precision and computational agility, would be r = n 0.7 . As for the number of subsamples, N , as argued above, it should be kept at low values given the already low variability of the bootstrap bandwidth selector. We shall now proceed to illustrate the performance of the bagged bootstrap bandwidth defined in (8) by applying it to a real dataset. This dataset contains two samples of size n = 105, 235 with the age and hospitalization time of people infected with COVID-19 in Spain from January 1, 2020 to December 20, 2020. In what

Computational Efficiency of Bagging Bootstrap Bandwidth …

315

Fig. 3 Sampling distribution of hˆ n / h n0 (left panels) and Mn (hˆ n )/Mn (h n0 ) (right panels), with n = 105 and hˆ n denoting both the ordinary bootstrap bandwidth (first boxplots) and the subagged bootstrap bandwidth (second to last boxplots), for densities D1 (top), D2 (center) and D3 (bottom). The number of subsamples was set to N = 1 and the size of the subsamples was chosen as r = n p , with p = 0.5, 0.6, 0.7, 0.8, 0.9. The case p = 1 corresponds to the ordinary bootstrap bandwidth. For density D3, the case p = 0.5 was omitted because the bandwidths obtained were too large and altered the scale of the plots

follows, we will consider the subagged bootstrap bandwidth and omit the results corresponding to the ordinary bootstrap bandwidth because, generally, both selectors yielded similar results. In order to avoid boundary effects and alleviate the effect of outliers, both samples were first transformed by means of the Box-Cox family: Tage (x) = x 1.4 /1.4 and Ttime (y) = y 0.1 /0.1 (let us denote these transformed samples by X age = (X 1 , . . . , X n ) and Ytime = (Y1 , . . . , Yn )). The bandwidths, h age and h time , were then computed for these transformed samples and finally the results were detransformed and returned to their original scale by means of the kernel density estimators [15] fˆage (x) =



1.4 n 1 0.4 x /1.4 − X i x φ nh age i=1 h age

(9)



0.1 n 1 −0.9 y /0.1 − Yi . y φ nh time i=1 h time

(10)

and fˆtime (y) =

D. Barreiro-Ures et al.

0.06 0.04

Density

0.00

0.000

0.02

0.010

Density

0.020

316

0

20

40

60 Age (years)

80

100

0

10

20

30

40

50

60

Hospitalization time (days)

Fig. 4 Histograms and kernel density estimates (solid lines) for the age (left panel) and hospitalization time (right panel) of people infected with COVID-19 in Spain from January 1, 2020 to December 20, 2020. For the kernel density estimates, the subagged bootstrap bandwidth was considered

The subagged bootstrap bandwidths obtained for the transformed samples relative to the age of people hospitalized after being infected with COVID-19 and the hospitalization time were, respectively, h age = 7.47 and h time = 0.098. In both cases, the number of subsamples was set to N = 100, the size of the subsamples was chosen as r = n 0.7 = 3, 277 and the number of bins used to compute the bandwidths was 0.1r = 327. The kernel density estimators (9) and (10) are shown in Fig. 4. Although in this case the sample size at hand, n = 105, 235, is rather moderate and, therefore, the computational gain from the use of subagging cannot be expected to be particularly substantial, it is worth noting that the ordinary bootstrap bandwidth was computed in 35 seconds while the subagging bandwidth required less than 2 seconds. Acknowledgements This research has been supported by MINECO (Grant MTM2017-82724-R), MICINN (Grant PID2020-113578RB-I00), and by the Xunta de Galicia (Grupos de Referencia Competitiva ED431C-2020-14 and Centro de Investigación del Sistema Universitario de Galicia ED431G 2019/01), all of them through the ERDF. The authors would like to thank the Spanish Center for Coordinating Sanitary Alerts and Emergencies for kindly providing the COVID-19 hospitalization dataset.

References 1. Barreiro-Ures, D., Cao, R., Francisco-Fernández, M., Hart, J.D.: Bagging cross-validated bandwidths with application to big data. Biometrika 108(4), 981–988 (2021) 2. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996) 3. Breiman, L.: Heuristics of instability and stabilization in model selection. Ann. Stat. 24, 2350– 2383 (1996) 4. Bühlmann, P., Yu, B.: Analyzing bagging. Ann. Stat. 30, 927–961 (2002) 5. Cao, R.: Bootstrapping the mean integrated squared error. J. Multivar. Anal. 45, 137–160 (1993) 6. Efron, B.: Bootstrap methods: another look at the Jackknife. Ann. Stat. 7, 1–26 (1979) 7. Friedman, J.H., Hall, P.: On bagging and nonlinear estimation. J. Stat. Plan. Inference 137, 669–683 (2007)

Computational Efficiency of Bagging Bootstrap Bandwidth …

317

8. Hall, P., Marron, J.S.: Extent to which least-squares cross-validation minimises integrated square error in nonparametric density estimation. Probab. Theory Relat. Fields 74, 567–581 (1987) 9. Hall, P., Robinson, A.P.: Reducing variability of crossvalidation for smoothing-parameter choice. Biometrika 96(1), 175–186 (2009) 10. Opitz, D., Maclin, R.: Popular ensemble methods: an empirical study. J. Artif. Intell. Res. 11, 169–198 (1999) 11. Parzen, E.: On estimation of a probability density function and mode. Ann. Math. Stat. 33, 1065–1076 (1962) 12. Rosenblatt, M.: Remarks on some nonparametric estimates of a density function. Ann. Math. Stat. 27, 832–837 (1956) 13. Sheather, S., Jones, M.: A reliable data-based bandwidth selection method for kernel density estimation. J. R. Stat. Soc. Ser. B 53(3), 683–690 (1991) 14. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Monographs on Statistics and Applied Probability, Chapman & Hall/CRC, London (1986) 15. Wand, M.P., Marron, J.S., Ruppert, D.: Transformations in density estimation. J. Am. Stat. Assoc. 86, 343–353 (1991)

Optimal Experimental Design for Physicochemical Models: A Partial Review Carlos de la Calle Arroyo, Jesús López-Fidalgo, and Licesio J. Rodríguez-Aragón

Abstract This paper presents a partial review of optimal experimental design applied to physicochemical models. The goal is to serve as an introduction to the discipline for all those who carry out laboratory experiments observing phenomena related to physical chemistry. The optimal design of experiments does not make sense unless the proposed designs can be implemented in practice, and therefore the involvement of the experimenters is essential. This chapter provides a motivated introduction to optimal experimental design, as well as some of the results obtained by applying these techniques to widely used physicochemical models: the MichaelisMenten model used in the kinetics of enzyme systems; the Arrhenius model used to describe the relationship between the rate of a chemical reaction and the temperature; adsorption isotherms that describe adsorption equilibrium, and the Tait equation, which characterizes relations between the pressure, volume and temperature of gases, liquids and mixtures.

1 Optimal Experimental Designs Optimal Experimental Design theory (OED) deals with the choice of the observations to be carried out in an experiment in order to obtain the best information about an object. Modeling and statistical inference are processes in which information is C. de la Calle Arroyo (B) · L. J. Rodríguez-Aragón Universidad de Castilla-La Mancha, Escuela de Ingeniería Industrial y Aeroespacial de Toledo, Instituto de Matemática Aplicada a la Ciencia y la Ingeniería, Avda. Carlos III s/n, 45071 Toledo, Spain e-mail: [email protected] L. J. Rodríguez-Aragón e-mail: [email protected] J. López-Fidalgo Universidad de Navarra, Institute of Data Science and Artificial Intelligence, Campus Universitario, 31009 Pamplona, Navarra, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_26

319

320

C. de la Calle Arroyo et al.

obtained by experiment. These experiments are based on precise situations, and are, on most occasions, subjected to restrictions of different kinds. Mathematical models used in physical chemistry study macroscopic phenomena in chemical systems and are applied in pharmacological and industrial processes, among others. Precise estimates of the unknown parameters of these models are of great interest, not only for the possibility of reducing experimental cost, but also for quality improvement in manufacturing processes. These models include unknown parameters, and in order to describe how the results are expected to vary, accurate estimates are required. Error in these models is usually assumed to follow a Gaussian distribution with mean zero and constant variance, although heteroscedastic assumptions can also be made. Independence of the observations can also be assumed, as well as correlation structures between observations. An experimental design (exact design) consists of a plan of n points or observations in a given design space X . Some of these points may be equal, meaning that various experiments are replicated. The number of points, n, is predetermined by the user and is usually a result of physical budget constraints as well as the required statistical precision. A design (approximate design), ξ , can also be seen as a set of points in X , each associated with the proportion of observations to be taken. This introduces the idea of a design as a measurement on X where ξ(x) is the (approximate) proportion of observations to be taken at x ∈ X . Kiefer [15] was the first to propose this approach, and Kiefer and Wolfowitz [16] presented the General Equivalence Theorem (GET) for D- and G-optimality, which allows the optimality of a design to be checked, and is the keystone of many algorithms and numerical procedures for finding optimal designs. The target when searching for optimal experimental designs, say ξ  , is not unique. It can be tuned to find designs that provide the best estimates of the parameters, or linear functions of them, that give an optimal estimation at an unobserved point or points, or that help to discriminate between competing models, etc. [12, p. 52]. Design criteria are non–increasing functions φ to be minimized, usually applied to the Fisher Information Matrix (FIM). This function is frequently assumed convex and sometimes differentiable. Designs with small criterion functions are desirable. For linear models this matrix is proportional to the covariance matrix of the estimates of the parameters, but for nonlinear models, as the model is usually linearized via a first-order Taylor expansion, the FIM is asymptotically proportional to the covariance matrix. The FIM now depends on the unknown parameters, and approaches to this issue are usually locally optimal designs assuming nominal values of the parameters [5], adaptive sequential designs where each new observation takes previous observations into account [6] or Bayesian optimal designs assuming prior distributions of the parameters [4]. Optimal designs are usually supported at few points, and are therefore seen by practitioners as theoretical statistical objects, rarely to be used directly. Different experimental designs can be compared with the optimum, and with this information the design most suited to performing the experiment can be chosen from among them. The goodness of a design can be measured by its efficiency, which is linked

Optimal Experimental Design for Physicochemical Models: A Partial Review

321

to a certain criterion. The efficiency of a design ξ for a criterion φ will be defined as effφ (ξ, ξ  ) = φ(ξ  )/φ(ξ ). The efficiency of two designs is compared by assuming the same number of observations, n, for both designs. Optimal experimental designs appear in 1918, in the work of the statistician Kristine Smith [34], and are first applied to physical chemistry, dilution problem, by Fisher in 1922 [13]. Since then, optimal experimental designs have been applied to a wide range of nonlinear physicochemical models [14, 17]. There is a large number of monographs—we would highlight Fedorov and Leonov [12]—as well as many other studies that can be found through the volumes of mODa conference proceedings (Model-Oriented Data Analysis and Design). This study presents a partial review for some physicochemical models, and seeks to serve for the application of optimal experimental design for practitioners.

2 Michaelis-Menten Model The Michaelis-Menten model is the simplest model for describing the kinetics of enzyme action, and is also used in compartmental models to model the rate of change from one compartment to another. The model predicts the initial velocity rate, v, of product formation as a function of substrate concentration, s. It can be described by v=

Vs ; s ∈ (0, Smax ], K +s

(1)

where V is the parameter indicating maximum velocity (saturation) and K is the Michaelis-Menten constant, which characterizes the action of the enzyme. Its value corresponds to the concentration at which half of the maximum saturation is reached. D−optimality is a common criterion for estimating the unknown parameters of the model. Its goal is to minimize the volume of the confidence ellipsoid of the parameters, which is equivalent to minimizing the generalized variance of the estimated parameters. Closed-form designs for D−optimality are shown in [22], as well as designs for estimating linear combinations of the parameters, c−optimality. D−optimal designs are equally supported at two points, one of which is the upper endpoint of the design space, Smax . Designs for estimating the parameters V and K are also supported at two experimental points and at the upper endpoint, Smax is always in the design, and the weights are not equally supported. To estimate K these weights are independent of the parameters. López-Fidalgo and Wong [22], observed that in practice the upper endpoint, Smax , was assumed as a multiple of K , e.g. five times K . For this reason they used Smax = bK to obtain friendly expressions of the design points as multiples of K . The GET is not only a checking condition for optimality of a design but also a very useful tool for obtaining it. Other approaches are, for example, Elfving’s method that allows the closed-form expressions of the c−optimal designs to be

322

C. de la Calle Arroyo et al.

obtained graphically [22]. A generalization of this graphical procedure also can be applied to extensions of the Michaelis-Menten model, for example by adding a linear component [19]. For D−optimality, Dette and Kunert obtain numerically exact optimal designs when observations from different subjects are assumed to be independent, but observations from the same subject are correlated [8]. And robust and efficient designs are obtained in [7] so that they maximize the minimum of the D−efficiencies over a certain interval of the Michaelis-Menten nonlinear parameter. Apart from the Michaelis-Menten model, the kinetics of enzyme action can also be described when adding the aforementioned linear term, which produces what is known as the Modified Michaelis-Menten model (MMM). Another extension is the so-called EMAX model, v=

V sH ; s ∈ (0, Smax ], K + sH

(2)

where H is the curvature parameter. With the help of a Chebyshev system, Dette et al. [9] obtained closed-form expressions of the D−optimal design for the EMAX model. This design is equally weighted and supported at three points, two of them interior points and the third the upper endpoint of the design space, Smax . They also test the validity of the Michaelis-Menten model against the EMAX by maximizing a minimum of the D−efficiency over a range of values for the nonlinear parameters. T −optimal designs to discriminate between the Michaelis-Menten model and some of their extensions, such as the MMM or the EMAX models, are described in [21]. A T −optimal design attains the highest test power, discriminating between a “true” and a rival model, although T −optimal designs are not necessarily good for parameter estimation. This study uses a compound criterion allowing a pair of “true" models to be considered, when the models are not nested. Another interesting extension of T −optimality includes the use of the KullbackLeibler distance to discriminate between the Michaelis-Menten model and the MMM with log-normal and gamma errors [20]. All of the design strategies mentioned up to this point require initial estimates of the model parameters, locally optimal designs. It is not always reasonable to think that precise estimates will be available for these parameters before conducting experiments. A prior distribution for the nonlinear parameter K can be established and the optimal Bayesian design can be found by maximizing the expectation of the criterion over this distribution. For example, Bayesian D−optimal designs can be found in the work of Matthews and Allcock [27]. Other variations of the Michaelis-Menten are very common and have also received wide attention from researchers. We mention for guidance the works of MariñasCollado et al. [24] and Schorning et al. [33] where optimal designs for high-substrate and non-competitive inhibition can be found.

Optimal Experimental Design for Physicochemical Models: A Partial Review

323

3 Arrhenius Equation The Arrhenius equation is applied to the study of temperature influence on the rates of chemical processes, as well as other physical processes such as diffusion, thermal and electrical conductivity and viscosity, among others. The model expresses the rate constant of a process k in terms of temperature T , k = Ae−B/T ; T ∈ [Tmin , Tmax ],

(3)

where A and B are the temperature independent parameters, the frequency factor and the activation energy (difference of energy between the activated and initial state) respectively. The application of the Arrhenius equation to chemical reactions of atmospheric interest leads to closed-form expressions under the Gaussian error hypothesis [29]. D−optimal designs have been calculated with the help of the GET. These designs are supported at two different temperatures, and the upper endpoint of the design interval is always in the optimal design, taking half of the observations at each support point. Elfving’s graphical method is used to obtain closed-form expressions of c−optimal designs to estimate linear combinations of the parameters. In particular, the designs that estimate each of the two parameters individually are always supported at the upper endpoint of the space design, and these designs, once obtained, are also combined in a compound design. This allows a design that can deliver greater efficiency in estimating one of the parameters, as well as ensuring a minimum efficiency for the other. For the Modified Arrhenius equation (MA), including temperature dependence in the frequency factor, A/T m , Rodríguez-Díaz and Santos-Martín [31] obtain D− and c−optimal designs. In their study, m is assumed to be known, and optimal designs supported at two points are used to compute efficiencies of real experiments that cover all the range of temperatures, including the endpoints of the design space. They include the study of the heteroscedastic case, assuming independent observations and variance proportional to the mean, without finding relevant differences. The heteroscedastic case of the MA, for exponential covariance structures, is studied by Rodríguez-Díaz et al. [32]. D−optimal exact designs are difficult to obtain in these cases, and can in fact only be obtained by numerical computation. For practical reasons this study keeps to designs that cover the design space with a specific number of points. Numerical techniques are also the only way to find designs that discriminate between the Arrhenius and the Modified model. T −optimal designs that provide the most powerful F−test for lack of fit are found in Martín-Martín et al. [25]. In practice, chemical kinetics observations are obtained after two steps: first, the reaction rate constants, k, are estimated, and then Arrhenius or MA equations are fitted to the rate constants for different temperatures. Amo-Salas et al. [1], obtain exact D−optimal designs for both steps simultaneously. They consider zeroth and first order reaction rates in the first step, with correlated observations over time. The

324

C. de la Calle Arroyo et al.

correlation structure achieves a stronger correlation as the time readings become closer. The second step uses an Arrhenius model. In a different approach Baran et al. [2], obtain D−optimal designs for zeroth reaction rates and the MA equation for a similar covariance structure.

4 Adsorption Isotherms Adsorption phenomena are important in many physicochemical processes, such as retention of chemicals in soils, adsorption of water by food, purification and separation processes, heterogeneous catalysis etc. At a specific temperature, adsorption equilibrium is described by the relation between the concentration of adsorbed species and the equilibrium adsorption concentration. Most models used to describe this isotherm relation are nonlinear with respect to the parameters: BET, GAB, Freundlich, Langmuir, Jovanovich, Sips, Redlich-Peterson among others. As mentioned there are many different models to explain the phenomenon of adsorption. There are closed-form expressions for D−optimal designs for the 2parameter BET model [30] and numerical procedures for the 2-parameter Freundlich, Langmuir and the 3-parameter Langmuir and GAB [23, 30]. Closed-formed Optimal designs for the precise estimation of linear combinations, c−optimal, of the unknown parameters, were also obtained with the help of Elfving’s graphical method for BET and GAB isotherms [30]. For these two models, correct estimation of parameter k for GAB models provides high efficiency in discrimination. For this phenomenon, discrimination between Isotherms is of great interest, and is considered in [28, 30] to distinguish between the 3-parameter GAB and the 2parameter BET models. The efficiencies of these T −optimal designs in estimating the unknown parameters can be then obtained, and the examples show that these designs could be used for prior estimations of the parameters using a sequential strategy [30]. Robustness to misestimation of the best guesses of the parameters, and lack of fit analysis to illustrate the importance of experimental variability to allow model discrimination, can be found in [28]. Kober et al. [18] distinguish between considering liquid solution equilibrium concentration as an independent variable, as commonly happens in optimal design works, or considering: initial liquid solution concentration, liquid solution volume and adsorbent mass as the independent variable. This second approach is more reliable with the experimental procedure that is actually used in the laboratory. In these studies, the upper endpoint of the design space is always in the optimal design for all the 2- and 3-parameter isotherms considered.

Optimal Experimental Design for Physicochemical Models: A Partial Review

325

5 Tait Equation The Tait equation is a three-parameter model that, under isothermal conditions, relates density of liquids, ρ to pressure, p, ρ=

ρ0 (T ) 1 − C(T ) log

B(T )+ p B(T )+ p0

, ( p, T ) ∈ X = P × T ,

(4)

where ρ0 , B and C are temperature dependent parameters. Characterization of pressure and temperature effects on the volume of gases and liquids are of great interests in thermodynamics and engineering. The Tait equation is therefore modified by including the dependence of ρ0 and B with temperature either by linear polynomials or by a nonlinear function known as the Rackett equation for ρ0 . If the Tait equation is considered under isothermal conditions, optimal designs search for optimal pressures, p in X , to obtain the best estimations of the three unknown parameters. If temperature, T , dependence of the parameters is included in the model, then ( p, T ) must be chosen from the Cartesian set X = (P × T ). MartínMartín et al. [26], present a method for obtaining D−optimal designs for multifactor models like the Tait equation with either linear dependence of the parameters on temperature, or nonlinear dependence through the Racket equation. This study notes that D−optimal designs obtained for models with 8 unknown parameters produce designs supported at more than 8 points, and different weights for each point. The study not only analyzes the robustness of locally optimal designs for misestimation in the initial best guesses of the parameters, but also compute efficiencies of uniform distributed and weighted designs supported at different points that vary from 16 to 100 in P × T .

6 Discussion Optimal experimental design is not a perfect solution and suffers from some weaknesses. The most common criticisms are related to the following topics: Optimal experimental design is claimed to be strongly model dependent. The model has to be chosen even before the observations are considered. This is indeed an important drawback, but there are also many physicochemical models which are strongly supported by theoretical work. The models mentioned in this review are widely used and the prior choice of the model is not a strong disadvantage. However, T −optimal designs address discrimination between models. The practitioner’s point of view must also be kept in mind: the best possible fit will never justify obtaining parameters with values unexplained by the physics behind the model. In the case of nonlinear models, the choice of prior estimations of the unknown parameters also turns out to be a problem. Engineering and industrial develop-

326

C. de la Calle Arroyo et al.

ment are strongly dependent, in their calculations and processes, on the information obtained in laboratories. It is therefore of the greatest interest to have accurate estimations of the model parameters to be used as nominal values for computing optimal designs. These estimations are usually shared in repositories or databases such as those maintained by the Jet Propulsion Laboratory [3], the Dortmund Data Bank [11] or the American Institute of Chemical Engineers [10] among others. These repositories not only allow initial estimates for the parameters to be obtained, but they also serve to draw attention to models and parameters with high uncertainty in their estimates. Bayesian designs and robustness analysis are also important techniques to be taken into account. In the selection of optimization criteria, the choice is made without prior knowledge of the observations. Therefore, a design can be optimum for a certain criterion but not from other points of view. Compound designs are an option to combine several criteria assuring minimum efficiencies for certain criteria. Also, cross efficiencies can be evaluated allowing designs to be compared with respect to other criteria. The wide variety of optimization criteria can also be seen as an advantage for the researcher, allowing the most suitable option to be chosen for each circumstance. Optimal experimental designs have evolved along two parallel lines, exact versus approximate designs. In the field of physical chemistry the number of observations in each experiment is high enough to apply the idea of approximate designs. Furthermore, the possibility of applying the GET provides a very important tool for obtaining and checking optimality of approximate designs. There are also approaches such as experiments with correlated observations where exact designs need to be taken into account. Optimal designs frequently require extreme observations and have few different support points. Optimal designs are often disliked by applied researchers. Extreme observations are more difficult to perform and designs covering all the experimental range are desired in laboratories. Therefore, optimal designs are rarely used directly. However, they are very useful as a benchmarking tool for other designs. Different experimental designs can be compared with the optimum, and with this information the most suitable designs for the experiment can be chosen. Finally, one of the most interesting aspects of working in optimal experimental designs is the interdisciplinary work with other sciences. The theoretical mathematical developments that emerge from real applied problems are without any doubt highly challenging and enriching. Acknowledgements This work was sponsored by Ministerio de Economía y Competitividad MTM2016-80539-C2-1-R and by Consejería de Educación, Cultura y Deportes of Junta de Comunidades de Castilla-La Mancha and Fondo Europeo de Desarrollo Regional SBPLY/17/180501/000 380. López-Fidalgo wants to thank Leandro for his support, trust and friendship in some of his professional milestones. Rodríguez-Aragón wants to thank Professor LJ Rodríguez (Professor of Physical Chemistry at the University of Salamanca) who has always generously answered the questions and shared the applied point of view of the physicochemical models and techniques.

Optimal Experimental Design for Physicochemical Models: A Partial Review

327

References 1. Amo-Salas, M., Martín-Martín, R., Rodríguez-Aragón, L.J.: Design of experiments for zeroth and first-order reaction rates. Biom. J. 56, 792–807 (2014) 2. Baran, S., Sikolya, K., Stehlík, M.: Optimal designs for the methane flux in troposphere. Chemom. Intell. Lab. Syst. 146, 407–417 (2015) 3. Burkholder, J.B., Sander, S.P., Abbatt, J., Barker, J.R., Cappa, C., Crounse, J.D., Dibble, T.S., Huie, R.E., Kolb, C.E., Kurylo, M.J., Orkin, V.L., Percival, C.J., Wilmouth, D.M., Wine, P.H.: Chemical kinetics and photochemical data for use in atmospheric studies, Evaluation No. 19. JPL Publication (2019). http://jpldataeval.jpl.nasa.gov 4. Chaloner, K., Verdinelli, I.: Bayesian experimental design: a review. Stat. Sci. 10, 273–304 (1995) 5. Chernoff, H.: Locally optimal designs for estimating parameters. Ann. Math. Stat. 24, 586–602 (1953) 6. Chernoff, H.: Approaches in Sequential Design of Experiments in a Survey of Statistical Design and Linear Models. North-Holland, New York (1975) 7. Dette, H., Biedermann, S.: Robust and efficient designs for the Michaelis-Menten model. J. Am. Stat. Assoc. 98, 679–686 (2003) 8. Dette, H., Kunert, J.: Optimal designs for the Michaelis-Menten model with correlated observations. Statistics 48, 1254–1267 (2014) 9. Dette, H., Melas, V.B., Wong, W.K.: Optimal design for goodness-of-fit of the MichaelisMenten enzyme kinetic function. J. Am. Stat. Assoc. 100, 1370–1381 (2005) 10. DIPPR Project 801 (2021). http://www.aiche.org/dippr 11. Dortmund Data Bank (2021). http://www.ddbst.com 12. Fedorov, V.V., Leonov, S.L.: Optimal Design for Nonlinear Response Models. CRC Press, Boca Raton (2014) 13. Fisher, R.A.: On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. Lond. A 222, 309–368 (1922) 14. Ford, I., Titterington, D.M., Kitsos, C.P.: Recent advances in nonlinear experimental design. Technometrics 31, 49–60 (1989) 15. Kiefer, J.: Optimum experimental designs. J. Roy. Stat. Soc. Ser. B 21, 272–319 (1959) 16. Kiefer, J., Wolfowitz, J.: The equivalence of two extremum problems. Can. J. Math. 12, 363–366 (1960) 17. Kitsos, C.P., Kolovos, K.G.: A compilation of the D-optimal designs in chemical kinetics. Chem. Eng. Commun. 200, 185–204 (2013) 18. Kober, R., Schwaab, M., Steffani, E., Barbosa-Coutinho, E., Pinto, J.C., Alberton, A.L.: Doptimal experimental designs for precise parameter estimation of adsorption equilibrium models. Chemom. Intell. Lab. Syst. 192, 103823 (2021) 19. López-Fidalgo, J., Rodríguez-Díaz, J.M.: Elfving’s method for m-dimensional models. Metrika 59, 235–244 (2004) 20. López-Fidalgo, J., Tommasi, C., Trandafir, P.C.: An optimal experimental design criterion for discriminating between non-normal models. J. Roy. Stat. Soc. Ser. B 69, 231–242 (2007) 21. López-Fidalgo, J., Tommasi, C., Trandafir, P.C.: Optimal designs for discriminating between some extensions of the Michaelis-Menten model. J. Stat. Plan Infer. 138, 3797–3804 (2008) 22. López-Fidalgo, J., Wong, W.K.: Design issues for the Michaelis-Menten model. J. Theor. Biol. 215, 1–11 (2002) 23. Mannarswamy, A., Munson-McGee, S.H., Steiner, R., Andersen, P.K.: D-optimal experimental designs for Freundlich and Langmuir adsorption isotherms. Chemom. Intell. Lab. Syst. 97, 146–151 (2009) 24. Mariñas-Collado, I., Rivas-López, M.J., Rodríguez-Díaz, J.M., Santos-Martín, M.T.: Optimal designs in enzymatic reactions with high-substrate inhibition. Chemom. Intell. Lab. Syst. 189, 102–109 (2019)

328

C. de la Calle Arroyo et al.

25. Martín-Martín, R., Dorta-Guerra, R., Torsney, B.: Multiplicative algorithm for discriminating between Arrhenius and non-Arrhenius behaviour. Chemom. Intell. Lab. Syst. 139, 146–155 (2014) 26. Martín-Martín, R., Rodríguez-Aragón, L.J., Torsney, B.: Multiplicative algorithm for computing D-optimum designs for pVT measurements. Chemom. Intell. Lab. Syst. 111, 20–27 (2012) 27. Matthews, J.N.S., Allcock, G.C.: Optimal designs for Michaelis-Menten kinetic studies. Stat. Med. 23, 477–491 (2004) 28. Munson-Mcgee, S.H., Mannarswamy, A., Andersen, P.K.: Designing experiments to differentiate between adsorption isotherms using T-optimal designs. J. Food Eng. 101, 386–393 (2010) 29. Rodríguez-Aragón, L.J., López-Fidalgo, J.: Optimal designs for the Arrhenius equation. Chemom. Intell. Lab. Syst. 77, 131–138 (2005) 30. Rodríguez-Aragón, L.J., López-Fidalgo, J.: T-, D- and c-optimum designs for BET and GAB adsorption isotherms. Chemom. Intell. Lab. Syst. 89, 36–44 (2007) 31. Rodríguez-Díaz, J.M., Santos-Martín, M.T.: Study of the best designs for modifications of the Arrhenius equation. Chemom. Intell. Lab. Syst. 95, 199–208 (2009) 32. Rodríguez-Díaz, J.M., Santos-Martín, M.T., Waldl, H., Stehlík, M.: Filling and D-optimal designs for the correlated generalized exponential models. Chemom. Intell. Lab. Syst. 114, 10–18 (2012) 33. Schorning, K., Dette, H., Kettelhake, K., Möller, T.: Optimal designs for enzyme inhibition kinetic models. Statistics 52, 1359–1378 (2018) 34. Smith, K.: On the standard deviations of adjusted and interpolated values of an observed polynomial function and its constants and the guidance they give towards a proper choice of the distribution of observations. Biometrika 12, 1–85 (1918)

Small Area Estimation of Proportion-Based Indicators María Dolores Esteban, Tomáš Hobza, Domingo Morales, and Agustín Pérez

Abstract This paper considers a unit-level multinomial model with logit link for small area estimation of proportion-based socioeconomic indicators. For a labour force survey, some of these indicators are totals of unemployed people, proportions of inactive people, unemployment rates or entropy indexes. A Broyden quasi-Newton algorithm is proposed to calculate the method-of-moment and maximum likelihood estimators of the model parameters. Model-based predictors of small area indicators are derived and their mean squared errors are estimated by parametric bootstrap.

1 Introduction Public statistics must provide data on socioeconomic indicators for territories or population groups in which the sample size is insufficient to calculate precise direct estimators. Small Area Estimation (SAE) introduces procedures based on statistical models that allow the introduction of auxiliary information to provide new estimators. See e.g. [5] or [4] for an introduction to SAE. If the indicators of interest depend on the proportions of the categories of a classifying variable, such as employment status, then a multinomial regression model allows the behavior of the target vector to be adequately described and auxiliary variables to be incorporated into the inferential process. References [1–3] introduced area-level multinomial models for aggregated data and derived predictors of small area indicators. This paper deals with unit level data and considers a unit-level multinomial model with logit link for small area estimation of proportion-based socioeconomic indicators. Let yd jk be a discrete random variable that takes values on N ∪ {0} and that is measured on the sample unit j of domain d and category (group) k, d = 1, . . . , D, M. D. Esteban (B) · D. Morales · A. Pérez Miguel Hernández University of Elche, Elche, Spain e-mail: [email protected] T. Hobza Czech Technical University in Prague, Prague, Czech Republic e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_27

329

330

M. D. Esteban et al.

j = 1, . . . , n d , k = 1, . . . , q. Let νd j ∈ N be a known size parameter such that D yd j1 + . . . + yd jq = νd j , d = 1, . . . , D, j = 1, . . . , n d . Let n = d=1 n d be the global sample size. For k = 1, . . . , q − 1, let xd jk = (xd jk1 , . . . , xd jkpk ) be a row vector containing pk explanatory variables and let βk = (βk1 , . . . , βkpk ) be a column vector of size pk containing regression parameters. Let p = p1 + . . . + pq−1 . The unit-level multinomial logit (q-logit) model assumes that the distribution of the target vector yd j = (yd j1 , . . . , yd jq−1 ) is multinomial. More concretely, it assumes that (1) yd j ∼ M(νd j ; pd j1 , . . . , pd jq−1 ), d = 1, . . . , D, j = 1, . . . , n d , with the logit link for the natural parameter, i.e. pd jk = xd jk βk , d = 1, . . . , D, j = 1, . . . , n d , k = 1, . . . , q − 1, pd jq (2) where pd j1 + . . . + pd jq = 1, pd jk > 0, d = 1, . . . , D, j = 1, . . . , n d , k = 1, . . . , q. Finally, the q-logit model assumes that vectors yd j ’s are independent. For d = 1, . . . , D, j = 1, . . . , n d the probability function of yd j is ηd jk = log

y

y

P(yd j ; β) = c(νd j , yd j ) pddj1j1 · · · pddjqjq , c(νd j , yd j ) =

νd j ! , yd j1 ! · · · yd jq !

where yd jq = νd j − yd j1 − . . . − yd jq−1 , pd jq = 1 − pd j1 − . . . − pd jq−1 and pd jq =

1+

1 q−1 =1

exp{ηd j }

pd jk =

,

exp{ηd jk } , k = 1, . . . , q − 1. q−1 1 + =1 exp{ηd j }

The vectors of dimensions n d (q − 1) × 1 and n(q − 1) × 1 that contains the values of the target variables are yd = col (yd j ), 1≤ j≤n d

y = col (yd ). 1≤d≤D

 ) . The likelihood function is The vector of model parameters is β = (β1 , . . . , βq−1

 

q−1 nd  D   k=1 exp yd jk (x d jk βk ) c(νd j , yd j )  (β; y) = P(yd j ; β) =  νd j  q−1 1 + =1 exp xd j β d=1 j=1 d=1 j=1 nd D  

=

nd D  



c(νd j , yd j ) exp

d=1 j=1



nd D

d=1 j=1

 q−1 pk D n d



k=1 i=1

d=1 j=1

q−1

  . νd j log 1 + exp xd j β =1

yd jk xd jki βki

Small Area Estimation of Proportion-Based Indicators

331

A set of sufficient statistics for β is Ski =

nd D

yd jk xd jki , k = 1, . . . , q − 1, i = 1, . . . , pk .

d=1 j=1

Based on the sufficient statistics Ski , k = 1, . . . , q − 1, i = 1, . . . , pk , we first consider the method of moments (MM) for estimating the vector β. The p equations are D nd D nd 1

1

0 = f ki (β) = Mk,i (β) − Mˆ k,i = E β [yd jk ]xd jki − yd jk xd jki , n d=1 j=1 n d=1 j=1

(3) where the expectation of yd jk is E β [yd jk ] = νd j pd jk

  νd j exp xd jk βk = .  q−1 1 + =1 exp xd j β

Let us define the p × 1 vector f (β) =

col



col



1≤k≤q−1 1≤i≤ pk

f ki (β)

 p×1

.

The MM estimator of β is a solution of of the system of p nonlinear equations f (β) = 0.

(4)

Let us note that 1 − pd jk =

1+

q−1 1 + =1,=k exp{ηd j } exp{ηd j } − exp{ηd jk } = . q−1 q−1 1 + =1 exp{ηd j } 1 + =1 exp{ηd j }

q−1 =1

For d = 1, . . . , D, j = 1, . . . , n d , k = 1, . . . , q − 1, the derivatives of pd jk are   q−1 exp{ηd jk } 1 + =1 exp{ηd j } − exp{2ηd jk } ∂ pd jk = xd jki 2  q−1 ∂βki 1 + =1 exp{ηd j }   q−1 exp{ηd jk } 1 + =1,=k exp{ηd j } = xd jki = pd jk (1 − pd jk )xd jki .  2 q−1 1 + =1 exp{ηd j } For d = 1, . . . , D, j = 1, . . . , n d , k1 , k2 = 1, . . . , q − 1, k1 = k2 , the derivatives of pd jk1 are

332

M. D. Esteban et al.

∂ pd jk1 ∂ = ∂βk2 i2 ∂βk2 i2



exp{ηd jk1 } q−1 1 + =1 exp{ηd j } = − pd jk1 pd jk2 xd jk2 i2 .



exp{ηd jk2 }xd jk2 i2 = − exp{ηd jk1 }  2 q−1 1 + =1 exp{ηd j }

By applying these derivatives, it is easy to check that f ki (β) = 0 ⇐⇒

∂(β; y) = 0, k = 1, . . . , q − 1, i = 1, . . . , pk . ∂βki

Therefore, the MM estimator is also the maximum likelihood (ML) estimator of β. For solving the system (4), we apply the Broyden quasi-Newton algorithm. The updating equations are −1  f (β (i) ), β (i+1) = β (i) − Bˆ (i) Bˆ (i+1) = Bˆ (i) +

f (β (i+1) )(β (i+1) − β (i) ) , (β (i+1) − β (i) ) (β (i+1) − β (i) )

where Bˆ (0) is usually the identity matrix. This method is attractive because it does not ˆ is requires calculating the Jacobian matrix. A bootstrap algorithm to estimate var(β) ˆ 1. Fit the q-logit model to the sample and calculate β. (b) 2. Generate bootstrap samples {yd j : d = 1, . . . , D, j = 1, . . . , n d }, b = 1, . . . , B, from the fitted q-logit model. 3. Fit the  q-logit model to the bootstrap samples. Calculate βˆ (b) , b = 1, . . . , B, B 1 β = B b=1 βˆ (b) . B ˆ = 1 b=1 4. Output: v ar B (β) (βˆ (b) − β)(βˆ (b) − β) . B

2 Predictors Let U be a population of size N partitioned in domains Ud of size Nd ; i.e. U = D Ud and Ud1 ∩ Ud2 = ∅ if d1 = d2 . Assume that unit-level multinomial logit ∪d=1 model (1) and (2) holds for d = 1, . . . , D, j = 1, . . . , Nd . We further assume that U is partitioned in two subset, called sample s and non sample r . Similarly, Ud is partitioned in sd and rd . This section gives plug-in predictors of functions of probability cells under the unit-level q-logit Bernoulli model, i.e. we assume that νd j = 1, d = 1, . . . , D, j = 1, . . . , Nd .

Small Area Estimation of Proportion-Based Indicators

333

2.1 Predictors of Probability-Dependent Indicators The plug-in predictor of pd jk = pd jk (β) is ˆ = pˆ dinjk = pd jk (β)

exp{xd jk βˆk } , k = 1, . . . , q − 1. q−1 1 + =1 exp{xd j βˆ }

 d The plug-in predictors of the model total μdk = Nj=1 pd jk , the model proportion q μdk = μdk /Nd and the Shannon entropy Hd = − k=1 μdk log μdk are ˆ μˆ in dk = μdk (β) =

Nd

ˆ in ˆ pd jk (β), μ ˆ in dk /Nd , dk = μ

Hˆ din = −

j=1

q

ˆ in ˆ in μ dk log μdk .

k=1

The plug-in predictors μˆ in dk are functions of x d jk , d = 1, . . . , D, j = 1, . . . , Nd , k = 1, . . . , q − 1. For calculating the predictors, we need a data file containing the values of the explanatory variables in all the population units. This kind of data file (census file) is not always available. Remark 2.1 presents a categorical setup where the calculation of μˆ in dk requires less auxiliary data. Remark 2.1 Suppose that the covariates are categorical such that xd jk ∈ {z 1 , . . . , z T }. Suppose also that all the components of the target vector yd j have the same set of auxiliary variables, so that xd jk = xd j , d = 1, . . . , D, j = 1, . . . , Nd , k = 1, . . . , q − 1. Then μdk = μdk (β) =

Nd

j=1

pd jk =

T

Ndt qdk,t , qdk,t =

t=1

exp{z t βk } , q−1 1 + =1 exp{z t β }

where Ndt = #{ j ∈ Ud : xd j = z t } is the size of the covariate class z t at domain d. Under this categorical setup, the plug-in predictor of μdk is ˆ μˆ in dk = μdk (β) =

T

t=1

ˆ = Ndt qdk,t (β)

T

t=1

in in Ndt qˆdk,t , qˆdk,t =

exp{z t βˆk } . q−1 1 + =1 exp{z t βˆ } (5)

2.2 Predictors of Variable-Dependent Indicators The plug-in predictor of yd jk is yˆdinjk = yd jk if j ∈ s and yˆdinjk = pˆ dinjk if j ∈ r , where  d pˆ dinjk is the plug-in predictor of pd jk . The plug-in predictor of Y dk = N1d Nj=1 yd jk is

334

M. D. Esteban et al.



in 1  Yˆ dk = yd jk + pˆ dinjk . Nd j∈s j∈r d

(6)

d

in A design-based approximations of Yˆ dk is

 in  1   yd jk − pˆ dinjk + Yˆ dk ≈ ωd j pˆ dinjk , Nd j∈s j∈s d

d

where the ωd j ’s are the calibrated sample weights. If n d = 0, then the plug-in predictor of Y dk is Nd in 1 ˆ Y dk = pˆ in . Nd j=1 d jk

1 Nd

Under  Nd the categorical setup of Remark 2.1, the plug-in predictor of Y dk = j=1 yd jk is T 

in 1  in , Yˆ dk = yd jk + Ndt,r qˆdk,t Nd j∈s t=1 d

in where qˆdk,t was defined above and Ndt,r = #{ j ∈ rd : xd j = z t } is the size of the covariate class z t at rd . An example of variable-dependent domain indicator is the unemployment rate. We define the unemployment status categories k = 1 (≤ 15 years), k = 2 (unemployed), k = 3 (employed) and k = 4 (inactive). The domain unemployment rate and its plug-in predictors are

Rd =

Y d2 Y d2 + Y d3

,

Rˆ din =

in Yˆ d2 in in Yˆ d2 + Yˆ d3

.

3 MSE of Predictors This section introduces bootstrap-based estimators of the MSEs of predictors. We assume that the categorical setup of Remark 2.1 holds. The following procedure calculates a parametric bootstrap estimator of M S E(μˆ in dk ). ˆ 1. Fit the model to the sample and calculate β. 2. Repeat B times (b = 1, . . . , B): a. Bootstrap sample: The bootstrap sample has the same units as the real data sample, i.e. sd∗(b) = sd , b = 1, . . . , B. For d = 1, . . . , D, j ∈ sd , generate the elements of the bootstrap sample

Small Area Estimation of Proportion-Based Indicators ∗(b) ∗(b) yd∗(b) j ∼ M(1; pd j1 , . . . , pd jq−1 ),

335

pd∗(b) jk =

exp{xd j βˆk } . q−1 1 + =1 exp{xd j βˆ }

For d = 1, . . . , D, k = 1, . . . , q − 1, calculate the bootstrap population quantities μ∗(b) dk =

pd∗(b) j +

T

∗(b) ∗(b) Ndt,r qdk,t , qdk,t =

t=1

j∈sd

exp{z t βˆk } . q−1 1 + =1 exp{z t βˆ }

b. Bootstrap model: Fit a unit-level multinomial-logit model to the bootstrap sample (yd∗(b) j , x d j ), d = 1, . . . , D, j = 1, . . . , n d . of the bootstrap popuCalculate the estimator βˆ ∗(b) and the predictor μˆ in∗(b) dk lation quantity μ∗(b) , i.e. dk μˆ in∗(b) = dk

T

in∗(b) in∗(b) Ndk qˆdk,t , qˆdk,t =

t=1

3. Output: mse∗ (μˆ in dk ) =

1 B

exp{z t βˆk∗(b) } . q−1 1 + =1 exp{z t βˆ∗(b) }

2  B  in∗(b) ˆ dk − μ∗(b) . b=1 μ dk

in The following procedure calculates a parametric bootstrap estimator of M S E(Yˆ dk ).

ˆ 1. Fit the model to the sample and calculate β. 2. Repeat B times (b = 1, . . . , B): a. Bootstrap sample: The bootstrap sample has the same units as the real data sample, i.e. sd∗(b) = sd , b = 1, . . . , B. For d = 1, . . . , D, j ∈ sd , generate the elements of the bootstrap sample ∗(b) ∗(b) yd∗(b) j ∼ M(1; pd j1 , . . . , pd jq−1 ),

pd∗(b) jk =

exp{xd j βˆk } . q−1 1 + =1 exp{xd j βˆ }

For d = 1, . . . , D, k = 1, . . . , q − 1, calculate the bootstrap population quantities T 1 ∗(b) ∗(b) ∗(b) Y dk = yd jk + z dk,t , Nd j∈s t=1 d

where ∗(b) ∗(b) ∗(b) ∗(b) z d,t ∼ M(Ndt,r ; qd1,t , . . . , qdq−1,t ), qdk,t =

exp{z t βˆk } q−1 1 + =1 exp{z t βˆ }

336

M. D. Esteban et al. ∗(b) ∗(b) and z dk,t , k = 1, . . . , q − 1, are elements of the vector z d,t .

b. Bootstrap model: Fit a unit-level multinomial-logit model to the bootstrap sample (yd∗(b) j , x d j ), d = 1, . . . , D, j = 1, . . . , n d . in∗(b) ∗(b) Calculate the estimator βˆ ∗(b) and the predictor Yˆ dk of Y dk , i.e.

∗(b) T  in∗(b) exp{z t βˆk } 1  ∗(b) in∗(b) in∗(b) . , qˆdk,t = Yˆ dk = yd jk + Ndt,r qˆdk,t  q−1 ∗(b) Nd 1 + =1 exp{z t βˆ } t=1 j∈sd in 3. Output: mse∗ (Yˆ dk ) =

1 B

 B  ˆ in∗(b) ∗(b) 2 − Y dk . b=1 Y dk

4 Discussion and Future Research Due to the use of auxiliary information, model-based predictors of domain indicators outperform direct estimators when the model has a good fit to data. This depends on the availability of explanatory variables and on the capacity of the model to incorporate data correlation structures. A drawback of the multinomial fixed-effects model is that it assumes that the observations are independent and, therefore, it cannot model hierarchical, spatial, or temporal correlations. A natural extension of the proposed statistical methodology is to consider predictors based on multinomial models with random effects. This increases the mathematical complexity of the study, but provides greater flexibility for data modeling. This might be an interesting future line of research. Acknowledgements The authors thanks the editors of the book “Trends In Mathematical, Information and Data Sciences: A Tribute To Leandro Pardo” for their invitation to submit a contribution. This work was supported by the Spanish grant PGC2018-096840-B-I00, by the Valencian grant PROMETEO/2021/063 and by the European Regional Development Fund-Project ”Center of Advanced Applied Sciences” (No. CZ.02.1.01/0.0/0.0/16 019/0000778).

References 1. López-Vizcaíno, E., Lombardía, M.J., Morales, D.: Multinomial-based small area estimation of labour force indicators. Stat. Model 13(2), 153–178 (2013) 2. López-Vizcaíno, E., Lombardía, M.J., Morales, D.: Small area estimation of labour force indicators under a multinomial model with correlated time and area effects. J. Roy. Stat. Soc. Ser. A 178(3), 535–565 (2015) 3. Molina, I., Saei, A., Lombardía, M.J.: Small area estimates of labour force participation under a multinomial logit mixed model. J. Roy. Stat. Soc. Ser. A 170(4), 975–1000 (2007)

Small Area Estimation of Proportion-Based Indicators

337

4. Morales, D., Esteban, M.D., Pérez, A., Hobza, T.: A Course on Small Area Estimation and Mixed Models. Springer, Berlin (2021) 5. Rao, J.N.K., Molina, I.: Small Area Estimation, 2nd edn. Wiley, Hoboken (2015)

Non-parametric Testing of Non-inferiority with Censored Data Alba M. Franco-Pereira, María Carmen Pardo, and Teresa Pérez

Abstract We propose non-parametric tests for showing non-inferiority of a new treatment compared to reference therapies when data are censored. Two new families of non-parametric approaches for solving these testing problem are investigated, which include two known tests. The performance of the test procedures is investigated in a simulation study under several scenarios. Finally, the proposed methods are applied to a major depression disorder clinical trial.

1 Introduction Non inferiority tests have become very popular nowadays. The term has been originally coined in the clinical trials where the main objective is often to demonstrate that some new, experimental therapy is not inferior to a well-established reference treatment, Wellek [17]. In fact, a noninferiority trial aims to prove that the experimental treatment is not worse than other comparators by more than a pre-specified small amount, EMEA [3]. Clinical trials involving generic drugs are good examples of this type of studies. In this case, a sightly loss of efficacy is compensated by its economic benefit. For time-to-event outcomes with censored data and two arms, the noninferiority test, Frietag et al. [6], of the treatment T2 to the treatment T1 over the follow up period [t1 , t2 ], can be formulated as

A. M. Franco-Pereira (B) · M. C. Pardo Department of Statistics and OR, Complutense University of Madrid, 28040 Madrid, Spain e-mail: [email protected] M. C. Pardo e-mail: [email protected] T. Pérez Department of Statistics and Data Science, Complutense University of Madrid, 28040 Madrid, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_28

339

340

A. M. Franco-Pereira et al.

H0 : S2 (t) − S1 (t) ≥ M

for some t ∈ [t1 , t2 ]

H1 : S2 (t) − S1 (t) < M t ∈ [t1 , t2 ] where Si , i = 1, 2, are the survival functions, M > 0 when shorter survival is desirable and otherwise M < 0. In these settings, several parametric and semi-parametric approaches have been suggested and were described by Freitag et al. [6]. There are some situations in which parametric or semi-parametric modelling of the treatment effects is difficult and a non-parametric approach becomes necessary. Various non parametric methods for two arms have already been developed, Com-Nougue et al. [2], Su and Wei [15], Efron [5], Kalbfleisch and Prentice [9], Wellek [17], Martínez et al. [11]. The drawback of the noninferiority tests in trials with only two arms (reference vs experimental), appears when the reference treatment had no effect at all in the trial, then, accepting the alternative hypothesis may provide no evidence that the experimental therapy is effective. For that reason, the U.S. Food and Drug Administration [16] recommends a 3-arm noninferiority trials that includes a placebo, if there are not ethical concerns, the so-called gold standard design. Mielke et al. [12], Kombrink et al. [10] and Hida and Tango [8] proposed different statistical methods for non-inferiority hypotheses to censored time to event outcomes, however, all of them assume the proportionality of the hazard functions. Chang and McKeague [1] developed two families of statistics based on nonparametric likelihood ratio tests. The noninferiority condition was expressed as an ordered alternative hypothesis and it was extended to multiple survival functions. In our study, the same alternative hypothesis is considered but unlike their approach, the methodology proposed relies on nonparametric chi-squared-test. These new family tests were introduced by Pardo et al. [13], in the context of simple stochastic and umbrella ordering for uncensored data, and they showed that the performance of the proposed methods, in terms of power and Type I error, was better than likelihood ratio type tests. The aim of this study is to present an extension of these procedures to accommodate right censored data. Adopting notation from Chang and McKeague [1], let S1 , S2 . . . Sk be unknown survival functions corresponding to k ≥ 2 treatments. The hypotheses of noninferiority can be rewritten as: H1 : S1M1  S2M2  ...  SkMk Mj

where M1 , ..., Mk > 0 are the prespecified margins, and we define SiMi  S j SiMi (t)

M S j j (t)

to

≥ for all t with a strict inequality for some t; the time domain mean is restricted to a given follow-up period [t1 , t2 ]. When shorter survival is desirable Mi < M j indicates noninferiority of the treatment T j to the treatment Ti , and Mi ≥ M j indicates superiority of T j to Ti .

Non-parametric Testing of Non-inferiority with Censored Data

341

Chang and McKeague [1] proposed a two-step procedure for testing H1 which c , where consists on partitioning the parameter space for (S1 , S2 , ..., Sk ) into H01 ∪ H01 H01 = H0 ∪ H1 and H0 : S1M1 = S2M2 = ... = SkMk , (1) c versus H01 , and then H0 versus H1 . Rejection of both of so first to test the null H01 c ∪ H0 . these null hypotheses gives support for H1 versus the overall null H1c = H01 The first test is more standard and can be seen in the previous cited paper. It is used to exclude the possibility of crossing alternatives or alternative orderings that constitute H01 . In this paper, we focus only in the second test. The paper is organized as follows. In Sect. 2 the notation and some preliminaries are presented. In Sect. 3 two different families of test statistics are considered but we focus on four test statistics, two known and two new ones. Their asymptotic null distributions are established. The performance of the proposed tests is examined in a simulation study in Sect. 4, and their application to a real data set is presented in Sect. 5. Concluding remarks are given in Sect. 6.

2 Preliminaries Under the assumptions given in Sect. 2.1 in Chang and McKeague [1], they simplify the local nonparametric likelihood ratio (NPLR) at a given time point t to di j

R (t) =

k   hi j j=1 i≤N j (t)



1 − hi j

ri j −di j

r −d , di j   hi j i j i j hi j 1 − 

(2)

where ri j is the number at risk just before Ti j , di j is the number of deaths at Ti j , N j (t) is the number of observed uncensored times that are less than or equal to t and h i j is the estimation of the hazard probability under H0 which is given by hi j =

di j ri j + (λ j − λ j−1 )

with the multipliers λ1 , ..., λk satisfying the equality constrains given in (2.4) of Chang and McKeague [1] and  h i j is the estimation of the hazard probability under H1 which is given by di j  hi j =  ri j + (λ j −  λ j−1 ) λk satisfying the equality constrains (2.6) given in Chang with the multipliers  λ1 , , ...,  and McKeague [1].

342

A. M. Franco-Pereira et al.

To obtain these multipliers, Chang and McKeague [1] proposed a pool adjacent violator algorithm (PAVA) which allows to obtain asymptotic results for the local NPLR.

3 Non-parametric Tests For each fixed t, let Z (t) be a statistic for testing H0 such that its large values reject H0 . Therefore, we consider two types of statistics defined by 

t2

Z=

Z (t)du(t)

(3)

t1

and Z max = sup [Z (t)u(t)] , t∈[t1 ,t2 ]

(4)

where u(t) is some weight function and large values of Z or Z max reject H0 . The power of these two types of statistics depends on Z (t) and u(t). These two families were studied by Zhang and Wu [18] in the context of general k-sample test and by Pardo et al. [13] in the context of simple stochastic and umbrella ordering for uncensored data. Our candidates for Z (t) are going to be based on Pearson chisquared statistic and the likelihood ratio test statistic and different weight functions u(t). Chang and McKeague [1] proposed the member of the family of test statistics 0 (t). That is to say, given in (3) for Z (t) = −2 log R (t) and u(t) = F 

t2

In = −2

0 (t) log R (t) d F

(5)

t1

0 (t) = 1 −  where F S0 (t),  S0 (t) is a consistent estimate of the survival function k  Mj  v j (t)S j (t) , being v j (t) inversely proportional to the asymptotic stanS0 (t) = j=1

M dard deviation of  S j j (t) and normalized so

k 

v j (t) = 1. Furthermore, the time

j=1

domain is restricted to a given follow-up period [t1 , t2 ]. And also the member of the family of test statistics given in (4) for Z (t) = −2 log R (x) and u(t) = 1 That is to say, K n = sup (−2 log R (t)) . t∈[t1 ,t2 ]

(6)

Our two proposals are the members of the families given in (4) for u(t) = 1 and  based on, Z (t) = Λn (t) , the Pearson chi-squared statistic (3) for u(t) = F(t)

Non-parametric Testing of Non-inferiority with Censored Data

343

 2   k k  2 ri j h

ri j h

i j − hi j i j − hi j Λn (t) =   −   hi j 1 − hi j hi j 1 − hi j j=1 i≤N (t) j=1 i≤N (t) j

j

being h

i j = di j /ri j . The former test statistic was considered by Davidov and Herman [4] in the context of simple stochastic ordering for uncensored data and the latter was introduced by Pardo et al. [13]. Therefore, the new test statistics are TnK = sup Λn (t) t∈[t1 ,t2 ]



and TnI =

t2

0 (t). Λn (t) d F

(7)

(8)

t1

The asymptotic distributions of In and K n are given in Theorem 1 of Chang and McKeague [1]. The following theorem gives the asymptotic null distributions of TnI and TnK . Theorem 3.1 Under H0 , if 1 > S0 (t1 ) > S0 (t2 ) > 0, we have TnI → L

k



j=1

TnK



L

t2

w j (t)

k j=1



2 E w (Bw (t) | I) j − B(t) d F0 (t)

t1

w j (t) sup

t∈[t1 ,t2 ]



2 E w (Bw (t) | I) j − B(t) ,

(9)

 T √ √ where Bw (t) = B1 (t) / w1 (t), ..., Bk (t) / wk (t) , the processes B1 , ..., Bk are independent Gaussian processes, w j (t) are time-varying weights,  B (t) = kj=1 w j (t)B j (t) and E w (Bw (t) | I) is the weighted least squares pro jection of Bw (t) onto I = z ∈Rk : z 1 ≤ ... ≤ z k with weights w1 (t) , ..., wk (t) . Proof It is immediate from Proof of Theorem 1 of Chang and McKeague [1] and the fact that −2 log R (t) = Λn (t) + o p (1) . 

4 Simulation Study A simulation study evaluating the performance of the four test statistics In , K n , TnI and TnK for testing the hypotheses given in (1) was conducted. The simulation setup is a three-arm noninferiority trial (k = 3) in which S1 represents a placebo, S2 a standard therapy and S3 an experimental therapy. We consider three Gamma

344

A. M. Franco-Pereira et al.

Mj

Sj

1.0

1.0

1.0

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

0

2

4

6

8

0

10

2

4

Time

Mj

Sj

6

8

10

0

1.0

1.0

0.8

0.8

0.8

0.6

0.6

0.6

0.4

0.4

0.4

0.2

0.2

0.2

2

4

6

Time

4

8

10

0

2

4

6

Time

6

8

10

6

8

10

Time

1.0

0

2

Time

8

10

0

2

4

Time

Fig. 1 Simulation scenarios. Above {Below}: A (left) representing H0 , B {B’} (middle) and C M {C’} (right) representing H1 . Each S j j is specified as Gamma: placebo (green solid), standard therapy (blue longdashed), experimental therapy (red dashed). Note in A all the three lines overlap

scenarios: A (representing H0 ), B, B’ and C, C’ (representing H1 ), and in each M M define the S j by specifying S j j (see Fig. 1) and M j . The S j j have proportional hazards in Scenarios B and B’, but crossing hazards in Scenarios C and C’. Moreover, we consider four sets of margins: (M1 , M2 , M3 ) = (1, 1, 10/8), (M1 , M2 , M3 ) = (1, 1, 10/7), (M1 , M2 , M3 ) = (1.1, 1, 10/8) and (M1 , M2 , M3 ) = (1, 0.97, 1). All of these margins represent superiority of the standard therapy over the placebo (i.e., M1 ≥ M2 ), and a noninferiority of the experimental to the standard therapy (i.e., M2 < M3 ) with a margin of 0.8, 0.7 or 0.97. We specify the censoring distributions (the same in each arm) to be uniform with administrative censoring at t = 10, and a censoring rate of either 10% or 25% in the placebo group. Also, we consider a per group sample size of 120. Figure 2 displays the hazard rate functions under the different scenarios. The number of replicates is 1000 and the significance level for rejecting H0 is α = 0.05. As the asymptotic null distributions of the four statistics are not distributionfree and it is complex, to obtain critical values we use a multiplier bootstrap approach commonly used in survival analysis (Parzen et al. [14]). The empirical significance levels for the four tests considered in this paper are presented in Table 1. We note that the simulated sizes for the four test statistics are reasonable close to the nominal size for the three sets of margins (M1 , M2 , M3 ). For a censoring rate of 25%, K n and TnK are too conservative.

Non-parametric Testing of Non-inferiority with Censored Data

hj

0.6

0.6

0.6

0.5

0.5

0.5

0.4

0.4

0.4

0.3

0.3

0.3

0.2

0.2 0

2

4

6

8

10

0.2 0

2

4

Time

hj

345

6

8

10

0

0.6

0.6

0.5

0.5

0.5

0.4

0.4

0.4

0.3

0.3

0.3

0.2

0.2 2

4

4

6

8

10

6

8

10

6

8

10

Time

0.6

0

2

Time

0.2 0

2

4

Time

6

8

10

0

2

Time

4

Time

Fig. 2 Hazard rates on the simulation scenarios. Above {Below}: A (left), B {B’} (middle) and C {C’} (right). Note in A all the three lines overlap Table 1 Empirical significance levels Scenario A

(M1 ,M2 ,M3 )

In

10%

censoring

25%

censoring

Kn

TnI

TnK

In

Kn

TnI

TnK

(1, 1, 10/8)

0.056

0.048

0.058

0.050

0.054

0.030

0.058

0.040

(1, 1, 10/7)

0.052

0.044

0.044

0.048

0.048

0.030

0.048

0.030

(1.1, 1, 10/8)

0.046

0.052

0.050

0.054

0.048

0.042

0.054

0.044

(1, 0.97, 1)

0.050

0.058

0.060

0.062

0.060

0.044

0.062

0.036

Table 2 shows the empirical powers. TnI outperforms the other tests in almost all the scenarios and set of margins considered. Its counterpart proposed by Pardo et al. [13] in the context of simple stochastic and umbrella ordering for uncensored data also emerged as more powerful test than In and K n .

5 Application to Real Data We analyze data from a randomized, double blind, and active comparator-controlled study, Mielke et al. [12]. Patients meeting DSM-IV criteria for major depression disorder were randomized to receive placebo, standard or experimental treatment. The aim of this non-inferiority study was to test whether the experimental antidepressant showed at least as much efficacy as the standard, i.e. the experimental treatment would be at least as fast as the standard treatment in achieving remission.

346

A. M. Franco-Pereira et al.

Table 2 Empirical powers 10%

censoring

Scenario

(M1 ,M2 ,M3 )

In

Kn

TnI

TnK

In

Kn

TnI

TnK

B

(1, 1, 10/8)

0.122

0.114

0.168

0.136

0.134

0.114

0.180

0.126

(1, 1, 10/7)

0.114

0.092

0.150

0.104

0.132

0.104

0.170

0.118

(1.1, 1, 10/8)

0.118

0.138

0.156

0.146

0.130

0.124

0.174

0.134

(1, 0.97, 1)

0.152

0.170

0.160

0.162

0.156

0.158

0.158

0.168

B’

C

C’

25%

censoring

(1, 1, 10/8)

0.338

0.328

0.358

0.344

0.330

0.326

0.354

0.344

(1, 1, 10/7)

0.302

0.280

0.310

0.300

0.306

0.298

0.324

0.310

(1.1, 1, 10/8)

0.322

0.332

0.342

0.348

0.336

0.334

0.352

0.338

(1, 0.97, 1)

0.410

0.440

0.414

0.450

0.394

0.432

0.406

0.430

(1, 1, 10/8)

0.692

0.692

0.728

0.704

0.730

0.726

0.750

0.730

(1, 1, 10/7)

0.662

0.630

0.698

0.646

0.688

0.670

0.730

0.684

(1.1, 1, 10/8)

0.668

0.684

0.702

0.694

0.700

0.716

0.734

0.720

(1, 0.97, 1)

0.764

0.814

0.786

0.818

0.798

0.840

0.820

0.846

(1, 1, 10/8)

0.336

0.352

0.388

0.370

0.362

0.372

0.402

0.386

(1, 1, 10/7)

0.312

0.294

0.350

0.330

0.320

0.324

0.372

0.348

(1.1, 1, 10/8)

0.324

0.340

0.362

0.358

0.344

0.372

0.398

0.384

(1, 0.97, 1)

0.420

0.434

0.462

0.448

0.438

0.458

0.466

0.474

The outcome considered is time, in days, to first remission, where remission is defined as maintaining the 17-item Hamilton Rating Scale for Depression (HAMD17 ) total score ≤ 7. As it was previously done by Chang and Mceague [1] data were obtained by digitizing the published Kaplan Meier curves presented in Mielke et al. [12] using the algorithm developed by Guyot et al. [7]. At 10 week follow-up, remission was observed in 56 (41%) of 135 patients who were allocated to placebo, 123 (46%) of 267 to the standard, and 134 (51%) of 262 to the experimental. In randomized clinical trials, it is quite common to report the p-value of the log-rank test ( p > 0.05 for both comparisons with placebo) and the hazard ratios, even when the proportional hazards assumption is not clear, as in this case, Fig. 3A. Therefore, the objective was evaluated by testing the alternative hypothesis H1 : S1M1  S2M2  S3M3 where subscripts i = 1, 2 and 3 represent the placebo, the standard, and the experimental treatment, respectively. Values for (M1 , M2 , M3 ) were choosing so that they fulfilled M2 < M3 , the noninferiority of the experimental to the standard, and M1 ≥ M2 , the superiority of the standard over placebo. Figure 3B shows the Kaplan Meier curves,  Si , whereas Fig. 3C displays the corresponding values of  SiMi , when considering (M1 , M2 , M3 ) = (1, 0.97, 1). The ratio M2 /M3 = 0.97 represents that the larger tolerable decrease in the chance of remission of the experimental treatment over the standard is 3%.

Non-parametric Testing of Non-inferiority with Censored Data (B)

(C)

1.5

2.5

3.5

1.0 0.8 0.6 0.2

0.4

Survival functions

0.6 0.4

4.5

log(Time, in days, to remission)

0.0

−5

0.0

−4

0.2

−3

−2

Survival functions

−1

0.8

0

1.0

(A)

log(−log(Survival))

347

0

10

20

30

40

50

60

70

0

10

Time, in days, to remission

20

30

40

50

60

70

Time, in days, to remission

Fig. 3 Graphical evaluation of the proportional hazard assumption, log-log survival curves (A). Survival curves,  SiMi , for patients with major depression disorder, (M1 , M2 , M3 ) = (1, 1, 1) (B) and (M1 , M2 , M3 ) = (1, 0.97, 1) (C) for placebo (green solid), standard treatment (blue longdashed) and experimental treatment (red dashed) Table 3 Statistics values (and critical values). In bold the values that lead to reject H0 (M1 , M2 , M3 ) In Kn TnI TnK (1,0.97,1)

1.667 (1.689)

6.731 (7.748)

1.694 (1.689)

6.777 (7.748)

The statistics values as well as the critical values are given in Table 3. Non-inferiority of the experimental treatment compared with the standard was confirmed with the statistic TnI . In addition, the efficacy of the experimental and standard was shown, since time to first remission was significantly shorter for patients treated with either experimental or standard than that for patients receiving placebo. However, the rest of the statistics fails to provide such evidence.

6 Concluding Remarks In this paper, we have studied nonparametric chi-squared-based tests for the ordering of k survival functions based on right-censored data as competitors of that developed by Chang and McKeague [1]. All of them are members of two wide families of test statistics. We have established asymptotic distributions of these test statistics but as they are complex, we use bootstrap to obtain the critical values. We have compared our chi-squared-based tests with the known empirical likelihood- based tests. One of the new test statistics has stable type I error and it is more powerful than the other three tests in the simulation studies. Furthermore, the conclusion from our test is supported by the analyzed data but the others fail. Acknowledgements This work was partially supported by grants PID2019-104681RB-I00 and MTM2017-89422-P.

348

A. M. Franco-Pereira et al.

References 1. Chang, H., McKeague, I.W.: Nonparametric testing for multiple survival functions with noninferiority margins. Ann. Stat. 47(1), 205–232 (2019) 2. Com-Nougue, C., Rodary, C., Patte, C.: How to establish equivalence when data are censored: a randomized trial of treatments for B non-Hodgkin lymphoma. Stat. Med. 12(14), 1353–1364 (1993) 3. Committee for Medican Products for Human Use (2005) Guideline on the choice of the noninferiority margin. EMEA/CPMP/EWP/2158/99 4. Davidov, O., Herman, A.: Testing for order among K populations: theory and examples. Can. J. Stat. 38(1), 97–115 (2010) 5. Efron, B.: The two sample problem with censored data. In: Le Cam, L.M., Neyman, J. (eds.) Proceedings 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 4, pp. 831–853. University of California Press, Berkeley (1967) 6. Freitag, G., Lange, S., Munk, A.: Non-parametric assessment of non-inferiority with censored data. Stat. Med. 25(7), 1201–1217 (2006) 7. Guyot, P., Ades, A.E., Ouwens, M.J.N.M., Welton, N.J.: Enhanced secondary analysis of survival data: reconstructing the data from published Kaplan-Meier survival curves. BMC Med. Res. Method 12(9) (2012). https://doi.org/10.1186/1471-2288-12-9 8. Hida, E., Tango, T.: Design and analysis of a 3-arm noninferiority trial with a prespecified margin for the hazard ratio. Pharm. Stat. 17(5), 489–503 (2018) 9. Kalbfleisch, J.D., Prentice, R.L.: Estimation of the average hazard ratio. Biometrika 68(1), 105–112 (1981) 10. Kombrink, K., Munk, A., Friede, T.: Design and semiparametric analysis of non-inferiority trials with active and placebo control for censored time-to-event data. Stat. Med. 32(18), 3055– 3066 (2013) 11. Martinez, E.E., Sinha, D., Wang, W., Lipsitz, S.R., Chappell, R.J.: Tests for equivalence of two survival functions: Alternative to the tests under proportional hazards. Stat. Methods Med. Res. 26(1), 75–87 (2017) 12. Mielke, M., Munk, A., Schacht, A.: The assessment of non-inferiority in a gold standard design with censored, exponentially distributed endpoints. Stat. Med. 27(25), 5093–5110 (2008) 13. Pardo, M.C., Lu, Y., Franco-Pereira, A.M.: Extensions of empirical likelihood and chi-squaredbased tests for ordered alternatives. J. Appl. Stat. 49(1), 24–43 (2022) 14. Parzen, M.I., Wei, L.J., Ying, Z.: Simultaneous confidence intervals for the difference of two survival functions. Scand. J. Stat. 24, 309–314 (1997) 15. Su, J.Q., Wei, L.J.: Nonparametric estimation for the difference or ratio of median failure times. Biometrics 49(2), 603–607 (1993) 16. U.S. Department of Health and Human Services FDA: Non-inferiority clinical trials to establish effectiveness (2016). https://www.fda.gov/downloads/Drugs/Guidances/UCM202140. pdf. Accessed 20/11/2020 17. Wellek, S.: Testing Statistical Hypotheses of Equivalence and Noninferiority, 2nd edn. CRC Press, Boca Raton (2010) 18. Zhang, J., Wu, Y.: k-Sample tests based on the likelihood ratio. Comput. Stat. Data Anal. 51(9), 4682–4691 (2007)

A Review of Goodness-of-Fit Tests for Models Involving Functional Data Wenceslao González-Manteiga, Rosa M. Crujeiras, and Eduardo García-Portugués

Abstract A sizable amount of goodness-of-fit tests involving functional data have appeared in the last decade. We provide a relatively compact revision of most of these contributions, within the independent and identically distributed framework, by reviewing goodness-of-fit tests for distribution and regression models with functional predictor and either scalar or functional response.

1 Introduction Since the earliest Goodness-of-Fit (GoF) tests were introduced by Pearson more than a century ago, there has been a prolific statistical literature on this topic. If we were to highlight a milestone in this period, that may be 1973, with the publication of [9] and [1], introducing a novel design of GoF tests based on distances between distribution and density estimates, respectively. To set the context for the reader, assume that {X 1 , . . . , X n } is an identically and identically distributed (iid) sample of a random variable X with (unknown) distribution F (or density f , if that is the case). If the target function is the distribution F, then the GoF testing problem can be formulated as testing H0 : F ∈ FΘ = {Fθ : θ ∈ / FΘ , where FΘ stands for a parametric family of distributions Θ ⊂ Rq } vs. H1 : F ∈ indexed in some finite-dimensional set Θ. A general test statistic for this problem can be written as Tn = T (Fn , Fθˆ ), with the functional T denoting, here and henceforth, some kind of distance between a nonparametric estimate, given in this case by W. González-Manteiga (B) · R. M. Crujeiras Department of Statistics, Mathematical Analysis and Optimization, University of Santiago de Compostela, A Coruña, Spain e-mail: [email protected] R. M. Crujeiras e-mail: [email protected] E. García-Portugués Department of Statistics, Carlos III University of Madrid, Madrid, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_29

349

350

W. González-Manteiga et al.

n the empirical cumulative distribution function Fn (x) = n −1 i=1 I(X i ≤ x), and an estimate obtained under the null hypothesis H0 , Fθˆ in this case. Similarly, for testing the GoF of a certain parametric density model, the testing problem is formulated as / fΘ and can be approached with the H0 : f ∈ fΘ = { f θ : θ ∈ Θ ⊂ Rq } vs. H1 : f ∈ general test statistic Tn = T ( f nh , f θˆ ). In this setting, f θˆ is the density n estimate under K h (x − X i ) by H0 and f nh denotes the kernel density estimator f nh (x) = n −1 i=1 Parzen [27] and Rosenblatt [32], where K h (·) = K (·/ h)/ h, K is the kernel function, and h is the bandwidth. The previous ideas were naturally generalized to the context of regression models in the 1990s. Consider a nonparametric, random design, regression model such that Y = m(X ) + ε with (X, Y ) ∈ R p × R, and m(x) = E[Y |X = x] and E[ε|X = x] = n an iid sample of (X, Y ) satisfying such a model. In this 0. Denote by {(X i , Yi )}i=1 / context, the GoF goal is to test H0 : m ∈ MΘ = {m θ : θ ∈ Θ ⊂ Rq } vs. H1 : m ∈ MΘ , where MΘ represents a parametric family of regression functions indexed in Θ. Continuing along the testing philosophies advocated by [1, 9], the seminal works of [16, 34] respectively introduced two types of GoF tests for regression models: (a) Tests based on empirical regression processes, considering distances between x estimates of the integrated regression function I (x) = −∞ m(t) dF(t) (F being the test statisthe marginal distribution of X under H0 and H1 ). Specifically, n −1 = T (I , I ), with I (x) = n I(X tics are constructed as T n n θˆ n i ≤ x)Yi and i=1 n I(X i ≤ x)m θˆ (X i ). Iθˆ (x) = n −1 i=1 (b) Smoothing-based tests, using distances between estimated regression functions, ), with m nh a smooth regression estimator. As a particular Tn = T (m nh , m θˆ n Wnh,i (x)Yi , with Wnh,i (x) some weights depending on case, m nh (x) = i=1 a smoothing parameter h. Such an estimator can be obtained with Nadaraya– Watson or local linear weights (see, e.g., [37]). A complete review of GoF for regression models was presented by [14], who described the aforementioned two types of paradigms and focused on the smoothingbased alternative for discussing their properties (asymptotic behavior and calibration of the distribution in practice). The authors thoroughly checked references in the statistical literature for more than two decades, and they also identified some areas where GoF tests where still to be developed. One of these areas is functional data analysis. The goal of this work is to round off this previous review, updating the more recent contributions in GoF for distribution and regression models with functional data. Consequently, the rest of the chapter is organized as follows: Sect. 2 is devoted to GoF for distribution models of functional random variables, and Sect. 3 focuses on regression models with scalar (Sect. 3.1) and functional response (Sect. 3.2).

A Review of Goodness-of-Fit Tests for Models Involving Functional Data

351

2 GoF for Distribution Models for Functional Data Owing to the requirement of appropriate tools for analyzing high-frequency data, and the boost provided by the books by [10, 31] (or, more recently, by [17]), functional data analysis is nowadays one of the most active research areas within statistics. Actually, the pressing needs of developing new statistical tools with data in general spaces have reclaimed separable Hilbert spaces as a very natural and common framework. However, given the generality of this kind of spaces, there is a scarcity of parametric distribution models for a Hilbert-valued random variable X aside the popular framework of Gaussian processes. Let H denote a√Hilbert space over R, the norm of which is given by its scalar product as x = x, x . Consider {X 1 , . . . , X n } iid copies of the random variable X : (, A) → (H, B(H)), with (, A, P) the probability space where the random sample is defined and B(H) the Borel σ -field on H. The general GoF problem for / the distribution of X consists on testing H0 : P X ∈ PΘ = {Pθ : θ ∈ Θ} vs. H0 : P X ∈ PΘ , where PΘ is a class of probability measures on H indexed in a parameter set Θ, now possibly infinite-dimensional, and P X is the (unknown) probability distribution of X induced over H. When the goal is to test the simple null hypothesis H0 : P X ∈ {P0 }, a feasible approach that enables the construction of test statistics is based on projections π : H → R, in such a way that the test statistics are defined from the probe taken on the distrijected sample {π(X 1 ), . . . , π(X n )}. Such an approach can n I(π(X i ) ≤ x) and bution function: Tn,π = T (Fn,π , F0,π ) with Fn,π (x) = n −1 i=1 F0,π (x) = P H0 (π(X ) ≤ x). Some specific examples are given by the adaptation to this context of the Kolmogorov–Smirnov, Cramér–von Mises, or Anderson–Darling type tests. As an alternative, and mimicking the smoothing-based tests presented in Sect. 1, a teststatistic can also be built as Tn,π = T ( f nh,π , E H0 [ f nh,π ]) with n K h (x − π(X i )). It should be also noted that, when embracing f nh,π (x) = n −1 i=1 the projection approach, the test statistic may take into account ‘all’ the projections within a certain space, e.g. by considering Tn = Tn,π dW (π ) for W a probability measure on the space of projections, or take just Tn = Tn,πˆ with πˆ being a randomlysampled projection from a certain non-degenerate probability measure W . Now, when the goal is to test the composite null hypothesis H0 : P X ∈ PΘ , the previous generic approaches are still valid if replacing P0,π (x) with Pθˆ ,π (x) = PPθˆ (π(X ) ≤ x). Within this setting, [4, 5] provide a characterization of the composite null hypothesis by means of random projections, jointly with a bootstrap procedure for calibration, as well as [2]. As an alternative, [8] follows a finitedimensional approximation. Note that, in the space of real square-integrable functions H = L 2 [0, 1], as a particular case one may take πh (x) = x, h , with h ∈ H. The previous references provide some approaches for the calibration under the null hypothesis of the rejection region {Tn > cα }, where P(Tn > cα ) ≤ α. A relevant alternative to the procedures based on projections is the use of the so-called ‘energy statistics’ [35]. Working with H a general Hilbert separable space

352

W. González-Manteiga et al.

(as it can be seen in [24]), if X ∼ P X and Y ∼ PY = P0 (P0 being the distribution under the null), then E = E(X, Y ) = 2E[X − Y ] − E[X − X ] − E[Y − Y ] ≥ 0,

(1)

with {X, X } and {Y, Y } iid copies of the variables with distributions P X and PY , respectively. Importantly, (1) equals 0 if and only if P X = PY , a characterization that serves as basis for a GoF test. The energy statistic in (1) can be empirically estimated from a sample {X 1 , . . . , X n } as Eˆ ∗ = 2

n n  

X i − Y j∗  −

i=1 j=1

n n   i=1 j=1

X i − X j  −

n n  

Yi∗ − Y j∗ ,

i=1 j=1

with {Y1∗ , . . . , Yn∗ } simulated from PY . This estimated energy Eˆ ∗ can be compared with appropriate Monte Carlo simulation {Y1∗b , . . . , Yn∗b }, b = 1, . . . , B, designed to B . Note that, under the null hypothesis, Y ∗ build an α-level critical point using { Eˆ ∗b }b=1 is simulated from P0 . In the case of testing a composite hypothesis, then generation is done under Pθˆ with θˆ estimated using {X 1 , . . . , X n }. Due to the scarcity of distribution models for random functions, the Gaussian case is one of the most widely studied, as it can bee seen, e.g., in [20, 21] and in the recent review by [15] on tests for Gaussianity of functional data. Finally, it is worth it to mention the two-sample problem, a common offspring of the simple-hypothesis one-sample GoF problem. Two-sample tests have also received a significant deal of attention in the last decades; see, e.g., the recent contributions by [19, 30] and references therein.

3 GoF for Regression Models with Functional Data We assume henceforth, without loss of generality and for the sake of easier presentation, that both the predictor X and response Y are centered, so that the intercepts of the linear functional regression models are null.

3.1 Scalar Response A particular case of a regression model with functional predictor and scalar response is the so-called functional linear model. For H X = L 2 [0, 1], this parametric model is given by

A Review of Goodness-of-Fit Tests for Models Involving Functional Data

 Y = m β (X ) + ε, m β (x) = x, β =

353

1

x(t)β(t) dt,

(2)

0

for some unknown β ∈ H X indexing the functional form of the model. This popular model can be seen as the natural extension of the classical linear (Euclidean) regression model. In general, there have been two mainstream approaches for performing inference on (2): (i) testing the significance of the trend within the linear model, i.e., testing H0 : m ∈ {m β0 } vs. H1 : m ∈ {m β : β ∈ H X , β = β0 }, usually with β0 = 0; (ii) testing the / L. linearity of m, i.e., testing H0 : m ∈ L = {m β : β ∈ H X } vs. H1 : m ∈ n , For the GoF testing problem presented in (ii), given an iid sample {(X i , Yi )}i=1 one may consider the adaptation to this setting of the smoothing-based tests, with a basic test statistic structure given by Tn = T (m nh , m βˆ ), where βˆ is a suitable estimator for β and m nh (x) =

n  i=1

Wni (x)Yi =

n  i=1

K h (x − X i ) n Yi j=1 K h (x − X j )

(3)

is the Nadaraya–Watson estimator with a functional predictor. A particular smoothing-based test statistic is given by that of [7], Tn =

  2 m nh (x) − m nh,βˆ (x) ω(x) dP X (x),

which employs a weighted L 2 distance between (3) and m nh,βˆ , the latter being a smoothed version of the parametric estimator that follows by replacing Yi with m βˆ (X i ) in (3). Note that a crucial problem for implementing this test is the computation of the critical region {Tn > cα }, which depends on the selection of h when a class of estimators for β is used under the null. This class of smoothed-based tests were deeply studied in the Euclidean setting (see [14]). Nevertheless, this is not the case in the functional context, except for the recent contributions by [25] and [28]. As also presented by [14] in their review, it is possible to avoid the bandwidth selection problem using tests based on empirical regression processes. For this purpose, a key element nis the empirical counterpart of the integrated regression I(X i ≤ x)Yi , where X i ≤ x means that X i (t) ≤ x(t), function In (x) = n −1 i=1 for all t ∈ [0, 1]. In this scenario, the test statistic can be formulated as Tn (In , Iβˆ ), n ˆ Deriving the theoretical I(X i ≤ x)Yˆi , where Yˆi = X i , β . where Iβˆ (x) = n −1 i=1 behavior of an empirical regression process indexed by x ∈ H X , namely Rn (x) = √ n(In (x) − Iβˆ (x)) is a challenging task. Yet, as previously presented, the projection approach over H X can be considered. The null hypothesis H0 : m ∈ L can be formulated as H0 : E[(Y − X, β )I(X, γ ≤ u)] = 0, for a β ∈ H X and for all γ ∈ H X ,

354

W. González-Manteiga et al.

which in turn is equivalent to replacing ‘for all γ ∈ H X ’ with ‘for all γ ∈ SH X ’ or p−1 ‘for all γ ∈ SH X ,{ψ j }∞ , for all p ≥ 1’, where j=1

SH X = {ρ ∈ H X : ρ = 1},

p−1 SH X ,{ψ j }∞ j=1

 p  = ρ= r j ψ j : ρ = 1 j=1

are infinite- and finite-dimensional spheres on H X , {ψ j }∞ j=1 is an orthonormal basis p for H X , and {r j } j=1 ⊂ R. As follows from [13], a general test statistic can be built  aggregating all the projections within a certain subspace: Tn = Tn,π dW (π ) with Tn,π = T (In,π , Iβ,π ˆ ) based on In,π (u) = n −1

n 

−1 I(π(X i ) ≤ u)Yi and Iβ,π ˆ (u) = n

i=1

n 

I(π(X i ) ≤ u)Yˆi ,

(4)

i=1

for π(x) = x, γ . In this case, W is a probability measure defined in SH X or p−1 SH X ,{ψ j }∞ , for a certain p ≥ 1. Alternatively, the test statistic can be based on only j=1 one random projection: Tn = Tn,πˆ . More generally, Tn may consider the aggregation of a finite number of random projections, as advocated in the test statistic of [6]. Both types of tests, all-projections and finite-random-projections, may feature several distances for T , such as Kolmogorov–Smirnov or Cramér–von Mises types, although the latter type of distance yields more tractable all-projections statistics. Model (2) can be generalized to include a more flexible trend component, for instance, with an additive formulation. The functional generalized additive model (see [26]) is formulated as 

1

Y = m F (X ) + ε, m F (x) = η +

F(X (t), t) dt

(5)

0

and it can be seen that (2) is a particular case of (5) with F(x, t) = xβ(t) and η = 0. The functional F can be approximated as F(x, t) =

kX  kT 

θ jk B jX (x)BkT (t),

j=1 k=1

where θ jk are unknown tensor product B-spline coefficients. Both for the x and t X T and {BkT (t)}kk=1 , are considcomponents, cubic B-spline bases, namely {B jX (x)}kj=1 ered. Model (5) can be written in an approximated way as a linear model with random effects (see [38]) using the evaluations of X i (tin ) over a grid {tin } ⊂ [0, 1]. Under the assumption of ε being a Gaussian process, the so-called restricted likelihood ratio test (RLRT) can be used, where testing the GoF of the functional linear model (2)

A Review of Goodness-of-Fit Tests for Models Involving Functional Data

355

against model specifications within (5) is equivalent to test that the variance of the random effect is null. Another generalization of the functional linear model is given by the functional quadratic regression model introduced by [18]:  Y = 0

1

 β(t)X (t) dt + 0

1



1

γ (s, t)X (t)X (s) dt ds + ε.

(6)

0

Clearly, when γ = 0, (2) follows as a particular case of (6). Using a principal component analysis methodology to approximate the covariance  p function Cov(t, s) = E[(X i (t) − E[X (t)])(X i (s) − E[X (s)])] with β(t) = j=1 b j v j (t) and γ (s, t) = p p j=1 k=1 a jk vk (s)v j (t) with v j the eigenfunctions of Cov(t, s), model (6) can be written as a kind of linear model, were the null hypothesis γ = 0 is tested. A recent contribution by [22] is devoted to the testing a modified null hypothesis:

0 : ‘X is independent of ε and m ∈ L’, using the results related with the distance H covariance (see [24, 33, 36]). Consider (X , ρ

) two semimetric spaces of X ) and (Y, ρY

) and ρ are the corresponding semimetrics. Denote by (

X, Y negative type, where ρ

X Y and marginals P and P , respectively, a random element with joint distribution P





XY X Y

) an iid copy of (

). The generalized distance covariance (

) X, Y X, Y and take (

X , Y is given by

(Y

, Y

)

) = E ρ

θ (

X, Y X ( X , X )ρY



, Y

) + E ρ

( Y X ( X , X ) E ρY



, Y

) . − 2E(

) E

ρY

( Y X ,Y X ρ

X ( X , X ) EY As noted by [22], the generalized distance covariance can be alternatively written as 



θ ( X , Y ) = ρ

x,

x )ρY (

y,

y ) d[(P

− P

) × (P

− P

)]. X (

XY X PY XY X PY

) = 0 if and only if

are independent. Given an iid sample Note that θ (

X, Y X and Y n

i )}i=1

), an empirical estimator of θ is given by {(

Xi , Y of (

X, Y  1  2 

) = 1 X, Y k  + k  − ki j iq θn (

i j i j i j qτ n 2 i, j n 4 i, j,q,τ n 3 i, j,q



i , Y

j ). Taking

= ε = Y − with ki j = ρ

X = X and Y

(Y X ( X i , X j ) and i j = ρY X, β , ρY is the absolute value and ρ

X is the distance associated to H X . The test n ˆ i=1 . statistic is Tn = θn (ˆε, X ) and is based on {(X i , Yi − X i , β )} All the tests described in this section have challenging limit distributions and need to be calibrated with resampling techniques.

356

W. González-Manteiga et al.

3.2 Functional Response When both the predictor and the response, X and Y , are functional random variables evaluated in H X = L 2 [a, b] and HY = L 2 [c, d], the regression model Y = m(X ) + ε is related with the operator m : H X → HY . Perhaps the most popular operator specification is a (linear) Hilbert–Schmidt integral operator, expressible as  m β (x)(t) = x, β(·, t) =

b

β(s, t)x(s) ds, t ∈ [c, d],

(7)

a

for β ∈ H X ⊗ HY , which is simply referred to as the functional ∞model with  linear functional response. The kernel β can be represented as β = ∞ j=1 k=1 b jk (ψ j ⊗ ∞ and {φ } being orthonormal bases of H and H , respectively. φk ), with {ψ j }∞ k X Y j=1 k=1 Similarly to the case with scalar response, performing inference on (7) have attracted the analogous two mainstream approaches: (i) testing H0 : m ∈ {m β0 } vs. H1 : m ∈ {m β : β ∈ H X ⊗ HY , β = β0 }, usually with β0 = 0; (ii) testing H0 : m ∈ / L. The GoF problem given in (ii) can be L = {m β : β ∈ H X ⊗ HY } vs. H1 : m ∈ approached by considering a double-projection mechanism based on π X : H X → R n an iid sample {(X i , Yi )}i=1 , a general test statistic follows and πY : HY → R. Given  (see [11]) as Tn = Tn,π X ,πY dW (π X × πY ) with Tn,π X ,πY = T (In,π X ,πY , Iβ,π ˆ X ,πY ), where In,π1 ,π2 and Iβ,π ˆ 1 ,π2 follows from (4) by replacing π with π X , and Yi and Yˆi with πY (Yi ) and πY (Yˆi ), respectively. In this case, W is a probability measure p−1 q−1 defined in SH X × SHY or SH X ,{ψ j }∞ × SHY ,{φk }∞ , for certain p, q ≥ 1. The projecj=1 k=1 tion approach is immediately adaptable to the GoF of (7) with H X = R, and allows graphical tools that can help detecting the deviations from the null, see [12]. An alternative route considering projections just for X is presented by [3]. The above generalization to the case of functional response is certainly more difficult for the class of tests based on the likelihood ratios. Regarding the smoothingbased tests, [29] introduced a kernel-based significance test consistent for nonlinear alternative. More recently, [23] proposed a significance test based on correlation distance ideas. Acknowledgements The authors acknowledge the support of project MTM2016-76969-P, PGC2018-097284-B-100, and IJCI-2017-32005 from the Spain’s Ministry of Economy and Competitiveness. All three grants were partially co-funded by the European Regional Development Fund (ERDF). The support by Competitive Reference Groups 2017–2020 (ED431C 2017/38) from the Xunta de Galicia through the ERDF is also acknowledged.

A Review of Goodness-of-Fit Tests for Models Involving Functional Data

357

References 1. Bickel, P.J., Rosenblatt, M.: On some global measures of the deviations of density function estimates. Ann. Stat. 1(6), 1071–1095 (1973) 2. Bugni, F.A., Hall, P., Horowitz, J.L., Neumann, G.R.: Goodness-of-fit tests for functional data. Economet. J. 12(S1), S1–S18 (2009) 3. Chen, F., Jiang, Q., Feng, Z., Zhu, L.: Model checks for functional linear regression models based on projected empirical processes. Comput. Stat. Data Anal. 144, 106897 (2020) 4. Cuesta-Albertos, J.A., del Barrio, E., Fraiman, R., Matrán, C.: The random projection method in goodness of fit for functional data. Comput. Stat. Data Anal. 51(10), 4814–4831 (2007) 5. Cuesta-Albertos, J.A., Fraiman, R., Ransford, T.: Random projections and goodness-of-fit tests in infinite-dimensional spaces. Bull. Brazil Math. Soc. 37(4), 477–501 (2006) 6. Cuesta-Albertos, J.A., García-Portugués, E., Febrero-Bande, M., González-Manteiga, W.: Goodness-of-fit tests for the functional linear model based on randomly projected empirical processes. Ann. Stat. 47(1), 439–467 (2019) 7. Delsol, L., Ferraty, F., Vieu, P.: Structural test in regression on functional variables. J. Multivar. Anal. 102(3), 422–447 (2011) 8. Ditzhaus, M., Gaigall, D.: A consistent goodness-of-fit test for huge dimensional and functional data. J. Nonpar. Stat. 30(4), 834–859 (2018) 9. Durbin, J.: Weak convergence of the sample distribution function when parameters are estimated. Ann. Stat. 1(2), 279–290 (1973) 10. Ferraty, F., Vieu, P.: Nonparametric Functional Data Analysis: Theory and Practice. Springer Series in Statistics, Springer, New York (2006) 11. García-Portugués, E., Álvarez-Liébana, J., Álvarez-Pérez, G., González-Manteiga, W.: A goodness-of-fit test for the functional linear model with functional response. Scand. J. Stat. 48(2), 502–528 (2021) 12. García-Portugués, E., Álvarez-Liébana, J., Álvarez-Pérez, G., González-Manteiga, W.: Goodness-of-fit tests for functional linear models based on integrated projections. In: Aneiros, G., Horová, I., Hušková, M., Vieu, P. (eds.) Functional and High-Dimensional Statistics and Related Fields, Contributions to Statistics, pp. 107–114. Springer, Cham (2020) 13. García-Portugués, E., González-Manteiga, W., Febrero-Bande, M.: A goodness-of-fit test for the functional linear model with scalar response. J. Comput. Graph Stat. 23(3), 761–778 (2014) 14. González-Manteiga, W., Crujeiras, R.M.: An updated review of goodness-of-fit tests for regression models. TEST 22(3), 361–411 (2013) 15. Górecki, T., Horváth, L., Kokoszka, P.: Tests of normality of functional data. Int. Stat. Rev. 88(3), 677–697 (2020) 16. Härdle, W., Mammen, E.: Comparing nonparametric versus parametric regression fits. Ann. Stat. 21(4), 1926–1947 (1993) 17. Horváth, L., Kokoszka, P.: Inference for Functional Data with Applications. Springer Series in Statistics, Springer, New York (2012) 18. Horváth, L., Reeder, R.: A test of significance in functional quadratic regression. Bernoulli 19(5A), 2130–2151 (2013) 19. Jiang, Q., Hušková, M., Meintanis, S.G., Zhu, L.: Asymptotics, finite-sample comparisons and applications for two-sample tests with functional data. J. Multivar. Anal. 170, 202–220 (2019) 20. Kellner, J., Celisse, A.: A one-sample test for normality with kernel methods. Bernoulli 25(3), 1816–1837 (2019) 21. Kolkiewicz, A., Rice, G., Xie, Y.: Projection pursuit based tests of normality with functional data. J. Stat. Plan Infer. 211, 326–339 (2021) 22. Lai, T., Zhang, Z., Wang, Y.: Testing independence and goodness-of-fit jointly for functional linear models. J. Korean Stat. Soc. 50, 380–402 (2021) 23. Lee, C.E., Zhang, X., Shao, X.: Testing conditional mean independence for functional data. Biometrika 107(2), 331–346 (2020) 24. Lyons, R.: Distance covariance in metric spaces. Ann. Probab. 41(5), 3284–3305 (2013)

358

W. González-Manteiga et al.

25. Maistre, S., Patilea, V.: Testing for the significance of functional covariates. J. Multivar. Anal. 179 (2020) 26. McLean, M.W., Hooker, G., Ruppert, D.: Restricted likelihood ratio tests for linearity in scalaron-function regression. Stat. Comput. 25(5), 997–1008 (2015) 27. Parzen, E.: On estimation of a probability density function and mode. Ann. Math. Stat. 33(3), 1065–1076 (1962) 28. Patilea, V., Sánchez-Sellero, C.: Testing for lack-of-fit in functional regression models against general alternatives. J. Stat. Plan Infer. 209, 229–251 (2020) 29. Patilea, V., Sánchez-Sellero, C., Saumard, M.: Testing the predictor effect on a functional response. J. Am. Stat. Assoc. 111(516), 1684–1695 (2016) 30. Qiu, Z., Chen, J., Zhang, J.T.: Two-sample tests for multivariate functional data with applications. Comput. Stat. Data Anal. 157 (2021) 31. Ramsay, J.O., Silverman, B.W.: Functional Data Analysis. Springer Series in Statistics. Springer, New York (2005) 32. Rosenblatt, M.: Remarks on some nonparametric estimates of a density function. Ann. Math. Stat. 27(3), 832–837 (1956) 33. Sejdinovic, D., Sriperumbudur, B., Gretton, A., Fukumizu, K.: Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Stat. 41(5), 2263–2291 (2013) 34. Stute, W.: Nonparametric model checks for regression. Ann. Stat. 25(2), 613–641 (1997) 35. Székely, G.J., Rizzo, M.L.: The energy of data. Ann. Rev. Stat. Appl. 4(1), 447–479 (2017) 36. Székely, G.J., Rizzo, M.L., Bakirov, N.K.: Measuring and testing dependence by correlation of distances. Ann. Stat. 35(6), 2769–2794 (2007) 37. Wand, M.P., Jones, M.C.: Kernel Smoothing. Chapman & Hall, London (1995) 38. Yasemin-Tekbudak, M., Alfaro-Córdoba, M., Maity, A., Staicu, A.M.: A comparison of testing methods in scalar-on-function regression. AStA. Adv. Stat. Anal. 103(3), 411–436 (2019)

An Area-Level Gamma Mixed Model for Small Area Estimation Tomáš Hobza and Domingo Morales

Abstract This paper introduces an area-level gamma mixed model with log link for small area estimation. A Laplace approximation algorithm is implemented to estimate the model parameters. Empirical best predictors of domain means are derived and their mean squared errors are estimated by parametric bootstrap. An application to data from the Spanish living condition survey of 2008 is given, where the target is the estimation of average annual net incomes by province and sex.

1 Introduction Small area estimation (SAE) introduces statistical methodology to estimate subpopulation indicators where the sample size is insufficient to obtain adequate precision. Introductory books to SAE are [7, 8]. Procedures based on statistical models for aggregated data allow the introduction of relevant information for the construction of new estimators. A general area-level model-based formulation to SAE based on generalized linear mixed models was introduced in [3]. By following their proposals, this section introduces an area-level gamma mixed model with log link for small area estimation. Let {vd : d = 1, . . . , D} be a set of i.i.d. N (0, 1) random effects. Assume that the distribution of the target variable yd , conditioned to the random effect vd , is   md , d = 1, . . . , D, yd |vd ∼ Gamma m d , ad = μd

T. Hobza (B) Czech Technical University in Prague, Prague, Czech Republic e-mail: [email protected] D. Morales Miguel Hernández University of Elche, Elche, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_30

359

360

T. Hobza and D. Morales

where μd > 0, m d > 0 and m d is assumed to be known (similarly to σd2 in the Fay-Herriot model). The corresponding density has for y > 0 the form f (yd |vd ) =

  m m d y m d −1  m adm d m d −1 d d d exp{−ad yd } = yd yd exp − (m d ) μd (m d ) μd

and the expectation and variance of yd , given vd , are E[yd |vd ] =

μ2 md md md md = = μd , var[yd |vd ] = 2 = 2 2 = d . ad m d /μd md ad m d /μd

For the mean parameter, we assume the logarithm link, i.e. log μd = x d β + φvd , μd = exp{x d β + φvd }, d = 1, . . . , D,

(1)

where β = col (βk ) is a vector of unknown regression parameters and x d = 1≤k≤ p

col (xdk ) is a vector of known explanatory variables. The unknown parame-

1≤k≤ p

ters of the model (1) are denoted as θ = (β  , φ) . Further, we assume that the yd ’s are independent conditioned to v. The p.d.f of y = col (yd ), given v, is 1≤d≤D D f (yd |vd ) and f ( y|v) = d=1 f ( y) =

RD

f ( y|v) f v (v) dv =

D

d=1

f (yd |vd ) f (vd ) dvd = R

D

f (yd ) .

d=1

D D The corresponding log-likelihood is  = (θ ) = d=1 log f (yd ) = d=1 d . Since the involved integrals cannot be evaluated explicitly, to apply the method of maximum likelihood (ML) some approximation needs to be used. Remark 1 In practice, yd is a direct estimator of a domain total or mean with estimated design-based variance σd2 = varπ (yd ). By equating var[yd |vd ] to σd2 and y2

substituting μd by yd , one gets σd2 = mdd . Therefore, we take m d = yd2 /σd2 , d = 1, . . . , D, in the application to real data.

2 The Laplace Approximation Algorithm This section derives the Laplace approximation of the log-likelihood (θ ). First, the Laplace approximation of the integral R eh(x) d x is described. Let h : R → R be a continuously twice differentiable function with a global maximum at x0 . This is to ¨ 0 ) < 0. A Taylor series expansion of h(x) ˙ 0 ) = 0 and h(x say, let us assume that h(x around x0 yields to

An Area-Level Gamma Mixed Model for Small Area Estimation

361



˙ 0 )(x − x0 ) + 1 h(x ¨ 0 )(x − x0 )2 + o |x − x0 |2 h(x) = h(x0 ) + h(x 2 1¨ 2 ≈ h(x0 ) + h(x0 )(x − x0 ) . 2 The univariate Laplace approximation is



−∞



  1  ¨ 0 ) (x − x0 )2 d x − h(x eh(x0 ) exp − 2 −∞

−1/2 h(x0 ) 1/2 ¨ 0) = (2π ) −h(x e .

eh(x) d x ≈



(2)

Let us now approximate the loglikelihood of the area-level gamma mixed model. For the marginal p.d.f. of yd it holds f (yd ) =



−∞ ∞

f (yd |vd ) f (vd ) dvd =

d m d −1 mm d yd (2π )1/2 (m d )

 v2  exp − d − m d (x d β + φvd ) − m d yd exp{−(x d β + φvd )} dvd 2 −∞ ∞ m d m d −1   m d yd = exp h(vd ) dvd , 1/2 (2π ) (m d ) −∞

·

where (using the notation μd = μd (vd ) = exp{x d β + φvd }) v2 h(vd ) = − d − m d (x d β + φvd ) − m d yd exp{−(x d β + φvd )}, (3) 2  ˙ d ) = −vd − m d φ + m d yd φ exp − (x  β + φvd )} = −vd − φm d + φm d yd μ−1 (vd ), h(v d d  

 ¨ d ) = −1 − m d yd φ 2 exp − (x  β + φvd ) = − 1 + φ 2 m d yd μ−1 (vd ) . h(v d d

Let v0d denote the point of maxima of the function h(vd ). By applying (2) in vd = v0d , we get d m d −1

−1/2 mm d yd · 1 + φ 2 m d yd μ−1 d (v0d ) (m d )  v2  · exp − 0d − m d (x d β + φv0d ) − m d yd exp{−(x d β + φv0d )} . 2 D d and d can be approximated as The loglikelihood is  = d=1

f (yd ) ≈

d m d −1 mm v2 1 d yd − log ξ0d − 0d − m d (x d β + φv0d ) (m d ) 2 2  − m d yd exp{−(x d β + φv0d )}, (4)

d = log f (yd ) ≈ 0d = log

362

T. Hobza and D. Morales

where ξ0d = 1 + φ 2 m d yd μ−1 0d and μ0d = μd (v0d ). For r, s = 1, . . . , p the components of the score vector and the Hessian matrix are U0r =

D  ∂0d d=1

, U0 p+1 =

∂βr

D  ∂0d d=1

H0r p+1 = H0 p+1r =

∂φ

, H0r s = H0sr =

D  ∂ 2 0d , ∂βs ∂βr d=1

D D   ∂ 2 0d ∂ 2 0d , H0 p+1 p+1 = . ∂φ∂βr ∂φ 2 d=1 d=1

In matrix form, we have U 0 = U 0 (θ ) =

col (U0r s ) and H 0 = H 0 (θ) =

1≤r ≤ p+1

(H0r s )r,s=1,..., p+1 , where θ = (β  , φ) . The fitting algorithm works iteratively in two steps. A first Newton-Raphson D 0d , with fixed vd = v0d , d = 1, . . . , D. The algorithm maximizes 0 (θ) = d=1 updating equation is (k) (k) θ (k+1) = θ (k) − H −1 0 (θ )U 0 (θ ).

(5)

A second Newton-Raphson algorithm maximizes h(vd ) = h(vd , θ ), defined in (3), with θ = (β  , φ) = θ 0 fixed. The updating equation is vd(k+1) = vd(k) −

˙ d(k) , θ 0 ) h(v . ¨ d(k) , θ 0 ) h(v

(6)

Let us note that this algorithm gives at convergence not only estimates of he model parameters but also the mode predictors vˆ d of the random area effects.

3 Empirical Best Predictors In this section we obtain empirical best predictors (EBP) for the area-level gamma mixed model. The best predictor of the parameter μd = μd (θ , vd ) is μˆ d (θ) = E θ [μd | y]. It is an unbiased predictor minimizing the MSE in the class of unbiased predictors. In the assumed model, we have that E θ [μd | y] = E θ [μd |yd ] and E θ [μd |yd ] =

R

  exp x d β + φvd f (yd |vd ) f (vd ) dvd Nd (yd , θ ) , = Dd (yd , θ ) R f (yd |vd ) f (vd ) dvd

where Nd = Nd (yd , θ ) and Dd = Dd (yd , θ ) are

An Area-Level Gamma Mixed Model for Small Area Estimation

363



  exp x d β + φvd R    · exp − m d (x d β + φvd ) − m d yd exp − (x d β + φvd ) f (vd ) dvd ,    Dd = exp − m d (x d β + φvd ) − m d yd exp − (x d β + φvd ) f (vd ) dvd . Nd =

R

The EBP of μd is obtained by replacing θ by a suitable estimate θˆ . That means ebp the EBP of μd is μˆ d = μˆ d (θˆ ) and it can be approximated by the following Monte Carlo algorithm. 

ˆ . 1. Obtain the estimate θˆ = (βˆ , φ) (s) 2. For s = 1, . . . , S, generate vd i.i.d. N (0, 1) and vd(S+s) = −vd(s) . 3. Approximate the theoretical integrals Nd , Dd as 2S   1  ˆ d(s) exp x d βˆ + φv Nˆ d = 2S s=1    ˆ d(s) ) − m d yd exp − (x d βˆ + φv ˆ d(s) ) , · exp − m d (x d βˆ + φv 2S    1  ˆ ˆ d(s) ) − m d yd exp − (x d βˆ + φv ˆ d(s) ) . Dd = exp −m d (x d βˆ + φv 2S s=1

4. Calculate μˆ d (θˆ ) = Nˆ d / Dˆ d .

3.1 Bootstrap Estimation of the MSE ebp

ebp

Mean squared error of the EBP, M S E(μˆ d ) = E(μˆ d − μd )2 , d = 1, . . . , D, is a measure of accuracy of the EBP. The following procedure (inspired by [4]) calculates ebp a parametric bootstrap estimator of M S E(μˆ d ).  ˆ . 1. Fit the model to the sample and calculate the estimator θˆ = (βˆ , φ) 2. Repeat B times (b = 1, . . . , B):

a. Generate vd(b) ∼ N (0, 1), yd(b) ∼ Gamma(m d , m(b)d ), where μ(b) d = μd    ˆ d(b) , d = 1, . . . , D. exp x d βˆ + φv (b) b. For each bootstrap sample, calculate the estimator θˆ and the EBP μˆ (b) d = (b) ˆ μˆ d (θ ). 2 B (b) ebp 3. Output: mse(μˆ d ) = B1 b=1 μˆ d − μ(b) , d = 1, . . . , D. d

364

T. Hobza and D. Morales

4 Application to Real Data This section presents an application of model (1) to the Spanish Living Condition Survey (SLCS) from 2008. The data are described and fitted to variants of Fay-Herriot models in [2] and [5, 6]. This section deals with the estimation of average annual net incomes by the 52 Spanish provinces and sex. Direct estimators of average incomes and their corresponding variances are yd =

nd 1  wd j yd j , Nˆ d j=1

and V (yd ) =

Nˆ d =

nd 

wd j

(7)

j=1

nd 1  wd j (wd j − 1)(yd j − yd )2 , d = 1, . . . , D , Nˆ d2

(8)

j=1

where wd j and yd j are the calibrated sampling weight and the annual net personal income of the jth individual in area d, respectively, and n d denotes the sample size in area d. The formula (8) is obtained from [9] under the assumptions that sampling weights are the inverses of the first order inclusion probabilities, wd j = 1/πd j , and that equalities πdii = πdi and πdi j = πdi πd j , if i = j, hold for the second order inclusion probabilities. We compute direct estimates of the average incomes given in 10000 of Euros by sex and province, i.e. we get estimates for 2 · 52 = 104 domains. Auxiliary data are taken from the Spanish Labour Force Survey (SLFS) data from the year 2008 and they represent proportions of various socio-demographic characteristics such as sex, education and working status by domain. First, we fit the Gamma model (1) to the data. The shape parameters m d , d = 1, . . . , D, are calculated in the same way as described in Remark 1. Further, we also consider the Fay–Herriot model yd = μd + ed = x d β + φvd + ed , d = 1, . . . , D , iid

(9)

ind

where vd ∼ N (0, 1) and ed ∼ N (0, σd2 ) are independent for d = 1, . . . , D. Therefore, the direct estimates are treated as normally distributed under this model. The variances σd2 , d = 1, . . . , D, are taken as the design-based variances V (yd ) of the direct estimates yd (cf. (8)). Under the Fay-Herriot model, the EBLUPs of μd and vd , d = 1, . . . , D, are eblup

μˆ d

(θˆ ) = x d βˆ + φˆ vˆ d , vˆ d =

φˆ φˆ 2 + σd2

ˆ , d = 1, . . . , D. (yd − x d β)

(10)

Next, we compare the fits of both models. We consider the following area-level auxiliary variables: the proportion of unemployed individuals (β1 ), of individuals

An Area-Level Gamma Mixed Model for Small Area Estimation

365

aged from 16 to 24 (β2 ), of individuals aged 65 and older (β3 ) and of individuals with university education (β4 ) per area. These variables are denoted as xd1 , . . . , xd4 , respectively. Table 1 presents a summary of both models. Let us note that estimates of the parameter φ are φˆ = 0.082 in the Gamma model and φˆ = 0.119 in the Fay-Heriot model. Table 1 shows that all employed regressors are significant at 5% significance level for both models. Further, the growing proportion of unemployed individuals, of individuals aged from 16 to 24 and of individuals aged 65 and older have a negative effect on the average personal income in small areas. On the contrary, the university education has an opposite effect. The residuals for both models are computed as rd = yd − μˆ d , d = 1, . . . , D ,

(11)

where μˆ d is in fact the plug-in predictor, i.e. μˆ d = exp(x d βˆ + φˆ vˆ d ) for the Gamma model and μˆ d = x d βˆ + φˆ vˆ d is the EBLUP for the Fay-Herriot model. Concerning the Fay–Herriot model, we employed the function eblupFH from the package sae implemented in the R software for statistical computing. The residual sum of squares is 0.331 for the Gamma model and 0.262 for the Fay-Herriot. Figure 1 plots dispersion graphs of residuals for both models. Figure 2 gives the Q-Q plots of the predicted random effects, vˆ 1 , . . . , vˆ D , for both models. They do not present a visible deviation from normality. Table 2 presents the corresponding p-values of three normality tests. All of them are greater than 0.05.

Table 1 Coefficients of Gamma model (1) (left) and Fay–Herriot model (9) (right) Estimate p-value Estimate p-value ˆ 0.99 1.31e-07 2.44 9.18e-23 β0 βˆ1 −2.27 6.60e-04 −3.71 2.16e-05 βˆ2 −3.79 5.31e-03 −6.12 6.11e-04 βˆ3 −2.09 1.87e-14 −2.78 1.27e-14 βˆ4 1.70 1.15e-05 2.20 1.47e-05

Gamma model 0.2 0.1 0.0 −0.3

−0.1

raw residuals

0.2 0.1 0.0 −0.2

raw residuals

0.3

Fay−Herriot model

0

20

40

60

domain

80

100

0

20

40

60

domain

Fig. 1 Graphs of residuals for the Gamma model and Fay-Herriot model

80

100

366

T. Hobza and D. Morales Fay−Herriot model

1 0

Sample Quantiles

−2

−1

1 0 −1 −2

Sample Quantiles

2

2

Gamma model

−2

−1

0

1

Theoretical Quantiles

2

−2

−1

0

1

2

Theoretical Quantiles

Fig. 2 Q-Q plots of the mode predictions of the random effects, vˆ 1 , . . . , vˆ D , for the Gamma model (left) and of the EBLUPs, vˆ 1 , . . . , vˆ D , for the Fay-Herriot model (right) Table 2 p-values of normality tests for predicted random effects under the Gamma model (GM) and the Fay-Herriot model (FH) Test GM FH Shapiro-Wilk Anderson-Darling Kolmogorov-Smirnov

0.34 0.43 0.57

0.39 0.32 0.55

For estimating μd , d = 1, . . . , D, we compute the EBPs based on the Gamma model. By the function eblupFH, we also calculate the EBLUPs based on the Fay–Herriot model. Figure 3 shows very similar results. Figure 4 depicts the precision of both predictors, where estimates of MSE(μˆ d ), d = 1, . . . , D, were calculated by the described bootstrap algorithm with B = 1000 for the Gamma model and by the function mseFH from the sae package for the Fay–Herriot model. The function mseFH does not use the parametric bootstrap for calculating the MSE estimates under the Fay–Herriot model, but it employs an approximation of the MSE, see [1]. This figure also plots the variance of the direct estimates yd , d = 1, . . . , D. Figure 4 shows that a significant reduction in the MSE is achieved by the introduction of both models, compared to the variance of the direct estimates. If we compute means of the MSE estimates depicted in Fig. 4 over all domains, we get 3.11e−03 and 3.20e−03 for the Gamma model and Fay–Herriot model, respectively. This is a small evidence in favor of the Gamma model.

367

2.0

An Area-Level Gamma Mixed Model for Small Area Estimation

1.0

1.2

1.4

1.6

1.8

GM−EBP FH−EBLUP

1

12

24

35

47

58

domain

70

81

93

104

0.04

Fig. 3 GM-EBP and FH-EBLUP predictions of average annual net incomes. The domains are sorted with respect to decreasing variance of the direct estimates

0.00

0.01

0.02

0.03

GM−EBP FH−EBLUP dir

1

12

24

35

47

58

70

81

93

104

domain

Fig. 4 MSE estimates of the GM-EBPs and FH-EBLUPs compared to the variance of the direct estimates (dir). The domains are sorted with respect to decreasing variance of the direct estimates

Acknowledgements The authors thanks the editors of the book “Trends In Mathematical, Information And Data Sciences: A Tribute To Leandro Pardo” for their invitation to submit a contribution. This work was supported by the Spanish grant PGC2018-096840-B-I00, by the Valencian grant PROMETEO/2021/063 and by the European Regional Development Fund-Project “Center of Advanced Applied Sciences” (No. CZ.02.1.01/0.0/0.0/16 019/0000778).

368

T. Hobza and D. Morales

References 1. Datta, G.S., Lahiri, P.: A unified measure of uncertainty of estimated best linear unbiased predictors in small area estimation problems. Stat. Sin. 10, 613–627 (2000) 2. Esteban, M.D., Morales, D., Pérez, A., Santamaría, L.: Small area estimation of poverty proportions under area-level time models. Comput. Stat. Data Anal. 56, 2840–2855 (2012) 3. Faltys, O., Hobza, T., Morales, D.: Small area estimation under area-level generalized linear mixed models. Commun. Stat.- Simul. Comput. (2021) (in press). https://doi.org/10.1080/ 03610918.2020.1836216 4. González-Manteiga, W., Lombardía, M.J., Molina, I., Morales, D., Santamaría, L.: Estimation of the mean squared error of predictors of small area linear parameters under a logistic mixed model. Comput. Stat. Data Anal. 51, 2720–2733 (2007) 5. Marhuenda, Y., Molina, I., Morales, D.: Small area estimation with spatio-temporal Fay-Herriot models. Comput. Stat. Data Anal. 58, 308–325 (2013) 6. Marhuenda, Y., Morales, D., Pardo, M.C.: Information criteria for Fay-Herriot model selection. Comput. Stat. Data Anal. 70, 268–280 (2014) 7. Morales, D., Esteban, M.D., Pérez, A., Hobza, T.: A Course on Small Area Estimation and Mixed Models. Statistics for Social and Behavioral Sciences, Springer, Heidelberg (2021) 8. Rao, J.N.K., Molina, I.: Small Area Estimation. Wiley, New York (2015) 9. Särndal, C.E., Swensson, B., Wretman, J.: Model Assisted Survey Sampling. Springer, New York (1992)

On the Consistence of the Modified Median Estimator for the Logistic Regression Model María Jaenada

Abstract The logistic regression model is very commonly used in real life problems as a classifier for binary data. However, it is known the lack of robustness of the maximum likelihood estimator for the model, and therefore robust estimators must be developed so as to deal with contaminated data. In this paper, we consider the modified median estimator for the logistic regression model and we prove it consistency under general conditions.

1 Introduction Real life problems often presents variables taking only two possible states or values, which requires to use binary classification procedures. Logistic regression can be used as a classifier with a probabilistic interpretation. Let Y1 , ..., Yn be independent binary response variables drawn from a Bernoulli model, Pr(Yi = 1) = πi and Pr(Yi = 0) = 1 − πi , i = 1, ..., n. The logistic regression model assumes that the parameters of the Bernouilli distributions depending on the observation, πi , are associated to d + 1 regressors, x i = (1, xi1 , ..., xid ) trough a linear predictor x iT β with β = (β0 , ..., βd ) ∈ Θ := Rd+1 and the logistic function, e xi β T

πi =

π(x iT β)

=

1 + e xi β T

, i = 1, ..., n.

M. Jaenada (B) Department of Statistics and OR, Complutense University of Madrid, Plaza Ciencias, 3, 28040 Madrid, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_31

369

370

M. Jaenada

Conversely, the inverse relation linearly links the expectation of each Yi with the explanatory variables x i . In the logistic regression model, the M-estimators of the parameter β are defined by  β M = arg minβ∈Rd+1

n 

  Φ yi , π(x iT β)

i=1

for a suitable loss function, Φ. If we consider the loss function       Φ yi , π(x iT β) = −yi log π(x iT β) − (1 − yi ) log π(x iT β) we obtain the MLE. Nonetheless, the MLE lacks of robustness, and several robust losses have been proposed in order to produce robust estimators. Different authors have attempted to derive robust estimates of the parameters in the logistic regression model; see for instance Pregibon [11], Morgenthaler [10], Carroll and Pederson [4], Christmann [5], Bianco and Yohai [1], Croux and Haesbroeck [6] and Bondell [2, 3]. In this line, Hobza et al. [8] introduced the median estimator (Me) of the logistic regression model, obtained as the classical 1 -estimator but smoothing the response vector, n    Med  β n = arg minβ∈Rd+1 |Z i − m π(x iT β) | i=1

where the variables Z i are obtained by adding to the response independent and uniformly distributed in the interval (0, 1) random variables, Ui , Z i = Yi + Ui ,

1 ≤ i ≤ n,

and m(·) denotes the median function of the distribution of Z = Y + U ∼ Be( p) + U (0, 1),   1 , m( p) = FZ−1 (1/2) = inf z ∈ R : FZ (z) ≥ 2 with p ∈ [0, 1]. This method was modified in Hobza et al. [8] so as to improve it efficiency. They proposed to replace the set of statistically smoothed data, Z i = Yi + Ui , with a expanded set by considering, for k > 1, the matrix of data Z i j = Yi + Ui j 1 ≤ i ≤ n, 1 ≤ j ≤ k, where the variables Ui j are mutually independent as well as on Yi , 1 ≤ i ≤ n, and follow a U (0, 1) distribution. Therefore, the k−enhanced median estimator (k−Me) is obtained by

On the Consistence of the Modified Median Estimator for the Logistic …

371

  1  k Med  = arg minβ∈Rd+1 |Z i j − m π(x iT β) |. βn k i=1 j=1 n

k

This estimator was considered for generalized linear models with binary responses in Hobza et al. [9]. Continuing in the same vein, Hobza et al. [7] considered the limit situation, k → ∞, and defined the modified median estimator (MMe), M Me  = arg minβ∈Rd+1 βn

n   i=1

0

1

  |Yi + u − m π(x iT β) |du.

(1)

Note that this proposal is a deterministic estimator, as it does not depend on any additionally generated random sample. Moreover, the MMe is a M-estimator with loss function  1      |y + u − m(π x T β )|du. Φ y, π x T β = 0

Hobza et al. [7] established the asymptotic distribution of the MMe and considered robust Wald-type testing procedures based on it. In this paper we are studying the consistence properties of the MMe.

2 Consistence of the Estimator We consider the loss function Hn (β) =

n 

  Φ yi , π(x iT β)

(2)

i=1

and the MMe for the logistic regression model defined in (1). The MMe is a minimum of Hn (β) in Rd+1 , and therefore it annuls its first derivative with respect to β, n  ∂ Hn (β)   = Ψ yi , x iT β = 0d+1 (3) ∂β i=1 where   ∂Φ y, x t β (4) Ψ y, x β = ∂β ⎧

 π(x T β) π(x T β) ⎪ ⎨ − 21 1−π(x y + (y − 1) 1−π(x x, for 0 ≤ π(x T β) ≤ 21 T β) T β)

 = . T T ⎪ ⎩ − 1 1−π(xT β) (y − 1) + y 1−π(xT β) x, for 1 ≤ π(x T β) ≤ 1 2 π(x β) 2 π(x β) 

T



372

M. Jaenada

For more details, see Theorem 2.3 in Hobza et al. [7]. We are studying the existence and the consistence of a sequence of solutions of the estimating equations (3). To establish the consistency of the estimator we need to make an additional assumption, in addition to the assumptions (A)–(C) stated in Hobza et al. [7].   C1 The third order partial of Φ yi , π(x iT β) is bounded for all i = 1, ..n. Theorem 2.1 Under Assumption 2 and Assumptions (A)–(C) in Hobza et al. [7], the MMe modified median estimator for the logistic regression model,  β n , is a consistent estimator of the true parameter, β 0 , i.e. for all a > 0,  MMe  lim P || β n − β 0 || < a = 1

n→∞

where || · || is a norm on Θ. Proof Let consider Q a = Q(β 0 , a) the sphere with centre β 0 and radius a. We are proving that, for a sufficiently small a, the objective function Hn (β) has a local minimum on Q a , or equivalently the system of equations (4) has a solution within Q a with probability tending to 1. Since any continuous function reaches its minimum and maximum on the closed compact ball bounded by Q a , it suffices to prove the inequality (5) Hn (β) ≥ Hn (β 0 ) for all β in the surface of Q a to ensure the existence of a local minimum of Hn (β) in the interior. From the Taylor-Lagrange formula one can write, for all β ∈ Θ,  d+1   ∂ Hn (β)   Hn (β 0 )−Hn (β) = − β j − β0, j ∂β  j=1

j

β0

 d+1 d+1   1   ∂ 2 Hn (β)   − βk − β0,k β j − β0, j  2 j=1 k=1 ∂β j ∂βk β 0

(6)

 d+1 d+1 d+1    1    ∂ 3 Hn (β)   − βk − β0,k β j − β0, j βl − β0,l  6 j=1 k=1 l=1 ∂β j ∂βk ∂βl β ∗

where β ∗ belongs to the interior of the ball with centre β 0 and with radius ||β − β 0 ||. Multiplying both members by n1 , the previous equation can be written as 1 1 Hn (β 0 ) − Hn (β) = L 1 + L 2 + L 3 , n n with

(7)

On the Consistence of the Modified Median Estimator for the Logistic …

L1 = L2 =

373

 d+1  1  ∂ Hn (β)   β j − β0, j , −  n j=1 ∂β j β 0

 d+1 d+1   1 1   ∂ 2 Hn (β)   β j − β0, j βk − β0,k and −  2 n j=1 k=1 ∂β j ∂βk β 0

(8)

 d+1 d+1 d+1    1 1    ∂ 3 Hn (β)   β j − β0, j βk − β0,k βl − β0,l . − L3 = 6 n j=1 k=1 l=1 ∂β j ∂βk ∂βl β ∗ Let us consider separately the right-hand of Eq. (7). We first establish the convergence in probability of the first and second order partial derivatives of Hn (β). From (3), it is not difficult to obtain that    T    ∂Φ Y, π(X β) (9) E Ψ Y, π(X T β) = E = 0d+1 . ∂β Now, the weak law of large numbers gives A(n) j

      n ∂Φ Y, π(X T β)  1  ∂Φ yi , π(x iT β)  P  = −−→ E = 0.  −  n→∞ n i=1 ∂β j ∂β j β0 β0

(10)

On the other hand, we define the positive definite matrix 

   ∂Ψ Y, X T β   . Q(β 0 ) = E  ∂β T

(11)

β0

Using Expression (3), Q(β 0 ) is given by ⎧

 T ⎨ E X 1 π 2 (X Tβ 0 ) X X T , for 0 ≤ π(X T β 0 ) ≤ 1 2

2 1−π(X Tβ 0 )  Q(β 0 ) = ⎩ E X 1 (1−π(XT β 0 ))2 X X T , for 1 ≤ π(X T β 0 ) ≤ 1 2 2 π(X β )

(12)

0

and again by the weak law of large numbers, we obtain the convergence       n ∂ 2 Φ Y, π(X T β)  1  ∂ 2 Φ yi , π(x iT β)  P  = −−→ E = Q(β 0 ) j,k .  −  n→∞ n i=1 ∂β j ∂βk ∂β ∂β j k β0 β0 (13) Finally, using Assumption 2 the third order derivative of Φ is bounded for all i = 1, ..., n,   3   ∂ Φ yi , π(x iT β)   ≤ M (i) (yi , x i )  j,k,l   ∂β j ∂βk ∂βl B (n) j,k

374

M. Jaenada

 (i) for all j, k, l = 1, ..., d + 1. Denoting m (i) j,k,l = E M j,k,l (Yi , X i ) and choosing m ≥ m (i) j,k,l for all i = 1, .., n and for all j, k, l = 1, .., d + 1 we have, by the weak law of the large numbers, that  n   3 1  ∂ Φ(Yi , X iT β)   E ≤m n ∂β j ∂βk ∂βl i=1

(14)

for all i = 1, .., n and for all j, k, l = 1, .., d + 1. Now, let consider ε > 0. By (10), (13) and (14), there exist n 0 ∈ N such that for all n > n 0 , we have   ε 2 < P |A(n) (15) j |≥a (d + 1) + (d + 1)2 + (d + 1)3   ε P |B (n) (16) j,k − Q(β 0 ) j,k | ≥ a < (d + 1) + (d + 1)2 + (d + 1)3   n     1  ∂ 3 Φ yi , π(x iT β)  ε  ≥ 2m < P  (17)  n i=1 ∂β j ∂βk ∂βl (d + 1) + (d + 1)2 + (d + 1)3 We denote S the event involving the following (d + 1) + (d + 1)2 + (d + 1)3 inequalities,   ⎫   n ∂ 3 Φ yi , π(x T β)  ⎬  1 i (n) (n) 2  < 2m |A j | < a , |B j,k − Q(β 0 ) j,k | < a,   ⎭ ⎩ n ∂β j ∂βk ∂βl ⎧ ⎨

i=1

. j,k,l=1,..,d+1

From the above majorations, we get that P(S c ) < ε, and therefore P(S) > 1 − ε. We study the sign of n1 Hn (β 0 ) − n1 Hn (β) under the event S by bounding L 1 , L 2 and L 3 in (8). Since β ∈ Q a , we have that      1 d+1 ∂ Hn (β)     β j − β0, j  ≤ (d + 1) · a 2 · a. − |L 1 | =   n j=1 ∂β j β 0

(18)

and  d+1     1  d+1   1 (n)  (−B j,k ) − (− Q(β 0 )) j,k β j − β0, j βk − β0,k  ≤ (d + 1)2 · a · a 2 . 2 2 j=1 k=1

(19) We consider the negative quadratic form    1  Q(β 0 ) j,k β j − β0, j βk − β0,k . 2 j=1 k=1 d+1 d+1

A=−

On the Consistence of the Modified Median Estimator for the Logistic …

375

Since the matrix of the quadratic form A is symmetric, an orthogonal transformation can reduce this matrix to its diagonal form, and thus A can be expressed as A=

d+1 

λ j ξ 2j ,

j=1

 d+1 2 2 2 where d+1 j=1 ξ j = j=1 (β j − β0, j ) = a . Therefore, choosing λ = max j (λ j ) < 0, we get that d+1  λ j ξ 2j ≤ λa 2 < 0. j=1

Additionally, a study of the sign of the function 21 (d + 1)2 a 3 + λa 2 proves that we can find a0 and c > 0 such that for all a smaller than a0 ,    1  (−B (n) β j − β0, j βk − β0,k j,k ) − (− Q(β 0 )) j,k 2 j=1 k=1 d+1 d+1

L2 =

   1  + −( Q(β 0 )) j,k β j − β0, j βk − β0,k ≤ −ca 2 . 2 j=1 k=1 d+1 d+1

(20)

Finally,    d+1  d+1  d+1 1 1     ∂ 3 Hn (β)    |L 3 | =  β β β − − β − β − β j 0, j k 0,k l 0,l  n 6 j=1 k=1 l=1 ∂β j ∂βk ∂βl β ∗ ≤ where b =

2 3 a m = ba 3 , 6 2m . 6

(21)

Putting the previous inequalities together, 1 1 Hn (β 0 ) − Hn (β) < (d + 1)a 3 − ca 2 + ba 3 n n

and (d + 1)a 3 − ca 2 + ba 3 < 0 if and only if a < necessary, we get that under the event S,

c . b(d+1)

(22)

Therefore, reducing a if

1 1 Hn (β 0 ) − Hn (β) < 0 for all β ∈ Q a . n n

(23)

Equation (23) implies that n1 Hn (β) has a local minimum in the interior of the sphere β n (a), or equivalently,  β n (a) is a solution of the system of Eq. (3). Because Q a , say 

376

M. Jaenada

 β n (a) satisfies the estimating Eq. (3), it is the MMe. Finally, if we consider a new event C involving all β satisfying (23), we obtain that P(C) ≥ P(S) ≥ 1 − ε. That is,  MMe  β n − β 0 || ≤ a = 1. lim P || n→∞

 Acknowledgements The author would like to express her gratitude to Leandro Pardo for his support and teachings. It is a truly honour to have the opportunity to contribute to this tribute to him. Leandro is an excellent professor, researcher and person, who will always have my entire admiration. This research is partially supported by Grant FPU 19/01824 from Ministerio de Ciencia, Innovación y Universidades (Spain).

References 1. Bianco, A.M., Yohai, V.J.: Robust estimation in the logistic regression model. In: Rieder, H. (ed) Robust Statistics, Data Analysis and Computer Intensive Methods, pp. 17–34. Lecture Notes in Statistics, vol. 109. Springer, New York (1996) 2. Bondel, H.D.: Minimum distance estimation for the logistic regression model. Biometrika 92, 724–731 (2005) 3. Bondel, H.D.: A characteristic function approach to the biased sampling model, with application to robust logistic regression. J. Stat. Plan Infer. 138, 742–755 (2005) 4. Carroll, R.J., Pederson, S.: On robustness in the logistic regression model. J. Roy. Stat. Soc. Ser. B 55, 669–706 (1993) 5. Christmann, A.: Least median of weighted squares in logistic regression with large strata. Biometrika 81, 413–417 (1994) 6. Croux, C., Haesbroeck, G.: Implementing the Bianco and Yohai estimator for logistic regression. Comput. Stat. Data Anal. 44, 273–295 (2003) 7. Hobza, T., Martín, N., Pardo, L.: A Wald-type test statistic based on robust modified median estimator in logistic regression models. J. Stat. Comput. Simul. 87, 2309–2333 (2017) 8. Hobza, T., Pardo, L., Vajda, I.: Robust median estimator in logistic regression. J. Stat. Plan Infer. 138, 3822–3840 (2008) 9. Hobza, T., Pardo, L., Vajda, I.: Robust median estimator for generalized linear models with binary responses. Kybernetika 49, 768–794 (2012) 10. Morgenthaler, S.: Least-absolute-deviations fits for generalized linear models. Biometrika 79, 747–754 (1992) 11. Pregibon, D.: Logistic regression diagnostics. Ann. Stat. 9, 705–724 (1981)

Analyzing the Influence of the Rating Scale for Items in a Questionnaire on Cronbach Coefficient Alpha María Asunción Lubiano, Manuel Montenegro, Sonia Pérez-Fernández, and María Ángeles Gil

Abstract Questionnaires are widely used in many different fields, especially in connection with human rating. Different rating scales are considered in questionnaires to base the response to their items on. The most popular scales of measurement are Likert-type ones. Other well-known rating scales to be involved in the items in a questionnaire are visual analogue, interval-valued, fuzzy linguistic and fuzzy rating scales. This paper aims to compare these five scales by means of a simulation study. The statistical tool for the comparison (actually, for the ranking) of the scales is the Cronbach index of internal consistency or reliability of a construct from a questionnaire. Percentages of advantages of the fuzzy rating scale verses the other ones, as well as values of the Cronbach index for some samples, are obtained and discussed.

1 Usual Imprecise-Valued Rating Scales Involved in the Items of a Questionnaire Questionnaires are often considered to conduct research about attitudes and human behaviour. Measurement of attitudinal and behavioural variables, especially latent variables, is facing a great challenge nowadays. Current computing developments permit advances in the measurement of the richness of human characteristics, so rating scales and tools are improving day by day bringing new and refreshing ideas to the field which attempt to improve the accuracy for making better decisions in the applied context. In this paper, through a vast simulation study, we compare and analyze different scales of measurement by means of one of the most well-known indicators for internal consistency or reliability of the constructs: Cronbach’s alpha [8]. This paper has been written as a tribute to our beloved and admired colleague Professor Leandro Pardo. Thank you, Leandro, for your permanent friendship and support! M. A. Lubiano (B) · M. Montenegro · S. Pérez-Fernández · M. Á. Gil Departamento de Estadística e I.O. y D.M., Universidad de Oviedo, 33007 Oviedo, Spain e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_32

377

378

M. A. Lubiano et al.

In the context of survey research, a construct is understood to be “the abstract idea, underlying theme, or subject matter that one wishes to measure using survey questions” (see [15]). Some constructs are very simple and can be measured using only one or a few questions for which responses are numerical or dichotomous. Other constructs are more complex and may require a rather large set of questions for which responses cannot be precisely/perfectly measured/expressed. In designing a questionnaire with such complex constructs, most of items can be formalized in terms of imprecise-valued random magnitudes and the involved rating scales for the response to such items are usually either Likert, visual analogue, interval-valued, fuzzy linguistic-valued and fuzzy-valued. Likert Scales (LSs) [22] format consists of some scores indicating the strength of the agreement with several assertions, the Likert type items. Sometimes, these numbers are combined with or replaced by semantic expressions in terms of quantity which are, for instance, adverbs of frequency. Despite Likert Scales have been adopted for the vast majority of the social science research communities, they present some controversy and debate by social science researchers concerning several issues, among them, the nature of the response categories and the uses of the scores. Questionnaires where items comprise Likert type-items can be easily conducted. Furthermore, the options to respond to each of the questions involve some imprecision, which seems quite coherent in the context of imprecise-valued magnitudes. However, since the choice is made within a list of a few possible, anchored for the Likert options, individual differences are almost systematically overlooked. Consequently, the number of applicable techniques to statistically analyze Likert data is quite limited, and they are mostly based either on the frequencies of different ‘values’ or on their position in accordance with either a certain ranking or a posterior numerical encoding, so that relevant statistical information along with the inherent imprecision can be usually lost in the analysis. On the other hand, Visual Analogue Scales (VASs) were mostly considered to overcome the limitations with ordinal discrete Likert-type scales (see [29]). VAS are not so easy-to-use, and questionnaires involving them are usually conducted by filling out either a paper-and-pencil or a computerized form, after a small training explanation showing how to proceed (since problems with subject’s ability to conceptually understand the rating method itself have been reported in the literature, see [31]). VAS has a long tradition in psychological measurement. Respondents to a VAS item/scale, mark their level of agreement to a statement by indicating a position along a continuous line between two end-points, permitting an infinite number of gradations. This analogue/continuous/graphic rating aspect of the scale differentiates from others similar measures as previous mentioned Likert scale (semantic and/or numerically pointed out). VAS properly captures individual differences because the choice is made within a continuum of possible options (actually, a bounded interval). However, the choice of the single point that best represents rater’s score in visual analogue scales is usually neither easy nor natural. To require a full accuracy seems rather unrealistic in connection with intrinsically imprecise variables. This statement is in line with the quote from Popper [25], in accordance with which “... Both, precision and certainty, are false ideals. They are impossible to attain, ..., it is always

Analyzing the Influence of the Rating Scale for Items in a Questionnaire …

379

undesirable to make an effort to increase precision for its own sake -especially linguistic precision- since this usually leads to lack of clarity, ...: one should never try to be more precise than the problem situation demands”, as it usually happens in measuring attitudes. Single-point rating scales, like LSs and VASs, supply valuable information regarding respondents’ opinion/score on a given question. However, they are limited in capturing imprecision and uncertainty of respondent answers. In case of LSs they are also limited in capturing individual differences. As it has been pointed out by Wagner et al. [33], “the capturing of respondents’ uncertainty requires the development of more suitable scales...”. Aiming to capture imprecision/uncertainty in responding to questions related to intrinsically imprecise magnitudes, Themistocleous et al. [30] highlighted that Interval-Valued Scales (IVSs) allow the respondent the “choice of an interval when providing a response by positioning an ellipse on a straight line with polar adjectives on its two ends” (see also Wagner et al. [33]). Consequently, items with intervalvalued responses offer richer and more complex information compared with singlepoint rating scales and provide researchers with more insights regarding respondent perceptions as well as the imprecision/uncertainty of their responses which is expected to increase the reliability of constructs in the questionnaire. Furthermore, as IVSs do not prefix the intervals to be chosen, respondents are completely free in expressing their answers, so they can capture individual differences. In addition to reflect inherent imprecision associated with most of latent variables and constructs in questionnaires, IVSs are fully suitable either to model magnitudes related to ranges or, more generally, interval-valued symbolic data. On the other hand, fuzzy scales were introduced to establishing a bridge between strongly defined measurements, as the VASs or the numerically encoded LSs, and weakly defined measurements used in behavioral sciences as the Likert or the semantic differential. Fuzzy Linguistic Scales (FLSs), associated with the so-called fuzzy linguistic variables, were stated by Zadeh [35] as a flexible alternative to the numerical encoding of Likert and semantic differential scales. In fact, the numerical encoding does not take into account the essential imprecision accompanying ‘values’ of most of the variables in attitudinal studies, whereas the fuzzy linguistic encoding (partially) does, although individual differences cannot be grasped either by LSs or by FLSs. Aiming to overcome the last drawback, Hesketh et al. [20, 21], introduced Fuzzy Rating Scales (FRSs) as an extension of the other mentioned scales. To motivate, justify and support introducing FRS’s, they argued that “A perennial issue in psychological assessment has been the extent to which differences in psychological test scores are a function of genuine individual differences rather than differences imposed as it happens with visual analogue scales, or obscured as it happens with Likert or semantic differential scales by the constraints of the measurement procedures.” FRS’s adapt the semantic differential so that a preferred point on a given interval (with anchored end points), along with latitudes of acceptance on either side should be indicated by the respondent. Reference [19] allowed the preferred ‘point’ to be a subinterval within the original given interval. The preferred point/subinterval

380

M. A. Lubiano et al.

determines the ‘core’ (1-level) of the fuzzy assessment (i.e., the value or interval of values that are fully compatible with respondent’s rating). When the core is enlarged with the latitudes of acceptance, one gets the ‘support’ (topological closure of the 0level) of the fuzzy assessment (i.e., the interval of values that are compatible to some extent with respondent’s rating). And the choice of core and support is completely free, so no list of possible responses is prefixed.

2 Comparing Rating Scales Through Cronbach Alpha The use of scales like IVSs, FLSs and FRSs in connection with questionnaires is relatively new in contrast to that of either discrete or continuous single-point scales like LSs and VASs, respectively. As noticed by Ellerby et al. [14] and Lubiano et al. [23], the incorporation of these scales suggests to apply and mainly to develop methodologies for the statistical analysis of interval- and fuzzy-valued responses, as well as to compare the new scales with the single-point ones. Regarding the approaches to the statistical analysis of interval- and fuzzy-valued data, several studies about can be found in the literature of the last two decades. Although, at present not all the problems and methods to statistically analyze realvalued data have been extended to deal either with interval-valued or with fuzzyvalued data, some interesting ones do (see, for instance, [2–6, 11–13, 16–18, 23, 24, 26, 28]). This paper focuses on the comparison between the above mentioned rating scales. The comparison is to be based on examining the behavior of the well-known Cronbach coefficient of internal consistency/reliability of constructs in a questionnaire by means of simulation developments. Along this simulation study, LS will be identified with the numerically encoding of the Likert type scale and the Cronbach α will be given as follows: Definition 2.1 Given a construct involving K items and the response of the ith respondent (i = 1, . . . , n) to the jth item ( j = 1, . . . , K ) being denoted by – xi j if a single-point-valued scale is considered, – xi j if an interval-valued scale is considered – xi j if a fuzzy-valued scale is considered, the Cronbach α is given by  K 2  K j=1 s j α= , 1− 2 K −1 stotal 2 where s 2j is the sample variance of the responses to item j, and stotal is the variance of all the responses to the items involved in the construct, the variances being defined as the Fréchet ones w.r.t. the Euclidean distance in R for the single-point scales, the

Analyzing the Influence of the Rating Scale for Items in a Questionnaire …

381

Vitale δ2 -metric [32] for the interval-valued scale and the Diamond and Kloeden ρ2 -metric [10] for the fuzzy-valued scales, that is, ⎧ 2 n

⎪ xi j − x j ⎪ ⎪ for LS/VAS ⎪ ⎪ ⎪ n ⎪ ⎪ i=1 ⎪ 2 2 ⎪ n ⎨

inf xi j − inf x j + sup xi j − sup x j for IVS s 2j = ⎪ 2n ⎪ i=1 ⎪ ⎪ 2 2 ⎪ n ⎪

inf( xi j )υ − inf( xυ ) j + sup( xi j )υ − sup( xυ ) j ⎪ ⎪ ⎪ dυ for FLS/FRS ⎪ ⎩ 2n [0,1] i=1

2 stotal

⎧ K n

x i j − x 2 ⎪ ⎪ ⎪ for LS/VAS ⎪ ⎪ nK ⎪ ⎪ j=1 i=1 ⎪ ⎪ 2 2 ⎪ K

n ⎨

inf xi j − inf x + sup xi j − sup x for IVS = 2n K ⎪ ⎪ j=1 i=1 ⎪ ⎪ ⎪ K n 2 2 ⎪

⎪ xυ ) + sup( xi j )υ − sup( xυ ) inf( xi j )υ − inf( ⎪ ⎪ dυ for FLS/FRS ⎪ ⎩ 2n K [0,1] j=1 i=1

where  xυ = {t ∈ R :  x (t) ≥ υ}. It should be remarked in connection with the value of α for fuzzy-valued data, that they are scarcely influenced by the shape chosen for such data (see Lubiano et al. [23]), so to assume this shape is trapezoidal does not mean a real constraint. In comparing different rating scales through α general conclusions cannot be drawn, but one can get majority trends by means of simulations.

2.1 Simulation of FRS-Based Responses and Suggested Links with Responses to Other Rating Scales Since there are not yet suitable realistic models for the distribution of the random mechanisms generating fuzzy responses/data, the simulation process is not an immediate and standard one. However, in previous papers dealing with the statistical analysis of fuzzy data simulation procedures have been introduced (see, for instance, [9, 27]). For purposes of analyzing reliability of constructs some relationships between responses to items should be added (see Lubiano et al. [23]). By combining the previous procedures we will alternatively denote Tra(a, b, c, d) = Trax1 , x2 , x3 , x4 ,

382

M. A. Lubiano et al.

Fig. 1 A 4-tuple (x1 , x2 , x3 , x4 ) generated from the simulation process, and the associated trapezoidal fuzzy datum

where x1 = (b + c)/2, x2 = (c − b)/2, x3 = b − a, x4 = d − c, (see Fig. 1 to illustrate the double notation). The simulation process will generate the 4-tuple (x1 , x2 , x3 , x4 ) in accordance with some guideliness to be now explained. To each generated 4-tuple (x1 , x2 , x3 , x4 ) we associate the trapezoidal fuzzy datum Trax1 , x2 , x3 , x4  = Tra(x1 − x2 − x3 , x1 − x2 , x1 + x2 , x1 + x2 + x4 ). By inspiring the simulation process in most of the already known real-life examples, fuzzy data will be generated as follows: – 5% (or, more generally, 100 · ω1 %) of the data have been obtained by first considering a simulation from a simple random sample of size 4 from a beta β( p, q) distribution, the ordered 4-tuple, and finally computing the values of the xi . The values of p and q have been assumed to be p = q = 1. The values from the beta distribution should be re-scaled and translated to the reference interval [l0 , u 0 ] for the considered FRS. – 35% (or, more generally, 100 · ω2 %) of the data have been obtained considering a simulation of four random variables X i = (u 0 − l0 ) · Yi + l0 as follows: Y1 Y2 Y3 Y4

∼ β( p, q), ∼ Uniform 0, min{1/10, Y1 , 1 − Y1 } , ∼ Uniform 0, min{1/5, Y1 − Y2 } , ∼ Uniform 0, min{1/5, 1 − Y1 − Y2 } .

– 60% (or, more generally, 100 · ω3 %) of the data have been obtained considering a simulation of four random variables X i = (u 0 − l0 ) · Yi + l0 as follows: Y1 ∼ β( ⎧ p, q), ⎨ Exp(200) Y2 ∼ Exp(100 + 4 Y1 ) ⎩ Exp(500 − 4 Y1 )  γ (4, 100) Y3 ∼ γ (4, 100 + 4 Y1 )

if Y1 ∈ [0.25, 0.75] if Y1 < 0.25 otherwise if Y1 − Y2 ≥ 0.25 otherwise

Analyzing the Influence of the Rating Scale for Items in a Questionnaire …

 Y4 ∼

γ (4, 100) γ (4, 500 − 4 Y1 )

383

if Y1 + Y2 ≥ 0.25 otherwise.

To add the possible relationship between responses to items in analyzing reliability, a large sample of n = 500 FRS-type data for each of a large number of items, K = 100, is to be simulated in accordance with the above described generation procedure. This process will provide with an ‘auxiliary sample’ from which we will later select data for other choices of n and K and transform them to mimic a certain (linear) dependence. To generate the 500 × 100 data we proceed as follows: ∗ S1. A sample of 500 FRS-type data ( x1∗ , . . . ,  x500 ), the reference interval of the FRS being [l0 , u 0 ], are first simulated as the ‘auxiliary sample’. S2. To mimic the desirable high correlation between the responses from a respondent to different 100 items, for any item j ( j = 1, . . . , 100)

– a pair (γ j , δ j ) is considered so that γ j is generated at random from a uniform distribution in [0, 1] and δ j is generated from a standard normal distribution; – the response of the ith respondent (i = 1, . . . , 500) to the jth item is assumed to xi∗ + δ j + εi j , with εi j being generated at random from a be given by  xi j = γ j ·  standard normal distribution; – in case any  xi j is not fully included within interval [l0 , u 0 ], the response is appropriately truncated. Once we get the simulated largest data set including 500 × 100 fuzzy data, we choose at random and stepwise n = 450 from the former 500 respondents, n = 400 from the preceding selected 450 respondents, and so on. Analogously, we choose at random and stepwise K = 50 from the former 100 items, K = 40 from the preceding selected 50 items, and so on. To be realistic, in the studies in the paper we will constrain K to take on values up to 50. In what concerns the links between the FRS-based responses and the ones based on the other rating scales, we will consider some reasonable and realistic ones as follows: • The numerically encoded r -point Likert scale usually considered is 1, 2, . . ., r , but to compare it with the FRS the values 1 to r should be rescaled in accordance with the reference interval [l0 , u 0 ], so that L i = l0 + (u 0 − l0 ) · (i − 1)/(r − 1), for i ∈ {1, . . . , r }. The link between FRS and the numerically encoded LS will be the x = Tra(a, b, c, d) one associated with the minimum ρ2 -distance criterion, i.e., if  is the available FRS-valued response  x ↔ L( x ) = arg

min

L i , i∈{1,...,r }

2 · L i2 − (a + b + c + d) · L i ,

corresponding to the L i being closer to the ‘central point’ (a + b + c + d)/4. • In mimicking the connection between FRS and VAS responses for the same respondent, a reasonable link is the one associated with a suitable ‘defuzzification’ process like the one introduced in [34], which for a trapezoidal response x = Tra(a, b, c, d) is such that

384

M. A. Lubiano et al. FLS 1

0 0

20

40

60

80

100

Fig. 2 Frequently used fuzzy linguistic encoding of 5-point Likert scales

 x ↔ VA( x) =

a+b+c+d . 4

• In mimicking the connection between FRS and IVS responses for the same respondent, a possible link is the one associated with the 0.5-level of the fuzzy response, which for a trapezoidal response  x = Tra(a, b, c, d) is such that   x ↔ IV( x ) = ( x ).5 =

 a+b c+d , . 2 2

• Finally, the connection between FRS and FLS responses for the same respondent, a possible link is to consider the numerical ‘Likertization’ in the first stated connection and to consider later, for instance, the linguistic hierarchy of r labels (see, for instance, Cordón et al. [7]). Figure 2 graphically displays the one associated with a 5-point Likert scale when [l0 , u 0 ] = [0, 100]. At this point, it should be emphasized that the choice of the linguistic fuzzification scarcely affects the value of the Cronbach α.

2.2 Comparison of Rating Scales Through Percentages of Greater Values of α The comparative studies in this paper consider the reference interval to be [l0 , u 0 ] = [0, 1] and the involved Likert scale to be the 5-point one in Fig. 2. For different choices of n (number of respondents) and K (number of items), 1000 samples of n × K FRS-based data have been generated, and later linked, by means of the process described in Sect. 2.1. Later, percentages of samples for which Cronbach’s α of the FRS-based data is greater than that of the other rating scales are computed and collected. Table 1 shows a few choices of K up to 50, albeit outputs for larger values are rather similar.

Analyzing the Influence of the Rating Scale for Items in a Questionnaire …

385

Table 1 Percentages of simulated samples for which Cronbach’s α of the FRS are greater than that of the other rating scales for different choices of n (number of respondents) and K (number of items) n and K choices αFRS > αIVS αFRS > αVAS αFRS > αLS αFRS > αFLS n = 300

n = 100

n = 50

n = 30

K K K K K K K K K K K K K K K K K K K K

= 50 = 30 = 20 = 10 =5 = 50 =30 = 20 = 10 =5 = 50 = 30 = 20 = 10 =5 = 50 =30 = 20 = 10 =5

100 100 100 100 99.4 98.2 99.6 100 99.8 98 94.6 98.2 98.9 98.8 95.2 90.1 94.9 96.5 96.5 91.2

100 100 100 100 99.4 100 100 100 99.8 97.9 100 100 99.8 99 95.26 99.9 99.7 99.1 98.2 91.9

100 100 100 99.9 95.7 100 100 100 97.8 87.3 100 100 99.5 92.5 80.6 99.7 99.6 97.8 89.5 75.4

100 100 100 100 97.9 100 100 100 99.6 93.4 100 100 100 97.6 87.8 100 100 99.8 95.4 83.1

Consequently, in getting a larger internal consistency/reliability of a construct, majority trends support the almost general superiority of the FRS with respect to the other rating scales.

2.3 Comparison of Rating Scales Through Values of α In addition to the advantage of the FRS w.r.t. the other ones in terms of the percentages of greater reliability, it would be interesting to examine whether such advantage is also clear in terms of the values of Cronbach coefficient. Table 2 gathers values of Cronbach’s α of the five rating scales for a sample of size of n × K FRS-based data generated, and later linked, by means of the process described in Sect. 2.1. In a similar way, a graphical comparison is displayed in Fig. 3 for a sample in which n = 100, corroborating that the ranking with respect to the reliability of constructs is FRS–IVS–VAS–LS–FLS, the difference between FRS and IVS being really a minor one.

386

M. A. Lubiano et al.

Table 2 Values of Cronbach’s α for the rating scales and a sample of size n × K for different choices of n and K n & K choices αFRS αIVS αVAS αLS αFLS n = 300

n = 100

n = 50

n = 30

K K K K K K K K K K K K K K K K K K K K

= 50 = 30 = 20 = 10 =5 = 50 = 30 = 20 = 10 =5 = 50 = 30 = 20 = 10 =5 = 50 = 30 = 20 = 10 =5

0.9187 0.8624 0.8405 0.7533 0.6779 0.9134 0.8489 0.8204 0.7252 0.6774 0.9142 0.8303 0.8037 0.6919 0.674 0.9207 0.8493 0.8131 0.7103 0.7329

0.9185 0.8620 0.8401 0.7527 0.6773 0.9129 0.8482 0.8196 0.7239 0.6765 0.9135 0.8289 0.8022 0.6895 0.6725 0.9205 0.8489 0.8127 0.7096 0.7325

0.9161 0.8584 0.8361 0.7476 0.6714 0.9093 0.8424 0.8129 0.7154 0.6697 0.9101 0.8214 0.7941 0.6788 0.6673 0.9188 0.8455 0.8085 0.7039 0.7301

0.9083 0.8447 0.8215 0.7305 0.6511 0.9010 0.8269 0.7951 0.7048 0.6565 0.9026 0.8082 0.7805 0.6834 0.6772 0.9107 0.8321 0.7959 0.7103 0.7332

α 0.9

rating scale 0.8

FRS IVS Likert LingFuzzy VAS

0.7

5

10

15

20

25

Fig. 3 Evolution of values of Cronbach’s α

30

35

40

45

50

K

0.9047 0.8387 0.8159 0.7228 0.6385 0.8975 0.8196 0.7878 0.6962 0.642 0.8999 0.8012 0.7742 0.6753 0.6596 0.9080 0.8235 0.7861 0.6969 0.7149

Analyzing the Influence of the Rating Scale for Items in a Questionnaire …

387

The research in this paper can be complemented with the comparisons based on alternative tools or indicators, like those for the validation of questionnaires that are closely connected with divergences, one of the highest research interest of our tributed colleague (see, for instance, [1]), as well other possible links between scales. This complementary analysis will help also in making decisions on the number of items for achieving a given reliability/indicator value, on the convenient scale to choose, and so on. Acknowledgements This research has been partially supported by the Spanish Ministry of Science and Innovation Grant PID2019-104486GB-I00 and Grant AYUD/2021/50897 from the Principality of Asturias Counseling of Science, Innovation and University. Both of them are gratefully acknowledged.

References 1. Balakrishnan, N.: Methods and Applications of Statistics in the Life and Health Sciences. Wiley, Hoboken, NJ (2009) 2. Billard, L., Diday, E.: Descriptive statistics for interval-valued observations in the presence of rules. Comput. Stat. 21(2), 187–210 (2006) 3. Blanco-Fernández, Á., Casals, M.R., Colubi, A., Corral, N., García-Bárzana, M., Gil, M.Á., González-Rodríguez, G., López, M.T., Lubiano, M.A., Montenegro, M., Ramos-Guajardo, A.B., De la Rosa de Sáa, S., Sinova, B.: A distance-based statistical analysis of fuzzy numbervalued data. Int. J. Approx. Reas. 55(7), 1487–1501; Rejoinder. Int. J. Approx. Reas. 55(7), 1601–1605 (2014) 4. Carvalho, F.D.A.T., Lima Neto, E.D.A., da Silva, K.C.F.: A clusterwise nonlinear regression algorithm for interval-valued data. Inf. Sci. 555, 357–385 (2021) 5. Colubi, A., González-Rodríguez, G., Gil, M.Á., Trutschnig, W.: Nonparametric criteria for supervised classification of fuzzy data. Int. J. Approx. Reas. 52, 1272–1282 (2011) 6. Coppi, R., D’Urso, P., Giordani, P.: Fuzzy and possibilistic clustering for fuzzy data. Comput. Stat. Data Anal. 56(4), 915–927 (2012) 7. Cordón, O., Herrera, F., Zwir, I.: Linguistic modeling by hierarchical systems of linguistic rules. IEEE Trans. Fuzzy Syst. 10(1), 2–20 (2002) 8. Cronbach, L.J.: Coefficient alpha and the internal structure of tests. Psychometrika 16(3), 297–334 (1951) 9. De la Rosa de Sáa, S., Gil, M.Á., González-Rodríguez, G., López, M.T., Lubiano, M.A.: Fuzzy rating scale-based questionnaires and their statistical analysis. IEEE Trans. Fuzzy Syst. 23(1), 111–126 (2015) 10. Diamond, P., Kloeden, P.: Metric spaces of fuzzy sets. Fuzzy Sets Syst. 35, 241–249 (1990) 11. D’Urso, P., De Giovanni, L., Massari, R.: Trimmed fuzzy clustering for interval-valued data. Adv. Data Anal. Class 8(1), 21–40 (2015) 12. D’Urso, P., Gil, M.Á.: Fuzzy data analysis and classification. Special issue in memoriam of Professor Lotfi A. Zadeh, father of fuzzy logic. Adv. Data Anal. Class. 11(4), 645–657 (2017) 13. D’Urso, P., Giordani, P.: A least squares approach to principal component analysis for interval valued data. Chem. Intel. Lab Syst. 70(2), 179–192 (2004) 14. Ellerby, Z., Wagner, C., Broomell, S.: Capturing richer information–On establishing the validity of an interval-valued survey response mode. Behav. Res. Meth. (in press) (2021), https://doi. org/10.3758/s13428-021-01635-0 15. Encyclopedia of Survey Research Methods: Lavrakas PJ (ed). SAGE Pub, Inc, Thousand Oaks, CA (2008)

388

M. A. Lubiano et al.

16. García-Bárzana, M., Ramos-Guajardo, A.B., Colubi, A., Kontoghiorghes, E.J.: Multiple linear regression models for random intervals: a set arithmetic approach. Comput. Stat. 35(2), 755– 773 (2020) 17. Gil, M.Á., González-Rodríguez, G., Colubi, A., Montenegro, M.: Testing linear independence in linear models with interval-valued data. Comput. Stat. Data Anal. 51(6), 3002–3015 (2007) 18. González-Rodríguez, G., Colubi, A., Gil, M.Á.: Fuzzy data treated as functional data. A oneway ANOVA test approach. Comput. Stat. Data Anal. 56(4), 943–955 (2012) 19. Hesketh, T., Hesketh, B.: Computerized fuzzy ratings: the concept of a fuzzy class. Behav. Res. Meth. Inst. Comput. 26(3), 272–277 (1994) 20. Hesketh, B., Pryor, R., Gleitzman, M., Hesketh, T.: Practical applications and psychometric evaluation of a computerised fuzzy graphic rating scale. In: Zétényi T (ed) Fuzzy Sets in Psychology. Adv. Psychol. Ser. 56(C), 425–454. North-Holland/Elsevier, Amsterdam (1988) 21. Hesketh, T., Pryor, R., Hesketh, B.: An application of a computerized fuzzy graphic rating scale to the psychological measurement of individual differences. Int. J. Man-Mach. Stud. 29(1), 21–35 (1988) 22. Likert, R.: A technique for the measurement of attitudes. Arch. Psychol. 22, 140–155 (1932) 23. Lubiano, M.A., García-Izquierdo, A.L., Gil, M.Á.: Fuzzy rating scales: does internal consistency of a measurement scale benefit from coping with imprecision and individual differences in psychological rating? Inf. Sci. 550, 91–108 (2021) 24. Lubiano, M.A., Montenegro, M., Sinova, B., De la Rosa de Sáa, S., Gil, M.Á.: Hypothesis testing for means in connection with fuzzy rating scale-based data: algorithms and applications. Eur. J. Op. Res. 251(3), 918–929 (2016) 25. Popper, K.: Autobiography by Karl Popper - The Philosophy of Karl Popper. The Lib Living Phil Inc, La Salle, IL (1974) 26. Sinova, B.: M-estimators of location for interval-valued random elements. Chem. Intel. Lab Syst. 156, 115–127 (2016) 27. Sinova, B., Gil, M.Á., Colubi, A., Van Aelst, S.: The median of a random fuzzy number. The 1-norm distance approach. Fuzzy Sets Syst. 200, 99–115 (2012) 28. Sinova, B., Van Aelst, S., Terán, P.: M-estimators and trimmed means: from Hilbert-valued to fuzzy set-valued data. Adv. Data Anal. Class. 15(2), 267–288 (2021) 29. Sung, Y.T., Wu, J.S.: The Visual Analogue Scale for rating, ranking and paired-comparison (VAS-RRP): a new technique for psychological measurement. Beh. Res. Meth. 50, 1694–1715 (2018) 30. Themistocleous, C., Pagiaslis, A., Smith, A., Wagner, C.: A comparison of scale attributes between interval-valued and semantic differential scales. Int. J. Market Res. 61(1), 394–407 (2019) 31. Visual analog scales: In: Frey, B.B. (ed.) The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation, vol. 4. CA, SAGE Pub Inc, Thousand Oaks (2018) 32. Vitale, R.A.: Metrics for compact, convex sets. J. Approx. Theor. 45, 280–287 (1985) 33. Wagner, C., Miller, S., Garibaldi, J.M., Anderson, D.T., Havens, T.C.: From interval-valued data to general type-2 fuzzy sets. IEEE Trans. Fuzzy Syst. 23, 248–269 (2015) 34. Yager, R.R.: A procedure for ordering fuzzy subsets of the unit interval. Inf. Sci. 24, 143–161 (1981) 35. Zadeh, L.A.: The concept of a linguistic variable and its application to approximate reasoning. Part 1. Inf. Sci. 8, 199–249; Part 2. Inf. Sci. 8, 301–353; Part 3. Inf. Sci. 9, 43–80 (1975)

Robust LASSO and Its Applications in Healthcare Data Abhijit Mandal and Samiran Ghosh

Abstract We address the development of a robust variable selection procedure using the density power divergence with the least absolute shrinkage and selection operator (LASSO). It produces robust estimates of the regression parameters and simultaneously selects the important explanatory variables. The asymptotic distribution of the regression coefficients is derived. The widely used model selection procedures based on the classical information criteria often show very poor performance in the presence of heavy-tailed error or outliers. For this purpose, we introduce a robust version of the Mallows’s C p statistic based on our proposed method. The simulation studies show that the robust variable selection technique outperforms the classical likelihood-based techniques in the presence of outliers. The performance of the proposed method is explored through a healthcare data analysis.

1 Introduction We introduce a robust method for modeling and analyzing high-dimensional data in the presence of outliers. Due to advanced technology and a wide source of data collection, high-dimensional data are available in several fields including healthcare, bioinformatics, economics, finance, and sociology. In the initial stage of modeling, a large number of predictors are included to get maximum information from data. For example, in a genome-wide association study (GWAS), where the goal is to identify genetic risk factors for complex diseases, a million or more single nucleotide polymorphisms (SNPs) are potential predictors for complex diseases and traits. Although in practice, only a few predictors may contain relevant information about the response variables and unnecessary including more variables increases A. Mandal (B) University of Texas at El Paso, El Paso, USA e-mail: [email protected] S. Ghosh Wayne State University, Detroit, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_33

389

390

A. Mandal and S. Ghosh

the prediction variability. Thus, variable selection is an important topic in regression analysis when there are too many predictors. It also enhances the predictability power of the model and reduces the chance of over-fitting. The standard statistical methods for variable selection includes the step-wise variable selection, the best subset selection, bridge regression [5], the least absolute shrinkage and selection operator (LASSO) [16] etc. In general, high-dimensional data contains a large number of outliers. For example, in bioinformatics, data are often aggregated from multiple sources, such as different sub-populations, which introduce outliers. Outliers can be very influential that may lead to biased estimation and misleading conclusions. In high-dimensional data, detecting outliers by conducting model diagnostics can be extremely difficult using traditional techniques. Such challenges demand the development of robust methods that can down-weight the effect of large outlying observations. Several robust regression methods in statistical literature deal with outliers, for example, regressions using M-estimator [8], least absolute deviation (LAD) [2] and quantile regression [10]. These classical robust techniques are mainly constructed for a low-dimensional setup, so their performance in the variable selection is not extensively studied. On the other hand, the standard variable selection methods break down in the presence of outliers. In this paper, we introduce a robust variable selection method, by combining these two approaches, that can withstand the effect of outliers in modeling high-dimensional noisy data. The rest of the paper is organized as follows. Section 2 gives the background of the classical penalized regression analysis. Our proposed method for the robust penalized regression is introduced in Sect. 3. The asymptotic distribution is also derived there. In Sect. 4, an information criteria for model selection is proposed as robust version of the Mallow’s C p statistic. A simulation study and a real data analysis are presented to explore the effectiveness of the proposed method in Sects. 5 and 6, respectively. Some concluding remarks are given in Sect. 7.

2 Background Suppose the pair (yi , x i ) denotes the observation from the i-th subject, where yi ∈ R is the response variable and x i ∈ R p is the set of linearly independent predictors with the first element of x i being one for the intercept parameter. Consider the following linear regression model: yi = x iT β + εi , i = 1, 2, . . . , n,

(1)

where β = (β0 , β1 , . . . , β p−1 )T is the regression coefficient, and εi is the random error. We assume that the error term εi ∼ N (0, σ 2 ). So, we have yi ∼ N (x iT β, σ 2 ), i = 1, 2, . . . , n. We define the response vector as y = (y1 , y2 , . . . , yn )T and the design matrix as X = (x 1 , x 2 , . . . , x n )T . Under the classical setup when n > p, the ordinary least squares (OLS) estimate of β is obtained by minimizing the

Robust LASSO and Its Applications in Healthcare Data

391

square error loss function || y − Xβ||2 , where || · || is the L 2 norm. The solution is βˆ = (X T X)−1 X T y, which is also the maximum likelihood estimator (MLE) of β. Suppose that the true model is sparse, where there are p0 non-zero coefficients, p0 ≤ p. In small or moderate sample sizes, when p0 < p, the OLS estimate often has a low bias but a large variance. On the other hand, shrinking or setting some regression coefficients to zero may improve the prediction accuracy. In this case, we may incorporate a small bias, but obtain a greater reduction in the variance of the predicted values. Thus, it often improves the overall mean square error (MSE). We assume that the design matrix X is standardized so that i x i j /n = 0 and  2 x /n = 1 for all j = 1, 2, . . . , p − 1, where x is the (i, j)-th element of X. ij i ij Parameter shrinkage in LASSO is imposed by considering a penalized loss function  1 || y − Xβ||2 + λ |β j |, 2n j=1 p−1

L(β|λ) =

(2)

where λ > 0. The regularized parameter λ controls the model complexity and balances between the bias and variance of the estimators. More predictors are included as λ → 0+ , producing smaller bias, but higher variance. For λ = 0, we get the OLS estimate. On the other hand, fewer predictors stay in the model as λ increases, and finally, only the intercept parameter remains when λ is larger than a threshold, say λ > λ0 . Therefore, with a properly tuned λ, the optimum prediction accuracy is achieved. Other than the LASSO, some widely used penalty functions are elastic net [20], adaptive LASSO [19], minimax concave penalty (MCP) [18], smoothly clipped absolute deviation (SCAD) [4], etc.

3 Robust Penalized Regression The density power divergence (DPD) measure between the model density f θ with θ ∈ Θ and the true density g is defined as ⎧   1+α

1 1 ⎨ y f θ (y) − 1 + α f θα (y)g(y) + α g 1+α (y) dy, for α > 0, g(y) dα ( f θ , g) =  ⎩ y g(y) log dy, for α = 0, f θ (y) (3) where α is a tuning parameter [3]. For α = 0, the DPD is obtained as a limiting case of α → 0+ ; and the measure is called the Kullback-Leibler divergence. Given a parametric model, we estimate θ by minimizing the DPD measure with respect to θ over its parametric space Θ. We call the estimator the minimum power divergence estimator (MDPDE). For α = 0, it is equivalent to maximize the log-likelihood function. Thus, the MLE is a special case of the MDPDE. The tuning parameter α controls the trade-off between efficiency and robustness of the MDPDE – robustness measure increases if α increases, but at the same time efficiency decreases.

392

A. Mandal and S. Ghosh

Let us consider the regression model defined in Eq. (1). The probability density function (pdf) of yi , denoted by f (yi |x i , β, σ ) or in short f i , is given by 1 T 2 1 exp− 2σ 2 (yi −x i β) , i = 1, 2, . . . , n. f i ≡ f (yi |x i , β, σ ) = √ 2π σ

(4)

It is obvious that the classical penalized regression methods do not produce a robust estimator due to the fact that Eq. (2) uses the square error loss function. We replace it with a modified penalized loss function using the DPD measure as L α (β, σ |X, y, λ) =

p−1 n  1 dα ( f i , gi ) + λ |β j |, n i=1 j=1

(5)

where gi is the true density for a given x i , i = 1, 2, . . . , n. For α > 0, the loss function in Eq. (5) is empirically written as L α (β, σ |X, y, λ) = where c(α) = and

1 nα

n i=1

 y

p−1 n  1 Viα (β, σ |x i , yi ) + λ |β j |, n i=1 j=1

(6)

gi1+α (y)dy, the third term of Eq. (3), is free of β and σ ,

Viα (β, σ |x i , yi ) =

1 (2π )

α 2

σα

− √ 1+α

1+α α fi . α

(7)

The MDPDEs of β and σ are obtained by minimizing L α (β, σ |X, y, λ) over β ∈ R p and σ > 0. If the i-th observation is an outlier, then the value of f i is very small compared to other samples. In that case, the second term of Eq. (7) is negligible, thus the corresponding MDPDE for α > 0 becomes robust against outlier. On the other hand, when α = 0, we have Viα (β, σ |x i , yi ) = − log( f i ), and for an outlier, it diverges as f i → 0. Therefore, the MLE breaks down in the presence of outliers as they dominate the loss function. There are relatively few works that investigated robust variable selection methods for high-dimensional data. Wang et al. [17] used the least absolute deviations in LAD LASSO; [9] proposed an estimator based on γ -divergence; [7] used the classical cross-validation technique with density power divergence. In this paper, we used a robust variable selection method by modifying the Mallows’s C p statistic [12].

3.1 Asymptotic Distribution of the MDPDE T Suppose θˆ = (βˆ , σˆ 2 )T is the penalized MDPDE of θ = (β T , σ 2 )T obtained by minimizing the loss function defined in Eq. (6) for a fixed α. Let θ g = (β gT , σg2 )T be the true value of θ that minimizes dα ( f θ , g) in Eq. (3) for given x i s. So, θ g is

Robust LASSO and Its Applications in Healthcare Data

393

the true value of the parameter if the model is correctly specified, otherwise it is the parameter of the model that is closest to the data generating distribution with respect to the DPD measure. We assume that β g is sparse, and the set corresponding to the non-zero elements is given by A = { j : 0 ≤ j ≤ p, βg j = 0} where |A| = p1 ≤ p. Let us define β A as the vector obtained from β g by selecting the elements corresponding to set A. The remaining part of β g is denoted by β A¯ . So, β A¯ = 0, the ( p − p1 )-dimensional zero vector. We also partition βˆ as βˆ A and βˆ A¯ based on set A, where βˆ A is a p1 dimensional vector. Similarly, X is partitioned as X A and X A¯ , where X A is a matrix 1 T of dimension n × p1 . Let us define A = lim X A X A and n→∞ n α

xi α = (2π )− 2 σ −(α+2) (1 + α)− 2 and ηα = 3

2 + α2 1 α (2π )− 2 σ −(α+4) . 5 4 (1 + α) 2

(8)

To derive the asymptotic distribution of the MDPDE, we require assumptions (A1)– (A7) of [6] and two selected assumptions from [11] as follows: (C1) (C2)

The regularized parameter is a function of n such that λ = O(n −1/2 ), T Let dn2 = maxi x iT S−1 n x i where Sn = X X. For large n, there exists a constant −1/2 . s > 0 such that dn ≤ sn

Theorem 3.1 The asymptotic distributions of the MDPDEs βˆ = (βˆ A , βˆ A¯ )T and σˆ 2 have the following properties: 1. Sparsity: βˆ A¯ = 0 with probability tending to 1.   √ a −1 2. Asymptotic Normality of βˆ A : n(βˆ A − β A ) ∼ N 0, xixi2α2 A . α √ a 3. Asymptotic Normality of σˆ 2 : n(σˆ 2 − σg2 ) ∼ N (0, σα2 ), where σα2 = 4. Independence: βˆ A and σˆ 2 are asymptotically independent.

2

η2α − α4 xi α2 . ηα2

The theorem ensures that, for large sample sizes, our DPD method correctly drops the variables that do not have any significant contribution to the true model. So, the method selects variables consistently. Moreover, the estimators of nonzero coefficients (βˆ A ) have the same asymptotic distribution as they would if the zero coefficients (β A¯ ) were known in advance.

4 Robust C p Statistic and Degrees of Freedom The model selection criterion plays a key role in choosing the best model for highdimensional data analysis. In a regression setting, it is well known that omitting an important explanatory variable may produce severe bias is parameter estimates and prediction results. On the other hand, including unnecessary predictors may degrade

394

A. Mandal and S. Ghosh

the efficiency of the resulting estimation and yields less accurate prediction. Hence, selecting the best model based on a finite sample is always a problem of interest for both theory and application in this field. The cross validation technique are widely used for variable selection. However, if there are outliers in data, both the estimation and testing process may be severely affected. Therefore, the classical cross validation technique may not work properly in the presence of outliers. There are several important and widely used selection criteria, e.g. the Mallows’s C p statistic, the Akaike information criterion (AIC) [1], the Bayes information criterion (BIC) [15], etc. Those selection criteria are based on the classical MLEs, so they show very poor performance in the presence of heavytailed error and outliers. To overcome the deficiency, we propose a robust alternative of the C p statistic to select the best sub-model by choosing the optimum value of regularization parameter. References [13, 14] modified a few selection criteria using the Huber’s M-estimator. Let us consider the notations and setting as given in the previous section. We define T X A (βˆ A − β A ). Following [12], we consider σ12 E[JA ] as a JA = (βˆ A − β A )T X A measure of prediction adequacy. We denote the robust estimate of it by RC p , the robust C p statistic. Let d f = σ12 E[( y − X A β A )T X A (βˆ A − β A )] be the degrees of freedom or the “effective number of parameters” of the regression model. Then, from the asymptotic distribution of βˆ A , it can be shown that d f = |A|, the number of non-zero regression coefficients. Let RSSA be the residual sum of squares for the sub-model. Then, if the sub-model is true,   E(RSSA ) = E ( y − X A βˆ A )T ( y − X A βˆ A ) = nσ 2 − 2σ 2 d f + E(JA ). Therefore, using the estimates of the both sides of the above equation, we have RC p =

n σˆ 2 − n + 2|A|, σˆ u2

where σˆ is the penalized MDPDE of σ under the sub-model and σˆ u is a consistent but robust estimator of σ . The RC p is equivalent to the classical C p statistic, but the estimators are replaced by suitable penalized MDPDE estimators.

5 Simulation Study We present a small simulation study to demonstrate the advantage of our DPDbased penalized method. A dataset is generated from the regression model (1) with 25 predictors where 60% of the regression coefficient do not have any effect on y (i.e. β j = 0). We have arbitrarily taken non-zero regression coefficients from a uniform distribution, and then they are kept fixed throughout the entire simulation. The regressor variables in X are generated from a multivariate normal distribution.

Robust LASSO and Its Applications in Healthcare Data

395

The value of the standard deviation σ of the error term is chosen in such a way that the signal-to-noise (SNR) ratio is 10. We generated samples of sizes n = 50 to n = 200 and replicated the process 1,000 times for all n. We considered the penalized MDPDE with α = 0.2. The optimum regularized parameter λ is calculated based on the robust C p and denoted by RC p(0.2). Our proposed method is compared with the classical LASSO and LAD LASSO estimators, where the optimum λ is selected using a crossvalidation. Other than the OLS, the performance is also compared with a robust (non-penalized) regression based on Huber’s M-estimator [8]. We  simulated another n t = 1, 000 test data and calculate the mean prediction errors n1t ( yˆi − xiT β)2 for all estimators. The mean relative prediction error (RPE) compared to the OLS estimator are plotted in the first plot of Fig. 1. In the second plot 5% outliers are added to the response variable by shifting the mean of the error distribution at 5σ . Based on the simulations, the following outcomes are worth mentioning: 1. Non-penalized robust estimators like Huber’s M-estimator lose a significant amount of efficiency when there is no outlier in data. But the penalized MDPDE is very efficient even in pure data. 2. There are several robust sparse estimators like LAD LASSO, but in general, penalized MDPDEs perform better in both pure and contaminated data. The LAD LASSO also has larger variability. 3. The tuning parameter α in the DPD controls the trade-off between efficiency and robustness. A user can fix it to the desired level. Moreover, one can also choose a data-dependent optimum α that minimizes the empirical MSE of the regression coefficient. The LAD LASSO does not have that flexibility, for this reason, its performance is not always good, particularly when the SNR is small.

Huber

LASSO

LAD LASSO

RCp(0.2)

0.8 0.7

Mean RPE

0.4

0.6

0.5

0.6

1.0 0.8

Mean RPE

1.2

0.9

1.4

1.0

OLS

50

100

150 n

200

50

100

150

200

n

Fig. 1 The mean RPE compared to the OLS for different estimators on 1,000 test data over 1,000 replications in uncontaminated data (left), and data with 5% outliers at μc = 5σ (right)

396

A. Mandal and S. Ghosh

4. In pure data, classical penalized estimators like LASSO give the best results, but they eventually break down in the presence of outliers. As it is difficult to detect outliers in high-dimensional data, one may prefer a penalized MDPDE for being on the safe side.

6 Real Data Analysis The data-set was obtained from a pilot grant funded by the Blue Cross Blue Shield of Michigan Foundation. The primary goal was to develop a pilot computer system (Intelligent Sepsis Alert) aimed towards increased automated sepsis detection capacity. We analyzed a part of the data that is available to us. It contains 51 variables from 8,975 cases admitted to Detroit Medical Center (DMC) from 2014– 2015. Both demographic and clinical data are available during the first six hours of patients’ emergency department stay. The outcome variable (y) is the length of hospital stay. Few variables are deleted as they had more than 75% missing values. Some variables containing texts, mostly the notes from the doctor or nurse, are also excluded. We randomly partitioned the data-set into four equal parts, where one sub-sample is used as the test set and the remaining three sub-samples form the training set. We have calculated five different estimators as used in the simulation study in Sect. 5. For the penalized MDPDE, we considered several values of α, however, only the result using α = 0.5 is presented as the optimum α by minimizing the empirical MSE is very large in the full data. The performance of the estimators is compared by the mean absolute prediction error (MAPE) in the test data, n t ˆ i |, n t being the number observations of the |(yi − x iT β)/y where MAPE = n1t i=1 test set. The process was replicated 100 times using a different random partition of the data-set, then the mean MAPE relative to the OLS are reported in Table 1. The result shows that the penalized MDPDE reduces the MAPE by around 39% compared to the OLS and classical LASSO. At the same time, unlike the OLS or Huber estimators, our proposed method considerably reduces the dimension of predictor variables. The increased efficiency of the robust methods clearly indicates that the dataset contains a significant amount of outliers and they are not easy to detect for the high-dimensional data. For this reason, the multiple R 2 from the OLS method is just 11.42%, but the robust version of the multiple R 2 is dramatically increased to 82.50% using the Huber’s M-estimator. On the other hand, LASSO reduces 25.71%

Table 1 The mean of MAPE relative to the OLS and the median percentage of dimension reduction over 100 random resamples for different estimators for the sepsis data Estimators OLS Huber LASSO LAD LASSO RCp(0.5) Relative MAPE

1.00

0.75

1.01

0.65

0.61

Robust LASSO and Its Applications in Healthcare Data

397

of variables in the regression model. So it reveals that several regressor variables do not contain any significant information for predicting the outcome variable y. Our proposed method successfully combines these two desired criteria—the robustness property and a sparse representation of the model. In our future research, we would like to extend the robust penalized regression to the generalized linear model (GLM) so that it will be helpful to model the binary outcomes. These methods can also be modified to obtain a suitable imputation method to deal with missing values.

7 Conclusion We have developed a robust penalized regression method that can perform regression shrinkage and selection like the LASSO or SCAD, while being resistant to outliers like the LAD or quantile regression. A robust information criterion is introduced by modifying the Mallow’s C p to make the variable selection procedure stable against outliers. The simulation studies, as well as the real data example, show improved performance of the proposed method over the classical procedures. Thus, the new procedure is expected to improve prediction power significantly for the high-dimensional data where the presence of outliers is very common.

References 1. Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Proceedings 2nd International Symposium on Information Theory, pp. 267–281. Akadémiai Kiadó, Budapest (1973) 2. Bassett, G., Jr., Koenker, R.: Asymptotic theory of least absolute error regression. J. Am. Statist Assoc. 73(363), 618–622 (1978) 3. Basu, A., Harris, I.R., Hjort, N.L., Jones, M.C.: Robust and efficient estimation by minimising a density power divergence. Biometrika 85(3), 549–559 (1998) 4. Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Statist Assoc. 96(456), 1348–1360 (2001) 5. Frank, L.E., Friedman, J.H.: A statistical view of some chemometrics regression tools. Technometrics 35(2), 109–135 (1993) 6. Ghosh, A., Basu, A.: Robust estimation for independent non-homogeneous observations using density power divergence with applications to linear regression. Electron. J. Statist 7, 2420– 2456 (2013) 7. Ghosh, A., Majumdar, S.: Ultrahigh-dimensional robust and efficient sparse regression using non-concave penalized density power divergence. IEEE Trans. Inform. Theor. 66(12), 7812– 7827 (2020) 8. Huber, P.J.: Robust Statistics. Wiley, New York (1981) 9. Kawashima, T., Fujisawa, H.: Robust and sparse regression via γ -divergence. Entropy 19(11), 608:e19110608 (2017) 10. Koenker, R., Hallock, K.F.: Quantile regression. J. Econ. Perspect. 15(4), 143–156 (2001) 11. Li, G., Peng, H., Zhu, L.: Nonconcave penalized M-estimation with a diverging number of parameters. Statist. Sinica 21(1), 391–419 (2011) 12. Mallows, C.L.: Some comments on C p . Technometrics 15(4), 661–675 (1973)

398

A. Mandal and S. Ghosh

13. Ronchetti, E.: Robust model selection in regression. Statist. Probab. Lett. 3(1), 21–23 (1985) 14. Ronchetti, E., Staudte, R.G.: A robust version of Mallows’ C P . J. Am. Statist. Assoc. 89(426), 550–559 (1994) 15. Schwarz, G.: Estimating the dimension of a model. Ann. Statist. 6(2), 461–464 (1978) 16. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Statist Soc. Ser. B 58(1), 267–288 (1996) 17. Wang, H., Li, G., Jiang, G.: Robust regression shrinkage and consistent variable selection through the LAD-Lasso. J. Bus. Econ. Stat. 25(3), 347–355 (2007) 18. Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Statist 38(2), 894–942 (2010) 19. Zou, H.: The adaptive lasso and its oracle properties. J. Am. Statist Assoc. 101(476), 1418–1429 (2006) 20. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Statist Soc. Ser. B 67(2), 301–320 (2005)

Machine Learning Procedures for Daily Interpolation of Rainfall in Navarre (Spain) Ana F. Militino , María Dolores Ugarte , and Unai Pérez-Goya

Abstract Kriging is by far the most well known and widely used statistical method for interpolating data in spatial random fields. The main reason is that it provides the best linear unbiased predictor and it is an exact interpolator when normality is assumed. The robustness of this method allows small departures from normality, however, many meteorological, pollutant and environmental variables have extremely asymmetrical distributions and Kriging cannot be used. Machine learning techniques such as neural networks, random forest, and k-nearest neighbor can be used instead, because they do not require specific distributional assumptions. The drawback is that they do not take account of the spatial dependence, and for an optimal performance in spatial random fields more complex machine learning techniques could be considered. These techniques also require a relatively large amount of training data and they are computationally challenging to implement. For a reduced number of observations, we illustrate the performance of the aforementioned procedures using daily rainfall data of manual meteorological gauge stations in Navarre, where the only auxiliary variables available are the spatial coordinates and the altitude. The quality of the predictions is carefully checked through three versions of the relative root mean squared error (RRMSE). The conclusion is that when we cannot use Kriging, random forest and neural networks outperform k-nearest neighbor technique, and provide reliable predictions of rainfall daily data with scarce auxiliary information.

A. F. Militino (B) · M. D. Ugarte · U. Pérez-Goya Department of Statistics, Computer Science and Mathematics and Institute for Advanced Materials (Inamta2 ), Public University of Navarre, Campus de Arrosadia, 31006 Pamplona, Spain e-mail: [email protected] M. D. Ugarte e-mail: [email protected] U. Pérez-Goya e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_34

399

400

A. F. Militino et al.

1 Introduction Spatial interpolation of daily precipitation is a necessary task in hydrology, ecology, climatology and precision agriculture where it is important to know the accumulated rainfall in any location of a particular region of interest in a given day [18]. Regardless of the specificity and climatological properties of the region of interest, rainfall can follow very dissimilar patterns and different distributions, where at least locally, only neighbor similarity can be assumed [13]. Historically, spatial interpolation of daily precipitation has been accomplished weighting and averaging close rain gauge observations with historical information and additional auxiliary variables [27]. The most well spread spatial interpolation procedure is Kriging and its derived family. See for example [17] for a regional modelling of daily precipitation where Kriging outperforms the vector generalized additive model alternative. Very frequently, a logtransformation of rainfall is also necessary for approximating normality when using the Kriging family [30], yet with an additional cost of bias when back-transforming. The use of historical information, the precipitation-elevation relationship and the topographic effects contribute to improve daily predictions, for example when using the angular weighting distance method [41]. However, the strong spatial variation of the precipitation, and the sparsity and unevenly distribution of the rain gauge stations could sometimes give disappointing results [4]. Complex relationships with auxiliary variables make also difficult the use of parametric procedures, and other alternatives should be checked. Since the advance in hardware and big data, machine learning techniques have promoted exponentially. They are very much used in a variety of fields, so that many scientific journals specialised in applications of environment, remote sensing, meteorology, agriculture, hydrology, or health sciences have important contributions based on machine learning techniques. See for example [22] for an overview of deep learning methods for hyperspectral image classification. Neural networks, k-nearest neighbor or random forest algorithms are very popular among machine learning techniques. The main reason is that they are powerful for making predictions, they are able to manage a high diversity of data, and they do not require the parametric restrictions usually needed in many statistical procedures. The drawback is that they do not incorporate the spatial dependence, and for an optimal performance in spatial random fields more complex machine learning techniques need to be developed. These procedures also require a relatively large amount of training data because they are computationally challenging to implement [39]. Recently, an excellent study of deep neural networks and deep hierarchical models applied to spatio-temporal context has been published [37], but comparisons between classical spatial regressions and random forest are still on request [10]. Artificial neural networks (ANN) are used in weather forescasting since more than 50 years ago [15], and since then, many proposals have been made. For example, random forest for predicting daily climate variables in USA [14], random forest and support vector machine for reproducing monthly rainfall and temperature in Australia [36], wavelet neural networks algorithms for predicting daily meteorological data in

Machine Learning Procedures for Daily Interpolation of Rainfall in Navarre (Spain)

401

Turquey [29], or random forest for interpolating daily Meteosat Second Generation (MSG) Spinning Enhanced Visible and Infrared Imager (SEVIRI) data in Algeria [28]. Lately, a comparison of machine learning techniques has been made for daily rainfall prediction, based on satellite optical rainfall retrievals [26]. A combination of random forest, support vector machine and neural networks with MSG-SEVERI data has been used in Algeria data [21]. Finally, random forest is the best proposal for extreme rainfall downscaling in Taiwan [31], and for downscaling satellite-based precipitation in Lancang-Mekong River basin [40]. It is also the proposed method for subdaily estimation of rain in Spain [7], hail prediction in Poland [6] or mapping daily precipitation in Catalonia [32]. In this paper, we aim to spatially interpolate daily rainfall in 10361 locations in a grid of 1 km2 enclosing Navarre (Spain) using observed rainfall data of manual rainfall gauge stations, where we know only the spatial coordinates and the altitude. For this goal, we are going to compare different machine learning techniques based only in this limited information. Therefore, we propose a simple procedure that do not use historical, ecological or topographical data different from altitude. This type of application is very useful in climatology and precision agriculture, because it is desirable to know the precipitation at any moment in a specific location where there is no gauge stations and perhaps the only available information comes from a Global Positioning System (GPS). The paper is organised as follows. Section 2 presents a short description of the rainfall data used for illustration. Section 3 gives a brief summary of k-nearest neighbor, random forest and neural networks methods used for data prediction. In Sect. 3.1 we present the results. Further comments can be found in the conclusions.

2 Data Description Data used in this study come from the records of manual rainfall stations in Navarre. The Agriculture and Environmental Department of the Navarre Government provides this data [25]. Automated rainfall gauge stations are also available, but we decide to use manual stations because there are more abundant that automatic stations. Unfortunately, we cannot use automated stations for checking the results, because the daily time recording both datasets are different. We have chosen at random several days with rainfall data. Navarre is a region located in north central Spain with roughly 10,000 km2 , and an altitude between 200 and 2,500 m. Valleys and mountains are frequent in the north, but south Navarre is mainly flat, with one desert called Bardenas Reales. There, the climate is continental with hot and dry summers, and cold winters. The northwest of Navarre is humid with an average annual temperature between 11C and 14.5C, and average rainfall between 1,400 and 2,500 mm. In the northeast the average annual temperature ranges from 7C to 12C, and rainfall between 900 and 2,200 mm. The central area of Navarre has a temperate climate, with an average rainfall from 450 to 750 mm, and average temperatures between 12.5C and 14C. In the south of the

402

A. F. Militino et al.

Table 1 Dates and deciles of the empirical distributions of rainfall the 10th of September, the 22th of October and the 6th of November 2019 in Navarre Date 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 10-092019 06-112019 22-102019

0.00

0.00

0.00

1.52

3.92

6.20

7.62

9.00

12.04 14.66 24.00

2.00

3.20

5.36

7.32

9.00

11.3

12.00 14.98 18.36 24.50 36.50

5.40

8.16

10.16 12.00 14.48 16.90 18.48 20.50 22.58 24.48 31.50

province, the yearly average of total rainfall is around 500 mm, and 11.5C to 13.5C the average temperature. Overall, the climate varies across the province from more humid in the north to drier in the south. We have analysed a set of 50 days, but we draw the 10th of September, the 22th of October, and the 6th of November 2019 for illustrative purposes, because rain is present on these days and there are no missing observations. Table 1 shows the different empirical deciles of the three data sets, and Fig. 1 shows the histograms, boxplots and qq-norms. We can observe and test that data in the three chosen days are not normal, yet the empirical distribution of the rain the 22th of October is less asymmetric.

3 Machine Learning Algorithms In this work, we analyse the performance of some of the well known machine learning algorithms: k-nearest neighbor, random forest and neural networks. K -nearest neighbor and random forest are machine learning algorithms, based on decision trees that break the input space into non-overlapping subsets with different parameters. Neural networks appeared in the 1980s as parametrized models. They were inspired in the architecture of the human brain. Procedures based on neural networks belong to the subset of deep learning techniques that enable computer systems lo learn from examples and rules [12]. The training algorithms to fit the weights into an output are different versions of the stochastic gradient descendent algorithm. Deep learning algorithms can be supervised or unsupervised. The unsupervised algorithms learn while observing some examples of a vector y for estimating the probability distribution p(y). The supervised learning algorithms use examples of data to estimate the conditional probability p(y|x).

Machine Learning Procedures for Daily Interpolation of Rainfall in Navarre (Spain)

403

Fig. 1 Histograms, boxplots and qqnorm of the 65 manual rainfall gauge stations the 10th of September, the 22th of October and the 6th of November 2019 in Navarre respectively

3.1 K nearest Neighbor The k-nearest neighbor algorithm is a supervised machine learning algorithm for classification and regression, originally developed as a nonparametric discrimination technique. First, an unclassified point was assigned to the class most heavily represented among its k-nearest neighbor [9], but later this rule was modified by assigning the classification to the category of its nearest neighbor [5], Then, the Bayes probability of error is minimised more than twice the probability of error of any other decision rule. The algorithm does not make any assumption of the distribution of data [16]. The main assumption is that new observations behave like its closest neighbor. In regression, it is common the use of Euclidean distances for getting new

404

A. F. Militino et al.

data based on the averages of nearest values. Frequently, k, is the number of neighbours initially chosen as the square root of the data points. For daily precipitation studies this algorithm has been used by [1, 24, 35, 38]. In this application we use the method knn of the R library car et [19, 20].

3.2 Random Forest Random forests is a classification and regression machine learning method used for supervised learning [3]. It means that learns from data and makes predictions based on learned patterns. The methodology of random forest is based on predictions made with many decision trees. They are easy to interpret but they can be unstable. In a decision tree the decisions are split by branches. The split is known as the feature in machine learning. When we finish the split, we need to generate the prediction combining different model predictions. There are two ways of doing this: bagging or bootstrap aggregation [2] and boosting [33]. In boosting extra weights are given to the points incorrectly predicted. The prediction is made with a weighted vote. In bagging, there are not dependence on earlier trees, because each tree is constructed with independent bootstrap samples. The prediction is based on a simple majority vote [23]. In this application we call randomf to the function randomForest from the library r andom For est [23]. Compared to the standard decision tree model, random forest trains each individual tree using bootstrapped resamples from the total dataset. Hyperparameters of random forest that requires tuning include the number of features to consider when looking for the best split (mtry) and the number of trees (ntree).

3.3 Neural Networks A neural network is a collection of artificial neurons or nodes that try to mimic the biological neural network [11]. Basically, they are made of predictors or inputs x j in an input layer, a set of weights, and a bias which is summed to the weighted inputs to provide net inputs. This step is done in the hidden layers al , where activated functions smooth the weighted inputs to provide the output layers. There are well known activated functions. For example, the sigmoid function (logistic) or the hypertangent function smooth the outputs. Different neurons define the layers. In the supervised learning the neurons automatically learn new features from data. Without the nonlinearity in the hidden layer, the neural network is a generalized linear model, typically estimated by maximum likelihood. Since the 2010s, neural networks are

Machine Learning Procedures for Daily Interpolation of Rainfall in Navarre (Spain)

405

considered as deep learning techniques that enable computer systems lo learn from examples and rules [12], thanks to the computer improvements. Training sets and loss functions involving tuning parameters are essential in neural networks. Among the most popular loss functions is the weight-decay also used in ridge regression, and the lasso penalty or the mixture of them. See [8] for more details. In this application, neural networks are computed with the function nnet of the nnet library [34]. The nnet library fits single-hidden-layer neural networks, likely with skip-layer connections. It uses a feed-forward method with traditional backpropagation.

3.4 Scaling Data One of the key points of deep learning techniques is the normalization process that need to be accomplished before running any algorithm. There are two common ways of scaling data: the min-max normalization and the standardization. The min-max normalization is used in k neural networks and it is defined as X nor =

X − min(X ) . max(X ) − min(X )

(1)

The standarization is used by neural networks and random forest. It is given by X std =

X −μ . σ

(2)

The standarization can provide positive or negative values, while the min-max normalization gives only positive values between 0 and 1.

4 Data Analysis The study aims to interpolate rainfall in a grid of 10361 pixels of 1km 2 along Navarre with nnet, knn and randomf techniques. To choose the best one, we provide a simulation study where we define different proportions (0.5, 0.6, 0.7, 0.8, 0.9 and 1) of the training sets drawn from the 65 manual rainfall gauge stations the 10th of September, the 22th of October, and the 6th of November 2019. For each proportion of training set we run every method 50 times, with different random seeds, and calculate the mean squared error (MSE), the root mean squared error (RMSE)

406

A. F. Militino et al.

and the relative root mean squared error (RRMSE) over three target datasets. In each simulation l and each proportion p of the training set we define the mean squared error (MSE) given by n pk M S Elpjk =

i=1 (yilpk

− yˆilpjk )2

, j = nnet, knn, randomf, n pk k = k1 , k2 , k3 , p = 0.5, 0.6, 0.7, 0.8, 0.9, 1, l = 1, . . . , 50.

(3)

where, j indicates the used procedures, k represents the three different target datasets: k1 indicates the test set, k2 is the training and the test set jointly, and k3 is the set of 11th deciles from 0 to 100%. Moreover, yilpk is the ith observed rainfall in the l simulation with proportion p of the kth training set, yˆilpjk is the corresponding prediction made with the jth nnet, knn or randomf procedure, and n pk is the sample size of the proportion p of the kth target dataset. When k = k3 , yilpk represents the i-th percentile of the empirical distribution of the l simulation in the proportion p of the training set, and yˆilpjk is the ith decile of the estimated distribution in the l simulation with proportion p of a grid of 10361 locations in Navarre. The root mean squared error (RMSE) is obtained averaging over the 50 simulations,  R M S E pjk =

50 l=1

M S Elpjk . 50

Finally, we compute the relative root mean squared error (RRMSE) dividing by y¯ p , the observed rainfall mean of the p proportion of the training set, to achieve a fair comparison among different days. It is given by R M S E pjk j = nnet, knn, randomf, y¯ p k = k1 , k2 , k3 p = 0.5, 0.6, 0.7, 0.8, 0.9, 1.

R R M S E pjk =

(4)

When the proportion is equal to 1 all the observed data become the training set, and the test set is the same as the training set. The covariates used in all the methods are the universal transverse mercator (UTM) coordinates and the altitude. Tables 2, 3 and 4 show the average of the RRMSE calculated with 50 simulations of the proportions 0.5, 0.6, 0.7, 0.8, 0.9 and 1 of the training set the 10th of September, the 22th of October and the 6th of November 2019 respectively, yet similar results can be generalized to other days. The method randomf outperforms nnet and knn in all the scenarios and days, except for the grid set scenario the 10th of September, and the test set scenario the 6th of November 2019, where nnet is slightly better.

Machine Learning Procedures for Daily Interpolation of Rainfall in Navarre (Spain)

407

Table 2 Means of the relative root mean squared error (RRMSE) estimated with 50 simulations the 10th of September 2019 with the machine learning procedures according to the percentage (p) of the training set. The RRMSE is calculated for the test set, the training and the test set jointly, and the deciles of the empirical distribution on a grid of 10361 locations in Navarre p Neural networks k nearest Random forest neighbor RMSE of the test 0.50 set 0.60 0.70 0.80 0.90 Mean p RMSE of the training set and the test set

0.70

0.85

0.65

0.64 0.63 0.62 0.64 0.65 Neural networks

0.85 0.82 0.83 0.86 0.84 k nearest neighbor 0.80

0.62 0.61 0.60 0.59 0.61 Random forest

0.26 0.26 0.26 0.24 0.24 0.26 Random forest

0.24

0.79 0.80 0.80 0.79 0.90 0.81 k nearest neighbor 0.94

0.24 0.24 0.22 0.24 0.22 0.23

0.95 0.94 0.94 0.94 0.94 0.94

0.41 0.40 0.37 0.36 0.36 0.39

0.50

0.26

0.60 0.70 0.80 0.90 1.00 Mean p

0.28 0.30 0.30 0.30 0.30 0.29 Neural networks

RMSE of the grid 0.50 set 0.60 0.70 0.80 0.90 1.00 Mean

0.28

0.42

These tables also show a RRMSE decrease when the size of the training set increases, enhancing the learning capacity of these procedures, yet randomf is better than nnet or knn, and is more stable. Lowests RMSE are reached when predicting both the test set and the training set in the three techniques, but nnet and randomf perform similarly in the grid target dataset. Overall, randomf and nnet are competitive procedures for spatial prediction in small data sets when we

408

A. F. Militino et al.

Table 3 Means of the relative root mean squared error (RRMSE) estimated with 50 simulations the 22th of October 2019 with the machine learning procedures according to the percentage (p) of the training set. The RRMSE is calculated for the test set, the training and the test set jointly, and the deciles of the empirical distribution on a grid of 10361 locations in Navarre p Neural networks k nearest Random forest neighbor RMSE of the test 0.50 set 0.60 0.70 0.80 0.90 Mean p RMSE of the training set and the test set

0.22

0.26

0.22

0.22 0.22 0.22 0.22 0.22 Neural networks

0.24 0.24 0.26 0.26 0.25 k nearest neighbor 0.17

0.20 0.20 0.20 0.20 0.20 Random forest

0.10 0.10 0.10 0.10 0.10 0.10 Random forest

0.20

0.20 0.17 0.17 0.17 0.30 0.20 k nearest neighbor 0.22

0.20 0.26 0.28 0.26 0.24 0.24

0.22 0.22 0.22 0.22 0.22 0.22

0.10 0.10 0.10 0.10 0.10 0.10

0.50

0.10

0.60 0.70 0.80 0.90 1.00 Mean p

0.10 0.10 0.10 0.10 0.10 0.10 Neural networks

RMSE of the grid 0.50 set 0.60 0.70 0.80 0.90 1.00 Mean

0.10

0.10

cannot assume normality. Fig. 2 shows the interpolation made with nnet, knn and randomf in the three selected days. Black dots on the first panels on different rows show the locations of manual rainfall stations. First, second, and third row show the maps of the interpolations made on a 1 km2 grid of Navarra defined with 10361 locations. The prediction pattern varies among days but they are pretty similar within days, especially for nnet and randomf. Maxima of the predictions reach the

Machine Learning Procedures for Daily Interpolation of Rainfall in Navarre (Spain)

409

Table 4 Means of the relative root mean squared error (RRMSE) estimated with 50 simulations the 6th of November 2019 with the machine learning procedures according to the percentage (p) of the training set. The RRMSE is calculated for the test set, the training and the test set jointly, and the deciles of the empirical distribution on a grid of 10361 locations in Navarre p Neural networks k nearest Random forest neighbor RMSE of the test 0.50 set 0.60 0.70 0.80 0.90 Mean p RMSE of the training set and the test set

0.37

0.59

0.37

0.36 0.35 0.36 0.33 0.35 Neural networks

0.58 0.58 0.57 0.59 0.58 k nearest neighbor 0.60

0.36 0.36 0.36 0.37 0.36 Random forest

0.17 0.17 0.17 0.17 0.17 0.17 Random forest

0.20

0.59 0.59 0.59 0.58 0.57 0.59 k nearest neighbor 0.68

0.22 0.26 0.24 0.22 0.22 0.23

0.66 0.66 0.66 0.66 0.66 0.66

0.22 0.20 0.20 0.17 0.17 0.20

0.50

0.20

0.60 0.70 0.80 0.90 1.00 Mean p

0.22 0.22 0.22 0.22 0.22 0.22 Neural networks

RMSE of the grid 0.50 set 0.60 0.70 0.80 0.90 1.00 Mean

0.17

0.22

maxima of the observed datasets in all days, and overall predicted distributions follow also the pattern of the observed dataset. The cost of this procedure is that randomf divides the prediction region in rectangular sets, while nnet better follows the climatological geographical pattern.

410

A. F. Militino et al.

Fig. 2 Rainfall interpolations obtained with nnet, knn and Random Forest methods the 10th of September, the 22 of October and the 6th of November 2019 in Navarre. In the first panel, black points correspond to locations of manual rain gauge stations

5 Conclusions Machine learning techniques offer an opportunity to predict non normal and asymmetrical distributions in spatial random fields, where Kriging is not appropriate in general. Here, we illustrate k-nearest neighbor, neural networks, and random forest techniques to check if they can provide reliable rainfall predictions in a particular region using small data sets, and when auxiliary information is scarce. Assuming

Machine Learning Procedures for Daily Interpolation of Rainfall in Navarre (Spain)

411

that observed data are the ground truth data, three relative root mean squared errors are defined: one for the test data, the second for the whole data set and the third for a regular 1 km2 grid of 10361 locations in Navarre (Spain). Overall, random forest is the best procedure because it minimizes the RRMSE over the majority of scenarios, but it is closely followed by neural networks that can occasionally be better. These results agree with some well known properties in the literature where random forest is a very competitive procedure when no a priori assumptions about the relationship between the response variable and the auxiliary variables are made, and many auxiliary variables are available. Here, only spatial coordinates and altitude are used as auxiliary variables and datasets are small, but we still obtain reliable predicted rainfall in non sampled locations. Improvements of these methods to account for spatial dependence seem complicated in small data sets. Acknowledgements This research was supported by the Spanish Research Agency (PID2020113125RB-I00/MCIN/AEI/10.13039/501100011033 project). It has also received funding from la Caixa Foundation (ID1000010434), Caja Navarra Foundation, and UNED Pamplona, under agreement LCF/PR/PR15/51100007.

References 1. Agilan, A., Umamahesh, N.V.: Rainfall generator for gonstationary extreme rainfall condition. J. Hydrol. Eng. 24(9), 04019027 (2019) 2. Breiman, L.: Bagging predictors. Mach. Learn. 24, 123–140 (1996) 3. Breiman, L.: Random Forest. Mach. Learn. 45(1), 5–32 (2001) 4. Chen, F., Gao, Y., Wang, Y., Li, X.: A downscaling-merging method for high-resolution daily precipitation estimation. J. Hydrol. 581, 124414 (2020) 5. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theor. 13(1), 21–27 (1967) 6. Czernecki, B., Taszarek, M., Marosz, M., Półrolniczak, M., Kolendowicz, L., Wyszogrodzki, A., Szturc, J.: Application of machine learning to large hail prediction - The importance of radar reflectivity, lightning occurrence and convective parameters derived from ERA5. Atmos. Res. 227, 249–262 (2019) 7. Díez-Sierra, J., del Jesús, M.: Subdaily rainfall estimation through daily rainfall downscaling using random forests in Spain. Water 11(1), 125:w11010125 8. Efron, F., Hastie, T.: Computer Age Statistical Inference. Cambridge University Press, Cambridge (2016) 9. Fix, E., Hodges, J.L.: Discriminatory analysis. Nonparametric discrimination; consistency properties. Tech Rep 4, USAF School of Aviation Medicine, Randolph Field, TX (1951) 10. Fox, E.W., Ver Hoef, J.M., Olsen, A.R.: Comparing spatial regression to random forests for large environmental data sets. PLoS ONE 15(3), e0229509 (2020) 11. Friedman, J.H., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, Springer, New York (2001) 12. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016) 13. Grimes, D.I.F., Pardo-Igúzquiza, E.: Geostatistical analysis of rainfall. Geogr. Anal. 42(2), 136–160 (2010) 14. Hashimoto, H., Wang, W., Melton, F.S., Moreno, A.L., Ganguly, S., Michaelis, A.R., Nemani, R.R.: Highresolution mapping of daily climate variables by aggregating multiple spatial data sets with the random forest algorithm over the conterminous United States. Int. J. Climatol. 39, 2964–2983 (2019)

412

A. F. Militino et al.

15. Hu, MJ-C.: Application of the adaline system to weather forecasting. Doctoral Thesis. Department of Electrical Engineering, Stanford University (1964) 16. Kataria, A., Singh, M.D.: A review of data classification using k-nearest neighbour algorithm. Int. J. Emerg. Tech. Adv. Eng. 3(6), 354–360 (2013) 17. Khedhaouiria, D., Mailhot, A., Favre, A.C.: Regional modeling of daily precipitation fields across the Great Lakes region (Canada) using the CFSR reanalysis. Stoch. Environ. Res. Risk Assess 34, 1385–1405 (2019) 18. Kilibarda, M., Hengl, T., Heuvelink, G.B.M., Gräler, B., Pebesma, E., Perˇcec Tadi´c, M., Bajat, B.: Spatiotemporal interpolation of daily temperatures for global land areas at 1 km resolution. J. Geophys. Res. Atmos. 119, 2294–2313 (2014) 19. Kuhn, M.: The caret package (2019). https://topepo.github.io/caret/index.html 20. Kuhn, M.: Contributions from Wing, J., Weston, S., Williams, A., Keefer, C., Engelhardt, A., Cooper, T., Mayer, Z., Kenkel, B., the R Core Team, Benesty, M., Lescarbeau, R., Ziem, A., Scrucca, L., Tang, Y., Candan, C., Hunt, T.: caret: Classification and Regression Training. R package version 6.0-81 (2018). https://CRAN.R-project.org/package=caret 21. Lazri, M., Ameur, S.: Combination of support vector machine, artificial neural network and random forest for improving the classification of convective and stratiform rain using spectral features of SEVIRI data. Atmos. Res. 203, 118–129 (2018) 22. Li, S., Song, W., Fang, L., Chen, Y., Ghamisi, P., Benediktsson, J.A.: Deep Learning for hyperspectral image classification: an overview. IEEE Trans. Geosci. Remote Sens. 57(9), 6690–6709 (2019) 23. Liaw, A., Wiener, M.: Classification and regression by randomForest. R News 2, 18–19 (2002) 24. Lu, Y., Qin, X.S.: A coupled K nearest neighbor and Bayesian neural network model for daily rainfall downscaling. Int. J. Climatol. 34, 3221–3236 (2014) 25. Meteorology of Navarre Government. http://meteo.navarra.es 26. Meyer, H., Kühnlein, M., Appelhans, T., Nauss, T.: Comparison of four machine learning algorithms for their applicability in satellite-based optical rainfall retrievals. Atmos. Res. 169, Part B, 424–433 (2016) 27. Militino, A.F., Ugarte, M.D., Goicoa, T., Genton, M.: Interpolation of daily rainfall using spatiotemporal models and clustering. Int. J. Climatol. 35(7), 1453–1464 (2015) 28. Ouallouche, F., Lazri, M., Ameur, S.: Improvement of rainfall estimation from MSG data using Random Forests classification and regression. Atmos. Res 211, 62–72 (2018) 29. Partal, T., Cigizoglu, H.K., Kahya, E.: Daily precipitation predictions using three different wavelet neural network algorithms by meteorological data. Stoch. Environ. Res. Risk Assess 29, 1317–1329 (2015) 30. Pellicone, G., Caloiero, T., Modica, G., Guagliardi, I.: Application of several spatial interpolation techniques to monthly rainfall data in the Calabria region (southern Italy). Int. J. Climatol. 38, 3651–3666 (2018) 31. Pham, Q.B., Yang, T.-C., Kuo, C.-M., Tseng, H.-W., Yu, P.-S.: Combing random forest and least square support vector regression for improving extreme rainfall downscaling. Water 11(451), w11030451 (2019) 32. Sekuli´c, A., Kilibarda, M., Heuvelink, G.B.M., Nikoli´c, M., Bajat, B.: Random forest spatial interpolation. Remote Sens. 12, 1687:rs12101687 (2020) 33. Shapire, R., Freund, Y., Bartlett, P., Lee, W.: Boosting the margin: a new explanation for the effectiveness of voting method. Ann. Stat. 26(5), 1651–1686 (1998) 34. Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th Ed. Springer, New York (2002) 35. Vu, T.M., Mishra, A.K.: Performance of multisite stochastic precipitation models for a tropical monsoon region. Stoch. Environ. Res. Risk Assess 34, 2159–2177 (2020) 36. Wang, B., Zheng, L., Liu, D.L., Ji, F., Clark, A., Yu, Q.: Using multimodel ensembles of CMIP5 global climate models to reproduce observed monthly rainfall and temperature with machine learning methods in Australia. Int. J. Climatol. 38, 4891–4902 (2018) 37. Wikle, C.K.: Comparison of deep neural networks and deep hierarchical models for spatiotemporal data. J. Agric. Biol. Environ. Stat. 24, 175–203 (2019)

Machine Learning Procedures for Daily Interpolation of Rainfall in Navarre (Spain)

413

38. Wu, J.: A novel artificial neural network ensemble model based on K -nearest neighbor nonparametric estimation of regression function and its application for rainfall forecasting. In: International Joint Conference on Computational Sciences and Optimization, pp. 44–489 (2009) 39. Zammit-Mangion, A., Wikle, C.K.: Deep integro-difference equation models for spatiotemporal forecasting. Spat Stat. 37, 100408 (2020) 40. Zhang, J., Fan, H., He, D., Chen, J.: Integrating precipitation zoning with random forest regression for the spatial downscaling of satellite based precipitation: A case study of the LancangMekong River basin. Int. J. Climatol. 39, 3947–3961 (2019) 41. Zhang, G., Su, X., Ayantobo, O., Feng, K., Guo, J.: Spatial interpolation of daily precipitation based on modified ADW method for gauge-scarce mountainous regions: a case study in the Shiyang River Basin. Atmos. Res. 247, 105167 (2021)

Testing Homogeneity of Response Propensities in Surveys Juan Luis Moreno-Rebollo, Joaquín Muñoz-García, and Rafael Pino-Mejías

Abstract Nonresponse is a serious problem in surveys. Response propensity (probability to respond) is a key concept that relates nonresponse to bias in the estimates. Stratum with homogeneous propensities protect estimates. Assuming that nonresponse units are replaced and the sample units are selected sequentially until completing the sample, two tests to validate the homogeneity of propensities are proposed. A detailed study is done when the number of population units between to consecutive responses follows a Zero Inflated Geometric distribution. A limited simulation is carried out to assess the performance of the tests proposed.

1 Introduction In sampling theory, non-sampling errors are due to factors other than sampling: coverage errors, data entry errors, biased survey questions, nonresponses, false information provided by respondents,... Nonresponse is a serious and increasing problem. Groves and Peytcheva [4] highlight that total nonresponse is one of the main problems in surveys. Nonresponse bias is a threat to the core utility of official statistics since unbiasedness in large samples is an essential quality component of official statistics. Nonresponse also increases the variance of the estimates since the sample size observed is lower than the originally thought. However, variability can be measured reasonably well, while bias is very difficult to measure. For this reason, nonresponse bias is a much greater concern to the quality of official statistics than precision. Recently, Hedlin [6] explore the relationship between nonresponse rate and bias, assuming non-ignorable nonresponse. J. L. Moreno-Rebollo (B) · J. Muñoz-García · R. Pino-Mejías University of Seville, Sevilla, Spain e-mail: [email protected] J. Muñoz-García e-mail: [email protected] R. Pino-Mejías e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_35

415

416

J. L. Moreno-Rebollo et al.

Different approaches have been proposed in the literature to dealing with missing data: imputation, reweighting, nonresponse models,... Several methods have been developed depending on the auxiliary data available. Because the approaches suggested to dealing with nonresponse will not guarantee unbiasedness, efforts have to be made in the planning and implementation of the survey to minimize nonresponse. In this paper unit nonresponse is considered. It is assumed that nonresponse is a stochastic outcome and each element has its own individual response probability when it is sampled (propensity). Obviously, response propensities are unknown. Different approaches, based on auxiliary information, have been proposed in the literature to estimate propensities, see Peress [9]. Propensity to respond is a key concept that relates nonresponse to bias in the estimates, Brick and Jones [2]. Särndal and Lundström [11] demonstrate that the greater the correlation between propensity and the objective variable the greater bias. They also demonstrate that the bias is null if the propensity is homogenous in the population. This fact justifies that stratum with homogeneous propensities would be desirable. Haziza and Beaumont [5] suggest that homogeneous response groups should be formed to protect somewhat against model insufficiency. Schouten et al. [12] define a response subset as representative with respect to the sample if the propensities are the same for all units in the population. An indicator for representativeness is defined based on this concept. Assuming that stratum have been delimited aiming homogeneous propensities and that the responses of different units are independent, in this paper two tests to validate the hypothesis of homogeneous propensities are proposed. A detailed study is carried out considering that responses follow a model favourable to answer, and the number of population units between two consecutive responses follows a Zero Inflated Geometric distribution. Alternative inflated geometric distributions, Joshi [7], could be considered. An empirical and limited study is carried out to assess the performance of the proposed tests.

2 Test for Homogeneity of Propensity Let U = {1, ..., N } be a population (or strata), delimited using the auxiliary information available, aiming homogenous propensity for the population units. Let s be a sample from U selected according to simple random sampling, the generally applied sampling method for the final selection of the sampling units. Here we suppose simple random sampling with replacement (SRSWR) for simplifying the results, but the procedure can be extended to sampling without replacement. Suppose that for dealing with nonresponse a random substitution procedure is performed. Although there has been criticism of the method, this procedure is usually followed in practice in many academic surveys and statistical offices, Vehovar [13]. According to Chapman [3] the substitutes should be selected simultaneously with the sample. See Muñoz-Conde et al. [8] for a review of the substitution methods. Given i ∈ U , let Ri be a binary random variable with Ri = 1 if i responds and Ri = 0 if not. We will assume that the response probability depends on i but not on the

Testing Homogeneity of Response Propensities in Surveys

417

sample s of which i is a member. Let pi = Pr(Ri = 1), i = 1, ..., N . The hypothesis of homogeneity of response propensities is formulated as H0 : p1 = · · · = p N = p. Assuming that the sampling units (and the substitutes) are selected sequentially until n responses are obtained, let X 1 be the number of population units selected until obtaining the first response, and X i the number of population units between the (i − 1)th and the ith responses, i = 2, ..., n. From the assumptions, X 1 , ..., X n are independent identically distributed random variables X i ∼ D (θ ( p1 , ..., p N )). To validate the hypothesis H0 : p1 = · · · = p N = p two tests based on order statistics are proposed. 1. The first one is based on the maximum X n:n = max {X 1 , ..., X n }, being C : X n:n > xn:n,α the critical region for a α−level test, with xn:n,α the 1 − α quantile of the distribution of X n:n under H0 , Fn (x, θ ( p)). xn:n,α the The implementation of  the test requires to estimate θ ( p) and xn:n,α by  θ ( p) . 1 − α quantile of Fn ·,  2. The second one is based on the range Rn = X n:n − X 1:n , being C : Rn > rn,α the critical region for a α−level test, with rn,α the 1 − α quantile of the distribution of Rn under H0 , FRn (·, θ ( p)). rn:n,α the The implementation of the test requires to estimate θ ( p) and rn:n,α by  θ ( p) . 1 − α quantile of FRn ·, 

3 Zero-Inflated (Truncated) Geometric Distribution The tests proposed in Sect. 2 depend on the distribution D (θ ( p)) supposed for X i . If the population units respond independently to the survey, then the geometric distribution could be reasonable to model X i . However, the abundance of zeros, usually observed in practice, makes inappropriate the Geometric distribution for modelling X i . To model the abundance of zero counts in the data, we will suppose that X i follows a Zero-Inflated (Truncated) Geometric (ZITG) Distribution, a mixture distribution that increases the probability of response. Note that other models, not considered in this paper, could be used by inflating/deflating the Geometric distribution, see Joshi [7]. This section presents some results related to the ZITG distribution that will be used later. The probability mass function (pmf) of a random variable X ∼ Z I T G(π, p), 0 < π, p < 1 is given by

Pr (X = x) =

⎧ π + (1 − π ) p ⎪ , x =0 ⎪ ⎪ ⎨ T (π, p, N ) ⎪ ⎪ (1 − π ) pq x ⎪ ⎩ , T (π, p, N )

x = 1, ..., N

418

J. L. Moreno-Rebollo et al.

with T (π, p, N ) = 1 − (1 − π ) q N +1 and q = 1 − p. From now on, T (π, p, N ) will be denoted by T to simplify the notation. The cumulative distribution function (cdf), in the support, is given by F (x) = F (x, π, p) =

1 − (1 − π ) q x+1 , x = 0, 1, ..., N T

(1)

and

Pr (X ≥ x) =

⎧ 1, ⎪ ⎨

x =0

⎪ ⎩ (1 − π ) q x − q N +1  , x = 1, ..., N . T

(2)

Next, the pmfs of X n:n and Rn related to a random sample of size n from a Z I T G(π, p) distribution are given.

3.1 Distribution of X n:n The cdf of X n:n is

Fn (x) = Fn (x, π, p) = F (x)n

with F (x) given by (1). Its pmf is given by Pr (X n:n = k) = Fn (k) − Fn (k − 1) , k = 0, 1, ..., N .

3.2 Distribution of Rn Following to Ahsanullah et al. [1, p. 70] the joint pmf of (X 1:n ,X n:n ), Pr (X 1:n = i; X n:n = j) = pi j , can be obtained from pi j = (Pr (X ≥ i) − Pr (X ≥ j + 1))n − (Pr (X ≥ i) − Pr (X ≥ j))n − (Pr (X ≥ i + 1) − Pr (X ≥ j + 1))n + (Pr (X ≥ i + 1) − Pr (X ≥ j))n . Taking in account (2) and that P (Rn = r ) =

N −r i=0

it is obtained that

Pr (X 1:n = i; X n:n = i + r ) ,

Testing Homogeneity of Response Propensities in Surveys

Pr (Rn = 0) =

1 − (1 − π ) q T

n +

419

n N

(1 − π ) pq i

T   (1 − (1 − π ) q)n (1 − π )n p n q n − q n(N +1) = + Tn Tn 1 − qn i=1

and n

1 − (1 − π ) q r +1 1 − (1 − π ) q r n Pr (Rn = r ) = − T T n

  n  (1 − π ) q n 1 − q r − 1 − q r −1 − T

n  n (1 − π ) n  1 − q r +1 − 1 − q r + T n  n q n − q n(N −r +1)  − q − q r +1 + q − q r , r = 1, ..., N . 1 − qn

3.3 MLE Supposed that X i ∼ Z I T G (π, pi ) , under the hypothesis H0 : p1 = · · · = p N = p, it is obtained that X 1 , ..., X n is a random sample of a Z I T G (π, p) distribution. In this framework the likelihood function related to a sample x1 , ..., xn is given by n 

xi 1 L (x1 , ..., xn ) = n [π + (1 − π ) p]k0 [(1 − π ) p]n−k0 q i=1 T

with k0 = # {xi /xi = 0} and k = n − k0 . The maximum likelihood estimates (mle) of π and p are obtained as solutions of the following equations k0 (1 − π ) k (N + 1) (1 − π ) q N − − + n T 1 − (1 − π ) q p

n i=1

q

xi

=0

k0 q nq N +1 k − − = 0. 1 − (1 − π ) q T (1 − π ) We do not have closed form expressions for these estimators, so the previous equations have to be solved by numerical approximation. The optimize R function [10] has been used in our simulations. To alleviate the dependence on the initial solution, the algorithm can be independently repeated from several initial solutions. The final solution is selected according to the maximum logL value.

420

J. L. Moreno-Rebollo et al.

4 Implementation and Simulations We have developed functions in R [10] to obtain the pmfs of X n:n and Rn , to generate samples from a Z I T G(π, p) distribution, to obtain the mle estimators of π and p and to compute the p-values associated to the tests based on X n:n and Rn . All R codes are available from the corresponding author upon request. An empirical and limited study has been performed to assess the power of the tests based on X n:n and Rn , and population size N =10000. Various settings have been specified for the parameters π and p. Four values, 0.2, 0.4, 0.6 and 0.8 for π , and nine values p = 0.05 + ,  = 0(0.05)0.4, for p were selected. For each configuration (π, p), 2000 samples of size n = 200 of the corresponding mixture were generated, n/2 cases from Z I T G(π, p), with p = 0.05, and n/2 cases were generated from Z I T G(π, p ). For each sample both tests presented in Sect. 2 were applied. Figure 1 visualizes the proportion of samples rejecting the hypothesis of homogeneity with tests based on X n:n (solid lines) and Rn (dashed lines). Similar figures were obtained with other configurations. Rn tends to offer less power than X n:n when  is greater than 0.2.









Fig. 1 Empirical power for tests based on X n:n (solid lines) and Rn (dashed lines)

Testing Homogeneity of Response Propensities in Surveys

421

5 Conclusions Stratum with homogenous propensities protect estimates from nonresponse. Assuming that the sample units are selected sequentially until n responses are obtained, to validate homogeneity of propensities two test based on order statistics, maximum and range, related to the number of population units between consecutive responses are proposed. A detailed study of the tests is done when the number of population units between consecutive responses is modelled by a Zero-Inflated (Truncated) Geometric distribution. Alternative inflated geometric distributions, out of the scope of this paper, could also be considered. The limited simulation carried out suggests that the proposed tests could be useful tools to test the homogeneity of propensities.

References 1. Ahsanullah, M., Nevzorov, V.B., Shakil, M.: An Introduction to Order Statistics. Atlantis Studies in Probability and Statistics, vol. 3. Atlantis Press, Dordrecht (2013) 2. Brick, J.M., Jones, M.E.: Propensity to respond and nonresponse bias. Metron 66, 51–73 (2008) 3. Chapman, D.: The impact of substitution on survey estimates. In: Madow, W., Olkin, I., Rubin, D. (eds.) Incomplete Data in Sample Surveys, vol. II. Theory and Bibliographies. Academic, New York (1983) 4. Groves, R.M., Peytcheva, E.: The impact of nonresponse rates on nonresponse bias. A meta analysis. Pub. Opin. Quart. 72, 167–189 (2008) 5. Haziza, D., Beaumont, J.F.: On the construction of imputation classes in surveys. Int. Stat. Rev. 75, 25–43 (2007) 6. Hedlin, D.: Is there a ‘safe area’ where the nonresponse rate has only a modest effect on bias despite non-ignorable nonresponse? Int. Stat. Rev. 88, 642–657 (2020) 7. Joshi, R.D.: A Generalized Inflated Geometric Distribution. Ph.D. Theses, Dissertations and Capstones, Marshall University (2015) 8. Muñoz-Conde, M., Muñoz-García, J., Pascual-Acosta, A., Pino-Mejías, R.: Field substitution and sequential sampling method. In: Gil, E., Gil, E., Gil, J., Gil, M.A. (eds.) The Mathematics of the Uncertain. A tribute to Pedro Gil. Springer, Heidelberg (2018) 9. Peress, M.: Correcting for survey nonresponse using variable response propensity. J. Am. Stat. Assoc. 105, 1–13 (2010) 10. R Core Team R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2020). https://www.R-project.org 11. Särndal, C.E., Lundström, S.: Estimation in Surveys with Nonresponse. Wiley, New York (2005) 12. Schouten, B., Cobben, F., Bethlehem, J.: Indicators for the representativeness of survey response. Surv. Method 35, 101–113 (2009) 13. Vehovar, V.: Field substitution and unit nonresponse. J. Offic. Stat. 15, 335–350 (1999)

Use of Free Software to Estimate Sensitive Behaviours from Complex Surveys María del Mar Rueda, Beatriz Cobo, and Antonio Arcos

Abstract In social surveys, respondents often do not answer honestly when asked sensitive questions about behavior that is illegal or not accepted by society. Warner [71] developed a data collection procedure, the Randomized Response (RR) technique, which enables researchers to obtain sensitive information while ensuring the privacy of respondents. The methodology of RR has advanced considerably in recent years. Nevertheless, most research in this area concern only simple random sampling. Data from complex survey designs require special consideration with regard to estimation for parameters and corresponding variance estimation. There are a body of research literature on alternative techniques for eliciting suitable RR schemes in order to estimate parameters for sensitive characteristics, but no existing software covers the estimation of these procedures from complex surveys. This gap is filled by RRTCS. In this paper we highlight the main features of the package. The package includes the estimators for means and totals with several RR techniques and also provides confidence interval estimation for the entire population and considering domains. We illustrate the use of this software with three real surveys referring to alcohol abuse, collection of subsidies and gender violence.

1 Introduction In socioeconomic and health research, it is common to collect information on highly sensitive topics. In these situations, the direct interview method (asking questions directly of respondents) is used, and respondents refuse to answer or give a false answer due to social stigma or fear of possible retaliation. These systematic errors in M. Rueda (B) · B. Cobo · A. Arcos University of Granada, Granada, Spain e-mail: [email protected] B. Cobo e-mail: [email protected] A. Arcos e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Balakrishnan et al. (eds.), Trends in Mathematical, Information and Data Sciences, Studies in Systems, Decision and Control 445, https://doi.org/10.1007/978-3-031-04137-2_36

423

424

M. Rueda et al.

the responses lead to a social desirability bias in the prevalence estimates of sensitive behaviors of interest, and producing an underestimation of the prevalence of socially undesirable activities such as abuse, drug use, and ablation. To try to avoid or minimize these biases, methods such as the randomized response technique (RRT) can be used, which allows the collection of more reliable data, protects the confidentiality of the respondent and reduces the rate of non-response. In the RRT, respondents use a randomization mechanism to generate a probabilistic relationship between their responses and the actual values of the sensitive characteristic. The RRT is currently applied in surveys that cover a variety of sensitive topics such as drug use [22, 27, 59, 69], abortion or crime [38, 46, 58], sexual addition [59, 61], sexual victimization [44], racism [45], AIDS [6] or academic cheating [5, 33], fraud in the field of disability benefits [51]. Warner [71] was the pioneer in the use of these techniques. Warner developed a procedure for data collection, the Randomized Response (RR) technique, which allows obtaining information from the interviewees in an anonymous way, guaranteeing the privacy of the respondents. These RR methods have been shown to encourage greater cooperation from respondents while reducing their motivation to misreport their attitudes. There have been many works that demonstrated that RR achieved more accurate estimates of the prevalence of socially undesirable behavior when sensitive questions are asked directly [23]. However, the use of RR also entails additional costs, and therefore the advantage in the use of RR (the greater precision of the population estimates obtained) will only exceed these additional costs if the estimates are substantially better than those derived from designs with simple questions and answers [50]. From the Warner model many models arose with the objective to estimate a population proportion (see [4, 34, 48, 49, 52, 57, 65], ...). Standard RR methods were initially proposed to treat binary responses to a sensitive question and seek to estimate the proportion of people who exhibit a sensitive characteristic or behavior. Subsequently, some models have appeared that allow treating quantitative sensitive variables. Greenberg et al. [36] designed a method in which the respondent is asked to select, by means of a randomization device, one of two questions; the sensitive or unrelated question, the answers to which should be roughly the same order of magnitude. Others important works in this respect include Eichhorn and Hayre [28], Bar-Lev et al. [7], Saha [63], Huang [40], Diana and Perri [25, 26], Arcos et al. [2]. A good revision of RR techniques can be seen in Fox and Tracy [31], Bouza et al. [13] or in Chaudhuri et al. [19]. Most RR methods have been developed for simple random sampling. But in practice, most surveys of people are conducted using complex samples that include stratification, clustering, and unequal probability of sample selection. The data obtained with complex sample designs require a different formulation regarding the estimation of finite population parameters and the corresponding variance estimation procedures, which take into account the sampling procedure used. In such a complex survey design, the estimation of the unbiased variance is not easy to calculate due to the pooling and participation of the second-order inclusion probabilities which are

Use of Free Software to Estimate Sensitive Behaviours …

425

generally complex. Today we have several software packages to facilitate the analysis of complex survey data and implement some of these estimators such as SAS, SPSS, Systa, Stata, SUDAAN or PCCarp. CRAN contains several R packages that include these design-based methods that are commonly used in survey methodology to treat samples drawn from a sampling frame, for example, survey, sampling, laeken or TeachingSampling among others see [70]. Some RR methods are extended to more complex sampling design, as stratified sampling [1, 21, 41, 43, 66], two sample designs [42, 62], or unequal probability sampling [15, 16]. If data is obtained from randomized response techniques we cannot use standard software packages for complex surveys directly. These statistical software can yield correct point estimates of population parameters (with certain modification in the variables) but still yield incorrect results for estimated standard errors. Some authors developed R-packages for estimation with randomized response surveys. The RRreg package [37] conducts multivariate regression analyses for some RR models. The rr package [10] is similar to RRreg package though there are some differences (e.g., estimation strategy in rr is based on the EM algorithm whereas RRreg uses the standard optimization routine). The GLMMRR package [32] fits generalized linear mixed models with binary randomized response data. The methods implemented in these packages are used under the assumption on simple random sampling. In addition to the randomized response techniques, within the indirect questioning techniques we find other interesting and widely used techniques, such as the item count technique [23]. This technique is also implemented in an R package called list [9]. To the best of our knowledge, there is no free software incorporating estimation procedures for handling randomized response data obtained from complex surveys. RRTCS (Randomized Response Techniques in Complex Surveys) provides functions for point and interval estimation for the entire population and considering domains from randomized response surveys. We have structured the work as follows. In the next section, we present RRTCS package discussing guidelines that have been followed to construct it and presenting its principal functions and functionalities. Subsequently, we present some implementation details of the package. Some examples to illustrate how the package works are provided. Finally we present a summary with some conclusions and possible future developments.

2 The R Package RRTCS RRTCS is a R package designed for the estimation of linear parameters in the entire population or considering domains with data obtained by using randomized response surveys. The program is designed to work with a wide range of sampling designs: unequal probabilities sampling, stratified sampling, cluster sampling, and combination of them. It consists of twenty one main functions each of them implementing one of these RR procedures for complex surveys:

426

M. Rueda et al.

• Randomized response techniques for qualitative stigmatizing characteristic: Christofides model [20], Devore model [24], Forced Response model [11], Horvitz model [35, 39], Horvitz model with unknown B [17, p. 42], Kuk model [47], Mangat model [55], Mangat model with unknown B [17, p. 53], Mangat and Singh model [54], Mangat, et al. model [55], Mangat et al. model with unknown B [17, p. 54], Singh and Joarder model [64], Soberanis-Cruz model [68] and Warner model [71]. • Randomized response techniques for quantitative stigmatizing characteristic: BarLev model [7], Chaudhuri and Christofides model [18, p. 97], Diana and Perri 1 model [25, p. 1877], Diana and Perri 2 model [25, p. 1879], Eichhorn and Hayre model [28], Eriksson model [29] and Saha model [63]. The package also includes the function ResamplingVariance for variance estimation of the RR estimators using resampling methods. 20 dataset contain observations from different surveys conducted using different randomized response techniques are also included.

3 Implementation Details The package considers the unified approach of estimation with randomized response given by Arnab [3]. Consider a finite population U , consisting of N individuals. Let yi be the value of the sensitive aspect under study for the ith population  Nelement. The finite population yi , and the population mean total of the variable of interest y is denoted by Y = i=1 N ¯ by Y = 1/N i=1 yi . The proportion of the population presenting a stigmatized behaviour A, is treated as a mean where  1 if yi ∈ G A yi = 0 otherwise being G A the group with the stigmatized behaviour. A sample s of n elements of U is chosen according to a general design p. We denote by πi = si p(s), i ∈ U the first order inclusion probabilities. Since the values yi are not available in the sample, these values are estimated using the randomized response obtained from the ith respondent. Suppose that the ith respondent has to conduct a RR trial independently and z i is the obtained scrambled (or randomized) response. For each i ∈ s the scrambled response induces a revised randomized response ri that is an unbiased estimation of yi with φi = V (ri ). Considering the Horvitz-Thompson estimator applied to this variable, we obtain an unbiased estimator for the population total of the sensitive variable y: Yˆht (r ) =

 ri . πi i∈s

Use of Free Software to Estimate Sensitive Behaviours …

427

The variance of Yˆht (r ) is given by ⎡

⎤ 2    φi  y 1 φ y j i i⎦ Vht (r ) = ⎣ (πi π j − πi j ) − + , = Vht + 2 i= j∈U πi πj π π i∈U i i∈U i where πi j are the second order inclusion probabilities of the design p and we assume that the second order inclusion probabilities are non null. We can obtain an unbiased estimator of Vht (r ) by ⎡ ⎤ 2    (πi π j − πi j )  ri  φˆ i ˆi r 1 φ j ˆ ⎦ = Vˆht + − + Vˆht (r ) = ⎣ 2 i= j∈s πi j πi πj πi πi i∈s i∈s where φˆ i denotes an unbiased estimator of the randomized variance φi . Similarly, the estimator for the population mean Y¯ for the RR survey is given by 1  ri Yˆ¯ht (r ) = N i∈s πi and the estimator for its variance is calculated as: 1 ˆˆ Vht (r ). N2 If the the population size N is unknown, we consider the ratio estimator  ri Yˆ¯ha (r ) =  i∈s 1 i∈s πi

in order to obtain consistent estimators for the mean. The estimator of the variance is calculated by using Taylor-series linearization of the ratio [72]. If we consider domain estimation, we need to calculate an identity vector I show us the interest domain. With the help of this vector, we will create the new data structures, eliminating the indices that do not meet the domain of interest, so now we consider a new population and sample size, which will be the population and sample size of the indicated domain. With these new data structures we will perform all the analyzes indicated above. Then, the unbiased estimator for the population total of y and its variance are the same as in the previous case considering the interest domain Yˆht D (r ) =

 ri πi i∈s d

 φˆ i and Vˆˆht D (r ) = Vˆht D + πi i∈s d

428

M. Rueda et al.

where sd is the sample considering interest domain. The population mean Y¯ and its variance are estimated by 1  ri Yˆ¯ht D (r ) = Nd i∈s πi d

⎡ ⎤  ˆ φ 1 i⎦ and Vˆˆht D (r ) = 2 ⎣Vˆht D + π Nd i i∈s d

where Nd is the domain population size. Usually, the domain population size Nd is unknown. In this case the ratio estimator  i∈s ri Yˆ¯ha D (r ) =  d 1 i∈sd πi

is used. The revised randomized response ri varies with the RR technique. For example, in Warner RR technique [71], each respondent has to draw a random card from a deck. The deck has two types of cards with known proportions but which are identical in appearance. The first type ofcard, with proportion p(= 1/2) contains the question “ Do you belong to group A? ” while the second type of card with a 1 − p ratio contains ¯ ” where A¯ is the complement of the group the question “ Do you belong to group A? A. The respondent will answer “ Yes ” or “ No ” to the question mentioned on the selected card with sincerity. Since the experiment is conducted in the absence of the interviewer, the interviewer will not know which of the two questions the interviewee answered and the respondent’s privacy is maintained. Let z i be the random response obtained from the i th unit. One define z i = 1, if the ith unit response is “Yes” and z i = 0 if the response is “No”. p) yields an unbiased estimator of yi and the variance of ri is φi = ri = zi(2−(1− p−1) p) VR (ri ) = (2p(1− . p−1)2 In qualitative models, the values ri and VR (ri ) are obtained for each model. In some quantitative models, the values ri and VR (ri ) are calculated in a general form [2] as follows: We define the randomized response variable as

⎧ ⎪ with probability p1 ⎨ yi Z i = yi S1 + S2 with probability p2 ⎪ ⎩ S3 with probability p3 with p1 + p2 + p3 = 1 and where S1 , S2 and S3 are variables whose distributions are known. The mean and standard deviation of the variable Si , (i = 1, 2, 3) are noted by μi and σi respectively. The transformed variable is ri =

z i − p2 μ2 − p3 μ3 p1 + p2 μ1

Use of Free Software to Estimate Sensitive Behaviours …

its variance is VR (ri ) =

429

1 (y 2 A + yi B + C) ( p1 − p2 μ1 )2 i

where A = p1 (1 − p1 ) + σ12 p2 + μ21 p2 − μ21 p22 − 2 p1 p2 μ1 B = 2 p2 μ1 μ2 − 2μ1 μ2 p22 − 2 p1 p2 μ2 − 2μ3 p1 p3 − 2μ1 μ3 p2 p3 C = (σ22 + μ22 ) p2 + (σ32 + μ23 ) p3 − (μ2 p2 + μ3 p3 )2 and the estimated variance is Vˆ R (ri ) =

1 (r 2 A + ri B + C). ( p1 − p2 μ1 )2 i

For example, if p1 = p3 = 0, p2 = 1, S2 = 0 then the proposed model becomes the Eichhorn and Hayre model [28]; if p1 = p, p2 = 1 − p, p3 = 0, S2 = 0, then the proposed model becomes the BarLev model [7] or if p1 = p, p2 = 0 and S3 is a discrete uniform variable with probabilities q1 , q2 , ..., q j verifying q1 + q2 + · · · + q j = 1 − p, then the proposed technique becomes Eriksson method [29]. To estimate the variance by resampling methods [72] we implement the function ResamplingVariance that lets us choose from several models: • The jackknife method [60] • The Escobar-Berger method [30] • The Campbell-Berger-Skinner method [8, 14]. The jackknife method is implemented with new code. The other methods are implemented using the samplingV ar Est package (see [53]). Note: If you do not have the first-order inclusion probabilities, but rather the values of the corresponding weights, in order to use the functions, the inverses of the weights must be entered in the first-order inclusion probability parameter.

4 Computational Efficiency Much attention has also been paid to aspects of computational efficiency. Often the populations in a survey are extremely large or it is necessary to keep the sampling error below a certain value. As a consequence, it is necessary to consider large sample sizes, often on the order of thousands of sampling units. In these situations, the computational efficiency of the functions is essential. Otherwise, user can face high runtimes and heavy computational loads. In this sense, the functions of RRTCS are developed according to strict efficiency measures, using the power of R to avoid loops and increase the computational efficiency.

User

0.0006

0.0008

0.0010

Warner Horvitz SoberanisCruz Devore MangatSingh ForcedResponse Kuk Christofides SinghJoarder Mangat MangatSinghSingh HorvitzUB MangatUB MangatSinghSinghUB

0.0004

0.0006

0.0008

0.0010

Warner Horvitz SoberanisCruz Devore MangatSingh ForcedResponse Kuk Christofides SinghJoarder Mangat MangatSinghSingh HorvitzUB MangatUB MangatSinghSinghUB

0.0002

0.0002

0.0004

Elapsed

0.0012

M. Rueda et al. 0.0012

430

0

2000

4000

6000

8000

0

10000

2000

4000

6000

8000

10000

8000

10000

n

n

Fig. 1 Elapsed and user times (in seconds) for qualitative RR methods

8e−04

1e−03

EichhornHayre BarLev Eriksson ChaudhuriChristofides Saha DianaPerri1 DianaPerri2

User

2e−04

2e−04

4e−04

6e−04

6e−04 4e−04

Elapsed

8e−04

1e−03

EichhornHayre BarLev Eriksson ChaudhuriChristofides Saha DianaPerri1 DianaPerri2

0

2000

4000

6000 n

8000

10000

0

2000

4000

6000 n

Fig. 2 Elapsed and user times (in seconds) for quantitative RR methods

Figures 1 and 2 show the user time required to calculate the estimators using an Intel (R) Core (TM) i7-3770 at 3.40 GHz when different sample sizes are considered. Elapsed time is also included to illustrate the actual time the user needs to get estimates. Figures 1 and 2 show that the sample size is not an issue here: it took less than 0.0015 second to obtain all estimates from a sample of size 10000. The latter sample dimension can be realistic for an official survey, while in real social surveys it is common to get much smaller sample sizes.

5 Examples Note: all information regarding the examples can be found in the RRTCS package documentation

Use of Free Software to Estimate Sensitive Behaviours …

431

5.1 Example 1: A Randomized Response Survey to Investigate the Alcohol Abuse We ilustrate the use of Warner technique [71] to investigate the total alcohol abuse considering as parameter p = 0.7 and as domain the gender of the respondent. The sensitive question in this study is, during the last month, did you ever have more than five drinks (beer/wine) in succession? The sample is drawn by simple random sampling without replacement and the interes domain is female. The dataset is bundled with the RRTCS package The R code and the output for this example are as follows: data(WarnerData) dat