Statistical Modeling and Simulation for Experimental Design and Machine Learning Applications : Selected Contributions from SimStat 2019 and Invited Papers [1 ed.] 9783031400544, 9783031400551

This volume presents a selection of articles on statistical modeling and simulation, with a focus on different aspects o

148 67 10MB

English Pages x; 265 [265] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Statistical Modeling and Simulation for Experimental Design and Machine Learning Applications : Selected Contributions from SimStat 2019 and Invited Papers [1 ed.] 9783031400551, 9783031400544

This volume presents a selection of articles on statistical modeling and simulation, with a focus on different aspects o

129 15 33MB Read more

Machine Learning in Modeling and Simulation : Methods and Applications 9783031366444, 9783031366437

Machine learning (ML) approaches have been extensively and successfully employed in various areas, like in economics, me

227 88 77MB Read more

Statistical Data Modeling and Machine Learning with Applications II 3036582002, 9783036582009

The present book contains all of the articles in the second edition of the Special Issue titled "Statistical Data M

159 29 32MB Read more

Simulation and Modeling Methodologies, Technologies and Applications: 9th International Conference, SIMULTECH 2019 Prague, Czech Republic, July 29-31, 2019, Revised Selected Papers [1st ed.] 9783030558666, 9783030558673

The present book includes a set of selected best extended papers from the 9th International Conference on Simulation and

507 27 21MB Read more

Statistical Methods for Machine Learning

21,019 5,271 3MB Read more

Theory and Applications of Time Series Analysis: Selected Contributions from ITISE 2019 3030562182, 9783030562182

This book presents a selection of peer-reviewed contributions on the latest advances in time series analysis, presented

1,388 47 16MB Read more

Statistical modelling and machine learning principles for bioinformatics 9789811524448, 9789811524455

1,784 242 4MB Read more

Applications of Mathematical Modeling, Machine Learning, and Intelligent Computing for Industrial Development 1032392649, 9781032392646

The text focuses on mathematical modeling and applications of advanced techniques of machine learning, and artificial in

580 134 56MB Read more

Process Control: Modeling, Design, and Simulation [2 ed.] 9780134033846

Master Process Control Hands On, through Updated Practical Examples and MATLAB® Simulations Process Control: Modeling,

690 279 21MB Read more

Business process modeling, simulation, and design 9788131761359, 9789332511705, 8131761355

1,830 177 48MB Read more

Statistical Modeling and Simulation for Experimental Design and Machine Learning Applications : Selected Contributions from SimStat 2019 and Invited Papers [1 ed.]
9783031400544, 9783031400551

Author / Uploaded
Jürgen Pilz
Viatcheslav B. Melas
Arne Bathke

Categories
Mathematics

Table of contents :
Preface
Contents
Part I Invited Papers
1 Likelihood Ratios in Forensics: What They Are and What They Are Not
1.1 Introduction
1.2 Lindley's Likelihood Ratio (LLR)
1.2.1 Notations
1.2.2 A Frequentist Framework for Lindley's Likelihood Ratio (LLR)
1.3 Score-Based Likelihood Ratio (SLR)
1.3.1 The Expression of the SLR
1.3.2 The Glass Example
1.4 Discussion
References
2 MANOVA for Large Number of Treatments
2.1 Introduction
2.2 Notations and Model Setup
2.3 Simulations
2.3.1 MANOVA Tests for Large g
2.3.2 Special Case: ANOVA for Large g
2.4 Discussion and Outlook
References
3 Pollutant Dispersion Simulation by Means of a Stochastic Particle Model and a Dynamic Gaussian Plume Model
3.1 Introduction
3.2 Meteorological Monitoring Network
3.3 Wind Field Modeling
3.3.1 Mass Correction of the Wind Field
3.3.2 Plume Rise
3.4 Stochastic Particle Model
3.4.1 Deposition
3.4.2 Implementation
3.5 Dynamic Gaussian Plume Model
3.6 Implementation on the Server
3.7 A Real-World Example with Application to an Alpine Valley
3.8 Conclusions and Outlook
References
4 On an Alternative Trigonometric Strategy for StatisticalModeling
4.1 Introduction
4.2 The Alternative Sine Distribution
4.2.1 Presentation
4.2.2 Moment Properties
4.2.3 Parametric Extensions
4.3 AS Generated Family
4.3.1 Definition
4.3.2 Series Expansions
4.3.3 Example: The ASE Exponential Distribution
4.3.4 Moment Properties
4.4 Application to a Famous Cancer Data
4.5 Conclusion
References
Part II Design of Experiments
5 Incremental Construction of Nested Designs Basedon Two-Level Fractional Factorial Designs
5.1 Introduction
5.2 Greedy Coffee-House Design
5.3 Two-Level Fractional Factorial Designs
5.3.1 Half Fractions: m=1
5.3.2 Several Generators
5.3.2.1 Defining Relations
5.3.2.2 Resolution
5.3.2.3 Word Length Pattern
5.3.3 Minimum Size
5.4 Two-Level Factorial Designs and Error-Correcting Codes
5.4.1 Definitions and Properties
5.4.2 Examples
5.5 Maximin Distance Properties of Two-Level Factorial Designs
5.5.1 Neighbouring Pattern and Distant Site Pattern
5.5.2 Optimal Selection of Generators by Simulated Annealing
5.5.2.1 SA Algorithm for the Maximisation of ρH
5.6 Covering Properties of Two-Level Factorial Designs
5.6.1 Bounds on CRH(Xn)
5.6.2 Calculation of CRH(Xn)
5.6.2.1 Algorithmic Construction of a Lower Bound on CRH(Xn)
5.7 Greedy Constructions Based on Fractional Factorial Designs
5.7.1 Base Designs
5.7.2 Rescaled Designs
5.7.3 Projection Properties
5.8 Summary and Future Work
Appendix
References
6 A Study of L-Optimal Designs for the Two-Dimensional Exponential Model
6.1 Introduction
6.2 Equivalence Theorem for L-Optimal Designs
6.3 General Case
6.4 Excess and Saturated Designs
References
7 Testing for Randomized Block Single-Case Designsby Combined Permutation Tests with Multivariate Mixed Data
7.1 Introduction
7.2 Randomized Block Single-Case Designs and NPC
7.3 Simulation Study
7.4 A Real Case Study
7.5 Conclusions
References
8 Adaptive Design Criteria Motivated by a Plug-In Percentile Estimator
8.1 Introduction
8.2 Problem Formulation and Background
8.2.1 Problem Formulation
8.2.2 Background
8.3 The Plug-In Estimator
8.4 Adaptive ``Plug-In'' Criteria
8.4.1 Monte Carlo Approximation
8.4.2 Monte Carlo Approximation Assuming Independency
8.4.3 Assuming Independency and Neglecting Uncertainty
8.4.4 Using SUR Design Criterion for Exceedance Probability
8.5 Numerical Implementation
8.6 Numerical Study
8.6.1 Comparison Study
8.6.2 Methodology
8.6.2.1 Case Studies
8.6.2.2 Performance Indicators
8.6.3 Numerical Results
8.6.3.1 Estimators Performance
8.6.3.2 Implementation
8.6.3.3 Criteria
8.7 Conclusions
Appendix 1
Posterior Mean and Variance of f Under the Gaussian Process Assumption
SUR Design Criteria for Exceedance Probability Estimation
Appendix 2
References
Part III Queueing and Inventory Analysis
9 On a Parametric Estimation for a Convolutionof Exponential Densities
9.1 Introduction
9.2 Convolution of the Exponential Densities
9.3 ML Estimation of the Parameters
9.4 Parameter's Estimation by the Moments' Method
9.5 Approximation of the Density
9.6 Experimental Study
9.7 Application to a Single Queueing System M/G/1/k
9.8 Conclusions
References
10 Statistical Estimation with a Known Quantileand Its Application in a Modified ABC-XYZ Analysis
10.1 Introduction
10.2 Methods
10.2.1 Statistical Estimation with a Known Quantile
10.2.2 ABC-XYZ Analysis
10.3 ABC-XYZ Analysis Modified with a Known Quantile
10.4 Conclusions
References
Part IV Machine Learning and Applications
11 A Study of Design of Experiments and Machine Learning Methods to Improve Fault Detection Algorithms
11.1 Introduction
11.2 Design of Experiments and Machine Learning Modelling
11.3 Application to Fault Detection
11.3.1 Design of Experiments Step
11.3.2 Machine Learning Modelling Step
11.3.2.1 Refrigerant Undercharge: Fault Detection
11.3.2.2 Condenser Fouling: Fault Detection
11.4 Conclusions
References
12 Microstructure Image Segmentation Using Patch-Based Clustering Approach
12.1 Introduction
12.2 Input Data
12.3 Previous Work
12.4 Grain Segmentation
12.4.1 Seeded Region Growing (SRG)
12.4.2 Image Denoising and Patch Determination
12.4.3 Feature Extraction
12.4.4 Patch Clustering
12.4.5 Implementation
12.5 Results
12.6 Conclusion and Outlook
References
13 Clustering and Symptom Analysis in Binary Datawith Application
13.1 Introduction
13.2 The Symptom Analysis
13.2.1 The Symptom and Syndrome Definition
13.2.2 Impulse Vector and Super-symptoms
13.2.3 Prefigurations of Super-symptom
13.2.4 The Super-symptom Recovery by Vector β
13.2.5 Clustering in Dichotomous Space and Symptom Analysis
13.3 The Medical Application of the Clustering and Symptom Analysis in Binary Data
13.3.1 Dataset
13.3.2 Result and Discussion
13.4 Conclusion
References
14 Big Data for Credit Risk Analysis: Efficient Machine Learning Models Using PySpark
14.1 Introduction
14.2 Data Processing
14.2.1 Data Treatment
14.2.2 Data Storage and Distribution
14.2.3 Munge Data
14.2.4 Creating New Measures
14.2.5 Missing Values Imputation and Outliers Treatment
14.2.6 One-Hot Code and Dummy Variables
14.2.7 Final Dataset
14.3 Method and Models
14.3.1 Method
14.3.2 Model Building
14.4 Results and Credit Scorecard Conversion
14.5 Conclusion
Appendix 1
Appendix 2
References

Citation preview

Contributions to Statistics

Jürgen Pilz Viatcheslav B. Melas Arne Bathke Editors

Statistical Modeling and Simulation for Experimental Design and Machine Learning Applications Selected Contributions from SimStat 2019 and Invited Papers

Contributions to Statistics

The series Contributions to Statistics features edited and conference volumes in theoretical and applied statistics. Composed of refereed selected contributions, the volumes present the latest developments in all the exciting areas of contemporary statistical research.

Jürgen Pilz • Viatcheslav B. Melas • Arne Bathke Editors

Statistical Modeling and Simulation for Experimental Design and Machine Learning Applications Selected Contributions from SimStat 2019 and Invited Papers

Editors Jürgen Pilz Department of Statistics University of Klagenfurt Klagenfurt, Austria

Viatcheslav B. Melas Department of Stochastic Simulation St. Petersburg State University St. Petersburg, Russia

Arne Bathke Department of Artificial Intelligence and Human Interfaces Paris Lodron University Salzburg Salzburg, Austria

ISSN 1431-1968 Contributions to Statistics ISBN 978-3-031-40054-4 ISBN 978-3-031-40055-1 https://doi.org/10.1007/978-3-031-40055-1

(eBook)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.

Preface

The present volume contains selected papers given at the 10th International Workshop on Simulation held at the Paris Lodron University of Salzburg, Austria, Sept. 2–6, 2019, and papers which have been contributed afterward, and are closely related to the theme of the conference. The conference was organized by the Department of Data Science, Statistics, Stochastics of the Paris Lodron University of Salzburg, in collaboration with the Department of Statistics of the Alpen-Adria University of Klagenfurt, the Department of Statistical Modelling of St. Petersburg State University, and INFORMS Simulation Society (USA). This international conference was devoted to statistical techniques in stochastic simulation, data collection, design and analysis of scientific experiments, and studies representing broad areas of applications. Plenary lectures were given by Edgar Brunner, Holger Dette, Regina Liu, Christian Robert, and Gerd Antes. Session topics included experimental design, data science and statistical learning, methods for multivariate and high-dimensional data, functional data, survival analysis, nonparametric statistics, algebraic methods in computational biology, statistics in forensics, neurology, and evidence-based medicine. The 1st–6th Workshops took place in St. Petersburg (Russia) in 1994, 1996, 1998, 2001, 2005, and 2009, respectively. The 7th International Workshop on Simulation took place in Rimini in 2013, May 21–24. The 8th Workshop was held in Vienna in 2015, Sept. 21–25, in memory of Luidmila Kopylova-Melas, the wife of Viatcheslav Melas who initiated this series of conferences. The 9th International Workshop was held in Barcelona in 2018, June 25–29. The Scientific Program Committee of the 10th jubilee conference in this series of international workshops was chaired by Viatcheslav Melas (St. Petersburg, Russia), Dieter Rasch (Rostock, Germany), Jürgen Pilz (Klagenfurt, Austria), and Arne Bathke (Salzburg, Austria). We are indebted to the following members of the Scientific Program Committee for their fruitful help in organizing the sessions and making the 10th Jubilee Workshop in Salzburg a tremendous success: Aleksander Andronov (Latvia), Narayanaswamy Balakrishnan (Canada), Edgar Brunner (Germany), Ekaterina Bulinskaya (Russia), Holger Dette (Germany), Sat Gupta (Canada), Werner Müller (Austria), Luigi Salmaso (Italy), Christian v

vi

Preface

Robert (France), Sergey Tarima (USA), and James V. Zidek (Canada). The Local Organizing Committee was led by Arne Bathke (Salzburg, Austria). We are thankful to the following members of this committee for their extremely helpful and efficient organizational work during the conference: Beate Simma (Klagenfurt), Georg Zimmermann (Salzburg), Wolfgang Trutschnig (Salzburg), Gunter Spöck (Klagenfurt), and Albrecht Gebhardt (Klagenfurt). The present proceedings volume consists of four parts; the first part contains four invited papers and the remaining three parts deal with various applications of statistics and modern data science. The first of the invited papers, presented by D. Gluck, E. Tabassi, N. Balakrishnan, and L. L. Tang, gives an overview of the use of likelihood ratios as a means to weigh forensic evidence, whether and how to use likelihood ratios has generated great interests in the forensic community. The authors discuss the details of Lindley’s (1977) likelihood ratio based on the original features, thereby providing a frequentist interpretation of it, and the scorebased likelihood ratio. The second of the invited papers evaluates the performance of Wilks’ likelihood ratio test statistic when the number of populations in a MANOVA model is allowed to increase, where the sample size and dimension are kept fixed: S.E. Ahmed and M.R. Ahmad find the statistic to be accurate under normality assumptions, for both size and power, a serious size distortion and discernably low power is witnessed, however, for t-distributed errors for an increasing number of populations, even for large sample size. In the third invited paper, M. Arbeiter, A. Gebhardt, and G. Spöck discuss physical and statistical pollutant dispersion models. They propose a large-scale physical particle dispersion model and a dynamic version of the well-known Gaussian plume model, based on statistical filters. The dispersion models are used to predict pollutant concentrations resulting from the emissions of a cement plant in Carinthia, Austria. To test and validate these models, they developed the R-package PDC using the CUDA framework for GPU implementation. In the last one of the invited papers, Ch. Chesneau discusses the properties of a new trigonometric family of univariate distributions for statistical modelling. He derives a new flexible one-parameter trigonometric lifetime distribution, which can be viewed as a trigonometric extension of the exponential distribution. Through the Akaike and Bayesian information criteria, this new extension demonstrates a better fit for certain cancer patient data than several well-known one-parameter lifetime distributions accessible in the statistical literature. The contributed ten chapters have been arranged in three parts dealing with different aspects of statistical estimation and testing problems, design of experiments, reliability and queueing theory models, and applications on the interface between modern statistical and machine learning modelling approaches. The chapters in Part II (Design of Experiments) start with a contribution by R. Cabral-Farias, L. Pronzato, and M.-J. Rendas studying in detail the incremental construction of nested designs having good spreading properties over the d-dimensional hypercube. They propose an algorithm for the construction of fractional-factorial designs with maximum packing radius. V.B. Melas, A.A. Pleshkova, and P.V. Shpilev consider the construction of L-optimal designs for the two-dimensional nonlinear in parameters exponential model. They also provide an analytical solution of the

Preface

vii

problem of finding the dependence between the number of support points and the model’s parameters values. R. A. Giancristofaro, R. Ceccato, L. Pegoraro, and L. Salmaso propose an extension of permutation tests to analyze randomized block single-case designs where multivariate mixed data are observed. It takes advantage of the Nonparametric Combination (NPC) procedure, using an adequately defined combining function and test statistic, and makes it possible to tackle both twosided and directional alternative hypotheses. In their second chapter, Cabral-Ferrias et al. investigate whether the efficient solutions available for the easier problem of estimation of an excursion set can help finding solutions to the closely related percentile estimation problem. They show that estimates of the percentile obtained on designs built incrementally to estimate the probability of exceedance of the current percentile estimate converge to the correct value even when started with a poor initial design. The chapters by Farias-Cabral et al. have been contributed independently after the 10th International Workshop in Salzburg. Simulation models for lifetime, risk, and inventory analysis played an important role at the 10th IWS in Salzburg. Contributions in this direction are collected in Part III of the present volume. A. Andronov, N. Spiridovska, and D. Santalov study the problem of approximating non-exponential distributions in continuoustime Markov chain models on the basis of parametric estimation for a convolution of exponential densities. The efficiency of the considered approach is illustrated by an application to a problem in queuing theory. Zh. Zenkova, S. Tarima, W. Musoni, Yu. Dmitriev, and N.P. Alexeyeva suggest an estimator of a functional of the cumulative distribution function (cdf) modified with a known quantile. The new estimator is unbiased and asymptotically normally distributed with a smaller asymptotic variance than the estimator obtained by plugging-in the empirical cdf. The new estimator is then applied to modify ABC-XYZ analysis of a trade company’s assortment, resulting in a more stable inventory management. In Part IV, we have collected four chapters dealing with the interplay between statistical inference and machine learning methods, thereby demonstrating some of their applications. R. A. Giancristofaro, R. Ceccato, L. Pegoraro, and Luigi Salmaso present an industrial application of DOE and machine learning methods for the development of algorithms applied to fault detection problems in the Heating, Ventilation, Air Conditioning and Refrigeration (HVAC-R) industry. The preliminary DOE study aids in providing a rationale for feature selection and high quality data for model training. D. Alagić and J. Pilz introduce an image processing algorithm to automatically extract quantitative information about the microstructure characteristics like the size of damage patterns, grain size distribution, etc. out of images of polycristaline materials used in semiconductor manufacturing. A patchbased clustering approach based on features that measure a region’s homogeneity is proposed; the final, pixelwise segmentation is achieved with the Seeded Region Growing (SRG) algorithm using the identified grain areas as seed points. N.P. Alexeyeva, F.S. Al-juboori, and E.P. Skurat propose a new method to generalize a statistical multi-factor analysis based on reducing the dimensionality in categorical data by means of projective subspaces. The method uses algebraic normal forms, as applied to random binary vectors called super symptoms; the authors use it to

viii

Preface

define factors affecting breast cancer and identify a risk group to the presence of distant metastases and the tumor spreading to the lymph nodes. Finally, A. Ashofteh presents PySpark code for computationally efficient use of statistical learning and machine learning algorithms for the application scenario of personal credit evaluation with a performance comparison of models including logistic regression, decision trees, random forests, neural networks, and support vector machines. The chapter also highlights the steps, perils, and benefits of using Big Data and machine learning algorithms in credit scoring. The chapter by Ashofteh has been contributed independently after the 10th International Workshop in Salzburg. It is our great pleasure to thank all authors of invited and contributed chapters for carefully preparing their manuscripts and submitting them for editorial processing of the present volume. We are indebted to our reviewers, chosen partly from the Scientific Program Committee. Finally, we are indebted to the relentless support by Mrs. Veronika Rosteck from Springer Publishing Berlin-Heidelberg. Klagenfurt, Austria St. Petersburg, Russia Salzburg, Austria May 2023

Jürgen Pilz Viatcheslav B. Melas Arne Bathke

Contents

Part I Invited Papers 1

Likelihood Ratios in Forensics: What They Are and What They Are Not . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Gluck, Elham Tabassi, N. Balakrishnan, and Liansheng Larry Tang

3

2

MANOVA for Large Number of Treatments . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. Ejaz Ahmed and M. Rauf Ahmad

3

Pollutant Dispersion Simulation by Means of a Stochastic Particle Model and a Dynamic Gaussian Plume Model . . . . . . . . . . . . . . . . Maximilian Arbeiter, Albrecht Gebhardt, and Gunter Spöck

31

On an Alternative Trigonometric Strategy for Statistical Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christophe Chesneau

51

4

17

Part II Design of Experiments 5

Incremental Construction of Nested Designs Based on Two-Level Fractional Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rodrigo Cabral-Farias, Luc Pronzato, and Maria-João Rendas

77

6

A Study of L-Optimal Designs for the Two-Dimensional Exponential Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Viatcheslav B. Melas, Alina A. Pleshkova, and Petr V. Shpilev

7

Testing for Randomized Block Single-Case Designs by Combined Permutation Tests with Multivariate Mixed Data . . . . . . 127 Rosa Arboretti Giancristofaro, Riccardo Ceccato, Luca Pegoraro, and Luigi Salmaso

ix

x

Contents

8

Adaptive Design Criteria Motivated by a Plug-In Percentile Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Rodrigo Cabral Farias, Luc Pronzato, and Maria-João Rendas

Part III Queueing and Inventory Analysis 9

On a Parametric Estimation for a Convolution of Exponential Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Alexander Andronov, Nadezda Spiridovska, and Diana Santalova

10

Statistical Estimation with a Known Quantile and Its Application in a Modified ABC-XYZ Analysis . . . . . . . . . . . . . . . . . 197 Zhanna Zenkova, Sergey Tarima, Wilson Musoni, and Yuriy Dmitriev

Part IV Machine Learning and Applications 11

A Study of Design of Experiments and Machine Learning Methods to Improve Fault Detection Algorithms . . . . . . . . . . . . . . . . . . . . . . . 211 Rosa Arboretti Giancristofaro, Riccardo Ceccato, Luca Pegoraro, and Luigi Salmaso

12

Microstructure Image Segmentation Using Patch-Based Clustering Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Dženana Alagić and Jürgen Pilz

13

Clustering and Symptom Analysis in Binary Data with Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 N. P. Alexeyeva, F. S. Al-juboori, and E. P. Skurat

14

Big Data for Credit Risk Analysis: Efficient Machine Learning Models Using PySpark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Afshin Ashofteh

Part I

Invited Papers

Chapter 1

Likelihood Ratios in Forensics: What They Are and What They Are Not Daniel Gluck, Elham Tabassi, N. Balakrishnan, and Liansheng Larry Tang

Abstract The 2009 National Research Council report “Strengthening Forensic Science in the United States: A Path Forward” calls for the quantification of forensic evidence. The likelihood ratio provides one qualitative approach to weigh forensic evidence and arrive at the posterior odds of determining the same source versus different sources. Two most commonly used likelihood ratios (LRs) are Lindley’s [9] LR and score-based LR. Whether and how to use these LRs has generated great interests in the forensic community. We discuss the details of Lindley’s [9] likelihood ratio based on the original features and the score-based likelihood ratio. Lindley’s likelihood ratio was originally studied in the Bayesian setting. We provide the frequentist interpretation of Lindley’s likelihood ratio. The likelihood ratio is a function of the evidence measurements and several parameters estimated from the reference database. These parameters are the within-subject variance, the between-subject variance, and the population mean for the univariate evidence measurements. Using the glass fragment example in [9] and [4], the relationship between Lindley’s likelihood ratio and these parameters is illustrated with graphical representations. Our figures also show some explanation of Lindley’s paradox [10]. The difference between Lindley’s likelihood ratio and the commonly used likelihood ratio statistic in hypothesis testing is also given. In addition, the relationship between the score-based likelihood ratio and the population parameters is derived and illustrated using the glass example.

D. Gluck · L. L. Tang () National Center for Forensic Science, Department of Statistics and Data Science, University of Central Florida, Orlando, FL, USA e-mail: [email protected] E. Tabassi Image Group, Information Access Division, Information Technology Laboratory, National Institute of Standards and Technology, Gaithersburg, MD, USA N. Balakrishnan Department of Mathematics and Statistics, McMaster University, Hamilton, ON, Canada © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Pilz et al. (eds.), Statistical Modeling and Simulation for Experimental Design and Machine Learning Applications, Contributions to Statistics, https://doi.org/10.1007/978-3-031-40055-1_1

3

4

D. Gluck et al.

1.1 Introduction The 2009 National Research Council report, published by the National Academy of Sciences in 2002, highlights the need for developing quantifiable measures of uncertainty in forensic analyses [11]. The likelihood ratio provides one qualitative approach to weigh the forensic evidence and arrive at the posterior odds of determining the same source versus different sources. In the context of glass fragments, it is the ratio between the likelihood function of fragments from the same source and that of fragments from different sources. The two most commonly used likelihood ratios (LRs) are feature-based LR and score-based LR. Whether and how to use these LRs has generated great interest in the forensic community. This is evidenced by a recent lively debate at a National Institute of Standards and Technology workshop, accessible at http://www.nist.gov/forensics/how-to-quantify-the-weightof-forensic-evidence.cfm. Parker [12] and Evett [4] are among the early papers discussing the statistical methods available to quantify the weight of forensics evidence. Their methods are based on testing the hypothesis of having two sets of glass fragment evidence from the same source versus the hypothesis of having them from different sources. A classic paper by Lindley [9] gives a Bayesian viewpoint on the ratio between the same-source probability and the different-source probability. Lindley’s likelihood ratio (LR) is based on the original evidence measurements and is sometimes referred to as “feature-based” LR. It is the ratio of the posterior odds to the prior odds or a Bayesian factor [7]. It is written as .LR = P r(X, Y |Hp , I )/P r(X, Y |Hd , I ), where I is the background information which is related to the evidence. The Bayes factor updates the prior odds as follows: .

P r(Hp ) P r(Hp |X, Y ) = LR × , P r(Hd |X, Y ) P r(Hd )

with the last term being the prior odds of favoring .Hp relative to .Hd without the knowledge of any evidence. Kaye [8] provides an earlier review of the relevant statistical methods including the LR. According to [7], the Bayes factor is the same as the likelihood ratio for simple hypotheses. For unknown parameters in the hypotheses, the Bayes factor has the form of the likelihood ratio by integrating the densities over the parameter space under hypotheses. Instead of integration, the regular likelihood ratio is obtained by maximizing the densities over the parameter space. The LR provides a qualitative approach to weigh the evidence and to arrive at the posterior odds of determining the same source versus different sources. If the prior odds can be assumed to be 1, the LR is the same as the posterior odds. It also uses the probabilities of evidence under the viewpoints from both the prosecution side and the defense side. The LR has gained popularity since Lindley’s paper. As a valuable tool, the likelihood ratio has been used on matching glass fragments in [1] by accounting for multivariate elemental ratios in the fragments. It has been

1 Likelihood Ratios in Forensics: What They Are and What They Are Not

5

extended to address the source-matching and classification in other forensics areas such as hair, fiber, DNA profiling, handwriting, and fingerprinting. The rest of the chapter is structured as follows. Section 1.2 describes Lindley’s likelihood ratio in details and introduces a frequentist framework to derive the same expression for the likelihood ratio. Section 1.3 formally defines the score-based likelihood ratio in terms of the evidence measurements and reference population datasets. The relationship between the components in the score-based likelihood ratio and the population parameters is presented. The difference between Lindley’s likelihood ratio and the score-based likelihood ratio is discussed. Section 1.4 gives some discussion on the simplified version of Lindley’s likelihood ratio and conclusion.

1.2 Lindley’s Likelihood Ratio (LLR) 1.2.1 Notations The overall evidence available is .(E, M) = (Ec , Es , Mc , Ms ), where the subscript c denotes the crime scene and subscript s denotes the suspect, E denotes evidence measurements, and M denotes evidence materials. .Mc is sometimes referred to as the recovered evidence materials from unknown origin, and .Ms is sometimes referred to as the control evidence materials from known origin. .Mc is also referred to as source evidence materials, and .Ms is also referred to as materials on the receptor object [2]. .Ec and .Es are then referred to as recovered evidence and control evidence measurements. The entire set of evidence measurements may not be used for the weight of evidence. Suppose the categorical or continuous evidence measurements are taken to estimate the probabilities for the quantification of the evidence. In this chapter, the upper-case letters such as X and Y are used to denote random variables with probability distributions. The realization of the random variables are denoted with the lower-case letters such as x and y. The probabilities of observing these realized measurements are quantified with the probability distributions of the corresponding random variables. Let these evidence measurements be .X = (X1 , X2 , ..., Xm ) and .Y = (Y1 , Y2 , ..., Yn ), where X is either a subset or a full set of .Ec from the crime scene (recovered evidence from source objects) and Y is either a subset or a full set of .Es from the suspect. .Xi and .Yj are either univariate measurements or vectors of multivariate measurements, for .i = 1, ..., m, and .j = 1, ..., n. For example, X and Y are univariate measurements in Lindley’s paper, and they are the refractive index of the glass fragments. A univariate normal distribution is assumed for modeling X and Y separately, and a bivariate normal distribution is assumed for the joint distribution of X and Y . An example of the multivariate measurements is provided in [1], in which the natural logarithms of three elemental ratios from the glass fragments are considered. Multivariate normal distributions are assumed for the vector of these ratios for X or

6

D. Gluck et al.

Table 1.1 Evidence and estimators Evidence Measurements Sample size Estimator

Reference database

.Ec

.Es

.U

.X1 , X2 , ..., Xm

.Y1 , Y2 , ..., Yn

.U1 , ..., UK

.m

n

.K

¯ m,n , .Y¯m,n , .Ym,n ,.Ym,n .X ′

∗

×R

.σ ˆ , τˆ , μˆ

Y . Besides the evidence measurements X and Y , another important dataset, ideally, is .K × R measurements .U = (U1 , ..., UK ) with .Uk = (Uk1 , Uk2 , ..., UkR )T in the reference population database. Here, the vector .Uk denotes the measurements on the kth subject, and .Ukr is the rth repeated measurement for the subject. The reference population database provides the estimates of marginal density distributions and joint distributions of X and Y . The repeated measurements on same subjects are used to estimate the within-subject variance, and measurement vectors from different subjects are used to estimate the between-subject variance. For the univariate case, Table 1.1 illustrates the observations associated with the evidence and database, as well as the estimators from these datasets.

1.2.2 A Frequentist Framework for Lindley’s Likelihood Ratio (LLR) In forensics, LLR is used for matching the recovered evidence to a known origin. It has a different meaning from the traditional understanding of the LR used for classification or hypothesis testing. The LLR for matching is the main focus for quantifying the weight of evidence from the recovered materials and the known origin. Here, the parameter estimates are obtained from the reference database U . For the example of the refractive index of the glass fragments, the reference data U gives the mean and variance estimates for the within-window and between-window measurements. The null and alternative hypotheses are .Hp : X and Y are from the same source, vs. .Hd : X and Y are from different sources. The null hypothesis, .Hp , is that the evidence from the crime scene comes from the suspect. In other words, the evidence from the crime scene and from the suspect is from the same source. X and Y are correlated measurements. The variation of X and Y is from the within sources (say, e.g., glass fragments in the same window). This represents the prosecutor’s view. The alternative hypothesis, .Hd , is that the evidence from the crime scene does not come from the suspect. In other words, the evidence from the crime scene and from the suspect are from different sources. Under this hypothesis, X and Y are independent measurements, and the variation of X and Y is from the between sources (say, e.g., glass fragments from different windows). This represents the defendant’s view.

1 Likelihood Ratios in Forensics: What They Are and What They Are Not

7

Suppose that the joint density functions .f (X1 , ..., Xm , Y1 , ..., Yn ) of .Xi ’s and Yj ’s take on different expressions under .Hp and .Hd . With I denoting the background information, the likelihood ratio is given by the ratio between the joint density functions,

.

LLR =

.

f (X1 , ..., Xm , Y1 , ..., Yn |Hp , I ) . f (X1 , ..., Xm , Y1 , ..., Yn |Hd , I )

Although we omit I in this manuscript hereafter for simplicity, the importance of the background information in constructing LLR is worth further investigation. A parametric assumption for the evidence measurements is given in [9]: .Xi ∼ N (θ1 , σ12 ) and .Yj ∼ N(θ2 , σ22 ). Selecting the normal distributions has its own advantage for illustrating how the Bayes factor is derived. Most importantly, the numerator and denominators of the Bayes factor need integration of the densities for unknown parameters in the hypotheses. By assuming normal distributions for prior and the pdf of random observations, the integration results in normal densities for both numerator and denominator. When X is taken from the same source, .σ1 gives the variability within the recovered source. Similarly, .σ2 gives the variability within the control source with a known origin. Although a strong assumption, what simplifies the derivation is to further assume that the variability within the same source stays the same, or .σ = σ1 = σ2 . When X is taken from the same source, .σ gives the variability within the recovered source as well as the variability within the control source with a known origin. The parameter representations of .Hp and .Hd are then .Hp : θ1 = θ2 and .Hd : θ1 /= θ2 . The sample means .X¯ m and .Y¯n are sufficient statistics for the true means. Lindley uses the joint distribution of sample means .X¯ m and .Y¯n to calculate the numerator and the denominator for LLR. The joint distribution can be obtained using the independence between .X + Y and .X − Y with a Jacobin transformation. The prior distribution for the means are that .θℓ ∼ N (μ, τ 2 ) for .ℓ = 1, 2, where .τ 2 is the variance between sources. .(X¯ m , Y¯n ) follow a bivariate normal distribution 2 2 = σ 2 + τ 2 and the .N(θ = (θ1 , θ2 ), v ), where the marginal variances are .v correlation is .ρ. The correlation is 1 for .X¯ m and .Y¯n from same source and 0 for different sources. .θℓ is a random variable under Lindley’s setting. It is easier to consider the results from a random effect model, in which Xi = θ1 + ɛx,i ,

.

Yj = θ2 + ɛy,i ,

(1.1)

with the random effect term, .θℓ ∼ N(μ, τ 2 ), and a random error term, .ɛ. Here, 2 .ɛx,i and .ɛy,j are i.i.d. normal random variables with mean zero and variance .σ . The random effect terms .θ1 and .θ2 are i.i.d. random variables. They introduce the correlation within the same source. .θℓ are independent of .ɛx,i and .ɛy,j . Conditional on .θ1 , it follows that .E(Xi |θ1 ) = θ1 and .var(Xi |θ1 ) = var(ɛx,i ) = σ 2 . Similarly, .E(Yj |θ2 ) = θ2 , and .var(Yj |θ2 ) = var(ɛy,j ) = σ 2 . The unconditional

8

D. Gluck et al.

mean and variance for .Xi are .E(Xi ) = E(E(Xi |θ1 )) = μ and .var(Xi ) = E(var(Xi |θ1 ))+var(E(Xi |θ1 )) = σ 2 +τ 2 . Conditional and marginal distributions for .Xi are .Xi |θ1 ∼ N(θ1 , σ 2 ) and .Xi ∼ N(μ, σ 2 + τ 2 ). Those for .X¯ m are ¯ m ∼ N(θ1 , σ 2 /m) and .X¯ m ∼ N(μ, σ 2 /m + τ 2 ). Similar results apply to the .X conditional and marginal distributions of .Yj and .Y¯n . Conditional and marginal distributions for Y are .Yj |θ2 ∼ N(θ2 , σ 2 ) and .Yj ∼ N(μ, σ 2 + τ 2 ). Those for ¯n are .Y¯n |θ2 ∼ N(θ2 , σ 2 /m) and .Y¯n ∼ N(μ, σ 2 /m + τ 2 ). .Y It follows that .θ = θ1 = θ2 under .Hp so that .X¯ m and .Y¯n are correlated through the same parameter .θ . Under .Hp , X and Y are assumed to come from the same subjects (same window). This leads to difference in .θ1 and .θ2 so that .X¯ m and .Y¯n are independent through the independence between .θ1 and .θ2 . With the random effect model, the numerator and denominator of LLR have the same expression as those in [9]: ¯ ¯ . f (Xm , Yn |θ )f (θ )dθ = f (X¯ m |θ )f (Y¯n |θ )f (θ )dθ and .

f (X¯ m |θ1 )f (θ1 )dθ1

f (Y¯n |θ2 )f (θ2 )dθ2 = f (X¯ m )f (Y¯n ).

The explicit forms for the numerator and denominator of LLR are given in [9] and further explored in other papers by Grove [6] and Seheult [15]. The expressions follow from the aforementioned random effect framework by deriving the unconditional means and variances of the evidence measurements under the hypotheses. The numerator is derived under .Hp . The numerator becomes the product of these two pdfs, ¯

¯

(Y ∗ )2

m −Yn ) exp(− 2σ(2X(1/m+1/n) − 2(τ 2 +σm,n 2 /(m+n)) ) ∗ ¯ m − Y¯n )f (Ym,n ) = √ . .f (X √ 2π σ (1/m + 1/n) τ 2 + σ 2 /(m + n) 2

The denominator is derived under .Hd . The denominator is the product of the unconditional normal pdfs of .X¯ m and .Y¯n , given by .

(Y¯n − μ)2 1 (X¯ m − μ)2 √ √ − ). exp(− 2(σ 2 /m + τ 2 ) 2(σ 2 /n + τ 2 ) 2π σ 2 /m + τ 2 σ 2 /n + τ 2

1 Likelihood Ratios in Forensics: What They Are and What They Are Not

9

′ ) as the denominator, and the expression Lindley [9] uses the .f (X¯ m − Y¯n )f (Ym,n of the denominator is some algebraic transformation of the original product of unconditional pdfs:

.

(X¯ m − Y¯n )2 √ exp(− 2(σ 2 /m + σ 2 /n + 2τ 2 ) 2π σ 2 /m + τ 2 σ 2 /n + τ 2 1

√

−

′ (Ym,n − μ)2

2(σ 2 /m + τ 2 )(σ 2 /n + τ 2 )

).

The LLR from the numerator and denominator is essentially a function of sample means, sample sizes, within-source and between-source variances, and population mean. Taking the logarithm of LLR, it follows that .

log

P r(Hp ) P r(Hp |X, Y ) = log LLR + log . P r(Hd |X, Y ) P r(Hd )

This indicates that .log LLR adds weight from the evidence to the prior odds. The difference between using LLR and conducting a hypothesis is more obvious by discussing a simplification under the assumption that .τ is much larger than .σ and the logarithm of LLR is .

1 1 1 2 2 2 ) + log √ exp(−Wm,n ) − log √ exp(−Vm,n ), log C + log √ exp(−Zm,n 2π 2π 2π

√ √ √ ¯ m − Y¯n )/(σ (1/m + 1/n)), with .C = 2π τ/(σ 1/m + 1/n), .Zm,n = (X √ ′ − μ)/τ for .Y ′ ∗ ¯ ¯ .Wm,n = (Ym,n − μ)/τ , and .Vm,n = 2(Ym,n m,n = (Xm + Yn )/2. It is common to use .Zm,n in two-sample Z-test to compare two population means when the means are constant values. Conditional on .θ ’s, .Zm,n is also likelihood ratio test statistic when two populations are assumed to have the same variance. Thus, .Zm,n is a standard normal random variable conditional on known .θ1 and .θ2 . .Zm,n is derived from the likelihood ratio .

maxθ=θ1 =θ2 P r(X1 , ..., Xm |θ ) . maxθ1 /=θ2 P r(X1 , ..., Xm |θ1 , θ2 )

From the Neyman-Pearson lemma, the test statistic .Zm,n is the most powerful among all size .α tests. As discussed earlier, .Wm,n and .Vm,n are standard normal under .Hp and .Hd , respectively. The logarithm of LLR in terms of the normal pdf is .

log C + log φ(Zm,n ) + log φ(Wm,n ) − log φ(Vm,n ).

(1.2)

The estimate for the logarithm is obtained by using realized values from the ′ ∗ are close, the value of .V evidence measurements. Since .Ym,n and .Ym,n m,n can be √ considered to be inflated by . 2 from .Wm,n . Since the departure from zero leads to

10

D. Gluck et al.

a smaller normal density value, the larger value for .|Vm,n | compared with .|Wm,n | results in the positive value for .log(φ(Wm,n )) − log(φ(Vm,n )). This means that having the W and V components in LLR always adds a positive weight to LLR. The representation in terms of the standard normal density shows clear how the change of parameters may affect the realized LLR values. Here, the density curve stays the same with a spread of approximately three standard deviations and a center of 0.

1.3 Score-Based Likelihood Ratio (SLR) The likelihood ratio based on similarity scores recently brings attention to the forensic scientists. In fingerprint, the similarity score between a questioned fingerprint and a fingerprint is generated by comparing their characteristics in the automated fingerprint identification system (AFIS) [3]. The publicly funded forensic crime laboratories in the USA received 274,000 new requests for fingerprint analysis for identifying offenders [14]. The fingerprints are processed through the AFIS system [16] to preselect the potentially matched fingerprints for further examination by fingerprint examiners. The SLR is considerably simpler to compute since the matching scores are one-dimensional. Moreover, it borrows strength from quite advanced commercialized matching algorithms which have low error rates as indicated in the NIST report [17]. In addition, as the algorithms for matching fingerprints based on minutiae configurations of the fingerprints are largely proprietary, it is easier to obtain the similarity scores than the original minutia configurations used in the algorithms. In this section, the detailed expression of the SLR is explored and compared with Lindley’s LR.

1.3.1 The Expression of the SLR The score-based likelihood ratio is SLR =

.

P r(S(X, Y )|Hp ) , P r(S(X, Y )|Hd )

where .SX,Y = S(X1 , ..., Xm , Y1 , ..., Yn ) is the similarity score, a function of (X1 , ..., Xm ) and .(Y1 , ..., Yn ). The score-based LLR (SLR) can also be referred to as the Bayes factor since it gives a weight to prior odds

.

.

P r(Hp ) P r(Hp |SX,Y ) = SLR × . P r(Hd |SX,Y ) P r(Hd )

1 Likelihood Ratios in Forensics: What They Are and What They Are Not

11

Lindley’s LLR is from the original measurements, X and Y , while the score-based LLR is based on the score, which is a distance function of the pair .(X, Y ). Under .Hp , we have the pair X and Y from the same source. The score from the ith pair of subjects, .Tp,i , i = 1, . . . , m, has a cumulative distribution function (CDF), .Fp , and the probability density function, .fp . Under .Hd , we have the pair X and Y from different sources. The score from the j th pair of subjects, .Td,j , j = 1, . . . , n, has a CDF, .Fd , and the pdf, .fd . With the CDF representation, the score-based likelihood ratio is written as SLR =

.

fp (SX,Y ) . fd (SX,Y )

The observed evidence measurements are realized values of X and Y . Due to usually small sample sizes of X and Y , the reference population database is valuable for the estimation of the CDFs and pdfs. Unlike Lindley’s LLR which uses the estimates from the original measurements in the database, the SLR uses the scores calculated from the database for the estimation of parameters and the distributions. The scores T .Tp,i are calculated from within-subject measurements .Uk = (Uk1 , Uk2 , ..., UkR ) . The scores .Td,j are calculated using measurements from different subjects in the reference population database. Assuming normal distributions for .Tp,i and .Td,j , mean and variance parameters are estimated from these scores. Suppose that .Tp,i ∼ N (μp , σp2 ) and .Td,j ∼ N(μd , σd2 ). The relationship between similarity scores, evidence, and reference database is presented in Table 1.2. The SLR takes the following form: fp (SX,Y ) .SLR = = fd (SX,Y )

√ 1 exp(−(SX,Y − μp )2 /(2σp2 )) 2πσp . √ 1 exp(−(SX,Y − μd )2 (2σ 2 )) d 2π σd

The score function depends on the matching algorithm. Assume the score function takes on a simple form: .S(X1 , ..., Xm , Y1 , ..., Yn ) = X¯ m − Y¯n . Because of no mean difference from the same source, .μp = 0. After some calculation, the SLR becomes SLR =

.

σd exp(−(X¯ m − Y¯n )2 /(2σp2 )) fp (SX,Y ) = . fd (SX,Y ) σp exp(−(X¯ m − Y¯n − μd )2 /(2σd2 ))

Table 1.2 Evidence and estimators in SLR Evidence Measurement Score Estimator

Reference database

.Ec

.Es

.U

.X1 , X2 , ..., Xm

.Y1 , Y2 , ..., Yn

.U1 , ..., UK

.S(X1 , ..., Xm ; Y1 , ..., Yn )

¯ m,n .X

− Y¯m,n

.Tp,i , Td,j .σ ˆ p , σˆ d , μˆ p , μˆ d

12

D. Gluck et al.

Subsequently, the logarithm of SLR is given by .

′ ′ log(SLR) = log C ′ + log φ(Zm,n ) − log φ(Wm,n ),

′ ′ where .C ′ = σd /σp , .Zm,n = (X¯ m − Y¯n )/σp , and .Wm,n = (X¯ m − Y¯n − μd )/σd . Given the simple expression of the specified score function, the variances, .σd2 and .σp2 , have explicit forms as functions of the within-subject variance, .σ 2 , and the between-subject variance, .τ 2 : .σp2 = var(X¯ m − Y¯n ) = σ 2 (1/m + 1/n) and .σd2 = σ 2 (1/m+1/n)+2τ 2 . The mean of .X¯ m − Y¯n , .μˆ p = 0 when .Hp is true since X’s and Y ’s are repeated observations from the same subject and .μˆ d = μX −μY when .Hd is ′ true. Thus, as functions of .σ and √ √ SLR as .C = √ .τ , we can write′ the components in the 2 2 ¯ ¯ σ (1/m + 1/n) + 2τ /(σ 1/m + 1/n), .Zm,n = (Xm − Yn )/(σ 1/m + 1/n), √ ′ = (X¯ m − Y¯n − μd )/ σ 2 (1/m + 1/n) + 2τ 2 . and .Wm,n Comparing with Lindley’s LLR, it is clear that the information from .Wm,n and .Vm,n is not included in the score-based LLR. This means that SLR accounts for the similarity between the recovered evidence and the evidence from the known origin, but does not account for the rarity of these evidence measurements. Here, the rarity measure in terms of our notation is estimated from the difference between the weighted sample mean of all evidence means and the overall population mean.

1.3.2 The Glass Example We use a well-known glass fragment example from [13] to illustrate the relationship between the population parameters and values of the score-based likelihood ratio. This example has been studied extensively by Evett [4] and Lindley [9]. It has the evidence measurements from glass fragments and the necessary parameter estimates. The context of the example is related to broken glasses at a crime scene. A univariate refractive index was measured from m glass fragments from the crime scene. Also, the n measurements were taken from glass fragments on a known origin. Given that there might be other types of evidence materials, the entire collection of evidence materials, .Mc , are from the crime scene, and .Ms are from the known origin. From the entire collection of evidence, the measurements on the glass fragments are .X = (X1 , X2 , ..., Xm ) from the crime scene and .Y = (Y1 , Y2 , ..., Yn ) from the known origin. From [5], we expect .μX = 1.5160 and .μY = 1.5162 when they are from different sources. As can be seen from the table, since the withinsource variance is much larger than between-source variance, it leads to a much larger variance for scores under .Hd than the one for scores under .Hp (Table 1.3). First, the relationship is shown between the logarithm of SLR and the withinsubject standard deviation, .σ . .log SLR is calculated when .σ is varied from .0.00004/10 to .0.00004 × 10, while observed evidence measurements and the other parameters are fixed at their given values. Here, all the terms in .log SLR are functions of .σ . The upper panel gives the curve of .log SLR versus .σ . When .σ is near

1 Likelihood Ratios in Forensics: What They Are and What They Are Not

13

Table 1.3 Glass evidence data Evidence .Ec

Measurement Sample size Estimate

LLR

Reference database .Es

.U

.x1 , x2 , ..., x10

.y1 , y2 , ..., y5 .U1 , ..., U939 = 10 .n = 5 .939 × 8 .x ¯m,n = 1.518458, y¯m,n = 1.518472 .σ ˆ = 0.00004, τˆ = 0.004 ′ ∗ .ym,n = 1.518465, .ym,n = 1.518463 .μ ˆ = 1.5182 .C = 457.6456, .Zmn = −0.639, .Wmn = 0.0657, .Vmn = 0.0937 .LLR = 149.1903, .log LLR = 5.005 .m

zero, .log SLR takes on small values. .log SLR reaches its peak and then decreases as .σ increases. The major difference between these two likelihood ratios is that for small .σ , .log SLR takes on negative values, indicating much higher likelihood of observing the evidence from different sources rather than the same source. On the other hand, .log SLR always indicates higher likelihood of observing the evidence from the same sources due to its positive values. The middle panel shows the ′ , and .W ′ ′ curves of .C ′ , .Zm,n m,n versus .σ . The lower panel shows the curves of .log C , ′ ′ ′ ′ .log φ(Zm,n ), .log φ(Wm,n ), and .logφ(Zm,n ) − log φ(Wm,n ) versus .σ . A larger .σ leads to a smaller .C ′ and a smaller .log C ′ . .log C ′ contributes to a smaller weight in ′ . A larger .σ value SLR for larger .σ values. We have .σ on the denominator of .Zm,n ′ ′ reduces the absolute value for .Zm,n . Same as in Lindley’s likelihood ratio, .Zm,n is ′ negative and approaches 0 for larger .σ values. .log φ(Zm,n ) is a concave curve, and ′ also decreases, it stays constant as .σ gets larger enough. The absolute value of .Wmn but does not approach zero due to the fact that the unchanging between-source variation .τ dominates the denominator. The values of .log SLR have implications in the interpretation of the evidence. The evidence strongly supports .Hd in the range of very small .σ . As .σ gets larger, .log SLR gets above 0, and the evidence strength becomes “very strong” against .Hd when .σ = 0.000025. As .σ gets even larger, the evidence strength decreases to “strong” (Fig. 1.1). Second, the relationship is shown between the logarithm of SLR and the between-subject standard deviation, .τ . .log SLR is calculated when .τ is varied from .0.004/10 to .0.004 × 10. The upper panel gives the convex curve of .log SLR versus .τ showing that .log SLR is monotonically increasing as .τ increases. This is similar to the relationship between .log LLR and .τ . The middle panels show the curves ′ of each component of .log SLR versus .τ . .C ′ is positively related to .τ , and .Wm,n ′ ′ is negatively related to .τ . .Zm,n is not a function of .τ . The closeness of .Wm,n to zero leads to a higher density value on the normal curve. The lower panel also ′ )) − log(φ(W ′ )) versus .τ . It shows a sharp decline in the range plots .log(φ(Zm,n m,n of small .τ and an almost constant trend for larger .τ values. Similar to Lindley’s likelihood ratio, again, the dominant component in the SLR is the constant term .C ′ which gives stronger evidence against .Hp as between-subject variance gets larger. The weight of evidence is always “strong” against .Hd . As .τ gets larger, the evidence strength against .Hd gets “very strong” when .τ > 0.003 (Fig. 1.2).

Log−SLR

4

5

6

7

8

0e+00

0e+00

0

1e−04

0e+00

σ

2e−04

log(C’) vs. σ

1e−04

3e−04

2e−04 σ

C’ vs. σ

4e−04

3e−04

−20

−15

−10

−5

0e+00

0

1e−04

0e+00

σ

2e−04

1e−04

3e−04

log(φ(Z’mn)) vs. σ

−6

−4

1e−04

4e−04

−2

0

4e−04

σ

−0.9196539

−0.9196536

−0.9196533

−0.9196530

2e−04

2e−04 σ Z’mn vs. σ

0e+00

3e−04

−0.03783

−0.03782

1e−04

σ

2e−04

3e−04

log(φ(W’mn)) vs. σ

4e−04

−0.03781

4e−04

0e+00

3e−04

log(φ(Z’mn))−log(φ(W’mn))

−20

−15

−10

−5

0e+00

0

1e−04

1e−04

σ

3e−04

σ

2e−04

3e−04

4e04

4e−04

4e−04

log(φ(Z’mn))−log(φ(W’mn)) vs. τ

2e−04

W’mn vs. σ

′ , and .W ′ Fig. 1.1 This figure shows the terms in .log SLR versus .σ . Upper panel: Logarithm of score-based likelihood ratio vs. .σ . Middle panel: C’, .Zm,n m,n ′ ′ ′ ′ vs. .σ . Lower panel: log(C’), .log(φ(Zm,n )), .log(φ(Wm,n )), and .log(φ(Zm,n )) − log(φ(Wm,n )) vs. .σ

log(C’)

1000

C’

2000

−10

−5

0

log(φ(Z’mn))

Z’mn

Log−SLR vs. Within Variability σ

log(φ(W’mn))

W’mn

5

14 D. Gluck et al.

Log−SLR

0.00

0.00

0.00

0.01

τ

0.02

τ

0.02

log(C) vs. τ

0.01

C’ vs. τ

0.03

0.03

0.04

0.04

0.01

−0.98

−0.96

−0.94

0.00

0.00

−0.92

−0.9

−0.6

−0.3

τ

0.02

0.03

0.01

τ

0.02

0.03

log(φ(Wmn)) vs. τ

0.01

Z’mn vs. τ

τ

0.02

0.04

0.04

−0.19

−0.17

−0.15

0.01

τ

0.02

W’mn vs. τ

0.03

0.04

0.04

0.00

0.01

τ

0.02

0.03

0.04

log(φ(Z’mn))−log(φ(W’mn)) vs. τ

0.00

−0.13

−0.3

−0.2

−0.1

0.0

0.03

.Vm,n

Fig. 1.2 This figure shows the terms in .log SLR versus .τ . Upper panel: Logarithm of score-based likelihood ratio vs. .σ . Middle panel: C, .Zm,n , .Wm,n , and vs. .σ . Lower panel: log(C), .log(φ(Zm,n )), .log(φ(Wm,n )), and .log(φ(Vm,n )) vs. .σ

4

5

6

7

8

0

1000

2000

3

4

5

6

7

Z’mn log(φ(Wmn))

C

log(C)

W’mn log(φ(Z’mn))−log(φ(W’mn))

Log−SLR vs. Between Variability τ

1 Likelihood Ratios in Forensics: What They Are and What They Are Not 15

16

D. Gluck et al.

1.4 Discussion If only one observation is available, it is impossible to estimate the variance. Even with a few observations available, the sample variances for two group’s observations may not be good estimators for the true variances. The hypothesis testing based on the sample means and sample variances may still not be applicable. This makes hypothesis testing infeasible, and the routine via Bayesian analysis by borrowing information on population parameters from the reference population datasets may be inevitable for the purpose of quantifying the weight of evidence. Multiple available observations from the glass example allows two-sample hypothesis testing. However, comparing sample means may not account for the rarity of the evidence from the known and unknown origin as indicated from the figures. Additional information captured by Lindley’s likelihood ratio should be considered as well.

References 1. Aitken, C.G., Lucy, D.: Evaluation of trace evidence in the form of multivariate data. J. R. Stat. Soc. Ser. C (Appl. Stat.) 53(1), 109–122 (2004) 2. Aitken, C.G., Taroni, F.: Statistics and the Evaluation of Evidence for Forensic Scientists, vol. 16. Wiley Online Library, Hoboken (2004) 3. Egli, N.M., Champod, C., Margot, P.: Evidence evaluation in fingerprint comparison and automated fingerprint identification systems? Modelling within finger variability. Forensic Sci. Int. 167(2), 189–195 (2007) 4. Evett, I.: The interpretation of refractive index measurements. Forensic Sci. 9, 209–217 (1977) 5. Evett, I., Lambert, J.: The interpretation of refractive index measurements. III. Forensic Sci. Int. 20(3), 237–245 (1982) 6. Grove, D.: The interpretation of forensic evidence using a likelihood ratio. Biometrika 67(1), 243–246 (1980) 7. Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90(430), 773–795 (1995) 8. Kaye, D.: Statistical evidence of discrimination. J. Am. Stat. Assoc. 77(380), 773–783 (1982) 9. Lindley, D.: A problem in forensic science. Biometrika 64(2), 207–213 (1977) 10. Lindley, D.V.: A statistical paradox. Biometrika 44(1/2), 187–192 (1957) 11. National Research Council. Strengthening forensic science in the united states: A path forward (2009) 12. Parker, J.: A statistical treatment of identification problems. J. Forensic Sci. Soc. 6(1), 33–39 (1966) 13. Pearson, E., Dabbs, M.: Some physical properties of a large number of window glass specimens. J. Forensic Sci. 17(1), 70–78 (1972) 14. Peterson, J.L., Hickman, M.J.: Census of publicly funded forensic crime laboratories. US Department of Justice, Office of Justice Programs, Bureau of Justice Statistics Washington, DC (2002) 15. Seheult, A.: On a problem in forensic science. Biometrika 65(3), 646–648 (1978) 16. Ulery, B.T., Hicklin, R.A., Buscaglia, J., Roberts, M.A.: Accuracy and reliability of forensic latent fingerprint decisions. Proc. Natl. Acad. Sci. 108(19), 7733–7738 (2011) 17. Wilson, C., Hicklin, R.A., Korves, H., Ulery, B., Zoepfl, M., Bone, M., Grother, P., Micheals, R., Otto, S., Watson, C., et al. Fingerprint vendor technology evaluation 2003: summary of results and analysis report. NIST Res. Report 7123, 9–11 (2004)

Chapter 2

MANOVA for Large Number of Treatments S. Ejaz Ahmed and M. Rauf Ahmad

Abstract The performance of Wilks’ likelihood ratio test statistic .Λ is evaluated when the number of populations in a MANOVA model is allowed to increase, where the sample size and dimension are kept fixed. For simplicity, only one-way MANOVA model under homoscedasticity is considered, although for both balanced and unbalanced cases. Apart from the usual normality assumption, the error vectors in the model are also allowed to follow a t distribution, in order to assess the statistic for its robustness to normality. Whereas the statistic is found to be accurate under normality assumption, for both size and power, a serious size distortion and discernably low power are witnessed for t distribution for an increasing number of populations, even for large sample size. Finally, as a special case, the univariate .F statistic for ANOVA model is also evaluated and compared with two of its recent modifications typically introduced for a large number of populations.

2.1 Introduction Multivariate analysis of variance (MANOVA) is one of the most fundamental techniques in multivariate analysis. It extends univariate ANOVA model for multiple response variables and is one of the two basic partitions of multivariate model, the other being multivariate regression. Multiple test statistics are available for inference, with Wilks’ likelihood ratio statistic .Λ being the most popular. Under certain conditions, mainly independence and multivariate normality of error vectors, exact distributions of .Λ (or its functions) are available [2, 12]. Under the violations of assumptions, e.g., under non-normality, robust, asymptotic, or re-sampling-based methods are available to extend the exact cases. The asymptotic extension here

S. Ejaz Ahmed Department of Mathematics and Statistics, Brock University, St. Catharines, ON, Canada M. Rauf Ahmad () Department of Statistics, Uppsala University, Uppsala, Sweden e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Pilz et al. (eds.), Statistical Modeling and Simulation for Experimental Design and Machine Learning Applications, Contributions to Statistics, https://doi.org/10.1007/978-3-031-40055-1_2

17

18

S. Ejaz Ahmed and M. Rauf Ahmad

particularly refers to a large number of replications per sample or possibly the overall sample size in the experiment. With the recent advent of high-dimensional era, where the dimension of data vector is quite large, often exceeding the sample size, there have been several attempts to modify the classical theory which simply collapses for high-dimensional case. The limiting distributions of the modified statistics are mostly evaluated by allowing both sample size and dimension to grow simultaneously, with certain joint rate of convergence. In fact, MANOVA models, including Hotelling’s .T2 statistic as special case for two populations, were the first to be modified under high-dimensional data. Although much work has been done in this case, the highdimensional modification is still a topic of much interest in modern statistical literature and will presumably remain so for a long time to come. For some such extensions, see, e.g., [1, 5, 6, 11, 15]. For further references and general treatment of MANOVA model from both large sample and high dimension perspective, see [7]. However, for rank-deficient linear models such as MANOVA, there is still another aspect of interest to be considered. Although related to the presence of big data, but in a different perspective, there are situations, e.g., in network data modeling, graphical analysis, or agriculture (typically uniformity trials), when a large number of independent groups with multivariate data structure are available. Although the within-group structure of data is not very high-dimensional, the number of groups or populations, henceforth denoted as g, grows rapidly. Since the classical MANOVA model assumes fixed g, the validity of its analysis for increasing g needs to be evaluated. This is the subject of this study. In fact, the same situation, that of a large number of treatments, can also happen in univariate ANOVA models. This puts the analysis of such models in a unique perspective, since for less than full rank linear models like ANOVA or MANOVA, the case of large number of populations, e.g., .g → ∞ for the present scenario, is tantamount to large number of parameters in the model. This, therefore, can be considered as another form of high dimensionality although, unlike full rank linear models like univariate or multivariate regression, the populations, hence parameters, are independent in such models. We are interested to evaluate the classical MANOVA model for an increasing number of populations. As a preliminary investigation, and in order to focus primarily on the objective, we do a simulation-based evaluation, restricting it from certain perspectives, e.g., we do not take heteroscedasticity into account, although we do allow unbalanced model. Further, we restrict the evaluation to one-way model only, with the intention to extend it to other cases later. A detailed study of ANOVA .F statistic in this direction, for one- and multi-way models, is given in [3], with a more recent investigation in [14]. For some other, particularly nonparametric modifications, we refer to [16, 17], and the references cited in [3]. One- and multi-way MANOVA models, including nonparametric variants, are considered in [4, 8, 9]. After introducing the notational setup and model formulation in the next section, a detailed simulation study on the evaluation of the test statistic is given in Sect. 2.3.

2 MANOVA for Large Number of Treatments

19

It includes the special case of univariate ANOVA model, where a comparison of classical .F statistic with two of its recent competitors is also given. Section 2.4 provides a brief summary and future outlook on the subject.

2.2 Notations and Model Setup MANOVA models make a special class of the multivariate general linear model (MGLM), Y = XB + U,

.

(2.1)

where .Y = (Y′1 , . . . , Y′g )′ ∈ Rn×p is the response matrix, .U = (U′1 , . . . , U′g )′ ∈ Rn×p is the corresponding matrix of unobservable errors, .B ∈ Rq×p is the matrix of unknown parameters, and .X ∈ Rn×q is the design (model) matrix. The g matrices composing .Y and .U correspond to the g independent populations, where a random Ʃg sample of .ni units is drawn from the ith population, with . i=1 ni = n. Then .Yi = (y′i1 , . . . , y′ini )′ ∈ Rni ×p and .Ui = (u′i1 , . . . , u′ini )′ ∈ Rni ×p , where .yij ∈ Rp is a response vector measured on j th individual in ith population, .j = 1, . . . , ni , p .i = 1, . . . , g, and .uij ∈ R is the corresponding vector of unobservable errors. Assuming the same design matrix .X across all p responses in each i, this setup is a straightforward p-variate extension of univariate general linear model. Our main interest in MGLM, as in the univariate case, is to do inference for .B. The estimation of .B follows by the multivariate least-squares criterion, with its optimality, under minimal assumptions, by the multivariate extension of the Gauss-Markov theorem. Adding multivariate normality, we can proceed for testing, where the MGLM allows a variety of testable hypothesis, .H0 : LBM = D, where mostly .D = 0, with .r(Ll×q ) = l ≤ q, .r(Mp×m ) = m ≤ p, and .r(·) denotes the rank of a matrix. Under homoscedasticity assumption, .Cov(uij ) = Ʃ .∀ i.j , leading, under normality, to .U ∼ Nn,p (O, Ω) .⇒ .Y ∼ Nn,p (XB, Ω). This setup holds for .r(X) = r ≤ q, where .r(X) = r < q for MANOVA models which are the object of interest here. The analysis of MGLM can be generally based on two matrix quadratic forms, denoted .H and .E and pertaining, respectively, to the hypothesis under consideration and error. In particular, .E = Y′ (In −P)Y, with .P = X(X′ X)− X as unique projection matrix, where .(·)− can be replaced with regular inverse if .r(X) = q. In either case, = E/(n − r) as the unique REML estimator of .Ʃ. the uniqueness of .P ensures .Ʃ This leaves us with the component of main interest, .B, whose estimator for MANOVA models is not unique, but, parallel to the univariate case, under certain (mostly sum-to-zero) constraints, certain linear functions of elements of .B are estimable and testable. As we shall only restrict our attention to one-way MANOVA model, we leave the details aside; see, e.g., [13] or [12]. A one-way MANOVA model corresponds to MGLM (2.1) when a vector of p responses are observed on each of .ni independent units in the ith sample,

20

S. Ejaz Ahmed and M. Rauf Ahmad

i = 1, . . . , g, with samples taken independently from g populations. The model is formulated as

.

yij = μ + τ i + uij ,

(2.2)

.

where .yij = (Yij 1 , . . . Yijp )′ is the p-dimensional vector of responses measured on j th unit in ith sample, .μ ∈ Rp is the overall average, .τ i ∈ Rp denotes the effect of ith population, and .uij ∈ Rp is the corresponding error vector, .j = 1, . . . , ni , .i = 1, . . . g. The estimability of Model (2.2) is ensured under certain constraints, Ʃg Ʃg mostly setting . i=1 ni τ i = 0 or, for balanced case, . i=1 τ i = 0. Whereas Model (2.2) is flexible from conceptual and interpretational perspective, the mathematical treatment is more convenient by considering its non-singular version, yij = μi + uij ,

(2.3)

.

with .μi = μ + τ i .∀ i. In this form, the model corresponds to the MGLM in (2.1) with X = ⊕i=1 1ni , B = (μ′1 , μ′2 , . . . , μ′g )′

.

g

Ʃg where .1ni is a vector of 1s, . i=1 ni = n, and .⊕ denotes the Kronecker sum. A similar correspondence for Model (2.2) with MGLM can also be formulated. Given this setup, a test of equivalent hypotheses .H0 : τ i = 0 .∀ i .⇔ .H0 : μi = μ .∀ i, against .H1 : Not .H0 , can be carried out by a multiple of test statistics [see, e.g., 10, 12], the most frequent among them being the likelihood ratio statistic, Wilks’ .Λ, defined as t 1 |E| −1 −1 , = |I + E H| = .Λ = 1 + λs |E||I + E−1 H|

(2.4)

s=1

where .λs are the eigenvalues of .E−1 H, with .t = min(p, g − 1) = .r(H). The .H and .E matrices are as discussed for MGLM above and reduce for Model (2.3) to H=

g

.

ni (Yi − Y)(Yi − Y)′ ,

E=

(Yij − Yi )(Yij − Yi )′ ,

i=1 j =1

i=1

with .Yi =

g ni

Ʃn i

j =1 Yij /ni

and .Y =

Ʃg i=1

Ʃn i

j =1 Yij /n

as unbiased estimators of .μi

and .μ, respectively. Since .H = O .⇒ .Λ = 1 if .Yi = Y .∀ i, it clues to the fact that small values of .Λ, implying deviations from .H0 , must lead to the rejection of null hypothesis. As exact distributions of .Λ exist for only a few restricted values of p and g [see 10, p. 303], some approximations are often used for general parameter values. A common chi-squared approximation, including Bartlett’s correction factor, is

2 MANOVA for Large Number of Treatments

.

− n−1−

p+g ln Λ 2

21 2 ≈ χp(g−1) .

An .F-approximation, also frequently used, is given as .

F=

1 − Λ1/r f2 , Λ1/r f1

(2.5)

with .f1 = pfH , .f2 = wr − (pfH − 2)/2, .w = fE + fH − (p + fH + 1)/2, and / r=

.

p2 fH2 − 4 p2 + fH2 − 5

,

where .fE and .fH are error and hypothesis degrees of freedom, respectively. Note that, for one-way MANOVA, .fE = n − g and .fH = g − 1, for which w reduces to .n − 1 − (p + g)/2, which is the same Bartlett correction factor used for the chi-squared approximation above. Further, it can be easily seen that, for .g → ∞, .r → p. We, however, adjourn the detailed theoretical analysis of the statistic aside for another project. As mentioned above, the .Λ statistic, or its approximations, are based on three assumptions: (i) normality of g distributions, (ii) independence of the same distributions, and (iii) homoscedasticity, i.e., .Cov(uij ) = Ʃ .∀ i, j . Our evaluation of the statistic takes care of all three assumptions, except that we also evaluate the statistic under t distribution in order to assess its robustness to normality; see the next section for details.

2.3 Simulations 2.3.1 MANOVA Tests for Large g We evaluate Wilks’ .Λ statistic, or its .F-approximation, by running a MANOVA model with .g = 3c independent populations, where .c = 1, . . . , 5, so that .g ∈ {3, 9, 27, 81, 243, 729}. For balanced case, we generate a random sample of .ni ∈ {10, 20, 50} p-dimensional vectors for each population, .i = 1, . . . , g, where .p ∈ {10, 20}. So, for example, for .g = 9 and .ni = 20, a total of .n = 180 random vectors each of size 10 or 20 are generated. For error covariance matrix, .Ʃ, we assume either identity or compound symmetric structure, i.e., .Ʃ = I and .Ʃ = (1 − ρ)I + ρJ, with .J a matrix of 1s. We use .ρ = 0.5. For unbalanced case, we make three subsets of g and assign 1/10th of total sample size, n, to the first third subset, 3/10th to the second subset, and 6/10th to the third subset, keeping n the same as for the corresponding balanced case. So, for example, for .g = 9 with a total sample size .n = 180, as above, we divide .ni = 18 equally (6 replications per treatment) for .i = 1, 2, 3, likewise .ni = 54 = 18 × 3 for

22

S. Ejaz Ahmed and M. Rauf Ahmad

i = 4, 5, 6, and .ni = 108 = 36×3 for .i = 7, 8, 9. For both balanced and unbalanced cases, the condition .n − g > p is ensured for all sample sizes, as required for the validity of the test statistic. Finally, we report results only for .α = 0.05, for both size and power, because of similarity of results for other nominal levels. Likewise, we also omit results for other parameters, e.g., AR(1) covariance structure, due to the same reason. The reported results are estimated size and power, averaged over 1000 simulation runs. Tables 2.1, 2.2, 2.3, and 2.4 report results for normal distribution, where Tables 2.1 and 2.2 are, respectively, on estimated size and power for balanced case and Tables 2.3 and 2.4 similarly for unbalanced case. Further, the upper panel of each table provides results for identity covariance structure and the lower panel for compound symmetric structure. Tables 2.5, 2.6, 2.7, and 2.8, with the same order and description, are for t distribution. For size, we observe an almost coherent pattern under normality, for all parameter settings including both covariance structures and three sample sizes. However, a serious size distortion is observed for t distribution, for increasing g, and this behavior is discernably more pronounced for unbalanced case, where the highly

.

Table 2.1 Estimated size for normal distribution, balanced model, with Ʃ as identity (upper panel) and compound symmetric (lower panel)

Table 2.2 Estimated power for normal distribution, balanced model, with Ʃ as identity (upper panel) and compound symmetric (lower panel)

g

p:

3 9 27 81 243 3 9 27 81 243

g 3 9 27 81 243 3 9 27 81 243

p:

n1 = 10 10 20 0.053 0.052 0.044 0.053 0.060 0.048 0.054 0.051 0.055 0.050 0.053 0.043 0.048 0.077 0.051 0.043 0.058 0.061 0.058 0.048

n2 = 20 10 20 0.051 0.038 0.045 0.056 0.048 0.048 0.053 0.043 0.048 0.047 0.050 0.047 0.058 0.045 0.056 0.046 0.040 0.051 0.051 0.046

n3 = 50 10 20 0.043 0.056 0.050 0.049 0.046 0.065 0.055 0.053 0.057 0.053 0.047 0.047 0.061 0.047 0.053 0.057 0.054 0.051 0.052 0.046

n1 = 10 10 20 0.125 0.095 0.220 0.275 0.464 0.619 0.853 0.959 0.905 0.981 0.074 0.082 0.148 0.139 0.263 0.296 0.509 0.618 0.793 0.808

n2 = 20 10 20 0.319 0.367 0.552 0.733 0.901 0.987 1.000 1.000 1.000 1.000 0.181 0.185 0.318 0.383 0.599 0.720 0.952 0.994 0.999 1.000

n3 = 50 10 20 0.840 0.948 0.989 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.501 0.647 0.816 0.926 0.997 1.000 1.000 1.000 1.000 1.000

2 MANOVA for Large Number of Treatments Table 2.3 Estimated size for normal distribution, unbalanced model, with .Ʃ as identity (upper panel) and compound symmetric (lower panel)

Table 2.4 Estimated power for normal distribution, unbalanced model, with .Ʃ as identity (upper panel) & compound symmetric (lower panel)

Table 2.5 Estimated size for T distribution, balanced model, with .Ʃ as identity (upper panel) and compound symmetric (lower panel)

g

p:

3 9 27 81 243 3 9 27 81 243

g

p:

3 9 27 81 243 3 9 27 81 243

23 .n1

= 10 10 20 0.049 0.048 0.052 0.051 0.057 0.039 0.060 0.060 0.054 0.043 0.044 0.063 0.049 0.050 0.041 0.048 0.060 0.050 0.045 0.058

.n2

= 20 10 20 0.044 0.053 0.053 0.043 0.048 0.065 0.048 0.051 0.054 0.045 0.056 0.058 0.043 0.048 0.047 0.048 0.056 0.051 0.040 0.044

.n3

.n1

= 10 10 20 0.214 0.173 0.992 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.078 0.068 0.281 0.172 0.997 0.967 1.000 1.000 1.000 1.000

.n2

= 20 10 20 0.560 0.660 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.104 0.068 0.611 0.456 1.000 1.000 1.000 1.000 1.000 1.000

.n3

= 10 10 20 0.051 0.045 0.048 0.031 0.054 0.037 0.091 0.094 0.194 0.245 0.039 0.049 0.035 0.035 0.042 0.037 0.061 0.089 0.174 0.256

.n2

= 20 10 20 0.052 0.048 0.055 0.033 0.032 0.046 0.071 0.076 0.160 0.179 0.049 0.041 0.056 0.049 0.026 0.038 0.064 0.063 0.170 0.179

.n3

.n1

g 3 9 27 81 243 3 9 27 81 243

p:

= 50 10 20 0.049 0.057 0.046 0.059 0.043 0.069 0.061 0.045 0.049 0.051 0.069 0.044 0.055 0.057 0.046 0.057 0.067 0.043 0.050 0.057

= 50 10 20 0.989 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.282 0.190 0.993 0.944 1.000 1.000 1.000 1.000 1.000 1.000

= 50 10 20 0.044 0.045 0.040 0.059 0.049 0.057 0.058 0.043 0.140 0.133 0.051 0.049 0.039 0.048 0.043 0.045 0.046 0.064 0.113 0.145

liberal estimation of the nominal level begins already for .g = 9, in some cases even for .g = 3. Further, this pattern is similar for t distribution under both covariance structures. We notice that the overestimation goes as high as almost six times the

24 Table 2.6 Estimated power for T distribution, balanced model, with .Ʃ as identity (upper panel) and compound symmetric (lower panel)

Table 2.7 Estimated size for T distribution, unbalanced model, with .Ʃ as identity (upper panel) and compound symmetric (lower panel)

Table 2.8 Estimated power for T distribution, unbalanced model, with .Ʃ as identity (upper panel) and compound symmetric (lower panel)

S. Ejaz Ahmed and M. Rauf Ahmad

g

p:

3 9 27 81 243 3 9 27 81 243

g

p:

3 9 27 81 243 3 9 27 81 243

.n1

= 10 10 20 0.124 0.100 0.244 0.340 0.627 0.790 0.906 0.939 0.991 1.000 0.104 0.091 0.132 0.174 0.347 0.454 0.791 0.865 0.897 0.998

.n2

= 20 10 20 0.342 0.394 0.630 0.784 0.915 0.977 0.971 0.988 1.000 1.000 0.195 0.199 0.349 0.453 0.672 0.816 0.943 0.963 1.000 1.000

.n3

.n1

= 10 10 20 0.057 0.061 0.080 0.079 0.110 0.178 0.207 0.247 0.275 0.275 0.073 0.059 0.082 0.080 0.115 0.161 0.238 0.253 0.257 0.304

.n2

= 20 10 20 0.075 0.060 0.083 0.102 0.134 0.166 0.230 0.255 0.264 0.307 0.057 0.067 0.079 0.089 0.135 0.175 0.228 0.251 0.280 0.286

.n3

= 10 10 20 0.233 0.206 0.985 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.090 0.086 0.296 0.272 0.968 0.844 0.996 0.990 1.000 1.000

.n2

= 20 10 20 0.570 0.735 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.128 0.114 0.592 0.452 1.000 0.996 1.000 0.997 1.000 1.000

.n3

.n1

g 3 9 27 81 243 3 9 27 81 243

p:

= 50 10 20 0.843 0.948 0.993 1.000 0.999 1.000 0.994 0.999 1.000 1.000 0.509 0.696 0.844 0.945 0.988 0.997 0.994 0.993 1.000 1.000

= 50 10 20 0.053 0.053 0.063 0.076 0.103 0.141 0.199 0.244 0.264 0.279 0.059 0.068 0.057 0.069 0.102 0.140 0.190 0.226 0.240 0.273

= 50 10 20 0.985 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.292 0.215 0.991 0.941 1.000 1.000 0.999 1.000 1.000 1.000

nominal level by .g = 243. In fact, the trend was witnessed to worsen further for g = 729, which is not reported here. The size distortion of t is obviously also reflected in power estimation, in the sense that reasonably higher power values, and quickly approaching 1, as

.

2 MANOVA for Large Number of Treatments

25

compared to the case under normality, can be attributed to this size issue. For normal distribution, the power trend shows an improvement with increasing g. The main difference is between the covariance patterns, where the results are reasonably better under identity structure. Interestingly, the power under normal distribution is much better for unbalanced model than for balanced model. It may perhaps pertain to some treatments getting much larger portion (6/10) of the total sample size. In short, the MANOVA statistic seems to perform well for both size and power when the errors are assumed to follow a normal distribution, and this performance is good for even moderate sample size and different covariance structure, although much better for identity structure, i.e., when the errors are essentially independent. For errors following a t distribution with moderate degrees of freedom, however, the statistic does not seem to perform well, and its performance particularly worsens for an increasing number of populations.

2.3.2 Special Case: ANOVA for Large g For .p = 1, the MGLM in (2.1) reduces to the univariate GLM, .y = Xβ + u, so that the one-way ANOVA model, parallel to (2.2), follows as yij = μi + uij

.

(2.6)

where .yij is the response variable measured on j th unit under ith population, .μi = μ + τi is the mean of ith population Ʃgwith .τi as the effect of ith population, .j = , . . . , ni , .i = 1, . . . , g, and .n = i=1 ni . To test the equality of means (or no treatment effect), i.e., H0 : μ1 = . . . μg vs. H1 : Not H0 ,

.

the well-known .F-statistic (here denoted .T1 for comparison purposes with other statistics), as the ratio of mean squares for the effect (treatment) and error, .

T1 =

MST , MSE

(2.7)

Ʃg is used, with .MST = SST/.ft , .MSE = SSE/.fe , SST = .n i=1 (y i. − y .. )2 , SSE = Ʃg Ʃn 2 . j =1 (yij − y i. ) , where .ft = g − 1, .fe = g(n − 1) are the corresponding i=1 degrees of freedom. The two sums of squares are reduced versions of .H and .E of MANOVA model for .p = 1. As quadratic forms, they can also be written as SSE = .y′ (Cg ⊗ Jn /n)y and SSE = .y′ (Ig ⊗ Cn )y, where .y = (y11 , . . . , ygng )′ ; .Cg = Ig −Jg /n is the centering matrix, with .Cn defined similarly; and .J is the matrix of 1s of appropriate order. Under normality and independence assumptions, .T1 is a ratio 2 and .MST ∼ of two independent chi-squared random variables, .MST ∼ σ 2 χg−1

26

S. Ejaz Ahmed and M. Rauf Ahmad

2 σ 2 χg(n−1) , so that .T1 ∼ Fg−1,g(n−1) . Note also that the degrees of freedom for the corresponding univariate and multivariate quadratic forms are the same. Our aim is to evaluate .T1 and compare it with two competing statistics, for .g → ∞. The competing statistics are those given in [3, 14], to be denoted, respectively, as .T2 and .T3 . Whereas .T2 assumes balanced model and homoscedasticity, i.e., .ni = n .∀ i and .Var(uij ) = σ 2 .∀ i, j , but not normality, .T3 is derived under just the opposite conditions, i.e., it allows unbalanced model and unequal variances, but it needs normality assumption, .uij ∼ N(0, σi2 ). We therefore restrict the comparison to balanced and homoscedastic case. Thus, .T1 given in (2.7) pertains specifically to this case, for .ni = n and .σi2 = σ 2 . However, for the sake of robustness again, we use both normal and t distributions for .uij . Given this setup, we now briefly introduce the two competing statistics. Bathke [3] studies .T1 in detail and shows that, for one-way model,

.

T2 =

g(n − 1) D → N(0, 1), (T1 −1) − 2n

(2.8)

as .g → ∞, with n fixed. Park and Park [14] introduce two modifications of .T1 for .g → ∞, with further adjustments based on Edgeworth expansions for finite g case. For brevity, we only use their first statistic, reduced for equal sample sizes and variances. The original statistic is defined as

.

T=

g

i=1

1 si2 (y i. − y .. ) − 1 − , g ni 2

Ʃ with .y i. and .y .. as defined above and .si2 = nj=1 (yij −y i. )2 /(ni −1) as the unbiased Ʃg estimator of .σi2 from ith sample, .i = 1, . . . , g. With .E(T) = i=1 (μi − μ)2 = 0 Ʃg under .H0 , where .μ = i=1 μi /g, [14] show that .

T3 = /

T Var(T)

D

− → N(0, 1),

Ʃ Ʃg 4 = g ci where .Var(T) = i=1 ci σi4 , estimated as .Var(T) σi4 = (ni − i=1 σi , with . 4 4 1)si /(ni + 1) as an unbiased estimator of .σi , and

1 .ci = 2 1 − g

1 1 1 + 1− . g n2i (ni − 1) n2i

For comparison purposes, using the same notation as introduced for .T2 , we can re-write .T in .T3 for .ni = n, after a brief simplification, as

2 MANOVA for Large Number of Treatments

27

Table 2.9 Comparison of estimated sizes of three ANOVA tests for normal (upper panel) and T (lower panel) distributions .n1

= 10

.n2

= 20

.n3

= 50

g

.T1

.T2

.T3

.T1

.T2

.T3

.T1

.T2

.T3

3 9 27 81 243 3 9 27 81 243

0.049 0.055 0.049 0.046 0.052 0.058 0.047 0.046 0.060 0.052

0.103 0.092 0.070 0.059 0.059 0.127 0.081 0.070 0.069 0.057

0.073 0.075 0.062 0.053 0.056 0.083 0.061 0.055 0.061 0.049

0.049 0.048 0.051 0.048 0.046 0.046 0.050 0.052 0.048 0.057

0.109 0.076 0.069 0.058 0.053 0.103 0.078 0.071 0.059 0.066

0.077 0.067 0.063 0.055 0.050 0.069 0.063 0.061 0.054 0.059

0.042 0.048 0.052 0.054 0.049 0.052 0.048 0.050 0.041 0.050

0.091 0.071 0.071 0.064 0.056 0.100 0.081 0.068 0.051 0.053

0.063 0.063 0.066 0.061 0.056 0.073 0.066 0.061 0.047 0.052

.

T=

g−1 MSE(T1 −1). n

Likewise, the variance reduces to = Var(T)

.

(gn − 1)(g − 1) 4 2 si . 2 n (n + 1) g2 g

i=1

Although .T3 seems to be defined intricately in terms of .ni and g, it is interesting to see that .T3 in fact coincides with .T2 in the limit, for .g → ∞. For this, we first note that .Var(T) = 2σ 4 (kn − 1)(k − 1)/kn2 (n − 1) for .ni = n and .σi = σ .∀ i. Now, write √ √ Var(T) T [(g − 1)/n] MSE(T1 −1) Var(T) / ·/ . T3 = √ = / 2σ 4 (gn−1)(g−1) Var(T) Var(T) Var(T) 2 g n (n−1) √ √ √ g(T1 −1) Var(T) MSE (g − 1)(T1 −1) Var(T) MSE / = . = / / 2 2 σ 2 (gn−1)(g−1) 2 n−1 σ Var(T) Var(T) n + n−1 g n−1 g−1 is shown in [14, Lemma 2] in the sense that The consistency of .Var(T) P Var(T) − .Var(T)/ → 1 as .g → ∞. Further, .MSE → σ 2 as .g → ∞ [see 3, p.121]. Applying these two limits, the rest of the expression simplifies exactly to .T2 limit in (2.8). We thus expect the two statistics to behave similarly, particularly for .g → ∞. Table 2.9 reports a comparison of estimated test sizes for .T1 , .T2 , and .T3 for the same g, n, and .α values as used for MANOVA, although restricting the ANOVA

28

S. Ejaz Ahmed and M. Rauf Ahmad

Table 2.10 Comparison of estimated power of three ANOVA tests for normal (upper panel) and T (lower panel) distributions .n1

= 10

.n2

= 20

.n3

= 50

g

.T1

.T2

.T3

.T1

.T2

.T3

.T1

.T2

.T3

3 9 27 81 243 3 9 27 81 243

0.144 0.200 0.380 0.709 0.983 0.146 0.185 0.350 0.666 0.969

0.258 0.281 0.445 0.742 0.985 0.243 0.266 0.414 0.704 0.972

0.227 0.271 0.437 0.741 0.985 0.210 0.244 0.392 0.681 0.969

0.262 0.429 0.752 0.990 1.000 0.239 0.403 0.716 0.983 1.000

0.384 0.528 0.794 0.991 1.000 0.357 0.493 0.760 0.986 1.000

0.339 0.508 0.790 0.991 1.000 0.308 0.460 0.747 0.984 1.000

0.593 0.881 0.999 1.000 1.000 0.564 0.860 0.999 1.000 1.000

0.714 0.920 1.000 1.000 1.000 0.677 0.900 0.999 1.000 1.000

0.665 0.911 1.000 1.000 1.000 0.627 0.888 0.999 1.000 1.000

comparison to only the balanced model. The upper panel of Table 2.9 is for normal distribution and the lower panel for .t10 distribution. Table 2.10 reports power of the same three tests for .α = 0.05, with upper and lower panels again representing normal and t distributions. All results are an average over 3000 simulation runs. We notice that all three tests perform accurately for .g → ∞, where the differences are only for small values of g. For example, .T1 is the most accurate among all three for small as well as for large values of g, whereas .T2 is liberal for small g, although its accuracy improves drastically by even moderate g, and .T3 seems a compromise between the two, as it is reasonably less liberal than .T2 for small g and improves likewise for increasing g. The power results mimic the size performance of the three tests. As g increases, the power approaches its maximum for all tests, but their performance also improves for increasing n. The slight opposite difference between .T2 and .T3 can be ascribed to their size difference due to the liberal behavior of .T2 , particularly for small g. Both tests, however, show similar accurate performance even for moderate g. Another interesting aspect is the very close similarity of .T3 to .T1 for large g values. Further, the same performance properties of the three tests remain intact for t distribution as well.

2.4 Discussion and Outlook The likelihood ratio test for one-way MANOVA model is evaluated for its performance when the number of treatments is allowed to grow, where the sample size and dimension of response vectors are kept fixed and small. The statistic seems to perform accurately, for both size control and power, when the error vector is assumed to follow a normal distribution. However, the test does not seem to be robust to normality, even for moderate g, given its serious size distortion

2 MANOVA for Large Number of Treatments

29

and corresponding effect on power, when the errors are allowed to follow a t distribution. The relative properties under the two distributions remain intact for different covariance matrices of the error vector. As a special case, the univariate .F statistic for one-way ANOVA model is also investigated and compared with two of its modified versions specifically introduced for an increasing number of treatments. All three statistics seem to perform accurately for both normal and t distributions, again under different covariance structures, even for small or moderate sample sizes. The present study has been kept restricted from several dimensions. For example, only one-way model is investigated, homoscedasticity is assumed, and only one non-normal distribution is used to check for robustness. In the subsequent extended evaluations, these restrictions are planned to be relaxed. Further, a detailed theoretical investigation is planned to accompany the simulation-based assessment. Acknowledgments The authors would like to thank Professor Jürgen Pilz and co-editors for the kind invitation to contribute a paper and processing it. The research of Professor S. Ejaz Ahmed was funded by the Natural Sciences and Engineering Research Council of Canada (NSERC).

References 1. Ahmad, M.R.: A unified approach to testing mean vectors with large dimensions. AStA Adv. Stat. Anal. 103, 593–618 (2019) 2. Anderson, T.W.: Introduction to Multivariate Statistical Analysis, 3rd edn. Wiley, Hoboken (2003) 3. Bathke, A.: ANOVA for large number of treatments. Math. Meth. Stat. 11, 118–132 (2002) 4. Bathke, A., Harrar, S.: Nonparametric methods in multivariate factorial designs for large number of factor levels. J. Stat. Plann. Inf. 138, 588–610 (2008) 5. Cai, T., Xia, Y.: High-dimensional sparse MANOVA. J. Multivar. Anal. 131, 174–196 (2014) 6. Fujikoshi, Y.: Multivariate analysis for the case when the dimension is large compared to the sampel size. J. Korean Stat. Soc. 33, 1–24 (2004) 7. Fujikoshi, Y., Ulyanov, V.V., Shimizu, R.: Multivariate Statistics: High-Dimensional and Large-Sample Approximations. Wiley, Hoboken (2010) 8. Gupta, A.K., Harrar, S.W., Fujikoshi, Y.: MANOVA for large hypothesis degrees of freedom under non-normality. Test 17, 120–137 (2008) 9. Harrar, S., Bathke, A.: Nonparametric methods for unbalanced multivariate data and many factor levels. J. Multivar. Anal. 99, 1635–1664 (2008) 10. Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis, 6th edn. Prentice Hall, Hoboken (2007) 11. Katayama, S., Kano, Y.: A new test on high-dimensional mean vectors without any assumption on population covariance matrix. Commun. Stat. Theory Methods 43, 5290–5304 (2014) 12. Mardia, K.V., Kent, J.T., Bibby, J.M.: Multivariate Analysis (reprint 2003). Academic Press, London (1979) 13. Muirhead, R.J.: Aspects of Multivariate Statistical Theory. Wiley, Hoboken (2005) 14. Park, J., Park, D.: Testing the equality of a large number of normal population means. Comp. Stat. Data Anal. 56, 1131–1149 (2012) 15. Schott, J.R.: Some high-dimensional tests for a one-way MANOVA. J Multivar. Anal. 98, 1825–1839 (2007)

30

S. Ejaz Ahmed and M. Rauf Ahmad

16. Wang, H., Akritas, M.G.: Rank tests for ANOVA with large number of factor levels. Nonparam. Stat. 16, 563–589 (2004) 17. Wang, L., Akritas, M.G.: Two-way heteroscedastic ANOVA when the number of levels is large. Nonparam. Stat. 16, 563–589 (2006)

Chapter 3

Pollutant Dispersion Simulation by Means of a Stochastic Particle Model and a Dynamic Gaussian Plume Model Maximilian Arbeiter, Albrecht Gebhardt, and Gunter Spöck

Abstract The pollutant dispersion models of this work fall into two classes: physical and statistical. We propose a large-scale physical particle dispersion model and a dynamic version of the well-known Gaussian plume model, based on statistical filters. Both models are based on wind measurements, wind interpolations, and mass corrections of certain wind stations installed in an alpine valley in Carinthia/Austria. Every 10 minutes the wind field is updated, and the dispersion of the pollutant is calculated. Vegetations like forest and grassland are fully considered. The dispersion models are used to predict pollutant concentrations resulting from the emissions of a cement plant. Both models are compared to each other and give almost equivalent results. The great advantage of the statistical model is that it does not scale like the particle model with the number of emitters, but its computational burden is constant, no matter how many emitters are included in the model. To test and validate these models, we developed the R-package PDC using the CUDA framework for GPU implementation.

3.1 Introduction Beside models that interpolate actual pollutant concentrations, pollution dispersion models play a prominent role in environmental science. Most often, dispersion models are based on certain types of wind speed and wind direction statistics and calculate the dispersion of a pollutant over a longer period of time. One may distinguish between dynamic and non-dynamic pollutant dispersion models. Nondynamic models like the Gaussian plume model calculate the dispersion based on some summary statistics of the investigated wind field. On the other hand, dynamic models like the ones discussed in this chapter take the wind field at every instance of time into account.

M. Arbeiter () · A. Gebhardt · G. Spöck Universitaet Klagenfurt, Klagenfurt am Wörthersee, Austria e-mail: [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Pilz et al. (eds.), Statistical Modeling and Simulation for Experimental Design and Machine Learning Applications, Contributions to Statistics, https://doi.org/10.1007/978-3-031-40055-1_3

31

32

M. Arbeiter et al.

We propose in this work some applications of the particle model [3], and a generalization of the Gaussian plume model [31], to a dynamic setting. Both models take topography, vegetation, dynamic wind field, plume rise, and atmospheric stability classes into consideration and are dynamic in nature. The following list gives some other well-known pollutant dispersion models with their specificities and shows how these models are related to our own two in terms of what can be modeled. • AERMET, AERMOD: [36] steady-state Gaussian plume model, complex terrain, wet deposition, plume rise, multiple sources • CALMET, CALPUFF: [37] dynamic Gaussian puff model, moderately complex terrain, wet deposition, plume rise, multiple sources (point, line, area, volume), kinematic effects, slope effects, blocking effects, mass conservation • LAPMOD: [16] dynamic particle model with different lognormally distributed particle sizes, multiple sources (volume, area, line, circular), plume rise, vegetation, complex terrain [20] length instead of [27]-Gifford classes, wet deposition • SCIPUFF: [35] dynamic Gaussian puff model, complex terrain effects and massconsistent wind field adjustment, wet deposition, moving and stack sources, plume rise, variance estimates of concentrations • ADMS-5: [8] similar to AERMOD, but with more modules especially for visualization The complexities of most of the above models forced us to develop our own software for pollutant dispersion. Our own software allows us to be very flexible with the interfaces to the wind model server and to buy cheaper weather stations for monitoring wind direction and wind speed. The overall costs for our hardware including 13 weather stations with Internet access servers, 1 all-sky camera, and 1 main computing server amount to only 9000 euros. Overall 1 man-year has been invested in software development and all other tasks necessary to start and maintain the monitoring network. The main reason for all those developments was to get a proof of concept for the environmental dynamic Gaussian plume model proposed in this chapter. In order to make this model comparable to standard models from pollutant dispersion modeling, it was necessary also to implement the particle model from the bottom-up because parameters from the dynamic Gaussian plume model should translate one to one to the ones from the particle model. Since with standard software as mentioned in the above list one cannot be sure that both, standard model and dynamic Gaussian plume model, are similarly parametrized, we chose to program both models from the bottom-up. Our models compare to the models in the above list in that all basic functionalities found above are also implemented in our models. We take account of the topography and of the vegetation by means of assuming that pollutants are absorbed in different vegetations differently. Pasquill-Gifford stability classes and plume rise are also considered. Up to date, there is only dry deposition implemented, but for the future, it is planned to extend also to wet deposition. Currently, also all pollutants have the same deterministic particle size. This will be relaxed in the future by means of assuming a lognormal distribution for particle size and weight.

3 Pollutant Dispersion Simulation

33

The proposed environmental dynamic Gaussian plume model has the great advantage that its computational burden is constant and does not scale linearly when adding additional pollutant emitters. This fact is the great head start of this model in opposition to all other models mentioned in the above list.

3.2 Meteorological Monitoring Network In the whole Görtschitz valley, a valley in Carinthia/Southern Austria, certain wind and weather stations have been installed. The stations range from Semlach in the north to Eppersdorf in the south. Together they cover an area of about .27 × 9 km2 . The height above sea level of the stations ranges from 492 to 928 meters. Most stations are WH1080 stations measuring temperature, humidity, hourly rain, wind speed, wind direction, and air pressure. The two stations in Hochfeistritz and Eppersdorf are WH3080 and measure additionally luminosity and UV index. The station in Eberstein is very special, because it has additionally an all-sky camera installed that measures cloud coverage during night- and daytime. Both luminosity and cloud coverage are used to define later the Pasquill-Gifford stability class see Table 3.1. Temperature and air pressure are needed in order to calculate the plume rise at the plume emitter. Rain will be included at a later stage to calculate wet deposition of the pollutants. The weather stations WH1080 and WH3080 both are split into an outdoor and indoor unit connected via radio. The indoor units are connected to Raspberry Pi mini computers [28], via USB and send data to a central frewe server ([26], also a Raspberry running Raspbian) which are stored in a SQL database [23], and can be visualized through a web interface. The address of the homepage is http://guspgets-it.net/frewe. The stations provide data on the environmental variables every 5 minutes (Fig. 3.1).

Table 3.1 Criteria for the Pasquill-Gifford stability classes [13]. Note: Always use class D for overcast conditions

.u(ms .6

−1 )

Strong A A–B B C C

Daytime Solar radiation Moderate A–B B B–C C–D D

Weak B C C D D

Cloudy E E D D D

Nighttime Cloud coverage

Clear F F E D D

34

M. Arbeiter et al.

Fig. 3.1 Left: Locations of the 13 weather stations in the Görtschitz valley. The area is .27×9 km2 . In the north is Semlach; in the south is Eppersdorf. Source: GRASS GIS and OpenStreetMap [24, 25]. Right: A WH1080 weather station; the WH3080 looks similar except for two additional sensors for luminosity and UV index at the solar module

3.3 Wind Field Modeling To determine the vertical wind profile for each wind station, the so-called power law is used: ⎧ m ⎨u(zm ) 200 if z > 200 meters zmm .u(z) = (3.1) ⎩u(zm ) z if z ≤ 200 meters zm zm denotes the height of the wind station above the ground (in meters), and .u(z) is the wind speed at height z. Here, all weather stations are fixed at about 2 meters above the ground. m is a constant which depends on the stability class and on the landscape; see also Table 3.2 [38]. The interpolation between the measurement stations is done by means of inverse-distance weighting. In our case, the wind can be interpreted as a vector with three components .u = (u, v, w)T . If measurements .u1 , ..., un are given at positions .x1 , ..., xn , then the value at an arbitrary position .x0 can be calculated as follows: ⎧ |x0 −xi |−2 ⎪ if xi = x0 n ⎨ n −2 .u(x0 ) = λi ui with λi = k=1 |x0 −xk | (3.2) ⎪ ⎩ i=1 1 else .

3 Pollutant Dispersion Simulation

35

Table 3.2 All stability classes and all parameters depending on them. In our case, we p mix 1/3 use the following parameterizations: .w∗ = ( 9.81qh , ρ = T ·287.058 , cp = T0 ρcp ) 0.285 p0 dθ −4 , q = 0.45 · 1.12−1 1006, |f | = 10 , − = | dz |, θ = T p

(990 sin φ − 30)(1 − 0.75cloud 3.4 )0.8 + 60cloud − 5.67 · 10−8 T 4 + 5.31 · 10−13 T 6 , .n = (1 − 16z/L0 )0.25 , and .n0 = (1−16z0 /L0 )0.25 , where T is the temperature in kelvin, p is the air pressure in pascal, .T0 and .p0 are these values at the ground, .L0 and .z0 are the Monin-Obukhov length and the height at the ground, .φ is the angle of the sun above the horizon, cloud is the cloud coverage, .θ is the potential temperature, .w∗ is the convective velocity, .ρ is the density of air, f is the Coriolis parameter, .cp is the specific heat, q is the sensible heat flux, .ts is the time of sun rise, and . − is the difference between the dry adiabatic and the temperature lapse rate [13, 14, 38] Stability class

.u∗

A (instable) .

.hmix

.σv

0.4u 2 (n0 +1)(n0 +1)2 −1 n−tan−1 n ) log( zz )+log 0 2 2 +2(tan 0

2 + σ2 σv1 v2

.

(n +1)(n+1)

t 2.8 q(t)dt t

B

.

.σv1

s

=

0.6w∗ if z ≤ hmix mix 0.6w∗ exp (−2 z−h hmix )

pcp (−)

C

= z 1.9u∗ exp −0.75 hmix z .σv = 1.9u∗ exp −0.75 hmix z .σv = 1.9u∗ exp −0.75 hmix .σv2

D (neutral)

.

.

0.4u log( zz )+5(z−z0)/L0

.2400u∗

0

E, F (stable)

.

0.3u∗ |f |

0.4u log( zz ) 0

3/2

Stability class

.σw

m (forest/ grassland)

A (instable)

.

.0.20/0.12

2 + σ2 σw1 w2 ⎧ 0.6 z 0.4 ⎪ ⎪ 0.20.4 ∗ ( hmix ) if z ≤ 0.2hmix ⎪ ⎪ ⎨ 0.6w if z ∈ (0.2h , 0.8h ] ∗ mix mix B .σw1 = z−hmix ⎪ 0.6w exp (−0.6) exp (−3 ) if z ∈ (0.8hmix , hmix ] ⎪ ∗ h ⎪ mix ⎪ ⎩ mix 0.6w∗ exp (−2 z−h ) else hmix z C .σw2 = 1.3u∗ exp −0.75 hmix z D (neutral) .σw = 1.3u∗ exp −0.75 hmix z E .σw = 1.3u∗ exp −0.75 hmix F (stable)

.0.31/0.14

.0.31/0.18 .0.31/0.26 .0.48/0.32 .0.52/0.37

(continued)

36

M. Arbeiter et al.

Table 3.2 (continued) Stability class .Ti,L,y

.Ti,L,z

Instable

h .0.15 mix σv

Neutral

.0.5

Stable

⎧ 0 )/ hmix ) ⎪ 0.15hmix 1−exp (−5∗(z−z if (z − z0 ) ≥ 0.1hmix ⎪ σw ⎪ ⎪ z−z0 ⎨ 0.1 else if σw (0.55+0.38(z−z0 −0.25)/L0 ) . ⎪ −(z − z0 − 0.25) 0 and .k(1/2) ≈ −0.04044011 < 0. This proves that there is no functional superiority between .2x/(1 + x) and .sin[(π/2)x] for any .x ∈ [0, 1]. This achieves the proof of Proposition 1.

4 On an Alternative Trigonometric Strategy for Statistical Modeling

55

0.2

0.4

0.6

0.8

1.0

First order stochastic dominance

0.0

F(x) D(x)

0.0

0.2

0.4

0.6

0.8

1.0

Fig. 4.1 Plot of .F (x) and .D(x) for .x ∈ (0, 1)

The inequalities in Proposition 1 have a probabilistic interpretation; they are related to the concept of first-order stochastic (FOS) dominance (see [19]). Indeed, it follows from the first point of Proposition 1 that .F (x) ≥ D(x) for any .x ∈ R. This implies that the S distribution FOS dominates the AS distribution. In this sense, the AS distribution provides a modeling alternative to the S distribution. This order property is illustrated by the curves of the involved functions in Fig. 4.1. The second point in Proposition 1 can be interpreted as follows: Let us denote by .K(x) the cdf of the Marshall-Olkin unit uniform (MOUU) distribution with parameter equal to .1/2 and by .L(x) the cdf of the unit uniform (UU) distribution. That is,

K(x) =

.

⎧ 1, ⎪ ⎪ ⎨ 2x ⎪ ⎪ ⎩1 + x 0,

x ≥ 1, ,

x ∈ (0, 1), x ≤ 0.

⎧ ⎪ ⎪ ⎨1, L(x) = x, ⎪ ⎪ ⎩0,

x ≥ 1, x ∈ (0, 1), x ≤ 0.

Then, it follows from Proposition 1 that .F (x) ≥ K(x) ≥ L(x) for any .x ∈ R. From a probabilistic viewpoint, this means that the AS distribution is FOS dominated by the MOUU and UU distributions. In this sense, the AS distribution provides a

56

C. Chesneau

0.2

0.4

0.6

0.8

1.0

First order stochastic dominance

0.0

F(x) K(x) L(x)

0.0

0.2

0.4

0.6

0.8

1.0

Fig. 4.2 Plot of .F (x), .K(x) and .L(x) for .x ∈ (0, 1)

modeling alternative to the MOUU and UU distributions. The curves of the related functions in Fig. 4.2 demonstrate this order feature. The complete presentation of the AS distribution requires the expression of other important functions. As a central function, the pdf is given as

f (x) =

.

⎧ ⎨F (x) = − ⎩

0,

π π , cos 1+x (1 + x)2

x ∈ (0, 1), x ∈ (0, 1).

Hence, by denoting P the probability measure, for any set A of .R, a random variable X with the AS distribution satisfies .P (X ∈ A) = A f (x)dx. Moreover, by denoting E the expectation operator associated with P , for any function .u(x), +∞ we have .E(u(X)) = −∞ u(x)f (x)dx, upon existence in the integral convergence sense. This integral formula will allow us to determine various moment measures of X, among other things. Some remarks on the analytical behavior of .f (x) are formulated below. We can express .f (x) in function of .F (x) as .f (x) = [π/(1+x)2 ] 1 − F (x)2 . When x tends

4 On an Alternative Trigonometric Strategy for Statistical Modeling

57

0.0

0.5

1.0

1.5

2.0

2.5

3.0

f(x)

0.0

0.2

0.4

0.6

0.8

1.0

Fig. 4.3 Plot of .f (x) for .x ∈ (0, 1)

to 0, we have .f (x) ∼ π , and, when x tends to 1, we have .f (x) ∼ (π 2 /16)(1−x) → 0. The derivative of .f (x) is obtained as f (x) =

.

π π π 2(1 + x) cos − π sin . 1+x 1+x (1 + x)4

Since .cos (π/(1 + x)) ≤ 0 and .sin (π/(1 + x)) ≥ 0, we have .f (x) ≤ 0, implying that .f (x) is decreasing. For illustrative purposes, .f (x) is plotted in Fig. 4.3. With the above fact, it is clear that the AS distribution can serve to model lifetime-type phenomena with values on .(0, 1). As one of the most important reliability functions, the hrf of the AS distribution is obtained as ⎧ ⎪ 1, x ≥ 1, ⎪ ⎪ ⎨ π cos (π/(1 + x)) f (x) , x ∈ (0, 1), = − .h(x) = (1 + x)2 1 − sin (π/(1 + x)) 1 − F (x) ⎪ ⎪ ⎪ ⎩0, x ≤ 0.

58

C. Chesneau

0

50

100

150

200

h(x)

0.0

0.2

0.4

0.6

0.8

1.0

Fig. 4.4 Plot of .h(x) for .x ∈ (0, 1)

The following are some observations about the analytical behavior of .h(x). When x tends to 0, we have .h(x) ∼ π , and, when x tends to 1, we have .h(x) ∼ 2/(1 − x) → +∞. With the study of the derivative of .h(x) and tedious calculations, we prove that .h(x) is increasing. All these aspects are illustrated in Fig. 4.4. Thus, the AS distribution satisfies the increasing failure rate property. For a last useful function, we mention the quantile function (qf). By using the arcsine function defined over .[−π/2, π/2], it is obtained as ⎧

−1 ⎪ ⎨ 1 − arcsin(x) − 1, −1 .Q(x) = F (x) = π ⎪ ⎩0,

x ∈ [0, 1], x ∈ [0, 1].

From this qf, since .arcsin(1/2) = π/6, the median of the AS distribution follows as M=Q

.

π/6 −1 1 1 = 1− −1= . 2 π 5

4 On an Alternative Trigonometric Strategy for Statistical Modeling

59

Other quantile values can be expressed in a similar manner. Also, basically, for any random variable U with the UU distribution, .V = Q(U ) follows the AS distribution. Thus, values from the AS distribution can be generated using values generated from the UU distribution. This could be the starting point for a simulation study. The analytical expression of the qf allows for a more extensive quantile investigation of the AS distribution in general, including expressions for the quantile density function and hazard qfs, as well as a variety of asymmetry and plateness quantile metrics. See [7] and [17], among others.

4.2.2 Moment Properties The following results determine a mathematical formula for the raw moments related to the AE distribution. Proposition 2 Let X be a random variable with the AS distribution and r be a positive integer. Then, the .r th raw moment of X is given by .m(r) = E(Xr ). Two different approaches are given to express it. • Finite sum and integral approach: We have m(r) =

r r

.

j =0

π (−1)r−j π j Ci , j − Ci(π, j ) , j 2

where . jr denotes the binomial coefficient and .Ci(a, b) denotes the generalized +∞ cosine integral function: .Ci(a, b) = − a x −b cos(x)dx for .a > 0 and .b ≥ 1. • Infinite series approach: We have m(r) =

.

+∞ 1 (−1)k+1 π 2k+1 −2(k + 1) . (2k)! r ++1

k,=0

Proof Let us prove the two points in turn. • By the binomial formula, we have r r (−1)r−j E((1 + X)j ). .m(r) = E((−1 + 1 + X) ) = j r

(4.4)

j =0

Let us investigate the measure .E((1 + X)j ). Based on the integral definition, we get E((1 + X) ) =

.

j

+∞ −∞

1

(1 + x) f (x)dx = −π j

0

(1 + x)

j −2

π cos 1+x

dx.

60

C. Chesneau

With the change of variable .u = π/(1 + x), we get E((1 + X) ) = π j

.

π/2

j π

π , j − Ci(π, j ) . u−j cos(u)du = π j Ci 2 (4.5)

We obtain the desired result by combining Eqs. (4.4) and (4.5). • By using the cosine and generalized binomial series expansions, with validated interchange of the integral and sum signs, we get m(r) =

+∞

.

−∞

1

= −π 0

k=0

(2k)!

+∞ k,=0

1

0

(2k)!

1

xr π dx cos 1+x (1 + x)2

1 (−1)k π 2k dx (2k)! (1 + x)2k

+∞ (−1)k+1 π 2k+1 k=0

=

xr (1 + x)2

+∞

+∞ (−1)k+1 π 2k+1 k=0

=

x f (x)dx = −π 0

=

r

xr dx (1 + x)2(k+1)

1

x

0

r

+∞ −2(k + 1) =0

x dx

1 (−1)k+1 π 2k+1 −2(k + 1) . (2k)! r ++1

The stated result is obtained.

This ends the proof of Proposition 2.

Further information on the generalized cosine integral function can be found in [18]. We can determine the mean and variance of X by .m(1) and .σ 2 = m(2) − m(1)2 , respectively. The coefficient of variation of X, can be calculated as .CV = σ/m(1). On the other hand, by applying the binomial formula, the .r th central moment of X is given by r r (−1)k m(1)k m(r − k). .m (r) = E([X − m(1)] ) = k c

r

k=0

Then, the general coefficient of X is defined by .C(r) = mc (r)/σ r . As the main related measures the moment skewness of X is given by .C(3), and the moment kurtosis of X is given by .C(4). For numerical purposes, the first seven raw moments, coefficient of variation, moment skewness, and moment kurtosis of X are given in Table 4.1.

4 On an Alternative Trigonometric Strategy for Statistical Modeling

61

Table 4.1 Values of some moment measures of the AS distribution 2

.m(1)

.m(2)

.m(3)

.m(4)

.m(5)

.m(6)

.m(7)

.σ

0.251

0.105

0.056

0.034

0.0226

0.016

0.012

0.041

CV 0.809

.C(3)

.C(4)

0.977

3.352

From Table 4.1, we can say that the raw moments of X are relatively small and decrease as r decreases, as expected for a unit distribution. The variance is also small. Since .C(3) > 0 and .C(4) > 3, this confirms that the AS distribution is right skewed and leptokurtic. The result in Proposition 2 can be extended to the incomplete moments. This is formulated in the next proposition. Proposition 3 Let X be a random variable with the AS distribution, r be a positive integer, and .t ∈ [0, 1]. Then the .r th incomplete moment of X taken at t is given by r .m(r, t) = E(X 1X≤t ). Two different approaches are given to express it. • Finite sum and integral approach: We have m(r, t) =

π , j − Ci(π, j ) . (−1)r−j π j Ci 1+t j

r r

.

j =0

• Infinite series approach: We have m(r, t) =

.

+∞ (−1)k+1 π 2k+1 −2(k + 1) t r++1 . (2k)! r ++1

k,=0

The proof of Proposition 3 is similar to the one of Proposition 2, and it is thus omitted. Taking .t = 1 in Proposition 3 allows us to re-obtain the .r th raw moments of X. Also, .m(1, t) gives the first incomplete moment of X. It is particularly important since it naturally arises in useful moment-type measures like the mean deviation of X around .m(1), mean deviation of X around M, mean inactivity time, Bonferroni curve, Lorenz curve, and many other measures of income inequality. In reversed residual lifetime analysis, as well as a variety of related measures involving other conditional moments, incomplete moments of superior order are frequently used. See [21] and [6] for more information. The AS distribution can be used for various modeling purposes. However, in the current form, it suffers from a lack of flexibility in the functional sense. A solution to this problem may be to introduce one or more tuning parameters into its original definition. This point is discussed in the next subsection.

62

C. Chesneau

4.2.3 Parametric Extensions We now list some possible parametric extensions of the AS distribution, with respect to the support .(0, 1). • We can consider the power AS (PAS) distribution by the following cdf: ⎧ ⎪ 1, x ≥ 1, ⎪ ⎪ ⎨ π sin , x ∈ (0, 1), .F∗ (x; α) = ⎪ 1 + xα ⎪ ⎪ ⎩ 0, x ≤ 0, where .α > 0. It corresponds to the distribution of the random variable .Y = X1/α , where X is with the AS distribution. • We can consider the exponentiated AS (EAS) distribution by the following cdf:

F◦ (x; β) =

.

⎧ ⎪ 1, ⎪ ⎪

⎨ ⎪ ⎪ ⎪ ⎩

sin 0,

π 1+x

β

x ≥ 1, , x ∈ (0, 1), x ≤ 0,

where .β > 0. The exponentiated scheme is connected with the distribution of order statistics. We may refer to [8]. • We can consider the Topp-Leone AS (ToAS) distribution by the following cdf: ⎧ ⎪ 1, ⎪ ⎪

β β ⎨ π π .F∇ (x; β) = 2 − sin sin , ⎪ 1+x 1+x ⎪ ⎪ ⎩ 0,

x ≥ 1, x ∈ (0, 1), x ≤ 0,

where .β > 0. The Topp-Leone scheme is an alternative to the exponentiated scheme, with the same mathematical basis. Further information can be found in [15] and [1]. • We can consider the transmuted AS (TAS) distribution by the following cdf: ⎧ ⎪ 1, ⎪ ⎪ x ≥ 1, ⎨ π

π sin 1 + λ − λ sin , x ∈ (0, 1), .F† (x; λ) = ⎪ 1+x 1+x ⎪ ⎪ ⎩ 0, x ≤ 0, where .λ ∈ [−1, 1]. Further detail on the transmuted scheme can be found in [22].

4 On an Alternative Trigonometric Strategy for Statistical Modeling

63

The PAS, EAS, ToAS, and TAS distributions are selected parametric versions of the AS distribution. Other examples can be presented in a similar manner. These new trigonometric distributions of the unit interval can be useful in the fitting of proportional data, the elaboration of various regression models, and the construction of diverse classification tools. These perspectives are of interest, but in the rest of the study, we propose to use the AS distribution to generate other distributions that have not been considered before.

4.3 AS Generated Family Based on a distribution with support .(0, 1), we can generate a plethora of distributions via the composition scheme (see [6]). In this section, we apply this scheme to the AS distribution and extract a special member of the family for further investigation and application.

4.3.1 Definition To begin, let us consider a cdf of an arbitrary continuous distribution denoted by G(x). This baseline distribution may depend on some parameters, be with finite, semi-infinite, or infinite support, etc. Then, the composition scheme applied to the AS distribution suggests the following cdf: .FG (x) = F [G(x)], .x ∈ R. That is,

.

FG (x) = sin

.

π , 1 + G(x)

x ∈ R.

We define the AS generalized (AS-G) family by this cdf. The main interest of the AS-G family is that it is new and simple and it provides a real alternative to the S-G family without additional parameters. In the FOS sense, based on Proposition 1, the S-G family dominates the AS-G family. Of course, any classical choice for .G(x) leads to a new distribution defined by .FG (x), opening some novel perspectives of modeling. In addition, by choosing .G(x) as the cdf of the UU distribution, the AS distribution is obtained. As another example, a one-parameter lifetime member of the AS-G family will be developed later. By introducing the pdf .g(x) related to .G(x), the pdf of the AS-G family is given as π π , x ∈ R. cos .fG (x) = g(x)f [G(x)] = −g(x) 1 + G(x) [1 + G(x)]2

64

C. Chesneau

Some analytical facts about this function are described below.

We can express .fG (x) in function of .FG (x) as .fG (x) = g(x){π/[1 + G(x)]2 } 1 − FG (x)2 . When .G(x) tends to 0, we have .fG (x) ∼ πg(x), and, when .G(x) tends to 1, we have .fG (x) ∼ (π 2 /16)g(x)[1 − G(x)]. One can also notice that, if .g(x) is a decreasing function, since .f (x) is a decreasing function and .G(x) is an increasing function, .fG (x) is a decreasing function. We complete this presentation by the hrf and qf, which are expressed by hG (x) =

.

π cos (π/(1 + G(x))) fG (x) = −g(x) , 1 − FG (x) [1 + G(x)]2 1 − sin (π/(1 + G(x)))

x∈R

and

⎧ −1 ⎪ ⎨G−1 1 − arcsin(x) −1 , −1 π .QG (x) = G [Q(x)] = ⎪ ⎩ 0,

x ∈ [0, 1], x ∈ [0, 1],

respectively. These functions have the same interpretation and importance as those of the AS distribution.

4.3.2 Series Expansions Since the pdf of the AS-G family may be complicated from the mathematical point of view, tractable series expansions of it are of interest for computational purposes. This is the subject of the following proposition. Proposition 4 The pdf of the AS-G family can be expressed as follows: • In terms of exponentiated .G(x) and .g(x), we have fG (x) =

+∞

.

ak, {g(x)G(x) },

k,=0

where .ak, = [(−1)k+1 π 2k+1 /(2k)!] −2(k+1) . • In terms of exponentiated .G(x) = 1 − G(x) and .g(x), we have fG (x) =

+∞

.

bk, {g(x)G(x) },

k,=0

where .bk, = {(−1)+k+1 π 2k+1 /[(2k)!2+2(k+1) ]} Proof Let us prove the two items in turn.

−2(k+1) .

4 On an Alternative Trigonometric Strategy for Statistical Modeling

65

• Owing to the cosine and generalized binomial series expansions, we get fG (x) = −g(x)

.

= −g(x)

π π cos 1 + G(x) [1 + G(x)]2 +∞ 1 (−1)k π 2k π 2 (2k)! [1 + G(x)]2k [1 + G(x)] k=0

= g(x)

+∞ k=0

= g(x)

+∞ k=0

=

+∞

1 (−1)k+1 π 2k+1 (2k)! [1 + G(x)]2(k+1) +∞ (−1)k+1 π 2k+1 −2(k + 1) G(x) (2k)! =0

ak, {g(x)G(x) }.

k,=0

• By using the equality .1 + G(x) = 2 − G(x) and similar development than for the proof of the first point, we obtain fG (x) = −g(x)

.

= g(x)

π [2 − G(x)]2

+∞ (−1)k π 2k k=0

k=0

(2k)!22(k+1)

[2 − G(x)]2k

(2k)!

+∞ (−1)k+1 π 2k+1

1

1 [1 − G(x)/2]2(k+1)

+∞ +∞ (−1)k+1 π 2k+1 −2(k + 1) (−1) G(x) = g(x) 2 (2k)!22(k+1) k=0

=

+∞

=0

bk, {g(x)G(x) }.

k,=0

This concludes the proof of Proposition 4. One interest in Proposition 4 is the following finite sum approximations: fG (x) ≈

M

.

k,=0

ak, {g(x)G(x) },

fG (x) ≈

M

bk, {g(x)G(x) },

(4.6)

k,=0

where M denotes a certain real number that can be chosen by the practitioner. Then, these tractable approximations can be used instead of .fG (x) in various probabilistic measures, making their evaluation easier to handle. For instance, the

66

C. Chesneau

r th raw moment of a random variable X with a distribution belonging to the AS-G family is given by

.

mG (r) = E(X ) = r

.

+∞

−∞

x r fG (x)dx.

This integral can be computed via a numerical integral program in mathematical softwares. Alternatively, thanks to the expansions in Eq. (4.6), we have the following finite sum approximations: mG (r) ≈

M

.

mG (r) ≈

ak, u (r),

k,=0

M

bk, v (r),

(4.7)

k,=0

+∞ +∞ where .u (r) = −∞ x r {g(x)G(x) }dx and .v (r) = −∞ x r {g(x)G(x) }dx. If .G(x) is of modest analytical complexity, which is the case for most classical distributions, these integrals are often simple to determine. This makes the above moment approximations manageable. The incomplete moments of X can be treated in a similar manner.

4.3.3 Example: The ASE Exponential Distribution Here, we concretize the AS-G family by considering a representative member called the AS exponential (ASE) distribution. As suggested by its name, this member is based on the exponential (E) distribution as the baseline. If we define the cdf and pdf of the E distribution by G(x; α) =

.

1 − e−αx ,

x > 0,

0,

x ≤ 0,

,

g(x; α) =

αe−αx , x > 0, 0,

x ≤ 0,

where .α > 0 denotes a shape parameter, the ASE distribution is defined by the following cdf:

F (x; α) =

.

⎧ ⎨sin ⎩

0,

π 2 − e−αx

, x > 0, x ≤ 0.

Thus defined, the ASE distribution constitutes a new one-parameter lifetime distribution. It follows from the first point of Proposition 1 that the ASE distribution is dominated in the FOS sense by the SE distribution by Kumar et al. [12], which is also dominated in the FOS sense by the E distribution. Consequently, the

4 On an Alternative Trigonometric Strategy for Statistical Modeling

67

1.5

f(x; α)

0.0

0.5

1.0

α = 0.05 α = 0.1 α = 0.5 α = 1.5 α=3 α=8 α = 10

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Fig. 4.5 Plot of .f (x; α) for .x ∈ (0, 3) and various values for .α

ASE distribution is a real alternative to the mentioned distributions, with a new perspective on lifetime modeling. The corresponding pdf is given as

f (x; α) =

.

⎧ ⎨−αe−αx ⎩

0,

π π , x > 0, cos 2 − e−αx (2 − e−αx )2 x ≤ 0.

Some analytical facts about .f (x; α) are described below. When x tends to 0, we have .f (x; α) ∼ π α, and, when x tends to .+∞, we have .f (x; α) ∼ (π 2 /16)αe−2αx → 0. As previously sketched in the general case, since .g(x; α) is a decreasing function, .f (x; α) is a decreasing function too. Figure 4.5 illustrates the possible decreasing shapes of .f (x; α). The hrf is classically obtained as

h(x; α) =

.

⎧ ⎪ ⎨−αe−αx ⎪ ⎩

0,

cos π/(2 − e−αx ) π , (2 − e−αx )2 1 − sin π/(2 − e−αx )

x > 0, x ≤ 0.

68

C. Chesneau

h(x; α)

0

5

10

15

α = 0.05 α = 0.5 α=1 α = 1.5 α=3 α=5

0.0

0.5

1.0

1.5

2.0

Fig. 4.6 Plot of .h(x; α) for .x ∈ (0, 3) and various values for .α

When x tends to 0, we have .h(x; α) ∼ π α, and, when x tends to .+∞, we have h(x) ∼ 2α. With the study of the derivative of .h(x; α) and tedious calculations, we prove that .h(x; α) is decreasing. These facts are illustrated in Fig. 4.6. Thus, the ASE distribution satisfies the decreasing failure rate property. The qf is obtained as

.

⎧

−1 ⎪ ⎨− 1 log 2 − 1 − arcsin(x) , α π .Q(x; α) = ⎪ ⎩ 0,

x ∈ [0, 1], x ∈ [0, 1].

The median of the ASE distribution is 0.2231436 1 5 1 ; α = log ≈ . .M(α) = Q 2 α 4 α A quantile analysis can be performed in a similar way to those suggested for the AS distribution. In particular, the qf is a key tool for generating values from the

4 On an Alternative Trigonometric Strategy for Statistical Modeling

69

ASE distribution, which allows us to investigate the accuracy of various estimation methods for .α, among other things.

4.3.4 Moment Properties Some moment properties of the ASE distribution are now investigated. Let X be a random variable with the ASE distribution. Then, the .r th raw moment of X is given by .m(r; α) = E(Xr ). The following integral representation holds: m(r; α) =

+∞

.

−∞

x r f (x; α)dx

+∞

= −απ

x r e−αx

0

1 π dx. cos 2 − e−αx (2 − e−αx )2

This integral can be computed by using mathematical software. In an alternative manner, we can use the second approximation established in Eq. (4.7). That is, m(r; α) ≈

M

.

bk, v (r; α),

k,=0

where v (r; α) =

+∞

.

−∞

=α

x r {g(x; α)G(x; α) }dx

+∞

x r e−(+1)αx dx =

0

r! . α r ( + 1)r+1

The sum is thus quite computationable. From the formulas above, we can determine the mean, variance, and coefficient of variation of X by .m(1; α), .σ 2 (α) = m(2; α) − m(1; α)2 , and .CV (α) = σ (α)/m(1; α), respectively. On the other hand, by applying the binomial formula, the .r th central moment of X is given by r r (−1)k m(1; α)k m(r − k; α). .m (r; α) = E([X − m(1; α)] ) = k c

r

k=0

Then, the general coefficient of X is defined by .C(r; α) = mc (r; α)/σ r (α). The moment skewness is given by .C(3; α), and the moment kurtosis is given by .C(4; α). For numerical purposes, the first two raw moments, coefficient of variation, moment skewness, and moment kurtosis of X are given in Table 4.2.

70

C. Chesneau

Table 4.2 Values of some moment measures of the ASE distribution 2 (α)

.α

.m(1; α)

.m(2; α)

.σ

0.05 0.5 1 3 5 10

6.896509 0.6896508 0.3448254 0.1149418 0.06896508 0.03448254

105.1669 1.051669 0.2629173 0.02921303 0.01051669 0.002629173

57.60509 0.5760509 0.1440127 0.01600141 0.005760509 0.001440118

.CV (α)

.C(3; α)

.C(4; α)

1.100528 1.100528 1.100528 1.100528 1.100528 1.100525

2.520193 2.520193 2.520193 2.520193 2.520193 2.520216

13.17591 13.17591 13.17591 13.17591 13.17591 13.17604

From Table 4.2, we see that the mean and variance of X vary a lot; they both decrease according to .α for the considered values. The coefficient of variation, moment skewness, and moment kurtosis of X are very stable, with .CV (α) ≈ 1.1, .C(3; α) ≈ 2.52, and .C(4; α) ≈ 13.17, which is quite high. In comparison to the E distribution, which has a moment skewness of 2 and a moment kurtosis of 6, we see that the ASE distribution is more adapted for the modeling of phenomena having a high kurtosis.

4.4 Application to a Famous Cancer Data In this section, we consider a real data set of importance to show that the ASE model can be a good lifetime model when compared to many other well-known models in the statistical literature. We use real data on the remission times (in months) of 128 bladder cancer patients for this study. The following data are taken from [13]: 0.08, 2.09, 3.48, 4.87, 6.94, 8.66, 13.11, 23.63, 0.20, 2.23, 3.52, 4.98, 6.97, 9.02, 13.29, 0.40, 2.26, 3.57, 5.06, 7.09, 9.22, 13.80, 25.74, 0.50, 2.46, 3.64, 5.09, 7.26, 9.47, 14.24, 25.82, 0.51, 2.54, 3.70, 5.17, 7.28, 9.74, 14.76, 26.31, 0.81, 2.62, 3.82, 5.32, 7.32, 10.06, 14.77, 32.15, 2.64, 3.88, 5.32, 7.39, 10.34, 14.83, 34.26, 0.90, 2.69, 4.18, 5.34, 7.59, 10.66, 15.96, 36.66, 1.05, 2.69, 4.23, 5.41, 7.62, 10.75, 16.62, 43.01, 1.19, 2.75, 4.26, 5.41, 7.63, 17.12, 46.12, 1.26, 2.83, 4.33, 5.49, 7.66, 11.25, 17.14, 79.05, 1.35, 2.87, 5.62, 7.87, 11.64, 17.36, 1.40, 3.02, 4.34, 5.71, 7.93, 1.46, 18.10, 11.79, 4.40, 5.85, 8.26, 11.98, 19.13, 1.76, 3.25, 4.50, 6.25, 8.37, 12.02, 2.02, 13.31, 4.51, 6.54, 8.53, 12.03, 20.28, 2.02, 3.36, 12.07, 6.76, 21.73, 2.07, 3.36, 6.93, 8.65, 12.63, and 22.69. For these data, suitable models include the SE, transmuted inverse Weibull (TIW), transmuted inverse Rayleigh (TIR), transmuted inverted exponential (TIE), and inverse Weibull (IW) models. See [11] and [12] and the references therein. The suitability of these models is attested by the maximum likelihood procedure combined with the Akaike information criterion (AIC) and Bayesian information criterion (BIC) values as statistical benchmarks.

4 On an Alternative Trigonometric Strategy for Statistical Modeling

71

In the setting of the ASE model, the parameter .α is supposed to be unknown, and the maximum likelihood procedure consists in estimating it by .

αˆ = argmaxα>0 (α),

where .(α) denotes the log-likelihood function defined by (α) =

n

.

log[f (xi ; α)]

i=1

= n log(α) + n log(π ) − α

n

xi − 2

i=1

+

n i=1

log − cos

π 2 − e−αxi

n

log(2 − e−αxi )

i=1

and .x1 , . . . , xn represent the data. There is no closed form for .α, ˆ but a numerical evaluation of it using data is always possible. The standard error of .αˆ is given by .SE(α) ˆ = {−∂ 2 (α)/∂α 2 }−1 |α=αˆ . We may refer to [2] for the details about the maximum likelihood procedure. The AIC and BIC are defined by AIC = −2ˆ + 2k,

.

BIC = −2ˆ + k log(n),

ˆ k denotes the number of parameters, and n is the respectively, where .ˆ = (α), number of data. The AIC and BIC of several models can be calculated, the best model being the one with the smallest AIC and BIC. After processing the maximum likelihood procedure with the R software (see [20]), the parameter .α of the ASE model is estimated by .αˆ = 0.03612539 with a quite small standard error of .SE(α) ˆ = 0.003441175. Table 4.3 presents the values ˆ AIC, and BIC for the ASE, SE, TIWD, TIED, IWD, and TIRD models. of .−, From Table 4.3, we see that the ASE model has the lowest AIC and BIC; it arrives to beat the SE model which is reputed to be one of the best one-parameter models to fit these data. See [12]. The estimated cdf defined by .Fˆ (x) = F (x; α) ˆ is compared to the empirical cdf of the data in Fig. 4.7. ˆ Table 4.3 Values of .−, AIC, and BIC of the considered models

Model ASE SE TIWD TIED IWD TIRD

.−ˆ

414.9 415.3 438.5 442.8 444.0 710.2

AIC 831.8 832.6 879.4 889.6 892.0 1424.4

BIC 834.7 835.5 879.7 889.8 892.2 1424.6

C. Chesneau

0.6 0.4 0.0

0.2

estimated cdf

0.8

1.0

72

0

20

40

60

80

x

Fig. 4.7 Plot of the estimated cdf of the ASE model over the empirical cdf of the data

From Fig. 4.7, we see that the curve of the estimated cdf is very close to the scaled curve of the empirical cdf. This illustrates the accuracy of the fit of the ASE model for the considered data. We complete the previous graphical analysis by plotting in Fig. 4.8 the estimated pdf defined by .fˆ(x) = f (x; α) ˆ over the normalized histogram of the data. From Fig. 4.8, we observe that the form of the normalized histogram of the data is well-fitted by the estimated pdf. This is consistent with the previous numerical and graphical results. All the above facts are in favor of the use of the ASE model for similar lifetime data in other applied disciplines.

4.5 Conclusion The contribution of the chapter was twofold. First, we have presented and studied a new trigonometric distribution with support of .(0, 1) that offers an alternative to the classical sine distribution. Then, we use it to elaborate on a general trigonometric family of distributions. A focus is made on the particular member defined by the exponential distribution as the baseline. It constitutes a new simple one-

73

0.04 0.03 0.00

0.01

0.02

estimated pdf

0.05

0.06

0.07

4 On an Alternative Trigonometric Strategy for Statistical Modeling

0

20

40

60

80

x

Fig. 4.8 Plot of the estimated pdf of the ASE model over the normalized histogram of the data

parameter lifetime distribution, called the alternative sine exponential distribution. It is demonstrated to fit a well-known bladder cancer patient data set better than many well-known one-parameter lifetime distributions found in the statistical literature. The perspectives of this work are numerous, including two-parameter extensions of the alternative sine exponential distribution, the development of various regression models, diverse bivariate extensions, and discrete versions as well.

References 1. Al-Shomrani, A., Arif, O., Shawky, K., Hanif, S., Shahbaz, M.Q.: Topp-Leone family of distributions: some properties and application. Pak. J. Stat. Oper. Res. 12(3), 443–451 (2016) 2. Casella, G., Berger, R.L.: Statistical Inference. Brooks/Cole Publishing Company, Bel Air (1990) 3. Chesneau, C., Artault, A.: On a comparative study on some trigonometric classes of distributions by the analysis of practical data sets. J. Nonlin. Model. Anal. 3(2), 225–262 (2021) 4. Chesneau, C., Jamal, F.: The sine Kumaraswamy-G family of distributions. J. Math. Exten. 15(2), 1–33 (2021)

74

C. Chesneau

5. Chesneau, C., Bakouch, H.S., Hussain, T.: A new class of probability distributions via cosine and sine functions with applications. Commun. Stat. Simul. Comput. 48(8), 2287–2300 (2019) 6. Cordeiro, G.M., Silva, R.B., Nascimento, A.D.C.: Recent advances in lifetime and reliability models. Bentham books (2020). https://doi.org/10.2174/97816810834521200101 7. Gilchrist, W.: Statistical Modelling with Quantile Functions. CRC Press, Abingdon (2000) 8. Gupta R.C., Gupta P.I., Gupta R.D.: Modeling failure time data by Lehmann alternatives. Commun. Stat. Theory Methods 27, 887–904 (1998) 9. Jamal, F., Chesneau, C.: A new family of polyno-expo-trigonometric distributions with applications. Infin. Dimension. Anal. Quantum Probab. Relat. Top. 22(04), 1950027, 1–15 (2019) 10. Jamal, F., Chesneau, C., Bouali, D.L., Ul Hassan, M.: Beyond the Sin-G family: the transformed Sin-G family, PLoS ONE 16(5), 1–22 (2021) 11. Khan, M.S., King, R., Hudson, I.L.: Characterisations of the transmuted inverse Weibull distribution. ANZIAM J. 55(EMAC2013), C197–C217 (2014) 12. Kumar, D., Singh, U., Singh, S.K.: A new distribution using sine function – its application to bladder cancer patients data. J. Stat. Appl. Probab. 4(3), 417–427 (2015) 13. Lee, E.T., Wang, J.W.: Statistical Methods for Survival Data Analysis. Wiley, New York (2003) 14. Mahmood, Z., Chesneau, C., Tahir, M.H.: A new sine-G family of distributions: properties and applications. Bull. Comput. Appl. Math. 7(1), 53–81 (2019) 15. Nadarajah, S., Kotz, S.: Moments of some J-shaped distributions. J. Appl. Stat. 30, 311–317 (2003) 16. Nagarjuna, V.B.V., Vardhan, R.V., Chesneau, C.: On the accuracy of the sine power Lomax model for data fitting. Modelling 2021(2), 78–104 (2021) 17. Nair, N.U., Sankaran, P.G.: Quantile based reliability analysis. Commun. Stat. Theory Methods 38, 222–232 (2009) 18. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes: The Art of Scientific Computing, 3rd edn. Cambridge University Press, New York (2007) 19. Quirk, J.P., Saposnik, R.: Admissibility and measurable utility functions. Rev. Eco. Stud. 29(2), 140–146 (1962) 20. R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2014). http://www.R-project.org/ 21. Ruiz, J.M., Navarro, J.: Characterizations based on conditional expectations of the double truncated distribution. Ann. Inst. Stat. Math. 48, 563–572 (1996) 22. Shaw, W.T., Buckley, I.R.C.: The alchemy of probability distributions: beyond Gram-Charlier expansions, and a skew-kurtotic-normal distribution from a rank transmutation map. UCL discovery repository (2007). http://discovery.ucl.ac.uk/id/eprint/643923 23. Souza, L.: New Trigonometric Classes of Probabilistic Distributions, Thesis. Universidade Federal Rural de Pernambuco (2015) 24. Souza, L., Junior, W.R.O., de Brito, C.C.R., Chesneau, C., Ferreira, T.A.E., Soares, L.: On the Sin-G class of distributions: theory, model and application. J. Math. Model. 7(3), 357–379 (2019) 25. Souza, L., Junior, W.R.O., de Brito, C.C.R., Chesneau, C., Ferreira, T.A.E., Soares, L.: General properties for the Cos-G class of distributions with applications. Eurasian Bull. Math. 2(2), 63– 79 (2019) 26. Souza, L., Júnior, W.R.O, de Brito, C.C.R., Chesneau, C., Fernandes, R.L., Ferreira, T.A.E.: Tan-G class of trigonometric distributions and its applications. Cubo 23(1), 1–20 (2021)

Part II

Design of Experiments

Chapter 5

Incremental Construction of Nested Designs Based on Two-Level Fractional Factorial Designs Rodrigo Cabral-Farias, Luc Pronzato, and Maria-João Rendas

Abstract The incremental construction of nested designs having good spreading properties over the d-dimensional hypercube is considered, for values of d such that the .2d vertices of the hypercube are too numerous to be all inspected. A greedy algorithm is used, with guaranteed efficiency bounds in terms of packing and covering radii, using a .2d−m fractional factorial design as candidate set for the sequential selection of design points. The packing and covering properties of fractional factorial designs are investigated, and a review of the related literature is provided. An algorithm for the construction of fractional factorial designs with maximum packing radius is proposed. The spreading properties of the obtained incremental designs, and of their lower-dimensional projections, are investigated. An example with .d = 50 is used to illustrate that their projection in a space of dimension close to d has a much higher packing radius than projections of more classical designs based on Latin hypercubes or low discrepancy sequences.

5.1 Introduction We consider the incremental construction of designs with large packing radius in the d-dimensional hypercube, using the coffee-house rule of [20] and [21, Chap. 4]: each new point introduced maximises the distance to its nearest neighbour in the current design. This simple algorithm is known to guarantee an efficiency of 50% in terms of packing and covering radii, for each design size along the construction. Intuitively, when d is large, the first points selected are vertices of the hypercube, and we shall provide arguments that validate this intuition. However, when d is very large, it is impossible to inspect all vertices and select one at every iteration. We show that restriction of the search to fractional factorial designs having a large enough covering radius does not entail any loss of performance

R. Cabral-Farias · L. Pronzato · M.-J. Rendas () Laboratoire I3S, Université Côte d’Azur - CNRS, Sophia Antipolis, France e-mail: [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Pilz et al. (eds.), Statistical Modeling and Simulation for Experimental Design and Machine Learning Applications, Contributions to Statistics, https://doi.org/10.1007/978-3-031-40055-1_5

77

78

R. Cabral-Farias et al.

up to some design size: an example shows that designs of size up to .215 + 1 = 32 769, with 50% packing and covering efficiencies, can be constructed in this way when .d = 50. The packing and covering properties of these designs when projected on smaller dimension subspaces are investigated. Transformation rules based on rescaling are proposed to generate designs that populate the interior of the hypercube. Numerical computations indicate that the designs obtained have slightly larger covering radii than more classical space-filling designs based on (nonincremental) Latin hypercubes or (incremental) Sobol’ low discrepancy sequence, but have significantly larger packing radii. The chapter is organised as follows. Section 5.2 sets notation and recalls the definitions of packing and covering radii and the incremental construction of designs based on the coffee-house rule. The main properties of two-level fractional factorial designs are recalled in Sects. 5.3 and 5.4 to make the chapter self-contained. Their spreading properties are investigated in Sects. 5.5 (packing radius) and 5.6 (covering radius). An algorithm is given in Sect. 5.5 for the construction of fractional factorial designs with large covering radii. Section 5.7 studies the restriction of the coffee-house rule to two-level fractional factorial designs and shows that the 50% packing and covering efficiencies are preserved when the fractional factorial design has minimum Hamming distance at least .d/4. A rescaling rule is proposed to generate incremental designs not concentrated on the vertices of the hypercube, and properties of projections on smaller dimensional subspaces are investigated. An example in dimension .d = 50 illustrates the presentation. Section 5.8 briefly concludes.

5.2 Greedy Coffee-House Design Let .X denote a compact subset of .Rd with nonempty interior; throughout the chapter, we consider the case where .X is the d-dimensional hypercube .Cd = [−1, 1]d . Denote by .Xk = {x1 , . . . , xk } a k-point design when the ordering of the .xi is not important and by .Xk = [x1 , . . . , xk ] the ordered sequence; for .1 ≤ k1 ≤ k2 , .Xk1 :k2 denotes the design formed by .[xk1 , xk1 +1 , . . . , xk2 ], with .X1:k = Xk . The j th coordinate of a design point .xi is denoted by .{xi }j , .j = 1 . . . , d; .x = 1/2 d 2 denotes the .2 norm of the vector .x ∈ Rd , and .x1 = di=1 |{x}i | i=1 {x}i (respectively, .x∞ = maxi=1,...,d |{x}i |) is its .1 (respectively, .∞ ) norm. For any d .x ∈ R and any k-point design .Xk in .X , we denote .d(x, Xk ) = mini=1,...,k x−xi . For .x and .x two vectors of the same size, .z = x◦x denotes their Hadamard product, with components .{z}i = {x}i {x }i . .B(x, r) denotes the closed ball with centre .x and radius r. For .A a finite set, .|A | is the number of elements in .A .

5 Incremental Design Construction Based on Two-Level Fractional Factorial Designs

79

Space-filling design aims at constructing a set .Xk of points in .X , with given cardinality k, that “fill” .X in a suitable way; see, e.g. [25, 26]. Two measures of performance are standard. The covering radius of .Xk is defined by CR(Xk ) = max d(x, Xk ) .

.

x∈X

(5.1)

It corresponds to the smallest r such that the k closed balls of radius r centred at the .xi cover .X . .CR(Xk ) is also called the dispersion of .Xk [22, Chap. 6] and corresponds to the minimax distance criterion [12] used in space-filling design; small values are preferred. Another widely used geometrical criterion of spreadness is the packing radius PR(Xk ) =

.

1 2

min

xi ,xj ∈Xk , xi =xj

xi − xj .

(5.2)

PR(Xk ) is also called the separating radius, and it corresponds to the largest r such that the k open balls of radius r centred at the .xi do not intersect; .2PR(·) corresponds to the maximin distance criterion [12] often used in computer experiments; large values are preferred. We may also consider the combined measure given by the mesh ratio

.

τ (Xk ) =

.

CR(Xk ) , PR(Xk )

with .τ (Xk ) ≥ 1 for any design .Xk when .X is convex, since the k balls B(xi , PR(Xk )) cannot cover .X . When the objective is to construct a sequence .Xk = [x1 , . . . , xk ] such that .PR(Xk ) is reasonably large, and/or .CR(Xk ) is reasonably small, for all .k ∈ {2, . . . , n}, the following greedy algorithm, called coffee-house design ([20], [21, Chap. 4]), may be used. See also [14] for an early suggestion. .

Algorithm 1 (Coffee-House) (0) Select .x1 ∈ X , and set .S1 = {x1 } and .k = 1. (1) for .k = 1, 2 . . . do find .x∗ ∈ Arg maxx∈X d(x, Sk ), and set .Sk+1 = Sk ∪ {x∗ }. The point .x∗ can be obtained by a Voronoï tessellation of .X (when d is small enough) or a MCMC method; see [25]. Note that the choice of .x∗ is not necessarily unique. The construction is much easier when a finite candidate set .Xn with n points is substituted for .X at Step 1. In the chapter, we show that when .X = Cd , a wellchosen .Xn yields a drastic simplification of calculations for very large d but does not entail any loss of performance for the greedy algorithm. For a given order of magnitude of the anticipated number of design points to be used, we informally define the notions of small, large and very large dimension d as follows: small d are such that the construction of designs with .2d points may be considered; large d

80

R. Cabral-Farias et al.

correspond to situations where the greedy construction above with a candidate set Xn containing all .2d vertices of .Cd is conceivable; and very large d cover cases where exploration of all .2d vertices of .Cd is unfeasible. For instance, .C50 has more than .1015 vertices, a situation considered in Sect. 5.7. Let .CR∗n = minXn CR(Xn ) denote the minimum covering radius for an n-point design in .X , .n ≥ 1, and .PR∗n = maxXn PR(Xn ) denote the maximum packing radius, .n ≥ 2. The following property is a consequence of [8]. Note that the efficiencies .CR∗k /CR(Sk ) and .PR(Sk )/PR∗k belong to .[0, 1] by definition; large values are preferred for both.

.

Theorem 1 The sequence of designs .Sk constructed with Algorithm 1 satisfies .

1 PR(Sk ) CR∗k 1 ≥ (k ≥ 1) and (k ≥ 2) . ∗ ≥ CR(Sk ) 2 PRk 2

(5.3)

Moreover, .τ (Sk ) ≤ 2 for all .k ≥ 2. Proof By construction, for all .k ≥ 1, .PR(Sk+1 ) = d(xk+1 , Sk )/2 = CR(Sk )/2. Therefore, for all .k ≥ 2, .τ (Sk ) = CR(Sk )/PR(Sk ) = 2PR(Sk+1 )/PR(Sk ) ≤ 2. Also, from the pigeonhole principle, for any pair of k and .(k + 1)-point designs .Xk and .X k+1 , one of the ball .B(xi , CR(Xk )) with .xi in .Xk contains two points .xi and .xj of .Xk+1 , which implies .CR(Xk ) ≥ PR(Xk+1 ). Therefore, for the greedy construction, we have in particular .CR∗k ≥ PR(Sk+1 ) = CR(Sk )/2 and .PR∗k+1 ≤ CR(Sk ) = 2 PR(Sk+1 ).

In the rest of the chapter, we take .X = Cd . Take .x1 = 0d , the null vector of dimension d, which corresponds to the centre √ of .Cd . The design .S1 = [x1 ] has thus .CR(S1 ) = d. When applying Algorithm 1, for all minimum covering radius, with √ k such that .CR(Sk ) = d, the point .x∗ chosen at Step 1 is then necessarily a vertex d of .Cd ; that is, where √ .xk ∈ {−1, 1} for .k = 2, . . . , k∗ (d), √ .k∗ (d) is the first k such that .CR(Sk ) < d. (Note that this implies that .CR∗n ≥ d/2 for all .n ≤ k∗ (d)−1.) This simple property has the important consequence that the greedy construction of a design .Sn via Algorithm 1 initialised at .x1 = 0d can restrict its attention to the set of vertices of .Cd , provided that .n ≤ k∗ (d). For .d ≤√4, since any pair of distinct vertices of the hypercube are at distance at least .2 ≥ d, Algorithm 1 sequentially (and indifferently) selects the vertices of .Cd until they are exhausted, and .k∗ (d) = 2d +1. For larger d, the behaviour depends on the order in which vertices are selected in the first iterations, that is, on the particular choices of .x∗ made at Step 1. The largest values of .k∗ (d) obtained for d up to 12 are indicated in Table 5.1. We shall see in Sect. 5.7.1 that .k∗ (d) is large enough for practical applications when d gets very large; see (5.12). Occasionally, Algorithm√1 may still take .x∗ among vertices of .Cd when .k ≥ k∗ (d), that is, when .CR(Sk ) < d; this again depends on the first choices made for .x∗ . Denote by .kN V (d) the first .k > 1 such that .x∗ chosen at Step 1 is not a vertex of .Cd (with necessarily .kN V (d) ≥ k∗ (d)); the largest values of .kN V (d) that we have obtained are also indicated in Table 5.1.

5 Incremental Design Construction Based on Two-Level Fractional Factorial Designs Table 5.1 First k such that .CR(Sk ) < d .k∗ (d) .kN V (d)

2 5 5

3 9 9

4 17 17

5 17 33

81

√ d and first .k > 1 such that .xk is not a vertex of .Cd 6 33 65

7 65 129

8 129 257

9 33 129

10 65 65

11 129 129

12 257 257

The difficulty is that the inspection of all .2d vertices of .Cd is unpractical for very large d. The main objective of the chapter is therefore to propose a method for selecting a subset .Xn of .2d vertices of .Cd on√which Algorithm 1 can be applied, ensuring that .maxx∈Xn d(x, Sk ) = CR(Sk ) = d for all .k ≤ n, with n large enough to allow the construction of designs of practical size. The method relies on the notion of fractional factorial design, the basic properties of which are recalled in the next two sections. Their spreading properties in terms of packing and covering radii are then investigated in Sects. 5.5 and 5.6. We prefer not to call those designs “spacefilling” since they are supported on the vertices of .Cd ; they nevertheless satisfy the bounds of Theorem 1.

5.3 Two-Level Fractional Factorial Designs This section only gives a brief summary of the topic; one may refer to [2, 3] for a thorough and illuminating exposition.

5.3.1 Half Fractions: m = 1 A .2d factorial (or full factorial) design is formed by the .2d vertices of .Cd ; each design point .xi is such that .{xi }j ∈ {−1, 1}, .i = 1, . . . , 2d , .j = 1, . . . , d. The notation used for a .2d factorial design is illustrated in Table 5.2a for the case .d = 3. The coordinates of design points correspond to factors and are denoted by lowercase letters.1 A (regular) .2d−m fractional factorial design is obtained by setting .d − m coordinates (sometimes called basic factors) of the .2d−m design points at values given by a .2d−m factorial design, the other m coordinates being defined by generating equations, or generators, that explain how they are obtained (calculated) from the basic factors. Without any loss of generality, we can suppose that the basic factors correspond to the first .d − m coordinates. Table 5.2b shows the .24−1 fractional obtained from the generating equation .{x}4 = d = abc. factorial design .X(a) 24−1 By the product of two factors, we mean the entrywise (Hadamard) product of the

1 The

design in Table 5.2a is listed in what is called standard order.

82

R. Cabral-Farias et al.

Table 5.2 (a) A .23 factorial design and (b) and (c) two .24−1 fractional factorial designs, with = a, .{x}2 = b, .{x}3 = c and .{x}4 = d

.{x}1

(a) A .23 factorial design .X8 .{x}1 = a .{x}2 =b .x1 −1 −1 .x2 1 −1 .x3 −1 1 .x4 −1 −1 .x5 1 1 .x6 1 −1 .x7 −1 1 .x8 1 1

.{x}3 =c

−1 −1 −1 1 −1 1 1 1

(b) A .24−1 fractional factorial design .X8 a b c .d = abc .x1 −1 −1 −1 −1 .x2 1 −1 −1 1 .x3 −1 1 −1 1 .x4 −1 −1 1 1 .x5 1 1 −1 −1 .x6 1 −1 1 −1 .x7 −1 1 1 −1 .x8 1 1 1 1

(c) Another .24−1 fractional factorial design .X8 a b c .d = ab .x1 −1 −1 −1 1 .x2 1 −1 −1 −1 .x3 −1 1 −1 −1 .x4 −1 −1 1 1 .x5 1 1 −1 1 .x6 1 −1 1 −1 .x7 −1 1 1 −1 .x8 1 1 1 1

corresponding columns in the design viewed as a .n × d matrix. Since all .{x}i belong to .{−1, 1} for .x in .X2d−m , this implies in particular that the product of a factor by itself gives a vector with all components equal to 1, which we denote by .1. The equation .d = abc is thus equivalent to .1 = abcd, called defining relation. Changing the generating equation to .d = ab gives another .24−1 fractional factorial (b) design .X24−1 , presented in Table 5.2c. Both designs are called a half fraction of the full factorial design with .d = 4. Since .d = ab in Table 5.2c, .{x}4 = {x}1 {x}2 for all (b) .x in .X 4−1 , and this design does not allow us to estimate separately the main effect 2 of .{x}4 and the interaction .{x}1 {x}2 ; these effects are said confounded or aliased. The equation .d = ab also implies .a = bd and .b = ad, showing that the effects of .{x}1 and .{x}2 {x}4 are confounded, as well as those of .{x}2 and .{x}1 {x}4 . We say that this design has resolution .R = I I I (notation with a Roman numeral is traditional): no p factor effect is confounded with any other effect containing less than .R − p factors, .p = 0, . . . , R. For the design in Table 5.2b, we get .a = bcd, .b = acd, .c = abd and of course .d = abc which is the generating equation. Here, none of the main and two factor interaction effects are confounded, and the design has resolution .R = I V . In general, designs of high resolution are preferable. When .m = 1, the highest possible resolution .R = d is obtained for the half fraction with defining relation .{x}d = d−1 i=1 {x}i (unique up to a sign change and a permutation of variables).

5.3.2 Several Generators 5.3.2.1

Defining Relations

A .2d−m fractional factorial design with .m > 1 requires more than one generating equation, and the construction of suitable designs with high resolution has motivated intensive research since the pioneering papers [2, 3]. To ensure that the resolution

5 Incremental Design Construction Based on Two-Level Fractional Factorial Designs

83

is larger than I I , the generating equations are chosen independent, which means that a generator cannot be obtained by multiplying together two other generators. It implies that there are no repetitions within the columns of the design table. If this were not the case, two main effects would be confounded since we would have .{x}i = {x}j for some .i, j ∈ {1, . . . , d} and all .x in the design. The generators are called principal when there are only positive signs in the defining relations. When multiplying the m generating equations by the d independent factors and by themselves in all possible ways, we obtain the complete set of defining relations. Principal defining relations are obtained from principal generators. The complete set of principal defining relations defines a unique fraction, that is, a unique design up to .2m sign changes in the variables defined by the generators. The set contains m defining relations (including the trivial one .1 = 1), each one having the form .1 .2 equals the product of a subset of factors, called word. For example, the .26−2 design with independent factors .a, b, c and d and generating equations .e = abcd and .f = acd has defining relations .1 = abcde, .1 = acdf and .1 = bef , the latter being obtained by multiplying the first two since .(abcde) × (acdf ) = a 2 bc2 d 2 ef = bef .

5.3.2.2

Resolution

The resolution of the design is given by the shortest word length within the complete set of defining relations; here, .R = I I I (since .1 = bef ). Another choice of generating equations may yield a different set of defining relations and a design with different resolution. For instance, choosing .e = abc and .f = acd in a .26−2 design yields the complete set of defining relations .1 = abce = acdf = bdef , and is the resolution is now I V . To identify the resolution of a design, the notation .2d−m R used. For example, the design with six variables and generating equations .e = abcd d−m d−m points in d variables) and .f = acd is denoted .26−2 I I I . Designs .2I I I (with .n = 2 can be constructed for d up to .n − 1 and n a power of 2 (designs with .d = n − 1 are called saturated; see Sect. 5.3.3), and designs .2d−m I V can be constructed for d up to 7−4 15−11 31−26 7−3 8−4 16−11 .n/2 and n a power of 2; generators for .2 , .2 III I I I , .2I I I , .2I V , .2I V and .2I V can be found in [2]. The nonregular designs of [23] have resolution I I I and allow the exploration of .n − 1 variables for n a multiple of 4. Generators for designs .28−2 V and .2V11−4 are given in [3]. The MATLAB function fracfactgen.m implements the algorithm of [6] for the construction of a .2d−m fractional factorial design of prescribed resolution (when it exists). A design with resolution R contains a full factorial design in any subset of .R − 1 variables. Omitting p variables from a .2d−m design with resolution R produces a R design of resolution R in .d − p variables but with .n = 2d−m points. All words containing characters associated with the dropped variables must be removed from the set of defining relations. The resulting design may duplicate some design points, and a more economical design with similar word pattern in the defining relations may exist in general. Bounds on the maximum resolution attainable for a .2d−m design are given in [7].

84

5.3.2.3

R. Cabral-Farias et al.

Word Length Pattern

The word length pattern .A (Xn ) of a .2d−m design with resolution R is defined by the distribution of word lengths in the complete set of defining relations, A (Xn ) = [1, 0, . . . , 0, AR (Xn ), AR+1 (Xn ), . . . , Ad (Xn )] ,

.

with .Ak (Xn ) denoting the number of words of length k (.A0 (Xn ) = 1 since the (a) word .1 is always present and . dk=0 Ak (Xn ) = 2m ). Among two designs .Xn and .X(b) n having the same (maximum) resolution R, the paper [7] recommends to select the one with minimum aberration: let .i∗ be the smallest .i ≥ 1 such that (a) (b) (a) (b) (a) (b) .Ai (Xn ) = Ai (Xn ); then .Xn is preferred to .Xn if .Ai∗ (Xn ) < Ai∗ (Xn ), (b) (a) and .Xn is preferred to .Xn otherwise.2 The construction of a minimum aberration design can thus be viewed as the sequential minimisation of the .Ai (Xn ) for .i ≥ 1. A minimum aberration .2d−m design has necessarily generators that contain all d variables [7]; lists of generators are tabulated in [33].

5.3.3 Minimum Size Proposition 1 The size of a 2d−m design necessarily satisfies n = 2d−m ≥ d + 1. The designs for which equality holds have resolution I I I . Proof Since the m generators must be independent and each one must involve at least 2 of the d − m basic factors, we get m≤

d−m

.

k=2

d −m = 2d−m − (d + 1 − m) , k

that is, n = 2d−m ≥ d + 1. Designs for which equality holds are those that use all possible independent generators (without any loss of generality, we only consider principal generators). They cannot have resolution R larger than I I I since there are generators defined as a product of two basic factors and thus defining relations involving words of length 3. We prove by contradiction that they cannot have resolution I I . If the design has resolution I I , it means that one of the defining relations has been obtained by multiplying two relations 1 = w and 1 = z, with words w and z that only differ by two letters, say a, b. There are two possibilities: either w = ta and z = tb or w = tab and z = t. In both cases, the multiplication w × z gives the defining relation 1 = ab, which cannot exist since the generators are independent.

2 It may happen, though rarely, that two designs with different defining relations have exactly the same word length pattern; the minimum aberration criterion then does not provide any preference.

5 Incremental Design Construction Based on Two-Level Fractional Factorial Designs Table 5.3 Saturated designs

d m n

3 1 4

7 4 8

15 11 16

31 26 32

63 57 64

127 120 128

85 255 247 256

Proposition 1 gives a lower bound on the number of points for a given dimension d; it also gives an upper bound on the number of generators that can be used for a given d, m ≤ m∗ (d) = d − log2 (d + 1) .

.

(5.4)

That is, for 2k ≤ d < 2k+1 , we can construct 2d−m fractional factorial designs with m ∈ {1, 2, . . . , m∗ (d) = d − k − 1}. Values of d, m and n for minimum-size 2d−m designs with d + 1 = 2d−m , called saturated designs, for d up to d = 255 are given in Table 5.3.

5.4 Two-Level Factorial Designs and Error-Correcting Codes 5.4.1 Definitions and Properties The construction of a two-level factorial design .Xn possesses strong similarities with the construction of an error-correcting code .Cn with binary alphabet .{0, 1}: design points correspond to codewords in .Cn , and d is the length of the code, with d−m for a fractional factorial design). Associating levels 1 and .n = |Cn | (and .n = 2 .−1 to symbols 0 and 1, respectively, we obtain that the product rule used in Sect. 5.3 corresponds now to addition modulo 2 and the codes corresponding to fractional factorial designs, which are obtained through generating equations, are linear. Since .{xi }j ∈ {−1, 1} for each design point and any .j ∈ {1, . . . , d}, the Hamming distance .dH (xi , xj ), which counts the number of components that differ between two design points .xi and .xj , satisfies dH (xi , xj ) =

.

1 1 xi − xj 2 = xi − xj 1 . 4 2

The minimum distance of .Cn , .ρH (Cn ), is defined as the minimum Hamming distance between two codewords in .Cn , and we shall write .ρH (Xn ) = ρH (Cn ) with .Cn the code associated with .Xn . More generally, .ρH (Xn ) = minxi ,xj ∈Xn , xi =xj dH (xi , xj ) for any design .Xn supported on the vertices of .Cd . Therefore, ρH (Xn ) = PR2 (Xn ) .

.

86

R. Cabral-Farias et al.

Similarly, the (Hamming) covering radius .CRH (Xn ) of a two-level fractional factorial design .Xn corresponds to the covering radius .CRH (Cn ) of the associated code, and we define more generally CRH (Xn ) =

.

max

min dH (x, xi ) .

x∈{−1,1}d xi ∈Xn

(5.5)

Several results from coding have their counterpart in design theory. Suppose that

ρH (Xn ) ≥ 2k + 1 for some .k ∈ N. For each of the n design points .xi , there are . d points in .{−1, 1} that are at distance . from .xi . Since the n Hamming balls centred we obtain the sphere-packing bound (see, at the .xi with radii k do

not intersect, e.g. [32, Th. 20.1]): .n k=0 d ≤ 2d . For a fractional factorial design .Xn with d−m , it gives .n = 2 .

2m ≥

k d

.

=0

.

(5.6)

Note that .CRH (Xn ) ≥ ρH (Xn )/2. When .ρH (Xn ) = 2k + 1 and equality is reached in (5.6), all points in .{−1, 1}d are at Hamming distance at most k to exactly one design point in .X2d−m , which corresponds to the notion of perfect code. Delete now the .p − 1 last coordinates of each .xi ∈ Xn , with .p = ρH (Xn ). The n points that are obtained belong to .{−1, 1}d−(p−1) and are all distinct. Therefore, their number n is less than .2d−p+1 , which gives the Singleton bound ([29], [32, Th. 20.2]): .n ≤ 2d−p+1 . For a fractional factorial design .X2d−m , we obtain ρH (X2d−m ) ≤ m + 1 .

.

(5.7)

Another result from coding theory gives an upper bound on the size n of a design Xn supported on .{−1, 1}d when .ρH (Xn ) is large: Plotkin bound [16, 24, Th. 5.5.2] states that ρH (Xn ) (5.8) .n ≤ ρH (Xn ) − d/2

.

when .ρH (Xn ) > d/2. Besides the value of the packing radius .PR(Xn ), the distribution of the distances .xi − xj , or .dH (xi , xj ), between pairs of design points is also of interest. This is particularly true in the present context where there exist many pairs of points at the same distance since all design points are vertices of the hypercube. In [12], a design ∗ .Xn is called maximin distance optimal when it maximises .PR(Xn ) and minimises the number of pairs of points at distance .2 PR(X∗n ). That definition is extended as follows in [19]. For a given design .Xn , consider the list .[d1 , d2 , . . . , dq ] of intersite distances sorted in decreasing order, with .d1 = 2 PR(Xn ) and .1 ≤ q ≤ n(n − 1)/2. Denote by .J (Xn ) = [J1 , . . . , Jq ] the associated counting list defined by .Jk =

5 Incremental Design Construction Based on Two-Level Fractional Factorial Designs

87

(i, j ) : xi − xj = dk , xi , xj ∈ Xn } , .k = 1, . . . , q. In [19], a design is called maximin distance optimal if it maximises .d1 , and among all such designs minimises .J1 , maximises .d2 , and among all such designs minimises .J2 . . . and so on. Following [35], we call (Hamming) distance distribution of a design .Xn supported on .{−1, 1}d the list .B(Xn ) = [B0 (Xn ), B1 (Xn ), . . . , Bd (Xn )] where Bk (Xn ) =

.

1

(i, j ) : dH (xi , xj ) = k, xi , xj ∈ Xn } , k = 0, . . . , d n

(5.9)

(so that . dk=0 Bk (Xn ) = n and .B0 (Xn ) = 1 when all points are distinct). Let .Xn be a .2d−m fractional factorial design; .Xn is balanced, i.e. each value .+1 and .−1 appears equally often dfor each factor, and for any .xi ∈ Xn , . k=1 dH ({xi }k , {xj }k ) = nd/2. Therefore, xj ∈Xn , j =i dH (xi , xj ) = xj ∈Xn , j =i d . k=1 k Bk (Xn ) = nd/2, and interpreting .Bk (Xn )/(n − 1) as a weight on k, we get ρH (Xn ) = min{k ∈ {1, . . . , d} : Bk (xn ) > 0} ≤

.

nd . 2(n − 1)

(5.10)

Let p denote the number of generators written as the product of an odd number of basic factors (.p ≥ 0). For any .xi ∈ Xn , the design point .xj obtained by changing the signs of the .d − m basic factors is at Hamming distance .dH (xi , xj ) = d − m + p from .xi ; that is, Bd−m+p (Xn ) ≥ 1 .

.

In particular, it implies that .ρH (Xn ) ≤ d − m + p. Also, since each point has at most one point at Hamming distance d, when .p = m, we have .Bd (Xn ) ≤ 1 and thus (a1) (a2) (a) (b) .Bd (Xn ) = 1; see, for example, the designs .X 16 , .X16 , .X32 and .X32 of Table 5.4. Due to the equivalence between Hamming and Euclidean distances for a .2d−m design, design selection based on maximin distance optimality in the sense of [19] sequentially minimises the .Bk (Xn ) for .k ≥ 1; it is similar to selection by the minimum aberration criterion of [7] applied to the distance distribution instead of the word length pattern. In [15], minimum aberration designs are called maximin word length. As noticed in [35], MacWilliams’ theorem (see, e.g. [32, Th. 20.3]) implies that the distance distribution .Bk (Xn ) and the word length pattern .A (Xn ) of a given .2d−m design .Xn are related by Aj (Xn ) =

.

d 1 Bk (Xn )Pj (k; d, 2) , j = 0, . . . , d , n k=0

Bj (Xn ) = n 2−d

d k=0

Ak (Xn )Pj (k; d, 2) , j = 0, . . . , d ,

(5.11)

88

R. Cabral-Farias et al.

Table 5.4 .2d−m designs, generators, word length patterns, distance distributions and covering radii Generators

.A

6−2 .2I V

(a1) .X16

abc, acd

.[1

0 0 0 3 0 0]

.[1

0 3 8 3 0 1]

1

6−2 .2I I I

(b1) .X16

abcd, acd

.[1

0 0 1 1 1 0]

.[1

0 4 6 3 2 0]

2

7−3 .2I V

(a2) .X16

bcd, abd, acd

.[1

0 0 0 7 0 0 0]

.[1

0 0 7 7 0 0 1]

1

7−3 .2I I I

(b2) .X16

bcd, abd, abcd

.[1

0 0 2 3 2 0 0]

.[1

0 1 6 5 2 1 0]

2

.X32

(a)

abc, bcd

.[1

0 0 0 3 0 0 0]

.[1

1 3 11 11 3 1 1]

1

(b) .X32

abc, ade

.[1

0 0 0 2 0 1 0]

.[1

0 6 9 9 6 0 1]

1

(c) .X32

abcd, abce

.[1

0 0 0 1 2 0 0]

.[1

0 5 12 7 4 3 0]

1

abcde, abcdf ,

.[1

0 0 0 6 12 8 0 1 4 0 0]

.[1

0 1 0 14 24 6 8 9 0 1 0]

2

.[1

0 0 4 11 18 15 8 4 2 1 0] 2

.[1

0 0 2 14 22 8 6 9 2 0 0]

Design

7−2

.2I V

(a) 11−5 .2I V .X64

.B

.CRH

abcef, abdef, cdef (b) .X64

abcd, abce, acdf , .[1 0 0 0 7 9 6 6 2 1 0 0] .cdef , .abcdef

(c)

.X64

cde, bde, abcdf ,

.[1

0 0 0 4 14 8 0 3 2 0 0]

2

abce, .adef (d)

.X64

.cdef , .adef , .abef , .[1 .abcf ,

0 0 0 5 10 10 5 0 0 0 1] .[1 0 0 0 25 0 27 0 10 0 1 0] 3

bcdf

where the .Pj (x; d, s) are the Krawtchouk polynomials defined by Pj (x; d, s) =

.

j (−1)i (s − 1)j −i i=0

×

Γ (x + 1) Γ (x + 1 − i)Γ (i + 1)

Γ (d + 1 − x) , Γ (d + i + 1 − x − j )Γ (j + 1 − i)

j so that .Pj (k; d, 2) = i=0 (−1)i ki d−k j −i . Several extensions of the results above, in various directions, are present in the literature. Let us mention a few. Fractional factorial designs with s levels, with s any prime number, are considered in [35], together with designs where different factors may have different numbers of levels, and the notion of generalised minimum aberration is introduced; see also [4]. Space-filling properties of fractional factorial designs with more than two levels are studied in [36], where it is shown that the generalised minimum aberration designs of [35] have good performance in terms of maximin distance for the .1 norm when allowing permutations of factor levels. Starting from an initial s-level balanced design .Xn , where each level appears exactly .n/s times for each one of the d factor, [34] shows how to construct a design .Xn with d factors at qs levels, for n divisible by qs (.Xn is a Latin hypercube design when

5 Incremental Design Construction Based on Two-Level Fractional Factorial Designs

89

q = n/s). When .s > 2, the space-filling properties of .Xn (measured by the maximin distance for the .1 norm) can be improved by level permutation, using the approach in [36]. Following the approach of [18], properties of .2d−m designs for prediction with a Gaussian process model defined on the vertices .{−1, 1}d of the hypercube d .[−1, 1] are investigated in [15]; a practical conclusion is that maximin word length (minimum aberration) designs often coincide with maximin distance designs, but not always. The paper [1] shows how to decompose a minimum aberration .2d−m design into layers containing two points each, in such a way that the resulting design has suitable space-filling properties. The construction of two-level factorial designs having small covering radius (5.5) is considered in [11] (note, however, that .CRH (Xn ) is not necessarily an adequate measure of the space-filling properties of .Xn over the full hypercube .[−1, 1]d ); a few general properties are given, and the construction of minimum-size covering designs having .CRH (Xn ) = 1 and minimum-size designs with .CRH (Xn ) = 2 is detailed for .d ≤ 7 (with rather intensive computer search for .d = 7). The centred .L2 discrepancy .CL2 (Xn ) of [9] is a popular measure of uniformity of a design .Xn . For a .2d−m fractional factorial design, .CL2 (Xn ) is a function of the .Ai (Xn ) in the word length pattern .A (Xn ) [5]; see also [31] for related results. A relation between .CL2 (Xn ) and the distance distribution .B(Xn ) is established in [30] for more general balanced designs (with n runs and d factors, each one taking s levels and, for each factor, each level appearing equally often). .

5.4.2 Examples The .24−1 I V design .X8 of Table 5.2b, with generator .d = abc, has word length pattern .A = [1 0 0 0 1]; its distance distribution is .B = [1 0 6 0 1]; it reaches the bound (5.7) since .ρH (X8 ) = 2 = m + 1. The design in Table 5.2c with .d = ab has resolution I I I , .A = [1 0 0 1 0] and .B = [1 1 3 3 0]; it is thus worse than the previous one in terms of both aberration and maximin distance. (a1) Other examples with more factors are presented in Table 5.4. .X16 (respectively, (a2) (b1) (b2) .X 16 ) is better than .X16 (respectively, .X16 ) in terms of both resolution and (a2) maximin distance. .X16 reaches the bound (5.6), and it corresponds to a perfect code of length 7, distance 3 and covering radius 1; see, e.g. [32, p. 215]. The three (a) (b) (c) 7−2 .2 I V designs .X32 , .X32 and .X32 are those in Table 1 of [7]; they all have resolution I V , and the hierarchy .(a) ≺ (b) ≺ (c) is respected in terms of both aberration and maximin distance, where .(a) ≺ (b) means that .(b) is preferable to .(a) for the (a) criterion considered. The word length pattern of .X64 is (slightly) better than that of (b) (b) (a) 11−5 .X 64 , but .X64 does better than .X64 in terms of maximin distance. The two .2I V (c) (d) (c) designs .X64 and .X64 are given in Table 3 of [15]; .X64 has minimum aberration but (d) is worse than .X64 in terms of maximin distance.

90

R. Cabral-Farias et al.

5.5 Maximin Distance Properties of Two-Level Factorial Designs 5.5.1 Neighbouring Pattern and Distant Site Pattern For any design .Xn supported on the .2d vertices of .Cd and any .xi ∈ Xn , we call neighbouring pattern of .xi the counting list .L (xi ; Xn ) = [1, I1 (xi ; Xn ), . . . , .Id (xi ; Xn )] with .Ik (xi ; Xn ) = {j : dH (xi , xj ) = k, xj ∈ Xn } . Similarly, we call distant site pattern the list .L (xi ; Xn ) = [0, 1 (xi ; Xn ), . . . , I d (xi ; Xn )] with

Id−m .I k (xi ; Xn ) = {j : dH (xi , xj ) = k, xj ∈ Xn } . .2 fractional factorial designs satisfy the following property. Proposition 2 All design points .xi of a .2d−m fractional factorial design have the same neighbouring pattern and the same distant site pattern. Proof Take any .x ∈ Xn ; without any loss of generality, we suppose that basic factors correspond to the first .d − m coordinates, and we denote by .x the corresponding part of .x. The remaining m components are constructed from the generators that define the design; we can write .{x}d−m+k = gk (x), with .gk (x) equal to the product of some components of .x, .k = 1, . . . , m. We collect those m components in a vector .g(x) and write .x = (x, g(x)). Suppose that there exist .xj ∈ Xn such that .dH (x, xj ) = k. We first show that for any .x ∈ Xn , there also exists a .xj ∈ Xn such that .dH (x , xj ) = k. Using the same notation as above, we can write .x = (x , g(x )), and, since .x ∈ Xn , .x = z ◦ x with .z a .(d − m)-dimensional vector with components in .{−1, 1}. Therefore, x = (z ◦ x, g(z ◦ x)) = (z ◦ x, g(z) ◦ g(x)) = (z, g(z)) ◦ x.

.

The vector .xj = (z ◦ xj , g(z ◦ xj )) = (z, g(z)) ◦ xj also belongs to .Xn (since the first d − m coordinates of design points in .Xn form a .2d−m factorial design) and satisfies .dH (x , x ) = dH (x, xj ). j To conclude the proof that all design points have the same neighbouring pattern, we only need to show that if .xi and .xj are two distinct points in .Xn , say with .dH (x, xi ) = dH (x, xj ) = k, then .x = (z ◦ xi , g(z ◦ xi )) and .x = (z ◦ xj , g(z ◦ xj )) i j are distinct points in .Xn satisfying .dH (x , xi ) = dH (x , xj ) = k. The equality between distances has already been proved; the points are distinct since .xi = (z, g(z)) ◦ xi = (z, g(z)) ◦ xj = xj .

Denote .Ik (xi ) = {j : dH (xi , xj ) = k, xj ∈ {−1, 1}d } . Since .I k (xi ; Xn ) = Ik (xi ) − Ik (xi ; Xn ) and .Ik (xi ) = Ik (xj ) for any .xi and .xj in .Xn , all design points have also the same distant site pattern.

.

This property explains why division by n in the definition .(5.9) of distance distribution yields integer values for the .Bk (Xn ): we have .Lk (xi ; Xn ) = B(Xn ) for any .2d−m fractional factorial design .Xn and any .xi ∈ Xn . A straightforward

5 Incremental Design Construction Based on Two-Level Fractional Factorial Designs

91

consequence is we do not need to consider all pairs of points in .Xn to construct the distance distribution, but only the distances between one point and the .n − 1 others. In particular, this point can be taken as .1d , the d-dimensional vector with all components equal to 1 (provided that the design is constructed with principal generators with non-negative signs, which we assume throughout the chapter). As an illustration, below we consider the distance distribution of fractional factorial designs with .n = d + 1 (see Sect. 5.3.3), which is very peculiar. Proposition 3 Saturated .2d−m fractional factorial designs (.n = d +1) are maximin distance optimal; their distance distribution satisfies .B0 (Xn ) = 1, .B(d+1)/2 (Xn ) = n − 1 and .Bi (Xn ) = 0 for .i > 0, .i = (d + 1)/2. Proof From Proposition 2, we only need to consider the distance between one particular point, which we denote .x = (x, g(x)), and other points .x ∈ Xn , d−m−1 = n/2 = (d + 1)/2 when .x = (x , g(x )). We show that .dH (x, x ) = 2 .dH (x, x ) = 1, 2, . . . , m. Suppose that .dH (x, x ) = 1; let a be the basic factor that changes between .x and .x . The number of generators that contain a is na =

d−m−1

.

k=1

d −m−1 = 2d−m−1 − 1, k

since there remains .d − m − 1 factors available and each defining relation contains at least two factors. It gives .dH (g(x), g(x )) = 2d−m−1 − 1, and thus .dH (x, x ) = 2d−m−1 . Suppose now that .dH (x, x ) = 2, with a and b the modified factors. The number of generators containing a and not containing b is nab =

d−m−2

.

k=1

d −m−2 = 2d−m−2 − 1, k

since now only .d − m − 2 factors remain available. We also need to count generators that contain b and not a, which gives .dH (x, x ) = 2 + 2(2d−m−2 − 1) = 2d−m−1 . The same calculation can be repeated when .dH (x, x ) = p, with factors .a1 , . . . , ap being modified, for any .p ≤ d − m. Suppose first that p is odd. There are .2d−m−p − 1 generators with .a1 alone (without .a2 , . . . , ap ), .2d−m−p with .a1 a2 a3 alone (without .a4 , . . . , ap ), etc., and .2d−m−p with all the .ai , .i = 1, . . . , p. It gives p d−m−p p d−m−p 2 + 2 + · · · + 2d−m−p 3 5 p p p p = + + + ··· + 2d−m−p = 2p−1 2d−m−p = 2d−m−1 . 1 3 5 p

dH (x, x ) = p + p(2d−m−p − 1) +

.

92

R. Cabral-Farias et al.

Suppose now that p is even. Similar calculation gives dH (x, x ) =

.

p p p p + + + ··· + 2d−m−p 1 3 5 p−1

= 2p−1 2d−m−p = 2d−m−1 . Therefore, .dH (x, x ) = 2d−m−1 = n/2 = (d + 1)/2 for any .x ∈ Xn , .x = x (note that it gives equality in the upper bound (5.10)). Plotkin bound (5.8) indicates that the size n of a design .Xn supported on .{−1, 1}d and such that .ρH (Xn ) > d/2, with d odd, is at least .d + 1, showing that saturated .2d−m designs are maximin distance optimal among all designs supported on .{−1, 1}d . The n design points of a saturated design are vertices of a regular simplex in .Cd with (Euclidean) edge length √ . 2(d + 1) and form a maximin-optimal design in .Cd .

The application of Algorithm 1 to the candidate set .Xn defined by a .2d−m design .Xn with .n = d + 1, initialised at any .xi ∈ Xn , ensures that .ρH (Sk ) = (d + 1)/2 = 2d−m−1 for all .k = 2, . . . , n (in fact, the property is true for any sequential selection of points within .Xn ). As the example below illustrates, the performance achieved in terms of .ρH may be superior to those obtained when the candidate set is .{−1, 1}d , the set of vertices of .Cd (i.e. the full factorial .2d design). Note that we have .ρH (Sk ) = CRH (Sk−1 ), .2 ≤ k ≤ |Xn |, when applying Algorithm 1 to a candidate set .Xn ⊆ {−1, 1}d ; see the proof of Theorem 1. Example 1 The left panel of Fig. 5.1 presents the evolution of .ρH (Sk ) for .d = 15 and .n = 16 = 24 when the candidate set in Algorithm 1 is the full .2d factorial design (red solid line) and a .2d−m fractional factorial design with .m = 11 (black dashed line). In the latter case, .ρH (Sk ) = (d + 1)/2 for all .k = 2, . . . , 16, which for

Fig. 5.1 Evolution of .ρH (Sk ) when Algorithm 1 is applied to the candidate set .Xn given by the .2d full factorial design (red solid line) and when .Xn = Xn is a .2d−m fractional factorial design (black dashed line); .d = 15, the algorithm is initialised at a .xi ∈ Xn . Left: .m = 11, .n = d + 1 = 16, .k = 2, . . . , 16. Right: .m = 7, .n = 256, .k = 2, . . . , 250

5 Incremental Design Construction Based on Two-Level Fractional Factorial Designs

93

k ≥ 3 is larger than the value obtained in the first case, where all the .215 = 32 768 vertices are used as candidates. The fact that restricting the set of candidate points to a subset of .{−1, 1}d may be beneficial is further illustrated on the right panel of Fig. 5.1. There, the red solid line corresponds again to the candidate set given by the full .2d factorial design, whereas a .2d−m design with .m = 7 and .ρH (Xn ) = 4 is used for the black dashed-line curve (.n = 28 = 256).

.

In the Appendix, we give conditions on the choice of generators that provide guarantees on the minimum Hamming distance k of a fractional factorial design .Xn , i.e. .ρH (Xn ) ≥ k, for .k = 2, 3 and 4. However, the derivation of such conditions gets cumbersome when .k ≥ 5, and in the next section, we present an algorithm for the optimal selection of m generators among all .2d−m − (d − m) − 1 possible generators having length at least 2.

5.5.2 Optimal Selection of Generators by Simulated Annealing 5.5.2.1

SA Algorithm for the Maximisation of ρH

(0) Construct the set G all 2d−m −(d −m)−1 generators (of length ≥ 2), choose an initial set G of m distinct generators in G , construct the corresponding design Xn and its distance distribution B(Xn ), and set X∗n = Xn and k = 1. (1) Select a random generator g within G and a random generator g within G \ G. Construct G = G \ {g} ∪ {g } and its associated design Xn and distance distribution B(Xn ). (2) Compute i ∗ = min{i : Bi (Xn ) = Bi (X∗n )} and δ ∗ = Bi ∗ (Xn ) − Bi ∗ (X∗n ); if δ ∗ ≤ 0, set X∗n = Xn . (3) Compute i+ = min{i : Bi (Xn ) = Bi (Xn )}, and P = min exp −

G

Bi + (Xn )−Bi + (Xn ) Tk

, 1 ; accept the move Xn = Xn and G =

with probability P . (4) if k = K, stop; otherwise k ← k + 1, return to 1. A logarithmic decrease of the “temperature” parameter Tk used in Step (3), such as Tk = 2d−m / log(k), ensures global asymptotic convergence to a maximin distance optimal fractional factorial design. In practice, we wish to stop the algorithm after a number of iterations K which is not excessively large. Numerical experimentation indicates that a faster decrease of Tk , yielding a behaviour close to that of a simple descent method, is often suitable. For instance, we take Tk = 2d−m /k 4/5 in Example 2 (Sect. 5.7). Remark 1 In the applications we have in mind, n = 2d−m is relatively small even when d is large (it must nevertheless satisfy the bound n ≥ d + 1 of Proposition 1), and the set G remains of reasonable size. A similar simulated annealing algorithm can be used for the construction of minimum aberration designs. However, the

94

R. Cabral-Farias et al.

construction of the word length pattern for a 2d−m design requires the calculation of 2m − 1 defining relations, which becomes prohibitively computationally demanding when m is large (a consequence of n being reasonably small). For instance, for d = 50 and m = 35 (the situation in Example 2), we have 2d−m − (d − m) − 1 = 37 752 whereas 2m − 1 > 3.43 × 1010 . Computation of the word length pattern from the distance distribution using (5.11) may then be advantageous.

5.6 Covering Properties of Two-Level Factorial Designs In this section, we investigate the covering properties of fractional factorial designs measured by the Hamming covering radius .CRH defined by (5.5). One should notice that the best design in terms of maximin distance is not always the best one (d) in terms of covering radius; compare, for instance, .X(a) 64 and .X64 in Table 5.4.

5.6.1 Bounds on CRH (Xn ) We already know that .CRH (Xn ) ≥ ρH (Xn )/2 for any design on .{−1, 1}d ; see Sect. 5.4.1. .CRH (Xn ) also satisfies the following property. Proposition 4 The Hamming covering radius .CRH (Xn ) of a .2d−m fractional factorial design .Xn satisfies CRH (Xn [d − m + 1 : d]) ≤ CRH (Xn ) ≤ m ,

.

with .Xn [d − m + 1 : d] denoting the projection of .Xn on the m-dimensional space defined by the non-basic factors (i.e. those constructed through generators). Proof We use the same notation as in Sect. 5.5.1, and, for .xi ∈ Xn , we denote xi = (xi , g(xi )); also, we split any .x ∈ Rd into .x = (x, x), with .x ∈ Rd−m and m d .x ∈ R . For any .x ∈ {−1, 1} , there exists .j = j (x) such that .xj = x, so that .

.

min dH (x, xi ) ≤ dH (x, xj ) = dH (x, xj ) + dH (x, g(xj ))

xi ∈Xn

= dH (x, g(xj )) = dH (x, g(x)) . We also have .dH (x, xi ) = dH (x, xi ) + dH (x, g(xi )) ≥ dH (x, g(xi )). Therefore, .

max

min dH (x, g(xi )) ≤ CRH (Xn ) =

x∈{−1,1}d xi ∈Xn

≤

max

min dH (x, xi )

max

dH (x, g(x)) .

x∈{−1,1}d xi ∈Xn x∈{−1,1}d

5 Incremental Design Construction Based on Two-Level Fractional Factorial Designs

95

Finally, .maxx∈{−1,1}d minxi ∈Xn dH (x, g(xi )) = CRH (Xn [d − m + 1 : d]) and maxx∈{−1,1}d dH (x, g(x)) = m conclude the proof.

.

5.6.2 Calculation of CRH (Xn ) Direct calculation of .CRH (Xn ) through .maxx∈{−1,1}d minxi ∈Xn dH (x, xi ) is unfeasible for large d. Exploiting Proposition 2, a possible alternative consists in restricting

the set of candidates to the . dδ points at a given Hamming distance .δ from an arbitrary point .x1 in .Xn , for an increasing sequence .δi , initialised at .δ1 = ρH (Xn )/2. We thus compute .C(Xn , x1 , δ) = maxx:dH (x,x1 )=δ minxi ∈Xn dH (x, xi ) for .δ = δ1 , δ1 + 1 . . . and stop at the first .δi when .C(Xn , x1 , δ) starts decreasing; the value .δi−1 equals .CRH (Xn ). Although it requires less computations than direct calculation, this approach is still too costly for large d unless .CRH (Xn ) is very small (meaning that n is very large) or very large (meaning that n is very small). In Example 2 considered below, with .d = 50 and .n = 32, 768, we have .CRH (Xn ) = 13, and the construction is unapplicable. Therefore, hereafter, we present a simple local ascent algorithm for searching a distant point from .Xn , which we initialise at a design point. The construction relies on the search of a point in .{−1, 1}d at maximum Hamming distance from .Xn . From Proposition 2, we only need to consider moves from an arbitrary point of .Xn . The order of inspection of the d factors in the for loop of Step 1 may be randomised.

5.6.2.1

Algorithmic Construction of a Lower Bound on CRH (Xn )

(0) Set x = x1 ∈ Xn , Δ = 0 and continue = 1. (1) while continue = 1 Try successively all points at Hamming distance 1 from x: for i = 1, . . . , d set {x }j = {x}j for j = i and {x }i = −{x}i , and compute Δ = minxi ∈Xn dH (x , xi ). if Δ > Δ, set x = x , Δ = Δ and break the for loop. otherwise, if i = d, set continue = 0 (all possible moves have been unsuccessfully exhausted). (2) Return Δ, which forms a lower bound on CRH (Xn ). Remark 2 Convergence to a point at maximum distance from Xn is not guaranteed. The algorithm can be modified to incorporate a simulated annealing scheme that accepts moves such that Δ < Δ with some probability: at Step 1, in the for loop, we then set x = x and Δ = Δ with probability min{exp[(Δ − Δ)/Tk ], 1} for some decreasing temperature profile Tk , do not break the loop and never set

96

R. Cabral-Farias et al.

continue = 0; the algorithm is stopped when the number of iterations reaches a predefined bound.

5.7 Greedy Constructions Based on Fractional Factorial Designs 5.7.1 Base Designs We first consider a specialisation of the n first iterations of Algorithm 1 to the case where .X = Cd and the candidate set at Step 1 is a .2d−m fractional factorial design .Xn . Algorithm 2 (0) Construct a .2d−m fractional factorial design .Xn ; set .S1 = {0} and .k = 1. (1) for .k = 1, . . . , n do find .x∗ = arg maxx∈Xn d(x, Sk ), set .Sk+1 = Sk ∪ {x∗ }. Note that the distances .d(x, Sk ), .x ∈ Xn , can be computed recursively as d(x, Sk ) = min{d(x, Sk−1 ), x − xk } and that the generation of .Sk for .k ≤ n + 1 has complexity .O(knd). Algorithm 2 also satisfies the following property.

.

Proposition 5 If .ρH (Xn ) ≥ d/4, then, for .k ≤ n + 1, the design .Sk constructed by Algorithm 2 could also have been generated by Algorithm 1 initialised at .S1 = {0}, and it satisfies the bounds of Theorem 1. Proof Since .ρH (Xn ) ≥ d/4, for any .k ≤ n √ and any .xi ∈ Xn \ S2:k , we √ have dH (xi , S2:k ) ≥ d/4; that is, .d(xi , S2:k ) ≥ d. Therefore, .d(xi , Sk ) = d = maxx∈Cd d(x, Sk ). The restriction to the set .Xn at Step 1 thus entails no loss of performance, and Theorem 1 applies.

.

Proposition 3 shows that minimum-size .2d−m designs .Xn with .n = d + 1 satisfy .ρH (Xn ) = (d + 1)/2. We shall not provide an explicit construction ensuring the existence of fractional factorial designs satisfying .ρH (Xn ) ≥ d/4 for all values of d (see, however, Remark 3). Instead, for each .d = 4, . . . , 35, using the algorithm of Sect. 5.5.2, we have searched the smallest .m = m (d) (i.e. the largest possible design m(d) ) for which we can find a design .X with minimum Hamming distance size .2d− n at least .d/4. We denote by .ρ H (d) ≥ d/4 the distance we have obtained. The left panel of Fig. 5.2 shows .m (d) (red solid line), together with the upper bound .m∗ (d) given by (5.4) (blue dotted line) and the lower bound .m∗ (d) = d/4 − 1 implied by the Singleton bound (5.7) (black dashed line); the right panel presents .ρ H (d). For instance, for .d = 35, we can construct a design with .n = 235−22 = 8 192 points and minimum Hamming distance 9. A construction with .d = 50 and .m = 35 will be considered in Example 2. Note that the value .k∗ (d) of Sect. 5.2 satisfies m(d) k∗ (d) ≥ 2d− .

.

(5.12)

5 Incremental Design Construction Based on Two-Level Fractional Factorial Designs

97

Fig. 5.2 Left: .m (d) (red solid line), .m∗ (d) = d − log2 (d + 1) (blue dotted line), .m∗ (d) = H (d) (red solid line); the black dashed line corresponds to d/4 − 1 (black dashed line). Right: .ρ .d/4

Remark 3 Let .d∗ be a dimension satisfying .d∗ + 1 = 2d∗ −m ; see Table 5.3. The corresponding minimum-size .2d∗ −m fractional factorial design .Xn satisfies .ρH (Xn ) = (d∗ + 1)/2. By removing any .d∗ − d factors from .Xn , with .d ≥ 2(d∗ − 1)/3, we obtain a design .Xn in .[−1, 1]d such that .ρH (Xn ) ≥ ρH (Xn ) − (d∗ − d) ≥ d/4. However, these designs have too few points to be of practical interest for computer experiments. For all .k ≤ n − 1, the choice of .x∗ at Step 1 of Algorithm 2 is arbitrary; in particular, if this choice is randomised, an unlucky selection may thus yield .ρH (S2:k ) = ρH (Xn ) for all .k = 3, . . . , n + 1. This weakness can be overcome through a slight modification of Step 1, yielding the following algorithm. Algorithm 3 (0) Construct a .2d−m fractional factorial design .Xn with .ρH (Xn ) ≥ d/4; set .S2 = {0, x2 } and .k = 2, with .x2 an arbitrary point in .Xn . (1) for .k = 2, . . . , n do find .x∗ = arg maxx∈Xn d(x, S2:k ), and set .Sk+1 = Sk ∪ {x∗ }. √ Note that all .xi in .Xn satisfy .d(xi , S1 ) = d = arg maxx∈Xn d(x, S1 ) and have the same neighbouring pattern; see Sect. 5.2 and Proposition 2.

Example 2 For .d = 50 and .m = 35, the algorithm of Sect. 5.5.2 yields a design Xn of .n = 215 = 32,768 points, with resolution IV (.A4 (Xn ) = 2), .ρH (Xn ) = 13 > d/4 (.B13 (Xn ) = 2), and the algorithm of Sect. 5.6.2 gives .CRH (Xn ) ≥ 13. Algorithm 3 generates a sequence of nested designs .Sk that satisfy the efficiency bounds (5.3) for all .k ≤ n + 1 = 32 769. The construction is very fast since there are only .n = 32 768 points in .Xn = Xn to be considered at Step 1 of Algorithm 3 (to be compared with the .2d > 1.1258 × 1015 vertices of .Cd ). Figure 5.3 presents the evolution of the packing radius .PR(S2:k ) as a function of k for Algorithm 3 (red solid line), for .k = 3, . . . , 500 (the value 500 is rather arbitrarily, chosen

.

98

R. Cabral-Farias et al.

Fig. 5.3 Evolution of .PR(S2:k ) in Algorithm 3 with .Xn given by a .250−35 design (red solid line, top), of .PR(Sk ) in Algorithm 1 with .Xn given by the first .n = 219 points of Sobol’ sequence .(XSi )i in .Cd (black dashed line, middle) and of .PR(XSi ), .i = 2, . . . , 500 (blue dotted line, bottom). The √ horizontal line indicates the value . d/2 (.d = 50)

in agreement √ with the “.10 d” rule of [17]). When including .x1 = 0, it satisfies PR(Sk ) = d/2 3.5355 for .k ≤ 32 769. The curve in black dashed line (middle) is obtained when Algorithm 1 is applied to the candidate set .Xn given by the first 19 points of Sobol’ sequence; the blue dotted line (bottom) corresponds to designs .2 given by the first k points of this Sobol’ sequence.

.

The complexity of Algorithm 3 is only linear in k and grows like .O(knd). If necessary, it can be further reduced for large n by first constructing nested halfdesigns from .Xn , following ideas similar to those in [1]. Theorem 2 in their paper shows that only .2d−m − 1 different half-designs need to be considered when starting from an arbitrary .2d−m fractional factorial design (note that here those half-designs must be compared in terms of their .ρH values, whereas aberration is used in [1]).

5 Incremental Design Construction Based on Two-Level Fractional Factorial Designs

99

5.7.2 Rescaled Designs A fractional factorial design .Xn has all its points on the vertices of .Cd , which is advantageous in terms of packing radius in the full dimensional space. However, performance in terms of prediction/interpolation of an unknown function by a nonparametric model (in particular with kriging) is more related to .CR(Xn ) [12], and it may then be beneficial to have design points inside .Cd . In [1], the iterative decomposition of .Xn into half-designs is used to construct multi-layer designs having two points on each layer. Here, since the design points in .Sk constructed with Algorithm 3 are ordered, for a fixed .K ≤ n, we can directly apply a scaling procedure to .SK to obtain a design .SK with points inside .Cd . First, following [1], and to avoid having pairs of points too close together, we d impose that .SK has no point in a hypercube .[−a, a] , .0 < a < 1, except the centre .x1 = 0. We choose a by setting the ratio r of the volume of the neglected empty hypercube to the volume of .Cd ; that is, .a = ar = r 1/d . Next, we need to choose how we rescale the points .xi of .SK , for .i = 2, . . . , K+1, in order to obtain a suitable distribution of distances to the centre for the .∞ norm. We shall denote by .xi = βi,r,K,γ xi the rescaled design points, .i = 1, . . . , K + 1, with .β1,r,K,γ = 0 and βi,r,K,γ

.

γ 1/γ (i − 2)(1 − ar ) = 1− , i = 2, . . . , K + 1 . K −1

(5.13)

Here, .γ is a scalar in .[1, d]: linear scaling with .γ = 1 yields a design with points more densely distributed close to the centre .0 than near the boundary of .Cd ; when .γ = d, the empirical distribution of the . xi ∞ converges to the uniform distribution on .[0, 1], obtained for points .x uniformly distributed in .Cd , as K tend to infinity. Values of .γ between 1 and d provide behaviours between these two extreme cases. When N points are needed, with .K + 1 < N ≤ n + 1, the rescaling procedure can be applied periodically, using γ 1/γ [(i − 2) mod K](1 − ar ) . βi,r,K,γ = 1 − K −1

.

(5.14)

Example 2 (Continued). The left panel of Fig. 5.4 shows the empirical cumulative distribution function (cdf) of the . xi ∞ for the design obtained with Algorithm 3 for .γ = 1, .r = 10−6 and .K = 500 (red solid line). When .γ = d, the empirical cdf is visually confounded with the dashed-line diagonal, which corresponds to points uniformly distributed in .Cd . The black dashed line (middle) on the right panel of S2:k ) after linear rescaling of the designs .Sk Fig. 5.4 presents the evolution of .PR( obtained by Algorithm 3; the top curve (red solid line) is identical to that on Fig. 5.3 and corresponds to .S2:k . Periodic rescaling of .Sk with .γ = 1, .r = 10−6 and .K = 50 in (5.14) yields the two curves in blue solid line, for the cdf (left) and for the S2:k ) (right). With a horizon .N = 500 and .K = 50, there are evolution of .PR(

100

R. Cabral-Farias et al.

Fig. 5.4 Linear rescaling with .γ = 1 and .r = 10−6 in (5.13) when the .xi are generated with xi ∞ for .K = 500 (top solid line in red) Algorithm 3 in Example 2. Left: empirical cdf of the . and .K = 50 (blue stair-case solid line); when .γ = d and .K = 500, the cdf is confounded with the S2:k ) dashed-line diagonal. Right: same as on Fig. 5.3 for the red solid line (top), evolution of .PR( after linear rescaling of designs .Sk given by Algorithm 3 for .K = 500 (black dashed line) and .K = 50 (blue solid line, bottom)

10-uples of points with the same .∞ norm, which explains the stair-case shape of the cdf observed on the left panel. The faster decrease of the scaling factor yields a faster decrease of the packing radius on the right panel.

5.7.3 Projection Properties Space-filling designs in the full d-dimensional space do not necessarily have good properties when projected on an axis-aligned subspace with dimension .d < d. In this section, we compare the projection properties of designs .Sk generated with Algorithm 3 with those of Sobol’ sequences and Latin hypercube designs, for a fixed k. Sobol’ sequence is a particular low discrepancy .(t, s) sequence (see, e.g. [22, Chap. 4]), which permits the fast generation of designs .XSk having good space-filling properties when k is a power of 2, also for large d. A Latin hypercube (Lh) design Lh with k points in .[−1, 1]d has the k levels .2i/(k − 1) − 1, .i = 0, . . . , k − 1, .X k for each of the d factors, but this does not ensure good space-filling properties in the full d-dimensional space. In [19], maximin distance optimal Lh designs are constructed by simulated annealing. A different space-filling criterion is considered in [13], whose optimisation yields so-called maximum projection designs. In the continuation of Example 2 (with .d = 50) presented below, we use the ESE algorithm of [10] to construct a maximin distance optimal Lh design.

For any .d ∈ {1, . . . , d} and any .r ∈ {1, . . . , dd }, let .Pd ,r denote one of the d

. distinct projections on an axis-aligned .d -dimensional subspace. For any k-point d design .Xk = {x1 , . . . , xk }, we then denote by .Pd ,r (Xk ) the corresponding design for

5 Incremental Design Construction Based on Two-Level Fractional Factorial Designs

101

the .d factors associated with .Pd ,r , that is, .Pd ,r (Xk ) = {Pd ,r (x1 ), . . . , Pd ,r (xk )}, and consider the following criteria: CRd (Xk ) =

max

.

PRd (Xk ) =

max

r=1,...,(dd ) x∈[−1,1]d

1 2

min

(5.15)

d(x, Pd ,r (Xk )) , .

min

r=1,...,(dd ) xi ,xj ∈Xk , xi =xj

Pd ,r (xi ) − Pd ,r (xj ) .

(5.16)

When applied to designs generated by Algorithm 3, we obtain the following properties for .CRd (Sk ) and .PRd (Sk ). Proposition 6 Let .Sk be a design generated by Algorithm 3, .2 ≤ k ≤ n. We have .

1/2 PRd (Sk ) = min{max{d − d + ρH (S2:k ), 0}, d /4}

d 2

for any d ∈ {1, . . . , d} , . (5.17) √ 2] − [d mod ≤ CRd (Sk ) ≤ d for any d ∈ {1, . . . , d} 4 (5.18) √ and CRd (Sk ) = d for d ≥ 43 [d − ρH (S2:k )] .

Proof We first prove (5.17). For any .d ≤ d0 = d − ρH (S2:k ), there exist at least one subset of .d factors, .i1 , . . . , id say, and two points .xj = x in .S2:k such that .dH ({xj }i1 ,...,id , {x }i1 ,...,id ) = 0. Therefore, .Pd ,r (xj ) − Pd ,r (x ) = 0 for some projection .Pd ,r , and .PRd (S2:k ) = 0 = PRd (Sk ). Take now .d = d0 + 1. On the one hand, there exist two points .xj = x in .S2:k that satisfy .dH (Pd ,r (xj ), Pd ,r (x )) ≤ 1, which implies that .ρH (Pd ,r (S2:k )) ≤ 1. On the other hand, the existence of a projection .Pd ,r such that .ρH (Pd ,r (S2:k )) = 0 would imply .ρH (S2:k ) ≤ d − d = ρH (S2:k ) − 1. Therefore, .min r=1,...,(dd ) ρH (Pd ,r (S2:k )) = 1. Proceeding in the same way, by induction, we get .minr=1,...,( d ) ρH (Pd ,r (S2:k )) = d

d − d0 for any .d ∈ {d0 , d0 + 1, . . . , d}. We have thus obtained .PRd (S2:k ) = 1/2 . Since .S = S [max{d − d + ρ√ H (S2:k ), 0}] k 2:k ∪ {0}, we obtain .PRd (Sk ) = min{PRd (S2:k ), d /2}, which gives (5.17). Now we prove (5.18). .Sk contains the origin, and furthest points √ from the origin are vertices of the projected hypercube; therefore, .CRd (Sk ) ≤ d . Since .k ≤ n, there exists .xi ∈ Xn \ S2:k such that .dH (xi , S2:k ) ≥ ρH (S2:k ) ≥ d/4. Take any .d such that .(4/3) [d − ρH (S2:k )] ≤ d ≤ d. For any projection .Pd ,r , we have d d ≥ dH (xi , S2:k ) + d − d − 4 4 3 ≥ d + ρH (S2:k ) − d ≥ 0 . 4 √ √ Therefore, .CR[Pd ,r (S2:k )] ≥ d and .CR[Pd ,r (Sk )] = CRd (Sk ) = d . dH [Pd ,r (xi ), Pd ,r (S2:k )] −

.

102

R. Cabral-Farias et al.

Consider finally the favourable case where, for every projection .Pd ,r , .Pd ,r (Sk ) contains the .2d full factorial design. For .d even, with .d = 2p, the point .x with p coordinates at 0 and the other p at 1 satisfies .d(Pd ,r (x), .Pd ,r (Sk )) = d /2 ≤ CRd (Sk ). When .d is odd, with .d = 2p+1, consider the point .x with p coordinates at 0, p at 1 and the last one equal to .1/2; we have .d(Pd ,r (x), Pd ,r (Sk )) = √

p + 1/4 = d /2 − 1/4 ≤ CRd (Sk ). Remark 4 The lower bound on .CRd (Sk ) in (5.18) is very optimistic in general. However, when .S2:k contains a .2d−m fractional factorial design with resolution .R ≥ d + 1, then each projected design .Pd ,r (S2:k ) contains a full factorial design (see Sect. 5.3.2), and the bound becomes accurate. For any projection .Pd ,r , there are .2d distinct points .Pd ,r (xi ) at most when .xi varies in .S2:k (which has .k−1 elements), so that .PR[Pd ,r (Sk )] = PR[Pd ,r (S2:k )] = 0 when .2d < k−1. One may note that this case is already covered by (5.17). Indeed, Singleton bound (see Sect. 5.4.1) implies that the .k − 1 points of any .Pd ,r (S2:k ) with .d = d − [ρH (S2:k ) − 1] are all distinct; therefore, .k − 1 ≤ 2d−ρH (S2:k )+1 , and d < k − 1 implies .d ≤ d − ρ (S ). .2 H 2:k When d is large, we cannot compute the values of .CRd (Xk ) and .PRd (Xk ) in (5.15) and (5.16) exactly, and we shall consider the following approximations that use q projections at most, instead of . dd : d (Xk ) = CR

.

max max d(x, Pd ,r (Xk )) , r=1,...,min q,(dd ) x∈Xd ,Q

d (Xk ) = 1 PR 2

min min r=1,...,min q,(dd ) xi ,xj ∈Xk , xi =xj

Pd ,r (xi ) − Pd ,r (xj ) ,

where .Xd ,Q is a finite set of Q points in .[−1, 1]d . The q projections are chosen d (Xk ) gives an optimistic (under)estimation randomly without repetition. .CR of .CRd (Xk ) due to the substitution of a finite set .Xd ,Q for .Cd and to the use of q random projections√only. When .d ≥ (4/3) [d − ρH (S2:k )], d for all projections .Pd ,r ; see the proof of .maxx∈C d(Pd ,r (x), Pd ,r (Sk )) = d √ d (Sk ) smaller than . d are only Proposition 6. Therefore, for such .d , values of .CR d (Xk ) overestimates .PRd (Xk ) due to due to the substitution of .Xd ,Q for .Cd . .PR d (Sk ) = 0 when .2d ≤ k − 2; see Remark 4. the restriction to q projections, but .PR Equation (5.17) indicates that .PRd (Sk ) = 0 for the designs obtained with Algorithm 3 when .d ≤ d0 = d − ρH (S2:k ). Although the rescaling procedure of Sect. 5.7.2 prevents the exact coincidence of projected design points, .PRd ( Sk ) remains very close to zero when .d ≤ d0 for rescaled designs. As the example below will illustrate, the performances in terms of .PRd are thus much worse than those of more classical designs based on Lh and Sobol’ sequences for small .d . They are much better, however, for .d close to d. The example also illustrates that

5 Incremental Design Construction Based on Two-Level Fractional Factorial Designs

103

rescaling decreases .PRd for those large d’, but has the benefit of slightly improving (decreasing) .CRd . = q for Example 2 (Continued). We take .q = 100, so that .min q, dd d = 2 already, and .Xd ,Q consisting of the first .214 points of a Sobol’ sequence in .[−1, 1]d , complemented by a .2d full factorial design when .d ≤ 10. We consider designs of size .k = 500. Equation (5.17) shows the importance of having .ρH (S2:k ) as large as possible to obtain good projection properties in terms of packing radius for dimensions .d as small as possible. The .250−35 fractional factorial design .Xn of Example 2 has .ρH (Xn ) = 13; .S2:k√ generated with Algorithm 3 17 4.1231; see Fig. 5.3. satisfies .ρH (S2:500 ) = 17, with .PR(S2:500 ) = 6, and, due to the Therefore, .PRd (S500 ) = 0 for .d ≤ 33 from Proposition

d (S500 ) is equal to zero random choice of 100 projections only among . 50 , .PR d d (S500 ) = 0 for with positive probability when .d ≤√33. From Remark 4, .PR d (S500 ) = PRd (S500 ) = d /2 for .d ≥ 44, in agreement with (5.17). .d ≤ 8. .PR Figure 5.5 presents the lower and upper bounds (5.18) on .CRd (S500 ) (black dotted d (S500 ) based on q random projections (stars), lines) and the approximation .CR .

Fig. 5.5 Lower and upper bounds (5.18) on .CRd (S500 ) and .PRd (S500 ) given by (5.17) (black d (S500 ) and .PR d (S500 ) (red stars); .S500 is generated by Algorithm 3 dotted lines); .CR

104

R. Cabral-Farias et al.

d ( d ( Fig. 5.6 .CR S500 ) and .PR S500 ) after linear rescaling of .S500 using (5.13) (red stars; .r = d and .PR d for .XS given by the first 500 points of Sobol’ 10−6 , .K = 500 and .γ = 1); .CR 500 sequence (black circles) and a 500 point Lh design .XLh 500 optimised for the .PR criterion with the ESE algorithm of [10] (blue diamonds)

together with .PRd (S500 ) given by (5.17) (black dotted line) and its approximation d (S500 ) (stars). PR d ( Linear rescaling of .Sk by (5.13) with .γ = 1 affects the values of .PR S500 ); xj ) − Pd ,r ( x ) = 0 for all projections see Fig. 5.6. Although we have now .Pd ,r ( .Pd ,r and all .xj = x in .S500 , when .d is small, this value remains very close to zero d ( for some pairs of points and projections; therefore, .PR X500 ) is still very small: the difference is hardly visible on the figure for small .d ; compare the red stars on d in Figs. 5.5 and 5.6 for .d 16. For larger .d , rescaling decreases the plots of .PR d which is slightly decreased for .d .PRd , but has a small positive effect on .CR S 30. The values of .CRd for a design .X500 given by the first 500 points of Sobol’ sequence (black circles), or for a (non-incremental) Lh design .XLh 500 optimised for d ( the .PR criterion (blue diamonds), are marginally better than .CR S500 ), but .S500 is d for large .d . The construction of .XLh uses the significantly better in terms of .PR 500 ESE algorithm of [10] with the default tuning parameters suggested in that paper; 100 cycles are performed, requiring .500 000 evaluations of .PR. .

5 Incremental Design Construction Based on Two-Level Fractional Factorial Designs

105

Fig. 5.7 Same as Fig. 5.6, but with nonlinear rescaling ((5.13) with .γ = d) of .S500 generated by Algorithm 3 (red stars)

Rescaling with .γ = d in (5.13) yields results intermediate between no rescaling (Fig. 5.5) and linear rescaling (Fig. 5.6); see Fig. 5.7. Performances with linear periodic rescaling using (5.14) with .K = 50 (Fig. 5.8) are close to those on d . For large enough .d < d, Fig. 5.6, with some small improvement in terms of .CR d are significantly better than those obtained for a performances in terms of .PR design .SS500 generated by Algorithm 1 with the first .n = 219 points of Sobol’ sequence as candidate set (note that Algorithm 3 only uses .32 768 candidate points), S in terms of .CR d are fairly close. The whereas the performances of .S500 and .S 500 S value .PRd (S500 ) corresponds to the last point on the black dashed line in Fig. 5.3; .PRd ( S500 ) is smaller than .PRd ( S2:500 ) on Fig. 5.4-Right due to the addition of the central point .0. Above, we have used a .2d−m fractional factorial design with .m = 35 as candidate set in Algorithm 3, which allows the construction of incremental designs of size up to 32,769. If we are only interested in shorter sequences, we may increase m and get a design with larger .ρH value. For instance, when taking .m = 41 instead of .m = 35, the design .Xn obtained with the algorithm of Sect. 5.5.2 has 512 points, resolution III and .ρH (Xn ) = 20 (.CRH (Xn ) ≥ 17). Algorithm 3 then yields a .Sk such that .ρH (S2:500 ) = 20 (and .PRd (S500 ) = 0 for .d ≤ 30, instead of .d ≤ 33 when .m = 35; see (5.17)).

106

R. Cabral-Farias et al.

d ( d ( Fig. 5.8 .CR S500 ) and .PR S500 ) after linear periodic rescaling of .S500 (red stars; .K = 50, = 10−6 , .γ = 1); the black circles correspond to the design obtained with Algorithm 1 using the first .n = 219 points of Sobol’ sequence as candidate set (the same design is used for the black dashed line in Fig. 5.3); the blue diamonds correspond to a 500 point Lh design .XLh 500 optimised for the .PR criterion with the ESE algorithm of [10]

.r

5.8 Summary and Future Work In situations where the number d of factors is too large to inspect all vertices of the hypercube Cd = [−1, 1]d to construct a design, we suggest to use a fractional factorial design Xn to thin the search space. When Xn has minimum Hamming distance at least d/4, the coffee-house rule permits to construct a sequence of nested designs, with flexible size up to n + 1, each design along the sequence having at least 50% packing (maximin) and covering (minimax) efficiency.

The packing and covering properties of designs projected in lower-dimensional subspaces have been investigated. The covering performances are slightly worse than those obtained for more classical space-filling designs, but their packing

5 Incremental Design Construction Based on Two-Level Fractional Factorial Designs

107

performance is significantly better when projecting on a subspace with large enough dimension. An intrinsic drawback of the construction is that all design points (except the first one, taken at the centre) are vertices of the hypercube. A rescaling rule has been proposed to populate the interior of .Cd , but, like for the multi-layer designs of [1], all rescaled design points lie along the diagonals of .Cd . Other rules could be considered that deserve further investigations. For instance, the compromise between placing points on vertices, which is favourable for packing in the full dimensional space, and in the interior of .Cd , which is favourable to the performance of projected designs, may rely on interlacing the sequence proposed in the chapter with a low discrepancy sequence. Combination with other space-filling sequences could be considered as well; see, e.g. [27, 28]. We leave such developments for further work. Acknowledgments This work benefited from the support of the project INDEX (INcremental Design of EXperiments) ANR-18-CE91-0007 of the French National Research Agency (ANR).

Appendix As shown in [7], a design for which a basic factor is not used in generators cannot have maximum resolution; see Sect. 5.3. It also has poor performance in terms of Hamming distance. Indeed, suppose without any loss of generality that the first factor is not used, and consider .xi = (1, x\1 ) ∈ Xn , where .x\1 is the vector obtained omitting the first coordinate of .x. The point .x = (−1, x\1 ) also belongs to .Xn , and .ρH (Xn ) ≤ dH (xi , x ) = 1, implying that .B1 (Xn ) ≥ 1. As shown below, the reverse property also holds. Proposition 7 If each basic factor is used in the construction of generators for .Xn , then .ρH (Xn ) ≥ 2 and .B1 (Xn ) = 0. Proof We use the notation of Sect. 5.5.1. Consider any pair of points .xi = (xi , g(xi )) and .xj = (xj , g(xj )) of .Xn . If .dH (xi , xj ) ≥ 2, then .dH (xi , xj ) ≥ 2. Otherwise, .xi and .xj differ by one coordinate only, say the kth. Since the kth basic factor is used within generators, .dH (g(xi ), g(xj )) ≥ 1, implying .dH (xi , xj ) ≥ 2.

In the rest of the Appendix, we show how a similar reasoning can be used to construct .2d−m designs with a larger minimum Hamming distance .ρH . Proposition 8 .ρH (Xn ) ≥ 3 if and only if in the construction of generators (i) each basic factor is used at least twice and (ii) for each pair of basic factors, one of the factors appears at least once separately. Proof Consider the point .x = (1d−m , g(1d−m )) ∈ Xn . Suppose that (i) is not satisfied, with the first basic factor appearing only once among generators. Then .x1 = (−1, 1d−m−1 , g(−1, 1d−m−1 )) belongs to .Xn and

108

R. Cabral-Farias et al.

dH (g(1d−m ), g(−1, 1d−m−1 )) = 1, implying that .dH (x, x1 ) = 2. Also, when the first two factors only appear as a pair, then .x2 = (−1, −1, 1d−m−2 , .g(−1, −1, 1d−m−2 )) ∈ Xn and .dH (g(1d−m ), g(−1, −1, 1d−m−2 )) = 0, implying that .dH (x, x2 ) = 2. This shows that (i) and (ii) are necessary to have .ρH (Xn ) ≥ 3. We show that the condition is sufficient. From Proposition 2, we only need to consider the nearest neighbour to the point .x = (1d−m , g(1d−m )), which, up to a reordering of basic factors, is given by .x1 or .x2 . Now, (i) implies that .dH (g(1d−m ), g(−1, 1d−m−1 )) ≥ 2 and (ii) implies that .dH (g(1d−m ), g(−1, −1, 1d−m−2 )) ≥ 1, showing that .dH (x, x1 ) ≥ 3 and .dH (x, x ) ≥ 3.

2 .

Example 3 Consider the design given by the half fraction .2d−1 with the product d−1 of all basic factors as generators: .g1 = j =1 xj (.m = 1). The condition of Proposition 7 is satisfied, but none of the conditions (i) and (ii) of Proposition 8 is; therefore, .ρH (Xn ) = 2. Direct calculation gives .A0 (Xn ) = 1 and .A2q (Xn ) = d−1 d−1

2q−1 + 2q for .1 ≤ q < d/2, with .Ad (Xn ) = 1 when d is even, all .Ai with i odd being equal to zero. Proposition 9 .ρH (Xn ) ≥ 4 if and only if in the construction of generators (i) each basic factor is used at least three times and (ii-a) for each pair of basic factors, one of the factors appears at least twice separately or (ii-b) for each pair of basic factors, each one of the factors in the pair appears at least once separately and (iii-a) each triple of basic factors appears at least once or (iii-b) within each triple of basic factors, each factor appears at least once without the other two. The proof uses arguments similar to that used for Proposition 8. Example 4 Consider the case .d = 9 and .m = 5, with basic factors a, b, c and d and generators abc, abd, acd, bcd and abcd. Conditions (i), (ii-b) and (iiia) are satisfied, and we get .A (Xn ) = [1 0 0 4 14 8 0 4 1 0] and .B(Xn ) = [1 0 0 0 6 8 0 0 1 0]. When the generators are ab, abd, acd, bc and cd, then (iii-b) is satisfied instead of (iii-a), and we get .A (Xn ) = [1 0 0 6 9 9 6 0 0 1] and .B(Xn ) = [1 0 0 0 9 0 6 0 0 0]. Note that the first design is preferable in terms of both aberration and maximin distance. Example 5 Consider the case where .d = 2m, .m ≥ 4, and where each of the m generators isthe product of all basic factors but one; that is, with obvious notation, .gi = m j =1, j =i xj , for .i = 1, . . . , m. Conditions (i), (ii-b) and (iii-a) of

5 Incremental Design Construction Based on Two-Level Fractional Factorial Designs

109

Proposition 9 are satisfied, and direct calculation shows that the word length pattern of .Xn (with .n = 2d−m = 2m ) satisfies A0 (Xn ) = 1 ,

.

Am (Xn ) =

m/2 p=0

m m and A4p (Xn ) = for p = 1, . . . , m/2 2p + 1 2p

if mmod 4 = 0 , m/2 m m m and A4p (Xn ) = Am (Xn ) = + for p = m/4 2p + 1 m/2 2p p=0

if mmod 4 = 0 , all other .Ai being equal to zero (and the design has resolution I V ). Direct calculation also indicates that .B(Xn ) = A (Xn ), with therefore .ρH (xn ) = 4. One can check that the sphere packing bound (5.6) implies that a .2d−m design with .d = 2m cannot reach .ρH (xn ) ≥ 5 for .m < 7. However, designs with better maximin properties can be obtained for larger m. For instance, when .m = 8 (.d = 16, .n = 256), the construction above yields a design .Xn with .B(Xn ) = A (Xn ) = [1 0 0 0 28 0 0 0 198 0 0 0 28 0 0 0 1] and .ρH (Xn ) = 4, whereas the design with generators abcdefgh, defgh, bcfgh, acegh, bdgh, cef h, adf h and abeh (and basic factors .1, b, c, d, e, f, g, h) has distance distribution .B(Xn ) = [1 0 0 0 0 24 44 40 45 40 28 24 10 0 0 0 0], with .ρH (xn ) = 5 (again, .A (Xn ) = B(Xn ), with .Xn thus having resolution V ). .CRH (Xn ) = 4 for both designs.

References 1. Ba, S., Jospeh, V.R.: Multi-layer designs for computer experiments. J. Am. Stat. Assoc. 106(495), 1139–1149 (2011) 2. Box, G.E.P., Hunter, J.S.: The 2k−p fractional factorial designs. Part I. Technometrics 3(3), 311–351 (1961) 3. Box, G.E.P., Hunter, J.S.: The 2k−p fractional factorial designs. Part II. Technometrics 3(4), 449–458 (1961) 4. Cheng, C.-S., Tang, B.: A general theory of minimum aberration and its applications. Ann. Stat. 33(2), 944–958 (2005) 5. Fang, K.-T., Mukerjee, R.: Uniform design: theory and application. Biometrika 87(1), 193–198 (2000) 6. Franklin, M.F., Bailey, B.A.: Selection of defining contrasts and confounded effects in twolevel experiments. Appl. Stat. 26(3), 321–326 (1977) 7. Fries, A., Hunter, W.G.: Minimum aberration 2k−p designs. Technometrics 22(4), 601–608 (1980) 8. Gonzalez, T.F.: Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci. 38, 293–306 (1985)

110

R. Cabral-Farias et al.

9. Hickernell, F.J.: A generalized discrepancy and quadrature error bound. Math. Comput. 67(221), 299–322 (1998) 10. Jin, R., Chen, W., Sudjianto, A.: An efficient algorithm for constructing optimal design of compter experiments. J. Stat. Plan. Inf. 134, 268–287 (2005) 11. John, P.W.M., Johnson, M.E., Moore, L.M., Ylvisaker, D.: Minimax distance designs in twolevel factorial experiments. J. Stat. Plan. Inf. 44, 249–263 (1995) 12. Johnson, M.E., Moore, L.M., Ylvisaker, D.: Minimax and maximin distance designs. J. Stat. Plan. Inf. 26, 131–148 (1990) 13. Joseph, V.R., Gul, E., Ba, S.: Maximum projection designs for computer experiments. Biometrika 102(2), 371–380 (2015) 14. Kennard, R.W., Stone, L.A.: Computer aided design of experiments. Technometrics 11(1), 137–148 (1969) 15. Kerr, M.K.: Bayesian optimal fractional factorials. Stat. Sin. 11, 605–630 (2001) 16. Lin, S., Xing, C.: Coding Theory. A First Course. Cambridge University Press, Cambridge (2004) 17. Loeppky, J.L., Sacks, J., Welch, W.J.: Choosing the sample size of a computer experiment: a practical guide. J. Am. Stat. Assoc. 51(4), 366–376 (2009) 18. Mitchell, T.J., Morris, M.D., Ylvisaker, D.: Two-level fractional factorials and Bayesian prediction. Stat. Sin. 5, 559–573 (1995) 19. Morris, M.D., Mitchell, T.J.: Exploratory designs for computational experiments. J. Stat. Plan. Inf. 43, 381–402 (1995) 20. Müller, W.G.: Coffee-house designs. In: Atkinson, A., Bogacka, B., Zhigljavsky, A. (ed.) Optimum Design 2000, chapter 21, pp. 241–248. Kluwer, Dordrecht (2001) 21. Müller, W.G.: Collecting Spatial Data. Springer, Berlin, 3rd edn. (2007) 22. Niederreiter, H.: Random Number Generation and Quasi-Monte Carlo Methods. SIAM, Philadelphia (1992) 23. Plackett, R.L., Burman, J.P.: The design of optimum multifactorial experiments. Biometrika 33(4), 305–325 (1946) 24. Plotkin, M.: Binary codes with specified minimum distance. IRE Trans. Inf. Theory 6(4), 445– 450 (1960) 25. Pronzato, L.: Minimax and maximin space-filling designs: some properties and methods for construction. J. de la Société Française de Statistique 158(1), 7–36 (2017) 26. Pronzato, L., Müller, W.G.: Design of computer experiments: space filling and beyond. Stat. Comput. 22, 681–701 (2012) 27. Pronzato, L., Zhigljavsky, A.A.: Bayesian quadrature and energy minimization for space-filling design (2018). hal-01864076, arXiv:1808.10722v1 28. Pronzato, L., Zhigljavsky, A.A.: Measures minimizing regularized dispersion. J. Sci. Comput. 78(3), 1550–1570 (2019) 29. Singleton, R.C.: Maximum distance q-nary codes. IEEE Trans. Inf. Theory 10(2), 116–118 (1964) 30. Sun, F., Wang, Y., Xu, H.: Uniform projection designs. Ann. Stat. 47(1), 641–661 (2019) 31. Tang, Y., Xu, H., Lin, D.K.J.: Uniform fractional factorial designs. Ann. Stat. 40(2), 891–907 (2012) 32. van Lint, J.H., Wilson, R.M.: A Course in Combinatorics. Cambridge University Press, Cambridge (1992) 33. Wu, C.F.J., Hamada, M.S.: Experiments: Planning Analysis, and Optimization, 2nd edn. Wiley, New York (2009) 34. Xiao, Q., Xu, H.: Construction of maximin distance designs via level permutation and expansion. Stat. Sin. 28, 1395–1414 (2018) 35. Xu, H., Wu, C.F.J.: Generalized minimum aberration for asymmetrical fractional factorial designs. Ann. Stat. 29(4), 1066–1077 (2001) 36. Zhou, Y.-D., Xu, H.: Space-filling fractional factorial designs. J. Am. Stat. Assoc. 109(507), 1134–1144 (2014)

Chapter 6

A Study of L-Optimal Designs for the Two-Dimensional Exponential Model Viatcheslav B. Melas, Alina A. Pleshkova, and Petr V. Shpilev

Abstract This chapter devotes to the problem of constructing L-optimal designs for two-dimensional nonlinear in parameters exponential model. We show that, for some homothetic transformation of the design space .X , the locally L-optimal designs can change the type from a saturated design to an excess design and vice versa. We find the optimal saturated designs in the explicit form in some cases. We also provide an analytical solution of the problem of finding the dependence between the number of support points and the model’s parameter values.

6.1 Introduction It is a widely recognized fact that selecting a suitable experimental design can lead to a reduction in experimental costs. Specifically, one can employ a saturated design, which is a nonsingular design with the minimal number of support points, offering optimality in a certain sense. Many researchers have addressed the construction of such designs, as evidenced by numerous works such as those by Fedorov [5], Pukelsheim [13], Atkinson et al. [1] which discuss various optimality criteria. The seminal paper by de la Garza [3] demonstrated that D-optimal designs are always saturated for polynomial regression models. This phenomenon is referred to as the de la Garza phenomenon in Khuri’s work [11]. However, when dealing with nonlinear in parameter models, it is not uncommon to encounter cases where the number of support points in an optimal design exceeds the number of unknown parameters. In our recent publication [7], inspired by Khuri’s research, we introduced the term excess phenomenon to describe such cases and labeled the corresponding designs as excess designs. The papers by Yang and Stufken [19, 20], Yang [18], and Dette and Melas [4] explore the possibility of extending de la Garza’s result to nonlinear models. For

V. B. Melas · A. A. Pleshkova · P. V. Shpilev () Department of Mathematics, St. Petersburg State University, St. Petersburg, Russia e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Pilz et al. (eds.), Statistical Modeling and Simulation for Experimental Design and Machine Learning Applications, Contributions to Statistics, https://doi.org/10.1007/978-3-031-40055-1_6

111

112

V. B. Melas et al.

instance, [19] and [18] present a method to reduce the upper bound on the number of support points in D-optimal designs. In certain scenarios, the Yang-Stufken method directly indicates that the locally optimal design is saturated. While most authors have focused on models with a single explanatory variable, many practical regression models involve multiple dimensions, making them more challenging to investigate. The methods used to obtain upper bounds on the number of support points in optimal designs, such as the Yang-Stufken method and its variations [20], which work well for one-dimensional models, cannot be readily applied to multidimensional cases. This difficulty largely stems from the absence of Chebyshev systems of functions in such situations [4]. Interestingly, the excess phenomenon frequently occurs in locally optimal designs for multidimensional models. The excess phenomenon is also observed in locally optimal designs for multidimensional models. Finding an analytical solution to the problem of determining the relationship between the number of support points in a locally optimal design and the lengths of the design intervals is a valuable tool for investigators aiming to select a suitable design space and reduce experimental costs. In our recent papers [7–9], we examined D-optimal designs for the Ayen-Peters and Laible models used in analytical chemistry [2] and the Cobb-Douglas model used in microeconomics [6]. However, to the best of our knowledge, no other analytical solutions to this problem for multidimensional models have been reported in the literature. In this study, we demonstrate that for certain homotheties .T : X → X of the design space .X , the locally L-optimal designs for a two-dimensional Cobb-Douglas model can transition between being saturated designs and excess designs and vice versa. It should be noted that the problem of constructing a locally optimal design based on the L-optimality criterion is considerably more challenging than that based on the D-optimality criterion, as the support points of the optimal design depend on three model parameters instead of two for the D-optimality criterion. We obtain explicit analytical expressions for optimal saturated designs in some cases and provide an analytical solution for determining the relationship between the number of support points and the boundaries of the design space. The key idea of the approach proposed in our works [7–9] is to use the theorem of equivalence (see [12, 17]) not only to construct the support points and the weights of optimal design but also to find the regions which determine the design structures. The approach is based on studying the properties of special function determining in the theorem of equivalence. The present chapter is organized as follows. In Sect. 6.2, we give the main theoretical information: the equivalence theorem for the locally L-optimal designs and the definition of the homothetic transformation. In Sect. 6.3, we study the locally L-optimal designs for the Cobb-Douglas model. In Sect. 6.3, we show how a homothetic transformation of the design space .X affects the structure of support points of these designs. In particular, we show that saturated locally Loptimal designs can become excess and vice versa. In conclusion, we consider a few examples.

6 A Study of L-Optimal Designs for the Two-Dimensional Exponential Model

113

6.2 Equivalence Theorem for L-Optimal Designs In this section, we repeat the main theoretical information discussed in our previous works [7–9]. Consider the classical regression model y = η(x, θ ) + ,

.

(6.2.1)

where the explanatory variable x varies in the compact design space .X ⊂ R k and observations at different locations, say x and .x , are assumed to be uncorrelated with the same variance. The vector .θ ∈ Θ ⊂ R p is the vector of unknown parameters, and .η : R k → R 1 is the given regression function; see [16]. An approximate design .ξ for model (6.2.1) is defined as a probability measure on the design space .X with finite support .supp(ξ ) = (x1 , . . . , xn ); see [12]. The support points of the design give the locations, where observations are taken, while the weights .ω(ξ ) = (ω1 , . . . , ωn ) give the corresponding proportions of the total number of observations to be taken at these points. If N uncorrelated observations can be taken, then the quantities .ωi N are rounded to integers .ki such that . ni=1 ki = N (see, e.g., [14] for a rounding procedure), and an experimenter takes .ki observations at the point .xi , .i = 1, . . . , n; see [5, 13]. An approximate design .ξ is called saturated if .n = p and excess if .n > p. The information matrix of a design .ξ = (supp(ξ ), ω(ξ )) is defined by the following expression: M(ξ, θ ) =

.

X

f (x, θ )f T (x, θ )dξ(x),

(6.2.2)

∂ where .f (x, θ ) = ∂θ η(x, θ ) ∈ R p . We say that a design .ξ ∗ is L-optimal if

ξ ∗ = arg min trLM −1 (ξ ),

.

ξ ∈ΞL

where L is a fixed nonnegative definite matrix (see, e.g., [15]). In [13], an analogue of the Kiefer-Wolfowitz theorem for L-optimal criteria was proposed. We shall use this theorem in the following form: Theorem 1 Assume that a set of information matrix is a compact; .L ∈ R p×p is a fixed, nonnegatively definite matrix; and .ξ ∗ is a nonsingular L-optimal design. Then the following statements are equivalent: (a)

.

ξ ∗ = arg min trLM −1 (ξ ).

(b)

.

max ϕ(t, ξ ∗ ) = trLM −1 (ξ ∗ ),

ξ ∈ΞL

t∈χ

where .ϕ(t, ξ ) = f T (t)M −1 (ξ )LM −1 (ξ )f (t).

114

V. B. Melas et al.

Here, at points .ti ∈ supp(ξ ∗ ), the following equality holds: ϕ(ti , ξ ∗ ) = trLM −1 (ξ ∗ ).

.

This theorem is a powerful instrument for checking a given design for optimality. Let .y ∈ Y ⊂ R v , .R v be an affine space. An affine transform .T : Y → Y , which, in coordinates, has the form y − a = γ (y − a),

.

is called a homothety with the coefficient .γ and with the center at the point a. Two sets .Y and .Y in an affine space transferring one to another by a homothety are called homothetic. In this chapter, we shall study the influence of a homothety of the design space .X on the form of an optimal design. In particular, in the example of the two-dimensional exponential model, we shall state the conditions under which a homothety leads to existence of an optimal excess design in some cases.

6.3 General Case Let us consider a model η(x, θ ) = θ0 e−(θ1 x1 +θ2 x2 ) ,

.

X = [0, b1 ] × [0, b2 ]

(6.3.1)

Note that if we replace .xi by .− ln(ti ) in (6.3.1), we obtain .η(t, θ ) = θ0 t1θ1 t2θ2 that is known as the Cobb-Douglas production function in microeconomics; see [10]. In this chapter, we consider an important subclass of L-optimal designs obtained by the choice .L = I , where I is the identity matrix. Such designs are usually called A-optimal in the literature. Thus, throughout this chapter, we suppose that ⎛

⎞ 100 .L = ⎝ 0 1 0 ⎠ . 001 The analysis of the function .ϕ(t, ξ ) from Theorem 1 shows that L-optimal saturated design has the following form (see also an appendix in [8] for more details): ξ∗ =

.

(0; 0) (a1 ; 0) (0; a2 ) , where ai ∈ (0, bi ], i = 1, 2, ω2 ω3 ω1

(6.3.2)

6 A Study of L-Optimal Designs for the Two-Dimensional Exponential Model

115

and there exist three types of such designs: −1 (ξ )

(a) .ai ∈ (0, bi ], ∂trM∂ai

−1 (ξ ) (0, b1 ], . ∂trM ∂a1

= 0, i = 1, 2 (the design of the first type). −1

(ξ ) (b) .a1 ∈ = 0, .a2 = b2 , . ∂trM = 0 (the design of the second ∂a2 type). −1 (c) .ai = bi , . ∂trM∂ai (ξ ) = 0, i = 1, 2 (the design of the third type).

The behavior of .ϕ(t, ξ ) for different values of parameters is depicted in Fig. 6.1. Our first result is related to weights of an optimal design. The following theorem holds:

Fig. 6.1 The behavior of the functions: .ϕ(t, ξ ∗ ) for .θ0 = 0.5, θ1 = 1.5, θ2 = 1.5 (top); .ϕ(t, ξ ∗ ) for .θ0 = 0.5, θ1 = 1, θ2 = 1.5 (bottom left); and .ϕ(t, ξ ∗ ) for .θ0 = 0.5, θ1 = 1, θ2 = 1 (bottom right)

116

V. B. Melas et al.

Theorem 2 For model (6.3.1), suppose .ξ ∗ is the L-optimal saturated design with structure defined in (6.3.2). Then the weights of .ξ ∗ are defined as follows: ⎧ ⎪ ⎪ ⎪ ω1 = ⎪ ⎪ ⎪ ⎨ . ω2 = ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ω3 =

a12 a22 θ02 +a12 +a22 , a1 eθ2 a2 +a2 eθ1 a1 + a12 a22 θ02 +a12 +a22 a2 eθ1a1 a1 eθ2 a2 +a2 eθ1 a1 + a12 a22 θ02 +a12 +a22 a1 eθ2a2 a1 eθ2 a2 +a2 eθ1 a1 + a12 a22 θ02 +a12 +a22

(6.3.3)

, .

Proof of Theorem 2 Direct calculations show that trM −1 (ξ ∗ ) =

.

1 ω1 e2θ1 a1 + ω2 ω1 e2θ2 a2 + 1 − ω1 − ω2 + + ω1 θ02 a12 ω1 ω2 θ02 a22 ω1 (1 − ω1 − ω2 )

The weights are obtained as a solution of the corresponding system: ⎧ e2θ2 a2 ω12 a12 −(a12 a22 θ02 +a12 +a22 )(−1+ω1 +ω2 )2 ∂trM −1 (ξ ∗ ) ⎪ = =0 ⎪ ⎪ ω12 θ02 a12 a22 (−1+ω1 +ω2 )2 ⎨ ∂ω1 2 a 2 e2θ1 a1 +e2θ2 a2 a 2 ω2 −1 ∗ −(−1+ω +ω ) ∂trM (ξ ) 1 2 2 1 2 . = =0 ∂ω2 ⎪ ω22 θ02 a12 a22 (−1+ω1 +ω2 )2 ⎪ ⎪ ⎩ ω1 + ω2 + ω3 = 1

⇒

⎧ 2θ2 a2 ω2 a 2 − (a 2 a 2 θ 2 + a 2 + a 2 )(−1 + ω + ω )2 = 0 ⎪ ⎪ 1 2 ⎨e 1 1 1 2 0 1 2 2 2 2 2 2θ a 2θ a 1 1 2 2 . −(−1 + ω1 + ω2 ) a2 e +e a1 ω2 = 0 ⎪ ⎪ ⎩ω + ω + ω = 1

⇒

1

2

3

⎧ ⎪ ⎪ ⎪ ω1 = ⎪ ⎪ ⎪ ⎨ ω2 = . ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ω3 =

a12 a22 θ02 +a12 +a22 a1 eθ2 a2 +a2 eθ1 a1 + a12 a22 θ02 +a12 +a22 a2 eθ1a1

a1 eθ2 a2 +a2 eθ1 a1 + a12 a22 θ02 +a12 +a22 a1 eθ2a2

.

a1 eθ2 a2 +a2 eθ1 a1 + a12 a22 θ02 +a12 +a22

The following result provides the condition of existence of the optimal saturated design of the first type.

6 A Study of L-Optimal Designs for the Two-Dimensional Exponential Model

117

Theorem 3 For model (6.3.1), consider the design (6.3.2) where .ωi , i = 1, 2, 3— are solution of the system (6.3.3) and .ai , i = 1, 2—are solution of the system ⎧ ⎪ ⎪ ⎪ ⎨θ = 1 . ⎪ ⎪ ⎪ ⎩ θ2 =

W0

e

W0

e

√ √

a2 a12 a22 θ02 +a12 +a22 a1 a1 a12 a22 θ02 +a12 +a22 a2

+1

,

(6.3.4)

+1

,

where .W0 is the main branch of the Lambert W-function. Then if .a1 ∈ (0, b1 ], a2 ∈ (0, b2 ], the design (6.3.2) is the L-optimal saturated design of the first type. Proof of Theorem 3 The statement of this theorem directly follows from the definition and the fact that for the optimal saturated design of the first type, the following equalities hold: .

∂trM −1 (ξ ) ∂a1 ∂trM −1 (ξ ) ∂a2

.

=0 =0

⇔

⎧ 1θ1 a1 ⎨ θ1 e2 −

e2θ1 a1 − 21 = θ0 ω2 θ02 ω2 a1 θ0 a1 ω1 ⎩ 2 θ2 e2θ2 a2 − 2 e2θ2 a2 θ0 (1−ω1 −ω2 ) θ0 a2 (1−ω1 −ω2 )

e−2θ1 a1 = a1 ωω12 (θ1 − e−2θ2 a2 = a2 ωω13 (θ2 −

1 a1 ) 1 a2 )

⎧ ⎪ ⎪ ⎪ ⎨θ = 1 ⇔ ⎪ ⎪ ⎪ ⎩ θ2 =

W0

e

W0

e

√ √

0 −

1 θ02 a2 ω1

a2 a12 a22 θ02 +a12 +a22 a1 a1 a12 a22 θ02 +a12 +a22 a2

⇔

=0

+1

+1

.

Example 1 Let’s demonstrate how the values of parameters .θ1 , θ2 affect the type of optimal design. Suppose that .θ0 = 1, b1 = 1, b2 = 2. It follows from Theorem 3 that coordinates .a1 and .a2 of support points of the first type’s optimal design depend on .θ1 , θ2 under fixed .θ0 . The behaviors of the corresponding dependencies are depicted in Figs. 6.2 and 6.3. As one can see, .a1 and .a2 increase, while .θ1 , θ2 decrease and finally the boundary values .b1 and .b2 . exceed Let .Θˆ 1 = (θ1 , θ2 ) a1 b1 , .Θˆ 2 = (θ1 , θ2 ) a2 b2 , and .Θˆ = (θ1 , θ2 ) a1 b1 , a2 b2 = Θˆ 1 Θˆ 2 . These sets are depicted in Figs. 6.4, 6.5, ˆ there exists the optimal design of the first and 6.6. By Theorem 3, if .(θ1 , θ2 ) ∈ Θ, type. And as it easy to see for any fixed positive point .(θ1 , θ2 ) we can chose such ˆ This remains true for any fixed .θ0 . boundaries .b1 , b2 that .(θ1 , θ2 ) belongs to .Θ. Example 2 Let us consider .θ0 = 1, b1 = 1, b2 = 2, and .(θ1 , θ2 ) = (1.5, 0.75) ∈ ˆ Find .a1 , a2 as direct solutions of corresponding system (6.3.4): .a1 ≈ 0.8088 and Θ.

118

V. B. Melas et al.

Fig. 6.2 The behavior of the dependence .a1 on .θ1 and .θ2 under .θ0 = 1 and the surface .a1 = b1 = 1

a2 ≈ 1.5006. By Theorem 3, the design 6.3.2 with .a1 ≈ 0.8088 and .a2 ≈ 1.5006 is L-optimal:

.

ξ∗ =

.

(0; 0) (0.8088; 0) (0; 1.5006) . 0.217 0.524 0.259

The behavior of the extremal function .ϕ(t, ξ ∗ ) from Theorem 1 is depicted in Fig. 6.7.

6.4 Excess and Saturated Designs Throughout this section, we suppose that .b1 = b2 = b for regression model 6.3.1. The following theorem demonstrates how the values of parameters .θ0 , θ1 , θ2 affect the structure of the optimal design. Theorem 4 Consider the model 6.3.1 and suppose that the design space .X = [0, b] × [0, b]; then the following statements hold:

6 A Study of L-Optimal Designs for the Two-Dimensional Exponential Model

119

Fig. 6.3 The behavior of the dependence .a2 on .θ1 and .θ2 under .θ0 = 1 and the surface .a2 = b2 = 2

Fig. 6.4 The set .Θˆ 1 = (θ1 , θ2 ) a1 b1

120

Fig. 6.5 The set .Θˆ 2 = (θ1 , θ2 ) a2 b2

Fig. 6.6 The set .Θˆ = (θ1 , θ2 ) a1 b1 , a2 b2

V. B. Melas et al.

6 A Study of L-Optimal Designs for the Two-Dimensional Exponential Model

121

Fig. 6.7 The behavior of the extremal function .ϕ(t, ξ ∗ ) from Theorem 1 for design .ξ ∗ from Example 2 and .θ0 = 1, θ1 = 1.5, θ2 = 0.75

(a) Let .a1 , a2 be solutions of the system (6.3.4); .ω1∗ , ω2∗ , ω3∗ be solutions of the system (6.3.3); and points .(θ0 , θ1 ) and .(θ0 , θ2 ) belong to the set ⎛ ⎞ 1 √ θ1 − 2(b W + 1 0 e−2b θ1 − 1)2 1 e 2 ⎠ θ1 0 and m m n n > 0 and is not an integer → J = mod(n, m) and m 1 I = (n − mod(n, m)). m

(9.10)

These relations give two functions .ϕ and .ψ on n that define I and J : ϕ(n) = I and ψ(n) = J.

.

Process .N (t) is continuous-time ergodic finite Markov chain. Its .(n∗ +1)×(n∗ + 1)-matrix of the transition intensities . = (n,n ) is as follows: n,n+m = α,

.

n,n−1 = λm−1+ψ(n) ,

n = 0, . . . , n∗ − m, n = 1, . . . , n∗ ,

and the remaining components of the matrix equal zero. Now we are able to calculate the conditional probability .Pn,n (t) = P {(N (t) = n |N (0) = n)} that a total number of the phases .N(t) at time moment t equals .n if initially one equals n. Let .P (t) = (Pn,n (t)) be the corresponding .(n∗ +1)×(n∗ +1)matrix and .D be a diagonal .(n∗ + 1) × (n∗ + 1)-matrix with the diagonal .1, where 1 is column vector of dimension .n + 1 from units. If all eigenvalues of matrix (generator) .G = − D are different, then probabilities .P (t) = (Pn,n (t)) can be represented simply. Let .νη and .Zη , .n = 0, . . . , n∗ , be the eigenvalue and

192

A. Andronov et al.

the corresponding eigenvector of G, .Z = (Z0 , . . . , Zn∗ ) be the matrix of the

T , . . . , Z

T∗ ) be the corresponding inverse matrix

= Z −1 = (Z eigenvectors, and .Z n 0

n is the n-th row of .Z).

(here .Z Then, see [1], n∗

n . P (t) = Pn,n (t) = exp(νη t)Zn Z

(9.11)

.

n=0

Now the distribution of the number of the customers in the system .X(t) at time moment t can be calculated. Let us suppose that initially at time moment .t = 0 .η customers are in the system. If .η > 0, we suppose additionally that the service of the customer begins only now. If .(η) = {n ∈ : ϕ(n) = η}, then P (X(t) = η |X(0) = η) =

.

Pηk,n (t),

0 ≤ η,

η ≤ k.

(9.12)

n ∈(η )

Below the numerical example is considered. Our aim is to study how the expectation .E(X(t)) of the number of customers in the system depends on time t. The famous Pollaczek–Khinchin formula gives an answer for the stationary case. If .α is intensity of Poisson flow, .μ and .σ are average and standard deviation of service time, and load coefficient .ρ = αμ is less than one, then E(X(∞)) = ρ +

.

ρ 2 + (ασ )2 . 2(1 − ρ)

(9.13)

We consider the following initial data: .α = 0.1, .m = 3, .λ = (λ1 , λ2 , λ3 )T = (1, 1.6, 2.1)T , .k = 4, and .n∗ = 12. The first and second moments of the service time, calculated by formula (9.3), are as follows: .μ = μ1 (λ) = 2.101, .μ2 (λ) = 6.032, so .σ = 1.272. Figure 9.7 contains the generator G. Expression (9.14) and Fig. 9.8 contain the vector .v = (v0 , . . . , v12 ) of the eigenvalues and the matrix Z of the eigenvectors: χ = (−2.99, −2.69, 0, −1.65 + 1.13i, −1.65 − 1.13i, −0.34,

.

−1.61 + 0.81i, −1.61 − 0.81i, −2.21, −0.66, −1.59 + 0.26i, −1.59 − 0.26i, −1.20).

(9.14)

The matrix of the transition probabilities for .t = 2 is presented in Fig. 9.9. Figure 9.10 contains final graphs of .E(X(∞)) and .E(X(t)), named as .MeanN(t, 0.1, λ1)0 and .AvrN umber(t, 0.1, λ1), correspondingly. We see how nonstationary expectation .E(X(t)) tends to stationary limit.

9 On a Parametric Estimation for a Convolution of Exponential Densities

193

Fig. 9.7 The generator G

Fig. 9.8 The matrix of the eigenvectors

9.8 Conclusions Two aspects of the problem were considered. First, the parametrical estimation of the convolution on the basis of given statistical data. Second, an approximation of fixed non-negative density. Different approaches to such approximation and estimation were considered: maximum likelihood method, the method of the moments, and fitting of densities. An empirical analysis of different approaches has been performed using the simulation. The efficiency of the considered approach was illustrated by the task of the queuing theory.

194

A. Andronov et al.

Fig. 9.9 Matrix of the transition probabilities for .t = 2

Fig. 9.10 The graph of the dependence .MeanN (t, 0.1, λ1)0 = E(X(t)) on t and stationary value = E(X(∞))

.AvrN umber(t, 0.1, λ1)

9 On a Parametric Estimation for a Convolution of Exponential Densities

195

Acknowledgments This work was financially supported by the specific support objective activity 1.1.1.2. “Post-doctoral Research Aid” (Project id. N. 1.1.1.2/16/I/001) of the Republic of Latvia, funded by the European Regional Development Fund. Nadezda Spiridovska research project No. 1.1.1.2/VIAA/1/16/075 “Non-traditional regression models in transport modelling.”

References 1. Bellman, R.: Introduction to Matrix Analysis. McGraw Hill, New York (1969) 2. Buslenko, N.P.: Complex System Modelling. Nauka, Moscow (1968). In Russian 3. Gnedenko, B.W., Kovalenko, I.N.: Introduction to queueing theory. Nauka, Moscow (1987). In Russian 4. Kijima, M.: Markov Processes for Stochastic Modeling. Chapman & Hall, London (1997) 5. Neuts, M.F.: Matrix-geometric Solutions in Stochastic Models. The Johns Hopkins University Press, Baltimore (1981) 6. Pacheco, A., Tang, L.C., Prabhu, N.U.: Markov-Modulated Processes & Semiregenerative Phenomena. World Scientific, Hoboken, New York (2009) 7. Sleeper, A.D.: Six sigma Distribution Modeling. McGraw Hill, New York (2006) 8. Turkington, D.A: Matrix calculus & Zero-One Matrices. Statistical and Econometric Applications. Cambridge University Press, Cambridge (2002)

Chapter 10

Statistical Estimation with a Known Quantile and Its Application in a Modified ABC-XYZ Analysis Zhanna Zenkova, Sergey Tarima, Wilson Musoni, and Yuriy Dmitriev

Abstract The manuscript suggests an estimator of a functional of a cumulative distribution function (c.d.f.) modified with a known quantile. The modified estimator is unbiased, asymptotically normally distributed with a smaller asymptotic variance than the estimator obtained by plugging in an empirical c.d.f. instead of an unknown c.d.f. This new estimator is applied to modify ABC-XYZ analysis of a trade company’s assortment. As a result, a new merchandise grouping is suggested with a more stable inventory management.

10.1 Introduction There are many situations in statistical practice when researchers have additional information about a random variable of interest. This information can be used to improve accuracy of statistical estimation. One of the first attempts of using additional information in statistical procedures was made in [6] where authors suggested symmetrization of a cumulative distribution function (c.d.f.) in statistical estimation. In [20], a method of generalized S-symmetrization of an empirical c.d.f. (e.c.d.f.) was suggested to improve Kolmogorov’s goodness-of-fit test. Other transformations of e.c.d.f. based on additional information were considered in [1] and [8]. In [2], authors suggested a variation of Owen’s empirical likelihood method [10]. They used a known population mean, quantile, or auxiliary variable as additional information which improved accuracy of estimation of an unknown c.d.f. Population mean estimation with additional information was considered in [7, 9, 11, 12], and [17]. In [5], a nonparametric estimation procedure of functionals

Z. Zenkova () · W. Musoni · Y. Dmitriev Institute of Applied Mathematics and Computer Science, Tomsk State University, Tomsk, Russia e-mail: [email protected] S. Tarima Institute for Health and Equity, Medical College of Wisconsin, Milwaukee, WI, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Pilz et al. (eds.), Statistical Modeling and Simulation for Experimental Design and Machine Learning Applications, Contributions to Statistics, https://doi.org/10.1007/978-3-031-40055-1_10

197

198

Z. Zenkova et al.

of probability distributions using unbiased prior conditions was suggested. In [4], a functional considering an additional assumption about a value of another functional was estimated. The use of additional information known with a degree of uncertainty was considered in [13–16]. In [18], c.d.f. estimators were mapped onto a class of c.d.f.s with a known quantile, which was applied to Kaplan-Meier estimator in an illustrative example. In summary, additional information helps to improve accuracy of statistical estimation despite possible bias and uncertainty. In this manuscript, we suggest unbiased estimators of a c.d.f. and its moments using additional information given by a known q-level quantile. These new estimators reduce point-wise asymptotic variance of an e.c.d.f.

10.2 Methods 10.2.1 Statistical Estimation with a Known Quantile Let .ξ be a random variable with an unknown c.d.f. .F (x) = P (ξ ≤ x) and additional information be available in form of a known c.d.f. at a quantile .xq : .F xq = q. (10.1) Let .{X1 , . . . , XN } be an independent sample from .F (x) and the objective be to estimate a functional +∞ .J (F ) = ψ(x)dF (x), (10.2) −∞

where .ψ(x) is a known real function. If .ψ(x) = H (x − y), Eq. (10.2) is a c.d.f.; if ψ(x) = x, Eq. (10.2) is .Eξ ; at .ψ(x) = x 2 , Eq. (10.2) becomes the second moment of .X − Eξ 2 . Here .H (y) is an indicator of .y > 0. Traditionally, the unknown .J (F ) is estimated by

.

JN =

.

N 1 ψ (Xi ) , N

(10.3)

i=1

which is an unbiased estimator .(EJN = J ) if .ψ is defined for all values of x and is asymptotically normally distributed with a variance V ar {JN } =

.

1 · V ar {ψ (ξ )} . N

We suggest to estimate .J (F ) with a known q in a class q .J N = JN − λ · qˆ − q ,

(10.4)

(10.5)

10 Statistical Estimation with a Known Quantile and Its Application in a. . .

199

where .qˆ = N1 N i=1 H xq − Xi is an empirical estimate of q and .λ is an unknown coefficient, estimated by minimizing the mean square error (MSE) q 2 q MSE JN = E JN − J (F )

.

q = V ar {JN } − 2λcov JN , qˆ + λ2 V ar qˆ => min . λ

Then, the estimator defined by q cov JN , qˆ .λ = q(1 − q) has the smallest MSE in class (10.5). However, the value of covariance is unknown and .λ is estimated as Eˆ (JN − J ) qˆ − q ˆ = . .λ q(1 − q) The use of .λˆ instead of .λ leads to

H xq − Xi − q H xq − Xj − q 1 (q) .J ψ (Xi ) 1 − . N = N (N − 1) q(1 − q) i=j

(10.6) The estimator (10.6) is a U-statistic. Thus, it is unbiased with 1 (q) = .V ar J N N

E 2 ψ(ξ ) H xq − ξ − q V ar {ψ(ξ )} − + q(1 − q)

⎛ ⎞ 2 ψ(ξ ) H x − ξ − q 2 E q V ar ψ(ξ ) H xq − ξ − q 1 ⎠ ⎝ + .+ 2 N (N − 1) q(1 − q) q (1 − q)2 (10.7) or

E 2 ψ(ξ ) H (xq − ξ ) − q 1 1 (q) . · V ar {ψ(ξ )} − = .V ar J +O N N q(1 − q) N2 (q) As it follows from (10.7), at large N, .V ar JN ≤ V ar {JN } , where .V ar {JN } is determined by Eq. (10.4).

200

Z. Zenkova et al.

(q)

Fig. 10.1 Normalized variances of modified e.c.d.f. .FN (x) for the standard uniform distribution, .q = 0.1, 0.5 and 0.7

Estimator (10.6) is asymptotically normally distributed under certain regularity conditions of a central limit theorem. The variance (10.7) of Estimator (10.6) is different from variance of estimators of .J (F ) proposed by [5]. (q) (q) Variances of .FN (x) (an e.c.d.f. with a known q level quantile) and .X (a sample mean with a known q level quantile) for the standard uniform random variable .ξ are shown in Figs. 10.1 and 10.2. In Fig. 10.1, the smallest maximum (q) variance of .FN (x) is achieved at a known 0.5-level quantile, e.g., when its median is known.

10.2.2 ABC-XYZ Analysis The ABC-XYZ analysis is a popular technique often used in business practice, particularly in marketing to optimize stocking and assortment of products. Logistics and supply chain management are other application areas [19]. The ABC analysis was suggested in 1951 by H. Ford Dickie, a manager of General Electric [3]. He developed the ABC method to minimize cost of inventory management using the Pareto principle also known as the 80/20 rule. This method helps to identify the most important group of products (Group A), which must be in stock all the time as the deficit of these products directly leads to loss of profit.

10 Statistical Estimation with a Known Quantile and Its Application in a. . .

201

q

Fig. 10.2 Normalized variance of .X for the standard uniform distribution

Group B, in his notation, is not as important and does not require as much attention. Group C often includes very cheap, obsolete, and low-demand products; some of them could be excluded from the assortment without any serious impact on profit or often lead to profit growth. The procedure can be described in four stages: 1. Find product-specific revenue or the best net profit within a specific time period for each of M products. Then, sort them out in an increasing order by an optimality criterion. For example, if M products are ordered by product-specific revenue, the sorting leads to the ordering M .R1 ≥ ... ≥ RM . 2. Calculate the total revenue .T R = i=1 Ri and a percent of product-specific revenue in the total result .di = Ri (T R)−1 . 3. Then, find cumulative percent values .Si = Si−1 + di with .S0 = 0. 4. The final stage of the process helps to make a decision: • If .Si ≤ 0.8, then .ith product belongs to Group A. • If .0.8 < Si ≤ 0.95, then it belongs to Group B. • Otherwise, the product is from Group C. Table 10.1 reports 24-month sales results of 15 brands of low-alcohol drinks. For illustrative purposes, we use an artificial dataset, which mimics a typical application of an ABC-XYZ analysis. This artificial dataset is sufficient to describe the problem and a solution we faced working with the real motivating example further developed

202

Z. Zenkova et al.

Table 10.1 ABC-XYZ analysis of low-alcohol drinks sales Name One Two Three Four Five Six Seven Eight Nine Ten Eleven Twelve Thirteen Fourteen Fifteen

Price 97 83 120 43 60 110 104 47 58 57 63 52 78 58 52

.Ri , rub. 73,153.17 170,300,572 18,064.54 35,718,840 12402.38 35,984,567 8901.79 15,372,070 14,895.42 12,818,580 3716.04 9,810,350 2782.25 6,944,496 4700.25 5,301,882 3095.08 4308356 851.17 1,164,396 753.67 1139544 727.33 907,712 440.54 824,694 566.96 789,206 377.42 471,016 .T R = 301, 856, 281 .X i

.di , %

.Si , % 56.42 56.42 11.92 68.34 11.83 80.17 5.09 85.26 4.25 89.51 3.25 92.76 2.30 95.06 1.76 96.82 1.43 98.25 0.39 98.63 0.38 99.01 0.30 99.31 0.27 99.58 0.26 99.84 0.16 100.00

ABC S A 31,497 A 6774 B 5160 .B 2891 B 10,765 B 1504 C 1741 C 3830 C 604 C 493 C 421 C 509 C 156 C 182 C 118

.CV , %

43.06 37.50 41.61 32.48 72.27 40.47 62.58 81.48 19.50 57.89 55.83 69.98 35.49 32.12 31.14

XYZ Z Z Z .Z Z Z Z Z Y Z Z Z Z Z Z

in Sect. 10.3. This artificial dataset is a shifted and re-scaled version of an extract of real dataset of low-alcohol beverage sales. The ABC analysis found that two products bringing more than 80% of total revenue belong to Group A, four to Group B, and nine to Group C. The XYZ analysis classified products by demand stability, using the coefficient −1 of variation .CV = X S · 100%, where .X is a sample mean and S is a sample standard deviation. If .CV ≤ 10%, then the product belongs to Group X with a steady demand; if .10% < CV ≤ 30%, then it belongs to Group Y, in which sales are not stable; otherwise the produce is from Group Z, a group with unstable demand (e.g., seasonal demand). ABC-XYZ grouping results are reported in Table 10.1. Note that almost all products belong to Group Z and only one to Group Y . Finally, the classical ABC-XYZ analysis led to AZ group with 2 brands of beverages (One and Two), .BZ − 4, (Three, Four, Five and Six), .CY − 1 (Nine) and .CZ − 8 (the rest beverages).

10.3 ABC-XYZ Analysis Modified with a Known Quantile When a classical ABC-XYZ analysis was applied to the data in Table 10.1, the grouping appeared to be suboptimal. There were several occasions of suboptimal groupings of Product Three, which was initially placed in Group A; then at .t = 20

10 Statistical Estimation with a Known Quantile and Its Application in a. . .

203

Fig. 10.3 Sales of Product Three during a 24-month period

it moved to Group B. Later, the order size of the product did not allow to meet consumer demand. The sales dynamics of Product Three is shown in Fig. 10.3. A similar retail store from the same retail chain reported that 75% (q) of all sales of Product Three were less than 16,000 units per month .(xq ). This additional information can help to improve the situation. We apply the modified Estimator (6) to recalculate the expected 2-year revenue (q) (q) of Product Three, .RT hree = 24 · PT hree · X T hree , where .PT hree = 120 rubles/item is the product’s price (see Table 10.1) and (q) .X T hree

H xq − Xi − q H xq − Xj − q 1 = Xi · 1 − N(N − 1) q(1 − q) i=j

is its average monthly sale re-estimated using the known quantile; .Xi , .i = 1, N are (q) sales of Product Three for i-th month, .N = 24. Then, .X T hree = 13,132.06 items (q) per month and .RT hree = 37,820,337 rubles in 2 years. The total 2-year revenue is re-estimated as .T R (q) = 303,957,778 rubles. Then, the re-estimated coefficient of variation becomes (q) .CV T hree

=

(q)2

ST hree (q)

XT hree

· 100%

204

Z. Zenkova et al.

Table 10.2 The result of the modified ABC-XYZ analysis of low-alcohol drinks sales q

Name One Three Two Four Five Six Seven Eight Nine Ten Eleven Twelve Thirteen Fourteen Fifteen

q

.Ri

73,153.17 13,132.06 18,064.54 14,895.42 8901.79 3716.04 2782.25 4700.25 3095.08 851.17 753.67 727.33 440.54 566.96 377.42 q .T R =

(q)2

(q)2

where .ST hree = mN (q)2 .m N

, rub. 170,300,572 37,820,337 35,718,840 15,372070 12,818,580 9,810,350 6,944,496 5,301,882 4,308,356 1,164,396 1139544 907,712 824,694 789,206 471,016 303,957,778

.X i

q

,% 56.42 12.44 11.84 5.06 5.06 3.23 2.28 1.74 1.42 0.38 0.37 0.30 0.27 0.26 0.15

.di

q

ABC.q A A B .B B B C C C C C C C C C

,% 56.42 68.47 80.31 84.53 89.58 92.81 95.10 96.84 98.26 98.64 99.02 99.31 99.59 99.85 100.00

.Si

XYZ.q Z Z Z Z Z Z Z Z Y Z Z Z Z Z Z

(q) 2 − X and

H xq − Xi − q H xq − Xj − q 1 2 = Xi · 1 − . N (N − 1) q(1 − q) i=j

(q)2

(q)2

Then, for data in Table 10.1, .mT hree = 203,405,380.53, .ST hree = 30,954,338.81, (q)2 (q) . S T hree = 5563.66, and, finally, .CVT hree = 42.37%, which means that the Product Three remains in Group Z. Thus, we performed a modified ABC-XYZ analysis, which changed expected Product Three revenue and its coefficient of variation (see Table 10.2). Consequently, Product Three was moved from Group BZ to Group AZ replacing Product Two. This new grouping changed supplying process and reduced chances of product shortage and related losses. We have also re-calculated the total revenue for Product Three using this quantile information. Further, we bootstrapped the density and e.c.d.f. for classical and modified means q estimators .106 times; see Figs. 10.4 and 10.5. Note that visually .X is not normally distributed, whereas the sample mean .X is closer to a Gaussian shape. Nevertheless, q variance of .X was less than .X.

10 Statistical Estimation with a Known Quantile and Its Application in a. . .

205

Fig. 10.4 Nonparametric smoothing for bootstrap density .g(x) of classical and .g (q) (x) of modified means

Fig. 10.5 Bootstrap e.c.d.f. .G(x) of classical and .G(q) (x) of modified means

206

Z. Zenkova et al.

10.4 Conclusions This paper suggested an ABC-XYZ analysis modified with quantile information. This new ABC-XYZ technique was applied to grouping of biennial sales of 15 products. The new groupings of merchandise reduced the risk of consequent inventory changes, minimize chances of products’ shortage and related losses, thus, stabilized company’s inventory management. We anticipate that this new approach to ABC-XYZ grouping may be used by small and mid-size businesses to more accurately manage their stocking.

References 1. Botero, B., Francés, F.: Estimation of high return period flood quantiles using additional nonsystematic information with upper bounded statistical models. Hydrol. Earth Syst. Sci. 14(12), 2617–2628 (2010) 2. Chen, J., Sitter, R.:A pseudo empirical likelihood approach to the effective use of auxiliary information in complex surveys. Stat. Sin. 9(2), 385–406 (1999) 3. Dickie, H.F.: ABC inventory analysis shoots for dollars not pennies. Factory Manag. Maint. 109(7), 92–94 (1951) 4. Dmitriev, Y., Tarassenko, P., Ustinov, Y.: On estimation of linear functional by utilizing a prior guess. In: International Conference on Information Technologies and Mathematical Modelling, pp. 82–90. Springer, Berlin (2014) 5. Dmitriev, Y.G., Koshkin, G.M.: Nonparametric estimators of probability characteristics using unbiased prior conditions. Stat. Pap. 59(4), 1559–1575 (2018) 6. Hinkley, D.: On estimating a symmetric distribution. Biometrika 63(3), 680–681 (1976) 7. Jhajj, H., Sharma, M., Grover, L.: A family of estimators of population mean using information on auxiliary attribute. Pakistanian J. Stat. (All Series) 22(1), 43 (2006) 8. Kabanova, S., Zenkova, Z.N., Danchenko, M.A.: Regional risks of artificial forestation in the steppe zone of Kazakhstan (case study of the green belt of astana). In: IOP Conference Series: Earth and Environmental Science, vol. 211, pp. 12–55. IOP Publishing (2018) 9. Naik, V., Gupta, P.: A note on estimation of mean with known population proportion of an auxiliary character. J. Ind. Soc. Agric. Stat 48(2), 151–158 (1996) 10. Owen, A.B.: Empirical Likelihood. Chapman and Hall/CRC (2001) 11. Shabbir, J., Gupta, S.: Estimation of the finite population mean in two phase sampling when auxiliary variables are attributes. Hacettepe J. Math. Stat. 39(1), 121–129 (2010) 12. Singh, R., Chauhan, P., Sawan, N., Smarandache, F.: Ratio estimators in simple random sampling using information on auxiliary attribute. Auxiliary Inform. A priori Values Constr. Improved Estimators 1, 7 (2007) 13. Tarima, S.: Statistical estimation in the presence of possibly incorrect model assumptions. J. Stat. Theory Pract. 11(3), 449–467 (2017) 14. Tarima, S., Dmitriev, Y.: Statistical estimation with possibly incorrect model assumptions. Bull. Tomsk State Univ. Control Comput. Inform. 8, 78–99 (2009) 15. Tarima, S., Pavlov, D.: Using auxiliary information in statistical function estimation. ESAIM: Probab. Stat. 10, 11–23 (2006) 16. Tarima, S.S., Vexler, A., Singh, S.: Robust mean estimation under a possibly incorrect lognormality assumption. Commun. Stat.-Simul. Comput. 42(2), 316–326 (2013) 17. Zaman, T.: Modified ratio estimators using coefficient of skewness of auxiliary attribute. Int. J. Mod. Math. Sci. 16(2), 87–95 (2018)

10 Statistical Estimation with a Known Quantile and Its Application in a. . .

207

18. Zenkova, Z.: Censored data treatment using additional information in intelligent medical systems. In: AIP Conference Proceedings, vol. 1688. AIP Publishing (2015) 19. Zenkova, Z., Kabanova, T.: The ABC-XYZ analysis modified for data with outliers. In: 2018 4th International Conference on Logistics Operations Management (GOL), pp. 1–6. IEEE (2018) 20. Zenkova, Z., Lanshakova, L.: Kolmogorov goodness-of-fit test for s -symmetric distributions in climate and weather modeling. IOP Conf. Series: Earth Environ. Sci. 48, 012006 (2016)

Part IV

Machine Learning and Applications

Chapter 11

A Study of Design of Experiments and Machine Learning Methods to Improve Fault Detection Algorithms Rosa Arboretti Giancristofaro, Riccardo Ceccato, Luca Pegoraro, and Luigi Salmaso

Abstract This work presents an industrial application of design of experiments (DOE) and machine learning methods for the development of algorithms applied to fault detection problems in the heating, ventilation, air conditioning and refrigeration (HVAC-R) industry. The framework adopted is an attempt to mitigate the problems which affect the context of machine learning and consists in a sequential approach of DOE and machine learning modelling. The DOE study is performed to ensure the quality of the data that are then used to infer a series of supervised algorithms, both regression and classification. The steps needed for the implementation of the algorithms are described, and the final performance of the models is discussed in terms of pros and cons and results on a test data set.

11.1 Introduction Machine learning methodologies have recently found application in many diverse fields, such as finance, engineering, industry and manufacturing. Nevertheless, at present, the predictions obtained with machine learning algorithms are far from flawless. In this context problems are mainly due to imperfections intrinsic to the nature of the methodology and therefore difficult to resolve. One characteristic of machine learning algorithms is that they require extensive amounts of data in order to learn. This often leads to the erroneous conclusion that more data implies better data, but this is often not the case. In the modern context of big data, huge amounts

R. A. Giancristofaro Department of Civil, Environmental and Architectural Engineering, University of Padova, Padua, Italy e-mail: [email protected] R. Ceccato · L. Pegoraro · L. Salmaso () Department of Management and Engineering, University of Padova, Vicenza, Italy e-mail: [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Pilz et al. (eds.), Statistical Modeling and Simulation for Experimental Design and Machine Learning Applications, Contributions to Statistics, https://doi.org/10.1007/978-3-031-40055-1_11

211

212

R. A. Giancristofaro et al.

of (potential) information are available for relatively little cost, but such data is often not appropriate in terms of quality. One reason for this is that data often arises from observational sources rather than from strict rules of statistical sampling or designed experiments [1]. An example from the field studied in this work is the broad range of data provided by sensors and IoT systems deployed in industrial environments. Despite the importance of data quality often not being stressed enough in relation to big data, it still plays a crucial role for the success of an analysis—the quantity of data does not erase the need for an appropriate study design and statistical analysis [2]. The problem of data quality is even more significant if two additional characteristics of machine learning models are considered: they focus mainly on correlations and are difficult to interpret. The fact that machine learning algorithms are often nonparametric, therefore no explicit assumption is made about the function or distribution underlying the data, implies that the results of a given algorithm rely mainly or solely on the correlative relationships found in the training data. Such relationships, even when strong, cannot be trusted to provide definitive conclusions about the behaviour of data. Causative models should always be preferred since they can provide a theoretical justification of results. Another point is that machine learning algorithms tend to function as black boxes. Modern techniques, such as deep learning applications, are notoriously difficult to interpret, and it is virtually impossible to thoroughly understand the rationale which drives such algorithms. As a result, whenever the nature of the problem allows it, the analyst should strive to increase the quality of the data used for model learning as it is often the main way to ensure the robustness of the final results. A scenario of this type is experimentation in industrial settings. In such situations a phenomenon under study has one or more input variables that determine one or more response variables, and some of the input variables can be controlled by the experimenter. This is the case of the industrial application described in this paper. Here the problem of data quality is not addressed by introducing a structured framework or a technique for a quantitative assessment as done by other authors [3, 4], but rather to show an easy-to-implement good practice that can be deployed in industrial settings with the aim of increasing the qualitative content of data sets used for machine learning modelling. The framework employed and the application considered are a development of the work of [5].

11.2 Design of Experiments and Machine Learning Modelling The proposed approach consists in supporting the machine learning modelling phase with a design of experiments (DOE) study. DOE is a statistical method that consists in the execution of a series of experiments to understand how one or more input variables impact on one or more output variables. A crucial step in the implementation of a DOE study is the generation of the design, which includes the combination of the factors and factor levels that determine each treatment that will then be tested via experimentation. In this case the quality of data is guaranteed by the adoption of the rigorous statistical approach that, if properly followed, confers an

11 A Study of DOE and ML Methods to Improve Fault Detection Algorithms

213

overall control over the whole analysis. Considering the scope of the present work, some of the most relevant advantages of the DOE method are as follows: data are not only observed, but properly sampled; the variables which are more influential on the responses can be identified; several combinations of factors are investigated; therefore interaction effects, if present, can be detected and analysed. The machine learning modelling step is carried out after execution of the DOE. In this phase the aim is to develop ad hoc algorithms to address specific problems. The task of developing supervised machine learning models from a set of training data is rather articulated, and specific applications may require different and specialized efforts. Nevertheless, a simplified framework can be summarized in the two macrophases of data management and data analytics. In data management the main assignments are acquisition and filtering of data. As mentioned above, acquisition often consists in a mere recording of data, resulting in a purely observational study. The exercise of data filtering, on the other hand, is more challenging as it aims to increase the overall quality of the training set by cleaning data and extracting only significant features. In data analytics the focus is put on model training and evaluation. One of the methods that can be deployed to evaluate the performances of machine learning algorithms is to split the initial data set into two subsets, one for model training and the other for model validation. Several metrics, depending on the nature of the problem, can be used to perform an evaluation of model performances using the holdout data of the validation set. The final step is to apply the previously developed algorithms to a real-world scenario, i.e. to use the algorithms to make predictions on a test set. In this case we refer to the “test set” as a set of data that is not only holdout from the initial data set but also gathered during an additional series of tests of DOE. This approach is used with the aim of reducing the impact of any unknown biases that may be present in the initial data set and therefore more accurately simulating the performances of the algorithms in a real setting. The approach proposed in this paper essentially consists in linking DOE and machine learning modelling to improve the reliability of the final results. This link is achieved by using the data obtained in the DOE’s experimentation phase as an initial data set for machine learning modelling and using the outcomes of DOE analysis as a rationale for variable selection in the data management phase of the machine learning process. The framework is summarized in Fig. 11.1.

training set

From DOE: • Initial dataset

initial dataset

• Variable selection

MODEL TRAINING

validation set

test set

MODEL EVALUATION

MAKE PREDICTIONS

Fig. 11.1 The proposed framework. The DOE study is executed prior to the machine learning modelling phase, and it provides the initial data set and rationale for feature selection that will then be employed in the modelling phase

214

R. A. Giancristofaro et al.

11.3 Application to Fault Detection An application of the proposed approach of combining DOE and machine learning modelling is presented here. This application concerns detection of faults in the heating, ventilation, air conditioning and refrigeration (HVAC-R) industry. The purpose of this study is to investigate the occurrence of two faults, namely, “refrigerant undercharge” and “condenser fouling”, for refrigeration equipment at steady-state operations under different operating conditions. The objective is to find a model for each of the two circumstances that can detect the presence of the faults and assess their severity by reading the measurements of sensors installed on the apparatus. If such models could be developed, it would be possible to reliably identify the faulty condition remotely and send maintenance personnel only when actually needed. This, combined with the increase in performance and energy efficiency of a fault-free apparatus, would result in considerable savings in terms of maintenance and operating costs. In order to explore the impact of the two faults, a test bed has been selected which consists in a water chiller powered by a 4 kW BLDC rotary compressor with an inverter controlling velocity. At the condenser side, an electronically commuted fan is employed, and the metering device is an electronic expansion valve which is regulated on superheat. Several sensors are installed on the equipment. In Table 11.1 the significant variables of the study are listed, together with a brief description and their role in both the DOE (step 1) and machine learning modelling (step 2) phases of the study.

11.3.1 Design of Experiments Step In this phase focus is on detecting the impact of faults on the sensors’ measures. To this end, an optimal factorial design was chosen with second-order terms in which the two faults together with the operating conditions of the system, i.e. the evaporating and condensing temperatures, are selected as factors. Both faults are split into three levels of severity: 100% (nominal value), 85% and 70% for the refrigerant charge and 0% (clean coil), 25% and 50% for the clogging of the condenser. For the sake of brevity, only the main results of the study are reported; for a more thorough discussion, please refer to [5]. As already pointed out, the DOE is used here to gather data and identify the significant variables for the successive step of modelling. To this end the statistical analysis of data identifies some interesting trends: • The superheat response “SH” is almost constant among all trials. • The suction temperature “Tsuct” and evaporating pressure “Pevap” are highly correlated, and both are highly correlated with the operating conditions at the evaporator “Op_ev”. • The discharge temperature “Tdisch” and discharge pressure “Pdisch” are highly correlated.

11 A Study of DOE and ML Methods to Improve Fault Detection Algorithms

215

Table 11.1 Name, role in “DOE” (step 1) and “machine learning modelling” (step 2) phases of the study and description of the variables Variable Refr_charge

Role in step 1 Factor

Role in step 2 Response

Cond_foul

Factor

Response

Op_ev

Factor

Predictor

Op_cond

Factor

Predictor

Tdisch

Response

Predictor

Tsuct

Response

–

Pdisch

Response

–

Pevap SH SBC Air T DIFF

Response Response Response Response

– – Predictor Predictor

Cmp rps

Response

Predictor

Fan rpm

Response

Predictor

VLV%

Response

Predictor

Description Is the fault “refrigerant undercharge”. The refrigerant is less than the nominal amount Is the fault “condenser fouling”. A layer of dirt is deposited on the condenser coil Is the evaporating temperature of refrigerant Is the condensing temperature of refrigerant Is the discharge temperature of the refrigerant Is the suction temperature of the refrigerant Is the discharge pressure of the refrigerant Is the suction pressure of the refrigerant Is the superheating of the refrigerant Is the subcooling of the refrigerant Is the difference between air temperatures at the outlet and inlet of the condenser Is the speed of the compressor, measured in revolutions per second Is the speed of the fan, measured in revolutions per minute Is the degree of openness of the electronic expansion valve

In accordance with these insights, the variables “SH”, “Tsuct”, “Pevap” and “Pdisch” are filtered and will not be considered when developing the algorithms for fault detection. The DOE study has also been used as a means to assess the impact of the two faults on the performance of the system. This is an important piece of information because such knowledge can be used to quantify the cost of equipment which is not functioning under nominal conditions. The coefficient of performance (COP) [6] is the metric used to estimate the impact of the two faults: when both faults are at the highest level of severity, there is a drop of approximately 30% in performance, which is a reasonable area to investigate in the design space (Fig. 11.2).

216

R. A. Giancristofaro et al. 1

0.95

Refrigerant charge

0.90

5.65

0.85

0.80

0.75

4.00 0.70 0

0.1

0.2

0.3

0.4

0.5

Condenser fouling

Fig. 11.2 Contour plot of COP when the severity of “refrigerant undercharge” and “condenser fouling” is varying. The COP ranges from a value of approximately 4.00–5.65. The dots on the contour indicate the direction of increment of the COP

11.3.2 Machine Learning Modelling Step This section looks at machine learning modelling, and several concurring algorithms are presented with the objective of detecting the occurrence of the two faults. To this end the sensors’ measures are used here as predictors of system failure. The DOE data set is used for modelling and only significant features are selected as per the results presented in the previous section. Furthermore, two different approaches are adopted to detect the two faults: regression methods are employed for the detection of “refrigerant undercharge”, and classification is chosen to detect “condenser fouling”. A descriptive investigation of data reveals the different characteristics of the two faults. Such considerations are clear when the scatter plot matrix in Fig. 11.3 is inspected: the different levels of refrigerant charge already appear to be distinguishable when considering bivariate association of predictors, while no grouping is clear for the fault “condenser fouling” if only such associations are studied. This is the rationale adopted when screening algorithms to deploy for the detection of the two faults: relatively inflexible models are chosen to detect “refrigerant

11 A Study of DOE and ML Methods to Improve Fault Detection Algorithms Cond_foul=0 60

70

2 3 4 5 6 7

5

10

15

20

20

40

Cond_foul=0.25 60

100

400

700

Cond_foul=0.5 30 50 70 90

65

65

50

217

6

6

50

50

Tdisch

4 15

15

2

2

4

SBC

50

50

5

5

Air.T.DIFF

70

100 500

Fan.rpm

30

30

V VLV..

70

100 500

20

20

Cmp.rps

50

60

70

Refr_charge=1

2 3 4 5 6 7

5

Refr_charge=0.85

10

15

20

20

40

60

100

400

700

30 50 70 90

Refr_charge=0.7

Fig. 11.3 A scatter plot matrix of the data set. The operating conditions at the evaporator “Op_ev” and condenser “Op_cond” are not displayed, even though they are included in the list of predictors. The lower panel of the matrix focuses on the “refrigerant undercharge” fault and some groups are already distinguishable. The upper panel of the matrix focuses on the “condenser fouling” fault and no clear pattern appears

undercharge” since the focus is on interpretability of results, while fairly flexible machine learning algorithms are employed for classification of “condenser fouling” since only methods of this kind are believed to achieve acceptable classification accuracy.

11.3.2.1

Refrigerant Undercharge: Fault Detection

Identification of the refrigerant undercharge condition is achieved using regression models. The performance of several regression algorithms is assessed, and the best performer is selected by estimation of the error on the test set. The first algorithm examined is multiple linear regression because it is relatively inflexible and easily interpretable. Polynomial regression up to the fourth degree is then applied aimed at testing the performance of more flexible approaches which can potentially better fit data for fault detection. In both cases the coefficients of the model are obtained by ordinary least squares (OLS) estimation, hence by minimization of the sum of squared residuals. Lasso [7] is a technique used in the regression setting that

218

R. A. Giancristofaro et al.

Lambda 1

0.14

0.018

0.0025

0.00034

4.5e-05

6.1e-06

0.02

0.03

SBC

0.01

Coefficients

0.04

0.05

0.06

7.4

-0.01

0.00

Op_ev Cmp.rps Op_cond Fan.rpm A.T.DIFF VLV. Tdisch 2

0

-2

-4

-6

-8

-10

-12

Log Lambda Fig. 11.4 Coefficient estimates of the model for detection of “refrigerant undercharge” at variation of the tuning parameter .λ. The optimal value of .λ is indicated by a dashed vertical line and the corresponding coefficient estimates are found at the intersection of the line with the curves

produces highly interpretable models by shrinking some coefficients and setting others to 0; it is used here with the aim of reducing the number of predictors, thus finding the most parsimonious model. In this work a reduction in the number of predictors results in the removal of some sensors, leading to a saving in terms of resources when performing fault detection. In Lasso, similar to OLS, the coefficient estimates are obtained by minimization of the sum of squared residuals plus a shrinkage penalty which depends on a tuning parameter .λ. The appropriate value of .λ is found through cross validation as the value that minimizes the mean squared error (MSE). The variables compressor speed “Cmp rps” and operating conditions at the evaporator “Op_ev” are removed from the model, and the optimal value of .λ together with the corresponding coefficient estimates is found and reported in Fig. 11.4. The performance of the algorithms is evaluated by application on a test set. The test set is gathered with an additional series of test in which the same fault levels and operative conditions as in the preceding experiments are tested but for different combinations. As such the experimental configuration of trials in the test set is unprecedented. The results are displayed in Fig. 11.5 that presents box plots of predictions against actual data. The figure shows that linear regression seems capable of properly estimating the response for low charge levels but underestimates when the charge increases. The same problem is even more evident for Lasso, as indicated by the increase in MSE. The second-degree polynomial model

11 A Study of DOE and ML Methods to Improve Fault Detection Algorithms

219

Legend

Refr_charge predicted

1.0

1st degree polynomial 2nd degree polynomial 3rd degree polynomial 4th degree polynomial Lasso

0.9

0.8

MSE=0.00368 0.7

MSE=0.00967 MSE=0.00057 MSE=0.00042 MSE=0.00029 0.7

0.85

1

Refr charge actual

Fig. 11.5 Box plots of predicted values against actual values and MSE for the “refrigerant undercharge” fault. Values that lie on the diagonal are perfect detections of the level of the fault

improves the results as it reduces the problem of underestimation, thus considerably decreasing the MSE. Augmenting the degree of the polynomial continues to slightly improve the performance of the algorithm. Nonetheless, it should be pointed out that as the degree of the polynomial increases, the interpretability of the models drops as a trade-off exists between model interpretability and model accuracy.

11.3.2.2

Condenser Fouling: Fault Detection

Tree-based methods are chosen for classification of the condenser fouling condition in the apparatus. This family of algorithms is used because they can produce a rather vast range of shapes for estimation of the behaviour of the response while still being, at least for simple decision trees, reasonably interpretable. A decision tree for classification is the first algorithm developed on the training data ending with a model with 15 terminal nodes. Bootstrap aggregation (bagging) [8] is an improvement over decision trees as it averages the results of an ensemble of trees trained on bootstrap samples of the initial data set, hence reducing variance. Random forests [9] are a further improvement over bagging as they increase the variance reduction capability of bagging. This is attained by performing a random selection of the input variables as candidates for splitting, therefore obtaining a collection of de-correlated trees [10]. The number of trees and the number of variables considered at each split (m) are parameters that should be set by the analyst.

220

R. A. Giancristofaro et al.

bagging

random_forest

Fan.rpm

Fan.rpm

SBC

Air.T.DIFF

Air.T.DIFF

Tdisch

Cmp.rps

SBC

VLV.

Cmp.rps

Op_cond

VLV.

Tdisch

Op_cond

Op_ev

Op_ev

0.0 0.1 0.2 0.3 0.4

0.05

MeanDecreaseAccuracy

0.15

0.25

MeanDecreaseAccuracy

Bagging

Random forest

Accuracy=0.94

Accuracy=0.88

Accuracy=1

0.5

0.25

0

0

0.25

0.5

Cond_foul actual

Cond_foul predicted

Tree Cond_foul predicted

Cond_foul predicted

Fig. 11.6 Variable importance plots of bagging and random forest models for classification of the “condenser fouling” fault

0.5

0.25

0

0

0.25

0.5

Cond_foul actual

count

0.5 5000

5000

4000 0.25 3000

4000

2000 0

2000

3000

0

0.25

0.5

Cond_foul actual

Fig. 11.7 “Condenser fouling” fault: the plots are a visualization of confusion matrices of actual values against predicted values. Correct classifications lie on the diagonal

A desirable feature of bagging and random forest algorithms is the possibility to obtain a measure of variable importance. Such a measure can be performed by comparison of error rates before and after permutation of a predictor variable in the model: in this way the importance of that variable in the model can be estimated [11] (Fig. 11.6). After some fine-tuning both bagging and random forest models have been trained to deal with 500 trees, and the optimal value for m in random forest was found to be the square root of the number of predictors. The validation of the algorithms has been performed via application to the test set. The result of classification is reported in Fig. 11.7. The performance of the simple regression tree is already satisfactory, with not many instances being misclassified. The situation gets worse for bagging

11 A Study of DOE and ML Methods to Improve Fault Detection Algorithms

221

as there is a drop in accuracy. The random forest algorithm proves to be the best performer with a perfect classification for all the instances in the test set.

11.4 Conclusions This paper presented an application of DOE and machine learning to the field of fault detection in an industrial setting. The main contribution of this work is the adoption of a DOE study as a preliminary step for machine learning modelling. The DOE study contributes by providing a rationale for feature selection and highquality data for model training. Moreover, a general understanding of the system under study is granted by the time spent performing the experiments of DOE. Two different approaches of regression and classification are employed for detection of the two faults “refrigerant undercharge” and “condenser fouling”, respectively, and in both cases the results of the machine learning algorithms selected appear adequate. The satisfactory performance of the algorithms is an indication of the good quality of data used for modelling; thus it is an element supporting the validity of the method of DOE and machine learning proposed.

References 1. McFarland, D.A., McFarland, H.R.: Big Data and the danger of being precisely inaccurate. BD&S 2, 1–4 (2015) 2. Cox, D.R., Kartsonaki, C., Keogh, R.H. : Big data: Some statistical issues. Stat. Probab. Lett. 136, 111–115 (2018) 3. Cai, L., Zhu, Y.: The challenges of data quality and data quality assessment in the Big Data era. Data Sci. J. 14, 1–10 (2015) 4. Meng, X.-L.: Statistical paradises and paradoxes in big data (I): law of large populations, big data paradox, and the 2016 US presidential election. Ann. Appl. Stat. 2, 685–726 (2018) 5. Salmaso, L., Pegoraro, L., Arboretti Giancristofaro, R., Ceccato, R., Bianchi, A., Restello, S., Scarabottolo, D.: Design of experiments and machine learning to improve robustness of predictive maintenance with application to a real case study. Commun. Stat-Simul. C. (2019). https://doi.org/10.1080/03610918.2019.1656740 6. Borgnakke, C., Sonntag, R.E.: Fundamentals of Thermodynamics. Wiley, New York (2013) 7. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. B 58, 267– 288 (1996) 8. Breiman, L.: Bagging predictors. Mach. Learn. 24, 123–140 (1996) 9. Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001) 10. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer, New York (2001) 11. Liaw, A, Wiener, M.: Classification and regression by random forest. R News 2(3), 18–22 (2002)

Chapter 12

Microstructure Image Segmentation Using Patch-Based Clustering Approach Dženana Alagić and Jürgen Pilz

Abstract Semiconductor devices are used in different areas of our daily life, what increases the demand on their functionality and robustness. To obtain information about the robustness of these devices in a reasonable amount of time, accelerated lifetime tests are performed. During such a test, a high thermomechanical load triggers changes in the microstructure of metal films. The microstructure characteristics like the size and morphology of grains in the polycrystalline material influence its physical and mechanical properties. To properly understand and model the degradation process, it is important to take this information into account. To visualize the changes in the microstructure, different imaging methods, such as scanning electron microscopy (SEM) and focused ion beam (FIB), are used. Manually identifying microstructural changes in these images is very slow, tedious, and prone to errors. Therefore, this work introduces an image processing algorithm to automatically extract quantitative information about the microstructure characteristics like the size of damage patterns, grain size distribution, etc. out of the mentioned images. Since labeled data is not provided, a patch-based clustering approach based on features that measure a region’s homogeneity is proposed. The algorithm distinguishes between two classes, grain and grain boundary area, making it effective on a variety of microstructure images. For patch clustering, Gaussian Mixture Model (GMM) is used. The final, pixelwise segmentation is achieved with the Seeded Region Growing (SRG) algorithm using the identified grain areas as seed points.

D. Alagić () KAI - Kompetenzzentrum für Automobil- und Industrieelektronik GmbH, Villach, Austria e-mail: [email protected] J. Pilz Alpen-Adria-Universität, Klagenfurt, Austria e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Pilz et al. (eds.), Statistical Modeling and Simulation for Experimental Design and Machine Learning Applications, Contributions to Statistics, https://doi.org/10.1007/978-3-031-40055-1_12

223

224

D. Alagić and J. Pilz

12.1 Introduction The usage of power semiconductor devices, as well as the demand on their functionality and robustness, constantly increases. To ensure that the devices operate reliably, accelerated lifetime tests are performed in practice. During such a test, a high thermomechanical load is applied, which triggers microstructural changes and degradation within the metal layers. To properly understand and model the degradation process, these changes need to be investigated and taken into account. The ability to characterize the microstructure is important not only for degradation modeling but also for new advances in materials science. It is well known that the characteristics like the size and morphology of grains define the physical and mechanical properties of polycrystalline metal. Thus, this knowledge is one step toward the prediction of the material performance in a given application. Electron backscatter diffraction (EBSD) is a laboratory-based software tool used in materials science to obtain microstructure characteristics such as distributions of grain size, grain boundary length, grain shape, and grain (mis)orientation out of polycrystalline materials. This technique is powerful but also very expensive with respect to time and money. Scanning electron microscopy (SEM) and focused ion beam (FIB) are alternative methods for microstructure visualization commonly used in practice. However, the relevant information needs to be extracted manually. This process is time-consuming, subjective, and prone to human errors. Therefore, the aim of this work is to develop an image processing algorithm to automatically extract relevant information like damage patterns and grain characteristics out of these images.

12.2 Input Data As already mentioned in Sect.12.1, quantitative information about the microstructure is of high interest for quantitative (statistics) and qualitative (materials science) degradation analysis. Therefore, the developed algorithm needs to be applicable to a variety of microstructure images and not only to those showing degraded metal films. In this work, FIB and SEM images that visualize the grain structure of the material are investigated. An example of a SEM cross section image with damage patterns, preparation artifacts, and grain boundaries is illustrated in Fig. 12.1. The yellow lines highlight the damage patterns induced by the high thermomechanical load. The green ellipse surrounds one part of a grain boundary. Taking a deeper look, it can be seen that the gray levels of the grain boundaries differ heavily. Darker boundaries are similar to damage patterns, but also boundaries that are nearly white can be observed. This means that the damage patterns, background area, and some of the grain boundaries in the image can show similar properties in terms of intensity and texture. Further, a high variation in the quality of the investigated images is

12 Microstructure Image Segmentation Using Patch-Based Clustering Approach

225

Fig. 12.1 Example of a degraded metal film due to thermomechanical fatigue in semiconductor industry; damage patterns (yellow), preparation artifacts (blue), and grain boundary (green)

Fig. 12.2 Examples of investigated microstructure images with and without damage

present, which is due to the specimen preparation and the setup parameters during image acquisition. Some examples of artifacts caused by the specimen preparation process are indicated with blue rectangles in Fig. 12.1. The images used in this work can be divided into two groups, with and without damage, as illustrated in Fig. 12.2. To reach the final goal, the work is partitioned into two parts: damage detection and grain segmentation. Hence, in case of damage, damage detection is applied as a pre-processing step for grain segmentation. In the next section, the damage detection algorithm, which is developed in the first stage of our research, is explained.

226

D. Alagić and J. Pilz

12.3 Previous Work For some research questions, the amount of damage in a degraded metal film is the quantity of interest; therefore an algorithm for automatic damage detection and quantification is developed first [2]. In this algorithm, an unsupervised approach is followed, since the ground truth labelings for the images are not provided and manual annotation at pixel level is tedious, inconsistent, and prone to errors. Such an approach is promising in evaluation of new images, not seen before, since all parameters and settings depend only on the image to be analyzed. The algorithm consists of two stages. In the first stage, the metal layer of interest is extracted using k-means method. In the second stage, the nonlocal means (NL-means) denoising method is applied. As input parameter, NL-means requires standard noise deviation. To avoid possible inaccuracy due to high variation in the amount of noise between images, this parameter is automatically computed for each image. As a result, the non-damaged part of the images is heavily smoothed, but sharp edges of the damage patterns are preserved. To detect the damage patterns, k-means is used. The algorithm provides visual output which enables optical inspection of results (see Fig.12.3), as well as quantitative measures to enable further statistical analysis. According to discussions with experts, the algorithm results are plausible and can be used for further investigations. As shown in Fig.12.3, the visibility of the grain morphology does not influence the performance of the algorithm. Thus, it can be used independently, or as a preprocessing step for grain segmentation.

12.4 Grain Segmentation Image segmentation is the process of dividing the image into its constituent regions or objects. More precisely, it is the process of assigning a class label to each pixel in the image such that pixels with the same label are more similar to each other than pixels with different labels. This is one of the most challenging tasks in image processing. Numerous approaches, e.g., threshold techniques, boundarybased methods, region-based methods, or hybrid techniques, that combine boundary and region criteria are described in literature. In recent years, machine learning techniques and especially deep neural networks have gained a lot of popularity in this area of research. However, the diversity between the target images and nonexistence of a large number of labeled data limits the usage of the mentioned methods in our application. Further, the number of regions (grains) in our images is unknown. Taking a deeper look at Fig. 12.2, it can be seen that the majority of grains in one image have similar intensity, contrast, and texture. The grain boundaries are the only common property that enable to distinguish between the regions in all images. A region-based algorithm for image segmentation that shows promising results with this type of images is Seeded Region Growing (SRG).

12 Microstructure Image Segmentation Using Patch-Based Clustering Approach

227

Fig. 12.3 The output of the damage detection algorithm. The yellow lines denote the boundaries of the identified damage patterns

12.4.1 Seeded Region Growing (SRG) SRG is an image segmentation algorithm for intensity images, where the individual objects or regions are characterized by connected pixels of similar gray value [1]. It is robust, fast, and free of tuning parameters. This algorithm performs image segmentation with respect to the starting set of points called seeds. Seeds are used to compute the initial mean gray value for each region. That means that each region in the image must have a starting seed point in order to be detected. Thus, the number of seeds equals the number of regions in the image. The condition of growth is the difference of a gray value of a candidate pixel and mean gray level intensity of a neighboring region. At each step of the algorithm, a candidate with the smallest

228

D. Alagić and J. Pilz

difference to a neighboring region is added to that region. All other neighboring points of it that are not yet assigned to any region are added to a candidate list. For the SRG algorithm, seeds must be given in advance. Manual seed selection is not an option, because of the large number of grains in an image. Further, each seed must provide a good estimate of the statistics of the corresponding region. A seed can be one pixel or a group of pixels. The choice of an outlier pixel as a seed can lead to poor segmentation results. Therefore, it is recommended to use bigger areas for seeds. To automatically determine seeds in microstructure images, a patch-based approach is proposed. The image is divided into patches (nonoverlapping rectangular selections) of the same size. The goal is to use features to distinguish between patches containing two grains and the grain boundary and patches that are completely inside one grain. In case of the images with damage patterns, the damage detection algorithm described in Sect. 12.3 is applied as a pre-processing step, and those patches that contain damage patterns are excluded from the consideration. The extracted features from the patches are used for a Gaussian Mixture Model (GMM) to classify patches into two classes: grain and grain boundary area. The whole workflow for microstructure image segmentation is illustrated in Fig. 12.4 and described in more detail below. Fig. 12.4 The proposed workflow for microstructure image segmentation

12 Microstructure Image Segmentation Using Patch-Based Clustering Approach

229

12.4.2 Image Denoising and Patch Determination Image denoising is one of the most important steps in the algorithm. It influences not only the patch features but also the performance of the SRG algorithm afterward. A suitable denoising filter should smooth the grain area but preserve the grain boundaries as they enable to distinguish between grains. For this task, the nonlocal means (NL-means) filter is used [4]. NL-means reduces noise in the image without destroying fine structures like edges and texture. The pixel value is replaced by the weighted average of other pixels in the image. More precisely, small patch centered on the target pixel is compared with patches centered on the other selected pixels. The weight is computed based on the pixels’ similarity. As a result, image structures are not blurred. In this work, we use the NL-means implementation of T. Wagner and P. Behnel [12] which includes changes of the original method proposed by J. Darbon et al. [6]. The main contribution of [6] is the method for parallel processing which enables efficient computation of the weights. After denoising, the image is divided into patches (nonoverlapping rectangular selections) of a fixed size defined by the user. In order to choose the optimal patch size, the following trade-offs have to be considered: 1. The patch size should not be too small. Small patches could be completely contained in a grain boundary leading to a false assumption of homogeneous grain areas. 2. The patch size should not be too large. Big patches could be so positioned that none of them lie completely inside small grains. Thus, the patch size (width and height) depends on the microstructure of the material under investigation and the resolution of the image. At this point in the algorithm, distinction is made between the images with and without damage. If the image contains damage, the damage detection algorithm is applied, damage patches are excluded, and feature extraction is performed for the rest of the patches. Otherwise, feature extraction is performed for all patches.

12.4.3 Feature Extraction The grain patches are homogeneous with small variation in gray intensities. The grain boundary patches show dispersion in gray values caused by a grain boundary which is either brighter or darker than the neighboring regions. It can happen that a grain patch contains some preparation artifacts or outlier pixels. This patch would have similar properties as a patch which intersects just a small part of a grain boundary. To avoid a high number of misclassifications, the proper feature selection is crucial for this algorithm. The selected features contain information about the region homogeneity. They are based on the properties of the original denoised patch,

230

D. Alagić and J. Pilz

its gradient computed with the Sobel operator and texture information [7]. They can be divided into three groups. 1. Denoised image patch • Standard deviation • Absolute difference between the mean and the mode 2. Sobel operator • Standard deviation • Mean gradient value 3. Textural features Textural features are extracted from each patch based on the corresponding graylevel co-occurrence matrix (GLCM). For an 8-bit image, GLCM, denoted with .P (i, j ), is a .256 × 256 matrix where the .(i, j )th entry represents the frequency with which two neighboring pixels separated by distance d occur in the image, one with gray value i and the other with gray level j . In our application, we compute the horizontal GLCM. This means we use distance .d = 1 and consider the right neighbor for each pixel. From this matrix, the following features are extracted: • Angular Second Moment f1 =

Ng Ng

.

p(i, j )2 ,

(12.1)

i=1 j =1

where .p(i, j ) is the .(i, j )th entry in a normalized GLCM (.P (i, j )/R, R is a normalizing constant) and .Ng is the number of distinct gray values. • Inverse Difference Moment f2 =

Ng N g

.

i=1 j =1

1 p(i, j ) 1 + (i − j )2

(12.2)

• Contrast Ng Ng .f3 = (i − j )2 p(i, j )

(12.3)

i=1 j =1

• Entropy f4 = −

Ng Ng

.

i=1 j =1

p(i, j ) log(p(i, j ))

(12.4)

12 Microstructure Image Segmentation Using Patch-Based Clustering Approach

231

• Homogeneity f5 =

Ng Ng

p(i, j ) 1 + |i − j |

(12.5)

(i − μ)2 p(i, j ), .

(12.6)

.

i=1 j =1

• Variance f6 =

Ng N g

.

i=1 j =1

μ=

μx + μy 2

(12.7)

where .μx and .μy are the means of .px and .py , respectively: px (i) =

Ng

.

p(i, j ).

(12.8)

p(i, j ).

(12.9)

j =1

py (j ) =

Ng i=1

In total, ten features are extracted for each patch.

12.4.4 Patch Clustering Based on the extracted features, the patches are classified into two classes: Class 1: grain area Class 2: grain boundary area For this task, we compared different clustering algorithms: k-means, k-medoids, fuzzy c-means, hierarchical clustering, and Gaussian Mixture Model (GMM) [3]. The outputs are compared visually and using two cluster validation measures: 1. Internal measures: the connectivity, the silhouette coefficient, and the Dunn index. These measures use intrinsic information in the data to assess the quality of the clustering. 2. Stability measures: the average proportion of non-overlap (APN), the average distance (AD), the average distance between means (ADM), and the figure of merit (FOM). These measures evaluate the consistency of a clustering result by comparing it with the clusters obtained after each feature is removed, one at a time [8].

232

D. Alagić and J. Pilz

The best overall performance is achieved with a GMM. Due to page limitations, the comparison results are not shown here, as it was done for a number of images individually. The GMM assumes that each cluster k is modeled by a Gaussian distribution characterized by the following parameters: • .μk : mean vector. • .k : covariance matrix. • An associated probability distribution for the mixture of the models. Each observation has a probability of belonging to each cluster [8]. To estimate the model parameters, the Expectation-Maximization (EM) algorithm initialized by hierarchical model-based clustering is used. Each cluster k is centered at the mean .μk , with increased density near the mean. Geometric features (shape, volume, orientation) of each cluster are determined by the covariance matrix .k . After clustering, the connected grain areas are used as seed points for the SRG algorithm. Running this algorithm, the final pixelwise segmentation in case of nondamaged images is achieved. In case of damaged images, the output regions after applying SRG will overflow the damaged areas. The final segmentation is then achieved by combining the result with the damage pattern boundaries from the output of the damage detection algorithm.

12.4.5 Implementation The algorithm is implemented in the image processing software ImageJ [5, 9] combined with the statistical software R [10]. All steps in the algorithm, except GMM, are implemented in ImageJ as Java plug-in. To fit the model, the R package Mclust is used [11]. In this package, different possible parameterizations of the covariance function are available. The available model options are represented by identifiers including EII, VII, EEI, VEI, EVI, VVI, EEE, EEV, VEV, and VVV. The first identifier refers to volume, the second to shape, and the third to orientation. E stands for “equal”, V for “variable”, and I for “coordinate axes”. The best model is selected according to Bayesian Information Criterion (BIC) [8].

12.5 Results The described algorithm for microstructure segmentation uses FIB and SEM cross section images. As the images differ in quality, noise level, and microstructure representation, each step is carefully defined and tested on the majority of available images.

12 Microstructure Image Segmentation Using Patch-Based Clustering Approach

233

Figure 12.5 illustrates the performance of the algorithm on five images of the material of interest: FIB cut (first two rows) and SEM cross section without damage patterns (third row) and with damage patterns (last two rows). The first column represents the original images. In the second column, the output of the GMM is shown. For better visual inspection of the clustering result, only patches classified as grain area are colored black. Each connected group of black patches is used as a seed for the SRG algorithm. The segmentation output after running the SRG algorithm is shown in the third column. Each region is represented with different color. The obtained results show that the developed method is powerful, as it can be applied on different image types. As expected, high-quality images produce better

Fig. 12.5 The output of the proposed algorithm. Black squares in the second column represent patches classified as grain area using GMM. The third column illustrates the SRG segmentation output. In the last column, the segmentation output after user interaction (user adds and/or deletes some seed points) is shown

234

D. Alagić and J. Pilz

segmentation results. However, there is still place for improvement. The outputs in Fig. 12.5 indicate two types of errors: • Oversegmentation—one grain is segmented into several grains. This happens due to unconnected groups of grain patches inside one grain producing more than one seed for one grain. The reason could be patch misclassification but also irregular grain shape. For example, a grain can be too thin in one part, so that the overlaying patches always intersect a grain boundary. • Undersegmentation—one segmented region contains more grains. This happens if a grain has no initial seed point. Root causes are bad delineation of grain boundaries and a bad combination of small grains and large patch sizes. This corroborates our previous claim about the importance of extracted features as well as the determination of an appropriate patch size. The results can be improved with small user interaction. The algorithm leaves the user the option to add or delete selected seeds before the SRG is applied. The final segmentation after user interaction is illustrated in the fourth column in Fig. 12.5. As expected, comparing the automatic and semiautomatic segmentation results, more precise segmentation is achieved with user interaction.

12.6 Conclusion and Outlook In this work, an algorithm for (semi)automatic microstructure image segmentation is presented. A number of FIB and SEM cross section images of a material of interest are studied. Since no labeled data is provided, an unsupervised approach is followed. The used features have proven to be good estimates of a region’s homogeneity. The model output depends only on the image to be analyzed leading to robust and reproducible results. Our algorithm can be applied to a variety of microstructure images. The output can be used to extract quantitative information about the microstructure like the characteristics of damage patterns, grain size distribution, grain boundary length, grain shape, etc. This information is of high importance for materials scientists and degradation modeling because it enables the description of changes in microstructure. With small user interaction, the results can be further improved. The described algorithm classifies patches based only on the extracted features, independently of each other. In other words, no spatial information is taken into account. To reduce misclassifications due to the lack of information from neighboring patches, in future we will include spatial dependencies. For this purpose, the theory of conditional random fields (CRF) will be followed. A CRF model uses local information of a patch and introduces the spatial information by favoring the same label for neighboring patches with similar features. It is a powerful approach because of its flexibility and strong theoretical background. Besides defining an adequate graphical structure of the model and the feature space, an appropriate parameter estimation technique that can cope with computational challenges and

12 Microstructure Image Segmentation Using Patch-Based Clustering Approach

235

small amount of data needs to be developed and applied. To quantitatively measure the algorithm accuracy, EBSD and SEM images of the same material area will be produced and compared. Acknowledgments The authors would like to thank Olivia Pfeiler, Michael Nelhiebel, and Barbara Pedretscher for valuable discussions on the topic. This work was funded by the Austrian Research Promotion Agency (FFG, Project No. 863947).

References 1. Adams, R., Bischof, L.: Seeded region growing. IEEE Trans. Pattern Anal. Mach. Intell. 16(6), 641–647 (1994). https://doi.org/10.1109/34.295913 2. Alagić, D., Pilz, J.: Unsupervised algorithm to detect damage patterns in microstructure images of metal films. In IEEE International Conference on Image Processing, Applications and Systems (IPAS), Sophia Antipolis, France (2018). https://doi.org/10.1109/IPAS.2018.8708852 3. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2006) 4. Buades, A., Coll, B., Morel, J.-M.: Non-local means denoising. Image Process. Line 1, 208– 212 (2011). https://doi.org/10.5201/ipol.2011.bcm_nlm 5. Burger, W., Burge, M. J.: Digital Image Processing: An Algorithmic Introduction Using Java, 2nd edn. Springer, Berlin (2016) 6. Darbon, J, Cunha, A., Chan, T.F., Osher, S., Jensen, G.J.: Fast nonlocal filtering applied to electron cryomicroscopy. In: IEEE International Symposium on Biomedical Imaging: From Nano to Macro, Proceedings, ISBI, pp. 1331–1334 (2008) 7. Haralick, R.M., Shanmugam, K., Dinstein, Ih.: Textural features for image classification. IEEE Trans. Syst. Man Cybern. SMC-3(6), 610–621 (1973) 8. Kassambara, A.: Practical Guide to Cluster Analysis in R: Unsupervised Machine Learning. STHDA (2017) 9. Rasband, W.S.: ImageJ, U. S. National Institutes of Health, Bethesda, Maryland, USA, (1997– 2018). https://imagej.nih.gov/ij/ 10. R Development Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0 (2008). http:// www.R-project.org 11. Scrucca, L., Fop, M., Murphy, T.B., Raftery, A.E.: mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J. 8/1, 205–233 (2016) 12. Wagner, T., Behnel, P.: IJ-NL-means: Non local means 1.4.6 (2016). https://doi.org/10.5281/ zenodo.47468

Chapter 13

Clustering and Symptom Analysis in Binary Data with Application N. P. Alexeyeva, F. S. Al-juboori, and E. P. Skurat

Abstract A new method to generalize statistical multifactor analysis, based on reducing the dimensionality in categorical data by means of projective subspaces, is proposed. The method uses algebraic normal forms, as applied to random binary vectors called super-symptoms for adaptation in the statistical analysis of biomedical applications. Usually the task was to consider all super-symptoms and to find a super-symptom with the best extreme properties by means of various measures. As an optimality criterion, we use various tests, for example, Fisher’s exact test and log rank test, or uncertainty coefficients. The disadvantage of this method is that the number of super-symptoms is too large. Therefore, it is proposed to apply clustering of observations and then describe clusters using super-symptoms. In practice, the most important combinations of factors affecting breast cancer have been defined, and a risk group for the presence of distant metastases and the tumor spreading to the lymph nodes has been identified. These factors were found to be progesterone receptor positive without combination of estrogen receptor positive with mastectomy or lymphatic dissection.

13.1 Introduction Cluster analysis is designed to aggregate similar observations and has a wide range of applications. Clustering methods are well known, but often there is a problem of how to name or identify a particular cluster. Of course, it is possible to describe in detail how many observations are included in one or another cluster, but the way to quickly parameterize clusters is relevant.

N. P. Alexeyeva · E. P. Skurat Department of Mathematics, St. Petersburg State University, St. Petersburg, Russia F. S. Al-juboori () Department of Mathematics, University of Information Technology and Communications, St Al-Nidal, Baghdad, Iraq © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Pilz et al. (eds.), Statistical Modeling and Simulation for Experimental Design and Machine Learning Applications, Contributions to Statistics, https://doi.org/10.1007/978-3-031-40055-1_13

237

238

N. P. Alexeyeva et al.

In [1–5, 8] the authors propose a symptom-syndrome method for analysis of the multidimensional binary data. For example, we need to study the dependence of variable Y on the binary random variables .X1 , . . . , Xm in total. This method allows to extend the set of variables to linear combinations over field .F2 , which are called symptoms, and to highlight the most significant symptom for Y . Nontrivial symptoms can be considered as latent variables, linking trivial symptoms .X1 , . . . , Xm into a single syndrome. The next step was to consider as symptoms algebraic normal forms (ANF) instead of linear combinations over a field .F2 . ANF are known as Zhegalkin polynomials, but in statistical analysis, it is more convenient to call them super-symptoms. There is a one-to-one correspondence between ANF and Boolean functions, but ANF are preferable for the algorithm of selection of the most informative variable. Thus, the search area was expanded by means ANF and it became possible to get a result that is easier to interpret. However the big busting problem became even more relevant. Look at this problem from the other side. Each super-symptom divides the multidimensional space .m consisting of the many possible values of the binary vector .X = (X1 , . . . , Xm ) which can be encoded with field .F2m elements into two subsets .A ∪ A¯ = m and .A ∩ A¯ = ∅. The reverse is also true, i.e., any subset ¯ opposite to it can be describe by means the some .A ⊆ m with the subset .A super-symptom. Therefore, to find the most significant super-symptom, we first find the most different cluster and then indicate the corresponding super-symptom. This avoids the complexity of an exhaustive search and quickly finds the most significant super-symptom.

13.2 The Symptom Analysis 13.2.1 The Symptom and Syndrome Definition Suppose there is random binary vector .Xm = (X1 , . . . , Xm )T with the set of all realizations .m . The set .m consists of vectors .x = (x1 , . . . , xm ), xj ∈ {0, 1} which are conveniently denoted by j = j (x) =

m

.

xk 2k−1 , j = 0, 1, . . . , 2m − 1 .

(13.1)

k=1

Definition 1 Let it be the vector .A = (a1 , . . . , am )T , where the elements are .ai ∈ {0, 1}, and the set .τ = {t : at = 1} consists of .k elements. The linear combination m T .Xτ = i=1 ai Xi (mod 2) = A Xm (mod 2) is called the symptom of the rank .k. The elements of the field .F2 of order 2 may be represented by integers .0, 1. The expression . (mod 2) indicates the remainder of the division by 2 of the result of the corresponding integer operation. Variables .X1 , . . . , Xm are called trivial symptoms. Each symptom is a random binary vector which has a new value. For example, .X12 = X1 + X2 (mod 2) means the presence of one without the other.

13 Clustering and Symptom Analysis in Binary Data with Application

239

A set of all possible symptoms which can be constructed from .Xm will be called a syndrome. There is a simple recurrent way to form a syndrome as follows: S(X1 ) = X1 , S(X1 , X2 ) = (S(X1 ), X2 , S(X1 ) + X2 (mod 2)), . . . ,

.

S(Xm ) = (S(Xm−1 ), Xm , S(Xm−1 ) + Xm (mod 2)), where Xm ∈ / S(Xm−1 ), m > 2 .

(13.2)

Thus, symptoms in the syndrome .S(Xm ) have impulse ordering. Symptoms, whose rank in the syndrome .S(Xm ) is equal to .2k , .k = 0, 1, . . . , m − 1, are called basic.

13.2.2 Impulse Vector and Super-symptoms Consider a random vector that is formed analogically (13.2) but by using multiplication V (X1 ) = X1 , V (X1 , X2 ) = (V (X1 ), X2 , V (X1 )X2 (mod 2))T , . . . ,

.

V (Xm ) = (V (Xm−1 ), Xm , V (Xm−1 )Xm (mod 2))T , Xm ∈ V (Xm−1 ). (13.3) At .m = 2 we have the impulse vector .V (X1 , X2 ) = (X1 , X2 , X1 X2 )T and at .m = 3 V (X1 , X2 , X3 ) = (X1 , X2 , X1 X2 , X3 , X1 X3 , X2 X3 , X1 X2 X3 )T .

.

(13.4)

It is easy to see that the impulse vector .V (Xm ) consists of .M = 2m − 1 variables. If these variables are taken as basic symptoms in syndrome, then we get a supersyndrome .S = S(V (Xm )), consisting of .2M − 1 elements which are called supersymptoms. For example, the super-syndrome .S(V (X1 , X2 )) looks like a random vector with elements over .F2 (X1 , X2 , X1 + X2 , X1 X2 , X1 + X1 X2 , X2 + X1 X2 , X1 + X2 + X1 X2 ) . (13.5)

.

Each super-symptom .Si from .S = S(V (Xm )) = (S1 , . . . , Sn ), n = 2M − 1, can be expressed by means of the binary vector .α = (α1 , . . . , αM )T , α = (0, 0, . . . , 0), as Si = α T V (Xm ) , where i = i(α) =

M

.

αk 2k−1 .

(13.6)

k=1

For example, in case of .m = 2 and .S5 = X1 +X1 X2 (mod 2), we have .α = (1, 0, 1) and .i(α) = 1 · 20 + 0 · 21 + 1 · 22 = 5. In case of .m = 3 and .M = 23 −

240

N. P. Alexeyeva et al.

1 = 7, multiplication of the vector .V (X1 , X2 , X3 ) from (13.4) and the vector .α = (1, 0, 0, 1, 0, 0, 1) corresponds to super-symptom .X1 + X3 + X1 X2 X3 (mod 2), .i(α) = 1 + 8 + 64 = 73. As already mentioned, the set .m of all possible values .x = (x1 , . . . , xm ), xi ∈ {0, 1} consists of .2m elements which encoded .j = j (x) according to (13.1) and can be represented by integers in the range .(0, 1, . . . , M), where .M = 2m −1. By means the super-symptom .Si , i = i(α), we receive the permutation .(0, β1 , . . . , βM ), where .βj ∈ {0, 1} are values of super-symptoms at .j (x) = 0, 1, . . . , M. If at .m = 2 we have .S5 = X1 + X1 X2 (mod 2), then .β1 = 1 + 1 · 0 (mod 2) = 1 at .(X1 , X2 ) = (0, 1), .β2 = 0 + 0 · 1 (mod 2) = 0 at .(X1 , X2 ) = (1, 0), .β3 = 1 + 1 · 1 (mod 2) = 0 at .(X1 , X2 ) = (1, 1), respectively, and .β = (1, 0, 0). Thus using (13.1) for .x ∈ m , .x = (0, 0, . . . , 0), we have the elements of vector .β in the form βj = V (x)T α , j = j (x) = 1, 2, . . . , M .

.

(13.7)

13.2.3 Prefigurations of Super-symptom Our task is to find a connection between ANF and clusters in a dichotomous space. Since the super-symptom .S = α T V (Xm ) looks like a function of .Xm with values from .{0, 1}, there are, respectively, prefigurations B0 = {x ∈ m |S = 0}, B1 = {x ∈ m |S = 1}

.

(13.8)

which can be considered as clusters. In case of symptoms, sets .B0 and .B1 contain the some number of elements. But in case super-symptoms cardinality of this sets can be different. Prefiguration .B0 or .B1 can be obtained using the recurrence M-matrix .D(m), where .M = 2m − 1. Let are .D(1) = 1, .0M and .I M zero and unit vectors of length .M respectively, .0M zero row of length .M, .OM zero M-matrix, ⎡ ⎤ D(m) 0M OM (13.9) .D(m + 1) = ⎣ 0M 1 0M ⎦ . M D(m) I D(m) We get this matrix when consider M rows of all possible values .V (x) from (13.3) for such x that .j (x) = 1, 2, . . . , M. For example, .m = 2, .M = 3, .V (X1 , X2 ) = (X1 , X2 , X1 X2 ). i(x1 , x2 ) 1 . 2 3

x1 1 0 1

x2 0 1 1

X1 1 0 1

X2 X1 X2 0 0 1 0 1 1

⎤ 100 D(2) = ⎣ 0 1 0 ⎦ 111 ⎡

∼

13 Clustering and Symptom Analysis in Binary Data with Application

241

Column in the form of .[0M−1 , 1, I M−1 ] indicated possible values of the next variable in the space .m . Columns .[OM−1 , 0M−1 , D(m − 1)] are obtained by multiplying the columns .[D(m − 1), 0M−1 , D(m − 1)] by .[0M−1 , 1, I M−1 ]. To get vector .β, we can simply multiply the matrix D by vector .α according to (13.7) D(m)α = β over F2 .

.

(13.10)

For example, at .α = (1, 0, 1)T we have ⎡

⎤ ⎡ ⎤ ⎡ ⎤ 100 1 1 F2 .β = D(2)α = ⎣ 0 1 0 ⎦ · ⎣ 0 ⎦ = ⎣ 0 ⎦ , 111 1 0 which corresponded to .(B0 , B1 ) = (023, 1). Reversely, if we have .(B0 , B1 ), where .0 ∈ B0 , .B0 ∪ B1 = m , B0 ∩ B1 = ∅, then components of the vector .β = (β1 , . . . , βM ) look like βj =

.

1, if j ∈ B1 0, otherwise.

(13.11)

The next task is to find the corresponding vector .α.

13.2.4 The Super-symptom Recovery by Vector β Denote .D(m)i the sum of elements in i-th row of matrix .D(m). It is easy to prove the following statement using the mathematical induction. Theorem 1 Let .α, .β, and .D = D(m) be identified in (13.6), (13.7), and (13.9), respectively. Then next expressions are valid over .F2 : (1) .D(m)i = 1; (2) .D −1 = D; (3) .Dβ = α. Proof 1. It is easy to notice that the first statement is valid in cases .m = 1 and .m = 2 because .D(1) = 1 and .D(2)i = 1 for all .i = 1, 2, 3. If .D(m)i = 1, then the sum of elements in rows of matrix .[D(m), 0M , OM ] is equal to 1 too. The sum of elements in row .[0M , 1, 0M ] is equal to 1 obviously. And the sum of elements in rows of matrix .[D(m), 1M , D(m)] is equal to 1 because the sum of two identical elements is equal to 0 over field .F2 .

242

N. P. Alexeyeva et al.

2. It is clear that .D(2)·D(2) = I3 at .m = 2. Assume that .D(m)·D(m) = IM , where m m+1 − 1. .M = 2 − 1. We show that .D(m + 1) · D(m + 1) = IM1 , where .M1 = 2 ⎡

⎤ ⎡ ⎤ D(m) 0M OM D(m) 0M OM .D(m + 1) · D(m + 1) = ⎣ 0M 1 0M ⎦ · ⎣ 0M 1 0M ⎦ = D(m) I M D(m) D(m) I M D(m) ⎤ ⎡ D(m) · D(m) 0M OM ⎦ ⎣ 0M 1 0M D(m) · D(m) + D(m) · D(m) B D(m) · D(m) The vector B consists of elements that look like .1 + D(m)i and is given by .0M according to the first statement. By induction .D(m) · D(m) = IM , and over the field .F2 D(m) · D(m) + D(m) · D(m) = OM

.

as sum of two identical matrices over field .F2 . Thus we receive unit matrix of dimension .M + M + 1 = 2(2m − 1) + 1 = 2m+1 − 1 = M1 . 3. The expression .Dβ = α is obtained from .Dα = β and .D −1 = D. .

13.2.5

Clustering in Dichotomous Space and Symptom Analysis

Clustering is a method that allows to identify homogeneous groups of objects based on the values of their variables (dimensions). Let be the random vector .Xm = (X1 , . . . , Xm ) with dichotomous components and the set .m of .2m all possible values .(x1 , . . . , xm ), xi ∈ {0, 1}, in natural cod m−1 } according (13.1). If Y is the variable of type class with .j (x) ∈ {0, 1, 2, . . . , 2 values 0 and 1, then we can estimate probabilities .pˆ j of Class 1, .j = 0, . . . , 2m − 1. Then we receive clustering kind of .(B0 , B1 ), where B0 = {j |pˆ j ≤ L} and B1 = {j |pˆ j > L}, L ∈ [0, 1].

.

(13.12)

We choose L so that .p.value of the Fisher’s exact test is minimal. Pair .(B0 , B1 ) gives the vectors .β and .α and the appropriate logical expression for describing the risk group. It makes sense to consider not all .X1 , . . . , Xm , only the most informative subset which can be obtained using the uncertainty coefficient [9].

13 Clustering and Symptom Analysis in Binary Data with Application

243

13.3 The Medical Application of the Clustering and Symptom Analysis in Binary Data 13.3.1 Dataset This research includes data on breast cancer collected from the Cancer Oncology Hospital in Medicine City in Baghdad in a set of patients who were scheduled for biopsy, with mammograms interpreted by radiologists. And the number of patients was about 101 in 2017. We consider the factor Class which means the presence of distant metastases and the tumor spreading to the lymph nodes, the factor .X1 which means estrogen receptor positive, the factor .X2 which means progesterone receptor positive, and the factor .X3 which means mastectomy or lumpectomy [7].

13.3.2 Result and Discussion Cross tabulation for class and .(X1 , X2 , X3 ) is presented in Table 13.1. If the cluster boundary .L ∈ (0.8, 1), then according to (13.12) the first cluster .B1 consists of subgroups .2, 3, 6, and the second cluster .B0 includes the remaining subgroups. Thus .B0 = (0, 1, 4, 5, 7), .B1 = (2, 3, 6), .β = (0, 1, 1, 0, 0, 1, 0) from (13.11). According to theorem 1 we get .α = (0, 1, 0, 0, 0, 0, 1) which corresponds to the super-symptom .Z = X2 + X1 X2 X3 (mod 2). In general, if .Z = 1 then .Class = 1 with probability equal to 1, but when .Z = 0, then .Class = 1 with probability equal to .0.6. In this case we have the smaller .p-value = 0.000003 of Fisher’s exact test [6]. The super-symptom Z means practically that progesterone receptor was positive and the excisional biopsy was performed only except in three patients. Thus we get the most distinct risk group which is described by means a combination of two factors. A combination of a positive progesterone receptor with an excisional biopsy indicates a worse prognosis compared to other groups. Table 13.1 Cross tabulation for class and .(X1 , X2 , X3 )

.x1

.x2

.x3

.j

1 1 0 0 1 0 1 0

0 1 0 0 0 1 1 1

0 1 0 1 1 0 0 1

1 7 0 4 5 2 3 6

= j (x)

Class 0 1 15 2 4 1 0 0 0

Class 1 0 19 4 8 4 5 29 3

.p ˆj

0.00 0.55 0.66 0.66 0.80 1 1 1

244

N. P. Alexeyeva et al.

13.4 Conclusion A super-symptom method for the analysis of multidimensional categorical data is proposed, based on the correspondence of clusters to algebraic normal forms. As a result, a risk group was identified that is described by a logical combination of factors. In the future, it is planned to consider the problem of matching a supersyndrome to clusters, the number of which is more than two.

References 1. Alexeyeva, N.: Analysis of biomedical systems. Reciprocity. Ergodicity. Synonymy. Publishing of the Saint-Petersburg State University, Saint-Petersburg (2013) (in Russian) 2. Alexeyeva, N., Alexeyev, A.: About a Role of the Finite Geometries in the Dichotomic Variables Correlation Analysis. Mathematical Models. Theory and Applications. Issue 4. Eds. prof. Chirkov M.K. Research Institution of Mathematic and Mechanic, Saint-Petersburg University, pp. 102–117 (2004) (in Russian) 3. Alexeyeva, N., Gracheva, P., Martynov, B., Smirnov, I.: The finitely geometric symptom analysis in the glioma survival. In: The 2nd International Conference on BioMedical Engineering and Informatics (BMEI09), China. (2009). https://doi.org/10.1109/BMEI.2009.5305560 4. Alexeyeva, N., Gracheva, P., Podkhalyuzina, E., Usevich, K.: Symptom and syndrome analysis of categorial series, logical principles and forms of logic. In: Proceedings, 3rd International Conference on BioMedical Engineering and Informatics BMEI, pp. 2603–2606. China (2010) 5. Alexeyeva N.P., Al-Juboori, F.S., Skurat, E.P.: Symptom analysis of multidimensional categorical data with applications. Period. Eng. Nat. Sci. 8(3), 1517–1524 (2020) 6. Mehta, C.R., Patel, N.R.: IBM SPSS Exact tests. IBM Corporation (2011) 7. NCBI: American Journal of Cancer Research. Steroid hormone receptors as prognostic markers in breast cancer (2017). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5574935/. Accessed 9 Nov 2018 8. Al-Juboori, F.S., Alexeyeva, N.P.: Application and comparison of different classification methods based on symptom analysis with traditional classification technique for breast cancer diagnosis. Period. Eng. Nat. Sci. 8(4), 2146–2159 ( 2020) 9. Shannon, C.E., Weaver, W: The Mathematical Theory of Communication. University of Illinois Press (1963)

Chapter 14

Big Data for Credit Risk Analysis: Efficient Machine Learning Models Using PySpark Afshin Ashofteh

Abstract Recently, Big Data has become an increasingly important source to support traditional credit scoring. Personal credit evaluation based on machine learning approaches focuses on the application data of clients in open banking and new banking platforms with challenges about Big Data quality and model risk. This paper represents a PySpark code for computationally efficient use of statistical learning and machine learning algorithms for the application scenario of personal credit evaluation with a performance comparison of models including logistic regression, decision tree, random forest, neural network, and support vector machine. The findings of this study reveal that the logistic regression methodology represents a more reasonable coefficient of determination and a lower false negative rate than other models. Additionally, it is computationally less expensive and more comprehensible. Finally, the paper highlights the steps, perils, and benefits of using Big Data and machine learning algorithms in credit scoring.

14.1 Introduction Risk management with the ability to incorporate new and Big Data sources and benefit from emerging technologies such as cloud and parallel computing platforms is critically important for financial service providers, supervisory authorities, and regulators if they are to remain competitive and relevant [1]. Financial institutions’ growing interest in nontraditional data may be seen as a hypothetical occurrence, a reaction to the most recent financial crisis. However, the financial crisis not only prompted several statutory and supervisory initiatives that require significant disclosure of data but also provided a positive atmosphere to get the advantages of new data sources such as nontraditional datasets [2, 3].

A. Ashofteh () NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa, Lisboa, Portugal e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 J. Pilz et al. (eds.), Statistical Modeling and Simulation for Experimental Design and Machine Learning Applications, Contributions to Statistics, https://doi.org/10.1007/978-3-031-40055-1_14

245

246

A. Ashofteh

There are sources of supply and demand for this increased acceptance of nontraditional data. On the supply side, technology advancements like mobile phones [4] that have expanded storage space and computing power while cutting costs have fueled rises in the new data sources. In addition, mobile data and social data have recently been used to monitor different risks [5]. On the demand side, loan providers are becoming more interested in learning how data analysis may improve credit scoring and lower the risk of default [6]. Some of the largest and most established financial institutions such as banks, insurance companies, payday lenders, peer-to-peer lending platforms, microfinance providers, leasing companies, and payment by installment companies are now taking a fresh look at their customers’ transactional data to enhance the early detection of fraud. They use innovative machine learning models that exploit novel data sources like Big Data, social data, and mobile data. Credit risk management may benefit in the long run if these advancements result in better credit choices. However, there are shorter-term hazards if early users of nontraditional data credit scoring mostly disregard the model risk and technical aspects of new methods that might affect credit scoring [7]. For instance, one crucial issue in credit evaluation is the class imbalance resulting from distress situations for loan providers [8]. These distress situations are relatively infrequent events that make the imbalance data very common in credit scoring. In addition, the limited information for distinguishing dynamic fraud from a genuine customer in a highly sparse and imbalanced environment makes default forecasting more challenging. Even though banks and loan providers should follow different regulations to reduce or eliminate credit risk, regulatory changes can potentially change the microfinance environment that generates the distribution of nontraditional data. It could be the source of changes in the probability distribution function of credit scores over time. In this case, the reliability of the models based on historical data will decrease dramatically. This time dependency of the training process needs new approaches adopted to deal with these situations and to avoid interruptions of ML approaches for Big Data over time [9]. These issues in credit evaluation show the importance of comparing the machine learning techniques for evaluating the model risk for credit scoring. Figure 14.1 summarizes the application of Big Data and small data in credit risk analytics. This paper presents greater insight into how nontraditional data in credit scoring challenges the model risk and addresses the need to develop new credit scoring models. The rest of the paper is structured as follows: Sect. 14.2 describes the PySpark code for data processing as the first step of a personal credit evaluation. Section 14.3 represents the model building and model evaluation methods. The results are shown in Sect. 14.4 for a complete personal credit evaluation. Finally, Sect. 14.5 contains some concluding remarks.

14 Big Data and Machine Learning for Credit Risk Using PySpark

247

Fig. 14.1 Graphical summary of Big Data analytics on credit risk evaluation

14.2 Data Processing In this paper, a public dataset is used, which is provided by Lending Club Company, a peer-to-peer lending company based in the USA, to compare the performance of the proposed algorithms. Lending clubs provide a sort of bridge between investors and borrowers. The author adopted a consumer loan dataset with 2,260,668 observations by combining 2 versions of the lending club loan dataset; one contains loans issued from 2007 to 2015 and another from 2012 to 2018. As a result, the funded loans of these two datasets were combined, and the duplicates were removed to obtain a dataset from 2007 to 2018 with 1,048,575 customers and 145 attributes. The new

248

A. Ashofteh

Table 14.1 Loan status in combined datasets without duplicates includes the personal credit history of customers in 2007–2018 Loan status Current Fully paid In grace period Charged off Late (31–120 days) Late (16–30 days) Default Total

Default loans (1/TRUE) – – – 94,285 12,154 2162 22 108,623

Good loans (0/FALSE) 603,273 331,528 5151 – – – – 939,952

dataset includes 108,623 default loans (true or 1) and 939,952 good loans (false or 0). Redundant columns were removed by applying the ExtraTreesClassifier approach for variable importance levels, the correlation of data features was optimized, and the 145-dimensional data was reduced to 35-dimensional data. For instance, the correlation coefficient between two attributes funded_amnt as the total amount committed to that loan at that point in time and funded_amnt_inv as the total amount committed by investors for that loan is one, which shows the complete similarity. Based on the dimensionality reduction idea of preventing overfitting, funded_amnt_inv can be eliminated. Almost the same situation is for two variables, loan_amnt and installment, with a Pearson correlation over 0.94. The dataset includes a row number column and current loan status feature with seven types of debit and credit states (current, fully paid, charged off, late (31–120 days), in grace period, late (16–30 days), default) as target variable (see Tables 14.1 and 14.2). Each row includes information provided by the applicant, loan status (current, fully paid, charged off, late (31–120 days), in grace period, late (16–30 days), and default), and information on the payments to the Lending Club Company. The author will use Python and Spark to predict the probability of default and identify the credit risk of each customer: zero/FALSE for non-default and one/TRUE for default (see Table 14.2). Spark is a tool for parallel computation with large datasets and integrates well with Python.

14.2.1 Data Treatment We must connect to a cluster to use PySpark,1 R, SQL, or Scala for Big Data. Our cluster was hosted on a machine in Databricks Community Edition, connected to all

1 See PySpark program and dataset here: https://github.com/AfshinAshofteh/creditscore_pyspark. git.

14 Big Data and Machine Learning for Credit Risk Using PySpark

249

Table 14.2 Analytical base table for lending club loan dataset Row 1 3

Attribute Id annual_inc

4

delinq_2yrs

5

dti

7

emp_length

9

funded_amnt

11 12

grade home_ownership

31

il_util

13

inq_last_6mths

14

installment

15 16 17

int_rate issue_d loan_amnt

18 32

loan_status mo_sin_old_il_acct

33

mo_sin_old_rev_tl_op

19

mths_since_rcnt_il

30

mths_since_rcnt_il

34 35 36

mths_since_recent_bc mths_since_recent_inq num_rev_tl_bal_gt_0

Description A unique LC assigned ID for the loan listing The self-reported annual income provided by the borrower during registration The number of 30+ days of past-due incidences of delinquency in the borrower’s credit file for the past 2 years A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income Employment length in years. Possible values are between 0 and 10, where 0 means less than 1 year and 10 means 10 or more years The total amount committed to that loan at that point in time LC assigned loan grade The homeownership status provided by the borrower during registration. Values: RENT, OWN, MORTGAGE, and OTHERS The ratio of total user balance to high credit/credit limit The number of inquiries in the past 6 months (excluding auto and mortgage inquiries) The monthly payment owed by the borrower if the loan originates (monthly arrears) Interest rate on the loan The month in which the loan was funded The listed amount of the loan is applied for by the borrower. If the credit department reduces the loan amount, it will be reflected in this value Current status of the loan Number of months since the earliest bank account was opened Number of months since the earliest revolving account was opened Number of months after the installment account was opened Number of months after the installment account was opened Number of months since last online payment Number of months of last loan inquiry Number of revolving accounts

Scale Index Continuous Continuous

Continuous

Continuous

Continuous Categorical Categorical

Continuous Continuous Continuous Continuous Continuous Continuous

Categorical Continuous Continuous Continuous Continuous Continuous Continuous Continuous (continued)

250

A. Ashofteh

Table 14.2 (continued) Row 20

Attribute open_acc

21

out_prncp

38 22 24 25

percent_bc_gt_75 pub_rec revol_bal revol_util

26

term

39 27

tot_hi_cred_lim total_acc

40 41 28

total_bc_limit total_il_high_credit_limit total_pymnt

29

verification_status

Description The number of open credit lines in the borrower’s credit file The remaining outstanding principal for the total amount funded Recent income-expenditure ratio Number of derogatory public records Total credit revolving balance Revolving line utilization rate, or the amount of credit the borrower uses relative to all available revolving credit The number of payments (installments) on loan. Values are in months and can be either 36 or 60 Credit card credit limit The total number of credit lines currently in the borrower’s credit file Amount limit of online banking Limit of overdue repayment amount Payments received to date for the total amount funded Indicates if income was verified by LC, if not verified, or if the income source was verified

Scale Continuous Continuous Continuous Continuous Continuous Continuous

Categorical

Continuous Continuous Continuous Continuous Continuous Categorical

other nodes. Typically, we have one computer as the master that manages splitting up the data and the computations. The master connects to the rest of the computers in the cluster, called workers. The master sends data and calculations to the workers to run. The workers also send their results back to the master. When we are just getting started with Spark on a local computer, we might just run a cluster locally. If we are using only one local node for Big Data computations in a simulated cluster, we should monitor this node (i.e., computer) if, for instance, the fan is working very hard or one process is wasting a long time to preserve the node when dealing with Big Data. However, in Databricks and cloud platforms, we safely use online resources. For this use case in credit scoring, this study uploaded the dataset into the data repository of Databricks. When we create a notebook in our cloud platform, we build a SparkContext as our connection to the cluster. This SparkContext will recognize the specified clusters after running the first command in the notebook. This paper uses Spark MLlib, the Apache Spark library of algorithms and transformers included in a distributed Spark, to be used for ETL work and to build our models.

14 Big Data and Machine Learning for Credit Risk Using PySpark

251

14.2.2 Data Storage and Distribution The dataset is in CSV format, and Spark can convert it into a Universal Disk Format (UDF). Additionally, it is exportable to MLeap, a standard serialization format and execution engine for machine learning pipelines. It supports Spark, scikit-learn, and TensorFlow for training pipelines and exporting them to an MLeap Bundle. As a result, if we continue leveraging Spark contents, it would be easy to import a batch in the batch stream, a collection of data points grouped within a specific time interval. Another term often used for this process is a data window (see Appendix 1—lines 1–7). This study started with the CSV file format. However, it is not an optimal format for Spark. The Parquet format is a columnar data store, allowing Spark to only process the data necessary to complete the operations versus reading the entire dataset. Parquet gives Spark more flexibility in accessing the data and improves performance on large datasets (see Appendix 1—line 9). Before starting to work with Spark, it could be recommended first to check the Spark Context and the Spark version to check the version and compatibility of our packages and program (see Appendix 1—lines 10–11). Second, check the dataset available on our cluster with the catalog to check the metadata or the data describing the structure of the data (see Appendix 1—lines 12–13).

14.2.3 Munge Data According to the data schema, the types of attributes are String. Before continuing with Spark, it is essential to convert the data types to numeric types because Spark only handles numeric data. That means all the data frame columns must be either integers or decimals (called “Doubles” in Spark). Therefore, the .cast() method is used to convert all the numeric columns from our loan data frame (for instance, see Appendix 1—lines 14–17). The other attributes are treated similarly according to Table 14.3. Additionally, the emp_length column was converted into numeric type (see Appendix 1—lines 18–20), and map multiple levels of the verification_status attribute into a one-factor level (see Appendix 1—line 21). Finally, the target vector default_loan was created from the loan_status feature by classifying the data into two values: users with poor credit (default) including default, charged off, late (31– 120 days), and late (16–30 days) and users with good credit (not default) including fully paid, current, and in grace period (see Table 14.4 and Appendix 1—line 22).

252

A. Ashofteh

Table 14.3 Attributes data format modification and their description Characteristic annual_inc credit_length_in_years delinq_2yrs dti funded_amnt earliest_year emp_length

Data format Original Current String Integer String Integer String Integer String Integer String Float String Double String Float

funded_amnt il_util inq_last_6mths instalment

String String String String

Float Float Integer Integer

int_rate issue_year loan_amnt mo_sin_old_il_acct mo_sin_old_rev_tl_op mths_since_rcnt_il mths_since_rcnt_il mths_since_recent_bc mths_since_recent_inq num_rev_tl_bal_gt_0 open_acc out_prncp percent_bc_gt_75 pub_rec remain revol_bal revol_util tot_hi_cred_lim total_acc total_bc_limit total_il_high_credit_limit total_pymnt

String String String String String String String String String String String String String String String String String String String String String String

Float Double Integer Float Float Float Float Float Float Float Integer Integer Float Integer Integer Integer Integer Float Integer Float Float Float

Description

(issue_year - earliest_year)

substring(loan_df.earliest_cr_line,5, 4) Delete characters: " () [] * + [ a-z A-Z ] . , * | (n/a) Delete values lower than 1 Replace .10+ with 10

[loan_amnt + sum(remain(t) .× int_rate(t); t from 1 to term)]/term In percentage (%) substring(loan_df.issue_d, 5, 4)

(loan_amnt - total_pymnt) In percentage (%)

14.2.4 Creating New Measures Three new measures were created to increase the model’s accuracy and decrease the data dimension by removing less critical attributes according to the Extra-

14 Big Data and Machine Learning for Credit Risk Using PySpark

253

Table 14.4 Lending status as a binary target label vector Poor credit (default) Good credit (not default)

Loan_status Default, charged off, late (31–120 days), late (16–30 days) Fully paid, current, in grace period

Default_loan true/1 false/0

TreesClassifier approach. For this purpose, we must know that the Spark data frame is immutable. It means it cannot be changed, and columns cannot be updated in place. If we want to add a new column in a data frame, we must make a new one. To overwrite the original data frame, we must reassign the returned data frame using the .withColumn command. The first measure refers to the length of credit in years to know how much each person returned to the bank from the total loan amount. Therefore, we need to make a new column by subtracting the “loan payment” from the “total loan amount” (see Appendix 1—line 23). The second measure is the total amount of money earned or lost per loan to show how much of the total amount of the loan should be repaid to the bank by each person (see Appendix 1—line 24). Finally, the third measure is the total loan. Customers in this database could have multiple loans, and it is necessary to aggregate the loan amounts based on the member IDs of the customers. Then according to the Basel Accords and routine of the banking risk management, the maximum and minimum amounts could be reported and reviewed by risk managers to be checked for concentration risk in the risk appetite statement of the financial institutions (see Appendix 1—lines 25–29).

14.2.5 Missing Values Imputation and Outliers Treatment The primary purpose of this section is to make a decision for the imputation of the missing values and deal with outliers. For this large-scale dataset, it is reasonable to have NULL values, and handling the missing values in Spark is possible with three options: keep, replace, and remove. With missing data in our dataset, we would not be able to use the data for modeling in Spark if we have empty or N/A in the dataset. It would cause training errors. Therefore, we must impute or remove the missing data, and we could not keep them for the modeling step. This PySpark code uses the fillna() command to replace the missing values with an average for continuous variables, the median for discrete ordinal ones, and mode (the highest number of occurrences) for nominal features. Additionally, the variables with more than half of the data in a sample as null were discarded (see examples in Appendix 1—lines 30–34).

254

A. Ashofteh

The processing of outliers in this paper follows the following principles: 1. We need to consider the reasonable data range in each attribute and delete the sample data with outliers. This paper uses a simple subsetting for indexing the rows with outliers, removes the outliers with an index equal to TRUE, and checks again if the outliers are removed according to the criteria (see examples in Appendix 1—lines 35–36). 2. Then, this paper uses cross tables to find possible errors. Cross tables for paired attributes with min and max functions as aggregate functions (aggfunc = “min” and aggfunc = “max”) could show the possible errors which exceed the minimum or maximum of attributes (see example in Appendix 1—line 37). 3. Finally, box plots with the interquartile rule are used to measure the spread and variability in our dataset. According to this rule, data points below Q1-1.5*IQR or above Q3+1.5*IQR are viewed as being too far from the central values.

14.2.6 One-Hot Code and Dummy Variables This paper discretizes the continuous variables by the chi-square box-dividing method and standardizes the discrete variables by transforming them into dummy variables. In machine learning, the standard encoding method is one-hot encoding. For encoding class variables, we have to import the one-hot encoder class from the machine learning library of Spark and create an instance of the one-hot encoder to apply to the discrete features of the dataset (see examples in Appendix 1—lines 38–42). Additionally, this paper considers the predictive power of discrete variables by looking at the information value (IV) to understand the possible transformations in categorical variables and to create multiple categories with similar IVs. IV for each class of categorical variable is the weight of evidence (WoE) times the difference between the proportion of all good loans in the class and the proportion of all bad loans in the class. Generally speaking, WoE is the logarithm of the proportion of good loans to the bad loans in a class. The calculation formulas of WOE and IV are shown in Eqs. (14.1), (14.2), and (14.3). W oE i = ln (gi /gT ) − ln (bi /bT )

(14.1)

I V i = W oE × [(gi /gT ) − (bi /bT )]

(14.2)

.

.

where .gi /.bi represents the number of good/bad loans in the grouping and .gT /.bT denotes the total number of good/bad loans in all data. Formula 3 represents the IV of the whole variable, which is an aggregation of the corresponding IVs of each group: IV =

.

IV i

(14.3)

14 Big Data and Machine Learning for Credit Risk Using PySpark

255

Table 14.5 Criteria for IVs Prediction Power IV

Useless .0.5

Table 14.5 shows the criteria for excluding some variables or grouping certain variables. For example, the variable “issue_d” shows similar IVs for its first four categories (Aug 18, Dec 18, Oct 18, and Nov 18) and the following five categories (Sep 18, Dec 15, Jun 18, Jul 18, and May 18). Therefore, they could be grouped into two new categories. Furthermore, the results from IV show strong prediction power for most variables (e.g., term, grade, home_ownership, verification_status, etc.), and none were transformed.

14.2.7 Final Dataset When the data treatment is completed, there are 1,043,423 customers in rows and 35 features in the dataset, including 4 categorical and 31 numeric attributes in addition to 1 binary target variable with 2 values default and not default. After finalizing the data treatment, the Spark cache is used to optimize the final dataset for iterative and interactive Spark applications and improve Jobs’ performance (see Appendix 1—line 43). The dataset contains 331,528 fully paid loans (see Table 14.1 and Appendix 1—lines 44–45). Considering the imbalance of default and non-default loans in the dataset, good customers are much fewer than bad customers, which may cause prediction deviation in some modeling approaches such as logistic regression. For these models, the paper applies a paired sample technique to the training base by randomly selecting bad customers as the same number of total good clients. This undersampling method or equivalent approach, such as the synthetic minority oversampling technique (SMOTE), is essential to increase the efficiency of the models, which suffer from imbalanced datasets [10]. Table 14.6 shows that the dataset was randomly divided into two groups, 65% for model training (678,548 observations) and the other 35% for the test set (364,875 observations) to apply different algorithms. For dividing the dataset into test and train sets, the function .randomSplit() was used in PySpark, equivalent to test_train_split() in Python (see Appendix 1—line 46). The training dataset for developing the model would normally have 12 months dedicated to the training to have a full annual cycle to recover the seasonality of Table 14.6 Loan status in training and test datasets

Loan status Default Not default Total

Training set 607,981 70,567 678,548

Test set 326,820 38,055 364,875

256

A. Ashofteh

each month. Additionally, some recent months could be considered just for testing the optimal model with an unseen dataset.

14.3 Method and Models Financial institutions must predict the customers’ credit risk over time with minimum model risk. Recently, machine learning models have been applied to Big Data to determine if a person is eligible for receiving a loan. However, the pre-processing for the data quality and finding the best hyperparameters in model development are necessary to overcome the overfitting problems and instability of model accuracy over time.

14.3.1 Method According to the dataset, we have a credit history of the customers, and this study tries to predict the loan status of the customers by applying statistical learning and machine learning algorithms. It helps the loan providers to guess the probability of default to determine whether or not a loan should be granted. For this purpose, this paper makes a preliminary statistical analysis of the credit dataset. Then, different models were developed to predict the probability of default. The models include logistic regression, decision tree, random forest, neural network, and support vector machine. Finally, the results (predictive power of models) were evaluated by evaluation metrics such as the area under the ROC curve (AUC) and the mean F1 score. Receiver operating characteristic (ROC) curves show the statistical performance of the models. In the ROC chart, the horizontal axis represents the specificity, and the vertical axis shows the sensitivity. The greater the area between the curve and the baseline, the better the feature performance in default prediction. After investigating the characteristics of the new credit score model, the research employs the area ratio of ROC curves to compare the classification accuracy and evaluates how well this credit scoring model performs. The F1 score, which is commonly used in information retrieval, measures the model’s accuracy using precision .(p) and recall .(r). Precision is the ratio of true positives .(tp) to all predicted positives .tp + fp. A recall is the ratio of true positives to all actual positives .tp + f n. The F 1 score tp tp where .p = tp+fp and .r = tp+f n is given by: F1 = 2

.

p.r p+r

The F1 metric weights recall and precision equally, and a good retrieval algorithm will simultaneously maximize precision and recall. Thus, moderately

14 Big Data and Machine Learning for Credit Risk Using PySpark

257

good performance on both will be favored over excellent performance on one and poor performance on the other. For creating a ROC plot in PySpark, we need a library that is not installed by default in Databricks Runtime for Machine Learning. First, we have to install plotnine and its dependencies based on the ggplot2 package (see Appendix 1—lines 47–48). Second, PyPI mlflow package could be installed into the cluster to track the model development and packaging code into reproducible runs.

14.3.2 Model Building This section builds and evaluates supervised models in PySpark for personal credit rating evaluation. This paper applies the obtained dataset to logistic regression, decision tree, random forest, neural network, and support vector machine. The model building phase started with three statistical learning methods and penalized linear regression models: LASSO, ridge, and ElasticNet. They eliminate variables that contribute to overfitting without compromising out-of-sample accuracy. They have L1 and L2 penalties during training and some hyperparameters (maxIter, elasticNetParam, and regParam), which should be set to assign how much weight is given to each of the L1 and L2 penalties and which model should be fitted. The elasticNetParam for ridge regression is 0, for LASSO is 0.99, and for ElasticNet regression is 0.5 (see Appendix 1—lines 64–66). The results from this notebook in Databricks were tracked for storing the results and comparing the accuracy of different models (see Appendix 1—lines 67–80). Then the logistic regression model was built (see Appendix 1—line 81). A pipeline was defined, which includes standardizing the data, imputing missing values, and encoding for categorical columns (see Appendix 1—lines 82–83). Setting the mlflow of model tracking and reproducibility of the input parameters is useful to log the model and review later (see Appendix 1—lines 84–88). Finally, the accuracy measures were calculated by logging the ROC curve (see Appendix 1—lines 89–93), setting max F1 threshold for predicting the loan default with a balance between true positives and false positives (see Appendix 1—lines 94–99), scoring the customers (see Appendix 1—lines 100–114), and logging the results (see Appendix 1—lines 115– 116). The leave-one-out cross-validation method examines the between-sample variation of default prediction. This paper divides the available data into ten disjoint subsets to train the models on nine subsets and evaluate the model selection criterion on the tenth subset. This procedure is then repeated for all combinations of subsets by the Python API of Apache Spark (see Appendix 1—lines 117–118). Finally, this paper uses the MLflow UI built-in as part of the Community Edition of Databricks to compare the models and choose the ultimate best model. The best model might be selected with an AUC greater than a threshold (see Appendix 1—lines 124–127) or maximum AUC (see Appendix 1—lines 126–130). The details of the best model with maximum AUC could be checked (see Appendix 1—lines 131–132) and the

258

A. Ashofteh

model’s score with the test data (see Appendix 1—lines 133–134). This final model could predict the amount of money earned or lost per loan (remain.=loan payments - total loan amount) and the outstanding loan balance (see Appendix 1—line 135). As a result, the ridge method represents a better performance than LASSO. The logistic regression in this paper is based on the ridge penalty with elastic net regularization zero and regParam 0.3 as the best hyperparameters. In addition to the logistic regression classifier as an industry standard for building credit scoring models, this paper uses other binary classifiers such as random forests and linear support vector machines for the empirical analysis. Although they are more complex and powerful than logistic regression in the application, the outputs explainability of these models could not be guaranteed. The codes are almost the same for the other models, such as random forest and linear support vector machine, with the possibility to use Scala for more complicated models and to use some features that are not available in PySpark.

14.4 Results and Credit Scorecard Conversion The results show that A-grade loans have the lowest interest rate because of the minimum evaluated risk for these customers. A significant amount of loans is allocated to grade A and B customers with the minimum interest rate and minimum risk of default. We have a descending trend for the grades D, E, F, and G because banks typically have some sort of criteria to reject high-risk applications. The optimal cutoff for logistic regression is considered 0.167. This study discovers a high level of false negative rate in any approach. This rate represents an unexpected loss for the bank. A false negative rate also shows a loss in the bank’s balance sheet since it does not let the new business increase. These two rates are summarized in the F1 score, indicating a trade-off between false positive and false negative. As a trade-off between model sensitivity and specificity, AUC in Table 14.7 shows almost the same performance among the logistic regression, decision tree, and random forest. However, the logistic regression model obtained a higher F1 score (i.e., 0.815) than the decision tree and random forest, with F1 scores of 0.766 and 0.577, respectively (see Appendix 2). Overall, the logistic regression

Table 14.7 Evaluation results of the algorithms on personal credit data Algorithm Logistic regression Decision tree Random forest Neural network Support vector machine

[[11, 10] [01, 00]] [[9%, 2%] [2%, 87%]]

AUC value 0.909

F1 score 0.815

[[7%, 0.5%] [3.5%, 89%]] [[10%, 13%] [2%, 75%]] [[2%, 2.5%] [9%, 86.5%]] [[1%, 0%] [10%, 89%]]

0.901 0.884 0.580 0.530

0.766 0.577 0.258 0.113

14 Big Data and Machine Learning for Credit Risk Using PySpark

259

performs the best, and the support vector machine performs the worst with three times more training time compared to other algorithms.

14.5 Conclusion This study described machine learning approaches to assess credit candidate applicants’ profiles and continued credit scoring based on the nontraditional dataset of Lending Club Company. Regarding the classification accuracy, the results showed that the logistic regression is more accurate, informative, and conservative for personal credit evaluation. Furthermore, the model predictions could be used to score new and old clients in an accurate scorecard complementary to traditional credit evaluation methods. For further study, the following items could be suggested for more investigation: For data treatment, one might consider the following new attributes for a possible increase in the model’s accuracy: 1. Effort rate by dividing the installment by the annual income. 2. The ratio between the number of open accounts and the total number of accounts. 3. Percentage of the loan that is still left to be paid. It would be similar to the “remain” variable but divided by the total amount. 4. A continuous variable to represent the duration of the client’s credit line. It could be built based on “earliest_cr_line” by subtracting the earliest_cr_line’s values from the current time. 5. Decomposing issue_d into months and years to lead to better insight into any eventual seasonal events. For model development, one might consider other machine learning approaches as the author did a preliminary study on gradient boosted trees, and the AUC was increased to 0.966 with a longer run time compared with other models in this paper. For the hyperparameter tuning stage and finding the best hyperparameters that enable a higher F1 score, one could use the GridSearchCV. Instead of the undersampling approach in this paper for the imbalanced dataset, one might use StratifiedKFold in the cross-validation stage. Acknowledgments The author of this paper would like to thank José L. Cervera-Ferri (CEO of DevStat) for his invitation to CARMA 2018 (International Conference on Advanced Research Methods and Analytics) at the Polytechnic University of Valencia, which motivated this research. Data and Code Availability Data and code used to support the findings of this paper are available from the author upon request, his GitHub or Kaggle page.

260

A. Ashofteh

Appendix 1 1. 2. 3. 4. 5. 6. 7.

8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.

23. 24. 25. 26. 27.

file_location = “/FileStore/tables/loan.CSV” file_location = “/FileStore/tables/loan_complete.CSV” file_type = “CSV” infer_schema = “false” first_row_is_header = “true” delimiter = “,” loan_df = Spark.read.format(file_type).option(“inferSchema”, infer_schema). option(“header”, first_row_is_header).option(“sep”, delimiter).load(file_ location) print(“ >>>>>>> ” + str(loan_df.count())+ “ loans opened in this data_set!”) loan_df.write.parquet(“AA_DFW_ALL.parquet”, mode=“overwrite”) print(sc) print(sc.version) Spark.catalog.listTables() display(loan_df) loan_df = loan_df.withColumn(“loan_amnt”, loan_df.loan_amnt.cast (“integer”))\ .withColumn(“int_rate”, regexp_replace(“int_rate”, “%”, “”).cast(“float”))\ .withColumn(“revol_util”, regexp_replace(“revol_util”, “%”, “”).cast (“float”))\ .withColumn(“issue_year”, substring(loan_df.issue_d, 5, 4).cast(“double”)) loan_df = loan_df.withColumn (“emp_length”, trim(regexp_replace(loan_df. emp_length, “([ ]*+[a-zA-Z].*)|(n/a)”, “”) )) loan_df = loan_df.withColumn (“emp_length”, trim(regexp_replace(loan_df. emp_length, “< 1”, “0”) )) loan_df = loan_df.withColumn (“emp_length”, trim(regexp_replace(loan_df. emp_length, “10\\+”, “10”) ).cast(“float”)) loan_df = loan_df.withColumn(“verification_status”, trim(regexp_replace (loan_df.verification_status, “Source Verified”, “Verified”))) loan_df = loan_df.filter (loan_df.loan_status.isin ([“Default”, “Charged Off”, “Late (31-120 days)”, “Late (16-30 days)”, “Fully Paid”,“Current”] )) .withColumn( “default_loan”, (∼(loan_df.loan_status.isin([“Fully Paid”, “In Grace Period” , “Current”]))).cast(“string”)) loan_df = loan_df.withColumn(“credit_length_in_years”, (loan_df.issue_year - loan_df.earliest_year)) loan_df = loan_df.withColumn(“remain”, round( loan_df.loan_amnt loan_df.total_pymnt, 2)) customer_df = loan_df.groupBy(“member_id”).agg(f.sum(“loan_amnt”) .alias(“sumLoan”)) loan_max_df = customer_df.agg({“sumLoan”: “max”}).collect()[0] customer_max_loan = loan_max_df[“max(sumLoan)”]

14 Big Data and Machine Learning for Credit Risk Using PySpark

261

28. print (customer_df.agg({“sumLoan”: “max”}).collect()[0], customer_df.agg({“sumLoan”: “min”}).collect()[0]) 29. print(customer_df.filter(“sumLoan = ” +str(customer_max_loan)).collect()) 30. pandas_df = loan_intrate_income.toPandas() 31. null_columns = pandas_df.columns[pandas_df.isnull().any()] 32. pandas_df[null_columns].isnull().sum() 33. pandas_df.int_rate.fillna(pandas_df.int_rate.median()) 34. loan_intrate_income=loan_intrate_income.dropna() 35. indices = pandas_df[padas_df[“income”] >= 1500000].index 36. pandas_df.drop(indices, inplace=TRUE) 37. pd.crosstab(pandas_df.grade, pandas_df.default_loan, values=pandas_df. annual_inc, aggfunc=“min”).roundGrindEQ2 38. from PySpark.ml.feature import OneHotEncoderEstimator 39. onehot = OneHotEncoderEstimator (inputCols=[“grade”], outputCols=[“grade_dummy”]) 40. model_df3 = model_df.select (“int_rate”, “annual_inc”,“loan_amnt”, “label”,“grade”) 41. onehot = onehot.fit(model_df3) 42. str_to_dummy_df_onehot = onehot.transform(model_df3) 43. loan_df.cache() 44. loan_df.filter(col(“loan_status”) == “Default”).count() 45. loan_df.filter(col(“loan_status”) == “NOTDefault”).count() 46. train, valid = dataLogReg.randomSplit([.65, .35]) 47. %sh 48. /databricks/python/bin/pip install plotnine matplotlib==2.2.2 49. import sklearn.metrics as metrics 50. import pandas as pd 51. from plotnine import * 52. from plotnine.data import meat 53. from mizani.breaks import date_breaks 54. from mizani.formatters import date_format 55. from PySpark.ml import Pipeline 56. from PySpark.ml.feature import StandardScaler, StringIndexer, OneHotEncoder, Imputer, VectorAssembler 57. from PySpark.ml.classification import LogisticRegression 58. from PySpark.ml.evaluation import BinaryClassificationEvaluator 59. from PySpark.ml.tuning import CrossValidator, ParamGridBuilder 60. import mlflow 61. import mlflow.Spark 62. from PySpark.mllib.evaluation import BinaryClassificationMetrics 63. from PySpark.ml.linalg import Vectors 64. maxIter = 10 65. elasticNetParam = 0 66. regParam = 0.3 67. with mlflow.start_run():

262

A. Ashofteh

68. labelCol = “default_loan” 69. indexers = list(map(lambda c: StringIndexer(inputCol=c, outputCol=c+ “_idx”, handleInvalid = “keep”), categoricals)) 70. ohes = list(map(lambda c: OneHotEncoder(inputCol=c + “_idx”, outputCol=c+“_class”), categoricals)) 71. imputers = Imputer(inputCols = numerics, outputCols = numerics) 72. featureCols = list(map(lambda c: c+“_class”, categoricals)) + numerics 73. model_matrix_stages = indexers + ohes + \ 74. [imputers] + \ 75. [VectorAssembler( inputCols=featureCols, outputCol=“features” ), \ 76. StringIndexer( inputCol= labelCol, outputCol=“label” )] 77. scaler = StandardScaler(inputCol=“features”, 78. outputCol=“scaledFeatures”, 79. withStd=True, 80. withMean=True) 81. lr = LogisticRegression(maxIter=maxIter, elasticNetParam=elasticNetParam, regParam=regParam, featuresCol = “scaledFeatures”) 82. pipeline = Pipeline(stages=model_matrix_stages+[scaler]+[lr]) 83. glm_model = pipeline.fit(train) 84. mlflow.log_param(“algorithm”, “SparkML_GLM_regression”) #put a name for the algorithm. 85. mlflow.log_param(“regParam”, regParam) 86. mlflow.log_param(“maxIter”, maxIter) 87. mlflow.log_param(“elasticNetParam”, elasticNetParam) 88. mlflow.Spark.log_model(glm_model, “glm_model”) #log the model. 89. lr_summary = glm_model.stages[len(glm_model.stages)-1].summary 90. roc_pd = lr_summary.roc.toPandas() 91. fpr = roc_pd[“FPR”] 92. tpr = roc_pd[“TPR”] 93. roc_auc = metrics.auc(roc_pd[“FPR”], roc_pd[“TPR”]) 94. fMeasure = lr_summary.fMeasureByThreshold 95. maxFMeasure = fMeasure.groupBy().max(“F-Measure”).select(“max(FMeasure)”).head() 96. madFMeasure = maxFMeasure[“max(F-Measure)”] 97. fMeasure = fMeasure.toPandas() 98. bestThreshold = float ( fMeasure[ fMeasure[“F-Measure”] == maxFMeasure] [“threshold”]) 99. lr.setThreshold(bestThreshold) 100. def extract(row): 101. return (row.remain,) + tuple(row.probability.toArray().tolist()) + (row.label,) + (row.prediction,) 102. def score(model,data): 103. pred = model.transform(data).select(“remain”, “probability”, “label”, “prediction”)

14 Big Data and Machine Learning for Credit Risk Using PySpark

263

104. pred = pred.rdd.map(extract).toDF([“remain”, “p0”, “p1”, “label”, “prediction”]) 105. return pred 106. def auc(pred): 107. metric = BinaryClassificationMetrics(pred.select(“p1”, “label”).rdd) 108. return metric.areaUnderROC 109. glm_train = score(glm_model, train) 110. glm_valid = score(glm_model, valid) 111. glm_train.registerTempTable(“glm_train”) 112. glm_valid.registerTempTable(“glm_valid”) 113. print( “GLM Training AUC :” + str( auc(glm_train))) 114. print( “GLM Validation AUC :” + str(auc(glm_valid))) 115. mlflow.log_metric(“train_auc”, auc(glm_train)) 116. mlflow.log_metric(“valid_auc”, auc(glm_valid)) 117. cv = CrossValidator(estimator=pipeline_rf, estimatorParamMaps=params, evaluator=BinaryClassificationEvaluator(), numFolds=5) 118. rf_model = cv.fit(train) 119. import mlflow 120. import mlflow.Spark 121. from mlflow.tracking import MlflowClient 122. from PySpark.sql.functions import * 123. from PySpark.ml import PipelineModel 124. client = MlflowClient() 125. runs = client.search_runs(experiment_ids =[“#number#”], filter_string = “metrics.valid_auc >= .65”) 126. run_id1 = runs[0].info.run_uuid 127. client.get_run(run_id1).data.metrics 128. runs = client.search_runs(experiment_ids=[“#number#”], order_by= [“metrics.valid_auc DESC”], max_results=1) 129. run_id = runs[0].info.run_uuid 130. client.get_run(run_id).data.metrics 131. runs = mlflow.search_runs(experiment_ids=[“1636634778227294”], order_by=[“metrics.valid_auc DESC”], max_results=1) 132. runs.loc[0] 133. score_df = Spark.table(“final_scoring_table”) 134. predictions = model1.transform(“score_df”) 135. display(predictions.groupBy(“default_loan”, “prediction”). agg((sum(col(“remain”))).alias(“sum_net”)))

Appendix 2 The first branch of the decision tree shows that if the value of out_prncp is more extensive than 0.01, we will automatically receive that probability default value of

264

A. Ashofteh

Fig. 14.2 A decision tree sample with selected attributes

0.03%. When an individual does not meet demanded value, the proposal should be checked for total payment with a limit of 5000. Finally, the branches show how the application should go through the confirmation process (Fig. 14.2).

References 1. Onay, C., Öztürk, E.: A review of credit scoring research in the age of Big Data J. Financ. Regul. Compliance 26(3), 382–405 (2018) 2. Ashofteh, A.: Mining Big Data in statistical systems of the monetary financial institutions (MFIs). In: International Conference on Advanced Research Methods and Analytics (CARMA) (2018). https://doi.org/10.4995/carma2018.2018.8570 3. Óskarsdóttir, M., Bravo, C., Sarraute, C., Vanthienen, J., Baesens, B.: The value of big data for credit scoring: enhancing financial inclusion using mobile phone data and social network analytics. Appl. Soft Comput. J. 74, 26–39 (2019) 4. Pedro, J.S., Proserpio, D., Oliver, N.: Mobiscore: towards universal credit scoring from mobile phone data. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9146, pp. 195–207 (2015) 5. Bjrkegren, D., Grissen, D.: Behavior revealed in mobile phone usage predicts credit repayment (2017). arXiv. arXiv 6. Henrique, B.M., Sobreiro, V.A., Kimura, H.: Literature review: machine learning techniques applied to financial market prediction. Expert Syst. Appl. 124, 226–251 (2019) 7. dos Reis, G., Pfeuffer, M., Smith, G.: Capturing model risk and rating momentum in the estimation of probabilities of default and credit rating migrations. Quant. Financ. 20(7), 1069– 1083 (2018)

14 Big Data and Machine Learning for Credit Risk Using PySpark

265

8. Zhang, H., Liu, Q.: Online learning method for drift and imbalance problem in client credit assessment. Symmetry (Basel) 11(7), 890 (2019) 9. Ashofteh, A., Bravo, J.M.: A non-parametric-based computationally efficient approach for credit scoring. In: Atas da Conferencia da Associacao Portuguesa de Sistemas de Informacao (2019) 10. Gici, A., Subasi, A.: Credit scoring for a microcredit data set using the synthetic minority oversampling technique and ensemble classifiers. Expert Syst. 36(2), e12363 (2019)