Data Analysis and Applications 4: Financial Data Analysis and Methods 1786306247, 9781786306241

Data analysis as an area of importance has grown exponentially, especially during the past couple of decades. This can b

1,869 398 10MB

English Pages 310 [295] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Data Analysis and Applications 4: Financial Data Analysis and Methods
 1786306247, 9781786306241

Citation preview

Data Analysis and Applications 4

Big Data, Artificial Intelligence and Data Analysis Set coordinated by Jacques Janssen

Volume 6

Data Analysis and Applications 4 Financial Data Analysis and Methods

Edited by

Andreas Makrides Alex Karagrigoriou Christos H. Skiadas

First published 2020 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address: ISTE Ltd 27-37 St George’s Road London SW19 4EU UK

John Wiley & Sons, Inc. 111 River Street Hoboken, NJ 07030 USA

www.iste.co.uk

www.wiley.com

© ISTE Ltd 2020 The rights of Andreas Makrides, Alex Karagrigoriou and Christos H. Skiadas to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988. Library of Congress Control Number: 2020930629 British Library Cataloguing-in-Publication Data A CIP record for this book is available from the British Library ISBN 978-1-78630-624-1

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Part 1. Financial Data Analysis and Methods . . . . . . . . . . . Chapter 1. Forecasting Methods in Extreme Scenarios and Advanced Data Analytics for Improved Risk Estimation . . . . George-Jason S IOURIS, Despoina S KILOGIANNI and Alex K ARAGRIGORIOU 1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 1.2. The low price effect and correction . . . . . . . . . . 1.2.1. Percentage value at risk and low price correction 1.2.2. Expected Percentage Shortfall (EPS) and Low Price Correction . . . . . . . . . . . . . . . . . . . . 1.2.3. Adjusted Evaluation Measures . . . . . . . . . . . 1.2.4. Backtesting and Method’s Advantages . . . . . . 1.3. Application . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1. The Alpha warrant . . . . . . . . . . . . . . . . . . 1.3.2. The ARTX stock . . . . . . . . . . . . . . . . . . . 1.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 1.5. Acknowledgements . . . . . . . . . . . . . . . . . . . 1.6. References . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

1

3

3 6 9 12 14 15 17 17 24 28 30 30

vi

Data Analysis and Applications 4

Chapter 2. Credit Portfolio Risk Evaluation with Non-Gaussian One-factor Merton Models and its Application to CDO Pricing . . . . . . . . . . . . . . . . . . . . . . . Takuya F UJII and Takayuki S HIOHAMA 2.1. Introduction . . . . . . . . . . . . . . . . . . . 2.2. Model and assumptions . . . . . . . . . . . . . 2.3. Asymptotic evaluation of credit risk measures 2.4. Data analysis . . . . . . . . . . . . . . . . . . . 2.5. Conclusion . . . . . . . . . . . . . . . . . . . . 2.6. Acknowledgements . . . . . . . . . . . . . . . 2.7. References . . . . . . . . . . . . . . . . . . . .

. . . . . . .

33 36 40 44 48 48 48

Chapter 3. Towards an Improved Credit Scoring System with Alternative Data: the Greek Case . . . . . . . . . . . . . . . Panagiota G IANNOULI and Christos E. KOUNTZAKIS

51

3.1. Introduction . . . . . . . . . . . . . . . . . 3.2. Literature review: stages of credit scoring . 3.3. Performance definition . . . . . . . . . . . 3.4. Data description . . . . . . . . . . . . . . . 3.4.1. Alternative data in credit scoring . . . . 3.4.2. Credit scoring data set . . . . . . . . . . 3.4.3. Data pre-processing . . . . . . . . . . . 3.5. Models’ comparison . . . . . . . . . . . . . 3.6. Out-of-time and out-of-sample validation . 3.7. Conclusion . . . . . . . . . . . . . . . . . . 3.8. References . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . .

. . . . . . . . . . .

. . . . . . .

61

. . . . . .

. . . . . . . . . . .

. . . . . . .

Chapter 4. EM Algorithm for Estimating the Parameters of the Multivariate Stable Distribution . . . . . . . . . . . . . . . . . Leonidas S AKALAUSKAS and Ingrida VAICIULYTE

. . . . . .

. . . . . . . . . . .

. . . . . . .

51 52 53 54 54 54 55 56 58 59 59

. . . . . .

. . . . . . . . . . .

. . . . . . .

. . . . . . . . . . .

4.1. Introduction . . . . . . . . . . . . . . . . . . 4.2. Estimators of maximum likelihood approach 4.3. Quadrature formulas . . . . . . . . . . . . . . 4.4. Computer modeling . . . . . . . . . . . . . . 4.5. Conclusion . . . . . . . . . . . . . . . . . . . 4.6. References . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . .

33

. . . . . .

. . . . . . . . . . .

. . . . . .

. . . . . .

61 63 67 68 71 71

Contents

vii

Part 2. Statistics and Stochastic Data Analysis and Methods

75

Chapter 5. Methods for Assessing Critical States of Complex Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Valery A NTONOV

77

5.1. Introduction . . . . . . . . . . . 5.2. Heart rate variability . . . . . . . 5.3. Time-series processing methods 5.4. Conclusion . . . . . . . . . . . . 5.5. References . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

77 78 80 87 88

Chapter 6. Resampling Procedures for a More Reliable Extremal Index Estimation . . . . . . . . . . . . . . . . . . . . . . . Dora P RATA G OMES and M. Manuela N EVES

89

6.1. Introduction and motivation . . . . . . . . . . . . . . . . 6.2. Properties and difficulties of classical estimators . . . . . 6.3. Resampling procedures in extremal index estimation . . 6.3.1. A simulation study of mean values and mean square error patterns of the estimators . . . . . . . . . . . . . . . . 6.3.2. A choice of δ and k: a heuristic sample path stability criterion . . . . . . . . . . . . . . . . . . . . . . . . 6.4. Some overall comments . . . . . . . . . . . . . . . . . . . 6.5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . 6.6. References . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . . . . . .

89 92 93

. . . .

94

. . . .

96 98 99 99

. . . .

. . . .

. . . .

Chapter 7. Generalizations of Poisson Process in the Modeling of Random Processes Related to Road Accidents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Franciszek G RABSKI 7.1. Introduction . . . . . . . . . . . . . . . . . . . 7.2. Non-homogeneous Poisson process . . . . . . 7.3. Model of the road accident number in Poland 7.3.1. Estimation of model parameters . . . . . . 7.3.2. Anticipation of the accident number . . . . 7.4. Non-homogeneous compound Poisson process 7.5. Data analysis . . . . . . . . . . . . . . . . . . . 7.6. Anticipation of the accident consequences . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

103 104 106 107 108 109 113 113

viii

Data Analysis and Applications 4

7.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.8. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Chapter 8. Dependability and Performance Analysis for a Two Unit Multi-state System with Imperfect Switch . . . . . . . 119 Vasilis P. KOUTRAS, Sonia M ALEFAKI and Agapios N. P LATIS 8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2. Description of the system under maintenance and imperfect switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3. Dependability and performance measures . . . . . . . . . . . . 8.3.1. Transient phase . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2. Asymptotic analysis . . . . . . . . . . . . . . . . . . . . . . 8.4. Optimal maintenance policy . . . . . . . . . . . . . . . . . . . 8.4.1. Optimal maintenance policy for maximizing system availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2. Optimal maintenance policy for minimizing total expected operational cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3. Optimal maintenance policy for multi-objective optimization problems . . . . . . . . . . . . . . . . . . . . . . . . 8.5. Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1. Transient and asymptotic dependability and performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2. Optimal asymptotic maintenance policies implemented in the transient phase . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6. Conclusion and future work . . . . . . . . . . . . . . . . . . . 8.7. Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 120 . . . . .

122 124 125 128 129

. 130 . 130 . 131 . 132 . 132 . . . .

143 147 148 152

Chapter 9. Models for Time Series Whose Trend Has Local Maximum and Minimum Values . . . . . . . . . . . . . . . . . . . . 155 Norio WATANABE 9.1. Introduction . . . . . . . . . . . . . . . 9.2. Models . . . . . . . . . . . . . . . . . . 9.2.1. Model 1 . . . . . . . . . . . . . . . 9.2.2. Model 2 . . . . . . . . . . . . . . . . 9.3. Simulation . . . . . . . . . . . . . . . . 9.4. Estimation of the piecewise linear trend 9.5. Conclusion . . . . . . . . . . . . . . . . 9.6. References . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

155 156 156 158 159 161 164 165

Contents

ix

Chapter 10. How to Model the Covariance Structure in a Spatial Framework: Variogram or Correlation Function? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Giovanni P ISTONE and Grazia V ICARIO 10.1. Introduction . . . . . . . . . . 10.2. Universal Krige setup . . . . 10.3. The variogram matrix . . . . 10.4. Inverse variogram matrix Γ−1 10.5. Projecting on span (1)⊥ . . . 10.6. Elliptope . . . . . . . . . . . . 10.7. Conclusion . . . . . . . . . . 10.8. Acknowledgements . . . . . 10.9. References . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

167 168 170 173 177 179 182 182 183

Chapter 11. Comparison of Stochastic Processes . . . . . . . . 185 Jesús Enrique G ARCÍA, Ramin G HOLIZADEH and Verónica Andrea G ONZÁLEZ -L ÓPEZ 11.1. Introduction . . . . . . . . . . 11.2. Preliminaries . . . . . . . . . 11.3. Application to linguistic data 11.4. Conclusion . . . . . . . . . . 11.5. References . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

185 186 191 195 196

Part 3 . Demographic Methods and Data Analysis . . . . . . . . 197 Chapter 12. Conjoint Analysis of Gross Annual Salary Re-evaluation: Evidence from Lombardy ELECTUS Data . . . 199 Paolo M ARIANI, Andrea M ARLETTA and Mariangela Z ENGA 12.1. Introduction . . . . . . . . . . . . . . . 12.2. Methodology . . . . . . . . . . . . . . 12.2.1. Coefficient of economic valuation 12.3. Application and results . . . . . . . . 12.4. Conclusion . . . . . . . . . . . . . . . 12.5. References . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

199 201 202 204 211 212

x

Data Analysis and Applications 4

Chapter 13. Methodology for an Optimum Health Expenditure Allocation . . . . . . . . . . . . . . . . . . . . . . . . . 215 George M ATALLIOTAKIS 13.1. Introduction . . . . . . . . . . . . . 13.2. The Greek case . . . . . . . . . . . 13.3. The basic table for calculations . . 13.4. The health expenditure in hospitals 13.5. Conclusion . . . . . . . . . . . . . 13.6. References . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

215 216 219 221 221 222

Chapter 14. Probabilistic Models for Clinical Pathways: The Case of Chronic Patients . . . . . . . . . . . . . . . . . . . . . 225 Stergiani S PYROU, Anatoli K AZEKTSIDOU and Panagiotis BAMIDIS 14.1. Introduction . . . . . . . . . . . . . . . . . 14.2. Models and clinical practice . . . . . . . 14.3. The Markov models in medical diagnoses 14.3.1. The case of chronic patients . . . . . . 14.3.2. Results . . . . . . . . . . . . . . . . . . 14.4. Conclusion . . . . . . . . . . . . . . . . . 14.5. References . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

225 227 228 229 231 232 233

Chapter 15. On Clustering Techniques for Multivariate Demographic Health Data . . . . . . . . . . . . . . . . . . . . . . . . 235 Achilleas A NASTASIOU, George M AVRIDOGLOU, Petros H ATZOPOULOS and Alex K ARAGRIGORIOU 15.1. Introduction . . . . . . . . . . 15.2. Literature review . . . . . . . 15.3. Classification characteristics 15.3.1. Distance measures . . . . 15.3.2. Clustering methods . . . 15.4. Data analysis . . . . . . . . . 15.4.1. Data . . . . . . . . . . . . 15.4.2. The analysis . . . . . . . . 15.5. Conclusion . . . . . . . . . . 15.6. References . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

235 236 237 238 239 240 240 242 249 249

Contents

xi

Chapter 16. Tobacco-related Mortality in Greece: The Effect of Malignant Neoplasms, Circulatory and Respiratory Diseases, 1994–2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Konstantinos N. Z AFEIRIS 16.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 16.1.1. Smoking-related diseases . . . . . . . . . . . . . . . 16.2. Data and methods . . . . . . . . . . . . . . . . . . . . . 16.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.1. Life expectancy at birth . . . . . . . . . . . . . . . . 16.3.2. Effects of the diseases of the circulatory system on longevity . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.3. Effects of smoking-related neoplasms on longevity 16.3.4. Effects of respiratory diseases on longevity . . . . . 16.4. Discussion and conclusion . . . . . . . . . . . . . . . . 16.5. References . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

251 253 254 256 256

. . . . .

. . . . .

. . . . .

. . . . .

258 261 265 268 272

List of Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Index

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

Preface

Thanks to the important work of the authors and contributors we have developed this collective volume on “Data Analysis and Applications: Computational, Classification, Financial, Statistical and Stochastic Methods”. Data analysis as an area of importance has grown exponentially, especially during the past couple of decades. This can be attributed to a rapidly growing computer industry and the wide applicability of computational techniques, in conjunction with new advances of analytic tools. This being the case, the need for literature that addresses this is self-evident. New publications appear as printed or e-books covering the need for information from all fields of science and engineering thanks to the wide applicability of data analysis and statistic packages. The book is a collective work by a number of leading scientists, analysts, engineers, mathematicians and statisticians who have been working on the front end of data analysis. The chapters included in this collective volume represent a cross-section of current concerns and research interests in the above-mentioned scientific areas. This volume is divided into three parts with a total of 16 chapters in a form to provide the reader with both theoretical and applied information on data analysis methods, models and techniques along with appropriate applications. Part 1 focuses on Financial Data Analysis and Methods and contains four chapters on “Forecasting Methods in Extreme Scenarios and Advanced Data Analytics for Improved Risk Estimation” by George-Jason Siouris,

xiv

Data Analysis and Applications 4

Despoina Skilogianni and Alex Karagrigoriou, “Credit Portfolio Risk Evaluation with non-Gaussian One-factor Merton Models and its Application to CDO Pricing” by Takuya Fujii and Takayuki Shiohama, “Towards an Improved Credit Scoring System with Alternative Data: the Greek Case” by Panagiota Giannouli and Christos E. Kountzakis and “EM Algorithm for Estimating the Parameters of the Multivariate Stable Distribution” by Leonidas Sakalauskas and Ingrida Vaiciulyte. In Part 2, the interest lies in Statistics and Stochastic Data Analysis and Methods which includes seven papers on “Methods for Assessing Critical States of Complex Systems” by Valery Antonov, “Resampling Procedures for a More Reliable Extremal Index Estimation” by Dora Prata Gomes and M. Manuela Neves, “Generalizations of Poisson Process in the Modeling of Random Processes Related to Road Accidents” by Franciszek Grabski, “Dependability and Performance Analysis for a Two Unit Multi-State System with Imperfect Switch” by Vasilis P. Koutras, Sonia Malefaki and Agapios N. Platis, “Models for Time Series Whose Trend has Local Maximum and Minimum Values” by Norio Watanabe, “How to Model the Covariance Structure in a Spatial Framework: Variogram or Correlation Function?” by Giovanni Pistone and Grazia Vicario and “Comparison of Stochastic Processes” by Jesús E. García, Ramin Gholizadeh and Verónica Andrea González-López. Finally, in Part 3, the interest is directed towards Demographic Methods and Data Analysis which includes five chapters on “Conjoint Analysis of Gross Annual Salary Re-evaluation: Evidence from Lombardy Electus Data” by Paolo Mariani, Andrea Marletta and Mariangela Zenga, “Methodology for an Optimum Health Expenditure Allocation” by George Matalliotakis, “Probabilistic Models for Clinical Pathways: The Case of Chronic Patients” by Stergiani Spyrou, Anatoli Kazektsidou and Panagiotis Bamidis, “On Clustering Techniques for Multivariate Demographic Health Data” by Achilleas Anastasiou, George Mavridoglou, Petros Hatzopoulos and Alex Karagrigoriou and “Tobacco Related Mortality in Greece: The Effect of Malignant Neoplasms, Circulatory and Respiratory Diseases, 1994–2016” by Konstantinos N. Zafeiris.

Preface

xv

We wish to thank all the authors for their insights and excellent contributions to this book. We would like to acknowledge the assistance of all involved in the reviewing process of the book, without whose support this could not have been successfully completed. Finally, we wish to express our thanks to the secretariat and, of course, the publishers. It was a great pleasure to work with them in bringing to life this collective volume. Andreas MAKRIDES Rouen, France Alex KARAGRIGORIOU Samos, Greece Christos H. SKIADAS Athens, Greece January 2020

PART 1

Financial Data Analysis and Methods

Data Analysis and Applications 4: Financial Data Analysis and Methods, First Edition. Edited by Andreas Makrides, Alex Karagrigoriou and Christos H. Skiadas. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

1 Forecasting Methods in Extreme Scenarios and Advanced Data Analytics for Improved Risk Estimation

After extensive investigation on the statistical properties of financial returns, a discrete nature has surfaced when low price effect is present. This is rather logical since every market operates on a specific accuracy. In order for our models to take into consideration this discrete nature of returns, the discretization of the tail density function is applied. As a result of this discretization process, it is now possible to improve the percentage value at risk (PVaR) and expected percentage shortfall (EPS) estimations on which we are focusing in this work. Finally, in order to evaluate the improvement provided by our proposed methodology, adjusted evaluation measures are presented, capable of evaluating percentile estimations like PVaR. These adjusted evaluation measures are not only limited to evaluating percentiles, but in any scenario where data does not bare the same amount of information and consequently, does not all carry the same degree of importance, like in the case of risk analysis.

1.1. Introduction The quantification of risk is an important issue in finance that becomes even more important during periods of financial crises. The estimation of volatility is the main financial characteristic associated with risk analysis and management since in financial modeling, it has been consolidated the view

Chapter written by K ARAGRIGORIOU.

George-Jason

S IOURIS,

Despoina

S KILOGIANNI

and

Alex

4

Data Analysis and Applications 4

that the asset returns are preferred over prices due to their stationarity [MEU 09, SHI 99]. Moreover, the volatility of returns is the one that can be successfully forecasted and that is essential for risk management [POO 03, AND 01]). The increase in complexity of both the financial system and the nature of the financial risk over the last decades, results in models with limited reliability. Hence, in extreme economic events, such models become less reliable and in some instances, fully fail to measure the underlying risk. Consequently, they are producing inaccurate risk measurements which has a great impact on many financial applications, such as asset allocation and portfolio management in general, derivatives pricing, risk management, economic capital and financial stability (based on Basel III accords). To make thinks even worse, stock exchange markets (and in general every financial market) operate with certain accuracy. In most European markets the accuracy is 0.1 cents (0.001 euros), while US markets operate with 1 cent (USD 0.01) accuracy except for securities that are priced less than USD 1.00 for which the market accuracy or minimum price variation (MPV) is USD 0.0001. Despite the readjustment for extremely low price assets, the associated fluctuation (variation) is considerable and the corresponding volatility is automatically increased. The phenomenon is magnified considerably in periods of extreme economic events (economic collapses, bankruptcies, depressions, etc.) and as a result typical market accuracy fails to handle smoothly assets of extremely low price. However, any attempt to increase the reliability of the model identification procedure for volatility estimation results unavoidably in even more complex models, which still will be unable to identify and fully capture the governing set of rules of the global economy. The more complicated the model we use for volatility estimation, the larger the number of parameters that needs to be estimated. Hence, in order to have larger data sets, we go further back in the past to collect and use for the analysis older observations, which though may be less representative of reality. Lastly, risk models use only the returns of an asset while ignoring the prices. In order to answer all the above, we discuss, in this chapter, the concept of low price effect (lpe) [FRI 36] and recommend a low price correction (lpc) for improved forecasts. The low price effect is the increase in variation for stocks

Forecasting Methods in Extreme Scenarios and Advanced Data Analytics

5

with low price due to the existence of a minimum possible return produced when the asset price changes by the MPV. Lpe is frequently overlooked and the main reason for that is the lack of theoretical background related to the reasons resulting in this phenomenon, which surfaces primarily in periods of economic instability or extreme economic events. The pioneering of the proposed correction is that it does not require any additional parameters and takes into account the asset price. The proposed correction is associated with the rationalization of the estimated asset returns, since it is rounded to the next integer multiple of the minimum possible return. Except for the proposed correction, in this work we also provide a mathematical reasoning for the increase in volatility. Inspired from the above, we came to the same conclusion as many before, that in risk analysis the returns of an asset do not all bare the same amount of information and do not all carry the same degree of significance [SOK 09, ASA 17, ALH 08, ALA 10]. In the absence of a formal mathematical notion, the term asymmetry in the importance describes relatively satisfactory the above phenomenon which is also apparent in other scientific areas, related to medical, epidemiological, climatological, geophysical or meteorological phenomenon. For a mathematical interpretation, we may consider a proper cost function that takes different values, depending on different regions of the dataset or the entire period of observation. For instance, in biosurveillance, the importance (and the cost) associated with an illness incidence rate is much higher in the so-called epidemic periods associated with an extreme rate increase. In the case of risk analysis, risk measures like the value at risk (VaR) and expected shortfall (ES), concentrate only on the left tail of the distribution of returns with the attention being paid to the days of violation rather than to those of no violation. Consequently, failures in fitting a model on the right tail are not considered to be important. Thus, the asymmetry in the importance of information is crucial in both choosing the most appropriate model by assigning a proper weight to each region of the dataset and evaluating its forecasting ability. In order to judge the forecasting quality of the proposed methodology applied to typical risk measurements like VaR and ES, we have appropriately adjusted a number of popular evaluation measures that take into account the asymmetry in the importance of information bared in the data. Since the proposed low price correction is applied in a subset of the dataset with a specific property (in this case a very low price), it is logical for the adjusted

6

Data Analysis and Applications 4

evaluation measures to be computed on the same subset. Thus, the risk estimations can be evaluated with and without the implementation of low price correction and then, compared. The decrease in the values of the adjusted evaluation measures will show the improved forecasting quality of the proposed methodology. In addition, for the evaluation of the proposed correction, backtesting methods such as the violation ratio (VR) for value at risk and normalized shortfall (NS) for expected shortfall can also be used (see, for example, [BRO 11]). For real-life examples and applications, see [SIO 17, SIO 19a, SIO 19b]. 1.2. The low price effect and correction The rules that govern the accuracy of financial markets around the world fail to smoothly handle assets of extremely low price resulting in a considerable fluctuation (variation) and increased volatility. As expected, the phenomenon is magnified considerably in periods of extreme economic events (economic collapses, bankruptcies, depressions, etc.). Indeed, since all possible (logarithmic) returns on a specific day are integer multiples of the minimum possible return, the stock movement will fluctuate more nervously, the lower the prices. As a result, violations will occur more frequently and forecasts will turn out to be irrational in the sense that such returns cannot be materialized. The resulting volatility increase is quite often overlooked, primarily due to the fact that we take into account only the returns of an asset, neglecting entirely the prices. In order to accommodate different accuracies, we introduce below a broad definition of the minimum possible return. D EFINITION 1.1.– Let pt be the asset value at time t and c(pt ) be the minimum price variation (market accuracy) associated with the value of the asset at time t. Then the minimum possible return (mpr) of an asset at time t denoted by mprt is the logarithmic return that the asset will produce if its value changes by c(pt ) and is given by:  mprt = log

pt + c(pt ) pt

 .

Forecasting Methods in Extreme Scenarios and Advanced Data Analytics

7

Note that mprt is the same for both upward and downward movements due to the symmetry of logarithmic returns. For the special case, that a market has a constant accuracy, say c, irrespectively of the stock price, Definition 1 can be simplified to mprt = log ((pt + c)/pt ) (see [SIO 17]). E XAMPLE 1.1.– Let us assume that the value of an asset is equal to e0.19 and the market operates under e0.001 accuracy. Then the minimum possible return for the asset is 0.5%. Also, all possible (logarithmic) returns for the asset in this specific day are the integer multiples of the minimum possible return. This will have as a result, the stock movement to become even more nervous and models’ failures to increase. Consequently, PVaR violations will occur more frequently and our model will almost always derive forecasts that are irrational since the stock cannot produce such returns. Now that we have defined the mpr, we can provide a strict definition of lpe as well as the mathematical reasoning behind this. D EFINITION 1.2.– Low price effect (lpe) is the inevitable increase of variance in stocks with low prices due to the existence of a minimum possible return. Considering that the probability mass will be concentrated to the next integer multiple of the minimum possible return, we have that the volatility, provided that it exists, is increased as shown below. +∞ ˆ (rt − μ)2 f (rt ) drt V ar(Rt ) = −∞

=

+∞  k=−∞



+∞  k=−∞

=

+∞ 

(k+1)·mpr t ˆ

(rt − μ)2 f (rt ) drt

k·mprt (k+1)·mpr t ˆ

(rt − μ)2 f (rt ) drt

sup k·mprt

rt ∈[k·mprt ,(k+1)·mprt ]

sup

(rt − μ)

k=−∞ rt ∈[k·mprt ,(k+1)·mprt ]

2

(k+1)·mpr t ˆ

f (rt ) drt k·mprt

8

Data Analysis and Applications 4

where Rt is the random variable of returns at time t and f (·), μ are the density and the mean of returns, respectively. Here it must be noted that although Definition 2 is of the same philosophy, it is slightly different from the original concept introduced by [FRI 36] who observed that low-priced stocks were characterized by higher returns and higher volatility. Such high volatility whenever observed was attributed to increased returns, although the discrete nature is the one to be blamed for both increased returns and volatility. According to [CLE 51] the low price effect (or low price anomaly) is attributed to the low quality of stocks perceived by investors. Most results on lpe ([CHR 82, DES 97, HWA 08], etc) deal with the US market, although in the literature we can find some examples on other markets. Such markets include the Warsaw Stock Exchange, WSE [ZAR 14], the Johannesburg Stock Exchange, JSE [GIL 82] and [WAE 97] and the Athens Stock Exchange, ASE [SIO 17]. A general conclusion that can be derived from these results is that the decision on the cut-off point for a low-price share may be a subjective one, but the researcher is free to consider and explore various cut-off points in search for an idealistic threshold, if it exists. For both practical and theoretical purposes, any value can be adapted, provided that above a predefined threshold, the market is assumed or believed to operate not as efficiently as it should have been. For the purposes of the current work, the value Θ = 0.001 or 0.1% is arbitrarily chosen to play the role of the pre-assigned threshold or cut-off point. R EMARK 1.1.– It should be noted that in JSE, the lpe was present for shares priced below 30 cents, but the performance was not equally good for “superlow priced shares” (0–19 cents). For WSE, the analysis was based on the 30% of stocks with the lowest prices. D EFINITION 1.3.– Low price effect area is the range of prices for which the mpr is greater than a pre-specified threshold Θ. It must be noted that the minimum possible return is a direct result of the combination of a low price with the minimum possible variation for the stock. E XAMPLE 1.2.– We present below two examples for the low price effect area, which is easily defined by applying Definition 1 for specific values of the accuracy and the threshold.

Forecasting Methods in Extreme Scenarios and Advanced Data Analytics

9

a) For a stock exchange market which operates under accuracy c = c(pt ) = 0.001, and a predefined threshold Θ = 0.001, we get pt ≤ 0.999001. Thus, the low price effect area is given by pt ≤ 0.999, and an appropriate technique to resolve the low price effect should be implemented. b) Equivalently, for stock exchange markets like NYSE and NASDAQ, we would have for the same threshold Θ = 0.001, two different cases: – for pt ≥ 1 implying that pt ≤ 9.995002 and – for pt < 1 implying that pt ≤ 0.09995002. Thus, for 1 ≤ pt ≤ 10 and for pt ≤ 0.100 the low price effect is present, and proper actions should be taken to minimize it. While in the range of 1–10 dollars, numerous real-life examples can be found, the same is not true for prices under 0.1 dollars, in which case, the stock has already been halted from the stock market, and hence, lpe makes only theoretical sense and becomes a theoretical concept. For the resulting low price effect area, the models considered should be appropriately adapted. This adaptation is done through the so-called low price correction of both the estimation of percentage value at risk denoted by PVaR and the expected percentage shortfall denoted by EPS to be introduced respectively in the following two sections. 1.2.1. Percentage value at risk and low price correction The percentage value at risk which is a risk measure for estimating the possible percentage losses from trading assets, within a set time period, is defined as follows: D EFINITION 1.4.– a) Percentage value at risk p (PVaR(p)) is the 100pth percentile of the distribution of returns. b) Percentage value at risk p at time t (PVaRt (p)) is the above-mentioned risk measure at time t.

10

Data Analysis and Applications 4

The probability of a PVaRt (p) violation, namely, the p-th percentile of the distribution, is given by ˆ p = Pr[Rt ≤ −P V aRt (p)] =

−P V aRt (p) −∞

f (x)dx.

where Rt is the random variable of returns at time t, and f (·) is the probability density function of returns. Since we defined PVaRt (p) based on the return distribution, no additional specification of the equation is needed on whether simple or logarithmic returns are available. Usually, the computation is done over standardized returns. Thus,  p = Pr

Rt P V aRt (p) ≤− σ σ



where the distribution of standardized returns with standard deviation σ, (Rt /σ), is denoted by F (·). Hence, the PVaR of an asset takes the form: P V aRt (p) = −σF −1 (p). where F −1 (p) is the 100p-th percentile of the assumed distribution. Note that for the evaluation of PVaRt (p), we may consider any econometric model and then apply an estimation technique for the model parameters. Consider, for instance, the general Asymmetric Power ARCH (APARCH) model [DIN 93]: Rt = σt εt with σtδ

= a0 +

p  i=1

δ

ai (|εt−i | − γi εt−i ) +

q 

δ bj σt−j ,

j=1

where Rt is the (logarithmic) return at day t, εt a series of iid random variables, σt2 the conditional variance of the model, a0 > 0, ai , bj , δ ≥ 0, i = 1, . . . p, j = 1, . . . , q and γi ∈ [−1, 1], i = 1, . . . , p.

Forecasting Methods in Extreme Scenarios and Advanced Data Analytics

11

We observe that for γi = 0, ∀i and δ = 2, the model reduces to the GARCH model ([BOL 86]), which, in turn, reduces further to the EWMA model for p = q = 1 and for special values of the parameters involved. For the distribution F of the series {εt }, we consider in this work (see section 1.3), the normal, the Student-t and the skewed Student-t distribution ([LAM 01]). Based on the available data, the estimator σ ˆt of the conditional standard deviation σt is obtained by numerical maximization of the log-likelihood according to the distribution chosen. If without loss of generality, we assume that the mean of the conditional distribution of the (logarithmic) returns is zero (otherwise the mean-corrected logarithmic returns could be used), then the estimator P V aRt of PVaR at day t is obtained as follows (see, for instance, [BRA 16]): P V aRt = qp (F )ˆ σt , where qp (F ) is the 100p-th percentile of the assumed conditional distribution F of εt . For the distribution F , usual choices are the normal, the Student-t and the skewed Student-t distribution [LAM 01]. In case where the standardized returns are generated from a Student-t distribution with ν degrees of freedom, the variance is equal to ν/(ν − 2); hence, it is never equal to 1. If that sample variance is used in the calculation of PVaR, the PVaR would be overestimated. Volatility effectively shows up twice, both in F −1 (p) and in the estimation σ ˆ of σ, which is obtained by numerical maximization of the ν log-likelihood. Hence, we need to scale volatility as follows: σ 2 ≡ ν−2 σ ˜2, where σ ˜ 2 is the variance in excess of that implied by the standard Student-t. We conclude this section with the definition of the low price correction for the percentage value at risk. A PVaRt (p) estimate can take any real value not necessarily equal to integer multiples of mprt . Under the low price effect, asset movements become inevitably more nervous, and any continuous model used would produce forecasts that are irrational in the sense that assets cannot produce such returns, more often than it should. To resolve this “inconsistency” we propose the low price correction by rounding the PVaRt (p) estimate to the closest legitimate value, namely, the next integer multiple of the mprt .

12

Data Analysis and Applications 4

D EFINITION 1.5.– Let P V aRt (p) be the estimation of the PVaR on day t for a specific asset. The low price correction of the estimation denoted by P V aRt (p) is given by:  P V aR (p)  t + 1 · mprt , if mprt ≥ Θ mprt P V aRt (p) = [1.1] P V aRt (p), if mprt < Θ where w is the floor function (integer part) of w. Note that we prefer to deal with the percentage value at risk for comparative reasons between assets of the same portfolio with different allocations. We observe that under the low price correction, the market’s accuracy is passed on to the evaluation of the percentage value at risk, resulting in a more reasonable number of violations. 1.2.2. Expected Percentage Shortfall (EPS) and Low Price Correction After we have obtained the PVaRt (p) in the previous section, we now calculate the conditional expectation under PVaRt (p), which is given by: D EFINITION 1.6.– The expected percentage loss conditional on PVaRt (p) being violated is defined by: EP St = −E[Rt |Rt ≤ −P V aRt (p)]. We observe that the area under f (·) in the interval [−∞, −P V aRt (p)] is less than 1, implying that f (·) is not a proper density function any more. This can be resolved by defining the tail density (right-truncated) function fP V aR (·), obtained by truncation on the right, so that the area below this density becomes exactly equal to 1. Thus: ˆ ˆ −P V aRt (p) 1 −P V aRt (p) 1= f (x)dx = fP V aR (x)dx. p −∞ −∞ The EPS is then given by: ˆ EP St = −

−P V aRt (p)

−∞

1 xfP V aR (x)dx = − p

ˆ

−P V aRt (p)

−∞

xf (x)dx.

Forecasting Methods in Extreme Scenarios and Advanced Data Analytics

13

In order to provide the discrete expression of EPS, we present below the discretization fDP V aR (·) of fP V aR (·). fDP V aR (x) is the probability function of a discrete random variable, having as values all the possible returns under P V aRt (p), and probabilities given as follows: 1 fDP V aR (x) = p

x+mpr ˆ t

f (x) dx = x

1 [F (x + mprt ) − F (x)], p

aRt (p) − 2mprt , . . . . x = −P V aRt (p) − mprt , −P V Then, the definition of the discretization of EPS follows naturally: D EFINITION 1.7.– Let EP St be the estimation of the EPS on day t for a specific asset. The discrete approximation of the estimation EP St , denoted by DEP St , is given by:  xfDP V aR (x) DEP St = − x≤−P V aRt (p)

where fDP V aR (·) is the discretization of fP V aR (·). R EMARK 1.2.– Even though, on the previous definition we call DEP St the discrete approximation of EP St , the truth is that the nature of f and by extension fP V aR , is discrete, since there always exists a minimum possible return. We may treat returns as a continuous random variable when mpr is extremely small, but the discrete nature of returns still exists. D EFINITION 1.8.– Let EP St be the estimation of EPS on day t for a specific S t , is asset. The low price correction of the estimation EP St denoted by EP given by:  S t = DEP St , if mprt ≥ Θ EP EP St , if mprt < Θ R EMARK

1.3.– For historical simulation, we have that DEP St = − xfEP V aR (x), where fEP V aR (·) is the discrete empirical

x≤−P V aRt (p)

probability function of fP V aR (·) given by fEP V aR (x) =

# of historical observations such that x ≤ Ri < x + mprt , for i ≤ t , # of historical observations such that Ri < −P V aRt (p)

14

Data Analysis and Applications 4

x = −PV aR t (p) − mprt , −P V aRt (p) − 2mprt , . . . , −P V aRt (p) −  mini≤t Ri + 1 · mprt , where Ri are the realizations of returns. mprt 1.2.3. Adjusted Evaluation Measures For the evaluation of the performance of the competing models, we will use statistical methods, as well as backtesting. Popular evaluation measures used in the literature include the mean square error (MSE), the mean absolute error (MAE), and the mean absolute percent error (MAPE). Since the main interest lies in returns, which mostly (except for a very few extreme cases) take values in (−0.5, 0.5), we prefer MAE and MAPE, because the square in MSE will decrease further the errors. These measures should be appropriately adapted in order to capture the needs of the problem at hand. T Let Rt=1 be a sample of a time series corresponding to daily logarithmic losses on a trading portfolio, T the length of the time series and PVaRt the estimation of PVaR on day t. If on a particular day, the logarithmic loss exceeds the PVaR forecast, then the PVaR limit is said to have been violated. For a given PVaRt we define the indicator ηt as follows:  1 if Rt > P V aRt . ηt = 0 if Rt ≤ P V aRt

Under the above setting, it is easily seen, that the MSE of violation days is defined as follows:  1 (Rt − P V aRt (p))2 . M SE =

η t≤T t t≤T &ηt =1

We can easily observe that the above is a special weighted mean squared error expression. Indeed, we note that the above can be written as  ηt · (Rt − P V aRt (p))2 M SE = v(T ) t≤T

where v(T ) =



t≤T

ηt .

R EMARK 1.4.– A more in-depth analysis of adjusted evaluation measures and their theoretical background can be found in [SIO 19a, SIO 19b] which due to space limitations cannot be provided here. 1.2.4. Backtesting and Method’s Advantages The comparison between the observed frequency and the expected number of violations provides the primary tool for backtesting, which is known as the violation ratio [CAM 06]. VR can be used to evaluate the forecast ability of the model, PVaR estimations, and normalized shortfall (NS) for backtesting EPS estimations. Note that if the corrected version of PVaR is used, then PVaRt should be replaced by P V aRt in (1.2.3). A PVaR violation is said to have occurred whenever the indicator ηt is equal to 1. D EFINITION 1.9.– Let v(T ) be the observed number of violations, p the probability of a violation and WT the testing window. Then, the violation ratio V R is defined by VR=

Observed number of violations v(T ) = . Expected number of violations p × WT

Intuitively, if the violation ratio is greater than 1, the risk is underforecast, while if smaller than 1, the risk is overforecast. Acceptable values for VR according to Basel III accords lie in the interval (0.8,1.2), while values over 1.5 or below 0.5 indicate imperfect modeling (see [DAN 11]). It is harder to backtest expected percentage shortfall (EPS) than PVaR because we are testing an expectation rather than a single quantile. Fortunately, there exists a simple methodology for backtesting EPS that is analogous to the use of violation ratios for PVaR. For each day t when PVaR is violated, the normalized shortfall NS is calculated as follows: N St =

Rt EP St

16

Data Analysis and Applications 4

where EP St is the observed EPS on day t. From the definition of EPS, the expected return, given that PVaR is violated, is: E[Rt |Rt ≤ P V aRt (p)] = 1. EP St Therefore, average NS, denoted by N S, given by: n

n

t=1

t=1

1 1  Rt NS = N St = n n EP St should be equal to 1 which, in turn, formulates the null hypothesis: H0 : N S = 1. With EPS, we are testing whether the mean of returns on days when PVaR is violated is the same as the expected EPS on these days. As it is clear, it is much harder to create a formal test in order to ascertain, whether normalized EPS equals to 1 or not. Such a test would have to simultaneously test the accuracy of PVaR and the expectation beyond PVaR. This means that the reliability of any EPS backtest procedure is likely to be much lower than that of PVaR backtest procedures. R EMARK 1.5.– For a better understanding of the proposed methodology, the reader may refer to examples for the Athens Exchange (ATHEX) and the American Stock Exchange (NYSE MKT) analyzed and discussed in [SIO 19a, SIO 19b] which due to space limitations cannot be provided here. R EMARK 1.6.– The importance of the proposed technique lies on the fact that under the low price correction, fewer violations are expected to occur. Note that a VaR estimate can take any real value not necessarily equal to integer multiples of mpr since it is not controlled by the market’s accuracy. This observation simply implies that any model almost always derives forecasts that are irrational in the sense that stocks cannot produce such returns. Thus, due to this “inconsistency” between PVaR forecasts and stock movements, PVaR violations occur more frequently than they should. Indeed, assume mprt is greater that the threshold Θ. Then, the PVaR estimate is plausible, as the asset’s (logarithmic) return, as long as it coincides with an integer multiple of mpr. Otherwise, if k · mprt < P V aRt
0. Therefore, the conditions of the central limit theorem are not satisfied, and L is not asymptotically normal. Then, the approximate probability density functions of the default rate distributions are obtained as √  −1 1 − ρG−1 1−ρ Y1 (DR) − GW1 (P D) gY 0 hDR (DR) = √ ρ ρ ·

1

. gY1 (G−1 Y1 (DR))

Here, the Edgeworth density functions gY0 (x) and gY1 (x), and their corresponding distribution function GY1 (x), are given in Theorem 2.1. Under

44

Data Analysis and Applications 4

the Gaussian assumptions on the asset returns, the above density function reduces to that of a Gaussian–Merton model. That is, ! " 1 1−ρ exp f (DR) = (N −1 (DR))2 ρ 2  #$ √ 1 − ρN −1 (DR) − N −1 (P D) 2 − . √ ρ 2.4. Data analysis We use default rate data from the U.S. default history record analyzed in Hull (2012). We formulate a parameter estimation based on the maximum likelihood method below. The model parameter vector θ is denoted as

(n) (n) (n) (n) T θ = P D, ρ, C˜ , C˜ , C˜ , C˜ . The observed default records are Y0 ,3

Y0 ,4

Y1 ,3

Y1 ,4

denoted as DRt (t = 1, . . . , M ). Then, the maximum likelihood estimator (MLE) of θ is defined as θˆ(M L) = argmaxθ

M 

log hDR (DRt ; θ).

t=1

In order to avoid negative values in the density functions in the Edgeworth expansions, we impose the following restrictions on the parameters in θ; that is, (n) CYi ,3



√ (n) −1 (CYi ,4 )) 24 nH3 (Fn,min (n)

(n)

(n)

−1 −1 −1 (3H3 (Fn,min (CYi ,4 )) · H2 (Fn,min (CYi ,4 )) − 4H3 (Fn,min (CYi ,4 ))2 )

0,

for CYi ,3 < 0, together with 0 ≤ ρ ≤ 1 and 0 ≤ P D ≤ 1. Here, function Fn is defined as Fn (x) = −72nH2 (x)/(3H4 (x) · H2 (x) − 4H3 (x)2 ),

Credit Portfolio Risk Evaluation

45

−1 −1 and the values Fn,min (x) and Fn,max (x) are the minimum and maximum roots of Fn (y) = x, respectively.

With the above-mentioned constrained optimization, the maximum likelihood estimates are given in Table 2.1. We plotted the default rate distributions with the estimated parameters and indicated credit risk measures, including expected loss, unexpected loss and economic capital, in Figure 2.1 and Table 2.2. ˆY0 ,3 C ˆY0 ,4 C ˆY1 ,3 C ˆY1 ,4 PˆD ρˆ C MLL Gaussian 0.013 0.110 137.731 Y0 : non-Gaussian 0.014 0.111 0.138 0.080 137.423 Y1 : non-Gaussian 0.013 0.103 0.072 0.033 138.022 Table 2.1. Maximum likelihood estimates of the model parameters

Figure 2.1. Histogram of U.S. default rates with estimated default rate distributions together with their credit risk measures. For a color version of this figure, see www.iste.co.uk/makrides/data4.zip

46

Data Analysis and Applications 4

Gaussian Y0 : non-Gaussian Y1 : non-Gaussian

Expected loss Unexpected loss VaR (99%) Economic capital 1.345% 1.305% 6.340% 4.995% 1.353% 1.264% 6.050% 4.697% 1.316% 1.273% 6.200% 4.884%

Table 2.2. Estimated credit risk measures among models

We assume that a CDO portfolio is composed of m contractions of credit default swap (CDS), which is the protection of obligor i’s default. In pricing CDOs, we must consider loss distribution in addition to quantifying the credit risk. The number of names, m, is typically equal to 125. The notional N of the CDO is the total exposure of the portfolio. The contraction is conducted in each tranche for the amount of principal from the attachment point (AP ) to the detachment point (DP ). The {AP, DP } of the tranches are often defined as {0%, 3%}, {3%, 6%}, {6%, 9%}, {9%, 12%} and {12%, 22%}. The cumulative loss up to time t of the tranche {AP, DP } is {AP,DP } Lt := (Lt − AP · N )+ − (Lt − DP · N )+ . The default and premium payments can conveniently be expressed in terms of the cumulative loss process. At a time, τ ≤ T , of default for a name in the portfolio, a default payment size {AP,DP }

} } ΔL{AP,DP := L{AP,DP − Lτ − τ τ

is made. Assuming that the short-term interest rate is (r(t))t≥0 , the initial value of all default payments up to time T is given by "ˆ def V{AP,DP } =E

T 0

 ˆ t  # {AP,DP } exp − r(s)ds dLt . 0

To keep our analysis simple, the value of the default payments can be expressed with the expected value of the loss process by the partial integration below.  ˆ def = exp − V{AP,DP } ˆ + 0

T

T 0



{AP,DP } r(s)ds E LT

 ˆ t 

{AP,DP } dt. r(t) exp − r(s)ds E Lt 0

Credit Portfolio Risk Evaluation

47

The premium payment leg consists of regular payments at fixed future dates: t0 < t1 < · · · < tTn = T. Given a spread x and t0 = 0, the value of regular premium payments equals prem V{AP,DP } (x)

=x

Tn 

 ˆ (tn − tn−1 ) exp −



tn

r(s)ds

0

n=1

%

& {AP,DP } × (DP − AP )N − E Ltn . The fair tranche spread, x{AP,DP } , is then determined by equating the value of the default and premium payments,

prem def {AP,DP } V{AP,DP x . = V } {AP,DP } In addition, we assume that default can only occur on the dates t1 < . . . < tN and that Li,t = PDi × LGDi × EADi . All we have to do to evaluate the expected value of Lt is to focus on the default rate distribution. Then, both sides of the equation can be expressed as functions of

{AP,DP } E Lt = E ((Lt − AP · N )+ − (Lt − DP · N )+ ) ´ DP

=

Lt · h(Lt )dLt . P ((AP < Lt ) ∩ (Lt < DP )) AP

For our numerical experiments, we choose the following parameters: the identical probability of default is P D(t) = 1 − e−λt , EAD = 1, LGD = 0.6, maturity is T = 5, interest rate is r = 0, CY0 ,3 = 0.138, CY0 ,4 = 0.08, CY1 ,3 = 0.072, CY1 ,4 = 0.033 and ρ represents the implied tranche correlation. The parameter ρ in each tranche is chosen from Hull and White (2004). We present the numerical results of CDO pricing in Table 2.3. Table 2.3 shows that the CDO spread with a Gaussian model is higher than Y1 : the non-Gaussian model for all tranches. In addition, we find that for Y0 non-Gaussian cases, the CDO spreads for the [6,9] and [12,22] tranches have larger spreads than that of the other models.

48

Data Analysis and Applications 4

Tranche

[0,3] ρ = 0.219 Gaussian 55.22474% Y0: non-Gaussian 52.47056% Y1: non-Gaussian 54.88656%

[3,6] ρ = 0.042 11.61022% 11.42896% 11.55764%

[6,9] ρ = 0.148 3.768472% 4.111445% 3.558663%

[9,12] ρ = 0.223 2.180098% 2.102742% 2.07836%

[12,22] ρ = 0.305 0.4942221% 0.5278298% 0.4317376%

Table 2.3. Calibrated CDO spreads for each tranche

2.5. Conclusion We have introduced non-Gaussian one-factor Merton models for credit risk modeling and shown their application to CDO pricing. Through parameter estimation, we found that the non-Gaussian models are more efficient than the Gaussian ones in terms of maximum log likelihood. In addition, we show that the non-Gaussian models make credit risk measurements, for example, EC, smaller than those of Gaussian models. Regarding CDO pricing, overall, the non-Gaussian models have a smaller CDO spread than the Gaussian ones. 2.6. Acknowledgements The authors would like to express their gratitude to the anonymous referee, whose invaluable comments improved this chapter. This research was supported in part by JSPS KAKENHI (grant no. 18K01706). 2.7. References [BRI 10] B RIGO D., PALLAVICINI A., T ORRESETTI R., Credit Models and the Crisis: A Journey into CDOs, Copulas, Correlations and Dynamic Models, John Wiley & Sons, New York, 2010. [BUT 07] B UTLER R.W., Saddlepoint Approximations with Application, Cambridge University Press, New York, 2007. [DEM 09] D EMYANYK Y., VAN H EMERT O., “Understanding the subprime mortgage crisis”, The Review of Financial Studies, vol. 24, no. 6, pp. 1848–1880, 2009. [GIG 16] G IGLIO S., K ELLY B., P RUITT S., “Systemic risk and the macroeconomy: An empirical evaluation”, Journal of Financial Economics, vol. 119, no. 3, pp. 457–471, 2016. [GOR 02] G ORDY M., “Saddlepoint approximation of credit risk”, Journal of Banking and Finance, vol. 26, pp. 1335–1353, 2002. [HOF 11] H OFERT M., S CHERER M., “CDO pricing with nested Archimedean copulas”, Quantitative Finance, vol. 11, no. 5, pp. 775–787, 2011.

Credit Portfolio Risk Evaluation

49

[HUA 07] H UANG X., O OSTERLEE C.W., VAN DER W EIDE J.A.M., “Higher-order saddlepoint approximations in the Vasicek portfolio credit loss model”, Journal of Computational Finance, vol. 11, pp. 93–113, 2007. [HUA 11] H UANG X., O OSTERLEE C.W., “Saddlepoint approximations for expectations and application to CDO pricing”, SIAM Journal on Financial Mathematics, vol. 2, no. 1, pp. 692–714, 2011. [HUL 04a] H ULL J., N ELKEN I., W HITE A., “Merton’s model, credit risk, and volatility skews”, Journal of Credit Risk Volume, vol. 1, no. 1, pp. 1–27, 2004. [HUL 04b] H ULL J., W HITE A., “Valuation of a CDO and an nth to default CDS without Monte Carlo simulation”, Journal of Derivatives, vol. 12, no. 2, pp. 8–23, 2004. [HUL 09] H ULL J.C., “The credit crunch of 2007: What went wrong? Why? What lessons can be learnt?”, Journal of Credit Risk, vol. 5, no. 2, pp. 3–18, 2009. [HUL 12] H ULL J.C., Risk Management and Financial Institutions, 3rd edition, Wiley, New York, 2012. [JAR 95] JARROW R., T URNBULL S., “Pricing derivatives on financial securities subject to credit risk”, Journal of Finance, vol. 50, pp. 53–85, 1995. [KAW 16] K AWADA A., S HIOHAMA T., “Structural credit risks with non-Gaussian and serially correlated innovations”, American Journal of Mathematical and Management Sciences, vol. 35, no. 2, pp. 143–158, 2016. [LI 00] L I D.X., “On default correlation: A copula function approach”, The Journal of Fixed Income, vol. 9, no. 4, pp. 43–54, 2000. [TAN 00] TANIGUCHI M., K AKIZAWA Y., Asymptotic Theory of Statistical Inference for Time Series, Springer, New York, 2000. [VAS 02] VASICEK O., “The distribution of loan portfolio value”, Risk, vol. 15, no. 12, pp. 160–162, 2002.

3 Towards an Improved Credit Scoring System with Alternative Data: the Greek Case

During the development of credit risk assessment models, it is very important to identify variables that allow us to evaluate a company’s credit risk accurately, as the classification results depend on the appropriate characteristics for a selected data set. Many studies have been focused on the characteristics that should be used in credit scoring applications. The data that has been used in most of these studies is either financial data or credit behavior data. However, there are other sources which can also provide useful information and which have not been explored accordingly. Our main objective is to explore these alternative sources of information. To that end, we introduce alternative data to a predictive model which uses only traditional credit behavior data, in order to see if the former contribute to the model’s performance. In this chapter, a new credit risk model, tested on real data, which evaluates the credit risk of Greek hotels, is introduced. This model uses a combination of credit behavior data and alternative data. The credit risk model introduced in this chapter does have some important additional advantages: a) it contains a relatively small number of variables, b) its stability is tested on samples after the time period of data selection and for different populations and c) the characterization of “good” and “bad” credit behavior is strictly defined.

3.1. Introduction Credit risk is one of the major threats that financial institutions face. Credit scoring is concerned with assessing credit risk and providing for informed

Chapter written by Panagiota G IANNOULI and Christos E. KOUNTZAKIS. Data Analysis and Applications 4: Financial Data Analysis and Methods, First Edition. Edited by Andreas Makrides, Alex Karagrigoriou and Christos H. Skiadas. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

52

Data Analysis and Applications 4

decision-making in the money-lending business. Hand and Jacka [HAN 98] stated that a financial institution’s process of modeling creditworthiness is referred as credit scoring. Given the importance of credit scoring, much research has been done around this area. Many studies have been focused on the characteristics that should be used in credit scoring applications (e.g. Pendharkar [PRE 72], Fletcher and Goss [FLE 93], Jo et al. [JO 97], Desai et al. [DES 96], Tam and Kiang [TAM 92], Salchenberger et al. [SAK 10], Leshno and Spector [LES 96]). However, there are not many studies that have been focused on alternative data’s utility. The objective of this chapter is to introduce alternative data to a model which uses only traditional credit behavior data such as maximum percent credit utilization and worst payment status. More specifically, we are interested in creating variables that come from alternative sources concerning Greek hotels, introducing them to an already existing model for Greek hotels that uses only credit behavior variables and seeing if the alternative variables contribute to the model’s performance. Finally, we perform an out-of-time and out-of-sample validation for the logistic regression model which uses the combination of alternative and credit behavior variables in order to see if its performance remains stable over time and for different populations. 3.2. Literature review: stages of credit scoring Applications of credit scoring have been widely used in various fields, including statistical techniques used for prediction purposes and classification problems. Particularly, in corporate credit scoring models, a number of stages must be included, ranging from gathering and preparing relevant data to estimating a credit score using a formula induction algorithm and developing, monitoring and recalibrating the scorecard. All stages have been explored in the literature. For example, data gathering and preparation concerning the handling of missing values (e.g. Florez-Lopez [FLO 10]) and the selection of a predictive set of explanatory variables (e.g. Falangis and Glen [FAL 10], Liu and Schumann [LIU 05]). Once a data set is ready, a variety of prediction methods accommodate estimating different aspects of credit risk. Particularly, the Basel 2 Capital Accord requires financial institutions, who adopt an internal rating approach, to develop three forms of risk models to estimate the probability of default (PD), the exposure at default (EAD) and the loss given default (LGD). EAD and LGD prediction models have been explored in recent research (e.g. Bellotti and Crook [BEL 12], Loterman et al. [LOT 12],

Towards an Improved Credit Scoring System with Alternative Data

53

Somers and Whittaker [SOM 07]). Nevertheless, most studies concentrate on PD modeling using either classification or survival analysis. Survival analysis models are useful for estimation of when a customer will default (e.g. Bellotti and Crook [BEL 09], Stepanova and Thomas [SAM 94], Tong et al. [TON 12]). On the other hand, classification analysis benefits from an incomparable variety of modeling methods and represents the prevailing modeling approach in the literature. 3.3. Performance definition There are two periods that are studied during the model’s creation: the observation period and the performance period. In this chapter, a period of 12 months (01/01/2016 to 12/31/2016) will be used as a performance period and 24 months (01/01/2014 to 12/31/2015) as an observation period, as often occurs in creating similar models (e.g. in Siddiqi [SID 06]). These models are intended to discriminate the “bad” from “good” behavior in the performance period. First of all, we have to specify what we mean by “bad” and “good” credit behavior of a company: 1) Companies with “good” behavior are companies with no delinquency or companies with maximum delinquency in the last 12 months of from 0 to 29 days past due or credit limit utilization over 102% from 0 to 29 days, concerning SME overdrafts. 2) Companies with “bad” behavior are companies showing “severe delinquency”, which means: - SME (small and medium-sized enterprises) contracts, not overdrafts, with maximum delinquency in the last 12 months of greater than or equal to 90 days past due or - SME overdrafts with maximum delinquency in the last 12 months of greater than or equal to 90 days past due or credit limit utilization over 102% for a time period greater than or equal to 90 days with the overlimit amount greater than 100 euros. In the case where there are some guarantors for the company, “bad” is the company which has the following credit behavior: – SME contracts, not overdrafts, with maximum delinquency in the last 12 months of greater than or equal to 150 days past due or

54

Data Analysis and Applications 4

– SME overdrafts with maximum delinquency in the last 12 months of greater than or equal to 150 days past due or credit limit utilization over 102% for a time period greater than or equal to 90 days. A company is also included in the ones with “bad” credit behavior when there is a new DFO (Default Financial Obligation) (loan denunciation) within the performance period. The term “utilization” refers to the ratio current balance of the company or its credit limit. 3.4. Data description 3.4.1. Alternative data in credit scoring The landscape of data is ever-changing, meaning analysts need to evolve both their thinking and data collection methods in order to stay ahead of the curve. In many cases, data that might have been considered unique, uncommon or unattainably expensive just a few years ago is now widely used. The analysts who take advantage of these untapped data sources can gain competitive advantage before the rest of their industry catches on. This type of data is often referred to as alternative data, and with the ever-increasing levels of data available in the modern world comes the opportunity to gain unique insights and competitive industry advantage. Alternative data can also be described as data that has been delivered from non-traditional sources; data that can be used to complement traditional data sources to produce improved analytical insights that would otherwise not have been achievable with traditional data alone. Put simply, it is data that is not commonly being used within a specific industry, but which can potentially be used to gain a competitive advantage over those who do not have access to it. 3.4.2. Credit scoring data set At this point, it is important to note that for this research a real-world credit scoring data set was taken from the private database of Tiresias S.A. (a company founded by all the banks in Greece) and contains data from businesses concerning loans and credit cards, information about credit statuses, data concerning the credit behavior of individuals and companies

Towards an Improved Credit Scoring System with Alternative Data

55

and, finally, data from mortgages. The set of hotels having credit transactions with banks (all 678 of them) was used throughout the analysis. The data set includes several independent variables to create a credit scorecard. These variables are associated with information from the application form (e.g. a loan amount), the status of the credit (e.g. the current balance) and the credit behavior of company (e.g. bankruptcy). In this data set, we added the “alternative” variables that we created using information from social media and customer reviews in order to analyze them together with the already existing variables. The alternative variables that took part in the analysis are: a hotel’s registration on Facebook, Twitter, Instagram, LinkedIn or YouTube; the number of hotel awards; the hotel’s rating on TripAdvisor; the number of votes on TripAdvisor; the hotel’s rating on Booking.com and the number of votes on Booking.com. Using the above alternative variables, we created two two-dimensional variables: the first is the combination of a hotel’s registration on Twitter and Instagram, and the second is the average rating of TripAdvisor and Booking.com combined with the sum of votes on TripAdvisor and Booking.com. This data set also includes a binary response variable that indicates whether or not a default event was observed in a given period of time. 3.4.3. Data pre-processing In this section, we employ some standard pre-processing operations to prepare the data for subsequent analysis. In particular, we grouped all the independent variables by creating dummy variables using weight-ofevidence (WOE) coding. Missing values (records that do not contain all their elements/information) were grouped separately. This process offers the following advantages: 1) it offers an easier way to deal with outliers, interval variables and rare classes; 2) grouping makes it easier to understand relationships and therefore gain more knowledge of the portfolio. A chart displaying the relationship between attributes of a characteristic and performance is a much more powerful tool than a simple variable strength statistic. It allows users to explain the nature of this relationship, as well as the strength of the relationship; 3) nonlinear dependence can be modeled with linear models.

56

Data Analysis and Applications 4

3.5. Models’ comparison In this section, we mention the independent credit behavior variables that are already used in the predictive model for Greek hotels as well as the K-S (Kolmogorov–Smirnov) test; the Gini index (= 2 ∗ AU C − 1), where AUC is the area under the ROC curve; and the accuracy values of this model in order to compare them with the corresponding values of the “alternative” model. The already existing model for Greek hotels contains independent variables: sum occurrence delinquency one plus at the last 24 months, utilization PJ (prim joint holders) update at the last 12 months non-revolving, utilization PJ update at the last 12 months revolving, worst payment status PJ last month versus 24 months. The value of K-S statistics for this model is 74.8%, the Gini index value is 0.88 and its accuracy is 91.4%. Subsequently, we introduce the two two-dimensional alternative variables (mentioned in section 3.4.2) into this model, as they were stronger than the one-dimensional variables (which were described in section 3.4.2 as the variables that took part in the analysis) according to WOE and information value (IV). This resulted in the following “alternative” model: Ln(odds) = 1.55820 + 0.00610x1 + 0.00587x2 + 0.00750x3 + 0.00494x4 + +0.01191x5 + 0.00932x6 where x1 = sum occurrence delinquency one plus at the last 24 months; x2 = utilization PJ (prim joint holders) update at the last 12 months non-revolving; x3 = utilization PJ (prim joint holders) update at the last 12 months revolving; x4 = worst payment status PJ last month versus 24 months; x5 = hotel’s registration on Twitter and Instagram; x6 = average rating of TripAdvisor and Booking.com combined with the sum of votes on TripAdvisor and Booking.com.

Towards an Improved Credit Scoring System with Alternative Data

57

Ln(odds) shows the possibility of a hotel to be “good”. It takes values between 0 and 1, and the closer to 0, the better (closer to “good” behavior) the hotel is. Observed-predicted Bad Good Percentage correct Bad

104

28

78.8

Good

20

526

96.3

Overall percentage





92.9

Table 3.1. Classification table

Table 3.1 is the classification table, which shows that the addition of independent variables increases the proportion of cases (from the 50-50 case) of the dependent variable that are correctly predicted by the model. In this case, the model correctly predicts 92.9% (accuracy) of the observations. This percentage is higher than the previous model’s accuracy, which was 91.4% (see in section 3.5). Table 3.2 contains K-S and Gini index values, which are 77.0% and 0.90, respectively, and they are used in order to verify if the model is capable of distinguishing two populations. We note that both K-S and Gini index values are higher than they were in the previous model (74.8% and 0.88, respectively; see section 3.5). Predicted probability Bad Good Bad rate

K-S

Gini index

≤ .19102

64

3

95.5% 47.9%



.19103–.57739

42

26

61.8% 75.0%

0.02

.57740–.85966

16

55

22.5% 77.0%

0.03

.85967–.98958

9

193

4.5%

48.5%

0.05

.98959+

1

269

0.4%

0.0%

0.01

19.5% 77.0%

0.90

Total

132 546

Table 3.2. K-S and Gini index

At this point, it is important to note that this slight increase of accuracy, for the K-S and Gini index values is significant as we refer to real-world data sets. Finally, based on the above results, we conclude that alternative data contributes to the Greek hotel predictive model’s performance.

58

Data Analysis and Applications 4

3.6. Out-of-time and out-of-sample validation The following procedure verifies the alternative logistic regression model by running it in a different time period (04/2016 to 04/2017) in order to see if it is still efficient and stable, as it will only be useful if it can be used over time. In Table 3.3, it appears that K-S (79.0%) is better than before (see Table 3.2) and Gini index remains the same (0.90), so we can conclude that the model is still efficient and stable. The model’s stability is also confirmed once again in Table 3.4, as its stability value is 0.00. Predicted probability Bad Good Bad rate

K-S

Gini index

≤ .19102

54

6

90.0% 48.4%



.19103–.57739

37

24

60.7% 77.8%

0.02

.57740–.85966

11

46

19.3% 79.0%

0.02

.85967–.98958

5

193

2.5%

46.7%

0.03

.98959+

2

254

0.8%

0.0%

0.03

17.2% 79.0%

0.90

Total

109 523

Table 3.3. Out-of-time validation K-S and Gini index Score range

Development# Validation# Development% Validation% Stability

≤ .19102

67

60

9.9%

9.5%

0.00

.19103-.57739

68

61

10.0%

9.7%

0.0

.57740-.85966

71

57

10.5%

9.0%

0.0

.85967-.98958

202

198

29.8%

31.3%

0.0

.98959+

270

256

39.8%

40.5%

0.0

Total

678

632

100.0%

100.0%

0.0

Table 3.4. Stability

Finally, we perform an out-of-sample validation by running this model for 122 new hotels. In Table 3.5, it is observed that K-S (77.6%) and Gini index (0.90) remain high, proving that the model is also suitable for other populations.

Towards an Improved Credit Scoring System with Alternative Data

Predicted probability Bad Good Bad rate

K-S

59

Gini index

≤ .19102

22

1

95.7% 61.7%



.19103–.57739

5

5

50.0% 70.2%

0.01

.57740–.85966

5

6

45.5% 77.6%

0.03

.85967–.98958

3

32

8.6%

49.4%

0.06

.98959+

0

43

0.0%

0.0%

0.00

Total

35

87

28.7% 77.6%

0.90

Table 3.5. Out-of-sample validation K-S and Gini index

3.7. Conclusion We set out to explore the effectiveness of alternative data in credit scoring models. To that end, we created and introduced variables coming from alternative sources to an already existing predictive model for Greek hotels that uses only traditional credit behavior variables. For this purpose, we used a real-world credit scoring data set of 678 Greek hotels. Comparing the “alternative” model with the existing one in terms of K-S, Gini index and accuracy, we concluded that alternative data contributes to the model’s performance. More specifically, we can see this contribution by observing the differences between the values of performance indicators for these two models (K-S: 77.0% > 74.8%, accuracy: 92.9% > 91.4%, Gini index: 0.90 >0.88). By noting this contribution in the model’s performance for Greek hotels, we can say that it would be prudent to explore alternative data’s utility in other industries as well. Finally, we performed an out-of-time and out-of-sample validation for the logistic regression model with the alternative variables, which confirmed its efficiency over time and for different populations. 3.8. References [BEL 09] B ELLOTTI T., C ROOK J., “Credit scoring with macroeconomic variables using survival analysis”, Journal of the Operational Research Society, vol. 60, pp. 1699–1707, 2009. [BEL 12] B ELLOTTI T., C ROOK J., “Loss given default models incorporating macroeconomic variables for credit cards”, International Journal of Forecasting, vol. 28, no. 1, pp. 171–182, 2012.

60

Data Analysis and Applications 4

[DES 96] D ESAI V.S, C ROOK J.N., OVERSTREET G.A., “A comparison of neural networks and linear scoring models in the credit union enviroment”, European Journal of Operational Research, vol. 95, no. 1, pp. 24–37, 1996. [FAL 10] FALANGIS K., G LEN J.J., “Heuristics for feature selection in mathematical programming discriminant analysis models”, Journal of the Operational Research Society, vol. 61, no. 5, pp. 804–812, 2010. [FLE 93] F LETCHER D., G OSS E., “Forecasting with neural networks: An application using bankruptcy data”, Information and Management, vol. 24, no. 3, pp. 159–167, 1993. [FLO 10] F LOREZ -L OPEZ R., “Effects of missing data in credit risk scoring. A comparative analysis of methods to achieve robustness in the absence of sufficient data”, Journal of the Operational Research Society, vol. 61, no. 3, pp. 486–501, 2010. [HAN 98] H AND D.J., JACKA S.D., Statistics in Finance, Edward Arnold Publishers Ltd., London, 1998. [JO 97] J O H., H AN I., L EE H., “Bankruptcy prediction using case-based reasoning, neural networks and discriminant analysis”, Expert Systems with Applications, vol. 13, no. 2, pp. 97–108, 1997. [LES 96] L ESHNO M., S PECTOR Y., “Neural networks prediction analysis: The bankruptcy case”, Neurocomputing, vol. 10, no. 2, pp. 125–147, 1996. [LIU 05] L IU Y., S CHUMANN M., “Data mining feature selection for credit scoring models”, Journal of the Operational Research Society, vol. 56, no. 9, pp. 1099–1108, 2005. [LOT 12] L OTERMAN G., B ROWN I., M ARTENS D. et al., “Benchmarking regression algorithms for loss given default modelling”, International Journal of Forecasting, vol. 28, no. 1, pp. 161–170, 2012. [PEN 05] P ENDHARKAR P.C., “A threshold-varying artificial networks approach for classification and its application to bankruptcy prediction problems”, Computers and Operations Research, vol. 32, no. 10, pp. 2561–2582, 2005. [SAL 92] S ALCHENBERGER L.M., C INAR E.M., L ASH N.A., “Neural networks: A new tool for predicting thrift failures”, Decision Sciences, vol. 23, no. 4, pp. 899–916, 1992. [SID 06] S IDDIQI N., Credit Risk Scorecard: Developing and Implementing Intelligent Credit Scoring, John Wiley & Sons, New York, 2006. [SOM 07] S OMERS M., W HITTAKER J., “Quantile regression for modelling distribution of profit and loss”, European Journal of Operational Research, vol. 183, no. 3, pp. 1477– 1487, 2007. [STE 02] S TEPANOVA M., T HOMAS L., “Survival analysis methods for personal loan data”, Operational Research, vol. 50, no. 2, pp. 277–289, 2002. [TAM 92] TAM K.Y., K IANG M.Y., “Managerial applications of neural networks: The case of bank failure predictions”, Management Science, vol. 38, no. 7, pp. 926–947, 1992. [TON 12] T ONG E.N.C., M UES C., T HOMAS L.C., “Mixture sure models in credit scoring: If and when borrowers default”, European Journal of Operational Research, vol. 218, no. 1, pp. 132–139, 2012.

4 EM Algorithm for Estimating the Parameters of the Multivariate Stable Distribution

Research of α-stable distributions is especially important nowadays, because they often occur in the analysis of financial data and information flows along computer networks. It has been found that financial data are often leptokurtic with heavy-tailed distributions. Many authors have proved that the most commonly used distribution is not the most suitable way to analyze economic indicators and suggested to replace it with more general, for example, stable distributions. However, a problem arises how to estimate a multidimensional stable data. Moreover, one-dimensional α-stable distributions are efficiently estimated by the maximum likelihood method. Therefore, a maximum likelihood method for the estimation of multivariate α-stable distributions by using an expectation–maximization algorithm is presented in this work. Integrals included in the expressions of the estimates have been calculated using the Gaussian and Gauss–Laguerre quadrature formulas. The constructed model can be used in stock market data analysis.

4.1. Introduction Stochastic processes can be modeled, estimated and predicted by probabilistic statistical methods, using the data obtained by process observation. A number of empirical studies confirm that real commercial data are often characterized by skewness, kurtosis and heavy-tailed (Janicki and

Chapter written by Leonidas S AKALAUSKAS and Ingrida VAICIULYTE. Data Analysis and Applications 4: Financial Data Analysis and Methods, First Edition. Edited by Andreas Makrides, Alex Karagrigoriou and Christos H. Skiadas. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

62

Data Analysis and Applications 4

Weron [JAN 93]; Rachev and Mittnik [RAC 00]; Samorodnitsky and Taqqu [SAM 94]; etc.). Therefore, a well-known normal distribution does not always fit, for example returns of stocks or risk factors are badly fitted by the normal distribution (Kabasinskas et al. [KAB 09]; Belovas, Kabasinskas and Sakalauskas [BEL 06]). In this case, normal distributions are replaced with more general, for example, stable distributions, which make it possible to model both leptokurtic and asymmetric (Fielitz and Smith [FIE 72]; Rachev and Mittnik [RAC 00]; Kabasinskas et al. [KAB 12]; Sakalauskas et al. [SAK 13]). Therefore, stable distributions are the most often used in business and economics data analysis. According to some experts, the α-stable distribution offers a reasonable improvement – if not the best choice – among the alternative distributions that have been proposed in the literature over the past four decades (e.g. Bertocchi et al. [BER 05]; Hoechstoetter, Rachev and Fabozzi [HOE 05]). However, a practical application of stable distributions is limited by the fact that their distribution and density functions are not expressed through elementary functions, except for a few special cases (Janicki and Weron [JAN 93]; Rachev and Mittnik [RAC 00]; Belovas, Kabasinskas and Sakalauskas [BEL 06]). By the way, stable distributions have infinite variance (except for the normal case). In this work, the stable multivariate variables expression through normal multivariate vector with random variance, changing by a particular stable law, are used for the simulation. Although the estimation of parameters of multivariate stable distributions has been discussed long time ago, the problem is not solved to the end yet (Press [PRE 72]; Rachev and Xin [RAC 93]; Nolan [NOL 98]; Davydov and Paulauskas [DAV 99]; Kring et al. [KRI 09]; Ogata [OGA 13]). The maximum likelihood (ML) approach for the estimation of multivariate α-stable distribution by using an expectation–maximization (EM) algorithm is presented in this work. In one-dimensional case, random stable value is described by four parameters: stability α ∈ (0; 2], skewness β ∈ [−1; 1], scale σ > 0 and position μ ∈ R1 . Stable parameter α is the most important, which is essential for characterizing financial data, and parameter of scale σ can be also used to measure risk. Random variables, which are stable for a fixed number of random elements with respect to the composition, are called α-stable.

EM Algorithm for Estimating the Parameters of the Multivariate Stable Distribution

In one-dimensional case, it is known that s =



63

s1 · s2 , where:

s1 is the random stable variable with skewness parameter β = 1 and shape parameter α1 < 1; s2 is an another random stable variable, independent of s1 , with skewness parameter β = 0 and shape parameter α2 ; s is the random stable variable with skewness parameter β = 0 and shape parameter α = α1 · α2 (Rachev and Mittnik [RAC 93]; Samorodnitsky and Taqqu [SAM 94]; Ravishanker and Qiou [RAV 99]). While applying this method, it is usually chosen that s2 be a random variable, which is normally distributed, i.e. α1 = α/2 and α2 = 2. When α1 < 1 and β = 1, the random stable variable is called the stable subordinator, and always obtains only positive values (Rachev and Mittnik [RAC 93]; Ravishanker and Qiou [RAV 99]). This approach produces multidimensional random vector with dependent components, in which the heavy-tailed data can be modeled (Nolan [NOL 07]; Sakalauskas and Vaiiulyt [SAK 14]). In this way, the multivariate stable symmetric vector can be expressed through normally distributed random vector, and α-stable variables (Ravishanker and Qiou [RAV 99]; Rachev and Mittnik [RAC 93]): X =μ+



s 1 · s2 ,

[4.1]

where μ is the random vector of mean, s1 is the subordinator with parameter α and s2 is the random vector, distributed by d-variate normal law N (0, Ω) with zero mean and covariance matrix of Ω. 4.2. Estimators of maximum likelihood approach ML approach allows us to obtain the values of parameter sets of model, which maximize the likelihood function for fixed independent uniformly distributed model data sample (Sakalauskas [SAK 10]; Kabasinskas et al. [KAB 09]; Ravishanker and Qiou [RAV 99]). Let’s consider probability density of random vector created according to [4.1]. Indeed, the density of the multivariate vector N (μ, s · Ω) is as follows:   s−d/2 (x − μ)T · Ω−1 · (x − μ) f ( x | μ, s, Ω) = · exp − . [4.2] 2·s (2π)d/2 · |Ω|1/2

64

Data Analysis and Applications 4

Let us write down the probability density of α-stable subordinator (Rachev and Mittnik [RAC 00]; Bogdan et al. [BOG 09]): α · s2/(α−2) f (s |α ) = 2 · |2 − α|

ˆ

  α/(2−α)  U (y) α dy, [4.3] Uαα/(2−α) (y) · exp − s −1 1

where, s ≥ 0, and Uα (y) =

sin ((π/4) · α · (y + 1)) · cos ((π/4) · (α − (2 − α) · y))(2−α)α cos ((π · y)/2)2/α · cos ((π · α)/4)2/α

.[4.4]

Thus, probability density of random vector under given parameters μ, Ω, α is expressed as bivariate integral (α/(2 − α))

f ( x| μ, Ω, α) =

× 2 · (2π)d/2 · |Ω|1/2     ˆ ∞ˆ 1 −1 Uα (y) α/(2−α) 1 T Ω (x − μ) − exp − (x − μ) × × 2 s s 0 −1

×

α/2−α

(y) Uα dy ds. d/2+2/(2−α) s

[4.5]

  Let us consider the sample X = X 1 , X 2 , . . . , X K that consists of independent d-variate stable vectors. The likelihood function by virtue of [4.5] is ˜ (X, μ, Ω, α) = L

K    f X i μ, Ω, α = i=1

×

K ˆ  i=1

×

0

(α/(2 − α))K 2K · (2π)K·d/2 · |Ω|K/2

T Ω−1  i  1 i α/(α−2) X − μ − si exp − X − μ · Uα (yi ) × 2 si −1

∞ˆ 1

Uα (yi ) d/2+2/(2−α) si

dyi dsi .

[4.6]

EM Algorithm for Estimating the Parameters of the Multivariate Stable Distribution

65

Denote α/(α−2)

zi = si

· Uα (yi ) .

[4.7]

The log-likelihood function is now as follows: L (X, μ, Ω, α) = −

K

   ln f X i μ, Ω, α =

i=1

=−

K i=1





ln 0

ˆ exp {−zi }

1 −1

   B X i , yi , zi , μ, Ω, α dyi dzi ,

[4.8]

where   B X i , yi , zi , μ, Ω, α =

1 2·

(2π)d/2

· |Ω|

(d(2−α))/(2α)

1/2

· Uα (yi )

· zi

×



  T  X i − μ Ω−1 X i − μ (2−α)/α · zi × exp − . 2 · Uα (yi )

[4.9]

ML estimators of multivariate α-stable distribution parameters μ, Ω for a given and fixed α are calculated by equating the likelihood function derivatives of optimized parameters to zero and solving the system of received equations: ⎧   K ⎪ ∂f X i μ, Ω, α 1 ∂L (X, μ, Ω, α) ⎪ ⎪ =− · = 0, ⎪ i ⎪ ∂μ ∂μ f ( X | μ, Ω, α) ⎨ i=1

[4.10]   i ⎪ K ⎪ ∂f X μ, Ω, α ⎪ ∂L (X, μ, Ω, α) 1 ⎪ ⎪ =− · = 0. ⎩ i ∂Ω ∂Ω f ( X | μ, Ω, α) i=1

Let us denote the derivatives   ∂B X i , yi , zi , μ, Ω, α 1 (d+2)(2−α)/(2α) =− × · zi 1/2 d/2 ∂μ 2 · (2π) · |Ω| · Uα (yi ) ×Ω

−1

   T   i  X i − μ Ω−1 X i − μ (2−α)/α · zi · X − μ · exp − = 2 · Uα (yi )

66

Data Analysis and Applications 4

    Ω−1 X i − μ (2−α)/α · zi = · B X i , yi , zi , μ, Ω, α , Uα (yi )

[4.11]

  ∂B X i , yi , zi , μ, Ω, α ∂Ω 1 (d+2)(2−α)/(2α) = × · zi 1/2 d/2 2 · (2π) · |Ω| · Uα (yi )   × −Ω−1 + Ω−1 · (X i − μ) · (X i − μ)T · Ω−1 ×

   T  X i − μ Ω−1 X i − μ (2−α)/α × exp − · zi = 2 · Uα (yi ) =

  Ω−1 · (X i − μ) · (X i − μ)T · Ω−1 (2−α)/α · zi −Ω−1 + × Uα (yi )

  ×B X i , yi , zi , μ, Ω, α .

[4.12]

Differentiating integrals by the parameters, these derivatives are obtained: h (X, μ, Ω, α) =

K X i · gi i=1



fi

K gi i=1

fi

,

 T K  i X −μ ˆ Xi − μ ˆ gi . w (X, μ, Ω, α) = fi

[4.13]

[4.14]

i=1

We can denote the derivatives of log-likelihood function in this way: gi ∂L = (h (X, μ, Ω, α) − μ) · , ∂μ fi

[4.15]

∂L = −K · Ω−1 + Ω−1 · w (X, μ, Ω, α) · Ω−1 , ∂Ω

[4.16]

K

i=1

EM Algorithm for Estimating the Parameters of the Multivariate Stable Distribution

67

where ˆ



g (X, μ, Ω, α) =



0

ˆ f (X, μ, Ω, α) =

1

−1 ∞ ˆ 1

0

   B X i , y, z, μ, Ω, α dy z (2−α)/α e−z dz,[4.17] Uα (y)   i  B X , y, z, μ, Ω, α dy e−z dz.

−1

[4.18]

Estimators of parameters satisfy equations of the fixed-point method: μ ˆ=

K X i · gi i=1

fi



K gi i=1

fi

,

 i T K  i X − μ ˆ X − μ ˆ gi 1 ˆ= . Ω K fi

[4.19]

[4.20]

i=1

The shape parameter α estimate is obtained by solving the minimization problem of one-dimensional likelihood function α ˆ = arg max 0≤α≤1   ˆ α . Golden section search method can be applied to the L X, μ ˆ, Ω, minimization. 4.3. Quadrature formulas Integrals included in the expressions of the estimates can be calculated by integral calculation subprograms in mathematical systems such as MathCad and Maple or using the Gaussian and Gauss–Laguerre quadrature formulas (Ehrich [EHR 02]; Stoer and Bulirsch [STO 02]; Kovvali [KOV 12]; Casio Computer co. [CAS 11]). Gauss–Laguerre quadrature formulas are given by: ˆ 0



xα e−x f (x) dx ∼ =

n

ωi f (xi ) ,

[4.21]

i=1

where f (xi ) is the integrated function, n is the number of nodes, xi is the integration nodes and ωi is the fixed weights.

68

Data Analysis and Applications 4

Gaussian quadrature formulas are given by: ˆ

1

−1

f (χ) dχ ∼ =

m

ϑi f (χi ) ,

[4.22]

i=1

where f (χi ) is the integrated function, m is the number of nodes, χi is the integration nodes and ϑi is the fixed weights. 4.4. Computer modeling ML parameter estimation by the EM algorithm is an iterative process that needs to choose initial values and perform a certain number of iterations while values in adjacent steps differ insignificantly. In order to test the behavior of created algorithm, the experiments were made with financial statements – total current assets, total assets, total current liabilities, total liabilities – of 124 companies in the USA. Data are taken from “Audit Integrity” analysis [EON 10]. “Audit Integrity” is a leading independent research firm that rates more than 12,000 public companies in North American and Europe based on their corporate integrity, in addition to its flagship Accounting and Governance Risk ratings (Price, Sharp and Wood [PRI 11]). According to the much shorter computing time (error of the likelihood function is only in the sixth sign), integrals were calculated using the Gaussian [4.22] and Gauss–Laguerre [4.21] quadrature formulas. In this experiment, data consisted of 124 fourth-dimensional vectors with these sampling means and sampling covariance matrix:

α = 1.5,

⎛ ⎞ 1.044 ⎜2.046⎟ ⎟ μ=⎜ ⎝ 0.37 ⎠ , 0.873



⎞ 1.175 2.024 0.486 0.911 ⎜2.024 5.225 1.038 2.572⎟ ⎟ Ω=⎜ ⎝0.486 1.038 0.418 0.667⎠ . [4.23] 0.911 2.572 0.667 1.953

We have developed an algorithm, where α is minimized in each iteration. Figure 4.1 shows that the likelihood function is unimodal. Therefore, it can be applied to the minimization by the golden section search method.

EM Algorithm for Estimating the Parameters of the Multivariate Stable Distribution

69

Figure 4.1. Likelihood function dependence on α

Overall, 100 iterations were performed by the proposed EM algorithm. Figure 4.2 shows the obtained parameters of α-stable law in dependence on the number of iterations. We see that the value of the likelihood function and the parameter estimates in a few iterations converge to the values, calculated with the MathCad minimization subprogram. Furthermore, K = 100 four-dimensional random α-stable law values with obtained parameter estimates were generated and a likelihood ratio test was performed: 1) parameters of the model were estimated by the ML method using practical data; 2) then, a new sample by stable model whose parameters correspond to obtained estimates was generated; 3) furthermore, the empirical likelihood function and the likelihood function values of this sample, (derived from practical data,  empirical  probability), were calculated. If this probability is in the interval α2 , 1 − α2 , this is not a reason to reject the hypothesis about the data matching to the analyzed probability model, in a given case, to the α-stable law with reliability α; 4) thus, the empirical probability of the test with financial balance data was 21.47% (see Figure 16.3).

70

Data Analysis and Applications 4

Figure 4.2. Parameters dependence on the number of iterations. For a color version of this figure, see www.iste.co.uk/makrides/data4

EM Algorithm for Estimating the Parameters of the Multivariate Stable Distribution

71

Figure 4.3. Likelihood function test. For a color version of this figure, see www.iste.co.uk/makrides/data4

4.5. Conclusion 1) ML method for the multivariate α-stable distribution was created in this work, which makes it possible to estimate parameters of this distributions using the EM algorithm. 2) The α-stable distribution parameter estimators obtained by the numerical simulation method are statistically adequate, because after a certain number of iterations, values of likelihood function and parameters converge to the ML values. 3) It was shown that this method realizes the log-likelihood function golden section search, implementing it with EM algorithm. 4) This algorithm was applied to create the model of balance data of US companies. It can be used to create financial models in stock market data analysis. In addition, it can be used to test the systems of stochastic type and to solve other statistical tasks.

4.6. References [BEL 06] B ELOVAS I., K ABASINSKAS A., S AKALAUSKAS L., “A study of stable models of stock markets”, Information Technology and Control, vol. 35, no. 1, pp. 34–56, 2006. [BER 05] B ERTOCCHI M., G IACOMETTI R., O RTOBELLI S. et al., “The impact of different distributional hypothesis on returns in asset allocation”, Finance Letters, vol. 3, no. 1, pp. 17–27, 2005.

72

Data Analysis and Applications 4

[BOG 09] B OGDAN K., B YCZKOWSKI T., K ULCZYCKI T. et al., Potential Analysis of Stable Processes and its Extensions, Springer Science and Business Media, New York, 2009. [CAS 11] C ASIO C OMPUTER C O “Nodes and weights of Gauss-Laguerre Calculator”. Available at: http://keisan.casio.com/exec/system/1281279441, 2011. [DAV 99] DAVYDOV Y., PAULAUSKAS V., “On the estimation of the parameters of multivariate stable distributions”, Acta Applicandae Mathematica, vol. 58, no. 1, pp. 107– 124, 1999. [EHR 02] E HRICH S., “On stratified extensions of Gauss-Laguerre and Gauss-Hermite quadrature formulas”, Journal of Computational and Applied Mathematics, vol. 140, nos 1–2, pp. 291–299, 2002. [EON 10] EON: E NHANCED O NLINE N EWS, “Audit Integrity’s “AGR” rating outperforms leading academic accounting risk measures”, independent study finds, 2010. Available at: http://www.businesswire.com/news/home/20100315006057/en/. [FIE 72] F IELITZ B.D., S MITH E.W., “Asymmetric stable distributions of stock price changes”, Journal of American Statistical Association, vol. 67, no. 340, pp. 331–338, 1972. [HOE 05] H OECHSTOETTER M., R ACHEV S., FABOZZI F.J., “Distributional analysis of the stocks comprising the DAX 30”, Probability and Mathematical Statistics, vol. 25, no. 1, pp. 363–383, 2005. [JAN 93] JANICKI A., W ERON A., Simulation and Chaotic Behavior of - Stable Stochastic Processes, Marcel Dekker, New York, 1993. [KAB 09] K ABASINSKAS A., R ACHEV S., S AKALAUSKAS L. et al., “Stable paradigm in financial markets”, Journal of Computational Analysis and Applications, vol. 11, no. 3, pp. 642–688, 2009. [KAB 12] K ABASINSKAS A., S AKALAUSKAS L., S UN E.W. et al., “Mixed-stable models for analyzing high-frequency financial data”, Journal of Computational Analysis and Applications, vol. 14, no. 7, pp. 1210–1226, 2012. [KOV 12] KOVVALI N., Theory and Applications of Gaussian Quadrature Methods, Morgan and Claypool Publishers, 2012. [KRI 09] K RING S., R ACHEV S.T., H OCHSTOTTER M. et al., “Estimation of α-stable subGaussian distributions for asset returns”, Risk Assessment: Decisions in Banking and Finance, pp. 111–152, 2009. [NOL 98] N OLAN J.P., “Multivariate stable distributions: Approximation, estimation, simulation and identification”, A Practical Guide to Heavy Tails, pp. 509–525, 1998. [NOL 07] N OLAN J.P., Stable Distributions – Models for Heavy Tailed Data, Birkhauser, Boston, 2007. [OGA 13] O GATA H., “Estimation for multivariate stable distributions with generalized empirical likelihood”, Journal of Econometrics, vol. 172, no. 2, pp. 248–254, 2013. [PRI 11] P RICE R.A., S HARP N.Y., W OOD D.A., “Detecting and predicting accounting irregularities: A comparison of commercial and academic risk measures”, Accounting Horizons, vol. 25, no. 4, pp. 755–780, 2011.

EM Algorithm for Estimating the Parameters of the Multivariate Stable Distribution

73

[PRE 72] P RESS S.J., “Estimation in univariate and multivariate stable distributions”, Journal of the American Statistical Association, vol. 67, no. 340, pp. 842–846, 1972. [RAC 93] R ACHEV S.T., M ITTNIK S., “Modeling asset returns with alternative stable distributions”, Econometric Reviews, vol. 12, no. 3, pp. 261–330, 1993. [RAC 00] R ACHEV S.T., M ITTNIK S., Stable Paretian Models in Finance, Wiley, New York, 2000. [RAC 93] R ACHEV S.T., X IN H., “Test for association of random variables in the domain of attraction of multivariate stable law”, Probability and Mathematical Statistics, vol. 14, no. 1, pp. 125–141, 1993. [RAV 99] R AVISHANKER N., Q IOU Z., “Monte Carlo EM estimation for multivariate stable distributions”, Stat. Prob. Lett., vol. 45, no. 4, pp. 335–340, 1999. [SAK 10] S AKALAUSKAS L., “On the empirical Bayesian approach for the Poisson-Gaussian model”, Methodology and Computing in Applied Probability, vol. 12, no. 2, pp. 247–259, 2010. [SAK 13] S AKALAUSKAS L., K ALSYTE Z., VAICIULYTE I. et al., “The application of stable and skew t-distributions in predicting the change in accounting and governance risk ratings”, Proceedings of the 8th International Conference Electrical and Control Technologies, pp. 53–58, 2013. [SAK 14] S AKALAUSKAS L., VAIIULYT I., “Sub-gausinio vektoriaus skirstinio parametr vertinimas Monte-Karlo Markovo grandins metodu”, Jaunj mokslinink darbai, vol. 41, no. 1, pp. 104–107, 2014. [SAM 94] S AMORODNITSKY G., TAQQU M.S., Stable Non-Gaussian random processes. Stochastic Models with Infinite Variance, Chapman and Hall, New York, 1994. [STO 02] S TOER J., B ULIRSCH R., Introduction to Numerical Analysis, Springer Science and Business Media, New York, 2002.

PART 2

Statistics and Stochastic Data Analysis and Methods

Data Analysis and Applications 4: Financial Data Analysis and Methods, First Edition. Edited by Andreas Makrides, Alex Karagrigoriou and Christos H. Skiadas. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

5 Methods for Assessing Critical States of Complex Systems

In the study of complex systems (for example a human body), a large number of characteristics is required to assess the current state and forecast its development. It is usually impossible to measure all of them because of the lack of time and equipment. So, it becomes necessary to assess the state and short-term dynamics of its change using signals that can be obtained in real time. It is assumed that the behavior of the system can be described by a characteristic RT, which is nearly periodic. In this way, investigating a complex system is reduced to analyzing the time series. This paper provides an overview of the existing methods of such analysis. Particular attention is paid to the methods of nonlinear dynamics and chaotic behavior of systems.

5.1. Introduction We consider a human body as an example of a complex system. In some cases, it is necessary to determine whether the system is in a stable state or undergoes substantial changes. It is even more important to recognize the transition of the system to a critical condition. For instance, physicians conducting an operation under general anesthesia face such problem. It should be noted that in emergency cases, the decision time must not exceed five minutes; otherwise, irreversible changes may occur in the patient’s body in a state of clinical death. That is why one needs to evaluate the state and short-term dynamics in the body using characteristic signals obtained in real time. Chapter written by Valery ANTONOV.

78

Data Analysis and Applications 4

One usually transforms the signal mathematically to get additional information. Most of the signals encountered in practice are in the time domain, i.e. the signal is a function of time. Thus, one obtains an amplitude–time representation of the signal. However, in many cases the most significant information is hidden in the frequency domain of the signal, i.e. in the frequency spectrum, which shows the (complex) amplitudes of the frequencies present. A moving window can be applied to the signal to preprocess it in real time. This makes it possible to consider the original signal as a sequence of intervals ri. The problem that arises in our case is related to the search for indicators describing the relationship between the systems of heart rhythm regulation. To solve it, the methods of deterministic chaos theory are used. In our work, the object of study is heart rate variability (HRV). We process the measured signal using the mentioned methods implemented in a specially developed software package (Antonov and Zagainov 2015). It can also be applied to the analysis of other objects, in particular, of various technological systems. In this paper, we give an overview of the existing research procedures. 5.2. Heart rate variability Recommendations for physiological interpretation of successive intervals between the electrocardiogram (ECG) QRS complexes given in 1996 by the European Society of Cardiology and the North American Society of Pacing and Electrophysiology (M. Malik et al. 1996) make such time series the most promising for the modern non-invasive diagnostics. An important problem is a selection of the most significant criteria that determine the regulation of heart rhythm. The modern approach to solving the problem is based on the development of methods of deterministic chaos as applied to the analysis of HRV. Heart rate variability is a physiological phenomenon of changing time intervals between heartbeats. It is measured by the variation in the “R–R interval” (R is a point corresponding to the peak of the QRS complex of the ECG wave, and R–R is the interval between successive Rs). The QRS complex is a combination of three of the graphical deflections seen on a typical ECG. Time series resulting from the ECG treatment need to be studied:

Methods for Assessing Critical States of Complex Systems

79

The numerical approach proposes the transition from the time series (signal) itself to a certain object formed in the phase space of finite dimensionality which is called as a restored attractor. Various measures can be used for its characterization, of which the most fundamental appears to be its fractal dimension (Gudkov 2008). It should be noted that in order to determine the transition of an organism to a critical state, one must learn to distinguish real threats from variations associated with the daily changes in cardiac rhythm, taking into account possible cardiac pathologies. All this requires careful verification of the developed mathematical models being developed for their accuracy. Figures 5.1 and 5.2 give examples illustrating the concept of HRV.

Figure 5.1. Heart rate

Figure 5.2. Heart rate variability

80

Data Analysis and Applications 4

5.3. Time-series processing methods Figure 5.3 shows a diagram of methods for investigating time series.

Figure 5.3. Dynamic series analysis

Statistical methods are used to directly quantify the signal in the time interval under study (dispersion, coefficient of variation, etc.). Time analysis consists in studying the law of distribution of intervals considered as random variables. We construct the corresponding histogram and determine its main characteristics, namely: – Mo, the most frequent value of the interval in this dynamic series; – AMo, the number of intervals corresponding to the value of the mode in % to the sample size; – TINN, the variation range. It is calculated from the difference between the maximum (Mx) and the minimum (Mn) values of the intervals and is sometimes denoted as MxDMn. Autocorrelation analysis provides the way to recognize the latent periodicity in RT. See Figure 5.4, where: – C1 is the value of the correlation coefficient after the first shift; – С0 is the number of shifts, which result in the negative value of the correlation coefficient.

Methods for Assessing Critical States of Complex Systems

81

Figure 5.4. Autocorrelation function

Frequency analysis. Analysis of the spectral power density of oscillations provides information on the power distribution as a function of frequency. The use of spectral analysis enable us to quantify the effect of various regulatory systems the heart work. High frequency (HF), low frequency (LF) and Very Low Frequency (VLF) components are selected and used for short-term ECG recording. The HF component is associated with respiratory movements and reflects the effect of the vagus nerve on the work of the heart. LF component characterizes the influence of both sympathetic and parasympathetic nervous systems on the heart rhythm. VLF and ULF (ultra-low frequency) components reflect the effect of various factors, such as, for example, vascular tone, thermoregulation system, etc. (Figure 5.5).

Figure 5.5. High-frequency and low-frequency parts of the spectrum

82

Data Analysis and Applications 4

There are parametric and nonparametric methods of spectral analysis. The first is related to the autoregressive analysis, the second to the fast Fourier transform (FFT) and periodogram analysis. Both approaches give comparable results. In the spectral analysis, we calculate the main spectral quantities that correspond to fluctuations in the heart rhythm of different periodicity: – absolute total and mean spectrum power; – maximum harmonic value; – centralization index IC=(HF+LF)/VLF. The spectrum is calculated by means of the Fourier transform:

Its disadvantage is that the time when the frequency components occur is unknown. There are many widely used signal transformations other than the Fourier transform. To name a few, there is the Hilbert transform, windowed PF, Wigner distribution, Walsh transform, wavelet transform and so on. For each transformation, we can specify the most suitable area of application, its merits and disadvantages. Correlation rhythm graph. The method consists of graphically displaying successive pairs of intervals (the previous and the following) in the two-dimensional coordinate plane obtaining thus what we call a scattergram (Figure 5.6). When constructing a scattergram, a set of points is formed, the center of which is located on the bisector of the first quarter of the coordinate plane. The distance from the center to the origin corresponds to the most expected duration of the interval (Mo). The deviation of a point from the bisector to the left shows how much is an interval shorter than the previous one, to the right, how much it is longer.

Methods for Assessing Critical States of Complex Systems

83

Figure 5.6. Scattergram

The above methods have a serious drawback. With their help, it is difficult to determine at which moment of time the system undergoes significant changes. This fact makes them inadequate for operational diagnostics. At present, methods of nonlinear dynamics and fractal analysis, as well as wavelet transforms, that allow determining the moment of serious changes in the behavior of the system, are increasingly used. Fractal analysis assesses fractal characteristics of data (Alligood, Sauer, and Yorke 1997). From the nonlinear point of view, the processes under investigation contain deterministic chaos, while the linear viewpoint sees these processes as stochastic (Mandelbrot 1977). For an exhaustive description of the system state, a lot of variables are needed combined into a vector from the phase space of states Q(t)=(q1(t),q2(t),…,qn(t)). Phase portraits of systems with chaotic behavior, regardless of initial conditions, come to a certain area of the phase space, i.e. the attractor of the system. It should be noted that the phase portrait of a nonlinear system usually is a multifractal. A multifractal system is a generalization of a fractal system needed when a single exponent or fractal dimension is insufficient for the description of the system dynamics; instead, a continuous spectrum of exponents is needed.

84

Data Analysis and Applications 4

A numeric characteristic that determines the chaotic state of a system can be its entropy. To quickly diagnose transient states in the behavior of a system, the Renyi entropy is usually chosen:

It tends to Shannon entropy as

.

At q=0, this is the Kolmogorov–Hausdorff dimension. The idea here is to weight the probability of the most often visited cubes according to the order of the dimension. Embedding process has to precede any estimation of fractals from a data series. The main practical issue lies in the choice of the embedding variable and embedding delay. If the measurement variable at time t is x(t), then an (n+1)-dimensional embedding is defined as [x(t), x(t+τ),…,x(t+nτ)]. The right embedding delay can be estimated by the first zero crossing of the autocorrelation function or, better, by the first local minimum of the mutual information. However, the probability of finding a point on the attractor is necessary. Usually, the information dimension and related informational entropy are used for this purpose; the correlation dimension and correlation entropy are also used. Theoretically, according to the Takens theorem, any state variable can be used to calculate the invariants of the dynamics (Takens 1980). But it is only the estimation and not a real accurate calculation that is done in practice. This raises the problem of computational process convergence is related to the conversion of the necessary information. A detailed exposition of the method allowing carrying out operational diagnostics on the basis of calculating the fractal dimension of the system attractor can be found in the work (Antonov, Zagainov and Kovalenko 2016). Also, in the same work, one can find a detailed description of our

Methods for Assessing Critical States of Complex Systems

85

software used for rapid diagnosis of a body state. Some results for the attractor’s correlation dimension are shown in Figures 5.7 and 5.8.

Figure 5.7. Trend of correlation dimension. Pneumonia

Figure 5.8. Severe pneumonia

In a severe pneumonia, the trend of correlation dimension reaches approximately two during the day. Finally, in the near-death state for a period of time, the trend does not change much. However, further on, sharp fluctuations start, followed by a rapid drop down to zero (case of death). This drop is observed for several minutes. However, due to the sharp leap after a long period of a calm system condition, the prediction of the trend dropping down can be made. Wavelet analysis provides important information about the mathematical morphology of a signal (Daubechies 1992). In our software package (Antonov and Zagainov 2015), we use the Wavelet Transform Modulus Maxima (WTMM)-method (Mallat and Hwang 1992). The WTMM-formalism is

86

Data Analysis and Applications 4

suitable for analyzing multidimensional patterns, but the complexity increases fast when dimensions are added. In our work we consider multifractals constructed using time series. As a result of the processing, a “scaling indicator” can be determined, which studies the properties of multifractal as a whole. To implement this approach, we have developed a multifunctional computer program. Wavelet transform modulus maxima (WTMM) is a method for detecting the fractal dimension of a signal. It uses plotting the local maximum line of a wavelet transform and gives the possibility of partitioning the time and scale domain of a signal into fractal dimension regions. The definition of the continuous wavelet transform of f(x) in given by:

Wψ ( a, b) =

1 a





−∞

 x −b  f ( x) ψ   dx  a 

where a is the scale parameter and b is the coordinate or time. The initial signal f(x) is divided using the function ψ(x) generated from the soliton-like one with special features by its scale measurements and shifts (Mallat and Hwang 1992). In the simplest version (Holder’s exponent h), the scaling of one of the lines (e.g. the maximum one) is studied: ht Wψ ( ti ,s ) ~ s ( i )

Holder exponent, a measure of the degree to which a signal is differentiable, is used to detect the presence of damage and the time when that it occurred. (Pavlov and Anischenko 2007). A more complicated approach to plotting the scaling is based on the analysis of all lines by introducing the partial function with the weight degree of all wavelet transform maximums:

Pq ( s ) =



l =⊂ L( t )

q

Wψ ( t ,s )  ,

where L(t) is the set of all lines (l) on which the wavelet coefficients reach the modules maximum on the scale t. The graph of the function k q k ( q ) : Pq ( s ) =~ s ( ) is given in Figure 5.9.

Methods for Assessing Critical States of Complex Systems

87

Figure 5.9. Scaling function

5.4. Conclusion We give a survey of methods for analyzing the state of complex systems. These methods are divided into two large groups: linear and nonlinear. We show that the traditional approaches to signal analysis based on the methods of mathematical statistics and Fourier transforms give unsatisfactory results in when the studied process is not stationary. We thus find that for non-stationary processes, multifractal analysis and wavelet transform are most appropriate. These methods enable us to monitor the status of the system in real time. To implement the methods of multifractal analysis, a special software package was developed, which includes: – a mathematical model for evaluating the state of a human body, based on the fractal analysis of variation of heart rate in real time; – the program package able to carry out a study of time series in static and dynamic modes; – checking the adequacy of the developed software on the classic examples of attractors; – the analysis of real processes in healthy and sick people. With some confidence, we can assert that the analysis of data allows determining the estimated time of transition to a critical condition.

88

Data Analysis and Applications 4

5.5. References Alligood K.T., Sauer T., Yorke J.A. (1997). Chaos: an introduction to dynamical systems. Springer-Verlag. Antonov V., Zagainov A. and Kovalenko A. (2016). “Stochastic Models in Society. Fractal Analysis of Biological Signals in a Real Time Mode. Global and Stochastic Analysis”, GSA. 3(2), 75–84. Antonov V., Zagaynov A. (2015). “Software Package for Calculating the Fractal and Cross Spectral Parameters of Cerebral Hemodynamic. In a Real Time Mode. New Trends in Stochastic Modeling and Data Analysis. Demography and Related Applications”, ISAST, 440, 339–345. Daubechies I. (1992). Ten Lectures on Wavelets. CBMS-NSF Regional Conference Series in Applied Mathematics, Philadelphia, Pa. Gudkov G.V. (2008). “The role of deterministic chaos in the structure of the fetal heart rate variability”, Modern Problems of Science and Education, Moscow., Krasnodar, 1, 413–423. Malik M., Bigger J.T., Camm A.J., Kleiger R.E., Malliani A., Moss A.J. and Schwartz P.J. (1996). “Heart rate variability. Standards of measurement, physiological interpretation and clinical use”, European Heart Journal. 17. 354–381. Mallat S. and Hwang W.L. (1992). “Singularity Detection and Processing with Wavelets”, IEEE Transactions on Information Theory, 38(2), 617–643. Mandelbrot B. (1977). Fractals: Form, Chance, Dimension. Freeman, San-Francisco. Parlov A. N. and Anis-chenko B.C. (2007). “Multifractal analysis of complex signals”, Successes of Physical Science, 177(8), 859–876. Takens F. (1980). “Detecting strange attractors in turbulence”. In: Dynamical Systems and Turbulence. Lecture Notes in Mathematics, (eds) D.A.R. and L.S. Young. Heidelberg: Springer-Verlag. 366–381.

6 Resampling Procedures for a More Reliable Extremal Index Estimation

Extreme value theory (EVT) deals essentially with the estimation of parameters of extreme or rare events. Extreme events are usually described as observations that exceed a high threshold. In many environmental and financial applications, clusters of exceedances of that high threshold are of practical concern. One important parameter in EVT, which measures the amount of clustering in the extremes of a stationary sequence, is the extremal index, θ. It needs to be adequately estimated, not only by itself but also due to its influence on other parameters such as a high quantile, return period or expected shortfall. Some classical estimators of θ and their asymptotic properties are revisited. The challenges that appear for finite samples are illustrated. A resampling procedure that has shown to give good results in extreme value theory, the generalized jackknife methodology, is discussed and applied to improve the extremal index estimation. An extensive simulation study was performed, and some results are shown. Finally, a heuristic procedure, based on a stability criterion, is applied to some simulated samples to estimate θ.

6.1. Introduction and motivation Extreme value theory (EVT) is an area of increasingly vast applications in environmental problems. Many authors have presented research in several areas where disastrous extreme events can occur, such as sea levels (Smith [SMI 86] and Tawn [TAW 88]), river flows (Gumbel [GUM 58], Gomes [GOM 93] and Reiss and Thomas [REI 97]), pollution levels (Buishand

Chapter written by Dora P RATA G OMES and M. Manuela N EVES. Data Analysis and Applications 4: Financial Data Analysis and Methods, First Edition. Edited by Andreas Makrides, Alex Karagrigoriou and Christos H. Skiadas. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

90

Data Analysis and Applications 4

[BUI 89] and Carter and Challenor [CAR 81]), wind speeds (Walshaw and Anderson [WAL 00]), air temperatures (Smith et al. [SMI 97] and Coles et al. [COL 94]), precipitation levels (Coles and Tawn [COL 96]), burned areas (Díaz-Delgado et al. [DÍA 04] and Schoenberg et al. [SCH 03]) and earthquake thermodynamics (Lavenda and Cipollone [LAV 00]). In many practical applications, extreme conditions often persist over several consecutive observations. Inference regarding clusters of exceedances over a high threshold needs to be properly performed to control the risk for hazardous events. Under adequate general local and asymptotic dependence conditions, the limiting point process of exceedances of a high level un after a suitable normalization is a homogeneous compound Poisson process with intensity θτ and limiting cluster size distribution π (Hsing et al. [HSI 88]). That constant θ is the extremal index and plays an important role in extreme value theory for weakly dependent processes, reflecting the effect of clustering of extremes observation on the limiting distribution of the maximum. Suppose that {Xn }n≥1 is a strictly stationary sequence of random variables with marginal distribution function F . This sequence is said to have an extremal index θ ∈ (0, 1] if, for each τ > 0, there exists a sequence of levels (un (τ ))n∈N such that n[1 − F (un (τ ))] −→ τ and n→∞

  P M1,n ≤ un (τ ) −→ exp(−θτ ), n→∞

[6.1]

where M1,n = max{X1 , . . . , Xn } (Leadbetter et al. [LEA 83]). When θ = 1, the exceedances of high thresholds tend to occur isolated, as in the independent context. If θ < 1, we have groups of exceedances in the limit. The extremal index is the quantity that measures the amount of clustering of the extremes in a stationary sequence. Different probabilistic characterizations of θ led to the definition of different estimators for θ. Let us present a brief review of some of them. – Maxima characterization (Leadbetter et al. [LEA 83]) Let us consider that the strictly stationary sequence {Xn }n≥1 satisfies the D(un ) condition of Leadbetter et al. [LEA 83] and has a marginal distribution

Resampling Procedures for a More Reliable Extremal Index Estimation

91

function F . That D(un ) condition limits the long-range dependence in the sequence. For large n and un , the asymptotic equivalence is valid: P {M1,n ≤ un } ≈ F nθ (un ).

[6.2]

If there exist normalizing constants an (> 0) and bn such that F n (an x + bn ) −→ G(x), then G(x) is the distribution function of a GEV distribution, n→∞ and P {M1,n ≤ un } −→ H(x) = Gθ (x).

[6.3]

n→∞

θ is the key parameter for extending extreme value theory from independent and identically distributed random variables to stationary processes. – Down-crossing characterization (O’Brien [OBR 87]) An alternative characterization of θ, in terms of down-crossings, is given by O’Brien [OBR 87], for sequences that satisfy a weak mixing condition that locally restricts the occurrence of clusters P {M2,rn ≤ un |X1 > un } −→ θ,

[6.4]

n→∞

where M2,rn = max{X2 , . . . , Xrn } and rn determines a partition of the sample of length n such that rn → ∞ and rn = o(n). – Mean cluster size characterization (Hsing et al. [HSI 88]) Under a mixing condition which is slightly stronger than D(un ), the authors showed that the point process of exceedances converge weakly to a compound Poisson process, provided that n[1 − F (un (τ ))] −→ τ . The distribution πn (j; un , rn ) of the cluster sizes is given by:  πn (j; un , rn ) = P

rn 

I(Xi > un ) = j|

i=1

rn 

n→∞

 I(Xi > un ) > 0 , [6.5]

i=1

for j = 1, . . . , rn , rn → ∞ and rn = o(n), and I(·) denoting the indicator function. Under additional summability conditions on πn ,  j≥1

jπn (j; un , rn ) −→ θ−1 , n→∞

[6.6]

92

Data Analysis and Applications 4

i.e. the limiting mean number of exceedances of un in an interval of length rn corresponds to the arithmetic inverse of the extremal index. Therefore, we can write: θ−1 =



jπ(j).

[6.7]

j≥1

6.2. Properties and difficulties of classical estimators The process of identifying clusters of exceedances above a high threshold gave rise to different estimators. Identifying clusters by the occurrence of up-crossings (down-crossings) led to the classical up-crossing, U C  U C , of Nandagopalan [NAN 90] and Gomes (down-crossing), estimator, Θ [GOM 90, GOM 93], defined as:  U C (un ) :=  UC ≡ Θ Θ

n−1 I (X ≤ un < Xi+1 ) i=1 n i , i=1 I(Xi > un )

[6.8]

for a suitable threshold un , where I(·) is defined above. Consistency of this estimator is obtained provided that the high level un is a normalized level, i.e. with τ ≡ τn fixed, the underlying d.f. F verifies F (un ) = 1 − τ /n + o(1/n),

n → ∞ and τ /n → 0.

Two classical methods to define clusters are the blocks method and the runs method, [HSI 91, HSI 93]. The blocks estimator is derived by dividing the data into approximately kn blocks of length rn , where n ≈ kn ×rn , i.e. considering kn = [n/rn ]. Each block is treated as one cluster, and the number of blocks in which there is at least one exceedance of the threshold un is counted. The  B (un ), is then defined as: blocks estimator, Θ n B Θ n (un ) :=

kn

i=1 I



max X(i−1)rn +1 , · · · , Xirn > un n . i=1 I (Xi > un )

[6.9]

If we assume that a cluster consists of a run of observations between two exceedances, then the runs estimator is defined as:



n ≤ u I X > u , max X , · · · , X i n i+1 i+r −1 n n i=1 R  (un ) := n Θ . [6.10] n i=1 I (Xi > un )

Resampling Procedures for a More Reliable Extremal Index Estimation

93

 B = limn→∞ Θ  R = θ. Other properties Under mild conditions, limn→∞ Θ n n of these estimators have been well studied by Smith and Weissman [SMI 94] and Weissman and Novak [WEI 98]. Although showing very agreeable asymptotic properties, those estimators present several difficulties for finite samples. Indeed, they present the usual drawback common to most semi-parametric estimators: they are strongly dependent on the threshold, with the usual bias-variance trade-off. To modify those estimators, in order to obtain more stable and reliable path estimates, became a topic of intense research. Here, some procedures based on resampling methods are reviewed and improved by including a heuristic procedure for the choice of a tuning parameter and the threshold, based on a stability criterion. 6.3. Resampling procedures in extremal index estimation In this chapter, our attention will be focused on the U C estimator, in equation [6.8]. Given the sample Xn := (X1 , . . . , Xn ) and the associated ascending-order statistics, X1:n ≤ · · · ≤ Xn:n , we will consider the level un as a deterministic level u ∈ [Xn−k:n , Xn−k+1:n ). The U C estimator can now be written as a function of k, the number of top order statistics above the chosen threshold, n−1  UC ≡ Θ  U C (k) := 1 Θ I (Xi ≤ Xn−k:n < Xi+1 ). i=1 k

[6.11]

 U C (k) has two dominant For many dependent structures, the bias of Θ components of orders k/n and 1/k; see Gomes et al. [GOM 08],  UC

Bias[Θ

k 1 k 1 (k)] = ϕ1 (θ) + ϕ2 (θ) +o +o , n k n k

[6.12]

whenever n → ∞ and k ≡ k(n) → ∞, k = o(n). The generalized jackknife methodology, introduced by Gray and Schucany [GRA 72], has the properties of estimating the bias and the variance of any estimator, leading to the development of estimators with bias and mean squared error often smaller than those of an initial set of estimators.

94

Data Analysis and Applications 4

The generalized jackknife methodology states that if the bias has two main terms that we would like to reduce, we need to have access to three estimators, with the same type of bias. D EFINITION 6.1.– (Gray and Schucany [GRA 72]) (1)

(2)

(3)

Given three biased estimators of θ, Tn , Tn and Tn such that (i)

(i)

E[Tn(i) − θ] = b1 (θ)ϕ1 (n) + b2 (θ)ϕ2 (n)

i = 1, 2, 3,

the generalized jackknife statistic (of order 2) is given by:

(1) (2) (3)

T Tn Tn

n

(1) (3)

ϕ1 (n) ϕ(2) (n) ϕ (n) 1 1

(1) (2) (3)

ϕ (n) ϕ (n) ϕ (n) 2 2 2 GJ

. Tn :=

1 2 3



(1) (2) (3)

ϕ1 (n) ϕ1 (n) ϕ1 (n)

(1) (3)

ϕ2 (n) ϕ(2) 2 (n) ϕ2 (n) Using the information obtained from equation [6.12] and based on the  U C computed at the three levels, k, [δk] + 1 and [δ 2 k] + 1, where estimator Θ [x] denotes, as usual, the integer part of x, Gomes et al. [GOM 08] proposed a class of generalized jackknife estimators, depending on a tuning parameter δ,  GJ(δ) ≡ Θ  GJ(δ) (k), defined as: 0 < δ < 1, Θ  GJ(δ) Θ

UC 2

 U C ([δk] + 1) − δ Θ  ([δ k] + 1) + Θ  U C (k) (δ 2 + 1)Θ := . [6.13] (1 − δ)2

This is an asymptotically unbiased estimator of θ, in the sense that it can remove the two dominant components of bias referred to in equation [6.12].  GJ are consistent and  U C and Θ Under certain conditions, estimators Θ asymptotically normal if θ < 1; see Nandagopalan [NAN 90] and Gomes et al. [GOM 08]. 6.3.1. A simulation study of mean values and mean square error patterns of the estimators  U C and Θ  GJ(δ) estimators, let us For illustrating the properties of the Θ consider the following max-autoregressive process:

Resampling Procedures for a More Reliable Extremal Index Estimation

95

– Let {Zn }n≥1 be a sequence of independent, unit-Fréchet distributed random variables and Y0 a random variable with d.f. H0 (y) = exp −

y −1 (β −1 − 1) . For 0 < β < 1, let   Yj = β max Yj−1 , Zj , j = 1, 2, . . .

[6.14]

The extremal index of this process is θ = 1 − β (Alpuim [ALP 89]). A Monte Carlo simulation for the mean value and the mean square error  U C , and the generalized jackknife (MSE) of the up-crossing estimator, Θ GJ(δ)  estimators, Θ , for some values of δ (δ = 0.05 (0.05) 0.95) and several values of θ was performed. Some of the results obtained are shown in Figure 6.1. For this model and also for other models studied, the best values for mean values and mean square error depend on the value of δ. Among those values of δ producing the trajectories plotted in Figure 6.1, δ = 0.1 and δ = 0.25 seem to be the favorites.  



 

 

 

 









 U C and Θ  GJ(δ) with Figure 6.1. Mean values and MSE of Θ δ = 0.05, 0.1, 0.25, 0.5 and 0.6 for max-autoregressive processes with θ = 0.1 for samples of size n = 1, 000 and 1,000 replicates

It is then necessary to have procedures for choosing δ. Gomes et al. [GOM 08] present a complete study considering δ = 0.25. Although agreeable results were obtained with that choice, we think that they can be improved with an adequate choice of δ, depending on the sample.

96

Data Analysis and Applications 4

6.3.2. A choice of δ and k: a heuristic sample path stability criterion A path stability algorithm (see Gomes et al. [GOM 13] and Neves et al. [NEV 15]) has revealed quite agreeable results for extreme value parameters estimation and can now be adapted to the choice of δ, followed by the choice of kopt (in a given sense) for estimating θ.  GJ(δ) estimators: Let us see the description of the algorithm, for Θ 1) Given an observed sample (x1 , . . . , xn ), compute, for k = 1, . . . , n − 1,  GJ(δ) for a range of values of δ, 0 < δ < 1. the observed values of Θ 2) Obtain the rounded values, to 0 decimal places, of the estimates in the  GJ(δ)  GJ(δ) , 0), k = 1, 2, . . . , n − 1, the previous step. Define aΘ (0) = round(Θ k GJ(δ)  rounded values of Θ (k) to 0 decimal places. 3) For each value of δ, consider the sets of k values associated with equal  GJ(δ)  GJ(δ)  GJ(δ) Θ Θ consecutive values of aΘ (0), obtained in step 2. Set kmin and kmax k the minimum and maximum values, respectively, of the set with the largest  GJ(δ)  GJ(δ) Θ Θ range. The largest run size is then lΘ − kmin .  GJ(δ) := kmax 4) Choose the δ value, δ0 , which correspond to the largest value of lΘ  GJ(δ) . 5) Obtain j0 , the minimum value of j, a non-negative integer, such that the rounded values, to j decimal places, of the estimates θGJ(δ0 ) (k) are  GJ(δ0 ) ) (Θ distinct. Define ak (j) = round(θGJ(δ0 ) (k), j), k = 1, 2, . . . , n − 1, the GJ(δ ) 0 (k) to j decimal places. rounded values of θ 6) Consider the sets of k values associated with equal consecutive values (θGJ(δ0 ) )

(θGJ(δ0 ) )

(θGJ(δ0 ) )

(j0 ), obtained in step 5. Set kmin and kmax the minimum of ak and maximum values, respectively, of the set with the largest range. The largest θGJ(δ0 ) − k θGJ(δ0 ) . run size is then lΘ  GJ(δ0 ) := kmax min 0 ) 0 ) (θ (θ 7) Consider all those estimates, θGJ(δ0 ) , kmin ≤ k ≤ kmax , now GJ(δ0 ) ) ( θ GJ(δ ) 0 (k) = a with two extra decimal places, i.e. compute θ (j0 + 2). k GJ(δ )  0 (k) and denote K  GJ(δ ) the set of k-values Obtain the mode of θ GJ(δ )

associated with this mode.

Θ

GJ(δ )

0

8) Take kˆΘ  GJ(δ0 ) as the maximum value of KΘ  GJ(δ0 ) and consider the GJ(δ ) ˆ GJ(δ ) ). 0 (k adaptive estimate θ 0 θ

Resampling Procedures for a More Reliable Extremal Index Estimation

97

9) The best estimate is the value of θGJ(δ0 ) that corresponds to the maximum run size lΘ  GJ(δ0 ) computed in step 6. Table 6.1 presents the result of an application of the algorithm to three samples generated from the max-autoregressive process with θ = 0.1, θ = 0.5 and θ = 0.9, with the choice of δ, k and the associated estimates. θ

ˆ  GJ(δ ) θGJ(δ0 ) δ 0 lΘ  GJ(δ0 ) kΘ 0

0.1 0.05 0.5 0.25 0.9 0.05

599 715 619

479 399 499

0.033 0.497 0.895

ˆ  GJ(δ ) and the Table 6.1. Value of δ0 , the largest run size lΘ  GJ(δ0 ) , kΘ 0 best estimate of θ for three samples of size n = 1, 000 generated from the max-autoregressive process with θ = 0.1, θ = 0.5 and θ = 0.9

Figure 6.2 illustrates the application.  



 

 

 

  









  

 

   







 

 

   

  





 





 









 U C and Θ  GJ(δ) with Figure 6.2. Mean values and MSE of Θ δ = 0.05, 0.1, 0.25, 0.5 and 0.6 for max-autoregressive processes with θ = 0.5 (top) and θ = 0.9 (bottom), for samples of size n = 1, 000 and 1,000 replicates



98

Data Analysis and Applications 4









 

 

 

 









 











 

















         





Figure 6.3. The “optimal” choice of δ and k for the sample paths of the  GJ(δ) , with δ = 0.05, 0.1, 0.25, 0.5, for three estimates obtained from Θ samples of size n = 1, 000 generated from the max-autoregressive process with θ = 0.1 (top left), θ = 0.5 (top right) and θ = 0.9 (bottom)

6.4. Some overall comments – EVT is now a statistical domain that has revealed a great interest and a strong development, motivated for the challenges put by relevant and recent applications, many of them in finance, climatology and environment. – In the semi-parametric approach, topics such as the threshold selection and the bias reduction, as well as the search for stable and reliable sample paths, continue motivating intense research. – Resampling methodologies that need to be adapted when dealing with extremes are revealing promising results.

Resampling Procedures for a More Reliable Extremal Index Estimation

99

– To illustrate the application of the aforementioned resampling procedures, an extensive simulation study, considering several models, was performed. – For a given sample, the choice of δ was performed on the basis of the stability criterion described above. Another approach could be to proceed with a block bootstrap, as described in Prata Gomes and Neves [PRA 18], on the generalized jackknife estimates, to get a smoother path. 6.5. Acknowledgements This research was partially supported by National Funds through FCT – Fundação para a Ciência e a Tecnologia, projects UID/MAT/00297/2013 (CMA) and PEst-OE/MAT/UI0006/2013, 2019 (CEA/UL). 6.6. References [ALP 89] A LPUIM T., “An extremal markovian sequence”, Journal of Applied Probability, vol. 26, pp. 219–232, 1989. [BUI 89] B UISHAND A.,“Statistics of extremes in climatology”, Statistica Neerlandica, vol. 43, pp. 1–30, 1989. [CAR 81] C ARTER T., C HALLENOR G., “Estimating return values of environmental parameters”, Quarterly Journal of the Royal Meteorological Society, vol. 107, pp. 259–266, 1981. [COL 94] C OLES S., TAWN J., S MITH R., “A sazonal Markov model for extremely low temperatures”, Environmetrics, vol. 5, pp. 221–339, 1994. [COL 96] C OLES S., TAWN J., “A Bayesian analysis of extreme rainfall data”, Applied Statistics, vol. 45, pp. 463–478, 1996. [DAV 11] DAVIDSON A., “Statistics of extremes” Courses 2011–2012, École Polytechnique Fédérale de Lausanne (EPFL), 2011. [DÍA 04] D ÍAZ -D ELGADO R., L LORET F., P ONS X., “Spatial patterns of fire occurrence in Catalonia”, Landscape Ecology, vol. 19, no. 7, pp. 731–745, Spain, 2004. [GOM 90] G OMES M.I., “Statistical inference in an extremal markovian model”, CompStat, pp. 257–262, 1990. [GOM 93] G OMES M.I., “On the estimation of parameters of rare events in environmental time series”, Statistics for the Environment, pp. 226–241, 1993. [GOM 08] G OMES M.I., H ALL A., M IRANDA C., “Subsampling techniques and the Jackknife methodology in the estimation of the extremal index”, Computational Statistics and Data Analysis, vol. 52, no. 4, pp. 2022–2041, 2008.

100

Data Analysis and Applications 4

[GOM 13] G OMES M.I., H ENRIQUES -RODRIGUES L., F RAGA A LVES M.I. et al., “Adaptive PORT-MVRB estimation: An empirical comparison of two heuristic algorithms”, Journal of Statistical Computation and Simulation, vol. 83, no. 6, pp. 1129–1144, 2013. [GRA 72] G RAY H., S CHUCANY W., The Generalized Jackknife Statistic, Marcel Dekker, New York, 1972. [GUM 58] G UMBEL E., Statistics of Extremes, Columbia University Press, New York, 1958. [HSI 91] H SING T., “Estimating the parameters of rare events”, Stochastic Processes and their Applications, vol. 37, pp. 117–39, 1991. [HSI 93] H SING T., “Extremal index estimation for a weekly dependent stationary sequence”, Annals of Statistics, vol. 21, pp. 2043–2071, 1993. [HSI 88] H SING T., H USLER J., L EADBETTER M., “On exceedance point process for a stationary sequence”, Probability Theory and Related Fields, vol. 78, pp. 97–112, 1988. [LAV 00] L AVANDA B., C IPOLLONE E., “Extreme value statistics and thermodynamics of earthquakes: Aftershock sequences”, Annals of Geophysics, vol. 43, no. 5, pp. 967–982, 2000. [LEA 83] L EADBETTER M., “Extremes and local dependence in stationary sequences”, Z. Wahrsch. Verw. Gebiete, vol. 65, no. 2, pp. 291–306, 1983. [LEA 83] L EADBETTER M., L INDGREN G., ROOTZÉN H., Extremes and Related Properties of Random Sequences and Processes, Springer-Verlag, New York, 1983. [LEA 89] L EADBETTER M., NANDAGOPALAN L., “On exceedance point process for stationary sequences under mild oscillation restrictions”, in Extreme Value Theory: Proceedings, Oberwolfach 1987, H ÜSLER J., R EISS R.D. (eds), Lecture Notes in Statistics, vol. 51, pp. 69–80, Springer-Verlag, Berlin, 1989. [NAN 90] NANDAGOPALAN S., Multivariate Extremes and Estimation of the Extremal Index, PhD Thesis, University of North Carolina, Chapel Hill, 1990. [NEV 15] N EVES M., G OMES M. I., F IGUEIREDO F. et al., “Modeling Extreme Events: Sample Fraction Adaptive Choice in Parameter Estimation”, Journal of Statistical Theory and Practice, vol. 9, no. 1, pp. 184–199, 2015. [OBR 87] O’B RIEN G., “Extreme values for stationary and Markov sequences”, Annals of Probability, vol. 15, pp. 281–291, 1987. [PRA 15] P RATA G OMES D., N EVES M., “Bootstrap and other resampling methodologies in statistics of extremes”, Communications in Statistics – Simulation and Computation, vol. 44, no. 10, pp. 2592–2607, 2015. [PRA 18] P RATA G OMES D., N EVES M., “Revisiting resampling methods in the extremal index estimation: Improving risk assessment”, in Recent Studies on Risk Analysis and Statistical Modeling, O LIVEIRA T.A., K ITSOS C., O LIVEIRA A. et al. (eds), Springer International Publishing, 2018. [REI 97] R EISS D., T HOMAS M., Statistical Analysis of Extreme Values: With Applications to Insurance, Finance, Hydrology and Other Fields, Birkhaüser Verlag, 1997.

Resampling Procedures for a More Reliable Extremal Index Estimation

101

[SCH 03] S CHOENBERG F., P ENG R., H UANG Z. et al., “Detection of nonlinearities in the dependence of burn area on fuel age and climatic variables”, International Journal of Wildland, vol. 12, no. 1, pp. 1–10, 2003. [SMI 86] S MITH R., “Extreme value theory based on the r largest annual events”, Journal of Hydrology, vol. 86, pp. 27–43, 1986. [SMI 94] S MITH R., W EISSMAN I., “Estimating the extremal index”, J. R. Statist. Soc. B, vol. 56, pp. 515–528, 1994. [SMI 97] S MITH M., M ALATHY D EVI V., B ENNER D. et al., “Temperature dependence of air-broadening and shift coefficients of O3 lines in the ν1 band”, Journal of Molecular Spectroscopy, vol. 182, pp. 239–259, 1997. [TAW 88] TAWN J., “An extreme value theory model for dependent observations”, Journal of Hydrology, vol. 101, pp. 227–250, 1988. [WAL 00] WALSHAW D., A NDERSON C.W., “A model for extreme wind gusts”, Journal of the Royal Statistical Society, Series C (Applied Statistics), vol. 49, pp. 499–508, 2000. [WEI 98] W EISSMAN I., N OVAK S., “On blocks and runs estimators of the extremal index”, Journal of Statistical Planning and Inference, vol. 66, pp. 281–288, 1998.

7 Generalizations of Poisson Process in the Modeling of Random Processes Related to Road Accidents

The stochastic process theory provides concepts and theorems that enable us to build probabilistic models concerning accidents. A crucial role in the construction of the models plays a Poisson process and its generalizations. The non-homogeneous Poisson process and corresponding non-homogeneous compound Poisson process can be applied for modeling the number of road, sea and railway accidents in the given time intervals. Those stochastic processes are used for modeling the road accident number, and number of injured people and fatalities. To estimate model parameters, data coming from the annual reports of the Polish police are used.

7.1. Introduction A Poisson distribution is given by the rule: ( )=

( ) !



, ∈

= {0,1,2, … }, Λ > 0.

In 1837, Siméon Denis Poisson derived this distribution to approximate the Binomial Distribution when a parameter , determining the probability of success in a single experiment, is small. The application of this distribution was not found until von Bortkiewitsch (1898) calculated from the data of the Prussian army, the number of soldiers who died during the 20 consecutive

Chapter written by Franciszek GRABSKI. Data Analysis and Applications 4: Financial Data Analysis and Methods, First Edition. Edited by Andreas Makrides, Alex Karagrigoriou and Christos H. Skiadas. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

104

Data Analysis and Applications 4

years because of the kick by a horse. A random variable, say , denoting the number of solders killed accidentally by the horse kick per year, turned out to have a Poisson distribution: ( )= ( = )= with parameter Λ = 0.61 [

( ) !



, ∈ .

].

Since then, the Poisson distribution and its associated Poisson random process have found applications in various fields of science and technology. A Poisson process and its extensions are used in safety and reliability problems. They enable us to construct the number of road, sea and railway accidents in given time intervals. A non-homogeneous Poisson process (NPP) in modeling accident number in Baltic Sea and Seaports was presented in the coference Summer Safety and Reliability Seminars 2017 and is published in the Journal of Polish Safety and Reliability Association (Grabski F., 2017). The non-homogeneous compound Poisson process (NCPP) enables us to anticipate the number of injured people and fatalities. In the elaboration of the theoretical part of the article, books from Fisz M. (1969), Grabski F. (2015), Limnios N. and Oprisan N.G. (2001), Shiryayev A.N. (1984) and papers from Di Crescenzo A., Martinucci B. and Zacks S. (2015), Fisz M. (1969), Grabski F. (2017), Grabski F. (2018), Zinchenko N. (2017) were used. The statistical data come from the annual police reports (Symon E. 2018, Symon E. 2019). 7.2. Non-homogeneous Poisson process Let { ( ): ≥ 0} be a stochastic process taking values on = {0,1,2, … }, value of which represents the number of events in a time interval [0, ]. A counting process { ( ): ≥ 0} is said to be non-homogeneous Poisson process (NPP) defined by an intensity function ( ) ≥ 0, ≥ 0, if 1) ( (0) = 0) = 1;

[7.1]

2) The process { ( ): ≥ 0} is the stochastic process with independent increments, the right continuous and piecewise constant trajectories; 3) ( ( + ℎ) − ( ) = ) =

( ) !

( )

.

[7.2]

Generalizations of Poisson Process in the Modeling of Random Processes

105

From the definition, it follows that the one-dimensional distribution of NPP is defined by the rule ( )

( ( ) = ) =

( )

!

, = 0,1,2, …

[7.3]

The expectation and variance of NPP are the functions given by the rules Λ( ) = [ ( )] =

( )

,

[7.4]

V(t) = [ ( )] =

( )

, ≥ 0.

[7.5]

The corresponding standard deviation is determined by the formula D(t) =

( )

[ ( )] =

, ≥ 0.

[7.6]

The expected value of an increment ( + ℎ) − ( ) is Δ( ; ℎ) = ( ( + ℎ) − ( )) =

( )

.

[7.7]

The corresponding standard deviation is given by the formula D( ; ℎ) =

( ( + ℎ) − ( )) =

( )

[7.8]

An NPP with ( ) = , ≥ 0 for each t ≥ 0, is a regular Poisson process. Its probability distribution takes the form ( ( ) = ) =

( ) !



, = 0,1,2; ≥ 0.

The increments of the NPP are independent, but not necessarily stationary. As a process with independent increments, an NPP is a Markov random process.

106

Data Analysis and Applications 4

7.3. Model of the road accident number in Poland Table 7.1 shows the number of road accidents and their consequences in Poland in 2007–2018. The data come from the annual police reports Symon E. (2018), Symon E. (2019). Year

Killed Injured Number Interval Center Number Number number/accident number/accident of of of of [year] interval accidents fatalities injured number number

1

2

3

4

5

6

7

8

2007

[0, 1)

0,5

49,536

5,583

63,224

0.1127

1,2763

2008

[1, 2)

1,5

49,054

5,432

62,097

0.1108

1,2658

2009

[2, 3 )

2,5

44,196

4,572

56,046

0.1034

1,2681

2010

[3, 4)

3,5

38,832

3,907

48,952

0.1006

1,2606

2011

[4, 5)

4,5

40,065

4,189

49,501

0.1045

1,2355

2012

[5, 6)

5,5

37,046

3,571

45,792

0.0963

1,236

2013

[6, 7)

6,5

35,847

3,357

44,059

0.0957

1,2571

2014

[7, 8)

7,5

34,970

3,209

42,545

0.0915

1,2166

2015

[8, 9)

8,5

32,967

2,938

39,778

0.0891

1,2066

2016 [9, 10)

9,5

33,664

3,026

40,766

0.0898

1,2109

2017 [10,11)

10,5

32,760

2,831

39,466

0.0864

1,2047

2018 [11,12)

11,5

31,674

2,862

37,359

0.0903

1,1794

Table 7.1. Number of road accidents and their consequences in Poland in 2007–2018

Let { ( ); ≥ 0} be a stochastic process taking values on = {0,1,2, … }. A value of the process represents the number of road accidents in Poland in a time interval [0, ]. Due to the nature of these events, preassumption that it is an NPP with some parameter ( ) > 0 seems to be justified. The expected value of this process increment is given by [7.7], while its one dimensional distribution is determined by [7.3]. We can practically use these rules if the function ( ) > 0 is known.

Generalizations of Poisson Process in the Modeling of Random Processes

107

7.3.1. Estimation of model parameters We approximate the empirical intensity by a linear regression function =

+

[7.9]

Recall that parameters problem: [

( , ) = ∑

and are the solution of the optimization −(

+ )] →

[7.10]

The parameters are given by the rules: = ̅=

, = = = ∑ = ∑





,

[7.11]

, = , ,

=

=



,

− =

, −

.

60,000 50,000 40,000 30,000 20,000 10,000 0 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5

Figure 7.1. The empirical intensity of the road accidents in Poland

To define the function ( ), ≥ 0, we use information presented in Table 7.1. The statistical analysis of the data shows that the empirical hazard rate can be approximated by the linear function ( ) = + .

108

Data Analysis and Applications 4

Using EXCEL to find a linear regression function, we obtain the values of parameters and : = –1617.7;

=48090.42.

Therefore, the linear intensity of accidents is ( ) = −1617.7 + 48090.42; ≥ 0.

[7.12]

From [7.4], we have Λ( ) =

(−1617.7 + 48090.42)

.

Hence, we obtain Λ( ) = −808.848

+ 48090.42 ; where 0 ≤ ≤ 12 [years].

[7.13]

From [7.3] and [7.4], we obtain the one-dimensional distribution of the NPP that denotes the number of road accidents: ( ( ) = ) =

( )

( )

!

,

= 0,1,2, ….

[7.14]

7.3.2. Anticipation of the accident number Let us recall that ( ( + ℎ) − ( ) = ) =

[ (

)

( )]

[ (

)

!

( )]

.

[7.15]

It means that we can anticipate the number of accidents at any time interval with a length of h. The expected value of an increment ( + ℎ) − ( ) is defined by [7.7]. For the function ( ) = + , ≥ 0, we obtain the expeted value of accidents in the time interval [ , + ℎ] ( + ℎ) − ( ) = Δ( ; ℎ) = ℎ



+ + .

[7.16]

Generalizations of Poisson Process in the Modeling of Random Processes

109

The corresponding standard deviation is D( ; ℎ) =

ℎ(



+ + ) .

[7.17]

EXAMPLE 7.1.– We want to anticipate the number of road accidents in Poland from June 1, 2019 to September 30, 2019. We also want to calculate the probability of a given number of that kind of accidents. Using the formula [7.15], we obtain the expected value of road accidents from June 1, 2019 to September 30, 2019. The intensity function ( ) = −1617,7 + 48090,42; ≥ 0, in 2019, takes arguments from the interval [11, 12). From January 1, 2019 to June 1, 2019, 151 days have passed. Hence, = 11 + 151/365 = 11,413699 years. From June 1 to September 30, ℎ = 122/365 = 0,334247 years have passed. For these parameters, using [7.16] and [7.17], we obtain Δ( ; ℎ) = 9812,201695,

D( ; ℎ) = 99,05655806.

[7.18]

It means that the average predicted number of the road accidents between June 1, 2019 and September 30, 2019 will be about 9,812 accidents with a standard deviation of about 99. For example, the probability that the number of accidents in this time interval will be not greater than d = 10010 and not less than c = 9611 is = (9611 ≤ ,

≅∑

!

,

( + ℎ) − ( ) ≤ 10010) ≅

.

Applying approximation by the standard normal distribution, we get







. .





. .

≅ 0.95.

7.4. Non-homogeneous compound Poisson process We assume that { ( ): ≥ 0} is a non-homogeneous Poisson process , … is a (NPP) determined by a function ( ) ≥ 0 for ≥ 0, and ,

110

Data Analysis and Applications 4

sequence of the independent and identically distributed (i.i.d.) random variables independent of { ( ): ≥ 0}. A stochastic process ( )=

+

+ ⋯+

( ) ,

≥0

[7.19]

is said to be a non-homogeneous compound Poisson process (NCPP). PROPOSITION 7.1.– Let { ( ): ≥ 0} be an NCPP. If

(

) < ∞, then

1) [ ( )] = Λ( ) ( 2) [ ( )] = Λ( ) (

)

[7.20] ),

[7.21]

where Λ( ) = [ ( )] =

( )

.

Proof: Applying the property of conditional expectation [ ( )] = [ ( ( )| ( ))] we have [ ( ( )| ( ))] =

(

+

+ ⋯+

( ))

∑∞ (( + + ⋯ + ) ( ( ) = ) = ∑∞ ( ) ( ( ) = ) = (

)

( ) = Λ( ) (

)

Using the formula [ ( )] = [ ( ( )| ( ))] + [ ( ( )| ( ))]

( ) =

Generalizations of Poisson Process in the Modeling of Random Processes

111

we get [ ( ( )| ( ))] =

+

(

( ))

( )=

( ( )= )=

= ∑∞



+

+ ⋯+

( )

= ∑∞

(

+

+⋯+

) ( ( ) = ) =

= ∑∞

(

) ( ( ) = ) = (

( ( )| ( ) ] = = ( (

+

(

) ( )) = ( (

)

+⋯+

( ) = ( ( ))

( ) =( (

))

( ) =

+ ⋯+

)Λ(

),

( ) =

))

( ).

Therefore, [ ( )] = ( )Λ( ) + ( ( ( ( )) + ( ( )) ] = = Λ( ) (

))

( ) = ( )[ (

)−

).

COROLLARY 7.1.– Let { ( + ℎ) − ( ): ≥ 0} be an increment of the NCPP. If (

) < ∞, then

[ ( + ℎ) − ( )] = Δ( ; ℎ) ( [ ( + ℎ) − ( )] = Δ( ; ℎ) (

), ),

[7.22] [7.23]

where Δ( ; ℎ) =

( )

.

[7.24]

112

Data Analysis and Applications 4

PROPOSITION 7.2.– If { ( ): ≥ 0}is an NPP with an intensity function ( ), ≥ 0 such that ( ) ≥ 0 for ≥ 0, then the cumulative distribution function (CDF) of the NCPP is given by the rule ( , )=

[ , )(

( )

)

+∑

( ; )

( )

( ),

[7.25]

where ( )

( ) denotes the k-fold convolution of the CDF of the random variables , i = 1,2,… and ( ; )=

( ( ))

( )

!

Λ( ) = [ ( )] =

,

≥ 0,

( )

= 0,1, …,

[7.26]

.

[7.27]

Proof: Using the total probability low, we obtain the CDF of the NCPP: ( , ) = ( ( ) ≤ )= =∑ =∑

+⋯+ ( ; )

( )

( )

( )=

+

+⋯+

≤ | ( )= [ , )(

)

( )

( )



=

( ( )= )= + ∑

( ; )

( )

( ).

COROLLARY 7.2.– If the random variables , i = 1,2,… have a discrete probability function ( ) = ( = ), ∈ , then the discrete distribution function of the NCPP is given by the rule ( , )=∑

( ; )

( )

( ) , > 0

[7.28]

( )

where ( ) denotes k-fold convolution of the discrete probability distribution ( ), = 0,1,2, … of the random variable . It should be noted that the results presented above are known for homogeneous Poisson processes – equations [7.25] and [7.28] are presented in paper Di Crescenzo A., Martinucci B. and Zacks S. (2015).

Generalizations of Poisson Process in the Modeling of Random Processes

113

7.5. Data analysis Columns 7 and 8 in Table 7.1 contain the frequency of fatalities and injured people with respect to the road accident number in Poland in 2007–2018. Figures 7.2 and 7.3 show that the frequencies of fatalities and injured people with respect to the road accident have a decreasing trend over time. The variance of the frequency of injured people is much greater than in the case of fatalities. 0.12 0.1 0.08 0.06 0.04 0.02 0 0.5

1.5

2.5

3.5

4.5

5.5

6.5

7.5

8.5

9.5 10.5 11.5

Figure 7.2. Frequency of fatalities with respect to the road accident number

1.3 1.28 1.26 1.24 1.22 1.2 1.18 1.16 1.14 1.12 0.5

1.5

2.5

3.5

4.5

5.5

6.5

7.5

8.5

9.5 10.5 11.5

Figure 7.3. Frequency of injured people with respect to the road accident number

7.6. Anticipation of the accident consequences We suppose that the random variables , = 1,2, … have the Poisson distribution with parameters ( ) = ( ) = , = 1,2, … , ( ). From

114

Data Analysis and Applications 4

the data analysis, it follows that the parameter depends on time: = ( ). This function we approximate by linear function using data from Table 7.1. Using EXCEL for computing the linear regression, we obtain the value of parameters: = – 0,002301748,

= 0,111402156

Hence, ( ) = − 0,002301748 + 0,10021139886, ≥ 0.

[7.29]

To apply results from 4, we should assume that in the interval of prediction [ , + ℎ] for given and small , the random variables are i.i.d. In our cases, the function [7.29] in time interval [ , + ℎ] is almost constant if parameter h is not large. Therefore, to simplify the model, we suppose ( )= ( )=

=

( )

(

)

[7.30]

From Corollary 7.1, we have [ ( + ℎ) − ( )] = Δ( ; ℎ) [ ( + ℎ) − ( )] = Δ( ; ℎ) ( + [ ( + ℎ) − ( )] =

Δ( ; ℎ) ( +

[7.31] )

[7.32] )

[7.33]

where Δ( ; ℎ) = ℎ



+ + .

[7.34]

EXAMPLE 7.2.– Example 7.2 is a continuation of Example 7.1. Recall that ( ) = −1617,7 + 48090,42; ≥ 0. For these parameters, using [7.15] and [7.16], we obtain Δ( ; ℎ) = 9812,201695,

D( ; ℎ) = 99,05655806.

Generalizations of Poisson Process in the Modeling of Random Processes

115

The expected number of fatalities at the time interval [ , + ℎ] is described by an increment ( + ℎ) − ( ). Now, we want to predict the number of fatalities in the road accidents in Poland from June 1, 2019 to September 30, 2019. Using results from Example 7.1, we have = 11,413699 years. For these parameters, we get = −1617,7, = 48090,42, Δ( ; ℎ) = 9812,202, D( ; ℎ) = 99,06. h = 0,334247,

( ) = − 0,002301748 + 0,10021139886, ≥ 0.

From [7.29] and [7.30], we obtain = 0,073555264 Applying the rule = ( ) + [ ( )] = for

= ,

+

, we get = 0,078965641

[ ( + ℎ) − ( )] =

Δ( ; ℎ) ( +

) = 27.84

Using [7.31], we obtain the predicted number of fatalities in the road accidents in Poland in the considered period of time. Finally, we obtain the expected value of fatalities (EFN): EFN = 721.74. Using [7.33], we obtain the predicted standard deviation of fatality number (DFN) in the road accidents in Poland in this time interval: DFN = 27.84. In the same way, we calculate the parameters of the model that enables us to anticipate the number of injured people in the road accidents in Poland in this period.

116

Data Analysis and Applications 4

To obtain the expectation (EIN) and standard deviation (DIN) of the injured people number, we use the rules [7.31] and [7.33]. For ( ) = − 0,008118182 + 1,283509091, h = 0,334247, we have

= 11,413699,

=1,18949387 and Δ( ; ℎ) = 11442,96. Figure 7.4 shows the Poisson distribution with parameter = 1,189 denoting the probability distribution of the injured number in a single accident. 0.4 0.3 0.2 0.1 0 0

1

2

3

4

5

6

7

Figure 7.4. Probability distribution of the injured number in a single accident

Finally, we get EIN = 11671,55

and

DIN = 159,86.

7.7. Conclusion The random process theory delivers concepts and theorems that enable us to construct stochastic models releted to the road accidents. The processes with independent increaments are the most appropriate for modeling the number of accident in a specified time interval. A crucial role in the model construction plays a non-homogeneous Poisson process. The presented models enable us to anticipate the number of accidents in a certain period of time and their consequences. The identification of almost real parameters was possible thanks to the statistical data derived from the police report (Symon E. 2018, Symon E. 2019). The non-homogeneous compound Poisson processes as the models of the road accident consequences enable us

Generalizations of Poisson Process in the Modeling of Random Processes

117

to anticipate the number of injured people and fatalities. To explain the constructed models, two examples have been presented. 7.8. References Di Crescenzo A., Martinucci B. and Zacks S. (2015). “Compound Poisson process with a Poisson subordinator”. Journal of Applied Probability, Applied Probability Trust, 52, (2), 360–374. Fisz M. (1969). Probability and Mathematical Statistics, Warsaw PWN, Warsaw (in Polish). Grabski F. (2015). Semi-Markov Processes: Application in System Reliability and Maintenance. Elsevier, Amsterdam. Grabski F. (2017). “Nonhomogeneous Poisson process in modelling accidents number in Baltic Sea waters and ports”. Journal of Polish Safety and Reliability Association; Summer Safety and Reliability Seminars, 8, (1), 39–46. Grabski F. (2018). “Nonhomogeneous stochastic processes connected to Poisson process”. Scientific Journal of Polish Naval Academy, 2 (213), 5–15, Gdynia (LIX). Limnios N. and Oprisan N.G. (2001). Semi-Markov Processes and Reliability. Birkhauser, Boston. Shiryayev A.N. (1984). Probability. Springer-Verlag, New York. Symon E. (2018). “Traffic accidents in Poland in 2017”, Opinion and Analytical Department of the Traffic Office of the Police Headquarters. Warsaw, (in Polish). Symon E. (2019). “Traffic accidents in Poland in 2018”, Opinion and Analytical Department of the Traffic Office of the Police Headquarters. Warsaw, (in Polish). Zinchenko N. (2017). “Limit theorems for compound renewal processes: Theory and application proceedings”, ASMDA, 1125–1136.

8 Dependability and Performance Analysis for a Two Unit Multi-state System with Imperfect Switch

A two-unit multi-state deteriorating system under preventive condition-based maintenance and imperfect switch among units is considered. The system consists of one operating unit and one unit in cold standby mode. System control is switched to the standby unit when the operational unit experiences a failure or enters a maintenance state. The automated switch mechanism can experience failures either due to frequent use that incurs aging and degradation effects or even due to extended periods of being idle. In this case, a manual switch is initiated. The operational unit is periodically inspected in order to distinguish if any maintenance action needs to be triggered. Moreover, maintenance can be imperfect, restoring the unit to a worse degraded state, or even to a total failure state, mainly due to external factors. The main aim of this work consists of studying the transient behavior of the aforementioned two-unit system under a Markov framework and in examining how unit inspection intervals, as well as switching mechanism success probability, affect the entire system dependability and performance. Towards this, system transient and asymptotic availability as well as total expected operational cost are derived. The main aim of this chapter is to determine an optimal inspection and thus maintenance policy that improves system dependability and performance measures.

Chapter written by Vasilis P. KOUTRAS, Sonia M ALEFAKI and Agapios N. P LATIS.

Data Analysis and Applications 4: Financial Data Analysis and Methods, First Edition. Edited by Andreas Makrides, Alex Karagrigoriou and Christos H. Skiadas. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

120

Data Analysis and Applications 4

8.1. Introduction Over the last decades, the design of large-scale and complex technological systems has become important mainly due to the rapid development of technology and the increasing demand of various critical applications. The reliability and the availability of these systems are of great importance since their deterioration and/or failure may lead to serious economic and social losses [AGH 16]. Thus, in order to improve the operation of such a system and increase its availability and reliability, redundancy can be introduced. One of the most commonly used types of redundancy is the standby redundancy. In a standby redundant system, apart from the operational units, there is a number of standby units as backups, in order to replace any component after a failure. The switching process from the failed unit to the standby unit is usually assigned to an automated mechanism. It is assumed that this mechanism can fail with a positive probability [WAN 06, GOE 85, PLA 09]. The automated restoration mechanism usually fails due to frequent use, which results in aging and degradation, or due to the long period within which it is in standby mode. In the latter case, usually called imperfect switch, the switching process is operated manually [DUG 89, TRI 01]. The manual switch restores the system to an operational state again. However, more time is needed for manual than for automatic switch. After an automated mechanism failure, either maintenance or replacement actions are initiated. For further improvement of the operational time and condition of a multi-state system, maintenance actions can be adopted [LIS 07, NAT 11]. All necessary actions for keeping and/or restoring a system in an acceptable operating condition or even extending its lifetime are considered as system maintenance. These actions can be divided into corrective and preventive maintenance actions. Corrective maintenance occurs when the system fails, in contrast to preventive maintenance that occurs during its operational period. Under normal circumstances, preventive maintenance is more effective than corrective maintenance, since its main aim is to keep the system available and avoid undesirable failures that incur considerably high cost [CHE 03, AMA 06]. There are two main types of preventive maintenance – condition-based and time-based maintenance [LAP 06]. Time-based maintenance is carried out at specific time intervals independently of a system’s state. Condition-based maintenance depends on a system’s state; the

Dependability and Performance Analysis

121

system is regularly inspected, and depending on its state, it is either left without maintenance, if it works in an acceptable level, or minimal/major maintenance takes place, restoring the system in a previous deterioration level or in an as good as new state, respectively, when the system operates in an inefficient deterioration level. Although preventive maintenance has been adopted for improving the performance of a system, it incurs downtime and consequently implies a cost. Thus, preventive maintenance needs to be properly scheduled. An appropriate preventive maintenance policy that manages to reduce the total operational cost and improve the availability of the system is of critical importance. A lot of research effort has been paid to this direction (see [NAG 06, SHE 15, THE 12]). A recent review paper on optimal maintenance policies can be found in [DIN 15]. In this chapter, a two-identical-unit cold standby system with imperfect switch is modeled. Each unit functions under multiple states of degradation, from its perfect state to the total failure [LIS 10]. The switching process between the two units is assigned to an automated mechanism. It is assumed that the mechanism can fail with a positive probability [PLA 09] and a manual switch is triggered to shift system control to the operating unit. Our aim is to compute the main dependability and performance measures for the proposed model in transient as well as in its asymptotic phase and to examine how unit inspection rate, switching mechanism success probability as well as the manual switch rate affect the entire system dependability and performance in both phases. The rest of this chapter is organized as follows. In section 8.2, the proposed model for the system under consideration is described analytically. In section 8.3, the main dependability and performance measures are defined in transient phase and in the steady state too. The optimization problems for detecting the optimal maintenance policy for system dependability and performance are presented in section 8.4. In section 8.5, some numerical results are presented to illustrate the theoretical framework provided and to also examine system behavior. This chapter concludes by providing a short discussion and mentioning some points for further research.

122

Data Analysis and Applications 4

8.2. Description of the system under maintenance and imperfect switch In this chapter, a two-identical-unit system which experiences deterioration is considered. One unit is operating and the other unit is in a cold standby mode. It is assumed that initially both units are in their perfect state, thus the whole system is fully operational. The operational unit can either be in its perfect state (O) or in one of the deterioration states Di , i = 1, 2, 3, where Di denotes the ith deterioration level prior to the total failure (F ). In order to delay or even avoid a total failure, depending on the deterioration level of the operating unit, minimal (m) or major (M ) maintenance actions are implemented. To identify the level of deterioration and thus to decide on the type of maintenance to be triggered, periodical unit inspection takes place. In particular, from its perfect state or from each of the deteriorations states, the operational unit may enter an inspection state Ik , k = 0, 1, 2, 3. Thus, the state space of the operational unit can be defined by E O = {O, D1 , D2 , D3 , I0 , I1 , I2 , I3 , m, M, F }. As far as the standby unit is concerned, its operational states can be denoted by: – OS : if the standby unit is in its perfect state; – DiS : if the standby unit is in the ith deterioration level (i = 1, 2, 3). When the operational unit enters a maintenance state (minimal or major) or experiences a failure, the system control is switched to the standby unit either automatically with probability c or manually, when the automated switch mechanism experiences a failure, with probability 1 − c. Such failures are mainly due to either frequent use, which incurs aging and degradation effects, or even due to extended periods of being idle. In the case of manual switch, the operating unit enters the so-called manual switch state (M S l , l = m, M, F ). Note that in this case, the automated switch mechanism is replaced by a new identical one. During manual switch, either due to a failure or due to maintenance of the operating unit, the state of the standby unit does not change. It is important to note that system control will be switched to the standby unit after the completion of manual switch only if this is in an operational state. Taking into account the aforementioned

Dependability and Performance Analysis

123

description, the state space of each system unit, either in operational or in standby mode, is denoted by E (E O ⊂ E): E = {O, OS , D1 , D2 , D3 , D1S , D2S , D3S , I0 , I1 , I2 , I3 , m, M, F, M S m , M S M , M S F } In order to denote the state of the system, a pair (i, j) is used, where i, j ∈ E denote the condition of the primary and the supplementary unit, respectively. Initially, the system starts to operate in its perfect state, i.e. the primary and the standby units are both in their fully operational state (O, OS ). From the perfect state, the system may enter one of the next deterioration levels (Di , OS ), i = 1, 2, 3 or state (F, O) if system control is switched automatically from the failed unit to the standby unit, or even state (F, M S F ) if the automated switch mechanism experiences a failure and the switching process is performed manually. Moreover, from state (O, OS ), the system may enter the inspection state (I0 , OS ). From each of the deterioration states D(i, j), i = 1, 2, 3, j ∈ {OS , D1S , D2S , D3S , m, M, F }, the operational unit may enter an inspection state (Ii , j), respectively, due to the assumption that during the inspection of the operating unit, the state of the standby unit does not change. In case unit deterioration occurs prior to inspection, the unit enters one of the next deterioration states (depending on the level of deterioration) or even the failure state in case of a sudden failure, which can occur mainly due to external factors. If after the inspection the operating unit is detected in its perfect state, or in the first deterioration state (D1 ), no maintenance action takes place. However, if the unit is either in the second (D2 ) or in the third (D3 ) deterioration level, minimal or major maintenance action is triggered, respectively, and the system control is switched to the standby unit, if this is allowed. Either minimal or major maintenance can be perfect, imperfect or failed. A perfect minimal maintenance restores the unit to its previous deterioration level (from state D2 to state D1 ), though an imperfect minimal maintenance restores the unit to the same deterioration level. Failed minimal maintenance leads the unit to a worse deterioration level or even to a total failure (from state D2 to state D3 or F ). Perfect major maintenance restores the unit to its perfect state (as good as new), though imperfect major maintenance can restore the system to any of the previous deterioration states (D1 or D2 ) or even to the same deterioration level (D3 ),

124

Data Analysis and Applications 4

with different rates. Failed major maintenance leads the unit to the total failure state (F ). The resulting system consists of 145 states in total. All states of the system as well as all possible transitions are presented in Table A.5 in the Appendix (section 8.7). Exponential distributions are assumed for all the sojourn times. Note that the failure rates for the deterioration process are state-dependent, i.e. the higher the deterioration level, the more probable it is for the system to enter a worse deterioration level [AGH 16]. Additionally, it is assumed that the inspection duration is negligible compared to any other sojourn time; thus, there is not enough time for any other transition to occur during inspection [XIE 05]. The transition rates among unit possible states are presented in Table 8.1. Thus, the evolution of the system in time is described by a Markov process Z = {Z(t), t ≥ 0}. Rate Description λ1 Deterioration rate from O to D1 λ12 Deterioration rate from O to D2 λ13 Deterioration rate from O to D3 λf 1 Sudden failure rate from O to F λ2 Deterioration rate from D1 to D2 λ22 Deterioration rate from D1 to D3 λf 2 Sudden failure rate from D1 to F λ3 Deterioration rate from D2 to D3 λf 3 Sudden failure rate from D2 to F λ4 Deterioration rate from D3 to F λIN Inspection rate

Rate Description μI Inspection response rate λm Minimal maintenance rate λIm Imperfect minimal maintenance rate λf m Failed minimal maintenance rate λF m Failure rate due to minimal maintenance failure λM Major maintenance rate λIM 1 Imperfect major maintenance rate of level 1 λIM 2 Imperfect major maintenance rate of level 2 λIM 3 Imperfect major maintenance rate of level 3 λF M Failure rate due to major maintenance failure λR Repair rate

Table 8.1. Transition rates

8.3. Dependability and performance measures Initially, we intend to evaluate the dependability and performance of the proposed model for the two-unit multi-state system under maintenance and imperfect switch with the aim to examine the behavior of the system in terms of availability and operational cost. Let the entire system state space be defined by E  . The elements of E  are shown in the first and third columns of Table A.5 in the Appendix (section 8.7). Note that E  is divided into two subsets U and D. Subset U contains all system operational states, although subset D contains all states in which none of the system units is in an operational mode (down states), such as the following relationships which hold true: E  = U ∪ D, U ∩ D = ∅.

Dependability and Performance Analysis

125

Initially, the transient phase of the system is studied. However, since such systems are designed to operate continuously in time, its asymptotic behavior, in terms of dependability and performance, is also studied. 8.3.1. Transient phase In the transient phase (as well as in the asymptotic phase), system availability is used as a dependability measure though the total expected operational cost due to system downtime and due to any actions taken (inspection, maintenance, replacement of automated switch mechanism, unit repair) is used as a system performance measure. To evaluate these measures at time t ≥ 0, the probability transition matrix for the Markov process that describes systems’ evolution in time needs to be derived.   Let Q = q(i,j),(i ,j  ) (i,j),(i ,j  )∈E  be the infinitesimal generator matrix for the homogeneous continuous-time Markov process Z = {Z(t), t ≥ 0} and the state space E  , which have already beendefined, with q(i,j),(i ,j  ) ≥ 0, (i, j) = (i , j  ) and q(i,j),(i,j) = −q(i,j) = (k,l)∈E  ,(k,l)=(i,j) q(i,j),(k,l) .  be the initial distribution of Z at time t = 0 and Let also α = (α(i, j))(i,j)∈E  P(t) = p(i,j),(i ,j  ) (t) , t ≥ 0 its transition function with: p(i,j),(i ,j  ) (t) = P rob(Z(t) = (i , j  )|Z(0) = (i, j)) = P rob(Z(t + h) = (i , j  )|Z(h) = (i, j)), ∀h ≥ 0. [8.1] In this case, the solution of the Kolmogorov equation is: P(t) = et·Q

[8.2]

which is the probability transition matrix at t ≥ 0 with P(0) = I the identity matrix [SAD 05, KIJ 97]. 8.3.1.1. Transient availability Based on the two-unit multi-state deteriorating system model and the aforementioned assumptions, system instantaneous availability at time t ≥ 0 can be evaluated as follows [17]: AV (t) = P rˆ a(Z(t) ∈ U ) = α · et·Q 1|E  |,|U |

[8.3]

126

Data Analysis and Applications 4

where 1|E  |,|U | is a vector of dimensions |E  | × 1 containing ones in the entries that correspond to system operational states and zeros in the entries that correspond to the non-operational states. Thus, 1|E  |,|U | contains |U | ones and |E  | − |U | zeros. 8.3.1.2. Total expected operational availability Minimal and major maintenance actions, as well as repair/replacement along with other action schedules, are usually designed for minimizing any system’s downtime. As a result, system designers are usually interested in minimizing the total downtime. Taking also into account that for each unit, non-operational time incurs a cost for the system, the total expected cost achieved when the system is in a non-operational mode should also include a cost per unit of downtime [KOU 17, MAL 17]. In the proposed model, system downtime occurs when both units are non-operational, when the operating unit is inspected and during manual switch among units. In order to define the total expected downtime cost, let us define a reward function: D



w (i, j) =

CD if (i, j) ∈ D 0 else

[8.4]

where CD is the cost per unit of downtime time that occurs when the system is in a non-operational state. Thereafter, wD (i, j) is used for evaluating the total expected downtime cost per unit time. In particular, let g(Z(t)) be the downtime reward rate at time t:  g(Z(t)) = wD (i, j)I{Z(t)=(i,j)} [8.5] (i,j)∈E 

where I{Z(t)=(i,j)} is an indicator function. Then, the total expected downtime cost can be expressed as the mean of the reward rate g(Z(t)): ⎛ ⎞  T EDC(t) = E(g(Z(t))) = E ⎝ wD (i, j)I{Z(t)=(i,j)} ⎠ =

 (i,j)∈E 

(i,j)∈E 

  wD (i, j)E I{Z(t)=(i,j)}

Dependability and Performance Analysis

=



127

wD (i, j)P r(Z(t) = (i, j))

(i,j)∈E 

=



wD (i, j) · π(i.j) (t)

[8.6]

(i,j)∈E 

where π(i.j) (t) is the probability of state (i, j) ∈ E  at time t and can be obtained using [8.2]. Alternatively, T EDC(t) can be obtained according to the following equation:   T EDC(t) = CD α · etQ · 1|E  |,|D|

[8.7]

where 1|E  |,|D| is a vector of dimension |E  | × 1 containing ones in the entries that correspond to system down states and zeros in the entries that correspond to the operational states of the system. Thus, 1|E  |,|D| contains |D| ones and |E  | − |D| zeros. Beyond downtime cost, whenever an action takes place, an additional action cost occurs. Thus, as already mentioned, inspection, minimal and major maintenance, unit repair as well as automated switch mechanism replacement after it experiences a failure incur cost. Similarly to T EDC(t), to define the total expected action cost T EAC(t), the corresponding reward function is defined as follows: wA (i, j) = ⎧ CI if in state (i, j), a unit is under inspection ⎪ ⎪ ⎪ C if in state (i, j), a unit is under minimal maintenance ⎪ ⎪ m ⎪ ⎨ CM if in state (i, j), a unit is under major maintenance CR if in state (i, j), a unit is under repair ⎪ ⎪ ⎪ ⎪ ⎪ C if in state (i, j), the switched mechanism should be replaced ⎪ ⎩ S 0 else

[8.8]

where CI is the cost per unit time for inspecting a unit, Cm is the cost per unit time for implementing minimal maintenance, CM is the cost per unit time for implementing major maintenance, CR is the cost per unit time for unit repair/replacement, and while CS is the cost for replacing the switch mechanism after a failure. Depending on their nature, let

128

Data Analysis and Applications 4

CI < Cm < CM < CS < CR since the cost of repairing/replacing a unit is higher than all other costs, while the cost of major maintenance is higher than the corresponding cost of minimal maintenance. Note that despite the inspection’s negligible duration compared with all the other actions, it incurs a cost which is the lowest among all others. Finally, it is assumed that the switch mechanism replacement cost is higher than maintenance costs but lower than the unit repair cost. The corresponding action reward rate can be defined as follows:  f (Z(t)) = wA (i, j)I{Z(t)=(i,j)} [8.9] (i,j)∈E 

and then the total expected action cost can be defined similarly to equation [8.6]. However, we will alternatively use equation [8.10] to define and evaluate this cost: T EAC(t) = α · etQ · wA

[8.10]

where wA is a vector with dimensions |E  | × 1 containing CI , Cm , CM , CS , CR in its entries that corresponds to system states that the aforementioned (costly) actions take place with respect to equation [8.9]. Finally, whenever a system similar to the one that is modeled in this chapter is used, system designers are usually interested in evaluating a combined (downtime + action) cost. Let this cost be denoted as the total expected operational cost (TEOC). Then, the overall system cost at time t due to downtime and all actions taken can be defined as follows: T EOC(t) = T EDC(t) + T EAC(t)

[8.11]

8.3.2. Asymptotic analysis Since such systems are usually designed to operate continuously in time, we are also interested in evaluating the asymptotic behavior of the system and consequently the limiting values of the defined dependability and performance measures. In the way that availability as well as total operational cost have been defined in [8.3] and [8.11], respectively, it is not difficult to evaluate their

Dependability and Performance Analysis

129

limiting values by setting t → ∞. Thus, the asymptotic availability and the total operational cost in steady state can be defined as follows: AV = lim AV (t)

[8.12]

T EOC = lim T EOC(t)

[8.13]

t→∞

t→∞

Alternatively, the asymptotic availability and the total expected operational cost in steady state can be evaluated using equations AV =



π(i,j)

[8.14]

(i,j)∈U



T EOC =

wD (i, j)π(i,j) +

(i,j)∈E 



wA (i, j)π(i,j)

[8.15]

(i,j)∈E 

where π = (π(i,j) )(i,j)∈E  is the steady-state probability distribution vector. The steady-state probability distribution for the proposed model can be derived by solving the following system of linear equations: π · Q = 0,



π(i,j) = 1.

(i.j)∈E 

It is also worth mentioning that system designers are interested in determining the maintenance policies that optimize asymptotically the dependability and performance of a system. Thus, equations [8.14] and [8.15] are used as objective functions in some optimization problems that are formulated in the following section, with the inspection rate λIN as the decision variable. 8.4. Optimal maintenance policy Determining the optimal inspection policy is equivalent to determining the maintenance policy that optimizes system dependability, system performance or even both of these measures. Although transient system behavior is of prior interest in this chapter, designing an optimal maintenance schedule would benefit the two-unit multi-state system in the long run. Thus, optimization

130

Data Analysis and Applications 4

problems for the dependability and performance measure that use the expressions derived in equations [8.14] and [8.15] as objective functions are formulated. However, once the maintenance schedule that optimizes asymptotic availability and/or performance is derived, it would be implemented in the considered system from the beginning of its lifetime. Thus, the effects of the aforementioned optimal maintenance schedules on the system’s transient behavior could also be observed. 8.4.1. Optimal maintenance policy for maximizing system availability UB Depending on system characteristics, a lower (λLB IN ) and an upper (λIN ) bound can be settled for the inspection policy λIN , which is the decision variable. Consequently, the mathematical programming model for asymptotic system availability, with respect to the inspection policy, is:

max AV

[8.16]

UB s.t. λLB IN ≤ λIN ≤ λIN

Solving [8.16] results in providing the optimal inspection policy that maximizes system asymptotic availability. 8.4.2. Optimal maintenance policy for minimizing total expected operational cost Depending on the system requirements, an inspection policy that minimizes the total operational cost could be also derived. Based on equation [8.14], the corresponding mathematical programming model that can be formulated for this case is: min T EOC

[8.17]

UB s.t. λLB IN ≤ λIN ≤ λIN

Solving [8.17] results in providing the optimal inspection policy that minimizes the system asymptotic total operational cost.

Dependability and Performance Analysis

131

8.4.3. Optimal maintenance policy for multi-objective optimization problems Note that any inspection policy selected to be adopted for the system would have totally different effects on availability and overall cost. More specifically, performing inspections very often benefits the total expected operational cost, since by inspection and thereafter maintenance, the system is prevented from failures which incur a considerably high cost. However, since inspection states are considered as down states, such an inspection policy will provide lower availability. Contrarily, when inspection is performed less often, system availability is benefited because the system does not enter down states often. However, in the latter case, an increased overall cost would be achieved since, when delaying inspection and hence maintenance, a total failure that incurs high cost is more probable for the system. Consequently, the system designer should take into account these features in order to schedule the optimal maintenance policy that simultaneously optimizes both measures. This is usually achieved by formulating and solving multi-objective optimization problems. In this chapter, the multi-objective optimization problems are solved as an optimization problem with constraints. More specifically, the most important measure is considered as the objective function and the remaining measure participates in the optimization problem as a constraint. Usually, especially in industrial applications, the designer is mainly interested in minimizing the system operational cost. Thus, the optimization problem to be solved for the two-unit multi-state system includes TEOC as the objective function, while availability (AV) is included as a constraint. In particular, a lower availability threshold (AV0 ) is settled and the optimization problem is solved by considering that system availability should be higher than this threshold. Note that AV0 depends on system characteristics, and it can be decided by the designer: min T EOC s.t. AV ≥ AV0 UB λLB IN ≤ λIN ≤ λIN

[8.18]

132

Data Analysis and Applications 4

However, in cases of systems where availability is of prior importance, such as life-critical systems, asymptotic availability can be considered as the objective function. In this case, the optimization is implemented with respect to a higher cost threshold (T EOC0 ), which also depends on system characteristics and can be provided by the designer too. 8.5. Numerical results The proposed model can be further examined and illustrated through some numerical results based on the experimental data provided in Table 8.2. Parameter λ1 λ12 λ13 λf 1 λ2 λ22 λf 2 λ3 λf 3 λ4

Value Parameter Value Parameter Value 1/1, 200 h−1 λIN 0.05 h−1 μI 120 h−1 1/2, 000 h−1 λm 1 h−1 λR 0.025 h−1 −1 −1 1/1, 500 h λIm 0.002 h b 2 h−1 −1 −1 1/10, 000 h λf m 1/100 h CD 50 cu/h 1/1, 000 h−1 λF m 1/10, 000 h−1 CS 500 cu/h 1/1, 500 h−1 λM 0.2 h−1 Cm 50 cu/h −1 1/1, 000 h λIM 1 0.025 h−1 CM 200 cu/h 1/800 h−1 λIM 2 0.015 h−1 CI 10 cu/h −1 5/1, 000 h λIM 3 0.01 h−1 CR 1, 000 cu/h 1/600 h−1 λF M 1/10, 000 h−1 c 0.9 − 1 Table 8.2. Experimental data

8.5.1. Transient and asymptotic dependability and performance Initially, let us set the probability of perfect switch among units in the case of maintenance or in the case of a total unit failure to be c = 0.95. Transient availability and total operational cost behavior for a time horizon of 10,000 hours are presented in Figures 8.1 and 8.2, respectively. As expected, system availability decreases with time although the overall cost increases, since, in the long run, the system will eventually experience failures. Additionally, the convergence of AV (t) and T EOC(t) to AV and T EOC, respectively, indicating how fast the system reaches steady state is also depicted in Figures 8.1 and 8.2. As it can be observed, the system needs about 4,000 hours to reach steady state. This time interval is large enough, highlighting thus the necessity of transient analysis.

Dependability and Performance Analysis

Figure 8.1. Transient availability in a time horizon of 10,000 hours

Figure 8.2. Transient total operational cost in a time horizon of 10,000 hours

133

134

Data Analysis and Applications 4

However, it is interesting to examine how the proposed model parameters affect the dependability and performance measures in the transient phase. Initially, the switch mechanism success probability c is examined, since it is one of the most critical parameters in a model that incorporates imperfect switch. As it can be observed in Figure 8.3, availability increases for higher success probability values, as expected, since a higher success probability prevents the system from entering the manual switch states which are down states. Correspondingly, as shown in Figure 8.4, the total expected operational cost seems to not be significantly affected by the change in c values. Although an increased c slightly reduces T EOC(t), since it indicates that the switching mechanism properly operates with high probability and does not need to be replaced (to avoid replacement cost), the improvement on the overall cost is not significant.

Figure 8.3. Transient availability with respect to success probability c

Dependability and Performance Analysis

135

Figure 8.4. Transient total expected operational cost with respect to success probability c

Moreover, the inspection rate λIN is of major importance too, since it indicates the inspection policy and thus the maintenance policy to be adopted for the two-unit multi-state system. In Figure 8.5, the availability decreases with the increase of the inspection rate, since increasing λIN indicates more often inspections and thus maintenance actions, which indicate that the system enters down states. Thus, in this case, the system’s operational time decreases as well as the availability. Contrarily, as shown in Figure 8.6, the expected overall cost reduces with the increase of inspection frequency. Inspecting and consequently maintaining the system more often manages to delay future failures that incur considerably higher cost. Thus, the total expected operational cost is benefited by adopting a schedule that often includes maintenance actions.

136

Data Analysis and Applications 4

Figure 8.5. Transient availability with respect to the inspection rate λIN

Figure 8.6. Transient total expected operational cost with respect to the inspection rate λIN

Dependability and Performance Analysis

137

Figure 8.7. Transient availability with respect to the manual switch rate b

Since, one of the innovative aspects of the proposed model for the system under consideration is the imperfect switch among its units, it is also interesting to examine how the time needed to switch system control manually, in the case of a failure on the automated mechanism, affects the dependability and performance measures. This can be examined through the manual switch rate. In Figure 8.7, we observe that the increase of the manual switch rate results in increasing system availability, as expected. This is obvious, since in this case, the system spends less time in the non-operational state of manual switch. However, b does not seem to significantly affect the T EOC(t) indicator. This is more or less reasonable because entering the manual switch state affects the cost (by triggering automation mechanism replacement cost) but the sojourn time in this state has no effect on the cost. Nevertheless, a slight cost improvement can be obtained when b increases.

138

Data Analysis and Applications 4

Figure 8.8. Transient availability with respect to the minimal maintenance rate λm

The effect of the time needed for a unit to be restored to its previous deterioration level through minimal maintenance on dependability and performance is also examined in this section, through the minimal maintenance rate λm , which represents the rate of minimal maintenance completion. As expected, when minimal maintenance lasts a short time, transient availability increases since the system is restored faster to an operational state. However, similarly to the previous case, the total expected cost does not seem to be greatly affected by the duration of minimal maintenance rather than the fact that the system enters minimal maintenance state, which incurs cost, although the reduced minimal maintenance duration slightly improves the overall cost. The effects of the time needed for a unit to be restored to its fully operational state through major maintenance on dependability and performance is also examined through the major maintenance rate λM . Similarly to the minimal maintenance rate and due to exactly the same reason, transient system availability increases with the increase of λM , as shown in

Dependability and Performance Analysis

139

Figure 8.9. Contrarily to the minimal maintenance rate, λM seems to significantly affect T EOC(t). As it can be observed in Figure 8.10, T EOC(t) reduces with the increase of λM . This is because major maintenance manages to restore the unit to its fully operational state, thus causing a significant delay on unit failures or even maintenance actions that incur cost. Thus, when the unit is restored to the perfect operating state through faster major maintenance, the need to maintain this unit or to repair it after a failure is delayed, results in a reduction of the system’s overall cost at time t.

Figure 8.9. Transient availability with respect to the major maintenance rate λM

The rate of repairing/replacing a failed unit is also considered in analysis. In Figure 8.11, it can be observed that the faster the unit repair/replacement restores the system back to an operational state, the higher the availability, since the total system operational time increases. Respectively, as shown in Figure 8.12, the total expected operational cost reduces with fast unit repair/replacement since restoring the unit to its fully operational state (as good as new) indicates that its lifetime starts again and thus maintenance, repair or any other costly actions will be delayed for the unit.

140

Data Analysis and Applications 4

Figure 8.10. Transient total operational cost with respect to the major maintenance rate λM

Figure 8.11. Transient availability with respect to the repair rate λR

Dependability and Performance Analysis

141

Figure 8.12. Transient total operational cost with respect to the repair rate λR

From the system designer point of view, it would also be interesting to examine how all the action costs as well as the downtime cost affect T EOC(t). As shown in Figure 8.13, the inspection cost does not seem to critically affect T EOC(t) due to its small value compared to the rest of the action costs. However, as expected, there is a slight increase of T EOC(t) with the increase of CI . Note that the same behavior holds true for the minimal maintenance cost Cm , the downtime cost CD and the automated switch mechanism CS . Note also that although CS is much higher compared to the rest of the cost that does not significantly affect T EOC(t), its changes don’t seem to cause any important change on T EOC(t). However, major maintenance cost’s CM as well as unit repair/replacement cost’s changes seem to have a notable effect on T EOC(t) (their increase results in total operational cost increase as expected) as shown in Figures 8.14 and 8.15, respectively.

142

Data Analysis and Applications 4

Figure 8.13. Transient total operational cost with respect to inspection cost CI

Figure 8.14. Transient total operational cost with respect to major maintenance cost CM

Dependability and Performance Analysis

143

Figure 8.15. Transient total operational cost with respect to unit repair/replacement cost CR

8.5.2. Optimal asymptotic maintenance policies implemented in the transient phase Taking into account that the maintenance schedule to be adopted is represented by the inspection schedule in the proposed model, the results of the optimization problems defined in equations [8.17], [8.18] and [8.19] are presented in Tables 8.3 and 8.4. For all the optimization problems, we set −1 and λU B = 0.1 h−1 . Note that since we innovatively λLB IN = 0.001 h IN propose to model an imperfect switch for the two-unit multi-state deteriorating system with maintenance, we solve all of the optimization problems with respect to the success probability c too. From the results (see Table 8.3), it can be obtained that the optimal inspection policy λ∗ that maximizes system asymptotic availability seldom consists of inspecting the operating unit (actually λ∗ is equal to the lowest possible value λLB IN regardless of the success probability c). The result is obvious since inspection often leads the system to down states more frequently, hence reducing system availability. However, as also shown in Table 8.3, the total expected operational cost is minimized for more frequent inspection, regardless of the

144

Data Analysis and Applications 4

success probability c. The explanation lies on the fact that frequent inspection triggers maintenance actions, hence avoiding or delaying operating unit failures that incur high cost. Finally, for the multi-objective optimization problem (see Table 8.4), we observe that the optimal inspection policy λ∗ mostly depends on the objective function. Hence, λ∗ is closely related to the optimal solution of problem [8.18]. However, there are instances of problem [8.19], in particular when higher levels of availability are desired (AV0 = 0.999), for which the optimal inspection policy is slightly different from the optimal inspection schedule for problem [8.18]. c λ∗ (h−1 ) 0.900 0.001 0.950 0.001 0.990 0.001 0.995 0.001

AV ∗ λ∗ (h−1 ) 0.999650 0.1 0.999681 0.1 0.999706 0.1 0.999709 0.1

T EOC ∗ 20.0075 19.9430 19.9108 19.8850

Table 8.3. Optimization results c 0.900 0.900 0.900 0.950 0.950 0.950 0.990 0.990 0.990 0.995 0.995 0.995

AV0 λ∗ (h−1 ) 0.990 0.100 0.995 0.100 0.999 0.060 0.990 0.100 0.995 0.100 0.999 0.066 0.990 0.100 0.995 0.100 0.999 0.071 0.990 0.100 0.995 0.100 0.999 0.072

T EOC ∗ AV ∗ 19.8818 0.998661 19.8818 0.998661 20.5512 0.999002 19.8818 0.998714 19.8818 0.998714 20.3976 0.999004 19.8818 0.998757 19.8818 0.998757 20.2921 0.999003 19.8818 0.998763 19.8818 0.998763 20.2798 0.99900029

Table 8.4. Multi-objective optimization results

By adopting the policy that optimizes asymptotic availability or cost from the beginning of the system operational time, the dependability and performance measures would be benefited not only in the steady-state phase but also at any time instant t of a system’s operational time. Since the multi-objective optimization problem simultaneously provides optimal cost and availability, the behavior of AV (t) and T EOC(t) under the derived optimal policies from [8.19] shown in Table 8.4 is indicatively presented in Figures 8.16 and 8.17.

Figure 8.16. Transient availability for inspection policies from the multi-objective problem

Dependability and Performance Analysis 145

Figure 8.17. Transient total operational cost for inspection policies from the multi-objective problem

146 Data Analysis and Applications 4

Dependability and Performance Analysis

147

8.6. Conclusion and future work In this chapter, a two-unit multi-state deteriorating system with minimal and major maintenance actions, triggered to prevent or delay unit failures, is modeled using a continuous-time Markov process. Such systems can be met in many real-life applications. The innovative aspect of this chapter consists of modeling the imperfect switch among the system units. Imperfect switch models the cases when the automated switch mechanism fails to operate whenever needed, in order to switch the control from the operating to the standby unit. The transient as well as the asymptotic behavior of the system under consideration in terms of dependability and performance, expressed by availability and total operational cost, respectively, is analytically examined. Additionally, a sensitivity analysis with respect to model parameters is implemented through numerical results. Such an analysis is useful for system designers, since it provides knowledge on how the two-unit system should be designed in order to achieve the desired levels of availability or/and operational cost, with this, mathematical programming models for optimizing availability and cost are formulated and solved, in order to derive the optimal maintenance policies, expressed through the inspection rates. Consequently, we have provided an appropriate theoretical framework for designing efficient two-unit multi-state deteriorating systems with maintenance and imperfect switch among units. The proposed model can be extended in various ways. First, the exponentially distributed sojourn times in system states can be relaxed and more precise models, like semi-Markov models, can be used to examine the system’s behavior in terms of dependability and performance, in the transient and/or asymptotic phase. Second, the proposed approach can be generalized to model an n-unit multi-state system, from both theoretical and practical points of view, although model complexity as well as computational complexity will be increased. However, a first attempt in this direction is already under evaluation, in which k-out-of-n systems with deteriorating multi-state units with minimal and major maintenance are modeled using the

148

Data Analysis and Applications 4

proposed approach. Another interesting issue that needs to be examined in the future is that after minimal and major maintenance, or after repair of the operating unit, the systems return to a state that consists of units with different ages. This is also the case for states in which there is a renewal of units’ sojourn times. In the current approach, each time a unit returns to its perfect state or to a lower deterioration state, the sojourn time of the other unit is considered to be renewed. This assumption can be relaxed, resulting in a more realistic model, by using a class of stochastic models called Markov regenerative process (MRGP) that can model in detail the aforementioned issues. Finally, real-life system practices, like opportunistic maintenance actions, can also be incorporated to extend the proposed model. 8.7. Appendix From state To state From state (O, O S ) (D1 , OS ), (D2 , OS ), (D3 , OS ), (M, M ) (F, O), (M S F , OS ), (I0 , OS ) (D1 , OS ) (D2 , OS ), (D3 , OS ), (F, O), (M S F , OS ), (I1 , OS )

(O, F )

(D2 , OS ) (D3 , OS ), (F, O), (M S F , OS ), (I2 , OS ) S (D3 , O ) (F, O), (M S F , OS ), (I3 , OS )

(D1 , F ) (D2, F )

(I0 , OS ) (I1 , OS ) (I2 , OS ) I3 , O S ) (m, O)

(D3 , F ) (I0 , F ) (I1 , F ) (I2 , F ) (I3, F )

(M, O)

(O, O S ) (D1 , OS ) (m, O), (M S m , OS ) (M, O), (M S M , OS ) (D1S , O), (D2S , O), (D3S , O), (F, O), (m.I0 ), (m, D1 ), (m, D2 ), (m, D3 ), (m, F ) (OS , O), (D1S , O), (D2S , O), (D3S , O), (F, O), (M.I0 ), (M, D1 ), (M, D2 ), (M, D3 ), (M, F )

(m, F )

To state (O, M ), (D1 , M ), (D2 , M ), (D3 , M ), (F, M ) (M, O), (M, D1 ), (M, D2 ), (M, D3 ), (M, F ) (I0 , F ), (D1 , F ), (D2 , F ), (D3 , F ), (F, F ), (O, O S ) (I1 , F ), (D2 , F ), (D3 , F ), (F, F ), (D1 , OS ) (I2 , F ), (D3 , F ), (F, F ), (D2 , OS ) (I3 , F ), (F, F ), (D3 , OS ) (O, F ) (D1 , F ) (m, F ) (M, F )

(D1 , F ), (D2 , F ), (D3 , F ), (F, F ), (m, O)

Dependability and Performance Analysis

From state To state (M S F , O S ) (F, O)

(F, O)

(F.I0 ), (F, D1 ), (F, D2 ),

149

From state To state (M, F )

(F, M )

(F, D3 ), (F, F ), (OS , O)

(O, F ), (D1 , F ), (D2 , F ), (D3 , F ), (F, F ), (M, O) (F.O), (F, D1 ), (F, D2 ), (F, D3 ), (F, F ), (O.M )

(F, D1 )

(F.I1 ), (F, D2 ), (F, D3 ), (F, F ), (OS , D1 )

(m, I0 )

(m, O)

(F, D2 )

(F.I2 ), (F, D3 ), (F, F ),

(I1 , m)

(D1 , m)

(F, D3 )

(OS , D2 ) (F.I3 ), (F, F ), (OS , D3 )

(m, I2 )

(m, m)

(F, F ) (F, I0 )

(F, O), (O, F ) (F, O)

(m, I3 ) (M, I0 )

(m, M ) (M, O)

(F, I1 ) (F, I2 )

(F, D1 ) (F, m)

(M, I1 ) (M, I2 )

(M, D1 ) (M, m)

(F, M ) (I0 , D1S ), (D1 , D1S ),

(M, I3 ) (OS , O)

(M, M ) (OS , I0 ), (OS , D1 ), (OS , D2 ),

(I1 , D1S ), (D2 , D1S ), (D3 , D1S ), (F, D1 ), (M S F , D1S )

(O S , D1 )

(OS , I1 ), (OS , D2 ), (OS , D3 ), (O, F ), (OS , M S F )

(OS , D 2 )

(OS , I2 ), (OS , D3 ), (O, F ), (OS , M S F )

(O S , D3 ) (OS , I0 )

(OS , I3 ), (O, F ), (OS , M S F ) (OS , O)

(OS , I1 ) (OS , I2 )

(OS , D1 ) (O, m), (OS , M S m )

(F, I3 ) (O, D1S )

(D1 , D1S ) (D2 , D1S ) (D3 , D1S ) (I0 , D1S ) (I1 , D1S ) (I2 , D1S )

(I3 , D1S ) (m, D1 )

(D2 , D1S ), (D3 , D1S ), (F, D1 ), (M S F , D1S )

(I2 , D1S ), (D3 , D1S ), (F, D1 ), (M S F , D1S )

(I3 , D1S ), (F, D1 ), (M S F , D1S ) (O, D1S )

(D1 , D1S ) (m, D1 ), (M S m , D1S )

(M, D1 ), (M S M , D1S ) (m, I1 ), (m, D2 ), (m, D3 ),

(OS , D3 ), (O, F ), (OS , M S F )

(OS , I3 ) (O, M ), (OS , M S M ) S m (O , M S ) (O, m)

(m, F ), (D1S , D1 ), (D2S , D1 ), (D3S , D1 ), (M, D1 )

(F, D1 ) (M, I1 ), (M, D2 ), (M, D3 ),

(OS , M S M ) (O, M )

(M, F ), (OS , D1 ), (D1S , D1 ), (D2S , D1 ), (M S F , D1S ) (O, D2S )

(D3S , D1 ), (F, D1 ) (F, D1 )

(I0 , D2S ), (D1 , D2S ), (D2 , D2S ), (D3 , D2S ), (F, D2 ),

(OS , M S F ) (O, F ) (D1S , O)

(D1S , I0 ), (D1S , D1 ), (D1S , D2 ), (D1S , D3 ), (D1 , F ),

150

Data Analysis and Applications 4

From state To state (D1 , D2S ) (D2 , D2S )

(M S F , D2S ) (I1 , D2S ), (D2 , D2S ), (D3 , D2S ), (F, D2 ), (M S F , D2S ) (I2 , D2S ), (D3 , D2S ), (F, D2 ),

From state To state (D1S , D1 ) (D1S , D2 )

(D1S , M S F ) (D1S , I1 ), (D1S , D2 ), (D1S , D3 ),

(D1 , F ), (D1S , M S F ) (D1S , I2 ), (D1S , D3 ), (D1 , F ),

(D3 , D2S )

(M S F , D2S ) (I3 , D2S ), (F, D2 ), (M S F , D2S )

(D1S , D3)

(D1S , M S F ) (D1S , I3 ), (D1 , F ), (D1S , M S F )

(I2 , D2S )

(m, D2 ), (M S m , D2S )

(D1S , I2 )

(D1 , m), (D1S , M S m )

(I0 , D2S ) (I1 , D2S ) (I3 , D2S ) (m, D2 )

(M, D2 )

(O, D2S ) (D1 , D2S )

(M, D2 ), (M S M , D2S )

(m, I2 ), (m, D3 ), (m, F ),

(D1S , I0 ) (D1S , I1 )

(D1S , I3 ) (D1S , M S m )

(D1S , O) (D1S , D1 )

(D1 , M ), (D1S , M S M ) (D1 , m)

(D1S , D2 ), (D2S , D2 ), (D3S , D2 ), (F, D2 ) (M, I2 ), (M, D3 ), (M, F ), (D1S , M S M ) (D1 , M ) (OS , D2 ), (D1S , D2 ), (D2S , D2 ), (D3S , D2 ),

(F, D2 ) (M S F , D2S ) (F, D2 ) (O, D3S )

(D1 , D3S ) (D2 , D3S ) (D3 , D3S ) (I0 , D3S ) (I1 , D3S ) (I2 , D3S ) (I3 , D3S ) (m, D3 )

(M, D3 )

(I0 , D3S ), (D1 , D3S ), (D2 , D3S ), (D3 , D3S ), (F, D3 ), (M S F , D3S ) (I1 , D3S ), (D2 , D3S ), (D3 , D3S ), (F, D3 ), (M S F , D3S ) (I2 , D3S ), (D3 , D3S ), (F, D3 ), (M S F , D3S ) (I3 , D3S ), (F, D3 ), (M S F , D3S ) (O, D3S ) (D1 , D3S ) (m, D3 ), (M S m , D3S ) (M, D3 ), (M S M , D3S ) (m, I3 ), (m, F ), (D1S , D3 ), (D2S , D3 ), (D3S , D3 ), (F, D3 ) (M, I3 ), (M, F ), (OS , D3 ), (D1S , D3 ), (D2S , D3 ), (D3S , D3 ), (F, D3 )

(M S F , D3S ) (F, D3 ) (O, m) (I0 , m), (D1 , m), (D2 , m), (D3 , m), (F, m), (O, D1S ), (O, D2S ), (O, D3S ), (O, F )

(D1S , M S F ) (D1 , F ) (D2S , O)

(D2S , D1 ) (D2S , D2 ) (D2S , D3 ) (D2S , I0 ) (D2S , I1 ) (D2S , I2 ) (D2S , I3 ) (D2S , M S m )

(D2S , I0 ), (D2S , D1 ), (D2S , D2 ), (D2S , D3 ), (D2 , F ), (D2S , M S F ) (D2S , I1 ), (D2S , D2 ), (D2S , D3 ),

(D2 , F ), (D2S , M S F ) (D2S , I2 ), (D2S , D3 ), (D2 , F ),

(D2S , M S F ) (D2S , I3 ), (D2 , F ), (D2S , M S F ) (D2S , O) (D2S , D1 )

(D2 , m), (D2S , M S m ) (D2 , M ), (D2S , M S M ) (D2 , m)

(D2S , M S M ) (D2 , M )

(D2S , M S F ) (D2 , F ) (D3S , O) (D3S , I0 ), (D3S , D1 ), (D3S , D2 ), 3 , D ), (D , F ), (D S , M S F ) (DS 3 3 3

Dependability and Performance Analysis

From state To state

From state

To state

(D1 , m)

(I1 , m), (D2 , m), (D3 , m), (F, m),

(D3S , D1 )

(D2 , m)

(D1 , D1S ), (D1 , D2S ), (D1 , D3S ), (D1 , F ) (I2 , m), (D3 , m), (F, m),

(D3S , I1 ), (D3S , D2 ), (D3S , D3 )

(D3S , D2 )

(D3 , m)

(D2 , D1S ), (D2 , D2S ), (D2 , D3S ), (D2 , F )

(I3 , m), (F, m), (D3 , D1S ), (D3 , D2S ), (D3 , D3S ),

(D3S , D3 )

(D3 , F )

(I0 , m) (I1 , m)

(O, m) (D1 , m)

(I2 , m) (I3 , m)

(m, m) (M, m)

(m, m)

(D1, m), (D2, m), (D3, m), (F, m), (m, D1 ), (m, D2 ), (m, D3 ), (m, F )

(M, m)

(O, m), (D1 , m), (D2 , m), (D3 , m), (F, m),

(F, m)

(M, D1 ), (M, D2 ), (M, D3 ), (M, F ) (O, m), (F, D1 ), (F, D2 ), (F, D3 ),

(O, M )

(F, F ) (I0 , M ), (D1 , M ), (D2 , M ), (D3 , M ),

(D3S , I0 ) (D3S , I1 ) (D3S , I2 ) (D3S , I3 ) (D3S , M S m )

(D3 , F ), (D3S , M S F ) (D3S , I2 ), (D3S , D3 ),

(D3 , F ), (D3S , M S F )

(D3S , I3 ), (D3 , F ), (D3S , M S F ) (D3S , M S F ) (D3S , O) (D3S , D1 )

(D3 , m), (D3S , M S m ) (D3 , M ), (D3S , M S M ) (D3 , m)

(D3S , M S M ) (D3 , M )

(D3S , M S F ) (D3 , F ) (M S m , O S ) (m, O)

(F, M ), (O, OS ), (O, D1S ), (O, D2S ), (O, D3S ),

(O, F ) (D1 , M ) (I1 , M ), (D2 , M ), (D3 , M ), (F, M ), , O S ), (D

S S 1 , D1 ), (D1 , D2 ),

(D1 (D1 , D3S ), (D1 , F )

(M S m , D1S ) (m, D1 )

(D2 , M ) (I2 , M ), (D3 , M ), (F, M ), (D2 , OS ), (M S m , D2S ) (m, D2 ) S S S (D2 , D1 ), (D2 , D2 ), (D2 , D3 ), (D2 , F )

(D3 , M ) (I3 , M ), (F, M ), (D3 , O S ), (M S m , D3S ) (m, D3 ) S S S (D3 , D1 ), (D3 , D2 ), (D3 , D3 ), (D3 , F ) (I0 , M ) (I1 , M )

(O, M ) (D1 , M )

(I2 , M ) (I3 , M )

(m, M ) (M, M )

(m, M )

(D1 , M ), (D2 , M ), (D3 , M ), (F, M ), (m, O),

(M S M , O S ) (M, O) (M S M , D1S ) (M, D1 ) (M S M , D2S ) (M, D2 ) (M S M , D3S ) (M, D3 )

(m, D1 ), (m, D2 ), (m, D3 ), (m, F )

Table 8.5. States and possible transitions for the proposed model

151

152

Data Analysis and Applications 4

8.8. References [AGH 16] AGHEZZAF E.-H., K HATAB A., L E TAM P., “Optimizing production and imperfect preventive maintenance plannings integration in failure-prone manufacturing systems”, Reliability Engineering & System Safety, vol. 145, pp. 190–198, 2016. [AMA 06] A MARI S.V., M C L AUGHLIN L., P HAM H., “Cost-effective condition-based maintenance using Markov decision processes”, RAMS’06: Annual Reliability and Maintainability Symposium, pp. 464–469, 23–26 January, 2006. [CHE 02] C HEN D., T RIVEDI K.S., “Closed-form analytical results for condition-based maintenance”, Reliability Engineering and System Safety, vol. 76, no. 1, pp. 43–51, 2002. [CHE 03] C HEN D., C AO Y., T RIVEDI K.S. et al., “Preventive maintenance of multistate system with phase-type failure time distribution and non-zero inspection time”, International Journal Reliability of Quality and Safety Engineering, vol. 10, no. 3, pp. 323–344, 2003. [DIN 15] D ING S.H., K AMARUDDIN S., “Maintenance policy optimization-literature review and directions”, The International Journal of Advanced Manufacturing Technology, vol. 76, no. 5, pp. 1263–1283, 2015. [DUG 89] D UGAN J.B., T RIVEDI K.S., “Coverage modelling for dependability analysis of fault-tolerant systems”, IEEE Transactions on Computers, vol. 38, no. 6, pp. 775–787, 1989. [GOE 85] G OEL L.R., G UPTA R., S INKH S.K., “Cost analysis of a two-unit priority standby system with imperfect switch and arbitrary distributions”, Microelectronics and Reliability, vol. 25, no. 1, pp. 65–69, 1985. [KIJ 97] K IJIMA M., Markov Processes for Stochastic Modeling, CRC Press, Boca Raton, FL, USA, 1997. [KOU 17] KOUTRAS V.P., M ALEFAKI S., P LATIS A.N., “Optimization of the dependability and performance measures of a generic model for multi-state deteriorating systems under maintenance”, Reliability Engineering and System Safety, vol. 166, pp. 73–86, 2017. [LAP 06] L APA C.M.F., P EREIRA C.M.N.A, DE BARROS M.P., “A model for preventive maintenance planning by genetic algorithms based in cost and reliability”, Reliability Engineering and System Safety, vol. 91, no. 2, pp. 233–240, 2006. [LIS 07] L ISNIANSKI A., D ING Y., F RENKEL I. et al., “Maintenance optimization for multi-state aging systems”, Proceedings of the 5th International Conference on Mathematical Methods in Reliability, Methodology and Practice, Glasgow, United Kingdom, 2007. [LIS 10] L ISNIANSKI A., F RENKEL I., D ING Y., Multi-state System Reliability Analysis and Optimization for Engineers and Industrial Managers, Springer, London, 2010. [MAL 17] M ALEFAKI S., KOUTRAS V.P., P LATIS A.N., “Optimizing availability and performance of a two-unit redundant multi-state deteriorating system”, Recent Advances in Multi-state Reliability, pp. 71–105, Springer, Berlin, 2017.

Dependability and Performance Analysis

153

[NAG 06] NAGA P., R AO S., NAIKAN V.N.A., “A condition-based preventive maintenance policy for Markov deteriorating systems”, International Journal of Performability Modeling, vol. 2, no. 2, pp. 175–189, 2006. [NAT 11] NATVIG B., “Probabilistic modeling of monitoring and maintenance”, in Multi-state Systems Reliability Theory with Applications, John Wiley & Sons, Inc., Chichester, UK, 2011. [PLA 09] P LATIS A., D ROSAKIS E., “Coverage modeling and optimal maintenance frequency of an automated restoration mechanism”, IEEE Transactions on Reliability, vol. 58, no. 3, pp. 470–475, 2009. [SAD 05] S ADEK A., L IMNIOS N., “Nonparametric estimation of reliability and survival function for continuous-time finite Markov processes”, Journal of Statistical Planning and Inference, vol. 133, no. 1, pp. 1–21, 2005. [SHE 15] S HEU S.-H., C HANG C.-C., C HEN Y.-L. et al., “Optimal preventive maintenance and repair policies for multi-state systems”, Reliability Engineering and System Safety, vol. 140, pp. 78–87, 2015. [THE 12] T HEIN S., C HANG Y.S., M AKATSORIS C., “A study of condition based preventive maintenance model for repairable multi-stage deteriorating system”, International Journal of Advanced Logistics, vol. 1, no. 1, pp. 83–102, 2012. [TRI 01] T RIVEDI K.S., Probability and Statistics with Reliability, Queuing, and Computer Science Applications, John Wiley & Sons, New York, USA, 2001. [WAN 06] WANG K.-H., D ONG W.-L., K E J.-B., “Comparison of reliability and the availability between four systems with warm standby components and standby switching failures”, Applied Mathematics and Computation, vol. 183, no. 2, pp. 1310–1322, 2006. [XIE 05] X IE X., Y IGUANG H., T RIVEDI K.S., “Analysis of a two level software rejuvenation policy”, Reliability Engineering and System Safety, vol. 87, no. 1, pp. 13–22, 2005.

9 Models for Time Series Whose Trend Has Local Maximum and Minimum Values

Economical time series usually include trends, and it is important to capture trends adequately. In many cases, trends show repeated up-and-down behavior. In this study, estimation problems on local maximum and minimum values included in such time series are considered by assuming that the trend is piecewise linear. However, it is not clear whether estimation is meaningful or not, unless the property of time series is clarified. The first purpose of this chapter is to propose two kinds of models for a time series whose trend is piecewise linear. One is a trend stationary model, and the other is a random walk type model. Proposed models provide a basis for discussion about appropriateness of estimation and prediction of local maximum and minimum values in the trend. Simulation studies suggest that estimation and prediction is meaningless in some cases when a time series includes a stochastic trend even though the mean value function has local maximum and minimum values. The second purpose of this chapter is to propose a procedure for estimating the piecewise linear trend. Local maximum and minimum values are obtained directly from the estimated trend. The proposed method is based on a piecewise linear regression and an application of a fuzzy trend model.

9.1. Introduction When a time series includes a temporal trend it is important to capture the trend adequately. One of the typical patterns of temporal trends is repeated upand-down behavior. Examples are series of stock prices or stock indices. In this study, we consider problems on local maximum and minimum values in such

Chapter written by Norio WATANABE. Data Analysis and Applications 4: Financial Data Analysis and Methods, First Edition. Edited by Andreas Makrides, Alex Karagrigoriou and Christos H. Skiadas. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

156

Data Analysis and Applications 4

a trend. Estimation and prediction of local maximum and minimum values are important topics. However, it is not clear whether these topics are meaningful or not, unless the properties or models of a time series are clarified. The first purpose of this chapter is to propose two kinds of models for a time series whose trend is continuous piecewise linear. One is a trend stationary model, and the other is a random walk type model. Problems on the piecewise linear trend are discussed by [KIM 09] and [TIB 14], among others. The piecewise linear trend can be used for approximation in many cases, although it has a restricted form. One advantage of the assumption of the piecewise linear trend is that peaks and troughs are obtained directly from the estimated trend. We can discuss the problem on estimation of local maximum and minimum values based on proposed models. From simulation studies, we can say that a basis of estimation for random walk type models is usually weak, since there is no direct relationship between up-and-down behavior of a time series and local maximum and minimum values of the mean value function. On the other hand, it is reasonable to estimate local maximum and minimum values of trends for trend stationary models. The second purpose of this chapter is to propose an estimation method for the piecewise linear trend. The temporal trend analysis is important in many fields, and there are various methods based on regression analysis (see [SEN ¸ 17]). Our method is based on a piecewise linear regression and an application of a fuzzy trend model [WAT 15]. Both are regression analysis based methods. Applicability of the proposed method is illustrated by using an example. 9.2. Models Let {xn |n = 1, ..., N } be an observed time series, whose trend has local maximum and minimum values. We consider two models for {xn }. The first is a mean or trend stationary model, and the second is a random walk type model. 9.2.1. Model 1 The trend stationary model is defined as follows: xn = μn + v n ,

[9.1]

Models for Time Series Whose Trend Has Local Maximum and Minimum Values

157

where {vn } is a zero mean stationary process, and {μn } is the mean value function of {xn } given by the recursion: μn = μn−1 + dn ,

(n = 1, ...)

[9.2]

with the initial value μ0 = d0 . The series of constants {dn } is defined as follows. Let {u(k)|k = 1, 2, ...} be a given series satisfying u(k)u(k − 1) < 0,

[9.3]

and Sn = (s1n , s2n ) be a state variable determined stochastically by the equation:  Sn−1 with prob. g(n − s2,n−1 ) Sn = [9.4] (1 + s1n , n) with prob. 1 − g(n − s2,n−1 ) for n ≥ 2, where S1 = (1, 0), and g(t) is a monotonically decreasing function such that g(1) = 1, g(t) ≥ 0 for t ≥ 1. Then, dn is defined by dn = u(s1n ).

[9.5]

The series {dn } is stochastic, but we consider this under the condition that {dn } is given. The process {u(k)} can also be stochastic, as shown in an example in section 9.3. In this model, the mean value function μn is the trend, and the trend is continuous piecewise linear. The assumption [9.3] means that the local maximum and minimum values appear by turns. In other words, the trend shows up-and-down behavior. Note that the assumption [9.3] can be weakened or removed. The first component of the state variable, s1n , means the number of local linear trends. That is, μn is on the s1n -th straight line. The second component s2n is the time point when the s1n -th local linear trend begins. The time point of the local maximum or minimum value is given by Tk =

min

n∈{n|s1n =k}

and we have T0 = 0.

n − 1, (k = 1, 2, ...)

[9.6]

158

Data Analysis and Applications 4

9.2.2. Model 2 The random walk type model is defined as follows: xn = xn−1 + dn + en , (n = 1, 2, ...)

[9.7]

where dn is the series of constants defined in model 1, and {en } is a zero mean i.i.d. process with the variance σ 2 . We assume that the initial value x0 is zero for simplicity. In model 2, the expectation of xn is identical to μn in model 1 under the condition that {dn } is given. That is, E(xn |{dm }) = μn . We show two examples of the monotone function g(t) (t ≥ 1), where 1 − g(t) is the probability of occurrence of the next change in local trends. The first example is the exponential function: g(t) = ρt−1

[9.8]

where ρ < 1. The second is the cosine-type function:  g(t) = 1/ 1 +



t−1 m

2b  ,

[9.9]

where m is a positive constant and b is a positive integer. We call equation [9.9] the cosine-type probability, since we can prove the formula:  1/ 1 +



t−1 m

2b 

 =

 1 + cos 2 tan

−1



t−1 m

b  /2, [9.10]

by using the equations: cos 2θ = cos2 θ − 1 and tan−1 x = √   cos−1 1/ 1 + x2 . Examples of the cosine-type probability are shown in Figure 9.1 for b = 1, 2, ..., 9 and m = 10, 15, 20, ..., 40. The furthest left curve is for m = 10 and the furthest right is for m = 40 in each graph.

Models for Time Series Whose Trend Has Local Maximum and Minimum Values

b=1

b=2

b=3

1

1

1

0.5

0.5

0.5

0

0 20

40

60

0 20

b=4

40

60

20

b=5 1

1

0.5

0.5

0.5

0 20

40

60

b=7

40

60

20

b=8 1

0.5

0.5

0.5

0 40

60

40

60

40

60

b=9

1

20

60

0 20

1

0

40 b=6

1

0

159

0 20

40

60

20

Figure 9.1. Examples of g: cosine-type functions

9.3. Simulation In this section, we show examples of two models. The same sequence {dn } is used in these examples. Let y(k) be independently distributed to the chi-square distribution with the degree of freedom dF . We set u(k) = ±0.1y(k)/dF , where dF = 10. As the monotone function is g, we adopt the cosine-type probability with m = 20 and b = 2. In model 1, we assume that the stationary process {vn } is the Gaussian white noise with the variance 0.42 . Figure 9.2 demonstrates an example of a time series generated by model 1, where N = 200. The trend dn is shown by the bold piecewise linear line in this figure. In model 2, we assume that the white noise {en } is Gaussian with the variance 0.42 . An example of a time series obtained by model 2 is shown in Figure 9.3.

160

Data Analysis and Applications 4

The mean value function shown by the bold piecewise linear line in Figure 9.3 is the same as the trend in Figure 9.2. 5

4

3

2

1

0

-1 0

50

100

150

200

Figure 9.2. An example of the trend stationary model

Figure 9.2 shows that it is reasonable to estimate local maximum and minimum values of the trend in model 1, since these values are local maximum and minimum values of the mean value function. On the other hand, validity of estimation for random walk type models is weak, since there is usually no direct relationship between up-and-down behavior of a time series and local maximum and minimum values of the mean value function, as shown in Figure 9.3. The estimated peaks and troughs have no meaning, unless the variance of the white noise is relatively small compared with the range of fluctuation of dn . As a result, it is difficult to show validity of prediction methods for peaks and troughs in some cases when a stochastic trend is included in a time series. Another problem is whether a peak detection method for a time series from model 2 is useful or not in a practical analysis apart from the theoretical aspect.

Models for Time Series Whose Trend Has Local Maximum and Minimum Values

161

8

7

6

5

4

3

2

1

0

-1 0

50

100

150

200

Figure 9.3. An example of the random walk type model

9.4. Estimation of the piecewise linear trend In this section, we propose a procedure for identification and estimation of a trend in model 1 for the case when no information on change points is available. This procedure can be applied to a time series from model 2. However, it is difficult to evaluate the validity of the procedure for model 2 as was stated in section 9.3. We apply the piecewise linear regression, since the trend is assumed to be continuous piecewise linear. However, it is not easy to identify the piecewise linear function when the number of segments is unknown. The piecewise linear trend can be estimated by the 1 trend filtering method [KIM 09] without information on change points, but it is not clear how to choose the regularization parameter in the 1 trend filtering method. Moreover, the shrinkage effect appears as shown in Figure 9.2 in [TIB 14]. (Tibshirani [TIB 14] discussed the trend filtering in detail).

162

Data Analysis and Applications 4

Thus, we adopt another approach and introduce an identification procedure by applying the simple fuzzy trend model for pre-estimation. General fuzzy trend models are discussed by Watanabe and Watanabe [WAT 15], for example. The outline of our procedure is as follows: Identification procedure Step 1: fitting a fuzzy trend model; Step 2: detection of peaks and troughs; Step 3: piecewise linear regression; Step 4: modification of nodes. The fuzzy trend model used here is a one-parameter model for a scalar time series based on the fuzzy if-then rule: R : If n is A , then μn () = α , for  = 1, 2, ..., L, where the number of rules L is determined by the width parameter of the membership function of A . We use the membership functions as shown in Figure 9.4. In Figure 9.4, the width parameter is 12. The width parameter of A must be set smaller for applying to the piecewise linear regression. Let {α ˆ  | = 1, ..., L} be the estimate of the latent process {α } in the fuzzy trend model. In step 2, peaks and troughs are detected by checking the change of α ˆ  . We judge that the change occurs if |ˆ α − α ˆ −1 | > tα Sα , where tα is a given positive constant and Sα is the sample standard deviation of {ˆ α }. Then, the time points of nodes in the piecewise linear function are pre-estimated. The piecewise linear regression is easy, if the time points of nodes in the piecewise linear function are given (see [DRA 98], [GAL 73]). Finally, time points are re-estimated among the neighborhood of pre-estimated points by minimizing the mean squared error of piecewise linear regression. Minimization is achieved sequentially, since it is difficult to re-estimate for all combinations of time points.

Models for Time Series Whose Trend Has Local Maximum and Minimum Values

163

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20

40

60

80

100

Figure 9.4. Membership functions in the fuzzy trend model

Figures 9.5 and 9.6 demonstrate an application of the proposed method. In this example, we use the value 6 as the width parameter in the fuzzy trend model. Moreover, we set tα = 0.2 in step 2. The estimated trend is shown by the bold line in Figure 9.5, where the time series is the same as shown in Figure 9.2. The pre-estimated, re-estimated and true trends are shown in Figure 9.6 5

4

3

2

Estimated trend

1

0

-1 0

50

100

150

Figure 9.5. An example: the estimated trend

200

164

Data Analysis and Applications 4

4.5

Final estimated trend

4

3.5

3

2.5

2

1.5

Pre-estimated trend

1

0.5

True trend

0

-0.5 0

50

100

150

200

Figure 9.6. An example: the true and estimated trends

9.5. Conclusion We proposed two models whose mean value functions are piecewise linear. We can discuss the validity of estimation methods for peaks and troughs based on the proposed models. As a result, it is found that validity is doubtful in some cases when stochastic trends are included. Thus, it is important to investigate the property of an observed time series in practical analysis. This is a problem of a statistical hypothesis test or model selection. Moreover, we proposed an identification method of the piecewise linear trend. Peaks and troughs in the trend are estimated simultaneously. The example in section 9.4 shows that the proposed method is applicable. When the constant tα should be selected by data, it can be determined by an information criterion if some candidates of tα are given. However, further simulation studies are required for verification. After peaks and troughs are estimated, inference on {dn } and {u(k)}, from which a trend is generated, becomes possible. For example, prediction of the time point and height of the next peak can be considered.

Models for Time Series Whose Trend Has Local Maximum and Minimum Values

165

9.6. References [DRA 98] D RAPER N.R., S MITH H., Applied Regression Analysis, 3rd edition, Wiley, New York, 1998. [GAL 73] G ALLANT A.R., F ULLER W.A., “Fitting segmented polynomial regression models whose join points have to be estimated”, JASA, vol. 68, no. 341, pp. 144–147, 1973. [KIM 09] K IM S.-J., KOH K., B OYD S., G ORINEVSKY D., “1 trend filtering”, SIAM Review, vol. 51, no. 2, pp. 339–360, 2009. [SEN ¸ 17] S¸ EN Z., “Temporal trend analysis”, in Innovative Trend Methodologies in Science and Engineering, S¸ EN Z. (ed.), pp. 133–174, Springer, Berlin, 2017. [TIB 14] T IBSHIRANI R.J., “Adaptive piecewise polynomial estimation via trend filtering”, Annals of Statistics, vol. 42, no. 1, pp. 285–323, 2014. [WAT 15] WATANABE E., WATANABE N., “Weighted multivariate fuzzy trend model for seasonal time series”, in Stochastic Modeling, Data Analysis and Statistical Applications, F ILUS L. et al. (eds), pp. 443–450, ISAST, 2015.

10 How to Model the Covariance Structure in a Spatial Framework: Variogram or Correlation Function?

The basic Kriging model assumes a Gaussian distribution with stationary mean and stationary variance. In such a setting, the joint distribution of the spatial process is characterized by the common variance and the correlation matrix or, equivalently, by the common variance and the variogram matrix. In this chapter, we will discuss in detail the option to actually use the variogram as a parameterization.

10.1. Introduction This chapter is a development of the authors’ paper [PIS 16]. An application was implemented in Vicario’s et al. paper [VIC 16]. In the interest of clarity, we allow here for a little overlap with the papers referred to above. We discuss the notion of a variogram as it is used in geostatistics, and we offer some preliminary thought about the possibility of a non-parametric approach to Universal Kriging that aims to use the Bayes methodology. Variograms are due to Matheron [MAT 62] and there are many modern expositions, for example Cressie [CRE 93, Chapter 2], Chiles and Delfiner [CHI 12, Chapter 2], Gneiting et al. [GNE 01], and Gaetan and Guyon [GAE 10, Chapter 1].

Chapter written by Giovanni P ISTONE and Grazia V ICARIO. Data Analysis and Applications 4: Financial Data Analysis and Methods, First Edition. Edited by Andreas Makrides, Alex Karagrigoriou and Christos H. Skiadas. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

168

Data Analysis and Applications 4

In section 10.2, we give a brief overview of the so-called Universal Kriging model and its parameterization with the Matheron’s variogram function. In section 10.3, we define the general variogram matrix and give a necessary and sufficient condition for a positive variance σ 2 and a matrix Γ to be a variogram matrix of a covariance σ 2 R, where R is a correlation matrix. In section 10.4, we provide some useful computations concerning the inverse variogram matrix. In section 10.5, we discuss an interpretation of the variogram matrix as related to a projection of the Gaussian field. In section 10.6, we discuss the shape of the set of parameters of the general Kriging model. The final section is the conclusion of this chapter. 10.2. Universal Krige setup We consider a Gaussian n-vector Y , n ≥ 2, whose mean has the form μ = μ1 and whose covariance matrix Σ = [σij ]ni,j=1 has the constant diagonal σii = σ 2 , i = 1, . . . , n. The assumption on the mean and the diagonal terms is the weakest stationarity assumption, i.e. a first-order stationarity. We can write Y ∼ Nn (μ1, σ 2 R), where μ is a general mean value, σ 2 is the common variance, and R = [ρij ]ni,j=1 is a generic correlation matrix. The variogram of Y is the n × n matrix Γ = [γij ]ni,j=1 , whose element γij is half the variance of the difference Yi − Yj . As the mean value is constant, the variance of the difference is equal to the second moment of the difference. It is expressed in terms of the common variance σ 2 and the correlations ρij , i, j = 1, . . . , n, as 2γij = Var (Yi − Yj ) = σ 2 (ei − ej ) R(ei − ej ) = σ 2 (ρii + ρjj − 2ρij ) = 2σ 2 (1 − ρij ) , and, in matrix form, as Γ = σ 2 (11 − R). The simple Gaussian model described above is commonly used in geostatistics, when each random component Yi of the random vector Y is associated with a location xi , i = 1, . . . , n, in a given region X, xi ∈ X, i = 1, . . . , n.

How to Model the Covariance Structure in a Spatial Framework

169

We briefly describe the most common setup in geostatistics. The elements of the variogram matrix γij are assumed to be a given function γ of the distance between two locations, γi,j = γ(d(xi , xj )). In such a case, the statistical model is characterized by the choice of a distance d(x, y), x, y ∈ X, and by the choice of a function γ, called the variogram function, defined on a real domain containing {d(x, y)|x, y ∈ X}. The existence of a positive σ 2 and a correlation matrix R = [ρij ] such that σ 2 (1 − ρij ) = γ(d(xi , xj )) imposes a non-trivial condition on the function γ (see Sasvári [SAS 94] and Gneiting et al. [GNE 01]). Such a model, where it is assumed that the vector of means is constant μ = μ1, is called a Universal Krige model. We do not consider the more general case of a non-constant mean. Krige has further qualified this model by adding assumptions on the variance function and suggesting a statistical method to estimate the value at an untried point x0 given a set of observation Y1 , . . . , Yn at points x1 , . . . , xn . Specifically: 1) Krige’s modeling idea is to assume the variogram function γ to be an increasing function on [0, ∞[, so that the variogram’s values are increasing with the distance. Moreover, the correlation between locations is assumed to be positive. The rational is to model a variability that increases with the distance and is bounded by a general variance: 0≤

1 Var (Yi − Yj ) = γij = γ(d(xi , xj )) = σ 2 (1 − ρij ) ≤ σ 2 . 2

The increasing function γ : R+ → R+ is assumed to be continuous for every possible value. As it is bounded at +∞, the general shape is as shown in Figure 10.1. 2) The parameters in the Krige’s universal model are unrestricted values of μ ∈ R, σ 2 > 0 and restricted values of R that are usually estimated over a suitable parametric model. 3) Krige’s idea is to predict the value Y0 = Y (x0 ) at an untried location x0 with the conditional expectation based on a plug-in estimate of the parameters. If I = {1, . . . , n} and the locations in the model are x0 , x1 , . . . , xn , the regression value is   ΣI,I ΣI,0 −1  Y0 − μ = Σ0,I ΣI,I (YI − μ1I ), with Σ = . Σ0,I σ02

170

Data Analysis and Applications 4







   

 

 

 



    



  





Figure 10.1. A general variogram function is null at 0, can have jump at 0 which is called nugget and has a finite limit at +∞ named sill. The range is a length such that the value is equal to the limit value for any practical purpose

The set of data that give the same prediction is an affine plane in Rn . The variance of the prediction is σ02 − Σ0,I Σ−1 I,I ΣI,0 . In this chapter, we do not follow this approach, but we adopt a general non-parametric attitude, where μ is a real number, σ 2 is a positive real number, R is a positive definite matrix with unit diagonal, possibly with positive entries. The variogram matrix Γ is not restricted, and we do not enforce the existence of any special form. 10.3. The variogram matrix Our plan now is to express the Krige’s computations in terms of Matheron’s variogram matrix Γ. One good reason to use Γ as a basic parameter is that its empirical estimator is unbiased.

How to Model the Covariance Structure in a Spatial Framework

171

Let us discuss in some detail the basic transformation of matrix parameters: Γ = σ 2 (11 − R) = σ 2 11 − Σ .

[10.1]

We note that Γ = 0 if, and only if, R = 11 , such an extreme case being always excluded in the following. In fact, in most cases, we will assume det R = 0. The entries of Γ are non-negative and bounded by 2σ 2 because the correlations are bounded between −1 and 1. If all the correlations are non-negative, the entries of Γ are bounded by σ 2 . The difference between the covariance matrix and the variogram matrix is a matrix of rank 1, Γ + Σ = σ 2 11 . Moreover, let us remark that 1  1 11 = (Γ + Σ) n nσ 2

[10.2]

is the orthogonal projector on the space of constant vectors span [10.1]. We denote S= as the cone of non-negative definite matrices with constant diagonal, S=1 as the convex set of correlation matrices and V as the cone of variogram matrices. We have the following characterization of V. P ROPOSITION 10.1.– A non-zero matrix Γ is the variogram matrix of some covariance matrix of the form Σ = σ 2 R, with σ 2 > 0 and R being a correlation matrix, if, and only if, the following three conditions hold: 1) Γ is symmetric, and has zero diagonal; 2) Γ is conditionally negative definite, i.e. w Γw ≤ 0 if w 1 = 0; 3) sup {x Γx|x 1 = 1} ≤ σ 2 . P ROOF.– Assume Γ = σ 2 (11 − R) = 0, with R being a correlation matrix and σ 2 > 0. Condition 1 follows from the definition. If we write a generic vector as x = w + α1 with w 1 = 0, we have n2 σ 2 α2 = x Γx + x Σx . In particular, condition 2 follows because α = 0 implies x Γx = −x Σx. Finally, if x 1 = 1, i.e. nα = 1, we have x Γx − σ 2 = x Σx ≥ 0 and condition 3 follows.

172

Data Analysis and Applications 4

Conversely, let us consider the matrix 11 − σ −2 Γ. It is symmetric, with unit diagonal. We only need to show it is positive definite: x (11 − σ −2 Γ)x = (x 1)2 − σ −2 x Γx     1 1  2 −2 x Γ x ≥ 0. = (x 1) (1 − σ x 1 x 1 

because of condition 3.

The lower bound imposed on σ 2 means that the parameterization with σ 2 , carrying one degree of freedom, and with Γ, carrying n(n − 1)/2 degrees of freedom, has a drawback in that the two parameters are not independently defined on a product set. Note the relation 1 Γ1 = σ 2 (n2 − 1 R1) that we will discuss in section 10.4. In conclusion, there is a one-to-one transformation of parameters (σ 2 , Γ) ↔ (σ 2 , R) ↔ Σ with σ 2 ∈ R> , Γ ∈ V, R ∈ S=1 , Σ ∈ S= , namely: 1) The mapping from Σ ∈ S= to the couple (σ 2 , Γ) ∈ R> × V factors as  S= Σ →

1 Tr (Σ) , n



1 Tr (Σ) n

−2  Σ = (σ 2 , R) ∈]0, ∞[×S=1

and ]0, ∞[×S=1 (σ 2 , R) → (σ 2 , σ 2 (11 − R)) 



= (σ 2 , Γ) ∈ (σ 2 , Γ) Γ ∈ V, sup x Γx x 1 = 1 ≤ σ 2 . 2) The inverse is  2



 (σ , Γ) Γ ∈ V, sup x Γx x 1 = 1 ≤ σ 2 (σ 2 , Γ) → σ 2 11 − Γ = Σ ∈ S=

How to Model the Covariance Structure in a Spatial Framework

173

10.4. Inverse variogram matrix Γ−1 The two equations presented above, both based on the definition of the variogram matrix as Γ = σ 2 (11 − R), provide a simple connection between the parameterization based on the covariance matrix Σ and the parameterization based on the couple σ 2 and Γ. However, we want to spell out the computation of another key statistical parameter, namely the concentration matrix Σ−1 . We begin by recalling a well-known equation in matrix algebra [PRE 96]. We review the result in detail as we need an exact statement of the conditions under which it is true in our case. P ROPOSITION 10.2 (Sherman–Morrison formula).– Assume the matrix A is invertible. The matrix 11 − A is invertible if, and only if, 1 A−1 1 = 1. In such a case, det(11 − A) = (−1)n (1 − 1 A−1 1) det A , (11 − A)−1 = −A−1 − (1 − 1 A−1 1)−1 A−1 11 A−1 . P ROOF.– The multilinear expansion of det(11 − A) is written in terms of the adjoints (−A)ij of each element (−A)ij by 

det(11 − A) = det (−A) +

n n

(−A)ij

j=1 i=1

= (−1)n det A − (−1)n−1

n

Aij

i,j=1

= (−1)n det A − (−1)n−1 1(adj A)1 . As det A = 0, we can factor out (−1)n (det A) to obtain det(11 − A) = (−1)n (det A)(1 − 1 A−1 1) and the statement about the determinant follows. The inversion formula is directly checked.  We are concerned with the invertibility of Γ = σ 2 (11 − R); therefore, we need to discuss the condition 1 R−1 1 = 1.

174

Data Analysis and Applications 4

P ROPOSITION 10.3.– Let R be a correlation matrix and assume det R = 0. Let λj > 0, j = 1, . . . , n, be the spectrum of R and uj be a set of unit eigenvectors. It holds:

1) Tr R = nj=1 λj = n and det R = nj=1 λj ≤ 1, with equality if, and only if, R = In ; 2) Tr R−1 = nj=1 λ−1 j ≥ n with equality if, and only if, R = In ; 3) 1 R−1 1 = 1. P ROOF.–



1) n = Tr R = nj=1 λj . From det R = nj=1 λj , as the arithmetic mean is larger than the geometric mean: ⎛

n

j=1 λj

1=

n

≥⎝

n 

⎞1

n 1

λj ⎠ = (det R) n ,

j=1

with equality if, and only if, the λj ’s are all equal, hence equal to 1, which happens if R = In . 2) The geometric mean is larger or equal than the harmonic mean, hence ⎛ 1 n

1 ≥ (det R) = ⎝

n 

j=1

⎞1

⎞−1 ⎛ n ⎠ , λj ⎠ ≥ n ⎝ λ−1 j n

j=1

with equality if, and only if, λj = 1, j = 1, . . . , n. It follows that −1 1 n j=1 λj ≥ 1. n  3) We derive a contradiction from 1 = 1 R−1 1. As R−1 = nj=1 λ−1 j uj uj n and j=1 (1 uj )2 = 1 2 = n2 , 

1=1R

−1

1=

n j=1

 2 λ−1 j (1 uj )

=n

2

n j=1

(λj )−1 θj ,

How to Model the Covariance Structure in a Spatial Framework

where θj = (1 u)2 /n2 ≥ 0 and λ−1 , we obtain 1 = n2

n

n

⎛ (λj )−1 θj ≥ n2 ⎝

j=1

j=1 θj

n

175

= 1. From the convexity of λ → ⎞−1

λ j θj ⎠

,

j=1

hence the contradiction 1≤

n 1 1 1 λj θj ≤ 2 max {λj |j = 1, . . . , n} ≤ . n2 n n j=1

 From Proposition 10.3, we immediately have the following result of interest. P ROPOSITION 10.4.– Assume the correlation matrix Σ = σ 2 R ∈ S= is invertible. It follows that Γ = σ 2 (11 − R) is invertible, with Γ−1 = −Σ−1 − (σ −2 − 1 Σ−1 1)−1 Σ−1 11 Σ−1

[10.3]

Σ−1 = −Γ−1 − (σ −2 − 1 Γ−1 1)−1 Γ−1 11 Γ−1

[10.4]

and

P ROOF.– From the assumption, it follows det R = 0 so σ −2 − 1 Σ−1 1 = σ −2 (1 − 1 R1) = 0 , hence the conclusion.



We can now analyze the likelihood of the Gaussian model N(μ1, σ 2 R) in terms of the variogram.

176

Data Analysis and Applications 4

First, we compute the determinant of the correlation matrix     det σ 2 R = det σ 2 11 − Γ   = σ 2n det 11 − σ −2 Γ       = σ 2n det −σ −2 Γ + 1 adj −σ −2 Γ 1 = det (−Γ) − σ 2 1 adj (−Γ) 1 . Second, we compute the quadratic form of the concentration matrix   y  Σ−1 y = y  −Γ−1 − (σ −2 − 1 Γ−1 1)−1 Γ−1 11 Γ−1 y = −y  Γ−1 y − (σ −2 − 1 Γ−1 1)−1 (y  Γ−1 1)2 . Third, we compute the log-likelihood with μ = 0:   log p y σ 2 , Γ    1 n = − log (2π) − log det 11 − σ −2 Γ 2 2 1  − 2 y (11 − σ −2 Γ)−1 y 2σ   1 n = − log (2π) − log det (−Γ) − σ 2 1 adj (−Γ) 1 2 2 1 2 1  −1 + y Γ y + (σ − 1 Γ−1 1)−1 (y  Γ−1 1)2 . 2 2 The essentials of the computations that lead to a maximum likelihood estimation of Γ are the following. In the direction of a generic symmetric matrix with zero diagonal H,      dH (Γ → log det 11 − σ −2 Γ ) = Tr (σ 2 11 − Γ)−1 H and dH (Γ → y  (11 − σ −2 Γ)−1 y)   = σ 2 Tr (σ 2 11 − Γ)−1 yy  (σ 2 11 − Γ)−1 H .

How to Model the Covariance Structure in a Spatial Framework

177

so that the normal equations for Γ reduce to the condition −(σ 2 11 − Γ)−1 + (σ 2 11 − Γ)−1 yy  (σ 2 11 − Γ)−1

is diagonal.

The approach with parameters σ 2 , Γ is feasible in principle, but it does not appear promising in terms of ease of computation. In section 10.5, we will see a different, possibly better, approach. 10.5. Projecting on span (1)⊥ We now change our point of view to consider the same problem from a different angle suggested by the observation that the variogram does not change if we change the general mean μ. In fact, we can associate the variogram with the state space description of the Gaussian vector. The following proposition is a new characterization of the variogram matrix in our setting. P ROPOSITION 10.5.– 1) The matrix Γ is a variogram matrix of a covariance matrix Σ ∈ V+ if, and only if, the matrix     1   1  Σ0 = − I − 11 Γ I − 11 n n

[10.5]

is symmetric, positive definite and with constant diagonal. 2) If Y0 ∼ Nn (0, Σ0 ), then its variogram is Γ, and Y0 is supported by span (1)⊥ . P ROOF.– 1) If Γ = σ 2 (11 − R) is the variogram matrix of Σ = σ 2 R, then from equation [10.5] we have     1  1   Σ0 = σ I − 11 R I − 11 , n n 2

178

Data Analysis and Applications 4

which is indeed positive definite. Let us show that the diagonal elements of Σ0 are constant:     1  1   2  (Σ0 )ii = σ ei I − 11 R I − 11 ei n n     1 1 2 = σ ei − 1 R ei − 1 n n   2 1  2  = σ ei Rei − ei R1 + 2 1 R1 n n   1  = σ2 1 R1 − 1 n2 Conversely, let us assume Σ0 is a covariance matrix. As ei − ej ∈ span (1)⊥ , the variogram of Σ0 has elements (ei − ej ) Σ0 (ei − ej )     1  1    = (ei − ej ) I − 11 (−Γ) I − 11 (ei − ej ) n n = −(ei − ej ) Γ(ei − ej ) = −γii − γjj + 2γij = 2γij .     2) As 1 (ei − ej ) = 0, then 1 I − n1 11 (−Γ) I − n1 11 1 = 0, hence the distribution of Y0 is supported by the space span (1)⊥ .  It is possible to split every Gaussian Y with covariance matrix Σ ∈ S= according to the splitting Rn = span (1) ⊕ span (1)⊥ . The corresponding projections split the Gaussian process Y into two components: one with the covariance as in Proposition 10.5, and the other proportional to the empirical mean. Note that the two components have a singular covariance matrix. P ROPOSITION 10.6.– Let Y ∼ Nn (μ, Σ), Σ = σ 2 R ∈ S+ with variogram Γ = σ 2 (11 − R). Let Y = I − n1 11 Y ∼ Nn (0, Σ0 ) be the projection of Y onto span (1)⊥ so that we can write Y = Y + Y , where each component of Y is the empirical mean n1 1 Y . 1) The distribution of Y depends on the variogram only,      1 1   Y ∼ Nn 0, − I − 11 Γ I − 11 , n n

How to Model the Covariance Structure in a Spatial Framework

179

and the variogram matrix of Y is Γ. 2) The distribution of n1 1 Y , conditionally to Y , is Gaussian with mean μ. P ROOF.–         1  1  1   1   − I − 11 Γ I − 11 = I − 11 Σ I − 11 . n n n n  This suggests the following empirical estimation algorithm: 1) project the independent sample data y1 , . . . , yN onto span (1)⊥ by 1 subtracting the empirical mean y+ = n yi , to obtain y1 = y1 − y+ , . . . , yN = yN − y+ . Use the empirical estimator of the variogram matrix on the projected data; 2) estimate μ with the empirical mean. In the same way, we could suggest the simulation of a random variable with variogram matrix Γ by the generation of Nn (0, Σ0 ) data in span (1)⊥ . These two suggestions will be further discussed in future work. 10.6. Elliptope We now turn to the geometrical description of the set of variograms. From the basic equation Γ = σ 2 (11 − R), it follows that the set of variogram matrices is an affine image of the set of correlation matrices in the space of symmetric matrices. The set of correlation matrices is a convex bounded set whose geometrical shape has been studied in a number of papers, i.e. [ROU 94] and [RAP 07]. Such a shape is of central interest in a non-parametric Bayesian approach to the statistics of the Universal Kriging model. It also appears in convex optimization, which is called elliptope. Let us discuss the case n = 3.

180

Data Analysis and Applications 4

All principal minors of R are non-negative, ⎤⎞ 1xy det (R) = det ⎝⎣x 1 z ⎦⎠ = 1 − x2 − y 2 − z 2 + 2xyz ≥ 0 yz1 ⎛⎡

and 1 − x2 , 1 − y 2 , 1 − z 2 ≥ 0. The last three inequalities define the cube Q = [−1, +1]3 , while the equation 1 − x2 − y 2 − z 2 + 2xyz = 0 is a cubic algebraic variety whose intersection with the cube Q is the border of the elliptope. All horizontal sections z = c, −1 ≤ c ≤ 1, of the elliptope are the interior of the ellipses 1 − x2 − y 2 + 2cxy ≥ c2 . It is the same for other sections (see Figures 10.2 and 10.3). Various proposals of a priori distribution on the elliptope exist (see, for example, [BAR 00]).

Figure 10.2. The 3-D elliptope. For a color version of this figure, see www.iste.co.uk/makrides/data4

How to Model the Covariance Structure in a Spatial Framework

181

Figure 10.3. Algebraic variety. For a color version of this figure, see www.iste.co.uk/makrides/data4

Going forward with the discussion of our example, the volume is easily computed and the uniform a priori is defined. Simulation is feasible for example by the rejection method. Another option is to write R = A A, where the columns of A are unit vectors. This gives another possible a priori starting from independent unit vectors. Simulation is feasible, for example starting with independent standard Gaussians. An interesting option is the Cholesky representation. A symmetric matrix A is positive definite if there exists an upper triangular matrix ⎤ ⎡ ⎤ ⎡ t11 t12 t13 t1 T = ⎣t2 ⎦ = ⎣ 0 t22 t23 ⎦ , t3 0 0 t33

tii ≥ 0

such that ⎤ t211 t11 t12 t11 t13 t212 + t222 t12 t13 + t22 t23 ⎦ . A = T  T = [ti · tj ]ij = ⎣t11 t12 t11 t13 t12 t13 + t22 t23 t213 + t223 + t233 ⎡

Moreover, t11 t22 t33 = 0 ⇔ T is unique and invertible if, and only if, A is invertible. It is an identifiable parameterization for non-singular matrices.

182

Data Analysis and Applications 4

In the case of the correlation matrix, R = T  T with ⎤ ⎡  ⎤ ⎡1 − t2 − t2 t12 t13 t1 12 13  ⎢ ⎥ + . T = ⎣t2 ⎦ = ⎣ 0 1 − t223 t23 ⎦ , ti ∈ 0i−1 × Sn−1+1  t3 0 0 1 It follows: ⎡

1



1 − t212 − t213 t12

⎢ 2 2 R=⎢ 1 ⎣ 1 − t12 − t13 t12   1 − t212 − t213 t13 t12 t13 + 1 − t223 t23



⎤ 1 − t212 − t213 t13 ⎥  t12 t13 + 1 − t223 t23 ⎥ ⎦ 1

and det (R) = (1 − t212 − t213 )(1 − t223 ) . 10.7. Conclusion Over the last few decades, Kriging models have been recommended not only for the original application, but also for spatial noisy data in general. The accuracy of this model strongly depends on the detection of the correlation structure of the responses. In the Bayesian approach, where the posterior distribution of a prediction Krige’s Y0 given the training set (Y1 , . . . , Yn ) requires less uncertainty as possible on the correlation function, the use of the variogram as a parameter should be preferred because it does not demand a parametric approach as the correlation estimation does. In a previous paper [PIS 16], the authors proved the equivalence between the variogram and spatial correlation function for stationary and intrinsically stationary processes. This study has been devoted to the characterization of matrices which are admissible variograms in the case of first-order stationarity. 10.8. Acknowledgements This paper was presented at the 4th Stochastic Modeling Techniques and Data Analysis International Conference, June 1–4, 2016, University of Malta, Valletta, Malta, with the title Bayes and Krige: Generalities. We thank both G. Kon Kam King (CCA, Torino) and L. Malagò (RIST, Cluj-Napoca) for

How to Model the Covariance Structure in a Spatial Framework

183

suggesting relevant references. We also thank E. Musso (Politecnico di Torino) for helping with the graphical representation of the elliptope. G. Pistone is supported by de Castro Statistics, Collegio Carlo Alberto, Turin, and he is a member of GNAFA-INDAM. 10.9. References [BAR 00] BARNARD J., M CCULLOCH R., M ENG X.-L., “Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage”, Statist. Sinica, vol. 10, no. 4, pp. 1281–1311, 2000. [CHI 12] C HILÈS J.-P., D ELFINER P., Geostatistics: Modeling Spatial Uncertainty, 2nd edition, John Wiley & Sons Inc., Hoboken, NJ, USA, 2012. [CRE 93] C RESSIE N.A.C., Statistics for Spatial Data, John Wiley & Sons Inc., Hoboken, NJ, USA, 1993. [GAE 10] G AETAN C., G UYON X., Spatial Statistics and Modeling, Translated by Kevin Bleakley, Springer, New York, USA, 2010. [GNE 01] G NEITING T., S ASVÁRI Z., S CHLATHER M., “Analogies and correspondences between variograms and covariance functions”, Adv. in Appl. Probab., vol. 33, no. 3, pp. 617–630, 2001. [MAT 62] M ATHERON G., Traité de Géostatistique Appliqué, Éditions Technip, Paris, France, 1962. [PIS 16] P ISTONE G., V ICARIO G., “A note on semivariogram”, in D I BATTISTA T., M ORENO E., R ACUGNO W. (eds), Topics on Methodological and Applied Statistical Inference, Springer, Cham, Switzerland, 2016. [PRE 96] P RESS W.H., T EUKOLSKY S.A., V ETTERLING W.T. et al., Numerical Recipes: The Art of Scientific Computing, Cambridge University Press, Cambridge, NY, USA, 1996. [ROU 94] ROUSSEEUW P.J., M OLENBERGHS G., “The shape of correlation matrices”, Amer. Statist., vol. 48, no. 4, pp. 276–279, 1994. [RAP 07] R APISARDA F., B RIGO D., M ERCURIO F., “Parameterizing correlations: A geometric interpretation”, IMA J. Manag. Math., vol. 18, no. 1, pp. 55–73, 2007. [SAS 94] S ASVÁRI Z., Positive Definite and Definitizable Functions, vol. 2, Akademie Verlag, Berlin, Germany, 1994. [VIC 16] V ICARIO G., C RAPAROTTA G., P ISTONE G., “Meta-models in computer experiments: Kriging vs artificial neural networks”, Quality and Reliability Engineering International, vol. 32, pp. 2055–2065, 2016.

11 Comparison of Stochastic Processes

This chapter explores a distance that enables us to compare Markovian processes. It shows the relationship of this distance to the Kullback–Leibler divergence and reveals its stochastic behavior in terms of the chi-square distribution. The distance enables us to decide whether there is any discrepancy between two samples of stochastic processes. When a discrepancy exists, the use of this distance allows us to find the strings where the discrepancy is manifested. We apply the distance to written texts of European Portuguese coming from two authors: Vieira, 1608 and Garrett, 1799. In the application, the distance reveals the linguistic configurations that expose discrepancies between written texts of different genres from the same author. This type of result could characterize linguistic genres and varieties in the same language.

11.1. Introduction By comparing several processes, it is possible to tackle real problems. In linguistics, for instance, different written texts of a single language should point out identical characteristics associated with the language, common to all of them. A comparison of texts would also be useful to point out linguistic varieties existing within a language (see [GAL 12]). But process comparison can also be implemented to processes that operate in parallel, for example, in the industrial field, often there are imposed operational constraints for processes to exhibit a similar behavior, in order to obtain a standard final material. On the other hand, the certainty that parallel processes follow the same behavior facilitates the implementation of maintenance control

Chapter written by Jesús Enrique G ARCÍA, Ramin G HOLIZADEH and Verónica Andrea G ONZÁLEZ -L ÓPEZ. Data Analysis and Applications 4: Financial Data Analysis and Methods, First Edition. Edited by Andreas Makrides, Alex Karagrigoriou and Christos H. Skiadas. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

186

Data Analysis and Applications 4

strategies. For this reason, it is relevant to be able to measure the similarity between processes. In García and González-López (2015) [GAR 15], a criterion d is proposed to achieve this objective. d is based on the conception of Partition Markov Models formulated over discrete Markov processes with finite memory and finite alphabets (see [GAR 17b]). When the processes have the same law and the samples are large enough, it is possible to prove that d converges to 0 almost surely. In this work, we explore other properties of this criterion, in order to construct a distance in the strict sense of the word. We show the relation that the distance d has with the Kullback–Leibler divergence, and we give a notion about its behavior in terms of the chi-square distribution. In addition, we apply this distance to a real problem. 11.2. Preliminaries Let (Xt ) be a discrete time (order o < ∞) Markov chain on a finite alphabet A. Let us call S = Ao the state space and denote the string am am+1 . . . an by anm , where ai ∈ A, m ≤ i ≤ n. For each a ∈ A and s ∈ S, P (a|s) = t−1 Prob(Xt = a|Xt−M = s). In a given sample xn1 , coming from the stochastic process, the number of occurrences of s in the sample xn1 is denoted by Nn (s) and the number of occurrences of s followed by a in the sample xn1 is denoted by Nn (s, a). In this way, NNnn(s,a) (s) is the estimator of P (a|s). D EFINITION 11.1.– Consider two Markov chains (X1,t ) and (X2,t ), of order o, k with finite alphabet A and state space S = Ao . With sample xnk,1 , for k = 1, 2 respectively. For any s ∈ S, 1 2 ds (xn1,1 , xn2,1 )

   Nn1 (s, a) α = Nn1 (s, a) ln (|A| − 1) ln(n1 + n2 ) Nn1 (s) a∈A   Nn2 (s, a) +Nn2 (s, a) ln Nn2 (s)   Nn1 +n2 (s, a) −Nn1 +n2 (s, a) ln Nn1 +n2 (s)

with Nn1 +n2 (s, a) = Nn1 (s, a) + Nn2 (s, a), Nn1 +n2 (s) = Nn1 (s) + Nn2 (s), 1 where Nn1 and Nn2 are given as usual, computed from the samples xn1,1 and n2 x2,1 respectively, and α being a real and positive value.

Comparison of Stochastic Processes

187

The most relevant properties of d are listed below. Both properties are consequence of results proved in García and González-López (2017) [GAR 17b]: 1 2 i) The function ds (xn1,1 , xn2,1 ) is a distance between the Markov chains relative to the specific string s ∈ S. If (Xi,t ), i = 1, 2, 3 are Markov chains under the assumptions of definition 11.1, with samples xni,1i , i = 1, 2, 3 respectively 1 2 ds (xn1,1 , xn2,1 ) ≥ 0 with equality ⇔

Nn2 (s, a) Nn1 (s, a) = ∀a ∈ A, Nn1 (s) Nn2 (s)

1 2 2 1 ds (xn1,1 , xn2,1 ) = ds (xn2,1 , xn1,1 ), 1 2 1 3 3 2 ds (xn1,1 , xn2,1 ) ≤ ds (xn1,1 , xn3,1 ) + ds (xn3,1 , xn2,1 ).

ii) Local behavior of processes laws. If the stochastic laws of (X1,t ) and 1 2 (X2,t ) are the same in s, then ds (xn1,1 , xn2,1 ) −→ 0. min(n1 ,n2 )→∞

Otherwise,

1 2 ds (xn1,1 , xn2,1 )

−→

min(n1 ,n2 )→∞

∞.

In the following result, we show the relationship between this distance and the Kullback–Leibler divergence D(P ||Q), a concept commonly used in this topic, but that does not constitute a distance. We also show the asymptotic behavior of the distance. We will use the following notations:   P (a) 2 D(P (·)||Q(·)) = a∈A a∈A P (a) ln( Q(a) ) and χ (P (·), Q(·)) = (P (a)−Q(a))2 , Q(a)

for two distributions P and Q defined in the alphabet A, with Q(a) = 0, a ∈ A. First, we will see how the quantity D(P (·)||Q(·)) behaves under certain conditions on P and Q. Consider the function f (x) = x ln(x), near to x = 1, by the Taylor’s expansion we have f (x) = (x − 1)+ (x−1)2 + δ(x)(x − 1)2 , where δ(x) = − (x−1) for some value t ∈ (x, 1) 2 6t2 (Lagrange’s form). We note that when x → 1, δ(x) → 0. Thus, for two probability distributions P and Q in A,  P (a)  P (a) = Q(a)f P (a) ln Q(a) Q(a) = P (a) − Q(a) + +δ

1 (P (a) − Q(a))2 2 Q(a)

 P (a) (P (a) − Q(a))2 Q(a)

Q(a)

,

188

Data Analysis and Applications 4

for a ∈ A,   P (a) (P (a) − Q(a))2 1 δ D(P (·)||Q(·)) = χ2 (P (·), Q(·)) + 2 Q(a) Q(a)

[11.1]

a∈A

and D(P (·)||Q(·)) 1 = + 2 χ (P (·), Q(·)) 2 If

P (a) Q(a)









 a∈A δ

P (a) Q(a)



(P (a)−Q(a))2 Q(a)

χ2 (P (·), Q(·))

→ 1, given  is positive and small enough, |δ 

a∈A δ

P (a) Q(a)



(P (a)−Q(a))2 Q(a)

χ2 (P (·), Q(·))



. P (a) Q(a)

[11.2]

| <  and

D(P (·)||Q(·)) 1

→ .

< , so 2 χ (P (·), Q(·)) 2

If one of the probabilities is the empirical distribution, say Pˆ (a) = X(a) k , where the occurrences of a in the sample of size k is denoted by X(a), and the 2  . sample is generated from the law Q, χ2 (Pˆ (·)||Q(·)) = k1 a∈A (X(a)−kQ(a)) kQ(a) 2  (X(a)−kQ(a)) Thus, if we introduce the quantity χ2,k (Pˆ (·), Q(·)) = a∈A , kQ(a) we can recognize the typical chi-square statistic. From equation [11.1], we obtain D(Pˆ (·)||Q(·)) =

1 2,k ˆ χ (P (·), Q(·)) 2k  1  Pˆ (a) (X(a) − kQ(a))2 δ + k Q(a) kQ(a)

[11.3]

a∈A

and when

Pˆ (a) Q(a)

→ 1,

1 D(Pˆ (·)||Q(·)) . → 2,k ˆ 2k χ (P (·), Q(·)) If we have two samples of sizes k1 and k2 generated from the law W, with Y (a) ˆ empirical distribution Pˆ (a) = X(a) k1 and Q(a) = k2 respectively. We obtain

Comparison of Stochastic Processes

189

(equation [11.1]) ˆ D(Pˆ (·)||Q(·)) =

 Pˆ (a) (X(a) − k W (a))2 1   W (a)  1 1 +δ ˆ ˆ k1 2 k W (a) 1 Q(a) Q(a) a∈A

 Pˆ (a) (Y (a) − k W (a))2 1   W (a)  1 2 +δ + ˆ ˆ k2 2 k W (a) 2 Q(a) Q(a) a∈A

+



1 + 2δ

a∈A

Hence, when

 Pˆ (a)  W (a) (Pˆ (a) − W (a)) −1 . ˆ ˆ Q(a) Q(a)

W (a) ˆ Q(a)

→ 1 and

Pˆ (a) ˆ Q(a)

→ 1,

ˆ D(Pˆ (·)||Q(·))

1 2,k1 (P ˆ (·), W (·)) 2k1 χ

+

[11.4]

1 2,k2 (Q(·), ˆ W (·)) 2k2 χ

→ 1.

[11.5]

These simple relationships between empirical distributions allows us to delineate the behavior of the distance ds (definition [11.1]). T HEOREM 11.1.– Let (Xk,t ) be a Markov chain of order o, with finite alphabet k A, state space S = Ao and xnk,1 a sample of the process for k = 1, 2. Consider  Nnk (s,·) Nn1 +n2 (s,·) also s ∈ S. If D Nn (s) Nn +n (s) < ∞, for k = 1, 2, then 1

k

1 2 ds (xn1,1 , xn2,1 )=

2

α (|A| − 1) ln(n1 + n2 )  N (s, ·) N  nk n1 +n2 (s, ·)

. Nnk (s)D Nnk (s) Nn1 +n2 (s)

k=1,2

When

Nnk (s,·)/Nnk (s) W (·)

2 ln(n1 + n2 )  k=1,2

→ 1 for k = 1, 2,

(|A| − 1) 1 2 ds (xn1,1 , xn2,1 ) ∼d α

χ2,Nnk (s)

 N (s, ·) N nk n1 +n2 (s, ·) , W (·) + χ2,Nn1 +n2 (s) , W (·) , Nnk (s) Nn1 +n2 (s)

190

Data Analysis and Applications 4

where ∼d means similarity in distribution. 1 2 ds (xn1,1 , xn2,1 ) is P ROOF.– Note that ln(n1 + n2 ) (|A|−1) α

   Nn1 (s, a) Nn2 (s, a) + Nn2 (s, a) ln = Nn1 (s, a) ln Nn1 (s) Nn2 (s) a∈A   Nn1 +n2 (s, a) −(Nn1 (s, a) + Nn2 (s, a)) ln . Nn1 +n2 (s)      N (s, a)   Nn1 +n2 (s, a) n1 − ln Nn1 (s, a) ln = Nn1 (s) Nn1 +n2 (s) a∈A         Nn2 (s, a) Nn1 +n2 (s, a) + Nn2 (s, a) ln − ln Nn2 (s) Nn1 +n2 (s) a∈A   Nn (s, a)  Nn (s, a) Nn +n (s, a)  k k ln / 1 2 = Nnk (s) Nnk (s) Nnk (s) Nn1 +n2 (s) 



a∈A

k=1,2



=

Nnk (s)D

k=1,2

 N (s, ·) N nk n1 +n2 (s, ·)

. Nnk (s) Nn1 +n2 (s)

Following equation [11.5]: 

Nnk (s)D

k=1,2

 N (s, ·) N n1 +n2 (s, ·) nk

∼d Nnk (s) Nn1 +n2 (s)

 Nn (s,·)  Nn (s) χ2,Nnk (s) Nnk (s) , W (·) k k 2 Nnk (s)

k=1,2

+

χ2,Nn1 +n2 (s)



Nn1 +n2 (s,·) Nn1 +n2 (s) , W (·)





Nn1 +n2 (s)

Then, 2 ln(n1 + n2 )

(|A| − 1) 1 2 ds (xn1,1 , xn2,1 ) ∼d α

.

Comparison of Stochastic Processes

 k=1,2

χ2,Nnk (s)

191

 N (s, ·) N nk n1 +n2 (s, ·) , W (·) + χ2,Nn1 +n2 (s) , W (·) . Nnk (s) Nn1 +n2 (s)

11.3. Application to linguistic data Tycho Brahe corpus is an annotated historical corpus, freely accessible at Galves, Andrade and Faria [GAL 10]. This corpus uses the chronological criterion of the author’s birthdate to assign a time for written texts. The subset of written texts included in this study listed in Table 11.1 consists of six texts from two authors. Author Vieira Vieira Vieira Date 1608 1608 1608 Type Dissertation Letters Sermons Notation 1608d 1608c 1608s Author Garrett Garrett Garrett Date 1799 1799 1799 Type Letters Narrative Theater Notation 1799c 1799n 1799t Table 11.1. The set of the Tycho Brahe corpus

Linguistic studies show that the variability observed in different written texts of European Portuguese involves, among other aspects, changes in the proportion of occurrence of the placement of the stress on the last or the penultimate syllable of the word and alterations in the use of monosyllables, with or without stress (see, for example, [FRO 12]). For this reason, we guide our inspection to the position in the word occupied by the stress and the size of the word (number of syllables). Each written text was processed with a slightly modified version of the perl-code “silaba” by Miguel Galves1. The software was used to extract two components of each orthographic word, denoted by (i, j), where i is the total number of syllables which compound the word, i = 1, 2, ..., 8, and j indicates the syllable (from left to right) in which the stress on the word is registered. When j = 0 means there is no stress on the word. The period (end of sentence) was codified as (0, 0). The alphabet A used here was defined as shown in Table 11.2.

1 This can be freely downloaded for academic purposes from www.ime.usp.br/∼tycho/ prosody/vlmc/tools/sil4.pl.

192

Data Analysis and Applications 4

Orthographic word code Element in the alphabet A Meaning (0, 0) 0 End of sentence (1, 1) 1 Monosyllable with stress (1, 0) 2 Monosyllable without stress (2, 2) 3 Dissyllable – stress on the last syllable (2, 1) 4 Dissyllable – stress on the first syllable (i, i), i ≥ 3 6 Oxytone word (i, i − 1), i ≥ 3 7 Paroxytone word (i, i − 2), i ≥ 3 8 Proparoxytone word Table 11.2. Definition of the alphabet A

We can define 1 2 dmax = max{ds (xn1,1 , xn2,1 ), s ∈ S}

[11.6]

smax = arg max{dmax}.

[11.7]

and

1 2 Observe that dmax <  if and only if ds (xn1,1 , xn2,1 ) < , ∀s ∈ S. That is, a small value of dmax indicates that the stochastic laws on s are similar for all s ∈ S. In other words, the distributions of the processes are similar.

As seen in definition 11.1, if the stochastic laws of (X1,t ) and (X2,t ) are the same in s, then 1 2 ds (xn1,1 , xn2,1 )

−→

min(n1 ,n2 )→∞

0.

In the same way, if the local laws for s are different, then 1 2 , xn2,1 ) ds (xn1,1

−→

min(n1 ,n2 )→∞

∞.

We can see that if dmax is large, smax is exactly the string we want to recognize, as being relevant in terms of discrepancy, but all the strings with a large relative value of d will reveal changes on the local laws of the processes relative to the string. In this application, a value larger than 1 will be considered significant.

Comparison of Stochastic Processes

193

We note that the comparison is made between the different texts of the same author. The memory o used in this application is equal to 2. Other studies in the area show that the strings 2-4, 7-2 and 2-7 (see Tables 11.3 and 11.4) are volatile configurations of European Portuguese (from the 16th Century to the 19th Century) – see García et al. (2017) [GAR 17a]. We can see that this characteristic persists when analyzing the variability of different written texts by the same author, whether that author is Vieira or Garrett. ds (1608c, 1608d) 1.02591 1.11191 1.13048 2.14046

s ds (1608c, 1608s) 7-6 1.18101 1-6 1.28567 3-6 1.98674 7-2 3.86756

ds (1799c, 1799t) 1.13432 1.20717 1.29197 2.15512 2.35864 2.84146 3.40959 3.46598

s ds (1608d, 1608s) 4-7 1.07770 2-3 1.07883 2-7 1.33124 2-4 1.67395 1.74245

s ds (1799t, 1799n) 1-7 1.01398 4-4 1.07517 7-0 1.24806 4-2 1.34589 4-7 2.56588 2-7 2.57690 7-2 3.56924 2-4 3.74332 4.49460

s 1-7 4-4 4-7 2-4 2-7

s 1-7 6-2 1-2 3-2 2-4 4-7 4-2 2-7 7-2

Table 11.3. Cases with values of d > 1 :1608c-1608d, 1608c-1608s,1608d-1608s, 1799c-1799t and 1799t-1799n. In bold, the dmax value (see equation [11.6]) and the smax string (see equation [11.7])

String Meaning 2–4 A monosyllable without stress followed by a dissyllable with stress on the first syllable 2–7 A monosyllable without stress followed by a paroxytone word 7–2 A paroxytone word followed by a monosyllable without stress Table 11.4. Meaning of each smax detected by dmax

Table 11.5 shows the transition probabilities P (a|smax) ∀a ∈ A, for each pair of compared texts. With this information, we can check the differences

194

Data Analysis and Applications 4

between the written texts in relation to the prosodic construction, for example P (2|7 − 2) is 0.34777 in the text 1799t (theater) and it goes to 0.15735 in the written text 1799n (narrative), both texts from Garrett. Moreover, the most probable choice for the second text, since the string 7-2 has been observed, is 7 (P (7|7 − 2) = 0.3241). a ∈ A smax: 7-2 1608c 0 0.00000 1 0.09461 2 0.20037 3 0.07616 4 0.31379 6 0.03763 7 0.26082 8 0.01662

(2.14046) smax: 2-4 1608d 1608c 0.00000 0.02277 0.09927 0.06624 0.15025 0.31714 0.04350 0.04251 0.29417 0.20402 0.03986 0.02638 0.34266 0.31144 0.03028 0.00949

a ∈ A smax: 2-4 1799c 0 0.05469 1 0.06448 2 0.28798 3 0.04389 4 0.19514 6 0.02971 7 0.30959 8 0.01452

(3.86756) smax: 2-7 1608s 1608d 0.06825 0.05221 0.07620 0.05200 0.35255 0.50989 0.04473 0.02926 0.22949 0.16547 0.02175 0.02337 0.19307 0.15053 0.01397 0.01726

(3.46598) smax: 7-2 1799t 1799t 0.13364 0.01100 0.10649 0.13265 0.27959 0.34777 0.04200 0.06598 0.24608 0.23505 0.01273 0.02749 0.17522 0.17113 0.00424 0.00893

(1.74245) 1608s 0.11444 0.05488 0.47124 0.03284 0.16809 0.02203 0.12464 0.01183

(4.49460) 1799n 0.00183 0.10353 0.15735 0.05153 0.29684 0.02863 0.32410 0.03619

Table 11.5. Conditional probabilities P (a|smax), ∀a ∈ A computed from each written text: 1608c, 1608d; 1608c, 1608s; 1608d, 1608s; 1799c, 1799t; 1799t, 1799n

We can define three groups of strings: (i) strings that show discrepancies between Vieira’s texts but not in the case of Garrett’s texts; (ii) strings that show discrepancies between Garrett’s texts and not in the case of Vieira’s texts; and (iii) strings that show discrepancies between texts for each of these authors (see the detailed description of each group in Table 11.6). Values of d greater than 1 have not been detected in the comparison between the texts: 1799c (letters) and 1799n (narrative). Thus, these texts can be considered as coming from the same Markovian process.

Comparison of Stochastic Processes

Author String Vieira 1–6 2–3 3–6 7–6 Garrett

1–2 3–2 4–2 6–2 7-0

Both

1–7 4–4 4–7

195

Meaning A monosyllable with stress followed by an oxytone word A monosyllable without stress followed by a dissyllable with stress on the last syllable A dissyllable with stress on the last syllable followed by an oxytone word A paroxytone word followed by an oxytone word A monosyllable with stress followed by a monosyllable without stress A dissyllable with stress on the last syllable followed by a monosyllable without stress A dissyllable with stress on the first syllable followed by a monosyllable without stress An oxytone word followed by a monosyllable without stress A paroxytone word followed by end of sentence A monosyllable with stress followed by a paroxytone word A dissyllable with stress on the first syllable followed by a dissyllable with stress on the first syllable A dissyllable with stress on the first syllable followed by a paroxytone word

Table 11.6. Strings (see Table 11.3) and meaning of the linguistic compositions that characterize the variability between the texts of the same author. We also list the strings (with d > 1) that are common among the authors; the constructions listed in Table 11.4 are excluded

11.4. Conclusion The distance proposed in this chapter has a clear relation to the Kullback–Leibler divergence, which is shown in Theorem 11.1. In addition, the adequately scaled distance has its stochastic behavior described by a sum of chi-square dependent random variables, which is also seen in Theorem 11.1. In relation to the application, note that the distance introduced here makes it possible to decide whether two Markovian stochastic processes follow the same law or not. Furthermore, it enables us to identify discrepancies, pointing out the strings responsible for them.

196

Data Analysis and Applications 4

11.5. References [FRO 12] F ROTA S., G ALVES C., V IGÁRIO M. et al., “The phonology of rhythm from Classical to Modern Portuguese”, Journal of Historical Linguistics, vol. 2.2, pp. 173–207, 2012. [GAL 10] G ALVES C., DE A NDRADE A.L., FARIA P., Parsed Corpus of Historical Portuguese, December 2010. http://www.tycho.iel.unicamp.br/∼tycho/corpus/texts/psd.zip.

Tycho Brahe Available at:

[GAL 12] G ALVES A., G ALVES C., G ARCIA J.E. et al., “Context tree selection and linguistic rhythm retrieval from written texts”, The Annals of Applied Statistics, vol. 6, no. 1, pp. 186– 209, 2012. [GAR 17a] G ARCÍA J.E., G HOLIZADEH R., G ONZÁLEZ -L ÓPEZ V.A., “Linguistic compositions highly volatile in Portuguese”, Cadernos de Estudos Lingüísticos, vol. 59, no. 3, pp. 617–630, 2017. [GAR 17b] G ARCÍA J.E., G ONZÁLEZ -L ÓPEZ V.A., “Consistent estimation of partition Markov models”, Entropy, vol. 19, no. 4, p. 160, 2017. [GAR 15] G ARCÍA J.E., G ONZÁLEZ -L ÓPEZ V.A., “Detecting Regime Changes in Markov Models”, in M ANCA R., M C C LEAN C., S KIADAS C.H. (eds), New Trends in Stochastic Modeling and Data Analysis, pp. 103–109, ISAST, Lisbon, 2015.

PART 3

Demographic Methods and Data Analysis

Data Analysis and Applications 4: Financial Data Analysis and Methods, First Edition. Edited by Andreas Makrides, Alex Karagrigoriou and Christos H. Skiadas. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

12 Conjoint Analysis of Gross Annual Salary Re-evaluation: Evidence from Lombardy ELECTUS Data

This chapter discusses annual salary enhancements for newly hired employees in companies taking on new graduates. The decision process for selecting candidates with the best skill sets seems to be one of the most difficult obstacles to overcome. In particular, the analysis is based on the Education-for-Labor Elicitation from Companies’ Attitudes towards University Studies project (ELECTUS), involving 471 enterprises, with 15 or more employees, operating in Lombardy. The recruiters’ preference analysis was carried out using conjoint analysis (CA). Starting from CA partworth utilities, a coefficient of economic revaluation was used to compare different salary profiles for five job vacancies.

12.1. Introduction In recent decades, tertiary-level education has expanded rapidly across many countries, as well as in Italy. In general, the expectation is that higher education should prepare young people to become highly productive and successful in the labor market. Sometimes, the skills required of the graduates for the job do not coincide with the skills offered by the graduates applying, Chapter written by Paolo M ARIANI, Andrea M ARLETTA and Mariangela Z ENGA.

200

Data Analysis and Applications 4

creating a mismatch between education and the labor market. In general, the mismatch occurs when: “...there is a difference between the skills a worker provides and the skills necessary for the job. In particular, working in a job below an individual’s level of skills limits individual productivity and leads to underutilization of education” [EUR 08]. The mismatch in Italy has been well documented by The Italian National Institute of Statistics (ISTAT), using data from a sample survey on university graduates’ vocational integration1. In fact, in 2015, the transition into the labor market was rather difficult for Humanities graduates (with 61.7% of the bachelor’s degree graduates and 73.4% of the master’s degree graduates being employed) and Earth Sciences graduates (with 58.6% of the bachelor’s degree graduates and 76.5% of the master’s degree graduates being employed). On the other hand, the master’s degree graduates in Defence and Security, Medicine and Engineering had the highest employment levels (99.4%, 96.5% and 93.9%, respectively). Moreover, 52.8% of the bachelor’s graduates and 41.9% of the master’s graduates found employment in “non-stable” jobs. Of course, this information has to be considered in light of the worldwide economic crisis, which significantly impacted the employment situation in the labor market in general. Looking at the Italian situation in 2015, while the overall unemployment rate was at 11.9%, the unemployment rate among young people (15–24 years old) increased dramatically to 40.3%. Moreover, unemployment in the service sector hit 16%2. This chapter concerns labor market inclusion policies for new graduates and the relationships between enterprises and universities. The study is based on the multi-center research project, Education-for-Labor Elicitation from Companies’ Attitudes towards University Studies (ELECTUS) [FAB 15], which involved several Italian universities. This work has three main objectives. First, it focuses on the identification of an ideal graduate profile for several job positions. Second, it seeks recommendations of some broad skills, universally recognized as “best practice” for graduates. Finally, the analysis attempts to give a comparative view of the differences between, and the assessment of, wages and skill sets for new graduates. 1 See: htpp://www.istat.it/en/archive/190700 2 See: http://ec.europa.eu/eurostat/data/database

Conjoint Analysis of Gross Annual Salary Re-evaluation

201

This chapter is structured as follows: section 1 serves as an introductory preface; section 2 introduces the methodology of conjoint analysis and the coefficient of economic valuation; section 3 presents the results from the ELECTUS research and section 4 provides the conclusions. 12.2. Methodology Conjoint analysis (CA) is among the methods that are mostly used to analyze consumer choices and to assign consumers’ utility, drawn from the properties of single characteristics of goods, services or, as in this application, jobs being offered on the market. Different models that have been used to measure the economic value derived from CA are described in the literature. In [BUS 04], the monetary value of a utility unit is computed as the ratio of the difference between its maximum price and its minimum price, compared to related utilities. The quantification of the monetary value of the utility, of a given attribute, is obtained by multiplying the monetary value for the utility perceived by customers, with the better or worse level of the attribute. Hu et al. [HU 12] introduced some measures of customers’ willingness to pay (WTP) following an interpretation of part-worth utilities and the offer of monetary values for various attributes. The ratio is created if a change in attribute increases welfare. Therefore, an individual will pay more to have that change in attribute and vice versa [DAR 08]. Following Louviere [LOU 00], these measures of WTP are calculated as the part-worth utility for the various attribute levels, divided by the negative of the marginal utility of income. In this work, the conjoint rating response format is used to gather and use additional information about respondents’ preferences. This preference model uses a part-worth utility linear function. Part-worth utilities are also assumed for each level of the various attributes, estimated by using ordinary least squares (OLS) multiple regression. In this formulation, attention is focused on a rating scale, opting for a very general preference model used in traditional CA. In fact, of all the attribute levels that describe the graduate, the information contained in the conjoint rating format is exploited by regressing individual

202

Data Analysis and Applications 4

responses on a piecewise linear function. A non-metric estimation procedure such as MONANOVA might be more appropriate than OLS, since the conjoint data are collected on a non-metric scale. However, as demonstrated in Carmone [CAR 78] and Cattin [CAT 82], OLS regression provides similar parameter estimates for both ranking and rating scales. Hence, OLS seems to be an appropriately reliable estimation procedure. The function is defined as follows: Uk =

n 

βi xik

[12.1]

i=0

where xi is equal to 1 and n is the number of all levels of the attributes which define the combination of a given good. Each xik variable is a dichotomous variable, which refers to a specific attribute level. This variable is equal to 1 if the corresponding attribute level is present in the combination of attributes that describes the alternative k. Otherwise, that variable will be 0. As a result, the utility associated with the alternative k (Uk ) is obtained by summing the terms βi xik over all attribute levels, where βi is the partial change in Uk for the presence of the attribute level i, with all other variables remaining constant. In this work, we refer to this piecewise linear function as a part-worth function model that gives a specific utility value for each level of the considered attributes, usually referred to as part-worth utility. Consequently, the number of parameters estimated by assuming the part-worth specification is greater than what is required by alternative preference model specifications, such as the vector model form and the ideal model. 12.2.1. Coefficient of economic valuation A coefficient based on part-worth utilities can determine the monetary variation associated with any change in the combination of the attributes assigned to a good, service or, in this case, a job, compared to the actual revenue generated by that job. Having chosen the preference model (and the rating scale), a coefficient of economic valuation is developed for a hypothetical change that may occur in the combination of attribute levels, as described by Mariani and Mussini [MAR 13].

Conjoint Analysis of Gross Annual Salary Re-evaluation

203

Total utility variation is computed by replacing one attribute level of status quo b, where b is the current profile of the job, with attribute level i (with i = 1, . . . , n), which is different from b. Mi is given by the ratio of the difference between the total utility of alternative i and the status quo b over the total utility of the status quo b; formally: Mi =

Ui − Ub Ub

[12.2]

where Ui denotes the sum of the utility scores associated with alternative profile i and Ub (assumed to be different from 0) denotes the sum of the part-worth utilities associated with the status quo b of the job. Equation [12.2] indicates whether the status quo b modification gives a loss or a gain. If Mi = 0, there is no loss or gain in terms of total utility. However, the utility change arising from an attribute-level modification can be considered more or less important by respondents. Hence, this change may have a more important economic impact with respect to a changed utility, which has a similar intensity but involves a less relevant attribute. As a solution, the relative importance of the modified attribute is used as a weighting [GAR 17]. The range of the utility values for each attribute from highest to lowest provides an indicator of how important the attribute is compared to the others. The larger the utility ranges, the more important is the role that the attributes play. This applies in the same way to smaller ranges. For any attribute j, the relative importance can be computed by dividing its utility range by the sum of all utility ranges as follows: I j = J

max (Wj ) − min (Wj )

j=1 [max (Wj )

− min (Wj )]

,

[12.3]

where J is the number of attributes and Wj is the set of part-worth utilities referring to the various levels of attribute j. Usually, importance values are represented as percentages with a total score of 100. Otherwise, these importance values may be expressed in terms of decimals whose sum is 1. If this is the case, entering the importance of the modified attribute in equation [12.2], the coefficient formulation becomes: M Iij = Mi ∗ Ij .

[12.4]

204

Data Analysis and Applications 4

Assuming a change in the status quo profile, formula [12.1] will be used to estimate the variation of the total revenue generated. Given the gross annual salary (GAS) associated with the status quo profile, the coefficient of economic valuation (CEV) is expressed as follows: Vij = M Iij ∗ GAS

[12.5]

where Vij denotes the amount of the salary variation. The variation Vij is obtained by supposing that the job’s monetary attribute will vary in proportion to the change in total utility. This assumption may seem restrictive. However, it is possible to argue that the monetary amount asked of an employer for a job reflects how that user values the combination of attributes of the job in terms of its utility. Under this hypothesis, it is credible to assess the economic value of a change in the combination of attributes as a function of the utility and importance of the modified attribute. In addition, CA serves to approximate the real structure of preferences, given that only a partial knowledge of preferences can be known. Therefore, it is possible to use the CEV as a monetary indicator that approximates the impact of a given utility change in monetary terms. The proposed coefficient was then applied to the ELECTUS survey. First, an ideal profile was obtained by maximizing part-worth utilities. Then, economic variations on the proposed gross annual salary for graduates were computed by using the coefficient. 12.3. Application and results The survey was conducted in 2015 using computer-assisted web interviewing (CAWI). Data were collected using a software program called Sawtooth3. Data manipulation and conjoint analysis were performed using R software and Conjoint package [BAK 12]. The questionnaire contained two sections: the first concerned the conjoint experiment for the five job positions and the second contained general information about the company (demographic data). The five job positions considered for the new graduates were administration clerk, HR assistant,

3 See: www.sawtoothsoftwaver.com

Conjoint Analysis of Gross Annual Salary Re-evaluation

205

marketing assistant, ICT professional and CRM assistant. Six attributes were used to specify the candidates’ profile: – Field of Study with 10 levels (philosophy and literature, educational sciences, political sciences/sociology, economics, law, statistics, industrial engineering, mathematics/computer sciences, psychology, foreign languages); – Degree Mark with 3 levels (low, 66–90, medium, 91–100, high, 101– 110+); – Degree Level with 2 levels (bachelor’s, master’s); – English Knowledge with 2 levels (suitable for communication with nonnative Italian speakers, inadequate for communication with non-native Italian speakers); – Relevant Work Experience with 4 levels (no experience at all, internship during or after completion of university studies, intermittent or occasional employment during university studies, one year or more of regular employment); – Willingness to Travel on Business with 3 levels (unwilling to travel on business, willing to travel on business only for short periods, willing to travel on business, even for long periods). After having rated the selected profile and chosen the best one, employers were asked to propose a gross annual salary for the chosen profile, to measure WTP [BRE 06]. As far as the Milano-Bicocca research unit was concerned, the interviewees were representatives of companies registered on the AlmaLaurea portal. AlmaLaurea is an Interuniversity Consortium established in 1994 and currently counts 75 Universities as members and represents about 90% of Italian graduates. There were 471 final respondents. The company profiles showed that most of the respondents had 15–49 employees (52%), followed by businesses with 50–249 employees (25.6%) and then enterprises with at least 250 employees (22.4%). The sectors most represented were industry services (62.1%), individual services (16.2%) and manufacturing (14.9%). The majority of the companies (89.4%) operated fully or partially in the domestic market. Moreover, they were mainly under the management of the entrepreneur (64.2%). Frequency distributions for ELECTUS data are found in Table 12.1.

206

Data Analysis and Applications 4

Company supervisor

Employees

Activity sectors

Activity market

Entrepreneur 64.2%

15–19 37.5% Services industry 62.1%

Both 45.7%

Manager 23.2%

20–49 14.5% Personal services 16.2%

National 43.8%

Other 12.6%

50–249 25.6%

Manufacturing 14.9%

250+ 22.4%

Other 6.8%

International 10.6%

Table 12.1. Basic features of ELECTUS companies. Source: ELECTUS data (2015)

Five CAs were achieved, corresponding to the different job positions in order to measure the entrepreneurs’ preferences. The results for part-worth utilities are presented in Table 12.2. It was deemed necessary to introduce and define cross-competencies or specialized skills. A cross-competence is defined as having part-worth utilities that are independently higher than required for the chosen vacancy. On the other hand, if the level of the attribute changes due to the job position, that competence is defined as specialized. It is important to recall that, since the definition of the sum of utilities for all levels of an attribute will be equal to 0, less desirable attributes may generate negative utilities. In the application, the part-worth utilities seemed to be similar for all the attributes, except for Field of Study. This means that other competencies have some levels that are universally identified as “best practice” for graduates. The Relevant Work Experience and English Knowledge attributes always generated the highest utilities for the same level for each vacancy. Fluent communication with foreigners and one or more years of regular employment experience are recognized as preferred. Degree Mark and Willingness to Travel on Business variables were competencies where the top two levels were consistently preferred. Hence, candidates holding a degree earned with medium to high marks, and who expressed their willingness to travel on business for short or long periods, were preferred.

Conjoint Analysis of Gross Annual Salary Re-evaluation

Competencies Field of Study Philosophy and literature Educational sciences Political sciences Economics Law Statistics Engineering Computer sciences Psychology Foreign languages Degree Mark Bachelor Master Degree Level Low Medium High English Knowledge Suitable Inadequate Relevant Work Experience No experience Internship Occasional Regular Willingess to Travel on Business Unwilling to travel Short period Long period

AC

HR

ICT

MKT

CRM

–0.8312 –0.5959 0.3031 1.8811 0.0737 0.4506 –0.5488 0.4444 –1.0678 –0.1091

0.1561 0.8598 0.1876 0.3210 0.5498 –0.6956 –1.5581 2.9842 1.5375 –0.2371

–0.6792 –0.0759 –0.7714 0.2981 4.8612 0.3956 0.8889 2.9842 –1.0325 –1.1121

–0.1247 –0.2299 0.0313 1.3350 –0.5211 –0.0129 –0.4019 –0.4163 0.0974 0.2431

–0.5629 –0.2086 0.1996 1.0165 –0.0909 –0.1686 0.0469 0.0252 –0.1557 –0.1015

207

0.0485 0.0251 –0.0483 –0.0092 –0.0586 –0.0485 –0.0251 0.0483 0.0092 0.0586 –0.3960 –0.2497 –0.1047 –0.1407 –0.2299 0.2169 0.0950 –0.0431 0.0203 0.1401 0.1790 0.1547 0.1478 0.1204 0.0898 0.4608 0.2699 0.0969 0.3145 0.2998 –0.4608 –0.2699 –0.0969 –0.3145 –0.2998 –0.3169 –0.0045 –0.1219 0.4433

–0.1666 –0.0019 –0.1383 0.3068

0.0303 –0.0182 –0.1300 0.1179

–0.3177 –0.0464 0.1736 0.1905

–0.1619 –0.1313 0.1014 0.1918

–0.0793 –0.3530 –0.0768 –0.0862 –0.4198 –0.0279 0.0698 –0.0295 0.0610 0.2353 0.1072 0.2832 0.1063 0.0252 0.1845

Table 12.2. Competencies part-worth utilities for job positions. AC = administration clerk, HR = HR assistant, ICT = ICT professional, MKT = marketing assistant, CRM = CRM assistant

Utility scores for the Degree Level variable were very close to 0 for each position. This means that for the respondents, there was no significant difference between a bachelor’s and a master’s degree. This was due to the

208

Data Analysis and Applications 4

fact that all the positions analyzed were entry-level roles, not requiring specialized skills. The Field of Study attribute required a more complex analysis since it is less wide-ranging and a degree in one field could be best for one position and less good for another. Table 12.3 shows the ideal profiles for each job vacancy. As can be noted, these profiles were similar to one another, except for Field of Study. This confirms the theory of the existence of some transferable or specialized competencies. Competencies AC HR Field of Study Economics Psychology Degree Level Bachelor’s Bachelor’s Degree Mark Medium High English Knowledge Suitable Suitable Relevant Work Experience Regular Regular Willingness to Travel Long Long

ICT MKT CRM Comp.Sci Economics Economics Master’s Master’s Master’s High High Medium Suitable Suitable Suitable Regular Regular Regular Long Short Short

Table 12.3. Competencies attributes and ideal levels for job vacancies

Economics was the preferred Field of Study for three positions (AC, MKT and CRM) out of the five. A degree in psychology was desirable for the HR assistant role, while for a more technical position, such as ICT professional, the Field of Study with the largest part-worth utility was computer sciences/mathematics. For this reason, according to the previous definition, Field of Study is a specialized competence. The Relevant Work Experience and English Knowledge attributes show that the best level perceived does not depend on the duties to be performed, which means they could be considered cross-competencies. After all, it is easy to imagine that companies would prefer to employ a candidate with one year or more of regular work experience and who is capable of communicating in another language. Two levels are recognized as “best practice” for the attributes Degree Mark and Willingness to Travel on Business, which could be defined as being nearly cross-competencies.

Conjoint Analysis of Gross Annual Salary Re-evaluation

209

Finally, since part-worth utilities for the Degree Level variable were very close to 0 and there was no difference between these levels, this could be defined as a non-binding attribute. Table 12.4 shows the percentage of the importance index for the five CAs computed using equation [12.3]. It is important to remember that higher values for the index corresponds with a more important competence for the respondents. Competence Field of Study Degree Level Degree Mark English Knowledge Relevant Work Experience Willingness to Travel

AC 53.35% 1.75% 11.09% 16.67% 13.75% 3.37%

HR 59.39% 1.04% 7.67% 10.55% 9.12% 12.23%

ICT 80.54% 1.91% 4.98% 3.82% 4.89% 3.61%

MKT 54.27% 0.54% 7.63% 18.39% 14.86% 4.30%

CRM 42.98% 3.19% 10.07% 16.32% 9.62% 17.83%

Table 12.4. Importance indices of competence attributes for job vacancies

For each profile, the attribute with the highest values for the index was Field of Study. This percentage reached a maximum of 80.54% for ICT professional. The only other position where this value was not over 50% was CRM assistant. Another high percentage for this role was Willingness to Travel, which was nearly 20%. English Knowledge appeared to be the most important competence after Field of Study. Except for ICT professional, it was always over 10% with a peak of 17.83% for marketing assistant (MKT). Third, on the same level, there were Degree Mark, Relevant Work Experience and Willingness to Travel with an average importance of about 10%. Finally, because all part-worth utilities were close to 0, the index of importance for Degree Level was very low for each job position. The last step of the analysis done for this research consisted of the computation of the CEVs Mij . To obtain these figures, the part-worth utilities in Table 12.2 were used, combined with the importance indices in Table 12.4. As seen in the previous section, one of the limits of using this methodology is that the variations can only be extrapolated by evaluating

210

Data Analysis and Applications 4

changes in utility for a unique attribute. In this case study, the attribute field of study was considered as not fixed. Therefore, the utilities were computed with other attributes remaining constant. In particular, in this application, the status quo profile was the worst profile because it minimized total utility. Therefore, Ub is the sum of the lowest part-worth utilities (plus an intercept) for each attribute j. This means that all M Iij coefficients and all variations Vij are positive. Table 12.5 shows the values of (Mij + 1) × 100 for Field of Study, as expected each Mij ≥ 0 and Mij = 0 only in correspondence of the Field of Study with the minimum of part-worth utilities. Comparing the job positions, ICT professional showed higher variations. This is due to the fact that Ij was very high: a degree in mathematics or computer sciences was fully specialized for this position. The biggest Mij is for a degree in mathematics or computer sciences for ICT professional, in comparison to a graduate in foreign languages, the former earns twice the gross annual salary. Field of Study AC HR ICT MKT CRM Philosophy and literature 102 118 111 104 100 Educational sciences 104 126 127 103 103 Political science/sociology 112 118 109 106 106 Economics 125 120 136 119 112 Law 110 122 106 100 104 Statistics 113 109 139 105 103 Industrial engineering 104 100 152 101 105 Mathematics/computer sciences 113 104 206 101 105 Psychology 100 133 102 106 103 Foreign languages 108 114 100 108 104 Minimum GAS 19230 e 19600 e 11420 e 21070 e 24960 e Table 12.5. Values of coefficients (Mij + 1) × 100 for Field of Study by job vacancies

It seems clear that economics graduates can expect higher re-evaluation percentage coefficients. In comparison to the lowest utility values, the Mij coefficients were close to 25% for administration clerk (against psychology) and 20% for HR assistant (against industrial engineering) and marketing assistant (against law). A degree in psychology was very favorable for HR assistant, but for other vacancies, its Mij was very low. Statistics and industrial engineering had good coefficients only for ICT professional.

Conjoint Analysis of Gross Annual Salary Re-evaluation

211

Finally, for CRM assistant, all wages were very close one another, in comparison to the worst (philosophy and literature), with the only role that went over 10% being economics. With respect to the CEV coefficients, first, the minimum GAS was computed: in the questionnaire, the entrepreneurs were asked to determine a possible GAS for the five job vacancies. The CEV shows very interesting results: for example, looking for economics graduates, the value of GAS increases to 4809.75 euros for AC, 3920 euros for HR, 4111.2 euros for ICT, 4003.3 euros for MKT and 2995.2 euros for CRM (with respect to the Field of Study with the minimum part-worth utility for each of the five job vacancies). 12.4. Conclusion This chapter proposed the use of the Mariani–Mussini coefficients in combination with CA utilities. The aim was to give a monetary assessment of the competencies for gross annual salary for five different job positions. This monetary evaluation was found to be changeable according to some competencies maintained by new graduates. Data referring to the ELECTUS project applied in Lombardy showed the existence of different kinds of attributes declared as cross, specialized or non-binding. In particular, Relevant Work Experience and English Knowledge appeared to be cross-competencies, while Field of Study seemed to indicate a specialized skill. For this reason, it was possible to create different ideal profiles on the basis of the required job position. Using conjoint analysis, the competencies were measured according to the perceived importance. Field of Study was proven to be the most relevant, especially when the job vacancy was more specialized. The high importance of Field of Study resulted in taking into account the monetary variation, according to different academic fields. Economics was preferred for the role of administration clerk, marketing assistant and CRM assistant. The gross annual salary for these positions was 25%, 19% and 12%, respectively, higher than the degree course with the lowest utilities. When the position concerns human resources, a degree in psychology was preferred by companies. Its role’s salary was 33% higher than posts requiring a degree in engineering. As information and communications technology is a specialized field, the CEVs

212

Data Analysis and Applications 4

are higher. Therefore, the salary for a graduate in mathematics or computer sciences was twice that of a graduate in foreign languages. Future research aims to focus attention in two directions. First, the results of stratified CA based on socio-demographic features of companies responding in the ELECTUS project could be discussed. Second, it would clearly be advantageous to extend the Mariani–Mussini economic re-evaluation coefficient to more than one attribute, in order to jointly analyze more competencies and possible interactions. 12.5. References [BAK 12] BAK A., BARTLOMOWICZ T., “Conjoint analysis method and its implementation in conjoint R package”, Data Analysis Methods and its Applications, pp. 239–248, 2012. [BRE 06] B REIDERT C., Estimation of Willingness-to-Pay: Application, Deutscher Universitats-Verlag, Wiesbaden, 2006.

Theory,

Measurement,

[BUS 04] B USACCA B., C OSTABILE M., A NCARANI F., Prezzo e valore per il cliente. Tecniche di misurazione e applicazioni manageriali, Etas, Perugia, 2004. [CAR 78] C ARMONE F., G REEN P.E., JAIN A.K., “Robustness of conjoint analysis: Some Monte Carlo results”, Journal of Marketing Research, vol. 15, pp. 300–303, 1978. [CAT 82] C ATTIN P., W ITTINK D., “Commercial use of conjoint analysis: A survey”, Journal of Marketing, vol. 46, pp. 44–53, 1982. [DAR 08] DARBY K., BATTE M.T., E RNST S. et al., “Decomposing local: A conjoint analysis of locally produced foods”, American Journal of Agricultural Economics, vol. 90, no. 2, pp. 476–486, 2008. [EUR 08] E UROPEAN C OMMISSION, Commission Staff Working Paper accompanying New Skills for New Jobs. Anticipating and Matching Labour Market and Skills Needs, Brussels, p. 868, COM, 2008. [FAB 15] FABBRIS L., S CIONI M., “Dimensionality of scores obtained with a pairedcomparison tournament system of questionnaire item”, M EERMAN A., K LIEWE T. (eds), Academic Proceedings of the 2015 University-Industry Interaction Conference: Challenges and Solutions for Fostering Entrepreneurial Universities and Collaborative Innovation. Amsterdam, 2015. [GAR 17] G ARAVAGLIA C., M ARIANI P., How much do consumers value protected designation of origin certifications? Estimates of willingness to pay for PDO dry-cured ham in Italy, Agribusiness, 2017.

Conjoint Analysis of Gross Annual Salary Re-evaluation

213

[HU 12] H U W., BATTE M.T., W OODS T. et al., “Consumer preferences for local production and other value-added label claims for a processed food product”, European Review of Agricultural Economics, vol. 39, no. 3, pp. 489–510, 2012. [LOU 00] L OUVIERE J.J., H ENSHER D.A., S WAIT J.D., Stated Choice Methods: Analysis and Application, Cambridge University Press, 2000. [MAR 13] M ARIANI P., M USSINI M., “A new coefficient of economic valuation based on utility scores”, Argumenta Oeconomica, vol. 2, pp. 33–46, 2013.

13 Methodology for an Optimum Health Expenditure Allocation

The Healthy Life Years Lost Methodology (HLYL) was presented in two books in the Springer Series on Demographic Methods and Population Analysis, vol. 45 and 46, titled: “Exploring the Health State of a Population by Dynamic Modeling Methods” and “Demography and Health Issues: Population Aging, Mortality and Data Analysis” and was applied in particular chapters of these books. Here the main part of the methodology, with more details and illustrations, is applied by developing and extending a life table, important to estimate the healthy life years lost along with fitting to the health expenditure, in the case of Greece in 2009. The application results are quite promising and important to support decision-makers and health agencies with a powerful tool to improve the health expenditure allocation and future predictions.

13.1. Introduction The OECD health care study on several European countries, including Greece, was launched several years ago, with a target of long range forecasting of health care expenditure [OEC 13]. Although the economic recession in the last decade had a critical impact on many economic activities and on the health-related costs, the health systems continue to develop following the advances in health research and in health-related technology. The latter provided new advances in health services with inevitable growing costs.

Chapter written by George M ATALLIOTAKIS.

Data Analysis and Applications 4: Financial Data Analysis and Methods, First Edition. Edited by Andreas Makrides, Alex Karagrigoriou and Christos H. Skiadas. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

216

Data Analysis and Applications 4

Another point is related to longevity and the steady growth of life expectancy along with the healthy life expectancy. Following several studies, the healthy life expectancy turned to be an important measure of the health improvements in a country or population (see [ATS 08, SKI 07, SKI 18a, SKI 18b]). The World Health Organization provides a standard estimation of the healthy life expectancy, termed as HALE, for every country, whereas statistical institutes such as those of the European Union also provide their estimates. The main finding is that the healthy life years lost, i.e. the difference between life expectancy and healthy life expectancy, tends to increase as life expectancy increases in resulting an excess of health expenditure as the duration of life is expanded. As we live longer, we need more years of treatment and health care during the period of the healthy life years lost to disability. Accordingly, health care systems, both from the state and the private sector, will face gradually increasing costs, which should be compensated with new funding. 13.2. The Greek case Following the limited funding resources, health cost allocation becomes very important from both public and private sectors. The standard methodology includes the cost allocation per age group, usually broken down into five year ranges, as shown in Figure 13.1, for Greece in 2009. The percentage of expenditure per capita and age group starts from the first year of age where an extra cost is related to the newborn. It follows a reduced cost for the years from 5–9 to 10–15, and then the costs grow steadily with a sharp exponential growth after 55–59 and 60–64 years of age, reaching a maximum at 85–85 and 90–94 years of age. An important point is related to the minimum level of per capita expenditure, at 1.46%, corresponding to the age group 10–14 years, where the minimum mortality appears. The per capita and age group estimation should be multiplied by the population per age group to provide the health expenditure per age group of the same year. Figure 13.2 shows that the population in 2009 has a maximum at 30–34 years of age, with close figures for the age groups 35–39 and 40–44 years, followed by a decline. The population of the age group 65–69 years shows an irregular decline.

Methodology for an Optimum Health Expenditure Allocation

Figure 13.1. Per capita and age group health expenditure in Greece in 2009. For a color version of this figure, see www.iste.co.uk/makrides/data4

Figure 13.2. Population per age group in Greece in 2009. For a color version of this figure, see www.iste.co.uk/makrides/data4

217

218

Data Analysis and Applications 4

The percentage of the total expenditure per age group starts from the first year of age where an extra cost is related to the newborn. It follows a reduced cost for the next years with a minimum at 10–14 years of age, and then the costs grow steadily, with a sharp exponential growth after 45–49 years of age reaching a maximum at 75–79 years. Then a steady decline follows, due to the declining number of people for the following age groups. An important point is related to the minimum level of per age group expenditure, at 1.95%, corresponding to the age group 10–14 years, where the minimum mortality appears (Figure 13.3). This is the also an estimate of the minimum operational level of the health system of Greece.

Figure 13.3. Percentage of health expenditure in Greece in 2009. For a color version of this figure, see www.iste.co.uk/makrides/data4

In our calculations, we have taken into account the fertility rates by mother’s age at childbirth, provided as births per 1000 women. The data are provided by Eurostat Demographic Statistics. Note that the mean age of woman at childbirth is 30.4 years and the mean age of first birth is 29.0 years old. Small changes appear from 2009 to 2016, with a fertility decline from 301.4 in 2009 to 275.6 in 2016. In contrast, the fertility in 1960 was at 446.2 births per 1000 women (see Figure 13.4).

Methodology for an Optimum Health Expenditure Allocation

219

Figure 13.4. Births per 1000 women in Greece. For a color version of this figure, see www.iste.co.uk/makrides/data4

13.3. The basic table for calculations The calculations for Greek Health expenditure for the year 2009 are based on a table provided in Figure 13.5, where the death probability is included in the third column, and in the fourth column, the data for the fertility rate are added after the appropriate transformation. The main calculations are done in the sixth column, where a correction parameter k is estimated via a regression analysis. Column seven is set for the minimum expenditure level, column eight includes the estimated values, and column nine includes the data sets provided by OECD. The tenth column is set for the estimation of the sum of squared errors and the calculation of R2 and the standard error se. The main task is to produce a standard system for the calculation of the health expenditure allocation per age group for a specific year in a country. Although the main health expenditure is devoted to high and very high ages, it is fair to have a correct idea of the allocation per age group, including the expenditure for the very sensitive young ages and newborns. Following our estimates, 6.9% of the health expenditure for 2009 in Greece is allocated to the population from 1 to 14 years of age, 28.6% is allocated to the population from 15 to 49 years of age, including the costs for the period of women’s fertility, and 21.1% is allocated to people aged 50–64 years. A considerable cost of 28.1% is allocated to people from 65 to 79 years of age, and the remaining 15.3% is devoted to the people aged 80+ years. Note that the health expenditure for the 35 years from 15 to 49 is almost equal for the expenditure for the 15 years ranging from 65 to 79 years of age. The remaining

220

Data Analysis and Applications 4

expenditure for the very old, 80+, is slightly higher than double the costs spent for years 1–14. However, the real health expenditure for the very old includes personal and family spending for health care that is not included in the main health statistics. Our methodology is based on the estimation of the health expenditure allocation in a country via the healthy life years lost, a parameter that is provided by Eurostat for the European Union and from the World Health Organization (WHO) for all world countries. Both estimates differ due to the methodology used. The WHO estimates provide an estimate for the healthy life years lost due to severe causes of disability, whereas the Eurostat estimates are higher, including severe and moderate disability causes. Following the WHO and Eurostat estimates, Greece is ranked in different places. According to Eurostat, Greece is ranked below OECD25 Countries for the healthy life years lost at age 65, with 7.9 years of healthy age for men and 7.7 for women in 2015 after Eurostat estimates (see Figure 13.6). However, the WHO estimates give 17.28 healthy years for men and 19.89 healthy years for women at 60 years of age, also in 2015. Our estimates in 2009 for the total population (men and women) in Greece found 17.84 healthy life years at 60 years of age and 14.35 years at 65 years of age. The life expectancy at 60 and 65 years of age is 23.53 and 19.38 years of age.

Figure 13.5. Table for the estimation of health expenditure in Greece. For a color version of this figure, see www.iste.co.uk/makrides/data4

Methodology for an Optimum Health Expenditure Allocation

221

Figure 13.6. Healthy life expectancy at 65 years of age (source: Eurostat database, 2017). For a color version of this figure, see www.iste.co.uk/makrides/data4

13.4. The health expenditure in hospitals A significant portion of health resources is spent in hospitals, both public and private. The distribution of the health costs in hospitals by age group was compared with the Dutch system, for which there are significant data for the years 1995–2009. I was particularly interested a as a hospital manager a few years ago. Figure 13.7 shows the percentage of expenditure per age group for the Netherlands Hospital System in 1995 and 2009 [GHE 14], and our estimates for the Greek system in 2009. The Greek system in 2009 is more similar to that of the Netherlands in 1995. Estimates for Greece come from the appropriate software that takes into account demographics from life tables. The minimum operational cost level is 2%. 13.5. Conclusion So far the methodology, applied with the support of the findings of organizations like the WHO and Eurostat, provides an advanced framework for the estimation of the best health expenditure allocation policy by the decision-makers. Both data collection and model building turned out to be important tools for the government strategy for the health policy. As the health systems become increasingly complex, the need for data development

222

Data Analysis and Applications 4

tends to be crucial for achieving satisfaction of the people from the provided health services. The simple resource allocation followed in the previous health systems turns to be unsatisfactory in the new era. As economic resources are limited, the quest for the optimum resource allocation is extremely important. Our methodology is coming to support this issue and providing vital information to develop the best strategy.

Figure 13.7. Allocation of health expenditure per age group in hospitals. For a color version of this figure, see www.iste.co.uk/makrides/data4

13.6. References [ATS 08] ATSALAKIS G., N EZIS D., M ATALLIOTAKIS G. et al., “Forecasting mortality rate using a neural network with fuzzy inference system”, 4th International Workshop on Applied Probability (IWAP 2008), Compiègne, France, July 7–10 2008. [GHE 14] G HEORGHE M., O BULQASIM P., VAN BAAL P., Estimating and Forecasting Costs of Illness in the Netherlands, 2014. Available at: Netherlands-Hospital-o21731-kvz-notitie2014-2-report-co-interpolation-and-extrapolation. [OEC 13] OECD, Public Spending on Health and Long-term Care: A New Set of Projections. OECD, 2013. [SKI 07] S KIADAS C., M ATALLIOTAKIS G., S KIADAS C.H., “An Extended Quadratic Health State Function and the Related Density Function for Life Table Data”, in S KIADAS C.H. (ed.), Recent Advances in Stochastic Modeling and Data Analysis, pp. 360–369, World Scientific, 2007.

Methodology for an Optimum Health Expenditure Allocation

223

[SKI 18a] S KIADAS C.H., S KIADAS C., Exploring the Health State of a Population by Dynamic Modeling Methods, Demographic Methods and Population Analysis, vol. 45, Springer, Chum, 2018. [SKI 18b] S KIADAS C.H., S KIADAS C., Demography and Health Issues: Population Aging, Mortality and Data Analysis, Demographic Methods and Population Analysis, vol. 46. Springer, Chum, 2018.

14 Probabilistic Models for Clinical Pathways: The Case of Chronic Patients

This chapter presents the ongoing research on the use of Markov models on electronic health records. The research is aimed at the reliability of the data flow and process flow of the different stages of the patient status following the health care processes in a health care setting. Specifically, the use of mathematical modeling of the clinical pathways is examined through literature research. Clinical pathways translate the best available evidence into practice, indicating the most widely applicable order of treatment interventions for particular treatment goals. A special focus will be given through a case study regarding the care of patients who are under the age of 18 and have respiratory health problems. Applying models for the follow-up of those patients and generally chronic patients is innovative and of great importance. Special attention is given to chronic patients as the completion of the clinical pathway depends on parameters that implicitly relate to the provided health services, such as the adulthood of children and others. The results of the models could become valuable knowledge tools to help the health care providers.

14.1. Introduction According to the World Health Organization [WHO 18], the ultimate goal of primary health care is better health for all. Primary health care does not simply lead to an approach that places a greater emphasis on certain areas of the health system in terms of prevention and out-of-hospital care in general. Indeed, the strategy proposed by primary health care also influences the

Chapter written by Stergiani S PYROU, Anatoli K AZEKTSIDOU and Panagiotis BAMIDIS. Data Analysis and Applications 4: Financial Data Analysis and Methods, First Edition. Edited by Andreas Makrides, Alex Karagrigoriou and Christos H. Skiadas. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

226

Data Analysis and Applications 4

general planning of a country’s social and economic organization, particularly in sectors such as industry, agriculture and the environment [MOR 95]. An important tool for quality in medical practice and the comprehensive treatment of patients is the personal health record (PHR) and its development, the electronic health record (EHR). The EHR can improve the quality of health care as well as the documentation of this care, unlike the hospital part of the health system and the primary health care. [SOT 02]. Particularly important for effective health care and for actions aimed at improving this quality are the quality, accuracy and completeness of the information contained in the health records. Therefore, the transition from paper-based health records to electronic ones has created expectations that electronic data will be used to measure and improve the quality of care that will be provided to patients [GRE 11]. Evidence-based medicine is widely acknowledged as a systematic approach for delivering consistent, safe and credible health care [LUC 16]. Clinical pathways translate the best available evidence into practice, indicating the most widely applicable order of treatment interventions for particular treatment goals [ZHA 15]. One way for physicians to follow the latest data that applies to a disease is to use clinical pathways. Essentially, these are guidelines for health care from a multifactorial point of view. They are algorithms that are designed to guide health professionals to deal with a pathological condition. These pathways have the advantage of relying on medicine based on medical data rather than on tradition or on habit [WAZ 01]. Continuous release of new medical data combined with the requirements of everyday practice creates difficulties for doctors to be constantly informed about new data. Clinical pathways are tools based on existing data and are the best link between proven knowledge and clinical practice. They provide instructions, procedures and time frames for managing specific situations or interventions [ROT 10]. Clinical pathways are therefore a systematic approach to guide health professionals in managing a specific pathological condition. The aim is to reduce unnecessary fluctuations in health care, reduce the use and make better use of available resources, improve patient education and improve the quality of care provided [BAN 04]. In the past few years, several articles have been published on clinical pathways for various chronic diseases such as “Effects

Probabilistic Models for Clinical Pathways: The Case of Chronic Patients

227

of clinical pathways for chronic obstructive pulmonary disease (COPD) on patient” [PLI 16], “Clinical pathway for patients with acute or chronic urticaria” [RUI 16], “Clinical pathways in chronic lymphocytic leukemia” [NAB 17], and “Data-driven clinical and cost pathways for chronic care delivery” [ZHA 16]. As a conclusion, the dynamics of the use of clinical pathways in clinical practice and, more specifically, in the treatment and follow-up of patients with chronic diseases is of great importance. 14.2. Models and clinical practice Regarding reliability in the software engineering domain, Goseva-Popstojanova and Trivedi [GOS 03] presented architecture-based approaches to reliability assessment of component-based software systems in order to describe how it can be used to examine software behavior early in the design phase. They proposed the following classification of architecture-based models: a) state-based models: the architecture of software can be modeled as a discrete-time Markov chain (DTMC), a continuous-time Markov chain (CTMC) or a semi-Markov process (SMP); b) path-based models: based on the same concept as state-based models, with the exception that here, failure behavior is described as a path, and system reliability is computed after considering the possible execution paths of the program experimentally or by testing algorithmically; and c) additive models: focused on estimating the overall application reliability using the component’s failure data. The models consider software reliability growth. Applications related to old ideas about the health care domain were presented in a previous work [SPY 08]. The feasibility of an application of such a model is examined in the field of clinical pathways and electronic health records (EHRs). This model exploited stochastic modeling approaches with the ultimate scope to calculate the reliability of clinical pathways. Specifically, reliability analysis in the health care domain can be viewed in three different areas: 1) technical (medical equipment, etc), 2) organizational (procedures) and 3) human reliability analysis [ZAI 12]. Consequently, reliability engineering is applied in the health care domain in the following areas:

228

Data Analysis and Applications 4

– Area of System design, which includes the: - fault-tolerant systems; - software reliability; - human reliability analysis – Area of Exploitation of the system, which includes the: - reliability data analysis; - risk and reliability analysis; - safety and maintenance analysis; - software reliability; - availability analysis; - human reliability analysis. The work for clinical data extraction and clinical pathways relies on the area of reliability data analysis. The mathematical models of these techniques mainly include the binary state and the multi-state system models. More specifically, the mathematical techniques include (according to [SPY 08, ROU 02]): 1) failure mode and effect analysis; 2) reliability block diagrams; 3) fault tree analysis; 4) Markov analysis; 5) hybrid techniques; etc. 14.3. The Markov models in medical diagnoses The Markov model in medical diagnoses and generally medical prognoses is used in several cases. The Markov model can be used instead of a decision tree or in standard decision analysis as it is characterized by (mathematical) simplicity and could be used for the representation of many clinical problems [BEC 83]. The individual Monte Carlo simulation, the Markov cohort and the fundamental matrix solution are the simplest the Markov processes in medical diagnosis and prognosis.

Probabilistic Models for Clinical Pathways: The Case of Chronic Patients

229

The Markov models are easy to apply as the fundamental hypothesis is that the present state of a patient’s health is sufficient to project the future states, i.e. all patients in a given state at a given time have the same prognosis, no matter how they got to the present state. The transition probability matrix is the key solution of these models. Note that the transfer of control – patient flow has the Markov property (the future behavior of the system is conditionally independent of the past behavior). Based on this assumption, the patient flow behavior (transition states) for a clinical pathway can be modeled with an absorbing discrete time Markov chain (DTMC) with a transition probability matrix P = [pij ], where pij is the probability to transfer the control from state i to state j   QC ), where Q is a (n-m) × (n-m) substochastic matrix (P = 0 I Then, matrix A = [aik ] (2), where aik is the probability of reaching an absorbing state k starting from a transient state i (mean time to absorption in k), is calculated [BEC 83] as follows: (A = (I − Q)−1 × C)(2), Markov models are used in the field of clinical effectiveness along with cost-effectiveness [WIL 17]. Computer simulation has been employed to evaluate the proposed changes in the delivery of health care. Most models have been applied to the medicine domain, to prognose on clinical practice, and very few are associated with policy issues so that the results of the analysis could be exploited by policy makers. Also, in the pharmacy domain, many researchers are interested in applying models and mathematical techniques with the economic view of clinical best available practice [KIR 15, ATH 15, SOB 11, LAN 09]. 14.3.1. The case of chronic patients The research part of this chapter is based on data obtained from the outpatient clinic for respiratory problems in the Georgios Gennimatas hospital of Thessaloniki, Greece, and it concerns the period 2008–2015. One

230

Data Analysis and Applications 4

main reason for this choice is the nature of the disease, which is a chronic disease that often requires long-term follow-up with many fluctuations in its development and hence the accumulation of large amounts of information. We also chose it due to the patients’ ages and the fact they will probably require follow-ups for many years. There is therefore a group of patients for whom an organized and systemic medical history is a necessity, contributing to a practical record for their care and therefore for their quality of life. The data were extracted from the paper-based records and were uploaded to the electronic medical record database. The record included 157 patients at the time of treatment, and 840 visits by patients aged 2–18 were recorded.

Figure 14.1. The process of applying the DTMC in chronic patients. For a color version of this figure, see www.iste.co.uk/makrides/data4

The selection was made of the particular sample of patients initially because bronchial asthma is a respiratory illness that millions of people worldwide suffer from, adversely affecting their quality of life. In addition, when it comes to children, the situation is even more complicated as the family is involved, and it is sometimes difficult to cooperate with the child. It is also a disease that the child can display throughout his or her life with outbreaks and recessions. In addition to the medical data from the paper-based medical records, each visit by each patient was categorized in association with the clinical pathway designed for our study. In practice, this

Probabilistic Models for Clinical Pathways: The Case of Chronic Patients

231

means that the patient’s status, as recorded by the treating physician, was recorded in an encoded form. Therefore, the doctor immediately has a quick overview of the course of the patient’s illness. This could be used, for example, in association with symptoms which, in some cases, such as outbreaks of bronchial asthma with viruses during the winter months, may be severe. It is of course intended to make possible any necessary interventions at both the individual and social levels after the necessary epidemiological investigations. As described in Figure 14.1, after completing the first step of the “EHR data extraction”, a state transition diagram was created in order to apply the DTMC model and calculate the overall reliability of the pathway and transition probabilities among states. 14.3.2. Results The results of the study included 471 patient visits in state 1, of which 328 reverted to status 1. Also, 264 visits resulted in non-attendance. Among the limitations is that it is accepted that there is an initial situation for our scenario as well as a terminal, which is considered to be state 1 (monitoring), which symbolizes the successful termination (T) of the script. There are also other termination states that are unsuccessful termination situations, which in our case involved non-turnout (No Visit, NV). Among other results, the probability that a patient will switch from state 1 (steady-state) to NV (non-attendance) is 12%. This is particularly important for the patient’s progress and shows the success of the quality of the medical services of the clinic. This is because, in this particular disease, parents often find their children stable in the course of the disease, and either slow down or terminate their child’s follow-up. This often results in some initial symptoms either to divert the attention of the environment or to underestimate it, so as not to get a timely and best-performing therapeutic intervention. However, the probability of transition from the initial state (“1”) to the state where the patients do not visit the clinic again (NV) does not seem to have any significant effect on the probability for successful termination (“T”) of the whole clinical pathway. The probability for successful termination is about 0.38 and can reach 0.347 even when the probability from state “1” to state “NV” reaches 0.1.

232

Data Analysis and Applications 4

Several other results have been examined that lead to the fact that a detailed investigation among the states and transition probabilities should be notified in order to have valuable results from our model. Therefore, it appears that the digitization of medical records, in addition to the use of clinical pathways, makes it easy to have a quick overview of the course of the health of all patients in the clinic/outpatient clinic and thus interpretations of the results can be made, as well as an evaluation of the services it provides and any causes that may lead to the discontinuation of patient monitoring. In other words, “social” interventions can be made because some reasons for the interruption of monitoring may have to do with the quality of services provided, while others will be related to other social issues, for example, a low socio-economic level of parents, who may have difficulties working with doctors [ZAC 15]. 14.4. Conclusion This chapter reveals the importance of mathematical models to guide the clinical practice. Great importance should be paid to the target population as it should be defined in terms that are relevant to the decision and that are also committed to receiving the interventions being modeled. The intention for the future is to plan a funding decision or reimbursement scenario for the follow-up of chronic patients, optimizing the available resources. Decision outcomes may include quality-of-life dimensions. More generally, managing the health services for chronic diseases has come to a need to develop models to guide public health care in general. To build and then validate the model, it is important that a team of physicians along with experts in the field of reliability must cooperate in order to assure that the modeling process adequately addresses the needs. The health policy area needs analytic models with the scope of making decisions with valuable outcomes for the stakeholders (doctors, patients, citizens, etc.).

Probabilistic Models for Clinical Pathways: The Case of Chronic Patients

233

14.5. References [ATH 15] ATHANASAKIS K. et al., “Quantifying the economic benefits of prevention in a healthcare setting with severe financial constraints: The case of hypertension control”, Clin. Exp. Hypertens, vol. 37, no. 5, pp. 375–380, 2015. [BAN 04] BANASIAK N., M EADOWS -O LIVER M., “Inpatient asthma clinical pathways for the pediatric patient: An integrative review of the literature”, Pediatr. Nurs., vol. 30, no. 6, pp. 447–50, 2004. [BEC 83] B ECK J.R., PAUKER S.G., “The Markov process in medical prognosis”, Med. Decis. Making, vol. 3, no. 4, pp. 419–458, 1983. [GOS 03] G OSEVA -P OPSTOJANOVA K., “Architectural-level risk analysis using UML”, IEEE Transactions on Software Engineering, vol. 29, no. 10, pp. 946–960, 2003. [GRE 11] G REIVER M. et al., “Implementation of electronic medical records: Theoryinformed qualitative study”, Canadian Family Physician, vol. 57, no. 10, 2011. [KIR 15] K IRSCH F., “A systematic review of Markov models evaluating multicomponent disease management programs in diabetes”, Expert Rev. Pharmacoecon Outcomes Res., vol. 15, no. 6, pp. 961–84, 2015. [LAN 09] L ANZARONE E., M ATTA A., S CACCABAROZZI G., “A patient stochastic model to support human resource planning in home care”, Production Planning & Control, vol. 21, no. 1, pp. 3–25, 2009. [LUC 16] L UCKMANN R., “Evidence-based medicine: How to practice and teach EBM”, DAVID L. S ACKETT, S HARON E. S TRAUS , W. S COTT R ICHARDSON et al. (eds), Journal of Intensive Care Medicine, 2nd edition, Churchill Livingstone, vol. 16, no. 3, pp. 155–156, 2016. [MOR 95] M ORAITIS E. et al., Study on the Organization and Operation of the Integrated Healthcare and Welfare Institution, M.o.H.a. Welfare, Athens, 1995. [NAB 17] NABHAN C., M ATO A.R., F EINBERG B.A., “Clinical pathways in chronic lymphocytic leukemia: Challenges and solutions”, Am. J. Hematol., vol. 92, no. 1, pp. 5–6, 2017. [PLI 16] P LISHKA C. et al., “Effects of clinical pathways for chronic obstructive pulmonary disease (COPD) on patient, professional and systems outcomes: Protocol for a systematic review”, Syst. Rev., vol. 5, no. 1, p. 135, 2016. [ROT 10] ROTTER T. et al., “Clinical pathways: Effects on professional practice, patient outcomes, length of stay and hospital costs”, Cochrane Database Syst. Rev., vol. 3, p. CD006632, 2010. [ROU 02] ROUVROYE J., VAN DEN B LIEK E., “Comparing safety analysis techniques”, Reliability Engineering & System Safety, vol. 75, pp. 289–294, 2002. [RUI 16] RUIZ -V ILLAVERDE R. et al., “Clinical pathway for patients with acute or chronic urticaria: A consensus statement of the Andalusian section of the Spanish academy of dermatology and venereology (AEDV)”, Actas Dermo-Sifiliográficas, vol. 107, no. 6, pp. 482–488, 2016.

234

Data Analysis and Applications 4

[SOB 11] S OBOLEV B.G., S ANCHEZ V., VASILAKIS C., “Systematic review of the use of computer simulation modeling of patient flow in surgical care”, J. Med. Syst., vol. 35, no. 1, pp. 1–16, 2011. [SOT 02] S OTO C.M., K LEINMAN K.P., S IMON S.R., “Quality and correlates of medical record documentation in the ambulatory care setting”, BMC Health Services Research, vol. 2, no. 1, 2002. [SPY 08] S PYROU S. et al., “A methodology for reliability analysis in health networks”, IEEE Trans. Inf. Technol. Biomed., vol. 12, no. 3, pp. 377–386, 2008. [WAZ 01] WAZEKA A. et al., “Impact of a pediatric asthma clinical pathway on hospital cost and length of stay”, Pediatr. Pulmonol., vol. 32, no. 3, pp. 211–216, 2001. [WHO 18] WHO, “Primary health care”, [cited 1-2-2018]. http://www.who.int/topics/primary_ health_care/en/, 2018.

Available

at:

[WIL 17] W ILLIAMS C. et al., “Estimation of survival probabilities for use in cost-effectiveness analyses: A comparison of a multi-state modeling survival analysis approach with partitioned survival and Markov decision-analytic modeling”, Med. Decis. Making, vol. 37, no. 4, pp. 427–439, 2017. [ZAC 15] Z ACHARIAH J.P., S AMNALIEV M., “Echo-based screening of rheumatic heart disease in children: A cost-effectiveness Markov model”, J. Med. Econ., vol. 18, no. 6, pp. 410–419, 2015. [ZAI 12] Z AITSEVA E., RUSIN M., “Healthcare system representation and estimation based on viewpoint of reliability analysis”, Journal of Medical Imaging and Health Informatics, vol. 2, no. 1, pp. 80–86, 2012. [ZHA 15] Z HANG Y., PADMAN R., PATEL N., “Paving the COWpath: Learning and visualizing clinical pathways from electronic health record data”, J. Biomed. Inform., vol. 58, pp. 186–197, 2015. [ZHA 16] Z HANG Y., PADMAN R., “Data-driven clinical and cost pathways for chronic care delivery”, Am. J. Manag. Care, vol. 22, no. 12, pp. 816–820, 2016.

15 On Clustering Techniques for Multivariate Demographic Health Data

Demographic factors and the global economic crisis play a significant role and greatly affect the performance of national health care systems. The expenditure on health is an important indicator for understanding and measuring health care performance. For this purpose, in this chapter, we address and discuss issues related to health care systems and focus on the performance of such systems using multivariate statistical techniques, and classify OECD countries in clusters based on the similarities on the expenditure on health.

15.1. Introduction In many countries of the Organisation for Economic Co-operation and Development (OECD), the health system is the largest service with an average cost in the range of 9–10% of the country’s gross domestic product (GDP). Regardless of the fact that the disparity between countries is very significant, OECD countries currently pay record amounts on health care than other parts of the world. Health spending typically includes costs for the prevention and treatment of diseases as well as the rehabilitation of the patient after treatment for the disease.

Chapter written by Achilleas A NASTASIOU, George M AVRIDOGLOU, Petros H ATZOPOULOS and Alex K ARAGRIGORIOU.

Data Analysis and Applications 4: Financial Data Analysis and Methods, First Edition. Edited by Andreas Makrides, Alex Karagrigoriou and Christos H. Skiadas. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

236

Data Analysis and Applications 4

These costs cover hospital and non-hospital expenditure and include medical services in and outside hospitals, nursing costs in hospitals, costs for rehabilitating patients and pharmaceutical expenditure. The financing of the health system is based on resources from the government through general taxation, from social insurance funds through employers’ and employees’ contributions, related to wages, and private financing in the form of direct purchases and co-payments or through private insurance. Demographic factors and the global economic crisis forced many countries to look into these costs and especially to seek ways to “force” these resources to produce better results. This led to a growing interest in the comparisons of performance in national health care systems. The expenditure on health is an important indicator for understanding and measuring health care performance. For this purpose, in this chapter we will classify OECD countries in clusters based on the similarities on the expenditures on health. In this chapter, we try to look at the performance of the health system using multivariate statistical techniques, with special attention given to clustering and classification analysis. Relying on data from the World Health Organization (WHO) concerning the resources available to the health system and the results it produces, we attempt to highlight similarities and reveal differences between them. The rest of this chapter is organized as follows. In section 15.2, an overview of statistical techniques for the analysis of multivariate data is presented with emphasis on cluster analysis techniques and methods. Section 15.3 is devoted to classification characteristics, and section 15.4 presents the results of the data analysis based on the health expenditure of OECD countries for the period 2000–2015. 15.2. Literature review Cluster analysis is a method that is intended to classify groups of existing observations, using the information which exists in some variables. In other words, the cluster analysis examines how similar some observations on a number of variables are, in order to create groups/clusters of observations that look alike. The ideal would be to reach groups for which the observations

On Clustering Techniques for Multivariate Demographic Health Data

237

within each group are homogeneous and the observations of different groups are as different as possible. Cluster analysis appears in papers related to anthropology and psychology [TRY 39, DRI 32, CLE 54]. A fairly comprehensive literature review of these initial tasks is presented by Clements (1954). The significant development of clustering began in the mid-70s and 1980s with the increase in computational power [AND 73, EVE 80, ALD 84]. Nowadays, cluster analysis has found ground for application in sectors such as environmental science, biology, medicine, archaeology and marketing. The idea behind cluster analysis is the organization of a collection of sample data (patterns) to clusters based on a similarity measure. In other words, given a set of data x1 , x2 , ..., xn , the analysis attempts to group the data in such a way that the “most similar” will be classified into the same group while “less like” will form different groups [THE 08]. The proximity of any two elements is quantified with an appropriate proximity measure, which may be a similarity or distance measure that may or may not constitute a formal metric. Typical steps in cluster analysis include the selection of the variables, the classification method, the metric distance and the linkage algorithm, as well as, after the classification, the selection of an ideal number of clusters and eventually the evaluation and interpretation of the results. The questions that arise for the implementation of the above approach are the way of measuring similarity and distance between groups/clusters, the way of defining the minimum distance of separation of two groups and the clustering criterion, since different criteria may give rise to different classification groups with the same (or very similar) characteristics. 15.3. Classification characteristics In cluster analysis, the construction of groups from a complex set of data requires a measure of similarity. There is often subjectivity in choosing a distance measure. It is advisable to choose a measure that takes into account the nature of the variables (discrete, continuous, binary), measurement units (nominal, normal, interval, ratio) and knowledge of the object. Generally, the

238

Data Analysis and Applications 4

distance d between a random vector X = (x1 , ..., xp ) and a random vector Z = (z1 , ..., zp ) satisfies the following properties: – positivity, i.e. d(X, Z) ≥ 0 with d(X, Z) = 0 if and only if X = Z; – symmetry, i.e. d(X, Z) = d(Z, X); – triangle inequality, i.e. d(X, Z) = d(Z, V ) + d(X, V ) for any vector V . 15.3.1. Distance measures The case of continuous data is perhaps the simplest as well as the one that offers numerous possibilities. Indeed, there are many measures that have been used to quantify the distance between continuous data. Some of the most popular are presented below. It should be pointed out that these measures do not necessarily satisfy all the properties mentioned above. In fact, it should be emphasized that for our purposes, only the first of the above three properties is mandatory, as long as the distance between two objects or functions is nonnegative. – Euclidean distance is defined by   p   d(X, Z) = |X − Z| =  (xi − zi )2 = (X − Z) (X − Z). i=1

The Euclidean distance highly depends on the measurement units, and by changing the scale, we obtain a different measure of distance. Large absolute values and outliers also have a much greater impact and often determine the magnitude of the distance. Finally, note that the distance ignores statistical properties and characteristics such as the variability involved in each variable. A way to avoid these deficiencies is a proper standardization of each variable involved in the analysis. – City-block distance is defined by dC =

p 

|xi − zi |

i=1

and is used in the presence of outliers since the absolute value downweights the effect of extreme observations, as compared to the square used in the Euclidean distance.

On Clustering Techniques for Multivariate Demographic Health Data

239

– Minkowski distance, given by  dM =

p 

1/v |xi − zi |v

i=1

is a generalization of the Euclidean distance which reduces the effect of outlying observations. – Chebyshev distance dT = max|xi − zi | , i = 1, ..., p greatly depends on differences in the measurement scale of variables. – Czekanowski coefficient is defined by 2 d(X, Z) = 1 −

p 

(xi , zi )

i=1 p 

(xi + zi )

i=1

– Mahalanobis distance is given by 

D(M ) = (X − Z) Σ −1 (X − Z) and it is a measure of the distance that is constructed based on statistical concepts and by taking into account variances and covariances. It is often used to remove the multi-collinearity among variables. Note that the Mahalanobis distance resembles the so-called quadratic distance, which involves a general matrix Q in place of the inverse of the variance–covariance matrix Σ. 15.3.2. Clustering methods The following are the two most common clustering algorithms: – hierarchical methods: clusters are gradually formed by either joining smaller groups and continuously forming larger groups until we have all the data in a single cluster (agglomerative methods) or by dividing clusters into smaller ones until we reach a situation where each observation forms its own cluster (divisive methods);

240

Data Analysis and Applications 4

– partitional clustering: data is divided into k segments. Each segment corresponds to a single cluster. In contrast, by hierarchical methods, the number of clusters to be created should be known in advance. The most common algorithm of this large class of partitioning clustering is the well-known K-means algorithm. The method works repetitively. It uses the concept of the center of the group and then classifies the observations according to their distance from the center of the group. The group center is defined as the mean value for each variable of all group observations, i.e. the vector of the means. In contrast to hierarchical methods, the K-means algorithm does not require the distance matrix, and the data need not be stored during execution. It is therefore preferable for a large dataset. Another important point for the algorithm is how we measure the distance between groups. The most common methods of calculating this distance are nearest neighbor, furthest neighbor, average linkage, weighted average, centroid and the Ward method. Note that the Ward method greatly differs from all others, as it is an ANOVA-based approach, viz. uses variance analysis to calculate the distances between clusters. For more details on the classification and clustering techniques, the reader may refer among others to [AND 73, EVE 80, BER 03]. 15.4. Data analysis 15.4.1. Data The original data comes from the World Health Organization (WHO) and the Global Health Observatory (GHO)1. The full dataset consists of time series data from 2000 to 2015 and covers most of the countries (98 in total) monitored by the World Health Organization. The variables of the dataset can be divided into two distinct groups: the first group concerns variables that measure the resources, financial, material and human, flowing into the health system (input dataset) and affect the

1 http://www.who.int/gho/en/

On Clustering Techniques for Multivariate Demographic Health Data

241

health of the population, and the second group concerns outcome variables (output dataset) of the input variables. More specifically, there are 28 input variables which can be further divided into three indicator subsets, namely economic inputs, human resources and infrastructures. A brief description is presented in Table 15.1. Furthermore, there are 20 output variables which can be mainly characterized as demographic indices. A brief description is provided in Table 15.2. Name Variable 1 Current health expenditure as a percentage of gross domestic product 2 Domestic general government health expenditure as a percentage of current health expenditure 3 Domestic private health expenditure as a percentage of current health expenditure 4 External health expenditure as a percentage of current health expenditure 5 Out-of-pocket expenditure as a percentage of current health expenditure 6 Pharmaceutical personnel density per 1,000 population 7 Physician density per 1,000 population 8 Laboratory health worker density per 1,000 population 9 Dentistry personnel density per 1,000 population 10 Community and traditional health worker density per 1,000 population 11 Other health worker density per 1,000 population 12 Environmental and public health worker density per 1,000 population 13 Health management & support worker density per 1,000 population 14 Nursing and midwifery personnel density per 1,000 population 15 Hospital beds per 1,000 population 16 Legislation 17 Cooperation 18 Surveillance 19 Response 20 Preparedness 21 Risk communication 22 Human resources 23 Laboratory 24 Points of entry 25 Zoonosis 26 Food safety 27 Chemical risk 28 Radionuclear risk Table 15.1. Input variables

242

Data Analysis and Applications 4

Name Variable 1 Stillbirths by 1,000 births 2 Infant mortality rate 3 Neonatal mortality rate 4 Under-five mortality rate 5 Mortality rate for 5–14 year-olds 6 Adult mortality rate 7 Adult mortality rate for male 8 Adult mortality rate for female 9 Life expectancy at birth 10 Life expectancy at birth for male

Name Variable 11 Life expectancy at birth for female 12 Life expectancy at age 60 13 Life expectancy at age 60 for male 14 Life expectancy at age 60 for female 15 Healthy life expectancy at birth 16 Healthy life expectancy at birth for male 17 Healthy life expectancy at birth for female 18 Healthy life expectancy at age 60 19 Healthy life expectancy at age 60 for male 20 Healthy life expectancy at age 60 for female

Table 15.2. Output variables

15.4.2. The analysis This section presents the key results of cluster analysis. The results are presented in two stages: initially, the results of each method are presented separately and then an analysis is presented for comparative purposes. The methods used in this chapter require the number of clusters to be determined in advance. The silhouette method [KAS 17] measures how well an observation accumulates and estimates the mean distance between the clusters. 15.4.2.1. Analysis of the input dataset According to Figure 15.1, the optimal number of clusters is 2 for the input dataset. Note that for the K-means algorithm the optimal number of clusters is also equal to 2. Figures 15.2–15.4 provide the classification results for K-means (Figure 15.2) and hierarchical clustering using complete linkage (Figure 15.3) and Ward linkage (Figure 15.4). Table 15.3 gives the breakdown of countries into two clusters. It is shown that the largest concentration of countries in a cluster is observed using the hierarchical clustering with the complete linkage (HCA-COM) method. The highest separation almost 50–50 is observed via the K-means (KM) algorithm.

On Clustering Techniques for Multivariate Demographic Health Data

Figure 15.1. Optimal K for hierarchical clustering. For a color version of this figure, see www.iste.co.uk/makrides/data4

Figure 15.2. Input K-means clustering. For a color version of this figure, see www.iste.co.uk/makrides/data4

243

244

Data Analysis and Applications 4

Figure 15.3. Input hierarchical clustering (complete method). For a color version of this figure, see www.iste.co.uk/makrides/data4

Figure 15.4. Input hierarchical clustering (Ward method). For a color version of this figure, see www.iste.co.uk/makrides/data4

On Clustering Techniques for Multivariate Demographic Health Data

Cluster 1 2 Total

KM HCA COM N % N % 54 55.1% 84 85.7% 44 44.9% 14 14.3% 98 100% 98 100%

245

HCA WARD N % 58 59.2% 40 40.8% 98 100%

Table 15.3. Classification results for the input dataset

By comparing the classification of the countries with the HCA COM and the Ward methods, we observe that 44.9% of the countries have the same classification in the 2 hierarchical methods; while comparing their classification with K-means, we observe that the highest classification rate is 54% using the complete linkage. The Ward method (WARD) comes second with 44%. Methods Similar classification % HCA COM vs. Ward 44 44.9% HCA COM vs. KM 54 55.1% HCA WARD vs. KM 44 44.9% Table 15.4. Input comparison

15.4.2.2. Analysis of the output dataset Following the same procedure as in the previous section, we proceed in this section in building two classification groups for the output dataset. The ideal number of clusters is verified by the silhouette method for both clustering approaches. Figures 15.5–15.7 present the classification results for the output dataset. For the hierarchical clustering, the complete and Ward linkages have been used. Table 15.5 gives the breakdown of countries into the two clusters. We observe that the largest concentration of countries in a cluster is done using the HCA COM method (79%). We observe that all three methods separate the countries in a similar way, with each method selecting approximately 70% (from 65 to 82%) of the countries to be classified in the first cluster.

246

Data Analysis and Applications 4

Figure 15.5. Output K-means clustering). For a color version of this figure, see www.iste.co.uk/makrides/data4

Figure 15.6. Output hierarchical clustering (complete method). For a color version of this figure, see www.iste.co.uk/makrides/data4

On Clustering Techniques for Multivariate Demographic Health Data

247

Figure 15.7. Output hierarchical clustering (Ward method). For a color version of this figure, see www.iste.co.uk/makrides/data4

Cluster 1 2 Total

KM N % 82 68.9% 37 31.1% 119 100%

HCA COM N % 94 79% 25 21% 119 100%

HCA WARD N % 78 65.5% 41 34.5% 119 100%

Table 15.5. Summary output

While comparing the classification of the countries with the HCA COM and Ward methods, we observe that 86.6% of the countries have the same classification in the two hierarchical methods; while comparing their classification and the method K-means, we observe that 89.07% of the countries have the common classification in all methods. Methods Similar classification HCA COM vs. Ward 103 KM vs. HCA.All 106 HCA COM vs. KM 107 HCA Ward vs. KM 115 Table 15.6. Output comparison

% 86.6% 89.07% 89.9% 96.6%

248

Data Analysis and Applications 4

15.4.2.3. Comparison of input and output datasets In Table 15.7, we observe that of the 68 countries examined under the K-means algorithm for both input and output variables, 44 (64.7%) were classified in the same cluster for both datasets. The countries in their clusters are represented in Table 15.8. Cluster Same Different Total

Input vs. output N % 44 64.7% 24 35.3% 68 100%

Table 15.7. Input–output K-means

Input Output Countries Cluster 1 Cluster 1 Austria, Australia, Belgium, Bulgaria, Canada, Chile, China, Colombia, Denmark, Estonia, Finland, France, Georgia, Germany, Hungary, Iceland, Japan, Jordan, Latvia, Malaysia, Holland, New Zealand, Nicaragua, Norway, Oman, Poland, Portugal, Qatar, R.Korea, R.Moldova, Saudi Arabia, Singapore, Slovakia, Slovenia, Sweden, Switzerland, Thailand, Turkey, USA Cluster 1 Cluster 2 Kazakhstan, India, Lithuania, Mongolia, Myanmar, Russia Cluster 2 Cluster 1 Brazil, Ecuador, Ireland, Israel, Jamaica, Luxembourg, Maldives, Malta, Mauritius, Mexico, Montenegro, Pakistan, Panama, Serbia, Spain, Tunisia, the United Arab Emirates, UK Cluster 2 Cluster 2 Botswana, Ethiopia, Gambia, Laos, Syria Table 15.8. Input–output classification, K-means algorithm

In Table 15.9, we observe that, of the 43 countries examined under the Ward linkage method, for both input and output variables, only 21 (48.8%) were ranked in the same cluster for both groups of variables.

Cluster Same Different Total

Input vs. output N % 21 48.8% 22 51.2% 68 100%

Table 15.9. Input–output Ward linkage

On Clustering Techniques for Multivariate Demographic Health Data

249

Finally, in Table 15.10, we observe that, of the 68 countries examined under the complete linkage method, for both input and output variables, 58 (85.3%) were ranked in the same cluster for both groups of variables. Cluster Same Different Total

Input vs. output N % 58 85.3% 10 14.7% 68 100%

Table 15.10. Input–output complete linkage

15.5. Conclusion Demographic factors and the global economic crisis forced many countries to look into health costs and especially to seek ways to “force” these resources to produce better results. This led to a growing interest in the comparisons of performance in national health care systems. The expenditure on health is an important indicator for understanding and measuring health care performance. For this purpose, in this chapter, we focused on the performance of the health systems using multivariate statistical techniques and classified OECD countries in clusters based on the similarities on the expenditures on health. 15.6. References [ALD 84] A LDENDERFER M.S., B LASHFIELD R.K., Cluster Analysis, Sage Publications, Newbury Park, 1984. [AND 73] A NDERBERG M.R., Cluster Analysis for Applications, Academic Press, New York, 1973. [BER 03] B ERKES I., H ORVÁTH L., KOKOSZKA P., “GARCH processes: Structure and estimation”, Bernoulli, vol. 9, no. 2, pp. 201–227, 2003. [CLE 54] C LEMENTS F.E., Use of Cluster Analysis with Anthropological Data, 1954. Available at: https://anthrosource.onlinelibrary.wiley.com/doi/pdf/10.1525/aa.1954.56.2.02 a00040. [DRI 32] D RIVER H.E., K ROEBER A.L., Quantitative Expression of Cultural Relationships, University of California Publications in American Archaeology and Ethnology, vol. 31, pp. 211–56, 1932. [EVE 80] E VERITT B., Cluster Analysis, Wiley, Chichester, 1980.

250

Data Analysis and Applications 4

[KAS 17] K ASSAMBARA A., Practical Guide to Cluster Analysis in R: Unsupervised Machine Learning (Multivariate Analysis) (Volume 1), CreateSpace Independent Publishing Platform, Scotts Valley, 2017. [THE 08] T HEODORIDIS S., KOUTROUMBAS K., Pattern Recognition, 4th edition, Academic Press, New York, 2008. [TRY 39] T RYON R.C., Cluster Analysis Correlation Profile and Orthometric (Factor) Analysis for the Isolation of Unities in Mind and Personality, Edwards Brothers, Ann Arbor, 1939. [WAR 63] WARD J.H., “Hierarchical groupings to optimize an objective function”, Journal of the American Statistical Association, vol. 58, pp. 236–244, 1963.

16 Tobacco-related Mortality in Greece: The Effect of Malignant Neoplasms, Circulatory and Respiratory Diseases, 1994–2016

Smoking is a popular habit in the Greek population. Because of that – not forgetting the problem of second-hand smoking – there is an elevated risk for the development of tobacco-related diseases. Thus, the mortality caused by diseases of the circulatory system, such as neoplasms was studied. A combined procedure was applied for the estimation of the years lost regarding longevity depending on age and gender under the action of each tobacco-related cause of death. According to this procedure, the typical method of elimination of related causes of death is applied along with the method developed by Arriaga [ARR 84] for the examination of the differences of life expectancy between two life tables. The results of the analysis indicate that there is a decreasing effect of the tobacco-related diseases of the circulatory system on longevity. A more or less stagnated picture emerges for neoplasms. Contrarily, the effects of the diseases of the respiratory system become more significant over time. Overall, the problem remains significant for the Greek population, a fact that outlines the need for the intensification of the existing policies or the application of new ones concerning smoking and second-hand smoking.

16.1. Introduction People in Greece start smoking at a very young age: about one-third of high school students have tried tobacco, while 16.2% are current smokers (year

Chapter written by Konstantinos N. Z AFEIRIS. Data Analysis and Applications 4: Financial Data Analysis and Methods, First Edition. Edited by Andreas Makrides, Alex Karagrigoriou and Christos H. Skiadas. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

252

Data Analysis and Applications 4

2007; Kyrlesi et al. [KYR 07]). The prevalence of smoking and second-hand smoking is also high in the newly recruited soldiers (years 2013 and 2014; Grapatsas et al. [GRA 17]) as in rural Greece and elsewhere (Stafylis et al. [STA 18]). According to Eurostat [EUR 16], about one in three individuals aged 15 and over was a smoker and a vast majority of the population (64.2%) was exposed daily to tobacco smoke indoors in 2014, despite the relevant antismoking legislation (see Telionatis et al. [TEL 17]), which is seldom enforced effectively (Filippidis and Tzoulaki [FIL 16]). Consequently, as Stafylis et al. [STA 18] note, the risk for tobacco-related diseases must be considered as more than significant. According to the Harvard School of Public Health [HAR 11], we can speak about the “Greek tobacco epidemic”, as in 2008, 46.4% of deaths were due to cardiovascular diseases, 25.7% due to cancer and 9.6% due to diseases of the respiratory system. Obviously, the cost of the effects of tobacco consumption on the population’s health and longevity is huge, not forgetting at the same time the money required by individuals to satisfy this habit as well as the expenditure on social and medical care as well as the hospitalization of patients. Therefore, the scope of this chapter is to analyze the mortality developments caused by diseases related to tobacco consumption. In the literature, such an analysis of Greek data, which usually ignores the temporal trends of the phenomenon, has been based on standardized mortality rates, age-specific mortality rates’ levels and changes (see, for example, Filippidis et al. [FIL 17]; Laliotis et al. [LAL 16]; Rontos et al. [RON 10]; Nikolaidis et al. [NIK 04]) or by calculating the smoking-attributable years of potential life lost (the product of smoking-attributable mortality and the relative life expectancy at the midpoint of each age category; see Harvard School of Public Health [HAR 11]; Baliunas et al. [BAL 07]). Instead, in this chapter, a new procedure will be presented and applied to the most recent available data by also taking into account the temporal trends of smoking-related mortality since 1994 in an effort to fill an existing gap in the published literature. According to this procedure, the first step is to construct the actual (observed) life table, including the tobacco-related deaths separately for each gender. Afterwards, these deaths are eliminated (see Namboodiri [NAM 87, pp. 92–107 and 137–163], Preston et al. [PRE 01, 71–89]) and a new life table is constructed. During this process, conventional methods are used (see Chiang [CHI 84]; Preston et al. [PRE 01, 71–89]). These two life tables are

Tobacco-related Mortality in Greece

253

subsequently compared by applying the Arriaga method [ARR 84, ARR 89] in order to evaluate the effects of smoking-related deaths on life expectancies at birth and examine their impact on each age group of the human life span. The details of this procedure will be discussed in section 16.2. 16.1.1. Smoking-related diseases According to the U.S. Department of Health and Human Services [OFF 04], smoking can lead to a variety of diseases, i.e. cancer, cardiovascular and respiratory diseases, and has several other effects on health like, for example, reproductive ones. Auger et al. [AUG 14] (see also Baliunas et al. [BAL 07]) divides the main causes of death related to some extent with smoking into seven categories: 1) malignant neoplasms: (a) lip, oral cavity, pharynx (b), esophagus, (c) larynx, (d) trachea, bronchus, lung, (e) stomach, (f) pancreas, (g) cervix uteri, (h) kidney, (i) bladder and (j) myeloid leukemia; 2) circulatory system: (a) ischemic heart diseases, (b) other heart diseases (c) cerebrovascular disease, (d) atherosclerosis (e) aortic aneurism, dissection (f) other arterial diseases; 3) respiratory system: (a) influenza, emphysema, (c) other chronic obstructive;

pneumonia,

(b) bronchitis,

4) digestive: ulcer; 5) mental-behavioral disorders; 6) exposure to smoke, fire, flames; 7) perinatal: (a) length of gestation, fetal growth, (b) sudden infant death syndrome. It must be emphasized that besides the fact that a causal relationship between smoking and the appearance of these diseases is unquestionable, some other factors may have played an important role too. For example, air pollution has been associated with increased mortality and morbidity due to cardiovascular and respiratory diseases (Hoek et al. [HOE 13]). Also, several carcinogens (among them the use of tobacco), radiation, some viruses and other factors may induce cancer (see Cooper [COO 00]). This subject will be briefly discussed later if necessary.

254

Data Analysis and Applications 4

16.2. Data and methods Data come from the Eurostat database (https://ec.europa.eu/eurostat/data/ database) and refer to the midyear population per gender and age for the years 1994–2016. The number of deaths by year, age, gender and the cause of death were also retrieved and used for the analysis. However, besides the fact that this database contains detailed information about the malignant neoplasms described in the previous section, no data are available for myeloid leukemia. Thus, this disease was omitted from the analysis, noting at the same time its very low impact on mortality (only 1.8% of the reported deaths from smoking-related malignant neoplasms in 2008; see Harvard School of Public Health [GRA 17, p. 26]). There are also no detailed data for atherosclerosis, aortic aneurism, dissection and “other arterial diseases” (these diseases account for 5.1% of the deaths due to smoking-related cerebrovascular diseases in 2008; see Harvard School of Public Health [GRA 17, p. 26]). Thus, the analysis concerning the deaths because of the circulatory system deficiencies includes the following categories: cerebrovascular diseases, ischemic heart diseases and “other diseases of the circulatory system”. For the diseases of the respiratory system, the analysis was done in the following categories: influenza (including swine flu), pneumonia, diseases of the lower respiratory system (asthma and status asthmaticus and other lower respiratory diseases) as well as other diseases of the respiratory system (remainder of J00-J99), as smoking can cause a great variety of pulmonary diseases (see Centers for Disease Control and Prevention (US) [CEN 10]), which are not cited separately in the database. Thus, the analysis was focused exclusively on the diseases that constitute the major causes of tobacco-related deaths in the Greek population. The method used for the analysis was briefly discussed in section 16.1. It must be noted that many others besides the Arriaga method could have been used for this purpose, like those of Andreev [AND 82], Pollard [POL 82, POL 88] and Andreev and Shkolnikov [AND 12] (see also Andreev et al. [AND 02]). The Arriaga method was chosen not only for its reliability as seen in the published literature (see, for example, Yang et al. [YAN 12]; Auger et al. [AUG 14]; Le et al. [LE 15]; Sunberg et al. [SUN 18] and

Tobacco-related Mortality in Greece

255

Chisumpa et al. [CHI 18]) but also because of its simplicity, as will be proven later. We must also emphasize that Arriaga [ARR 84] compared the differences of life expectancy between two time points; however, this procedure can be easily applied in order to study longevity differences because of the effect of the smoking-related causes of death. According to this procedure, the effects of mortality change on life expectancies can be classified as “direct” and “indirect”: “the direct effect on life expectancy is due to the change in life years within a particular age group as a consequence of the mortality change in that age group.” The indirect effect “consists of the number of life years added to a given life expectancy because the mortality change within (and only within) a specific age group will produce a change in the number of survivors at the end of the age interval.” Another effect springs from the interaction between the exclusive effect of each age group and the overall effect. The direct effect, denoted by i DEx , is calculated as:   t+n t Txt − Tx+i lxt Txt+n − Tx+i − i DEx = t la lxt lxt+n The indirect effect, denoted by i IEx , is given by:   t+n t lxt lx+i Tx+i −1 i IEx = t lt+n lat lx+i x The interaction i Ix is given by: iIx =i OEx − i IEx and t+n Tx+i OE = x i lat



lxt

lxt+n

lt − tx+i lx+1



In these equations, the terms x and x+i refer to age groups, t and t+n correspond to time points (years), l to the number of survivors at an exact age

256

Data Analysis and Applications 4

and T is the number of person-years lived by the members of a population beyond that age. The effect of each cause of death studied on mortality was calculated according to the following formula (Arriaga [ARR 89]; see also Preston et al. [PRE 01, pp. 84–86]):   i,2 i,1 2 1 n Rx ∗n mx −n Rx ∗n mx i n Δx = n Δx ∗ 2 1 n mx −n mx where 2 represents the relevant values of the second life table, in which the tobacco-related causes of death have been eliminated, and 1 refers to the observed data (i.e. it includes tobacco-related diseases). n Rxi is the proportion of deaths from cause i, n Δx is the contribution of all-cause mortality differences to age group x to x+n differences in life expectancies and n m2x corresponds to the age specific mortality rates. Obviously, the term n Rxi,2 of the above equation related to non-tobacco-related mortality, is 0. According to this formula, the effects of each tobacco-related cause of death on longevity (denoting in that way life expectancy at birth, e0) can be estimated as the years lost of e0 due to a fatal disease. For example, an effect of 2 years lost under the action of a specific cause of death constitutes the additional number of years the population would be expected to live if this cause of death was absent. Besides this, a relative effect of this cause of death was calculated as the quotient of the years lost to the observed life expectancy at birth. Because the absolute and relative measures of this procedure follow a parallel course, only the absolute values will be discussed in this chapter, unless it is necessary to discuss the relative ones too. 16.3. Results 16.3.1. Life expectancy at birth Longevity, expressed by life expectancy at birth, in Greece has increased significantly since 1994. Until 2013, the improvements were almost linear in both genders (Figure 16.1). Afterwards, as clearly seen from the polynomial lines adjusted to these data, a “plateau” has formed. In other words, a stalling

Tobacco-related Mortality in Greece

257

in the longevity increase is observed in the male and female population of the country.

Figure 16.1. Life expectancy at birth, Greece 1994–2017. For a color version of this figure, see www.iste.co.uk/makrides/data4

The financial crisis which afflicted Greece after 2008 is well-known worldwide. Could these economic and the subsequent social developments have affected longevity in Greece? It is very difficult to speculate about this based on a unique measurement. Zafeiris and Kostaki [ZAF 19] found that during this time, a significant decrease in the effects of accident hump in the ages 15–30 was observed, mainly connected with the improvement of the infrastructure and the economic hardship of young people after 2008. Females are differentiated during this process: behavioral and lifestyle reasons are related for the – less important in comparison to males – severity of the accident hump, which is downgrading continuously. In the literature, it has been seen that either a pro-cyclical pattern or an anticyclical pattern of mortality in the era of economic recession may prevail. In the first case, mortality reduces in times of recession and increases in boom periods (see, for example, Ruhm [RUH 07]). In the second case, the opposite happens: mortality reduction in conditions of economic growth and mortality increases in times of recessions (see Catalano [CAT 97]). However, it must be taken into account that a “time lag” is needed for the effects of economic crisis

258

Data Analysis and Applications 4

to be seen on mortality (for more on this concept, see Laporte [LAP 04]). Also, the certain cause- and age-specific effects could be more complicated. For example, reduced economic activity may lead to a reduction in deaths from road accidents, while an increase in death rates among middle-aged persons, especially males, from lifestyle-related causes might be observed. In any case, the detailed discussion of the effects of economic crises on longevity in Greece is beyond the scope of this chapter. Instead, the effects of smoking-related mortality will be presented in the following sections. 16.3.2. Effects of the diseases of the circulatory system on longevity The relationship between cardiovascular disease and smoking is indisputable. According to Ezzati et al. [EZZ 05], 10% of deaths worldwide due to this disease were attributed to smoking in 2000. In the United States, smoking caused 33% of deaths due to cardiovascular disease and 20% of deaths from ischemic heart disease in persons older than 35 years of age. According to the European Heart Network [EUR 17], cardiovascular disease is the leading cause of mortality in Europe, being responsible for approximately 45% of total deaths. Besides these direct effects, it must be noted that smoking is also related to other risk factors of this disease such as glucose intolerance and low serum levels of high-density lipoprotein cholesterol (see, for example, Jacobs et al. [JAC 99]). It is not surprising then that the overall effect of the diseases of the circulatory system is huge, being the most important cause of death in Greece at least during the first years of the study (Figure 16.2). Afterwards, the effect on longevity decreases in both genders, as their proportion in the observed life expectancy at birth and the two genders tend to converge. For the sum of the diseases of the circulatory system studied, a significant female excess mortality is observed; a well-known phenomenon in the literature. Garcia et al. [GAR 16] in their review about cardiovascular disease in women discuss several clinical perspectives which are responsible for this phenomenon, including hypertension, dyslipidemia, diabetes, smoking, obesity, physical inactivity, preterm delivery, hypertensive pregnancy disorders, gestational diabetes and menopausal transition. Female excess mortality is confirmed in both cerebrovascular diseases and in the “other heart diseases”.

Tobacco-related Mortality in Greece

259

Figure 16.2. Effects of the diseases of the circulatory system on longevity. Percentages represent the analogy of the effect of each cause of death on the observed life expectancy at birth

In particular, the effect of cerebrovascular diseases increases, not without significant variations, until the beginning of the 21st Century in females and decreases almost linearly afterwards. In fact, this effect increases from 5.2 in 1994 to 6.1 years lost in 2001 and then decreases to 2.6 years in 2016. At the same time, in males, after its peak in 1998 (3.8 years), the years lost because of these diseases decreases significantly to almost 2 years in 2016. Thus, a strong convergence trend of the two genders is evident. In the category “other heart diseases”, the phenomenon of female excess mortality is evident too. The effect on longevity of this group of diseases from 4.5 years in 1994 increases to 5.68 in 1998; afterwards, it fluctuates a lot but at high levels until 2009 (5.1 years) and reduces to 2.3 years in 2016. An almost parallel trend but at lower levels is observed in males. From 3.2 years lost in 1994, it increases to 3.76 years in 1998, fluctuates a lot afterwards and after 2008 (3.3 years) reduces to 1.74 years in 2016. The female excess mortality, however, is not confirmed concerning the ischemic heart diseases, because of which mortality is higher in males. This is also a known phenomenon in the literature. According to Maas and Appelman [MAA 10], the prevalence of coronary heart disease before

260

Data Analysis and Applications 4

menopause is low in women and predominately attributed to smoking, while in general, it is lower than males. In Greece, the two genders followed an almost parallel temporal trend. In males, after an increase in the first years of the study, the effect of ischemic heart diseases increases to 4.1 years in 1997 and then with some fluctuations decreases to 2.8 years in 2016. In females, it reaches 2.7 years in 1997, fluctuates significantly afterwards and after 2003–2004 decreases to 1.5 years.

Figure 16.3. Effects of the diseases of the circulatory system on longevity by broad age group. Males: white markers and continued lines; females: black markers and dashed lines. Percentages represent the analogy of the effect of each cause of death on the observed life expectancy at birth

The analysis of the effects of the diseases of the circulatory system on longevity per broad age group leads to some notable conclusions (Figure 16.3). For all the diseases studied, the developments are governed by the mortality of the older group of people aged 65+ years. The effects of cerebrovascular diseases for this age group fluctuate significantly until 2005–2006, when a clear downgrading trend is observed in both genders. Thus, the effect of the mortality changes in the older age group from 5.8 years lost in 2001 and decreases to 2.4 years in 2016 in females. In males, the relevant figures are 3.8 years in 1998 and 1.97 years in 2016. The developments for this cause of death in the ages 45–64 years are smaller as are their effects. In both genders, an almost linear trend limits the effects of

Tobacco-related Mortality in Greece

261

these diseases to 0.12 years in females and 1.7 years in males. However, the female excess described in the previous paragraphs is not confirmed for that age group, as mortality is always higher in males. A similar situation prevails for the group named “other heart diseases”. In females, the effect of these diseases tends to increase until 1998 (the effect on life expectancy at birth is 5.5 years), it fluctuates afterwards and from 5.2 years in 2006 decreases to 2.3 years in 2016. In males, from 3.4 years in 1998, it gradually decreases to 1.5 in 2016. In the age group of 45–64 years, a similar picture as the one described in the previous paragraph is seen: the female excess is not confirmed, and mortality is higher in males. However, a significant difference exists. In females, the effect on longevity of this group of diseases from 0.2 years in 1994 decreases to 0.1 years in 2002, and afterwards, it remains practically unchangeable. In males, it decreases from 0.3 years in 1994 to 0.2 years in 2003, and afterwards, it remains almost stable. In the third group of diseases, those affected by an ischemic heart, the magnitude of the differences between the two genders is smaller in the older people (65+ years old), and the two genders follow a parallel course in order to approach an effect of 1.7 years in males and 1.3 years in females in 2016. The male excess mortality for this age group springs from the mortality gender differentials in the age group of 45–64 years. The relevant effect from males increases from 1.1 years in 1994 to 1.3 years in 1999, when it stagnates to around 1.2 years between 2000 and 2007 and decreases afterwards to 0.9 years in 2016. The effect in the female population is significantly smaller, about 0.3–0.2 years. 16.3.3. Effects of smoking-related neoplasms on longevity After the diseases of the circulatory system, the tobacco-related neoplasms constitute the second most important aggravating factor of mortality in the Greek population (Figure 16.4). Their overall effect is about 3.5–4 years in males and 1–1.4 years in females. No clear temporal trends are observed concerning their effects on mortality, rather the years lost because of this group of diseases fluctuates among the limits mentioned before.

262

Data Analysis and Applications 4

Figure 16.4. Effects of tobacco-related neoplasms on longevity. Percentages represent the analogy of the effect of each cause of death on the observed life expectancy at birth

Tobacco-related Mortality in Greece

263

Among the tobacco-related neoplasms, the most important effect comes from the malignant neoplasms of trachea, bronchus and lung. It is worth noting that, according to the American Lung Association1, smoking and second-hand smoking cause about 90% of lung cancer cases. Their effect on males ranges between 2.1 and 2.3 years. However, lifestyle differences concerning the trigger factors of these diseases are responsible for the gender differences for the years lost over time. In females, the years lost from 0.4 in 1994 increases to 0.6 years in 2016, a sign of a marginal elevation of their effects due to an increase in the number of female smokers (see Sifaki-Pistolla et al. [SIF 17]). All the other tobacco-related neoplasms have a smaller impact on longevity. Of them, stomach cancer is of greater importance. According to Khani et al. [KHA 18], there is a strong and positive association between tobacco smoking and stomach cancer in men, while it is positive but weak in women. Since the middle of the last century, the incidence and mortality rate of stomach cancer in most high-income countries has declined. These findings are confirmed in the Greek population. Mortality due to stomach cancer is higher in males than in females. The years lost under the effects of this disease marginally increases from 0.36 in 1994 to about 0.4 in the period 1997–2002 and afterwards decrease to 0.3 in 2016. In females, the number of years lost is lower; they marginally increase to 0.3 years until 2002 and decrease to 0.2 years in 2016. According to Zeegers et al. [ZEE 00], the risk of developing bladder cancer in smokers is two to four times more than non-smokers, while smoking is responsible for 23% of cases in females and 50% in males (Zeegers et al. [ZEE 04]). This picture is confirmed concerning the bladder cancer mortality rate in Greece. The years lost fluctuates between 0.3 and 0.4 years in males, while in females, the effect is always lower than 0.1 years. The next disease, that of pancreatic cancer, is related to the number of cigarettes smoked and the duration of use (Iodice et al. [IOD 08]). A male excess mortality is observed for this cause of death. In males, the years lost because of pancreatic cancer, despite the temporal fluctuations, increases

1 https://www.lung.org/lung-health-and-diseases/lung-disease-lookup/lung-cancer/learnabout-lung-cancer/what-is-lung-cancer/what-causes-lung-cancer.html.

264

Data Analysis and Applications 4

from 0.26 in 1994 to 0.36 in 2016. In females, they increase from 0.2 to 0.3 in the same years. As for laryngeal cancers, according to Jones et al. [JON 16], alcohol and tobacco act in their development. The years lost because of larynx cancers is limited: below 0.16 in males and 0.02 in females. The same gender diversification is observed in kidney cancers. While the risk of developing such cancer is related to tobacco smoking (Zeegers et al. [ZEE 00]), the mortality caused by it, in males besides its general elevating temporal trend, remains at 0.15–0.17 years in 2014–2016. In females, the effect is minimal, less than 0.1 years lost for the period studied. The other three groups of cancers, those of lip, oral cavity and pharynx, of esophagus and cervix uteri (for females) always have minimal effects on longevity. A small increasing trend of mortality is observed under the action of lip, oral cavity and throat cancers.

Figure 16.5. Effects of malignant neoplasms on longevity by broad age group. Males: white markers and continued lines; females: black markers and dashed lines. Percentages represent the analogy of the effect of each cause of death on the observed life expectancy at birth

If the effects of neoplasms on longevity by broad age group are studied (Figure 16.5; the younger age groups have been omitted due to their small effects on population longevity), then a stagnated picture emerges. It is seen that in both genders, the findings are formed mainly by the effects of cancers on the older people (65+ years) and to a lesser degree by the relevant effects in the people aged 45–64 years, while male excess mortality is confirmed in all of them. The picture of the “overall” cancers is similar to those of trachea, bronchus and lung, which constitute the most important factor for the

Tobacco-related Mortality in Greece

265

regulation of longevity. Among them, a marginal decrease is observed after 2008 in males aged 45–64 years. In the older age groups of 65+ years, the years lost tends to increase slightly mainly in males. 16.3.4. Effects of respiratory diseases on longevity According to Behr and Nowak [BEH 02], tobacco smoking affects the respiratory tract in two ways (mechanisms): 1) by the induction of inflammation and 2) through its mutagenic/carcinogenic action. The second mechanism was discussed in the previous section. As for the first mechanism, smoking is the most important factor for the development of chronic obstructive pulmonary disease and interstitial lung diseases (idiopathic pulmonary fibrosis/usual interstitial pneumonia, desquamative interstitial pneumonia, respiratory bronchiolitis-associated interstitial lung disease, etc.). It must be stressed that in some diseases, a decreased incidence or severity is observed in smokers (e.g. hypersensitivity pneumonitis). In any case, several details of the smoking-related diseases can be found in Centers for Disease Control and Prevention [CEN 08]. Also, it must not be forgotten that other agents may have played an important role in the development of the respiratory diseases, including environmental pollution, socioeconomic status, the number of family members in relation to residential space, the existence and nature of respiratory disease of co-habitants and their smoking habits, and the kind of heating and cooking sources (Sichletidis et al. [SIC 05]). In that way, the loss of longevity because of the respiratory diseases tends to increase with fluctuations in males from 1.4 in 1994 to 2.4 years in 2009, and after a small retreat afterwards, it remains 2.3–2.4 years in the period 2014–2016 (Figure 16.6). In females, the same general trend is followed: while 1.2 years are lost because of the diseases of the respiratory system in 1994, the relevant value is 2.6 years in 2009 and 2.3–2.4 years in 2014–2016. Besides the fact that a male excess mortality is evident for the first years of the study, females quickly converge with males, and in the period 2002– 2012, the effects of respiratory diseases on their mortality are significantly higher than males in absolute numbers. However, the relative effect of these diseases on longevity remains lower most of the time, and only in the period 2002–2004 is it marginally higher. Overall, these developments can reasonably be attributed to the changing habits of women in Greece and the increase of smoking among them.

266

Data Analysis and Applications 4

The most important effect on longevity is of the group of diseases named “other diseases of the respiratory system”. This group represents the remaining diseases of the respiratory system if influenza, pneumonia and chronic lower respiratory diseases are removed2. The last three groups of diseases are presented separately in Figure 16.6.

Figure 16.6. Effects of the respiratory diseases on longevity. Percentages represent the analogy of the effect of each cause of death on the observed life expectancy at birth

The general pattern observed so far for the total of the respiratory diseases is largely followed in the “other diseases of the respiratory system”. In males, the effect on longevity from 1 year in 1994 reaches 1.6 years in 2009, retreats slightly afterwards, and after an increasing course, it reaches 1.6 years in 2015

2 See https://ec.europa.eu/eurostat/ramon/nomenclatures/index.cfm?TargetUrl=LST_NOM_DTL& StrNom=COD_2012&StrLanguageCode=EN&IntPcKey=31100702&StrLayoutCode= HIERARCHIC&IntCurrentPage=1.

Tobacco-related Mortality in Greece

267

and 1.5 years in 2016. In females, from 0.9 years in 1994 it reaches 1.8 in 2008, and following a parallel course with males, it will be 1.7 years or more in the period 2014–2016. An analogous course is described by the relative frequencies of Figure 16.6. However, even if the differences between the two genders were very small in the period 1994–2000, afterwards, the effects both in relative and in absolute numbers tend to be more important in the female population of the country. As pointed out before, this might be attributed to a progressively increased use of tobacco products by women. The diseases of the lower respiratory system are presented separately in Figure 16.6, distinguished as “asthma and status asthmaticus” and “other lower respiratory diseases”. Mortality from the group of diseases called “other lower respiratory diseases” tends to increase over time, despite the significant fluctuations which are observed. The effects on longevity are more moderate in comparison to the previous group of diseases studied. In males, the years lost from 0.3 in 1994 reaches a maximum of 0.7 in 2015. In females, the relevant values are 0.1 and 0.44. Additionally, a male excess mortality is observed for all the years studied, which can be attributed to the higher smoking prevalence in them. The effects of “asthma and status asthmaticus” on longevity are limited in both genders, though in decreasing order for much of the time studied. A minor increase is observed in the last three years of the study. As for pneumonia, an irregular pattern is followed over the years, corresponding to times of outbreak and recession of the disease. During this course, in males, the maximum of the years lost is 0.3 in 2007 and the minimum is 0.1 in 1994. The relevant figures for females are 0.23 years in 2009 and 0.15 in 1994. Influenza (including swine flu) exhibits some outbreaks and recessions after 2008; however, its effect of the longevity of the two genders is minimal. As largely happened with other diseases, the developments of the effects of the respiratory ones on longevity are governed by its temporal trends observed in the older people. These trends largely follow the relevant ones described previously for the overall effect of these diseases. The observed changes in the age group 45–64 years are minor. The temporal pattern is more complicated in the younger ages