Data Analysis and Applications 2 [1st edition] 9781786304476

This series of books collects a diverse array of work that provides the reader with theoretical and applied information

1,640 296 6MB

English Pages 252 [256] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Data Analysis and Applications 2 [1st edition]
 9781786304476

Citation preview

Data Analysis and Applications 2

Big Data, Artificial Intelligence and Data Analysis Set coordinated by Jacques Janssen

Volume 3

Data Analysis and Applications 2 Utilization of Results in Europe and Other Topics

Edited by

Christos H. Skiadas James R. Bozeman

First published 2019 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address: ISTE Ltd 27-37 St George’s Road London SW19 4EU UK

John Wiley & Sons, Inc. 111 River Street Hoboken, NJ 07030 USA

www.iste.co.uk

www.wiley.com

© ISTE Ltd 2019 The rights of Christos H. Skiadas and James R. Bozeman to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988. Library of Congress Control Number: 2018965157 British Library Cataloguing-in-Publication Data A CIP record for this book is available from the British Library ISBN 978-1-78630-447-6

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gilbert SAPORTA

xiii

Part 1. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

Chapter 1. Context-specific Independence in Innovation Study . . . Federica NICOLUSSI and Manuela CAZZARO

3

1.1. Introduction . . . . . . . . . . . . . . . . . 1.2. Parametrization for CS independencies 1.3. Stratified chain graph models . . . . . . 1.4. Application on real data . . . . . . . . . 1.5. Conclusion . . . . . . . . . . . . . . . . . 1.6. References . . . . . . . . . . . . . . . . .

. . . . . .

3 4 6 7 12 12

Chapter 2. Analysis of the Determinants and Outputs of Innovation in the Nordic Countries . . . . . . . . . . . . . . . . . . . . . Cátia ROSÁRIO, António Augusto COSTA and Ana LORGA DA SILVA

15

2.1. Introduction . . 2.2. Innovation. . . 2.3. Methodology . 2.4. Results . . . . . 2.5. Conclusion . . 2.6. References . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

15 16 19 21 25 26

vi

Data Analysis and Applications 2

Chapter 3. Bibliometric Variables Determining the Quality of a Dentistry Journal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pilar VALDERRAMA, Manuel ESCABIAS, Evaristo JIMÉNEZ-CONTRERAS, Mariano J. VALDERRAMA and Pilar BACA 3.1. Introduction . . . . . . . . 3.2. Statistical methodology . 3.3. Results . . . . . . . . . . . 3.4. Conclusions . . . . . . . . 3.5. Acknowledgment . . . . 3.6. References . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

29 30 32 35 35 36

Chapter 4. Analysis of Dependence among Growth Rates of GDP of V4 Countries Using Four-dimensional Vine Copulas . . . Jozef KOMORNÍK, Magda KOMORNÍKOVÁ and Tomáš BACIGÁL

37

4.1. Introduction . . . . . . . . . . 4.2. Theory . . . . . . . . . . . . . 4.3. Results . . . . . . . . . . . . . 4.4. Conclusion and future work 4.5. Acknowledgment . . . . . . 4.6. References . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

29

. . . . . .

. . . . . .

Chapter 5. Monitoring the Compliance of Countries on Emissions Mitigation Using Dissimilarity Indices . . . . . . . . . . . . . Eleni KETZAKI, Stavros RALLAKIS, Nikolaos FARMAKIS and Eftichios SARTZETAKIS 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . 5.2. The proposed method . . . . . . . . . . . . . . . 5.2.1. Description of method for individual data 5.2.2. Description of method for grouped data . 5.3. Application of method . . . . . . . . . . . . . . 5.3.1. Application of method for individual data 5.3.2. Application of method for grouped data . 5.4. Conclusions . . . . . . . . . . . . . . . . . . . . . 5.5. Appendix . . . . . . . . . . . . . . . . . . . . . . 5.6. References . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

37 38 42 45 47 47 49

. . . . . . . . . .

49 50 51 52 53 54 55 55 57 58

Chapter 6. Maximum Entropy and Distributions of Five-Star Ratings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yiannis DIMOTIKALIS

59

6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2. Entropy framework to five-star ratings. . . . . . . . . . . . . . . . . . . . 6.3. Maximum entropy of ratings for values k = 1,2,3,. . . ,30 . . . . . . . . .

59 60 66

Contents

vii

. . . . . . . . .

. . . . . . . . .

66 69 73 76 80 82 83 86 86

Part 2. The Impact of the Economic and Financial Crisis in Europe . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

6.3.1. Ratings with two outcomes (k = 1) . . . . . . . . . . . . . . . 6.3.2. Ratings with three Outcomes (k=2) . . . . . . . . . . . . . . . 6.3.3. Ratings with four outcomes (k=3) . . . . . . . . . . . . . . . . 6.3.4. Ratings with five outcomes (k = 4) . . . . . . . . . . . . . . . 6.3.5. Ratings entropy for outcomes k>4 . . . . . . . . . . . . . . . . 6.3.6. Maximum entropy constraints for the binomial distribution . 6.4. Application to real five-star rating data . . . . . . . . . . . . . . . 6.5. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

Chapter 7. Access to Credit for SMEs after the 2008 Financial Crisis: The Northern Italian Perspective . . . . . . . . . . . . Cinzia COLAPINTO and Mariangela ZENGA 7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2. Italian SMEs and access to credit . . . . . . . . . . . . . . . . . . 7.3. The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4. Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5. Analysis and discussion . . . . . . . . . . . . . . . . . . . . . . . 7.5.1. The measure for the Great Recession period (2008–2012) 7.5.2. The measure for the recovery period (2013–2015) . . . . 7.5.3. Comparing the two crisis phases . . . . . . . . . . . . . . . . 7.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

91 92 93 94 97 97 99 102 105 105

Chapter 8. Gender-Based Differences in the Impact of the Economic Crisis on Labor Market Flows in Southern Europe . . . . Maria SYMEONAKI, Maria KARAMESSINI and Glykeria STAMATOPOULOU

107

8.1. Introduction . . . . . . . . . . . . 8.2. Data, methods and limitations . 8.3. Results . . . . . . . . . . . . . . . 8.4. Conclusions and discussion . . 8.5. References . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . . . .

. . . . .

. . . . . . . . . .

. . . . .

. . . . . . . . . .

. . . . .

. . . . . . . . . .

91

. . . . .

. . . . .

107 108 111 111 119

Chapter 9. Measuring Labor Market Transition Probabilities in Europe with Evidence from the EU-SILC . . . . . . . . . . . . . . . . . Maria SYMEONAKI, Maria KARAMESSINI and Glykeria STAMATOPOULOU

121

9.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2. Data, methods and limitations . . . . . . . . . . . . . . . . . . . . . . . . . 9.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

121 122 124

viii

Data Analysis and Applications 2

9.4. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

135 135

Part 3. Student Assessment and Employment in Europe. . . . . . . .

137

Chapter 10. Almost Graduated, Close to Employment? Taking into Account the Characteristics of Companies Recruiting at a University Job Placement Office . . . . . . . . . . . . . . Franca CRIPPA, Mariangela ZENGA and Paolo MARIANI 10.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2. Recruiters and graduates seeking an HEI common ground 10.3. Web survey pitfalls: considerations for data collection . . 10.4. Sampled recruiters: an outline . . . . . . . . . . . . . . . . . 10.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

Chapter 11. How Variation of Scores of the Programme for International Student Assessment can be Explained through Analysis of Information . . . . . . . . . . . . . . . . . . . . . . . . . Valérie GIRARDIN, Justine LEQUESNE and Olivier THÉVENON 11.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2. Multiplicative models and Zighera’s parameterization 11.3. Application to PISA surveys . . . . . . . . . . . . . . . . 11.3.1. Data and variables . . . . . . . . . . . . . . . . . . . . 11.3.2. Analysis of scores in mathematics . . . . . . . . . . 11.3.3. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 11.4. References . . . . . . . . . . . . . . . . . . . . . . . . . . .

Part 4. Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

165

Chapter 12. A Topological Discriminant Analysis . . . . . . . . . . . . . Rafik ABDESSELAM

167

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

149 149 151 155 155 157 162 163

. . . . . . .

. . . . . . .

139 140 141 144 146 146

. . . . . . .

12.1. Introduction . . . . . . . . . . . . . . 12.2. Topological equivalence . . . . . . 12.3. Topological discriminant analysis . 12.4. Application example . . . . . . . . . 12.5. Conclusion and perspectives . . . . 12.6. Appendix . . . . . . . . . . . . . . . 12.7. References . . . . . . . . . . . . . . .

. . . . . . .

139

. . . . . . .

. . . . . . .

167 168 171 173 175 176 178

Contents

Chapter 13. Using Graph Partitioning to Calculate PageRank in a Changing Network . . . . . . . . . . . . . . . . . . . . . . . . Christopher ENGSTRÖM and Sergei SILVESTROV 13.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 13.1.1. Computing PageRank . . . . . . . . . . . . . . . . 13.2. Changes in personalization vector . . . . . . . . . . . . 13.3. Adding or removing edges between components . . . 13.3.1. Computations in practice . . . . . . . . . . . . . . . 13.3.2. Adding or removing an edge inside a component 13.3.3. Maintaining the component structure . . . . . . . 13.4. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 13.5. References . . . . . . . . . . . . . . . . . . . . . . . . . .

179 181 182 184 186 187 189 190 191

Chapter 14. Visualizing the Political Spectrum of Germany by Contiguously Ordering the Party Policy Profiles . . . . . . . . . . . . . Andranik TANGIAN

193

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

. . . .

. . . . . . . . .

179

. . . . . . . . .

14.1. Introduction . 14.2. The model . . 14.3. Conclusions . 14.4. References . .

. . . . . . . . .

ix

. . . .

. . . .

193 195 206 206

List of Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

209

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

213

Preface

Thanks to the significant work by the authors and contributors, we have developed this book, the second of two volumes. The data analysis field has been continuously growing over recent decades following the wide applications of computing and data collection along with new developments in analytic tools. Hence, the need for publications is evident. New works appear as printed or e-books covering the need for information from all fields of science and engineering thanks to the wide applicability of data analysis and statistics packages. In this volume, we present the collected material in four parts, including 14 chapters, in a form that will provide the reader with theoretical and applied information on data analysis methods, models and techniques along with appropriate applications. The results of the work in these chapters are used for further study throughout Europe, including the Nordic countries, the V4 states, southern Europe, Germany and the United Kingdom. Other topics include computing, entropy, innovation and quality assurance. Before the chapters, we include an excellent introductory and review paper titled “50 Years of Data Analysis: From Exploratory Data Analysis to Predictive Modeling and Machine Learning” by Gilbert Saporta, a leading expert in the field. The paper was based on the speech given for the celebration of his 70th birthday at the ASMDA2017 International Conference in London (held in De Morgan House of the London Mathematical Society). The current volume contains the following four parts: Part 1, Applications, includes six chapters: “Context-specific Independence in Innovation Studies” by Federica Nicolussi and Manuela Cazzaro; “Analysis of the Determinants and Outputs of Innovation in the Nordic Countries” by Catia Rosario, Antonio Augusto Costa and Ana Lorga da Silva; “Bibliometric Variables

xii

Data Analysis and Applications 2

Determining the Quality of a Dentistry Journal” by Pilar Valderrama, Manuel Escabias, Evaristo Jiménez-Contreras, Mariano J. Valderrama and Pilar Baca; “Analysis of Dependence among Growth Rates of GDP of V4 Countries Using Fourdimensional Vine Copulas” by Jozef Komornik, Magda Komornikova and Tomas Bacigal; “Monitoring the Compliance of Countries on Emissions Mitigation Using Dissimilarity Indices” by Eleni Ketzaki, Stavros Rallakis, Nikolaos Farmakis and Eftichios Sartzetakis; and “Maximum Entropy and Distributions of Five-Star Ratings” by Yiannis Dimotikalis. Part 2, The Impact of the Economic and Financial Crisis in Europe, contains one chapter about credit: “Access to Credit for SMEs after the 2008 Financial Crisis: The Northern Italian Perspective” by Cinzia Colapinto and Mariangela Zenga. This is followed by two chapters on the labor market: “Gender-Based Differences in the Impact of the Economic Crisis on Labor Market Flows in Southern Europe”, and “Measuring Labor Market Transition Probabilities in Europe with Evidence from the EU-SILC, both by Maria Symeonaki, Maria Karamessini and Glykeria Stamatopoulou. Part 3, Student Assessment and Employment in Europe, has an article concerning university students who are about to graduate and hence are close to employment that is related to Part 2: “Almost Graduated, Close to Employment? Taking into Account the Characteristics of Companies Recruiting at a University Job Placement Office” by Franca Crippa, Mariangela Zenga and Paolo Mariani, followed by a paper on how students are assessed: “How Variation of Scores of the Programme for International Student Assessment Can be Explained through Analysis of Information” by Valérie Girardin, Justine Lequesne and Olivier Thévenon. Part 4, Visualization, examines this topic in computing: “A Topological Discriminant Analysis” by Rafik Abdesselam, followed by “Using Graph Partitioning to Calculate PageRank in a Changing Network” by Christopher Engström and Sergei Silvestrov, and in politics: “Visualizing the Political Spectrum of Germany by Contiguously Ordering the Party Policy Profiles by Andranik Tangian. We would like to thank the authors of and contributors to this book. We pass on our sincere appreciation to the referees for their hard work and dedication in providing an improved book form. Finally, we express our thanks to the secretariat and, of course, the publishers. December 2018 Christos H. SKIADAS, Athens, Greece James R. BOZEMAN, Bormla, Malta

Introduction 50 Years of Data Analysis: From Exploratory Data Analysis to Predictive Modeling and Machine Learning

In 1962, J.W. Tukey wrote his famous paper “The Future of Data Analysis” and promoted exploratory data analysis (EDA), a set of simple techniques conceived to let the data speak, without prespecified generative models. In the same spirit, J.P. Benzécri and many others developed multivariate descriptive analysis tools. Since that time, many generalizations occurred, but the basic methods (SVD, k-means, etc.) are still incredibly efficient in the Big Data era. On the other hand, algorithmic modeling or machine learning is successful in predictive modeling, the goal being accuracy and not interpretability. Supervised learning proves in many applications that it is not necessary to understand, when one needs only predictions. However, considering some failures and flaws, we advocate that a better understanding may improve prediction. Causal inference for Big Data is probably the challenge of the coming years. It is a little presumptuous to want to make a panorama of 50 years of data analysis, while David Donoho (2017) has just published a paper entitled “50 Years of Data Science”. But 1968 is the year when I began my studies as a statistician and I would very much like to talk about the debates of the time and the digital revolution that profoundly transformed statistics and which I witnessed. The terminology followed this evolution–revolution: from data analysis to data mining

Chapter written by Gilbert SAPORTA.

xiv

Data Analysis and Applications 2

and then to data science while we went from a time when the asymptotics began to 30 observations with a few variables in the era of Big Data and high dimension. I.1. The revolt against mathematical statistics Since the 1960s, the availability of data has led to an international movement back to the sources of statistics (“let the data speak”) and to sometimes fierce criticisms of an abusive formalization. Along with to John Tukey, who was cited above, here is a portrait gallery of some notorious protagonists in the United States, France, Japan, the Netherlands and Italy (for a color version of this figure, see www.iste.co.uk/skiadas/data2.zip).

John Wilder Tukey (1915–2000)

Jean-Paul Benzécri (1932–)

Chikio Hayashi (1918–2002)

Jan de Leeuw (1945–)

J. Douglas Carroll (1939–2011)

Carlo Lauro (1943–)

And an anthology of quotes: He (Tukey) seems to identify statistics with the grotesque phenomenon generally known as mathematical statistics and find it necessary to replace statistics by data analysis. (Anscombe 1967)

Introduction

xv

Statistics is not probability, under the name of mathematical statistics was built a pompous discipline based on theoretical assumptions that are rarely met in practice. (Benzécri 1972) The models should follow the data, not vice versa. (Benzécri 1972) Use the computer implies the abandonment of all the techniques designed before of computing. (Benzécri 1972) Statistics is intimately connected with science and technology, and few mathematicians have experience or understand of methods of either. This I believe is what lies behind the grotesque emphasis on significance tests in statistics courses of all kinds; a mathematical apparatus has been erected with the notions of power, uniformly most powerful tests, uniformly most powerful unbiased tests, etc., and this is taught to people, who, if they come away with no other notion, will remember that statistics is about significant differences […]. The apparatus on which their statistics course has been constructed is often worse than irrelevant – it is misleading about what is important in examining data and making inferences. (Nelder 1985) Data analysis was basically descriptive and non-probabilistic, in the sense that no reference was made to the data-generating mechanism. Data analysis favors algebraic and geometrical tools of representation and visualization. This movement has resulted in conferences especially in Europe. In 1977, E. Diday and L. Lebart initiated a series entitled Data Analysis and Informatics, and in 1981, J. Janssen was at the origin of biennial ASMDA conferences (Applied Stochastic Models and Data Analysis), which are still continuing. The principles of data analysis inspired those of data mining, which developed in the 1990s on the border between databases, information technology and statistics. Fayaad (1995) is said to have the following definition: “Data Mining is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”. Hand et al. precised in 2000, “I shall define Data Mining as the discovery of interesting, unexpected, or valuable structures in large data sets”. The metaphor of data mining means that there are treasures (or nuggets) hidden under mountains of data, which may be discovered by specific tools. Data mining is generally concerned with data which were collected for another purpose: it is a secondary analysis of databases that are collected not primarily for analysis, but for the management of individual cases. Data mining is not concerned with efficient

xvi

Data Analysis and Applications 2

methods for collecting data such as surveys and experimental designs (Hand et al. 2000). I.2. EDA and unsupervised methods for dimension reduction Essentially, exploratory methods of data analysis are dimension reduction methods: unsupervised classification or clustering methods operate on the number of statistical units, whereas factorial methods reduce the number of variables by searching for linear combinations associated with new axes of the space of individuals. I.2.1. The time of syntheses It was quickly realized that all the methods looking for eigenvalues and eigenvectors of matrices related to the dispersion of a cloud (total or intra) or of correlation matrices could be expressed as special cases of certain techniques. Correspondence analyses (single and multiple) and canonical discriminant analysis are particular principal component analyses. It suffices to extend the classical Principal Components Analysis (PCA) by weighting the units and introducing metrics. The duality scheme introduced by Cailliez and Pagès (1976) is an abstract way of representing the relationships between arrays, matrices and associated spaces. The paper by De la Cruz and Holmes (2011) brought it back to light. From another point of view (Bouroche and Saporta 1983), the main factorial methods PCA, Multiple Correspondence Analysis (MCA), as well as multiple regression are particular cases of canonical correlation analysis. Another synthesis comes from the generalization of canonical correlation analysis to several groups of variables introduced by J.D. Carroll (1968). Given p blocks of variables X j , we look for components z maximizing the following p

criterion:

∑ R ( z, X ) . 2

j

j =1

p

The extension of this criterion in the form MaxY ∑ Φ(Y , X j ) , where Φ is an j =1

adequate measure of association, leads to the maximum association principle (Tenenhaus 1977; Marcotorchino 1986; Saporta 1988), which also includes the case of k-means partitioning.

Introduction

xvii

The PLS approach to structural equation modeling also provides a global framework for many linear methods, as has been shown by Tenenhaus (1999) and Tenenhaus and Tenenhaus (2011). Criterion

Analysis

max ∑ r 2 (c, x j ) with x j numerical

PCA

p

j =1 p

max ∑η 2 (c, x j ) with x j categorical

MCA

j =1 p

max ∑ R 2 (c, X j ) with X j data set

GCA (Carroll)

j =1

p

max ∑ Rand (Y , x j ) with Y and x j categorical

Central partition

j =1

p

max ∑τ ( y , x j ) with rank orders

Condorcet aggregation rule

j =1

Table I.1. Various cases of the maximum association principle

I.2.2. The time of clusterwise methods The search for partitions in k classes of a set of units belonging to a Euclidean space is most often done using the k-means algorithm: this method converges very quickly, even for large sets of data, but not necessarily toward the global optimum. Under the name of dynamic clustering, Diday (1971) has proposed multiple extensions, where the representatives of classes can be groups of points, varieties, etc. The simultaneous search for k classes and local models by alternating k-means and modeling is a geometric and non-probabilistic way of addressing mixture problems. Clusterwise regression is the best-known case: in each class, a regression model is fitted and the assignment to the classes is done according to the best model. Clusterwise methods allow for non-observable heterogeneity and are particularly useful for large data sets where the relevance of a simple and global model is questionable. In the 1970s, Diday and his collaborators developed “typological” approaches for most linear techniques: PCA, regression (Charles 1977), discrimination. These methods are again the subject of numerous publications in association with functional data (Preda and Saporta 2005), symbolic data (de Carvalho et al. 2010) and in multiblock cases (De Roover et al. 2012; Bougeard et al. 2017).

xviii

Data Analysis and Applications 2

I.2.3. Extensions to new types of data I.2.3.1. Functional data Jean-Claude Deville (1974) showed that the Karhunen–Loève decomposition was nothing other than the PCA of the trajectories of a process, opening the way to functional data analysis (Ramsay and Silverman 1997). The number of variables being infinitely not countable, the notion of linear combination to define a principal T

component is extended to the integral ξ = ∫ f (t ) X t dt , f (t ) being an eigenfunction 0

of the covariance operator



T

0

C (t , s ) f ( s )ds = λ f (t ) .

Deville and Saporta (1980) then extended functional PCA to correspondence analysis of trajectories of a categorical process. The dimension reduction offered by PCA makes it possible to solve the problem of regression on trajectories, a problem that is ill posed since the number of observations is smaller than the infinite number of variables. PLS regression, however, is better adapted in the latter case and makes it possible to deal with supervised classification problems (Costanzo et al. 2006). I.2.3.2. Symbolic data analysis Diday is at the origin of many works that have made it possible to extend almost all methods of data analysis to new types of data, called symbolic data. This is the case, for example, when the cell i, j of a data table is no longer a number, but an interval or a distribution. See Table I.2 for an example of a table of symbolic data (from Billard and Diday 2006). wu w1 w2 w3 w4

Court Type Hard Grass Indoor Clay

Player Weight [65, 86] [65, 83] [65, 87] [68, 84]

Player Height [1.78, 1.93] [1.80, 1.91] [1.75, 1.93] [1.75, 1.93]

Racket Tension [14, 99] [26, 99] [14, 99] [24, 99]

Table I.2. An example of interval data

I.2.3.3. Textual data Correspondence analysis and classification methods were, very early, applied to the analysis of document-term and open-text tables (refer to Lebart et al. 1998 for a full presentation). Text analysis is now part of the vast field of text mining or text analytics.

Introduction

xix

I.2.4. Nonlinear data analysis Dauxois and Pousse (1976) extended principal component analysis and canonical analysis to Hilbert spaces. By simplifying their approach, instead of looking for ⎛ p ⎞ linear combinations of maximum variance like in PCA max V ⎜ ∑ a j x j ⎟ subject to ⎝ j =1 ⎠

a = 1 , we look for separate nonlinear transformations Φ j of each variable ⎛ p ⎞ maximizing V ⎜ ∑ Φ j ( x j ) ⎟ . This is equivalent to maximize the sum of the squares ⎝ j =1 ⎠ of the correlation coefficients between the principal component c and the p

transformed variables

∑ ρ ( c, Φ ( x ) ) , which is once again an illustration of the 2

j

j

j =1

maximum association principle. With a finite number of observations n, this is an ill-posed problem, and we need to restrict the set of transformations Φ j to finite dimension spaces. A classical choice is to use spline functions as in Besse (1988). The search for optimal transformations has been the subject of work by the Dutch school, summarized in the book published by Gifi (1999). Separate transformations are called semilinear. A different attempt to obtain “truly” nonlinear transformations is kernelization. In line with the work of V. Vapnik, Schölkopf et al. (1998) defined a nonlinear PCA in the following manner where the entire vector x = (x1, x2,…, xp) is transformed. Each point of the space of the individual E is transformed into a point in a space Φ(E) called extended space (or feature space) provided with a dot product. The dimension of Φ(E) can be very large and the notion of variable is lost. A metric multidimensional scaling is then performed on the transformed points according to the Torgerson method, which is equivalent to the PCA in Φ(E). Everything depends on the choice of the scalar product in Φ(E): if we take a scalar product that is easily expressed according to the scalar product of E, it is no longer necessary to know the transformation Φ, which is then implicit. All calculations are done in dimension n. This is the “kernel trick”. Let k ( x, y ) be a dot product in Φ(E) and < x, y > the dot product of E. We then replace the usual Torgerson’s matrix W by a matrix where each element is k ( x, y ), then doubly center W in rows and columns: its eigenvectors are the principal components in Φ(E).

xx

Data Analysis and Applications 2

Once the kernel-PCA was defined, many works followed, “kernelizing” by various methods, such as Fisher discriminant analysis by Baudat and Anouar (2000) found independently under the name of LS-SVM by Suykens and Vandewalle (1999), the PLS regression of Rosipal and Trejo (2001), the unsupervised classification with kernels k-means already proposed by Schölkopf et al. and canonical analysis (Fyfe and Lai 2001). It is interesting to note that most of these developments came not from statisticians but from researchers of artificial intelligence or machine learning.

I.2.5. The time of sparse methods When the number of dimensions (or variables) is very large, PCA, MCA and other factorial methods lead to results that are difficult to interpret: how to make sense of a linear combination of several hundred or even thousands of variables? The search for the so-called “sparse” combinations limited to a small number of variables, that is, with a large number of zero coefficients, has been the subject of the attention of researchers for about 15 years. The first attempts requiring that the coefficients be equal to –1, 0 or 1, for example, lead to non-convex algorithms that are difficult to use. The transposition to PCA of the LASSO regression de Tibshirani (1996) allowed exact and elegant solutions. Recall that the LASSO consists of performing a regression with an L1 penalty on the coefficients, which makes it possible to easily manage the multicollinearity and the high dimension. p ⎛ ⎞ 2 βˆ lasso = arg min ⎜ y − Xβ + λ ∑ β j ⎟ . β j =1 ⎝ ⎠

Zou et al. (2006) proposed modifying one of the many criteria defining the PCA of a table X: principal components z are such that: 2 2 βˆ = arg min z - Xβ + λ β + λ1 β 1 .

β

The first constraint in an L2 norm only implies that the loadings have to be normalized; the second constraint in an L1 norm tunes the sparsity when the Lagrange multiplier λ1 varies. Computationally, we get the solution by alternating an SVD β being fixed, to get the components z and an elastic-net to find β when z is fixed until convergence. The positions of the null coefficients are not the same for the different components. The selection of the variables is therefore dimension by dimension. If

Introduction

xxi

the interpretability increases, the counterpart is the loss of characteristic properties of PCA, such as the orthogonality of the principal components and/or the loadings. Since then, sparse variants of many methods have been developed, such as sparse PLS by Chun and Keleş (2009), sparse discriminant analysis by Clemmensen et al. (2011), sparse canonical analysis by Witten et al. (2009) and sparse multiple correspondence analysis by Bernard et al. (2012).

I.3. Predictive modeling A narrow view would limit data analysis to unsupervised methods to use current terminology. Predictive or supervised modeling has evolved in many ways into a conceptual revolution comparable to that of the unsupervised. We have moved from a model-driven approach to a data-driven approach where the models come from the exploration of the data and not from a theory of the mechanism generating observations, thus reaffirming the second principle of Benzécri: “the models should follow the data, not vice versa”. The difference between these two cultures (generative models versus algorithmic models, or models to understand versus models to predict) has been theorized by Breiman (2001), Saporta (2008), Shmueli (2010) and taken up by Donoho (2015). The meaning of the word model has evolved: from that of a parsimonious and understandable representation centered on the fit to observations (predict the past), we have moved to black-box-type algorithms, whose objective is to forecast the most precisely possible new data (predict the future). The success of machine learning and especially the renewal of neural networks with deep learning have been made possible by the increase in computing power, but also and above all by the availability of huge learning bases.

I.3.1. Paradigms and paradoxes When we ask ourselves what a good model is, we quickly arrive at paradoxes. A generative model that fits well with collective data can provide poor forecasts when trying to predict individual behaviors. The case is common in epidemiology. On the other hand, good predictions can be obtained with uninterpretable models: targeting customers or approving loans does not require a consumer theory. Breiman remarked that simplicity is not always a quality: Occam’s Razor, long admired, is usually interpreted to mean that simpler is better. Unfortunately in prediction, accuracy and simplicity (interpretability) are in conflict.

xxii

Data Analysis and Applications 2

Modern statistical thinking makes a clear distinction between the statistical model and the world. The actual mechanisms underlying the data are considered unknown. The statistical models do not need to reproduce these mechanisms to emulate the observable data. (Breiman 2001) Other quotes illustrate these paradoxes: Better models are sometimes obtained by deliberately avoiding to reproduce the true mechanisms. (Vapnik 2006) Statistical significance plays a minor or no role in assessing predictive performance. In fact, it is sometimes the case that removing inputs with small coefficients, even if they are statistically significant, results in improved prediction accuracy. (Shmueli 2010) In a Big Data world, estimation and tests become useless, because everything is significant! For instance, a correlation coefficient equal to 0.002 when the number of observations is 106 is significantly different from 0, but without any interest. Usual distributional models are rejected since small discrepancies between model and data are significant. Confidence intervals have zero length. We should keep in mind the famous sentence of George Box: “All models are wrong, some are useful”.

I.3.2. From statistical learning theory to empirical validation One of the major contributions of the theory of statistical learning developed by Vapnik and Cervonenkis was to give the conditions of generalizability of the predictive algorithms and to establish inequalities on the difference between the empirical error of adjustment of a model to observed data and the theoretical error when applying this model to future data from the same unknown distribution. If the theory is not easy to use, it has given rise to the systematization of the practice of dividing data into three subsets: learning, testing, validation (Hastie et al. 2001). There had been warnings in the past, like that of Paul Horst (1941), who said, “the usefulness of a prediction procedure is not established when it is found to predict adequately on the original sample; the necessary next step must be its application to at least a second group. Only if it predicts adequately on subsequent samples can the value of the procedure be regarded as established” and the finding of cross-validation by Lachenbruch and Mickey (1968) and Stone (1974). But it is only recently that the use of validation and test samples has become widespread and has become an essential step for any data scientist. However, there is still room for improvement if we go through the publications of certain areas where prediction is rarely checked on a hold-out sample.

Introduction

xxiii

I.3.3. Challenges Supervised methods have become a real technology governed by the search for efficiency. There is now a wealth of methods, especially for binary classification: SVM, random forests, gradient boosting, neural networks, to name a few. Ensemble methods are superimposed to combine them (see Noçairi et al. 2016). Feature engineering consists of constructing a large number of new variables functions of those observed and choosing the most relevant ones. While in some cases the gains over conventional methods are spectacular, this is not always the case, as noted by Hand (2006). Software has become more and more available: in 50 years, we have moved from the era of large, expensive commercial systems (SAS, SPSS) to the distribution of free open source packages like R and ScikitLearn. The benefits are immense for the rapid dissemination of new methods, but the user must be careful about the choice and often the lack of validation and quality control of many packages: it is not always clear if user-written packages are really doing what they claim to be doing. Hornik (2012) has already wondered if there are not too many R packages. Ten years ago, in a resounding article, Anderson (2008) prophesied the end of theory because “the data deluge makes the scientific method obsolete”. In a provocative manner, he wrote “Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot”. This was, of course, misleading, and the setbacks of Google’s epidemic influenza forecasting algorithm brought a denial (Lazer et al. 2014). Correlation is not causality and drawing causal inference from observational data has always been a tricky problem. As Box et al. (1978) put it, “To find out what happens when you change something, it is necessary to change it”. The best way to answer causal questions is usually to run an experiment. Drawing causal inference from Big Data is now a hot topic (see Bottou et al. 2013; Varian 2016). Quantity is not quality and massive data can be biased and lead to unfortunate decisions reproducing a priori that led to their collection. Many examples have been discovered related to discrimination or presuppositions about gender or race. More generally, the treatment of masses of personal data raises ethical and privacy issues when consent has not been gathered or has not been sufficiently explained. Books for the general public such as Keller and Neufeld (2014) and O’Neil (2016) have echoed this.

xxiv

Data Analysis and Applications 2

I.4. Conclusion The past 50 years have been marked by dramatic changes in statistics. The ones that will follow will not be less formidable. The Royal Statistical Society is not afraid to write in its Data Manifesto “What steam was to the 19th century, and oil has been to the 20th, data is to the 21st”. Principles and methods of data analysis are still actual, and exploratory (unsupervised) and predictive (supervised) analysis are two sides of the same approach. But as correlation is not enough, causal inference could be the new frontier and could go beyond the paradox of predicting without understanding by going toward understanding to better predict, and act to change. As the job of the statistician or data scientist becomes more exciting, we believe that it will have to be accompanied by an awareness of social responsibility.

I.5. References Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. http://www.wired.com/2008/06/pb-theory/. Baudat, G., Anouar, F. (2000). Generalized discriminant analysis using a kernel approach. Neural Comput., 12(10), 2385–2404. Bernard, A., Guinot, C., Saporta, G. (2012). Sparse principal component analysis for multiblock data and its extension to sparse multiple correspondence analysis, In: Proc. of 20th Int. Conference on Computational Statistics (COMPSTAT 2012), Colubi, A., Fokianos, K., Gonzalez-Rodriguez, G., Kontoghiorghes, E. (eds). International Statistical Institute (ISI), 99–106. Besse, P. (1988). Spline functions and optimal metric in linear principal components analysis. In: Components and Correspondence Analysis, Van Rijckevorsel et al., (eds). John Wiley & Sons, New York. Billard, L., Diday, E. (2012). Symbolic Data Analysis: Conceptual Statistics and Data Mining. John Wiley & Sons, Chichester. Bottou, L. et al. (2013). Counterfactual reasoning and learning systems: The example of computational advertising. J. Machine Learn. Res., 14, 3207–3260. Bougeard, S., Abdi, H., Saporta, G., Niang Keita, N. (2018). Clusterwise analysis for multiblock component methods. Advances in Data Analysis and Classification, 12(2), 285–313. Box, G., Hunter, J.S, Hunter, W.G. (1978). Statistics for Experimenters, John Wiley & Sons, New York. Breiman, L. (2001) Statistical modeling: The two cultures, Statist. Sci., 16(3), 199–231.

Introduction

xxv

Cailliez, F., Pagès, J.P. (1976). Introduction à l’analyse des données, Smash, Paris. Carroll, J.D. (1968). Generalisation of canonical correlation analysis to three or more sets of variables. Proc. 76th Annual Convention Am. Psychol. Assoc., 3, 227–228. Chun, H. , Keleş, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J. R. Statist. Soc. B, 72, 3–25. Clemmensen, L., Hastie, T., Ersboell, K. (2011). Sparse discriminant analysis. Technometrics, 53(4), 406–413. Costanzo, D., Preda, C., Saporta, G. (2006). Anticipated prediction in discriminant analysis on functional data for binary response. In: COMPSTAT'06, A. Rizzi (ed.) Physica-Verlag, 821–828. De Roover, K., Ceulemans, E., Timmerman, M.E., Vansteelandt, K., Stouten, J., Onghena, P. (2012). Clusterwise simultaneous component analysis for analyzing structural differences in multivariate multiblock data. Psychol Methods, 17(1), 100–119. De la Cruz, O., Holmes, S.P. (2011). The Duality Diagram in Data Analysis: Examples of Modern Applications, Ann. Appl. Statist., 5(4), 2266–2277. Deville J.C., (1974). Méthodes statistiques et numériques de l’analyse harmonique, Ann. l’INSEE, 15, 3–101. Deville J.C., Saporta, G. (1980). Analyse harmonique qualitative. In: Data Analysis and Informatics, E. Diday (ed.), North-Holland, Amsterdam, 375–389. Diday, E. (1974). Introduction à l’analyse factorielle typologique, Revue Statist. Appl., 22(4), 29–38. Donoho, D. (2017). 50 Years of Data Science, J. Comput. Graph. Statist., 26(4), 745–766. Friedman, J.H. (2001). The Role of Statistics in the Data Revolution?, Int. Statist. Rev., 69(1), 5–10. Fyfe, C., & Lai, P. L. (2001). Kernel and nonlinear canonical correlation analysis. Int. J. Neural Syst., 10, 365–374. Gifi, A. (1990). Non-linear multivariate analysis, John Wiley & Sons, New York. Hand, D., Blunt, G., Kelly, M., Adams, N. (2000). Data mining for fun and profit, Statist. Sci., 15(2), 111–126. Hand, D. (2006). Classifier Technology and the Illusion of Progress, Statist. Sci., 21(1), 1–14. Hastie,T., Tibshirani, R., Friedman, J. (2001). The Elements of Statistical Learning, Springer, New York. Keller, M., Neufeld, J. (2014). Terms of Service: Understanding Our Role in the World of Big Data, Al Jazeera America, “http://projects.aljazeera.com/2014/terms-of-service/” http://projects.aljazeera.com/2014/terms-of-service/#1. Hornik, K. (2012). Are There Too Many R Packages? Aust. J. Statist., 41(1), 59–66.

xxvi

Data Analysis and Applications 2

Lazer, D., Kennedy, R., King, G., Vespignani, A. (2014). The Parable of Google Flu: Traps in Big Data Analysis, Science, 343(6176), 1203–1205. Lebart, L., Salem, A., Berry, L. (1998). Exploring Textual Data, Kluwer Academic Publisher, Dordrecht, The Netherlands. Marcotorchino, F. (1986). Maximal association as a tool for classification, in Classification as a tool for research, Gaul & Schader (eds), North Holland, Amstedam, 275–288. Nelder, J.A. (1985) discussion of Chatfield, C., The initial examination of data, J. R. Statist. Soc. A, 148, 214–253. Noçairi, H., Gomes,C., Thomas, M., Saporta, G. (2016). Improving Stacking Methodology for Combining Classifiers; Applications to Cosmetic Industry, Electronic J. Appl. Statist. Anal., 9(2), 340–361. O’Neil, C. (2016) Weapons of Maths Destruction, Crown, New York. Ramsay, J.O., Silverman, B. (1997). Functional data analysis, Springer, New York. Rosipal, A., Trejo, L. (2001). Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space, J. Machine Learn. Res., 2, 97–123. Schölkopf, B., Smola,A., Müller, K.L. (1998). Nonlinear Component Analysis as a Kernel Eigenvalue Problem, Neural Comput., 10(5), 1299–1319. Suykens, J.A.K.; Vandewalle, J. (1999). Least squares support vector machine classifiers, Neural Process. Lett., 9(3), 293–300. Saporta, G. (1988). About maximal association criteria in linear analysis and in cluster analysis. In: Classification and Related Methods of Data Analysis, H.H. Bock (ed.), 541– 550, North-Holland, Amsterdam. Saporta, G. (2008). Models for understanding versus models for prediction, In P. Brito (ed.), Compstat Proceedings, Physica Verlag, Heidelberg, 315–322. Shmueli, G. (2010). To explain or to predict? Statist. Sci., 25, 289–310. Tenenhaus, M. (1977). Analyse en composantes principales d’un ensemble de variables nominales ou numériques, Revue Statist. Appl., 25(2), 39–56. Tenenhaus, M. (1999). L’approche PLS, Revue Statist. Appl., 17(2), 5–40. Tenenhaus, A., Tenenhaus, M. (2011). Regularized Generalized Canonical Correlation Analysis, Psychometrika, 76(2), 257–284. Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. R. Statist. Soc. B, 58, 267–288. Tukey, J.W. (1962). The Future of Data Analysis, Ann. Math. Statist., 33(1), 1–67. Vapnik, V. (2006). Estimation of Dependences Based on Empirical Data, 2nd edition, Springer, New York.

Introduction

xxvii

Varian, H. (2016). Causal inference in economics and marketing, Proc. Natl. Acad. Sci., 113, 7310–7315. Witten, D., Tibshirani, R., Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3), 515–534. Zou, H., Hastie, T. and Tibshirani, R. (2006). Sparse principal component analysis. J. Comput. Graph. Statist., 15, 265–286.

PART 1

Applications

1 Context-specific Independence in Innovation Study

The study of (in)dependence relationships among a set of categorical variables collected in a contingency table is an ample topic. In this chapter, we focus on the so-called context-specific (CS) independence where the conditional independence holds only in a subspace of the outcome space. The main aspects that we introduce concern the definition in the same model of marginal, conditional and CS independencies, through the marginal models. Furthermore, we investigate how it is possible to test these CS independencies when there are ordinal variables. Finally, we propose a graphical representation of all the considered independencies taking advantages from the chain graph model (CGM). We show the results of an application on “The Italian Innovation Survey” (Istat 2012).

1.1. Introduction In the field of categorical variables, with the term CS independence we refer to the particular conditional independence that holds only for some modalities of the variable(s) in the conditioning set, but not for all. That is, given three variables X1 , X2 and X3 we describe this situation as X1 ⊥ X2 |X3 = c3 , where c3 is a subset of all possible values of X3 . Among others, Højsgaaard (2004) and Nyman et al. (2016) studied this topic in detail. In this chapter, we want to improve the main results of these works by dealing with CS independencies concerning subsets of all the considered (also ordinal) variables. For this aim, we use the hierarchical multinomial marginal models (HMMMs) (Bartolucci et al. 2007; Cazzaro and Colombi 2014). The need of this parameterization chases the will of considering a model where we want to test simultaneously marginal and conditional independencies. In addition, it also uses local logits evaluated on different marginal contingency tables in order to consider the ordered modalities of the CS conditioning variables. Chapter written by Federica N ICOLUSSI and Manuela C AZZARO.

Data Analysis and Applications 2: Utilization of Results in Europe and Other Topics, First Edition. Edited by Christos H. Skiadas and James R. Bozeman. © ISTE Ltd 2019. Published by ISTE Ltd and John Wiley & Sons, Inc.

4

Data Analysis and Applications 2

This chapter is organized as follows. In section 1.2, we introduce the constraints to impose on the HMMM in order to also represent CS independencies. The proposed model is also represented through a stratified chain graph model (SCGM), an extension of stratified graphical model proposed by Nyman et al. (2016), that uses a chain graph model (CGM) to represent the classical conditional independencies and labeled arcs in the graph to denote CS independencies. The details are explained in section 1.3. Finally, we analyze a real data set, “The Italian Innovation Survey” (Istat 2012), in order to investigate the effect of the innovation in different aspects of small and medium Italian enterprises on the growth in revenue terms. The procedure and the results are discussed in section 1.4. In section 1.5, we summarize the main results of this work and future research. 1.2. Parametrization for CS independencies Let us consider q categorical variables (X1 , . . . , Xq ) taking values (i1 , . . . , iq ) in the contingency table I = (n1 × . . . × nq ), where the modalities of the generic variable Xj , ij takes value in Ij . A parametrization of a model able to capture marginal and conditional independencies among non-ordinal variables comes through the marginal model (Bergsma and Rudas 2002), which defines the classical log-linear parameters on marginal distributions by respecting certain properties of M completeness and hierarchy. The marginal parameters are ηL (iL ), where M refers to the marginal set, L denotes the subset of variables to which the parameter pertains and iL , in parenthesis, the modalities of the variable selected in L (when the parenthesis are omitted, this means that the parameters refer to each iL ∈ IL ). The following example shows how to define the marginal parameters in order to describe a conditional independence. Example 1.1. Let us consider a set of four variables, say X1 , X2 , X3 and X4 , and suppose we are interested in describing the independence X1 ⊥ X2 |X3 . For this aim, we have to define the marginal sets {(1, 2, 3), (1, 2, 3, 4)} where (1, 2, 3, 4) is a shortcut for (X1 X2 X3 X4 ). Then, we define the classical log-linear parameters on the contingency table I1,2,3 restricted to (1, 2, 3) and the remaining parameters on the unrestricted contingency table I. Finally, we have to constrain to zero the parameters associated 1,2,3 1,2,3 and η1,2,3 . with the statement of independence η1,2 Now, let us collect four subsets of variables, supposing A, B, C and D. As we mentioned, our aim is to find a parametrization able to describe, beyond the classical

Context-specific Independence in Innovation Study

statements of conditional independencies, independence, formally: A ⊥ B|(C = iC , D),

5

the following statement of CS

iC ∈ K

[1.1]

where ic is the vector of certain modalities of variables in C which take values in K that is a subset of the modalities of C (IC ) for which the conditional independence holds. The independence in formula [1.1] holds if the marginal log-linear parameters satisfy the following constraints 

M ηvc (iv ic ) = 0

iv ∈ I v

ic ∈ K

[1.2]

v ∈ V c ∈ P(C)

where P(·) denotes the power set, V = {(P(A) \ ∅) ∪ (P(B) \ ∅) ∪ P(D)} and K is a subset of the modalities of C (IC ) for which the CS independence holds. Example 1.2. (Recall Example 1.1) Let us suppose that we want to define through marginal model the CS independence X1 ⊥ X2 |X3 X4 = i4 , with i4 ∈ K where K ⊆ I4 is a subset of the modalities i4 of X4 for which the conditional independence holds. The constraints on the marginal parameters will be in this case 1,2,3,4 1,2,3,4 1,2,3,4 1,2,3,4 η1,2 (i1 i2 ) + η1,2,3 (i1 i2 i3 ) + η1,2,4 (i1 i2 i4 ) + η1,2,3,4 (i1 i2 i3 i4 ) = 0

i1 ∈ I 1 ,

i2 ∈ I 2 ,

i3 ∈ I3 ,

i4 ∈ K.

Now, we consider the case where we have at least an ordinal variable. In this unexplored case, we move in the HMMM framework (Bartolucci et al. 2007; Cazzaro and Colombi 2014). In the HMMMs, beyond the baseline parameters, we can use parameters η coded with different criteria in order to consider the possible proper order of the modalities. In this work, we take advantage from the local logits that compare the probability of a cell πi with the previous one, for instance, referring to π1 variable X1 we have η11 (i1 ) = log( πi i−1 ). 1

The independence in formula [1.1] holds if the parameters of HMMM, coded with local logits, satisfy the following constraints  v ∈ V c ∈ P(C)

 i∗ c ≤ic

η vc (iv ic ) = 0

iv ∈ I v

ic ∈ K

[1.3]

6

Data Analysis and Applications 2

where P(·) denotes the power set, V = {(P(A) \ ∅) ∪ (P(B) \ ∅) ∪ P(D)} and K is a subset of the modalities of C (IC ) for which the CS independence holds (Nicolussi and Cazzaro 2017). Example 1.3. By considering the CS independence in example 1.2, by adopting local logit for coding the conditioning variable, the constraints in formula [1.3] become 1,2,3,4 1,2,3,4 η1,2 (i1 , i2 ) + η1,2,3 (i1 i2 i3 ) +   i4 i 1,2,3,4 1,2,3,4 (i1 i2 i∗4 ) + i4∗ =1 η1,2,3,4 (i1 i2 i3 i∗4 ) = 0 + i∗ =1 η1,2,4 4

4

[1.4]

with i1 ∈ I1 , i2 ∈ I2 , i3 ∈ I3 and i4 ∈ K. It is worthwhile to note that the constraints in formula [1.3], when we deal with local logit, correspond to the CS independence X1 ⊥ X2 |X3 X4 ≤ i4 , i4 ∈ K. 1.3. Stratified chain graph models A chain graph is a graph with both directed and undirected arcs and without any directed or semidirected cycle. The vertices of a chain graph are decomposable in so-called chain components, denoted by T1 , ...., Ts . Within these chain components, there are only undirected arcs and between vertices belonging to different components there are only directed arcs, all going in the same direction. Trivially, the CGMs are graphical models that take advantages from chain graphs to describe a system of independencies. There are different types of CGM (Drton 2009) that interpret in different ways the presence/absence of directed/indirected arcs. In this work, we use the CGM of type I (Lauritzen and Wermuth 1989; Frydenberg 1990), as natural generalization of classical graphical models. CGMs are used when the variables to analyze are of a different nature, such that they can be naturally collected in different components. Furthermore, it is reasonable to suppose that between variables within the same component there is a kind of dependence relationship that differs from the relationship between variables collected in different components. Therefore, it is possible to define an explicative order between the variables collected in different components. As shown in Rudas et al. (2010) and Nicolussi (2013), the marginal log-linear models and the HMMMs give a suitable parameterization for the CGM of type I. Now, the improvement in CGMs necessary to represent the CS independencies closely follows Nyman’s approach (2016) for undirected graphs. Thus, we introduce the stratified chain graph models (SCGM) as an extension of stratified graphical models (Nyman et al. 2016). A stratified chain graph has, in addition to the previous graphs, labeled arcs. These identify the “stratum” of the models, that is the

Context-specific Independence in Innovation Study

7

modality(ies) of the variable(s) in the conditional set according to the CS independence. Example 1.4 Let us consider five variables X1 , X2 , X3 , X4 and X5 . Suppose that according to the nature of the variables, we can split them into two components such that variables X1 and X2 can be considered explicative for X3 , X4 and X5 . The SCGM represented in Figure 1.1 is one possible situation that can occur. In this case, we have the conditional independencies X3 ⊥ X2 |X1 and X5 ⊥ X1 X2 |(X3 , X4 ) and the CS independence X3 ⊥ X4 |(X1 = i∗1 , X2 , X5 = i∗5 ).

Figure 1.1. SCGM with the labeled arc X3 − X4 referring to modality i∗1 of X1 and modality i∗5 of X5

1.4. Application on real data In this section, we investigate the potential of a model that simultaneously consider marginal, conditional and CS independencies on a set of (ordinal) categorical variables. Our aim is to study the effect of innovation in small and medium Italian enterprises, from 2009–2012, on revenue growth. With the term “innovation”, we refer to any improvement in product, services, productive line, logistic system, organization and investment in Research and Development (R&D). We used the “Italian innovation survey on SM enterprises” (Istat 2012). Thus we considered the revenue growth in 2012, G (yes, no) henceforth denoted as variable 1, as the pure response variable. Then, we took into account the innovation through three dichotomous variables referring to the period 2009–2012: innovation in products or services or production line or investment in R&D, IPSP

8

Data Analysis and Applications 2

(yes, no), innovation in organization system, IORG (yes, no) and innovation in marketing strategies, IMAR (Yes, No), henceforth denoted as variables 2, 3 and 4, respectively. Finally, other variables concerning the firm’s featuring in 2009–2012 were collected: the main market (in revenue terms), MARK (A = regional, B = national, C = international), the percentage of graduate employers, DEG (1 = 0%  10%, 2 = 10%  50%, 3 =50%  100%) and the enterprise size, TYP (1 = small, 2 = medium), henceforth denoted as variables 5, 6 and 7, respectively. In order to analyze this data set, we build a chain graph with three components according to the nature of the variables, so in the first component we collect the firm’s features (MARK 5, DEG 6, TYP 7), in the second component the innovations variables (IPSP 2, IORG 3, IMAR 4) and in the third component the revenue growth G 1. Then, starting from the complete chain graph, where there are all possible edges, corresponding to the saturated HMMM, we tested all chain graph models of type I with only one missing edge, in order to investigate, one by one, which pairwise relationship is plausible. The test was led with the maximum likelihood ratio test, by comparing the likelihood of unconstrained HMMM, with the likelihood of the corresponding constrained model. In the HMMM, the parameters of dummy variables were codified with baseline logits, whereas the parameters referring to the ordinal MARK and DEG were codified with local logits. We removed from the complete chain graph all the edges that, given positive results in the previous tests, in this way obtained a reduced CGM. Subsequently, we tested the reduced CGM adding one by one all the edges previously removed. Table 1.1 shows the statistic test, the degree of freedom and the P-value of the HMMM for the main significant models. The numbers involved in the independencies represent the variables in the order of presentation. The CGMs associated with these three HMMMs are depicted in Figure 1.2. Name Independencies 1 ⊥ 4|2, 3, 5, 6, 7 A 3 ⊥ 5|2, 4, 6, 7 1 ⊥ 4|2, 3, 5, 6, 7 B 4 ⊥ 7|2, 3, 5, 6 1 ⊥ 4|2, 3, 5, 6, 7 C 3 ⊥ 5|2, 4, 6, 7 4 ⊥ 7|2, 3, 5, 6

G2

df P-value

100.88 84 0.1012 91.87

81 0.1921

112.02 93 0.0872

Table 1.1. Values of likelihood ratio test G2 of HMMM associated with CG models

It is clear (i.e. it is common to all models) that the growth (1) is independent by the innovation in the marketing strategies (4) given by the remaining variables (2, 3, 5, 6, 7). In model A, we have that the innovation in the organization system (3) is

Context-specific Independence in Innovation Study

9

independent of the market where the enterprise works (5) given the other variables concerning the innovation and the firm’s features (2, 4, 6, 7). On the contrary, in model B, we have that the innovation in marketing strategies (4) is independent of the enterprise’s size (7) given the other variables concerning the innovation and the firm’s features (2, 3, 5, 6). Model C is the union of the independencies in models A and B. As we can see from Table 1.1, by choosing a reference level of the first type of error α equal to 0.1, we reject the null hypothesis; thus we do not have enough evidence to choose model C. Thus, we considered the three independencies characterizing model C like CS independencies and we tested all possible alternatives. The more interesting models are reported in Table 1.2. The preferable model, according to the parsimonious principle, is C4. The difference between models C and C4 is the independence concerning the organization system (3) and the market where the enterprise works (5). In fact, in C4 this independence holds only when the conditioning variable percentage of graduated employers (6) is lower than 10% or greater than 50%, which we can assume is an indicator of unspecialized or highly specialized firms. This means that only when the percentage of graduated employers is between 10% and 50% does the market affect the innovation in the organization system. TYP 7

DEG 6

TYP 7

MARK 5

DEG 6

MARK 5

IORG 3

IPSP 2

(a)

DEG 6

MARK 5

G1

G1

IMAR 4

TYP 7

IMAR 4

IORG 3

IPSP 2

(b)

G1

IMAR 4

IORG 3

IPSP 2

(c)

Figure 1.2. CG models

The stratified chain graph associated with model C4 is depicted in Figure 1.3. In this graph, the labeled arc between the nodes MARK and IORG reports the modalities of the variables DEG when the arc is removed. That is, only when the variable DEG assumes the first or the third modality, MARK is independent of IORG given by ISPS, IMAR, DEG and TYP.

10

Data Analysis and Applications 2

Name Independencies 1 ⊥ 4|2, 3, 5, 6, 7 C1 3 ⊥ 5|2, 4, (6 = 1), 7 4 ⊥ 7|2, 3, 5, 6 1 ⊥ 4|2, 3, 5, 6, 7 C2 3 ⊥ 5|2, 4, (6 = 2), 7 4 ⊥ 7|2, 3, 5, 6 1 ⊥ 4|2, 3, 5, 6, 7 C3 3 ⊥ 5|2, 4, (6 = 3), 7 4 ⊥ 7|2, 3, 5, 6 1 ⊥ 4|2, 3, 5, 6, 7 C4 3 ⊥ 5|2, 4, (6 = 1, 3), 7 4 ⊥ 7|2, 3, 5, 6

G2

df P-value

94.75 85 0.22002

102.77 85 0.09205

101.08 85 0.1125

105.09 89 0.1171

Table 1.2. Values of likelihood ratio G2 test of HMMM

Figure 1.3. SCG model C4

Context-specific Independence in Innovation Study

11

Finally, in Table 1.3 we report the values of the second-order marginal log-linear parameters (referring to paired variables) of model C4. At first, we remind that these are defined in the first marginal distribution where they occur. In this case, the marginal subsets associated with the CG models in Figure 1.2 and the SCG model in Figure 1.3 are {(5, 6, 7), (2, 3, 4, 5, 6, 7), (1, 2, 3, 4, 5, 6, 7)}. Furthermore, we remind that in order to define the conditional (marginal) independencies in model C4, we 1,2,3,4,5,6,7 have to constrain to zero the parameter η1,4 and all the higher order parameters, defined in the marginal set (1, 2, 3, 4, 5, 6, 7), containing the paired 2,3,4,5,6,7 variables (1, 4) and also the parameter η4,7 and all the higher order parameters, defined in the marginal set (2, 3, 4, 5, 6, 7), containing the paired variables (4, 7). Finally, in order to define the CS independence, according to the formula [1.3], we 2,3,4,5,6,7 have to constrain to zero the sum of parameters η3,5 and all the higher order parameters, defined in the marginal (2, 3, 4, 5, 6, 7), containing the paired variables (3, 5) but where variable 6 assumes value 1 or 3. Note that in Table 1.3, the 2,3,4,5,6,7 parameters η3,5 are free and assumed to have a value of 0. This reveals the lack of relationship between the variables MARK and IORG, at least concerning the parameters of third or higher order.

Variable

Modalities

ISPS 2

Yes

G1

IPSP 2

IORG 3

IMAR 4

Yes

Yes

Yes

Yes

MARK 5

DEG 6

National International 10–50%

≥ 50%

0.1927 (0.0793)

IORG 3

IMAR 4

MARK 5

Yes

Yes

National Internatational

DEG 6

10–50%

0.1023

1.8221

(0.0709)

(0.0827)

0

1.4848

1.9967

(0.0000)

(0.0907)

(0.0764)

0.0980

0.6378

0

0.3005

(0.0688)

(0.0960)

(0.0000)

(0.0928)

0.4668

0.1517

0

-0.2096

(0.1486)

(0.1815)

(0.0000)

(0.1912)

0.0332

0.5020

0.4372

(0.0821) (0.10400) ( 0.0988) ≥50% TYP 7

Medium

-0.1333

-0.0422

0.5048

(0.1436)

(0.2070)

(0.1624)

0.3700

0.6447

0.5687

(0.0790)

(0.1064)

(0.0868)

0.4323

0.6902

0.2547

(0.0927)

(0.0567)

(0.0856)

0.3451

0.1758

( 0.1746) (0.1024) 0

0.9878

(0.0000) (0.0497)

-0.1186 (0.1493) 0.7591 (0.0775)

1.1702

-0.3302

(0.0543) (0.0899)

Table 1.3. Second-order marginal log-linear parameters

From Table 1.3, we can see that between the three innovation variables there is a strong (positive) second-order association: (IPSP, IORG) with log odds ratio of 1.82, (IPSP, IMAR) with log odds ratio of 1.49 and (IMAR, IORG) with log odds ratio of 2. In the graph, they correspond to the undirected arcs between nodes 2 and 3.

12

Data Analysis and Applications 2

This means that it is more likely to have firms that improve innovations in different levels. Another strong association is between the firm’s dimension and the main market. In particular, it seems reasonable that the bigger the firm, the bigger the market where it operates. It is also worthwhile to focus on the parameters concerning the variable DEG, which discriminates between a conditional and a CS independence in model C4. In particular, Table 1.3 presents that there is a reverse direction between the parameters (all positive) referring to the 10–50% modality and the one referring to the ≥ 50%, which are more than half negative. This means that moving from the unspecialized firm (less than 10% graduate) to a medium specialized firm (10–50% graduate), we have a positive association with all the other variables. On the other hand, by considering the highly specialized firm (≥ 50% graduate) with respect to the medium specialized firm, we can see that there is a negative trend with the revenue growth. The same trend also occurs with the innovation in product, services, product line and R&D (IPSP), the main market (MARK) and the firm’s size (TYP). This change probably would have been unobserved by codifying the parameters with baseline logits. Furthermore, by accepting the conditional independence 3 ⊥ 5|2, 4, 6, 7 we would not focus on variable 6. 1.5. Conclusion In this chapter, we showed how to represent CS independencies in HMMMs when we treat with ordinal variables, and we are interested in also representing marginal and conditional independencies. We also provide a graphical representation based on chain graph in order to visually simplify the relationships among the variables. The final SCGM have been chosen following a two-step procedure to identify the best CGM and then by watching the problem at hand to find the “strata” of the graph, but further research will be dedicated to implement the procedure that is able to test all possible models (testing all hypothesis of independence). Furthermore, other research involves the definition of constraints for parameters coded with “global” or “continuation” logits. It should also be interesting to study the definition of SCGM by considering the Chain Graph Models of type 4 (Drton 2009), with the parameterization explained by Marchetti and Lupparelli (2011). 1.6. References Bartolucci, F., Colombi, R., Forcina, A. (2007). An extended class of marginal link functions for modelling contingency tables by equality and inequality constraints. Statistica Sinica, 17, 691–71. Bergsma, W.P., Rudas, T. (2002). Marginal models for categorical data. Annals of Statistics, 30(1), 140–159. Cazzaro, M., Colombi, R. (2014). Marginal nested interactions for contingency tables. Communications in Statistics – Theory and Methods, 43(13), 2799–2814.

Context-specific Independence in Innovation Study

13

Drton, M. (2009). Discrete chain graph models. Bernoulli, 15(3), 736–753. Istat. (2012). Italian innovation survey on SM enterprises. Frydenberg, M. (1990). The chain graph Markov property. Scandinavian Journal of Statistics, 17(4), 333–353. Lauritzen, S.L., Wermuth, N. (1989). Graphical models for associations between variables, some of which are qualitative and some quantitative. The Annals of Statistics, 17(1), 31–57. Marchetti, G.M., Lupparelli, M. (2011). Chain graph models of multivariate regression type for categorical data. Bernoulli, 17(3), 827–844. Nicolussi, F. (2013). Marginal parameterizations for conditional independence models and graphical models for categorical data. PhD Thesis. University of Milano Bicocca. Nicolussi, F., Cazzaro, M. (2017). Context-specific independencies for ordinal variables in chain regression models. arXiv preprint arXiv:1712.05229. Nyman, H., Pensar, J., Koski, T., Corander, J. (2016). Context-specific independence in graphical log-linear models. Computational Statistics, 31(4), 1493–1512. Højsgaaard, S. (2004). Statistical inference in context specific interaction models for contingency tables. Scandinavian Journal of Statistics, 31(1), 143–158. Rudas, T., Bergsma, W.P., Németh, R. (2010). Marginal log-linear parameterization of conditional independence models. Biometrika, 97(4), 1006–1012.

2 Analysis of the Determinants and Outputs of Innovation in the Nordic Countries

In this chapter, we discuss the European Nordic countries (Denmark, Finland, Iceland, Norway and Sweden), which are referred to by the European Commission as countries with high innovation performance. The analyzed panel data concern the period between 1999 and 2014, and how different inputs of innovation affect the different outputs was studied. The “innovation” variable was constructed using factor analysis, given that the Organization for Economic Co-operation and Development considers that innovation is the result of a set of macromeasures common to different countries. The factor obtained through the exploratory factor analysis represents the results of innovative activity and economic performance. How the quality of human capital, the research and development efforts carried out by different economic agents affect the results of innovation was analyzed. It was possible to conclude that countries with a higher proportion of applied research and more cooperation between research carried out by companies, universities and the government lead to better economic results and to higher outcomes of intellectual property.

2.1. Introduction Innovation is a central theme in current literature and the recognition of its importance has increased over the last few decades. As Porter (2007) has pointed out, innovation has become the challenge that defines global competitiveness.

Chapter written by Cátia ROSÁRIO, António Augusto COSTA and Ana LORGA DA SILVA.

Data Analysis and Applications 2: Utilization of Results in Europe and Other Topics, First Edition. Edited by Christos H. Skiadas and James R. Bozeman. © ISTE Ltd 2019. Published by ISTE Ltd and John Wiley & Sons, Inc.

16

Data Analysis and Applications 2

Panel data refer to a sectional and temporal sample, merging an approach of time series with a cross-section approach (Baltagi 2013). The use of panel data analysis allows us to analyze several individuals, in this case the Nordic countries. The choice of these countries is related to the fact that they are references in the field of innovation (European Commission 2016). The European Innovation Scoreboard (EIS) and the Global Innovation Index (GII) group different information for the construction of a single indicator of innovation. Therefore, factor analysis was used to construct a representative factor that can be considered as the output of the innovative activity. The main purpose of this study is to determine how different sources and approaches of research, as well as the quality of human capital, contribute to the outputs of innovation. 2.2. Innovation Joseph Schumpeter mentioned the importance of innovation as a form of “creative destruction” that leads to value creation. Schumpeter (1939) pointed out that this “weed” goes beyond the simple idea of creating something new, since it can also lead to the creation of new markets. According to the Oslo manual (OECD 2005), innovation is the implementation of a new or significantly improved product (good or service), process or method of marketing, or a new organizational method in business practices, workplace and external relations. Cunha et al. (2016) specify organizational innovation as being a way of establishing new agreements with clients or suppliers, new ways of providing after-sales service, new modus operandi for the relationship with customers, among other practices. The concepts of change, invention and creativity (Drucker 1997) are related to innovation. And as stated by Schumpeter (1939), it is possible to distinguish invention, innovation and diffusion. Teixeira (2011) points out that innovation is a process, composed of three phases: invention (creation of something new that results from the creation or acquisition of knowledge), innovation (transformation or application of new knowledge) and diffusion (acceptance and adoption of innovation, recognizing its economic utility). Innovation is a complex process, and it is also due to the many ways in which it is represented. As mentioned by Sarkar (2014), it is a process that affects the different organizational areas. Depending on the degree of novelty of the results, innovation can be classified as follows: incremental (associated with gradual

Analysis of the Determinants and Outputs of Innovation in the Nordic Countries

17

improvements), radical (referring to the creation of something new) and disruptive (it can originate a new industry or create a symbiosis between unrelated technologies). Innovation can have different degrees of novelty, it can also be differentiated through its “object”, and as mentioned in the Oslo manual (OECD 2005) there are four types of innovation as follows: product, process, organizational and marketing. As mentioned previously, there are different forms and types of innovation. This is a process that affects several organizational areas, so there is an evolution in the way that this process can be developed. Innovation models 1st Generation (1950 to 1960)

Technology push

2nd Generation (1960 to 1970)

Market Pull

3rd Generation (1970 to 1980)

Coupling model

4th Generation (1980 to 1990)

Integrated business processes

5th Generation (after 1990)

System integration & networking

Description Sequential and linear process in which the market functions as a receiver of research results developed in universities, considering that basic research is sufficient. Sequential and linear process where needs are opportunities to explore, initiating the process of creating ideas, directing R&D efforts. Continuous sequential process with interconnected steps that relate different sectors of the company to the scientific community and to other economic agents. Parallel process, with integrated development where production and sales are integrated to work simultaneously in the development of products / services. Process with vertical and horizontal integration within companies, broadening the horizons of collaborative research.

Table 2.1. Description of innovation models (source: adapted from Campos and Valadares [2008], Rothwell [1994] and Teixeira [2011])

According to the Oslo manual (OECD 2005), innovation goes beyond technological development; however, this remains the feature that has the greatest impact on organizations as well as on society. These models show the importance of research for innovation success and as per Castilho et al. (2014), research is the development of a study based on a set of procedures that seek solutions to certain problems. This is a generic concept; however, it is known that innovation influences and is influenced by two types of research: basic (fundamental) and applied. As referred to by the World Intellectual Property Organization (WIPO) (2016), measures of innovation can be ambiguous, given that research is essential for

18

Data Analysis and Applications 2

innovative activity results. That is, the efforts employed essentially involve R&D, and research is the cornerstone of such efforts. Basic research is mostly performed at universities and their contribution is mainly to the scientific knowledge, while applied research is concerned with solving concrete “problems”. In this way, applied research responds more effectively to the purpose of innovation Campos and Valadares (2008). According to the European Commission (2015), research has increasingly and significantly contributed to results of the innovative activity, it is required to invest in R&D and this can be done by different agents: companies, universities and government. The cooperation of different agents enables the creation of synergies, leveraging the sharing of information, technology and results as we can read in the Oslo manual (OECD 2005). This systemic view of innovation was approached by Etzkowitz and Leydesdorf (1998) by the triple helix model that considers the coordination between the different mechanisms and institutions to be fundamental, as shown in Figure 2.1. Universities (Creation and dissemination of knowledge)

Innovation Government (Apply public policies that promote the development of science and technology)

Companies (Investment that aims to transform knowledge into products, services or processes that create value)

Figure 2.1. Triple helix model (source: adapted from Leydesdorff and Ivanova [2016])

As shown in Figure 2.1, this relationship is considered by Etzkowitz (2002) using three dimensions: relationship between each axis in function of the economic mission, mutual influence between them and creation of a new layer of organizations that result from the interaction of these three agents. Therefore, the investment in R&D programs carried out by each one of them must consider the research that is carried out by the others. Since innovation is so important, it is necessary to find the best way to evaluate it to take measures to promote it in a sustainable way. Referred to by the OECD, the collection of data on scientific and technological capacities at the national level has

Analysis of the Determinants and Outputs of Innovation in the Nordic Countries

19

become a priority. In 1963, the OECD developed a manual – the Frascati manual – which established a set of procedures for collecting data on human resources and R&D expenditures, so that it would be possible to assess and compare innovation between countries. In 1999, the OECD had 45 innovation indicators and currently has about 200 innovation references to analyze scientific and technological practice, such as the international mobility of researchers and scientists, growth of the information economy, innovation by regions and industries, innovation strategies and others (OECD 2015). The European Union (EU) uses a set of tools to collect information on innovation: – Community Innovation Survey (CIS): It is required for the EU member States, based on the conceptual framework set out in the Oslo manual, as well as Eurostat methodological recommendations. – EIS: The EIS was developed at the European Summit in Lisbon in 2000. Its purpose is to measure the innovation performance of EU countries and their comparison with other countries. As mentioned by Godinho (2007) and Lhuillety et al. (2016), there are many ways in which innovation can be measured and there is no single indicator or measure that can reflect the full potential and innovative outcome of a country. In this way, the challenge is even greater because it is necessary to correctly articulate the different measures of innovation. Similarly, Roszko-Wojtowicz and Biateck (2016), using multidimensional statistics, concluded that the set of 25 indicators used by the EIS can be effectively reduced. However, they draw attention to the fact that it includes inputs and outpost of innovation, which makes it difficult to analyze which inputs contribute the most to innovation success. In the same sense, Lhuillery et al. (2016) point out the importance of distinguishing these two components of innovation. In the same way, Sarkar (2014) reinforces the idea of a systemic approach that distinguishes input, process and output, allowing a better understanding not only of the innovation process but also the determinants of its success. 2.3. Methodology The data collected (1999–2014) refers to the main Northern European countries, namely the Nordic countries such as Denmark, Finland, Iceland, Norway and Sweden. The variables used were collected through the World Bank and the OECD.

20

Data Analysis and Applications 2

Using exploratory factor analysis (EFA), a single factor was obtained, considered representative of the results of the innovation efforts. The construction of this factor included the following variables: registration of patents (PR) and trademarks (TR) made by residents, exports of high technology in the pharmaceutical industry (EHTFI) and in the aerospace industry (EHTAI). The relational structure of the considered variables was evaluated by the EFA on the matrix of correlations, with extraction of the factors by the method of principal components. The retained factor had an eigenvalue greater than 1, in agreement with screen plot and the percentage of variance retained. The use of different criteria allows a higher robustness in the retention of factors. To evaluate the validity of EFA, the KMO criterion was used, with a KMO = 0.729 and the Bartlett equilibrium test has a P-value very close to 0; we can conclude that AFE is adequate and that the variables are significantly correlated. Henson and Roberts (2006) sustain that there is no consensus regarding the minimum cumulative variance acceptable for all research areas. As mentioned by Taherdoost et al. (2014), in natural sciences, the admissible values are higher than 95%, while in the humanities, values between 50% and 60% are already acceptable. It was considered that the 72% values obtained are acceptable to proceed with the analysis. The obtained factor is given as: ^

INNOV = 0.296 PR + 0.331TR + 0.253EHTFI + 0.295 EHTAI

For each econometric model presented, a panel diagnosis was made to determine the most suitable model. Through the results of F-statistic, the Breusch–Pagan test and the Hausman test, it was found that, in all cases, the fixed effects model is the most adequate (Pesaran 2015). The fixed effects models obtained are given as: yit = β 0 + β1 xit1 + ... + β k xitk + ai + uit where: – i = 1,...,5 – countries – t = 1,...,16 – years

Analysis of the Determinants and Outputs of Innovation in the Nordic Countries

21

– k = 8 – explanatory variables – ai – fixed effect of each country – uit – error term The models presented and analyzed in section 2.4 aim to answer the main question of this study: how do different sources of innovation contribute to the entrepreneurial and commercial outputs of innovation? The variables used can be grouped as follows: – Business and commercial results of innovation: Patent registration (PR) and trademarks (TR) made by residents and exports of high technology goods/services in the pharmaceutical industry (EHTFI) and aerospace (EHTAI). This set of variables refers to the factor created using AFE and which is representative of the business and commercial outputs of innovation (Roszko-Wojtowicz and Biateck 2016). – Human capital: It includes higher education in engineering (HEE), higher education in business, law sciences (HEBL) and vocational programs (VP). This set of variables relates to the quality of human capital (Valente 2014). – Research and Development: It includes R&D expenditure by companies (BERD), universities (HERD) and government (GOVERD), a number of researchers (RES) and scientific publications (SP). These variables reflect innovation efforts (European Commission 2016).

2.4. Results The results of the econometric models obtained are analyzed here. Their analysis is complemented with descriptive statistics of some of the considered variables, comparing the countries under study. In model (1), R&D investments made by companies contribute positively to innovation, contrary to the investments made by the universities and the government, which, although not statistically significant, show a negative sign. Another research perspective (researchers and scientific production) is that not all types of research appear to have a positive impact on innovation outputs. Regarding human capital, the contribution of training in business and law sciences has a positive impact in achieving business and commercial results of innovation. The analysis to the model (1), not being conclusive, raises the interest to try to verify if the contribution of the explanatory variables is different through the different innovation outputs considered.

22

Data Analysis and Applications 2

Dependent variables: M odel (1) - INNOV: Factor obtained through EFA; M odel (2) - PR: Number of patent registrations submitted by national applicants through the Patent Cooperation Treaty procedure or with a national patent office; M odel (3) - TR: Number of trademark applications made by national applicants in a particular national intellectual property office; M odel (4) - EHTFI: Volume of exports in the pharmaceutical industry, in millions of USD; M odel (5) - EHTAI: Volume of exports in the aerospace industry, in millions of USD. Explanatory variables: BERD, HERD and GOEVRD: Expenditures made by firms, universities and government in R&D, as % of GDP; HEE: Proportion of people with higher education in Engineering and Industry and HEBL: Proportion of people with higher education in Business and Law Sciences (relative to total people with higher education); VP: Percentage of vocational training programs in secondary and post-secondary (non-tertiary) education based on programs geared specifically to a given class of professions or trades; RES: Number of researchers per 1,000 persons employed; SP: Number of publications in scientific journals, per million dollars of GDP, corresponding, according to the Policy Platform for Innovation developed by the OECD and World Bank, a measure of the quality of scientific publications. (1) INNOV (2) PR (3) TR (4) EHTFI (5) EHTAI 0,564057*** (0,168618)

411,014**

433,115

−57,6046

(193,269)

(685,198)

(948,616)

(188,557)

HERD

−0,487268 (0,328954)

−1690,37*** (378,529)

−1211,57 (1336,74)

9654,62*** (1857,92)

−632,148* (369,301)

GOVERD

−0,536596 (0,490689)

−504,28 (555,119)

5456,20*** (1993,97)

−2206,12 (2724,68)

−1198,35** (541,587)

HEE

−0,209773 (1,04808)

2964,19** (1195,02)

−10884,10** (4258,98)

−15313,5** (5865,50)

1658,23 (1165,89)

HEBL

3,13243*** (1,01792)

−27,3018 (1152,79)

15305,0*** (4136,45)

559,692 (5658,20)

1546,51 (1124,69)

VP

0,0110596* (0,0065235)

−13,0432* (7,50171)

73,9348*** (26,5092)

74,6904* (36,8204)

1,22994 (7,31883)

RES

0,00318607 (0,0243858)

9,32919 (27,5731)

−79,5665 (99,0943)

348,854** (135,336)

−16,5242 (26,9010)

SP

−6,85333 (4,46935)

9146,21* (5115,01)

−56188,5*** (18161,7)

−23028,8 (25105,8)

−1675,69 (4990,31)

BERD

518,667***

−1,83473** 1718,95* −997,705 −6861,26 −337,428 (0,788494) (904,877) (3204,13) (4441,38) (882,818) INNOV: F(12,33) = 254,26*** , PR: F(12,34) = 128,37*** , TR: F(12,33) = 140,56*** EHTFI: F(12,34) = 136,70*** , EHTAI: F(12,34) = 27,82*** Number of observations: INNOV = 46, PR = 47, TR = 46, EHTFI = 47, EHTAI = 47 =0,9766 INNOV: R2=0,9893, INNOV: =0,9870, PR: R2=0,9784, PR: =0,9739, TR: R2=0,9808, TR: EHTFI: R2=0,9797, EHTFI: =0,9751, EHTAI: R2 =0,9076, EHTAI: =0,8881 (Standard errors in parenthesis) *** p λ have all their elements equal to 1. S(nh ) is the column vector of the nh elements s(i, h), where s(i, h) is the share of the individual i belonging to class h in total income. Expression [5.6] can be written as a sum of two components, Iw and IB , that corresponds to the within-classes inequality and between-classes inequality. Thus, we take the expression as 

IG = e GS =

m  m   [ e (np ) · G(np , nq ) · S(np )] = p=1 q=1

=

m 

m  m  [5.7]  e (np ) · G(np , np ) · S(np ) + [ e (np ) · G(np , nq ) · S(np )] = 

p=1

p=1 q=p

= IW + IB 5.3. Application of method Let us consider the 39 countries that participated in the negotiations for the signature of the Kyoto protocol presented in Table 5.3. Moreover, Table 5.3 presents the mitigation’s percentage target of each country, the base year and the recorder carbon emissions of the base year. In this section, it is described as an application of the proposed method for the 39 countries. The data correspond to the CO2 emissions of each country from 2005 to 2012. We will distinguish the two cases in our application.

54

Data Analysis and Applications 2

5.3.1. Application of method for individual data Initially, we measure the inequality considering countries as individuals. Our objective in this case is twofold. First, we calculate the dissimilarity for each year and then compare it through years. Second, we detect whether the choice of expression [5.3] or [5.4] of the Gini index affects the results. According to Table 5.1, the values of the Gini index are close to the upper limit. This indicates that there is a big inequality in the differences of the mitigation policy. Studying the data, we can easily derive that by 2005, one-third of the examined member states had already managed to achieve their mitigation targets. The year 2005 was the formal starting period for the Kyoto protocol. Until 2012, which was the target year, only 22 of 39 members had managed to reach their targets. Some of the Kyoto members had shown selfish behavior. The countries with the biggest differences from their targets by 2012 were: Canada (120,464.64 kt), Japan (200,128.96 kt) and the United States (632,002.9 kt). Moreover, we derive from Figure 5.1 that there is no significant difference in the results based on the use of either the restricted to positive values expression [5.3] or the expression calculating Gini with both positive and negative values [5.4]. That means that our assumption setting yi = 0 for the negative values in the first case does not affect the results. Year 2005 2006 2007 2008 2009 2010 2011 2012 Only positive values 0.87 0.87 0.88 0.88 0.88 0.88 0.90 0.90 Positive and negative values 0.91 0.90 0.91 0.91 0.90 0.90 0.90 0.90 Table 5.1. Dissimilarity measurement for individual countries

Figure 5.1. Comparing methods for inequality measurement of Annex I countries, 2005–2012 period on individual data

Monitoring the Compliance of Countries on Emissions Mitigation

55

5.3.2. Application of method for grouped data In this section, we separate the countries into four groups. These groups have a geographical orientation. We chose the 15 countries of the European Community to be the first group. These countries had signed and ratified the Kyoto protocol, as members of the premature European Union. They also set a collective target of mitigation. We chose the other group to be the countries that they belong to geographically in the European continent, but in 1997 they were not yet considered as members of the European Union. The third regional group consisted of the countries of the North American continent, which in our case are the United States and Canada. These two countries are the bigger emitters, which have signed in the Kyoto protocol. But their policies were always mistrust in the protocol. As a result, there was the non-ratification of the protocol by the United States and the withdrawal of Canada from the coalition in 2011. The fourth regional group includes the East Asia and Pacific countries, which includes Australia, New Zealand and Japan1. Table 5.2 shows the values of the Gini index calculated for both between group and within groups. Calculating the between-group Gini index, it is derived that the index takes similar high values across the years. It means that the inequality between groups remains the same despite the effort made by countries on emission mitigation. The same results are obtained for inequality measured within groups. Year 2005 2006 2007 2008 2009 2010 2011 2012 B.G.G.I. 0.7915 0.7891 0.8026 0.8037 0.8113 0.8205 0.8386 0.8294 W.G.G.I. 0.0707 0.0679 0.0663 0.0666 0.0645 0.0568 0.0538 0.0566 Table 5.2. Dissimilarity measurement for grouped countries

In this section, by applying the proposed method we derive that either calculating the Gini index considering the countries as individuals or considering them as groups, the inequality remains high over the examined period of years. This means that some of the country members act as “free riders” throughout the time period and exploit the benefits from mitigation of the other countries. Thus, the proposed method is capable of detecting “free riding” in existing climate coalitions. 5.4. Conclusions The dissimilarity in GHG emissions mitigation is an obstacle that undercuts the countries from reaching a significant environmental agreement. This study proposes 1 Regional grouping follows the World Bank documentation (http://databank.worldbank.org/ data/reports.aspx?Code=NY.GDP.MKTP.CD&id=1ff4a498&report_name=Popular-Indicators &populartype=series&ispopular=y)

56

Data Analysis and Applications 2

a method of calculating environmental inequality. The measurement of environmental inequality is a major issue, as it is directly linked to the impacts of climate change. Climate change burdens not only the natural environment but also the budgets of the countries. Applying the method, using different expressions of the Gini index, we obtained some interesting results. First of all, using either of the two different expressions of the Gini index does not significantly affect the results. It is also noted that the values of the Gini index remain high through the time period. This indicates that there is not any difference in emission mitigation, although some countries claimed that they have mitigated their emissions in this period. Thus, we found that the “free riding” is strongly implied in both cases, either applying the expression that uses positive and negative differences or the expression that uses only the positive ones. Furthermore, when applying the method for grouped data we find out that the Gini index between groups remains high but it is extremely low within a group, indicating that the environmental policy is affected by the relationships that countries have with each other.

Figure 5.2. Between and within group inequality measurement of Annex I countries, 2005–2012 period, on grouped data

Monitoring the Compliance of Countries on Emissions Mitigation

5.5. Appendix Annex I countries

Percent of quantified emission mitigation pi Australia 1.08 Austria* 0.92 Belgium* 0.92 Bulgaria 0.92 Canada 0.94 Croatia 0.95 Czech Republic 0.92 Denmark* 0.92 Estonia 0.92 Finland* 0.92 France* 0.92 Germany* 0.92 Greece* 0.92 Hungary 0.94 Iceland 1.10 Ireland* 0.92 Italy* 0.92 Japan 0.94 Latvia 0.92 Lichtenstein 0.92 Lithuania 0.92 Luxembourg * 0.92 Monaco 0.92 The Netherlands* 0.92 New Zealand 1.00 Norway 1.01 Poland 0.94 Portugal* 0.92 Romania 0.92 Russian Federation 1.00 Slovakia 0.92 Slovenia 0.92 Spain* 0.92 Sweden* 0.92 Switzerland 0.92 Ukraine 1.00 United Kingdom* 0,92 United States of America 0.93

Base year Base emissions year ebi 277,802.53 1990 61,932.64 1990 118,684.50 1990 98,815.11 1988 457,534.00 1990 23,080.45 1990 163,864.20 1990 53,342.45 1990 37,677,86 1990 56,767.66 1990 392,627.00 1990 1,032,776.20 1990 84,313.57 1990 85,795.50 (1987–1985)/3 2,158.64 1990 32,559.50 1990 434,781.95 1990 1,144,129.51 1990 18,622.93 1990 203.06 1990 36,168.80 1990 12,219.20 1990 105.37 1990 159,389.50 1990 25,462.57 1990 34,766.97 1990 469,143.82 1988 40,261.95 1990 192,407.79 1989 2,500,352.09 1990 6,022.70 1990 16,281.84 1986 228,511.44 1990 56,301.08 1990 44,553.30 1990 714,310.07 1990 590,319.32 1990 5,100,000.00 1990

Table 5.3. Kyoto Annex I countries’ quantified emissions targets. Countries with (*) are members of the 1997 European Community

57

58

Data Analysis and Applications 2

5.6. References Boyce, J.K., Zwickl, K., Ash, M. (2016). Measuring environmental inequality. Ecological Economics, 124, 14–123. Chancel, L., Piketty, T. (2015). Carbon and inequality from Kyoto to Paris: Trends in the global inequality of carbon emissions (1998-2013) and prospects for an equitable adaptation fund. PSE. Duro, J.A., Padilla, E. (2006). International inequalities in per capita CO2 emissions: A decomposition methodology by Kaya factors. Energy Economics, 28(2), 170–187. Duro, J.A. (2012). On the automatic application of inequality indexes in the analysis of the international distribution of environmental indicators. Ecological Economics, 76, 1–7. Duro Moreno, J.A., Teixidó-Figueras, J., Padilla, E. (2013). Empirics of the international inequality in CO2 emissions intensity. Explanatory factors according to complementary decomposition methodologies. Working Papers. Department of Economics, Universitat Autonoma of Barcelona. Duro, J.A., Teixido, F.J., Padilla, E. (2016). Empirics of the international inequality in emissions intensity: Explanatory factors according to complementary decomposition methodologies. Environmental & Resource Economics, 63(1), 57-77. Heil, M.T., Wodon, Q.T. (1997). Inequality in CO2 emissions between poor and rich countries. J. Environ. Dev., 6(4), 426–452. Heil, M.T., Wodon, Q.T. (2000). Future inequality in CO2 emissions and the impact of abatement proposals. Environmental and Resource Economics, 17(2), 163–181. Heugues, M. (2014). International environmental cooperation: A new eye on the greenhouse gas emissions control. Ann. Oper. Res., 220(1), 239–262. Padilla, E., Duro, J. (2013). Explanatory factors of CO2 per capita emission inequality in the European Union. Energy Policy, 62, 1320–1328. Raffinetti, E., Siletti, E., Vernizzi, A. (2015). On the Gini coefficient normalization when attributes with negative values are considered. Statist. Methods Appl., 24(3), 507–521. Sen A. (1973). On Economic Inequality. Clarendon Press, Oxford, Great Britain. Silber, J. (1989). Factor components, population subgroups and the computation of the Gini index of inequality. The Review of Economics & Statistics, 107-115.

6 Maximum Entropy and Distributions of Five-Star Ratings

The five-star rating system is the standard tool used to measure customer experience of products and services in websites. Online ratings are commonly represented by frequencies of observed ratings and average values calculated from the frequencies. The investigation of appropriate statistical distribution of ratings based on the principle of entropy maximization is the purpose of this work. The expected value and variance constraints of entropy maximization are analyzed numerically by nonlinear constrained optimization. Binomial appears as the resulting max entropy distribution (MED) subject to the expected value and variance constraints asymptotically. The implications of this result for the ratings analysis is presented based on a data set of 1,000 real five-star ratings samples. The max entropy truncated geometric distribution is set as a global upper limit and the binomial distribution is set as a local lower limit of data entropy depending on the expected value and variance of the data.

6.1. Introduction The five-star rating system is a type of Likert scale (Likert 1932) with five ordered responses; raters choose one of them to express their opinion about an entity (product, service, mobile application, film, hotel, tourism attraction, etc.). The five possible responses are typically numbered 1,2,3,4,5 or presented verbally by words or phrases e.g. “terrible, poor, average, very good, excellent”. In every case, there are five ordered responses representing a negative opinion by the first two items (very bad, bad), a neutral opinion by the third in the middle (average) and a positive opinion by the last two items (good, very good). Transforming the five possible responses to ordered integer numerical values (1,2,3,4,5) or for better statistical representation (0,1,2,3,4), it is possible to do mathematical treatments to those ratings. This is a commonly accepted practice, Chapter written by Yiannis D IMOTIKALIS.

Data Analysis and Applications 2: Utilization of Results in Europe and Other Topics, First Edition. Edited by Christos H. Skiadas and James R. Bozeman. © ISTE Ltd 2019. Published by ISTE Ltd and John Wiley & Sons, Inc.

60

Data Analysis and Applications 2

where ratings in web pages are represented and visualized by their arithmetic mean, for an overall rating, e.g. 1,. . . ,3, 3.5,. . . ,4.5, 4.7, . . . ,5. Alternatively, a percentage, e.g. 80%, 85%, 95%, . . . , which is a normalized average, represents an overall score in 0–100% interval. In every case, the mathematical treatments of five-star ratings are commonly accepted and used in practice. Because of broad diffusion of the five-star rating system on the Internet, there are billions of rated entities, fluctuating from a few tenths of rates for hotels or consumer products to several millions for mobile applications. Collections of several thousands of rating data sets are available (see Stanford Large Network Dataset Collection, SNAP; Leskovec and Krevl 2014). Their influence on potential customers in e-commerce decisions is a research topic in recommendation systems and data mining (see Hu et al. 2009; Aral 2013; Chung et al. 2013). In this chapter, the statistical analysis of rating data based on entropy is the main research task. Entropy is introduced as a theory to statistical mechanics by Boltzmann in the late 19th Century to measure the order and randomness of ideal gas particles. In five-star ratings, entropy measures the order and disorder (uncertainty) of population selections, reducing their states to five alternatives. This work is organized as follows. In section 6.2, we analyzed the approach of statistical entropy to five-star ratings and the statistical distributions resulting from the max entropy principle. In section 6.3, we applied the entropy maximization subject to the expected value and variance constraints for cases of k = 2,3,4,5,6,. . . ,30 response items in a rating system. In section 6.4, we exhibit the implications of an entropy approach to real five-star rating data using a large 1,000-sample rating data set. Finally, in section 6.5 we provide a concluding remark and future directions are discussed. 6.2. Entropy framework to five-star ratings Let N raters respond to a five-star rating about an entity with response items transformed to integer values i = 0,1,2,3,4. Let N0 , N1 , N2 , N3 , N4 represent the number of raters choosing 0,1,2,3,4, respectively, where N0 +N1 +N2 +N3 +N4 = N. i Then relative frequencies (%) are the empirical probabilities: pi = N N , i = → − 0,1,2,3,4. Define a random variable X = (X0 , X1 , X2 , X3 , X4 ) consisting of the number of raters giving any one of the specific rate 0,1,2,3,4. If the probabilities of − → → response items are − p = (p0 ,p1 ,p2 ,p3 ,p4 ), then X follows a multinomial statistical − → → − − − → distribution, X ∼Mult( X ,N,→ p ). The probability that X =(X0 , X1 , X2 , X3 , X4 ) takes → a particular value − x = (N0 , N1 , N2 , N3 , N4 ) is defined by multinomial distribution − → − → → − Mult( X = x ,N, p ).

The Mult(X0 =N0 ,X1 =N1 ,X2 =N2 ,X3 =N3 ,X4 =N4 , N, p0 ,p1 ,p2 ,p3 ,p4 ) pdf is defined by:

Maximum Entropy and Distributions of Five-Star Ratings

61

→ − → → Mult( X = − x ,N,− p )= P(X0 =N0 , X1 =N1 , X2 =N2 , X3 =N3 , X4 =N4 , N, p0 , p1 , p2 , p3 , p4 )= N! pN0 ·pN1 ·pN2 pN3 ·pN4 N0 ! ·N1 ! ·N2 ! ·N3 ! ·N4 ! 0 1 2 3 4

[6.1]

where p0 , p1 , p2 , p3 , p4 are the probabilities of each rate 0,1,2,3,4. The unique permutations of ratings are given as: W =

N! N! = 4 N0 ! ·N1 ! ·N2 ! ·N3 ! ·N4 ! i=1 Ni !

The N! (factorial) term in the W definition means that as N increases, W grows exponentially and the permutations (possible states of rating) are practically uncountable. Table 6.1 rating data is used to explain the results of those definitions in practice including three samples (#1, #2, #3) of five-star rating data. The left part of Table 6.1 presents the available rating data, and the resulting frequencies Ni /N are calculated on the right side. Entity #1 #2 #3

Ratings Ni 1* 2* 3* 1 1 11 6 2 8 12 31 26

4* 19 23 26

5* 13 31 15

Sum N 45 70 110

Freq. pi =Ni /N p0 p1 2.2% 2.2% 8.6% 2.9% 10.9% 28.2%

p2 24.4% 11.4% 23.6%

p3 42.2% 32.9% 23.6%

p4 28.9% 44.3% 13.6%

Table 6.1. Sample five-star rating data

In Table 6.2, we calculated the basic statistics of Table 6.1 data: expected value E(X), normalized expected value E(X)% and variance Var(X). The factorial N! shows that even in those small N values, the possible permutations (states of the rating system) are an extremely large number. In the last column, the Mult(Ni,N,pi ) probability to observe each particular state of the rating is extremely low because of an extremely large number of possible combinations (states).

62

Data Analysis and Applications 2

Entity

E(X)

E(X)%

Var(X)

#1 #2 #3

2.93 3.01 2.01

73.33% 75.36% 50.23%

0.82 1.44 1.50

N! 1.2 × 1056 1.2 × 10100 1.6 × 10178

! W= N 4N i

i

4.0 × 1021 9.7 × 1035 1.9 × 1070

Mult (N i , N, pi ) 6.9 × 10−25 7.9 10−40 7.1 × 10−75

Table 6.2. Sample five-star rating data statistics and probabilities

Each combination of probabilities (W column of Table 6.2) represents a “state” of the rating system. The extremely small probabilities are shown in the last column of Table 6.2, which means that it is impossible to observe each one of those states and raises the question about the most “possible” state of the system. The most probable state is the one with higher probability values calculated by multinomial distribution definition Mult(Xi,N,pi ) (equation [6.1]). Because those probabilities are dependent on W (multinomial coefficient), one may think to maximize W to find the most “logical” (expected) state of the system. Jaynes (1957) proposed the solution of maximization logW/N in place of W, leading to the definition of Shannon’s (1948) entropy maximization. The main derivation is relatively simple, starting from:   log W 1 N! = = log N N N0 ! ·N1 ! ·N2 ! ·N3 ! ·N4 !   1 N! = log N Np0 ! ·Np1 ! ·Np2 ! ·Np3 ! ·Np4 ! [

1 [log N ! − log (Np0 ! ·Np1 ! ·Np2 !·Np3 ! ·Np4 !)] N

If N is large enough (N →∞) using Stirling’s approximation log N ! = N log N − N , after a few mathematical treatment results: log W = N           1 1 1 1 1 +p1 log + p2 log + p3 log + p4 log = p0 log p0 p1 p2 p3 p4 4 

 pi log

i=0

1 pi



which is the definition of Shannon’s entropy:

H=

k  i=0

pi log(

1 ) pi

[6.2]

Maximum Entropy and Distributions of Five-Star Ratings

63

To find the most probable state of the system is “equivalent” to maximizing the entropy of the system with respect to probabilities. Practically, to find the statistical distribution of probabilities pi is maximizing the entropy defined by equation [6.2]. The order and randomness (uncertainty) of ratings can be described by the entropy value calculated by this definition. Shannon’s entropy was originally derived for discrete systems using a multinomial distribution approach as presented before. Introducing the principle of max entropy, it is possible to find more specific statistical distributions (than the generic multinomial) under logical constraints (prior information) for the system (data). The most common result is that MED k Max[H(pi )]=Max[ i=0 pi log( p1i )] without any constraint is pi =p0 =p1 =. . . =pk = 1 k+1 , the discrete uniform distribution. The idea is simple: without any information for the “structure” (probabilities) of the system the most “possible” distribution of pi is to have equal values. To derive this result by maximization of entropy H, the logical constraint sum of probabilities equal to 1 (second axiom of probabilities) is k used, i.e. i=0 pi = 1. Traditionally, this is called “unconstrained” entropy maximization. The full optimization (entropy H maximization) “unconstrained” problem is: k Max[H(pi )]=Max[ i=0 pi log( p1i )] Under sum of probabilities equal to 1 constraint: S OLUTION.– pi =p0 =p1 =. . . =pk =

1 k+1

k

i=0

pi = 1

the uniform distribution.

The solution is derived using the Lagrange method and maximizing the Lagrangian:

L (pi , λ) =

k  i=0

pi log(

k  1 ) + λ( pi − 1) pi i=0

Using the partial derivative conditions, multiplier).

∂L ∂pi

[6.3]

= 0 for optimization (λ is the Lagrange

k It is also known (Jaynes 1957) that Max[ i=0 pi log( p1i )] subject to expected k value constraint, E(X) = μ or i=0 ipi = μ, is an exponential distribution. Among all the discrete distributions supported on the set {x1 , ..., xk } with a specified mean μ, the MED has the following form: pi = P (X = xk ) = Crk , k = 0, 1, 2, 3, . . .

[6.4]

64

Data Analysis and Applications 2

where the C and r values are determined by the constraints. This distribution is an exponential distribution, the discrete geometric distribution for discrete data. The probabilities of discrete geometric distribution are given by: G(p) = (1 − p)k−1 p, k = 1, 2, . . ., ∞,

[6.5]

where the values of k range from 1 to ∞, to reduce in the range (0, k) the truncated geometric Geo(k,p) is appropriate, defined by: Geo(k, p) = P (X = x) =

p(1 − p)

x

1 − (1 − p)

k

, x = 0, 1, 2, . . ., k

[6.6]

The denominator is the sum of k+1 probabilities. By the commonly used “functional approach” to max entropy, the final solution is the functional form of a specific distribution pdf using the Lagrangian optimization procedure (partial derivatives). The identification and parameter values of the specific distribution function, i.e. the parameters C and r of equation [6.4]. In this work, the problem faced from another point of view is the “numerical” approach. The max entropy problem is viewed as a constrained nonlinear optimization (programming) problem solved by nonlinear optimization software tools (Lindo). By this approach, the functional form of the solution (distribution function) is not required. The specific values pi of the distribution pdf are calculated. The distribution identified by pdf value properties, e.g. a constant rate pi+1 /pi = r, is the way to identify that the solution pi follows a geometric distribution. This “numerical” nonlinear programming approach is equivalent with the “functional” approach. In optimization, the Lagrange multipliers λ are the dual problem variable values (Golan 2008). The main difference is that by the Lagrange method we determined the distribution function pdf and then calculated the specific pi values. By direct numerical optimization, the specific pi values are calculated as maximization variable values. The main difficulty with nonlinear programming maximization is the convergence to a global maximum due to possible local maxima solutions. Solving the max entropy problem by nonlinear programming optimization for all possible values of a constraint, i.e. all possible values of E(X), the values of probabilities pi are determined numerically through a step-by-step procedure. Observing that in every step of E(X) value, the next step is very close to the previous, the problem of a good starting value is handled using the previous step solution as the

Maximum Entropy and Distributions of Five-Star Ratings

65

starting value for the next step. A good starting value in the beginning of this numerical approach is the uniform distribution (pi =1/k), a unique solution to an “unconstrained” problem. In rating systems and data, the probabilities p0 , p1 , . . . , pk are associated with values of rates 0,1,2,. . . ,k. Let us define a random variable (r.v.) R taking values r = 0,1,2,. . . ,k, each one with probability pi = p0 , p1 , . . . , pk . Then P(R = r) = pr , r = 0,1,2,...,k is called the rating distribution. It is not necessary to define any specific functional form of P(R = r) = pr = f(r,k,. . . ); the pdf will be determined by the max entropy principle subject to constraints. But it is required to define the intervals where the numerical approach will be applied. The expected value (average) E(R)=μ of the r.v. R is defined by k E(R) = i=0 ipi = μ. It is needed to define the range of expected value E(R). Observing the definition of E(R), the minimum value of E(R) = 0, when p0 = 1. The maximum value of E(R) occurs when pk =1, then the maximum of E(R) = k. Thus, expected value E(R) varies in the interval [0,k]. The “normalized” expected value E(R)/k always varies in the interval [0,1]. Let p = E(R)/k, where p varies in the interval [0,1], then E(R) = kp for some p in the interval [0,1]. The variance of r.v. R defined by Var(R)=E(R2 )-[E(R)]2 , where k 2 2 E(R2 ) = i=0 i pi . From the definition, E(R )=0 when p0 =1, then the minimum value of E(R2 )=0. The maximum value of E(R2 ) occurs when pk =1, then the maximum value of E(R2 )=k2 . Thus, E(R2 ) varies in the interval [0,k 2 ]. Using the “normalized” expected value, E(R2 )/k 2 varies in the interval [0,1]. Let p = E(R2 )/k 2 , where p varies in the interval [0,1] and E(R2 )=k 2 p. Substituting E(R2 ) = k 2 p to variance definition V ar(R) = E(R2 ) − [E(R)]2 = k p − (kp)2 = k 2 p(1 − p), we “normalized” variance V ar(R)/k = kp(1 − p). 2

Those derivations for random variable of rating R, E(R)/k = p and V ar(R)/k = kp(1 − p) will be useful for the following sections to compare results and create “normalized” charts for different values of k = 1, 2, 3, . . . (nr of outcomes in rating). As stated before, our main goal is to investigate and find the resulting specific discrete distribution under constraints of entropy maximization for our random variable R of ratings. The five-star ratings are described by the case where k = 4 (5 outcomes i = 0,1,2,3,4). In section 6.3, we presented analytically the cases starting from the minimum possible value of k = 1, then each value of k = 2,3,4 to detect the behavior of a max entropy approach step by step. The extension to larger values of k=5,6,. . . ,29 results from analysis of k≤4 cases.

66

Data Analysis and Applications 2

6.3. Maximum entropy of ratings for values k = 1,2,3,. . . ,30 In this section, we presented the derivations of MEDs subject to (s.t.) constraints starting from the simplest case of k = 1 (two outcomes: 0,1). The presentation of all the cases k = 1, 2, 3, 4 exhibits the behavior of entropy maximization when the rating system “complexity” increases. For each value of k, three cases of MED are examined: (1) “unconstrained”, (2) subject to expected value E(R) = μ constraint and (3) subject to expected value E(R) = μ and variance V ar(R) constraints. 6.3.1. Ratings with two outcomes (k = 1) When k = 1, the two possible values of the random variable of rating R are as follows: 0 (fail) with probability p0 = 1−p and 1 (success) with probability p1 = p. A typical example of this case is the coin tossing result (head: success→1, tail: fail→0). A modern rating example is the very popular “like” in social media. In a social media “post”, the number of followers who like the post interpreted are successes (1) and the rest of the followers are fails (0) of r.v. R. 6.3.1.1. Unconstrained entropy maximization of rating k = 1 The max entropy principle “unconstrained” is the following problem:     k Max[H(pi )] = M ax[ i=0 pi log( p1i )] = M ax[p0 log p10 + p1 log p11 ] Under sum of probabilities equal to 1 constraint

k

i=0

pi = 1: p0 + p1 = 1

The well-known solution is: p0 = p1 = p = 12 , a discrete uniform distribution. The value of max entropy Hm ax is:     H max = H(p0 = 1/2, p1 = 1/2) = p0 log p10 + p1 log p11 =   1 1 = log (2), using natural logarithms H max = ln 2 = 0.693. log 1 2

1 2

  log

1 1 2

+

2

Because the max entropy Hmax = ln(2), all entropy values are normalized by dividing by ln2, then normalized entropy HN is: N H (p0 = 1/3,p1 = 2/3) = H(p0 = 1/3, p1 = 2/3)/ ln2=0.636/0.693=91.83% of max entropy of rating with two outcomes. Observing that in this case for r.v. R, the expected value is E(R) = p0 0+p1 1 = p1 , it is possible to create a chart with the probability p = p1 = E(R) as the x-axis and the normalized entropy H N (p0 , p1 ) as the vertical y-axis. This chart of associated values of normalized entropy % H N versus p = E(R) is presented in Figure 6.1.

Maximum Entropy and Distributions of Five-Star Ratings

67

  '# 



   &!











   ! 

   '#! 

  ! !"  ! 

 





















 !#'

Figure 6.1. Normalized entropy of ratings k = 1. For a color version of the figure, see www.iste.co.uk/skiadas/data2.zip

Shannon (1948) introduced the information entropy idea and used this chart for the case of two outcomes, setting in x-axis the probability of success p, which is identical to E(R) = p of the x-axis in Figure 6.1. The max normalized entropy unconstrained H N (p0 = 1/2, p1 = 1/2) = 100% is the point on top (vertex) of the upside-down parabolic curve, shown by a green square at x-axis value E(R) = p = 50%. At this point (case), the total rating population is divided into two subgroups of 50% each. The two points where the blue curve intersects the horizontal x-axis have zero entropy. On the left side is the point [E(R) = 0, H N (p0 = 100%, p1 = 0%) = 0] where all raters select 0. On the right side is the point [E(R) = 1, H N (p0 = 0%, p1 = 100%) = 0] where all raters select 1. In both cases, there is no uncertainty, so entropy (uncertainty) is H = H N = 0. 6.3.1.2. Constrained entropy maximization of rating k = 1 s.t. E(R) = μ In this case, we add the E(R) = μ constraint to the previous case of “unconstrained” max entropy (section 6.3.1.1)     k Max[H(pi )] = Max[ i=0 pi log( p1i )] = Max[p0 log p10 + p1 log p11 ]

68

Data Analysis and Applications 2

Under constraints C1, C2: 1) sum of probabilities equal to 1,

k

i=0

pi = 1: p0 +p1 =1;

2) expected value E(R)=μ=kp=1p=p ⇒ E(R)=p0 0+p1 1=p ⇒ p1 =p. From C2 results p1 =p, substituting to C1: p0 +p1 =1 that p0 =1-p. Therefore, C1 and C2 constraints results in a unique solution, which is the MED for the system s.t. expected value E(R) of ratings. By Lagrangian for the general case (equation [6.4]), the solution is in the following form: pi =P(R=i)=C·ri , i= 0,1 a truncated Geometric Geo(1,p) distribution. Then p0 =P(R=0)=C·r0 =C and p1 =P(R=1)=Cr1 =C·r. The values of C and r calculated by the constraints C1, C2: p0 =1-p and p1 =p=μ, substituting results C=1-p and r=p1 /C=p/(1-p). Note that it also holds:  p0 =P(R=0)=C=1-p=p0 (1-p)1 =  p1 =P(R=1)=Cr=p=p1 (1-p)0 =

k=1 x=0

k=1 x=1

 px (1 − p)

 px (1 − p)

k−x

k−x

,

equal to probabilities of binomial distribution Bin(1,p). Thus, for k=1, the MED s.t. E(R)=μ constraint is the binomial distribution Bin(k=1,p), with p being the probability of success and k=1 the number of trials. It is known that Bin(1,p) is the Bernoulli distribution Ber(p). In Figure 6.1, the blue upside-down parabolic curve shown is the max entropy curve for the associated value of E(R)/k=p on the x-axis. This max entropy curve represents four MEDs: (1) unconstrained max entropy point at the vertex point HN (50%,50%)=100% representing the uniform distribution, (2) binomial Bin(1,p) distribution, which is identical with (3) Bernoulli Ber(p), (4) truncated geometric distribution Geo(1,p) of “standard” max entropy s.t. E(R) constraint. Because every value of entropy of ratings is on that curve, it is the entropy limit set of ratings for k=1. It is possible to find the max entropy curve under the E(R) constraint numerically. The values of normalized expected value of r.v. R, E(R)/k=E(R)/1=p1 varies in the interval [0,1]. Using a nonlinear optimization software tool for values of E(R)=p1 =0%, 1%, . . . , 100%, the solution of each entropy maximization problem is

Maximum Entropy and Distributions of Five-Star Ratings

69

a point on the max entropy curve with respect to the E(R) value. The red circles on the max entropy curve in Figure 6.1 are an indication that numerical computation of entropy maximization solution is the points of the curve. The red circles in Figure 6.1 are max entropy solutions calculated for E(R)=p1 =p values 0%, 5%, . . . , 100%. The entropy curve is symmetric by the entropy definition H(p0 , p1 ) = H(p1 , p0 ), e.g. the entropy H(70%, 30%)=H(30%, 70%), meaning that in both cases the level of uncertainty (measured by entropy) is identical; the population is divided into two subgroups of 70–30%. 6.3.1.3. Constrained entropy maximization of rating k=1 s.t. E(R)=μ and Var(R)=σ 2 In this case, we added the Var(R)=σ 2 constraint to the previous case of section k 6.3.1.2, Max[H(pi )]=Max[ i=0 pi log( p1i )], under constraints C1, C2, C3: k 1) sum of probabilities equal to 1, i=0 pi = 1: p0 +p1 =1; 2) expected Value E(R)=μ=p ⇒ E(R)=p0 0+p1 1=p ⇒ p1 =p; 3) variance Var(R)=σ 2 ⇒ Var(R)=

k

i=0

i2 pi −μ2 = p0 02 +p1 12 −p2 = p1 −p2 =

p(1 − p). The first two constraints C1 and C2 result in the previously presented solution: p0 =1-p and p1 =p. This solution satisfies the added C3 constraint. For k=1, the MED s. t. E(R) constraint is also MED subject to E(R) and Var(R). It is the identical Geometric Geo(1,p) Binomial Bin(1,p) and Bernoulli Ber(p) distributions in this case. 6.3.2. Ratings with three Outcomes (k=2) When k=2 the 3 possible values of r.v. R are: 0 with probability p0 , 1 with probability p1 and 2 with probability p2 . A modern rating example of this case is a rating system with 3 alternatives: dislike, neutral, like (typically named: “bad”, “neutral”, good”). This scheme is used in YouTube video ratings: viewers can choose “I like this” (rate:2) or “I dislike this” (rate:0), the rest of the total video viewers are treated as neutral (rate:1). 6.3.2.1. Unconstrained Entropy Maximization of Rating k=2 The max entropy “unconstrained” problem in this case is: k Max[H(pi )]=Max[ i=0 pi log( p1i )]=

70

Data Analysis and Applications 2

 Max[p0 log

1 p0



    + p1 log p11 + p2 log p12 ]

Under sum of probabilities equal to 1,

k

i=0

pi =1: p0 +p1 +p3 =1

S OLUTION.– p0 =p1 =p3 =p= 13 , the discrete Uniform distribution. The value of max entropy is: Hm ax=H(p0 =1/3, p1 =1/3, p2 =1/3)=ln(3)=1.0987 For normalization entropy values divided by ln(k+1)=ln(3): H N (1/2, 1/4, =1/4)=H(1/2, 1/4, 1/4)/ln(3)=1.0397/1.0986=94.64% of the max entropy of the system. In this case the r.v. R taking values 0,1,2 normalized to [0,1] interval dividing E(R) by k=2. The unconstrained max entropy point x-axis value is E(R)/2=1/2 and the yaxis normalized entropy value HN =100%. In Figure 6.2, this point of “unconstrained” max entropy (Uniform distribution) is shown by a green square on top of the red parabolic curve. In Figure 6.2, the green dotted line named “Iso-entropic ln(2)/ln(3)” appears with three points (indicated by green diamond-shaped markers) at x-axis values 25%-50%-75%. This Iso-entropic line corresponds to rating cases where raters select only 2 of the 3 available rates. The normalized entropy value is H N (1/2,1/2,0)= H(1/2,1/2)/ln3=ln2/ln3=63.093%. The first Iso-entropic point labeled {01} is the conditional unconstrained max entropy H(p0 ,p1 ,p2 |p2 =0), e.g. max entropy of rating given that probability p2 =0. It is a Uniform distribution of the rates 0 and 1, 50% of the raters in each rate. The other 2 Iso-entropic points are H(p0 ,p1 ,p2 |p1 =0) labeled {02} and H(p0 ,p1 ,p2 |p0 =0) labeled {12} in Figure 6.2. Thus, in the case of rating with k=2 the previous case of rating with k=1 is present as a conditional case (unstable solution). 6.3.2.2. Constrained Entropy Maximization of Rating k=2 s.t. E(R)=μ In this case, we added the E(R)=μ constraint to the previous on in section 6.3.1.3. Max[H(pi )], under constraints C1, C2: k 1) Sum of probabilities equal to 1, i=0 pi = 1: p0 +p1 + p2 =1; 2) Expected Value E(R)=μ=p0 0+p1 1+p2 2=p1 +2p2 => p1 +2p2 =μ. By Lagrangian the solution is: pi =P(R=i)=C·ri for i= 0,1,2 a truncated Geometric Geo(2,p) distribution. Then p0 =P(R=0)=C·r0 =C, p1 =P(R=1)= Cr1 =C·r,

Maximum Entropy and Distributions of Five-Star Ratings

71

p2 = P(R=2)=C·r2 , the values of C and r determined by the linear system of C1,C2 constraints.

 4, 



 %3&





'* +(

'+ ,(

'* ,(













 

 % &

 4,

$%,&"%-&

 !!% &) % &

 %#04*&

 %#/4*&

 %#.4*&

 





















% &"

Figure 6.2. Normalized Entropy of Rating k=2. For a color version of the figure, see www.iste.co.uk/skiadas/data2.zip

In Figure 6.2 the values of this max entropy curve are the points of red parabolic curve with red circles, calculated numerically by a nonlinear optimization tool. We applied the same numerical approach of k=1, using 101 equally spaced values of E(R)/k in [0,1] interval. In each maximization the result is the max entropy value (objective function value) and the values of 3 variables p0 , p1 , p2 for the associated E(R)/k value on the x-axis. The calculated values of p0 , p1 , p2 are probabilities of a truncated Geometric distribution Geo(2,p), certified by the constant rate p1 /p0 =p2 /p1 =r. The “unconstrained” max entropy point of Uniform distribution is the max entropy s.t. E(R)/k=50%, then the distribution value Geo(2,p=50%) also lies on the red curve. Note that it is the point where the monotonic (unimodal) Geo(2,p) rate is r=p1 /p0 =p2 /p1 =1, the inflection point. Before that point (for p p1 > p2 decrease, after r > 1 increasing values p0 < p1 < p2 . In Figure 6.2, below (or “inside”) the blue Binomial Bin(2,p) curve there are 3 green parabolic curves named “Maximum Entropy H|pi =0” i=0,1,2. Each one is a conditional max entropy curve s.t. E(R), i.e. max entropy given that 1 of the 3 rates of

72

Data Analysis and Applications 2

probability is zero H(p0 ,p1 ,p2 |pi =0) when p1 =0 or p2 =0 or p2 =0. The 3 curves are Binomials Bin(k=1,p) defined in different parts of the x-axis. Note that the maximum point (vertex) of those 3 curves are the Iso-entropic points presented in section 3.2.1 of the unconstrained case. 6.3.2.3. Constrained Entropy Maximization of Rating k=2 s.t. E(R)=μ and Var(R)=σ 2 In this case the Var(R)=σ 2 constraint added: Max[H(pi )] under constraints: k 1) i=0 pi = 1: p0 +p1 + p2 =1 2) E(R)=μ=p0 0+p1 1+p2 2=p1 +2p2 ⇒ p1 +2p2 =μ=kp=2p k 3) Var(R)=σ 2 ⇒ Var(R)= i=0 i2i pi − μ2 = p0 02 + p1 12 + p2 22 − μ2 ⇒ Var(R)=p1 + 4p2 = kp(1 − p) + 4p2 The solution satisfies the constraints, the following linear system of equations 3X3: p0 +p1 +p2 = 1 p1 +2p2 = 2p p1 +4p2 = kp(1 − p) + 4p2 1 The Determinant is Det= 0 0

1 1 1

1 2 4

=2=0, there is a unique solution.

The solution of the 3X3 system is: {p0 =p2 -2p+1, p1 =2p-2p2 , p2 =p2 } ⇒ {p0 =(1-p)2 , p1 = 2p(1-p), p2 = p2 }; those values of p0 , p1 , p2 are the probabilities of Binomial Bin(k=2,p). Thus, for k=2 the entropy maximization distribution s.t. E(R)=μ and Var(R)=σ 2 constraints is the Binomial Distribution Bin(k=2,p). In Figure 6.2, the blue parabolic curve is the Binomial Bin(k=2,p). This curve is always below the Geo(2,p), except the points of the x-axis starting and ending values, 0 and 1 respectively, where they are equal.

Maximum Entropy and Distributions of Five-Star Ratings

73

By numerical computation of the maximization problem solution, the values of p0 , p1 , p2 were found to be identical to those of Binomial distribution. In Figure 6.2, this result is that a numerical solution gives the Binomial pi , represented by the blue star points (*) over the blue line Bin(2,p) entropy curve. The 3 green parabolic curves named “Maximum Entropy H|pi =0” i=0,1,2 are conditional max entropy curves s.t. E(R) and var(R), i.e. max entropy given that 1 of the 3 rates of probability is zero, p1 =0 or p2 =0 or p2 =0. The 3 curves are Binomials Bin(k=1,p) presented to a previous case of max entropy s.T. E(R). The points of those 3 curves are both conditional max entropy s.t. E(R), E(R) and Var(R) constraints. As noted for the case of k=1 in section 3.1, Geo(1,p) is identical with Bin(1,p). Concluding for the case of rating k=2 the MED s.t. E(R) constraint is a truncated geometric Geo(2,p) and MED subject to E(R) and Var(R) constraints is the binomial Bin(2,p). The area in between those two distribution curves (red geometric and blue binomial in Figure 6.2) is questionable. By recalling the maximization conditions, under the red curve of truncated geometric are all the possible distributions for the rating data s.t. E(R)/k=p. Under the blue curve, binomial Bin(k=2,p) are all the possible distributions where E(R)/k=p and Var(R)/k=kp(1-p), then above the Bin(2,p) is all the distributions for data with E(R)/k=p and Var(R)>kp(1-p). The data or distributions that have a variance that is greater than binomial are known as overdispersed data or distributions compared to binomial distribution. 6.3.3. Ratings with four outcomes (k=3) When k=3, the four possible values (outcomes) of rating R are denoted as i with probability pi , P(R=i)=pi i=0,1,2,3. This system of rating with four alternatives is not used in practice because there is no symmetry of responses. 6.3.3.1. Unconstrained entropy maximization of rating k=3 The max entropy principle in this case results in: k=3 k Max[H(pi )]=Max[ i=0 pi log( p1i )], under i=0 pi = 1: p0 +p1 +p2 +p3 =1 S OLUTION.– p0 =p1 =p2 =p3 =p= 41 , discrete uniform distribution. The value of max entropy is Hmax =ln4=1.386. A diagram of normalized entropy HN versus E(R)/k=p is presented in Figure 6.3. The max entropy point (Uniform distribution) is shown as a green square on top of Figure 6.3 (red curve).

74

Data Analysis and Applications 2

 0+ 



%()*&

%()+&

 #/$



%)*+&

%(*+&





%()&

%(*&

%(+&%)*&

%*+&

%)+&











 # $

 

 0+

   # $' # $

" #+$!#,$

" #*$!#,$

 





















#$!

Figure 6.3. Normalized entropy of rating k=3. For a color version of the figure, see www.iste.co.uk/skiadas/data2.zip

In Figure 6.3, there are two isoentropic lines: The first line is named “iso-entropic ln(2)/(4)” (green dotted line with diamond-shaped markers) when only two of four available rates are selected. The second line is named “iso-entropic ln(3)ln(4)” (brown dotted line with triangle markers) when only three of four available rates are selected by raters. The five points of the first line and the four points of the second line are special cases of conditional “unconstrained” max entropy of the form: H(p0 ,p1 ,p2 ,p3 |p2 =p3 =0) labeled {01}, H(p0 ,p1 ,p2 ,p3 |p1 =p3 =0) labeled {02}, . . . , H(p0 ,p1 ,p2 ,p3 |p3 =0) labeled {012},. . . , etc. The details of those iso-entropic points are similar to that analyzed for the previous case of k = 2. It must be noted that the number of iso-entropic points  is determined by binomial combinatorics. The upper  k+1=4 4! brown line has = 3!(4−3)! =4 iso-entropic points, the lower green x = 3   4 line has =6 iso-entropic points. 2 6.3.3.2. Constrained entropy maximization of rating k=3 s.t. E(R)=μ The E(R)=μ constraint added to prior “unconstrained” case: Max[H(pi )], under the constraints: k 1) i=0 pi =1: p0 +p1 + p2 + p3 =1;

Maximum Entropy and Distributions of Five-Star Ratings

75

2) E(R)=μ=p1 +2p2 +3p3 . By Lagrangian the solution is pi =P(R=i)=C·ri for i=0,1,2,3 truncated Geometric Geo(3,p) distribution. In Figure 6.3, the values of the max entropy curve (red parabolic curve with circles) are calculated numerically as explained in the case of k=2. The Geo(3,p) distribution of those p0 , p1 , p2 , p3 values confirmed checking that rate r=pi+1 /pi is constant. The maximum “unconstrained” entropy point (Uniform distribution) also lies on this curve, because it is the max entropy for the normalized expected value E(R)/k=50%. In Figure 6.3, below (or “inside”) the blue binomial Bin(3,p) curve, there are four brown and six green parabolic curves. Each one is a conditional max entropy curve s.t. E(R), i.e. max entropy given that 1 (for brown curves) or 2 (for green curves) of the four rate probabilities are zero. The four brown color curves are geometrics Geo(k=2,p), conditional max entropy curve s.t. E(R), mathematically H(p0 ,p1 ,p2 ,p3 |pi =0) i=0,1,2,3. The six green color curves are binomials Bin(k=1,p), conditional max entropy s.t. E(R), H(p0 ,p1 ,p2 ,p3 |pi =0,pj =0) for all possible combinations of i,j=0,1,2,3 i=j. 6.3.3.3. Constrained entropy maximization of rating k=3 s.t. E(R)=μ and Var(R)=σ 2 Here the Var(R) constraint is added to the previous case in section 6.3.3.2. Max[H(pi )] under constraints: k 1) i=0 pi = 1: p0 +p1 + p2 + p3 =1; 2) E(R)=μ=p0 0+p1 1+p2 2+p3 3=p1 +2p2 +3p3 ; k 3) Var(R)=σ 2 ⇒ Var(R)= i=0 r2i pi − μ2 = kp(1 − p) + 9p2 . The solution needed to satisfy all the constraints, with the following linear system 3X4: p0 +p1 + p2 +p3 =1 p1 +2p2 +3p3 =3p p1 +4p2 +9p3 =kp(1-p)+9p2 ⎛

⎞ 1 1 1 1 The coefficients matrix is A=⎝ 0 1 2 3 ⎠, using the first three columns of 0 1 4 9 A, rank(A)=3=0; the system of equations has infinitely many solutions. Maximization finds the solution with max entropy.

76

Data Analysis and Applications 2

By numerical computation of the maximization problem solution, the values of pi are identical to those of binomial distribution Bin(3,p) for all values of p=E(R)/3 (x-axis values). In Figure 6.3, this result is represented by the blue star points (*) over the blue binomial Bin(3,p) entropy curve. For the conditional MED s.t. E(R) and Var(R), six green curves of binomials Bin(k=1,p) remain. But for the four brown curves of geometric Geo(k=2,p), there is a Binomial Bin(k=2,p) curve, which is conditional max entropy s.t. E(R) and Var(R), H(p0 ,p1 ,p2 ,p3 |pi =0,pj =0) for all possible combinations of i,j=0,1,2,3 i=j. Those Binomials Bin(k=2,p) curves are presented in Figure 6.3 by brown dotted curves just below each geometric Geo(k=2,p). Summarizing for the case of k=3 the entropy maximization distribution s.t. E(R)=μ and Var(R)=σ 2 constraints is the binomial distribution Bin(3,p). In this case, the linear system of constraints is not sufficient to determine a unique solution. The numerical solution values by nonlinear maximization found to be identical to Bin(3,p) values. 6.3.4. Ratings with five outcomes (k = 4) When k = 4, the five possible values (outcomes) of r. v. R are i=0,1,2,3,4 with probability pi . This is commonly named the five-star rating system. This system with five alternative responses is regularly used in rating practice. Several millions of real rating data sets are available. The value of N (responders) varies from a few decades to several millions (e.g. Facebook app in Google store N>74,000,000). 6.3.4.1. Unconstrained entropy maximization of rating k=4 The max entropy principle in this case is calculated as: Max[H(pi )] = M ax[ p4 = 1

k=4 i=0

pi log( p1i )], under

k

i=0

pi = 1 : p0 + p1 + p2 + p3 +

S OLUTION.– p0 = p1 = p2 = p3 = p4 = p = 15 , discrete uniform distribution. The max entropy value is Hmax =ln5=1.609. A chart of normalized entropy HN versus E(R)/k=p is presented in Figure 6.4. The max entropy point (Uniform distribution) occurs for E(R)/k=50%, where HN =100%, it is the point shown as a green square on top of Figure 6.4. The conditional unconstrained max entropy for k=4 shows the three iso-entropic lines (green, brown, purple dotted lines) in Figure 6.4. The upper purple color iso-entropic line “ln(4)/ln(5)” crosses the binomial blue curve. K=4 is the first value where this phenomenon occurs in entropy of ratings. The five points of this line, shown as purple squares in Figure 6.4, are conditional unconstrained max entropy

Maximum Entropy and Distributions of Five-Star Ratings

77

points. The first one labeled {0123} corresponds to max H N (p0 , p1 , p2 , p3 |p4 =0)=ln4/ln5=86.13%. Similar are the descriptions for the remaining iso-entropic points (see section 3.3.1). The number of iso-entropic points is determined by binomial combinations. Forthe upper iso-entropic named “ln(4)/ln(5)” (purple line)  5=k+1 the iso-entropic points are =5. For the next iso-entropic, the brown 4   5 line named “ln(3)/ln(5)”, the number of iso-entropic points are =10. The last 3   5 green iso-entropic line “ln(2)/ln(5)” has =10 iso-entropic points. 2 6.3.4.2. Constrained entropy maximization of rating k=4 s.t. E(R)=μ Adding the E(R)=μ constraint to the “unconstrained” case in section 6.3.4.1: Max[H(pi )] under the constraints: k 1) i=0 pi = 1: p0 +p1 + p2 + p3 + p4 =1; 2) E(R)=μ: p0 0+p1 1+p2 2+p3 3+p4 4=μ. By Lagrangian the solution is a truncated Geometric GEO(k=4,p) distribution. In Figure 6.4, the values of the max entropy curve (red parabolic curve with red circles) are calculated numerically by nonlinear programming maximization. The GEO(4,p) distribution of those pi values is confirmed by the constant rate r=pi /pi−1 for i=1,2,3,4. The max entropy point (Uniform distribution) lies on this curve, because it is the entropy for E(R)/k=50%. The conditional max entropy s.t. E(R) for k = 4 results in the green, brown and purple curves in Figure 6.4. Each one is a geometric Geo(i,p) i=1,2,3 entropy curve that corresponds to an iso-entropic point presented in section 6.3.4.1. For the purple curves, the definition of each curve is max entropy s.t. E(R), e.g. maximum H N (p0 ,p1 ,p2 ,p3 |p4 =0) for the curve with maximum labeled {0123}, etc. 6.3.4.3. Constrained entropy maximization of rating k=4 s.t. E(R)=μ and Var(R)=σ 2 In this case, the Var(R) constraint was added to the previous case in section 6.3.4.2: Max[H(pi )] under constraints: k 1) i=0 pi = 1: p0 +p1 +p2 +p3 +p4 =1 2) E(R)=μ=p0 0+p1 1+p2 2+p3 3=p1 +2p2 +3p3 +4p4 ; k 3) Var(R)=σ 2 ⇒ Var(R)= i=0 i2i pi =μ2 = p1 +4p2 +9p3 +16p4 =kp(1-p)+16p2 .

78

Data Analysis and Applications 2

The solution satisfies the constraints, with the following linear system 3X5: p0 +p1 + p2 +p3 +p4 =1 p1 +2p2 +3p3 +4p4 =4p p1 +4p2 +9p3 +16p4 = 4p(1-p)+16p2  0+ 

$'()*%



$'()+% $'(*+%

$')*+%

$()*+%

 "/#



$'(*%

$'()%



$'(+%$')*%

$()*%$')+%

$()+%$'*+% $(*+%

$)*+%





$'(%

$')%

$'*%$()%

$'+%$(*%

$(+%$)*%

$)+%

$*+%









 

 " #

 0+

 " #& " #

! "+# ",#

! "*# ",#

! ")# ",#

 





















"# 

Figure 6.4. Normalized entropy of rating k=4. For a color version of the figure, see www.iste.co.uk/skiadas/data2.zip



⎞ 1 1 1 1 1 The coefficients matrix A=⎝ 0 1 2 3 4 ⎠ with rank(A)=3 (first 0 1 4 9 16 3 columns Det=0), then the system of equations has an infinite number of solutions. Maximization finds the solution (values of pi ) with max entropy. By numerical computation of the maximization problem solution, the values of pi are found to be slightly different from the Binomial Distribution Bin(4,p) values. The entropy is a little greater than the binomial for every value of p=E(R)/k (x-axis values). Because the differences are very small in Figure 6.5, the difference between max entropy s.t. E(R) and Var(R) constraints distribution and the Binomial Bin (p,k) for several values of k=4,5,6,. . . ,29 is displayed (see legend of Figure 6.5). The orange curve with circles at the bottom of Figure 6.5 represents the differences of entropy for the case k=4, between MED s.t. E(R) & Var(R) and

Maximum Entropy and Distributions of Five-Star Ratings

79

Bin(k=4,p). The curve of differences for k=4 reaches the maximum value of about ΔHN =0.014% 50

0.073 81 -0.061 81

0.994 1.021

NACE Manufactoring Service, tourism, transportation IT/communication Construction Utilities Wholesale trade Other

0.107 44 -0.176 70 -0.421 7 0.436 27 0.178 5 -0.019 8 -0.391 5

0.999 1.028 1.297 0.840 0.413 0.994 0.886

Size Micro Small Medium

-0.043 103 0.138 46 -0.112 17

0.965 1.020 1.170

0.847

0.398

0.124

0.529

Table 7.3. Average mean of Rasch standardized scores for sociodemographic characteristics of SMEs and the ANOVA test

7.5.2. The measure for the recovery period (2013–2015) In Table 7.4, the reliability measures for the Rasch analysis are reported.

100

Data Analysis and Applications 2

Item Measure SE Infit MNSQ Outfit MNSQ OBS % EXP % I1 -0.42 0.17 0.59 0.65 83.2 59.2 I2 -0.89 0.29 0.45 0.46 97.4 56.6 I3 0.38 0.20 1.31 1.64 41.1 66.5 I4 0.13 0.19 1.79 1.98 37.4 62.7 I5 -0.27 0.18 0.45 0.46 91.6 59.5 I6 2.09 0.33 0.80 0.49 90.7 90.7 I7 -1.02 0.17 1.14 1.11 50.0 58.9 Table 7.4. The measure of the seven items (period 2013–2015)

Cronbach’s alpha is equal to 0.692 and shows sufficient internal consistency of the seven items. We assessed the unidimensionality of the model a priori (through a correspondence analysis). The proportion of inertia explained by its first dimension was not very low (63.9%) and the first axis clearly separated the two response options. Moreover, infit and outfit MNSQ estimates showed reasonable values. Table 7.4 reports the two situations on the level of difficulty accessing finance: the highest and the lowest and their respective scores, -1.02 (I7) and 2.09 (I6), as for the results of the 2008–2012 period. With respect to SMEs’ ability to overcome the difficulty, we see that no SME answers “1” to any questions in the questionnaire. As before, a score of -4.66 is given to SMEs experiencing less difficulty while a score of -0.55 refers to SMEs experiencing the greatest difficulty. On average, SMEs display a score of -2.31, which shows a sufficient level of difficulty. Figure 7.3 shows a histogram of the distribution of the new standardized measure.

Figure 7.3. The distribution of the standardized measure for obtaining financing

Access to Credit for SMEs after the 2008 Financial Crisis

101

We turn our attention to the evaluation of differences in Rasch standardized scores for our sample with respect the sociodemographic characteristics of SMEs. Table 7.5 presents mean Rasch standardized scores for each of the covariates and the P-values of an ANOVA test of differences between mean values. n Std. dev. P-Value ANOVA 0.042** 0.382 23 0.736 -0.077 140 1.030 Mean

Age of the company ≤ 5 years > 5 years Legal status Sole proprietorship Private limited company, limited by shares (LTD) Partnership Private company limited by guarantee Community interest company, limited by guarantee or shares Other

0.116 0.115 0.025 -0.118 0.428 -0.588

43 59 27 10 17

0.967 1.050 0.909 1.013 1.021

-0.247

5

0.152

Gender of owner Male Female

0.028 130 -0.202 31

0.973 1.088

Managing or leading partner’s age ≤ 50 > 50

0.118 81 -0.150 81

0.957 1.043

NACE Manufactoring Service, tourism, transportation IT/communication Construction Utilities Wholesale trade Other

0.285 44 -0.289 70 1.113 7 0.215 27 0.628 5 -0.223 8 -0.054 5

1.025 0.948 0.969 0.920 0.150 1.357 1.002

Size Micro Small Medium

-0.165 103 0.256 46 0.305 17

0.971 1.048 0.865

0.248

0.091*

0.038**

0.024**

Table 7.5. Average mean for Rasch standardized scores for sociodemographic characteristics of SMEs and ANOVA test (***, ** and * denote 1%, 5% and 10% significance levels, respectively)

102

Data Analysis and Applications 2

The significant differences in means are to be found with respect to age, the NACE code and the size of an SME. This means that younger or bigger SMEs are associated with higher levels of difficulty accessing finance. Moreover, it seems that SMEs in the economic sector as manufacturing, information and communication, construction and utilities are related with higher levels of difficulty obtaining financing. We built a regression tree (Figure 7.4) to analyze SMEs according to the difficulties they faced when accessing finance based on their sociodemographic characteristics.

Figure 7.4. The regression tree of the indicators of SMEs’ difficulties in accessing financing

The analysis shows that only the statistical classification of economic activities in the European Community (NACE) provided a relevant contribution in explaining this indicator. The tree identifies two groups. The first group (indicated in Figure 7.4 as Leaf 1), associated with the lowest level of difficulty accessing finance (-0.268), consists of SMEs in service, tourism, transportation; wholesale trade and other. These enterprises represent 50% of our sample (83 enterprises). The second group (Leaf 2) represents 50% of the sample and is characterized by the positive indicator value (+0.268), which means the highest level of difficulty accessing finance. It consists of SMEs in manufacturing, information and communication, construction and utilities. 7.5.3. Comparing the two crisis phases Comparing levels of difficulty accessing finance with respect to the two different identified phases, we find a correlation of 0.12; this underlines a low positive linear correlation between the two standardized measures. We then calculated level of difficulty accessing treciles (33rd, 66th percentiles) for both periods. We defined SMEs experiencing a high level of difficulty as those that fall between the 66th and 100th percentiles in the distribution, a moderate level between the 33rd and 66th percentiles in the distribution and a low level as below the 33rd percentile in the distribution. Table 7.6 reports the contingency table of the levels of difficulty for the 2008–2012 period and those for the 2013–2015 period.

Access to Credit for SMEs after the 2008 Financial Crisis

2008

103

2013 Low Moderate High Total Low 18 24 14 56 Moderate 11 13 11 35 High 21 24 30 75 Total 50 61 55 166

Table 7.6. Contingency table of the levels of difficulty for the 2008–2012 period and those for the 2013–2015 period

We can consider Table 7.6 as a dynamic path of the levels of difficulty accessing financing between two periods. Moreover, Table 7.7 reports the description of the nine groups determined by Table 7.6 with respect to the sociodemographic characteristics. Thus, three profiles can be assessed: 1) The “stayers” (are along the main diagonal) maintain the same level of difficulty; 32.14% of SMEs with a low level of difficulty in the first period perform the same at the end of the second period; 37.14% of all SMEs with a moderate level in the first period have a moderate level in the second period; 40% of the “high-level stayer” SMEs in the first period feel the same at the end of the second period. This group is composed of small, older, LTD SMEs, with younger leading partners, in the manufacturing sector. 2) The half of the matrix under the main diagonal contains “positive mover” SMEs: they move from a higher level of difficulty to a lower level. For example, the 11 SMEs moving from a moderate to a low level of difficulty represent 31.45% of SMEs that had between the 33rd and 66th percentiles in the distribution of the level of difficulty in period 2008–2012. Moreover, 60% of SMEs with a high level of difficulty passed to a lower level in the 2013–2015 period. 3) The part of the matrix above the main diagonal contains “negative mover” SMEs: they shift from a lower to a higher level of difficulty of accessing finance; 38 SMEs moving from a low to a higher level represent 68.7% of SMEs staying below the 33rd percentile in the distribution of difficulty in the first period. These SMEs are composed of micro and older enterprises with sole proprietorship or LTD active in service, tourism, transportation or manufacturing sectors.

104

Data Analysis and Applications 2

Legal status

Gender of owner

Managing or leading partner’s age

>5 years 88.9%

LTD 44.4%

Male 76.5%

>50 years 55.6%

Low-> Moderate

>5 years 75.0%

LTD 50%

Male 83.3%

≤50 52.2%

Low-> High

>5 years 78.6%

Male 85.7%

>50 57.1%

Moderate-> Low

>5 years 100.0%

Male 72.7%

>50 81.8%

Moderate-> Moderate

>5 years 61.5%

Sole proprietorship 54.5%

Male 90.9%

≤50 69.2%

Moderate-> High

>5 years 100.0%

Sole proprietorship 62.5%

Male 63.6%

>50 50.0%

Male 76.2%

>50 52.4%

Service, Tourism, Micro Transportation 71.4% 52.4%

Male 79.2%

>50 50.0%

Male 89.3%

≤50 60.0%

Male 80.7%

>50 50.0%

Service, Tourism, Transportation 58.3% Manufacturing 53.3% Service, Tourism, and Transportation; Manufacturing 42.2%; 26.5%

Group

Age of company

Low-> Low

High-> Low

>5 years 95.2%

High-> Moderate

>5 years 87.5%

High-> High

>5 years 90.0%

Total

>5 years 85.9%

Sole proprietorship; LTD 37.5%; 28.6% Sole proprietorship; Partnership; CIC 27.3%; 27.3%; 27.3%

LTD; Sole proprietorship; Partnership; CIC 38.1%;19.0%; 19.0%; 19.0% Partnership; LTD 33.3%; 29.2% LTD 53.3% Sole proprietorship; LTD 26.7%; 36.6%

Table 7.7. Description of the groups

NACE Service, Tourism, Transportation 55.6% Service, Tourism, Transportation 62.5% Manufacturing; Service, Tourism, Transportation 35.7%; 35.7%

Size

Micro 61.1% Micro 62.5% Micro 50.0%

Service, Tourism, Micro Transportation 81.8% 54.5%

Service, Tourism, Transportation; Manufacturing 46.2%; 23.1% Manufacturing; Construction; Utilities 27.3%; 27.3%; 27.3%

Micro 76.9%

Micro 63.6%

Micro 87.5% Small 56.7% Micro 62.0%

Access to Credit for SMEs after the 2008 Financial Crisis

105

7.6. Conclusion The 2008 world financial crisis began with the subprime characterized by a housing crisis. It turned into a crisis both financially and in the real economy, and affected the ability of the private sector to access the credit needed to find investment and consumption. The financial crisis had a severe impact on SMEs and they are a considerable source of jobs and profits, growth and innovation in all countries, and especially in Italy. Based on a sample of 166 Italian firms, we provided some insights on the main difficulties faced by SMEs when trying to raise their credit. While in the first phase of the crisis (2008–2012), all Italian SMEs faced a general difficulty in accessing finance, whereas in the second phase (2013–2015), the difficulty in accessing finance changed according to the age, the NACE code and the size of the SME. We notice that mainly young companies were mainly affected. Some SMEs even experienced a worsening in their capabilities to access to credit, in particular microenterprise and older enterprise with sole proprietorship or LTD active in service, tourism, transportation or manufacturing sectors. Although the present study has yielded some interesting findings, it presents some limitations. Our sample included only northern firms and, as well documented in the banking literature, Italy’s southern regions are economically and financially less developed, and local firms have greater difficulties in accessing bank credit (Lucchetti et al. 2001; Alessandrini et al. 2009). Thus, it will be interesting to expand the research including data from the Southern part of Italy, and also to compare our results in a European perspective. 7.7. References Alessandrini, P., Presbitero, A.F., Zazzaro, A. (2009). Banks, distances and firms’ financing constraints. Rev. Financ., 13(2), 261–307. Bofondi, M., Carpinelli, L., Sette, E. (2017). Credit supply during a sovereign debt crisis. J. Eur. Econ. Assoc., doi:10.1093/jeea/jvx020 Bond, T., Fox, C. (2007). Applying the Rasch Model: Fundamental Measurement in the Human Sciences, 2nd edition. LEA, Mahwah, New Jersey. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and Regression Trees. Chapman and Hall, New York. Costa, S., Margani, P. (2009). Credit crunch in Italy: Evidence on new ISAE survey data. http://www.oecd.org/std/leading-indicators/43854874.pdf. Corazza, M., Funari, S., Gusso, S. (2016). Creditworthiness evaluation of Italian SMEs at the beginning of the 2007-2008 crisis: An MCDA approach. N. Am. J. Econ. Financ., 38, 1–26. Cox, D. R. (1970). The Analysis of Binary Data. Chapman and Hall, London. Cronbach, L.J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334.

106

Data Analysis and Applications 2

Dainelli, F., Giunta, F., Cipolini, F. (2013). Determinants of SME credit worthiness under Basel rules: The value of credit history information. PSL. Q. Rev., 66, 21–47. Del Giovane, P., Eramo, G., Nobili, A. (2011). Disentangling demand and supply in credit developments: A survey based analysis for Italy? J. Bank. Financ., 35(10), 2719–2732. European Commission and European Central Bank: Survey on the Access to Finance of Enterprises. http://ec.europa.eu/growth/access-to-finance/data-surveys/index_en.htm. Cited 07/07/2016. IBM SPSS: decision Trees 23. http://public.dhe.ibm.com/software/analytics/spss/ documentation/statistics/23.0/en/client/Manuals/IBM_SPSS_Decision_Trees.pdf. Lucchetti, R., Papi, L., Zazzaro, A. (2001). Banks’ inefficiency and economic growth: A micro-macro approach. Scott. J. Polit. Econ., 48(4), 400–424. Paolazzi, L., Rapacciuolo, C. (2009). C’è credit crunch in Eurolandia e Italia? Confindustria, 9(4).

NCS

Presbitero, A.F., Udell, G.F., Zazzaro, A. (2014). The home bias and the credit crunch: A regional perspective. J. Money Credit Bank, 46(1), 53–85. Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Danish Institute for Educational Research, Copenhagen. Winsteps Manuals. (2016). http://www.winsteps.com/manuals.htm.

8 Gender-Based Differences in the Impact of the Economic Crisis on Labor Market Flows in Southern Europe

This chapter1 presents an application of the non-homogeneous Markov system theory to labor force participation providing a cross-national, gender-based comparison of labor market transitions among Southern European countries. Raw data from the European Union Labour Force Survey (EU-LFS) from 2006 to 2013 is drawn to compare the distribution of transition probabilities from the labor market state of employment, unemployment and inactivity and vice versa, for Greece, Italy, Spain and Portugal and examine whether the distributions are gender sensitive. Moreover, the chapter examines whether school-to-work transition probabilities for these countries differ for males and females and to what extent. Additionally, the crisis’ impact on the individual’s labor market position is studied by country and by sex with the use of mobility indices.

8.1. Introduction The study of transitions between employment, unemployment and inactivity and their evolution over time is crucial for the comprehension of labor market dynamics. Transition probabilities and school-to-work transitions are of utmost importance for understanding the mechanisms of the labor market and for enabling their modeling.

Chapter written by Maria S YMEONAKI, Maria K ARAMESSINI and Glykeria S TAMATOPOULOU. 1 This paper has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 649395 (NEGOTIATE – negotiating early job insecurity and labor market exclusion in Europe, Horizon 2020, Societal Challenge 6, H2020YOUNG-SOCIETY-2014, Research and Innovation Action (RIA), duration: March 1, 2015– February 28, 2018).

Data Analysis and Applications 2: Utilization of Results in Europe and Other Topics, First Edition. Edited by Christos H. Skiadas and James R. Bozeman. © ISTE Ltd 2019. Published by ISTE Ltd and John Wiley & Sons, Inc.

108

Data Analysis and Applications 2

Numerous attempts to model labor market transitions can be found in the literature (Brzinsky-Fay 2007; Ward-Warmedinge et al. 2013; Flek and Mysikova 2015; Symeonaki and Stamatopoulou 2015). The main objective of this chapter is to provide a cross-national, gender-based comparison of labor market transitions for Greece, Italy, Portugal and Spain using raw data drawn from the EU-LFS for the years 2006–2013 and the theory of non-homogeneous Markov systems (NHMS) (Vassiliou 1982). Markov systems are systematically used to model population systems and to create a more concrete background for a number of Markov chain population models. Numerous real-life probability population models can be adapted in this framework since Markov systems provide a significant method to describe a population that is stratified into different states according to a specific characteristic and to model the transitions between these states and their evolution over time (Bartholomew 1982; Vassiliou 1982; Bartholomew 1991; Vassiliou and Symeonaki 1999; Symeonaki et al. 2002; Symeonaki and Stamou 2004; Vassiliou 2013; Malefaki et al. 2014; Symeonaki 2015). Moreover, this chapter studies whether school-to-work transition probabilities for Greece, Italy, Portugal and Spain differ for males and females and to what extent for the time period before and during the crisis. The crisis’ impact on the individual’s labor market situation is studied by country and by gender with the use of mobility indices (Bibby 1975; Shorrocks 1978; Bartholomew 1982; Prais 1995; Heineck and Riphahn 2007; Prais 1995). The chapter is structured as follows. In section 8.2, the proposed methodology and the limitations of both the data and the methodology is provided. In section 8.3, the results of the analysis are given, and section 8.4 provides the conclusions and discussion of the study. 8.2. Data, methods and limitations In order to estimate the transition flows, we focus on data drawn from the EU-LFS and more specifically on raw data for the years 2006–2013 covering the Southern European countries. The EU-LFS is an important data source, as it provides extensive evidence on labor market participation and working conditions of European citizens. It allows multivariate analysis by gender, age, educational attainment and other sociodemographic characteristics, while common principles and guidelines are used to guarantee cross-country comparability. Limitations, however, arise due to differences in the national questionnaires. Even key variables such as “labor status at the time of the survey” are not collected for some of the participating countries. Moreover, due to its cross-sectional nature, the survey does not allow capturing flows

Gender-Based Differences in the Impact of the Economic Crisis

109

over time, as individuals cannot be tracked year after year. Moreover, measurement errors are present in the EU-LFS, as in all surveys, as a result of misreporting, mistakes in the recording of responses (by interviewers) and the use of proxy interviews (Pavlopoulos and Vermunt 2015). Finally, in the Portuguese databases from 2006 to 2010, the variable concerning the main labor status 1 year before the time of the survey (WSTAT1Y) poses limitations in the analysis. More particularly, the information on the individuals being in retirement or permanently disabled is missing and it seems that they are grouped under the label “other inactive persons”, not allowing the separation of these states. Therefore, for Portugal, only outcomes for 2011–2013 are presented. Concerning the methodology, the theory of Markov systems is used to analyze the data. More specifically, the current labor status and the situation 1 year before the survey of males and females are the key variables that will be used to estimate the transition probabilities among labor states and input probabilities to the different labor market states, with the aid of an NHMS model. In the proposed model, a population is stratified into distinct categories according to a certain characteristic, which in our case is the labor market statuses. In the EU-LFS survey, people are asked about their present labor market state and their state a year ago. The individual responds whether he/she is: 1) carrying out a job or profession, including unpaid work for a family business or holding, including an apprenticeship or paid traineeship, etc.; 2) unemployed; 3) a pupil, student, further training, unpaid work experience; 4) in retirement or early retirement or has given up business; 5) permanently disabled; 6) in compulsory military service; 7) fulfilling domestic tasks; 8) other inactive person. If combined, these categories represent the following labor market states: 1→ employment (corresponding to the 1st category); 2→ unemployment (corresponding to the 2nd category); 3→ inactivity (corresponding to the 5th, 6th, 7th and 8th categories). Apparently, the entrance to the labor market system is represented by the transition from the category “in Education or Training”, corresponding to the 3rd category, to

110

Data Analysis and Applications 2

either one of the three labor market states of employment, unemployment or inactivity. These input probabilities, which are used as indicators of school-to-work transition, are the conditional probabilities p0j (t), where j ={employed, unemployed, inactive}: p0j (t) =prob{an individual is in state j at time t | he or she was a pupil, a student, in further training or unpaid work experience at time t − 1}. A similar methodology is used in Flek and Mysikova (2015) to analyze labor market flows, i.e. flows between employment, unemployment and inactivity, using Markov transition systems in order to draw conclusions on unemployment dynamics in central Europe. Ward-Warmedinge et al. (2013) used a similar methodology to capture the flows affecting the changes in unemployment rates in European countries. Markov system analysis is also used in Symeonaki and Stamatopoulou (2004) in order to investigate labor market dynamics in Greece. Labor mobility of individuals in Europe and its evolution over the years of the crisis is also estimated. More specifically, four different relative mobility indices are calculated in order to reveal the extent of the individuals’ transitions within the labor market system. Relative indices reveal the rate of labor–market fluidity, and the ones used in the present analysis are the well-established mobility indices: The Prais–Shorrocks mobility index:  MP S =

1 k−1

 (k − tr(P))

[8.1]

The immobility index: IM =

tr(P) k

[8.2]

The Bartholomew mobility index:

MB =

k k 1  pij |i − j| k i=1 j=1

[8.3]

The Prais–Bibby mobility index: MT = 1 −

tr(P) k

[8.4]

Gender-Based Differences in the Impact of the Economic Crisis

111

The Prais–Shorrocks index, the immobility index and the Prais–Bibby index consider the trace of the transition probability matrices, whereas the Bartholomew mobility index estimates the amount of mobility off the main diagonal. 8.3. Results The current Section provides the reader with the results of the analysis. More specifically, Table 8.1 provides the results for Greece, for 2006–2013, for males and females. The input probability vectors and the transition matrices are also presented for both sexes. Moreover, the values of these mobility indices are estimated based on the transition probability matrices presented in Table 8.1 and are shown in Table 8.2. Tables 8.3 and 8.4 present the respective results for Italy, Tables 8.5 and 8.6 for Portugal and Tables 8.7 and 8.8 for Spain. 8.4. Conclusions and discussion The key finding of the study is that the crisis’ impact on the labor market transitions is more than clear for the Southern European countries under study and gender differences are present in the transition probabilities more so for Greece. However, when school-to-work transition probabilities are considered, the crisis’ impact is more evident, since these probabilities have decreased dramatically from 2006 to 2013. Although there were gender differences in these input probabilities in 2006, the differences have faded and the school-to-work transition probabilities for males and females seem to converge to low values, providing evidence that young people’s pathways from school to sustained work have become rough and unpredictable in Southern Europe. More specifically, when the transitions from employment to unemployment are considered, one can clearly see that for Greece the probabilities of going from employment to unemployment had a 4.2 times increase for men and a 2.3 times increase for women between 2006 and 2013, as the crisis hit substantially the so-called “male” occupations. The increase in these probabilities is rather similar for men and women in Italy, whereas in Spain the increase differs (2.7 times higher probabilities for men between 2006 and 2013 and 1.7 times higher for women).

112

Data Analysis and Applications 2

t =2006

t =2007

t =2008

t =2009

t =2010

t =2011

t =2012

t =2013

Males, N= 103, 906 P0 (t) =⎛0.441 0.272 0.960 0.016 P (t) = ⎝0.290 0.665 0.133 0.102 Males, N= 100.019 P0 (t) =⎛0.405 0.262 0.960 0.016 P (t) = ⎝0.277 0.667 0.136 0.101 Males, N= 99, 424 P0 (t) =⎛0.384 0.264 0.962 0.016 P (t) = ⎝0.284 0.668 0.123 0.112 Males, N= 101, 355 P0 (t) =⎛0.360 0.303 0.950 0.030 P (t) = ⎝0.238 0.711 0.112 0.127 Males, N= 102, 861 P0 (t) =⎛0.287 0.310 0.937 0.041 P (t) = ⎝0.203 0.749 0.098 0.130 Males, N= 92, 242 P0 (t) =⎛0.215 0.451 0.919 0.055 P (t) = ⎝0.134 0.825 0.054 0.159 Males, N= 82, 100 P0 (t) =⎛0.151 0.462 0.904 0.067 P (t) = ⎝0.102 0.859 0.038 0.163 Males, N= 3, 785 P0 (t) =⎛0.194 0.500 0.911 0.064 P (t) = ⎝0.114 0.859 0.048 0.167 a

 0.287 ⎞ 0.004 0.024⎠ 0.739  0.333 ⎞ 0.003 0.025⎠ 0.735  0.352 ⎞ 0.003 0.021⎠ 0.724  0.337 ⎞ 0.004 0.026⎠ 0.721  0.403 ⎞ 0.004 0.023⎠ 0.733  0.334 ⎞ 0.004 0.018⎠ 0.737  0.387 ⎞ 0.005 0.018⎠ 0.750  0.306 ⎞ 0.005 0.009⎠ 0.746

Females, N = 108, 148 P0 (t) =⎛0.361 0.457 0.933 0.028 P (t) = ⎝0.196 0.749 0.014 0.013 Females, 104.499 P0 (t) =⎛0.333 0.432 0.938 0.025 P (t) = ⎝0.197 0.750 0.012 0.013 Females, N = 103, 563 P0 (t) =⎛0.320 0.418 0.936 0.023 P (t) = ⎝0.216 0.725 0.013 0.010 Females, N = 106, 071 P0 (t) =⎛0.276 0.796 0.927 0.034 P (t) = ⎝0.176 0.761 0.013 0.016 Females, N = 108, 789 P0 (t) =⎛0.250 0.508 0.922 0.041 P (t) = ⎝0.138 0.805 0.012 0.021 Females, N = 97, 477 P0 (t) =⎛0.172 0.598 0.909 0.054 P (t) = ⎝0.095 0.853 0.009 0.022 Females, N = 85, 281 P0 (t) =⎛0.165 0.646 0.897 0.060 P (t) = ⎝0.078 0.873 0.008 0.026 Females, N = 3, 903 P0 (t) =⎛0.190 0.611 0.898 0.064 P (t) = ⎝0.082 0.871 0.008 0.025

Source: EU-LFS, 2006–2013.

Table 8.1. Input probabilities and transition probability matrices, Greece, 2006–2013

 0.182 ⎞ 0.022 0.044⎠ 0.963  0.235 ⎞ 0.021 0.040⎠ 0.966  0.262 ⎞ 0.021 0.044⎠ 0.967  0.228 ⎞ 0.023 0.049⎠ 0.959  0.242 ⎞ 0.019 0.046⎠ 0.955  0.230 ⎞ 0.017 0.040⎠ 0.957  0.189 ⎞ 0.021 0.034⎠ 0.953  0.199 ⎞ 0.019 0.035⎠ 0.952

Gender-Based Differences in the Impact of the Economic Crisis

Greece Males 2006 2007 2008 2009 2010 2011 2012 2013 Females 2006 2007 2008 2009 2010 2011 2012 2013 a

M(PS) IM

MB

MT

0.318 0.319 0.323 0.309 0.290 0.260 0.244 0.242

0.788 0.787 0.785 0.794 0.806 0.827 0.838 0.839

0.235 0.232 0.228 0.218 0.200 0.161 0.145 0.153

0.212 0.213 0.215 0.206 0.194 0.173 0.162 0.161

0.177 0.173 0.186 0.176 0.159 0.141 0.139 0.139

0.882 0.885 0.876 0.882 0.894 0.906 0.908 0.907

0.118 0.114 0.120 0.116 0.103 0.076 0.085 0.087

0.118 0.115 0.124 0.118 0.103 0.094 0.092 0.093

113

Source: EU-LFS, 2006–2013; own calculations

Table 8.2. Prais–Shorrock, immobility, Bartholomew and Prais–Bibby indices, Greece

Looking at the probabilities of going from unemployment to employment and their evolution over time, it is notable that these have decreased for all countries both for men and women. On the other hand, the probabilities of staying employed decreased and as far as the school-to-labor market entry probabilities are concerned, it is observed that in all countries these have decreased considerably and the gender differences have subsided. This means that all individuals have rather the same chances of finding a job after education or training irrespective of their gender, with Greece exhibiting the lowest probabilities for men and women equal to approximately 0.19 (Figures 8.1–8.4). Therefore, in 2013 in Greece only 19% of new school leavers, either male or female, were able to find a job. These results comply with the generally accepted fact that young people are most affected by economic and financial crises, since they either have not yet moved from school to work to find their way into the labor market, or they have not yet built a reputation and proven themselves in the labor market arena.

114

Data Analysis and Applications 2

t =2006

t =2007

t =2008

t =2009

t =2010

t =2011

t =2012

t =2013

Males, N= 234, 386 P0 (t) =⎛0.487 0.446 0.957 0.023 P (t) = ⎝0.264 0.705 0.088 0.056 Males, N= 230, 060 P0 (t) =⎛0.491 0.473 0.958 0.023 P (t) = ⎝0.272 0.703 0.070 0.039 Males, N= 226, 882 P0 (t) =⎛0.496 0.455 0.950 0.026 P (t) = ⎝0.251 0.724 0.070 0.039 Males, N= 220, 603 P0 (t) =⎛0.409 0.524 0.943 0.035 P (t) = ⎝0.207 0.768 0.053 0.050 Males, N= 219, 404 P0 (t) =⎛0.369 0.558 0.942 0.034 P (t) = ⎝0.214 0.753 0.083 0.056 Males, N= 213, 659 P0 (t) =⎛0.366 0.571 0.945 0.034 P (t) = ⎝0.230 0.738 0.082 0.066 Males, N= 197, 389 P0 (t) =⎛0.339 0.593 0.938 0.044 P (t) = ⎝0.195 0.771 0.062 0.072 Males, N= 21, 261 P0 (t) =⎛0.301 0.629 0.934 0.050 P (t) = ⎝0.174 0.799 0.067 0.067 a

 0.067 ⎞ 0.006 0.016⎠ 0.791  0.036 ⎞ 0.005 0.014⎠ 0.831  0.049 ⎞ 0.006 0.015⎠ 0.831  0.067 ⎞ 0.009 0.014⎠ 0.836  0.073 ⎞ 0.007 0.019⎠ 0.801  0.068 ⎞ 0.006 0.018⎠ 0.781  0.068 ⎞ 0.007 0.019⎠ 0.809  0.070 ⎞ 0.006 0.013⎠ 0.809

Females, N = 245, 452 P0 (t) =⎛0.416 0.490 0.937 0.028 P (t) = ⎝0.210 0.712 0.024 0.020 Females, N = 241, 106 P0 (t) =⎛0.443 0.482 0.937 0.028 P (t) = ⎝0.224 0.701 0.024 0.020 Females, N = 238, 048 P0 (t) =⎛0.458 0.458 0.935 0.030 P (t) = ⎝0.210 0.718 0.025 0.021 Females, N = 231, 629 P0 (t) =⎛0.373 0.525 0.931 0.035 P (t) = ⎝0.188 0.748 0.019 0.019 Females, N = 231, 219 P0 (t) =⎛0.343 0.549 0.932 0.035 P (t) = ⎝0.192 0.724 0.023 0.024 Females, N = 226, 761 P0 (t) =⎛0.355 0.523 0.933 0.036 P (t) = ⎝0.196 0.709 0.024 0.026 Females, N = 209, 651 P0 (t) =⎛0.324 0.561 0.929 0.042 P (t) = ⎝0.182 0.725 0.022 0.034 Females, N = 21, 565 P0 (t) =⎛0.288 0.594 0.924 0.050 P (t) = ⎝0.169 0.747 0.019 0.034

Source: EU-LFS, 2006–2013.

Table 8.3. Input probabilities and transition probability matrices, Italy, 2006–2013

 0.094 ⎞ 0.023 0.064⎠ 0.941  0.076 ⎞ 0.022 0.062⎠ 0.939  0.084 ⎞ 0.021 0.060⎠ 0.940  0.102 ⎞ 0.022 0.054⎠ 0.946  0.108 ⎞ 0.019 0.074⎠ 0.939  0.122 ⎞ 0.018 0.084⎠ 0.932  0.115 ⎞ 0.019 0.081⎠ 0.931  0.118 ⎞ 0.018 0.071⎠ 0.932

Gender-Based Differences in the Impact of the Economic Crisis

Italy Males 2006 2007 2008 2009 2010 2011 2012 Females 2006 2007 2008 2009 2010 2011 2012 2013 a

M(PS)

IM

MB

MT

0.274 0.254 0.248 0.227 0.252 0.268 0.241

0.818 0.831 0.838 0.849 0.832 0.821 0.839

0.182 0.166 0.161 0.143 0.168 0.175 0.156

0.182 0.169 0.165 0.151 0.168 0.179 0.161

0.205 0.212 0.204 0.188 0.202 0.213 0.208 0.198

0.863 0.859 0.864 0.875 0.865 0.858 0.862 0.868

0.139 0.142 0.138 0.126 0.136 0.142 0.140 0.133

0.137 0.141 0.136 0.125 0.135 0.142 0.138 0.132

Source: EU-LFS, 2006–2013; own calculations

Table 8.4. Prais–Shorrock, immobility, Bartholomew and Prais–Bibby indices, Italy t =2011

t =2012

t =2013

Males, N = 53, 966  P0 (t) =⎛0.444 0.493 0.864 0.073 P (t) = ⎝0.229 0.721 0.012 0.026 Males, N = 53, 732  P0 (t) =⎛0.355 0.579 0.831 0.098 P (t) = ⎝0.179 0.770 0.017 0.025 Males, N = 3, 591  P0 (t) =⎛0.394 0.542 0.864 0.078 P (t) = ⎝0.198 0.748 0.020 0.026 a

 0.063 ⎞ 0.010 0.008⎠ 0.869  0.066 ⎞ 0.011 0.010⎠ 0.886  0.064 ⎞ 0.010 0.010⎠ 0.894

Females, N = 58, 638  P0 (t) =⎛0.454 0.480 0.863 0.061 P (t) = ⎝0.221 0.702 0.017 0.022 Females, N = 38, 210  P0 (t) =⎛0.342 0.573 0.844 0.074 P (t) = ⎝0.177 0.729 0.014 0.020 Females, N = 3, 836  P0 (t) =⎛0.405 0.522 0.863 0.073 P (t) = ⎝0.194 0.728 0.017 0.025

 0.066 ⎞ 0.044 0.043⎠ 0.924  0.085 ⎞ 0.049 0.055⎠ 0.935  0.073 ⎞ 0.040 0.049⎠ 0.929

Source: EU-LFS, 2006–2013.

Table 8.5. Input probabilities and transition probability matrices, Portugal, 2006–2013

115

116

Data Analysis and Applications 2

Portugal Males 2011 2012 2013 Females 2011 2012 2013 a

M(PS)

IM

MB

MT

0.273 0.256 0.247

0.818 0.829 0.835

0.127 0.123 0.124

0.182 0.171 0.165

0.256 0.246 0.240

0.830 0.836 0.840

0.156 0.151 0.152

0.170 0.164 0.160

Source: EU-LFS, 2006–2013; own calculations

Table 8.6. Prais–Shorrock, immobility, Bartholomew and Prais–Bibby indices, Portugal

Figure 8.1. School-to-work probabilities, Greece, LFS 2006–2013. For a color version of the figure, see www.iste.co.uk/skiadas/data2.zip

Figure 8.2. School-to-work probabilities, Italy, LFS 2006–2013. For a color version of the figure, see www.iste.co.uk/skiadas/data2.zip

Gender-Based Differences in the Impact of the Economic Crisis

t =2006

t =2007

t =2008

t =2009

t =2010

t =2011

t =2012

t =2013

Males, N= 34, 146 P0 (t) =⎛0.643 0.276 0.941 0.026 P (t) = ⎝0.428 0.388 0.099 0.036 Males, N= 34, 602 P0 (t) =⎛0.648 0.269 0.943 0.024 P (t) = ⎝0.457 0.379 0.097 0.038 Males, N= 35, 042 P0 (t) =⎛0.579 0.350 0.923 0.046 P (t) = ⎝0.391 0.444 0.068 0.041 Males, N= 36, 643 P0 (t) =⎛0.448 0.457 0.884 0.081 P (t) = ⎝0.283 0.592 0.054 0.042 Males, N= 36, 736 P0 (t) =⎛0.394 0.507 0.906 0.065 P (t) = ⎝0.264 0.619 0.038 0.037 Males, N= 35, 068 P0 (t) =⎛0.359 0.545 0.911 0.060 P (t) = ⎝0.253 0.640 0.105 0.043 Males, N= 36, 241 P0 (t) =⎛0.316 0.593 0.895 0.075 P (t) = ⎝0.210 0.695 0.062 0.049 Males, N= 16, 344 P0 (t) =⎛0.222 0.548 0.902 0.070 P (t) = ⎝0.211 0.700 0.063 0.062 a

 0.081 ⎞ 0.018 0.140⎠ 0.751  0.083 ⎞ 0.015 0.128⎠ 0.745  0.071 ⎞ 0.015 0.125⎠ 0.785  0.095 ⎞ 0.018 0.096⎠ 0.795  0.099 ⎞ 0.014 0.089⎠ 0.830  0.096 ⎞ 0.014 0.079⎠ 0.738  0.091 ⎞ 0.013 0.072⎠ 0.792  0.230 ⎞ 0.014 0.072⎠ 0.789

Females, N = 35, 656 P0 (t) =⎛0.586 0.335 0.907 0.037 P (t) = ⎝0.369 0.362 0.047 0.041 Females, N = 36, 219 P0 (t) =⎛0.622 0.293 0.915 0.033 P (t) = ⎝0.378 0.395 0.055 0.043 Females, N = 36, 415 P0 (t) =⎛0.550 0.356 0.908 0.043 P (t) = ⎝0.361 0.417 0.046 0.046 Females, N = 38, 242 P0 (t) =⎛0.445 0.449 0.887 0.065 P (t) = ⎝0.271 0.530 0.037 0.058 Females, N = 38, 316 P0 (t) =⎛0.391 0.512 0.903 0.055 P (t) = ⎝0.250 0.551 0.032 0.048 Females, N = 36, 632 P0 (t) =⎛0.390 0.510 0.905 0.053 P (t) = ⎝0.229 0.608 0.051 0.051 Females, N = 38, 210 P0 (t) =⎛0.356 0.562 0.894 0.066 P (t) = ⎝0.190 0.653 0.035 0.053 Females, N = 16, 408 P0 (t) =⎛0.254 0.509 0.899 0.066 P (t) = ⎝0.193 0.655 0.033 0.067

Source: EU-LFS, 2006–2013.

Table 8.7. Input probabilities and transition probability matrices, Spain, 2006–2013

 0.079 ⎞ 0.044 0.233⎠ 0.872  0.085 ⎞ 0.038 0.193⎠ 0.865  0.094 ⎞ 0.036 0.195⎠ 0.875  0.106 ⎞ 0.034 0.171⎠ 0.875  0.097 ⎞ 0.029 0.172⎠ 0.887  0.100 ⎞ 0.029 0.143⎠ 0.866  0.082 ⎞ 0.027 0.136⎠ 0.881  0.237 ⎞ 0.025 0.139⎠ 0.865

117

118

Data Analysis and Applications 2

Spain Males 2006 2007 2008 2009 2010 2011 2012 2013 Females 2006 2007 2008 2009 2010 2011 2012 2013 a

M(PS)

IM

MB

MT

0.460 0.462 0.424 0.365 0.323 0.355 0.309 0.304

0.693 0.692 0.717 0.757 0.785 0.763 0.794 0.797

0.288 0.290 0.256 0.215 0.186 0.224 0.185 0.190

0.307 0.308 0.283 0.243 0.215 0.237 0.206 0.203

0.429 0.413 0.400 0.354 0.329 0.310 0.286 0.290

0.714 0.725 0.733 0.764 0.780 0.793 0.809 0.806

0.287 0.278 0.270 0.236 0.216 0.212 0.190 0.194

0.286 0.275 0.267 0.236 0.220 0.207 0.191 0.194

Source: EU-LFS, 2006–2013; own calculations

Table 8.8. Prais–Shorrock, immobility, Bartholomew and Prais–Bibby indices, Spain

Figure 8.3. School-to-work probabilities, Spain, LFS 2006–2013. For a color version of the figure, see www.iste.co.uk/skiadas/data2.zip

Gender-Based Differences in the Impact of the Economic Crisis

119

Figure 8.4. School-to-work probabilities, Portugal, LFS 2006–2013. For a color version of the figure, see www.iste.co.uk/skiadas/data2.zip

8.5. References Bartholomew, D.J. (1982). Stochastic Models for Social Processes. Wiley, London. Bartholomew, D.J., Forbes, A.F., McClean, S.I. (1991). Statistical Techniques for Manpower Planning. Wiley, London. Bibby, J. (1975). Methods of Measuring Mobility. Qual. Quant., 9, 107–136. Brzinsky-Fay, C. (2007). Lost in transition? Labour Market Entry sequences of school leavers in Europe. Eur. Sociol. Rev., 23(4), 409–422. Flek, V., Mysikova, M. (2015). Unemployment dynamics in central Europe: A labour flow approach. Prague Economic Papers, 24(1), 73–87. Heineck, G., Riphahn, R. (2007). Intergenerational transmission of educational attainment in Germany – the last five decades. Rev. Econ. Stat., 229, 36–60. Malefaki, S., Limnios, N., Dersin, P. (2014). Reliability of maintained systems under a semiMarkov setting. Reliab. Eng. Syst. Safe., 131, 282–290. Pavlopoulos, D., Vermunt, J.K. (2015). Measuring temporary employment, do survey or register data tell the truth? Survey Methodology, 41(1), 197–214. Prais, S. (1955). Measuring social mobility. J. Royal Stat. Soc., 118, Series A, 56–66. Shorrocks, A. (1978). The measurement of social mobility. Econometrica, 46, 1013–1024. Symeonaki, M. (2015). Theory of fuzzy non-homogeneous Markov systems with fuzzy states. Qual. Quant., 49(6), 2369–2385. Symeonaki, M., Stamatopoulou, G. (2015). A Markov system analysis application on labour market dynamics: The case of Greece. IWPLMS, 22-24 June, Athens. Symeonaki, M., Stamou, G. (2004). Theory of Markov systems with fuzzy states. Fuzzy Sets Syst., 143, 427–445. Symeonaki, M., Stamou, G., Tzafestas, S. (2002). Fuzzy non-homogeneous Markov systems. Applied Intelligence, 17(2), 203–214.

120

Data Analysis and Applications 2

Vassiliou, P.-C.G. (1982). Asymptotic behaviour of Markov systems. Appl. Probab., 19, 851– 857. Vassiliou, P.-C.G. (2013). Fuzzy semi-Markov migration process in credit risk. Fuzzy Sets Syst., 223(0), 39–58. Vassiliou, P.-C.G., Symeonaki, M. (1999). The perturbed non-homogeneous Markov system. Linear Algebra Appl., 289(1–3), 319–332. Ward-Warmedinge, M., Melanie, E., Macchiarelli, C. (2013). Transitions in labour market status in the European Union. LEQS, 69.

9 Measuring Labor Market Transition Probabilities in Europe with Evidence from the EU-SILC

In this chapter1, the estimation of youth transition probabilities from school to employment, unemployment and inactivity is provided for all European countries using raw data from the European Union’s Survey on Living and Income Conditions (EU-SILC). More precisely, the input probabilities are measured using the EU-SILC data from 2006 to 2013. Replacing the European Panel Survey since 2003, EU-SILC provides timely and comparable cross-sectional, multidimensional data on income, poverty, social exclusion and living conditions anchored in the European Statistical System. Methodologically, the theory of non-homogenous Markov systems (NHMS) will be used to measure the transition and input probabilities. In the proposed model, the population is stratified into distinct categories according to a certain characteristic, which in this case are the labor market statuses. The study will deliver whether the crisis has created convergence or divergence in youth transitions to employment among European countries and study the evolution of labor market probabilities before and during the crisis for European countries.

9.1. Introduction Transitions of young individuals among different labor market states and youth labor market entry and integration are crucial to labor market and educational

Chapter written by Maria S YMEONAKI, Maria K ARAMESSINI and Glykeria S TAMATOPOULOU. 1 This paper has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 649395 (NEGOTIATE – Negotiating early job-insecurity and labour market exclusion in Europe, Horizon 2020, Societal Challenge 6, H2020-YOUNG-SOCIETY-2014, Research and Innovation Action (RIA), Duration: 01 March 2015 – 28 February 2018).

Data Analysis and Applications 2: Utilization of Results in Europe and Other Topics, First Edition. Edited by Christos H. Skiadas and James R. Bozeman. © ISTE Ltd 2019. Published by ISTE Ltd and John Wiley & Sons, Inc.

122

Data Analysis and Applications 2

policies. Numerous studies can be found in the literature that intend to cover this central subject. More precisely, in Eurofound (2014) the labor market status of young individuals in European countries is presented, with a special interest on the school-to-work transition. Brzinsky-Fay (2007), on the other hand, examines the sequences of school-to-work transitions in 10 European countries using optimal matching and cluster analysis. Christodoulakis and Mamatzakis (2009) use a Bayesian method that involves a Monte Carlo integration procedure to set forward the empirical posterior distribution of transition probabilities from full-time employment to part-time employment, temporary employment and unemployment and vice versa in the EU 15. Moreover, Alvarez et al. (2008) present the labor dynamics of the population by fitting a stationary Markov chain to data drawn from the Argentine official labor survey. Betti et al. (2007) designate some aspects of school-to-work transitions with an emphasis on the patterns in Southern European countries. Furthermore, Ward-Warmedinge et al. (2013) provide evidence on labor market mobility in 23 European countries using the Eurostat’s Labour Force Survey (EU-LFS) data over the period 1998–2008. Flek and Mysikova (2015) estimate the flows between employment, unemployment and inactivity using Markov transition systems to draw conclusions on unemployment dynamics in Central Europe. The theory of non-homogeneous Markov system models (Vassiliou 1982) is used in Symeonaki and Stamatopoulou (2015) in order to investigate labor market dynamics in Greece, and in Karamessini et al. (2016) the same theory is used to estimate the school-to-labor market entry probabilities for a number of European countries with raw data drawn from the EU-LFS data sets for 2013. Bosch and Maloney (2007) estimate a set of statistics in order to examine and compare labor market dynamics based on the estimation of continuous time Markov transition processes. The present study focuses on the measurement of the transition probabilities between labor market states, part-time and full-time employment and school-to-labor market entry probabilities for young individuals. The study uses the available raw microdata behind the EU-SILC survey in order to provide evidence as to whether the crisis has generated convergence or divergence in early job insecurity across Europe. At the same time, a number of mobility indices are estimated to contribute to the measurement of labor fluidity and its evolution over time. The chapter is structured as follows. Section 9.2 provides information about the data used, the methods and the limitations, whereas section 9.3 presents the results concerning the dynamics of the labor market flows, with an emphasis put on the evolution of the transition probabilities between labor market states, the evolution of school-to-labor market entry probabilities and the estimation of labor fluidity. Finally, the conclusions of the study are provided. 9.2. Data, methods and limitations In order to estimate the transition probabilities of young Europeans between labor market states and their labor market entry and integration, we use raw data drawn from

Measuring Labor Market Transition Probabilities

123

the EU-SILC. More precisely, the analysis will be performed for the countries using raw data drawn from the EU-SILC data sets for the years 2006–2013. It is known that the EU-SILC2 survey aims at collecting timely and comparable cross-sectional and longitudinal multidimensional microdata on income, poverty, social exclusion and living conditions and it is anchored in the European Statistical System (ESS). It provides two types of data: – cross-sectional data pertaining to a given time or a certain time period with variables on income, poverty, social exclusion and other living conditions; – longitudinal data pertaining to individual-level changes over time observed periodically over a 4-year period. Measurement errors are present in the EU-SILC, as in all surveys, as a result of misreporting (by respondents), mistakes in the recording of responses (by interviewers) and the use of proxy interviews (Pavlopoulos and Vermunt 2015). The EU-SILC has the advantage, compared to the EU-LFS, of collecting income and age information. An additional advantage is that of providing information about transitions between part-time and full-time employment, unemployment and inactivity, but also school to part-time and full-time employment. However, we cannot overlook the strong disadvantage of the EU-SILC data sets of having a considerably smaller number of cases. This causes limitations in this kind of analysis in many countries, especially in the estimation of school-to-part-time employment and school-to-full-time employment or school-to-unemployment probabilities. For the purpose of the present study, the focus is on individuals aged between 15 and 29 years. More specifically, the theory of Markov systems (Vassiliou 1982) is used to model raw data from the EU-SILC survey concerning the transitions of young individuals among the labor market states and their labor market entry. Markov systems are steadily used to model population systems and to generate a firm basis for a number of Markov chain population models. Various real-life probability population models can be adapted in this framework since Markov systems provide a noteworthy scheme to describe a population that is stratified into different categories according to a certain characteristic and to model the transitions among these categories and their evolution over time (Bartholomew 1982; Vassiliou 1982; Bartholomew et al. 1991; Vassiliou and Symeonaki 1999; Symeonaki et al. 2002; Symeonaki et al. 2004; Vassiliou 2013). More precisely, from the EU-SILC data sets the variables referring to the main labor activity in January and the main labor activity in December of the same year will be used to estimate the transitions between the labor market states and the transitions from education or training to these states. 2 http://ec.europa.eu/eurostat/web/microdata/european-union-statistics-on-income-and-livingconditions.

124

Data Analysis and Applications 2

Evidently, for the EU-SILC data sets, the input to the labor market is represented by the transition from school to either of the following categories: full-time employment, part-time employment, unemployment and inactivity. These input probabilities are the conditional probabilities: p01 (t) prob{an individual is full-time employed in December t − 1 | he or she was a pupil, a student, in further training or unpaid work experience in January t − 1}; p02 (t) prob{an individual is part-time employed in December t − 1 | he or she was a pupil, a student, in further training or unpaid work experience in January t − 1}; p03 (t) prob{an individual is unemployed in December t − 1 | he or she was a pupil, a student, in further training or unpaid work experience in January t − 1}; p04 (t) prob{an individual is inactive in December t − 1 | he or she was a pupil, a student, in further training or unpaid work experience in January t − 1}.

9.3. Results For the estimation of the transition probabilities based on the EU-SILC data sets, Markov system theory is used and the focus is on young individuals aged between 15 and 29 years. In the EU-SILC questionnaire, there is a distinction between part-time and full-time employment that allows us to estimate the transitions among them. In this questionnaire, respondents state whether they are: 1) employed working full-time; 2) employed working part-time; 3) self-employed working full-time (including family worker); 4) self-employed working part-time (including family worker); 5) unemployed; 6) pupil, student, further training, unpaid work experience; 7) in retirement or in early retirement or has given up business; 8) permanently disabled or/and unfit to work; 9) on compulsory military community or service; 10) fulfilling domestic tasks and care responsibilities; 11) other inactive person.

Measuring Labor Market Transition Probabilities

125

These categories relate to the following labor market states: 1→ full-time employment (corresponding to the first and third category); 2→ part-time employment (corresponding to the second and fourth category); 3→ unemployment (corresponding to the fifth category); 4→ inactivity (corresponding to the eighth and 11th category). Apparently, the set space is S = {1, 2, 3, 4}. The probabilities of the possible transitions for the years 2006–2013 are given by the sequence of the transition probability matrices {P (t)}t=2013 t=2006 , where: ⎛

p11 (t) ⎜p21 (t) P (t) = ⎜ ⎝p31 (t) p41 (t)

p12 (t) p22 (t) p32 (t) p42 (t)

p13 (t) p23 (t) p33 (t) p43 (t)

⎞ p14 (t) p24 (t)⎟ ⎟, p34 (t)⎠ p44 (t)

t = 2006, ..., 2013

[9.1]

As mentioned previously, EU-SILC has a strong advantage when compared to the EU-LFS – that of collecting income information – and a strong disadvantage – the considerably smaller number of cases. An additional advantage is that of providing information about transitions between part-time and full-time employment, unemployment and inactivity, but also school to part-time and full-time employment. However, the small number of cases causes limitations in this kind of analysis in the majority of the countries, especially in the estimation of school-to-part-time employment and school-to-full-time-employment or school-to-unemployment probabilities. To overcome this limitation, we grouped countries and examined them in more or less coherent groups. The criteria for this classification derived from the general characteristics of the countries with regard to their type of welfare state. Esping-Andersen (1990) has provided a still-influential classification based on three conclusive types of welfare states: the liberal, exemplified in the case of the United States, while in Europe the United Kingdom could be considered the most representative case; the corporatist–statist, based on a tripartite system of collective bargaining found in Germany, France and other continental countries; and the social democratic, identified in Scandinavian countries. Despite the validity and the influence of the typology suggested by Esping-Andersen and despite the fact that he did not neglect nuances and specificities inside each type, major shortcomings were noticed, particularly linked to the restricted number of countries on which he based his quantitative and qualitative analysis. For the countries of Central and Eastern Europe, the explanation is simple since his declared interest was to examine the “worlds of welfare capitalism”. He

126

Data Analysis and Applications 2

also neglected, however, the new EU Southern members, i.e. Greece, Spain and Portugal, which had specific characteristics that led scholars to claim a distinct Mediterranean welfare model (Ferrera 1996). Taking into account both Southern Europe and post-socialist Central and Eastern Europe, and trying to take some precaution against oversimplification, since particularly in the second case one can certainly define different subgroups, we classified countries combining welfare state typologies with dominant employment/unemployment patterns, also keeping in mind the impact of crisis on welfare policies and expenditure. Therefore, we opted for three inclusive categories: – Southern European countries or countries with a sub protective welfare regime (Gallie and Paugam 2000), defined either by rudimentary welfare structures compensated by strong family networks or by severe austerity policies linked to the 2008 financial and sovereign debt crisis: Greece, Ireland, Italy, Portugal Spain and Cyprus. – Postsocialist countries including both cases considered as “successful” or “developing” countries regarding their welfare state (Fenger 2007): Bulgaria, Croatia, Czech Republic, Estonia, Hungary, Latvia, Lithuania, Poland, Romania, Serbia, Slovakia and Slovenia. – Countries characterized by advanced welfare regimes, either of liberal, corporatist or social–democratic type: Austria, Belgium, Denmark, France, Germany, Luxembourg, the Netherlands, Switzerland and United Kingdom. Table 9.1 presents the estimations of the probabilities of remaining in full-time and part-time employment, unemployment and inactivity for these groups of countries. The full transition probability matrices can be found in Table 9.2. It is clear that in all three groups of countries, we have a decrease in the probabilities of remaining fulltime employed and an increase in the probabilities of remaining part-time employed or unemployed, although the values of these probabilities are different for each group. For example, the chances of a young individual to remain unemployed in the year 2013 are equal to 0.863 for the first group, 0.765 for the second and 0.682 for the third. The input probabilities to full-time/part-time employment, unemployment and inactivity are now estimated using the EU-SILC data sets (Table 9.3). It is clear that in the three groups, there is an increase in the probabilities of going straight from school to unemployment from 2008 to 2013. For the first group, 23.6% of new school leavers moved from school to unemployment, whereas in 2013 it was 37.2%. For the postsocialist countries, the increase is higher: 14.1% in 2008 and 40.1% in 2013. The increase is rather small for the countries belonging to the third group: 12.3% in 2008 and 16.9% in 2013. In Figure 9.1, the evolution of full-time and part-time employment probabilities for new school leavers is presented for the

Measuring Labor Market Transition Probabilities

127

years 2006–2013 for each group. Apparently, for the first group, one can note a clear drop in the probability of a new school leaver to find full-time employment, during these years and a simultaneous increase in the probability to move into part-time employment. For the second group, one can distinguish a notable drop in the probability of moving to full-time employment, but a very small change in the probabilities of moving from school to part-time employment. Evidently, these probabilities are low for all these years for the second group of countries. For the third group, the probability of a new school leaver to move to full-time employment did not change considerably between the years 2008 and 2013 (0.546 and 0.508, respectively). However, there is an increase in the probability of moving into part-time employment (0.187 and 0.223, respectively). Group 1 2006 2007 2008 2009 2010 2011 2012 2013 Group 2 2006 2007 2008 2009 2010 2011 2012 2013 Group 3 2006 2007 2008 2009 2010 2011 2012 2013

Transition probabilities F_E → F_E P_E → P_E 0.942 0.867 0.94 3 0.857 0.943 0.854 0.911 0.852 0.912 0.840 0.929 0.876 0.924 0.865 0.909 0.861 F_E → F_E P_E → P_E 0.936 0.707 0.947 0.750 0.943 0.747 0.920 0.773 0.904 0.788 0.929 0.753 0.930 0.764 0.928 0.794 F_E → F_E P_E → P_E 0.938 0.838 0.933 0.841 0.925 0.758 0.909 0.778 0.904 0.804 0.916 0.800 0.911 0.800 0.913 0.809 a

U→U 0.791 0.778 0.796 0.815 0.809 0.837 0.863 0.863 U→U 0.491 0.570 0.509 0.634 0.714 0.715 0.694 0.765 U→U 0.669 0.589 0.581 0.675 0.674 0.616 0.623 0.682

IA → IA 0.888 0.889 0.899 0.923 0.899 0.891 0.914 0.900 IA → IA 0.720 0.756 0.825 0.825 0.816 0.811 0.799 0.817 IA → IA 0.818 0.742 0.724 0.739 0.765 0.757 0.775 0.784

Source: EU-SILC data sets, 2006–2013

Table 9.1. Probability of remaining full-time employed, part-time employed, unemployed and inactive, EU-SILC, 2006–2013

128

Data Analysis and Applications 2



0.942 ⎜0.048 P (2006) = ⎜ ⎝0.145 0.032 ⎛

0.943 ⎜0.067 P (2007) = ⎜ ⎝0.162 0.043 ⎛

0.943 ⎜0.059 ⎜ P (2008) = ⎝ 0.157 0.045 ⎛

0.911 ⎜0.044 P (2009) = ⎜ ⎝0.141 0.050 ⎛

0.912 ⎜0.052 P (2010) = ⎜ ⎝0.137 ⎛0.046 0.929 ⎜0.037 P (2011) = ⎜ ⎝0.114 0.037 ⎛

0.924 ⎜0.044 ⎜ P (2012) = ⎝ 0.092 0.037 ⎛

0.909 ⎜0.036 P (2013) = ⎜ ⎝0.090 0.029 a

Group 1 0.006 0.035 0.867 0.051 0.037 0.791 0.010 0.022

⎞ 0.007 0.016⎟ ⎟, N = 16, 408 0.921⎠ 0.921

0.006 0.857 0.041 0.018

0.033 0.038 0.778 0.018

⎞ 0.008 0.025⎟ ⎟, N = 17, 964 0.009⎠ 0.912

0.007 0.854 0.034 0.018

0.034 0.042 0.796 0.021

⎞ 0.008 0.030⎟ ⎟, N = 19, 333 0.006⎠ 0.911

0.008 0.852 0.033 0.011

0.064 0.064 0.815 0.010

⎞ 0.010 0.010⎟ ⎟, N = 17, 367 0.004⎠ 0.923

0.008 0.840 0.035 0.017 0.007 0.876 0.035 0.019

0.065 0.071 0.809 0.025 0.051 0.056 0.837 0.044

⎞ 0.010 0.011⎟ ⎟, N = 19, 298 0.006⎠ 0.899⎞ 0.007 0.010⎟ ⎟, N = 19, 914 0.005⎠ 0.891

0.007 0.865 0.034 0.021

0.057 0.055 0.863 0.025

⎞ 0.005 0.016⎟ ⎟, N = 19, 693 0.005⎠ 0.914

0.009 0.861 0.031 0.014

0.067 0.075 0.863 0.037

⎞ 0.009 0.011⎟ ⎟, N = 19.630 0.009⎠ 0.900

Source: EU-SILC, 2006–2013; own calculations

Table 9.2. Transition probability matrices, EU-SILC, 2006–2013, first group

Measuring Labor Market Transition Probabilities



0.936 ⎜0.163 P (2006) = ⎜ ⎝0.339 0.059 ⎛

0.947 ⎜0.182 P (2007) = ⎜ ⎝0.294 0.062 ⎛

0.943 ⎜0.152 ⎜ P (2008) = ⎝ 0.338 0.064 ⎛

0.920 ⎜0.146 P (2009) = ⎜ ⎝0.270 0.094 ⎛

0.904 ⎜0.119 P (2010) = ⎜ ⎝0.201 0.088 ⎛

0.929 ⎜0.161 ⎜ P (2011) = ⎝ 0.223 0.096 ⎛

0.930 ⎜0.120 P (2012) = ⎜ ⎝0.237 0.095 ⎛

0.928 ⎜0.116 P (2013) = ⎜ ⎝0.183 0.105 a

Group 2 0.003 0.032 0.707 0.031 0.022 0.491 0.005 0.022

⎞ 0.008 0.037⎟ ⎟, N = 22, 251 0.082⎠ 0.904

0.004 0.750 0.021 0.007

0.020 0.027 0.570 0.017

⎞ 0.007 0.009⎟ ⎟, N = 25, 373 0.031⎠ 0.904

0.003 0.747 0.027 0.006

0.023 0.020 0.509 0.013

⎞ 0.005 0.044⎟ ⎟, N = 29, 835 0.037⎠ 0.908

0.004 0.773 0.024 0.019

0.050 0.044 0.634 0.046

⎞ 0.019 0.007⎟ ⎟, N = 23, 839 0.023⎠ 0.825

0.006 0.788 0.022 0.012

0.062 0.045 0.714 0.048

⎞ 0.021 0.014⎟ ⎟, N = 28, 482 0.022⎠ 0.816

0.005 0.753 0.016 0.015

0.044 0.049 0.715 0.044

⎞ 0.016 0.016⎟ ⎟, N = 29, 280 0.019⎠ 0.811

0.004 0.764 0.021 0.012

0.042 0.048 0.694 0.055

⎞ 0.019 0.024⎟ ⎟, N = 27, 879 0.022⎠ 0.799

0.005 0.794 0.016 0.010

0.042 0.045 0.765 0.041

⎞ 0.019 0.021⎟ ⎟, N = 29, 369 0.016⎠ 0.817

Source: EU-SILC, 2006–2013; own calculations

Table 9.3. Transition probability matrices, EU-SILC, 2006–2013, second group

129

130

Data Analysis and Applications 2

Figure 9.1. Full-time and part-time employment probabilities for new school leavers, EU-SILC, 2006–2013. For a color version of the figure, see www.iste.co.uk/skiadas/data2.zip

We now focus on estimating the labor mobility of young individuals in Europe and its evolution for the years before and during the crisis. More specifically, four different relative mobility indices are calculated in order to reveal the extent of the young individuals’ transitions within the labor market system. Relative indices reveal the rate of labor–market fluidity, and the ones used in the present analysis are the well-established mobility indices: The Prais–Shorrocks mobility index:  MP S =

1 k−1

(k − tr(P))

[9.2]

The immobility index: IM =

tr(P) k

[9.3]

Measuring Labor Market Transition Probabilities

131

The Bartholomew mobility index: k k 1

pij |i − j| MB = k i=1 j=1

[9.4]

The Prais–Bibby mobility index: MT = 1 −

tr(P) k

[9.5]

These mobility indices are estimated based on the transition probability matrices produced using the EU-SILC data sets for the years 2006–2013. The changes in the mobility indices for the three groups of countries are presented in Table 9.4. The line graphs in Figures 9.2–9.4 present the evolution of the mobility indices over the years 2006–2013. The differences between these groups of countries, when the labor fluidity is considered, are evident. Small changes and low mobility are recorded for the first group of countries, higher mobility and bigger changes for the countries of the second group and smaller changes with the highest mobility recorded for the countries of the third group.

Figure 9.2. Mobility indices, Group 1, EU-SILC, 2006–2013. For a color version of the figure, see www.iste.co.uk/skiadas/data2.zip

132

Data Analysis and Applications 2



0.938 ⎜0.057 P (2006) = ⎜ ⎝0.205 0.052 ⎛

0.933 ⎜0.074 P (2007) = ⎜ ⎝0.266 0.068 ⎛

0.925 ⎜0.098 ⎜ P (2008) = ⎝ 0.277 0.071 ⎛

0.909 ⎜0.105 P (2009) = ⎜ ⎝0.201 0.103 ⎛

0.904 ⎜0.083 P (2010) = ⎜ ⎝0.198 0.083 ⎛

0.916 ⎜0.088 ⎜ P (2011) = ⎝ 0.239 0.102 ⎛

0.911 ⎜0.095 P (2012) = ⎜ ⎝0.245 0.102 ⎛

0.913 ⎜0.092 ⎜ P (2013) = ⎝ 0.193 0.084 a

Group 3 0.010 0.029 0.838 0.040 0.077 0.669 0.022 0.025

⎞ 0.007 0.035⎟ ⎟, N = 15, 004 0.018⎠ 0.891

0.010 0.841 0.081 0.016

0.029 0.029 0.589 0.016

⎞ 0.010 0.021⎟ ⎟, N = 20, 835 0.032⎠ 0.881

0.021 0.758 0.076 0.024

0.024 0.035 0.581 0.016

⎞ 0.015 0.072⎟ ⎟, N = 21, 422 0.030⎠ 0.870

0.019 0.778 0.085 0.066

0.032 0.035 0.675 0.035

⎞ 0.020 0.040⎟ ⎟, N = 21, 034 0.023⎠ 0.739

0.019 0.804 0.074 0.060

0.035 0.039 0.674 0.035

⎞ 0.018 0.039⎟ ⎟, N = 21, 872 0.024⎠ 0.765

0.018 0.800 0.083 0.066

0.031 0.037 0.616 0.026

⎞ 0.016 0.030⎟ ⎟, N = 22, 065 0.029⎠ 0.757

0.020 0.800 0.076 0.057

0.031 0.039 0.623 0.023

⎞ 0.017 0.025⎟ ⎟, N = 22, 652 0.024⎠ 0.775

0.018 0.809 0.065 0.059

0.033 0.035 0.682 0.023

⎞ 0.016 0.027⎟ ⎟, N = 20, 609 0.023⎠ 0.784

Source: EU-SILC, 2006–2013; own calculations

Table 9.4. Transition probability matrices, EU-SILC, 2006–2013, third group

Measuring Labor Market Transition Probabilities

→ F_E 0.405 0.488 0.505 0.439 0.353 0.417 0.333 0.314 → F_E 0.614 0.645 0.695 0.557 0.427 0.439 0.481 0.448 → F_E 0.477 0.571 0.546 0.572 0.546 0.541 0.544 0.508

Group 1 2006 2007 2008 2009 2010 2011 2012 2013 Group 2 2006 2007 2008 2009 2010 2011 2012 2013 Group 3 2006 2007 2008 2009 2010 2011 2012 2013 a

Input probabilities → P_E 0.126 0.204 0.202 0.158 0.154 0.177 0.180 0.200 → P_E 0.052 0.073 0.065 0.051 0.062 0.061 0.056 0.065 → P_E 0.202 0.134 0.187 0.171 0.138 0.171 0.176 0.223

→U 0.278 0.205 0.236 0.280 0.370 0.316 0.416 0.372 →U 0.229 0.177 0.141 0.288 0.427 0.408 0.368 0.401 →U 0.230 0.134 0.123 0.136 0.177 0.164 0.176 0.169

→ IA 0.191 0.103 0.057 0.123 0.123 0.090 0.071 0.114 → IA 0.105 0.105 0.099 0.104 0.084 0.092 0.053 0.086 → IA 0.091 0.161 0.144 0.121 0.139 0.124 0.104 0.100

Source: EU-SILC, 2006–2013; own calculations

Table 9.5. Input probabilities to full-time/part-time employment, unemployment and inactive based on the EU-SILC, 2006–2013

Figure 9.3. Mobility indices, Group 2, EU-SILC, 2006–2013. For a color version of the figure, see www.iste.co.uk/skiadas/data2.zip

133

134

Data Analysis and Applications 2

Group 1 2006 2007 2008 2009 2010 2011 2012 2013 Group 2 2006 2007 2008 2009 2010 2011 2012 2013 Group 3 2006 2007 2008 2009 2010 2011 2012 2013 a

MP S 0.170 0.177 0.169 0.166 0.180 0.155 0.144 0.155 MP S 0.282 0.325 0.325 0.282 0.259 0.264 0.271 0.232 MP S 0.245 0.298 0.337 0.270 0.284 0.303 0.297 0.270

Mobility Indices MT 0.128 0.133 0.127 0.124 0.135 0.116 0.108 0.116 MT 0.212 0.244 0.244 0.212 0.194 0.198 0.203 0.174 MT 0.185 0.223 0.253 0.225 0.213 0.227 0.222 0.203

MB 0.207 0.211 0.201 0.198 0.206 0.153 0.167 0.168 MB 0.443 0.451 0.390 0.323 0.291 0.306 0.310 0.280 MB 0.298 0.359 0.371 0.315 0.310 0.341 0.337 0.279

IM 0.872 0.867 0.873 0.875 0.865 0.883 0.891 0.883 IM 0.714 0.755 0.756 0.788 0.805 0.802 0.796 0.826 IM 0.815 0.776 0.747 0.775 0.786 0.772 0.777 0.797

Source: Source: EU-SILC, 2006–2013; own calculations

Table 9.6. Mobility indices using the EU-SILC data sets, 2006–2013

Figure 9.4. Mobility indices, Group 3, EU-SILC, 2006–2013. For a color version of the figure, see www.iste.co.uk/skiadas/data2.zip

Measuring Labor Market Transition Probabilities

135

9.4. Conclusions The study presented in Chapter 9 and the results presented in the paper expose significant differences between the groups of European countries that relate to the transition probabilities of young individuals among labor market states, the school-to-part-time and full-time transition probabilities, the labor fluidity of young individuals and the way all the above have progressed over time, during the years of the crisis. Evidently in the three groups of countries, a decrease in the probabilities of remaining full-time employed and an increase in the probabilities of remaining part-time employed is recorded. It is also clear that in the three groups of countries, there is an increase in the probabilities of going straight from school to unemployment from 2008 to 2013. For the postsocialist countries, the increase is higher than in the southern European countries, whereas the increase is rather small for the countries belonging to the third group. The differences between these groups of countries, when the labor fluidity is considered, are also evident. Small changes and low mobility are recorded for the first group of countries, higher mobility and bigger changes for the countries of the second group and smaller changes with the highest mobility recorded for the countries of the third group. 9.5. References Alvarez, E., Ciocchini, F., Konwar, K. (2008). A locally stationary Markov Chain model for labour dynamics. JDS., (7), 27–42. Bartholomew, D.J. (1982). Stochastic Models for Social Processes. Wiley, London. Bartholomew, D.J., Forbes, A.F., McClean, S.I. (1991). Statistical Techniques for Manpower Planning. Wiley, London. Betti, G., Lemmi, A., Verma, V. (2007). WA comparative analysis of school-to-work transitions in the European Union. Innovation, 18(4), 419–442. Bosch, M., Maloney, W. (2007). Comparative analysis of labor market dynamics using Markov processes: An application to informality. IZA., 3038. Brzinsky-Fay, C. (2007). Lost in transition? Labour Market Entry sequences of school leavers in Europe. Eur. Sociol. Rev., 23(4), 409–422. Christodoulakis, G., Mamatzakis, C. (2009). Labour Market Dynamics in EU: a Bayesian Markov Chain Approach. Department of Economics, University of Macedonia. Discussion paper series, 07. Esping-Andersen, G. (1990). Three Worlds of Welfare Capitalism. University Press, NJ. Eurofound: Mapping youth transitions in Europe. (2014). Publications Office of the European Union, Luxembourg. Fenger, H.J.M. (2007). Welfare regimes in Central and Eastern Europe: Incorporating postcommunist countries in a welfare regime typology. Contemp. Issues Ideas Soc. Sci., 3(2), 1–30.

136

Data Analysis and Applications 2

Ferrera, M. (1996). The southern model of welfare in social Europe. J. Eur. Soc. Policy, 6(1), 17–37. Flek, V., Mysikova, M. (2015). Unemployment dynamics in central Europe: A labour flow approach. Prague Economic Papers, 24(1), 73–87. Gallie, D., Paugam, S.E. (2000). Welfare Regimes and the Experience of Unemployment in Europe. Oxford University Press, Oxford. Karamessini, M., Symeonaki, M., Stamatopoulou, G., Papazachariou, A. (2016). The careers of young people in Europe during the economic crisis: Identifying risk factors. https://blogg.hioa.no/negotiate/files/2015/04/NEGOTIATE-working-paper-noD3.2-The-careers-of-young-people-in-Eurpa-during-the-economic-crisis.pdf. Pavlopoulos, D., Vermunt, J.K. (2015). Measuring temporary employment, do survey or register data tell the truth? Survey Methodology, 41(1), 197–214. Symeonaki, M., Stamatopoulou, G. (2015). A Markov system analysis application on labour market dynamics: The case of Greece. IWPLMS, 22-24 June, Athens. Symeonaki, M., Stamou, G., Symeonaki, M., Stamou, G. (2004). Theory of Markov systems with fuzzy states. Fuzzy Sets Syst., 143, 427–445. Symeonaki, M., Stamou, G., Tzafestas, S. (2002). Fuzzy non-homogeneous Markov systems. Applied Intelligence, 17(2), 203–214. Vassiliou, P.-C.G. (1982). Asymptotic behaviour of Markov systems. Appl. Probab., 19, 851– 857. Vassiliou, P.-C.G. (2013). Fuzzy semi-Markov migration process in credit risk. Fuzzy Sets Syst., 223(0), 39–58. Vassiliou, P.-C.G., Symeonaki, M. (1999). The perturbed non-homogeneous Markov system. Linear Algebra Appl., 289(1–3), 319–332. Ward-Warmedinge, M., Melanie, E., Macchiarelli, C. (2013). Transitions in labour market status in the European Union. LEQS, 69.

PART 3

Student Assessment and Employment in Europe

10 Almost Graduated, Close to Employment? Taking into Account the Characteristics of Companies Recruiting at a University Job Placement Office

In several areas worldwide, youth social involvement has experienced increased working vulnerability after the recession following the 2007–2008 crisis. In spite of a recent mild drop in unemployed youth, many graduates, both in developed and in developing countries, still find jobs whose quality falls below their expectations or experience long-term unemployment. In this context, graduate employability has emerged as a key topic for higher education institutions (HEI). So far, university system-level intervention, in this work focused on a specific Italian academic situation, has involved employers at distinct levels, in academic governance first and recently in the direct interplay of the transition from HE to the labor market, recruiting directly at university sites. In fact, Italian universities have become intermediaries between their graduates and potential employers, according to law 30/2003, which enabled job placement offices to play a key role in the interplay between graduates at the end of their university paths and companies in need of professionals. To this regard, we carried out a Computer-Assisted Web Interview (CAWI) for registration with the Portal of Almalaurea for recruitment and linkage, namely with the Job Placement Office of the University of Milano-Bicocca. Results allowed some insight into companies’ structural characteristics as well as some attitudes, in a prospective feedback to students and to their HEI from their counterparts, in the kaleidoscope of tertiary education.

10.1. Introduction The financial crisis in 2007–2008 and the following recession exacerbated the vulnerability of youth employment (Schmitt 2008). Only recently has there been a mild recovery (ILO 2016a). Still, young people’s school-to-work transition stays difficult, as expectations often remain unmet when starting a profession, let alone the Chapter written by Franca C RIPPA, Mariangela Z ENGA and Paolo M ARIANI.

Data Analysis and Applications 2: Utilization of Results in Europe and Other Topics, First Edition. Edited by Christos H. Skiadas and James R. Bozeman. © ISTE Ltd 2019. Published by ISTE Ltd and John Wiley & Sons, Inc.

140

Data Analysis and Applications 2

time elapsed in finding a job altogether (ILO 2016b). This holds both in developed and in developing countries (Pukka 2016), graduate employability having raised several issues to HEI. To increase the relevance of the higher education provision for the productive reality, system-level intervention is twofold. On the one side, academic governance somehow embodies employers, even though their presence is less than pervasive. On the other side, according to law 30/2003 (Gazzetta Ufficiale 2016), job placement offices have been playing a key role in the interplay between new graduates and companies needing to fill specific positions. Administrative data sources on this theme are valuable to disclose educational processes also with reference to other social actors, thereby with an inner viewpoint to youth social involvement. In this work, we focus on a specific Italian academic situation. To this regard, we consider companies registered on the Almalaurea Portal and referring to the University of Milano-Bicocca. Almalaurea is a university consortium, currently accounting for over 91% of university students in Italy, for recruitment and linkage with the academic job placement offices. In June 2015, a CAWI was performed in the framework of an ongoing multicenter study, planned by the Department of Statistics, University of Padua, Italy. The University of Milano-Bicocca has been the first academic site, after the leading center of Padua, to follow the data processing among the other adhering universities. Our results outline some recruiters’ characteristics. In addition, we tried to grab some fragments of the communicative style between the productive reality and HEI education. From the students’ standpoint, this should carry both prospective empowerment and autonomy. The employability of a person and their capacity to adapt are linked to the way they are able to combine these different types of knowledge and build on them. In this context, individuals become the principal constructor of their own abilities (CEC 1995). Higher education is reckoned to provide a great value in terms of youth growth and civic engagement, which employment and earnings contribute to for young people and for entire communities, not only in terms of economic strength but also of quality of life (Kenyon 2009). 10.2. Recruiters and graduates seeking an HEI common ground Social and human capital are highly interrelated, the former being apt to empower the latter (Coleman 1988). This holds also in the case of employability: there is no denying that tertiary education improves personal qualifications and increases the chances of obtaining a good job, whether a university program is vocational, i.e. oriented to prepare for a specific profession, or otherwise, as the modern economy highly values skills developed in less vocational majors. Programs for facilitating the transition from university to work have been developed by HEI, since law 30/2003 has given Italian universities the role of intermediaries between their graduates and the labor market. A systematic evaluation of these programs is not fully in place yet, while a cross-national analysis has been explored at large (de Weert 2011). At any rate, permeability to the outside reality has been acknowledged to enhance the

Almost Graduated, Close to Employment?

141

academic integrity of HE (Crippa et al. 2011; Lowden et al. 2011). Within the previously mentioned extensive educational transformation, the University of Milano-Bicocca established their Job Placement Office in 2005, rectorial decree No. 0008765, June 29, 2004. Later, in line with the regulations of law L.183/2010 that reorganized the labor legislation, the academic site, represented by the Job Placement Office, joined FIxO (Fixo 2016), the governmental system for publishing on the web both students’ and new graduates’ anonymous curricula. In 2012, FIxO switched into “School and University” (Scuola & Universitá), a wider program for reducing the gap between educational qualifications and positions in the workplace. At present, companies wishing to recruit at universities need to register on the Portal of Almalaurea. The latter is a website for recruitment and linkage, pooling databases from Job Placement Offices of Italian universities. Our case is restricted to the University of Milano-Bicocca. Our survey belongs to a CAWI multicenter study, named ELECTUS (Education-for-Labour Elicitation from Companies’ Attitudes toward University Studies), planned by the University of Padua, Department of Statistical Sciences. At present, the latter has completed the whole data processing for 250 entrepreneurs of manufacturing companies and of industrial services of the Veneto region, Italy. The University of Milano-Bicocca, first among the other centers, has concluded the survey, with 471 companies responding. In this chapter, we first briefly address the issue of bias in web surveys, so as to assess potentials and limitations of the study. Then, after describing the characteristics of recruiters adhering to the e-survey, we consider the elicitation experiment that is at the core of the ELECTUS project. The experiment applies conjoint measurement psychometric techniques (Luce and Tukey 1964; Krantz 1964; Luce 1977) with reference to the ranking of curricular characteristics for a set of specified professional positions. 10.3. Web survey pitfalls: considerations for data collection Web survey features and issues on validity are set in their correct perspective when keeping in mind they are self-administered, as no interviewers interact with respondents. Supposedly, response rates are reported to be higher with electronic surveys than with paper surveys or interviews. In principle, the quicker the response time, the wider the magnitude of response. Still, this seems to apply only to the first few days, later on response rates align to the other kinds of surveys (Opperman 1995). The Padua multicenter investigation imposed specific requirements for companies to enter the survey. After subsetting the target population registered on the Almalaurea portal, the Job Placement Office provided the whole mailing list and relevant information. We focus on non-sampling errors, which include all errors made in the process of providing answers. They can be distinguished between observation and non-observation errors. The former ones occur when collecting and recording information. They consist of the following: overcoverage, i.e. inclusion of units not belonging to the population; measurement errors, due to misunderstanding or untruthfulness in answering, leading to a gap between the true and the observed

142

Data Analysis and Applications 2

value; processing effects, i.e. data imputation mistakes. They are also due to wording of example data entry. Observation errors are extraneous to our results too. In fact, recruiters’ characteristics are recorded according to standard procedures and, on the whole, they refer to objective aspects that are not easily either misunderstood or eluded. Non-observation errors are errors run into when not all planned measurements can be carried out. If some units of the target population are excluded due to the very way the survey is conducted, they never have a chance to answer. In a web surveys, this phenomenon happens mainly when some units of the population do not have access to the Internet. This case does not apply to our survey, since all companies do have access to the Internet, since they had registered on Almalaurea on line. A different type of non-observation error is the non-response one, occurring when a part of elements in the sample – in the population in our census investigation – do not provide information. Being the questionnaires are self-administered, web surveys have a potential for high non-response rates (Bethelem 2010). To this respect, problems with dealing with the Internet add another source of non-response, due to connection, speed, costs and the like. Respondents can be excluded from the survey altogether, because they are not reached, or they may be frustrated and discontinue. Besides, self-selection is a potential source of bias too, even if it is a census survey. In fact, people who participate in the survey can be very different with respect to other eligible units who do not answer the questionnaire. The generalizability of such observations will be limited. The population of companies targeted in our census survey consisted of 4,183 potential recruiters registered on the Almalaurea and referring to the University of Milano-Bicocca, namely to its Job Placement Office that provided all the information necessary for carrying out the survey, as aforesaid. Total available contacts Completed e-questionnaires Response rate Uncompleted e-questionnaires Untouched altogether Composition Undelivered mail Unsent mail Duplicated records

4,183 471 11.26% 541 3,171 1,060 22 6

Table 10.1. Return rate and specification of web survey results by response type

Undelivered mail accounted for 1,060 excluded statistical units. To this number, 28 cases added on as automatically excluded from the delivery itself, due to form flaws. The amount of delivered mail is computed as the difference between all units and excluded ones, therefore accounting for 3,095 companies. Web questionnaires were administered from May 11 to June 11, 2015. After the first invitation mail, on May 11, non-respondents were solicited once a week for 3 weeks in a row (May 19,

Almost Graduated, Close to Employment?

143

May 26 and June 5). Completed questionnaires amounted to 471. Return details are summarized in Table 10.3. Even though unquestionably low, such a response rates is not uncommon in the literature (see Table 10.2).

Couper (2001)

Sample Response size rate 7,000 62%

Asch (2001)

14,150 8%

Everingham (2001) Jones and Pitt (1999) Dillman et al. (1998) Dillman et al. (1998)

1,298 200 9,522 2,466

Survey

44% 19% 41% 38%

Population University of Michigan students College-bound high school students and college students RAND employees University staff Purchasers of computer products Purchasers of computer products

Table 10.2. Return rate from CAWI (source: Schonlau et al. 2002)

Mixed modes methods approaches and relative correction techniques (de Leeuw 2010) tend to improve response rates. They seem inappropriate in this case, though, owing to the complex structure of the questionnaire embedding the elicitation experiment. On the contrary, weighing for characteristics of responding companies seems promising, as shown in a case study regarding Californians’ attitudes toward health care and health care providers, funded by the California Health Care Foundation (Schonlau et al. 2002). In the latter survey, self-selection led to regarding results as a convenience sample, a statistical standpoint that lends itself to our data, markedly when the investigation is restrained to the so-called advanced tertiary economic sector, as services to companies are termed in Italy. In section 10.4, devoted to descriptive analysis of our results, we provide the distribution of economic sectors, according to the international classification. Convenience sampling is considered most useful for a pilot analysis, the goal being tested or measurement validation. To explore generalizable inferences, according to the case study, we need to ponder some sort of correction. The Californians’ attitudes web survey, which did not employ a probability sample, held a response rate comparable to ours, since “of the 70,932 persons to whom an e-mail was sent, [....] 12 percent completed the survey” (Schonlau et al. 2002). Weights are derived exclusively through poststratification, matching the current population survey (CPS) for California within participants’ characteristics for variables derived from propensity scoring (PS). PS is a statistical technique that attempts to render two populations comparable, controlling for all variables regarded to affect the comparison (Rosenbaum and Rubin 1983; Rosenbaum and Rubin 1984). The PS comparability issue requires an in-depth investigation that goes far beyond the present discussion. It is in fact relevant mainly for the preference elicitation experiment, that is only briefly sketched hereafter.

144

Data Analysis and Applications 2

10.4. Sampled recruiters: an outline The survey was conducted using the commercial software Sawtooth® (www.sawtoothsoftware.com), which supports a choice-based conjoint experiment as well as the creation of an electronic questionnaire appropriate for such a collection technique. We adopted the commercial software previously adopted by the University of Padua in order to gain homogeneity on the collection technique. For the present statistical description, we applied IBM SPSS 23. Of all registrations to the Almalaurea portal referring to the Job Placement Office of the University of Milano-Bicocca, eligible companies were required to operate in the Lombardy region in the secondary or the tertiary economic sector service industry, the latter providing services to the general population and to businesses. The quaternary sector included mainly information technology (IT). They need to satisfy the following requirements: their entrepreneurship being rooted in Lombardy, operating in the aforementioned sectors and employing at least 15 employees. Among the 471 companies fully completing questionnaires, respondents were entrepreneurs, partners or general directors (40.4%) or heads of company units (32.9%). Table 10.4 shows how a large proportion of enterprises, which interact directly with the Job Placement Office, belongs to the sector of services to other companies. Sector Percentage Agriculture and food 2.2 Construction 2.2 Tourism 2.4 Manufacturing 14.9 Services for people and the family 16.2 Services for companies 62.1 Total interviewed, absolute value: 471 Table 10.3. Sector of the company registering at Almalaurea for recruitment at the Job Placement Office, University of Milano-Bicocca

Companies’ profiles show that as small size was prevalent (52%; 15–49 employers), followed by medium size (25.6%), ranging from 50 to 249 employees, and by the large companies (22.4%) with 250 employers or more. The most represented activity sectors were services to the industry (62.1%), services to the person or the family (16.2%) and manufacturing (14.9%). The majority of companies (89.4%) operated fully or partially within the domestic market. Moreover, they were mainly under the management of the entrepreneur or a partner (63%). Entrepreneurs were mostly male (58.6%), younger than 49 years (76.1%) with a high educational level. In fact, 18.8% of them held a secondary school diploma, 59.5% were graduates and 21.3% were postgraduates. The majority of the companies (94.2%) employed at least one graduate employee and, in the last 3 years, 90.7% of the entrepreneurs had been involved directly in the recruitment processes concerning HE graduates. With

Almost Graduated, Close to Employment?

145

respect to the recruitment process, the strategy for selecting HE graduates started by considering CVs sent directly to the company (76.8%), followed (68.6%) by the search from the university databases (as for instance Almalaurea), then by the stage experiences (43.2%) or specialized websites (38.4%). Even if fewer used employment agencies (24.6%), social networks (20%) and recruiters (15.1%) represented common methods of job search. In general, the selection process appears complex and requires several employers’ attributes to be analyzed, taking into account the trade-offs among them (Marder 1999). To provide a general idea of the complexity of the web survey, which prevented us from using a mixed mode strategy, we sketched some results on the relative importance of six characteristics of HE graduate candidates. The latter are as follows: major field, final degree mark, English knowledge, previous work experience and willingness to travel on business, with respect to five job hypothetical positions: clerk, marketing assistant, human resources (HR) assistant, customer relationships manager (CRM) and information and communication technology (ICT) professional. The choice of this pool of jobs is in line with the leading center, Padua. Candidate and job position traits are considered per se and not in a reciprocal trading situation. Table 10.4 shows the rankings of the profiles for the previously specified job positions.

Preference Major Degree level Degree mark English knowledge Work experience Willingness to travel on business

Job position Administration Marketing HR CRM ICT clerk assistant assistant assistant professional 1 1 1 2 1 3 4 3 4 3 5 6 5 6 4 4 2 4 3 2 2 3 2 1 5 6

5

6

5

6

Table 10.4. Overall ranking of candidates’ characteristics for the five job positions in the preference elicitation experiment (values are ratings on a scale from 1 to 6, where 1 is the least important and 6 the most important characteristic)

Of all positions considered, major field seemed to play a key role, while willingness to travel on business seemed to be the least relevant. English was especially important for positions in marketing and information and communication technology (ICT). Work experience was relative for administrative and HR jobs, whereas it became far more meaningful for professions in customer relationship management (CRM) activities. Surprisingly, clerical positions in an administrative office and junior positions in an HR office showed the same rank. Relevant characteristics at the preliminary phase of the recruitment process were, in order of importance, as follows: major field, previous work experience and degree level,

146

Data Analysis and Applications 2

whereas the less important characteristics were English knowledge, final mark degree and the willingness to travel on business. For a junior marketing assistant position, entrepreneurs took into account the area of study, English knowledge and work experience. For a position as computer science technician, the order of preferred characteristics was as follows: field of study, English knowledge and degree. Finally, for a junior assistant manager position, entrepreneurs’ choices were based on work experience, field of study and English knowledge. 10.5. Conclusion In youth civil and social engagement, entering a job represents a vital transition toward self-realization and self-reliance, education and work moving both along a common trajectory. Current economic trends have posed a severe threat to a substantive effort in removing hindrances to the younger generation’s acquisition of their legitimate role. Nonetheless, the many actions undertaken so far prove intrinsic and effective value, as they aim to allow both stakeholders, students and recruiters, to come to a commonly shared and, prospectively, openly discussed meeting ground. This process can offer them some sort of virtuous cycle, with feedback both to companies searching for graduated positions and to new graduates. The role of HEIs, in this context, is challenging, as they need to maintain a neutral stance, while mediating a communicative process that has been so long overlooked when not demeaned to “trivial matters”, as if being able to sustain oneself and to express one’s own potential could be alienated from fundamental rights. This implicitly enables employers to introduce themselves to the academic body, a first and crucial step in partnerships. The struggle undertaken by several countries to steer away from mismatch, therefore from several social drawbacks, answers to this need (de Weert 2011; Crippa and Civardi 2013). Further analysis on eliciting preferences requires, besides statistical correction, additional insight into the complexity of the phenomena not to jump to hurried conclusions. For instance, one should not assume that credentials with low labor market outcomes have no value, especially in fields where jobs typically require advanced degrees (The Aspen Institute 2014). Similarly, gender issues need to be investigated with non-conventional analytical instruments (Mecatti et al. 2012). 10.6. References Bethelem J. (2010). Selection in web surveys. Int. Stat. Rev., 78(2), 161–188. Crippa, F., Fabbris, L., Ferraresso. N. (2011). Il capitale umano dei laureati giá lavoratoristudenti. In Criteri e indicatori per misurare l’efficacia delle attivitá universitarie, Fabbris, L. (ed.) Cleup, Padova, Italy, 65–104. Crippa, F., Civardi, M. (2013). University outcomes and employability: Is a harmonisation feasible? Ital. J. Appl. Stat., 23(1), 37–50.

Almost Graduated, Close to Employment?

147

Coleman, J.S. (1988). Social capital in the creation of human capital. Am. J. Sociol., 94, 95– 120. Commission of the European Communities (CEC). (1995). White Paper on Education and Training, Teaching and Learning – Toward the Learning Society, http://eurlex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:51995DC0590&qid=1469309261 982&from=EN. Fixo. (2016). http://www.cliclavoro.gov.it/Progetti/Pagine/FIxO.aspx. Gazzetta Ufficiale. (2016). 47 (27th February 2003). http://www.camera.it/parlam/ leggi/03030l.htm. ILO. (2016). Youth employment crisis easing but far from over (2014). http://www. ilo.org/global/about-the-ilo/newsroom/news/WCMS_412014/lang–en/ index.htm. ILO. (2016). World Employment Social Outlook. Trends 2016. http://www.ilo.org/wcmsp5/ groups/public/—dgreports/—dcomm/—publ/documents/ publication/wcms_443480.pdf. Jansen, K.J., Corley, K.G., Jansen, B.J. (2006). E-Survey Methodology. In Handbook of Research on Electronic Surveys and Measurements, Gideon, L. (ed.), 1–8, IGI Global. Kenyon, P. (2009). Partnership for Youth Employment. A review of selected community-based initiatives. Employment Working Paper No. 33. Krantz, D.H. (1964). Conjoint measurement: The Luce-Tukey axiomatization and some extensions. J. Math. Psychol., 1(2), 248–277. de Leeuw, E. (2010). Mixed-mode surveys and the internet. Surv. Pract., 3(6), 1–5. Lowden, K., Hall, S., Elliot, D., Lewin, J. (2011). Employers? Perceptions of the employability skills of new graduates, University of Glasgow, SCRE Centre and Edge Foundation, ISBN 978-0-9565604-3-8. Luce, R.D. (1977). The choice axiom after twenty years. J. Math. Psychol., 15(3), 215–233. Luce, R.D., Tukey, J.W. (1964). Simultaneous conjoint measurement: A new type of fundamental measurement. J. Math. Psychol., 1(1), 1–27. Marder, E. (1999). The assumptions of choice modelling: Conjoint analysis and SUMM? Can. J. Mark. Res., 18, 3–14. Mecatti, F., Crippa F., Farina, P. (2012). A special gen(d)er of statistics. Development and methodological prospects of gender statistic. Int. Stat. Rev., 80, 452–467. Opperman, M. (1995). E-mail surveys–potentials and pitfalls. Mark. Res., 7(3), 29–33. Pukka, J. (2015). Policy Measures to Support Higher Education Graduate Employment. http://ec.europa.eu/transparency/regexpert/index.cfm?do=groupDetail.group DetailDoc&id=21624&no=3. Rosenbaum, P.R., Rubin, D.B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55. Rosenbaum, P.R., Rubin, D.B. (1984). Reducing bias in observational studies using subclassification on the propensity score. J. Am. Stat. Assoc., 79(387), 516–524.

148

Data Analysis and Applications 2

Schonlau, M., Fricker, R.D., Elliott, M.N. (2002). Conducting Research Surveys via E-mail and the Web. RAND Corporation, Santa Monica, CA. http://www.rand.org/pubs/monograph_reports/MR1480.html. Schmitt, J. (2008). The decline of good jobs: How have jobs with adequate pay and benefits done? Challenge, 51(1), 5–25. The Aspen Institute, From College to Jobs. (2014). Making Sense of Labor Market Returns to Higher Education. The Aspen Institute, College Exellence Program, Georgetown, Washington, D.C. U.S.A. https://cew.georgetown.edu/wpcontent/uploads/LaborMarketReturns_0.pdf. de Weert, E. (2011). Perspectives on Higher Education and the labour market. Center for Higher Education Policy Studies – CHEPS. https://www.utwente.nl/ bms/cheps/publications/Publications%202011/C11EW158%20Final%20version%20 Themarapport%20onderwijs%20-%20arbeidsmarkt.pdf.

11 How Variation of Scores of the Programme for International Student Assessment can be Explained through Analysis of Information

The Programme for International Student Assessment (PISA) is a triennial international survey that aims at evaluating education systems worldwide by testing the skills of 15-year-old students in three competence fields. The present contribution analyzes variations of PISA scores from 2003 to 2012 in France and Germany for mathematics literacy through a multiplicative model whose choice of parameters was introduced by Zighera in 1985. Thanks to construction through repeated analysis of Kullback–Leibler divergence, the parameters are meaningful in terms of information for testing simultaneously both direct and crossed effects of all explanatory variables on the evolution of scores. In particular, Zighera’s method is shown to highlight evolution in the sociodemographic composition of the sampled population that may affect the observed evolution of performance of students.

11.1. Introduction PISA, coordinated by the OECD, compares outcomes of learning in mathematics, reading and science literacy around the end of compulsory schooling in numerous countries. Surveys are conducted every 3 years with samples of 15-year-old pupils who answer to many demographic and personal questions before scored questions in three competence fields, with a major and two minors. An OECD service is devoted to the development of questionnaires, management of surveys, analysis and interpretation of results through the diffusion of reports. Questionnaires are not testing mastery of school curriculum but testing how successfully pupils might cope with “everyday life” post-school situations; Chapter written by Valérie G IRARDIN, Justine L EQUESNE and Olivier T HÉVENON.

Data Analysis and Applications 2: Utilization of Results in Europe and Other Topics, First Edition. Edited by Christos H. Skiadas and James R. Bozeman. © ISTE Ltd 2019. Published by ISTE Ltd and John Wiley & Sons, Inc.

150

Data Analysis and Applications 2

see Prais (2003). In Bodin (2006), the contents of PISA questionnaires in mathematics are shown to cover approximately only 15% of the programs of secondary school, where more than 85% of the questioned pupils are registered. The reports provide data analysis primarily based on comparisons of mean levels or confidence intervals between countries, years, and other variables. A five points margin error is assigned to all mean scores, making statistically doubtful, for instance, the widely diffused ranking between countries. For well-argued critical studies, see Bodin (2009); Prais (2003); Rutkowski and Rutkowski (2016) and Wuttke (2007), among many others. Since pupils are not being tracked individually, only relations between skills of pupils and various factors can be compared from one survey year to another. Most existing studies use decomposition methods of score differences, as developed by Oaxaca (1973) and Blinder (1973). These methods forbid to take precisely into account variations in the composition of a population, that is to say the simultaneous evolution of sociodemographic characteristics affecting school results. The present analysis aims to fill this gap by using a multiplicative modeling approach initiated in Zighera (1985). Generally speaking, analysis of connections between qualitative variables essentially contributes to the understanding of a sociodemographic process. Its main interest consists of identifying whether two or several characteristics of a population are dependent or not, according to complex relations of causality. When the variable of interest is a behavior, one will seek to distinguish the time evolution of the structure of the population itself from the influence of the characteristics of this population on the studied behavior. More specifically, comparative analysis of two observed contingency tables induces us to consider one of the two tables’ distributions as reference, the other defining the marginal constraints. The difficulty of the exercise comes with the growing number of variables and relations to be taken into account. From this point of view, multiplicative (also called log-linear) models allow various levels of interactions between variables to be considered for a better understanding of the nature of heterogeneity of the population. The model is obtained by minimizing the Kullback–Leibler divergence (KLD) – also called discrimination information – with respect to the reference among the distributions with the right margins. Although the obtained distribution is unique, different sets of parameters can be considered, because the marginal constraints are interdependent. Additional constraints have to be considered for ensuring identifiability (unicity), a precondition for sound statistical estimation of the parameters. In order to simplify ensuing algebra, classical choices are a sum-to-zero constraint, or to set one parameter to zero for each modality. With a different perspective, Zighera (1985) proposes a set of harmonic constraints linked to an analysis of the KLD of the model. The ensuing parameterization is studied in mathematical details in Girardin et al. (2018), which will be summarized in section 11.2. The parameters are selected via simultaneous

Variation of Scores of PISA

151

calculation, and hence more straightforwardly than in the backward elimination procedure detailed in Christensen (1990). The reference distribution can easily vary for considering variation of the observed distribution from a counterfactual situation. For sociodemographic data analysis, this fits especially well for studies of evolution over time of a single population. Usual parameterizations of multiplicative models retain only hierarchical associations: an interaction of higher level can be considered only if the terms of lower level are present in the model. Zighera’s parameterization overcomes this obstacle, proving it to be more flexible than many classical approaches. Application to PISA scores will also show how Zighera’s approach highlights what precisely in the observed evolution can be explained by the composition of the samples. In section 11.3, we will apply this method to analyze the evolution of mathematical skills of 15-year-old pupils, as measured by PISA in major years 2003 and 2012, in both France and Germany. While most existing studies of PISA results are based on interpreted data from the reports written by the OECD service, the following analysis is directly based on the raw PISA data sets, available as large files of code ASCII1. This analysis will allow us to highlight which parts of variations and differences are due to the evolution of the structure of the population and effects of certain socioeconomic characteristics of this population, and which to the sampling method and its evolution from one survey to another. The population dynamics related to the evolution of school inequalities in mathematics will also appear to be different in the two countries. 11.2. Multiplicative models and Zighera’s parameterization For a better understanding of the model, we will first briefly present classical results on multiplicative modeling for categorical data (see Christensen (1990) and Agresti (2002) for details). We will then specialize to Zighera’s parameterization. Multiplicative models apply to contingency tables of any dimension. However, to avoid cumbersome notation, this presentation will concern only three-dimensional tables. Usually, adjustment of the distribution with fixed marginals to the reference distribution is obtained by minimization of KLD. The ensuing model is multiplicative; a logarithmic transformation makes it additive – hence its alternative name of log-linear model. The obtained distribution is also the maximum likelihood estimator subject to the marginal constraints, a unique distribution with non-unique parameterization. The approach of Zighera (1985), continued in Thévenon (2009), and thoroughly studied in Girardin et al. (2018) and Lequesne (2015), is based on constraints of identifiability related to analysis of information. What we call 1 Available at https://www.oecd.org/pisa/pisaproducts/.

152

Data Analysis and Applications 2

Zighera’s parameterization mainly allows for parameters meaningful in terms of information associated with an additive analysis of information of the model into marginal, cross-marginal and conditional effects – similar to analysis of variance ANOVA. A three-dimensional contingency table is the set of observations of three variables, say X1 , X2 and X3 , with I, J and K modalities – or categories. The distribution of (X1 , X2 , X3 ), say p ∈ P, has fixedmargins p1 = (pi.. )i at order 1 and p12 = (pij. )(i,j) at order 2, etc.; finally, p... = ijk pijk = 1. The KLD of a distribution p ∈ P with respect to another r ∈ P is given by K(p|r) =



pijk log(pijk /rijk ).

[11.1]

ijk

For K(p|r) to be well defined, set pijk = 0 whenever rijk = 0. Note that K is convex and non-negative, null if and only if the two distributions are equal, which makes it a pertinent tool for discriminating between distributions (see Kullback (1959) and Cover and Thomas (1991)). This discrimination information will be minimized at I order 1 in the set Pm of all distributions with order 1 margins equal to the margins of a given distribution m; in mathematical words, pi.. = mi.. , p.j. = m.j. and p..k = II m..k for all i, j, k. Similarly at order 2, minimization occurs in the set Pm of all distributions with order 1 and 2 margins equal to the margins of m. Note the nested II I structure, Pm ⊂ Pm . u The distribution p that minimizes the information relative to a reference r in Pm , u for u = I or II, precisely such that K(p|r) = minq∈Pm K(q|r), takes the multiplicative form

 pijk =

I in Pm , Z −1 rijk Θ1i Θ2j Θ3k −1 1 2 3 12 13 23 II Z rijk Θi Θj Θk Θij Θik Θjk in Pm ,

[11.2]

where Z is the normalization constant depending on the parameters (see Csiszár 1975). The unique distribution p can be computed, for instance, through the IPF (iterative proportional fitting) method initiated in Deming and Stephan (1940), together with one set of parameters Θ. Since the constraints depend on each other, this set is not unique, and more constraints are necessary for ensuring unicity – the statistical identifiability. Classically, the parameters to be kept in the model are then chosen through successive goodness-of-fit tests in sequences of hierarchical models (see Christensen (1990) for details). For each tested submodel and related parameterization, all probabilities have to be estimated and test statistics computed.

Variation of Scores of PISA

153

Moreover, when several submodels are not rejected, new tests are compulsory for selecting a final parameterization. On the contrary, Zighera’s parameterization allows for an all-in-one process, as we will show now, through a short presentation of the selection of its parameters. We refer the interested reader to Girardin et al. (2018) for all mathematical details and proofs. The ANOVA method for linear models is well known to yield an analysis of the total variance in sums of model and residual variances. Then the most significant factors and interactions are selected through tests. Similarly, well-known properties of minimization of information yield K(q|r) = K(q|p) + K(p|r), where K(p|r) is the information explained by model p and K(q|p) the residual information (see Csiszár 1975). The approach proposed in Zighera (1985) goes further by analyzing the total information K(p|r) into a sum of information quantities measuring the marginal and crossed-effects of the explanatory variables. Precisely, at order 1, the chosen parameters of the distribution [11.2] are the solution θ = (θ1 , θ2 , θ3 ) of Zighera’s constraints ZmI :

 mi.. i

θi1

= 1,

 m.j. j

θj2

= 1,

 m..k k

θk3

= 1.

Note that these are harmonic constraints, and they make the model identifiable. Then we can write K(p|r) = − log Z +

 i

= − log Z +

 i

pi.. log θi1 +

 j

p.j. log θj2 +



p..k log θk3

k

  pi.. p.j. p..k pi.. log + p log + p..k log .j. 1 2 pi.. /θi p.j. /θj p..k /θk3 j

= − log Z + K1 + K2 + K3 .

k

[11.3]

For parameters satisfying ZmI , all denominators are distributions, and hence all terms Ka , for a = 1, 2, 3, are KLD as defined by [11.1]. Testing H0 : “Ka = 0” against H1 : “Ka = 0”, is equivalent to testing H0 : “θa = 1” against H1 : “θa = 1”, since Ka is null if and only if θa = 1. All together, these three tests yield the model the closest to data in terms of information. Let us detail the procedure of testing, say for X1 . The marginal constraints are n n pni.. = Ni.. /n, where Ni.. is the number of observations of X1 = i in the table with distribution m that fixes the margins. An estimator θ (n) = (θ1(n) , θ2(n) , θ3(n) ) solution of ZmI simply follows from the set Θ(n) given by the IPF, by setting

154

Data Analysis and Applications 2

1(n) 1(n)  1(n) θi = Θi , and so on. Then K1 is estimated by i mi.. /Θi  1(n) n n K1 = . Under H0 , the test statistic 2nKn1 converges to a i (Ni.. /n) log θi 2 χ (I − 1)-distribution. Under the alternative, it converges to the KLD of the true distribution relative to the null model, a positive quantity. A unique parameterization is thus obtained that retains only the parameters that are statistically significant in terms of information.

Similarly, at order 2, the two sets of Zighera’s constraints ZmI and

ZmII

⎧  12 12 = mi.. , ⎨ j mij. /θij i mij. /θij = m.j. , i, j,  13 13 = m..k , i, k, : k mi.k /θik = mi.. , i mi.k /θik ⎩ 23 23 k m.jk /θjk = m.j. , j m.jk /θjk = m..k , j, k,

induce the analysis K(p|r) = − log Z + K1 + K2 + K3 + K{1,2} + K{1,3} + K{2,3} , where the Ka are order 1 divergences given by [11.3] and the K{a,b} are order 2 divergences. For instance, the interaction between X1 and X2 can thus be tested 12 through H0 : “K{1,2} = 0” or equivalently “θij = 1 for all i, j”. The margin n n n constraints are mij. = Nij. /n where Nij. is the number of observations of  12(n) n X1 = i, X2 = j. Under H0 , the test statistic 2nKn{1,2} = i,j Nij. /n log θij converges to a χ2 ((I − 1)(J − 1))-distribution. Unfortunately, no estimator of θ = (θ1 , θ2 , θ3 , θ12 , θ13 , θ23 ) can be deduced from the IPF procedure. Therefore, a numerical procedure based on the projected gradient method is developed in Girardin et al. (2018). All estimators and linked divergences are instantly computed through a program written in C++. Further, an analysis of K{a,b} in terms of conditional information is induced by ZmI and ZmII . For instance, K{1,2} =

 ij

12 pij. log θij =

 i

pi..

 pij. j

pi..

log

 pij. /pi.. = pi.. K{1,2}/1=i . 12 (pij. /pi.. )/θij i

For parameters θ satisfying ZmII , the divergence K{1,2}/1=i measures the crossed-effect between X1 and X2 for the category i of X1 . Finally, if “K{1,2} = 0” is rejected, then the interaction for a fixed category can be tested, for instance H0 : n n K{1,2}/1=i “K{1,2}/1=i = 0” for a fixed i. Under H0 , the test statistic 2Ni.. 2 converges to a χ (J − 1)-distribution.

Variation of Scores of PISA

155

Thus, Zighera’s approach leads to an all-in-one procedure for deciding which effects are significant in the model, with simultaneous computation of parameters and information. These parameters bear a meaning in terms of information of the model. They can also be interpreted in terms of odds ratios (see Girardin et al. 2018). In particular, the selected model may be non-hierarchical, while usual parameterizations of multiplicative models retain only hierarchical associations. 11.3. Application to PISA surveys Zighera’s method is here applied to the evolution of PISA scores in mathematics between 2003 and 2012 first in France and then in Germany. The explanatory variables will be chosen to represent sociodemographic characteristics of pupils. The database and the choice of variables – gender, father’s occupational status (FOS) and parents’ country of birth – are detailed in section 11.3.1. In section 11.3.2, Zighera’s method leads to selecting for each country one descriptive model analyzing, through statistically significant criteria, the evolution of scores according to effects and crossed-effects of the three characteristics. 11.3.1. Data and variables PISA services constitute a representative sample of 15-year-old pupils through a two-step stratified sample design. First, a minimum of 150 schools are selected with a probability of selection proportional to the school size. Replacement schools are also selected in case of non-response of the first chosen. Then, a random sample of 35 pupils is chosen in each school. France and Germany have similar average school and pupil participation rates, around 90%. For precise rates, see OECD (2004, Annex A3) for 2003 and OECD (2013, Annex A2) for 2012. For questions on the sampling method, see, for example, Prais (2003) or Wuttke (2007). It is worth noting that samples of pupils are not representative of a particular school level because of disparities between countries with regard to the end of obligatory fulltime schooling and automatic or not passing to the next degree. In particular, the calculated average scores that would arise according to whether we sample pupils by age or according to classes might be something like 20 points (see Prais 2003). Indeed, 15-year-old pupils in the 10th school grade or above – “on time” with regard to school curriculum – represented 37.5% of the sample in Germany and 70.2% in France in 2012 (see OECD 2013). This makes doubtful

156

Data Analysis and Applications 2

abrupt comparison of score levels between countries, although it is often perceived by the general public as one of the main goals of PISA reports. PISA evaluation is conducted every 3 years on three competence fields – learning in mathematics, reading and science literacy, with one domain chosen as major each time, which represents two-thirds of the evaluation. The share of linking items is the largest between major years: 84 of the 110 math items in 2012 were in common with 2003 versus 48 with 2006 and 35 with 2009 (Rutkowski and Rutkowski 2016). This led us to compare 2012 to 2003 for mathematics. Note that we do not compare performances of the two countries but evolution between the 2 years in each country. Still, so doing, the two dynamics of evolution will appear as quite different. The level of score in mathematics is represented by a variable, say X1 , and is clustered into seven classes defined by PISA. PISA reports show that the performance of pupils has decreased in France, with an average score of 511 in 2003 and 495 in 2012, and has increased in Germany from 503 in 2003 to 514 in 2012. Applying Zighera’s method, we aim at explaining this global evolution according to pertinent sociodemographic characteristics. For determining these variables, we have relied on both the raw database and the conclusions of PISA reports. First, the score changes noticeably according to pupil’s gender, with better results for boys in both France and Germany, and a greater improvement of average German boys’ score than of girls’ score during the period (see Table 11.1). Country Year All pupils 2003 511 France 2012 495 2003 503 Germany 2012 514

Boys 515 499 508 520

Girls 507 491 499 507

Table 11.1. Average scores in mathematics of French and German pupils by gender (OECD 2012a: Table I.2.3a, I.2.3b and I.2.3c)

Second, PISA reports attest of a strong correlation between pupil’s performance and father’s occupational status; in particular, the web application Occupations@PISA20122 can be used for evaluating this correlation through average scores. Third, performance gaps in mathematics are also highlighted between pupils with at least one parent born in the country of the study and pupils declaring two parents born abroad; for instance, (OECD 2012b, Table II.3.6a) states that the average score in 2012 in France was 448 for second-generation pupils and 508 for native pupils. Taking into account these arguments, we have chosen to analyze the evolution

2 http://mi2.mini.pw.edu.pl:8080/SmarterPoland/PISAoccupations2012/.

Variation of Scores of PISA

157

of score in France and in Germany between 2003 and 2012 according to three sociodemographic variables coded as follows: – X2 = Gender: 1 – boy, 2 – girl; – X3 = Parents’ origin: 1 – both born in the country (0PBA), 2 – both born abroad (2PBA), 3 – one parent born abroad (1PBA); – X4 = Father’s occupational status (FOS): M – profession declared unclear or not declared, 1 – managers; professionals, 2 – technicians and associate professionals; clerical support workers; service and sales workers, 3 – skilled agricultural, forestry and fishery workers; craft and related trades workers, 4 – plant and machine operators and assemblers; workers in elementary occupations. In theory, any number of variables could be considered in a broad study. In practice, for a reasonable interpretation of all crossed-effects, the number of variables has to be limited – unless higher order KLD becomes very close to zero. In the present analysis, alternative choices would have been possible, for instance mother’s occupational status, or parents’ education level, instead of the classical FOS. 11.3.2. Analysis of scores in mathematics We will apply Zighera’s method to the data set summarized in two four-dimensional contingency tables – available upon request, with marginals fixed at orders 1 and 2. For each analysis, the variable of interest is the score X1 . The variations of score can be analyzed through both direct effect of X1 and through crossed-effects of (X1 , X2 ), (X1 , X3 ) and (X1 , X4 ). Further, variations in sociodemographic characteristics can also be analyzed by direct and crossed-effects of X2 , X3 , X4 . 11.3.2.1. Analysis of scores for France For the comparison of results between 2003 and 2012 in France, the tests presented in section 11.2 lead to select the multiplicative model 13 14 24 34 pijk = Z −1 rijk θi1 θl4 θik θil θjl θkl ,

[11.4]

with related parameters and information values presented in Table 11.2. At order 1, direct effects of variables X1 and X4 are retained, with the same marginal information K1 = K4 = 0.0189. Thus, the evolution of score is of the same order of magnitude in terms of information as the evolution of the distribution of FOS. With θi1 > 1 for i = 0, 1, 2, the proportions of scores 0, 1 and 2 have increased in the sample, showing a global decreasing in score. With θ14 = 1.274 and

158

Data Analysis and Applications 2

θ44 = 1.139, the proportions of FOS 1 and 4 have increased in the sample. Precisely, margins show that 27.97% of the pupils declare FOS 1 in 2012 versus 23.58% in 2003, inducing an increase of 18.6%. Score θ1 Margins (2003) Margins (2012) FOS θ4 Margins (2003) Margins (2012)

i=0 1.647 4.85 7.24 l =M 0.715 12.56 9.93

Parameters and margins i=1 i=2 i=3 i=4 1.333 1.066 0.921 0.853 10 20.45 25.60 23.35 12.85 21.47 23.70 20.33 l=1 l=2 l=3 l=4 1.274 0.840 1 1.139 23.58 27.67 21.19 15 27.97 22.56 21.70 17.84

θ13 k=1 k=2 k=3 KLD

i=0 0.922 1.084 1.195 0.0046

Interaction Score–Parents’ origin i=1 i=2 i=3 i=4 0.998 0.967 1.021 1.005 0.977 1.091 0.891 0.936 1.053 1.091 0.981 1.008 0.0002 0.0015 0.0009 0.0002

θ14 l =M l=1 l=2 l=3 l=4 KLD

i=0 0.924 1.140 1.048 0.994 1.022 0.0017

i=1 0.934 1.129 1.149 0.912 0.954 0.0048

θ24 j=1 j=2 KLD

l =M 1.083 0.922 0.0032

l=1 0.964 1.038 0.0007

θ34 k=1 k=2 k=3 KLD

l =M 1.060 0.949 0.789 0.0046

Interaction Parents’ origin–FOS l=1 l=2 l=3 l=4 0.988 0.994 1.032 0.960 1.111 1.201 0.838 1.024 0.999 0.882 1.089 1.288 0.0006 0.0028 0.0034 0.0039

Interaction Score–FOS i=2 i=3 i=4 1.123 1.047 0.996 0.924 0.930 1.090 1.042 0.943 1 0.979 1.180 0.805 0.998 0.982 1.053 0.0016 0.0042 0.0057

KLD i=5 0.844 11.91 10.54

i=6 0.911 3.83 3.86

0.0189

0.0189

i=5 1.063 0.600 0.806 0.0106

i=6 0.969 3.080 0.698 0.0549

0.0006 0.0194 0.0067 0.0040

i=5 0.797 1.043 0.788 1.298 1.151 0.0139

i=6 1.132 0.816 1.533 1.178 0.858 0.0357

0.0039 0.0043 0.0084 0.0103 0.0011 0.0059

Interaction Gender–FOS l=2 l=3 l=4 0.935 1.097 0.992 1.067 0.921 1.007 0.0022 0.0038 0

Table 11.2. Order 1 (up) and 2 (down) parameters of model [11.4] for France

0.0020 0.0017 0.0018

0.0004 0.0083 0.0103 0.0027

Variation of Scores of PISA

159

Further, order 2 information analysis and tests show a significant interaction between score and FOS, with K{1,4} = 0.0059. Variations of score are stronger for 14 14 14 pupils declaring FOS 3, with K{1,4}/4=3 = 0.103. Similarly, θ33 , θ53 and θ63 are all greater than 1, showing that pupils with FOS 3 had a higher score in 2012 than in 2003. However, questions arise from the comparison of the marginal distributions of the sample with the distribution of FOS in France3. For example, 27.97% of pupils were declaring FOS 1 in 2012, whereas only 11.1% of French men were managers and professionals in 20083 . In the same way, 21.70% of pupils declared FOS 3 in 2012, whereas only 10.9% of men were skilled agricultural, forestry and fishery workers, craft and related trades workers in 2008. Although tests do not select an order 1 effect of X2 , girls were overrepresentated in the sample, precisely 52.72% in 2003 and 51.48% in 2012. Questions about the difference in response rates between boys and girls can be raised (Wuttke 2007) for a discussion on disparities of gender distribution, and also on underrepresentation of pupils from vocational schools or on anomalous distribution of birth months). Further, order 2 information K{1,3} = 0.0040 shows that the evolution of score differs according to parents’ origin, especially for 2PBA, with conditional 13 13 information K{1,3}/2=2 = 0.0194; indeed, with θ02 = 1.084 and θ22 = 1.091, an increased proportion of pupils with levels 0 and 2 was observed in 2012 in this 13 category. Also, with θ26 = 3.080, an increased proportion of pupils with level 6 was observed, suggesting that the level of pupils with a high score increased in this 13 category. On the contrary, the level of pupils declaring 1PBA decreased, with θi3 >1 for i=0, 1, 2, 4, and the level of score of native pupils (declaring 0PBA) did not 13 evolve with θi1 ≈ 1 for all i. These results highlight a great heterogeneity in the evolution of score depending on the parents’ origin. Besides the above analysis of scores, information and tests also reveal significant interactions between FOS and parents’ origin and between FOS and gender. First, K{3,4} = 0.0027 shows that variations in the distribution of FOS depend on the parents’ origin, especially, with conditional information K{3,4}/1=3 = 0.0103, for 34 1PBA; precisely, θ34 = 1.288 reveals that pupils declaring 1PBA also declared a greater proportion of FOS 4 in 2012 than in 2003, and margins show they constituted 17.34% of the database instead of 12.40%. Second, for example, with 24 24 θ12 = 0.935 < 1 < θ22 = 1.067, a greater proportion of girls declared a middle-level profession for their fathers in 2012; again, this is confirmed by the database: girls accounted for 52.5% of pupils stating FOS 2 in 2012 versus 50.1% in 2003. The already-known clear decrease in score levels of French pupils in mathematics on the observed period was confirmed. Further, the relationship between score level and parent’s origins are shown to have highly evolved, especially for the pupils 3 According to INSEE Table https://www.insee.fr/fr/statistiques/1373359?sommaire=1373438.

160

Data Analysis and Applications 2

declaring 2PBA, showing a great increase in the heterogeneity of score results. Such a sociodemographic evolution of the samples seems too strong for the relatively short period of the survey; questions on the sampling methods on this particular point should be further analyzed. 11.3.2.2. Analysis of mathematics for Germany For the comparison of results in Germany, the tests presented in section 11.2 lead to selecting the multiplicative model 12 13 14 24 34 pijk = Z −1 rijk θi1 θk3 θl4 θij θik θil θjl θkl ,

[11.5]

with related parameters and information values given in Table 11.3. At order 1, direct effects of variables X1 , X3 and X4 are retained. The evolution of FOS is three times more marked than the evolution of score levels, with K4 = 0.0147 and K1 = 0.0051, which is confirmed by parameters and margins. For example, with θ44 = 1.275, the proportion of pupils declaring FOS 4 has increased in the sample, precisely 15.72% in 2012 versus 12.60% in 2003. Conversely, with θ34 = 0.816, FOS 3 has greatly decreased: 21.57% in 2012 versus 26.46% in 2003. With θi1 > 1 for i=2, 5, 6, the proportions of pupils with these levels have increased while level 0 has decreased, thus highlighting an improvement of score results. Differences of margins between 2003 and 2012, however, indicate that this evolution is much less important than that of FOS. Further, with K{1,4} = 0.0153, order 2 analysis shows that variations of score differ according to the FOS. With K{1,4}/4=M = 0.0322, analysis of information shows a stronger interaction for pupils not declaring FOS. The proportion of those pupils clearly increased between 4 2003 and 2012, with θM = 1.248. In 2012, they represented 20.55% of the sample, which is too high a proportion for a correct interpretation of a possible correlation between the score level and the FOS. Note that only 9.93% of the French pupils’ FOS were missing in 2012. This issue of missing FOS has already been noted, for instance in Goussé and Le Donné (2015). More generally, the high proportion of missing responses may be due to translation of PISA questionnaires into German, because items were originally conceived simultaneously in English and French (Rey 2011; Wuttke 2007). Still at order 1, with K3 = 0.0034, the distribution of parents’ origin has much evolved; in particular, with θ23 = 0.817, the proportion of pupils with 2PBA decreased. Note that this effect was not retained for France. This evolution may come from an increase in mixed marriages (Bourgeois 2006). Moreover, an important migratory flow has been observed in Germany since the 1960s that attained a maximum in the 1990s (Münz and Ulrich 1998). At order 2, with K{1,3} = 0.0043, the variation of score is also linked to the parents’ origin, especially when one parent at least is born abroad, with conditional information K{1,3}/3=2 = 0.0218 and K{1,3}/3=3 = 0.0310. Indeed,

Variation of Scores of PISA

Score θ1 Margins (2003) Margins (2012) Origin θ3 Margins (2003) Margins (2012) FOS θ4 Margins (2003) Margins (2012)

i=0 0.698 8.24 6.17 k=1 1.018 81.35 82.70 l =M 1.248 17.48 20.55

Parameters and margins i=1 i=2 i=3 1.002 1.084 1.004 11.53 17.99 23.40 11.89 19.61 23.47 k=2 k=3 0.817 1.196 13.60 5.05 11.16 6.13 l=1 l=2 l=3 0.924 0.956 0.816 21.13 22.33 26.46 20.20 21.96 21.57

θ 12 j=1 j=2 KLD

i=0 1.019 0.982 0.0002

i=1 0.898 1.117 0.0059

θ 13 k=1 k=2 k=3 KLD

i=0 1.040 0.833 1.367 0.0076

θ 14 l =M l=1 l=2 l=3 l=4 KLD

i=0 0.553 1.840 1.514 1.523 1.519 0.1223

θ 24 j=1 j=2 KLD θ 34 k=1 k=2 k=3 KLD

i=4 0.998 21.80 21.35

i=5 1.050 12.26 12.55

i=6 1.064 4.79 4.96

161

KLD 0.0051

0.0034

l=4 1.275 12.60 15.72

Interaction Score–Gender i=2 i=3 i=4 0.928 1.086 1.001 1.069 0.923 0.998 0.0025 0.0033 0

0.0147

i=5 1.020 0.976 0.0002

i=6 1.053 0.917 0.0022

0.0021 0.0021 0.0021

Interaction Score–Parents’ origin i=1 i=2 i=3 i=4 1.013 1.007 0.982 1.008 0.882 0.812 1.307 1.157 1.225 1.361 0.796 0.744 0.0030 0.0060 0.0053 0.0032

i=5 0.989 1.171 1.036 0.0007

i=6 0.987 1.701 0.856 0.0068

0.0001 0.0218 0.0310 0.0043

i=1 0.808 0.710 1.408 0.985 1.242 0.0263

Interaction Score–FOS i=2 i=3 i=4 1.175 1.156 1.229 1.029 1.036 1.017 0.855 0.875 1.029 0.937 1.077 0.894 1.038 0.865 0.885 0.0065 0.0065 0.0055

i=5 1.063 0.996 1.113 0.979 0.706 0.0066

i=6 1.288 0.954 0.906 1.096 1.189 0.0062

0.0322 0.0048 0.0141 0.0073 0.0196 0.0153

l =M 1.024 0.971 0.0004

l=1 0.892 1.126 0.0068

Interaction Gender–FOS l=2 l=3 l=4 1.047 1.008 1.042 0.959 0.992 0.958 0.0010 0 0.0009

0.0017 0.0019 0.0018

l =M 1.148 0.321 0.469 0.0605

Interaction Parents’ origin–FOS l=1 l=2 l=3 l=4 1 0.960 0.947 0.929 0.885 1.779 1.139 1.245 1.111 0.917 1.530 1.065 0.0007 0.0139 0.0079 0.0072

0.0029 0.1175 0.0539 0.0185

Table 11.3. Order 1 (up) and 2 (down) parameters of model [11.5] for Germany 13 the level of pupils declaring 2PBA has increased, with θi2 > 1 for i ≥ 3, while the 13 level of pupils declaring 1PBA has decreased, with θi3 < 1 for i ≤ 2. On the contrary,

162

Data Analysis and Applications 2

13 conditional information K{1,3}/3=1 = 0.0001 and parameters θi1 ≈ 1 for all i show that scores have not changed for native pupils.

Analysis of information and tests at order 2 also reveal significant interactions between FOS and parents’ origin with K{3,4} = 0.0185. Precisely, conditional information values K{3,4}/3=2 = 0.1175 and K{3,4}/3=3 = 0.0539 show that evolution of the FOS distribution in the sample depends on the parents’ origin, and is 34 especially strong when both are born abroad. For example, θ22 = 1.779 indicates that pupils declaring 2PBA also declared a greater proportion of FOS 2 in 2012 than in 2003, precisely 18.28% in 2012 versus 11.04% in 2003. Finally, with K12 = 0.0021 and K24 = 0.0018, the evolution of both score level 12 and distribution of FOS depends on gender. First, with θi1 > 1 for i > 3, boys’ level 12 increased between 2003 and 2012, whereas girls’ level decreased with θi2 > 1 for 24 i = 1, 2, the gap between boys and girls widened. Second, with θ21 = 1.126, a larger proportion of girls indicated FOS 1 in 2012 than in 2003. Accordingly, the database of pupils indicating FOS 1 contained 51.7% of girls in 2012 and only 46.9% in 2003. The above analysis shows that the increase in score of German pupils is closely linked to all considered sociodemographic variables through crossed-effects. Indeed, PISA reports have shown that Germany is one of the countries in which educational achievement is the most closely correlated with the socioeconomic environment (Rey 2011). Apart from analysis of the evolution of scores, Zighera’s method has allowed us to point out important variations in the distribution of the FOS in the sample. In particular, order 2 information and parameters show that these variations strongly depends on the parents’ origins. As well as for France, the revealed strong evolution in the sociodemographic composition of the sample raises questions. To go further, it would be of interest to compare the evolution of PISA results of German pupils with that of Czech pupils known to achieve similar performances (Greger 2012). Technically, Zighera’s method would allow a comparison between the German and Czech data tables for 2012 (or any other year), instead of comparing the 2012 table to the 2003 table in each country. 11.3.3. Conclusion Zighera’s parameterization of multiplicative models is shown in Girardin et al. (2018) to have all required statistical properties for analyzing categorical data with any number of variables. Because of the nested structure of the model and structural properties of the KLD, Zighera’s method leads to parameters that are meaningful in terms of information. Moreover, computation of all parameters and information rests on a single process, so yielding an all-in-one process providing the model through simultaneous tests that best fits data.

Variation of Scores of PISA

163

Here, Zighera’s multiplicative approach has been used to analyze the evolution of mathematical literacy levels in PISA surveys of French and German pupils between 2003 and 2012. Both analyses clearly confirm that the score level of French pupils has decreased, whereas that of German pupils has increased during the observed period. Still, variations are shown to be driven by very different dynamics linked to the FOS and parents’ origin. Although the choice of the three sociodemographic variables retained in this study originally aimed at explaining evolution of score, the results reveal as much important changes in the sociodemographic composition of the samples as in the score levels, thus showing Zighera’s method to be a useful tool for controlling the statistical quality of samples. The relationships between observed variations in the sample and evolution of the whole population would deserve further study. It would be necessary to specify what, in the education system or family practices, products such differences in the evolution of score according to the socioeconomic profile. Studying variations of PISA scores according to the school grade of pupils may be of great interest. Note that Zighera’s method is applicable to many other fields and has proven to be especially useful for very large samples. For example, Thévenon (2009) takes this approach to explore European data on women’s labor force participation in order to analyze the effects of family characteristics on women’s activity behavior. The method would also fit the evaluation of public policies by comparing the distributions of population before and after a reform. 11.4. References Agresti, A. (2002). Categorical Data Analysis, 2nd edition, John Wiley & Sons, Hoboken, New Jersey. Blinder, A. (1973). Wage discrimination: Reduced form and structural estimates. J. Hum. Resour., 8, 436–455. Bodin, A. (2006). Ce qui est vraiment évalué par PISA en mathématiques. Ce qui ne l’est pas. Un point de vue français, Conférence Franco-Finlandaise sur PISA. Bull. de l’APMEP. 463. Bodin, A. (2009). L’étude PISA pour les mathématiques : résultats français et réactions. Gazette des mathématiciens, Société Mathématique de France. 120. Bourgeois, I. (2006). Démographie : l’Allemagne est une société métissée, Regards sur l’économie allemande. Bull. écon. du CIRAC. 77, 38–39. Christensen, R. (1990). Log-linear Models. Springer-Verlag New York. Cover, T.M., Thomas, J.A. (1991). Elements of Information Theory. John Wiley & Sons, New York.

164

Data Analysis and Applications 2

Csiszár, I. (1975). I-divergence geometry of probability distributions and minimization problems. Ann. Probab., 3, 146–158. Deming, W.E., Stephan, F.F. (1940). On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann. Math. Stat., 11, 427–444. Girardin, V., Lequesne, J., Ricordeau, A. (2018). Information-based parameterization of the log-linear model for categorical data analysis. Methodol. Comput. Appl. Probab. 20, 11051121. Goussé, M., Le Donné, N. (2015). Why do Inequalities in 15-year-old Cognitive Skills Increase in France between 2000 and 2009? Les inégalités scolaires d’origines sociales et ethno-culturelles, CNESCO, Paris. Greger, D. (2012). When PISA does not matter? The case of the Czech Republic and Germany. Hum. Aff., 22, 31–42. Kullback, S. (1959). Information Theory and Statistics. John Wiley & Sons, New York. Lequesne, J. (2015). Tests statistiques basés sur la théorie de l’information. Applications en biologie et en démographie. PhD. Thesis, Université de Caen Normandie, France. Münz, R., Ulrich, R. (1998). Les migrations en Allemagne : 1946-1996. Rev. Eur. Migr. Int., 14, 173–210. Oaxaca, R. (1973). Male-female wage differentials in urban labor markets. Int. Econ. Rev., 14, 693–709. OECD. (2004). Learning for tomorrow’s world: First results from PISA 2003, OECD Publishing. OECD. (2012). PISA 2012 Results: What students know and can do (volume I). Student performance in mathematics, reading and science. OECD Publishing. OECD. (2012). PISA 2012 Results: Excellence through equity. Giving every student the chance to succeed (volume II). Student performance in mathematics, reading and science. OECD Publishing. OECD. (2013). PISA 2012 Results: Ready to learn (volume III). Students’ engagement, drive and self-beliefs. OECD Publishing. Prais, S.J. (2003). Cautions on OECD’s recent educational survey (PISA). Oxford Rev. Educ., 29, 139–163. Rey, O. (2011). PISA: ce que l’on en sait et ce que l’on en fait. Doss. Actual. Veill. Anal., 66. Rutkowski, L., Rutkowski, D. (2016). A call for a more measured approach to reporting and interpreting PISA results. Educ. Res., 45, 252–257. Thévenon, O. (2009). Increased women’s labour force participation in Europe: Progress in the work-life balance or polarization of behaviours? Popul., 64, 235–272. Wuttke, J. (2007). Uncertainty and Bias in PISA. PISA According to PISA – Does PISA Keeps What It Promises? Hopmann, Brinek, Retzl, Eds. 241–263. Zighera, J.A. (1985). Partitioning information in a multidimensional contingency table and centring of loglinear parameters. Appl. Stoch. Models Data Anal., 1, 93–108.

PART 4

Visualization

12 A Topological Discriminant Analysis

In this chapter, we propose a new discriminant approach, called topological discriminant analysis (TDA), which uses a proximity measure in a topological context. The results of any operation of clustering or classification of objects strongly depend on the proximity measure chosen. The user has to select one measure among many existing ones. Yet, from a discrimination point of view, according to the notion of topological equivalence chosen, some measures are more or less equivalent. The concept of topological equivalence uses the basic notion of a local neighborhood. In a discrimination context, we first define the topological equivalence between the chosen proximity measure and the perfect discrimination measure adapted to the data considered, through the adjacency matrix induced by each measure, then propose a new topological method of discrimination using this selected proximity measure. To judge the quality of discrimination, in addition to the classical percentage of objects well classified, we define a criterion for topological equivalence of discrimination. The principle of the proposed approach is illustrated using a real data set with conventional proximity measures of literature for quantitative variables. The results of the proposed TDA, associated with the “best” discriminating proximity measure, are compared with those of classical metric models of discrimination, linear discriminant analysis and multinomial logistic regression.

12.1. Introduction In order to understand and act on situations that are represented by a set of objects, very often we are required to compare them. Humans perform this comparison subconsciously using the brain. But, in the context of artificial intelligence, we should be able to describe how the machine might perform this comparison. In this context, one of the basic elements that must be specified is the proximity measure between objects. Certainly, application context, prior knowledge, data type and many other factors can help in identifying the appropriate measure. However, the number of candidate Chapter written by Rafik A BDESSELAM.

Data Analysis and Applications 2: Utilization of Results in Europe and Other Topics, First Edition. Edited by Christos H. Skiadas and James R. Bozeman. © ISTE Ltd 2019. Published by ISTE Ltd and John Wiley & Sons, Inc.

168

Data Analysis and Applications 2

measures may still remain quite large. In a discriminant context for example, can we consider that all those measures remaining are equivalent and just pick one of them at random? Or are there some that are equivalent and, if so, to what extent? This information might interest a user when seeking a specific measure. For instance, in information description, supervised or unsupervised clustering, choosing a given proximity measure is an important issue. We effectively know that the result of a query depends on the measure used. For this reason, in our context, users may wonder, which one is more discriminant? Very often, they try many of them, randomly or sequentially, seeking a “suitable” discriminant proximity measure. We find this problematic in the context of an unsupervised or supervised classification – discrimination (Abdesselam 2014). The assignment or the classification of an object to a class partly depends on the used learning database. According to the selected proximity measure, this database changes and therefore the result of the classification also changes. Here, we are interested in the degree of topological equivalence of discrimination of these proximity measures. Several studies on topological equivalence of proximity measures have been proposed (Batagelj and Bren 1992; Rifqi et al. 2003; Batagelj and Bren 1995; Lesot et al. 2009; Zighed et al. 2012), but neither of these propositions has an objective of discrimination. A criterion for comparing and selecting the “best” discriminant proximity measure is defined in Abdesselam (2014). We propose here, using this chosen “best” discriminant measure, a new approach called TDA. This chapter is organized as follows. We recall in section 12.2, the basic notions of structure, graph and topological equivalence. In section 12.3, we present the principle of the topological discriminant analysis. Section 12.4 begins with an illustrative example with continuous data, followed by comparisons of performances between the proposed TDA and two other classical models of discrimination. A conclusion and some perspectives of this work are presented in section 12.4. 12.2. Topological equivalence The topological equivalence is based on the concept of a topological graph, also referred to as a neighborhood graph. The basic idea is actually quite simple: two proximity measures are equivalent if the corresponding topological graphs induced on the set of objects remain identical. Measuring the similarity between proximity measures consists of comparing the neighborhood graphs. We will first define more precisely what a topological graph is and how to build it. Then, we propose a measure of proximity between topological graphs that will subsequently be used to compare the proximity measures.

A Topological Discriminant Analysis

169

Consider a set E = {x, y, z, . . .} of n = |E| objects in Rp . We can, by means of a proximity measure u, define a neighborhood relationship Vu to be a binary relationship on E × E. There are many possibilities for building this neighborhood binary relationship. Thus, for a given proximity measure u, we can build a neighborhood graph on a set of individuals–objects, where the vertices are the individuals and the edges are defined by a property of neighborhood relationship. Many definitions are possible to build this binary neighborhood relationship. For example, we can build on E × E the minimal spanning tree (MST) (Kim and Lee 2003) and define for two objects x and y, if the objects are directly connected by an edge. In this case, Vu (x; y) = 1 otherwise Vu (x; y) = 0. So, Vu forms the adjacency matrix associated with the MST graph consisting of 0 and 1.

Figure 12.1. Minimal spanning tree graph – adjacency matrix

Alternatively, we can use the Gabriel graph (GG) (Gabriel and Sokal 1969; Matula and Sokal 1980; Park et al. 2006), in which all pairs of neighbor points (x, y) satisfy the following property: P ROPERTY 12.1.– Gabriel graph (GG): ∀x, y ∈ E ∀z   Vu (x, y) = 1 if u(x, y) ≤ min( u2 (x, z) + u2 (y, z)) Vu (x, y) = 0 otherwise



E − {x, y}:

Geometrically, the diameter of the hypersphere u(x, y) is empty. We can choose the relative neighbohood graph (RNG) (Toussaint 1980; Jaromczyk and Toussaint 1992), where all pairs of neighbor points (x, y) satisfy the following property.

170

Data Analysis and Applications 2

Figure 12.2. Gabriel graph – adjacency matrix. For a color version of the figure, see www.iste.co.uk/skiadas/data2.zip

P ROPERTY 12.2.– Relative neighborhood graph (RNG): ∀x, y ∈  Vu (x, y) = 1 if u(x, y) ≤ max[u(x, z), u(y, z)] E − {x, y} : Vu (x, y) = 0 otherwise

E ; ∀z



That is, if the pairs of points verify or not the ultra-triangular inequality of property 12.2, the ultrametric condition, which means geometrically that the RNG is a connection scheme in which two points are connected if the hyper-lunula (intersection between the two hyperspheres centered on two points with radius equal to the distance between the points) is empty.

Figure 12.3. Relative neighborhood graph – adjacency matrix. For a color version of the figure, see www.iste.co.uk/skiadas/data2.zip

A Topological Discriminant Analysis

171

For a given neighborhood property (MST, GG or RNG), each measure u generates a topological structure on the objects in E, which are totally described by the binary adjacency matrix Vu . Figures 12.1–12.3 show an example of each topological graph perfectly defined in R2 by the associated binary adjacency   matrix Vu . In these examples, the proximity 2 measure u(x, y) = uEuc (x, y) = ( j=1 (xj − y j )2 ) is the Euclidean distance. 12.3. Topological discriminant analysis In this section, we use the following notations to present TDA on continuous explanatory variables. Let us denote – X(n,p) as the data matrix associated with the p centered continuous explanatory variables, associated with the set of the p discriminant variables {xj ; j = 1, p}, with n rows–objects and p columns–variables; – Y(n,q) as the data matrix associated with the q dummy variables {y k ; k = 1, q} of the explain qualitative variable y with q modalities or groups to discriminate; – Dn = n1 In as the diagonal weights matrix of the n individuals and In the unit matrix with n order; – Dq = t Y Dn Y as the diagonal weights matrix of the q modalities of the target variable y defined by [Dq ]kk = nnk , ∀k = 1, q; – χ2y = Dq−1 as the matrix associated with the chi-square distance; – G(q,p) = χ2y t Y Dn X as the matrix associated with the q centers of gravity in Rp . Let E = {x, y, z, . . .} and G = {G1 , · · · , Gk , · · · , Gq } be the sets of n = |E| objects and q = |G| centers of gravity in Rp . We define a neighborhood relationship on E × G by means of the “best” discriminating proximity measure, previously selected (Abdesselam 2014), denoted u, and the associated binary adjacency matrix Vu . An object x ∈ E and a center of gravity Gk ∈ G verify the neighborhood property 12.1, according to GG, if they are connected by a direct edge diameter u(x, Gk ). The vertices x and Gk are neighbors within the meaning of Gabriel if and only if they satisfy the following property. P ROPERTY 12.3.– Gabriel graph (GG) – ∀x ∈ E ; ∀Gl =  Vu (x, Gk ) = 1 if u(x, Gk ) ≤ min u2 (x, Gl ) + u2 (Gk , Gl ) Vu (x, Gk ) = 0 otherwise

Gk



G:

172

Data Analysis and Applications 2

From a geometrical point of view, the hypersphere diameter u(x, Gk ) contains no other center of gravity than Gk ; mathematically, this means that ∀Gl ∈ G u(x, Gk ) ≤ u(x, Gl ), thus, object x is closer to group Gk than of any other group.

Figure 12.4. Gabriel graph – adjacency matrix. For a color version of the figure, see www.iste.co.uk/skiadas/data2.zip

Figure 12.4 shows an example of a topological graph (GG) perfectly defined in R2 and the associated binary adjacency matrix Vu according to property 3. In this case, u(x, Gk ) = uEuc (x, Gk ) is the Euclidean distance. Thus, the object x is connected by an edge to the center of gravity G1 because the circle diameter u(x, G1 ) contains neither of the other two centers of gravity G2 and G3 , then Vu (x, G1 ) = 1 and Vu (x, G2 ) = Vu (x, G3 ) = 0. We note, Vu∗ the reference adjacency matrix, “perfect” discrimination of the q groups according to an unknown “perfect” discriminant proximity measure denoted as u∗. Like any technical discrimination, the performance of the TDA approach can result in a confusion matrix that allows us to measure the error rate or the percentage of objects well-classified measured by the quantity: %W.C. =

100 trace( t Vu∗ Vu ) n

where the reference binary adjacency matrix Vu∗ associated with the unknown “perfect” discriminant measure u∗ exactly corresponds to the binary matrix Y(n,q) . For this topological approach, it can also be considered as a quality criterion, the topological equivalence of discrimination S(Vu , Vu∗ ), which measures according to property 2, the similarity between the best and the perfect adjacency matrices is measured by the following property of concordance.

A Topological Discriminant Analysis

173

P ROPERTY 12.4.– Topological equivalence between two adjacency matrices: S(Vu , Vu∗ ) =

n

k=1

n n2

l=1

δkl

with

δkl =

 1 if V (k, l) = V ∗ (k, l) u u 0 otherwise.

In order to evaluate the discriminating power of the topological proposed approach, we compare it with two supervised models, linear discriminant analysis (LDA) and multinomial logistic regression (MLR), which are most commonly used as a dimensionality reduction technique and machine learning applications. The general LDA approach is very similar to a principal component analysis (PCA), but in addition to finding the component axes that maximize the variance of the data (PCA), we are additionally interested in the axes that maximize the separation between multiple classes (LDA). Unlike methods LDA and MLR, the proposed TDA does not develop function or model, it includes only one step in which each object is directly classified according to the neighborhood graph, completely characterized by the adjacency matrix associated and the proximity measure chosen. This same step is also used to classify an anonymous object. Moreover, the TDA approach assumes no specific condition, is not really inconvenient in its application, nor is it constrained in a very large dimension, except a complexity problem attended by massive data, which is not the case of LDA (assumes normal distributed data, features statistically independent and identical covariance matrices for every class, problem of outliers, etc.) and MLR (many specific statistical tests, parameter estimates, missing values, does not converge in case of complete separation of classes, etc.) methods. 12.4. Application example To illustrate the application of TDA to a real data set, we use a famous iris data set collected by Anderson (1935), which inspired Fisher (1996) to develop LDA. This data set contains measurements for 150 iris flowers from three different species (setosa, virginica and versicolor). Four predictor features were measured on 50 samples for each species: sepal length, sepal width, petal length and petal width. The complete data have been deposited in the UCI machine learning repository (UCI Machine Learning Repository 1936); data matrices and their dimensions are presented in Table 12.1. Name Iris Dimension

Explanatory continuous variables X(n×p) 150 × 4 Table 12.1. Data set

Variable to explain Y(q) 3

174

Data Analysis and Applications 2

The main results of the proposed TDA approach are presented in the following numerical tables. They allow us to visualize the proximity measures that are close to each other in a context of discrimination. First, we select the best discriminant measure for the considered data (Abdesselam 2014), then we perform TDA and finally, we compare the obtained results with those of LDA and MLR discrimination models. Table 12.7 shows some classic proximity measures used for continuous data that are defined on Rp . The iris data set used is from the UCI Machine Learning Repository (UCI Machine Learning Repository 1936). G(q,p) Setosa Versicolor Virginica

Sepal length 5.006 5.936 6.588

Sepal width 3.428 2.770 2.974

Petal length 1.462 4.260 5.552

Petal width 0.246 1.326 2.026

Table 12.2. Centers of gravity matrix in R4

It was shown in Abdesselam (2014) and Zighed et al. (2012), using a series of experiments, that the choice of a proximity measure has an impact on the results of a supervised or unsupervised classification. In view of the results of the comparison than the selection measure (Abdesselam 2014), the unknown “perfect” discriminant measure u∗ would be closer to the cosine dissimilarity measure uCos which would be, for these iris data, the “best” discriminant proximity measure among the 16 measures presented in Table 12.7. Thus, this first part indicates that the cosine dissimilarity measure is the “best” discriminant measure; it is the most appropriate measure to separate and differentiate the three species of iris flowers well. The cosine measure between the three centers of gravity (Table 12.2) is given in Table 12.3. uCos (Gk , Gl ) Setosa Versicolor Virginica Setosa 0 Versicolor 0.075 0 Virginica 0.112 0.004 0 Table 12.3. Matrix of cosine measure between the centers of gravity

Table 12.4 summarizes the main results of the TDA with cosine dissimilarity measure, the cross-classification table of predicted and actual species assignments – the confusion matrix and the percentages of concordance and well classified. The main results of the proposed TDA, applied to each of the 16 adjacency matrices induced by the 16 proximity measures given in Table 12.7, are presented in Table 12.8. Thus, for the iris data set, it shows that the best TDA, with a greater percentage of well classified (98.00%) and topological equivalence (98.67%), is obtained with the cosine dissimilarity measure uCos .

A Topological Discriminant Analysis

175

TDA Predicted Setosa Versicolor Virginica Setosa 50 0 0 Actual Versicolor 0 47 3 Virginica 0 0 50 Well classified: 98.00%

Topological equivalence: 98.67%

Table 12.4. Confusion matrix – topological discriminant analysis

LDA Predicted Setosa Versicolor Virginica Setosa 50 0 0 Actual Versicolor 0 48 2 Virginica 0 1 49 Well classified: 98.00% Table 12.5. Confusion matrix – linear discriminant analysis

Tables 12.5 and 12.6 summarize the main results of the classical discriminant models in a metric context. Thus, from a comparison point of view, according to the criterion of the percentage of well classified, the topological approach TDA presents a discriminating power substantially similar to those of MLR and LDA metric approaches, with a percentage of well classified around 98% for the iris data. MLR Predicted Setosa Versicolor Virginica Setosa 50 0 0 Actual Versicolor 0 49 1 Virginica 0 1 49 Well classified: 98.67% Table 12.6. Confusion matrix – multinomial logistic regression

12.5. Conclusion and perspectives The choice of a proximity measure is very subjective; it is often based on habits or on criteria such as the interpretation of the a posteriori results. This work uses proximity measures and proposes a new topological approach in the context of discrimination. The proposed approach is based on the concept of the neighborhood graph induced by a proximity measure for continuous data. Results obtained by analyzing a real data set highlight the effectiveness of the proposed method. Further research will regard the extension of TDA to binary, qualitative and also mixed (quantitative and qualitative) explanatory variables by choosing the best discriminant proximity measure adapted to considered data in a topological context.

176

Data Analysis and Applications 2

It would be interesting to extend this work to use a comparison criteria, other than a clustering technique, in order to validate the degree of topological equivalence of discrimination between the “best” and the “perfect” discriminant measures. We can, for example, use the non-parametric test of the Kappa concordance coefficient calculated from the associated adjacency matrix (Abdesselam and Zighed 2011). This will allow us to give a statistical significance of the degree of agreement between two similarity matrices and to validate or not the topological equivalence in discrimination, i.e. whether or not they induce the same neighborhood structure on the groups of objects to be separated. 12.6. Appendix Measure Euclidean Manhattan (City-block) Minkowski

Formula : Distance - Dissimilarity  p 2 uEuc (x, y) = j=1 (xj − yj ) p uM an (x, y) = j=1 |xj − yj | 1  γ γ uM inγ (x, y) = ( p j=1 |xj − yj | )

Normalized Euclidean

uT ch (x, y) = max1≤j≤p |xj − yj |  p 1 2 uN E (x, y) = j=1 2 [(xj − xj ) − (yj − y j )]

Mahalanobis

  uM ah (x, y) = (x − y)t −1 (x − y)

Cosine dissimilarity

uCos (x, y) = 1 −

Tchebychev

σ

xy

Canberra Squared Pearson correlation

j

p x y j=1 j j  p p x2 y2 j=1 j j=1 j

= 1 −

p

|xj −yj | j=1 |xj |+|yj | p (xj −x)(yj −y))2 ( p uCor (x, y) = 1 − p j=1 (x −x)2 (y −y)2 j=1 j j=1 j ()2

uCan (x, y) =

1−

2 x−x2 y−y p √ j=1 ( xj

√ − yj )2 p  max( j=1 xj , p j=1 yj ) −

Squared Chord

uCho (x, y) =

Doverlap measure

uDev (x, y) = p j=1 min(xj , yj )  p 2 uW Eu (x, y) = j=1 αj (xj − yj ) p 1 uGow (x, y) = p j=1 | xj − yj |  p 2 uSha (x, y) = j=1 [(xj − xj ) − (yj − y j )] p uSiz (x, y) =| j=1 (xj − yj ) | p γ uLpoγ (x, y) = j=1 |xj − yj |

Weighted Euclidean Gower’s dissimilarity Shape distance Size distance LPower

=

Where p is the dimension of space, x = (xj )j=1,...,p and y = (yj )j=1,...,p two points in Rp , xj  the mean, σj the standard deviation, αj = 12 , −1 the inverse of the variance and covariance matrix, γ > 0. σ

j

Table 12.7. Some proximity measures for continuous data

A Topological Discriminant Analysis

Name

Measure

Topological Equivalence(%)

Euclidean

uEuc

95.11

Manhattan

uM an

94.67

Minkowski

uM inγ =3 94.67

Tchebychev

uT ch

94.22

Normalized Euclidean

uN Eu

89.78

Mahalanobis

uM ah

91.11

Cosine dissimilarity uCos

98.67

Canberra

uCan

96.89

Sq. Pearson correlation

uCor

97.33

Squared Chord

uCho

97.33

Doverlap measure

uDov

92.00

Weighted Euclidean uW Eu

56.89

Gower’s dissimilarity

uGow

94.67

Shape distance

uSha

96.44

Size distance

uSiz

90.22

LPower

uLpo

94.67

Confusion Matrix ⎛ ⎞ 50 0 0 ⎝ 0 46 4 ⎠ 0 7 43 ⎛ ⎞ 50 0 0 ⎝ 0 47 3 ⎠ 0 9 41 ⎞ ⎛ 50 0 0 ⎝ 0 45 5 ⎠ 0 7 43 ⎛ ⎞ 50 0 0 ⎝ 0 45 5 ⎠ 0 8 42 ⎛ ⎞ 49 1 0 ⎝ 0 39 11 ⎠ 0 11 39 ⎞ ⎛ 49 1 0 ⎝ 0 42 8 ⎠ 0 11 39 ⎛ ⎞ 50 0 0 ⎝ 0 47 3 ⎠ 0 0 50 ⎛ ⎞ 50 0 0 ⎝ 0 47 3 ⎠ 0 4 46 ⎞ ⎛ 50 0 0 ⎝ 0 47 3 ⎠ 0 3 47 ⎛ ⎞ 50 0 0 ⎝ 0 47 3 ⎠ 0 3 47 ⎛ ⎞ 50 0 0 ⎝ 4 41 5 ⎠ 0 9 41 ⎞ ⎛ 3 47 0 ⎝ 0 50 0 ⎠ 0 50 0 ⎛ ⎞ 50 0 0 ⎝ 0 47 3 ⎠ 0 9 41 ⎛ ⎞ 50 0 0 ⎝ 0 47 3 ⎠ 0 51 45 ⎞ ⎛ 50 0 0 ⎝ 4 40 6 ⎠ 0 12 38 ⎞ ⎛ 50 0 0 ⎝ 0 45 5 ⎠ 0 7 43

Well Classified(%)

Rank

92.67

6

92.00

7

92.00

7

91.33

11

84.67

15

86.67

13

98.00

1

95.33

4

96.00

2

96.00

2

88.00

12

35.33

16

92.00

7

94.67

5

85.33

14

92.00

7

Table 12.8. Main results of the TDA according to different proximity measures

177

178

Data Analysis and Applications 2

12.7. References Abdesselam, R. (2014). Proximity measures in topological structure for discrimination. In A Book Series SMTDA-2014, 3nd Stochastic Modeling Techniques and Data Analysis, Skiadas, C.H. (ed.), ISAST, International Conference, Lisbon, Portugal, 599–606. Abdesselam, R., Zighed, A.D. (2011). Comparaison topologique de mesures de proximite. In Actes des XVIIIème Rencontres de la Société Francophone de Classification, 79–82. Anderson, E. (1935). The irises of the gaspe peninsula. Bulletin of the American Iris Society, 59, 2–5. Batagelj, V., Bren, M. (1992). Comparing resemblance measures. In Proc. International Meeting on Distance Analysis (DISTANCIA’92). Batagelj, V., Bren, M. (1995). Comparing resemblance measures. Journal of Classification, 12, 73–90. Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, Part II, 7, 179–188. Gabriel, K.R., Sokal, R.R. (1969). A new statistical approach to geographic variation analysis. Systematic Zoology, 18, 259–278. Jaromczyk, J.W., Toussaint, G.T. (1992). Relative neighborhood graphs and their relatives. Proceedings of IEEE, 80(9), 1502–1517. Kim, J.H., Lee, S. (2003). Bound for the minimal spanning tree of a complete graph. Statistics & Probability Letters, 4(64), 425–430. Lesot, M.J., Rifqi, M., Benhadda, H. (2009). Similarity measures for binary and numerical data: A survey. In IJKESDP, 1(1), 63–84. Matula, D.W., Sokal, R.R. (1980). Properties of Gabriel graphs relevant to geographic variation research and the clustering of points in the plane. Geographical Analysis, 12, 205–222. Park, J.C., Shin, H., Choi, B.K. (2006). Elliptic Gabriel graph for finding neighbors in a point set and its application to normal vector estimation. Computer-Aided Design Elsevier, 38(6), 619–626. Rifqi, M., Detyniecki, M., Bouchon-Meunier, B. (2003). Discrimination power of measures of resemblance. IFSA’03, Citeseer. Toussaint, G.T. (1980). The relative neighbourhood graph of a finite planar set. Pattern Recognition, 12(4), 261–268. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/Iris. Database from R.A. Fisher (1936). Zighed, D.A., Abdesselam, R., Hadgu, A. (2012). Topological comparisons of proximity measures. In the 16th PAKDD 2012 Conference. In Part I, LNAI 7301, Tan, P.-N. et al. (eds), Springer-Verlag, Berlin, Heidelberg, 379–391.

13 Using Graph Partitioning to Calculate PageRank in a Changing Network

PageRank was first defined by S. Brin and L. Page in 1998 in order to rank home pages on the Internet by ranking pages according to the stationary distribution of a random walk on the web graph. While the original way to calculate PageRank is fast, due to the huge size and growth of the web there have been many attempts at improving upon the calculation speed of PageRank through various means. In this chapter, we will look at a slightly different but equally important problem, namely how to improve the calculation of PageRank in a changing network where PageRank of an earlier stage of the network is available. In particular, we consider two types of changes in the graph, the change in rank after changing the personalization vector used in calculating PageRank as well as added or removed edges between different strongly connected components (SCCs) in the network.

13.1. Introduction PageRank was initially developed by S. Brin and L. Page to rank home pages on the Internet for the search giant Google (Brin and Page 1998). Since then, PageRank or similar methods have been used for a variety of application in many types of networks such as evaluating trust in P2P networks (Kamvar 2003) or Lazy PageRank used in clustering of networks (Chung and Tsiatas 2012). There have been a large amount of works aimed at speeding up the calculation of PageRank, both in the form of approximations such as aggregating of web pages (Ishii 2009) or how to speed up calculations by handling certain vertices in the graph separately (Lee 2007; Yu 2012). However, not much work has been done on the problem of recalculation of PageRank after doing some changes to the network. Chapter written by Christopher E NGSTRÖM and Sergei S ILVESTROV.

Data Analysis and Applications 2: Utilization of Results in Europe and Other Topics, First Edition. Edited by Christos H. Skiadas and James R. Bozeman. © ISTE Ltd 2019. Published by ISTE Ltd and John Wiley & Sons, Inc.

180

Data Analysis and Applications 2

In this chapter, we will build upon the PageRank method we proposed in Engström and Silvestrov (2015) and show how a partition of the network into SCCs can be used together with the old PageRank to calculate the new PageRank after doing certain modifications on the graph. In particular, we will look at how changes in the personalization vector (see definition 13.1) or changes in edges between components can be handled. PageRank can be computed as the stationary distribution of a random walk on a directed graph using a slightly modified adjacency matrix together with a low probability to take a random step to any vertex (ignoring the edges in the graph). In this chapter, we will use a slightly different definition of PageRank compared to the one by S. Brin and L. Page, which avoids the normalization making it possible to easily compare rank between different graphs as well as not requiring any modification to account for vertices with no outgoing edges. D EFINITION 13.1.– Consider a random walk on a graph. In each step of the random walk, move to a new vertex from the current vertex by traversing a random edge from the current vertex with probability 0 < c < 1 and stop the random walk with  for a single vertex vj is defined as probability 1 − c. Then PageRank R ⎛ Rj = ⎝wj +

 vi ∈V,vi =vj

⎞ wi Pij ⎠

∞ 

 (Pjj )k

.

[13.1]

k=0

where Pij is the probability to hit node vj in a random walk starting in node vi . This can be seen as the expected number of visits to vj if we do multiple random walks, starting in every vertex a number of times described by a non-negative vector w.  w  is called the personalization vector or weight vector for the graph. It is worth pointing out that this is proportional to the ordinary (normalized) definition of PageRank and therefore obviously gives the same ranking where we proved proportionality in one of our earlier works (Engström and Silvestrov 2015). The reason for using a non-normalized variation of PageRank i to be able to divide the graph into multiple components and compute PageRank for each component individually without the need to keep checking on complicated normalization constants between the components. The following standard definitions in graph theory are important for understanding this chapter. D EFINITION 13.2.– An SCC of a directed graph G is a subgraph S of G such that for every pair of vertices u, v in S there is a directed path from u to v and from v to u. In addition, S is maximal in the sense that adding any other set of vertices and/or edges from G to S would break this property.

Using Graph Partitioning to Calculate PageRank in a Changing Network

181

It is well-known in graph theory that every directed graph can be uniquely partitioned into SCCs such that every vertex is part of exactly one SCC. This partitioning can be done using a depth-first search known as Tarjan’s algorithm (Tarjan 1972). It is also well known that this partitioning creates a partial order on these components, which we will use to label each component. D EFINITION 13.3.– Consider a graph G with partition P into SCCs and the underlying directed acyclic graph (DAG) created by replacing every component with a single vertex. If there is an edge between any two vertices between a pair of components, then there is an edge in the same direction between the two vertices representing those two components as well. – The level LC of component C is equal to the longest path in the underlying DAG starting in C. – The level Lei of some vertex ei is defined as the level of the component for which ei belongs (Lei ≡ LC , if ei ∈ C). An example of a graph, its partition into SCCs and the level of corresponding components can be seen in Figure 13.1.

Level 2 3 Level 1 2

1

1 Level 0

1

0

0

0

Level 0 Figure 13.1. Example of a graph and corresponding components from the SCC partitioning of the graph. Vertex labels denote the level of each vertex

13.1.1. Computing PageRank Using a partition of the graph into SCCs, it is possible to formulate a PageRank algorithm that takes this partitioning into account by considering each component individually rather than looking at the graph as a whole. Since vertices in non-normalized PageRank only affect the PageRank of vertices they can reach, a

182

Data Analysis and Applications 2

component can never influence the rank of any other higher or same level components. This means that it is possible to first calculate PageRank of all the highest level components, then the second largest and so on, while adjusting the personalization vector of lower level components to accommodate for any edges between components. This formulation allows for the use of fewer iterations or even different methods for different components that typically gives a lower amount of overall iterations at the cost of performing an initial depth-first search. For a more detailed explanation of this method and how the personalization vector adjustments are made, we refer to our previous work (Engström and Silvestrov 2015). 1) Partition the graph into SCCs and find their corresponding levels. 2) For each level (starting at the highest): - calculate PageRank for each component on current level (can be done in parallel); - adjust weight vector for all lower-level components. Calculating PageRank of a graph is equivalent to calculating the following power series (Andersson and Silvestrov 2008):  = R

∞ 

(cA )k w ,

[13.2]

k=0

where A is the adjacency matrix but scaled in such a way that all edge weights are equal to 1/ni , where ni is the number of outgoing edges of vertex vi . This ensures that A is a stochastic matrix except for possible zero-rows. 0 < c < 1 is a scalar that ensures that cA is substochastic and therefore also guarantees the convergence of the series. w  is a personalization vector supplying an initial rank to each vertex. A proof showing that this is equivalent to definition 13.1 can be seen in Engström and Silvestrov (2015). 13.2. Changes in personalization vector Changes in the personalization vector are comparatively easy to handle since it does not change any path in the graph.

Using Graph Partitioning to Calculate PageRank in a Changing Network

183

 1 and weight vector w L EMMA 13.1.– Consider a graph with PageRank R  1 , then the 2 2 1  new PageRank R given a new personalization vector w  = w  + w  that can be written as: ⎛ Rj2

=

Rj1

⎞



+ ⎝wj +

wi Pij ⎠

vi ∈S,vi =vj

∞ 

 (Pjj )

k

.

[13.3]

k=0

P ROOF.– The proof is very straightforward using the definition and factoring out the old rank. Rj2 = ⎛ ⎝wj1

+ wj +

=



(wi1

vi ∈S,vi =vj

⎛ ⎝wj1

⎞





+

+ ⎝wj +

⎞

wi1 Pij ⎠

vi ∈S,vi =vj

 vi ∈S,vi =vj

+ wi )Pij ⎠

∞ 

 (Pjj )

k

k=0 ∞ 

 (Pjj )

k

k=0

⎞ wi Pij ⎠

∞  k=0

 (Pjj )

k

. 

As seen, rather than recalculating PageRank completely, it is possible to calculate only the change in rank instead, given that there are only changes to the personalization vector. Having an efficient way to calculate these kinds of changes is important when considering that changes in high-level components can be seen as a personalization vector change in lower-level components when considering each component separately. This method fits well into the framework for the baseline method for calculating PageRank described earlier (skip the component finding step). This way, changes in one component naturally propagate as personalization vector changes on lower-level components and avoids doing any calculations for components that remain unchanged. Assuming changes are small, then we can also expect it to converge faster calculating the change in rank of a component compared to starting from the beginning.

184

Data Analysis and Applications 2

13.3. Adding or removing edges between components Next we are going to consider the main results of this chapter, namely the addition or removal of edges between two components under certain conditions as outlined below. 1) The only change in the graph is the addition and/or removal of one or more edges from a single “source” vertex to one or more “target” vertices belonging to other components. 2) New edges do not create any cycle in the underlying DAG. 3) The personalization vector remains unchanged in the “source” component. The first condition is slightly more general than only allowing single edge additions/removals, while still limiting the amount of potential change in rank of the source component. The second condition is required in order to make sure that any changes in rank only depend on the source and target components and to avoid dependence on the rank of the target components for the rank of the source component. The second condition is most easily achieved by only allowing new edges to be formed from the source component to lower- or same-level components. By not allowing changes in the personalization vector for the source component, we again limit the amount of change in rank we get in this component. Obviously, changes in the personalization vector of the source component could be handled by first considering only the change in personalization vector as outlined in the previous section, and then continue as if there was no change in the personalization vector. For convenience, we introduce the following notation: – P→j = wj + vi ∈V,vi =vj wi Pij ; – Pab (c) is the probability to reach vb starting in va after passing through vc at least once; vc .

– Pab (¯ c) is the probability to reach vb starting in va without ever passing through Note that using this notation, PageRank can also be written as: Ra =

P→a . 1 − Paa

[13.4]

Before looking at the actual problem, we will start by formulating the following lemma that states that PageRank can be decomposed into two parts representing all paths that go through or do not go through some vertex va .

Using Graph Partitioning to Calculate PageRank in a Changing Network

185

L EMMA 13.2.– PageRank of a single vertex vb can be written as: Rb =

a) a) P→b (¯ Ra Pab (¯ + . 1 − Pbb (¯ a) 1 − Pbb (¯ a)

[13.5]

This can be seen as a decomposition of paths, the left expression can be seen as the sum of all paths that does not go through vertex va and the right side being the sum of all paths going through vertex va at least once. P ROOF.– We start by rewriting PageRank as a sum of all visits to vb before any visits to va + all visits to vb after 1 visit to va but before the second visit to va and so on. Rb =

a) a) P→a Paa Pab (¯ a) P→a (Paa )2 Pab (¯ a) P→b (¯ P→a Pab (¯ + + + + ... 1 − Pbb (¯ a) 1 − Pbb (¯ a) 1 − Pbb (¯ a) 1 − Pbb (¯ a)

The second and later expressions can be identified as a geometric sum resulting in: a) P→b (¯ + = 1 − Pbb (¯ a)



k=0

P→a (Paa )k Pab (¯ a) . 1 − Pbb (¯ a)

Solving the geometric sum and using 13.4 completes the proof: =

a) a) a) a) P→a Pab (¯ P→b (¯ Ra Pab (¯ P→b (¯ + = + . 1 − Pbb (¯ a) (1 − Pbb (¯ a))(1 − Paa ) 1 − Pbb (¯ a) 1 − Pbb (¯ a)

Adding or removing edges between components from a single vertex can be seen as a personalization vector change for the target components as outlined in the previous section, the remaining problem is how this affects the source component itself since changing the number of outgoing edges of a vertex changes the weight on each of those edges as well. This leads us to considering the problem of changing the weights on all outgoing edges from a single vertex as seen in theorem 13.1. T HEOREM 13.1.– Consider a graph with PageRank R1 , let e1a be the weight of all edges going out of vertex va . After changing edge weights on edges out of va to e2a , the PageRank Ra2 of vertex va and PageRank Rb2 of any other vertex vb = va can be written as: Ra2 =

1 P→a

e2

1 a 1 − Paa e1 a

,

186

Data Analysis and Applications 2

Rb2 = Rb1 +

e2 1 Ra2 ea1 − Ra1 Pab (¯ a) a

1 (¯ 1 − Pbb a)

.

1 2 = P→a P ROOF.– The first statement is easily shown to be correct, first of all P→a since edges going out of va have no effect on the first hitting probability of va in a random walk on the graph. Similarly, it is easy to see that the probability of return 2 1 2 paths only change by the new edge weight e2a , which gives Paa = Paa ea /e1a . Both of these together prove the first statement.

For the second part, we start by decomposing PageRank using lemma 13.2: e2

Rb2

1 2 2 1 Ra2 Pab (¯ a) ea1 (¯ a) (¯ a) (¯ a) P→b Ra2 Pab P→b a = + = + . 2 (¯ 2 (¯ 1 (¯ 1 (¯ 1 − Pbb a) 1 − Pbb a) 1 − Pbb a) 1 − Pbb a)

2 1 Here, the second equality is found by realizing that P→b (¯ a) = P→b (¯ a) and 1 = Pbb (¯ a) since we skip the paths through va when calculating these and 1 = Pab (¯ a)e2a /e1a since all these paths go through va exactly once and therefore need to be scaled by the new edge weight. 2 Pbb (¯ a) 2 Pab (¯ a)

Next, we apply lemma 13.2 again in reverse on the left-hand side that completes the proof: Rb2 = Rb1 −

1 Ra1 Pab (¯ a) 1 (¯ 1 − Pbb a)



e2

+

1 Ra2 Pab (¯ a) ea1 a

1 (¯ 1 − Pbb a)

= Rb1 +

e2 1 Ra2 ea1 − Ra1 Pab (¯ a) a

1 (¯ 1 − Pbb a)

.

While theorem 13.1 considers a whole graph, this graph change is identical to the one we would get in an SCC if we added or removed edges from one of the vertices in the component to other (lower level) components. This means we can use theorem 13.1 for the source component and lemma 13.1 for the changes in rank in the target and other lower-level components. 13.3.1. Computations in practice To be able to use theorem 13.1 in practice, we would like to first rewrite it in a way that can be computed efficiently. Fortunately, it is easy to rewrite it using a power series, as seen below.  a that can be seen as the expected number of We start by defining the vector Q b hits to each vertex in the random walk on the graph where we start in va and then

Using Graph Partitioning to Calculate PageRank in a Changing Network

187

after leaving it remove all edges out of va (ensuring we do not count any paths going through va ).  ab Q

=

1 , Paa

1 Pab (¯ a) , 1 (¯ 1−Pbb a)

b=a b = a

 a , we define a new vector U  as the 1-step probabilities of a random To calculate Q b walk starting in va before the edge weight change. b = U

eab , 0,

(a, b) ∈ E (a, b) ∈ /E

Let B be the equal to the scaled adjacency matrix after removing all outgoing  a. edges from va , which gives the following power series for calculating Q a = Q

∞ 

 . (cB)k U

k=0

In the vector form, this gives: Ra2 =

1 1 P→a (1 − Paa )

e2a e1a

=

Ra1 (1 − Qaa ) e2

1 )(1 − P 1 (1 − Paa ) 1 − Qaa ea1 aa a   2 2 = R a .  1 + R 2 ea − R 1 Q R a ¯ a ¯ a 1 a a ¯ ea

,

So we see that to find the new rank after the change, we need to solve a similar problem as the original problem (calculating it from scratch) but with a slightly different starting vector and matrix to work with. This approach has some potential advantages: first of all we can expect the sum to converge at least as fast as the original problem due to B being identical to A with a couple of removed elements.  will be sparse, hence making it Another potential advantage is that initially U possible to work with a sparse vector as well as matrix for the first couple of iterations. 13.3.2. Adding or removing an edge inside a component While likely not particularly useful for calculations in practice, we will also show how these results can be used to find a two-step method for updating PageRank when adding or removing a single edge inside a component.

188

Data Analysis and Applications 2

T HEOREM 13.2.– Consider a graph with PageRank R1 , and let e1a be the weight of all edges going out of vertex va . After adding a new edge eaα from va to vα , the PageRank Ra2 of vertex va and PageRank Rb2 of any other vertex vb = va can be written as: Ra2 =

1 P→a

1−

2 1 ea Paa e1a

Rb2 = Rb1 +

2 (e ) − Paa aα

,

e2 1 2 Ra2 ea1 − Ra1 Pab (¯ a) + Ra2 Pab (eaα , a ¯) a

1 (¯ 1 − Pbb a)

.

P ROOF.– We begin by proving the first statement. Since adding an edge from a vertex also changes the weight on all other outgoing edges from this can be seen as two consecutive changes, first a change in edge weights as described in theorem 13.1, secondly as a weight change in vertex vα . If we first do the edge weight change, this gives: Ra2 =

1 P→a

e2

1 a 1 − Paa e1

.

a

We then add the new edge eaα without changing any other edge weights; this results in an addition to the return probability of va by going through the new vertex: 2 Paa (eaα ). Adding these two together gives PageRank after both changes as: Ra3 =

3 P→a

1−

2 1 ea Paa e1a

3 (e ) − Paa aα

=

1 P→a

1−

2 1 ea Paa e1a

3 (e ) − Paa aα

,

1 3 = P→a since only edges out of va where changed/added and these have where P→a no effect on P→a . Combining the two changes into one by swapping the 3 to 2 in the last expression proves the first statement.

For the second half, we use a similar approach as in the proof of theorem 13.1. From lemma 13.2, we have  2  2 2 2 1 (e¯α a ¯) + Pab (eα a ¯) Ra2 Pab (¯ a) (¯ a) (¯ a) P→b Ra2 Pab P→b 2 Rb = + = + , 2 (¯ 2 (¯ 1 (¯ 1 (¯ 1 − Pbb a) 1 − Pbb a) 1 − Pbb a) 1 − Pbb a) 1 2 1 2 = P→b and Pbb (¯ a) = Pbb (¯ a) and we furthermore where we once more use that P→b divide the last term in those paths going through the new edge and those that do not. Because of the change in weight on outgoing edges and since neither go through the new edge, we get: 2 1 Pab (e¯α a ¯) = Pab (¯ a)

e2a . e1a

Using Graph Partitioning to Calculate PageRank in a Changing Network

189

Using this and lemma 13.2 in reverse on the left-hand side, we get Rb2 = Rb1 −

1 (¯ a) Ra1 Pab 1 (¯ 1 − Pbb a)

Rb2 = Rb1 +

+

e2 1 2 (¯ a) ea1 + Pab (eα a ¯) Ra2 Pab a

1 (¯ 1 − Pbb a)

e2 1 2 Ra2 ea1 − Ra1 Pab (¯ a) + Ra2 Pab (eaα , a ¯) a

1 (¯ 1 − Pbb a)

,

.

This completes the proof. Unfortunately, theorem 13.2 cannot be used to update PageRank using a single power series similar to changing weights or vertices or outgoing edges from a vertex. This is due to there being double the amount of unknowns, Pa a2 (eaα ) and 2 Pab (eaα , a ¯), it is still possible to do it in two steps however, first by doing an edge weight change and then vertex weight change from the new edge. We can make one positive observation however, namely that adding more than one outgoing edge from a single vertex is not significantly harder than adding one. Adding several outgoing edges can also be described in two steps, first by changing the weight on old outgoing edges and then calculate the additional rank added from new edges simultaneously in the same way as we would do for one. Finally, we note that although the theorem is stated for a single added outgoing edge here, almost the same result and proof can be created for the removal of a single outgoing edge as well. 13.3.3. Maintaining the component structure If a new edge targets a higher level component, we might create a new cycle, requiring the merge of two or more components. This means that we need to be able to adjust the component structure, otherwise the overall method would no longer give the correct ranks. Unfortunately, there is no simple equation for the rank of the vertices in the merged component. Fortunately any change in other non-merged components would still propagate naturally as weight adjustments even if their level might possibly have changed. Rather than looking at how to update the rank of the merged component, we will focus on showing how the underlying component structure could be maintained at a low cost. We start by creating the underlying DAG by collapsing each old component into a vertex. Put an edge between every pair of vertices whenever there is an edge between corresponding component (after adding or removing new edges). After this

190

Data Analysis and Applications 2

new underlying graph is drawn, we can partition it into SCCs as normal and find their levels using a depth-first search. Any two vertices that belong to the same component in the underlying graph indicate a merge of corresponding components in the original graph. This gives rise to the following proposed adjustments to the original method for updating PageRank after doing some changes in the graph: 1) adjust component structure by doing a depth-first search on the underlying DAG and any newly created vertices. 2) maintain both final weight vector (with weight adjustments) for each component and difference in original weight vector before/after graph changes. 3) starting at the highest level for each level and each component, calculate PageRank depending on the type of changes: - only weight change: Calculate PageRank of difference in weight vector and add to old rank; - only vertex va changed in component (different edge weights), use old rank and random walk from va as described in section 13.3.1; - other internal changes: use iterative method using old rank as initial guess (e.g. Jacobi method or using a power series); 4) adjust weight vector for all lower-level components. We note that although this method can find merges of components relatively easy because the underlying graph is typically much smaller than the original graph, unfortunately it cannot find when it would be possible to split a component into two separate components because of deleted edges within a component. Unfortunately, this requires redoing the depth-first search in every component with deleted edges, which is considerably more time consuming, especially in the presence of a giant component containing a large amount of the total number of vertices. Since having a component that is in fact two separate components is not a huge problem computationally and does not affect the validity of the method, we propose skipping this step unless there is reason to believe a large component could be split up into multiple similar-size components as the result of merges and subsequent deletions. 13.4. Conclusions In this chapter, we have shown how it is possible to update PageRank in a changing network given two common types of change to the graph or personalization vector. First, we showed how changes in the personalization vector can be handled by simply using the normal iteration procedure with the difference in the personalization vector

Using Graph Partitioning to Calculate PageRank in a Changing Network

191

before and after. The same method also gives a natural way to update the rank of one component because of changes in other higher level components. Second, we showed how the special case of adding or removing edges between components can sometimes be handled more effectively. Although the method can be trivially extended to handle multiple-source vertices by considering them as multiple different one-source vertex changes after each other, this means calculating multiple power sums that in most cases is unlikely to be faster than calculating PageRank normally. We also showed how the change in PageRank after the addition of edges from a single-source vertex inside a component can be computed as two separate changes by first considering the weight change of old edges and then the new edges themselves. Unfortunately, since this requires solving two linear systems of the same size instead of one, it is unlikely to be useful for computation in practice. In the future, we would like to take a look at other types of special cases that could potentially be handled in a similarly more effective manner; for example multiple vertex/edge deletions inside or between components, or finding a better solution for the multisource problem considered here. 13.5. References Andersson, F., Silvestrov, S. (2008). The mathematics of internet search engines. Acta Appl. Math., 104, 211–242. Brin, S., Page, L. (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7), 107–117. Chung, F., Tsiatas, A. (2012). Finding and visualizing graph clusters using PageRank optimization. Internet Mathematics, 8, 46–72. Engström, C., Silvestrov, S. (2015). A componentwise PageRank algorithm. 16th ASMDA 2015 Conference and Demographics Proceedings, 3, 1–14. Ishii, H. (2009). Distributed randomized PageRank computation based on web aggregation. Proceedings of the 48th IEEE Conference on Decision and Control, 2009 held jointly with the 2009 28th Chinese Control Conference. CDC/CCC 2009., 3026–3031. Kamvar, S.D., Schlosser, M.T., Garcia-Molina, H. (2003). The EigenTrust algorithm for reputation management in P2P networks. Proceedings of the 12th international conference on World Wide Web, WWW ’03, 640–651. Lee, C.P., Golub, G.H., Zenios, S.A. (2007). A two-stage algorithm for computing PageRank and multistage generalizations. Internet Mathematics, 4(4), 299–327. Tarjan, R.E. (1972). Depth-first search and linear graph algorithms. SIAM Journal on Computing, 1(2), 146–160. Yu, Q., Miao, Z., Wu, G., Wei, Y. (2012). Lumping algorithms for computing Google’s PageRank and its derivative, with attention to unreferenced nodes. Information Retrieval, 15(6), 503–526.

14 Visualizing the Political Spectrum of Germany by Contiguously Ordering the Party Policy Profiles

The data from the German voting advice application (VAA), the Wahl-O-Mat, are used to empirically construct and visualize the political spectrum of Germany. For this purpose, we consider the positions of 28 German parties on 38 policy issues declared before the 2013 federal election and associate the parties with the 38-dimensional vectors of their policy profiles. These vectors are used to define the party proximity in the party space. Principal component analysis (PCA) reveals that the parties constitute a thin ellipsoid whose two longest diameters cover 83.4% of the total variance. Reducing the model to just these two dimensions, a one-dimensional party ordering is found, which is exactly the left-right axis rolled into a circumference, reflecting that the far-left and far-right ends of the political spectrum approach each other, although remain distant, resulting in its horseshoe-like shape.

14.1. Introduction Discussing radical changes in the world order at the turn of the century, many political scientists started promoting the viewpoint that the left-right alignment of parties is becoming outdated (see, for instance, Giddens 1994; Manin 1997; Mitchell 2007; Voda 2014). It is argued that after the fall of the Soviet Union and Eastern Block, the class opposition lost the impetus of its inspiration by a systemic alternative. On the other hand, climate change, globalization, the West’s competition with an inexorably rising China and India, an aging population, migration, ethnic tensions, religious intolerance and international terrorism have swayed public attention away from left-right political confrontations toward less ideological and more pragmatic matters. For instance, subordinating supranational class interests to

Chapter written by Andranik TANGIAN.

Data Analysis and Applications 2: Utilization of Results in Europe and Other Topics, First Edition. Edited by Christos H. Skiadas and James R. Bozeman. © ISTE Ltd 2019. Published by ISTE Ltd and John Wiley & Sons, Inc.

194

Data Analysis and Applications 2

national geopolitical challenges (Streeck 1999) develops the idea of employer–employee “competitive solidarity”, which must somehow supplant the concept of class conflict, convince employees to cooperate with employers instead of demanding better working conditions and enhance the national cohesion against the external threats. Some authors emphasize that, due to increasing interdependence between countries, political platforms have come to be perceived as a constraint for flexibly responding to the trends of globalization. This results in the emergence of manager-type politicians who are less platform bound but rather compete for votes by adjusting their positions to numerous cleavages of the society and advertising themselves in the media before large audiences: In party democracy, electoral cleavages reflect class division. In a number of Western societies, the situation today is different. No socioeconomic or cultural cleavage is evidently more important and stable than others. To be sure, citizens do not constitute a homogeneous mass that can be divided in any manner by the choices they are offered, but the social and cultural lines of cleavage are numerous, cross-cutting and rapidly changing. Such an electorate is capable of a number of splits. The number of floating voters who do not cast their ballot on the basis of stable party identification is increasing. A growing segment of the electorate tends to vote according to the stakes and issues of each election (Manin 1997, pp. 209, 223, 231). From all of these, it is concluded that the political spectrum is becoming essentially multidimensional, replacing the former left-right ideological alignment. This viewpoint is reflected in numerous studies, in particular the prize-winning MANIFESTO project/database, with its over 400-dimensional tabular representation of party programs from more than 50 countries covering all free democratic elections since 1945 (Budge et al. 2001; Klingemann et al. 2006; Budge and McDonald 2007) (see also Linhart and Shikano 2007; Volkens et al. 2013; WZB 2015). The VAAs (interactive Internet sites to help users find the candidates whose policy profiles are closest to theirs) implemented in about 20 countries also assume multiple cleavages and, correspondingly, essentially multidimensional political spectra (Kieskompas 2006; EU Profiler 2009; Garzia and Marschall 2014; Vote Match Europe 2015). In this chapter, we empirically test the thesis about multiplicity of equally significant political dimensions with an example from the German political space, as it is given in the German VAA, Wahl-O-Mat (an invented word composed from the German Wahl, or election, plus Automat). The Wahl-O-Mat provides the official positions of the 28 political parties who participated in the 2013 Bundestag (federal) election on 38 topical policy issues (“Introduce a nationwide minimum wage?” –

Visualizing the Political Spectrum of Germany

yes/no, “Introduce a general speed limit on motorways?” (Bundeszentrale für politische Bildung 2013).

195

– yes/no, etc.)

To perform our study, we associate the parties with the 38-dimensional vectors of their policy profiles and use them to define the party proximity in the party space. The statement in question, that the party space is essentially multidimensional, would imply that the parties are scattered through the party space more or less homogeneously, resulting in a ball-shaped cloud of “observations”. However, applying PCA to the parties’ proximity matrix reveals that the parties constitute a thin ellipsoid whose two longest diameters cover 83.4% of the total variance. Reducing the model to these two dimensions, a one-dimensional contiguous party ordering is found that is exactly the left-right axis rolled into a circumference. This reflects the fact that the far-left and far-right ends approach each other, although remain somewhat distant, so that the political spectrum is Ω-shaped (i.e. looks like a horseshoe). It should be emphasized that the left-right axis arises without any normative assumptions but “objectively” from the party positions on those issues with no apparent ideological or class-oriented content. This empirical finding calls into question the assertion that the left-right axis is outdated. The circular shape of the political spectrum explains why linear empirical models fail to recognize its one-dimensionality (Sulakshin 2010; Voda 2014): a circumference, being one-dimensional itself, cannot be placed in a one-dimensional Euclidian space – to be accommodated it needs a Euclidian space with at least two line axes. Thus, our finding bridges two types of spatial political models (Gill and Hangartner 2010, Sect. 8): directional models of successive policy shifts with circular representations and angular measures (Matthews 1979; Grofman 1985; Schofield 1985; Rabinowitz and MacDonald 1989; Linhart and Shikano 2007), and proximity models that describe the distance between political agents in the Euclidian space with line axes. 14.2. The model Recall that the Wahl-O-Mat is the German version of the Dutch Internet site StemWijzer (“VoteMatch”), which was originally developed in the 1990s to involve young people in political participation (Pro demos 2015). Both websites help users locate themselves on the political landscape by testing how well their opinions match with party positions. Before an election (local, regional, federal and European), a special governmental supervising committee compiles a list of questions on topical policy issues (“Introduce a nationwide minimum wage?” – yes/no, “Introduce a general speed limit on motorways?” – yes/no, etc.) and asks the parties participating in the election for their answers. A user of the site answers the same questions, eventually attributing weights to reflect their importance, and then the program compares his or her political profile with that of the parties and finds the best-matching party, the

196

Data Analysis and Applications 2

second best-matching party, etc. On the occasion of the 2013 federal election, the Wahl-O-Mat was visited about 13 million times, which is about 30% of the 44 million Germans who took part in the election (Bundeswahlleiter 2013; Bundeszentrale für politische Bildung 2013). The 2013 Wahl-O-Mat had the following 38 questions (Bundeszentrale für politische Bildung 2013): 1) Minimum wage. Introduce a nationwide minimum wage. 2) Childcare subsidy. Childcare subsidy for parents whose kids do not attend statesponsored day care. 3) Speed limit. Introduce a general speed limit on highways. 4) Euro. Germany should retain the Euro as its currency. 5) Electricity prices. Electricity prices should be more heavily regulated by the state. 6) Video surveillance. Video surveillance in public spaces should be expanded. 7) Basic income. Germany should introduce an unconditional basic income. 8) Organic agriculture. Only organic agriculture should be subsidized. 9) Separate school education. All children, regardless of cultural heritage, should receive equal education. 10) Top income tax rate. The top income tax rate should be increased. 11) Leaving NATO. Germany should leave NATO. 12) Coal-fired power. No new construction of coal-fired energy plants. 13) “Morning after” pill. The “morning after” pill must be available by prescription only. 14) Nationalization of banks. All banks in Germany should be nationalized. 15) Refugee policy. Germany should accept more refugees. 16) Compensation for care time. State compensation for the time employees spend caring for incapacitated relatives. 17) Party prohibition. Political parties that are unconstitutional should continue to be illegal. 18) State subsidies to students. The level of federal student financial aid should be independent of the parents’ income. 19) Border control. Border control should be reintroduced. 20) Female quota. Institute a legal quota for women on company governing boards.

Visualizing the Political Spectrum of Germany

197

21) Financial equalization of states. Financially stronger federal states should not have to support weaker states as much. 22) Retirement at 67. The legally mandated retirement age should be brought back down. 23) Immigrants in public services. The government should employ more people with immigrant backgrounds. 24) Exports of munitions. Exports of munitions should be forbidden. 25) Income tax. Retain the tax law that favors the spouse who earns much less than the other spouse or nothing. 26) EU-membership for Turkey. Germany should champion Turkey’s bid for EU membership. 27) Supplementary income. Bundestag members should reveal their supplementary incomes to the last euro. 28) EII-contribution. Energy-intensive industries should bear more of the costs of the transition to renewable energy. 29) Sanctions against recipients of ALG II. Reduce long-term unemployment benefits for those who turn down a job offer. 30) Church tax. The state should continue to collect tithes on behalf of religious institutions. 31) Public health insurance. Require all citizens to enroll in the public health insurance system. 32) Eurobonds. Every state in the euro zone should be liable to pay its own debts. 33) Adoption rights for homosexuals. Homosexual couples should be allowed to adopt. 34) Preventive communication data collection. No collection of communication data (e.g. telephone, internet) without probable cause. 35) Rental price control. Limit rent price increases, also upon turnover of renters. 36) Double citizenship. German citizens should not be allowed to hold additional nationalities. 37) Passenger-car toll on highways. Institute a passenger-car toll on the national highways. 38) Referenda. Introduce referenda at the federal level. The positions on these 38 questions of 28 German parties who participated in the 2013 federal election are shown in Table 14.1. Our goal is to arrange the parties in a

198

Data Analysis and Applications 2

contiguous way, with neighboring parties having similarities in their policy profiles (columns i, j of Table 14.1). For policy profiles i, j, their proximity pij is identified with their correlation ρij expressed in percent, so that for equal profiles i, j the proximity is pij = 1.0 × 100% = +100%, and for opposite ones the proximity is pij = −100%. While computing pairwise correlations between columns i, j of Table 14.1, two rows with any missing value are omitted. Then two unequal party profiles can have 100% proximity, e.g. when one is complete and another incomplete, and all non-missing values are equal in both profiles. This implies that the 100% proximity relation is not transitive. For example, the proximity of GRÜNE-PIRATEN, and of PIRATEN-Nichtwähler is 100%, whereas that of GRÜNE-Nichtwähler is only 85%. This could be “repaired” by replacing missing values with the neutral response 1/2. In the given context, however, it is safer to omit missing data than to replace them. For instance, there is controversial evidence reported on the Québec and Scotland independence referenda (Durand 2015): two-thirds of those who abstain from a judgment in a pre-referendum poll ultimately vote “No”’ (for the status quo) at the referendum, resulting in diverging poll outcomes (with missing answers interpreted as indifference) and referenda outcomes (with disclosed positions). The non-replacement of missing values is also justified by the MCAR test (missing completely at random), which tests the null hypothesis that missingness is MCAR versus the alternative hypothesis that missingness is not MCAR (Little 1988). The test reveals χ2 = 588, 850 with 929 degrees of freedom and P-value = 0.00001, meaning an extremely significant deviation from the null hypothesis. Thus, the missing data are highly unlikely to be MCAR and should not be “repaired” (this observation is due to Adrian del Pino from Chile). The proximity triangle for the party policy profiles (the bottom-left half of the proximity matrix computed for Table 14.1) is shown in Figure 14.1 as a “relief table”: the cells with high and low values are dark (as mountains and deep ocean in geographic maps) and those with moderately positive and negative values are pale (as plains or shallow waters). The random distribution of colors in Figure 14.1 indicates that parties with close profiles are not close neighbors in the given party ordering. Here, the parties are ordered by votes received in the 2013 federal election. Indeed, under a contiguous party ordering, the proximity triangle would look well structured: dark cells would build a “mountain ridge” along the diagonal, having at their foot pale elements, and finally dark elements for “ocean depths”. The dispersion of “heavy” and “light” elements is characterized by the index S, the sum of proximity coefficients multiplied by the squared Manhattan distances to the diagonal S=

 i>j

pij × (i − j − 1)2 =

 i>j

ρij × (i − j − 1)2 × 100%.

[14.1]

Visualizing the Political Spectrum of Germany

RENTNER Partei der Vernunft MLPD PBC BIG BüSo DIE FRAUEN Nichtwähler Bündnis 21/RRP DIE VIOLETTEN FAMILIE PSG

BP Volksabstimmung

ÖDP REP Die PARTEI pro Deutschland

Party positions: +[1] positive, −[1] negative, ? missing opinion or abstention CDU/CSU SPD DIE LINKE GRÜNE FDP AfD PIRATEN NPD FREIE WÄHLER Tierschutzpartei

Question number

199

1

− + + + − − + + − + + − + + − + + − + + + + + ? + + + +

2

+ − − − ? − − + − − + + − ?

3

− − + + − − ? − − + + − − − − ?

4

+ + + + + − + − + + + − + − − + + − ? + + − ? + + + + −

5

− + + ?

6

+ ? − − − − − ? + − − + − + − + + − − + − − ? ? + − + −

7

− − ? ?

− − + − − + − − + − − − − − ? − + − ? ? − + − +

8

− ? ? ?

− − + + − + + − + + − + − − − − − − + ? + + + −

9

+ + + + + + + − + + + − + − + ?

+ − + ? + + + ? + + + +

10

− + + + − − ? + − + + − + − − ?

+ − + − + − + − + + + +

11

− − + − − − − + − − − ? − + − ?

− + + − − + + − − ? − +

12

− − + + − ? + + − + + − ? − + ?

− − + ? + − + ? + + + ?

13

+ − − − ? + − + + − + + − + + + − − − + + + − ? + + + −

14

− − ? − − − − + − − − − + − − − − − + − − − + − − − − +

15

− + + + ? − + − − ?

16

− + + + − + + + + + + + + + + + + − + + + + + ? + + + +

17

+ + + + + + + − + + + + + + − + + − ? + + + ? + + + + −

18

− ? + + + − + + + − + − + + + − − − − + − − + ? − + + +

19

− − − − − − − + − − − + − + − − − − − − − ? − − − − ? −

20

? + + + − − ? − − + − − + − − − − − ? − ? ? + ? − + + −

21

? − − − − + ? ? ?

22

− + + − ? − ? + + + − − + + + + + − + − + + ? ? + + ? +

23

+ + + + + ? + − + + + − + − − ?

? − + − + + + ? ? + + +

24

− − + ?

− − + ? − ? + + − + + +

25

+ ? − − + + − + + + + + + + + + + − − + − + − ? + − ? −

26

− + + + ? − ? − − − ?

27

? + + + − + + + − + + + + + + + + − + − + + + + + + + +

28

− ? + + − − + + + + + − + + + + + − + + + − + + + + + +

29

+ + − − + + − ? + ?

30

+ + − ?

31

− + + + − − + + − − + − ? + − + + − + − + − + ? + − + +

32

+ ? − − + + − + + + + + ? + + + + + − + − + − ? + + + −

33

− + + + + − + − ?

34

− − + + + + + + − + + + + + + + + + + − + + + + + + + +

+ + − − − + − + − − − + + − − − + − − − + ? − + + +

− − ? + + + − − + + + + + − ? + + − + + + + + +

+ − + − − ?

− ?

+ − ?

+ ?

− − + + − + + ? + + ? ?

? − + ? + + + ? ? + + +

− + − ? − − − ? − − + −

− − − − + − − − − + − ? ? ? + − −

− + − − + − + + − + − − − ? + − − −

? + − ? + − + − − + + ?

+ ?

− − − + + ? − ? − − − −

− + − − + + + + − − − + + + + ? +

Data Analysis and Applications 2

RENTNER Partei der Vernunft MLPD PBC BIG BüSo DIE FRAUEN Nichtwähler Bündnis 21/RRP DIE VIOLETTEN FAMILIE PSG

CDU/CSU SPD DIE LINKE GRÜNE FDP AfD PIRATEN NPD FREIE WÄHLER Tierschutzpartei

BP Volksabstimmung

Party positions: +[1] positive, −[1] negative, ? missing opinion or abstention

Question number

ÖDP REP Die PARTEI pro Deutschland

200

35

+ + + + − − + + + + + − + − + + + − + − + + + ? + + + +

36

+ − − − − + − + − + ?

37

? − − ?

38

− + + + + + + + + + + + + + + + − + + + + + + + + + + +

+ − + + ?

+ − − + − ? − − − − − −

− ? − − − − − + + − + − ? − − − − − ? ? − − − −

4. IN GR KE ÜN E FD P

EL

D

ER

ei

HL

art



tzp

ρ < -60, and statistically significant (P ≤ 0.10) -60 ≤ ρ < -40, and statistically significant (P ≤ 0.10) -40 ≤ ρ < -20, and statistically significant (P ≤ 0.10) -20 ≤ ρ < 0, or statistically non-significant (P > 0.10) 0 ≤ ρ < 20, or statistically non-significant (P > 0.10) 20 ≤ ρ < 40, and statistically significant (P ≤ 0.10) 40 ≤ ρ < 60, and statistically significant (P ≤ 0.10) ρ ≥ 60, and statistically significant (P ≤ 0.10)

12

.R



11

EP 13 .D ie PA 14 .p RT ro EI De 15 .B uts P ch 16 lan .V d olk 17 sa .M bs tim L P 18 mu D .R ng EN 19 TN .P ER art 20 ei de .P rV BC 21 ern .B un IG ft 22 .B üS 23 o .D IE 24 FR .N AU ich EN 25 tw .B äh ler ün dn 26 is .D 21 IE /R 27 VIO RP .F L AM ET 28 TE ILI .P E N SG

hu

EIE

sc

ier

10

.T

FR

DP

AT

D

9.

8.

NP

AfD

6.

7.

PIR

5.

EN

48 60 86 40 -40 48 51 93 100 -45 53 53 61 40 38 51 49 63 60 33 33 56 33 41 -37 -65 -55 69 -52 -30 43 65 63 79 65 -33 31 69 -32 62 47 43 45 37 -51 87 64 71 53 59 40 48 41 35 49 -38 35 45 62 60 57 67 61 35 35 34 48 -68 36 100 85 -50 84 42 58 70 85 56 100 65 74 65 31 34 37 42 48 49 73 64 71 59 41 48 60 41 -62 78 53 -29 66 35 1 2 3 4 5 6 7 8 9 10

3.

-61 -35 47 53 -44

DI

1.

2.

SP

CD

U/

CS

U

Table 14.1. Party positions on policy questions

-37 46 42 65

34 34 50 36 45 64 11

31 47 -48 59 37 51 38 38 47 46 -43 -35 40 45 52 -27 0 37 36 48 -65 75 -39 86 -50 41 74 77 58 44 60 59 40 80 31 71 54 72 -36 49 42 49 48 35 59 70 33 38 65 34 38 69 52 57 -48 53 94 -33 38 86 44 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

Figure 14.1. Proximity triangle (correlations ρij in %) for the parties ordered by votes received

If the dispersion is low, i.e. “heavy” elements are located along the diagonal and “light” elements are concentrated in the bottom-left corner of the triangle, then the index S is low, and vice versa. Here, the large S = 9, 857 indicates that the party

Visualizing the Political Spectrum of Germany

201

ordering is not contiguous and, consequently, cannot be regarded as accurately representing the German political spectrum. Thus, we dispose the (28 × 28)-matrix of pairwise correlations between 28 party policy profiles regarded as the degree of proximity between them. To visualize the German political spectrum, we must locate the parties in a low-dimensional space, retaining their proximity relations as accurately as possible. Using PCA, we perform dimensionality reduction; for an introduction to PCA, see Hyun et al. (2009), Jackson (1988), Krzanowski (1988) and Seber (1984). As it is based on linear transformations, the PCA approximates a “cloud of observations”, given as vectors in a n-dimensional space, by an ellipsoid whose first diameter is directed along the observations’ maximal variance, and whose second diameter is directed along the second maximal variance, etc. The largest diameters are the principal components, that is, the most important dimensions that “explain” most of the variance, and other dimensions are omitted without much loss of information. These new axes are linear combinations of the initial axes and are interpreted either as composite factors or just as a geometric characteristic of the set of observations. For example, making a 2D map of a relatively small country that is actually located on a 3D globe requires reducing three dimensions to two. For this purpose, the least significant dimension associated with the earth’s curvature is omitted and only NorthSouth and East-West directions (explained by two principal components) are retained. For Chile, which is a North-South strip 4,300 km long and on average 180 km wide, the second principal component would be associated with the earth’s curvature rather than with the East-West dimension. Then, the Chile map based on the first and second components would look like an arc – the side view of Chile on the globe – instead of the usual bird’s-eye view. Note, however, that making a map with n reference points, e.g. cities, is not that straightforward. If a (3 × n)-table of the cities’ 3D coordinates is reduced to two dimensions, the priority is given to optimally reducing the dimensionality of the set of coordinates. The accuracy of the airline distances between cities is secondary, and it degrades from the map’s center to edges, where the earth’s curvature manifests itself most. In critical cases, as in the case of Chile, the formally correct result becomes useless. To avoid this shortcoming, the (n × n)-table of intercity airline distances should be considered instead of the cities’ 3D coordinates. Its reduction to two dimensions results in a uniformly accurate approximation of the distances and the map looks normal. To construct a contiguous party arrangement respecting the proximity of policy profiles, we follow Friendly (2002) and Friendly and Kwan (2003); the approach is the same as making a map of Chile with respect to the distances between major

202

Data Analysis and Applications 2

0.4

hu

sc

er

M FA i rte pa tz IG B

0.2

C

FDP

D

Af

IL

GR

PB

Ti

ÜN

IE

Second component (11.75% of the variance)

RENTNER mung Volksabstim RP ÖDP is 21/R Bündn

D

SP

ler äh htw Nic

0.3

CD U/ CS U

FRE IE W ÄHL

ER

cities. First, we apply PCA to the proximity (28 × 28)-matrix, whose bottom-left half is shown in Figure 14.1. In our model, this matrix plays the same role as the matrix of distances between cities in the above example, whereas the (38 × 28)-matrix in Table 14.2 is analogous to the set of cities’ 3D coordinates. The PCA finds 28 eigenvectors (orthogonal principal components) ek , k = 1, . . . , 28, of the covariance matrix of the proximity matrix and orders them by decreasing eigenvalues. The vectors of a party’s proximity to other parties pj = {pij }, j = 1, . . . , 28 (columns of the proximity matrix) are represented in this new orthogonal basis for coordinates {ekj }. Their projections onto the plane of the first two principal components, which cover 83.4% of the total variance, are displayed in Figure 14.2.

E DIE ATE VIOL N ETT E Die PAR N TEI

PIR 0.1

So

BP



REP

0 DIE LINKE

Pa

rte

-0.1

DIE F

EN

RAU

D MLP

-0.2

0.4

0.3

G

D NP

PS

pro Deu tsch land er Ve rnu nft

id

0.2

0.1

0

-0.1

-0.2

First component (71.61% of the variance)

Figure 14.2. Principal components analysis solution: eigenvector plot for the correlation matrix of the party profiles

-0.3

Visualizing the Political Spectrum of Germany

203

The proximity of party profiles is approximated by the cosine of angles αj between party vectors in this plane, where ⎧   e ⎨ if e1j > 0 arctan e2j 1j   αj = . ⎩ arctan e2j + π if otherwise e1j Thereby, we obtain a circular ordering, in which neighboring parties have close policy profiles. This circular ordering can be unfolded to a linear one by splitting it at the greatest angle – between the vectors of the far-left Trotskyist party, the PSG, and the far-right nationalist party, the NDP. Going clockwise, we obtain a plausible left-right party ordering: – PSG: Partei für Soziale Gleichheit, Sektion der Vierten Internationale (Party of Social Justice, Section of the Fourth International) founded in 1997, a Trotskyist party; – MLPD: Marxistisch-Leninistische Partei Deutschlands (Marxist-Leninist Party of Germany) founded in 1982, an anti-revisionist party, referring to Marx, Engels, Lenin, Stalin and Mao Zedong; – DIE FRAUEN: feministische Partei DIE FRAUEN (Feminist Party) founded in 1995 promoting rights of women; – DIE LINKE (The Left) founded in 2007 as the merger of East German communists and the Electoral Alternative for Labor and Social Justice (WASG), a left-wing breakaway from the SPD; – Die PARTEI: Partei für Arbeit, Rechtstaat, Tierschutz, Eliteförderung und basisdemokratische Initiative (Party for Work, Rule-of-Law, Protection of Animals, Advancement of Elites, and Grassroot-Democratic Initiative) founded in 2004, a left populist party with totalitarian trends; – DIE VIOLETTEN (The Violet for spiritual Policy) founded in 2001 claiming to represent “alternative spiritual politics in the new age”; – GRÜNE: BÜNDNIS 90/DIE GRÜNEN (Alliance 90/The Greens) founded in 1993 as the merger of DIE GRÜNEN (West Germany) and BÜNDNIS 90 (East Germany), both with a social-democratic background; – PIRATEN: Piratenpartei Deutschland (Pirate Party of Germany) founded in 2006, a part of international Pirate movement promoting the information society with a free access to all digital medias; – BIG: Bündnis für Innovation und Gerechtigkeit (Alliance for Innovation and Justice) founded in 2010, a party of muslims promoting their integration; – Tierschutzpartei: Mensch Umwelt Tierschutz (Human Environment Animal Welfare) founded in 1993, a party promoting the introduction of animal rights into the German constitution;

204

Data Analysis and Applications 2

– FAMILIE: Familien-Partei Deutschlands (The Family Party of Germany) founded in 1983, a party promoting family values; – Nichtwähler (Party of Non-Voters) founded in 1998, a party with a social democratic background promoting improving representative democracy by introducing elements of direct democracy; – SPD: Sozial-demokratische Partei Deutschlands (Social Democratic Party) founded in 1863; – ÖDP: Ökologisch-Demokratische Partei (Ecological Democratic Party) founded in 1982, an conservative environmentalist party; – Bündnis 21/RRP: Bündnis 21/Rentnerinnen- und Rentner-Partei (Alliance 21/Female and Male Pensioner Party) founded in 2007, promoting improving the pension, health and education systems; – Volksabstimmung (Referendum party) founded in 1997, a party promoting direct democracy of Swiss type; – RENTNER (German Party of Pensioners) founded in 2002, a party of social welfare state bridging interests of generations; – FDP: Freie Demokratische Partei (Free Democratic Party) founded in 1948, liberal political party close to employers’ organizations; – FREIE WÄHLER (Free Voters) founded in 2009, a party of opposition to the EU financial policy; – PBC: Partei Bibeltreuer Christen (Party of Bible-abiding Christians) founded in 1989, a conservative evangelical party, opposing antisemitism, same-sex marriage and abortion; – CDU/CSU: union of Germany’s two main conservative parties, Christlich Demokratische Union Deutschlands (Christian Democratic Union of Germany) founded in 1950 and Christlich-Soziale Union in Bayern (Christian Social Union of Bavaria) founded in 1945; – AfD: Alternative für Deutschland (Alternative for Germany) founded in 2013, a conservative, euro-currency-sceptic party; – BüSo: Bürgerrechts-bewegung Solidarität (Civil Rights Movement Solidarity) founded in 1992, a part of the worldwide La Rouche (U.S. politician) Youth movement with republican orientation but promoting worldwide solidarity, e.g. abolishing debts of the Third World; – BP: Bayernpartei (Bavaria Party) founded in 1946, a separatist Bavarian party advocating Bavarian independence within the European Union;

Visualizing the Political Spectrum of Germany

205

– REP: Die Republikaner (The Republicans) founded in 1983, a national conservative party opposing to immigration; – pro Deutschland: Bürger-bewegung pro Deutschland (Pro Germany Citizens’ Movement) founded in 2005, a far-right populist party opposing to illegal immigration and multinational corporations and financial institutions; – Partei der Vernunft (Party of Reason) founded in 2009, a right-liberal party promoting the ideas of Austrian School of economics – minimal state, free market, decentralization of political power and subsidiarity; – NPD: National-demokratische Partei Deutschlands (National Democratic Party of Germany) founded in 1964, a far-right German nationalist party.

DI 3.

EL

EF

RA UE N 5. I N Die KE P 6. AR PIR TE I AT 7. EN DI EV 8. I O GR LE TT ÜN 9. EN E BIG 10 .T ier sc 11 hu .F tzp AM art 12 I LIE ei .N ich 13 t w .S äh ler PD 14 .Ö DP 15 .B ün dn 16 is .V 21 olk /R 17 sa RP .R bs tim EN 18 m T N u .F ng ER DP 19 .F RE 20 IE .C WÄ DU HL 21 /C ER .P S U BC 22 .A fD 23 .B üS 24 o .B P 25 .R EP 26 .p ro De 27 .P uts art ch 28 ei lan de .N d rV PD ern un ft

31

100 75 84 59 85 41 42 38 59 36

DI

86 87 59 71 49 64 52 53 34 58

ρ < -60, and statistically significant (P ≤ 0.10) -60 ≤ ρ < -40, and statistically significant (P ≤ 0.10) -40 ≤ ρ < -20, and statistically significant (P ≤ 0.10) -20 ≤ ρ < 0, or statistically non-significant (P > 0.10) 0 ≤ ρ < 20, or statistically non-significant (P > 0.10) 20 ≤ ρ < 40, and statistically significant (P ≤ 0.10) 40 ≤ ρ < 60, and statistically significant (P ≤ 0.10) ρ ≥ 60, and statistically significant (P ≤ 0.10)

4.

G PS 1.

ML

2.

94 86 78 53 66 44 53 38 35

PD

The correlation triangle with the new party ordering is shown in Figure 14.3. It has the desired ridge of dark correlation peaks of neighboring parties along the diagonal, pale low-correlation cells of more distant parties, then a dark band of opposing parties and, finally, the pale bottom-left vertex, indicating that the far-left and far-right parties have something in common. The index S = −3, 592 in Figure 14.3 compared with S = 9, 857 in Figure 14.1 confirms a much better contiguity of this party ordering, compared with the initial one based on the votes received in the 2013 election.

65 93 73 86 57 51 41 70 48

79 49 63 40 65 38 74 43

71 100 61 63 60 100 51 56 31 40 42

64 48 59 57 70

67 49 48 85 60 45 60 33 34 42

35 60 60 34 54 45

37

-62 -51 -68 -61 -30 -44 -33 -43 -50 -38 -29 -50 -40

-35

0

35

-39 -32 -48 -48 -65 -65 -37 -52 -36 -55 -35 -33 -27 -45 1 2 3 4 5 6 7 8 9

41 74 38 33 49 37 48

69

58 64 50 52 72 65 36 65 77 45 65 44 59 56 40 33 65 53

80 71 51 37 48 40 61 47 53 35 62 49 33 48 40 53 45 34 48 34 35 48 43 46 62 -37 41 38 69 37 42 47 47 31 36 31 46 41 35 38 47 69 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

Figure 14.3. Proximity triangle (correlations ρij in %) for the parties ordered with the 2D PCA

206

Data Analysis and Applications 2

14.3. Conclusions The circular ordering of 28 German parties, obtained purely formally with the 2D PCA without any normative assumption, highlights the left-right ideological axis rolled into an incomplete circumference, or a horseshoe-shaped body. To be precise, the German political spectrum consists of party vectors located approximately along the equator of a multidimensional unit sphere. Due to minor deviations in these vectors from the circular axis, the spectrum assumes some volume, looking like an unfastened belt encircling the sphere. We conclude with three implications as follows: 1) Geometric implication: The German political spectrum can be regarded as approximately one-dimensional in the topological sense, but its internal proximity relations require at least two Euclidean dimensions to adequately reflect its circularity. 2) Electoral implication: The approximate one-dimensionality of political spectrum, as a precondition of consistent elections in Duncan Black’s setting on single-peaked preferences along some common ordering of candidates, can explain, at least partially, why voting paradoxes are not observed in real-world elections as frequently as the theory predicts. 3) Political implication: Finally, the “objectively revealed” left-right scale calls into question the assertion that the left-right characterization of parties is outdated. This statement, removing the class opposition from the political agenda, argues for the non-antagonistic nature of modern Western capitalism. Through this work, we – even if indirectly – disagree with the apologetics of the current system. 14.4. References Budge, I., Klingemann, H.D., Volkens, A., Bara, J., Tanenbaum, E. (2001). Mapping policy preferences: estimates for parties, electors and governments. Oxford University Press, Oxford, 1945–1998. Budge, I., McDonald, M.D. (2007). Election and party system effects on policy representation: bringing time into a comparative perspective. Electoral Studies, 26(1), 168–179. Bundeswahlleiter. (2013). Ergebnisse der Wahl zum 18. Deutschen Bundestag. http://www.bundeswahlleiter.de/de/bundestagswahlen/BTWBUND09/. Bundeszentrale für politische Bildung (2013). Wahl-O-Mat. http://www.bpb.de/politik/ wahlen/wahl-o-mat/. Carlisle, R.P. (2005). Encyclopedia of Politics: The Left and the Right. Sage Publications, Thousand Oaks, MI.

Visualizing the Political Spectrum of Germany

207

Durand, C. (2015). Polls on national independence: The Scottisch case in a comparative perspective. Paper at the conference of the European Survey Research Association (ESRA), Reykjavik, 13–17 July, 2015. See also Durand, C. (19.09.2014) Ah! les sondages. http://ahlessondages.blogspot.ca/. EU profiler (2009). http://www.euprofiler.eu/. Friendly, M. (2002). Corrgrams: Exploratory displays for correlation matrices. American Statistician, 56(4), 316–324. Friendly, M., Kwan, E. (2003). Effect ordering for data display. Computational Statistics and Data Analysis, 43, 509–539. Garzia, D., Marschall, S. (eds), (2014). Matching Voters with Parties and Candidates: Voting Advice Applications in a Comparative Perspective. ECPR Press, Colchester UK. Giddens, A. (1994). Beyond Left and Right, the Future of Radical Politics. Stanford University Press, Stanford CA. Gill, J., Hangartner, D. (2010). Circular data in political science and how to handle it. Political Analysis, 18(3), 316–336. Grofman, B. (1985). The neglected role of the status quo in models of issue voting, J. Politics, 47(1), 230–237. Grofman, B., Feld, S. (1988). Rousseau’s general will: A Condorcetian perspective. Am. Polit. Sci. Rev., 82, 567–578. Hyun, D., Jeong Ziemkiewicz, C., Ribarsky, W., Chang, R. (2009). Understanding Principal Component Analysis Using a Visual Analytics Tool. VACCINE Publication 235 (Visual Analytics for Command, Control, and Interoperability Environments). Purdue University, West Lafayette, IN. http://www.purdue.edu/discoverypark/vaccine/ publications.php. Jackson, J.E. (1988). A User’s Guide to Principal Components. Wiley, New York. Kieskompas. (2006). http://www.kieskompas.nl/. Klingemann, H.D., Volkens, A., Bara, J.L., Budge, J., McDonald, M.D. (eds). (2006). Mapping Policy Preferences II. Estimates for Parties, Electors, and Governments in Eastern Europe, European Union, and OECD 1990-2003. Oxford University Press, Oxford. Krzanowski, W.J. (1988). Principles of Multivariate Analysis. Oxford University Press, Oxford. Linhart, E., Shikano, S. (2007). Die Generierung von Parteipositionen aus vorverschluesselten Wahlprogrammen fr die Bundesrepublik Deutschland, MZES working paper. Mannheim, Mannheimer Zentrum fr Europische Sozialforschung. Little, R.J.A. (1988). A test of missing completely at random for multivariate data with missing values. J. Amer. Stat. Ass., 83(404), 1198–1202. Manin, B. (1997). The Principles of Representative Government. Cambridge University Press, Cambridge. Matthews, S.A. (1979). A simple direction model of electoral competition. Public Choice, 34, 141–156.

208

Data Analysis and Applications 2

Mitchell, B.P. (2007). Eight Ways to Run the Country: A New and Revealing Look at Left and Right. Praeger, Westport CN. Pro demos: Hois voor democratie en rechtsstaat (2015). StemWijzer. http://www. stemwijzer.nl/. Rabinowitz, G., MacDonald, S.E. (1989). A directional theory of issue voting. Am. Pol. Sc. Rev., 83(1), 93–121. Schofield, N.J. (1985). Social Choice and Democracy. Springer, New York. Seber, G.A.F. (1984). Multivariate Observations. Wiley, New York. Streeck, W. (1999). Competitive Solidarity: Rethinking the ‘European Social Model’. MPIfG Working Paper 99/8, September 1999. Reprinted in: Leibfried S, Mau S (2008) (eds) Welfare States: Construction, Deconstruction, Reconstruction. Vol II: Varieties and Transformations. Edward Elgar, Cheltenham, 549–565. Sulakshin, S. (2010). A quantitative political spectrum and forecasting of social evolution. Int. J. Interdisciplinary Soc. Sc., 5(4), 55–66. Voda, P. (2014). Class voting in West and East. Paper at the 8th ECPR General Conference, University of Glasgow, September 3–6, 2014. http://ecpr.eu/events/paperlist. aspx?EventID=4&SectionID=28&PanelID=288. Volkens, A., Bara, J., Budge, I., McDonald, M.D., Klingemann, H.D. (eds), (2013). Mapping Policy Preferences from Texts: Statistical Solutions for Manifesto Analysts. Oxford University Press, Oxford. Vote Match Europe (2015). http://www.votematch.eu/. Ware, A. (1996). Political Parties and Party Systems. Oxford University Press, Oxford. WZB (2015). The Manifesto Project. WZB, Berlin. https://manifestoproject.wzb.eu/ information/information.

List of Authors

Rafik ABDESSELAM University of Lyon France Pilar BACA Department of Dentistry University of Granada Spain Tomáš BACIGÁL Slovak University of Technology Bratislava Slovakia James R. BOZEMAN American University of Malta Bormla Malta Manuela CAZZARO University of Milano-Bicocca Italy Cinzia COLAPINTO Ca’ Foscari University of Venice Italy Nazarbayev University Astana Kazakhstan

António Augusto COSTA ECEO Universidade Lusófona de Humanidades e Tecnologias Lisbon Portugal Franca CRIPPA University of Milano-Bicocca Italy Yiannis DIMOTIKALIS Technological Education Institute of Crete Greece Christopher ENGSTRÖM The School of Education, Culture and Communication Mälardalen University Västerås Sweden Manuel ESCABIAS Department of Statistics and Operations Research University of Granada Spain

Data Analysis and Applications 2: Utilization of Results in Europe and Other Topics, First Edition. Edited by Christos H. Skiadas and James R. Bozeman. © ISTE Ltd 2019. Published by ISTE Ltd and John Wiley & Sons, Inc.

210

Data Analysis and Applications 2

Nikolaos FARMAKIS Aristotle University of Thessaloniki Greece

Paolo MARIANI University of Milano-Bicocca Italy

Valérie GIRARDIN Université de Caen Normandie France

Federica NICOLUSSI University of Milano-Bicocca Italy

Evaristo JIMÉNEZ-CONTRERAS Department of Library and Information Sciences University of Granada Spain

Stavros RALLAKIS Interdepartmental Programme of Postgraduate Studies in Business Administration University of Macedonia Thessaloniki Greece

Maria KARAMESSINI Panteion University of Social and Political Sciences Athens Greece Eleni KETZAKI Department of Mathematics Aristotle University of Thessaloniki Greece Jozef KOMORNÍK Comenius University Bratislava Slovakia Magda KOMORNÍKOVÁ Slovak University of Technology Bratislava Slovakia Justine LEQUESNE Centre Henri Becquerel Université de Caen Normandie France Ana LORGA DA SILVA ECEO-FCSEA Universidade Lusófona de Humanidades e Tecnologias Lisbon Portugal

Cátia ROSÁRIO ECEO Universidade Lusófona de Humanidades e Tecnologias Lisbon Gilbert SAPORTA Conservatoire National des Arts et Métiers Paris France Eftichios SARTZETAKIS Department of Economics University of Macedonia Thessaloniki Greece Sergei SILVESTROV The School of Education, Culture and Communication Mälardalen University Västerås Sweden Christos H. SKIADAS ManLab Technical University of Crete Chania Greece

List of Authors

Glykeria STAMATOPOULOU Panteion University of Social and Political Sciences Athens Greece Maria SYMEONAKI Panteion University of Social and Political Sciences Athens Greece Andranik TANGIAN (Melik-Tangyan) Institute of Economic Theory Karlsruhe Institute of Technology Germany

Olivier THÉVENON OCDE-INED Paris France Pilar VALDERRAMA School of Communication and Documentation University of Granada Spain Mariano J. VALDERRAMA Department of Statistics and Operations Research University of Granada Spain Mariangela ZENGA University of Milano-Bicocca Italy

211

Index

A, C

F, G

adjacency matrix, 167, 169–173, 176 analysis of information, 149, 151, 152, 160, 162 ARIMA-GARCH filter, 37, 42, 43 climate agreements, 49 CO2 emissions, 51, 53 context-specific independence, 3 correlation, 37, 42–46 credit, 91–93, 97, 105 customer experience, 59

financial crisis, 91, 92, 105 five-star rating, 59–62, 65, 76, 83, 86 GDP, 37, 42, 44, 45 gender, 107, 108, 111, 113, 119 Germany, 193, 196, 197, 203–205 Gini index, 51, 52, 54–56 graduate, 139, 140, 141, 144–146 graph partitioning, 179 graphical models, 6

D, E dissimilarity indices, 49, 50 economic crisis, 107 economy, 92, 105 employment, 121–127, 130, 133, 135, 139, 140, 145 entropy, 59, 60, 62–86 maximization, 59, 60, 62, 63, 65– 70, 72–77, 79, 86 environmental inequality, 49, 50 Europe, 121, 122, 125, 126, 130, 135 EU-SILC, 121–134 exploratory data analysis, xiii

H, I higher education institution (HEI), 139 H-index, 30 innovation, 3, 4, 7–9, 11, 12, 15–26

J, L job placement office, 139–142, 144 journal citation report, 29 labor market, 121–125, 130, 135 Likert scale, 59 logit regression, 29, 30, 32, 34, 35

Data Analysis and Applications 2: Utilization of Results in Europe and Other Topics, First Edition. Edited by Christos H. Skiadas and James R. Bozeman. © ISTE Ltd 2019. Published by ISTE Ltd and John Wiley & Sons, Inc.

214

Data Analysis and Applications 2

M machine learning, xiii Markov system, 107–110 maximum entropy, 59, 60, 66, 71, 73, 80, 82 mobility index (indices), 110, 111 multiplicative model, 149, 151, 155, 157, 160, 162

O, P ordinal variables, 3, 4, 12 PageRank, 179–191 panel data, 15, 16 perfect discrimination measure, 167 personalization vector, 179, 180, 182–185, 190 prediction, xiii, xxii Programme for International Student Assessment, 149

proximity measure, 167–169, 171–177 triangle, 198, 200, 205

R, S, T recruitment, 139–141, 144, 145 regression model, 29, 30, 34, 35 removing edges, 184, 185, 191 small- and medium-sized enterprise (SME), 91 Southern Europe, 107, 108, 111 topological discriminant analysis, 167, 168, 171, 175

V, W, Z variation of score, 149, 160 Vine copula, 37, 39, 41, 42, 44–47 Wahl-O-Mat, 193–196 Zighera’s method, 149, 155–157, 162, 163

Other titles from

in Innovation, Entrepreneurship and Management

2019 HÉRAUD Jean-Alain, KERR Fiona, BURGER-HELMCHEN Thierry Creative Management of Complex Systems (Smart Innovation Set – Volume 19) LATOUCHE Pascal Open Innovation: Corporate Incubator (Innovation and Technology Set – Volume 7) MILLOT Michel Embarrassment of Product Choices 2: Towards a Society of Well-being PRIOLON Joël Financial Markets for Commodities ROGER Alain, VINOT Didier Skills Management: New Applications, New Questions (Human Resources Management Set – Volume 1)

2018 BURKHARDT Kirsten Private Equity Firms: Their Role in the Formation of Strategic Alliances CALLENS Stéphane Creative Globalization (Smart Innovation Set – Volume 16) CASADELLA Vanessa Innovation Systems in Emerging Economies: MINT – Mexico, Indonesia, Nigeria, Turkey (Smart Innovation Set – Volume 18) CHOUTEAU Marianne, FOREST Joëlle, NGUYEN Céline Science, Technology and Innovation Culture (Innovation in Engineering and Technology Set – Volume 3) CORLOSQUET-HABART Marine, JANSSEN Jacques Big Data for Insurance Companies (Big Data, Artificial Intelligence and Data Analysis Set – Volume 1) CROS Françoise Innovation and Society (Smart Innovation Set – Volume 15) DEBREF Romain Environmental Innovation and Ecodesign: Certainties and Controversies (Smart Innovation Set – Volume 17) DOMINGUEZ Noémie SME Internationalization Strategies: Innovation to Conquer New Markets ERMINE Jean-Louis Knowledge Management: The Creative Loop (Innovation and Technology Set – Volume 5) GILBERT Patrick, BOBADILLA Natalia, GASTALDI Lise, LE BOULAIRE Martine, LELEBINA Olga Innovation, Research and Development Management

IBRAHIMI Mohammed Mergers & Acquisitions: Theory, Strategy, Finance LEMAÎTRE Denis Training Engineers for Innovation LÉVY Aldo, BEN BOUHENI Faten, AMMI Chantal Financial Management: USGAAP and IFRS Standards (Innovation and Technology Set – Volume 6) MILLOT Michel Embarrassment of Product Choices 1: How to Consume Differently PANSERA Mario, OWEN Richard Innovation and Development: The Politics at the Bottom of the Pyramid (Innovation and Responsibility Set – Volume 2) RICHEZ Yves Corporate Talent Detection and Development SACHETTI Philippe, ZUPPINGER Thibaud New Technologies and Branding (Innovation and Technology Set – Volume 4) SAMIER Henri Intuition, Creativity, Innovation TEMPLE Ludovic, COMPAORÉ SAWADOGO Eveline M.F.W. Innovation Processes in Agro-Ecological Transitions in Developing Countries (Innovation in Engineering and Technology Set – Volume 2) UZUNIDIS Dimitri Collective Innovation Processes: Principles and Practices (Innovation in Engineering and Technology Set – Volume 4) VAN HOOREBEKE Delphine The Management of Living Beings or Emo-management

2017 AÏT-EL-HADJ Smaïl The Ongoing Technological System (Smart Innovation Set – Volume 11) BAUDRY Marc, DUMONT Béatrice Patents: Prompting or Restricting Innovation? (Smart Innovation Set – Volume 12) BÉRARD Céline, TEYSSIER Christine Risk Management: Lever for SME Development and Stakeholder Value Creation CHALENÇON Ludivine Location Strategies and Value Creation of International Mergers and Acquisitions CHAUVEL Danièle, BORZILLO Stefano The Innovative Company: An Ill-defined Object (Innovation Between Risk and Reward Set – Volume 1) CORSI Patrick Going Past Limits To Growth D’ANDRIA Aude, GABARRET

Inés Building 21st Century Entrepreneurship (Innovation and Technology Set – Volume 2) DAIDJ Nabyla Cooperation, Coopetition and Innovation (Innovation and Technology Set – Volume 3) FERNEZ-WALCH Sandrine The Multiple Facets of Innovation Project Management (Innovation between Risk and Reward Set – Volume 4) FOREST Joëlle Creative Rationality and Innovation (Smart Innovation Set – Volume 14)

GUILHON Bernard Innovation and Production Ecosystems (Innovation between Risk and Reward Set – Volume 2) HAMMOUDI Abdelhakim, DAIDJ Nabyla Game Theory Approach to Managerial Strategies and Value Creation (Diverse and Global Perspectives on Value Creation Set – Volume 3) LALLEMENT Rémi Intellectual Property and Innovation Protection: New Practices and New Policy Issues (Innovation between Risk and Reward Set – Volume 3) LAPERCHE Blandine Enterprise Knowledge Capital (Smart Innovation Set – Volume 13) LEBERT Didier, EL YOUNSI Hafida International Specialization Dynamics (Smart Innovation Set – Volume 9) MAESSCHALCK Marc Reflexive Governance for Research and Innovative Knowledge (Responsible Research and Innovation Set – Volume 6) MASSOTTE Pierre Ethics in Social Networking and Business 1: Theory, Practice and Current Recommendations Ethics in Social Networking and Business 2: The Future and Changing Paradigms MASSOTTE Pierre, CORSI Patrick Smart Decisions in Complex Systems MEDINA Mercedes, HERRERO Mónica, URGELLÉS Alicia Current and Emerging Issues in the Audiovisual Industry (Diverse and Global Perspectives on Value Creation Set – Volume 1) MICHAUD Thomas Innovation, Between Science and Science Fiction (Smart Innovation Set – Volume 10)

PELLÉ Sophie Business, Innovation and Responsibility (Responsible Research and Innovation Set – Volume 7) SAVIGNAC Emmanuelle The Gamification of Work: The Use of Games in the Workplace SUGAHARA Satoshi, DAIDJ Nabyla, USHIO Sumitaka Value Creation in Management Accounting and Strategic Management: An Integrated Approach (Diverse and Global Perspectives on Value Creation Set –Volume 2) UZUNIDIS Dimitri, SAULAIS Pierre Innovation Engines: Entrepreneurs and Enterprises in a Turbulent World (Innovation in Engineering and Technology Set – Volume 1)

2016 BARBAROUX Pierre, ATTOUR Amel, SCHENK Eric Knowledge Management and Innovation (Smart Innovation Set – Volume 6) BEN BOUHENI Faten, AMMI Chantal, LEVY Aldo Banking Governance, Performance And Risk-Taking: Conventional Banks Vs Islamic Banks BOUTILLIER Sophie, CARRÉ Denis, LEVRATTO Nadine Entrepreneurial Ecosystems (Smart Innovation Set – Volume 2) BOUTILLIER Sophie, UZUNIDIS Dimitri The Entrepreneur (Smart Innovation Set – Volume 8) BOUVARD Patricia, SUZANNE Hervé Collective Intelligence Development in Business GALLAUD Delphine, LAPERCHE Blandine Circular Economy, Industrial Ecology and Short Supply Chains (Smart Innovation Set – Volume 4)

GUERRIER Claudine Security and Privacy in the Digital Era (Innovation and Technology Set – Volume 1) MEGHOUAR Hicham Corporate Takeover Targets MONINO Jean-Louis, SEDKAOUI Soraya Big Data, Open Data and Data Development (Smart Innovation Set – Volume 3) MOREL Laure, LE ROUX Serge Fab Labs: Innovative User (Smart Innovation Set – Volume 5) PICARD Fabienne, TANGUY Corinne Innovations and Techno-ecological Transition (Smart Innovation Set – Volume 7)

2015 CASADELLA Vanessa, LIU Zeting, DIMITRI Uzunidis Innovation Capabilities and Economic Development in Open Economies (Smart Innovation Set – Volume 1) CORSI Patrick, MORIN Dominique Sequencing Apple’s DNA CORSI Patrick, NEAU Erwan Innovation Capability Maturity Model FAIVRE-TAVIGNOT Bénédicte Social Business and Base of the Pyramid GODÉ Cécile Team Coordination in Extreme Environments MAILLARD Pierre Competitive Quality and Innovation MASSOTTE Pierre, CORSI Patrick Operationalizing Sustainability

MASSOTTE Pierre, CORSI Patrick Sustainability Calling

2014 DUBÉ Jean, LEGROS Diègo Spatial Econometrics Using Microdata LESCA Humbert, LESCA Nicolas Strategic Decisions and Weak Signals

2013 HABART-CORLOSQUET Marine, JANSSEN Jacques, MANCA Raimondo VaR Methodology for Non-Gaussian Finance

2012 DAL PONT Jean-Pierre Process Engineering and Industrial Management MAILLARD Pierre Competitive Quality Strategies POMEROL Jean-Charles Decision-Making and Action SZYLAR Christian UCITS Handbook

2011 LESCA Nicolas Environmental Scanning and Sustainable Development LESCA Nicolas, LESCA Humbert Weak Signals for Strategic Intelligence: Anticipation Tool for Managers MERCIER-LAURENT Eunika Innovation Ecosystems

2010 SZYLAR Christian Risk Management under UCITS III/IV

2009 COHEN Corine Business Intelligence ZANINETTI Jean-Marc Sustainable Development in the USA

2008 CORSI Patrick, DULIEU Mike The Marketing of Technology Intensive Products and Services DZEVER Sam, JAUSSAUD Jacques, ANDREOSSO Bernadette Evolving Corporate Structures and Cultures in Asia: Impact of Globalization

2007 AMMI Chantal Global Consumer Behavior

2006 BOUGHZALA Imed, ERMINE Jean-Louis Trends in Enterprise Knowledge Management CORSI Patrick et al. Innovation Engineering: the Power of Intangible Networks

WILEY END USER LICENSE AGREEMENT Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.