Advances in Data Science: Symbolic, Complex, and Network Data 1786305763, 9781786305763

Data science unifies statistics, data analysis and machine learning to achieve a better understanding of the masses of d

1,112 94 8MB

English Pages 258 [245] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Advances in Data Science: Symbolic, Complex, and Network Data
 1786305763, 9781786305763

Citation preview

Advances in Data Science

Big Data, Artificial Intelligence and Data Analysis Set coordinated by Jacques Janssen

Volume 4

Advances in Data Science Symbolic, Complex and Network Data

Edited by

Edwin Diday Rong Guan Gilbert Saporta Huiwen Wang

First published 2020 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.

Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address: ISTE Ltd 27-37 St George’s Road London SW19 4EU UK

John Wiley & Sons, Inc. 111 River Street Hoboken, NJ 07030 USA

www.iste.co.uk

www.wiley.com

© ISTE Ltd 2020 The rights of Edwin Diday, Rong Guan, Gilbert Saporta and Huiwen Wang to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988. Library of Congress Control Number: 2019951813 British Library Cataloguing-in-Publication Data A CIP record for this book is available from the British Library ISBN 978-1-78630-576-3

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Part 1. Symbolic Data

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

Chapter 1. Explanatory Tools for Machine Learning in the Symbolic Data Analysis Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . Edwin DIDAY

3

1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2. Introduction to Symbolic Data Analysis . . . . . . . . . . . . . . . . . . 1.2.1. What are complex data? . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2. What are “classes” and “class of complex data”? . . . . . . . . . . 1.2.3. Which kind of class variability? . . . . . . . . . . . . . . . . . . . 1.2.4. What are “symbolic variables” and “symbolic data tables”? . . . . 1.2.5. Symbolic Data Analysis (SDA) . . . . . . . . . . . . . . . . . . . . 1.3. Symbolic data tables from Dynamic Clustering Method and EM . . . . 1.3.1. The “dynamical clustering method” (DCM) . . . . . . . . . . . . . 1.3.2. Examples of DCM applications . . . . . . . . . . . . . . . . . . . . 1.3.3. Clustering methods by mixture decomposition . . . . . . . . . . . 1.3.4. Symbolic data tables from clustering . . . . . . . . . . . . . . . . . 1.3.5. A general way to compare results of clustering methods by the “explanatory power” of their associated symbolic data table . . . 1.3.6. Quality criteria of classes and variables based on the cells of the symbolic data table containing intervals or inferred distributions . 1.4. Criteria for ranking individuals, classes and their bar chart descriptive symbolic variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1. A theoretical framework for SDA . . . . . . . . . . . . . . . . . . 1.4.2. Characterization of a category and a class by a measure of discordance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 6 6 7 7 7 9 10 10 10 12 13 15 15 16 16 18

vi

Advances in Data Science

1.4.3. Link between a characterization by the criteria W and the standard Tf-Idf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.4. Ranking the individuals, the symbolic variables and the classes of a bar chart symbolic data table . . . . . . . . . . . . . . . . . . . 1.5. Two directions of research . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1. Parametrization of concordance and discordance criteria . . . . . 1.5.2. Improving the explanatory power of any machine learning tool by a filtering process . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 2. Likelihood in the Symbolic Context . . . . . . . . . . . . . . Richard E MILION and Edwin D IDAY 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2. Probabilistic setting . . . . . . . . . . . . . . . . . . . . . 2.2.1. Description variable and class variable . . . . . . . . 2.2.2. Conditional distributions . . . . . . . . . . . . . . . 2.2.3. Symbolic variables . . . . . . . . . . . . . . . . . . . 2.2.4. Examples . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5. Probability measures on (C, C), likelihood . . . . . 2.3. Parametric models for p = 1 . . . . . . . . . . . . . . . . 2.3.1. LDA model . . . . . . . . . . . . . . . . . . . . . . . 2.3.2. BLS method . . . . . . . . . . . . . . . . . . . . . . 2.3.3. Interval-valued variables . . . . . . . . . . . . . . . . 2.3.4. Probability vectors and histogram-valued variables . 2.4. Nonparametric estimation for p = 1 . . . . . . . . . . . . 2.4.1. Multihistograms and multivariate polygons . . . . . 2.4.2. Dirichlet kernel mixtures . . . . . . . . . . . . . . . 2.4.3. Dirichlet Process Mixture (DPM) . . . . . . . . . . 2.5. Density models for p ≥ 2 . . . . . . . . . . . . . . . . . . 2.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7. References . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

49

. . . . . . .

. . . . . . . . . . . . . . . . . . .

31

Chapter 3. Dimension Reduction and Visualization of Symbolic Interval-Valued Data Using Sliced Inverse Regression . . . . . . . . . Han-Ming W U, Chiun-How K AO and Chun-houh C HEN . . . . . . .

. . . . . . . . . . . . . . . . . . .

25 27 28

31 32 32 33 33 35 37 38 38 41 42 42 45 45 45 45 46 46 47

. . . . . . .

. . . . . . . . . . . . . . . . . . .

21 23 23

. . . . . . . . . . . . . . . . . . .

3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. PCA for interval-valued data and the sliced inverse regression 3.2.1. PCA for interval-valued data . . . . . . . . . . . . . . . . 3.2.2. Classic SIR . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3. SIR for interval-valued data . . . . . . . . . . . . . . . . . . . . 3.3.1. Quantification approaches . . . . . . . . . . . . . . . . . . 3.3.2. Distributional approaches . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

19

. . . . . . .

. . . . . . .

49 51 51 52 53 54 56

Contents

3.4. Projections and visualization in DR subspace . . . . . . . . . . 3.4.1. Linear combinations of intervals . . . . . . . . . . . . . . 3.4.2. The graphical representation of the projected intervals 2D DR subspace . . . . . . . . . . . . . . . . . . . . . . . 3.5. Some computational issues . . . . . . . . . . . . . . . . . . . . 3.5.1. Standardization of interval-valued data . . . . . . . . . . 3.5.2. The slicing schemes for iSIR . . . . . . . . . . . . . . . . 3.5.3. The evaluation of DR components . . . . . . . . . . . . . 3.6. Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1. Scenario 1: aggregated data . . . . . . . . . . . . . . . . . 3.6.2. Scenario 2: data based on interval arithmetic . . . . . . . 3.6.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7. A real data example: face recognition data . . . . . . . . . . . 3.8. Conclusion and discussion . . . . . . . . . . . . . . . . . . . . . 3.9. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . in . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . the . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 4. On the “Complexity” of Social Reality. Some Reflections About the Use of Symbolic Data Analysis in Social Sciences . . . . Fr´ed´eric LEBARON 4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2. Social sciences facing “complexity” . . . . . . . . . . . . . . . . . . . . 4.2.1. The total social fact, a designation of “complexity” in social sciences 4.2.2. Two families of answers . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3. The contemporary deepening of the two approaches, “reductionist” and “encompassing” . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4. Issues of scale and heterogeneity . . . . . . . . . . . . . . . . . . . 4.3. Symbolic data analysis in the social sciences: an example . . . . . . . . 4.3.1. Symbolic data analysis . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2. An exploratory case study on European data . . . . . . . . . . . . 4.3.3. A sociological interpretation . . . . . . . . . . . . . . . . . . . . . 4.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part 2. Complex Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

58 58 59 61 61 62 62 63 63 63 64 65 73 74

79 79 80 80 80 81 82 83 83 83 94 95 96 99

Chapter 5. A Spatial Dependence Measure and Prediction of Georeferenced Data Streams Summarized by Histograms . . . . . . 101 Rosanna V ERDE and Antonio BALZANELLA 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2. Processing setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3. Main definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4. Online summarization of a data stream through CluStream for Histogram data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5. Spatial dependence monitoring: a variogram for histogram data . . 5.6. Ordinary kriging for histogram data . . . . . . . . . . . . . . . . . .

. . 101 . . 103 . . 104 . . 106 . . 107 . . 110

viii

Advances in Data Science

5.7. Experimental results on real data . . . . . . . . . . . . . . . . . . . . . . 112 5.8. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.9. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Chapter 6. Incremental Calculation Framework for Complex Data . 119 Huiwen WANG, Yuan W EI and Siyang WANG 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . 6.2. Basic data . . . . . . . . . . . . . . . . . . . . . . 6.2.1. The basic data space . . . . . . . . . . . . . 6.2.2. Sample covariance matrix . . . . . . . . . . 6.3. Incremental calculation of complex data . . . . . 6.3.1. Transformation of complex data . . . . . . 6.3.2. Online decomposition of covariance matrix 6.3.3. Adopted algorithms . . . . . . . . . . . . . 6.4. Simulation studies . . . . . . . . . . . . . . . . . 6.4.1. Functional linear regression . . . . . . . . . 6.4.2. Compositional PCA . . . . . . . . . . . . . 6.5. Conclusion . . . . . . . . . . . . . . . . . . . . . 6.6. Acknowledgment . . . . . . . . . . . . . . . . . . 6.7. References . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

119 122 122 123 124 124 125 128 131 131 133 135 135 135

Part 3. Network Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Chapter 7. Recommender Systems and Attributed Networks . . . . 141 Franc¸oise F OGELMAN -S OULI E´ , Lanxiang M EI, Jianyu Z HANG, Yiming L I, Wen G E, Yinglan L I and Qiaofei Y E 7.1. Introduction . . . . . . . . . . . . . . . . . . . . . 7.2. Recommender systems . . . . . . . . . . . . . . . 7.2.1. Data used . . . . . . . . . . . . . . . . . . . 7.2.2. Model-based collaborative filtering . . . . . 7.2.3. Neighborhood-based collaborative filtering 7.2.4. Hybrid models . . . . . . . . . . . . . . . . 7.3. Social networks . . . . . . . . . . . . . . . . . . . 7.3.1. Non-independence . . . . . . . . . . . . . . 7.3.2. Definition of a social network . . . . . . . 7.3.3. Properties of social networks . . . . . . . . 7.3.4. Bipartite networks . . . . . . . . . . . . . . 7.3.5. Multilayer networks . . . . . . . . . . . . . 7.4. Using social networks for recommendation . . . 7.4.1. Social filtering . . . . . . . . . . . . . . . . 7.4.2. Extension to use attributes . . . . . . . . . . 7.4.3. Remarks . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

141 142 143 145 145 148 150 150 150 151 152 153 154 154 155 156

Contents

7.5. Experiments . . . . . . . . . . . . . . . . . . . . 7.5.1. Performance evaluation . . . . . . . . . . 7.5.2. Datasets . . . . . . . . . . . . . . . . . . . 7.5.3. Analysis of one-mode projected networks 7.5.4. Models evaluated . . . . . . . . . . . . . . 7.5.5. Results . . . . . . . . . . . . . . . . . . . . 7.6. Perspectives . . . . . . . . . . . . . . . . . . . . 7.7. References . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

ix

156 156 157 158 160 160 163 163

Chapter 8. Attributed Networks Partitioning Based on Modularity Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 David C OMBE, Christine L ARGERON, Baptiste J EUDY, Franc¸oise F OGELMAN -S OULI E´ and Jing WANG 8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . 8.2. Related work . . . . . . . . . . . . . . . . . . . . . . 8.3. Inertia based modularity . . . . . . . . . . . . . . . . 8.4. I-Louvain . . . . . . . . . . . . . . . . . . . . . . . . 8.5. Incremental computation of the modularity gain . . 8.6. Evaluation of I-Louvain method . . . . . . . . . . . 8.6.1. Performance of I-Louvain on artificial datasets 8.6.2. Run-time of I-Louvain . . . . . . . . . . . . . . 8.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . 8.8. References . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

169 171 172 174 176 179 179 180 181 182

Part 4. Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Chapter 9. A Novel Clustering Method with Automatic Weighting of Tables and Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 ´ , Francisco DE A SSIS T ENORIO DE C ARVALHO Rodrigo C. DE A RA UJO and Yves L ECHEVALLIER 9.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3. Definitions, notations and objective . . . . . . . . . . . . . . . . . . . . . 9.3.1. Choice of distances . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2. Criterion W measures the homogeneity of the partition P on the set of tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.3. Optimization of the criterion W . . . . . . . . . . . . . . . . . . . 9.4. Hard clustering with automated weighting of tables and variables . . . 9.4.1. Clustering algorithms MND–W and MND–WT . . . . . . . . . . 9.5. Applications: UCI data sets . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1. Application I: Iris plant . . . . . . . . . . . . . . . . . . . . . . . . 9.5.2. Application II: multi-features dataset . . . . . . . . . . . . . . . .

189 190 191 192 193 195 196 196 201 201 204

x

Advances in Data Science

9.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 9.7. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Chapter 10. Clustering and Generalized ANOVA for Symbolic Data Constructed from Open Data . . . . . . . . . . . . . . . . . . . . . . . . . 209 Simona KORENJAK - Cˇ ERNE, Nataˇsa K EJ Zˇ AR and Vladimir BATAGELJ 10.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2. Data description based on discrete (membership) distributions . . 10.3. Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1. TIMSS – study of teaching approaches . . . . . . . . . . . . . 10.3.2. Clustering countries based on age–sex distributions of their populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4. Generalized ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

209 210 212 215

. . . .

. . . .

217 221 225 226

List of Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 Index

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

Preface

This book contains a selection of papers presented at two recent international workshops devoted to progress in the analysis of complex data. The first workshop, ADS’16, short for Advances in Data Science, was held in October 2016 at Beihang University, Beijing, China, at the initiative of Professor Huiwen Wang. The second workshop, entitled Data Science: New Data and Classes, was held a few months later in January 2017 at Paris-Dauphine University, Paris, France, at the invitation of Professor Edwin Diday. Several members of the Scientific Committees and participants were common to both. Each workshop gathered about 50 participants by invitation only. After the workshops, we decided that some papers presented deserved to be made available to a wider audience, and we asked authors to prepare revised versions of their papers. Most of them agreed and the 10 papers collected in this volume were part of a blind review by referees, revised, and finally edited. The papers are grouped into four sections: symbolic data, complex data, network data, and clustering. For their dedication, we thank Paula Brito, Francisco de A.T. de Carvalho, Jie Gu, George H´ebrail, Yves Lechevallier, Wen Long, Monique Noirhomme, Francesco Palumbo, Ming Ye, and Jichang Zhao. We would also like to thank the sponsors of both meetings: – ADS’16, Beijing: School of Economics and Management, and the Complex Data Analysis Research Center of Beihang University, School of Statistics and Mathematics of Central University of Finance and Economics. The Beijing workshop

xii

Advances in Data Science

was financially supported by the NFSC Major International Joint Research Project (Grant number 71420107025), co-organized by Professor Huiwen Wang and Professor Gilbert Saporta. – Data Science: New Data and Classes, Paris: Lamsade and Ceremade Labs of Paris-Dauphine University, the French Statistical Society (SfdS), the French Speaking Society for Classification (SFC) , and the Society for Knowledge Discovery (EGC).

Edwin D IDAY Rong G UAN Gilbert S APORTA Huiwen WANG October 2019

Part 1 Symbolic Data

Advances in Data Science: Symbolic, Complex and Network Data, First Edition. Edited by Edwin Diday, Rong Guan, Gilbert Saporta and Huiwen Wang. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

1 Explanatory Tools for Machine Learning in the Symbolic Data Analysis Framework

The aim of this chapter is mainly to give explanatory tools for the understanding of standard, complex and big data. First, we recall some basic notions in Data Science: what are complex data? What are classes and classes of complex data? Which kind of internal class variability can be considered? Then, we define “symbolic data” and “symbolic data tables”, which express the within variability of classes, and we give some advantages of such kind of class description. Often in practice the classes are given. When they are not given, clustering can be used to build them by the Dynamic Clustering method (DCM) from which DCM regression, DCM canonical analysis, DCM mixture decomposition, and the like can be obtained. The description of these class yields by aggregation to a symbolic data table. We say that the description of a class is much more explanatory when it is described by symbolic variables (closer from the natural language of the users), and then by its usual analytical multidimensional description. The explanatory and characteristic power of classes can then be measured by criteria based on the symbolic data description of these classes and induce a way for comparing clustering methods by their explanatory power. These criteria are defined in a Symbolic Data Analysis framework for categorical variables, based on three random variables defined on the ground population. Tools are then given for ranking individuals, classes and their symbolic descriptive variables from the more toward the less characteristic. These characteristics are not only explanatory but can also express the concordance or the discordance of a class with the other classes. We suggest several directions of research mainly on parametric aspects of these criteria and on improving the explanatory power of Machine Learning tools. We finally                                   Chapter written by Edwin DIDAY.

4

Advances in Data Science

present the conclusion and the wide domain of potential applications in socio demography, medicine, web security and so on. 1.1. Introduction A “Data Scientist” is someone who is able to extract new knowledge from Standard, Big and Complex Data. Here we consider complex data as data that cannot be expressed in terms of a standard data table, where units are described by quantitative and qualitative variables. Complex data happen in case of unstructured data, unpaired samples, and multisource data (as mixture of numerical, textual, image and social networks data). The aggregation, fusion, and summarization of such data can be done into classes of row units that are considered as new units. Classes can be obtained by unsupervised learning, giving a concise and structured view on the data. In supervised learning, classes are used in order to provide efficient rules for the allocation of new units to a class. A third way is to consider classes as new units described by “symbolic” variables whose values are “symbols” as: intervals, probability distributions, weighted sequences of numbers or categories, functions, and the like, in order to express their within-class variability. For example, “Regions” express the variability of their inhabitant, “Companies” express the variability of their web intrusion, and “Species” express the variability of their specimen. One of the advantages of this approach is that unstructured data and unpaired samples at the level of row units become structured and paired at the classes’ level (see section 1.2.4). Three principles guide this chapter in conformity with the Data Science framework. First, new tools are needed to transform huge data bases intended for management to data bases usable for Data Science tools. This transformation leads to the construction of new statistical units-described by aggregated data in terms of symbols as single‐valued data–are not suitable because they cannot incorporate the additional information on data structure available in symbolic data. Second, we work on the symbolic data as they are given in data bases and not as we wish that they be given. For example, if the data contain intervals, we work on them even if the within-interval uniformity is statistically not satisfactory. Moreover, by considering Min–Max intervals, we can obtain useful knowledge, complementary to the one given without the uniformity assumption. Hence, considering that the Min–Max or interquartile where the aim is to extract useful knowledge from the data and not only to infer models (even if inferring models like in standard statistics, can for sure give complementary knowledge). Third, by using marginal description of classes by vectors of univariate symbols, rather than joint symbolic description by multivariate symbols, 99% of the users would say that a joint

Explanatory Tools for Machine Learning in the Symbolic Data Analysis Framework

5

distribution describing a class often contains too much low or 0 values and so has a poor explanatory power in comparison with marginal distributions describing the same class. For example, having 10 variables of 5 categories each, the joint multivariate distribution leads to a sparse symbolic data table where the classes are described by a unique bar chart symbolic variable value containing 510 categories and taking for each class 510 low or 0 values. On the other hand, the 10 marginal bar chart symbolic variables’ values describe the classes by vectors of 10 bar charts of 5 categories each, easy to interpret and to compare between classes. Nevertheless, a compromise can be obtained by considering joints instead of marginal between the more dependent variables. Symbolic Data Analysis (SDA) is an extension of standard data analysis and data mining to symbolic data. The theory and practice of SDA have been developed in several books [AFO 18a], [BIL 06], [BOC 00], [DID 08], many papers (see overviews in [BIL 03] and [DID 16]), and several international workshops (http://vladowiki.fmf.uni-lj.si/doku.php?id=sda:meet:pa18). Special issue related to SDA has been published, for example, in the RNTI journal, edited by Guan et al. [GUA 13] on Advances in Theory and Applications of High Dimensional and Symbolic Data Analysis; in the ADAC journal on SDA, edited by Brito et al. [BRI 16]; in IEEE Transactions on Cybernetics [SU 16]. This chapter is organized into five sections. Section 1.2 aims to define symbolic data issued from the descriptions of classes of statistical units (called “individuals”) in order to take care of their internal variability. “Complex data”, “classes”, and “classes of complex data” are defined. The symbolic data appear in the cells of a “symbolic data table”, where the rows describe classes and the columns are associated with variables of symbolic value. Some advantages of symbolic data are finally given in this section. Section 1.3 is devoted to the case where the classes are not given, but built by a clustering process. We illustrate this case by two clustering tools: Dynamic Clustering Method (DCM) and by mixture decomposition with the Estimation– Maximization (EM) method. We present different variants of the DCM, which can lead to different kinds of clusters, depending on the kind of clusters representative: regression, canonical analysis, distributions, and so on. Then, we show how to build a symbolic data table from the results of these clustering methods. Several criteria measuring the explanatory power of a symbolic data table are suggested. In consequence, the explanatory quality of clustering methods can be compared by these criteria.

6

Advances in Data Science

In section 1.4, our aim is to define other kinds of explanatory criteria in the case where the initial variables defined on the ground population are of categorical value. We introduce, in this case, a theoretical framework of SDA based on three random variables. From this framework, we define two kinds of bar chart. The first called “fx(c)” which assigns to each category x, its frequency in the class, and the second called “gc, E (x)” that associates its frequency to each event E containing fx(c). These functions yield the characterization of pairs (category and class) by different kinds of criteria. We show that these criteria generalize to symbolic data, the standard TfIdf widely used in text mining (see, for example, [ROB 04]). According to these criteria, can be placed in order: the individuals, the classes, the symbolic variables and the symbolic data tables from the more to the less characteristic power. Finally, in section 1.5, we suggest two directions of research. First, in this SDA framework, there are different possible parameterizations of the criteria expressed in terms of concordance or discordance of a class with the other classes are given. An interesting open question is to find in which condition when a sequence of partitions converges toward a trivial partition, such parametric criteria defined on classes converges toward a parametric distribution defined on Ω, as it is interesting and economical to obtain from distributions on classes to the distribution on the population (in the case of concordance or discordance). Second, as explaining for understanding is complementary to discriminating for learning, we suggest a filtering process that improves on a filtered sub-population the explanatory power without degrading the discriminating power of any learning machine tool. 1.2. Introduction to Symbolic Data Analysis 1.2.1. What are complex data? By definition, “complex data” are any data set that cannot be considered as a “standard statistical units x standard variables” data table. This is the case when data are defined by several data tables with different statistical units, and different and unpaired variables coming from multi-sources sometimes at multi-levels. Example of complex data in Official Statistics: The units are REGIONS described by several data tables. For example, each region is described by a first data table where the units are hospitals and the variables are: size of the hospital, number of patients during given periods, and so on. In a second data table, the units are schools described by the number of pupils, their results in their examinations, and so on. In a third data table, the units are

Explanatory Tools for Machine Learning in the Symbolic Data Analysis Framework

7

inhabitants described by socio-demographic variables. More details are given on that example in section 1.2.4. 1.2.2. What are “classes” and “class of complex data”? “Classes” are as usual, subsets of any statistical set of units as for example: teams of football players, region of inhabitant, level of consumption in health insurance, and so on. By definition, a “class of complex data” is a vector of standard classes defined on different statistical spaces of units. For example, in Official Statistics, a Region can be considered as a class of complex data denoted as CR = (Ch, Cs, Ci), where Ch is the class of hospitals, Cs is the class of schools, and Ci is the class of inhabitants, of this region. 1.2.3. Which kind of class variability? Classes of statistical units (i.e. individuals) can express different kinds of variability mainly based on place, time and individuals. Three kinds of variability often happen in practice. First, the variability between several individuals when time and space are fixed. For example, the variability between the inhabitants of a region at a given period. Second, the variability of a single individual considered at different time and/or place. For example, the performance variability of a team player at different time and/or place. A third one concerns the case where the individual and the time are fixed and the place varies inside the individual. This is the case for the variability of an individual between some of its parts, for example, the variability between the cracks of a cooling tower of a nuclear power plot (see [AFO 10]). 1.2.4. What are “symbolic variables” and “symbolic data tables”? The first characteristic of “symbolic variables” is that they are defined in classes. Their second characteristic is that their values take the variability between the individuals inside these classes into account by “symbols” representing more than only one category or number. Hence, the standard operators of numbers cannot be applied to the values of these kinds of variables, so these values are not numerical: that is why they are called “symbolic” and represented by “symbols” as intervals, bar chart, and the like. A “symbolic data table” is a table where classes of individuals are described by at least one symbolic variable. Standard variables can also describe classes by considering the set of classes as a new set of units of higher level.

8

Advances in Data Science

Table 1.1 shows an example of a symbolic data table. The statistical units of the ground population are players of French cup teams and classes of players are teams called Paris, Lyon, Marseille, and Bordeaux. The variability of the players inside each team is expressed by the following symbolic variables: “weight” whose value is the interval of [min, max] weight of the players of the associated team, “National Country” whose value is the list of their nationality, and “Age bar chart” is the frequency of the age players being in the intervals: [less than 20], ]20, 25], ]25, 30], and ]more than 30], respectively, denoted: (0), (1), (2), and (3) in Table 1.1. The symbolic variable “age” is called “bar chart variable” as the interval of age on which it is defined is the same for all the classes and can, therefore, be considered as categories. The last variable is numerical as its values for a team are the frequency of the French players in this team among all the French players of all the teams. Hence, this variable produces a vertical bar chart in comparison with the symbolic variable “age” of horizontal bar chart’s value in Table 1.1. By adding to the French the same kinds of columns associated with the other nationalities, we can obtain a new symbolic variable whose values are a list of numbers, where each number is the frequency of having players in a team of nationality among all the players having this nationality among all the teams. A team can also be described by standard numerical or categorical variables as, for example, its expenses or the number of goals in a season. French Cup teams

Weight

National Country

Age

Frequency of French among all French

Paris

[73, 85]

{France, Argentina, Senegal}

{(0) 30%, (1) 70%}

30%

Lyon

[68, 90]

{France, Brazil, Italia}

{(0) 30%, (1) 65%, (2) 5%}

25%

Marseille

[77, 85]

{France, Brazil, Algeria}

{(1) 40%, (2) 52%, (3) 8%}

28%

Bordeaux

[80, 90]

{France, Argentina}

{(0) 40%, (1) 60%}

17%

Table 1.1. An example of symbolic data table where teams of the French Cup are described by three symbolic variables of interval, sequence of categories, “horizontal” bar charts, and a numerical variable inducing a “vertical” bar chart

This example is built from the standard ground data table. In the case of complex data, we can also build a symbolic data table. For example, National Statistical Institutes (NSI) organize census in their regions on different kinds of populations:

Explanatory Tools for Machine Learning in the Symbolic Data Analysis Framework

9

hospitals, schools, inhabitants, and so on. For each region, each of these populations of different sizes is associated with their own descriptive variables. For hospitals: number of beds, doctors, patients, and so on; for schools: number of pupils, teachers, and so on; for inhabitants: gender, age, socio-professional category, and so on. The regions are the classes of units described by the variable available for all these populations. If we have n regions and N populations (hospitals, schools, and so on, for each region), then we get after the symbolic description of each region, a symbolic data table with n rows and p = p1 … pN columns, where pj is the number of ground variables associated to the jth population. For sure, other variables (standard or symbolic) can be added in order to describe other aspects of the regions. Therefore, the unstructured data and unpaired samples at the level of row units become structured and paired at the classes’ level by a symbolic data table of n rows and p columns. This example illustrates the importance of complex data in SDA as they constitute a natural and numerous source of symbolic data. 1.2.5. Symbolic Data Analysis (SDA) The first aim of SDA is to describe classes by vectors of symbolic data in an explanatory way. Its second aim is to extend Data Mining and Statistics to new kinds of complex data symbolic data issued from the standard or complex data often coming from the industrial domain. We cannot say that SDA gives better results than the standard data analysis but we can just say that SDA can give good complementary results when we need to work on units that have a higher level of generality and have internal variability. For example, if we wish to know what makes a good player, for sure, the data concern individuals units, but if we wish to know what makes a good team, in this case, the natural units are the teams, and so, there are classes of individuals. Moreover, SDA has several advantages. As the number of classes is lower than the number of individuals, SDA facilitates interpretation of results in symbolic decision trees, symbolic factorial analysis, and so on. SDA reduces simple or Complex and Big Data. It also reduces missing data and solves confidentiality (as often individuals are confidential but classes are not confidential). It allows adding new variables at the right level of generality.

10

Advances in Data Science

1.3. Symbolic data tables from Dynamic Clustering Method and EM 1.3.1. The “dynamical clustering method” (DCM) Starting from a given partition P = (c1, …, ck) of a population, this method is based on an alternative use of a representation function g (which associates a representation L to a class c) and an allocation function f which associates a class c to any individual w of the population: C(w) = c in order to improve a given criteria at each step until convergence. More precisely, starting from a partition P = (c1, …, ck) of the initial population, the representation function applied to the classes ci produces a vector of representation L = (L1, …, Lk), where g(ci) = Li. A quality criterion can be defined in ∑ z c , L , where W measures the fit between each the following way: W P, L class ci and its representation Li. W decreases when this fit increases. Starting from a partition P(n), the value of the sequence un = W(P(n), L(n)) decreases at each step n of the algorithm. Indeed, during the allocation step, an individual w belonging to a class P is affected to a new class f w P iff W(P(n 1), L(n)) ≤ W(P(n), L(n) ) = un. Then, starting from the new partition P

, we

L ,…,L , where can always define a new representation vector L for any i = 1 to K, L g P fit best to P than L or remains L . This means: W P ,L W P ,L unchanged (i.e. L for i = 1 to k. Hence, at this step, we have un 1 = W(P(n 1), L(n 1)) ≤ W(P(n 1), L(n)) ≤ W(P(n), L ) = un. As this inequality is true for any n, this positive sequence decreases and converges. (n)

Moreover, note that if W P , L ∑ ∈ z w, L , then the allocation step consists to change w from one class ci to another class cj when z(w, Lj) < z(w, Li). In this case, Lj can be called a “prototype”. Another condition of convergence is that for any c and L, z(c, g(c)) ≤ z(c, L), in that case, g(c) is an “optimal prototype”. 1.3.2. Examples of DCM applications The classical k-means method is the case where Lk are the means of Ck. When, in the DCM, Lk are probability densities, we have a mixture decomposition method (see [DID 75], [DID 05]), which improves the fit (in terms of likelihood) between each class (of the partition) and its associated density function. More precisely, in

Explanatory Tools for Machine Learning in the Symbolic Data Analysis Framework

11

this case, each individual is associated with the allocation function to the density function of the highest value for this individual. In case of the representation by a regression, each individual is allocated to the class C’i if this individual fit the best the regression Li (see [CHA 77]), more generally in case of the representation by canonical axis (see [DID 78], [DID 86]). There are many other possibilities such as when the representation of any class can be a distance [DID 77], a functional curve [DID 76], points of the population [DID 73], and a factorial axis [DID 72]. DCM canonical analysis is a general method, whose aim (see Figure 1.1) by giving p blocs of variables, is to find simultaneously k classes of individuals and m canonical axis fitting the best. The mathematical question as expressed in [DID 86] is settled as follows: Maximize: W(P, , Z) = ∑ where





,

>

/ ||

|| ||

||

are the linear combinations of the variables of the ith block of variables for

the jth class of individuals associated with the jth canonical and the th axes among the m axis.

Figure 1.1. DCM Canonical analysis of k blocks of individuals described by p blocks of variables

Note that the DCM Canonical Analysis contains as a special case: – DCM principal component analysis [DID 72] [see Figure 1.2(a)]. – DCM regression [CHA 77] [see Figure 1.2(c)].

 

12

Advances in Data Science

In the case of categorical variables, it leads in [DID 78] to: – DCM factorial correspondence analysis. – DCM discriminant analysis [see Figure 1.2(b)].

(a)

(b)

(c)

Figure 1.2. DCM PCA: find simultaneously classes and first axes of local PCA which fit the best. DCM discriminant analysis: find simultaneously classes and first axes of local factorial discriminant analysis which fit the best. DCM regression: find simultaneously classes and local regressions which fit the best. For a color version of this figure, see www.iste.co.uk/diday/advances.zip

For an overview of DCM, see [DID 79] and [DID 80]. 1.3.3. Clustering methods by mixture decomposition In case of mixture decomposition by DCM [DID 75], [DID 05] for partitioning or EM [DEM 77], for fuzzy partitioning, the joint probability densities are associated with each obtained cluster. More precisely, the DCM aims to build iteratively a partition Pi and simultaneously a probability density Li in a dynamical clustering process. The obtained partition iteratively maximizes the following criteria denoted as W, where w denotes the likelihood of Li for Pi such that: ,



,

.

The EM method aims to obtain a fuzzy partition, maximizing the likelihood of , are the probability densities of the mixture the probability density f, where decomposition satisfying at the individual w, with the parameters ai a given model, such that: f(w, a) = ∑



,

with ∑



1

Explanatory Tools for Machine Learning in the Symbolic Data Analysis Framework

13

Note that the EM obtained fuzzy partition fits the obtained probability densities, but the exact partition is defined by / , , and does not fit the associated probability density. As shown in Figure 1.3 in the case of a two class’s partition, the classes do not fit their associated probability densities. At opposite, the exact partition given by DCM fits its obtained probability densities, but does not fit its associated fuzzy partition. Therefore, both methods can gain to be used alternatively in order to improve their obtained partition (exact or fuzzy).

Figure 1.3. The exact partition induced from the density functions obtained from EM does not fit these obtained density functions. This is not the case at the convergence of DCM where the obtained partition and its associated density functions fit exactly

1.3.4. Symbolic data tables from clustering Building a symbolic data table, where each unit is the obtained clusters, can be done in three ways: directly if the obtained clusters define a partition, from the marginal induced by the joint distribution associated to each cluster provided by EM or DCM, or from the membership weight of the individuals if we have fuzzy clusters as in EM mixture decomposition. If Lm is the representative of the class cm, then the weight tk(wi) of an individual wi in class ck is given by: , L , where d is the dissimilarity used by the clustering method, which has produced the classes. Then, the histogram for the mth class and the jth variable is given by: ∑ ∑



,…,

∑ ∑



[1.1]

14

Advances in Data Science

where is the value taken by the variable Xj for the individual wi. In other words, Xj(wi) = . , …, ) is a vector of Dirac mass defined on V intervals , … , , partitioning the domain Dj of the numerical variable Xj such that: takes the value 1 if Є and 0 elsewhere. When are categorical values instead of intervals, we obtain a bar chart and takes the value 1 if is the category and the value 0 elsewhere. as the one given by EM, we have When instead of a fuzzy partition ,…, an exact partition denoted as as the one induced by ,…, / , , , or directly by DC, we can build in the same way a 1, for any 1, … , and histogram or a bar chart by setting: 1, … , . In SDA, in order to increase the explanatory power of the obtained symbolic data table, first, the chosen number of intervals is preferably chosen not numerous (about 5, but it can be increased if needed), second, the size and the position of these intervals can be obtained in an optimal way in order to maximize the distance between the symbolic description of the classes (see [DID 13a]), and, third, weights can be added to the variables when the clustering method uses DCM with cluster representative as PCA, regression, canonical analysis, and multi-blocks approach as in [BOU 17] and [BOU 18], by considering that the categories of each symbolic bar chart variables constitute the blocks of standard numerical variables. By this way from any clustering method, we can obtain a symbolic data table on which SDA can be applied. The SDA’s aim is to study the obtained symbolic data table in order to gain complementary knowledge and more explanatory results than the usual standard interpretation. On the contrary, in standard mixture decomposition, the description of each class is often just given by the analytical expression of the joint probability density fi associated with each class. For example, in the case of the Gaussian model, the joint is described by a big correlation matrix very heavy to interpret when the number of variables is numerous. A more explanatory way is to describe the joints fi associated with each class Ci by their marginal fij. These marginal associated with each class can then be described by several kinds of symbolic data as histograms or interquartile intervals or any kind of symbolic data. The three steps of describing clusters from the less toward the more explanatory are summarized in Figure 1.4. Note that the obtained symbolic data table is not only by itself a complementary way to interpret the results of the mixture decomposition but,

Explanatory Tools for Machine Learning in the Symbolic Data Analysis Framework

15

moreover, it is a starting point for applying the tools of SDA as a symbolic PCA, or many other kinds SDA tools on the obtained clusters or fuzzy clusters and their symbolic description, in order to enhance considerably their interpretation.

Figure 1.4. After a mixture decomposition from the less to the more explanatory way of the obtained clusters

1.3.5. A general way to compare results of clustering methods by the “explanatory power” of their associated symbolic data table There are several ways of building a symbolic data table from a given set of classes. The best way in SDA is to get a meaningful symbolic data table by maximizing the discrimination power of the symbolic data associated with each variable for each class. A discrimination degree can be calculated by a normalized sum (to be maximized) of the dissimilarity two by two between the symbolic descriptions. Such kind of dissimilarities can be found in [BOC 00], [DEC 98], [DEC 06], [DID 08], [GOW 91], [ICH 94], and [IRP 08]. In case of histogram value variables, an example of discriminating tool is given in [DID 13a]. In summary, there are at least two (related) ways to obtain a meaningful symbolic data table: – distances between rows to be maximized; – entropy in each cell of the symbolic data table to be minimized. More details are given in [DID 19]. 1.3.6. Quality criteria of classes and variables based on the cells of the symbolic data table containing intervals or inferred distributions Since a long time ago, much work has been done on robust intervals, which can be useful to measure the robustness quality of a symbolic interval-valued variable (see [HOR 98], [ROY 86]). The quality of the inferred distributions contained in the cells of the obtained symbolic data table can be also measured by classical model

16

Advances in Data Science

selection criteria like Bayesian Information Criteria (BIC), Minimum Description Length (MDL), Akaike’s Information Criteria (AIC), Minimum Message Length (MML), or other criteria of this kind based on the likelihood estimation. In section 1.4, other kinds of explanatory power of clustering methods based on the symbolic data table that they induce by aggregation are given. 1.4. Criteria for ranking individuals, classes and their bar chart descriptive symbolic variables 1.4.1. A theoretical framework for SDA A general theoretical framework for SDA can be found in [EMI 18]. Here, we define a theoretical framework for SDA in the case of bar chart symbolic variables. Let three random variables be C, X, and A defined on the ground population Ω in the following way: C a class variable: Ω  P such that C(w) = c where c is a class of a given partition P. X can be considered as a vector of variables and X(w) is a vector of categories (called “metabins” see [DID 13b]) containing one category for each variable. In this case, X is a mapping from Ω in M, the set of metabins. For sure, it can be interesting to study the joint explanatory power of such metabins for a given class. Nevertheless, following the principle that the explanatory power of a metabins is the sum of the explanatory power of each of its bin, we use in the following part of this paper the marginal case by considering X as unique variables. In the following, all the criteria express marginal explanatory power that is why X is defined in the following way: X a variable: Ω  M such that X(w) = x is a category among a set of categories M. A is an aggregation function that associates with a class c, a symbol s = A(c), which can be a min–max interval or an interquartile interval or a cumulative distribution or a quantile function or a bar chart, and the like issued from an aggregative process on individuals. Here, s is restricted to be a mapping from M to [0, 1].

Explanatory Tools for Machine Learning in the Symbolic Data Analysis Framework

17

From C, X, and A, we can build the so-called “symbolic random variable” SH which is defined on Ω and takes its values in [0, 1], such that SH associates with w Є Ω, the value SH(w) = A(C(w))(X(w)). In the following, we consider that M is restricted to be a set of categories and f : c

M  [0, 1] is the bar chart induced by the class c = C(w). Then, if x = X(w), we have SH(w) = f (x), where f = A(c) is a bar chart symbol. This symbol f is c

c

c

“Horizontal” in Figure 1.5 that is why H is the index of S. We can also define another bar chart g : C  [0, 1] such that if c = C(w) and x = x

X(w) then g (c) = {w’ Є Ω/f x

C(w’)

(x) = f (x)}|/|P|. If v(x) = {f (x)/c Є P}. We can now define c

c

another symbolic random variable Sv: SV(w) = g (c) where A(v(x)) = gx is a bar x

chart symbol. More generally, we can define g

x, E

(c) = |{w’ Є Ω/f

C(w’)

(x) Є E}|/|Ω| = SVE(w),

where E is an interval included in [0, 1], which generalizes the preceding case where E was reduced to f (x). These functions are illustrated by an example shown c

in Figure 1.5.

Figure 1.5. A symbolic data table reduced to six classes among many others and to a unique symbolic variable X = Height of bar chart value (with seven categories), for each class. Each class ci is associated with a bar chart fci and we represent the gxE(c) value for the category x = 7, the interval event E around fc5(7) = 0.2, and for the class c5

18

Advances in Data Science

1.4.2. Characterization of a category and a class by a measure of discordance We say generally that a category x is “characteristic” of a class c if it is frequent in the class and rare in the other classes. In other words, there is a “discordance” between the class c and the other classes for this category x. More generally, if we denote E an interval contained in [0, 1] such that f (x) belongs to E, the category x is c

“characteristic” of the class c for this event E if f (x) is large and rarely f (x) belongs c

to E when c’

c’

c varies among all the classes of the partition P.

A characterization criteria W of a category x and a class c can be measured by: W(x, c) = f (x)/g (c) or by W(x, c) = – f (x) Log(g (c)) (many other variants are c

x,E

c

x,E

possible like in the case of the Tf-idf criteria, even if this criterion is very different from the standard Tf-idf). In order to have a criterion varying between 0 and 1, we can use: W(x, c) = f (x)/(1 c

g

x, E

(c)).

Moreover, given an event E, both criteria W express how much a category x is “discordant” of a class c versus the other classes c’ of the given partition P. This criterion means that a category x is even more discordant of a given class c and for an event E, its frequency in the class c is large and the proportion of individuals w taking the x category in any class c’ and such that f (x) belongs to the event E(x, c) c’

is low in the ground population Ω. Giving x and c, several choices of E can be interesting. We now give four examples of events E. For a characterization of x and c in the neighborhood of f (x): c

E1 = [f (x) – ε, f (x) c

c

ε] for ε > 0 and f (x) Є [ε, 1 c

ε].

For a characterization of the higher values than f (x): E2 = [f (x), 1]. For a c

c

characterization of the lower values than f (x): E3 = [0, f (x)]. In order to c

c

characterize the existence of the category x: E4 = [0, 1]. Hence, a category x is a characteristic of a class c when it is frequent in the class c and rare to appear: in the classes c’ c with a frequency in a neighborhood of f (x) c

if E = E1, with a frequency above (respectively, under f (x)) if E = E2 (respectively, c

E = E3), with a frequency strictly higher then 0 (i.e. rare to appear in classes c’

Explanatory Tools for Machine Learning in the Symbolic Data Analysis Framework

different of c) if E = E4. In all these cases, g

x,E

19

(c) is low and so W is high if f (x) is c

high. Singular, typical, or specific characteristic discordance of a category When w such that X(w) = x and C(w) = c varies in Ω, there are four cases to consider, depending on the fact that the category x is frequent (denoted Fc) or rare (denoted Rc) in c and this category x is frequent (denoted FE) or rare (denoted RE) in the set of classes c’ c such that f (x) belongs in E. Hence, we have four cases c’

called Fc FE, Fc RE, Rc FE, and Rc RE depending on x or c’; the cases Fc FE and Rc RE cannot give any specific value to W(x, c). The case Fc RE, where a category is frequent in c [i.e. f (x) high] and its frequency inside the classes c’ c rarely appear c

in E [i.e. g (c) low], leads to a value of W(x, c) close to 1. In this case, we can say x

that the category is “typical” of c and the criterion W measures its typicality. The case Rc FE where the category is rare in c [i.e. f (x) low] and has a frequency in the c

classes c’

c frequently inside E [i.e. g (c) high] leads to a value of W(x, c) close to x

0. In this case, we can say that the category is “singular” in comparison with the other classes and the criterion W measures the singularity of the category. The categories x that satisfies a value W(x, c) close to 0 or 1 can be said “specific” to c as it is typical or singular. Therefore, we can say that a category x of a variable Y is specific of c if W(x, c) LogW(x, c) is low. We can also say that in comparison with the other classes, the class c is “discordant” for the category x. We can also say that the explanatory power of a couple (x, c) is 1 W(x, c) LogW(x, c). 1.4.3. Link between a characterization by the criteria W and the standard Tf-Idf The basic idea of the Tf-Idf is to characterize a category of a class by the fact that it is frequent inside the class and rare in the other classes of the given partition P. In other words, if n(x, c) is the number of occurrences of x in c, K is the number of classes, and k(x) is the number of classes containing x, then the Tf-Idf of x and c can be written: Tf-Idf (x, c) = (n(x, c)/|c|) (K/k(x)). Hence, the Tf-Idf is even greater than x appears in a class and rarely in the other classes.

20

Advances in Data Science

PROPOSITION.– If the classes c of the partition P have the same size and their elements w are either all such that: f (x) = 0, neither all such that: f (x) > 0, then c

c

W(x, c, [0, 1]) is the standard Tf-Idf for the value (x, c). Proof: By definition: n(x, c)/| c | = f (x)

[1.2]

c

Therefore, from the hypotheses that all the classes have the same size results, we get: K |c| = |Ω|. From the hypotheses that the elements of the classes take all the value x or all takes another value, we get: k(x) |c | = {w/fC(w)(x) > 0}. Therefore, K/k(x) = (K |c|)/(k(x) |c|) = |Ω|/{w/fC(w)(x) > 0}. Hence, finally: K/k(x) = 1/g

x,[0, 1]

(c)

[1.3]

Therefore: Tf-Idf (x, c) = (n(x, c)/| c |) (K/k(x)) implies from [1.2] and [1.3]: Tf-Idf (x, c) = f (x)/g c

x,]0, 1]

(c) = W(x, c)

End of proof. Note that there are several other closed ways to define the Tf-Idf. For example, by using a Log as follows: Tf-idfLog (x, c) = (n(x,c)/| c |) Log((K/k(x))) In this case by setting: WLog (x, c) = – f (x) Log(g c

Tf-IdfLog(x, c) = f (x) Log(1/g c

x, ]0, 1]

x,]0, 1]

(c)), we obtain:

(c)) = WLog(x, c, ]0, 1]))

More generally, if Tf-idf’ (x, c) = Tf(n(x,c)/| c |) Idf(K/k(x)), where Tf and Idf are the other ways to define the Tf-idf. Then, by setting W’(x, c) = Tf(f (x))/ Idf(g

c

x,[0, 1]

(c)) and by using [1.2] and [1.3], we obtain: Tf-Idf’(x, c) = W’(x, c, ]0, 1]).

Another kind of SDA characterization criteria closer from the Tf-Idf It is based on another choice of g by g’: g’

x, E

(c) = |{c Є P/f

C(w)

(x) Є E}|/|Ω|

Explanatory Tools for Machine Learning in the Symbolic Data Analysis Framework

21

By setting: W’(x, c) = f (x)/g’ (c), W’(x, c) = – f (x) Log(g’ (c)) or W’(x, c) = c

f (x)/(1 c

g’

x, E

x,E

c

x,E

(c)), we get a characterization criteria equal to variants of the Tf-Idf

when E = ]0, 1] (i.e. E = E4). With this criterion W’, the typicality, singularity, or specificity can be calculated in the same way as with W, but with a different meaning as, in this case, the category x appears in a frequent (denoted FE) or rare (denoted RE) number of classes c’ c such that f (x) belongs in E. c’

Other criteria of characterization Other kinds of characterization criteria can be used. The popular “test value”, developed in [LEB 84], may also be used to measure a characterization of a category in a bar chart contained in a cell. The p-value is the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event. A simple way can be the ratio between the frequency of a category in a class and the mean of the frequencies of the same category in all the classes of the given partition. 1.4.4. Ranking the individuals, the symbolic variables and the classes of a bar chart symbolic data table A bar chart symbolic data table is defined by a set of p symbolic variables Xj in column describing in rows K classes ck and containing in each cell a bar chart of categories denoted zjm for m = 1 to mj associated with each variable Xj. In the following, the characterization criteria W cover all kinds of variants, including W’. Each category associated with each cell of a bar chart symbolic data table can be characterized by W criteria. Giving an individual w, we have its class C(w) = ck and its categorical value Xj(w) = zjm for any variable Xj. Then, by summing on the characterization W value of the category zjm for j = 1 to m row associated with the class ck, we get a characterization of this individual w. Then, by summing on the characterization W value of a selected category in all the cells of each row (respectively, column) of this symbolic data table, we obtain a characterization of each class (respectively, variable). In the same way, by summing on the characterization W value of all the cells, we can obtain a characterization of the symbolic data table. By that way, we can find a typical, singular, or specific ranking of individuals, classes, bar chart variables, or symbolic data table. For example, the characterization measure of an individual w for the jth variable such that Xj(w) = zjm, C(w) = ck for the event E is defined by:

22

Advances in Data Science

W(zjm, ck) = f (zjm)/(1 c

g

x, E

(ck)).

Therefore, the characterization measure of an individual w can be: CI(w) = ∑j = 1, p W(Xj(w), C(w)). We can then define a typicality measure of a symbolic variable Xj by: CV(Xj) = ∑k = 1, K Maxm = 1, mj W(zjm, ck). We can also define a typicality measure of a class c by: CC(c) = ∑j = 1, p Maxm = 1, mj W(zjm, c). We can finally define a typicality measure of a partition P (which can be called “typicality of the symbolic data table” defined by the symbolic class descriptions) by: CT(c) = ∑k = 1, K CC(ck). The singularity measure can be calculated by using the min instead of the max. Ranking We can then place in order from the less to the more characteristic the individuals w, the symbolic variables Xj for j = 1 to p and the classes ck for k = 1 to K, by using, respectively, the CI, CV, CC, and CT characteristic measures. These orders are, respectively, denoted as OCI, OCV, OCC, and OCT. Note that, in all these cases, we can associate a metabins with each individual or class by choosing the most characteristic category of each variable. A hierarchical or pyramidal clustering on these metabins can facilitate the interpretation of the most characteristic individuals or classes, as they can be close in the ranking but for different reasons (expressed by different metabins). For example, a study on the Cause-Specific Mortality in European Countries [AFO 18], Bulgaria and Romania with Singularity Level (SL), respectively, equal to 271 and 252, was the most discordant until Poland (SL = 152) with a high level of mortality mainly for circulatory problem. Denmark (SL = 152) and France (SL = 149) were close from Poland in the ranking, but for very different reasons with low level of mortality. From these basic criteria (CI, CV, CC, and CT), many other can be considered, for example, by using the mean or the sum or the median values instead of the Max or the Min. Also, instead of giving the same weight to the categories in the sums, we can give a different weight as the one obtained by a DCM canonical analysis given

Explanatory Tools for Machine Learning in the Symbolic Data Analysis Framework

23

in [DID 86] or more recently in [BOU 17] and [BOU 18], considering that the categories of each symbolic bar chart variables constitute the blocs of standard numerical variables. 1.5. Two directions of research 1.5.1. Parametrization of concordance and discordance criteria Note first that instead of considering the characterization criteria W that express a “discordance” between a class and the other classes, we can consider a criteria that express a “concordance”. This criterion denoted Wconc has many variants. Nevertheless it can be written basically (with E given fixed) in the following form: Sconc(w) = f

C(w)

(X(w)) g

x,E

(C(w))

Therefore: Sconc(w) = SH(w) SV(w) This criterion varies between 0 and 1 and expresses a concordance between the class C(w) and the other classes as its highest value is obtained for individuals w such that the frequency of the category X(w) in C(w) is high and simultaneously the number of individuals in the other classes (for W) or the number of other classes (for W’) having a high frequency for X(w) is also high. The discordance can be written in the same way: Sdisc(w) = f

C(w)

(X(w))/(1

g

x,E

(C(w))

Therefore: Sdisc(w) = SH(w)/(1

SVE(w)).

PROPOSITION.– When the partition P is the trivial partition [i.e. C(w) = {w}) and X(w) = x, then we have: Sconc(w) = fE(x) and Sdisc(w) = 1/(1 fE(x)], where fE is the frequency of individuals w such that f(X(w)) Є E.

 

24

Advances in Data Science

Proof: In the case where C(w) = {w} and X(w) = x, we have: f

C(w)

(X(w)) = f

(x) = 1

{w}

as it is the frequency of the category x in a class containing only w which associated category is x. Moreover, we have: g (c) = {w’ Є Ω/f x

= {w’ Є Ω/f = {w’ Є Ω/f

{w’}

{w’}

C(w’)

(x) = f (x)}|/|Ω| c

(x) = f (x)}|/|Ω| c

(x) = 1}|/|Ω|

= f(x) the frequency of x in the ground population Ω Therefore, g (c) becomes the frequency of x in Ω. In the same way gx, E(c) x becomes the frequency denoted fE(x) of E. Hence, we obtain when P is the trivial partition [i.e. C(w) = {w}] and X(w) = x: Sconc(w) = f Sdisc(w) = f

C(w)

C(w)

(X(w)) g

x,E

(X(w))/(1

(C(w)) = fE(x) g

x,E

(C(w)) = 1/(1

fE(x))

These results lead to an explanatory interpretation of Sconc and Sdisc in the case of such trivial partition, as we can say that the concordance of an individual will be all the greater in that the frequency of its category inside the interval E is great too. In the same way, we can say that the discordance of an individual will be all the greater in that the frequency of its category in E is small. The parametric case: If SH (respectively, SV) depends on a parameter “a” (respectively, “b”) under some models assumption (Multinomial, Dirichlet, or the like) and having a sample of Ω: {w1, …, wn}, we have: Sconc (w, a, b) = SH(w, a) SV(w, b) Sdisc (w, a, b) = SH(w, a)/(1

SV(w, b))

The parameters a and b can be estimated by maximizing the following likelihood Lconc (Sconc; a, b) = ∏i=1,n Sconc (wi, a, b)

Explanatory Tools for Machine Learning in the Symbolic Data Analysis Framework

25

In the same way, we can parameterize a parametric discordance by maximizing: Ldisc(Sdisc; a, b) = ∏i=1,n SH(wi, a)/(1 SV(wi, b)) If we define a law F on the fc and a law G on the g (c), a more accurate x,E

parameterization can be settled in the following way: find a’, b’, a, b which maximizes: Lconc (Rconc; a’, b’, a, b, E) = ∏i=1,n RE (wi, a’, b’)SH(wi, a) SV(wi, b) where Rconc (wi, a’, b’) = F(fC(wi); a, a’)G(gX(wi),E; b, b’) Ldisc (Rdisc; a’, b’, a, b, E) = ∏i=1,n RE (wi, a’, b’) SH(wi, a)/(1

SV(wi, b))

where Rdisc(wi, a’, b’) = F(fC(wi); a, a’)/(1

G(gX(wi),E); b, b’))

Like in section 1.4.4, the orders OCI, OCV, OCC, and OCT can be used with all these kinds of criteria. An interesting open question is to find in which condition when a sequence of partitions converges toward the trivial partition, the parametric concordance Sconc(w, a, b) converges toward a parametric frequency fE(X(w), a,b)) on Ω of same parameters. The same question is for the discordance case. Another open question is to extend the concordance and discordance to the case where X is a numerical random variable and the symbolic value variables are distributions. Much has also to be done by considering joint explanatory criteria instead of marginal as it has been done in this paper as, in case of dependencies, the results of both approaches can be very different. A compromise can be to apply DCM canonical analysis and to use the features induced by the best explanatory canonical axis. 1.5.2. Improving the explanatory power of any machine learning tool by a filtering process Explaining for understanding is complementary to discriminating for learning (see [DID 18]). Our aim is now to give a filtering process that improves on a filtered subpopulation the explanatory power, without degrading the discriminating power of any learning machine tool. We suppose here that we have already obtained a clustering from a basic sample where the predictive values are known in the case of supervised data. We have to consider two cases depending on the fact that the data are supervised or not.

26

Advances in Data Science

In the case of unsupervised data, we have to allocate new individuals to the best fitting representative associated to each cluster. For example, in the case of the k-means, we associate any new individual to the cluster of closest mean. In the DCM case where the class representative is a distribution (like in DCM Mixture decomposition see [DID 75], [DID 05]), any new individual is allocated to the cluster that associates density function and maximizes the likelihood of this individual. For any individual and in any case, we can obtain an order of preference of the clusters from the best fitting representative to this individual to the less representative. Hence, in this way, from any individual, we can place the clusters in an order denoted as O1. In the case of supervised data, there are two steps. In the first step, the aim is to allocate a new individual (which predictive value is not given) to the best cluster. In the second step, the aim is to obtain the predictive value of this new individual, from the local model associated to this cluster. For example, if we allocate a new individual to a cluster modeled by a local regression given by a DCM regression (as in [CHA 77]), then we can obtain its predictive value by using this regression. The same can be done if instead of having a local regression, we have a local decision tree, a local SVM, a local neural network, and so on. In order to find the best new individual cluster allocation, we can only use the given data without the predicted value variable as for the new individuals for sure this value is not given. Coming back to the basic sample where now the predicted value variable associate to each individual is its cluster, we can use a supervised machine learning tool (SVM, Neural Network or any black box learning machine method) on these data. In that way, any new individual can be associated with a preference order of the clusters from the best allocation to the worse. Hence, in this way, an individual can place the clusters in an order denoted as O2. We can also associate to any new individual its fit to the symbolic description associated to any obtained cluster. For example, in the numeral case, if the symbolic descriptions are density functions fj, we can use the likelihood product of fj(xj) for j = 1, p, where xj is the value taken by this individual for the jth initial variable. We can then place in order the clusters from their best to the lower fit to this individual. We can also replace fj(xj) by W(xj, c) in the categorical case. Hence, in this way, an individual can place the clusters in an order denoted O3. O3 is an explanatory order as it is based on the symbolic description of the clusters. Finally, given a new individual, we can place in order the obtained clusters in three ways: O1, O2, and O3. Several strategies (see here under) are then possible. Having chosen one of them, we can continue the machine learning process: we allocate the new individual to a cluster and then adding it to this cluster, then finding

Explanatory Tools for Machine Learning in the Symbolic Data Analysis Framework

27

a best fit representative, and so on, until the convergence of DCM, until a new partition and its associated local models. Machine learning filtering strategies: The idea is to add (i.e. to filter) a new individual to the cluster and to the aggregation process leading to a new symbolic description, if it improves simultaneously at best the fit between the cluster and its representative and the explanatory power of its associated symbolic description. The first kind of filtering strategy is to continue the learning process with only the individuals that have at the best position the same first cluster (i.e. same leader) in the order O1, O2, and O3. Another kind of strategy is to continue the learning machine process with only the individuals whose clusters at the best position (in the three orders) are not beyond then a given rank k. Then the individual is allocated to the cluster of the best rank following O1, O2, or O3 alternatively or depending if you wish more explanatory power or better decision. Other strategies are also possible by adding OCI and (or) OCL to the orders O1, O2, and O3 or any given order on the individuals a priori given by an expert. It is also possible to reduce the number of variables by selecting the first ones in OCV order or by any learning machine method aiming to select variables. In any filtering strategy, the learning process progress with individuals that improve the explanatory power of the machine learning as much as possible without degrading at all or not much the efficiency of the obtained rules. When a sub-population is obtained, the process can continue with the remaining population and leads to other subpopulations, and so on, when the population increases or until the whole population has been considered. 1.6. Conclusion The aim of this chapter was to give tools related to the part of our brain needing to understand what happen and not to the other parts of our brain needing to take efficient and quick decision without knowing how (e.g. for face recognition). Classes obtained by clustering or a priori given in unsupervised or supervised learning machine are here considered as new units to be described in their main facets and to be studied by taking care of their internal variability. We have shown that classes can be obtained by DCM, which have many variants depending on the kind of representative (means like in the popular k-means clustering, but more generally principal components, regressions, distributions, canonical axis, etc.). Then, we have given tools for building symbolic data describing these classes on which SDA methodology can be applied. We have focused on bar chart symbolic descriptions of classes, but for sure other kinds of symbolic representation of classes can be done in

28

Advances in Data Science

the spirit. Several explanatory criteria have been defined, from which individuals, classes, symbolic variables, and symbolic data tables can be placed in order from the more toward the less characteristic. Much remains in order to compare and improve the different criteria, and to extend them into the parametric and numerical cases. These tools have potential applications in many domains. For example, in order to compare the explanatory power of clustering algorithm or more generally for improving the explanatory power of machine learning. In practice, these criteria can be applied in order to find discordant countries of Europe on political questions or on illnesses and death causes, to place in order power point cooling towers, to find boats of risk in a harbor, to find singular stocks behavior, to find typical text section of a book or specific web intrusions, to find discordant or concordant images, and so on. These explanatory tools have an immense potential for research and applications. 1.7. References [AFO 10] AFONSO F., DIDAY E., BADEZ N. et al., “Symbolic data analysis of complex data: application to nuclear power plant”, COMPSTAT’2010, Paris, 2010. [AFO 18a] AFONSO F., DIDAY E., TOQUE C., Data Science par Analyse des Données Symboliques, Editions Technip, Paris, France, 2018. [AFO 18b] AFONSO F., DOLINAR A.L., KORENJAK-CERNE S. et al., “Analysis of gender-agecause-specific mortality in European countries with SYR software”, in BRITO D. (ed), Symbolic Data Analysis Workshop SDA 2018, available at: https://sda2018.wixsite.com/ ipvc, 2018. [BIL 03] BILLARD L., DIDAY E., “From the statistics of data to the statistic of knowledge: symbolic data analysis”, Journal of the American Statistical Association, vol. 98, no. 462, pp. 470–487, 2003. [BIL 06] BILLARD L., DIDAY E., Symbolic Data Analysis: Conceptual Statistics and Data Mining, Wiley, Chichester, p. 321, 2006. [BOC 00] BOCK H.H., DIDAY E., Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data, Springer-Verlag, Heidelberg, p. 425, 2000. [BOU 17] BOUGEARD S., ABDI H., SAPORTA G. et al., “Clusterwise analysis for multiblock component methods”, Advances in Data Analysis and Classification, vol. 12, no. 2, pp. 285–313, 2017. [BOU 18] BOUGEARD S., CARIOU, V., SAPORTA, G. et al., “Prediction for regularized clusterwise multiblock regression”, Applied Stochastic Models in Business and Industry, vol. 34, no. 2, pp. 1–16, 2018. [BRI 15] BRITO P., NOIRHOMME-FRAITURE M., ARROYO J., “Special issue on symbolic data analysis”, Editorial. Advances in Data Analysis and Classification, vol. 9, pp. 1–4, 2015.

Explanatory Tools for Machine Learning in the Symbolic Data Analysis Framework

29

[CHA 77] CHARLES C., Régression typologique et reconnaissance des forms, Thesis, Université Paris IX-Dauphine and INRIA Rocquencourt, Paris, France, 1977. [DEC 98] DE CARVALHO F.A.T., “Extension based proximity coefficients between constrained Boolean symbolic objects”, in HAYASHI C., YAJIMA K., BOCK H.H. et al. (eds), Proceedings of IFCS’96, Springer-Verlag, Berlin, pp. 370–378, 1998. [DEC 06] DE CARVALHO F.A.T., SOUZA R., CHAVENT M. et al., “Adaptive Hausdorff distances and dynamic clustering of symbolic interval data”, Pattern Recognition Letters, vol. 27, pp. 167–179, 2006. [DID 72] DIDAY E., Introduction à l’Analyse factorielle typologique, Report Laboria no. 27, INRIA, Rocquencourt, France, 1972. [DID 73] DIDAY E., “The dynamic clusters method in non-hierarchical clustering”, International Journal of Computer and Information Science, vol. 2, no. 1, pp. 61–88, 1973, doi: 10.1007/BF00987153. [DID 75] DIDAY E., SCHROEDER A., “A new approach in mixed distributions detection”, RAIRO, vol. 10, no. 6, pp. 75–106, 1975. [DID 76] DIDAY E., SCHROEDER A., “A new approach in mixed distributions detection”, A. Schroeder. RAIRO-Operations Research – Recherche Opérationnelle, vol. 10, no. 6, pp. 75–106, 1976. [DID 77] DIDAY E., GOVAERT G., “Classification avec distances adaptatives”, RAIRO, no. 4, 1977. [DID 78] DIDAY E., Analyse canonique du point de vu de la classification automatique, Report Laboria (April 1978) no. 293. INRIA, Rocquencourt, France. [DID 79] DIDAY E., SIMON J.C., “Clustering Analysis”, in FU K.S. (ed.), Communication and Cybernetics Digital Pattern Recognition, Springer Verlag, 1979. [DID 80] DIDAY E. et al., “Optimisation en classification automatique”, INRIA, France, 1980. [DID 86] DIDAY E., “Canonical analysis from the automatic classification point of view”, Control and Cybernetics, vol. 15, no. 2, pp. 115–137, 1986. [DID 05] DIDAY E., VRAC M., “Mixture decomposition of distributions by Copulas in the symbolic data analysis framework”, Discrete Applied Mathematics (DAM), vol. 147, no. 1, pp. 27–41, 2005. [DID 08] DIDAY E., NOIRHOMME M. (eds), Symbolic Data Analysis and the SODAS Software, Wiley, Chichester, 2008. [DID 13a] DIDAY E., AFONSO F., HADDAD R., “The symbolic data analysis paradigm, discriminant discretization and financial application”, HDSDA 2013 Conference, Beijing, China and RNTI-E-25, Paris, Hermann, pp. 1–14, 2013. [DID 13b] DIDAY E., “Principal component analysis for bar charts and Metabins tables”, Statistical Analysis and Data Mining, vol. 6, pp. 403–430, 2013, doi:10.1002/sam.11188.

30

Advances in Data Science

[DID 16] DIDAY E., “Thinking by classes in data science: the symbolic data analysis”, WIREs Computational Statistics Symbolic Data Analysis, vol. 8, pp. 172–205, 2016. [DID 19] DIDAY E., “Pouvoir explicatif et discriminant de variables et de tableaux de données symboliques”, RNTI Journal. [DEM 77] DEMPSTER A., LAIRD N., RUBIN D., “Maximum likelihood from incomplete data with the EM algorithm”, Royal Statistical Society: Series B (Statistical Methodology), vol. 39, pp. 1–38, 1977. [EMI 18] EMILION R., DIDAY E., “Symbolic data analysis basic theory”, in DIDAY E., GUAN R., SAPORTA G., et al. (eds), Advances in Data Sciences, ISTE Ltd, London and John Wiley & Sons, New York, 2018. [GUA 13] GUAN R., LECHEVALLIER Y., SAPORTA G. et al., Advances in Theory and applications of High Dimensional and Symbolic Data Analysis, vol. E25, RNTI, Hermann, MO, 2013. [GOW 91] GOWDA K.C., DIDAY E., “Symbolic clustering using a new dissimilarity measure”, Pattern Recognition, vol. 24, pp. 567–578, 1991. [HOR 98] HORN S., PESCE A.J., COPELAND B.E., “A robust approach to reference interval estimation and evaluation”, Clinical Chemistry, vol. 44, pp. 622–631, 1998. [ICH 94] ICHINO M., YAGUCHI H., “Generalized Minkowski metrics for mixed feature-type data analysis”, IEEE Transactions on Systems, Man, and Cybernetics, vol. 24, pp. 698–707, 1994. [IRP 08] IRPINO A., VERDE R., “Dynamic clustering of interval data using a Wassersteinbased distance”, Pattern Recognition Letter, vol. 29, pp. 1648–1658, 2008. [LEB 84] LEBART L., MORINEAU A., WARWICK K.M., Multivariate Descriptive Statistical Analysis, Wiley, New York, NY, 1984. [ROB 04] ROBERTSON S., “Understanding inverse document frequency: on theoretical arguments for IDF”, Journal of Documentation, vol. 60, no. 5, pp. 503–520, 2004, doi:10.1108/00220410410560582. [ROY 86] ROYALL R.M., “Model robust confidence intervals using maximum likelihood estimators”, International Statistical Review, vol. 54, pp. 221–226, 1986. [SU 16] SU S.-F., PEDRYCZ W., HONG T.-P. et al., “Special issue on granular/symbolic data processing”, IEEE Transactions on Cybernetics, vol. 46, pp. 344–401, 2016. [VIN 11] VINET F., BOISSIER L., DEFOSSEZ, S., “La mortalité comme expression de la vulnérabilité humaine face aux catastrophes naturelles: deux inondations récentes en France (Xynthia, Var, 2010)”, [VertigO] La revue électronique en sciences de l’environnement, vol. 11, no. 2, 2011.

2 Likelihood in the Symbolic Context

2.1. Introduction Analyzing classes (or groups) of raw data, with each class being considered as a statistical unit, can be of interest, for example, when dealing with objects having a complex behavior, or dealing with a very large dataset split into groups of data. While a standard statistical analysis can work when each class is summarized by the class mean, the problem becomes less obvious when the class summarization is, for example, an estimator of the distribution of that class data. According to E. Diday, who introduced the paradigm of “Symbolic Data Analysis” [DID 87], a symbol of a statistical unit is any mathematical object summarizing the variability internal to that unit, see also [BIL 03, BIL 06, BOC 99, DID 16]. For example, a symbol of a class of real data can be just a real number (e.g. the class mean or its variance), and also an interval (the class range), a function (e.g. the class empirical c.d.f. or an histogram built from that class), or a probability distribution (e.g. a theoretical distribution estimated from that class data). In this chapter, our first aim is to propose a probabilistic framework for properly defining symbolic data as statistical units, modifying the framework proposed in [EMI 15] a little bit. Our second aim is to consider the problem of defining distributions on symbols or, more simply, to propose some likelihood functions for finite-dimensional symbols. In fact, in the case where symbols are probability distributions, the problem is not new: J.F.C. Kingman [KIN 75], in a paper introducing the famous Poisson–Dirichlet random distribution, mentions many previous papers where “people are interested in describing the probability distribution of objects which are themselves probability distributions.” Likelihood for interval-valued variables was studied in [LER 11] and [SAN 11] and, more recently, in [ZHA 16]. As likelihood

Chapter written by Richard E MILION and Edwin D IDAY.

Advances in Data Science: Symbolic, Complex and Network Data, First Edition. Edited by Edwin Diday, Rong Guan, Gilbert Saporta and Huiwen Wang. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

32

Advances in Data Science

for symbols assigns a number to each symbol, it can be used for ranking, prediction, outlier detection based on symbols, and so on. A recent method [BER 18] seems to indicate that likelihood for symbols even substantially reduces the complexity of the standard likelihood estimation for a large dataset. This chapter is structured as follows. Symbols and likelihood on symbols with respect to a class variable are rigorously defined in a probabilistic setting in section 2.1. Density models for one variable of probability vector symbols are presented in the parametric case and in the nonparametric case in sections 2.2 and 2.3, respectively. The case of several correlated variables is presented in section 2.4 before a short conclusion in section 2.5. 2.2. Probabilistic setting 2.2.1. Description variable and class variable The following setting was introduced in [EMI 15]. Consider a population of individuals represented by a probability space (Ω, F, P), and let (X, C) be a pair of observable correlated random variables (r.v.) defined as follows. The r.v. X : Ω −→ V, measurable w.r.t. F and V,

[2.1]

describes the individuals, and (V, V) being a measurable space of descriptions. Generally, V will be a measurable subset of Rp , and X will be denoted by (X1 , . . . , Xp ). The class variable C C : Ω −→ C, measurable w.r.t. F and C, assigns a class label to each individual, and (C, C) being a measurable space of class labels. Generally, C will be finite or countable, but in some cases, it can be uncountable. By class with label c ∈ C, shortly class c, we mean the following subset of individuals (C = c) = {ω ∈ Ω : C(ω) = c}

[2.2]

We will assume that the singletons {c} belong to C, ∀c ∈ C, so that the classes, for c in the range of C, form a measurable partition of Ω.

Likelihood in the Symbolic Context

33

2.2.2. Conditional distributions Recall that the distribution of X, denoted by PX , is the probability measure defined on V by PX (A) = P(X ∈ A), ∀A ∈ V and the distribution of C, denoted by PC , is the probability measure defined on C by PC (B) = P(C ∈ B), ∀B ∈ C. As usual, if C is finite or countable, then the σ-algebra C is the powerset of C. Class c and PC are such that PC ({c}) = P(C = c), ∀c ∈ C The variability of X within each class c ∈ C is described by the conditional distribution PX|C=c of X given C = c. Recall that if the range of C is countable, this conditional distribution is a probability measure on V and defined as PX|C=c (A) =

P (X ∈ A, C = c) , ∀A ∈ V P (C = c)

whenever P (C = c) > 0, while the definition is rather nontrivial if the range of C is uncountable. Further, we will assume the existence of an RCP (Regular Conditional Probability), which ensures that the mapping (c, A) −→ PX|C=c (A)

[2.3]

is measurable with respect to the σ-algebra C ⊗ V and the Borel σ-algebra of [0,1]. 2.2.3. Symbolic variables The folllowing setting was introduced in [EMI 15], but the variable S was defined there on Ω while it is defined here on C since each c ∈ C should be considered a statistical unit. Let us denote M1 (V) the space of all probability measures on (V, V) endowed, as usual, with the smallest σ-algebra making measurable all the mappings ϕA : M1 (V) −→ [0, 1] P −→ P (A) for all A ∈ V.

34

Advances in Data Science

The main object of our interest is the following distribution-valued mapping T defined on the class label space C and taking value in M1 (V): T : C −→ M1 (V) c −→ PX|C=c Observe that a consequence of the existence of an RCP is that the mapping T is measurable due to [2.3], since ϕA ◦ T : C −→ [0, 1] c −→ PX|C=c (A) is measurable ∀A ∈ C. However, as a general probability measure is not always easily handled, it will be more convenient to deal with some functions of T , justifying the following D EFINITION.– A symbolic variable S of the context (X, C) is any measurable function of T , say S = f (T ), where f : M1 (V) −→ S is a measurable function taking value in some measurable space (S, S), so that S : C −→ S S(c) = f (PX|C=c ). Any S(c), where c ∈ C will be a “symbolic data”, summarizing the variability of the data within class c is the variability of X(ω) for ω ∈ (C = c). (see [2.2]). On the other hand, note that S, just like T , is defined on C and not on Ω, the space of basic individuals. Therefore, the correlation of X and S has no meaning. However, we eventually can be interested in the correlation of X and S ◦ C since both of these r.v. are defined on Ω. The variable S ◦ C was the S proposed in [EMI 15], but if our aim is to consider classes as statistical units, then S should be defined on C. Generally, the distribution of (X, C) depends on a parameter, e.g. θ, and the notation S(c|θ) will be adopted to emphasize the dependence of S with respect to θ. If θ is random, as done in the Bayesian setting, we will be lead to consider the mapping C × Ω −→ S (c, ω) −→ S(c|θ(ω)).

[2.4]

Likelihood in the Symbolic Context

35

2.2.4. Examples 2.2.4.1. Interval-valued variables Assume that X : Ω −→ V = R and that the support of PX|C=c is a closed interval [lc , uc ], then S(c) = [lc , uc ] defines an interval-valued symbolic variable. More generally, assume that X = (X1 , . . . , Xp ) : Ω −→ V = Rp , where p, p ≥ 1, is an integer, and assume that the support of PXj |C=c is a closed interval [lcj , ujc ], then S(c) = ([lc1 , u1c ], . . . , [lcp , upc ]) defines an interval vector-valued symbolic variable. A simpler symbolic variable, taking value in S = R2 , is S(c) = ((lc1 , u1c ), . . . , (lcp , upc )) and a more convenient one is S(c) = ((a1c , b1c ), . . . , (apc , bpc )) where ajc =

lcj +ujc 2

[2.5]

is the mid range of the jth interval and bjc =

ujc −lcj 2

is its mid length.

2.2.4.2. Probability-vector-valued and histogram variables Assume that X : Ω −→ V, where V is a compact subset of R, and let V1 , . . . , Vm be a partition of V into m adjacent intervals of finite length. Then, S(c) = (PX|C=c (V1 ), . . . , PX|C=c (Vm ))

[2.6]

defines a symbolic variable taking value in the m-simplex of probability vectors Tm : S = Tm = {y = (y1 , . . . , ym ) ∈ Rm + :

m ∑

yl = 1}.

l=1

Note that, if λ denotes the Lebesgue measure on R, the stepwise function, which takes value PX|C,=c (V1 ) PX|C=c (Vm ) ,..., λ(V1 ) λ(Vm ) on V1 , . . . , Vm , respectively, is nothing but the popular histogram function with respect to that partition.

36

Advances in Data Science

Instead of having a fixed partition for all the labels c, we can get more flexibility by considering moving adjacent intervals that are translated by a real tc that depends on c from fixed adjacent intervals, e.g. tc + V1 , . . . , tc + Vm , where V1 , . . . , Vm are fixed adjacent intervals chosen by the user. Then, S(c) = (tc , PX|C=c (tc + V1 ), . . . , PX|C=c (tc + Vm )) defines a symbolic variable taking value in S = R × Tm . As in Example 1, this can be extended when X takes value in Rp , having then S = (R × Tm1 ) × . . . × (R × Tmp ). 2.2.4.3. Unpaired data and marginal distributions In the two preceding examples, we were considering conditionals of the margins Xj , namely, PXj |C=c , rather than conditionals of X, for X = (X1 , . . . , Xj , . . . , Xp ) taking value in Rp so that the expression symbolic variable is of the following form S(c) = (f1 (PX1 |C=c ), . . . , fp (PXp |C=c )). A first reason for that is that the estimation of a joint distribution is often a complex problem. Another point is the case of unpaired data: samples of X1 , . . . , samples of Xp are observed, but the observations do not come from the same individuals, and moreover, these samples can be of different size, while samples of X are not observed. Then, obviously, neither the correlations of the Xj ’s nor the conditional joint distributions PX|C=c can be estimated, while the marginal distributions PXj |C=c can be estimated. 2.2.4.4. Finite mixtures If PX is a mixture of K distributions belonging to a specific parametric family, e.g. PX =

K ∑

qk Dk (θk ), where qk ≥ 0 and

k=1

K ∑

qk = 1.

k=1

then S(c) = (qc , θc ), c = 1, . . . , K defines a symbolic variable. In fact, when the parameters of the mixture are estimated, for example, by EM-like algorithms, the r.v. C taking value in C = {1, 2, . . . , K} is actually a random fuzzy variable assigning to each individual a membership degree to class k.

Likelihood in the Symbolic Context

37

2.2.4.5. Function-valued variables Assume that X : Ω −→ V = R and that PX|C=c has the density dc , c.d.f. Fc , and quantile function Ψc = Fc−1 , respectively. Then, S(c) = dc , S(c) = Fc , S(c) = Ψc respectively, are three function-valued symbolic variables. The idea of using quantile functions in the symbolic context is due to M. Ichino [ICH 11]. 2.2.5. Probability measures on (C, C), likelihood The most natural probability measure Q that can be put on the class label space (C, C) is the distribution PC of the r.v. C, and this does not pose any problem when C is finite or countable. However, if S takes value in Rm for some integer m, m ≥ 1, and has a density dS with respect to the Lebesgue measure λm on Rm , then C should be uncountable and Q should be such that Q({c}) = 0. It can even be shown, under some conditions, that Q should be nonatomic. Several previous papers, which propose a density model in the symbolic context, do not pay attention to these basic conditions required for C and Q. Therefore, we consider a pair (X, C) with an uncountable space of class labels C, its σ-algebra C, and a nonatomic probability measure Q defined on C and S = f (T ) : C −→ Rm a symbolic variable whose distribution has a density, e.g. dS , with respect to the Lebesgue measure λm in Rm , where dS : Rm −→

R+

[2.7]

s −→ dS (s), the underline denoting vectors.

Observe that if S, as in [2.6], takes value in the simplex Tk considered as the set of probability distributions on the finite set {1, . . . , m}, then the distribution of S is a distribution on a space of distributions, shortly a distribution of distributions. Extending this to infinite sets was an exciting problem solved in 1973 in a celebrated paper of T.S. Fergusson [FER 73]. Our problem is to propose some appropriate models for estimating dS given a n-sample of symbolic data s1 , . . . , sn such that i.i.d.

si = (si,1 , . . . si,m ) = S (i) (c) ∈ Rm , S (i) ∼ QS , i = 1, . . . , n for some c ∈ C expressing the randomness of the sample of symbols.

38

Advances in Data Science

As usual, if the density dS defined in [2.7] depends on a parameter θ belonging to a measurable space Θ, then dS will be denoted by dS (.|θ) and dS (s) by dS (s|θ). Given a sample of symbols Ls1 ,...,sn (θ) =

n ∏

dS (si |θ), θ ∈ Θ.

i=1

In some of the examples given in the next section, our models will appear as hierarchical models. 2.3. Parametric models for p = 1 2.3.1. LDA model In our first example, we detail in our symbolic context, the interesting Latent Dirichlet Allocation model [BEN 09a], shortly the LDA model, which is popular in text mining, text classification, and can also be used in various domains such as collaborative filtering, as detailed in [BEN 09a]. In the following, a document and a corpus, that are a finite set of documents, are defined in mathematical terms, and their respective probability is computed in the proposed model. Recall that a random probability vector θ = (θ1 , . . . , θk ) : Ω −→ Tk follows a Dirichlet distribution with the parameter α = (α1 , . . . , αk ) ∈ Rk+ , if (θ1 , . . . , θk−1 ) has the popular Dirichlet density ∑ α Γ(α1 + . . . + αk ) α1 −1 αk−1 −1 y . . . yk−1 (1 − yi ) k−1 IUk−1 (y) Γ(α1 ) . . . Γ(αk ) 1 i=1 k−1

Dd(y|α) =

[2.8]

where Uk−1 = {y = (y1 , . . . , yk−1 ) ∈ Rk−1 : +

k−1 ∑

yi ≤ 1}.

i=1

In regards to [2.1], the LDA model assumes that p = 1 and first concerns: a categorical r.v. X : Ω −→ V = {1, . . . , k} taking value in a finite set of k topics, an integer-valued r.v. N : Ω −→ N = {0, 1, 2, . . . , }, and a random probability vector θ = (θ1 , . . . , θk ) : Ω −→ Tk , such that {

(N, θ) ∼ P oisson(ξ) ⊗ Dirichlet(α) P(X = i|θ) = θi , i = 1, . . . , k.

[2.9]

Likelihood in the Symbolic Context

39

In fact, in [BEN 09a], topic i = 1, . . . , k, is represented by the unit basis vector in dimension k, having component i equal to one and all other components equal to zero, and the categorical distribution in the second line in [2.9] is written in an equivalent way PX|θ = M ultinomial(θ). Next, let {1, . . . , V } be a finite set of V words, and let β = (βi,j ), i = 1, . . . , k, j = 1, . . . , V be a k × V Markov matrix, each of its k rows being a probability vector in dimension V . A document is considered as an outcome c of our class random variable C defined as a sequence of random words:   C(ω) = (W (1) (ω), . . . , W (N (ω)) (ω)), ω ∈ Ω      where, given N and θ i.i.d [2.10] X (r) ∼ P(X|θ) , for each r = 1, . . . N    W (r) : Ω −→ {1, . . . , V }, r = 1, . . . N, are independent    P(W (r) = v|X (r) ) = β (r) , for each r = 1, . . . N, v = 1, . . . V. (X ,v) The first line in [2.10] concerns the N observed words W (1) (ω), . . . , W (N (ω)) (ω), the third line concerns the distribution of the unobserved topics, and the last line specifies how is distributed word r given topic r. Again, in [BEN 09a], the last line in [2.10] is written in an equivalent way PW (r) |X (r) = M ultinomial(β(X (r) ,.) ). Also, according to [2.2], given a class label c = (w1 , . . . , wN ), class c is defined as (C = c) = {ω ∈ Ω : W (1) (ω) = w1 , . . . , W (N ) (ω) = wN } The variability of the topic variable of a document c having N words, e.g. c = (w1 , . . . , wN ) with unobserved topics (x1 , . . . , xN ), respectively, can be described by the symbol s(c) = (

N ∑

1(xr =1) , . . . ,

r=1

N ∑

1(xr =k) )

r=1

where 1A is equal to 1 if expression A is true and is 0 otherwise. Observe that the symbol is therefore latent. The distribution of the random symbol defined on ω by S =s◦C =(

N ∑ r=1

1(X (r) =1) , . . . ,

N ∑ r=1

1(X (r) =k) ),

40

Advances in Data Science

given N and θ, is therefore   PS |N,θ = M ultinomial(N, θ) that is  P(S = (n1 , . . . , nk )|N = n, θ) =

n! θ n1 n1 !...nk ! 1

[2.11]

n

. . . θk k if n1 + . . . + nk = n

and this defines our symbolic likelihood in the LDA model. From [2.9] and [2.10], it can be seen that the probability of a topic xr and a word wr can be written as follows  ∏k 1x =i  p(xr |θ) = i=1 θi r 

p(wr |xr , β) =

∏V

[2.12] 1

wr =j j=1 βxr ,j

Since for a given xr , just one of the indicators 1xr =i , i = 1, . . . , k is equal to 1 and the others equal 0, the same for 1wr =j , j = 1, . . . , V . Then,  ∏k 1wr =j 1x =i ∏V   p(xr , wr |θ, β) = i=1 θi r j=1 βxr ,j [2.13]

  p(w |θ, β) = ∑ ∏k θ1xr =i ∏V β 1wr =j r xr i=1 i j=1 xr ,j so that, given N , the probability of a document is,  ∏N ∑ ∏k 1wr =j 1xr =i ∏V   p(w1 , . . . wN |θ, β, N ) = r=1 xr i=1 θi j=1 βxr ,j

∏k  1wr =j 1xr =i ∏V  p(w , . . . w |α, β, N ) = ∫ Dd(θ|α) ∏N ∑ 1 N j=1 βxr ,j dθ i=1 θi r=1 xd

[2.14]

r

where Dd(θ|α) is understood as Dd(θ1 , . . . , θk−1 |α) defined in [2.8]. Finally, if a corpus CO is defined as a set of M documents with parameters (θ(d) , Nd ), d = 1, . . . , M , respectively, then the computation of the probability of a corpus is straightforward and provides the following likelihood used in the estimation procedure in [BEN 09a]: P(CO) =

M ∫ ∏ d=1

Dd(θd |α)

Nd ∑ ∏ k ∏ n=1 xdr i=1

1x

θd,idr

=i

V ∏

1w

=j

r βxdrd,j dθd .

j=1

In [BEN 09a], the LDA model was fit to a TREC AP corpus of M = 16, 333 documents with V = 23, 075 words and k = 100 topics, in order to classify these documents.

Likelihood in the Symbolic Context

41

2.3.2. BLS method In our second example, we consider, through our symbolic context too, a recent method due to B. Beranger, H. Lin, and S.A. Sisson [BER 18], where symbols and likelihood on such symbols are defined from samples of a r.v. and are used to provide a very fast approximation of the classical likelihood. The setting seems a little bit similar to that of the preceding LDA model. Let X : Ω −→ Rp be a r.v. with the density function dX (.|θ), depending on a parameter θ. For any integer N ≥ 2, our class random variable C is defined on Ω as follows:  (1) (N )  C(ω) = (X (ω), . . . , X (ω)), ω ∈ Ω where,  (r) i.i.d X ∼ PX , r = 1, . . . N

[2.15]

Again, according to [2.2], given a class label c = (x1 , . . . , xN ) (that is an observed sample), class c is defined as (C = c) = {ω ∈ Ω : X (1) (ω) = x1 , . . . , X (N ) (ω) = xN } and its probability is P(C = c) =

N ∏

dX (xr |θ).

r=1

Let l ≥ 2 be an integer, and let Bb , b = 1, . . . , l be a finite measurable partition of Rp . Here, the variability of X can be defined by a counting symbol s: s(c) = (

N ∑

1(xr ∈B1 ) , . . . ,

r=1

N ∑

1(xr ∈Bl ) )

r=1

and the distribution of the random symbol defined on Ω by S =s◦C =(

N ∑

1(X (r) ∈B1 ) , . . . ,

r=1

N ∑

is M ultinomial(N, p1 , . . . , pl ), with pb = LDA case [2.11] P(S = (n1 , . . . , nl )) = defining a symbolic likelihood.

1(X (r) ∈Bl ) )

r=1

∫ Bb

f (x)dx, b = 1, . . . , l, just as in the

N! pn1 . . . pnl l , n1 + . . . + nl = N, n1 ! . . . nl ! 1

[2.16]

42

Advances in Data Science

Further, it is shown in [BER 18] that the classical likelihood L can be recovered very fast by choosing appropriate partitions such that lim P(S = (n1 , . . . , nl )) = L(x1 , . . . , xn |θ) =

l−→+∞

n ∏

dX (xi |θ).

i=1

The partition size l should be optimized to get enough precision in the likelihood approximation. An illustrative example is the following. Let X be a pair of p = 2 variables, X = (wind direction, atmospheric pressure) daily measured on several geographical locations ω and during N days. Let Bb , b = 1, . . . , l, be a partition of R2+ into a product of intervals. A model for the density of X can be a pair of dependent Gamma distributions derived from a pair of dependent Gaussian distributions. 2.3.3. Interval-valued variables As in [2.5], assume that X : Ω −→ V = R, that the support of PX|C=c is [lc , uc ] c c and that S(c) = (ac , bc ) where ac = lc +u and bc = uc −l 2 2 . Since ac ∈ R and 2 bc ∈ R+ , a natural model for the density of S is N (µ, σ ) ⊗ Γ(a, b) or N (µ, σ 2 ) ⊗ log N (ν, τ 2 ), the product of a normal distribution with a Gamma one or a log normal one, respectively. This was introduced in [BRI 12] where, however, more realistic mixtures of such distributions are not considered. 2.3.4. Probability vectors and histogram-valued variables As in [2.6], consider the symbolic variable defined by S(c) = (PX|C=c (V1 ), . . . , PX|C=c (Vm1 )), taking value the m1 -simplex of probability vectors Tm1 = {y = ∑in m1 1 (y1 , . . . , ym1 ) ∈ Rm + : l=1 yl = 1}. Nice density models for S are provided by mixtures of standard distributions on the simplex. Examples of such distributions include Aitchison distributions, normal distributions on the simplex, Dirichlet distributions, and the like. See a nice presentation, e.g. in [PAW 15, pp. 112–127]. The Dirichlet mixtures case is detailed below. 2.3.4.1. Mixtures of Dirichlet distributions Mixtures of Dirichlet distributions were used in a symbolic context in various applied fields [CAL 11]. Let α = (α1 , . . . , αl , . . . , αm1 ) with αl ≥ 0. The Dirichlet density belongs to the exponential family so that if dS is a mixture dS (.|(αk , qk )k=1,...,K ) =

K ∑

qk Dd(.|αk ), where αk = (αk,1 , . . . , αk,m1 )

k=1

of such distributions, shortly a Dd mixture, then this mixture is identifiable. Its parameters (αk , qk )k can be estimated, for example, by the popular EM algorithm,

Likelihood in the Symbolic Context

43

its variants SEM, SAEM, MCEM, and the like. In our notations below, for any observed probability vector s = (s1 , . . . , sl , . . . , sm1 ), we will agree that Dd(s|α) = Dd((s1 , . . . , sm1 −1 )|α), where the right-hand side expression is defined in [2.8]. 2.3.4.2. EM algorithm for a mixture of Dirichlet distributions Given a n-sample of probability vectors si = (si,1 , . . . si,m ) ∈ Tm1 , i = 1, . . . , n

[2.17]

the EM-like estimation procedures introduce a latent class variable C (2) , the upperscript (2) denoting a second level of classes, where C (2) is now defined on the class label probability space C and is such that C (2) : (C, C, Q) −→ {1, . . . , K} Q(C (2) = k) = qk , k = 1, . . . , K QS|C (2) =k = Dd(.|αk ). In the “E” part, the expectation of the complete variable (S, C (2) ) log likelihood is computed with respect to the conditional distribution of C (2) , given S and a value αold , qold of the parameters α and q, that have to be estimated: EQC (2) |S,α =

(2)

old ,qold

n ∑ K ∑

(LL((si , ci )i , α, q))

ti,k (log(DD(si |αk ) + log(qk ))

i=1 k=1

with qk,old DD(si |αold,k ) ti,k = ∑K r=1 qr,old DD(si |αold,r ) and log(DD(si |αk )) = log Γ(

m1 ∑

αk,l ) −

l=1

m1 ∑ l=1

log Γ(αk,l ) +

m1 ∑ (αk,l − 1) log(si,l ). l=1

The “M” part of EM consists in finding the new parameters that maximize the above expectation. Derivations yield m1 equations for component k: n ∑ i=1

toi,l ,αold ,qold (z(

m1 ∑

h=1

αk,l ) − z(αk,l ) + log(si,l )) = 0, l = 1, . . . , m1

44

Advances in Data Science ′

where z = ΓΓ denotes the standard logarithmic derivative of the Γ function, known as the digamma function. The solution requires some numerical methods. An implementation, using R software package “BB” and “dfsane” function, can be found in [XIA 17]. 2.3.4.3. Clustering quality of a mixture Likelihood is a statistical criterion of the mixture quality, but another criterion, more based on the clustering aspect, is defined in [XIA 17]. Indeed, each step of EM-like algorithms provides fuzzy classes since ti,k is the posterior probability of si in the C (2) = k, given S = si , and can be seen as the membership ∑n degree∑of n fuzzy class k. Hence, defining the fuzzy mean of class k as i=1 ti,k si / i=1 ti,k and choosing a distance on the set of probability vectors, we can proceed in computing the within variance of each class and the variance between classes, getting a clustering quality of the mixture. Such a criterion can be used to compare various algorithms, see [XIA 17]. 2.3.4.4. Consistency and mixtures of Dirichlet processes The probability vectors in [2.17], used in the above EM-like algorithms, were computed with respect to a partition of R into m1 adjacent intervals. If the support of PX is included in a bounded interval, the partition intervals can be taken of finite length. When this length goes to 0 and m1 → +∞, the consistency of the estimated mixture is proved in [EMI 12] for a model derived from a mixture of Dirichlet processes, using the martingale theorem and a nice property of Dirichlet processes introduced by Fergusson [FER 73]: D EFINITION.– Let α be a positive finite measure on any measurable space Y, and a random probability measure P : C −→ M1 (Y) is a Dirichlet process with parameter α, shortly P ∼ D(α), if for any measurable finite partition A1 , . . . , Ad of Y, (P (A1 ), . . . , P (Ad )) ∼ Dd(α(A1 ), . . . , α(Ad )). Writing α = rP0 , where r = α(Y) > 0, then P0 = on Y, and a standard notation is

α r

is a probability measure

P ∼ D(r, P0 ), since it can be seen that P0 is the expectation of P , i.e. E(P (A)) = P0 (A) for any measurable set A. The distribution D(r, P0 ) of P is a probability measure on M1 (Y). It is a very rich, powerful, and popular tool in nonparametric Bayesian statistics. The Dirichlet distribution Dd(α1 , . . . , αd ) is just a particular case when taking Y = {1, . . . , d}.

Likelihood in the Symbolic Context

45

2.4. Nonparametric estimation for p = 1 Instead of proposing parametric models, a nonparametric or semiparametric approach is possible to estimate the density dS . 2.4.1. Multihistograms and multivariate polygons First, define multi-dimensional bins, for example, if S takes value in the simplex Tm1 , such bins can be the product of m1 one-dimensional bins in [0,1]. Then just count the observations in each bin. See an implementation in R language in the nice book of M.L. Rizzo [RIZ 08, pp. 305–310]. 2.4.2. Dirichlet kernel mixtures If S takes value in the simplex Tm1 , a Dirichlet kernel estimator introduced in [EMI 16] as follows. D EFINITION.– Given a sample of probability vectors si , i = 1, . . . , n of size m1 , define the multivariate Dirichlet kernel as the function KH (x) = |H|− 2 1

n ∑

Dd(H − 2 (x − si )), x ∈ Tm1 −1 1

[2.18]

i=1

where H is a (m1 − 1) × (m1 − 1) symmetric positive definitive bandwith matrix to be estimated, and |H| denoting the determinant of H. Then, we can proceed to the estimation of a Dirichlet kernel mixture using npEM algorithm [BEN 09a] implemented in the R package mixtools [BEN 09b]. We omit the details. As observed in [EMI 16], many other kernels can be defined replacing in [2.18] the Dirichlet density Dd by any other distribution on the simplex, such as those mentioned in [PAW 15, pp. 112–127]. 2.4.3. Dirichlet Process Mixture (DPM) In the finite mixture of K Dirichlet distributions or of K Kernels, the mixing class variable, denoted by C2 , takes value in {1, . . . , K}. As suggested in [EMI 17], a more general model is the following hierarchical model called Dirichlet Process Mixture (DPM), which is an infinite mixture of Dd’s, the mixing distribution being the distribution D of a Dirichlet process: si |αi ∼ Dd(αi ) αi |P ∼ P P |r, P0 ∼ D(rP0 ). Here, Dd’s can be replaced by various distributions on the simplex or by the Kernels defined in [2.18] and in the remark thereafter. See also [TOS 10, pp. 79–86].

46

Advances in Data Science

2.5. Density models for p ≥ 2 We are now considering, as in [2.5], the case where X = (X1 , . . . , Xp ) : Ω −→ V = Rp , where p, p ≥ 2, is an integer, and S(c) = (f1 (PX1 |C=c ), . . . , fp (PXp |C=c )). As we have to take in account the dependency between the p columns of symbolic data, we are lead to use, for example, dependent Gamma distributions in the case of intervals and dependent Dirichlet distributions in the case of probability vectors. The following constructions were proposed in [EMI 16]. Note that a Γ(a, b) distribution is the distribution of Ψ(a,b) (U ), where Ψ(a,b) is the quantile function of a Γ(a, b)-distributed r.v. and U is a uniform r.v., and taking U = ϕ(N (0, 1)), where ϕ denotes the cdf of a standard Gaussian distribution, it is seen that Γ(a, b) is the distribution of a transform Ψ(a,b) (ϕ(N (0, 1))) of a standard Gaussian. Hence, taking dependent standard Gaussians and applying such a transform, we can get dependent Gamma distributions. A similar construction holds for defining dependent log normal distributions. Mixtures of such dependent distributions provide some models for interval-valued symbolic variables. In order to define p dependent Dirichlet distributions, the first one being in dimension m1 , . . . , the pth in dimension mp , starts from m = m1 + m2 + . . . + mp standard Gaussians such that the first m1 ones are independent, the next m2 ones are independent, . . ., the last mp ones are independent, but there is dependency between these p blocks of variables. This can be specified through an m × m matrix A from which is derived a covariance matrix Σ = At A having p diagonal blocks of identity matrices of size m1 , . . . , mp . Applying transforms, we then get Gamma-distributed r.v. having a similar dependency structure. Then, dividing the first m1 independent gamma variables by their sum, the next m2 independent gamma variables by their sum, and so on up to the last mp independent gamma variables divided by their sum, we get the desired p dependent Dirichlet distributions. A similar construction can be used for defining other p dependent distributions on p simplexes of different dimensions. Note that such constructions provide a kind of copula model between multivariate Dirichlet distributions and finding out some new constructions would be of interest. Finally, multihistograms and finite mixture of kernels built from such density functions provide nonparametric models for p ≥ 2. Infinite mixtures of Kernels can be derived using a DPM just as in the case p = 1. 2.6. Conclusion We have proposed a probabilistic setting that modifies the one proposed in [EMI 15], so that the classes of data can be considered as statistical units, and likelihood for symbolic variables that take value in Rm , for some m = 1, 2, ..., is

Likelihood in the Symbolic Context

47

clearly defined. We have been lead to define distributions on distributions and, thus, to use some nonparametric Bayesian statistic tools such as the Dirichlet distribution, dependent Dirichlet distributions, kernels, and the Dirichlet process. Symbolic likelihood can be of great interest when dealing with objects described by probability distributions, for example, for ranking or detecting outliers. We have shown that out formalism is appropriate to hierachical models such as LDA model that we can actually extend to p variables, p ≥ 2, and also when dealing with very large datasets, as shown in [ZHA 16] and [BER 18], and it can be thought that the scope of this field will be extended in forthcoming works. 2.7. References [BEN 09a] B ENAGLIA T.C.D., H UNTER D.R., “An EM-like algorithm for semi- and non-parametric estimation in multivariate mixtures”, Journal of Computational and Graphical Statistics, vol. 18, no. 2, pp. 505–526, 2009. [BEN 09b] B ENAGLIA T.C.D., C HAUVEAU D., H UNTER D.R., YOUNG D.S., “mixtools: an R package for analyzing mixture models”, Journal of Statistical Software, vol. 32, no. 4, pp. 1–29, 2009. [BER 18] B ERANGER B., L IN H., S ASSON S.A., “New models for symbolic data”, arXiv e-prints, 2018. [BIL 03] B ILLARD L., D IDAY E., “From the statistics of data to the statistics of knowledge: symbolic data analysis”, Journal of the American Statistical Association, vol. 98, no. 462, pp. 470–487, 2003. [BIL 06] B ILLARD L., D IDAY E., Symbolic Data Analysis: Conceptual Statistics and Data Mining, John Wiley & Sons, New York, 2006. [BOC 99] B OCK H., D IDAY E., Analysis of Symbolic Data, Springer, Berlin, 1999. [BRI 12] B RITO P., D UARTE S ILVA A., “Modelling interval data with Normal and Skew-Normal distributions”, Journal of Applied Statistics, vol. 39, no. 1, pp. 3–20, 2012. [CAL 11] C ALIFE R., E MILION R., S OUBDHAN T., “Classification of wind speed distributions”, Renewable Energy, vol. 36, no. 11, pp. 3091–3097, 2011. [DID 87] D IDAY E., “The symbolic approach in clustering and related methods of data analysis”, in BOCK H. (ed.), Proceedings of IFCS, North-Holland, Aachen, 1987. [DID 16] D IDAY E., “Thinking by classes in data science: symbolic data analysis”, WIREs Computational Statistics Symbolic Data Analysis, 191, vol. 8, no. 5, pp. 388–398, 2016. [EMI 12] E MILION R., “Unsupervised classification of objects described by nonparametric distributions”, Statistical Analysis and Data Mining, vol. 5, no. 5, pp. 388–398, 2012.

48

Advances in Data Science

[EMI 15] E MILION R., “Some mathematical problems in symbolic data analysis”, Proceedings of the ISI 60th World Statistics Congress, Rio de Janeiro, Brazil, July 26–31, 2015. [EMI 16] E MILION R., “Models in SDA: Dependent dirichlet distributions and kernels”, Proceedings of the International Workshop on Advances in Data Science, Beijing, China, 2016. [EMI 17] E MILION R., “Random intervals as random distributions”, Proceedings of the Third International Symposium on Interval Data Modelling, Beijing, China, 2017. [FER 73] F ERGUSSON T.S., “A Bayesian analysis of some nonparametric problems”, Annals of Statistics, vol. 1, no. 2, pp. 209–230, 1973. [ICH 11] I CHINO M., “The quantile method for symbolic principal component analysis”, Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 4, no. 2, pp. 184–198, 2011. [KIN 75] K INGMAN J.F.C., “Random discrete distributions”, Journal of the Royal Statistical Society. Series B, vol. 37, no. 1, pp. 1–15, 1975. [LER 11] L E -R ADEMACHER J., B ILLARD L., “Likelihood functions and some maximum likelihood estimators for symbolic data”, Journal of Statistical Planning and Inference, vol. 141, no. 4, pp. 1593–1602, 2011. [PAW 15] PAWLOWSKY-G LAHN V., E GOZCUE J.J., T OLOSANA -D ELGADO R., Modeling and Analysis of Compositional Data, Wiley, Chichester, 2015. [RIZ 08] R IZZO M.L., Statistical Compouting with R, Chapman & Hall/CRC, Boca Raton, 2008. [SAN 11] S ANKARARAMAN S., M AHADEVAN S., “Likelihood-based representation of epistemic uncertainty due to sparse point data and/or interval data”, Reliability Engineering and System Safety, vol. 96, no. 5, pp. 814–824, 2011. [TOS 10] T OSSA A., B ERNARD D., E MILION R. et al., A Model for Dissipation: Cascades SDE with Markov Regime-Switching and Dirichlet Prior, Springer, Berlin, 2010. [XIA 17] X IA B., WANG H., E MILION R. et al., “EM algorithm for Dirichlet samples and its application to movie data”, Proceedings of the Symposium on The Service Innovation Under The Background of Big Data & IEEE Workshop on Analytics and Risk, 2017. [ZHA 16] Z HANG X., S ISSON S.A., “Constructing likelihood functions for interval-valued random variables”, arXiv:1608.00107 [stat.ME], vol. 83, no. 5, pp. 1056–1063, 2016.

3 Dimension Reduction and Visualization of Symbolic Interval-Valued Data Using Sliced Inverse Regression

Dimension reduction of interval-valued data is an active research topic in symbolic data analysis (SDA). The main thread has focused on the extension of principal component analysis (PCA). In this study, we extend classic sliced inverse regression (SIR), an alternative dimension reduction method, to interval-valued data to create a method we call interval SIR (iSIR). SIR is a popular slice-based sufficient dimension reduction technique for exploring the intrinsic structure of high-dimensional data. It has been extended and applied to different data types, such as survival data, time-series data, functional data, and longitudinal data. This study considered three families of symbolic-numerical-symbolic approaches to implement iSIR: quantification approaches, distributional approaches, and interval arithmetic approaches. Each family consists of several methods. We evaluated the methods for low-dimensional discriminative and visualization purposes by means of simulation studies and through application to an empirical dataset. Comparison with results obtained via symbolic principal component analysis was also reported. The results provided clues for selecting an appropriate extension of iSIR to analyze the interval-valued data. 3.1. Introduction In contrast with conventional numerical data tables where an observation on p random variables is realized by a single point in Rp , an interval-valued dataset

Chapter written by Han-Ming W U, Chiun-How K AO and Chun-houh C HEN.

Advances in Data Science: Symbolic, Complex and Network Data, First Edition. Edited by Edwin Diday, Rong Guan, Gilbert Saporta and Huiwen Wang. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

50

Advances in Data Science

contains observations on p random variables that are p-dimensional hyperrectangles in Rp . The analysis of interval-valued data usually serves as the basic principle for analyzing other types of symbolic data, such as multi-valued data, modal-valued data, and modal multi-valued data. One source of the interval data is the aggregation of huge datasets. The aggregated data can be of a manageable size while retaining as much of the information of the original data as possible. [DID 88] introduced the concept of symbolic data analysis (SDA) and [BIL 03], [BIL 06] provided an overview of the statistical methodologies for analyzing such data. Among these methods, dimension reduction based on principal component analysis (PCA) for interval-valued data, which we refer to as interval PCA (iPCA), is a major research theme (see section 3.2.1). PCA aims at identifying a small number of new uncorrelated variables that account for as much of the variability in the data as possible. It usually serves as the base algorithm for performing further analysis. In addition to PCA, sliced inverse regression (SIR), which was proposed in [LI 91], is a sufficient dimension reduction method based on the linear projection of input variables onto the latent variables (components). SIR identifies compact representations of data to explore the intrinsic structure of high-dimensional observations. Effective dimension reduction (EDR) was also introduced, and SIR has been successfully extended and used in various applications [CHE 98], [COO 01]. SIR theory has been generalized and applied to different data types, such as survival data [COO 03], [LI 99], time-series data [BEC 03], functional data [FER 05], longitudinal data [LI 09], and nonlinear manifold data [YAO 13]. These extensions were designed to extract the prominent linear or nonlinear subspaces from classic datasets with numerical values. In this study, we extended the classic SIR to the interval-valued data and called the new method interval SIR (iSIR). The idea was inspired by iPCA. PCA is based on the eigen-decomposition of the sample variance–covariance matrix of a data matrix. Therefore, the core step in establishing iPCA is to find the sample variance–covariance matrix of the interval-valued data. SIR can be regarded as a PCA-like procedure that performs generalized eigen-decomposition on the so-called sample-weighted variance–covariance matrix of a data matrix. As a result, it is straightforward to implement iSIR by estimating the sample-weighted variance–covariance matrix for the interval-valued data. SIR creates the components by modeling the relationship between input (X) and response (Y ) variables while maintaining most of the information in the input variables. The response in the model can be continuous or discrete, and it plays the role of being sliced; that is, the slices defined must form a partition of the Y values. Suppose that the interval-valued data were obtained from multiple classes and that class labels were available; the information provided by these class labels could be the natural slices adopted in iSIR. If the data do not have class labels, we can apply a clustering algorithm to the interval-valued data to produce pseudo

Dimension Reduction and Visualization of Symbolic Interval-Valued Data

51

slices (see section 3.5.2 for details). In this study, we consider three families of symbolic-numerical-symbolic approaches to develop iSIR, that is, the input interval-valued data are analyzed by the classic SIR algorithm and the output is provided in terms of intervals. The methods include quantification approaches, distributional approaches, and interval arithmetic approaches, and several methods were proposed for each family of approach. For exploratory data analysis, by utilizing the first two or three extracted variates, one can observe the structure of the set of symbolic points, such as the presence of clusters or outliers. The result is useful for subsequent analysis. We employed maximum covering area rectangles [CHO 98] to visualize the projected intervals in the EDR subspace. We evaluated several iSIR methods for low-dimensional discriminative and visualization purposes through simulation studies and application to an empirical dataset. Comparison with the results obtained by iPCA was also conducted. Our experimental results suggest that the directions found by iSIR when using center and vertices methods are particularly suitable for discriminative purposes. The rest of the chapter is structured as follows: Section 3.2 reviews the extension of PCA to the interval-valued data, followed by a brief description of the classic SIR. Section 3.3 considers three families of symbolic-numerical-symbolic approaches to extend SIR to the interval-valued data. The methods for low-dimensional representation and visualization in the reduced space are introduced in section 3.4. Section 3.5 presents several practical issues regarding the slicing strategies of the iSIR algorithm when an existing known set of ground truth or class labels of data were not available, in addition to the standardization methods of each iSIR approach. We evaluate the implemented iSIR methods and compare the results with those of iPCA for low-dimensional discriminative and visualization purposes by means of simulation studies in section 3.6 and through application to two empirical datasets in section 3.7. We then conclude with our results in section 3.8. 3.2. PCA for interval-valued data and the sliced inverse regression 3.2.1. PCA for interval-valued data The earliest literature on the extension of PCA for interval-valued data is the quantification methods. For example, the given interval-valued data table is transformed to a classic numerical data table so that traditional PCA can be applied. The most well-known approaches are the vertices method of PCA (V-PCA) [CAZ 97], [CHO 98] and the center method of PCA (C-PCA) [CHO 00]. V-PCA decomposes the correlation matrix of the vertices of the observed hyperrectangles, whereas C-PCA decomposes the correlation matrix of the centers of intervals. C-PCA ignores the interval variations contained in the data, whereas V-PCA projects the vertices in the PC subspace. [LAU 00] considered three variants of iPCA, namely,

52

Advances in Data Science

range transformation PCA (RT-PCA), symbolic object PCA (SO-PCA), and the mixed method of SO-PCA and RT-PCA. They attempted to improve the factorial visualization and overcome the drawbacks of the vertices method, where vertices are treated as single independent units described by points. [PAL 03] proposed MR-PCA through a hybrid approach based on the midpoints and radii of the intervals. The hybrid approach regards intervals as a pair of midpoint and radius and employs interval arithmetic to calculate the sample variance–covariance matrix. [DUR 04] extended MR-PCA using a least-squares approach. [GIO 06] and [LAU 06] used interval algebra and optimization theory to develop iPCA. The solution was obtained by solving an interval eigenvector problem. This approach is different from others in many respects since it fully utilizes the interval nature. [IRP 06] considered PCA for time-dependent interval-valued data. Several methods, such as V-PCA, C-PCA, S-PCA, MR-PCA, and iPCA, have been reviewed and compared [CAR 06b], [LAU 08]. Some of the methods have been implemented in the SODAS software [DID 03]. [ICH 11] presented another quantification method using quantiles for symbolic data tables based on the monotone structure of objects. In addition to the quantification method and the interval arithmetic method, methods have been proposed to estimate the variance–covariance matrices of interval-valued variables under the assumption that the interval observations are uniformly distributed within the interval. For instance, [WAN 12] defined the inner product of the intervals and the proposed complete information PCA (CI-PCA). [LER 12] proposed symbolic covariance PCA (SC-PCA). With the exception of iPCA, most PCA extensions for interval-valued data belong to the category of symbolic-numerical-symbolic approaches. 3.2.2. Classic SIR Assume that X = (X1 , · · · , Xp )T are p-dimensional independent variables, and Y is the univariate dependent variable. Let {yi , xi }ni=1 be the realizations of the variables (Y, X), where xi = (xi1 , · · · , xip )T is the ith numerical observation. [LI 91] presented a regression model as the prototypical framework for dimension reduction Y = f (β T1 X, · · · , β TK X, ϵ),

[3.1]

where βs are vectors with dimensions p × 1, p ≥ K, ϵ is the random noise independent of X, and f is an arbitrary function. βs are referred to as the effective dimension reduction or projection directions. The dimensions of X can be reduced by projecting along the EDR directions while maintaining most of the information about Y conveyed in X. SIR is a linear algorithm for estimating EDR directions based on the sample {yi , xi }ni=1 . Under model assumption [3.1] and the linearity condition presented

Dimension Reduction and Visualization of Symbolic Interval-Valued Data

53

in [LI 91], it has been shown that the centered inverse regression curve E(X|Y ) − E(X) is contained in the linear subspace spanned by β Tk Σx (k = 1, · · · , K), where Σx denotes the covariance matrix of X. According to this property, the estimates of βs are obtained by performing an eigen-decomposition of the sampleˆ w of sample {yi , xi }n with respect to the sample weighted covariance matrix Σ i=1 ˆ variance–covariance matrix Σx ˆ w β = λj Σ ˆ xβ , Σ j j

j = 1, · · · , p,

[3.2]

ˆ w is constructed by where λ1 ≥ λ2 ≥ · · · ≥ λp are eigenvalues and Σ ˆw = Σ

H ∑

¯ )(¯ ¯ )T , ph (¯ x(h) − x x(h) − x

[3.3]

h=1

¯ is the where ph is the proportion of all observed yi s that fall into hth slice (Ih ), x ¯ (h) is the sample mean for the hth slice. Then, the leading K grand mean, and x ˆ ’s are used as projection directions. The reduced subspace considered eigenvectors β j ˆ ,...,β ˆ }. by the classic SIR is a K-dimensional linear subspace spanned by {β 1 K [COO 94], [COO 96] named this reduced subspace the central dimension reduction subspace, SY |X , for the regression of Y on X, in which {β 1 , . . . , β K } form a basis for SY |X . SIR has served as the base algorithm for slice-based sufficient dimension reduction [COO 98]. We denote these eigenvectors in terms of the eigen-decomposition operator: ˆ = Evec(Σ ˆ w, Σ ˆ x ). β k

[3.4]

Similar to PCA, SIR is based on the projection of input variables onto the latent variables (components). However, in contrast with PCA, SIR creates components by modeling the relationship between input and response variables while maintaining most of the information in the input variables. SIR can be seen as a PCA-like procedure performed on random variable E(X|Y ) rather than on X. That is, SIR looks for linear combinations of X that maximize Var(E(β T X|Y ))/Var(β T X) instead of just Var(β T X). Whereas PCA leads to an eigen-problem, SIR leads to a generalized eigen-problem. 3.3. SIR for interval-valued data To obtain the estimated βs of equation [3.4] for interval-valued data, it ˆx is sufficient to estimate the symbolic sample variance–covariance matrix Σ ˆ and the symbolic sample-weighted variance–covariance Σw and to conduct generalized eigen-decomposition. Similar to the numerical case, we assume that Ξ = (Ξ1 , · · · , Ξp )T are p-dimensional independent interval-valued variables, and Ψ is a univariate dependent interval-valued variable. Denote the realizations of

54

Advances in Data Science

the interval-valued variables (Ψ, Ξ) by {ψi , ξ i }ni=1 , where ψi = [ci , di ], ci ≤ di , ξi = (ξi1 , · · · , ξip )T , ξij = [aij , bij ], aij ≤ bij , i = 1, · · · , n, j = 1, · · · , p. We describe two families (tracks) of symbolic-numerical-symbolic approaches to extend SIR to analyze interval-valued data: the quantification approach and the distributional approach. The quantification approach is the most common method used for analyzing symbolic interval-valued data. The analysis proceeds by transforming the symbolic input data into a classic numerical data table so that the classic algorithm can be applied while the distributional approach assumes some distribution within the observed intervals and compute covariances directly from the interval-valued data. 3.3.1. Quantification approaches The quantification approaches employ suitable coding to quantify ξi such that the data can be handled by the classic SIR algorithm. Since the role of response ψi is to slice the data (see section 3.5.2 for slicing schemes), it is not treated with the same manipulation of ξ i . 3.3.1.1. The center method (CM) The center method uses the interval midpoints as the reference points of the intervals, and the interval variation contained in the data is ignored. Let c ξc = (ξ c1 , · · · , ξcn )T be the data of centers and ξij be the ijth element of ξ c where ( ξci

=

aip + bip ai1 + bi1 ,··· , 2 2

)T ,

i = 1, · · · , n.

Matrix ξ c is then treated as though it represents the classic p-variate data for n individuals. 3.3.1.2. The vertices method (VM) ∑pLet qi be the number of nontrivial intervals for the ith observation [i.e. qi = j=1 I(aij < bij ), where I(·) is an indicator function]. The vertices method stacks the vertices of each to form the vertices matrix ξ v = (ξvV1 , · · · , ξ vVn )T ∑nhyperrectangle qi with dimensions i=1 2 × p, where 

ξvVi

ai1  ai1 . =  .. bi1 bi1

··· ··· .. . ··· ···

T aip bip  ..  , .  aip bip

i = 1, · · · , n.

[3.5]

The rows of matrix ξ v represent all the vertices Vki , ki = 1, · · · , 2qi , i = 1, · · · , n, of n hyperrectangles.

Dimension Reduction and Visualization of Symbolic Interval-Valued Data

55

3.3.1.3. The quantile method (QM) The quantile matrix of {ξi }ni=1 is defined by ξq = (ξqQ1 , · · · , ξ qQn )T , where 

ξqQi

qi01 =  ... qim1

T · · · qi0p .. ..  , and qikj = aij + (bij − aij ) k . . m · · · qimp

for i = 1, · · · , n, j = 1, · · · , p, k = 0, · · · , m. The number m (m ≥ 1) is user defined. Note that the construction of qikj relies on an assumption of uniformity within each observed interval. The dimensions of ξq are (m + 1) × n by p. The main advantage of the quantile method is that it is able to simultaneously manipulate histograms, nominal multi-value types, and other variable types. [ICH 11] proposed the QM for symbolic-object PCA by applying PCA based on the Kendall or Spearman’s rank or Pearson’s correlation matrix of ξ q ; that is, by conducting eigenvalue decomposition of a correlation matrix of ξq . This is equivalent to performing PCA on the covariance matrix of the standardized variables of ξ q . Therefore, iSIR with QM can be implemented by conducting the classic SIR on the standardized variables of ξq . 3.3.1.4. The stacked endpoints method (SE) The stacked endpoints method stacks the lower and upper values of an interval variable. SE is a special case of QM with m = 1 and is called the original object splitting method [ICH 11]. SE regards the minimum aij and the maximum bij of the intervals as single realizations of the variable Ξj . Let ξ s denote the stacked matrix of size 2n × p; the ith individual is ξsi =

(

) ai1 · · · aip , bi1 · · · bip

i = 1, · · · , n.

The estimated βs are obtained by applying the classic SIR algorithm to the standardized variables of ξs . 3.3.1.5. The fitted values method (FV) The fitted values method is motivated from the MinMax method [LIM 08] where MinMax is used to fit a linear regression model to symbolic interval-valued variables. FV quantifies the interval variables by the fitted values of the maximum of the intervals based on the following fitted simple linear regression model: ˆbij = ηˆ0j + ηˆ1j aij ,

j = 1, · · · , p,

i = 1, · · · , n.

Therefore, the transformed matrix is ξf = (ξ f1 , · · · , ξ fn )T where ξfi = (ˆbi1 , · · · , ˆbip )T .

56

Advances in Data Science

3.3.2. Distributional approaches The distributional approaches assume that the possible interval observations uij in a given interval [aij , bij ] are uniformly distributed within that interval and that each individual has the same probability of being observed. [BER 00] defined the empirical density function for an interval-valued random variable Ξj as a mixture of n uniform distributions, 1 ∑ I(u ∈ [aij , bij ]) , n i=1 bij − aij n

fj (u) =

j = 1, · · · , p.

The symbolic sample mean and variance of Ξj can be derived as, respectively, 1 ∑ (aij + bij ), ξ¯j = 2n i=1 n

j = 1, · · · , p,

[3.6]

and ςj2

[ n ]2 n 1 ∑ 2 1 ∑ 2 = (b + aij bij + aij ) − 2 (aij + bij ) . 3n i=1 ij 4n i=1

[3.7]

The symbolic sample slice mean is n 1 ∑ (h) ξ¯j = (aij + bij )I(ψi ∈ Ih ), 2nh i=1

h = 1, · · · , H, j = 1, · · · , p,

∑n where nh = i=1 I(ψi ∈ Ih ) is the number of observed ψi ’s that fall into the hth slice, Ih . Note that the symbolic data sample variance is equal to that of the classic data sample variance when aij = bij . The estimation of the symbolic ˆ w is then straightforward from equation [3.3] sample-weighted variance–covariance Σ (h) once the symbolic mean ξ¯j and the symbolic sample slice mean ξ¯j are obtained. Three methods for calculating the symbolic sample covariance for interval-valued variables are introduced below. 3.3.2.1. The empirical joint density method (EJD) The empirical joint density function of the interval variables Ξj and Ξj ′ can be expressed as 1 ∑ I(u ∈ [aij , bij ], v ∈ [aij ′ , bij ′ ]) , n i=1 (bij − aij )(bij ′ − aij ′ ) n

fjj ′ (u, v) =

j ̸= j ′ .

Dimension Reduction and Visualization of Symbolic Interval-Valued Data

57

Based on this empirical joint density function, [BIL 03] derived the symbolic sample covariance for interval variables Ξj and Ξj ′ as 1 ∑ [(aij + bij )(aij ′ + bij ′ )] 4n i=1 [ n ][ n ] ∑ 1 ∑ − 2 (aij + bij ) (aij ′ + bij ′ ) . 4n i=1 i=1 n

ςjj ′ =

[3.8]

[WAN 12] defined the inner product of the intervals under the assumption that each data unit is regarded as a random variable, obeying a uniform distribution and is infinitely dense in [aij , bij ]. Based on these definitions and assumptions, it can be shown that the sample variance of the centerlized interval-valued variable Ξj and the sample covariance of two centerlized interval-valued variables Ξj and Ξj ′ are given exactly by equations [3.7] and [3.8] without the latter parts. [WAN 12] also showed that the off-diagonal elements of the variance–covariance matrices of ξc and ξv are exactly the same as those of ςjj ′ of equation [3.8]. The only differences are the variances of the interval-valued variables. 3.3.2.2. The symbolic covariance method (GQ) An alternative expression of the symbolic sample variance for Ξj in equation [3.7] can be expressed as 1 ∑ [(aij − ξ¯j )2 + (aij − ξ¯j )(bij − ξ¯j ) + (bij − ξ¯j )2 ]. 3n i=1 n

ςj2 =

[BIL 06] generalized the above equation to formulate the form of the symbolic sample covariance for Ξj and Ξj ′ as 1 ∑ Gj Gj ′ [Qj Qj ′ ]1/2 , 3n i=1 n

ςjj ′ =

j, j ′ = 1, · · · , p,

where for J = j, j ′ , QJ = (aiJ − ξ¯J )2 + (aiJ − ξ¯J )(biJ − ξ¯J ) + (biJ − ξ¯J )2 , { c −1, if ξiJ ≤ ξ¯J , GJ = c 1, if ξiJ > ξ¯J , c and ξiJ is the midpoint of the interval [aiJ , biJ ].

[3.9]

58

Advances in Data Science

3.3.2.3. The total sum of products (SPT) [BIL 07], [BIL 08] further demonstrated that the sample variance in equation [3.7] is a function of the total sum of squares (SST) and that the SST can be decomposed into the sum of the internal (within) variation and the between variation. The total sum of products (SPT) is the sum of the within sum of products and the between sum of products. [BIL 08] extended equation [3.7] to the bivariate case to obtain the sample covariance of Ξj and Ξj ′ based on the decomposition of the SPT as 1 ∑ [2(aij − ξ¯j )(aij ′ − ξ¯j ′ ) + (aij − ξ¯j )(bij ′ − ξ¯j ′ ) 6n i=1 n

ςjj ′ =

+ (bij − ξ¯j )(aij ′ − ξ¯j ′ ) + 2(bij − ξ¯j )(bij ′ − ξ¯j ′ )].

[3.10]

The definitions and calculations of the symbolic sample covariance in equations [3.8]–[3.10] are consistent with the results in the classic data case if aij = bij for i = 1, · · · , n, j = 1, · · · , p. If j = j ′ , equation [3.10] reduces to the sample variance of the interval-valued variable j, as given in equation [3.7]. 3.4. Projections and visualization in DR subspace 3.4.1. Linear combinations of intervals The projections or linear combinations of intervals on the DR subspace are ˆ of various DR methods are obtained. computed after the corresponding estimated βs a b a b Let zik = [zik , zik ], zik ≤ zik , i = 1, · · · , n, k = 1, · · · , K be the ith projected interval of ξi on the kth component. According to Moore’s linear combination rule of p intervals [MOO 66], the ith projected observation is computed by T

ˆ ξ = zik = β k i

p ∑

a b βˆkj ξij = [zik , zik ],

[3.11]

j=1

where a zik =

p ∑

βˆkj (ηaij + (1 − η)bij )

and

j=1

b zik =

p ∑

βˆkj ((1 − η)aij + ηbij ),

j=1

with η = I(βˆij > 0). The linear combination rule is equivalent to the direct projection by the vertices of the intervals [DOU 11]; that is, the matrix of vertices (equation [3.5]) ˆ to calculate the coordinates of the vertices on the is multiplied by the kth estimated β kth factorial axis. For example, in the cases of VM, SE, FV, EJD, GQ, and SPT, the b a ] can also be computed by , zik projections zik = [zik { T }] { T } v ˆ ξv . ˆ = min β k ξ u , max β k u [

zik

u∈Vi

u∈Vi

Dimension Reduction and Visualization of Symbolic Interval-Valued Data

59

a b In the case of CM, the projection zik = [zik , zik ] is computed by a zik =

p ∑ j=1

b zik =

{ } c (ξij − ξ¯jc ) βˆkj

max c

{ } c (ξij − ξ¯jc ) βˆkj ,

aij ≤ξij ≤bij

p ∑ j=1

min c

aij ≤ξij ≤bij

and

where ξ¯jc is the mean of the jth variable of ξc . For QM and SE, zik s are obtained by projecting the quantiles on the estimated components as follows ] T q T q ˆ ˆ = min {β k ξu }, max {{β k ξ u } , [

zik

u∈Qi

u∈Qi

i = 1, · · · , n, k = 1, · · · , K.

3.4.2. The graphical representation of the projected intervals in the 2D DR subspace A main goal of dimension reduction is data visualization. Interval-valued data are projected into a two- or three-dimensional DR subspace to enable users to explore the data structure. Suppose there exists an interval object in 3D space, as shown in Figure 3.1(a). The linear projection of this interval object by the DR algorithms is equivalent to transforming the object from the input coordinates to the DR coordinates, as shown in Figure 3.1(b). Several methods have been proposed to visualize interval objects in lower dimensional subspace, such as 2D-convex hull [VER 97] and parallel edge connected shapes (PECS) [IRP 03]. In the following, we introduce three additional types of graphical representation of the projected interval object on the DR subspace. 3.4.2.1. The maximum covering area rectangle (MCAR) The maximum covering area rectangle (MCAR) [CAZ 97], [CHO 98] is widely used for graphical representation of interval objects on a DR subspace, owing to its simplicity. It constructs observations {zik }ni=1 as k-dimensional hyperrectangles (usually k = 2) on a DR subspace, where the sides of hyperrectangles are parallel to the axes. As shown in Figure 3.1(c), the circumscribed rectangle with green dashed line in the first DR plane is an example of MCAR. The main drawback of MCAR is that the interval objects are over-sized in the DR subspace with respect to the real objects in Rp . For example, the hyperrectangle in DR space may include data points that do not belong to the original observations. In this study, MCAR is adopted by CM, VM, and FV. 3.4.2.2. The polytopes representation The polytopes representation for interval data was proposed in [LER 12]. The vertices of interval-valued data are projected onto lower dimensional subspace, and

60

Advances in Data Science

the edges of points are formed in the DR subspace by connecting the vertices, as in the original input space Rp . For example, for the vertices of the ith interval-valued observation, the projections are ˆ T xv , zvVi k = β k Vi

i = 1, · · · , n,

k = 1, · · · , K.

Figure 3.1. Visualization of a 3D interval-valued object in the linear 2D dimension-reduced subspace. (a) A 3D interval-valued object in the sample space. (b) The linear transformed object in the DR space. (c) The representation methods of a projected interval-valued object in the first DR plane: the maximum covering area rectangle (green dashed line), 2D convex hull (blue solid line), and polytopes representation (red dotted line). For a color version of this figure, see www.iste.co.uk/diday/advances.zip

Dimension Reduction and Visualization of Symbolic Interval-Valued Data

61

The edges with red dotted lines in the first DR plane in Figure 3.1(c) show an example of the polytopes representation for an interval object. The polytopes representation improves MCAR and provides a true projection of the observed interval data. In this study, the polytopes representation is adopted by EJD, GQ, and SPT. 3.4.2.3. The arrow lines representation [ICH 11] proposed a quantile method to perform PCA. The symbolic objects in the DR planes are displayed as arrow lines connecting the so-called quantile sub-objects. The quantile method can be used for other types of symbolic objects, such as histogram-valued data, and nominal and ordinal multi-valued type data. In this study, the arrow lines representation is adopted by QM and SE. 3.5. Some computational issues 3.5.1. Standardization of interval-valued data Before performing the iSIR algorithm, we first standardize each interval-valued variable of the data to have zero mean and unit variance. This standardization prevents variables with larger measurement scale from dominating those with smaller scales. [CAR 06b] proposed three standardization approaches for interval-valued variables: dispersion of the interval centers, dispersion of the interval boundaries, and the global range. In this study, the general rule for standardizing interval-valued variables Ξj , j = 1, · · · , p, consists in performing the same transformation separately to both the lower and upper bounds of all intervals [aij , bij ], which standardizes all point values between aij and bij in the same linear way [ s ξij =

] aij − ξ¯j bij − ξ¯j , , ςj ςj

where ξ¯j is the sample mean of interval variable Ξj given in equation [3.6] and ςj is the sample standard deviation of interval variable Ξj . The sample standard deviation ςj is estimated following different iSIR methods. With the exception of the distributional approaches, the general estimate of ςj is given by the standard deviation of the interval centers, i.e. [

1∑ ςj = n i=1 n

(

aij + bij − ξ¯j 2

)2 ]1/2 .

For EJD, GQ, and SPT, ςj is obtained from equation [3.7].

62

Advances in Data Science

3.5.2. The slicing schemes for iSIR The slicing step in the classic SIR algorithm partitions data based on numerical responses yi and produces class labels for the data. If the response is categorical, it consists of natural slicing. If an existing known set of ground truth or class labels of data were not available, [SET 04] have suggested applying K-means clustering to the numeric data table to produce pseudo slices. For interval-valued data, the aforementioned rules can be applied directly to categorical or numerical responses. If the response is intervals, one easy way to produce slices is to slice the centers of the response intervals, ψic = (ci + di )/2 ∈ R. Otherwise, applying the clustering algorithms to ψic is a straightforward way to obtain slices. If the desired responses ψi or the class labels of ξ i are not available, we can apply the existing clustering algorithms to interval-valued data [BIL 06], [CAR 06a], [CAR 07] to obtain pseudo slices. While applying SIR to the data where the response is unavailable, the only free parameter is the number of pseudo slices, which has to be specified manually. As noted in [LI 91] and [LIQ 12], SIR appears to be less sensitive to the choice of the number of slices H, and it is not as crucial as the choice of a smoothing parameter in most nonparameteric regression or density estimation problems. As a result, SIR can be easily applied to real-world data without much effort on parameter tuning. On the other hand, K-means is usually suggested as a tool for obtaining the data partitions, since it is most popular for practical use when there is no prior knowledge about the data. Using such clustering provides a simple and quick way of forming the data slices. More information concerning the choice of the number of slices H and the SIR estimation procedure can be found in [LI 91] and [CHE 98]. 3.5.3. The evaluation of DR components In the simulation studies, we have used R2 (b) (equation [3.12]) as a validity index since the true EDR directions are known. For the real data example, the eigenvalues can be used as an effective index for various DR methods of iPCA. In classical PCA, the proportion of variation explained by the ith principal component is interpreted by the ratio of the ith eigenvalue for that component, to the sum of all eigenvalues. On the other hand, the eigenvalues of SIR in equation [3.2] can be interpreted as the associated R2 values in multiple linear regression [CHE 98]. The key point in SIR is the generalized eigen-decomposition of the estimation of Cov(E[X|Y ]) for X ∈ Rp and Y ∈ R. Each eigenvalue expresses the proportion of the between-classes variance explained by the DR components of SIR. They reflect the amount of variance explained in the grouping variable. Therefore, to some extent, the eigenvalues obtained from PCA and SIR were not comparable. The eigenvalues obtained from SIR were seldom used as indices to evaluate the performance of the DR components.

Dimension Reduction and Visualization of Symbolic Interval-Valued Data

63

The evaluation depends on what applications were being used instead. For example, [YAO 13] used the classification error rates as the validity indices in the classification problems based on the DR components of SIR. The quality assessment for various DR methods of interval-valued data using the comparable validation indices requires further investigation. 3.6. Simulation studies In this section, we evaluate the performance of various iSIR implementations described in section 3.3 to estimate the EDR directions on a number of synthetic datasets. Three models, including the linear model, the quadratic model, and the rational model, described in equations [6.1]–[6.3] of [LI 91], are employed to generate the original numerical data. [6.1]

y = x1 + x2 + x3 + x4 + 0 · x5 + ϵ,

[6.2]

y = x1 (x1 + x2 + 1) + σ · ϵ, x1 y= + σ · ϵ. 0.5 + (x2 + 1.5)2

[6.3]

and

The synthetic interval-valued data are then constructed by two different scenarios. One is based on the aggregation of the numerical data generated from the above models. Another one is based on the interval arithmetic such that the intervals are constructed from midranges and centers of the corresponding intervals. 3.6.1. Scenario 1: aggregated data A possible source of the interval data is the aggregation of a large numerical dataset. Assume that x1 , · · · , xp and ϵ follow i.i.d. standard normal distributions in models [6.1]–[6.3] of [LI 91]. We take p = 5 in model [6.1] with a sample size n = 1, 000 and p = 10, σ = 0.5 and σ = 1.0 in models [6.2] and [6.3] with the sample size n = 4, 000. Interval-valued observations were generated by aggregating consecutive sorted y’s of size 10 from the above numerical data. For example, the generated numerical datasets were first sorted by the values of the y’s; then, we aggregated consecutive individual observations of size 10 into interval-valued observations that span the minimum value to the maximum value. Therefore, the sample sizes for the interval-valued data were 100, 400, and 400 for models [6.1]–[6.3]. 3.6.2. Scenario 2: data based on interval arithmetic Let ξi = [ai , bi ] and ξj = [aj , bj ] be real intervals and ◦ one of the basic operations, namely, addition, subtraction, multiplication, and division, that is,

64

Advances in Data Science

◦ ∈ {+, −, ×, /}. Then, the corresponding operations for intervals ξi and ξj can be defined by ξi ◦ ξj = {x ◦ y|x ∈ ξi , y ∈ ξj }, where we assume 0 ∈ / ξj in the case of division. The interval ξ = [a, b] can be expressed by pairs of midpoints and radii (or midranges), {ξ m = (a + b)/2, ξ r = (b − a)/2}. For practical applications, the above four operations can be simplified further [MOO 66], [NEU 90]: – The constant multiple rule: cξi = [cai , cbi ]. – The sum rule: ξi + ξj = [ai + aj , bi + bj ], expressed by {ξim + ξjm , ξir + ξjr }. – The difference rule: ξi − ξj = [ai − bj , bi − aj ]. – The product rule: ξi × ξj = [min cij , max cij ], where cij = {ai aj , ai bj , bi aj , bi bj }. – The division rule: ξi /ξj = [ai , bi ]/[aj , bj ] = [ai , bi ] × [1/bj , 1/aj ], if 0 ̸∈ [aj , bj ]. Denote the midranges and centers of the explanatory interval variable xi and noise interval ϵi by xri , xci and ϵri , and ϵci , respectively. Assume that the centers xci , and ϵci follow i.i.d. standard normal distributions, and the midrange xri and ϵri follow i.i.d. folded standard normal distributions. Note that the midranges and the centers are independent. The interval variable xi and the interval noise ϵi are constructed by [xci − xri , xci + xri ] and [ϵci − ϵri , ϵci + ϵri ], respectively. The interval response yi is generated according to models [6.1]–[6.3] based on the interval arithmetic. As in the previous subsection, we consider p = 5 in model [6.1] with the sample size n = 100 and p = 10, σ = 0.5 and σ = 1.0 in models [6.2] and [6.3] with the sample size 400. 3.6.3. Results The generated interval-valued data are used to estimate the EDR directions. Any vector proportional to β = (1, 1, 1, 1, 0) is an EDR direction in model [6.1]. The true EDR directions in models [6.2] and [6.3] are the vectors in the plane generated by (1, 0, · · · , 0) and (0, 1, 0, · · · , 0) in 10-dimensional Euclidean space. For model [6.1], ˆ for we calculate the means and the standard deviations (in parentheses) of β 1 H = 10 after 100 replicates. For models [6.2] and [6.3], we use the squared multiple correlation coefficient [LI 91] between the projected variable bt x and the space SY |X spanned by the true EDR directions as a measure for evaluating the effectiveness of the estimated EDR directions, ˆ x β t )2 (bΣ . R2 (b) = max ˆ x bt )(β Σ ˆ x βt ) β ∈SY |X (bΣ

[3.12]

Dimension Reduction and Visualization of Symbolic Interval-Valued Data

65

ˆ ) and R2 (β ˆ ) were computed after 100 The means and standard deviations of R2 (β 1 2 replicates with H = 10. The comparison results for the two scenarios are shown in Tables 3.1–3.3. As we can see from Table 3.1, for the linear model [6.1], the estimates of the EDR directions by CM, VM, FV, EJD, and GQ for both scenarios are quite good, and the means are all proportional to the target vector (1, 1, 1, 1, 0). Nevertheless, FV has high standard deviations in both scenarios, and GQ has high standard deviations in the aggregation case. On the other hand, QM (m = 4) and SE performs poorly in both scenarios for the linear model [6.1]. Note that SPT performs poorly in the aggregation scenario but well in the interval arithmetic scenario. ˆ ) and R2 (β ˆ ) Tables 3.2 and 3.3 list the means and standard deviations of R2 (β 1 2 of various iSIR methods for the quadratic model [6.2] and the rational model [6.3]. On average, CM, VM, FV, EJD, GQ, and SPT perform better than the other methods for models [6.2] and [6.3] in the two scenarios despite the noise level. Moreover, these methods have higher R2 values in the data aggregation case than in the interval arithmetic case. By contrast, among all methods, QM and SE have the worst performance in both scenarios. In addition, QM and SE perform slightly better in the interval arithmetic case than in the aggregation case. In summary, the performance of these eight methods for estimating the true EDR directions for models [6.2] and [6.3] is similar to that of model [6.1]. For comparison, we performed the classic SIR on the numerical data without data aggregation in the first scenario and on the generated data, where all xri and ϵri were set to zero, in the second scenario. The results of the EDR estimates and the squared multiple correlation coefficient are shown at the bottom of Tables 3.1–3.3. 3.7. A real data example: face recognition data Face recognition data have been widely used in the SDA literature to illustrate iPCA [DOU 11], [LER 96], [LER 12]. The data include six measurements characterizing a man’s face (Figure 3.2, [CHO 98]): the outer (inner) length between the eyes [AD (BC)], the length from the outer right (left) eye to the upper middle lip between the nose and mouth [AH (DH)], and the length from the upper middle lip to the outside of the mouth on the right (left) side of the mouth [EH (GH)]. The distance measure is expressed as the number of pixels on a face image. Nine men were measured, each with three observations (cameras), resulting in a total of 27 observations. The measurements for each observation were obtained from a sequence of images covering a range of values, resulting in interval-valued variables. Figure 3.3 displays the index plots for each measurement for all observations, where the x-axis is the range of values for the six interval measurements and the y-axis is the index of the observation. It is difficult to discern a meaningful grouping pattern from this plot. Figure 3.4 shows the scatterplot matrix for the six variables. An immediate observation is that the set of the three faces for each individual cluster together, thus

66

Advances in Data Science

(a) Aggregation iSIR CM VM QM SE FV EJD GQ SPT SIR

βˆ11 0.502 (0.080) 0.501 (0.046) 0.287 (0.136) 0.287 (0.136) 0.492 (0.147) 0.502 (0.046) 0.504 (0.126) 0.343 (0.087) 0.499 (0.014)

βˆ12 0.510 (0.089) 0.510 (0.046) 0.272 (0.134) 0.272 (0.134) 0.488 (0.153) 0.510 (0.047) 0.513 (0.136) 0.346 (0.086) 0.500 (0.015)

βˆ13 0.494 (0.088) 0.491 (0.049) 0.246 (0.134) 0.246 (0.134) 0.522 (0.167) 0.492 (0.050) 0.498 (0.138) 0.355 (0.093) 0.501 (0.017)

βˆ14 0.494 (0.089) 0.497 (0.040) 0.258 (0.127) 0.258 (0.127) 0.495 (0.161) 0.497 (0.040) 0.485 (0.110) 0.346 (0.082) 0.500 (0.014)

βˆ15 0.007 (0.095) −0.002 (0.040) −0.846 (0.162) −0.846 (0.162) 0.045 (0.548) −0.001 (0.041) 0.001 (0.108) −0.719 (0.021) 0.001 (0.017)

(b) Interval arithmetic iSIR CM VM QM SE FV EJD GQ SPT SIR

βˆ11 0.494 (0.054) 0.488 (0.077) 0.414 (0.173) 0.414 (0.173) 0.488 (0.094) 0.490 (0.061) 0.489 (0.064) 0.498 (0.053) 0.494 (0.054)

βˆ12 0.505 (0.052) 0.515 (0.073) 0.423 (0.264) 0.423 (0.264) 0.506 (0.105) 0.510 (0.058) 0.508 (0.057) 0.503 (0.054) 0.505 (0.052)

βˆ13 0.499 (0.050) 0.503 (0.072) 0.389 (0.246) 0.389 (0.246) 0.502 (0.108) 0.502 (0.056) 0.501 (0.058) 0.494 (0.056) 0.499 (0.050)

βˆ14 0.502 (0.048) 0.494 (0.066) 0.375 (0.241) 0.375 (0.241) 0.503 (0.097) 0.498 (0.050) 0.502 (0.050) 0.504 (0.049) 0.502 (0.048)

βˆ15 −0.000 (0.065) 0.006 (0.073) −0.598 (0.235) −0.598 (0.235) −0.003 (0.104) 0.003 (0.064) −0.002 (0.066) −0.000 (0.053) −0.000 (0.065)

Table 3.1. Mean and standard deviations (in parentheses) of βˆ1 for the linear model [6.1] by various iSIR implementations after 100 replications (H = 10). The interval data are constructed based on (a) data aggregation and (b) linear combinations of intervals

Dimension Reduction and Visualization of Symbolic Interval-Valued Data

67

(a) Aggregation σ = 0.5 ˆ ) R2 (β ˆ ) iSIR R2 (β 1 2 CM 0.956 0.924 (0.036) (0.049) VM 0.986 0.968 (0.011) (0.017) QM 0.371 0.260 (0.035) (0.019) SE 0.371 0.260 (0.035) (0.019) FV 0.840 0.777 (0.081) (0.079) EJD 0.984 0.964 (0.014) (0.021) GQ 0.977 0.956 (0.015) (0.022) SPT 0.953 0.841 (0.027) (0.082) SIR 0.995 0.991 (0.004) (0.005)

σ = 1.0 ˆ ) R2 (β ˆ ) R2 (β 1 2 0.963 0.896 (0.022) (0.051) 0.977 0.930 (0.015) (0.033) 0.266 0.180 (0.017) (0.016) 0.266 0.180 (0.017) (0.016) 0.872 0.703 (0.071) (0.179) 0.977 0.927 (0.015) (0.035) 0.969 0.921 (0.017) (0.037) 0.935 0.741 (0.035) (0.155) 0.992 0.979 (0.004) (0.009)

(b) Interval arithmetic σ = 0.5 ˆ ) R2 (β ˆ ) iSIR R2 (β 1 2 CM 0.955 0.884 (0.035) (0.056) VM 0.950 0.880 (0.035) (0.061) QM 0.692 0.637 (0.043) (0.061) SE 0.692 0.637 (0.043) (0.061) FV 0.940 0.761 (0.034) (0.142) EJD 0.954 0.883 (0.034) (0.058) GQ 0.953 0.884 (0.037) (0.055) SPT 0.944 0.704 (0.037) (0.217) SIR 0.949 0.879 (0.035) (0.060)

σ = 1.0 ˆ ) R2 (β ˆ ) R2 (β 1 2 0.935 0.818 (0.032) (0.091) 0.932 0.817 (0.032) (0.086) 0.683 0.594 (0.047) (0.095) 0.683 0.594 (0.047) (0.095) 0.917 0.636 (0.042) (0.194) 0.935 0.820 (0.031) (0.088) 0.935 0.820 (0.032) (0.090) 0.930 0.562 (0.050) (0.240) 0.921 0.784 (0.041) (0.123)

ˆ ) and R2 (β ˆ ) for Table 3.2. Mean and standard deviations (in parentheses) of R2 (β 1 2 the quadratic model [6.2] by various iSIR implementations after 100 replications (H = 10). The interval data are constructed based on (a) data aggregation and (b) linear combinations of intervals

68

Advances in Data Science

(a) Aggregation σ = 0.5 ˆ ) R2 (β ˆ ) iSIR R2 (β 1 2 CM 0.987 0.953 (0.007) (0.020) VM 0.993 0.973 (0.004) (0.013) QM 0.336 0.231 (0.012) (0.011) SE 0.336 0.231 (0.012) (0.011) FV 0.982 0.796 (0.008) (0.113) EJD 0.993 0.972 (0.004) (0.013) GQ 0.984 0.962 (0.007) (0.017) SPT 0.983 0.909 (0.010) (0.041) SIR 0.995 0.987 (0.002) (0.006)

σ = 1.0 ˆ ) R2 (β ˆ ) R2 (β 1 2 0.970 0.813 (0.012) (0.106) 0.976 0.857 (0.011) (0.081) 0.215 0.140 (0.012) (0.022) 0.215 0.140 (0.012) (0.022) 0.948 0.548 (0.023) (0.225) 0.977 0.854 (0.010) (0.083) 0.973 0.849 (0.013) (0.086) 0.962 0.715 (0.025) (0.162) 0.989 0.958 (0.005) (0.021)

(b) Interval arithmetic σ = 0.5 ˆ ) R2 (β ˆ ) iSIR R2 (β 1 2 CM 0.817 0.762 (0.127) (0.132) VM 0.811 0.761 (0.126) (0.124) QM 0.583 0.559 (0.118) (0.102) SE 0.583 0.559 (0.118) (0.102) FV 0.756 0.703 (0.126) (0.129) EJD 0.817 0.765 (0.124) (0.127) GQ 0.809 0.756 (0.130) (0.126) SPT 0.944 0.786 (0.040) (0.123) SIR 0.942 0.855 (0.031) (0.073)

σ = 1.0 ˆ ) R2 (β ˆ ) R2 (β 1 2 0.763 0.598 (0.132) (0.180) 0.748 0.599 (0.143) (0.169) 0.536 0.428 (0.107) (0.138) 0.536 0.428 (0.107) (0.138) 0.697 0.532 (0.155) (0.205) 0.756 0.598 (0.138) (0.173) 0.759 0.598 (0.132) (0.180) 0.906 0.586 (0.058) (0.203) 0.883 0.530 (0.063) (0.230)

ˆ ) and R2 (β ˆ ) for Table 3.3. Mean and standard deviations (in parentheses) of R2 (β 1 2 the quadratic model [6.3] by various iSIR implementations after 100 replications. The interval data are constructed based on (a) data aggregation and (b) linear combinations of intervals

Dimension Reduction and Visualization of Symbolic Interval-Valued Data

69

validating their within-subject coherence. The strong positive correlation between pairs of measurements{AD, BC} is expected as they are measures related to face width. Similarly, due to facial symmetry, the pairs {AH, DH} and {EH, GH} are strongly positively correlated. By contrast, a negative correlation is observed between the pairs {AH, GH} and {DH, EH}, which are measures of two complimentary distances on the opposite sides of a person’s face.

Figure 3.2. The six face measurements for the face recognition dataset [Dou 11].

The goal of using PCA is to explore the grouping structure of faces based on the identification of facial features. However, the information of the men’s identity was not used in iPCA. We used this information as the discrete response y˜ in iSIR. The first factorial planes obtained by iPCA and iSIR for the eight methods, with their corresponding eigenvalues are shown in Figure 3.5, where the set of three faces with the same graphical color representation belonging to the same person clusters together. iPCA looks for axes that maximize the variability across all 27 observations, regardless of whether some observations belong to the same person. In contrast, iSIR looks for axes that maximize the variability across the nine men and, hence, retains information about the dependency between the three observations of each man. The first iPCA component of eight methods accounts for 42.7%–46.8% of the variation, while the first two PCs together account for 72.7%∼81.9% of the total variation. Among them, SE of iPCA outperforms others slightly. On the other hand, the proportion of the between-classes (between-men) variance explained by the first two DR components of eights DR methods of iSIR ranges from 35% to 47.6% where VM performed best among others. Although the amount of variance explained in the grouping variable for the eight DR methods of iSIR was low and some clusters for those faces were not clearly distinguishable by iSIR, it presents more compact clusters compared to iPCA.

70

Advances in Data Science

Figure 3.3. The index plot for the six variables of the face recognition data. The colors represent the nine men. For a color version of this figure, see www.iste.co.uk/diday/advances.zip

Dimension Reduction and Visualization of Symbolic Interval-Valued Data

71

Figure 3.4. The scatterplot matrix for the six variables of the face recognition data. The colors represent the nine men. For a color version of this figure, see www.iste.co.uk/diday/advances.zip

72

Advances in Data Science

Figure 3.5. The 2D projection plot of the face recognition dataset achieved by various interval PCA and SIR methods. The two percentages aside the names of DR methods are the first two largest eigenvalues associated with the first two DR components (see context for explanation). The colors represent the nine men. For a color version of this figure, see www.iste.co.uk/diday/advances.zip

Dimension Reduction and Visualization of Symbolic Interval-Valued Data

73

The first DR component (DR1) of iPCA and iSIR can be interpreted as the overall measure of the face length, where longer faces have smaller DR1 values. The second DR component (DR2) of iPCA and iSIR is interpreted as the measure of the face width, where wider faces have larger DR2 values. Five distinct groups of the faces of the nine men could be roughly identified for the eight iPCA methods: {ROM}, {ISA}, {PHI, JPL, HUS}, {LOT, KHA}, and {FRA, INC}. In the graphical representation of 27 observations by MCAR, it is difficult to discern the distinct clusters compared to the projected polytopes and the arrow lines. Note that INC’s face has higher variability than the others on the second DR component. For the iSIR methods, the interpretation of the DR components using the eight methods is similar to that of iPCA, where DR1 describes the face length and DR2 characterizes the face width. By contrast, sets of observations belonging to the same person remain more compact than those in the iPCA methods, whereas the sets of observations have larger between-person distance in iSIR. Specifically, the polytopes and the arrow lines of HUS’s face have distinct features and form a cluster that is completely separated from the others in Figure 3.5(b). The MCAR and the arrow lines of INC’s face in the first factorial plane of iSIR are more separable from the others than those in the first factorial plane of iPCA. For iPCA, the observations of HUS, JPL, and PHI are considerably overlapped, whereas HUS is more separable from JPL, and PHI in iSIR. The results obtained by the GQ method of iSIR are not compatible with any of the other methods. The arrow lines representation for the QM and SE methods of iSIR is shrunk into arrow points. Clearly, the polytopes and the arrow lines representations improve the MCAR of the observations in DR subspace. 3.8. Conclusion and discussion In this study, we generalized the classic SIR to the interval-valued symbolic data through the quantification and distributional approaches. For the quantification approaches, the interval SIR algorithms were implemented with CM, VM, QM, SE, and FV, which are methods to apply the classic SIR to transformed data matrices. For the distributional approaches, EJD, GQ, and SPT were used to compute the sample-weighted covariance matrix of the interval-valued data, and the interval SIR algorithms were realized through generalized eigen-decomposition. iSIR preserves the computational advantages of the classic SIR and is capable of capturing the intrinsic structure of interval-valued data. Eight methods with their corresponding graphical representation in lower dimensional subspace were compared for both iPCA and iSIR. The results for both simulated and real data showed that iSIR is a powerful and promising feature extraction approach that achieves superior performance to iPCA. In contrast with iPCA, the main strength of the iSIR algorithm is that it enables the utilization of the class information. Therefore, iSIR provides data visualization capabilities for the presence of clusters or outliers, yielding superior discriminatory

74

Advances in Data Science

power in lower dimensional subspace. When the entities of interest in a study are classes of interval-valued symbolic observations sharing similar attributes rather than individuals, iSIR is suggested as a new DR tool to analyze such data with low computational cost and possibly substantial performance gains. Regarding which DR method should be used in practice, some guidance is provided as follows. If one would like to explore the structure of the interval-valued data by DR methods quickly, CM is the more preferable first choice, rather than other methods for its easy-to-implement relatively and computational efficiency. However, CM ignored the internal interval variance. VM is simple as well, although it is not suggested for the data consisting of a large number of variables because the data matrix will be inflated artificially. In addition, VM treated the vertices as independent observations, which may not always be true. The advantage of the distributional approaches is that the resulting symbolic covariance matrix fully utilizes all the information in the data. They assume that values within an interval are uniformly distributed across interval. However, this is often not the case. The unique feature of QM is that it can manipulate histograms, nominal multi-value, and other types of symbolic data simultaneously. In summary, there is no consensus in the literature to declare which is the best method in general. Each method should be considered according to the data structure, the tools at hand, and the aims of the analysis. 3.9. References [BEC 03] B ECKER C., F RIED R., “Sliced inverse regression for high-dimensional time series”, Exploratory Data Analysis in Empirical Research: Proceedings of the 25th Annual Conference of the Gesellschaft f¨ur Klassifikation, University of Munich, Munich, pp. 3–11, 2003. [BER 00] B ERTRAND P., G OUPIL F., “Descriptive statistics for symbolic data”, in B OCK H.-H., D IDAY E. (eds), Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data, pp. 103–124, Springer-Verlag, Berlin, 2000. [BIL 03] B ILLARD L., D IDAY E., “From the statistics of data to the statistics of knowledge: symbolic data analysis”, Journal of the American Statistical Association, vol. 98, no. 462, pp. 470–487, 2003. [BIL 06] B ILLARD L., D IDAY E., Symbolic Data Analysis: Conceptual Statistics and Data Mining, John Wiley & Sons, Ltd, Chichester, UK, 2006. [BIL 07] B ILLARD L., “Dependencies and variation components of symbolic interval-valued data”, in B RITO P., B ERTRAND P., C UCUMEL G. et al. (eds), Selected Contributions in Data Analysis and Classification, pp. 3–12, Springer-Verlag, Berlin, 2007.

Dimension Reduction and Visualization of Symbolic Interval-Valued Data

75

[BIL 08] B ILLARD L., “Sample covariance functions for complex quantitative data”, in M IZUTA M., NAKANO J. (eds), Proceedings of the International Association of Statistical Computing Conference, Yokohama, pp. 157–163, 2008. [CAR 06a] DE C ARVALHO F.A.T., S OUZA R., C HAVENT M. et al., “Adaptive Hausdorff distances and dynamic clustering of symbolic interval data”, Pattern Recognition Letters, vol. 27, no. 3, pp. 167–179, 2006. [CAR 06b] DE C ARVALHO F.A.T., B RITO P., B OCK H.H., “Dynamic clustering for interval data based on L2 distance”, Computational Statistics, vol. 21, no. 2, pp. 231–250, 2006. [CAR 07] DE C ARVALHO F.A.T., “Fuzzy c-means clustering methods for symbolic interval data”, Pattern Recognition Letters, vol. 28, pp. 423–437, 2007. [CAZ 97] C AZES P., C HOUAKRIA A., D IDAY E. et al., “Extension de l’analyse en composantes principales a` des donn´ees de type intervalle”, Annual Review of Statistics and Its Application, vol. 45, pp. 5–24, 1997. [CHE 98] C HEN C.H., L I K.C., “Can SIR be as popular as multiple linear regression?”, Statistica Sinica, vol. 8, pp. 289–316, 1998. [CHO 98] C HOUAKRIA A., Extension de l’analyse en composantes principales a` des donn´ees de type intervalle, PhD thesis, University of Paris IX Dauphine, Paris, France, 1998. [CHO 00] C HOUAKRIA A., C AZES P., D IDAY E., “Symbolic principal component analysis”, in B OCK H.-H., D IDAY E. (eds), Analysis of Symbolic Data, Springer-Verlag, Berlin, 2000. [COO 94] C OOK R.D., “On the interpretation of regression plots”, Journal of the American Statistical Association, vol. 89, pp. 177–190, 1994. [COO 96] C OOK R.D., “Graphics for regressions with a binary response”, Journal of the American Statistical Association, vol. 91, pp. 983–992, 1996. [COO 98] C OOK R. D., Regression Graphics, Wiley, New York, 1998. [COO 01] C OOK R.D., Y IN X., “Dimension-reduction and visualization in discriminant analysis”, Australian & New Zealand Journal of Statistics, vol. 43, pp. 147–200, 2001. [COO 03] C OOK R.D., “Dimension reduction and graphical exploration in regression including survival analysis”, Statistics in Medicine, vol. 22, pp. 1399–1413, 2003. [DID 88] D IDAY E., “The symbolic approach in clustering and related methods of data analysis: the basic choices”, in B OCK H.-H. (ed.), Classification and Related Methods of Data Analysis, Proceedings of IFCS’87, pp. 673–684, North Holland, Amsterdam, 1988. [DID 03] D IDAY E., E SPOSITO F., “An introduction to symbolic data analysis and the SODAS software”, Intelligent Data Analysis, vol. 7, no. 6, pp. 583–601, 2003.

76

Advances in Data Science

[DOU 11] D OUZAL -C HOUAKRIA A., B ILLARD L., D IDAY E., “Principal component analysis for interval-valued observations”, Statistical Analysis and Data Mining, vol. 4, pp. 229–246, 2011. [DUR 04] D’U RSO P., G IORDANI P., “A least squares approach to principal component analysis for interval valued data”, Chemometrics and Intelligent Laboratory Systems, vol. 70, pp. 179–192, 2004. [FER 05] F ERRE L., YAO A. F., “Smoothed functional inverse regression”, Statistica Sinica, vol. 15, pp. 665–683, 2005. [GIO 06] G IOIA F., L AURO N. C., “Principal component analysis on interval data”, Computational Statistics, vol. 21, pp. 343–363, 2006. [ICH 11] I CHINO M., “The quantile method for symbolic principal component analysis”, Statistical Analysis and Data Mining, vol. 4, no. 2, pp. 184–198, 2011. [IRP 03] I RPINO A., L AURO C., V ERDE R., “Visualizing symbolic data by closed shapes”, in S CHADER M., G AUL W., V ICHI M. (eds), Between Data Science and Applied Data Analysis, pp. 244–251, Springer-Verlag, Berlin, 2003. [IRP 06] I RPINO A., “‘Spaghetti’ PCA analysis: an extension of principal components analysis to time dependent interval data”, Pattern Recognition Letters, vol. 27, pp. 504–513, 2006. [LAU 00] L AURO N.C., PALUMBO F., “Principal component analysis of interval data: a symbolic analysis approach”, Computational Statistics, vol. 15, no. 1, pp. 73–87, 2000. [LAU 06] L AURO N.C., G IOIA F., “Dependence and interdependence analysis for interval-valued variables”, in BATAGELJ V., B OCK H.-H., F ERLIGOJ A. et al. (eds), Data Science and Classification, pp. 171–183, Springer-Verlag, Berlin, 2006. [LAU 08] L AURO N.C., V ERDE R., I RPINO A., “Principal component analysis of symbolic data described by intervals”, in D IDAY E., N OIRHOMME -F RAITURE M. (eds), Symbolic Data Analysis and the SODAS Software, pp. 279–311, John Wiley & Sons, Ltd, Chichester, UK, 2008. [LER 96] L EROY B., C HOUAKRIA A., H ERLIN I. et al., “Approche g´eom´etrique et classification pour la reconnaissance de visage, reconnaissance des formes et intelligence artificielle”, INRIA, IRISA and CNRS, France, pp. 548–557, 1996. [LER 12] L E -R ADEMACHER J., B ILLARD L., “Symbolic-covariance principal component analysis and visualization for interval-valued data”, Journal of Computational and Graphical Statistics, vol. 21, no. 2, pp. 413–432, 2012. [LI 91] L I K.C., “Sliced inverse regression for dimension reduction”, Journal of the American Statistical Association, vol. 86, pp. 316–342, 1991. [LI 99] L I K.C., WANG J.L., C HEN C.H., “Dimension reduction for censored regression data”, Annals of Statistics, vol. 27, no. 1, pp. 1–23, 1999.

Dimension Reduction and Visualization of Symbolic Interval-Valued Data

77

[LI 09] L I L., Y IN X., “Longitudinal data analysis using sufficient dimension reduction method”, Computational Statistics and Data Analysis, vol. 53, pp. 4106–4115, 2009. [LIM 08] L IMA N ETO E.D.A., DE C ARVALHO F.D.A., “Centre and Range method for fitting a linear regression model to symbolic interval data”, Computational Statistics & Data Analysis, vol. 52, pp. 1500–1515, 2008. [LIQ 12] L IQUET B., S ARACCO J., “A graphical tool for selecting the number of slices and the dimension of the model in SIR and SAVE approaches”, Computational Statistics, vol. 27, no. 1, pp. 103–125, 2012. [MOO 66] M OORE R.E., Interval Analysis, Prentice Hall, Englewood Cliffs, NJ, 1966. [NEU 90] N EUMAIER A., Interval Methods for Systems of Equations, Cambridge University Press, Cambridge, UK, 1990. [PAL 03] PALUMBO F., L AURO C.N., “A PCA for interval valued data based on midpoints and radii”, in YANAI H., O KADA A., S HIGEMATU K. et al. (eds), New Developments in Psychometrics, Springer-Verlag, Japan, pp. 641–648, 2003. [SET 04] S ETODJI C.M., C OOK R.D., Technometrics, vol. 46, pp. 421–429, 2004.

“K-means

inverse

regression”,

[VER 97] V ERDE R., D E A NGELIS P., “Symbolic objects recognition on a factorial plan”, IV International Meeting of Multidimensional Data Analysis (NGUS’97), Bilbao, Spain, 1997. [WAN 12] WANG H., G UAN R., W U J., “CIPCA: complete-information-based principal component analysis for interval-valued data”, Neurocomputing, vol. 86, pp. 158–169, 2012. [YAO 13] YAO W.T., W U H.M., “Isometric sliced inverse regression for nonlinear manifolds learning”, Statistics and Computing, vol. 23, pp. 563–576, 2013.

4 On the “Complexity” of Social Reality. Some Reflections About the Use of Symbolic Data Analysis in Social Sciences

4.1. Introduction The theme of the “complexity” of the social world has long been present in methodological and epistemological reflection on the specificities of social sciences (e.g. [SIM 22]). Passage of textbooks or theme of a more or less vague essayist discourse that, on the contrary, we can consider it as an important issue if we wish, as is the order of the day, to enhance the dialogue between natural sciences and social sciences. This theme brings into play the very notion of modeling and its limits when it is transferred more or less mechanically from physics, chemistry, or biology to the “social” world, in the name of the proven effectiveness of the mathematization of these disciplines. In this text, we begin by discussing various strategies developed in social sciences research, particularly since the advent of the “statistics for researchers” [ROU 98], to rigorously treat this “complexity”, which can be defined in various ways (section 4.2). This allows us to address, section 4.3, some of the characteristics of Symbolic Data Analysis (SDA), which make it a promising area of research from the perspective of the study of complex phenomena in social sciences. An exploratory case study, based on the exploitation of European EU-SILC 2013 crosssectional data, is presented and discussed for this purpose. It leads us to conclude                                   Chapter written by Frédéric LEBARON.

Advances in Data Science: Symbolic, Complex and Network Data, First Edition. Edited by Edwin Diday, Rong Guan, Gilbert Saporta and Huiwen Wang. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

80

Advances in Data Science

with a research program that is both methodological and empirical, to which the use of the analysis of symbolic data can strongly contribute. 4.2. Social sciences facing “complexity” After having evoked the founding notion of “total social fact” in order to start from a minimal prior definition, we present two families of methodological answers to the issue of complexity in the social sciences, then their recent evolutions and their manifestations in social sciences methodology, before referring more specifically to the question of “scales” in the analysis of social facts. 4.2.1. The total social fact, a designation of “complexity” in social sciences By “complexity” of reality, we mean here an element which is at first sight trivial: social facts are always constituted of a composite set of various types of realities simultaneously present, which are located on different “plans” that it is difficult, and even artificial, to separate by thought from each other. This idea can be referred to that of “total social fact”, developed by the French sociologist and anthropologist Marcel Mauss [MAU 23]: all social reality is at the same time psychological and institutional, individual and collective, economic, political, and legal, material and ideal, local and global, and so on. Presented with this characteristic, analyses of “social” reality have tried, from the beginning, to organize social facts according to various principles, and to oppose such “dimensions” (“sectors”, “domains”, “plans”, etc.) between them. The endless debates aroused by these “cuttings” have nourished the thought of the social and its controversies. With the advent of social statistics, in the 18th and especially in the 19th Century, a new scientific field emerges, with regard to a set of empirical data and mathematical methods, which appreciably alters the apprehension of this problem. This evolution is thus due to an implicit reformulation of the theme of “complexity” in the social sciences and the appearance of particular “solutions” of quite diverse natures. 4.2.2. Two families of answers Two main lines of “response” emerge and develop with the advent of “statistics of researchers” in the social sciences: one can be described as “reductionist” and the other as “encompassing”.

On the “Complexity” of Social Reality

81

The “reductionist” approach, which has had the strongest echo in economics, is often associated, even today, with the current idea of “modeling” (implied: a priori and also, most often, formal and framed by probability theory). It is a question of representing the studied phenomenon by reducing it to the game, determined by the model deduced from the theory and the state of the knowledge, of some wellidentified forces or generating mechanisms (see [BRE 08]). The “encompassing” approach studies the most “exhaustive” possible systems of relationships between phenomena as they are observable in reality, without prejudging the play of certain forces rather than others, with the fundamental idea that it is through the analytical interpretation of the data in all their extent that a model will gradually emerge, somehow a posteriori (for economics in the tradition of Benzécri, see e.g. [DES 09]. In this second case, however, a “framework model” always presides over the initial collection of information, without which no structure can emerge from reality (see [LEB 15], [LER 15]). 4.2.3. The contemporary deepening “reductionist” and “encompassing”

of

the

two

approaches,

These two approaches, very fruitful, have continued to develop and consolidate since then separately. As far as the modeling approach is concerned, economics, but also demography, political science, or sociology – at least in their “methodological individualist” or “analytical” variants – have gone very far in trying to represent from a simplified point of view the main mechanisms of studied reality, in order to test the existence of these identified mechanisms. Regression methods, within the framework of the General Linear Model, made it possible to carry out the realization of this program on a more strictly statistical level. However, by seeking to integrate more and more variables, the use of certain methods associated with the “reductionist” approach, initially designed according to parsimony norms, has come increasingly to approach the “exhaustive” goal. Associated with more “encompassing” methods, it is to the point that, for example, some specialists in time series econometrics today explicitly defend an approach that is also called “encompassing” (see [MEU 15]), following the British economist David Hendry). The “modeling” approach also continues to diversify and refine to meet the limitations of the previous models, for example, with the sophistication of models in the analysis of time-series and the ways they developed in finance to get out of the hegemony of inadequate representations based on normality assumptions [LÉV 02], [WAL 13].

82

Advances in Data Science

The strict “reductionist” path continues, of course, to be explored and applied in a more classical way in economics and also in sociology: “multi-agent” simulation methods extend, for example, in contemporary sociology, the perspective more clearly “analytical” or “deductive” illustrated by the research of [BOU 75]. On the other hand, that of approaches commonly described as “inductive”, one also observes a form of deepening and diversification. The birth of Geometric Data Analysis around Jean–Paul Benzécri contributed to the wide success of the so-called “exploratory” and “multidimensional” methods, most often combined with clustering techniques [BEN 73]. In this tradition too, there have been many innovations, and they have contributed to enriching the “encompassing” approach, without excluding attempts to move closer to the more “parsimonious” approach associated with the almost exclusive use of regression in large sectors of the social sciences. We can talk here about geometric modeling, whose idea is to stay closer to the complexity of social phenomena (see [LEB 15]). More and more, this approach is applied to complex data, taking into account time and geographical space, presenting different formats, located on different scales. The construction of multidimensional spaces allowing the geometric representation of reality is both the instrument and the objective and constitutes an object-crossing contribution of SDA and Geometric Data Analysis for the methodology in social sciences. 4.2.4. Issues of scale and heterogeneity One of the contemporary issues in the statistical methodology of the social sciences is the consideration of different scales and, more broadly, heterogeneities of all kinds in the analysis of social facts. On the side of modeling methods, this Symbolic Data Analysis in particular to the affirmation of “multi-level” methods that are better able to take into account the embeddedness of individuals in structures (geographical units in particular) which are themselves nested within in each other, to a very global level [COU 04]. In terms of “encompassing” methods of analysis, SDA (see [BIL 06], [BOC 00], [DID 08], [DID 16]) is one of the innovations and perspectives aimed at simultaneously analyzing data of different levels and heterogeneous by nature (such as e.g. a region described by variables for hospitals at a certain level, and variables for schools at another levels). Hence, the objects of study (regions for example) describe classes of individual entities (hospitals and schools for example) by histograms, intervals, sequences of values, and so on taking into account the variability inside these classes. We can also mention the analysis of spatial interactions that enriches the perspective of “modeling” complex data [PUM 10].

On the “Complexity” of Social Reality

83

4.3. Symbolic data analysis in the social sciences: an example After recalling the perspective of symbolic data analysis, a case study of European data is briefly presented. The first results obtained are then the subject of a sociological commentary. 4.3.1. Symbolic data analysis The analysis of symbolic data was first defined, according to [VER 14], as “a branch of data analysis that develops exploratory techniques to deal with” symbolic “data, i.e. to say variables taking their values in the form of intervals, of multi-categories, or of set of categories with which a ‘mode’ is associated, which can be a probability, a frequency, a weight”. More recently, E. Diday defines it as “an extension of standard data analysis of individuals to the analysis of classes of individuals” [DID 16]. We can, therefore, compare some of the development of SDA with Geometric Data Analysis methods [LER 15], the originality of SDA as a family of methods being the coding of initially heterogeneous variables, and the “multi-scale” character explicitly claimed, which lead to represent the heterogeneity geometrically and in the form of a summary in the main axes. In addition, SDA leSymbolic Data Analysis is used to rethink all statistical methods, which means that it is indeed very general in scope, and can absorb regression techniques, as well as more intrinsically geometric tools. From a general methodological point of view, it can be argued that the coding of heterogeneous variables is related to the construction of a table and, therefore, a relevant metric, whereas the “multi-scale” nature of the Symbolic Data Analysis brings it closer to the approach of “structured data analysis” in the GDA framework. 4.3.2. An exploratory case study on European data We will closely follow the approach proposed by ALO and IAA [ALO 15] concerning data from the European Social Survey (ESS). In our case, the data retained are categorized variables, represented in Symbolic Data Analysis in the form of histograms. The numerical variables present in our database have been, temporarily, left out. 4.3.2.1. The data The European Union Statistics and Income Survey (EU-SILC) is an annual European survey commissioned by Eurostat from National Statistical Institutes since

84

Advances in Data Science

2004. It focuses on living conditions at the household and individual levels and includes socio-demographic variables (age, sex, region, etc.), housing variables (type of housing, location, housing status), on the employment situation at the last job (income, type of contract, etc.), on consumption (possession of certain goods such as a computer, a car, etc.), and on income. The basis on which we are working, that of the EU-SILC 2013 sectional survey, comprises 362,301 individuals from 32 countries (including the 28 EU countries), after selecting the only individuals declaring a profession within the meaning of the ISCO/CITP 2008 (see table in the appendix) and after elimination of individuals with non-response to at least one of the 10 variables of interest selected from which the active variables of the geometric analysis will be chosen (see the list of studied variables below). 4.3.2.2. Aggregation of micro-data and construction of concepts To construct the concepts of Symbolic Data Analysis, we chose to cross two factors, the country (n = 32) and the social group defined by ISCO/CITP at level 1 (nine modalities). ISCO 08 Code Title FR – Directors, Senior Managers, and Managers; – Intellectual and scientific professions; – Intermediate professions; – Administrative employees; – Personnel in direct services to individuals, traders, and salesmen; – Farmers and skilled workers in agriculture, forestry, and fishing; – Skilled trades in industry and crafts; – Plant and machine operators and assembly workers; – Elementary occupations. The countries are Germany (DE), Austria (AT), Belgium (BE), Bulgaria (BG), Cyprus (CY), Croatia (HR), Denmark (DK), Spain (ES), Estonia (EE), Greece (EL), Hungary (HU), Ireland (IE), Iceland (IS), Italy (IT), Finland (FI), France (FR), Latvia (LV), Lithuania (LT), Luxembourg (LU), Malta (MT), Norway (NO), the Netherlands (NL), Poland (PL), Portugal (PT), the Czech Republic (CZ), the United Kingdom (UK), Romania (RO), Serbia (RS), Slovakia (SK), Slovenia (SI), Sweden (SE), and Switzerland (CH).

On the “Complexity” of Social Reality

85

4.3.2.3. Elementary descriptive study of variables and visualization in histogram form Following the pattern of a previous study [LEB 15], [LEB 17], we then selected variables by focusing on four themes of interest: – economic conditions and social exclusion; – housing; – physical environment and physical security; – health. The four selected rubrics were chosen to summarize the multidimensional nature of concrete living conditions. They express various species of capital: economic resources; those related to housing, (which refer to a concrete aspect of economic capital, the “social” and “physical” environment, with questions on living environment and physical security, and which also measure resources related to the immediate daily life context), and finally health; a physical capital, partly “biological” but strongly influenced by social factors. We, therefore, have here a simplified summary of the main dimensions considered as fundamental in the Stiglitz-Sen-Fitoussi report (2009). These data, however, are very strongly focused on social exclusion and the hardest material deprivation. General socio-demographic variables of the individual Sex (RB090): man (1)/woman (2) Marital status (PB190): never married (1)/married (2)/separated (3)/widowed (4)/divorced (5) Consensus Union (PB200): yes on a legal basis (1)/yes without legal basis (2)/ no (3) Country of birth (PB210): specific code Citizenship (PB220A): specific code NACE (PL111): code of the branch ISCO category (ISCO: see above).

86

Advances in Data Science

Type of contract (PL140): permanent job (1)/temporary job or fixed-term contract (2) Manager position (PL150): supervision (1)/no supervision (2) Degree of urbanization (DB100): densely populated region (1)/intermediate region (2)/sparsely populated region (3) Highest level of ISCED degree attained (PE040): pre-primary education (1)/low secondary education (2)/upper secondary education (3)/post-secondary non-tertiary education (4)/first level of education tertiary – not leading to advanced research qualification (5)/second level of tertiary education – leading to advanced research qualification (6) Economic conditions, precariousness, and exclusion of households Late payments on service invoices in the last 12 months (HS021): yes once (1)/yes twice or more (2)/no (3) Late payments on hire purchase invoices or other loans (HS031): yes once (1)/yes twice or more (2)/no (3) Ability to afford an annual vacation week away from home (HS040): yes (1)/no (2). This question is of the same nature as the previous one, with a dimension that is probably even more dependent on the perception of what is “far from home”, whose meaning is not simple. Ability to afford a meal with meat, chicken, fish (or vegetarian equivalent) every other day (HS050): yes (1)/no (2). This is a partly “subjective” question to the respondent for the household. It provides a good indicator of poverty, insofar as it refers to a “capability” or a material possibility, partly subjective. Ability to cope with unexpected expenses (HS060): yes (1)/no (2) Possession of a telephone, including mobile (HS070): yes (1)/no (2) Possession of a color TV (HS080): yes (1)/no (2) Possession of a computer (HS090): yes (1)/no cannot afford (2)/no other reason (3). The response modalities are three in number, making it possible to distinguish two (subjective) reasons for the non-possession of a computer. Possession of a washing machine (HS100): yes (1)/no (2)

On the “Complexity” of Social Reality

87

Possession of a car (HS110): yes (1)/no cannot afford (2)/no other reason (3) Ability to make ends meet (HS120): very hard (1)/with difficulty (2)/with some difficulty (3)/relatively easily (4)/easily (5)/very easily (6). The financial burden of repayment of purchases on credit or debt (HS140): a heavy burden (1)/a light burden (2)/no burden at all (3). Housing conditions These are questions about the residential situation of households. A question was asked about the financial burden of the total cost of housing (which depends on the perception of what is a “heavy” or “light” burden) and another on the type of housing. The first refers to the weight of access to housing in the standard of living, the second describes the actual housing conditions and its environment. Problems with housing: too dark, not enough light (HS160): yes (1)/no (2) The financial burden of the total cost of housing budget (HS140): a heavy burden (1)/a light burden (2)/no burden at all (3). It is a three-way question of the same nature as the previous one: a heavy burden, a light burden, no burden at all. Type of accommodation (HH010): separate house (1), semi-detached (2), apartment in a building with fewer than 10 apartments (3), apartment in a building with more than 10 apartments (4), other (5). Property status (HH021): full owner (1)/first-time homeowner (2)/tenant or subtenant at market price (3)/rented at a reduced price (4)/rented free (5) Leaking roof, or other degradations of the apartment (HH040): yes (1)/no (2) Ability to maintain heat in the apartment (HH050): yes (1)/no (2) Residential environment The three dichotomous questions (yes/no) asked imply a personal evaluation, the notions of “noise”, “criminal violence and vandalism”, or “pollution”, and so on being left to the discretion of the respondent. Noise of neighbors or the street (HS170): yes (1)/no (2) Pollution, dirt, or other environmental problem (HS180): yes (1)/no (2)

88

Advances in Data Science

Criminal violence or vandalism in the neighborhood (HS190): yes (1)/no (2) Health Again, two questions are used which are partly dependent on a subjective assessment (with the notion of “limitation” and that of “unmet need”). General health (PH010): very good (1)/good (2)/correct (3)/bad (4)/very bad (5) Suffers from a chronic illness (PH020): yes (1)/no (2) Limited activities due to health problems (PH030): yes (1)/yes in part (2)/no (3) Unmet need for examination or medical treatment in the last 12 months (PH040): yes to at least one opportunity (1)/no to any occasion (2). The analyzed database, created using the Syr software, contains 288 concepts and 38 variables. The data can be presented in the form of a table such as Table 4.1 (extracted from a larger table).

Table 4.1. Histograms of variables for each concept. For a color version of this table, see www.iste.co.uk/diday/advances.zip

The visualization in the form of histograms makes it possible to show and study in a simple way the differences between frequency distributions within countries and between countries, variable by variable. In the example selected, we can see that Belgian executives and the Belgian intellectual and scientific professions (BE_1 and BE_2) are much less numerous than the Austrian “elementary professions” (AT_9)

On the “Complexity” of Social Reality

89

who have difficulties in “making ends meet”. (variable HS120, ordered modalities from 1 “very difficult” to 6 “very easy”). The differences between countries are also notable. We also note, with the variable PE040 (ordered from lowest to highest level of qualification), that the social categories are strongly differentiated in terms of diploma levels, which corresponds of course to the fact that the criterion of the level of qualification is used to define them: the chosen concepts correspond to specific socio-demographic units. 4.3.2.4. Principal component analysis The PCA was carried out with eight active variables: Economic conditions: Possession of a computer (HS090): yes (1)/no cannot afford (2)/no other reason (3). Ability to afford an annual vacation week away from home (HS040): yes (1)/no (2); Ability to afford a meal with meat, chicken, fish (or vegetarian equivalent) every other day (HS050): yes (1)/no (2); Financial burden of repayment of purchases on credit or debt (HS140): a heavy burden (1)/a light burden (2)/no burden at all (3). Environment: Criminal violence or vandalism in the neighborhood (HS190): yes (1)/no (2); Health: Limited activities due to health problems (PH030): yes (1)/yes in part (2)/no (3). Housing: Type of accommodation (HH010): separate house (1), semi-detached (2), apartment in a building with fewer than 10 apartments (3), apartment in a building with more than 10 apartments (4), other (5); Property status (HH021): owner of right (1)/accessing the property (2)/tenant or sub-tenant at the market price (3)/rented at a reduced price (4)/housed free (5). The first axis of the PCA is interpreted as an axis of general economic and social comfort level. It is positively correlated with computer ownership, homeownership, the ability to afford a week’s vacation away from home, the ability to afford a meal of meat, fish or a vegetarian equivalent every other day at least. The second axis of the PCA is strongly correlated with a variable, the rate of delinquency in the neighborhood, and to a lesser extent with the mode of residence and the state of health. The lower you go, the worse the situation is in health but in a

90

Advances in Data Science

separate residence and no surrounding crime. We thus find an axis close to what we observe in the individual data.

Figure 4.1. Active variables in planes 1–2

Axis 3 (see Chart 2) is related to the mode of residence, the health, and the tenant situation. He contrasts more “precarious” situations down with more “stable” situations at the top. The projection of the additional variables completes this interpretation. Axis 1 is an axis linked to the economic and social hierarchy, while Axis 2 is an urban/rural axis. The cloud of concepts (country-classes) shows the wide dispersion of the country-classes in terms of living conditions. On the left, we observe in particular the categories “basic professions” and “farmers” of Bulgaria and Romania, while on the right we can see the emergence of the upper and middle categories, or even the popular ones, of the countries of Northern and Western Europe.

On the “Complexity” of Social Reality

91

On Axis 2, the situation of farmers stands out clearly downward, as opposed to more urban groups, located in different countries of varying levels of development (Norway, Finland, Austria, Slovenia, Slovakia, Croatia, etc.). Finally, Axis 3 opposes popular groups in certain countries, such as the United Kingdom or Belgium, which are characterized by high precariousness, and higher categories in other countries, such as the countries of Central and Eastern Europe.

Figure 4.2. Active variables in planes 3–4

Finally, we can represent a “structuring factor” (or supplementary element) in the space of concepts. The NetSyr software uses a representation in the form of geometric figures determined by the nine points corresponding to the other variable. The following graph shows that the different countries are partially superimposed in the center of the cloud, but that their specific characteristics characterize them as to the position and shape of the figure, reflecting a significant interaction between country and social class factors.

92

Advances in Data Science

Figure 4.3. Cloud of concepts in planes 1–2

Figure 4.4. Cloud of concepts in planes 3–4

On the “Complexity” of Social Reality

93

Figure 4.5. Cloud of countries in planes 1–2. For a color version of this figure, see www.iste.co.uk/diday/advances.zip

4.3.2.5. Visualization of “heterogeneities”

Figure 4.6. Cloud of histograms in planes 1–2. For a color version of this figure, see www.iste.co.uk/diday/advances.zip

94

Advances in Data Science

SDA (with the help of the NetSyr program) also makes it possible to represent the heterogeneities observed in the multidimensional space constructed using PCA. By way of example, we represent here the different configurations of diplomas among the nine French ISCO groups. We see that, in planes 1–2 of the PCA, only groups six and nine are on the left on the first axis and that group six is very clearly distinguished down on the second axis, other groups being clearly hierarchized (the group of intellectual and scientific professions a little further to the right) on Axis 1, according to the socio-economic interpretation that we have made. These different groups are very different from the point of view of the structure of diplomas within them. 4.3.2.6. Classification The NetSyr program proceeds to a classification by the K-means method. Here we present only a glimpse in planes 1–2, with a classification into three groups of concepts. These three classes can then be described by the properties of the concepts that constitute them.

Figure 4.7. Cloud of clusters from the K-means procedure in planes 1–2. For a color version of this figure, see www.iste.co.uk/diday/advances.zip

4.3.3. A sociological interpretation Here we find results similar to those of analyses conducted at the micro-data level [LEB 17], but located here from the outset at the level of concepts specifically

On the “Complexity” of Social Reality

95

constructed for the analysis, namely, the country-classes (for a very similar analysis of the point units of analysis and approach, see [HUG 14]. A strong hierarchical opposition structures the space of countries and social groups in Europe and clearly differentiates living conditions. It is redoubled by a differentiation linked to the residential environment (urban/rural) and another linked to the degree of insecurity. Each “country-class” constitutes a point in a sostructured space, but, of course, the underlying individual dispersions are important: here we have been first at the level of the aggregates that constitute the countrygroups. The “geometric map” presented above resembles the “geographical map” of countries, reflecting the fact that the European space is also a social space, and each area is specifically differentiated. Symbolic Data Analysis-specific results add the possibility of visualizing the relative “heterogeneities” of the different countryclasses according to the criteria concerned by means of histograms in the built space. The interest of this type of analysis is thus to make possible, in a global framework, the geometric visualization of heterogeneous and “multi-level” data, which potentially enriches the apprehension of social processes that are always located on several scales (the global economic and social system, the regional area, the country, and, at the finest level, the individual or even the practice or activity) and always have subtle articulations of internal variations, which are located at these different scales. 4.4. Conclusion The approach of statistical methodology in the social sciences discussed briefly here can thus be combined with a “global” sociological perspective, and thus contribute to the description and the progressive formalization of social structures nested in each other and characterized by strong heterogeneities. This research program can be extended to all kinds of observation scales, from the very micro (the individual activity) to the very macro (the global space). Unlike multi-level analysis approach, this perspective is situated in the filiation of an “encompassing” approach to the study of complex social facts, which aims to account for the multidimensionality of reality without reducing it to some forces assumed a priori. However, it is part of the same integrating will between theories and methods of the social sciences [COU 04], [PUM 10].

96

Advances in Data Science

The integration of the two traditions – no matter how difficult, at least at first glance – appears to be an important objective for the future of the social sciences, especially if we consider their increasingly dense interactions with natural sciences. The links between these approaches and more “qualitative” approaches to sociology, such as ethnographic observation as it has developed in sociology and anthropology [WEB 15], are also among the important future issues, to which a perspective more attentive to the complexity and its rigorous formalization can also provide interesting lighting. 4.5. References [BEN 73] BENZÉCRI J.-P., Analyse des Données. Tome II: Analyse des Correspondances, Dunod, Paris, 1973. [BIL 06] BILLARD L., DIDAY E., Symbolic Data Analysis: Conceptual Statistics and Data Mining, Wiley, Hoboken, 2006. [BOC 00] BOCK H.H., DIDAY E. (eds), Analysis of Symbolic Data, Springer, Berlin, 2000. [BOU 75] BOUDON R., L’inégalité des chances, Fayard/Pluriel, Paris, 1975 [2011]. [BRE 08] BRESSOUX P., Modélisation statistique appliquée aux sciences sociales, De Boeck, Brussels, 2008. [COU 04] COURGEAU D., Du groupe à l’individu. Synthèse multiniveau, INED, Paris, 2004. [DES 09] DESBOIS D., “La place de l’a priori dans l’analyse des données économiques ou le programme fort des méthodes inductives au service de l’hétérodoxie”, Revue Modulad, vol. 39, pp. 176–181, 2009. [DID 08] DIDAY E., NOIRHOMME-FRAITURE M. (eds), Symbolic Data Analysis and the SODAS Software, Wiley, Hoboken, 2008. [DID 16] DIDAY E., “Thinking by classes in data science: the symbolic data analysis paradigm”, WIREs Computational Statistics, vol. 8, pp. 172–205, doi: 10.1002/wics.1384, 2016. [DUR 95] DURKHEIM E., Les règles de la méthode sociologique, Flammarion, Paris, 1895 [2010]. [GRE 84] GREENACRE M.J., Theory and Application of Correspondence Analysis, Academic Press, Inc., London, 1984. [HUG 14] HUGRÉE C., PÉNISSAT E., SPIRE A. et al., “Capital culturel et pratiques culturelles. Les enjeux d’une comparaison européenne depuis l’enquête SILC-EU 2006”, communication au colloque Les classes sociales en Europe, AFS, 2014.

On the “Complexity” of Social Reality

97

[HUM 10] HUMAN DEVELOPMENT REPORT 2010 (20th Anniversary Edition). The Real Wealth of Nations: Pathways to Human Development. Published for the United Nation Development Programme, available at: http://hdr.undp.org/en/reports/global/hdr2010/ chapters/, 2010. [LEB 16] LEBARON F., L’espace des conditions de vie des actifs occupés en Europe, Working document, INSEE, 2016. [LEB 17] LEBARON F., BLAVIER P., “Classes et nations en Europe. Quelle articulation?”, Actes de la recherche en sciences sociales, vol. 219, pp. 80–97, 2017. [LER 14] LE ROUX B., Analyse géométrique des données multidimensionelles, Dunod, Paris, 2014. [LER 15] LE ROUX B., LEBARON F., “Les idées-clés de l’analyse géométrique des données”, in LEBARON F., LE ROUX B. (eds), La méthodologie de Pierre Bourdieu en action. Espace culturel, espace social et analyse des données, Dunod, Paris, 2015. [LÉV 02] LÉVY-VEHEL J., WALTER C., Les marchés fractals, PUF, Paris, 2002. [MAU 23] MAUSS M., Essai sur le don, PUF, Paris, 1923 [2012]. [MEU 15] MEURIOT V., Une étude critique et réflexive de l’économétrie des séries temporelles (1974–1982), HDR, Université de Versailles-Saint-Quentin-en-Yvelines/ Université Paris-Saclay, Paris, 2015. [PAS 91] PASSERON J.-C., Le raisonnement sociologique, L’espace non-popperien du raisonnement naturel, Paris, Nathan, 1991. [PUM 10] PUMAIN D., SAINT-JULIEN T., Analyse spatiale des interactions, Armand Colin, Paris, 2010. [ROU 98] ROUANET H., ROUANET H., BERNARD J.M. et al., New Ways in Statistical Methodology: From Significance Tests to Bayesian Inference (Foreword by SUPPES P.), Peter Lang, Berne, 1998. [SIM 22] SIMIAND F., Statistique et expérience. Remarques de méthode, Marcel Rivière, Paris, 1922. [VER 14] VERDE R., DIDAY E., “Symbolic data analysis: a factorial approach based on fuzzy coded data”, in BLASIUS J., GREENACRE M. (eds), Visualization and Verbalization of Data, CRC Press, Boca Raton, pp. 255–270, 2014. [WAL 13] WALTER C., Le modèle de marche aléatoire en finance, Economica, Paris, 2013. [WEB 15] WEBER F., Brève histoire de l’anthropologie, Flammarion, Paris, 2015. [WOR 10] WORLD DEVELOPMENT INDICATORS 2010, World Bank, Washington, DC, 2010.

Part 2 Complex Data

Advances in Data Science: Symbolic, Complex and Network Data, First Edition. Edited by Edwin Diday, Rong Guan, Gilbert Saporta and Huiwen Wang. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

5 A Spatial Dependence Measure and Prediction of Georeferenced Data Streams Summarized by Histograms

5.1. Introduction Massive datasets having the form of continuous streams with no fixed length are becoming very common due to the availability of sensor networks that can perform, at a very high frequency, repeated measurements of some variables. The knowledge extraction from such data must consider the technological characteristics of the tools for data acquisition as well as the nature of the monitored phenomenon. Often, data acquisition is performed by sensors with limited storage and processing resources. Moreover, the communication among sensors is constrained by their physical distribution or by limited bandwidths. Finally, the recorded data relate, often, to highly evolving phenomena for which it is necessary to use algorithms that adapt the knowledge with the arrival of new observations. The prevailing paradigm for the analysis of data in this context is centralized data stream analysis. Observations, recorded by sensors, are organized and processed by a single unit that provides the results of queries. In this case, the single processing unit should ensure space and time efficiency so that the data have to be processed on the fly, at the speed in which it is recorded, and algorithms need to adapt their behavior over time, consistently with the dynamic nature of data.

Chapter written by Rosanna V ERDE and Antonio BALZANELLA.

102

Advances in Data Science

In the framework of distributed stream processing, this chapter deals with the monitoring of data stream spatial dependence and with the prediction of data at spatial locations where there are no sensors. In the framework of data stream mining, when data are recorded by spatially located sensors, we assume as still true that the detected measurements at nearby locations are more similar than those in distant places. Due to the high evolving nature of data streaming and to their potentially unbounded size, the spatial dependence of data can evolve itself over time. Thus, a first challenge is to measure the spatial dependence and its evolution over time. A second challenge is related to data prediction: analysts could be interested in predicting the behavior of the monitored phenomenon at spatial locations where there is no sensor or a sensor is not able to record data. In order to face these challenges, we propose to split the incoming streams in non-overlapped time windows; in each window, the subsequences are synthesized through histograms. Histogram representation makes it possible to get a low storage cost and to keep track of the data distribution. Micro-clustering of the histogram representations is also performed by a CluStream algorithm in order to obtain a further dimensional reduction of the streams: histogram data are replaced by the centroid of the clusters in which they are collected. The centroids, or prototypes of the clusters, are also represented as average histogram data. According to this new kind of data stream representation, we provide a variogramlike tool for measuring the spatial dependence and a new kriging-based approach that makes it possible to predict data stream distributions in unsampled locations. The two proposed tools are based on a suitable distance measure for comparing histograms through the L2 -Wasserstein metric [IRP 15]. An interesting decomposition of this distance between histograms, into two additive components, allows emphasizing the differences in position and in variability and shape of the compared distributions. Following this, the variogram for histogram data can be computed according to the two components, in order to measure the spatial variability due to position and to the variability and shape of the data distributions. Similarly, the kriging prediction model can be performed highlighting the contribution of the two components. As will be better detailed in the following, we adopt a processing scheme based on distributed computing: summaries of data streams as histograms are given as input to a local computation unit. Then, the online updated outputs are sent to a central computation unit that performs variogram computing. On user demand, it is possible to query the central processing unit for obtaining a spatial prediction of sensor data, using the kriging model.

A Spatial Dependence Measure and Prediction of Georeferenced Data

103

We intend to furnish an original contribution on this topic, whereas, to our knowledge, the literature is quite limited. Most of the methods for analyzing spatial data streams propose a solution to traditional data mining challenges (clustering, classification, and summarization) on streams that record the position of moving objects [DIN 02, WEI 13]. Our strategy differs from the mentioned proposal in several aspects: it performs part of the processing at the sensors rather than all the computations in a single centralized unit; spatial dependence monitoring is performed starting from histograms rather than scalar measurements. Moreover, the prediction phase aims at estimating the data distribution of a time period instead of single scalar values. This chapter is organized as follows: section 5.2 introduces the processing setup; section 5.3 provides the notation and the main definitions; section 5.4 describes our strategy for summarizing the parallel arriving data streams through histograms; section 5.5 provides our proposal for monitoring the spatial dependence; section 5.6 describes the kriging approach proposed for predicting the spatial distribution where there are no records from sensors; section 5.7 shows an application of the strategy to simulated and real data. 5.2. Processing setup We assume a sensor network for monitoring a physical phenomenon in different places of a geographic area. Each sensor performs repeated measurements, over a long time, at a very high frequency. We assume that there is a spatial dependence between the data recorded by the sensors: near sensors record more similar measurements than far ones. Finally, sensors do not communicate with each other but only with a central node. Each data stream recorded by a sensor is processed individually in two summarization steps. In the first one, the incoming data streams are split into non-overlapping windows. Each sub-sequence in a window is summarized by a histogram. A histogram representation of the sub-sequences makes it possible to keep information about the empirical data distribution and, at the same time, to get an effective data compression. In the second step, the summarized sub-sequences are clustered in very homogenous classes using a CluStream [AGG 03] algorithm adapted to histogram data. This step allows a further dimensional reduction of the data streams: the histogram representations of the sub-sequences of streams are replaced by histogram prototypes of the belonging micro-clusters.

104

Advances in Data Science

Figure 5.1. (a) Time series recorded by a sensor and (b) its representation by histograms

At the central node, the micro-cluster prototypes are stored for each data stream. The central processing node is still used for computing the variogram on such summaries of data which extend the definition of trace-variogram in spatio-functional data analysis [DEL 10] to histogram data. The idea is to consider the inverse of the cumulated distribution function, that is the quantile function, associated with each histogram as functional data so that the main theoretical formulations of the variogram for functional data can be applied to address our challenge. The variogram for histogram data can be used for predicting the histogram of the data in an unsampled spatial location or at time intervals for which there are no observations. 5.3. Main definitions In this section, a classical definition of data streams is introduced. Let Y = {Y1 , . . . , Yi , . . . , Yn } be a set of n data streams made by real-value observations yij on a discrete time grid T = {t1 , ...,{tj , ...}, with tj ⊆ ℜ and } tj > tj−1 . We assume that each data stream Yi = (yi1 , t1 ), . . . , (yij , tj ), . . . , is constituted by sensor recorded observations. Sensors are located at si ∈ S, with S ⊆ ℜ2 as the geographic space.

We assume that incoming data are recorded online, whereas only subsets of the streams can be kept in memory. Due to this, data stream techniques are performed on observations in the most recent batch and on some synopsis of the old data that are no longer available.

A Spatial Dependence Measure and Prediction of Georeferenced Data

105

Formally, incoming data streams are split into non-overlapping windows identified by w = 1, . . . , ∞. A window is an ordered subset of T , having size b, which frames}a { data batch Y w = {Y1w , . . . , Yiw , . . . , Ynw }, where Yiw = (yij , tj ), . . . , (yij+b , tj+b ) is a subsequence of Yi . According to our proposal, every time a new batch of data is available; the w observations of each { subsequence Yi are summarized by a}histogram, formally w w w w w w w , πi,L ) where Ii,l are a set defined as: Hiw = (Ii,1 , πi,1 ), . . . , (Ii,l , πi,l ), . . . , (Ii,L w of L non-overlapped intervals (bins) in which the support Di = [y i ; y i ] of Yiw ; has w been shared πi,l are the associated weights (relative frequencies).

The use of the histogram as a tool to synthesize data makes it possible to keep information about the main moments of the distribution (related to the position, variability, and shape indices). Moreover, to compare histogram data, we need to introduce a suitable distance. In particular, in the field of data analysis on theoretical or empirical distribution functions, [IRP 15] suggested we use the L2 -Wasserstein metric, also known as Mallow’s distance, in the family of Wasserstein distances. Two probability density functions (pdf s) are denoted by ϕ and ϕ′ with Φ and Φ being the respective cumulative distribution functions (cdf s), the L2 -Wasserstein distance is expressed as the distance between the quantile functions Φ−1 and Φ′−1 , which are the inverse functions of cdf s, as follows: ′

v u∫1 u u ′ dW (ϕ, ϕ ) := t (Φ−1 (ξ) − Φ′−1 (ξ))2 dξ

[5.1]

0

A further implication, due to [DEL 99], is that the squared Wasserstein distance d2W can be decomposed into two additive components: d2W (ϕ, ϕ′ ) = (µ − µ′ )2 + (σ − σ ′ )2 + 2σσ ′ (1 − ρ) | {z } | {z } | {z } Location

|

Size

{z

Shape

[5.2]

}

Variability

The first component emphasizes the difference in position of the two distributions through the (squared Euclidean) distance between the respective means. The second component takes into consideration the different variability structure of the compared distributions, and it is expressed by the distance between the respective standard

106

Advances in Data Science

deviations (the size component) and by a term expressed as function of std and the positive correlation between the quantile functions that depend from the shape of the distribution functions (ρ = 1 when the two distributions are equal except for the scale and shape parameters). 5.4. Online summarization of a data stream through CluStream for Histogram data A CluStream algorithm adapted on Histogram data [AGG 03, BAL 13] is introduced in order to get a further dimensional reduction of the data. It performs an effective and quick summarization of the incoming data trough to compute synopsis, corresponding to the center of low variability (micro)clusters. In order to have a high representativity of the input data, the number of clusters to be kept updated is not specified as a parameter but only its maximum value is fixed to manage the memory resources. Micro-cluster represents the data {structure we use for data summarization. For } each stream Yi , we keep a set µCi = µC1k , . . . , µCik , . . . , µCiK of micro-clusters, where µCik records the following information: – Hik : Histogram centroid; – nki : number of allocated items; – σik : L2 -Wasserstein-based standard deviation; – Swik : Sum of window Id; – SSwik : Sum of squared window Id. Since the data received by the previous summarization step are histograms, we have adapted the micro-cluster definition consistently. Especially, the micro-cluster centroid is the average histogram computed coherently with the L2 -Wasserstein metric. Similarly, σik records the standard deviation for a set of histograms in a micro-cluster, computed according to the same metric. The two statistics Swik and SSwik allow us to keep synthetic information about the allocation times. Especially, Swk

SSwk

Swk

2 i by wki = nki and σw − ( nki )2 it is possible to recover, respectively, the k = nk i i i i average of the allocation times and their variance, in terms of time windows.

The initialization step is performed offline by using a standard k-means clustering algorithm in order to create a set of initial micro-clusters. Then the online process of updating the micro-clusters is initiated. Whenever a new window w of data is available and the corresponding histogram Hiw is built, CluStream allocates the latter to an existing micro-cluster or generates a new one. The first preference is to assign the data point to a currently existing micro-cluster

A Spatial Dependence Measure and Prediction of Georeferenced Data

107

µCik such that d2W (Hiw , Hik ) < d2W (Hiw , Hik ) (with k ̸= k ′ and k = 1, . . . , K), if d2W (Hiw , Hik ) < u. The threshold value u makes it possible to control the fall of Hiw within the maximum boundary of the micro-cluster, chosen as a factor of the standard deviation of the histograms in µCik . ′

The allocation of a histogram to a micro-cluster involves updating nki → nki + 1, the micro-cluster centroid, and the standard deviation. Recalling what was stated in [IRP 15], these can be computed easily from the center and radii of the allocated histograms. Finally, it is necessary to update the sum and the sum of squares of the window identifiers considering the time window w of the histogram Hiw . If Hiw is outside the maximum boundary of any already achieved micro-cluster, a new micro-cluster is created setting the Hiw as centroid and nki = 1. The standard deviation σik is defined in a heuristic way by setting it to the squared L2 -Wasserstein distance to the closest cluster. Behaviors no longer active will be summarized by micro-clusters having low 2 values of wki and σw k , since they have not been updated for a long time. i

If there is no old micro-cluster to delete, instead there is the merging of two nearest micro-clusters into one. In the next section, the evaluation of the spatial dependence is performed, keeping into account the behavior of the data of each stream. The set of data summaries is sent to a central computation node, which ensures a low network load and a high quality result. To reach this goal, the communication between the processing node at the sensor and the centralized processing unit is made by two tasks. The first task, performed at predefined time stamps (e.g. every 20 windows), consists of sending micro-cluster centroids to the central computation node. The second task, performed at every time window, consists of sending the identifier of the micro-cluster to which the histogram of the window has been allocated. The idea is to summarize the data in a window through the centroid to which it has been allocated. If an updated set of centroids is kept by the central node, it is sufficient to send the identifier of the micro-cluster rather than its data. 5.5. Spatial dependence monitoring: a variogram for histogram data In this section, we introduce the procedure performed at the central node for measuring and monitoring the spatial dependence. Especially, we consider a new tool, namely, the variogram for histogram data, which extends the classical variogram to histogram data.

108

Advances in Data Science

We shortly recall the classical definition of a traditional tool for evaluating the spatial dependence of georeferenced data: the variogram. Given a random process Y , stationary and isotropic, the variogram γ is expressed as a function of the distance h = ∥si − sj ∥ between the spatial locations. An unbiased estimator can be expressed by the empirical variogram of a set of observations yi located at si (i = 1, . . . , k) as follows: ∑ 1 γˆ (h) = (yi − yj )2 [5.3] 2 |N (h)| i,j∈N (h)

where N (h) is the set of observations such that ∥si − sj ∥ = h and |N (h)| is the number of pairs in the set. The variogram estimator cannot be computed at every lag distance h, due to the reduced availability of observations. Thus, the empirical variograms are often approximated by model functions ensuring validity [CHI 12]. In order to define the variogram for histogram data, we consider the quantile functions associated to the histograms Hiw (with i = 1, . . . , n), at the window w, as realizations of a functional spatial random process χs . Formally: { } Let χs : s ∈ D ⊆ ℜd be a spatial random process, where s is a data location in a d−dimensional Euclidean space and D is a fixed subset of ℜd with positive volume. We choose n points s1 , . . . , si , . . . , sn in D to observe the random functions χsi (ξ), with i = 1, . . . , n. We assume that each χsi (ξ), with ξ ∈ [0, 1], is a random element of L2 equipped with a Borel σ-algebra, where L2 = L2 ([0, 1]) is the set of measurable ∫1 real-value functions defined on ([0, 1]) satisfying 0 χsi (ξ)2 dξ < ∞. ⟨ ⟩ ′ The space L2 is a separable Hilbert space with the inner product χsi , χsi = √∫ ∫1 ′ 1 χ (ξ)χsi (ξ)dξ and the norm ∥χsi ∥ = χ (ξ)2 dξ. The norm ∥·∥ defines 0 si 0 si the metric: √ ∫ 1 ′ d(χsi , χsi ) = (χsi (ξ) − χ′si (ξ))2 dξ [5.4] 0

We assume that for each ξ ∈ [0, 1], the spatial process is second-order stationary and isotropic, that is, the mean and variance functions are constant and the covariance depends only on the distance between the sampling points: – E(χs (ξ)) = m(ξ), for all ξ ∈ [0, 1] and s ∈ D – V (χs (ξ)) = σ 2 (ξ), for all ξ ∈ [0, 1] and s ∈ D – Cov(χsi (ξ), χsj (ξ)) = C(h, ξ), for all si , sj ∈ D and ξ ∈ [0, 1], where h = ∥si − sj ∥

A Spatial Dependence Measure and Prediction of Georeferenced Data

109

If we assume that the random elements in L2 are quantile functions as defined in equation [5.3], that is, they are non-decreasing continuous functions in [0, 1], the metric in equation [5.4] corresponds to the L2 -Wasserstein distance for distribution functions, introduced in [MAL 72] and recalled in equation [5.1]. This makes it possible to start from the variogram for spatio-functional data introduced in [DEL 10, MON 15] for defining a variogram function for histogram data. In order to estimate γ(h), consistently with [DEL 10], we can using the following estimator: ∑ ∫ 1 1 γ(h) = (χsi (ξ) − χsj (ξ))2 dξ [5.5] 2 |N (h)| 0 i,j∈N (h)

This corresponds to estimate the variogram on histograms using the squared L2 -Wasserstein distance: ∑ 1 (d2W (Hiw , Hjw )). [5.6] η(h) = 2 |N (h)| i,j∈N (h)

As the variogram for classic quantitative data, η(h) is non-negative, since it is the average of squared distances, and η(0) = 0. Moreover, η(h) is a conditionally negative definite function so that all linear combinations of the random variable will have non-negative variances. Using the decomposition of the squared L2 -Wasserstein mentioned in equation [5.2], the variogram for histogram data η(h) can still be expressed as a sum of two components, denoted as ηl (h) and ηv (h), that is: η(h) =

∑ [ ] 1 (µsi − µsj )2 + 2 |N (h)| i,j∈N (h) | {z } ηl (h)

∑ [ ] 1 (σsi − σsj )2 + (2σsi σsj (1 − ρ)) 2 |N (h)| i,j∈N (h) | {z }

[5.7]

ηv (h)

The first component is a variogram accounting only for the histogram locations, since it focuses on the difference in position among the distributions. The second component is related to the different variability structure of the compared distributions, due to the different standard deviations (the “size component”) and to the different shapes of the density functions (the “shape component”). In this chapter, we propose to use the variogram for histogram data, as defined in equation [5.6], as a measure of the spatial dependence among the streams. If there is

110

Advances in Data Science

spatial dependence, we expect that near sensors tend to have lower values of average distances while far sensors tend to be more different, so that η(h) is an increasing function. Every time a new window becomes available, it is possible to measure the spatial dependence by receiving at the central node, for each data stream, the identifier of the micro-cluster to which the histogram of the window has been allocated. This approach makes it possible to measure the spatial dependence of the data in a window by using the micro-cluster centroids rather than the raw sensor data. The variogram computation has to be performed from the micro-cluster centroids. Thus, the variogram for the data synthesis, in a window w, becomes: η(h) =

1 2 |N (h)|





(d2W (Hik , Hjk ))

[5.8]

i,j∈N (h)



where Hik and Hjk are the micro-cluster centroids to which, respectively, the histograms Hiw and Hjw , which represent the subsequences Yiw , Yjw , have been allocated. The proposed variogram for histogram data allows the obtaining of a different measure of the spatial dependence at every time window w, starting from the micro-cluster identifiers sent by the sensors to the central communication node. Since the variogram is computed by considering the average of the pairwise distances at each lag distance h, a variogram η w (h) for the window w, in an incremental way from the variogram η w−1 (h), at the window w − 1 updates the average values by:   ∑ 1 ′ η w−1 (h) |N (h)| + η w (h) = (d2W (Hik , Hjk )) [5.9] 2 |N w (h)| i,j∈N (h)

This is useful when we need an aggregate measurement of the spatial dependence. It should be noted that the set of pairs i, j involved in the average computation depends only on the spatial location of the sensors that record the data. 5.6. Ordinary kriging for histogram data We introduce a kriging predictor in order to predict the data stream distribution in a time window w0 at an unsampled spatial location s0 . Since the data are represented by distributions (histograms), we generalize the classic kriging predictor to histogram data as well as to quantile functions associated with the empirical distributions. As in the classical case, the weights depend on spatial dependence measured by the variogram. In this context of analysis, we refer to the variogram for getting the histogram data, as defined in the previous section.

A Spatial Dependence Measure and Prediction of Georeferenced Data

111

We consider the following linear predictor: b Q(ξ) =

n ∑

λi Qw i (ξ)

[5.10]

i=1

where Q(ξ) is the piecewise quantile function to predict; Qw i (ξ) is the piecewise quantile function associated with Hiw ; and λi ⊂ Λ ∈ ℜ are the kriging weights. The proposed linear predictor has the same form of the classical ordinary kriging; however, it uses quantile functions rather than scalars. Consistently with the L2 -Wasserstein metric, we propose to predict the histogram at the location s0 for the window w0 , as a linear combination of the quantile functions of the histograms Hiw (with i = 1,. . . ,n) observed for the same window w0 , at the locations s1 , . . . , sn . Due to the spatial dependence, the kriging coefficients in equation [5.10] are such that the locations closer to the prediction point have a greater influence than the location far apart. If we assume that the quantile functions are realizations of a functional spatial random process, we can still make reference to the ordinary kriging for functional data in [GIR 11] for developing our kriging predictor for histogram data. In b particular, in order to ]ensure that Q(ξ) is a Best Linear Unbiased Predictor (BLUP), [∫ 1 b (Q(ξ) − Q(ξ)) = 0, for Q(ξ), the weights are the solution of the following E 0

constrained optimization problem:  ∫ minλ1 ,...,λn E  0

where

∑n i=1

1

(

n ∑



)2

dξ  s.t.

λi Qw i (ξ) − Q(ξ)

i=1

n ∑

λi = 1

[5.11]

i=1

λi = 1 is the unbiasedness constraint.

Consistently with [WAC 03] and [GIR 11], the objective function to minimize can be expressed as: [∫

1

E

( n ) ( )2 ] ∑ b Q(ξ) − Q(ξ) dξ + 2µ λi − 1 .

0

[5.12]

i=1

where µ is the Lagrange multiplier. b To ensure that the predicted function Q(ξ) is a quantile function, λi ≥ 0 for all i = 1, . . . , n. Thus, the weights are scaled and normalized to 1: λ∗i = λi + abs(min λi ) (with i = 1, . . . , n) and

∑n i=1

λ∗i = 1.

112

Advances in Data Science

5.7. Experimental results on real data The strategy proposed in this chapter has been tested on a real dataset. The dataset collects the records of 54 Mica2Dot sensors placed at the Intel Berkeley Research Lab, between February 28 and April 5, 2004, relative to the humidity, temperature, light, and voltage. The values are recorded once every 31 seconds. The x and y coordinates of sensors (in meters relative to the upper right corner of the lab) are also available. Among such variables, the temperature records of each sensor are chosen for our analysis. We have a set of 54 time series, each one made by 65,000 observations. To perform the analysis, the input parameters have to be set: we fix the number of bins L for the histogram to be equal to 10, which is a good compromise between the accuracy of the distribution approximation and compactness of the representation; we choose the size b of each window (i.e. the number of observations) to be equal to 116, so that each window collects 1 hour of sensor recordings. To initialize the micro-clusters, we fix the training period to winit = 50, corresponding to 50 time windows of data for each data stream, and on such data, we use a k-means-like algorithm for clustering histogram data based on the L2 -Wasserstein distance. The parameters K ∗ and u that, together, control the amount of summarization of the data have to be set keeping into account the available storage resources. If there is no a priori knowledge on the variability in the data and on the presence of a clustering structure, K ∗ should be set as high as possible and u near to 1. This would lead us to get many homogeneous and representative micro-clusters. In our test, we set K ∗ = 100 and u = 2. The threshold ϵ is set equal to 2.49. This ensures that the minimum number of curve pairs falling into each interval is 30, according to a widely used thumb rule for variogram estimation [JOU 04]. The online procedure performed on the dataset provides, as first output, the results of the summarization due to CluStream for histogram data. We show, at first, the number of micro-clusters obtained by processing the data of each sensor. Since Figure 5.2 highlights that for 16 streams, it has been necessary to generate 100 micro-clusters in order to get summaries of low variability clusters while only for three streams, the number of micro-clusters is lower than 15. The average number of micro-clusters generated for summarizing each data stream is 66.17, and the standard deviation is 33.6.

A Spatial Dependence Measure and Prediction of Georeferenced Data

113

Figure 5.2. Plot of the number of micro-clusters generated by the analysis of each data stream

The analysis provides monitoring of spatial dependence over time. We show at first the variograms on the time windows and then their decomposition into the location and size-shape components.

Figure 5.3. Variogram for histogram data over the time windows. For a color version of this figure, see www.iste.co.uk/diday/advances.zip

Looking at Figure 5.3, it is possible to note that for windows w < 180, there is a low variability in the data and still a limited spatial dependence. This is because the increasing of the variability over the lag distances is reduced. In windows w > 180, the variability in the data is higher, and near sensors (h < 10) tend to have a high spatial dependence. With the growth of the spatial distance h, the variability becomes stable in most of the time windows. Only in few cases there is still a considerable increase of the variability with the growth of h. Looking at Figure 5.4, we can get insights on the contribution of the location and the variability component to the spatial dependence structure over time.

114

Advances in Data Science

Figure 5.4. Location (on the upper) and variability (on the lower) component for the variograms, over the time windows. For a color version of this figure, see www.iste.co.uk/diday/advances.zip

The location component, related to the average of the distributions, has a higher impact on the aggregate variograms than the variability component whose contribution is still different along the time windows. Especially, in the first time windows w < 200, there is not a real spatial dependence in the variability and shape component, while in the more recent data (w > 200), it is more present. The following step in this application is related to the evaluation of the kriging performance. We need, first, to select a time window and a spatial location. The proposed kriging predictor uses the variogram to detect the weights and, thus, to compute the estimated quantile function. In Figure 5.5, we show the predicted versus observed quantile functions and the corresponding histograms for the time window 50 and the first sensor. This allows us to get an overview of the kriging performance.

A Spatial Dependence Measure and Prediction of Georeferenced Data

115

Figure 5.5. (a) Predicted versus observed quantile functions and (b) the corresponding histograms for the time window 50 of the first sensor. The red is for the observed function while the green is for the predicted one. For a color version of this figure, see www.iste.co.uk/diday/advances.zip

Figure 5.6. Box plot of the correlation coefficients between observed data and predicted data, computed over the windows. For a color version of this figure, see www.iste.co.uk/diday/advances.zip

In order to get a global evaluation of the kriging performance on the proposed real-world dataset, we have still used a cross-validation test. For each stream and each time window, we have computed the kriging prediction using the variogram of the chosen time window. This allows us to compare the quantile function of the observed data and the corresponding kriging prediction. Through the correlation coefficient proposed in [IRP 15], we have measured the association between the prediction and the reference data. The results in Figure 5.6 highlight that the first and third quartiles are, respectively, at ρ = 0.82 and ρ = 0.875, while the worst prediction has a

116

Advances in Data Science

correlation coefficient ρ = 0.75. This highlights the good prediction performance on the test dataset. 5.8. Conclusion We have proposed a strategy for monitoring the spatial dependence of data recorded by spatially located sensors. This strategy allows us to estimate data distribution at unsampled spatial locations. The main contribution consists of measuring, online, the spatial dependence using appropriate summaries of the streams computed over time. The choice of the summary tool has been performed by evaluating the capability to record information. In such a way, a histogram representation of the subsequences of incoming data is proposed as an effective tool for keeping information about data distribution. Moreover, the analysis of histogram data can be performed by using an appropriate dissimilarity measure: the L2 -Wasserstein distance. This has been widely used in many techniques developed for histogram data. The transformation of data in sequences of histograms has required an extension of the variogram model. According to the decomposition property of the L2 -Wasserstein distance, the new variogram model allows us to highlight two sources of spatial dependence of the histogram data depending on the position (average) and variability/shape components of the distributions. The proposed variogram has still been used for computing a new kriging predictor with the aim of estimating a data distribution in a place where no sensor is available. As in the classic ordinary kriging for geostatistical data, the predictor is a weighted average of sample values which uses the variogram to account for the variability between the samples and the variances between the samples and the point to estimate. The proposed strategy can furnish an innovative contribution, generalizing classical spatial statistics measures to the analysis of continuous sequences of data, recorded by sensors. The use of a suitable summarization of subsequences arriving over the time allows of monitoring the data distributions, also detecting some changes in the structures of the data. The evaluation of the performance on simulated and real data has confirmed that where the hypotheses on the data are satisfied, our strategy is effective both in monitoring the spatial dependence and in predicting the data distribution. 5.9. References [AGG 03] AGGARWAL C.C., H AN J., WANG J. et al., “Clustream: a framework for clustering evolving data streams”, Very Large Data Bases, vol. 29, pp. 81–92, 2003.

A Spatial Dependence Measure and Prediction of Georeferenced Data

117

[BAL 13] BALZANELLA A., R IVOLI L., V ERDE R., “Data Stream Summarization by Histograms Clustering”, in G IUDICI P., I NGRASSIA S., V ICHI M. (eds), Statistical Models for Data Analysis, Studies in Classification, Data Analysis, and Knowledge Organization, Springer International Publishing, Heidelberg, pp. 27–35, 2013. [CHI 12] C HILES J.P., D ELFINER P., Geostatististics, Modelling Spatial Uncertainty, 2nd ed., Wiley-Interscience, Hoboken, 2012. ´ C. et al., “Tests of [DEL 99] DEL BARRIO E., C UESTA -A LBERTOS J.A., M ATR AN goodness of fit based on the L2 -Wasserstein distance”, Annals of Statistics, vol. 27, no. 4, pp. 1230–1239, 1999. [DEL 10] D ELICADO P., G IRALDO R., C OMAS C. et al., “Statistics for spatial functional data: some recent contributions”, Environmetrics, vol. 21, nos 3–4, pp. 224–239, 2010. [DIN 02] D ING Q., D ING Q., P ERIZZO W., “Decision tree classification of spatial data streams using Peano Count Trees”, Proceedings of the 2002 ACM Symposium on Applied Computing, ACM, New York, USA, March 2002. [GIR 11] G IRALDO R., D ELICADO P., M ATEU J., “Ordinary kriging for functionvalued spatial data”, Environmental and Ecological Statistics, vol. 18, no. 3, pp. 411–426, 2011. [IRP 15] I RPINO A., V ERDE R., “Basic statistics for distributional symbolic variables: a new metric-based approach”, Advances in Data Analysis and Classification, vol. 9, no. 2, pp. 143–175, 2015. [JOU 04] J OURNEL A., H UIJBREGTS C.J., Mining Geostatistics, The Blackburn Press, Caldwell, 2004. [MAL 72] M ALLOWS C.L., “A note on asymptotic joint normality”, Annals of Mathematical Statistics, vol. 43, no. 2, pp. 508–515, 1972. ´ [MON 15] M ONTERO J.-M., F ERN ANDEZ -AVIL E´ S G., M ATEU J., An Introduction to Functional Geostatistics, John Wiley & Sons, Hoboken, pp. 274–294, 2015. [WAC 03] WACKERNAGEL H., Multivariate Geostatistics, Springer, Berlin, 2003. [WEI 13] W EI L.-Y., P ENG W.-C., “An incremental algorithm for clustering spatial data streams: exploring temporal locality”, Knowledge and Information Systems, vol. 37, no. 2, pp. 453–483, 2013.

6 Incremental Calculation Framework for Complex Data

In today’s digital world, the data for business analysis are more likely to be collected from diverse sources. This contributes to the complexity of data analysis because of the introduction of many uncommon data types, such as functional data, compositional data, histogram data, and so on. Much attention has been paid to the statistical methods about these new types of data. However, there have been few studies about the incremental calculation for these complex data types, which is necessary in practical applications at present. In this chapter, we develop an incremental calculation framework for complex data, one that can be applied to various data types. We first transform the complex data into basic data and then propose the incremental calculation method based on this type of data. The incremental calculation framework can be implemented to many frequently used statistical models built on the covariance matrix. We take linear regression of functional data and principal component analysis of compositional data as examples to discuss the incremental calculation for complex types of data. The simulation result shows both the efficiency and effectiveness of this framework. 6.1. Introduction With the maturity of “big data” technology, the data used for business analysis are collected from various sources, which makes the data type quite complex. For example, in the case of movie box office prediction, the data from searching on the Internet, comments in blogs, film types, and showtimes can all be used to achieve a more precise prediction result [MES 13]. As the data from these different sources

Chapter written by Huiwen WANG, Yuan W EI and Siyang WANG.

Advances in Data Science: Symbolic, Complex and Network Data, First Edition. Edited by Edwin Diday, Rong Guan, Gilbert Saporta and Huiwen Wang. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

120

Advances in Data Science

are no longer confined to conventional numerical data, some other data types are beginning to play an important role in practical applications. There have been plenty of studies on how to deal with these non-numerical data. Functional data, for instance, are applied in a wide range of areas, promoting the development of functional data analysis [FER 06], [RAM 91]. Since functional data are intrinsically infinite dimensional, the commonly used approach for dimension reduction is expanding the functional data on some basis functions. The coefficients of these basis functions can then be used in further statistical modeling [RAM 06]. Compositional data type is another commonly used data type, as demonstrated for example by applications in economic and engineering data analysis. This type of data is often expressed in proportions or percentages, conveying the structure information that parts organize a whole in a quantitative way [AIT 82]. All components of compositional data are subject to non-negative and constant-sum (e.g. 100 weight percent) constraints, which are hard to satisfy in statistical modeling. To meet this challenge, the general idea is to first remove the constraints by some transformations, such as the additive log-ratio (alr) transformation, centered log-ratio (clr) transformation [AIT 82], and isometric log-ratio (ilr) transformation [EGO 03]. Then, traditional statistics methods can be performed on the transformed vectors. Many other non-numerical data types have also been discussed up to now, like interval-valued data [DID 88], histogram data [NAG 07], tensor data [LU 11], and others [CHE 13], [TAO 08], [TAO 09]. In this chapter, we collectively refer these non-numerical data as complex data type. In recent years, the data collected for analysis are not only more varied in type, but also much denser in frequency. This makes online modeling become necessary in practical applications. Studies of incremental calculation aroused people’s concern since Carrier and Christophe defined the notion of incrementality for learning tasks and algorithms [GIR 00]. This work clarified that an incremental learning algorithm should be able to update the hypotheses only by the previous results and the current data. Up to now, the main works in this area can be classified into three directions: online matrix decomposition, specific incremental algorithms, and the framework of online modeling. Online matrix decomposition laid the foundation for incremental calculation. Gill et al. [GIL 74] gave several methods for modifying the Cholesky factors of a matrix following a rank-one change. In addition, Gu et al. [GU 94] presented stable and fast algorithms for updating the SVD of a m × n matrix when appending a row, which reduced the floating point operations from O((m + n) min2 (m, n)) to O((m + n) min(m, n)log22 ε, where ε is the machine precision. Levey and Lindenbaum [LEV 00] proposed the Sequential Karhunen–Loeve (SKL) algorithm, presenting a new incremental eigenbasis updating method when given one or more piece of additional training data. Ross et al. [ROS 08] improved this SKL method by introducing a forgetting parameter, which ensured less modeling power expended

Incremental Calculation Framework for Complex Data

121

on the older observations. Some other online matrix decomposition methods can be found in [BRA 02], [LI 00], [WEN 03]. Based on these works, there have been plenty of papers concerning incremental algorithms for specific models, such as incremental multiple linear regression [HUI 14], online principal component analysis [MAI 10], online logistic regression [XIE 13], online group lasso [YAN 10], online dictionary learning [MAI 09], and so on. Relatively, there are few papers about the framework of online modeling. Chu et al. [CHU 07] first proposed a “summation form”, which adapted Google’s mapreduce paradigm and achieved basically linear speedup with an increasing number of processors for a variety of statistical models. This work mainly focused on parallel computing and did not talk about incremental calculation. Luts et al. [LUT 14] realized the incremental calculation for various semi-parametric regression methods by employing online mean field variational ideas and putting these models into the framework of a graphical model. Some other works for designing a framework for incremental computing can be found in [BRO 13], [LUT 15] and [PAT 09].

Figure 6.1. The incremental calculation framework for complex data

However, although there have been plenty of works about incremental calculation for numeric data as we introduced earlier, there are quite a few incremental computing methods for complex data. In this chapter, we propose a framework to realize incremental modeling for complex data, which is shown in Figure 6.1. The main idea of this framework is that we first transform the complex data into vector data, which is recorded as a basic data type in this chapter, and then build the incremental computing models on this basic data type. During this process, the transformation plays a key role and much attention should be paid to the error control. Note that plenty of frequently used statistical models are built on a covariance matrix, such as ordinary least squares regression (OLS), principal component analysis (PCA), linear discriminant analysis (LDA), Lasso, Elastic Net, sparse principal component analysis (SPCA), and so on. Therefore, by incremental calculation of the covariance matrix, these methods can easily be developed into online algorithms. In this chapter, we propose two online

122

Advances in Data Science

covariance matrix decomposition methods for basic data. The first method, ICMD1, can provide precise decomposition results and compute efficiently when the data dimension is not high. And the second decomposition method, ICMD2, can adapt to the high-dimensional case. We provide functional linear regression and compositional PCA as examples to show the efficiency of the proposed framework. The remainder of the chapter is organized as follows. In section 6.2 we define the basic data space and propose some commonly used operations. In section 6.3, we propose incremental calculation algorithms for complex data, and propose online functional linear regression and online compositional PCA as examples. Section 6.4 presents the simulation results. Section 6.5 concludes the chapter. 6.2. Basic data The vector data type is a commonly used data type, which is referred to as basic data in this chapter. As the numeric data can be seen as 1-dimensional vector data, the operations of vector data are also applied to numeric data. In this section, we will first propose the definition of the basic data space Φ and then introduce some frequently used operations in the basic data space. 6.2.1. The basic data space Let us define the basic data space as Φ. Φ is an inner product space, and addition and scalar multiplication satisfy the basic conditions in Linear Space [DUN 71]. Suppose Φ is the space of s-dimensional basic data (s = 1 for numeric data), we have x ∈ Rs for ∀x ∈ Φ. And for ∀x, y ∈ Φ, ∀α ∈ R, the addition and scalar multiplication are defined as: x ⊕ y = (x1 + y1 , x2 + y2 , . . . , xs + ys )T

[6.1]

α ⊗ x = (αx1 , αx2 , . . . , αxs )T

[6.2]

where “( )T ” is the operation of matrix transpose. Then, the subtraction can be deduced as: x ⊖ y = (x1 − y1 , x2 − y2 , . . . , xs − ys )T

[6.3]

Let us define the inner product of x, y ∈ Φ as: < x, y >=

s ∑

xi yi

[6.4]

i=1

Based on the definition of the basic data space, we will next denote the basic data matrix and the n-dimensional basic data space as Φn .

Incremental Calculation Framework for Complex Data

123

Let Xn×p = (xij )n×p be a basic data matrix with n observations and p variables. In this matrix, cell xij ∈ Φ (i = 1, 2, . . . , n; j = 1, 2, . . . , p) is the basic data type. Correspondingly, Xn×p is a basic data matrix. Note that each cell of the basic data matrix is a vector with a group structure, which is different from the p-dimensional sample vector space in multivariate statistical analysis. Xn×p can also be recorded as Xn×p = (x1 , x2 , . . . , xp ), where xj = (x1j , x2j , . . . , xnj )T is the jth variable of Xn×p (j = 1, 2, . . . , p). xj is in the n-dimensional basic data space, and we record this space as Φn . For xj , xk ∈ Φn , α ∈ R, the addition, scalar multiplication, and inner product are defined as: xj ⊕ xk = (x1j ⊕ x1k , . . . , xnj ⊕ xnk )T α ⊗ xj = (α ⊗ x1j , . . . , α ⊗ xnj )T < xj , xk >n =

n ∑

< xij , xik >

[6.5] [6.6] [6.7]

i=1

where xij is s-dimensional basic data. Let the dth component of the basic data be xdij (d = 1, . . . , s), and by equations [6.1], [6.2], and [6.4], we can get xij ⊕ xik = ∑s (x1ij + x1ik , . . . , xsij + xsik )T , α ⊗ xij = (αx1ij , . . . , αxsij )T , < xij , xik >= d=1 < xdij , xdik >. Thus, with basic data operations, equations [6.5]–[6.7] can be calculated. 6.2.2. Sample covariance matrix For variable xj = (x1j , x2j , . . . , xnj )T ∈ Φn , denote the sample mean as: ¯xj =

1 ⊗ ⊕ni=1 xij ∈ Φ n

[6.8]

Based on this, we can obtain centralized xj . Defining cen(xj ) = (¯xj , . . . , ¯xj )T ∈ Φn , then the centralized variable is: xj ⊖ cen(xj ) = (x1j ⊖ ¯ xj , x2j ⊖ ¯ xj , . . . , xnj ⊖ ¯xj )T

[6.9]

For ∀xj , xk ∈ Φn , the sample covariance, recorded as Covs (xj , xk ), can be calculated with: 1 Covs (xj , xk ) = < xj ⊖ cen(xj ), xk ⊖ cen(xk ) >n n 1 = < xj , xk >n − < ¯ xj , ¯ xk > [6.10] n By the operations defined in equations [6.4] and [6.7], we can get the sample mean and sample covariance easily. And, the sample covariance matrix of basic data Xn×p can be calculated with Covs (X) = [Covs (xj , xk )]p×p , (j, k = 1, 2, . . . , p).

124

Advances in Data Science

6.3. Incremental calculation of complex data Although data types are becoming more and more diverse in practical applications, many complex data types can actually be turned into the basic data through specific transformation. Here, we will first introduce how to convert them into basic space, and then give the online updating method for a sample covariance matrix in this space. As plenty of statistical models are built on a covariance matrix, while updating the sample covariance matrix online, we can obtain an incremental calculation framework adopted to many statistical algorithms. 6.3.1. Transformation of complex data Taking the commonly used functional data and compositional data as examples, the transformation methods are given as follows. Functional data Suppose a stochastic process ∫ {x(t) : t ∈ F, x2 (t)dt < ∞} F

is defined on (Ω, B, P ). And the sample paths of {x(t) : t ∈ F } are in L2 (F ), the set of all square-integrable functions on F . x(t) is called functional data, often assumed to be in a Hilbert space [RAM 06]. To express different kinds of functional data, such as curves, images and arrays, F can be subsets of R, Rp , or other spaces. To deal with this infinite dimensional functional data, the general way is to approximate it by some transformation techniques [RAM 02]. The mainstream technique is the functional basis expansion approach. The choice of basis function can be divided into two categories [WAN 15]: one uses data-adaptive basis functions that are determined by the covariance function of the functional data, the finite approximation FPCA approach, for example [YAO 06]; the other one projects the functional data to a given set of basis functions, and chooses the number of basis function according to the data, like spline basis functions [SCH 07]. In this chapter, we take the centralized standard orthogonal B-spline functions as a basis function, and expand x(t) upon them. The transition error can be controlled by choosing the number of basis functions, and the sets of basis coefficients are used for next step modeling. For instance, let ϕ(t) = [ϕk (t)], k = 1, 2, . . . , K be the centralized standard orthogonal basis functions, and we get x(t) ≃

K ∑

bk ϕk (t),

[6.11]

k=1

where the basis coefficient bk is numeric data and can be used for statistical modeling. Thus, by expanding the functional data on basis functions, we can transform it into basic data space.

Incremental Calculation Framework for Complex Data

125

Compositional data Define the D-parts compositional data x as: x = (x1 , x2 , . . . , xD )T , where ∑D x > 0, i=1 xi = 1, (i = 1, 2, . . . , D). Compositional data have “non-negative” and “constant-sum” constraints, which bring difficulties to statistical modeling [AIT 82]. Several log-ratio transformations have been proposed to release the constrains. Among these transformations, the ilr method can solve the multicollinearity problem for compositional data, and it transforms the compositional-data unit exactly into the basic data space. Let z = ilr(x), then z can be calculated by, √∏ √ k k j k j=1 x zk = log , k = 1, 2, . . . , D − 1. [6.12] k+1 xk+1 i

Obviously, z is a basic data type, and we can use the operations defined in section 6.2 for the transformed compositional data. In addition, the inverse-ilr transformation can be calculated by yu =

D ∑

zk √ − k(k + 1) k=u



u − 1 u−1 z , u

[6.13]

exp(y u ) (1 ≤ u ≤ D, z 0 = z D = 0). xu = ∑D u) exp(y u=1 With the inverse-ilr transformation, compositional data can be generated from the basic data type. 6.3.2. Online decomposition of covariance matrix After transforming the complex data into the basic data type, we can develop online covariance decomposition algorithms. In practical applications, the decomposition algorithms are usually chosen according to the different dimensions of data set. In this part, we propose two incremental covariance updating methods for basic data. The first incremental covariance matrix decomposition method (ICMD1) can be used to efficiently calculate when the data dimension is low (e.g. p < 100). While the second covariance matrix decomposition method (ICMD2), which draws on the idea of Ross et al. [ROS 08], is more suitable for a high-dimensional situation. ICMD1 Denote X = (x1 , x2 , . . . , xp )T as a p × n dimension basic data matrix. And variable xj = (x1j , . . . , xnj ) ∈ Φn (j = 1, . . . , p) has n common observations, where xij ∈ Φ(i = 1, . . . , n) is basic data. With the new m observations X∗p×m , the updated e p×(n+m) can be written as: X e = [X, X∗ ]. Let ¯xj , ¯x∗ , ¯˜xj be the sample mean matrix X j

126

Advances in Data Science

of the old data xj , the new data x∗j , and the updated combined data ˜xj , respectively, where ˜xj = [xj , x∗j ]. Then, we can update the sample mean with ¯˜xj =

1 ⊗ (n ⊗ ¯ xj ⊕ m ⊗ ¯ x∗j ) n+m

[6.14]

For ∀˜xj , ˜xk ∈ Φm+n , the online updating method of sample covariance can be derived as follows: Covs (˜xj , ˜xk )

[6.15]

1 n+m n+m 1 (< xj , xk >n + < x∗j , x∗k >m )− < ¯˜xj , ¯˜xk > = n+m =

Therefore, the update of the sample covariance matrix is only related to sample mean, sample number and variable inner product of old data and newly added e as Covs (X) e = data. Considering the updated sample covariance matrix of X e [Covs (˜xj , ˜xk )]p×p , obviously the elements of Covs (X) can be obviously updated by equations [6.14] and [6.15]. Then, by some matrix decomposition methods (singular value decomposition (SVD), for example), we can derive the updated eigenvalues and e eigenvectors from Covs (X). The time complexity is O(mp2 + p3 ), where m represents the number of new samples. This method calculates the eigenvalues and eigenvectors precisely, and it can greatly reduce the computing time and storage space when n ≫ m and p is not high. ICMD2 When p grows higher, it is not easy to decompose the p × p covariance matrix online. Borrowing the sample mean updating method in Ross et al. [ROS 08], we introduce the ICMD2 method. First, we propose Lemma 1 for basic data. ˜ = (m + n)Covs (X). e L EMMA 6.1.– Denote S = nCovs (X), S∗ = mCovs (X∗ ), S The following equation for basic data type can be proved: ˜ = S + S∗ + nm [< (¯ S x∗j ⊖ ¯ xj ), (¯ x∗k ⊖ ¯ xk ) >]p×p n+m e is equal to the decomposition e ⊖ X) From Lemma 1, the decomposition of (X ∗ ∗ of concatenation of (X ⊖ X), (X ⊖ X ), and one additional vector √ the horizontal ∗ nm T n+m (X ⊖ X), where X ⊖ X = (x1 ⊖ cen(x1 ), . . . , xp ⊖ cen(xp )) . We define √ ∗ nm ˆ ∗ = [(X∗ ⊖ X∗ ), X n+m ⊗ (X ⊖ X)]p×(m+1) to consider the updating of sample mean. And for the covariance matrix of basic data, the following Lemma 2 holds:

Incremental Calculation Framework for Complex Data

127

L EMMA 6.2.– Let xij ∈ Φ be the element of Xp×n (i = 1, . . . , p; j = 1, . . . , n), and xdij be the dth part of xij = (x1ij , . . . , xsij )(d = 1, . . . , s). Then by the definition in [6.10], we have: Covs (X) =

s n ∑ 1 ∑ d d ( [ x x ]p×p − ¯ xdj ¯ xdk ) n i=1 ij ik d=1

= Covs (Z) where Z = [X1 , . . . , Xs ] is the expended (p × ns) dimensional matrix of X. Thus the covariance of basic data matrix X is equal to the covariance of numeric data matrix XS . ˆ ∗ to Z ∗ , where Z ∗ = From Lemma 2, we expand the basic data matrix X ˆ ∗1 , . . . X ˆ ∗s ]p×s(m+1) and can still achieve the equivalent eigen decomposition [X results. Let Γ = UDUT be a rank k approximation to S, where the k × k diagonal matrix D approximates the first k eigenvalues and the p×k matrix U approximates the ′ corresponding eigenvectors. Let Z be the standardized component of Z∗ orthogonal ˜ can be rewritten as: to the eigen U. The eigenvalue decomposition equation of S e = [U, Z′ ]Q[U, Z′ ]T Γ

[6.16]

with [ Q=

D + UT Z∗ Z∗T U, UT Z∗ Z∗T Z ′ ′ ′ Z T Z∗ Z∗T U, Z T Z∗ Z∗T Z



] [6.17]

It then suffices to perform the eigen decomposition of the matrix Q of dimension (k + s(m + 1)) × (k + s(m + 1)). Writing Q = VΣVT with V orthogonal and e is simply expressed as U eD eU e T , where D e = Σ and Σ diagonal, the decomposition of Γ e = [U, Z′ ]V U

[6.18]

We can keep the ks largest eigenvalues by deleting the small ones, and drop the e associated eigenvectors in U.

128

Advances in Data Science

In addition, the forgetting parameter f can also be introduced in this method. By changing D in [6.17] to f D and forming the sample mean updating equation as ¯˜xj =

1 ⊗ (f n ⊗ ¯ xj ⊕ m ⊗ ¯ x∗j ), fn + m

[6.19]

ICMD2 can be adapted to unstable situations. This method requires O(p(k + sm + s)) storage space and O(p(k + sm + s)2 ) computation time complexity for the m new samples. 6.3.3. Adopted algorithms In this section, we provide functional linear regression and compositional PCA as examples to show the application of the incremental calculation framework for complex data. Functional linear regression We use the ICMD1 method to build the online functional linear regression method in this section. Assume the functional linear regression model as: ∫ y = β(t)x(t)dt + ε [6.20] where y, x(t) have been centralized. By expanding x(t), β(t) on the basis function ϕ(t) = [ϕk (t)](k = 1, 2, . . . , K), we get: x(t) ≈

K ∑

ak ϕk (t), β(t) ≈

k=1

K ∑

bk ϕk (t).

[6.21]

k=1

Then, equation [6.20] can be converted to: ∫ ∑ K K ∑ y= ( bk ϕk (t))( ak ϕk (t))dt + ω k=1

=

K ∑

bk ak + ω

k=1

[6.22]

k=1

where “ω” is the regression fitting error. Obviously, equation [6.22] is a multivariate regression model of 1-dimensional basic data, where bk (k = 1, 2, . . . K) is the parameter to be estimated. Here, we use the operation symbols of numeric data for the purpose of simplicity. Define Sxy = n[Covs (ak , y)]K×1 , Sx = n[Covs (ak , aj )]K×K , (k, j = 1, 2, . . . , K). Then, we can have: ˆb = (Sx )−1 Sxy .

[6.23]

Incremental Calculation Framework for Complex Data

129

∑K With the newly added data x(t)∗ , y∗ , expand x(t)∗ ≈ k=1 a∗k ϕk (t). By equations [6.14] and [6.15], obtain the updated sample covariance matrix e Sx , e Sxy . Then, we can ˜ˆ easily calculate b incrementally. ˜ˆ Also, we can rebuild the functional regression coefficient by the estimated b and the basis function ϕ(t): ∑˜ ˜ˆ ˆbk ϕk (t). β(t) ≈ K

[6.24]

k=1

Let a¯, ¯y be the sample mean of old data, and define Wa = [< aj , ak >n ]K×K , Way = [< aj , y >n ]K×1 . The summary of online functional linear regression is presented in Algorithm 6.1. Algorithm 6.1. Online functional linear regression (1) Initialize Input: ¯a, ¯y, n, Wa , Way , x(t)∗ , y∗ . (2) Expand new samples on basis functions ∑K Calculate: x(t)∗ ≈ k=1 a∗k ϕk (t). (3) Online updating with ICMD1 ∑m ¯˜a = 1 (n¯ a + i=1 a∗i ), n+m ∑ ¯˜y = 1 (n¯y + m y∗ ), i=1 i n+m e ay = Way + [< a∗ , y∗ >m ]K×1 , W j e a = Wa + [< a∗ , a∗ >m ]K×K , W j k e e ay − (n + m)[< ¯ ˜ ˜y >]K×1 , Say = W aj , ¯ e e a − (n + m)[< ¯ ˜ ˜ Sa = W aj , ¯ ak >]K×K , ˜ˆ b = (e Sa )−1 e Say . ∑K ˜ ˜ ˆ (4) Rebuild: β(t) ≈ k=1 ˆbk ϕk (t). e ay , Wa ← W e a. ˜ (5) Update: n ← n + m, ¯ a←¯ a, y¯ ← y¯ ˜, Way ← W Compositional PCA This part will give the online compositional PCA method with the ICMD2 algorithm. Let Cp×n be a compositional data matrix, where cij (i = 1, 2, . . . , p; j = 1, 2, . . . , n) is d-parts compositional data. For each cij , use the ilr transformation in [6.12], and get matrix X, where xij ∈ Rd−1 is the basic data type.

130

Advances in Data Science

Denote Dk×k as the diagonal matrix of ks largest eigenvectors of S and Up×k as the corresponding eigenvectors, where S = nCovs (X). With the new coming data C∗p×m , get the transformation matrix X∗ , and generate p × (m + 1) basic data matrix √ ∗ nm ˆ ∗ = [(X∗ ⊖ X∗ ), X ⊗ (X ⊖ X)]p×(m+1) . n+m

ˆ ∗ to Z∗ = [X ˆ ∗1 , . . . , X ˆ ∗s ], and calculate Z′ using Z′ = orth(Z∗ − We expand X T ∗ UU Z ), which is the orthogonal component of Z∗ to U. Then, online calculate the matrix decomposition with the ICMD2 algorithm by [6.16]–[6.18], and get the e and eigenvectors U. e updating eigenvalues D ˜k be the kth updated eigenvector, Let u then the kth updated principal component ∑p ˜k = ˜ can be derived with F u ˜ x . With the inverse-ilr transformation in kj j j=1 equation [6.13], the compositional principal component can be rebuilt. The summary of online compositional PCA is shown in Algorithm 6.2. Algorithm 6.2: Online compositional PCA (1) Initialize input: X, n, U, D, C∗ . (2) ilr transform: X∗ = ilr(C∗ ). (3) Online updating method ∗

∗ ⊗ ⊕m i=1 xi , √ ∗ ∗ nm ˆ = [(X∗ ⊖ X ), Generate: X n+m ⊗ (X ⊖ X)]p×(m+1) ,

Calculate: X = ∗

1 m

ˆ ∗1 , . . . , X ˆ ∗(d−1) ], Z′ = orth(Z∗ − UUT Z∗ ), Expand: Z∗ = [X   ′ D + UT Z∗ Z∗T U, UT Z∗ Z∗T Z e = , Compute: Q ′ ′ ′ Z T Z∗ Z∗T U, Z T Z∗ Z∗T Z e =V eD eV eT , Decompose: Q e = [U, Z′ ]V. e Get: U (4) Rebuild: the kth updated compositional principal component : ∑ ˜k = p u xj . Calculate: F j=1 ˜kj ˜ ′

˜ k = inverse-ilr(F ˜ k ). Transform: F e D ← D, e (5) Update: n ← n + m, U ← U, X←

1 n+m



⊗ (n ⊗ X ⊕ m ⊗ X ),

Incremental Calculation Framework for Complex Data

131

6.4. Simulation studies In this section, we conduct two simulations, online functional regression and online compositional PCA, to show the efficiency of the incremental calculation framework. In the first simulation, we vary the sample size and compare the computing time and estimated accuracy of online algorithm based on ICMD1 and the classical regression method for functional data. To show the behaviors under different dimensions, in the second simulation, we vary p from 5 to 500 for compositional PCA, and compare the simulation results of online PCA based on ICMD1, ICMD2, and the classical PCA method. The simulation environment is Intel i7 2.50 GHz CPU and 16G RAM with the Python programming language. 6.4.1. Functional linear regression The data sets were drawn according to the functional linear regression model in [KAT 12]: ∫ 1 y= β(t)x(t)dt + ε, 0

β(t) =

50 ∑

βj ϕj (t), β1 = 0.3,

j=1

βj = 4(−1)j+1 j −2 , j ≥ 2, ϕj (t) = x(t) =

50 ∑



2 cos(jπt),

γj zj ϕj (t), γj = (−1)j+1 j −1 ,

j=1

√ √ where zj ∼ U [− 3, 3], ε ∼ N (0, 0.09). Online linear functional regression based on ICMD1 and classical non-incremental functional regression were conducted in this simulation for comparison. A third-order centralized standard orthogonal B-spline function was used as the basis functions, whose sample number was set as 5. We drew n = 100, 500, 1,000, 5,000 and 10,000 samples by random sampling and ran 100 rounds under each sample size. The newly added sample number was set as 1% of the data set size (i.e. m = 1, 5, 10, 50, 100). Figure 6.2 shows the calculation time of the two methods. The X-axis represents different sample sizes ranging from 100 to 10,000, and Y-axis the computing time. Two lines are drawn in Figure 6.2: the red solid line for incremental calculation and the other one for non-incremental calculation. The difference of the computing time between two methods increases gradually with the growing sample size. And as the incremental calculation is not affected by the original sample size, the advantage will be obvious when the data set size is big enough.

132

Advances in Data Science

Figure 6.2. Computing time for functional data. For a color version of this figure, see www.iste.co.uk/diday/advances.zip

ˆ Figure 6.3 shows the true β(t) and estimated β(t) for functional data analysis when n = 1,000. The performance of incremental functional linear regression ˆ based on ICMD1 is measured by the mean squared error (MSE) of β(t), which is ∑n1 ˆ 1 2 ˆ defined as M SE(β(t)) = n1 i=1 (β(ti ) − β(ti )) , where ti (1 ≤ i ≤ n1 ) are ˆ equispaced grid points. Table 6.1 shows the MSE of β(t). As the ICMD1 is a precise decomposition algorithm, the online method gets exactly the same results with the classical non-incremental method. Hence, we show the results of two algorithms in ˆ of functional linear regression model proves one row in Table 6.1. The estimated β(t) the effectiveness of this incremental calculation framework.

ˆ and β(t) for functional data. For a color version of this figure, Figure 6.3. β(t) see www.iste.co.uk/diday/advances.zip

100 500 1000 5000 10000 Inc. & Non-inc. 0.0588 0.0226 0.0194 0.0138 0.0133 (std.) (0.0661) (0.0197) (0.0131) (0.0037) (0.0004) ˆ Table 6.1. Estimated MSE of β(t)

Incremental Calculation Framework for Complex Data

133

6.4.2. Compositional PCA In this section, we conduct the simulation of the PCA method for three-part compositional data. The PCA model was designed according to Cardot et al. [CAR 15]. First, we created a Gaussian random vector X ∈ Rd (d = 2p) with zero mean and covariance matrix Cov(X) = (min(i, j)/d)1≤i,j≤d , where i, j were the index of rows and columns. Then, the three-part compositional data set Cp×n was generated by inverse-ilr transformation, e.g. c11 = (c111 , c211 , c311 ) = inverse-ilr(x11 , x21 ). Note that when d was large enough, the eigenvalues of the scaled covariance Cov(X)/d decreased rapidly to zero so that most of the variability of X was concentrated in a low-dimensional subspace of Rd . In each simulation, iid realizations of X were generated with n = 500 and p ∈ {10, 50, 100, 300, 500}. The component number was set as K = 2. Initialized by batch PCA, three methods were conducted in this simulation: online PCA based on ICMD1, online PCA based on ICMD2, and classical PCA method. Running 100 rounds in each simulation, the computation time (in seconds, for one iteration) is shown in Figure 6.4.

Figure 6.4. Computing time of 3 compositional PCA methods. For a color version of this figure, see www.iste.co.uk/diday/advances.zip

The X-axis represents different sample dimensions ranging from 5 to 500 and the Y-axis the computing time. Three lines are drawn in Figure 6.2: the red solid line for classical PCA, the blue dashed line for PCA with ICMD1, and the green dot-dash line for PCA with ICMD2. We can see that with data dimension p < 100, there is no significant difference in computing time for the three methods. But when p grows, the modeling time of classical PCA rises rapidly and is much longer than the two incremental methods. We give the detailed mean and standard deviation of computing time in Table 6.2. As compositional PCA with ICMD2 method decomposes the k + (d − 1)(m + 1) dimensional covariance matrix, it is more efficient than ICMD1 when p > 100.

134

Advances in Data Science

p PCA (std.) ICMD1 (std.) ICMD2 (std.)

5 0.001 (0.002) 0.001 (0) 0.001 (0)

10 0.002 (0) 0.001 (0) 0.001 (0)

100 0.018 (0.002) 0.008 (0.001) 0.006 (0.001)

200 0.103 (0.007) 0.022 (0.002) 0.011 (0.001)

500 0.611 (0.027) 0.105 (0.004) 0.028 (0.003)

Table 6.2. Computing time of three compositional PCA methods

We use the correlation between the estimated unit eigenvector uk and the ′ one computed by the classical PCA method uk , also normalized, to measure the eigenvector estimation error of online PCA. The correlation is represented by their ′ ′ ′ inner product Γ =< uk , uk > [WEN 03]. Since ∥ uk − uk ∥2 = 2(1− < uk , uk >), the closer Γ tends to 1, the better the estimate results are. Here, we set the newly added sample size as 100. The first two eigenvectors explain about 70%–45% variance of the data set (depending on the dimensions). The estimated Γ for the first two eigenvectors based on the online methods is shown in Table 6.3.

ICMD1:E1 (std) E2 (std) ICMD2:E1 (std) E2 (std)

p=5 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 0.994 (0.012)

p = 10 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 0.976 (0.024)

p = 100 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 0.978 (0.021)

p = 200 1.000 (0.000) 1.000 (0.000) 0.999 (0.001) 0.929 (0.146)

p = 500 1.000 (0.000) 1.000 (0.000) 0.999 (0.001) 0.896 (0.211)

Table 6.3. The estimated Γ for the first two eigenvectors

As online PCA with ICMD1 gives exactly the same eigenvectors with classical PCA, thus from Table 6.3 we can see that the Γ of ICMD1 equals 1. For PCA with the ICMD2 method, the first estimated eigenvector converges to the classical PCA (as Γ ≈ 1). And the second estimated eigenvector ranges from about 0.99 to 0.89, which is lower with the growing p. Therefore, when p is low, we can use the online methods based on ICMD1 to achieve an exact result at high speed. While p grows high, we prefer the online models built on ICMD2 to get a more efficient programming process.

Incremental Calculation Framework for Complex Data

135

6.5. Conclusion In this chapter, we propose an incremental calculation framework for complex data, which is applicable to many frequently used statistical models built on a covariance matrix. First, we transform the complex data into a basic data type. And by the operations of basic data introduced in this chapter, we propose two online covariance matrix updating methods in basic data space. The first method, ICMD1, can give exact estimation results at high speed when the data dimension is not high. While the second decomposition method, ICMD2, can adapt to the high dimensional case. These two online methods can greatly reduce the computing time as well as the storage space when the previous sample size is much larger than the new data. We use functional linear regression and compositional PCA as examples to show the application of the incremental calculation framework. The simulation results proved both the efficiency and the effectiveness of this framework. 6.6. Acknowledgment This research is supported partly by the National Natural Science Foundation of China (Grant No. 71420107025) and Beijing Key Laboratory of Emergency Support Simulation Technologies for City Operations. 6.7. References [AIT 82] A ITCHISON J., “The statistical analysis of compositional data”, Journal of the Royal Statistical Society. Series B (Methodological), vol. 44, no. 2, pp. 139–177, 1982. [BRA 02] B RAND M., “Incremental singular value decomposition of uncertain data with missing values”, European Conference on Computer Vision, Springer, Berlin, Germany, pp. 707–720, 2002. [BRO 13] B RODERICK T., B OYD N., W IBISONO A. et al., “Streaming variational Bayes”, Proceedings of the 26th International Conference on Neural Information Processing Systems, vol. 2, pp. 1727–1735, 2013. [CAR 15] C ARDOT H., D EGRAS D., “Online Principal Component Analysis in High Dimension: Which Algorithm to Choose?”, arXiv preprint arXiv:1511.03688, 2015. [CHE 13] C HENG J., B IAN W., TAO D., “Locally regularized sliced inverse regression based 3D hand gesture recognition on a dance robot”, Information Sciences, vol. 221, pp. 274–283, 2013. [CHU 07] C HU C., K IM S.K., L IN Y.-A. et al., “Map-reduce for machine learning on multicore”, Advances in Neural Information Processing Systems, vol. 19, p. 281, 2007. [DID 88] D IDAY E., “The symbolic approach in clustering and related methods of data analysis”, Proceedings of IFCS, Classification and Related Methods of Data Analysis, vol. 1988, pp. 673–384, 1988.

136

Advances in Data Science

[DUN 71] D UNFORD N., S CHWARTZ J.T., BADE W.G. et al., Linear Operators, Wiley-Interscience, New York, 1971. [EGO 03] E GOZCUE J.J.E., PAWLOWSKY-G LAHN V., M ATEU -F IGUERAS G.O.R. et al., “Isometric logratio transformations for compositional data analysis”, Mathematical Geology, vol. 35, no. 3, pp. 279–300, 2003. [FER 06] F ERRATY F.E.D.E, V IEU P., Nonparametric Functional Data Analysis: Theory and Practice, Springer Science & Business Media, Berlin, 2006. [GIL 74] G ILL P.E., G OLUB G.H., M URRAY W. et al., “Methods for modifying matrix factorizations”, Mathematics of Computation, vol. 28, no. 126, pp. 505–535, 1974. [GIR 00] G IRAUD -C ARRIER C., “A note on the utility of incremental learning”, AI Communications, vol. 13, no. 4, pp. 215–223, 2000. [GU 94] G U M., E ISENSTAT S.C., A Stable and Fast Algorithm for Updating the Singular Value Decomposition, Yale University, New Haven, CT, 1994. [HUI 14] H UIWEN W., Y UAN W., L ELE H. et al., “Incremental algorithm of multiple linear regression model”, Journal of Beijing University of Aeronautics and Astronautics, vol. 40, no. 11, pp. 1487–1491, 2014. [KAT 12] K ATO K., “Estimation in functional linear quantile regression”, The Annals of Statistics, vol. 40, no. 6, pp. 3108–3136, 2012. [LEV 00] L EVEY A., L INDENBAUM M., “Sequential Karhunen–Loeve basis extraction and its application to images”, IEEE Transactions on Image Processing, vol. 9, no. 8, pp. 1371–1374, 2000. [LI 00] L I W., Y UE H.H., VALLE -C ERVANTES S. et al., “Recursive PCA for adaptive process monitoring”, Journal of Process Control, vol. 10, no. 5, pp. 471–486, 2000. [LU 11] L U H., P LATANIOTIS K.N., V ENETSANOPOULOS A.N., “A survey of multilinear subspace learning for tensor data”, Pattern Recognition, vol. 44, no. 7, pp. 1540–1551, 2011. [LUT 14] L UTS J., B RODERICK T., WAND M.P., “Real-time semiparametric regression”, Journal of Computational & Graphical Statistics, vol. 23, no. 3, pp. 589–615, 2014. [LUT 15] L UTS J., “Real-time semiparametric regression for distributed data sets”, IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 2, pp. 545–557, 2015. [MAI 09] M AIRAL J., BACH F., P ONCE J. et al., “Online dictionary learning for sparse coding”, Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, Quebec, pp. 689–696, 2009. [MAI 10] M AIRAL J., BACH F., P ONCE J. et al., “Online learning for matrix factorization and sparse coding”, Journal of Machine Learning Research, vol. 11, pp. 19–60, 2010.

Incremental Calculation Framework for Complex Data

137

´ M., TAHA Y., “Early prediction of movie box office success [MES 13] M ESTY AN based on Wikipedia activity big data”, PLoS ONE, vol. 8, no. 8, p. 71226, 2013. [NAG 07] NAGABHUSHAN P., K UMAR R.P., “Histogram PCA”, Advances in Neural Networks–ISNN 2007, pp. 1012–1021, Springer, 2007. [PAT 09] PATEN B., H ERRERO J., B EAL K. et al., “Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment”, Bioinformatics, vol. 25, no. 3, pp. 295–301, 2009. [RAM 91] R AMSAY J.O., DALZELL C.J., “Some tools for functional data analysis”, Journal of the Royal Statistical Society. Series B (Methodological), vol 53, pp. 539–572, 1991. [RAM 02] R AMSAY J.O., S ILVERMAN B.W., Applied Functional Data Analysis: Methods and Case Studies, vol. 77, Springer, Berlin, 2002. [RAM 06] R AMSAY J.O., Functional Data Analysis, Wiley Online Library, Hoboken, NJ, 2006. [ROS 08] ROSS D.A., L IM J., L IN R.-S. et al., “Incremental learning for robust visual tracking”, International Journal of Computer Vision, vol. 77, nos 1–3, pp. 125–141, 2008. [SCH 07] S CHUMAKER L., Spline Functions: Basic Theory, Cambridge University Press, 2007. [TAO 08] TAO D., L I X., W U X. et al., “Tensor rank one discriminant analysis – convergent method for discriminative multilinear subspace selection”, Neurocomputing, vol. 71, no. 10, pp. 1866–1882, 2008. [TAO 09] TAO D., L I X., W U X. et al., “Geometric mean for subspace selection”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 260–274, 2009. [WAN 15] WANG J.-L., C HIOU J.-M., M UELLER H.-G., “Review of functional data analysis”, arXiv preprint arXiv:1507.05135, 2015. [WEN 03] W ENG J., Z HANG Y., H WANG W.-S., “Candid covariance-free incremental principal component analysis”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 8, pp. 1034–1040, 2003. [XIE 13] X IE Y., W ILLETT R., “Online logistic regression on manifolds”, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3367–3371, 2013. [YAN 10] YANG H., X U Z., K ING I. et al., Online Learning for Group Lasso, International Conference on Machine Learning, Omnipress, 2010. [YAO 06] YAO F., L EE T., “Penalized spline models for functional principal component analysis”, Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 1, pp. 3–25, 2006.

Part 3 Network Data

Advances in Data Science: Symbolic, Complex and Network Data, First Edition. Edited by Edwin Diday, Rong Guan, Gilbert Saporta and Huiwen Wang. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

7 Recommender Systems and Attributed Networks

7.1. Introduction With the wide availability of big data, algorithms and software techniques face new challenges [LAN 01]. The volume of data is very large; on the web, we may have millions of observations. Algorithms need to scale to such volumes, and implementations may need to be further optimized, either using multi-threading/ multi-processing or big data distributed platforms, such as e.g. Hadoop/Spark. Any algorithm that does not scale, e.g. which has more than linear complexity, would very rapidly become useless. On the other hand, with an increase of big data sources, we can merge many files, thus obtaining a large number of attributes for each observation – potentially hundreds of thousands, i.e. large variety. This large variety results in very high-dimension spaces, where distances, similarities, and nearest neighbors become meaningless [AGG 01], thus posing significant problems for algorithms using nearest-neighbor techniques. In addition to these issues, data are increasingly linked, i.e. very often there exists relationships between observations. This is the case when data come from social networks (SNs in the rest of this chapter), such as social sites (e.g. Facebook and Twitter), information networks (e.g. citation networks and terrorist networks), or even biological networks (e.g. protein–protein interaction networks). The independently and identically distributed assumption (i.i.d.), at the basis of data mining techniques, is then not valid anymore, which is of course a big challenge for analysis where i.i.d. is the standard hypothesis. One of the most successful applications of big data is recommendation. Online experience is nowadays systematically enhanced through Recommender Systems

Chapter written by Franc¸oise F OGELMAN -S OULI E´ , Lanxiang M EI, Jianyu Z HANG, Yiming L I, Wen G E, Yinglan L I and Qiaofei Y E.

142

Advances in Data Science

(RS in the rest of this chapter), bringing to users recommendations for movies [BEN 07], [EKS 11], products [LIN 03], songs [AIO 13], friends, banners or content on social sites [NGO 13], tags [MAR 11], or travels [RIC 02]. Better exploitation of big data in recommender systems thus has the potential for application across many domains. Volume, variety, and non iid-ness are big issues for RSs nowadays, and very few existing systems are capable of efficiently exploiting all of the available data and produce sound recommendations. Bringing improvements on such issues would indeed bring significant economic benefits and increase users’ satisfaction. In this chapter, we show how volume, variety, and linked data issues can be handled in the case of recommender systems. In particular, we describe social network analysis (SNA in the rest of this chapter) tools and show how to use them for RSs. The chapter is organized as follows: in section 7.2, we introduce recommender systems and the major algorithms to produce recommendations; in section 7.3, we present social networks and their properties and discuss the particular case of bipartite networks; in section 7.4, we present new models for recommendation using social networks and discuss the problem of using attributes for enhancing model quality; in section 7.5, we present experiments and discuss our results. We conclude with perspectives in section 7.6. 7.2. Recommender systems Recommender systems are widely deployed in online sites [DES 04], [NIN 11], [PAR 15], [RIC 11]. They present to users a ranked list of items that are potentially interesting to them, hopefully increasing users’ interest and views or purchases on the site. There exist two main families of techniques to build Recommender Systems, depending on the available data: supervised model-based and (unsupervised) neighborhood-based models. Model-based techniques are regression models for rating prediction. They essentially use explicit users’ behavior in the form of ratings; the most common technique produces a latent space through matrix factorization, and predicts rating as a dot product of the users and items’ representations in this latent space; highest predicted ratings provide the list of items to recommend. Most academic works were produced in this area at the time of the Netflix prize ¨ 09]. However, recently, more work has been focused on challenge [KOR 11], [TOS neighborhood-based collaborative filtering models that are unsupervised learningto-rank techniques. These Top-N recommenders use implicit users’ feedback (such as view, purchase, click, and download extracted from site logs) and identify neighborhoods of similar users or items. Top-N recommenders are the most widely used systems on online sites, because ratings are not always available and, when available, are much scarcer than implicit feedback. Item-based collaborative filtering (using neighborhood of similar items, CF-IB in the rest of the chapter) was shown [DES 04] to be better than the user-based version (neighborhoods of similar users, CF-UB in the rest of the chapter), and it is known as item-based neighborhood-based

Recommender Systems and Attributed Networks

143

collaborative filtering which is most common for real-world Top-N recommenders. In the rest of this chapter, we will mostly focus on neighborhood-based CF-IB Top-N RSs, unless stated otherwise. The latent-space approach has been shown to perform best for rating prediction [KOR 11], but neighborhood-based techniques are best for Top-N recommenders with implicit data [AIO 13], [DES 04], [CHR 16]. While latent-space methods (and more generally model-based techniques) produce a global model, neighborhoodbased CF models are essentially local models, aggregating feedback behavior as a linear combination of feedback in the neighborhood of an item (or a user). The coefficients of this linear combination are item similarities in the classical CF-IB model [DES 04] and can also be learned through optimization (see, for example, the SLIM model [NIN 11]). Recommenders systems in real-world applications involve high-dimension datasets: there could be millions of users and hundreds of thousands of items. It should thus be expected that recommendation techniques, particularly neighborhood-based collaborative filtering, suffer from the curse of dimensionality. It would thus be natural to try and reduce dimensions, because this would help reducing computation and storage costs, while hopefully increasing performances [CAM 03]. Dimensionality reduction techniques are an intrinsic part of embedding and representation learning, which is becoming one of the most active domains in machine learning, especially since the rebirth of neural networks and deep learning. In this chapter, we will not discuss this issue further, but will point out, when presenting some of our results, to this frequent origin of the difficulties we found. All notations used in this chapter are listed in Table 7.1. 7.2.1. Data used RSs generate ranked lists of recommendations on the basis of available data sources. In this chapter, we will exploit behavioral data observed from users’ feedback on items (clicks, insertion-into-caddy, purchases, downloads, etc.), called in the following consumption to indifferently designate this implicit feedback. In the case of explicit feedback, the behavioral data are ratings provided by users on items. Content data may exist on users (although rarely in publicly available datasets for privacy-preserving concerns) and on items (available in various publicly available data sources) in the form of users’ or items’ attributes. This is why we will mostly focus, in this chapter, on using items’ attributes. Users’ feedback on items is represented by an interaction matrix R which encodes the actions of users on items. Symbols a, u, v will designate users and i, j items. If there are n users and m items, matrix R is of dimension n × m, and entry rai in R is 1 if user a has consumed item i, 0 otherwise in the case of implicit feedback

144

Advances in Data Science

(rai could also indicate the intensity of the implicit feedback [KOR 11] or the explicit rating, but we will not use this possibility in this chapter). Obviously, for any user a, there are many more 0 entries (not consumed) than 1 entries (consumed), i.e. the density of matrix R is very small. We will use columns vectors to represent users and items: columns of matrix R, e.g. items’ representations, are denoted r·j (for item j) t and users’ representations by ra· . I(a) denotes the set of items a consumed and U (i) the set of users who consumed i.

Notation rui

Meaning Element of matrix R: interaction of user u in set of users U with item i in set of items I (rui = 1 if u consumed item i, 0 otherwise) n (respectively, m) Total number of users or lines in R (respectively, items or columns) I(a) Sets of items consumed by a U (i) Sets of users who consumed i t ra· Column representation for user a (transposed line of matrix R) r·i Column representation for item i (column of matrix R) sim(a, u), sim(i, j) Similarity of users a and u, of items i and j Asym.Cos(i, j) Asymmetric cosine similarity of items i and j simbeh (i, j) Behavior similarity computed on matrix R AI , aIif ; AU , aU , Items’ attributes matrix AI and element of matrix (value af of f th attribute of item i); users’ attributes matrix and element of AU simatt (i, j) Attribute similarity computed on matrix A F I , FU Total number of items’ (respectively, users’) attributes or columns in AI , AU K(a), N (i) Neighborhood of user a, of item i T op − N (i) Top-N neighborhood of item i Score(a, i) Score of user a for item i (ia1 , ia2 , · · · , iak ) Ranked list (of length k) of items recommended to user a G simbeh xx Similarity graph, obtained when using similarity function G simatt xx xx (either behavior or attribute) between two items (or users) Items’, users’ graphs Projected uni-mode graphs from transaction bipartite graph p Number of cross-validation samples (train, test) Table 7.1. Notations

Recommender Systems and Attributed Networks

145

7.2.2. Model-based collaborative filtering One of the most efficient and most frequently used model-based methods for rating prediction is model-based collaborative filtering, based on matrix factorization [FAN 11], [KOR 11], [ZHA 06] in which users and items are represented in the same low-dimensional latent factors space: in this space, users’ representations are close to the items’ representations they rated [KOR 09]. For Singular Value Decomposition ˆ minimize the Frobenius norm (SVD), the new representations of users Vˆ and items M 2 t ∥R − V · M ∥F , which is classically solved by minimizing a regularized squared ¨ 09]: error [TOS ] ∑ [(

2 )2 2 ˆ ) = Arg min rui − Vut · Mi + λ1 Vut + λ2 ∥Mi ∥ [7.1] (Vˆ , M V,M

u,i

For Non-Negative Matrix Factorization (NMF) [ZHA 06], minimization is constrained to include a positive constraint on matrices V and M . The predicted rating is then computed as the dot product of the user’s and item’s representations in latent space, and predicted ratings are then ranked to produce the list of the k largest predicted ratings: ˆ ·i rˆui = Vˆa· · M

[7.2]

ia1 , ia2 , . . . , iam , such that rˆaia1 ≥ rˆaia2 ≥ . . . ≥ rˆaiak ≥ . . . rˆaiam

[7.3]

7.2.3. Neighborhood-based collaborative filtering Neighborhood-based collaborative filtering techniques [BER 15], [CRE 10], [DES 04] use three elements: a similarity function between items for CF-IB (or users in CF-UB) makes it possible to build a neighborhood between items (or users). Then, a score function aggregates the active user’s tastes for items in the neighborhood of those consumed before (item-based CF; user-based is similar [BER 15]). 1) Similarity between items is computed between items’ representation vectors r·j in matrix R; we call it behavioral similarity in the rest of this chapter. Classical similarity measures [KOR 11] include cosine, Pearson Correlation Coefficient (PCC), asymmetric cosine [AIO 13], support and confidence (in a graph-based formalism [BER 15] described in section 7.4.1), asymmetric confidence [DES 04], or dot product as used in matrix factorization techniques [KOR 11]. Asymmetric cosine is defined for some parameter γ by: n ∑ rui × ruj u=1 Asym.Cos(i, j) = [ ]γ [ n ]1−γ (0 ≤ γ ≤ 1) [7.4] n ∑ ∑ 2 2 (rui ) × (ruj ) u=1

u=1

Obviously, asymmetric cosine with γ = 0.5 is identical to cosine.

146

Advances in Data Science

2) Neighborhood: Top-N neighborhood of an item i is the set N (i) of N items most similar to i. Performances (measured by precision and recall indicators described in section 7.5) are well-known to heavily depend on N : Figure 7.1 shows the necessary locality in a CF-IB model obtained in our experiments for dataset MovieLens 1M (described in section 7.5.3). Using a global model, with N = m, the number of items, is obviously not optimal. N (i) = {j1 , j2 , . . . , jN |sim (i, j1 ) ≥ sim (i, j2 ) ≥ . . . ≥ sim (i, jN ) ≥ sim (i, j) ∀j ̸= j1 , j2 , . . . , jN }. [7.5] Other neighborhoods can be used, such as using all items with a similarity larger than a threshold or all m items (getting a global model as said earlier). Usually, Top-N neighborhood is the most efficient choice.

Figure 7.1. Influence of N on performances of a Top-N CF model on MovieLens 1M. For a color version of this figure, see www.iste.co.uk/diday/advances.zip

3) Scoring function: many scoring functions can be used [AIO 13], [BER 15], [KOR 11]. The most common is the weighted average score (denoted WAS in the following) defined as: ∑ WAS ∀i ∈ / I (a) , Score (a, i) = raj × sim (i, j) [7.6] j∈N (i)

In equation [7.6], the scoring function is a local linear combination of interactions. Other common choices include the average score AVES or the local weighted average score LWASq (large values of q will put more focus on the highest similarities; in this case, we can assume N (i) to contain all m items, as in [AIO 13]. LWAS1 is obviously

Recommender Systems and Attributed Networks

147

the same as WAS): AVES ∀i ∈ / I (a) , Score (a, i) =

∑ 1 raj card [N (i)]

[7.7]

j∈N (i)

LWASq ∀i ∈ / I (a) , Score (a, i) =



raj × [sim (i, j)]q , q ∈ N, q ≥ 1

[7.8]

j∈N (i)

Depending upon the choices of similarity functions, neighborhoods, and scoring functions, different performances can be obtained. Neighborhood-based CF techniques, also called Top-N recommenders, are very simple to implement and have good performances. CF-IB was shown to be two-order of magnitudes faster than CF-UB with similar [DES 04] or better performance [SAR 01]. In our experiments, we tested various similarity functions, neighborhoods, and scoring functions, but found that the choices of equations [7.4]–[7.6] were almost always best. We thus mostly report results obtained for these choices. Note that, as was shown in [CRE 10], 0 entries should not be treated as missing for Top-N recommenders, as is done in the case of rating prediction models, and our experiments were run using this choice. The general item-based CF algorithm is described in Table 7.2. Choose similarity function, neighborhood and scoring function. For user a and item i, not consumed by a, - For all items j ̸= i, compute similarities sim(i, j) - Compute neighborhood N (i)of item i - Compute score(a, i) - Rank items i in decreasing score values: Score(a, ia1 ) ≥ Score(a, ia2 ) ≥ · · · ≥ Score(a, iam ) - Recommend to a the ranked list (of length k) (ia1 , ia1 , . . . , iak )

[7.9]

Table 7.2. Item-based collaborative filtering

Note that model-based collaborative filtering techniques can also be used for implicit feedback (then “rating” has only one value 1 when the item is consumed): predicted ratings are ranked to generate the top k recommended items [LI 15]. However, various authors found that these techniques underperform the Top-N collaborative filtering methods presented earlier [AIO 13], [BER 15], [CHR 16], [PAR 15], [PRA 11]. This is probably because model-based techniques are global, whereas locality is certainly needed as illustrated in Figure 7.1. The scaling behavior of Neighborhood-Based Collaborative Filtering techniques is related to the issue of computing the similarity matrix between users (user-based CF-UB) or items (item-based CF-IB). Since items are usually “only” a few tens-hundreds of thousands compared to the few million users, item-based CF is more frequently deployed than

148

Advances in Data Science

user-based CF. Note that neighborhood-based CF can be used efficiently, both for explicit and implicit interactions, contrary to most model-based CF techniques. It should be noted at this point that both CF-IB and CF-UB suffer from the curse of dimensionality. In a typical situation where the number of users n is a few millions and the number of items m is a few hundreds of thousands, we need to compute similarity between item vectors of dimension n or user vectors of dimension m. Clearly, these are high-dimension spaces where similarities are getting very close to each other and some similarity functions might be preferable to others (e.g. [AGG 01] shows that fractional metrics Lk , with k < 1, behave better than L2 ). In addition, lots of hubs, i.e. nodes with high similarities with many nodes, tend to appear in many lists of Top-N neighbors and thus in recommended lists. 7.2.4. Hybrid models An important characteristic of all CF techniques is that they only use interaction data: even in situations where lots of data on users or items exist, using such data does not seem to really improve performances (see, for example, the discussion in the Kaggle Million Song Dataset Challenge1 or [AIO 13]). It has been reported in the literature that, because interaction data already contain a lot of information, it is very hard for attributes to bring much improvement. However, various authors have tried to incorporate additional data into behavioral data and have reported some improvement. For example, to exploit descriptions of items, [FAN 11], [POZ 16] extend matrix factorization techniques, while [HOR 13] integrates a CF technique with two content-based techniques that only use users/items attributes (content-based CB: see section 7.2.4.1). Social recommenders [ZHE 07] use social data, but very few RSs techniques use all data. Hybrid RSs [BUR 07] should exploit users and items attributes and social data, if available, at the same time as interaction data. Using all the data should seem to be the right way to go, especially in Big Data situations, where large variety means many different data sources exist. Hence, there is a need for a systematic way of handling at the same time both interactions and users/items data, which is what we propose to do in this chapter with the help of social networks tools. Traditionally, hybrid systems incorporate attributes similarities into the behavior similarity function. This can be seen as a problem of aggregating similarities computed from different criteria [ADO 11] and, in this chapter, we focus on blending these two criteria: – Criterion 1: behavior similarity of items consumed in a similar fashion as those consumed by user a. This is the similarity used in CF-IB, denoted as simbeh (i, j) or sim(i, j) if there is no possible confusion (as in the earlier section); 1 https://www.kaggle.com/c/msdchallenge/forums/t/2365/challenge-retrospective-methodsdifficulties/14396#post14396.

Recommender Systems and Attributed Networks

149

– Criterion 2: attributes similarity for items that have attributes similar to those consumed by user a. This is the similarity used for Content-Based RS (CB), denoted as simatt (i, j). 7.2.4.1. Attributes similarities Suppose the items are described by a vector of attributes (or features), i.e. we have a matrix AI of dimension m × FI , where m is the number of items and FI is the number of items’ attributes (or features): aIif is the value of the f th attribute of item i. We can then compute the attribute similarity simatt (i, j). between two items i and j using any of the functions described in section 7.2.3., but using matrix AI instead of matrix R. In the same fashion, we can define matrix AU of dimension n × FU , where n is the number of users and FU the number of users’ attributes. Obviously computing similarities is expensive, running in O(m2 + mFI2 ) [or O(n2 + nFU2 ) for user-based], which can be reduced taking advantage of sparsity or by locality-sensitive hashing [LES 14]. Content-based (CB) recommendations can be obtained by a CF-IB model using attribute similarities [PAZ 07]: items are recommended on the basis of their attribute similarities to items already consumed. CB recommenders usually have poor performances, but they can work even in the case of cold-case, for new items. 7.2.4.2. Aggregating similarities First, we need to aggregate p = 2 similarity functions: one on behavioral data simbeh and one on attributes data simattr . Then, denote by x = (x1 , . . . , xp ) the p similarities that we need to merge; here, p = 2, but we could actually want to merge more similarities. There exist many aggregating functions, among which the most common are [BEL 11] weighted quasi-arithmetic means. For a given function g∑: [0, 1] → IR and a weight vector w = (w1 , . . . , wp ) such that: ∀j, wj ≥ 0 and p j=1 wj = 1, the weighted quasi-arithmetic means is defined by:   P ∑ W Mg (x) = g −1  wj × g (xj ) [7.10] j=1

For different choices of functions g, we obtain arithmetic means WAM, geometric means WGM, harmonic means WHM, and power means WPMr : WAM (x) =

P ∑

xj × wj , WGM(x) =

j=1

 WHM (x) = 

(xj )wj

j=1

P ∑ wj j=1

P ∏

xj

−1 

  r1 P ∑ r , WPMr (x) =  wj × (xj )  j=1

150

Advances in Data Science

For p = 2, we denote w1 = α, w2 = 1 − α and mostly use WAMα as follows: WAMα (x1 , x2 ) = αx1 + (1 − α) x2

[7.11]

After merging attribute similarity into behavioral similarity, we then proceed with our CF algorithm as usual by choosing a neighborhood and a scoring function. We would expect increased performances of our new algorithm (denoted by CB-CF in the rest of the chapter). A similar technique has been shown to give interesting results for cold start situations [DIN 15]. 7.3. Social networks 7.3.1. Non-independence All models presented so far do not take into account possible couplings between users and items, except in the case of social models ([MA 08], [MA 13], [ZHE 07], for example) where recommendations make use of friends’ tastes. However, there certainly exist implicit couplings which should be taken into account: if two users consume the exact same items, there is probably some underlying relationship which should help make better recommendations. In the same fashion, if two items were both consumed by many consumers, there certainly is some (hidden) relationship between them. Ignoring these implicit relationships may result in poor recommendations. Work on non i.i.d. models is recent [CAO 16], while lots of works have been done on graph representations to handle the relationships between users and items [AGG 99], [SAR 01], [SHA 17]. We will present here a general framework introduced in [BER 15], which is a graph-based model taking advantage of relationships, and show how it can be extended to use items (or users) attributes. 7.3.2. Definition of a social network Social networks are ubiquitous: we are all familiar with Facebook, Twitter, and WeChat, which are explicit networks where entities are connected together through links indicating friendship, for example. But there also exist many implicit networks, where relationships between entities may not be apparent and need to be discovered by some analysis; however, this hidden implicit network structure can explain various behaviors such as product adoption, cascade formation, or virality [EAS 10]. A Social Network is a graph G = (V, E), i.e. a collection of entities V with relationships E between them; entities are also called nodes or vertices; relationships are also called edges or links. In addition to being a graph, a social network has a locality property defined (informally) as follows: if node A has a relationship with nodes B and C, then the probability that B and C have a relationship too is higher

Recommender Systems and Attributed Networks

151

than random (this is also called triadic closure). This means there are more triangles than expected should the graph be random (where a triangle is a set of three nodes interconnected). A social network is often represented by its adjacency matrix A, where Aij = 1 if node i has a link to node j, 0 otherwise. Networks can be directed (Aij ̸= Aji ) or undirected (Aij = Aji ). Networks can be weighted, where wij is the weight of the link between i and j. In an attributed network, nodes have attributes. 7.3.3. Properties of social networks From this (rather) informal definition, various characteristic properties emerge in any social network domain (i.e. they are universal). These properties are very different from those of random networks [BAR 16]: 1) Degree distribution: The degree of a node i is defined as: n ∑ deg (i) = Aij

[7.12]

j=1

where n is the number of nodes. A social network has a degree distribution which is a power law, i.e. for some parameter β and constant C, the probability pk of degree k is: pk = C × k −β [7.13] A network with that property is called a scale-free network. Typical values for β in scale-free networks are 2–3 [BAR 16]. Random networks have a degree distribution which is Poisson type. 2) Clustering coefficient: The clustering coefficient of a node i measures the density of links in node i’s immediate neighborhood and is defined as: ∑ |{ejk ∈ E : j, k ∈ Ni }| j,k∈Ni Ajk Ci = = [7.14] ki × (ki − 1) ki × (ki − 1) where Ni is the neighborhood of i, i.e. the set of nodes directly connected to i (also called the first circle). For social networks, the average of the clustering coefficients of nodes with degree k, C(k), decreases with k. In random networks, the clustering coefficient is independent of the node’s degree. 3) Small world: a social network has the small-world property, i.e. the average distance between two nodes in the network is small where the distance between two nodes is the length of the shortest path between them (also called degree of separation). In the famous experiment by Milgram [MIL 67], it was 6.5. This distance is larger for random networks. 4) Community size distribution: a social network can be decomposed in tightly knit groups of nodes (called communities). The distribution of communities’ sizes has a fat tail, i.e. there are many small communities and a few very large ones [BAR 16]. Random networks do not have any community structure.

152

Advances in Data Science

If we want to assess whether a graph is a social network, we can compare these four properties with those of random networks. We can also use the node characteristics defined earlier, such as degree, clustering coefficient, or community index, as new attributes to enhance performances of our models [NGO 13]. We will show in the following sections how this could be useful for RSs. 7.3.4. Bipartite networks In multipartite networks, nodes are from different families and a link can only exist between nodes in different families. In a bipartite network (also called two-mode network), there exist two families of nodes, for example, users and items, and there are links only between a node in one family, a user, and a node in the other family, an item: the link represents a relationship such as user purchasing item, clicking on item, and so on. A bipartite network is thus a graph G = (U, I, E) where U and I are two collections of entities and E is the set of relationships between them. Its adjacency matrix A has dimensions n × m where n (respectively, m) is the number of nodes in set U (respectively, I) and Aui = 1 if node u ∈ U has a connection with node i ∈ I (i.e. (u, i) ∈ E), 0 otherwise. A two-mode bipartite graph can be projected into two one-mode un-directed weighted networks GU = (U, EU ) and GI = (I, EI ) with sets of nodes U and I, where there is an edge between nodes in U if they have common neighbors in I (respectively, nodes in I if they have common neighbors in U ). The weight of the edge can be set in many ways [ZHO 07]. Most classically, the weight is the number of common neighbors; for example, for the bipartite purchasing network, the weight in GI is the number of users who purchased both items i and j: ∑ ∀i, j ∈ I, wij = Aai × Aaj [7.15] a∈U

Note that if we are projecting the bipartite purchasing network G = (U, I, E), then its adjacency matrix is A = R and the weight in the items one-mode projection GI is the dot product of items i and j represented in R: wij = r·it × r·j

[7.16]

Figure 7.2. (left) shows a bipartite network, with top nodes (U ) shown as blue circles and bottom nodes (I) as red squares, and its two one-mode projections with weights computed by equation [7.15] in GI and its equivalent for GU . Note that there is no reason why projected one-mode networks should obey the locality property of social networks and thus enjoy the characteristics of social networks (we will discuss this in section 7.5.4).

Recommender Systems and Attributed Networks

153

Figure 7.2. Bipartite network (left) and its two one-mode projections (middle and right). For a color version of this figure, see www.iste.co.uk/diday/advances.zip

7.3.5. Multilayer networks A multilayer network (called also multiplex or multigraph) G = (V, E, L) is a collection of (single-layer) networks on the same vertices V (we do not consider here multilayer networks with different sets of vertices) [KIV 14]. E ∈ V × V × L is an edge in the layer l ∈ L network with u, v ∈ V . Figure 7.3 shows such a network.

Figure 7.3. Multilayer network (left) and its OR-aggregated network (right)

A multilayer network can be aggregated into a one-mode network, by taking the OR, AND or any function of the edges in the different layers (tensor flattening [KIV 14]). Nodes characteristics (degree, clustering coefficient, etc.) can be defined in the aggregated one-mode network as usual, but other node characteristics exist especially designed for multilayer networks. In particular, we will use the following [BRO 13]: – The neighborhood M N (x, α) of node x for parameter α ∈ N, α > 0 is: M N (x, α) = {y : |{l : (x, y, l) ∈ E or (y, x, l) ∈ E}| ≥ α}

[7.17]

– The cross-layer clustering coefficient of node x for parameter α > 0 is: ∑ ∑ l y,z∈M N (x,α) w (y, z, l) CLCC (x, α) = [7.18] 2 |M N (x, α)| × |L| where w(y, x, l) denotes the weight of edge (y, z, l) in layer l.

154

Advances in Data Science

– The multilayer degree centrality (three versions) of node x for parameter α > 0 is: ∑ ∑ l y∈N (x,l) [w (x, y, l) + w (y, x, l)] (i) Version i : M DC (x, α) = [7.19] (m − 1) × |Di | where N (x, ∑l) is the neighborhood of node x in layer l, D1 = |L|, D2 = |M N (x, 1)| and D3 = l∈L |N (x, l)|. 7.4. Using social networks for recommendation Graph-based methods for recommendation represent users-items interactions as a bipartite graph. They then use various methods to produce the recommendations: using direct paths [AGG 99], random walks [BAC 11], and resource allocation to compute weights in the two-mode projection [WAN 16], [ZHO 07]. Recent contributions address the issue of ranking users’ preferences not handled in the bipartite network and thus propose to use a tripartite network [SHA 17]. In [BER 15], we have introduced a graph-based framework extending neighborhood-based CF methods, which we now describe. 7.4.1. Social filtering Social filtering (SF in the rest of this chapter) works as CF, by defining similarity, neighborhood, and scoring function. The differences with CF lie in the similarity and neighborhood graph used. From the various definitions we have introduced, we can build items’ multilayer networks G = (I, E, L) with the following layers (Figure 7.3) (the users’ case is similar): – Layer 1, the items graph, is the one-mode projected items network GI = (I, EI ). We may impose a constraint to retain an edge that its weight be larger than some threshold θ ≥ 0: wij ≥ θ [7.20] – Layer 2 is the behavior network Gsimbeh xx = (I, Exx ), where simbeh xx is some particular similarity function computed using matrix R (e.g. xx = cosine. Note that there are as many multilayer networks as there are behavior similarity xx functions) and there is an edge between nodes i and j with weight simbeh xx(i, j) if it satisfies, for some threshold θ′ ≥ 0: simbeh xx(i, j) ≥ θ′ [7.21] We can now see that in CF, we use similarities simbeh xx computed using matrix R, as asymmetric cosine in equation [7.4]. This gives us the behavior network Gsimbeh xx, where xx is the particular similarity function used (asymmetric cosine, for example). Neighborhoods are defined in that graph, such as Top-N in equation [7.5]. Finally, scoring functions are defined as in equations [7.6]–[7.8].

Recommender Systems and Attributed Networks

155

In SF, we can use not only the similarities as in CF, but also other similarities as well [BER 15] such as support, confidence, and asymmetric confidence [AIO 13] – all of these being defined on R, that is on behavior network Gsimbeh xx. We may also use the dot product in items graph GI (equations [7.16] and [7.18]) or any other similarity function. We can then define neighborhoods in this behavior network simbeh xx(i, j) as before, and also in the projected one-mode items network, GI = (I, EI ), for example, Top-N for dot product similarity, first circle or community. Scoring functions are as for CF. The item-based social filtering algorithm, SF-IB (the user-based version SF-UB is defined in a similar way, see [BER 15]), is thus exactly as in CF-IB in Table 7.2., the only difference being in the larger choices for similarity functions and neighborhoods. Table 7.3 shows different possible neighborhoods. Social filtering thus extends not only collaborative filtering, but also association rules, content-based filtering, or social-based filtering [BER 15]. We now discuss how we can extend it to use attributes, which was not possible in [BER 15]. Neighborhood Top-N similarity Similarity ≥ θ First circle Community

Acronym Top-N simθ 1st circ comm

Items graph SF (dot product) SF (dot product) SF SF

Behavior similarity graph CF, SF CF, SF SF SF

Table 7.3. Neighborhoods for collaborative and social filtering (CF and SF)

7.4.2. Extension to use attributes As discussed in section 7.2.4.1, if we have items’ attributes in the form of a matrix AI of dimensions m × FI , we can define the similarity simattr xx(i, j) between items i and j based on those attributes, using any of the similarity functions defined previously. This gives us a new layer in the previous multilayer network G = (I, E, L): – Layer 3 is the items network G simatt xx = (I, Esimatt xx ). We usually impose a constraint on edges that similarity is larger than some threshold θ′′ ≥ 0: simattr xx(i, j) ≥ θ′′ [7.22] To define a hybrid social filtering model, we proceed as described in section 7.2.4.2: we first merge attribute similarity simatt xx and any of the previously defined behavior similarity (section 7.4.1) using equation [7.11]. We then choose any neighborhood (Table 7.3) and scoring function. We denote these merged models as CF-CB or SF-CB, depending on the similarity and neighborhood we choose. Other models might be defined as well, such as merging layers in the multilayer network, but will not be described here.

156

Advances in Data Science

7.4.3. Remarks Various points might not have been apparent from the previous presentation, but have a significant impact on results presented later: – When computing items’ similarity (users are similar), we use items’ representations, either from behavior matrix R or from attribute matrix AI . These items’ representations, are of high dimension (a few thousands in our experiments below). We should then expect very small differences in similarity, because of curse of dimensionality. Reducing dimensionality, for example, through PCA, should help. – We compare similarities of all pairs of items (i, j), but small similarities most probably are not useful when we take neighborhoods such as Top-N. Since most similarity functions between items i and j, except PCC, are null whenever the number of users who purchased them both is 0, i.e. when weight wij in graph GI is 0 (equation [7.16]), thresholds on the various layers of multilayer network G = (I, E, L) should be aligned. Filtering low similarities, e.g. through equations [7.20]–[7.22], would significantly speed up computations, at no performance cost. – There are many available choices in similarity, neighborhood, and scoring functions, and, contrary to the case of explicit rating prediction, we cannot use a global error function to optimize across all choices, but must resort to painful search through the entire hyper-parameters space. 7.5. Experiments We now describe the results of experiments illustrating the different methods we have introduced in the previous sections. 7.5.1. Performance evaluation For training, we split the dataset with 90% users for training and 10% for test, as proposed in [AIO 13]. All transactions of users in the training set are used for training. For testing, half of the transactions of the users in the test set are used as input to produce recommendations; the other half is used for evaluation (comparing the obtained recommendations with these transactions). In what follows, we may repeat this on p cross-validation (CV) samples, for p draws of (training, test) samples. To evaluate performances, we denote La = (ia1 , ia2 , . . . , iak ) as the set of k items recommended to user a in the Test set of size lT est , from the data in the input part of the Test set; Ta is the target set for a, i.e. the items actually consumed by user a in the evaluation part of the Test set. Note that, for Top-N recommenders, performance indicators for rating predictors, such as RMSE (Root Mean Square Error), MAE (Mean Absolute Error), and Area Under the Curve (AUC) cannot be used.

Recommender Systems and Attributed Networks

157

Performance is measured by the classical Precision and Recall indicators (at k, the number of recommended items) defined as: P recision@k = Recall@k =

1 lT est



|La ∩ Ta | |La | a∈T est ∑ |La ∩ Ta | |Ta | a∈T est

1 lT est

[7.23]

7.5.2. Datasets We have evaluated our methods on MovieLens 1M [HAR 16], a dataset widely used in the literature. It contains 1,000,209 ratings, on a 5-star scale, of approximately 3,900 movies made by 6,040 users who joined MovieLens in 2000 (each user has at least 20 ratings). There is also a time stamp, which we did not use. Since we are interested in Top-N recommenders, we transform the data into binary implicit feedback, as proposed in [CHR 16] and [KOR 11]. By setting all existing ratings to 1 and missing ratings to 0, they actually show whether the user did or did not rate the movie. The density of interaction matrix R is only 4.19%. For users, we have three attributes (age, sex, and occupation; we did not use zip code). We also derived 16 graph attributes from the users’ multilayer network: degree, clustering coefficient, and community in the users’ graph and users’ behavior graph; cross-layered clustering coefficient, degree centrality versions 1–3 and community in the multilayer networks with two layers (G simbeh cosine and (G simatt cosine), with threshold θ′ = 0.05 in equation [7.21], to have layers of approximately the same density, and α = 1 or 2 [BRO 13]. Community is encoded as a categorical variable. For movies, we only have two attributes (title and genre). With these attributes, we can collect additional information from the Open Movie Database public API2. After some basic cleaning, we retain 8 of the 19 attributes, plus ratings: Actors, Awards, Country, Director, Genre, Language, Writer, and IMDb Rating, which we call original attributes. The other attributes were dropped either because they were largely missing (Metascore) or redundant (imdbVote). We also used basic text processing to extract words from Plot and Title. Encoding was then done as follows. Attributes such as Awards or Writer may contain complex information [“Nominated for three Oscars. Another 22 wins and 17 nominations” or “Mark Twain (novel), Stephen Sommers (screenplay), David Loughery (screenplay)”]. We extracted from these records three variables for Awards (Oscars, Wins, and Nominations) and the list of writers without their tasks for Writer. Attributes, such as Actors, Country, Director, Genre, Language, or Writer, are binary-encoded. Text was encoded with tf-idf, obtaining Text attributes. After these transformations, we obtain the final dataset described in Table 7.4.

2 http://www.omdbapi.com/.

158

Advances in Data Science

Number of users Number of movies Number of non 0 entries in R Number of movies attributes Number of movies text attributes Number of users attributes Number of users graph attributes

Original After cleaning/encoding 6,040 6,040 3,883 3,883 38,191 56,106 10 15,973 2 20,259 3 30 16 48

Table 7.4. MovieLens 1M dataset

7.5.3. Analysis of one-mode projected networks As discussed earlier, filtering connections with small weights in the one-mode projected graph (equation [7.20]) eliminates small similarities, which might not be useful when we look for Top-N neighborhoods. In this section, we analyze the impact of this filtering on the projected graphs structure. As described in section 7.3.3, social networks, i.e. graphs with locality property have characteristics very different from random networks. We want here to look at the characteristics for the one-mode projected networks, items, and users from the bipartite graph constructed on the MovieLens dataset described in Table 7.4. Depending on the threshold θ in equation [7.20], we get the following densities (twice the ratio of the number of edges to the square of the number of nodes) and remaining nodes as shown in Table 7.5. Filter θ Density #nodes Movies 0 0.824 3,704 Movies 4 0.562 3,396 Movies 30 0.226 2,710 Users 0 0.824 6,039 Users 14 0.393 6,025 Users 30 0.204 5,028 Table 7.5. MovieLens 1M dataset with filtering small weights

Obviously, filtering has a significant impact on density, but too much filtering means some items (or users) are eliminated and thus cannot be recommended (to). Figure 7.4. shows the characteristics of the one-mode projected networks: the degree distribution (the figure shows the probability of degree in terms of degree in log-log coordinates) is not a power law, but not a Poisson distribution either (the users’ graph have a lot of hubs, with very high degree); the clustering coefficient distribution is not decreasing but not constant either (the figure shows the average clustering coefficient

Recommender Systems and Attributed Networks

159

for a given degree: there are many clustering coefficients corresponding to large degrees, i.e. hubs very connected to their neighbors); the community sizes distribution has a long tail (not shown here). Therefore, we can see that these networks are not social networks, but not random networks either. Also, when the filtering threshold θ increases, these networks start behaving more like social networks, which means our social network analysis tools (e.g. community detection) and intuition would apply better. In particular, the large number of hubs (nodes with high degree) is very apparent here and all the more for θ = 0. In the following experiments, we chose an intermediate threshold θ = 4 for items, 14 for users.

Figure 7.4. Degree distributions of projected graphs Item (top row) (θ = 0 left, 4 middle, 30 right) and User (middle row) ( θ = 0 left, 14 middle, 30 right). Clustering coefficient distributions of projected Item graph (bottom row) (same θ). Log-Log coordinates

When finding the optimal N ∗ in Top-N models, we try to get the best precision (equation [7.23]), that is the average on test set of precisions for users a in test set. If user a has a small degree deg(a) (in neighborhood graph), then there will not be enough neighbors to find N ∗ users, hence a will have poor precision. If user a has a large degree ( deg(a) > N ∗ ), then there will be enough neighbors to find best N ∗ users, and a will have good precision. Hence, for getting a good average precision, we need to balance the number of nodes to a small and large degree. For example, we

160

Advances in Data Science

could start looking for N ∗ around the median of the degrees. For our test dataset, at thresholds 4 and 14 (for items and users), we found medians 212 and 1,182. 7.5.4. Models evaluated We use the following baseline models: – Popularity of each movie is computed as the number of users who rated it, and items are recommended by decreasing popularity. All users will get the same list of recommended items (excluding the items they already have rated): ∑ P op(i) = rai [7.24] a

– Content-based recommendation: we implement this as CF-IB with similarity cosine or asymmetric cosine and Top-N neighborhood in attributes graph G simatt xx, with xx = cosine or asymmetric cosine. – Item-based collaborative filtering CF-IB with similarity cosine or asymmetric cosine and neighborhood Top-N in behavior graph G simbeh xx. – User-based collaborative filtering CF-UB with similarity asymmetric cosine and neighborhood Top-N in behavior graph G simbeh AsymCos. – Model-based matrix factorization SVD and NMF. All these models are classical in the literature and our implementation is in line with references. However, because of the many implementation details, there is not one reference that can serve as the basis for comparison for all these techniques. We thus have preferred to run all these models into the same implementation framework to make comparisons. We compared the above models to SF models with other similarities (dot product) and other neighborhoods (first circle) and tested linearly merging attributes into the models to get CF-CB or SF-CB models (this new possibility was not included in [BER 15]). 7.5.5. Results Results are shown in the following tables which report the average of five cross-validation samples, using the WAS scoring function (equation [7.6]), for various similarities (cosine, asymmetric cosine with γ = 0.2, or dot product). Table 7.6 shows that Popularity has very poor performances. Model-based matrix factorization models, SVD and NMF, are worse than the best CF-IB (shown in bold in Table 7.6). SF models with first circle are worse than CF models with Top-N neighborhoods (confirming the necessary locality of models as shown in Figure 7.1). CF-UB is worse than CF-IB, in line with claims in the literature. Note that optimal N ∗ (shown in parenthesis in the table: 160, 190, and 1000) are close to the medians indicated in section 7.5.3 (212 and 1,182). Performances heavily depend on N as illustrated in

Recommender Systems and Attributed Networks

161

Figure 7.1 for this CF-IB model. In [BER 15], SF was shown to be able to outperform classical CF models for some datasets. However, our results show that, in the present case, CF is the best. Model Popularity SVD NMF CF-IB CF-IB CF-UB SF-IB SF-IB

Behavior sim. Neighborhood Precision (%) Recall (%) 3.76 0.29 54.01 5.54 52 5.33 Asym.Cos Top-N (160) 59.07 6.47 Cosine Top-N (190) 55.82 6.01 Asym.Cos Top-N (1000) 43.02 3.32 Asym.Cos First circle 43.9 4.22 Cosine First circle 52.83 5.57

Table 7.6. Performances of models on MovieLens without attributes

We then used attributes, merging their similarity into the behavior similarity as described in section 7.4.2. Results in Table 7.7 show again that Content-based models are poor; attributes marginally increase performances for CF-CB-IB, CF-CB-UB*, and CF-CB-UB** (with users’ attributes and graph attributes). The best model is CF-CBIB with asymmetric cosine and Top-N neighborhood. Note that the (average) merging coefficient alpha is relatively large, indicating that behavior data contribute more than attributes (α = 1 means no attribute). Model CB CB SF-CB-IB SF-CB-IB CF-CB-IB CF-CB-IB CF-CB-UB* CF-CB-UB**

Behavior sim. Neighborhood Top-N (18) Top-N (15) Asym.Cos First circle Cosine First circle Asym.Cos Top-N (130) Cosine Top-N (130) Asym.Cos Top-N (1000) Asym.Cos Top-N (1000)

Attribute sim. α Precision Recall Asym.Cos 25.74 2.2 Cosine 25.68 2.2 Asym.Cos 0.5 48.06 4.8 Cosine 0.8 53.94 5.8 Asym.Cos 0.6 59.24 6.54 Cosine 1 54.94 5.88 Asym.Cos 0,66 43.18 3.34 Asym.Cos 0,9 43.22 3.34

Table 7.7. Performances of hybrid models on MovieLens with attributes

The poor increase in performance is probably due to the large dimensionality of the data we compute similarity for (see Table 7.4). We thus reduced dimensions by running PCA on the items vectors and then applied the various models in the space of the PCA representations. Table 7.8 shows some of the obtained results: using

162

Advances in Data Science

PCA increases performances of Content-Based CB and CF-IB (d is the dimension of the PCA space giving best precision) directly on the behavior data. But if we run PCA on the behavior data and the attributes (CF-IB*) or those and the text attributes (CF-IB**), we can see that while attributes do improve performances, additional textual attributes fail to do so. In particular, it seems that first computing a PCA representation of behavior data and attributes, and then running CF (CF-IB*) gets better performances than merging similarity as in CF-CB-IB; but this did not hold when text attributes were also included (CF-IB**). d 2,461 2,461 50 50 130 130 5 130

Model CB CB CF-IB CF-IB CF-IB* CF-IB* CF-IB** CF-IB**

Behavior sim. Asym.Cos Cosine Asym.Cos Dot product Asym.Cos Dot product Asym.Cos Dot product

Neighborhood Precision (%) Recall (%) Top-N (20) 26.23 2.29 Top-N (10) 26.14 2.28 Top-N (264) 59.86 6.63 Top-N (150) 45.56 4.43 Top-N (240) 60.32 6.76 Top-N (160) 46.40 4.66 Top-N (488) 48.66 4.57 Top-N (100) 44.13 4.44

Table 7.8. Performances of models after PCA

All the experiments described earlier have been run on one server with 2-TB RAM (all datasets entirely fit in RAM) and one thread. Obviously, computation time can be very large; so, we tested multi-threads and Spark implementations. As can be seen in Table 7.9, for MovieLens datasets 1, 10, and 20 Million [HAR 16], multi-threads can indeed speed up computations a lot. Spark implementation (on a 20 nodes cluster) is hard to handle, but is not interesting for these (not so) small datasets (worse than the 20 threads counterpart). Dataset 1 thread 20 threads 50 threads Spark MovieLens 1M 55.8s 20.8s 19.8s 24.3s MovieLens 10M 1875.8s 59.8s 70.8s 79.0s MovieLens 20M 6435.8s 132.8s 201.8s 164.8s Table 7.9. Computing time in seconds for multi-threads and Spark

In this section, we have shown results in line with the literature on CF (Tables 7.6 and 7.7), but did not find, on the dataset presented, any performance increase with our social network SF model. We discussed the “social network”-ness of projected networks and showed how their study could give us an indication of the size of the optimal neighborhood size N ∗ for Top-N. We also presented ways to produce new

Recommender Systems and Attributed Networks

163

features, extracting them from our multilayer network representation (CF-CB-UB**) and showed that these provided improved performance. Finally, we demonstrated that the curse of dimensionality indeed is an issue for CF or SF and showed how a simple PCA-based representation can help achieving improved performances. Obviously, more work is needed in this direction, as demonstrated in the non-usefulness of text attributes. 7.6. Perspectives We have shown how recommender systems can be framed as multi-layer or bipartite graphs. Social network analysis allowed us to derive hybrid RSs using attributes, predict a range of optimal N for Top-N recommenders (median of the degree distribution in neighborhood network), and extract graph attributes, used to enrich the original attributes to improve performances (these techniques extend those presented in [BER 15]). We showed that distributed implementations, such as Spark, do not speed up computations compared to multi-threads. We have discussed the difficulties of computing similarities and neighborhoods with high-dimension data. Filtering low items’ (or users’) similarities and representing data in low-dimension spaces (e.g. with PCA) are key factors for obtaining good performances for recommender systems. While social networks bring us an interesting way of representing data, the fundamental issue of data representation in lowdimension space has a strong potential to bring better performances. We leave this topic for our future research. 7.7. References [ADO 11] A DOMAVICIUS G., M ANOUSELIS N., K WON Y., “Multi-criteria recommender systems”, in R ICCI F., ROKACH L., S HAPIRA B. et al. (eds), Recommender Systems Handbook, Springer Science and Business Media, LLC, Boston, MA, pp. 769–803, 2011. [AGG 99] AGGARWAL C.C., W OLF J.L., W U K.-L. et al., “Horting hatches an egg: a new graph-theoretic approach to collaborative filtering”, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 201–212, 1999. [AGG 01] AGGARWAL C. C., H INNEBURG A., K EIM D.A., “On the surprising behavior of distance metrics in high dimensional spaces”, ICDT, vol. 1, pp. 420–434, 2001. [AIO 13] A IOLLI F., “Efficient top-n recommendation for very large scale binary rated datasets”, Proceedings of the 7th ACM Conference on Recommender Systems, ACM, pp. 273–280, 2013.

164

Advances in Data Science

[BAC 11] BACKSTROM L., L ESKOVEC J., “Supervised random walks: predicting and recommending links in social networks”, Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, ACM, pp. 635–644, 2011. ´ [BAR 16] BARAB ASI A.-L., Network Science, Cambridge University Press, Cambridge, 2016. [BEL 11] B ELIAKOV G., C ALVO T., JAMES S., “Aggregation of preferences in recommender systems”, in R ICCI F., ROKACH L., S HAPIRA B. et al. (eds), Recommender Systems Handbook, Springer Science and Business Media, LLC, Boston, 2011. [BEN 07] B ENNETT J., L ANNING S., “The Netflix prize”, Proceedings of KDD Cup and Workshop, vol. 2007, ACM, New York, NY, USA, 2007. [BER 15] B ERNARDES D., D IABY M., F OURNIER R. et al., “A social formalism and survey for recommender systems”, ACM SIGKDD Explorations Newsletter, vol. 16, no. 2, pp. 20–37, ACM, 2015. [BRO 13] B RODKA P., K AZIENKO P., “Multi-layered social networks”, A LHAJJ R., ROKNE J. (eds), Encyclopedia of Social Network Analysis and Mining, Springer, Boston, pp. 998–1013, 2013. [BUR 07] B URKE R., “Hybrid Web recommender systems”, in B RUSILOVSKY P., KOBSA A., N EJDL W. (eds), The Adaptive Web, Methods and Strategies of Web Personalization, Lecture Notes in Computer Science, vol. 4321, pp. 377–408, Springer Verlag, Boston, 2007. [CAM 03] C AMASTRA F., “Data dimensionality estimation methods: a survey”, Pattern Recognition, vol. 36, no. 12, pp. 2945–2954, 2003. [CAO 16] C AO L., “Non-IID recommender systems: a review and framework of recommendation paradigm shifting”, Engineering, vol. 2, no. 2, pp. 212–224, 2016. [CHR 16] C HRISTAKOPOULOU E., K ARYPIS G., “Local item-item models for topn recommendation”, Proceedings of the 10th ACM Conference on Recommender Systems, ACM, pp. 67–74, 2016. [CRE 10] C REMONESI P., KOREN Y., T URRIN R., “Performance of recommender algorithms on top-n recommendation tasks”, Proceedings of the Fourth ACM Conference on Recommender Systems, ACM, pp. 39–46, 2010. [DES 04] D ESHPANDE M., K ARYPIS G., “Item-based top-n recommendation algorithms”, ACM Transactions on Information Systems (TOIS), vol. 22, no. 1, pp. 143–177, 2004. [DIN 15] D ING S.-H., J I D.-H., WANG L.-L., “Collaborative filtering recommendation algorithm based on user attributes and scores”, Computer Engineering and Design, vol. 2, p. 39, 2015.

Recommender Systems and Attributed Networks

165

[EAS 10] E ASLEY D., K LEINBERG J., Networks, Crowds, and Markets: Reasoning About a Highly Connected World, Cambridge University Press, Cambridge, 2010. [EKS 11] E KSTRAND M.D., R IEDL J.T., KONSTAN J.A. et al., “Collaborative filtering recommender systems”, Foundations and Trends in Human–Computer Interaction, vol. 4, no. 2, pp. 81–173, 2011. [FAN 11] FANG Y., S I L., “Matrix co-factorization for recommendation with rich side information and implicit feedback”, International Workshop on Information Heterogeneity and Fusion in Recommender Systems, pp. 65–69, 2011. [HAR 16] H ARPER M., KONSTAN J. A., “The MovieLens datasets: history and context”, ACM Transactions on Interactive Intelligent Systems (TiiS), vol. 5, no. 4, pp. 19:1–19:19, 2016. [HOR 13] H ORNUNG T., Z IEGLER C.-N., F RANZ S. et al. “Evaluating hybrid music recommender systems”, Proceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)-Volume 01, IEEE Computer Society, pp. 57–64, 2013. [KIV 14] K IVEL A¨ M., A RENAS A., BARTHELEMY M. et al., “Multilayer networks”, Journal of Complex Networks, vol. 2, no. 3, pp. 203–271, 2014. [KOR 09] KOREN Y., B ELL R., VOLINSKY C., “Matrix factorization techniques for recommender systems”, Computer, vol. 42, no. 8, pp. 30–37, 2009. [KOR 11] KOREN Y., B ELL R., “Advances in collaborative filtering”, in R ICCI F., ROKACH L., S HAPIRA B. et al. (eds), Recommender Systems Handbook, Springer Science and Business Media, LLC, Boston, pp. 145–186, 2011. [LAN 01] L ANEY D., “3D Data management: controlling data volume, velocity and variety”, META Group Research Note, vol. 6, Application Delivery Strategies, 2001. [LES 14] L ESKOVEC J., R AJARAMAN A., U LLMAN J.D., Mining of Massive Datasets, 2nd ed., Cambridge University Press, Cambridge, 2014. [LI 15] L I S., K AWALE J., F U Y., “Deep collaborative filtering via marginalized denoising auto-encoder”, Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, ACM, pp. 811–820, 2015. [LIN 03] L INDEN G., S MITH B., YORK J., “Amazon.com recommendations: itemto-item collaborative filtering”, IEEE Internet Computing, vol. 7, no. 1, pp. 76–80, 2003. [MA 08] M A H., YANG H., LYU M.R. et al., “SoRec: social recommendation using probabilistic matrix factorization”, Computational Intelligence, vol. 28, no. 3, pp. 931–940, 2008. [MA 13] M A H., “An experimental study on implicit social recommendation”, International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 73–82, 2013.

166

Advances in Data Science

[MAR 11] M ARINHO L. B., NANOPOULOS A., S CHMIDT-T HIEME L. et al., “Social tagging recommender systems”, in R ICCI F., ROKACH L., S HAPIRA B. et al. (eds), Recommender Systems Handbook, Springer Science and Business Media, LLC, pp. 615–644, Boston, 2011. [MIL 67] M ILGRAM S., “The small world problem”, Psychology Today, vol. 2, no. 1, pp. 185–195, 1967. [NGO 13] N GONMANG B., V IENNET E., S EAN S. et al., “Monetization and services on a real online social network using social network analysis”, 2013 IEEE 13th International Conference on Data Mining Workshops (ICDMW), IEEE, 2013. [NIN 11] N ING X., K ARYPIS G., “Slim: Sparse linear methods for top-n recommender systems”, 2011 IEEE 11th International Conference on Data Mining (ICDM), IEEE, pp. 497–506, 2011. ¨ [PAR 15] PARASCHAKIS D., N ILSSON B.J., H OLL ANDER J., “Comparative evaluation of top-n recommenders in e-commerce: an industrial perspective”, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), IEEE, pp. 1024–1031, 2015. [PAZ 07] PAZZANI M.J., B ILLSUS D., “Content-based recommendation systems”, in B RUSILOVSKY P., KOBSA A., N EJDL W. (eds), The Adaptive Web, LNCS 4321, Springer-Verlag, Berlin and Heidelberg, pp. 325–341, 2007. [POZ 16] P OZO M., C HIKY R., M E´ TAIS E., “Enhancing collaborative filtering using implicit relations in data”, Transactions on Computational Collective Intelligence XXII, Springer, pp. 125–146, 2016. [PRA 11] P RADEL B., S EAN S., D ELPORTE J. et al., “A case study in a recommender system based on purchase data”, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 377–385, 2011. [RIC 02] R ICCI F., “Travel recommender systems”, IEEE Intelligent Systems, vol. 17, no. 6, pp. 55–57, 2002. [RIC 11] R ICCI F., ROKACH L., S HAPIRA B., “Introduction to recommender systems handbook”, in R ICCI F., ROKACH L., S HAPIRA B. et al. (eds), Recommender Systems Handbook, Springer Science and Business Media, LLC, Boston, 2011. [SAR 01] S ARWAR B., K ARYPIS G., KONSTAN J. et al., “Item-based collaborative filtering recommendation algorithms”, Proceedings of the 10th International Conference on World Wide Web, ACM, pp. 285–295, 2001. [SHA 17] S HAMS B., H ARATIZADEH S., “Graph-based collaborative ranking”, Expert Systems With Applications, vol. 67, pp. 59–70, Elsevier, 2017. ¨ 09] T OSCHER ¨ [TOS A., JAHRER M., B ELL R. M., “The Big Chaos solution to the Netflix grand prize”, AT&T Labs - Research Tech Report, September 2009.

Recommender Systems and Attributed Networks

167

[WAN 16] WANG J., S HAO F., W U S. et al., “Weighted bipartite network projection for personalized recommendations”, Journal of Advances in Computer Networks, vol. 4, no. 1, pp. 64–69, 2016. [ZHA 06] Z HANG S., WANG W., F ORD J. et al., “Learning from incomplete ratings using non-negative matrix factorization”, Proceedings of the 2006 SIAM International Conference on Data Mining, SIAM, pp. 549–553, 2006. [ZHE 07] Z HENG R., P ROVOST F., G HOSE A., “Social network collaborative filtering: Preliminary results”, Proceedings 6th Workshop on eBusiness WEB2007, December 2007. [ZHO 07] Z HOU T., R EN J., M EDO M. et al., “Bipartite network projection and personal recommendation”, Physical Review E, vol. 76, no. 4, p. 046115, 2007.

8 Attributed Networks Partitioning Based on Modularity Optimization

We have proposed I-Louvain, a method for clustering graph nodes, which uses a criterion based on inertia combined with Newman’s modularity: it can detect communities in attributed graphs where real attributes are associated to vertices. In this chapter, we demonstrate that the optimization of this global criterion can be done at the local level, ensuring execution speed. So, like in the famous Louvain algorithm, the global modularity of a new partition can be quickly updated in our I-Louvain algorithm. Our experiments on synthetic datasets show that this method is able to handle reasonably large-sized networks and that combining the relational information with the attributes may allow the detection of communities more efficiently than using only one type of information. 8.1. Introduction Community detection deals with the unsupervised clustering of graph vertices in social networks. Each vertex represents, for instance, a person, and the graph structure models the social relations between people. There is a link between the two nodes of the graph if the corresponding persons have social interaction. The particular signification of this social interaction depends on the social network: it could be an observed relation (using e.g. proximity sensors or logs of emails) or the users themselves can declare a “friendship” relation between them. The goal is then to partition the vertices of the graph into the so-called “communities” such that, on average, the vertices in the same community are more strongly connected than

Chapter written by David C OMBE, Christine L ARGERON, Baptiste J EUDY, Franc¸oise F OGELMAN -S OULI E´ and Jing WANG.

Advances in Data Science: Symbolic, Complex and Network Data, First Edition. Edited by Edwin Diday, Rong Guan, Gilbert Saporta and Huiwen Wang. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

170

Advances in Data Science

¨ 98], [LAN 09], [NEW 04a], with nodes in different communities [BIC 13], [FJA [SCH 07]. Note that many definitions of communities, exist resulting in assortative communities (nodes in a community are mostly connected to nodes in their own community, the definition we assume here [GIR 02]), disassortative (nodes in a community are mostly connected to nodes outside their own community), or a combination [NEW 15]. Communities can define a partition of the nodes (as assumed here) or may be overlapping. Among the main methods proposed in the literature, many optimize an objective function (modularity, ratio cut or its variants, etc.) that measures the quality of the partition [DIN 01], [KER 70], [NEW 04b], [SHI 00]. Other methods are based on the minimum cuts [FLA 03], the spectral methods [VON 07], the Markov clustering algorithm and its extensions [SAT 09], or stochastic block models [KAR 11]. We refer the reader to the review articles of Fortunato [FOR 10], [FOR 16] for a thorough discussion of state-of-the-art community detection methods. Together with the topological information (the graph), many data sources also may provide attributes describing the vertices of the network. Most of the previous methods do not use these attributes. On the other hand, unsupervised clustering methods only use the attributes to compute partitions. An interesting challenge is, therefore, to combine the topological information (contained in the graph) with the attribute information, to build the communities. However, these different sources might not be aligned: structure and attributes may “disagree” [PEE 17] and attributes may be irrelevant to the network structure or may capture different properties than the communities. Yet one could hope to benefit from the two sources, when one is missing or noisy, and obtain better communities. Thus recently, several methods have been proposed to take into account both the relational information and the attributes, to detect patterns in attributed graphs [PRA 13], [STA 13] or for hybrid clustering [COM 12], [ELH 13]. Various techniques can be used to incorporate both sources: linear combination [RUA 13], probabilistic generative models [YAN 13], for example. We have introduced a method, called I-Louvain [COM 15], which allows partitioning of vertices of an attributed graph when numerical attributes are associated to the vertices. In social networks, these attributes can correspond to features (age or weight) or to tf-idf vectors, representing documents associated to the nodes, for example. Our method is based on a local optimization of a global criterion, which is a function of the modularity [NEW 06] on the one hand, and of a new measure based on inertia, on the other hand. After a presentation of related work in section 8.2, we define this measure, called inertia-based modularity, in section 8.3, and the method I-Louvain in section 8.4. In section 8.5, we demonstrate that this method benefits from a local optimization

Attributed Networks Partitioning Based on Modularity Optimization

171

of the global criterion. Finally, the experimental study within section 8.6 confirms that clustering, based on the relational information and attributes, provides more meaningful clusters than methods taking into account only one type of information (attributes or edges) for a class of artificial datasets, generated through our network generator DANCer. It also confirms that this method is able to handle reasonably large-sized networks. 8.2. Related work Recently, methods exploiting both sources of information (structure and attributes) have been introduced in order to detect communities in social networks or graphs where vertices have attributes. Steinhaeuser and Chawla [STE 08] propose measuring the similarity between vertices according to their attributes and then using the result as a weight of the edge linking the two vertices. After this pre-processing, they use a graph partitioning method in order to cluster the new weighted graph. In the hierarchical clustering of Li et al. [LI 08], after a first phase consisting in detecting community seeds with the relational information, the final communities are built under constraints defined by the attributes. This leads to merging the seeds on the basis of their attributes’ similarity. Therefore, in these previous methods, the two types of information are not exploited simultaneously, but one after the other. Zhou et al. [ZHO 09], [ZHO 10] exploit the attributes in order to extend the original graph. They add new vertices representing the attributes and new edges that link original vertices having similar attributes through these new vertices. A graph partitioning is then carried out on this new augmented graph. However, this approach cannot be used when the attributes have continuous values: it works only with categorical attributes. Ester et al. [EST 06] study the “connected k-center problem” and propose a method called NetScan, which is an extended version of the K-means algorithm with an internal connectivity constraint [GE 08]. Under this constraint, two vertices in a same cluster are connected by a path that is internal to the cluster. In NetScan, as in many other partitioning methods, the number of clusters has to be known in advance. However, this condition is relaxed in the work of Moser [MOS 07]. CESNA was introduced by Yang et al. [YAN 13] to identify Communities from Edge Structure and Node Attributes. One advantage of this method is its ability to detect overlapping communities by modeling the interaction between the network structure and the node attributes. There are some other methods, focusing on dense subgraph detection, that integrate the homogeneity of the attributes inside the subgraphs, cf. for instance ¨ 10], [GUN ¨ 11]. [GUN

172

Advances in Data Science

Probabilistic generative models can be used [JIN 17b] to integrate communities based on structure and clusters based on attributes and learn the correlations between the two to increase the quality of detected communities. This approach also provides semantics to interpret communities. Stochastic block models are also used in [PEE 17] to evaluate the relationship between structure and attributes and test (statistically) whether the two sources are correlated. We can also mention a family of methods, which propose to extend the wellknown Louvain algorithm [BLO 08], and for this reason, they are probably the works most related to our concerns. Dang et al. [DAN 12] suggest modifying the modularity by considering not only the link between two vertices but also the similarity of their attributes. Thus, both types of information are simultaneously considered in the partitioning process, but with this approach, the communities provided can contain non linked vertices. In [CRU 11], the optimization phase of the Louvain algorithm is based not only on the modularity but also on the entropy of the partition but, again, both types of information are not exploited simultaneously. Finally, deep learning methods with their strong abilities to learn proper representations of complex data sources have been used [JIN 17a] to combine – nonlinearly – structure and content. By integrating a modularity-based embedding (for communities based on structures) and a normalized cut embedding (for contentbased clustering), the deep learning model can automatically balance the respective importance of the two data sources. This model exploits the two data sources simultaneously. Recently, some of these methods have been compared and these experiments have confirmed that the detection of communities in an attributed graph is not a trivial problem [COM 12], [ELH 13]. To solve it efficiently, we consider that the attributes and the relational information must be exploited simultaneously and this is not the case for several methods cited above. Moreover, most of the methods discussed previously exploit categorical attributes but are not suited for numerical attributes. This is the reason why we proposed I-Louvain, a method to detect communities in a graph where numerical attributes are associated to the vertices [COM 15]. These attributes can correspond to features (age or weight) or to a tf-idf vector representing documents associated to the vertex. I-Louvain consists in optimizing the modularity introduced by Newman [NEW 06], on the one hand, and a new measure that is defined in the next section, on the other hand. 8.3. Inertia based modularity Let V be a set of N vertices represented in a real vector space such that each element v ∈ V is described by a vector of attributes v = (v1 , . . . , v|T | ) ∈ R|T | . The inertia I(V ) of V through its center of gravity g, also called second central moment, is

Attributed Networks Partitioning Based on Modularity Optimization

173

∑ 2 a homogeneity measure defined by I(V ) = v∈V ∥v − g∥ , where ∥v ′ − v∥ denotes ′ the Euclidean distance between v and v , g = (g1 , . . . , g|T | ), the center of gravity of ∑ V is such that gj = N1 v∈V vj . The inertia I(V, v) of V through v is equal to the sum of the square Euclidean ∑ 2 distances between v and the other elements of V : I(V, v) = v′ ∈V ∥v ′ − v∥ . Given a partition P = {C1 , . . . , Cr } of V in r disjoint clusters, we introduce a quality measure Qinertia (P) of P defined by: [( ) ] 2 ∑ I(V, v) · I(V, v ′ ) ∥v − v ′ ∥ Qinertia (P) = · δ (cv , cv′ ) − (2N · I(V ))2 2N · I(V ) ′ (v,v )∈V ·V

[8.1] where cv denotes the cluster of v ∈ V and δ is the Kronecker function equal to 1 if cv and cv′ are equal and 0 otherwise. Thus, while the modularity, introduced by Newman, considers the strength of the link between vertices in order to cluster strongly connected vertices, our measure attempts to cluster elements which are the most similar. This appears in the second term of equation [8.1], which is a function of the square of the distance between v and v ′ , corresponding to an observed distance between v and v ′ . This observed distance between v and v ′ is compared with an expected distance inferred from their respective inertia. This expected distance, which appears in the second term of the equation [8.1], is a function of the square distance of each of these elements v and v ′ to the other elements of V . Therefore, Qinertia allows a comparison of each pair of elements (v, v ′ ) from the same community, the expected distance with the observed distance. If the former is greater than the latter, then v and v ′ are good candidates to be affected to a same cluster. Given the normalization factors in the denominators of the expected and observed distances, the criterion Qinertia ranges between −1 and 1. Indeed, the maximum value of the left term in the subtraction (equation [8.1]), containing the product of the inertia for all pairs of elements, is 1. Similarly, the right term of the criterion Qinertia (equation [8.1]) cannot exceed 1. Both terms are strictly positive. Consequently, the measure, constrained by the Kronecker function, varies between −1 and 1. This criterion has several interesting properties. First, it has the same value irrespective of the affine transformation applied to the attribute vectors; in other words, the addition of a constant and/or the multiplication by a scalar of the attribute vectors associated to the elements do not affect the value Qinertia . Second, the order of attributes has no effect on the result.

174

Advances in Data Science

However, this criterion also has limitations. It is undefined if the vectors are identical, since the total inertia is then zero. This is not really a problem, because in this case, the detection of the communities will be based only on the relational data. Moreover, as the modularity introduced by Newman, this criterion could present a resolution limit. If it is the case, the solution proposed by Arenas et al. [ARE 08] or Reichardt et al. [REI 06] could be adapted for our criterion. 8.4. I-Louvain As stated above, a direct application of our measure Qinertia is the community detection in social networks represented by an attributed graph G = (V, E) where V is a set of vertices, E is a set of edges and where each vertex v ∈ V is described by a real attribute vector v = (v1 , . . . , vj , . . . , vT ) ∈ R|T | [ZHO 09]. In this section, we propose a community detection method for real attributed graphs, which exploits the inertia-based modularity Qinertia jointly with the Newman modularity QN G (P). Our method, called I-Louvain, is based on the exploration principle of the Louvain method. It consists in the optimization of the global criterion QQ+ (P) defined by: QQ+ (P) = QN G (P) + Qinertia (P) with: QN G (P) =

[ ] 1 kv · kv ′ Σvv′ (Avv′ − )δ(cv , cv′ ) 2m 2m

[8.2]

[8.3]

where kv is the degree of vertex v ∈ V , A is the adjacency matrix associated to G, m is the number of edges, δ is the Kronecker function, and cv is the community of v. It may be noted that another combination of these criteria others than the sum could be used, for instance, to give more importance to one kind of data. However, in the general case, we find that attributes and relational information have the same weight. In addition, it is not useful to normalize the criteria QN G (P) and Qinertia (P) because they have both been normalized to take values between −1 and 1, as mentioned in the previous section. The I-Louvain method is presented in Figure 8.1. The process begins with the discrete partition in which each vertex is in its own cluster (line 1). The algorithm is divided into two phases that are repeated. The first one is an iterative phase, coming from the Louvain algorithm, which consists in considering each vertex v and its neighbors in the graph and evaluating the modularity gain induced by a move of v from its community to that of its neighbors. Vertex v is then affected to the community for which the gain of the global criterion QQ+ (P), defined in equation [8.2], is maximum and strictly positive.

Attributed Networks Partitioning Based on Modularity Optimization

175

This process is applied repeatedly and sequentially for all vertices until no further improvement can be obtained. This procedure is very efficient because most vertices have few neighbors, and we thus do not have to evaluate many potential moves of vertex v.

Figure 8.1. Algorithm of I-Louvain

If there is an increase of the modularity during the first phase, the second phase consists in building a new graph G′ from the partition P ′ obtained at the end of the previous phase. This second phase involves two procedures: Fusion Matrix Adjacency and Fusion Matrix Inertia. Procedure Fusion Matrix Adjacency is identical to the one used in the Louvain method [BLO 08] and it exploits only the relational information. It consists in building a new graph G′ . The vertices of this new graph G′ correspond to the communities obtained at the end of the previous phase. The weights of the edges between these new vertices are given by the sum of the weights of the edges between vertices in the corresponding two communities. The edges between vertices of the same community lead to a self-loop for this community in the new network. Procedure Fusion Matrice Inertia exploits the attributes and allows computation of the distances between the vertices of G′ from the distances between the vertices of G. If the graph G considered at the beginning of the iterative phase includes |V | vertices, then the matrix D is a symmetric square matrix of size |V | × |V | in which each term D [a, b] is the square of the distance between the vertices va and vb of V . At the end of the iterative phase, a partition P ′ of V in k communities is obtained, in which each community will correspond to a vertex of V ′ in the new graph G′ , built by

176

Advances in Data Science

the procedure Fusion Matrix Adjacency. Matrix D′ associated to this new graph G′ is defined by: ∑

D′ [x, y] =

D [va , vb ] · δ(τ (va ), x) · δ(τ (vb ), y)

[8.4]

(va ,vb )∈V ×V

where function τ gives for each vertex v ∈ V , and the vertex v ′ ∈ V ′ corresponding to its cluster in P ′ . 8.5. Incremental computation of the modularity gain One advantage of the Louvain method is the local optimization of the modularity done during the first phase [AYN 13]. In the same way, in I-Louvain, the global modularity of a new partition can be quickly updated. There is no need to compute it again from scratch after each move of a vertex. Indeed, the modularity gain can be computed using only local information concerning the move of the vertex from its community to that of its neighbors. This local optimization of the inertia-based modularity Qinertia (P) and, consequently, of the global criterion QQ+ is detailed below. Given two partitions, P = (A, B, C1 , .., Cr ) the original partition and P ′ = (A \ {u} , B ∪ {u} , C1 , .., Cr ) the partition induced by the move of a vertex u from its community A to the community B where A \ {u} denotes the community A deprived of the vertex u and B ∪ {u} , the community B extended with u. In order to simplify notations in the sequel, Dvv′ denotes the term of matrix D [v, v ′ ]. The inertia-based modularity of the partition P is defined by: Qinertia (P) =

∑ C∈P

=

] ∑ [ I(V, v) · I(V, v ′ ) 1 − Dvv′ 2N · I(V ) 2N · I(V ) ′

[8.5]

v,v ∈C

] ∑ [ I(V, v) · I(V, v ′ ) 1 − Dvv′ 2N · I(V ) 2N · I(V ) ′ v,v ∈A

] ∑ [ I(V, v) · I(V, v ′ ) 1 − Dvv′ + 2N · I(V ) 2N · I(V ) ′ v,v ∈B

+

1 2N · I(V )



∑ [ I(V, v) · I(V, v ′ )

C̸=A,B v,v ′ ∈C

2N · I(V )

] − Dvv′

[8.6]

Attributed Networks Partitioning Based on Modularity Optimization

1 = 2N · I(V )

[

∑ v,v ′ ∈A\{u}

I(V, v) · I(V, v ′ ) − Dvv′ 2N · I(V )

[

I(V, u)2 1 − Duu + 2N · I(V ) 2N · I(V ) +

+

1 N · I(V )

[

∑ v∈A\{u}

]

]

I(V, u) · I(V, v) − Duv 2N · I(V )

]

] ∑ [ I(V, v) · I(V, v ′ ) 1 − Dvv′ 2N · I(V ) 2N · I(V ) ′ v,v ∈B

1 + 2N · I(V )

∑ [ I(V, v) · I(V, v ′ )



C̸=A,B v,v ′ ∈C

2N · I(V )

] − Dvv′

The inertia-based modularity of the partition P ′ is equal to: ′ ∑ ∑ [ Qinertia (P ′ ) =

C∈P

=

1 2N · I(V )

1 2N · I(V ) +

1 2N · I(V ) +

[

v,v ′ ∈A\u

1 2N · I(V )

I(V, v) · I(V, v ) − Dvv′ 2N · I(V )

v,v ′ ∈C



1 + 2N · I(V )

=

I(V, v) · I(V, v ′ ) − Dvv′ 2N · I(V )

[

∑ v,v ′ ∈B∪u



[

∑ v,v ′ ∈A\u

] [8.8]

]

∑ [ I(V, v) · I(V, v ′ ) 2N · I(V )

I(V, v) · I(V, v ′ ) − Dvv′ 2N · I(V )

[8.7]

]

I(V, v) · I(V, v ′ ) − Dvv′ 2N · I(V )

C̸=A\{u},B∪{u} v,v ′ ∈C

]

] −D

vv ′

[8.9]

] ∑ [ I(V, v) · I(V, v ′ ) 1 − Dvv′ 2N · I(V ) 2N · I(V ) ′ v,v ∈B

[

+

177

]

I(V, u)2 1 − D(u, u) 2N · I(V ) 2N · I(V )

] ∑ [ I(V, u) · I(V, v) 1 − Duv + N · I(V ) 2N · I(V ) v∈B

+

1 2N · I(V )



∑ [ I(V, v) · I(V, v ′ )

C̸=A\{u},B∪{u} v,v ′ ∈C

2N · I(V )

] − Dvv′

[8.10]

178

Advances in Data Science

Consequently, the modularity gain induced by the transformation of P in P ′ equals: ∆Qinertia = Qinertia (P ′ ) − Qinertia (P) =

1 2N · I(V )

[

∑ v,v ′ ∈A\{u}

I(V, v) · I(V, v ′ ) − Dvv′ 2N

[8.11]

]

] ∑ [ I(V, u) · I(V, v) 1 + − Duv N · I(V ) 2N · I(V ) v∈B

+

] ∑ [ I(V, v) · I(V, v ′ ) 1 − Dvv′ 2N · I(V ) 2N · I(V ) ′ v,v ∈B

[

+

I(V, u)2 1 − Duu 2N · I(V ) 2N · I(V )

+

1 2N · I(V )

 −

]

∑ [ I(V, v) · I(V, v ′ )



2N · I(V )

C̸=A\{u},B∪{u} v,v ′ ∈C

[



1 2N · I(V )

v,v ′ ∈A\{u}

I(V, v) · I(V, v ′ ) − Dvv′ 2N · I(V )

] − Dvv′

]

] ∑ [ I(V, v) · I(V, v ′ ) 1 ′ + − Dvv 2N · I(V ) 2N · I(V ) ′ v,v ∈B

[

+

I(V, u)2 1 − Duu 2N · I(V ) 2N · I(V )

]

] ∑ [ I(V, u) · I(V, v) 1 + − Duv N · I(V ) 2N · I(V ) v∈B

1 + 2N · I(V )

=

1 N · I(V ) −



2N · I(V )

C̸=A\{u},B∪{u} v,v ′ ∈C

∑ [ I(V, u) · I(V, v) 2N · I(V )

v∈B

1 N · I(V )

∑ [ I(V, v) · I(V, v ′ )

∑ v∈A\{u}

[

]



− Dvv′ 

[8.12]

] − Duv

I(V, u) · I(V, v) − Duv 2N · I(V )

] [8.13]

One can note that the variation of modularity, resulting from the move of vertex u from its community to another one, is the same whatever its new community. It follows that the modularity gain can be computed only taking into account the

Attributed Networks Partitioning Based on Modularity Optimization

179

increase (or decrease) induced by its affectation in its new community, corresponding to the first term in equation [8.13]. This confirms that the optimization of Qinertia can be done using a local computation, based on the information related to the affectation of vertex u to its new community. 8.6. Evaluation of I-Louvain method Our experiments aim at evaluating the performances of I-Louvain, which exploits attributes and relational data on artificial networks. The datasets have been generated with the DANCer-Generator [LAR 17]. This generator has been chosen because it allows us to build an attributed graph having a community structure as well as known properties of real-world networks such as preferential attachment and homophily. We refer the reader interested in experiments done on real networks to [COM 15]. 8.6.1. Performance of I-Louvain on artificial datasets In the first set of experiments, I-Louvain is compared with methods based only on one type of data, K-means for the attributes and Louvain for the relations. The Louvain source code is the one proposed by Thomas Aynaud in 20091. We retained Louvain because not only is I-Louvain an extension of this algorithm but it is also known for its ability to handle large datasets. Indeed, with a computational complexity of (N log N), it has been proven to have a lower computational complexity [XIE 11], [YAN 16]) than other methods such as FastGreedy, Leading eigenvector, Spinglass, or Walktrap. Moreover, in a very recent study, Louvain surpasses Infomap in nearly all of the experiments [EMM 16]. Ten graphs, each having 10,000 nodes, have been generated, and the average, maximum, and minimum performance evaluations computed on the ten graphs are reported as final results. In this experiment, since we have a ground truth (set in DANCer), the results are evaluated using the Normalized Mutual Information (NMI) derived from the mutual information (MI) and entropy (H), and defined by [STR 03]: M I(P1 , P2 ) N M I(P1 , P2 ) = √ H(P1 )H(P2 )

[8.14]

It allows us to evaluate the similarity between the ground truth P1 and the partition P2 , provided by an algorithm. The higher the NMI is, the better the result. Table 8.1 presents the results provided by I-Louvain and those obtained by Louvain and K-means. The results confirm the interest of using the two kinds of information.

1 http://perso.crans.org/aynaud/communities/.

180

Advances in Data Science

NMI (average) Min Max Louvain 0.7640 0.7203 0.8032 K-means 0.5199 0.4373 0.6022 I-Louvain 0.8151 0.7795 0.8894 Table 8.1. Evaluation according to the normalized mutual information (NMI)

Indeed, the NMI is equal to 0.764 for Louvain when it is equal to only 0.5199 for K-means, whereas the number of clusters that must be identified is given as a parameter for this algorithm. Consequently, with an NMI equal to 0.8151, our proposed algorithm I-Louvain outperforms the previous methods. These results confirm the interest in I-Louvain for improving the community detection. 8.6.2. Run-time of I-Louvain In the second set of experiments, we study the impact on run-time of I-Louvain and Louvain when increasing the number of vertices. We consider attributed networks where the number of nodes varies from 10,000 to 100,000. For each size between 10,000 and 100,000, five graphs have been generated with DANCer, and the average results are computed on these five graphs. Figure 8.2 presents the run-time evolution against the number of vertices |V |.

Figure 8.2. Run-time of I-Louvain and Louvain in function of the number of nodes |V |. For a color version of this figure, see www.iste.co.uk/diday/advances.zip

Louvain is still several orders of magnitude faster. Indeed, in the iterative phase, for each vertex v, both algorithms compute the gain in modularity, if moving this vertex to the community of one of its neighbors.

Attributed Networks Partitioning Based on Modularity Optimization

181

The gain in Newman modularity (used by Louvain) is a sum over the neighbors of v (for most nodes of most networks, this degree is small and can be considered as a small constant). However, the change in inertia-based modularity (used by I-Louvain) is a sum over all the nodes of the new community, which can be large. Thus, the complexity in updating inertia-based modularity can be several orders of magnitude larger than updating Newman modularity. 8.7. Conclusion In this chapter, we have studied the problem of attributed graph clustering when the vertices are described by real attributes. Inspired by the Newman modularity, we have presented a modularity measure, based on inertia. This measure is suited for assessing the quality of a partition of elements, represented by attributes in a real vector space. We also introduced I-Louvain, an algorithm that combines our criterion with Newman’s modularity in order to detect communities in attributed graphs. We demonstrated formally that this new algorithm can be optimized in its iterative phase by an incremental computation of the inertia. As we show in the experiments, jointly using the relational information and the attributes, I-Louvain detects the communities more efficiently than methods only using the attributes or only the relation. This improvement is obtained, of course, at the cost of a more time-consuming algorithm. However, we are currently investigating the use of distributed computing to make this algorithm more scalable. Indeed, there are some implementations of Louvain algorithm using the Spark framework2. To make a distributed version, it is necessary that the choice of the new community for each node (line 8 of the algorithm) is done in parallel for all the vertices simultaneously. This means that the algorithm cannot use the knowledge of the new community of a node when computing the new community of another one. Therefore, the new community computed for a node may not be the optimal one, and this distributed Louvain implementation may not converge as well, or it may require more iterations to do so. Indeed, some preliminary experiments have shown that, on some graphs, the NMI using this Spark implementation of Louvain is significantly lower than with the classic Louvain algorithm. We, however, think that it is possible to improve the convergence of the distributed version of Louvain at the cost of an increased number of iterations. Although, since in each iteration, computations on all vertices are done in parallel, we expect that the total computation time can be competitive on large graphs. Thanks to the incremental computation of the inertia, a distributed version of I-Louvain should then also be efficient even on very large size networks. 2 https://github.com/zhanif3/spark-distributed-louvain-modularity.

182

Advances in Data Science

8.8. References ´ ´ [ARE 08] A RENAS A., F ERN ANDEZ A., G OMEZ S., “Analysis of the structure of complex networks at different resolution levels”, New Journal of Physics, vol. 10, no. 5, p. 053039, 2008. [AYN 13] AYNAUD T., B LONDEL V., G UILLAUME J.-L. et al., “Multilevel local optimization of modularity”, in B ICHOT C., S IARRY P. (eds), Graph Partitioning, ISTE Ltd, London and John Wiley & Sons, New York, pp. 315–345, 2013. [BIC 13] B ICHOT C., S IARRY P., Graph Partitioning, ISTE Ltd, London and John Wiley & Sons, New York, 2013. [BLO 08] B LONDEL V.D., G UILLAUME J.-L., L AMBIOTTE R. et al., “Fast unfolding of community hierarchies in large networks”, CoRR, vol. abs/0803.0476, 2008. [COM 12] C OMBE D., L ARGERON C., E GYED -Z SIGMOND E. et al., “Combining relations and text in scientific network clustering”, International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 1280–1285, 2012. [COM 15] C OMBE D., L ARGERON C., G E´ RY M. et al., “I-Louvain: an attributed graph clustering method”, Proceedings of Advances in Intelligent Data Analysis XIV – 14th International Symposium, IDA 2015, Saint Etienne, France, October 22–24, 2015, pp. 181–192, 2015. [CRU 11] C RUZ J.D., B OTHOREL C., P OULET F., “Entropy based community detection in augmented social networks”, Computational Aspects of Social Networks (CASoN 2011), pp. 163–168, 2011. [DAN 12] DANG T.A., V IENNET E., “Community detection based on structural and attribute similarities”, International Conference on Digital Society (ICDS), pp. 7–12, 2012. [DIN 01] D ING C., H E X., Z HA H. et al., “A min-max cut algorithm for graph partitioning and data clustering”, IEEE International Conference on Data Mining, pp. 107–114, 2001. [ELH 13] E LHADI H., AGAM G., “Structure and attributes community detection: comparative analysis of composite, ensemble and selection methods”, 7th Workshop on Social Network Mining and Analysis, SNAKDD ’13, ACM, New York, NY, USA, pp. 10:1–10:7, 2013. [EMM 16] E MMONS S., KOBOUROV S., G ALLANT M. et al., “Analysis of network clustering algorithms and cluster quality metrics at scale”, PLOS ONE, vol. 11, no. 7, p. e0159161, 2016. [EST 06] E STER M., G E R., G AO B. et al., “Joint Cluster analysis of attribute data and relationship data: the connected k-center problem”, SIAM International Conference on Data Mining, ACM Press, pp. 25–46, 2006.

Attributed Networks Partitioning Based on Modularity Optimization

183

¨ 98] F J ALLSTR ¨ ¨ [FJA OM P.-O., “Algorithms for graph partitioning: a survey”, Science, vol. 3, no. 10, 1998. [FLA 03] F LAKE G., TARJAN R., T SIOUTSIOULIKLIS K., “Graph clustering and minimum cut trees”, Internet Mathematics, vol. 1, no. 4, pp. 385–408, 2003. [FOR 10] F ORTUNATO S., “Community detection in graphs”, Physics Reports, vol. 486, nos 3–5, pp. 75–174, June 2010. [FOR 16] F ORTUNATO S., H RIC D., “Community detection in networks: a user guide”, CoRR, vol. abs/1608.00163, 2016. ¨ ¨ [GUN 10] G UNNEMANN S., FARBER I., B ODEN B. et al., “Subspace clustering meets dense subgraph mining: a synthesis of two paradigms”, IEEE International Conference on Data Mining, pp. 845–850, 2010. ¨ 11] G UNNEMANN ¨ [GUN S., B ODEN B., S EIDL T., “DB-CSC: a density-based approach for subspace clustering in graphs with feature vectors”, Machine Learning and Knowledge Discovery in Databases, Springer, pp. 565–580, 2011. [GE 08] G E R., E STER M., G AO B.J. et al., “Joint cluster analysis of attribute data and relationship data”, ACM Transactions on Knowledge Discovery from Data, vol. 2, no. 2, pp. 1–35, 2008. [GIR 02] G IRVAN M., N EWMAN M.E.J., “Community structure in social and biological networks”, PNAS, vol. 99, no. 12, pp. 7821–7826, 2002. [JIN 17a] J IN D., G E M., L I Z. et al., “Using deep learning for community discovery in large social networks”, IEEE 29th International Conference on Tools with Artificial Intelligence, ICTAI, 2017. [JIN 17b] J IN D., WANG X., H E D. et al., “Identification of generalized communities with semantics in networks with content”, IEEE 29th International Conference on Tools with Artificial Intelligence, ICTAI, 2017. [KAR 11] K ARRER B., N EWMAN M.E.J., “Stochastic blockmodels and community structure in networks”, Physical Review E, vol. 83, p. 016107, 2011. [KER 70] K ERNIGHAN B.W., L IN S., “An efficient heuristic procedure for partitioning graphs”, Bell System Technical Journal, vol. 49, no. 2, pp. 291–307, 1970. [LAN 09] L ANCICHINETTI A., F ORTUNATO S., “Community detection algorithms: a comparative analysis”, Physical Review E, vol. 80, no. 5, p. 056117, 2009. [LAR 17] L ARGERON C., M OUGEL P.N., B ENYAHIA O. et al., “DANCer: dynamic attributed networks with community structure generation”, Knowledge and Information Systems, vol. 53, no. 1, pp. 109–151, 2017. [LI 08] L I H., N IE Z., L EE W.-C.W. et al., “Scalable community discovery on textual data with relations”, 17th ACM Conference on Information and Knowledge Management, pp. 1203–1212, 2008.

184

Advances in Data Science

[MOS 07] M OSER F., G E R., E STER M., “Joint cluster analysis of attribute and relationship data without a-priori specification of the number of clusters”, 13th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 510–519, 2007. [NEW 04a] N EWMAN M., “Detecting community structure in networks”, The European Physical Journal B-Condensed Matter and Complex Systems, vol. 38, no. 2, pp. 321–330, 2004. [NEW 04b] N EWMAN M., G IRVAN M., “Finding and evaluating community structure in networks”, Physical Review E, vol. 69, no. 2, pp. 1–16, 2004. [NEW 06] N EWMAN M., “Modularity and community structure in networks”, Proceedings of the National Academy of Sciences of the United States of America, vol. 103, no. 23, pp. 8577–8696, 2006. [NEW 15] N EWMAN M.E.J., P EIXOTO T.P., “Generalized communities in networks”, Physical Review Letters, vol. 115, p. 088701, 2015. [PEE 17] P EEL L., L ARREMORE D.B., C LAUSET A., “The ground truth about metadata and community detection in networks”, Science Advances, vol. 3, no. 5, p. e1602548, 2017. [PRA 13] P RADO A., P LANTEVIT M., ROBARDET C. et al., “Mining graph topological patterns: finding covariations among Vertex Descriptors”, IEEE Trans. Knowl. Data Eng., vol. 25, no. 9, pp. 2090–2104, 2013. [REI 06] R EICHARDT J., B ORNHOLDT S., “Statistical mechanics of community detection”, Physical Review E, vol. 74, no. 1, p. 016110, 2006. [RUA 13] RUAN Y., F UHRY D., PARTHASARATHY S., “Efficient community detection in large networks using content and links”, Proceedings of the 22nd International Conference on World Wide Web, WWW ’13, pp. 1089–1098, 2013. [SAT 09] S ATULURI V., PARTHASARATHY S., “Scalable graph clustering using stochastic flows: applications to community discovery”, 15th SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 737–746, 2009. [SCH 07] S CHAEFFER S., “Graph clustering”, Computer Science Review, vol. 1, no. 1, pp. 27–64, 2007. [SHI 00] S HI J., M ALIK J., “Normalized cuts and image segmentation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, 2000. [STA 13] S TATTNER E., C OLLARD M., “From frequent features to frequent social links”, International Journal of Information System Modeling and Design (IJISMD), vol. 4, no. 3, pp. 76–98, 2013. [STE 08] S TEINHAEUSER K., C HAWLA N.V., “Community detection in a large real-world social network”, in L IU H., S ALERNO J.J., YOUNG M.J. (eds), Social Computing, Behavioral Modeling, and Prediction, Springer, Boston, MA, pp. 168–175, 2008.

Attributed Networks Partitioning Based on Modularity Optimization

185

[STR 03] S TREHL A., G HOSH J., “Cluster ensembles – a knowledge reuse framework for combining multiple partitions”, The Journal of Machine Learning Research, vol. 3, pp. 583–617, 2003. [VON 07] VON L UXBURG U., “A tutorial on spectral clustering”, Statistics and Computing, vol. 17, no. 4, pp. 395–416, 2007. [XIE 11] X IE J., S ZYMANSKI B.K., “Community detection using a neighborhood strength driven label propagation algorithm”, Network Science Workshop, 2011. [YAN 13] YANG J., M C AULEY J.J., L ESKOVEC J., “Community detection in networks with node attributes”, ICDM, pp. 1151–1156, 2013. [YAN 16] YANG Z., A LGESHEIMER R., T ESSONE C.J., “A comparative analysis of community detection algorithms on artificial networks”, CoRR, vol. abs/1608.00763, 2016. [ZHO 09] Z HOU Y., C HENG H., Y U J., “Graph clustering based on structural/attribute similarities”, VLDB Endowment, vol. 2, no. 1, pp. 718–729, 2009. [ZHO 10] Z HOU Y., C HENG H., Y U J.X., “Clustering large attributed graphs: an efficient incremental approach”, 2010 IEEE International Conference on Data Mining, pp. 689–698, 2010.

Part 4 Clustering

Advances in Data Science: Symbolic, Complex and Network Data, First Edition. Edited by Edwin Diday, Rong Guan, Gilbert Saporta and Huiwen Wang. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

9 A Novel Clustering Method with Automatic Weighting of Tables and Variables

9.1. Introduction Clustering analysis is one of the most important methods for accomplishing unsupervised learning from data, which is widely used in areas such as pattern recognition, bioinformatics, data mining, image processing, among others. Clustering aims to provide homogeneous clusters such that the similarity between the objects within the same group is high, and the similarity between objects belonging to different groups is low [JAI 10]. The main objective of the data clustering is to classify a set of objects in groups, minimizing an objective criterion W that measures the homogeneity of the partition of the objects. Problems involving the classification of complex data appear every day, so new clustering models need to consider different perspectives, views, or tables, to cope these problems. Nowadays, it is more common to note the existence of separate tables that describe objects from different perspectives or views. This approach can be used to solve clustering problems in many application fields. The perspective or view is described by a set of variables, the table represents the values of the set of variables defined by the user on an observed sample. For example, in a multi-source approach, different representations of several sensors or signatures (Fourier and Karhunen coefficients) can be proposed to describe the same observation. Each of these datasets can be considered as a separate view of different sensors. In the field of marketing, customer information is available on

´ , Francisco DE A SSIS T ENORIO DE C ARVALHO Chapter written by Rodrigo C. DE A RA UJO and Yves L ECHEVALLIER.

190

Advances in Data Science

different databases (bank, store, administration, etc.). On social networks, various sources (emails, collaborations, etc.) are represented by multiple tables of data. In XML documents, several blocks (images, video and text) or sections of Web pages can be considered as different points of view. In Bioinformatics, gene expression and location data can contribute to detect gene iteration or regulation. Currently, there are two approaches to manage multi-view data in the clustering task. Distributed methods: they aim to cluster the views independently, using the same or different algorithms. The resulting groups provided by each view are combined to obtain a final partition [REZ 09], [XIE 13]. Centralized methods: they take into account all views simultaneously to perform the grouping of data. Each view has a relevance weight that will influence the formation of the groups of individuals [TZO 12]. In this work, we present a centralized method where weights are assigned simultaneously to the tables or views and to the variables of each table during the partitioning process. The proposed method is a multiple tables hard clustering algorithm with automated computation of weights for both tables and variables, in such a way that the relevant tables, as well as the relevant variables in each table, are selected for clustering. Our approach aims at building a single partition of objects that takes into account multiple data tables or multi-views simultaneously. These tables are generated using different sets of variables, and a specific distance is assigned to each variable. Our method provides a partition and a Fr´echet mean for each cluster and learns a weight vector for each table; each variable optimizes an adequacy criterion that measures the fitting between clusters and their Fr´echet means using adaptive weights. The Fr´echet mean is a generalization of centroids to metric spaces, giving a single representative and central object for each cluster. A method where tables and variables weights are taken into account has been presented by Xu in [XU 16], but the corresponding algorithm depends on two input parameters. 9.2. Related Work Research about the clustering of multiple tables or datasets was done extensively in the last years, resulting in the emergence of different algorithms. Some centralized methods and algorithms are briefly presented hereafter. – Cofkm [2009] a centralized method clustering provided by G. Cleuziou et al. [CLE 09]; – MVKKM [2012] a multi-view kernel k-means provided by Tzortzis and Likas [TZO 12];

A Novel Clustering Method with Automatic Weighting of Tables and Variables

191

– Co-regspec [2013] a co-regularized spectral clustering provided by X. Wang et al. [WAN 13]; – TW-k-means [2013] an automated two-level variable weighting algorithm provided by Chen et al. [CHE 13]; – WMCFS [2016] a weighted multi-view clustering with feature selection provided by Y.-M. Xu et al. [XU 16]; These methods optimize an adequacy criterion W based on the metric space. Our approach uses this strategy. The method presented by De Carvalho et al. in [CAR 12] is able to cluster objects considering different views that are represented by matrices of dissimilarity (relational data). This method is able to take into account all views, assigning weights to them during the iterative steps of the algorithm. Tzortzis and Likas [TZO 12] proposed two algorithms that weight different views according to the contribution in the resulting partition. Despite the fact the algorithms consider each view for grouping elements, the performance of the model depends on a input parameter that controls the sparsity of weighting views. There is no simple way to find the input value of this parameter and the tuning requires a priori labels. The model, presented by Wang et al. in [WAN 13], proposes to assign weights to each attribute of each view locally. For example, a particular attribute of a view may affect the formation of each cluster differently. The method, however, does not consider that each view may also have an independent relevance weight, influencing the partitioning of the objects. Thus, views that have many noises attributes can greatly disrupt the clustering of the data. Xu et al. [XU 16] proposed an algorithm where the weighting views are taken into account, and the variable selection is used to cluster the objects. The proposed algorithm needs the tuning of two parameters: the former to control the sparsity of the weights of the views and the latter to control the sparsity of the weights of the variables. To choose these parameters is a tricky problem, because they may vary within a relatively wide range, and a bad choice can negatively affect the performance of the algorithm. Moreover, the parameter tuning is different for each data set, and the procedure for tuning these parameters is only viable if the objects are a priori labeled. 9.3. Definitions, notations and objective Let E = {e1 , . . . , en } be a set of n objects that are described according to V data tables or views X1 , . . . , Xv , . . . , XV . The vector x = (x(1,1) , . . . , x(dV ,V ) ) is an element of the global representation space Φ. Table 9.1 represents the V data tables.

192

Advances in Data Science

Table 1 X1 (1,1) (d ,1) x1 . . . x1 1 ... (1,1) (d ,1) xi . . . xi 1 ... (1,1) (d ,1) xn . . . xn 1

... Table v ... Xv (1,v) (d ,v) . . . x1 . . . x1 v ... ... (1,v) (d ,v) . . . xi . . . xi v ... ... (1,v) (d ,v) . . . xn . . . xn v

... Table V ... XV (1,V ) (d ,V ) . . . x1 . . . x1 V ... ... (1,V ) (d ,V ) . . . xi . . . xi V ... ... (1,V ) (d ,V ) . . . xn . . . xn V

Table 9.1. Multiple tables

Each object ei ∈ E is described by a vector xi = (Xi1 , . . . , Xiv , . . . , XiV ) where (d ,v) (1,v) . . . xi v ) ∈ Φv is the description of the object ei in the table v, and = (xi (j,v) dv is the number of variables in the table v. xi ∈ Φvj is the value of the object ei v for the variable j belonging to the table v where Φj is the representation space of the variable j belonging to the table v. Xiv

9.3.1. Choice of distances The choice of the distance depends on the type of variable and the clustering objective of the user. In data analysis, variables used to describe the objects are single-valued. However, in many real situations, single-valued variables can be very restrictive, especially when analyzing a group of objects where the variability inherent to the group must be taken into account. A group of objects cannot be properly described by the usual single-valued variables without loss of the information concerning the variability. Symbolic Data Analysis [BOC 00] defines two types of variables: Single-valued variables (Classical variables) – quantitative or numerical variables; – qualitative or categorical variables; – binary variables. Multi-valued variables (Symbolic variables) – interval-valued variables or quantitative multi-valued variables; – categorical multi-valued variables; – modal or histogram variables. When the table or the global table contains different types of variables, it is necessary to select a distance adapted to the type of variable.

A Novel Clustering Method with Automatic Weighting of Tables and Variables

193

Choice of the distance d(j,v) in the space Φvj should be realized by the user. d(j,v) is the distance of the variable j belonging to the table v, then (Φvj ,d(j,v) ) is a metric space. For example, when the variable is numerical, then the representation space Φvj is ℜ. The representation space Φvj is the finite and closed interval set of ℜ when the variable (j, v) is an interval-valued variable. The global distance d for mixed variables or different tables is determined by a positive-weighted linear combination of distances by the following formula: d(xi , xl ) =

V ∑

d ∑ v

ωv

v=1

(j,v)

λvj d(j,v) (xi

(j,v)

, xl

)

[9.1]

j=1

where (1,1) (d ,V ) – xi = (xi , . . . , xi V ) ∈ Φ; – λvj is the weight of the variable j of the table v and λvj ≥ 0; – ωv is the weight of the table v and ωv ≥ 0; – d(j,v) is the distance associated to the variable j of the table v; Since all (Φvj ,d(j,v) ) are metric spaces, then (Φ,d) is a metric space because all weights are positive or null. The Fr´echet mean is a generalization of centroids to metric spaces, giving a single representative and central object for a cluster. Let (Φ,d) be a metric space and each object ei ∈ E is described by a vector xi ∈ Φ. For any element x in Φ, the Fr´echet variance w is the sum of squared distances from x to xi : w(E, x) =

n ∑

d2 (xi , x)

i=1

M is the set of elements of Φ, which locally minimize the function w. If the set M contains one element g, then it is the Fr´echet mean and ∑ g = arg min d2 (xi , x) [9.2] x∈Φ

ei ∈Ck

For real numbers and using the usual Euclidean distance, then the Fr´echet mean is the classical mean. The median is also a Fr´echet mean, using the square root of the Euclidean distance. 9.3.2. Criterion W measures the homogeneity of the partition P on the set of tables For the variable (l, v) and the cluster Ck of the partition P , the Fr´echet variance (l,v) (l,v) is w(Ck , gk ) where gk is the Fr´echet mean. The Fr´echet variance Jvj (P ) of the

194

Advances in Data Science

∑K (l,v) partition P is k=1 w(Ck , gk ). The weight vector Ω is associated to the tables, and the weight matrix Λ is associated to the variables. Therefore, the criterion W is defined by : W (P, G, Λ, Ω) =

=

V ∑

v

ωv

v =1

j =1

V ∑

d ∑

K ∑

ωv

j =1

Wk =

=

K ∑

(l,v)

w(Ck , gk

λvj

K ∑ ∑

(j,v)

d2(j,v) (xi

(j,v)

, gk

)

k=1 ei ∈Ck K ∑ ∑ V ∑

ωv Jv (P ) =

v =1

)

k =1

V ∑ v =1

d ∑ v

ωv

k=1 ei ∈Ck v =1

k =1 V ∑

λvj

v

v =1

=

d ∑

(j,v)

λvj d2(j,v) (xi

(j,v)

, gk

)

j =1

d ∑ v

ωv

λvj Jvj (P )

j =1

where – A global Fr´echet mean G = (G1 , . . . , GK ) is represented in Table 9.2 where Gk is the Fr´echet mean of the cluster Ck ; (1,v)

(j,v)

(d ,v)

– Gk = (G1k , . . . , Gvk , . . . , GVk ) where Gvk = (gk , . . . , gk , . . . , gk v ) is an (j,v) element of Φv where gk ∈ Φvj ; – Wk measures the homogeneity of the cluster Ck ; – Jv (P ) measures the homogeneity of the table v; ∑K ∑ (j,v) (j,v) 2 , gk ) is the Fr´echet variance of the – Jvj (P ) = k=1 ei ∈Ck d(j,v) (xi variable j in the table v on the partition P and measures its homogeneity; (j,v)

(j,v)

– d(j,v) (xi , gk ) measures the local distance between an object ei ∈ Ck and (j,v) the Fr´echet mean gk on the variable j from the view v.

G1 Gk GK

Table 1 G1 (1,1) (d ,1) g1 . . . g1 1 ... (d ,1) (1,1) . . . gk 1 gk ... (1,1) (d ,1) gK . . . gK 1

... Table v ... Gv (1,v) (d ,v) . . . g1 . . . g1 v ... ... (d ,v) (1,v) . . . gk v . . . gk ... ... (1,v) (d ,v) . . . gK . . . gK v

... Table V ... GV (1,V ) (d ,V ) . . . g1 . . . g1 V ... ... (d ,V ) (1,V ) . . . gk V . . . gk ... ... (1,V ) (d ,V ) . . . gK . . . gK V

´ Table 9.2. Frechet mean tables

A Novel Clustering Method with Automatic Weighting of Tables and Variables

195

The criterion W measures the error between multiple tables (Table 9.1) and Fr´echet mean tables (Table 9.2). The minimization of the criterion W allows the best partition P to be obtained. 9.3.3. Optimization of the criterion W Trivial solutions of the optimization of the criterion W should be obtained by : – ωv = 0, ∀v = 1 to V or – λvj = 0, ∀v = 1 to V and ∀j = 1 to dv Therefore, it is necessary to introduce some constraints. The criterion proposed by Y.-M. Xu et al. [XU 16] is the objective criterion ∑V ∑dv W (P, G, Λ, Ω) with a penalization value β v=1 j=1 (λvj )2 . This penalization is the product between a priori coefficient β and a sum of the weights of Λ on the set of variables. The estimation of β is not easy. p is a exponential parameter, and the experimental results of Y.-M. Xu et al. [XU 16] show that there exists a relatively wide range of p values. The use of additive constraints adds a penalization value to the criterion W in the optimization process and the estimation of two meta parameters. Therefore, we propose using the following multiplicative constraints: V ∏ v

ωv = 1 and ωv > 0 , ∀v = 1 to V

=1

and d ∏ v

j

λvj = 1 and λvj > 0 , ∀j = 1 to dv and ∀v = 1 to V

=1

The additive constraints allow the weights to be interpreted as probabilities. The product of the weights can be interpreted as a volume because these weights are the diagonal elements of the Mahalanobis matrix. If the Mahalanobis matrix is a diagonal matrix, the product of the weights is equal to the determinant of the matrix. With these multiplicative constraints, it is possible to optimize the criterion W, and it is not necessary to define meta parameters. We propose an algorithm that groups elements, taking into account both the weights of the tables and variables. The model is based on optimizing an objective function at each iteration, where the weights of the tables and the variables will vary until the final clusters are defined. Our algorithm is named MND–WT; therefore, when the table is unique, our algorithm is named MND–W.

196

Advances in Data Science

9.4. Hard clustering with automated weighting of tables and variables These MND–W and MND–WT algorithms optimize the criterion W (P, G, Λ, Ω) defined in the previous chapter. The objective is to build a partition P and a Fr´echet mean Gk for each cluster and to learn two weight vectors Ω and Λ associated to the variable set and the table set. Four steps are used, and, during these steps, partition P , the Fr´echet mean G on the partition P , and weight vectors Ω and Λ are estimated and change at each iteration. 9.4.1. Clustering algorithms MND–W and MND–WT They start with an initialization step and alternates three or four steps. – Initialization and repeat these following steps – Step 1: Build the partition P . – Step 2: Compute the Fr´echet mean set G. – Step 3: Compute the weight vector Λ of the table set (does not exist with MND–W algorithm). – Step 4: Compute the weight vector Ω on the variable set. until the convergence. The adequacy criterion W reaches a stationary value representing a local minimum. 9.4.1.1. Initialization Step The parameter K (number of clusters) and the distance d(j,v) for each variable j and each table v are defined by user. The optimization of the criterion W (P, G, Λ, Ω) can be performed when three parameters are fixed. Also, our initialization step proposes to determine the Fr´echet mean G, the weight vector Λ associated to the variable set, and the weight vector Ω associated to the table set by: – Randomly select K elements {e(1) , . . . , e(K) } of E and Gk is initialized by x(k) for k = 1 to K; – To obtain the matrix Λ = (Λ1 , . . . , Λv , . . . , ΛV ), where Λv = (λv1 , . . . , λvj , . . . , λvdv ), where λvj is the weight of the variable j from the table v; – To obtain the vector Ω = (ω1 , . . . , ωV ) is the weighting vector tables, and ωv is the weight of the table v.

A Novel Clustering Method with Automatic Weighting of Tables and Variables

197

We propose to use the initial solutions where the weights Λ and Ω are equal to 1: – λvj = 1 , ∀v = 1 to V and ∀j = 1 to dv ; – ωv = 1, ∀v = 1 to V . Therefore, the first step of the algorithm must be the search of the best partition associated to our choice of initialization. 9.4.1.2. Step 1: Build the best partition P The Fr´echet mean set (G1 , . . . , GK ) and the weight vectors Ω and Λ are fixed. This step is the classical k-means [MAC 67] affectation step. The partition P = (C1 , . . . , CK ) is created or updated by : Ck = {ei ∈ E :

V ∑ v =1



V ∑ v =1

d ∑

d ∑ v

ωv

(j,v)

λvj d2(j,v) (xi

(j,v)

, gk

)

j =1

v

ωv

(j,v)

λvj d2(j,v) (xi

(j,v)

, gh

)}

j =1

The partition P depends on the global Fr´echet mean G = (G1 , . . . , GK ) and two weight vectors Ω and Λ. ´ 9.4.1.3. Step 2: Find the best global Frechet mean G The partition P =(C1 , . . . , CK ) and the weight vectors Ω and Λ are fixed. (1,1)

(j,v)

(V,d )

This step builds a Fr´echet mean Gk = (gk , . . . , gk , . . . , gk V ) ∈ Φ of the (j,v) cluster Ck where gk is a Fr´echet mean of Ck on the variable j belonging to the table v. However, if the set of elements of Φvj , which minimizes the Fr´echet function w(Ck , g) of equation [9.3], contains several elements of Φvj , then several strategies (j,v)

can be used to have the element that will be the Fr´echet mean gk ∑ (j,v) d2(j,v) (xi , g) w(Ck , g) =

. [9.3]

ei ∈Ck

If the representation space Φvj is a finite set, the Fr´echet mean is easily obtained. For example, if the representation space Φvj is the set E, then the Fr´echet mean is the representative vector x of the object e ∈ E that minimizes the Fr´echet function w (See [CAR 12]).

198

Advances in Data Science

If the representation space Φvj is an infinite set, then the existence of the Fr´echet (j,v)

mean depends on the distance d(j,v) . The Fr´echet mean gk exists with the Euclidean (L2 ), L1 distances on numerical variables, χ2 distance on categorical single-valued or multi-valued variables, and L1 , L2 , and Hausdorff distance [CHA 06] on intervalvalued variables. When the result g of equation [9.3] is not unique, the Fr´echet mean g can be obtained using different strategies. We propose two strategies that are easily applicable. The first strategy is to reduce the representation space Φvj , and the second is to construct a factorial representation of Φvj . The first strategy is to replace the space Φvj by E the set of n objects. Therefore, the Fr´echet mean is (See [CAR 12]): ∑ (j,v) (j,v) gk = arg min d2(j,v) (xi , g) g∈E

ei ∈Ck (j,v)

The second strategy is to build a factorial representation (f (x1 ), . . . , (j,v) (j,v) f (xn )) of the table v, where f (xi ) ∈ ℜp , that depends only on the distances (j,v) (j,v) d(j,v) (xi , xi′ ) between the object ei ∈ E and ei′ ∈ E on the table v. This approach uses the Torgerson formula: (j,v)

=

∑ 1 ∑ (j,v) (j,v) (j,v) (j,v) [ d(j,v) (xi , xh )+ d(j,v) (xi′ , xh ) 2n eh ∈E

+

eh ∈E

1 ∑ n



(j,v)

d(j,v) (xm

(j,v)

, xi′

(j,v)

)−d(j,v) (xi

(j,v)

, xl

)]

eh ∈E em ∈E

If this symmetric bilinear form is positive, then we have an Euclidean representation of the set of objects, and the Euclidean distance d(j,v) is defined by : (j,v)

d(j,v) (xi

(j,v)

, x i′

(j,v)

) = ∥f (xi

(j,v)

)-f (xi′

)∥

9.4.1.4. Step 3: Compute of the best weight matrix Λ of the variable set During this step, the fixed elements are the partition P =(C1 , . . . , CK ) of E into K clusters, the global Fr´echet mean (G1 , . . . , GK ), and the weight vector Ω of the table set. This step provides the optimal solution to minimize the criterion W (P, G, Λ, Ω) ∏d v on the Λ weight vector under constrains λvj ¿0 and j = 1 λvj = 1. The solution is found using the Lagrangian multiplier method on the couple (v, j) where v is the table and j is the variable in the table v.

A Novel Clustering Method with Automatic Weighting of Tables and Variables

199

Deriving W (P, G, Λ, Ω), according to the weight component λvj under constraints ∏d v λvj ¿0 and j = 1 λvj = 1, is equivalent to minimizing the following Lagrange formula: d ∏ v

L(λvj , λ)

=

ωv λvj Jvj (P )

+ λ(

u

λvu − 1)

[9.4]

=1

∂ (ωv λvj Jvj (P )) = ωv Jvj (P ) ∂λvj d d ∏ ∏ ∂ v λ( λ − 1) = λ( ∂λvj u=1 u v

[9.5]

v

λvu )

[9.6]

u=1,u̸=j

With equations [9.5] and [9.6], we get (equation [9.4]) d ∏ ∂ v j L(λ , λ) = ω J (P ) + λ( v v j ∂λvj v

λvu ) = 0

[9.7]

u=1,u̸=j

ωv λv J j (P )

From λ = − ∏djv vλv , and as u=1 u isolating the λvj term, we get λvj = − d ∏

u=1

λvu = 1, we get λ = −ωv Jvj (P )λvj , and

λ

[9.8]

ωv Jvj (P )

v

u=1

∏d v

d ∏ v

λ = (λ)

dv

dv

= −(ωv )

d ∏ v

λvu Jvu (P )

= −(ωv )

u=1

With equations [9.8] and [9.9], we get ∏d v ∏dv v 1/dv u ωv u=1 Jvu (P )1/d v u=1 Jv (P ) λj = = ωv Jvj (P ) Jvj (P )

dv

Jvu (P )

[9.9]

u=1

[9.10]

This result is important because the weight vector Λ of the variables is independent of the weight vector Ω. In fact, the updating of Λ must be performed before the updating of Ω. In conclusion, the weight λvj that minimizes the criterion W under the constraints ∏dv λvj ¿0 and l = 1 λvj = 1 is ∏d v v J v (P )1/d v λj = u=1 vu Jj (P ) ∑ (j,v) (j,v) K ∑ where Jjv (P )= k=1 ei ∈Ck d2(j,v) (xi , gk ) is the Fr´echet variance of the variable j in the table v on the partition P and measures its homogeneity.

200

Advances in Data Science

9.4.1.5. Step 4: Compute of the best weight vector Ω of the tables The partition P =(C1 , . . . , CK ) of E into K clusters, the global Fr´echet mean G = (G1 , . . . , GK ), and the weight vector Λ of the variable set are fixed. To obtain the weight vector Ω, we use the approach developed by De Carvalho et al. in [CAR 12]. The solution is found using the Lagrangian multiplier method on the set of tables.

L(ωv , λ) = W (P, G, Λ, Ω)+λ(

V ∏

u

ωu − 1)

[9.11]

=1

Deriving W (P, G, Λ, Ω) according to the weight component ωv , we get ∂ W (P, G, Λ, Ω) = Jv (P ) ∂ωv

[9.12]

and V ∏ ∂ λ( (ωu ) − 1) = λ ∂ωv u=1

V ∏

ωu

[9.13]

u=1,u̸=v

with equations [9.12] and [9.13], we get (equation [9.11]) ∂ L(ωv , λ) = Jv (P ) + λ ∂ωv

V ∏

ωu = 0

[9.14]

u=1,u̸=v

and isolating the ωv term, we get ωv = − V ∏

λ Jv (P )

λ = (λ)V = −(ωv )V ∗

u=1

[9.15] V ∏

Jv (P )

[9.16]

v=1

With equations [9.15] and [9.16], we get ∏V ωv =

Ju (P )1/V Jv (P )

u=1

[9.17]

A Novel Clustering Method with Automatic Weighting of Tables and Variables

201

In conclusion, the solution ∏V is the weight ωv that minimizes the criterion W under the constraints ωv ¿0 and u = 1 ωu = 1 is: ∏V Ju (P )1/V ωv = u=1 [9.18] Jv (P ) where d ∑ v

Jv (P )=

d ∑ v

λvj Jjv (P )=

j =1

j =1

λvj

K ∑ ∑

(j,v)

d2(j,v) (xi

(j,v)

, gk

)

k=1 ei ∈Ck

measures the homogeneity of the table v 9.5. Applications: UCI data sets This section discusses the performance of the proposed algorithms M N D − W T and M N D − W in comparison with others algorithms. To compare the clustering performances, the overall error rate of classification (OERC) is computed. The miss-classification rate OERC measures the difference between the a priori partition and the partition provided by the clustering algorithm. The miss-classification rate OERC is to seek a decision rule that minimizes the probability of error. Therefore, the miss-classification rate OERC is equal to: OERC = 1 −

Number of correctly classified objects Number of objects in the dataset

[9.19]

For each algorithm, the best result was selected from 500 runs according to the clustering adequacy criterion W . This chapter considers two data sets: Iris plant data set and Multi-Features dataset. These datasets are found in www.ics.uci.edu./mlearn/MLRepository.html (UCI machine learning repository data sets). 9.5.1. Application I: Iris plant This well-known data set contains 150 plants belonging to three types of iris plants: Iris setosa, Iris versicolor, and Iris virginica. Each class contains 50 objects. One class (Setosa class) is linearly separable from each other. Each object is described by four numerical variables that can be split into two tables. – table 1 (Sepal) contains two variables: (1) sepal length and (2) sepal width; – table 2 ( Petal) contains two variables: (3) petal length and (4) petal width.

202

Advances in Data Science

Different classical clustering methods and weighted clustering methods are applied with K = 3. Two methods are applied to four dissimilarity matrices (one by variable) presented by De Carvalho et al. [CAR 12]. – M RDCA − RW L dynamic clustering algorithm with weight for each dissimilarity matrix estimated for each cluster (locally estimated); – M RDCA − RW G dynamic clustering algorithm with weight for each dissimilarity matrix estimated globally. Two methods are applied to two tables (two variables by table). – M N D − W dynamic clustering algorithm with weight for each variable (single data table); – M N D − W T dynamic clustering algorithm with weight for each variable and for each table. Table 9.3 gives the weighting of variables for M RDCA−RW G, the weighting of variables by classes for M RDCA − RW L, Λ the vector of variables for M N D − W and M N D − W T , and Ω the vector of tables for M N D − W T . Data Matrix M RDCA − RW G Sepal length Sepal width Petal length Petal width

0.5523 0.2971 2.9820 2.0428

M RDCA − RW L MND − W Cluster 1 Cluster 2 Cluster 3 0.4215 0.4423 0.4145 0.4631 0.5146 0.3555 0.0994 0.2865 2.3212 2.0378 7.3868 2.6720 1.9861 3.1202 3.2822 2.8209

MND − W T Variable Table 1.2714 0.3642 0.7865 0.3642 0.9732 2.7455 1.0275 2.7455

Table 9.3. Iris data set: weighting vectors

Concerning the partitions given by these methods, the variables (3) petal length and (4) petal width have the highest relevant weight and show the importance of these variables in the determination of the partition. The performance and usefulness of the clustering algorithm, measured by missclassification rate OERC from some clustering methods, are in Table 9.4. – HAC Hierarchical Agglomerative Clustering (Ward); – k-means traditional clustering; – EM Expectation Maximization method; – CARD-R Clustering and aggregation of relational data [FRI 07]; – NERF [HAT 94]; – MFCMdd-RWL-P relational partitioning fuzzy clustering based on multiple dissimilarity matrices; – MVFCMddV multi-view relational fuzzy c-medoid vectors clustering algorithm.

A Novel Clustering Method with Automatic Weighting of Tables and Variables

Name HAC EM K-means FCMdd NERF CARD-R M RDCA − RW G M RDCA − RW L MFCMdd-RWL-P MVFCMddV MND − W MND − WT

OERC 0.1666 0.1733 0.1666 0.1530 0.1600 0.0460 0.0400 0.0460 0.0530 0.0930 0.0400 0.0400

203

Classical methods

Clustering on distances tables

Weighted Clustering methods

Table 9.4. Miss-classification rate OERC

Classes Clusters Setosa Versicolour Virginica 1 50 0 0 2 0 2 46 3 0 47 4 Table 9.5. Confusion matrix by MRDCA–RWG, MND–W and MND–WT

The OERC of the confusion matrix of Table 9.5 is equal to (2+4)/150=0.0400 A validation is realized with 500 runs, and the run set is divided into 10 blocks. In Table 9.6, three methods: K-means, M N D − W , and M N D − W T are used with three input parameters σ , Λ, and Ω. The output of four criteria Criterion, M iss, R, and V al allows us to evaluate these methods and their input parameters. – σ Yes = data are not normalized, No = Data are normalized; – Λ Yes = the weights of the variables are equal to 1, No = The weights of the variables are adaptive; – Ω Yes = the weights of the tables are equal to 1, No = The weights of the tables are adaptive; – Criterion: minimum value of the criterion observed in the run set; – M iss: number of misclassified objects (OERC = Miss/n); – R: number of runs where the minimum value of Criterion is observed; – V al: number of blocks where the minimum value of Criterion is observed;

204

Advances in Data Science

Methods K-means(1) (2) M N D − W (1) (2) (1) (2) (3) M N D − W T (4) (5) (6)

Parameters σ Λ Ω No Yes Yes Yes No No Yes No No No Yes Yes No Yes No Yes No Yes Yes No No No No Yes No No

Results Criterion M iss 78.940841 16 140.026045 25 68.417838 6 98.397939 6 78.344647 15 99.831480 6 71.021692 15 131.692682 26 68.417838 6 98.397939 6

R 210 30 177 178 178 229 217 202 177 178

V al 10 10 10 10 10 10 10 10 10 10

Table 9.6. Iris data set: validation

With this data set, the effective strategy is to adjust the weights of the variables (Λ = No), and the weights of the tables are equal to 1 (Ω = Yes). With this strategy, M N D − W and M N D − W T algorithms obtain the same result with normalization or not of the variables. With M N D − W (1) and M N D − W T (5), the criterion is equal to 68.417838, and the number of misclassified objects is equal to 6; with M N D − W (2) and M N D − W T (6), the value of the parameter Criterion is equal to 98.397939, and the number of misclassified objects is equal to 6. However, with M N D − W T (2), the number of misclassified objects is also equal to 6, but the value of the Criterion is greater. 9.5.2. Application II: multi-features dataset The multiple features dataset (abbr. Mfeat) is a dataset consisting of handwritten digits (0–9). Mfeat contains 2,000 objects into 10 classes (0–9). Each object is described by a set of 649 variables. These 649 variables are divided into six tables (see Table 9.7). Table #variables mfeat-fac 216 Profile correlations mfeat-fou 76 Fourier coefficients of the character shapes mfeat-kar 64 Karhunen–Love coefficients mfeat-mor 6 Morphological features mfeat-pix 240 Pixel averages in 2 x 3 windows mfeat-zer 47 Zernike moments Table 9.7. Multiple features dataset tables

A Novel Clustering Method with Automatic Weighting of Tables and Variables

205

The random initialization of the algorithm is repeated 500 times, and the number of classes is equal to 10. The best partition according to the adequacy criterion W is selected. We use the three multiple table sets proposed by Y.-M. Xu et al. [XU 16], and the data are not normalized. Method #variables fac+fou fac+zer fac+fou+kar+zer+pix mfeat-fac 216 0.492372 0.742381 0.633875 mfeat-fou 76 2.030984 1.091199 mfeat-kar 64 1.352603 mfeat-pix 240 0.441812 mfeat-zer 47 1.347018 2.419272 Table 9.8. Multiple features: weight vector Ω

Table 9.8 gives the results of the M N D − W T for the weight vector Ω on three table selections. The mfeat–zer table obtains the most relevant weights on the “fac+zer” and “fac+fou+kar+zer+pix” selections, and the mfeat–fou table on the “fac+fou” selection. In the “fac+fou” selection, the role of the mfeat–fac table is very weak in the determination of the partition. Method #variables k-means EM LLC-fs MVKKM Co-regspec WMCFS MND − W MND − WT

fac+fou 292 0.361 0.345 0.224 0.175 0.218 0.165 0.147 0.094

fac+zer fac+fou+kar+zer+pix 263 643 0.390 0.322 0.405 0.364 0.292 0.185 0.292 0.354 0.348 0.265 0.206 0.164 0.173 0.137 0.162 0.110

Table 9.9. Multiple features: miss-classification rate OERC

In Table 9.9, we compare the performance of the proposed methods M N D − W and M N D − W T with the results on six methods (k-means, EM, LLC-fs, MVKKM, Co-regspec, and WMCFS). The last method WMCFS is proposed by X Y.-M. Xu et al. [XU 16]. The introduction of the weight vector Ω in the M N D − W T method reduces the OERC. This reduction was significant for the fac+fou tables (36%) but lower for the fac+zer tables (6%). For the last group of tables, the reduction is 16% In Table 9.10, the values of the criterion W and the miss-classification rate OERC are very similar for the k-means and M N D − W algorithms but very different with

206

Advances in Data Science

the M N D − W T algorithm. Grouping variables according to their descriptions in tables improves the result of the classification. The description by Zernike moments (mfeat–zer table) is the most relevant. Method k-means M N DW M N DW V

Criterion 778288.97 738929.10 638159.78

OERC M iss 0.1345 267 / 2000 0.1335 268 / 2000 0.110 220 / 2000

Table 9.10. Multiple features: data are normalized

9.6. Conclusion In this chapter, a novel-weighted multi-table clustering method that is able to partition objects by simultaneously adapting the weight of the variables and the weights of the tables or views was presented. For each variable, it is possible to choose a specific distance. A global criterion is proposed. To optimize the global criterion, we design a k-means-like iteration, which consists of four main stages and converges to satisfactory results. This clustering algorithm gives a partition of the input data, a corresponding a Fr´echet mean vector for each cluster. However, the proposed iterative procedure gives locally optimal partition. The aim was to obtain a final partition giving a consensus between different tables or views describing the objects. The weight value associated to the table (the Ω vector) measures the role of this table in the classification. The weight of the variable determines its influence in the classification. Conducted on two real datasets, the experiments show that, in these cases, the proposed algorithm is superior in comparison with the corresponding Xu et al. [XU 16] algorithm. The proposed method has two disadvantages: (i) there is no theoretical guarantee that the globally optimal partition is found and (ii) the appropriate number of clusters should be given by the user. Both datasets contain only numeric variables, so we will continue our experiments on data tables containing categorical variables or interval-valued variables. 9.7. References [BOC 00] B OCK H.H., D IDAY E., Analysis of Symbolic Data, Exploratory Methods for Extracting Statistical Information from Complex Data, Springer, Berlin Heidelberg, 2000.

A Novel Clustering Method with Automatic Weighting of Tables and Variables

207

[CAR 12] D E C ARVALHO F.A.T., L ECHEVALLIER Y., DE M ELO F.M., “Partitioning hard clustering algorithms based on multiple dissimilarity matrices”, Pattern Recognition, vol. 45, pp. 447–464, 2012. [CAR 15] D E C ARVALHO F.A.T., DE M ELO F.M., L ECHEVALLIER Y., “A multiview relational fuzzy c-medoid vectors clustering algorithm”, Neurocomputing, vol. 163, pp. 115–123, 2015. [CHA 06] C HAVENT M., D E C ARVALHO F.A.T., L ECHEVALLIER Y. et al., “New clustering methods for interval data”, Computational Statistics, vol. 21, no. 2, pp. 211–229, 2006. [CHE 13] C HEN X., X U X., H UANG J.Z. et al., “Tw-(k)-means: Automated two-level variable weighting clustering algorithm for multiview data”, IEEE Transactions on Knowledge and Data Engineering, vol. 25, pp. 932–944, 2013. [CLE 09] C LEUZIOU G., E XBRAYAT M., M ARTIN L. et al., “COFKM: A centralized method for multiple-view clustering”, ICDM 2009 Ninth IEEE International Conference on Data Mining, Miami, USA, pp. 752–757, 2009. [FRI 07] F RIGUI H., H WANG C., R HEE F.C.-H., “Clustering and aggregation of relational data with applications to image database categorization”, Pattern Recognition, vol. 40, pp. 3053–3068, 2007. [HAT 94] H ATHAWAY R.J., B EZDEK J.C., “Nerf c-means: Non-Euclidean relational fuzzy clustering”, Pattern Recognition, vol. 27, no. 3, pp. 429–437, 1994. [JAI 10] JAIN A.K., “Data clustering: 50 years beyond k-means”, Pattern Recognition Letters, vol. 31, pp. 651–666, 2010. [MAC 67] M AC Q UEEN J., “Some methods for classification and analysis of multivariate observations”, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297, 1967. [REZ 09] R EZA G., NASIR S.M.D., H AMIDAH I. et al., “A survey: Clustering ensembles techniques”, Proceedings of World Academy of Science, Engineering and Technology, vol. 38, pp. 644–653, 2009. [TZO 12] T ZORTZIS G., L IKAS A., “Kernel-based weighted multi-view clustering”, Proceedings of the 12th International Conference on Data Mining, pp. 675–684, 2012. [WAN 13] WANG H., N IE F., H UANG H., “Multi-view clustering and feature learning via structured sparsity”, Proceedings of the 30th International Conference on Machine Learning, pp. 352–360, 2013. [XIE 13] X IE X., S UN S., “Multi-view clustering ensembles”, ICMLC, pp. 51–56, 2013. [XU 16] X U Y.M., WANG C.D., L AI J.H., “Weighted multi-view clustering with feature selection”, Pattern Recognition, vol. 53, pp. 25–35, 2016.

10 Clustering and Generalized ANOVA for Symbolic Data Constructed from Open Data

10.1. Introduction Official statistics are very important sources of open data where National Statistical Offices play a vital role. More and more societies favor the idea of freely available data and, therefore, many governmental institutions have also established open data websites. At the international level, such sources of open data are, for example, the United Nations open data website [UN 17], The World Bank Open Data [WB 17], and The European Union Open Data Portal [EUR 17]. A commonly used technique to present their data in a transparent and compact way is aggregation. There are several important properties and advantages of data aggregation: – it is usually the first step to make a large amount of data manageable; – it extracts (first) information from big data; – it protects the privacy of individuals (persons, companies etc.); – it produces second-level units of data. Aggregated data present original individual units at a higher level, which enables a different view of the data. Symbolic Data Analysis (SDA) provides tools for the analysis of such higher second-level units. Second-level units in SDA are called concepts or classes (Diday, inspired by Aristotle’s collection of works on logic The Organon [ARI] in which he distinguishes between first-level objects called

Chapter written by Simona KORENJAK - Cˇ ERNE, Nataˇsa K EJ Zˇ AR and Vladimir BATAGELJ.

Advances in Data Science: Symbolic, Complex and Network Data, First Edition. Edited by Edwin Diday, Rong Guan, Gilbert Saporta and Huiwen Wang. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

210

Advances in Data Science

individuals and second-level objects). They represent a natural extension of aggregated descriptions of individuals. SDA is an extension of the standard data analysis. Following the SDA approach, the aggregation process returns second-level units, symbolic objects (SOs), in which more information is usually preserved (e.g., a frequency distribution of individual values instead of just a mode). In order to find the answers to theoretical hypotheses, symbolic data tables with complex/structured data as table entries are the input to the SDA methods (several practical examples can be found in the SDA literature, for example, in [BIL 06], [NOI 11], [BRI 14] or [DID 16]). In this chapter, we present a review of our contributions in one of such SDA topics, namely, clustering, adapted for symbolic data representions based on distributions of values. The adaptation of the classical methods was directly motivated by analyses of open data sets. It can be used with several dissimilarites [BAT 15a]. The usage is illustrated with applications on two different open data sets: – TIMSS (Trends in International Mathematics and Science Study) by combining teachers’ and students’ data sets [KOR 11]; and – countries’ data descriptions based on their age–sex population distributions [KOR 15]. Furthermore, we present some basic ideas on how to generalize the well-known analysis of variance (ANOVA) for cases where no assumptions from classical ANOVA hold [BAT 15b]. The generalized method can be used on the described second-level units that we demonstrate on the example of population pyramids and HDI index. 10.2. Data description based on discrete (membership) distributions With aggregation, a large set of (primary) units is partitioned into mutually disjoint sets/groups P = {Pj }. The representation of the group Pj is a second-level unit Xj . In this chapter, we discuss the case when an aggregated unit is represented by a distribution of values (with frequencies, relative frequencies or subtotals). We call this distribution a discrete (membership) distribution. Its categories are discrete values of primary units that have been aggregated. More formally, to obtain such a representation, the domain of each variable Vi (i = 1, · · · , m) is partitioned into ki subsets {Vij , j = 1, . . . ki }. The set of (second-level) units U consists of symbolic objects. An SO X is described with a list of descriptions of variables Vi , i = 1, . . . , m: X = [x1 , x2 , . . . , xm ],

[10.1]

Clustering and Generalized ANOVA for Symbolic Data Constructed from Open Data

211

where m denotes the number of variables, and xi is a list of numerical values (usually, frequencies or subtotals over the corresponding groups) xi = [xi1 , xi2 , . . . , xiki ]. The same description (with the list of values for each symbolic variable) is used for a description of a cluster of symbolic objects C. Such a representation of SO is based on weighted modal (or histogram, if Vij , i = 1, . . . m, j = 1, . . . , ki , are intervals) type of symbolic variables. Let the sum of values of a variable Vi be denoted with nxi

nxi =

ki ∑

xij .

j=1

Then, the corresponding empirical probability distribution is pxi =

1 xi = [pxi1 , pxi2 , . . . , pxiki ]. nxi

A symbolic object X is in this way described with a list of couples X = [(nx1 , px1 ), (nx2 , px2 ), . . . , (nxm , pxm )].

[10.2]

The advantages of such a data description are: – the description of each group has a fixed size; – we can deal with variables that are based on a different number of original (individual) units; – it preserves more information about the original first-level (primary) units and about groups than the usual one-value of an appropriate statistic – e.g., a mean value used in the classical approach; – it produces uniform descriptions for all measurement types of variables; – it is also compatible with the merging of disjoint clusters, i.e., knowing the descriptions of clusters C1 and C2 , C1 ∩ C2 = ∅, we can easily calculate the empirical probability distribution of their union as a weighted sum. The second property is very useful when a data set is a combination of more than one initial data set, e.g., in the application on TIMSS data [KOR 11], or when we study demographic structures, e.g., age–sex structures [KOR 15] or causes of deaths by age and gender.

212

Advances in Data Science

10.3. Clustering Cluster analysis or clustering is the task of assigning a set of objects into groups called clusters so that the objects in the same cluster are more similar to each other than to those in other clusters. When we want to use clustering for solving the concrete research problem, the choice of a dissimilarity measure significantly affects the clustering result. It is, therefore, of crucial importance (1) how we choose a proper dissimilarity to reveal the structure that we are looking for, and (2) how we select a proper method to obtain optimal cluster representatives (that answers the initial research questions). We mainly focus on the latter issue in our adaptation of the clustering methods. We describe the issue of dissimilarity selection only in relation to the adaptation of methods (but for more, see [KEJ 11]). We define a clustering problem as an optimization problem to find a partition C∗ in a set of feasible partitions Φ for which P (C∗ ) = min P (C), C∈Φ

where P (C) is a criterion function. P (C) is based on the dissimilarities between units and/or cluster representatives. For solving the clustering problem for SOs described with discrete distributions, we adapted the following classical clustering methods: – the leaders method (a generalization of the k-means method [AND 73], [HAR 75] and dynamic clouds [DID 79]); – the agglomerative hierarchical clustering method (for example, Ward’s hierarchical clustering method [WAR 63]). Besides a separate usage of each of them, one can combine both methods if they are based on the same criterion function (namely, use the same dissimilarity measure).The leaders method can be used with large data sets; however, the number of clusters has to be prespecified. An application of the compatible (based on the same criterion function) hierarchical method on the sample can be helpful to determine the number of expected clusters. A compatible hierarchical clustering method can also be used after the leaders method on its resulting clustering to uncover the structure of clustering and the number of clusters. There are two basic choices in the leaders method: – how we select a representation of units, clusters, and cluster representatives; – which dissimilarity measures we use between units, clusters, and unit and cluster representative. The main aim of our adapted methods is to obtain optimal clusters’ representatives that resolve the following issues:

Clustering and Generalized ANOVA for Symbolic Data Constructed from Open Data

213

– To consider demographic structure as SO: the optimal cluster representative should be meaningful (interpretable), namely, it should represent the demographic structure of the population of all units from the cluster. This was the motivation for the inclusion of weights into the representation of SOs and into the clustering criterion function. – Patents’ citation data set: the error measure/dissimilarity should consider all component values of a variable equally (for example, squared Euclidean distance favors the largest component value). This was the motivation for proposing alternative dissimilarities. Our approach is based on the additive model. The criterion function P (C) is the sum of all cluster errors. The error of a cluster p(C) is the sum of dissimilarities of its units from the cluster’s optimal representative – leader TC . ∑ ∑ P (C) = p(C) where p(C) = d(X, TC ). X∈C

C∈C

The set of feasible partitions Φ is a set of partitions into k clusters of a finite set of units U. We assume that a leader has the same structure of description as SOs (see [10.1]), i.e., it is represented with nonnegative vectors ti of the size ki for each k1 k2 km variable Vi – its representation space is T = (R+ × (R+ × · · · × (R+ . 0) 0) 0) For a given representative T ∈ T and a cluster C, we define the cluster error with respect to T : ∑ p(C, T ) = d(X, T ), X∈C

where d is the selected dissimilarity measure. The best representative – leader TC – is then the one that minimizes the sum of errors within the cluster TC = arg min p(C, T ). T

Then, we define p(C) = p(C, TC ) = min T



d(X, T ).

X∈C

A dissimilarity measure between SOs and T is defined as a weighted average (convex combination) d(X, T ) =

m ∑ i=1

αi di (xi , ti ),

αi ≥ 0,

m ∑

αi = 1,

i=1

where αi are weights for variables. They allow specifying the importance of the 1 variables by the user. If not determined otherwise, they are all set to αi = m .

214

Advances in Data Science

For each variable, we set di (xi , ti ) =

ki ∑

wxij ≥ 0,

wxij δ(pxij , tij ),

j=1

where wxij are weights for each variable’s component and δ is a basic dissimilarity. The adapted clustering methods are implemented in the R package clamix [BAT 19] that supports the clustering of (very) large data sets of mixed (measured in different scales) units. Basic dissimilarities δ included in the R package clamix can be found by Batagelj et al. [BAT 15a]. For example, the selection of δ = (px − t)2 represents an extension of the squared Euclidean distance on SOs described with discrete distributions. Five other proposed basic dissimilarities for δ represent relative error measures proposed by Kejˇzar et al. [KEJ 11], extended on SOs. New leader TC of the cluster C is determined with TC = arg min T

= arg min T



d(X, T ) = arg min

X∈C

∑ i

αi

T

∑ X∈C

m ∑∑

αi di (X, T ) =

X∈C i=1

∑ ]m [ di (xi , ti ) i=1 , di (xi , ti ) = arg min ti

X∈C

where ti = [ti1 , . . . , tiki ]. The solution ti of the obtained optimization problem depends on the nature of the selected basic dissimilarity δ. To make the adapted leaders method and the adapted agglomerative hierarchical clustering method compatible, the dissimilarity D(Cu , Cv ) in the agglomerative hierarchical clustering is determined by the following formula D(Cu , Cv ) = p(Cu ∪ Cv ) − p(Cu ) − p(Cv ). The dissimilarity between the two clusters is the same as the cluster error of the merged cluster diminished by both the cluster errors. For the dissimilarity δ = (px − t)2 , we get the generalization of the Ward hierarchical method m ∑

ki ∑ wuij · wvij (uij − vij )2 , D(Cu , Cv ) = αi w + w vij i=1 j=1 uij

Clustering and Generalized ANOVA for Symbolic Data Constructed from Open Data

215

where uij and vij are the leader’s components of the clusters Cu and Cv , respectively, i.e., ∑ 1 ∑ uij = wxij · pxij , wuij = wxij , and wuij X∈Cu

vij =

1 ∑ wxij · pxij , wvij X∈Cv

X∈Cu

wvij =



wxij .

X∈Cv

A detailed definition of the methods’ compatibility and the derivations for the leaders and for the dissimilarities D(Cu , Cv ) can be found by Batagelj et al. [BAT 15a]. 10.3.1. TIMSS – study of teaching approaches For studying teaching approaches, we used the TIMSS – Trends in International Mathematics and Science Study open data set [IEA 04], [TIM 04] for the years 1999 and 2003 (joint work with Barbara Japelj Paveˇsi´c, National Coordinator of the International Research of Trends in Knowledge of Mathematics and Science for Slovenia, The Educational Research Institute, Slovenia). The aim of the study was to find groups of teachers with similar teaching approaches where we combined the data set of teachers’ answers with the data set of students’ answers. The data set for the year 2003 includes a sample consisting of 6,552 teachers and 131,000 students, representing more than 10 million students of the 8th grade in 30 countries. All answers in the questionnaire were categorized (including age). The data description used can be explained in a more general framework, i.e., as the so-called ego-centered or personal networks (see the basic scheme in Figure 10.1), that are rather common in social sciences. The ego-centered network consists of two related data sets: egos and alters. Each unit in the first data set, i.e., ego, can be related with different units from the second data set, i.e., alters.

Figure 10.1. Ego-centered network

216

Advances in Data Science

In our study, units of analysis were teachers, described by their variables: gender, age, education, their work in classes, pedagogical approaches used in the class, opinions about mathematics, classroom activities, the use of IT, and the issues on homework. For each teacher, there was also a description for distributions of students’ answers, describing students’ attitudes toward mathematics, such as valuing math, enjoying learning, and self-confidence in mathematics, and activities in class such as students’ use of IT, their participation in learning mathematics, and their strengths in mathematics. These values were collected with separate questionnaires for students.

Figure 10.2. Teacher–students ego-centered network (Source: [KOR 11])

We combined both data descriptions into one SO(X) = [X, A(X)], where the units were described with ego X and alters variables A(X) as a symbolic object. For example, the SO description of the teacher with id 4567 is SO4567 =

[(

) ( ) ( ) ( ) ] 1,[0,0,0,1] , 1,[0,0,0,0,1,0] , ... 100, [ 0.47, 0.16, 0.37, 0,0] , 100,[0,0,1,0] , ... ↑ ↑ ↑ ↑ T1 T2 ... S1 S2

where teacher variables (T1 , T2 , . . . ) have only a singular value, but the alters (students’) variables S1 , S2 , . . . contain distributions of students’ answers. Most of them are distributed over the following four subsets: 1 = strongly agree, 2 = agree, 3 = disagree and 4 = strongly disagree, which express how much they agree/disagree with the statement that is considered as a student variable.

Clustering and Generalized ANOVA for Symbolic Data Constructed from Open Data

217

One hundred and one variables were included in the clustering process: 77 from teachers and 24 from the students’ questionnaire. The adapted hierarchical method with squared Euclidean distance was used (without weights). We identified five main clusters. One of them contains units with mostly missing values. Teachers in other clusters differ in the usage of computers and calculators in their lectures, in assigning and monitoring homework and testing the knowledge of their students. We further observed also if there are links between the obtained clusters and other variables that were not included in the clustering process, like students’ achievements and the teacher’s country of origin. For example, in the TIMSS study, students are assigned to different benchmark levels of mathematical knowledge. The distribution of students reaching benchmarks for four clusters (cluster 2 with missing answers was omitted) is presented in Figure 10.3.

Figure 10.3. Benchmark levels of mathematics achievement reached by students (source: [KOR 11])

Additional details on the obtained results can be found in [KOR 11]. 10.3.2. Clustering countries based on age–sex distributions of their populations On the web, the data on age–sex distributions (population pyramids) for the countries and for many countries also for their administrative units are openly available. Although the population pyramid is simple and easy to understand, it well

218

Advances in Data Science

reflects characteristics of the observed time and region. It is mostly influenced by population processes (fertility, mortality and migration), and policies (social and political) can also have a strong influence on its shape, e.g., birth control in China, wars, and lifestyle. Because of this, population pyramids are often connected with the developing stage of the represented regions. The base for the graphical representation with the population pyramid is age–sex distribution of the population of the particular region in the particular time. We considered age–sex distribution as SO in the following way: each age–sex distribution (population pyramid) is described with two vectors of frequencies (one for each gender), representing the distributions of men/women by age. A region X (world country, US county, municipality, and sub-national area) represented with population pyramids (age–sex distributions of the population) and cluster of regions Cu is described with two symbolic variables: X = [(nxM , pxM ); (nxF , pxF )],

Cu = [(nuM , puM ); (nuF , puF )]

where nM is the number of men, pM is the vector of relative frequencies of men over age groups, nF is the number of women (female), and pF is the vector of relative frequencies of women over age groups. For example, the population of Ljubljana on July 1, 2011 was split into three economic age groups 0–19, 20–64 and 65+, where nLjM = 134,410 men and nLjF = 145,488 women, and the corresponding frequency distributions over the economic age groups are [25,396, 90,466, 18,548] for men and [24,204, 91,899, 29,385] for women. The description of the corresponding SO (see expression [10.2]) is XLj =

[( ) ( )] 134 410, [0.189, 0.673, 0.138] ; 145 488, [0.166, 0.632, 0.202] ↑ ↑ ↑ ↑ nLjM pLjM nLjF pLjF | {z } | {z } men women

For the cluster Cu , it holds nui =

∑ X∈Cu

nxi

and

pui =

1 ∑ nxi · pxi , nui

i = M, F.

X∈Cu

The dissimilarity between clusters Cu and Cv is in this case rewritten as ) 1 ( nuM · nvM nuF · nvF D(Cu , Cv ) = ||puM −pvM ||2 + ||puF −pvF ||2 . 2 nuM + nvM nuF + nvF

Clustering and Generalized ANOVA for Symbolic Data Constructed from Open Data

219

We used the adapted hierarchical clustering method with weighted squared Euclidean distance as dissimilarity for the applications on several open data sets: – Slovenian municipalities on July 1, 2011, where data were obtained from the National Statistical Office of the Republic of Slovenia. Analyses were made with the original data with 21 five-year groups (0–4 years, 5–9 years, 10–14 years, . . . , 95–99 years, 100+) and also with data aggregated into three economic age groups (0–14 years, 15–64 years, and 65+ years) (joint work with Joˇze Sambt, University of Ljubljana, Faculty of Economics [KOR 12]); – population pyramids of the world countries obtained from the International Data Base (IDB) [US 08]. The data are divided into 17 five-year groups (0–4 years, 5–9 years, 10–14 years, . . . , 75–79 years, 80+), where we also observed time changes with the 5-year time-lag (for years 1996, 2001 and 2006); – US counties from US Census 2000 Summary File 1, prepared by U.S. Census Bureau [US 11]. Data for the year 2000 include 3, 219 US counties. The data for the year 2010 include 3, 221 US counties with the additional variable ethnicity that was included in our analysis; – Brazilian municipalities with IBGE – Brazilian Institute of Geography and Statistics data [BRA 14], where we analyzed 5, 570 municipalities for 2010. We used data descriptions based on the age–sex structures and also on age–area (urban/rural) structures; – sub-national areas in Latin America and the Caribbean with IPUMS dataset of census microdata from 1960 to 2011 (joint work with Ludi Simpson, University of Manchester, UK). The main characteristic of the adapted clustering method based on squared Euclidean distance is that with the inclusion of sizes as weights for each variable (the number of men/women) into the clustering process, the obtained optimal cluster representative is again age–sex distribution of the region determined by the corresponding cluster (thus, we get meaningful cluster representative). Note, however, that age groups are considered as categories (not intervals and without ordering). The main aim of the analysis of the countries based on their age–sex distributions, obtained with the adapted clustering methods, was to identify groups of countries with similar age–sex structures and to identify groups of countries with similar structural changes over time [KOR 15]. In order to achieve a relevant comparison, 215 of the countries for which data were available at all three time points (1996, 2001, and 2006) were included in the analysis. With the symbolic data descriptions of the population age–sex distributions, we save complete information about distributions, and with the inclusion of the sizes as weights into the clustering process, we obtained meaningful optimal cluster representatives, i.e., age–sex distributions of the population included in the countries inside the clusters. We identified four main clusters for each of the observed years. Their shapes rather

220

Advances in Data Science

well reflect basic demographical developing stages. Clusters are studied in detail also for partitions at lower levels.

Figure 10.4. Four main clusters obtained from 215 of the countries for the years 1996, 2001 and 2006 with their population pyramids, number of countries in each cluster, and overall population (source: [KOR 15])

Observation over time showed that the shapes of the sex–age structures of the countries mostly changed from the more expansive shape (with a large number of people in young ages and fast decline for older people) to stationary or even constrictive shape that usually express a more developed stage (with lower fertility and mortality rate and longer life expectancy). Further observation of the time changes revealed five main clusters of similar time changes of population age–sex distributions over the observed time. More details about the results can be found in [KOR 15]. The application to sub-national age–sex distributions from Latin America and the Caribbean [KOR 17] was a part of a wider project compared to sub-national demographic development in Latin America and the Caribbean [SIM 16]. The main focus of the study was to examine sub-national time series of age–sex structures for many countries in Latin America and the Caribbean, to summarize the diversity and the socio-demographic associates of changing age–sex structures, and to identify and characterize the development of those age–sex structures over time, useful to the practice of demographic projections. As a clustering result, we identified four main shapes for the population pyramids that are strongly related to the additional sociodemographic indicators for clusters’ descriptions. Most of the time movements of the observed regions were from clusters with indicators expressing less developed stages

Clustering and Generalized ANOVA for Symbolic Data Constructed from Open Data

221

to more developed ones. Observation of population pyramids over time revealed that the shape of the age–sex structure of some of the areas significantly changed over the observed time period from 1960 to 2010 [e.g., Federal District (Brazil)]. The changes can be explained with the additional knowledge of special circumstances in this area. The clustering method also revealed some areas with rather unusual shapes that require a more detailed study of the data and of social and political situations in the observed area and time. Dissimilarities among structures in different decades indicate that age–sex structures of the observed areas become more similar over time. 10.4. Generalized ANOVA ANalysis Of VAriance (ANOVA) is one of the most common statistical approaches for detecting “differences” among groups/clusters. It is based on – squared Euclidean distance; – sum-of-squares decomposition equality: SST = SSB +SSW where SST stands for total sum of squares (deviations of the values around the total mean), SSB for between-group sum of squares (deviations of group means around total mean), and SSW for within-group sum of squares (sum of the deviations of values around group mean - sum of the group errors); – assumptions about distributions — normal distributions with equal variances. We are interested in an extension/adaptation of the ANOVA method with a general measure of spread. We present here our basic ideas about a possible generalization of the standard approach that enables a more general usage. We propose to combine some available theoretical results for each of the following three main steps: 1) selection of an appropriate measure of spread; 2) construction of a test statistic for non-parametric multivariate analysis; and 3) calculation of a P -value. The sum of squares of the group can also be viewed as the error of the group, denoted by p(C) (see the previous section), or the inertia, sometimes denoted by I(C). From mechanics, we know the Huygens theorem IT = IB + IW , where IT stands for total inertia, IB for between-group inertia, and IW for withingroup inertia. For the basic dissimilarity δ = (px − t)2 , the total inertia is ∑ IT = d(X, TU ), where X∈U

d(X, TU )

=

tU i = ∑

1

X∈U

wxi



∑ 2 αi di (xi , tU i ) = i αi wxi ||pxi − tU i || , and ∑ wxi · pxi , where U denotes the whole set of units. i

X∈U

222

Advances in Data Science

The between inertia is ∑ d(TC , TU ), IB = C∈C

with d(TC , TU ) =

∑ i

αi wCi ||tCi − tU i ||2 .

And the within inertia is ∑ ∑ ∑ IW = P (C) = p(C) = d(X, TC ), C∈C

C∈C X∈C



∑ where d(X, TC ) = ) = ||2 , and the i αi di (xi , tCi i αi wxi ||pxi − tCi∑ ∑ 1 representative of variable i: tCi = wCi X∈C wxi · pxi , where wCi = X∈C wxi . With a general dissimilarity, using the ideas from [BAT 88]: 1. Define the cluster error p(C) p(C) =

∑ ∑ 1 w(X) · w(Y ) · d(X, Y ) 2 · w(C) X∈C Y ∈C

which is a generalization of the classical formula for the squared Euclidean distance 1 ∑ ∑ ||X − Y ||2 . p(C) = 2 · nC X∈C Y ∈C

2. Introduce a generalized (possible imaginary) center C˜ of a cluster C defined with the extension of a dissimilarity to units and cluster centers ( ) ∑ 1 ˜ = d(C, ˜ Y)= w(X) · d(X, Y ) − p(C) , d(Y, C) w(C) X∈C

where Y is a unit or a cluster center. Definition of∑ the generalized center is based on the classical formula for the center C¯ = arg minY X∈C ||X −Y ||2 with the squared Euclidean distance and equality ∑( ) ¯ 2 . ¯ 2= 1 ||X − Y ||2 − ||X − C|| ||Y − C|| nC X∈C

3. The generalized Huygens theorem holds: IT = IB + IW ,

Clustering and Generalized ANOVA for Symbolic Data Constructed from Open Data

223

where IT = p(U) = IW =

∑ 1 w(X) · w(Y ) · d(X, Y ), 2 · w(U) ∑

X,Y ∈U

p(C),

C∈C

IB =



˜ = IT − IW . ˜ U) w(C) · d(C,

C∈C

The problem might occur since the extended “dissimilarity” between (imaginary) ˜ is not necessary nonnegative for every dissimilarity. center and each unit d(Y, C) In [BAT 88], it is shown that the triangle inequality is a sufficient condition for the ˜ to be nonnegative. Therefore, in the next step, we show extended dissimilarity d(Y, C) how it is possible to produce a dissimilarity from a general one that can be used in the generalized ANOVA process. In [JOL 86], for general dissimilarity measure d, there exists a unique nonnegative real number p, called metric index, such that dα is a metric1 for all α ≤ p, and dα is not a metric for all α > p. If a dissimilarity d is not a metric, it can be transformed into it using the power transformation. Therefore, we can first find metric index p of arbitrarily chosen dissimilarity d and in the generalized Huygens theorem – use d if p ≥ 1; – otherwise (if p < 1) use dp . The test statistic for the generalized ANOVA that we used is in line with the approach of Anderson and McArdle [AND 01], [MCA 01], which was applied to ecology data, and the approach of Studer et al. [STU 11], applied to the life trajectory analysis. The construction of their test statistic is based on the ratio of sums of squares as in the classical ANOVA. F =

IB /(m − 1) , IW /(n − m)

where m is the number of clusters and n is the number of units. The sums of squares are substituted by the generalized inertias IB and IW , respectively. Since the distribution of F is in the case of different dissimilarities not necessarily the F -distribution, P -values are calculated by a nonparametric (permutation) method.

1 Dissimilarity d is metric if besides non-negativity, identity and symmetry, also triangle inequality holds, i.e., for each triple of units X, Y and Z, it holds d(X, Z) ≤ d(X, Y ) + d(Y, Z).

224

Advances in Data Science

McArdle and Anderson [MCR 01] showed that their method can be used with an arbitrary semimetric measure. Here, we add that the method can be used with a general dissimilarity measure as long as the dissimilarity between (imaginary) center C˜ and each unit is nonnegative. The nonnegativity can be achieved by the application of the metric index on the dissimilarity matrix just before using the nonparametric method. The computations for the generalized ANOVA from dissimilarity measures were made by using the procedure dissassoc from the R package TraMineR by Studer et al. [STU 11]. It computes and tests the rate of discrepancy (defined from a dissimilarity matrix) explained by categorical variable(s). To demonstrate the proposed approach, we performed these steps on the data of the countries described with the age–sex structures of their population for the year 2005. The data were obtained from the International Data Base (IDB) for the year 2005 [US 08]. Populations are divided into 17 five-year groups (0–4 years, 5–9 years, 10–14 years, ..., 75–79 years, 80+) for each gender. Unit representation is based on the same symbolic data analysis approach as in the application of population pyramids: data representation with two vectors, i.e., distributions of men/women over age-groups. Groups were determined by the Human Development Index (HDI) found in The United Nations Development Program (UNDP) [UN 15b], which was developed by Pakistani economist Mahbub ul Haq to emphasize the importance of people, not only economy, for human development. It is a summary measure of the average achievement in key dimensions of human development: a long and healthy life, indicated by life expectancy at birth, being knowledgeable, considering mean years of schooling and expected years of schooling, and having a decent standard of living, where measurement is based on GNI (Gross National Income) per capita. We calculated the dissimilarity between countries X and Y with the formula ) 1 ( nxM · nyM nxF · nyF d(X, Y ) = ||pxM − pyM ||2 + ||pxF − pyF ||2 . 2 nxM + nyM nxF + nyF ˜ is Since we used squared Euclidean distance, the extended dissimilarity d(U, C) nonnegative and we would not need to calculate the metric index. We do this here for demonstration purposes. The metric index for the obtained dissimilarity matrix is p = 0.06438. We used in the process dp instead of d. The HDI is used to rank countries by human development in the annual Human Development Reports prepared and published by The United Nations Development Program (UNDP). Data for the year 2005 from Table 2: Human Development

Clustering and Generalized ANOVA for Symbolic Data Constructed from Open Data

225

Index trends, 1980–2013, include 173 world’s countries [UN 15a]. Three classes for the year 2005 (The Human Development Report 2007/2008) were determined as: – high (HDI 0.800 or more); – medium (HDI from 0.500 to 0.799); – low (HDI below 0.500). The results of the generalized ANOVA for 173 countries for the year 2005, based on dp and three HDI classes (obtained with the usage of the clamix program for the calculation of the dissimilarity between countries and using TraMineRs procedure dissassoc to compute inertias and F -value), are IT = 155.05, IW = 144.48, IB = IT − IW = 10.57 F =

IB /(m − 1) 10.57/(3 − 1) = = 6.22 IW /(n − m) 144.48/(173 − 3)

The results obtained show a larger discrepancy between groups than within them (P -value = 0.0002, P -value used with or without the metric index for the world country examples is the same). This indicates that there are noticeable differences between groups of countries determined with their HDI index according to the age– sex structure of the population. 10.5. Conclusion Open data are very often available in aggregated form and can be considered as so-called second-level units. To preserve internal variation of the original (primary) units, these second-level units need to be represented with a more complex representation of aggregated values than the usual single mean value. SOs provide such a description, and SDA methods can be used to analyze them. The institutions that offer open data are, therefore, invited to produce/release the aggregated data in the form of SOs. In this chapter, we presented adapted clustering methods for second-level units that were motivated by analysis of some open data sets. The main aim of the presented methods is to produce meaningful (informative) optimal cluster representatives. In order to obtain the desired properties of optimal cluster representatives, we have proposed some alternative dissimilarity measures between second-level units represented with empirical discrete (membership) distributions and the inclusion of weights. We demonstrated their usage with applications on TIMSS open data base and demographic age–sex structures on different sets of teritorial units. In order to study differences among pre-specified groups of units, we presented an approach to generalize ANOVA with the following two main advantages: (1) it can

226

Advances in Data Science

be used with any dissimilarity measure and (2) it is nonparametric – it has no a priori assumptions about variable distributions. N OTE.– This work was partially supported by the Slovenian Research Agency, Programmes P1-0294 and P3-0154 and by Russian Academic Excellence Project ‘5-100’. 10.6. References [AND 73] A NDERBERG M., Cluster Analysis for Applications, Academic Press, New York, 1973. [AND 01] A NDERSON M.J., “A new method for non-parametric multivariate analysis of variance”, Austral Ecology, vol. 26, no. 1, pp. 32–46, 2001. [ARI] A RISTOTLE, “The Organon”. Available at: https://archive.org/details/ AristotleOrganon. [BAT 88] BATAGELJ V., “Generalized ward and related clustering problems”, in BOCK H.H. (ed.), Classification and Related Methods of Data Analysis, North-Holland, Amsterdam, pp. 67–74, 1988. [BAT 19] BATAGELJ V., K EJ Zˇ AR N., “Clamix—clustering symbolic objects R package”. Available at: https://r-forge.r-project.org/projects/clamix/, 2019. [BAT 15a] BATAGELJ V., K EJ Zˇ AR N., KORENJAK - Cˇ ERNE S., “Clustering of modal valued symbolic data”, ArXiv e-prints 1507.06683, July 2015. ˇ ERNE S., “Generalized [BAT 15b] BATAGELJ V., K EJ Zˇ AR N., KORENJAK - C ANOVA for SDA”, SDA Workshop 2015, University of Orl´eans, November 17–19, 2015. Available at: http://www.univ-orleans.fr/mapmo/colloques/sda2015/, pp. 45–46, 2015. [BIL 06] B ILLARD L., D IDAY E., Symbolic Data Analysis: Conceptual Statistics and Data Mining, John Wiley, Chichester, 2006. [BRA 14] B RAZILIAN I NSTITUTE OF G EOGRAPHY AND S TATISTICS (IBGE), “Census 2000 Summary File 1 [Data file]”. Available at: http://www.cidades. ibge.gov.br/xtras/home.php, accessed 2014. [BRI 14] B RITO P., “Symbolic data analysis: another look at the interaction of data mining and statistics”, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 4, no. 4, pp. 281–295, 2014. [DID 79] D IDAY E., Optimisation en classification automatique, Institut national de recherche en informatique et en automatique, Rocquencourt, 1979. [DID 16] D IDAY E., “Thinking by classes in data science: the symbolic data analysis paradigm”, Wiley Interdisciplinary Reviews: Computational Statistics, vol. 8, no. 5, pp. 172–205, 2016. [EUR 17] E UROPEAN U NION, “The European Union open data portal”. Available at: https://data.europa.eu/euodp/en/home, 2017.

Clustering and Generalized ANOVA for Symbolic Data Constructed from Open Data

227

[HAR 75] H ARTIGAN J., Clustering Algorithms, Wiley-Interscience, New York, 1975. [IEA 04] IEA – I NTERNATIONAL A SSOCIATION FOR E VALUATION OF E DUCATIONAL ACHIEVEMENT, “IEA website, data repository for TIMMS data”. Available at: http://www.iea.nl/data.html, accessed 2004. [JOL 86] J OLY S., L E C ALV E´ G., “Etude des puissances d’une distance”, Statistique et Analyse de Donn´ees, pp. 30–50, North-Holland, Amsterdam, 1986. [KEJ 11] K EJ Zˇ AR N., KORENJAK - Cˇ ERNE S., BATAGELJ V., “Clustering of distributions: a case of patent citations”, Journal of Classification, vol. 28, no. 2, pp. 156–183, 2011. ˇ ERNE S., BATAGELJ V., JAPELJ PAVE Sˇ I C´ B., “Clustering [KOR 11] KORENJAK - C large data sets described with discrete distributions and its application on TIMSS data set”, Statistical Analysis and Data Mining, vol. 4, no. 2, pp. 199–215, 2011. ˇ ERNE S., BATAGELJ V., S AMBT J. et al., “Hierarchical [KOR 12] KORENJAK - C clustering method for discrete distributions with the case of clustering population pyramids of Slovenian municipalities”, Facing Demographic Challenges : Proceedings of the 15th International Multiconference Information Society IS 2012, October 8–9, 2012, Ljubljana, Slovenia, volume B, (Informacijska druˇzba, ISSN 1581-9973), Institut Joˇzef Stefan, Ljubljana, pp. 31–35, 2012 (in Slovenian). ˇ ERNE S., K EJ Zˇ AR N., BATAGELJ V., “A weighted [KOR 15] KORENJAK - C clustering of population pyramids for the world’s countries, 1996, 2001, 2006”, Population Studies, vol. 69, no. 1, pp. 105–120. Available at: http://www.tandfonline.com/doi/full/10.1080/00324728.2014.954597, 2015. ˇ ERNE S., S IMPSON L., “Clustering age-sex structures [KOR 17] KORENJAK - C to monitor their development over time: Latin America and the Caribbean sub-national areas 1960-2011. In ALAP (La Asociaci´on Latinoamericana de Poblaci´on) Project report”. Available at: http://www.cmist.manchester.ac.uk/ research/projects/s-alyc/, 2017. [MCA 01] M C A RDLE B.H., A NDERSON M.J., “Fitting multivariate models to community data: a comment on distance-based redundany analysis”, Ecology, vol. 82, no. 1, pp. 290–297, 2001. [NOI 11] N OIRHOMME -F RAITURE M., B RITO P., “Far beyond the classical data models: symbolic data analysis”, Statistical Analysis and Data Mining, vol. 4, no. 2, pp. 157–170, 2011. [SIM 16] S IMPSON L., G ONZALES L., “Comparative subnational demographic development in Latin America and the Caribbean (s-ALyC)”. Available at: http://www.cmist.manchester.ac.uk/research/projects/s-alyc/, 2016.

228

Advances in Data Science

[STU 11] S TUDER M., R ITSCHARD G., G ABADINHO A. et al., “Discrepancy analysis of state sequences”, Sociological Methods & Research, vol. 40, no. 3, pp. 471–510, 2011. [TIM 04] TIMSS & PIRLS. I NTERNATIONAL S TUDY C ENTER . B OSTON C OLLEGE , LYNCH S CHOOL OF E DUCATION . USA, “TIMSS – Trends in International Mathematics and Science Study open data set. TIMSS 1999 and TIMSS 2003 [Data files]”. Available at: http://timss.bc.edu, accessed 2004. [UN 15a] U NITED NATIONS, “Human Development data, The United Nations Development Program (UNDP). Available at: http://hdr.undp.org/en/data, accessed 2015. [UN 15b] U NITED NATIONS, “Human development reports, The United Nations Development Program (UNDP)”. Available at: http://hdr.undp.org/en/ content/human-development-index-hdi, accessed 2015. [UN 17] U NITED NATIONS, “The United Nations open data website”. Available at: http://data.un.org/, 2017. [US 08] U.S. C ENSUS B UREAU, “IDB: International Data Base”. Available at: http://www.census.gov/ipc/www/idbnew.html, accessed 2008. [US 11] U.S. C ENSUS B UREAU, “Census 2000 Summary File 1 [Data file]”. Available at: http://factfinder.census.gov/, accessed 2011. [WAR 63] WARD J., “Hierarchical grouping to optimize an objective function”, Journal of the American Statistical Association, vol. 58, no. 301, pp. 236–244, 1963. [WB 17] T HE W ORLD BANK, “The World Bank Open Data”. Available at: http://data.worldbank.org, 2017.

List of Authors

Antonio BALZANELLA Department of Mathematics and Physics University of Campania “Luigi Vanvitelli” Caserta Italy Vladimir BATAGELJ Institute of Mathematics, Physics and Mechanics Ljubljana Slovenia Chun-houh CHEN Academia Sinica Taipei Taiwan David COMBE Hubert Curien Laboratory Lyon University/ Jean Monnet University Saint-Etienne France

Rodrigo C. DE ARAÚJO Computer Center Federal University of Pernambuco Recife Brazil Francisco de Assis Tenorio DE CARVALHO Computer Center Federal University of Pernambuco Recife Brazil Edwin DIDAY CEREMADE Paris Dauphine-PSL University France Richard EMILION Denis Poisson Institute University of Orléans France

Advances in Data Science: Symbolic, Complex and Network Data, First Edition. Edited by Edwin Diday, Rong Guan, Gilbert Saporta and Huiwen Wang. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

230

Advances in Data Science

Françoise FOGELMAN-SOULIÉ Hub France IA Paris France Wen GE School of Computer Software Tianjin University China Rong GUAN School of Statistics and Mathematics Central University of Finance and Economics Beijing China Baptiste JEUDY Hubert Curien Laboratory Lyon University/ Jean Monnet University Saint-Etienne France Chiun-How KAO Academia Sinica Taipei Taiwan Nataša KEJŽAR Faculty of Medicine University of Ljubljana Slovenia Simona KORENJAK-ČERNE School of Economics and Business University of Ljubljana Slovenia

Christine LARGERON Hubert Curien Laboratory Lyon University/ Jean Monnet University Saint-Etienne France Frédéric LEBARON Department of Social Sciences ENS Paris-Saclay Cachan France Yves LECHEVALLIER INRIA Rocquencourt France Yiming LI School of Computer Software Tianjin University China Yinglan LI School of Computer Software Tianjin University China Lanxiang MEI School of Computer Software Tianjin University China Gilbert SAPORTA Cédric Lab National Conservatory of Arts and Crafts (CNAM) Paris France

List of Authors

Rosanna VERDE Department of Mathematics and Physics University of Campania “Luigi Vanvitelli” Caserta Italy Huiwen WANG Beijing Advanced Innovation Center for Big Data and Brain Computing School of Economics and Management Beihang University Beijing China Jing WANG Hubert Curien Laboratory Lyon University/ Jean Monnet University Saint-Etienne France Siyang WANG School of Statistics and Mathematics Central University of Finance and Economics Beijing China

231

Yuan WEI Beijing Advanced Innovation Center for Big Data and Brain Computing School of Economics and Management Beihang University Beijing China Han-Ming WU Department of Statistics National Taipei University New Taipei City Taiwan Qiaofei YE School of Computer Software Tianjin University China Jianyu ZHANG School of Computer Software Tianjin University China

Index

A, C aggregation, 50, 63, 65–68 ANOVA (analysis of variance), 209, 210, 221, 223–225 attributed graph, 169, 170, 172, 174, 179, 181 canonical analysis, 3, 5, 11, 14, 22, 25 clamix, 214, 225 class, 31–34, 36, 37, 39, 41, 43–46 clustering analysis, 189 dynamic, 202 graph nodes, 169 k-means, 190, 191, 197, 202–206 micro-, 102–104, 106, 107, 110, 112, 113 partitioning, 189–191, 193–202, 205, 206 community detection, 169–172, 174, 180, 181 complexity, 79 D data aggregated, 209, 225 analysis, 192 big, 3, 4, 9 complex, 3–9 stream mining, 102 TIMSS, 210, 211, 215, 217, 225

Dirichlet density, 38, 42, 45 distribution, 31, 38, 42–47 kernel estimator, 45 kernel mixture, 42, 45 Latent Dirichlet Allocation (LDA), 38, 40, 41, 47 process, 44, 45, 47 Process Mixture (DPM), 45, 46 discrete (membership) distribution, 210, 212, 214, 225 distances, 190, 192–194, 196, 198, 203 choice of, 192, 193, 206 Dynamic Clustering Method (DCM), 3, 5, 10–14, 22, 25–27 E, F eigen-decomposition, 120, 126, 127, 130, 133, 134 empirical joint density function (EJD), 56, 58, 61, 65–68, 73 exploratory data analysis, 51 tools, 3 face recognition data, 65, 69–73 feedback explicit, 142–144, 148, 150, 156 implicit, 142–144, 147, 148, 150, 157 framework, 119 Fr´echet mean, 190, 193–200, 206

Advances in Data Science: Symbolic, Complex and Network Data, First Edition. Edited by Edwin Diday, Rong Guan, Gilbert Saporta and Huiwen Wang. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.

234

Advances in Data Science

G, H, I, K geometry, 82–84, 91, 95 high-dimensional, 122, 125, 135 histogram, 31, 35, 42, 45, 46, 101 incremental calculation, 119 individuals, 80, 82–85, 90, 95 interval arithmetic, 49, 51, 52, 63–68 data, 192, 193, 198, 206 kriging, 102, 103, 110, 111, 114–116 L, M leaders method (adapted), 212, 214, 215 Louvain algorithm, 169, 172, 174–176, 179–181 I-Louvain, 169, 170, 172, 174–176, 179–181 machine learning, 3 matrix factorization, 142, 145, 148, 160 maximum covering area rectangle (MCAR), 51, 59–61, 73 methodology, 80, 82, 95 mixture decomposition, 3, 5, 10, 12–15, 26 modeling, 79, 81, 82 multi-source datasets, 189, 190 multi-view datasets, 190, 191, 202 multidimensionality, 82, 85, 95 N, O, P neighborhood-based collaborative filtering (CF), 142, 143, 145, 147–150, 154, 155, 160–163 network attributed, 141, 169 bipartite, 142, 144, 152–154, 158, 163 social, 141, 142, 148, 150–152, 154, 158, 159, 162, 163, 169–171, 174 online learning, 121 polytopes representation, 59–61, 73 precision and recall, 146, 157, 159, 161, 162

principal components analysis (PCA), 12, 14, 15, 49–53, 55, 61, 62, 69, 72 interval (iPCA), 50–52, 62, 65, 69, 73 Q, R quantile functions, 104–106, 108–111, 114, 115 random distribution, 31 ranking classes, 16, 21, 22 recommender system (RS), 141 regression, 3, 5, 11, 12, 14, 26, 27 functional, 129, 131 relevance variable, 190, 202, 206 view, 190, 191 S sensor data analysis, 101 similarity, 141–150, 154–156, 158, 160–163 social analysis, 169 social filtering, 154, 155, 160–163 space, 95 sociology, 80–85, 89, 94–96 spatial data analysis, 103 prediction, 102 standardization, 51, 55, 61 structures, 81–83, 94, 95 demographic (age–sex), 210, 211, 213, 217–221, 224, 225 symbolic data, 3, 31, 34, 37, 46 analysis (SDA), 5, 6, 9, 14–16, 20, 27, 49, 50, 65 symbolic sample covariance, 50, 52, 53, 56–58, 74 V, W variogram, 102, 104, 107–110, 112–116 Wasserstein distance, 102, 105–107, 109, 111, 112, 116 weighted multi-view clustering, 191, 206 weighting, 189

Other titles from

in Innovation, Entrepreneurship and Management

2019 AMENDOLA Mario, GAFFARD Jean-Luc Disorder and Public Concern Around Globalization BARBAROUX Pierre Disruptive Technology and Defence Innovation Ecosystems (Innovation in Engineering and Technology Set – Volume 5) DOU Henri, JUILLET Alain, CLERC Philippe Strategic Intelligence for the Future 1: A New Strategic and Operational Approach Strategic Intelligence for the Future 2: A New Information Function Approach FRIKHA Azza Measurement in Marketing: Operationalization of Latent Constructs FRIMOUSSE Soufyane Innovation and Agility in the Digital Age (Human Resources Management Set – Volume 2)

GAY Claudine, SZOSTAK Bérangère L. Innovation and Creativity in SMEs: Challenges, Evolutions and Prospects (Smart Innovation Set – Volume 21) GORIA Stéphane, HUMBERT Pierre, ROUSSEL Benoît Information, Knowledge and Agile Creativity (Smart Innovation Set – Volume 22) HELLER David Investment Decision-making Using Optional Models (Economic Growth Set – Volume 2) HELLER David, DE CHADIRAC Sylvain, HALAOUI Lana, JOUVET Camille The Emergence of Start-ups (Economic Growth Set – Volume 1) HÉRAUD Jean-Alain, KERR Fiona, BURGER-HELMCHEN Thierry Creative Management of Complex Systems (Smart Innovation Set – Volume 19) LATOUCHE Pascal Open Innovation: Corporate Incubator (Innovation and Technology Set – Volume 7) LEHMANN Paul-Jacques The Future of the Euro Currency LEIGNEL Jean-Louis, MÉNAGER Emmanuel, YABLONSKY Serge Sustainable Enterprise Performance: A Comprehensive Evaluation Method LIÈVRE Pascal, AUBRY Monique, GAREL Gilles Management of Extreme Situations: From Polar Expeditions to ExplorationOriented Organizations MILLOT Michel Embarrassment of Product Choices 2: Towards a Society of Well-being N’GOALA Gilles, PEZ-PÉRARD Virginie, PRIM-ALLAZ Isabelle Augmented Customer Strategy: CRM in the Digital Age

NIKOLOVA Blagovesta The RRI Challenge: Responsibilization in a State of Tension with Market Regulation (Innovation and Responsibility Set – Volume 3) PELLEGRIN-BOUCHER Estelle, ROY Pierre Innovation in the Cultural and Creative Industries (Innovation and Technology Set – Volume 8) PRIOLON Joël Financial Markets for Commodities QUINIOU Matthieu Blockchain: The Advent of Disintermediation RAVIX Joël-Thomas, DESCHAMPS Marc Innovation and Industrial Policies (Innovation Between Risk and Reward Set – Volume 5) ROGER Alain, VINOT Didier Skills Management: New Applications, New Questions (Human Resources Management Set – Volume 1) SAULAIS Pierre, ERMINE Jean-Louis Knowledge Management in Innovative Companies 1: Understanding and Deploying a KM Plan within a Learning Organization (Smart Innovation Set – Volume 23) SERVAJEAN-HILST Romaric Co-innovation Dynamics: The Management of Client-Supplier Interactions for Open Innovation (Smart Innovation Set – Volume 20) SKIADAS Christos H., BOZEMAN James R. Data Analysis and Applications 1: Clustering and Regression, Modelingestimating, Forecasting and Data Mining (Big Data, Artificial Intelligence and Data Analysis Set – Volume 2) Data Analysis and Applications 2: Utilization of Results in Europe and Other Topics (Big Data, Artificial Intelligence and Data Analysis Set – Volume 3)

VIGEZZI Michel World Industrialization: Shared Inventions, Competitive Innovations and Social Dynamics (Smart Innovation Set – Volume 24)

2018 BURKHARDT Kirsten Private Equity Firms: Their Role in the Formation of Strategic Alliances CALLENS Stéphane Creative Globalization (Smart Innovation Set – Volume 16) CASADELLA Vanessa Innovation Systems in Emerging Economies: MINT – Mexico, Indonesia, Nigeria, Turkey (Smart Innovation Set – Volume 18) CHOUTEAU Marianne, FOREST Joëlle, NGUYEN Céline Science, Technology and Innovation Culture (Innovation in Engineering and Technology Set – Volume 3) CORLOSQUET-HABART Marine, JANSSEN Jacques Big Data for Insurance Companies (Big Data, Artificial Intelligence and Data Analysis Set – Volume 1) CROS Françoise Innovation and Society (Smart Innovation Set – Volume 15) DEBREF Romain Environmental Innovation and Ecodesign: Certainties and Controversies (Smart Innovation Set – Volume 17) DOMINGUEZ Noémie SME Internationalization Strategies: Innovation to Conquer New Markets ERMINE Jean-Louis Knowledge Management: The Creative Loop (Innovation and Technology Set – Volume 5)

GILBERT Patrick, BOBADILLA Natalia, GASTALDI Lise, LE BOULAIRE Martine, LELEBINA Olga Innovation, Research and Development Management IBRAHIMI Mohammed Mergers & Acquisitions: Theory, Strategy, Finance LEMAÎTRE Denis Training Engineers for Innovation LÉVY Aldo, BEN BOUHENI Faten, AMMI Chantal Financial Management: USGAAP and IFRS Standards (Innovation and Technology Set – Volume 6) MILLOT Michel Embarrassment of Product Choices 1: How to Consume Differently PANSERA Mario, OWEN Richard Innovation and Development: The Politics at the Bottom of the Pyramid (Innovation and Responsibility Set – Volume 2) RICHEZ Yves Corporate Talent Detection and Development SACHETTI Philippe, ZUPPINGER Thibaud New Technologies and Branding (Innovation and Technology Set – Volume 4) SAMIER Henri Intuition, Creativity, Innovation TEMPLE Ludovic, COMPAORÉ SAWADOGO Eveline M.F.W. Innovation Processes in Agro-Ecological Transitions in Developing Countries (Innovation in Engineering and Technology Set – Volume 2) UZUNIDIS Dimitri Collective Innovation Processes: Principles and Practices (Innovation in Engineering and Technology Set – Volume 4) VAN HOOREBEKE Delphine

The Management of Living Beings or Emo-management

2017 AÏT-EL-HADJ Smaïl The Ongoing Technological System (Smart Innovation Set – Volume 11) BAUDRY Marc, DUMONT Béatrice Patents: Prompting or Restricting Innovation? (Smart Innovation Set – Volume 12) BÉRARD Céline, TEYSSIER Christine Risk Management: Lever for SME Development and Stakeholder Value Creation CHALENÇON Ludivine Location Strategies and Value Creation of International Mergers and Acquisitions CHAUVEL Danièle, BORZILLO Stefano The Innovative Company: An Ill-defined Object (Innovation Between Risk and Reward Set – Volume 1) CORSI Patrick Going Past Limits To Growth D’ANDRIA Aude, GABARRET

Inés Building 21st Century Entrepreneurship (Innovation and Technology Set – Volume 2) DAIDJ Nabyla Cooperation, Coopetition and Innovation (Innovation and Technology Set – Volume 3) FERNEZ-WALCH Sandrine The Multiple Facets of Innovation Project Management (Innovation between Risk and Reward Set – Volume 4) FOREST Joëlle Creative Rationality and Innovation (Smart Innovation Set – Volume 14)

GUILHON Bernard Innovation and Production Ecosystems (Innovation between Risk and Reward Set – Volume 2) HAMMOUDI Abdelhakim, DAIDJ Nabyla Game Theory Approach to Managerial Strategies and Value Creation (Diverse and Global Perspectives on Value Creation Set – Volume 3) LALLEMENT Rémi Intellectual Property and Innovation Protection: New Practices and New Policy Issues (Innovation between Risk and Reward Set – Volume 3) LAPERCHE Blandine Enterprise Knowledge Capital (Smart Innovation Set – Volume 13) LEBERT Didier, EL YOUNSI Hafida International Specialization Dynamics (Smart Innovation Set – Volume 9) MAESSCHALCK Marc Reflexive Governance for Research and Innovative Knowledge (Responsible Research and Innovation Set – Volume 6) MASSOTTE Pierre Ethics in Social Networking and Business 1: Theory, Practice and Current Recommendations Ethics in Social Networking and Business 2: The Future and Changing Paradigms MASSOTTE Pierre, CORSI Patrick Smart Decisions in Complex Systems MEDINA Mercedes, HERRERO Mónica, URGELLÉS Alicia Current and Emerging Issues in the Audiovisual Industry (Diverse and Global Perspectives on Value Creation Set – Volume 1) MICHAUD Thomas Innovation, Between Science and Science Fiction (Smart Innovation Set – Volume 10)

PELLÉ Sophie Business, Innovation and Responsibility (Responsible Research and Innovation Set – Volume 7) SAVIGNAC Emmanuelle The Gamification of Work: The Use of Games in the Workplace SUGAHARA Satoshi, DAIDJ Nabyla, USHIO Sumitaka Value Creation in Management Accounting and Strategic Management: An Integrated Approach (Diverse and Global Perspectives on Value Creation Set –Volume 2) UZUNIDIS Dimitri, SAULAIS Pierre Innovation Engines: Entrepreneurs and Enterprises in a Turbulent World (Innovation in Engineering and Technology Set – Volume 1)

2016 BARBAROUX Pierre, ATTOUR Amel, SCHENK Eric Knowledge Management and Innovation (Smart Innovation Set – Volume 6) BEN BOUHENI Faten, AMMI Chantal, LEVY Aldo Banking Governance, Performance And Risk-Taking: Conventional Banks Vs Islamic Banks BOUTILLIER Sophie, CARRÉ Denis, LEVRATTO Nadine Entrepreneurial Ecosystems (Smart Innovation Set – Volume 2) BOUTILLIER Sophie, UZUNIDIS Dimitri The Entrepreneur (Smart Innovation Set – Volume 8) BOUVARD Patricia, SUZANNE Hervé Collective Intelligence Development in Business GALLAUD Delphine, LAPERCHE Blandine Circular Economy, Industrial Ecology and Short Supply Chains (Smart Innovation Set – Volume 4)

GUERRIER Claudine Security and Privacy in the Digital Era (Innovation and Technology Set – Volume 1) MEGHOUAR Hicham Corporate Takeover Targets MONINO Jean-Louis, SEDKAOUI Soraya Big Data, Open Data and Data Development (Smart Innovation Set – Volume 3) MOREL Laure, LE ROUX Serge Fab Labs: Innovative User (Smart Innovation Set – Volume 5) PICARD Fabienne, TANGUY Corinne Innovations and Techno-ecological Transition (Smart Innovation Set – Volume 7)

2015 CASADELLA Vanessa, LIU Zeting, DIMITRI Uzunidis Innovation Capabilities and Economic Development in Open Economies (Smart Innovation Set – Volume 1) CORSI Patrick, MORIN Dominique Sequencing Apple’s DNA CORSI Patrick, NEAU Erwan Innovation Capability Maturity Model FAIVRE-TAVIGNOT Bénédicte Social Business and Base of the Pyramid GODÉ Cécile Team Coordination in Extreme Environments MAILLARD Pierre Competitive Quality and Innovation MASSOTTE Pierre, CORSI Patrick Operationalizing Sustainability

MASSOTTE Pierre, CORSI Patrick Sustainability Calling

2014 DUBÉ Jean, LEGROS Diègo Spatial Econometrics Using Microdata LESCA Humbert, LESCA Nicolas Strategic Decisions and Weak Signals

2013 HABART-CORLOSQUET Marine, JANSSEN Jacques, MANCA Raimondo VaR Methodology for Non-Gaussian Finance

2012 DAL PONT Jean-Pierre Process Engineering and Industrial Management MAILLARD Pierre Competitive Quality Strategies POMEROL Jean-Charles Decision-Making and Action SZYLAR Christian UCITS Handbook

2011 LESCA Nicolas Environmental Scanning and Sustainable Development LESCA Nicolas, LESCA Humbert Weak Signals for Strategic Intelligence: Anticipation Tool for Managers MERCIER-LAURENT Eunika Innovation Ecosystems

2010 SZYLAR Christian Risk Management under UCITS III/IV

2009 COHEN Corine Business Intelligence ZANINETTI Jean-Marc Sustainable Development in the USA

2008 CORSI Patrick, DULIEU Mike The Marketing of Technology Intensive Products and Services DZEVER Sam, JAUSSAUD Jacques, ANDREOSSO Bernadette Evolving Corporate Structures and Cultures in Asia: Impact of Globalization

2007 AMMI Chantal Global Consumer Behavior

2006 BOUGHZALA Imed, ERMINE Jean-Louis Trends in Enterprise Knowledge Management CORSI Patrick et al. Innovation Engineering: the Power of Intangible Networks