Complex Networks X: Proceedings of the 10th Conference on Complex Networks CompleNet 2019 [1st ed.] 978-3-030-14458-6, 978-3-030-14459-3

This book aims to bring together researchers and practitioners working across domains and research disciplines to measur

510 66 7MB

English Pages X, 183 [181] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Complex Networks X: Proceedings of the 10th Conference on Complex Networks CompleNet 2019 [1st ed.]
 978-3-030-14458-6, 978-3-030-14459-3

Table of contents :
Front Matter ....Pages i-x
Front Matter ....Pages 1-1
Observability of Dynamical Networks from Graphic and Symbolic Approaches (Irene Sendiña-Nadal, Christophe Letellier)....Pages 3-15
Exploratory Factor Analysis of Graphical Features for Link Prediction in Social Networks (Lale Madahali, Lotfi Najjar, Margeret Hall)....Pages 17-31
An Efficient Approach for Counting Occurring Induced Subgraphs (Luciano Grácio, Pedro Ribeiro)....Pages 33-45
Front Matter ....Pages 47-47
A Generalized Configuration Model with Degree Correlations (Duan-Shin Lee, Cheng-Shang Chang, Hung-Chih Li)....Pages 49-61
Missing Data Augmentation for Bayesian Exponential Random Multi-Graph Models (Robert W. Krause, Alberto Caimo)....Pages 63-72
Front Matter ....Pages 73-73
Spread and Control of Misinformation with Heterogeneous Agents (Pedro Cisneros-Velarde, Diego F. M. Oliveira, Kevin S. Chan)....Pages 75-83
Long-Term Behavior in Evolutionary Dynamics from Ergodicity Breaking (Jan Nagler, Frank Stollmeier)....Pages 85-95
Combining Path-Constrained Random Walks to Recover Link Weights in Heterogeneous Information Networks (Hong-Lan Botterman, Robin Lamarche-Perrin)....Pages 97-109
Front Matter ....Pages 111-111
The Structure and Evolution of an Offline Peer-to-Peer Financial Network (Pantelis Loupos, Alexandros Nathan)....Pages 113-122
Modelling Students’ Thematically Associated Knowledge: Networked Knowledge from Affinity Statistics (Ismo T. Koponen)....Pages 123-134
Graph Convolutional Networks on Customer/Supplier Graph Data to Improve Default Prediction (Alejandro Martínez, Jordi Nin, Elena Tomás, Alberto Rubio)....Pages 135-146
Multidimensional Outlier Detection in Interaction Data: Application to Political Communication on Twitter (Audrey Wilmet, Robin Lamarche-Perrin)....Pages 147-155
Social Media Vocabulary Reveals Education Attainment of Populations (Harith Hamoodat, Eraldo Ribeiro, Ronaldo Menezes)....Pages 157-168
Exploring the Role and Nature of Interactions Between Institutes in a Local Affiliation Network (Chakresh Kumar Singh, Ravi Vishwakarma, Shivakumar Jolad)....Pages 169-181
Back Matter ....Pages 183-183

Citation preview

Springer Proceedings in Complexity

Sean P. Cornelius Clara Granell Martorell Jesús Gómez-Gardeñes Bruno Gonçalves Editors

Complex Networks X Proceedings of the 10th Conference on Complex Networks CompleNet 2019

Springer Proceedings in Complexity Series editors Henry Abarbanel, San Diego, USA Dan Braha, Dartmouth, USA Péter Érdi, Kalamazoo, USA Karl Friston, London, UK Hermann Haken, Stuttgart, Germany Viktor Jirsa, Marseille, France Janusz Kacprzyk, Warsaw, Poland Kunihiko Kaneko, Tokyo, Japan Scott Kelso, Boca Raton, USA Markus Kirkilionis, Coventry, UK Jürgen Kurths, Potsdam, Germany Andrzej Nowak, Warsaw, Poland Hassan Qudrat-Ullah, Toronto, Canada Linda Reichl, Austin, USA Peter Schuster, Vienna, Austria Frank Schweitzer, Zürich, Switzerland Didier Sornette, Zürich, Switzerland Stefan Thurner, Vienna, Austria

Springer Complexity Springer Complexity is an interdisciplinary program publishing the best research and academic-level teaching on both fundamental and applied aspects of complex systems—cutting across all traditional disciplines of the natural and life sciences, engineering, economics, medicine, neuroscience, social, and computer science. Complex Systems are systems that comprise many interacting parts with the ability to generate a new quality of macroscopic collective behavior the manifestations of which are the spontaneous formation of distinctive temporal, spatial, or functional structures. Models of such systems can be successfully mapped onto quite diverse “real-life” situations like the climate, the coherent emission of light from lasers, chemical reaction–diffusion systems, biological cellular networks, the dynamics of stock markets and of the Internet, earthquake statistics and prediction, freeway traffic, the human brain, or the formation of opinions in social systems, to name just some of the popular applications. Although their scope and methodologies overlap somewhat, one can distinguish the following main concepts and tools: self-organization, nonlinear dynamics, synergetics, turbulence, dynamical systems, catastrophes, instabilities, stochastic processes, chaos, graphs and networks, cellular automata, adaptive systems, genetic algorithms, and computational intelligence. The three major book publication platforms of the Springer Complexity program are the monograph series “Understanding Complex Systems” focusing on the various applications of complexity, the “Springer Series in Synergetics”, which is devoted to the quantitative theoretical and methodological foundations, and the “SpringerBriefs in Complexity” which are concise and topical working reports, case-studies, surveys, essays, and lecture notes of relevance to the field. In addition to the books in these two core series, the program also incorporates individual titles ranging from textbooks to major reference works.

More information about this series at http://www.springer.com/series/11637

Sean P. Cornelius • Clara Granell Martorell Jesús Gómez-Garde˜nes • Bruno Gonçalves Editors

Complex Networks X Proceedings of the 10th Conference on Complex Networks CompleNet 2019

123

Editors Sean P. Cornelius Center for Complex Network Research Northeastern University Boston, MA, USA Jesús Gómez-Garde˜nes Department of Condensed Matter Physics University of Zaragoza Zaragoza, Spain

Clara Granell Martorell Enginyeria Inform`atica i Matem`atiques Universitat Rovira i Virgili Tarragona, Tarragona, Spain Bruno Gonçalves JPMorgan Chase & Co (United States) New York, NY, USA

ISSN 2213-8684 ISSN 2213-8692 (electronic) Springer Proceedings in Complexity ISBN 978-3-030-14458-6 ISBN 978-3-030-14459-3 (eBook) https://doi.org/10.1007/978-3-030-14459-3 Library of Congress Control Number: 2019933408 © Springer Nature Switzerland AG 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

The International Workshop on Complex Networks—CompleNet (www.complenet. org) was initially proposed in 2008, and the first workshop took place in 2009 in Catania. The initiative was the result of efforts from researchers from the (i) BioComplex Laboratory in the Department of Computer Sciences at Florida Institute of Technology, USA, and the (ii) Dipartimento di Ingegneria Informatica e delle Telecomunicazioni, Universit di Catania, Italy. CompleNet aims to bring together researchers and practitioners working on complex networks or related areas. In the past two decades we have indeed witnessed an exponential increase of the number of publications in this field. From biology to computer science, from economics to social systems, complex networks are becoming pervasive in many fields of science. It is this interdisciplinary nature of complex networks that CompleNet aims to explore. CompleNet 2019 was the tenth event in the series and was hosted at Universitat Rovira i Virgili in Tarragona, Spain, from March 18–21, 2019. This book includes the peer-reviewed list of works presented at CompleNet 2019. We received an impressive 128 submissions from 35 countries around the world. Each submission was reviewed by at least 3 members of the Program Committee. Acceptance was based on the relevance to the symposium themes, clarity of presentation, originality, and accuracy of results and proposed solutions. After the review process, 10 full papers and 4 short papers were selected to be included in this book. The 14 contributions address many topics related to complex networks and have been organized in four major groups: (1) Network Theory, (2) Network Models, (3) Processes on Networks, and (4) Applications. We would like to thank the Program Committee members for their work in promoting the event and refereeing submissions. We are grateful to our speakers: Antoine Allard, Francesca Colaiori, Manlio De Domenico, Raissa M. D’Souza, James Gleeson,

v

vi

Preface

Marta C. González, Roger Guimerá, Philipp Hövel, Sonia Kéfi, Susanna Manrubia, Chiara Poletto, Mason Porter, H. Eugene Stanley; their presentations were one of the reasons CompleNet 2019 was such a success. Boston, MA, USA Tarragona, Spain Zaragoza, Spain New York, NY, USA March 2019

Sean P. Cornelius Clara Granell Martorell Jesús Gómez-Garde˜nes Bruno Gonçalves

Contents

Part I Network Theory Observability of Dynamical Networks from Graphic and Symbolic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Irene Sendiña-Nadal and Christophe Letellier

3

Exploratory Factor Analysis of Graphical Features for Link Prediction in Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lale Madahali, Lotfi Najjar, and Margeret Hall

17

An Efficient Approach for Counting Occurring Induced Subgraphs . . . . . . Luciano Grácio and Pedro Ribeiro

33

Part II Network Models A Generalized Configuration Model with Degree Correlations. . . . . . . . . . . . . Duan-Shin Lee, Cheng-Shang Chang, and Hung-Chih Li Missing Data Augmentation for Bayesian Exponential Random Multi-Graph Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert W. Krause and Alberto Caimo

49

63

Part III Processes on Networks Spread and Control of Misinformation with Heterogeneous Agents . . . . . . . Pedro Cisneros-Velarde, Diego F. M. Oliveira, and Kevin S. Chan

75

Long-Term Behavior in Evolutionary Dynamics from Ergodicity Breaking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Nagler and Frank Stollmeier

85

Combining Path-Constrained Random Walks to Recover Link Weights in Heterogeneous Information Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hong-Lan Botterman and Robin Lamarche-Perrin

97

vii

viii

Contents

Part IV Applications The Structure and Evolution of an Offline Peer-to-Peer Financial Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Pantelis Loupos and Alexandros Nathan Modelling Students’ Thematically Associated Knowledge: Networked Knowledge from Affinity Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Ismo T. Koponen Graph Convolutional Networks on Customer/Supplier Graph Data to Improve Default Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Alejandro Martínez, Jordi Nin, Elena Tomás, and Alberto Rubio Multidimensional Outlier Detection in Interaction Data: Application to Political Communication on Twitter. . . . . . . . . . . . . . . . . . . . . . . . . . 147 Audrey Wilmet and Robin Lamarche-Perrin Social Media Vocabulary Reveals Education Attainment of Populations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Harith Hamoodat, Eraldo Ribeiro, and Ronaldo Menezes Exploring the Role and Nature of Interactions Between Institutes in a Local Affiliation Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Chakresh Kumar Singh, Ravi Vishwakarma, and Shivakumar Jolad Author Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

Contributors

Hong-Lan Botterman Sorbonne Université, CNRS, Laboratoire d’Informatique de Paris 6, Paris, France Alberto Caimo Dublin Institute of Technology, Dublin, Ireland Kevin S. Chan U.S. Army Research Laboratory, Adelphi, MD, USA Cheng-Shang Chang Institute of Communications Engineering, National Tsing Hua University, Hsinchu, Taiwan, R.O.C Pedro Cisneros-Velarde University of California, Santa Barbara, CA, USA Luciano Grácio CRACS & INESC-TEC, DCC-FCUP, Universidade do Porto, Porto, Portugal Margeret Hall University of Nebraska at Omaha, Omaha, NE, USA Harith Hamoodat BioComplex Laboratory, Florida Institute of Technology, Melbourne, FL, USA Shivakumar Jolad Indian Institute of Technology, Gandhinagar, India Ismo T. Koponen Department of Physics, University of Helsinki, Helsinki, Finland Robert W. Krause University of Groningen, Groningen, The Netherlands Robin Lamarche-Perrin CNRS, Institut des système complexes de Paris Île-de-France, ISC-PIF, Paris, France Duan-Shin Lee Institute of Communications Engineering, National Tsing Hua University, Hsinchu, Taiwan, R.O.C Christophe Letellier Normandie Université CORIA, Campus Universitaire du Madrillet, Saint-Etienne du Rouvray, France Hung-Chih Li Institute of Communications Engineering, National Tsing Hua University, Hsinchu, Taiwan, R.O.C ix

x

Contributors

Pantelis Loupos Graduate School of Management, University of California, Davis, Davis, CA, USA Lale Madahali University of Nebraska at Omaha, Omaha, NE, USA Alejandro Martínez CFIS, Universitat Politècnica de Catalunya-BarcelonaTECH, Barcelona, Spain Ronaldo Menezes BioComplex Laboratory, University of Exeter, Exeter, UK Jan Nagler Deep Dynamics Group & Centre for Human and Machine Intelligence, Frankfurt School of Finance & Management, Frankfurt, Germany Lotfi Najjar University of Nebraska at Omaha, Omaha, NE, USA Alexandros Nathan McCormick School of Engineering, Northwestern University, Evanston, IL, USA Jordi Nin BBVA Data & Analytics, Barcelona, Catalonia, Spain Universitat de Barcelona (UB), Barcelona, Catalonia, Spain Diego F. M. Oliveira U.S. Army Research Laboratory, Adelphi, MD, USA Network Science and Technology Center, Rensselaer Polytechnic Institute, Troy, NY, USA Eraldo Ribeiro Florida Institute of Technology, Melbourne, FL, USA Pedro Ribeiro CRACS & INESC-TEC, DCC-FCUP, Universidade do Porto, Porto, Portugal Alberto Rubio BBVA Data & Analytics, Barcelona, Catalonia, Spain Irene Sendiña-Nadal Complex Systems Group & GISC, Universidad Rey Juan Carlos, Móstoles, Madrid, Spain Center for Biomedical Technology, Universidad Politécnica de Madrid, Pozuelo de Alarcón, Madrid, Spain Chakresh Kumar Singh Indian Institute of Technology, Gandhinagar, India Frank Stollmeier Network Dynamics, Max Planck Institute for Dynamics and Self-Organization (MPIDS), Göttingen, Germany Institute for Nonlinear Dynamics, Faculty of Physics, University of Göttingen, Göttingen, Germany Elena Tomás BBVA Data & Analytics, Madrid, Spain Ravi Vishwakarma IISER, Kolkata, India Audrey Wilmet Sorbonne Université, UMR 7606, LIP6, Paris, France

Part I

Network Theory

Observability of Dynamical Networks from Graphic and Symbolic Approaches Irene Sendiña-Nadal and Christophe Letellier

Abstract A dynamical network, a graph whose nodes are dynamical systems, is usually characterized by a large dimensional space which is not always accessible due to the impossibility of measuring all the variables spanning the state space. Therefore, it is of the utmost importance to determine a reduced set of variables providing all the required information to non-ambiguously distinguish its different states. Inherited from control theory, one possible approach is based on the use of the observability matrix defined as the Jacobian matrix of the change of coordinates between the original state space and the space reconstructed from the measured variables. The observability of a given system can be accurately assessed by symbolically computing the complexity of the determinant of the observability matrix and quantified by symbolic observability coefficients. In this work, we extend the symbolic observability, previously developed for dynamical systems, to networks made of coupled d-dimensional node dynamics (d > 1). From the observability of the node dynamics, the coupling function between the nodes, and the adjacency matrix, it is indeed possible to construct the observability of a large network with an arbitrary topology. Keywords Dynamical network · Observability

I. Sendiña-Nadal Complex Systems Group & GISC, Universidad Rey Juan Carlos, Móstoles, Madrid, Spain Center for Biomedical Technology, Universidad Politécnica de Madrid, Pozuelo de Alarcón, Madrid, Spain e-mail: [email protected] C. Letellier () Normandie Université CORIA, Campus Universitaire du Madrillet, Saint-Etienne du Rouvray, France e-mail: [email protected] © Springer Nature Switzerland AG 2019 S. P. Cornelius et al. (eds.), Complex Networks X, Springer Proceedings in Complexity, https://doi.org/10.1007/978-3-030-14459-3_1

3

4

I. Sendiña-Nadal and C. Letellier

1 Introduction Consider a network composed of N nodes each one of them having a d-dimensional dynamics and whose interactions are given by an adjacency matrix A. We can thus distinguish three levels of description of this network: (i) the node dynamics using the corresponding d × d node Jacobian matrix Jn , (ii) the topology described by the N × N adjacency matrix A, and (iii) the whole dynamical network described by the d · N × d · N network Jacobian matrix JN . There are two possible conventions for writing the adjacency matrix, one being the transposed of the other. In order to do this without unnecessary complicated notations, we will retain the convention used by Newman [11] in which each element Aij of the adjacency matrix A corresponds to an edge from node j to node i. The node Jacobian matrix Jn , computed from the set of the d differential equations governing the node dynamics, allows an easy construction of the fluence graph describing how the d variables of the node dynamics are interacting. Such fluence graphs were used by Lin for assessing the controllability of linear systems [9] and later on the theory was extended to address their observability [2]. When dealing with dynamical networks it is important to distinguish the adjacency matrix A from the network Jacobian matrix JN , since very often the observability of a network has been wrongly investigated by only taking into account the adjacency matrix [1, 3, 15] and disregarding the node dynamics. We show how such an approach does not always provide correct results. Without loss of generality, we will exemplify our methodology to assess the observability of dynamical networks by considering networks of diffusively coupled Rössler systems [12] (d = 3). The knowledge gathered from the analysis of dyads and triads of Rösslers will guide us to propose some rules to handle larger networks in a systematic way by decomposing the networks in blocks whose observability properties are known. In order to select a reduced set of variables we will use a graphical approach by introducing a pruned fluence graph of the network Jacobian matrix JN as developed in [7]. Then, the symbolic observability coefficients are computed as detailed in [8] and, when full observability is detected, the analytical determinant of the observability matrix is checked to rigorously validate the graphical and symbolic results.

2 Theoretical Background 2.1 Observability Matrix Let us consider a d · N -dimensional network N composed of N nodes each one having an associated d-dimensional dynamics. The network state is represented by the state vector x ∈ Rd·N whose components are given by x˙i = fi (x1 , x2 , x3 , . . . , xd·N ),

(i = 1, 2, . . . , d · N ),

(1)

Observability of Dynamical Networks

5

where fi is the ith component of the vector field f . The corresponding network ∂fi Jacobian matrix Jij = ∂x can be expressed as j JN = IN ⊗ Jn − ρ(L ⊗ H )

(2)

reflecting its structure in N diagonal blocks containing the node Jacobian matrix Jn , and being the second term the contribution to the network dynamics from the topology encoded in the Laplacian matrix L = (Lij ) = (Aij − ki δij ) and the linear coupling function H ∈ Rd×d . IN is the square identity matrix of size N , the symbol ⊗ stands for the Kronecker product, and ρ is the coupling constant. Let us introduce the measurement vector h(x) ∈ Rm whose m components are the measured variables. The observability cannot be stated only from these m measured variables. Indeed, to construct the observability matrix O of the network dynamics from these m measured variables, it is also necessary to specify the dr − m variables required for completing the vector X ∈ Rdr spanning the reconstructed space in which the dynamics is investigated. For these reasons, and as introduced by Lin [9], we will speak about the observability of the pair [JN , X] to explicit the fact that the network described by JN is observable via the m measured variables and dr − m of their derivatives. Since we are interested in the smallest state space in which the dynamics can be investigated, we will limit ourselves to the case where dr = d · N . The observability of a dynamical network can be defined as follows. Let us consider the case when m = 1 (a generalization to larger m is straightforward), and let X ∈ Rd·N be the vector spanning the reconstructed space obtained by using the (d · N − 1) successive Lie derivatives of the measured variables. The dynamical system (1) is said to be state observable at time tf if every initial state x(0) can be t uniquely determined from the knowledge of a finite time series {X}τf=0 . In practice, it is possible to test whether the pair [JN , X] is observable by computing the rank of the observability matrix [4], that is, the Jacobian matrix of the Lie derivatives of h(x). ∂h Differentiating the measured vector h(x) yields dtd h(x) = ∂x f (x) = Lf h(x), where Lf h(x) is the Lie derivative of h(x) along the vector field f . The kth order ∂ Lk−1 h(x)

f f (x), L0f h(x) = h(x) being the Lie derivative is given by Lkf h(x) = ∂x zeroth order Lie derivative of the measured variable itself. Therefore, the d ·N ×d ·N observability matrix OX can be written as

 T OX (x) = dh(x), dLf h(x), . . . , dLfd−1 h(x) .

(3)

The pair [JN , X] is state observable if and only if the observability matrix has full rank, that is, rank(OX ) = d · N . The Jacobian matrix of the coordinate transformation Φ : Rd·N → Rd·N between the original state space Rd·N (x) and the reconstructed space Rd·N (X) is the observability matrix OX [6].

6

I. Sendiña-Nadal and C. Letellier

2.2 Symbolic Observability Coefficients The procedure to calculate the symbolic observability coefficients is implemented in four steps as follows [8]. The first step is devoted to the construction of the symbolic Jacobian matrix J˜N by replacing each constant element Jij by “1,” each ¯¯ ¯ and each rational element Jij by “1” non-constant polynomial element Jij by “1,” ¯ when the j th variable is present in the denominator or by 1 otherwise. Rational terms are distinguished from non-constant polynomial terms since they strongly reduce the observability. The second step corresponds to the construction of the symbolic observability matrix O˜ X [8]. When m variables are measured, the construction of O˜ X is performed by blocks of size (κi + 1) × d, being κi the number of derivatives of the ith measured variable and m + m i=1 κi = d · N : the construction of each block follows the same rules as introduced in [8] for univariate measures. The third step consists in computing the symbolic observability coefficients. The determinant of O˜ X is computed according to the symbolic algebra defined in [8] ¯¯ whose ¯ and 1, and expressed as products and addends of the symbolic terms 1, 1, number of occurrences are stored in variables N1 , N1¯ , and N1¯¯ , respectively. A special condition is required for rational systems, such that if N1¯ = 0 and N1¯¯ = 0, then N1¯ = N1¯¯ . The symbolic observability coefficient for the reconstructed vector 1 1 1 X is then equal to ηX = N1 + 2 N1¯ + 3 N1¯¯ with D = max (1, N1 ) + N1¯ + N1¯¯ D D D and 0 ≤ ηX ≤ 1, being ηX = 1 for a reconstructed vector X providing a full observability. It was shown that the observability can be considered as being good when ηX ≥ 0.75 [13].

2.3 Selecting the Variables to Measure A systematic check of all the possible combinations of m measured variables and their d · N − m derivatives turns out to be a daunting task for large N and large d. Therefore, it becomes crucial to furnish methods to unveil a tractable set of variables providing full observability of a system. This may be achieved by using a graphical approach [7] which is an improved version of the procedure introduced by Liu et al. [10]. A pruned fluence graph with d · N vertices (one per variable) and a directed edge xj → xi is drawn between variables xj and xi when the element Jij of JN is constant. At least one variable from each root strongly connected component (rSCC) of the pruned fluence graph has to be measured [7]. A rSCC is a subgraph in which there is a directed path from each node to every other node in the subgraph and with no outgoing edges. As we will see, a pruned fluence graph provides a necessary but not a sufficient reduced set of variables to measure for getting an observable pair [JN , X].

Observability of Dynamical Networks

7

3 Observability of the Node Dynamics The node dynamics corresponds to the Rössler system [12] (x1 , x2 , x3 ) = (x, y, z) whose evolution is governed by the vector field (f1 , f2 , f3 ) = [−y − z, x + ay, b + z(x − c)], whose Jacobian matrix is ⎡

0 −1 −1

⎢ Jn = ⎣ 1 a

0

⎤ ⎥ ⎦.

(4)

z 0 x−c Its nonzero constant elements Jij lead to the pruned fluence graph shown in Fig. 1 which has a single rSCC (dashed oval) containing variables x and y. Variable z can thus be discarded from measurements but, at least, variable x or y must be measured. The symbolic observability coefficients for the pair [Jn , (x, x, ˙ x)], ¨ [Jn , (y, y, ˙ y)], ¨ and [Jn , (z, z˙ , z¨ )] are ηx x˙ x¨ = 0.86, ηy y˙ y¨ = 1.00, and ηz˙zz¨ = 0.44, respectively. This means that the pair [Jn , (y, y, ˙ y)] ¨ is fully observable, ˙ x)] ¨ is good, and the pair [Jn , (z, z˙ , z¨ )] is the observability of the pair [Jn , (x, x, poorly observable. The pruned fluence graph returns the two variables providing the largest observability coefficients when each one of them is measured alone. This can be analytically confirmed by computing the determinants of the corresponding observability matrices which are Det Ox x˙ x¨ = x − (a + c), Det Oy y˙ y¨ = 1, and Det Oz˙zz¨ = z2 , respectively. If Det OX = 0 for a subset Mobs ⊂ Rd of the state space associated with the node dynamics, then Mobs is nonobservable through the measurements and it is called the singular observability manifold. Since Det Oy y˙ y¨ = 1, Mobs is an empty set, and the pair [Jn , (y, y, ˙ y)] ¨ is actually fully observable. When the reconstructed space is spanned by X = (x, x, ˙ x), ¨ the plane defined by x = a + c is nonobservable. The plane z = 0 is nonobservable when z is measured. It was shown that the complexity of the determinant, assessed, for instance, by the order of its expression (1 for Det Ox x˙ x¨ , 0 for Det Oy y˙ y¨ , and 2 for Det Oz˙zz¨ ), is related to the observability: the larger the order, the less observable the pair [Jn , X] [5].

x y

z

Fig. 1 Pruned fluence graph of the Rössler system where an edge is drawn between variables xi and xj whenever Jij is a nonzero constant. A dashed oval surrounds the root strongly connected component (rSCC). Edges xi → xi are omitted since they do not contribute to the determination of the rSCC

8

I. Sendiña-Nadal and C. Letellier

4 Observability of Small Network Motifs 4.1 Dyads (N = 2) Let us start with a small network motif of two Rösslers bidirectionally coupled by either x, y, or z. From the analysis of this basic motif we will derive general rules for assessing the observability of larger networks. The corresponding JN for the case the two nodes are coupled through the x variable is given by ⎡ ⎢ ⎢ ⎢ ⎢ JN = ⎢ ⎢ ⎢ ⎣

−ρx −1 −1 1 z ρx 0 0

0

ρx



0

⎥ 0 0 ⎥ ⎥ 0 0 ⎥, ⎥ −1 −1 ⎥ ⎥ a 0 ⎦ 0 x−c

a 0 0 0 x−c 0 0 0 −ρx 1 0 0 z 0 0

where H = Hx = [1, 0, 0; 0, 0, 0; 0, 0, 0] has been used in Eq. (2). Figure 2 shows the pruned fluence graphs obtained from JN for the three coupling configurations. Below each graph, a compact representation at the level of the adjacency matrix is also provided, indicating as well the coupling nature of the bidirectional links. There is only one rSCC when the two nodes are coupled either via variable x or y, whereas there are two rSCCs when coupled via variable z. This suggests that at least one variable has to be measured among {x1 , y1 , x2 , y2 } in the first two cases (Figs. 2a and b) and one (xi or yi ) in each rSCC in the latter case. In the following, we will analyze in detail the cases where m = 1, 2, that is, 1 or 2 measured variables, which can be acquired in Nm = 1 or 2 nodes. We computed the symbolic observability coefficients ηX for all possible reconstructed vectors X. A summary of these analyses is reported in Table 1.

x

x

z

y x y

y x x

(a)

x

z x y

x

x

z

y

x y

y

y y

(b)

z

x y

x

y

z

x y

z

z z (c)

y

x y

Fig. 2 Pruned fluence graphs (top) and network connection motifs (bottom) for small networks motifs (N = 2) of Rössler systems coupled by their different variables. The root strongly connected components (rSCCs) are shown in dashed lines. (a) ρx = 0: 1 rSCC. (b) ρy = 0: 1 rSCC. (c) ρz = 0: 2 rSCCs

Observability of Dynamical Networks

9

Table 1 Symbolic observability coefficients η for the dyads shown in Fig. 2 H

m=1

m = 2, Nm = 1

m = 2, Nm = 2

Hx

ηx 6 = 0.65

ηy 5 z = 0.91

ηy 3 y 3 = 1

(Det O = 1)

ηy 6 = 0.41

ηy 2 y 4 = 1

(Det O = ρz )

ηz6 = 0.03

ηx1 y 5 = 0.91

1 2

1 2

2

ηx 3 x 3 = 0.79 1 2

Hy

Hz

ηx 6 = 0.66

ηx 5 y = 1

(Det O =

ηy 6 = 0.56

ηx 2 y 4 = 1

ηz6 = 0.31

ηx 5 z = 1

−ρy3 )

ηy 3 y 3 = 1

(Det O = 1)

(Det O = ρy3 )

ηy 2 y 4 = 1

(Det O = −ρy )

(Det O = ρy3 )

ηx 2 y 4 = 1

(Det O = −ρy3 )

1 2

1 2

1 2

ηy 4 z2 = 0.86

ηx 3 y 3 = 0.91

ηy 5 z = 0.77

ηx 3 x 3 = 0.91

ηx 6 = 0.72

ηx 5 y = 0.72

ηy 3 y 3 = 1

(Det O = 1)

ηy 6 = 0.37

ηx 5 z = 0.72

ηy 2 y 4 = 1

(Det O = −ρz )

ηz6 = 0.11

ηx 2 z4 = 0.72

ηx 2 y 4 = 1

(Det O = −ρz )

ηy 2 z4 = 0.72

ηy1 y 5 = 0.86

1 2

1 2

1 2

1 2

1 2

2

ηx 3 x 3 = 0.79 1 2

The type of coupling function H , the number m of measured variables, and the number of nodes Nm where they are measured are also reported. Analytical determinants are reported only in those cases when η = 1. To shorten the notation of the reconstructed vector, we used y 3 instead of (y, y, ˙ y), ¨ where the exponent refers to the number of derivatives (including the variable itself). The index is omitted when only the variable itself appears in the reconstruction vector

When a single variable is measured (m = 1), the pair [JN , X] is always poorly observable, even when there is a single rSCC (via Hx or Hy ).The symbolic observability coefficients are very well confirmed by the determinants which are at least second-order polynomials (not shown). When two variables are measured, m = 2, in just a single node, Nm = 1, full observability of the pair [JN , X] is obtained only through Hy . We found three possibilities for the reconstructed state vector X providing ηX = 1. In these cases, the corresponding determinants depend on ρy3 (Table 1). Such a strong dependency on the coupling strength could deteriorate the observability when ρy becomes small. On the other hand, when the two variables measured are coming from two different nodes, there is a wide variety of possibilities providing a fully observable pair [JN , X]. There is a strong advantage of using variable y and its first two derivatives in each node since it provides a full observability and the determinant is not dependent on the coupling strength (Det OX = 1). Among other possibilities offering full observability is that from the reconj structed vector X = (x12 y24 ) (where the notation xi designates the first j Lie

10

I. Sendiña-Nadal and C. Letellier

derivatives of variable xi , being the first one the variable itself), either using the coupling functions Hy or Hz . However, when looking at the corresponding determinants Δx 2 y 4 = ρy3 and Δx 2 y 4 = −ρz , respectively, we unexpectedly notice 1 2 1 2 that the observability depends on the coupling strength in a weaker way when nodes are coupled via variable z than via variable y. And even more surprising is the case when nodes are coupled via variable x, since the determinant (not shown in Table 1) Δx 2 y 4 = 0, indicating that the network is not observable at all. Consequently, the 1 2 observability of our network with a given topology and reconstructed state vector X strongly depends on the coupling variable. Notice also that the graphical analysis of the pruned fluence graph of JN is providing a necessary condition about the variables to be measured for getting full observability but it may not be sufficient. For example, in Fig. 2b, it is recommending to measure either x or y from the single rSCC but to get full observability a second variable is needed. This leads to the following propositions. Proposition 1 The minimal number mmin of variables necessary to measure for getting full observability of a d · N -dimensional network N is equal to the number Nr of root strongly connected components. Each measured variable has to be chosen in a different root strongly connected component. Corollary 1 If additional variables are required to get a full observability of a d · N -dimensional network, they will be selected in the Nr root strongly connected components and, preferably, in those whose cardinality is the largest. Thus, with these rules from the analysis of the pruned fluence graph, the number of vectors X is sufficiently reduced to make exhaustive computations of the symbolic observability coefficients. Proposition 2 When the node dynamics is fully observable from one of its variables, then the network N is fully observable if that variable is measured at each node (m = N ), independently from the coupling function and topology, even when the network is not completely connected. Corollary 2 When the number Nm of measured nodes is such that Nm < N, by definition, the choice of the variables to measure is not only dependent on the adjacency matrix A and coupling function H but also on the node dynamics. Proposition 3 When a network N of Rössler systems coupled the variable z, then m = N nodes must be necessarily measured for getting full observability.

4.2 Triads (N = 3) Let us now consider motifs of N = 3 nodes. To limit the number of cases to discuss, we will analyze only triad networks coupled through variable y since this is the sole coupling configuration for which a dyad of Rösslers are fully observable from measurements in a single node (see Table 1). In order to refer to all the possible triad motifs shown in Fig. 3, we will distinguish them by the number l of directed edges, Tl , such that there are five classes of motifs: T2 (3), T3 (4), T4 (4), T5 (1), and T6 (1).

Observability of Dynamical Networks

1 2

1 3

2

11

1 3

2

1 3

2

1 3

2

1 3

2

3

(a)

(b)

(c)

(d)

(e)

(f)

1

1

1

1

1

1

2

3

(h)

2

3

(i)

2

3

(j)

2

3

(k)

2

3

(l)

1

2

2

3

(g)

3

(m)

Fig. 3 Network connection motifs for triad networks (N = 3) of Rössler systems coupled by variable x or y. Only the rSCCs are shown (dashed line). (a) T2a . (b) T2b . (c) T2c . (d) T3a . (e) T3b . (f) T3c . (g) T3d . (h) T4a . (i) T4b . (j) T4c . (k) T4d (l) T5 (m) T6

Let us start with the triad T2a shown in Fig. 3a. There is a single root strongly connected component comprised by the vertices x2 and y2 . According to this graph, measuring either x2 or y2 in node 2 should provide full observability of the triad T2a . However, when two variables are measured in that node, the largest observability coefficient is ηy 8 z2 = 0.59. Measuring a third variable in node 2 does not improve 2 the observability since the symbolic observability coefficient becomes null. The triad T2a is therefore poorly observable when measurements are only performed in the rSCC. Therefore, a proposition is made as follows. Proposition 4 In a network of N Rössler systems, it is not possible to reconstruct with full observability the space associated with three nodes from measurements in a single node. Therefore, the nine dimensions of a Rössler triad cannot be observed from just measuring in one node. However, from the dyad analysis, when the coupling function is via the y variable, it is possible to reconstruct the six associated dimensions from measurements (m = 2) in a single node. In order to investigate what is the largest dimension that can be reconstructed, we consider the vector X = (x22 , y24 , y33 ) which provides full observability of the triad T2a (Det OX = −ρy3 ) by performing m = 3 measurements, two in node 2 and one in node 3. Now, we proceed by progressively adding an extra Lie derivative of y2 and removing it from y3 until full observability is lost for X = (x22 , y27 ). For the case X = (x22 , y25 , y32 ), Det OX = −ρy5 and therefore a full observable pair [JN , X] is still obtained. However, one more Lie derivative of y2 , X = (x22 , y26 , y3 ), leads to Det OX = −ρy3 [(a + c − x1 ) (x1 − x3 ) + y1 + 2z1 − z3 − 1], that is, observability is good since a singular observability manifold appears with this first-order determinant and ηX = 0.82. Therefore, the largest dimension that can be reconstructed from measurements in a single Rössler node is six. The triads T4d and T6 led to similar results (Table 2): full observability of a triad of Rösslers is only possible when no more than seven dimensions are reconstructed from measurements in a single node.

12

I. Sendiña-Nadal and C. Letellier

Table 2 Determinants of the observability matrix and symbolic observability coefficients for three of the triads shown in Fig. 3 and for different reconstructed vectors X Triad X = (x22 , y24 , y33 ) X = (x22 , y25 , y32 ) X = (x22 , y26 , y3 ) X = (x22 , y27 )

T2a Det OX = −ρy3 ηX = 1 Det OX = −ρy5 ηX = 1 Det OX = ρy7 P2 ηX = 0.82 Det OX = 0 ηX = 0.69∗

T4d Det OX = −ρy3 ηX = 1 Det OX = −ρy5 ηX = 1 Det OX = ρy7 P3 ηX = 0.82 Det OX = ρy9 P5 ηX = 0.69

T6 Det OX = −ρy3 ηX = 1 Det OX = ρy4 P1 ηX = 0.88 Det OX = ρy7 P3 ηX = 0.68 Det OX = ρy7 P5 ηX = 0.60

Determinants with a polynomial of degree i dependence are indicated with Pi . The value of the observability coefficient with a ∗ is spurious due to symmetries in the observability matrix that cancel the determinant and that the symbolic formalism does not detect: it should be zero

As soon as eight dimensions are recovered from one node, the reconstructed vector provides poor observability of the whole system. Proposition 5 In a network of Rössler systems, it is not possible to reconstruct with full observability more than two nodes from measurements in a single node. Proposition 6 When N > 2 Rössler systems are coupled, full observability is only possible if at least Nm = N2 + (N mod 2) the nodes are measured and m = N variables are measured. This proposition could be specific to the Rössler system or even be more generic. This will be further investigated elsewhere. An additional question to address to complete the observability analysis of the triad is to check whether it is possible to reconstruct a node from another one not directly connected to it. Let us consider the triad T2a and the reconstructed vector X = (x22 , y24 , y13 ) where m = 3 measurements are performed in nodes 1 and 2. Note that in triad T2a information is flowing from 3 to 2 through node 1 which is measured. The three extra dimensions reconstructed from node 2 cannot be used for node 3 (not directly connected) but only for node 1. Node 1 is thus observed twice, leading to Det OX = 0: there is null observability of the triad T2a from such a reconstructed vector X. Contrary to this, the vector X = (y23 , x12 , y14 ) provides full observability of triad T2a , node 3 being reconstructed from node 1. Then, we state the following proposition. Proposition 7 A necessary condition for having full observability of a nonmeasured node ni from a measured one nj is that there is an edge from ni to nj . Corollary 3 If the node dynamics is a Rössler system, a nonmeasured node can be fully observable if it is directly coupled via variable y to a measured one.

Observability of Dynamical Networks

13

5 Larger Networks 5.1 Star Network Let us consider a star network of N nodes, being N − 1 of them leaves and one acting as the hub. The number Nr of rSCCs depends on the number l of edges and how these edges are directed. When couplings are bidirectional, there is a single rSCC that contains all the nodes. When unidirectional couplings are all directed to the hub, the hub is the rSCC. When all the edges are out-going from the hub, there are Nr = N − 1 rSCCs, each one made of one leaf. In a random star network with Nout edges out-going from the hub, there are Nout rSCCs, each one made of one of the leaves receiving one of these outgoing edges. In all cases, according to Propositions 1 and 5, m = N variables must be measured in Nm = Nr nodes.

5.2 Ring Network In a ring network of Rössler systems coupled via variable y, according to Propositions 1 and 5, m = N variables must be measured in Nm = N2 + (N mod 2) nodes for full observability of the pair [JN , X]. This result does not depend on the directionality of the edges (they can be either bidirectional or unidirectional). Along the ring, one in two nodes are measured.

5.3 Random Network (N = 28) We here investigate a random network made of 28 Rössler systems bidirectionally coupled via variable y according to the topology (Fig. 4) of an electronic network [14]. Applying Propositions 1 and 5, nodes are grouped by pairs depending on their edges. One possibility is to measure m = N + 1 variables in 15 nodes, namely, in nodes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14, 15, 19, and 23. Measurements are needed in Nm = N2 + 1 since two nodes, 22 and 23, are only connected to node 1, from which it is not possible to reconstruct three nodes according to Proposition 4. One of these two nodes must also be measured. In all nodes but nodes 19 and 23, the reconstructed vector is Xi = (xi2 , yi4 ), while in nodes 19 and 23 Xj = yj3 . The symbolic coefficient was equal to one and the analytical determinant of the observability matrix is Det OX = ρy39 , therefore validating all our results. The expression of this determinant could mean that the observability is strongly sensitive to the coupling value. Nevertheless, since nodes are grouped by pair, the dependency on the coupling value should not be practically worse than the one observed for a pair of nodes, that is, depending on ρy3 .

14

I. Sendiña-Nadal and C. Letellier

Fig. 4 Topology of the random network (N = 28) used in Ref. [14] to implement a network of electronic Rössler-like circuits. Nodes are grouped by pairs to get full observability of this network from measurements in Nm = 15 nodes

6 Conclusion We showed that it is possible to construct a procedure to reliably determine the observability of networks whose node dynamics are structurally identical (the governing equations have the same functional form but parameter values can differ). A reduced set of variables to measure in a network of N nodes and providing a full observability of it can be selected using a graphical approach [7]. Then symbolic coefficients are computed to quantify the observability of the network dynamics provided by the measurements [8]. To be fully reliable, network observability must be investigated from the complete Jacobian matrix JN of the network which encodes the topology, the coupling function, and the node dynamics. Nevertheless, some systematic rules for assessing the observability of the network can be derived from the node Jacobian matrix Jn and the coupling function of dyads and triads. First we determine the observability of the node dynamics. Then, using the results obtained from the analysis of a dyad, general rules can be established to be applied to larger networks. In the case of Rössler systems, it is not possible to reconstruct more than two nodes from measurements in one node. It is necessary to measure at least in N2 + (N mod 2) nodes for getting full observability of a network made of N Rössler systems coupled via variable y. For any other coupling, N nodes have to be measured. Therefore, the coupling function may critically affect the network observability. Acknowledgements ISN acknowledges partial support from the Ministerio de Economía y Competitividad of Spain under project FIS2017-84151-P and from the Group of Research Excelence URJC-Banco de Santander.

Observability of Dynamical Networks

15

References 1. Bianchin, G., Frasca, P., Gasparri, A., Pasqualetti, F.: The observability radius of networks. IEEE Trans. Autom. Control 62(6), 3006–3013 (2017). 2. Chan, B.Y., Shachter, R.D.: Structural controllability and observability in influence diagrams. In: Proceedings of the Eighth International Conference on Uncertainty in Artificial Intelligence (UAI’92), pp. 25–32. Morgan Kaufmann Publishers, San Francisco (1992) 3. Hasegawa, T., Takaguchi, T., Masuda, N.: Observability transitions in correlated networks. Phys. Rev. E 88, 042809 (2013) 4. Hermann, R., Krener, A.: Nonlinear controllability and observability. IEEE Trans. Autom. Control 22(5), 728–740 (1977) 5. Letellier, C., Aguirre, L.A.: Investigating nonlinear dynamics from time series: The influence of symmetries and the choice of observables. Chaos 12(3), 549–558 (2002). 6. Letellier, C., Aguirre, L.A., Maquet, J.: Relation between observability and differential embeddings for nonlinear dynamics. Phys. Rev. E 71(6), 066213 (2005) 7. Letellier, C., Sendiña-Nadal, I., Aguirre, L.A.: A nonlinear graph-based theory for dynamical network observability. Phys. Rev. E 98, 020303(R) (2018) 8. Letellier, C., Sendiña-Nadal, I., Bianco-Martinez, E., Baptista, M.S.: A symbolic networkbased nonlinear theory for dynamical systems observability. Sci. Rep. 8, 3785 (2018) 9. Lin, C.T.: Structural controllability. IEEE Trans. Autom. Control 19(3), 201–208 (1974) 10. Liu, Y.Y., Slotine, J.J., Barabási, A.L.: Observability of complex systems. Proc. Natl. Acad. Sci. 110(7), 2460–2465 (2013) 11. Newman, M.E.: Networks: an introduction. Oxford University Press, Oxford (2010) 12. Rössler, O.E.: An equation for continuous chaos. Phys. Lett. A 57(5), 397–398 (1976) 13. Sendiña Nadal, I., Boccaletti, S., Letellier, C.: Observability coefficients for predicting the class of synchronizability from the algebraic structure of the local oscillators. Phys. Rev. E 94(4), 042205 (2016) 14. Sevilla-Escoboza, R., Buldú, J.Mss.: Synchronization of networks of chaotic oscillators: structural and dynamical datasets. Data in Brief 7, 1185–1189 (2016) 15. Van Mieghem, P., Wang, H.: The observable part of a network. IEEE/ACM Trans. Netw. 17(1), 93–105 (2009)

Exploratory Factor Analysis of Graphical Features for Link Prediction in Social Networks Lale Madahali, Lotfi Najjar, and Margeret Hall

Abstract Social networks attract much attention due to their ability to replicate social interactions at scale. Link prediction, or the assessment of which unconnected nodes are likely to connect in the future, is an interesting but non-trivial research area. Three approaches exist to deal with the link-prediction problem: featurebased models, Bayesian probabilistic models, and probabilistic relational models. In feature-based methods, graphical features are extracted and used for classification. Usually, these features are subdivided into three feature groups based on their formula. Some formulas are extracted based on neighborhood graph traverse. Accordingly, there exist three groups of features: neighborhood features, pathbased features, and node-based features. In this paper, we attempt to validate the underlying structure of topological features used in feature-based link prediction. The results of our analysis indicate differing results from the prevailing grouping of these features, which indicates that current literatures’ classification of feature groups should be redefined. Thus, the contribution of this work is exploring the factor loading of graphical features in link prediction in social networks. To the best of our knowledge, no prior studies had addressed it. Keywords Social networks analysis · Exploratory factor analysis · Data mining

1 Introduction A social network is a social structure that is composed of a set of actors (also known as players, agents, or nodes) and a set of the relationships between these actors. It can be represented as a graph in which nodes (vertices) represent people (actors) and edges represent relationships between them. Link prediction, or the prediction of future connections of unconnected nodes in a graph [1], has many applications

L. Madahali () · L. Najjar · M. Hall University of Nebraska at Omaha, Omaha, NE, USA e-mail: [email protected]; [email protected]; [email protected] © Springer Nature Switzerland AG 2019 S. P. Cornelius et al. (eds.), Complex Networks X, Springer Proceedings in Complexity, https://doi.org/10.1007/978-3-030-14459-3_2

17

18

L. Madahali et al.

within and outside of the domain of social networks, e.g., finding interactions between proteins in bioinformatics [2]; in e-commerce and recommender systems [3]; and in the defense and security domains in identifying terrorist cells [4]. Traditionally there are three approaches for addressing link-prediction problem: feature-based link prediction, Bayesian probabilistic models, and probabilistic relational models. In feature-based link prediction, the problem is seen as a supervised classification problem in which each record corresponds to a relationship. In Bayesian probabilistic models, the main idea is assessing a probability score denoting the existence of a future relationship between two nodes that are not connected. This score can also be used as a feature in classification. Probabilistic relational models (PRM) can incorporate attributes of edges and vertices to create the combined probability distribution of a set of nodes and edges. There are two approaches of PRM: Bayesian networks based, which is used for directed links, and relational Markov networks based, which is used for undirected links [5]. In most feature-based link-prediction studies, researchers categorize graphical features as neighborhood features, node-based features, and path-based features [5]. The general aspects considered in each category are as follows: • Neighborhood: common neighbors, Jaccard coefficient, and Adamic/Adar • Path-based: shortest path count, Katz, hitting time, and rooted pagerank • Node-based: preferential attachment, clustering coefficient, and simrank The features’ formulae are elaborated later. There are different ways researchers approach link-prediction problems. In some studies, researchers investigate the structure of the networks to try to introduce new graphical features for improving prediction quality [6]. Other authors work on machine learning techniques. In these studies, algorithms are created or manipulated to enhance performance, since supervised link prediction culminates in a classification problem [7]. Literature to date focuses on introducing new features and manipulating algorithms to increase performance. Our estimation and the present foundational work consider that current research has not considered the underlying real structure of these features in prediction, and whether all these features are needed in and for prediction. This is akin to the curse of dimensionality problem in machine learning [8], and leads us to position that this under-consideration by the research community has caused structural biases in existing analyses. This is what we exam in this paper. Factor analysis is a useful tool for probing relationships between variables. By collapsing a large number of variables into a few understandable factors, it allows investigating concepts that are not easily measured directly. It is also computationally efficient, well-benchmarked in literature, and follows the premise that simple solutions are preferable to complex solutions in the case of similar or the same results. Factor analysis is employed to analyze the relationship between factors and to see how they can be loaded under underlying factors. We investigate which features load under each produced factor, and whether they deviate from expectation. To the extent of our knowledge, no prior studies have completed a factor analysis of topological features in link-prediction problems. Until now it has been conceptually assumed that the attributes must go together.

Exploratory Factor Analysis of Graphical Features for Link Prediction in Social. . .

19

Until now, literature has not shown that these features must belong together mathematically. This paper tries to mathematically show if the features belong together. Therefore, the research question in this study is “Considering graphical features in a social network graph for the link prediction, what is the structure of retained factors for further analysis?” Features are extracted from the Hep-ph dataset which is a coauthorship network from 1992 to 2000. It has 16,402 nodes and 156,742 edges where nodes are authors and edges are their relationship based on some type of collaboration, like publishing a paper or working on a project together. The related work, methodology, experimental setup, results, implications, and conclusion are described below.

2 Related Work 2.1 Link Prediction As stated earlier, one of the favored approaches to link prediction is feature-based link prediction. In their study Madahali et al. [7] used data mining techniques to improve the performance of machine learning algorithms. Applying preprocessing techniques to the data and combining algorithms, they came up with performance improvement in terms of F-measure and AUC (area under the curve). Liben-Nowell et al. [1] extracted and worked with graphical features considering a core set of co-authors who have collaborated at least three times during the train and test interval. They considered each of these features as predictors and then compared their performance with a random predictor. Their result is a list of links with associated probabilities. Graph-based features are the most common features used in feature-based link prediction [5]. Cukierski et al. [9] extracted 94 distinct graph features and used them as their classification input along with using random forests. Furthermore, they mentioned big data problem in large graphs. They came up with this conclusion that the best classification performance is achieved through a combination of different categories of features, as they show different aspects of the graph structure. They reported the area under the curve of 0.9695. Introducing new futures for improvement is a common approach to contribute in this area. In [6] the authors define two new features: friends-measure, and same-community. Their definition is as follows: friends-measure is the number of connections between two nodes’ neighbors; same-community determines whether two nodes are members of a common community. They ran their experiments on ten datasets and showed their improved performance. In addition to topological features, other features such as node and edge features can contribute to improvement. Al Hasan et al. [4] introduced new features like keyword match count or the sum of papers in coauthorship networks. Scellato et al. [10] used Gowalla which is a location-based social network. They introduced a new feature called “place-friends” which determines which users

20

L. Madahali et al.

visited the same place. They defined new features helpful for prediction relying on the properties of the places visited by users. Finally, through a supervised linkprediction framework based on these features, they established their prediction. The excessive change in social networks makes them extremely dynamic; millions of nodes and edges are added and deleted in a day. In order to deal with this challenge, Song et al. [11] introduced proximity measures with an algorithm to estimate proximity in very dynamic networks. Proximity measures show how far or close nodes are in a social network. Measuring proximity forms a range of applications in the social sciences, business, information technology, computer networks, and cybersecurity. Defining various proximity measures, they found first the effectiveness of using different proximity measures varies among networks. Second, considering several proximity measures contribute to better accuracy. Link-prediction problem is sometimes formulated as a random walk from a source node to the target. Supervised random walks can be applied to many problems that require learning to rank nodes in a graph, link recommendations, anomaly detection, missing link, and searching and ranking [12]. They provided a random walk solution on Facebook for a learning function which assigns a score to each edge, in turn, the algorithm could see nodes that are more probable to be connected. They found that their algorithm outperforms unsupervised approaches and feature-based prediction in terms of AUC. Using game theory, Zappella et al. [13] introduced a new approach based on graph transduction game. Using dataset from Tuenti social network, they proved that their method excels standard local measures and also significant enhancement in terms of mean average precision and reciprocal rank.

2.2 Applied Factor Analysis Generally speaking, the goal of factor analysis is to reduce the dimensionality of the original space and end in a new space. Its goal is to classify intercorrelated variables under more general variables [14]. Therefore, factor analysis brings about two advantages: the possibility of gaining a clear view of the data and the possibility of using the output in the subsequent analysis [15, 16] by reducing the space of the feature vector. Many studies give a thorough theoretical overview of factor analysis [15, 16]. Factor analysis plays a major role in improving psychological researches [17]. It has been used as an analytic technique for extracting interrelationship patterns, reducing data, classifying and describing data, data transformation, hypothesis testing, and mapping construct space [16]. The journal Psychometrika (the primary journal of the Psychometric Society, a professional body devoted to psychometrics and quantitative psychology) has devoted more pages to factor analysis than to any other quantitative topic in the behavioral science [18], and the number of studies applying factor analysis has experienced a dramatic increase [19]. As stated, EFA is usually used in psychology and recognizing influential contributing factors. In [20] the authors measure personality and trait affect to see

Exploratory Factor Analysis of Graphical Features for Link Prediction in Social. . .

21

how various psychological factors contribute to the degree and nature of posting status updates on Facebook. They introduced an instrument to measure two types of status updates on Facebook: positive and negative content. Using this instrument along with instruments that measure personality and trait affect, they decided how much different psychological factors contribute to determining the degree and nature of posting status updates on Facebook. They used Pearson correlations and partialleast-squares structural equation modeling as their statistical techniques. In turn, they found support for the role of personality and trait affect in understanding the types of people who post status updates on Facebook. In [21] the authors try to answer two questions at the intersection of psychometrics, data mining, machine learning, and physics education research. They analyzed a large set of students’ responses to an assessment designed to assess strategic knowledge of the mechanics reasoning inventory (MRI). They use EFA to identify the basic mental constructs of students in order to reduce the number of variables necessary to explain the data below the number of items. They mention their reason to use EFA, it is the model that shows their theoretical assumption about the relationship between mental constructs and observed items responses since factors are formulated as the cause of the correlated item responses. Second, EFA models measurement errors and unique sources of variance in item responses.

2.3 Factor Analysis in Computing Marsden et al. [22] present the results of an exploratory factor analysis (EFA) in professionals’ attitudes toward online communication. They conduct a survey of 100 professionals based on the theoretical approaches to the scientific study of online communication identified by [23]. They found three constructs: (1) media choice, (2) the hyperpersonal options of online communication, and (3) social cues in online communication. These constructs reflect the scientific approaches that can explain the dynamics of online communication. These attitudinal approaches have several advantages. First, it can influence the communication itself and second, an impact on the use and the quality of online communication. Third, the results showed attitudes which will be useful for successful online communication. In [24] the authors developed a scenario such that each scenario contained a single essay, related to identifying privacy, accuracy, property, and access (PAPA), within a context, and with the respondent as the individual encountering the dilemma. According to Mason [25], these are the four main ethical issues of the information age. Violation of each issue shows the injured participant’s perspective to identify consequences coming from the use of information and information technology in an unethical way. They carried out two pilot tests to refine the wording of the 16 scenarios used in the final questionnaire [24]. The survey results showed that people have high egocentricity and concern for themselves; few are concerned with or aware of other stakeholders. [24] hypothesize that the lack of awareness of the stakeholder is the result of the mediation of technology which creates a gap

22

L. Madahali et al.

between computer user and stakeholder. Consequently, this psychological distance may explain the growing rate of unethical computer usages, such as break-ins and viruses. Researchers claim that there is value in studying social networks simultaneously and the advantage of social network use will be clear when multiple services are reviewed together [26, 27]. Spiliotopoulos and Oakley [27] conducted a study of 198 Facebook users and predicted how it is probable that a Facebook user becomes a twitter user based on their Facebook usage. They collected 12 activity measures via Facebook API and concluded five discrete usage dimensions. The factor analysis result shows that Facebook usage for participants is multidimensional. This implies that people can be classified along these dimensions and into groups with very different usage styles. They claim that considering usage behavior a multidimensional concept provides a more accurate definition of individual behaviors than a general measure that describes Facebook usage. Therefore, their findings show that distinguished features are drivers for adopting one social network over another. Conversely, if multiple services share the same features, users tend to use those features in only one social network. EFA can be used for reordering different measurements in various areas. In the study carried out by Jha et al. [28] researchers tried to explore and prioritize the factors that influence, control, and empower the modern ERP implementation for small and medium enterprises. In order to accomplish this, they examined the latest trend of modern ERP functioning of Delhi-NCR companies through EFA and the reliability of all the constructs that emerged during their work. They concluded that optimizing software engineering, project management, and lean six sigma techniques helped in having successful implementation of ERP in small and medium enterprises. In the paper written by Schreiber et al. [29], researchers try to figure out a valid categorization and to examine the performance and properties of a range of h-type indices. Using EFA, they studied the relationship between the h-index, its variants, and some standard bibliometric indicators of 26 physicists from the Science Citation Index in the Web of Science. They showed that for their dataset a distinction is possible to quality and quantity of scientific output.

3 Methodology In this section, we explicate the features extracted from the dataset graph and then explain the EFA procedure. As mentioned, the Hep-ph dataset and the graphical features in this study are common in link-prediction problems. LPmade [30] was used to extract these features.

Exploratory Factor Analysis of Graphical Features for Link Prediction in Social. . .

23

3.1 Link-Prediction Graphical Features The following are feature formulas used in most link-prediction studies. Adamic/Adar: In the context of web mining, Adamic/Adar measures the similarity between two webpages and determines how two pages are related. It computes the similarity between the two pages [31]. z:common features between a,b

1 log (frequency(z))

(1)

1 log | (z) |

(2)

Thus, in link prediction this converts to z:common neighbors between a,b

Common neighbors: This feature calculates the number of common neighbors between two nodes [1]. Suppose (a) is the set of node a neighbors. Common neighbors between two nodes are defined as follows: | (a) ∩ (b) |

(3)

Clustering coefficient: Illustrate how your friends are friends with each other [32]. CC =

3 × number of triangles number of connected triplets of vertices

(4)

Jaccard coefficient: It is used for measuring the similarity and diversity. It is the quotient of the intersection of the two neighbor sets and the union of the two neighbor sets [33]. J (a, b) =

|a∩b | |a∪b |

(5)

Preferential attachment: It is a network formation model. The more the degree of a node, the more probably it will connect to other nodes. The probability of collaboration of nodes a and b is proportional to the product of the number of neighbors they have [34]. | (a) |· | (b) |

(6)

Katz: It keeps a collection of paths. It assigns higher weights to shorter paths. This weight is calculated by graph traversing rather than matrix operation. Its first

24

L. Madahali et al.

parameter is a maximum distance away from the source and the second parameter is the damping factor β ∞

β l · | paths a,b |,

(7)

l=1

where paths a,b is the set of all l-length paths from a to b, and β > 0 is a parameter of predictor. Katz have two variants: (1) unweighted, in which paths a,b = 1 if a and b have cooperated and 0 otherwise and (2) weighted, in which paths a,b is the number of times that a and b have cooperated [35]. Prop flow: According to [36] this algorithm runs breadth first search algorithm that receives a maximum distance to explore the network before terminating. Rooted pagerank: Pagerank is a commonly used algorithm in web mining, and specifically, web structure mining. It is an indicator of the importance of a page [37, 38]. Rooted pagerank is pagerank algorithm in which random walk has a restart parameter α [1]. A simple version of pagerank is as follows: PR(u) = (1 − d) + d

PR(v) , Nv

(8)

v∈B(u)

where d is the probability of the users’ following the direct links. (1−d) is the pagerank distribution from pages that are not linked directly. It is called damping factor and is frequently set to 0.85 [39]. Simrank: A recursive algorithm which implies the similarity of two nodes is proportional to their similar neighbors [1]. The starting point is similarity(a,a):=1. 

 similarity (a, b) := γ ·

xL(a)

yL(b) similarity (x, y)

| (a)| · | (b)|

γ  [0, 1]

(9)

Shortest path count: This feature calculates the number of shortest paths between two nodes. This means that it executes a breadth first search terminating at the level at which the target is found and counts the number of times the target is encountered at that level [7]. Idegree: The degree of the target node. Jdegree: The degree of the source node.

3.2 Factor Analysis Factor analysis goal is to classify intercorrelated variables together under more general variables [40]. Factor analysis starts with a correlation matrix, in which the interrelations between variables are shown. The goal is to classify variables that

Exploratory Factor Analysis of Graphical Features for Link Prediction in Social. . .

25

highly correlate with a group of other variables under one variable. These variables should have low correlation with variables outside of that group [15]. All measured variables are related to a latent factor. The resultant factors show a new dimension that visualizes classification axes along which measurement variables can be plotted [15]. This projection of the scores of the original variables on the factor shows two different information: factor scores and factor loadings. Factor scores are the scores of a variable (feature) on a factor [14]. Factor loadings are the correlation of the original variables with a factor. The factor scores can be used as new scores in multiple regression analysis. On the other side, factor loadings show the importance of a particular variable to a factor [15]. This information is used for interpreting and naming the factors. Measurements: As factor analysis starts with a correlation matrix, variables should at least be at an interval level. Normality is not required in EFA except in cases of statistical tests for significance of factors. The sample size is important as correlations are not resistant [42] and can affect reliability [15]. According to [15] there are a host of studies explaining the necessary sample size for factor analysis leading to many “rules-of-thumb.” The general conclusion of these studies is that the most important factors in a reliable factor analysis are the absolute sample size and the absolute magnitude of factor loadings [15]. The Kaiser–Meyer–Olkin measure of sampling adequacy (KMO) can be employed as a measure of sample size adequacy. If the KMO value is greater than 0.5, the sample is adequate. Additionally, the anti-image matrix of covariances and correlations can be employed. In this matrix, the sample size is adequate if all elements on the diagonal of this matrix are greater than 0.5 [15]. Correlation matrix: Regarding the correlation matrix, there are two important points. The variables have to be intercorrelated, but not too highly correlated. This will make recognizing the unique contribution of the variables to a factor [15]. Bartlett’s test of sphericity is used to check the intercorrelation. When the correlation matrix is an identity matrix, there are no correlations between the variables. The number of factors to be retained: The number of positive eigenvalues of the correlation matrix determines the number of factors to be retained. However, this is not always true, since sometimes some eigenvalues are positive but very close to zero. Thus, for determining the number of retained factors there are some rules-of-thumb [14, 15]. Based on Guttman–Kaiser [40], only keep factors with an eigenvalue larger than 1. Then, there are two ways of selecting which factors to keep. The first one is to keep factors which contribute to 70–80% of the variance. The second way is to generate a scree-plot and keep all factors before the threshold. After factor extraction, we have to check the communalities. The extracted factors account for only a small part of the variance if the communalities are low. Thus, more factors might be retained in order to provide a better variance value. Factor rotation. Two types of rotation are standard: orthogonal and oblique. The difference is that in the orthogonal rotation there is no correlation between the extracted factors, whereas, in the oblique rotation there is. A straightforward solution to choose the type of rotation is to perform the analysis with the two types

26

L. Madahali et al.

of rotation. If the oblique rotation shows a trivial correlation between the extracted factors, then choosing orthogonal rotation makes sense [15].

4 Experimental Setup, Results, and Discussion SPSS 23.0 is used in this study. For all the experiments, principal component analysis with varimax rotation was carried out to construct the factor structure. The sample size is adequate; the KMO measure of sphericity exceeds the threshold of 0.5. Bartlett’s test of sphericity is significant at p < 0.005. In the anti-image matrix, all the correlations are greater than the threshold of 0.5. There were no cross loadings and the cutting point for factor loading was set at 0.5. For the experiments, we set first set three then four for the retained number of factors. In the 3-component structure clustering coefficient unexpectedly does not load on any of the components. In the 4-component analysis (Table 1), all the anti-image correlation diagonal values are greater than 0.5. Cumulative variance is 82.360%. The only feature that is loaded under factor 4 is clustering coefficient with 0.958 factor loading. Values for Cronbach’s α for four factors are 0.002, 0.643, 1.000, and 0.000, respectively. As it is shown in Table 2 Jaccard coefficient, rooted pagerank, Idegree, propflow, and simrank are loaded on one factor. While it may be expected that Idegree and Jdegree load on the same factor, they did not. Features common neighbor, Adamic/Adar, Katz, and preferential attachment are loaded under factor 2. Jdegree and shortest path count were loaded under factor 3. Table 1 Rotated factors, factor loadings individual, and cumulative variances

Component 1 2 0.954 0.852 0.692

Common neighbor Adamic/Adar Jaccard coefficient Clustering coefficient Rooted pagerank 0.847 Katz 0.907 Idegree −0.685 Jdegree Preferential attachment 0.578 Propoflow 0.812 Shortest path count Simrank 0.754 % of variance 36.077 23.721 Cumulative variance 36.077 59.798 Cronbach’s α −0.002 0.018

3

0.969

0.969 14.878 74.676 1.000

Exploratory Factor Analysis of Graphical Features for Link Prediction in Social. . .

27

Table 2 Rotated factors, loadings, individual, and cumulative variances for four components Component 1 Common neighbor Adamic/Adar Jaccard coefficient Clustering coefficient Rooted pagerank Katz Idegree Jdegree Preferential attachment Propoflow Shortest path count Simrank % of variance Cumulative variance Cronbach’s α

2 0.961 0.858

3

4

0.739 0.958 0.880 0.918 −0.597 0.970 0.571 0.817 0.970 0.795 36.077 36.077 −0.002

23.721 59.798 0.643

14.878 74.676 1.000

7.684 82.360 0.000

Given the Cronbach’s α result, we conducted the experiments again on standardized data for four factors. The anti-image diagonal values are greater than 0.5. The cumulative variance is 74.676%. Again, clustering coefficient did not load under any factors. Thus, we removed clustering coefficient and carried out the factor analysis again. The cumulative variance reached 80.488%. In the next experiment, we set the number of retained factors at four. The results are shown in Table 2. The only feature loaded under factor 4 was clustering coefficient. Furthermore, the cumulative variance in this condition reached 82.360%. As stated earlier the value of Cronbach’s α for the first factor is 0.596, for the second factor is 0.877, and for the third factor is 1.000. Therefore, it is on average 0.824, which is an acceptable value (more than 0.7). Therefore, data standardization culminated in higher reliability. Figure 1 visualizes the common graphical conceptual structure, whereas Fig. 2 shows the statistical structure of loading each feature.

5 Implications In most link-prediction studies, researchers focus on defining new features, creating and manipulating learning algorithms, and manipulating data to deal with the problem. In order to analyze the real relationship and correlation between these features, and real groupings of these features, we ran EFA on many common features in this problem. Based on our analysis, the groupings differ from those employed in the standard literature. These new groupings are advantageous especially when

28 Fig. 1 The conceptual structure of feature grouping

L. Madahali et al.

Jaccard coefficient

neighborhood

common neighbors Adamic/Adar path-based

shortest path count Katz hitting time rooted pagerank

node-based

preferential attachment clustering coefficient simrank

Fig. 2 Statistical grouping of features

factor 1

Jaccard coefficient rooted pagerank Idegree propflow simrank

factor 2

common neighbor

Adamic/Adar Katz

preferntial attachment factor 3

Jdegree

shortest path count

factor 4

clustering coefficient

there are a host of features to deal with. It may also help to reduce intercorrelated errors and biases in the analyses. Interestingly, Clustering coefficient was the only feature which did not load under any other factors. Future researchers may either remove clustering coefficient or consider it as a new feature category. In the latter case, it will be interesting to introduce new features similar and related to it.

Exploratory Factor Analysis of Graphical Features for Link Prediction in Social. . .

29

Furthermore, Idegree and Jdegree do not load on the same factor, indicating that they are structurally dissimilar. This has an unknown impact on extant literature which must be validated in future work. Future researchers may reduce the complexity of their datasets with learning classifiers via EFA to simplify and reduce factors in a way that reflects the structure of the data. This is helpful to reduce model and computational complexity (in order to avoid being overloaded) in terms of time and memory complexity. This simplified approach assists in increasing generalizability in measurement [41].

6 Conclusion and Limitations In this study, we tried to examine the link-prediction problem features by analyzing their relationship and correlation. Until now it was unknown if these groupings are an accurate representation of the underlying structure of the features. This novel approach looks at the problem from a very different perspective. Usually, these graphical features are grouped into three categories: neighborhood, path, and node features. The employed groupings are traditionally based only on their formula. Our results are intriguing in that features grouped in the same category did not load under a common factor in EFA. Further efforts are required to fully empirically validate the underlying structure of the features and its impact on the results of other literature. It is our position that given the structural differences in commonly used features, current link-prediction research may be subject to unknown structural biases. Correcting these biases may improve the results and prediction quality of current and future works. A limitation of this work can be considering common features but not all possible features from literature. Another limitation is that this paper analyzes a coauthorship network. The results we found may not hold in the case of, e.g., a social media network. Further research is required. Another direction could be considering as many features as possible to validate the relationship and factor loadings.

References 1. Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58, 1019–1031 (2007) 2. Airoldi, E.M.: Mixed membership stochastic block models for relational data with application to protein-protein interactions. In: Proceedings of the International Biometrics Society Annual Meeting (2006), pp. 1–34. Google Search, https://www.google.com/search?q=E.M.+Airoldi%2C+“Mixed+membership+stochastic+ block+models+for+relational+data+with+application+to+protein-protein+interactions”%2C+ In+Proceedings+of+the+international+biometrics+society+annual+meeting.+2006%2C+pp.+ 1–34.&oq=E.M.+Airoldi%2C+“Mixed+membership+stochastic+block+models+for+ relational+data+with+application+to+protein-protein+interactions”%2C+In+Proceedings+of+ the+international+biometrics+society+annual+meeting.+200

30

L. Madahali et al.

2. Huang, Z.H.Z., Li, X.L.X., Chen, H.C.H.: Link prediction approach to collaborative filtering. In: Proc. 5th ACM/IEEE-CS Jt. Conf. Digit. Libr. (JCDL ‘05) (2005) 4. Al Hasan, M., Chaoji, V., Salem, S., Zaki, M.: Link prediction using supervised learning. In: SDM’06: Workshop on Link Analysis, Counter-terrorism and Security (2006) 5. Al Hasan, M., Zaki, M.J.: A survey of link prediction in social networks. In: Social Network Data Analytics, pp. 243–275. Springer, New York (2011) 6. Fire, M., Tenenboim-Chekina, L., Puzis, R., Lesser, O., Rokach, L., Elovici, Y.: Computationally efficient link prediction in a variety of social networks. ACM Trans. Intell. Syst. Technol. 5, 10 (2013) 7. Madahali, L., Sherkat, E., Hall, M.: A comprehensive study on improving supervised feature based link prediction in social networks. In: 1st North American Social Networks (NASN) Conference, Washington (2017) 8. Hastie, T., Tibshirani, R., Friedman, J.: Unsupervised learning. In: The Elements of Statistical Learning, pp. 485–585. Springer, New York (2009) 9. Cukierski, W., Hamner, B., Yang, B.: Graph-based features for supervised link prediction. In: The 2011 International Joint Conference on, Neural Networks (IJCNN), pp. 1237–1244. IEEE, Piscataway (2011) 10. Scellato, S., Noulas, A., Mascolo, C.: Exploiting place features in link prediction on locationbased social networks. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ‘11, p. 1046. ACM Press, New York (2011) 11. Song, H.H., Cho, T.W., Dave, V., Zhang, Y., Qiu, L.: Scalable proximity estimation and link prediction in online social networks. In: Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement Conference, pp. 322–335. ACM, New York (2009) 12. Facebook, L.B., Leskovec, J.: Supervised random walks: predicting and recommending links in social networks (2010) 13. Zappella, G., Karatzoglou, A., Baltrunas, L.: Games of friends: a game-theoretical approach for link prediction in online social networks. In: Workshops at the Twenty-Seventh AAAI Conference on Artificial Intelligence (2013) 14. Rietveld, T., Van Hout, R.: Statistical Techniques for the Study of Language and Language Behaviour. Walter de Gruyter, Berlin (1993) 15. Field, A.: Discovering Statistics Using SPSS for Windows: Advanced Techniques for Beginners (Introducing Statistical Methods Series). SAGE Publications, Thousand Oaks (2000) 16. Rummel, R.J.: Applied Factor Analysis. Northwestern University Press, Evanston (1988) 17. Spearman, C.: General intelligence, objectively determined and measured. Am. J. Psychol. 15, 201 (1904). https://doi.org/10.2307/1412107 18. Nunnally, J.C., Bernstein, I.H.: Psychometric Theory McGraw-Hill New York Google Scholar. McGraw-Hill, New York (1978) 19. Comrey, A.L.: Common methodological problems in factor analytic studies. J. Consult. Clin. Psychol. 46, 648–659 (1978). https://doi.org/10.1037/0022-006X.46.4.648 20. Dupuis, M., Khadeer, S., Huang, J.: “I Got the Job!”: an exploratory study examining the psychological factors related to status updates on Facebook. Comput. Human Behav. 73, 132– 140 (2017). https://doi.org/10.1016/j.chb.2017.03.020 21. Lee, S., Kimn, A., Chen, Z., Paul, A., Pritchard, D.: Factor analysis reveals student thinking using the mechanics reasoning inventory. In: L@S 2017 - Proc. 4th ACM Conf. Learn. Scale, pp. 197–200. ACM, New York (2017). https://doi.org/10.1145/3051457.3053984 22. Marsden, N.: Attitudes towards online communication: an exploratory factor analysis. In: 2013 ACM Conf. Comput. People Res. SIGMIS-CPR 2013, pp. 147–152. ACM, New York (2013). https://doi.org/10.1145/2487294.2487326 23. Walther, J.B.: Computer-mediated communication: Impersonal, interpersonal, and hyperpersonal interaction. Communic. Res. 23, 3–43 (1996) 24. Conger, S., Loch, K.D., Helft, B.L.: Information technology and ethics. In: Proceedings of the Conference on Ethics in the Computer Age, pp. 22–27. ACM Press, New York (1994) 25. Mason, R.O.: Four Ethical Issues of the Information Age. MIS Q. 10, 5 (1986). https://doi.org/10.2307/248873

Exploratory Factor Analysis of Graphical Features for Link Prediction in Social. . .

31

26. Hall, M., Mazarakis, A., Peters, I., Chorley, M., Caton, S., Mai, J.-E., Strohmaier, M.: Following user pathways: cross platform and mixed methods analysis. In: Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. 3400–3407. ACM Press, San Jose (2016) 27. Spiliotopoulos, T., Oakley, I.: An exploratory study on the use of Twitter and Facebook in tandem. In: Proc. 2015 Br. HCI Conf. - Br. HCI ‘15, pp. 299–300. ACM, New York (2015). https://doi.org/10.1145/2783446.2783620 28. Jha, R., Saini, A.K.: An exploratory factor analysis on pragmatic Lean ERP implementation for SMEs. In: Proc. 2012 2nd IEEE Int. Conf. Parallel, Distrib. Grid Comput. PDGC 2012, pp. 474–479. IEEE, Piscataway (2012). https://doi.org/10.1109/PDGC.2012.6449867 29. Schreiber, M., Malesios, C.C., Psarakis, S.: Exploratory factor analysis for the Hirsch index, 17 h-type variants, and some traditional bibliometric indicators. J. Informetr. 6, 347–358 (2012). https://doi.org/10.1016/j.joi.2012.02.001 30. Lichtenwalter, R.N., Chawla, N.V.: Lpmade: Link prediction made easy. J. Mach. Learn. Res. 12, 2489–2492 (2011) 31. Adamic, L.A., Adar, E.: Friends and neighbors on the web. Soc. Networks. 25, 211–230 (2003) 32. Zhou, T., Yan, G., Wang, B.-H.: Maximal planar networks with large clustering coefficient and power-law degree distribution. Phys. Rev. E. 71(4), 046141 (2004). 8718 33. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Delhi (2008) 34. Newman, M.E.: Clustering and preferential attachment in growing networks. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 64, 025102 (2001) 35. Katz, L.: A new status index derived from sociometric analysis. Psychometrika. 18, 39–43 (1953) 36. Lichtenwalter, R.N., Lussier, J.T., Chawla, N.V.: New perspectives and methods in link prediction. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 243–252. ACM, New York (2010) 37. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab, Cornwall-on-Hudson (1999) 38. Brin, S., Page, L.: The anatomy of a large scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30, 107–117 (1998). https://doi.org/10.1016/s0169-7552(98)00110-x 39. Tyagi, N., Sharma, S.: Weighted page rank algorithm based on number of visits of links of web page. Int. J. Soft Comput. Eng. 2, 2231–2307 (2012) 40. Moore, D.S., Mccabe, G.P.: STATISTIEK IN DE PRAKTIJK Theorieboek. Academic Service, Den Haag (2006) 41. Lee, A.S., Baskerville, R.L.: Generalizing generalizability in information systems research. Inf. Syst. Res. 14, 221–243 (2003)

An Efficient Approach for Counting Occurring Induced Subgraphs Luciano Grácio and Pedro Ribeiro

Abstract Counting subgraph occurrences is a hard but very important task in complex network analysis, with applications in concepts such as network motifs or graphlet degree distributions. In this paper we present a novel approach for this task that takes advantage of knowing that a large fraction of subgraph types does not appear at all on real-world networks. We describe a pattern-growth methodology that is able to iteratively build subgraph patterns that do not contain smaller nonoccurring subgraphs, significantly pruning the search space. By using the g-trie data structure, we are able to efficiently only count those subgraphs that we are interested in, reducing the total computation time. The obtained experimental results are very promising allowing us to avoid the computation of up to 99.78% of all possible subgraph patterns. This showcases the potential of this approach and paves the way for reaching previously unattainable subgraph sizes. Keywords Subgraph counting · Pattern-growth · Induced subgraphs · Occurring subgraphs · Network motifs

1 Introduction Many real-world systems can be modeled and analyzed using complex networks. Being able to extract information from these networks is therefore a vital task with applications in a multitude of domains [2]. One very important network mining primitive is the ability to count the occurrences of induced subgraphs. The frequency in which different subgraph types appear inside a network provides a rich characterization of its topology and lies at the core of concepts such as network motifs [10] and graphlet degree distributions [13].

L. Grácio · P. Ribeiro () CRACS & INESC-TEC, DCC-FCUP, Universidade do Porto, Porto, Portugal e-mail: [email protected]; [email protected] © Springer Nature Switzerland AG 2019 S. P. Cornelius et al. (eds.), Complex Networks X, Springer Proceedings in Complexity, https://doi.org/10.1007/978-3-030-14459-3_3

33

34

L. Grácio and P. Ribeiro

Counting subgraphs in general graphs is however a computationally very hard task. Even just knowing if a subgraph appears at all inside another graph, that is, determining if its frequency is higher than zero, is already an NP-complete problem [3]. This limits the applicability since the computation time grows exponentially on the size of both the subgraphs and the networks containing them. Current approaches are typically focused on computing the frequency of all possible subgraph types of a given size k. As k increases, the amount of different types increases exponentially. For instance, there are only two different undirected subgraph types of size 2 (a triangle and a chain), but more than 105 of size 8 and more than 109 of size 11. However, in real-world networks, many of these subgraphs do not occur at all (i.e., many types have a frequency of zero). In fact, as we show in Sect. 4, as k increases we might have less than 1% of occurring subgraphs. Present counting methodologies are generally oblivious to the absence of many subgraph types and do not take advantage of this. In this work we propose an approach that fully exploits this characteristic of the subgraphs frequency distribution in order to improve the efficiency and reduce the computation time. We adopt a pattern-growth strategy in which we start by computing the frequency of subgraphs of a smaller initial size ki . We then iteratively increase this size up to a desired size k by growing only the occurring subgraphs. We do so by making sure that we are only generating subgraph types that do not contain smaller nonoccurring subgraphs inside themselves, which greatly reduces the possible subgraph search space. As the basis counting algorithm we use g-tries [14], a data structure that can efficiently store and count any given custom set of subgraphs. This allows the use of information about subgraphs that do not appear, guiding the computation toward subgraphs that do occur and avoiding unnecessary work. The experimental results obtained with our proof of concept implementation are very promising and show the potential of the approach. We were always able to avoid computing the vast majority of the subgraphs, up to 99.5% in some networks, effectively avoiding the computation time. We also show that as we increase the size k, the fraction of non-occurring subgraphs also keeps increasing and appears to follow a logistic function, which further emphasizes the gains of our methodology and its capability to reach previously unfeasible sizes. The remainder of this paper is organized as follows: Section 2 discusses background concepts, defines the problem we are solving, and gives a brief overview of related work. Section 3 presents our proposed methodology in detail. In Sect. 4 we provide experimental results, showcasing the gains obtained by our approach. We finally give concluding remarks in Sect. 5.

An Efficient Approach for Counting Occurring Induced Subgraphs

35

2 Background 2.1 Notation We first review the main graph concepts used, establishing a coherent terminology to be used throughout this paper. A graph G can be defined as a set of nodes (or vertices) V (G), and a set of edges, E(G), each connecting two nodes. An n-graph is a graph of size n, that is, a graph with n nodes. For the sake of simplicity, we will only consider undirected graphs, in which edges do not express direction, but our results expand naturally to directed graphs. Two graphs G and H are isomorphic if there is a one-to-one mapping between the nodes of both graphs and there is an edge between two nodes of G if and only if their corresponding vertices in H also form an edge. An automorphism is an isomorphism of a graph into itself. The equivalence classes of the nodes under the action of the automorphisms are called orbits, that is, two nodes belong to the same orbit if when mapped to one another, they give origin to isomorphic graphs. Figure 1 exemplifies this concept where, for instance, the star shaped T4 pattern has two orbits: the white center node and the black periphery nodes. H is a subgraph of G if V (H ) ⊂ V (G) and E(H ) ⊂ E(G). This subgraph is said to be induced if (a, b) ∈ E(H ) ⇐⇒ (a, b) ∈ E(G). We will only be considering induced subgraphs. Note that the non-induced occurrences are also implicitly counted when finding the induced occurrences. If G contains H as a subgraph, we say that it is a supergraph of H . The set of all supergraphs of H is denoted as super(H ). The subgraph types of a given size k are the set of all possible connected and non-isomorphic k-graphs. The frequency of a subgraph type is the number of times it occurs inside another graph. For the purposes of this paper we consider the classical and most used definition of frequency, which allows overlapping of nodes and edges between different occurrences (for other possible and less used frequency concepts we refer the reader to [16]). The set of all the subgraph types of size k with a frequency higher than zero in a graph G is denoted as k-O(G) (the “O” stands for occurring). For the non-occurring set of k-subgraphs we use k-N O(G). Figure 2 illustrates these concepts, showing the occurrences of all subgraphs of sizes 3 and 4. For instance, the frequency of subgraph type T1 is 4. Note how supergraphs do not necessarily have a smaller or equal frequency than its subgraphs. For example, the single occurrence of the triangle T2 induces two occurrences of the pattern T6,

Fig. 1 All possible subgraph types of sizes 3 and 4, and their corresponding orbits (nodes with the same color have the same orbit in the corresponding subgraph)

36

L. Grácio and P. Ribeiro

Fig. 2 Example graph and its corresponding 3- and 4-subgraph occurrences

which contains inside it the triangle. This has implications on the computation tractability, as there is no downward closure property on the frequencies, and separates our problem from a more classical frequent subgraph mining task [8].

2.2 Problem Definition Definition 1 (The Occurring Induced Subgraph Counting Problem) Given a graph G and an integer k, determine the exact frequencies of the subgraphs in kO(G), that is, count all induced occurrences of k-subgraphs that occur at least once in G. Two occurrences are considered different if they have at least one node or edge that they do not share. Other nodes and edges can overlap. Note that k-N O(G) is just the complement of k-O(G) and is also implicitly calculated by solving this problem. Furthermore, for most of the applications the interest is in the positive frequencies. For instance, network motifs can be thought of as overrepresented subgraphs [10], and the first step toward its discovery implies computing the frequencies of subgraphs that do appear on the given network.

2.3 Related Work Current approaches for exact counting of general subgraphs typically follow one of the three different strategies. Some methods are network centric, in the sense that given a number k, they compute the frequency of all possible subgraphs of size k. They generally work by enumerating all connected sets of k nodes and by identifying the isomorphic class of each occurrence. Examples of this are ESU [19] and FaSE [11]. They differ from our work because they do not explicitly take into advantage information about subgraphs that are certain to not exist. Other methods are subgraph-centric, meaning that they only count occurrences of a single individual subgraph type. Examples of this are Grochow [5] and ISMAGS [7]. In principle they could be coupled with our approach as counting algorithms. However,

An Efficient Approach for Counting Occurring Induced Subgraphs

37

they do not use information from previously computed frequencies. Finally, setcentric methods lie in between the two previously described approaches. They allow for counting custom sets of subgraphs, larger than a single subgraph, but also smaller than all possible k-subgraphs. The best-known example for this approach is the g-trie data structure [14] and the associated counting algorithms. We leverage precisely this capability by integrating it in our workflow as described in Sect. 3 and we add the capability of only searching for subgraphs that do not contain nonoccurring smaller subgraphs. Besides these general and exact approaches there are a multitude of algorithms. Some are geared toward specific types of subgraphs (for example, star shaped subgraphs [4] or undirected subgraphs up to size five [6]). Others, such as MOSS [18] or Rand-FaSE [12], trade accuracy for speed and provide only approximate results. We should also mention that several works exploit parallelism to further improve the computation time. Here we are more focused on the core methodological approach, and our proof of concept implementation is sequential, but our work can be adapted to use parallelism. In fact, there is an almost direct way of benefiting from this since there already exists distributed-memory [15] and shared-memory [1] parallel implementations of g-tries that we use as our base counting algorithm.

3 Proposed Methodology 3.1 Overview We propose an incremental approach as detailed in the following algorithm: Algorithm 1: High-level overview of our proposed methodology Input : A graph G and an integer k ≥ 4 Output: The exact frequencies of all subgraphs in k-O(G) 1 C3 ← all 3-subgraph types // current candidate subgraphs 2 B ←∅ // blacklisted subgraphs 3 for i ← 3 to k do 4 Fi ← frequencies of all i-subgraphs ∈ Ci in graph G 5 if i < k then 6 B ← B ∪ all i-subgraphs ∈ Ci with frequency = 0 7 Ci+1 ← (i + 1)-super(Ci ) that do not have any subgraph in B 8 end 9 end 10 return Fk We assume that k ≥ 4 since 2-subgraphs are just edges and computing 3-subgraph frequencies would be equivalent to simply counting without the opportunity to use a blacklist to limit the search. Note also that since we are considering

38

L. Grácio and P. Ribeiro

simple undirected subgraphs, the initial candidate list C3 consists of exactly two possible subgraphs: a triangle and a chain, corresponding to patterns T1 and T2 in Fig. 1. In the following sections we give more detail on how to achieve each of the steps described by this algorithm and how the whole process results in an efficient computation of the occurring k-subgraphs.

3.2 Counting Sets of Subgraphs The entire approach relies on keeping a list of current candidate subgraphs Ci which should not include subgraphs that are sure to not occur, hence pruning the search space. We then naturally need the ability to compute the frequency of all subgraphs in this candidate list (line 4 in Algorithm 1), both for adding more non-occurring subgraphs to our blacklist, and for generating the next set of candidates with one more node. Computing these frequencies efficiently is not a trivial task. Taking advantage of the mutual information between the candidates is essential, even more since we are ruling out non-occurring subgraphs, which also highlights the topological characteristics that the occurring subgraphs should not have. With that purpose in mind we use the g-tries, previously developed by us, which are efficient data structures for storing and finding sets of subgraphs. In the same way that a prefixtree can efficiently store strings by storing only once common prefixes between them, a g-trie takes advantage of the similarities between subgraphs and represents them with the aim to minimize redundancy. Figure 3 exemplifies the concept. Note how descendants of a g-trie node share the same subgraph. After inserting the candidates (Ci ) in a g-trie, we perform a census operation that computes their frequencies in G. Given the space constraints, we refer the reader to [14] for an in-depth explanation on how g-tries do this very efficiently, including how the g-trie itself is built given a set of subgraphs and how the usage of symmetry breaking conditions is able to constrain the search. Fig. 3 An example g-trie containing all undirected 3and 4-subgraphs

An Efficient Approach for Counting Occurring Induced Subgraphs

39

3.3 Blacklisting Subgraphs In the process of counting occurrences of a set of subgraphs, not only do we gather information about the subgraphs that occur in G but also about the ones that are non-occurring. This section describes how we use that information. The two main properties we exploit are the following: 1. If a subgraph G does not occur in G, no element of super(G ) occurs in G. When we find a non-existing subgraph G ∈ NO(G), we can also rule out the possibility that any subgraph that contains G exists in G. 2. Any (sub)subgraph of a (sub)graph contained in the blacklist has a positive frequency. If they were non-occurring, then they would have been put into the blacklist during previous iterations, break property (1). This means that the blacklist is in its essence minimal, storing the smallest possible subgraphs that do not occur in G. We now describe how to implement and use the blacklist. Creation Once we have computed all i-subgraphs ∈ Ci which do not occur in G, we insert them into a new g-trie, which we append to the blacklist B, currently implemented as an array of g-tries (line 6 of Algorithm 1). Usage The blacklist B is used every time we consider adding a new candidate subgraph S (line 7 of Algorithm 1). This process is quite similar to the occurrence counting solution described above, but in this case we are searching for the blacklisted subgraphs in S and the process is interrupted once any of the subgraphs is found. We iterate through all the g-tries in B, from the smallest to the largest size. If we discover an occurrence in S of any of the subgraphs in any of the g-tries, the process is interrupted and we know that we do not need to append S to the next iteration’s candidates Ci+ . We denote this as is blacklisted(S). Keep in mind that there is also no need to append S to B, since it already contains a subgraph that is blacklisted. In fact, the lack of need to keep these discarded candidates in memory is one of the main advantages of this approach. In the case that no occurrences were found in any of the g-tries, we append the candidate to Ci+1 .

3.4 Candidate Generation In the candidate generation phase of the iterative process, Ci+1 is created (line 7 of Algorithm 1). In order to do this we start by considering a limited set of candidate subgraphs, by generating super(i-O(G)), that is, the set of graphs of size i + 1 that contains at least one subgraph of an occurring subgraph of Ci . Note that we still need to pass the generated subgraphs through the blacklist filter, since it is most likely that some will contain blacklisted subgraphs. For instance, consider the subgraph types exemplified in Fig. 1 and imagine a situation where the chain T1 occurs and the

40

L. Grácio and P. Ribeiro

triangle T2 does not (and hence is included in the blacklist). The subgraph type T6 would be generated as an expansion of T1, but would still be filtered out by the blacklist, since it contains T2. We start by presenting a naive solution to this problem of supergraph generation, in which given an i-subgraph H , we want to generate all the (i + 1)-subgraphs in super(H ). Exhaustive Node Enumeration We add a new node and try to connect it to all possible subsets of the existing i nodes. This results in 2i − 1 possible connected (i +1)-subgraphs (the −1 term comes from excluding an empty set of current nodes, which would result in an unconnected new node). The main problem with this exponential solution is that many subgraphs of the same type are redundantly created. In fact, connecting the new node to any two subsets that exhibit the same orbit classes will result in isomorphic subgraphs. Recall that the concept of orbits is exemplified in Fig. 1. For instance, connecting a new node to any single node in the triangle T1 (with only one orbit) will always result in subgraph T6. To address this problem, we developed a smarter method for generating supergraphs that we now describe. Orbit-Aware Enumeration We only connect the new node to all possible orbit combinations. For instance, if a 4-subgraph has two nodes in orbit a and two nodes in orbit b (an example of this is subgraph T7), instead of naively trying all possible 24 − 1 = 15 sets, we only connect to 8 subsets of existing orbits: {a}, {a, a}, {b}, {b, a}, {b, a, a}, {b, b}, {b, b, a}, and {b, b, b, a}. In order to implement this strategy we use a very efficient third-party software (nauty [9]) to compute the automorphisms and orbit classes. While our orbit-aware strategy avoids duplicates growing from the same subgraph, it might still happen that duplicate (i + 1)-subgraphs emerge from different i−subgraphs, due to the effect already described when considering the usage of the blacklist. For instance, consider Fig. 1 and subgraph type T6, which would be generated as a supergraph of T1, but also as a supergraph of T2. To avoid having these duplicates and only keep one copy of each isomorphic subgraph class, we insert all candidates in a g-trie, which efficiently checks if the subgraph was already inserted.

4 Experimental Results All tests were executed in a laptop with an Intel i7-6700HQ CPU, and 16GB of RAM. We used three different undirected real-world networks with varied topological characteristics, as described in Table 1. In order to test and explore our methodology, we ran a thorough experiment in which we kept increasing the subgraph size k being computed and we took note of the following values:

An Efficient Approach for Counting Occurring Induced Subgraphs

41

Table 1 Real-world networks used in our experiments Name karate circuit euroroad

Size 34 252 1174

Nr. edges 78 399 1417

Description Social ties between members of a karate club [20] Electronic circuit (s420 ) [10] Europe main roads network [17]

Table 2 Results obtained for the karate network Size k 4 5 6 7 8 9 10

Occurring 6 21 89 476 2612 11,569 40,069

Blacklisted 0 0 23 81 832 3272 7886

Blocked 0 0 0 297 6821 153,286 2,348,283

All 6 21 112 853 11,117 261,080 11,716,571

Avoided 0.00% 0.00% 0.00% 2.22% 10.34% 37.28% 79.89%

Time (s) 0.00 0.00 0.03 0.14 1.73 33.28 405.25

Table 3 Results obtained for the circuit network Size k 4 5 6 7 8 9 10 11

Occurring 5 11 33 89 293 1001 3659 13,462

Blacklisted 1 7 14 37 61 142 379 1203

Blocked 0 3 56 486 3891 31,123 259,851 2,174,907

All 6 21 112 853 11117 261,080 11,716,571 1,006,700,565

Avoided 0.00% 3.70% 14.39% 33.37% 63.80% 88.03% 97.79% 99.78%

Time (s) 0.00 0.00 0.02 0.05 0.20 1.20 8.44 66.77

– occurring: number of occurring subgraphs of size k in G; – blacklisted: number of subgraphs kept in the blacklist B after that iteration; – blocked: number of subgraphs types that were generated but were not inserted into the candidates because they contained blacklisted subgraphs; – all: number of all possible undirected subgraphs types of size k; – avoided: percentage of all whose generation was avoided by our solution; – time: execution time, measured as elapsed time between launching the program up to computing the k-frequencies, including everything in-between. We now present three tables with the obtained results (Tables 2, 3, 4). The results are very promising. The first main aspect to notice is that in all networks we are able to avoid computing the frequency of the vast majority of the entire set of all possible k-subgraph types (column avoided), with values reaching up to 99.78% with k = 11 in the euroroad network. Although the exact percentage may vary, it seems to follow a logistic distribution in the tested networks, as shown in Fig. 4. This provides further evidence that our strategy has the potential to avoid most of the computation by making use of the information of smaller

42

L. Grácio and P. Ribeiro

Table 4 Results obtained for the euroroad network Size k 4 5 6 7 8 9 10 11

Occurring 5 13 41 137 511 1937 7428 28,190

Blacklisted 1 5 15 38 105 299 909 2521

Blocked 0 3 51 501 4830 47,290 453,838 4,143,557

All 6 21 112 853 11,117 261,080 11,716,571 1,006,700,565

Avoided 0.00% 3.70% 10.07% 25.90% 53.41% 81.61% 96.12% 99.59%

Time (s) 0.00 0.00 0.02 0.06 0.33 2.42 19.23 175.27

Fig. 4 Fraction of avoided subgraphs with increasing size k

sized subgraphs. Furthermore, we do this while being able to provide exact results, assuring accuracy and always knowing all subgraph types that occur at least once. By looking at the blacklist size (column blacklisted), we can have a better notion on how we are filtering out non-occurring subgraphs. Gains are more immediate and larger when smaller subgraphs already have zero frequency. For instance, the clique of size 4—type T8 in Fig. 1—does not occur in the euroroad network. As a result we blacklist it in the first iteration, avoiding the computation of all its supergraphs. However, even in the cases where we start blacklisting later (for instance, the karate network contains all possible size 4 and 5 subgraph types and the blacklist only contains subgraphs of size ≥6), the quantity of blacklisted nodes still steadily grows and the quantity of avoided subgraphs seems to follow a similar shape. This gives empirical evidence that it might take longer, but as soon as we discover subgraphs that do not occur, they quickly give origin to many other related non-occurring subgraph types. Further preliminary experiments with other networks also support this observation. We should also indicate the really reduced size of the blacklist when compared with all possible subgraphs, showcasing our capability of having a very compact way of representing patterns known to not occur.

An Efficient Approach for Counting Occurring Induced Subgraphs

43

Another aspect to mention is that the number of subgraph types that appear at least once (column occurring) is always a small fraction of all possible types. Furthermore, this fraction of occurring subgraphs keeps shrinking as k increases, which indicates it might scale to larger values of k. In what concerns execution time, even with just our preliminary non-optimized implementation we are already competitive up to size 9 when compared with the set-centric and very efficient g-trie base method starting with a g-trie containing all possible k-subgraphs (we achieve slightly slower times but still on the same order of magnitude). For sizes 10 and 11, this base approach is not even applicable, as the number of all possible subgraphs is so large that it is impossible to store them all on a g-trie. This showcases the need for a methodology like the one we propose. As a final remark we would like to comment on the potential of our approach. Previous approaches that rely on knowing beforehand all possible subgraph types are severely limited in time and memory, as the number of possible subgraph types grows superexponentially (column all). This is the case for general set-centric or subgraph-centric algorithms when we do not have a way of filtering subgraphs. It is also the case for network-centric methods, which typically provide a large memory footprint that might even grow exponentially as k increases (an example of this is FaSE [11]). Even less memory hungry analytic approaches might rely on knowing all types, as in the case of ORCA [6], which builds an intricate system of equations that relate all different subgraph types to accelerate the computation and cannot directly compute only a small fraction of these. Our approach provides a more efficient way of exploring the search space and can be combined with any base approach that takes advantage of knowing which subgraph types we are looking for. Furthermore, the memory requirements are only bounded by the size of the occurring and blacklisted subgraphs, greatly enhancing the limits of applicability to much larger sizes.

5 Conclusion Being able to count the frequency of subgraphs is a fundamental graph mining task with many applications. In this paper we present a novel approach for computing the frequency of all k-subgraphs that appear at least once. We propose a patterngrowth approach that is able to prune the search space. For this purpose, we keep a minimal list of subgraphs that we know that do not occur and we use it as a blacklist to filter out non-occurring subgraphs. With this, we are able to iteratively keep increasing the subgraph size with a very compact list of viable candidate subgraphs. The results obtained are very promising and we are able to avoid the computation of up to 99.78% of all possible subgraphs. The experiments showcase the capability of our approach and indicate that we can contribute toward increasing the feasible subgraph sizes. For future work, we plan to further optimize our implementation and provide the community with an open-source tool. Furthermore, we intend to make a more

44

L. Grácio and P. Ribeiro

systematic and thorough experimentation on a large set of both synthetic and real-world networks, gaining more insight on the distributions of occurring and blacklisted subgraphs, and also on the avoided fraction of subgraphs. We would also like to better explore the candidate generation algorithm and to explore other pruning options besides only avoiding non-occurring subgraphs. Acknowledgements This work is financed by the ERDF—European Regional Development Fund through the Operational Programme for Competitiveness and Internationalization—COMPETE 2020 Programme within project POCI-01-0145-FEDER-006961, and by National Funds through the FCT—Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) as part of project UID/EEA/50014/2013.

References 1. Aparício, D.O., Ribeiro, P.M.P., da Silva, F.M.A.: Parallel subgraph counting for multicore architectures. In: 2014 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA), pp. 34–41. IEEE, Piscataway (2014) 2. Costa, L.d.F., Oliveira Jr, O.N., Travieso, G., Rodrigues, F.A., Villas Boas, P.R., Antiqueira, L., Viana, M.P., Correa Rocha, L.E.: Analyzing and modeling real-world phenomena with complex networks: a survey of applications. Adv. Phys. 60(3), 329–412 (2011) 3. Eppstein, D.: Subgraph isomorphism in planar graphs and related problems. In: Graph Algorithms and Applications I, pp. 283–309. World Scientific, Singapore (2002) 4. Gonen, M., Ron, D., Shavitt, Y.: Counting stars and other small subgraphs in sublinear-time. SIAM J. Discret. Math. 25(3), 1365–1411 (2011) 5. Grochow, J.A., Kellis, M.: Network motif discovery using subgraph enumeration and symmetry-breaking. In: Annual International Conference on Research in Computational Molecular Biology, pp. 92–106. Springer, Berlin (2007) 6. Hoˇcevar, T., Demšar, J.: A combinatorial approach to graphlet counting. Bioinformatics 30(4), 559–565 (2014) 7. Houbraken, M., Demeyer, S., Michoel, T., Audenaert, P., Colle, D., Pickavet, M.: The indexbased subgraph matching algorithm with general symmetries (ISMAGS): exploiting symmetry for faster subgraph enumeration. PLoS One 9(5), e97896 (2014) 8. Jiang, C., Coenen, F., Zito, M.: A survey of frequent subgraph mining algorithms. Knowl. Eng. Rev. 28(1), 75–105 (2013) 9. McKay, B.D., Piperno, A.: Practical graph isomorphism, II. J. Symb. Comput. 60, 94–112 (2014) 10. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network motifs: simple building blocks of complex networks. Science 298(5594), 824–827 (2002) 11. Paredes, P., Ribeiro, P.: Towards a faster network-centric subgraph census. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 264–271. ACM, New York (2013) 12. Paredes, P., Ribeiro, P.: Rand-fase: fast approximate subgraph census. Soc. Netw. Anal. Min. 5(1), 17 (2015) 13. Pržulj, N.: Biological network comparison using graphlet degree distribution. Bioinformatics 23(2), e177–e183 (2007) 14. Ribeiro, P., Silva, F.: G-tries: a data structure for storing and finding subgraphs. Data Min. Knowl. Disc. 28(2), 337–377 (2014) 15. Ribeiro, P., Silva, F., Lopes, L.: Efficient parallel subgraph counting using g-tries. In: 2010 IEEE International Conference on Cluster Computing, pp. 217–226. IEEE, Piscataway (2010)

An Efficient Approach for Counting Occurring Induced Subgraphs

45

16. Schreiber, F., Schwöbbermeyer, H.: Frequency concepts and pattern detection for the analysis of motifs in networks. In: Transactions on Computational Systems Biology III, pp. 89–104. Springer, Berlin (2005) 17. Šubelj, L., Bajec, M.: Robust network community detection using balanced propagation. Eur. Phys. J. B 81(3), 353–362 (2011) 18. Wang, P., Zhao, J., Zhang, X., Li, Z., Cheng, J., Lui, J.C., Towsley, D., Tao, J., Guan, X.: Moss-5: A fast method of approximating counts of 5-node graphlets in large graphs. IEEE Trans. Knowl. Data Eng. 30(1), 73–86 (2018) 19. Wernicke, S.: Efficient detection of network motifs. IEEE/ACM Trans. Comput. Biol. Bioinform. 3(4), 347–359 (2006) 20. Zachary, W.W.: An information flow model for conflict and fission in small groups. J. Anthropol. Res. 33(4), 452–473 (1977)

Part II

Network Models

A Generalized Configuration Model with Degree Correlations Duan-Shin Lee, Cheng-Shang Chang, and Hung-Chih Li

Abstract In this paper we present a generalization of the classical configuration model. Like the classical configuration model, the generalized configuration model allows users to specify an arbitrary degree distribution. In our generalized configuration model, we partition the stubs in the configuration model into b blocks of equal sizes and choose a permutation function h for these blocks. In each block, we randomly designate a number proportional to q of stubs as type 1 stubs, where q is a parameter in the range [0, 1]. Other stubs are designated as type 2 stubs. To construct a network, randomly select an unconnected stub. Suppose that this stub is in block i. If it is a type 1 stub, connect this stub to a randomly selected unconnected type 1 stub in block h(i). If it is a type 2 stub, connect it to a randomly selected unconnected type 2 stub. We repeat this process until all stubs are connected. Under an assumption, we derive a closed form for the joint degree distribution of two random neighboring vertices in the constructed graph. Based on this joint degree distribution, we show that the Pearson degree correlation function is linear in q for any fixed b. By properly choosing h, we show that our construction algorithm can create assortative networks as well as disassortative networks. We verify our results by extensive computer simulations. Keywords Configuration model · Assortative mixing · Degree correlation

1 Introduction Recent advances in the study of networks that arise in the field of computer communications, social interactions, biology, economics, information systems, etc., indicate that these seemingly widely different networks possess a few common properties. Perhaps the most extensively studied properties are power-law degree

D.-S. Lee () · C.-S. Chang · H.-C. Li Institute of Communications Engineering, National Tsing Hua University, Hsinchu, Taiwan, R.O.C e-mail: [email protected]; [email protected]; [email protected] © Springer Nature Switzerland AG 2019 S. P. Cornelius et al. (eds.), Complex Networks X, Springer Proceedings in Complexity, https://doi.org/10.1007/978-3-030-14459-3_4

49

50

D.-S. Lee et al.

distributions [1], the small-world property [17], network transitivity or “clustering” [17]. Other important research subjects on networks include network resilience, existence of community structures, synchronization, spreading of information, or epidemics. A fundamental issue relevant to all the above research issues is the correlation between properties of neighboring vertices. In the ecology and epidemiology literature, this correlation between neighboring vertices is called assortative mixing. In general, assortative mixing is a concept that attempts to describe the correlation between properties of two connected vertices. Take social networks for example. Vertices may have ages, weight, or wealthiness as their properties. It is found that friendship between individuals is strongly affected by age, race, income, or languages spoken by the individuals. On the one hand, if vertices with similar properties are more likely to be connected together, we say that the network shows assortative mixing. On the other hand, if vertices with different properties are likely to be connected together, we say that the network shows disassortative mixing. It is found that social networks tend to show assortative mixing, while technology networks, information networks, and biological networks tend to show disassortative mixing [12]. The assortativity level of a network is commonly measured by a quantity proposed by Newman [11] called assortativity coefficient. If the assortativity level to be measured is degree, assortativity coefficient reduces to the standard Pearson correlation coefficient [11]. Specifically, let X and Y be the degrees of a pair of randomly selected neighboring vertices, the Pearson degree correlation function is the correlation coefficient of X and Y , i.e., def

ρ(X, Y ) =

E(XY ) − E(X)E(Y ) , σX σY

(1)

where σX and σY denote the standard deviation of X and Y , respectively. We refer the reader to [3, 6, 10–13, 18] for more information on assortativity coefficient and other related measures. In this paper we shall focus on degree as the vertex property. Researchers have found that assortative mixing plays a crucial role in the dynamic processes, such as information or disease spreading, taking place on the topology defined by the network [3, 4, 7, 10, 15]. Assortativity also has a fundamental impact to the network resilience as the network loses vertices or edges [16]. In order to study information propagation or network resilience, researchers may need to build models with assortative mixing or disassortative mixing. Newman [11] and Xulvi-Brunet et al. [18] proposed algorithms to generate networks with assortative mixing or disassortative mixing based on an idea of rewiring edges. Boguñá et al. [3] proposed a class of random networks in which hidden variables are associated with the vertices of the networks. Establishment of edges is controlled by the hidden variables. Ramezanpour et al. [14] proposed a graph transformation method to convert a configuration model into a graph with degree correlations and non-vanishing clustering coefficients as the network grows in size. However, the degree distribution no longer remains the same as the network is transformed. Zhou [19] also proposed a method to generate networks with assortative mixing

A Generalized Configuration Model with Degree Correlations

51

or disassortative mixing using a Monte Carlo sampling method. Comparing with these methods, our method has an advantage that specified degree distributions are preserved in the constructed networks. In addition, our method allows us to derive a closed form for the Pearson degree correlation function for two random neighboring vertices. In this paper we propose a method to generate random networks that possess either assortative mixing property or disassortative mixing property. Our method is based on a modified construction method of the configuration model proposed by Bender et al. [2] and Molloy et al. [9]. The modified construction method is as follows. Given a degree distribution, generate a sequence of degrees. Each vertex is associated with a set of “stubs" where the number of stubs is equal to the degree of the vertex. We sort and arrange the stubs of all vertices in ascending order (descending order would be fine as well). Sort and divide the stubs into b blocks. We associate each block with another block. This association forms a permutation, i.e., no two distinct blocks are associated with a common block. We randomly designate a fixed number of stubs in each block as type-1 stubs. The rest of the stubs are all designated as type-2 stubs. To connect stubs, we randomly select a stub. If it is a type-1 stub, we connect it to a randomly selected type-1 stub in the associated block. If it is a type-2 block, we connect it to a randomly selected type-2 stub out of all type-2 stubs. We repeat this process until all stubs are connected. To generate a generalized configuration model with assortative mixing property, we select permutation of blocks such that a block of stubs with large degrees is associated with another block of large degrees. To generate a network with disassortative mixing, we select permutation such that a block of large degrees is associated with a block of small degrees. We present the detail of the construction algorithm in Sect. 2. For this model, we derive a closed form for the Pearson correlation coefficient of the degrees of two neighboring vertices. From the Pearson correlation coefficient of degrees we show that our constructed network can be assortative or disassortative as desired. The rest of this paper is organized as follows. In Sect. 2 we present our construction method of a random network. In Sect. 3 we derive a closed form for the joint degree distribution of two randomly selected neighboring vertices from a network constructed by the algorithm in Sect. 2. In Sect. 4, we show that the Pearson degree correlation function of two neighboring vertices is linear. We then show how permutation function h should be selected such that a constructed random graph is associatively or disassortatively mixed. Numerical examples and simulation results are presented in Sect. 5. Finally, we give conclusions in Sect. 6.

2 Construction of a Random Network Research on random networks was pioneered by Erd˝os and Rényi [5]. Although Erd˝os-Rényi’s model allows researchers to study many network problems, it is limited in that the vertex degree has a Poisson distribution asymptotically as the network grows in size. The configuration model [2, 9] can be considered as an

52

D.-S. Lee et al.

extension of the Erd˝os and Rényi model that allows general degree distributions. Configuration models have been used successfully to study the size of giant components. It has been used to study network resilience when vertices or edges are removed. It has also been used to study the epidemic spreading on networks. We refer the readers to [12] for more details. In this paper we propose an extension of the classical configuration model. This model generates networks with specified degree sequences. In addition, one can specify a positive or a negative degree correlation for the model. Let there be n vertices and let pk be the probability that a randomly selected vertex has degree k. Based on sequence {pk }, we draw a degree sequence ki , i = 1, 2, . . . , n for the n vertices. We give each vertex i a total of ki stubs. There are 2m = ni=1 ki stubs totally, where m is the number of edges of the network. In a classical configuration model, we randomly select an unconnected stub, say s, and connect it to another randomly selected unconnected stub in [1, 2m] − {s}. We repeat this process until all stubs are connected. The resulting network can be viewed as a matching of the 2m stubs. Each possible matching occurs with equal probability. The consequence of this construction is that the degree correlation of a randomly selected pair of neighboring vertices is zero. To achieve nonzero degree correlation, we arrange the 2m stubs in ascending order (descending order will also work) according to the degree of the vertices, to which the stubs belong. We label the stubs accordingly. We partition the 2m stubs into b blocks evenly. We select integer b such that 2m is divisible by b. Each block has 2m/b stubs. Block i, where i = 1, 2, . . . , b, contains stubs (i − 1)(2m/b) + j for j = 1, 2, . . . , 2m/b. Next, we choose a permutation function h of {1, 2, . . . , b}. If h(i) = j , we say that block j is associated with block i. In this paper we select h such that h(h(i)) = i, i.e., if blocks i and j are mutually associated with each other. In each block, we randomly designate 2mq/b stubs as type 1 stubs, where q is a parameter in the range [0, 1]. Other stubs are designated as type 2 stubs. Randomly select an unconnected stub. Suppose that this stub is in block i. If it is a type 1 stub, connect this stub to a randomly selected unconnected type 1 stub in block h(i). If it is a type 2 stub, connect it to a randomly selected unconnected type 2 stub in [1, 2m]. We repeat this process until all stubs are connected. The construction algorithm is shown in Algorithm 1. We make a few remarks. First, note that in networks constructed by this algorithm, there are mq edges that have two type-1 stubs on their two sides. These edges create degree correlation in the network. On the other hand, there are m(1−q) edges in the network that have two type-2 stubs on their two sides. These edges do not contribute towards degree correlation in the network. Second, random networks constructed by this algorithm possess the following property. A randomly selected stub connects to another randomly selected in the associated block with probability q. With probability 1 − q, this stub connects to a randomly selected stub in [1, 2m]. Finally, note that standard configuration models can have multiple edges connecting two particular vertices. There can also be edges connecting a vertex to itself. These

A Generalized Configuration Model with Degree Correlations

53

Inputs: degree sequence {ki : i = 1, 2, . . . , n}; Outputs: graph (G, V , E); Create 2m stubs arranged in descending order; Divide 2m stubs into b blocks evenly. Initially, all stubs are unconnected. For each block, randomly designate 2mq/b stubs as type 1 stubs. All other stubs are designated as type 2 stubs; while there are unconnected stubs do Randomly select a stub. Assume that the stub is in block i; if type 1 stub then connect this stub with a randomly selected unconnected type 1 stub in block h(i); else connect this stub with a randomly selected unconnected type 2 stub in [1, 2m]; end end

Algorithm 1: Construction algorithm

are called multiedges and self edges. In our constructed networks, multiedges and self edges can also exist. However, it is not difficult to show that the expected density of multiedges and self edges approaches to zero as n becomes large. Due to space limit, we shall not address this issue in this paper.

3 Joint Distribution of Degrees Consider a randomly selected edge in a random network constructed by the algorithm described in Sect. 2. In this section we analyze the joint degree distribution of the two vertices at the two ends of the edge. Let s(k) be the set of stubs that belong to vertices of degree k. We make the following assumption. Assumption 1 Stubs in s(k) all belong to a single block for any degree k. This assumption is approximately true if the block size, 2m/b, is much greater than the number of stubs associated with any degree, i.e., 2m/b  nkpk for all blocks and k = 0, 1, 2, . . . . Note that the construction algorithm described in Sect. 2 works without this assumption. However, this assumption allows us to derive a very simple expression for the joint probability mass function (PMF) of X and Y . This simple expression allows us to analyze the assortativity and disassortativity of the model. Let Si be the set of stubs in the i th block, i.e., Si = {s : 2m(i − 1)/b ≤ s < 2m · i/b} .

(2)

54

D.-S. Lee et al.

Let Hi be the set of degrees that correspond to stubs in set Si . That is, Hi = {k : s(k) ∈ Si }.

(3)

We randomly select a stub in the range [1, 2m]. Denote this stub by sY . Let vY be the vertex, to which sY belongs. Let Y be the degree of vY . Let sX be the stub that stub sY connects according to the construction algorithm in Sect. 2. Denote by vX the vertex, to which stub sX belongs. Let X be the degree of vX . Suppose that vertex vX locates in block i. It follows that vertex vY locates in block h(i). Since s is randomly selected, the PMF of Y is Pr(Y = k) =

kpk nkpk = , ˜ 2m E(X)

(4)

where X˜ is the degree of a randomly selected vertex. Since the selection of vertices is random, Pr(X˜ = ki ) =

1 n

for i = 1, 2, . . . , n. Thus, the expectation of X˜ is ˜ = E(X)

n

i=1 ki

n

=

2m . n

In this section we shall study the joint PMF of X and Y . We first study the marginal PMF of X. Suppose x is a degree in Hi . The total number of stubs which belong to vertices with degree x is nxpx . With probability q, stub sY is of type 1. Now randomly select a type-1 stub in block i to connect to sY . Given that the selected stub is of type 1, the probability that the selected stub connects to a stub belonging to vertices with degree x is qnxpx , 2mq/b − δ(i, h(i)) where δ() is the delta function. On the other hand, given that stub sY is of type 2, the probability is (1 − q)nxpx . 2m(1 − q) − 1 Thus, Pr(X = x|Y = y) =

q 2 nxpx (1 − q)2 nxpx + , 2mq/b − δ(i, h(i)) 2m(1 − q) − 1

(5)

A Generalized Configuration Model with Degree Correlations

55

for y ∈ Hh(i) . If y ∈ Hj for j = h(i), Pr(X = x|Y = y) =

(1 − q)2 nxpx . 2m(1 − q) − 1

(6)

Now assume that the network is large. That is, we consider a sequence of constructed ˜ Under this graphs, in which n → ∞, m → ∞, while keeping 2m/n = E(X). asymptotics, Eqs. (5) and (6) converge to qb+(1−q)

y ∈ Hh(i)

xpx ,

˜ E(X) 1−q xp x, ˜ E(X)

Pr(X = x|Y = y) →

(7)

y ∈ Hj , j = h(i).

From the law of total probability we have Pr(X = x) =



Pr(X = x|Y = y) Pr(Y = y)

y∈Hh(i)

+



Pr(X = x|Y = y) Pr(Y = y).

(8)

j =h(i) y∈Hj

Substituting (4) and (7) into (8), we have Pr(X = x) =

1−q qb + (1 − q) ypy ypy xpx + xpx . ˜ ˜ ˜ ˜ E(X) E(X) j =h(i) y∈H E(X) E(X) y∈H j

h(i)

(9) Since the partition of stubs is uniform,

nypy = 2m/b

y∈Hj

and thus,

˜ ypy = E(X)/b

y∈Hj

for any j = 1, 2, . . . , b. Substituting this into (9), we have Pr(X = x) =

xpx ˜ E(X)

.

From (7) we derive the joint PMF of X and Y Pr(X = x, Y = y) = Pr(X = x|Y = y) Pr(Y = y)

(10)

56

D.-S. Lee et al.

=

xpx ypy ˜ E(X) ˜ , E(X) xpx ypy (1 − q) ˜ ˜ , E(X) E(X)

y ∈ Hh(i) , x ∈ Hi

(bq + 1 − q)

= Cij

x ∈ Hi , y ∈ Hj , j = h(i)

xypx py , ˜ 2 (E(X))

where

(11)

Cij =

bq + 1 − q, 1 − q,

h(i) = j h(i) = j.

(12)

We summarize the results in the following theorem. Theorem 1 Let G be a graph generated by the construction algorithm described in Sect. 2 based on a sequence of degrees k1 , k2 , . . . , kn . Randomly select an edge from G. Let X and Y be the degrees of the two vertices at the two ends of the edge. Then, the marginal PMF of X and Y are given in (10) and (4), respectively. The joint PMF of X and Y is given in (11).

4 Assortativity and Disassortativity In this section, we present an analysis of the Pearson degree correlation function of two random neighboring vertices. The goal is to search for permutation function h such that the numerator of (1) is non-negative (resp. non-positive) for the network constructed in this section. From (10), we obtain the expected value of X E(X) =

x

x Pr(X = x) =

b 2 b 1 x px = ui , ˜ ˜ E(X) E(X) i=1 x∈H i=1

(13)

i

where def

ui =



x 2 px .

(14)

x∈Hi

Now we consider the expected value of the product XY . We have from (11) that E(XY ) =

x

xy Pr(X = x, Y = y) =

y



b b Cij x 2 y 2 px py ˜ 2 (E(X)) i=1 j =1 x∈Hi y∈Hj

⎞ Cij ui uj 1 ⎝(1 − q) = = ui uj + qb ui uh(i) ⎠. ˜ 2 ˜ 2 (E( X)) (E( X)) i j i j i (15)

A Generalized Configuration Model with Degree Correlations

57

Note from (13) and (15) that E(XY ) − E(X)E(Y ) =

  q b ui uh(i) − ui uj . ˜ 2 (E(X)) i

i

(16)

j

Based on (16), we summarize the Pearson degree correlation function in the following theorem. Theorem 2 Let G be a graph generated by the construction algorithm in Sect. 2. Randomly select an edge from the graph. Let X and Y be the degrees of the two vertices at the two ends of this edge. Then, the Pearson degree correlation function of X and Y is ρ(X, Y ) = cq,

(17)

where c=

b

 i

ui uh(i) − σX σY

  i

j

ui uj

˜ 2 (E(X))

,

and σX and σY are the standard deviation of the PMFs in (10) and (4). In view of (17), the sign of ρ(X, Y ) depends on the constant c. To generate assortative (resp. disassortative) mixing random graphs we sort ui ’s in descending order first and then choose the permutation h that maps the largest number of ui ’s to the largest (resp. smallest) number of ui ’s. This is formally stated in the following corollary. Corollary 1 Let π(·) be the permutation such that uπ(i) is the i th largest number among ui , i = 1, 2, . . . , b, i.e., uπ(1) ≥ uπ(2) ≥ . . . ≥ uπ(b) .

(i) If we choose the permutation h with h(π(i)) = π(i) for all i, then the constructed random graph is assortative mixing. (ii) If we choose the permutation h with h(π(i)) = π(b + 1 − i) for all i, then the constructed random graph is disassortative mixing. The proof of Corollary 1 is based on the famous Hardy, Littlewood, and Pólya rearrangement inequality (see, e.g., the book [8], pp. 141). Proposition 1 (Hardy, Littlewood, and Pólya Rearrangement Inequality) If ui , vi , i = 1, 2, . . . , b are two sets of real number. Let u[i] (resp. v[i] ) be the i th largest number among ui , i = 1, 2, . . . , b (resp. vi , i = 1, 2, . . . , b). Then

58

D.-S. Lee et al. b

u[i] v[b−i+1] ≤

i=1

b

ui vi ≤

i=1

b

(18)

u[i] v[i] .

i=1

Proof (Corollary 1) (i) Consider the circular shift permutation σj (·) with σj (i) = (i +j −1 mod b)+1 for j = 1, 2, . . . , b. From symmetry, we have σj (i) = σi (j ). Thus, b b

ui uj =

i=1 j =1

b b

ui uσi (j ) =

i=1 j =1

b b

ui uσj (i) .

(19)

j =1 i=1

Using the upper bound of the Hardy, Littlewood, and Pólya rearrangement inequality in (19) and h(π(i)) = π(i) yields b

ui uσj (i) ≤

i=1

b i=1

u[i] u[i] =

b

uπ(i) uh(π(i)) =

i=1

b

ui uh(i) .

(20)

i=1

In view of (16) and (19), we conclude that the generated random graph is assortative mixing. (ii) Using the lower bound of the Hardy, Littlewood, and Pólya rearrangement inequality in (19) and h(π(i)) = π(b + 1 − i) yields b i=1

ui uσj (i) ≥

b i=1

u[i] u[b+1−i] =

b

uπ(i) uh(π(i)) =

i=1

b

ui uh(i) .

(21)

i=1

In view of (16) and (19), we conclude that the generated random graph is disassortative mixing. 

5 Numerical and Simulation Results We report our simulation results in this section. Recall that we derive the degree covariance of two neighboring vertices based on Assumption 1. Assumption 1 is somewhat restrictive. For degree sequences that do not satisfy Assumption 1, the analyses in Sects. 3 and 4 are only approximate. In this section, we compare simulation results with the analytical results in Sect. 4. We have simulated the construction of networks with 4000 vertices. We use the batch mean simulation method to control the simulation variance. Specifically, each simulation is repeated 100 times to obtain 100 graphs. Equation (1) was applied to compute the assortativity coefficient for each graph. One average is computed for

A Generalized Configuration Model with Degree Correlations

59

0.9

simulation analysis

correlation coefficient

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

q Fig. 1 Degree correlation of an assortative model. Power-law degree distribution and b = 6 0.05

simulation analysis

correlation coefficient

0 −0.05 −0.1 −0.15 −0.2 −0.25 −0.3 −0.35 −0.4

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

q Fig. 2 Degree correlation of a disassortative model. Power-law degree distribution and b = 6

every twenty repetitions. Ninety percent confidence intervals are computed based on five averages. We have done extensive number of simulations for uniform and Poisson distributed degree distributions. We have found that simulation results on Pearson degree correlation coefficient agree extremely well with (17) for a wide range of b and q. Due to space limit, we do not present these results in the paper. We have also simulated power-law degree distributions. Specifically, we assume that the exponent of the power-law distribution is negative two, i.e., pk ≈ k −2 for large k. We first fix b at six. The degree correlations for power-law degree distributions are shown in Figs. 1 and 2 for an assortatively mixed network and a disassortatively mixed network, respectively. The discrepancy between the simulation result and the analytical result is quite noticeable in Fig. 1 when q is large, while the two results agree very well in Fig. 2. This is because power-law distributions can generate very large sample values for degrees. As a result, Assumption 1 may fail in this case.

60

D.-S. Lee et al. 0.4

simulation analysis

correlation coefficient

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −0.05

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

q Fig. 3 Degree correlation of a disassortative model. Power-law degree distribution and b = 2

We decrease b to two, which increases the block size. The corresponding Pearson degree correlation function for an assortatively mixed network is presented in Fig. 3. One can see that the approximation accuracy is dramatically increased as the block size is increased.

6 Conclusions In this paper we have presented an extension of the classical configuration model. Like a classical configuration model, the extended configuration model allows users to specify an arbitrary degree distribution. In addition, the model allows users to specify a positive or a negative assortative coefficient. We derived a closed form for the assortative coefficient of this model. We verified our result with simulations. Acknowledgements This research was supported in part by the National Science Council, Taiwan, R.O.C., under Contract NSC-105-2221-E-007-036-MY3.

References 1. Barabási, A.-L., Albert, R.: Emergence of scaling in random networks. Science 286, 509–512 (1999) 2. Bender, E. A., Canfield, E.R.: The asymptotic number of labelled graphs with given degree sequences. J. Comb. Theory Series A 24, 296–307 (1978) 3. Boguñá, M., Pastor-Satorras, R., Vespignani, A.: Absence of epidemic threshold in scale-free networks with degree correlations. Phys. Rev. Lett. 90(2), 028701 (2003) 4. Eguíluz, V.M., Klemm, K.: Epidemic threshold in structured scale-free networks. Phys. Rev. Lett. 89, 108701 (2002)

A Generalized Configuration Model with Degree Correlations

61

5. Erd˝os, P., Rényi, A.: On random graphs. Publ. Math. 6, 290–297 (1959) 6. Litvak, D.N., van der Hofstad, R.: Degree-degree correlations in random graphs with heavytailed degrees. arXiv preprint arXiv:1202.3071 (2012) 7. Marian, B., Pastor-Satorras, R.: Epidemic spreading in correlated complex networks. Phys. Rev. E 66, 047104 (2002) 8. Marshall, A.W., Olkin, I., Arnold, B.: Inequalities: Theory of Majorization and Its Applications. Springer Science & Business Media (2010) 9. Molloy, M., Reed, B.: A critical point for random graphs with a given degree sequence. Random Struct. Algorithm. 6, 161–179 (1995) 10. Moreno, Y., Gómez, J.B., Pacheco, A.F.: Epidemic incidence in correlated complex networks. Phys. Rev. E 68, 035103 (2003) 11. Newman, M.: Mixing patterns on networks. Phys. Rev. E 67, 026126 (2003) 12. Newman, M.: Networks: An Introduction. Oxford University Press, Oxford (2010) 13. Nikoloski, Z., Deo, N., Kucera, L.: Degree-correlation of a scale-free random graph process. In: Proceedings of the Theoretical Computer Science (DMTCS), pp. 239–244. AE, EuroComb (2005) 14. Ramezanpour, A., Karimipour, V., Mashaghi, A.: Generating correlated networks from uncorrelated ones. Phys. Rev. E 67, 046107 (2005) 15. Schläpfer, M., Buzna, L.: Decelerated spreading in degree-correlated networks. Phys. Rev. E 85, 015101 (2012) 16. Vázquez, A., Moreno, Y.: Resilience to damage of graphs with degree correlations. Phys. Rev. E 67, 015101 (2003) 17. Watts, D.J., Strogatz, S.H.: Collective dynamics of ’small-world’ networks. Nature 393, 440– 442 (1998) 18. Xulvi-Brunet, R., Sokolov, I.M.: Changing correlations in networks: assortativity and dissortativity. Acta Phys. Pol. B 36, 1431–1455 (2005) 19. Zhou, J., Xu, X., Zhang, J., Sun, J., Small, M., Lu, J.-A.: Generating an assortative network with a given degree distribution. Int. J. Bifurcation Chaos 18, 3495–3502 (2008)

Missing Data Augmentation for Bayesian Exponential Random Multi-Graph Models Robert W. Krause and Alberto Caimo

Abstract In this paper we present an estimation algorithm for Bayesian exponential random multi-graphs (BERmGMs) under missing network data. Social actors are often connected with more than one type of relation, thus forming a multiplex network. It is important to consider these multiplex structures simultaneously when analyzing a multiplex network. The importance of proper models of multiplex network structures is even more pronounced under the issue of missing network data. The proposed algorithm is able to estimate BERmGMs under missing data and can be used to obtain proper multiple imputations for multiplex network structures. It is an extension of Bayesian exponential random graphs (BERGMs) as implemented in the Bergm package in R. We demonstrate the algorithm on a well-known example, with and without artificially simulated missing data.

1 Introduction In recent years, it is becoming more and more apparent that the understanding of social structure often requires to take more than just one type of social relation into account, so-called multiplex networks, or multi-graphs. Notable examples include the important interrelations between friendships and advice seeking behavior [18], the importance of antipathi-ties in the maintenance of friendship group structures [19], and the relationships between joined drug use, sexual relations, and co-visitation of social venues [6]. These papers, and many others, show that to understand one type of relation it is often necessary to also take other types of relations between actors into account. Unfortunately, generative algorithms for the estimation multiplex network models are scarce. R. W. Krause () University of Groningen, Groningen, The Netherlands e-mail: [email protected] A. Caimo Dublin Institute of Technology, Dublin, Ireland e-mail: [email protected] © Springer Nature Switzerland AG 2019 S. P. Cornelius et al. (eds.), Complex Networks X, Springer Proceedings in Complexity, https://doi.org/10.1007/978-3-030-14459-3_5

63

64

R. W. Krause and A. Caimo

Generative models for the analysis of multiplex networks offer a great advantage for the treatment of missing data. While missing data is a problem for all (social) sciences, network models suffer particularly under missing data, because of the strong dependencies within the data. Non-response by one participant does not only mean we know less about this participant, but we also know less about the social network of all other participants, after all, the missing participant could have nominated any of the other participants thus potentially changing the network structure drastically. Previous work has primarily focused on single layer networks. Statistical tools have been developed to handle missing social network data, to obtain both more reliable model estimates [7, 10, 11, 17] and reliable descriptive statistics [12]. Extending this work to multiplex models is likely to lead to less biased model estimations (less biased statistics), because multiplexity allows to use information of observed layers for the missing data augmentation of missing layers. In this paper we propose an extension of previous work on missing data augmentation to the context of exponential random multi-graph models (ERmGMs). We advance the literature twofold, first, by proposing an estimation procedure for Bayesian ERmGMs, and second by proving proper multiple imputation of missing multiplex network data. In Sect. 2 of this paper we will introduce the exponential random graph model family and the proposed multiplex extension algorithm. Section 3 details the missing data problem for networks and how the algorithm is extended to properly estimate BERmGMs under missing data, and an example application is presented in Sect. 4. We end the paper with a discussion of our findings and according recommendations.

2 Bayesian ERmGMs Before we describe the proposed algorithm for BERmGMs, we will first introduce the ERG-family and network multiplexity.

2.1 Bayesian Inference for ERGMs The exponential random graph model family (ERGMs; [13]) is most commonly used to analyze cross-sectional network data. ERGMs model an observed network, or graph, as a function of sufficient network statistics (primarily counts of subgraph configurations, e.g., number of ties, number of reciprocated ties, or number of transitive triplets). A network graph is expressed as a random n×n adjacency matrix Y with Yij = 1 when there is tie from node i to node j and Yij = 0 when there is no tie. Usually, edges connecting a node with itself are not allowed (Yii = 0). Networks can be directed or undirected (Yij = Yj i ). Let Y denote the set of all possible networks on n nodes and let y be a realization of Y ∈ Y. Then, in Bayesian ERGMs (BERGMs) the posterior probability of the parameters conditional on the data is given by

Missing Data Augmentation for Bayesian Exponential Random Multi-Graph Models

p(θ |y) =

exp [θ T s(y)] p(θ ) , z(θ ) p(y)

65

(1)

with θ being a vector of model parameters, s(y) a vector of corresponding sufficient network statistics, z(θ ) the normalizing constant, p(θ ) the prior distribution of the parameters, and p(y) is the marginal probability. See Lusher et al. for an introduction to ERGMs [13].

2.2 Multiplexity Multiplex networks are structures with multiple different types of relations on the same set of nodes. Multiplex networks can thus be expressed as a random n × n × m adjacency array or cube Y with Yij m = 1 when there is tie from node i to node j on network m and Yij m = 0 when there is no such tie on network m. Each layer m of the multiplex network can be either directed or undirected. Multiplex ERGMs were first introduced by Pattison and Wasserman [15] and later extended by Wang [21]. Multiplexity increases the complexity of network models by an additional factor, while a single layer directed network has 2n×n−n possible configurations (e.g., a network of 20 nodes has ∼ 2.5 × 10114 possible configurations), this number increases exponentially to the number of layers, 2(n×n−n)×m (e.g., a multiplex network of 20 nodes with 2 layers has ∼ 6.1 × 10228 possible configurations).

2.3 Posterior Parameter Estimation for BERmGMs The Markov-Chain Monte-Carlo (MCMC) estimation algorithm of the posterior p(θ |y) is an extension of the approximate exchange algorithm introduced by Caimo and Friel [1] and currently implemented in the Bergm package in R [2]. The algorithm samples from the following distribution: p(θ  , y  , θ |y) ∝ p(y|θ ) p(θ ) (θ  |θ ) p(y  |θ  ),

(2)

with p(y  |θ  ) being the likelihood on which the simulated data y  are defined and belong to the same exponential family of densities as p(y|θ ), (θ  |θ ) is any arbitrary proposal distribution for the parameter θ  . This proposal distribution is set to be a normal centered at θ . At each MCMC iteration, the exchange algorithm consists of three main steps: First, proposing a Gibbs update of θ  . Followed by a Gibbs update of y  , a drawn from p(·|θ  ) with an MCMC algorithm [9]. Third an exchange, or swap, from the current state θ to the proposed new parameter θ  is taken. This deterministic proposal is accepted with the following probability:

66

R. W. Krause and A. Caimo

  qθ (y  ) p(θ  ) (θ |θ  ) qθ  (y) z(θ ) z(θ  ) × min 1, qθ (y) p(θ ) (θ  |θ ) qθ  (y  ) z(θ  ) z(θ )

(3)

where qθ and qθ  indicate the unnormalized likelihoods for parameters θ and θ  , respectively. The intractable normalizing constants cancel each other out in this equation, thus avoiding the problem of calculating them. The exchange algorithm is asymptotically exact [4]. Concretely, the algorithm is implemented in the following way: Algorithm 1 Approximate exchange algorithm for BERmGMs Initialise θ for i = 1, . . . , N do Generate θ  from (·|θ) loop for m = 1, . . . , M do  from p(·|θ  , y  ) Simulate one (or a few) tie swap ym end for end loop Update θ → θ  with the log of the probability:    p(θ  ) min 0, [θ − θ  ]T [s(y  ) − s(y)] + log p(θ)

(4)

end for

The key change to the regular Bergm algorithm is in the network simulation loop which is here sampling from a multiplex network space. Instead of directly simulating a new multiplex network y  with the proposed parameter θ  , the simulation is done iteratively for each of the M layers of ties by proposing one (or a few) tie swap on each layer, conditional on the proposed parameter vector θ  and on the current state of y  , thus including all tie swaps simulated on all layers of the network in this and previous iterations. This is repeated until convergence is reached and a sample is drawn from p(·|θ  ). Adaptive procedures such as the adaptive direction sampling [1, 20] or the delayed rejection sampling [3] can be adopted.

2.4 Cross Network Effects Currently three fundamental dyadic cross network effects are implemented for the algorithm. These effects are: (1) co-occurrence, (2) entrainment, and (3) cross network reciprocity. Co-occurrence expresses the tendency of edges on one layer to occur with edges on another layer in an undirected graph and entrainment is its directed counterpart. The corresponding sufficient statistic can thus be calculated similarly for both: sco|ent (y) = yij 1 yij 2 . (5) i 1, remain constant if r¯ = 1 and decay if 0 ≤ r¯ < 1. Like in this example, multiplicative noise has generally a net-negative effect on growth in the long term [39–41]. Models of evolutionary game theory are more complex but share the same underlying property, which leads to noise-induced non-ergodic behavior. In the classical Hawk–Dove game two birds meet and compete for a shareable resource V , the positive payoff. If a Hawk meets a Dove the Hawk alone gets the resource, if two Doves meet they share the resource and if two Hawks meet they fight for the resource, which costs energy and implies the risk of getting injured, formalized by a negative payoff −C. Since 50% of the Hawks win and 50% of the Hawks loose a fight, the average payoff of a Hawk meeting a Hawk in the limit of V −C an infinite population is V 2−0 + 0−c 2 = 2 . Figure 1a shows that for V = 1 and C = 1.5 the time-discrete replicator dynamics leads to an evolutionarily stable state in which a larger population of Hawks coexists with a smaller population of Doves. However, in a changing environment the payoff matrix will not be constant. For example, the abundance of the food resource may change periodically with the seasons, or the risk of death

88

J. Nagler and F. Stollmeier

Fig. 1 Selection reversal in a Hawk–Dove game with constant, periodic, and random payoff. (a) describes a traditional Hawk–Dove game. The population starts at x1 = x2 = 0.5 (50 % Hawks and 50% Doves) and converges to an evolutionarily stable state where x1 > x2 . Periodically (b) or randomly fluctuating payoffs (c) shift the evolutionarily stable state such that x1 < x2

caused by an injury may depend on the presence of predators. Figure 1b, c shows how the evolutionarily stable state can change if V or C fluctuates such that their averages are still the same as in a. Similar to the aforementioned example with the exponential growth process, the noise has a net-negative effect on the long-term growth of the strategies in replicator dynamics, too. Due to the specific structure of the Hawk–Dove game payoff matrix, the negative effect of the noise of both V and C is stronger for the population of Hawks than for the Doves, such that with sufficient noise the Doves dominate the population in the evolutionarily stationary state. Next, we show that these anomalous effects are generic for evolutionary games. In evolutionary game theory the interactions are usually formalized in a payoff function, which specifies the reward from the interaction with another player that is received by a given individual. In the simplest case, a game with two strategies is determined by a payoff matrix  M with 2 × 2 matrix elements. We describe the state of the population as x ( xi = 1), where xi ≥ 0 is the fraction of players with strategy i ∈ {1, 2}. Players with strategy i receive the payoff Pi = (Mx)i + b, where the background fitness b ensures that the payoff is positive. The assumption that species that receive a higher payoff reproduce faster can be formalized by the replicator equation, which is used here in its time-discrete form [42] (t+1)

xi

(t)

= xi · ri (x(t) , M),

with ri (x(t) , M) =

(Mx(t) )i + b Pi = P  x(t)T Mx(t) + b

and the average payoff of the population P  = x1 P1 + x2 P2 .

(1) (2)

Long-Term Behavior in Evolutionary Dynamics from Ergodicity Breaking

89

Following Smith [43], “a population is said to be in an ‘evolutionarily stable state’ [henceforth ESS] if its genetic composition is restored by selection after a disturbance, provided the disturbance is not too large.” Hence the ESS describes the long-term behavior of the system and are stable stationary states of Eq. (1). For a constant payoff matrix M, the stationary states x∗ satisfy ri (x∗ , M) = 1. If two species coexist, r1 (x∗ , M) = r2 (x∗ , M) implies that both receive the same payoff P1 = P2 = P , as otherwise the species with the higher payoff would move the system away from this state due to faster growth. Now consider continuously changing payoffs with finite means. The stationary states x∗ (t) are solutions of

ri (x∗ , M) := lim

T →∞

T −1 

 T1 ri (x∗ (t), M (t) )

= 1,

(3)

t=0

where M (t) is the time-dependent payoff matrix. Equation (3) defines the geometric average, indicated henceforth by the bar. If the payoff matrix changes deterministically with period T a stationary state is a periodic function x∗ (t) = x∗ (t + T ); if it changes randomly a stationary state is a random function x∗ (t) with distribution ρ ∗ (x). But how does one calculate the stationary states for periodically and randomly changing payoff matrices? In contrast to normal ESS  the stationary states  −1  (t) ∗ are not solutions of P1  = P2 , where Pi  := limT →∞ T1 Tt=0 M x (t) i is the arithmetic time average of the received payoff. Equation (3) implies that r1 (x∗ , M) = r2 (x∗ , M) = 1, and, using Eq. (2), that P1 = P2 .

(4)

If the fluctuations are small, we can approximate the geometric mean by Pi = Pi − σi2 2Pi 

+ O(σi4 ), where σi2 = Var[Pi ]. Using this approximation in Eq. (4) yields P1  −

σ12 σ2 = P2  − 2 . 2P1  2P2 

(5)

Equation (5) shows that P1  and P2  are generally different, which is why we call these stationary states unfair. It includes the case of constant payoff values as a special case.1 Figure 2a illustrates how payoff fluctuations may change

1 Note

that σ1 and σ2 depend on the stationary state x1 and the variance and covariance of the payoff values M = [m1 , m2 , m3 , m4 ]. If σ1 = σ2 = 0, Eq. (5) reduces to P1  = P2 . For small fluctuations we can approximate them as σ12 ≈ E[x1 ]2 Var[m1 ] + (1 − E[x1 ])2 Var[m2 ] + 2(E[x1 ] − E[x1 ]2 )Cov[m1 , m2 ] and σ22 ≈ E[x1 ]2 Var[m3 ] + (1 − E[x1 ])2 Var[m4 ] + 2(E[x1 ] − E[x1 ]2 )Cov[m3 , m4 ].

90

J. Nagler and F. Stollmeier

Fig. 2 Fluctuations transform a Hawk–Dove game into a prisoner’s dilemma and cause “unfair” stable coexistence. (a) Shown is the anomalous stationary state (solid line: stable and dashed line: unstable) of the fraction of cooperators x1 as a function of the noise intensity. Due to alternating payoff values the stationary states consist of two periodic points (green and blue). With increasing intensity, the dynamical structure of a Hawk–Dove game first changes to a game without analog in traditional games (N.N.) and finally to a prisoner’s dilemma game. (b) The difference of the averaged payoffs received by the two players corresponding to the stationary states of coexistence in (a). In the arithmetic mean the received payoffs are unfair. In the geometric mean they are equal, as predicted by Eq. (4)

the evolutionary dynamics and thereby transform one game into another game. Figure 2b shows how the arithmetic and the geometric average of the payoffs the two species receive deviate.

3 Deterministic Payoff Fluctuations We first consider deterministic payoff fluctuations under the replicator equation (Eq. (1)). To find the stationary state x∗ we solve Eq. (3). We assume that M (t) is a sequence with period T . Consequently, the stationary state x∗(t) is periodic as 1 T well and P (x, M) = T t=0 δ(x − x∗(t) )δ(M − M (t) ). Equation (3) reduces to ⎞1 ⎛ t +T   T ri (x∗ , M) = ⎝ ri x∗(t) , M (t) ⎠ = 1. t=t 

(6)

Long-Term Behavior in Evolutionary Dynamics from Ergodicity Breaking

91 

Note that Eq. (6) has only one free variable because if one periodic point x∗(t ) is given, the others are determined by Eq. (1). As an illustrative example, assume an alternating payoff matrix M (t) = M + ˜ Then x∗(t) = x∗ + (−1)t x∗ has the same form and can be found by (−1)t σ M. solving Eq. (6), which reduces to ri (x∗ , M) =



ri (x∗(t) , M (t) ) · ri (x∗(t+1) , M (t+1) ) = 1.

(7)

Figure 2 shows the stationary states of a game with the payoff function M

(t)

    1.1 0.8 −0.33 1 t = + (−1) σ . 2 0 1 0

(8)

For σ = 0 this is a Hawk–Dove game. For small σ , in fact, the stationary states predicted by Eq. (7) slightly deviate from the ESS of the Hawk–Dove game. There is a first bifurcation at σ ≈ 4.07, from one stable stationary state (solid curves) to two. At σ ≈ 6.4 there is a second bifurcation where the first branch, the stable coexistence, disappears. The bifurcation behavior induces a pronounced hysteresis effect. Ergodicity breaking causes anomalous player’s payoff expectations as shown in Fig. 2b. The arithmetic mean of the payoff difference that the players receive also shows a pronounced hysteresis effect. For the geometric mean, as predicted by Eq. (4), this effect is absent. More generally, fluctuations can even change the number, the positions, and the stability of stationary states and the dynamics can be structurally very different from the dynamics of games with constant payoffs, as shown in Fig. 3. In Fig. 3a large fluctuations induce the onset of cooperation for the prisoner’s dilemma as it is effectively transformed to a Hawk–Dove game with stable coexistence. Figures 3b, c, and d show how increasing fluctuations successively transform three other classical games either into different classical games or into games without classical analogs (denoted at “N.N.”).

4 Discussion Payoff noise in evolutionary dynamics is multiplicative and as such causes ergodicity breaking. The consequences have intricate effects on the coevolution of strategies. Depending on the details of the system, on the intensity of the fluctuations, and even on their covariance, ergodicity breaking leads to shifting the payoffs out of equilibrium, shifting the stationary states and thereby to fundamental structural changes of the dynamics. In evolutionary games with constant payoffs, the condition for stable coexistence is that all species have equal growth rates. With fluctuating payoffs this condition generalizes to equal time-averaged growth rates, which typically are different from

92

J. Nagler and F. Stollmeier

Fig. 3 Evolutionarily stable states with increasing fluctuation intensity. Stable and unstable states (solid and dashed lines) x1∗ (σ ) for games with alternating payoff fluctuations (blue and green are the two periodic points). The payoff matrices are M (t) = [3, 1, 4, 2] + (−1)t σ [0, 0, 0, 1] in (a), M (t) = [4, 1, 3, 2] + (−1)t σ [1, 0, 0, 1] in (b), M (t) = [2, 3, 4, 1] + (−1)t σ [0, 1.3, 1.3, 0] in (c), and M (t) = [3, 2, 4, 1] + (−1)t σ [−0.75, 1, −2, 1] in (d). In each example the background fitness is b = 10. The names of the games are identified using criteria described elsewhere [44]. For the same games but stochastic instead of alternating noise, the background shows the average of three stationary distributions resulting from the initial distributions δ(x), δ(x − 0.5), and δ(x − 1)

ensemble averages in non-ergodic systems. When one naively replaces fluctuating payoffs with their average values, the ensemble averages of the growth rates are recovered but these averages do not correctly predict the dynamics. Games with fluctuating payoffs require a novel classification that cannot be based on payoff ranking schemes. We developed a classification that primarily considers the dynamical structure [44]. Our classification for evolutionary games may be applied to evolutionary games where the payoff structure cannot be described by a simple payoff matrix, or when other modifications affect the dynamical structure. Examples include complex interactions of microbes such as cooperating and freeriding yeast cells, where the payoff is a nonlinear function of the densities [45]. Payoff fluctuations can cause two strategies that coexist in an evolutionarily stable state to receive different time-averaged payoffs. However, these “unfair” stable states are not mutationally stable. Mutations, in fact, would turn the “unfair” stable state into a meta-game, where the beneficiary aims to increase and the victim aims to escape the unfairness. Strategies of this meta-game could be tuning the adaptation or reproduction rate according to the environmental fluctuation [32]. Phenotypic plasticity [46] and bet-hedging [47] may reduce the necessity to adapt at all. In general, the understanding of evolutionary games in fluctuating environments may be particularly relevant to understanding and controlling microbiological systems. Examples for evolutionary games in microbiology are diverse and include

Long-Term Behavior in Evolutionary Dynamics from Ergodicity Breaking

93

yeast cells [45], viruses [48], and bacteria [49–52]. Because many of these microbes evolve in natural and artificial environments which are fluctuating, the presented effects are relevant in biotechnology and healthcare. A stable coexistence of antibiotic-sensitive bacteria with antibiotic-degrading bacteria has been proven to be a stable state of a Hawk–Dove-like game [49, 52] if the antibiotic concentration is constantly above the concentration which the sensitive bacteria could tolerate alone. Our framework qualitatively describes the competitive interplay of bacteria strains in a fluctuating environment, for instance, in a patient who is given a daily dose of antibiotic instead of a continuous infusion. Simple experimental settings can directly demonstrate the consequence of nonergodic anomalous long-term behaviors in microbiological systems. Expected shifts and bifurcations in the stationary states of strategies for two strains, or species, competing for resources (and survival) suggest to study the (co)evolutionary dynamics for a fluctuating control parameter c that, e.g., switches between two levels in a square-wave fashion, c = [c+ , c− , c+ , c− , c+ , . . .], where c+ = c + A and c− = c − A. For increasing fluctuation amplitude A, the stationary state is expected to shift, or to change discontinuously, both as a result of ergodicity breaking. The strongest effect is expected for fluctuations that are of the same time scale as the reproduction period of the model organisms. However, quantitative predictions require much more specific model systems [53]. To conclude, caution is advised when predictions are based on averaged observables, in particular, averaged payoffs structures. Our framework predicts anomalous stationary states as a generic result of ergodicity breaking in evolutionary dynamics that depend on the amplitude and covariance of the fluctuations.

References 1. Traulsen, A., Nowak, M.A.: Evolution of cooperation by multilevel selection. Proc. Natl. Acad. Sci. USA 103(29), 10952–10955 (2006) 2. Lehmann, L., Keller, L., West, S., Roze, D.: Group selection and kin selection: two concepts but one process. Proc. Natl. Acad. Sci. USA 104(16), 6736–6739 (2007) 3. Nowak, M.A.: Five rules for the evolution of cooperation. Science 314(5805), 1560–1563 (2006) 4. Nowak, M.A., May, R.M.: Evolutionary games and spatial chaos. Nature 359(6398), 826–829 (1992) 5. Van den Broeck, C., Parrondo, J.M.R., Toral, R., Kawai, R.: Nonequilibrium phase transitions induced by multiplicative noise. Phys. Rev. E. 55(4), 4084–4094 (1997) 6. Toral, R.: Noise-induced transitions vs. noise-induced phase transitions. AIP Conf. Proc. 1332, 145–154 (2011) 7. Horsthemke, W., Lefever, R.: Noise-Induced Transitions: Theory and Applications in Physics, Chemistry, and Biology. Springer, Berlin (1984) 8. Gammaitoni, L., Hänggi, P., Jung, P., Marchesoni, F.: Stochastic resonance. Rev. Mod. Phys. 70(1), 223–287 (1998) 9. Leigh, E.G.: The average lifetime of a population in a varying environment. J. Theor. Biol. 90(2), 213–239 (1981)

94

J. Nagler and F. Stollmeier

10. Lande, R.: Risks of population extinction from demographic and environmental stochasticity and random catastrophes. Am. Nat. 142(6), 911–927 (1993) 11. Foley, P.: Predicting extinction times from environmental stochasticity and carrying capacity. Conserv. Biol. 8(1), 124–137 (1994) 12. Ovaskainen, O., Meerson, B.: Stochastic models of population extinction. Trends Ecol. Evol. 25(11), 643–652 (2010) 13. Schreiber, S.J.: Interactive effects of temporal correlations, spatial heterogeneity and dispersal on population persistence. Proc. Biol. Sci. 277(1689), 1907–1914 (2010) 14. Morales, L.M.: Viability in a pink environment: why “white noise” models can be dangerous. Ecol. Lett. 2(4), 228–232 (1999) 15. Heino, M., Ripa, J., Kaitala, V.: Extinction risk under coloured environmental noise. Ecography 23(2), 177–184 (2000) 16. Wilmers, C.C., Post, E., Hastings, A.: A perfect storm: the combined effects on population fluctuations of autocorrelated environmental noise, age structure, and density dependence. Am. Nat. 169(5), 673–683 (2007) 17. Schwager, M., Johst, K., Jeltsch, F.: Does red noise increase or decrease extinction risk? Single extreme events versus series of unfavorable conditions. Am. Nat. 167(6), 879–888 (2006) 18. Heino, M., Sabadell, M.: Influence of coloured noise on the extinction risk in structured population models. Biol. Conserv. 110(3), 315–325 (2003) 19. Ruokolainen, L., Lindén, A., Kaitala, V., Fowler, M.S.: Ecological and evolutionary dynamics under coloured environmental variation. Trends Ecol. Evol. 24(10), 555–563 (2009) 20. Greenman, J.V., Benton, T.G.: The impact of environmental fluctuations on structured discrete time population models: resonance, synchrony and threshold behaviour. Theor. Popul. Biol. 68(4), 217–235 (2005) 21. Kamenev, A., Meerson, B., Shklovskii, B.: How colored environmental noise affects population extinction. Phys. Rev. Lett. 101, 268103 (2008) 22. Nowak, M.A., Sasaki, A., Taylor, C., Fudenberg, D.: Emergence of cooperation and evolutionary stability in finite populations. Nature 428(6983), 646–650 (2004) 23. Traulsen, A., Nowak, M.A., Pacheco, J.M.: Stochastic dynamics of invasion and fixation. Phys. Rev. E 74(1), 011909 2006 24. Altrock, P.M., Traulsen, A.: Fixation times in evolutionary games under weak selection. New J. Phys. 11, 013012 (2009) 25. Assaf, M., Mobilia, M., Roberts, E.: Cooperation dilemma in finite populations under fluctuating environments. Phys. Rev. Lett. 111(23), 238101 (2013) 26. Ashcroft, P., Altrock, P.M., Galla, T.: Fixation in finite populations evolving in fluctuating environments. J. R. Soc. Interface 11(100), 20140663 (2014) 27. Houchmandzadeh, B.: Fluctuation driven fixation of cooperative behavior. Biosystems 127, 60–66 2015 28. Baron, J.W., Galla, T.: Sojourn times and fixation dynamics in multi-player games with fluctuating environments. arXiv, 1612.05530 [q-bio.PE] (2016) 29. Foster, D., Young, P.: Stochastic evolutionary game dynamics. Theor. Popul. Biol. 38(2), 219– 232 (1990) 30. Fudenberg, D., Harris, C.: Evolutionary dynamics with aggregate shocks. J. Econ. Theory 57(2), 420–441 (1992) 31. Hofbauer, J., Imhof, L.A.: Time averages, recurrence and transience in the stochastic replicator dynamics. Ann. Appl. Probab. 19(4), 1347–1368 (2009) 32. Traulsen, A., Röhl, T., Schuster, H.G.: Stochastic gain in population dynamics. Phys. Rev. Lett. 93(2), 028701 (2004) 33. Houchmandzadeh, B., Vallade, M.: Selection for altruism through random drift in variable size populations. BMC Evol. Biol. 12, 61 (2012) 34. Huang, W., Hauert, C., Traulsen, A.: Stochastic game dynamics under demographic fluctuations. Proc. Natl. Acad. Sci. USA 112(29), 9064–9069 2015 35. Gokhale, C.S., Hauert, C.: Eco-evolutionary dynamics of social dilemmas. Theor. Popul. Biol. 111, 28–42 (2016)

Long-Term Behavior in Evolutionary Dynamics from Ergodicity Breaking

95

36. Constable, G.W.A., Rogers, T., McKane, A.J., Tarnita, C.E.: Demographic noise can reverse the direction of deterministic selection. Proc. Natl. Acad. Sci. USA 113(32), E4745–E4754 (2016) 37. Taylor, C., Fudenberg, D., Sasaki, A., Nowak, M.A.: Evolutionary game dynamics in finite populations. Bull. Math. Biol. 66(6), 1621–1644 (2004) 38. Moran, P.A.P.: Random processes in genetics. Math. Proc. Cambridge Philos. Soc. 54, 60–71 (1958) 39. Lewontin, R.C., Cohen, D.: On population growth in a randomly varying environment. Proc. Natl. Acad. Sci. USA 62(4), 1056–1060 (1969) 40. Peters, O.: Optimal leverage from non-ergodicity. Quant. Finance 11(11), 1593–1602 (2011) 41. Peters, O., Klein, W.: Ergodicity breaking in geometric Brownian motion. Phys. Rev. Lett. 110, 100603 (2013) 42. Taylor, P.D., Jonker, L.B.: Evolutionary stable strategies and game dynamics. Math. Biosci. 40(1–2), 145–156 (1978) 43. Smith, J.M.: Evolution and the Theory of Games. Cambridge University Press, Cambridge (1982) 44. Stollmeier, F., Nagler, J.: Unfair and anomalous evolutionary dynamics from fluctuating payoffs. Phys. Rev. Lett. 120, 058101 (2018) 45. Gore, J., Youk, H., van Oudenaarden, A.: Snowdrift game dynamics and facultative cheating in yeast. Nature 459, 253–256 (2009) 46. Pigliucci, M.: Evolution of phenotypic plasticity: where are we going now? Trends Ecol. Evol. 20(9), 481–486 (2005) 47. Bergstrom, T.C.: On the evolution of hoarding, risk-taking, and wealth distribution in nonhuman and human populations. Proc. Natl. Acad. Sci. USA 111, 10860–10867 (2014) 48. Turner, P.E., Chao, L.: Escape from prisoner’s dilemma in RNA phage 6. Am. Nat. 161(3), 497–505 (2003) 49. Yurtsev, E.A., Chao, H.X., Datta, M.S., Artemova, T., Gore, J.: Bacterial cheating drives the population dynamics of cooperative antibiotic resistance plasmids. Mol. Syst. Biol. 9, 683 (2013) 50. Kirkup, B.C., Riley, M.A.: Antibiotic-mediated antagonism leads to a bacterial game of rockpaper-scissors in vivo. Nature 428(6981), 412–414 (2004) 51. Griffin, A.S., West, S.A., Buckling, A.: Cooperation and competition in pathogenic bacteria. Nature 430(7003), 1024–1027 (2004) 52. Dugatkin, L.A., Perlin, M., Scott Lucas, J., Atlas, R.: Group-beneficial traits, frequencydependent selection and genotypic diversity: an antibiotic resistance paradigm. Proc. R. Soc. Lond. Ser. B. 272(1558), 79–83 (2005) 53. de Vos, M.G.J., Zagorski, M., McNally, A., Bollenbach, T.: Interaction networks, ecological stability, and collective antibiotic tolerance in polymicrobial infections. Proc. Natl. Acad. Sci. USA 114(40), 10666–10671 (2017)

Combining Path-Constrained Random Walks to Recover Link Weights in Heterogeneous Information Networks Hong-Lan Botterman and Robin Lamarche-Perrin

Abstract Heterogeneous information networks (HINs) are abstract representations of systems composed of multiple types of entities and their relations. Given a pair of nodes in a HIN, this work aims at recovering the exact weight of the incident link to these two nodes, knowing some other links present in the HINs. Actually, this weight is approximated by a linear combination of probabilities, results of pathconstrained random walks, i.e., random walks where the walker is forced to follow only a specific sequence of node types and edge types which is commonly called a meta path, performed on the HINs. This method is general enough to compute the link weight between any types of nodes. Experiments on Twitter data show the applicability of the method.

1 Introduction Networked entities are ubiquitous in real-world applications. Examples of such entities are humans in social or communication activities and proteins in biochemical interactions. Heterogeneous information networks (HINs), abstract representations of systems composed of multiple types of entities and their relations, are good candidates to model such entities, together with their relations, since they can effectively fuse a huge quantity of information and contain rich semantics in nodes and links. In the last decade, the heterogeneous information network analysis has attracted a growing interest and many novel data mining tasks have been designed in such networks, such as similarity search, link prediction, clustering and classification, just to name a few.

H.-L. Botterman () Sorbonne Université, CNRS, Laboratoire d’Informatique de Paris 6, Paris, France e-mail: [email protected] R. Lamarche-Perrin CNRS, Institut des système complexes de Paris Île-de-France, ISC-PIF, Paris, France e-mail: [email protected] © Springer Nature Switzerland AG 2019 S. P. Cornelius et al. (eds.), Complex Networks X, Springer Proceedings in Complexity, https://doi.org/10.1007/978-3-030-14459-3_8

97

98

H.-L. Botterman and R. Lamarche-Perrin

The goal of this work is to recover, for a given pair of nodes in a weighted HIN, the actual incident link weight to these two nodes, knowing some other links present in the HINs. Trying to capture not only the presence of a link but also its actual weight can be useful, for instance, in recommendation systems where the weight can be taken for the “rating” a user would give to an item. Another application would be the detection of disease-gene candidate thanks to the prediction of protein–protein interactions. This problem can be related to the node similarity problem since similar nodes tend to be connected. Indeed, the similarity score between two nodes, result of a particular function of these two nodes, can be seen as the strength of their connection and hence, the link weight connecting them. Here, the particular function is related to a random walk on the graph. In HINs, most of the similarity scores [6, 9] are based on the concept of meta path, roughly defined as a concatenation of node types linked by corresponding link types. The type of a node/link is basically a label in the abstract representation. Meta paths can be used as a constraint to a classic random walk: the walker is allowed to take only paths satisfying a particular meta path. These path-constrained random walks have the sensitivity to take into account explicitly different semantics present in HINs. Back to our goal, the target weight is approximated by a linear combination of probabilities, results of path-constrained random walks performed on the HINs. The proposed method aims at finding a relevant set of meta paths and the best possible cœfficients such that the difference between the exact link weight and its approximation is minimized. The rest of this paper is organized as follows: In Sect. 2, we remind basic concepts about HINs and present the problem statement. Section 3 explains our method and we apply it on Twitter data related to the Football World Cup 2014 in Sect. 4. We review some related work in Sect. 5 and finally, we conclude and give some perspectives in Sect. 6.

2 Preliminary Concepts In this section, we remind some basic concepts of weighted HINs useful for the following and define the “weight recovering" problem. Figure 1 illustrates this section. Definition 1 (Weighted Directed Multigraph) A weighted directed multigraph is a 5-tuple G := (V , E, w, μs , μt ) with V the node set, E the link set, w : E → R the function that assigns each link a real weight, μs : E → V the function that assigns each link a source node, and μt : E → V the function that assigns each link a target node. Definition 2 (Heterogeneous Information Network) A HIN H := (G, V, E, φ, ψ) is a weighted directed multigraph G along with E the node type set, V the

Combining PCRWs to Recover Link Weights in HINs

99

Fig. 1 (Left) Example of HINs composed of five node types, represented by diverse shapes, and multiple link types. (Middle) Its associated network schema composed by five nodes and twenty links. (Right) Illustration of the problem statement. For each pair of nodes in (filled square, filled diamond), there is possibly a path connecting them. The link weight is approximated by a linear combination of the path-constrained random walk results

link type set, φ : E → E the function that assigns a node type to each node, and ψ : V → V the function that assigns a link type to each link such that if two links belong to the same link type,the two links share  the same starting and target node type, i.e., ∀ e1 , e2 ∈ E, ψ(e1 ) = ψ(e2 ) ⇒ φ(μs (e1 )) = φ(μs (e2 )) ∧ φ(μt (e1 )) = φ(μt (e2 )) . Understanding the node types and link types in a complex HIN is not always easy; thus, it is sometimes necessary to provide the meta level (i.e., schema-level) description of the network. Therefore, the concept of network schema is proposed to describe the meta structure of a network. Definition 3 (HIN Schema) Let H be a HIN. The schema TH for H is a directed graph defined on the node types V and link types  E, i.e., TH := (V, E, νs , νt ) with νs : E → V : E ∗ → νs (E ∗ ) := φ μs (e) the function  that assigns each link a source node and νt : E → V : E ∗ → νt (E ∗ ) := φ μt (e) the function that assigns each link a target node, where e ∈ ψ −1 (E ∗ ) and ψ −1 the pseudo-inverse of ψ defined by ψ −1 : E → 2E : E ∗ → {e ∈ E | ψ(e) = E ∗ }. We can effectively take any element e ∈ ψ −1 (E ∗ ) since {e ∈ E | ψ(e) = E ∗ } is the equivalence class of any of its elements, with the equivalence relation “has the same type of." By the definition of HINs, it is sufficient to take one member of the equivalence class to know the node types the link type connects. Two entities in a HIN can be linked via different paths and these paths have different semantics. These paths can be defined as meta paths as follows. Definition 4 (Meta Path) A meta path P of length n − 1 ∈ N is a sequence of node types V1 , · · · , Vn ∈ V linked by link types E1 , · · · , En−1 ∈ E as follows: P = E1

En−1

V1 −→ V2 · · · Vn −−−→ Vn which can also be denoted as P = E1 E2 · · · En−1 . E1

En−1

Given a meta path P = V1 −→ V2 · · · Vn −−−→ Vn and a path P = en−1 e1 v1 − → v2 · · · vn−1 −−→ vn , if ∀ i ∈ {1, . . . , n}, φ(vi ) = Vi , ∀ i ∈ {1, . . . , n − 1}, μs (ei ) = vi , μt (ei ) = vi+1 , and ψ(ei ) = Ei , then the path P satisfies the meta path P and we note P ∈ P. Hence, a meta path is a set of paths.

100

H.-L. Botterman and R. Lamarche-Perrin

Problem 1 (Combination of Meta Paths) Define a HIN as H = (G, V, E, φ, ψ), with G = (V , E, w, μs , μt ) a directed weighted multigraph, and a target meta path Ec between two node types. The problem is to find a set of relevant meta paths EP and a linear function F of (functions that themselves depend on) these meta paths that best quantifies, for each pair of nodes in H , the strength of their connection via Ec .

3 Method We present our method for solving Problem 1 in three steps. Without loss of generality, V = {V1 , · · · , Vm }, and E = {E1 , · · · , Er } and we note Ec the target meta path defined between V1 and Vn . We consider a meta path Ejn−1

Ej1

P = V1 −−→ Vi2 · · · Vin−1 −−−→ Vn different from Ec where i2 , · · · , in−1 and j1 , · · · , jn−1 are all indices that can take integer values between 1 and m and 1 and r, respectively.

3.1 Path-Constrained Random Walk Given vn ∈ Vn and v1 ∈ V1 , the probability of reaching vn from v1 following the meta path P is simply defined by the random walk starting at v1 and ending at vn following only paths satisfying P. Formally P((vn |v1 ) | P) =

vn−1 ∈Vin−1

=

vn−1 ∈Vin−1

    in−1 ,n 1,in−1 P (vn−1 |v1 ) | P P (vn |vn−1 ) | P   wEjn−1 (vn−1 , vn ) 1,i n−1  P (vn−1 |v1 ) | P k wEjn−1 (vn−1 , vk )

(1)

with P =: P1,n , Pa,b the truncated meta path of P from node type Va to Vb , of type Ei between nodes vj and vk , Va , Vb ∈ V, wEi (vj , vk ) the link’s weight  and P((v2 |v1 )|P1,i2 ) = wEj1 (v1 , v2 )/ k wEj1 (v1 , vk ) the basis of recurrence. Furthermore, we forbid the walker to return to the initial node on the penultimate step of the walk, i.e., if Vin−1 = V1 , the sum in Eq. (1) only holds for all vn−1 = v1 . It prevents us from using what we are looking for to find what we are looking for.

Combining PCRWs to Recover Link Weights in HINs

101

3.2 Linear Regression Model Since H is a HIN, multiple types of links can connect the nodes. Hence, there is no reason to restrict ourselves to a single meta path to compute the reachability of one node from another. As a result, the similarity between vn and v1 is defined by several path-constrained random walk results combined through a linear regression model of the form: F ((vn |v1 ) | EP ) = β0 + βP P((vn |v1 ) | P), P∈EP

where EP is the set of relevant meta paths and the vector β := [β0 , β1 , · · · , β|EP | ]T is real-valued cœfficients. The cœfficients stress the contribution of each meta path in the final similarity score, i.e., our approximation F ((vn |v1 ) | EP ) of the exact link weight wEc (v1 , vn ). The choice of linear model is simply motivated by its interpretation in our particular case. Since the components of β are not confined in [0,1] and do not sum to 1, F is a real-valued function whose image is neither confined in [0,1]. Given example node pairs and their link weights, β is estimated by the least squares method which is appreciated for its applicability and simplicity.

3.3 Forward Selection Procedure In order to determine the set EP , we use the forward selection with p-value and r 2 criteria. This is a greedy approach but very simple and intuitive. The p-values are used to test the significance of each predictor. Given the hypothesis H0 : β = 0 against the hypothesis H1 : β = 0, the p-value p is the probability, under H0 , of getting a statistics as extreme as the observed value on the sample. We reject the hypothesis H0 , at the level α, if p ≤ α in favor of H1 . Otherwise, we reject H1 in favor of H0 . Conversely, the r 2 score is used to test the quality of the entire model. It is the proportion of the variance in the dependent variable that is predictable from the predictors. So, given k predictors or explanatory variables which are the meta paths, the forward selection procedure works as follows: • Start with a null model, i.e., no predictor but only an intercept. Typically, this is the average of the dependent variable; • Try k linear regression models and chose the one which gives the best model with respect to the criterion. In our case, the one that maximizes the coefficient of determination r 2 ; and • Search among the remaining variables the one that, added to the model, gives the best result, i.e., the higher r 2 such that all the variables in the model are significant, i.e., their p-value is below the chosen threshold. Iterate this step until no further improvement.

102

H.-L. Botterman and R. Lamarche-Perrin

4 Experiments We present the dataset on which we test the proposed method as well as the construction of the resulting graphs. Then, we report our results concerning different tests, namely, the importance of meta path length, a description task, and finally a recovery task.

4.1 Dataset Description and Setup The data we use is a set of tweets collected from Twitter during the Football World Cup 2014. This period extents from June 12 to July 13, 2014. Twitter allows multiple kinds of interactions between its users. Here, we consider retweet (RT), reply (RP), and mention (MT) actions plus the fact of posting hashtags (UH). Based on these actions, we construct a HIN with node types V ={users, hashtags} and edge types E ={RT, RP, MT, UH} as illustrated in Fig. 2. Each node represents a user or a hashtag. We create a link from u1 to u2 if u1 retweets, replies (to), or mentions u2 and the weight of the link correspond to the number of times u1 performs the specific action toward u2 during the whole world cup. For the user– hashtag graph, a link exists between u and h if h appears in u’s post and the weight of the link corresponds to the number of times u post h during the whole world cup. The RT graph is composed of 6069 nodes and 19495 links, the RP graph is composed of 8560 nodes and 11782 links, and the MT graph is composed of 11782 nodes and 60506 links. The Pearson coefficient between the stochastic matrices rises to 0.1776, 0.6783, and 0.4286 for RT/RP, RT/MT, and RP/MT, respectively. Thus, the retweet and mention relationships are clearly correlated which may cause some problems for the proposed method, as we shall see, since it is well known that least squares method is sensitive to that. Since the data is related to the world cup, the most used hashtags of bipartite users–hashtags graph are those referring to the 32 Reply

Retweet @u4 #h1 #h3#h4

@u3

@u3

@u1

@u1 @u4

#h1#h2

@u4

@u2

#h2#h3

@u3

Mention

#h1@u4#h3

@u1

@u2

Users-Hashtags

@u4

#h1

#h2

#h3

#h4

@u1

@u2

@u3

@u4

@u3

@u2

@u1 @u4 @u2

Fig. 2 Illustration of the graphs construction representing the Twitter interactions. The underlying HIN is such that V ={users, hashtags} and E ={RT, RP, MT, UH}

Combining PCRWs to Recover Link Weights in HINs

103

countries involved in the final phase as well as those referring directly to the event (#WorldCup2014, #Brazil, #Brasil2014, #CM2014, etc.). The semi-finalists have the greatest in strength.

4.2 Results We apply the proposed method to find if the hashtags posted by users (UH) can be explained by other relations (RT, RP, MT, and their combinations). For instance, given a user u, explaining UH by RT-UH and MT-RP-UH means that the hashtags posted by u are, to some extent, a combination of those posted by the users retweeted by u and those posted by the users who received a response from users mentioned by u. In other words, we try to understand if, in the case of the football World Cup 2014, the probability that users post hashtags can be explained by the relations these users have with other users and the probability that these latter have to post specific words.

4.2.1

Meta Paths of Length 2

We test linear regression models with all the possible combinations of variables of length 2 (see Table 1). This test allows a first glimpse at the contribution of the simplest predictors. First, the more the predictors, the better is the value of the r 2 . Nevertheless, it does not mean that all variables are significant. Indeed, the analysis of the coefficients and p-values makes it possible to realize the correlation of some Table 1 Cœfficients and p-values for linear regressions whose variables correspond to meta paths of length 2

Mod. 0 1 2 3 4 5 6 7

Var. Cœf. Average : 1.8704e-05 RT-UH 0.6273 RP-UH 0.4291 MT-UH 1.0289 RT-UH 0.5795 RP-UH 0.3957 RT-UH −0.3578 MT-UH 1.4534 RP-UH 0.0051 MT-UH 0.9391 RP-UH −0.1283 RP-UH 0.0791 MT-UH 1.1466

p-values – – – 0.0062 0.0105 0.0612 0.0087 0.0138 0.0057 0.0791 0.0113 0.0111

r2 0.2992 0.3594 0.2289 0.4606 0.6116 0.5943 0.6111 0.6818

Model 0 corresponds to the null model: no predictor but one intercept that is the average of the explained variable

104

H.-L. Botterman and R. Lamarche-Perrin

variables. In models 5 and 7, the RT-UH cœfficient is negative with p-value greater than 0.05, consequence of the correlation with the MT-UH variable. In summary and according to Table 1, the best model would be the model 4 whose predictors are RT-UH and RP-UH. This means that, for a given user, the hashtags she posts can be explained by the hashtags posted by the users she retweets with a contribution of 0.5795 and the users she replies to with a contribution of 0.3957. This model accounts for 61.16% of the variance.

4.2.2

Importance of Meta Path Length

This subsection looks at the length of the meta paths for a given link type. More specifically, we compute, for each link type, the r 2 score when the only predictor is associated to a random walk of length l = 1, . . . , 10 in the same link type. Intuitively, the importance of a meta path decreases with its length since considering longer meta paths means considering neighborhoods more extended; hence, the information is more diffused. This is corroborated with the left panel of Fig. 3. Each link type brings a different quantity of information and the MT type is the more informative for our purpose. Plus, this test exposes a characteristic of the reply dynamics: most of the time, the replies involved only two people [7]. This is reflected through the oscillations of the reply scores. The scores associated to odd length random walks are low since the walker is forbidden to return to the initial node on the penultimate step of the walk. We also draw in black the scores when we do not differentiate the link types, i.e., all the link weights between two nodes are aggregated. This score is below the average score of the three link types. One can see that just take the mention or retweet type is more informative than the aggregation. The right panel of Fig. 3 shows r 2 scores when we combine variables of different lengths related to the same link type in the model. Actually, the score associated to the abscissa l is related to the model whose predictors are meta paths of length smaller or equal to l + 1 and whose l first steps are in the same type of links. Again,

Fig. 3 (Left) Linear regression r 2 scores with one meta path according to its length. (Right) Linear regression r 2 scores according to the number of meta paths of the same link type

Combining PCRWs to Recover Link Weights in HINs

105

the more the variables, the better the score. Also, the increase is not linear; the best improvement happens when we combine length-1 and length-2 variables. We can also observe that scores given by the RT and MT types are really similar when considering more than two variables while there is a clear difference in the r 2 score for single variable. Once again, the score for the aggregation is shown and is far below the other scores. This indicates that it is potentially interesting to distinguish the types of links. In summary, these tests tend to show that considering too long as well as too many meta paths is not necessarily useful in our case.

4.2.3

Forward Linear Regression for Description

We apply the proposed algorithm on the entire dataset with a threshold α = 0.05 for p-values. The number of meta paths grows exponentially with the length and since the length is unbounded, the set of possible meta paths is infinite. Here, the k potential predictors are those of length less than or equal to 4. This is motivated by the test performed in the previous subsection. In addition, the semantics of longer paths are less clear than shorter paths. Results are reported in Table 2. The final model contains five predictors related to meta path whose length is no longer than 3 and no intercept. This regression model accounts for 71.29% of the variance. To comfort the goodness of fit of the model, we plot in Fig. 4 the density plot in log–log scale of the observed values versus the estimated values. The green line represents the ideal case where estimated values Table 2 Results for the forward stepwise linear regression

Mod. 0 1 2 3

4

5

Var. Cœf. Average: 1.8704e-05 MT-UH 1.0289 MT-UH 0.9391 RP-UH 0.0052 MT-UH 0.8464 RP-UH 0.0335 RT-RP-UH 0.1077 MT-UH 0.8114 RP-UH 0.0362 RT-RP-UH 0.0766 RP-MT-UH 0.0676 MT-UH 0.1974 RP-UH 0.5556 RT-RP-UH 0.0650 RP-MT-UH 0.1591 MT-RT-UH 0.0074

p-values – 0.0057 0.0137 0.0062 0.0124 0.0138 0.0063 0.0109 0.0142 0.0143 0.0094 0.0146 0.0125 0.0160 0.0124

r2 0.2992 0.4606 0.6112 0.6682

0.6947

0.7129

106

H.-L. Botterman and R. Lamarche-Perrin

Fig. 4 Density plot of observed versus estimated values for the model 5. Green line represents the perfect matching between observed and estimated data

match observed ones. Most of the data points fall to this line which indicates that linear model is a good choice. The best improvement comes with the addition of the second variable. The model with two predictors is actually a local extremum (see Table 1). This allows to point two weaknesses of the method: there is no guarantee of finding the best model and the order of the variables selection is important. Note that the first two variables are part of the most direct relationships (meta paths of length 2) which is intuitive: the direct neighborhood of a user thus created shares common topics with her. The last meta path included in the model provokes an important change in the other cœfficients. This suggests this meta path is either correlated to other meta paths already present in the model or the presence of outliers. It is well known that ordinary least squares method is sensitive to that.

4.2.4

Forward Linear Regression for Recovery Task

We validate the method by performing a task aiming to recover the weights of missing links. In other words, this part tries to answer to the question: Is it possible to know, in a quantitative way, the way some people post some hashtags, knowing the functioning of some other people? To do so, we select 80% of the users and train the algorithm on it to obtain the vector β. Then, we use it on the remaining 20% and compute the r 2 associated to each model. Since there is a part of randomness, we generate ten training sets. The final models do not include the same variables as before. Not surprisingly, it depends on the 80% selected. The number of predictors is five or six. Nevertheless, whatever the

Combining PCRWs to Recover Link Weights in HINs

107

Fig. 5 Boxplot for the r 2 scores of training sets and test sets. The training set scores increase with the number of predictors in the model, while for the testing set, the scores seem to reach a threshold

training set, the meta path MT-UH is always the first predictor to be selected. After, there is no more consensus on the second variable but the RP-UH and RT-RP-UH always compete for the second place. Again, it is not surprising to obtain the RP-UH variable since, for a user, it is related to one of the closest neighbors with respect to our graph construction. Although the r 2 scores of the final models reach, on average, 0.7 for the training samples, we only get, on average, a score of 0.5 for the test sets (Fig. 5). One also observes that even if a model fits better the training set, it does not mean that it will give the best recovery. Indeed, it is sometimes better to consider a model with fewer regressors, and so a lower r 2 for training set, to better recover.

5 Related Work As previously explained, our work is based on node similarity measures. Recently, several measures tackle the problem of node similarity in HINs which takes into account not only the structure similarity of two entities but also the meta paths connecting them. Among these measures, PathCount (PC,[9]) and path-constrained random walk (PCRW,[6]) are the two most basic and gave birth to several extensions [1, 2, 5, 12]. The methods related to PC are based on the count of paths between a pair of nodes, given a meta path. PathSim [10] measures the similarity between two objects of same type along a symmetric meta path which is restrictive since many valuable paths are asymmetric and the relatedness between entities of different types is not useless. Two measures based on it [3, 4] incorporate more information such as

108

H.-L. Botterman and R. Lamarche-Perrin

degree nodes and transitivity. However, all these methods have the drawback of favoring highly connected objects since they deal with raw data. The methods related to PCRW are based on random walks and so the probability of reaching a node from another one, given a meta path. Considering a random walk implies a normalization and, depending on the data, offers better results. An adaptation, HeteSim [8], measures the meeting probability between two walkers starting from opposite extremities of a path, given a meta path. However, this method requires the decomposition of atomic relations, which is very costly for large graphs. To address this issue, AvgSim [11] computes the similarity between two nodes using random walks conditioned by a meta path and its inverse. But it is mostly appreciated in undirected networks since in these cases, it is just as sensible to walk a path in one direction as in the other. In these cited works, when the similarity scores are used for link prediction/detection, the scores are ranked and then, the presence of links is inferred based on this ranking. Also some works try to combine meta paths but the target values to recover are binary; the networks are unweighted. At variance with these works, we set ourselves in the general framework of directed and weighted HINs. We do not use rankings but take directly the similarity measures obtained by means of an adequate combination of PCRWs as link weights. This allows not only to perform description tasks but also, to some extent, recovery tasks.

6 Conclusion and Perspectives We have considered a linear combination of path-constrained random walks to try to explain, to some extent, a specific meta path in a HIN. This proposed method allows to express the weight of a link between two nodes knowing the other links in a graph. This could be useful for prediction or recommendation tasks. In particular, we have shown on our dataset that the hashtags posted by a specific user are mainly related to those posted by her direct neighborhood, especially the MT and RP neighborhood. This method has also shown that the RT relation is not really useful for our purpose. Nevertheless, the main drawback of the method is its sensitivity to outliers. Hence, more robust least square alternatives could be envisaged such as least trimmed squares or parametric alternatives. Furthermore, we have provided all the meta paths whose length is no longer than four. Even if it is motivated by previous tests, this threshold is clearly data related and is based on the knowledge of the user. Hence, it could be interesting to build a method able to find relevant meta paths by itself. Finally, all data have been aggregated in time. Since it is possible to extract the time stamp of tweets, a future work could be the integration of time by defining a random walk process on temporal graph or by counting the temporal paths (plus normalization). This would restrict the possibilities of the walker and maybe improve the quality of the model.

Combining PCRWs to Recover Link Weights in HINs

109

Acknowledgements This work is funded in part by the European Commission H2020 FETPROACT 2016-2017 program under grant 732942 (ODYCCEUS), by the ANR (French National Agency of Research) under grant ANR-15- E38-0001 (AlgoDiv), and by the Île-de-France Region and its program FUI21 under grant 16010629 (iTRAC).

References 1. Fang, Y., Lin, W., Zheng, V.W., Wu, M., Chang, K.C., Li, X.: Semantic proximity search on graphs with metagraph-based learning. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 277–288 (2016) 2. Gupta, M., Kumar, P., Bhasker, B.: DPRel: a meta-path based relevance measure for mining heterogeneous networks. Inf. Syst. Front. (2017). https://doi.org/10.1007/s10796-017-9811-x 3. He, J., Bailey, J., Zhang, R.: Exploiting transitive similarity and temporal dynamics for similarity search in heterogeneous information networks. In: International Conference on Database Systems for Advanced Applications, DASFAA (2014) 4. Hou U.L., Yao, K., Mak, H.F.: Pathsimext: revisiting pathsim in heterogeneous information networks. In: Li, F., Li, G., Hwang, S.-W., Yao, B., Zhang, Z. (eds.) Web-Age Information Management, pp. 38–42. Springer, Cham (2014) 5. Huang, Z., Zheng, Z., Cheng, R., Sun, Y., Mamoulis, N., Li, X.: Meta structure: computing relevance in large heterogeneous information networks. In: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’16, pp. 1595–1604. ACM, New York (2016) 6. Lao, N., Cohen, W.W.: Relational retrieval using a combination of path-constrained random walks. Mach. Learn. 81(1), 53–67 (2010) 7. Macskassy, S.A.: On the study of social interactions in Twitter. In: Sixth International AAAI Conference on Weblogs and Social Media, ICWSM (2012) 8. Shi, C., Kong, X., Huang, Y., Yu, P.S., Wu, B.: Hetesim: a general framework for relevance measure in heterogeneous networks. IEEE Trans. Knowl. Data Eng. 26(10), 2479–2492 (2014) 9. Sun, Y., Barber, R., Gupta, M., Aggarwal, C.C., Han, J.: Co-author relationship prediction in heterogeneous bibliographic networks. In: Proceedings of the 2011 International Conference on Advances in Social Networks Analysis and Mining, ASONAM’11, pp. 121–128. IEEE Computer Society, Washington (2011) 10. Sun, Y., Han, J., Yan, X., Yu, P.S., Wu, T.: Pathsim: meta path-based top-k similarity search in heterogeneous information networks. Proc. VLDB Endowment 4(11), 992–1003 (2011) 11. Xiao, D., Meng, X., Li, Y., Shi, C., Wu, B.: AVGSIM: relevance measurement on massive data in heterogeneous networks. J. Theor. Appl. Inf. Technol. 84(1), 101–110 (2016) 12. Zhou, Y., Huang, J., Sun, H., Sun, Y.: Recurrent meta-structure for robust similarity measure in heterogeneous information networks. ArXiv e-prints (2017)

Part IV

Applications

The Structure and Evolution of an Offline Peer-to-Peer Financial Network Pantelis Loupos and Alexandros Nathan

Abstract In this work, we investigate the structure and evolution of Venmo, one of the most popular peer-to-peer (P2P) mobile payment applications. A unique aspect of the network under consideration is that the edges among nodes represent financial transactions between individuals who shared an offline social interaction. We present a series of static and dynamic measurements that summarize the key aspects of any social network, namely the degree distribution, density, and connectivity. We find that the degree distributions do not follow a power-law distribution. With regards to density and connectivity, Venmo’s network exhibits densification and resilience, which indicate a high level of engagement from its users. Last, we examine the “topological” version of the small-world hypothesis, and find that Venmo users are separated by a mean of 5.9 steps and a median of 6 steps.

1 Introduction Over the last 2 decades, there has been a surge of research on complex networks. The availability of high quality network data, as a result of the rise of the internet and information technology, has facilitated the computational analysis of social networks. Examples of networks that have been studied include scientific co-authorship networks, online social networks, and employee communication networks [5, 8, 24]. While there have been several large-scale studies that focus on the evolution of online social networks [14, 18, 20], offline social network data have been scarce and small in scale.

P. Loupos () Graduate School of Management, University of California, Davis, Davis, CA, USA A. Nathan McCormick School of Engineering, Northwestern University, Evanston, IL, USA e-mail: [email protected] © Springer Nature Switzerland AG 2019 S. P. Cornelius et al. (eds.), Complex Networks X, Springer Proceedings in Complexity, https://doi.org/10.1007/978-3-030-14459-3_9

113

114

P. Loupos and A. Nathan

In this work, we study the structural evolution of Venmo, the most popular peerto-peer (P2P) mobile payment application, especially among millennials.1 Venmo allows peers to send money to each other for a variety of social activities, such as splitting the bill after a group outing at a restaurant or sharing a cab.2 Capturing these face-to-face, offline activities at scale is what makes Venmo’s social network so remarkable: we are given the unique opportunity to analyze the interactions of individuals who are in close geographic proximity and engage in shared experiences. Our investigation is organized into two parts. In the first part, we present a series of static measurements that characterize the degree distributions of Venmo. Since Venmo is a directed social network, we look at three degree sequences: (a) degree (undirected), (b) in-degree, and (c) out-degree. We find that none of the degree sequences follow a power-law distribution, confirming the results of a 2018 study that shows that real-world social networks are rarely scale-free [7]. In the second part, we present a series of dynamic measurements that characterize the structural evolution of Venmo’s network. These include network density, clustering coefficient, component structure, and degrees of separation. Our results indicate an unusual high level of density and connectedness; the key finding here is evidence in support of Milgram’s “six degrees of separation” hypothesis. To the best of our knowledge, this is the first structural analysis of such a largescale P2P financial network. Given the scarcity of large-scale offline networks, our main contributions are the documentation of Venmo’s structural properties, and their comparison against other well-studied online social networks [2, 20, 21], as well as offline financial ones [25].

2 Data Overview This work uses data from Venmo, a P2P mobile payment service owned by PayPal. Founded in 2009, Venmo operates in the United States and has succeeded in transforming financial transactions into sharing experiences. What makes Venmo stand out against its competitors is its social nature. Upon logging into the application, users gain access to a Facebook-like news feed, which is composed of public transactions. The individual who initiates the transaction is required to accompany the post with a description of what the money was used for, while the dollar amount is left out for privacy reasons. While users may opt to hide their transactions by adjusting their privacy settings, according to Dan Schulman, CEO of PayPal, “90% of transactions are shared”.3 For the purposes of our analysis we used Venmo’s API [17] to collect the entire transaction history of 1,765,776 users, which corresponded to approximately 120 million transactions. The reason for gathering

1 https://bit.ly/2qhCRvG. 2 https://lendedu.com/blog/venmo. 3 http://fortune.com/2017/11/17/dan-schulman-paypal-venmo/.

The Structure and Evolution of an Offline Peer-to-Peer Financial Network

115

the complete financial history of these accounts is to have the ability to map the evolution of Venmo’s network from its inception in 2009, up to late 2016. As our goal is to explore human interactions over a financial network, we exclude from our analysis Venmo accounts that belong to either businesses or charity organizations; this left us with 1,748,119 distinct users.

3 Results and Discussion In this section, we present some key metrics that characterize the overall properties of a network. First, we introduce some notation: let Gt := (Vt , Et ) be a dynamic social network, where Vt is the set of nodes that have joined the network by time t, and Et consists of all edges that have been formed by time t. We treat time as a sequence of discrete monthly intervals. In what follows, we present results for the case where nodes and edges can only be added in the network, that is for t2 ≥ t1 , we have that Vt1 ⊆ Vt2 and Et1 ⊆ Et2 .

3.1 Degree Distribution The degree distribution of a network is one of its most important properties. Given that Venmo’s transactions are directed in nature, we look at three degree sequences: (a) degree (undirected), (b) in-degree, and (c) out-degree. The degree distributions are computed for a static snapshot of the network, when it has reached its steady state (month 74). Figure 1 shows the three degree plots. Following the approach in [7, 8], we fit and test for the plausibility of power laws. We find the p-value to be ≈0, ≈0, and ≈0.01 for the degree, out-degree, and in-degree, respectively. Therefore, we can safely reject the scale-free hypothesis. While a number of studies have provided evidence that power law distributions are common across social networks [1, 3, 4, 6, 12, 15, 16, 19, 24, 26], a recent comprehensive study found that the scale-free structures are rare in real-world

Fig. 1 Degree, In-Degree, and Out-Degree distributions. We can safely reject the scale-free hypothesis as the p-value is ≈0, ≈0, and ≈0.01 for the degree, out-degree, and in-degree, respectively. (a) Degree. (b) In-Degree. (c) Out-Degree

116

P. Loupos and A. Nathan

networks [7]—out of 1000 networks, only 4% exhibited a scale-free structure. While preferential attachment has been one of the most popular frameworks for explaining power law distributions, the growing evidence of the rarity of scale-free structures reinforces the need for new network formation mechanisms.

3.2 Density and Connectivity 3.2.1

Density and Densification

We first focus our attention on network density. Density is defined as the ratio of undirected edges to nodes. As can be seen in Fig. 2, density is increasing monotonically as a function of time, which implies that the creation of new links outpaces user growth. This stands in direct contrast with the findings of [20] (Yahoo! 360 and Flickr), who observe three different stages: an initial upward trend, followed by a dip, and finally a gradual steady increase. We believe that the observed monotonicity is due to two main reasons; first, Venmo has a quick time to transaction for new sign-ups, meaning that users typically join Venmo with the goal of completing a transaction, and second, Venmo’s users become highly engaged with the application. Densification, which was first described in [22], provides an alternative way to investigate the same phenomenon. In a network that exhibits densification over time, the average degree also increases. An easy way to show this is via the Densification Power Law plot (DPL)—see Fig. 2b. The DPL plot refers to the log-log plot of the number of nodes Vt against the number of edges Et at several snapshots t. We subsequently calculate the slope of the line, α, that best fits our data. A value of

Fig. 2 In the DPL plot the dashed line is the diagonal with slope 1. The best fitted line to our data has a slope of α = 1.19. (a) Density over time. (b) DPL plot

The Structure and Evolution of an Offline Peer-to-Peer Financial Network

117

α > 1 implies that the average degree increases over time; for Venmo’s network, we have α = 1.19. This is consistent with the four networks that were examined in [22], and more specifically, the value of α = 1.19 is very close to the value of 1.15 that was found for the affiliation graph of co-authors in arXiv.

3.2.2

Clustering Coefficient

The clustering coefficient measures the extent to which an individual’s friends know each other. As expected, a high clustering coefficient implies a large proportion of triangles (triads) in the network. An alternative interpretation involves the notion of resilience: the clustering coefficient expresses the extent to which the neighbors of node i will reach out to each other if node i is removed from the network. Following the formulation in [28], let aij be the element of adjacency matrix indicating the existence or absence of an (undirected) edge between nodes i and j , and ki denote the degree of node i. The average clustering coefficient is defined as: C avg =

1 aij aj k aki n ki (ki − 1)/2 i

(1)

j,k

Figure 3 shows the evolution of the clustering coefficient over time. We see two distinct phases: a sharp increase, followed by a plateau around 0.2. This is in contrast with other networks, such as Google+, which shows three phases

Fig. 3 Clustering coefficient averaged across all nodes—it’s increasing over time

118

P. Loupos and A. Nathan

Fig. 4 Sub-figure (a) shows fraction of the users that belong to the giant component over time. Sub-figure (b) shows the overall number of components over time

(decrease, increase, and decrease again) [14]. A more relevant comparison involves Ripple [25], one of the most well-known financial networks with a presence in the cryptocurrency space. As it turns out, the clustering coefficient of Ripple fluctuates between 0.07 and 0.13, which suggests its network is less resilient compared to that of Venmo. This could imply that studying the resilience of P2P financial networks could be beneficial for devising robust, decentralized financial networks.

3.2.3

Component Structure

We investigate here the existence of a giant component, a common feature of most networks [10], as well as the migration patterns of the smaller components. As illustrated in Fig. 4a, Venmo’s transaction network exhibits a giant component that eventually contains over 99% of all nodes of the network. Figure 4b shows how the number of components changes over time. It is interesting that the number of components keep increasing over time, only to decrease sharply towards the end—the tipping point is around month 60. Note also that the distribution of the component size undergoes a significant transition following month 60. Figure 5 shows the before and after distributions. While the majority of components at month 60 are of size 2, at month 79 only the larger components survive, with the distribution exhibiting a heavier tail. It is noteworthy that this behavior is consistent with the random graph model introduced by Erdos and Renyi [11].

3.2.4

Degrees of Separation

Last, we turn our attention to the “topological” [13] version of the small-world hypothesis, which has long attracted the interest of social scientists. To run our

The Structure and Evolution of an Offline Peer-to-Peer Financial Network

119

Fig. 5 Sub-figure (a) shows the component size distribution at time 60, and sub-figure (b) shows the same plot at time 79

analysis, we only focus on the giant component and use R’s Igraph package [9] to calculate all shortest paths between two randomly chosen nodes. We repeat this for 2000 random pairs and average the results.4 Due to the computational intensity of the problem, we run our calculations every 6 months. As can be seen in Fig. 6, the average distance follows three distinct phases. Initially, we have a plateau until month 21. Then, we have a sharp increase, reaching its maximum value at month 33, and finally the average distance slowly converges to its final value of 5.90 with a median of 6. Milgram and Travers, in their monumental experiments [23, 27], claimed that the degrees of separation across people is six. We should point out here, however, that their results correspond to the “algorithmic” version of the small-world hypothesis, which provides an upper bound on the average distance. Throughout the years researchers studying a variety of networks have come up with a range of results: Goel et al. [13] (message chain experiments) and Leskovec and Horvitz [21] (Microsoft Messenger) found the median shortest path to be 7, whereas an investigation of the Facebook social graph [2] found the average degrees of separation to be 4.5. It is noteworthy that among these important works, our research is the only one that is in line with Milgram’s results.

4 We

experimented with different values and found that after 2000 our results do not change.

120

P. Loupos and A. Nathan

Fig. 6 Mean and median shortest path distance over time

4 Conclusion In this work, we investigate the evolution of the structural properties of Venmo. To the best of our knowledge, Venmo is the largest dataset of P2P financial activity ever to be analyzed. Its unique aspect of reflecting offline shared social activities among individuals makes its properties documentation worth having. We compare the attributes of Venmo’s network with other well-known networks and assess their differences. Our first main finding is the absence of a power-law distribution, which is contrary to popular belief that scale-free structures are prevalent in real-world networks. Second, we find that unlike online social networks like Flickr, Yahoo! 360, or Google+, Venmo exhibits high levels of density and connectivity, which translates to high engagement and customer retention in business settings. Finally, our research sheds some more light on the well-known small world phenomenon. Contrary to previous work of large-scale networks, we provide evidence that supports Milgram’s original six degrees of separation hypothesis.

References 1. Abello, J., Buchsbaum, A.L., Westbrook, J.R.: A functional approach to external graph algorithms. In: European Symposium on Algorithms, pp. 332–343. Springer, Berlin (1998) 2. Backstrom, L., Boldi, P., Rosa, M., Ugander, J., Vigna, S.: Four degrees of separation. In: Proceedings of the 4th Annual ACM Web Science Conference, pp. 33–42. ACM, New York (2012)

The Structure and Evolution of an Offline Peer-to-Peer Financial Network

121

3. Barabási, A.-L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999) 4. Barabási, A.-L., Albert, R., Jeong, H.: Scale-free characteristics of random networks: the topology of the world-wide web. Physica A Stat. Mech. Appl. 281(1), 69–77 (2000) 5. Barabâsi, A.-L., Jeong, H., Néda, Z., Ravasz, E., Schubert, A., Vicsek, T.: Evolution of the social network of scientific collaborations. Phys. A Stat. Mech. Appl. 311(3-4), 590–614 (2002) 6. Broder, A., Kumar, R., Maghoul, F., Raghavan, S., Rajagopalan, P., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the web: experiments and models. In: 9th World Wide Web Conference (2000) 7. Broido, A.D., Clauset, A.: Scale-free networks are rare. arXiv preprint arXiv:1801.03400 (2018) 8. Clauset, A., Shalizi, C.R., Newman, M.E.: Power-law distributions in empirical data. SIAM Rev. 51(4), 661–703 (2009) 9. Csardi, G., Nepusz, T.: The igraph software package for complex network research. InterJ. Complex Syst. 1695(5), 1–9 (2006) 10. Easley, D., Kleinberg, J.: Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press, Cambridge (2010) 11. Erdos, P., Rényi, A.: On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5(1), 17–60 (1960) 12. Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the internet topology. In: ACM SIGCOMM Computer Communication Review, vol. 29, pp. 251–262. ACM, New York (1999) 13. Goel, S., Muhamad, R., Watts, D.: Social search in small-world experiments. In: Proceedings of the 18th International Conference on World Wide Web, pp. 701–710. ACM, New York (2009) 14. Gong, N.Z., Xu, W., Huang, L., Mittal, P., Stefanov, E., Sekar, V., Song, D.: Evolution of socialattribute networks: measurements, modeling, and implications using Google+. In: Proceedings of the 2012 ACM Conference on Internet Measurement Conference, pp. 131–144. ACM, New York (2012) 15. Huberman, B.A., Adamic, L.A.: Internet: growth dynamics of the world-wide web. Nature 401(6749), 131 (1999) 16. Kleinberg, J.M., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.S.: The web as a graph: measurements, models, and methods. In: International Computing and Combinatorics Conference, pp. 1–17. Springer, Berlin (1999) 17. Kraft, B., Mannes, E., Moldow, J.: Security Research of a Social Payment App (2014). https:// courses.csail.mit.edu/6.857/2014/files/13-benkraftjmoldow-mannes-venmo.pdf 18. Kumar, R., Novak, J., Raghavan, P., Tomkins, A.: Structure and evolution of blogspace. Commun. ACM 47(12), 35–39 (2004) 19. Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the web for emerging cybercommunities. Comput. Netw. 31(11-16), 1481–1493 (1999) 20. Kumar, R., Novak, J., Tomkins, A.: Structure and evolution of online social networks. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’06, pp. 611–617. ACM, New York (2006) 21. Leskovec, J., Horvitz, E.: Planetary-scale views on a large instant-messaging network. In: Proceedings of the 17th International Conference on World Wide Web, pp. 915–924. ACM, New York (2008) 22. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 177–187. ACM, New York (2005) 23. Milgram, S.: The small-world problem. Psychol. Today 1(1), 61–67 (1967) 24. Mislove, A., Marcon, M., Gummadi, K.P., Druschel, P., Bhattacharjee, B.: Measurement and analysis of online social networks. In: Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, pp. 29–42. ACM, New Yolk (2007)

122

P. Loupos and A. Nathan

25. Moreno-Sanchez, P., Modi, N., Songhela, R., Kate, A., Fahmy, S.: Mind your credit: assessing the health of the ripple credit network. arXiv preprint arXiv:1706.02358 (2017) 26. Redner, S.: How popular is your paper? An empirical study of the citation distribution. Eur. Phys. J. B 4(2), 131–134 (1998) 27. Travers, J., Milgram, S.: An experimental study of the small world problem. In: Social Networks, pp. 179–197. Elsevier, Amsterdam (1977) 28. Watts, D.J., Strogatz, S.H.: Collective dynamics of ’small-world’ networks. Nature 393(6684), 440 (1998)

Modelling Students’ Thematically Associated Knowledge: Networked Knowledge from Affinity Statistics Ismo T. Koponen

Abstract Students’ knowledge is often organized around relations and key concepts but it sometimes also resembles associative knowledge, where connections between knowledge elements are based on thematic resemblance without overarching organization based on substantiation or logical reasoning. Because it is known that associative knowledge, while important for learning too, may be very differently structured from more organized knowledge, a closer look on students’ thematically associated knowledge is warranted. In this study we model students’ thematically associative knowledge as a network of pairwise associative connections. The model is based on the assumption that associative knowledge is by a large degree governed by the intrinsic affinity of the knowledge elements that consists of the thematically associated knowledge base. The model introduced here makes minimal assumptions about the affinity distribution of such knowledge. The results show that in this case, under very general conditions, the network of associative knowledge is characterized by inverse power laws of degree, eigenvector, and betweenness centralities. These results agree with the empirically found properties of students’ associative networks.

1 Introduction Knowledge acquisition and processing strategies are specific for the context of learning and for type of targeted knowledge. A starting point for learning is often familiarization with key concepts or key knowledge items, which then are processed further and integrated into more coherent knowledge structures [9, 15]. In that knowledge processing, students’ familiarization with the target knowledge often starts with proposing thematic connections between the knowledge items and their possible relationships, for example in the form of concept maps or mind maps.

I. T. Koponen () Department of Physics, University of Helsinki, Helsinki, Finland e-mail: [email protected] © Springer Nature Switzerland AG 2019 S. P. Cornelius et al. (eds.), Complex Networks X, Springer Proceedings in Complexity, https://doi.org/10.1007/978-3-030-14459-3_10

123

124

I. T. Koponen

Such connections can be taken as basically pairwise (dyadic) thematic word or term associations, where associative connections are established on the basis of thematic resemblance or kinds of family resemblance, but where no detailed substantiation or justification of connections is provided [17, 19]. The knowledge processing may then continue with more organized strategies and concept map-type scaffoldings in the form of integrated knowledge systems [9], known to be useful in a variety of learning contexts, for example equally useful in learning science [1, 11, 15, 18, 19] as in learning history [22]. Cognitively oriented research of learning claims that associative knowledge is different from knowledge which is structured through more complex relational dependencies [7, 8, 15]. This may well hold also for thematically associated knowledge, where common theme or topic is the basis of associative connections. The difference between such thematically associative and structured knowledge was noted also in a study focusing on how students organize their knowledge using concept maps [14]. A recent empirical study [13] shows that in the context of history of science, at least students’ preliminary knowledge is structured differently than substantiated knowledge [12]. To understand better students’ knowledge organization strategies a recent study modelled it in the form of concept networks by using simple linkage-motifs to generate the concept networks. The model produced networks which structurally closely match concept networks made by students, when the construction of networks is rule-based [12]. In that case, networks have degree distributions which are centered and thus have a scale, and have a relatively high clustering [12]. Motivated by the notion that thematically associative knowledge[13] may be very different from rule-based, relational knowledge [12], we focus here on modelling associative networks, which are based on pairwise (dyadic) knowledge item associations and model such networks as affinity-based networks. We model here knowledge item networks and their properties as they are reported in a recent study addressing the thematically associative knowledge items in the context of history of science [13]. In that study a group of 25 students was involved. The set of knowledge items came from preparatory tasks to explore the history of science over 3 centuries between 1550 and 1850 and how that history of science was embedded as a part of the culture, society, and politics of that same era [13]. This data consisted of about 1300 different knowledge items and about 2500 different pairwise thematically associative connection between them. The study showed that resulting network was characterized by a fat-tailed distribution of node degrees [13]. Also betweenness centrality was found to be distributed according to fattailed distribution. In all cases, the distributions were reasonably well-fitted by an inverse power law distributions, which allowed to describe them by using a single relevant parameter, power λ, where 1 < λ < 2. In what follows, we refer to such heavy-tailed distributions as inverse power laws. The present study concentrates on rationalizing a generative model, which can reproduce the inverse power laws of centrality distributions as found in the real thematically associative networks. We show that to reproduce the empirically found properties of students’ networks as reported in Ref. [13] we can construct a kind of a minimal model as an affinity-

Modelling Students’ Thematically Associative Knowledge

125

based network [2, 6, 21], with very minimal assumptions about the distribution of the affinities. The resulting network comes close to the empirically found networks when students produce connection between knowledge elements based on thematic associations [13], but are very different from networks found in cases where students’ organize their knowledge in rule-based ways [12]. The results support the notion that the ways the students handle knowledge organized around associative connections leads to very different knowledge organization in comparison to situations, where they use relational, rule-based dependencies; both strategies lead to simple regularities but different regularities.

2 The Empirical Case: Associative Knowledge The empirical findings of associative network and its properties that will be modelled here are based on results recently reported in an empirical study addressing how students make thematic associative connection between different knowledge items in a science (physics) history course for a third and fourth year students (preservice teachers). The course aim was to discuss the history of physics as part of science history, part of the history of humanities and arts, and as part of general history, in expanding circles. The results reported in Ref. [13] are based on data coming from pre-tasks on which students explored and constructed connections between the historical characters, scientists, ideas, inventions and institutions etc. they thought were of major interest or importance for history of science and history in general. Students reported the connections in form of pairwise associations (dyads), for example [ galilei ↔ heliocentricmodel ]. That data was used in Ref. [13] to construct a complex, thematically associative network of about 1300 nodes and 2500 links. Further details of the course, analysis of the empirical and results are reported in Ref. [13]. The main finding of the empirical study was that thematically associative networks, in the group-level when all student networks were collated, had heavytailed distribution of degree centrality D and betweenness centrality B. What is of interest here for modelling is the result that values of D and B turned out to be heavy-tailed and to have approximately an inverse power law type distribution with the inverse power λ ∈ [1.5, 2.0]. Of course, the networks were not scale invariant and inverse power law should be taken only as an appropriate fit and in sense revealing the heavy-tailed nature of distribution of values D and B [13].

3 The Model The basic assumption of the model is that the structure of the students’ thematically associative knowledge as it is captured by the network consisting of all different pairwise connections is determined solely by the intrinsic affinity αk of the

126

I. T. Koponen

knowledge elements k = 1, 2, . . . , N. The intrinsic affinity is for some knowledge elements substantially higher than for some other elements. The formation of links between knowledge elements is determined also solely by their affinity. The affinity αk , however, cannot be directly available to students. A more plausible assumption is that affinity related ranking Rk of knowledge elements is the basis for forming the linkages. Here, we assume that the appropriate ranking is simply equal to the cumulative distribution Rk of affinities Rk =

k

αi /

i

N

αi ,

Rk ∈ [0, 1]

(1)

i

We next assume that a characteristic value R¯ exists, which may depend on task, time allowed for the task, and the average competency of students participating in completion of the task. The probability πk that a given knowledge element k is linked to another knowledge element is then assumed to correspond to a maximally uncertain choice under this simple constraint. The probability of formation of a link between knowledge elements p and q is then assumed to be proportional to the product πp πq . The probability πk is now through maximization of the information theoretical (Shannon-Jaynes) entropy function [10]. Here, in what follows, to allow as broad a generality as possible, we adopt the generalized (Tsallis) q-entropy [16, 23–25] in the form  1+q  1 , q ∈ ]−1, 1[ (2) 1− πi Iq = q i

The exponent q governs the non-extensivity of the entropy. The normal, extensive  Shannon-Jaynes entropy I = − i πi log πi is recovered at the limit q → 0 [16, 23]. The next step is then to introduce multipliers for variational maximization of the entropy function in Eq. (2). The resulting distribution πk , which maximizes the q-entropy given the constraint R¯ = constant, is a q-exponential [16, 25] (for details of derivation, see Ref. [16]) πk = π0 expq [−βRk ],

 where expq [x] = 1 +

q x 1+q

1/q (3)

The function expq [x] is a q-deformed (or q-generalized) exponential function which is reduced to the normal exponential function in limit q → 0. The parameter β is the multiplier corresponding to the constraint that R¯ is kept constant. Note that now β < 0 because Rk → 1 when k → ∞ and expq [x] must be an increasing function [16, 25]. Another multiplier corresponding to the normalization condition is absorbed in normalization π0 . As the functional form of πk is now known, we require that πk → 1, when Rk → 1. This fixes the normalization coefficient to a value π0 = 1/[1 − βq/(1 + q)], where β < 0.

Modelling Students’ Thematically Associative Knowledge

127

A similar result as in Eq. (3) is obtained through entirely different chain of arguments, starting from affinity distribution and finding a linking probability, which leads to an inverse power law type degree distribution but which does not directly depend on affinity distribution but only through the cumulative distribution Rk [21]. The derivation in Ref. [21] shows that it is always possible to find an affinity distribution which satisfies Eq. (3) for a given power λ of the inverse power law for degree distribution for node degrees d of the form P (d) ∝ d −λ , λ ≥ 1

(4)

The advantage of derivation in Ref. [21] is explicit connection between the parameters appearing in linking probability to the power of degree distribution in Eq. (4) and to minimum and maximum degrees kmin and kmax , respectively, allowed by the choice of the parameters. We utilize these results and rewrite the parameter dependencies of Eq. (3) as follows: q = 1 − λ, λ ∈ ]1, 2[ β=

1 − r λ−1 1−λ

1 if i and j are both payed by k and zero otherwise.1 Summing over all k, the cocitation Cij of i and j is Cij =

n

Aik Aj k =

k=1

n

Aik ATkj ,

k=1

where ATkj is an element of the transpose of A. So, we define the cocitation matrix C to be the n × n matrix with elements Cij , which is given by C = AAT Now we can build a cocitation graph where there is an edge between i and j if Cij > 0, for i = j .

2.2 Graph Convolutional Networks Generalizing general purpose neural models like recurrent neural networks (RNNs) [8] or convolutional neural networks (CNNs) [16] to work on arbitrarily structured graph is a challenging problem. Usually, most of graph convolutional networks (GCNs) have a similar architecture, because filter parameters are typically shared over all (or a subset of) nodes within the graph [6]. The aim of GCNs is to learn a function from a set of features on a graph G = (N, E) with the following inputs: • A feature description Ni for every node i; represented in an N ×D feature matrix X where N and D stand for the number of nodes and number of input features, respectively. • A representation of the graph structure in matrix form; usually using an adjacency matrix A. Then, it produces a node-level output Z, an N × F feature matrix, where F corresponds to the number of output features per node. Each neural network layer can be written as a non-linear function as follows: H (l+1) = f (H (l) , A), with an input layer equal to H (0) = X and an output layer equal to H (L) = Z , being L the number of layers. Concrete GCNs models differ only in how f (·, ·) is chosen and parameterized.

1 Note

that edges in our supplier/customer network represent a payment.

GCNs on Customer/Supplier Graph Data to Improve Default Prediction

139

3 Default Prediction Models Based on Customer/Supplier Graphs This section describes the information required to construct the customer/supplier graph, the baseline default prediction models, and the two alternatives considered in this work to incorporate graph information to default prediction models.

3.1 Customer/Supplier Graph Construction There are multiple ways to model a customer/supplier graph. The connections linking the companies together may represent contract sharing, material flows, financial flows, co-patent information, among others. All these kinds of graphs are different, and a lot of attention must be paid to ensure the network construction fits the purpose of the study. In our case, we gathered anonymized quarterly data from the official customer/supplier third party payment declarations collected by a financial institution risk management department. This declaration is used as a mechanism to avoid fraud in companies VAT declaration and corresponds to the exchange of customer/supplier services. Besides, we have removed from the graph all those companies belonging to the sectors of financial services and public institutions. Such companies are important hubs that make the graph more connected and do not necessarily propagate default dynamics, since they may be subsidized and we do not account for this kind of external influence. This graph can also be processed and transformed to improve the default prediction task. Being our main hypothesis that clients in default will probably propagate instability to their providers, we have transformed the graph by normalizing the weight by the in-strength to assess for this dependence, this transformation results in the graph G. Besides, we have gone a step further and transform this graph into a graph of the companies who share a supplier GCP , and the network of the companies who share a client GCC (cocitation graph). We have modified the initial cocitation approach and weight the edges by the similarity between the companies as detailed in Sect. 2.1. We have also enriched this similarity measurement including additional information related to the probability of failure to propagate, such as, if the two companies are competitors, or if the two companies have the same range of diversification according to their clients. In our case, we have modified the similarity weights to assess for the ability of a supplier to influence a neighbor in the common customer graph: if both of them operate within the same sector, the similarity weight is preserved, and the similarity is divided by the number of sectors otherwise. This results in other two graphs GCS ∗ and GCP ∗ to be analyzed. Collected data covers from January 2014 to December 2015. The direction of the edges follows the path of money injection (from the client to the supplier). All edge weights (total money transferred) are aggregated annually (from January to

140 Table 1 Network company revenue distribution

A. Martínez et al. Revenue category (Million Euros) Micro-SME (0–1) Small (1–5) Medium (5–50) Large (+50)

Number 2768 3953 2631 995

Percentage 26.75 38.20 25.43 9.62

December 2014) and normalized by its destination node in-strength. After removing self-loops, a directed and weighted network G with 168, 305 nodes and 310, 084 edges is obtained. For each available company, we extracted its operating revenue (categorical value) and financial health metric (continuous variable in the range [−50, 100]). We attempt to predict future default, it is to say, if the company is going to enter default within 1 year after doing the prediction. Therefore, future default labels at December 2014 were set to 0 or 1 depending on whether a company entered default or not in the following year. For that purpose, companies are assigned another default label with values 0 or 1 depending on whether the company is or was in default during the previous year before doing the prediction (January14–December14). Initial default rate is 3.2%. Note that only a small fraction of the companies included in the graph have been analyzed by the experts of a financial institution risk management department, specifically only 17, 289 nodes, and 10, 347 if we restrict to the largest connected component (6% of the network). Therefore, the network contains an important percentage of missing values for the financial health metric. As depicted in Table 1, most of the firms included in the network are micro-SMEs and small companies, with an annual revenue smaller than 5 Million Euros (more than 65% of the informed health financial values).

3.2 Baseline Models Logistic regression is the most widely used default prediction model in risk assessment. It has several advantages, such as its linearity, low computational cost, and parameters interpretability. However, logistic regression has many well-known drawbacks, being the most important one that it cannot properly combine a large number of input variables. Due to this, a common practice is to combine several variables into one. Following this idea, the global risk management department of a financial institution designed an early warning system where a financial health value (FHV) is computed by aggregating many balance sheets values and ratios. Depending on each company revenue, balance sheet values and ratios might be very different one from another. For this reason four different FHV formulas were developed, one for each revenue category. For our experiments we use three different baseline models: a single logistic regression only considering FHV value, a single

GCNs on Customer/Supplier Graph Data to Improve Default Prediction Table 2 Baseline models based on a logistic regression model over the FHV aggregated metric

141 Model logit (FHV) logit (FHV, cat) logitcat (FHV)

AUC 0.772 0.773 0.780

logistic regression considering FHV and revenue category, and finally, four different logistic regression (one for each revenue category) only considering FHV value. As we observe in Table 2 in all cases, AUC values are almost the same, being the largest 0.780. We consider this value as our baseline. To obtain these AUC values we cross validate the results using 5 stratified folds. The folds are made by preserving the percentage of samples for each class. This is really important when we face up an unbalanced classification problem, as the one we solve here. All the experiments described later on in this paper will follow the same logic to fairly compare all the AUC results.

3.3 Graph Metrics Based Models Network science focus on how graph topology affects system properties via numerous graph metrics. These metrics have proved to be useful for many purposes. However, graph metrics do not translate to any other discipline in a straightforward manner [2]. In particular, too much effort has been made to develop novel metrics in supply chain risk management [27]. Risk assessment adaptation faces similar challenges. Many metrics definition and attempts are needed to find a suitable set of features to consider. Besides, most of these features are often correlated, therefore de-correlation approaches and knowledge of the context must be considered [4]. We have come up with features suitable for risk assessment purposes starting from the hypothesis that they should be related to the label we want to predict, in this case future default. Therefore it is a plausible idea that the probability of a company j entering default in the next year may be related to the dependence of this company on clients i in default. This is what we have named default ratio at distance 1 Rd1 , which can be calculated from the transposed adjacency matrix and from the initial default node vector d as Rd1 = AT · d. We have also defined default ratio at distance 2 as Rd2 = Rd1 · d. It is consistent to think that the smaller the net cash across one node, the more difficulties the company may go through. Since the in-strength of every node is normalized to 1, this net balance is given by the out strength Sout , the sum of the outgoing edges of each node. The above features are still local. To take into consideration the global topology of the graph we can think of a measure which takes into account the structure of faraway neighborhoods. In order to consider the diffusion process of default across the graph, we have computed a personalized page rank (PPR), where nodes in default are heavily weighted.

142 Table 3 Graph metrics statistical descriptors summary

A. Martínez et al. Metric Rd1 Rd2 PPR Sout CC Rd1 CC Rd2 CS Rd1 CS Rd2 U CC CC U CC CS W CC CC W CC CS CC sizecc CS sizecc sizescc

Mean 0.0221 0.0232 0.000006 0.676 1.345 106.601 3.39 1376.60 0.830 0.878 0.303 0.363 77,875.54 11,4276.16 3711.37

Std 0.133 0.079 0.000023 4.138 2.056 227.731 8.18 6250.19 0.268 0.222 0.184 0.148 6792.42 13,763.74 8891.97

Median 0.000 0.000 0.000 0.012 0.718 24.752 1.29 150.44 1.000 1.000 0.288 0.371 78,468 115,934 1.000

For the case of the common supplier GCS and common client GCC graphs we have considered the default ratios at distances 1 and 2, although we expect the trend of these features with respect to the future default to be different than for the original graph G. Suppose the scenario of two neighbors in GCC who also happen to be competitors. If one of them defaults, the situation of the other may improve since their shared customers may increase its dependence on the one who continues its economical activity. However, this only would happen as long as this provider is able to meet the demands of the customers, it has the possibility of producing and selling more goods. For this reason, in the case of these graph we have also considered the weighted and un-weighted local clustering coefficient (denoted as WCC and UCC, respectively). Finally, we have calculated the size of the connected component the node belongs to in the case of the undirected cocitation graphs (sizecc ) and the size of the strongly connected component in the case of the directed, original graph (sizescc ). Table 3 shows a statistical descriptors summary of all these metrics. For all the experiments, when computing AUC values, we only consider those nodes having FHV variable calculated. As we observe in Table 4 AUC increases when considering graph metrics with respect to our baselines. However, we do not see any additional improvement when considering the cocitation graph metrics, meaning that the information may be redundant.

3.4 GCN Based Models Now let’s focus on how a simple GCN model based on a convolution kernel equal to the size of a row in the adjacency matrix works on G and its alternative representations. To consider level 1 and 2 neighborhood we follow the same logic

GCNs on Customer/Supplier Graph Data to Improve Default Prediction

143

Table 4 Logistic regression models including financial health value, revenue category, and the different graph metrics features Model logit (FHV, CAT, G(Rd1 ),G(Rd2 ), G(Sout ) G(PPR)) logit (FHV, CAT, GCC (Rd1 ),GCC (Rd2 ), GCC (WCC), GCC (UCC)) logit (FHV, CAT, GCS (Rd1 ),GCS (Rd2 ), GCS (WCC), GCS (UCC)) logit (FHV, CAT, GCC ∗ (Rd1 ),GCC ∗ (Rd2 ), GCC ∗ (UCC)) logit (FHV, CAT, GCS ∗ (Rd1 ), GCS ∗ (Rd2 ), GCS ∗ (UCC))

AUC 0.781 0.780 0.781 0.781 0.781

Fig. 1 Description of the GCN architecture applied

described in Sect. 3.3 when Rd1 and Rd2 were computed. Figure 1 describes the concrete architecture used in our experiments. Specifically, 1st level and 2nd level layers correspond to the convolutions. These two layers compress all neighborhood information into a single value. The activation function in these convolutions is a ReLu [19] function. We have also tested other non-linear functions like tanh or sigmod, however, the achieved results were worse than the ones described in Table 5. Our intuition about the better performance of the relu function is due to the fact it is not constrained to a maximum upper bound, allowing it to assign much larger weights to the very few defaulted nodes. Then, convolution outputs are combined with all the other node information (FHV, category and graph metrics) using a softmax layer with a sigmoid function as activation function. This last layer acts as a logistic regression classifier to compare in a fairly manner the contribution of the convolutions with regard to the baseline and graph metrics based models.

144 Table 5 GCNs based models including financial health value, revenue category, and the convolution value obtained with level 1 and 2 neighbors

A. Martínez et al. Model Only using GCN over G GCN over G and FHV, CAT and G metrics GCN over Gcc and FHV, CAT and Gcc metrics GCN over Gcc∗ and FHV, CAT and G metrics GCN over Gcs and FHV, CAT and Gcs metrics GCN over Gcs ∗ and FHV, CAT and Gcs ∗ metrics

AUC 0.547 0.793 0.787 0.790 0.792 0.792

The selected loss function for this experiment is the weighted binary cross entropy [5] to take into account the different class importance. The optimizer used was the Adam version of the stochastic gradient descent to consider batches larger than a single sample. Specifically, the batch size in the training was 256. We followed the same approach than when the stratified folds were created, and we created stratified batches to respect the original class ratios. Experiments have been carried out with a starting learning coefficient equal to 0.01. However, we reduce this value during the training phase to minimize the risk of divergence. Finally, to avoid over-fitting we fixed an early stopping criterion in the network: when the loss function increases in several consecutive epochs we stop the training. Note that the most challenging part of this experiment was memory management. This issue is related to dealing with such a large adjacency matrix. Table 5 depicts that the results obtained when including all kind of graphs and metrics into the GCN slightly improve the ones of the logistic regression model. These results, although moderate and with large deviations, are promising to keep on with this approach to continuously improve current risk assessment models. Note that a 1% of improvement in an unbalanced classification problem, as the one we face up in this work, means a much larger improvement on the classification error of the minority class, i.e. the defaulted companies.

4 Conclusions In this paper we have enriched risk assessment models for the prediction of future default. We have added relational information about the ecosystem of the company at a large scale. Our real-world network of Spanish companies gathered from the official customer/supplier third party payment declaration is one of the largest customer/supplier real-world networks appearing in the literature. Then, we have proposed and compared two different ways of incorporating graph knowledge into risk assessment models. On the one hand, we have found that cocitation graph construction and metrics does not improve the performance of the model over original graph metrics. On the other hand, we have studied that graph convolutional networks, although being challenging to apply due to the size of the adjacency matrix, have increased the model AUC from 0.78 to 0.79 avoiding over-fitting

GCNs on Customer/Supplier Graph Data to Improve Default Prediction

145

problems. However, although this improvement means a much larger improvement on the classification error of the minority class, it is still only moderate with respect to the current model. One reason may be due to the large amount of missing values in the network, both edge and node information. Another is that customer/supplier information may not be enough and network needs to be enriched by other macroeconomical data to better predict future default patterns, such as geographical connections or competitor relationships among the companies, etc. We expect to enhance these results by tackling such issues in our future studies.

References 1. Battiston, S., Gatti, D.D., Gallegati, M., Greenwald, B., Stiglitz, J.E.: Credit chains and bankruptcy propagation in production networks. J. Econ. Dyn. Control. 31(6), 2061–2084 (2007) 2. Brintrup, A, Ledwoch, A.: Supply network science: emergence of a new perspective on a classical field. Chaos Interdiscip. J. Nonlinear Sci. 28, 033120 (2018) 3. Brintrup, A., Ledwoch, A., Barros, J.: Topological robustness of the global automotive industry. Logist. Res. 9(1), 1 (2015) 4. Costa, L.D.F., Rodrigues, F.A., Travieso, G., Villas Boas, P.R.: Characterization of complex networks: a survey of measurements. Adv. Phys. 56(1), 167–242 (2007) 5. De Boer, P.T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A tutorial on the cross-entropy method. Ann. Oper. Res. 134(1), 19–67 (2005) 6. Duvenaud, D.K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik, A., Adams, R.P.: Convolutional networks on graphs for learning molecular fingerprints. In: Advances in Neural Information Processing Systems (NIPS) 28, pp. 2224–2232 (2015) 7. Goldin, I., Mariathasan, M.: The Butterfly Defect: How Globalization Creates Systemic Risks, and What to Do About It. Princeton University Press, Princeton (2014) 8. Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649 (2013) 9. Henaff, M., Bruna, J., LeCun, Y.: Deep convolutional networks on graph-structured data. CoRR, abs/1506.05163 (2015) 10. Iori, G., Jafarey, S., Padilla, F.G.: Systemic risk on the interbank market. J. Econ. Behav. Organ. 61(4), 525–542 (2006) 11. Keqiang, W., Zhaofeng, Z., Dongchuan, S.: Structure analysis of supply chain networks based on complex network theory. In: 2008 Fourth International Conference on Semantics, Knowledge and Grid, pp. 493–494 (2008) 12. Kim, Y., Choi, T.Y., Yan, T., Dooley, K.: Structural investigation of supply networks: a social network analysis approach. J. Oper. Manag. 29(3), 194–211 (2011) 13. Kim, Y., Chen, Y.-S., Linderman, K.: Supply network disruption and resilience: a network structural perspective. J. Oper. Manag. 33–34, 43–59 (2015) 14. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: International Conference on Learning Representations (2017) 15. Kito, T., Brintrup, A., New, S., Reed-Tsochas, F.: The structure of the Toyota supply network: an empirical analysis. In: Said Business School (2015) 16. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS) 25, pp. 1097– 1105 (2012)

146

A. Martínez et al.

17. Ledwoch, A., Yasarcan, H., Brintrup, A.: The moderating impact of supply network topology on the effectiveness of risk management. Int. J. Prod. Econ. 197, 13–26 (2018) 18. Mizgier, K.J., Wagner, S.M., Holyst, J.A.: Modeling defaults of companies in multi-stage supply chain networks. Int. J. Prod. Econ. 135(1), 14–23 (2012). Advances in Optimization and Design of Supply Chains. 19. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pp. 807–814. Omnipress, Madison (2010) 20. Pathak, S., Day, J.M., Nair, A., Sawaya, W.J., Murat, K.M.: Complexity and adaptivity in supply networks: building supply network theory using a complex adaptive systems perspective. Decis. Sci. 38(4), 547–580 (2007) 21. Perera, S., Bell, M.G.H., Bliemer, M.C.J.: Network science approach to modelling the topology and robustness of supply chain networks: a review and perspective. Appl. Netw. Sci. 2, 33 (2017) 22. Roukny, T., Bersini, H., Pirotte, H., Caldarelli, G., Battiston, S.: Default cascades in complex networks: topology and systemic risk. Sci. Report. 3, 2759 (2013) 23. Small, H.: Co-citation in the scientific literature: a new measure of the relationship between two documents. J. Am. Soc. Inf. Sci. 24, 265–269 (1973) 24. Tedeschi, G., Mazloumian, A., Gallegati, M., Helbing, D.: Bankruptcy cascades in interbank markets. PLoS One 7(12), 1–10 (2013) 25. Weisbuch, G., Battiston, S.: From production networks to geographical economics. J. Econ. Behav. Organ. 64, 448–469 (2007) 26. Wycisk, C., McKelvey, B., Hulsmann, M.: “Smart parts” supply networks as complex adaptive systems: analysis and implications. Int. J. Phys. Distrib. Logist. Manag. 38(2), 108–125 (2008) 27. Yan, t., Choi, T.Y., Kim, Y., Yang, Y.: A theory of the nexus supplier: a critical supplier from a network perspective. J. Supply Chain Manag. 51(1), 52–66 (2015) 28. Zeng, Y., Xiao, R.: Modelling of cluster supply network with cascading failure spread and its vulnerability analysis. Int. J. Prod. Res. 52(23), 6938–6953 (2014) 29. Zsidisin, G.A.: A grounded definition of supply risk. J. Purch. Supply Manag. 9(5), 217–224 (2003)

Multidimensional Outlier Detection in Interaction Data: Application to Political Communication on Twitter Audrey Wilmet and Robin Lamarche-Perrin

Abstract We introduce a method which aims at getting a better understanding of how millions of interactions may result in global events. Given a set of dimensions and a context, we find different types of outliers: a user during a given hour which is abnormal compared to its usual behavior, a relationship between two users which is abnormal compared to all other relationships, etc. We apply our method on a set of retweets related to the 2017 French presidential election and show that one can build interesting insights regarding political organization on Twitter.

1 Introduction Within Twitter, users can post information via tweets as well as spread information by retweeting tweets of other users. This dissemination of information from a variety of perspectives may lead to global events which affect users’ opinions. In this paper, we introduce a method which aims at getting a better understanding of how these interactions are organized. To this end, we look for outliers in interaction data formed from a set of retweets. The problem of outlier detection on Twitter has been approached in various ways depending on how outliers are defined. Some researchers consider outliers as realworld events taking place at a given place and at a given moment. For example, Sakaki et al. [10] and Bruns et al. [1] trace specific keywords attributed to an event and find such outliers by monitoring temporal changes in word usage within tweets. In other approaches, authors infer, from timestamps, geo-localizations and tweet contents, a similarity between each pair of tweet and find events into clusters of similar tweets [3, 8, 13]. Other researchers, instead, consider outliers as users with A. Wilmet () Sorbonne Université, UMR 7606, LIP6, Paris, France e-mail: [email protected] R. Lamarche-Perrin CNRS, Institut des système complexes de Paris Île-de-France, ISC-PIF, Paris, France e-mail: [email protected] © Springer Nature Switzerland AG 2019 S. P. Cornelius et al. (eds.), Complex Networks X, Springer Proceedings in Complexity, https://doi.org/10.1007/978-3-030-14459-3_12

147

148

A. Wilmet and R. Lamarche-Perrin

abnormal behaviors according to different criteria. For instance, Varol et al. [12] detect bots by means of a supervised machine learning technique; Stieglitz et al. [11] focus on influential users by investigating the correlation between the vocabulary they use in tweets and the number of time they are retweeted; and Ribeiro et al. [9], on the other hand, detect hateful users by means of a lexicon-based method. With our approach, we treat these different types of outliers in a unified way as well as consider different perspectives in the way outliers are considered abnormal in a way similar to what Grasland et al. [6] do in the case of media coverage. Hence, not only we consider different entities, but also different contexts in which outliers are defined. In this way, our framework aims to give a more complete picture of how users act, interact, and are organized along time. Our method is the following. We consider an interaction to be a triplet (s, a, t) meaning that user s, called the spreader, has retweeted a tweet of user a, called the author, at time t. Then, we model the set of interactions as a 3-D data cube which enables us to access to local information, as the number of retweets between two users during a specific hour, as well as more global and aggregated information, as, for instance, the total number of retweets. By combining and comparing these different quantities, we find outliers according to different contexts. We describe our method in Sect. 2 then apply it to a set of retweets related to the 2017 French presidential election in Sect. 3. Finally, we conclude the paper with future work in Sect. 4.

2 Method We denote the set of interactions by a set E of triplets such that (s, a, t) ∈ E indicates that s, called the spreader, has retweeted a, called the author, at time t. We model this set as a data cube. In this section, we formally define this tool as well as three operations we can apply on it to shape various contexts and find relevant outliers.

2.1 Data Cube Definition A data cube is a general term used to refer to a multi-dimensional array of values [7].  n n dimensions characterized by n sets X1 , . . . , Xn , we can build N = n Given i=0 n−i data cubes, each representing a different degree of aggregation of data. Within this set of data cubes, we call the base cuboid the cube which has the lowest degree of aggregation. We denote it Cn (X, f ) where X = X1 × . . . × Xn is the Cartesian product of the n sets X1 , . . . , Xn , and f a feature which maps each cell, c = (x1 , . . . , xn ) ∈ X, to a numerical value. In this paper, the three dimensions we consider are: the spreaders, denoted S, the authors, denoted A, and time, denoted T . In addition, we divide the temporal

Multidimensional Outlier Detection in Interaction Data

149

Fig. 1 Aggregation, expansion, and filtering on the base cuboid

Base cuboid v (s 2, a 4, d 1, h 1) = 40

Filtering on spreaders

(spreaders, authors, time) Aggregation on spreaders

(spreaders, time) (spreaders, authors)

(authors, time) Expansion on time

(spreaders)

(time)

(authors)

dimension into the sub-dimensions days, denoted D, and hours, denoted H , such that t = (d, h) with (d, h) ∈ D × H . The feature we choose to analyze relationships between dimensions is the quantity of interaction, denoted v. It gives the number of retweets for any combination of the three dimensions. In the base cuboid, v(s, a, (d, h)) gives the number of times s retweeted a during hour h of day d: v:

S × A × D × H −→ N (s, a, (d, h)) −→ v(s, a, (d, h))

For instance, in Fig. 1, the gray cell in the base cuboid indicates that s2 retweeted a4 40 times on day d1 at hour h1 . For the sake of clarity, in the following we will refer to (s, a, (d, h)) as (s, a, d, h).

2.2 Data Cube Operations We can explore data through three operations called aggregation, expansion, and filtering (see Fig. 1).

150

A. Wilmet and R. Lamarche-Perrin

Aggregation is the operation which consists in seeing information at a more global level. Given a data cube Cn (X, f ), the aggregation operation along the dimension Xi leads to a data cube of dimension n − 1, Cn−1 (X , f ) where X = X1 × . . . × Xi−1 × Xi+1 × . . .× Xn and such that for all c = (x1 , . . . , xi−1 , ·, xi+1 , . . . , xn ) ∈ X , f (c ) = xi ∈Xi f (c) . For instance, one can aggregate along the hour dimension such that v(s, a, d, ·) = h∈H v(s, a, d, h) gives the total number of time s retweeted a during day d. Expansion is the reverse operation which consists in seeing information at a more local level by introducing additional dimensions. Given a data cube Cn (X, f ), the expansion operation on the dimension Xn+1 leads to a data cube of dimension n+1, Cn+1 (X , f ) where X = X × Xn+1 . Filtering is the operation which consists in focusing on one specific subset of data. Given a data cube Cn (X, f ), the filtering operation leads to a sub-cube Cn (X , f ) such that X = X1 × . . . × Xn with X1 ⊆ X1 , . . . , Xn ⊆ Xn .

2.3 Data Cube Contexts Our goal is to find abnormal cells, i.e., n-uplets x ∈ X for which the observation f (x) is abnormal. As an observation’s abnormality is relative to the elements to which it is compared [2], a given cell may be abnormal or not depending on the context. More precisely, the context is the set of observations which are taken into account in order to assess the abnormality of a cell, we denote it O = {o(x) | x ∈ X}. The most elementary context we can consider is the set of raw observations O = {f (x) | x ∈ X}, we call it basic context. We present three other relevant contexts that can be shaped from the data cube, thanks to the previous operations: the aggregated context, the expected context and the restrained context. Aggregated Context The cube under study, Cn (X, f ) is evaluated with respect to a more aggregated data cube, Cm (X , f ), with n > m and X = X × Y where Y is the Cartesian product of the aggregated dimensions. To this end, for each cell x = (x  , y) ∈ X such that x  ∈ X and y ∈ Y , we measure the proportion, p(x), between the quantity f (x  , y), within Cn (X, f ), and the quantity f (x  ), within Cm (X , f ), p(x) =

f (x  , y) , f (x  )

and find abnormal cells in Cn (X, f ) according to the context O = {p(x) | x ∈ X}. Example In data cube C3 (A × D × H, v), relatively to data cube C2 (D × H, v), an abnormal cell c∗ = (a ∗ , d ∗ , h∗ ) indicates that the proportion of retweets received by author a ∗ among all retweets of hour h∗ of day d ∗ , p(a ∗ , d ∗ , h∗ ) =

v(·, a ∗ , d ∗ , h∗ ) v(·, ·, d ∗ , h∗ )

Multidimensional Outlier Detection in Interaction Data

151

is abnormal compared to most proportions of retweets received by authors during 1 h (independently of the hour of the day and of the day under consideration). Expected Context The cube under study, Cn (X, f ) is evaluated with respect to an expected value. The latter, denoted fexp , is obtained by averaging f on one or more of its variables. Formally, let Y be the Cartesian product of the averaged dimensions such that X = X × Y . For each x  ∈ X , we have fexp (x  ) =

1 f (x  , y) . |Y | y∈Y

Subsequently, for each cell x = (x  , y) ∈ X such that x  ∈ X and y ∈ Y , we measure a distance l(x) between f (x  , y), within Cn (X, f ), and its expected value fexp (x  ), and find abnormal cells in Cn (X, f ) according to the context O = {l(x) | x ∈ X}. When the feature consists in counting the number of interactions of cell x, as v(x), it can be modelled by a Poisson counting process of intensity fexp [6]. In this case, the distance l(x) can be obtained as follows. If f (x) ≥ fexp (x  ), we calculate the probability of observing a value f (x) or more, knowing that we should have observed fexp on average. We denote this probability q(Pois(fexp ) ≥ f (x)). By symmetry, we obtain O = {l(x) | x ∈ X} such that

− log(q(Pois(fexp ) ≥ f (x)) if f (x) ≥ fexp , l(x) = (1) log(q(Pois(fexp ) < f (x)) if f (x) < fexp . Defined as such, l(x) allows us to take into account the significance to which a value deviates from its expected value. Example In C3 (A × D × H, v), relatively to C2 (D × H, v), an abnormal cell c∗ = (a ∗ , d ∗ , h∗ ) indicates that the distance l(a ∗ , d ∗ , h∗ ) between the proportion of retweets received by author a ∗ among all retweets of hour h∗ of day d ∗ , p(a ∗ , d ∗ , h∗ ) =

v(·, a ∗ , d ∗ , h∗ ) v(·, ·, d ∗ , h∗ )

and its expected proportion pexp during this specific hour of the day h∗ , pexp (a ∗ , h∗ ) =

1 p(a ∗ , d, h∗ ) , |D| d∈D

is abnormal compared to most distances observed for other triplets in A × D × H . Restrained Context It consists in evaluating an observation’s abnormality with respect to a subset of all observations, which is another way to focus on local patterns. To this end, we consider the set O = {o(x) | x ∈ X } such that X ⊂ X. Taken separately, each of these contexts allows to study interactions under a different perspective. In our method, we combine them together which leads to numerous kinds of outliers.

152

A. Wilmet and R. Lamarche-Perrin

3 Experiments We apply our method on retweets related to political communication during the 2017 French presidential elections. Our dataset consists in the set of retweets E, such that (s, a, t) ∈ E means that s retweeted a at time t, where either the corresponding tweet contains politics-related keywords or a belongs to a set of 3700 French political actors listed by the Politoscope project [4] during the month of August 2016. It contains 1,142,004 retweets and involves 211,155 different users. By combining different contexts, we analyze the possible reasons why some events emerge more than others and, in particular, determine whether they are global phenomena or, whether they originate from specific actors only. We define an event to be an abnormal hour (d ∗ , h∗ ) ∈ D × H . If we look for abnormal hours in the basic context, i.e. for abnormal observations o∗ ∈ O such that O = {v(·, ·, d, h) | (d, h) ∈ D × H }, extreme values would only highlight trivial abnormalities which might only be related to the circadian rhythm as well as the overall trend of the month (see Fig. 2). To detect more subtle and local events, we can consider the aggregated and expected context. Indeed, in this context, the resulting abnormal hours are independent of daily variations as well as the time of the day: in C2 (D × H, v), relatively to C1 (D, v), we consider a cell c∗ = (d ∗ , h∗ ) to be abnormal if the distance, l(d ∗ , h∗ ), between the proportion of retweets observed during hour h∗ among all retweets of day d ∗ , p(d, h) =

v(·, ·, d, h) , v(·, ·, d, ·)

8000 7000

v(·, ·, d, h)

6000 5000 4000 3000 2000 1000 0 h -0 08 1/ -3 ay 0h sd 8ne 0/0 h ed -3 -0 W day /08 es -29 0h Tu day 08- h / on 28 -0 M ay- /08 7 nd -2 h Su day 8-0 h r te 6/0 8-0 Sa y-2 5/0 -0h a 8 id y-2 /0 Fr sda -24 ur ay -0h Th esd /08 n 3 h ed -2 -0 W day /08 es -22 0h Tu day 08- h / on 21 -0 M ay- /08 0 nd -2 h Su day 8-0 h r te 9/0 8-0 Sa y-1 8/0 -0h a 8 id y-1 /0 Fr sda -17 ur ay -0h Th esd /08 n 6 h ed -1 -0 W day /08 es -15 0h Tu day 08- h / on 14 -0 M ay- /08 3 nd -1 h Su day 8-0 h r te 2/0 8-0 Sa y-1 1/0 -0h a 8 id y-1 /0 Fr sda -10 ur ay 0h Th esd 08n / ed -9 h W day 08-0 es -8/ h Tu day 8-0 on 7/0 0h M ay- 08/ nd -6 Su day -0h r te /08 -0h Sa y-5 /08 0h a id y-4 8 Fr sda -3/0 ur ay h Th esd 08-0 n / ed -2 h W day 08-0 es -1/ Tu day on M

(d h)

Fig. 2 Number of retweets per hour along the month of August 2016—Note that due to a server failure from Tuesday the 9th to Thursday the 11th, no activity is observed during this period

Multidimensional Outlier Detection in Interaction Data Fig. 3 Abnormal hours in the aggregated and expected context

153

0.35 0.3

Distribution

0.25 0.2 Abnormal hour (24th 20h)

0.15 0.1 0.05 0 – 400 –200

0

200

400

600

800 1000 1200 1400

l (d h)

and its expected proportion pexp during this specific hour of the day h∗ , pexp (h∗ ) =

1 p(d, h∗ ) , |D| d∈D

is abnormal compared to most distances observed for other hours (d, h) ∈ D × H . Figure 3 shows the distribution of the set of observations O = {l(d, h) | (d, h) ∈ D×H }. As expected, most observations o ∈ O follow a normal distribution centered on 0 (gray zone), whereas some significantly deviate from it. This means that most proportions are likely to be generated by a Poisson counting process of intensity pexp (d, h) while others are not. We find 15 abnormal hours for which the proportion of retweets is significantly higher than expected: O∗ = {(3th, 11h), (12th, 23h), (21th, 21h), (22th, 17h), (22th, 18h), (22th, 19h), (24th, 20h), (24th, 21h), (24th, 22h), (25th, 19h), (26th, 16h), (27th, 15h), (28th, 14h), (28th, 15h), (29th, 8h)}. Now, we focus on determining whether an hour’s abnormality is due to specific authors, who have been retweeted predominantly, or, on the contrary, results from a more global phenomenon. To do so, we study interactions in a restrained context by considering the entities (a, d, h) ∈ A × T ∗ , where T ∗ ⊆ O∗ . For the same reasons as above, we use the aggregated and expected context: in data cube C3 (A × T ∗ , v), relatively to data cube C2 (T ∗ , v), we consider a cell c∗ = (a ∗ , d ∗ , h∗ ) to be abnormal if the distance, l(a ∗ , d ∗ , h∗ ), between the proportion of retweets received by author a ∗ during hour (d ∗ , h∗ ) among all retweets of hour (d ∗ , h∗ ), p(a ∗ , d ∗ , h∗ ) =

v(·, a ∗ , d ∗ , h∗ ) v(·, ·, d ∗ , h∗ )

154

A. Wilmet and R. Lamarche-Perrin

Fig. 4 Evolution of abnormal authors on the 24th of August from 20 to 22 h

and its expected proportion pexp during this specific hour of the day h∗ , pexp (a ∗ , h∗ ) =

1 p(a ∗ , d, h∗ ) , |D| d∈D

is abnormal compared to most distances of other triplets (a, d, h) ∈ A × T ∗ . Figure 4 displays the distribution of the set O = {l(a, d, h) | (a, d, h) ∈ A × {(24th, h )}}, where h successively takes the values 20, 21, and 22 h. On the news, this event corresponds to an interview of Nicolas Sarkozy on television at 20 h. The study of this event with our method enables us to illustrate political communication via Twitter: the more time passes, the more distributions are homogeneous. This shows that the event becomes a global phenomenon as information spreads: first, politicians tweet and are retweeted during the interview; 1 h later, journalists propagate their analyses and are retweeted; information reach individuals which start to tweet and being retweeted as well; finally, at 22 h, information reaches a larger scale and more and more individuals react and get retweeted.

4 Conclusion and Future Work We provided a method to explore temporal interactions and find outliers in a multitude of different situations. We applied it on a set of politics-related retweets and showed that it successfully highlights events as well as abnormally retweeted users. Section 3 only presents a small part of the extent of possibilities offered by our method. For instance, we could continue our study and look into the spreaders dimension to explore the cause of an author’s emergence. More generally, we could split the authors or spreaders dimension into sub-dimensions according to their political leaning. Using restrained contexts, this would allow us to study the behavior of each community separately as well as communities interactions.

Multidimensional Outlier Detection in Interaction Data

155

Also, we could include additional semantic dimensions. For instance, we could consider 5-uplets (s, a, d, h, k) meaning that s retweeted a tweet written by a and containing the hashtag k at time (d, h). Applying similar contexts along the hashtag dimension would give us a lot more details on events’ content. Finally, the reaction to a television show through Twitter, as with Nicolas Sarkozy’s interview, shows that this study could be interesting in researches studying the use of a second web-connected screen while watching television, see, for instance, the work of Gil de Zúñiga et al. [5]. Acknowledgements This work is funded in part by the European Commission H2020 FETPROACT 2016-2017 program under grant 732942 (ODYCCEUS), by the ANR (French National Agency of Research) under grants ANR-15- E38-0001 (AlgoDiv), by the Ile-de-France Region and its program FUI21 under grant 16010629 (iTRAC).

References 1. Bruns, A., Burgess, J.E., Crawford, K., Shaw, F.: # qldfloods and@ QPSMedia: Crisis communication on Twitter in the 2011 South East Queensland floods. ARC Centre of Excellence for Creative Industries and Innovation, Queensland University of Technology, Brisbane (2012) 2. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. (CSUR) 41(3), 15 (2009) 3. Dong, X., Mavroeidis, D., Calabrese, F., Frossard, P.: Multiscale event detection in social media. Data Min. Knowl. Disc. 29(5), 1374–1405 (2015) 4. Gaumont, N., Panahi, M., Chavalarias, D.: Reconstruction of the socio-semantic dynamics of political activist Twitter networks–method and application to the 2017 French presidential election. PLoS ONE 13(9), e0201879 (2018) 5. Gil de Zúñiga, H., Garcia-Perdomo, V., McGregor, S.C.: What is second screening? Exploring motivations of second screen use and its effect on online political participation. J. Commun. 65(5), 793–815 (2015) 6. Grasland, C., Lamarche-Perrin, R., Loveluck, B., Pecout, H.: International agenda-setting, the media and geography: a multi-dimensional analysis of news flows. L’Espace géographique 45(1), 25–43 (2016) 7. Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011) 8. Li, R., Lei, K.H., Khadiwala, R., Chang, K.C.-C.: Tedas: a twitter-based event detection and analysis system. In: 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 1273–1276. IEEE, Piscataway (2012) 9. Ribeiro, M.H., Calais, P.H., Santos, Y.A., Almeida, V.A., Meira, Jr. W.: Characterizing and Detecting Hateful Users on Twitter. arXiv preprint arXiv:1803.08977 (2018) 10. Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes Twitter users: real-time event detection by social sensors. In: Proceedings of the 19th International Conference on World Wide Web, pp. 851–860. ACM, New York (2010) 11. Stieglitz, S., Dang-Xuan, L.: Political communication and influence through microblogging– An empirical analysis of sentiment in Twitter messages and retweet behavior. In: 2012 45th Hawaii International Conference on System Science (HICSS), pp. 3500–3509. IEEE, Piscataway (2012) 12. Varol, O., Ferrara, E., Davis, C.A., Menczer, F., Flammini, A.: Online human-bot interactions: detection, estimation, and characterization. arXiv preprint arXiv:1703.03107 (2017) 13. Walther, M., Kaisser, M.: Geo-spatial event detection in the twitter stream. In: European Conference on Information Retrieval, pp. 356–367. Springer, Berlin (2013)

Social Media Vocabulary Reveals Education Attainment of Populations Harith Hamoodat, Eraldo Ribeiro, and Ronaldo Menezes

Abstract Educational attainment is a major indicator of a nation’s development stage. Yet, often we have to rely on census information that is done at insufficient periodicity not capturing well the dynamics of certain regions. In this context, social media can act as a sensor of populations with up-to-date information. In this paper, we focus on revealing the relationship between social media vocabulary and educational attainment. This work was performed at the county level for 5 different geographical areas in the United States by comparing the education attainment according to the American community survey to the level of vocabulary used in social media in the same region for the same period of the census (2010–2015). Our results show that social media vocabulary level can reveal the educational attainment of a population for specific areas with a few exceptions concentrated in counties that have a high population density. As a secondary contribution, the sampling method to calculate the vocabulary level in social media may also help using large twitter datasets in the context of social sensing.

1 Introduction Social scientists and economists have for many years linked populations’ vocabulary size with educational attainment and other socioeconomic indicators [8, 9, 17, 22, 32]. Vocabulary size has also been linked to students’ progress in school and life [30]. The reason for this is simply that the knowledge a person has about a subject

H. Hamoodat () BioComplex Laboratory, Florida Institute of Technology, Melbourne, FL, USA e-mail: [email protected] E. Ribeiro Florida Institute of Technology, Melbourne, FL, USA e-mail: [email protected] R. Menezes BioComplex Laboratory, University of Exeter, Exeter, UK e-mail: [email protected] © Springer Nature Switzerland AG 2019 S. P. Cornelius et al. (eds.), Complex Networks X, Springer Proceedings in Complexity, https://doi.org/10.1007/978-3-030-14459-3_13

157

158

H. Hamoodat et al.

is based on the vocabulary that surrounds that subject [21]. Nowadays, social media is considered a good source of data about society especially after the increase in the use of social media and social networking websites by more organizations and influential individuals [18]. Social media has become a common part of daily life of billions of people worldwide [14]. Users of social media post their thoughts, activities, and opinions about almost every aspect of life [12]. Therefore, the social media content gives researchers new opportunities to study a variety of social phenomena such as tourism and hospitality [10, 20], immigration [3], education [24], and health [23, 26]. In this paper, we are interested in how the peoples’ vocabulary used in social media may reflect the level of education for a specific geographic region. Works into how social media reflects the real world acting as sensors provide better ways to gauge the world because the access to data is instantaneous. Our approach is considered novel; most of the studies that look at the relation between the vocabulary and education depend on classical methods like questionnaires or tests. Many linguists have shown the relationship between education and vocabulary size. Nagy and Anderson found that students from grades 3 through 12 can learn some 3000 new words each year if they read between 500,000 and a million running words of text a school year [25]. Another study reported large differences in vocabulary size between first graders (i.e., 2500 to 26,000 words) and graduate students (i.e., 19,000 to 200,000 words) [15]; it raises the question regarding vocabulary online which is rather still unexplored [28]. Becker [7] spotlighted the importance of vocabulary growth by connecting vocabulary size to the academic achievement of disadvantaged students; he found that the vocabulary knowledge was the main factor restricting the reading and academic progress beyond grade 3 of that specific student population. Since Becker’s related vocabulary growth to academic achievement, several works have agreed that vocabulary acquisition is important to academic progress and the relation between reading understanding and vocabulary size is strong and unambiguous [5, 11, 31]. Our paper is structured as follows. In Sect. 2, we discuss the methods used to collect and pre-process the data, and to normalize and standardize the data. Our results are presented in Sect. 3, where we show that our approach appears to indicate a relation between vocabulary size and educational attainment for specific regions in the USA. We conclude our work in Sect. 4.

2 Methodology 2.1 Data Curation Twitter is one of the most popular microblogging platforms on the Internet. Although other platforms such as Facebook and WhatsApp have more active users, Twitter’s popularity among researchers remains high because of the openness of its

Social Media Vocabulary Reveals Education Attainment of Populations

Canada

Dominican Republic

Vancouver Montreal Toronto

UK

Russia

London Manchester

Moscow Saint Petersburg

159

China Stockholm

Beijing Shanghai Guangdong

Berlin Amsterdam

Seoul

Brussels Dublin USA San Francisco Los Angeles San Diego Detroit Chicago Boston New York Philadelphia Washington Dallas Atlanta Phoenix Houston Miami

Paris

Japan

Lisbon

Tokyo Nagoya Osaka

Spain

Caracas

Milan Rome

Bogota Mexico

Lima Mexico City Guadalajara Santiago

Taiwan

Italy

Madrid Barcelona

Manila Istanbul Kuala Lumpur

Brazil

Indonesia

Singapore

Jakarta Bandung

Bangkok

Rio de Janeiro Sao Paulo

Buenos Aires

Sydney

Fig. 1 Tweets distribution in 58 world cities. Each red dot represents a city; The cities are listed per country. This is the same data used by Lamanna et al. [19] Table 1 Number of users and tweets for each year in the dataset Users Tweets

2010 404,139 4,160,014

2011 1,591,575 20,703,647

2012 4,157,311 50,609,869

2013 6,970,707 107,347,885

2014 8,343,348 192,852,312

2015 6,357,308 193,259,271

data and its special structure [1]. The number of works that use Twitter as a social sensor continues to grow in numbers and variety [4, 6, 27, 29]. In our research, we used a large Twitter dataset containing about 569 million geolocalized tweets belonged to approximately 17 million users for 5 years (from May 07, 2010 to July 01, 2015). Those tweets are spread in 58 cities belonging to 32 countries (see Fig. 1). After removing tweets that lacked the necessary information (i.e., missing latitude, longitude, text) representing 0.05% of the original dataset, we split the resulting dataset into 6 years: 2010 to 2015 (partial 2010 and 2015 and complete year for others). It is worth noticing that the number of tweets and users dramatically increases from 2010 to 2015 (Table 1), which reflects the reality of Twitter’s official statistics. We also cleaned the dataset to remove spurious records (e.g., users who appear to be moving in the city too fast), duplicated tweets, and users who have more than one tweet in less than 2 s (possibly online bots). In total, the cleaning procedure preserved 98% of users and more than 93% of tweets from the initial dataset (Table 2).

160

H. Hamoodat et al.

Table 2 Number of users and tweets for each year after removing tweets which do not reflect human physical presence in either the reported place or time Users Tweets

2010 395,733 3,640,600

2011 1,563,298 18,111,598

2012 4,152,449 46,844,592

2013 6,961,825 102,898,323

2014 8,334,160 179,256,310

2015 6,346,505 178,663,575

Fig. 2 Removing tweets located outside the specific boundaries. Fulton county in Atlanta city is shown here as an example. The dark points represent the tweets inside the county of interest

It is worth mentioning that the dataset was collected by specifying a bounding box for each city on our list. However, in order to study several regions that may share the same borders, we used only the tweets that fall inside the borders and ignored all the tweets surrounding the cities or regions as shown in Fig. 2. In this work, we tested 5 different regions in the United States to cover a variety of demographic areas. Because the English language is the dominant language in the United States, we do not consider texts in other languages. Furthermore, numbers, special characters, links, functional words, and punctuation were removed from the tweet text because we do not want those to be considered as part of the vocabulary. Table 3 shows the number of counties in each area studied and the range of tweets for each area (variation among the counties).

Social Media Vocabulary Reveals Education Attainment of Populations

161

Table 3 Number of counties for each region and the region’s minimum and maximum number of tweets among all counties Number of counties Range of tweets

Atlanta 29 5.5 K–1.4 M

Houston 8 45 K–4.3 M

Chicago 14 7 K–4.3 M

San Francisco 8 70 K–1.2 M

Maryland 14 36 K–1.2 M

The second dataset in this work records the educational attainment for each county in the United States. The records are from the US census of the American community survey, for the same period of our dataset (i.e., 2010–2015). This dataset is used as the ground-truth in our study.

2.2 Data Sampling The number of tweets in our dataset varies considerably from region to region (see Table 3). As a result, we must ensure that if we want to capture the level of vocabulary of a region, r, we do this in a way that avoids bias towards regions in which the activity in terms of number of tweets is larger. Our approach is relatively simple and focuses taking a certain number of samples, s, for each region, where each sample has n tweets. We will describe later how we assigned a value for s and n but given these values have been correctly chosen, we calculate in Eq. (1) the vocabulary index of a region for a particular sample size, Vr (s). In essence, Vr (s) is the average value of the proportion of distinct words in a Twitter sample to the total number of words in the same sample for a given region and given by:  1  Uk /Ns , s s

Vr (s) =

(1)

k=1

where Uk is the number of distinct words in a random sample of n tweets, Ns is the total number of words in the same sample for region r, and s is the number of samples chosen in each region. This number must be chosen so that the normalization retains enough information regarding the growth in vocabulary in a region. To determine the number of samples s and the size of each sample, we analyzed the variance and the stability of the growth in the vocabulary as a function of the sample size. Here, we propose a new approach that normalizes the dataset, by fixing the number of tweets for all the counties under the study. We used the β parameter extracted from heaps’ law (see Eq. (2)), which describe the vocabulary growth in texts [2], to determine the number of samples and the size of each sample. In order to find an appropriate number of samples s, we calculate the value of β for the different

162

H. Hamoodat et al.

H ard Cou Hea un u ntty y 1000

3000

5000

7000

10000

B ro Bar ow w Cou C unty un u nty 1000

3000

5000

7000

10000

F o Fult on nC County Cou oun nty ty 1000

3000

5000

7000

10000

Fig. 3 The effect of increasing the number of samples and the size of the sample to the average β value. Every plot represents one sample size from the following values: (1000, 3000, 5000, 7000, or 10,000), and 10 number of samples (1 to 10). Each row represents a county (in Atlanta). The 5th plot was removed from Heard county because of the number of tweets is less than 10,000. This approach has been used for all cities. We show only Atlanta here

number of samples; from 1 to 10. For each sample size, s, we repeat the calculations 100 times and use the variance value to determine the value of s (Fig. 4a). The stability and variance of β are used in our choice of s. The Heaps’ law has been proposed by Herdan [16] as a variation of Zipf’s law. It describes the growth of vocabulary as a function of words in a text. It described a situation in which there are diminishing returns in finding new distinct words as one is exposed to a new text. The law is given by: VR (n) = Knβ ,

(2)

where VR is the number of distinct words (vocabulary) in a text of size n, K and β are parameters determined experimentally. Similar to the above method, we use the β parameter to examine and determine the size of the sample n. The values β are estimated for different values of n, from 1000 to 10,000 stepping by 1000. We repeat the computation 100 times for each n in order to find the stable point for n. All previous calculations were done for all counties of the 5 regions (Fig. 3). After completing the calculation, we found that the relation of s is inversely proportional to n. Using Knee Detection [13] in Fig. 4b, the size of the sample appears to be between 3000 and 4000 tweets. Furthermore, the number of samples should be above 4 which is indicated by the yellow region in Fig. 4a.

163

Average Beta Value

Social Media Vocabulary Reveals Education Attainment of Populations

Number of Sample

(a)

(b)

Fig. 4 Detection of sample size and sample count. On the left, the yellow box points out to the sample count, while the detection of the sample size on the right. We again used the 29 counties of Atlanta city as an example. (a) Number of samples. (b) Size of samples

3 Results and Discussion Education attainment is an important part of well-being and is used in the measure of income growth and quality of life, which are crucial factors in classifying a country as developed, developing, or under-developed. According to the US census data of the American Community Survey (ACS), the level of education varies quite broadly in the USA from high for some counties to low in others. The educational attainment used in this work was for the fraction of the population that is 25 years old and over.1 Starting with Atlanta, we compared the ground truth from ACS with the calculated value of Vr (s) for each county. We found that the correlation between the values was high with r = 0.881 with a significant p-value (Fig. 5b). Moreover, we noticed that there are 13 counties having a perfect correlation. We then calculated the difference between the two datasets for each county in order to provide a visual representation of the correlations. The result is shown in Fig. 5a. In Maryland, we also observed a high correlation between the twitter vocabulary and the ACS data for education attainment with r = 0.819 which is again considered high (see Fig. 6b). Figure 6a shows the difference between the twitter vocabulary estimate and the ACS data. Again some counties have a higher difference than others but the variance is smaller than we observed in Atlanta. The 3rd region in our results is in San Francisco in California. We tested 8 counties in the region of San Francisco city and its surrounded counties. The difference between the calculated values of Vr (s) and the ACS data is small

1 People

aged 25 to 34 form the largest fraction of twitter users, with the second largest being between 35 and 44 years old.

164

H. Hamoodat et al.

7.27

0.00

(a) Atlanta city region has 29 counties.

(b) Correlation analysis for the region of Atlanta.

Fig. 5 Atlanta city region. (a) Shows the map of counties and the color represents the difference values between the educational attainment and the vocabulary ratio estimated from twitter. (b) shows the correlation between the two measures

5.09

0.00

(a) Maryland state and its 14 counties. Harford county has the highest disagreement between the calculated twitter vocabulary and the ACS information about education attainment.

(b) Correlation chart for the counties in Maryland.

Fig. 6 Maryland state region. (a) Shows the map of counties and the color represents the difference values between the educational attainment and the vocabulary ratio. (b) Shows the correlation between the two datasets

compared with all other regions as shown in Fig. 7a. Figure 7b shows the high correlation observed between the two datasets with r = 0.881 and significant pvalue. The lowest correlation value among the areas we studied in this work relates to Chicago and the surrounded areas with r = 0.753 which is nevertheless considered high correlation (see Fig. 8b). In this case, it became more apparent that counties with cosmopolitan centers (such as the city of Chicago itself) such as Cook county tend to have the largest differences between vocabulary and education attainment. In Sect. 4 we propose some reasons for such differences. Fig. 8a, as before, shows a

Social Media Vocabulary Reveals Education Attainment of Populations

165

1.77

0.00

(a) The results show a small difference values between the two datasets with a highest value 1.77 for Santa Clara county.

(b) The datasets correlation in San Francisco.

Fig. 7 Eight counties from California in the San Francisco area. (a) Map of counties and the color represents the difference between the educational attainment and the vocabulary ratio. (b) Correlation between the two datasets 4.28

0.00

(a) The result shows that the highest difference value is for Kane county then Cook counties.

(b) Correlation analysis for the region of Chicago.

Fig. 8 Chicago region with 14 counties. (a) Shows the map of counties and the color represents the difference values between the educational attainment and the vocabulary ratio. (b) Shows the correlation between the two datasets

color-map in which the differences between the two metrics are shown by the color intensity. The last region in this work is Houston containing 8 counties. The correlation between the two datasets is r = 0.871 (see Fig. 9b). Similar to Chicago, we noticed that the third most populous county in the United States, Harris county, has the highest difference value between the twitter vocabulary and the ACS data with 5.86 (see Fig. 9a).

166

H. Hamoodat et al.

5.86

0.00

(a) Houston counties with highest difference value between the ACS for educational attainment and twitter datasets is Harris county.

(b) Houston counties correlation.

Fig. 9 Houston region with 8 counties. (a) Shows the map of counties and the color represents the difference values between the ACS for educational attainment and the vocabulary ratio. As before with the case of Chicago, Harris county, being one of the most densely populated counties in the USA, has the weakest link between education attainment and vocabulary index. (b) Shows the correlation between the two datasets, which despite Harris county is still quite high overall

4 Conclusion In this work, we showed that the vocabulary level on social media tends to reflect the educational attainment of the location the tweets are coming from. Our results show a high correlation between the two datasets for most of the regions. Furthermore, our method can serve as a simple measure of educational attainment for regions where official data may not be available; another contribution in the context of social media as sensors. This result alone opens several research avenues because it may enable, for instance, the classification of individuals within regions, that is, our study deals with averages but it would be important to also look at higher granularity and perhaps understand the vocabulary level and the mix/composition of people within regions. Our results indicate that in areas with higher population density the correlation is weak. To look further into this fact, if we remove the non-English tweets from the dataset, this disagreement is less prominent supporting the hypothesis that the disagreement comes from the cosmopolitan nature of such areas; the number of non-English tweets in more rural areas is smaller (as a fraction of the total number of tweets). Our results also suggest that social media text can be treated as a regular text because of the diversity of subjects which leads to a richer vocabulary. Also, our proposed sampling method of the twitter data that determines a minimum number of samples can be used to study very large twitter datasets by efficiently reducing the processing costs.

Social Media Vocabulary Reveals Education Attainment of Populations

167

For future works, we plan to apply our method to several other cities and study the effect of tourism (transient population) on vocabulary growth. Also, we will study how vocabulary size changes over certain seasons given that temporal aspects may influence how people communicate. Finally, it would be worth studying the relationship between vocabulary size and other factors such as income level and city size. Acknowledgements The authors would like to thank Bruno Gonçalves for providing the Twitter dataset from his work [19]. The data was invaluable to us.

References 1. Ahmed, W.: Using Twitter as a data source: an overview of social media research tools (updated for 2017). Impact of Social Sciences Blog (2017) 2. Al Rozz, Y., Hamoodat, H., Menezes, R.: Characterization of written languages using structural features from common corpora. In: Workshop on Complex Networks CompleNet, pp. 161–173. Springer, Berlin (2017) 3. Aswad, F.M.S., Menezes, R.: Refugee and immigration: Twitter as a proxy for reality. In: FLAIRS Conference, pp. 253–258 (2018) 4. Bagavathi, A., Krishnan, S.: Social sensors early detection of contagious outbreaks in social media. In: International Conference on Applied Human Factors and Ergonomics, pp. 400–407. Springer, Berlin (2018) 5. Baker, S.K., Simons, D.C., Kamenui, E.J.: Vocabulary acquisition: synthesis of the research. Technical Report 13, National Center to Improve the Tools of Educator, University of Oregon, Eugene (1995) 6. Bauman, K., Tuzhilin, A., Zaczynski, R.: Using social sensors for detecting emergency events: a case of power outages in the electrical utility industry. ACM Trans. Manag. Inf. Syst. 8(2–3), 7 (2017) 7. Becker, W.: Teaching reading and language to the disadvantaged what we have learned from field research. Harv. Educ. Rev. 47(4), 518–543 (1977) 8. Biemiller, A., Boote, C.: An effective method for building meaning vocabulary in primary grades. J. Educ. Psychol. 98(1), 44 (2006) 9. Bornstein, M.H., Haynes, O.M.: Vocabulary competence in early childhood: measurement, latent construct, and predictive validity. Child Dev. 69(3), 654–671 (1998) 10. Brahmbhatt, J., Menezes, R.: On the relation between tourism and trade: a network experiment. In: 2013 IEEE 2nd Network Science Workshop (NSW), pp. 74–81. IEEE, Piscataway (2013) 11. Brown, T.S., Perry, Jr. F.L.: A comparison of three learning strategies for ESL vocabulary acquisition. Tesol Q. 25(4), 655–670 (1991) 12. Chew, C., Eysenbach, G.: Pandemics in the age of Twitter: content analysis of tweets during the 2009 h1n1 outbreak. PloS one 5(11), e14118 (2010) 13. Clarke, B., Valdes, C., Dobra, A., Clarke, J.: A Bayes testing approach to metagenomic profiling in bacteria. Stat. Interface 8(2), 173–185 (2015) 14. Golder, S.A., Macy, M.W.: Digital footprints: opportunities and challenges for online social research. Annu. Rev. Sociol. 40, 129–152 (2014) 15. Graves, M.F.: Chapter 2: vocabulary learning and instruction. Rev. Res. Educ. 13(1), 49–89 (1986) 16. Herdan, G.: Type-Token Mathematics, vol. 4. Mouton, Berlin (1960) 17. Hoff, E.: The specificity of environmental influence: socioeconomic status affects early vocabulary development via maternal speech. Child Dev. 74(5), 1368–1378 (2003)

168

H. Hamoodat et al.

18. Kaplan, A.M., Haenlein, M.: Users of the world, unite! the challenges and opportunities of social media. Bus. Horiz. 53(1), 59–68 (2010) 19. Lamanna, F., Lenormand, M., Salas-Olmedo, M.H., Romanillos, G., Gonçalves, B., Ramasco, J.J.: Immigrant community integration in world cities. PloS One 13(3), e0191612 (2018) 20. Leung, D., Law, R., Van Hoof, H., Buhalis, D.: Social media in tourism and hospitality: a literature review. J. Travel Tour. Mark. 30(1–2), 3–22 (2013) 21. Marzano, R.J., Pickering, D.J.: Building Academic Vocabulary: Teacher’s Manual. ERIC (2005) 22. Milton, J., Treffers-Daller, J.: Vocabulary size revisited: the link between vocabulary size and academic achievement. Appl. Linguist. Rev. 4(1), 151–172 (2013) 23. Moorhead, S.A., Hazlett, D.E., Harrison, L., Carroll, J.K., Irwin, A., Hoving, C.: A new dimension of health care: systematic review of the uses, benefits, and limitations of social media for health communication. J. Med. Int. Res. 15(4), e85 (2013) 24. Moran, M., Seaman, J., Tinti-Kane, H.: Teaching, Learning, and Sharing: How Today’s Higher Education Faculty Use Social Media. Babson Survey Research Group. Pearson, Boston (2011) 25. Nagy, W.E., Anderson, R.C.: How many words are there in printed school English? Read. Res. Q. 19, 304–330 (1984) 26. Pacheco, D.F., Pinheiro, D., Cadeiras, M., Menezes, R.: Characterizing organ donation awareness from social media. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp. 1541–1548. IEEE, Piscataway (2017) 27. Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes Twitter users: real-time event detection by social sensors. In: Proceedings of the 19th International Conference on World Wide Web, pp. 851–860. ACM, New York (2010) 28. Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E.P., Ungar, L.H.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PloS One 8(9), e73791 (2013) 29. Siragusa, G., Leone, V.: Such a wonderful place: extracting sense of place from Twitter. In: European Semantic Web Conference, pp. 397–405. Springer, Berlin (2018) 30. Sprenger, M.: Teaching the Critical Vocabulary of the Common Core: 55 Words that Make or Break Student Understanding. ASCD, Alexandria (2013) 31. Walker, D., Greenwood, C., Hart, B., Carta, J.: Prediction of school outcomes based on early language production and socioeconomic factors. Child Dev. 65(2), 606–621 (1994) 32. Wilson, S., Thorne, A., Stephens, M., Ryan, J., Moore, S., Milton, J., Brayley, G.: English vocabulary size in adults and the link with educational attainment. Lang. Focus 2(2), 44–69 (2016)

Exploring the Role and Nature of Interactions Between Institutes in a Local Affiliation Network Chakresh Kumar Singh, Ravi Vishwakarma, and Shivakumar Jolad

Abstract In this work, we have studied the collaboration and citation network between Indian Institutes from publications in American Physical Society (APS) journals between 1970–2013. We investigate the role of geographic proximity on the network structure and find that it is the characteristics of the Institution, rather than the geographic distance, that play a dominant role in collaboration networks. We find that Institutions with better federal funding dominate the network topology and play a crucial role in overall research output. We find that the citation flow across different category of institutions is strongly linked to the collaborations between them. We have estimated the knowledge flow in and out of Institutions and identified the top knowledge source and sinks.

1 Introduction Academic institutes in a country are the biggest stakeholders in the knowledge production, diffusion, and innovation. Institutions nurture the manpower and provide resources to conduct research. Cumulative effort of academic institutions, Industry, and government agencies is essential for building an efficient knowledge economy [1, 2]. Studies [3] suggest that developed countries dominate with their share in total research output measured via publications and citations. However in recent years developing countries like India, Brazil, China, etc. have significantly increased their global share of research output. Exploring and understanding the major factors and policies leading to this accelerated growth is of interest to both academicians and policy makers [3–6]

C. K. Singh () · S. Jolad Indian Institute of Technology, Gandhinagar, India e-mail: [email protected]; [email protected] R. Vishwakarma IISER, Kolkata, India © Springer Nature Switzerland AG 2019 S. P. Cornelius et al. (eds.), Complex Networks X, Springer Proceedings in Complexity, https://doi.org/10.1007/978-3-030-14459-3_14

169

170

C. K. Singh et al.

Flow of scientific knowledge across people, institutions, and countries through collaborations and citations determines the evolution of scientific discoveries and technological growth. Quantitative analysis of different forms of networks constructed from bibliometric data provides an insight into underlying structural and dynamic properties of scientific collaboration [7, 8]. In the last two decades, the rapid growth of network science and availability of large-scale data on scientific publications have led to large-scale studies on analysis of patterns of scientific collaboration and citations [9, 10]. Analysis of the evolution of co-authorship and citation networks have largely focused on the interactions between individuals and Institutions at global level to explain the functioning of ecosystem of scientific collaboration [11, 12]. These studies have shown broad features such as power law behavior of the collaboration networks [9], preferential attachment [13], knowledge flow map [11], aging in collaboration strength and citations [14–16], and geographic proximity [2, 17–19]. In this work, we focus on the collaboration and citation networks in American Physical Society (APS) journals with at least one author with an Indian affiliation. The motivation behind restricting to country-specific study at mesoscopic Institution level is threefolds. First, studies on large-scale datasets in scientific collaboration networks at global level often masks the small-scale dynamics that are specific to Institutions, cities, and countries. While large-scale studies highlight the global average trend in network measures, small-scale studies give deeper insight into nature of interactions between institutes that drive the collaborations [11, 18– 21]. Second, investigating the behavior of these networks at country level helps us to reveal multitude of factors such as type of Institutions, characteristic of the Institutions, and location of Institutions which influence collaborations. Third, extracting the factors influencing collaborations is useful in framing of higher education and research policy, allocation and prioritization of resources at the Institutional level. In this work, we have constructed different types of networks representing collaboration between Institutions, citations flow between Institutions and broadly across the category of the Institutions. Using different network measures, we have analyzed the strength of collaboration between Institutions, importance of Institutions, constructed spatial network of collaborations, and analyzed the role of geographical proximity in collaboration.

2 Data We use journal papers published by American Physical Society (APS) between 1970–2013 in journals Physical Review A-E, Physical Review Letters, and Review of Modern Physics. Since our study restricted to India, we have chosen all the articles such that there is at least one author with Indian affiliation. The total number of such papers was 14,704. From each of these articles we extract the affiliations of all the authors and extract the national origins for outside. We mark all the non-

Structure of Institutional Collaboration Network

171

Indian affiliations in our subset as “Foreign” and only extract respective countries. For the Indian affiliations, we extract the Institute name, type of Institution, city, and the pin-code. Different authors chose to write their affiliation in different ways as there is no standardized naming of Institutions and their address followed in India. Ambiguity in affiliation naming could be due to: different ways of writing address, inclusion or exclusion of department names, usage of abbreviations for the Institute or department, separators and punctuations, etc. For example—Jawaharlal Nehru Institute of Advanced Research (JNCASR), Bangalore has been written in 110 ways (some of which are given in the updated manuscript) and Indian Institute of Technology (IIT) Bombay appeared in 143 different ways. We first compared the institutes’ names, and clustered similar affiliations. Second, we manually checked every affiliation for repetitions, and assigned unique IDs. This way we reduced the total number of distinct institutes from 7180 to 677. Out of the reduced set, we could map 628 institutes to their pin-code locations. After cleaning the data, we classified each institute based on the categories in Table 1, and constructed networks for our analysis. We use the classification of Indian higher education Institutions by University Grants Commission (UGC) of India [22], which is based on degree awarding category, managing bodies such as state, central, or private, and sources of funding (see Table 1). We also included special categories which are certified by UGC, but not given a standard category (such as Private Institutes and State Research Institutes). Table 1 Categories of institutions Type of institutes National Research Institutes

Acronym NRI

Institutes of National Importance

INI

Central Universities State Universities State Colleges Central Colleges Deemed Universities

CU SU SC CC DU

Private Universities

PU

Private Institutes

PI

State Research Institutes

SRI

Function Research institutions funded by the central government Teaching (both UG and PG) and research institutions, declared as INI by Government of India Public universities formed by central act Public universities formed by state act Colleges affiliated to state universities Colleges affiliated to central universities Public or private universities which can award degrees on their own, and declared as deemed by UGC Universities established through a state or central act by a sponsoring body Stand-alone private institutions recognized by government Research institutions funded by the state government

172

C. K. Singh et al.

3 Methods We have explored collaborations by constructing networks at the Institution level, its geographic location, and category. This allows us to explore the network properties at multiple scales by constructing super nodes from individual nodes. We also explore citations between these Institutions to assess the knowledge flow between Institutions and their category. Construction of Networks Institute Collaboration Network We construct a weighted undirected networks with institutes as nodes, where the edge weights between two nodes i, j , represent the number of co-authored pairs between these Institutions. In Fig. 1, we show the map of collaborations between Indian Institutions and different countries of the world. Institute Citation Networks Here, the weighted directed network is constructed with institutes as nodes, and for two nodes i, j , the edge weight e(i → j ) from i → j denotes the number of citations authors from i have cited authors from j . Network Based on Institution Type Institutions of same type are clubbed into single super node, and network based on collaboration/citation between super-nodes is constructed as in Figs. 4 and 7. To track the evolution of these networks, we construct cumulative graphs at 1 year time interval from 1970–2013. At a given time t, the network will have information about all the collaboration or citation between the nodes from 0 to t. Network Measures We measure the normalized strength of collaboration between two institutes by Cij Nij = wi ×w [18], where Cij is the number of common papers between nodes i j and j , and wi and wj are the number of papers published individually by i and

Fig. 1 Map of India’s Global collaboration based on the publications in APS journals between 1970–2013. Each red dot in within is an institute while outside India they represent capital cities of the respective countries

Structure of Institutional Collaboration Network

173

j , respectively. To characterize the structural significance of nodes in the network we use three centrality measures: Betweenness, Average Degree, Clustering and Page-Rank centrality [23]. The knowledge flow in and out of a node is measured = kiin × in the Institute citation network as (a) Fout i −kiout ×

Wiout Wiin +Wiout

Wiin Wiin +Wiout

and (b) Fin i =

where kiin , kiout are in-degree and out-degree of a node, and Wiin ,

Wiout are total incoming and outgoing weights, respectively. For our analysis, we performed measurements on the cumulative collaboration and citation networks between institutes up to 2013. The centrality value of each super node in every case is the average of values of its constituents. We measure the distance between two Institutions by measuring the Vincenti (great arc) between the pin-codes representing these Institutions. We club the distance in 50 km bins. Gephi [24] software and Networkx [25] package in python were used for calculations and visualizations.

4 Results In our analysis, we have addressed four questions related to collaboration, affiliation, distance between Institutions, and type of Institution based on analysis of different types of network discussed in the methods section. Does Collaboration Depend on Geographic Proximity? With the advancement in telecommunication and transportation technology it seems natural that communication has overcome the distance barrier [26, 27]. However, studies have shown that geographic proximity still plays a role in establishing connections [2, 18, 19]. In our study we address this question by measuring change in frequency of collaboration and strength of collaboration vs. distance between Institutions. The top panel of Fig. 2 shows the box plots of the strength of collaboration (Nij ) as defined in Sect. 3 for different distance bins. Each bin bk is 50kms wide and data includes all the pairs i, j such that 50(k − 1) ≤ dij < 50k . There is broad declining trend in the median of the normalized collaboration strength with distance. However, after the 31st bin (1500–1550 kms), there is a surge in collaborations and then the trend is uneven. Bottom left panel (b) shows the average strength of collaboration Nij  versus distance for different time periods 1970–95 and 1996– 2013 corresponding to pre and post Internet era in India. We see a flattening in trend for 1996–2013 which indicates to a weaker dependence on distance with advancement of telecommunication technologies. Panel (c) shows the cumulative strength of collaborations up to 2013 in log-linear scale. After b1 , there is a big drop in Nij . People collaborate mostly within their own Institutions and with people in their city. Afterwards, the collaborations broadly decrease, but there are many spikes

174

C. K. Singh et al.

Fig. 2 Dependence of the strength of collaboration with geographic distance between Institutes. Panel (a) boxplot of the collaboration strength Nij versus distance (in multiples of 50 kms). Panel (b) Mean strength of collaboration Nij  versus distance for two time periods (Note the change in trend for 1996–2013). Panel (c) Average cumulative strength of collaboration Nij (from 1970– 2013) with distance in multiples of 50 kms (Note the change in y axis scale)

in between, which is likely due to peaks in the pair correlation function between population of cities C(r) = P (x)P (x + r) . There is no indication for a power law decay in Nij  with distance. To explain the variance in collaboration versus distance, we split the collaborations into Institutional groups (categories) as in Table 1 and study the frequency of collaborations between four different pair of groups SC − X, SU − X, NRI − X, and I N I − X as in Fig. 3. Here X denotes all category of Institutions combined. The State Colleges (SC) and State Universities (SU) collaborate strongly with Institutions in the close proximity than farther cities (Top panels). On the other hand, National Research Institutions (NRI) and Institutes of National importance (INI) don’t show strong dependency on distance. In all graphs we notice an increase in frequency of collaborations at distances between 750–1650 km (15–35 bin). This is largely due to collaborations between Institutions located in highly populous metropolitan areas such as Delhi, Kolkata, Mumbai, Bangalore, and Chennai. The aerial distance between these cities lies in this range. We argue that the strength of collaboration between NRI and INI in major cities can be the reason for fluctuations in Fig. 2c. Does Collaboration Between Institutions Depend on Their Productivity? The number of publications by authors affiliated to an institute is a strong indicator of its research output. We hypothesize that collaboration strength depends on the

Structure of Institutional Collaboration Network

175

Fig. 3 Comparing the frequency of collaboration of SC, SU, NRI, and INI’s with institutes of other categories denoted by -X. SC and SU have more local collaborations while NRI and INI’s have collaborations spread over wider distances Table 2 Number of papers from different types of institutes in the dataset studied till 2013 NRI Papers 9292 Institutions 76 Papers per 122.3 institute

INI 2635 46 57.3

CU 2083 32 65.1

SU 3438 109 31.5

SC 1482 301 4.9

CC 1 1 1

DU 9 4 2.25

PU 85 18 4.7

PI 57 19 2.68

SRI 25 6 4.17

category of Institutions and its productivity. We build network of Institutional category by creating super nodes from the individual nodes as described in the methods section. In Table 2, we tabulate the number of papers, number of institutions, and papers per Institute in each category. Of all the publications in the dataset, NRI’s contribute to 63% of papers followed by SU’s (23%) and INI’s(18%). The total research productivity is highest for NRI (9292), followed by SU (3438) and INIs (2635). The average productivity is (papers per Institute) is highest for NRIs (122.3) followed by CU (65.1) and NRIs (57.3). In Fig. 4, we show the collaboration network between Institution categories (panel a) and their corresponding weighted adjacency matrix (panel b). In panel (a), the size of the node represents the total publications. Edge width shows the number of collaborations between authors of the Institutions. Groups are arranged according to the decreasing order of their productivity measured in papers per Institution in the

176

C. K. Singh et al.

a

b

4.0

3.2

2.4

1.6

0.8

0.0

Fig. 4 Collaboration between different types of institutions (a) Network representation. Size of the node is proportional to the total number of papers published from institutes falling in the category as in Table 1. Edge width is proportional to the number of co-authorship events. Self edges represent collaboration amongst institute of same kind. (b) Matrix representation of the collaboration of panel (a), where the type of institutions are sorted according to their productivity

category. We see that the highly productive groups in the top left corner collaborate most among themselves. The NRI, CU, and INI lead in relative contribution. Some premier institutions that fall in this category are Indian Institute of Science (IISc), Saha Institute of Nuclear Physics (SINP), Punjab University, Banaras Hindu University (BHU), Institute of Mathematical Sciences (IMSc), Tata Institute of fundamental Research (TIFR), and different Indian Institute of Technology (IITs). These institutes are mostly autonomous and are most favorable centers for pursuing higher education in India. Network Structural Differences Across Different Institutions and Their Types In Fig. 5, we show the cumulative Institute collaboration network from APS publications in India as of 2013. The nodes are colored according to their category as in Fig. 4a and spatially located based on their pin-codes. In Fig. 6, we compare four different measures: average degree, clustering coefficient, betweenness, and page rank for top five productive category of Institutions. These measures help us to assess the strength and dominant role of each category of Institutions within the network. Average degree tells us the average number of connection nodes, betweenness tells the centrality of a node in connecting different parts of the network, page rank measures importance of node, and the Clustering defines the average connectivity of the neighborhood [13, 23]. NRI’s have the highest average degree, betweenness, and page rank indicating their dominant position in collaboration network. Central Universities have highest average clustering coefficient, highlighting their role bringing different type of Institutions in collaborations. State colleges, though fare low in average degree, betweenness, and page rank, they tend to form highly clustered groups in the network.

Structure of Institutional Collaboration Network

177

Fig. 5 Collaboration between Indian institutes marked by their pin-codes in 2013. The nodes are colored based on their type as in Fig. 4

Fig. 6 Comparison of centrality measures for institutes grouped into different categories

178

C. K. Singh et al.

2.4 1.8 1.2 0.6 0.0 –0.6

Fig. 7 Institutes clubbed as super nodes representing citations exchanged between different types of institutes. Size of the node is proportional to the total number of papers published from institutes falling in the category. Edge width is proportional to the number of citations exchanged

Does Knowledge Flow Across Institutions Depend on the Category of Institutions? Citations are an indirect measure of the flow of ideas between authors. At an aggregate level, citations between Institutions is an indicator of the knowledge flow across them [11]. The knowledge flow network based on the citations exchanged between Institutions (see methods for details) is shown in Fig. 7 (left). The corresponding directed and weighted adjacency matrix between type of Institutions is shown in the right panel of Fig. 7. Node size represents the total number of published. NRI category is the largest in the group and also shows the most incitations within group. The matrix shows the maximum citations flow between high productive Institutions like NRI, CU, INI , SU, and SC. The pattern is similar to what we observe in Fig. 4. In Fig. 8, we show the Giant Connected Component (GCC) for the knowledge network at Institutional level, and highlight the Institutes which receive high in citations. These can be considered as knowledge hubs in the Institutional network and are located in the major cities of India. Of all the nodes in the GCC, NRI’s, INI’s, CU’s, and SU’s have nodes that act as knowledge centers. The biggest center for knowledge share is Tata Institute of Fundamental Research(NRI) based on the given dataset. To compare the inward and outward flow of knowledge, we compute the effective out in flow Fin (see methods for details) measures for different i and outflow Fi Institutions in the GCC. The results split according to the categories shown in Fig. 9. We find that top knowledge sources also acts as knowledge sinks.

Structure of Institutional Collaboration Network

179

Fig. 8 Dominant institutes in the knowledge network constructed from the dataset. Size is proportional to weighted in-degree. All these institute are located in major cities of India acting as knowledge hubs

Fig. 9 Effective incoming and outgoing citations shared by each node. This represents the knowledge transferring (Fout positive y-axis) and receiving (Fin i i negative y-axis) capacity of every node in the network. Each category of institute is color coded

5 Conclusion To the best of our knowledge, this is the first study to map the collaboration and knowledge flow between different type of institutions in India. We have investigated whether the geographic scaling law (inverse distance) in scientific collaborations usually studied at global level is valid at a lower scale, by using Indian physics collaboration network. Our results indicate no strong evidence for inverse power law dependence of collaboration strength with respect to distance.

180

C. K. Singh et al.

We have identified the type of Institutions which dominate the research output in India measured through number of papers, collaborations, and knowledge flow. We find that National Research Institutions (NRI), Central Universities (CU), and Institutes of National Importance (INI) dominate the research output in Physics based on APS dataset. The major cities in India like Delhi, Mumbai, Kolkata, Bangalore, Chennai are the largest knowledge hubs for India followed by Kanpur, Allahabad, Ahmedabad, and Bhubaneswar. These cities are also known to host premier educational and research Institutions in the country. State Universities and state colleges collaborate closely with Institutions closer to them, while National Institutions like NRIs and INIs have broad collaborations in all major cities across India. Highly productive Institutions collaborate more amongst each other and cite each other’s work more frequently. We identified leading Institutions which act as knowledge sources. Our study was limited to Physics papers published in American Physical Society (APS) journals from 1970–2013 with at least one Indian affiliation. This does not cover the full spectrum of publications in India over different disciplines. Hence broad generalizations on the scientific output and flow cannot be made. However results from our analysis are in agreement with reports that study India’s research output on a larger scale and give a reasonable idea about the existing knowledge network in India. We believe this study could be helpful for framing policies to promote research collaborations between institutes and sharing of resources. In future we plan to scale this study to include large datasets and cover more indexed publications and implement network modeling to understand the dynamics behind observed evolution.

References 1. Chen, K., Zhang, Y., Zhu, G., Mu, R.: Do research institutes benefit from their network positions in research collaboration networks with industries or/and universities? Technovation (2017). https://doi.org/10.1016/j.technovation.2017.10.005 2. Laursen, K., Reichstein, T., Salter, A.: Exploring the effect of geographical proximity and university quality on university–industry collaboration in the United Kingdom. Reg. Stud. 45(4), 507–523 (2011) 3. SJR SCImago Journal & Country Rank [Portal]. Retrieved on September 1st, 2018 from http:// www.scimagojr.com 4. Garfield, E.: Mapping science in the third world. Sci. Public Policy 10(3), 112–127 (1983) 5. Gupta, B., Dhawan, S.: Status of India in science and technology as reflected in its publication output in the Scopus international database, 1996–2006. Scientometrics 80(2), 473–490 (2009) 6. Arunachalam, S., Srinivasan, R., Raman, V.: Science in India–a profile based on India’s publications as covered by science citation index 1989–1992. Curr. Sci. 74(5), 433–441 (1998) 7. Herrera, M., Roberts, D.C., Gulbahce, N.: Mapping the evolution of scientific fields. PloS One 5(5), e10355 (2010) 8. Singh, C.K., Jolad, S.: Structure and evolution of Indian physics co-authorship networks. arXiv preprint arXiv:1801.05400 (2018) 9. Newman, M.E.J.: The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. 98(2), 404–409 (2001)

Structure of Institutional Collaboration Network

181

10. Barabâsi, A.-L., Jeong, H., Néda, Z., Ravasz, E., Schubert, A., Vicsek, T.: Evolution of the social network of scientific collaborations. Physica A Stat. Mech. Appl. 311(3–4), 590–614 (2002) 11. Mazloumian, A., Helbing, D., Lozano, S., Light, R.P., Börner, K.: Global multi-level analysis of the scientific food web’. Sci. Rep. 3, 1167 (2013) 12. Dong, Y., Ma, H., Shen, Z., Wang, K.: A century of science: globalization of scientific collaborations, citations, and innovations. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1437–1446. ACM, New York (2017) 13. Newman, M.E.J.: Clustering and preferential attachment in growing networks. Phys. Rev. E 64(2), 025102 (2001) 14. Börner, K., Maru, J.T., Goldstone, R.L.: The simultaneous evolution of author and paper networks. Proc. Natl. Acad. Sci. 101(suppl 1), 5266–5273 (2004) 15. Hajra, K.B., Sen, P.: Aging in citation networks. Physica A Stat. Mech. Appl. 346(1–2), 44–48 (2005) 16. Wang, D., Song, C., Barabási, A.-L.: Quantifying long-term scientific impact. Science 342(6154), 127–132 (2013) 17. Katz, J.S.: Geographical proximity and scientific collaboration. Scientometrics 31(1), 31–43 (1994) 18. Pan, R.K., Kaski, K., Fortunato, S.: World citation and collaboration networks: uncovering the role of geography in science. Sci. Rep. 2, 902 (2012) 19. Ma, H., Fang, C., Pang, B., Li, G.: The effect of geographical proximity on scientific cooperation among Chinese cities from 1990 to 2010. PloS One 9(11), e111705 (2014) 20. Gaskó, N., Lung, R.I., Suciu, M.A.: A new network model for the study of scientific collaborations: Romanian computer science and mathematics co-authorship networks. Scientometrics 108(2), 613–632 (2016) 21. Hou, H., Kretschmer, H., Liu, Z.: The structure of scientific collaboration networks in scientometrics. Scientometrics 75(2), 189–202 (2007) 22. University Grants Commission (UGC) website. https://www.ugc.ac.in/ 23. Newman, M.: Networks: An Introduction. OUP, Oxford (2010) 24. Bastian, M., Heymann, S., Jacomy, M.: Gephi: an open source software for exploring and manipulating networks. International AAAI Conference on Weblogs and Social Media, Munich (2009) 25. Hagberg, A., Schult, D., Swart, P.: Networkx: Python Software for the Analysis of Networks. Technical report. Mathematical Modeling and Analysis, Los Alamos National Laboratory, New Mexico (2005). http://networkx.lanl.gov 26. Freidman, T.: The World is Flat. Farrar, Straus and Giroux, New York, vol. 488 (2005) 27. Graham, S.: The end of geography or the explosion of place? Conceptualizing space, place and information technology. Prog. Hum. Geogr. 22(2), 165–185 (1998)

Author Index

B Botterman, H.-L., 97

C Caimo, A., 63 Chan, K.S., 75 Chang, C.-S., 49 Cisneros-Velarde, P., 75

G Grácio, L., 33

H Hall, M., 17 Hamoodat, H., 157

J Jolad, S., 169

K Koponen, I.T., 123 Krause, R.W., 63

L Lamarche-Perrin, R., 97, 147 Lee, D.-S., 49 Letellier, C., 3 Li, H.-C., 49 Loupos, P., 113

M Madahali, L., 17 Martínez, A., 135 Menezes, R., 157 N Nagler, J., 85 Najjar, L., 17 Nathan, A., 113 Nin, J., 135 O Oliveira, D.F.M., 75 R Ribeiro, E., 157 Ribeiro, P., 33 Rubio, A., 135 S Sendiña-Nadal, I., 3 Singh, C.K., 169 Stollmeier, F., 85 T Tomás, E., 135 V Vishwakarma, R., 169 W Wilmet, A., 147

© Springer Nature Switzerland AG 2019 S. P. Cornelius et al. (eds.), Complex Networks X, Springer Proceedings in Complexity, https://doi.org/10.1007/978-3-030-14459-3

183