*515*
*45*
*17MB*

*English*
*Pages 133(141)
[141]*
*Year 2021*

- Author / Uploaded
- Andreia Sofia Teixeira

*Table of contents : PrefaceContentsEffects of Hidden Users on Cascade-Based Community Detection 1 Introduction 2 Related Work 3 Problem Formulation and Methodology 3.1 Cascade-Based Community Detection Problem and Evaluation Indices 3.2 Dataset 3.3 Experimental Methodology 4 Results and Discussion 5 Conclusion and Future Work ReferencesGame of Thieves and WERW-Kpath: Two Novel Measures of Node and Edge Centrality for Mafia Networks 1 Introduction 2 Materials and Methods 3 Results and Discussion 4 Conclusions ReferencesCommon Knowledge on Facebook Communication Networks: Models and Experimental Findings 1 Introduction 2 Related Work 3 The Model 3.1 Preliminaries 3.2 Knowledge and Common Knowledge 3.3 Network Structure Globally Known 3.4 Network Structure Locally Known 4 Theoretical Results 4.1 Globally Known Network Structure 4.2 Locally Known Network Structure 5 Experimental Design 6 Experimental Results 7 Conclusion and Further Research ReferencesLogistics Route Planning in Agent-Based Simulation and Its Optimization Represented in Higher-Order Markov-Chain Networks 1 Introduction 2 Higher-Order Markov Chain Networks 3 Agent-Based Simulations in Square-Lattice Network 3.1 Algorithm Used for Route Optimization 3.2 Numerical Results 4 Agent Traces Represented in Higher-Order Markov-Chain Networks 5 Concluding Remarks ReferencesDegree-Degree Correlation in Networks with Preferential Attachment Based Growth 1 Introduction 2 ANND Distribution in the Barabási-Albert networks 3 Triadic Closure Model Analysis 4 Conclusion ReferencesOn Measuring the Diversity of Organizational Networks 1 Introduction 2 Related Work 3 Problem Formulation 3.1 Objectives 3.2 Challenges 4 Method 4.1 Handling Constraints 4.2 Variations on FairEA 5 Experimental Setup 5.1 Datasets 5.2 Open Positions, Teams, and Candidate Pool 5.3 Fitness Functions 5.4 Baseline Methods 5.5 Metrics 6 Results and Analysis 6.1 FairEA Evaluation 6.2 Example Usage of FairEA 7 Discussion, Limitations, and Conclusion ReferencesAn Interpretable Graph-Based Mapping of Trustworthy Machine Learning Research 1 Introduction 2 Materials and Methods 2.1 Data 2.2 Methods 3 Results 4 Discussion 5 Conclusion ReferencesMAVAC: Mapping and Visualization of Academic Collaborations with a Focus on Diversity 1 Introduction 2 Related Works 3 System Overview and Methods of Construction 4 Visualization 5 Network Analysis 6 Results ReferencesModelling Damage Propagation in Complex Networks: Life Exists in Half-Chaos 1 Introduction 2 Kauffman's Hypothesis 3 Experiments and Network Parameters 4 Evolutionary Stability of Half-Chaotic Networks 5 Conclusions ReferencesInformation Seeking as an Evolutionary Game 1 Introduction 2 Model Description 3 Results 3.1 Mean-Field Solution 3.2 Mean-Field Solution with Additional Sources of Truth 3.3 Comparison to Results from Numerical Simulation 3.4 The Effects of Clustering 4 Summary and Conclusions ReferencesHow Correlated Are Community-Aware and Classical Centrality Measures in Complex Networks? 1 Introduction 2 Classical and Community-Aware Centrality Measures 3 Correlation Analysis 4 Network Topology Analysis 5 Conclusion ReferencesAuthor Index*

Springer Proceedings in Complexity

Andreia Sofia Teixeira · Diogo Pacheco · Marcos Oliveira · Hugo Barbosa · Bruno Gonçalves · Ronaldo Menezes Editors

Complex Networks XII Proceedings of the 12th Conference on Complex Networks CompleNet 2021

Springer Proceedings in Complexity

Springer Proceedings in Complexity publishes proceedings from scholarly meetings on all topics relating to the interdisciplinary studies of complex systems science. Springer welcomes book ideas from authors. The series is indexed in Scopus. Proposals must include the following: - name, place and date of the scientiﬁc meeting - a link to the committees (local organization, international advisors etc.) - scientiﬁc description of the meeting - list of invited/plenary speakers - an estimate of the planned proceedings book parameters (number of pages/articles, requested number of bulk copies, submission deadline) Submit your proposals to: [email protected]

More information about this series at http://www.springer.com/series/11637

Andreia Soﬁa Teixeira Diogo Pacheco Marcos Oliveira Hugo Barbosa Bruno Gonçalves Ronaldo Menezes •

•

•

• •

Editors

Complex Networks XII Proceedings of the 12th Conference on Complex Networks CompleNet 2021

123

Editors Andreia Soﬁa Teixeira Hospital da Luz Learning Health, Luz Saúde Lisboa, Portugal INESC-ID Lisboa, Portugal Marcos Oliveira Department of Computer Science University of Exeter Exeter, UK GESIS — Leibniz Institute for the Social Science Köln, Nordrhein-Westfalen, Germany

Diogo Pacheco Department of Computer Science University of Exeter Exeter, UK Hugo Barbosa Department of Computer Science University of Exeter Exeter, UK Ronaldo Menezes Department of Computer Science University of Exeter Exeter, UK

Bruno Gonçalves Data for Science, Inc New York, NY, USA

ISSN 2213-8684 ISSN 2213-8692 (electronic) Springer Proceedings in Complexity ISBN 978-3-030-81853-1 ISBN 978-3-030-81854-8 (eBook) https://doi.org/10.1007/978-3-030-81854-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

The International Workshop on Complex Networks—CompleNet (www. complenet.org) was initially proposed in 2008, and the ﬁrst workshop took place in 2009 in Catania. The initiative was the result of efforts from researchers from the (i) BioComplex Laboratory at the Department of Computer Sciences at Florida Institute of Technology, USA, and the (ii) Dipartimento di Ingegneria Informatica e delle Telecomunicazioni, Università di Catania, Italy. CompleNet aims at bringing together researchers and practitioners working on complex networks or related areas. In the past two decades, we have indeed witnessed an exponential increase of the number of publications in this ﬁeld. From biology to computer science and from economics to social systems, complex networks are becoming pervasive in many ﬁelds of science. It is this interdisciplinary nature of complex networks that CompleNet aims at addressing. CompleNet Live 2021 was the 12th event in the series and was hosted online during May 24–26, 2021. This book includes the peer-reviewed list of works presented at CompleNet Live 2021. We received 107 submissions from 30 countries. Each submission was reviewed by at least three members of the Program Committee. Acceptance was judged based on the relevance to the symposium themes, clarity of presentation, originality and accuracy of results and proposed solutions. After the review process, ten full papers and one short paper were selected to be included in this book. We would like to thank the Program Committee members for their work in promoting the event and refereeing submissions. We are grateful to our speakers: Elisa Omodei, Danielle Bassett, Laura Alessandretti, Jon Kleinberg, Brooke Foucault Welles, and James Bagrow; their presentation is one of the reasons CompleNet Live 2021 was such a success. May 2021

Andreia Sofia Teixeira Diogo Pacheco Marcos Oliveira Hugo Barbosa Bruno Gonçalves Ronaldo Menezes v

Contents

Effects of Hidden Users on Cascade-Based Community Detection . . . . . Daiki Suzuki and Sho Tsugawa Game of Thieves and WERW-Kpath: Two Novel Measures of Node and Edge Centrality for Maﬁa Networks . . . . . . . . . . . . . . . . . . . . . . . . Annamaria Ficara, Rebecca Saitta, Giacomo Fiumara, Pasquale De Meo, and Antonio Liotta Common Knowledge on Facebook Communication Networks: Models and Experimental Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sarah McDonald and Gizem Korkmaz Logistics Route Planning in Agent-Based Simulation and Its Optimization Represented in Higher-Order Markov-Chain Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ryota Ikai, Shigeyuki Miyagi, and Osamu Sakai Degree-Degree Correlation in Networks with Preferential Attachment Based Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sergei Mironov, Sergei Sidorov, and Igor Malinskii On Measuring the Diversity of Organizational Networks . . . . . . . . . . . . Zeinab S. Jalali, Krishnaram Kenthapadi, and Sucheta Soundarajan An Interpretable Graph-Based Mapping of Trustworthy Machine Learning Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Noemi Derzsy, Subhabrata Majumdar, and Rajat Malik MAVAC: Mapping and Visualization of Academic Collaborations with a Focus on Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Logan McNichols, Steven Pineda, Emma Sauerborn, Brandon Tat, Kevin Yoo, Jane Lehr, Zoë Wood, and Theresa Migler

1

12

24

38

51 59

73

86

vii

viii

Contents

Modelling Damage Propagation in Complex Networks: Life Exists in Half-Chaos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrzej Gecow and Mariusz Nowostawski

98

Information Seeking as an Evolutionary Game . . . . . . . . . . . . . . . . . . . 108 Markus Brede How Correlated Are Community-Aware and Classical Centrality Measures in Complex Networks? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Stephany Rajeh, Marinette Savonnet, Eric Leclercq, and Hocine Cheriﬁ Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Eﬀects of Hidden Users on Cascade-Based Community Detection Daiki Suzuki(B) and Sho Tsugawa University of Tsukuba, Tsukuba, Ibaraki 305-8573, Japan [email protected], [email protected]

Abstract. Community detection in social networks is an important research topic in the ﬁeld of network science. Recently, cascade-based community detection algorithms that discover communities in a network only from records of information diﬀusion cascades on the network have been gathering attention. Although cascade-based community detection algorithms are expected to be useful, their eﬀectiveness is considerably aﬀected by missing data in available cascades. On social media, there exists a certain number of hidden users whose information diﬀusion behavior cannot be observed due to privacy settings. In this paper, we investigate the robustness of cascade-based community detection algorithms against missing data due to such hidden users. Namely, we investigate how such hidden users aﬀect cascade-based community detection algorithms through experiments using both synthetic cascade data and actual cascade data from Twitter. The results show that even when 10% of users are hidden, the accuracies of existing community detection algorithms are nearly the same as when information about all users is available, suggesting high robustness of existing algorithms. However, we show that when many hidden users exist (20% or more), the accuracies of cascade-based community detection algorithms signiﬁcantly degrade.

1

Introduction

Community detection in social networks has been an important research topic in the ﬁeld of network science. Many social networks have a community structure consisting of subnetworks of tightly connected nodes with sparse links between them. Community detection is the task of ﬁnding densely connected subnetworks (communities) in a given network [4]. Many community detection algorithms have been proposed and shown to be useful for various applications, including viral marketing [7] and information recommendation [12]. Generally, two types of community detection algorithms have been proposed. One is based on the network’s topological structure [2,4] and the other is based on the records of information diﬀusion cascades [1,10,11]. Most existing community detection algorithms are structure-based, but cascade-based community detection algorithms have recently gained attention. Cascade-based community c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 A. S. Teixeira et al. (Eds.): Complex Networks XII, SPCOM, pp. 1–11, 2021. https://doi.org/10.1007/978-3-030-81854-8_1

2

D. Suzuki and S. Tsugawa

detection algorithms obtain communities in a network only from records of information diﬀusion cascades in the network without using the explicit network structure [10,11]. In some cases, records of information diﬀusion cascades are easier to obtain than is information about the explicit structure of the social network. This is true, for example, if we have already collected tweets and retweets related to a certain topic on Twitter, and we wish to identify user communities spreading information about that topic. While we already have the information cascade records (the tweets and retweets), obtaining the explicit social network of who-follows-whom relationships among the users is quite costly, due to strong restrictions in the Twitter application programming interface when the number of target users is large. In such cases, it is useful to extract communities only from information cascade records. Although cascade-based community detection algorithms are expected to be useful, their eﬀectiveness is considerably aﬀected by missing data in the available cascades [9–11]. The basic idea behind cascade-based community detection is that users belonging to the same community are likely involved in the same diﬀusion cascades [14]. Figure 1 illustrates the basic idea of cascade-based community detection. There exists a set S = {B, C, D} of users who frequently spread information posted by user A. In this case, A ∪ S is extracted as a single community. In this network, if all information cascade records related to user A are missing, relationships among users in user group S are hidden, resulting in inaccurate community detection. When observing information diﬀusion on social media, diﬀusion behavior of some number of hidden users cannot be observed owing to their privacy settings. The existence of such hidden users should signiﬁcantly impact cascade-based community detection algorithms. However, the eﬀects of hidden users on cascade-based community detection have not been investigated. This study aims to clarify the eﬀects of missing data in information diﬀusion cascades on the eﬀectiveness of cascade-based community detection algorithms. In particular, we assume information diﬀusion behavior where some hidden users cannot be observed, and analyze how the existence of those hidden users aﬀects community detection. While previous studies [9–11] have focused on the situation where randomly selected cascades are missing, this study considers all diﬀusion events where selected hidden users are missing. We select hidden users based on multiple criteria and examine how the missing information aﬀects the eﬀectiveness of cascade-based community detection algorithms Clique(0) [9] and its extension CosineSim [10]. Both algorithms have been shown to be eﬀective when suﬃciently many diﬀusion cascades are available [10]. We evaluate how hidden users aﬀect CosineSim and Clique(0) through experiments using cascades on Twitter and synthetic cascades generated from statistical models. The remainder of this paper is organized as follows: In Sect. 2, we introduce studies related to cascade-based community detection. In Sect. 3, we describe problem formulation, datasets, and research methodologies. Section 4 presents the results and a discussion. Finally, Sect. 5 concludes this paper and discusses future works.

Eﬀects of Hidden Users on Cascade-Based Community Detection

3

Fig. 1. Example of cascade-based community detection, showing information diﬀusion cascade records (left) and a graph constructed from those records (right). In the graph, links connect nodes participating in the same cascade. Identifying dense subgraphs in the graph shows communities of nodes that are frequently involved in the same information diﬀusion.

2

Related Work

Cascade-based community detection is a relatively new research topic, ﬁrst studied by Barbieri et al. [1], who proposed a community detection algorithm using both diﬀusion cascades and the network structure. The proposed algorithm detects communities by using cascades to infer connection strengths between nodes. More recently, community detection problems using only diﬀusion cascades have been also studied [9–11]. Ramezani et al. [11] proposed cascade-based community detection algorithms R-CoDi and D-CoDi. These algorithms construct graphs representing relationship strengths among nodes from cascades, then uses the constructed graphs to identify communities. Prokhorenkova et al. [9] took a similar approach, proposing algorithms called Path, Clique, and Clique(0), along with CosineSim in a later extension [10]. These algorithms construct weighted graphs that represent relationships among users from records of diﬀusion cascades. Each uses diﬀerent algorithms to determine link weights. The Louvain algorithm [2] is applied to the constructed weighted graph to obtain communities. This study uses CosineSim and Clique(0) because these algorithms achieve the highest accuracy for community detection [9,10].

3 3.1

Problem Formulation and Methodology Cascade-Based Community Detection Problem and Evaluation Indices

This paper considers the cascade-based community detection problem formulated in Ref. [10], where the social network is described as a graph G = (V, E).

4

D. Suzuki and S. Tsugawa

The set of information diﬀusion cascades is D = {d1 , d2 , . . . , dn }, where di = (v1 , v2 , . . . , vm ) is an ordered set of nodes in G, with vj representing a node involved in the j-th reposting (retweet) of diﬀusion cascade di . The goal in the cascade-based community detection problem is to obtain the community c(u) to which each node u ∈ V belongs by using the set D of information diﬀusion cascades. Note that when obtaining communities, set D is available, but social network G is not. We evaluate the eﬀectiveness of community detection algorithms by comparing obtained communities with ground-truth community labels of nodes. As indices for measuring the degree of coincidence between the obtained communities and the ground-truth node labels, we use Normalized Mutual Information (NMI) [3] and the Adjusted Rand Index (ARI) [6]. Both indices are widely used for evaluating community detection algorithms when ground-truth community labels are available [10]. 3.2

Dataset

We use a dataset of retweet diﬀusion cascades on Twitter (retweet data) and a dataset of synthetic diﬀusion cascades generated from a model (generated data). Retweet data [13]: The retweet data, which are also used in Ref. [13], is a dataset of tweets and their retweets posted between March 24 and April 25, 2012. This dataset also contains the social network representing mutual follow relationships among the users who involved in the tweets and retweets in the dataset. For each tweet in the dataset, the user who posted the tweet and users who retweeted the tweet are available, ordered by the time of posting. Each tweet i corresponds to diﬀusion cascade di , and the set of all available tweet cascades in the dataset corresponds to D. Following Ref. [11], the set of users appearing in more than ﬁve diﬀusion cascades is the target user set V . Ground-truth community labels for nodes are obtained from the social network structure among the target users because explicit community labels are unavailable in this dataset. We constructed a network G = (V, E) that represents mutual follow relationships among the target user set V . We then obtain communities from network G by using the Louvain algorithm [2] for detecting structure-based communities. The obtained communities are considered as the ground-truth communities in this dataset. Generated data: The generated data are a dataset of cascades artiﬁcially generated by a model of probabilistic information diﬀusion [11]. In this dataset, social network G is a network of 1000 nodes generated with the LancichinettiFortunato-Radicchi (LFR) model [8], which generates a network with a community structure. Set D is generated by simulating information diﬀusion based on the exponential independent cascade (EIC) model [5]. The information diﬀusion cascade di is the order of nodes receiving information in the i-th simulation run of the EIC model. When the network is generated in the LFR model, the community to which each node belongs is given. We thus obtain ground-truth community labels from the given communities.

Eﬀects of Hidden Users on Cascade-Based Community Detection

5

Table 1. Summary of the datasets Dataset

Retweet data Generated data

Number of users

132,556

866

Number of information diﬀusion cascades

982,888

1,000

Total number of information diﬀusion events

2,152,091

31,148

Average length of information diﬀusion cascades 2.1675

31.148

Number of ground-truth communities

28

211

Table 1 summarizes the datasets. The total number of information diﬀusion events in the table is deﬁned as the sum of all cascade lengths, where the length of cascade di is its number of elements. 3.3

Experimental Methodology

To investigate the eﬀects of hidden users on cascade-based community detection, we ﬁrst select the hidden users V from V . We use three criteria to determine the hidden users: (1) Active user selection, in which hidden users are selected in descending order of the number of diﬀusion cascades that they are involved in. (2) Inactive user selection, in which hidden users are selected in ascending order of the number of diﬀusion cascades that they are involved in. (3) Random user selection, in which hidden users are selected uniformly random from the user set. We next delete diﬀusion events in which hidden users V are involved, and generate an incomplete cascade set D from the complete cascade set D. For each cascade di ∈ D, if v ∈ V is included in di , we delete v from di . If the cascade length of di becomes 1 or 0 due to the deletion of v, di itself is also deleted from D. Finally, we obtain communities from incomplete cascades D using CosineSim and Clique(0), and calculate NMI and ARI scores for the obtained communities. Both CosineSim and Clique(0) obtain communities by applying the Louvain algorithm to a weighted graph constructed from diﬀusion cascades. The deﬁnition of link weights in the constructed graph is diﬀerent for each algorithm. Let Dvi be a 1 × |D| vector whose j-th element is 1 if vi exists in cascade dj , and 0 otherwise. Then, link weight rvi ,vj on link (vi , vj ) in CosineSim is deﬁned as rvi ,vj =

D v i · Dv j . |Dvi ||Dvj |

(1)

In contrast, link weight rvi ,vj for Clique(0) is deﬁned as rvi ,vj =

dk ∈D

1 , s(dk , vi , vj )

(2)

where s(dk , vi , vj ) is the distance between user vi and vj in cascade dk . Note that the distance between two adjacent nodes in a cascade is deﬁned as 1.

6

D. Suzuki and S. Tsugawa

(a) NMI

(b) ARI

(c) Number of events used for community detection

(d) Number of detected communities

Fig. 2. Comparison of the results of Clique(0) among the three methods for hidden user selection wile changing the number of deleted (hidden) users (retweet data).

4

Results and Discussion

First, we investigate how the number of hidden users aﬀects the results of cascade-based community detection algorithms. Figures 2 and 3 compare the results of Clique(0) among three methods for hidden user selection in the retweet data and the generated data, respectively. The results for CosineSim are not shown here owing to space limitations, but we compare CosineSim and Clique(0) below. Figures 2a, b, 3a, and b show that hiding inactive users has only a small eﬀect on NMI and ARI scores, which evaluate the accuracy of community detection. In contrast, hiding active users is known to signiﬁcantly degrade NMI and ARI scores when the ratio of hidden users is large. This can be explained by the number of extracted communities. Figures 2d and 3d show that the number of detected communities increases as the number of deleted users increases when active users are deleted. Active users may play central roles in their communities, so if these users are hidden, their communities will be separated into multiple small communities, possibly increasing the number of detected communities. These observations suggest that hiding active users has a larger impact on cascade-based community detection than does hiding inactive or random users. Note that even when hiding active users, NMI and ARI scores when the fraction

Eﬀects of Hidden Users on Cascade-Based Community Detection

7

(a) NMI

(b) ARI

(c) Number of events used for community detection

(d) Number of detected communities

Fig. 3. Comparison of the results of Clique(0) among the three methods for hidden user selection while changing the number of deleted (hidden) users (generated data).

of hidden users is 0.1 are nearly the same as those when no users are hidden, suggesting that Clique(0) is robust in cases of few hidden users. We next compare the results of Clique(0) among the three methods for hidden user selection when the same number of events is deleted. As Figs. 2c and 3c show, the number of available information diﬀusion events signiﬁcantly diﬀers among the three methods for hidden user selection. Therefore, the reason for larger eﬀects of hiding active users might be that the number of available diﬀusion events is smaller than when hiding inactive or random users. Figures 4 and 5 show the results from Clique(0) when the number of deleted events is changed. For comparison, the results when randomly selected diﬀusion events are deleted are also included in those ﬁgures. Figures 4a and b show that in the retweet data, when the fraction of deleted events exceeds 0.4, hiding active users has larger eﬀects on NMI and ARI scores than does hiding inactive users. This indicates that cascade events of active users are more useful for extracting communities of other users than those of inactive users. However, results from the generated data (Fig. 5) show diﬀerent tendencies; the NMI and ARI scores are lowest when deleting inactive users. We believe this is because the number of users signiﬁcantly decreases when inactive users are deleted in the generated data (Fig. 5d).

8

D. Suzuki and S. Tsugawa

(a) NMI

(b) ARI

(c) Number of target users for community detection

(d) Number of detected communities

Fig. 4. Comparison of results from Clique(0) among the three methods of hidden user selection when the same number of events is used for community detection (retweet data).

Finally, we compare the results from CosineSim and Clique(0) on each dataset. In what follows, we use NMI as the index for measuring the eﬀectiveness of community detection algorithms, showing only the results when hiding random users. Figures 6 and 7 show NMI scores when changing the number of deleted users and the number of deleted events in the retweet data and the generated data, respectively. Figure 6 shows that there is nearly no diﬀerence between NMI scores from the two algorithms. This can be explained by the length of the information diﬀusion cascade of the retweet data. Table 1 shows that the average length of the information diﬀusion cascade in the retweet data is 2.16, which is very short. When cascades are short, weights between two users in both algorithms are simply determined by the number of cascades in which both users are involved, which results in the small diﬀerence between the two algorithms. Figure 7 shows that both algorithms are aﬀected by missing data, but Clique(0) takes a higher value than does CosineSim. As Table 1 shows, the average length of cascades in the generated data is 31.14, resulting in diﬀerences between the two algorithms. As explained in Sect. 3, CosineSim simply counts the number of common cascades between two users, while Clique(0) considers

Eﬀects of Hidden Users on Cascade-Based Community Detection

9

(a) NMI

(b) ARI

(c) Number of target users for community detection

(d) Number of detected communities

Fig. 5. Comparison of results from Clique(0) among the three methods of hidden user selection when the same number of events is used for community detection (generated data).

(a) When deleting a fixed number of users (b) When deleting a fixed number of events

Fig. 6. NMI comparison of the two algorithms (retweet data).

the distance between two users in a common cascade. Since users who are distant in a cascade are likely to also be distant in the social network, Clique(0), which assigns lower weights for such distant users, is more eﬀective than is CosineSim, which does not consider distances between users.

10

D. Suzuki and S. Tsugawa

(a) When deleting a fixed number of users (b) When deleting a fixed number of events

Fig. 7. NMI comparison of the two algorithms (generated data).

5

Conclusion and Future Work

We evaluated the robustness of cascade-based community detection algorithms against missing data in information diﬀusion cascades on social networks. In particular, we analyzed the robustness of CosineSim and Clique(0) against three types of missing data: missing cascades of active users, inactive users, and random users. Our results suggest that CosineSim and Clique(0) are generally robust against missing data due to hidden users, such that even when 10% of the users are hidden, the accuracies of these algorithms are nearly the same as when information about all users is available. In contrast, we showed that when many users (20% or more) are hidden, the accuracies of cascade-based community detection are signiﬁcantly degraded. Further investigations are necessary for clarifying the details of patterns of missing data that have large impacts on community detection, but our results show that hiding active users has a larger impact than does hiding inactive or random users. In future work, we plan to investigate the robustness of cascade-based community detection algorithms on other datasets. We obtained some contradicting results from the retweet and generated datasets, so validating these results with other datasets is an important future task. Another important task is to investigate patterns of missing data that signiﬁcantly aﬀect cascade-based community detection algorithms.

References 1. Barbieri, N., Bonchi, F., Manco, G.: Eﬃcient methods for inﬂuence-based networkoblivious community detection. ACM Trans. Intell. Syst. Technol. (TIST) 8(2), 1–31 (2016) 2. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008(10), P10008 (2008) 3. Danon, L., Diaz-Guilera, A., Duch, J., Arenas, A.: Comparing community structure identiﬁcation. J. Stat. Mech. Theory Exp. 2005(09), P09008 (2005) 4. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3–5), 75–174 (2010)

Eﬀects of Hidden Users on Cascade-Based Community Detection

11

5. Gomez-Rodriguez, M., Leskovec, J., Krause, A.: Inferring networks of diﬀusion and inﬂuence. ACM Trans. Knowl. Discov. Data (TKDD) 5(4), 1–37 (2012) 6. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985) ´ Maximizing the spread of inﬂuence through 7. Kempe, D., Kleinberg, J., Tardos, E.: a social network. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2003), pp. 137–146 (2003) 8. Lancichinetti, A., Fortunato, S., Radicchi, F.: Benchmark graphs for testing community detection algorithms. Phys. Rev. E 78(4), 046110 (2008) 9. Prokhorenkova, L., Tikhonov, A., Litvak, N.: Learning clusters through information diﬀusion. In: Proceedings of the 30th International Conference on The World Wide Web Conference (WWW 2019), pp. 3151–3157 (2019) 10. Prokhorenkova, L., Tikhonov, A., Litvak, N.: When less is more: systematic analysis of cascade-based community detection (2020). arXiv preprint arXiv:2002.00840 11. Ramezani, M., Khodadadi, A., Rabiee, H.R.: Community detection using diﬀusion information. ACM Trans. Knowl. Discov. Data (TKDD) 12(2), 1–22 (2018) 12. Sahebi, S., Cohen, W.W.: Community-based recommendations: a solution to the cold start problem. In: Proceedings of Workshop on Recommender Systems and the Social Web (RSWEB 2011), p. 60 (2011) 13. Weng, L., Menczer, F., Ahn, Y.Y.: Virality prediction and community structure in social networks. Sci. Rep. 3, 2522 (2013) 14. Zhang, Y., Lyu, T., Zhang, Y.: COSINE: community-preserving social network embedding from information diﬀusion cascades. In: Proceedings of Thirty-Second AAAI Conference on Artiﬁcial Intelligence (AAAI 2018), pp. 2620–2627 (2018)

Game of Thieves and WERW-Kpath: Two Novel Measures of Node and Edge Centrality for Mafia Networks Annamaria Ficara1,2(B) , Rebecca Saitta2 , Giacomo Fiumara2 , Pasquale De Meo2 , and Antonio Liotta3 1

University of Palermo, Palermo, Italy [email protected] 2 University of Messina, Messina, Italy [email protected],{gfiumara,pdemeo}@unime.it 3 Free University of Bozen-Bolzano, Bolzano, Italy [email protected]

Abstract. Real-world complex systems can be modeled as homogeneous or heterogeneous graphs composed by nodes connected by edges. The importance of nodes and edges is formally described by a set of measures called centralities which are typically studied for graphs of small size. The proliferation of digital collection of data has led to huge graphs with billions of nodes and edges. For this reason, we focus on two new algorithms, Game of Thieves and WERW-Kpath which are computationally-light alternatives to the canonical centrality measures such as degree, node and edge betweenness, closeness and clustering. We explore the correlation among these measures using the Spearman’s correlation coeﬃcient on real criminal networks extracted from judicial documents of three Maﬁa operations. Results of our analysis indicate that Game of Thieves could be used as a more economic replacement to rank both nodes and edges and WERW-Kpath to rank edges. Keywords: Complex networks · Maﬁa networks Computational complexity · Correlation

1

· Centrality ·

Introduction

Many real-world complex systems can be modeled as homogeneous or heterogeneous networks composed by nodes which have the same function or two or more classes of nodes categorized by both function and utility. Nodes can be anything (e.g. people, computers, organizations, etc.). They are interconnected, which means that two nodes may be connected by a link or edge (e.g. two people meet or call each other, two computers are connected by a cable, two organizations exchange goods, etc.) [45]. Networks can be represented mathematically by graphs which come along with a theoretical framework that allows researchers to focus on the structure of networks in order to make statements about the c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 A. S. Teixeira et al. (Eds.): Complex Networks XII, SPCOM, pp. 12–23, 2021. https://doi.org/10.1007/978-3-030-81854-8_2

GoT and WKP: Two Novel Measures of Node and Edge Centrality

13

behavior of an entire social group. Graph theory and social network analysis (SNA) [14] provide a set of measures called centrality to formally describe the importance or the inﬂuence of nodes and edges. Traditionally, centrality has typically been studied for graphs of relatively small size. However, in the last few years, the proliferation of digital collection of data has led to huge graphs with billions of nodes and edges. There is a clear need to develop more eﬃcient, scalable, and accurate algorithms [30]. For this reason, we focus on two new algorithms, Game of Thieves and Weighted Edge Random Walks (WERW) - K Path, whose performances are promising in terms of both time and eﬀectiveness [23,25]. The results provided by these algorithms on diﬀerent datasets are compared to those of known centrality measures such as degree, (node/edge) betweenness, closeness, and clustering and they are validated through a correlation analysis. In our previous work [23], we analyzed Pearson’s r, Spearman’s ρ and Kendall’s τ correlation coeﬃcients among the classical node centrality metrics and Game of Thieves using artiﬁcial and real networks with diﬀerent numbers of nodes (i.e. 62–22,963). In that paper, we used Game of Thieves just to compute the centrality of nodes. In an other study [25], we made an analysis on artiﬁcial networks with a ﬁxed number of nodes (i.e. 10,000) and an increasing number of edges (i.e. 50,000–500,000). We also started to use Game of Thieves to compute edge centrality comparing it with the WERW-Kpath. More in general, studies on the correlations between the canonical measures which compute the centrality of nodes have been carried out by many researchers [39,41,43,46]. Valente et al. [46] investigated correlations among degree, in-degree, out-degree, betweenness, s-betweenness, closeness-in, closeness-out, s-closeness, integration, radiality and eigenvector using data originally collected in seven studies, which included 62 sociometric networks in a variety of settings. Br¨ ol and Lehnertz [5] modiﬁed various widely used centrality concepts for nodes to those for edges, in order to ﬁnd which edges in a network were important among other pairs of nodes. They noticed that there is an abundance of methods for measuring centrality of individual nodes, but only a few metrics to measure centrality of individual edges. In this respect, Meghanathan and Yang [34] explored the correlation between edge betweenness centrality and the neighbourhood overlap with the same purpose of our work that is to ﬁnd a computationally-light alternative to rank the edges in a graph. In this paper, we want to combine these two kinds of analysis and explore at ﬁrst the correlation among Game of Thieves and degree, betweenness, closeness and clustering for measuring node centrality. Then, we want to explore the correlation between edge betweenness centrality, Game of Thieves and WERWKpath for measuring edge centrality. This two kinds of studies are conducted on speciﬁc types of complex criminal networks which are Maﬁa networks. Among the major native maﬁa-like organizations there are the Sicilian Maﬁa (the original “Maﬁa” or Cosa Nostra) and the Calabrian ’Ndrangheta. They are loose confederations of about one hundred groups, also called cosche or families who aﬀect the social and economic life especially in Southern Italy since at least the

14

A. Ficara et al.

19th century [36,37]. SNA tools can be used in Maﬁa networks to describe their structure and functioning [1,19,20], to construct crime prevention systems [2,9], to identify leaders [28] or to evaluate police interventions aimed at dismantling and disrupting criminal networks [11,18]. Our correlation analysis was conducted on Maﬁa networks because they perfectly reproduce the characteristics of larger networks and therefore the results obtained on them support our proposal to use Game of Thieves and WERWKpath to compute node and edge centrality due to their competitive execution time. Maﬁa networks are in fact the best example of real-world networks where the geometry of connections is crucial for understanding the activity of the network as these connections are the building blocks of the entire Maﬁa, and more generally of organized crime [33].

2

Materials and Methods

A criminal network can be represented by a simple graph or network which is deﬁned as a set of n nodes or actors connected by another set of m edges or links supporting in some way or other the commission of illegal actions [32]. In particular, our analysis focuses on six real criminal networks related to three distinct Maﬁa operations called Montagna [9,11,12,22], Inﬁnito [6–8,28] and Oversize [2]. Montagna Meetings MM and Montagna Phone Calls MP C are two unique weighted and undirected networks we extracted from the pre-trial detention order issued by the Preliminary Investigation Judge of Messina at the end of the Montagna operation in 2007. MM contains 256 connections emerging from the police physical surveillance between 101 suspected criminals close to the Sicilian Maﬁa. MP C contains 124 phone calls emerging from the police audio surveillance among 100 suspects. These networks share 47 nodes and are available on Zenodo [10]. Inﬁnito Summits IS is a weighted and undirected network we extracted from the 2-mode matrix1 on the UCINET [3] website. Diﬀerently from the original matrix, we chose to not consider the 5 isolated nodes which are negligible when centrality is considered. IS contains 1619 connections among 151 suspected members of the Calabrian ’Ndrangheta criminal organization connected with some cosche in Milan whose attendance was registered by police through wiretapping and observations during the Inﬁnito operation (2007–2009). Oversize Wiretap Records OW R , Oversize Arrest Warrant OAW and Oversize Judgment OJU are three undirected and unweighted networks we rebuilt from the 1-mode matrices available on Figshare [38]. Berlusconi et al. constructed these matrices to represent three networks with the same number of nodes (i.e. 182) from three judicial documents corresponding to three diﬀerent stages of the Oversize operation (2000–2009). Diﬀerently from them we discarded the 1

Available at: https://sites.google.com/site/ucinetsoftware/datasets/covert-network s/ndranghetamaﬁa2

GoT and WKP: Two Novel Measures of Node and Edge Centrality

15

isolated nodes which are negligible when centrality is considered. OW R contains 247 wiretap conversations transcribed by the police among 182 individuals involved in illicit activities close to the Calabrian ’Ndrangheta. OAW contains 189 connections among 146 suspected criminals emerging from the police physical surveillance. OJU contains 113 wiretap conversations among 89 suspects emerging from the trial and other sources of evidence such as wiretapping and audio surveillance. A more detailed description about these networks can be found in [21]. To ﬁnd an important node or edge in a graph and therefore an inﬂuential person or connection in a criminal network, we have to use a family of measures called centrality. There are ﬁve main centrality measures used in SNA: – Degree centrality (DC) [26], which determines the importance of a node based on the number of edges incident upon it. Its computational cost is O(m). – Betweenness centrality (BC) [4], which quantiﬁes how many times a node acts as a bridge along the shortest path between two other nodes. Its computational complexity lies between O(mn) and O(n3 ). – Edge betweenness centrality (EBC) [4], which quantiﬁes how many times an edge acts as a bridge along the shortest path between two other nodes. Its computational complexity is the same complexity as BC. – Closeness (CL) centrality [26], which determines the importance of a node based on the proximity of a node to all the other nodes in the graph. Its computational cost is O(n3 ). – Clustering coeﬃcient (CC) [42], which indicates how well connected the neighborhood of a node is. Its computational complexity is O(n2 ). If the neighborhood is fully connected, the CC is 1. It is 0 instead if there are hardly any connections in the neighborhood. The computational complexity of these measure makes unfeasible or practically impossible their computations when we face with real and large networks. For this reason we focus on two computationally aﬀordable alternatives: Game of Thieves for the computation of both node and edge centrality and WERWKpath for the edges centrality. Game of Thieves (GoT) [35] proceeds in T = log 3 n epochs. When it begins, each node has usually n virtual diamonds or vdiamonds and 1 thief. At each epoch, a thief located on a node randomly picks a neighbor of it. He moves to this new node and, if he ﬁnds a vdiamond, he fetches it. Then, he brings the vdiamond back to his home node. At this point the vdiamond becomes available for the other thieves who can steal it. When the game has run for T epochs, the centrality of a node is computed as the average number of vdiamonds present on it. An important node is indicated by a small average number of vdiamonds because the most central nodes are visited by a lot of thieves and they are quickly depleted. The centrality of an edge is computed as the average number of thieves who carry a vdiamond passing through it after T epochs. In this case, the most important edge is indicated by a high average number of thieves. These computations are held with a computational cost between O(log 2 n) and O(log 3 n).

16

A. Ficara et al.

WERW-Kpath (WKP) [15–17] is an algorithm capable of computing the Kpath edge centrality in a graph with a near linear cost O(km). It consists of three main steps: (i) node and edge weights assignment; (ii) simulation of message propagations through simple random walks of ﬁxed length up to k; (iii) ﬁnal weight computation. In the ﬁrst stage of the algorithm, weights are assigned to both nodes and edges. Node weight is used to select the source node from which each message propagation simulation starts. Edge weight is the initial value of the edge centrality and it is updated during the execution of the algorithm. Then, the idea of message propagation is simulated in the graph using random walkers forced to make simple paths of bounded length up to a constant and user-deﬁned value k without passing no more than once through the same edge. The reasonable values range for the length of the k-path random walks can be found in the interval [5,20]. For our tests we chose k = 10. An edge is central if it is frequently exploited to diﬀuse information. To understand if it is possible to use GoT and WKP as replacements of the classical node and edge centrality measures, we made a correlation analysis using the most well-known correlation coeﬃcients such as Pearson’s r [13], Spearman’s ρ [44] and Kendall’s τ [31]. Correlation coeﬃcients are used to measure the strength of association among GoT and DC, BC, CL and CC for the case of node centrality and between GoT, WKP and EBC for the case of edge centrality. Since the results for the three coeﬃcients were quite similar, we chose to show only those for the Spearman’s ρ because it best captures the relationships among all the considered measures (See Figs. 2 and 4). This correlation coeﬃcient is determined by dividing the covariance (i.e. a measure of how two variables change together) by the product of the standard deviations (i.e. a measure of the dispersion of data from its average) of the rank values of two variables. The Spearman’s ρ is computed in Python 3 as well as the centrality metrics such as DC, BC, EBC, CL, and CC for which we also used the NetworkX module [29]. GoT2 was originally implemented in Python 2 and we ported it to Python 3. WKP was originally implemented in Java3 but we ported it to Python 2 and it is in the testing phase. Moreover, to get more reliable results, we repeated GoT and WKP for 30 times on each network and then we computed the mean and standard deviation values. This was necessary because in such measures the centrality values change at each execution. They are not in fact deterministic measures as DC, BC, EBC, CL and CC.

3

Results and Discussion

The ﬁrst experiment we performed consists in the identiﬁcation of the ten most important nodes in the six Maﬁa networks described in Sect. 2 using GoT and the four canonical node centrality measures DC, BC, CL and CC (See Fig. 1). 2 3

Available in Python 2 at: http://github.com/dcmocanu/centrality-metrics-complexnetworks Available in Java at: http://www.emilio.ferrara.name/code/werw-kpath/

GoT and WKP: Two Novel Measures of Node and Edge Centrality

17

GoT seems to be able to identify the most important nodes in the networks in a pretty similar way to DC, BC and CL. The CC seems instead to identify diﬀerent nodes as the most central. To understand if these measures are eﬀectively able to ﬁnd the key roles in a Maﬁa network, we have to consider the typical hierarchical structure of a Maﬁa Family [24]. On top is the Boss. Just below him is the Underboss. In-between the Boss and Underboss are the roles of the Consigliere (i.e. an advisor to the boss) and the Messaggero (i.e. a messenger). Below the Underboss there are the Caporegimes who manage their Soldiers. The lower level of a family is composed with Associates such as entrepreneurs, police oﬃcers, pharmacists and other people who are not actual members of the Maﬁa but work for it. Regarding the Montagna Operation, GoT is able to identify the two most important nodes, 18 and 47, which are respectively the Caporegime of the Mistretta family and the deputy Caporegime of the Batanesi family. Mistretta and Batanesi are the two Maﬁa families at the centre of this investigation. Node 22 is also important because it represents a pharmacist who can have a key role in drug synthesis processes which require pharmacological and chemical knowledge [11]. Nodes 68, 27 and 25 are caporegimes while the others are associates as entrepreneurs. More details about the roles of nodes in the Montagna Operation can be found in our previous work [24] and on Zenodo [10]. If applied to IS , GoT is capable to assess the leaders of the criminal networks such as node 78 which is always at the top rank across all measures as described by Grassi et al. [28]. The authors discovered 22 leaders in the Inﬁnito Operation among which we ﬁnd nodes 114, 9 and 6 also ﬁnd by GoT. About the Oversize Operation, GoT was able to discover two drug wholesaler (i.e. nodes 26 and 39), a drug sealer (i.e. node 13) and also the boss’ son and important drug dealer (i.e. node 49). The results described above seem to be conﬁrmed by our correlation analysis. To make it more comprehensible, we remind that a correlation coeﬃcient can assume any value in the interval between +1 and −1, including the end values +1 or −1 and to interpret it, we have to adopt the following rules [40]: – Values equal to 0 indicate no relationship; – Values equal to +1 (or −1) indicate a perfect positive (or negative) relationship; – Values between 0 and 0.3 (or −0.3) indicate a weak positive (or negative) relationship; – Values between 0.3 and 0.7 (or −0.3 and −0.7) indicate a moderate positive (or negative) relationship; – Values between 0.7 and 1 (or −0.7 and −1) indicate a strong positive (or negative) relationship. The Spearman’s coeﬃcient (see Fig. 2) shows how in most cases there is a strong negative correlation among GoT, BC and DC, and a moderate negative correlation between GoT and CL. More peculiar seems to be the relationship between GoT and CC which seems to have a strong positive correlation in networks as MM or IS and a moderate or strong negative correlation in the other

18

A. Ficara et al.

Fig. 1. 10 top ranked nodes computed with DC, BC, CL, CC and GoT

networks. In OW R we obtained a weak or absent correlation among all the measures. This result is an exception with respect to the other graphs and it can be explained through the peculiar degree distribution [21] of this network which indicates that there are only two hubs while most nodes have only 1 connection. The negative correlation is easily explained by the fact that the most important nodes for GoT are those with the smaller number of vdiamonds and so the smaller centrality values. Then, we continued with the experiments on edge centrality computing the ten most important edges in the six Maﬁa networks using GoT, WKP and the canonical edge centrality measure EBC (see Fig. 3). This time, WKP seems to be more precise respect to GoT in the identiﬁcation of the most important edges found by the EBC. The Spearman’s coeﬃcient in Fig. 4 shows how in all the cases WKP has a moderate positive correlation with EBC and a moderate or strong negative correlation with GoT. GoT has a weak or absent correlation with EBC. The correlation among WKP and EBC is an important ﬁnding because we can think to use WKP for example in community detection algorithms in which EBC is recalculated at each iteration to choose the edge to remove until no edges remain. This continuous computation requires a lot of time for very large networks but it could be done in a much shorter time through WKP. GoT still remains a better choice if we want to compute at the same time the centrality of nodes and edges in a graph.

GoT and WKP: Two Novel Measures of Node and Edge Centrality

19

Fig. 2. Spearman’s ρ rank correlation coeﬃcient among GoT, DC, BC, CL and CC

Fig. 3. 10 top ranked edges computed with EBC, GoT and WKP

20

A. Ficara et al.

Fig. 4. Spearman’s ρ rank correlation coeﬃcient among GoT, WKP and EBC

4

Conclusions

In this work we used real-world Maﬁa networks to examine the correlation between well known and more recently proposed centrality measures. A strong correlation implies the possibility of replacing the metric with a larger computational complexity using the other. The correlation between the centrality metrics is studied through the Spearman’s correlation coeﬃcient. An important ﬁnding is that DC and BC are strongly negative correlated with the GoT algorithm. WKP and EBC are positive correlated while GoT and WKP are negative correlated. Our results show that GoT has a strong correlation with the classical node centrality metrics and for this reason it can substitute them in the computation of node centrality in very large networks. It also has a negative correlation with WKP which can be used to substitute the EBC in large networks. As future work, we want to test these correlations on other (and larger) real networks and verify them on community detection algorithms such as the Girvan-Newman [27]. We can conclude that GoT and WKP due to their computational complexity represent a step forward compared to the classical centrality algorithms and can substitute them respectively in the computation of node and edge centrality in very large networks.

References 1. Agreste, S., Catanese, S., De Meo, P., Ferrara, E., Fiumara, G.: Network structure and resilience of Maﬁa syndicates. Inf. Sci. 351, 30–47 (2016). https://doi.org/10. 1016/j.ins.2016.02.027

GoT and WKP: Two Novel Measures of Node and Edge Centrality

21

2. Berlusconi, G., Calderoni, F., Parolini, N., Verani, M., Piccardi, C.: Link prediction in criminal networks: a tool for criminal intelligence analysis. PLOS ONE 11(4), 1–21 (2016). https://doi.org/10.1371/journal.pone.0154244 3. Borgatti, S.P., Everett, M.G., Freeman, L.C.: UCINET for Windows: Software for Social Network Analysis. Analytic Technologies, Harvard, MA (2002) 4. Brandes, U.: On variants of shortest-path betweenness centrality and their generic computation. Soc. Netw. 30(2), 136–145 (2008). https://doi.org/10.1016/j.socnet. 2007.11.001 5. Br¨ ohl, T., Lehnertz, K.: Centrality-based identiﬁcation of important edges in complex networks. Chaos Interdiscipl. J. Nonlinear Sci. 29(3), 033115 (2019). https:// doi.org/10.1063/1.5081098 6. Calderoni, F.: Identifying maﬁa bosses from meeting attendance. In: Masys, A.J. (ed.) Networks and Network Analysis for Defence and Security, pp. 27–48. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-04147-6 2 7. Calderoni, F.: Predicting organized crime leaders. In: Bichler, G., Malm, A. (eds.) Disrupting Criminal Networks: Network Analysis in Crime Prevention, pp. 89–110. Lynne Rienner Publishers, Boulder (2015). http://hdl.handle.net/10807/68084 8. Calderoni, F., Brunetto, D., Piccardi, C.: Communities in criminal networks: a case study. Soc. Netw. 48, 116–125 (2017). https://doi.org/10.1016/j.socnet.2016. 08.003 9. Calderoni, F., Catanese, S., De Meo, P., Ficara, A., Fiumara, G.: Robust link prediction in criminal networks: a case study of the Sicilian Maﬁa. Expert Syst. Appl. 161, 113666 (2020). https://doi.org/10.1016/j.eswa.2020.113666 10. Cavallaro, L., et al.: Criminal Network: The Sicilian Maﬁa. “Montagna Operation”, July 2020. https://doi.org/10.5281/zenodo.3938818 11. Cavallaro, L., et al.: Disrupting resilient criminal networks through data analysis: the case of Sicilian Maﬁa. PLoS ONE 15(8), 1–22 (2020). https://doi.org/10.1371/ journal.pone.0236476 12. Cavallaro, L., et al.: Graph comparison and artiﬁcial models for simulating real criminal networks. In: Benito, R., Cheriﬁ, C., Cheriﬁ, H., Moro, E., Rocha, L., Sales-Pardo, M. (eds.) Complex Networks and Their Applications IX. COMPLEX NETWORKS 2020. Studies in Computational Intelligence, vol. 944, pp. 286–297. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-65351-4 23 13. Chen, P., Popovich, P.: Correlation: Parametric and Nonparametric Measures. Sage University Papers Series, no. 07–139, Sage Publications, Thousand Oaks (2002) 14. Crossley, N.: Social Network Analysis (chap. 6), pp. 87–103. Wiley, Hoboken (2019). https://doi.org/10.1002/9781119429333.ch6 15. De Meo, P., Ferrara, E., Fiumara, G., Provetti, A.: Enhancing community detection using a network weighting strategy. Inf. Sci. 222, 648–668 (2013). https://doi.org/ 10.1016/j.ins.2012.08.001 16. De Meo, P., Ferrara, E., Fiumara, G., Provetti, A.: Mixing local and global information for community detection in large networks. J. Comput. Syst. Sci. 80(1), 72–87 (2014). https://doi.org/10.1016/j.jcss.2013.03.012 17. De Meo, P., Ferrara, E., Fiumara, G., Ricciardello, A.: A novel measure of edge centrality in social networks. Knowl. Based Syst. 30, 136–150 (2012). https://doi. org/10.1016/j.knosys.2012.01.007 18. Duijn, P.A.C., Kashirin, V., Sloot, P.M.A.: The relative ineﬀectiveness of criminal network disruption. Sci. Rep. 4(1), 4238 (2014). https://doi.org/10.1038/srep04238 19. Ferrara, E., De Meo, P., Catanese, S., Fiumara, G.: Detecting criminal organizations in mobile phone networks. Expert Syst. Appl. 41(13), 5733–5750 (2014). https://doi.org/10.1016/j.eswa.2014.03.024

22

A. Ficara et al.

20. Ferrara, E., De Meo, P., Catanese, S., Fiumara, G.: Visualizing criminal networks reconstructed from mobile phone records. In: CEUR Workshop Proceedings, vol. 1210 (2014) 21. Ficara, A., et al.: Criminal networks analysis in missing data scenarios through graph distances (2021) 22. Ficara, A., et al.: Social network analysis of Sicilian Maﬁa interconnections. In: Cheriﬁ, H., Gaito, S., Mendes, J.F., Moro, E., Rocha, L.M. (eds.) Complex Networks and Their Applications VIII. COMPLEX NETWORKS 2019. Studies in Computational Intelligence, vol. 882, pp. 440–450. Springer, Cham (2020). https:// doi.org/10.1007/978-3-030-36683-4 36 23. Ficara, A., Fiumara, G., De Meo, P., Liotta, A.: Correlations among game of thieves and other centrality measures in complex networks. In: Fortino, G., Liotta, A., Gravina, R., Longheu, A. (eds.) Data Science and Internet of Things. Internet of Things, pp. 43–62. Springer, Cham (2021). https://doi.org/10.1007/978-3-03067197-6 3 24. Ficara, A., Fiumara, G., De Meo, P., Catanese, S.: Multilayer network analysis: the identiﬁcation of key actors in a Sicilian Maﬁa operation. In: Perakovic, D., Knapcikova, L. (eds.) Future Access Enablers for Ubiquitous and Intelligent Infrastructures. FABULOUS 2021. LNICS, SITE, vol. 382, pp. 120–134. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-78459-1 9 25. Ficara, A., Fiumara, G., De Meo, P., Liotta, A.: Correlation analysis of node and edge centrality measures in artiﬁcial complex networks. In: Yang, X.-S., Sherratt, S., Dey, N., Joshi, A. (eds.) Proceedings of Sixth International Congress on Information and Communication Technology. LNNS 216, vol. 3. Springer, Cham (2021). https://doi.org/10.1007/978-981-16-1781-2 78 26. Freeman, L.C.: Centrality in social networks conceptual clariﬁcation. Soc. Netw. 1(3), 215–239 (1978). https://doi.org/10.1016/0378-8733(78)90021-7 27. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. 99(12), 7821–7826 (2002). https://doi.org/10.1073/ pnas.122653799 28. Grassi, R., Calderoni, F., Bianchi, M., Torriero, A.: Betweenness to assess leaders in criminal networks: new evidence using the dual projection approach. Soc. Netw. 56, 23–32 (2019). https://doi.org/10.1016/j.socnet.2018.08.001 29. Hagberg, A.A., Schult, D.A., Swart, P.J.: Exploring network structure, dynamics, and function using NetworkX. In: Varoquaux, G., Vaught, T., Millman, J. (eds.) Proceedings of the 7th Python in Science Conference, Pasadena, CA USA, pp. 11–15 (2008) 30. Kang, U., Papadimitriou, S., Sun, J., Tong, H.: Centralities in large networks: algorithms and observations. In: Proceedings of the 11th SIAM International Conference on Data Mining, SDM 2011, pp. 119–130 (2011). https://doi.org/10.1137/ 1.9781611972818.11 31. Kendall, M., Gibbons, J.: Rank Correlation Methods. Charles Griﬃn Book, E. Arnold (1990) 32. von Lampe, K., Johansen, P.O.: Organized crime and trust: on the conceptualization and empirical relevance of trust in the context of criminal networks. Glob. Crime 6(2), 159–184 (2004). https://doi.org/10.1080/17440570500096734 33. Mastrobuoni, G., Patacchini, E.: Organized crime networks: an application of network analysis techniques to the American Maﬁa. Rev. Netw. Econ. 11(3) (2012). https://doi.org/10.1515/1446-9022.1324

GoT and WKP: Two Novel Measures of Node and Edge Centrality

23

34. Meghanathan, N., Yang, F.: Correlation analysis: edge betweenness centrality vs. neighbourhood overlap. Int. J. Netw. Sci. 1(4), 299–324 (2019). https://doi.org/ 10.1504/IJNS.2019.102284 35. Mocanu, D.C., Exarchakos, G., Liotta, A.: Decentralized dynamic understanding of hidden relations in complex networks. Sci. Rep. 8(1), 1571 (2018). https://doi. org/10.1038/s41598-018-19356-4 36. Paoli, L.: Italian organised crime: maﬁa associations and criminal enterprises. Glob. Crime Today Chang. Face Org. Crime 6(1), 19–31 (2004). https://doi.org/10.1080/ 1744057042000297954 37. Paoli, L.: Maﬁa Brotherhoods: Organized Crime, Italian style. Oxford University Press, Oxford Scholarship Online (2008). https://doi.org/10.1093/acprof:oso/ 9780195157246.001.0001 38. Piccardi, C., Berlusconi, G., Calderoni, F., Parolini, N., Verani, M.: Oversize network (2016). https://doi.org/10.6084/m9.ﬁgshare.3156067.v1 39. Rajeh, S., Savonnet, M., Leclercq, E., Cheriﬁ, H.: Investigating centrality measures in social networks with community structure. In: Benito, R.M., Cheriﬁ, C., Cheriﬁ, H., Moro, E., Rocha, L.M., Sales-Pardo, M. (eds.) Complex Networks and Their Applications IX. COMPLEX NETWORKS 2020. Studies in Computational Intelligence, vol. 943, pp. 211–222. Springer, Cham (2021). https://doi.org/10.1007/ 978-3-030-65347-7 18 40. Ratner, B.: The correlation coeﬃcient: its values range between +1/−1, or do they? J. Target. Meas. Anal. Mark. 17(2), 139–142 (2009). https://doi.org/10.1057/jt. 2009.5 41. Ronqui, J.R.F., Travieso, G.: Analyzing complex networks through correlations in centrality measurements. J. Stat. Mech. Theory Exp. 2015(5), P05030 (2015). https://doi.org/10.1088/1742-5468/2015/05/p05030 42. Saram¨ aki, J., Kivel¨ a, M., Onnela, J.P., Kaski, K., Kert´esz, J.: Generalizations of the clustering coeﬃcient to weighted complex networks. Phys. Rev. E 75(2), 027105 (2007). https://doi.org/10.1103/PhysRevE.75.027105 43. Shao, C., Cui, P., Xun, P., Peng, Y., Jiang, X.: Rank correlation between centrality metrics in complex networks: an empirical study. Open Phys. 16(1), 1009–1023 (2018). https://doi.org/10.1515/phys-2018-0122 44. Spearman, C.: General intelligence, objectively determined and measured. Am. J. Psychol. 15(2), 201–292 (1904). https://doi.org/10.2307/1412107 45. van Steen, M.: Graph Theory and Complex Networks: An Introduction. Maarten van Steen (2010) 46. Valente, T.W., Coronges, K., Lakon, C., Costenbader, E.: How correlated are network centrality measures? Connections (Toronto, Ont.) 28(1), 16–26 (2008)

Common Knowledge on Facebook Communication Networks: Models and Experimental Findings Sarah McDonald(B)

and Gizem Korkmaz

Biocomplexity Institute, University of Virginia, Arlington, VA 22209, USA {sm9dv,gkorkmaz}@virginia.edu

Abstract. This paper develops a game-theoretic model of collective action on communication networks based on interactions on Facebook. We characterize the communication patterns that facilitate common knowledge and coordination for two cases in which the network structure is (i) known by everyone, and (ii) locally known. We show theoretically the implications of these two models. We ﬁnd that when the network structure is globally known, agents must learn each others’ thresholds through maximal, reciprocal paths of distance-2, or through a common neighbor to have common knowledge and participation. When the network structure is locally known, all agents must have at least one outgoing link and all agents must be neighbors. We conduct human subject experiments to identify the eﬀects of both network topology and communication and to test the predictions of these models. Our data reveal that choices are aﬀected by the network structure and they move towards the theoretical predictions with communication. Keywords: Common knowledge networks · Facebook

1

· Collective action · Communication

Introduction

Social media platforms such as Facebook and Twitter are critical tools in organizing collective action, such as Occupy Wall Street and the Arab Spring, and more recent Black Lives Matter protests [5] and 2021 Storming of the US Capitol. These events motivate research aimed at understanding how actionable information and knowledge spread on social media platforms (e.g. [8,12]). In collective actions, such as protests, an individual wants to participate only if joined by others, hence the participation decision depends on their expectation of what others will do [7]. Game-theoretic models of collective action and common knowledge consider a population of heterogeneous agents who diﬀer in their willingness to participate in protests [7,12]. The number of participants at or above which an individual would choose to participate deﬁnes the individual’s threshold. Coordination requires that people know each others’ thresholds and c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 A. S. Teixeira et al. (Eds.): Complex Networks XII, SPCOM, pp. 24–37, 2021. https://doi.org/10.1007/978-3-030-81854-8_3

Common Knowledge on Facebook

25

that this information is common knowledge (CK) among a suﬃcient number of people. These models study how network structure, i.e., local communication patterns, aﬀects CK and participation decisions. In this paper, we investigate how local interactions on social networks can facilitate coordination and collective action by modeling directed communication on Facebook. We develop game-theoretic models of Facebook-type networks, where information spreads locally, to characterize network structures that facilitate CK among a group of agents leading to collective action. In our model, agents post their thresholds on other agents’ walls, which is viewable by her friends. We discuss the conditions and minimal substructures required for CK, in the cases where the network structure is assumed to be globally and locally known. Our model builds upon the work of Korkmaz et al. [12] and explores how the direction of communication may inﬂuence network structures that produce CK. The main contributions and findings of this paper are listed below: 1. We develop a game-theoretic model of directed communication on Facebooktype networks. In this model, agents post their thresholds (willingness to participate) on their friends’ walls. Our model relaxes the assumption of undirected communication in Korkmaz et al. [12], and assumes that communication can occur in one direction (messages do not need to be reciprocated). This is the ﬁrst model of Facebook to our knowledge that explores non-reciprocated directed wall posts. 2. We characterize network structures that generate CK of thresholds among a group of agents when the network structure is globally known by everyone (a standard assumption in game-theoretic network models). In this model, as opposed to previous work by Chwe [6,7], we ﬁnd that cliques (reciprocated communication between every pair of nodes) are not necessarily required for CK to occur. We ﬁnd that complete tripartite networks with various partitions (presented in Sect. 4), and networks with maximal, reciprocal paths of distance-2 between all pairs of agents are suﬃcient substructures for CK. 3. To make the model more realistic, we relax the assumption that the network structure is known by everyone and assume instead that the connections are only locally known by friends. We identify the minimal substructures required for CK to occur. We ﬁnd that the necessary and suﬃcient substructure is such that all agents have at least one outgoing link, and for each pair of agents, at least one agent must write on the wall of the other. These ﬁndings diﬀer from prior work by Korkmaz et al. [12], which found complete bipartite graphs as the necessary and suﬃcient substructure. 4. We conduct a human subjects experiment to collect data and test the predictions of our models. We look at diﬀerent network structures (star, clique, circle), and ﬁnd that for all these networks, communication increases participation in collective action. Moreover, the conditions and substructures we have theoretically characterized further increase participation in cliques, which result in the highest participation. The remainder of this paper is organized as follows: Sect. 2 reviews relevant literature. Section 3 introduces the models. In Sect. 4, we formalize the result-

26

S. McDonald and G. Korkmaz

ing conditions and minimal substructures necessary for common knowledge. In Sects. 5 and 6, we discuss the experimental design and present ﬁndings. Section 7 discusses our next steps and areas of further research.

2

Related Work

This work is related to prior models of coordination and collective action, in particular, threshold models (e.g., Granovetter [9] and Schelling [15]), and models of information diﬀusion (e.g., Jackson and Lopez-Pintado [10] and Tsai et al. [19]). In threshold-based models, the decision of an agent is based on the number of other agents who have taken the same action; for example, in our model, an agent participates if a set number (their threshold) of other agents also participate. Oliver [13] provides a review of theories and models of collective action. Our research aims to deﬁne the characteristics of the network structures that encourage the spread of common knowledge. Centola and Macy [3] and Siegel [16] discuss the importance of the network structure for information to diﬀuse and for collective action. Prior research has also modeled information diﬀusion on Facebook and Twitter social networks (e.g., [8,14,17]). Our model builds upon game-theoretic models of collective action in Chwe [6,7] and Korkmaz et al. [12]. The latter is the only study that models “Facebooktype” communication, where an agent posts their threshold on other agents’ walls. We extend [12] by allowing for directed (unreciprocated) communication. We discuss our contributions to these stylized models in the Introduction. Experimental studies of common knowledge use 2-player coordination games [18] or studies where players are located on communication networks with 5-players [11] which test the theoretical predictions of Chwe model [6,7].

3 3.1

The Model Preliminaries

There is a ﬁnite set of agents V = {1, 2, . . . , n} and each agent i ∈ V chooses an action ai ∈ {r, s}, where r is ‘revolt’, the ‘risky’ action, and s is ‘stay at home’, the ‘safe’ action. Each agent i has a private threshold θi ∈ {1, 2, . . . , n + 1} representing their willingness to participate; an agent wants to revolt only if the total number of agents who revolt is greater or equal to her threshold. Agents only know their own thresholds, and the network (described below) allows them to share this information with their neighbors. Given agent i’s threshold θi and everyone’s actions, her utility is given by ⎧ ⎪ if ai = s ⎨0, Ui (θi , ai , a−i ) = 1, if ai = r ∧ #{j ∈ V | aj = r} ≥ θi ⎪ ⎩ −z, if ai = r ∧ #{j ∈ V | aj = r} < θi where −z < 0 is the penalty if she revolts and not enough people join her. Thus, a person will revolt as long as she is sure that there is a suﬃcient number

Common Knowledge on Facebook

27

of people revolting. A person always gets utility 0 by staying at home. When she revolts, she gets utility 1 if the total number of people revolting (including herself) is at least θi . Individuals are located on a communication network where they can share θ. We formalize the Facebook network and friends below. Definition 1 (Network). The communication network is represented as a directed graph G(V, E), where V = {1, 2, . . . , n} deﬁnes the set of nodes, and E denotes the set of edges. Let the pairwise relationship between two individuals i and j be represented by the pairwise variable gij . When gij = 0, there is no link from i to j; when gij = 1, i and j are connected by an outgoing link from i to j. In G, gij = 1 implies that i (the writer) communicates their threshold θi on j’s (the receiver’s) “wall” and this is observable by all of j’s friends. Definition 2 (In/Out Neighbors). Let the set of in-neighbors of i be represented by Niin := {j ∈ V : gji = 1}. Similarly, the set of out-neighbors of i are given as Niout := {j ∈ V : gij = 1}. Definition 3 (Friends). The friends of i are deﬁned as Ni := {j ∈ V \ {i} : j ∈ Niin ∪ Niout , i ∈ V }. All friends of the receiver can observe the directed communication from the writer to receiver (the directed link), as well as the writer’s threshold. This represents the friend-of-friend communication of Facebook. We also deﬁne directed paths of distance-2 (distance-2 in/out neighborhood) as Ni2,in := {j ∈ V |∃r ∈ Niin ∧ j ∈ Nrin } and Ni2,out := {j ∈ V |∃r ∈ Niout ∧ j ∈ Nrout }, respectively. Definition 4 (Reciprocal Ball). We deﬁne the reciprocal ball of i as Bi∗ := {j ∈ V : j ∈ (Niin ∩ Niout ) ∨ j ∈ (Niin ∩ Ni2,out ) ∨ j ∈ (Ni2,in ∩ Niout ) ∨ j ∈ (Ni2,in ∩ Ni2,out )}. When j ∈ Bi∗ , we say that there are maximal, reciprocal paths of distance-2 between i and j. Due to symmetry, i ∈ Bj∗ implies that j ∈ Bi∗ . 3.2

Knowledge and Common Knowledge

Here we provide a mathematical formalization of common knowledge based on the Stanford Encyclopedia of Philosophy [20]. Let Ω be the set of possible worlds which consists of all possible combinations of agents’ thresholds and edges in our model. The actual world (the actual thresholds and edges) ωactual is an element of the set of worlds, so ωactual ∈ Ω. Events E are subsets of Ω, so E ⊆ Ω. The event E is true if E ⊆ Ω and ωactual ∈ E. Given that E ⊆ Ω, if an agent i knows that event E is the case, we formally state that Ki (E), i.e., the event E is in agent i’s knowledge function. For example, since each agent i knows their own threshold, θi , all events that have θi will be in agent i’s knowledge function. Definition 5. Agent’s possibility set is deﬁned as Pi (ω) ≡ {E|ω ∈ Ki (E)}. The collection of sets Pi = ω∈Ω Pi (ω) is i’s private information partition. The knowledge function of is deﬁned by: Ki (E) = {ω ∈ Ω|Pi (ω) ⊆ E}.

28

S. McDonald and G. Korkmaz

An agent’s possibility set indicates all possible worlds that the actual world could be given the information that the agent has. The knowledge function of an agent states that i knows E if the possibility set of i is contained in the actual E. Below we deﬁne ﬁrst order mutual knowledge: i knows E and j knows E. Similarly, second order mutual knowledge means that i knows that j knows E and j knows that i knows E, and so on, which is deﬁned as mth order mutual knowledge. Common knowledge refers to the case when m − → ∞. Definition 6. Let a set Ω of possible worlds together with a set of agents N be given. The event that E is (ﬁrst order) mutual knowledge for the agents of N, K1N (E), is the set deﬁned by: K1N (E) ≡ i∈N Ki (E). The event E is mth order mutual knowledge for the agents of N, Km N (E), is deﬁned recursively as the set: m (A)). The event that E is common knowledge among KN (E) ≡ i∈N Ki (Km−1 N ∞ the agents of N, K∗N (E), is deﬁned as the set: K∗N (E) ≡ Km N (E). m=1

Next, we discuss the implications of assumptions about the network knowledge in our model: (i) known by everyone, and (ii) only locally known. 3.3

Network Structure Globally Known

First, we assume that the network structure is commonly known by everyone – a standard assumption made in game-theoretic network models (e.g., [7]). Here the set of possible states, Ω, is deﬁned only by the thresholds of agents. The set of states can be formally written as Ω = Θn = {1, 2, . . . , n + 1}n ; a state can be formally written as ω = [(θi )i∈N ]. The state ω ∈ Ω is an n-tuple. For example, ω = (1, 1) is the state for two agents who each have threshold 1. Figure 1 shows three examples of directed networks with four agents. In Fig. 1(A), agent i knows the threshold of k directly, and the threshold of l through the wall of k. However, agent i does not know the threshold of j because j does not communicate her threshold to i, directly or through the wall of a friend. Therefore, agent i cannot count on j to participate, and no one participates. In Fig. 1(B), we add a bidirectional link between agents j and l, meaning that both agents write directly on each others’ wall. Now, agent i knows θj through k’s wall. Agent j knows θi and θk directly, and θl through the k’s wall. All agents know the thresholds of all other agents and know that they know each others’ thresholds, so CK is achieved, and all agents participate. i k (A)

j

i

l

k (B)

j

i

l

k

j l (C)

Fig. 1. Networks with four agents with θi = 4 ∀i

Common Knowledge on Facebook

3.4

29

Network Structure Locally Known

One of the key features of large social networks in the real world is that agents only have local information about the network (who talks to whom). Here we relax the assumption that network is commonly known by everyone, instead assume that this information is locally known by friends (i.e., person i can observe his friend j’s friends if they write on j’s wall). Hence, in this model the possible set of states is deﬁned by the thresholds of all agents and the edges between agents, which can be formally written as ω = [(θi )i∈N , (gij )i,j∈N ]. The n! is the set of states is Ω = Θn × G with G = {0, 1}P(n,2) , where P(n,2) = (n−2)! number of 2-permutations on n nodes. So, the state ω ∈ Ω will be a n + P(n,2) tuple. For example, ω = (2, 2, 2, 0, 0, 0, 0, 0, 0) for three agents who each have a threshold of 2 and have no links to other agents. To exemplify the diﬀerence between models, we return to Fig. 1(B) and assume that the network structure is unknown. Each agent knows the threshold of every other agent. However, i does not observe the communication between j and l. Agent i cannot be sure that l knows θk . Agent l knows θk through the wall of j, but since i does not observe that j and l are friends, i does not know that l knows θk . Thus, i cannot count on l to participate, and in this case no agents participate. Now, we assume that agents i and l are friends, and i writes on the wall of l as in Fig. 1(C). Now, i observes that j writes directly on the wall of l because i is a friend of l. Agent i now knows that l knows his threshold and that l knows the threshold of k. In this case, CK is achieved, and all agents participate since they have suﬃciently low thresholds.

4

Theoretical Results

In this section, we characterize the conditions that are necessary for common knowledge to occur on a communication network. We discuss the conditions for when the network structure is globally and locally known, respectively. 4.1

Globally Known Network Structure

Here we provide the theoretical results of the model with the standard assumption that the network structure is known by everyone in the network. ˆ where E ˆ ≡ {ω ∈ Theorem 1. Given the actual state ω ˆ ∈ Ω and the event E, ˆ ˆ for all ˆ ∈ Ki Kj . . . Km (E) Ω : ω = [(θi )i∈N ] with (θi )i∈M ⊆N = (θi )i∈M ⊆N }, ω ∗ i, j, . . . , m ∈ M ⊆ N iﬀ ∀i, j ∈ M , (1a) i ∈ Bj ∨ (1b) ∃l ∈ M : i, j ∈ Nlin . Theorem 1 implies that all agents know each others’ thresholds, either directly (1a) or through the wall of a common friend (1b). For a subset of agents to achieve common knowledge, each pair of agents in the subset must know each others’ threshold. So, for each pair of agents in the subset, there must be either (1a) reciprocal maximal paths of distance-2 between the agents or (1b) the agents communicate their thresholds to a common friend. Figure 2 illustrates network examples to provide insights about these necessary and suﬃcient conditions.

30

S. McDonald and G. Korkmaz i

(A)

j

i

j

i

j

i

j

l

l

k

k

l

l

k

(B)

(C)

(D)

Fig. 2. Networks illustrating the conditions of Theorem 1. (A) Cond. (1a) is satisﬁed in a 3-node network. (B) Cond. (1a) is satisﬁed in a 4-node network. (C) Cond. (1b) is satisﬁed (every pair of agents have a common friend they communicate to). (D) Either Cond. (1a) or (1b) is satisﬁed for diﬀerent pairs of nodes.

Figure 2(A) illustrates a 3-person network forming a cycle. In this network, all agents are in the reciprocal ball of each other (Cond. (1a)), hence they have CK about each others’ thresholds (they know each others’ thresholds and they know that they know and so on). In Fig. 2(B), four agents share CK of thresholds as Cond. (1a) is again satisﬁed. Figure 2(C) shows a graph where Cond. (1a) is not satisﬁed for some pair of agents (i.e., il, jl, lk), but there is a common friend that both agents communicate their threshold to (Cond. (1b)). For example, agents i and l both write to the wall of agent j. In this network, agent l does not have any incoming links. Figure 2(D) presents a network structure where agent pairs are either satisfy Cond. (1a) (ij, il, ik, jk, kl) or Cond. (1b) (jl). There are no clear graph sub-families that include all graphs that meet the conditions in Theorem 1. However, we formally state three interesting network substructures that are suﬃcient for common knowledge to occur in Theorem 2. We use the naming convention provided in [2] for the tripartite graphs, namely, 10-030C and 12-120D triads. Theorem 2. If there is a graph G(V, E) and subgraph formed by M ⊆ N such that (2a) there are maximal, reciprocal distance-2 paths between nodes; or (2b) the graph forms a complete tripartite graph with cyclic partitions (10-030C triads); or (2c) the graph forms a complete tripartite graph with partitions in the ˆ ∀m ∈ M . form of a 12-120D triad, then ω ˆ ∈ K∗M (E) Figure 3 illustrates network substructures that generate CK of thresholds among a group of agents. Figure 3(A) shows the substructure referred to as a complete tripartite with cyclic partitions (10-030C triads) [2], and a corresponding 4-node example network. In 10-030C triads, there are reciprocal distance-2 paths between agents in each partition P1, P2, and P3. For example, all agents in P1 write to all agents in P2. All agents in P2 write on the wall of all agents in P3. Since all agents in P1 write to all agents in P2, there is some node in P2 such that each pair of agents in P1 write to that node, and thus can observe each others’ threshold. Figure 3(B) shows complete tripartite partitions in the form of a 12-120D triad [2] and a corresponding example network. Agents in P1 and P3 know each others’ thresholds directly. Agents in P2 and P3 know each others’ thresholds through the walls of agents in P1; similarly, agents in P1 and P2 know each others’ thresholds through the wall of P3. All agents within a

Common Knowledge on Facebook

31

Fig. 3. Network substructures identiﬁed in Theorem 2 that generate CK of thresholds. P 1, P 2, and P 3 represent partitions of nodes. (A) Complete tripartite digraph with cyclic partitions (top) and a 4-node network example (bottom-left). (B) Complete tripartite digraph with partitions in the form of 12-120D (top) and a 4-node example (bottom-left). (C) 5-node tripartite digraph with non-complete partitions.

partition know each others’ thresholds. All agents in P2 write to a common wall in P3, all agents in P3 write to a common wall in P1, and all agents in P1 write to a common wall in P3. Hence, CK is obtained in this network. As Theorem 2 suggests, the three network structures characterized are sufﬁcient but not necessary. In Fig. 3(C), we show an example of a graph that is a non-complete tripartite graph, where partitions are not based on 10-030C or 12-120D triads, but each node is contained in a 10-030C or 12-120D triad. In Fig. 3(C), agent i learns the thresholds of all other agents directly, except for agent k, whose threshold is learned through the wall of agent l. Agent j learns θk directly, and θl and θm through the wall of agent i, and θi through the wall of agent k. Agent k learns θj , θl , and θm through the wall of agent i, and θi directly. Agent l learns θi , θk , and θm directly, and θj through the wall of agent i. Agent m learns θj and θl through the wall of agent i, and θi and θk through the wall of agent l. So, each agent knows the threshold of all other agents. While the graph is tripartite, it is not complete. Agents j and m have no direct communication and are not in the same partition. 4.2

Locally Known Network Structure

When the network structure is not assumed to be known, (and is part of the uncertainty of the actual state), in addition to others’ thresholds, agents also need to know what others know, hence they need to observe the communication between others, at least locally. The conditions for CK when the network is locally known are stated in Theorem 3. ˆ where E ˆ = {ω ∈ Theorem 3. Given the actual state ω ˆ ∈ Ω and the event E, ˆ Ω : ω = [(θi )i∈N , (gij )i,j∈N ] with (θi )i∈M ⊆N = (θi )i∈M ⊆N and (gij )i,j∈M ⊆N = ˆ ∀i, j, . . . , m ∈ M ⊆ N if and only if (3a) ˆ ∈ Ki Kj . . . Km (E), (ˆ gij )i,j∈M ⊆N }, ω ∗ ∀i, j ∈ M, i ∈ Bj ∨ ∃l ∈ M : i, j ∈ Nlin and (3b) ∀i, j ∈ M , ∀k ∈ M \ {i, j}, k ∈ Ni ∩ N j .

32

S. McDonald and G. Korkmaz

Theorem 3 retains Cond. (1a) and (1b) of Theorem 1 (stated as Cond. (3a) above) because these conditions ensure that each pair of agents communicate their thresholds to one another. Additionally, each agent must know that all other agents know their threshold. As a direct result of how we deﬁne communication in our model, agents observe the communication between a writer and receiver if they are friends with the receiver. Since it is possible that a receiver agent also writes back to the writer, to know the existence of any edge between a pair of agents, all other agents must be friends with both agents in the pair, which is formalized in Cond. (3b). i l (A)

j

i

k

l

j

j

i

k (B)

k (C)

Fig. 4. Networks illustrating the conditions of Theorem 3. (A) Cond. (3a) and (3b) are satisﬁed. (B) Cond. (3a) is satisﬁed, but Cond. (3b) is not satisﬁed. (C) Cond. (3b) is satisﬁed, but Cond. (3a) is not satisﬁed in a 3-node network.

Figure 4(A) shows an example of a graph where both conditions are met. There are reciprocal, maximal distance-2 paths between agents i and k, l and k, i and l, j and k, and j and l. Agents i and j both write on the wall of agent k. All agents are neighbors, for all pairs of agents, all other agents are neighbors with both agents in the pair. In Fig. 4(B), each agent knows the threshold of every other agent directly or indirectly (Cond. (3a) is satisﬁed). However, agent i does not observe that agent j writes on agent k’s wall, although agent i observes that agent k writes on agent j’s wall because i is neighbors with j. Cond. (3b) ensures that each agent knows the communication between others, i.e., what others know. Further, we can say Cond. (3a) does not necessarily imply Cond. (3b). Agent j is not in the neighborhood of agent l. So, agent j does not observe that agent k writes on agent l’s wall or that agent i writes on agent l’s wall. In Fig. 4(C), Cond. (3b) is fulﬁlled. Agent i knows that j writes to k, because i is in the neighborhood of j and k. Agent j knows that i writes on the wall of k because agent j is in the neighborhood of i and k. Finally, agent k knows that agent i writes on the wall of agent j because agent k is in the neighborhood of i and j. Each agent observes the links between all other agents. However, agent k does not write to the wall of any agent, so Cond. (3a) is not satisﬁed. Hence, it also shows that Cond. (3b) does not imply Cond. (3a). Theorem 4. Given a graph G(V, E) if there exists a set of people M ⊆ N such ˆ ∀m ∈ M , then there must exist a subgraph formed by this set that ω ˆ ∈ K∗M (E), M such that (4a) ∀i ∈ M, Niout = ∅ and (4b) ∀i, j ∈ M, i ∈ Nj . We ﬁnd and prove that for CK to occur, the minimal substructure is a graph such that all agents have at least one outgoing link, and all agents are neighbors.

Common Knowledge on Facebook

33

Cond. (4a) implies that each node must have an outgoing link. If a pair of agents are in the reciprocal of each other, by the deﬁnition of the reciprocal ball, each agent has at least one outgoing link. If a pair of agents both write on the wall of another agent, then they both have at least one outgoing link. Cond. (4b) implies that each node must be neighbors with all other nodes. Returning to Fig. 4(A), each agent has at least one outgoing link, and all agents are neighbors, showing that this graph is a minimal substructure for CK to occur.

5

Experimental Design

In this section, we describe the laboratory study conducted to characterize common knowledge and collective action on communication networks. With the experiments, we aim to (i) study how communication aﬀects participation in collective action, (ii) explore the eﬀects of network structure, and (iii) test the predictions of our theoretical models by analyzing the change in participation levels when the speciﬁed conditions (of theorems in Sect. 4) forming as a result of messaging patterns of subjects are met. We developed an experimental platform oTree [4] and recruited 120 students (involving 80% undergraduate and 10% graduate students, most between ages 18 through 22 (86%), and 62% identiﬁed as female.). The experimental design (completely randomized crossover design) included 8 sessions of 15 subjects who were randomly divided into 3 groups of 5 players at the beginning of the sessions. Players completed 15 independent rounds, and in each round, the incentivized task of the coordination game was to choose (simultaneously with the other group members) whether to participate in a group activity or not. Each player was assigned a random avatar and a threshold (high or low) that dictated how many other players from their group needed to participate for them to make the highest payoﬀ. As mentioned above, the experiments are designed as one-shot games such that rounds are completely independent from each other and players do not observe the outcome of the rounds until the end of the experiment. Future work may explore repeated games where individual perceptions of each player are based on previous outcomes (Antonioni et al. [1]). Between-session conditions involved two messaging conditions (no messaging, and Facebook (FB)-wall messaging), and two network knowledge conditions: global (players were provided the full network structure of their group) or local (players only observed their neighbors/friends). Within-session conditions (15 rounds) involved the thresholds of agents (low-1 or high-3) and the network structure of the 5-player group (star, circle, clique) which determines players’ friends in the group. Friends could observe each others’ thresholds and could use the messaging tool to post their intentions to participate on each others’ walls in the FB-wall messaging sessions. Players could choose to post either “I will participate” or “I will not participate” on their own or their friends’ walls. Each player goes through ﬁve rounds of each network structure, at ﬁve diﬀerent threshold combinations (so there are 15 subsessions per session) and chooses whether to participate in the group activity.

34

6

S. McDonald and G. Korkmaz

Experimental Results

Testing Global Network Knowledge Model. We analyze if the theoretical predictions of our models result in an increase in participation. Based on our model, in the global network knowledge and wall messaging cases we expect all players to participate if the conditions in Theorem 1 hold (for each pair of players, either there are reciprocal paths of distance-2 between them or they both write on the wall of their common friend). We collected 30 data points (groups of 5 players in each) for each network structure and communication type combination.1

Fig. 5. Comparison of participation for globally known network structure, by network type, communication type, and Theorem 1 conditions.

Figure 5 provides stacked bar charts that show participation in a given network structure, subset by communication type, and if the conditions from Theorem 1 were satisﬁed. The light green indicates the cases where all 5 players in the group chose to participate. First, we observe that the network structure plays an important role; cliques (where all players are connected to four neighbors) resulting in the highest group participation for all cases, followed by star, and circle (every player has two neighbors). When there is no communication, the cases where all 5 players in the group participate are low for circle and star networks (6 out of 30 sessions). For the clique, all players participated in 16 of the 30 sessions. This is due to the fact that even in absence of messaging, there 1

In the wall messaging sessions, we remove the cases when a participant posted conﬂicting messages to their neighbors.

Common Knowledge on Facebook

35

is a high level of connectivity in these networks and every player in the group can observe every other players’ thresholds. When messaging is introduced, we compare the cases when the conditions of Theorem 1 are met as a result of the messaging decisions of the players, and when the conditions are not met. We predict an increase in participation for the former. In the circle networks, in 11 of the 27 networks, all players chose to participate. Additionally, the conditions of Theorem 1 are met in nine of the 11 networks for which all players participated, demonstrating that when the conditions for common knowledge are met, the number of networks in which all players participate increases. We observe similar behavior in the star networks. Finally, for the cliques, we observe that all players in 24 of the 29 networks participated (again, high participation due to high level of connectivity compared to other networks). Of those 24 cases, 22 networks satisﬁed the conditions of Theorem 1. In all cases where Theorem 1 is met, all players participated. In all networks where Theorem 1 was satisﬁed, at least three players participated. Testing Local Network Knowledge Model. We explore the results for the cases where the network structure is locally known. Theorem 4 conditions require that all agents are neighbors and have at least one outgoing link. We focus on cliques to test whether these conditions result in increased participation, as they are the only structures that could facilitate these communication patterns. Figure 6 shows the level of participation (number of sessions) for the clique networks under no messaging and wall messaging. When there is no communication, in 10 out of the 30 clique sessions all ﬁve players participated. As discussed before, a high level of participation even in absence of messaging is due to the high level of connectivity (and access to thresholds) in this network type. When players communicated through wall posts, in 29 out of the 30 networks, all ﬁve participants chose to participate. We observe that when Fig. 6. Comparison of participation for the conditions of Theorem 4 were met, locally known network structure in clique the number of players who participate networks by communication type and Theorem 4 conditions. increases.

7

Conclusion and Further Research

We introduce a game-theoretic model of collective action on networks with directed Facebook communication. This work characterizes the minimal substructures required for common knowledge to occur. We test the models using human subject experiments. Future work involves using real network data (e.g., [21]) to study the dynamics of our models in larger complex networks, modeling

36

S. McDonald and G. Korkmaz

diﬀerent communication platforms (e.g., Twitter), and open-form messaging in our experiments. Analysis of the content of these messages will provide insights about the impact of the message tone/sentiment on collective action. Acknowledgment. This material is based upon work supported by the Air Force Oﬃce of Scientiﬁc Research under award number FA9550-17-1- 0378. Any opinions, ﬁnding, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the United States Air Force.

References 1. Antonioni, A., Martinez-Vaquero, L.A., Mathis, C., Peel, L., Stella, M.: Individual perception dynamics in drunk games. Physical Rev. E 99(5), 052311 (2019) 2. Batagelj, V., Mrvar, A.: A subquadratic triad census algorithm for large sparse networks with small maximum degree. Soc. Netw. 23(3), 237–243 (2001) 3. Centola, D., Macy, M.: Complex contagions and the weakness of long ties. Am. J. Sociol. 113(3), 702–734 (2007) 4. Chen, D.L., Schonger, M., Wickens, C.: oTree—An open-source platform for laboratory, online, and ﬁeld experiments. J. Behav. Exp. Financ. 9, 88–97 (2016) 5. Choudhury, M.D., Jhaver, S., Sugar, B., Weber, I.: Social media participation in an activist movement for racial equality. In: Proceedings of the 10th International Conference on Web and Social Media, ICWSM 2016, pp. 92–101 (2016) 6. Chwe, M.S.Y.: Structure and strategy in collective action. Am. J. Sociol. 105(1), 128–156 (1999). https://www.jstor.org/stable/10.1086/210269 7. Chwe, M.S.Y.: Communication and coordination in social networks. Rev. Econ. Stud. 67(1), 1–16 (2000). http://www.jstor.org/stable/2567025 8. Gonzalez-Bailon, S., Borge-Holthoefer, J., Rivero, A., Moreno, Y.: The dynamics of protest recruitment through an online network. Nature 1, 1–7 (2011) 9. Granovetter, M.: Threshold models of collective behavior. Am. J. Sociol. 83(6), 1420–1443 (1978). https://www.jstor.org/stable/2778111 10. Jackson, M.O., Lopez-Pintado, D.: Diﬀusion and contagion in networks with heterogeneous agents and homophily. Netw. Sci. 1(1), 49–67 (2013) 11. Korkmaz, G., Capra, M., Vega-Redondo, F., et al.: Coordination and common knowledge on communication networks. In: Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems, pp. 1062–1070 (2018) 12. Korkmaz, G., Kuhlman, C.J., Marathe, A., Marathe, M.V., Vega-Redondo, F.: Collective action through common knowledge using a Facebook model. In: Lomuscio, A., Scerri, P., Bazzan, A., Huhns, M. (eds.) Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems, pp. 253–260 (2014) 13. Oliver, P.E.: Formal models of collective action. Annu. Rev. Sociol. 19, 271–300 (1993). https://doi.org/10.1146/annurev.so.19.080193.001415 14. Romero, D.M., Meeder, B., Kleinberg, J.: Diﬀerences in the mechanics of information diﬀusion across topics: idioms, political hashtags, and complex contagion on Twitter. In: World Wide Web (2011) 15. Schelling, T.C.: Micromotives and Macrobehavior. W. W. Norton, New York City (1978) 16. Siegel, D.A.: Social networks and collective action. Am. J. Polit. Sci. 53(1), 122–138 (2009)

Common Knowledge on Facebook

37

17. Sun, E., Rosenn, I., Marlow, C.A., Lento, T.M.: Gesundheit! Modeling contagion through Facebook news feed. In: Proceedings of the 3rd International ICWSM Conference, pp. 146–153 (2009) 18. Thomas, K.A., DeScioli, P., Haque, O.S., Pinker, S.: The psychology of coordination and common knowledge. J. Pers. Soc. Psychol. 107(4), 657 (2014) 19. Tsai, J., Bowring, E., Marsella, S., Tambe, M.: Empirical evaluation of computational fear contagion models in crowd dispersions. JAAMAS 27, 200–217 (2013) 20. Vanderschraaf, P., Sillari, G.: Common knowledge. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy (2014) 21. Viswanath, B., Mislove, A., Cha, M., Gummadi, K.P.: On the evolution of user interaction in Facebook. In: WOSN (2009)

Logistics Route Planning in Agent-Based Simulation and Its Optimization Represented in Higher-Order Markov-Chain Networks Ryota Ikai1 , Shigeyuki Miyagi1,2 , and Osamu Sakai1,2(B)

2

1 Department of Electronic Systems Engineering, The University of Shiga Prefecture, Hassaka-cho 2500, Hikone, Shiga 522-8533, Japan [email protected], [email protected] Regional ICT Research Center for Human, Industry and Future, The University of Shiga Prefecture, Hassaka-cho 2500, Hikone, Shiga 522-8533, Japan

Abstract. Route planning in logistics, in which multiple pickup and delivery positions exist in a road network, is a complicated task with many choices in a path selection and their inﬂuences on the following procedures. Solving this task by multi-agent simulations, we examine the route optimization process by monitoring motions in networks based on simple or higher-order Markov chains (MCs). Agent footprints in the networks, which spread over the entire network at the initial phase, converge on small number of edges as the transportation path gets shortened. When we increase the order of MCs in agent mobilities, the MC networks are enlarged and possess a large number of nodes and edges with structural regularity so that one node contains partial trace history, while the optimized route that frequently overlaps edge groups with high transition probabilities is equivalent to a smaller and more noticeable subgraph around a local optimal solution. In other words, this localization of the traces indicates a convergence level in optimization, which can be a measure for route planning in logistics. Keywords: Route planning

1

· Markov chain · Network visualization

Introduction

Logistics activities or freight transports currently play crucial roles both in our daily life and in supply chain managements for industrial activities [1–3]. Urban freight transports include a number of elements associated with their eﬃciency, such as a geographical road system, vehicle capacity, real constraints and demands, and so forth [4]. In this study, we focus on route planning with multiple pickup and delivery locations. A related and well-known traditional topic is c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 A. S. Teixeira et al. (Eds.): Complex Networks XII, SPCOM, pp. 38–50, 2021. https://doi.org/10.1007/978-3-030-81854-8_4

Higher-Order Markov-Chain Networks for Logistics

39

Travelling Salesman Problem (TSP), in which one optimizes the shortest route that passes all destinations once [5]. As a more practical task, Vehicle Routing Problem (VRP) generalizes TSP by matching solution in route optimization with pickup/delivery requests [6,7]. Such route planning problems are closely related to tasks in network science [8]. Obviously, feasible paths for vehicles are usually so complex that their visual representation in a network or a graph with nodes and edges is inevitable, where nodes indicate geographical locations including pickup, delivery and passing points and edges correspond to possible paths along which a vehicle in a real circumstance or an agent in a virtual phase is mobile. Based on such geographical networks, mobility networks in which nodes are departure, arrival or temporallyrecorded locations have been examined in a number of previous studies [9–11]. Statistical properties of the human mobility networks clariﬁed so far are categorized using monitored big data as L´evy ﬂight random works [12] and memoryless deterministic walks [13,14], and so forth. In our case in which vehicles are mobile to achieve the shortest trace, route planning is a kind of optimization problems, diﬀerent from statistical analyses, and memories of the previously-visited locations aﬀect optimal selection(s) of possible future locations. For inclusion of hysteresis for understanding such vehicle behaviors, the concept of Markov chains (MCs) [15] that represent memory eﬀects as the varying discrete level of order is classical but powerful in various categories of computer science. For instance, in reinforcement learning, Markov decision processes indicate that events and states in the past can determine future probabilities of possible actions and environments [16]. In our previous study, we demonstrated that consistency in mobilities between sightseeing vehicles and virtual agents along with varying-order MCs reveals the degree of memory eﬀects that are dominant in a series of vehicle actions [17]. In this study, we perform route planning by multi-agent simulations, and examine such temporary solutions of agents using network visualization. The multi-agent simulations in a square-lattice network are performed by changing transition probabilities from one node to anther, and we get close to the optimal solution for route planning with multiple pickup and delivery locations, following the previous similar studies based on reinforcement learning. What we ﬁnd and stress as facts clariﬁed in this study is that such route optimization processes are well visualized in higher-order MC networks that vary from closely-packed smaller networks to relatively-loose larger networks. These ﬁndings are significant both as a new kind of network analysis and as a measure for practical vehicle route planning for freight transports.

2

Higher-Order Markov Chain Networks

The concept of MCs or Markov information sources are well known in information theory [15] and many historical applications in various scientiﬁc categories [18,19]. A simple MC and a second-order Markov chain are popular in graph representation, and conceptual analogy in reinforcement learning forms a fundamental part, referred to as a Markov decision process [16]. In more speciﬁc

40

R. Ikai et al.

Fig. 1. Graphs derived, analyzed and used in this study. All edges between nodes are bidirectional. (a) Simple (or ﬁrst-order) Markov-chain (MC) network for 5 × 5 square-lattice network. (b) Second-order MC network based on (a). (c) Third-order MC network based on (a). (d) Modiﬁed third-order MC network based on (a). (e) Forth-order MC network based on (a).

previous research achievements, person trip patterns are modelled by MCs [21], and probabilistic processes in human behaviors in mobility are investigated using proﬁle hidden Markov models [22]. Community detection by Markov-stability optimization in complex networks is also a promising method beyond the typical means based on modularity [20]. In our previous study, we deﬁned and used higher-order MC networks whose structure includes information of geometrical structure on which vehicles are movable and their previously-visiting locations as memories [17]. One point we newly introduced is the fact that not only simple MC or second-order MC but also higher-order MCs are useful since the varying order adjusts the length of the memories that vehicle mobilities possess. A network of higher-order MCs has a huge size in contrast to lower-order cases, and traces of commercial vehicles observed in a given sightseeing area in a Japanese city are in a category of scale-free networks.

Higher-Order Markov-Chain Networks for Logistics

41

Fig. 2. Degree distributions of graphs shown in Fig. 1. Instead of probability, raw data of node count are plotted in vertical axis. Statistical network parameters of nodes and edges are described in the inset.

However, in this study, the structural information embedded in higher-order MC networks are more simple; the graph composed of spatial nodes and paths between them (i.e., simple or ﬁrst-order MC network) is limited to small-size regular lattice network virtually in our current model to contain general essentials that are valid universally. Then, higher-order MC networks preserve regularity in their network topology (see Fig. 1). The details of higher-order MC networks are described in our previous study [17], and we brieﬂy review them in the following. For an order x of a Markov information source in the general deﬁnition, a transition probability to a state ni , which is equivalent to a spatial location at iteration step i, is given by: p(ni |ni−1 , · · · , ni−x ) := p(ni = n|ni−1 = n(1) , · · · , ni−x = n(x) ).

(1)

This depends on x-length visiting locations in series before ni (i.e., ni−1 , · · · , ni−x ), where n and n(x) represent geographical positions at the current iteration step and that for x iterations before; in this study, all positions or locations are in a square-lattice network, which is a simple or ﬁrst-order MC network. In the xth-order MC network, an edge is directional, starting from a node that is a group of geographical locations (ni−1 -ni−2 -· · · -ni−x ) to that of (ni -ni−1 -· · · -ni−(x−1) ). That is, when the length of the shortest route is composed of x + 1 locations, the xth-order MC network is appropriate to represent the route in a very simple node-node pair with a single edge. In other words, we obtain this simple representation after constructing a ﬁrst-order MC network, a second-order MC network, and other ith-order MC networks where the maximum i is x. In this study, another representation of higher-order MC networks is introduced to reduce its construction processes. While considering Eq. (1) as a basis, we create a modiﬁed xth-order MC network using an edge linking the node (ni ni−1 -· · · -ni−(x−1) ) with (ni+(x−1) -· · · -ni+1 -ni ). A pair of adjacent nodes share one common element or one geographical location ni , and the past locations

42

R. Ikai et al.

(ni−1 , · · · , ni−(x−1) ) are linked by the edge with the future ones for ni (ni+(x−1) , · · · , ni+1 ). By this representation, the route whose length is composed of x + 1 locations is shown as a very simple node-node pair with one edge in a (x/2+1)thorder MC network. Figure 1 shows ﬁrst-, second-, third- and forth-order MC networks when the ﬁrst-order MC network or the graph of the geographical locations for agents is the 5 × 5 square-lattice network. As shown in Fig. 2, number of nodes and edges drastically increase as the order x of MCs gets higher while spectra in degree distributions remain in the similar range, where we measure degree by counting both incoming and outgoing edges (kin and kout , respectively). As we noted, the ﬁxed-length route represented on a geographical network becomes smaller when the order gets higher. Although it may be not realistic to create MC networks with the order higher than 10, this visualization for route simpliﬁcation is a merit in creation of MC networks. We note that the conﬁguration (in Fig. 1) and the degree distribution (in Fig. 2) of the modiﬁed higher-order MC networks are exceptional. Each node belongs to either subgraphs since two subgraphs do not share nodes coming from trace sequence. The degree becomes higher than the regular MC networks due to shared partial trace for each node becomes larger.

3 3.1

Agent-Based Simulations in Square-Lattice Network Algorithm Used for Route Optimization

Multi-agent simulations have been studied in many researches, with successful achievements for simulating motions of agents that mimic players in the physical environment and their interactions. VRP is one of the possible examples in which multi-agent simulations are valid. After rigorous mathematical classiﬁcation for solutions using search trees [6], some previous studies on VRP were successful for route planning [7]. Reinforcement learning frequently supports such agent simulations for route optimization [2], and here we also perform agent-based simulations [23], empowered by techniques suggested by reinforcement learning. We stress that our optimization method might not be the best for shortening time for ﬁnding solutions for VRP, and we aim at visualizing the results of agent-based simulations in large-size networks to verify their validity. In this study, we use a 5 × 5 square-lattice network as a virtual geographical situation. Here, we deﬁne a route length as count of edges along an agent trace, usually for the case at the minimum edge count. We perform our agent-based simulation according to the following procedure: 1. Install an arbitrary number (from 1 to 10) of virtual packages at 5 pickup locations selected at random 2. Set an arbitrary number (from 1 to 10) of packages for delivery at destinations that are arbitrarily selected except the pickup locations, where summation of package delivery required is equal to that of pickup packages 3. Start two agents from pickup locations randomly selected

Higher-Order Markov-Chain Networks for Logistics

43

Fig. 3. One example of agent-based simulation run. (a) Pickup (or production) and delivery (or consumption) pattern for 5 × 5 square-lattice network. No packages are picked up or delivered at nodes in dark gray. (b) Count of passing in calculated route in initial phase of episode sequence. (c) Count of passing in calculated route in optimized phase of episode sequence. (d) Calculated length of route, minimum number of passing edges, as function of episode number. Case with constant transition probabilities is also shown in comparison.

4. Move agents from one location to another along one edge at one iteration step in a stochastic process according to transition probability of the ﬁrst-order MC network, without any observation of actions for each other 5. Make agents load packages as much as possible at pickup locations, and unload packages at delivery locations, with carried packages up to 20 per agent 6. Repeat step 4 and 5 until agent actions stop by ﬁnishing delivery of all packages at destinations or until when the total route length exceeds 100 edges 7. Repeat step 1 to 6 (which is equal to one episode) for 100,000 episodes, changing transition probabilities from one location to another when the total route length is shorter than the minimum one in the previous episodes.

44

R. Ikai et al.

Here, variation of transition probabilities, which are tuned to be proportional to passing count on each edge in the temporal minimum-length route, is essential in our route planning, which is an introduced element from reinforcement learning in our simulation procedure. By actions, represented by agent motions and package loading/unloading, which is aﬀected by locations of both pickup and delivery, the numbers of remaining and required packages are varying, which is a state, represented by remaining packages and their locations for pickup and delivery and by transition probabilities. In other words, agents learn knowledge of favorable edges at each path selection, from changes of the package pickup-delivery state, and they are greedy since the transition probabilities with stochastic eﬀects determine their actions. The words of action, state, and greedy are in terminology of reinforcement learning [16]. We again stress that these agent-based simulations are not straightforward for achieving rapid calculation of the best solution for route planning; if we aimed at it, we would perform more sophisticated processes like the -greedy method [16]. On the contrary, by simplifying our method for route planning, we are concentrated on visualization and analysis of route optimization using large-size networks. 3.2

Numerical Results

Fig. 4. Statistical trend of calculated route length throughout 30 simulation runs for case shown in Fig. 3. Center curves are mean values varying as episode proceeds, with vertical length of hatched region for standard deviation.

One set of example run of our agent simulation by the algorithm described in Sect. 3.1 is shown in Fig. 3, which provides how typical optimization is performed, by raw data of an agent route when one pickup-delivery location pattern is given. Pickup-delivery locations are randomly distributed (see Fig. 3(a)), from which we cannot have any suggestion of route optimization in our mind at a glance. A general information processor is as well, and an virtual agent goes back and forth with a number of meaningless detours at initial steps, as shown in count of

Higher-Order Markov-Chain Networks for Logistics

45

passing from source to target nodes (see Fig. 3(b)). After accumulation of largenumber episodes, shorter routes with less detours are found, and the transition probabilities from a node to another (easily calculated from count of passing) come up for renewal intermittently and automatically (see Fig. 3(c)). In contrast, as shown in Fig. 3(d), the optimization proceeds slowly with insuﬃcient level when we perform a simulation run in the similar pickup/delivery pattern without tuned transition probabilities. This example of results in a typical simulation run indicates that our method by tuning transition probabilities in the algorithm works well for route optimization. We observed the similar tendency in statistical data scattering throughout the performances of 30 simulation runs, where we assume arbitrary and random pickup/delivery locations for each run (see Fig. 4). Although the route shortening takes place in an intermittent episode sequence during one simulation run, since the step-down timings are random, the mean values with statistical deviation are along gradual decline curves. Without tuning of the transition probabilities, the route shortening as the episode increases is slower. From these series agent-based simulations, for various patterns of pickup and delivery locations, route planning is positively performed. We again note that this optimization might not be the best solution, but combination of stochastic optimization before modiﬁcation of the transition probabilities and local search after their intermittent changes work well, which is suﬃcient for our purpose: visualization of optimization processes in route planning, which is demonstrated in the next section.

4

Agent Traces Represented in Higher-Order Markov-Chain Networks

As indicated in Sect. 2, higher-order MC networks are promising for a platform of optimized-route display because the necessary total edges decreases in principle as the order increases. Here, based on the route optimization performed in Sect. 3, we visualize the changes of the transition probabilities on the higher-order MC networks, accompanied by comparison with the spatial pattern of the initial pickup-delivery locations. In VRP, display using search trees in which possible selection of paths was used with mathematical expressions [6], which is complete and accurate. However, size of the search tree is huge and impractical for route planning in actual uses. In contrast, our method based on higher-order MC networks provides rather small, compact and concise charts, and information is condensed visually, as shown below. As deﬁned in Eq. (1), the transition probabilities of higher-order MC networks (p(ni |ni−1 , · · · , ni−x )) for xth order), varying as episodes are accumulated, are diﬀerent from those for the simple MC network (p(ni |ni−1 )). At a very initial episode, all of them are equal for each node history (ni−1 , · · · , ni−x )). As the episode count increases, some of them get higher, while others go down. Then,

46

R. Ikai et al.

Fig. 5. Trend display of high-transition-probability edges (more than 0.8, in red) in case for Fig. 3. Edges with zero transition probability are removed, and gray scale for edges in gray is proportional to corresponding transition probability from 0 to 0.8. (a) In second-order MC network at episode #9. (b) In second-order MC network at episode #125. (c) In second-order MC network at episode #16215. (d) In modiﬁed third-order MC network at episode #9. (e) In modiﬁed third-order MC network at episode #125. (f) In modiﬁed third-order MC network at episode #16215.

agents are likely to trace the paths with higher p, ﬁnding shorter routes stochastically. We recall that nodes in the higher-order MC networks represent not simple geographical locations but traces that include multiple nodes and edges in the lattice network mimicking spatial locations. That is, edges (or next nodes) with high p in higher-order MC networks represent favorable partial routes to search for shorter total routes. In this context, we investigate episode trends of the transition probabilities, and visualize the edges that possess high p values for one simulation run shown in Fig. 3. When we take speciﬁc snapshots of the transition probabilities at three episode numbers (#9, #125 and #16,215), their trends clarify route optimization by our agent-based simulations (see Fig. 5). Rough trends common to all MC networks are: increase of both higher transition probabilities (more than 0.8)

Higher-Order Markov-Chain Networks for Logistics

47

Fig. 6. Trends of proportion spectra of edges with transition probability (p) at each level in MC networks in agent-based simulation run shown in Fig. 3. Four levels (p = 0, 0 < p ≤ 0.5, 0.5 < p ≤ 0.8, p > 0.8) compose proportional spectra. (a) In ﬁrst-order MC network. (b) In second-order MC network. (c) In third-order MC network. (d) In modiﬁed third-order MC network.

Fig. 7. Statistical data of count of edges with high transition probability (more than 0.8) over 30 simulation runs for MC networks investigated here.

and zero transition probabilities. This fact indicates that, as searches for shorter routes proceed, the transition probabilities are likely to converge to higher or zero values on two directions with clearer division. A more closer look of the trends leads to the fact that the enhancement factor for higher probabilities in higher-order MC networks is more outstanding in comparison with those in lower-order MC networks. That is, as shown in Fig. 6, the trend analysis of optimization performances on route planning in higher-order MC networks verify their convergence more apparently than in a geographical network (i.e., ﬁrst-order MC network), which is popular in general studies. This increase of analysis data with memory eﬀects in route planning is eﬀective to conﬁrm its optimization level represented in network display. To conﬁrm these trends observed here statistically, we obtain similar trends of the mean values of the transition probabilities over 30 simulation runs (see Fig. 7). As the order of MC networks increases, the count of edges with transition probabilities more than 0.8 decreases gradually, and this tendency is in good

48

R. Ikai et al.

Fig. 8. Consistency of high-transition-probability edges in second-order MC network (on right-hand side) with actual routes in ﬁrst-order MC network after optimization (on left-hand side, episode #16215) in case shown in Fig. 3.

agreement with examples shown in Fig. 6 after conversion of this real count of edges to proportions in the total edge numbers. Finally, going back to the speciﬁc case, we reconﬁrm consistency of actual geographical routes with high-probability edges in a higher-order MC network. Figure 8 shows the 5×5 lattice network with agent routes, equal to the ﬁrst-order MC network, with corresponding high-probability edges in the second-order MC network. The optimized route well coincides with the high-probability edges and pickup/delivery locations. This speciﬁc comparison in actual data in assumed and obtained spatial arrangements indicates that our method based on higherorder MC networks works well for analysis of route planning for logistics in a simple case similar to that investigated here. In practice, more complicated and larger-size lattice networks will be appropriate for real solutions in logistics, in which a sophisticated algorithm combined with heuristic approaches [24] may work for route planning, and our method based on further higher-order MC networks (with possibly more than third order) will be still valid to monitor optimization levels, without any fundamental limitations.

5

Concluding Remarks

We demonstrate validity of route visualization in higher-order MC chains for logistic planning. In contrast to simple comprehension based on a geographical network, we can obtain clear images on eﬀectiveness of route planning by observation paths with high transition probabilities in higher-order MC networks. This method is not only for logistic planning but also, possibly, for vehicle and human traces with memory eﬀects coming from actions in the past. Acknowledgements. All the authors thank Air Business Lab Co. Ltd. for its support and useful comments, especially by T. Oobori, for this study. This work is partially supported by Regional ICT Research Center of Human, Industry and Future, and by Cabinet Oﬃce, Government of Japan.

Higher-Order Markov-Chain Networks for Logistics

49

References 1. Taniguchi, E., Thompson, R.G., Yamada, T., Van Duin, R.: City Logistics-Network Modelling and Intelligent Transport Systems. Pergamon, Amsterdam (2001) 2. Tamagawa, D., Taniguchi, E., Yamada, T.: Evaluating city logistics measures using a multi-agent model. Proc. Soc. Behav. Sci. 2, 6001–6012 (2010) 3. Crainic, T.G., Montreuil, B.: Physical internet enabled hyperconnected city logistics. Transp. Res. Proc. 12, 383–398 (2016) 4. Lu, E.H.-C., Yang, Y.-W.: A hybrid route planning approach for logistics with pickup and delivery. Expert Syst. Appl. 118, 482–492 (2019) 5. Menger, K.: Das botenproblem. Ergebnisse Eines Mathematischen Kolloquiums 2, 11–12 (1932) 6. Christoﬁdes, N., Mingozzi, A., Toth, P.: Exact algorithms for the vehicle routing problem, based on spanning tree and shortest path relaxations. Math. Program. 20, 255–282 (1981) 7. Goetschalckx, M., Jacobs-Blecha, C.: The vehicle routing problem with backhauls Euro. J. Oper. Res. 42, 39–51 (1989) 8. Albert, R., Barab´ asi, A.-L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47–97 (2002) 9. Peng, C., Jin, X., Wong, K.C., Shi, M., Lio, P.: Collective human mobility pattern from taxi trips in urban area. PLoS ONE 7, e34487-1–8 (2012) 10. Sagarra, O., Szell, M., Santi, P., D´ıaz-Guilera, A., Ratti, C.: Supersampling and network reconstruction of urban mobility. PLoS ONE 10, e0134508-1–15 (2012) 11. Tachet, R., et al.: Scaling law of urban ride sharing. Sci. Rep. 7, 42868-1–6 (2017) 12. Brockmann, D., Hufnagel, L., Geisel, T.: The scaling laws of human travel. Nature 439, 462–465 (2006) 13. Lima, G.F., Martinez, A.S., Kinouchi, O.: Deterministic walks in random media. Phys. Rev. Lett. 87, 010603-1–4 (2001) 14. Ter¸cariol, C.A.S., Martinez, A.S.: Analytical results for the statistical distribution related to a memoryless deterministic walk: dimensionality eﬀect and mean-ﬁeld models. Phys. Rev. E 72, 021103-1–8 (2005) 15. Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley, Hoboken (2006) 16. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction, 2nd edn. MIT Press, Cambridge (2018) 17. Yamamoto, K., Miyagi, S., Sakai, O.: Order estimation of Markov-Chain processes in complex mobility network embedded in vehicle traces. In: Benito, R.M., Cheriﬁ, C., Cheriﬁ, H., Moro, E., Rocha, L.M., Sales-Pardo, M. (eds.) COMPLEX NETWORKS 2020. SCI, vol. 944, pp. 231–242. Springer, Cham (2021). https://doi. org/10.1007/978-3-030-65351-4 19 18. Sasaki, T.: Estimation of person trip patterns through Markov chains. In: Newell, G.F. (ed.) Traﬃc Flow and Transportation, pp. 119–130. Elsevier, Amsterdam (1972) 19. Adler, T., Ben-Akiva, M.: A theoretical and empirical model of trip chaining behavior. Transp. Res. B 13B, 243–257 (1979) 20. Liu, Z., Barahona, M.: Geometric multiscale community detection: Markov stability and vector partitioning. J. Complex Netw. 6, 157–172 (2017) 21. Saadi, I., Mustafa, A., Teller, J., Cools, M.: Forecasting travel behavior using Markov chains-based approaches. Transp. Res. C 69, 402–417 (2016)

50

R. Ikai et al.

22. Liu, F., Janssens, D., Cui, J.X., Wets, G., Cools, M.: Characterizing activity sequences using proﬁle Hidden Markov Models. Expert Syst. Appl. 42, 5705–5722 (2015) ´ De Paz, J., Villarrubia Gonz´ 23. Lozano, A., alez, G., Iglesia, D., Bajo, J.: Multi-agent system for demand prediction and trip visualization in bike sharing systems. Appl. Sci. 8, 67-1–21 (2018) 24. Lin, S., Kernighan, B.W.: An eﬀective heuristic algorithm for the travelingsalesman problem. Oper. Res. 21, 498–516 (1973)

Degree-Degree Correlation in Networks with Preferential Attachment Based Growth Sergei Mironov , Sergei Sidorov(B) , and Igor Malinskii Saratov State University, Saratov, Russian Federation [email protected]

Abstract. The paper’s main focus is the analysis of degree-degree correlation in complex networks generated following two growth models based on the preferential attachment mechanism: the Barab´ asi-Albert model and the triadic closure model. The average nearest neighbor degree (ANND) value of k-degree nodes, deﬁned as their neighbors’ average degree, is employed to quantify the degree-degree assortativity in the networks. First, we derive the dynamics of the average degree of neighbors for every node. Then we ﬁnd the distributions of the ANND-values at each iteration in both networks. Results show that both networks are uncorrelated for nodes with high degrees, while for small degrees both networks exhibit degree-degree disassortativity. Keywords: Degree-degree correlation · Complex networks · Preferential attachment model · Assortative network · Nearest neighbor · Pair correlation · Scale-free networks

1

Introduction

One of the features that is regularly disposed in complex networks is assortativity [8,14] that indicates that nodes choose to connect with similar nodes. The most studied instance of assortativity is the degree assortativity that arises if hubs favor to link to high degree nodes, and nodes with small-degree more frequently choose small degree nodes. On the contrary, networks exhibit degree disassortativity if high degree nodes have a pattern of connection to small degree nodes, and the opposite way. Many social networks are assortative, while some of biological and technological networks exhibit the disassortativity (see [13,14]). A network is called degree-degree correlated if it shows either assortativity or disassortativity. If there is no obvious priority, the network is said to display neutral mixing. This work was supported by the Ministry of Science and Higher Education of the Russian Federation in the framework of the basic part of the scientiﬁc research state task, project FSRR-2020-0006. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 A. S. Teixeira et al. (Eds.): Complex Networks XII, SPCOM, pp. 51–58, 2021. https://doi.org/10.1007/978-3-030-81854-8_5

52

S. Mironov et al.

For quantifying the assortativity in networks, one may employ the correlation coeﬃcient proposed in [12]. Nonetheless, paper [9] presents evidence that this measure might not be appropriate for some kind of networks. Therefore, in this paper we will use the average nearest neighbor degree (ANND) quantities to estimate the network assortativity. Let G = (V, E) be an undirected graph of size |V | = n. If there is at least one node of degree k in the graph, then the ANND of nodes of degree k can be deﬁned as follows [4,19] 1 l>0 l (i,j)∈E I{di =k & dj =l} k , Φn (k) = 1 i∈V I{di =k} n where di denotes the degree of node i, Ia = 1 if condition a is fulﬁlled, and 0 otherwise. The ANND is one of the generally accepted measures for investigation of similarities between degrees of neighboring nodes, i.e. for the examination of degree-degree correlations, which has been the topic of many research papers in recent time [2,5,6,11,15,16,18,20]. The aim of this paper is to study the properties of the average nearest neighbor degree (ANND) in growth networks constructed in accordance with two growth models that use the preferential attachment mechanism: the Barab´ asiAlbert model and the triadic closure model proposed in [7]. We show that the theoretical values of Φ(k) at an iteration for suﬃciently large degrees k ﬂuctuate within a narrow range, on average, while for small k both networks exhibit degree-degree disassortativity. Results are derived using the mean-ﬁeld approach and complement the ones from paper [21].

2

ANND Distribution in the Barab´ asi-Albert networks

Network in the Barab´ asi-Albert model [1] is iteratively constructed according to the following rules. Let t denote the iteration, then after each new node t is added, it is connected to m existing nodes in the network which are selected with probability dependent on their degree (i.e. preferential attachment mechanism). Let di (t), si (t) = j: (i,j)∈E dj (t) and αi (t) = dsii (t) (t) denote the degree of node i, the total degree of all neighbors of node i, the average degree of all neighbors of node i, at given iteration t, respectively. Then the empirical values of Φ(k) can be obtained as follows: 1 Φt (k) ∼ αi , (1) |Ek | {i:di =k}

where the sum is taken over all nodes of degree k, and |Ek | denotes the number of such nodes in the network. The values of di (t) or si (t) may change at iteration t in the following cases: – if newborn node t+1 links to node i, then si (t+1) = si (t)+m, and di (t+1) = di (t) + 1. The probability of this case is di2t(t) ;

Degree-Degree Correlation in Networks

53

– if newborn node t + 1 links to one of the neighbors of node i by one of its edges j = 1, . . . , m, then si (t + 1) = si (t) + 1, while di (t + 1) = di (t). The probability of this case is si2t(t) . Let us now ﬁnd the expected value of αi (t) using mean ﬁeld approach. Let us ﬁnd how the value of αi (t) transforms after adding new node t + 1 with m links. We have di (t) si (t) + m si (t) − Δαi (t + 1) = αi (t + 1) − αi (t) = + 2t di (t) + 1 di (t) 1 si (t) si (t) + 1 si (t) 1 m − − + αi (t) . (2) = 2t di (t) di (t) 2t 2t(di (t) + 1) 2t(di (t) + 1) The approximation to diﬀerence Eq. (2) is the diﬀerential equation d(αi (t) − 1) m 1 = + (αi (t) − 1) , dt 2t 2t(di (t) + 1)

(3)

the solution of which is

1 dt m log t − m C + log t + αi (t) = C + 2 2 2t(di (t) + 1) 2 dt dt m − m . (4) 4t2 (di (t) + 1) 4t2 (di (t) + 1)

Then it can be found that αi (t) := E(αi (t)) ∼ C +

1 m log t + o(t− 2 ), 2

(5)

i.e. the average degree of node neighbors for all nodes asymptotically follows m 2 log t. The evolution of αi (t)-values for nodes i = 100, 250, 5000, that were averaged over 100 independent simulated networks of size n = 50, 000 with m = 4, is presented in Fig. 1(a). The averaged empirical values of αi (t) are oscillating in the vicinity of the trajectory obtained by Eq. (5). It follows from αi (i) = m 2 log i that C ∼ 0 for suﬃciently large t. Therefore, the expected values of αi (t) are asymptotically equal to each other for all nodes i. From this fact it follows that the expected values of Φn (k) deﬁned in (1) are approximately equal to each other, i.e. Φn (k1 ) ≈ Φn (k2 ) for any k1 = k2 . To conﬁrm this conclusion we generated 100 independent Barab´ asi-Albert networks of the same size n = 50, 000 with m = 4 and found the Φn (k)-values for every existing degree k, and then averaged them for every k over these simulated networks. Figure 1(b) presents the dependence of Φn (k)-values, calculated in such a way, on k. The ﬁgure shows that these averaged Φn (k)-values are lying around a constant line for suﬃciently large degrees k, i.e. Φn (k) ∼ const. However, there is a clear disassortativity for small degrees k. The results show that the networks constructed in accordance with the Barab´ asi-Albert model are not degree-degree correlated.

54

S. Mironov et al. 30 30

25 α100 (t) α250 (t) α5000 (t) c log t

20 15 10 0

1

2

3 t

4

5 104

20 Φn (k)

10 0

(a)

200

400 k

600

800

(b)

Fig. 1. BA networks: (a) the dynamics of αi (t); (b) the dependence of Φn (k) on k.

3

Triadic Closure Model Analysis

One of the accountable models utilizing the triadic closure mechanism was examined by Holme and Kim [7]. It is an expansion of the Barab´ asi-Albert model of preferential attachment [3] and uses the idea of triadic closure previously speciﬁed in papers [10,17]. This model generates networks with heavy-tailed degree distributions (as well as BA model), but with a much higher clustering similar to real-world networks. The triadic closure model by Holme and Kim [7] generates growth networks as follows. At each iteration t: 1. One newborn node t is merged; 2. m links are attached to the existing nodes of the network which connect the newborn node with them as follows: (a) the ﬁrst link connects node t with node i using preferential attachment mechanism (i.e. the probability of being connected to node i is proportional to its degree di (t)); (b) the remaining m − 1 edges link the new node t as follows: (b1) with a ﬁxed probability 0 < p < 1, the link is attached to an arbitrary neighbor of the node i (so-called triad formation); (b2) with probability 1 − p, the link is attached to one of existing nodes using preferential attachment. It was shown in [7] that the model may produce networks with various levels of clustering by selecting p and m. On the other hand, their degree distributions follow a power law with exponent γ = −3 for any p, i.e. it is the same as in the BA model. Throughout the rest of this section we will assume p = 0 and m ≥ 2. Let nodes j and i be neighbors, i.e. (j, i) ∈ E(t). It was shown in paper [7] that the clustering coeﬃcient tends to some constant value θ = θ(p, m) with an increase in the number of iterations t. Then the probability that randomly chosen neighbor of node j is also the neighbor of node i can be approximated by the value of the averaged clustering coeﬃcient θ(t) which can be approximated by constant θ.

Degree-Degree Correlation in Networks

55

Let us study how the values of di (t) and si (t) change during the iteration t + 1. Denote Δsi (t + 1) := si (t + 1) − si (t), Δdi (t + 1) := di (t + 1) − di (t). First, we examine the cases in which the degree of node i changes as the result of linking the new node t + 1 to node i. It may occur at steps (a), (b1) and (b2). All these cases are present in Table 1 with corresponding probabilities. Table 1. The cases in which the degree of node i changes Step (a)

Step (b1)

Step (b2)

Δsi (t + 1)

m + p(m − 1) m + p(m − 2)θ m

Δdi (t + 1)

1

1

Probability

di (t) 2|E(t)|

di (t) p(m 2|E(t)|

1 − 1)

di (t) (1 2|E(t)|

− p)(m − 1)

Now let us consider those cases in which no links are drawn to node i as a result of the iteration t + 1, but nevertheless one of its neighbors increases its degree. All these three cases are present in Table 2 with corresponding probabilities. Then we can write the following diﬀerence equation:

si + m + p(m − 1) si di − + E(Δαi (t + 1)) = di + 1 di 2|E| si + m + p(m − 2)θ di si − p(m − 1) + di + 1 di 2|E| si + m si di − (1 − p)(m − 1)+ di + 1 di 2|E| si + 1 + p(m − 1)θ si di si − p(m − 1) + − di di 2|E| 2|E| si + 1 + p(m − 2)θ2 si si − di p(m − 1) + − di di 2|E| si + 1 si si (1 − p)(m − 1) = − di di 2|E| 1 si p(m − 1) θ 1 + p2 (m − 2)θ + m 2|E| di 1 p(m − 1) 2 1 + pθ + p(m − 2)θ m− m 2|E| which approximately corresponds to the diﬀerential dαi (t) 1 1 = C2 (p, m) − αi (t) + dt 2 t

equation 1 C1 (m, p), t

(6)

56

S. Mironov et al. Table 2. The cases in which the degree of node i does not change Step (a)

Step (b1)

Δsi (t + 1)

1 + p(m − 1)θ

1 + p(m − 2)θ

Δdi (t + 1)

0

0

Probability

si (t) 2|E(t)|

where

−

di (t) p(m 2|E(t)|

− 1)

Step (b2) 2

1 0

si (t)−di (t) p(m 2|E(t)|

si (t) (1 2|E(t)|

− 1)

− p)(m − 1)

p(m − 1) 1 + pθ + p(m − 2)θ2 . m− m 1 p(m − 1) θ + p(m − 2)θ2 . 1+ C2 (p, m) = 2 m

C1 (p, m) =

1 2

The solution of (6) has the form αi (t) = u(t)v(t), where v(t) is the solution of the diﬀerential equation dv(t) 1 1 = v(t) C2 (p, m) − (7) dt 2 t and u(t) must satisfy du(t) 1 v(t) = C1 (m, p). dt t The solution of (7) is

(8)

1

v(t) = tC2 (p,m)− 2 ,

(9)

while the solution of (8) is u(t) = C1 (m, p) Then

1 1 − C2 (p, m) t−C2 (p,m)+ 2 + C. 2 1

αi (t) = u(t)v(t) = c1 + c2 tC2 (p,m)− 2 ,

(10)

i.e. the average value of neighbor’s degree coeﬃcient asymptotically follows 1 tC2 (p,m)− 2 (for the growth networks generated by the triadic closure model), and it is quite diﬀerent from the behavior of networks generated by the BA model, for which αi (t) demonstrates logarithmic growth. The evolution of αi (t)-values for nodes i = 100, 250, 1000, that were averaged over 100 independent simulations of size T = 50, 000 with m = 4, are presented in Fig. 2(a). The averaged empirical values of αi (t) are oscillating in the vicinity of the trajectory obtained by Eq. (10). 1 It follows from αi (i) = c2 iC2 (p,m)− 2 that c1 ∼ 0 for suﬃciently large t. Therefore, the expected values of αi (t) are asymptotically equal to each other for all nodes i. From this fact it follows that the expected values of Φn (k) deﬁned in (1) are approximately equal to each other, i.e. Φn (k1 ) ≈ Φn (k2 ) for any k1 = k2 .

Degree-Degree Correlation in Networks

57

To conﬁrm this conclusion we generated 100 independent networks of the same size n = 50, 000 with m = 4 and ﬁnd the Φn (k)-values for each existing degree k, and then averaged them for every k over these simulated networks. Figure 2(b) presents the dependence of Φn (k)-values, calculated in such a way, on k. The ﬁgure shows that these averaged Φn (k)-values are lying around a constant line for suﬃciently large degrees k, i.e. Φn (k) ∼ const. However, there is a clear disassortativity for small degrees k. The results show that the networks constructed in accordance with the triadic closure model are not degree-degree correlated. 30 30

25 α100 (t) α250 (t) α5000 (t) a1 ta2

20 15 10 0

1

2 t (a)

3

4

5 ·104

20 Φn (k)

10 0

200

400 k (b)

600

800

Fig. 2. Networks based on the triadic closure model: (a) the dynamics of αi (t); (b) the dependence of Φn (k) on k.

4

Conclusion

The preferential attachment mechanism is commonly used in the process of real network growth. Therefore, we suppose that it is important to study the features of networks, including the properties of assortativity, for the networks of such type. By this reason, this paper considers two models that generate networks with the use of PA mechanism: the classical Barab´asi-Albert and the triadic closure models. We exploit the mean-ﬁeld approach to ﬁnd the dynamics of the expected average degree of the neighbors of every node in both networks and show that the networks built in accordance with the models are degree-degree uncorrelated for suﬃciently large degrees. However, both networks demonstrate negative correlation for small k.

References 1. Albert, R., Barab´ asi, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74(1), 47–97 (2002). https://doi.org/10.1103/revmodphys.74.47 2. Allen-Perkins, A., Pastor, J.M., Estrada, E.: Two-walks degree assortativity in graphs and networks. Appl. Math. Comput. 311(C), 262–271 (2017). https://doi. org/10.1016/j.amc.2017.05.025

58

S. Mironov et al.

3. Barab´ asi, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999). https://doi.org/10.1126/science.286.5439.509 4. Catanzaro, M., Bogu˜ na ´, M., Pastor-Satorras, R.: Generation of uncorrelated random scale-free networks. Phys. Rev. E 71 (2005). https://doi.org/10.1103/ PhysRevE.71.027103 5. Farzam, A., Samal, A., Jost, J.: Degree diﬀerence: a simple measure to characterize structural heterogeneity in complex networks. Sci. Rep. 10(21348) (2020). https:// doi.org/10.1038/s41598-020-78336-9 6. de Franciscis, S., Johnson, S., Torres, J.J.: Enhancing neural-network performance via assortativity. Phys. Rev. E 83, 036114 (2011). https://doi.org/10.1103/ PhysRevE.83.036114 7. Holme, P., Kim, B.J.: Growing scale-free networks with tunable clustering. Phys. Rev. E 65(2), 026107 (2002). https://doi.org/10.1103/PhysRevE.65.026107 8. Lee, D.-S., Chang, C.-S., Zhu, M., Li, H.-C.: A generalized conﬁguration model with degree correlations and its percolation analysis. Appl. Netw. Sci. 4(1), 1–21 (2019). https://doi.org/10.1007/s41109-019-0240-2 9. Litvak, N., van der Hofstad, R.: Uncovering disassortativity in large scale-free networks. Phys. Rev. E 87, 022801 (2013). https://doi.org/10.1103/PhysRevE.87. 022801 10. Mack, G.: Universal dynamics, a uniﬁed theory of complex systems. Emergence, life and death. Commun. Math. Phys. 219(1), 141–178 (2001). https://doi.org/10. 1007/s002200100397 11. Nandi, G., Das, A.: An eﬃcient link prediction technique in social networks based on node neighborhoods. Int. J. Adv. Comput. Sci. Appl. 9(6), 257–266 (2018). https://doi.org/10.14569/ijacsa.2018.090637 12. Newman, M.E.J.: Assortative mixing in networks. Phys. Rev. Lett. 89, 208701 (2002). https://doi.org/10.1103/PhysRevLett.89.208701 13. Newman, M.E.J.: Mixing patterns in networks. Phys. Rev. E 67, 026126 (2003). https://doi.org/10.1103/PhysRevE.67.026126 14. Noldus, R., Van Mieghem, P.: Assortativity in complex networks. J. Compl. Netw. 3(4), 507–542 (2015). https://doi.org/10.1093/comnet/cnv005 15. Pelechrinis, K., Wei, D.: VA-index: quantifying assortativity patterns in networks with multidimensional nodal attributes. PLOS ONE 11(1), 1–13 (2016). https:// doi.org/10.1371/journal.pone.0146188 16. Samanta, S., Dubey, V.K., Sarkar, B.: Measure of inﬂuences in social networks. Appl. Soft Comput. 99, 106858 (2021). https://doi.org/10.1016/j.asoc.2020.106858 17. Stolov, Y., Idel, M., Solomon, S.: What are stories made of? - Quantitative categorical deconstruction of creation. Int. J. Mod. Phys. C 11(04), 827–835 (2000). https://doi.org/10.1142/S0129183100000699 18. Uribe-Leon, C., Vasquez, J.C., Giraldo, M.A., Ricaurte, G.: Finding optimal assortativity conﬁgurations in directed networks. J. Compl. Netw. 8(6) (2021). https:// doi.org/10.1093/comnet/cnab004 19. Yao, D., van der Hoorn, P., Litvak, N.: Average nearest neighbor degrees in scalefree networks. Internet Math. 2018, 1–38 (2018). https://doi.org/10.24166/im.02. 2018 20. Zhou, D., Stanley, H.E., D’Agostino, G., Scala, A.: Assortativity decreases the robustness of interdependent networks. Phys. Rev. E 86, 066103 (2012). https:// doi.org/10.1103/PhysRevE.86.066103 21. Zhuang-Xiong, H., Xin-Ran, W., Han, Z.: Pair correlations in scale-free networks. Chin. Phys. 13(3), 273–278 (2004). https://doi.org/10.1088/1009-1963/13/3/001

On Measuring the Diversity of Organizational Networks Zeinab S. Jalali1(B) , Krishnaram Kenthapadi2 , and Sucheta Soundarajan1 1

Syracuse University, Syracuse, NY, USA {zsaghati,susounda}@syr.edu 2 Amazon AI, Sunnyvale, CA, USA

Abstract. The interaction patterns of employees in social and professional networks play an important role in the success of employees and organizations as a whole. However, in many ﬁelds there is a severe underrepresentation of minority groups; moreover, minority individuals may be segregated from the rest of the network or isolated from one another. While the problem of increasing the representation of minority groups in various ﬁelds has been well-studied, diversiﬁcation in terms of numbers alone may not be suﬃcient: social relationships should also be considered. In this work, we consider the problem of assigning a set of employment candidates to positions in a social network so that diversity and overall ﬁtness are maximized, and propose Fair Employee Assignment (FairEA), a novel algorithm for ﬁnding such a matching. The output from FairEA can be used as a benchmark by organizations wishing to evaluate their hiring and assignment practices. On real and synthetic networks, we demonstrate that FairEA does well at ﬁnding high-ﬁtness, high-diversity matchings.

1

Introduction

In order for commercial and non-proﬁt organizations to succeed, it is important for those organizations to recruit a workforce that is not only skilled, but also diverse, as diversity has been positively associated with performance [15]. However, diversity cannot be measured only in terms of numbers: it is known that negative eﬀects may happen when a network is structured in a way that resources are not accessible through the social capital accessible to members of a minority group [9]. Social capital consists of bridging resources from outside of an individual’s group (inter-group connections) and bonding resources from internal group connections (intra-group connections) [13]. The literature contains a number of metrics for measuring network diversity/segregation, the most prominent being assortativity [17]. However, when dealing with dynamic networks where new nodes are being added, it is useful to know not only what the diversity of a speciﬁc network snapshot is after those nodes are added, but how good it could have been. In other words, if new nodes join a network, what is the best assortativity that one could possibly achieve, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 A. S. Teixeira et al. (Eds.): Complex Networks XII, SPCOM, pp. 59–72, 2021. https://doi.org/10.1007/978-3-030-81854-8_6

60

Z. S. Jalali et al.

given pre-existing structure of the network and restrictions on where the new nodes can join? Our work is motivated by the example of an organization that is evaluating their hiring and employee assignment practices with respect to the diversity (gender, race, etc.) of the organizational network. When positions are open, some set of candidates apply for those positions. Each candidate has some amount (possibly zero) of suitability for each of the open positions. If one’s goal is to minimize segregation in the network while ensuring that each position is ﬁlled by a candidate who is suitable for that position, which candidate should one hire for each position? If there are signiﬁcant gender disparities in applications across job categories (e.g., if software engineer candidates are disproportionately male), then it may not be possible to achieve perfect diversity in hiring and assignment; but nonetheless, it is useful to know how well one can do. One can imagine similar examples for, say, new graduate students joining an existing scientiﬁc collaboration network. There has been a great deal of recent interest in fairness of hiring/assignment procedures (e.g., the Rooney Rule used by the American National Football League [5]). This is because one cannot simply eliminate an existing professional network and replace it with a diverse network; and moreover, at the hiring stage, the candidate pool may itself be non-diverse or exhibit correlations between protected attributes and skillsets. In this work, we present FairEmployeeAssignment (FairEA), a novel algorithm for determining how to assign a set of attributed node candidates to open positions in a network such that both the ﬁtness of the match between nodes and open positions and the diversity of the resulting network are maximized. We experimentally demonstrate that FairEA outperforms baseline strategies at achieving these goals. We note that in the United States and other countries, it is illegal to make employment decisions based on protected attributes like race or religion.1 For this reason, although the output of FairEA is a matching of candidates to positions, it is not intended to be used directly to make assignments. Rather, the matching should be used as a baseline, and compared with the organization’s actual practices in employee hiring and assignment, to assess the quality of an organization’s assignment practices with respect to the diversity of the organizational network.

2

Related Work

The recent scientiﬁc literature contains many studies on modeling bias in human recruiting systems. For example, [20] examines strategies for hiring diverse faculty in universities, [2] shows the diﬀerent likelihoods of hiring and promotion for candidates from diﬀerent groups with equal skills, and [16] addresses the tradeoﬀ between performance goals and company diversity. [15] shows a positive 1

https://www.eeoc.gov/laws/practices/.

On Measuring the Diversity of Organizational Networks

61

Fig. 1. Overview of the problem.

relationship between board member racial and gender diversity to performance of nonproﬁts. Unfortunately, automating recruiting systems will not necessarily solve problems of discrimination in hiring [8]. One solution is to make sure that protected attributes do not inﬂuence algorithmic decisions, but [7] shows that gender bias exists even after scrubbing gender indicators from a classiﬁer. While traditional approaches measure diversity of organizations in terms of numbers [18], organizations are social networks, and network factors inﬂuence entrepreneurial success, mobility through occupational ladders, and access to employments [19].

3

Problem Formulation

We formulate this problem as a multi-objective problem in which the goal is to assign a set of newly-hired employees/employment candidates (without loss of generality, the ‘candidates’) to open positions so as to maximize (1) the ﬁtness of employees to positions and (2) the diversity of the organizational network, under the constraint that all open positions must be ﬁlled. We compute diversity as assortativity, which measures the extent to which ‘like connect to like’. Figure 1 shows an overview of the problem. The input for this problem consists of the following: (1) An undirected network G = (P, E), representing the professional network of an organization. Nodes represent positions (s ﬁlled, m open). Each edge (pi , pj ) represents either a real or expected professional interaction between the employees who currently ﬁll or will ﬁll positions pi and pj (i.e., those employees do interact, or are expected to interact once the positions are ﬁlled).

62

Z. S. Jalali et al. Table 1. Notation

Symbol

Deﬁnition

G(P, E)

Unweighted, undirected attributed graph

O = {o1 , .., om } F = {f1 , .., fs } P = {p1 , .., pn }

Set of open/unﬁlled positions Set of ﬁlled Positions Set of positions (nodes of network G) ,(F ∪ O = P )

Q = {q1 , .., qs }

Set of current employees(q1 , .., qs ﬁll f1 , .., fs )

C = {c1 , .., ct }

Set of candidates to ﬁll open positions

Wm×t

Fitness matrix, wij :ﬁtness of candidate cj for position oi

An×n

Adjacency matrix of network G

A1s×s , A2s×m , A3m×m Sub-matrices of A, (F to F , F to O, O to O edges) Yn×k YFs×k ,YOm×k YCt×k Xm×t Uk×k

Membership matrix of current/future employees to k classes Sub-matrices of Y, Membership of current and future employees Membership matrix of candidates to k classes binary output matrix, xij = 1 if cj is assigned to oi uij : fraction of classi to classj edges

(2) A set of t candidates (t ≥ m). If t = m, this problem represents the case where new employees have already been hired and need to be assigned – e.g., newly hired software engineers are being assigned to teams. If t > m, this problem can be viewed as a combination of the hiring and assignment problems. (3) The ﬁtness of each candidate cj for each position oi (how well-qualiﬁed cj is for oi ). We assume that it is possible to match candidates to open positions such that each open position is ﬁlled subject to having at least one candidate with greater than zero ﬁtness for each open position. (4) An attribute of interest, such as gender, that divides employees/candidates into k classes of attributes: classs , .., classk . We assume that this attribute is categorical and each node can be member of just one class (e.g., minority and majority). The output is a matching of candidates to open positions. We refer to the input and output in the rest of the paper as described in Table 1. 3.1

Objectives

FairEA attempts to solve a multi-objective optimization problem with ﬁtness and diversity-related objectives, described below. Maximizing Fitness: The ﬁrst goal is to maximize the ﬁtness of the assignment, subject to the constraint that all positions are ﬁlled, and each candidate ﬁlls at most one position. This objective corresponds to the organization’s primary goal of recruiting employees with the required skill sets [6]. We compute the overall ﬁtness of a matching by summing the ﬁtness of each pair of matched open positions and candidates. Higher values indicate a matching of better qualiﬁed

On Measuring the Diversity of Organizational Networks

63

candidates for open positions. This goal can be formulated as an optimization wij · xij , such that for each 1 ≤ i ≤ m, 1≤j≤t xij = 1 problem: max f1 = ij and for each 1 ≤ j ≤ t, 1≤i≤m xij ≤ 1. Maximizing Diversity: We compute diversity as assortativity, which measures the extent to which a minority group is integrated into the larger network based on the number of inter- vs intra-group connections. If we consider matrix U with uij to be the fraction of edges in the network that connect a node from 2 || where class i to class j, the assortativity coeﬃcient is equal to T race(U)−||U 1−||U2 || 2 2 ||U || means sum over all elements in U [17]. Positive values show more intragroup connections and negative values show more inter-group connections in network. Our goal is to minimize the absolute value of assortativity, so that groups are neither segregated nor isolated. Let A1 , A2 , and A3 be sub-matrices of adjacency matrix A, where A1(s×s) shows the existence of the edges between nodes in F , A2(s×m) shows the existence of the edges from nodes in F to nodes in O, and A3(m×m) shows the existence of the edges between nodes in O. Let YOm×k = X · YC be the binary matrix of the membership of candidates that will ﬁll the open position to each attribute class. Then we have: U = YT · A · Y = YFT · (A1 · YF + A2 · YO) + YOT · (AT2 · YF + A3 · YO) and this goal can 2 || |. be formulated as an optimization problem as: min f2 = | T race(U)−||U 1−||U2 || Other Constraints: In certain cases, there may be other constraints that one wishes to consider. For example, if a minority group is very small, a lowhomophily network would indicate that members of that group are isolated. Organizations may want to avoid such an outcome, because such isolated individuals may be unable to ﬁnd other members of a group and form a community of support, where peers share their experiences and discuss their problems and provide social support for each other [21]. In such cases, appropriate constraints (e.g., the minimum number of members from each protected group per team). We give an example of this in the next section. 3.2

Challenges

First, note that the problem considered here is NP-hard via a reduction from Unweighted Max Cut [11]. Second, the problem of minimizing diversity is not convex, and is neither sub-modular nor super-modular. Third, although the problem can be formulated as an integer program, this process is computationally slow. These challenges suggest that even a fast approximation algorithm may not exist. As such, we present a heuristic and demonstrate its strong performance experimentally.

4

Method

We propose FairEmployeeAssignment (FairEA), a method for assigning candidates to positions with the goals of maximizing ﬁtness and diversity. Assume

64

Z. S. Jalali et al.

that we are given input and desire output as described in Sect. 3. In our initial discussions, we assume that candidates are divided into k = 2 classes, and in Sect. 4.2, we explain how FairEA can be generalized for k > 2. FairEA consists of a sequence of iterations, where each iteration i consists of the following three steps: 1. Select two subsets Oi from O and Ci from C using the FairEA selection process described later in this section. 2. Assign Ci to Oi using the FairEA Matching Process, as described later in this section. 3. If |Oi | < |O| then increment i and return to (1). Otherwise, terminate. FairEA Selection Process: For each pair of open positions oa ∈ O and candidates cb ∈ C, where wab > 0, consider two scores: the ﬁtness score, given by wab , and the diversity score. The diversity score is deﬁned as 1 if cb ∈ classj (j ∈ {1, 2}), and the number of positions adjacent to oa ﬁlled by an employee from classj is less than the number of positions adjacent to oa ﬁlled by an employee from the other class. Using these two scores, we use the Pareto Optimality technique described in [3] to select subsets from O and C. At each iteration i that the selection process is called, the output contains all the open positions and candidates that are present in the pairs appearing in the top i Pareto front sets. A Pareto front set consists of all points that are not dominated by any other point (a point (x1 , y1 ) is dominated by (x2 , y2 ) if x2 > x1 , y2 > y1 , or x2 ≥ x1 , y2 > y1 , or x2 > x1 , y2 ≥ y1 ). To ﬁnd the top i Pareto front sets, one ﬁnds the ﬁrst Pareto front set as just described, removes all selected points, and repeats for i iterations. FairEA Matching Process: The FairEA matching process is based on the augmenting path approach from the Hungarian algorithm for weighted bipartite matching [10]: 1. Generate a bipartite graph B from Ci to Oi , where edges represent each pair of qualiﬁed candidates oa ∈ O and open positions cb ∈ C & wab > 0. To set weights: – Sort the ﬁtness score and diversity score computed, as described earlier. – Set edge weights based on their position in the Pareto front sets levels. If (oa , cb ) is present in level l, its score is equal to 1l . 2. Create a labeling l (l[j] = 0 for each node vj in B), an empty matching M , and an empty bipartite graph Bl . 3. If all elements in Oi are matched and present in M , update matrix X based on matching M and stop. Otherwise, update labeling l and bipartite graph Bl . – For each unmatched open position o in graph B, set l(o) to the maximum weight of edges connected to node o in B. – For each matched open positions o in M (if any), if o is matched to c, l(o) = weight(c, o)).

On Measuring the Diversity of Organizational Networks

4. 5. 6.

7.

65

– For each candidate c in graph B, set l(c) = 0. – Graph Bl contains edges (o, c) where oa ∈ Oi , c ∈ Ci , l[o] + l[c] ≤ weight(o, c). Pick an unmatched open position oa ∈ Oi . Let S = {oa } and T = {} and Let N (S) be the set of neighbors of nodes from S in Bl . If N (S) = T , update labels: – α = mino∈S,c∈T / l(o) + l(c) − w(o, c), l(o) = l(o) − α, if o ∈ S, l(c) = l(c) + α, if c ∈ T . If N (S) = T , pick c ∈ N (S) − T . – If c is not matched ﬁnd an augmenting path from oa to c. Augment M and update G based on the new assignments. Update weights of edges in B using the same approach described in (1) and return to (2). – If c is matched to ob , extend the alternating tree. Add ob to S and add c to T and return to (4).

In each iteration, the matching is improved and G is updated. For newly assigned positions, the attribute will be the attribute of the matched candidate. 4.1

Handling Constraints

FairEA can handle other diversity-related goals in the form of constraints. As an example, we consider the constraint that no minority individual should be isolated from other minority individuals. To address this (or any) constraint, add a step to FairEA. Suppose a company has k disjoint teams and wants to ensure that minorities from each team i are grouped with at least ti other members of that minority group. In this step, FairEA assigns a set of best qualiﬁed candidates from each class j to open positions in the team i that has fewer than the threshold ti employees from class i. (In practice, this can be accomplished via cluster hiring). A threshold of 0 indicates that avoiding isolation is not necessary. In this step, after each matching, FairEA ensures that all the remaining open positions can be ﬁlled (i.e., there is at least one distinct candidate with ﬁtness function greater than zero for each remaining open position). It may sometimes not be possible to reach the threshold for a speciﬁc team (e.g., there are not enough open positions in the team). In such cases, so as to not prevent assignment of a qualiﬁed minority to such a team, the algorithm can perform the assignment, but notify the organization. This information can then be used by the organization: e.g., individuals who lack access to other minorities can be enrolled in a mentor/mentee program. Speciﬁcally, FairEA performs the following: For each team i, denote the number of individuals from classj , j ∈ {1, 2} as |cji |, where |cji | < ti . Sort all pairs of {(oa , cb ), oa ∈ O and oa ∈ classj and cb ∈ C} based on wab (ﬁtness of c for o) in descending order. Next, iterate over all the elements (oa , cb ) in the sorted set. If both oa and cb are not already matched (i.e., all elements of row a and column b in matrix X are zero) and there is at least one possible complete matching from remaining candidates to remaining open positions, set xab = 1 and remove

66

Z. S. Jalali et al.

matched elements from O and C. Continue the iteration until ti − |cji | = 0 (sufﬁcient new matchings are established) or there are no elements in the set. In the end, if |cji | is still less than ti , return the team ti for notifying the organization. 4.2

Variations on FairEA

It is easy to modify FairEA for other settings: Non-binary Attributes: If the protected attribute has more than two classes, the diversity score calculation has to be changed. For each pair of qualiﬁed candidates oa ∈ O (ob ∈ classj ) and open positions cb ∈ C, where wab > 0, the diversity score is deﬁned as ct−ca ct , where ct is the total number of ﬁlled positions adjacent to cb and ca is the number of ﬁlled positions adjacent to cb that ﬁlled by an employee from classj . Multiple Attributes of Interest: If there are multiple attributes of interest (e.g., race and gender), by combining them into one new attribute we can address the problem using FairEA for k classes. For instance, suppose we have k1 classes for gender and k2 classes for race, we can generate a new attribute called identity with at most k1 ∗k2 classes. Because this may be a large number of combinations, these intersectional classes may be merged as appropriate.

5

Experimental Setup

In our experiments, we ﬁrst demonstrate that FairEA does well at matching employees to positions (with respect to ﬁtness and diversity). Second, we provide an example of FairEA on a real-world organizational network.2 5.1

Datasets

Table 2. Dataset statistics. Name

#nodes #edges Assort. Attributes coeﬀ. (maj, min)

CC(M)

46

552

−0.02

(77%, 23%)

CC(H)

46

552

0.37

(57%, 43%)

RT(M)

77

1341

0.02

(88%, 12%)

RT(H)

77

1341

0.43

(65%, 35%)

Nor(L) 1522

4143

−0.19

(61%, 39%)

Nor(M) 1091

3418

.08

(90%, 10%)

Nor(H) 1421

3855

.29

(64%, 36%)

It is diﬃcult to get real data for this FO(H) 288 2602 .86 (70%, 30%) problem- in particular, on candidate DO(H) 265 921 0.92 (70%, 30%) pools and demographics. Thus, to test SF(L) 1000 4000 −0.30 (69%, 31%) FairEA under a wide array of condi- SF(M) 1000 4000 0.07 (69%, 31%) tions, we use both real and synthetic SF(H) 1000 4000 0.39 (69%, 31%) network topologies and attributes and simulate candidate pools. Statistics of datasets are shown in Table 2. (L, M, H) notations indicate low, medium, and high values of assortativity (segregation) in the network. 2

For replication, we have posted our code and data at https://github.com/SaraJalali/ FairEmployeeAssignment.

On Measuring the Diversity of Organizational Networks

67

Real Network Topologies. First, we use the Norwegian Interlocking Directorate network (Nor), which describes connections among directors of public companies in Norway3 . This dataset includes the ‘gender’ attribute. We selected two snapshots of this network, the ﬁrst from February 2003 (Nor(M)) and the second from October 2009 (Nor(L)). These networks have, respectively, the highest and lowest levels of gender assortativity among all snapshots. We also add synthetic attributes to the August 2011 to obtain a with high assortativity, denoted as Nor(H). Next, we consider a set of intra-organizational networks showing interactions in a Consulting Company (CC) and a Research Team (RT),4 and consider different attributes to generate high and medium levels of assortativity. CC(M) is obtained from the ‘gender’ attribute, and CC(H) from the ‘location’ attribute (‘Europe’ vs. ‘USA’). RT(L) uses the ‘tenure’ attribute (less than one year vs. more than one year), and RT(H) uses the ‘location’ attribute (‘London’ vs. the rest of Europe). (In practice, not all of these attributes are things we care about in the context of network diversity; they were chosen for purposes of demonstrating the algorithm.) Additionally, these two sets of networks contain information about the the organisational level of employees, which we use in our second experiment. Synthetic Networks. To evaluate FairEA on additional network topologies, we construct synthetic datasets. First, FO (functional organization) has 6 teams and 12 sub-teams with equal number of nodes in each team, and follows the FedEx organizational chart pattern [14]. DO (divisional organization) has 3 divisions and 40 teams with an equal number of nodes in each team, and follows the Department of Energy organizational chart pattern [14]. We add a synthetic binary attribute so that members of each team are from one class. We next generated a set of power-law (SF) networks [12]. We generate three networks with the same topology, and assign attributes to have low SF(L), medium SF(M), and high SF(H) assortativity levels. 5.2

Open Positions, Teams, and Candidate Pool

For each network, we run 100 trials. In each trial, we sample 10%, 20%, and 30% of nodes randomly as open positions. To simulate the pool of candidates we consider two cases: (1) the candidate pool consists of the nodes set to open (with the same attributes), and (2) the candidate pool consists of two copies of each node set to open. The ﬁrst setting corresponds to the case where a ‘batch’ of new employees has been hired, and now the employees need to be assigned to teams without considering the hiring process. The second setting corresponds to the case where we consider both hiring and assignment procedures. Moreover, the way we assign attributes to nodes ensures that changes in homophily are actually due to employee assignment, rather than changes in attributes. 3 4

http://www.boardsandgender.com/data.php. https://toreopsahl.com/datasets/.

68

5.3

Z. S. Jalali et al.

Fitness Functions

The ﬁtness function governs which candidates are suitable for which positions. For the ﬁrst sets of experiments- the evaluation of FairEA- we consider two ﬁtness functions. In F1 , candidates are qualiﬁed for four randomly selected positions with ﬁtness equal to a random number in (0, 1). In F2 , candidates are ﬁt for the four open positions closest to the position that the candidate had previously ﬁlled with ﬁtness equal to a random number in (0, 1). 5.4

Baseline Methods

We use three baseline methods: (1) Random, which randomly assigns qualiﬁed candidates to each open position; (2) The weighted Hungarian algorithm, where the input is a bipartite graph whose two sides correspond to open positions and candidates. An edge (oa , cb ) exists if wab > 0, and the weight of this edge is the sum of wab and the diversity score as described in Sect. 4; and (3) Optimization, which uses the IPOPT solver in the GEKKO optimization suite [1] for solving the optimization problem with the two goals of maximizing ﬁtness and diversity. This is the simpliﬁed version of the problem where ﬁtness is maximized, as described in Sect. 3.1 and diversity is optimized by decreasing the gap between number of neighbors from classi to number of neighbors from classj for each newly assigned position. 5.5

Metrics

We report results using the following metrics: – The overall ﬁt score is the sum of the ﬁtness scores for each matching. Let F Sh and F Sl be overall ﬁt score of the best and worst possible matching in terms of ﬁtness of employees for the open positions respectively and F Sa be the overall ﬁt score of the network G after assignment using desired method. Sl · 100. Then we deﬁne Percentage Improvement in Fitness= FFSS−F h −F Sl – The diversity of the network is measured by the assortativity coeﬃcient [17]. Let ACb be the assortativity coeﬃcient of G , the subgraph of initial network G consisting only of ﬁlled positions, and ACb be the (assortativity coeﬃcient of the network G after assignments are made. Then the Percentage Improveb |−|ACa | · 100. ment in Assortativity = |AC|AC b|

i |,..,|cki |) where |cji | is – The fraction of minorities in team i is F Mi = min(|c1 |c1i |+...|cki | the number of individuals from classj in team i. Isolation Score is the average fraction of minorities. Isolation Score= k1 · 1≤i≤k F Mi .

6

Results and Analysis

We ﬁrst compare FairEA to the baseline algorithms in order to evaluate its performance algorithmically. Next, we use real intra-organizational networks to demonstrate how FairEA might be used in practice.

On Measuring the Diversity of Organizational Networks

69

Fig. 2. Comparison results of percentage improvement in ﬁtness and assortativity for FairEA and baseline methods. The ideal solution lies in the top right corner. Hungarian is good at increasing diversity and IPOPT is good at increasing fitness, but FairEA is good at increasing both. Hungarian performs well when assortativity is low (network is diverse).

6.1

FairEA Evaluation

Here, we compare FairEA and baseline algorithms with respect to diversity and ﬁtness. Figure 2 shows results for percentage improvement in ﬁtness and percentage improvement in assortativity, where the number of candidates is equal to the number of open positions, with ﬁtness function F1 (candidates are qualiﬁed for positions across the network) and 10% open positions. The ideal solution (high ﬁtness and diversity) is in the top right depicted as a star. Results for ﬁtness functions F2 (candidates are qualiﬁed for positions in a speciﬁc area) are similar. In most cases, results for FairEA, IPOPT and Hungarian are in a nondominated set, with results of FairEA having the lowest crowding distance. More simply, we see that Hungarian does very well with respect to diversity (especially when the network was already diverse), IPOPT does very well with respect to increasing ﬁtness, and FairEA does well at increasing both. To summarize results, we compute the average percentage improvement in ﬁtness and average percentage improvement in assortativity over all datasets with high, medium and low levels of assortativity for each method. When the size of the candidate pool is equal to the number of open positions, FairEA achieves at least 97% of the maximum ﬁtness score while improving the assortativity coeﬃcient value by 39%, 56% and 67% for 10%, 20% and 30% of open positions respectively. (Results were similar for other experimental settings.) Overall, while IPOPT increases ﬁtness, it performs poorly on diversity. This demonstrates that simply considering the number of neighbors of a node from each class for newly

70

Z. S. Jalali et al.

Fig. 3. Results of assortativity coeﬃcient and isolation score over original network (Org) and networks after assignment with diﬀerent isolation thresholds ti ∈ {0, 2, 0.05 · |ki |, 0.1·|ki |, 0.2·|ki |} where |ki | is size of team teami . Both networks have the potential to become fair by at least 50%.

assigned candidates is not suﬃcient. Hungarian performs well when the number of open positions is small, but performance decreases as the number of open positions increases. In contrast, FairEA consistently does well. 6.2

Example Usage of FairEA

We next illustrate how FairEA can be used in practice- i.e., to evaluate an organization’s hiring/assignment practices- on the intra-organizational networks CC and RT, which contain position level-related annotations. In such a setting, the organization would identify the set of all positions that have been open in the recent past (whatever timespan is desired), and would use the actual applicants to those positions to form the candidate pool. Because we do not have access to this data, we mark a random p% of the positions as open and consider the employees that ﬁll those position as candidates. We say that individuals are ﬁt for positions at their level. Recall that in addition to optimizing for diversity and ﬁtness, FairEA can accommodate constraints related to isolation (ensuring that minority individuals are not too far away from other minority individuals, which can have a negative eﬀect on eﬀectiveness [4]), and such constraints may aﬀect performance with respect to ﬁtness and diversity. Here, in addition to evaluating ﬁtness and diversity, we also evaluate the eﬀect of such a constraint. We compute the Percentage Improvement in Fitness, Percentage Improvement in Assortativity and Percentage Isolation Score of FairEA’s results when requiring that the number of minority group individuals in each group is at least {0, 2, 0.05 · |ki |, 0.1 · |ki |, 0.2 · |ki |} where |ki | is size of team teami . We consider networks CC(H) and RT(H), both of which are extremely segregated: the original networks have (Assortativity Coeﬃcient & Isolation Score) (0.38&0.07) and (0.43&0.02) respectively. Both of these networks are extremely segregated and have fewer than 10% minorities in each team. When applying FairEA, 20% and 30% of the positions are open, with diﬀerent threshold levels for isolation, we see huge improvements in segregation. Figure 3 shows the results of Assortativity Coeﬃcient and Isolation Score: these large

On Measuring the Diversity of Organizational Networks

71

improvements in both assortativity and isolation indicate that both networks have great potential to become more fair.

7

Discussion, Limitations, and Conclusion

In this work, we proposed FairEA, a novel algorithm that can be used to gauge discrimination with respect to a protected attribute. Compared to baselines, FairEA does well at ﬁnding high-diversity, high-ﬁtness matchings. While FairEA addresses an abstracted problem, it is a step towards a computational approach to create a diverse workplace in terms of social connections. This work is intended as a step towards remedying segregation and isolation in organizational networks by providing a simple approach to assessing the quality of hiring/assignment practices. We acknowledge that there are substantially more considerations - such as the process used to generate the candidate poolthat go into evaluating hiring/assignment procedures than those described here, and hope that future work will build on what we have presented.

References 1. Beal, L.D.R., Hill, D.C., Martin, R.A., Hedengren, J.D.: Gekko optimization suite. Processes 6(8), 106 (2018) 2. Bjerk, D.: Glass ceilings or sticky ﬂoors? Statistical discrimination in a dynamic model of hiring and promotion. Econ. J. 118(530), 961–982 (2008) 3. Censor, Y.: Pareto optimality in multiobjective problems. Appl. Math. Optim. 4(1), 41–59 (1977) 4. Cohen, S.G., Bailey, D.E.: What makes teams work: group eﬀectiveness research from the shop ﬂoor to the executive suite. J. Manag. 23(3), 239–290 (1997) 5. Collins, B.W.: Tackling unconscious bias in hiring practices: the plight of the Rooney rule. NYUL Rev. 82, 870 (2007) 6. Craig, M.: Cost eﬀectiveness of retaining top internal talent in contrast to recruiting top talent. In: Competition Forum, vol. 13. American Society for Competitiveness (2015) 7. De-Arteaga, M., et al.: Bias in bios: a case study of semantic representation bias in a high-stakes setting. In: Conference on Fairness, Accountability, and Transparency (2019) 8. Dobbe, R., Dean, S., Gilbert, T., Kohli, N.: A broader view on bias in automated decision-making: reﬂecting on epistemology and dynamics. arXiv preprint arXiv:1807.00553 (2018) 9. Ka Yi Fung: Network diversity and educational attainment: a case study in China. J. Chin. Sociol. 2(1), 12 (2015) 10. Grinman, A.: The Hungarian algorithm for weighted bipartite graphs. Massachusetts Institute of Technology (2015) 11. Gr¨ otschel, M., Pulleyblank, W.R.: Weakly bipartite graphs and the max-cut problem. Oper. Res. Lett. 1(1), 23–27 (1981) 12. Holme, P., Kim, B.J.: Growing scale-free networks with tunable clustering. Phys. Rev. E 65(2) (2002) 13. Lin, N.: A network theory of social capital. Handb. Soc. Capital 50(1), 69 (2008)

72

Z. S. Jalali et al.

14. Lumenlearning: Common organizational structures. Accessed Oct 2020 15. Mazzola, M.E.C., Pontacolon, J.L., Claudio, A., Salguero, J.A., James, M., Yawson, R.: Recruiting for success. Does board diversity matter? (2020) 16. Newman, D.A., Lyon, J.S.: Recruitment eﬀorts to reduce adverse impact: targeted recruiting for personality, cognitive ability, and diversity. J. Appl. Psychol. 94(2), 298 (2009) 17. Newman, M.E.J.: Mixing patterns in networks. Phys. Rev. E 67(2) (2003) 18. O’Leary, B.J., Weathington, B.L.: Beyond the business case for diversity in organizations. Empl. Responsib. Rights J. 18(4), 283–292 (2006) 19. Portes, A.: Social capital: its origins and applications in modern sociology. Ann. Rev. Sociol. 24(1), 1–24 (1998) 20. Sgoutas-Emch, S., Baird, L., Myers, P., Camacho, M., Lord, S.: We’re not all white men: using a cohort/cluster approach to diversify stem faculty hiring. Thought Action 32(1), 91–107 (2016) 21. Williams, S.N., Thakore, B.K., McGee, R.: Providing social support for underrepresented racial and ethnic minority PhD students in the biomedical sciences: a career coaching model. CBE–Life Sci. Educ. 16(4), ar64 (2017)

An Interpretable Graph-Based Mapping of Trustworthy Machine Learning Research Noemi Derzsy1 , Subhabrata Majumdar1(B) , and Rajat Malik2 1

2

Data Science and AI Research, AT&T Chief Data Oﬃce, New York, NY, USA {nderzsy,subho}@att.com Data Science and AI Research, AT&T Chief Data Oﬃce, Bedminster, NJ, USA [email protected] Abstract. There is an increasing interest in ensuring machine learning (ML) frameworks behave in a socially responsible manner and are deemed trustworthy. Although considerable progress has been made in the ﬁeld of Trustworthy ML (TwML) in the recent past, much of the current characterization of this progress is qualitative. Consequently, decisions about how to address issues of trustworthiness and future research goals are often left to the interested researcher. In this paper, we present the ﬁrst quantitative approach to characterize the comprehension of TwML research. We build a co-occurrence network of words using a web-scraped corpus of more than 7,000 peer-reviewed recent ML papers—consisting of papers both related and unrelated to TwML. We use community detection to obtain semantic clusters of words in this network that can infer relative positions of TwML topics. We propose an innovative ﬁngerprinting algorithm to obtain probabilistic similarity scores for individual words, then combine them to give a paper-level relevance score. The outcomes of our analysis inform a number of interesting insights on advancing the ﬁeld of TwML research. Keywords: Trustworthy machine learning · Natural language processing · Research space · Co-occurrence network · Community detection

1

Introduction

With the unprecedented increase in the deployment of machine learning (ML) systems in the real world, there is an increasing need to ensure that such systems behave in a socially responsible manner. Responding to this challenge, in the recent past there has been a plethora of interest from ML researchers and practitioners to develop algorithms and models that embody qualities such as fairness, explainability, privacy, and robustness. This sub-ﬁeld of ML is often referred by umbrella terms such as Responsible ML or Trustworthy ML [3,30,32]. Scientiﬁc literature on trustworthy ML (TwML) has grown rapidly in the past few years. While this presents tremendous opportunities for future technical N. Derzsy, S. Majumdar and R. Malik—Alphabetical authors. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 A. S. Teixeira et al. (Eds.): Complex Networks XII, SPCOM, pp. 73–85, 2021. https://doi.org/10.1007/978-3-030-81854-8_7

74

N. Derzsy et al.

work, there is a lack of codiﬁcation and characterization of this knowledge base. Considering the interdisciplinary nature of the area and its major venues of publication (e.g. FAccT1 and AIES2 ), with participation from ﬁelds like social sciences and public policy, such mapping is important to not only summarize existing work but also to inform new application areas on their relevance to TwML. To this end, there are a number of quality review and summary articles [3,5,16], as well as books [11]. However, by nature these are qualitative and static, and leave the judgement of a topic of interest being relevant to TwML to the reader. In this paper, we take a quantitative approach to characterize the comprehension of trustworthy ML research and its position with respect to contemporary scholarly work in the broader ﬁeld of ML. Our text analytics approach is based on a weighted co-occurrence network of words which occur in the text of more than 7,000 peer-reviewed ML papers published in the last 5 years—both related and unrelated to TwML. We use this network for two purposes. First, we use community detection to obtain semantic clusters of words and infer the relative position of TwML topics. Second, we propose a novel relevance score to quantify the ‘closeness’ of individual words to TwML concepts, then combine these wordlevel scores in an interpretable manner to obtain paper-level relevance scores. Related Work. Easy availability of bibliographic data has enabled a number of recent studies that aim to map scientiﬁc research spaces based on past scholarly work. Such studies span both general science [7,34], and speciﬁc domains like Physics [4,19], Bioinformatics [13], as well as rapidly emerging interest areas like COVID-19 [33]. The use of complex network models is extremely popular in these studies. In previous work, network models have been built on data from citations [21], author information and collaborations [4,6], pre-categorized research topics, such as Physics and Astronomy Classiﬁcation Scheme (PACS) codes [4,19], or words in the text of papers [13]. Popular methods for constructing scientiﬁc knowledge fall in two broad categories: embedding-based methods, and co-occurrence networks/knowledge graphs. Embedding-based methods, such as [4], typically map entities like words, sentences and documents to a high-dimensional numerical space (using techniques such as Word2vec), then form edges between two entities (such as words, authors, citations) based on their similarity as measured by some similarity metric. On the other hand, the second category of methods build a network directly based on the connectivity patterns of entities. Edges or edge weights in the graph may correspond to speciﬁc relationships between entities for knowledge graphs [2], or measures such as co-occurrence indicators/weights [13,22]. To the best of our knowledge, the literature lacks a study which maps the research landscape and characterizes existing knowledge of TwML. In this paper we aim to ﬁll this gap.

1 2

https://facctconference.org. https://www.aies-conference.com.

Graph-Based Mapping of Trustworthy ML Research

75

Goals and contributions. Contrasting with the mostly exploratory and inferential nature of studies on other ﬁelds of research, we aim for both inference and prediction. Speciﬁcally, our goal is to answer the following research questions in a data-driven manner: Q1: Within the co-occurrence network of words in recent ML literature, can we characterize the relative position of words and concepts as they relate to TwML? Q2: Can we predict which scientiﬁc papers that are not on TwML topics may be relevant to this sub-ﬁeld? Q3: For words or terms not directly pertaining to TwML, can we infer their context similarity with words/terms that do? To address these questions, our contributions are: 1. We map the space of recent ML research using a word-level network generated from papers that cover general ML, as well as work speciﬁcally on TwML. 2. We propose a probabilistic ﬁngerprinting method that quantiﬁes the relevance of a word/paper to TwML. Word-level relevance scores are aggregated in an interpretable manner to obtain the paper-level scores. 3. Taking a broader point of view, we explore the network of words to identify words that, even though not overtly related to TwML (i.e. unlike terms such as fairness, transparency), are conceptually related to TwML. As the nascent area of Trustworthy ML matures, through this paper we aim to initiate the quantitative study of this body of literature to channel future work in interesting directions, as well as create connections with new application areas.

2

Materials and Methods

2.1

Data

To begin with, we scraped 7107 papers from the Proceedings of Machine Learning Research3 (PMLR) website, that have been presented at peer-reviewed conferences and workshops, and contain ML research covering a breadth of topics. This corpus as part of our analysis ensures we have a diverse dictionary of words and their co-occurrences. Due to the wide scope of the PMLR corpus, papers that speciﬁcally focus on TwML get clubbed together with ones that do not. To create a distinctive set of papers that focus on TwML, we used a two-pronged approach. First, we obtained 221 papers from the ACM Digital Library4 that were published in past FAccT conferences (2018–2020), and labeled all of them as TwML-focused. Second, we curated a set of 74 words and terms related to TwML, starting from a list of obvious ‘seed’ words (e.g. ‘bias’, ‘fairness’) and manually iterating to include their variants relevant for TwML (e.g. ‘algorithmic bias’ but not ‘biased coin’) present in the FAccT corpus. We then label a paper in the larger PMLR corpus as TwML-focused if it contains at least one occurrence of any of these 3 4

http://proceedings.mlr.press. https://dl.acm.org.

76

N. Derzsy et al.

Training data

Identify TwML-related terms

FAccT corpus

List of related terms

PMLR corpus

Non-TwML papers

Preprocessing

Split data

Build graph

TwML papers Network and communities

TwML words Test data

Scoring

Relevance scores

Fig. 1. Schematic of the methodology. Blue and yellow indicate papers in a corpus labeled as TwML and non-TwML.

words, resulting in 263 more such papers. We label all the other PMLR papers as non-TwML-focused. Pre-processing. In a scientiﬁc paper, raw data, tables, plots, proofs, and references can all contribute to a noisy dictionary. Authors spend considerable time deciding titles and writing abstracts to make them stand out more. Further, abstracts often present a high-level summary of the research problem and methodology [31]. Therefore, for our analysis, we restrict our corpus to only include titles, keywords (when present), and abstracts. We start with standard text pre-processing steps: convert text to lowercase, remove special and numeric characters, tokenize, remove stop words and single character words, and then ﬁnally stem words using the Snowball stemmer5 . After pre-processing, we use simple random sampling to split the corpus, assigning 90% of all papers (i.e., PMLR+FAccT) as training set and the remainder as test set. 2.2

Methods

Our methodological work has 3 components: (1) building a network of words and detecting communities of similar words, (2) ﬁngerprinting papers as relevant to TwML, and (3) discovery of non-TwML words potentially relevant to the area of TwML. Figure 1 illustrates these processes, and we detail them below. Network of Words. Using the pre-processed text from our training corpus, we build a word co-occurrence network by connecting each pair of stemmed words that appear in the same abstract. Connections between words (i.e., nodes) are represented with a weighted edge. The weight reﬂects co-occurrence—the number of times the pair of words appeared together in an abstract. This construction scheme generates an undirected, weighted network of words. Next, we detect communities in this network and identify which communities our predeﬁned list of TwML words occur in. In order to eﬀectively perform community detection 5

https://snowballstem.org.

Graph-Based Mapping of Trustworthy ML Research

77

on the network, we use additional cleaning steps to denoise the graph by removing very high-frequency words. We apply a diﬀerential edge cutoﬀ: we remove the top 10% highest connectivity non-TwML words and the top 25% of highest connectivity TwML words that originate from splitting compound words (e.g., ‘algorithmic bias’ → ‘algorithm’ and ‘bias’). Note that this splitting also converts the 74 TwML-speciﬁc words into 41 individual stemmed words. Finally, we use the Louvain community detection algorithm [1] to identify densely connected communities within the above network. Bi-level fingerprinting. We use a novel ﬁngerprinting algorithm (Algorithm 1) to obtain probabilistic similarity scores for individual words or papers.

Algorithm 1. Algorithm for word-level relevance scoring procedure scoreword(word, TwMLwords, Graph) if word in TwMLwords then return 1; else path = []; if word in Graph then for twml in TwMLwords do sp = weighted shortest path(word, twml); if sp != null then path.append(sp); return path.mean() / path.max()

To begin with, we score each word individually based on its weighted shortest path distance—computed using Dijkstra’s algorithm—from TwML words. We calculate the relevance score for a full paper as the weighted average of the word-level scores of all words in that paper, assigning larger weights to words that belong to the same community as TwML words: N si [w1 × 1i∈TC + w2 × 1i∈NTC ] . spaper = i=1 N i=1 [w1 × 1i∈TC + w2 × 1i∈NTC ]

(1)

Here, si > 0 is the relevance score of the ith word in the paper. Given weights w1 > w2 ≥ 0, the contribution of a word to the paper-level score is w1 si if it belongs to any of the two communities rich in TwML words (Table 1; indicated by TwML community, or TC), and w2 si if it belongs to any other community (indicated by non-TwML community, or NTC). To score a paper, we consider the N words in its abstract that yield non-zero scores through Algorithm 1. Finally, the denominator normalizes a paper-level score by the maximum possible value, and the score is set to 0 if all word-level scores are 0 in a paper. If the relevance score of a paper is ≥ 0.5, then we ﬂag the paper as potentially TwML-related. We use grid search to ﬁnd optimal values of the weights : w1 = 3, w2 = 0.5.

78

N. Derzsy et al.

Table 1. TwML words in communities. The ﬁrst two rows contain the bulk of TwMLrelated words. The second row can be interpreted as a community relating to diﬀerential privacy. Community size

Number of TwML words TwML words

1127

26

Sensit, bias, decis, constraint, impact, group, remov, discrimin, attribut, demograph, fair, gender, implicit, interpret, mitig, pariti, treatment, unfair, criteria, dispar, sex, subgroup, transpar, crimin, racial, justic

405

7

Diﬀerenti, mechan, privaci, privat, concern, individu, preserv

488

2

Metric, deﬁnit

301

2

Account, procedur

1228

1

Discoveri

980

1

Trustworthi

250

1

Hindsight

748

1

Unbias

Due to the nature of how the above paper-level relevance scores (Eq. 1) are calculated, our probabilistic ﬁngerprinting method is inherently interpretable. From analyzing the breakdown of a paper-level score into its constituent wordlevel scores, the user can obtain potential reasonings of why a paper may be (or not) highly relevant to TwML. We discuss this in Sects. 3 and 4. Contextual Similarity of Non-TwML Words. Our last goal is to expand the existing list of TwML words with additional words that are conceptually related to TwML. The reason for doing this is two-fold. Firstly, in the current work we rely solely on the TwML words as an initial seed list of mostly technical words that are used for multiple purposes. However, expanding this existing list with additional contextually similar words would result in a more inclusive set that can improve the ﬁngerprinting process. Secondly, we wish to identify broad areas of interest for future research using these conceptually similar words. To this end, we utilize the connectivity information of non-TwML words with TwML words. We extract all the direct connections of TwML words, along with their corresponding edge weights, which indicate the strength of their connection. In addition, we score each direct neighbor using Algorithm 1, which informs us on the overall connectivity of that word with TwML words as a whole. Finally, we use upper threshold cutoﬀs on edge weights and word relevance scores to identify words above the threshold as potentially of interest.

3

Results

Network of words. Only about 7% (484 out of 7328) of all papers are TwMLrelated. Previous studies have empirically observed that complex methods such

Graph-Based Mapping of Trustworthy ML Research

79

Fig. 2. Network of ML research space constructed from the PMLR+FAccT corpus. Each node represents a term, and each edge represents the number of times a pair of words co-occur in an abstract. We highlight the two communities containing most TwML words—nodes are colored according to community membership. TwML words are highlighted separately per their subject area. Table 2. Performance of paper-level scoring. AUC = Area Under Curve uses paperlevel scores (Eq. 1). For other metrics, we use an upper cutoﬀ of 0.5. Corpus PMLR FAccT Overall

AUC 0.81 – 0.82

Precision 0.42 1 0.47

Recall 0.81 0.88 0.82

F1 score 0.55 0.94 0.6

as knowledge graphs or high-dimensional numeric embeddings are less reliable for characterizing rare concepts or terms [15,29]. Because of this rarity issue of TwML papers, we use a word co-occurrence network in place of more sophisticated methods. The resulting network contains 10,698 nodes and 254,347 edges. The community detection algorithm generated 25 communities, with a modularity score of 0.33. As given in Table 1, TwML-related words are concentrated in two communities. Among them, seven words that are mostly related to Differential Privacy (DP) separate from the rest into one community (second row in Table 1). Another community of 1127 words contains 26 other TwML-speciﬁc words. For convenience we shall refer to these communities as DP and nonDP community, respectively. The remaining 8 TwML words—which are mostly

80

N. Derzsy et al.

ambiguous such as ‘metric’ or ‘procedur’ or general such as ‘trustworthi’—get distributed across 6 communities. Figure 2 visualizes the overall network, focusing on the two TwML-speciﬁc communities. We categorize the TwML words into four subject-based categories: – – – –

Privacy: ‘privaci’, ‘diﬀerenti’, ‘privat’, ‘guarantee’, ‘concern’,‘preserv’, Interpretability: ‘transpar’,‘interpret’,‘account’, General: ‘trustworthi’, ‘mechan’,‘algorithm’,‘data’, Fairness: all others.

From the relative position of words in each category in Fig. 2, it is evident that a number of privacy-speciﬁc and fairness-speciﬁc words cluster together, and these two clusters are well-separated from each other. Fingerprinting of Papers. Because of the probabilistic nature of our ﬁngerprinting process, it can be used to classify whether or not a paper is related to TwML. Table 2 presents the results across diﬀerent metrics and the two corpuses. Our method exhibits good recall values across the two corpuses. The precision in the PMLR corpus—hence the overall precision, as it forms a large proportion of the overall set of papers—is low. This is an indication that there are probably a number of papers that do not contain our pre-speciﬁed TwML words, but may be related to this subject based on their contents. Note that since all papers in the FAccT corpus are labeled as TwML-related, area under curve (AUC) does not exist for this category, and it exhibits a perfect precision. We present non-TwML papers with highest relevance scores in Table 3, and the word-level relevance scores for selected papers in Fig. 3. A number of papers in Table 3 are on topics that have received less attention in TwML literature [9,16], such as reinforcement learning, active learning, bandit algorithms, and outlier detection. The word-level scores (Fig. 3) give interpretability to the paperlevel ﬁngerprinting. As an example, paper 7 in Table 3 [18] gets a high score because of the word ‘movi’ from to the non-DP community, and ‘ﬁsher’ which belongs to neither of the two TwML-word rich communities. Contextualizing these words, a ﬁsher information-based approach similar to [18] may be relevant for obtaining fairly calibrated movie ratings and recommendations [28]. Contextual Similarity. To expand the existing list of TwML words with additional conceptually related words, we use the edge weights and relevance scores of words that are direct neighbors of a TwML word in our co-occurrence network to identify the appropriate threshold cutoﬀ. Filtering for words that share at least one edge of weight ≥ 100 with a TwML word, and have a relevance score ≥ 0.5 resulted in a subset of 290 words. Words in this list can be further assessed for their signiﬁcance. Table 4 highlights 10 such words.

Graph-Based Mapping of Trustworthy ML Research

81

Table 3. Top 25 papers with the highest ﬁngerprinting scores. Index

Paper title

Score

1

Sparse reinforcement learning via convex optimization

0.72

2

Boosting with online binary learners for the multiclass bandit problem

0.71

3

Dirichlet process mixtures of generalized linear models

0.7

4

Optimal δ-correct best-arm selection for heavy-tailed distributions

0.69

5

Lifted weight learning of Markov logic networks revisited

0.7

6

Eﬃcient computation of updated lower expectations for imprecise continuous-time hidden Markov chains

0.64

7

Enhanced statistical rankings via targeted data collection

0.62

8

Multi-observation elicitation

0.6

9

Spotlighting anomalies using frequent patterns

0.6

10

Class proportion estimation with application to multiclass anomaly rejection

0.57

11

Exact subspace segmentation and outlier detection by low-rank representation

0.57

12

Wasserstein propagation for semi-supervised learning

0.56

13

Multitask principal component analysis

0.56

14

Risk-aware active inverse reinforcement learning

0.56

15

Optimal densiﬁcation for fast and accurate minwise hashing

0.55

16

A Bayesian approach for inferring local causal structure in gene regulatory networks

0.54

17

Lifting high-dimensional non-linear models with Gaussian regressors

0.52

18

qualitative multi-armed bandits: a quantile-based approach

0.51

19

Safe policy improvement with baseline bootstrapping

0.51

20

Cooperative online learning: keeping your neighbors updated

0.51

21

Analysis of empirical MAP and empirical partially bayes: can they be alternatives to variational Bayes?

0.5

22

Tree-based inference for Dirichlet process mixtures

0.5

23

Sequence prediction using neural network classiﬁers

0.5

24

Variance reduction for faster non-convex optimization

0.5

25

Stochastic variance reduction for nonconvex optimization

0.5

Paper 7

Paper 10

movi

0.887

fisher

Paper 14

multiclass

0.702

anomali

Paper 19 mdps

0.755

0.7 0.74

safe decoupl

safeti

0.746

safe

0.74

0.56 reject

0.341

primari

0.338

ij

0.325

bootstrap

Community fall

DP

0.565

Others

Others wherein

0.427

head

0.425

0.618

tabl

Community Others

0.559

stop

Word

Non−DP yahoo

0.672

Community

0.43

Word

laplacian

Word

Word

0.825

0.828

irl

0.375

Non−DP

0.25

0.50

0.75

Relevance score

0.257 0.251

gridworld 0.2

0.4

0.6

Relevance score

Others

0.337

0.22

gridworld

0.199 0.0

0.392

spi dqn

doe

0.295 0.00

Community

0.439

pi spibb

none

agre

0.635

leq

0.0

0.2

0.4

0.6

0.8

0.22 0.0

Relevance score

0.2

0.4

0.6

0.8

Relevance score

Fig. 3. Word level relevance scores for selected papers.

4

Discussion

A number of interesting insights come out from the above analysis. Network of Words. The diﬀerential distribution of TwML words within communities, as observed in Table 1, indicates that TwML papers tend to focus more on certain lines of research, methods or applications than others. In the context of ML bias and fairness, this is echoed by the review article of [16]. They observed that addressing group fairness in classiﬁcation problems has received

82

N. Derzsy et al. Table 4. Contextually similar words to TwML. Word

Weight Score Community Word

Weight Score Community

Race

354

0.98

Non-DP

Physiolog 252

0.81

Others

drug

324

0.88

Others

Censor

0.78

Non-DP

180

tamper

324

0.88

Non-DP

Facial

198

0.76

Others

stereotyp

318

0.87

Non-DP

Secur

177

0.70

Non-DP

Membership 222

0.82

Non-DP

Skin

180

0.67

Non-DP

Table 5. Top 10 Words with highest scores in either of the TwML communities. Word

Score Community Word

Score Community

Movi

0.887 Non-DP

0.729 Non-DP

vb

dp

0.825 DP

Triplet

0.646 Non-DP

Mdps

0.825 Non-DP

chi

0.645 DP

Membership 0.815 Non-DP

Ordinary 0.627 DP

multiclass

opt

0.755 DP

0.604 DP

disproportionately high interest compared to other fairness categories (e.g. individual fairness, subgroup fairness) and types of methods (e.g. clustering, graph embedding); see Table 7 therein. Within the TwML words, Diﬀerential Privacy (DP)-speciﬁc words and those related to fairness and transparency group separately into two diﬀerent communities. A potential reason for this may be that DP is a comparatively older research area, and has seen more theoretical developments than relatively new topics like fairness or transparency. Paper-level Fingerprinting. All papers in Table 3 with high relevance scores are on comparatively complex algorithms. A number of these areas have been heavily researched of late, such as reinforcement learning (RL; papers 1,14,19), bandit problems (2,4,18), anomaly detection (2,9,10,11), representation learning (11,13,15), multitask problems (2,8,10,13), dirichlet process (3,22), and nonconvex optimization (24,25). The word-level breakdown of relevance scores (Fig. 3) gives further insights into how the concepts in these papers may be related to TwML. Top scores for papers 7 and 19 come from TwML-words that belong to the non-DP community. Looking into their subject matters, paper 7 [18] studies statistical ranking for dependent network data. Interestingly, a very recent paper that is not in our analyzed corpus studied the problem of applying fairness constraints on node ranks in a graph [12]. Paper 19 is on safe policy improvement in RL [27]. Safe policies in RL refer to policies that maximize expected return in problems where ensuring certain safety constraints is important alongside satisfactory performance [8]. In the context of ML fairness, safe policies can potentially be policies that satisfy equitable performance guarantees for sensitive demographic subgroups.

Graph-Based Mapping of Trustworthy ML Research

83

In Table 5, we summarize the words with the highest scores among words that occur in any of the 25 papers in Table 3, and belong to either the DP or nonDP community. Among words belonging to the DP community, ‘multiclass’ is interesting. After a small number of papers in the early 2010’s [20,24], multiclass problems in DP have started to receive more attention recently [25]. Words in the non-DP cluster, on the other hand, refer to methods or algorithms—‘mdps’ is Markov Decision Processes, ‘vb’ is variational bayes, and ‘triplet’ is triplet loss. Each of these categories are contextual to ML fairness or explainability. For example, [10,14] incorporate causality and fairness notions in ML models using variational inference. Russell and Santos [23] explain reward functions in MDPs by building a classiﬁcation model with rewards as outputs. A recent preprint [26] applies the triplet loss in the context of fairness. Contextual Similarity. A large number of ‘similar’ words that are heavily connected with TwML words neither (a) pertain to algorithms or methods, nor (b) belong to the DP community. Table 4 presents ten such words. In contrast to words highly important to ﬁngerprinting of papers (Table 5), these similar words mostly refer to application aspects of fairness (‘race’, ‘stereotyp’, ‘facial’, ‘skin’), privacy and security (‘tamper’,‘membership’,‘secur’), as well as other practical issues (‘drug’,‘physiolog’,‘censor’). This potentially suggest two things. Firstly, application-oriented keywords are closely associated with TwML terms, and should be used to characterize the research landscape of this interdisciplinary ﬁeld. Secondly, such application areas may foster new connections with TwML topics, especially the ones each such word relates to.

5

Conclusion

In this paper, we present the ﬁrst quantitative study of the trustworthy ML research space. Using network analysis methods we identify the similarity and clustering patterns of TwML vs. non-TwML words, propose a novel ﬁngerprinting method to predict which papers may be related to TwML, and provide word-level contextual similarity insights. As indicated by Table 3, Fig. 3, and Table 5, potential areas of future exploration include multiclass problems in differential privacy, and work that focus on fairness and transparency aspects of newer research areas in broader ML. Contextually similar non-TwML words in Table 4 suggest the need for more practice-oriented work in this ﬁeld, which recent studies have acknowledged [3,17]. Through this work, we hope to motivate further quantitative characterization of TwML literature. As examples, a higher proportion of content instead of only title, abstract, and keywords may be used. The document corpus being analyzed can be speciﬁcally tailored to the end goals of the analysis (e.g. inference vs. prediction, explore new connections between theoretical vs. applied topics). Such explorations will facilitate and guide future ML research by identifying methodological gaps, as well as create novel opportunities for applying existing analytical techniques in new practical problems.

84

N. Derzsy et al.

References 1. Blondel, V.D., et al.: Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008(10), P10008 (2008) 2. Buscaldi, D., et al.: Mining scholarly data for ﬁne-grained knowledge graph construction. CEUR Workshop Proc. 2377, 21–30 (2019) 3. Cheng, L., et al.: Socially responsible AI algorithms: Issues, purposes, and challenges (2021). arXiv:2101.02032 4. Chinazzi, M., Gon¸calves, B., Zhang, Q., Vespignani, A.: Mapping the physics research space: a machine learning approach. EPJ Data Sci. 8(1), 1–18 (2019). https://doi.org/10.1140/epjds/s13688-019-0210-z 5. Chouldechova, A., Roth, A.: A snapshot of the frontiers of fairness in machine learning. Commun. ACM 63, 82–89 (2020) 6. Cimini, G., Zaccaria, A., Gabrielli, A.: Investigating the interplay between fundamentals of national research systems: performance, investments and international collaborations. J. Informetr. 10(1), 200–211 (2016) 7. Fortunato, S., et al.: Science of science. Science 359(6379), eaao0185 (2018) 8. Garc´ıa, J., Fern´ andez, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16(42), 1437–1480 (2015) 9. Gong, M., et al.: A Survey on Diﬀerentially Private Machine Learning [Review Article]. IEEE Comput. Intell. Mag. 15(2), 49–64 (2020) 10. Helwegen, R., et al.: Improving fair predictions using variational inference in causal models. arXiv:2008.10880 (2020) 11. Kearns, M., Roth, A.: The Ethical Algorithm: The Science of Socially Aware Algorithm Design. Oxford University Press Incorporated, Oxford (2019) 12. Krasanakis, E., et al.: Applying fairness constraints on graph node ranks under personalization bias. In: Complex Networks & Their Applications IX, vol. 944, pp. 610–622. Springer International Publishing, Cham (2021). https://doi.org/10. 1007/978-3-030-65351-4 49 13. Li, T., et al.: Co-occurrence network of high-frequency words in the bioinformatics literature: structural characteristics and evolution. Appl. Sci. 8(10), 1994 (2018) 14. Madras, D., et al.: Fairness through causal awareness: Learning causal latentvariable models for biased data. In: FAT-2019, pp. 349–358 (2019) 15. Manning, C.D., Sch¨ utze, H.: Foundations of Statistical Natural Language Processing, 1st edn. MIT Press, Cambridge (1999) 16. Mehrabi, N., et al.: A Survey on Bias and Fairness in Machine Learning (2019). arXiv:1908.09635 17. Mills, S., et al.: Six Steps to Bridge the Responsible AI Gap (2020). https:// www.bcg.com/publications/2020/six-steps-for-socially-responsible-artiﬁcialintelligence 18. Osting, B., et al.: Enhanced statistical rankings via targeted data collection. In: ICML-2013, pp. 489–497 (2013) 19. Palmucci, A., et al.: Where is your ﬁeld going? A machine learning approach to study the relative motion of the domains of physics. PLoS ONE 15(6), e0233997 (2020) 20. Pathak, M.A., Raj, B.: Large margin multiclass Gaussian classiﬁcation with differential privacy. In: Dimitrakakis, C., Gkoulalas-Divanis, A., Mitrokotsa, A., Verykios, V.S., Saygin, Y. (eds.) Privacy and Security Issues in Data Mining and Machine Learning. PSDML 2010. Lecture Notes in Computer Science, pp. 99–112. Springer, Berlin (2011). https://doi.org/10.1007/978-3-642-19896-0 9

Graph-Based Mapping of Trustworthy ML Research

85

21. Portenoy, J., et al.: Leveraging citation networks to visualize scholarly inﬂuence over time. Front. Res. Metr. Anal 2, 8 (2017) 22. Radhakrishnan, S., et al.: Novel keyword co-occurrence network-based methods to foster systematic reviews of scientiﬁc literature. PLoS ONE 12(9), e0185771 (2017) 23. Russell, J., Santos, E.: Explaining reward functions in Markov decision processes. In: Proceedings of the Thirty-Second International Florida Artiﬁcial Intelligence Research Society Conference, Sarasota, Florida, USA, May 19–22 2019. pp. 56–61 (2019) 24. Sazonova, V., Matwin, S.: Combining binary classiﬁers for a multiclass problem with diﬀerential privacy. Trans. Data Priv. 7, 51–70 (2014) 25. Senekane, M.: Diﬀerentially private image classiﬁcation using support vector machine and diﬀerential privacy. Mach. Learn. Knowl. Extr. 1(1), 483–491 (2019) 26. Serna, I., et al.: SensitiveLoss: Improving Accuracy and Fairness of Face Representations with Discrimination-Aware Deep Learning. arXiv:2004.11246 (2020) 27. Sim˜ ao, T.D., Spaan, M.: Safe policy improvement with baseline bootstrapping in factored environments. In: AAAI-2019, pp. 4967–4974 (2019) 28. Steck, H.: Calibrated recommendations. In: RecSys-2018, pp. 154–162 (2018) 29. Tacchella, A., et al.: Novel keyword co-occurrence network-based methods to foster systematic reviews of scientiﬁc literature. PLoS ONE 12(9), e0185771 (2017) 30. Toreini, E., et al.: The Relationship between trust in AI and trustworthy machine learning technologies. In: FAT-2020, pp. 272–283 (2020) 31. Tullu, M.S.: Writing the title and abstract for a research paper: being concise, precise, and meticulous is the key. Saudi J. Anaesth. 13(Suppl 1), S12–S17 (2019) 32. Xiong, P., et al.: Towards a Robust and Trustworthy Machine Learning System Development (2021). arXiv:2101.03042 33. Yeganova, L., et al.: Navigating the landscape of COVID-19 research through literature analysis: a bird’s eye view (2020). arXiv:2008.03397 34. Zeng, A., et al.: The science of science: from the perspective of complex systems. Phys. Rep. 714–715, 1–73 (2017)

MAVAC: Mapping and Visualization of Academic Collaborations with a Focus on Diversity Logan McNichols, Steven Pineda, Emma Sauerborn, Brandon Tat, Kevin Yoo, Jane Lehr, Zo¨e Wood, and Theresa Migler(B) California Polytechnic State University, San Luis Obispo, CA 93401, USA [email protected]

Abstract. Understanding the ways in which academic researchers collaborate in STEM ﬁelds helps us understand not only how scientiﬁc work is conducted, but also allows us to examine equity in scientiﬁc collaborations. We present our work on constructing, exploring and visualizing a collaboration network of co-authorship in computing and other ﬁelds to better understand collaboration trends and geospatial relationships between collaborators. We examine these networks with a focus on variance and similarities between subnetworks generated from a seed set of inferred gender and a seed set of inferred race and ethnicity of faculty members. Results related to variance in clustering coeﬃcients and average degree are presented. Keywords: Network visualization and ethnicity

1

· Collaboration network · Gender

Introduction

We present our work on the construction, visualization and analysis of academic collaboration networks. This work is rooted in our research focused on developing a better understanding of successful academic collaborations and how various factors aﬀect or are expressed as variance in subnetworks. Towards this end, we work with two diﬀerent networks rooted in the higher education systems in California. The Cal State University (CSU) and University of California (UC) educational systems are some of the largest and most diverse in the world, with the CSU system serving close to half a million students and the UC system serving close to a quarter of a million students. The CSU system focuses on undergraduate education, while the UC system oﬀers both undergraduate and PhD level education, (both systems also oﬀers master’s degrees). Construction of complete networks is ongoing, however here we present work with a CSU focused Supported by the Cal Poly Research, Scholarly and Creative Activities (RSCA) Program, with funds from the CSU Chancellor’s Oﬃce & Cal Poly Provost’s Oﬃce. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 A. S. Teixeira et al. (Eds.): Complex Networks XII, SPCOM, pp. 86–97, 2021. https://doi.org/10.1007/978-3-030-81854-8_8

Constructing and Visualizing a Large Collaboration Network

87

network via the Cal Poly Collaboration network and a general UC computing faculty network. For the academic collaboration networks described in this paper, vertices represent seed faculty members and the researchers that they collaborate with. The seed faculty members come from either the California Polytechnic State University in San Luis Obispo (the highest ranked regional school in the CSU system) or from the aggregate data for the University of California’s ten schools, for faculty with a research interest in Computing (representing a wide array of departments all related to computing). Future work includes the expansion of the network to all CSU campuses, however the project originated with a local project focused on the Cal Poly Collaboration Network, which has 3,751 seed vertices. In the UC Computing Collaboration Network there are 27,103 seed vertices. The collaborators of these seed researchers contribute 12,924 vertices in the Cal Poly Collaboration Network and 463,854 vertices in the UC Computing Collaboration Network. Two researchers are connected if they have ever coauthored a paper that appears on the Scopus database. Scopus is an abstract and citation database of peer-reviewed literature containing the work of 34,346 peer-reviewed journals [1]. In this paper we describe the construction of these networks and their embedded versions (the networks embedded on the globe with respect to the locations of the academic institutions of the researchers), we describe the visualization system, and we present initial analysis of the networks. An important component of this work is the embedded visualization of the networks to allow for exploration and visualization of variance and trends. Our visualization and analysis are conducted with an eye toward understanding and comparing networks of researchers from diﬀerent gender and ethnicity groups to ultimately aid in understanding how to promote successful collaborations for all academic researchers.

2

Related Works

Collaboration networks, where vertices represent people and two people are connected if they have ever collaborated, are among the earliest and most studied networks in network science [4,11,13,23,24,31]. Many of these collaboration networks include only vertices from a certain ﬁeld, for example, where vertices represent mathematicians and two mathematicians are connected if they have coauthored a paper [18]. Less work has been done to study collaboration networks with an eye toward geospatial organization. However, Abramo et al. found that among Italian academics women researchers register a greater capacity to collaborate, with the exception of international collaboration, where there is still a gap in comparison to male colleagues [2]. Pan et el. found that collaboration strengths between cities decrease with an increase in distance and follow gravity laws [25]. Fortunato et al. found that scientist who left their country of origin had higher citation scores than those who did not [15]. There has been much work studying collaboration networks with respect to gender, but much less with respect to race and ethnicity. Ductor et el. showed

88

L. McNichols et al.

that women have fewer collaborators, collaborate more often with the same coauthors, and a higher fraction of their co-authors are co-authors of each other (they have a higher clustering coeﬃcient) [14]. Holman and Morandin examined similar properties in the realm of life science by studying the PubMed biomedical article database and found that researchers preferentially co-publish with colleagues of the same gender, and show that this ‘gender homophily’ is slightly stronger today than it was 10 years ago [20]. Using a dataset with more than 270,000 scientists from Brazil, Ara´ ujo et al. show that while men are more likely to collaborate with other men, women are more egalitarian regardless of how many collaborators each scientist has, with the exception of engineering where the bias disappears with increasing number of collaborators [3]. West et al. showed that on average women publish fewer papers than men [32]. These related works use various tools such as the gender-api.com, or US Social Security Administration website to help infer gender of researchers. This research is a part of an ongoing project, with McNichols et al. presenting results on hierarchical subnetworks in an early version of a less complete Cal Poly Collaboration network with respect to gender [21]. Nakamichi et al. studied four disjoint departmental networks, Computer Science, Mathematics, Biology, and Electrical Engineering with respect to gender [22]. In this paper, we present work related to an enhanced network via the use of the Scopus database rather that Google Scholar and Microsoft Researcher, which were found to be less complete. Prior related work also includes early work on visualization via the Google maps API [8]. We also present a broader scope of the network with seed vertices representing all researchers at Cal Poly indexed by Scopus along with a large fraction of researchers from UC schools and an analysis of race and ethnicity along with gender. The current work includes a much more robust visualization system with user interface to select institutions, various network views, edge coloring and other features described below. This work in many ways is motivated by various studies which have shown positive trends for gender-heterogeneous working groups [7] and general results showing the value of diversity on research teams, including a study conducted in 2006 sugeesting that a diversity of perspectives can provide more innovative solutions to multifaceted problems [26,30]. However, Hofstra et al. found that there is a diversity paradox: Diversity is known to breed innovation, yet the underrepresented people that diversify organizations have less successful careers within those organizations [19]. A long term goal of this project is to contribute insight into understanding and helping faculty build and maintain successful collaboration networks.

3

System Overview and Methods of Construction

We present information here about the construction, visualization and analysis of the academic collaboration network using a variety of computational tools and applications built by the research team. The current network includes: 3,751 seed vertices from the Cal Poly Collaboration Network with 12,924 collaborator vertices and 27,103 seed vertices

Constructing and Visualizing a Large Collaboration Network

89

from the UC Computing Collaboration Network with 463,854 collaborator vertices. This data was gathered from Scopus, Elsevier’s abstract and citation database [6]. Tables containing the researchers aﬃliated with Cal Poly and the ten universities in the University of California system were downloaded from the website. Scopus provides each author with a set of subject areas. For Cal Poly, all researchers were included. For the UCs, only researchers with a subject of computer science were considered. This is a much broader category than the faculty members of the UC computer science departments, and includes all researchers working in a ﬁeld related to computer science. For example, approximately 10–20% of researchers from each UC school had a subject area including ‘computer science’, which included many researchers whose primary department was from other ﬁelds. This wider net allows us to examine the computing ﬁeld as a whole. In the Scopus system, each author is given a unique ID. The author name, surname, and list of publications were collected through the use of the Scopus API using this ID. Data access was managed via the open source python library pybliometrics [28]. A unique identiﬁer for each publication was obtained, which was used for a second data query for publication data. Speciﬁcally, information was gathered about publication title, venue, citation count, list of coauthors, and the aﬃliation ID for each coauthor. For some publications, both the department and university of the researchers were provided. Other times, only the university was provided. Both were collected when available. Finally, the aﬃliation IDs were queried to yield the aﬃliation name, city, state, country, and postal code. Latitude and longitude coordinates were obtained from these ﬁelds using the open source library GeoPy. Location was derived with the most speciﬁc data possible (i.e. aﬃliation name, city, state, and country). Most locations were able to be identiﬁed from either the aﬃliation name or city. Future work will address visualizing any geospatial errors. To create subnetworks based on inferred gender and ethnicity information, the NamSor API is used [29]. NamSor was used to infer gender in an extensive report on gender’s impact on research by Elsevier in 2020 [12]. NamSor classiﬁes gender as either male or female and ethnicity as Asian, Hispanic/Latino, Black non-Hispanic, and White non-Hispanic. Along with the classiﬁcation is a calibrated probability that the classiﬁcation is correct. We were not able to use the Black non-Hispanic classiﬁcation reliably (discussed below). We acknowledge that any inferred classiﬁcation is potentially misleading and that intersectional identities are an important consideration for this kind of research [9,10,27]. We attempt to make choices respecting the complexity of researcher identity and will continue to attend to this aspect of this research. The risk of such inference is the erasure of an individual’s unique identity, including the limitation of binary gender categories. We feel this risk is mitigated due to our focus on aggregate data, exploring trends, where these kinds of diﬀerentiation provide insight into system level concerns when striving for equity. Future work includes the goal to allow researchers to self identify.

90

L. McNichols et al.

Fig. 1. Top: view of the visualization application, showing Cal Poly Network (Global View), including the UI shown on the left-hand side. Bottom: Cal Poly Network focused at a national level

4

Visualization

To facilitate the exploration of trends in a large dataset, we created a system to display collaboration networks embedded on a world map using the Google Maps API. The application consists of a global world map and a sidebar of user interface (UI) elements. Figure 1 shows the visualization of Cal Poly’s collaboration network. The sidebar allows users to select a given institution, whose collaboration network can be displayed on the global map. Once an institution is selected, the sidebar shows a list of other institutions that the selected school has collaborated with. This ‘Institution View’ can be toggled to show a ‘Collaborator View’, which displays all of the school’s authors and their publications. The sidebar also allows the user to ﬁlter the data by inferred gender of authors at the seed institution. We represent institutions on the map with a marker icon

Constructing and Visualizing a Large Collaboration Network

91

located at the corresponding institution’s latitude and longitude coordinates. The API provides built-in support to examine the map at various scales (zoom level). Depending on the zoom level, institution markers scale to provide better visibility. To represent edges in the network, the system uses polylines, colored according to the number of shared publications. The primary goal of our visualization tool is to support the easy conveyance of potential trends that can drive inquiry and questions. By drawing the collected data embedded onto a map, we make it easier for viewers to identify geospatial trends. For example, Fig. 1 shows how Cal Poly has collaborated with many institutions in the US and Europe. Figure 1 displays global and national views of Cal Poly’s network. For the network visualized in the current system an edge represents a collaboration between institutions. Another way that we accomplish the conveyance of trends is by encoding information in edges on the map through colors. By coloring edges based on meaningful data, we can visually signify the relative strength of connections between vertices. We prioritized higher-weighted edges by drawing them on top. In Fig. 1, the edges are colored by the number of shared publications between institutions. The red edges represent a higher number of shared publications and the yellow edges represent a lower number of shared publications. We calculate a weight between 0 and 1 and use this weight to interpolate between yellow and red, completely red edges denote that the two institutions have greater than 100 shared publications.

(a) inferred male author subnetwork

(b) inferred female author subnetwork

Fig. 2. Visualization of two subnetworks for the Cal Poly Collaboration network based on inferred gender of the seed author. Lines represent a shared publication and line coloring denotes the strength of the collaboration with red indicating a stronger collaboration (greater than 100 publications).

To explore more speciﬁc trends and to drive further analysis, the system includes support to generate visualizations based on subnetworks ﬁltered on various author taxonomies. Filtered subnetwork visualizations allow the analysis and

92

L. McNichols et al.

comparison of various aspects of the author’s identities including department, gender, and ethnicity. The results section demonstrates visualizations of such subnetworks, with ethnicity subnetwork visualizations to be completed in future work.

5

Network Analysis

Here we present topological network metrics from 28 networks derived from the Cal Poly Collaboration Network and the UC Computing Collaboration Network. Our primary interest here is in collaborations between individuals, and we think that this is best understood by looking at small research teams. Thus, we only include publications containing 10 or fewer authors in the following analysis. Future work may examine changing the co-author count. Our research uses categorization for race, ethnicity and gender identities of academic researchers in order to understand how aggregate networks of researchers with these identities are functioning. The goal of this kind of research is to better understand if there are diﬀerences that should be considered to support institutional change to support equity. For example, Bensimon, et al. [5] describe equity-mindedness as an approach that is “color-conscious” rather than “color-blind”; recognizes “that beliefs, expectations, and practices assumed to be neutral can have outcomes that are racially disadvantageous”; takes institutional “responsibility for the elimination of inequality”; and is “[a]ware that while racism is not always overt, racialized patterns nevertheless permeate policies and practices in higher education institutions.” We consider the entire Cal Poly Collaboration Network where the seed vertices are all faculty at Cal Poly and we consider subsets of seeds of this larger network. We consider a set of seed vertices grouped by inferred gender identify (female/male) for Cal Poly faculty, and inferred ethnicity (Asian, Hispanic/Latino, White non-Hispanic). Table 1 includes the data for all categorizations. We also consider similar subnetworks for faculty working in the computing ﬁelds at California schools. We consider the entire UC computing subnetwork, the Cal Poly computing subnetwork, and the UCSB (University of California, Santa Barbara) computing subnetworks. We chose the UCSB subnetwork as it is in a similar geospatial location and size to Cal Poly, allowing us to consider trends with the main variance being a focus on undergraduate education versus a similar PhD granting institution. We analyze each of these networks with respect to the average degree and the global clustering coeﬃcient. The average degree of a network is the number of neighbors of each vertex averaged over all of the vertices in the network. In our network this is a measure of the average number of collaborators in the network. The global clustering coeﬃcient is the ratio of the number of closed triplets (triangles) to the number of all triplets (both open and closed). The clustering coeﬃcient gives a measure of how cohesive one’s neighborhood is. In the context of collaboration networks it gives a measure of how much one’s collaborators collaborate with others.

Constructing and Visualizing a Large Collaboration Network

93

Table 1. The number of seed authors, number of seed of collaborators, number of ¯ and clustering coeﬃcient (C) for each subnetwork seed of papers, average degree (d), based on seed authors’ aﬃliation: Cal Poly (CP), Univ. of California Computing (UC Computing), Santa Barbara Computing (SB Computing) and Cal Poly computing (CP Computing), as well as the seed authors’ inferred gender, and the seed authors’ inferred ethnicity (Asian (A), Hispanic/Latino (HL), White non-Latino (W NL), or unclassiﬁed (U)). Authors (seed) Collaborators Papers

d¯

C

Aﬃliation

G/E

CP

All Female Male A HL W NL U

3751 996 2056 705 240 2631 175

12924 4731 9382 4092 1360 9875 1431

12250 3192 8615 3312 706 8936 685

8.44 8.43 9.15 9.12 8.53 8.23 8.63

0.30 0.40 0.28 0.27 0.46 0.33 0.53

UC computing

All 27103 Female 4357 Male 15458 A 14956 HL 1234 W NL 8693 U 2220

463854 95858 358206 221952 33218 253528 83886

491893 70106 373682 214521 23963 240115 69325

33.29 29.55 39.14 25.12 31.21 44.72 46.37

0.12 0.18 0.12 0.15 0.20 0.12 0.15

CP computing

All Female Male A HL W NL U

670 135 432 168 36 439 27

5522 1635 4159 1526 191 4326 319

5039 1097 3941 1474 103 3618 230

13.55 15.43 14.08 12.16 7.29 14.62 12.46

0.29 0.34 0.28 0.26 0.59 0.30 0.50

UCSB computing All Female Male A HL W NL U

2840 431 1652 1345 148 1187 160

50692 9159 40731 18584 3334 33026 6186

58718 8086 47274 20278 3516 36263 6095

33.42 28.05 41.14 22.50 27.78 45.38 42.68

0.11 0.17 0.10 0.13 0.22 0.10 0.15

For the construction of the subnetworks, we classify gender as unknown if the probability that NamSor gives is below 80%. For classifying ethnicity, we needed to carefully reﬁne our ﬁltering based on the NamSor data. We noticed

94

L. McNichols et al.

that the distribution of inferred ethnicity did not match known data for the faculty population. Speciﬁcally, the inferred number of Black non-Hispanic and Hispanic/Latino authors did not match known demographics of the university. We believe this issue arose from NamSor using a calibration based on a random cross section of the United States, which is not always reﬂective of local demographics. For this work, we propose an adjustment to the ethnicity probabilities based on knowledge of the overall demographics of the Cal Poly faculty. After adjustment, we consider that inferred ethnicity is unknown if the probability is below 50%. However, even with this adjustment, we found the Black non-Hispanic classiﬁcations to be unreliable for our datasets. Our network is primarily composed of mature adults and research indicates that African American babies having a high ‘Black Name Index’ has varied over time [17], as well as the historical legacy of slavery and its impact on African American names. These factors taken in conjunction with the known occupational under-representation associated with academia [16], the ethnicity inference based on name for Black non-Hispanic faculty members was too unreliable for our dataset. Future work includes developing alternative methods to ﬁlter for this important subnetwork and its trends. For completeness, we also consider a subset of the seed researchers where ethnicity could not be classiﬁed. For the remaining classiﬁcations, we apply an adjustment that assumes that names are independent from faculty aﬃliations after conditioning on ethnicity. Let E be the set of ethnicities under consideration and e ∈ E. Let E be a random variable representing the ethnicity of an author with name N and aﬃliation A. We can then estimate the conditional probability that the author’s ethnicity is e given N and A as:

Pr(E = e|N, A) =

Pr(E = e|N ) Pr(E = e|A) Pr(E = f |N ) Pr(E = f |A) . Pr(E = e) Pr(E = f ) f ∈E

In our computations, demographic data from ‘The Cal Poly Experience Faculty Campus Climate Survey on Diversity Equity and Inclusion’ and the UC website are used for context. This expression requires that Pr(E = f |N ) is known for each ethnicity f , but NamSor only provides probabilities for the two most likely ethnicities. We acknowledge that this categorization can be an oversimpliﬁcation of multiracial identities. We chose to divide the remaining probability evenly between the two less likely ethnicities. An exploration of the parameter space shows that this assumption has a fairly small eﬀect.

6

Results

This ongoing research project includes a system for gathering, enumerating, creating, and visualizing a network of academic collaboration networks. We have presented our process for the creation of this network and various subnetworks. At this time, we have initial results on subnetworks based on seed author locations (Cal Poly vs UCs), seed author inferred gender and seed author inferred

Constructing and Visualizing a Large Collaboration Network

95

ethnicity. Table 1 shows the complete network analysis for the two collaboration networks. In general, we see that the average degree for the Cal Poly network is less than the UCs. This is also seen in Fig. 3 which shows a comparison of a visualization of both schools subnetworks for computing faculty only. Not surprisingly, the UC subnetwork shows many more edges (publications) than the undergraduate serving Cal Poly network. UC Santa Barbara and Cal Poly are similar sized schools, with UC Santa Barbara ranked as an ‘R1’ research institution (including PhD programs). Interestingly, the Cal Poly Computing subnetwork has a higher overall average degree than the full Cal Poly network. Thus comparing these two subnetworks (both ﬁltered on computing) shows a better view of the diﬀerence in academic collaborations between institution type. The inferred female subnetworks have a higher clustering coeﬃcient and most have slightly lower average degree, consistent with previous work [3,14,20,32]. The exception is the Cal Poly computing network, where the inferred female subnetwork has a higher average degree. Figure 2 shows a visualization of two diﬀerent subnetworks for the Cal Poly Collaboration network. Our data shows the larger span of the male author subnetwork in contrast to the larger strong connections for the inferred female author network. The inferred Hispanic/Latino network has a greater clustering coeﬃcient across all aﬃliations. The clustering was also observed to be substantially higher for the Cal Poly computing section than the UCSB computing section. Further explorations in the variations in these networks is a part of future work.

(a) Network of Cal Poly authors in computing

(b) Network of UCSB authors in computing

Fig. 3. Visualization of two subnetworks for only the seed vertices of authors classiﬁed in the computing ﬁelds. Comparison of Cal Poly (left) and UC Santa Barbara (right). Edge coloring based on strength of collaboration.

Acknowledgements. We would like to give our sincerest thanks to previous members of this team eﬀort, Connor Carroll, Nupur Garg, Gabriel Medina-Kim, Lauren Nakamichi, Viet Nguyen, Christian Rapp, and Barbara Walker.

96

L. McNichols et al.

References 1. Scopus. https://www.scopus.com/search/form.uri?display=basic#basic. Accessed 10 May 2021 2. Abramo, G., D’Angelo, C.A., Murgia, G.: Gender diﬀerences in research collaboration. J. Informetr. 7(4), 811–822 (2013). https://doi.org/10.1016/j.joi.2013.07. 002 3. Ara´ ujo, E.B., Ara´ ujo, N.A.M., Moreira, A.A., Herrmann, H.J., Andrade, J.S., Jr.: Gender diﬀerences in scientiﬁc collaborations: Women are more egalitarian than men. PLOS ONE 12(5), 1–10 (2017). https://doi.org/10.1371/journal.pone. 0176791 4. Barab´ asi, A., Jeong, H., N´ ada, Z., Ravasz, E., Schubert, A., Vicsek, T.: Evolution of the social network of scientiﬁc collaborations. Phys. A Statist. Mech. App. 311(3), 590–614 (2002). https://doi.org/10.1016/S0378-4371(02)00736-7 5. Bensimon, E.M., Dowd, A.C., Witham, K.: Diversity And Democracy: Five Principles For Enacting Equity By Design. Association of American Colleges & Universities, Washington, D.C (2016).A Voice And A Force For Liberal Education 6. Burnham, J.: Scopus database: a review. Biomed. Digital Libr. 3, 1 (2006). https:// doi.org/10.1186/1742-5581-3-1 7. Campbell, L.G., Mehtani, S., Dozier, M.E., Rinehart, J.: Gender-heterogeneous working groups produce higher quality science. PLOS ONE 8(10), 1–6 (2013). https://doi.org/10.1371/journal.pone.0079147 8. Carroll, C., Garg, N., Migler, T., Walker, B., Wood, Z.: Mapping and visualization of publication networks of public university faculty in computer science and electrical engineering. CATA 2, 1–12 (2020) 9. Collins, P.: Intersectionality’s deﬁnitional dilemmas. Ann. Rev. Sociol. 41, 1–20 (2015) 10. Crenshaw, K.: Mapping the margins: intersectionality, identity politics, and violence against women of color. Stanford Law Rev. 43, 1241 (1991) 11. De Castro, R., Grossman, J.: Famous trails to Paul Erd¨ os with a sidebar by Paul M. B. Vitanyi. Math. Intell. 21, 1–36 (1999) 12. De Kleijn, M., Jayabalasingham, B., Falk-Krzesinski, H., Collins, T., KuiperHoyng, L., Cingolani, I., Zhang, J., Roberge, G.: The researcher journey through a gender lens: an examination of research participation, career progression and perceptions across the globe. Technical report, Elsevier, March 2020 13. Ding, Y.: Scientiﬁc collaboration and endorsement: network analysis of coauthorship and citation networks. J. Informetr. 5(1), 187–203 (2011). https://doi. org/10.1016/j.joi.2010.10.008 14. Ductor, L., Goyal, S., Prummer, A.: Gender & collaboration. Cambridge working papers in economics, Faculty of Economics, University of Cambridge (2018), https://EconPapers.repec.org/RePEc:cam:camdae:1820 15. Fortunato, S., Bergstrom, C.T., B¨ orner, K., Evans, J.A., Helbing, D., Milojevi´c, S., Petersen, A.M., Radicchi, F., Sinatra, R., Uzzi, B., Vespignani, A., Waltman, L., Wang, D., Barab´ asi, A.L.: Science of science. Science 359, 6379 (2018). https:// doi.org/10.1126/science.aao0185 16. Foundation, N.S.: National center for science and engineering statistics. In: Women, Minorities, and Persons with Disabilities in Science and Engineering: 2019. Special report NSF, pp. 19–304 (2019). https://www.nsf.gov/statistics/wmpd 17. Fryer, R.G., Levitt, S.D.: The causes and consequences of distinctively black names. Q. J. Econ. 119, 767–805 (2004)

Constructing and Visualizing a Large Collaboration Network

97

18. Grossman, J.W., Ion, P.D.F.: On a portion of the well-known collaboration graph. Congr. Numer. 108, 129–131 (1995) 19. Hofstra, B., Kulkarni, V.V., Munoz-Najar Galvez, S., He, B., Jurafsky, D., McFarland, D.A.: The diversity–innovation paradox in science. Proc. Natl. Acad. Sci. 117(17), 9284–9291 (2020). https://doi.org/10.1073/pnas.1915378117 20. Holman, L., Morandin, C.: Researchers collaborate with same-gendered colleagues more often than expected across the life sciences. PLOS ONE 14, e0216128 (2019) 21. McNichols, L., Medina-Kim, G., Nguyen, V.L., Rapp, C., Migler, T.: Gender’s inﬂuence on academic collaboration in a university-wide network. In: Cheriﬁ, H., Gaito, S., Mendes, J., Moro, E., Rocha, L. (eds.) Complex Networks and Their Applications VIII, vol. 882, pp. 94–104. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-36683-4 8 22. Nakamichi, L., Migler, T., Wood, Z.: An analysis of four academic department collaboration networks with respect to gender. In: Benito, R.M., Cheriﬁ, C., Cheriﬁ, H., Moro, E., Rocha, L.M., Sales-Pardo, M. (eds.) Complex Networks & Their Applications IX, vol. 943, pp. 262–272. Springer International Publishing, Cham (2021). https://doi.org/10.1007/978-3-030-65347-7 22 23. Newman, M.E.J.: Coauthorship networks and patterns of scientiﬁc collaboration. Proc. Natl. Acad. Sci. 101(suppl 1), 5200–5205 (2004) 24. Newman, M.: Scientiﬁc collaboration networks I network construction and fundamental results. Phys. Rev. E Statist. Nonlinear Soft Matter Phys. 64, 016131 (2001) 25. Pan, R.K., Kaski, K., Fortunato, S.: World citation and collaboration networks: uncovering the role of geography in science. Sci. Reports 2, 902 (2012). https:// doi.org/10.1038/srep00902 26. Rainey, Katherine, Dancy, Melissa, Mickelson, Roslyn, Stearns, Elizabeth, Moller, Stephanie: Race and gender diﬀerences in how sense of belonging inﬂuences decisions to major in STEM. Int. J. STEM Educ. 5(1), 1–14 (2018). https://doi.org/ 10.1186/s40594-018-0115-6 27. Rainey, K., Dancy, M., Mickelson, R., Stearns, E., Moller, S.: Race and gender diﬀerences in how sense of belonging inﬂuences decisions to major in stem. Int. J. STEM Educ. 5(1), 10 (2018) 28. Rose, M., Kitchin, J.: pybliometrics: Scriptable bibliometrics using a python interface to Scopus. SoftwareX 10, 1–6 (2019). https://doi.org/10.1016/j.softx.2019. 100263 29. Santamar´ıa, L., Mihaljevic, H.: Comparison and benchmark of name-to-gender inference services. Peer. J. Comput. Sci. 4, e156 (2018) 30. Reich, S.M., Reich, J.A.: Cultural competence in interdisciplinary collaborations: a method for respecting diversity in research partnerships. Am. J. Commun. Psychol. 38, 51–62 (2006) 31. Watts, D.: Small Worlds: The Dynamics of Networks Between Order and Randomness Princeton Studies in Complexity. Princeton University Press, Princeton (2004) 32. West, J.D., Jacquet, J., King, M.M., Correll, S.J., Bergstrom, C.T.: The role of gender in scholarly authorship. PLOS ONE 8(7), 1–6 (2013). https://doi.org/10. 1371/journal.pone.0066212

Modelling Damage Propagation in Complex Networks: Life Exists in Half-Chaos Andrzej Gecow

and Mariusz Nowostawski(B)

Computer Science Department, IDI, NTNU, Norwegian University of Science and Technology, Gjøvik, Norway [email protected] https://sites.google.com/site/andrzejgecow/home, https://www.ntnu.edu/employees/mariusz.nowostawski

Abstract. Human institutions and administrative units, technological processes, technical constructions, and living organisms can all be modelled as dynamical, discrete and ﬁnite complex networks. This is an important practical approach in modelling complex dynamical systems. Original Kauﬀman’s hypothesis life on the edge of chaos imposes considerable limitations on such modelling. The problem is exacerbated by the estimated network parameters usually far away from the range indicated by the hypothesis itself. This report describes experimental evidence and argues that the assumptions leading to Kauﬀman’s conclusion turn out to be too simplistic. In the Kauﬀman’s approach to predict statistical stability, random networks approximated by inﬁnite and continuous systems are used. These networks can be either stable (i.e. ordered) or unstable (i.e. chaotic). By slightly adjusting the network parameters, a rapid jump between chaos and order appears. It is called a phase transition. Only for network parameters near this phase transition, the network changes have properties suitable for describing and capturing the stability of modelled real-world objects. Note however, that the modelled real systems are certainly neither inﬁnite nor continuous. Their parameters are usually in the area of the random chaotic system, far from the narrow phase transition, yet, the objects do not exhibit chaotic behaviour and are not random. In this work, we demonstrate the third system state that we dubbed: halfchaos. The features of half-chaotic systems are tuned such that they can exhibit both, small and large reactions to small network disturbances. This third class of networks, half-chaotic, allow for expanding the current theories in order for improved modelling of complex dynamical systems. We argue, that half-chaotic systems better describe the modelled dynamical phenomena appearing in real world. Keywords: Complex systems · Complex networks networks · Chaos · Phase transition

· Kaufmann

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 A. S. Teixeira et al. (Eds.): Complex Networks XII, SPCOM, pp. 98–107, 2021. https://doi.org/10.1007/978-3-030-81854-8_9

Life in Half-Chaos

1

99

Introduction

The widely accepted Kauﬀman hypothesis life on the edge of chaos imposes considerable and inconvenient limitations on modelling a large range of real world phenomena. The problem is exacerbated by the estimates that usually suggest network parameters far away from the range indicated by the hypothesis itself. The assumptions leading to Kuﬀman’s conclusion turn out to be too simplistic, and it is these simpliﬁcations that create these uncomfortable constraints. Existing studies [12,13] demonstrate the over-simpliﬁcation and showcase that the Kauﬀman hypothesis should be revised. In this article we ﬁrst describe the Kauﬀman hypothesis, then we demonstrate the limitations resulting from the Kauﬀman’s assumptions. Subsequently, we provide a better methodology that does not have those limitations and can be used to model broader class of systems that operate in the regime that we call half-chaos.

2

Kauﬀman’s Hypothesis

The main characteristic of the chaotic behaviour of physical dynamical systems is their high sensitivity to small perturbation of the initial conditions that leads to very diﬀerent ﬁnal eﬀects. In other words, for a very similar initial conditions, the system evolution is unpredictable and can result in a widely diﬀerent outcome. Kauﬀman [17–19] uses the term chaos to describe the behaviour of ﬁnite, discrete, deterministic networks. Note, that contemporary theories use the deﬁnition of chaos based on the Lyapunov coeﬃcients for functions in inﬁnite, continuous spaces. The term (deterministic) chaos is not reserved for just one of the continuous/inﬁnite versus discrete/ﬁnite systems, but can, and has been applied to both. Speciﬁcally, chaos theory can and has been used for ﬁnite discrete networks [2,3]. For diﬀerent experiments diﬀerent assumptions about the nature of the underlying (ﬁnite and discrete) process can be made. Those assumptions sometimes involve using network with inﬁnite number of nodes, as well as relying on certain speciﬁc ways of calculating the nodes value with the use of probability distributions, e.g. [9], or asynchronous function updates [5]. Some assumptions, e.g. in relation to asynchronous update function activation allow researchers to use Lyapunov exponents and Jacobians in their analysis, e.g. [4]. Note however, that by making some of those assumptions the researchers have missed some properties that are inherent in ﬁnite discrete networks, but absent in the context of networks with inﬁnite and continuous functions. For example, in the traditional chaos theory for networks with continuous functions it does not make sense to deﬁne the probability of an event where a given node in a network obtains a speciﬁc input values (node inputs) that has already happened in the past. In contrast, such a probability can be deﬁned and measured for ﬁnite and discrete networks, see for example [1]. Analogously, the concept of the attractor length, or path length to an attractor can be deﬁned in discrete ﬁnite systems, but cannot be deﬁned in the context of continuous space. In discrete step-wise synchronous systems it can be simply deﬁned as the number of transformation

100

A. Gecow and M. Nowostawski

steps in a process. This is an important diﬀerence and it allows investigating real world phenomena that map to ﬁnite and discrete networks in a more detailed way in comparison to traditional chaos theory based on continuous spaces. Current modelling methodologies assume, that the ﬁnite and discrete networks can be either stable (i.e. ordered) or unstable (i.e. chaotic). By slightly changing the network parameters, a fast jump between chaos and order called a phase transition appears. Only for network parameters in the immediate vicinity of this phase transition, the changes are not too small and not too large, to have properties suitable for describing stability of adaptive evolution of modelled realworld objects. This is commonly known as Kauﬀman’s hypothesis and captured by a succinct phrase: life evolves on the edge of chaos. In this article we describe our experimental ﬁnding of half-chaotic networks. A half-chaotic network will exhibit both, very large and also very small changes for similar disturbances. This range of network behaviours is not taken into account by the current modelling methodologies. The discovery of half-chaotic regimes is not a refutation of the existing theories. It provides an evidence that in some areas the current theory is not adequate for describing complexity of the discrete systems that are being modelled. Similarly, Newtonian mechanics is not false, but provides false expectations when used to predict behaviour for objects moving near the speed of light.

3

Experiments and Network Parameters

The experimental basis used to classify a particular network as chaotic or ordered is the distribution of damage size, i.e. size distribution of avalanches [27] caused by small disturbances. Figure 1 depicts the levels of damage equilibrium (maximum damage, dmx) calculated for chaotic networks from Derrida’s annealed approximation model [8,9]. Chaotic networks are not suitable for modelling most of the interesting real-world complex systems. Half-chaotic networks with parameters s and K can be used for modelling real-world complex systems given the existence of the order they exhibit (see the middle of Fig. 2). In the case of the random network the chaos area is determined by the parameters s and K, but half-chaotic networks with such parameters already show enough order (see Fig. 2) to be used for modelling. In Fig. 2, there are two peaks in the damage distribution. The right is near a Derrida equilibrium level [11]; it contains the results of chaotic responses to small disturbances. The left peak contains damage so minimal that they are taken as ordered. The shares of ordered and chaotic responses are shown in the middle of Fig. 2. For typical ordered or chaotic systems, there is only one of the peaks respectively (very similar to one of shown in Fig. 2). Thus, systems exhibiting distributions depicted in Fig. 2 are neither ordered nor chaotic. We call them half-chaotic. Those systems are distinct from ordered or chaotic systems, that were the only two states previously assumed under the Kauﬀman’s models. The existence of this third state for discrete systems was unknown before our experimental work. In this work and in our experiments we have used non-Boolean Kauﬀman networks, i.e. we continue to use the term Kauﬀman network even for networks

Life in Half-Chaos

101

Fig. 1. Maximum damage, dmx, in the function of parameters K and s. Kauﬀman’s explored only s = 2 and K = 2. As can be seen in Figure c, such case is especially extreme – there is no dmx. The existence of dmx is necessary to model an objective threshold of losing the stability and identity of a modelled system. This makes the model more adequate to represent real-world phenomena. Parameters s and K greater than 2 make random system chaotic and unusable for modelling in the original Kauﬀman hypothesis. However, in half-chaos they can be used. They are necessary to explain structural tendencies of the evolving objects. Figure b is called Derrida plot. It is result of the annealed approximation built for complex random networks. The ﬁgure depicts values for parameters s and K greater than 2. In Figure a an average d(t) is shown, which helps to get an intuition and to understand dmx.

with s > 2. The experiments are run on networks with N = 400 nodes; each node has K inputs and k outputs. K is ﬁxed to 3 subject to a speciﬁc network conﬁguration. Note that k may diﬀer for diﬀerent nodes. Signals on inputs and outputs have s = 4 equally probable variants. Kauﬀman used only s = 2 and his networks were Boolean. In random Boolean network the phase transition between order and chaos appears at K = 2. Other researchers have attempted to introduce more, yet not equally probable, signal states, see for example [20, 24]. This was done for the ordered phase of fully random networks. We use connectivity K > 2, as has been suggested earlier by other researchers in [3,25], and s > 2, which for random networks practically always results in chaos. Note, that we use networks that are not fully random, in half-chaos, which also oﬀers the ordered peak on the left side of the ﬁgure, and are much more adequate for describing real-world phenomena [11,12]. A node, using its function and input signals, calculates its state – output signal, and then it sends the output signal to all k of its outputs. The calculation takes a single time step: if the input signals are from time t, then the new state is sent to the receivers at time t + 1. This type of calculation of network function is called synchronous. Two types of networks are used in Fig. 2: scale-free [7] (labelled f ) and Erd˝ osR´enyi [10] random (r). These two network types indicate a diﬀerent method for generating the network structure (connections between nodes) and they diﬀer in their distributions of k. The node functions and the nodes’ initial states are random. Our deﬁned networks are deterministic–i.e., for two identical networks, if processes start from the same initial node states, then they will have the same states after the same number of time steps.

102

A. Gecow and M. Nowostawski

In our experiments we have provided some adjustments such as to achieve not completely random networks. We have conducted three experiments, named 5, 6 and 7. In experiment 5, the networks are corrected to initially be a pointattractor systems (PAS). As the initial state of the network, all the input and output values are randomly assigned to nodes inputs and outputs. Then, for each node, the node’s function for the current input values is modiﬁed in such a way as to provide function value that is the same as the currently assigned (random) value of the output. A small disturbance in the network is deﬁned by a permanent change of the function value for the initial input state of one chosen node. By permanent change we mean a change in the node’s function that is not reversed after a single simulation step, but stays as a new (updated) node’s function permanently. This is important, because in the case of permanent change the length of the attractor path inﬂuences the stability of the network. For long attractors the network has higher probability of providing again the concrete input values to the node, that has disturbed the network in the ﬁrst place, causing further disturbance. If the length of the attractor path is small, the network is more likely to settle before such a state re-occurs. If we compare disturbed and undisturbed networks, then number of diﬀerent states of nodes in those networks at time t is a number A (A stands for the Avalanche property as deﬁned in [21]), and damage d = A/N . Maximal t is ﬁxed to t = tmx steps (tmx = 2000). The trajectory of the system (and value of A) typically becomes periodic, which means the process achieves an attractor. A(tmx) is an eﬀect of one disturbance in one network and, in practice, can be either ordered (in the left peak) or chaotic (in the right peak). To measure the state of a particular network (ordered, half-chaotic or chaotic), we consecutively make all possible disturbances to the same initial network. There are (s − 1)N of them. Because there is a clear gap between peaks, deciding which peak each A(tmx) belongs to is simple. The result, q, for a particular network, called its degree of order, is the content of the left peak divided by the number of trials. The deep gap between peaks, absent in Kauﬀman approach, makes a real threshold of losing of stability and identity of a modelled system which makes model much more similar to the behaviour of real world phenomena. The results from many networks with the same s, K, type and experiment (typically a few hundred) were collected, and the results are depicted in Fig. 2. In all experiments described here s = 4 and K = 3. Kauﬀman could not use s > 2 in a fully random network, because for s > 2 there is no phase transition, even for the minimal sensible K = 2. Only chaotic behaviour is observed. Most researchers believes s > 2 is not necessary, because each network of conditions can be described using a Boolean network. That belief leads, however, to the false conclusion that the results of statistical investigations of networks do not depend on the value of s [11]. Kauﬀman states [19] that in the range of random networks, all possible networks are present. Therefore, if they can only be either ordered or chaotic, then it is impossible to ﬁnd phenomena such as half-chaotic networks among the range of networks that are not fully random. Therefore he does not take into account

Life in Half-Chaos

103

Fig. 2. Distribution of damage size. A – Avalanche, number of diﬀerent node states at maximal t (tmx = 2000) between disturbed and undisturbed networks. In the networks we used parameters s = 4 and K = 3 and there are N = 400 nodes. The damage is deﬁned as d = A/N . The ﬁrst character of the curve description (5, 6, 7) indicates the experiment, while the second depicts the network type (f scale-free, r random Erd˝ osR´enyi). The results from a few hundred networks for each experiment are depicted on the graphs. For each particular network there is an individual line with two peaks. The networks are neither ordered nor chaotic; thus, they are half-chaotic. Range of argument A (0 ≤ A ≤ 400) is divided into parts: left 0 − 10 for exact view of left peaks, where experiment 6 exhibits unusable feature; the gap between A = 10 and A = 270 where probability is negligible and area of right peaks of damage chaotic equilibrium. In the middle, over the gap, fractions of ordered (q) and chaotic (1 − q) events of avalanche after a small disturbance are shown. In the range of q, an order resulting from the absence of output in some nodes (k = 0) in r a random Erd˝ os-R´enyi network shown in yellow. All results presented here concern only the eﬀects of limiting global attractors (experiment 6) or local attractors (in the unfrozen lakes of activity). Typical chaotic networks have q too small to be visible.

that living organisms are not random due to the eﬀects of natural selection. These false conclusions are an eﬀect of using simpliﬁcation when applying a theory developed for inﬁnite and continuous space to ﬁnite and discrete networks. Also, experimental statistical search is misleading. While the number of halfchaotic networks is very large, it is still very much smaller than number of all random networks. It is improbable to ﬁnd a half-chaotic network via a random search. Many works have assumed that the stability of living organisms cannot be explained in any way other than as an ordered phase near the phase transition to chaos [16,21–23], leading to conclusions such as: order for free [15]. These sharp limitations on the s and K parameters which prohibit the use of values estimated from nature [3] in the modelling of practically all interesting objects caused that the results of these models were usually unsatisfactory [25]. Demonstrating the existence of half-chaos releases modelling from these limitations. In addition, half-chaotic networks have some properties that should be expected in modelled objects, though they were clearly not known to this point. For example, the evolutionary stability of half-chaos, short attractors, and ice. These properties are to be clariﬁed below.

104

4

A. Gecow and M. Nowostawski

Evolutionary Stability of Half-Chaotic Networks

In the ﬁrst round of earlier experiments [13] PAS networks were investigated, and half-chaotic systems have been explored. In this work, a new disturbance was made from the same initial network state in order to measure the system’s type (ordered, half-chaotic or chaotic). From this ﬁnding, questions arise: What conditions must evolutionary changes meet in order to remain in the halfchaotic state? How long can evolution continue within half-chaos? Acceptation (as an evolutionary change) of one small disturbance leading to large damage (i.e. chaotic changes–in the range of the right peak) moves the system into normal chaos. However, when the response to such a disturbance is ordered (in the range of left peak), then the majority of networks created (by that small disturbance) are also PAS (>99%). The evolution from PAS to PAS can be inﬁnite; but such a model of evolution is unsatisfactory, because it does not reﬂect the complexities and dynamic properties of the observed real-world phenomena. Therefore, in the next experiments (5, 6 and 7), evolution was investigated without accepting PAS. In other words, disturbances leading to small damage were accepted as evolutionary changes, only when the new attractor path was ≥7. Note, that PAS has attractor path =1. This change was intended to more rapidly move the system away from PAS towards chaos, but the process always stabilised in half-chaos. Limiting the evolution of half-chaotic system to changes causing small damage is enough to remain in half-chaos. This property is called the evolutionary stability of half-chaos. In experiment 5, the new evolution was started from PAS, which is an exceptional case for our networks. The whole network was frozen, with each node iced. In other words, their states were not changing over time. Most changes normally accepted in PAS are also PAS. This is the case for over 99% of accepted cases, across various network types and parameters s and K. This type of system evolution can be quite long before something interesting takes place. Therefore, to force the system to move out of PAS, we have introduced the concept of cumulative changes, which is deﬁned as changes that lead to an attractor of length of at least 7. Those changes are left in the network and contribute to the evolution, whereas changes that lead to attractors of shorter length, are ignored. Using this kind of cumulative change, in the ﬁrst step of evolution, PAS disappeared, but most of the nodes stayed iced, with only a small unfrozen lakes of activity emerging. This is similar to Kauﬀman’s description of a liquid area [19], where he saw life, but it was limited to areas near s = 2 and K = 2, whereas in half-chaos it is also observed for s > 2 and K > 2. The PAS was used as a system with an extremely short attractor to avoid a return to the initial input state for which the node function was permanently changed. After the single function change that initiates the damage, the system typically evolves towards an attractor, which continuous to cycle through the sequence of the same system states. During the cycle, the node that was the subject of the damage initiation, will go through a number of various input states. The number of possible states is larger for larger s and K, nevertheless, the set of possible input states is always ﬁnite. Note, that the longer the system

Life in Half-Chaos

105

cycle, and the more diﬀerent input states the node will go through, the higher the probability that the node will re-encounter the same input state that initiated the damage. Re-occurrence of the same input state on the node, will re-initiate again the damage. We call this as secondary initiation of the damage. Such secondary initiation of the damage can occur up to the second round of the attractor. The probability of damage fading out after one initiation is not very low, but fade-out after a few diﬀerent initiations is highly improbable. This is why the length of the attractor plays signiﬁcant role in the dynamics of the network. The probability of node reaching a stable state depends on the length of the attractor, in other words, the number of diﬀerent states that the given node will go through. The less transitions the node needs to go through, the higher the probability that the node will reach a state that will stabilise the damage initiated. This applies to both, primary and secondary damage initiations. In experiment 6, the new evolution started from the attractor that was 21 steps long. In such half-chaotic network, no ice and unfrozen lakes are observed. However, the left peak was much diﬀerent than in the other experiments. It contained practically only A = 0, meaning no evolution, and a negligible number of A = 1 and A = 2. In all other experiments the tails of the left peaks decreased much more slowly (see left part of Fig. 2), which is much more adequate for describing naturally occurring real world phenomena. In experiment 7, half-chaotic networks were built based on the features of networks in experiment 5: ice with few unfrozen lakes. The results of the experiment 7 are similar those of 5. In another experiment the networks grew during evolution. Addition and removal of nodes were used as disturbances. The results were notably similar to those of experiments 5 and 7. In all experiments with evolution the evolutionary stability of half-chaos was observed.

5

Conclusions

Kauﬀman considered Random Boolean networks (RBNs) and on their basis made his hypothesis limiting connectivity to K = 2, and number of signal variants to s = 2. However, those parameters are limiting and do not allow exploration of all possible network state regimes, that has been demonstrated in our own experiments. The process of natural evolution and naturally occurring phenomena exceed these conservative limits. RBN also lay the basis of theories regarding the Gene Regulatory network. Their properties were mainly investigated by mathematicians [2,3,25,27] and compared with the results of experiments on the genome of real living objects [22]. Despite receiving some agreement, doubts remained, as the evaluation of connectivity K in the experimental results did not meet the limitations resulting from the Kauﬀman hypothesis. GRN was, however, an appealing concept and was picked up by many biologists and is still in use today [26,28]. Most authors do not go into the basics of the model and its deeper connections with the hypothesis, but simply used the conclusions of the model. Overall, the simplistic Kauﬀman model exhibits severe limitations as demonstrated by our current and previous experimental results. In 2003 Banzhaf [6]

106

A. Gecow and M. Nowostawski

presented a competitive model of gene regulation much closer to the phenomena occurring in the natural processes of mutation and gene reading. Banzhaf model focuses on and describes quite diﬀerent aspects of regulation and it does not belong to the same category of models. Half-chaos introduces real, objective threshold of damage size leading to losing of stability and identity of modelled system, absent in Kauﬀman’s model. This in turn should revitalise the current model and provide biologists with a larger and more experiential range of conclusions and allow them to re-examine models of GRN and extend them to new ranges of parameters. The parameters s and K in the range where the random networks are chaotic, are a necessary prerequisite for the explanation of structural tendencies [14]. Structural tendencies is an important phenomenon that corresponds to the regularities of the evolution of ontogenesis.

References 1. Adiga, A., Galyean, H., Kuhlman, C.J., Levet, M., Mortveit, H.S., Wu, S.: Network structure and activity in boolean networks. In: Kari, J. (ed.) AUTOMATA 2015. LNCS, vol. 9099, pp. 210–223. Springer, Heidelberg (2015) . https://doi.org/10. 1007/978-3-662-47221-7 16 2. Aldana, M.: Boolean dynamics of networks with scale-free topology. Physica D Nonlinear Phenom. 185(1), 45–66 (2003) 3. Aldana, M., Coppersmith, S., Kadanoﬀ, L.P.: Boolean dynamics with random couplings. In: Kaplan, E., Marsden, J.E., Sreenivasan, K.R. (eds.) Perspectives and Problems in Nolinear Science, pp. 23–89. Springer, New York (2003). https://doi. org/10.1007/978-0-387-21789-5 2 4. Baetens, J.M., De Baets, B.: Phenomenological study of irregular cellular automata based on Lyapunov exponents and Jacobians. Chaos Interdiscip. J. Nonlinear Sci. 20(3), 033112 (2010) 5. Baetens, J.M., Van der Wee¨en, P., De Baets, B.: Eﬀect of asynchronous updating on the stability of cellular automata. Chaos Solitons Fractals 45(4), 383–394 (2012) 6. Banzhaf, W.: Artiﬁcial regulatory networks and genetic programming. In: Riolo, R., Worzel, B. (eds.) Genetic Programming Theory and Practice. Genetic Programming Series, vol. 6, pp. 43–61. Springer, Boston (2003). https://doi.org/10. 1007/978-1-4419-8983-3 4 7. Barab´ asi, A.L., Albert, R., Jeong, H.: Mean-ﬁeld theory for scale-free random networks. Physica A Stat. Mech. Appl. 272(1–2), 173–187 (1999) 8. Derrida, B., Weisbuch, G.: Evolution of overlaps between conﬁgurations in random boolean networks. J. Phys. 47(8), 1297–1303 (1986) 9. Derrida, B., Pomeau, Y.: Random networks of automata: a simple annealed approximation. EPL (Europhys. Lett.) 1(2), 45 (1986) 10. Erd˝ os, P., R´enyi, A.: On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5(1), 17–60 (1960) 11. Gecow, A.: Emergence of Matured Chaos During, Network Growth, Place for Adaptive Evolution and More of Equally Probable Signal Variants as an Alternative to Bias p. E.Tlelo-Cuautle, InTech (2011)

Life in Half-Chaos

107

12. Gecow, A.: Life evolves in experimentally conﬁrmed ‘half-chaos’ of not fully random networks, but not ‘on the edge of chaos’. In: Skiadas, C.H. (ed.) 13th Chaotic Modeling and Simulation International Conference, vol. CHAOS2020, pp. 259–270, 9–12 June 2020. http://www.cmsim.org/images/CHAOS2020-Proceedings-A-Gr1-316.pdf 13. Gecow, A.: Life is not on the edge of chaos but in a half-chaos of not fully random systems. deﬁnition and simulations of the half-chaos in complex networks. In: Bracken, P., Uzunov, D.I. (eds.) A Collection of Papers on Chaos Theory and Its Applications, p. 122. IntechOpen, 14 April 2021. https://doi.org/10.5772/ intechopen.91599. https://www.intechopen.com/books/a-collection-of-papers-onchaos-theory-and-its-applications/life-is-not-on-the-edge-of-chaos-but-in-a-halfchaos-of-not-fully-random-systems-deﬁnition-and-simu 14. Gecow, A., Nowostawski, M., Purvis, M.K.: Structural tendencies in complex systems development and their implication for software systems. J. UCS 11(2), 327– 356 (2005) 15. Kauﬀman, S.: At Home in the Universe: The Search for the Laws of SelfOrganization and Complexity. Oxford University Press, Oxford (1996) 16. Kauﬀman, S., Peterson, C., Samuelsson, B., Troein, C.: Genetic networks with canalyzing boolean rules are always stable. Proc. Natl Acad. Sci. 101(49), 17102– 17107 (2004) 17. Kauﬀman, S.A.: Metabolic stability and epigenesis in randomly constructed genetic nets. J. Theor. Biol. 22(3), 437–467 (1969) 18. Kauﬀman, S.A.: Requirements for evolvability in complex systems: orderly dynamics and frozen components. Physica D Nonlinear Phenom. 42(1–3), 135–152 (1990) 19. Kauﬀman, S.A., Strohman, R.C.: The Origins of Order: Self Organization And Selection in Evolution, vol. 454. Oxford University Press, New York (1994) 20. Luque, B., Ballesteros, F.J.: Random walk networks. Physica A Stat. Mech. Appl. 342(1–2), 207–213 (2004) 21. Serra, R., Villani, M., Graudenzi, A., Kauﬀman, S.: Why a simple model of genetic regulatory networks describes the distribution of avalanches in gene expression data. J. Theor. Biol. 246(3), 449–460 (2007) 22. Serra, R., Villani, M., Semeria, A.: Genetic network models and statistical properties of gene expression data in knock-out experiments. J. Theor. Biol. 227(1), 149–157 (2004) 23. Shmulevich, I., Kauﬀman, S.A., Aldana, M.: Eukaryotic cells are dynamically ordered or critical but not chaotic. Proc. Natl. Acad. Sci. 102(38), 13439–13444 (2005) 24. Sol´e, R.V., Luque, B., Kauﬀman, S.: Phase transition in random networks with multiple states. Technical report 00-02-011, SantaFe Institute (2000). arXiv preprint adap-org/9907011 25. Turnbull, L., et al.: Connectivity and complex systems: learning from a multidisciplinary perspective. Appl. Net. Sci. 3(1), 1–49 (2018) 26. Uller, T., Moczek, A.P., Watson, R.A., Brakeﬁeld, P.M., Laland, K.N.: Developmental bias and evolution: a regulatory network perspective. Genetics 209(4), 949–966 (2018) 27. Villani, M., La Rocca, L., Kauﬀman, S.A., Serra, R.: Dynamical criticality in gene regulatory networks. Complexity 2018 (2018) 28. Wilkins, A.S.: A striking example of developmental bias in an evolutionary process: the “domestication syndrome”. Evol. Dev. 22(1–2), 143–153 (2020)

Information Seeking as an Evolutionary Game Markus Brede(B) University of Southampton, Southampton, Hampshire SO171BJ, UK [email protected]

Abstract. In this paper we present a game-theoretical model of rumour propagation on social networks. Agents face a choice between making investments at some cost to establish the truth about some underlying fact or copy the views of their network neighbours at no cost. Agents are also assumed to derive a beneﬁt from knowledge about the truth. Considering rumour propagation at a fast time-scale and strategy adaptation at a slower time-scale, we present analysis of outcomes of the resulting evolutionary game. Depending on network structure and cost-beneﬁt ratios, transitions between one and two-cluster solutions, either marked by the existence of only one type of strategy or coexistence of two strategies with low and high investments are found. We establish that clustering in the social network typically suppress the two-cluster solution, thus inhibiting the spread of high-investment strategies and leading to lower populationlevel awareness of the truth. Moreover, we also investigate the inﬂuence of free provision of additional high-quality information by stubborn agents. Counter-intuitively, we ﬁnd that the presence of such agents encourages free riding – an eﬀect that over-compensates the increased presence of higher quality information and has overall detrimental eﬀects on the population.

1

Introduction

With an increasing availability of the internet and rapid development of social media the low costs of engagement have led to strongly increasing rates of change in collective attention [1]. Additionally, recent surveys have shown that more and more adults habitually source their news from social media [2]. However, platforms like Twitter or Facebook have allowed news stories and other content to be published with little fact-checking, such that individual users can in some cases reach as many readers as traditional news outlets [3]. On such media platforms, true and false news can both spread and reach large audiences, with false news potentially spreading faster than true news [5]. In fact, studies show that the spread of misinformation has aﬀected human responses to natural disasters [6] and terrorist attacks [7]. Relevant in this context is also what has been termed the “news ﬁnds me perception”, i.e. a tendency of social media users to believe that they can indirectly stay informed about current events from peer information without actively following the news [4]. The consequence of these developments c Crown 2021 A. S. Teixeira et al. (Eds.): Complex Networks XII, SPCOM, pp. 108–119, 2021. https://doi.org/10.1007/978-3-030-81854-8_10

Information Seeking as an Evolutionary Game

109

is an increasingly fertile ground for the spread of “fake news” [3] and a lowering of individual standards for the source from which their news come [4]. Eﬀects of such misinformation might increasingly lead to ﬁlter bubbles [8], echo chambers [9] and ideological polarisation [10]. In this paper, we are interested in a mathematical analysis of the incentives for spreading fake news. Our work thus builds on two branches of social physics: the spread and evolution of opinions in (networked) populations and evolutionary game theory, see, e.g., [11] and [12] for reviews. For this purpose, we characterise agents by strategies that capture a trade-oﬀ between sourcing information by copying from peers or sourcing high-quality information by costly veriﬁcation. We will then consider a voter-dynamics-like [13] spreading process of information at fast time-scales and slower evolution of strategies via copying from better performing neighbours mechanism as in [14] at slow timescales. Mathematical treatment of this process follows a mean-ﬁeld approach, similar to recent work on the voting dynamics in the context of inﬂuence maximisation [15,16]. Additionally, in the latter part of the paper, we will consider the introduction of extra sources of high-quality information, which is inspired by consideration of stubborn agents or zealots in work related to opinion formation [17] or evolutionary game theory [18]. The remainder of the paper is organised as follows. In Sect. 2 we introduce the mathematical model of incentives for rumour propagation and detail aspects of the numerical treatment. Starting with a simple mean-ﬁeld analysis followed by comparison to results from numerical simulation, Sect. 3 then presents our main results. We conclude with a summary and discussion in Sect. 4.

2

Model Description

Consider a social network composed of N agents whose connections are given by an adjacency matrix A = (aij ), where aij = 1 if there is a link between i and j and aij = 0 otherwise. Agents in our model society are interested in ﬁnding out the truth about some external event about which rumours are propagated along the social network. For this purpose, an agent i is further characterised by a strategy pi ∈ [0, 1], which models the probability with which the agent will spend eﬀort to verify the truth. Alternatively, when on its turn to update its beliefs, with probability 1 − pi an agent will seek information from a randomly selected neighbour by just copying the current belief of that neighbour. We will also assume that agents in our model society have some propensity to forget about the truth, hence they constantly need to update their beliefs either by veriﬁcation at some cost or copying the beliefs of their social peers (which we assume to not be costly). In more detail, we consider the following updating process: (i) with probability 1 − q an agent will forget about the truth; alternatively with probability q the agent will update its beliefs. (ii) When updating beliefs, with probability p the agent will spend some cost c > 0 to learn the truth; alternatively, with probability 1 − p the agent will update its belief by copying the belief of a randomly selected network neighbour (at no cost). We assume that steps (i) and (ii)

110

M. Brede

are iterated at a fairly fast timescale and agents will derive utility from knowledge of the truth (which bestows a beneﬁt b > 0). Utilities are diminished by the potential cost c agents expend on verifying the truth. Suppose Pi gives the stationary probability that agent i knows the truth. Agent i is then supposed to derive utility Ui = −qpi c + Pi b. As, in the following, only relative utilities are of interest, we set (1) Ui = −rpi + Pi , where r = qc/b quantiﬁes a cost-beneﬁt ratio of knowing the truth relative to average costs of learning the truth. Note, that the updating of agent probabilities to know the truth corresponds to a Markov process. Accordingly, the probability Pit+1 that agent i knows the truth at time t + 1 can be obtained from (t+1) Pi = qpi + q(1 − pi )1/ki aij Pjt , (2) j

where the ﬁrst term represents learning the truth directly, and the second term reﬂects copying of beliefs of randomly chosen neighbours. Stationary state probabilities can be obtained by solving the linear system (I − diag(q

1 − pi )A)P = qp, ki

(3)

1−pi i where diag(q 1−p ki ) is the diagonal matrix with entries q ki along its diagonal. Note, that the left hand matrix in the equation is diagonally dominant. Solutions of the linear system can thus be conveniently obtained through Jacobi iteration. Further, seeding Jacobi iteration with stationary states from previous rounds of the evolutionary game described below, allows for considerable speedup of the numerics of determining stationary states, enabling simulations to be performed on fairly large social networks. On a longer timescale, we consider the evolution of strategies pi through an evolutionary game in which agents select random neighbours and imitate strategies of these neighbours depending on utility diﬀerentials, subject to some noise. More precisely, if agent i with utility Ui selects agent j for strategy updating, agent i will copy agent j’s strategy according to so-called Fermi-pairwise updating, i.e. with probability 1/(1 + exp(−(Uj − Ui )/K)), where K quantiﬁes the noisiness of the updating process, such that for large K agents have larger chances to imitate inferior strategies [14]. When copying a strategy from a neighbour, we also introduce a small amount of noise, such that agent i will update its strategy pi → pj + η, where η is a small random number chosen uniformly from [−pmut , pmut ]. The latter choice allows for the maintenance of diversity in the population and might model small errors in perception when imitating strategies. We then seed the initial population of agents by choosing the pi uniformly at random from [0, 1] and then carry out parallel updates of agent strategies until a quasi-stationary state has been reached.

Information Seeking as an Evolutionary Game

3 3.1

111

Results Mean-Field Solution

In this section we derive an approximate solution to equilibrium states of the evolutionary dynamics based on a mean-ﬁeld approximation to Eq. (3). There, we assume that agents are typically exposed to neighbours with a mean probability P of knowing the truth. Averaging over Eq. (3), one obtains P = qp + q(1 − p)P ,

(4)

where · stands for averages over the whole population. One easily obtains P =

qp . 1 − q(1 − p)

(5)

To proceed, from Eq. (1) we can then obtain the utility of agent i as Ui = pi (−r + q(1 − P )) + qP ,

(6)

which gives an average utility of U = p(−r + q(1 − P )) + qP .

(7)

Let us consider the noiseless case of an evolutionary dynamics with pmut = 0 and K → 0, in which an agent i strictly copies a strategy of j if Uj > Ui . In the latter case frequencies of strategies evolve according to the diﬀerences of their utilities relative to the average utility in the population and hence a stationary state for strategy i is reached if Ui = U . Combining Eqs. (6) and (7) and inserting the (approximate) result for P from Eq. (5), one obtains the equilibrium condition 1 1 − p = (1 − q) , (8) r q which also entails that the average probability of knowing the truth is given by P = q if r < r0 , P = 1 − r/q if r0 < r < r1 , and P = 0 otherwise, with r0 = q(1 − q) giving a lower bound when p(r) = 1 and r1 = q an upper bound when p(r) = 0. Equation (7) also allows us to determine a socially optimal utility, i.e. ﬁnding a strategy that each agent should follow to maximize the utility of the population. With some standard calculus one obtains q(1 − q) , (9) psoc = 1 − 1/q 1 − r with P soc = 1− 1−q q r. However, note that the above only holds for a uniform population in which every agent follows the same strategy. Diﬀerences between the mean ﬁeld average probability of knowing the truth and the socially optimal illustrate the dilemma nature of the truth-ﬁnding game which tends to lead to either over or under-investment into eﬀorts of truth-ﬁnding of individual agents.

112

3.2

M. Brede

Mean-Field Solution with Additional Sources of Truth

Consider a scenario in which an external agent attempts to improve the quality of information in the artiﬁcial society by providing accurate information. The external agent does that introducing a correct source of information and attempting to make it available to agents in the society. Formally, this scenario can be described by the addition of an external agent who is always in the state P = 1 and forms a certain number ke of additional unidirectional links to every agent in the society, such that they can copy correct information. We describe the intensity or inﬂuence of this external agent by the fraction of total links it provides, i.e. ρ = ke /(k + ke ). Note, that as this agent has a ﬁxed strategy, such an external agent can be equated to a stubborn agent or zealot [17,18]. However, since the agent is not available for strategy comparisons, our description diﬀers from considerations of zealots in evolutionary games as in [18]. Including true sources of information at intensity ρ, Eq. (4) has to be modiﬁed to P = qp + q(1 − q) ((1 − ρ)P + ρ) , (10) where the ﬁrst term in brackets on the right hand side reﬂects copying from peers and the second describes copying from the external agent who has a share of ρ links to a typical agent. From the above, we can easily ﬁnd an expression for P , i.e. p(1 − ρ) + ρ . (11) P = q 1 − q(1 − p)(1 − ρ) Proceeding as in the previous section, we can again compute average utilities and again ﬁnd an equilibrium condition for strategy i, if Ui = U . After some algebraic manipulation, one then obtains 1 1 ρ − , (12) p = (1 − q) − r q(1 − ρ) 1−ρ with a lower ﬂoor of p = 0. In a similar vein to the previous section, we also 1 ﬁnd that P = 1 − rq 1−ρ . Thus, interestingly, we see that the introduction of additional sources of easily available true information tends to have a negative eﬀect on the overall state of the artiﬁcial society. Free information gives additional incentives to copying neighbour states and in the regime p > 0 the resultant decrease in agent incentives to ﬁnd true information at cost over-compensates the additional feedin of correct information. The eﬀect, however, would reverse when the ﬂoor of p = 0 is reached and compensation by strategy adjustment becomes impossible. 3.3

Comparison to Results from Numerical Simulation

In this section we describe results obtained from numerical simulations of the evolutionary dynamics of the model described in Sect. 2 on various types of complex networks. To investigate the inﬂuence of network topologies, we have

Information Seeking as an Evolutionary Game

113

Fig. 1. Evolution of average strategies p and probability of awareness of truth P for a regular random network with parameters K = 10−3 , q = 0.9 and r = 0.2.

Fig. 2. Dependence of the average equilibrium strategy p (top row) and of the average probability to be aware of the truth P (bottom row) on the cost-beneﬁt ratio r for q = 0.9, q = 0.5, and q = 0.2 (from left to right) for regular random graphs (RRG), Barab´ asi-Albert type scale-free networks (SFNW), a ring graph (RING) and a 2d lattice (LAT). For comparison, also the mean-ﬁeld solution and socially optimal solution from Subsect. 3.1 are shown as solid lines. Error bars are about the size of the symbols.

explored a range of typical network structures. These include ring graphs and 2d lattices with von Neumann neighbourhoods and periodic boundary conditions (to explore the inﬂuence spatial embedding and dimensionality), and regular random graphs and Barab´ asi-Albert-type scale-free networks (to explore the inﬂuence of degree heterogeneity). In Subsect. 3.4 we further carry out experiments on a variant of small-world networks to disentangle eﬀects of clustering and typical path lengths. Unless otherwise stated, experiments are carried out on networks of size N = 104 with average degree k = 4 using pmut = 10−3 , K = 10−4 and simulations are carried out for T = 105 iterations.

M. Brede 0.06

0.07

0.05

0.06

0.03 0.02

K=0.0001 K=0.001 K=0.003 K=0.005 K=0.007

0.01 0

r=0.12 r=0.15 r=0.17 r=0.20 r=0.30

0.05 0.04

σ

σ

0.04

probability

114

0.03 0.02 0.01 0

0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

r

0.6 0.7 0.8 0.9

1

p

1.1 1.2 1.3 1.4

0.055 RRG 0.05 SFNW 0.045 RING LAT 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

r

Fig. 3. (left) Dependence of the standard deviation of the distribution of strategies on the cost-beneﬁt ratio r for diﬀerent amounts of noise in strategy updating K. Simulations have been carried out on a regular random graph for q = 0.9. Error bars are about the size of the symbols. For each choice of K, we observe a transition from a low standard deviation regime to a high standard deviation regime at some value of r. (middle) Distribution of (normalized) strategies for q = 0.9 and K = 0.001 for diﬀerent r. (right) Comparison of the dependence of σ on r for K = 0.003 and q = 0.9 for diﬀerent types of complex networks.

Figure 1 illustrates the dynamics observed in such a typical evolutionary simulation seeded with pi = 0, i = 1, ..., N . We observe that after some transient the system tends to settle down into a quasi-stationary state with small ﬂuctuations around it caused by the noise in the evolutionary dynamics. We then proceed to measure stationary outcomes by averaging quantities over the last T /2 iterations after discarding the ﬁrst T /2 as a transient. The panels in Fig. 2 illustrate typical equilibrium results on diﬀerent networks for various parameter settings. In the top row, we ﬁnd the dependence of the average strategy on the cost-beneﬁt ratio for diﬀerent networks and different settings of the updating probability q. Results are generally in very good agreement with the mean-ﬁeld estimate of Eq. (8), but we notice systematic deviations for larger r which are particularly noticeable for large q. In these cases, the mean-ﬁeld approximation tends to under-estimate the equilibrium probability. We make two further observations. First, the panels of the top row also give a comparison of equilibrium strategies to the socially optimal strategy calculated in Subsect. 3.1 and –as expected– we see that the evolved strategies systematically show an under-investment into independent exploration of the truth. Second, inspection of results for diﬀerent networks points to nearly identical behaviour independent of network topology. Very close inspection, however, indicates that simulation results obtained for ring graphs systematically diﬀer from those found for the other classes of networks. Finally, in the bottom row of Fig. 2, we see simulation results for the dependence of the average probability to be aware of the truth P on the cost-beneﬁt ratio, again evaluated for diﬀerent networks and diﬀerent settings of q that correspond to the results shown in the top row of the ﬁgure. We now note stronger diﬀerences to mean-ﬁeld results which are again particularly apparent for large q, while results for low q still give a very good match to simulation data.

Information Seeking as an Evolutionary Game

115

To proceed, we explore the reasons of diﬀerences to mean-ﬁeld results in some parameter regimes. For this purpose, we have investigated the distributions of strategies in the stationary state. Mainly depending on the costs-beneﬁt ratio r, the choice of the updating probability q, and selection intensity K, we generally ﬁnd two types of regimes: (i) a regime marked by a tightly peaked monomodal distribution of strategies and (ii) a regime in which the distribution of strategies is bimodal (see, the middle panel of Fig. 3). Both regimes can be distinguished by the standard deviation σ of the distribution of strategies, which is relatively small in the ﬁrst case, and signiﬁcantly larger in the second case. Accordingly, in Fig. 3 (left) we show the dependence of σ on r for data obtained for regular random graphs. We see that depending on K at some value of r a strong increase in σ can be observed which demarcates a region of low σ for low r from a region of high σ for large r. For the curve for K = 0.001 the corresponding distributions for the strategies are shown in the middle panel. We can see that the distribution of strategies starts oﬀ with one sharply focused peak for low r, then gradually broadens as r is increased, until two distinct peaks can be distinguished for large r. The observed scenario corresponds to a continuous transition between a region in parameter space distinguished by one homogeneous group of strategies and another region in which two groups of low and high-p strategies can coexist. Not surprisingly, closer comparison for the scenarios illustrated in Fig. 2 also shows that discrepancies between numerical results and mean-ﬁeld approximate results become the more prominent the farther one moves into the two-cluster region. In this two-cluster region, one ﬁnds assortative mixing of strategies, i.e. nodes following low p strategies tend to be neighbours of nodes with similar strategies, and vice versa for high p strategies. Thus, the mixing assumption of the simple mean-ﬁeld description are violated and no close correspondence of results can be expected. Interestingly, even though apart from some noticeable distinction to ring graphs, both the p(r) and P (r) dependencies in Fig. 2 show hardly any difference between the other investigated types of complex networks, subtle differences become revealed when inspecting the σ(r) curves in Fig. 3 (right). For scale-free networks, regular random networks, and 2d lattices, we observe the above described transition between one and two-cluster states. However, the onset of the two-cluster phase is found at slightly diﬀerent cost-beneﬁt ratios for diﬀerent networks, occurring for lowest r for random networks, slightly larger r for scale-free networks, and substantially larger r for the lattices. In contrast, for the ring graphs no two-cluster solutions are found at all and variances are found to decrease monotonically with r. We have veriﬁed this observation also in other settings for diﬀerent choices of K and q, in none of which two-cluster solutions were found for ring graphs. A closer investigation of this will be presented in Subsect. 3.4 below. Last in this subsection, we also present results from evolutionary simulations for the scenario in which additional sources of accurate information have been provided to the artiﬁcial society as described in Subsect. 3.2 and compare to the

M. Brede 1

1 q=0.2,r=0.163 q=0.5,r=0.26 q=0.9,r=0.1

0.9

0.4

0.8

0.35

0.7

0.6

0.6

0.5 0.4

0.45

0.9

0.7

/q

0.8

0.3

σ

116

0.5

0.25 0.2

0.4

0.15

0.3

0.3

0.2

0.2

0.1

0.1

0.1

0.05

0

0 0

0.1

0.2

0.3

0.4

0.5

0 0

ρ

0.1

0.2

0.3

ρ

0.4

0.5

0

0.1

0.2

0.3

0.4

0.5

ρ

Fig. 4. Dependence of (left) the average equilibrium strategy p, (middle) the average probability of an agent to know the truth P (measured in units of q), and (right) the standard deviation of the stationary distribution of strategies p on the relative amount of free high-quality information ρ for some choices of the cost-beneﬁt ratio r and the probability of updating q. Symbols correspond to simulation results obtained on a regular random graph. Error bars are approximately the same size as the symbols. Solid lines correspond to the mean-ﬁeld estimates derived in Subsect. 3.2.

mean-ﬁeld solution obtained there. Simulation data shown in Fig. 4 (middle and right panel) give the dependence of the averages of p and P on the intensity ρ at which the high-quality information is provided. The right panel of Fig. 4 adds information for the dependence of σ on ρ. As already argued in Subsect. 3.2, we see that providing additional highquality information tends to result in overall reduced awareness of the truth. We observe a strong decline of average strategies p with ρ, i.e. providing high-quality information tends to encourage free-riding and incentivises agents to rely more on copying information from their network neighbours rather than investigating the truth themselves at a cost. The consequent reduction in eﬀort to explore the truth then overcompensates the eﬀects of the additional information (cf., Fig. 4 middle panel). However, note, that at some value of ρ a ceiling of p = 0 is reached, and from that point onward further increasing ρ will naturally lead to increases in P . In Fig. 4 right we also see that the provision of additional information tends to lead to much larger dispersion of strategies and tends to encourage two-cluster solutions for larger ρ (note that the standard deviations found in Fig. 4 are much larger than those found in other scenarios like in Fig. 3). Hence, a deteriorating match of numerical results to mean-ﬁeld results is expected for increasing ρ, which is evident in the comparisons presented in the ﬁrst two panels of Fig. 4. 3.4

The Eﬀects of Clustering

In Subsect. 3.3 we have noted that the dependence of average strategies on r is close to mean-ﬁeld expectations for all investigated types of networks with some diﬀerences between networks being observed for larger r. We also saw that diﬀerences between most investigated types of networks were relatively small, with diﬀerences between the other network types and ring graphs being most notable. In this section we explore in more depth which network characteristic is the main determinant of these deviations.

1.1

1

1

0.3

0.9

0.25

0.9 0.8 0.7 r=0.093 r=0.095 r=0.096 r=0.097 r=0.098 r=0.099

0.6 0.5

0.8 0.7 r=0.093 r=0.095 r=0.096 r=0.097 r=0.098 r=0.099

0.6 0.5

0.4 0.1

0.2

0.3

0.4

0.5

0

0.1

r=0.096 r=0.098 r=0.098

0.2 0.15 0.1 0.05 0 100

0.4 0

117

0.35

1.1

/

/

Information Seeking as an Evolutionary Game

0.2

α

0.3

α

0.4

0.5

1000

10000

100000

davg

Fig. 5. Dependence of (left) the relative average strategy and (middle) relative average probability to know the truth P on the clustering coeﬃcient of the regular graph. The right hand panel shows the dependence of the average strategy on the average shortest path length of the graph. Results obtained from simulations with q = 0.2 for the two leftmost panels and networks of sizes ranging between 103 to 105 for the right hand panel. Error bars are about the size of a symbol.

For this purpose, we modify the ring graph to construct small-world type networks [19]. Somewhat diﬀerent to the procedure introduced by Watts and Strogatz, we do this by randomly picking two pairs of connected nodes and swapping links between them, which guarantees that the degree distribution of the modiﬁed graph remains regular. On the other hand, due to the random rewiring, long distance links that dramatically shrink average shortest path length and eventually destroy all clustering are introduced. In the following, we have carried out evolutionary simulations on such networks and recorded equilibrium strategies for networks with tuneable fractions of rewired links. Accordingly, in the ﬁrst and second panel of Fig. 5 we show the dependence of equilibrium strategies p and equilibrium probabilities to be aware of the truth P on the clustering coeﬃcient α. To showcase eﬀects for diﬀerent parameter settings, all results have been normalised by the value obtained on a regular random graph where α ≈ 0 (obtained when a very large fraction of links has been rewired). Whilst results are essentially independent of network topology for small r one notes a marked dependence on α for large cost-beneﬁt ratio r. A question remains, whether the observed decrease in p is mainly caused by shrinkage in average shortest path lengths or mainly a result of changes in clustering resulting from rewiring. To probe this question, we have carried out experiments on ring graphs of diﬀerent size, ranging between N = 103 to N = 105 nodes. In the last panel of Fig. 5 we see that the resultant average strategies are essentially independent of network size and average shortest path lengths. It thus is evident, that the decline in p is caused by clustering – i.e. even though the eﬀect is relatively small, we see that local cohesiveness and community structure in a network can serve as an impediment to the persistence of high-quality information in our artiﬁcial society.

4

Summary and Conclusions

In this paper we have presented a game-theoretical model that analyses decisions of agents to source information in an artiﬁcial society connected by a social net-

118

M. Brede

work. Considering that sourcing information accurately independently comes at a cost whereas information from peers might be error-prone but can be sourced for free, we have analysed an evolutionary dynamics in which agents can adapt their strategies to source information. Equilibrium outcomes of the game reveal an underlying dilemma in information sourcing: Agents tend to free-ride on information provided by peers and under-invest in sourcing independent information, which promotes the spread of misinformation. From a more theoretical point of view, the presented game is somewhat diﬀerent to typically studied evolutionary games on graphs. Whereas in the prototypical situation often studied when analysing the evolution of cooperation payoﬀs of agents are only functions of the strategies of their direct network neighbours, in the presented game information sourcing introduces a coupling of payoﬀs of agents to all other agents in the network. In our formulation of the game this coupling is linked to the updating probability q (or forgetting probability 1 − q), which determines a typical chain length along which information can be propagated. For low q, agent payoﬀs mainly depend on strategies of direct neighbours, whereas in the limit of q → 1 strategies of all other agents in the network may play a role. Analysing the above game, we have provided a simple mean-ﬁeld solution for the information spreading process which also allows analysis of equilibrium outcomes of the game. These solutions have been compared to results from numerical simulation and are in good agreement in phases dominated by unimodal distributions of strategies. We also ﬁnd crossover to phases in which two strategies, characterised by low and high investment in individual exploration of the truth. In the latter phases marked deviations to the simple mean-ﬁeld analysis are found. Generally, as one would expect, we ﬁnd that the overall awareness of the truth in the agent population improves when information sourcing is cheap relative to beneﬁts of being aware of the correct information and declines when increasing costs through the control parameter r. We also note that aggregate order parameters, like the average strategy p and the average awareness of the truth P are relatively independent of network structure, such that degree heterogeneity or average shortest path-lengths are only minor determinants. However, we also ﬁnd that clustering can have a detrimental inﬂuence on awareness of the truth. The observed eﬀect seems reminiscent of the eﬀect of “echo-chambers” which have been found to increase polarisation in other studies [9].

References 1. Lorenz-Spreen, P., Monsted, P., H¨ ovel, P., Lehmann, S.: Accelerating dynamics of collective attention. Natl. Commun 10, 1759 (2019) 2. Gottfried, J., Shearer, E.: News use across social media platforms 2016. Pew research centre (2016). http://www.journalism.org 3. Allcott, H., Gentzkow, M.: Social media and fake news in the 2016 election. J. Econ. Perspect. 31, 211–236 (2017)

Information Seeking as an Evolutionary Game

119

4. Gil de Z’uniga, H., Weeks, B., Ard’evol-Abreu, A.: Eﬀects of the news-ﬁnds-me perception in communication: social media use implications for news seeking and learning about politics. J. Comput. Mediat. Commu. 22, 105–123 (2017) 5. Vousoghi, S., Roy, D., Aral, S.: The spread of true and false news online. Science 359, 1146–1151 (2018) 6. Mendoza, M., Poblete, B., Castillo, C.: Twitter under crisis: can we trust what we RT? In: Proceedings of the First Workshop on Social Media Analytics, pp. 71–89. ACM (2010) 7. Starbird, K., Maddock, J., Orand, M., Achterman, P., Mason, R.M.: Rumors, false ﬂags, and digital vigilantes: misinformation on Twitter after the 2013 Boston marathon bombing. In: iConference 2014 Proceedings, pp. 654–662. iSchools (2014) 8. Spohr, D.: Fake news and ideological polarization: ﬁlter bubbles and selective exposure on social media. Bus. Inf. Rev. 34, 150–160 (2017) 9. Baumann, F., Lorenz-Spreen, P., Sokolov, I.M., Starnini, M.: Modeling echo chambers and polarization dynamics in social networks. Phys. Rev. Lett. 124, 048301 (2020) 10. Dylko, I., Dolgov, I., Hoﬀman, W.: Comput. Hum. Behav. 73, 181–190 (2017) 11. Castellano, C., Fortunato, S., Loreto, V.: Statistical physics of social dynamics. Rev. Mod. Phys. 81, 591–646 (2009) 12. Szab´ o, G., Fath, G.: Evolutionary games on graphs. Phys. Rep. 446, 97–216 (2007) 13. Holley, R., Liggett, T.: Ergodic theorems for weakly interacting inﬁnite systems and the voter model. Ann. Probab. 3, 643–663 (1975) 14. Szab´ o, G., T¨ oke, S.: Evolutionary prisoner’s dilemma game on a square lattice. Phys. Rev. E 58, 69 (1998) 15. Masuda, N.: Opinion control in complex networks. New J. Phys. 17, 033031 (2015) 16. Brede, M., Restocchi, V., Stein, S.: Resisting inﬂuence: how the strength of predispositions to resist control can change strategies for optimal opinion control in the voter model. Front. Robot. AI 5, 34 (2018) 17. Mobilia, M.: Does a single zealot aﬀect an inﬁnite group of voters? Phys. Rev. Lett. 91, 028701 (2003) 18. Masuda, N.: Evolution of cooperation driven by zealots. Sci. Rep. 2, 646 (2012) 19. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393, 440–442 (1998)

How Correlated Are Community-Aware and Classical Centrality Measures in Complex Networks? Stephany Rajeh(B) , Marinette Savonnet , Eric Leclercq , and Hocine Cheriﬁ Laboratoire d’Informatique de Bourgogne, University of Burgundy, Dijon, France [email protected]

Abstract. Unlike classical centrality measures, recently developed community-aware centrality measures use a network’s community structure to identify inﬂuential nodes in complex networks. This paper investigates their relationship on a set of ﬁfty real-world networks originating from various domains. Results show that classical and community-aware centrality measures generally exhibit low to medium correlation values. These results are consistent across networks. Transitivity and eﬃciency are the most inﬂuential macroscopic network features driving the correlation variation between classical and community-aware centrality measures. Additionally, the mixing parameter, the modularity, and the MaxODF are the main mesoscopic topological properties exerting the most substantial eﬀect. Keywords: Centrality

1

· Inﬂuential nodes · Community structure

Introduction

Identifying inﬂuential nodes is crucial for accelerating or mitigating propagation processes in complex networks. To this end, numerous classical centrality measures relying on various topological properties have been proposed. One can distinguish two main categories: local and global measures [9]. Local metrics use information in the node’s neighborhood while global ones gather information from the whole network. Note that some works combine local and global information [5]. Another set of centrality measures uses information on the community structure to quantify the inﬂuence of the nodes. In this paper, we refer to them as “community-aware” centrality measures. Unlike classical centrality measures, community-aware centrality measures distinguish intra-community links from Electronic supplementary material The online version of this chapter (https:// doi.org/10.1007/978-3-030-81854-8 11) contains supplementary material, which is available to authorized users. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 A. S. Teixeira et al. (Eds.): Complex Networks XII, SPCOM, pp. 120–132, 2021. https://doi.org/10.1007/978-3-030-81854-8_11

Correlation of Centrality Measures

121

inter-community links. Intra-community links join nodes from the same community. They are related to the node’s local inﬂuence inside its community. Inter-community links join nodes belonging to diﬀerent communities. Therefore, they quantify the node’s impact at the global level. Community-aware centrality measures diﬀer based on how they integrate the intra-community and intercommunity links. Community Hub-Bridge proposed by [2] selects hubs within large communities and bridges simultaneously. Comm centrality [4] combines the intra-community and inter-community links of a node by prioritizing the latter. Community-based Centrality [22] weights an intra-community link by its community size and an inter-community link by the size of the communities it is joining. K-shell with Community [10] is based on the linear combination of the k-shell of a node by considering the intra-community links and inter-community links networks separately. Participation Coeﬃcient [3] and Community-based Mediator [21] tend to select important nodes based on the heterogeneity of their intra-community and inter-community links. The Participation Coeﬃcient of a node decreases if it doesn’t participate in any other community than its own. Community-based Mediator reduces to the normalized degree centrality if the proportion of intra-community and inter-community links of a node are equal. Modularity Vitality [11] is a signed community-aware centrality measure. It is based on the modularity variation when removing a node in the network. Since bridges connect diﬀerent communities, their presence decreases modularity. Therefore, nodes with negative Modularity Vitality values are bridges. In contrast, since hubs tend to increase a network’s modularity, nodes with positive Modularity Vitality values are local hubs. Many studies are devoted to the interactions between classical centrality measures [8,12,16,18,20]. However, the relationship between classical and community-aware centrality measures is almost unexplored [17]. Our goal in this paper is to gain a better understanding of this issue. In other words, we intend to answer the following questions: 1) What is the relationship between classical and community-aware centrality measures? 2) What is the inﬂuence of the macroscopic and mesoscopic topological properties on their relationship? The paper is organized as follows. First, the classical and community-aware centrality measures are introduced. In the subsequent two sections, the analyses of the correlation and the network topology are presented. Finally, the conclusion is given.

2

Classical and Community-Aware Centrality Measures

This study investigates ten classical centrality measures, of which ﬁve are local (Degree, Leverage, Laplacian, Diﬀusion Degree, and Maximum Neighborhood Component) and ﬁve are global (Betweenness, Closeness, Katz, PageRank, and Subgraph). Table 1 reports their deﬁnition. They are compared with seven community-aware measures introduced earlier and described in Table 2. Table 3

122

S. Rajeh et al.

Fig. 1. Distribution of the Kendall’s Tau correlation between classical and communityaware centrality measures for each network. Colors represent the network’s domain. Animal networks are green. Biological networks are pink. Collaboration networks are blue. Oﬄine social networks are violet. Infrastructural networks are grey. Actor networks are yellow. Miscellaneous networks are brown. Online social networks are orange.

quotes the ﬁfty real-world networks used in the experiments. They are from various domains (animal, biological, collaboration, online/oﬄine social networks, infrastructural, and miscellaneous). Since the community structure is sensitive to the community detection algorithm, Louvain and Infomap [13] are used to extract intra-community and inter-community links. Due to space constraints, the networks’ topological characteristics and results based on Louvain are provided in the supplementary materials1 . Furthermore, as there are no fundamental diﬀerences, we restrict our attention in analyzing the results based on the community structure revealed using Infomap.

3

Correlation Analysis

The ﬁrst investigation concerns how classical and community-aware centrality measures correlate for a given network. So, for each of the ﬁfty networks, the Kendall’s Tau correlation is computed for all possible combinations between the ten classical (αi ) and seven community-aware centrality measures (βj ). Figure 1 shows the distributions of the correlation values for each network. There is no 1

https://github.com/StephanyRajeh/MixedCommunityAwareCentralityAnalysis.

Correlation of Centrality Measures

123

Table 1. Deﬁnitions of classical centrality measures (α(i)). ai,j denotes the connectivity of a node i to node j from the adjacency matrix A. N is the total number of nodes. ki and kj are the degrees of nodes i and j, respectively. N1 (i) is the set of direct neighbors of node i. i and j are the propagation probabilities of nodes i and nodes j, respectively ( is set to 1 for all nodes in this study). σ(s, t) is the number of shortest paths between nodes s and t and σi (s, t) is the number of shortest paths between nodes s and t that pass through node i. d(i, j) is the shortest-path distance between node i and j. apij is the connectivity of node i with respect to all the other nodes at a given order of the adjacency matrix Ap . sp is the attenuation factor where s ∈ [0,1]. αp (i) and αp (j) are the PageRank centralities of node i and node j, respectively. d is the damping parameter (set to 0.85 in this study). vj refers to an eigenvector of the adjacency matrix A, associated with its eigenvalue λj . Centrality measure description Degree: based on the total sum of the neighbors of a node

Deﬁnition αd (i) = N j=1 aij 1 ki

N

ki −kj j=1 ki +kj

Leverage: a signed centrality based on the quantity of connections compared to its neighbors

αlev (i) =

Laplacian: based on how much damage a node causes in the network after its removal

αlap (i) = ki2 + ki + 2

Diﬀusion: based on the diﬀusive power of a node and that of its neighbors weighted by their propagation probabilities

αdif (i) = i × αd (i) +

j∈N1 (i)

kj

j∈N1 (i)

Maximum Neighbor. Component: αm (i) = |LCC ∈ N1 (i)| based on the node’s largest connected component (LCC) size established by its neighborhood i (s,t) Betweenness: based on the number αb (i) = s,t=i σσ(s,t) of shortest path a node falls in between two other nodes Closeness: based on how close, on αc (i) = average, a node is to all other nodes in the network Katz: based on the quantity, quality, and the subsequent distances of other nodes connected to a speciﬁc node

αk (i) =

PageRank: based on the quantity and quality of nodes connected to a speciﬁc node under a Markov chain process

αp (i) =

Subgraph: based on a node’s participation in closed walks, with paths starting and ending with the same node

αs (i) =

N −1 N −1 j=1 d(i,j)

p=1

1−d N

N

+d

j=1

sp apij

j∈N1 (i)

i 2 λj j=1 (vj ) e

αp (j) kj

j × αd (j)

124

S. Rajeh et al.

Table 2. Deﬁnitions of community-aware centrality measures (β(i)). ck is the k-th community. kiintra and kiinter represent the intra-community and inter-community links of a node. Nc is the total number of communities. ki,c is the number of links node i i. N is the total number of has in a given community c. ki is the total degree of node log(ρintra )] + [− ρinter log(ρinter )] is the entropy of node i nodes. Hi = [− ρintra i i i i based on its ρintra and ρinter which represent the density of the communities a node kintra kinter links to. χ = max i kintra × R and ϕ = max i kinter × R. μck is the proportion (j∈c) j

(j∈c) j

of inter-community links over the total community links in community ck . R is a constant to scale intra-community and inter-community values to the same range. M is the modularity of a network and M (Gi ) is the modularity of the network after the removal of node i. nc is the number of nodes in community c. β intra (i) and β inter (i) represent the k-shell value of node i by only considering intra-community links and inter-community links, respectively. δ is set to 0.5 in this study. Centrality measure description

Deﬁnition

Community Hub-Bridge [2]: based βCHB (i) = |ck | × kiintra + |N N Ci | × kiinter on weighting the intra-community links by the node’s community size and the inter-community links by the node’s number of neighboring communities c ki,c 2 Participation Coeﬃcient [3]: based βP C (i) = 1 − N c=1 ki on the heterogeneity of a node’s links, where the more external links a node has, the higher its centrality Community-based Mediator [21]: βCBM (i) = Hi × based on the entropy of a node’s intra-community and inter-community links

k N i

i=1

ki

Comm Centrality [4]: based on weighting the intra-community and inter-community links by the proportion of external links and prioritizes bridges

βComm (i) = (1 + μck ) × χ + (1 − μck ) × ϕ2

Modularity Vitality [11]: a signed community-aware centrality based on the modularity change a node causes after its removal from the network

βM V (i) = M (Gi ) − M (G)

Community-based Centrality [22]: βCBC (i) = based on weighting the intra-community and inter-community links by the size of their belonging communities

Nc c=1

ki,c

nc N

K-shell with Community [10]: βks (i) = δ × β intra (i) + (1 − δ) × β inter (i) based on the k-shell hierarchical decomposition of the local network (formed by intra-community links) and the global network (formed by inter-community links)

Correlation of Centrality Measures

125

Table 3. The ﬁfty real-world networks used in this study divided into eight diﬀerent domains. All network data can be obtained from the cited resources [1, 6, 7, 15, 19]. Domain

Network’s name and number

Animal networks

Dolphins (1), Reptiles (2)

Biological networks

Budapest Connectome (3), Blumenau Drug (4), E. coli Transcription (5), Human Protein (6), Interactome Vidal (7), Kegg Metabolic (8), Malaria Genes (9), Mouse Visual Cortex (10), Yeast Collins (11), Yeast Protein (12)

Collaboration networks

DBLP (13), AstroPh (14), C.S. PhD (15), GrQc (16), NetSci (17), New Zealand Collaboration (18)

Oﬄine social networks

Adolescent health (19), Jazz (20), Zachary Karate Club (21), Madrid Train Bombings (22)

Infrastructural networks EU Airlines (23), EuroRoad (24), Internet Autonomous Systems (25), Internet Topology Cogentco (26), London Transport (27), U.S. Power Grid (28), U.S. Airports (29), U.S. States (30) Actor networks

Game of Thrones (31), Les Mis´erables (32), Marvel Partnerships (33), Movie Galaxies (34)

Miscellaneous networks

911AllWords (35), Bible Nouns (36), Board of Directors (37), DNC Emails (38), Football (39), Polbooks (40)

Online social networks

DeezerEU (41), Ego Facebook (42), Facebook Friends (43), Facebook Organizations (44), Caltech (45), Facebook Politician Pages (46), Hamsterster (47), PGP (48), Princeton (49), Retweets Copenhagen (50)

consistency of the distribution for networks from the same domain. Indeed, their distributions can be quite diﬀerent. For example, although EU Airlines (23) and EuroRoad (24) belong to the infrastructural networks domain (grey color), EU Airlines (23) has a wide distribution while EuroRoad (24) is much narrow. One can notice that most networks exhibit a unimodal distribution. Yet, bimodal distributions are also seen, such as in the networks Movie Galaxies (34), 911AllWords (35), and Football (39). Whatever the network considered, the most frequent value of the distribution lies around 0.5. The average median of all the distributions is 0.43 ± 0.1. The average interquartile range is 0.37 ± 0.1. Finally, the average mean of the distribution for all networks is 0.37 ± 0.07. In other words, most of the classical and community-aware centrality measures tend to exhibit medium to low correlation values. Yet, few high correlation values are also observed. To check the consistency of Kendall’s Tau correlation values for the various pairs of community-aware and classical centralities across networks, we proceed as follows. Each network is represented by a sample made of thirty-ﬁve correlation pair values. The Pearson correlation values between the samples two-by-two

126

S. Rajeh et al.

Fig. 2. Distribution of Pearson’s correlation for the heatmaps of the Kendall’s Tau correlation between classical and community-aware centrality of all networks.

Fig. 3. Mean and standard deviation of the Kendall’s Tau correlation for each classical and community-aware centrality measures pair (αi , βj ) across the ﬁfty networks.

are then computed to quantify the two networks’ statistical proximity. Figure 2 illustrates its distribution. Globally, results across networks are well-correlated. Indeed, the Pearson correlation values range from 0.6 and 1. More precisely, their mean value is equal to 0.80, and their median is 0.82. Note that 911AllWords, Football, and to a lesser extent, Ego Facebook deviate from the general trend. That is the reason why the distribution has a fat left tail. Hence, one can conclude that the correlation of classical and community-aware centrality measures across networks is rather consistent. Finally, having checked that Kendall’s Tau correlation values are consistent across networks, we calculate the mean and standard deviation for each combination (αi , βj ) across the ﬁfty networks. It allows studying if community-aware centrality measures behave diﬀerently. Results reported in Fig. 3 show that the various community-aware centrality measures’ correlation patterns are very diﬀerent. Modularity Vitality (βM V ) is the only community-aware centrality measure exhibiting a negative correlation with classical centrality measures. Furthermore, its mean standard deviation value is high. As it is a signed community-aware centrality measure, this result is not unexpected. The remaining community-aware centrality measures can be ranked according to their correlation values. Community Hub-bridge (βCHB ) and Participation Coeﬃcient (βP C ) tend to show low positive mean correlation with all classical centrality measures (≤0.4) except for (αb , βP C ) amounting to 0.46. Their subsequent mean standard deviation is generally close to 0.15. Comm Centrality (βComm ) has a minimum mean corre-

Correlation of Centrality Measures

127

lation of 0.27 and a maximum mean correlation of 0.54. The standard deviation of βComm ranges from 0.11 to 0.21. Next comes Community-based Mediator (βCBM ), where the mean correlation is between 0.43 and 0.6. Its mean standard deviation is near 0.15 for all combinations except for (αm , βCBM ) amounting to 0.21. Finally, Community-based Centrality (βCBC ) and K-shell with Community (βks ) exhibit a higher correlation with classical centrality measures than the other community-aware centrality measures. Indeed, the mean correlation may even reach 0.83 as a maximum (αd , βks ). Their standard deviation is in the range of 0.14 and 0.21. These results corroborate the observation of high values of the correlation in each network’s distribution reported in Fig. 1. Indeed, these values correspond to βCBC and βks .

4

Network Topology Analysis

Correlation values between classical and community-aware centrality measures of each network are further processed. For a given network, each community-aware centrality measure is reduced to the mean value of the Kendall’s Tau correlation values computed for the ten classical centrality measures. Simple linear regression is performed to investigate the relationship with various topological properties of the networks. The average correlation values are the dependent variables, while the topological properties are the independent variables. The macroscopic features used are Density, Transitivity, Assortativity, Average distance, Diameter, Eﬃciency, and the Degree distribution exponent. The mesoscopic features used are Modularity, Mixing parameter, Internal distance, Internal density, MaxODF, Average-ODF, Flake-ODF, Embeddedness, and Hub dominance. If the p-value is below 0.05, the dependent and independent variables’ relationship is considered statistically signiﬁcant. Figure 4 presents the two extreme cases of statistical dependency between the mean and topological features. The premier case concerns Community-based Mediator (The mean value shows signiﬁcant linear relationships with nine topological features). The last case is for Modularity Vitality (the mean value shows no meaningful linear relationship with any topological property). The remaining ﬁgures and linear regression parameters estimate for each community-aware centrality measure are provided in the supplementary materials. Regarding macroscopic topological properties, we observe three situations. Network characteristics exhibit a signiﬁcant linear relationship with the mean of either three, two, or none community-aware centrality. In that sense, transitivity and eﬃciency are the most inﬂuential macroscopic topological features [14]. They show signiﬁcant relationships with the mean of three diﬀerent communityaware centrality measures. Then come density, assortativity, diameter, and average distance that aﬀect two community-aware centrality measures. Finally, the degree distribution exponent is the only topological feature among the macroscopic features that do not show any signiﬁcant relationship. Transitivity has a signiﬁcant negative association with the mean of Community-Based Mediator (βCBM ) and Participation Coeﬃcient (βP C ). Indeed, increasing transitivity leads to more triangles in the network. As Community-Based Mediator

128

S. Rajeh et al.

Fig. 4. Relationship of the mean of the correlation between the community-aware centralities “Community-based Mediator (βCBM )” and “Modularity Vitality (βM V )” combined with all classical centrality measures as a function of the topological properties of real-world networks. The line is ﬁtted by linear regression using ordinary least squares. “P” indicates p ≤ 0.05. “P” and * indicate p ≤ 0.01. The colors of the data points represent the network’s domain.

is based on the entropy of the intra-community and inter-community links of a node, transitivity may increase the diﬀerence between the two, resulting in a lower correlation. As the Participation Coeﬃcient also exploits the margin of the proportion of the inter-community and intra-community links, it behaves similarly. One observes a positive association with transitivity for

Correlation of Centrality Measures

129

Community-based Centrality (βCBC ). If the whole network forms a single community, βCBC reduces to degree centrality [22]. Consequently, the correlation between βCBC and classical measures tend to increase as transitivity increases. Eﬃciency has a signiﬁcant positive association on Comm Centrality (βComm ), Community-based Centrality (βCBC ), and K-shell with Community (βks ). An increase in eﬃciency means that the average shortest path distance in a network is getting smaller. In other words, the network is more eﬃcient when nodes are closely connected. Therefore, community-aware centrality measures tend to be more correlated with classical ones. Density has a signiﬁcant positive association with Comm Centrality (βComm ) and Community-based Centrality (βCBC ). An increase in density means more links between nodes. Accordingly, βComm and βCBC get more analogous to classical centrality measures. Assortativity has a signiﬁcant negative association with the mean of Community-Based Mediator (βCBM ) and Participation Coeﬃcient (βP C ). An increase in assortativity means that there are more interactions between peers in the networks. It may also increase the margin of diﬀerence between intra-community and intercommunity links. Assortative networks tend to form communities with “similar” degree nodes. Consequently, intra-community and inter-community link densities may further diﬀer from one community to another. Hence, a lower correlation between βCBM /βP C and classical centrality measures is observed. Diameter and average distance have both a signiﬁcant negative association with the mean of Community-based Centrality (βCBC ) and K-shell with Community (βks ). An increase in both measures means that nodes are more distant from each other. These two community-aware centrality measures are the most sensitive to distance-related measures. Regarding the mesoscopic topological features, one can distinguish two cases. The mixing parameter, modularity, and Max-ODF are statistically linearly related with the mean of three community-aware centrality measures. Linear dependence exists with the mean of two community-aware centrality measures for the remaining features. The mixing parameter has a signiﬁcant positive association with the mean on Community Hub-Bridge (βCHB ), Participation Coeﬃcient (βP C ), and Community-based Mediator (βCBM ). An increase in the mixing parameter translates into a weaker community structure. As a result, these community-aware centrality measures tend to extract similar information compared to classical centrality measures. Modularity has a signiﬁcant negative association with the mean on Community-based Mediator (βCBM ), Communitybased Centrality (βCBC ), and K-shell with Community (βks ). An increase in modularity means that communities are tightly connected. As a result, these measures extract diﬀerent information than classical centrality measures when the network is highly modular. Max-ODF has a signiﬁcant positive association with the mean of Community-based Mediator (βCBM ), Community-based Centrality (βCBC ), and K-shell with Community (βks ). Based on the nodes with the highest inter-community links in their community, its increase leads to more connections between highly connected nodes in diﬀerent communities, weakening the community structure. Therefore, correlation of βCBM , βCBC , and βks with classical centrality measures increases. Internal distance shows a signiﬁcant

130

S. Rajeh et al.

positive linear relationship with the mean of Participation Coeﬃcient (βP C ) and a negative one with the mean of Community-based Centrality (βCBC ). As βP C exploits the heterogeneity between intra-community and inter-community links of a node, an increase in internal decreases the margin between intra-community and inter-community links. Consequently, the correlation between βP C and classical centrality measures increases. The opposite eﬀect occurs with βCBC . Internal density has a negative inﬂuence on the mean of Community-based Mediator (βCBM ) and Participation Coeﬃcient (βP C ). An increase in internal density means that communities are condensed with inner connections. As βCBM and βP C exploit the margin of diﬀerence of a node’s intra-community links to its inter-community links, both will favor an increase in internal density. AverageODF has a signiﬁcant positive relationship with the mean of Community-based Mediator (βCBM ) and K-shell with Community (βks ). Since it is based on the proportion of inter-community links, the weaker the community structure, the higher the correlation with classical centrality measures. Similarly, Flake-ODF has a similar positive linear relationship with the mean of βCBM and βks . Indeed, it is another way of quantifying the strength of the community structure. Embeddedness has a negative relationship with the mean of Community-based Mediator (βCBM ) and K-shell with Community (βks ). Indeed, based on the proportion of intra-community links, it is the opposite of Average-ODF. Finally, hub dominance has a signiﬁcant positive relationship with the mean of Community-based Centrality (βCBC ) and K-shell with Community (βks ). A higher hub dominance means fewer tightly connected communities. As a result, βCBC behaves closer to degree centrality, and the correlation of βCBC with classical centrality measures increases. Concerning βks , higher hub dominance induces more similar intra-community and inter-community links and higher correlation with classical centrality measures.

5

Conclusion

This study investigates the relationship between classical and community-aware centrality measures. Results show that the Kendall’s Tau correlation between classical and community-aware centrality measures is generally medium to low. Second, the correlation patterns are pretty consistent across networks. Moreover, the community-aware centrality measures can be classiﬁed into four groups according to the correlation pattern with classical centrality measures. More speciﬁcally, Modularity Vitality shows a low negative correlation. Low positive correlation characterizes Community Hub-Bridge and Participation Coeﬃcient. A positive medium correlation is observed for Comm Centrality and Communitybased Mediator. Finally, Community-based Centrality and K-shell with Community show a high positive correlation. Transitivity and eﬃciency are the most inﬂuential macroscopic features while the mixing parameter, modularity, and Max-ODF are the predominant mesoscopic features. The results of this study pave the way for the development of eﬀective community-aware centrality measures. Indeed, it demonstrates that integrating knowledge about the network community structure brings a new perspective of node inﬂuence.

Correlation of Centrality Measures

131

References 1. Clauset, A., Tucker, E., Sainz, M.: The colorado index of complex networks (2016). https://icon.colorado.edu/ 2. Ghalmane, Z., El Hassouni, M., Cheriﬁ, H.: Immunization of networks with nonoverlapping community structure. SNAM 9(1), 1–22 (2019) 3. Guimera, R., Amaral, L.A.N.: Functional cartography of complex metabolic networks. Nature 433(7028), 895–900 (2005) 4. Gupta, N., Singh, A., Cheriﬁ, H.: Community-based immunization strategies for epidemic control. In: 2015 7th International Conference on Communication Systems and Networks (COMSNETS), pp. 1–6. IEEE (2015) 5. Ibnoulouaﬁ, A., El Haziti, M., Cheriﬁ, H.: M-centrality: identifying key nodes based on global position and local degree variation. J. Stat. Mech. Theory Exp. 2018(7), 073407 (2018) 6. Kunegis, J.: Handbook of network analysis [konect project]. arXiv:1402.5500 (2014) 7. Latora, V., Nicosia, V., Russo, G.: Complex Networks: Principles, Methods and Applications. Cambridge University Press, Cambridge (2017). https://www. complex-networks.net/datasets.html 8. Li, C., Li, Q., Van Mieghem, P., Stanley, H.E., Wang, H.: Correlation between centrality metrics and their application to the opinion model. EPJ B 88(3), 1–13 (2015) 9. L¨ u, L., Chen, D., Ren, X.L., Zhang, Q.M., Zhang, Y.C., Zhou, T.: Vital nodes identiﬁcation in complex networks. Phys. Rep. 650, 1–63 (2016) 10. Luo, S.L., Gong, K., Kang, L.: Identifying inﬂuential spreaders of epidemics on community networks. arXiv preprint arXiv:1601.07700 (2016) 11. Magelinski, T., Bartulovic, M., Carley, K.M.: Measuring node contribution to community structure with modularity vitality. IEEE Trans. Netw. Sci. Eng. 8(1), 707– 723 (2021) 12. Oldham, S., Fulcher, B., Parkes, L., Arnatkevicute, A., Suo, C., Fornito, A.: Consistency and diﬀerences between centrality measures across distinct classes of networks. PLoS ONE 14(7), e0220061 (2019) 13. Orman, G.K., Labatut, V., Cheriﬁ, H.: Qualitative comparison of community detection algorithms. In: Cheriﬁ, H., Zain, J.M., El-Qawasmeh, E. (eds.) DICTAP 2011. CCIS, vol. 167, 265–279. Springer, Heidelberg (2011). https://doi.org/10.1007/9783-642-22027-2 23 14. Orman, K., Labatut, V., Cheriﬁ, H.: An empirical study of the relation between community structure and transitivity. In: Menezes, R., Evsukoﬀ, A., Gonz´ alez, M. (eds.) Complex Networks. SCI, vol. 424, pp. 99–110. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-30287-9 11 15. Peixoto, T.P.: The netzschleuder network catalogue and repository (2020). https:// networks.skewed.de/ 16. Rajeh, S., Savonnet, M., Leclercq, E., Cheriﬁ, H.: Interplay between hierarchy and centrality in complex networks. IEEE Access 8, 129717–129742 (2020) 17. Rajeh, S., Savonnet, M., Leclercq, E., Cheriﬁ, H.: Investigating centrality measures in social networks with community structure. In: Benito, R.M., Cheriﬁ, C., Cheriﬁ, H., Moro, E., Rocha, L.M., Sales-Pardo, M. (eds.) Complex Networks & Their Applications IX. COMPLEX NETWORKS 2020 2020. SCI, vol. 943, pp. 211–222. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-65347-7 18 18. Ronqui, J.R.F., Travieso, G.: Analyzing complex networks through correlations in centrality measurements. J. Stat. Mech. Theory Exp. 2015(5), P05030 (2015)

132

S. Rajeh et al.

19. Rossi, R.A., Ahmed, N.K.: The network data repository with interactive graph analytics and visualization. In: AAAI (2015) 20. Schoch, D., Valente, T.W., Brandes, U.: Correlations among centrality indices and a class of uniquely ranked graphs. Soc. Netw. 50, 46–54 (2017) 21. Tulu, M.M., Hou, R., Younas, T.: Identifying inﬂuential nodes based on community structure to speed up the dissemination of information in complex network. IEEE Access 6, 7390–7401 (2018) 22. Zhao, Z., Wang, X., Zhang, W., Zhu, Z.: A community-based approach to identifying inﬂuential spreaders. Entropy 17(4), 2228–2252 (2015)

Author Index

B Brede, Markus, 108 C Cherifi, Hocine, 120 D De Meo, Pasquale, 12 Derzsy, Noemi, 73 F Ficara, Annamaria, 12 Fiumara, Giacomo, 12 G Gecow, Andrzej, 98 I Ikai, Ryota, 38 J Jalali, Zeinab S., 59 K Kenthapadi, Krishnaram, 59 Korkmaz, Gizem, 24 L Leclercq, Eric, 120 Lehr, Jane, 86 Liotta, Antonio, 12 M Majumdar, Subhabrata, 73 Malik, Rajat, 73

Malinskii, Igor, 51 McDonald, Sarah, 24 McNichols, Logan, 86 Migler, Theresa, 86 Mironov, Sergei, 51 Miyagi, Shigeyuki, 38 N Nowostawski, Mariusz, 98 P Pineda, Steven, 86 R Rajeh, Stephany, 120 S Saitta, Rebecca, 12 Sakai, Osamu, 38 Sauerborn, Emma, 86 Savonnet, Marinette, 120 Sidorov, Sergei, 51 Soundarajan, Sucheta, 59 Suzuki, Daiki, 1 T Tat, Brandon, 86 Tsugawa, Sho, 1 W Wood, Zoë, 86 Y Yoo, Kevin, 86

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 A. S. Teixeira et al. (Eds.): Complex Networks XII, SPCOM, p. 133, 2021. https://doi.org/10.1007/978-3-030-81854-8