Complex Networks and Their Applications VIII: Volume 1 Proceedings of the Eighth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2019 [1st ed. 2020] 978-3-030-36686-5, 978-3-030-36687-2

This book highlights cutting-edge research in the field of network science, offering scientists, researchers, students,

1,196 121 87MB

English Pages XXVII, 979 [992] Year 2020

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Complex Networks and Their Applications VIII: Volume 1 Proceedings of the Eighth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2019 [1st ed. 2020]
 978-3-030-36686-5, 978-3-030-36687-2

Table of contents :
Front Matter ....Pages i-xxvii
Front Matter ....Pages 1-1
LinkAUC: Unsupervised Evaluation of Multiple Network Node Ranks Using Link Prediction (Emmanouil Krasanakis, Symeon Papadopoulos, Yiannis Kompatsiaris)....Pages 3-14
A Gradient Estimate for PageRank (Paul Horn, Lauren M. Nelsen)....Pages 15-26
A Persistent Homology Perspective to the Link Prediction Problem (Sumit Bhatia, Bapi Chatterjee, Deepak Nathani, Manohar Kaul)....Pages 27-39
The Role of Network Size for the Robustness of Centrality Measures (Christoph Martin, Peter Niemeyer)....Pages 40-51
Novel Edge and Density Metrics for Link Cohesion (Cetin Savkli, Catherine Schwartz, Amanda Galante, Jonathan Cohen)....Pages 52-63
Facility Location Problem on Network Based on Group Centrality Measure Considering Cooperation and Competition (Takayasu Fushimi, Seiya Okubo, Kazumi Saito)....Pages 64-76
Finding Dominant Nodes Using Graphlets (David Aparício, Pedro Ribeiro, Fernando Silva, Jorge Silva)....Pages 77-89
Sampling on Networks: Estimating Eigenvector Centrality on Incomplete Networks (Nicolò Ruggeri, Caterina De Bacco)....Pages 90-101
Front Matter ....Pages 103-103
Repel Communities and Multipartite Networks (Jerry Scripps, Christian Trefftz, Greg Wolffe, Roger Ferguson, Xiang Cao)....Pages 105-115
The Densest k Subgraph Problem in b-Outerplanar Graphs (Sean Gonzales, Theresa Migler)....Pages 116-127
Spread Sampling and Its Applications on Graphs (Yu Wang, Bortik Bandyopadhyay, Vedang Patel, Aniket Chakrabarti, David Sivakoff, Srinivasan Parthasarathy)....Pages 128-140
Eva: Attribute-Aware Network Segmentation (Salvatore Citraro, Giulio Rossetti)....Pages 141-151
Exorcising the Demon: Angel, Efficient Node-Centric Community Discovery (Giulio Rossetti)....Pages 152-163
Metrics Matter in Community Detection (Arya D. McCarthy, Tongfei Chen, Rachel Rudinger, David W. Matula)....Pages 164-175
An Exact No Free Lunch Theorem for Community Detection (Arya D. McCarthy, Tongfei Chen, Seth Ebner)....Pages 176-187
Impact of Network Topology on Efficiency of Proximity Measures for Community Detection (Rinat Aynulin)....Pages 188-197
Identifying, Ranking and Tracking Community Leaders in Evolving Social Networks (Mário Cordeiro, Rui Portocarrero Sarmento, Pavel Brazdil, Masahiro Kimura, João Gama)....Pages 198-210
Change Point Detection in a Dynamic Stochastic Blockmodel (Peter Wills, François G. Meyer)....Pages 211-222
A General Method for Detecting Community Structures in Complex Networks (Vesa Kuikka)....Pages 223-237
A New Metric for Package Cohesion Measurement Based on Complex Network (Yanran Mi, Yanxi Zhou, Liangyu Chen)....Pages 238-249
A Generalized Framework for Detecting Social Network Communities by the Scanning Method (Tai-Chi Wang, Frederick Kin Hing Phoa)....Pages 250-261
Comparing the Community Structure Identified by Overlapping Methods (Vinícius da F. Vieira, Carolina R. Xavier, Alexandre G. Evsukoff)....Pages 262-273
Semantic Frame Induction as a Community Detection Problem (Eugénio Ribeiro, Andreia Sofia Teixeira, Ricardo Ribeiro, David Martins de Matos)....Pages 274-285
A New Measure of Modularity in Hypergraphs: Theoretical Insights and Implications for Effective Clustering (Tarun Kumar, Sankaran Vaidyanathan, Harini Ananthapadmanabhan, Srinivasan Parthasarathy, Balaraman Ravindran)....Pages 286-297
Front Matter ....Pages 299-299
Crying “Wolf” in a Network Structure: The Influence of Node-Generated Signals (Tomer Tuchner, Gail Gilboa-Freedman)....Pages 301-312
Vaccination Strategies on a Robust Contact Network (Christopher Siu, Theresa Migler)....Pages 313-324
Total Positive Influence Domination on Weighted Networks (Danica Vukadinović Greetham, Nathaniel Charlton, Anush Poghosyan)....Pages 325-336
Modelling Spatial Information Diffusion (Zhuo Chen, Xinyue Ye)....Pages 337-348
Rejection-Based Simulation of Non-Markovian Agents on Complex Networks (Gerrit Großmann, Luca Bortolussi, Verena Wolf)....Pages 349-361
Community-Aware Content Diffusion: Embeddednes and Permeability (Letizia Milli, Giulio Rossetti)....Pages 362-371
Can WhatsApp Counter Misinformation by Limiting Message Forwarding? (Philipe de Freitas Melo, Carolina Coimbra Vieira, Kiran Garimella, Pedro O. S. Vaz de Melo, Fabrí­cio Benevenuto)....Pages 372-384
Modeling Airport Congestion Contagion by SIS Epidemic Spreading on Airline Networks (Klemens Köstler, Rommy Gobardhan, Alberto Ceria, Huijuan Wang)....Pages 385-398
A Population Dynamics Approach to Viral Marketing (Pedro C. Souto, Luísa V. Silva, Diego Costa Pinto, Francisco C. Santos)....Pages 399-411
Integrating Environmental Temperature Conditions into the SIR Model for Vector-Borne Diseases (Md Arquam, Anurag Singh, Hocine Cherifi)....Pages 412-424
Opinion Diffusion in Competitive Environments: Relating Coverage and Speed of Diffusion (Valeria Fionda, Gianluigi Greco)....Pages 425-435
Beyond Fact-Checking: Network Analysis Tools for Monitoring Disinformation in Social Media (Stefano Guarino, Noemi Trino, Alessandro Chessa, Gianni Riotta)....Pages 436-447
Suppressing Information Diffusion via Link Blocking in Temporal Networks (Xiu-Xiu Zhan, Alan Hanjalic, Huijuan Wang)....Pages 448-458
Using Connected Accounts to Enhance Information Spread in Social Networks (Alon Sela, Orit Cohen-Milo, Eugene Kagan, Moti Zwilling, Irad Ben-Gal)....Pages 459-468
Designing Robust Interventions to Control Epidemic Outbreaks (Prathyush Sambaturu, Anil Vullikanti)....Pages 469-480
Front Matter ....Pages 481-481
The Impact of Network Degree Correlation on Parrondo’s Paradox (Ye Ye, Xiao-Rong Hang, Lin Liu, Lu Wang, Neng-gang Xie)....Pages 483-494
Analysis of Diversity and Dynamics in Co-evolution of Cooperation in Social Networking Services (Yutaro Miura, Fujio Toriumi, Toshiharu Sugawara)....Pages 495-506
Shannon Entropy in Time–Varying Clique Networks (Marcelo do Vale Cunha, Carlos César Ribeiro Santos, Marcelo Albano Moret, Hernane Borges de Barros Pereira)....Pages 507-518
Two-Mode Threshold Graph Dynamical Systems for Modeling Evacuation Decision-Making During Disaster Events (Nafisa Halim, Chris J. Kuhlman, Achla Marathe, Pallab Mozumder, Anil Vullikanti)....Pages 519-531
Spectral Evolution of Twitter Mention Networks (Miguel Romero, Camilo Rocha, Jorge Finke)....Pages 532-542
Front Matter ....Pages 543-543
Minimum Entropy Stochastic Block Models Neglect Edge Distribution Heterogeneity (Louis Duvivier, Céline Robardet, Rémy Cazabet)....Pages 545-555
Three-Parameter Kinetics of Self-organized Criticality on Twitter (Victor Dmitriev, Andrey Dmitriev, Svetlana Maltseva, Stepan Balybin)....Pages 556-565
Multi-parameters Model Selection for Network Inference (Veronica Tozzo, Annalisa Barla)....Pages 566-577
Scott: A Method for Representing Graphs as Rooted Trees for Graph Canonization (Nicolas Bloyet, Pierre-François Marteau, Emmanuel Frénod)....Pages 578-590
Cliques in High-Dimensional Random Geometric Graphs (Konstantin Avrachenkov, Andrei Bobu)....Pages 591-600
Universal Boolean Logic in Cascading Networks (Galen Wilkerson, Sotiris Moschoyiannis)....Pages 601-611
Fitness-Weighted Preferential Attachment with Varying Number of New Connections (Juan Romero, Jorge Finke, Andrés Salazar)....Pages 612-620
Rigid Graph Alignment (Vikram Ravindra, Huda Nassar, David F. Gleich, Ananth Grama)....Pages 621-632
Detecting Hotspots on Networks (Juan Campos, Jorge Finke)....Pages 633-644
Front Matter ....Pages 645-645
A Transparent Referendum Protocol with Immutable Proceedings and Verifiable Outcome for Trustless Networks (Maximilian Schiedermeier, Omar Hasan, Lionel Brunie, Tobias Mayer, Harald Kosch)....Pages 647-658
Utilizing Complex Networks for Event Detection in Heterogeneous High-Volume News Streams (Iraklis Moutidis, Hywel T. P. Williams)....Pages 659-672
Drawing Networks of Political Leaders: Global Affairs in The Economist’s KAL’s Cartoons (Nikita Golubev, Alina V. Vladimirova)....Pages 673-681
Shielding and Shadowing: A Tale of Two Strategies for Opinion Control in the Voting Dynamics (Guillermo Romero Moreno, Long Tran-Thanh, Markus Brede)....Pages 682-693
Front Matter ....Pages 695-695
Stable and Uniform Resource Allocation Strategies for Network Processes Using Vertex Energy Gradients (Mikołaj Morzy, Tomi Wójtowicz)....Pages 697-708
Cascading Failures in Weighted Networks with the Harmonic Closeness (Yucheng Hao, Limin Jia, Yanhui Wang)....Pages 709-720
Learning to Control Random Boolean Networks: A Deep Reinforcement Learning Approach (Georgios Papagiannis, Sotiris Moschoyiannis)....Pages 721-734
Comparative Network Robustness Evaluation of Link Attacks (Clara Pizzuti, Annalisa Socievole, Piet Van Mieghem)....Pages 735-746
MAC: Multilevel Autonomous Clustering for Topologically Distributed Anomaly Detection (M. A. Partha, C. V. Ponce)....Pages 747-760
Network Strengthening Against Malicious Attacks (Qingnan Rong, Jun Zhang, Xiaoqian Sun, Sebastian Wandelt)....Pages 761-772
Identifying Vulnerable Nodes to Cascading Failures: Optimization-Based Approach (Richard J. La)....Pages 773-782
Ensemble Approach for Generalized Network Dismantling (Xiao-Long Ren, Nino Antulov-Fantulin)....Pages 783-793
Front Matter ....Pages 795-795
A Simple Approach to Attributed Graph Embedding via Enhanced Autoencoder (Nasrullah Sheikh, Zekarias T. Kefato, Alberto Montresor)....Pages 797-809
Matching Node Embeddings Using Valid Assignment Kernels (Changmin Wu, Giannis Nikolentzos, Michalis Vazirgiannis)....Pages 810-821
Short Text Tagging Using Nested Stochastic Block Model: A Yelp Case Study (John Bowllan, Kailey Cozart, Seyed Mohammad Mahdi Seyednezhad, Anthony Smith, Ronaldo Menezes)....Pages 822-833
Domain-Invariant Latent Representation Discovers Roles (Shumpei Kikuta, Fujio Toriumi, Mao Nishiguchi, Tomoki Fukuma, Takanori Nishida, Shohei Usui)....Pages 834-844
Inductive Representation Learning on Feature Rich Complex Networks for Churn Prediction in Telco (María Óskarsdóttir, Sander Cornette, Floris Deseure, Bart Baesens)....Pages 845-853
On Inferring Monthly Expenses of Social Media Users: Towards Data and Approaches (Danila Vaganov, Alexander Kalinin, Klavdiya Bochenina)....Pages 854-865
Evaluating the Community Structures from Network Images Using Neural Networks (Md. Khaledur Rahman, Ariful Azad)....Pages 866-878
Gumbel-Softmax Optimization: A Simple General Framework for Combinatorial Optimization Problems on Graphs (Jing Liu, Fei Gao, Jiang Zhang)....Pages 879-890
TemporalNode2vec: Temporal Node Embedding in Temporal Networks (Mounir Haddad, Cécile Bothorel, Philippe Lenca, Dominique Bedart)....Pages 891-902
Deep Reinforcement Learning for Task-Driven Discovery of Incomplete Networks (Peter Morales, Rajmonda Sulo Caceres, Tina Eliassi-Rad)....Pages 903-914
Evaluating Network Embedding Models for Machine Learning Tasks (Ikenna Oluigbo, Mohammed Haddad, Hamida Seba)....Pages 915-927
A BERT-Based Transfer Learning Approach for Hate Speech Detection in Online Social Media (Marzieh Mozafari, Reza Farahbakhsh, Noël Crespi)....Pages 928-940
Front Matter ....Pages 941-941
A Simple Differential Geometry for Networks and Its Generalizations (Emil Saucan, Areejit Samal, Jürgen Jost)....Pages 943-954
Characterizing Distances of Networks on the Tensor Manifold (Bipul Islam, Ji Liu, Romeil Sandhu)....Pages 955-964
Eigenvalues and Spectral Dimension of Random Geometric Graphs in Thermodynamic Regime (Konstantin Avrachenkov, Laura Cottatellucci, Mounia Hamidouche)....Pages 965-975
Back Matter ....Pages 977-979

Citation preview

Studies in Computational Intelligence 881

Hocine Cherifi · Sabrina Gaito · José Fernendo Mendes · Esteban Moro · Luis Mateus Rocha   Editors

Complex Networks and Their Applications VIII Volume 1 Proceedings of the Eighth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2019

Studies in Computational Intelligence Volume 881

Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. The books of this series are submitted to indexing to Web of Science, EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink.

More information about this series at http://www.springer.com/series/7092

Hocine Cherifi Sabrina Gaito José Fernendo Mendes Esteban Moro Luis Mateus Rocha •







Editors

Complex Networks and Their Applications VIII Volume 1 Proceedings of the Eighth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2019

123

Editors Hocine Cherifi University of Burgundy Dijon Cedex, France

Sabrina Gaito Universita degli Studi di Milano Milan, Italy

José Fernendo Mendes University of Aveiro Aveiro, Portugal

Esteban Moro Universidad Carlos III de Madrid Leganés, Madrid, Spain

Luis Mateus Rocha Indiana University Bloomington, IN, USA

ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-030-36686-5 ISBN 978-3-030-36687-2 (eBook) https://doi.org/10.1007/978-3-030-36687-2 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

The International Conference on Complex Networks and Their Applications has been initiated in 2011. Since, it has grown to become one of the major international events in network science. The aim is to support the rise of the scientific community that study the world through the lens of networks. Every year, it brings together researchers from a wide variety of scientific background ranging from finance and economy, medicine and neuroscience, biology and earth sciences, sociology and politics, computer science and physics, and many others in order to review current state of the field and formulate new directions. Besides, the variety of scientific topics ranges from network theory, network models, network geometry, community structure, network analysis and measure, link analysis and ranking, resilience and control, machine learning and networks, dynamics on/of networks, diffusion, and epidemics. Let us also mention some current applications such as social and urban networks, human behavior, urban systems, mobility, and quantifying success. The great diversity of the participants allows for cross-fertilization between fundamental issues and innovative applications. The papers selected for the volumes of proceedings from the eighth edition, hosted by the Calouste Gulbenkian Foundation in Lisbon (Portugal) from December 10 to December 12, 2019, clearly reflect the multiple aspects of complex network issues as well as the high quality of the contributions. This edition attracted numerous authors from all over the world with 470 submissions from 58 countries. All the submissions have been peer-reviewed from at least 3 independent reviewers from our strong International Program Committee in order to ensure high quality of contributed material as well as adherence to conference topics. After the review process, 161 papers were selected to be included in the proceedings. The challenges for a successful edition are undoubtedly related to the work of the authors who provided high-quality papers. This success goes also to our keynote speakers with their fascinating plenary lectures. Their talks provide an outstanding coverage of the broad field of complex networks. v

vi

Preface

• Lada Adamic (Facebook Inc.): “Reflections of social networks” • Reka Albert (Pennsylvania State University, USA): “Network-based dynamic modeling of biological systems: toward understanding and control” • Ulrik Brandes (ETH Zürich, Switzerland): “On a positional approach to network science” • Jari Saramäki (Aalto University, Finland): “Temporal networks: past, present, future” • Stefan Thurner (Medical University of Vienna, Austria): “How to eliminate systemic risk from financial multilayer networks” • Michalis Vazirgiannis (LIX, École Polytechnique, France): “Machine learning for graphs based on kernels.” Prior to the conference, for the traditional tutorial sessions, Maria Ángeles Serrano (Universitat de Barcelona, Spain) and Diego Saez-Trumper (Wikimedia Foundation) deliver insightful talks, respectively, on “Mapping networks in latent geometry: models and applications” and “Wikimedia public (research) resources.” We sincerely gratify our advisory board members for inspiring the essence of the conference: Jon Crowcroft (University of Cambridge), Raissa D’Souza (University of California, Davis, USA), Eugene Stanley (Boston University, USA), and Ben Y. Zhao (University of Chicago, USA). We record our thanks to our fellow members of the Organizing Committee: Luca Maria Aiello (Nokia Bell Labs, UK) and Rosa Maria Benito (Universidad Politecnica de Madrid, Spain), our satellite chairs; Nuno Araujo (University of Lisbon, Portugal), Huijuan Wang (TU Delft, the Netherlands), and Taha Yasseri (University of Oxford, UK) for chairing the lightning sessions; Gitajanli Yadav (University of Cambridge, UK), Jinhu Lü (Chinese Ac. Science, Beijing, China), and Maria Clara Gracio (University of Evora, Portugal) for managing the poster sessions; and Bruno Gonçalves (NYU, USA), our tutorial chair. We extend our thanks to Carlos Gershenson (Universidad Nacional Autónoma de México, Mexico), Michael Schaub (MIT, USA), Leto Peel (Université Catholique de Louvain, Belgium), and Feng Xia (Dalian University of Technology, China), the publicity chairs for advertising the conference in America, Asia, and Europa, hence encouraging the participation. We would like also to acknowledge Roberto Interdonato (CIRAD UMR TETIS, Montpellier, France) as well as Andreia Sofia Teixeira (University of Lisbon, Portugal), respectively, our sponsor chair and social media chair. Our thanks goes also to Chantal Cherifi (University of Lyon, France), publication chair, and to the Milan team (University of Milan, Italy), Matteo Zignani, the Web chair and Christian Quadri, the submission chair for the tremendous work they have done in maintaining the Web site and the submission system. We would also like to record our appreciation for the work of the Local Committee chair, Manuel Marques-Pita, and members, Pedro Souto, Flávio L. Pinheiro, Rion Bratting Correia, Lília Perfeito, Sara Mesquita, Sofia Pinto, Simone Lackner, João Franco, for their work which participate to the success of this edition.

Preface

vii

A deep thanks to the members of Instituto Gulbenkian de Ciência, Rita Caré, Regina Fernandes, Greta Martins, as well as to Paulo Madruga from Fundação Calouts Gulbenkian for their precious support and dedication. We are also indebted to our partners, Alessandro Fellegara and Alessandro Egro (Tribe Communication) for their passion and patience in designing the visual identity of the conference. We would like to express our gratitude to the editors involved in the sponsoring of conference: Frontiers and Springer Nature. Our deepest appreciation goes to all those who have helped us for the success of this meeting. Sincere thanks to the contributors; the success of the technical program would not be possible without their creativity. Finally, we would like to express our most sincere thanks to the Program Committee members for their huge efforts in producing 1453 high-quality reviews in a very limited time. These volumes make the most advanced contribution of the international community to the research issues surrounding the fascinating world of complex networks. We hope that you enjoy the papers as much as we enjoyed organizing the conference and putting this collection of papers together. Hocine Cherifi Sabrina Gaito José Fernendo Mendes Esteban Moro Luis Mateus Rocha

Organization and Committees

General Chairs Hocine Cherifi José Fernando Mendes Luis Mateus Rocha

University of Burgundy, France University of Aveiro, Portugal Indiana University Bloomington, USA

Advisory Board Jon Crowcroft Raissa D’Souza Eugene Stanley Ben Y. Zhao

University of Cambridge, UK University of California, Davis, USA Boston University, USA University of Chicago, USA

Program Chairs Sabrina Gaito Esteban Moro

University of Milan, Italy Universidad Carlos III de Madrid, Spain

Program Co-chairs Joana Gonçalves-Sá Francisco Santos

Universidade NOVA de Lisboa, Portugal University of Lisbon, Portugal

Satellite Chairs Luca Maria Aiello Rosa M. Benito

Nokia Bell Labs, UK Universidad Politecnica de Madrid, Spain

ix

x

Organization and Committees

Lightning Chairs Nuno Araujo Huijuan Wang Taha Yasseri

University of Lisbon, Portugal TU Delft, the Netherlands University of Oxford, UK

Poster Chairs Gitajanli Yadav Jinhu Lü Maria Clara Gracio

University of Cambridge, UK Chinese Ac. Science, Beijing, China University of Evora, Portugal

Publicity Chairs Carlos Gershenson Leto Peel Michael Schaub Feng Xia

UNA de Mexico, Mexico UCLouvain, Belgium MIT, USA Dalian University of Technology, China

Tutorial Chair Bruno Gonçalves

NYU, USA

Sponsor Chair Roberto Interdonato

CIRAD - UMR TETIS, France

Social Media Chair Andreia Sofia Teixeira

University of Lisbon, Portugal

Local Committee Chair Manuel Marques-Pita

University Lusófona, Portugal

Local Committee Rion Bratting Correia João Franco Simone Lackner Sara Mesquita Lília Perfeito Flávio L. Pinheiro Sofia Pinto

Instituto Gulbenkian de Ciência, Portugal Nova SBE, Portugal Nova SBE, Portugal Nova SBE, Portugal Nova SBE, Portugal NOVA IMS, Portugal Nova SBE, Portugal

Organization and Committees

Pedro Souto Andrea Sofia Teixeira

xi

University of Lisbon, Portugal University of Lisbon, Portugal

Publication Chair Chantal Cherifi

University of Lyon, France

Submission Chair Christian Quadri

University of Milan, Italy

Web Chair Matteo Zignani

University of Milan, Italy

Program Committee Aguirre Jacobo Ahmed Nesreen Aida Masaki Aiello Luca Maria Aiello Marco Aktas Mehmet Akutsu Tatsuya Albert Reka Allard Antoine Aloric Aleksandra Altafini Claudio Alvarez-Zuzek Lucila G. Alves Luiz G. A. Amblard Fred An Chuankai Angione Claudio Angulo Marco Tulio Antonioni Alberto Antulov-Fantulin Nino Araujo Nuno Arcaute Elsa Aref Samin Arenas Alex Ares Saul Argyrakis Panos Aste Tomaso

Centro Nacional de Biotecnología (CNB-CSIC), Spain Intel, USA Tokyo Metropolitan University, Japan Nokia Bell Labs, UK University of Stuttgart, Germany University of Central Oklahoma, USA Kyoto University, Japan Pennsylvania State University, USA Laval University, Canada Institute of Physics Belgrade, Serbia Linköping University, Sweden IFIMAR-UNMdP, Argentina Northwestern University, USA IRIT - University Toulouse 1 Capitole, France Dartmouth College, USA Teesside University, UK UNAM, Mexico Carlos III University of Madrid, Spain ETH Zurich, Switzerland Universidade de Lisboa, Portugal University College London, UK MPI for Demographic Research, Germany URV, Spain Centro Nacional de Biotecnología (CNB-CSIC), Spain Aristotle University of Thessaloniki, Greece University College London, UK

xii

Atzmueller Martin Avrachenkov Konstantin Baggio Rodolfo Banisch Sven Barnett George Barucca Paolo Basov Nikita Baxter Gareth Beguerisse Diaz Mariano Benczur Andras A. Benito Rosa M. Bianconi Ginestra Biham Ofer Boguna Marian Bonato Anthony Bongiorno Christian Borg Anton Borge-Holthoefer Javier Borgnat Pierre Bornholdt Stefan Bovet Alexandre Braha Dan Brandes Ulrik Brede Markus Bressan Marco Brockmann Dirk Bródka Piotr Burioni Raffaella Campana Paolo Cannistraci Carlo Vittorio Carchiolo Vincenza Cardillo Alessio Casiraghi Giona Cattuto Ciro Cazabet Remy Chakraborty Abhijit Chakraborty Tanmoy Chavalarias David Chawla Nitesh V. Chen Kwang-Cheng Cheng Xueqi

Organization and Committees

Tilburg University, the Netherlands Inria, France Bocconi University, Italy MPI for Mathematics in the Sciences, Germany University of California, Davis, USA University College London, UK St. Petersburg State University, Russia University of Aveiro, Portugal Spotify Limited, UK ICSC, Hungarian Academy of Sciences, Hungary Universidad Politécnica de Madrid, Spain Queen Mary University of London, UK The Hebrew University of Jerusalem, Israel University of Barcelona, Spain Ryerson University, Canada Università degli Studi di Palermo, Italy Blekinge Institute of Technology, Sweden Internet Interdisciplinary Institute IN3 UOC, Spain CNRS, Laboratoire de Physique ENS de Lyon, France Universität Bremen, Germany Université Catholique de Louvain-la-Neuve, Belgium NECSI, USA ETH Zurich, Switzerland University of Southampton, UK Sapienza University of Rome, Italy Humboldt University of Berlin, Germany Wroclaw University of Science and Technology, Poland Università di Parma, Italy University of Cambridge, UK TU Dresden, Germany Università di Catania, Italy Universitat Rovira i Virgili, Spain ETH Zurich, Switzerland ISI Foundation, Italy Université Lyon 1, CNRS, LIRIS, France University of Hyogo, Japan IIIT Delhi, India CNRS, CAMS/ISC-PIF, France University of Notre Dame, USA University of South Florida, USA Institute of Computing Technology, China

Organization and Committees

Cherifi Hocine Cherifi Chantal Chin Peter Chung Fu Lai Cinelli Matteo Clegg Richard Cohen Reuven Coscia Michele Costa Luciano Criado Regino Cucuringu Mihai Darwish Kareem Dasgupta Bhaskar Davidsen Joern De Bie Tijl De Meo Pasquale De Vico Fallani Fabrizio Del Genio Charo I. Delellis Pietro Delvenne Jean-Charles Deng Yong Devezas José Di Muro Matías Diesner Jana Douw Linda Duch Jordi Eismann Kathrin El Hassouni Mohammed Emmerich Michael T. M. Emmert-Streib Frank Ercal Gunes Faccin Mauro Fagiolo Giorgio Flammini Alessandro Foerster Manuel Frasca Mattia Fu Xiaoming Furno Angelo Gaito Sabrina Gallos Lazaros Galán José Manuel Gama Joao

xiii

University of Burgundy, France Lyon 2 University, France Boston University, USA The Hong Kong Polytechnic University, Hong Kong University of Rome “Tor Vergata”, Italy Queen Mary University of London, UK Bar-Ilan University, Israel IT University of Copenhagen, Denmark Universidade de Sao Paulo, Brazil Universidad Rey Juan Carlos, Spain University of Oxford and The Alan Turing Institute, UK Qatar Computing Research Institute, Qatar University of Illinois at Chicago, USA University of Calgary, Canada Ghent University, Belgium Vrije Universiteit Amsterdam, the Netherlands Inria - ICM, France Coventry University, UK University of Naples Federico II, Italy University of Louvain, Belgium Xi’an Jiaotong University, China INESC TEC and DEI-FEUP, Portugal Universidad Nacional de Mar del Plata, Argentina University of Illinois at Urbana-Champaign, USA Amsterdam UMC, the Netherlands Universitat Rovira i Virgili, Spain University of Bamberg, Germany Mohammed V University, Morocco Leiden University, the Netherlands Tampere University of Technology, Finland SIUE, USA Université Catholique de Louvain, Belgium Sant’Anna School of Advanced Studies, Italy Indiana University Bloomington, USA University of Hamburg, Germany University of Catania, Italy University of Gottingen, Germany Université de Lyon, France University of Milan, Italy Rutgers University, USA Universidad de Burgos, Spain University of Porto, Portugal

xiv

Gandica Yerali Gao Jianxi Garcia David Gates Alexander Gauthier Vincent Gera Ralucca Giordano Silvia Giugno Rosalba Gleeson James Godoy Antonia Goh Kwang-Il Gomez-Gardenes Jesus Gonçalves Bruno Gonçalves-Sá Joana Grabowicz Przemyslaw Grujic Jelena Guillaume Jean-Loup Gunes Mehmet Guney Emre Guo Weisi Gómez Sergio Ha Meesoon Hackl Jürgen Hagberg Aric Hancock Edwin Hankin Chris Hayashi Yukio Heinimann Hans R. Helic Denis Hens Chittaranjan Hernandez Laura Heydari Babak Hoevel Philipp Holme Petter Hong Seok-Hee Hoppe Ulrich Hu Yanqing Huang Junming Hébert-Dufresne Laurent Iannelli Flavio Ikeda Yuichi Interdonato Roberto

Organization and Committees

Université Catholique de Louvain, Belgium Rensselaer Polytechnic Institute, USA Complexity Science Hub Vienna, Austria Indiana University Bloomington, USA Institut Mines-Telecom, CNRS SAMOVAR, France Naval Postgraduate School, USA SUPSI, Switzerland University of Verona, Italy University of Limerick, Ireland Rovira i Virgili University, Spain Korea University, South Korea Universidad de Zaragoza, Spain New York University, USA Nova School of Business and Economics, Portugal MPI for Software Systems, Germany Vrije Universiteit Brussel, Belgium Université de la Rochelle, France University of Nevada, Reno, USA Pompeu Fabra University, Spain University of Warwick, UK Universitat Rovira i Virgili, Spain Chosun University, South Korea ETH Zurich, Switzerland Los Alamos National Laboratory, USA University of York, UK Imperial College London, UK Japan Advanced Inst. of Science and Technology, Japan ETH Zurich, Switzerland Graz University of Technology, Austria CSIR-Indian Institute of Chemical Biology, India Université de Cergy-Pontoise, France Northeastern University, USA University College Cork, Ireland Tokyo Institute of Technology, Japan University of Sydney, Australia University Duisburg-Essen, Germany Sun Yat-sen University, China Princeton University, USA University of Vermont, USA Humboldt University of Berlin, Germany Kyoto University, Japan CIRAD - UMR TETIS, France

Organization and Committees

Iori Giulia Iorio Francesco Iosifidis George Iovanella Antonio Ivanov Plamen Iñiguez Gerardo Jalan Sarika Jalili Mahdi Jankowski Jaroslaw Javarone Marco Alberto Jeong Hawoong Jia Tao Jin Di Jo Hang-Hyun Jouve Bertrand Jędrzejewski Arkadiusz Kaltenbrunner Andreas Kanawati Rushed Karsai Márton Kaya Mehmet Kelen Domokos Kenett Yoed Kenett Dror Kertesz Janos Keuschnigg Marc Khansari Mohammad Kheddouci Hamamache Kim Hyoungshick Kitsak Maksim Kivela Mikko Klemm Konstantin Klimek Peter Kong Xiangjie Koponen Ismo Korhonen Onerva Kutner Ryszard Lambiotte Renaud Largeron Christine Larson Jennifer Lawniczak Anna T. Leclercq Eric Lee Deok-Sun

xv

City, University of London, UK Wellcome Sanger Institute, UK Trinity College Dublin, Ireland University of Rome “Tor Vergata”, Italy Boston University, USA Central European University, Hungary IIT Indore, India RMIT University, Australia West Pomeranian University of Technology, Poland Coventry University, UK KAIST, South Korea Southwest University, Chongqing, China Tianjin University, China Asia Pacific Center for Theoretical Physics, South Korea CNRS, France Wrocław University of Science and Technology, Poland NTENT, Spain Université Paris 13, France ENS de Lyon, France Firat University, Turkey Hungarian Academy of Sciences, Hungary University of Pennsylvania, USA Johns Hopkins University, USA Central European University, Hungary Linköping University, Sweden University of Tehran, Iran University Claude Bernard Lyon 1, France Sungkyunkwan University, South Korea Northeastern University, USA Aalto University, Finland IFISC (CSIC-UIB), Spain Medical University of Vienna, Austria Dalian University of Technology, China University of Helsinki, Finland Université de Lille, France University of Warsaw, Poland University of Oxford, UK Université de Lyon, France New York University, USA University of Guelph, Canada University of Burgundy, France Inha University, South Korea

xvi

Lehmann Sune Leifeld Philip Lerner Juergen Lillo Fabrizio Livan Giacomo Longheu Alessandro Lu Linyuan Lu Meilian Lui John C. S. Maccari Leonardo Magnani Matteo Malliaros Fragkiskos Mangioni Giuseppe Marathe Madhav Mariani Manuel Sebastian Marik Radek Marino Andrea Marques Antonio Marques-Pita Manuel Martin Christoph Masoller Cristina Mastrandrea Rossana Masuda Naoki Matta John Mccarthy Arya Medo Matúš Menche Jörg Mendes Jose Fernando Menezes Ronaldo Meyer-Baese Anke Michalski Radosław Milli Letizia Mitra Bivas Mitrovic Marija Mizera Andrzej Mokryn Osnat Molontay Roland Mondragon Raul Mongiovì Misael

Organization and Committees

Technical University of Denmark, Denmark University of Essex, UK University of Konstanz, Germany University of Bologna, Italy University College London, UK University of Catania, Italy University of Fribourg, Switzerland Beijing University of Posts and Telecommunications, China The Chinese University of Hong Kong, Hong Kong University of Venice, Italy Uppsala University, Sweden University of Paris-Saclay, France University of Catania, Italy University of Virginia, USA University of Zurich, Switzerland Czech Technical University, Czechia University of Florence, Italy King Juan Carlos University, Spain Universidade Lusofona, Portugal Leuphana University of Lüneburg, Germany Universitat Politècnica de Catalunya, Spain IMT Institute of Advanced Studies, Italy University at Buffalo, State University of New York, USA SIUE, USA Johns Hopkins University, USA University of Electronic Science and Technology of China, China Austrian Academy of Sciences, Austria University of Aveiro, Portugal University of Exeter, UK FSU, USA Wrocław University of Science and Technology, Poland University of Pisa, Italy Indian Institute of Technology Kharagpur, India Institute of Physics Belgrade, Serbia University of Luxembourg, Luxembourg University of Haifa, Israel Budapest University of Technology and Economics, Hungary Queen Mary University of London, UK Consiglio Nazionale delle Ricerche, Italy

Organization and Committees

Moro Esteban Moschoyiannis Sotiris Moses Elisha Mozetič Igor Murata Tsuyoshi Muscoloni Alessandro Mäs Michael Neal Zachary Nour-Eddin El Faouzi Oliveira Marcos Omelchenko Iryna Omicini Andrea Palla Gergely Panzarasa Pietro Papadopoulos Fragkiskos Papadopoulos Symeon Papandrea Michela Park Han Woo Park Juyong Park Noseong Passarella Andrea Peel Leto Peixoto Tiago Perc Matjaz Petri Giovanni Pfeffer Juergen Piccardi Carlo Pizzuti Clara Poledna Sebastian Poletto Chiara Pralat Pawel Preciado Victor Przulj Natasa Qu Zehui Quadri Christian Quaggiotto Marco Radicchi Filippo Ramasco Jose J. Reed-Tsochas Felix Renoust Benjamin Ribeiro Pedro Riccaboni Massimo Ricci Laura Rizzo Alessandro

xvii

Universidad Carlos III de Madrid, Spain University of Surrey, UK Weizmann Institute of Science, Israel Jozef Stefan Institute, Slovenia Tokyo Institute of Technology, Japan TU Dresden, Germany ETH Zurich, the Netherlands Michigan State University, USA IFSTTAR, France Leibniz Institute for the Social Sciences, USA TU Berlin, Germany Università di Bologna, Italy HAS, Hungary Queen Mary University of London, UK Cyprus University of Technology, Cyprus Information Technologies Institute, Greece SUPSI, Switzerland Yeungnam University, South Korea KAIST, South Korea George Mason University, USA IIT-CNR, Italy Université Catholique de Louvain, Belgium University of Bath, Germany University of Maribor, Slovenia ISI Foundation, Italy Technical University of Munich, Germany Politecnico di Milano, Italy CNR-ICAR, Italy IIASA and Complexity Science Hub Vienna, Austria Sorbonne Université, France Ryerson University, Canada University of Pennsylvania, USA UCL, UK Southwest University, China University of Milan, Italy ISI Foundation, Italy Northwestern University, USA IFISC (CSIC-UIB), Spain University of Oxford, UK Osaka University, Japan University of Porto, Portugal IMT Institute for Advanced Studies, Italy University of Pisa, Italy Politecnico di Torino, Italy

xviii

Rocha Luis M. Rocha Luis E. C. Rodrigues Francisco Rosas Fernando Rossetti Giulio Rossi Luca Roth Camille Roukny Tarik Saberi Meead Safari Ali Saniee Iraj Santos Francisco C. Saramäki Jari Sayama Hiroki Scala Antonio Schaub Michael Schich Maximilian Schifanella Rossano Schoenebeck Grant Schweitzer Frank Segarra Santiago Sharma Aneesh Sharma Rajesh Sienkiewicz Julian Singh Anurag Skardal Per Sebastian Small Michael Smolyarenko Igor Smoreda Zbigniew Snijders Tom Socievole Annalisa Sole Albert Song Lipeng Stella Massimo Sullivan Blair D. Sun Xiaoqian Sundsøy Pål Szymanski Boleslaw Tadic Bosiljka Tagarelli Andrea Tajoli Lucia Takemoto Kazuhiro Takes Frank Tang Jiliang

Organization and Committees

Indiana University Bloomington, USA Ghent University, Belgium University of São Paulo, Brazil Imperial College London, UK ISTI-CNR, Italy IT University of Copenhagen, Denmark CNRS, Germany Massachusetts Institute of Technology, USA UNSW, Australia Friedrich-Alexander-Universität, Germany Bell Labs, Alcatel-Lucent, USA Universidade de Lisboa, Portugal Aalto University, Finland Binghamton University, USA Institute for Complex Systems/CNR, Italy Massachusetts Institute of Technology, USA The University of Texas at Dallas, USA University of Turin, Italy University of Michigan, USA ETH Zurich, Switzerland Rice University, USA Google, USA University of Tartu, Estonia Warsaw University of Technology, Poland NIT Delhi, India Trinity College Dublin, Ireland The University of Western Australia, Australia Brunel University, UK Orange Labs, France University of Groningen, the Netherlands CNR and ICAR, Italy Universitat Rovira i Virgili, Spain North University of China, China Institute for Complex Systems Simulation, UK University of Utah, USA Beihang University, China NBIM, Norway Rensselaer Polytechnic Institute, USA Jozef Stefan Institute, Slovenia University of Calabria, Italy Politecnico di Milano, Italy Kyushu Institute of Technology, Japan Leiden University and University of Amsterdam, the Netherlands Michigan State University, USA

Organization and Committees

Tarissan Fabien Tessone Claudio Juan Thai My Théberge François Tizzoni Michele Togni Olivier Traag Vincent Antonio Trajkovic Ljiljana Treur Jan Tupikina Liubov Török Janos Uzzo Stephen Valdez Lucas D. Valverde Sergi Van Der Hoorn Pim Van Der Leij Marco Van Mieghem Piet Van Veen Dirk Vazirgiannis Michalis Vedres Balazs Vermeer Wouter Vestergaard Christian Lyngby Vodenska Irena Wachs Johannes Wang Xiaofan Wang Lei Wang Huijuan Wen Guanghui Wilfong Gordon Wilinski Mateusz Wilson Richard Wit Ernst Wu Bin Wu Jinshan Xia Feng Xia Haoxiang Xu Xiaoke Yagan Osman Yan Gang Yan Xiaoran Zhang Qingpeng

xix

CNRS - ENS Paris-Saclay (ISP), France Universität Zürich, Switzerland University of Florida, USA Tutte Institute for Mathematics and Computing, Canada ISI Foundation, Italy University of Burgundy, France Leiden University, the Netherlands Simon Fraser University, Canada Vrije Universiteit Amsterdam, the Netherlands Ecole Polytechnique, France Budapest University of Technology and Economics, Hungary New York Hall of Science, USA FAMAF-UNC, Argentina Institute of Evolutionary Biology (CSIC-UPF), Spain Northeastern University, USA University of Amsterdam, the Netherlands Delft University of Technology, the Netherlands ETH Zurich, Singapore-ETH Centre, Switzerland AUEB, Greece CEU, Hungary Northwestern University, USA CNRS and Institut Pasteur, France Boston University, USA Central European University, Hungary Shanghai Jiao Tong University, China Beihang University, China Delft University of Technology, the Netherlands Southeast University, China Bell Labs, USA Scuola Normale Superiore di Pisa, Italy University of York, UK University of Groningen, the Netherlands Beijing University of Posts and Telecommunications, China Beijing Normal University, China Dalian University of Technology, China Dalian University of Technology, China Dalian Minzu University, China CyLab-CMU, USA Tongji University, China Indiana University Bloomington, USA City University of Hong Kong, USA

xx

Zhang Zi-Ke Zhao Junfei Zhong Fay Zignani Matteo Zimeo Eugenio Zino Lorenzo Zippo Antonio Zlatic Vinko Zubiaga Arkaitz

Organization and Committees

Hangzhou Normal University, China Columbia University, USA CSUEB, USA Università degli Studi di Milano, Italy University of Sannio, Italy Politecnico di Torino, Italy Consiglio Nazionale delle Ricerche, Italy Sapienza University of Rome, Italy Queen Mary University of London, UK

Contents

Link Analysis and Ranking LinkAUC: Unsupervised Evaluation of Multiple Network Node Ranks Using Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emmanouil Krasanakis, Symeon Papadopoulos, and Yiannis Kompatsiaris

3

A Gradient Estimate for PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paul Horn and Lauren M. Nelsen

15

A Persistent Homology Perspective to the Link Prediction Problem . . . Sumit Bhatia, Bapi Chatterjee, Deepak Nathani, and Manohar Kaul

27

The Role of Network Size for the Robustness of Centrality Measures . . . . Christoph Martin and Peter Niemeyer

40

Novel Edge and Density Metrics for Link Cohesion . . . . . . . . . . . . . . . . Cetin Savkli, Catherine Schwartz, Amanda Galante, and Jonathan Cohen

52

Facility Location Problem on Network Based on Group Centrality Measure Considering Cooperation and Competition . . . . . . . . . . . . . . . Takayasu Fushimi, Seiya Okubo, and Kazumi Saito Finding Dominant Nodes Using Graphlets . . . . . . . . . . . . . . . . . . . . . . . David Aparício, Pedro Ribeiro, Fernando Silva, and Jorge Silva Sampling on Networks: Estimating Eigenvector Centrality on Incomplete Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicolò Ruggeri and Caterina De Bacco

64 77

90

Community Structure Repel Communities and Multipartite Networks . . . . . . . . . . . . . . . . . . . 105 Jerry Scripps, Christian Trefftz, Greg Wolffe, Roger Ferguson, and Xiang Cao

xxi

xxii

Contents

The Densest k Subgraph Problem in b-Outerplanar Graphs . . . . . . . . . 116 Sean Gonzales and Theresa Migler Spread Sampling and Its Applications on Graphs . . . . . . . . . . . . . . . . . 128 Yu Wang, Bortik Bandyopadhyay, Vedang Patel, Aniket Chakrabarti, David Sivakoff, and Srinivasan Parthasarathy EVa: Attribute-Aware Network Segmentation . . . . . . . . . . . . . . . . . . . . 141 Salvatore Citraro and Giulio Rossetti Exorcising the Demon: Angel, Efficient Node-Centric Community Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Giulio Rossetti Metrics Matter in Community Detection . . . . . . . . . . . . . . . . . . . . . . . . 164 Arya D. McCarthy, Tongfei Chen, Rachel Rudinger, and David W. Matula An Exact No Free Lunch Theorem for Community Detection . . . . . . . . 176 Arya D. McCarthy, Tongfei Chen, and Seth Ebner Impact of Network Topology on Efficiency of Proximity Measures for Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Rinat Aynulin Identifying, Ranking and Tracking Community Leaders in Evolving Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Mário Cordeiro, Rui Portocarrero Sarmento, Pavel Brazdil, Masahiro Kimura, and João Gama Change Point Detection in a Dynamic Stochastic Blockmodel . . . . . . . . 211 Peter Wills and François G. Meyer A General Method for Detecting Community Structures in Complex Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Vesa Kuikka A New Metric for Package Cohesion Measurement Based on Complex Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 Yanran Mi, Yanxi Zhou, and Liangyu Chen A Generalized Framework for Detecting Social Network Communities by the Scanning Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 Tai-Chi Wang and Frederick Kin Hing Phoa Comparing the Community Structure Identified by Overlapping Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 Vinícius da F. Vieira, Carolina R. Xavier, and Alexandre G. Evsukoff Semantic Frame Induction as a Community Detection Problem . . . . . . 274 Eugénio Ribeiro, Andreia Sofia Teixeira, Ricardo Ribeiro, and David Martins de Matos

Contents

xxiii

A New Measure of Modularity in Hypergraphs: Theoretical Insights and Implications for Effective Clustering . . . . . . . . . . . . . . . . . . . . . . . . 286 Tarun Kumar, Sankaran Vaidyanathan, Harini Ananthapadmanabhan, Srinivasan Parthasarathy, and Balaraman Ravindran Diffusion and Epidemics Crying “Wolf” in a Network Structure: The Influence of Node-Generated Signals . . . . . . . . . . . . . . . . . . . . . . . . 301 Tomer Tuchner and Gail Gilboa-Freedman Vaccination Strategies on a Robust Contact Network . . . . . . . . . . . . . . 313 Christopher Siu and Theresa Migler Total Positive Influence Domination on Weighted Networks . . . . . . . . . 325 Danica Vukadinović Greetham, Nathaniel Charlton, and Anush Poghosyan Modelling Spatial Information Diffusion . . . . . . . . . . . . . . . . . . . . . . . . 337 Zhuo Chen and Xinyue Ye Rejection-Based Simulation of Non-Markovian Agents on Complex Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 Gerrit Großmann, Luca Bortolussi, and Verena Wolf Community-Aware Content Diffusion: Embeddednes and Permeability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 Letizia Milli and Giulio Rossetti Can WhatsApp Counter Misinformation by Limiting Message Forwarding? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 Philipe de Freitas Melo, Carolina Coimbra Vieira, Kiran Garimella, Pedro O. S. Vaz de Melo, and Fabrício Benevenuto Modeling Airport Congestion Contagion by SIS Epidemic Spreading on Airline Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 Klemens Köstler, Rommy Gobardhan, Alberto Ceria, and Huijuan Wang A Population Dynamics Approach to Viral Marketing . . . . . . . . . . . . . . 399 Pedro C. Souto, Luísa V. Silva, Diego Costa Pinto, and Francisco C. Santos Integrating Environmental Temperature Conditions into the SIR Model for Vector-Borne Diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 Md Arquam, Anurag Singh, and Hocine Cherifi Opinion Diffusion in Competitive Environments: Relating Coverage and Speed of Diffusion . . . . . . . . . . . . . . . . . . . . . . . 425 Valeria Fionda and Gianluigi Greco

xxiv

Contents

Beyond Fact-Checking: Network Analysis Tools for Monitoring Disinformation in Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436 Stefano Guarino, Noemi Trino, Alessandro Chessa, and Gianni Riotta Suppressing Information Diffusion via Link Blocking in Temporal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448 Xiu-Xiu Zhan, Alan Hanjalic, and Huijuan Wang Using Connected Accounts to Enhance Information Spread in Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 Alon Sela, Orit Cohen-Milo, Eugene Kagan, Moti Zwilling, and Irad Ben-Gal Designing Robust Interventions to Control Epidemic Outbreaks . . . . . . 469 Prathyush Sambaturu and Anil Vullikanti Dynamics on/of Networks The Impact of Network Degree Correlation on Parrondo’s Paradox . . . 483 Ye Ye, Xiao-Rong Hang, Lin Liu, Lu Wang, and Neng-gang Xie Analysis of Diversity and Dynamics in Co-evolution of Cooperation in Social Networking Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 Yutaro Miura, Fujio Toriumi, and Toshiharu Sugawara Shannon Entropy in Time–Varying Clique Networks . . . . . . . . . . . . . . 507 Marcelo do Vale Cunha, Carlos César Ribeiro Santos, Marcelo Albano Moret, and Hernane Borges de Barros Pereira Two-Mode Threshold Graph Dynamical Systems for Modeling Evacuation Decision-Making During Disaster Events . . . . . . . . . . . . . . . 519 Nafisa Halim, Chris J. Kuhlman, Achla Marathe, Pallab Mozumder, and Anil Vullikanti Spectral Evolution of Twitter Mention Networks . . . . . . . . . . . . . . . . . . 532 Miguel Romero, Camilo Rocha, and Jorge Finke Network Models Minimum Entropy Stochastic Block Models Neglect Edge Distribution Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 Louis Duvivier, Céline Robardet, and Rémy Cazabet Three-Parameter Kinetics of Self-organized Criticality on Twitter . . . . . 556 Victor Dmitriev, Andrey Dmitriev, Svetlana Maltseva, and Stepan Balybin Multi-parameters Model Selection for Network Inference . . . . . . . . . . . 566 Veronica Tozzo and Annalisa Barla

Contents

xxv

Scott: A Method for Representing Graphs as Rooted Trees for Graph Canonization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578 Nicolas Bloyet, Pierre-François Marteau, and Emmanuel Frénod Cliques in High-Dimensional Random Geometric Graphs . . . . . . . . . . . 591 Konstantin Avrachenkov and Andrei Bobu Universal Boolean Logic in Cascading Networks . . . . . . . . . . . . . . . . . . 601 Galen Wilkerson and Sotiris Moschoyiannis Fitness-Weighted Preferential Attachment with Varying Number of New Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612 Juan Romero, Jorge Finke, and Andrés Salazar Rigid Graph Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621 Vikram Ravindra, Huda Nassar, David F. Gleich, and Ananth Grama Detecting Hotspots on Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633 Juan Campos and Jorge Finke Political Networks A Transparent Referendum Protocol with Immutable Proceedings and Verifiable Outcome for Trustless Networks . . . . . . . . . . . . . . . . . . . 647 Maximilian Schiedermeier, Omar Hasan, Lionel Brunie, Tobias Mayer, and Harald Kosch Utilizing Complex Networks for Event Detection in Heterogeneous High-Volume News Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659 Iraklis Moutidis and Hywel T. P. Williams Drawing Networks of Political Leaders: Global Affairs in The Economist’s KAL’s Cartoons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673 Nikita Golubev and Alina V. Vladimirova Shielding and Shadowing: A Tale of Two Strategies for Opinion Control in the Voting Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682 Guillermo Romero Moreno, Long Tran-Thanh, and Markus Brede Resilience and Control Stable and Uniform Resource Allocation Strategies for Network Processes Using Vertex Energy Gradients . . . . . . . . . . . . . . . . . . . . . . . 697 Mikołaj Morzy and Tomi Wójtowicz Cascading Failures in Weighted Networks with the Harmonic Closeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709 Yucheng Hao, Limin Jia, and Yanhui Wang

xxvi

Contents

Learning to Control Random Boolean Networks: A Deep Reinforcement Learning Approach . . . . . . . . . . . . . . . . . . . . . . 721 Georgios Papagiannis and Sotiris Moschoyiannis Comparative Network Robustness Evaluation of Link Attacks . . . . . . . 735 Clara Pizzuti, Annalisa Socievole, and Piet Van Mieghem MAC: Multilevel Autonomous Clustering for Topologically Distributed Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747 M. A. Partha and C. V. Ponce Network Strengthening Against Malicious Attacks . . . . . . . . . . . . . . . . . 761 Qingnan Rong, Jun Zhang, Xiaoqian Sun, and Sebastian Wandelt Identifying Vulnerable Nodes to Cascading Failures: Optimization-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773 Richard J. La Ensemble Approach for Generalized Network Dismantling . . . . . . . . . . 783 Xiao-Long Ren and Nino Antulov-Fantulin Machine Learning and Networks A Simple Approach to Attributed Graph Embedding via Enhanced Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797 Nasrullah Sheikh, Zekarias T. Kefato, and Alberto Montresor Matching Node Embeddings Using Valid Assignment Kernels . . . . . . . . 810 Changmin Wu, Giannis Nikolentzos, and Michalis Vazirgiannis Short Text Tagging Using Nested Stochastic Block Model: A Yelp Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822 John Bowllan, Kailey Cozart, Seyed Mohammad Mahdi Seyednezhad, Anthony Smith, and Ronaldo Menezes Domain-Invariant Latent Representation Discovers Roles . . . . . . . . . . . 834 Shumpei Kikuta, Fujio Toriumi, Mao Nishiguchi, Tomoki Fukuma, Takanori Nishida, and Shohei Usui Inductive Representation Learning on Feature Rich Complex Networks for Churn Prediction in Telco . . . . . . . . . . . . . . . . . . . . . . . . 845 María Óskarsdóttir, Sander Cornette, Floris Deseure, and Bart Baesens On Inferring Monthly Expenses of Social Media Users: Towards Data and Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854 Danila Vaganov, Alexander Kalinin, and Klavdiya Bochenina Evaluating the Community Structures from Network Images Using Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866 Md. Khaledur Rahman and Ariful Azad

Contents

xxvii

Gumbel-Softmax Optimization: A Simple General Framework for Combinatorial Optimization Problems on Graphs . . . . . . . . . . . . . . 879 Jing Liu, Fei Gao, and Jiang Zhang TemporalNode2vec: Temporal Node Embedding in Temporal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891 Mounir Haddad, Cécile Bothorel, Philippe Lenca, and Dominique Bedart Deep Reinforcement Learning for Task-Driven Discovery of Incomplete Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 903 Peter Morales, Rajmonda Sulo Caceres, and Tina Eliassi-Rad Evaluating Network Embedding Models for Machine Learning Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915 Ikenna Oluigbo, Mohammed Haddad, and Hamida Seba A BERT-Based Transfer Learning Approach for Hate Speech Detection in Online Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 928 Marzieh Mozafari, Reza Farahbakhsh, and Noël Crespi Network Geometry A Simple Differential Geometry for Networks and Its Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 943 Emil Saucan, Areejit Samal, and Jürgen Jost Characterizing Distances of Networks on the Tensor Manifold . . . . . . . 955 Bipul Islam, Ji Liu, and Romeil Sandhu Eigenvalues and Spectral Dimension of Random Geometric Graphs in Thermodynamic Regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965 Konstantin Avrachenkov, Laura Cottatellucci, and Mounia Hamidouche Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977

Link Analysis and Ranking

LinkAUC: Unsupervised Evaluation of Multiple Network Node Ranks Using Link Prediction Emmanouil Krasanakis(B) , Symeon Papadopoulos, and Yiannis Kompatsiaris CERTH-ITI, Thessaloniki, Greece {maniospas,papadop,ikom}@iti.gr Abstract. An emerging problem in network analysis is ranking network nodes based on their relevance to metadata groups that share attributes of interest, for example in the context of recommender systems or node discovery services. For this task, it is important to evaluate ranking algorithms and parameters and select the ones most suited to each network. Unfortunately, large real-world networks often comprise sparsely labelled nodes that hinder supervised evaluation, whereas unsupervised measures of community quality, such as density and conductance, favor structural characteristics that may not be indicative of metadata group quality. In this work, we introduce LinkAUC, a new unsupervised approach that evaluates network node ranks of multiple metadata groups by measuring how well they predict network edges. We explain that this accounts for relation knowledge encapsulated in known members of metadata groups and show that it enriches density-based evaluation. Experiments on one synthetic and two real-world networks indicate that LinkAUC agrees with AUC and NDCG for comparing ranking algorithms more than other unsupervised measures.

1

Introduction

It is well-known that network nodes can be organized into communities [1–4] identified through either ground truth structural characteristics or shared node attributes [5–7]. A common task in network analysis is to rank all network nodes based on their relevance to such communities, especially of the second type [8,9], which are commonly referred to as metadata groups. Ranking nodes is particularly important in large social networks, where metadata group boundaries can be vague [10,11]. Node ranks can also be used by recommender systems that combine them with other characteristics, in which case it is important to be of high quality across the whole network. Some of the most well-known algorithms that discover communities with only a few known members also rely on ranking mechanisms and work by thresholding their outcome [12,13]. Node ranks for metadata groups are a form of recommendation and their quality is usually (e.g. in [14]) evaluated with well-known recommender system measures [15–17], such as AUC and NDCG. Since calculating these measures requires knowledge of node labels, the efficacy of ranking algorithms needs be c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 3–14, 2020. https://doi.org/10.1007/978-3-030-36687-2_1

4

E. Krasanakis et al.

demonstrated on labeled networks, such as those of the SNAP repository1 . However, different algorithms and parameters are more suited to different networks, for example based on how well their assumptions match structural or metadata characteristics. At the same time, large real-world networks are often sparsely labeled, prohibiting supervised evaluation. In such cases, there is a need to evaluate ranking algorithms on the network at hand using unsupervised procedures. A first take on unsupervised evaluation would be to generalize traditional structural community measures, such as density [18], modularity [19] and conductance [20], to support ranks. However, these measures are designed with structural ground truth communities in mind and often fail to assess hierarchical dependencies or other meso-scale (instead of local) features [6,11,21] that may characterize metadata groups. To circumvent this problem, we propose utilizing the network’s structure and the existence of multiple metadata groups; under the assumption that network edges are influenced by node metadata similarity [6], a phenomenon known as homophily in social networks [22], we assess the quality of ranks for multiple metadata groups based on their ability to predict network edges. We show that this practice enriches density-based evaluation and that it agrees with supervised measures better than other unsupervised ones.

2

LinkAUC

The main idea behind our approach is that, if there is little information to help evaluate node ranks, we can evaluate other related structural characteristics instead. To this end, we propose using node rank distributions across metadata groups to derive link ranks between nodes. Link ranks can in turn be evaluated through their ability to predict the network’s edges. An overview of the proposed scheme is demonstrated in Fig. 1. In this section, we first justify why we expect node rank quality to follow link rank quality (Subsect. 2.1) and formally describe the evaluation process of the latter using AUC (Subsect. 2.2). We then show that link rank quality enriches density-based evaluation (Subsect. 2.3).

Fig. 1. Proposed scheme for evaluating ranking algorithms. Lighter colored nodes have higher ranks and lighter colored edges have lower link ranks. 1

https://snap.stanford.edu/data/.

LinkAUC: Unsupervised Evaluation of Node Ranks Using Link Prediction

2.1

5

Link Ranks

Let ri be vectors whose elements rij estimate the relevance of network nodes j to metadata groups i = 1, . . . , n. Motivated by latent factor models for  link prediction [23] and collaborative filtering [24], we consider R = [r1 . . . rn a matrix factorization of the network. Its rows Rj = [r1j . . . rnj ] represent the distribution of ranks of network nodes j across metadata groups. Following the principles of previous link prediction works [25,26], if network construction is influenced predominantly by structure-based and metadata-based characteristics, this factorization can help predict network edges by linking nodes with similar rank distributions. We calculate the similarities of rank distributions between nodes jk = Rj · Rk . These form a matrix of link ranks: j, k using the dot product2 as M  = RRT M

(1)

Accurate link prediction using link ranks implies good metadata group representations. To empirically understand this claim, let us consider algo∞ ranking n [27] of rithms that can be expressed as network filters f (M ) = n=0 an M the network’s adjacency matrix M , where an are the weights placed on random walks of length n. For example, personalized PageRank and Heat Kernels arise from exponentially degrading weights and the Taylor expansion coefficients of an exponential function respectively. If applied on query vectors qi , where qij are proportional to probabilities that nodes j belong to metadata groups i, network filters produce ranks ri = f (M )qi of how much nodes pertain to the metadata groups. Organizing multiple queries into a matrix Q = [q1 . . . qn ]:  = f (M )QQT f T (M ) R = f (M )Q ⇒ M

(2)

This is a quadratic form of f (M ) around the kernel QQT and, as such, propagates link ranks between queries to the rest of link candidates. Therefore, if queries adequately predict the links between involved query nodes and link ranks can predict the network’s edges, then the algorithm with filter f (M ) is a good rank propagation mechanism. At best, queries form an orthonormal basis of ranks QQT = I and this process can express any symmetric link prediction filter [25,26,28] by decomposing it to f (M )f T (M ). 2.2

Link Rank Evaluation Using AUC

When evaluating link ranks, it is often desirable to exclude certain links, such as withheld test edges or those absent due to systemic reasons (e.g. users may not be allowed to befriend themselves in social networks). To model this, we devise the notion of a network group that uses a binary matrix M to remove non-comparable links of the network’s adjacency matrix M by projecting the latter to M  M, where  is the Hadamard product performing elementwise multiplication. For 2

Cosine similarity would arise by a fixed-flow assumption of the ranking algorithm that performs row-wise normalization of R before the dot product.

6

E. Krasanakis et al.

example, network groups of zero-diagonal networks correspond to M = 1 − I, where 1 are matrices of ones and I identity matrices.  against a network with adjacency matrix M To help evaluate link ranks M within a network group M, we introduce a transformation vecM (M ) that creates a vector containing all elements of M for which M = 0 in a predetermined order. ) with vecM (M ). Then, link ranks can be evaluated by comparing vecM (M A robust measure that compares operating characteristic trade-offs at different decision thresholds is the Area Under Curve (AUC) [29], which has been previously used to evaluate link ranks [26]. When network edges are not weighted, if T P R(θ) and F P R(θ) are the true positive and false positive rates of a decision ) predicting vecM (M ), the AUC of link ranks becomes: threshold θ on vecM (M  ∞ T P R(θ)F P R (θ) dθ (3) LinkAU C = −∞

This evaluates whether actual linkage is assigned higher ranks across the network [30] without being affected from edge sparsity. These properties make LinkAUC preferable to precision-based evaluation of link ranks, which assesses the correctness of only a fixed number of top predictions [26]. 2.3

Relation to Rank Density

The density of a network is defined as the portion of edges compared to the maximum number of possible ones [31,32]. Using the notion of volume vol(M ) to annotate the number of edges in a network with adjacency matrix M , the denM) sity of its projection inside the network group M becomes DM (M ) = vol(M vol(M) . We similarly define rank density by substituting the volume with the expected volume vol(M, r) of the fuzzy set of subgraphs arising from ranks being proportional to the probabilities that nodes are involved in links:   rT M r vol(M, r) = Ev∼r v T M v = r21 rT (M  M)r vol(M  M, r) = ⇒ DM (M, r) = vol(M, r) rT Mr

(4)

where  · 1 is the L1 norm, calculated as the sum of vector elements, and v are binary vectors of vertices sampled with probabilities r. We first examine the qualitative relation between link ranks and rank density for a single metadata group R = r1 . Annotating as m ≥ θ the vectors arising from ) M binary thresholding on the elements of m = vecM ( and selecting thresholds vecM (M )1

θ[k] that determine the top-k link ranks up to all K link candidates (θ[K] = 0): m=

K−1 

(m ≥ θ[k])(θ[k] − θ[k + 1])

k=1

) vecTM (M )vecM (M ⇒ DM (M, r1 ) = =  vecM (M )1





−∞

T P (θ)P  (θ)dθ

LinkAUC: Unsupervised Evaluation of Node Ranks Using Link Prediction

7

where T P and P denote the number of true positive and positive number of thresholded link ranks respectively. At worst, every new positive link after a certain point would be a false positive. Using the big-O notation this can be P R(θ) written as ∂F∂P (θ) ∈ O(1) and hence:  LinkAU C ∈ O DM (M, r1 ) (5) We next consider the case where discovered ranks form non-overlapping metadata groups, i.e. each node has non-zero rank only for one group. This may happen when query propagation stops before it reaches other metadata groups.  = r rT , for non-overlapping ranks ri · rj = 0 for i = j, we rewrite Annotating M i  i i ) =  vecM (M i ), similarly to before:  (1) as M = i Mi ⇒ vecM (M i

 2 LinkAU C ∈ O DM (M, ri )vol(M, ri )ri 1 i

This averages group densities and weights them by vol(M, ri )ri 21 . Hence, when metadata groups are non-overlapping, high LinkAUC indicates high rank density. Finally, for overlapping metadata groups, LinkAUC involves inter-group links in its evaluation. Since averaging density-based evaluations across groups ignores these links, LinkAUC can be considered an enrichment of rank density in the sense that it bounds it when metadata groups do not overlap but accounts for more information when they do.

3

Experiments

To assess the merit of evaluating node ranks using LinkAUC, we devise a series of experiments where we test a number of different algorithms on several ranking tasks of varying degrees of difficulty across labeled networks. We use the ranks produced by these experiments to compare various unsupervised measures with supervised ones; the latter form the ground truth unsupervised measures need reproduce, but would not be computable if node labels were sparse or missing. For every network, we start with known binary vectors ci , whose elements cij show whether nodes j are members of metadata groups i. We use a uniform sampling process U to withhold a small set of evaluation nodes evali ∼ U (ci , 1%) and edges (evali × evali )  M that -by merit of their small number- do not significantly affect the ranking algorithm outcomes. We also procure varied length query vectors qi ∼ U (ci − evali , f ) that serve as inputs to the ranking algorithms, where their relative size compared to the group is selected amongst f ∈ {0.1%, 1%, 10%}. Depending on whether query nodes are adequately many or too few, we expect algorithms to encounter high and low difficulty respectively. 3.1

Networks

Experiments are conducted on three networks; a synthetic one constructed through a stochastic block model [33] and two real-world ones often used to evaluate metada group detection; the Amazon co-purchasing [34] and the DBLP

8

E. Krasanakis et al.

author co-authorship networks. These networks were selected on merit of being fully labeled, hence enabling supervised evaluation to serve as ground truth. They also comprise multiple metadata groups and unweighted edges needed for LinkAUC. The stochastic block model is a popular method to construct networks of known communities [35,36], where the probability of two nodes being linked is determined by which communities they belong to. Our synthetic network uses the randomly generated 5 × 5 block probability matrix of Fig. 2 with blocks of 2K-5K nodes. The Amazon network comprises links between frequently copurchased products3 that form communities based on their type (e.g. Book, CD, DVD, Video). We use the 2011 version of the DBLP dataset4 , which comprises 1.6M papers from the DBLP database, from which we extracted an author network based on co-authorship relations. In this network, authors form overlapping metadata groups based on academic venues (journals, conferences) they have published in. To experiment with smaller portions of query nodes and limit the running time of experiments, we select only the metadata groups with ≥5K nodes for the real-world networks. A summary of these is presented in Table 1.

Table 1. Networks and the number of metada groups used in experiments. Network

Nodes Edges

Synthetic

15K

Groups

0.4M

5 4

Amazon

[34] 0.5M

1.8M

DBLP

[37] 1.0M

11.3M 52

Fig. 2. Stochastic block model used to create the synthetic network.

3.2

Ranking Algorithms

We use both heuristic and established algorithms to rank the relation of network nodes to metadata groups. Our goal is not to select the best algorithm but to obtain ranks with many different methods and then use these ranks to compute the evaluation measures to be compared. The considered algorithms are: PPR [12,38]. Personalized PageRank with symmetric matrix Laplacian normalization arising from a random walk with restart strategy. It iterates ri ← aD−1/2 M D−1/2 ri + (1 − a)qi , where D is the diagonal matrix of node degrees. Throughout our experiments, we select the well-performing parameter a = 0.99. PPR+Inflation [13]. Adds all neighbors of the original query nodes to the query to further spread PPR. 3 4

https://snap.stanford.edu/data/amazon-meta.html. DBLP-Citation-network V4 from https://aminer.org/citation.

LinkAUC: Unsupervised Evaluation of Node Ranks Using Link Prediction

9

PPR+Oversampling [39]. Adding nodes with high PPR ranks to the query vector before rerunning the algorithm. HK [40]. Heat Kernel ranks obtained through an exponential degradation filter  N tk −1/2 ri = e−t M D−1/2 )k qi . This places higher weights on shorter k=0 k! (D paths instead of uniformly spreading them across longer random walks. Hence, it discovers denser local structures at the cost of not spreading ranks too much. We selected t = 5 and stopped iterations when (D−1/2 M D−1/2 )k qi converged. HPR. A heuristic adaptation of PPR that borrows assumptions of heat kernels to place emphasis on short random walks ri ← kt a(D−1/2 M D−1/2 − I)ri + (1 − a)qi , where k is the current iteration, t = 5 and a = 0.99. 3.3

Measures

The following measures are calculated for the node ranks of metadata groups produced in each experiment. We remind that, when network labels are sparse, supervised measures that serve as the ground truth of evaluation may be inapplicable. Unsupervised measures other than LinkAUC are computed on the training edges, as the sparsity of withheld group members evali does not allow meaningful structural scores. LinkAUC, on the other hand is applicable regardless of the evaluation edge set’s sparsity. To avoid data overlap between rank calculation and evaluation, which could overestimate the latter, supervised measures and LinkAUC use only the test group members and edges. Unsupervised Measures Conductance - Compares the probability of a random walk to move outside a community vs. to return to it [41]. Using the same probabilistic formulation as for rank density we define rank conductance: φM (M, r) =

rT (M  M)(C − r) rT (M  M)r

(6)

where C = 1 is a max-probability parameter. (Comparisons are preserved for any value.) Lower conductance indicates better community separation. Gap Conductance - Conductance of binarily cutting the network on the maximal rij percentage gap between degree(j) for each community i [42,43]. We use this as an alternative to sweeping strategies [12,13], which took too long to run. Density - The rank-based extension of density in (4). LinkAUC - AUC of links ranks calculated through (1), where columns are divided with their maximal value and then each node’s row representation is L2normalized, making link ranks represent cosine similarity between edge nodes. This is our proposed unsupervised measure. Supervised Measures (Ground Truth) NodeAUC - AUC of node ranks, averaged across metadata groups i.

10

E. Krasanakis et al.

NDCG - Normalized discounted cumulative gain across all network nodes. For this non-parametric statistic, ranks derive ordinalities ord[j] for nodes j (i.e. the highest ranked node is assigned ord[j] = 1). For each metadata group i, assigning to nodes j relevance scores of 1 if they belongs to it and 0 otherwise:  j:j∈evali 1/log2 (ord[j] + 1) (7) N DCGi = |evali | 1/log2 (c + 1) c=1 NDCG is usually used to evaluate whether a fixed top-k nodes are relevant to the metadata group. However, we are interested in evaluating the relevant nodes of the whole network and hence we make this measure span all nodes. This makes it similar to AUC in that values closer to 1 indicate that metadata group members are ranked as more relevant to the group compared to non-group members. Its main difference is that more emphasis is placed on the top discoveries. 3.4

Results

In Fig. 3 we present the outcome of evaluating different algorithms on the various experiment setups, i.e. tuples of networks, seed node fractions and ranking algorithms. Each point corresponds to a different unsupervised (vertical axes) - supervised (horizontal axes) measure pair calculated for a different experiment setup (i.e. combination of seed node sizes and ranking algorithms) and is obtained by averaging the measures across 5 repetitions of the setup. Unsupervised measures are considered to yield descriptive evaluations when they correlate to supervised ones for the same network (each network is involved in 15 experiment setups arising from the combination of |f | = 3 different seed node sizes with one of the 5 different ranking algorithms). We can see that LinkAUC is the unsupervised measure whose behavior most closely resembles that of the supervised ones. In particular, Table 2 shows that LinkAUC has a strong positive correlation with NodeAUC and a positive correlation with NDCG for all three networks, outperforming the other metrics in all but one experiments. To make sure that these findings cannot be attributed to non-linear relations with other measures, we confirm them using both Pearson and Spearman correlation, where the latter is a non-parametric metric that compares the ordinality of measure outcomes. The slightly weaker correlation of LinkAUC with NDCG can be attributed to the latter’s tendency to place more emphasis on the top predictions, which makes it overstate the correctness of rank quality compared to AUC when the rest of ranks are inaccurate. Looking at the other unsupervised measures, fuzzy definitions of conductance and density sometimes degrade for higher NodeAUC values. This can be attributed to these metrics measuring local-scale features, which are not always a good indication of the quality of larger metadata groups. It must be noted that gap conductance also exhibits strong correlation with the supervised measures on the real-world networks. However, especially for the synthetic network, it frequently assumes a value of 1 that reflects its inability to discover clear-cut boundaries. This sheds doubt on the validity of using it for evaluating ranks in new networks, since similar structural deficiencies can render it uninformative.

LinkAUC: Unsupervised Evaluation of Node Ranks Using Link Prediction 1200

1200 Synth Amazon DBPL

800

1000

Conductance

Conductance

1000

600 400 200

800

Synth Amazon DBPL

600 400 200

0 0.4

0.6

0.8

0 0.4

1

0.6

NodeAUC

0.8

1

0.8

1

0.8

1

0.8

1

NDCG 15

15 Synth Amazon DBPL

10

Gap Conductance

Gap Conductance

11

5

0

Synth Amazon DBPL

10 5 0 -5 -10

0.4

0.6

0.8

1

0.4

0.6

NodeAUC 20

10-3

0.02 Synth Amazon DBPL

0.01

10

Density

Density

15

NDCG

5

0

-0.01

0 -5 0.4

Synth Amazon DBPL

0.6

0.8

1

0.4

0.6

NodeAUC

LinkAUC

0.9

1

Synth Amazon DBPL

LinkAUC

1

NDCG

0.8 0.7 0.6

0.8

Synth Amazon DBPL

0.6 0.4

0.5 0.4

0.6

0.8

NodeAUC

1

0.4

0.6

NDCG

Fig. 3. Scatter plots and least square lines of unsupervised vs. supervised measures. Each point corresponds to a different experiment setup.

12

E. Krasanakis et al.

Table 2. Correlations between unsupervised and supervised measures. The strongest correlations for each dataset are bolded. Pearson Correlation Spearman Correlation Synth Amazon DBLP Synth Amazon DBLP With NodeAUC Conductance

−27%

14%

25%

1%

Gap Cond/nce −28%

−67% −70% −23%

Density

58%

−22% −59%

LinkAUC

84%

92%

55%

84%

85%

40%

1%

24%

5%

−72% −88% 6% −45% 95%

90%

−3%

−9%

With NDCG Conductance

4

−23%

−26%

Gap Cond/nce −19%

−69% −71% −19%

−68% −85%

Density

38%

−63% −74%

−21% −75%

LinkAUC

56%

65%

84%

45% 71%

85%

88%

Conclusions and Future Work

In this work we proposed a new unsupervised procedure that evaluates node ranks of multiple metadata groups based on how well they predict network edges. We explained the intuitive motivation behind this approach and experimentally showed that it closely follows supervised rank evaluation across a number of different experiments, many of which are inadequately evaluated by other unsupervised community quality measures. Based on our findings, our approach can be a better alternative to existing rank evaluation strategies in unlabeled networks whose metadata propagation mechanisms are unknown. This indicates that network structure and awareness of multiple metadata groups are two promising types of ground truth that can help evaluate metadata group ranks. In the future, we are interested in performing experiments across more networks and compare our approach with additional unsupervised measures. Acknowledgements. This work was partially funded by the European Commission under contract numbers H2020-761634 FuturePulse and H2020-825585 HELIOS.

References 1. Fortunato, S., Hric, D.: Community detection in networks: a user guide. Phys. Rep. 659, 1–44 (2016) 2. Leskovec, J., Lang, K.J., Mahoney, M.: Empirical comparison of algorithms for network community detection. In: Proceedings of the 19th International Conference on World Wide Web, pp. 631–640. ACM (2010) 3. Xie, J., Kelley, S., Szymanski, B.K.: Overlapping community detection in networks: the state-of-the-art and comparative study. ACM Comput. Surv. (CSUR) 45(4), 43 (2013)

LinkAUC: Unsupervised Evaluation of Node Ranks Using Link Prediction

13

4. Papadopoulos, S., Kompatsiaris, Y., Vakali, A., Spyridonos, P.: Community detection in social media. Data Min. Knowl. Discov. 24(3), 515–554 (2012) 5. Hric, D., Darst, R.K., Fortunato, S.: Community detection in networks: structural communities versus ground truth. Phys. Rev. E 90(6), 062805 (2014) 6. Hric, D., Peixoto, T.P., Fortunato, S.: Network structure, metadata, and the prediction of missing nodes and annotations. Phys. Rev. X 6(3), 031038 (2016) 7. Peel, L., Larremore, D.B., Clauset, A.: The ground truth about metadata and community detection in networks. Sci. Adv. 3(5), e1602548 (2017) 8. Perer, A., Shneiderman, B.: Balancing systematic and flexible exploration of social networks. IEEE Trans. Visual Comput. Graphics 12(5), 693–700 (2006) 9. De Domenico, M., Sol´e-Ribalta, A., Omodei, E., G´ omez, S., Arenas, A.: Ranking in interconnected multilayer networks reveals versatile nodes. Nat. Commun. 6, 6868 (2015) 10. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Math. 6(1), 29–123 (2009) 11. Lancichinetti, A., Fortunato, S., Kert´esz, J.: Detecting the overlapping and hierarchical community structure in complex networks. New J. Phys. 11(3), 033015 (2009) 12. Andersen, R., Chung, F., Lang, K.: Local graph partitioning using pagerank vectors. In: 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), pp. 475–486. IEEE (2006) 13. Whang, J.J., Gleich, D.F., Dhillon, I.S.: Overlapping community detection using neighborhood-inflated seed expansion. IEEE Trans. Knowl. Data Eng. 28(5), 1272– 1284 (2016) 14. Hsu, C.-C., Lai, Y.-A., Chen, W.-H., Feng, M.-H., Lin, S.-D.: Unsupervised ranking using graph structures and node attributes. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 771–779. ACM (2017) 15. Shani, G., Gunawardana, A.: Evaluating recommendation systems. In: Recommender Systems Handbook, pp. 257–297. Springer (2011) 16. Wang, Y., Wang, L., Li, Y., He, D., Chen, W., Liu, T.-Y.: A theoretical analysis of NDCG ranking measures. In: Proceedings of the 26th Annual Conference on Learning Theory (COLT 2013), vol. 8, p. 6 (2013) 17. Isinkaye, F., Folajimi, Y., Ojokoh, B.: Recommendation systems: principles, methods and evaluation. Egypt. Inform. J. 16(3), 261–273 (2015) 18. Kowalik, L  .: Approximation scheme for lowest outdegree orientation and graph density measures. In: International Symposium on Algorithms and Computation, pp. 557–566. Springer (2006) 19. Newman, M.E.: Modularity and community structure in networks. Proc. Natl. Acad. Sci. 103(23), 8577–8582 (2006) 20. Chalupa, D.: A memetic algorithm for the minimum conductance graph partitioning problem, arXiv preprint arXiv:1704.02854 (2017) 21. Jeub, L.G., Balachandran, P., Porter, M.A., Mucha, P.J., Mahoney, M.W.: Think locally, act locally: detection of small, medium-sized, and large communities in large networks. Phys. Rev. E 91(1), 012821 (2015) 22. McPherson, M., Smith-Lovin, L., Cook, J.M.: Birds of a feather: homophily in social networks. Annu. Rev. Sociol. 27(1), 415–444 (2001) 23. Duan, L., Ma, S., Aggarwal, C., Ma, T., Huai, J.: An ensemble approach to link prediction. IEEE Trans. Knowl. Data Eng. 29(11), 2402–2416 (2017)

14

E. Krasanakis et al.

24. Koren, Y., Bell, R.: Advances in collaborative filtering. In: Recommender Systems Handbook, pp. 77–118. Springer (2015) 25. Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58(7), 1019–1031 (2007) 26. L¨ u, L., Zhou, T.: Link prediction in complex networks: a survey. Phys. A 390(6), 1150–1170 (2011) 27. Ortega, A., Frossard, P., Kovaˇcevi´c, J., Moura, J.M., Vandergheynst, P.: Graph signal processing: overview, challenges, and applications. Proc. IEEE 106(5), 808– 828 (2018) 28. Mart´ınez, V., Berzal, F., Cubero, J.-C.: A survey of link prediction in complex networks. ACM Comput. Surv. (CSUR) 49(4), 69 (2017) 29. Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982) 30. Mason, S.J., Graham, N.E.: Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: statistical significance and interpretation. Q. J. R. Meteorol. Soc. 128(584), 2145–2166 (2002) 31. Schaeffer, S.E.: Graph clustering. Comput. Sci. Rev. 1(1), 27–64 (2007) 32. G¨ orke, R., Kappes, A., Wagner, D.: Experiments on density-constrained graph clustering. J. Exp. Algorithmics (JEA) 19, 3–3 (2015) 33. Holland, P.W., Laskey, K.B., Leinhardt, S.: Stochastic blockmodels: first steps. Soc. Netw. 5(2), 109–137 (1983) 34. Leskovec, J., Adamic, L.A., Huberman, B.A.: The dynamics of viral marketing. ACM Trans. Web (TWEB) 1(1), 5 (2007) 35. Rohe, K., Chatterjee, S., Yu, B., et al.: Spectral clustering and the high-dimensional stochastic blockmodel. Ann. Stat. 39(4), 1878–1915 (2011) 36. Abbe, E., Bandeira, A.S., Hall, G.: Exact recovery in the stochastic block model. IEEE Trans. Inf. Theory 62(1), 471–487 (2016) 37. Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 990–998. ACM (2008) 38. Lofgren, P., Banerjee, S., Goel, A.: Personalized pagerank estimation and search: a bidirectional approach. In: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pp. 163–172. ACM (2016) 39. Krasanakis, E., Schinas, E., Papadopoulos, S., Kompatsiaris, Y., Symeonidis, A.: Boosted seed oversampling for local community ranking. Inf. Process. Manag. 102053 (2019, in press). https://service.elsevier.com/app/answers/detail/a id/ 11241/supporthub/scopus/ 40. Kloster, K., Gleich, D.F.: Heat kernel based community detection. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1386–1395. ACM (2014) 41. Andersen, R., Chung, F., Lang, K.: Local partitioning for directed graphs using pagerank. Internet Math. 5(1–2), 3–22 (2008) 42. Borgs, C., Chayes, J., Mahdian, M., Saberi, A.: Exploring the community structure of newsgroups. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 783–787. ACM (2004) 43. Gleich, D., Kloster, K.: Seeded pagerank solution paths. Eur. J. Appl. Math. 27(6), 812–845 (2016)

A Gradient Estimate for PageRank Paul Horn1 and Lauren M. Nelsen2(B) 1

2

University of Denver, Denver, CO 80208, USA [email protected] University of Indianapolis, Indianapolis, IN 46227, USA [email protected] https://cs.du.edu/~paulhorn, https://sites.google.com/view/laurennelsen

Abstract. Personalized PageRank has found many uses in not only the ranking of webpages, but also algorithmic design, due to its ability to capture certain geometric properties of networks. In this paper, we study the diffusion of PageRank: how varying the jumping (or teleportation) constant affects PageRank values. To this end, we prove a gradient estimate for PageRank, akin to the Li-Yau inequality for positive solutions to the heat equation (for manifolds, with later versions adapted to graphs).

Keywords: PageRank Gradient estimate

1

· Discrete curvature · Random walks ·

Introduction/Background

Personalized PageRank, developed by Brin and Page [3] ranks the importance of webpages ‘near’ a seed. PageRank can be thought of in a variety of ways, but one of the most important points of view of PageRank is that it is the distribution of a random walk allowed to diffuse for a geometrically distributed number of steps. A key parameter in PageRank, then, is the ‘jumping’ or ‘teleportation’ constant which controls the expected length of the involved random walks. As the jumping constant controls the length, it controls locality – that is, how far from the seed the random walk is (likely) willing to stray. When the jumping constant is small, the involved walks are (on average) short, and the mass of the distribution will remain concentrated near the seed. As the jumping constant increases, then the involved walk will (likely) be much longer. This allows the random walk to mix, and the involved distribution tends towards the stationary distribution of the random walk. As the PageRank of individual vertices (for a fixed jumping constant) can be thought of as a measure of importance to the seed, then as the jumping constant increases this importance diffuses. In this paper, we are interested in how this importance diffuses as the jumping constant increases. This diffusion is related to the network’s geometry; in particular, the importance can get ‘caught’ by small cuts. This partially accounts for c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 15–26, 2020. https://doi.org/10.1007/978-3-030-36687-2_2

16

P. Horn and L. M. Nelsen

PageRank’s importance in web search but has other uses as well – for instance Andersen, Chung and Lang use PageRank to implement local graph partitioning algorithms in [1]. This paper seeks to understand the diffusion of influence (as the jumping constant changes) in analogy to the diffusion of heat. The study of solutions to ∂ u on both graphs and manifolds has a long history, the heat equation Δu = ∂t motivated by its close ties to geometric properties of graphs. On graphs, the relationship between heat flow and PageRank has been exploited several times. For instance, Chung [4] introduced the notion of heat kernel PageRank and used it to improve the algorithm of Anderson, Chung, Lang for graph partitioning. A particularly useful way of understanding positive solutions to the heat equation is through curvature lower bounds, which can be used to prove ‘gradient estimates’, which bound how heat diffuses locally in space and time and which can be integrated to obtain Harnack inequalities. Most classical of these is the Li-Yau inequality [13], which (in it’s simplest form) states that if u is a positive solution on a non-negatively curved n-dimensional compact manifold, then u satisfies n ut |∇u|2 ≤ . (1) − u2 u 2t In the graph setting, Bauer, et al. proved a gradient estimate for the heat kernel on graphs in [2]. In this paper we aim to prove a similar inequality for PageRank. Our gradient estimate, which is formally stated as Theorem 1 below, is proved using the exponential curvature dimension inequality CDE, introduced by Bauer et. al. We mention that, in some ways, our inequality is more closely related to 2 another inequality of Hamilton [9] which bounds merely |∇u| u2 , and was established for graphs by Horn in [10]. Other related works establish gradient estimates for eigenfunctions for the Laplace matrix; these include [6]. This paper is organized as follows: In the next section we introduce definitions for both PageRank and the graph curvature notions used. We further establish a useful ‘time parameterization’ for PageRank, which allows us to think of increasing the jumping constant as increasing a time parameter, and makes our statements and proofs cleaner. In Sect. 3 we prove a gradient estimate for PageRank. In Sect. 4 we use this gradient estimate to prove a Harnack-type inequality that allows us to compare PageRank at two vertices in a graph.

2 2.1

Preliminaries Spectral Graph Theory and Graph Laplacians

Spectral graph theory involves associating a matrix (or operator) with a graph and investigating how eigenvalues of the associated matrix reflect graph properties. The most familiar such matrix is the adjacency matrix A, whose rows and columns are indexed with vertices and  1 if vi ∼ vj aij = 0 else.

A Gradient Estimate for PageRank

17

In this work, the principal matrix that we will consider is the normalized Laplace operator Δ = I − D−1 A, where D is the diagonal matrix of vertex degrees and D−1 A is the transition probability matrix for a simple random walk. As a quick observation, note that Δ is non-positive semidefinite. This is contrary to usual sign conventions in graph theory, but is the proper sign convention for the Laplace-Beltrami operator in Riemannian manifolds, the analogy to which we emphasize in this paper. Also note that this matrix is (up to sign) the unsymmetrized version of the normalized Laplacian popularized by Chung (see [7]), L = (D−1/2 AD−1/2 ) − I. 2.2

PageRank

(Personalized) PageRank was introduced as a ranking mechanism [12], to rank the importance of webpages with respect to a seed. To define personalized PageRank, we introduce the following operator which we call the PageRank operator. This operator, P (α), is defined as follows: P (α) = (1 − α)

∞ 

αk W k ,

k=1 −1

where W = D A is the transition probability matrix for a simple random walk. Here the parameter α is known as the jumping or teleportation constant. For a finite n-vertex graph, P (α) is a square matrix; the personalized PageRank vector of a vector u : V → R is uT P (α) = (1 − α)

∞ 

α k uT W k

k=1

It has been noticed ([4,5]) that PageRank has many similarities to the heat kernel etΔ . Chung defined the notion of ‘Heat Kernel PageRank’ to exploit these similarities. In this work, we take inspiration in the opposite direction: we are interested in understanding the action of the PageRank operator in analogy to solutions of the heat equation. In order to emphasize our point of view, we note that graph theorists view the heat kernel operator in two different ways: For a vector u : V → R studying the evolution of uT etΔ as t → ∞ is really studying the evolution of the continuous time random walk, while studying the evolution  of etΔ u as t → ∞ is studying the solutions to the heat equation Δu = ut . The differing behavior of these two evolutions comes from the fact that (for irregular graphs) the left and right eigenvectors of Δ = I − W are different: the left Perron-Frobenius eigenvector of Δ is proportional to the degrees of a graph (as it captures the stationary distribution of the random walk) while the right Perron-Frobenius eigenvector is the constant vector. In particular, as t → ∞ the vector etΔ u tends to a constant. Physically, this represents the ‘heat’ on a graph

18

P. Horn and L. M. Nelsen

evening out, and this regularization (and the rate of regularization) is related to a number of geometric features of a graph. A similar feature holds for PageRank. As α → 1, uT P (α) tends to a vector proportional to degrees, but P (α)u regularizes. In this paper we study this regularization. Although we do not study the PageRank vector explicitly, we note that the left and right action of the PageRank operator are closely related. For an undirected graph uT P (α) = (P (α)T u)T = (DP (α)D−1 u)T , so that the regularization of D−1 u can be translated into information on the ‘mixing’ of the personalized PageRank vector seeded at u. To complete the analogy between P (α)u and etΔ u, it is helpful to come up with a time parameterization t = t(α) so we can view the regularization as a function of ‘time’, in analogy to the heat equation. To do this in the best way, ∂ Pα . it is useful to think of α = α(t) and compute ∂t Proposition 1

∂ α Pα = ΔPα2 , ∂t (1 − α)2

where ΔPα2 = ΔPα (Pα ). Proof. Notice that, the chain rule and algebra reveals, ∂ ∂ Pα = (1 − α)(I − αW )−1 ∂t ∂t = α ((αW − I)(I − αW )−2 + (1 − α)W (I − αW )−2 =

α ΔPα2 . (1 − α)2 

This is remarkably close to the heat equation if α (t) = (1 − α)2 ; solving this 1 . Since we desire separable differential equation yields that α = α(t) = 1 − t+C a parameterization so that α(0) = 0 and α → 1 as t → ∞, this gives us that C = 1 from whence we obtain:

α(t) = 1 − t=

1 t+1

α 1−α

(2) (3)

Given the time parameterization in Eq. 2, we get the following Corollary to Proposition 1. Corollary 2 ∂ Pα = ΔPα2 , ∂t where ΔPα2 = ΔPα (Pα ).

A Gradient Estimate for PageRank

19

Proof. From Proposition 1 and our choice of parameterization, we see that 1

∂ α (t+1)2 2 2 Pα = ΔPα2 =  2 ΔPα = ΔPα . ∂t (1 − α)2 1 t+1

 Fix a vector u : V → R. From now on, we let

f = Pα u. Lemma 1. For f = Pα u and t =

(4) f −u α , we have that Δf = . 1−α t

Proof. We know that W = D−1 A and Δ = W − I, so ΔPα = (W − I)(1 − α)(I − αW )−1 1 1−α · (1 − α)(I − αW )−1 = − (I − αW )(1 − α)(I − αW )−1 + α α 1−α (Pα − I). = α Hence Δf = ΔPα u =

(1 − α) f −u (Pα − I)u = . α t 

2.3

Graph Curvature

In this paper we study the regularization of P (α)u for an initial seed u as α → 1. On one hand, the information about this regularization is contained in the spectral decomposition of the random walk matrix W . The eigenvalues of P (α) are determined by the eigenvalues of W : indeed, if λ is an eigenvalue of W , then 1−α 1−αλ is an eigenvalue of Pα . One may observe that, then, as α → 1 all eigenvalues of Pα tend to zero except for the eigenvalue, 1, of W , and this is what causes the regularization. Thus the difference between Pα u and the constant vector can be bounded in terms of (say) the infinity norms of eigenvectors of Pα and α itself. On the other hand, curvature lower bounds (in graphs and manifolds) have proven to be important ways to understand the local evolution of solutions to the heat equation. As we have already noted important similarities between heat solutions and PageRank, we seek similar understanding in the present case. Curvature, for graphs and manifolds, gives a way of understanding the local geometry of the object. A manifold (or graph) satisfying a curvature lower bound at every point has a locally constrained geometry which allows a local understanding of heat flow through where a ‘gradient estimate’ can be proved. These gradient estimates can then be ‘integrated’ over space-time to yield Harnack inequalities which compare the ‘heat’ of different points at different times.

20

P. Horn and L. M. Nelsen

While a direct analogue of the Ricci curvature is not defined in a graph setting, a number of graph theoretical analogues have been developed recently in an attempt to apply geometrical ideas in the graph setting. In the context of proving gradient estimates of heat solutions, a new notion of curvature known as the exponential curvature dimension inequality was introduced in [2]. In order to discuss the exponential curvature dimension inequality, we first need to introduce some notation. The Laplace operator, Δ, on a graph G is defined at a vertex x by  (f (y) − f (x)). Δf (x) = y∼x

Definition 3. The gradient form Γ is defined by 2Γ (f, g)(x) = (Δ(f · g) − f · Δ(g) − Δ(f ) · g)(x)  = (f (y) − f (x))(g(y) − g(x)), y∼x

and we write Γ (f ) = Γ (f, f ). In general, there is no “chain rule” that holds for the Laplacian on graphs. However, the following formula does hold for the Laplacian and will be useful to us:    Δf = 2 f Δ f + 2Γ ( f ). (5) We define an iterated gradient form, Γ2 , that will be of use to us for the notion of graph curvature that we are using. Definition 4. The gradient form Γ2 is defined by 2Γ2 (f, g) = ΔΓ (f, g) − Γ (f, Δg) − Γ (Δf, g), and we write Γ2 (f ) = Γ2 (f, f ). At the heart of the exponential curvature dimension inequality is an idea that had been used previously based on the Bochner formula. The Bochner formula reveals a connection between solutions to the heat equation and the curvature of a manifold. Bochner’s formula tells us that if M is a Riemannian manifold and f is in C ∞ (M ), then 1 Δ|∇f |2 = ∇f, ∇Δf  + ||Hessf ||22 + Ric(∇f, ∇f ). 2 The Bochner formula implies that for an n-dimensional manifold with Ricci curvature at least K, we have 1 1 Δ|∇f |2 ≥ ∇f, ∇Δf  + (Δf )2 + K|∇f |2 . 2 n

(6)

An important insight of Bakry and Emery was that an object satisfying an inequality like (6) could be used as a definition of a curvature lower bound even

A Gradient Estimate for PageRank

21

when curvature could not be directly defined. Such an inequality became known as a curvature dimension inequality, or the CD inequality. Bauer, et al. introduced a modification of the CD inequality that defines a new notion of curvature on graphs that we will use here [2], the exponential curvature inequality. Definition 5. A graph is said to satisfy the exponential curvature dimension inequality CDE(n, K) if, for all positive f : V → R and at all vertices x ∈ V (G) satisfying (Δf )(x) < 0 ΔΓ (f ) − 2Γ (f,

2 Δf 2 ) ≥ (Δf )2 + 2KΓ (f ), 2f n

(7)

where the inequality in (7) is taken pointwise. While the inequality (7) may seem somewhat unwieldy it, as shown in [2], arises from ‘baking in’ the chain rule and is actually equivalent to the standard curvature dimension inequality (6) in the setting of diffusive semigroups (where the Laplace operator satisfies the chain rule.) Additionally, in [2], it is shown that some graphs including the Ricci flat graphs of Chung and Yau satisfy CDE(n, 0) (and hence are non-negatively curved for this curvature notion) and some general curvature lower bounds for graphs are given. An important observation is that this notion of curvature only requires looking at the second neighborhood of a graph, and hence this kind of curvature is truly a local property (and hence a curvature lower bound can be certified by only inspecting second neighborhoods of vertices.)

3

Gradient Estimate for PageRank

Our main result will make use of the following lemma, adapted from a lemma in [2]. Lemma 2. ([2]). Let G(V, E) be a (finite or infinite) graph, and let f, H : V × {t } → R be functions. If f ≥ 0 and H has a local maximum at (x , t ) ∈ V × {t }, then Δ(f H)(x , t ) ≤ (Δf )H(x , t ). Our goal is to show that C(t) t

√ Γ ( f) √ f ·M



C(t) t

for some function C(t). However,

is badly behaved as t → 0. The way that we handle this is by showing √ f) that H := t · Γ√(f ·M ≤ C(t). If H is a function from V × [0, ∞) → R, then instead consider H as a function from V × [0, T ] → R for some T > 0. Then, by compactness, there is a point (x , t ) in V ×[0, T ] at which H(x, t) is maximized. ∂ ∂ H ≥ 0. Since L = Δ − ∂t , this At this maximum, we know that ΔH ≤ 0 and ∂t implies that at the maximum point, LH ≤ 0. Using the CDE inequality, along with some other lemmas and an identity, we are able to relate H 2 with itself. √ Γ ( f) This allows us to find an upper bound for H, and thus for √f ·M . Our situation is a little easier, because we consider a fixed t. A simple computation shows the following:

22

P. Horn and L. M. Nelsen

Lemma 3. Let G be a graph, and suppose 0 ≤ f (x) ≤ M for all x ∈ V (G) and √ tΓ ( f ) t ∈ [0, ∞), and let H = √f ·M . Then  f −u Δ f= √ − 2t f

√ MH . t

2 Δu − |∇u| that is 2 u u √ √ |∇ u|2 Δ√ u Δu = u − u , u

This identity plays a similar role to the identity Δ log u =

key in the Li-Yau inequality on manifolds, and the identity which is behind the Li-Yau inequality for graphs. Lemma 3 is similar to these other identities and the CDE inequality allows us to exploit this relationship.

Theorem 1. Let G be a graph satisfying CDE(n, 0). Suppose 0 ≤ f ≤ M for all x ∈ V (G) and t ∈ (0, ∞). Then √ 1 n Γ ( f) n+4 1 √ · +2 ·√ . ≤ n+2 t n+2 f ·M t Note that this theorem actually is more akin to a ‘Hamilton-type’ gradient estimate, as it is an estimate in space only (and not time). Due to space constraints, the full proof of Theorem 1 is deferred to the full version of the paper. It proceeds by the maximum principle, similarly to the proof of the Li-Yau inequality in [2] but requires additional care in handling some terms since the heat equation is not specified; these give rise to its form. For convenience of the reader, we sketch the main ideas in the proof in an appendix.

4

Harnack-Type Inequality

We can use Theorem 1 to prove a result comparing PageRank at two vertices in a graph depending on the distance between them. This result is similar to a Harnack inequality. The classical form of a Harnack inequality is the following. Proposition 6 ([2]). Suppose G is a graph satisfying CDE(n, 0). Let T1 < T2 be real numbers, let d(x, y) denote the distance between x, y ∈ V (G), and let D = max deg(v). If u is a positive solution to the heat equation on G, then v∈V (G)

u(x, T1 ) ≤ u(y, T2 )

T2 T1



n exp

4Dd(x, y)2 T2 − T1

.

This result allows one to compare heat at different points and different times. This can make it possible to deduce geometric information about the graph, such as bottlenecking. Delmotte [8] showed that Harnack inequalities do not only allow us to compare heat at different points in space and time – they also have geometric consequences, such as volume doubling and satisfying the Poincar´e inequality. Horn, Lin, Liu, and Yau [11] completed the work of Delmotte by proving that even more geometric information can be obtained from Harnack inequalities.

A Gradient Estimate for PageRank

23

Using Theorem 1, we are able to relate PageRank at different vertices, but our result is not quite of the right form to be a Harnack √ inequality. In Theorem 1, an ideal conclusion would be to have an f instead of f · M in the denominator. Since we do not, this makes proving a “Harnack-type” inequality, directly comparing the two values in terms of themselves and their distance, more difficult. (A somewhat similar technique is used by Horn in [10] on the heat equation, but in the case of [10] the gradient estimate is scaled better, yielding stronger results.) To prove our Harnack-type inequality, we will use a lemma comparing PageRank at adjacent vertices. From now on, we will consider t fixed and write f (x) instead of f (x, t). If a vertex, w, is adjacent to a vertex, z, then we want to lower f (z) by a function only involving f (w). The trick to this is to rewrite bound  f (w) f (z) so that we can use Theorem 1 in order to get rid of the ‘ f (z)’ in the denominator. Lemma 4. Let D = maxv∈V (G) deg(v). If w ∼ z, then

√ f (w) 2CD M 1 √ ≤ + 2. · f (z) t f (w) √   2CD √ M · √1 Proof. If f (z) ≥ 12 f (w), then ff(w) ≤ 2 ≤ + 2. (z) t f (w)   If f (z) < 12 f (w), then

   f (w) − f (z) + f (z) f (w)  = f (z) f (z)   f (w) − f (z)  = +1 f (z)   D( f (w) − f (z))2   =  + 1. D f (z)( f (w) − f (z))

(8)

Now applying the gradient estimate (Theorem 1) yields, √ CD M 1 √  (8) ≤ +1 · t f (w) − f (z) √  CD M 1 2 √ ≤ + 1 since f (z) < f (w) · 2 t f (w) √ 2CD M 1 √ ≤ + 2. · t f (w)  We note that this can be carefully iterated to compare PageRank of vertices of a given distance. This proof, however (and even its rather complicated statement) are deferred to the full journal version of the paper due to space considerations.

24

5

P. Horn and L. M. Nelsen

Conclusions, Applications, and Future Work

In this paper we investigated PageRank as a diffusion, using recently developed notions of discrete curvature. These results, while theoretical (and in some cases not as strong as would be desired due to the dependence on the maximum value ‘M ’ in the gradient estimate), show that curvature aspects of graphs can be used to understand relative importance in networks – at least when ranking is based on random walk based diffusions. Regarding these points, we highlight the following: – Curvature is a local property – based only on second neighborhood conditions. An upshot of this is that it can be certified quickly. While the work here focuses on situations where the entire graph is non-negatively curved for simplicity, work in [2,10] show that these methods can be used when only parts of the graph satisfy such a radius by using cut-off functions. In principle these yield algorithms that are linear, either in the size of the graph – or even in a considered portion of the graph – verifying curvature conditions and elucidating PageRank’s diffusion in bounded degree graphs. – The influence of the jumping constant on PageRank has been important for certain algorithms (such as in [1]), but was originally picked rather arbitrarily (see, eg. [14]). A more rigorous study of this phenomenon seems important for the analysis of complex networks and this paper should be seen as part of this thrust. – There are several interesting areas for improvement here: The non-ideal scaling in Theorem 1 leads to a weaker than ideal result in Lemma 4. While Lemma 4 seems a reasonable result, when iterated it quickly loses power (unlike the Harnack inequality from a ‘properly scaled’ gradient estimate like in Proposition 6). While a ‘properly scaled’ Theorem 1 may not even be true, we suspect the scaling can be improved. An interesting √ question is whether a true ‘Hamilton type’ gradient estimate is true: Is Γ ( f )/f ≤ C log(M/f )t−1 ? Note that the addition of the logarithmic term damages a Harnack inequality, but the results obtainable from this are far better than we obtain. Also, a version including the time derivative term is also desirable. Acknowledgments. Horn’s work was partially supported by Simons Collaboration grant #525309.

Appendix The proof of Theorem 1 includes some rather lengthy computations, and is deferred for the full paper. For the benefit of readers, however, we have included a sketch here which highlights the initial part of the proof where one relates the quantity to be bounded with its own square using CDE. √ tΓ ( f ) Proof. (Proof sketch for Theorem 1). Let H = √ . Fix t > 0. Let (x , t) f ·M be a point in V ×{t} such that H(x, t) is maximized. We desire to bound H(x , t).

A Gradient Estimate for PageRank

25

√ Our goal, then is to apply √ to do this, √ the CDE inequality to Δ( f H). In order we must ensure that Δ f < 0, but a computation shows that if Δ f ≥ 0, then H ≤ 12 so this is allowable. √ Following this, one computes by bounding the arising ΔΓ ( f ) by CDE.   √ One √ √ u 2 2 2 bounds the ensuing terms; clearly t Γ ( f ) − t Γ f , √f ≥ − t Γ f , √uf   √ √ f −u √ − M H . Then one bounds: and by Lemma 3, Δ f = 2t t f  √ (f − u) M H (f − u)2 2 √ √ − + MH 2 f f √

 2 2Γ ( f ) u −√ Γ + √ f, √ f M M   √ √ 2 2 MH − f MH √ ≥ M nt

      1  f (y) f (x) −√ + u(y) 1 − u(x) 1 − f (x) f (y) M y∼x   √ √ 2 M H2 − f M H √ √ ≥ − M, M nt

   2 Δ f H≥√ M nt



Now one proceeds carefully, noting that we have related H and its square and thus, in principle at least, have recorded an upper bound for H. Now we continue to compute to recover the result. Remark: In a typical application of the maximum principle, one maximizes over [0, T ] and then uses information from the time derivative. Here, we don’t do this. This is important because one obtains an inequality of the form H 2 ≤ C1 · H + C2 · t Because of the dependence of this inequality on the time where the maximum occurs, if the t maximizing the function over all [0, √T ] is considered, then the 2C2 t . However, since we result will depend on t , giving a bound like H ≤ t are able to do the computation at t, this problem does not arise.

References 1. Andersen, R., Chung, F., Lang, K.: Local partitioning for directed graphs using PageRank. Internet Math. 5(1–2), 3–22 (2008) 2. Bauer, F., Horn, P., Lin, Y., Lippner, G., Mangoubi, D., Yau, S.-T.: Li-Yau inequality on graphs. J. Differ. Geom. 99(3), 359–405 (2015) 3. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1), 107–117 (1998). Proceedings of the Seventh International World Wide Web Conference

26

P. Horn and L. M. Nelsen

4. Chung, F.: The heat kernel as the pagerank of a graph. Proc. Natl. Acad. Sci. 104(50), 19735–19740 (2007) 5. Chung, F.: PageRank as a discrete Green’s function. In: Geometry and Analysis, no. 1, volume 17 of Advanced Lectures in Mathematics (ALM), pp. 285–302. International Press, Somerville (2011) 6. Chung, F., Lin, Y., Yau, S.-T.: Harnack inequalities for graphs with non-negative Ricci curvature. J. Math. Anal. Appl. 415(1), 25–32 (2014) 7. Chung, F.R.K.: Spectral Graph Theory, volume 92 of CBMS Regional Conference Series in Mathematics. Published for the Conference Board of the Mathematical Sciences, Washington, DC; by the American Mathematical Society, Providence (1997) 8. Delmotte, T.: Parabolic Harnack inequality and estimates of Markov chains on graphs. Rev. Mat. Iberoamericana 15(1), 181–232 (1999) 9. Hamilton, R.S.: A matrix Harnack estimate for the heat equation. Comm. Anal. Geom. 1(1), 113–126 (1993) 10. Horn, P.: A spacial gradient estimate for solutions to the heat equation on graphs. SIAM J. Discrete Math. 33(2), 958–975 (2019) 11. Horn, P., Lin, Y., Liu, S., Yau, S.-T.: Volume doubling, poincar´e inequality and Gaussian heat kernel estimate for non-negatively curved graphs. J. Reine Angew. Math. (to appear) 12. Jeh, G., Widom, J.: Scaling personalized web search. In: Proceedings of the 12th World Wide Web Conference (WWW), pp. 271–279 (2003) 13. Li, P., Yau, S.-T.: On the parabolic kernel of the Schr¨ odinger operator. Acta Math. 156(3–4), 153–201 (1986) 14. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. In: Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, pp. 161–172 (1998)

A Persistent Homology Perspective to the Link Prediction Problem Sumit Bhatia1(B) , Bapi Chatterjee2 , Deepak Nathani3 , and Manohar Kaul3 1

2

3

IBM Research AI, New Delhi, India [email protected] Institute of Science and Technology, Klosterneuburg, Austria [email protected] Indian Institute of Technology, Hyderabad, Sangareddy, India {me15btech11009,mkaul}@iith.ac.in

Abstract. Persistent homology is a powerful tool in Topological Data Analysis (TDA) to capture topological properties of data succinctly at different spatial resolutions. For graphical data, shape and structure of the neighborhood of individual data items (nodes) is an essential means of characterizing their properties. We propose the use of persistent homology methods to capture structural and topological properties of graphs and use it to address the problem of link prediction. We achieve encouraging results on nine different real-world datasets that attest to the potential of persistent homology based methods for network analysis.

1

Introduction

A graph structure representing pairwise relations or interactions among individuals or entities recurs in diverse real-world applications such as social and professional networks, biological phenomena such as protein-protein interactions [10], and citation and collaboration networks [4]. In all these applications, understanding how the network evolves and the ability to predict the formation of new, hitherto non-existent links is extremely useful and has crucial applications such as predicting target genes for cancer research [31], social network analysis, and recommendation systems. The Link Prediction Problem: Let U denote the set of all possible edges in graph G = (V, E) with V as the vertex set, and E as the edge set. If G is undirected, |U | = C(n, 2) = n(n − 1)/2, whereas, if G is directed, |U | = 2 × C(n, 2) = n(n − 1). The set U − E is called the set of potential links. Often, in real-world settings, only a small subset of links u ∈ U will materialize in future with |u|  |U |. For example, in a typical social network that has hundreds of millions of users (nodes), each user may only be friends (form an edge) with only B. Chatterjee—Supported by the European Union’s Horizon 2020 research and innovation programme under the Marie Skodowska-Curie grant agreement No. 754411 (ISTPlus). c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 27–39, 2020. https://doi.org/10.1007/978-3-030-36687-2_3

28

S. Bhatia et al.

a few hundred users. Given G = (V ; E), the task of identifying the edges e ∈ u is challenging and requires understanding and modelling the differences between the sets u and U − u. Why Persistent Homology for Link Prediction?: Persistent homology (PH) [11,12] is an algebraic tool for describing the structural features of a topological space at different spatial resolutions. By embedding a high-dimensional dataset in a topological space, PH allows us to extract and study crucial information about the structure and shape of the dataset in a succinct manner. Since understanding the evolution and formation of edges in networks involves analyzing the structure and shape of the underlying networks, we posit that PH offers a theoretically sound framework to study such topological properties of networks. As an emerging technique in data mining, PH has been successfully applied in various applications such as text analysis [42], image analysis [8], temporal network analysis [16,33], and network classification [7]. Persistence Diagram: A popular tool from the realm of PH is persistence diagram (PD). Homology of a point set X roughly characterizes it in terms of shape-features like connected components, tunnels and voids. Given a graph G = (V, E), mapping the nodes vi ∈ V to the points {xi }1≤i≤n ∈ X, homology of X exhibits G in terms of the shape-features formed by its nodes and edges. However, these features depend a lot on the resolution or the scale at which they are studied, and it is crucial to study them across a spectrum of spatial resolutions. The features that persist across resolutions constitute its persistent homology (PH) represented by its PDs. PD is depicted as a set of points in a two-dimensional space whose indices correspond to the resolutions at which the topological features are “born” and subsequently, “die”. Differences between PHs of two graphs (or subgraphs) can be captured by dissimilarity measure such as the Wasserstein or Bottleneck [12, Chapter VIII] distance between their corresponding PDs. Using such dissimilarity measures between PDs, we understand how an adaptive-sized extended neighborhood of query nodes changes in terms with regards to their PH when an edge is added (removed) to (from) the graph. Our Contributions: We describe a novel approach for predicting links in networks by utilizing the Persistence Diagrams of different neighborhood sub-graphs around the query nodes. Specifically, we characterize the existence of a potential link between a pair of query nodes in terms of a dissimilarity measure between a number of specially constructed neighborhoods. We first present the necessary mathematical notions to describe our method: the PD of a graph and the distance measures between PDs (Sect. 2). We then argue and explain that for a pair of nodes, the PDs of the subgraph induced by their extended neighborhood should not change much by addition or removal of a naturally existing edge. We also provide a theoretical insight into the working of our approach (Sect. 3). We describe and discuss the experiments conducted using nine different real-world network datasets that provide strong empirical evidence for the potential of application of PH for link prediction, and network analysis in general. Our proposed approach achieves robust performance across all the datasets

Persistent Homology and Link Prediction

29

when compared with six commonly used baseline methods for link prediction (Sect. 4). Overview of and Comparison with Related Work: Most methods for link prediction utilize the structural properties of the underlying network to predict formation of new edges. Some of the most frequently used methods [1,29] utilize the intuition that the likelihood of a link between two nodes is high if they share many common neighbors. Despite being widely adopted due to their intuitive nature and ease of computation, such methods are limited to the second order neighborhood of the source node and ignore the global structural information about the underlying network. On the other hand, studying the shape features of the graph at varying resolutions enables us to capture the global structure information. Different other approaches that consider global information for link prediction include measures based on an ensemble of all paths (such as the Katz score [21]), measures derived from conducting random walks over the graph [2,18], and learning continuous vector representations of nodes in the graph such that the nodes sharing similar structural properties are mapped close to each other in the latent space (e.g., DeepWalk [34], LINE [38], node2vec [15], struc2vec [35]). Ensemble methods that complement the network information with external information such as text documents have also been proposed [6]. In contrast to these methods that need to explore the entire graph for capturing global information, our approach is adaptive: we only study the combined neighborhood whose size varies depending on the sparsity of the graph. Thus, we can also avoid the large cost of exploring the entire graph.

2

Persistent Homology of a Graph

For a self-contained exposition, we briefly present the definitions of main concepts used in this work. For a detailed description of the concepts used, an interested reader may refer to advanced textbooks on computational topology, such as the one by Edelsbrunner and Harer [12]. A quick yet sufficient introduction to some more basic concepts can also be found in the extended pre-print of this paper [5]. Persistence Diagram: Let Δ be a finite abstract simplicial complex and {Γi }i∈I s.t. ∅ = Γ0  Γ1  Γ2 . . .  Γp = Δ be a filtration of Δ. For a pair i, j s.t. 0 ≤ i ≤ j ≤ p, this inclusion relation among Γi s induces a homomorphism on the simplicial homology group of each dimension n ∈ Z given by fni,j : Hn (Γi ) → Hn (Γj ). The nth persistent homology (PH) group is the image of the homomorphism i,j fn given by Im(fni,j ). In turn, the nth persistent Betti number is defines as the rank of Im(fni,j ) given by βni,j = rank(Im(fni,j )). The nth persistent Betti number counts how many homology classes of dimension n survives a passage from Γi to Γj . We say that a homology class α ∈ Hn (Γi ) is born at resolution i if it did not come from a previous sub-complex :

30

S. Bhatia et al.

Death

α∈ / Im(fni−1,i ). Similarly, we say that a homology class dies at resolution j if it does not belong to the sub-complex Γj and belonged to previous sub-complexes. A persistence diagram (PD) is a plotting of the points (i, j) corresponding to the birth and death resolutions, respectively, for each of the homology classes. Because a homology class can not die before it is born, every point (i, j) lies above the diagonal x = y. If a homology class does not die after its birth, we draw a vertical line starting from the diagonal in Birth correspondence to its birth. For practical purposes, we take a persistence threshold τ , and assume that every homology Fig. 1. PD class dies at the resolution τ . A typical PD is shown in Fig. 1. Distance Between PDs: Let P1 and P2 be two PDs. Let η be a bijection between the points in the two diagrams. We define the following two distance measures:  1 ||p − η(p)||q∞ ) q (1) (a) Wasserstein-q distance : Wq (P1 , P2 ) = ( inf η:P1 →P2

(b) Bottleneck Distance : W∞ (P1 , P2 ) =

inf

p∈P1

sup ||p − η(p)||∞

η:P1 →P2 p∈P1

(2)

The Wasserstein-q distance is sensitive to small differences in the PDs, whereas, the Bottleneck distance captures relatively large differences. Rips Complex: A Vietoris-Rips Complex, also called a Rips complex is an abstract simplicial complex defined over a finite set of points X = {xi }ni=1 ⊆ X in a metric space (X , d). Given X and a real number r > 0, r ∈ R, a Rips complex R(X, r) is formed by connecting the points for which the balls of radius r 2 centered at them intersect. In the context of the same point set, we use Rr to denote R(X, r). A 1-simplex is formed by connecting two such points and corresponds to an edge. A 2-simplex is formed by 3 such points and corresponds to a triangular face. Rips Filtration: Given a set of points X = {xi }ni=1 ⊆ X , let 0 = r0 ≤ r1 ≤ r2 . . . ≤ rm denote a finite sequence of increasing real numbers, which we use to construct Rips complexes {Rri }m i=1 as defined above. Clearly, by construction of Rips complexes the sequence {Rri }m i=1 is nested and thus provides a filtration of R rm : ∅ = R r0  R r1  R r2 . . .  R rm Deriving the PH groups via homomorphism over a Rips filtration, we obtain a PD associated with the point set X. Please note that to compute the Rips filtration associated with a point set X we need only the relative pairwise distances between the points xi ∈ X. Essentially, we need a symmetric distance matrix D = {d(xi , xj )}n,n i=1,j=1 to compute the PD of X. Next, we will use this method to compute the PD of a graph. Remark: Without going in details, we would like to mention that there are many choices for filtrations and distance metric available when applying PH

Persistent Homology and Link Prediction

31

to a graph1 , however, for this application, computational simplicity and welldeveloped software that could scale to real world datasets were the main factors for us to decide on Rips filtration with shortest-path metric. 2.1

Persistence Diagram of a Graph

Consider a graph G = (V, E), where V = {vi }ni=1 is the node set and E = {ei }m i=1 is the edge set. We associate a positive weight wei ∈ R, wei > 0 with each of the elements ei ∈ E. For an unweighted graph, wei = 1, ∀ei ∈ E. If two nodes are not connected by an edge, we take the (virtual) edge-weight between them as ∞, which for practical purposes is taken as a large positive real number M ∈ R. The shortest-path distance Dsp (vi , vj ) between the nodes vi , vj ∈ V is defined as the sum of weights of the edges on the path starting at vi and terminating at vj . Now consider the metric space (X , d) equipped with a metric d. Let X = {xi }ni=1 be a set of points in (X , d) such that the points in X correspond to the nodes in V = {vi }ni=1 . In an undirected graph, where the shortest-path distance Dsp between any two nodes is symmetric, it makes a natural choice for a metric. We can verify that Dsp satisfies all the properties of a metric: for arbitrary vi , vj , vk ∈ V , (a) Dsp (vi , vj ) ≥ 0, (b) Dsp (vi , vj ) = 0 ⇐⇒ vi = vj , (c) Dsp (vi , vj ) = Dsp (vj , vi ) and (d) Dsp (vi , vj ) + Dsp (vj , vk ) ≥ Dsp (vi , vk ). Therefore, for points xi , xj ∈ X, which correspond to vi , vj ∈ V , we take the metric as d(xi , xj ) = Dsp (vi , vj ). For a directed graph, the shortest-path distance between two nodes is not symmetric. In this case, d(xi , xj ) = Dsp (vi , vj ) provides a quasi-metric: it satisfies (a), (b) and (d) as described above. From a quasi-metric d(xi , xj ), we derive a metric as follows: fa (xi , xj ) = a × d(xi , xj ) + (1 − a) × d(xj , xi ) where a ∈ [0, 1/2] [40]. For a = 12 , fa (xi , xj ) is the average of the two directed distances. In this work, for a metric space representation of a directed graph, we D (v ,v )+D (v ,v ) take d(xj , xi ) = sp i j 2 sp j i , where xi , xj ∈ X correspond to vi , vj ∈ V . Computing the all-pair-shortest-path (APSP) in an undirected graph [19] gives a symmetric distance matrix D = {dij }n,n i=1,j=1 . For a directed graph, the distance matrix is not symmetric; therefore, to impose a metric structure we d +d apply the aforementioned method: dij = dji = ij 2 ji . With that, we have a complete pipeline to compare the shape-features of two graphs (or subgraphs) using PH.

3

Link Prediction via Persistent Homology

Having discussed the background to compute the quantitative differences between a pair of subgraphs with respect to their shape-features, we describe 1

https://topology.ima.umn.edu/node/53.

32

S. Bhatia et al.

k-nbd of u

k-nbd of v

u

v (a)

k-nbd of u

k-nbd of v

u

v (b)

Fig. 2. Combined-neighborhood of u and v, when they have (a) no edges connecting (b) multiple edges connecting.

how to use that to understand and predict the existence of a potential link. First, we summarize the entire pipeline of computing the PD for a graph G. PD Computation: To start with, we compute the all-pair-shortest-path distance matrix D using Johnson’s algorithm [19]. In case G is directed, D is made symmetric as described in the Sect. 2.1. Thereafter, D and a persistencethreshold τ are used to compute the PD of G. Efficient implementations for PD computations, such as the one by Bauer [3], could be used for this purpose. Now consider the cases of combined-neighborhood of nodes u and v as shown in Fig. 2. We consider two scenarios with respect to reasonably extended neighborhoods of the two nodes, as shown in Fig. 2(a) and (b). A potential link is shown by the dotted curve. Essentially, a case of predicting a link between an arbitrary pair of nodes lies on the spectrum of scenarios starting at the one shown in the Fig. 2(a) and stretches towards the ones similar to the Fig. 2(b). As we explained before, the existence of a possible link has higher chances as we move away from the case of the Fig. 2(a) on this spectrum. With that observation, we explore and understand how the difference in shape-measures, as provided by the distances in the PDs of a number of subgraphs induced by the combined-neighborhood of u and v, varies when we examine the cases of arbitrary pair of nodes. This is presented in the Algorithm 1. Given a graph G = ({vi }ni=1 , {ek }m k=1 ), for a k ≤ n, first, we compute the subgraph of G induced by the i-hop neighbors of u and v, where 1 ≤ i ≤ k, see lines 2 and 3. Thereafter, we compute the subgraph induced by an i-hop combined-neighborhood the two nodes, where where 1 ≤ i ≤ r, see line 4. The radius of the individual and combined neighborhoods, k and r, respectively, are chosen such that there could be a positive probability of covering of the combined-neighborhood by the union of the two individual neighborhoods, and therefore, k ≤ 2r. From this subgraph, we induce two subgraphs corresponding to the existence and non-existence of a link between the query nodes, see lines 5 and 6. Following our intuition, a missing link in a complete graph has high chances of existence, therefore, we also construct a complete graph over the nodes of the combined neighborhood, line 7. Having collected these subgraphs, we compute their PDs as described previously. In the PDs, we have considered only 0th PH groups. This is because the cycles in a graph, which correspond to its 1st PH group, are never destroyed as

Persistent Homology and Link Prediction

33

Algorithm 1. Bottleneck and Wasserstein-2 dist. computation. Input: Graph G, Nodes u, v, Neighborhood radius k, Combined-neighborhood radius r, persistence-threshold τ , a boolean isD to indicate if directed. 1: Algorithm GetDist(G, u, v, k, r, τ, isD) 2: Nuk ← GetNbrs(u, k); Induced subgraph over i-hop neighbors of u, where 1≤i≤k. 3: Nvk ← GetNbrs(v, k); r ← GetCombinedNbrs(u, v, r); Induced subgraph over i-neighbors 4: Nu,v of u or v or both, where 1≤i≤r. r+ r r ← Nu,v ∪ (u, v); Induced subgraph Nu,v augmented with the edge 5: Nu,v (u, v). r− r r ← Nu,v − (u, v); Induced subgraph Nu,v without the edge (u, v). 6: Nu,v r r 7: C(Nu,v ) ← MakeComplete(Nu,v ); The complete graph over the nodes of r . Nu,v 8: Pu ← PD(Nuk , τ, isD); Persistence diagram of the subgraph induced by Nuk . r+ , τ, isD); 9: Pv ← PD(Nvk , τ, isD); P + ← PD(Nu,v − r− c r ), τ, isD); 10: P ← PD(Nu,v , τ, isD); P ← PD(C(Nu,v + − 11: d1 ← W-2-Dis(P , P ); d2 ← W-2-Dis(P + , P c ); 12: d3 ← W-2-Dis(P + , Pu ); d4 ← W-2-Dis(P + , Pv ); 13: Wasserstein−2 distances between the Ps 14: d5 ← B-Dis(P + , P − ); d6 ← B-Dis(P + , P c ); 15: d7 ← B-Dis(P + , Pu ); d8 ← B-Dis(P + , Pv ); 16: Bottleneck distances between the Ps 17: d˜ ← {d1 , d2 , d3 , d4 , d5 , d6 , d7 , d8 }; A vector of the eight distances. ˜ 18: Output d; 19: end Algorithm

there are no 2-faces. Thus, for our purpose, distances between the 1-dimensional PDs of the subgraphs would not help much. In the subsequent discussion, by the topological features we shall mean the 0th dimensional features i.e. the number of connected components. We compute the Wasserstein-2: d1 , d2 , d3 and d4 , and the Bottleneck: d5 , d6 , d7 and d8 distances between the PDs, as shown in the lines 11 to 15. They signify how much the induced subgraphs are dissimilar with respect to their shape-features. We use di s, 1 ≤ i ≤ 8, in our experiments to perform linkprediction as a ranking task (Sect. 4). Computational Cost: To implement Algorithm 1, we leveraged parallelization as much as possible. For example, for shortest-path computation, we use a simple shared-memory thread-based parallelization of applying Dijkstra’s algorithm, ˜ which runs in O(|V |2 ) (assuming |V | > |E|), for each of the nodes, and thus ˜ pay roughly O(|V |3 /p), where p is the number of threads, and store the APSP matrix in a database. The neighborhood and combined neighborhood computation steps are linear in the maximum degree, thus O(|V |). The PD computation is performed by reduction of the APSP matrix to cost O(|V |3 ) arithmetic operations. Wq and W∞ distance computation steps are linear in the size of PDs.

34

S. Bhatia et al.

Effectively, Algorithm 1 costs O(|V |3 ). Next, we sketch a theoretical justification of our approach. 3.1

Why This Algorithm Works?

While the commonly used link-prediction heuristics [1,23,29,39], have been empirically validated, to the best of our knowledge, only a limited number of works [9,36] have explored why such methods should work. McPherson et al. [28] suggest that the network of real-life interactions stem from homophily. Hoff et al. [17] introduced a statistical model for such networks, that was extended by Sarkar et al. [36]. Essentially, all these models represent a graph-node by a point in a latent d-dimensional Euclidean space and suggest that the probability of the existence of a link between two query nodes u and v can be defined in terms of a parameterized logistic function of the distance between the corresponding points as follow [36]: P (u ∼ v|duv ) =

1

(3) 1+ where u ∼ v denotes the existence of a link between the nodes u, v, and α and r are the parameters of function-sharpness and sociability of the nodes, respectively. Thus, a smaller distance duv in the latent space implies a higher probability of link between u and v. Under the constraints of space, we now explain how decreasing the distances d1 to d8 in Algorithm 1 corresponds to decreasing duv in the Eq. (3). First, note that the distances di s, i ≤ 1 ≤ 8, are essentially based on the optimal matchings between the PDs and behave very differently from the Euclidean metrics. See Eqs. (1) and (2): higher the value of η(p) for each p ∈ P1 , lower are the Wq (P1 , P2 ) and W∞ (P1 , P2 ). η(p), as a bijection, represents matchings between the PDs P1 and P2 . Thus, the lower values of di s reflect higher matchings between the PDs indicating that the compared subgraphs have more similar topological features. An attentive reader would also have noticed that the PDs that we compare to generate di s, correspond to the simplicial sub-complexes over the subsets of the same dataset obtained by the embedding of a graph in a metric space. It is easy to observe that these subsets overlap by virtue of the construction of the subgraphs induced by the combined neighborhoods of the query nodes. In this setting, a higher matching in the PDs indicates highly similar topological features and these similar features are over the common subset of the subgraphs. Now, we discuss the individual subgraph comparison summaries captured by the di s: (a) d1 and d5 : smaller values of d1 and d5 indicate that augmenting a possible edge between the query nodes does not change the topological features of the subgraph induced by the combined neighborhood. (b) d3 and d7 : their smaller values imply that the combined neighborhood itself is not much different from the neighborhood of the first node in terms of the topological features. (c) d4 and d8 : same as (b) for the second node. (d) d2 and d6 : smaller values of d2 and d6 indicate that the subgraph induced by the combined-neighborhood is closer to a complete graph in terms of topological features. eα(duv −r)

Persistent Homology and Link Prediction

35

Let nl (u, v) denote the number of paths of length l between the nodes u and v. From the above summary, in general terms, it can be inferred that smaller the values of di s, i ≤ 1 ≤ 8, (a) the combined-neighborhood lies closer to the structure shown in Fig. 2(b) on the spectrum of the scenarios mentioned in Sect. 3. For example, smaller d2 and d6 would indicate that the combined-neighborhood is similar to a complete graph in which the likelihood of completion of a missing link is very high, and (b) because of the fact that higher overlap of neighborhood subgraphs, nl (u, v) is non-zero for increasing number of small path-lengths l. In our method, the metric space embedding of the graph translates it into a point cloud in Euclidean space where even though the points are at nondeterministic positions, the distance between them is deterministic. Essentially, it aligns to the deterministic model, (see Sects. 3 and 4 of [36], with (a) identical radii for unweighted graphs and (b) non-identical radii of weighted graphs. Thus in the spirit of the discussion in Sect. 5 for the bounds over duv , the Lemma 5.7, and Theorem 5.8 in the paper by Sarkar et al. [36], and inferring from the point (b) in the previous paragraph, P (u ∼ v|duv ) increases with decrease in the values of di s, i ≤ 1 ≤ 8.

4

Experiments

4.1

Experimental Protocol

Datasets: Table 1 lists the nine publicly available datasets that were used for evaluating our proposed approach. The datasets selected are from different domains and widely used in the study of complex networks. Baselines: We compare the performance of our approach with six frequently used methods for link prediction. We consider Common Neighbors (CN), Adamic-Adar (AA) [1] and Milne-Witten (MW) [29] as representative local methods. We chose Preferential Attachment (PA), node2vec (N2V) [15], and struc2vec (S2V) [35] as representative global methods. Implementation: We implemen- Table 1. Different datasets used in experiments ted our approach in C++ using # nodes # edges N/w type the Ripser library [3] for comDC [32] 112 425 Word Co-occurrence n/w puting PDs. We used the pubATCa 1226 2615 Air Traffic n/w 2 licly available code of Kerber Cora [26] 2708 5429 Citation n/w et al. [22] to compute W2 and Euroad [37] 1174 1417 Road n/w W∞ distances. We fixed the Figeyes [14] 2239 6452 Protein interaction n/w 1870 2277 Protein interaction n/w persistence threshold τ = 4. Yeast [10] 6594 Power Grid n/w It was empirically found that Power [41] 4941 arXiv [24] 5242 14496 Collaboration n/w beyond τ = 4 PD did not change. Twitter [27] 23370 33101 Social N/w The neighborhood and combined- a http://research.mssm.edu/maayan/datasets/qualitative neighborhood radii  L r networks.shtml  L  k and and are taken as 4 2 , 2

https://bitbucket.org/grey narn/hera/src/master/.

36

S. Bhatia et al.

respectively, where L is the shortest path distance between the two query nodes. This selection of k and r is adaptable to the position of query nodes and ensures that there is a reasonable intersection of their neighborhoods. Empirically we found that increasing this value did not change the distance di ’s but only increased the computation time. We implemented the baselines AA, MW, CN, and PA in C++ and used author provided source code for node2vec and struc2vec. For all the datasets, we removed 5% of edges making sure that the residual graph remains connected. We then compare the performance of different methods to recover the removed edges using information from the residual graph (Sect. 4.2). All the datasets and our source code are available for download3 . 4.2

Results and Discussions

Traditionally, the problem of link prediction has been addressed as a ranking problem where given a source node, a ranked list of target nodes is produced ordered by the likelihood of a potential link being formed between the source and the target nodes [20,25]. The baselines CN, AA, MW, and PA by definition, output a score between the source and target node that can be used as the ranking function. The other two baselines – N2V and S2V – learn continuous vector representations for each node in the graph. A typical way to rank target nodes given a source node is to rank them based on their distance from the source node [30]. Hence, for these methods, given a source node, we produce a ranked list of all the other nodes in the graph ordered by the Euclidean distance between the source and target node vectors. Given a pair of source and target nodes our proposed approach produces eight different distance values (Algorithm 1) capturing different topological properties. In order to produce a ranked list that combines these different properties captured by the different distance functions, we use the rank product metric [13] to combine the ranked lists produced by individual distance functions to obtain the final ranking of target nodes with respect to a given source node. For a node i, the rank product is computed as m rpi = ( j=1 rij )1/m where rij is the rank of node i in the j th ranked list. Table 2 summarizes the results achieved by the six baselines and our proposed approach (PH). We report Hit Rate@N (for N = {10, 50, 100}), – the proportion of edges for which the correct target node was ranked in the top N positions. Observe that our approach outperforms the baselines in most cases, and is a close second in others. Also note that while the methods based on immediate neighborhood achieve the best values for five out of nine datasets in terms of Hits@10, the methods that utilize global network information generally outperform the local methods at higher ranks. This is expected as the local methods work in a small, though highly relevant, search space of nodes in the immediate neighborhood of query nodes. Thus, they are able to predict the links for a few test cases that lie in this small search space. However, they fail for hard test cases that lie outside this search space. For instance, in the euroad dataset, only 6 out of 70 test cases lie in the first order neighborhood of query nodes, resulting 3

https://github.com/sumit-research/persistent-homology-link-prediction.

Persistent Homology and Link Prediction

37

Table 2. Performance of different methods on nine different datasets for the link prediction task. Hits at ranks 10, 50, and 100 are reported. For each dataset, the best method achieving highest hits at a given rank is highlighted in bold. Hits @ 50 Hits @ 100 Hits @ 10 CN AA MW PA S2V N2V PH CN AA MW PA S2V N2V PH CN AA MW PA S2V N2V PH DC ATC Cora Euroad Figeyes Yeast Power Arxiv Twitter

.190 .100 .180 .085 .000 .212 .227 .580 .055

.285 .061 .080 .085 .006 .247 .209 .587 .046

.142 .053 .074 .085 .000 .159 .182 .135 .047

.333 .023 .016 .014 .000 .008 .000 .015 .000

.142 .038 .028 .000 .006 .017 .015 .122 .003

.095 .061 .048 .100 .012 .150 .246 .480 .000

.000 .077 .332 .185 .003 .183 .267 .237 .003

.571 .161 .232 .085 .012 .256 .255 .849 .085

.666 .076 .080 .085 .006 .292 .255 .874 .053

.619 .092 .074 .085 .018 .283 .255 .526 .161

.571 .130 .038 .028 .003 .079 .009 .070 .002

.476 .138 .052 .114 .018 .053 .039 .219 .010

.476 .238 .118 .557 .024 .292 .574 .823 .001

.761 .263 .338 .600 .027 .339 .595 .723 .117

.714 .161 .252 .085 .015 .256 .255 .904 .087

.714 .076 .080 .085 .006 .292 .255 .918 .053

.714 .092 .074 .085 .024 .292 .255 .709 .236

1.00 .215 .040 .071 .015 .159 .030 .114 .011

.952 .184 .072 .214 .043 .106 .072 .238 .015

.952 .384 .144 .742 .043 .362 .680 .897 .001

1.00 .372 .338 .728 .046 .385 .747 .865 .276

in poor performance of local methods. On the other hand, the global methods (N2V, S2V, PH) outperform at higher ranks as they are not limited to this small search space. The robust performance achieved by the proposed approach, for all the datasets and at different ranks, is commendable given that the proposed approach uses only eight features (distance functions comparing the topological properties) that can be computed with relative ease compared to computationally expensive learning of vector representations (as is the case with node2vec and struc2vec). Further, unlike the CN, AA, MW, and PA baselines, that are also easier to compute, the proposed approach is built upon the solid theoretical foundations and is not limited to the immediate neighborhood of query nodes.

5

Conclusions and Future Work

We proposed an approach inspired from persistent homology to model link formation in graphs and use it to predict missing links. Our approach achieved robust and stable performance across nine datasets, outperforming many frequently used baseline methods despite being relatively simple and computationally less expensive. Given that the topological features succinctly capture information about shape and structure of the network and can be computed without the need of extensive training, it will be worth exploring how these features can be combined with other techniques for network analysis.

References 1. Adamic, L.A., Adar, E.: Friends and neighbors on the web. Soc. Netw. 25(3), 211–230 (2003) 2. Backstrom, L., Leskovec, J.: Supervised random walks: predicting and recommending links in social networks. In: WSDM 2011 (2011) 3. Bauer, U.: Ripser (2018). https://github.com/Ripser/ripser

38

S. Bhatia et al.

4. Bhatia, S., Caragea, C., Chen, H.H., Wu, J., Treeratpituk, P., Wu, Z., Khabsa, M., Mitra, P., Giles, C.L.: Specialized research datasets in the citeseerx digital library. D-Lib Mag. 18(7/8) (2012) 5. Bhatia, S., Chatterjee, B., Nathani, D., Kaul, M.: Understanding and predicting links in graphs: a persistent homology perspective. arXiv preprint arXiv:1811.04049 (2018) 6. Bhatia, S., Vishwakarma, H.: Know thy neighbors, and more!: studying the role of context in entity recommendation. In: Hypertext (HT), pp. 87–95 (2018) 7. Carstens, C.J., Horadam, K.J.: Persistent homology of collaboration networks. Math. Probl. Eng. 2013, 7 (2013) 8. Chung, M.K., Bubenik, P., Kim, P.T.: Persistence diagrams of cortical surface data. In: International Conference on Information Processing in Medical Imaging, pp. 386–397 (2009) 9. Cohen, S., Zohar, A.: An axiomatic approach to link prediction. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015) 10. Coulomb, S., Bauer, M., Bernard, D., Marsolier-Kergoat, M.C.: Gene essentiality and the topology of protein interaction networks. Proc. R. Soc. B: Biol. Sci. 272(1573), 1721–1725 (2005) 11. Edelsbrunner, H., Harer, J.: Persistent homology-a survey. Contemp. Math. 453, 257–282 (2008) 12. Edelsbrunner, H., Harer, J.: Computational Topology - An Introduction. American Mathematical Society, Providence (2010) 13. Eisinga, R., Breitling, R., Heskes, T.: The exact probability distribution of the rank product statistics for replicated experiments. FEBS Lett. 587(6), 677–682 (2013) 14. Ewing, R.M., Chu, P., Elisma, F., Li, H., Taylor, P., Climie, S., McBroomCerajewski, L., Robinson, M.D., O’Connor, L., Li, M., et al.: Large-scale mapping of human protein-protein interactions by mass spectrometry. Mol. Syst. Biol. 3(1), 89 (2007) 15. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD, pp. 855–864 (2016) 16. Hajij, M., Wang, B., Scheidegger, C., Rosen, P.: Visual detection of structural changes in time-varying graphs using persistent homology. In: PacificVis, pp. 125– 134. IEEE (2018) 17. Hoff, P.D., Raftery, A.E., Handcock, M.S.: Latent space approaches to social network analysis. J. Am. Stat. Assoc. 97(460), 1090–1098 (2002) 18. Jeh, G., Widom, J.: SimRank: a measure of structural-context similarity, pp. 538– 543. ACM (2002) 19. Johnson, D.B.: Efficient algorithms for shortest paths in sparse networks. J. ACM (JACM) 24(1), 1–13 (1977) 20. Kataria, S., Mitra, P., Bhatia, S.: Utilizing context in generative Bayesian models for linked corpus. In: AAAI, vol. 10, p. 1 (2010) 21. Katz, L.: A new status index derived from sociometric analysis. Psychometrika 18(1), 39–43 (1953) 22. Kerber, M., Morozov, D., Nigmetov, A.: Geometry helps to compare persistence diagrams. In: 2016 Proceedings of the Eighteenth Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 103–112. SIAM (2016) 23. Leskovec, J., Backstrom, L., Kumar, R., Tomkins, A.: Microscopic evolution of social networks. In: KDD, pp. 462–470 (2008) 24. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graph evolution: densification and shrinking diameters. ACM Trans. Knowl. Discov. Data 1(1) (2007)

Persistent Homology and Link Prediction

39

25. Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58(7), 1019–1031 (2007) 26. Lu, Q., Getoor, L.: Link-based classification. In: Fawcett, T., Mishra, N. (eds.) ICML, pp. 496–503. AAAI Press (2003). http://www.aaai.org/Library/ICML/ 2003/icml03-066.php 27. McAuley, J., Leskovec, J.: Learning to discover social circles in ego networks. In: NIPS, pp. 548–556 (2012) 28. McPherson, M., Smith-Lovin, L., Cook, J.M.: Birds of a feather: homophily in social networks. Annu. Rev. Sociol. 27(1), 415–444 (2001) 29. Milne, D., Witten, I.: An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In: AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, pp. 25–30 (2008) 30. Misra, V., Bhatia, S.: Bernoulli embeddings for graphs. In: AAAI, pp. 3812–3819 (2018) 31. Nagarajan, M., et al.: Predicting future scientific discoveries based on a networked analysis of the past literature. In: KDD, pp. 2019–2028. ACM (2015) 32. Newman, M.E.J.: Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E 74(3), 036104 (2006) 33. Pal, S., Moore, T.J., Ramanathan, R., Swami, A.: Comparative topological signatures of growing collaboration networks. In: Workshop on Complex Networks CompleNet, pp. 201–209. Springer (2017) 34. Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: online learning of social representations. In: KDD, pp. 701–710 (2014) 35. Ribeiro, L.F., Saverese, P.H., Figueiredo, D.R.: struc2vec: learning node representations from structural identity. In: KDD, pp. 385–394 (2017) 36. Sarkar, P., Chakrabarti, D., Moore, A.W.: Theoretical justification of popular link prediction heuristics. In: IJCAI (2011) ˇ 37. Subelj, L., Bajec, M.: Robust network community detection using balanced propagation. Eur. Phys. J. B 81(3), 353–362 (2011) 38. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: Large-scale information network embedding. In: WWW, pp. 1067–1077 (2015) 39. Tang, L., Liu, H.: Relational learning via latent social dimensions. In: KDD, pp. 817–826 (2009) 40. Turner, K.: Generalizations of the rips filtration for quasi-metric spaces with persistent homology stability results. arXiv preprint arXiv:1608.00365 (2016) 41. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440 (1998) 42. Zhu, X.: Persistent homology: an introduction and a new text representation for natural language processing. In: IJCAI (2013)

The Role of Network Size for the Robustness of Centrality Measures Christoph Martin(B) and Peter Niemeyer Institute of Information Systems, Leuphana University of L¨ uneburg, 21335 L¨ uneburg, Germany {cmartin,niemeyer}@uni.leuphana.de

Abstract. Measurement errors are omnipresent in network data. Studies have shown that these errors have a severe impact on the robustness of centrality measures. It has been observed that the robustness mainly depends on the network structure, the centrality measure, and the type of error. Previous findings regarding the influence of network size on robustness are, however, inconclusive. Based on twenty-four empirical networks, we investigate the relationship between global network measures, especially network size and average degree, and the robustness of the degree, eigenvector centrality, and PageRank. We demonstrate that, in the vast majority of cases, networks with a higher average degree are more robust. For random graphs, we observe that the robustness of Erd˝ os-R´enyi (ER) networks decreases with an increasing average degree, whereas with Barab` asi-Albert networks, the opposite effect occurs: with an increasing average degree, the robustness also increases. As a first step into an analytical discussion, we prove that for ER networks of different size but with the same average degree, the robustness of the degree centrality remains stable. Keywords: Centrality · Robustness data · Noisy data · Sampling

1

· Measurement error · Missing

Introduction

Networks are used to model various real-world phenomenons. Typical use cases are (online) social networks, web graphs, protein-protein interaction networks, infrastructure networks, and many more [23]. Networks model the pairwise relationship of objects, which makes them sensitive to errors in the data underlying the network. The reasons for such errors are manifold. When collecting data for a social network, for example, actors may be missing on the day of the survey or the number for the nomination of possible friends may be limited by the survey questionnaire [30]. The collection of protein-protein interaction data is, depending on the method used, inevitably associated with uncertainty, which is consequently also part of the network constructed from this data [6]. When c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 40–51, 2020. https://doi.org/10.1007/978-3-030-36687-2_4

The Role of Network Size for the Robustness of Centrality Measures

41

creating co-authorship or citation networks, authors or papers can be included multiple times or not at all, for example, due to incorrect spelling [8,26]. All these errors affect the outcome of network analysis methods and thus, the conclusions that depend on these methods [17,20]. In the field of network analysis, centrality measures are commonly used to analyze individual nodes. These measures map a real number to every node in the network which can be used to rank the node by their “importance”; the definition of importance here depends on the domain and the specific research question. It is well known that errors in the network data can have a severe impact on the reliability of centrality measures. For example, the best-ranked actor might actually not be the best in the erroneous network. We measure this impact using the concept of robustness of centrality measures, which is the rank correlation between the centrality values in the clean and the erroneous network [13,15,21, 31]. The effects of errors on the robustness of centrality measures depend on several variables, e.g., the type of centrality measure, the type and extent of the error, the network structure, and how we measure the robustness [9,27]. In this article, we study the robustness of centrality measures in larger networks. We are especially interested in whether global network measures can explain the robustness. Existing studies are inconclusive about this, especially about the relationship between network size and robustness. No relationship between size and robustness is noticeable in the empirical part of [24]. In [5] and [3], the authors observed that larger network size could be related to both, higher and lower robustness, depending on the network structure. In [31], the smaller network is usually more robust than the larger one. In contrast, [27] noticed that larger networks are frequently more robust. Moreover, existing studies have mostly been concerned about smaller networks (approx. less than 1000 nodes). For a comprehensive review of the existing work on the robustness of centrality measures, we refer to [28]. To examine these contrary observations in greater detail, we will proceed as follows in this paper: First, we investigate the robustness of centrality measures in 24 empirical networks coming from diverse domains. We focus on degree, eigenvector centrality, and PageRank. Both, the eigenvector centrality and the PageRank are feedback measures and fast to calculate [16]. However, they have rarely been considered simultaneously in previous studies. Since PageRank can be very stable in scale-free networks [10], a comparison with the eigenvector centrality is therefore interesting. We hardly observe any association between network size and robustness, but a high correlation between average degree and robustness. This observation holds for all considered centrality measures and error types that involve removing nodes or edges. We further investigate the effect of network size on the robustness using the Erd˝ os-R´enyi (ER) and the Barab´ asi–Albert (BA) random graph model. For both models, we observe that the robustness is independent of network size if the average degree remains constant. If the average degree increases, then centrality measures in BA graphs become more robust, in contrast to ER graphs.

42

C. Martin and P. Niemeyer

As a first step into an analytical discussion, we prove that for ER networks of different size but with the same average degree, the robustness of the degree centrality remains stable. As a consequence, there exist robust and non-robust networks of varying sizes, at least w.r.t. the degree centrality.

2

Methods

  . We A graph G(V, E) consists of a vertex set V and an edge set E, E ⊆ V (G) 2 denote the number of vertices in G by N = |V (G)| and the number of edges by M = |E(G)|. All graphs considered in this paper are undirected, unweighted, and simple, i.e., they do not contain loops nor multiple edges. The adjacency matrix of a graph is denoted by A, where Ai,j = 1 if there is an edge between vertex vi and vj (i.e., {vi , vj } ∈ E(G)) and 0 otherwise. The neighborhood of a node u is Γ (u) = {v : {u, v} ∈ E(G)}. It is the set of nodes that are connected to u. The degree is the number of connections that a node has, degree(u) = |Γ (u)|. The degree of an edge is the sum of the degree values of the source and target node, degree({u, v}) = degree(u) + degree(v).  The average degree of a graph G is defined as degree(G) = N1 u degree(u). Density is the ratio of the number of existing edges to the number of maximum 2M possible edges in a graph, density(G) = N (N −1) . The transitivity of a graph is of closed paths of length two) defined as transitivity(G) = (number [23]. (number of paths of length two) In contrast to global measures, centrality measures map a real number to every node in the graph. These values are, depending on the context, often interpreted as a proxy for the “importance” of the node and, thus, used to rank the nodes. We denote the centrality value for a specific node u in a graph G w.r.t. a centrality measure c by cG (u). If the context permits, we do not explicitly mention the graph. The vector of centrality values for all nodes in G is defined as c(G) = (cG (v1 ), . . . , cG (vN )). The most straightforward centrality measure is the degree centrality which was already discussed above, degree(u) = |Γ (u)|. The eigenvector centrality of a node u is proportional to the sum of the centrality values of its neigh bors: evc(u) = λ1 v∈Γ (u) evc(v), where λ is the largest eigenvalue of A [2]. The  PageRank is defined as PageRank(u) = d v∈Γ (u) PageRank(v) degree(v) + (1 − d) with d as damping factor (in our case 0.85) [4].

Error Mechanisms. When collecting data, external factors and the selection of the sampling method can lead to inaccurate network data. We use four procedures to simulate the impact of errors on information about nodes and edges. We call these procedures error mechanisms. They simulate an error that affects the nodes or edges of a network. Their inputs are a graph G and a parameter α which controls the intensity of the error. The procedure returns one network from the set of all possible erroneous versions of the graph G. For a more detailed discussion of the error mechanisms, see [21]. In this study, we use the following error mechanisms:

The Role of Network Size for the Robustness of Centrality Measures

43

add edges (e+): αN  edges are added  to the graph. The new edges are chosen uniformly at random from the N2 − M possible edges. remove edges unif. (e−): αN  edges are removed from the graph. The edges are chosen uniformly at random from E(G). remove edges degree (e−(p)) also removes αN  edges. The edges are, however, chosen with probability proportional to the edge degree (i.e., P ({u, v}) =  degree({u,v}) degree(e) ). e∈E(G)

remove nodes (n−): αN  nodes are removed from the graph. The nodes are chosen uniformly at random from V (G). Robustness of Centrality Measures. To quantify the impact of errors in data collection on centrality measures, we use the concept of robustness, which measures how the ranking of nodes, induced by the centrality measure, changes. In the same way as [15,31], we use Kendall’s tau (“tau-b”) rank correlation coefficient [14]. For two graphs on the same vertex set, G and H, and a centrality measure c, the robustness is defined as follows: τc (G, H) = 

nc − nd (nc + nd + nt )(nc + nd + nt )

(1)

The number of concordant pairs and discordant pairs w.r.t. c(G) and c(H) are nc and nd , respectively. A pair of nodes u, v is concordant, if (cG (v) − cG (u)) · (cH (v) − cH (u)) > 0 and discordant if (cG (v) − cG (u)) · (cH (v) − cH (u)) < 0. Ties in c(G) and c(H) (i.e., cG (v) − cG (u) = 0 or cH (v) − cH (u) = 0, respectively) are denoted by nt and nt . If G and H are not on the same vertex set then, similar to [31], we only consider nodes that exist in both graphs. Random Graph Models. The Erd˝ os-R´enyi random graph model has two parameters: the number of nodes n and the edge probability p. Since all node pairs are connected with the same probability (p), the degree distribution of the nodes in this model follows a binomial distribution [7]. In contrast, the Barab´ asi–Albert model is based on the idea of preferential attachment. Consequently, the probability that a new node will connect to an existing node depends on the degree of the existing node. This model also has two parameters. In addition to the number of nodes n, the parameter m specifies the number of connections that a new node makes to existing nodes. Due to this generation process, the degree distribution of the nodes in graphs generated by this model follows a power-law distribution [1].1

3

Experiments with Empirical Networks

In this section, we investigate the relationship between the robustness of centrality measures in empirical networks and global network measures. We consider 1

We use NetworkX (version 2.2, [12]) to generate random graphs and calculate centrality measures.

44

C. Martin and P. Niemeyer

the following centrality measures: degree, eigenvector centrality, and PageRank. Regarding the global measures, we focus on the network size, average degree, density, and transitivity. For our empirical study, we use the following undirected and unweighted networks available through the Koblenz Network Collection [18] (24 networks at the time of the beginning of this study): zachary (34 nodes, 78 edges), dolphins (62, 159), jazz (198, 2,742), pdzbase (212, 244), vidal (3,133, 6,726), facebook (4,039, 88,234), powergrid (4,941, 6,594), CA-GrQc (5,242, 14,496), reactome (6,327, 147,547), CA-HepTh (9,877, 25,998), pgp (10,680, 24,316), CA-HepPh (12,008, 118,521), CA-AstroPh (18,772, 198,110), CA-CondMat (23,133, 93,497), deezer-RO (41,773, 125,826), deezer-HU (47,538, 222,887), deezer-HR (54,573, 498,202), brightkite (58,228, 214,078), livemocha (104,103, 2,193,083), petstercat (149,700, 5,449,275), douban (154,908, 327,162), gowalla (196,591, 950,327), dblp (317,080, 1,049,866), and petster-dog (426,820, 8,546,581). As part of the data pre-processing, we have removed any existing loops. If a network consists of several components, we only consider the largest connected component. Table 1. Robustness of Konect networks, aggregated over all networks. Centrality

Error mechanism e+ e− e−(p) n− Mean Std Mean Std Mean Std Mean Std

Degree

0.85

0.06 0.89

0.05 0.91

0.05 0.90

0.05

Eigenvector 0.75

0.18 0.81

0.11 0.73

0.15 0.79

0.15

PageRank

0.07 0.82

0.07 0.86

0.07 0.83

0.07

0.73

To analyze the effects of different errors on the robustness of centrality measures in the empirical networks, we use a simulation-based experimental procedure. An iteration of the experiment is performed as follows: Starting from a graph G (one of the 24 networks described above), we apply the error mechanism with the intensity α. The resulting modified graph is called H. Finally, we calculate the robustness of the centrality measure c: τc (G, H) (as defined in Sect. 2). We repeat this procedure 100 times for each network for all combinations of centrality measures (degree, eigenvector centrality, PageRank), error mechanism (add and remove edges (unif. and prop.), and remove nodes), and error level (α ∈ {0.1, 0.2, . . . , 0.5}). Similar to previous studies in this area (as discussed in Sect. 1), we observe that the robustness declines with an increasing level of error. Therefore we will subsequently focus on an error level of α = 0.2 since the results for the other error levels yield the same conclusions regarding the role of the network structure and the impact of the error level is not our main objective. When observing the average across all networks (Table 1), degree centrality is always the most robust. For the removal error mechanisms (e−, e−(p), and n−), the PageRank

The Role of Network Size for the Robustness of Centrality Measures

45

is more robust than the eigenvector centrality. In the case of additional edges, the opposite effect can be observed. Regarding the standard deviation, the ranking is constant across all error types, degree centrality varies least, followed by PageRank. The robustness of the eigenvector centrality fluctuates most; sometimes the standard deviation is two to three times as large as for the degree centrality and the PageRank. Concerning the effect of the type of measurement error on the robustness, degree centrality and PageRank behave similarly. The absence of edges proportional to the edge degree has the weakest effect, spurious edges the most substantial. For eigenvector centrality, on the other hand, the first error type has the strongest influence on the robustness. As we look at the relationship between robustness and global network measures, we notice that there are both: large networks that are very vulnerable to errors (e.g., douban) and small networks that are very robust (e.g., Jazz). In the following, we will discuss the relationship between global network measures and robustness in more detail. Table 2 lists the Kendall rank correlation between the average robustness and the respective values for the global network measures. For all removal error types, the robustness tends to be higher with an increasing average degree. We observe almost perfect correlation for cases where edges or nodes are missing uniformly at random and still high correlation values when edges are missing proportionally. For the degree centrality, the correlation is also high in the case of spurious edges. For PageRank and eigenvector centrality, this is not the case. While for the transitivity a moderate correlation with the robustness can still be observed, the number of nodes, as well as the density, are, in most cases, uncorrelated with the robustness. This observation is rather unexpected since growing graphs often show “densification”, which means the average degree grows with the number of nodes [19]. Figure 2 shows the behavior of robustness for three groups in each panel in an exemplary fashion. The first panel shows the observation for PageRank and add edges. This behavior is typical for all centrality measures under the influence of additional edges, there is no obvious pattern. The middle panel shows the combination of PageRank and missing edges uniform. In this case, the relationship between average degree and robustness is most prominent. Robustness is high when the average degree is also high. The variance of robustness is also low. This Table 2. Empirical networks: rank correlation between global measures and the average robustness. Error mechanism Centrality Degree e+

e−

Eigenvector e−(p) n−

e+

e−

Avg degree

0.77 0.97 0.63

0.98

0.27 0.92

Density

0.18 0.01 0.01

0.03 −0.38 0.02

Nodes

0.02 0.31 0.16

0.29

Transitivity

0.26 0.23 0.58

0.23 −0.63 0.27

PageRank e−(p) n− 0.52

e+

e−

e−(p) n−

0.93

0.43 0.96 0.72

0.95

0.30 −0.11

0.26 0.07 0.00

0.15

0.43 −0.18 0.26 0.22

0.18

0.04

0.23

0.52 0.26 −0.14 0.50

0.22 0.16 0.49

46

C. Martin and P. Niemeyer

behavior occurs for PageRank and degree for all cases of missing edges (uniform and proportional) and missing nodes. The robustness of the eigenvector centrality in case of missing nodes is depicted in the last panel. There is a recognizable association, but in this case, the variance is higher than in most other cases.

4

Experiments with Random Graphs

In Sect. 3, we examined the robustness of 24 empirical networks and observed that there exist small and robust as well as large and sensitive networks regarding the robustness of centrality measures. We observed that there is little association between network size and robustness. We found, however, that in many cases, the higher the average degree of the network, the higher the robustness. In this section, we use the ER and the BA model to control the average degree and to measure the effects of its change on robustness. For this purpose, we choose two different perspectives. First, we keep the average degree constant and change the size of the network. Then we control the average degree while keeping the network size constant. The experimental setup is similar to that of Sect. 3. Instead of using empirical networks, however, we generate ER and BA graphs with an average degree of 10 and a network size n ∈ {100, 500, 1000, 1500, . . . , 10000}, which we call G. For the second part of the experiments, we fix the network size at n = 1000 and select the parameters p and m such that we obtain graphs with an average degree between 4 and 100. Then we apply the error mechanism with intensity α to G, which results in the erroneous network H and calculate the robustness of the centrality measure c: τc (G, H). For both parts of the experiment, we repeat this procedure 100 times for the two random graph models and the varying values for the network size and the average degree for all combinations of centrality measure (degree, eigenvector centrality, PageRank), the four error mechanisms, and error level (α ∈ {0.1, 0.2, . . . , 0.5}). The results for the ER graphs are homogeneous. For all centrality measures and error mechanisms, we observe the same behavior: robustness does not change with increasing network size. However, the variance decreases with increasing network size. It decreases sharply with the first increases in network size. Above 2000 nodes, the change is hardly visible. For the BA graphs, we observe, with two exceptions, the same behavior as for the ER graphs. Figure 1 shows the three different characteristics (all results in this figure are for the eigenvector centrality). The middle panel in Fig. 1 represents the robustness behavior in BA graphs in almost all cases. The robustness is independent of the network size. Only the variance decreases with increasing size, whereas the variance is relatively small already. The outer panels show the two exceptional cases. The absence of edges proportional to the degree of edge (left panel) reduces the robustness of the eigenvector centrality with increasing network size. If nodes are missing (right panel), the robustness is, as in most other cases, independent of the network size, but the variance is much larger and hardly declines with increasing size.

The Role of Network Size for the Robustness of Centrality Measures

47

Fig. 1. Results for the robustness of centrality measures in BA graphs. Here, the network size increases while the average degree remains constant, the error level is 0.2. For the network size values up to 5000 are shown for better readability; for larger values, hardly any changes occur.

Fig. 2. Illustrative examples for the three different behaviors of the robustness of empirical networks. The networks are sorted their average degree (ascending). The median robustness is indicated in each box; whiskers are 1.5 times the interquartile range.

The results for the second part of the experiment are listed in Table 3; the values are the rank correlation between the average degree and the robustness in ER and BA graphs. For the results concerning the ER graphs, the pattern consists of two parts, independent of the type of error. The robustness of the eigenvector centrality and the PageRank is, except for the smallest initial increases, constant and thus independent of the increase of the average degree. With degree centrality, on the other hand, robustness decreases with an increasing average degree. The decreases occur especially during the initial increases of the average degree (approx. the range between 4 and 25). Here the robustness decreases by 0.1. While degree centrality shows a strong negative correlation, in most other cases, no or weak negative correlation can be observed. The results for the BA graphs show a consistent pattern. In all cases, regardless of centrality measure and error type, higher average degree is accompanied by higher robustness. There is a very high, positive rank correlation between the average degree and the associated robustness. The increases in robustness associated with the increase in the average degree are particularly strong for

48

C. Martin and P. Niemeyer

Table 3. The rank correlations between the average degree and the robustness are listed for the cases of BA and ER graphs under the influence of different error mechanisms with an error level of 0.2. Centrality

Degree

Error mechanism BA graphs e+ e− e−(p) n− 0.91 0.92 0.90

ER graphs e+ e−

e−(p) n−

0.87 −0.66 −0.64 −0.64 −0.51

Eigenvector 0.82 0.93 0.95

0.69 −0.07

PageRank

0.91 −0.33 −0.25 −0.41 −0.02

0.93 0.94 0.93

0.02

0.04

0.15

initial increases, further increases still have a positive effect on robustness, but this effect diminishes. The only exception to this is the eigenvector centrality, which resembles a linear relationship. The variance is slightly higher for the error type missing nodes than for the other error types. In the case of eigenvector centrality, the variance is much higher. In the previous experiments, we observed that the robustness of degree centrality is independent of the size of the network. We will now take a more detailed look at this scenario. We focus on the degree centrality, as this is the most feasible for an analytical perspective [22,25,29]. We will show that for ER graphs and sufficiently large network size, the robustness is independent of the network size if the average degree remains constant. To analyze the degree centrality in more detail, we use the following terms: G is the unmodified graph and H is the erroneous graph (a “modified” version of G, i.e., H is on the same vertex set as G or on a subset of that vertex set). The error level is denoted by α (i.e., the fraction of edges deleted). Additionally, let v1 , v2 be two nodes drawn randomly from V (H). Then, Di is the random variable for the degree of node vi in G and Xi is the random variable for the degree decrease of node vi (i.e., the difference of the degree of node vi in G and in H). On this basis, we define P (D1 = d1 , X1 = x1 , D2 = d2 , X2 = x2 ) as the joint probability that specific values for d1 , x1 , d2 , x2 occur together. We abbreviate this by P (d1 , x1 , d2 , x2 ). We demonstrate how to use P (d1 , x1 , d2 , x2 ) to calculate the robustness. Summing P (d1 , x1 , d2 , x2 ) over the quadruples that correspond to concordant (discordant) pairs of nodes, we can calculate the probability for (v1 , v2 ) to be concordant (discordant). For the case of missing edges, the probability for (v1 , v2 ) to be concordant is  (2) Pc = d1 d2 −x2

Analogously, the probability for (v1 , v2 ) to be discordant is  Pd = d1 d2 −x2 P (d1 , x1 , d2 , x2 ).

(3)

d1 >d2 ;d1 −x1 1. For details, see Baker [2]. An example is given in Fig. 4.

122

S. Gonzales and T. Migler

Slice Construction. The dynamic programming solution follows a divide and conquer paradigm, where we divide the graph into so-called slices and calculate the tables for each slice before merging them together. Each vertex in each tree (and hence each interior face and exterior edge in each component of the graph) corresponds to a particular slice. The idea is to first define left and right boundaries for each tree vertex, and then define their slices by taking the induced subgraph of all vertices and edges that exist between the boundaries. The analogy here is that one can obtain a slice of pie by first deciding where the two cut lines will be (the left and right boundaries), and then the slice will be the pie that exists between your cuts. The full construction of slices is given in the full version of this paper. In Fig. 4, each tree vertex has its left boundary vertices to the left and its right boundary vertices to the right (note that the right boundary vertices for a tree vertex are the same as the left boundary vertices of the next tree vertex).

Fig. 4. Trees with slice boundaries

Dynamic Program. In this section, we detail the dynamic program that solves the densest k subgraph problem for b-outerplanar graphs. The dynamic program is given by the procedures table, adjust, merge, extend, and contract. Note that adjust, merge, extend, and contract are original to this paper whereas the table procedure is given by Baker [2]. The pseudocode for these procedures can be found in the extended version of the paper. The program constructs a table for each slice. The table for a level i slice consists of 22i entries: one entry for each subset of the boundary vertices (the left and right boundaries for a level i slice contain exactly i vertices each, for a total of 2i boundary vertices). An entry contains a number for each value of k  = 0 . . . k, where this number is the maximum number of edges over all subgraphs of the slice of exactly k  vertices that contain the corresponding subset of the boundary vertices. The main procedure of the program is table (find in extended version ), which takes as input a tree vertex v = (x, y). This procedure contains four conditional branches. The first branch handles the case when v represents a face f that does not enclose a level i + 1 component. In this case, the procedure makes a recursive call to table on each child of v and merges the resulting tables together. The second conditional branch handles the case when v represents a face f that encloses a level i + 1 component C. In this case, the procedure makes a recursive call to table on the tree vertex that represents C. The resulting table

The Densest k Subgraph Problem

123

is then passed to the contract function, which turns a level i + 1 table into a level i table by removing the level i + 1 boundary vertices from the table, and the contracted table is then returned. The third conditional branch handles the case when v is a level 1 leaf. In this case, the procedure returns a template table that works for all level 1 leaf vertices, since any level 1 leaf represents a level 1 exterior edge of the graph. The fourth conditional branch is slightly more complicated. This branch handles the case when v is a level i > 1 leaf vertex. The idea is to break up the slice of v into subslices, compute the table for an initial subslice, and extend this table by merging the tables for the subslices clockwise and counterclockwise from the initial subslice. These subslices have their own respective subboundaries. In the case that the slice of v is simply a line of vertices, no subslices can be created, so the procedure will return the table for the whole slice by passing the vertex to the create procedure, which effectively applies brute force to create the table for the subslice determined by the second parameter p (this is explained in more detail below). In the case that the slice of v is not just a line, we can create tables for subslices. Since we are dealing with a planar graph, there exists a level i − 1 vertex zp such that all level i − 1 vertices other than zp that are adjacent to x are clockwise from zp , and all level i − 1 vertices other than zp that are adjacent to y are counterclockwise from zp . Here, zp is the only level i − 1 vertex in slice(v) that can be adjacent to both x and y (although it might not be adjacent to either). So, we construct an initial level i table for the subslice corresponding to zp using create, and then we make as many calls as necessary to the merge and extend procedures (described below) to extend the table on one side with subslices constructed from the vertices adjacent to x, and on the other side with subslices constructed from the vertices adjacent to y.

Fig. 5. The slice for (c, d) is split into subslices for computing tables.

For example, Fig. 5 shows how we split slice((c, d)) into subslices. The algorithm finds that E is a level 1 vertex such that the level 1 vertices adjacent to c are clockwise from E and the level 1 vertices adjacent to d are counterclockwise to E (of which there are none). A call to create constructs the table for the initial subslice with subboundaries c, E and d, E. We then merge the other subslices in

124

S. Gonzales and T. Migler

a clockwise fashion, first merging the table for the subslice with subboundaries c, C and c, E, and then merging the table for the subslice with subboundaries c, C and c, B. The adjust procedure, (see extended version for pseudocode), takes as input a table T , which represents a slice with left boundary L and right boundary R. Let x and y be the highest level boundary vertices in L and R, respectively. This procedure checks if x = y, and if so, adds 1 to the table entry where x and y are both included. Unlike in Baker’s original procedure, the case when x = y is not handled in this procedure and is instead handled in merge. The merge procedure, given in Fig. 6 in Sect. A of the Appendix, takes as input two tables T1 and T2 such that the right boundary of the slice that T1 represents is the same as the left boundary of the slice that T2 represents. Let the left and right boundaries of the slice that T1 represents be L and M , and let the left and right boundaries of the slice that T2 represents be M and R. The resulting table T returned by merge represents the slice with left and right boundaries L and R. This procedure constructs T by creating an entry for each subset A of L ∪ R. As stated earlier, each entry contains a number for each value of k  , where k  ranges from 0 to the number of vertices in the union of the two slices represented by T1 and T2 , and where the number is the maximum number of edges over all subgraphs of exactly k  vertices containing A. Each of these individual subgraphs contains a different subset B of M . The contract procedure, (see extended version for pseudocode), changes a level i + 1 table T into a level i table T  . Here, T represents the slice for (z, z), where (z, z) is the root of a tree corresponding to a level i + 1 component C contained in a level i face f , and T  is the table for the slice of vertex(f ). Let S = slice((z, z)) and let S  = slice(vertex(f )). Let the left and right boundaries of S  be L and R, respectively. Then the left and right boundaries of S are of the form z, L and z, R respectively. For each subset of L ∪ R and for each value of k  , T contains two numbers: one that includes z and one that does not include z. So, for each subset A of L ∪ R and each value of k  , we set T (A, k  ) equal to the larger of these two numbers. The create procedure takes as input a leaf vertex v = (x, y) in a tree that corresponds to a level i + 1 component enclosed by a face f , and a number p ≤ t + 1, where the children of vertex(f ) are u1 , u2 , . . . , ut . This procedure simply applies brute force to create the table for the subgraph containing the edge (x, y), the subgraph induced by the left boundary of up if p ≤ t or the right boundary of up−1 if p = t + 1, and any edges from x or y to the level i vertex of this boundary. Lastly, the extend procedure, given in Fig. 7 in Sect. A of the Appendix, takes as input a level i + 1 vertex z and a table T representing a level i slice, and produces a table T  for a level i + 1 slice. Let L and R be the boundaries for the level i slice represented by T . The boundaries for the new slice will be L ∪ {z} and R ∪ {z}. For each subset A of L ∪ R and each value of k  , the new table T  contains two values: one that includes z and one that does not include z. The entries T  (A, k  ) which do not include z can simply be set to their original values in T . For the entries T  (A ∪ {z}, k  ) that do include z, we first check

The Densest k Subgraph Problem

125

that T (A, k  − 1) is not undefined. If this is the case, we set T  (A ∪ {z}, k  ) as T (A, k  − 1) plus the number of edges between z and every vertex in A. We claim that calling the above algorithm on the root of the level 1 tree results in a correct table for the slice of the root and that this slice is actually the entire graph. Since the level 1 root is of the form (x, x), its left and right boundaries are both equal to x. Thus, the table for this root has exactly 4 rows, and two of these rows are invalid since they attempt to include one copy of x and not the other (which is nonsensical). The two remaining rows have numbers corresponding to the maxmimum number of edges in subgraphs of size exactly k  for k  = 0, . . . , k. Taking the maximum between the two numbers corresponding to when k  = k gives us the size of the densest k subgraph. A proof of correctness and analysis of the O(k 2 8b n) running time can be found in the extended version.

4

Polynomial Time Approximation Scheme and Future Work

When searching for a polynomial time approximation scheme (PTAS) for planar graph problems, one often attempts to use Baker’s technique. For this technique, we assume that we have a dynamic programming solution to the given problem in b-outerplanar graphs. This technique works as follows: Given a planar graph G and a positive number , let b = 1 . Perform a breadth-first search on G to obtain a BFS tree T , and number the levels of T starting from the root, which is level 0. For each i = 0, 1, . . . , b − 1, let Gi be the subgraph of G induced by the vertices on the levels of T that are congruent to i modulo b. Gi is likely disconnected. Let the connected components of Gi be Gi,0 , Gi,1 , . . . . Since each Gi,j is (b − 1)outerplanar by construction, we may run the given dynamic program on each Gi,j and combine the solutions over all j to obtain a solution Si for the graph Gi . We then take the maximum Si , denoted S, as our approximate solution. We hypothesize that this technique will not work for the densest k subgraph problem on planar graphs. The reason is that by having a potentially large number of disconnected components, the approximate solution cannot be guaranteed to be within the bound given by . Suppose we have an approximate solution S for the densest k subgraph problem on some planar graph G. Note that S is the exact solution for the graph Gi for some i, meaning S does not account for any vertices on the levels of T which are congruent to i modulo b. While S could still be very dense, it is possible that most (if not all) of the edges in G are between vertices in different levels of T . This allows for the possibility that no matter which Si is chosen as the maximum, each graph Gi is missing too many edges to closely approximate the optimal solution. For future work, it would be of great interest for one to prove that such a construction is impossible using Baker’s technique. Acknowledgements. We would like to express our sincere thanks to Samuel Chase for his collaboration on our initial explorations of finding a PTAS for the densest k subgraph problem. We would like to thank our reviewer who pointed us to previous work on this problem [3].

126

A

S. Gonzales and T. Migler

Pseudocode Selections

procedure merge(T1 , T2 ) let T be an initially empty table; let L and M be the left and right boundaries of the slice that T1 represents; let M and R be the left and right boundaries of the slice that T2 represents; for each subset A of L ∪ R do for each k = 0, . . . , maxT1 k + maxT2 k − |M | do let V be an initially empty list; for each subset B of M do let n = 0; let x and y be the top level vertices in L and R, respectively; if x = y and x and y are both in A then let n = 1; for each k1 , k2 satisfying k1 + k2 − |B| − n = k do let m be the number of edges between vertices in B; let v = T1 ((A ∩ L) ∪ B, k1 ) + T2 ((A ∩ R) ∪ B, k2 ) − m; if v is not undefined then add v to V ; if V is not empty then let T (A, k ) = maxv V ; else let T (A, k ) be undefined; return T ;

Fig. 6. The merge procedure. procedure extend(z, T ) let T  be a table that is initialized with every entry in T ; let L and R be the boundaries for the slice represented by T ; for each subset A of L ∪ R do for each k = 0, . . . , maxT k do if T (A, k − 1) is not undefined then let m be the number of edges between z and every vertex in A; let T  (A ∪ {z}, k  ) = T (A, k  − 1) + m; else let T  (A ∪ {z}, k  ) be undefined;  return T ;

Fig. 7. The extend procedure.

The Densest k Subgraph Problem

127

References 1. Angel, A., Sarkas, N., Koudas, N., Srivastava, D.: Dense subgraph maintenance under streaming edge weight updates for real-time story identification. Proc. VLDB Endow. 5(6), 574–585 (2012) 2. Baker, B.S.: Approximation algorithms for NP-complete problems on planar graphs. J. ACM 41, 153–180 (1994) 3. Bourgeois, N., Giannakos, A., Lucarelli, G., Milis, I., Paschos, V.T.: Exact and approximation algorithms for densest k-subgraph. In: WALCOM: Algorithms and Computation, pp. 114–125. Springer, Heidelberg (2013) 4. Buehrer, G., Chellapilla, K.: A scalable pattern mining approach to web graph compression with communities. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, WSDM ’08, pp. 95–106. ACM, New York, NY, USA (2008) 5. Chen, J., Saad, Y.: Dense subgraph extraction with application to community detection. IEEE Trans. Knowl. Data Eng. 24(7), 1216–1230 (2010) 6. Corneil, D.G., Perl, Y.: Clustering and domination in perfect graphs. Discret. Appl. Math. 9(1), 27–39 (1984) 7. Du, X., Jin, R., Ding, L., Lee, V.E., Thornton Jr, J.H.: Migration motif: a spatial - temporal pattern mining approach for financial markets. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pp. 1135–1144. ACM, New York, NY, USA (2009) 8. Feige, U., Kortsarz, G., Peleg, D.: The dense k-subgraph problem. Algorithmica 29, 2001 (1999) 9. Fratkin, E., Naughton, B.T., Brutlag, D.L., Batzoglou, S.: MotifCut: regulatory motifs finding with maximum density subgraphs. Bioinformatics 22, 150–157 (2006) 10. Gibson, D., Kleinberg, J., Raghavan, P.: Inferring web communities from link topology. In: Proceedings of the Ninth ACM Conference on Hypertext and Hypermedia : Links, Objects, Time and Space—Structure in Hypermedia Systems: Links, Objects, Time and Space—Structure in Hypermedia Systems, HYPERTEXT ’98, pp. 225–234. ACM, New York, NY, USA (1998) 11. Gibson, D., Kumar, R., Tomkins, A.: Discovering large dense subgraphs in massive graphs. In: Proceedings of the 31st International Conference on Very Large Data Bases, VLDB ’05, pp. 721–732. VLDB Endowment (2005) 12. Goldberg, A.V.: Finding a maximum density subgraph. Technical report, University of California at Berkeley, Berkeley, CA, USA (1984) 13. Mark Keil, J., Brecht, T.B.: The complexity of clustering in planar graphs. J. Comb. Math. Comb. Comput. 9, 155–159 (1991) 14. Langston, M.A., Lin, L., Peng, X., Baldwin, N.E., Symons, C.T., Zhang, B., Snoddy, J.R.: A combinatorial approach to the analysis of differential gene expression data: the use of graph algorithms for disease prediction and screening. In: Methods of Microarray Data Analysis IV, pp. 223–238. Springer (2005) 15. Newman, M.E.J.: Fast algorithm for detecting community structure in networks. Phys. Rev. E 69, 066133 (2004)

Spread Sampling and Its Applications on Graphs Yu Wang1(B) , Bortik Bandyopadhyay1 , Vedang Patel1 , Aniket Chakrabarti2 , David Sivakoff1 , and Srinivasan Parthasarathy1 1

The Ohio State University, Columbus, OH 43210, USA [email protected], [email protected] 2 Microsoft Inc., Hyderabad, India

Abstract. Efficiently finding small samples with high diversity from large graphs has many practical applications such as community detection and online survey. This paper proposes a novel scalable node sampling algorithm for large graphs that can achieve better spread or diversity across communities intrinsic to the graph without requiring any costly pre-processing steps. The proposed method leverages a simple iterative sampling technique controlled by two parameters: infection rate, that controls the dynamics of the procedure and removal threshold that affects the end-of-procedure sampling size. We demonstrate that our method achieves very high community diversity with an extremely low sampling budget on both synthetic and real-world graphs, with either balanced or imbalanced communities. Additionally, we leverage the proposed technique for a very low sampling budget (only 2%) driven treatment assignment in Network A/B Testing scenario, and demonstrate competitive performance concerning baseline on both synthetic and realworld graphs.

Keywords: Graph sampling

1

· Social network analysis

Introduction

Networks are a powerful tool to represent relational data in various domains: an email network in a corporate, a co-sponsorship network in Congress, a coauthorship network in academia, etc. Given the ubiquitousness of the Internet, we can collect relational data at an immense scale (Facebook, Twitter, etc.). A huge amount of data restrains us from conducting complicated analysis: PageRank [28] computation has time complexity O(|V |3 ); community detection using Girvan-Newman [8] method takes O(|E|2 |V |) time; to compare the similarity between two large networks (each has ∼70 million edges) using a state-of-art method takes 10 min [19]. Sampling is often touted as a means to combat the inherent complexity of analyzing large networks [20]. Network sampling is broadly classified into edge and node sampling strategies. Edge sampling seeks to sample pairs of nodes (dyads) from a network, and c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 128–140, 2020. https://doi.org/10.1007/978-3-030-36687-2_11

Spread Sampling

129

one of its applications is to infer the network topological structure [36]. Node sampling seeks to sample nodes from a network, and one of its applications is to infer the distribution of network statistics (node degree, node label, etc.). Existing node sampling techniques include selection-based sampling (uniform [20]) and chain-based sampling (forest fire [20], random walk [4,9]). The idea of sampling subjects from different groups is called stratified sampling [22,24]. A typical design for stratified sampling is first to divide the population into different strata (groups) using some population characteristics and then sample individuals from each stratum. Social networks are well known to have the community structure [7]. Nodes within a community are more similar to each other than nodes across communities, and individuals in a social network tend to interact more frequently with others from the same community [25]. In this graph/network context, the doctrinal stratified-sampling first detects communities, and then samples from each community. Community discovery, based on topological and (or) attribute-based graph characteristics, is a very time-consuming procedure even with state-ofthe-art implementation [3]. Chain-based sampling methods sample connected subgraphs and hence are more likely to be stuck in one community or a few nearby (in terms of topological distance) communities, resulting in less community diversity of the sampled nodes. Uniform node selection sampling has better community coverage than chain-based sampling, but it tends to under-sample small communities when community sizes are imbalanced. Also, from an end-toend application’s performance perspective, to continue benefiting from sampling approaches, the sampling budget (i.e., number of nodes to be sampled) has to be kept as low as possible, which reduces the chance of high community coverage under practical settings on large-scale graphs. We propose a new graph sampling method, spread sampling, that can achieve better community coverage than existing algorithms for low sampling budget even in graphs with imbalanced communities, resulting in a more representative node-set in the sample in terms of community diversity than other methods. Under appropriate user-chosen parameter configuration, the proposed method penalizes sampling neighboring nodes, cliques or near-clique structures, and hence allows for better overall community coverage of the network without any costly pre-processing step required for a typical stratified sampling approach. We demonstrate its applications on community detection seeding and network A/B testing.

2

Related Work

Network Sampling: The work by Handcock and Gile [12] systematically studies node sampling on social networks. It proposes the concept of model-based sampling and design-based sampling. Our proposed method is a kind of designbased sampling method. Node selection sampling (a kind of uniform node sampling) is studied by Leskovec [20]. This method is easy to implement but has a sparse approximation to several network statistics.

130

Y. Wang et al.

Traditional chain-based sampling methods are biased towards hub nodes with a high degree. Two strategies, in-sample correction and post-sample correction, are proposed to address this issue. In-sample correction modifies random walk such that the equilibrium distribution is uniform [9,26,34]. Post-sample correction uses estimators that account for the biasness incurred by the sampling methods [11,13]. Maiya and Berger-Wolf [23,24] propose a crawling-based approach that samples a connected sub-graph with a good community structure (high precision). Our work differs in that we seek to identify samples spread across communities (high recall). We empirically show that spread sampling has better community coverage than all baselines. Community Detection: We choose community detection to test the efficacy and efficiency of spread sampling. Modularity maximization [3,8] and n-cut maximization [14] based approaches were surveyed by Fortunato [7]. The personalized PageRank based approach [1] is a variant of the n-cut maximization method. The work by [16] compared several seed expansion based community detection methods and concluded Personalized PageRank [1] works best. They find that seeding by uniform sampling results in a better recall than high-degree sampling in community detection. We show that seeding by spread sampling achieves an even better result than uniform sampling. Network A/B Testing: Spread sampling can also be applied for network A/ B testing, which is a widely used statistical experiment in modern social network settings to determine the effect of a treatment. The experimenters apply treatment like introducing a new feature of an online service to a subset of customers [17,18], while keeping the remaining users without that feature in control group, and then measure a quantity of interest for each user to compute the overall effect of the newly introduced feature, usually defined as Average Treatment Effect (ATE) by Gui et al. [10]. While classical A/B Testing experiments [31] assume independence of user behavior (SUTVA assumption), this is invalid in the context of social networks due to direct interaction edges between users [10]. Broadly speaking, the two popular sampling approaches proposed to tackle this ATE estimation problem for social networks under interference effect are Nodebased [2,15] and Cluster based [10,27,33] strategies. More recently, Saveski et al. [32] has proposed a hybrid strategy combining the Node and Cluster-based randomization approach. One practical problem is the cost of running the experiments in production deployments, which increases based on size of treatment group and hence an effective node sampling strategy is required which can have competitive ATE estimation error even for low treatment (sampling) budget.

3 3.1

Methology Designing Spread Sampling

We wish to obtain a sample that spreads out over the graph with a limited budget. Intuitively, a spread-out sample has nodes with very few of their neighbors in the sampled set. Hence we design an iterative sampling approach alternating

Spread Sampling

131

between two steps: (1) uniform sample from candidate nodes; (2) remove nodes neighboring to the existing sample. The first step leads to a near-uniform sample, while the second step spreads out the sampled nodes. The algorithm, described in Algorithm 1, is an iterative sampling method. During the sampling process, three sets are maintained: the sample set, the removal set, and the candidate set. The sample-set contains all the sampled nodes; the removal set contains nodes that have particular sampled neighbors; all other nodes are in the candidate set. Both the sample set and the removal set monotonically expands as the iteration proceeds, while the candidate set shrinks. The rationale of the removal set is that if a node has enough neighbors in the sample, we should not sample it since our goal is to “spread out” sample with a limited budget. Algorithm 1. Spread Sampling Input: infection rate q, removal threshold k, a connected undirected graph G, target sample size; Output: A set S of sampled nodes. 1: Initialize candidate set C = G; 2: while C is not empty and |S| smaller than target size do 3: for each node u in C, sample it with probability q, and add the sampled nodes into S; 4: C = C − S; { remove sampled nodes from candidate set} {Below: if a candidate node has at least k neighbors sampled, remove it from candidate set} 5: Bk = {v ∈ C | |N (v) ∩ S| ≥ k}; 6: R = R ∪ Bk ; { R is the removal set} 7: C = C − R; 8: end while 9: return

The sampling algorithm has two parameters: a single step infection rate and a removal threshold. Low removal threshold tends to remove nodes more aggressively, and hence is more likely to sample non-adjacent nodes; an extremely high removal threshold removes no node and hence achieves uniform sampling. Low single-step infection rate is also pro non-adjacent nodes; an extremely high rate reduces the chance of removal and hence pro uniform sampling. 3.2

High Community Diversity

We use community coverage ratio to quantify how well spread-out the sampled nodes are. We define community coverage ratio1 for each sample set S as the fraction of communities represented by the sampled nodes in S. It can be formulated as: 1

By design, our method achieves 100% expansion quality, a ratio of the neighborhood size of the sample to the number of unsampled nodes, as defined in [24] when the infection rate is exactly one node and removal threshold is one.

132

Y. Wang et al.

CoverageRatio(S) =

 

i∈S ci

|C|

 

,

where S is the sample set, ci is the community of node i and C is the set of all the communities. We compare 5 baseline sampling methods: uniform sampling, expanding snow-ball (XSN [24]), degree-inverse sampling, Louvain [3]+ stratification sampling, and METIS [14]+ stratification sampling. XSN is a greedy sampling method to sample a community-diversified connected component, and performs very well on graphs with balanced communities. We compare against degreeinverse because our method has a degree-inverse property for small removal threshold. We show that merely sampling nodes with degree-inverse probability cannot achieve as high community coverage as our method. For the stratification sampling, we first run state-of-the-art community detection methods, then sample the nodes with probability proportional to the sizes of the detected communities.2 Random-walk-based sampling [20] methods, including multi-walkers [29], have inferior community coverage, and hence we do not include those results. We report results on an SBM graph with imbalanced communities on various sampling budgets. All algorithms achieve high community coverage as the sampling budget increases. Spread sampling (SS) achieves the best community coverage ratio among all methods at minimal sampling budgets (Fig. 1). More results can be found in Sect. 3.3.3 of [35]

Fig. 1. Community coverage ratio on SBM10k Imbalanced. At very small sample budget (5%, 10%), spread sampling (SS) covers significantly more communities than the baselines.

3.3

Complexity Analysis

According to Algorithm 1, there are |Ct | candidate nodes in the while-loop at the t−th iteration, |Ct+1 | = (1 − q)|Ct | for infection rate q. The bottleneck is line 4, candidate nodes update: for each candidate node, we need to scan all 2

Evaluation is always performed on the ground truth communities.

Spread Sampling

133

its neighbors to determine the validity of its candidacy. This procedure incurs |Ct |d¯ queries per iteration, where d¯ ≡ 2|E|/|V | is the average degree. Hence, the Ct =n sum w/ ratio 1−q ¯ =geometric ¯ = |Ct |d) ================== O(|C0 |d) overall time complexity is O( Ct =0

sparse [5], |E|=O(|V |) ¯ = O(|E|) == O(nd) ============== O(|V |). We do not store edge information. All the edge information is retrieved from the graph. Hence the space complexity is O(|V |).

3.4

Impact of Sampling Parameters

The spread sampling method has two parameters: single-step infection rate q, controlling the sampling dynamic; and removal threshold k, controlling the endof-procedure status. We have analyzed the impact of parameters on community coverage. Intuitively, low infection rate and low removal threshold incur more neighborhood removal, and hence result in better community coverage. However, due to the paucity of space, we refer the reader to [35] for a detailed theoretical and empirical evaluation of community coverage and sampling probability of our proposed technique.

4

Applications of Spread Sampling

We run experiments on both synthetic and real-world graphs in Table 1. (|C| is number of communities, and CC stands for Clustering coefficient). Table 1. Graphs used in experiments |V |

Graph

|E|

|C|

CC

ER (1k, 0.4)

1k

100k N/A 0.3992

ER (10k, 0.001)

10k

50k

N/A 0.0011

PL (1k, 4, 0.01)

1k

4k

N/A 0.0343

PL (1k, 4, 0.8)

1k

4k

N/A 0.3720

SBM10k balanced

10k

594k 100

0.0165

SBM10k imbalanced

10k

103k 500

0.1265

BTER syn

1k

33k

10

0.2696

4k

86k

250

0.5964

Email [21]

1k

25k

42

0.3994

FB [21]

4k

88k

DBLP [21]

317k 1M

13k

0.6324

AMAZON [21]

335k 926k 75k

0.3967

Live-Journal (lj) [21]

4M

Youtube [21]

1.1M 3M

BTER fb

(fit power-law)

(fit FB below)

N/A 0.6055

35M 288k 0.2843 8k

0.0808

134

4.1

Y. Wang et al.

Overlapping Community Detection Seeding

Personalized PageRank (PPR) [1] is a well-established method for local graph partitioning: to identify a community (a densely connected sub-graph [30,37]) from a small seed set. An extensive comparison by [16] shows that PPR has the best performance among all local partitioning methods. They focus on the comparison of different methods and claim that uniform sampling is better than high-degree node sampling. Our spread sampling, with different parameter settings, can be specialized as uniform sampling (k > dmax ) or low-degree nodes sampling (small k). We run PPR community detection following the same procedure and on the same datasets (with one more, LiveJournal) as in Sect. 2 of [16] and compare various seeding methods: SS with small k (low-degree heuristic), SS with large k, uniform sampling and high-degree heuristic.3 For each ground-truth community, we test with sampling budgets of 5% and 10%, and get consistent results in terms of seeding methods ranking. PPR assigns each node a conductance score in [0, 1]. For each to-be-detected community, we sort all the nodes in descending order of the score and determine the top-C nodes as the community members, where C is the size of the ground-truth community. For SS, we fix q to be exact one node per step and vary removal threshold k from 1 to 50. Note that large k degenerate to uniform sampling. Our key finding is that low-degree heuristic works the best in overlapping community detection, and the recall of each seeding methods is reported in Table 2.4 Experiments show that low-degree heuristic works best on graphs with overlapping communities: high-degree nodes belong to several communities while low-degree nodes are “core members” of a community [30,39]. Expansion from a high degree node results in a sample from different communities, and hence the recall is low; expansion from “core members” is more likely to return nodes within the same community and hence a better recall. The recall on the youtube graph is very low, which is probably due to its low clustering coefficient. Table 2. Community detection recall with varying seed strategies Data

k=1

k = 10

amazon .5768 ± .038

3

4

Uniform

Max Deg.

.5808 ± .040 .5617 ± .042 .5612 ± .043

dblp

.2512 ± .004 .2396 ± .003

.2383 ± .004 .1479 ± .001

lj

.1328 ± .002 .1313 ± .003

.1311 ± .002 .1123 ± .001

youtube .0227 ± .004 .0200 ± .003

.0138 ± .002 .0108 ± .001

The neighborhood inflation method [38] is also a highly cited work. We did not compare against it since according to [16], PPR performs better than the neighborhood inflation method. Precision≡

True Positive |Cdetect |

exp. setup

========

TP |Cground truth |

≡recall.

Spread Sampling

4.2

135

Network A/B Testing

There are several reasons to consider spread sampling as a treatment assignment, i.e., node sampling strategy for Network A/B testing. First, spread sampling, intuitively, attempts to reduce homophily effect [2] and potential information leak [10]. Second, it only requires a knowledge of the immediate neighborhood of specific nodes in the seed set. Third, the spread sampling method samples a set with a higher population diversity in terms of community representation (in comparison to alternatives) at lower sampling budget, while also being relatively cheap to compute – there is no need for costly pre-processing steps like clustering of the graph and subsequent assignment to achieve this objective. Lower sampling budget is particularly useful when multiple alternatives are being tested (e.g., alternative advertisement variations or alternative features), which is usually the case. Large sampling budgets are an imposition on testers and can lead to tester fatigue. Dataset and Simulation: Our goal is to assign treatment using node sampling strategies viz. uniform sampling (US) and spread sampling (SS), and then simulate user response with Linear Probit model [6,10] to compute the average treatment effect (ATE). The simulation steps are summarized in Algorithm 2. We run 1000 simulations (as in Sect. 5.1.1 of [10]) on four small networks: PL(1k,4,0.001), PL(1k,4,0.8), FB and BTER (Table 1), of which BTER has community structure, with each of the node sampling strategies. The Linear Probit model for response simulation is proposed by [10]: ∗ = λ0 + λ1 Zi + λ2 Yi,t

AT.i Yt−1 + Ui,t Dii

∗ Yi,t = I(Yi,t > 0)

where Z is the treatment assignment vector, A is the adjacency matrix, D is the diagonal matrix of node degree and U is a user dependent stochastic component. We fix the baseline parameter λ0 = −1.5, vary the treatment effect parameter λ1 ∈ {0.25, 0.5, 0.75, 1.0} and the network effect parameter λ2 ∈ {0, 0.1, 0.25, 0.5, 0.75} as in [10], and keep a very low fixed sampling budget of only 2%. Estimator Choice: We have used existing state-of-the-art estimators viz: SUTVA - designed assuming no interference effect & LMI - designed to work well in the presence of interference effects.

136

Y. Wang et al.

Algorithm 2. Simulation Steps Input: λ0 , λ1 , λ2 , ρ, Sampling flag 1: Generate treatment response Yi,t for all nodes i using Zi = 1. 2: Generate control response Yi,t for all nodes i using Zi = 0. 3: Compute ground truth ATE (AT EGT ) using above responses. 4: for (l = 1; l ¡= 1000; l++) do 5: Set seed(l) to fix Ui,t generation. 6: Select ρ% nodes using SRSWOR or SS based on sampling flag and consider those nodes for treatment. 7: Use the above treatment assignment to construct the experiment assignment vector of all nodes (Zexp ) by setting those entries to 1 keeping 0 for others. 8: Compute fractional neighborhood exposure vector. ∗ for all nodes by simulating Probit 9: Use Zexp to generate binary response Yi,t Model [6,10] with t = 3.  10: Apply estimator of choice to compute empirical AT E. 11: end for  E (computed by averaging 1000 simulation estimates) to com12: Use AT EGT and AT pute the Absolute Error (relative error).

SUTVA For Node Sampling [10]:   1 1 δ = [ Yi (Z = zi )] − [ Yi (Z = zi )] N1 N0 {i:zi =1}

{i:zi =0}

N1 is number of nodes in treatment and N0 is number of nodes in control group. Yi (Z = 1) is the response of node i in treatment and Yi (Z = 0) is the response of node i in control. Fraction Neighborhood - Linear Model I [10]: g(Zi , σi ) = α + βZi + γσi δˆLI = βˆ + γˆ Zi = 1 indicates node i is in treatment, and 0 indicates that it is in the control group. σi is the fraction of i’s neighbor in the treatment group. β is the treatment effect parameter, and γ is the parameter that captures the network ˆ and γˆ which are then used to effect. Linear Regression is used to estimate α ˆ , β, ˆ compute δLI [10]. Note that the use of the fractional LMI model to node sampling is novel to this work (previous efforts only examined the performance of this idea on cluster sampling based assignment strategy at a high budget). We did not compare against [10] since that work requires half of the users to be sampled while in our case we focus on the ATE estimation at a minimal sampling budget (2%). We show that as the network effect, interaction among users, increases, the SS method becomes better and better than uniform sampling.

Spread Sampling

137

Fig. 2. ATE estimation bias comparison for BTER synthetic graph with balanced communities. Although SS does not have advantage over US when the network effect is weak, the advantage of SS becomes more and more clear as the the network effect increases. Spread sampling sample users far apart and hence reduces the homophily bias.

Results: Table 3 shows that spread sampling, together with clustering assignment estimator (SS+LMI), significantly outperforms the uniform sampling on multiple datasets. Figure 2 shows that although SS does not have an advantage over US when the network effect is weak, the advantage of SS becomes more and more evident as the network effect increases. This is not hard to explain: strong network effects mean users have strong interactions with each other and hence have homophily effect [2]. Spread sampling sample users far apart and hence reduces the homophily bias. Table 3. ATE estimation with λ0 = −1.5, λ1 = 1.0, λ2 = 0.75 strong network effect, and sampling budget of 2%; SS uses q = one node and k = 1. Dataset

GT US SUTVA

SS SUTVA

SS LMI

PL (1k, 4, 0.8)

0.35 0.26 (26.89%) 0.26 (25.90%) 0.29 (18.46%)

PL (1k, 4, 0.01) 0.35 0.26 (26.79%) 0.26 (27.93%) 0.29 (19.44%) FB

0.34 0.26 (24.61%) 0.25 (25.53%) 0.29 (15.78%)

BTER syn

0.36 0.26 (28.23%) 0.25 (29.41%) 0.28 (20.67%)

138

5

Y. Wang et al.

Conclusions

We propose a simple yet elegant procedure - spread sampling (SS)- for sampling nodes within a graph. We show that spread sampling tries to sample nodes from all regions of the graph, thereby improving community coverage than existing baselines, especially on the networks with imbalanced communities. We apply SS to three real practical applications viz: community detection and network A/B testing. Seeding PPR-based community detection with SS leads to higher recall than existing heuristics. SS-based network A/B testing outperforms competitive strawman solutions on a range of graph models, particularly in the presence of moderate to high network interference effects. Acknowledgments. This paper is funded by NSF grants DMS-1418265, IIS-1550302, and IIS-1629548.

References 1. Andersen, R., Chung, F., Lang, K.: Local graph partitioning using pagerank vectors. In: FOCS 2006, pp. 475–486 (2006) 2. Backstrom, L., Kleinberg, J.: Network bucket testing. In: Proceedings of the 20th International Conference on World Wide Web, pp. 615–624. ACM (2011) 3. Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech.: Theory Exp. 2008(10), P10008 (2008) 4. Chiericetti, F., Dasgupta, A., Kumar, R., Lattanzi, S., Sarl´ os, T.: On sampling nodes in a network. In: Proceedings of the 25th International Conference on World Wide Web, pp. 471–481. International World Wide Web Conferences Steering Committee (2016) 5. Chung, F.: Graph theory in the information age. Not. AMS 57(6), 726–732 (2010) 6. Karrer,B., Eckles, D., Ugander, J.: Design and analysis of experiments in networks: reducing bias from interference (2014) 7. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3), 75–174 (2010) 8. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. 99(12), 7821–7826 (2002) 9. Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Walking in Facebook: a case study of unbiased sampling of OSNs. In: 2010 Proceedings IEEE Infocom, pp. 1–9. IEEE (2010) 10. Gui, H., Xu, Y., Bhasin, A., Han, J.: Network A/B testing: from sampling to estimation. In: Proceedings of the 24th International Conference on World Wide Web, pp. 399–409. ACM (2015) 11. Hand, D.J.: Statistical analysis of network data: methods and models by Eric D. Kolaczyk. Int. Stat. Rev. 78(1), 135–135 (2010) 12. Handcock, M.S., Gile, K.J.: Modeling social networks from sampled data. Ann. Appl. Stat. 4(1), 5 (2010) 13. Hansen, M.H., Hurwitz, W.N.: On the theory of sampling from finite populations. Ann. Math. Stat. 14(4), 333–362 (1943) 14. Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)

Spread Sampling

139

15. Katzir, L., Liberty, E., Somekh, O.: Framework and algorithms for network bucket testing. In: Proceedings of the 21st International Conference on World Wide Web, WWW 2012, pp. 1029–1036. ACM, New York (2012) 16. Kloumann, I.M., Kleinberg, J.M.: Community membership identification from small seed sets. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1366–1375. ACM (2014) 17. Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., Pohlmann, N.: Online controlled experiments at large scale. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, pp. 1168–1176. ACM, New York (2013) 18. Kohavi, R., Deng, A., Longbotham, R., Xu, Y.: Seven rules of thumb for web site experimenters. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2014, pp. 1857–1866. ACM, New York (2014) 19. Koutra, D., Shah, N., Vogelstein, J.T., Gallagher, B., Faloutsos, C.: D elta C on: principled massive-graph similarity function with attribution. ACM Trans. Knowl. Discov. Data (TKDD) 10(3), 28 (2016) 20. Leskovec, J. Faloutsos, C.: Sampling from large graphs. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 631–636. ACM (2006) 21. Leskovec, J., Sosiˇc, R.: SNAP: a general-purpose network analysis and graphmining library. ACM Trans. Intell. Syst. Technol. (TIST) 8(1), 1 (2016) 22. Lohr, S.: Sampling: Design and Analysis. Nelson Education, Toronto (2009) 23. Maiya, A.S., Berger-Wolf, T.Y.: Expansion and search in networks. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 239–248. ACM (2010) 24. Maiya, A.S., Berger-Wolf, T.Y.: Sampling community structure. In: Proceedings of the 19th International Conference on World Wide Web, pp. 701–710. ACM (2010) 25. McPherson, M., Smith-Lovin, L., Cook, J.M.: Birds of a feather: homophily in social networks. Ann. Rev. Sociol. 27(1), 415–444 (2001) 26. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21(6), 1087– 1092 (1953) 27. Middleton, J.A., Aronow, P.M.: Unbiased estimation of the average treatment effect in cluster-randomized experiments 28. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. Technical report, Stanford InfoLab (1999) 29. Ribeiro, B., Towsley, D.: Estimating and sampling graphs with multidimensional random walks. In: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, pp. 390–403. ACM (2010) 30. Ruan, Y., Fuhry, D., Liang, J., Wang, Y., Parthasarathy, S.: Community discovery: simple and scalable approaches. In: User Community Discovery, pp. 23–54. Springer (2015) 31. Rubin, D.B.: Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psychol. 66(5), 688 (1974) 32. Saveski, M., Pouget-Abadie, J., Saint-Jacques, G., Duan, W., Ghosh, S., Xu, Y., Airoldi, E.M.: Detecting network effects: randomizing over randomized experiments. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1027–1035. ACM (2017)

140

Y. Wang et al.

33. Ugander, J., Karrer, B., Backstrom, L., Kleinberg, J.: Graph cluster randomization: network exposure to multiple universes. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, pp. 329–337. ACM, New York (2013) 34. Wang, D., Li, Z., Xie, G.: Towards unbiased sampling of online social networks. In: 2011 IEEE International Conference on Communications (ICC), pp. 1–5. IEEE (2011) 35. Wang, Y.: Revisiting network sampling. Ph.D. thesis, The Ohio State University (2019) 36. Wang, Y., Chakrabarti, A., Sivakoff, D., Parthasarathy, S.: Fast change point detection on dynamic social networks. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 2992–2998. AAAI Press (2017) 37. Wang, Y., Chakrabarti, A., Sivakoff, D., Parthasarathy, S.: Hierarchical change point detection on dynamic networks. In: Proceedings of the 2017 ACM on Web Science Conference, pp. 171–179. ACM (2017) 38. Whang, J.J., Gleich, D.F., Dhillon, I.S.: Overlapping community detection using seed set expansion. In: CIKM, pp. 2099–2108. ACM (2013) 39. Yang J., Leskovec, J.: Structure and overlaps of communities in networks. arXiv preprint arXiv:1205.6228 (2012)

Eva: Attribute-Aware Network Segmentation Salvatore Citraro and Giulio Rossetti(B) KDD Lab, ISTI-CNR, Pisa, Italy [email protected], [email protected]

Abstract. Identifying topologically well-defined communities that are also homogeneous w.r.t. attributes carried by the nodes that compose them is a challenging social network analysis task. We address such a problem by introducing Eva, a bottom-up low complexity algorithm designed to identify network hidden mesoscale topologies by optimizing structural and attribute-homophilic clustering criteria. We evaluate the proposed approach on heterogeneous real-world labeled network datasets, such as co-citation, linguistic, and social networks, and compare it with state-of-art community discovery competitors. Experimental results underline that Eva ensures that network nodes are grouped into communities according to their attribute similarity without considerably degrading partition modularity, both in single and multi node-attribute scenarios.

1

Introduction

Among the most frequent data mining tasks, segmentation requires a given population, to partition it into internally homogeneous clusters so to better identify different cohorts of individuals sharing a common set of features. Classical approaches [1] model this problem on relational data, each individual (data point) described by a structured list of attributes. Indeed, in several scenarios, this modeling choice represents an excellent proxy to address context-dependent questions (e.g., segment retail customers or music listeners by their adoption behaviors). However, such methodologies by themselves are not able to answer a natural, yet non-trivial question: what does it mean to segment a population for which the social structure is known in advance? A first way of addressing such an issue can be identified in the complex network counterpart to the data mining clustering problem, Community Discovery. Node clustering, also known as community discovery, is one of the most productive subfields of the complex network analysis playground. Many algorithms have been proposed so far to efficiently and effectively partition graphs into connected clusters, often maximizing specifically tailored quality functions. One of the reasons this task is considered among the most challenging, and intriguing ones, is its ill-posedness: there not exist a single, universally shared, definition of what a community should look like. Every algorithm, every study, defines node c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 141–151, 2020. https://doi.org/10.1007/978-3-030-36687-2_12

142

S. Citraro and G. Rossetti

partitions by focusing on specific topological aspects (internal density, separation. . . ) thus leading to the possibility of identifying different, even conflicting, clusters on top of the same topology. Generalizing, we can define the community discovery problem using a meta definition such as the following: Definition 1 (Community Discovery (CD)). Given a network G, a community c = {v1 , v2 , . . . , vn } is a set of distinct nodes of G. The community discovery problem aims to identify the set C of all the communities in G. Classical approaches to the CD problem focus on identifying a topologically accurate segmentation of nodes. Usually, the identified clusters – either crisp or overlapping, producing complete or partial node coverage – are driven only by the distribution of edges across network nodes. Such constraint, in some scenarios, is not enough. Nodes, the proxies for the individuals we want to segment, are carriers of semantic information (e.g., age, gender, location, spoken language. . . ). However, segmenting individuals by only considering their social ties might produce well defined, densely connected, cohorts, whose homogeneity w.r.t. the semantic information is not guaranteed. Usually, when used to segment a population embedded into a social context, CD approaches are applied assuming an intrinsic social homophily of individuals, often summarized with the motto “birds of a feather flock together”. Indeed, such a correlation in some scenarios might exist; however, it is not always given, and its strength could be negligible. To address such issue, in this work, we approach a specific instance of the CD problem, namely Labeled Community Discovery: Definition 2 (Labeled Community Discovery (LCD)). Let G = (V, E, A) be a labeled graph where V is the set of vertices, E the set of edges, and A a set of categorical attributes such that A(v), with v ∈ V , identifies the set of labels associated to v. The labeled community discovery problem aims to find a node partition C = {c1 , ..., cn } of G that maximizes both topological clustering criteria and label homophily within each community. LCD focuses on obtaining topologically well-defined partitions (as in CD) that also results in homogeneous labeled communities. An example of contexts in which an LCD approach could be helpful is, for instance, the identification, and impact evaluation, of echo chambers in online social networks, a task that cannot be easily addressed relying only on standard CD methodologies. In this work, we introduce a novel LCD algorithm, Eva (Louvain Extended to Vertex Attributes), tailored to extract label-homogeneous communities from a complex network. Our approach configures as a multi-criteria optimization one and extends a classical hierarchical algorithmic schema used by state-of-art CD methodologies. The paper is organized as follows. In Sect. 2 we introduce Eva. There we discuss its rationale and its computational complexity. In Sect. 3 we evaluate the proposed method on real-world datasets, comparing its results with state-of-art competitors. Finally, in Sect. 4 the literature relevant to our work is discussed, and Sect. 5 concludes the paper.

Eva: Attribute-Aware Network Segmentation

2

143

The Eva Algorithm

In this section, we present our solution to the LCD problem: Eva1 . Eva is designed as a multi-objective optimization approach. It adopts a greedy modularity optimization strategy, inherited by the Louvain algorithm [2], pairing it with the evaluation of intra-community label homophily. Eva main goal is maximizing the intra-community label homophily while assuring high partition modularity. In the following, we will detail the algorithm rationale and study its complexity. Eva is designed to handle networks whose nodes possess one or more labels having categorical values. Algorithm Rationale. The algorithmic schema of Eva is borrowed from the Louvain one: a bottom-up, hierarchical approach designed to optimize a wellknown community fitness function called modularity. Definition 3 (Modularity). Modularity is a quality score that measures the strength of the division of a network into modules. It takes values in [−1, 1] and, intuitively, measures the fraction of the edges that fall within the given partition minus the expected fraction if they were distributed following a null model. Formally:   kv kw 1  (1) Avw − δ (cv , cw ) Q= (2m) vw (2m) where m is the number of graph edges, Av,w is the entry of the adjacency matrix for v, w ∈ V , kv , kw the degree of v, w and δ (cv , cw ) identifies an indicator function taking value 1 iff v, w belong to the same cluster, 0 otherwise. Eva leverages the modularity score to incrementally update community memberships. Conversely, from Louvain, such an update is weighted in terms of another fitness function tailored to capture the overall label dispersion within communities, namely purity. Definition 4 (Purity). Given a community c ∈ C its purity is the product of the frequencies of the most frequent labels carried by its nodes. Formally:   max( v∈c a(v)) (2) Pc = |c| a∈A

where A is the label set, a ∈ A is a label, a(v) is an indicator function that takes value 1 iff a ∈ A(v). The purity of a partition is then the average of the purities of the communities that compose it: 1  P = Pc (3) |C| c∈C

Purity assumes values in [0, 1] and it is maximized when all the nodes belonging to the same community share a same attribute profile. 1

Python code available at: https://github.com/GiulioRossetti/EVA.

144

S. Citraro and G. Rossetti

Algorithm 1. EVA 1: function EVA(G, α) 2: C ← Initialize(G) 3: Z ← αP + (1 − α)Q 4: Zperv ← −∞ 5: while Z > Zprev do 6: C ← M oveN odes(G, C, α) 7: G ← Aggregate(G, C) 8: Zprev ← Z 9: Z ← αP + (1 − α)Q 10: return C

The primary assumption underlying the purity definition is that node labels can be considered as independent and identically distributed random variables: in such a scenario, considering the product of maximal frequency labels is equivalent to computing the probability that a randomly selected node in the given community has exactly that specific label profile. Eva takes into account both modularity and purity while incrementally identifying a network partition. To do so, it combines them linearly, thus implicitly optimizing the following score: Z = αP + (1 − α)Q

(4)

where α is a trade-off parameter that allow to tune the importance of each component for better adapt the algorithm results to the analyst needs. Eva pseudocode is highlighted in Algorithm 1. Our approach takes as input a labeled graph, G and a trade-off value, α and returns a partition C. As a first step, line 2, Eva assigns each node to a singleton community and computes the initial quality Z as a function of both modularity and purity. After the initialization step, the algorithm main-loop is executed (lines 5–9). Eva computation, as Louvain, can be broken in two main components: (i) greedy identification of the community merging move that produces the optimal increase of the partition quality (row 6), and (ii) network reconstruction (line 7). In Algorithm 2 is detailed the procedure applied to identify the best move among the possible ones. Eva inner loop cycles over the graph nodes and, for each of them, evaluate the gain in terms of modularity and purity while moving a single neighboring node to its community (lines 18–24). For each pair (v, w) the local gain produced by the move is computed: Eva compares such value with the best gain identified so far and, if needed, updates the latter to keep track of the newly identified optimum: in case of ties, the move that results in a higher increase of the community size is preferred (lines 25–28). Such a procedure is repeated until no more moves are possible (line 29). As a result of Algorithm 2, the original allocation of nodes to communities is updated. After this step, the aggregate function (Algorithm 1, line 7) hierarchically updates the original graph G transforming its communities in nodes, thus allowing to repeat the algorithm main loop until there are no moves able to increase the partition quality (lines 8–9).

Eva: Attribute-Aware Network Segmentation

145

Algorithm 2. EVA - MoveNodes 1: function MoveNodes(G, C, α) 2: Cbest ← C 3: repeat 4: for all v ∈ V (G) do 5: s ← |P [v]| 6: sbest ← size 7: gbest = −∞ 8: for all u ∈ Γ (v) do 9: Cnew ← C 10: Cnew [v] ← C[u] 11: sizenew ← |Cnew [v]| 12: qgain ← QCnew − QC 13: pgain ← PCnew − PC 14: g ← αpgain + (1 − α)qgain 15: if g > gbest or g == gbest and snew 16: gbest ← g 17: sbest ← snew 18: Cbest ← Cnew 19: until C == Cbest 20: return Cbest

> s then

Eva Complexity. Being a Louvain extension, Eva shares the same time complexity, namely O(|V |log|V |). Regarding space consumption, the increase w.r.t. Louvain is related only to the data structures used for storing node labels. Considering k labels, the space required to associate them to each node in G is O(k|V |): assuming k 0.4 rapidly reach saturation. Table 3 compares modularity and purity for four different instantiations of Eva on the Amh dataset while varying the number of node attributes from 1 to 5. We can observe that both quality functions are stable w.r.t. the number of attributes and that α = 0.8 offers a viable compromise to our aim. Table 3. Multi-attribute: modularity and purity comparison of Eva over Amh Modularity Purity Amh1 Amh2 Amh3 Amh4 Amh5 Amh1 Amh2 Amh3 Amh4 Amh5 Eva0.1 .43

.43

.43

.43

.43

.49

.13

.09

.04

.03

Eva0.5 .43

.43

.43

.43

.43

.49

.73

.77

.79

.80

Eva0.8 .43

.42

.42

.42

.43

.95

.93

.95

.94

.96

Eva0.9 .42

.36

.38

.40

.40

.97

.95

.97

.95

.98

4

Related Work

In this section, a brief overview of previous studies addressing LCD is presented. As previously discussed, classic CD algorithms deal only with the topological information since their clustering schemes are established by optimizing structural quality functions. In this scenario, LCD is a challenging and more sophisticated task, aiming to balance the weight of topological and attribute related information expressed by data enriched networks to extract coherent and welldefined communities. At the moment, an emerging LCD algorithm classification proposal [13] organizes the existing algorithms in three families on the basis of the different methodological principles they leverage: (i) topological-based LCD, the attribute information is used to complement the topological one that guides the partition identification; (ii) attributed-based LCD, topology is used as refinement for

150

S. Citraro and G. Rossetti

partitions identified leveraging the information offered by node attributes; (iii) hybrid LCD approach, the two types of information are exploited complementary to obtain the final partition. Examples of topological-based LCD are of three types, those that weight the edges taking account of the attribute information [14], those that use a labelaugmented graph [15] and those that extend a topological quality function in order also to consider the attribute information [12,16]. All three methodologies share the idea that the attribute information should be attached to the topological one, while, in an attributed-based LCD, attributes are merged with the structural information into a similarity function between vertices [12,17]. Finally, examples of hybrid LCD approaches are those that use an ensemble method to combine the found partitions [18] and those that use probabilistic models treating vertex attributes as hidden variables [19].

5

Conclusion

In this paper, we introduced Eva, a scalable algorithmic approach to address the LCD problem that optimizes the topological quality of the communities alongside to attribute homophily. Experimental results highlight how the proposed method outperforms CD and LCD state of art competitors in terms of community purity and modularity, allowing to identify high-quality results even in multi-attribute scenarios. As future works, we plan to generalize Eva methodology, allowing the selection of alternative quality functions, both topological (e.g., the conductance rather than the modularity) and attribute related – e.g., performing different assumptions for the purity computation than the independence of the vertex attributes. Moreover, we plan to integrate our approach within the CDlib project [20] and to extend it to support numeric node attributes. Acknowledgment. This work is partially supported by the European Community’s H2020 Program under the funding scheme “INFRAIA-1-2014-2015: Research Infrastructures” grant agreement 654024, http://www.sobigdata.eu, “SoBigData”.

References 1. MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, vol. 1, pp. 281–297 (1967) 2. Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008) 3. McCallum, A.K., Nigam, K., Rennie, J., Seymore, K.: Automating the construction of internet portals with machine learning. Inf. Retrieval 3, 127–163 (2000) 4. Leskovec, J., Mcauley, J.J.: Learning to discover social circles in ego networks. In: Advances in Neural Information Processing Systems, pp. 539–547 (2012)

Eva: Attribute-Aware Network Segmentation

151

5. Neville, J., Jensen, D., Friedland, L., Hay, M.: Learning relational probability trees. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 625–630. ACM (2003) 6. Trask, A., Michalak, P., Liu, J.: sense2vec - a fast and accurate method for word sense disambiguation in neural word embeddings. CoRR, vol. abs/1511.06388 (2015) 7. Traud, A.L., Mucha, P.J., Porter, M.A.: Social structure of Facebook networks. CoRR, vol. abs/1102.2166 (2011) 8. Traag, V.A., Waltman, L., van Eck, N.J.: From louvain to leiden: guaranteeing well-connected communities. CoRR, vol. abs/1810.08473 (2018) 9. Fortunato, S., Barthelemy, M.: Resolution limit in community detection. Proc. Natl. Acad. Sci. 104(1), 36–41 (2007) 10. Rosvall, M., Bergstrom, C.T.: Maps of random walks on complex networks reveal community structure. Proc. Natl. Acad. Sci. 105(4), 1118–1123 (2008) 11. Raghavan, U.N., Albert, R., Kumara, S.: Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E 76, 036106 (2007) 12. Dang, T.A., Viennet, E.: Community detection based on structural andattribute similarities. In: International Conference on Digital Society (ICDS) (2012) 13. Falih, I., Grozavu, N., Kanawati, R., Bennani, Y.: Community detection in attributed network. In: Companion Proceedings of the The Web Conference, pp. 1299–1306 (2018) 14. Neville, J., Adler, M., Jensen, D.: Clustering relational data using attribute and link information. In: 18th International Joint Conference on Artificial Intelligence, pp. 9–15 (2003) 15. Zhou, Y., Cheng, H., Yu, J.X.: Graph clustering based on structural/attribute similarities. Proc. VLDB Endow. 2, 718–729 (2009) 16. Combe, D., Largeron, C., G´ery, M., Egyed-Zsigmond, E.: I-Louvain: an attributed graph clustering method. In: Advances in Intelligent Data Analysis XIV, pp. 181– 192. Springer, Cham (2015) 17. Falih, I., Grozavu, N., Kanawati, R., Bennani, Y.: ANCA : attributed network clustering algorithm. In: Complex Networks and Their Applications, vol. VI, pp. 241–252. Springer, Cham (2018) 18. Elhadi, H., Agam, G.: Structure and attributes community detection: comparative analysis of composite, ensemble and selection methods. In: Proceedings of the 7th Workshop on Social Network Mining and Analysis, SNAKDD 2013, pp. 10:1–10:7. ACM (2013) 19. Yang, J., McAuley, J., Leskovec, J.: Community detection in networks with node attributes. In: 2013 IEEE 13th International Conference on Data Mining, pp. 1151– 1156, December 2013 20. Rossetti, G., Milli, L., Cazabet, R.: CDLIB: a python library to extract, compare and evaluate communities from complex networks. Appl. Netw. Sci. 4(1), 52 (2019)

Exorcising the Demon: Angel, Efficient Node-Centric Community Discovery Giulio Rossetti(B) KDD Lab, ISTI-CNR, Pisa, Italy [email protected]

Abstract. Community discovery is one of the most challenging tasks in social network analysis. During the last decades, several algorithms have been proposed with the aim of identifying communities in complex networks, each one searching for mesoscale topologies having different and peculiar characteristics. Among such vast literature, an interesting family of Community Discovery algorithms, designed for the analysis of social network data, is represented by overlapping, node-centric approaches. In this work, following such line of research, we propose Angel, an algorithm that aims to lower the computational complexity of previous solutions while ensuring the identification of high-quality overlapping partitions. We compare Angel, both on synthetic and real-world datasets, against state of the art community discovery algorithms designed for the same community definition. Our experiments underline the effectiveness and efficiency of the proposed methodology, confirmed by its ability to constantly outperform the identified competitors. Keywords: Complex network analysis

1

· Community discovery

Introduction

Community discovery (henceforth CD), the task of decomposing a complex network topology into meaningful node clusters, is allegedly oldest and most discussed problem in complex network analysis [3,6]. One of the main reasons behind the attention such task has received during the last decades lies in its intrinsic complexity, strongly tied to its overall ill-posedness. Indeed, one the few universally accepted axioms characterizing this research field regards the impossibility of providing a single shared definition of what community should look like. Usually, every CD approach is designed to provide a different point of view on how to partition a graph: in this scenario, the solutions proposed by different authors were often proven to perform well when specific assumptions can be made on the analyzed topology. Nonetheless, decomposing a complex structure in a set of meaningful components represents per se a step required by several analytical tasks – a need that has transformed what usually is considered a problem definition weakness, the existence of multiple partition criteria, into one of its major strength. Such peculiarity has lead to the definition of several c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 152–163, 2020. https://doi.org/10.1007/978-3-030-36687-2_13

Exorcising the Demon

153

“meta” community definitions, often tied to specific analytical needs. Classic works intuitively describe communities as sets of nodes closer among them than with the rest of the network, while others, only define such topologies as dense network subgraphs. A general, high-level, formulation of the Community Discovery problem definition is following: Definition 1 (Community Discovery (CD)). Given a network G, a community C is a set of distinct nodes: C = {v1 , v2 , . . . , vn }. The community discovery problem aims to identify the set C of all the communities in G. In this work, we introduce a CD algorithm, Angel, tailored to extract overlapping communities from a complex network. Our approach is primarily designed for social networks analysis and belongs to a well-known sub family of Community Discovery approaches often identified by the keywords bottom-up and node-centric [18]. Angel aims to provide a fast way to compute reliable overlapping network partitions. The proposed approach focuses on lowering the computational complexity of existing methods proposing scalable sequential – although, easily parallelizable – solutions to a very demanding task: overlapping network decomposition. The paper is organized as follows. In Sect. 2 we introduce Angel. There we discuss its rationale, the properties it holds as well as its computational complexity. In Sect. 3 we evaluate the proposed method on both synthetic and real-world datasets for which ground truth communities are known in advance. To better discuss the resemblance of Angel partitions to ground truth ones as well as its execution times, we compare the proposed method with state-of-art competitors sharing the same rationale. Finally, in Sect. 4 the literature relevant to our work is discussed and Sect. 5 concludes the paper.

2

Angel

In this section, we present our bottom-up solution to the community discovery problem: Angel1 . Our approach, as we will discuss, follows a well-known pattern composed by two phases: (i) construction of local communities moving from ego-network structures and, (ii) definition of mesoscale topologies by aggregating the identified local-scale ones. Since Angel main goal is reducing the computational complexity of previous node-centric approaches, we will detail the merging strategy it implements to build up the final community partition and, finally, we will discuss its properties and study its complexity. Algorithm Rationale. The algorithmic schema of Angel is borrowed from the Demon [4] one, an approach whose main goal was to identify local communities capturing individual nodes perspectives on their neighborhoods and to use them to build mesoscale ones. Angel takes as input a graph G, a merging threshold φ and an empty set of communities C. The main loop of the algorithm cycles over each node, so 1

Code available at: https://github.com/GiulioRossetti/ANGEL.

154

G. Rossetti

ALGORITHM 1. Angel

Input: G : (V, E), the graph; φ, the merging threshold. Output: C a set of overlapping communities.

1 2 3 4 5 6 7 8 9 10 11 12 13

for v ∈ V do e ← EgoMinusEgo(v, G) ; C(v) ← LabelPropagation(e) ; C ← C ∪ C(v) ncoms = |C| acoms = 0 while ncoms != acoms do acoms = ncoms C ← DecreasingSizeSorting(C) ; for c ∈ C do C ← PrecisionMerge(c, C, φ) ; ncoms = |C| return C

// Step #1 // Step #2 // Step #3

// Step #4 // Step #5 // Step #6

to generate all the possible points of view of the network structure (Step #1 in Algorithm 1). To do so, for each node v, it applies the EgoM inusEgo(v, G) (Step #2 in Algorithm 1) operation as defined in [4]. Such function extracts the ego-network centered in the node v – e.g., the graph induced on G and built upon v and its first order neighbors – then removes v from it, obtaining a novel, filtered, graph substructure. Angel removes v since, by definition, it is directly linked to all nodes in its ego-network, connections that would lead to noise in the identification of local communities. Obviously, a single node connecting the entire sub-graph will make all nodes very close, even if they are not in the same local community. Once obtained the ego-minus-ego graph, Angel computes the local communities it contains (Step #3 in Algorithm 1). The algorithm performs this step by using a community discovery algorithm borrowed from the literature: Label Propagation (LP) [13]. This choice, as in [4], is justified by the fact that: (i) LP has low algorithmic complexity (∼ O(N ), with N number of nodes), and, (ii) it returns results of a quality comparable to more complex algorithms [3]. Reason (i) is particularly important since Step #3 of Angel needs to be performed once for every node of the network, thus making unacceptable to spend a super-linear time for each node. Notice that instead of LP any other community discovery algorithm (both overlapping or not) can be used (impacting both on the algorithmic complexity and partition quality). Given the linear complexity (in the number of nodes of the extracted ego-minus-ego graph) of Step #3, we refer to this as the inner loop for finding the local communities. Due to the importance of LP for our approach and to shed lights on how it works we briefly describe its classical formulation [13]. Suppose that a node v has neighbors v1 , v2 , ..., vk and that each one of them carries a label denoting the community that it belongs to: then, at each iteration the label of v is updated to the majority label of its neighbors. As the labels propagate, densely connected groups of nodes quickly reach a consensus on a unique label. At the end of the propagation process, nodes with the same labels are grouped as one community. In case of bow-tie situations – e.g., a node having an equal maximum number of neighbors in two or more communities – the classic definition

Exorcising the Demon

155

ALGORITHM 2. PrecisionMerge Input: x, a community; C, a set of overlapping communities; φ, the merging threshold. Output: C, a set of overlapping communities. 1 2 3 4 5 6

com to freq ← community frequency(x) ; for com, freq ∈ com to freq do req if f|x| ≥ φ then C = C − {x, com} C = C ∪ {x ∪ com} return C

// Step #A // Step #B

of the LP algorithm randomly selects a single label for the contended node. Angel, conversely, handle this situation allowing soft community memberships, thus producing deterministic local partitions. The result of Steps #1–3 of Algorithm 1 is a set of local communities C(v), according to the perspective of a specific node, v, of the network. Conversely, from what done in Demon, Angel does not reintroduce the ego in each local community to reduce the noisy effects hubs play during the merging step. Local communities are likely to be an incomplete view of the real community structure of G. Thus, the result of Angel needs further processing: namely, to merge each local community with the ones already present in C. Once the outer loop on the network nodes is completed, Angel leverage the PrecisionMerge function to compact the community set C so to avoid the presence of fully contained communities in it. Such function (Step #6, detailed in Algorithm 2) implements a deterministic merging strategy and is applied iteratively until reaching convergence (Step #4) – e.g., until the communities in C cannot be merged further. To assure that all the possible community merges are performed at each iteration C is ordered from the smallest community to the biggest (Algorithm 1, #Step 6). This merging step is a crucial since it needs to be repeated for each of the local communities. In Demon such operation requires the computation for each pair of communities (x, y), x ∈ C(v) and y ∈ C, of an overlap measure (i.e. Jaccard index) and to evaluate if its value overcome a user defined threshold. This approach, although valid, has a major drawback: given a community x ∈ C(v) it requires O(|C|) evaluations to identify its best match among its peers. Indeed, such kind of strategy represents a costly bottleneck requiring an overall O(|C|2 ) complexity while applied to all the identified local communities. Angel aims to drastically reduce such computational complexity by performing the matches leveraging a greedy strategy. To do so, it proceeds in the following way: (i) Angel assumes that each node carries, as additional information, the identifiers of all the communities in C it already belongs to; (ii) in Step #A (Algorithm 2) for each local community x is computed the frequency of the community identifiers associated with its nodes; (iii) in Step #B, for each pair (community id, f requency) is computed its Precision w.r.t. x, namely the percentage of nodes in x that also belong to community id;

156

G. Rossetti

(iv) iff the precision ratio is greater (or equal) than a given threshold φ the local community x is merged with community id: their union is added to C and the original communities are removed from the same set. Operating in this way it is avoided the time expensive computation of community intersections required by Jaccard-like measures since all the containment testing can be done in place. Angel Properties. The proposed approach posses two nice properties: it produces a deterministic output (once fixed the network G and threshold φ), and it allows for a parallel implementation. Property 1 (Determinism). There exists a unique C=Angel(G, φ) for any given G and φ, disregarding the order of visit of the nodes in G. To prove the determinism of Angel it is mandatory to break its execution in two well-defined steps: (i) local community extraction and (ii) merging of local communities. (i) Local communities: Label Propagation identifies communities by applying a greedy strategy. In its classical formulation [13] it does not assure convergence to a stable partition due to the so-called “label ping-pong problem” (i.e., instability scenario primarily due to bow-tie configurations). Moreover as already discussed, Angel addresses such problem by relaxing the node single label constraint thus allowing for the identification of a stable configuration of overlapping local communities. (ii) Merging: this step operates on a well-determined set of local communities on which the PrecisionMerge procedure is applied iteratively. Since we explicitly impose the community visit ordering the determinism of the solution is given by construction. Property 2 (Compositionality). Angel is easily parallelizable since the local community extraction can be applied locally on well defined subgraphs (i.e., egominus-ego networks). Given a graph G = (V, E) it is possible to instantiate Angel local community extraction simultaneously on all the nodes u ∈ V and then apply the PrecisionMerge recursively in order to reduce and compact the final overlapping partition:  Angel(G, φ) = P M erge( u∈V LP (EM E(u))) (1) The underlying idea is to operate community merging only when all the local communities are already identified (i.e., LabelPropagation is applied to all the ego-minus-ego of the nodes u ∈ V – LP(EME(u)) in Eq. 1 – as shown in Fig. 1). Moreover, this parallelization schema is assured to produce the same network partition obtained by the original sequential approach due to the determinism property. Angel Complexity. To evaluate the time complexity we proceed by decomposing Angel in its main components. Given the pseudocode description provided in Algorithm 1 we can divide our approach into the following sub-procedures:

Exorcising the Demon

157

Fig. 1. Angel parallelization schema. The graph G is decomposed in |V | ego-minusego network by a dispatcher D and distributed to n workers {LP0 , . . . , LPn } that extract local communities from them. At the end of such parallel process, a collector C iteratively apply PrecisionMerge till obtaining the final overlapping partition.

(i) Outer loop (lines 3–6): the algorithm cycles over the network nodes to extract the ego-minus-ego networks and identify local communities. This main loop has thus complexity O(|V |). (ii) Local Communities extraction: the Label Propagation algorithm has complexity O(n + m) [13], where n is the number of nodes and m is the number of edges of the ego-minus-ego network. Let us assume that we are working with a scale free network, whose degree distribution is pk = k −α : in this scenario the majority of the identified ego-minus-ego networks are composed by n 0 [8,9] – Communicability (Comm): K = n=0 n! ∞ – Forest: K = n=0 αn (−L)n = (I + αL)−1 , α > 0 [3]

Impact of Network Topology on Measures Efficiency

191

n n ∞ – Heat: K = n=0 α (−L) = exp(−αL), α > 0 [14] n! – PageRank (PR): K = (I − αP )−1 , 0 < α < 1 [18]

It is important to note that because of the kernels definitions, the eigenvectors are the same for the Walk and Communicability, and Forest and Heat measures [1]. Consequently, the Spectral clustering will lead to the same partitions for these 2 pairs, and we will use only 3 measures (i.e., Walk, Forest, and PageRank) instead of 5 when discussing results for the Spectral method. 3.4

Network Generation

To generate networks with different topologies, we use the LFR model introduced by Lancichinetti et al. in [15]. Generated networks share a number of features which real networks have, e.g., the power law degree and community size distributions. Changing the model input parameters, one can get networks varying in size, average degree, power law exponent for the degree and community size distributions, minimum and maximum size of clusters, and clusters quality, i.e., the fraction of inter-community edges. Together, this allows one to get graphs with completely different structures. We discuss chosen input parameters and reasons for this choice in Sect. 4. 3.5

Clustering Quality Evaluation

For clustering quality evaluation, the Adjusted Rand Index (ARI) introduced in [10] is used. Reasons for using this quality index are provided in [16]. ARI plays an important role in the study, so we will briefly explain it. Initially, the Rand Index was introduced in [20]. If X and Y are two different partitions (clusterings) of n elements, let a be the number of pairs of elements that are in the same clusters in X and Y , and b the number of pairs of elements . that are in different clusters in X and Y . Then the Rand Index equals to a+b (n2 ) The idea here is simple: it is the number of agreements between two partitions divided by the total number of pairs. Unfortunately, the Rand Index has a drawback: the expected value of the Rand Index is not zero for random partitions. So, it should be corrected, and ARI is the corrected version of the Rand Index−ExpectedIndex . Index: ARI = MaxIndex−ExpectedIndex For ARI, 1 refers to perfect matching, while 0 is characteristic of random labeling.

4

Experimental Methodology

In this study, measures are tested in experiments with networks generated using the LFR model. To obtain different topologies, the following 6 input parameters of the LFR model are varied:

192

R. Aynulin

– Network size (n). Obviously, real networks are different in size. Due to the computational limits, we cannot generate really big networks, so graphs are generated with the following numbers of nodes: 100, 300, 500, 1000, 2000, 3000. – Average degree (m). Varying average degree from 2 to 15 with step 1 allows obtaining networks from very sparse (for which the clustering quality will be close to 0, regardless of other parameters) to pretty dense (with the clustering quality close to 1 for networks with good community structure). – Power law exponent for the degree distribution (τ1 ). The power law exponent is usually considered to be in the range from 2 to 3 [15,17]. We use the following values of τ1 : 2.0, 2.2, 2.4, 2.5, 2.6, 2.8, 3.0. – Power law exponent for the community size distribution (τ2 ). Like the degree distribution, the community size distribution was also reported to follow the power law with the typical limits 1 < τ2 < 2 [15]. Networks generated in this study have the power law exponent for the community size distribution varying from 1 to 2 with step 0.25. – Minimum and maximum communities size (cmin and cmax). Changing the limits for the communities size, we can get networks with a lot of small communities, few big communities, and intermediate stages between them. As a baseline, for n = 300, the following limits are used: [20, 50], [50, 80], [80, 140], [140, 185], and if the network size is different, then the limits are scaled accordingly. – Fraction of inter-community edges (μ). This parameter allows to change the quality of communities. We vary μ in the range from 0.1 to 0.6 with step 0.1. Graphs are generated in the following way: the basic configuration is n = 300, m = 5, τ1 = 2.5, τ2 = 1.5, cmin = 80, cmax = 140, μ = 0.2. Then, one of the parameters varies from the basic configuration within the limits described above. After generation, networks are clustered with each of the measures listed in Sect. 3.3 using the Ward and Spectral method. Each of the measures depends on the parameter. Therefore, we also search for the optimal parameter, and the results include the clustering quality for the optimal parameter. As already noted, the quality of produced partitions is evaluated using ARI. To get a stable result, for each combination of parameters, 100 graphs are generated and the quality index is averaged over them.

5

Results

In this section, we discuss the results of the experiments described above. Figure 1 presents results for the Spectral method. As noted in Sect. 3.3, it is meaningful here to analyze results only for 3 measures out of 5 under research. The basic set of parameters is marked by a circle on each of the graphs. The x-axis shows the values of the varying parameter, while the value of average ARI is plotted on y-axis. When changing the community size limits, the average number of clusters for the generated networks is plotted on the x-axis.

Impact of Network Topology on Measures Efficiency

193

Fig. 1. Results for the Spectral method, point n = 300, m = 5, τ1 = 2.5, τ2 = 1.5, cmin = 80, cmax = 140, μ = 0.2 is marked

Both the clustering algorithm and the proximity measure may depend on the network topology. Common features of the plots for different proximity measures show how the algorithm depends, while deviation from the general picture shows the dependence of the proximity measure on the network topology.

194

R. Aynulin

As can be seen, for the Spectral algorithm, all proximity measures behave similarly when topology changes. Their ranking by the quality index is also largely preserved. So we can conclude that when using the Spectral method, the dependence of the relative clustering quality on the network topology for each measure is almost absent, and the top-performing measure is Walk for most topologies. Let’s now look at common features of the plots for different measures and analyze how the clustering quality depends on the network structure for the Spectral algorithm itself. According to Fig. 1a, the Spectral method copes well with the network size increase. This is not generally true for all community detection algorithms [19]. There is a rapid increase in the clustering quality when the average degree increases (Fig. 1b). This can be explained by the definition of the community on which the Spectral method bases. Like most clustering algorithms, this method looks for groups of nodes which are densely connected, and it is hard to do it when there are almost no edges in the network. Figure 1c reveals an interesting relationship between the quality and the power law exponent for the degree distribution. For example, there are several local maxima and minima, and after the local minimum at τ1 = 2.2 there is the peak at τ1 = 2.4. So far, there is no explanation for such behavior, and it can be explored more in-depth in future studies. In Fig. 1d, we can see that the quality is almost independent of τ2 . According to Fig. 1e, the clustering quality is better when there are a lot of small communities than there are few big communities. Finally, in Fig. 1f, one can see an expected steep decline in the quality when the fraction of inter-cluster edges increases. Results for the Ward algorithm are presented in Fig. 2. This algorithm is more sensitive to the choice of the proximity measure. This is most noticeable in Fig. 2c, when we vary the power law exponent for the degree distribution. However, we can still find the measures which perform well for most topologies (Walk and Communicability), and the ranking of measures by the quality generally remains the same. So, generally the superiority of one measure over another is the fundamental property which doesn’t depend on the network topology. However, there are a few exceptions. For example, PageRank outperforms all the other measures when there are a lot of small clusters (Fig. 2e). An interesting relation can be seen in Fig. 2f. According to it, the Forest and Heat measures are slightly worse than others when there are clear cluster structure and μ = 0.1. But as soon as the cluster structure becomes slightly less distinct, and the fraction of inter-cluster edges increases to 0.2, their quality drops rapidly to zero. So, when using the Ward algorithm, Forest and Heat can detect clusters only if the community structure is distinct and there are almost no edges between clusters. Let’s now analyze the common features for all the measures, by which we can assess the impact of network topology on the Ward algorithm itself.

Impact of Network Topology on Measures Efficiency

195

Fig. 2. Results for the Ward method, point n = 300, m = 5, τ1 = 2.5, τ2 = 1.5, cmin = 80, cmax = 140, μ = 0.2 is marked

Due to the computation limits, we used networks with n ≤ 1000 for clustering with the Ward method. However, even for this network size interval, in Fig. 2a we can see that the performance of the Ward method degrades when the network size increases.

196

R. Aynulin

Similarly to the Spectral method, the quality of clustering increases with the increase in the average degree (Fig. 2b) and decreases with the increase in μ (Fig. 2f). Also, according to Fig. 2e, many small clusters are better than few big clusters for the Ward method. The explanation for these properties is the same as for the Spectral method. Figure 2d shows that the power law exponent for the community size distribution still doesn’t essentially affect the efficiency of community detection, although there are more fluctuations in comparison to the Spectral method. According to Fig. 2c, the relation between the efficiency and τ1 is fuzzy, and it is hard to detect any common properties for all the measures. We can also make some conclusions about the comparative efficiency of the Ward and the Spectral algorithms. According to the results of the experiments, the Spectral method outperforms the Ward method in most cases.

6

Conclusion

In this paper, we studied how the network topology affects the quality of community detection for such graph measures as Walk, Communicability, Forest, Heat, and PageRank. A variety of network topologies were generated using the LFR model, and resulting graphs were clustered using the Ward and the Spectral method in combination with each of the above measures. As a result, we found that the efficiency of proximity measures depends on the network topology in some way. However, this dependence is not critical, and measures which are efficient for most topologies can be found. For the Spectral method, the most efficient measure is Walk. When the Ward method is used, the Walk and the Communicability measures outperform others in most cases. Also, we have found some common features for all the measures. Using these common features, we can conclude how the algorithms themselves depend on the network topology. For example, the Ward and the Spectral methods prefer small clusters to big clusters.

References 1. Avrachenkov, K., Chebotarev, P., Rubanov, D.: Kernels on graphs as proximity measures. In: International Workshop on Algorithms and Models for the WebGraph. LNCS, vol. 10519, pp. 27–41. Springer (2017) 2. Aynulin, R.: Efficiency of transformations of proximity measures for graph clustering. In: International Workshop on Algorithms and Models for the Web-Graph. LNCS, vol. 11631, pp. 16–29. Springer (2019) 3. Chebotarev, P.Y., Shamis, E.: On the proximity measure for graph vertices provided by the inverse Laplacian characteristic matrix. In: 5th Conference of the International Linear Algebra Society, Georgia State University, Atlanta, pp. 30–31 (1995) 4. Chebotarev, P.: The walk distances in graphs. Discrete Appl. Math. 160(10–11), 1484–1500 (2012)

Impact of Network Topology on Measures Efficiency

197

5. Costa, L.d.F., Oliveira Jr., O.N., Travieso, G., Rodrigues, F.A., Villas Boas, P.R., Antiqueira, L., Viana, M.P., Correa Rocha, L.E.: Analyzing and modeling realworld phenomena with complex networks: a survey of applications. Adv. Phys. 60(3), 329–412 (2011) 6. Deza, M.M., Deza, E.: Encyclopedia of Distances. Springer, Berlin (2016) 7. Emmons, S., Kobourov, S., Gallant, M., B¨ orner, K.: Analysis of network clustering algorithms and cluster quality metrics at scale. PLoS One 11(7), e0159161 (2016) 8. Estrada, E.: The communicability distance in graphs. Linear Algebra Appl. 436(11), 4317–4328 (2012) 9. Fouss, F., Yen, L., Pirotte, A., Saerens, M.: An experimental investigation of graph kernels on a collaborative recommendation task. In: Sixth International Conference on Data Mining (ICDM 2006), pp. 863–868. IEEE (2006) 10. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985) 11. Ivashkin, V., Chebotarev, P.: Do logarithmic proximity measures outperform plain ones in graph clustering? In: International Conference on Network Analysis. PROMS, vol. 197, pp. 87–105. Springer (2016) 12. Jeub, L.G., Balachandran, P., Porter, M.A., Mucha, P.J., Mahoney, M.W.: Think locally, act locally: detection of small, medium-sized, and large communities in large networks. Phys. Rev. E 91(1), 012821 (2015) 13. Katz, L.: A new status index derived from sociometric analysis. Psychometrika 18(1), 39–43 (1953) 14. Kondor, R., Lafferty, J.: Diffusion kernels on graphs and other discrete input spaces. In: International Conference on Machine Learning, pp. 315–322 (2002) 15. Lancichinetti, A., Fortunato, S., Radicchi, F.: Benchmark graphs for testing community detection algorithms. Phys. Rev. E 78(4), 046110 (2008) 16. Milligan, G.W., Cooper, M.C.: A study of the comparability of external criteria for hierarchical cluster analysis. Multivar. Behav. Res. 21(4), 441–458 (1986) 17. Newman, M.E.: The structure and function of complex networks. SIAM Rev. 45(2), 167–256 (2003) 18. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Technical report, Stanford InfoLab (1999) 19. Pasta, M.Q., Zaidi, F.: Topology of complex networks and performance limitations of community detection algorithms. IEEE Access 5, 10901–10914 (2017) 20. Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971) 21. Schenker, A., Last, M., Bunke, H., Kandel, A.: Comparison of distance measures for graph-based clustering of documents. In: International Workshop on Graph-Based Representations in Pattern Recognition. LNCS, vol. 2726, pp. 202–213. Springer (2003) 22. Sommer, F., Fouss, F., Saerens, M.: Comparison of graph node distances on clustering tasks. In: International Conference on Artificial Neural Networks. LNCS, vol. 9886, pp. 192–201. Springer (2016) 23. Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007) 24. Ward Jr., J.H.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963) 25. Yen, L., Vanvyve, D., Wouters, F., Fouss, F., Verleysen, M., Saerens, M.: Clustering using a random walk based distance measure. In: ESANN, pp. 317–324 (2005)

Identifying, Ranking and Tracking Community Leaders in Evolving Social Networks M´ ario Cordeiro1,2(B) , Rui Portocarrero Sarmento2 , Pavel Brazdil2 , ao Gama2 Masahiro Kimura3 , and Jo˜ 1

2

Faculty of Engineering, University of Porto, Porto, Portugal [email protected] Laboratory of Artificial Intelligence and Decision Support, Porto, Portugal 3 Faculty of Science and Technology, Ryukoku University, Kyoto, Japan

Abstract. Discovering communities in a network is a fundamental and important problem to complex networks. Find the most influential actors among its peers is a major task. If on one side, studies on community detection ignore the influence of actors and communities, on the other hand, ignoring the hierarchy and community structure of the network neglect the actor or community influence. We bridge this gap by combining a dynamic community detection method with a dynamic centrality measure. The proposed enhanced dynamic hierarchical community detection method computes centrality for nodes and aggregated communities and selects each community representative leader using the ranked centrality of every node belonging to the community. This method is then able to unveil, track, and measure the importance of main actors, network intra and inter-community structural hierarchies based on a centrality measure. The empirical analysis performed, using two temporal networks shown that the method is able to find and tracking community leaders in evolving networks. Keywords: Community detection leaders · Centrality measures

1

· Dynamic networks · Community

Introduction

Typical tasks of Social Network Analysis (SNA) involve: the identification of the most influential, prestigious and central actors, using statistical measures; the identification of hubs and authorities, using link analysis algorithms; the discovery of communities, using community detection techniques; the visualization the interactions between actors; or spreading of information. These tasks are instrumental in the process of extracting knowledge from networks and consequently in the process of problem-solving with network data. Particularly, centrality measures help us to identify relevant nodes and quantify the notion of importance of an actor in the network. Recently, researchers have invested a lot c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 198–210, 2020. https://doi.org/10.1007/978-3-030-36687-2_17

Community Leaders in Dynamic Social Networks

199

of effort in the development of efficient algorithms able to compute centrality measures of nodes in very large evolving networks. Community Detection is also a key task to unveil and understand underlying structure of complex networks. Mainly because community structures are very common in complex networks, detecting and identifying these communities is a key point to understand hidden features of a network. Detecting communities summarizes interactions between members for gaining a deep understanding of interesting characteristics shared between members of the same community. Social networks communities usually have one leader, which in many cases represent the most influential, prestigious or central actor of the whole community. Identifying these actors shall not be neglected.

Fig. 1. Zachary karate club community detection. Node 33 represent “John A” group and Node 0 “Mr. Hi” group: Classical (a) vs Hierarchical (b)→ (c)→ (d).

Fig. 2. Jemaah Islamiyah Bali bombings cell. Strategy (Samudra), logistics commander (Idris) and team’s gofer (Imron): Classical (a) vs Hierarchical (b)→ (c).

Currently, the majority of the methods and approaches to the abovementioned tasks present some limitations. While quantifying the importance of actors in social networks using centrality measures based only on the local or global connectivity of the nodes is considered to be inappropriate. With the primary cause of this being the inattention to the hierarchy and community structure of the network inherent in all human social networks. On the other hand, even the community detection methods that provide a hierarchy and community structure of a network, do not give any insights on what these individual

200

M. Cordeiro et al.

communities represent in the overall network, nor information about community influence (as shown on Figs. 11 and 22 ). In brief, Community detection methods do not answer essential questions such as: what is the importance of each community among all the other identified communities? What are the major representative nodes of these densely connected set of actors? Moreover, what is the underlying structure of the community? In short, what are the most influential actors in the community which is called community leaders? To answer these questions, we propose, to the best of our knowledge, the first method which combines a dynamic Laplacian centrality [7] with a dynamic hierarchical community detection method [6]. The objective is to identify, rank and track communities and their leaders over time in evolving networks. The main contribution of the proposed method is the threefold: First, the new method identifies and ranks communities according to its influence in the overall network; Second, it ranks the most influential actors in each one of the communities to select a community leader (most representative node of the community); and finally, provides an in-depth community hierarchy and structure of the individual communities that form the full network. It is an important task to extract the hierarchical structure of communities and their influence in terms of leadership (i.e., Leadership Hierarchy) each time-step in a large evolving network. However, this is challenging since a large amount of computation should be required in general. This paper presents a promising solution. Moreover, the method supports both incremental only or a full dynamic setting to perform community detection using locality modularity optimization, and compute the Laplacian centrality for nodes and individual communities in evolving networks. By incremental only we mean networks in which only new nodes and edges are added to the network in subsequent snapshots, by fully dynamic we mean support for addition and removal of nodes and edges.

2

Related Methods and Techniques

Cordeiro et al. [8] reviews the state of the art in selected aspects of evolving social networks and presents open research challenges related to Online Social Networks. In the present work, the focus of research is maintenance methods, which is desirable to maintain the results of the data mining process continuously over time [1]. More specifically, we are interested in identifying, ranking and tracking communities and their leaders continuously over time. To accomplish this, two different kinds of methods are required: the detection and identification of community structures which represent occurrences of groups of nodes in the network that are more densely connected internally than with the rest of the network; and, actor-level or node-level statistical measures, such as centrality measures and leadership, to determine the importance of an actor or node within 1 2

Figure 1 online here: https://mmfcordeiro.github.io/LaplaceLouvainResults/Intro. html. Figure 2 online here: https://mmfcordeiro.github.io/LaplaceLouvainResults/Intro2. html.

Community Leaders in Dynamic Social Networks

201

the network, i.e.: reveal the individuals in which the most important relationships are concentrated and give an idea about their social power within their peers. Community Detection Methods: Fortunato [10] in a comprehensive survey devoted to the methods and techniques of finding communities in static networks, classify hierarchical clustering methods into two types: Divisive algorithms, such as Girvan and Newman; and agglomerative algorithms, such as [3] in which a greedy modularity optimization is used. Concerning dynamic networks, Aggarwal and Subbian [1] propose the division into slowly evolving networks methods and streaming networks methods. Slowly evolving networks methods are: [3] when used in batch snapshots; [33] a modularity based method employing Palla et al. [26] principles of the life of communities events (growth, contraction, merging, splitting, birth and death); QCA [24] a fast and adaptive algorithm based on [3] and preferable neighboring groups; AFOCS [23] a modified QCA that allows the detection of overlapping communities; label propagation techniques more specifically speaker-listener label propagation (SLPA) such as LabelRank, GANXiSw, LabelRankT all with good performance in overlapping community detection [36]; Cordeiro et al. [6] with a modularity-based full dynamic community detection algorithm, where dynamically added and removed nodes and edges only affect their related communities. By reusing community structure obtained by previous iterations, the local modularity optimization step operates in smaller networks where only affected communities are disbanded to their origin. Streaming networks methods include: [35] that uses a local weighted-edge-based pattern (LWEP) as a summary to cluster weighted graph stream and perform dynamic community detection; almost linear time simple label propagation algorithms [17,29]; spectral based efficient memory-limited streaming clustering algorithms [38]; random and adaptive sampling methods [38]; SONIC, a find-and-merge type of overlapping community detection [31]; and, SCoDA, a linear streaming algorithm for very large networks [11]. By combining consensus clustering [16] in a self-consistent way with any of the above methods, stability and the accuracy of the resulting partitions is enhanced considerably. Centrality Measures: There is no consensual best centrality measure for graphs, several measures are accepted and give good centrality values that are valuable in different scenarios. In [8] is shown that most of the commonly used centrality measures for static networks have their versions for evolving networks. Commonly used measures are Betweenness Centrality B(v), which provides an indicator of the magnitude of node placement in between other nodes in the network. Popular implementations are the Brandes [4] algorithm for static networks, and Nasre et al. [22] for incremental. Kas et al. [12] adapted the classical Betweenness algorithm for evolving graphs; Closeness Centrality C(v) quantifies the reachability, of a giving starting node to every network node. It gives an overall indicator of the actor positioning in the complete network by measuring, on average, how long takes to reach any of all other network nodes. Two methods were proposed in the literature that uses incremental updates over the Close-

202

M. Cordeiro et al.

ness Centrality measure in evolving networks: Kas et al. [13] and Sariyuce et al. [30]; Eigenvector Centrality E(v) assumes that the status of a node is recursively defined by the status of his/her first-degree connections, i.e., the nodes that are directly connected to a particular node [25]. Is often implemented for static networks using Katz centrality and Google’s PageRank. Evolving networks variants are [2,9] and [15]; Laplacian Centrality L(v) permits to include intermediate surrounding circumstantial information of a node or vertex to compute its centrality measure. The Laplacian centrality of a given v vertex is then described as a function of the network 2-walks counts in which a v vertex takes part. The known fact that Laplacian Centrality is a local measure [27,28] motivated the incremental versions of [27] and [7] which efficiency is improved by computing single node centralities just for nodes somehow affected by the removal or addition of edges in network snapshots. Influential Communities and Community Leaders: In analogy to actorlevel or node-level statistical measures that determine the importance of an actor or node within the whole network, the influence of a community measures the influence of a community on the overall network. This topic, ignored by the traditional community detection algorithms, has been very recently introduced in the field [20]. Commonly used approaches are: an efficient search method of the top-r k-influential communities was proposed by [20] (r is the number of communities, community nodes with degree at least k); maximal kr-cliques community [18] which uses heuristics based on common graph structures like cliques; k-influential community based on the concept of k-core to capture the influence of a community in [21]; skyline community combining the concepts of k-core and a multi-valued network [19]. Ignored by the existing community detection algorithms, the role of community leaders arose in recent research. This problem is addressed by leader-follower model-based methods that identify community structures that hold that a community is a set of follower nodes around a leader: Top Leaders [14], Licod [37] and [32]. Recently Sun et al. [34], proposed an agglomerative type clustering method, which measures the leadership of each node and lets each one adhere to its local leader forming dependence trees. This leader-aware community detection algorithm finds community structures as well as leaders of each community. Other proposals are the [39] method that identifies influential nodes with community structure. The method uses the information transfer probability between any pair of nodes and the k-medoid clustering algorithm. In [40] a centrality index (the community-based centrality) was introduced to identify influential spreaders in a network based on the community structure of the network. The index considering both the number and sizes of communities that are directly linked by a node. None of the above referenced influential community methods nor the community leaders’ methods are suitable for evolving networks. All of them were devised and designed for handling static networks.

Community Leaders in Dynamic Social Networks 1

2

3

4

7

5

6

(a) Batch

1

2

3

4

7

5

6

203

(b) Incremental

Fig. 3. Calculated node centralities with edge {(4, 6)} added.

3

Identifying, Ranking and Tracking Community Leaders

To answer the questions raised before, in this work is proposed a new method to solve the problem of identifying, ranking and tracking community leaders in evolving networks. The method is a combination of an agglomerative hierarchical community detection algorithm [6] with an efficient centrality measure [7]. Both designed to handle evolving networks with proven efficiency in large networks. With this new method, communities can be detected and tracked over time very efficiently. In parallel, for each community, a leader will be chosen to represent the community according to its centrality among its community peers. The main purpose is to identify and track communities over and establish intra-community and inter-community leadership hierarchies. 3.1

Dynamic Community Detection

The dynamic community detection algorithm proposed by Cordeiro et al. [6] shares the same greedy optimization method of the Blondel et al. [3] static version, i.e.: attempts to optimize the modularity of a partition of the network in successive iterations. Communities are calculated by maximizing the objective function in a two-step optimization in each one of the iterations. In the first step (step 1), small communities are formed by optimizing the modularity locally. Only local changes in communities are allowed in this step. In the following step (step 2), nodes belonging to the same community are aggregated in a single node that represents a community in a new aggregated network of communities. Iteratively these steps are repeated until no increase in modularity is possible with a hierarchy of communities being produced. This algorithm is a modification of the original Louvain method. We dynamically add and remove nodes and edges that only affect their related communities. By handling the 4 different types of the addition of nodes/edges and the 4 other ones w.r.t. removal of edges/nodes. Thereby, the algorithm in every iteration maintains unchanged all the communities that were not affected by modifications to the network. Efficiency is achieved by reusing previous iterations obtained community structure, thus the local modularity optimization step operates in smaller networks where only affected communities are disbanded to their origin. 3.2

Dynamic Laplacian Centrality

As already stated before, the Laplacian Centrality metric is not a global measure, i.e., is a function of the local degree plus the degrees of the neighbors (with

204

M. Cordeiro et al. 1

4

1

7 l1 = 16

C3

1

1

1

1

l4 = 16

1

4

1

6

l7 = 14 7

2

3

2

1

3

1

5

1

1

l2 = 16

6

1

2

1

1

3

(a) original network

1

5

l3 = 26

1 1

6

l5 = 26

lC3 = 111 C5 l6 = 16

C1

C2

2

C4

1 1

l2 = 16

l4 = 16

1 1

l3 = 26

4

1

1 3 C3

1 l5 = 26

5

6

2

C5

6

3

C6

lC3 = 111

l6 = 16

(b) initial communities

1

5

lC5 = 71

C3

2

4 lC4 = 49

C5

6

C7

8

2

1 1

5

lC5 = 71 C3

l7 = 14 7

1

(e) step 1 of 2nd iteration

(c) step 1 of 1st iteration l1 = 16

2

C4

2

C5

3

4 C4

lC3 = 119

lC4 = 49

(d) step 2 of 1st iteration

1

C3

5 C5

lC5 = 184

(f) step 2 of 2nd iteration

Fig. 4. Community leaders algorithm steps. Cx represents the community of node x; ly represents node y laplacian centrality; lCz the community Cx laplacian centrality. Steps and iteration follow the same flow defined by Cordeiro et al. [6]

different weights for each). In [27,28] is shown that local degree and the 1st order neighbors degree is all that is needed to calculate the metric for unweighted networks. Remark that a similar approximation can be done for weighted networks. For evolving networks, the algorithm proposed by Cordeiro et al. [7] performs selective Laplace Centrality calculations only for the nodes affected by the addition and removal of edges in the snapshot (i.e., it reuses Laplace Centrality information of the previous snapshot). To demonstrate this property, the toy network example in Fig. 3 is used to show the locality of the Laplacian Centrality. Dark grey nodes affected by the addition of edges. The Incremental version show light grey nodes were centralities need to be calculated due to their neighborhood with affected nodes. In comparison to batch, only 4 out of 7 nodes required centralities to be computed in the incremental version. 3.3

Combining Centrality and Dynamic Community Detection

Our proposed method has a twofold enhancement: Representative Nodes Chosen W.r.t. Centrality: Blondel et al. [3] algorithm is non-deterministic and produces a solution for the communities that step 1 of 1st iteration

step 2 of 1st iteration 5 step 1 of 2nd iteration 5

2

6

step 2 of 2nd iteration

step 2 of 1st iteration

5

3

1

4

1

3

3

3

5

7 4

1 2

1 1

1 3

1

5

1 1

1

4

7

6 4

step 1 of 1st iteration

step 2 of 2nd iteration step 1 of 2nd iteration

(a) original network

(b) step 1 of 2nd iteration

Fig. 5. Community leadership hierarchies

Community Leaders in Dynamic Social Networks

205

is very unstable between separate runs on the same snapshot. This issue, is attenuated in [6] that produces more stable communities by using the concept of local modularity. Nevertheless, it still holds one of the major drawbacks of the method: community representative nodes are randomly chosen and do not have significance among the other community node peers. This means that, apart from their hierarchical structure, higher-level aggregated networks are meaningless. In short, top-most nodes are not the representatives of the community and often change from iteration to iteration. Intra-community and Inter-community Centralities: Both Blondel et al. [3] and Cordeiro et al. [6] methods were designed primarily to perform efficient community detection on large networks. None of them uses node or actor-level measures to quantify node or community importance over its peers. By combining the dynamic community detection algorithm [6] with the dynamic Laplacian Centrality [7] method, we enable the disclosure of both inter-community and intra-community centralities. By inter-community centrality we mean, to find the most central or influential community or communities in the network. By intra-community centralities, we mean to find the most central or influential nodes belonging to a community. Using Fig. 4 network toy example and Algorithm 1 pseudo-code, the complete method is now described. Immediately after the initial partition (Line 6), laplacian centralities are computed for every node on the network (Line 7, Fig. 4b). Then, while the maximum modularity is not reached, or if edges/nodes are added/removed to the network (Line 10), a new community detection step (step 1) is performed (Line 11, Fig. 4c). Representative nodes for each community are chosen according to its higher Laplacian Centrality in the community (Line 12, Fig. 4c). In the following step (step 2), the generated new aggregated network of communities (Line 16) already includes the Lapacian Centralities computed in nodes that changed community (Line 17, Fig. 4d), and for nodes affected by adding or removal of edges (Line 21 and Line 29 respectively). A new 2-step iteration is performed and shown in Fig. 4e and f. The method also provides ways to perform an hierarchical analysis of the importance of communities. Detailed observation of Fig. 4f show that the network has two major communities (C5 and C3). The non-normalized Laplacian Centrality values obtained reveal C5 as the most important community in the network (lC5 = 184 vs lC3 = 119). Moving backwards in the hierarchy, can also be observed that C5 is composed by two sub modules or sub-communities: C5 and C4 . C3 is composed by 3 nodes ({1, 2, 3}). Centrality ranks for isolated communities can also be observed in Fig. 4c: node 3 is the most central within community C3 . By assuming centrality as a measure of community leadership we can visualize the network of Fig. 4 in an hierarchical way in Fig. 5b.

4

Results

Figures 1 and 2 show the effectiveness of the method to identify and rank community leader in static networks. For Evolving Social Networks, an visual empirical

206

M. Cordeiro et al.

(a) increments t = 0 .. t = 3

(b) community leadership analysis

Fig. 6. Leadership Hierarchies for the Zachary karate club dataset.

(a) increments t = 0 .. t = 3

(b) community leadership analysis

Fig. 7. Leadership Hierarchies for the Jure Leskovec and Andrews Ng temporal collaboration network.

analysis, in a incremental network setting, using two distinct datasets: in Fig. 63 the Zachary karate club dataset divided into 4 snapshots containing an equal number of randomly chosen edges; in Fig. 74 the temporal collaboration network of Jure Leskovec and Andrews Ng [5] was used. This dataset partitioned the 20-year co-authorship of both authors in four 5-year intervals. In these results figures, on the left side are shown the graphs resulting from the direct applicability of the method for each one of the increments (from t = 0 to t = 3). The vertical stack of graphs represents each one of the levels of the hierarchical community detection algorithm. On the right side, the hierarchy of each aggregated 3 4

Figure 6 online here: https://mmfcordeiro.github.io/LaplaceLouvainResults/Karate. html. Figure 7 online here: https://mmfcordeiro.github.io/LaplaceLouvainResults/Jure. html.

Community Leaders in Dynamic Social Networks

207

Algorithm 1. Dynamic Community Leaders Algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36:

V ← {u1 , u2 , .., uv } , E ← {(i1 , j1 ), (i2 , j2 ), .., (ie , je )} A ← array{(i1 , j1 ), .., (im , jm )} R ← array{(i1 , j1 ), .., (in , jn )} procedure Main(G ← (V, E), A, R) Cll ← {C1 , C2 , .., Cn }, Cul ← {}, Caux ← Cll InitPartition(Caux ) Lcent ← LapCent(Cll )  initial centralities mod ← Modularity(Caux ), old mod ← 0 m ← 1, n ← 1 while (mod ≥ old mod ∨ m ≤ |A| ∨ v ≤ |R|) do Caux ← OneLevel(Caux ) Caux ← RepresentativeNodesByMaxCentrality(Caux , Lcent ) n, c ← CommunityChangedNodes(Cll , Caux ) Cll ← UpdateCommunities(Cll , n, c) old mod ← mod, mod ← Modularity(Cll ) Cul ← PartitionToGraph(Cll ) Lcent ← LapCentOnNodes(Cul , n, c)  centralities in new network if m ≤ |A| then src, dest ← A[m] anodes ← AffectedByAddition(src, dest, Cll ) Lcent ← LapCentOnNodes(Cll , anodes )  affected by addition Cll ← AddEdge(src, dest, Cll ) Cll ← DisbandCommunities(Cll , anodes ) Cul ← SyncCommunities(Cll , Cul , anodes ) end if if n ≤ |R| then src, dest ← R[n] anodes ← AffectedByRemoval(src, dest, Cll ) Lcent ← LapCentOnNodes(Cll , anodes )  affected by removal Cll ← RemoveEdge(src, dest, Cll ) Cll ← DisbandCommunities(Cll , anodes ) Cul ← SyncCommunities(Cll , Cul , anodes ) end if Caux ← Cul , m ← m + 1, n ← n + 1 end while end procedure

community presented isolated the last increment (t = 3). For readability purposes, size of nodes reflect the normalized centrality of the node or aggregated community, new added edges are shown using dashed edges, and colours are maintained for nodes belonging to the same community across plots. In Fig. 6a at the first increment (t = 0) it is visible that nodes 0 (C0 ) and 33 (C33 ) are important nodes in the network. In Fig. 6a (t = 0), Layer 4 they represent the two highet centrality communities in the network. In Fig. 6a (t = 1) nodes 33 (C33 ) lose its importance to node 30 (C30 ), which is gained again at Fig. 6a (t = 2). Final community hierarchies show on Fig. 6b, the algorithm using locality modularity maximization found four final communities instead of the expected two (“John A” group “Mr. Hi” group), although C31 and C6 centrality values being very low. In Fig. 7a, the two first increments (t = 0 and t = 1) which represents the first 10 years of colaboration, Andrews Ng node (CAndrewsN g ) is the most important with few other small and less important communities arround it. Remark that Jure Leskovec led an initial community in Fig. 7a (t = 1). In Fig. 7a (t = 3) Jure Leskovec emerge as the most important community followed by Andrews Ng. An interesting observable fact is that both communities are separated by the Christopher Potts community known for being the only author

208

M. Cordeiro et al.

with collaborations with both. Figure 7b show the four most important community hierarchies and respective normalized centralities: CJureLeskovec with 0.775, CAndrewsN g with 0.643 and two Andrews Ng related communities: CQuocLe with 0.083 and CAdamCoates with 0.053.

5

Conclusions

In evolving networks, there is a clear lack between traditional community detection methods and the task of ranking and tracking community leaders. The proposed method aims to bridge this gap by combining two techniques with proven results in each one of its domain areas: a dynamic hierarchical community detection enhanced with a dynamic Laplacian centrality method. By applying the proposed method to a human social network and a temporal collaboration network, the empirical analysis performed and the obtained results have shown that the method is innovative and promising with respect to future research. In both cases, the method was able to unveil, track, and measure the importance of main actors, network intra and inter-community structural hierarchies using a centrality measure.

References 1. Aggarwal, C., Subbian, K.: Evolutionary network analysis: a survey. ACM Comput. Surv. (CSUR) 47(1), 1–36 (2014) 2. Bahmani, B., Chowdhury, A., Goel, A.: Fast incremental and personalized PageRank. Proc. VLDB Endow. 4(3), 173–184 (2010) 3. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech.: Theor. Exp. 2008(10), P10008 (2008) 4. Brandes, U.: A faster algorithm for betweenness centrality. J. Math. Sociol. 25, 163–177 (2001) 5. Chen, P.Y., Hero, A.O.: Multilayer spectral graph clustering via convex layer aggregation: theory and algorithms. IEEE Trans. Sig. Inf. Process. Netw. 3, 553–567 (2017) 6. Cordeiro, M., Sarmento, R., Gama, J.: Dynamic community detection in evolving networks using locality modularity optimization. Soc. Netw. Anal. Min. 6(1), 15 (2016) 7. Cordeiro, M., Sarmento, R.P., Brazdil, P., Gama, J.: Dynamic laplace: efficient centrality measure for weighted or unweighted evolving networks. CoRR abs/1808.02960 (2018) 8. Cordeiro, M., Sarmento, R.P., Brazdil, P., Gama, J.: Evolving networks and social network analysis methods and techniques. In: Viˇsn ˇovsk´ y, J., Radoˇsinsk´ a, J. (eds.) Social Media and Journalism, chap. 7. IntechOpen, Rijeka (2018) 9. Desikan, P., Pathak, N., Srivastava, J., Kumar, V.: Incremental page rank computation on evolving graphs. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, WWW 2005, pp. 1094–1095. ACM, New York (2005) 10. Fortunato, S.: Community detection in graphs, June 2009

Community Leaders in Dynamic Social Networks

209

11. Hollocou, A., Maudet, J., Bonald, T., Lelarge, M.: A linear streaming algorithm for community detection in very large networks. CoRR abs/1703.02955 (2017) 12. Kas, M., Wachs, M., Carley, K.M., Carley, L.R.: Incremental algorithm for updating betweenness centrality in dynamically growing networks. In: 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013), pp. 33–40, August 2013 13. Kas, M., Carley, K.M., Carley, L.R.: Incremental closeness centrality for dynamically changing social networks. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2013. pp. 1250–1258. ACM, New York (2013) 14. Khorasgani, R.R., Chen, J., Zaiane, O.R.: Top leaders community detection approach in information networks. In: Proceedings of the 4th Workshop on Social Network Mining and Analysis (2010) 15. Kim, K.S., Choi, Y.S.: Incremental iteration method for fast PageRank computation. In: Proceedings of the 9th International Conference on Ubiquitous Information Management and Communication, IMCOM 2015, pp. 80:1–80:5. ACM, New York (2015) 16. Lancichinetti, A., Fortunato, S.: Consensus clustering in complex networks. Sci. Rep. 2, 336 (2012) 17. Leung, I.X., Hui, P., Li` o, P., Crowcroft, J.: Towards real-time community detection in large networks. Nonlinear Soft Matter Phys. Phys. Rev. E - Stat. 79, 066107 (2009) 18. Li, J., Wang, X., Deng, K., Yang, X., Sellis, T., Yu, J.X.: Most influential community search over large social networks. In: Proceedings - International Conference on Data Engineering (2017) 19. Li, R.H., Qin, L., Ye, F., Yu, J.X., Xiaokui, X., Xiao, N., Zheng, Z.: Skyline community search in multi-valued networks. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2018) 20. Li, R.H., Qin, L., Yu, J.X., Mao, R.: Influential community search in large networks. Proc. VLDB Endowment 8, 509–520 (2015) 21. Li, R.H., Qin, L., Yu, J.X., Mao, R.: Finding influential communities in massive networks. VLDB J. 26, 751–776 (2017) 22. Nasre, M., Pontecorvi, M., Ramachandran, V.: Betweenness centrality - incremental and faster. CoRR abs/1311.2147 (2013) 23. Nguyen, N.P., Dinh, T.N., Tokala, S., Thai, M.T.: Overlapping communities in dynamic networks: their detection and mobile applications. In: Proceedings of the Annual International Conference on Mobile Computing and Networking, MOBICOM (2011) 24. Nguyen, N.P., Dinh, T.N., Xuan, Y., Thai, M.T.: Adaptive algorithms for detecting community structure in dynamic social networks. In: INFOCOM, pp. 2282–2290. IEEE (2011) 25. Oliveira, M.D.B., Gama, J.: An overview of social network analysis. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 2(2), 99–115 (2012) 26. Palla, G., Barab´ asi, A.L., Vicsek, T.: Quantifying social group evolution. Nature 446(7136), 664–667 (2007) 27. Qi, X., Duval, R.D., Christensen, K., Fuller, E., Spahiu, A., Wu, Q., Wu, Y., Tang, W., Zhang, C.: Terrorist networks, network energy and node removal: a new measure of centrality based on laplacian energy. Soc. Netw. 02(01), 19–31 (2013) 28. Qi, X., Fuller, E., Wu, Q., Wu, Y., Zhang, C.Q.: Laplacian centrality: a new centrality measure for weighted networks. Inf. Sci. 194, 240–253 (2012)

210

M. Cordeiro et al.

29. Raghavan, U.N., Albert, R., Kumara, S.: Near linear time algorithm to detect community structures in large-scale networks. Nonlinear Soft Matter Phys. Phys. Rev. E - Stat. 76, 036106 (2007) 30. Sariyuce, A.E., Kaya, K., Saule, E., Catalyiirek, U.V.: Incremental algorithms for closeness centrality. In: Proceedings - 2013 IEEE International Conference on Big Data, Big Data 2013, pp. 487–492 (2013) ¨ 31. Sarıy¨ uce, A.E., Gedik, B., Jacques-Silva, G., Wu, K.L., C ¸ ataly¨ urek, U.V.: SONIC: streaming overlapping community detection. Data Min. Knowl. Discov. 30, 819– 847 (2016) 32. Shah, D., Zaman, T.: Community detection in networks: the leader-follower algorithm. Sort 1050, 2 (2010) 33. Shang, J., Liu, L., Xie, F., Chen, Z., Miao, J., Fang, X., Wu, C.: A real-time detecting algorithm for tracking community structure of dynamic networks. In: 2012 18th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Workshops, SNAKDD, vol. 12 (2012) 34. Sun, H., Du, H., Huang, J., Li, Y., Sun, Z., He, L., Jia, X., Zhao, Z.: Leader-aware community detection in complex networks. Knowl. Inf. Syst. 1–30 (2019) 35. Wang, C.D., Lai, J.H., Yu, P.S.: Dynamic community detection in weighted graph streams. In: Proceedings of the 2013 SIAM International Conference on Data Mining, SDM 2013 (2013) 36. Xie, J., Kelley, S., Szymanski, B.K.: Overlapping community detection in networks: the state-of-the-art and comparative study. ACM Comput. Surv. 45, 43 (2013) 37. Yakoubi, Z., Kanawati, R.: LICOD: a leader-driven algorithm for community detection in complex networks. Vietnam J. Comput. Sci. 1, 241–256 (2014) 38. Yun, S.Y., Lelarge, M., Proutiere, A.: Streaming, memory limited algorithms for community detection. In: Advances in Neural Information Processing Systems (2014) 39. Zhang, X., Zhu, J., Wang, Q., Zhao, H.: Identifying influential nodes in complex networks with community structure. Knowl.-Based Syst. 42, 74–84 (2013) 40. Zhao, Z., Wang, X., Zhang, W., Zhu, Z.: A community-based approach to identifying influential spreaders. Entropy 17, 2228–2252 (2015)

Change Point Detection in a Dynamic Stochastic Blockmodel Peter Wills and Fran¸cois G. Meyer(B) Department of Applied Mathematics, University of Colorado at Boulder, Boulder, CO 80309, USA [email protected]

Abstract. We study a change point detection scenario for a dynamic community graph model, which is formed by adding new vertices and randomly attaching them to the existing nodes. The goal of this work is to design a test statistic to detect the merging of communities without solving the problem of identifying the communities. We propose a test that can ascertain when the connectivity between the balanced communities is changing. In addition to the theoretical analysis of the test statistic, we perform Monte Carlo simulations of the dynamic stochastic blockmodel to demonstrate that our test can detect changes in graph topology, and we study a dynamic social-contact graph.

Keywords: Change detection Graph distance

1

· Dynamic stochastic blockmodel ·

Introduction

Some of the most well-known empirical network datasets reflect social connective structure between individuals, often in online social network platforms such as Facebook and Twitter. These networks exhibit structural features such as communities and highly connected vertices, and can undergo significant structural changes as they evolve in time. Examples of such structural changes include the merging of communities, or the emergence of a single user as a connective hub between disparate regions of the graph. The main contribution of this work is a rigorous analysis of a dynamic community graph model, which we call the dynamic stochastic blockmodel. Models of dynamic community networks have recently been proposed. The simplest incarnation of such models, the dynamic stochastic blockmodel, is the subject of our study. This model is formed by adding new vertices, and randomly attaching them to the existing nodes. We circumvent the problem of decomposing each graph into communities, and propose instead a test that can ascertain when the connectivity between the balanced communities is changing. Because the evolution of the graph is stochastic, one expects random fluctuations of the graph topology. c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 211–222, 2020. https://doi.org/10.1007/978-3-030-36687-2_18

212

P. Wills and F. G. Meyer

We propose an hypothesis test to detect the abnormal growth of the balanced stochastic blockmodel. The stochastic blockmodel represents the quintessential exemplar of a network with community structure. In fact, it is shown in [28] that any sufficiently large graph behaves approximately like a stochastic blockmodel. This model is also amenable to a rigorous mathematical analysis, and is indeed at the cutting edge of rigorous probabilistic analysis of random graphs [1].

2

Graph Models

We recall the definition of the two-community stochastic blockmodel [1]. Definition 1. Let n ∈ N, and let p, q ∈ [0, 1]. We denote by SBM(n, p, q) the probability space formed by the graphs defined on the set of vertices [n], constructed as follows. We split the vertices [n] into two communities C1 and C2 , formed by the odd and the even integers in [n] respectively. We denote by n1 = (n+1)/2 and n2 = n/2 the size of C1 and C2 respectively. Edges within each community are drawn randomly from independent Bernoulli random variables with probability p. Edges between communities are drawn randomly from independent Bernoulli random variables with probability q. 2.1

The Dynamic Stochastic Blockmodel

Several dynamic stochastic blockmodels have been proposed in recent years (e.g., [10,11,18,22,27,29,31], and references therein). Existing dynamic stochastic block models assume that the number of nodes is fixed, and that community membership is random. Some authors propose a Markovian model for the community membership [30,31], while others assume the sequence of graphs are independent realizations in time [5]. Our work is more similar to that of [4], where the authors study changes in the dynamics of a preferential attachment model, the size of which grows as a function of time. Similarly, we investigate a growing model of a stochastic block model, and we are interested in the regime of large graphs (n → ∞), where the probabilities of connection, within each community, pn , and across communities, qn , go to zero as the size of the graph, n, goes to infinity. In order to guarantee that at each time n we study the growth of a graph ∼ SBM(n, pn , qn ), we cannot simply assume that the graphs G1 = (V1 , V1 ), . . . , Gn = (Vn , En ) form a sequence of nested subgraphs, where we would have V1 ⊂ · · · ⊂ Vn and E1 ⊂ · · · ⊂ En . Instead. our study focuses on the transition between a random realization (Vn , En ) ∼ SBM(n, pn , qn ) and the graph formed by adding a new node n + 1 and random edges to (Vn , En ). Formally, the dynamic stochastic blockmodel is defined recursively (see Table 1). G1 is formed by a single vertex. We assume that we have constructed G1 , . . . , Gn and we proceed with the construction of Gn+1 . First, we replace Gn by a graph Hn ∼ SBM(n, pn , qn ), and we consider the graph formed by adding a new node n + 1 (assigned to either C1 or C2 according to the parity of n), and we define

Change Point Detection in a Dynamic Stochastic Blockmodel

213

Fig. 1. Left: the stochastic blockmodel Hn = (Vn , En ) is comprised of two communities C1 (red) and C2 (blue). A new vertex (green) is added to Vn , and random edges are created between n + 1 and vertices in Vn . This leads to a new set of edges, En+1 , and the corresponding new graph Gn+1 defined by (2)

Vn+1  Vn ∪ {n + 1} .

(1)

Random edges are then assigned from n + 1 to each vertex in the same community with probability pn and to each vertex of the opposite community with probability qn . This leads to a new set of edges, En+1 , and the corresponding graph (see Fig. 1), (2) Gn+1  (Vn+1 , En+1 ) . We note that Hn is different from Gn ; indeed, Gn was created by adding a node and some edges to the graph Hn−1 ∼ SBM(pn−1 , qn−1 ), whereas Hn is a realization of SBM(pn , qn ). Table 1 summarizes the construction of the sequence G1 , G2 , . . . Table 1. Row n depicts the construction of Gn+1 as a function of the random seed graph Hn . The distance (last column) is always defined with respect to the seed graph Hn on the vertices 1, . . . , n that led to the construction of Gn+1 . Time index n

Probabilities of connection

0

Hn = seed Growth sequence graph at time n to generate Gn+1

Definition of the graph distance



0

∅ → G1

1

{p1 , q1 }

H1 ∼ SBM(1, p1 , q1 )

H 1 → G2

drp (H1 , G2 )

2

{p2 , q2 }

H 2 → G3

drp (H2 , G3 )

.. .

.. .

H2 ∼ SBM(2, p2 , q2 ) .. .

.. .

.. .

214

P. Wills and F. G. Meyer

We conclude this section with the definition of the expected degrees and the number of across-community edges, kn . Definition 2 (Degrees and number of across-community edges). Let G ∼ SBM(n, p, q). We denote by dn1 = pn1 the expected degree within community C1 , and by dn2 = pn2 the expected degree within community C2 . We denote by kn the binomial random variables that counts the number of cross-community edges between C1 and C2 . Because asymptotically, n1 ∼ n2 , we ignore the dependency of the expected degree on the specific community when computing asymptotic behaviors for large n. More precisely, we loosely write 1/dn when either 1/dn1 or 1/dn2 could be used.

3

The Resistance Perturbation Distance

In order to study the dynamic evolution of the graph sequence, we focus on changes between two successive time steps n and n + 1. These changes are formulated in terms of changes in connectivity between Gn+1 and the seed graph Hn (see Table 1). To construct the statistic that can detect the merging of communities without identifying the communities, we use the resistance perturbation distance [17]. This graph distance can be tuned to quantify configurational changes that occur on a graph at different scales: from the local scale formed by the local neighbors of each vertex, to the largest scale that quantifies the connections between clusters, or communities [17] (see [2,6] for recent surveys on graph distances, and [19] for a distance similar to the resistance perturbation distance). The Effective Resistance. For the sake of completeness, we review the concept of effective resistance (e.g., [7–9,12]). Given a graph G = (V, E), we transform G into a resistor network by replacing each edge e by a resistor with conductance we (i.e., with resistance 1/we ). The effective resistance between two vertices u and v in V is defined as the voltage applied between u and v that is required to maintain a unit current through the terminals formed by u and v. To simplify the discussion, we will only consider graphs that are connected with high probability. All the results can be extended to disconnected graphs as explained in [17]. Definition 3 (The resistance perturbation distance). Let G(1) = (V, E (1) ) and G(2) = (V, E (2) ) be two graphs defined on the same vertex sets V . Let R(1) and R(2) denote the effective resistances of G(1) and G(2) respectively. We define the resistance-perturbation distance to be        (1)  (3) drp G(1) , G(2)  Ru v − Ru(2)v  . u∈V v∈V,v=u

Change Point Detection in a Dynamic Stochastic Blockmodel

215

The resistance-perturbation distance cannot be used to compare graphs defined on different vertex sets, V (1) and V (2) . If V (1) and V (2) share many nodes, then we can compute the restriction of the perturbation distance on the intersection V (1) ∩ V (2) . In the following we compare two graphs Hn and Gn+1 that share all the nodes but one newly added node. We therefore extend the definition of the perturbation distance as follows. Definition 4 (Extension of the resistance perturbation distance). Let Hn = (Vn , En ) ∼ SBM(n, pn , qn ), and let Gn+1  (Vn+1 , En+1 ), defined by (2). We define the resistance-perturbation distance between Hn and Gn+1 as follows      (4) drp (Hn , Gn+1 )  Ru(1)v − Ru(2)v  . u∈Vn v∈Vn ,v=u

Because n + 1 did not exist at time n, it is not meaningful to compute Rv(n+1) , for v ∈ Vn Therefore we only compute the effective resistances for u, v ∈ Vn in (4). In the remainder of the paper, we use the notation drp to denote the extended resistance perturbation distance defined by (4).

4

Main Result

Figure 1 illustrates the statement of the problem. As a new vertex (shown in green) is added to the graph Hn the connectivity between the communities can increase, if edges are added between C1 and C2 , or the communities can remain separated if no across-community edges are created. If the addition of the new vertex promotes the merging of C1 and C2 , then we consider the new graph Gn+1 to be structurally different from Gn , otherwise Gn+1 remains structurally the same as Gn (see Fig. 1). As explained in Theorem 1, the resistance perturbation distance, between time n and n + 1, (see Table 1) measured by drp (Hn , Gn+1 ) (defined by (4)) is able to distinguish between connectivity changes within a community and changes across communities. Theorem 1 (The statistic under the null and alternative hypotheses). Let Hn = (Vn , En ) ∼ SBM(n, pn , qn ) with pn = ω (log n/n), and pn /n < qn < 3/4 (pn /n) . Let Gn+1 be the graph generated according to the dynamic stochastic blockmodel described by (2) and Table 1. To test the hypothesis kn+1 = kn (null hypothesis) versus kn+1 > kn (alternative hypothesis) we use the statistic Zn defined by Zn 

pn E [drp (Hn , Gn+1 )] − 1. 4

(5)

216

P. Wills and F. G. Meyer

The expected value of the statistic E [Zn ] is given by ⎧

⎪ 1 ⎪ ⎪ , conditioned on kn+1 = kn (null) ⎪O ⎨ dn

E [Zn ] = (6) ⎪ 2pn 1 ⎪ ⎪ , conditioned on kn+1 > kn (alternative). ⎪ ⎩ n2 q 2 + O n dn The theoretical analysis of the dynamic stochastic block model ignore the SBM(n, pn , qn ), provided by Theorem 1, reveals that if one could   within-community random connectivity changes, which have size O 1/ dn , then one should always be able to detect the addition of across-community edges using the global metric provided by the test statistic Zn . The condi3/4 tion qn < (pn /n) therefore guarantees that within-community connectivity changes do not obfuscate across-community connectivity changes triggered by the increase in across-community edges. Without loss of generality, we consider that the new node n + 1 is added to C2 (C1 and C2 play symmetric roles). The main result relies on the following two ingredients. 1. the community C2 is approximately an Erd˝ os-R´enyi graph SBM(n2 , pn ), wherein the effective resistance Ruv concentrates around 2/(n2 pn ) [23]; 2. the effective resistance between u ∈ C1 and v ∈ C2 depends mostly on the bottleneck formed by the kn across-community edges, Ruv ≈ 1/kn [14,15], and the number of across-community edges, kn , concentrates around qn n1 n2 . Under the null hypothesis, about n2 pn nodes in C2 will become incident to the new edges created by the addition of node n + 1. For each of these nodes, the  = Ruv − 1/d2u , for new degree du becomes du + 1 w.h.p., and therefore Ruv  2 all v ∈ C2 . By symmetry, Ruv = Ruv − 1/dv , for all u ∈ C2 , and the total 2 perturbation for u ∈ C2 , v ∈ C2 is ≈ 2n2 n2 pn /dn = 2/pn . We derive the same  estimate for the perturbation Ruv − Ruv for u ∈ C1 , v ∈ C2 or u ∈ C2 , v ∈ C1 . We conclude that drp ≈ 4/pn under the null hypothesis. Under the alternative hypothesis kn+1 = kn + 1 w.h.p., and thus ΔRuv ≈ −1/kn2 ≈ −1/(n1 n2 qn )2 . This perturbation affects every pair of node (u, v) where 2 2 u ∈ C1 and v ∈ C2 , therefore  drp ≈ 2/(n1 n2 qn ) = 8/(nqn ) . There is an addi 2 tional term of order O 1/pn that accounts for the changes in effective resistance within C2 (the community wherein node n+ 1 is added).  In order to estimate the noise term, O 1/ dn , we need to construct estimates of the effective resistance that are more precise than those that can be found elsewhere (e.g., [23], but see [21] for estimates similar to ours, obtained with different techniques). The full detailed rigorous proof of Theorem 1 is provided in the supplementary material [25]; we give in the following the key steps. Proof (Proof of Theorem 1). The proof proceeds in two steps: we first analyze the null hypothesis, and then the alternative hypothesis. Due to space limitation

Change Point Detection in a Dynamic Stochastic Blockmodel

217

we only present the alternative case, kn+1 > kn . We have       E (|Ruv − Ruv |) + E (|Ruv − Ruv |) E (drp (Gn , Gn+1 )) = u∈C1 v∈C1

+

 

u∈C2 v∈C2

E (|Ruv −

 Ruv |)

 

+

u∈C1 v∈C2

 E (|Ruv − Ruv |) (7)

u∈C2 v∈C1

The first sum in (7) is equal to   u∈C1 v∈C1

E (|Ruv −

 Ruv |)

=

2n1 dn

2





1+O

1 dn

.

(8)

Similarly, we have  

 E (|Ruv − Ruv |) = n2 (n2 − 1)2pn

u∈C2 v∈C2



1 dn

2

+O

1 dn

5/2

.

(9)

The estimates (8) and (9), which quantify the connectivity within both communities, are oblivious to the increase in the across-community connectivity (kn+1 > kn ). We need the third and fourth terms in (7), which are significantly affected by the increase in across-community edges, to detect a change in the effective resistance. Indeed, we have



    pn 1 1  1+ . (10) E |Ruv − Ruk+1 v | = +O pn n1 n2 qn2 dn u∈C1 v∈C2

The symmetric case where u ∈ C2 and v ∈ C1 leads to the same exact expression. Finally, we can assemble the expected resistance perturbation distance by combining the terms (8), (9), (10), and we obtain the advertised result. We conclude the proof with the condition on qn that guarantees that Zn can 3/4 detect the alternative hypothesis. As soon as qn < (pn /n) , the Zn  statistic  under the alternative hypothesis is larger than the noise term O 1/ dn . The theoretical condition on qn will be confirmed experimentally in the next section. The proofs of (8), (9), and (10) are rather technical and are provided in the supplementary material [25].



5

Experiments

Synthetic Experiments. Figure 2 shows numerical evidence supporting Theorem 1. The experiment involves Monte Carlo simulations of the dynamic stochastic blockmodel for 64 random realizations for each qn . The empirical distribution of Zn is computed under the null hypothesis (green line) and the alternative hypothesis (red line). The theoretical estimate given by (7) under the alternative hypothesis is also displayed (blue line). The size of the graph is n = 2, 048,

218

P. Wills and F. G. Meyer

Fig. 2. Statistic Zn defined by (5) computed under the null hypothesis (green line) and the alternative hypothesis (red line) for several values of the inverse across-community edge density, 2pn /(n2 qn2 ). The theoretical estimate of Zn under the alternative hypothesis, given by (6), is displayed as a blue line.

the density of edges is pn = log2 (n)/n, and the across-community edge density √ 3/4 ranges from qmax = 2 (pn /n) down to qmin = qmax /100. For each value of qn , we display the statistic Zn as a function of the inverse across-community edge density, 2pn /(n2 qn2 ). As the inverse density of across-community edges increases, the statistic Zn can more easily detect the alternative hypothesis. The   theoretical 3/4 , analysis, provided by (6), is confirmed: as qn becomes larger than O (pn /n) the statistic Zn computed under the null and alternative hypotheses merge. The across-community edge density qn becomes too large for the global statistic Zn to “sense” perturbations triggered by connectivity changes between the commu2 the alternative hypothesis nities. The expected value E [Zn ] = 2pn /(nq n ) under   becomes smaller than the noise term O 1/ dn , and the test statistic Zn fails to detect the alternative hypothesis. Analysis of a Primary School Face to Face Contact. In this section we provide an experimental extension of Theorem 1, wherein there are 10 communities, but the number of nodes, N , is fixed. The across-community and withincommunity edge densities are rapidly fluctuating as a function of time n. Our goal is to experimentally validate the ability of the resistance-perturbation distance to detect significant structural changes between the communities, while remaining impervious to random changes within each community. The data are part of a study [20] where RFID tags were used to record face-to-face contact between students in a primary school. Events punctuate the school day of the children, and lead to fundamental topological changes in the

Change Point Detection in a Dynamic Stochastic Blockmodel

219

Fig. 3. Left to right: snapshots of the face-to-face contact network at 9:00 a.m., 10:20 a.m., 12:45 p.m., and 2:03 p.m.

contact network (see Fig. 3). The school is composed of ten classes: each of the five grades (1 to 5) is divided into two classes (see Fig. 3). Each class forms a community of connected students; classes are weakly connected. During the school day, events such as lunch periods (12:00 p.m.– 1:00 p.m. and 1:00 – 2:00 p.m.) and recess (10:30 – 11:00 a.m. and 3:30 – 4:00 p.m), trigger significant increases in the number of links between the communities, and disrupt the community structure (see Fig. 3). The construction of the dynamic graphs proceeds as follows. We divide the school day into N = 150 time intervals of Δt ≈ 200 s. We denote by ti = 0, Δt, . . . , (N − 1)Δt, the corresponding temporal grid. For each ti we construct an undirected unweighted graph Gti , where the n = 232 nodes correspond to the 232 students in the 10 classes, and an edge is present between two students and if they were in contact (according to the RFID tags) during the time interval [ti−1 , ti ).

E , Fig. 4. Primary school data set: resistance perturbation distance drp , edit distance D  and DeltaCon distance DDC

220

P. Wills and F. G. Meyer

 N S and edit distance D  E (left); combiFig. 5. Primary school data set: NetSimile D  L , and adjacency D  A (right).  L , normalized Laplacian D natorial Laplacian D

The purpose of the analysis is to assess whether distances can detect changes in the topology coupled with the hidden events that control the network connectivity. We are also interested to verify if distances are robust against random changes within each classroom that do not affect the communication between the classes. We compare the resistance perturbation distance drp to the following  N S [3],  DC [13], (2) NetSimile distance D distances: (1) DeltaCon distance D   L, (3) edit distance DE , (4) three spectral distances: combinatorial Laplacian D   normalized Laplacian DL , and adjacency DA . The spectral distance between graph G and G is the 2 norm of the difference between the two spectra, {λi } and {λi } of the corresponding matrices [26]. For each distance measure d, we define a normalized distance contrast  i ) = d(Gt , Gt )/D, D(t i−1 i  where D = N −1 i d(Gti−1 , Gti ). All experiments were conducted using the NetComp library, which can be found on GitHub at [24]. Figure 4 displays the normalized temporal differences for the resistance distance, edit distance, and DeltaCon distance. The stochastic variability in the connectivity appreciably influence the high frequency (fine scale) eigenvalues; spectral distances, which are computed using all the eigenvalues, lead to very noisy estimates of the temporal differences (see Fig. 5). NetSimile is also significantly affected by these random fluctuations. The volume of the dynamic network changes rapidly, and the edit distance can reliably monitor these large scale changes. However, it entirely misses the significant events that disrupt the graph topology: onset and end of morning recess, onset of first lunch, end of second lunch (see Fig. 4). The resistance distance can detect subtle topological changes that are coupled to latent events that dynamically modify the networks, while remaining impervious to random local changes, which do not affect the large scale connectivity structure (see Fig. 4).

Change Point Detection in a Dynamic Stochastic Blockmodel

6

221

Discussion

We note that the condition on qn in Theorem 1 guarantees that the communities could be recovered using other techniques (e.g., spectral clustering). Our global approach, which does not require the detection of the communities can be computed efficiently (at a cost that is comparable to fast spectral clustering algorithms). Indeed, we have developed in [17] fast (linear in the number of edges) randomized algorithms that can quickly compute an approximation to the drp distance (see [16] for the publicly available codes). In the context of streaming graphs, we described in [17] algorithms to compute fast updates of the drp distance when a small number of edges are added, or deleted. We are currently exploring several extensions of the current model. The scenario of the primary school dataset, wherein the graph size is fixed and a latent process controls the addition and deletion of edges is an important extension of the current model. Acknowledgements. F.G.M was supported by the National Science Foundation (CCF/CIF 1815971), and by a Jean d’Alembert Fellowship.

References 1. Abbe, E., Bandeira, A.S., Hall, G.: Exact recovery in the stochastic block model. IEEE Trans. Inf. Theor. 62(1), 471–487 (2016) 2. Akoglu, L., Tong, H., Koutra, D.: Graph based anomaly detection and description: a survey. Data Min. Knowl. Discov. 29(3), 626–688 (2015). https://doi.org/10. 1007/s10618-014-0365-y 3. Berlingerio, M., Koutra, D., Eliassi-Rad, T., Faloutsos, C.: NetSimile: a scalable approach to size-independent network similarity. CoRR abs/1209.2684 (2012). http://dblp.uni-trier.de/db/journals/corr/corr1209.html#abs-1209-2684 4. Bhamidi, S., Jin, J., Nobel, A., et al.: Change point detection in network models: preferential attachment and long range dependence. Ann. Appl. Probab. 28(1), 35–78 (2018) 5. Bhattacharjee, M., Banerjee, M., Michailidis, G.: Change point estimation in a dynamic stochastic block model. arXiv preprint arXiv:1812.03090 (2018) 6. Donnat, C., Holmes, S., et al.: Tracking network dynamics: a survey using graph distances. Ann. Appl. Stat. 12(2), 971–1012 (2018) 7. Doyle, P., Snell, J.: Random walks and electric networks. AMC 10, 12 (1984) 8. Ellens, W., Spieksma, F., Mieghem, P.V., Jamakovic, A., Kooij, R.: Effective graph resistance. Linear Algebra Appl. 435(10), 2491 – 2506 (2011). http://www. sciencedirect.com/science/article/pii/S0024379511001443 9. Ghosh, A., Boyd, S., Saberi, A.: Minimizing effective resistance of a graph. SIAM Rev. 50(1), 37–66 (2008) 10. Ho, Q., Song, L., Xing, E.P.: Evolving cluster mixed-membership blockmodel for time-varying networks. J. Mach. Learn. Res. 15, 342–350 (2015) 11. Kim, B., Lee, K.H., Xue, L., Niu, X., et al.: A review of dynamic network models with latent variables. Stat. Surv. 12, 105–135 (2018) 12. Klein, D., Randi´c, M.: Resistance distance. J. Math. Chem. 12(1), 81–95 (1993)

222

P. Wills and F. G. Meyer

13. Koutra, D., Shah, N., Vogelstein, J.T., Gallagher, B., Faloutsos, C.: DELTACON: principled massive-graph similarity function with attribution. ACM Trans. Knowl. Discov. Data (TKDD) 10(3), 28 (2016) 14. Levin, D.A., Peres, Y., Wilmer, E.L.: Markov Chains and Mixing Times. American Mathematical Soc. (2009) 15. Lyons, R., Peres, Y.: Probability on trees and networks (2005). http://mypage.iu. edu/∼rdlyons/ 16. Monnig, N.D.: The Resistance-Perturbation-Distance. https://github.com/ natemonnig/Resistance-Perturbation-Distance (2016) 17. Monnig, N.D., Meyer, F.G.: The resistance perturbation distance: a metric for the analysis of dynamic networks. Discrete Appl. Math. 236, 347 – 386 (2018). http:// www.sciencedirect.com/science/article/pii/S0166218X17304626 18. Peel, L., Clauset, A.: Detecting change points in the large-scale structure of evolving networks. In: AAAI, pp. 2914–2920 (2015) 19. Sricharan, K., Das, K.: Localizing anomalous changes in time-evolving graphs. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1347–1358. ACM (2014) 20. Stehl´e, J., Voirin, N., Barrat, A., Cattuto, C., Isella, L., Pinton, J.F., Quaggiotto, M., Van den Broeck, W., R´egis, C., Lina, B., et al.: High-resolution measurements of face-to-face contact patterns in a primary school. PLoS One 6(8), e23176 (2011) 21. Sylvester, J.A.: Random walk hitting times and effective resistance in sparsely connected Erd˝ os–Renyi random graphs. arXiv preprint arXiv:1612.00731 (2016) 22. Tang, X., Yang, C.C.: Detecting social media hidden communities using dynamic stochastic blockmodel with temporal Dirichlet process. ACM Trans. Intell. Syst. Technol. (TIST) 5(2), 36 (2014) 23. Von Luxburg, U., Radl, A., Hein, M.: Hitting and commute times in large random neighborhood graphs. J. Mach. Learn. Res. 15(1), 1751–1798 (2014) 24. Wills, P.: The NetComp Python library (2019). https://www.github.com/ peterewills/netcomp 25. Wills, P., Meyer, F.G.: Change point detection in a dynamic stochastic block model (2019). https://ecee.colorado.edu/∼fmeyer/pub/WillsMeyer2019.pdf 26. Wills, P., Meyer, F.G.: Metrics for graph comparison: a practitioner’s guide. arXiv preprint arXiv:1904.07414 (2019) 27. Wilson, J.D., Stevens, N.T., Woodall, W.H.: Modeling and estimating change in temporal networks via a dynamic degree corrected stochastic block model. arXiv preprint arXiv:1605.04049 (2016) 28. Wolfe, P.J., Olhede, S.C.: Nonparametric graphon estimation. arXiv preprint arXiv:1309.5936 (2013) 29. Xing, E.P., Fu, W., Song, L.: A state-space mixed membership blockmodel for dynamic network tomography. Ann. Appl. Stat. 4(2), 535–566 (2010) 30. Xu, K.: Stochastic block transition models for dynamic networks. In: Artificial Intelligence and Statistics, pp. 1079–1087 (2015) 31. Yang, T., Chi, Y., Zhu, S., Gong, Y., Jin, R.: Detecting communities and their evolutions in dynamic social networks-a Bayesian approach. Mach. Learn. 82(2), 157–189 (2011)

A General Method for Detecting Community Structures in Complex Networks Vesa Kuikka(&) Finnish Defence Research Agency, Tykkikentäntie 1, PO Box 10, 11311 Riihimäki, Finland [email protected]

Abstract. We present a general method for detecting communities and their sub-structures in a complex network. The novelty of the method is to separate the network model and the community detection model. Network connectivity and influence spreading models are used as examples for network models. Depending on the network model, different communities and sub-structures can be found. We illustrate the results with two empirical network topologies. In these cases the strongest detected communities are very similar for the two network models. We use a community detection method that is based on searching local maxima of an influence measure describing interactions between nodes in a network. Keywords: Complex networks  Community detection  Influence spreading model  Network connectivity  Community influence measure

1 Introduction Methods for detecting communities in social, biological and technological networks have been studied extensively in the literature and still no commonly accepted definition of a community exists. Different mathematical methods and algorithms have been presented for detecting communities in complex network topologies [3, 9, 12–14]. Modularity maximization and spectral graph partitioning are two examples in the wide context of community detection methods [4, 6, 7, 10, 12]. Modularity measures the strength of division of a network into modules. One definition of a community is a locally dense connected sub-graph in a network [2]. Modularity has been defined as the fraction of links falling within the given groups minus the expected fraction if links were distributed at random. In order to compute the numerical value of modularity, each link is cut into two halves, called stubs. The expected number of links is computed by rewiring stubs randomly with any other stub in the network, except itself, but allowing self-loops when a stub is rewired to another stub from the same node. Mathematically modularity can be expressed as   1 X kv kw s v s w þ 1 M¼ : Avw  2m vw 2 2m

© Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 223–237, 2020. https://doi.org/10.1007/978-3-030-36687-2_19

ð1Þ

224

V. Kuikka

In Eq. (1) v and w are nodes in the network, 2m is the number of stubs in the network, kv is the node degree of node v, Avw ¼ 1 means that there is a link between nodes v and w, and Avw ¼ 0 means that there is no link between the two nodes. Matrix A is called the adjacency matrix. Membership variable sv indicates if node v belongs to a community: sv ¼ 1 if node v belongs to community 1 and sv ¼ 1 if node v belongs to community 2. Equation (1) holds for partitioning into two modules but it can be generalized for partitioning into a desired number of modules. Modularity suffers a resolution limit and it is unable to detect small communities [2, 12]. In matrix terms Eq. (1) is M¼

1 T s Bs; 4m

v kw is called the modularity matrix. The equation for M is similar in where Bvw ¼ Avw  k2m form to an expression used in spectral partitioning of graphs for the cut size of a network in terms of the graph Laplacian. This similarity can be used for deriving a spectral algorithm for community detection. The eigenvector corresponding to the largest eigenvalue of the modularity matrix assigns nodes to communities according to the signs of the vector elements. The classical graph partitioning is the problem of dividing the nodes of a network into a given number of non-overlapping groups of given sizes such that the number of links between groups is minimized [12]. The Louvain algorithm and Infomap are two fast algorithms for community detection that have been briefly described in [2]. These algorithms have gained popularity because of their suitability for identifying communities in very large networks. Both algorithms optimize a quality function. For the Louvain algorithm the quality function is modularity and for Infomap an entropy-based measure. In the Louvain algorithm modularity is optimized by local changes of the modularity measure and communities are obtained by aggregating the modules to build larger communities. Infomap compresses the information about a random walker exploring the graph [2]. In this paper, we take an alternative approach, where instead of running community detection based on the adjacency matrix A of a graph, an influence matrix C is first constructed that contains information about a social influence process over the paths of a graph [8]. An element Cvw of the social influence matrix accounts for interactions over all the paths from source node v to target node w. In order to study local interactions in the network, maximum path length Lmax can be used in the algorithm. In addition, we account for the fact that communities can mean different things depending on the processes that are supposed to operate on a network. This is demonstrated by substituting the social influence matrix with the network connectivity matrix known from the classical communication theory [1]. In the community detection algorithm of this paper, a sum of rows and columns of matrix C is used as the quality function. Rows and columns are included in the sum that correspond to node pairs in the community and node pairs not in the community. Also, this measure is different from the modularity M of Eq. (1). One idea for future studies is to use the measure M where the adjacency matrix A is substituted with the influence matrix C in a standard framework of community detection.

A General Method for Detecting Community Structures in Complex Networks

225

2 Community Detection Method Typically, a node has higher influence on neighbouring nodes compared to nodes that are far away in the network topology. Influence is increasing with the number of alternative paths between nodes. Most community detection methods calculate only the local influence among nodes in a network structure. Results that are more accurate can be obtained when longer path lengths are included in the model and computations. In order to balance between the increasing number of alternative paths and distance between a source node and a target node, weighting factors can be used to describe probabilities of influence via links between two nodes in the network. Weighting factors for links or nodes, or both, together with the network topology are the main input data for network models. In dynamic models, spreading time distribution or time dependency of node and link attribute values describe the influence spreading or the changing network structure. In static connectivity models the probability of functioning connection between a source node and a target node plays the role of a weighting factor. The novelty of the community detection method of this paper is to separate the network model and the community detection model. This means that the same community detection algorithm can be used with different network models. We present two examples of network models: a network connectivity model [1] and an influence spreading model [8]. In both cases, we use the same community detection algorithm. In these two examples, interactions between nodes are described as spreading probabilities or probabilities of functioning connections. Technically, interactions between the pairs of nodes in a network are expressed in a N  N –dimensional community influence matrix C where N is the number of nodes in the network. The method detects any kinds of structures in topological complex networks: nonoverlapping, overlapping and hierarchical community structures [8]. As a special case, communities consisting of two or more distinct sub-communities that have no direct contact, can be discovered. The method is based on searching local maxima of a community influence measure computed from the elements of the community influence matrix Cs;t ; s; t ¼ 1; . . .N. Our basic model has the following form for the community influence measure: X X P¼ Cs;t þ Cs;t ð2Þ s;t2V

 s;t2V

In Eq. (2) the first summation is over the pairs of nodes in a sub-set V of nodes in  of nodes. Cross the network and the second summation is over the remaining pairs V terms are ignored in this version of the model because they describe interactions between the two sub-sets and are not directly involved in the internal cohesion of the two sub-sets. The simplest method to search local maxima of the community influence measure is to start from a random division of the network and move one node, at a time, from one side to the other. If the numerical value of P increases, continue with the next node or return the node back and continue with the next node. This procedure is continued until

226

V. Kuikka

moving any of the nodes in the original network does not increase the value of the community influence measure [8]. The first cut will crucially impact the set of local maxima that can be found. More local maxima are searched by starting from a new random division of the network. This is repeated until no more local maxima are found in a reasonable number of trials. Finally, an understanding of the landscape of local maxima can be achieved. Especially, this part of the method is proposed for analysing and identifying sub-structures on larger and less-well studied networks. In this way, community detection methods are tools for understanding network data in general. Several methods for speeding up the computing process exist. For example, instead of a random starting division, intersections and subtractions of previously found solutions or some pre-existing information about the closeness of nodes can be used. Community influence matrix C and the topology of the network are possible sources of the closeness information. The current computer program version for maximizing Eq. (2) is scaling up to about 40.000 nodes with a personal computer because the matrix C is kept in the working memory. The optimization of memory and processing capabilities is not directly related with the community detection problem but with the optimization algorithm of the quality function. Scaling of the network influence spreading model is more dependent on processing power than memory size [8]. Examples of computing times of social media networks have been provided in [8]: Facebook 4,039 nodes and 88,234 links 4 min, Twitter 81,306 nodes and 1,768,149 links 5 h, and Google+ 107,614 nodes and 13,673,453 links 4 days. Efficient algorithms have been developed for computing network connectivity. Scaling of the network connectivity model has been discussed more in the literature [2]. The method assumes that the network is divided into two communities. However the model provides many solutions for the local maxima of Eq. (2). Communities with high rankings according to the value of P in Eq. (1) are candidates for the split of the original community in real-life. Note that this may not be the most probable solution of the community formation process. Later, in Sect. 4 we will present results for both of the community measures: the strength P of the split into two communities in Eq. (1) and the statistical measure describing the probability of community formation. In practice, real-world social networks separate into two parts; although it is possible that at a later point of time more sub-communities appear. However, this kind of community formation is a special case of the basic model. If the original network is first divided into A and B, and later B is divided into B1 and B2, usually divisions ðA [ B1 Þ [ B2 and ðA [ B2 Þ [ B1 are also local maxima of Eq. (2). To be precise, interactions between nodes in one community include also interactions mediated via paths through other communities. In the model, these paths are included whenever the source node and the target node are inside one community. We assume that the same community influence matrix is valid, it does not depend on the communities, and there is no need to re-calculate the matrix between iterations. If communities have selfintensifying properties or a complex dependency on other communities, the community influence matrix should be re-calculated during the optimization of Eq. (2). Later, we will present results also for a modification of Eq. (2). We define the modified community influence measure as

A General Method for Detecting Community Structures in Complex Networks

0 1 X N2 @ X P¼ 2 Cs;t þ l Cs;t A: NV þ NV2 s;t2V  s;t2V

227

ð3Þ

The number of nodes in the original network and the two factions are denoted by N, NV , and NV . The factor before the summations balances the fact that Eq. (2) has less  are of unequal sizes. The phenomenosummands when the two divisions V and V logical parameter denoted by l inside the brackets is used to model possible asymmetry  Stronger interactions may exist in V than in V  when nodes in between V and V.  are community V are connected around a common ideology and nodes in community V left outside without a similar connecting factor. As another application, modifications of Eq. (2) can approximate the above-mentioned re-calculation of the community influence matrix. Equation (3) will be applied only for a human social network because the interpretation of a common opinion or belief may not be justified for animal social networks. In this paper, the form of Eq. (3) is experimental because it superimposes a model on another model captured already by the influence matrix C. It would be better to include such a process in the definition of C and avoid the use of a free parameter. This can be achieved directly by using empirical or evaluated node and link activity values (weights). If these kind of data is available, effects that are modelled with parameter l can be included in matrix C elements. In the same fashion, social pressure against the other community could be included in matrix C for describing how a community is trying to convince outsiders to join it. In this paper, we demonstrate principles of the method and its parameters with two small well known empirical networks. In these cases, solutions agree with the ground truth divisions and many other methods, with the same documented exceptions [11, 14]. Essential aspects of the method are that the network model and the community detection model can be separated and the combined method can be used to investigate complex network sub-structures. One or both of the sub-models can be substituted with other models. In fact, a new idea would be to focus more on the effect of different network or influence models in a standard framework of community detection. Community detection of empirical examples with different network processes can yield new results, if the process described by the influence matrix C, is adjusted to the respective empirical case.

3 Example Network Models Two different network models are used to demonstrate the generality of the community detection method: the classical network connectivity model [1] and an influence spreading model proposed in [7, 8]. We present briefly the main features of the two models. The influence spreading model is designed for describing complex social interactions in a social network structure. These interactions propagate via connections, or paths, between people. We assume that the information content may change and the

228

V. Kuikka

social influence is developing during the spreading process. We allow repeated attempts of social influence from one source node to target nodes via all alternative paths. This includes also loops (no self-loops) where one node can occur several times on a path. The spreading probabilities between all pairs of the nodes in a network can be calculated from the node and link weighting values between neighbouring nodes, and the temporal spreading distribution as a function of number of links between the source node and the target node. Node and link weighting factors describe probabilities of forwarding information to neighbouring nodes in the network [8]. Node and link weighting factors along a path are taking into account by factor WL where the relevant weighting values are multiplied. The following equation shows how two paths from a source node to a target node with path lengths L1 and L2 and a common path length of L3 are combined together. Pi;minðL1 ;L2 Þ ðT Þ ¼ WL1 DL1 ðT Þ þ Pi1;L2 ðT Þ 

WL1 DL1 ðT ÞPi1;L2 ðT Þ ; W L3 D L3

i ¼ 1; . . .; NL  1ðL1 ; L2  Lmax Þ; P0;L2 ¼ WL2 DL2 ðT Þ: The quantity Pi1;L2 ðT Þ is the intermediate result at step i during iterations, and NL is the number of different paths between the two nodes. The temporal distribution at time T of the spreading process [9] is denoted by DL ðT Þ. The maximum path length of calculations is Lmax . Finally, all NL paths go into the result PNL 1 ðT Þ for the probability of influence propagation from source node s to target node t. We denoted this quantity by Cs;t in Eq. (2). A more detailed algorithm of the model has been presented in [8]. The network connectivity model is designed for describing the reliability of communication networks [1]. If the reliability values between any neighbouring pairs of nodes in the network are known, reliability values between any pairs of nodes in the network can be computed. Reliability is identified with the probability of operational connection in a time unit. From the general reliability theory [1] the reliability of a network V is r ðV Þ ¼

XY

ð 1  pe Þ

S2O e62S

Y

pe ;

e2S

where S is a set of links, where the network is connected, and O is the set of all connected states of the network. Links are denoted by e and the probability of an operational link is denoted by pe . If the probabilities pe are equal, r ðV Þ ¼ 1 

NL h X Y h¼1

s¼0

 ð1Þ

hs

 ! NL  s HS phe ; NL  h

where Hs is the sum of indicator functions when the number of broken links is s. The above equations are polynomials of the order of the number of links NL in the network.

A General Method for Detecting Community Structures in Complex Networks

229

In this form, the equations describe reliability of the entire network. In our case we apply the results for pairs of nodes by only taking the relevant terms in the summations. The influence spreading model and the network connectivity model have been designed for different application areas. In this study we use these models to demonstrate a community detection method with different network models. In the network connectivity model, connectivity is required in both directions between two nodes, and consequently connectivity is a symmetric property. The influence spreading model is less restrictive in this respect.

4 Detected Communities and Their Sub-structures We use two empirical networks to demonstrate the method and to compare results between two network models: network connectivity and network spreading models. Zachary’s karate club and Lusseau’s dolphin networks have been used as empirical example networks in several studies in the literature [5, 11, 14]. We do not expect exactly the same results from the two network models because of their different applications, definitions and parameterizations. However, in the cases of the two example network topologies, the most important communities and their sub-communities are surprisingly close to each other. More differences appear in weaker communities and in substructures. The fact that the same community detection algorithm provides reasonable results for different network models suggests that the method is generally valid for community detection in various applications. On the other hand, features of the community detection method are useful because sub-structures are also uncovered. The community detection method is not limited to particular network models: directed, weighted, time dependent, and layered network models can be used. The network connectivity model describes connectivity between pairs of nodes in the network. This is calculated by considering all possible paths between a source node and a target node. We use the same parameter value for describing an operational link between two neighbouring nodes and utilize the second formula of r ðV Þ in Sect. 3. However, in social networks weighting factors are used for describing probabilities of social influence. Although we apply the network model originally designed for physical communication network modelling, we use low values for the parameter value as in the modelling of influence spreading. The main focus of this paper is in the methodology and this is why simple realworld social networks, Zachary’s karate club and Lusseau’s bottlenose dolphin network, are used to demonstrate the method. In addition, we document very detailed results provided by the model to demonstrate the granularity and different aspects of the model. However, these results are not analysed in detail because such low-level empirical information is not available. Usually the model predicts the strongest communities accurately but weaker structures are more sensitive to network models and parameter values.

230

4.1

V. Kuikka

Zachary’s Karate Club

Wayne W. Zachary observed 34 members of a karate club over a period of two years [14]. During the study a disagreement developed between the administrator of the club and the club’s instructor. The instructor started a new club, taking 16 members of the original club with him. Figure 1 shows the karate club social network where lines 5 and 14 indicates the two factions after the split of the club; with the exception of node 9 who joined the other club. The instructor is node 1 and the administrator is node 34. Zachary’s karate club and Lusseau’s dolphin networks are social networks where low link weights describe better the probability of social influence. On the other, only one community, where all the nodes of the network are in one community, is detected with high parameter values. This is not an interesting case in our study. Connectivity and influence spreading probabilities describe different phenomena but they can have some common interpretation in social networks.

Fig. 1. Zachary’s karate club network with divisions indicating detected communities. Divisions correspond lines in Tables 3 and 4 (these can be compared with Tables 1 and 2).

Next, we present results of the Zachary’s karate club from the two network models. In Table 1, columns ‘A0.05’ and ‘A0.1’ are from the network connectivity model and the other five columns are from the network spreading model. Nine different solutions for communities are detected. This are lines 1–9 in Tables 1 and 2. The numerical values of the community influence measure of Eq. (2) from the two network models for the nine detected divisions are shown in the left part of Table 1. The corresponding values of statistical community measures are shown on the middle part of the table. The statistical values are probabilities to split into the two communities. These results are simulated by starting from random initial configurations. The

A General Method for Detecting Community Structures in Complex Networks

231

Table 1. The values of the community influence measure of Eq. (2) from the two network models for the nine detected divisions are shown in the left part of the table. The corresponding values of the statistical community measures are shown in the middle part of the table. Columns ‘A0.05’ and ‘A0.1’ show the results from the network connectivity model with connectivity probabilities p ¼ 0:05 and p ¼ 0:1 between neighbouring nodes. Columns ‘P0.05’ and ‘P0.1’ shows the results from the influence spreading model with influence spreading probabilities wL ¼ 0:05 and wL ¼ 0:1. The next column ‘PT0.1’ shows the results during the spreading process at time T ¼ 0:1 (wL ¼ 1:0) (all the other columns show results for time approaching infinity T ! 1). Column ‘L0.1’ shows the results with the limited path length Lmax ¼ 2 (wL ¼ 0:1). Column ‘VL0.05’ shows the results with the limited number of visits V ¼ 1 on a node during the influence spreading process (wL ¼ 0:05; Lmax ¼ 2). The right part of the table shows aggregated data from the right part of Table 3. 1 2 3 4 5 6 7 8 9

A0.05 A0.1 10.35 9.18 24.90 8.69 8.72 23.58 8.58 8.03 21.46 7.94 24.78

P0.05 P0.1 PT0.1 L0.1 VL0.05 A0.05 A0.1 P0.05 P0.1 10.62 19.24 23.89 9.72 1 10.7 % 10.3 % 9.42 17.20 21.29 8.76 2 13.1 % 10.2 % 13.2 % 8.91 16.23 20.04 8.25 3 4.6 % 5.1 % 8.94 16.33 20.12 8.32 4 2.6 % 0.4 % 2.8 % 8.79 16.04 19.70 8.16 5 1.6 % 1.6 % 8.22 15.03 18.34 7.67 6 0.4 % 0.1 % 0.5 % 8.12 14.84 18.10 7.56 7 0.4 % 0.4 % 8 0.1 % 29.27 9 9.0 %

PT0.1 10.8 % 14.3 % 3.2 % 5.2 % 1.8 % 0.5 % 0.4 %

L0.1 10.7 % 15.0 % 2.0 % 5.3 % 1.7 % 0.4 % 0.3 %

VL0.05 11.1 % 14.9 % 3.9 % 4.7 % 2.1 % 0.5 % 0.4 %

1 2 3 4 5 6 7 8 9

A0.05 11.6 % 7.8 % 2.7 % 5.5 % 0.9 % 0.5 % 0.4 % 0.0 % 7.2 %

A0.1

P0.05 P0.1 PT0.1 L0.1 VL0.05 11.3 % 11.4 % 10.7 % 11.9 % 6.0 % 3.1 % 5.0 % 2.8 % 3.1 % 3.1 % 5.0 % 5.2 % 5.2 % 6.8 % 0.6 % 15.1 % 15.7 % 16.6 % 17.0 % 0.9 % 1.0 % 1.0 % 1.2 % 1.1 % 1.1 % 1.1 % 1.2 % 0.5 % 0.6 % 0.5 % 0.6 % 0.1 % 5.3 %

first division in line 1 has the highest community measure of Eq. (2) for ‘A0.05’ for the connectivity network model and four influence spreading model calculations with different model parameters. Table 2 shows the nodes included in the communities. For example, the first line indicates that nodes {5, 6, 7, 11, and 17} and {1, 2, 3, 4, 8, 9, 10, 12, …,16, 18, …, and 34} are members of the two detected communities. The last two columns show that the number of nodes in the communities are 5 and 29.

Table 2. Nodes in communities corresponding lines in Table 1. For example, line 1 means that nodes 5, 6, 7, 11, and 17 are members of the first community. Communities detected in runs correspond columns in Table 1 (for example, 1010111 means that the division in line 1 is found in runs ‘A0.05’, ‘P0.05’, ‘PT0.1’ ‘L0.1’, and ‘VL0.05’). The last two columns show the number of nodes in the two factions of the network. 1 2 3 4 5 6 7 8 9

Nodes

Found in runs

N1

0000111000100000100000000000000000 1111111100111100110101000000000000 0000000000000011001010110010010011 0000000011000011001010110011011011 1101000100011100010101000000000000 1111000100011100010101001100100100 0000111000100011101010110010010011 1101111100111000110001000000000000 1101111100111100110101000000000000

1010111 1110111 1010111 1110111 1010111 1110111 1010111 0100000 0001000

5 16 10 14 10 15 15 13 15

N2 29 18 24 20 24 19 19 21 19

232

V. Kuikka

Table 3. Values of the community influence measure of Eq. (3) with the parameter value of l ¼ 0:95 (left part of the table). The corresponding values of the statistical community measures are shown on the right part of the table. The results can be compared with Table 1 where Eq. (2) is used as the quality function. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

A0.05 17.8 17.9 16.7 14.9 17.7 17.0 18.0 16.6 16.0 15.2 15.5 15.5 15.3

A0.1 P0.05 P0.1 18.3 48.6 45.1 17.1 15.3 48.1 18.2 56.5 45.7 17.4 18.4 17.0 16.3 15.8 15.9 15.8 18.4 17.3 15.9 55.9

PT0.1 L0.1 VL0.05 33.2 41.1 16.8 31.2 27.9 33.2 31.7 33.4 31.1 29.9 28.4 29.1 28.9

37.8 34.5 41.2 39.1 41.4 38.1 36.7 35.2 35.5 35.0

15.9 14.2 16.9 16.2 16.9 15.8 15.2 14.5 14.8 14.6

33.6 41.5 31.6 29.1 35.4

17.1 14.9

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

A0.05 5.7 % 7.2 % 4.0 % 2.7 % 7.8 % 1.5 % 5.9 % 0.9 % 1.1 % 0.2 % 0.3 % 0.2 % 0.2 %

A0.1

P0.05 P0.1 5.5 %

5.3 % 0.4 % 7.3 % 4.1 % 6.0 % 3.1 % 5.0 % 0.2 % 7.8 % 0.9 % 5.8 % 0.9 % 0.6 % 1.1 % 0.1 % 0.4 % 0.1 % 0.3 %

PT0.1 L0.1 VL0.05 5.5 % 5.3 % 6.0 % 7.5 % 4.1 % 2.8 % 8.2 % 1.1 % 5.9 % 1.0 % 0.8 % 1.1 % 0.1 %

8.0 % 4.0 % 3.1 % 8.6 % 1.17 % 5.45 % 1.05 % 0.2 % 1.06 % 0.05 %

8.0 % 4.5 % 3.1 % 9.0 % 2.3 % 6.0 % 1.2 % 0.3 % 1.2 % 0.1 %

0.49 % 0.46 % 0.5 % 0.12 % 0.28 % 0.08 % 0.13 % 0.1 %

Table 4. Nodes in communities corresponding lines in Table 3. Communities detected in runs correspond columns in Table 3. The last two columns show the number of nodes in the two factions of the network. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Nodes

Found in runs

N1

N2

0000111000100000100000000000000000 0010000011000011001010111111111111 0000000011000011001010110011011011 1111111111111100110101001101101100 1111111100111100110101000000000000 1111111100111100110101001100100100 1111000111011111011111111111111111 0010111011100011101010111111111111 1111000100011100010101000000000000 1111000100011100010101001100100100 0000111011100011101010110011011011 1111000111011100010101001101101100 0000111000100011101010110010010011 0000000011000011001010111111111111 1111111101111100110101001101100100 1111000101011100010101001101100100 1101111100111000110001000000000000

1010111 1100000 1110111 1010111 1111111 1110111 1010111 1010111 1010111 1010111 1010111 1010111 1000000 0010111 0010100 0010111 0001000

5 19 14 24 16 20 29 24 11 15 19 19 15 18 22 17 13

29 15 20 10 18 14 5 10 23 19 15 15 19 16 12 17 21

A General Method for Detecting Community Structures in Complex Networks

233

Note, that runs ‘A0.1’ and ‘P0.1’ have not found the first community, as can be seen also in Table 2 with the second and the fourth zero in ‘1010111’. Less communities are found with higher weights and it is even possible that the strongest community is not found or a new combination of nodes emerges. Comparing lines 2 and 9 in Table 2 reveals that the only difference is node 3 moving to the larger faction. This configuration is the only one detected with the higher influence parameter value of wL ¼ 0:1 in the influence spreading model. Three columns show calculations from the influence spreading model with three different parameters: ‘PT0.1’ with the time of spreading T ¼ 0:1, ‘L0.1’ with the maximum spreading path length Lmax ¼ 2, and ‘VL0.05’ with the limited number of visits in one node during the influence spreading V ¼ 1 (with Lmax ¼ 2). These results agree with the basic calculations of ‘A0.05’ and ‘P0.05’. It is possible that these results are different in more complex network topologies or network configurations. Tables 3 and 4 show the result of the modified community influence measure of Eq. (3). The division into 20 and 14 nodes in lines 3 and 6 has high probabilities of community formation for low weighting values (with exception ‘A0.05’) but the division into 18 and 16 nodes has high numerical values of Eq. (3) in lines 5 and 14. Division into 16 and 18 nodes is predicted also by Eq. (2) with a high community formation probability but division to 5 and 29 nodes has the highest numerical value of P in Eq. (2) in most cases. As a general observation, several differences in details exist between the results for the network connectivity model and the influence spreading model.  and Tables 3 and 4 have more lines than Tables 1 and 2 because now {V, V}  are two different solutions of the optimization problem in Eq. (3). For {1  V, 1  V} example, line 1 and 7 in Tables 3 and 4 correspond line 1 in Tables 1 and 2: the value of the statistical community measure for the division into 5 and 29 nodes (or 29 and 5) is 5.7% + 5.9% = 11.6%. The aggregated data from Table 3 is collected in the right part of Table 1. The numerical value of 10.7% in the middle part of Table 1 is different because is computed from Eq. (2) instead of Eq. (3). The correct way is to compare rankings between columns in Table 1. The aggregated data illustrate that the division into 14 and 20 nodes (lines 3 and 6 in Fig. 1) is stronger than in the basic model of Eq. (2). 4.2

Lusseau’s Bottlenose Dolphin Network

A population of 62 bottlenose dolphins were observed over a period of seven years [11]. A temporary disappearance of dolphin SN100 led to the fission of the dolphin community into two factions. The dolphin social network is found to be similar to a human social network in some respects but assortative mixing by degree is not

234

V. Kuikka

observed within the community [5, 11]. Assortative mixing measures bias in favour of interactions between nodes with similar characteristics.

Fig. 2. Dolphins’ social network with some results from Tables 5 and 6.

Line 1 in Fig. 2 shows the split observed in real-life with one exception of dolphin SN89. Only a few representative divisions are shown in Fig. 2. Table 5 shows the list of different structures detected by the basic community detection method of Eq. (2). In the upper part of the table, numerical values of the community detection measure of Eq. (2) are shown. In the lower part of the figure some representative values of the statistical community detection measure are shown. Nodes included in the communities of Fig. 2 are documented in Table 6. The complete list of detected communities are presented to illustrate the model. We discuss only the main results because weaker solutions may not have any interpretations in real-life. However, also weaker communities may be potential starting points for future developments in the community structure. The division indicated by line 1 has the highest ranking in most cases according to the community detection measure of Eq. (2) and the statistical measure describing the probability of community formation. The community indicated by line 2 is documented in the literature and agrees with other research [5, 11]. Line 4 in Tables 5 and 6 has the

A General Method for Detecting Community Structures in Complex Networks

235

community of 15 nodes {6, 7, 10, 14, 18, 23, 32, 33, 40, 42, 49, 55, 57, 58, 61}. This can be an indication of mediating roles of the six nodes {2, 8, 20, 26, 27, 28}. The names of these six dolphins are also documented in Fig. 2. Table 5. Values of the community influence measure of Eq. (2) for the dolphin network of Fig. 2. Notations are the same as in Tables 1 and 3. Values of the statistical measures are shown only for the most important four lines in the lower part of the table. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

A0.05 A0.075 A0.1 A0.15 A0.17 A0.19 P0.01 P0.05 P0.1 T0.2 L0.17 VL0.01 21.13 37.99 62.43 3.25 21.57 75.02 6.16 98.93 3.23 19.10 3.02 19.42 5.73 87.84 3.01 17.81 30.93 2.84 18.12 5.40 81.37 2.83 20.79 37.37 3.20 21.24 6.08 97.20 3.19 18.19 31.69 2.89 18.50 5.48 83.34 2.88 18.16 31.57 2.89 18.45 5.48 2.88 17.97 31.23 49.59 107.91 2.86 18.28 57.90 5.44 82.12 2.86 17.72 30.83 48.92 106.34 140.66 2.82 18.02 57.12 5.36 81.22 2.81 17.53 17.82 17.72 30.80 48.92 2.82 18.02 5.36 80.93 2.81 17.53 17.82 0.00 17.70 30.71 48.75 2.82 17.99 5.36 2.81 17.54 2.78 17.87 5.28 2.77 17.76 31.03 2.82 18.06 5.36 81.23 2.81 18.00 2.85 5.40 2.84 17.38 16.00 16.59 61.63 145.07 198.34 74.24 145.44 197.78 268.01 3.18 21.02 6.03 3.17 2.80 5.32 2.79 2.82 5.35 2.81 57.16 A0.05

A0.075

A0.1

A0.15

A0.17

A0.19

P0.01

P0.05

P0.1

T0.2

L0.17

VL0.01

1 30.55 % 32.93 % 35.07 % 29.62 % 30.81 % 36.01 % 29.60 % 8.72 % 29.40 % 2 7.59 % 8.29 % 7.23 % 8.28 % 32.54 % 8.34 % 7 1.11 % 0.81 % 9.51 % 33.70 % 1.68 % 1.54 % 8.57 % 1.68 % 0.82 % 1.69 % 8 0.95 % 0.61 % 0.44 % 7.25 % 30.85 % 1.29 % 1.10 % 0.29 % 1.28 % 0.85 % 1.25 % 21 0.002 % 27.66 %

The division indicated by line 8 is shown in Fig. 2 because it has a high ranking according to the statistical measure with high weighting values in the network connection model but lower ranking in the influence spreading model. This is only one example showing that the network model is important and different network models provide different results.

236

V. Kuikka

Table 6. Nodes in communities corresponding lines in Table 5. Communities detected in runs correspond columns in Table 5. The last two columns show the number of nodes in the two factions of the network. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Nodes

Found in runs

N1

N2



1110001110111 1000001100111 1100001100111 1100001100111 1100001100111 1100001100101 1111001110111 1111101110111 1000000100000 1110001100111 1000000100000 1110001100101 1000001100101 1100001100111 1000001000101 1000000000000 1000000000000 1000000000000 0011100010000 0001000000000 0000110000000 0000001100101 0000001000101 0000001000101 0000000010000

21 12 27 15 29 27 23 24 31 22 31 22 25 26 30 28 30 23 14 18 17 11 30 24 23

41 50 35 47 33 35 39 38 31 40 31 40 37 36 32 34 32 39 48 44 45 51 32 38 39

5 Conclusions We propose a general community detection method for analysing social, biological and technological network structures. The main result of this study is to separate the network model and the community detection method. Different network models can be used to provide input for the community detection algorithm. We demonstrate this by the classical network connectivity model and a resent influence spreading model. Two real-world social networks, Zachary’s karate club and Lusseau’s dolphin network are used to illustrate the method. Communities and sub-structures are local maxima of a quality function that we call the community influence measure. A set of local maxima of the community influence measure is searched by repeating the procedure several times starting from a random initial split of the original network. Weak interactions among network nodes produce more solutions than strongly connected networks. This is a way of obtaining understanding of the landscape of local maxima and identifying sub-structures in networks. The second main result of this paper is to present two different approaches for ranking the detected communities and sub-communities in the influence spreading model. This is an important question because the method provides many solutions and the strongest communities are candidates for real-life communities and their substructures. The first alternative is the optimal numerical value of the community influence measure and the second alternative is the statistical quantity measuring the

A General Method for Detecting Community Structures in Complex Networks

237

probability of forming a community. This later measure favours larger communities that may have lower values of the community influence measure but a higher probability of forming a community given that the initial state is random. Interestingly, Zachary’s karate club has split into two divisions according to the highest value of the statistical probability measure. Both measures predict correctly the real-life split of the dolphin social network (except dolphin SN89) for weakly interacting connections. The network connectivity model predicts different high rank divisions for more strongly interacting dolphins with both community ranking measures. We conclude that the network model and different processes on networks can have a significant impact on communities and their sub-structures. Community detection methods can provide useful tools for analysing empirical examples of different network processes. Existing community detection algorithms identify communities based the adjacency matrix. A standard framework of community detection with an influence matrix, instead of the adjacency matrix, could be used to study effects of different processes on networks.

References 1. Ball, M.O., Colbourn, C.J., Provan, J.S.: Network reliability. In: Handbooks in Operations Research and Management Science, vol. 7, pp. 673–762 (1995) 2. Barabási, A.-L.: Network Science. Cambridge University Press, Cambridge (2016) 3. Coscia, M., Giannotti, F., Pedreschi, D.: A classification for community discovery methods in complex networks. Stat. Anal. Data Min. 4(5), 512–546 (2011) 4. Fortunato, S., Hric, D.: Community detection in networks: a user guide. Phys. Rep. 659(11), 1–44 (2016) 5. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. U.S.A. 99(12), 7821–7826 (2002) 6. Karrer, B., Newman, M.E.J.: Stochastic blockmodels and community structure in networks. Phys. Rev. E 83(1), 016107 (2011) 7. Kuikka, V.: Influence spreading model used to community detection in social networks. In: Cherifi, C., Cherifi, H., Karsai, M., Musolesi, M. (eds.) Complex Networks & their applications VI. COMPLEX NETWORKS 2017. Studies in Computational Intelligence, vol. 689, pp. 202–215. Springer, Cham (2018) 8. Kuikka, V.: Influence spreading model used to analyse social networks and detect Subcommunities. Comput. Soc. Netw. 5, 12 (2018). https://doi.org/10.1186/s40649-018-0060-z 9. Lancichinetti, A., Fortunato, S.: Community detection algorithms: a comparative analysis. Phys. Rev. E 80, 056117 (2009) 10. Lancichinetti, A., Fortunato, S., Kertész, J.: Detecting the overlapping and hierarchical community structure in complex networks. New J. Phys. 11, 033015 (2009) 11. Lusseau, D., Newman, M.E.J.: Identifying the role that animals play in their social networks. Proc. R. Soc. London Ser. B 271, S477 (2004) 12. Newman, M.E.J.: Networks, An introduction. Oxford University Press, Oxford (2010) 13. Yang, Z., Algesheimer, R., Tessone, C.J.: A Comparative analysis of community detection algorithms on artificial networks. Sci. Rep. 6, 30750 (2016). https://doi.org/10.1038/ srep30750 14. Zachary, W.W.: An information flow model for conflict and fission in small groups. J. Anthropol. Res. 33, 452–473 (1977)

A New Metric for Package Cohesion Measurement Based on Complex Network Yanran Mi, Yanxi Zhou, and Liangyu Chen(B) Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai 200062, China [email protected]

Abstract. With software evolution and code expansion, software structure becomes more and more complex. Refactoring can be used to improve the structure design and decrease the complexity of software. In this paper, we propose a cohesive metric that can be used for package refactoring. It considers not only the dependencies of intra-package and inter-package, but also the backward dependencies of inter-package. After theoretical verification and empirical verification on multiple open source softwares, our metric is proved to effectively measure software structure. Keywords: Software dependency network · Software refactoring Package cohesion measurement · Software metric

1

·

Introduction

Software is not stationary, but usually gradually evolves. According to the newly requirements from our dynamic world, software functionality needs to be updated and attached with new codes. The amount of code has become more and more huge during software evolution, and the code structure has become more and more complicated. Therefore, it is easy to deviate from the original design rules, result in the degradation of software quality and comprehensibility, and finally create a “technical debt” [1]. For this problem, more researchers in software engineering focus on complex network method to analyze the structural characteristics of softwares. Based on the combination of complex network and software engineering, different software systems can be investigated from the macroscopic perspective, such as Linux kernel system [2], open source systems [3] and so on. Faced with the increasing software complexity, it needs an urgent adjustment on software structure without function degradation. Refactoring can improve software design and increase the maintainability and usability of software [4]. Simple refactoring with moving code in manual, is time consuming and has ordinary effort. Recently, there are a lot of researches on cohesive metrics as guidelines for auto refactoring. Current cohesion metrics tend to focus on the class cohesion level. Chidamber and Kemerer defined a set of CK metrics, in c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 238–249, 2020. https://doi.org/10.1007/978-3-030-36687-2_20

Metric for Package Cohesion Measurement

239

which LCOM (Lack of Cohesion in Methods) is used to measure cohesion [5]. Harrison et al proposed a set of MOOD metrics, in which CF (Coupling Factor) is proposed [6]. The metrics measure the cohesion indirectly by measuring the coupling. It calculates the sum of dependencies between all classes dividing the sum of all the possible dependencies of all classes. Bieman et al proposed a method based on the number of instance variables shared by methods [7]. They define TCC (Tight Class Cohesion) and LCC (loose class cohesion). Briand et al defined a graph of cohesive relationship and a set of measurement tools [8]. At the same time, Briand et al also defined a rigorous and understandable cohesive metric standard to measure the rationality for cohesion metrics [9]. Counsell et al. tested with a series of C++ based systems based on hamming distance, and proved that the measure of cohesion and coupling is intrinsically linked [10]. Badri et al. compared the metrics against a series of Java-based systems and found that the lower the degree of cohesion, the higher the coupling [11]. Therefore, it is worthwhile to add external dependency correlation to the cohesion metrics. Different from the above-mentioned class cohesion measurement methods, in recent years, some class-based package cohesion measurement methods have been proposed. Misic proposed a cohesion metric and concluded that relying solely on the internal relationship of the package is not sufficient to determine cohesion [12]. Abdeen et al. presented a cohesive metric with cyclic dependencies [13]. Gupta et al. presented a cohesive metric that considers the hierarchical relationship [14]. This paper presents an improved software cohesion metric based on complex network. Compared with the previous work, the new metric takes into account overall dependencies between classes, and also considers the backwards dependencies of classes. We firstly prove it meets the four principles of cohesion metrics proposed by Briand [9]. Based on this metric, we provide a refactoring algorithm to adapt package-class relations for better cohesion. Finally, through strict experiments on multiple open source systems, we verify the validity of our metric, and efficiency of the refactoring algorithm. The remainder of this paper is organized as follows. In Sect. 2, we presents the fundamental notations. In Sect. 3, we describes the new cohesive metric and its refactoring algorithm. In Sect. 4, we use experiments to verify the validity and efficiency of our metric. Section 5 is the conclusion of our work.

2 2.1

Preliminary Basis of Attributes of Community

Definition 1. Let G = (V, E, C) be a network, where V denotes the set of vertex and E denotes the set of edges. And C = {C1 , C2 , · · · , Ck } is the set of communities, where each vertex belongs to one community and ∀i=j, Ci ∩Cj =∅. Let A = (aij ) be the adjacent matrix of G. Note that for the networks generated from softwares, the communities information is given based on the package-class structure.

240

Y. Mi et al.

Definition 2. The sum of all the edges of network M is M=

1 aij . 2 i,j

(1)

Definition 3. The sum of intra-edges of all communities is Qreal =

1 aij δ(Ci , Cj ) 2 i,j

(2)

where Ci and Cj are the communities that vertex i and j belong to respectively. If they belong to the same community, δ is 1, otherwise δ is 0. Definition 4. The sum of intra-edges of all the communities for the null model which has same scale to the real network is 1 Qnull = pij δ(Ci , Cj ), (3) 2 i,j where pij is the expectation of edges between vertex i and j. Definition 5. For community k, the sum of all the intra-edges is  EIk = aij δ(Ci , Cj , k),

(4)

i,j

where Ci , Cj are communities that vertex i and j belong to. If Ci = Cj = Ck , δ is 1, otherwise δ is 0. Definition 6. For community k, the sum of external dependencies is  EXk = aij δ(Ci , Cj , k),

(5)

i,j

where Ci , Cj are communities that vertex i and j belong to. If Ci = Ck and Cj = Ck , δ is 1, otherwise δ is 0. Definition 7. For community k, the sum of external backwards dependencies  EXBk = aij δ(Ci , Cj , k), (6) i,j

where Ci , Cj are communities that vertex i and j belong to. If Ci = Ck and Cj = Ck , δ is 1, otherwise δ is 0.

Metric for Package Cohesion Measurement

2.2

241

Class Dependency Graph

We take a software system developed by Java as an example to illustrate the construction process of software network. Definition 8. Class dependency graph (CDG) [15] is a directed graph, Gc = (Vc , Ec , C). Vc is the set of vertexes of classes. Ec is the set of edges. C is the set of communities. Every package is mapped to be a community of the network. In the directed network, there is an edge between two classes if and only if there is at least one following dependency between these two classes: • R1—Inheritance and implementation: vj extends or implements vi ; • R2—Aggregation: vj is the data type of member variable in vi ; • R3—Parameter: vj is the data type of parameter/return value/declared exception of member function in vi ; • R4—Signature: vj is the type of local member variable in vi ; • R5—Invocation: vj is invoked insides the member function in vi ; We assume that the weights of above five dependencies are same, then the dependency between two classes is adding up all the dependencies. After defining the dependency rules, from the Java source codes shown in Fig. 1, we can make the CDG shown in Fig. 2. Comparing to the existing coarse-granularity software networks, our CDG based on five dependencies, can represent the software structure features more intuitively and clearly.

Fig. 1. Example of Java classes

In Fig. 1, classes A, B, and C belong to package1, class D, E, and F belong to package2, and classes H, I, and J belong to package3. Obviously, there are three dependencies between classes: D depends on A, F depends on D, and I (double) depends on E. According to the above five dependencies, we get the CDG in Fig. 2.

3 3.1

Cohesion Metrics Based on Complex Network Cohesion Metrics

As everyone knows, high cohesion is an important goal in software design, since it has a great impact on software maintainability and reusability. However, manual

242

Y. Mi et al.

Fig. 2. Example of a class dependency graph

evaluation for cohesion is time consuming and labor intensive. Therefore, it is necessary to propose standard cohesion metrics instead of manual evaluation, for better automated code refactoring. Newman et al. [16] proposed a module metric, Q=

Qreal − Qnull , M

(7)

where M is the number of edges in the network. Since the number of dependencies in the software may be less than the number of dependencies in the null model, this modularity does not meet the non-negative principle proposed by Briand [9]. Abdeen [10] proposed a cohesion metric of package, Q=

EI , EI + E X

(8)

where EI represents the number of edges of internal dependencies in a package, and EX is the number of edges of external dependencies between packages. For a package, this metric considers not only the internal dependencies, but also the external dependencies, however, it omits the backwards dependencies. From the perspective of software quality, excessive inter-package calls brought by the backwards dependencies, have a higher probability of affecting overall package reusability. The modularity in complex networks is consistent with the software design principle, that is, high cohesion and low coupling. When we simply apply it to be a cohesion metric, it does not satisfy the cohesion measurement principles proposed by Briand [9]. This is caused that the basis of modularity in complex networks is a random network. However, the corresponding network in software networks should have a condition that there is no intra-package dependency. Therefore, considering the backwards dependencies, we have defined the software network package cohesion metric Qc ,

Qc =

EI . EI + EX + α ∗ EXB

(9)

When EI + EX + α ∗ EXB = 0, Qc is marked as 0, preventing the case where the denominator is 0. EI means the number of edges inside the community/package;

Metric for Package Cohesion Measurement

243

Algorithm 1. package cohesion calculation algorithm Input: Adjacent matrix A; communities C; vertex number N ; and the package p1 to be calculated. Output: The cohesion of package p1 . 1: Set EI , EX EXB to 0; 2: for i = 0 to N − 1 and Ci = p1 do 3: for j = 0 to N − 1 do 4: if Cj = p1 then 5: EI + = wij ; 6: else 7: EX + = wij ; 8: for j = 0 to N − 1 do 9: if Cj = p1 then 10: EI + = wij ; 11: else 12: EXB + = wij ; 13: Calculate Qc as formula (9) and return the result.

EX means the sum of weight of edges sent from all classes in this package to the classes in other packages; EXB means the sum of weight of edges sent from all external classes outside this package to the classes inside this packages. α is the empirical scale factor. Since the influences of the class backwards dependencies on the class is significantly less than the class dependencies’ efforts, therefore, in this paper, we tentatively let α = 0.5 according to experience. We give Algorithm 1 to calculate package cohesion. Let’s turn to complexity analysis. In Algorithm 1, the number of classes of p1 is Np1 , and the algorithm executes a nested loop. The outer loop runs Np1 times, and the inner loop runs the number of nodes (classes) 2N times. Therefore, the complexity of Algorithm 1 is O(Np1 N ). When performing Algorithm 1 on all packages, the total complexity is O((Np1 + Np2 + · · · + Npx ) · N ). Since Np1 + Np2 + · · · + Npx = N , the total complexity is O(N 2 ). It is worth noting that the total software cohesion is average of cohesions of all packages. 3.2

Theoretical Verification of Our Cohesion Metric

We theoretically verify whether our proposed cohesion metric is reasonable. Briand proposed four verification principles for cohesion metric verification [9]. Here, we use these four principles to prove validity for our metric. Proposition 1: Formula (9) satisfies four verification principles proposed by Briand. Proof: (1) Non-negativity In formula (9), EI , EX , EXB are non-negative, so that Qc is also non-negative .

244

Y. Mi et al.

(2) Maximum and minimum I The maximum of Qc is when EX + α ∗ EXB = 0 and EI > 0, Qc = E EI = 1. The minimum of Qc is when EI = 0 and the denominator is positive, Qc = 0. (3) Monotonicity When adding some intra-package edges, we denote EIn as the number of new intra-package edges, and Qcn as the new package cohesion. Qcn − Qc =

(EX + EXB ) ∗ (EIn − EI ) (EIn + EX + EXB ) ∗ (EI + EX + EXB )

We can see that when EIn > EI , Qcn − Qc > 0. It satisfies the monotonicity. (4) Cohesive Modules Assume that two packages a, b, where all classes in package a have no dependencies or backwards dependencies on the classes in the b package. The cohesions of package a and b are listed as follows. Qca =

EIa , EIa + EXa + EXBa

Qcb =

EIb . EIb + EXb + EXBb

Then we merge a, b into a new package c. The cohesion of c is, Qcc = We use

Na Da

to denote

Qca − Qcc = and

Nb Db

EIa + EIb . EIa + EXa + EXBa + EIb + EXb + EXBb

EIa (EXb + EXBb ) − EIb (EXa + EXBa ) , (EIa + EXa + EXBa + EIb + EXb + EXBb )(EIa + EXa + EXBa )

to denote

Qcb − Qcc =

EIb (EXa + EXBa ) − EIa (EXb + EXBb ) . (EIa + EXa + EXBa + EIb + EXb + EXBb )(EIb + EXb + EXBb )

Obviously, Da , Db > 0, and Na = −Nb , therefore Qca − Qc ≥ 0 or Qcb − Qc ≥ 0 holds. This means Qc ≤ max{Qca , Qcb }. We prove the cohesion of merged package is not bigger than the cohesions of two original packages. In summary, we have proved that our cohesion metric satisfies four Briand’s principles. 3.3

Refactoring Algorithm

According to our cohesion metric, we propose a refactoring algorithm to optimize software structures. We refer to the idea of the well-known community detection algorithm CNM algorithm. The original CNM algorithm deems each node as a single community, and then merges iteratively to increase modularity until the modularity no longer increases. However, it doesn’t fit for software networks, since softwares natively have package structures in codes. So we propose a greedy

Metric for Package Cohesion Measurement

245

algorithm based on the original package structure. Each time we move a class to a package whose classes have dependencies with the previous class, we calculate the average package cohesion and pick the maximum. Repeat the above process, we finish refactoring until all classes are visited. Algorithm 2. Refactoring algorithm base on our cohesion metric Input: Adjacent matrix A; communities C; vertex number N . Output: output Qstart , Qref actored and a set of refactoring suggestions. 1: Calculate the original average package cohesion Qstart according to Algorithm 1. 2: while there is a unvisited class do 3: Select a unvisited class A, and set Qmax = 0. 4: for traverse all packages do 5: Set Dn = (as the number of dependencies A depends on the traversing package). 6: Set Do = (as the number of dependencies A depends on the original package). 7: set Dnb = (the corresponding backwards dependencies on the traversing package). 8: set Dob = (the corresponding backwards dependencies on the original package). 9: if (Dn −Do > threshold1  Dnb −Dob > threshold2  (Dn ≥ 1&&Dnb ≥ 1)) then 10: Calculate the average cohesion for the original package and the traversing package Qo . 11: Move A to the traversing package and update C. 12: Re-calculate the average cohesion of the original package and the traversing package Qr . 13: if Qr > Qmax then 14: Qmax = Qr , mark pmax as the traversing package. 15: Move A back to the original package and update C. 16: Set A as visited 17: if Qmax > 0 then 18: Move A to package Pmax and update C. 19: Set every class that depends on A as unvisited. 20: Calculate the average package cohesion Qref actored and return .

Note 1: The threshold is adapted based on experience. When the class number is large, the threshold is relatively large, and vice versa. In our experiments, threshold1 is 2, and threshold2 is 3. Algorithm 2 terminates when class movement does not make any increase in software package cohesion. In the process of class movement, only the cohesions of the source and destination packages change, we only consider cohesion change for these two packages. Obviously, the main time-consuming part of Algorithm 2 is the while-loop part, from the 2nd line to the 20th line. Let N be the number of classes, Np the number of packages. For one package, we use Algorithm 1 to calculate its cohesion, so the complexity of for-loop at 4th line is O(N 2 Np ).

246

Y. Mi et al.

Therefore, the total complexity of Algorithm 2 is O(N 3 Np ). Since this is a typical greedy algorithm, there may be a result of “local optimal”. But in the process of moving, the average package cohesion will increase monotonously. So the correctness of refactoring algorithm can be guaranteed.

4 4.1

Experiment and Analysis Refactoring and Analysis

Our experiment environment is a computer with i5-8400, 16G DDR4, Windows 10. We selected 10 open source softwares for refactoring verification. The software statistics and cohesion result are listed in Table 1. From the last two columns, we can see that the cohesion can be obviously improved after refactoring by Algorithm 2. Table 1. Refactoring Result of multiple Java softwares Name

PN

CN

EN

NP

RCN

WFR

CB

Ant 1.9.9

58

998

5169

2

671

51

0.223

0.251

9

202

914

1

151

8

0.274

0.327

Emma 2.0.5313

11

143

574

0

143

17

0.327

0.386

Hsqldb 2.4.0

21

553

4519

2

311

27

0.288

0.320

Jaxen 1.1.6

16

204

947

0

204

18

0.315

0.447

Jgroups 4.0.10

31

859

4454

2

460

60

0.230

0.292

Ormlite 5.0

11

176

938

0

176

23

0.263

0.353

9

154

548

0

154

14

0.357

0.396

7

402

1661

1

189

4

0.492

0.509

42

663

1887

2

522

21

0.413

0.451

Cglib-nodep 3.2.6

PDF-Renderer 1.0.5 RabbitMQ Client 5.0.0 Tomcat 9.0.1

CA

PN:Package number; CN:Class number; EN:Edge number; NP:Neglected packages RCN:Rest class number; WFR:Waiting for refactoring; CB:Cohesion before; CA:Cohesion after

We also compare our metric with Newman modularity and Adbeen package cohesion. In Table 2, we show the cohesion difference under three methods. In most cases, the cohesion variance of three methods are positive, only the Newman modularity of RabbitMQ is slightly reduced by 0.001 and Adbeen of Jgroups is reduced by 0.003. We calculate the correlation of our metric to Newman modularity and Adbeen cohesion. The Pearson correlation coefficient between our metric and Newman modularity is 0.840, and the significant level is 0.001. The Pearson correlation coefficient between our metric and Adbeen cohesion metric is 0.720, and the significant level is 0.009. Both are significantly and strongly related. This shows that our proposed cohesion metric is reasonable. At the same time, our proposed metric is significantly and strongly correlated with the Newman modularity, which indicates that in some case, the proposed metric can be used to replace the Newman modularity in software measurement. It shows our metric makes full use of the work of Newman and Adbeen and has good rationality and stablity.

Metric for Package Cohesion Measurement

247

Table 2. The impact of refactoring on other metrics Name

NM

ACM

Ours

Before/After/Diff

Before/After/Diff

Before/After/Diff

Ant

0.250/0.298/+0.048

0.260/0.298/+0.038

0.223/0.251/+0.029

Cglib-nodep

0.291/0.356/+0.065

0.322/0.396/+0.074

0.270/0.327/+0.058

Emma

0.348/0.488/+0.140

0.489/0.493/+0.004

0.327/0.386/+0.058

Hsqldb

0.235/0.255/+0.020

0.376/0.402/+0.026

0.288/0.320/+0.032

Jaxen

0.303/0.505/+0.202

0.384/0.506/+0.122

0.315/0.447/+0.133

Jgroups

0.228/0.259/+0.031

0.399/0.397/−0.003

0.230/0.292/+0.061

Ormlite

0.265/0.341/+0.076

0.395/0.449/+0.055

0.262/0.353/+0.092

PDF renderer

0.292/0.312/+0.020

0.422/0.443/+0.021

0.357/0.396/+0.040

RabbitMQ Client

0.287/0.286/−0.001

0.744/0.756/+0.012

0.492/0.509/+0.018

Tomcat

0.579/0.613/+0.033

0.536/0.579/+0.043

0.414/0.451/+0.037

NM: Newman’s modularity; ACM: Adbeen cohesion; Ours: Our cohesion

4.2

Randomly Disturb and Recover

In this section, we do another experiment with disturbing and recovering steps, to verify the validity and efficiency of our metric. For a software, we firstly deem it as a “PERFECT” software with good configurations between classes and packages. Then we randomly disturb one package, that is, a certain proportion of classes of a package are randomly put into other packages. In the verification step, we run the refactoring algorithm on the disturbed software, then check whether our algorithm can find the disturbed class and place correctly into the Recovered original package. The correct rate P of recovering is calculated as N Ndistrubed , where NRecovered represents the number of disturbed classes recovered by the algorithm, and Ndisturbed represents the total number of disturbed classes. In our experiments, we implement the disturb-recover process on 10 Java open-source softwares. For one software, we repeatedly do the process 100 times. At last, we compare our method with Pan’s method [17]. Table 3. Results of disturbing and recovering Name

NDC

NRC

LRP(%)

URP(%)

RP(%)

TC(s)

PRP(%)

Ant

35

28

77.7

82.9

80.0

1802

−−

−−

8

6

66.3

82.5

74.0

12

79.0

169

Emma

11

10

84.5

94.5

88.7

7

82.2

154

Hsqldb

17

14

76.5

84.1

80.9

268

84.5

5962

Cglib-nodep

PTC(s)

Jaxen

14

9

60.0

75.0

69.5

24

72.9

574

Jgroups

32

23

70.0

78.4

74.0

574

−−

−−

Ormlite

13

9

63.1

73.8

67.2

16

79.5

213

PDF renderer

11

9

73.6

88.2

80.4

7

85.1

115

RabbitMQ Client

14

12

77.1

87.1

81.2

32

85.3

650

Tomcat

30

26

82.3

87.3

84.9

349

85.0

5795

NDC: Number of disturbed classes; NRC: Number of recovered classes. LRP: Minimal percentage of recovering; URP: Maximal percentage recovering. RP: Recovering percentage; TC: Average time consumption. PRP: Pan’s recovering percentage; PTC:Pan’s average time consumption. −− means no result in 2 h

248

Y. Mi et al.

In Table 3, we find the refactoring algorithm has a good probability to recover the disturbed classes. In addition, the fluctuation of recovering rate is small, indicating that the algorithm has better stability. This proves that our proposed cohesive metric can effectively measure and reflect the software structure. We also see that our method and Pan’s method perform similarly in recovering percentage, but our method saves more time.

5

Conclusion

Nowadays, most researches focus the cohesion metrics at the level of classes and methods, and few researchers combine complex networks with software refactoring. In this paper, we take advantage of complex network methods and consider the factor of backwards dependencies into class dependency relations. The new metric based on complex networks and its refactoring algorithm are proposed. After theoretical verification and software verification, the metric and it’s refactoring algorithm are proved to measure the software structure correctly and effectively.

References 1. Tom, E., Aurum, A., Vidgen, R.: An exploration of technical debt. J. Syst. Softw. 86(6), 1498–1516 (2013) 2. Wang, L., Yu, P., Wang, Z., Yang, C., Ye, Q.: On the evolution of linux kernels: a complex network perspective. J. softw. Evol. Process 25(5), 439–458 (2013) 3. Myers, C.R.: Software systems as complex networks: structure, function, and evolvability of software collaboration graphs. Phys. Rev. E 68(4), 046116 (2003) 4. Fowler, M.: Refactoring: improving the design of existing code. In: 11th European Conference. Jyv¨ askyl¨ a, Finland (1997) 5. Chidamber, S.R., Kemerer, C.F.: A metrics suite for object oriented design. IEEE Trans. Softw. Eng. 20(6), 476–493 (1994) 6. Harrison, R., Counsell, S.J., Nithi, R.V.: An evaluation of the mood set of objectoriented software metrics. IEEE Trans. Softw. Eng. 24(6), 491–496 (1998) 7. Bieman, J.M., Kang, B.-K.: Cohesion and reuse in an object-oriented system. In: ACM SIGSOFT Software Engineering Notes, vol. 20, no. SI, pp. 259–262 (1995) 8. Briand, L.C., Morasca, S., Basili, V.R.: Defining and validating measures for object-based high-level design. IEEE Trans. Softw. Eng. 25(5), 722–743 (1999) 9. Briand, L.C., Morasca, S., Basili, V.R.: Property-based software engineering measurement. IEEE Trans. Softw. Eng. 22(1), 68–86 (1996) 10. Counsell, S., Mendes, E., Swift, S.: Comprehension of object-oriented software cohesion: the empirical quagmire. In: Proceedings 10th International Workshop on Program Comprehension, pp. 33–42. IEEE (2002) 11. Badri, L., Badri, M., Toure, F.: Exploring empirically the relationship between lack of cohesion and testability in object-oriented systems. In: International Conference on Advanced Software Engineering and Its Applications, pp. 78–92. Springer (2010) 12. Misic, V.B.: Cohesion is structural, coherence is functional: different views, different measures. In: Proceedings of Seventh International Conference on Software Metrics Symposium, METRICS 2001, pp. 135–144. IEEE (2001)

Metric for Package Cohesion Measurement

249

13. Abdeen, H., Ducasse, S., Sahraoui, H., Alloui, I.: Automatic package coupling and cycle minimization. In: 2009 16th Working Conference on Reverse Engineering, WCRE 2009, pp. 103–112. IEEE (2009) 14. Gupta, V., Chhabra, J.K.: Package level cohesion measurement in object-oriented software. J. Braz. Comput. Soc. 18(3), 251–266 (2012) 15. Shen, P., Chen, L.: Complex network analysis in Java application systems. J. East Chin. Normal Univ. 38–51 (2017) 16. Newman, M.E., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(2), 026113 (2004) 17. Pan, W., Li, B., Jiang, B., Liu, K.: Recode: software package refactoring via community detection in bipartite software networks. Adv. Complex Syst. 17(07n08), 1450006 (2014)

A Generalized Framework for Detecting Social Network Communities by the Scanning Method Tai-Chi Wang1 and Frederick Kin Hing Phoa2(B) 1

National Center for High-Performance Computing, Hsinchu, Taiwan 2 Academia Sinica, Taipei, Taiwan [email protected] http://phoa.stat.sinica.edu.tw:10080

Abstract. With the popularity of social media, recognizing and analyzing social network patterns have become important issues. A society offers a wide variety of possible communities, such as schools, families, firms and many others. The study and detection of these communities have been popular among business and social science researchers. Under the Poisson random graph assumption, the scan statistics have been verified as a useful tool to determine the statistical significance of both structure and attribute clusters in networks. However, the Poisson random graph assumption may not be fulfilled in all networks. In this paper, we first generalize the scan statistics by considering the individual diversity of each edge. Then we construct the random connection probability model and the logit model, and demonstrate the effectiveness of the generalized method. Simulation studies show that the generalized method has better detection when compared to the existing methods. Keywords: Network analysis likelihood methods

1

· Graphical models · Empirical

Introduction

The growth in the big data regime and the popularity of social media have enhanced a research migration to social network structural recognition and analysis. One of the most essential features in social networks is the community structure, which are defined as groups of vertices that share common properties or play similar roles in a graph or network [6]. We can recognize part of social functioning by understanding these communities. The study of these social communities are interested in many fields, like e-commerce [17,23]. When the boundaries of communities are defined, we are able to classify the nodes in a network into different communities. [3,8]. Therefore, methodologies in community detection have drawn much attentions among researchers in different fields. Community detection methods are generally designed by comparing the similarity within the groups and analyzing the difference between inside and outside c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 250–261, 2020. https://doi.org/10.1007/978-3-030-36687-2_21

Community Detection by Generalized Scan

251

the groups. A popular method is the modularity-based method [20], which uses a modularity measure to evaluate the similarities and connections of groups. Several extended methods [9,15,19] are developed through this criterion. However, the modularity optimization suffers severe limitation as it fails to find communities that are smaller than a given scale, even when they are very pronounced and easy to observe [7]. Furthermore, the modularity methods do not provide tests on the statistical significance of the detected communities. This leads to a blooming progress in developing statistical test for the significance of the detected communities, see [10,11,14,22,27,28] and many others for details. These existing methods take the difference between a selected group and its unselected counterpart into account, but most of them consider that the nodes of the selected groups follow the same distributions. In practice, many researches pointed out that networks follow a scale-free model [2] or exponential random graph [24]. We thus generalize the scan method for community detection by considering the likelihood of all edges with different types of baseline network models. In this study, we provide a generalized framework of the scan statistic to consider the individual difference of each edge instead of assuming that the number of edges in a subset follows the homogeneous Poisson distribution. If probabilities of edges can be properly addressed, we can construct the likelihood of a network, and then the scan statistic can be derived by appropriate formulations. We briefly review the standard framework of the scan statistic in Sect. 2, then we introduce its generalized framework in Sect. 3. Simulations are performed in Sect. 4 to verify the detection performances of the proposed method. Finally, a brief conclusion and discussions are reported at the end.

2

The Standard Framework of Scan Statistics

The scan statistic is a useful tool to find cluster patterns of many data in both time domain [18] and spatial domain [13]. [27] first applied this statistic to social networks, and [28] extended this method to consider both attribute and structure clusters. The basic idea of scan statistic is to divide the studied region/network into a selected part within and an unselected part outside scanning windows; This provides a systematic mechanism to detect clusters/communities. Recall the standard framework of the scan statistics in [28]. Let G = (VG , EG ) be an undirected graph with vertex set VG = {v1 , . . . , v|VG | } and edge set EG , and the degrees of vertices are k = {k1 , . . . , k|VG | }. Moreover, we define the |VG | total number of edges as |EG | = i=1 ki /2. Then, by considering the random graph assumption with degree vector k, the number of expected edges connecting the pair nodes (vi , vj ) is expressed as eij = ki kj /(2|EG |) for i = j, and eii = ki2 /4|EG |. Under the Poisson random graph [5], a scan statistic is used to evaluate the expected numbers of edges between the selected subgraph and its unselected counterpart. Suppose a subgraph Z is selected based on a scanning window. The corresponding |EZ | and expected μ(Z) are defined as kVZ /2 and kV2 Z /4|EG | respectively. One can apply a likelihood ratio test to evaluate the statistical

252

T.-C. Wang and F. K. H. Phoa

significance of the selected subgraph when compared to the Poisson random graph. Thus, the likelihood ratio statistic of a selected subgraph Z is  |EZ | |EZ |  |EG |−|EZ | |EG |−|EZ | LZ if α ˆ > βˆ µ(Z) µ(G)−µ(Z) = LR(Z) = (1) L0 1 otherwise, |EZ | |EG |−|EZ | and βˆ = µ(G)−µ(Z) are the corresponding maximum likelihood where α ˆ = µ(Z) estimates (MLEs) for the selected subgraph Z and the unselected counterpart Z c under the Poisson random graph model. By scanning the whole region, the test statistic is the one with the maximum logarithmic likelihood ratio, that is,

ˆ = max ln LR(Z). λ(Z) Z

(2)

The subgraph Zˆ with the maximum LR(·) is identified as a community if the null hypothesis is rejected. Due to the large set of selected subgraphs, the scanning method often suffers the multiple testing problem. The Monte Carlo testing is one of the best solutions to this problem [13] during the development of a testing procedure. [13] suggested generating the Monte Carlo samples for an attribute of interest in the spatial cluster detection problem, that is, to randomly permute observations for different nodes. However, randomly assigning the edges in a network might not be an appropriate method to adopt the similar idea in generating the Monte Carlo samples for networks. First, because of the different degrees of the nodes, it is inappropriate to assume all of them have equal probabilities to connect. Second, it is likely that there are few different combinations of edges for a subgraph with few nodes. For example, a star graph has only one corresponding graph expression with the fixed degree sequence (Fig. 1(a)). Therefore, we adopt the method provided in [28], which considered to construct a Monte Carlo graph under the null hypothesis. For example, by treating the expected connection probability of each edge under the random graph assumption, the expected connection probability is eij = ki kj /(2|EG |). This idea is demonstrated in Fig. 1, where the original network is a star graph with 5 nodes (Fig. 1(a)). Only the node 1 has degree 4 and other nodes have degree 1. Then we can evaluate from the degrees the connection probability of each edge (Fig. 1(b)), and generate random graphs based on the connection probabilities (Fig. 1(c)). This study focuses on determining the community effect by the idea of the likelihood ratio test. We suggest using the likelihood-based pseudo R-squared [16]. The pseudo R-squared, which is like the use of R2 applied in the usual linear regression, is used to assess the goodness of fit in the logistic regression analysis or other likelihood-based models. Based on the deviance measure, D, 2 , is expressed as the likelihood-based pseudo R-squared, RL 2 RL =

Dnull − Df itted , Dnull

(3)

where Dnull = −2 log (likelihood of the null model) and Df itted = −2 log (like2 is equivalent to minimizing Df itted lihood of the fitted model). Maximizing RL

Community Detection by Generalized Scan 4

2

2

0.125

3

0.125

3

3

0.5

0.5

0.5

1 0.125

0.5

1

0.125

0.125 0.5 0.125

2

4

0.5 0.125 0.5

5

1

0.125

0.125

0.125

4

0.5

0.125

0.125

5

(a) Star graph

253

(b) Complete graph

5

(c) Random graph

Fig. 1. Example of Monte Carlo graph.

and maximizing the likelihood of the fitted model. No matter what model we use to fit or test the communities, this criterion can be used to verify the effects of communities in social networks. The same testing algorithm is used to evaluate the new generated data and to determine the Monte Carlo p-value. Suppose a simulation with a large number of iteration such as 99 or 999 is executed. The Monte Carlo p-value with R runs is computed as 2 2 (MCr ) ≥ RL (obs)} + 1 #{RL , (4) p= R+1 2 2 2 2 where RL (obs) is the RL for the observed data, and RL (MCr ) is the RL for the th r Monte Carlo data. It suggests that the probability of finding extreme values than the observed value. If the p-value is smaller than a pre-specified criterion (e.g., 0.05), it is statistically significant to declare that there is a community. That is to say, the observed value is less likely to happen under the null hypothesis.

3

The Generalized Likelihood-Based Framework

In [28], the methods are constructed based on the Poisson random graph assumption. By assuming the homogenous likelihood within the subgraphs, the test statistics are used to evaluate the likelihoods between different subgraphs. However, the homogeneity assumption may not be always true in practice, thus we have to consider the diversities inside the subgraphs. In this study, we take the individual difference of each edge into account, which means that we assume each edge has its own likelihood/probability based on the manner of a network. Upon this idea, suppose a network of interest is G. The joint likelihood of all edges in a network is expressed as  y pijij (1 − pij )1−yij , (5) L(G) = (i,j)∈EG

where (i, j) is the edge of node pair vi and vj , EG is the set of edges in G, pij is the probability of existence of edge (i, j), and yij = 1 if the edge (i, j) occurs in G, otherwise is 0.

254

T.-C. Wang and F. K. H. Phoa

The definition of community is a group of nodes that have higher connections inside the group than those outside the group. To verify existence of a community, we select a subgraph Z from G based on the scan statistic described in Sect. 2.1. The original network G is divided into two parts, which are the selected subgraph Z and its unselected counterpart Z c . The joint likelihood based on this separation is   y y pijij (1 − pij )1−yij pijij (1 − pij )1−yij , (6) L(Z, Z c ) = (i,j)∈E(Z)

(i,j)∈E(Z c )

E(Z) and E(Z c ) are the sets of edges in Z and Z c respectively, and the other notations are the same as those in Eq. (5). Based on this likelihood, we determine the difference between Z and Z c by defining the different manners of the probability pij . If the connection probability of each edge is well-addressed, communities can be verified in this likelihood-based manner. Furthermore, we demonstrate how to apply this notion to two different network construction regimes: the random connection probability model (Sect. 3.1) and the logit model (Sect. 3.2). 3.1

Random Connection Probability Model

We introduce the connection probabilities in some random network models in this section. Suppose a random graph with a given degree sequence k = (k1 , . . . , kn ) for a vector of nodes v = (v1 , . . . , vn ). We denote pij as the probability of having an edge between vi and vj is proportional to the product ki kj . In specific, pij is expressed as pij = (ki kj )/(2|EG |) for i = j. Since there is no unified terminology for this probability, we call it as random connection probability (RCP) here. We consider RCP as the baseline model and apply it to the generalized framework. Suppose a subgraph Z is selected and all edges are independent but decided by the subgraphs they belong to. By computing the difference between the selected subgraph and its unselected counterpart, we consider a parameter γ for the connection probabilities within Z, and consider another parameter η for the connection probabilities within Z c . The γ and η can be treated as the strengths of connection in Z and Z c respectively. Then Eq. (6) is translated to   y y γpijij (1 − γpij )1−yij ηpijij (1 − ηpij )1−yij . (7) L(Z, Z c ) = (i,j)∈Z

(i,j)∈Z c

2 As we mentioned in Eq. (3), maximizing the RL is equivalent to maximizing the joint likelihood. To obtain the Df itted , we search for the maximum likelihood estimates of γ and η. We consider the independent assumption that independently estimate these two parameters. By partially differentiating Eq. (7) at γ and η,

 yij − γpij ∂ log(L(Z c )) ∂ log(L(Z)) = and = ∂γ γ(1 − γpij ) ∂η (i,j)∈Z

 (i,j)∈Z c

yij − ηpij . (8) η(1 − ηpij )

Community Detection by Generalized Scan

255

The maximum likelihood estimates (MLEs) of γˆ and ηˆ are the solutions of Eq. (8), so the Df itted with γˆ and ηˆ is  Df itted (Z, Z c ) = −2 log L(Z Z c)  (yij log(ˆ γ pij /(1 − γˆ pij )) + log(1 − γˆ pij )) = (i,j)∈Z

+



(yij log(ˆ η pij /(1 − ηˆpij )) + log(1 − ηˆpij ))

(9)

(i,j)∈Z c





(yij log(pij /(1 − pij )) + log(1 − pij )).

(i,j)∈G

We restrict γˆ ≥ ηˆ and 0 ≤ pˆij ≤ 1 ∀(i, j), where pˆij = γˆ pij for (i, j) ∈ Z and pˆij = ηˆpij for (i, j) ∈ Z c , since the communities are defined to have higher connection probabilities. On the other hand, Dnull is determined based on the null hypothesis, or in other words, there is no community effect with pij = ki kj /(2|EG |) and γ = η = 1. Then the Dnull is exactly the same as Eq. (5). 3.2

Logit Model

Exponential random graph is another useful regime to construct a network. Based on this random graph, the connection probability follows an exponential function [24], that is, P r(Y = y) =

 1 exp{ ξA gA (y)}, κ

(10)

A

where A are the configurations of G, ξA are the parameters in the configuration A, gA (y) = 1 if the con uration is observed in the network y. In this study, we consider a simple case of exponential random graph model, where the connection probability of edge (i, j) is related to the configuration of nodes i and j. It is equivalent to the logit model described in [25], and the model is expressed as (11) logit(p(yij = 1)) = αi + αj , where αi and αj are the effects of these two nodes. We can estimate α by this logit model from the observed network. Denote the estimates of parameters as ˆ 0 ). α ˆ 0 , then the corresponding deviance is Dnull = −2 log L(α When the community effect is considered, we follow the similar idea of the scanning method. Suppose a subgraph Z is selected. We consider a community effect in Eq. (11), and it can be expressed as logit(p(yij = 1|Z, Z c )) = αi + αj + δ × Iij ,

(12)

where Iij = 1 when (i, j) ∈ Z, otherwise is 0, and δ is called the community effect. We can also estimate δ from the observed network and the selected subgraph.

256

T.-C. Wang and F. K. H. Phoa

ˆ then the corresponding deviance Denote the estimates of parameters as {α ˆ z , δ}, ˆ ˆ z , δ). is Df itted = −2 log L(α In general, we suggest estimating the logit model by a sparse matrix, and the computing time for estimation is approximately 5 times quicker than that for the indicator functions.

4

Simulation Study and Comparison

We compare the differences, in terms of the type I error, detection power, and detection accuracy, between our proposed models and the original Poisson model by simulated data sets. To verify the improvement of the detection accuracy, the proposed methods are compared with the scan statistic and the traditional Poisson model. 4.1

Type I Error

In the simulation for checking type I error, we set a limit on the number of nodes to be 100 in the synthetic networks. The edges among these nodes are set to follow a Bernoulli distribution with four different connection probabilities p0 = 1/5, 1/10, 1/15, and 1/20, so the expected degrees of each node are 20, 15, 10, and 5. A thousand runs are executed in each simulation case. Suppose the significance level is 0.05. Table 1 shows that the type I error is good except p0 = 1/20. In addition, the results are different between the RCP model and logit model. The type I error of RCP model is more consistent (around 0.06), though the values are a little higher than expected significant level (0.05). On the other hand, the type I error of logit model seems more dramatic, especially in the case of p0 = 1/20. Thus, if one is interested in the community detection where Type I error has to be minimized, we suggest to use the RCP model for its consistency. Table 1. Type I error Connection probability 1/5

4.2

1/10 1/15 1/20

RCP model

0.059 0.054 0.058 0.068

Logit model

0.050 0.033 0.095 0.129

Testing Power

We use a similar setting as the previous subsection in the following simulation. According to the results reported in [28], the community size and the connection probability of community play important roles in community detection. Therefore, we construct the simulation cases where there are K community nodes and

Community Detection by Generalized Scan

257

100 − K usual nodes in a studied network. The connection probability of the usual nodes is 1/20, and that of the community node is set to be 1/4, 1/2, 3/4, and 1. For each combination of community size and connection probability, 100 simulations are executed. Table 2 shows the testing powers of different community sizes and connection probabilities. When the community size gets larger and/or the connection probability gets higher, the power gets higher. However, the detecting power is not good enough when the community size is small. Even in the situation where the connection probability is one and a clique with a size 5, the testing powers are only 0.68 and 0.42 for the RCP model and logit model respectively. On the other hand, the testing power is good when the community size is large. Take the case where the community size is 20 as an example, the detection power approaches perfect when the connection probability is only 1/2 for the logit model. On balance, the logit model receives better performances than the RCP model. A possible explanation is that the logit model captures more individual differences among nodes, while the RCP model is restricted to the random assumption that the probability is already higher when the community size is large and there is few space to find the significant effect on the estimation of γ in Eq. (9). Table 2. Testing power Community size RCP model 5 10 15

4.3

20

Logit model 5 10 15

20

pc = 1/4

0.10 0.04 0.16 0.08 0.18 0.09 0.25 0.29

pc = 1/2

0.10 0.28 0.82 0.88 0.13 0.43 0.88 1.00

pc = 3/4

0.21 0.94 1.00 1.00 0.19 0.95 1.00 1.00

pc = 1

0.68 1.00 1.00 1.00 0.42 1.00 1.00 1.00

Comparisons of Detection Accuracy

In this section, the accuracy of both the RCP and logit models are checked, and they are compared with the traditional Poisson model. Since the accuracy of community members is considered, some clear-defined accuracy measurements other than the testing power are also used for evaluation. We first define the terms of true positive, false positive, true negative, and false negative. True positive (TP) cells represent the true community nodes that are correctly detected as a community; false positive (FP) cells represent the usual nodes that are incorrectly detected as a community; true negative (TN) cells represent the usual nodes that are not identified as a community; false negative (FN) cells represent the true community nodes that are not identified as a community. In addition, since only one ground-truth community is set in the simulation study, we use some common criteria to evaluate our detection

258

T.-C. Wang and F. K. H. Phoa

results. To check if the method can identify community nodes, the recall (r), which is defined as TP/(TP+FN), is used to measure the proportion of identified community nodes among all true community nodes. The precision (p), which is defined as TP/(TP+FP), is used to measure the proportion of true community nodes among the identified community nodes. F1 score and Jaccard similarity are two common criteria which are used to evaluate community detection and are defined respectively as pr |C ∗ C| , and J(C ∗ , C) = ∗ , F1 = 2 C| p+r |C where C ∗ is the detected community and C is the ground-truth community. In addition, we adopt the modularity proposed by [19] to measure similarity among the detection groups. The modularity is defined as ki kj 1 1  )si sj = sT Bs, (eij − 2|EG | ij 2|EG | 2|EG |

Q=

where eij is 1 if vi and vj are adjacent, and 0 otherwise, and s is a ±1 vector, in which 1 represents an element belongs to the target group and −1 represents an element doest not belong to the target group. S = 20

S = 20

0.6

F1−score

0.4

0.6

Recall

0.6

0.4

0.4

0.6

0.8

1.0

0.4

Connection Probability

0.6

0.8

Poisson RCP Logit

1.0

0.4

Connection Probability

(a) Precision

0.6

(c) F1 -score

(b) Recall S = 20

0.20

Modularity

0.6

0.15

0.4 0.2

Jaccard

0.25

0.8

0.30

S = 20

Poisson RCP Logit 0.4

0.6

0.8

Connection Probability

(d) Jaccard

1.0

0.8

Connection Probability

Poisson RCP Logit

0.10

0.4

Poisson RCP Logit

0.2

Poisson RCP Logit

0.2

Precision

0.8

0.8

0.8

1.0

1.0

S = 20

0.4

0.6

0.8

Connection Probability

(e) Modularity

Fig. 2. Comparisons among three models

1.0

1.0

Community Detection by Generalized Scan

259

We only consider the most significant community, so the testing power is excluded here. In order to have a concise comparison, we only demonstrate the results of the case of community with size 20. Please refer to Appendix to see the detail values of other cases. The comparison measures are demonstrated in Fig. 2. According to the comparison results, all measures show the similar performances. We can clearly observe the improvement of our proposed methods. In general, the detection results can be separated into two parts: small connection probabilities (0.25 and 0.5) and large connection probabilities (0.75 and 1). In large connection probabilities, little difference exist among the three models and the logit model is the best. When the connection probability is small, the generalized methods are significantly better than the Poisson model, featuring the logit model with two times more accurate than the Poisson model.

5

Discussion and Conclusion

In this study, we propose a generalized framework of the scan statistic and suggest the pseudo-R2 measure as a testing criterion for identifying communities in social networks. By considering the heterogeneity of probability of each edge, the proposed method is flexible to apply in different random models. We provide the RCP model and the logit model to demonstrate the effectiveness of this generalized method. Both models have acceptable type I errors and detection powers, and also improve the detection accuracy of community members. In addition, two empirical examples show that the proposed methods are practical in the real data. Although the accuracy of our new models are better than that of the Poisson model, our method fails to fully replace the original one due to the computing load. Since the provided models, especially for the logit model, consist of some parameters without close forms, the numerical procedures can be used to find the estimates. The number of parameters (number of nodes) is huge when dealing with a large network, and the computer with current specifications usually cannot afford to run such estimation procedure. In addition, the scan statistics contain the Monte Carlo testing that is used to obtain the Monte Carlo p-value. It takes large time to reproduce synthetic data and to execute the R times of Monte Carlo procedures. In the simulation study (the studied network with 100 nodes), the RCP model takes 5 min and whereas the logit model takes around half an hour to conduct a complete testing procedure for 99 runs on our personal computer (Intel Core i7-4770 CPU 3.40 GHz). Although the scanning method seems to be a time-consuming approach, this approach can be executed via parallel computing by distributing and calculating the test statistics of independent scanning windows according to its systematically searching regime. On the other hand, instead of using the Monte Carlo procedure, a possible solution is to apply the false discovery rate (FDR) [1] to explain the type I errors when conducting the multiple testing. Without the Monte Carlo procedure, we can save much time when executing the scanning

260

T.-C. Wang and F. K. H. Phoa

method. We are also looking forward to reducing the computing time by accelerating the computing algorithm. Another restriction of the scan statistic is the shape of scanning window that decides the range of community. We use the circular windows to generate elective subsets for detecting communities, but the circular window is not the only choice. One may apply to detecting communities in social networks. Some meta-heuristic optimization methodologies are also considered as good methods to find the best communities [15]. On the other hand, the selection of radius is another problem. In real data, we do not know the true expansion of communities. We select 60% as the maximum size of community in the empirical studies, but in most real cases, one should consult field experts about the maximum size of community for a better result. The generalized framework has a flexibility for testing communities. The two models provided in this study are just two simpler forms. If the connection probabilities between nodes can be estimated in prior, this framework is easy to apply to constructing the scan statistic. However, the probabilities of edges are very difficult to be constructed. We need more effort to construct the probabilities of edges and to make this method more flexible. Moreover, the approach of statistically evaluating community detection is not restricted to the scanning method we proposed in this project. It could be implemented to several state-of-the-art methods in community detection, including at least the Louvain algorithm and methods based on stochastic block modeling. Acknowledgement. This work was partially supported by the Ministry of Science and Technology (Taiwan) Grant Numbers 107-2118-M-001-011-MY3 and 108-2321-B001-016.

References 1. Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc.: Ser. B 57(1), 289–300 (1995) 2. Catanzaro, M., Bogu˜ n´ a, M., Pastor-Satorras, R.: Generation of uncorrelated random scale-free networks. Phys. Rev. E 71(2), 027103 (2005) 3. Csermely, P.: Creative elements: network-based predictions of active centres in proteins and cellular and social networks. Trends Biochem. Sci. 33(12), 569–576 (2008) 4. Easley, D., Kleinberg, J.: Networks, Crowds, and Markets: Reasoning about a Highly Connected World. Cambridge University Press, Cambridge (2010) 5. Erd˝ os, P., R´enyi, A.: On random graphs. Publ. Math. Debr. 6, 290–297 (1959) 6. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3), 75–174 (2010) 7. Fortunato, S., Barth´elemy, M.: Resolution limit in community detection. Proc. Natl. Acad. Sci. 104(1), 36–41 (2007) 8. Freeman, L.C.: A set of measures of centrality based on betweenness. Sociometry 40(1), 35–41 (1977)

Community Detection by Generalized Scan

261

9. Guimera, R., Sales-Pardo, M., Amaral, L.A.N.: Modularity from fluctuations in random graphs and complex networks. Phys. Rev. E 70(2), 025101 (2004) 10. Handcock, M.S., Raftery, A.E., Tantrum, J.M.: Model-based clustering for social networks. J. R. Stat. Soc.: Ser. A 170(2), 301–354 (2007) 11. Heard, N.A., Weston, D.J., Platanioti, K., Hand, D.J.: Bayesian anomaly detection methods for social networks. Ann. Appl. Stat. 4(2), 645–662 (2010) 12. Javadi, S.H.S., Khadivi, S., Shiri, M.E., Xu, J.: An ant colony optimization method to detect communities in social networks. In: Proceedings of Advances in Social Networks Analysis and Mining, pp. 200–203 (2014) 13. Hulldorff, M.: A spatial scan statistic. Commun. Stat.-Theory Methods 26(6), 1481–1496 (1997) 14. Lancichinetti, A., Radicchi, F., Ramasco, J.: Statistical significance of communities in networks. Phys. Rev. E 81(4), 046110 (2010) 15. Liu, J., Liu, T.: Detecting community structure in complex networks using simulated annealing with k-means algorithms. Phys. A 389(11), 2300–2309 (2010) 16. Magee, L.: R2 measures based on Wald and likelihood ratio joint significance tests. Am. Stat. 44(3), 250–253 (1990) 17. Moody, J., White, D.R.: Structural cohesion and embeddedness: a hierarchical concept of social groups. Am. Sociol. Rev. 68(1), 103–127 (2003) 18. Naus, J.I.: Approximations for distributions of scan statistics. J. Am. Stat. Assoc. 77(377), 177–183 (1982) 19. Newman, M.E.J.: Fast algorithm for detecting community structure in networks. Phys. Rev. E 69(6), 066133 (2004) 20. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(2), 026113 (2004) 21. Patil, G.P., Taillie, C.: Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environ. Ecol. Stat. 11(2), 183–197 (2004) 22. Perry, M.B., Michaelson, G.V., Ballard, M.A.: On the statistical detection of clusters in undirected networks. Comput. Stat. Data Anal. 68, 170–189 (2013) 23. Reddy, P.K., Kitsuregawa, M., Sreekanth, P., Rao, S.S.: A graph based approach to extract a neighborhood customer community for collaborative filtering. Databases Netw. Inf. Syst. 2544, 188–200 (2002) 24. Robins, G., Pattison, P., Kalish, Y., Luster, D.: An introduction to exponential random graph (p∗ ) models for social networks. Soc. Netw. 29(2), 173–191 (2007) 25. Strauss, D., Ikeda, M.: Pseudolikelihood estimation for social networks. J. Am. Stat. Assoc. 85(409), 204–212 (1990) 26. Tango, T., Takahashi, K.: A flexibly shaped spatial scan statistic for detecting clusters. Int. J. Health Geogr. 4(1), 11 (2005) 27. Wang, B., Philips, J.M., Schreiber, R., Wilkinson, D.M., Mishra, N., Tarjan, R.: Spatial scan statistics for graph clustering. In: Proceedings of the 2008 SIAM International Conference on Data Mining, pp. 727–738 (2008) 28. Wang, T.C., Phoa, F.K.H.: A scanning method for detecting clustering pattern of both attribute and structure in social networks. Phys. A 445, 295–309 (2016) 29. Watts, D.J., Strogatz, S.H.: Collective dynamics of small-world networks. Nature 393(6684), 440–442 (1998)

Comparing the Community Structure Identified by Overlapping Methods Vin´ıcius da F. Vieira1(B) , Carolina R. Xavier1 , and Alexandre G. Evsukoff2 1

2

Federal University of S˜ ao Jo˜ ao del-Rei, S˜ ao Jo˜ ao del-Rei, Brazil [email protected] COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil

Abstract. Community detection is one of the most important tasks in network analysis. Recently, an increasing number of researchers have been dedicated to investigating networks in which the nodes participate concomitantly in more than one community. This work presents a comparative study of five state-of-art methods for overlapping community detection from the perspective of the structural properties of the communities identified by them. Experiments with benchmark and groundtruth networks show that, although the methods are able to identify modular communities, they often miss many structural properties of the communities, such as the number of nodes in the overlapping region and the membership of the nodes. Keywords: Overlapping community detection properties · Comparison of methods

1

· Structural

Introduction

One of the most important topological properties in complex networks is the organization of nodes as communities, a division of the nodes in groups with dense internal connections and sparse external connections. The discovery and investigation of communities in real world networks can reveal key functional and structural aspects in many contexts. Traditional methods for community detection aim at dividing a network in groups as a partition problem, i.e., a node must belong to one and only one community, and some measures have been proposed in the literature in order to assess the quality of communities in networks. However, it is very intuitive that, in real world networks, elements can participate concomitantly in several communities. For example, a person can maintain relationships with his family members, his coworkers and the members of his sports club or a protein can interact with other proteins in many different metabolic reactions. In this sense, more recently, several works have dedicated efforts to understand the properties and characteristics of overlapping communities in networks, in order to explore this natural aspect of real world phenomena [7,11,12]. An increasing number of authors seek to develop methods for detecting the overlapping community structure in networks. Some authors consider a specific c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 262–273, 2020. https://doi.org/10.1007/978-3-030-36687-2_22

Comparing Overlapping Community Structure

263

criterion to characterize a good division of a network in communities and define methods to optimize such criterion, such as Lancichinetti et al. [5], which define a fitness function for communities and propose an heuristic to optimize this quantity. The works of Nicosia et al. [6] and Shen et al. [10] also aim to detect overlapping communities by considering a variation of Newman’s modularity and propose optimization methods for community structure. Other authors propose methods from different perspectives. Palla et al. [7] present an algorithm for community detection based on the identification of adjacent k-cliques (CFinder), allowing nodes to participate in several communities at the same time. Yang and Leskovec [12] present a model for the representation of overlapping community structure and propose a method to fit the community structure to the model as an integer optimization problem. Some works found in the literature present reviews and comparative studies of overlapping community method and can be used as good references [1,4,11]. The work of Xie et al. [11] and the work of Amelio et al. [1] present a wide description of methods for the identification of overlapping communities. Additionally, Xie et al. compare the methods regarding their ability to identify the nodes in the overlapping region. Hric et al. [4] investigate if the communities identified by the methods agree with ground-truth communities. The present work also performs a comparative analysis of some of the state-of-art methods for overlapping community detection in complex networks. However, unlike the works of Xie et al. and Hric et al., this work does not focus on quality measures and hit rates of the methods. In a different perspective, the experiments conducted in this work investigate the structural properties of the communities identified by the studied methods. This approach was previously followed by Hric et al. [4], in a work where the authors test a basic hypotesis for community detection methods: that topological organization of the networks solely are able to reveal the communities of a network. The authors conducted experiments to assess if community detection methods are able to correctly classify the groundtruth communities (which they call metadata networks), disconsidering objective measures and verify that, on most cases, there is a substantial difference between identified and metadata groups. In this work community detection methods are evaluated with benchmark networks, frequently used to evaluate community detection algorithms and networks with ground-truth. Different from the work of Hrich et al. [4], where the authors evaluate the methods by their ability to identify the correct communities, this work focuses on structural properties of the methods in order to investigate the occurrence of patterns in the topology of the extracted communities and their similarity to ground-truth communities. Quality measures of the overlapping community structure identified by the methods are also presented in order to better illustrate the analysis. The investigation conducted allows us to verify that the methods are able to extract highly modular community from small and medium sized networks, confirming the results obtained by other authors in literature. It is also observed that most state-of-art methods are unable to identify the correct groups for nodes in networks with ground-truth. Moreover,

264

V. F. Vieira et al.

the identified communities are often very distinct from those observed for the ground-truth networks. The investigation on ground-truth networks combined to the benchmark networks also suggest that the structural properties of the communities obtained by the community detection methods are more related to the method itself and its way of operation than the topological organization of the networks.

2

Methods for Overlapping Community Detection

The problem of overlapping community detection can be stated as follows. Consider a network G(V, E) where V represents the set of nodes and E represents the set of edges, such that n = |V | and m = |E|. For the sake of simplicity, the edges can be considered unweighted and undirected. G can be represented by an adjacency matrix A. An overlapping community structure C can be defined as a cover of V in nc communities C = {Ci , i = 1...nc } and a node u can participate in of node u in community one or more community Ci with a belonging factor αuC i n Ci such that 0 ≤ αuCi ≤ 1, ∀ u ∈ V, ∀ Ci ∈ C and i c αuCi = 1, ∀ u ∈ V . As well as in the partition problem, the sense of community becomes more evident as the difference between intra and inter edges increases and there is not an universal definition for a particular division of the network in communities, overlapped or not. An increasing number of works aim to understand and detect the organization of nodes in communities without limiting the number of groups which a node can take part of. In order to define a method for detecting overlapping communities (CFinder), Palla et al. [7] observe that typical communities are basically the union of complete subgraphs of size k and define a community as a union of adjacent k-cliques. In this sense, a single node can belong to several communities and an overlapping structure can be identified. A game theory perspective is considered by Zhou et al. [14] for the problem of overlapping communities in a work in which the authors propose a function to asses the quality of the union of hierarchical groups and a greedy algorithm to identify overlapping communities. Variations of the strategy of label propagation [8] are at the core of many state-of-art works in the literature for overlapping community detection. Xie et al. [11] propose a method based on a variation of label propagation (SLPA) in which each node can assume the role of a listener/speaker and receive/propagate a label retained in a memory attached to it. The method of Coscia et al. [2] (Demon) is also based on a variation of label propagation, but it considers an ego-network for each node and evaluates the labels of each group shared by the node. Gregory [3] describes dynamic label propagation method for overlapping community detection (COPRA) in which the belonging factor of each node to each community is locally updated by considering the belonging factors of its neighbors and a parameter is defined to control the maximum number of communities in which a node can participate. Yang and Leskovec [12] propose a Community-Affiliation Model that considers more densely connections in the overlapping regions and fits it to networks.

Comparing Overlapping Community Structure

265

In another work, a similar approach is proposed by the same authors [13], where they combine a stochastic model and non-negative matrix factorization method to detect communities in huge real world networks (Bigclam). The number of communities to be considered by the method for network is estimated by forcing the model to use the minimal number of communities while still accurately modeling the network. Some works propose variations on the widely adopted Newman’s modularity in order to adapt it to the assessment of overlapping communities [6,10]. This work considers the adaptation of the modularity proposed by Shen et al. [10], which relaxes the fact that a node can belong to only one community and defines a more flexible factor αu,Ci , that denotes how much a node u belongs to the community Ci :   1   ku kv (1) Auv − Qov = αu,Ci αv,Ci . 2m 2m uv Ci ∈C

Before presenting and discussing the experiments performed in this work it is worth to mention that most methods in the literature for the identification of overlapping community structure are highly dependent on parameters adjustments, which may pose a concern in their use. The methods of Lancichinetti et al. [5] and Rhouma et al. [9] require the definition of resolution parameters and mixing parameters, which control the size of the resulting communities. The community structure found by the method of Lancichinetti et al. also depends on the correct definition of the parameter μ in the fitness function. The method of Shen et al. [10] is based on the identification of cliques and the definition of a proper number of cliques is crucial to the detection of meaningful communities, which is performed by the authors as an exploratory analysis. In CFinder [7], the parameter k, which specifies the size of the k-cliques must be set. COPRA [3] requires the user to set the number of communities a node can participate. The graph models proposed by Yang and Leskovec for overlapping community structure [12] also depends on the definition of a parameter that controls the probability of two nodes to be connected even if they are not at the same community. Furthermore, their methods for overlapping community detection [12,13] depend on the definition of the number of communities on the network, which is calculated by solving another optimization problem in the original works. The number of communities is also a parameter required by the genetic algorithm for overlapping community detection of Shen et al., which represents the size of a chromosome in the population. Moreover, the method requires the settings of many parameters (number of generations, population size, stop criterion, crossover rate and mutation rate), like any genetic algorithm. A different kind of parametrization is required by the method of Nicosia et al., which demands the definition of an association function for the belonging coefficients of two nodes to a community. In the context of this work, we adopted the default defined by the authors or the set of parameters suggested by them, as discussed in the next sesction.

266

3

V. F. Vieira et al.

Experiments and Discussion

The main purpose of this work is to investigate some of the most important methods for overlapping community detection found in the literature from the perspective of the structural properties of the communities found. For this, two sets of networks were considered: a set of benchmark networks, frequently explored for the evaluation of community detection methods and sampled ground-truth networks. A sampling strategy was adopted in this work for ground-truth networks due to the impossibility of the methods in dealing with their complete versions, with millions of nodes. A variation of the sampling strategy proposed Yang and Leskovec [12] for ground-truth networks is adopted. The original strategy randomly chooses a seed node u from the network and identifies all the communities at which u participates. Then, it separates all the nodes from these communities and determines the induced subgraph considering these nodes. In this work, we do not select the seed node randomly. Instead, we first identify the node which participate in most communities and define it as the seed node. By doing this, we expect to obtain a sample network with denser overlapping regions and closer to a real world scenario. Although it is interesting to investigate some variation on the parameters, due to the requirement of delimiting a scope for this work, it was necessary to define an arbitrary set of parameters for the considered methods. In an attempt to get the best possible results, the methods were parametrized as suggested by the authors, as follows: CFinder [7] (size of k-cliques k = 4), Bigclam [13] (minimum number of communities mc = 5, maximum number of communities xc = 100, number of trials nc = 10), Demon [2] (merging threshold  = 0.3, minimum community size mc = 3), COPRA [3] (maximum memberships of nodes v = 8) and SLPA [11] (propagation threshold r = 0.45). The set of networks investigated is composed of five benchmark networks: CA-GrQc1 (n = 5242, m = 14496), CAHepPh (see footnote 1) (n = 12008, m = 118521), Cit-HepTh (see footnote 1) (n = 27770, m = 352324), Email2 (n = 1133, m = 5451) and Keys (see footnote 2) (n = 10680, m = 24319); and four sampled ground-truth networks: Amazon (see footnote 1) (n = 10308, m = 21575), DBLP (see footnote 1) (n = 7515, m = 31784), Live Journal (see footnote 1) (n = 5843, m = 12543), Youtube (see footnote 1) (n = 6792, m = 28177). 3.1

Quality Measures

Table 1 shows some results regarding objective measures of the communities extracted by the explored methods. The overlapping modularity Qov considered implements the extension proposed by Shen et al. [10] (Eq. 1). The Overlapping Normalized Mutual Information (ONMI) [5] is presented only for networks with ground-truth information. The execution time for each method and each network 1 2

Downloaded from: http://snap.stanford.edu/data/. Downloaded from: http://www-personal.umich.edu/∼mejn/netdata/.

Comparing Overlapping Community Structure

267

Table 1. Execution time (in seconds), Overlapping modularity Qov and Overlapping Normalized Mutual Information (ONMI), when applicable, for the studied networks. GT CFinder

Bigclam

Demon

COPRA

SLPA

Qov Time

Qov ONMI Time Qov ONMI Time Qov ONMI Time Qov ONMI Time Qov ONMI

Amazon

0.40 0.65

0.33 0.04

0.31

0.81 0.11

2.60

0.51 0.05

14.82 0.53 0.25

2.50

0.80 0.05

DBLP

0.57 2.43

0.38 0.04

1.11

0.43 0.03

6.54

0.20 0.04

2.00

0.50 0.25

1.77

0.52 0.07

LJ

0.59 0.35

0.15 0.00

0.58

0.42 0.01

1.03

0.21 0.05

0.88

0.52 0.12

1.06

0.58 0.04

Youtube

0.30 16.32

0.02 0.00

3.77

0.06 0.00

2.76

0.03 0.03

0.72

0.36 0.16

1.07

0.01 0.00

CAGrQc

-

0.71

0.46 -

0.29

0.61 -

2.13

0.41 -

0.87

0.47 -

1.14

0.66 -

CAHepPh -

-

-

-

5.03

0.35 -

74.65 0.14 -

2.37

0.16 -

4.31

0.25 -

CitHepTh -

-

-

-

35.37 0.16 -

86.18 0.04 -

9.67

0.01 -

15.02 0.14 -

Email

-

0.27

0.27 -

0.69

0.18 -

0.54

0.06 -

0.22

0.51 -

0.24

0.11 -

Keys

-

2111.34 0.38 -

0.93

0.63 -

2.98

0.31 -

1.64

0.68 -

2.50

0.71 -

is also presented3 . First, it is possible to observe From Table 1 that, for most ground-truth networks, the methods were able to execute in a reasonable time, allowing them to be used in real world small and middle sized applications. For some benchmark networks, CFinder and Demon presented a high execution time, making them prohibitive to be considered in real time situations. Bigclam and SLPA, especially, presented consistent results, executing almost all the networks in a short time. For slightly larger networks the methods start to show a high execution time and some considerations must be done. All the methods shown a high execution time for CitHepTh (with ∼27k nodes and ∼352k edges). It is important to highlight that CFinder was unable to run CitHepTh and CAHepph. Since the main purpose of this work is not to compare objective measures of the methods neither to produce a ranking of methods, some observations regarding qualty measures are made with the sole purpose of better understand the structural properties of the identified networks. Considering the overlapping modularity Qov , COPRA, Bigclam and SLPA were able to identify modular communities in almost all networks. The exception is the Youtube network, from which only COPRA was able to identify relevant community structure. It is interesting to notice that, for Amazon network, Bigclam, Demon, COPRA and SLPA were able to identify communities with much more modular structure than the ground-truth, especially SLPA, which obtained Qov = 0.80, while the ground-truth modularity is Qov = 0.40. The Overlapping Normalized Mutual Information (ONMI) observed in Table 1 show that the methods were unable to identify the communities correctly although the results obtained by COPRA stand out from the other methods. Some hypothesis to explain why the methods mostly show a low ONMI, while presenting a high Qov are discussed in the next sections.

3

The computational environment consists of an Intel Core i9-9900K processor with 32Gb RAM running an Ubuntu 18.04 OS.

268

V. F. Vieira et al.

(a) Amazon

(b) DBLP

(f) CAHepPh (g) CitHepTh

(c) LJ

(d) Youtube

(h) Email

(i) Keys

(e) CAGrQc

Fig. 1. Complementary Cumulative Distribution Function of nodes memberships to the communities obtained by the studied methods.

3.2

Node Membership

Considering the features of the community detection methods and the results for objective quality measures previously presented, it is possible to investigate other aspects of the structural properties of the networks. Figure 1 shows the Complementary Cumulative Distribute Function (CCDF) of node memberships, i.e., the number of communities in which each node participate for all the identified communities and the ground-truth communities for each network. Figure 1 allows us to clearly see that except for COPRA and SLPA, all the considered methods does not necessarily assign a community to each node and some nodes can be left with no community and, thus, the probability of a node to belong to at least one community can be less than one. This behavior particularly stands out for CFinder, that severely underestimates the number of nodes that belong to one community. It is also possible to observe that the CCDFs for the groundtruth communities show an almost linear behavior (in log −log scale), which was not captured by none of the studied methods. It is also interesting to notice that, despite to differences in scale, some methods, like SLPA, Demon and Bigclam, have shown a very similar behavior for a most networks, suggesting that the resulting community structure is not very dependent on the relations existing on the network, but mostly on the strategy of the algorithm. It is important to note that most methods depend on the definition of parameters to execute, which may strongly affect the community structure and a deeper inestigation should be performed for each method in order to draw more conclusive observations. 3.3

Community Size

The CCDF for the sizes of communities identified by the methods were also considered and the results are presented in Fig. 2. As well as for the memberships

Comparing Overlapping Community Structure

(a) Amazon

(b) DBLP

(f) CAHepPh (g) CitHepTh

(c) LJ

(d) Youtube

(h) Email

(i) Keys

269

(e) CAGrQc

Fig. 2. Complementary Cumulative Distribution Function of the sizes of communities obtained by the studied methods.

distribution, Fig. 2 also shows that, despite for differences in scale, the groundtruth communities shows an almost linear behavior (in log − log scale) in a wide range of the distribution, which was not captured by none of the methods. First, it is worth to note that the communities obtained by COPRA are very similar in size to the ground-truth. However, this is atypical and such similarity can not be observed in the other scenarios. It is interesting to notice that the shapes of the curves observed for Demon and, more clearly, for Bigclam is very consistent for most networks, though very different from the ground-truth, suggesting that the structural properties of the communities may be highly affected by the mechanisms used by the methods and not by the network itself, which must be better understood with further investigation on the methods. Bigclam, for instance, requires the specification of the maximum number of communities what naturally impacts the size of communities. This can underestimate the number of large communities, as observed for three of the ground-truth networks (Amazon, DBLP and Live Journal). A deeper investigation on the propeerties of CAHepPh and CitHepTh must also be conducted in order to understand the similarity of the curves obtained by the methods in these networks. We can also notice that large communities are likely to occur in these networks and the probability of occurrence of large communities identified by the studied methods is very low. If we take into consideration the results presented in Table 1, it is noteworthy that Qov of the community structure identified for these networks is very high for most methods. On the other hand, Qov observed for Youtube is very low for most methods. This fact must be further investigated, but some hypothesis can be raised. Possibly, the methods are mistakenly disconsidering larger communities as they are prone to identify very high modular structures, even they are very small compared to real world scenarios and this way of operation may be causing the methods to fail in discovering the right communities, what is reflected in the low ONMI observed in Table 1.

270

V. F. Vieira et al.

(a) Amazon

(b) DBLP

(f) CAHepPh (g) CitHepTh

(c) LJ

(d) Youtube

(h) Email

(i) Keys

(e) CAGrQc

Fig. 3. Complementary Cumulative Distribution Function of the sizes of the overlapping regions obtained by the studied methods.

3.4

Overlapping Size

The size of the overlapping region, i.e., the number of nodes at each region on the intersection of communities, was also investigated and the CCDFs for these quantities are presented in Fig. 3. The methods does not show a clear and constant pattern for the distribution of overlapping size and the curves observed for a certain method are very distinct from one network to another. It is worth to notice that, for most methods, the occurrence of overlapping regions is very unlikely. The exception is Bigclam, which identifies communities with relevant overlapping regions, especially in the ground-truth networks. Yet considering the networks with ground-truth, the number of small overlapping regions is very high. Moreover, for three ground-truth networks (Amazon, DBLP and Live Journal), the methods underestimate the sizes of overlapping regions in all the range of the distribution. In the communities identified by COPRA, SLPA and CFinder the occurrence of nodes in the overlapping region is extremely rare. Therefore, we can argue that these methods are treating the community detection as a partition problem for a large number of nodes. CFinder and, especially, SLPA also identify very small overlapping regions for almost all networks. When considering the results from Table 1, it may suggest that the methods fail in identifying the overlapping nodes, concentrating the hits on the nodes that belong to only one community. Xie et al. [11] investigate the ability of overlapping community methods in identifying overlapping nodes on synthetic networks and verify that, in that scenario, the methods were able to classify them. A further investigation must be performed in order to validate or not the hypothesis that the methods are underestimating the size of overlapping regions due to a missclassification of overlapping nodes in the context explored in the present work.

Comparing Overlapping Community Structure

(a) Amazon

(b) DBLP

(c) LJ

(f) CAHepPh

(g) Email

(h) Keys

(d) Youtube

271

(e) CAGrQc

Fig. 4. Probability of occurrence of an edge between two nodes in respect to the number of communities shared by them.

3.5

Edge Probability on the Overlapping Region

According to Yang and Leskovec [12], most methods found in the literature tend to identify communities with a fewer number of edges in the overlapping region than the non-overlapping region. However, the authors investigate a set of ground-truth communities and empirically show that the region of the network where communities overlap tend to be more dense than the rest of the network. I.e., the more communities a node share with another, the higher the probability of an edge to exist between them. In this work, we investigate if the findings of Yang and Leskovec are observed in the ground-truth networks and if the methods investigated are able to capture this behavior even for the benchmark networks. Figure 4 shows the probability of an edge between two nodes to exist in function of the number of communities shared by them. First, it is important to mention that this experiment is very memory consuming and it was unable to be performed for CitHepTh. From Fig. 4 it is possible to notice that, in fact, there is a positive correlation between the number of shared nodes and the edge probability in ground-truth networks. This behavior is captured by Bigclam, although the edge probability decreases after a certain number of shared communities in almost all networks. Most other methods fail in capturing dense overlapping regions, as stated by Yang and Leskovec [12]. The exception is Demon, which identifies dense overlapping regions but largely overestimates the probability of an edge to exist. For the other methods, it is difficult to make more conclusive observations. But when the other results are considered, a clearer view of the results presented in Fig. 4 is possible. For instance, in SLPA, COPRA and CFinder the occurrence of large overlapping regions and the presence of nodes in several communities is rare. Thus, it is expected that the probability of an edge to exist in the overlapping region does not show a clear pattern among the different networks.

272

4

V. F. Vieira et al.

Conclusions and Future Directions

This work presents a comparative analysis of five state-of-art methods for overlapping community detection in complex networks considering a set of four ground-truth networks and five benchmark networks. Unlike other works, where community detection methods are compared regarding objective measures, in this work the investigation is performed from the perspective of the community structure identified by the methods. Different ways to characterize the community structure were the organization of communities were tested and analyzes in combination with objective quality measures for the network cover, an extension of Newman’s modularity and an adaptation of NMI for overlapping communities. The methods were able to identify modular community structures, resulting in large values of Qov , however they were unable to estimate the correct community cover, what is evidenced by the low values for ONMI obtained. The analysis of the community structure, especially when we consider the size of the overlapping region and the number of memberships of the nodes, allows us to raise some hypothesis in order to explain how the methods can present low values for ONMI while finding modular communities. The overlapping community methods, particularly those based on label propagation, tend to find very modular small communities disconsidering the nodes in the overlapping region. Naturally, the results obtained by the analysis conducted in this work may not be observed in other networks and no general conclusion can be done regarding them. Deeper investigation must be performed to better understand the communities identified by the methods and relate them to the mechanism of the methods. The same can be stated about the networks and further investigation must be conducted in this sense, first to describe in detail the properties of the communities and then to relate them to the ground-truth communities in different contexts. In order to better understand the methods for community detection, it would be also interesting to evaluate if there is some agreement between the communities identified by the methods. Nevertheless, from the experiments performed in this work it can be argued that, although very convenient when assessing the quality of community detection algorithms, objective measures can miss important aspects from community structure in real world networks. Mainly due to the fact that objective measures for overlapping communities are extensions of measures for non-overlapping communities, especially Qov and ONMI considered in this work, they does not reflect the behavior of the communities in the overlapping region. The results observed from the experiments are very consistent with other works in the literature [4,13] and suggest that other aspects, besides objective functions, must be considered when designing community detection methods. Moreover, although some remarks can be done in order to better understand overlapping community detection methods, this work brings more questions than answers regarding the community detection problem in real world applications and we expect that it can serve as a base for further studies in this direction. Acknowledgement. The authors would like to thank the Brazilian research funding agencies CNPq and Capes for the support to this work.

Comparing Overlapping Community Structure

273

References 1. Amelio, A., Pizzuti, C.: Overlapping community discovery methods: a survey. CoRR 1411.3935 (2014) 2. Coscia, M., Rossetti, G., Giannotti, F., Pedreschi, D.: Uncovering hierarchical and overlapping communities with a local-first approach. ACM Trans. Knowl. Discov. Data 9(1), 6:1–6:27 (2014) 3. Gregory, S.: Finding overlapping communities in networks by label propagation. New J. Phys. 12(10), 103018 (2010) 4. Hric, D., Darst, R.K., Fortunato, S.: Community detection in networks: structural communities versus ground truth. Phys. Rev. E 90, 062805 (2014) 5. Lancichinetti, A., Fortunato, S., Kertesz, J.: Detecting the overlapping and hierarchical community structure of complex networks. New J. Phys. 11, 033015 (2009) 6. Nicosia, V., Mangioni, G., Carchiolo, V., Malgeri, M.: Extending the definition of modularity to directed graphs with overlapping communities. J. Stat. Mech: Theory Exp. 2009(03), P03024 (2009) 7. Palla, G., Der´enyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043), 814 (2005) 8. Raghavan, N., Albert, R., Kumara, S.: Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E 76, 036106 (2007) 9. Rhouma, D., Romdhane, L.B.: An efficient algorithm for community mining with overlap in social networks. Expert Syst. Appl. 41(9), 4309–4321 (2014) 10. Shen, H.W.: Detecting the overlapping and hierarchical community structure in networks, pp. 19–44. Springer, Heidelberg (2013) 11. Xie, J., Kelley, S., Szymanski, B.K.: Overlapping community detection in networks: the state of the art and comparative study. CoRR 1110.5813 (2011) 12. Yang, J., Leskovec, J.: Community-affiliation graph model for overlapping network community detection. In: Proceedings of the 2012 IEEE 12th International Conference on Data Mining, ICDM 2012, pp. 1170–1175. IEEE Computer Society, Washington, DC (2012) 13. Yang, J., Leskovec, J.: Overlapping community detection at scale: a nonnegative matrix factorization approach. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013, pp. 587–596. ACM, New York (2013) 14. Zhou, L., L¨ u, K., Yang, P., Wang, L., Kong, B.: An approach for overlapping and hierarchical community detection in social networks based on coalition formation game theory. Expert Syst. Appl. 42(24), 9634–9646 (2015)

Semantic Frame Induction as a Community Detection Problem Eug´enio Ribeiro1,2,5(B) , Andreia Sofia Teixeira1,3,4 , Ricardo Ribeiro1,5 , and David Martins de Matos1,2 1

INESC-ID, Lisbon, Portugal [email protected] 2 Instituto Superior T´ecnico, Universidade de Lisboa, Lisbon, Portugal 3 Center for Social and Biomedical Complexity, School of Informatics, Computing and Engineering, Indiana University, Bloomington, IN, USA 4 Indiana University Network Science Institute (IUNI), Indiana University, Bloomington, IN, USA 5 Instituto Universit´ ario de Lisboa (ISCTE-IUL), Lisbon, Portugal

Abstract. Resources such as FrameNet provide semantic information that is important for multiple tasks. However, they are expensive to build and, consequently, are unavailable for many languages and domains. Thus, approaches able to induce semantic frames in an unsupervised manner are highly valuable. In this paper we approach that task from a network perspective as a community detection problem that targets the identification of groups of verb instances that evoke the same semantic frame. To do so, we apply a graph-clustering algorithm to a graph with contextualized representations of verb instances as nodes connected by an edge if the distance between them is below a threshold that defines the granularity of the induced frames. By applying this approach to the benchmark dataset defined in the context of the SemEval shared task we outperformed all the previous approaches to the task. Keywords: Semantic frames · Contextualized representations Community detection · Graph clustering

1

·

Introduction

A word may have different senses depending on the context in which it appears. Thus, in order to understand its meaning, we must analyze that context and identify the semantic frame that is being evoked [12]. Consequently, sets of frame definitions and annotated datasets that map text into the semantic frames it evokes are important resources for multiple Natural Language Processing (NLP) tasks [1,10,23]. Among such resources, the most prominent is FrameNet [5], providing a set of more than 1,200 generic semantic frames, as well as over 200,000 annotated sentences in English. However, this kind of resource is expensive and time-consuming to build, since both the definition of the frames and the annotation of sentences require expertise in the underlying knowledge. Furthermore, c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 274–285, 2020. https://doi.org/10.1007/978-3-030-36687-2_23

Semantic Frame Induction

275

it is difficult to decide both the granularity and the domains to consider while defining the frames. Thus, such resources only exist for a reduced amount of languages [8] and even English lacks domain-specific resources in multiple domains. An approach to alleviate the effort in the process of building semantic frame resources is to induce the frames evoked by a collection of documents using unsupervised approaches. However, most research on this subject focused on arguments and the induction of their semantic roles [16,25,26] or on the induction of semantic frames from verbs with two arguments [18,28]. To address this issue and define a benchmark for future research, a shared task was proposed in the context of SemEval 2019 [21]. This task focused on the unsupervised induction of FrameNet-like frames through the grouping of verbs and their arguments according to the requirements of three different subtasks. The first of those subtasks focused on clustering instances of verbs according to the semantic frame they evoke while the others focused on clustering the arguments of those verbs, both according to the frame-specific slots they fill and their semantic role. In this paper we approach the first subtask from a network perspective. First, we generate a network in which the nodes correspond to contextualized representations of each verb instance. Then, we create edges between two nodes if the distance between them is lower than a certain threshold which controls the granularity of the induced frames. Finally, we apply a graph-clustering approach to identify communities of nodes that evoke the same frame. In the remainder of the paper, we start by providing an overview of previous approaches to the task, in Sect. 2. Then, in Sect. 3, we describe our induction approach. Section 4 describes our experimental setup. The results of our experiments are presented and discussed in Sect. 5. Finally, Sect. 6 summarizes the conclusions of our work and provides pointers for future work.

2

Related Work

Before the shared task in the context of SemEval 2019, there were already some approaches to unsupervised semantic frame induction. For instance, LDAFrames [18] relied on topic modeling and, more specifically, on Latent Dirichlet Allocation (LDA) [7], to jointly induce semantic frames and their frame-specific semantic roles. On the other hand, Ustalov et al. [28] approached the induction of frames through the triclustering of Subject-Verb-Object (SVO) triples using the Watset fuzzy graph-clustering algorithm [27], which induces word-sense information in the graph before clustering. However, although these approaches are able to induce semantic frames, they can only be applied to verb instances with certain characteristics, such as a fixed number of arguments. Since we are approaching one of the subtasks defined in the context of SemEval 2019s Task 2, the most important approaches to describe in this section are those which competed in that subtask. Arefyev et al. [3] achieved the highest performance in the competition using a two-step agglomerative clustering approach. First, it generates a small set of large clusters containing instances of verbs which have at least one sense that evokes the same frame. Then, the verb

276

E. Ribeiro et al.

instances of each cluster are clustered again to distinguish the different frames that are evoked according to the different senses. In both steps, the generation of the representations of the instances relies on BERT [11]. Nonetheless, while the first step relies on the contextualized representation given by an empirically selected layer of the model, the second step uses BERT as a language model to generate possible context words that provide cues for the sense of the verb instance. To do so, multiple Hearst-like patterns [15] are applied to the sentence in which the verb instance occurs and the context words correspond to those generated to fill the slots in the patterns. The representation of the instance is then given by a tf-idf-weighted average of the representations of the most probable context words. The number of clusters in the first step was obtained by performing grid search while clustering the development and test data together. The selected value corresponds to that which led to maximum performance on the development data. In the second step, clusters with less than 20 instances or containing specific undisclosed verbs were left intact. In the remainder, the number of clusters was selected to maximize the silhouette score. Anwar et al. [2] used a more simplistic approach based on the agglomerative clustering of contextualized representations of the verb instances. The number of clusters was defined empirically. In the system submitted for participation in the competition, the contextualized representations were obtained by concatenating the context-free representation of the verb instance obtained using Word2Vec [19] with the tf-idf-weighted average of the representations of the remaining words in the sentence. However, in a post-evaluation experiment, better results were achieved using the mean of contextualized representations generated by ELMo [20]. Finally, Ribeiro et al. [22] also relied on contextualized representations of the verb instances, but used a graph-based approach. They experimented with both the sum of the representations generated by ELMo [20] and those generated by the last layer of the BERT model [11]. Better results were achieved with the former. The contextualized representations are used as the nodes in a graph and connected by a distance-weighted edge if the cosine distance between them is below a threshold based on a function of the mean and standard deviation of the pairwise distances between the nodes. Finally, the Chinese Whispers [6] algorithm is applied to the graph to identify communities of nodes that evoke the same frame. Although the high performance achieved on the development data did not generalize to the test data, this simple approach has the potential to achieve higher results with some modifications. Thus, the work described in this paper is based on this approach

3

Semantic Frame Induction Approach

In general, our approach, summarized in Algorithm 1, is very similar to the one used by Ribeiro et al. [22] in the context of the SemEval shared task. It starts by generating a contextualized representation of each verb instance. These representations are then used as the nodes in a network or graph in which each

Semantic Frame Induction

277

pair of nodes is connected through an edge if the distance between them is below a certain threshold. Finally, the Chinese Whispers algorithm is applied to the graph to identify communities of verb instances that evoke the same frame. However, it has some key modifications that improve its performance. Algorithm 1. Frame Induction Approach Input: S // The set of sentences Input: T // The set of head tokens to cluster Input: Embed // The approach for generating contextualized representations Input: d // The neighboring threshold Output: C // The set of clusters 1: V ← {Embed(St , t) : t ∈ T } 2: D ← {1 − cos(θv,v ) : (v, v  ) ∈ V 2 , v = v  } // θv,v is the angle between v and v  3: W ← {1 − Dv,v : (v, v  ) ∈ V 2 , v = v  } // The weights of the edges 4: E ← {(v, v  , Wv,v ) : (v, v  ) ∈ V 2 , v = v  , Dv,v < d} 5: G ← (V, E) 6: C ← ChineseWhispers(G) 7: return C

Starting with the representation of verb instances, the use of contextualized word representations in all of the approaches that competed in the SemEval shared task proves their importance for distinguishing different word senses, which evoke different frames. Ribeiro et al. [22] experimented with representations generated by both ELMo and BERT and achieved better results using the former. Furthermore, in their experiments, Arefyev et al. [3] noticed that BERT tends to generate representations of the different forms of the same lexeme which are distant in terms of the typically used euclidean and cosine distances. They tried to identify a distance metric that was appropriate for correlating such representations, but were unsuccessful. Thus, although it is not the current stateof-the-art approach for generating contextualized word representations, we rely on ELMo in our approach. The generated representations include a context-free representation and context information at two levels. According to the experiments performed by the authors of ELMo, the first level is typically related to the syntactic context, while the second is typically related to the semantic context. In addition to the combination of all information, we also explore the use of each level independently. This way, we are able to assess which information is actually important for the task. To generate the contextualized representation of multi-word verb instances, we use a dependency parser to identify the head word and use the corresponding representation, since it contains information from the other words. In our approach, the contextualized representations of the verb instances are used as the nodes of a graph. To generate the edges, the first step is to calculate the pairwise distance between those representations. We use the cosine distance since it is bounded and the magnitude of word vectors is typically related to the number of occurrences. Thus, the angle between the vectors is a better

278

E. Ribeiro et al.

indicator of similarity. Furthermore, the euclidean distance has issues in spaces with high dimensionality. Still, we performed preliminary experiments to confirm that using the cosine distance leads to better results than the euclidean distance. Each pair of nodes in the graph is connected through an edge if the distance between them is below a certain threshold. The definition of this threshold is particularly important, since it controls the granularity of the induced frames. Having control over this granularity is important, since it allows us to induce more specific or more abstract frames, both of which are relevant in different scenarios. Furthermore, this control allows us to define granularity in a small set of instances and then induce frames with a similar granularity in a different set. The latter was the main issue of Ribeiro et al.’s [22] approach at SemEval, whose performance on the development set did not generalize to the test set. That happened since the threshold was selected using a function of the statistics of the distribution of pairwise distances, which vary according to the contexts covered by the datasets and the number of instances. Consequently, applying the same function on the development and test sets led to the generation of frames with different granularity. We fix this issue by defining the threshold through grid search on the development set and then using the same fixed threshold across sets. Another difference of our approach is the weighting of the edges. While Ribeiro et al. [22] attributed a weight corresponding to the distance between the nodes, we weight the edges using the cosine similarity. This is more appropriate, since the Chinese Whispers [6] algorithm that we use to identify the communities of nodes that evoke the same frame attributes more importance to edges with higher weight. Chinese Whispers is a simple but effective graph-clustering algorithm based on the idea that nodes that broadcast the same message to their neighbors should be aggregated. It starts by attributing each node to a different cluster. Then, in each iteration, the nodes are processed in random order and are attributed to the cluster with highest sum of edge weights in their neighborhood. This process is repeated until there are no changes or the maximum number of iterations is reached. Chinese Whispers is appropriate for this task since it identifies the number of cluster on its own, is able to handle clusters of different sizes, and scales well to large graphs. Furthermore, it typically outperforms other clustering approaches on NLP tasks.

4

Experimental Setup

In this section we describe our experimental setup in terms of data, evaluation approach, and implementation details. 4.1

Dataset

In our experiments, we used the same dataset used in the context of SemEval 2019s Task 2. This dataset consists of sentences extracted from the Penn Treebank 3.0 [17] and annotated with FrameNet frames. Since we are focusing on

Semantic Frame Induction

279

clustering verb instances into semantic frame heads, we are not interested in the annotations of the arguments. The development set consists of 600 verb instances extracted from 588 sentences and annotated with 41 different frames. The test set consists of 4,620 verb instances extracted from 3,346 sentences and annotated with 149 different frames. Additionally, all the sentences are annotated with morphosyntactic information in the CoNLL-U format [9]. 4.2

Evaluation Approach

For direct comparison with the approaches that competed in SemEval’s task, we evaluate our approach using the same metrics used on the task: Purity F1 , which is the harmonic mean of purity and inverse-purity [24], and BCubed F1 , which is the harmonic mean of BCubed Precision and BCubed Recall [4]. While the first focuses on the quality of each cluster independently, the latter focuses on the distribution of instances of the same category across the clusters. Additionally, we report the number of induced clusters. Since the Chinese Whispers algorithm is not deterministic, the values we report for these metrics refer to the mean and standard deviation over 30 runs. Since we are approaching the problem from a network-based perspective, we also report the number of edges, the diameter, and the clustering coefficient of the network corresponding to the neighboring threshold with highest performance in each scenario. In addition to that of the approaches that competed in SemEval’s task, we also compare the performance of our approach with a baseline that consists of generating one cluster per verb. 4.3

Implementation Details

To obtain the contextualized representation of the verb instances we used the ELMo model provided by the AllenNLP package [13] to generate the contextualized embeddings for every sentence in the dataset and then selected the representations of the head token of each instance. The representation of each verb instance is then given by three vectors of dimensionality 1,024, corresponding to the context-free representation of the head token and the two levels of context information. We experimented both with each vector independently, as well as their combination. To combine the vectors we used their sum, since it represents the variation of the context-free representation according to the context. To apply the Chinese Whispers algorithm, we relied on Ustalov’s [29] implementation in Python, which requires the graph to be built using the NetworkX package [14]. We did not use weight regularization and performed a maximum of 20 iterations. Finally, to obtain the syntactic dependencies used to determine the head token of multi-word verbs, we used the annotations provided with the dataset, which were obtained automatically using a dependency parser.

280

5

E. Ribeiro et al.

Results and Discussion

Before starting the discussion, it is important to make some remarks regarding the presentation of the results. First, although the cosine distance varies in the interval [0, 2], for readability, we only plot the results in the interval [0, 1], since for neighboring thresholds above that value the verb instances are always grouped into a single cluster. Furthermore, we do not include the value of the graph diameter in our tables, since the graph corresponding to the threshold that leads to higher performance in each scenario is never connected. Thus, the diameter is always infinite.

Fig. 1. Results on the development data using the different levels of ELMo representations. The xx axis refers to the neighboring threshold used to create the edges.

Table 1. Results on the development data using the different levels of ELMo representations. d refers to the neighboring threshold. CC refers to the clustering coefficient. d

Edges CC

Clusters

Purity F1

Context-Free 0.57 39,441 0.97 22.03 ± 0.31 95.57 ± 0.58 Syntactic Context 0.41 27,201 0.80 30.93 ± 0.25 94.32 ± 0.16 Semantic Context 0.53 20,660 0.67 24.47 ± 0.52 88.64 ± 0.33 Free + Syntactic All

BCubed F1 93.35 ± 0.71 91.65 ± 0.20 81.92 ± 0.54

0.47 37,913 0.94 22.97 ± 0.41 95.83 ± 0.28 93.66 ± 0.32 0.45 21,448 0.73 34.73 ± 0.51 93.93 ± 0.30 91.04 ± 0.59

Starting with the information provided by the multiple levels included in ELMo representations, in Fig. 1 and the first block of Table 1, we can see that, independently, the context-free representation is the most informative of the three and the most robust to changes in the threshold, with a wide interval with reduced decrease in performance around the threshold with highest performance. The initial drop in the number of clusters is due to its lack of context information, which makes all the instances of the same verb become connected as soon as the threshold is higher than zero.

Semantic Frame Induction

281

The lower performance of the levels that provide context information on their own was expected, since they represent changes in the word sense of the verb according to the context, but lack information regarding the verb itself. Surprisingly, the level that typically captures the semantic context leads to worse performance than that which captures syntactic context and even harms performance in combination with the other levels. However, this can be explained by the fact that the ELMo model was trained for a specific task and, consequently, the semantic context is overfit to that task. On the other hand, the syntactic context is more generic and, since the sense of a verb can be related to the syntactic tree in which it occurs, it provides important information for the task.

Fig. 2. Results on the development data according to the weighting of the edges. The xx axis refers to the neighboring threshold used to create the edges.

As shown in the second block of Table 1, the highest performance is achieved when using the combination of the context-free representation and the syntactic context. Still, the average increase in BCubed F1 in relation to when using the context-free representation on its own is of just 0.33% points, which suggests that the context information is only able to disambiguate a reduced amount of specific cases. However, the threshold that leads to the highest performance in the combination is lower. This means that the graph has less edges and consequently, is less connected. Still, the number of clusters, around 23, is nearly half of the number of frames in the gold standard, 41, which means that the graph should be even less connected. Since the performance decreases for lower thresholds, this suggests that either the representations or the distance metric are unable to capture all the information required to group the instances in FrameNet-like frames. Table 2. Results on the development data according to the weighting of the edges. d refers to the neighboring threshold. CC refers to the clustering coefficient. d Weighted

Edges CC

Clusters

Purity F1

BCubed F1

0.47 37,913 0.94 22.97 ± 0.41 95.83 ± 0.28 93.66 ± 0.32

Unweighted 0.46 37,415 0.93 22.90 ± 0.30 95.77 ± 0.16

93.56 ± 0.31

282

E. Ribeiro et al.

Regarding the weighting of the edges, the results in Table 2 show that the difference in average top performance is of just 0.06 and 0.10% points in terms of Purity F1 and BCubed F1 , respectively. This suggests that the presence of the edges is more important for the approach than their weight. Still, in Fig. 2 we can see that using weighted edges increases the robustness of the approach to changes in the neighboring threshold.

Fig. 3. Results on the test data. The xx axis refers to the neighboring threshold used to create the edges.

Table 3. Results on the test data. d refers to the neighboring threshold. CC refers to the clustering coefficient. d

Edges

CC

Clusters

Purity F1

Dev. Threshold 0.47 347,202 0.91 196.63 ± 1.68 79.97 ± 0.21

BCubed F1 73.07 ± 0.25

Best Threshold 0.49 364,829 0.91 186.33 ± 0.98 80.26 ± 0.17 73.43 ± 0.19

Figure 3 shows the results achieved when applying the same approach to the test data. Although the performance is lower, we can observe patterns similar to those observed on the development data. The only difference is that there is a more pronounced performance drop immediately after the threshold that leads to highest performance. Nonetheless, as shown in Table 3, the threshold selected on development data, 0.47, is lower and very close to the best threshold on test data, 0.49. This shows that our grid-search approach to define the threshold generalizes well. Still, the average performance loss in relation to when using the best threshold is of 0.29 and 0.36% points in terms of Purity F1 and BCubed F1 , respectively. It is interesting to observe that, contrarily to what happened on development data, the approach overestimates the number of clusters. However, this can be explained by the fact that the test data includes more instances of different verbs that evoke the same frame. Once again, this suggests that either the representations or the distance metric are unable to capture all the required information.

Semantic Frame Induction

283

Table 4. Comparison with previous approaches in terms of performance on the test data. Purity F1 BCubed F1 Baseline

73.78

65.35

Ribeiro et al. [22] Anwar et al. [2] Arefyev et al. [3]

75.25 76.68 78.15

65.32 68.10 70.70

Our Approach (Dev. Threshold) 79.97

73.07

Finally, Table 4 compares the results of our approach with those of the systems that competed in the SemEval shared task. First of all, it is important to refer that while Ribeiro et al.’s [22] approach, on which ours is based, performed worse than the one-frame-per-verb baseline, our surpasses it by 4.37% points in terms of Purity F1 and 7.72% points in terms of BCubed F1 . This shows the importance of discarding the semantic context provided in the ELMo representations and, most importantly, of identifying a neighboring threshold that allows the approach to generalize. Furthermore, our approach also outperforms the more complex approach by Arefyev et al. [3] by 2.37% points in terms of BCubed F1 . Consequently, it achieves the current state-of-the-art performance on the task.

6

Conclusions

In this paper we have approached semantic frame induction as a community detection problem by applying the Chinese Whispers graph-clustering algorithm to a network with contextualized representations of verb instances as nodes connected by an edge if the cosine distance between them is below a threshold that defines the granularity of the induced frames. We have shown that the best performance is achieved when using verb instance representations given by the combination of the context-free and syntactical context levels of ELMo representations. The semantic context level impairs the performance since it is overfit to the task on which the model was trained. We have also observed that weighting the edges with the cosine similarity between the nodes improves the robustness to changes in the neighboring threshold. We have performed our experiments on the benchmark dataset defined in the context of SemEval 2019s Task 2, which allows us to compare our results with those of previous approaches. In this context, the most important step is to identify the threshold that defines correct granularity according to the gold standard annotations. We did so by performing grid search on the development data and used the same fixed threshold on the test data. This way, we solved the main issue of the approach on which ours was based, which was its lack of generalization ability. In fact, the difference between the best threshold on the

284

E. Ribeiro et al.

development set and that which would lead to the best performance on the test set was of just 0.02. Using this approach we were able to outperform the more complex approach that won the SemEval shared task by 2.37% points in terms of BCubed F1 . Thus, it achieves the current state-of-the-art performance on the task. Although we were able to outperform all the previous approaches on the task, the 73.07 BCubed F1 score achieved on the test data shows that the approach is not able to capture all the information required to induce FrameNet-like frames and that there is still room for improvement. Thus, as future work, we intend to assess the cases that our approach fails to cluster to check whether a different clustering approach or additional features are required, or an adaptation of the contextualized representations is enough. Regarding the latter, it would be interesting to assess whether fine tuning the ELMo representations to the task would make the semantic context level provide relevant information. Finally, since this approach achieves state-of-the-art performance when inducing semantic frames from verb instances, we intend to assess whether it is also appropriate to induce the semantic roles and the frame-specific slots filled by the arguments of the verbs. Acknowledgements. This work was supported by Portuguese national funds through Funda¸ca ˜o para a Ciˆencia e a Tecnologia (FCT), with reference UID/CEC/50021/2019, and PT2020, project number 39703 (AppRecommender).

References 1. Aharon, R.B., Szpektor, I., Dagan, I.: Generating entailment rules from framenet. In: ACL, vol. 2, pp. 241–246 (2010) 2. Anwar, S., Ustalov, D., Arefyev, N., Ponzetto, S.P., Biemann, C., Panchenko, A.: HHMM at SemEval-2019 Task 2: unsupervised frame induction using contextualized word embeddings. In: SemEval, pp. 125–129 (2019) 3. Arefyev, N., Sheludko, B., Davletov, A., Kharchev, D., Nevidomsky, A., Panchenko, A.: Neural GRANNy at SemEval-2019 Task 2: a combined approach for better modeling of semantic relationships in semantic frame induction. In: SemEval, pp. 31–38 (2019) 4. Bagga, A., Baldwin, B.: Algorithms for scoring coreference chains. In: LREC, pp. 563–566 (1998) 5. Baker, C.F., Fillmore, C.J., Lowe, J.B.: The Berkeley FrameNet project. In: ACL/COLING, vol. 1, pp. 86–90 (1998) 6. Biemann, C.: Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems. In: Workshop on Graph-based Methods for Natural Language Processing, pp. 73–80 (2006) 7. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) 8. Boas, H.C. (ed.): Multilingual FrameNets in Computational Lexicography: Methods and Applications. Mouton de Gruyter, Berlin (2009) 9. Buchholz, S., Marsi, E.: CoNLL-X shared task on multilingual dependency parsing. In: CoNLL, pp. 149–164 (2006)

Semantic Frame Induction

285

10. Das, D., Chen, D., Martins, A.F.T., Schneider, N., Smith, N.A.: Frame-semantic parsing. Comput. Linguist. 40(1), 9–56 (2014) 11. Devlin, J., Chang, M.W., Kenton, L., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, vol. 1, pp. 4171–4186 (2019) 12. Fillmore, C.J.: Frame semantics and the nature of language. In: Annals of the New York Academy of Sciences (Origins and Evolution of Language and Speech), vol. 280, pp. 20–32 (1976) 13. Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N.F., Peters, M., Schmitz, M., Zettlemoyer, L.S.: AllenNLP: a deep semantic natural language processing platform. CoRR abs/1803.07640 (2017). http://arxiv.org/abs/1803.07640 14. Hagberg, A., Schult, D., Swart, P.: NetworkX. GitHub (2004). https://networkx. github.io/ 15. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: COLING, vol. 2, pp. 539–545 (1992) 16. Lang, J., Lapata, M.: Similarity-driven semantic role induction via graph partitioning. Comput. Linguist. 40(3), 633–670 (2014) 17. Marcus, M., Santorini, B., Marcinkiewicz, M.: Building a large annotated corpus of english: the Penn Treebank. Comput. Linguist. 19(2), 330–331 (1993) 18. Materna, J.: LDA-frames: an unsupervised approach to generating semantic frames. In: CICLing, pp. 376–387 (2012) 19. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013) 20. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: NAACL-HLT, vol. 1, pp. 2227–2237 (2018) 21. QasemiZadeh, B., Petruck, M.R.L., Stodden, R., Kallmeyer, L., Candito, M.: SemEval-2019 Task 2: unsupervised lexical frame induction. In: SemEval, pp. 16–30 (2019) 22. Ribeiro, E., Mendon¸ca, V., Ribeiro, R., Martins de Matos, D., Sardinha, A., Santos, A.L., Coheur, L.: L2F/INESC-ID at SemEval-2019 Task 2: unsupervised lexical semantic frame induction using contextualized word representations. In: SemEval, pp. 130–136 (2019) 23. Shen, D., Lapata, M.: Using semantic roles to improve question answering. In: EMNLP-CoNLL, pp. 12–21 (2007) 24. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000) 25. Titov, I., Khoddam, E.: Unsupervised induction of semantic roles within a reconstruction-error minimization framework. In: NAACL-HLT, vol. 1, pp. 1–10 (2015) 26. Titov, I., Klementiev, A.: A Bayesian approach to unsupervised semantic role induction. In: EACL, vol. 1, pp. 12–22 (2012) 27. Ustalov, D., Panchenko, A., Biemann, C.: Watset: automatic induction of synsets from a graph of synonyms. In: ACL, vol. 1, pp. 1579–1590 (2017) 28. Ustalov, D., Panchenko, A., Kutuzov, A., Biemann, C., Ponzetto, S.P.: Unsupervised semantic frame induction using triclustering. In: ACL, vol. 2, pp. 55–62 (2018) 29. Ustalov, D., et al.: Chinese Whispers for Python. GitHub (2018). https://github. com/nlpub/chinese-whispers-python/

A New Measure of Modularity in Hypergraphs: Theoretical Insights and Implications for Effective Clustering Tarun Kumar1,2(B) , Sankaran Vaidyanathan3 , Harini Ananthapadmanabhan2 , Srinivasan Parthasarathy4 , and Balaraman Ravindran1,2 2

4

1 Robert Bosch Centre for Data Science and AI (RBCDSAI), Chennai, India Department of Computer Science and Engineering, IIT Madras, Chennai, India [email protected] 3 College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst, USA Department of Computer Science and Engineering, The Ohio State University, Columbus, USA

Abstract. Many real-world systems consist of entities that exhibit complex group interactions rather than simple pairwise relationships; such multi-way relations are more suitably modeled using hypergraphs. In this work, we generalize the framework of modularity maximization, commonly used for community detection on graphs, for the hypergraph clustering problem. We introduce a hypergraph null model that can be shown to correspond exactly to the configuration model for undirected graphs. We then derive an adjacency matrix reduction that preserves the hypergraph node degree sequence, for use with this null model. The resultant modularity function can be maximized using the Louvain method, a popular fast algorithm known to work well in practice for graphs. We additionally propose an iterative refinement over this clustering that exploits higher-order information within the hypergraph, seeking to encourage balanced hyperedge cuts. We demonstrate the efficacy of our methods on several real-world datasets.

1

Introduction

While most approaches for learning clusters on graphs assume pairwise (or dyadic) relationships between entities, many entities in real world network systems engage in more complex, multi-way (super-dyadic) relations. Hypergraphs provide a natural representation for such super-dyadic relations; for example, in a co-citation network, a hyperedge could represent a group of co-cited papers. Indeed, learning on hypergraphs has been gaining recent traction [7,21,25,26]. Analogous to the graph clustering task, Hypergraph clustering seeks to find dense T. Kumar and S. Vaidyanathan—Equal contribution. S. Vaidyanathan—Work done while the author at IIT Madras. H. Ananthapadmanabhan—Currently at Google, Bangalore, India. c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 286–297, 2020. https://doi.org/10.1007/978-3-030-36687-2_24

A New Measure of Modularity in Hypergraphs

287

connected components within a hypergraph [19]. This has been applied to varied problems such as VLSI placement [11], image segmentation [13], and modeling eco-biological systems [6], among others. A few previous works on hypergraph clustering [1,14,15,20,23] have limited their focus to k-uniform hypergraphs, where all hyperedges have the same fixed size. [27] extends the Spectral Clustering framework for general hypergraphs by proposing a suitable hypergraph Laplacian, which implicitly defines a reduction of the hypergraph to a graph [2,16]. Modularity maximization [17] is an alternative methodology for clustering on graphs, which additionally provides a useful metric for measuring cluster quality in the modularity function and return the number of clusters automatically. In practice, a greedy fast and scalable optimization algorithm known as the Louvain method [3] is commonly used. However, extending the modularity function to hypergraphs is not straightforward. One approach would be to reduce a hypergraph to a simple graph using a clique expansion and then employ a standard modularity-based solution. Such an approach would lose critical information encoded within the super-dyadic hyperedge structure. A clique expansion would also not preserve the hypergraph’s node degrees, which are required for the null model that modularity maximization methods are based on. Encoding the hyperedge-centric information present within the hypergraph is key to the development of an appropriate modularity-based framework for clustering. Additionally, when viewing the clustering problem via a minimization function (analogous to minimizing the cut), there are multiple ways to cut a hyperedge. Based on the proportion and assignments of nodes on different sides of the cut, the clustering will change. One way of incorporating information based on properties of hyperedges or their vertices is to introduce hyperedge weights based on a metric or function of the data. Building on this idea, we make the following contributions in this work: – We define a null model on hypergraphs, and prove its equivalence to the configuration model [18] for undirected graphs. We derive a node-degree preserving graph reduction to satisfy this null model. Subsequently, we define a modularity function using the above which can be maximized using the Louvain method. – We propose an iterative hyperedge reweighting procedure that leverages information from the hypergraph structure and the balance of hyperedge cuts. – We empirically evaluate the resultant algorithm, titled Iteratively Reweighted Modularity Maximization (IRMM), on several real-world datasets and demonstrate its efficacy and efficiency over competitive baselines.

2 2.1

Background Hypergraphs

Let G = (V, E, w) be a hypergraph, with vertex set V and hyperedge set E. Each hyperedge can be associated with a positive weight w(e). Degree of a vertex v

288

T. Kumar et al.

 is denoted as d(v) = e∈E,v∈e w(e). The degree of a hyperedge e is the number of nodes it contains; denoted by δ(e) = |e|. We denote the number of vertices as n = |V | and the number of edges as m = |E|. The incidence matrix H is given by h(v, e) = 1 if vertex v is in hyperedge e, and 0 otherwise. W is the hyperedge weight matrix and De is the edge degree matrix; these are diagonal matrices of size m × m. Dv is the vertex degree matrix, of size n × n. Clique Reduction: For any hypergraph, one can compute its clique reduction [9] by replacing each hyperedge by a clique formed from its node set. The adjacency matrix for the clique reduction of a hypergraph with incidence matrix H is A = HW H T . Dv may be subtracted from this matrix to remove self-loops. 2.2

Modularity

Modularity [17] is a metric of clustering quality that measures whether the number of within-cluster edges is greater than its expected value. This is defined as: 1  Q= [Aij − Pij ]δ(gi , gj ) (1) 2m ij

where Pij is denotes the expected number of edges between nodes i and j. The configuration model [18] used for graphs produces random graphs with a fixed degree sequence, by drawing random edges such that the node degrees are preserved. For two nodes i and j, with degrees ki and kj respectively, we have: ki kj Pij =  j∈V

3

kj

Hypergraph Modularity

We propose a simple but novel node-degree preserving null model for hypergraphs. Analogous to the configuration model for graphs, the sampling probability for a node is proportional to the number (or in the weighted case, the total weight) of hyperedges it participates in. Specifically, we have: d(i) × d(j) Pijhyp =  v∈V d(v)

(2)

The above null model preserves the node degree sequence of the original hypergraph. When using this null model to define modularity, we get the expected number of hyperedges that two nodes i and j participate in. However, while taking the clique reduction, the degree of a node in the corresponding graph is not the same as its degree in the original hypergraph, as verified below. Lemma 1. For the clique reduction of a hypergraph with incidence matrix H, the degree of a node i in the reduced graph is given by:  H(i, e)w(e)(δ(e) − 1) ki = e∈E

A New Measure of Modularity in Hypergraphs

289

Proof. The adjacency matrix of the reduced graph is given by Aclique = HW H T ,  H(i, e)w(e)H(j, e) (HW H T )ij = e∈E

Note that we do not have to consider self-loops, since they are not cut during the modularity maximization process. This is done by explicitly setting Aii = 0 for all i. We can write the degree of a node i in the reduced graph as:   Aij = H(i, e)w(e)H(j, e) ki = j

=



j

H(i, e)w(e)

e∈E

=



e∈E



H(j, e)

j:j=i

H(i, e)w(e)(δ(e) − 1)

e∈E

As shown above, the node degree is over counted by a factor of (δ(e) − 1) for each hyperedge e. We can hence correct it by scaling down each w(e) by a factor of (δ(e) − 1). This leads to the following corrected adjacency matrix: Ahyp = HW (De − I)−1 H T

(3)

Proposition 1. For the reduction of a hypergraph given by the adjacency matrix A = HW (De − I)−1 H T , the degree of a node i in the reduced graph (denoted ki ) is equal to its hypergraph node degree d(i). Proof. We have, (HW (De − I)−1 H T )ij =

 H(i, e)w(e)H(j, e) δ(e) − 1

e∈E

Note again that we do not have to consider self-loops, since they are not cut during the modularity maximization process (explicitly setting Aii = 0 for all i). We can rewrite the degree of a node i in the reduced graph as ki =

 j

=



Aij =

 H(i, e)w(e)  H(j, e) δ(e) − 1

e∈E

j:j=i

H(i, e)w(e) = d(i)

e∈E

We can use this node-degree preserving reduction, with the diagonals zeroed out, to implement the null model from Eq. 2. As in Eq. 1, we can obtain an expression for the hypergraph modularity, which can be maximized using a Louvainstyle algorithm. 1  hyp Qhyp = [A − Pijhyp ]δ(gi , gj ) (4) 2m ij ij

290

T. Kumar et al.

As with any weighted graph, the range of this function is [−1, 1]. We would get Qhyp = −1 when no pair of nodes in a hyperedge belong to the same cluster, and Qhyp = 1 when any two nodes that are part of the same hyperedge are always part of the same cluster. Qhyp = 0 when, for any pair of nodes i and j, the number of hyperedges that contain both i and j is equal to the number of randomly wired hyperedges containing both i and j, given by the null model.

4

Iterative Hyperedge Reweighting

When improving clustering, we look at minimizing the number of between-cluster edges that get cut, which for a hypergraph is given by the total volume of the hyperedge cut. We first consider the two-cluster case as in [27], where the set V is partitioned into two clusters, S and S c . For a given hyperedge e, the volume of the cut is proportional to |e ∩ S||e ∩ S c |, which gives the number of cut sub-edges when the hyperedge is reduced to a clique. This is minimized when all vertices of e go into one partition. A min-cut algorithm would hence favour cuts that are as unbalanced as possible [10]. For a given hyperedge, if there were a larger portion of its vertices in one cluster and a smaller portion in the other, it is likely that the smaller group of vertices are actually similar to the rest and should be pulled into the larger cluster. Similarly, the vertices in a hyperedge that is cut equally between clusters are equally likely to lie in either of the clusters. We would hence want unbalanced hyperedges to be retained in clusters and the more balanced hyperedges to be cut. This can be done by increasing the weights of hyperedges that get unbalanced cuts, and decreasing the weights of hyperedges that get more balanced cuts. Considering the case where a hyperedge is partitioned into two, for a cut hyperedge with k1 and k2 nodes in each partition (k1 , k2 = 0), we have the following equation that operationalizes the aforementioned scheme: t=

1 1 × δ(e) + k1 k2

(5)

For two partitions, δ(e) = k1 + k2 . In Fig. 4, note that t is minimized when k1 = k2 = δ(e)/2, which gives t = 4. We can then generalize Eq. 5 to c partitions as follows: c 1  1 w (e) = [δ(e) + c] (6) m i=1 ki + 1 Here, both the +1 and +c terms are added for smoothing, to account for cases where any of the ki ’s are zero. We divide by m to normalize the weights (Fig. 1). Let wt (e) be the weight of hyperedge e in the tth iteration, and w (e) be the weight computed at a given iteration then weight update rule can be written as: wt+1 (e) = αwt (e) + (1 − α)w (e)

(7)

A New Measure of Modularity in Hypergraphs

 t=

1 1 + 2 18



 × 20 = 11.111

t=

1 1 + 10 10

291

 × 20 = 4

Fig. 1. Reweighting for different hyperedge cuts

4.1

A Simple Example

The following example illustrates the effect of a single iteration of hyperedge reweighting. As seen in Fig. 2, the initial clustering of this hypergraph resulted in two highly unbalanced cuts. Cut 1 and Cut 2 each split hyperedge h2 in a 1:4 ratio, and also each split hyperedge h3 in a 1 : 2 ratio respectively. Cut 3 splits hyperedge h1 in a 2:3 ratio. After reweighting, the 1:4 splits are removed. This reduces the number of cuts from 3 to just 1, leaving two neat clusters. The single nodes in h1 and h3 , initially assigned to another cluster, have been pulled back into their respective (larger) clusters. This captures the intended behaviour for the re-weighting scheme as described earlier.

(a) Before Reweighting

(b) After Reweighting

Fig. 2. Effect of iterative reweighting

5

Evaluation on Ground Truth

We use the average F1 measure [24] and Rand Index scores to evaluate clustering performance on real-world data with ground truth class labels. Average F1 scores are obtained by computing the F1 score of the best matching ground-truth class to the observed cluster, and the F1 score of the best matching observed cluster

292

T. Kumar et al.

to the ground truth class, then averaging the two scores. The proposed methods are shown in the results table as Hypergraph-Louvain and IRMM. Number of Clusters Returned: While implementing the modularity function defined in Eq. 4, we use the Louvain method to find clusters by maximizing the hypergraph modularity. By default, the algorithm returns the number of clusters. To return a fixed number of clusters c, we used hierarchical clustering with the average linkage criterion as a post-processing step. Settings for IRMM: We tuned hyperparameter α over the set of values 0.1, 0.2, ..., 0.9. We did not find considerable difference in the resultant F1 scores, and minimal difference in the rate of convergence, over a wide range of values. As α is a scalar coefficient in a moving average, it did not result in a large difference in resultant weight values when set in this useful range. Hence, for the experiments, we chose to leave it at α = 0.5. For the iterations, we set the stopping threshold for the weights at 0.01. 5.1

Compared Methods

Clique Reductions: We took the clique reduction of the hypergraph and ran the graph versions of Spectral Clustering and the Louvain method. These are referred to as Clique-Spectral and Clique-Louvain respectively. Hypergraph Spectral Clustering: Here, the top c eigenvectors of the Hypergraph Laplacian (as defined in [27]) were found, and then clustered using bisecting k-means. We refer to this method as Hypergraph-Spectral. hMETIS [12] and PaToH [5]: These are hypergraph partitioning algorithms that are commonly used. We used the codes provided by the respective authors. 5.2

Datasets

For all datasets, we took the single largest connected component of the hypergraph. In each case, the class labels were taken as ground truth clusters. More statistics on the datasets are given in Table 1. Table 1. Dataset description Dataset TwitterFootball

# nodes # hyperedges Avg. hyperedge Avg. node # classes degree degree 234

3587

15.491

237.474

20

Cora

2708

2222

3.443

2.825

7

Citeseer

3264

3702

27.988

31.745

6

MovieLens

3893

4677

79.875

95.961

2

Arnetminer

21375

38446

4.686

8.429

10

A New Measure of Modularity in Hypergraphs

293

MovieLens [4]: We used the director relation to define hyperedges and build a co-director hypergraph, where the nodes represent movies. A group of nodes are connected by a hyperedge if they were directed by the same individual. Cora and Citeseer: In these datasets, nodes represent papers. The nodes are connected by a hyperedge if they share the same set of words [22]. TwitterFootball: This is taken from one view of the Twitter dataset [8]. It represents members of 20 different football clubs of the English Premier League. Here, the nodes are the players, and hyperedges are formed based on whether they are co-listed. Arnetminer: In this significantly larger co-citation network, the nodes represent papers and hyperedges are formed between co-cited papers. We used the nodes from the CS discipline, and its 10 sub-disciplines were treated as clusters. 5.3

Experiments

We compare the average F1 and Rand Index (RI) scores for the different datasets on all the given methods. For Louvain methods, the number of clusters was returned by the algorithm in an unsupervised manner and the same number was used for spectral methods. The results are given in Table 2. We also ran the same experiments with the number of clusters set to the number of ground truth classes, using the postprocessing methodology described earlier (Table 3). 5.4

Results

In both experiment settings, IRMM shows the best average F1 scores on all datasets, and the best Rand Index scores on all but one dataset. Additionally, both hypergraph modularity maximization methods show competitive performance with respect to the baselines. Figure 3 shows the results for varying number of clusters. Clique-Louvain sometimes returned a lower number of clusters than IRMM, and hence these curves are shorter in some of the plots. On TwitterFootball, IRMM returns fewer than ground truth clusters; and hence the corresponding entry in Table 3 is left blank. On some datasets, the best performance is achieved when the number of clusters returned by the Louvain method is used (e.g Citeseer, Cora), and on others when the ground truth number of classes is used (e.g ArnetMiner, MovieLens). This could be based on the structure of clusters in the network and its relationship to the ground truth classes. A class could have comprised multiple smaller clusters, which were detected by the Louvain algorithm. In other cases, the cluster structure could have corresponded better to the class labels. It is evident that Hypergraph based methods outperform the respective clique reduction methods on all datasets and both experiment settings. We infer that super-dyadic relational information captured by the hypergraph has a positive impact on the clustering performance.

294

T. Kumar et al.

Table 2. Average F1 and RandIndex scores; no. of clusters returned by Louvain method Citeseer

MovieLens

TwitterFootball

F1

RI

Cora F1

RI

F1

RI

F1

RI

Arnetminer F1

RI

hMETIS

0.1087

0.6504

0.1075

0.7592

0.1291

0.4970

0.3197

0.7639

0.0871

0.0416

PaToH

0.0532

0.6612

0.1171

0.6919

0.1104

0.4987

0.1132

0.7553

0.0729

0.0052

Clique Spectral

0.1852

0.7164

0.1291

0.2478

0.1097

0.4806

0.4496

0.7486

0.0629

0.0610

Hypergraph Spectral

0.2774

0.8210

0.2517

0.5743

0.118

0.4977

0.5055

0.9016

0.0938

0.0628

Clique Louvain

0.1479

0.7361

0.2725

0.7096

0.1392

0.4898

0.2238

0.6337

0.1378

0.0384

Hypergraph Louvain

0.2782

0.7899

0.3248

0.8238

0.1447

0.4988

0.5461

0.9056

0.1730

0.0821

IRMM

0.4019

0.7986

0.3709

0.8646

0.1963

0.5091

0.5924

0.9448

0.1768

0.0967

Table 3. Avg. F1 and RandIndex scores; no. clusters set to no. of ground truth classes Citeseer

MovieLens

TwitterFootball

F1

RI

F1

RI

F1

RI

F1

RI

F1

RI

hMETIS

0.1451

0.6891

0.2611

0.7853

0.4445

0.5028

0.3702

0.7697

0.3267

0.3116

PaToH

0.0710

0.7312

0.1799

0.7208

0.3239

0.4984

0.1036

0.7618

0.2756

0.182

Clique Spectral

0.2917

0.7369

0.2305

0.3117

0.2824

0.4812

0.4345

0.7765

0.387

0.3762

Hypergraph Spectral

0.3614

0.8267

0.2672

0.5845

0.3057

0.5006

0.5377

0.9112

0.4263

0.3851

Clique Louvain

0.1479

0.7361

0.2725

0.7096

0.2874

0.4982

0.2238

0.6337

0.4587

0.4198

Hypergraph Louvain

0.3491

0.8197

0.3314

0.8441

0.3411

0.5119

0.5461

0.9056

0.4948

0.5359

IRMM

0.4410

0.8245

0.3966

0.889

0.4445

0.5347





0.5299

0.5506

5.5

Cora

Arnetminer

Effect of Reweighting on Hyperedge Cuts

The plots in Fig. 4 illustrate the effect of hyperedge reweighting over iterations. We found the relative size of the largest partition of each hyperedge, and binned them in intervals of relative size = 0.1. The plot shows the fraction of hyperedges that fall in each bin over each iteration. relative size(e) = max i

number of nodes in cluster i number of nodes in the hyperedge e

We refer these hyperedges as fragmented if the relative size of its largest partition is lower than a threshold (here set at 0.3), and dominated of the relative size of its largest partition is higher than the threshold. The fragmented edges are likely to be balanced, since the largest cluster size is low. On the smaller TwitterFootball dataset, which has a greater number of ground truth classes, we see that the number of dominated edges decreases and the number of fragmented edges increases. This is as expected; the increase in fragmented edges is likely to correspond to more balanced cuts. A similar trend is reflected in the larger Cora dataset.

A New Measure of Modularity in Hypergraphs

(a) Citeseer

295

(b) Cora

(c) Arnetminer

Fig. 3. Symmetric F1 scores for varying number of clusters

(a) TwitterFootball

(b) Cora

Fig. 4. Reweighting for different hyperedge cuts.

6

Conclusion

In this work, we have considered the problem of modularity maximization on hypergraphs. In presenting a modularity function for hypergraphs, we derived a node degree preserving graph reduction and a hypergraph null model. To refine the clustering further, we proposed a hyperedge reweighting procedure that balances the cuts induced by the clustering method. Empirical evaluations on real-world data illustrated the performance of our resultant method, entitled

296

T. Kumar et al.

Iteratively Reweighted Modularity Maximization (IRMM). We leave the exploration of additional constraints and hyperedge-centric information in the clustering framework for future work. Acknowledgements. This work was partially supported by Intel research grant RB/18-19/CSE/002/INTI/BRAV to BR.

References 1. Agarwal, S., Lim, J., Zelnik-Manor, L., Perona, P., Kriegman, D., Belongie, S.: Beyond pairwise clustering. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 838–845, June 2005 2. Agarwal, S., Branson, K., Belongie, S.: Higher order learning with graphs. In: ICML 2006: Proceedings of the 23rd International Conference on Machine Learning, pp. 17–24 (2006) 3. Blondel, V.D., loup G., J., L., R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech.: Theory Exp. (10), P10008 (2008) 4. Cantador, I., Brusilovsky, P., Kuflik, T.: 2nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In: Proceedings of the 5th ACM Conference on Recommender Systems, RecSys 2011. ACM, New York (2011) ¨ Aykanat, C.: PaToH (partitioning tool for hypergraphs), pp. 1479– 5. C ¸ ataly¨ urek, U., 1487. Springer, Boston (2011). https://doi.org/10.1007/978-0-387-09766-4 93 6. Estrada, E., Rodriguez-Velazquez, J.A.: Complex networks as hypergraphs. arXiv preprint physics/0505137 (2005) 7. Feng, F., He, X., Liu, Y., Nie, L., Chua, T.S.: Learning on partial-order hypergraphs. In: Proceedings of the 2018 World Wide Web Conference, WWW 2018, pp. 1523–1532. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (2018). https://doi.org/10.1145/ 3178876.3186064 8. Greene, D., Sheridan, G., Smyth, B., Cunningham, P.: Aggregating content and network information to curate Twitter user lists. In: Proceedings of the 4th ACM RecSys Workshop on Recommender Systems and the Social Web, RSWeb 2012, pp. 29–36. ACM, New York (2012). https://doi.org/10.1145/2365934.2365941 9. Hadley, S.W., Mark, B.L., Vannelli, A.: An efficient eigenvector approach for finding netlist partitions. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 11(7), 885–892 (1992) 10. Hein, M., Setzer, S., Jost, L., Rangapuram, S.S.: The total variation on hypergraphs - learning on hypergraphs revisited. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS 2013, vol. 2, pp. 2427–2435. Curran Associates Inc., USA (2013). http://dl.acm.org/citation. cfm?id=2999792.2999883 11. Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998). https://doi.org/ 10.1137/S1064827595287997 12. Karypis, G., Kumar, V.: Multilevel k-way hypergraph partitioning. VLSI Design 11(3), 285–300 (2000) 13. Kim, S., Nowozin, S., Kohli, P., Yoo, C.D.: Higher-order correlation clustering for image segmentation. In: Advances in Neural Information Processing Systems, pp. 1530–1538 (2011)

A New Measure of Modularity in Hypergraphs

297

14. Leordeanu, M., Sminchisescu, C.: Efficient hypergraph clustering. In: Proceedings of the 15th International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 22, pp. 676–684. PMLR (2012). http:// proceedings.mlr.press/v22/leordeanu12.html 15. Liu, H., Latecki, L.J., Yan, S.: Robust clustering as ensembles of affinity relations. In: Advances in Neural Information Processing Systems (2010) 16. Louis, A.: Hypergraph Markov operators, eigenvalues and approximation algorithms. In: Proceedings of the Forty-seventh Annual ACM Symposium on Theory of Computing, STOC 2015, pp. 713–722. ACM, New York (2015) 17. Newman, M.E.: Modularity and community structure in networks. Proc. Natl. Acad. Sci. 103(23), 8577–8582 (2006) 18. Newman, M.E.: Networks: An Introduction. Oxford University Press Inc., New York (2010) 19. Papa, D.A., Markov, I.L.: Hypergraph partitioning and clustering. In: In Approximation Algorithms and Metaheuristics. Citeseer (2007) 20. Rot´ a Bulo, S., Pelillo, M.: A game-theoretic approach to hypergraph clustering. IEEE Trans. Pattern Anal. Mach. Intell. 35(6), 1312–1327 (2013) 21. Saito, S., Mandic, D., Suzuki, H.: Hypergraph p-laplacian: a differential geometry view. In: AAAI Conference on Artificial Intelligence (2018) 22. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collective classification in network data. AI Mag. 29(3), 93 (2008) 23. Shashua, A., Zass, R., Hazan, T.: Multi-way clustering using super-symmetric nonnegative tensor factorization. In: Proceedings of the 9th European Conference on Computer Vision, ECCV 2006, vol. IV, pp. 595–608. Springer, Heidelberg (2006). https://doi.org/10.1007/11744085 46 24. Yang, J., Leskovec, J.: Defining and evaluating network communities based on ground-truth. In: Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, MDS 2012, pp. 3:1–3:8. ACM, New York (2012). https://doi.org/10. 1145/2350190.2350193 25. Zhang, M., Cui, Z., Jiang, S., Chen, Y.: Beyond link prediction: predicting hyperlinks in adjacency space. In: AAAI Conference on Artificial Intelligence (2018) 26. Zhao, X., Wang, N., Shi, H., Wan, H., Huang, J., Gao, Y.: Hypergraph learning with cost interval optimization. In: AAAI Conference on Artificial Intelligence (2018) 27. Zhou, D., Huang, J., Sch¨ olkopf, B.: Learning with hypergraphs: clustering, classification, and embedding. In: Advances in Neural Information Processing Systems, pp. 1601–1608 (2007)

Diffusion and Epidemics

Crying “Wolf ” in a Network Structure: The Influence of Node-Generated Signals Tomer Tuchner1 and Gail Gilboa-Freedman2(B) 1

Efi Arazi School of Computer Science, IDC Herzliya, P.O. Box 167, 4610101 Herzliya, Israel [email protected] 2 Adelson School of Entrepreneurship, IDC Herzliya, P.O. Box 167, 4610101 Herzliya, Israel [email protected]

Abstract. Research into rumor spreading in a social network has largely assumed that information may only originate from an external content provider. In today’s age individual nodes may also be content providers. The propagation of a signal generated by a node in the network may contribute or diminish the efforts of information diffusion, as signals become an imprecise indication of a node’s knowledge. We present a model that allows for incorporating node-generated information into the well-studied area of modeling rumor spread in a network. We capture this by a stochastic information transmission mechanism at each node, with a positive probability to spread the rumor without holding its value. Simulations are performed using synthetic WattsStrogatz networks, along with a real-world Facebook sample graph. Using decision trees as a descriptive tool, we examine the effects of the rate in which internal non-informed nodes generate information on the properties of the rumor spread process. As our main results we show that: increasing the rate of information generated by non-informed nodes may have monotonous or nonmonotonous influence on the rumor spread time, in dependency with whether the network is sparse on not. We also identify that a strategy of increasing external communication in order to gain higher pureness level tends to be effective only for a medium level range of this generation rate and only in sparse networks. Keywords: Rumor spread · Advertising · Word of mouth networks · Decision tress · Predictive models

1

· Social

Introduction

With the spread of new social media technologies, which guide more and more of our access to news [9], our responses to everything from natural disasters [10,22] to terrorist attacks [28], are increasingly disrupted by the spread of unreliable rumors online. The source of these unreliable rumors is often internal, with claims c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 301–312, 2020. https://doi.org/10.1007/978-3-030-36687-2_25

302

T. Tuchner and G. Gilboa-Freedman

potentially generated by a single non-informed user (for example via a Twitter post) and presented as news. Recipients of this information may then knowingly or unknowingly spread signals that are false. To examine rumor spread behavior in a network, and how it is affected by unreliable information, let us recall of the fable about the boy who cried “wolf”. The boy amuses himself by crying “wolf” to see the panic he causes in the community, but consequently fails to get assistance when a real threat appears. The current study extends the “wolf” story and investigates the influence of “wolf” cries in a network structure, which are equivalent to generating information not originating from an external source of information. This information flow results with suspicion towards received data. Unlike existing models that describe the transmission of unreliable information [18], our model assumes not only that nodes transfer unreliable information, but also that information may be generated by nodes which have not received any information regarding the rumor from an external source. The model assumes that curiosity arises about the value of some variable (for example, how many causalities were in an earthquake), and there is some external trustworthy source of information (for example, a news channel) that spreads the real value of the variable (for example, that 23 people were killed in the earthquake). It also assumes that there is an internal diffusion of the information inside the network, from one person to another, and the internal diffusion can be of the real value (23) or of false data (other values, invented by non-informed individuals). The fact that there is an internal diffusion of data, generated by nodes that have not yet received any information about the rumor, has potentially two opposing impacts on the rate of propagation. On the one hand, spontaneous generation of information in the social network, may increase aggregate growth of informed population. On the other hand, it results with suspicion towards information, which may cause a slowdown in the data spread rate. The spread of the rumor involves a large number of actions taken by a large number of entities which interact with each other, generating aggregated patterns which are hard to predict, and often impossible to analyze analytically [30]. For this reason, we take a numerical approach, running simulations of rumor spread processes using combinations of the model parameters. Given that some of the model parameters have a non-linear effect, we use decision trees to analyze the simulation results [23]. We highlight this approach as of potential value for other numerical studies on complex networks that depend on large number of variables, with complex relationships between variables and research targets. We examine the rate in which internal non-informed nodes generate information, and reach two interesting results that focus on the effects this rate has on: (1) the rumor spread time; and (2) the level of pureness for informed nodes, i.e. nodes receiving information originating from an external source. Concerning the first effect, we examine when does the fact that there is a high rate of faked signals (i.e., generated by a node that doesn’t hold a value

Crying “Wolf” in a Network Structure

303

of the rumor), can harm the rumor diffusion in terms of spread time. We show that the answer is tightly bounded to whether the network is sparse or not. Concerning the second effect, we identify that a strategy of increasing external communication in order to gain higher pureness level tends to be effective for medium level range of this generation rate, and only in sparse networks.

2

Related Work

In today’s “post-truth” age [14], there is increasing interest in the concept of “fake news” – i.e., false stories – in the scientific literature, specifically their epidemiology [17], detection [3], and impact. This phenomenon has the potential to influence attitudes toward journalistic objectivity [21], and may impose real costs on society and politics [1]. From a corporate point of view, false stories have the potential to damage a firm’s or brand’s image and propel firms into financial disaster [11]. Of course, false stories have always existed – but the ability of social media platforms to spread such narratives rapidly and aggressively gives the question new importance. One recent study focusing on the social network Twitter found that fake news “diffused significantly farther, faster, deeper, and more broadly than the truth in all categories of information” [29]. Reliability in general, and rumor reliability in particular, is a central concept in theories of decision-making [25,27,33], cooperation [6], communication [26], viral marketing [13,16], and markets [2]. It is a vast research topic spanning multiple disciplines. The study of reliability commonly draws on network models, which often use the term reliability for the probability that the proportion of informed individuals exceeds a certain value. Some network models associate reliability with social cohesion [31]. Modern network models define reliability as the probability of data transmit from one element to another [7], which is the probability that a given node will be informed. Our model follows previous work in probability theory on interacting particle systems [20]. We formulate the simplest extension of an independent cascade model, a type of model that has been investigated in the context of marketing and word-of-mouth processes [8,16]. The dynamic at the node level follows a predefined scheme of response probabilities and is a function of the state of the nodes with which it interacts, as in [12]. Our contribution is in considering a richer dynamic for these interactions, specifically the possibility of activation by non-informed nodes. We simulate the rumor spread process on a Facebook graph sample taken from the SNAP project [19], on a random graph, and on a series of synthetic Watts-Strogatz networks [32]. Like other studies in the literature on rumor spreading online [4], we also consider a mid-sized network (500 nodes). The intuition behind this is that opinions and rumors often spread within a particular online community which is not that large – for example, people contributing to an online forum or people tweeting and re-tweeting some hashtag. We analyze the simulation results, by organizing the results in a decision tree for each phenomenon of the rumor spread process we aim to understand. Decision

304

T. Tuchner and G. Gilboa-Freedman

trees are widely used in Machine Learning [15], with the purpose of predicting a target value (a class) from some input features. To build the decision tree, a modeler uses a data set that includes a list of measurable properties (features), one of which is the target. Decision trees are considered to be one of the most popular predictive models (see [24] for survey). They are also used as descriptive tools [5].

3

Useful Definitions

Below are several definitions for terms used in the article. The purpose of the current section is to help the reader understanding the model description, which contains some new technical terms. Reader may skip to Sect. 4, and return to this section when there are terms that require further review. 1. External Communication – Transmission of information from an external advertiser (source) to any node in the graph. 2. Internal Communication – Transmission of information between neighboring nodes in the graph. 3. Informed/Non-informed Node – At each iteration, each node can be in one of two states: Informed or Non-informed. Informed nodes are nodes that hold information about the rumor. Non-informed nodes are nodes that do not hold such information. 4. Pure/Non-pure Node – Each Informed node is in either a Pure or Non-pure state, and thus there are Pure Informed Nodes or Non-pure Informed Nodes. The pureness state of a node is determined when it is activated (becomes Informed). A Non-informed node that has been activated by an external source, or by a Pure node, is Pure. Otherwise, it is Non-pure. (This is a recursive definition). 5. Faked/Held Signal – Produced when a Non-informed/Informed node (respectively) transmits information to its neighbors. 6. Reliable/Unreliable signal – Each signal may be Reliable or Unreliable. Faked signals are unreliable. Held signals are reliable/unreliable when held by pure/non-pure nodes respectively. 7. Suspicion Factor – The probability that Non-informed node which receives information from a neighboring node chooses to accept that information and become Informed (rather than to reject it and stay Non-informed).

4

Model

We introduce a model for rumor spread over a network. The network is represented by a graph with a finite set of nodes and a set of undirected edges. We define three states of nodes: Non-informed (a node that does not hold information; see def. 3), Pure-informed (a node that holds information originating from a reliable source; see def. 4) and Non-pure informed (a node that holds information originating from an unreliable source; see def. 4). The novelty of our model

Crying “Wolf” in a Network Structure

305

is in the fact that non-informed nodes have a positive probability for generating and spreading information on each iteration. On each iteration, every Non-informed node spreads information to its neighbors with probability Q (rate of faked signals – always unreliable; see def. 5 and 6), whereas informed nodes spread information to their neighbors with probability P (rate of held signals – either reliable or not; see def. 5). Every Non-informed node can be activated (become Informed), either by an external source (see def. 1) – with probability α, separately for each node – or by the internal communication with its neighbors (see def. 2), where the probability for it to adopt the information from its neighbors is multiplied by the suspicion factor P/(P + Q) (see def. 7). Nodes that become informed, cannot be deactivated (return to be Non-informed) in later stages of the process. The intuition behind the suspicion factor is that, the possibility of generating signals by non-informed nodes, may hold nodes from adopting any rumor that they get. Without any prior knowledge on whether a node is informed or not, it would be intuitive for its neighbor to consider the likelihood of a signal to be generated by an informed node, and become informed with probability that increases with this rate. Process is fully described in Algorithm 1. Regarding the question of whether a node becomes pure or non-pure following an activation, we have to consider the possibility that in a single iteration it may choose to adopt received signals from several sources at once. For consistency of the pureness concept, we define that if one chooses to adopt some unreliable signal in some iteration, then it becomes non-pure, or in other words – non-pure activation is dominant. In Table 1 we show a numerical example for the signal distribution depending on a node’s state. Every row in the matrix represents the state of a node, and every column represents the conditional probability for the signal it may transmit, depending on its state. In the given example, P = 0.8 and Q = 0.6. Please note two important points: (1) A non-informed node spreads unreliable signals (as for not holding the advertised value); and (2) An informed node spreads reliable or unreliable signals in agreement with its state – pure or non-pure (as for holding a value that is sourced at an external source or not). Table 1. Numerical example for the distribution of the signal transmitted by a node in a specific iteration, depending on the state of the node. In this example, P = 0.8 and Q = 0.6. Unreliable signal Silence Reliable signal Non-informed

0.6

0.4

0

Pure informed

0

0.2

0.8

0.2

0

Non-pure informed 0.8

306

T. Tuchner and G. Gilboa-Freedman

Algorithm 1. Rumor spread process with faked signals 1:

Input: – P , Q, α - Held signal, faked signal and advertising rates – Network G with N nodes and adjacency matrix A 2: Output: – State - array with final state for each node (0 for non-informed, 1 for pure – informed, 2 for non-pure informed) – T ime - array with activation time for each node 3: ρ = P/(P + Q) // Suspicion factor 4: State = T ime = zeros[1, N ] 5: i = 0, inf ormed = 0 6: while inf ormed < 0.95 · N do 7: i=i+1 8: // Randomizing transmission for this iteration 9: T ransmit = zeros[1, N ] 10: for n = 1 to N do 11: if State[n] > 0 then // Informed 12: T ransmit[n] = 1 w.p. P 13: else // Non-informed 14: T ransmit[n] = 1 w.p. Q 15: end if 16: end for 17: // Calculating new states 18: N ewState = zeros[1, N ] 19: for n = 1 to N do 20: if State[n] = 0 then // Non-informed 21: ActivatedByP ure = 0 22: for j = 1 to N do 23: if AN D(T ransmit[j] = 1, State[j] = 1, A[n, j] = 1) then 24: // Transmitting neighbor is pure 25: ActivatedByP ure = 1 w.p. ρ // Adopted with suspicion 26: end if 27: end for 28: if ActivatedByP ure = 0 then 29: ActivatedByP ure = 1 w.p. α // Activated by external 30: end if 31: if ActivatedByP ure = 1 then 32: N ewState[n] = 1 // Pure 33: T ime[n] = i 34: end if 35: ActivatedByN onpure = 0 36: for j = 1 to N do 37: if AN D(T ransmit[j] = 1, State[j] = 1, A[n, j] = 1) then 38: // Transmitting neighbor is non-pure or non-informed 39: ActivatedByN onpure = 1 w.p. ρ // Adopted with suspicion 40: end if 41: end for 42: if ActivatedByN onpure = 1 then // Non-pure activation is dominant 43: N ewState[n] = 2 // Non-pure 44: T ime[n] = i 45: end if 46: if N ewState[n] > 0 then // Activated 47: inf ormed = inf ormed + 1 48: end if 49: end if 50: end for 51: State = N ewState 52: end while 53: Return State, T ime

Crying “Wolf” in a Network Structure

5

307

Example: Facebook Sample Graph

We start by simulating our model on a Facebook graph sample, and on a Erd˝ osR´enyi Random-Graph with the same number of nodes (4039) and the same average degree (approx. 44). The Facebook graph sample was taken from the SNAP project [19]. The random-graph was generated with a computer program in Matlab. Figure 1 displays the number of iterations needed to reach 95% of informed nodes against Q, where P and α are fixed at 0.01, for the two networks (i.e., the Facebook sample and the random setup). Before we talk about whether Q strengthens or harms rumor spread in different settings, we wish to examine whether it actually has a two-sided effect on the spread process. As we see in this example of the Facebook network, spread of the rumor speeds up as Q increases, but only at the far left of the graph, where Q is low. At a certain point, increasing Q further once again slows the speed at which the rumor spreads. Figure 1 demonstrates that the ability to speed up the spread of the rumor by increasing the rate of faked signals, may or may not exist, in dependence with network topology. 60

Number of Iterations

50

40

30 Facebook Random 20

10

0 0

0.005

0.01

0.015

Q

Fig. 1. Extent of rumor spread as a function of the probability of faked signals. The Y-axis is the number of iterations needed to reach 95% of the nodes being informed, and the X-axis is Q, where P and α are fixed at 0.01

6

Methods

We simulate the rumor spread process as described by the model for a variety of model parameters and network structures. We then analyze the simulation

308

T. Tuchner and G. Gilboa-Freedman

results, using decision trees for classifying the simulations by their properties in decision tree structures. 6.1

Preparation of Networks

We wrote a Matlab program for generating synthetic networks using an algorithm from Watts and Strogatz [32]. These networks have 500 nodes and vary on the parameters K and β, which describe respectively the ratio between edges and nodes in the graph, and how close the graph is to a random network, where β = 1 implies a random network and β < 1 displays properties of a small-world structure (closer to a social network). 6.2

Numerical Simulations

Combinations of model parameters (α, P , Q) and network parameters (K,β) were considered in a full factorial design experiment. Consistent with previous literature [8], we set the advertising rate α lower than P (the probability of an Informed node to generate a signal). We also set Q – the probability of a faked signal (generated by a Non-informed node) – to be lower than P , under the reasonable assumption that being informed about a rumor should not reduce the probability of spreading it. Each of the five input variable parameters was manipulated to produce a variety of spread process simulations. For each set of parameters we ran 20 simulations and calculated their mean results. In total we examined 6, 720 sets of parameters (no α = Q = 0), in each case running the rumor spread process until 95% of the network was informed. The parameter ranges were set as follows: 3,10,15,20 1. K – Half the average degree 0.5,0.8,1 2. β – Randomness of the graph 3. α – Advertising rate 0–0.01 (steps of 0.00125) 4. P – Held signal rate 0.01–0.07 (steps of 0.01) 5. Q – Faked signal rate 0–P (9 values, constant step). 6.3

Analysis: Decision Trees

After generating all possible outcomes over the set of simulation parameters, we analyzed the influence of the parameters on the behavior of the rumor spread. For this purpose, we took the decision trees approach to elicit relevant observations. Results served as input data for training a decision tree – i.e., a set of rules organized in a hierarchical structure that could serve as a predictive model for the relevant measure. This tool can reveal observations that are non-intuitive and that therefore would not likely be advanced and tested by a “human” learning approach. We wrote a computer program in Python to build the trees. The results obtained from the decision trees are of rigorous nature, because each observation spots a node or leaf in the tree, that represents a hypothesis with high significance according to traditional methods.

Crying “Wolf” in a Network Structure

309

Each decision tree specifies the properties of its nodes. Specifically, for each leaf (which is a node with no arrow coming out from it), the tree specifies: the number of samples that were sorted into this leaf over the categories of the target; their distribution in terms of how many samples fall in each category; and the prediction assigned to this leaf. For each internal node (not a leaf), the graph also specifies the splitting criteria. We examined the classifications of the simulation results in the trees, and derived our observations based on values in the same class. We are most interested in classes that are homogeneous in terms of the target values of the simulations that fall into them.

7

Results

Firstly, we examine the two-sided impact of information generated by noninformed nodes on the time for a rumor spread. Increasing the rate of generating such information, increases both transmission rate and suspicion. When there exists a value of Q for which lower values are associated with acceleration in the rumor spread and higher values with a slowdown, we say that the network exhibits a turning point. We saw an example for this in Fig. 1 for the Facebook graph. If a turning point exists, then we are not sure that it is beneficial to increase internal communication of faked signals, in order to achieve a faster rumor spread. We get a simple decision tree (see Fig. 2), showing that for a sparse network (k = 3), it is not always beneficial to increase internal communication (the probability for turning point is 0.95, while for all simulations it is 0.37), and for a highly-connected dense network (k > 3), it is beneficial to increase internal communication (the probability for no turning point is 0.82, while for all simulations it is 0.63). Intuitively, when the graph is highly-connected, signals are being spread to many nodes in each iteration, overcoming suspicion.

k 0.002, Q ≤ 0.021), if the graph is sparse (k = 3), we can increase the advertisement rate (α ≥ 0.00375), in order to get high pureness and avoid low pureness. When α ≤ 0.0025 then the probability for low pureness is 0.96, but if α ≥ 0.00375 then the probability for low pureness is 0.33. We see that for high or low values of Q, we don’t need to increase advertisement rate to get high pureness, because there will be too much or not enough unreliable signals for the pureness to be high/low, accordingly.

Q 0 dt βhv b(T )Sv (t)Ihk (t) − μv Iv (t) > 0 βhv b(T )Sv (t)Ihk (t) Iv (t) < μv

dIv (t) dt

419

> 0,

(7)

At the outset of an epidemic, Sv (t) ≈ 1. Death is an instantaneous process, therefore μv = 1 then, βhv b(T )Ihk (t) > Iv (t). This should be greater than one. The basic reproduction number of vector is given by: R0v = βhv b(T )Ihk (t). By reporting the value of Iv (t) into Eqs. 2 and 3, the above equation can be written: dShk (t) = −βh kShk (t)Ihk (t) − βvh b(T )Shk (t){βhv b(T )Ihk (t)} (8) dt dIhk (t) = βh kShk (t)Ihk (t) + βvh b(T )Shk (t){βhv b(T )Ihk (t)} − Ihk (t) (9) dt dRhk (t) = Ihk (t) (10) dt Generally, a healthy host node is infected, and this infected node is converted into a recovered node. So we can say Shk (t) is converted into Rhk (t). Therefore, from Eqs. 8 and 10, dShk (t) −(βh kSkh (t) + βvh b(T )Shk (t)βhv b(T )Ihk (t) = dRhk (t) Ihk (t)

(11)

where, Eq. 11 shows the rate of change of susceptible nodes to recovered nodes. Integrating both side of Eq. 11 Skh (t) = e−(βh k+βvh b(T )

2

βhv )Rhk (t)

(12)

The negative exponent in Eq. 12 shows that the number of susceptible nodes is decreasing and converted into recovered nodes. Epidemic reaches a steady state at t → ∞ hence, Ihk (∞) = 0. Therefore, the normalized condition for the steady state is Skh (∞) = e−(βh k+βvh b(T )

2

βhv )Rhk (∞)

−(βh k+βvh b(T )2 βhv )Rhk (∞)

Rhk (∞) = 1 − e

2

(13) (14)

Now let f (Rhk (∞)) = 1 − e−(βh k+βvh b(T ) βhv )Rhk (∞) be a function of Rhk (∞) and strictly increasing. If we put Rhk (∞) = 0, then the whole population of host recover and it gives us a trivial solution. It also explains about disease free state.

420

M. Arquam et al.

Now we need to find some non trivial solution which lies between 0 and 1. For this the following condition must satisfy df (Rhk (∞))  >1  dRhk (∞) Rhk (∞)=0 (βh k + βvh b(T )2 βhv )e−(βh k+βvh b(T )

2

 

βhv )Rhk (∞)  Rhk (∞)=0

>1

(βh k + βvh b(T )2 βhv ) > 1 Now, we can say that basic reproduction R0h must be (βh k+βvh b(T )2 βhv ) > 1 to spread the epidemic in the host population. Therefore, R0h = (βh k + βvh b(T )2 βhv ) where βh , k, βvh and βhv are constant. Hence, R0h is directly proportional to the square of biting rate i.e. b(T )2 . This basic reproduction rate is also called the critical threshold of spreading of disease.

5

Simulation of the Model and Results Analysis

In this section, we first report the simulation setup, and then we discuss the results of the simulation performed using the temperature dependent SIR model using a homogeneous network (Watts–Strogatz model) as the underlying contact network. The various parameters values used for simulations are listed in Table 2. These values have been chosen according to a literature review. Table 2. Parameters values used in the simulation Name of parameter

Value

Host contact network size

2000

Connectivity probability for Watts–Strogatz model 0.2 Number of neighbour of each node

300

Vector population size

100000

Spreading Rate between host to host (βh )

0.6

Recovery Rate (μh )

1

Death Rate of vector (μv )

1

Spreading Rate between vector to host (βvh )

0.4

Spreading Rate between host to vector (βhv )

0.6

Biting rate of vector (b0 ) at T0

0.4

T0

25 ◦ C

Range of temperature (T )

[4.3, 37] ◦ C

Integrating Temperature into the SIR Model for Vector-Borne Diseases

421

We focus on the effect of temperature on the dynamics of epidemics on the host contact network as well as the vector population. In the simulation, if the temperature is in the range T < 0 ◦ C or T > 37 ◦ C, the vector biting rate is zero. In other words, outside the limit temperatures, no vectors are present. Within that range of temperature the critical threshold is given by: R0h = (βh k + βvh b(T )2 βhv ) The epidemic spreading with the modified SIR Model on homogeneous network as underlying topology (Watts–Strogatz model) is shown in Fig. 4. We took the value of temperature T ranging from 4.3 ◦ C to 37 ◦ C to analyze the effect of temperature in the infection process (temperature of Delhi NCR in 2018). The epidemic spreading evolution of the SIR spreading model for the host population is reported in Fig. 4(a). Similar results for the vector population is shown in Fig. 4(b). These figures show that the infection increases with time until the optimum temperature is reached. After that, the infection starts decreasing. The timespan of the existence of epidemic depends upon the existence of the vector population as shown in Fig. 4(b).

Fig. 4. Epidemic spreading in host and vector population

One can observe that the vector population vanished as much as quick due to short life span but it increases the epidemic threshold as mentioned in Fig. 5. Figure 5(a)&(b) illustrate the evolution in the epidemic threshold in the vector population with the temperature variation. Figure 5(c)&(d) present also the variation of the epidemic threshold but in the host population with a change in temperature. The infection threshold varies from 0.8 to more than 0.9 in the host population while it does not change a lot in the vector population because the life span of the vector is very small. These results corroborate researches reported in the literature that have already proved that the transmission probability from host to vector is greater than transmission probability from vector to host.

422

M. Arquam et al.

Fig. 5. Effect of temperature on infection threshold in SIR considering a homogeneous contact network

Fig. 6. Biting of vector population

We also analyse the effect of temperature on the biting rate that depends upon the total population of vectors. As temperature increases from 25◦ C, then mosquitoes start biting till the maximum temperature. After that once temperature reaches 37◦ C then biting becomes null as the vector population vanish. The biting rate is plotted in Fig. 6. Figure 6(a) shows that biting is maximum at the middle of the spreading process, while Fig. 6(b) shows that biting increases with the increase of temperature till ambient temperature. After that, the vector population starts dying.Finally, after reaching the maximum temperature the vector population is eliminated. Figure 7 explains the effect of temperature on infection spreading in vector as well as the host population. An infected vector can cause infection in multiple hosts. Vector population is much larger than the host population. Therefore, infection in the host population increases more than in vector population.

Integrating Temperature into the SIR Model for Vector-Borne Diseases

423

Fig. 7. Effect of temperature on infection spreading in vector and host populations

6

Conclusion and Future Work

In this work, we propose and investigate a modified SIR model which integrates the effect of temperature on spreading of vector-borne diseases. Here, we consider two type of populations: (1) the host population with the three states of the SIR Model and (2) the vector population with the two states of the SI Model. Favourable temperature increases the disease spreading from vector to host and by cascading effect to the host population. We show that the threshold of spreading rate of the disease is proportional to the square of the biting rate (b(T )) which is defined as the function of temperature. Simulations are performed using the proposed modified SIR model using an homogeneous contact network. They show that temperature increases the critical threshold value of the spreading rate. Additionally, if the temperature increases above 37 ◦ C, the epidemic die out due to the extinction of the vector population. Result of real data of diseases are plotted, which shows similar infection pattern in host population. We plan to develop this work in various future directions. An important extension is to include the effect of humidity in our future studies as most diseases spread after the rainy season in India especially. Furthermore more realistic scenario need to be considered concerning the host contact network topology such as scale-free networks, modular networks and dynamic networks [17–19]. The movement of population may also be considered.

References 1. World Health Organization et al.: Global strategy for dengue prevention and control 2012-2020 (2012) 2. Anderson, R.M., May, R.M., Anderson, B.: Infectious Diseases of Humans: Dynamics and Control, vol. 28. Wiley Online Library, Hoboken (1992) 3. Esteva, L., Vargas, C.: Analysis of a dengue disease transmission model. Math. Biosci. 150(2), 131–151 (1998) 4. de Pinho, S.T.R., Ferreira, C.P., Esteva, L., Barreto, F.R., Morato e Silva, V.C., Teixeira, M.G.L.: Modelling the dynamics of dengue real epidemics. Philos. Trans. Roy. Soc. A: Math. Phys. Eng. Sci. 368(1933), 5679–5693 (2010) 5. Focks, D.A., Daniels, E., Haile, D.G., Keesling, J.E.: A simulation model of the epidemiology of urban dengue fever: literature analysis, model development, preliminary validation, and samples of simulation results. Am. J. Trop. Med. Hygiene 53(5), 489–506 (1995)

424

M. Arquam et al.

6. Waikhom, P., Jain, R., Tegar, S.: Sensitivity and stability analysis of a delayed stochastic epidemic model with temperature gradients. Model. Earth Syst. Environ. 2(1), 49 (2016) 7. Liu-Helmersson, J., Stenlund, H., Wilder-Smith, A., Rockl¨ ov, J.: Vectorial capacity of Aedes aegypti: effects of temperature and implications for global dengue epidemic potential. PLoS One 9(3), e89783 (2014) 8. Polwiang, S.: The seasonal reproduction number of dengue fever: impacts of climate on transmission. PeerJ 3, e1069 (2015) 9. Wang, W., Mulone, G.: Threshold of disease transmission in a patch environment. J. Math. Anal. Appl. 285(1), 321–335 (2003) 10. Auger, P., Kouokam, E., Sallet, G., Tchuente, M., Tsanou, B.: The ross-macdonald model in a patchy environment. Math. Biosci. 216(2), 123–131 (2008) 11. Nekovee, M., Moreno, Y., Bianconi, G., Marsili, M.: Theory of rumour spreading in complex social networks. Phys. A 374(1), 457–470 (2007) 12. Albert, R., Barab´ asi, A.-L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74(1), 47 (2002) 13. Pastor-Satorras, R., Castellano, C., Van Mieghem, P., Vespignani, A.: Epidemic processes in complex networks. Rev. Mod. Phys. 87(3), 925 (2015) 14. Vespignani, A.: Modelling dynamical processes in complex socio-technical systems. Nat. Phys. 8(1), 32 (2012) 15. Moreno, Y., Pastor-Satorras, R., Vespignani, A.: Epidemic outbreaks in complex heterogeneous networks. Eur. Phys. J. B-Condens. Matter Complex Syst. 26(4), 521–529 (2002) 16. Li, X., Wang, X.: Controlling the spreading in small-world evolving networks: stability, oscillation, and topology. IEEE Trans. Autom. Control 51(3), 534–540 (2006) 17. Orman, K., Labatut, V., Cherifi, H.: An empirical study of the relation between community structure and transitivity. In: Menezes R., Evsukoff A., Gonz´ alez M. (eds.) Complex Networks. Studies in Computational Intelligence, vol. 424, pp. 99– 110 (2013) 18. Gupta, N., Singh, A., Cherifi, H.: Centrality measures for networks with community structure. Phys. A 452, 46–59 (2016) 19. Ghalmane, Z., El Hassouni, M., Cherifi, C., Cherifi, H.: Centrality in modular networks. EPJ Data Sci. 8(1), 1–27 (2019)

Opinion Diffusion in Competitive Environments: Relating Coverage and Speed of Diffusion Valeria Fionda(B) and Gianluigi Greco Department of Mathematics and Computer Science, University of Calabria, Rende, Italy {fionda,greco}@mat.unical.it

Abstract. The paper analyzes how two opinions/products/innovations diffuse in a network according to non-progressive dynamics. We show that the final configuration of the network strongly depends on their relative speed of diffusion. In particular, we characterize how the number of agents that will eventually adopt an opinion (at the end of the diffusion process) is related with the speed of propagation of that opinion. Moreover, we study how the minimum speed of propagation required to converge to consensus on a given opinion is related with the percentage of agents that initially act as seeds for that opinion. Our results comple ment earlier works in the literature on competitive opinion diffusion, by depicting a clear picture on the relationships between coverage and speed of diffusion.

Keywords: Competitive opinion diffusion

1

· Linear threshold models

Introduction

The mechanisms according to which opinions form and diffuse over social networks have attracted much research in recent years. A number of models have been proposed, and their properties have been studied both theoretically and experimentally (see, e.g., [11]). By abstracting from their specific technical differences, these models can be classified in two board categories, non-progressive and progressive ones (cf. [19]), with the difference between them being in whether or not an agent/individual/node that has been influenced to adopt some opinion can eventually rethink about her decision. Most of the literature focuses on the latter kind of models, which are indeed reminiscent of very influential earlier studies in economics [27] and sociology [16,17]. However, there are scenarios where the progressive behaviour is unrealistic [4,6,8,10]. This typically happens when social environments host opinions that compete with each other and when agents can oscillate in adopting one of them, being subject to the social pressure of their neighbors (see, e.g., [12–14]). Practical applications of non-progressive models have been pointed out in the c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 425–435, 2020. https://doi.org/10.1007/978-3-030-36687-2_35

426

V. Fionda and G. Greco

context of the diffusion of competing (product) innovations, in the usage of mobile apps, and for analyzing the cycles of opinions that are in fashion [22]. In these applications, we typically select an initial set S0 of seeds to propagate an opinion, say b(lack); but the diffusion process of b competes with the cascade of influence of another opinion, say w(hite). In the paper we contribute to shed lights on the theoretical and practical behaviour of such non-progressive models for opinion diffusion. Our analysis moves from the observation that the outcome of these models, i.e., the final configuration of the social network, crucially depends on the order in which the various agents change their mind. In fact, the setting has a kind of game-theoretic flavour and strategic aspects naturally emerge (see, e.g., [28]). Here we depart from the strategic analysis and we focus on an aspect that received considerably less attention in the literature. Indeed, we focus on answering questions that relate the speed of propagation of some given opinion with the overall number of agents that will eventually adopt that opinion. For an exemplification, assume that b propagates at the same speed of w, that is, if we consider two updates of opinions in the social network, then (in average) we observe one agents adopting opinion b and one agent adopting opinion w; and assume that a seed composed of 5% of the agents that initially hold opinion b is able to spread that opinion to 40% of the agents. Then, we may ask: What happens if b propagates in the network two times faster than w? Is it the case that the final coverage will be about 80%? More generally, for a given set S0 of initial seeds that hold opinion b, in the paper we study how the overall number of agents that will eventually adopt that opinion varies as a function of the relative speed of propagation of b compared to that of w. Moreover, we study how the minimum cardinality of a seed from which consensus on b can be eventually achieved varies (again) as a function of the relative speed of b. Our analysis includes a throughout experimental campaign which we conducted on competitive environments built over syntectic and real social networks. On these environments, we essentially considered a (deterministic) linear threshold [19] model of opinion diffusion and we provide results suggesting that the rate of the adoption of an opinion dramatically impacts (i.e., much more than one would naturally envisage) on its capacity of spreading over the whole network.

2

A Formal Framework for Competitive Diffusion

In this section we introduce our formal framework for opinion diffusion. In particular, we define notions that are meant to formalize the concepts of speed and coverage of diffusion and we illustrate how these concepts are related with each other. In the exposition we consider specific kinds of networks, which rather unlikely occur in real-world instances. This has been done with the aim of providing sharp theoretical bounds on these relationships. Bounds that hold in practice are instead singled out in Sect. 3.

Opinion Diffusion

427

Fig. 1. Illustrations for the results in Sect. 2.

Preliminaries. Let G = (N, E) be a social network, that is, an undirected graph encoding the interactions of a set N of agents. Throughout the paper, we consider a setting where two opinions/products/innovations, say b and w, compete for diffusing over G. We adopt a linear threshold model of diffusion [19], by assuming that the thresholds are a-priori known (rather than being initially selected at random). Indeed, thresholds can be learned over some available data by means of mining techniques [15,18] or, in many cases, we might just want to analyze and reason about scenarios where agents are characterized by some specific behaviour. Here, we focus on majority agents, which attracted much research as a prototypical behaviour in opinion diffusion (e.g., [1,7,9,20,24,25]). Formally, for each agent x ∈ N , the set {y | {x, y} ∈ E} of her neighbors is denoted by δ(x), and her associated threshold is denoted by σ(x) with 1 ≤ σ(x) ≤ δ(x). Then, we say that x is a majority agent if σ(x) = |δ(x)|/2. To keep notation simple, a configuration for G is just defined as the set S ⊆ N of all agents that hold opinion b—so that agents in N \ S hold opinion w. An agent x ∈ S (resp., x ∈ N \ S) is stable with respect to that configuration if |δ(x) ∩ S| ≥ σ(x) (resp., |δ(x) ∩ (N \ S)| ≥ σ(x)). A configuration S is stable if all agents in N are stable. A dynamic for G is a sequence of configurations π = S0 , ..., Sk such that Sk is stable and, for each h ∈ {1, ..., k}, Sh is obtained from Sh−1 by flipping the opinion of an agent that is not stable. Note that, at each time step, we assume that precisely one agent is selected to changer her opinion. Indeed, this assumption is appropriate when analyzing scenarios involving large social networks (cf. [7]). Speed and Coverage. Let π = S0 , ..., Sk be a dynamic for G = (N, E). Let I b (π, G) ⊆ {1, ..., k} be the set of all time steps i such that Si is obtained from Si−1 by swapping to b the opinion of some agent in G. Moreover, let I b+w (π, G) ⊆ {1, ..., k} be the set of all time steps i such that in Si−1 there are at least two agents, one with opinion b and another with opinion w, that are not stable. In words, I b (π, G) captures the times steps where some agent adopted opinion b, whereas I b+w (π, G) captures the time steps where both b and w can potentially propagate. Then, we define the relative speed of b in π as the ratio rsb (π, G) = |I b (π, G) ∩ I b+w (π, G)|/|I b+w (π, G)|. That is, rsb (π, G) measures the fraction of the times steps in which b has been propagated over all times steps in which w might have been propagated too.

428

V. Fionda and G. Greco

For each rational number α, with 0 ≤ α ≤ 1, let Παb (S0 , G) be the set of all dynamics π starting at the configuration S0 and such that rsb (π, G) ≤ α. Moreover, let covbα (S0 , G) be the maximum coverage of b from S0 , that is, the maximum number of agents that hold opinion b at the end of any dynamics π ∈ Παb (S0 , G). Hence, covbα (S0 , G) measures the capacity of spreading opinion b by considering dynamics for which the relative speed of b is at most α. We next relate the coverage covbα (S0 , G) with the relative speed α, in particular by showing that the impact of the speed on the coverage of the diffusion might be dramatic. Theorem 1. Let S0 be a configuration for a social network G = (N, E). Then, for each pair 0 ≤ α ≤ α ≤ 1, it holds that covbα (S0 , G) ≤ covbα (S0 , G). Moreover, there is a class of social networks {Gn =({1, ..., n}, En )}n>3 such that covb0 ({1, 2}, Gn ) = 0 and covb1 ({1, 2}, Gn ) = n. Proof (Sketch). The monotonic behaviour of cov w.r.t. the relative speed α is easily seen to hold. Concerning the second part of the statement, consider the graph Gn shown in Fig. 1(a) and any dynamic starting with the configuration {1, 2}. When we focus on dynamics in Π0b ({1, 2}, G), for n > 3, the evolution of the network is entirely deterministic: agent 1 and agent 2 will change their opinion to w in this order. That is, covb0 ({1, 2}, Gn ) = 0. On the other hand, within the dynamics in Π0b ({1, 2}, G), we can chose the one propagating b to agent 3, then to agent 4, and so on, till covering all agents in Gn . That is,  covb1 ({1, 2}, Gn ) = n. Speed and Consensus. The second parameter we are interested to analyze as a function of the relative speed is the number of seeds that are necessary to converge to consensus (i.e., to a scenario where all agents hold—w.l.o.g.—opinion b). To this end, let γαb (G) be the minimum cardinality over all the sets S0 such that covbα (S0 , G) = |N |. Again, a dramatic gap emerges between the values of γ α at the opposite boundaries of α. Theorem 2. Let S0 be a configuration for a social network G = (N, E). Then, for each pair 0 ≤ α ≤ α ≤ 1, it holds that γαb (G) ≥ γαb  (G). Moreover, there is a class of social networks {Gn =({1, ..., n}, En )}n>3 such that γ0b (Gn ) = n/2 and γ1b = 2. Proof (Sketch). The anti-monotonic behaviour of γ α is easily seen to hold. Then, consider again the graph Gn in Fig. 1(a), by recalling from the proof of Theorem 1 that γ1b ≤ 2. In fact, there is no way that a seed with one node only can propagate opinion b to all remaining agents; hence, γ1b = 2 actually holds. To conclude, note that when considering α = 0, the only way to forbid the propagation of opinion w is precisely to include in the initial configuration agent 1 and half of the agents  taken from the remaining agents. Hence, γ0b (Gn ) = n/2. In fact, it is known that γ1b (G) ≤ |N |/2 always hold [3]. Therefore, the above result might tempt the reader to believe that the |N |/2 bound holds independently on the speed of the propagation. We next show that this is not the case.

Opinion Diffusion

429

Theorem 3. There is a class of social networks {Gn =({1, ..., 5n}, En )}n≥1 such that γ0b (Gn ) = 4n. Proof. Consider the graph Gn shown in Fig. 1(b), which consists of a gadget over 5 agents cloned n times, and the configuration depicted there where 4n agents initially hold opinion b. It is immediate to check that from this configuration consensus on b can be eventually achieved even with dynamics for which α = 0. Now, the crucial observation is that it is not possible to end up with a consensus on b if at any time steps, two agents that are connected in Gn both hold opinion w. To avoid this obstruction, it can be checked that in every gadget at least four agents must initially hold opinion b. 

3

Experimental Evaluation

Our experimental campaign involved both synthetic and real networks. Networks can be characterized by the following parameters: • Degree distribution p(k): the probability distribution of node degrees over the whole network (i.e., the probability that a randomly chosen node has degree k); • Average node degree k ; • Neighbor degree distribution q(k): the probability that a randomly chosen edge is connected to a node of degree k; • Joint degree distribution e(k1 , k2 ): the probability that the two endpoint of a random chosen edge have degrees k1 and k2 , respectively; • Assortativity or degree correlation rkk : the Pearson correlation between degrees of connected nodes. In assortative networks (rkk > 0) nodes are connected to nodes having similar degree, and in disassortative networks (rkk < 0) they link to nodes having dissimilar degree. Note that, two networks having the same degree distribution can differ for their assortativity. The initial configuration of the network S0 can be specified by fixing the percentage of agents that initially hold opinion b and the joint probability distribution P (o, k) that is the probability that a node of degree k has opinion o ∈ {b, w}. Such joint probability can be used to compute ρk,o that is the correlation between node degrees and opinions. In general, randomly picking the agents having opinion b in the network leads to a configuration where ρk,o ≈ 0. By following the approach used in [21], we changed the opinion degree correlation by swapping the opinions of agents. In particular, given two agents n and n having opinions b and w respectively, swapping their opinion will lead to an increase of ρk,o if the degree of n is lower than that of n . Thus, to increase the value of the opinion degree correlation, we iteratively picked a node n with opinion b and a node n with opinion w and swapped their opinions only if the degree f n was bigger than the degree of n and repeated until the desired correlation value was reached or no more swappings were possible.

430

V. Fionda and G. Greco

For each input network we performed two types of experiments. The first experiment was meant to empirically validate Theorem 1, thus we fixed the number of seeds (agents initially having opinion b) and analyzed the coverage of opinion b according to different propagation speeds 0 ≤ α ≤ 1. The results of the first experiment are reported in Sect. 3.1. The second experiment was meant to empirically validate Theorem 2, thus we analyzed how the coverage of the network varies by increasing the initial number of seed for a fixed propagation speed. The results of the second experiment are discussed in Sect. 3.2. All results are reported as the average on 30 runs. Synthetic Networks. We used two types of synthetic networks: scale-free networks and Erd¨ os-R´enyi networks. We created both types of networks by using the generator implemented in the SNAP library (https://snap.stanford.edu/data/). In particular, we generated scale-free networks with a specified degree sequence by specifying the number of vertices and the exponent λ of the power law distribution p(k) ∼ k −λ , where p(k) indicates the fraction of nodes that have degree k. In particular, we generated three undirected scale-free networks with 10000 nodes having power low distribution k −2.1 , k −2.4 , k −3.1 , respectively. From each of this initial networks we obtained three networks by changing their assortativity. To generate Erd¨ os-R´enyi networks we specified the number of nodes and edges in order to obtain a specific average node degree. In particular, we generated an undirected Erd¨ os-R´enyi network having 10000 nodes and 12500 edges, thus with average node degree k = 2.5. From this initial network, by using the rewiring procedure, we obtained three networks having, respectively, assortativity equals to −0.5, 0, 0, 5. Starting from each generated network we used the Newmans edge rewiring procedure [23] to change its assortativity. This is an iterative procedure that at each iteration randomly chooses two disjoint edges and swap their eindpoints if it changes their degree correlation. The procedure stops when the desired degree assortativity is achieved. Real Networks. We considered a benchmark consisting of 14 graph datasets, whose main features are summarized in the Table 1, which report: the name of the dataset, the number of nodes, the number of edges, the assortativity coefficient, the average degree, the maximum degree and the coefficient of the power law distribution that better approximate the degree distribution of the network computed according to the Bhattacharyya distance [5]. The datasets fbArt, fb-Ath, fb-Com, fb-Gov, fb-NS, fb-Pol, fb-PF, fb-TvS have been extracted from the Facebook (fb) dataset by considering respectively artists’, athletes’, companies’, government’s, new sites’, politicians’, public figures’ and TV shows’ pages only. The datasets dz-HR, dz-HU and dz-RO have been extracted from the Deezer dataset by considering the friendships networks of users in Croatia, Hungary and Romania, respectively.

Opinion Diffusion

431

Table 1. Real networks characteristics. Network

Number of nodes Number of edges rkk

fb [26] 134873 deezer [26] 143884 317080 dblp [29]

3.1

1380293 846915 1049866

k

k∗

λ

0.0740 20.462 1469 1.3 0.3320 11.772 420 1.3 0.2665 6.622 343 1.5

fb-Art fb-Ath fb-Com fb-Gov fb-NS fb-Pol fb-PF fb-TvS

50521 13868 14120 7058 27930 5908 11573 3895

819306 86858 52310 89455 206259 41729 67114 17262

−0.019 −0.027 0.014 0.029 0.022 0.018 0.202 0.561

32.43 12.52 7.40 25.34 14.76 14.12 11.59 8.86

dz-HR dz-HU dz-RO

54573 47538 41773

498202 222887 125826

0.197 0.207 0.114

18.26 9.38 6.02

1469 468 215 697 678 323 326 126

1.2 1.3 1.4 1.2 1.3 1.3 1.4 1.3

420 1.2 112 1.2 112 1.4

Speed and Coverage

The scope of the first experiment was to empirically validate Theorem 1. To this aim, for each input network we fixed the number of seed to the 10% of the nodes and varied the propagation speed α of opinion b. Results are reported in Fig. 2 for synthetic networks and Fig. 3 for real networks. In particular, in all charts the diffusion speed α is reported on the x-axis while the percentage of nodes having opinion b after propagation is reported on the y-axis. In Fig. 2 each line reports the result obtained on a particular type of network by considering three different values of assortativity. The series correspond to various values of the attribute degree correlation of the initial configuration. Figure 3 reports in the left chart the results obtained on the real networks by considering an initial configuration having opinion degree correlation ρk,o = 0.5, while in the right chat an initial attribute degree correlation equals to ρk,o = 0.3 has been considered. In both charts the series correspond to networks. By looking at the results of this experiments, it can be concluded that Theorem 1 holds for both synthetic and real networks as, in general, larger values of α correspond to larger coverage percentage after propagation. In particular, we noticed that in general for synthetic networks we obtained larger variation of coverage percentage for lower values of assortativity rkk (see Fig. 2). As for Scale-Free networks, we noticed that, in general, greater values of the power low exponent λ corresponds to more regular increasing of the coverage percentage with respect to α (i.e., lower values of λ require larger values of α to obtain significant increase in the coverage percentage). Furthermore, for Erd¨ os-R´enyi networks and for Scale-Free networks with high value of λ, when small values of

432

V. Fionda and G. Greco

Fig. 2. Coverage percentage of b nodes after propagation on the synthetic networks.

Fig. 3. Coverage percentage of b nodes after propagation on the real networks.

α are considered it can be noted that the coverage after propagation is higher for lower values of opinion degree correlation ρk,o (see Fig. 2). Finally, by analyzing Fig. 3, it can be noticed that larger values of opinion degree correlation ρko correspond to both larger coverage values at the end of propagation and more regular increasing of the coverage percentage with respect to α. Indeed, consider for example the fb-NS dataset (green line in Fig. 3), the coverage percentage of 25% is reached for α = 0.8 if ρk,o = 0.5 and for α ∼ 0.95 if ρk,o = 0.3.

Opinion Diffusion

433

Fig. 4. Coverage percentage of b nodes after propagation on the real networks for different initial configurations.

3.2

Speed and Consensus

In the second experiment we want to empirically validate Theorem 2. To this aim, for each real network we varied the number of seeds from 5% to 25% of the nodes of the network and analyzed the number of nodes having opinion b at the end of the propagation for different diffusion speeds α. In this experiment we always considered an initial configuration having opinion degree correlation ρk,o equals to 0.5. Results are reported in Fig. 4. In particular, in all charts the diffusion speed α is reported on the x-axis while the percentage of nodes having opinion b after propagation is reported on the y-axis. Each chart in the figure refers to a different network and series correspond to different initial percentage of nodes having opinion b. By looking at the results it can be concluded that Theorem 2 holds on all networks. In fact, each coverage percentage is reached for smallest values of α if the initial number of seed |S| is increased. For example,

434

V. Fionda and G. Greco

by looking at the Facebook network (top left chart in Fig. 4) it can be noted that the final coverage percentage of 40% is reached for α = 1 if |S| = 5%|N |, for α = 0.7 if |S| = 15%|N | and for α = 0 if |S| = 25%|N |. In particular, it can be noted that larger values of assortativity correspond in general to slower increasing in the coverage percentage with respect to the diffusion speed.

4

Conclusion

We have studied the relationship between speed and coverage of diffusion in a completive environment, by considering two orthogonal perspectives. On the one hand, after having fixed some initial configuration (set of seeds), we have analyzed how the speed of an opinion impacts on the number of agents that will eventually adopt some desired opinion. On the other hand, we have analyzed how the minimum number of seeds required to reach consensus varies depending on the speed of propagating that opinion. The model we have adopted is a natural variant of the linear threshold model, and we have focused on the well-studied setting of majority agents [1,2,7,9,20,24,25]. Hence, the most natural avenue for further research is to conduct an analysis similar to the one discussed in this paper for other settings characterized by different thresholds.

References 1. Auletta, V., Caragiannis, I., Ferraioli, D., Galdi, C., Persiano, G.: Minority becomes majority in social networks. In: Proceedings of WINE 2015, pp. 74–88 (2015) 2. Auletta, V., Ferraioli, D., Fionda, V., Greco, G.: Maximizing the spread of an opinion when tertium datur est. In: Proceedings of AAMAS 2019, pp. 1207–1215 (2019) 3. Auletta, V., Ferraioli, D., Greco, G.: Reasoning about consensus when opinions diffuse through majority dynamics. In: Proceedings of IJCAI 2018, pp. 49–55 (2018) 4. Bharathi, S., Kempe, D., Salek, M.: Competitive influence maximization in social networks. In: Proceedings of WINE 2007, pp. 306–311 (2007) 5. Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943) 6. Borodin, A., Filmus, Y., Oren, J.: Threshold models for competitive influence in social networks. In: Saberi, A. (ed.) Proceedings of WINE 2010, pp. 539–550 (2010) 7. Bredereck, R., Elkind, E.: Manipulating opinion diffusion in social networks. In: Proceedings of IJCAI 2017, pp. 894–900 (2017) 8. Budak, C., Agrawal, D., El Abbadi, A.: Limiting the spread of misinformation in social networks. In: Proceedings of WWW 2011, pp. 665–674 (2011) 9. Chen, N.: On the approximability of influence in social networks. SIAM J. Discrete Math. 23(3), 1400–1415 (2009) 10. Chen, W., Collins, A., Cummings, R., Ke, T., Liu, Z., Rincon, D., Sun, X., Wei, W., Wang, Y., Yuan, Y.: Influence maximization in social networks when negative opinions may emerge and propagate. In: Proceedings of SDM 2011, pp. 379–390 (2011)

Opinion Diffusion

435

11. Easley, D., Kleinberg, J.: Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press, Cambridge (2010) 12. Fazli, M.A., Ghodsi, M., Habibi, J., Jalaly, P., Mirrokni, V., Sadeghian, S.: On non-progressive spread of influence through social networks. Theoret. Comput. Sci. 550, 36–50 (2014) 13. Frischknecht, S., Keller, B., Wattenhofer, R.: Convergence in (social) influence networks. In: Afek, Y. (ed.) Proceedings of DISC 2013, pp. 433–446 (2013) 14. Goles, E., Olivos, J.: Periodic behaviour of generalized threshold functions. Discrete Math. 30(2), 187–189 (1980) 15. Goyal, A., Bonchi, F., Lakshmanan, L.V.S.: Learning influence probabilities in social networks. In: Proceedings of WSDM 2010, pp. 241–250 (2010) 16. Granovetter, M.: The strength of weak ties. Am. J. Sociol. 78(6), 1360–1380 (1973) 17. Granovetter, M.: Threshold models of collective behavior. Am. J. Sociol. 83(6), 1420–1443 (1978) 18. Gursoy, F., Gunnec, D.: Influence maximization in social networks under deterministic linear threshold model. Knowl.-Based Syst. 161, 111–123 (2018) ´ Maximizing the spread of influence through 19. Kempe, D., Kleinberg, J., Tardos, E.: a social network. Theory Comput. 11(4), 105–147 (2015) 20. Khoshkhah, K., Soltani, H., Zaker, M.: On dynamic monopolies of graphs: the average and strict majority thresholds. Discrete Optim. 9(2), 77–83 (2012) 21. Lerman, K., Yan, X., Wu, X.-Z.: The “majority illusion” in social networks. PLoS One 11(2), e0147617+ (2016) 22. Lou, V.Y., Bhagat, S., Lakshmanan, L.V.S., Vaswani, S.: Modeling non-progressive phenomena for influence propagation. In: Proceedings of COSN 2014, pp. 131–138 (2014) 23. Newman, M.: Assortative mixing in networks. Phys. Rev. Lett. 89(20), 208701+ (2002) 24. Peleg, D.: Size bounds for dynamic monopolies. Discrete Appl. Math. 86(2), 263– 273 (1998) 25. Peleg, D.: Local majorities, coalitions and monopolies in graphs: a review. Theoret. Comput. Sci. 282(2), 231–257 (2002) 26. Rozemberczki, B., Davies, R., Sarkar, R., Sutton, C.:. GEMSEC: graph embedding with self clustering. CoRR (2018) 27. Schelling, T.C.: Micromotives and Macrobehavior. W. W. Norton & Company, New York (1978) 28. Tzoumas, V., Amanatidis, C., Markakis, E.: A game-theoretic analysis of a competitive diffusion process over social networks. In: Proceedings of WINE 2012, pp. 1–14 (2012) 29. Yang, J., Leskovec, J.: Defining and evaluating network communities based on ground-truth. In: Proceedings of ICDM-12, pp. 745–754 (2012)

Beyond Fact-Checking: Network Analysis Tools for Monitoring Disinformation in Social Media Stefano Guarino1,2(B) , Noemi Trino2 , Alessandro Chessa2,3 , and Gianni Riotta2 1

Institute for Applied Computing, National Research Council, Rome, Italy [email protected] 2 Data Lab, Luiss “Guido Carli” University, Rome, Italy 3 Linkalab, Cagliari, Italy

Abstract. Operated by the H2020 SOMA Project, the recently established Social Observatory for Disinformation and Social Media Analysis supports researchers, journalists and fact-checkers in their quest for quality information. At the core of the Observatory lies the DisInfoNet Toolbox, designed to help a wide spectrum of users understand the dynamics of (fake) news dissemination in social networks. DisInfoNet combines text mining and classification with graph analysis and visualization to offer a comprehensive and user-friendly suite. To demonstrate the potential of our Toolbox, we consider a Twitter dataset of more than 1.3M tweets focused on the Italian 2016 constitutional referendum and use DisInfoNet to: (i) track relevant news stories and reconstruct their prevalence over time and space; (ii) detect central debating communities and capture their distinctive polarization/narrative; (iii) identify influencers both globally and in specific “disinformation networks”. Keywords: Social network analysis

1

· Disinformation · Classification

Introduction

“SOMA – Social Observatory for Disinformation and Social Media Analysis” is a H2020 Project aimed at supporting, coordinating and guiding the efforts of researchers, fact-checkers and journalists contrasting online and social disinformation, to shield a fair political debate and a responsible, shared, set of information for our citizens. At the core of the Observatory is a web-based collaborative platform for the verification of digital (user-generated) content and the analysis of its prevalence in the social debate, based on a special instance of (SOMA partner) ATC’s Truly Media1 . In this paper, we present the first prototype of the DisInfoNet Toolbox, designed to support the users of the SOMA verification platform in understanding the dynamics of (fake) news dissemination in social media and tracking down the origin and the broadcasters of false 1

https://www.truly.media/.

c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 436–447, 2020. https://doi.org/10.1007/978-3-030-36687-2_36

Network Tools for Social Disinformation

437

information. We overview current features, preview future extensions, and report on the insights provided by our tools in the analysis of a Twitter dataset. Data collected on social media is paramount for understanding disinformation disorders [7] as it is instrumental to: (i) quantitative analyses of the diffusion of unreliable news stories [1]; (ii) comprehending the relevance of disinformation in the social debate, possibly incorporating thematic, polarity or sentiment classification [34]; (iii) unveiling the structure of social ties and their impact on (dis)information flows [3]. DisInfoNet was designed to allow all of the above and more, as it allows tracking specific news pieces in the data and visualizing their prevalence over time/space, classifying content in a semi-automatic fashion (relying on clustering a keyword/hashtag co-occurrence graph), and extracting, analyzing and visualizing social interaction graphs, embedding community-detection and user classification. Additional features will soon enrich the Toolbox, such as a user-friendly interface for Structural Topic Model [29], supporting sentiment analysis both globally and at topic level [16]. To demonstrate the potential of DisInfoNet, we also present an analysis of a dataset of over 1.3M Italian tweets dating back to November 2016 and focused on the constitutional referendum held on December 4, 2016. The significant diffusion of fake news in the phase of political campaign before the vote, together with the dichotomic structure of referendums fostering user polarization, make this dataset especially fit for purpose. Additionally, the distance in time of such a crucial political event makes it easier treating sensitive issues like disinformation while preventing the risk of recentism in analyzing social phenomena. We found evidence of a few relevant false stories in our dataset and, by relating polarization and network analysis, we were able to gain a better understanding of their patterns of production/propagation and contrast, and of the role of renowned authoritative accounts as well as outsiders and bots in driving the production and sharing of news stories. From a purely quantitative point of view, it is worth noting that our findings diverge significantly from what observed by (SOMA partner) Pagella Politica at the time [26], underlining once more that Twitter and Facebook provide very different perspectives on society and that further support of social media platforms is paramount for the research community.

2

Related Work

As reported by a recent Science Policy Forum article [21], stemming the viral diffusion of fake news and characterizing disinformation networks largely remain open problems. Besides the technical setbacks, the existence of the so-called “continued influence effect of misinformation” is widely acknowledged among socio-political scholars [31], thus questioning the intrinsic potential of debunking in contrasting the proliferation of fake news. Yet, the body of research work on fake news detection and (semi-)automatic debunking is vast and heterogeneous, relying on linguistics [22], deep syntax analysis [14], knowledge networks [11], or data mining [30]. Attempts at designing an end-to-end fact-checking system exist [19], but are mostly limited to detecting and evaluating strictly factual

438

S. Guarino et al.

claims. Even supporting professional fact-checkers by automating stance detection is problematic, due to relatedness being far easier to capture than agreement/disagreement [18]. Approaches specifically conceived for measuring the credibility of social media rumours appear to benefit from the combined effectiveness of analyzing textual features, classifying users’ posting and re-posting behaviors, examining external citations patterns, and comparing same-topic messages [5,10,35]. Unfortunately, this is well beyond what social media analytics and editorial fact-checking tools on the market permit. In this context, DisInfoNet was designed to help researchers, journalists and fact-checkers characterizing the prevalence and dynamics of disinformation on social media. Recent work confirmed the general perception that, on average, fake news get diffused farther, faster, deeper and more broadly than true news [1,34]. The prevalence of false information is often deemed to be caused by the presence of “fake” and automated profiles, usually called bots [6]. The role of bots in disinformation campaigns is however far from being sorted out: albeit bots seem to be the main responsible for fake news production and are used to boost the perceived authority of successful (human) sources of disinformation [3], they have been found to accelerate the spread of true and false news at the same rate [34]. Models for explaining the success of false information without a direct reference to bots have also been recently proposed, either based on information overload vs. limited attention [28], or on information theory and (adversarial) noise decoding [8]. Finally, investigating the relation between polarization and information spreading has been shown to be instrumental for both uncovering the role of disinformation in a country’s political life [7] and predicting potential targets for hoaxes and fake news [33].

3

The Toolbox

DisInfoNet is a Python library built on top of well-known packages (e.g., igraph, scikit-learn, NumPy, Gensim), soon to be available under the GPL on GitLab2 . It provides modules for managing archives, elaborating and classifying text, building and analyzing graphs, and more. It is memory-efficient to support large datasets and, albeit a few functions are optimized for Twitter data, generally flexible. At the same time, DisInfoNet implements a pipeline designed to enable journalists and fact-checkers with no coding expertise assessing the prevalence of disinformation in social media data. This pipeline, depicted in Fig. 1, consists of three main tools which may be controlled by a single configuration file – soon to be replaced by a user-friendly dashboard embedded in the SOMA platform. One of DisInfoNet’s main features is the ability to extract and examine both keyword co-occurrence graphs and user interaction graphs induced by a specific set of themes of interest, thus providing valuable insights into the contents and the actors of the social debate around disinformation stories. The first tool of DisInfoNet’s pipeline is the Subject Finder. It filters a dataset and returns information about the prevalence of themes or news pieces 2

Please, contact the authors if you wish to be notified when the code is released.

Network Tools for Social Disinformation

439

Fig. 1. DisInfoNet’s main pipeline.

of interest. It uses keyword-based queries (migration to document similarity is in progress) to extract (parsed) records into a CSV file. For instance, for Twitter data it returns tweets with covariates such as author, timestamp, geolocalization, retweet count, hashtags, mentions. It also plots the temporal and spatial distribution of all and query-matching records. The Classifier partitions records into classes based on a semi-automatic “self-training” process. By building and clustering a keyword co-occurrence graph (that the user may prune of central yet generic and/or out-of-context keywords, detrimental to clustering), it presents the user with an excerpt of the keywords associated with the obtained classes. Significantly, this means using far more keywords than any fully manual approach would permit, without sacrificing accuracy, but rather possibly discovering previously unknown and highly informative keywords. The user can select and label the classes of interest, which are used to automatically extract a training set. The Classifier then selects the best performing model among a few alternative (currently, Logistic Regression and Gradient Boosting Classifier, with 10-fold cross-validation) and predicts a label for all records. When only two classes are used (e.g., republican vs. democratic, right- vs. left-wing, pro vs. against; discussing theme A vs. theme B), the obtained classification may also be extended to users (e.g., authors) by averaging over the classification of all records associated to a specific user. Finally, the Graph Analyzer incorporates functions for graph mining and visualization. It first extracts a directed user interaction graph, wherein two users (e.g., authors) are connected based on how often they interact (e.g., cite each other). It then computes a set of global and local metrics, including: distances, eccentricity, radius and diameter; clustering coefficient; degree and assortativity; PageRank, closeness and betweenness centrality [24]. It also partitions the graph into communities, relying on the well-known Louvain [4] or Leading Eigenvector [25] algorithms, and applies the Guimer` a-Amaral cartography [17], based on

440

S. Guarino et al.

discerning inter- and intra-community connections. This results into a number of tables and plots.

4

Politics and Information in 2016 Italy

The 2013 election imposed an unprecedented tri-polar equilibrium in the Italian political scene, with the 5 Stars Movement (5SM) breaking the traditional leftright framework, and the rise of the populist right party Northern League (NL). In 2016, the Italian government guided by the center-left Democratic Party (PD) promoted a constitutional reform which led to a referendum, held on December 4, 2016. Both the 5SM and the NL opposed the referendum, making the NO faction a composite front supported by a wide spectrum of formations with alternative yet sometimes overlapping political justifications. In this framework, populist movements showed an extraordinary ability in setting the agenda, by imposing carefully selected instrumental news-frames and narratives that found the perfect breeding ground in Italy – the country of political disaffection par excellence [12]. New media, in particular, offered an unprecedented opportunity: to maintain a critical – even conspiratorial – attitude towards the establishment-dominated media, while enhancing the role of alternative/social media as strategic resources for community-building and alternative agenda setting [2]. In these contexts, Twitter plays a strategic role for newly born political parties, that through the activation of the two-way street mediatization may incorporate their proposals into conventional media [9]. The dichotomous structuring of referendum was however instrumental to both sides for aligning the various issues along a pro-anti/status quo spectrum. The final victory of the NO caused Renzi’s resignation from Head of Government and paved the way for the definite affirmation of the 5SM and the NL, who in 2018 joined forces in forming a so-called “government of change”. 4.1

Disinformation Stories

In order to identify relevant themes of disinformation of the political campaigning we relied on the activity of fact-checking and news agencies, who reported lists of fake news that went viral during the referendum campaign. Mostly based on the work by fact-checking web portal Bufale.net [23], online newspaper Il Post [27], and SOMA partner and political fact-checking agency Pagella Politica [26], we were able to identify the twelve main pieces of disinformation related to the referendum. To widen the scope of the analysis, we considered stories and speculations that reflect information disorders in a broader sense, from rumors, hearsays, clickbait items and unintentionally propagated misinformation, to conspiracy theories and organized propaganda, often used by the two sides to accuse one another. We then classified these disinformation stories into four categories: (i) the QUOTE category includes entirely fabricated quotes of public figures endorsing one or the other faction or defaming voters of the other side; (ii) the

Network Tools for Social Disinformation

441

CONSQ group of news contains manipulated interpretations of genuine information about the (potential) consequences of the reform; (iii) the PROPG category includes news inserted in a typical populist frame, opposing people vs the ´elite; (iv) finally, the FRAUD category involves the integrity of the electoral process, gaining unauthorized access to voting machines and altering voting results. Due to page restrictions, in this paper we only study disinformation at this category level, deferring a detailed analysis at news-story level to future work. Significantly, this type of category-based approach is fully supported by DisInfoNet and easily available through the configuration file.

5

Findings

In this section, we demonstrate the potential of DisInfoNet by analyzing a dataset of more than 1.3M tweets to shed light on the dynamics of social disinformation as Italy approached the referendum. 5.1

Disinformation Prevalence

With each of the selected news stories represented by a suitable keyword-based query, we ran the Subject Finder to identify our set of disinformation tweets, have them labelled with categories, and obtain the plots in Fig. 2 showing their temporal and geographical distribution. In Fig. 2a we see the one-day rolling mean of the four classes across November 2016, compared with the overall trend. The presence of disinformation in the dataset is limited, yet non-negligible: except for QUOTE tweets, each of the other three classes accounts for ≈5% of the records. The volume of discussion about fake/distorted news stories does not seem to simply increase at the approach of the referendum as for the general discussion, but different stories have different spikes, possibly related with events (e.g., a politician giving an interview) or with the activity of some influencer. Regarding the geography of the debate, we found that only 29716 tweets – that is, 2.21% of the whole dataset – were geotagged, and this percentage is even lower (≈1%) among disinformation tweets (see Table 1 for details), possibly due to users involved in this type of discussions being more concerned about privacy than the average. The map, reported in Fig. 2b, shows some activity in Great Britain and the Benelux area, but disinformation topics appear to be substantially absent outside Italy. 5.2

Polarization and Disinformation

The Classifier can now be used to gain a better understanding of the relation between polarization and disinformation in our dataset. During the semi-automatic self-training process, we pruned a few central but out-of-context hashtags (e.g., “#photo” and “#trendingtopic”) and let the Classifier run Louvain’s algorithm and plot the hashtag graph. This graph, reported in Fig. 3, shows that: (i) hashtags used by the NO and YES supporters are

442

S. Guarino et al.

(a) Temporal distribution by class

(b) Spatial distribution by class.

Fig. 2. The temporal and spatial distribution of disinformation tweets.

strongly clustered; (ii) “neutral” hashtags (such as those used by international reporters) also cluster together; (iii) a few hashtags are surprisingly high-ranked, such as “#ottoemezzo”, a popular and supposedly impartial political talk-show being central in the NO cluster – thus confirming regular patterns of behavior in the “second-screen” use of social network sites to comment television programs [32]. In particular, it is easy to identify two large clusters of hashtags clearly characterizing the two sides: the YES cluster is dominated by the hashtags “#bastauns`ı” (“a yes is enough”) and “#iovotosi (“I vote yes”), whereas the NO cluster by “#iovotono” (“I vote no”), “#iodicono” (“I say no”) and “#renziacasa” (“Renzi go home”). In this perspective, both communities show clear segregation and high levels of clustering by political alignments, thus confirming the hypothesis of social-media platforms as echo chambers, with political exchanges exhibiting “a highly partisan community structure with two homogeneous clusters of users who tend to share the same political identity” [12]. By interacting with the Classifier, we selected the aforementioned YES and NO clusters as the sets of hashtags to be used for building a training set. Labelling works as follows: −1 (NO) if the tweet only contains hashtags from the NO cluster; +1 (YES) if the tweet only contains hashtags from the YES cluster; 0 (UNK) if the tweet contains a mix of hashtags from the two clusters. Significantly, we also obtained a continuous score in [−1, 1] for each user, as the average score of the user’s tweets. When ran after the Subject Finder, the Classifier also plots a histogram that helps relating classification and disinformation, reported in Fig. 3b. We immediately see that UNK tweets are substantially negligible, while NO tweets are almost 1.5× more frequent than YES tweets, supporting the diffused belief that the NO front was significantly more active than its counterpart in the social debate. Disinformation news stories mostly follow the general trend, but: (i) topics of the QUOTE and PROPG classes, which gather attack vectors frequently used by the populist parties, are especially popular among NO supporters (hence, debunking efforts are invisible); (ii) on the other hand, YES supporters are more active than the average in the CONSQ

Network Tools for Social Disinformation

(a) The hashtag graph, with clusters highlighted, vertex size by pagerank.

443

(b) The polarization of tweets, in total and for the four disinformation classes.

Fig. 3. The hashtag graph and the classification results.

topics, probably due to the concurrent attempts at promoting the referendum and at tackling the fears of potential NO voters. 5.3

Interaction Graphs and Disinformation

Finally, we used the Graph Analyzer to better understand the dynamics of disinformation networks in our dataset. Due to page restrictions, in the following we only focus on retweets and on the CONSQ and PROPG disinformation classes, leaving a more detailed analysis to future work. Among the three supported types of interaction, in fact, retweeting is the simplest endorsement tool [20], commonly used for promoting ideas and campaigns and for community building, possibly relying on semi-automatic accounts. On the other hand, the CONSQ and PROPG classes appeared to be the most informative, for both their different polarity distribution and their almost non-intersecting sets of influencers. First of all, we obtained a number of macroscopic descriptors that yield insights into the structural similarities and differences of the two graphs, reported in Table 1. The CONSQ and PROPG are similar in size (2755 vertices and 3786 edges vs. 2126 and 2886) and have similarly sized in- and out-hubs (628 and 16 vs. 653 and 18), but the diameter of the CONSQ graph is significantly smaller (12 vs. 30) despite it having a larger average distance (2.73 vs. 1.64). These numbers suggest that PROPG disinformation stories travelled less on average, but were sporadically able to reach very peripherical users. Additionally, we see that the clustering coefficient of the two graphs is almost identical and rather small (≈0.004), more than one order of magnitude smaller that the clustering coefficient of the whole graph. This suggests that these disinformation networks

444

S. Guarino et al. Table 1. Dataset overview. Tweets

Geotags (%)

Retweet graph

Dataset

1344216

29716 (2.21%) 72574

CONSQ

7909

71 (0.90%)

PROPG

4345

47 (1.08%)

FRAUD

5362

QUOTE

57

Vertices Edges

degmax degmax in out Clustering Diam. Avg. dist. 1541 0.0483 149 4.81044

451423

4813

2755

3786

628

16

0.0039

12

2.72581

2126

2886

653

18

0.00385

30

1.63941

69 (1.29%)

2195

3452

692

13

0.00321

8

2.45673

1 (1.75%)

9

8

8

1

0.0

1

1.0

may not be “self-organizing” and their structure might be governed by artificial diffusion patterns. For a more close-up analysis, Fig. 4 shows, for both classes, the network composed of the top 500 users by pagerank. In these plots, users are colored by their polarity and edges take the average color of the connected vertices. The size of a vertex is proportional to its pagerank, whereas the width of an edge to its weight, i.e., number of interactions between the two users. These plots highlight a number of interesting aspects. First of all, the NO front appears to be generally dominant, with relevant YES actors only emerging in the debate on the alleged consequences of the referendum. Also, there seems to be limited interaction between YES and NO supporters, as can be noted by the fact that edges almost always link vertices of similar or even identical color. Among the leaders of the NO front, we find well-known public figures (e.g., politicians Renato Brunetta and Fabio Massimo Castaldo in the PROPG graph) along with accounts not associated with any publicly known individual. In most cases, these are militants of the NO front, sometimes having multiple aliases, and whose activity is characterized by a high number of retweets and mentions of well-known actors belonging to the same community (e.g., Antonio Bordin, Claudio Degl’Innocenti, Angelo Sisca, Liberati Linda). Additional insights can be gained by using Truthnest3 , a tool developed by SOMA partner ATC, which reports analytics on the usage patters of a specified account summarized into a bot-likelihood score. One of the most influential nodes of the PROPG graph, @INarratore, came out having a suspiciously high 60% bot-score, other than only 1% of original tweets and a considerable number of “suspicious followers”. In the same graph, @dukana2 has a 50% bot-score, while the account @advalita has been suspended from Twitter. In the CONSQ graph, the most central user is @ClaudioDeglinn2, characterized by a relatively low 10% bot-score, but apparently in control of at least other 7 aliases and strongly connected with other amplification accounts. Two of these “amplifiers” are especially noteworthy: @IPredicatore, having a 40% bot-score, and @PatriotaIl, having a 30% bot-score, mentioning @ClaudioDeglinn2 in more than 20% of his tweets, and producing only 3% original tweets. Altogether, we seem to have found indicators of coordinated efforts to avoid bot detection tools while reaching peripheral users and expanding the network.

3

https://app.truthnest.com/.

Network Tools for Social Disinformation

(a) The PROPG graph.

445

(b) The CONSQ graph.

Fig. 4. 500 top users by pagerank. Color is by polarity, size by pagerank.

6

Conclusion

In this paper, we publicly presented – to both the scientific and fact-checking community – an integrated toolbox for monitoring social disinformation, conceived as part of the H2020 Social Observatory for Disinformation and Social Media Analysis. Our DisInfoNet Toolbox builds on well-established techniques for text and graph mining to provide a wide spectrum of users instruments for quantifying the prevalence of disinformation and understanding its dynamics of diffusion on social media. We presented a case study analysis focused on the 2016 Italian constitutional referendum, wherein the natural bipolar political structure of the debate helps in reducing one of the most frequent problem in opinion detection on social media, related to the identification of all possible political orientations (associated to communities). Following the literature [12,15], we resorted to retweets in order to analyze accounts and their interactions according to their possible political orientation. The combined analysis of political communities and network clustering and centrality shows how the referendum caused a clear segregation by political alignment [13], configuring the existence of different echo-chambers. From a thematic point of view, news stories related to conspiracy theories and distrust with political ´elite were especially popular and traveled deeper than any other category of disinformation. We found evidence of a correlation between users’ polarization and participation to disinformation campaigns, and by highlighting the primary actors of disinformation production and propagation we could manually tell apart public figures, activists and potential bots. Our DisInfoNet Toolbox will soon be available online and extended in the next future. We believe that the state-of-the-art techniques for classification and network analysis embedded in the Toolbox will pave the way for future

446

S. Guarino et al.

research in the area, crucial to the preservation of our public conversation and the future of our democracies.

References 1. Allcott, H., Gentzkow, M.: Social media and fake news in the 2016 election. J. Econ. Perspect. 31(2), 211–36 (2017) 2. Alonso-Mu˜ noz, L., Casero-Ripoll´es, A.: Communication of European populist leaders on Twitter: agenda setting and the ‘more is less’ effect. Prof. Inform. 27(6), 1193–1202 (2018) 3. Bessi, A., Ferrara, E.: Social bots distort the 2016 US presidential election online discussion (2016) 4. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008) 5. Boididou, C., Middleton, S.E., Jin, Z., Papadopoulos, S., Dang-Nguyen, D.T., Boato, G., Kompatsiaris, Y.: Verifying information with multimedia content on twitter. Multimed. Tools Appl. 77(12), 15545–15571 (2018) 6. Boshmaf, Y., Muslukhov, I., Beznosov, K., Ripeanu, M.: Design and analysis of a social botnet. Comput. Netw. 57(2), 556–578 (2013) 7. Bovet, A., Makse, H.A.: Influence of fake news in Twitter during the 2016 US presidential election. Nat. Commun. 10(1), 7 (2019) 8. Brody, D.C., Meier, D.M.: How to model fake news. arXiv preprint arXiv:1809.00964 (2018) 9. Casero-Ripoll´es, A., Feenstra, R.A., Tormey, S.: Old and new media logics in an electoral campaign: the case of podemos and the two-way street mediatization of politics. Int. J. Press/Polit. 21(3), 378–397 (2016) 10. Castillo, C., Mendoza, M., Poblete, B.: Information credibility on Twitter. In: Proceedings of the 20th International Conference on World Wide Web, pp. 675– 684. ACM (2011) 11. Ciampaglia, G.L., Shiralkar, P., Rocha, L.M., Bollen, J., Menczer, F., Flammini, A.: Computational fact checking from knowledge networks. PLoS One 10(6), e0128193 (2015) 12. Conover, M., Ratkiewicz, J., Francisco, M.R., Gon¸calves, B., Menczer, F., Flammini, A.: Political polarization on Twitter. In: Icwsm, vol. 133, pp. 89–96 (2011) 13. Conover, M.D., Gon¸calves, B., Flammini, A., Menczer, F.: Partisan asymmetries in online political activity. EPJ Data Sci. 1(1), 6 (2012) 14. Feng, V.W., Hirst, G.: Detecting deceptive opinions with profile compatibility. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 338–346 (2013) 15. Garimella, K., Weber, I.: A long-term analysis of polarization on Twitter. CoRR abs/1703.02769 (2017) 16. Guarino, S., Santoro, M.: Multi-word structural topic modelling of ToR drug marketplaces. In: 2018 IEEE 12th International Conference on Semantic Computing (ICSC), pp. 269–273. IEEE (2018) 17. Guimer` a, R., Nunes Amaral, L.: Functional cartography of complex metabolic networks. Nature 433, 895 (2005) 18. Hanselowski, A., PVS, A., Schiller, B., Caspelherr, F., Chaudhuri, D., Meyer, C.M., Gurevych, I.: A retrospective analysis of the fake news challenge stance detection task. arXiv:1806.05180 (2018)

Network Tools for Social Disinformation

447

19. Hassan, N., Zhang, G., Arslan, F., Caraballo, J., Jimenez, D., Gawsane, S., Hasan, S., Joseph, M., Kulkarni, A., Nayak, A.K., et al.: ClaimBuster: the first-ever endto-end fact-checking system. Proc. VLDB Endow. 10(12), 1945–1948 (2017) 20. Kantrowitz, A.: The man who built the retweet: “we handed a loaded weapon to 4-year-olds” (2019). www.buzzfeednews.com/article/alexkantrowitz/how-theretweet-ruined-the-internet. Accessed 05 Aug 2019 21. Lazer, D.M., Baum, M.A., Benkler, Y., Berinsky, A.J., Greenhill, K.M., Menczer, F., Metzger, M.J., Nyhan, B., Pennycook, G., Rothschild, D., et al.: The science of fake news. Science 359(6380), 1094–1096 (2018) 22. Markowitz, D.M., Hancock, J.T.: Linguistic traces of a scientific fraud: the case of Diederik Stapel. PLoS One 9(8), e105937 (2014) 23. Mastinu, L.: TOP 10 Bufale e disinformazione sul Referendum (2016). www.bufale. net/top-10-bufale-e-disinformazione-sul-referendum/. Accessed 05 Jul 2019 24. Newman, M., Barabasi, A.L., Watts, D.J. (eds.): The Structure and Dynamics of Networks. Princeton University Press, Princeton (2006) 25. Newman, M.E.: Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E 74(3), 036104 (2006) ` una bufala (2016). 26. Politica, R.P.: La notizia pi` u condivisa sul referendum? E https://pagellapolitica.it/blog/show/148/la-notizia-pi%C3%B9-condivisa-sul-refer endum-%C3%A8-una-bufala. Accessed 05 Jul 2019 27. Post, R.I.: Nove bufale sul referendum (2016). www.ilpost.it/2016/12/02/bufalereferendum/. Accessed 05 Jul 2019 28. Qiu, X., Oliveira, D.F., Shirazi, A.S., Flammini, A., Menczer, F.: Limited individual attention and online virality of low-quality information. Nat. Hum. Behav. 1(7), 0132 (2017) 29. Roberts, M.E., Stewart, B.M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S.K., Albertson, B., Rand, D.G.: Structural topic models for open-ended survey responses. Am. J. Polit. Sci. 58(4), 1064–1082 (2014) 30. Shu, K., Sliva, A., Wang, S., Tang, J., Liu, H.: Fake news detection on social media: a data mining perspective. ACM SIGKDD Explor. Newsl. 19(1), 22–36 (2017) 31. Skurnik, I., Yoon, C., Park, D.C., Schwarz, N.: How warnings about false claims become recommendations. J. Consum. Res. 31(4), 713–724 (2005) 32. Trilling, D.: Two different debates? Investigating the relationship between a political debate on TV and simultaneous comments on Twitter. Soc. Sci. Comput. Rev. 33(3), 259–276 (2015) 33. Vicario, M.D., Quattrociocchi, W., Scala, A., Zollo, F.: Polarization and fake news: early warning of potential misinformation targets. ACM Trans. Web (TWEB) 13(2), 10 (2019) 34. Vosoughi, S., Roy, D., Aral, S.: The spread of true and false news online. Science 359(6380), 1146–1151 (2018) 35. Zubiaga, A., Aker, A., Bontcheva, K., Liakata, M., Procter, R.: Detection and resolution of rumours in social media: a survey. ACM Comput. Surv. (CSUR) 51(2), 32 (2018)

Suppressing Information Diffusion via Link Blocking in Temporal Networks Xiu-Xiu Zhan, Alan Hanjalic, and Huijuan Wang(B) Faculty of Electrical Engineering, Mathematics, and Computer Science, Delft University of Technology, Mekelweg 4, 2628 CD Delft, The Netherlands [email protected]

Abstract. In this paper, we explore how to effectively suppress the diffusion of (mis)information via blocking/removing the temporal contacts between selected node pairs. Information diffusion can be modelled as, e.g., an SI (Susceptible-Infected) spreading process, on a temporal social network: an infected (information possessing) node spreads the information to a susceptible node whenever a contact happens between the two nodes. Specifically, the link (node pair) blocking intervention is introduced for a given period and for a given number of links, limited by the intervention cost. We address the question: which links should be blocked in order to minimize the average prevalence over time? We propose a class of link properties (centrality metrics) based on the information diffusion backbone [19], which characterizes the contacts that actually appear in diffusion trajectories. Centrality metrics of the integrated static network have also been considered. For each centrality metric, links with the highest values are blocked for the given period. Empirical results on eight temporal network datasets show that the diffusion backbone based centrality methods outperform the other metrics whereas the betweenness of the static network, performs reasonably well especially when the prevalence grows slowly over time. Keywords: Link blocking · Link centrality · Information diffusion backbone · Temporal network · SI spreading

1

Introduction

The development of sensor technology and electronic communication service provide us access to rich human interaction data, including proximity data like human face-to-face contacting, electronic communication data like email exchange, message exchange, phone calls [6,14,18]. The recorded human interactions can be represented as temporal networks, in which each interaction is represented as a contact at a given time step between two nodes. The availability of such social temporal networks inspires us to explore further how to suppress the diffusion of (mis)information that unfolds on them? One possible intervention is to block the links (i.e., remove contacts between node pairs), but c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 448–458, 2020. https://doi.org/10.1007/978-3-030-36687-2_37

Suppressing Information Diffusion

449

only for a given period and given node pairs limited by intervention cost. In this work, we address the question: which links should we block for a given period in order to minimize the prevalence averaged over time, i.e., to prevent or delay the diffusion on temporal networks? Progress has been made recently in understanding, e.g., nodes with what temporal topological properties (temporal centrality metrics) should be selected as the seed node that starts the information diffusion in order to maximize the final prevalence [3,5,8,13,15,16], links with what temporal topological properties appear more frequently in a diffusion trajectory [19]. These works explored in general the relation between node’s or link’s topological properties and its role in a dynamic process on a temporal network. Our question which links should be blocked to suppress information diffusion will actually reveal the role of a link within a given period in a diffusion process in relation to the link’s temporal topological properties. As a starting point, we consider the Susceptible-Infected (SI) model as the information diffusion process. A seed node possesses the information (is infected) at time t = 0 whereas all the other nodes are susceptible. An infected node spreads the information to a susceptible node whenever a contact happens between the two nodes. Given a temporal network within the observation time window [0, T ], we would like to choose a given number of links within a period [ts , te ] to block in order to suppress the diffusion. We propose a comprehensive set of link centrality metrics that characterize diverse temporal topological properties. Each centrality metric is used to rank the links and we remove the links with the highest centrality values for the period [ts , te ]. One group of centrality metrics is based on the information diffusion backbone [19], which characterizes how the contacts appear in a diffusion trajectory thus contribute to the diffusion process. Centrality metrics of the integrated static network, where two nodes are connected if they have at least one contact, are also considered. We propose as well the temporal link gravity, generalized from the static node gravity model [9]. We conduct the SI spreading on the original temporal network as well as the temporal network after link blocking. Their difference in prevalence accumulated over time is used to evaluate the performance of the link blocking strategies/metrics. Our experiments on eight real-world temporal networks show that the diffusion backbone based metrics and the betweenness of the static integrated networks evidently outperform the rest. The backbone based metrics (betweenness of static network) perform(s) better when the prevalence increases fast (slowly) over time. This observation remains universal for diverse choices of the blocking period [ts , te ] and number of links to block. Our finding points out that both temporal and static centrality metrics, with different computational complexities, are crucial in identifying links’ role in a dynamic process. The rest of the paper is organized as follows. We propose the methodology in Sect. 2. In Sect. 2.1, the representation of a temporal network is introduced. In Sect. 2.2, the construction of diffusion backbone is illustrated. Afterwards, we propose the link centrality metrics in Sect. 2.3. In Sect. 2.4, the link blocking procedure and the performance evaluation method are given. We further describe

450

X.-X. Zhan et al.

temporal empirical networks that will be used in Sect. 3. The results of the link blocking strategies on the temporal empirical networks are analyzed in Sect. 4. We conclude our paper in Sect. 5.

2 2.1

Methods Representation of Temporal Networks

A temporal network within a given time window [0, T ] is represented as G = (N , L), where N denotes the node set and the number of nodes is N = |N |. The contact set L = {l(j, k, t), t ∈ [0, T ], j, k ∈ N } contains the element l(j, k, t) representing that a contact between node j and k occurs at time step t. The integrated weighted network of G is denoted by GW = (N , LW ). The weight wjk of link l(j, k) counts the number of contacts between node j and node k. 2.2

Information Diffusion Backbone

The information diffusion backbone was proposed to characterize how node pairs appear in a diffusion trajectory thus contribute to the actual diffusion process [19]. To illustrate our method, we construct the backbone for the SI model with infection probability β = 1, which means that an infected node infects a susceptible node with probability β = 1 whenever the two nodes have a contact. The backbone can be also constructed for the SI model with any infection probability β ∈ [0, 1]. We first record the spreading tree Ti of each node i by setting i as the seed of the SI spreading process starting at t = 0. The spreading tree Ti is the union of the contacts through which the information propagates. The diffusion backbone N  GB is defined as the union of all the spreading trees, i.e., GB = (N , LB ) = Ti . i=1

We use N , LB to represent the node set and the link set respectively. Each link B , counting the number of contacts l(j, k) in LB is associated with a weight wjk between j and k, that appear in diffusion trees/trajectories initiated from every node. An example of how we construct the diffusion backbone GB is given in Fig. 1(a–c). 2.3

Link Centrality Metrics

We first propose three backbone based link centrality metrics: B • Backbone Weight. The backbone weight wjk of a link l(j, k) counts how many times the link or its contacts appear in spreading trees (trajectories) initialized from every node. • Time-confined Backbone Weight [ts , te ]. Furthermore, we define the timeconfined information diffusion backbone GB ∗ , which generalizes our previous backbone definition. The backbone GB ∗ confined within a time window [ts , te ]

Suppressing Information Diffusion

451

Fig. 1. (a) A temporal network G with N = 5 nodes and T = 8 time steps. (b) Spreading trees rooted at every seed node. The time step on each link denotes the time of the contact through which information diffuses. (c) The diffusion backbone GB . (d) Diffusion backbone GB ∗ confined within ts = 2, te = 5. When we consider the links that only appear in a time window [ts , te ] = [2, 5], the value on the link shows the link weight in GB ∗ .

is the union of all the spreading trees but only of the contacts that occur within [ts , te ]. Hence, two nodes in GB ∗ are connected if at least one contact between them within [ts , te ] appears in a diffusion tree rooted at any node. B∗ of link l(j, k) in GB ∗ equals to the number of times that The weight wjk contact(s) between j and k within [ts , te ] that appear in the spreading trees rooted at every node. The link weight in GB ∗ characterizes the frequency that a link, within [ts , te ], contributes to the information diffusion. An example of the time-confined backbone construction is given in Fig. 1(d), where ts = 2, te = 5. Take link l(2, 4) as an example. It appears in the spreading trees twice, both at time step t1 , which is beyond range [ts = 2, te = 5]. Therefore, B∗ = 0. Link l(2, 3) appears at time step t8 , t3 , t3 , t3 , t3 in all the spreading w24 B∗ = 4. trees, only the time step t8 is out of range [2, 5]. Hence, w23 • Backbone Betweenness. The backbone betweenness is defined to measure the link influence in disseminating global information. Given a spreading tree Ti , i . We define the number of descendant nodes of link l(j, k) is denoted as Bjk number of descenthe backbone betweenness Bjk of link l(j, k) as the average  i . dant nodes over all the spreading trees, i.e., Bjk = N1 i∈N Bjk We consider as well the following centrality metrics derived from the integrated weighted network. Only the links in the integrated network deserves blocking. All the following metrics are zero for a node pair that they are not connected in the integrated network. • Degree Product of a link l(j, k) is the product of the degrees of its two end nodes in GW , i.e., dj · dk .

452

X.-X. Zhan et al.

• Strength Product. The node strength of a node j in GW is defined as sj =  w k∈Γj jk , where Γj is the neighbor set of node j. Hence, the strength of a node equals to the total weight of all the links incident to this node. We define strength product of a link l(j, k) as sj · sk . • Static Betweenness. The static betweenness centrality for a link is the number of shortest paths between all node pairs that pass through the link. To compute the shortest path, we define the distance of each link in the integrated network GW inversely proportional to its link weight in GW . This choice follows the assumption that links with a higher weight in GW can spread information faster [12]. • Link Weight. The link weight wjk of a link l(j, k) in GW tells the total number of contacts between node j and k in the temporal network G within the observation window [0, T ]. • Time-confined Link Weight [ts , te ] refers to the number of contacts between two ending nodes that occur in [ts , te ]. • Temporal Link Gravity. The link gravity between node j and k has been defined by regarding the node degree as the mass, the distance Hjk of the shortest path on static network GW between j and  k dasd the distance. The static gravity of node j can be further defined as k=j Hj 2 k . The static node jk

gravity has been used to select the seed node of an information diffusion process in order to maximize the prevalence [9], motivated by the fact that it contains both the neighborhood and the path information of a node. We generalize the gravity definition to temporal networks. The temporal link d d d d gravity of l(j, k) is defined as 12 ( Qj 2 k + Qj 2 k ), where Qjk is the number of jk

kj

links of the shortest path from j to k in all the directed spreading trees (see Fig. 1(b)). Specifically, the shortest directed path from j to k is computed in each spreading tree rooted at one seed node. We consider the shortest among these N shortest directed paths and its length (number of links) is Qjk . 2.4

Link Blocking and Evaluation

We illustrate the link blocking procedure and the evaluation method to measure the effectiveness of link blocking strategies. Given a temporal network, we specify the time window to block links as [ts , te ]. For each time window [ts , te ], we count the number of node pairs |L∗W (ts , te )| that have at least one contact within [ts , te ] and block 5%, 10%, 20%, 40%, 60%, 80% and 100% of |L∗W (ts , te )| links respectively using each centrality metric. The number of links to be blocked is further expressed as the fraction f of the number of links in the integrated network. For each centrality metric, we block the given fraction f of links that have the highest values for the given period [ts , te ], i.e., remove all the contacts within [ts , te ] associated with the selected links. We perform the SI spreading model by setting each node as the seed node on the original temporal network as well as the temporal network after the link blocking. The average prevalence is the average over each possible seed node. The average prevalence of the SI diffusion at any time t when the selected fraction

Suppressing Information Diffusion

453

f of links are blocked within [ts , te ] and when no links are blocked is denoted as ρf (t) and ρo (t) respectively, where t ∈ [0, 1, ..., T ]. The effectiveness of each centrality metric is evaluated by T (ρ0 (t) − ρf (t)) (1) ρD (f ) = t=1T t=1 ρ0 (t) which corresponds to the area below the original prevalence ρo (t) and above the prevalence curve ρf (t) with link blocking normalized by the area under ρo (t) (shown in Fig. 2(b)). A larger ρD (f ) implies a more effective link block strategy in suppressing the SI spreading.

3

Data Description

In this paper, we use eight temporal network datasets to investigate the link blocking problem in temporal networks. The dataset can be classified into two categories according to the contact type, i.e., proximity (Haggle [1], HighSchool 2012 (HS2012) [4], HighSchool2013 (HS2013) [10], Reality Mining (RM ) [2], Hypertext 2009 (HT 2009) [7], Primary School (P S) [17] and Infectious [7]) and electronic communication (Manufacturing Email (M E) [11]). The detailed topological features of these datasets are shown in Table 1, including the number of nodes, time steps, contacts, the number of links, link density, average degree and average link weight in GW . On each temporal network, we perform the SI spreading process starting at every node as the seed. The average prevalence ρ over time for each dataset is shown in Fig. 2(a), where the time step is normalized by the time span T of the observation time window. The spreading speed, i.e., how fast the prevalence grows over time, is quite different across networks. Two networks (Haggle and infectious) show slow and relative linear increase in prevalence over times, due to the low link density in these two networks (Table 1). However, the prevalence in the other networks, increases dramatically at the early stage of the spreading process and converges to about 100%.

4

Empirical Results

In this section, we evaluate the effectiveness of using aforementioned centrality metrics to select the links to be blocked within [ts , te ]. We consider diverse time windows [ts , te ] as listed in Table 2. Intervention is possibly introduced at different diffusion phases. Hence, ts ∈ {T10%I , T20%I , T30%I , T40%I , T50%I }, where T10%I is the time when the average prevalence without blocking reaches ρ = 10% (see Fig. 2(a)). The duration of each time window is set as the duration for the average prevalence to increase 10% just before ts . If ts = T20%I , the duration of the time window is te − ts = T20%I − T10%I . If ts = T10%I , the duration of the time window is te − ts = T10%I − T0%I = T10%I . The number of links to block has also been chosen systematically. We take [ts = T10%I , te = 2T10%I ] as an

454

X.-X. Zhan et al.

Table 1. Basic properties of the empirical networks. The number of nodes (N ), the original length of the observation time window (T in number of steps), the total number of contacts (|L|) and the number of links (|LW |), link density, average node degree (d) and average link weight w in GW are shown. Network

N

Haggle

274 15,662

HS2012

180 11,273

HS2013

327

HT2009

113

5,246

20,818 2,196 0.3470

38.87

9.48

Infectious 410

1,392

17,298 2,765 0.0330

13.49

6.26

ME

167 57,791

82,876 3,250 0.2345

38.92

25.50

PS

242

68.74

15.12

RM

T

|L|

7,375

|LW | Link density d

w

28,244 2,124 0.0568

15.50

13.30

45,047 2,220 0.1378

24.67

20.29

188,508 5,818 0.1092

35.58

32.40

3,100

125,773 8,317 0.2852

96 33,452

1,086,404 2,539 0.5568

52.90 427.89

Fig. 2. (a) Evolution of the average prevalence ρ of the SI model (β = 1) for the eight empirical datasets. (b) An example of the area difference between the original spreading curve (ρo ) and the curve (ρf ) after blocking f fraction of links.

example to illustrate our findings. Figure 3 shows the effectiveness of each centrality metric as a function of f , which is the number of links blocked normalized by the number of links in the integrated network. The random selection of links from those that have at least one contact within [ts , te ] is used as a baseline, in which each point is the averaged over 100 realizations. We find that four link centrality metrics always outperform the random selection: static betweenness, backbone weight, time-confined backbone weight [ts , te ] and backbone betweenness. In Haggle and infectious, the best performance comes from static betweenness, whereas the time-confined backbone weight [ts , te ] outperforms the other metrics in the other six networks. Figure 2 shows that the prevalence grows slowly over time in Haggle and infectious. Hence, the static betweenness seems a suitable link blocking strategy for networks with a slow spreading speed. However, for networks where information propagates fast, the

Suppressing Information Diffusion

455

Table 2. The time window [ts , te ] we choose for link blocking based on the average prevalence ρ when β = 1. For instance, T10%I represents the time when the prevalence reaches ρ = 0.1. [T20%I , 2T20%I − T10%I ] [T30%I , 2T30%I − T20%I ]

N etwork

[T10%I , 2T10%I ]

Haggle

[3293, 6586]

[8416, 13539]

[9523, 10630]

HS2012

[403, 806]

[675, 947]

[925, 1175]

HS2013

[50, 100]

[113, 176]

[195, 277]

HT2009

[332, 664]

[377, 422]

[439, 501]

Infectious [410, 820]

[553, 696]

[751, 949]

ME

[168, 336]

[285, 402]

[461, 637]

PS

[136, 272]

[276, 416]

[287, 298]

RM

[5, 10]

[34, 63]

[111, 188]

N etwork

[T40%I , 2T40%I − T30%I ] [T50%I , 2T50%I − T40%I ]

Haggle

[12440, 15357]

[12668, 12896]

HS2012

[1043, 1161]

[1109, 1175]

HS2013

[236, 277]

[369, 502]

HT2009

[568, 697]

[790, 1012]

Infectious [955, 1159]

[1062, 1169]

ME

[731, 1001]

[1387, 2043]

PS

[323, 359]

[347, 371]

RM

[133, 155]

[257, 381]

Fig. 3. The effectiveness ρD (f ) of each centrality metric in selecting the links to block within time window [T10%I , 2T10%I ]. Each point on the curve corresponds to block 5%, 10%, 20%, 40%, 60%, 80% and 100% of |L∗W (ts = T10%I , 2T10%I )| links, respectively. The x-axis f is obtained by the number of links blocked normalized by the number of links in the integrated network.

456

X.-X. Zhan et al.

Fig. 4. Average link blocking performance for each centrality metric over different number of blocked links, within different time windows and in different networks. The x axis shows the time windows. We only show the starting time ts of each time window for simplicity and the ending time of each window can be found in Table 2.

time-confined backbone weight [ts , te ] is a good indicator to select the links to block. Furthermore, we find that time-confined link weight [ts , te ] outperforms link weight and time-confined backbone weight [ts , te ] outperforms the backbone weight. This implies that considering the link temporal topological features within the blocking time window is crucial for the link selection. For a given time window [ts , te ], we define the average performance of a centrality metric as the area under ρD (f ) over the whole range f . The average performance is further normalized by the maximal average performance among all the centrality metrics for the given [ts , te ]. This average performance over diverse numbers of links to be blocked allows us to evaluate whether the performance of these centrality metrics is stable when the time window varies. Figure 4 verifies that our findings within [ts = T10%I , te = 2T10%I ] from Fig. 3 can be generalized to the other time windows.

Suppressing Information Diffusion

5

457

Conclusion

In this paper, we investigate how different link blocking strategies could suppress the information diffusion process on temporal networks. The spreading process is modeled by the SI model with infection probability β = 1. We propose diverse classes of link centrality metrics to capture different link temporal topological properties, including the information diffusion backbone based metrics and the static link centrality metrics. According to each metric, we select a given number of links that have the highest centrality value and block them for the given period [ts , te ]. The corresponding effect of such link blocking is evaluated via the extent that the prevalence is suppressed over time. The empirical results from eight temporal network datasets show that four metrics outperform the random link selection, that is, backbone weight, backbone weight [ts , te ], backbone betweenness and static betweenness. An interesting finding is that the backbone based metrics, especially time-confined backbone weight [ts , te ], perform well in networks where information gets prevalent fast. However, the static betweenness outperforms in networks where information propagates slowly. These observations hold for different choices of time window and the number of links to be blocked. Our findings point out the importance of both temporal and static centrality metrics in determining links’ role in a diffusion process. Moreover, the time-confined metrics that explicitly explore the property/role of the contacts that occur within the time window in the global diffusion process seems promising in identifying the links to block. In this work, we select links based on the centrality metrics that are derived from the temporal network information over the whole observation window [0, T ]. Our study unravels actually the relation between links’ or contacts’ temporal topological properties and their role in a diffusion process. A more challenging question is how to identify the links to block based on the temporal network information observed so far within [0, ts ]. Acknowledgements. This work has been partially supported by the China Scholarship Council (CSC).

References 1. Chaintreau, A., Hui, P., Crowcroft, J., Diot, C., Gass, R., Scott, J.: Impact of human mobility on opportunistic forwarding algorithms. IEEE Trans. Mob. Comput. 6, 606–620 (2007) 2. Eagle, N., Pentland, A.S.: Reality mining: sensing complex social systems. Pers. Ubiquit. Comput. 10(4), 255–268 (2006) 3. Estrada, E.: Communicability in temporal networks. Phys. Rev. E 88(4), 042811 (2013) 4. Fournet, J., Barrat, A.: Contact patterns among high school students. PLoS One 9(9), e107878 (2014) 5. Grindrod, P., Parsons, M.C., Higham, D.J., Estrada, E.: Communicability across evolving networks. Phys. Rev. E 83(4), 046120 (2011)

458

X.-X. Zhan et al.

6. Holme, P.: Modern temporal network theory: a colloquium. Eur. Phys. J. B 88(9), 234 (2015) 7. Isella, L., Stehl´e, J., Barrat, A., Cattuto, C., Pinton, J.F., Van den Broeck, W.: What’s in a crowd? Analysis of face-to-face behavioral networks. J. Theor. Biol. 271(1), 166–180 (2011) 8. Li, C., Li, Q., Van Mieghem, P., Stanley, H.E., Wang, H.: Correlation between centrality metrics and their application to the opinion model. Eur. Phys. J. B 88(3), 65 (2015) 9. Li, Z., Ren, T., Ma, X., Liu, S., Zhang, Y., Zhou, T.: Identifying influential spreaders by gravity model. Sci. Rep. 9(1), 8387 (2019) 10. Mastrandrea, R., Fournet, J., Barrat, A.: Contact patterns in a high school: a comparison between data collected using wearable sensors, contact diaries and friendship surveys. PLoS One 10(9), e0136497 (2015) 11. Michalski, R., Palus, S., Kazienko, P.: Matching organizational structure and social network extracted from email communication. In: International Conference on Business Information Systems, pp. 197–206. Springer (2011) 12. Newman, M.E.: Scientific collaboration networks. ii. Shortest paths, weighted networks, and centrality. Phys. Rev. E 64(1), 016132 (2001) 13. Pastor-Satorras, R., Castellano, C., Van Mieghem, P., Vespignani, A.: Epidemic processes in complex networks. Rev. Mod. Phys. 87(3), 925 (2015) 14. Peters, L.J., Cai, J.J., Wang, H.: Characterizing temporal bipartite networkssequential-versus cross-tasking. In: International Conference on Complex Networks and their Applications, pp. 28–39. Springer (2018) 15. Qu, C., Zhan, X., Wang, G., Wu, J., Zhang, Z.K.: Temporal information gathering process for node ranking in time-varying networks. Chaos: Interdisc. J. Nonlinear Sci. 29(3), 033116 (2019) 16. Rocha, L.E., Masuda, N.: Random walk centrality for temporal networks. New J. Phys. 16(6), 063023 (2014) 17. Stehl´e, J., Voirin, N., Barrat, A., Cattuto, C., Isella, L., Pinton, J.F., Quaggiotto, M., Van den Broeck, W., R´egis, C., Lina, B., et al.: High-resolution measurements of face-to-face contact patterns in a primary school. PLoS One 6(8), e23176 (2011) 18. Takaguchi, T., Sato, N., Yano, K., Masuda, N.: Importance of individual events in temporal networks. New J. Phys. 14(9), 093003 (2012) 19. Zhan, X.X., Hanjalic, A., Wang, H.: Information diffusion backbones in temporal networks. Sci. Rep. 9(1), 6798 (2019)

Using Connected Accounts to Enhance Information Spread in Social Networks Alon Sela1,2,3,6(&), Orit Cohen-Milo4,5, Eugene Kagan1,2, Moti Zwilling6,7, and Irad Ben-Gal2 1

Industrial Engineering Department, Ariel University, Ariel 40700, Israel [email protected] 2 Industrial Engineering Department, Tel Aviv University, Tel Aviv 39040, Israel 3 Physics Department, Bar Ilan University, Tel Aviv 5290002, Israel 4 Economics Department, Hebrew University, Jerusalem 9190501, Israel 5 Economics Department, Ben-Gurion University, Tel Aviv 8410501, Israel 6 Ariel Cyber Innovation Center (ACIC), Ariel University, Ariel 40700, Israel 7 Business and Management Department, Ariel University, Ariel 40700, Israel

Abstract. In this article, a new operation mode of social bots is presented. It includes a creation of social bots in dense, highly-connected, sub structures in the network, named Spreading Groups. Spreading Groups are groups of bots and human-managed accounts that operate in social networks. They are often used to bias the natural opinion spread and to promote and over represent an agenda. These bots accounts are mixed with regular users, while repeatedly echoing their agenda, disguised as real humans who simply deliver their own personal thoughts. This mixture makes the bots more difficult to detect and more influential. We show that if these connected sub structures repeatedly echo a message within their group, such an operation mode will spread messages more efficiently compared to a random spread of unconnected bots of a similar size. In particular, groups of bots were found to be as influential as groups of similar sizes, which are constructed from the most influential users (e.g., those with the highest eigenvalue centrality) in the social network. They were also found to be twice more influential on average than groups of similar sizes of random bots. Keywords: Social networks

 Information spread  Spreading Groups  Bot

1 Introduction Spreading Groups are groups of bots (automatic software agents) that operate in social networks in order to bias the natural opinion spread and over represent an agenda. These bots accounts are mixed with regular users (i.e., human beings), while repeatedly echoing their agenda, disguised as accounts of real humans who simply deliver their own personal thoughts. Through this method, bots and groups of bots amplify the agenda of their creator and influence the opinions of real users to spread a defined ideology [1, 2]. The cyber spread of information can trigger and deeply influence political, economical and social changes [1, 3, 4]. Ideological political struggles, such as the © Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 459–468, 2020. https://doi.org/10.1007/978-3-030-36687-2_38

460

A. Sela et al.

Boycott, Divestment, Sanctions (BDS) Movement [5–7], the Arab Spring [8, 9], Civil ideological spread [10], and the Russian effort to spread harmful fake information [11], are all example of political players who fiercely use the social medial arena as their battlefield. The importance of social networks is based on their ability to quickly spread all types of views with few censoring limitations. The social network websites can easily spread information to different remote populations in an unprecedented speed. The ability of social networks to efficiently spread information creates a counter effort to manipulate this spread. One such manipulation is the inflating of an agenda or a message by fake accounts and social bots. Social bots are defined as software agents that mimics humans’ activity in a social network in an endless effort to spread a defined agenda. Many studies have found that social bots, operate in large numbers, inside each of the many existing social networks platforms [12–16]. Social bots first collect followers, then, after a dormant phase, they begin to spread their agenda within their crowd. There are currently several methods to detect individual bots. These methods are often based on machine learning classification algorithms that find a difference in account features between bots and humans. While some studies claim these algorithms can detect approximately up to 95% of bots [17], there is a growing consensus that between 5% [18] and 15% [19] of accounts are bots. Human classification for example, can only detect *20% of the bots [20]. These two numbers contradict each other to some degree. If algorithms can detect as much as 95% of bot accounts, one would expect that the number of bots will decrease with time (assuming Twitter and other social platforms deletes these accounts), or at least that there would only be 5% of accounts considered as bots. Since this does not happen, and since bots are still considered as a great concern for some years, it might be concluded that many bots are not detectable through these artificial intelligence algorithms. The proposed model in this study, and its supported data might explain the above inconsistency. While Twitter or alike finds individual bots, it is likely that groups of bots can stay “under the radar” for longer periods, or even stay unconnected permanently. This is because each bot can operate in a different period, thus be considered as a human. Furthermore, the model, as well as the data, show that it is worthy for bots’ creators to connect these bots. As shown, the creation of connected structures of bots increase the spread of a message within human accounts by 3–28 times, depending on the initial conditions. We compared the spreading rates of Spreading Groups, not only to random users, but also to groups of the most influential users in the network. We found that the connected spreading groups of bots reach a spreading rate which ranges between 75%–108% (depending on the initial conditions), compared to a group of similar size which is constructed of the most influential spreaders in the network (those with the highest PageRank or Eigenvector Centrality). As will be shown in the following results section, these results can easily be explained, and are highly applicable, since the creation of interconnect bots accounts is an easy, realistic and technically feasible task.

Using Connected Accounts to Enhance Information Spread in Social Networks

461

The current work is an extension to our previous work, where we have shown and proposed the operation mode of Spreading Groups of bots [2]. While in the previous work, the model assumed only one seeding period before the natural spread through human’s account, in this work, we further examined these previous models and enable to spread information in longer spreading durations. This assumption is related to the work of the Debot project [21] that detected the activity of bots through temporal correlations between their messages. Accounts that have high temporal correlation of activity, and/or also spread related topics, are likely to be classified as bots. Thus, seeding messages in different periods of time breaks this temporal correlation and make the detection much harder. Since in each cascade different accounts from the spreading group are active, by this method, accounts in a spreading group can hide undetected for longer durations. While connected bots’ structures that repeatedly echo messages within the group and do not operate in one single peek have a slower spread rate, their influence is still expected to be much stronger, mainly since they remain undetected for longer durations, thus collect more followers.

2 Model The Spreading Group model consists of two stages. In the first stage the spreading group is created within the network. In the second stage different spreading strategies are inspected, while seeds are allocated at once before the natural infection stage, or gradually together with the natural infection. Note that we refer to natural infections as the infections of non-bot’s accounts. Similarly, intended infections by bots are named as seedings. 2.1

Creation of Spreading Groups in the Network

In order to create the required structure, we first construct a network of n2 nodes through a Preferential Attachment process [22]. Then, we select a set of r% of these nodes and label them as the “Spreading Group”. We then add links according to the following algorithm d times to the spreading group. Last, we continue growing the remaining n 2

nodes by preferential attachment. The meta code to enrich the group is the following. CONSTRUCTION OF SPREADING GROUP: For each node in Spreading-Group: Chose randomly node_1 Chose randomly node_2 if (node_1!= node_2) && no link (node_1, node_2): create-link (node_1, node_2)

462

A. Sela et al.

This process forms a network which is basically a preferential attachment network, but also contains a denser structure of r%; i.e. the Spreading Group. 2.2

Four Seeding Strategies

We have defined four main seeding strategies while for each of these strategies we inspected the outcome of graduate vs. instant seeding. The four strategies were: 1. Random Seeding – In this strategy, S seeds are randomly selected from the entire network and are seeded; i.e., their status changes to “infected”. 2. Group Seeding – In this strategy, S seeds are randomly selected solely from the Spreading Group and are seeded; i.e., their status is changed to “infected”. 3. PageRank Seeding – In this strategy, the nodes are ordered according to their PageRank scores and the S nodes with the highest scores are seeded. 4. Eigenvalue Centrality Seeding – In this strategy, the nodes are ordered according to their Eigenvalue centrality scores and the S nodes with the highest scores are seeded. Illustrations of these four seeding strategies are presented in Fig. 1, where the seeds are represented by a red target sign and the Spreading Group is represented by orange links (see A, B). In the lower figures, the seeded nodes position in the plot were changed in order to enable seeing their relatively high ranks.

Fig. 1. Four seeding strategies of the Spreading Group model: random seeds – A (upper left), Spreading Group – B (upper right). Highest Eigenvector centrality seeds – C (lower left). Highest PageRank centrality seeds – D (lower right).

Using Connected Accounts to Enhance Information Spread in Social Networks

2.3

463

Temporal Examination of Seeding and Spread

While many works inspect the spread of information as a two-phase process, where first there is an action of seeding, which is then followed by a process of infections, our previous works have shown the high importance of timing [23, 24, 26]. Timing acts in spreading processes as a two-edged sword. On the one hand, as one extends the time of the seeding process, potentially infected nodes forget, and as a result, the probability of acceptance of a message by the natural infection is reduced. On the other hand, as time passes, more nodes become infected, and thus, a well-planned seeding action, can boost the spread [23, 25]. Overall, we now believed that a process where seeding occurs only at the initial stages of the spread, when no planning or examination of the infection’s cascades exist can reach more nodes, compared to processes where the seeds are allocated gradually. Nevertheless, if the gradual seeds allocations are well planned, we have shown that a 10%–20% improvement can be obtained by correct scheduling of seeds [23, 25]. As a consequence of these results, we also inspected the seeding through a multi steps process. In such a case, we divided the S seeds to S ¼ s  Ts seeding batches, where s denotes the seeds used at each period, and Ts 2 ft ¼ 1; 2; 3. . .g denotes the spreading periods. 2.4

Modeling the Retention Loss

A very important part in the studies of information spread focuses on the spread of news. As the word news implies, it needs to be something new in order to capture users attention and spread it forward. In our model, we assumed human retention loss function follows an exponent decay. This assumption is based upon studies on memory, done by Ebbinghaus [25] and his followers, which are now over a century old. The probability of infection is set initially to be p0 , then at each step, for each infected node, this probability is reduced according to an exponential decay p ¼ p0 ect

ð1Þ

Since the probability of infection p is decaying infinitively, to terminate the model’s runs, when p reaches a value lower than the lower bound p\LB, the node’s status changes to non-infectious. In addition, if no nodes are infected for more than 10 periods, the model terminates. These stopping criteria permits a termination of the simulation runs in reasonable run times. 2.5

Simulation Scheme

To conclude, the study monitors the spread of the different spreading groups, while inspecting through an external layer the different runtime parameter. The different strategies of seeding are generated according to the following scheme.

464

A. Sela et al.

SIMULATION SCHEME: Create a network of n/2 nodes #By preferential attachment Select r% of nodes -> {SG} #Usually 0.001 d(B)(T) > d(A+B)(T). The paradox does not occur. However, when r = −0.9 (disassortative network), the average population gain d (T) of Game B is −0.11, and the average population gain d (T) of the randomized Game A + B is 0.18. d(A+B)(T) > 0 > d(B)(T). Thus, the strong paradox occurs. Based on these findings, the mechanism of paradox is analyzed.

Fig. 2. Changes in average population gain over time

Fig. 3. Changes in proportion of favoable branch (Branch 1) over time

(1) Strong paradox in disassortative networks (1) When we play Game B individually, the average population gain is negative. The reasons are: from the group level, the system environment is benign at the beginning of the game (for Game B, equal capital will make the node play Branch 1, and at this time the wining probability of Branch 1 is p1 = 0.695. So it is a favorable branch). As the number of small-degree nodes is more than that of large-degree nodes, the opportunity of capital growth of small-degree nodes is greater than that of large-degree nodes. The capital growth of small-degree nodes leads to the increase of the chance of playing Branch 2 (p2 = 0.28 is an unfavorable branch) in the follow-up game (that is, the probability becomes larger when the capital of small-degree nodes has an increment than the average capital of neighbors). As a result, the capital will gradually change from growth to decline as the game progresses. In disassortative networks, the neighbors of large-degree nodes are mostly composed of small-degree nodes, so the decrease of capital of small-degree nodes further increases the chances of large-degree nodes playing Branch 2 (when p2 = 0.28, Branch 2 is unfavorable). That is, the decrease of capital of small-degree nodes increases the chances of large-degree nodes exceeding the average capital of neighbors. The above development trend of smalldegree and large-degree nodes makes the probability of playing Branch 1 (favorable branch) decrease continuously until it is lower than the probability of 53.01% in fair game (In fair game, the mathematical expectation E = 0. So we can obtain the probability f of playing Branch 1 according to the equation: E = f[1  p1 + (−1)  (1 − p1)] + (1 − f)[1  p2 + (−1)  (1 − p2)]). Then the probability of playing Branch 1 Thus the average gain of the whole population decreases gradually. From

The Impact of Network Degree Correlation on Parrondo’s Paradox

489

Fig. 2, we can also notice that the average population gain increases first, then decreases gradually after the number of games is about 2  104. Furthermore, the average population gain decreases until it is negative. Figure 4(a) shows the relationship between the degree of nodes and the average gain of subgroups in a disassortative network. It can be found that there is no obvious relationship between the node degree and the average gain of subgroups when playing Game B alone. The average gain of subgroups with different degrees is evenly distributed and concentrated in a narrow interval. The reasons are as follows: firstly, under the disassortative network, the neighbors of large-degree nodes are mainly composed of small-degree nodes, and the neighbors of small-degree nodes are also composed of large-degree nodes. Secondly, when Branch 1 is a favorable branch, the capital of the node and its neighbors will keep the synchronization of increase and decrease (when the average capital of the neighbor is greater than or equal to the capital of the node, the node will play the favorable Branch 1, and the capital will increase; when the capital of the node is greater than the average capital of the neighbors, the node will play the unfavorable Branch 2, and the capital will decrease). Therefore, when we play Game B alone, the gains of large-degree and small-degree nodes will keep the synchronization of winning and losing to a certain extent, which results in that there is no obvious relationship between node degree and the average gains of sub-groups. (2) When we play the randomized Game A + B, there is a positive correlation between the average gains of sub-groups and node degree. The average gains of subgroups with degrees less than 38 are negative, while those of other sub-groups are positive. The reason for this result is due to the “agitating” role of Game A. Half of the time in the randomized Game A + B is used to play the zero-sum Game A, and the game relation is set as way of cooperation. For the cooperative pattern, the subject pays one unit to the object for free. In Game A, for any node, the probability of being selected as the subject is equal, but for large-degree nodes with more neighbors the probability of being selected as the object is greater. Thus, when we play Game A, nodes with larger-degrees have larger capital and the increase of the capital is more obvious. Then capital flows with an orientation to nodes with larger-degrees large. Therefore, there is a positive correlation between the average gains of sub-groups and node degrees, and a larger degree yields a higher average gain of sub-group. Simultaneously, for the disassortative network, the neighbors of the small-degree nodes are mainly composed of large-degree nodes. So the large-degree nodes with large capital increase the chances of small-degree nodes playing the favorable branch of Game B. From the group level, the agitation of Game A increases the chances of nodes playing the favorable branch of Game B. (Figure 3 shows that the probability has a increment from 50.20% of playing Game B individually to 55.45% of playing the randomized Game A + B). This causes the game result of the randomized Game A + B being positive (as shown in Fig. 2). Therefore, the ratcheting mechanism of Game B (there are some asymmetric branches in the structure) and the agitation effect of Game A are the key to produce Parrondo’s paradox. Besides, the disassortative network is conducive to the development of the ratcheting mechanism.

490

Y. Ye et al.

(a) in disassortative network

(b) in assortative network

Fig. 4. Changes in average sub-population gain over node degree

(2) No paradox in the assortative network Figure 4(b) shows the relationship between the node degree and the average gain of subgroup in the assortative network. It can be notable that: (1) When we play Game B alone, there is no obvious relationship between node degree and the average gain of subgroup. The average gains of subgroups with different degrees are distributed evenly, concentrated in a narrow interval, and the average gain of the population is negative. There is a similar reason to the case of Game B in the disassortative networks, which is caused by the synchronization of increase and decrease between the capital of nodes and their neighbors. The average gain of the population is negative (shown in Fig. 2), which is related to the parameters (p1 = 0.695 and p2 = 0.28). Under this set of parameters, as shown in Fig. 3, when the system is stable, the probability of playing the favorable branch of Game B is 50.22%, which is lower than that under the fair game (53.01%). (2) When we play the randomized Game A + B, the relationship between the average gain of sub-group and node degree can be divided into three sections: for subgroups with large-degree (degree greater than 50) and small-degree (degree less than 25) nodes, the average gains are positively correlated with node degrees, while for subgroups with medium-degree nodes (degree between 25 and 50), the average gains are not significantly correlated with node degree. The reasons for this difference are as follows: in Game A with the cooperative pattern, the node with a larger-degree has a larger capital, and the capital has a more obvious increment. Furthermore, capital flows to the large-degree nodes, and the positive correlation between degree and sub-group gain arises. The “synchronization of increase and decrease” of Game B will lead to the convergence of average gains among sub-groups with different degrees. Simultaneously, because of the normal distribution of node degrees in random networks (i.e. fewer large-degree and small-degree nodes, especially more medium-degree nodes), the average gains of subgroups with different degrees will gradually move towards the middle, and eventually demonstrates the result shown in Fig. 4(b). Moreover, because the average gain of the subpopulation with medium-degree, which accounts for a large proportion of nodes, is negative, the average gain of the population is negative (as shown in Fig. 2). From the group level, the agitation effect of Game A reduces the opportunity for nodes to play the favorable branch of Game B (Fig. 3 shows that the

The Impact of Network Degree Correlation on Parrondo’s Paradox

491

probability has a decrease from 50.22% of playing Game B individually to 49.10% of playing the randomized Game A + B), which leads to the average population gain of the randomized Game A + B is less than that of Game B played alone. Thus, no paradox occurs. 4.2.2 The Competitive Pattern: p1 = 0.175, p2 = 0.85 For this set of parameters, when r = 0.9 (assortative network), the average population gain d (T) of Game B is 0.59, and the average population gain d(T) of the randomized Game A + B is 0.34. d(B)(T) > d(A+B)(T) > 0. The paradox does not occur. However, when r = −0.9 (disassortative network), the average population gain d (T) of Game B is −0.094, and the average population gain d(T) of the randomized Game A + B is 0.32. d(A+B)(T) > 0 > d(B)(T). Thus, the strong paradox occurs. Based on these findings, the mechanism of paradox is analyzed. Figure 5 shows the change of the average population gain over time in the disassortative network. Figure 6 shows the change of the proportion of favorable branch (Branch 2) over time in the disassortative network.

Fig. 5. Changes in average population gain over time

Fig. 6. Changes in proportion of favorable branch (Branch 2) over time

(1) Strong paradox in disassortative Networks Figure 7(a) shows the relationship between the node degree and the average gain of subgroup in a disassortative network. It can be notable that there is a positive relationship between the average gain of sub-group and the node degree when Game B is played individually. A larger-degree node yields a greater average gain of the subgroup. The average gains of sub-groups whose degrees are less than 40 are negative, while the average gains of sub-groups with other degrees are positive. The reason for this result is that: for p1 = 0.175 and p2 = 0.85, when the capital of a node is not greater than the average value of the capital of its neighbors, the neighboring environment of the node is unfavorable (because the probability of winning is p1 = 0.175 for Branch 1 of Game B at this time); when the capital of a node is greater than the average value of the capital of its neighbors, the neighboring environment of the node is favorable (because for Branch 2 of Game B, the probability of winning is p2 = 0.85). At beginning, all nodes have the same capital. Thus, the node plays the unfavorable branch (Branch 1), the capital of the node decreases, while the large-degree node has

492

Y. Ye et al.

more neighbors, the neighborhood environment improves quickly, that is, the probability of the large-degree node playing the favorable branch increases quickly. In the disassortative network (r = −0.9), the neighbors of small-degree nodes are mainly composed of large-degree nodes. The capital of large-degree nodes is large, which makes the probability of small-degree nodes playing the unfavorable branch larger. This process leads to the capital of small-degree nodes reduction. The capital reduction of small-degree nodes makes the neighborhood environment of large-degree nodes connected with them further improved (the capital reduction of small-degree nodes leads to the decrease of the average capital of neighbors around large-degree nodes). In the game process, the favorable neighboring environment for large-degree nodes and the unfavorable neighboring environment for small-degree nodes are strengthened continuously, which ultimately leads to the average gains of subgroups with largerdegree larger, and the average gains of subgroups with smaller-degree smaller. From the group level, Fig. 6 shows that the probability has a decrease from 47.58% of playing Game B individually to 48.15% of playing fair Game B. This causes the average population gain of Game B being negative. (2) When we play the randomized Game A + B, there is no obvious correlation between the average gains of sub-groups and node degrees. The average gain of the population is positive. The reason for this result is due to the “agitating” role of Game A. Half of the time in the randomized Game A + B is used to play the zero-sum Game A, and the game relation is set as way of competition (The probability of winning is 0.5 for both subject and object). Because of the zero-sum game between large-degree and small-degree nodes, the probability of winning or losing is the same, which makes it possible for small-degree nodes to have an increase in capital, thus disrupting the strengthening process of large-degree nodes to form favorable environment and smalldegree nodes to form unfavorable environment. From the group level, the agitation of Game A increases the chances of nodes playing the favorable branch of Game B. (Fig. 6 shows that the probability has a increment from 47.58% of playing Game B individually to 49.86% of playing the randomized Game A + B). This causes the game result of the randomized Game A + B being positive. Therefore, the ratcheting mechanism of Game B (there are some asymmetric branches in the structure) and the

Fig. 7. Changes in average sub-population gain over node degree

The Impact of Network Degree Correlation on Parrondo’s Paradox

493

agitation effect of Game A are the key to produce Parrondo’s paradox. Besides, the disassortative network is conducive to the development of the ratcheting mechanism. (2) No paradox in the assortative network Figure 7(b) shows the relationship between the node degree and the average gain of subgroups in the assortative network. It can be found that there is no obvious relationship between node degree and average subgroup gain. The reasons for this result are as follows: firstly, under the assortative network, the neighbors of each node is mainly composed of nodes with similar degrees; secondly, when Branch 2 is a favorable branch, the capital of the node and its neighbors will maintain the characteristics of “opposite increase and decrease” (when the average capital of the neighbor is larger than the capital of the node, the node will play the unfavorable branch, and the capital will decrease; when the average capital of the neighbors is smaller than the capital of the node, the node will play the favorable Branch 2, and the capital increases). Therefore, when Game B is played alone, the environment is the same for individuals with different degrees. There are no significant differences among the average sub-group gains for different degrees. Therefore, there is no obvious relationship between node degree and average sub-group gain. The average gain of population is positive due to the parameters p1 = 0.175 and p2 = 0.85. Under these parameters, the probability playing the favorable Branch 2 of Game B (49.71%) is higher than that of fair game (48.15%). For the randomized Game A + B, a competitive mode is used when we play Game A, and the probability of winning is 0.5 for both the subject and the object, which will lead to more uniform gains between subgroups with different degrees. It can also be notable from the graph that the fluctuation is smaller. From the group level, the agitation of Game A increases the chances of nodes playing the favorable branch of Game B. (The calculation results show that the probability has a increment from 49.71% of playing Game B individually to 49.91% of playing the randomized Game A + B). Half of the time in the randomized Game A + B is used to play the zero-sum Game A. So the average population gain of playing the randomized Game A + B is less than that of playing Game B individually and there is no paradox.

5 Conclusions (1) In this paper, a multi-agent Parrondo’s model based on complex networks is established. Two different interactive behavior modes: competition and cooperation in Game A are adopted. Furthermore, the gradual change of the parameter space of Parrondo’s paradox from the assortative random network to the disassortative random network under different behavioral modes is analyzed, and the relationship between the parameter space and the degree correlation of networks is also analyzed. The simulation results show that: (1) different behavioral modes have impacts on the parameter space generated by the paradox; (2) under the same behavior mode, a smaller degree correlation coefficient yields a larger parameter space generated by the paradox. (2) In view of competitive and cooperative behaviors, a set of probability parameter is adopted, respectively, to analyze the micro-causes of strong paradox in

494

Y. Ye et al.

disassortative random networks in detail. Furthermore, the interaction mechanisms of the asymmetric structure of Game B, the “agitation” effect of Game A under different behavior modes and network topology structure are demonstrated. Acknowledgments. This project was supported by the National Natural Science Foundation of China (Grant No. 11705002); Ministry of Education, Humanities and Social Sciences research projects (15YJCZH210; 19YJAZH098; 18YJCZH102).

References 1. Harmer, G.P., Abbott, D.: Losing strategies can win by Parrondo’s paradox. Nature 402 (6764), 864–870 (1999) 2. Harmer, G.P., Abbott, D., Taylor, P.G., Parrondo, J.M.R.: Brownian ratchets and Parrondo’s games. Chaos 11(3), 705–714 (2001) 3. Parrondo, J.M.R., Harmer, G.P., Abbott, D.: New paradoxical games based on Brownian ratchets. Phys. Rev. Lett. 85(24), 5226–5229 (2000) 4. Shu, J.J., Wang, Q.W.: Beyond Parrondo’s Paradox. Sci. Rep. 4(4244), 1–9 (2014) 5. Toral, R.: Capital redistribution brings wealth by Parrondo’s paradox. Fluct. Noise Lett. 2(3), 305–311 (2002) 6. Ye, Y., Xie, N.G., Wang, L.G., Meng, R., Cen, Y.W.: Study of biotic evolutionary mechanisms based on the multi-agent Parrondo’s games. Fluct. Noise Lett. 11(2), 352–364 (2012) 7. Mihailović, Z., Rajković, M.: Cooperative Parrondo’s games on a two-dimensional lattice. Phys. A 365, 244–251 (2006) 8. Ye, Y., Xie, N.G., Wang, L.G., Wang, L., Cen, Y.W.: Cooperation and competition in history-dependent Parrondo’s game on networks. Fluct. Noise Lett. 10(3), 323–336 (2011) 9. Wang, L.G., Xie, N.G., Xu, G., Wang, C., Chen, Y., Ye, Y.: Game-model research on coopetition behavior of Parrondo’s paradox based on network. Fluct. Noise Lett. 10(1), 77– 91 (2011) 10. Meyer, D.A., Blumer, H.: Quantum parrondo games: biased and unbiased. Fluct. Noise Lett. 2(04), 257–262 (2002) 11. Miszczak, J.A., Pawela, L., Sladkowski, J.: General model for an entanglement-enhanced composed quantum game on a two-dimensional lattice. Fluct. Noise Lett. 13(02), 1450012 (2014) 12. Ye, Y., Cheong, K.H., Cen, Y.W., Xie, N.G.: Effects of behavioral patterns and network topology structures on Parrondo’s paradox. Sci. Rep. 6, 37028 (2016) 13. Ye, Y., Wang, L., Xie, N.G.: Parrondo’s games based on complex networks and the paradoxical effect. PLoS ONE 8(7), e67924 (2013) 14. Newman, M.E.J.: Assortative mixing in networks. Phys. Rev. Lett. 89(20), 2087011 (2002) 15. Xulvibrunet, R., Sokolov, I.M.: Reshuffling scale-free networks: from random to assortative. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 70(6 Pt 2), 066102 (2004) 16. Klemm, K., Eguíluz, V.M.: Growing scale-free networks with small-world behavior. Phys. Rev. E 65(5), 057102 (2002)

Analysis of Diversity and Dynamics in Co-evolution of Cooperation in Social Networking Services Yutaro Miura1(B) , Fujio Toriumi2 , and Toshiharu Sugawara1 1 Waseda University, Tokyo 1698555, Japan [email protected], [email protected] 2 The University of Tokyo, Tokyo 1138656, Japan [email protected]

Abstract. How users of social networking services (SNSs) dynamically identify their own reasonable strategies was investigated by applying a co-evolutionary algorithm to an agent-based game theoretic model of SNSs. We often use SNSs such as Twitter, Facebook, and Instagram, but we can also freeride without providing any content because providing information incurs costs to us. Numerous studies on evolutionary network analysis have been conducted to investigate why people continue to post articles. In these studies, genetic algorithms (GAs) have often been used to find reasonable strategies for SNS users. Although the evolved strategies in these studies are usually common among all users, the appropriate strategies for them must be diverse because the strategies are used in various circumstances. In this paper, we present our analysis using a co-evolutionary algorithm, multiple-world GA (MWGA), the various strategies for individual agents involving co-evolution with their neighboring agents. We also present the fitness value we obtained, a value that was higher than those obtained using the conventional GA. Finally, we show that the MWGA enables us to observe dynamic processes of co-evolution, i.e., why agents reach their own strategies in different circumstances. This analysis is helpful to understand various users’ behaviors through mutual interactions with neighboring users. Keywords: Social networking services · Public goods game Coevolutionary dynamics · Complex networks

1

·

Introduction

Social networking services (SNSs), such as Twitter, Facebook, Instagram and LinkedIn, have become an indispensable part of people’s lives. They are virtual places for people’s communications in groups of close friends, communities, companies, and organizations, and are utilized for various activities such as T. Sugawara—This work was partly supported by KAKENHI (17KT0044, 19H02376, 18H03498). c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 495–506, 2020. https://doi.org/10.1007/978-3-030-36687-2_41

496

Y. Miura et al.

advertisement, marketing, and political campaigns as well as private and local communication. SNSs are maintained by the massive content generated by users’ voluntary participation; therefore, a SNS disappears only if too little content is posted. Conversely, users can become free riders (or lurkers) who only read content without posting articles because such activities impose some costs on users. To understand why users voluntarily post so much content while others behave as free riders is a crucial issue and is helpful to keep SNSs thrive. Many studies have analyzed the factors of users becoming free riders on SNSs [5,7,10]. Sun et al. [10] clarified the factors of users when they stop generating content in online communities using a motivation model that determines online behavior. The researchers insisted that the factors causing free riders are, for example, low-quality messages, low response rates, and long response delays; these are inevitable, given the different levels and types of participants. They also claimed that free riders could be encouraged to participate by introducing external stimuli (rewards) and new norms. Some studies analyzed the characteristics of network features of failed SNSs [5,7]. A number of studies tried to identify incentives for voluntary participation on SNSs using game-theoretic simulation models. For example, Toriumi et al. [11] proposed the meta-rewards game to model users’ behaviors on SNSs using evolutionary game theory. The meta-rewards game is an extended model of Axelrod’s meta-norms game [2], which is a kind of public-goods game and is used to see what could prompt cooperation in social dilemma situations. Their experimental results indicated that cooperation on SNSs emerged by introducing the rewards given for comment returns (i.e., meta rewards). Hirahara et al. [6] proposed a SNS-norms game by adding the structural features of SNSs to the meta-rewards game; for example, users who respond to comments on articles are likely to be users who originally posted the articles. Then, they conducted agent-based simulations on artificial complex networks and a Facebook ego network to identify the optimal behavior in the game. We have to consider two issues, even though some evolutionary algorithms used in network analysis including these studies [6,11] have attempted to search for a common optimal/better strategy in the entire network. First, applying genetic algorithms (GAs) to find the optimal solution (strategy) that is common for agents does not fit the actual SNSs; for example, the strategy in SNSs learned by a hub agent (such as a celebrity) with so many followers is not necessarily advantageous for general (non-hub) agents. Of course, all users may be homogeneous in the sense that they attempt to increase the received rewards, but they are in diverse circumstances because the numbers of followers/friends are quite different; thus, they have their own behavioral strategies in a SNS along with the emergence of strategies among surrounding users. Second, the learned strategy is only the final result of long-term interactions, and it ignores the process of learning; i.e., their learned strategies are mutually affected between neighboring agents, and thus, the process of learning must be more complicated but worth understanding.

Analysis of Diversity and Dynamics in Co-evolution of Cooperation

497

This discussion motivated us to introduce co-evolution, which is a phenomenon where different species affect each other and evolve together, into evolutionary network analysis based on game theory [3]. Along this line, Miura et al. [9] proposed the multiple-world genetic algorithm (MWGA), which is a coevolutionary genetic algorithm in (complex) networks to maintain the diversity of nodes. In the MWGA, the network including all nodes (users) is duplicated to several networks where nodes in the same position have different strategies and where all users interact with diverse neighbors in different network worlds. Then, all agents simultaneously learn their own reasonable strategies by considering strategies examined in all duplicated networks. In this study, we introduced the MWGA into the agent-based simulation using the SNS-norms game and analyzed the diverse strategies in various circumstances involving the first issue. We also found that the MWGA can see the course of learning of strategies in accordance with the strategies learned by neighbor agents, so we analyzed the process of the simultaneous learning involving the second issue and identified why agents utilized converged strategies.

2 2.1

Modeling Social Networking Services Agents

We briefly describe the model including agents and behaviors of SNSs based on the networked evolutionary game. This model is identical to that proposed by Hirahara et al. [6]. Let G = (V, E) be a graph representing the underlying network, where V = {v1 , . . . , vn } is the set of nodes that correspond to agents representing users on a SNS and where E is the set of links between agents. The links in E represent the interaction structure among agents. An agent has two behavioral strategies, cooperation and defecting. Cooperation involves contributions to the SNS, i.e., posting articles and comments, and defecting involves freeriding to get only benefits by just reading posted articles and comments. Agents have two learning parameters specifying behavioral strategies: the probability of posting a new article Bi and the probability of posting a comment on a posted article or on a posted comment Li . Bi and Li are encoded by three-bit genes (in total, a 6-bit gene) for the MWGA; thus, they take on a discrete value, 0/7, 1/7, . . . , or 7/7. 2.2

SNS-Norms Game

A round of the SNS-norms game proceeds as follows (see also Fig. 1). For agent vi ∈ V , the initial values of Bi and Li are randomly determined before the game starts. In the t-th round, 0 ≤ Si,t ≤ 1, which represents the amount of fun or interest in the content of the article that vi intends to post, is randomly selected for vi . If Si,t ≥ 1 − Bi , vi posts an article and receives a negative reward as cost F ( 0 as a reward. After that, vj posts a comment on the read article with

498

Y. Miura et al.

Fig. 1. SNS-norm game.

Fig. 2. Conceptual structure of multiple-world GA.

Table 1. Sum of rewards (costs) for each action. Type of action Article (post article a)

Cooperate

Defect 

F + R × loge (Nc (a) + 1) + C × Ncc (a) 0

Comment (post comment c) M + C + R × loge (Nm (c) + 1)

M

probability Lj and receives cost C < 0. Then, vi gets a reward R > 0 because it received the comment on the article vi posted. When vi receives a comment from vj , vi gives a comment-return with probability Li . Thus, vi receives a negative reward C  < 0 and vj receives reward R > 0 as meta-reward. Table 1 shows the agent’s total rewards for each action, where Nc (a) is the number of comments on article a posted by neighboring agents, and Ncc (a) is the number of comment returns on the received comments. Nm (c) is the number of comment returns on comment c, which is 0 or 1 in the SNS-norms game. Note that the reward for posting an article follows the Weber-Fechner law [4], and it increases logarithmically with the number of comments received [8]. The SNSnorms game is an evolutionary game aimed at finding the agents’ reasonable strategies specified by Bi and Li to gain higher rewards through interactions. 2.3

Structure of Network

Real-world networks observed in human society have three complex network properties: a high clustering coefficient, small-world property, and scale-free property [1]. We generate an artificial agent network following the connecting nearest neighbor model (CNN model) [12], which has all three properties. The characteristics of networks generated by the CNN model (CNN networks) is determined using parameter u, the probability of turning a potential edge to a real one, where the potential edge is defined as {(vi , vj ) ∈ E | vi , vj ∈ V and ∃vk ∈ V s.t. (vi , vk ), (vj , vk ) ∈ E}.

Analysis of Diversity and Dynamics in Co-evolution of Cooperation

3

499

Co-evolutionary Game

We utilized the MWGA [9] as the co-evolutionary computation to find diverse reasonable strategies for individual agents. We briefly explain it here. The existing studies use genetic algorithms in which agents take over the better strategies of neighboring agents for the next generation, under the assumption that neighbors’ excellent strategies are worth mimicking. However, in actual social networks that have three properties of complex networks, agents are in diverse surroundings; thus, their strategies must also differ. 3.1

Multiple-World GA

In the MWGA, we make W copies of the master network G = (V, E), each of which is denoted by Gl = (V l , E l ) for 1 ≤ l ≤ W , where W is a positive integer called the multiple-world number. We represent the set of agents in the l-th world as V l = {v1l , . . . , vnl } (∼ = V ). An example of a multiple-world GA structure is illustrated in Fig. 2. The set of copied agents of vi ∈ V is denoted by Ai = {vi1 , . . . , viW }, and agents in Ai stand on the same position in all worlds. Initial genes are randomly given to all agents. Therefore, the agents in Ai = {vi1 , . . . , viW } have different genes and behave differently, even though these agents are the copies of a single agent vi . For vi ’s neighboring agent vj , agents in Aj have different strategies, so agents in Ai experience the game differently and receive different rewards as a result. In the following experiments on the SNS-norms game, we assumed that all agents had four chances to post articles during a generation and then simultaneously entered the co-evolution phase consisting of three operators: (parents) selection, crossover, and mutation. Then, with the new genes, all agents entered the next generation and repeated this process a certain number of times. 3.2

Genetic Operations

W In the parent selection phase, agent vil ∈ l=1 V l selects two agents as parents from Ai by following the probability distribution {Pil }W i=1 , Pil = 

(f (vil ) − fmin )2 , 2 v∈Ai (f (v) − fmin )

(1)

where f (vil ) is the fitness function, the value of which is the sum of the rewards received during the SNS-norms games in the current generation (see Table 1) and fmin = minv∈Ai f (v). The genes for the next generation are generated from the selected parents, applying uniform crossover and flip-bit mutation with a probability of 0.005 for each bit. For example, if W = 30, the gene of approximately one agent in Ai mutates in every generation (because 30 × 6 × 0.005 = 0.9). Then, the agent with the generated gene is placed as vil in the next generation.

500

Y. Miura et al. Table 2. Parameter values in experiments. Parameter Description

Value

W

Multiple-world number Mutation rate

30 0.005

F

Cost of posting article

−3.0

M

Reward for reading article

1.0

C and C  Costs of comment and comment return

−2.0

R and R Rewards for receiving comment and Comment-return 9.0 Table 3. Parameters and characteristics of the CNN networks.

4

Parameter Description

Value

N

Number of agents

1000

u

Probability of changing a potential edge to an edge 0.9 Average degree (average number of friends) 19.8 Average clustering coefficient 0.468 Average characteristic path length 3.31 Average power-law exponent −1.128

Experiments and Discussion

We simulated SNS-norm games on a CNN network and investigated how agents individually learn (co-evolve) their strategies (the probability of posting an article B and the probability of posting a comment L) through interactions with their neighboring agents. Table 2 shows the parameters used in the experiments. The rewards and costs were defined on the basis of the experiments done by Axelrod [2]. The parameters of the CNN networks are listed in Table 3. These parameters also followed existing studies [6,11] so that we can describe the effect of the MWGA appropriately. We conducted the experiments using the MWGA for 2000 generations. The values of Bi and Li for agent vi were defined as the averages of W agents in Ai . 4.1

Comparison of Learned Strategies and Fitness Values

In our experiment, we analyzed the effects of evolutionary computation using the MWGA on strategy learning using the SNS-norm game and compared them with those using the conventional GA. Figure 3 shows the transition of all agents’ average probability of posting an article B and a comment L in the SNS-norms game. Figure 4 shows the transition of average fitness value of all agents. Figure 3 indicates two different results; i.e., B and L converged to around 0.5 and 0.4 when using the MWGA, while they were 0.05 and 1.0 when using the conventional GA. As shown in Fig. 4, the average fitness value converged to approximately

Analysis of Diversity and Dynamics in Co-evolution of Cooperation

Fig. 3. Probability of posting.

501

Fig. 4. Fitness value.

10 using the conventional GA but converged to 80 when using the MWGA. The differences in these results suggest that the agents’ strategies evolved because the MWGA had much higher fitness values. To confirm that agents gained higher fitness values with the MWGA, we calculated the improvement in the fitness values of individual agents, i.e., the incremental value of the fitness value with the MWGA minus that with the conventional GA. The results are plotted in Fig. 5, where the vertical axis shows the averages of the incremental fitness values between the 1500-th and 2000th generations and where the horizontal axis shows the degrees of the agents. Figure 5 indicates that almost all the agents (more than 99%), especially those with higher degrees, gained the higher fitness value. Only a few agents lost fitness value, but we think that this is due to mutation. These results clearly indicate that the agents could find their own better strategies.

Fig. 5. Improvement in fitness value.

4.2

Distribution of Strategies and Dynamics of Learning Process

Second, we investigated the distribution of the strategies learned by agents and their learning dynamics in the SNS-norms game. First, we plotted the tuple of Bi , Li and agents’ degree (which corresponds to the number of friends) in Fig. 6(a) (when using the conventional GA) and in Fig. 6(b) (when using the MWGA).

502

Y. Miura et al.

(a) Conventional GA

(b) Multiple-world GA

Fig. 6. Distribution of Bi and Li .

Figure 6(a) does reveal similar strategies regardless of the degree values. However, agents learned various strategies when using the MWGA. For example, agents with higher degrees were likely to post articles but not to post comments (article writers). Agents with normal agents (with relatively low degrees) could be divided into three groups: a group in which agents actively post articles and comments, one in which agents post articles but do not post comments very much, and one in which free riders have Bi and Li of almost zero. We think that these results indicate the agents could learn diverse strategies depending on their surroundings. We then focused on the dynamics of the strategy learning by agents with the MWGA. Figure 7 shows the distribution of B and L during the certain generations of co-evolution in the SNS-norms game. In the 0-th generation, because each agent was given a strategy at random, the average article posting rate B and average comment posting rate L followed a normal distribution with centers of Bi = 0.5 and Li = 0.5 (Fig. 7(a)). In the 10th generation (Fig. 7(b)), we can see that all agents learned a high comment posting rate. Moreover, the agents with high degrees (hub agents) learned low article posting rates, while the agents with low degrees (non-hub agents) learned higher both posting rates, B and L. Thus, agents with low degrees gradually learned that lower L was a better strategy for gaining high fitness values. After that, the agents with both high and low degrees gradually turned to the strategy of posting articles, so their position in the graph gradually moved to the right corner (i.e., L decreased). Figure 7(c) shows the middle of this movement process. At the same time, a few agents with low degrees became free riders with B and L of almost 0.0. We found that these free riders were mainly connected with high-degree (hub) agents because the hub agents became likely to post articles but hardly ever post comments, and thus, such free riders could get sufficient rewards only by reading artcles from the hub agents. After the aforementioned transition, the hub agents who posted only comments could not receive comment returns, so they stopped posting comments.

Analysis of Diversity and Dynamics in Co-evolution of Cooperation

(a) 0th generation

(b) 10th generation

(c) 100th generation

(d) 500th generation

503

Fig. 7. Dynamics in co-evolved strategies.

However, this just resulted in lower rewards, and they began to find that posting articles could gain more rewards. Thus, hub agents moved from the left corner in Fig. 7(c) to the right corners shown in Fig. 7(d), which is the distribution in the 500th generation, one that is almost identical to the final state shown in Fig. 6(b)). Therefore, hub agents ended up with high L and low B, meaning they became article authors and hardly posted comments. Along with such a dynamic change in agents’ behaviors, the number of free rider increased, because the free riders could read many articles without doing anything if the hub agents posted articles. 4.3

Discussion

Our experimental results indicate that the diverse strategies for individual agents and the dynamic co-evolution process during the learning of strategies could be seen using the simulation of a SNS-norm game with the MWGA. We discuss why agents in various circumstances utilized their own strategies in this section. First, we focus on the agents with high degrees (hub agents). The hub agents behaved in the end as comment writers to gain higher fitness when using the conventional GA, as shown in Fig. 6(a). Hub agents generally could more easily to gain more rewards because they had many friends who might post articles and comments even though the friends’ B and L were low. However, this situation

504

Y. Miura et al.

(a) Agent of interest vf

(b) Hub agent A

(c) Hub agent B

(d) Non-hub agent C

Fig. 8. Strategy and fitness dynamics of the focused agent and its neighbors.

does not necessarily apply to non-hub agents, although they are likely to learn from hub agents when using the conventional GA. The reason is hub agents usually receive more rewards (thereby giving them higher fitness values). This results in a low posting article rate B and agents not gaining many rewards. However, Fig. 6(b) shows that hub agents behave as article-writers, which is a quite different result from the one using the conventional GA. They could gain enough rewards by receiving many comments because many of their neighboring agents have high L. If we consider actual SNSs in the real world, celebrities and popular users post articles to get feedback and consent but hardly ever respond to the posts from normal users. We believe that this phenomenon is in good agreement with the actual activities in an SNS. Second, we tried to understand why some agents turned to free riders in the end. For this purpose, we focused on a certain free rider vf ∈ V and its neighbors and investigated how their strategies changed across generations. Figure 8 shows the transition of B, L, and fitness values of the agent of interest vf (Fig. 8(a)) and its neighbors. Note that agent vf had five neighbors consisting of one nonhub agent who connected only with hub agents except vf (in this sense, the non-hub agent and vf were in similar circumstances), and the rest of the neighbors were hub agents with an average degree of 150. We only show two hub

Analysis of Diversity and Dynamics in Co-evolution of Cooperation

505

agents (Fig. 8(b)–(c)) and one non-hub agent (Fig. 8(d)) because all hub-agents connected to vf behaved the same way. Looking at hub agents (Fig. 8(b)–(c)), their fitness values gradually declined (but were still higher than those of non-hub agents), although their values of B and L were almost unchanged until the 130th generation. This is because their neighbor non-hub agents including the agent of interest (Fig. 8(a)) learned low L (posted fewer comments) and high B to gain high fitness. After that, the hub agents changed strategies to behave as article writers, and vf could receive only a few comments even if it posted articles. Thus, vf gradually stopped posting articles. Then, it found that it could receive sufficient rewards just by reading articles from its neighboring hub agents without posting articles and comments. Therefore, we can say that normal non-hub agents connected mainly with hub agents are likely to be free riders. If we apply this phenomena to an actual SNS, we can say that users who tend to follow only hub accounts, such as news communicators/broadcasters and celebrities who just post articles easily become free riders without posting any content. However, Fig. 6(b) also shows many active non-hub agents posted both articles and comments. We found that these agents not only have hub agents who posted articles but also have quite a few many non-hub friends; they behaved actively to maintain the activity between these friends. These phenomena seem to match the behavior in actual SNSs, but such realistic behaviors and diverse strategies could not evolve when using the conventional GA. In Fig. 4, we can see the fitness value fluctuated; it increased until the 20th generation, decreased to around 85 until the 100th generation, increased to 105 until the 200th generation, and then gradually decreased to around 80 in the process of learning by the MWGA. In very early generations, all agents tried to post articles and comments. However, because all the agents were rational and looked for the strategies that brought more rewards, we think that a dilemma situation appeared, i.e., these rational behaviors decreased the fitness values of neighboring agents, and they also looked for strategies to gain more rewards. This seems to be the reason for the first decrease. Then, after the 100th generation, the hub agents gradually changed their behaviors to make them article writers. This resulted in the temporal increase around the 200th generation. Next, the dilemma situations appeared again, the number of total rewards decreased, and the number of free riders increased. This assertion was also supported by the data in Figs. 3 and 8.

5

Conclusion

We investigated how users of social networking services (SNSs) dynamically identify their own reasonable strategies using the SNS-norms game, which is a gametheoretic model of SNS, with MWGA, a co-evolutionary algorithm with which diverse strategies emerge depending on the circumstances of agents. Through comparison experiments with existing studies that use conventional GA, we found that the strategies of agents with MWGA had high fitness values. In addition, we could observe the dynamic evolving process of individual agents. This

506

Y. Miura et al.

feature of the MWGA is quite helpful to understand the phenomena and reasons occurring in the SNS. Such phenomena and reasons are quite complicated because agent strategies were mutually influenced by the strategies selected by neighboring agents. On the basis of our experimental results using a simulation with the MWGA, we could reproduce a plausible model of dynamic behaviors that can explain well the process of behavior selections in actual SNSs. We plan to clarify what network characteristics including the neighboring agents determine the agents’ strategies on the co-evolutionary SNS model. The findings can be applied to friend recommendation systems on SNSs to increase the activity level of free riders.

References 1. Albert, R., Barab´ asi, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47–97 (2002). https://doi.org/10.1103/RevModPhys.74.47 2. Axelrod, R.: An evolutionary approach to norms. Am. Polit. Sci. Rev. 80(4), 1095– 1111 (1986) 3. Ebel, H., Bornholdt, S.: Coevolutionary games on networks. Phys. Rev. E 66(5), 056118 (2002) 4. Fechner, G.T., Howes, D.H., Boring, E.G.: Elements of Psychophysics, vol. 1. Holt, Rinehart and Winston, New York (1966) 5. Garcia, D., Mavrodiev, P., Schweitzer, F.: Social resilience in online communities: the autopsy of Friendster. In: Proceedings of the First ACM Conference on Online Social Networks, COSN 2013, pp. 39–50. ACM, New York (2013). https://doi.org/ 10.1145/2512938.2512946 6. Hirahara, Y., Toriumi, F., Sugawara, T.: Evolution of cooperation in SNS-norms game on complex networks and real social networks. In: International Conference on Social Informatics, pp. 112–120. Springer (2014) 7. L˝ orincz, L., Koltai, J., Gy˝ or, A.F., Tak´ acs, K.: Collapse of an online social network: burning social capital to create it? Soc. Netw. 57, 43–53 (2019) 8. Miura, Y., Toriumi, F., Sugawara, T.: Evolutionary learning model of social networking services with diminishing marginal utility. In: Companion Proceedings of the The Web Conference 2018, WWW 2018, pp. 1323–1329. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (2018). https://doi.org/10.1145/3184558.3191573 9. Miura, Y., Toriumi, F., Sugawara, T.: Multiple world genetic algorithm to analyze individually advantageous behaviors in complex networks. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 297–298. GECCO 2019. ACM, New York (2019). https://doi.org/10.1145/3319619.3321989 10. Sun, N., Rau, P.P.L., Ma, L.: Understanding lurkers in online communities: a literature review. Comput. Hum. Behav. 38, 110–117 (2014) 11. Toriumi, F., Yamamoto, H., Okada, I.: Why do people use social media? Agentbased simulation and population dynamics analysis of the evolution of cooperation in social media. In: Proceedings of the 2012 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology-Volume 02, pp. 43–50. IEEE Computer Society (2012) 12. V´ azquez, A.: Growing network with local rules: preferential attachment, clustering hierarchy, and degree correlations. Phys. Rev. E 67(5), 056104 (2003)

Shannon Entropy in Time–Varying Clique Networks Marcelo do Vale Cunha1,2(&) , Carlos César Ribeiro Santos1 Marcelo Albano Moret1,3 , and Hernane Borges de Barros Pereira1,3 1

,

Centro Universitário SENAI CIMATEC, Salvador, BA 41650-010, Brazil [email protected], [email protected], [email protected], [email protected] 2 Instituto Federal da Bahia, Barreiras, BA 47808-006, Brazil 3 Universidade do Estado da Bahia, Salvador, BA 41150-000, Brazil

Abstract. Recent works have used information theory in complex networks. Studies often discuss entropy in the degree distributions of a network. However, there is no specific work for entropy in clique networks. In this regard, this work proposes a method to calculate clique network entropy, as well as its theoretical maximum and minimum values. The entropies are calculated for the dataset of the semantic networks of titles of scientific papers from the journals Nature and Science for approximately a decade. Journals are modeled as time–varying graphs and each system is analyzed from a time sliding window. The results show the entropy values of vertices and edges in each window arranged in time series, and also suggest the moment which has more or less vocabulary diversification when this diversity turns the studied journals closer or move them away. For that matter, this report contributes to the studies on clique networks and the diffusion of human knowledge in journals of high scientific impact. Keywords: Networks of cliques  Shannon entropy Semantic networks  Social network analysis

 Time–varying graphs 

1 Introduction The mathematical formalism of the information as an entropy measure was firstly introduced by Claude Shannon in 1945. According to Shannon theory, the information theory allows to investigate and to compare systems from random variables inherent in the composition of this system or its properties [1]. As a consequence, the theory can reach several areas, such as biology, economics, and confined quantum systems, among others [2–4]. Also, it may compose a methodological link that unites different areas [5] including statistical and thermodynamic physics in which several recent works have shown some importance for information entropy [6, 7]. Recently, many authors have introduced these concepts to measure the information contained in the degrees or geodetic distances distribution of real networks or in classical network models to differentiate these systems by the heterogeneity of their links [8–10]. © Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 507–518, 2020. https://doi.org/10.1007/978-3-030-36687-2_42

508

M. do Vale Cunha et al.

The use of time is very important in the systems analysis in which their elements connect. In 2012, [11, 12] formalized diverse concepts and metrics used in time– varying networks creating the concept of a time–varying graph (TVG). For a more comprehensive approach, [13] has presented some suggestions for specific algorithms and metrics for very many applications which require this model. Considering the large applicability, the clique networks fit the modeling of various social systems, e.g. movie actor networks [14], co–authoring networks [15], concepts networks [16] and semantic networks [17–20]. The study of clique networks that are formed by words contributes mainly to the study of human language organization. In this context, knowledge representation systems such as the scientific journals can be studied through the semantic networks of titles of scientific papers (STN). The STNs has gained a prominent role in research aimed at understanding the behavior and structure of this system, consisting of words that summarize the main contribution works published in important scientific journals [18–20]. Despite the growing interest of various areas about Shannon entropy, no studies using this measure in clique networks were found. Therefore, this work proposes a method that calculates vertex and edge entropy in clique networks and calculates the maximum and minimum limits for entropy values, according to the initial conditions of the clique networks studied here. The dataset used in this report is from the STNs of Nature and Science journals from 1998 to 2008. To extract more accurate information from the system, it was decided to construct the associated network as a TVG. The results seek correlations between the two journals in a certain period and compare different times of the same journal, contributing to the study of the scientific dissemination of important reports over the studied decade.

2 Background 2.1

Information Entropy

Information theory has evolved in recent decades, and it has been applied in different fields, such as telecommunications, computer knowledge, physics, genetics, ecology and in the discussion of the fundamental process of scientific observation [21]. The mathematical concept of information, developed by Claude Shannon, considers that the information contained in a message is associated with the number of possible values or states that this message may have [1]. Thus, if the system has only one possible state (e.g. the degree of vertices in a regular network), no information is obtained upon inspection. As more possible states for a system, more information it contains, in other words, it is possible to learn more with the discovery of its real state. Entropy is the expected value for the uncertainty of a random variable X (a system state), referring to a probability distribution, Eq. 1: H ð X Þ ¼ k

X i

pi log pi :

ð1Þ

In P Eq. 1, X is the random variable, pi is the probability of a state i for this variable (with i pi ¼ 1), and k is a constant that, if arbitrated for k ¼ log 2, the entropy value

Shannon Entropy in Time–Varying Clique Networks

509

is given in bits. Hereafter, this value of k will be used. Each calculated entropy value has a maximum value and an associated minimum value. When these limits are known, they help to evaluate how much the real value deviates from these idealized situations. In a probability distribution for a state of a random variable X the minimal entropy situation occurs when the uncertainty is minimal. As an example, when there is only one possible state for X, we are 100% sure about this state, so H ð X Þ ¼ 0. On the other hand, the maximum entropy situation occurs when all N possible states for  the  variable P have equal probability of happening, i.e. p ¼ N1 and H ð X Þ ¼  1n log2 N1 ¼ log2 N. Thus, the entropy value for a random variable X of N possible states is within these limits, as shown in Eq. 2, 0  H ð X Þ  log2 N bits:

2.2

ð2Þ

Time Varying Graphs

Real networks are strongly influenced by the dynamics of their vertices (network’s input and output) and changes in the connections between them. Thus, for a better study of systems of this type, it is necessary to consider the temporal elements in their sets of vertices and edges. Among the various ways to study the effects of time on a network, there is a very interesting model: Time–Varying Graphs (TVG). Considering the formalization [11, 12], a TVG can be understood as a static graph G ¼ ðV; EÞ plus temporal parameters (functions or sets): r (latency function), c (presence function) and C (lifetime). Thus, a TVG is the fivefold shown in Eq. 3, G ¼ ðV; E; c; r; CÞ:

ð3Þ

In Eq. 3, V ¼ fv1 ; v2 ; . . .; vn g isthe  set of vertices and E ¼ fðe1 ; e2 ; . . .; em g is the set of edges of the system, where ek ¼ vi ; vj (with i 6¼ j and i; j ¼ 1; 2; . . .n  1; n). For these sets n ¼ jVj and m ¼ jEj. The time sets are: C  Z j C ¼ ft1 ; t2 ; t3 ; . . .; t; . . .; tT1 ; tT g representing the system lifetime, discrete in time. Each element of C represents a date or time instant. The interval between the extreme dates is total time T ¼ tT  t1 þ 1. The smallest variation between dates (two consecutive instants) represents the time unit of C; c ¼ E  C ! f0; 1g is the presence function that guarantees the existence of a given edge at a given time t 2 C; r is the latency function, which represents the time required to form an edge. 2.3

Clique Networks

A clique network consists of a maximal graph or subgraph which all vertices of the same clique connect. Thus, the clique network is a graph formed by the union of cliques, through the processes of overlapping and juxtaposition of common edges and vertices, respectively [22]. Figure 1 shows the process of forming these networks.

510

M. do Vale Cunha et al.

Fig. 1. (a) Cliques on the original configuration; (b) cliques joined from common vertices forming a clique network. In juxtaposition cliques are joined by only one vertex, while in overlap the union is made by at least two vertices and an edge.

There are several works investigating systems that mold themselves to the clique network. From social [15] until biological field [23], along to the theoretical works about these networks as in [24] and [22] which have proposed a set of indexes to capture the properties of the clique network, and a method to characterize the small world phenomenon in this type of network. Semantic clique networks are increasingly being studied, where the words that make up in a sentence of a text, a university course menu, keywords, or title of an article are vertices of a clique, [16] analyzes the structure of meaningful concepts in written discourses. On the other hand, [17, 25] have used the semantic clique networks to analyze the relationship between words that emerge in oral speeches. Others have proposed important study methodologies on semantic networks of titles of scientific papers (STN): [18, 19] has studied the topological structure of STN as a method to analyze the diffusion efficiency of information, [26] have used STN to compare the titles of journal articles in mathematics education in English and Portuguese. The work [27] has considered a time–varying STN, and observed an effect in the network memory.

3 Materials and Method 3.1

Dataset

The dataset is composed of the titles of the articles published in the journals Nature and Science, from 1999 to 2008 [18]. These journals were chosen because of their high impact factor values and similar publication frequency in the collected period. Table 1 shows some information about the collected data. We conjecture that the system to be analyzed considers only lexical words, so that grammatical words (e.g. preposition, article, and pronoun) are no longer considered as elements of the system because they have no intrinsic meanings.

Shannon Entropy in Time–Varying Clique Networks

511

Table 1. Data on Nature and Science journals from 1999 to 2008. Data’s information Nature Publication frequency Weekly Number of articles published 1999 to 2008 11798 Number of weeks (1999 a 2008) 512

Science Weekly 3490 514

The words of these titles were treated according to treatment rules, proposed in [18]. After treatment, the data were organized in a way that each journal has a set of text files, where each file contains the treated titles corresponding to one week of publication (a magazine number). After treating the data, clique networks are built in which vertex represents an already treated word, and the edges connect words that belong to the same title, the final network is called as semantic networks of titles of scientific papers (STN). 3.2

Construction of Time–Varying Semantic Networks of Titles of Scientific Papers

The STN is built for different periods. In other words, the time–varying semantic networks of titles of scientific papers (TVSNT) considers the temporal information contained in its titles in the construction of the network. We will use the same parameters as a TVG: The set of vertices V is represented by the treated words of each STN in the collected period; the set of edges E is formed by the pairs of words that belong to the same title. For the time parameters, the elements of the set C represent the collection time of the titles, which is given in weeks, since it is the minimum period of publication of the journals1. The presence function c indicates if two words occur in the same title at least once in a given instant. For this work, we will not use the latency function r. 3.3

Sliding Window

The network can be observed at one or more consecutive times of set C. To contribute to the various possibilities of network analysis, in this work it was proposed the use of the sliding window function ws;s , where s is the size of the time window and s represents the step taken by the window in time. Assuming the values of s and s as constant and arbitrated by the researcher, the set of windows fit into a TVG ¼ fs1 ; s2 ; . . .; sk ; . . .; sK1 ; sK g. In this set, the time distance between two consecutive windows is given s ¼ tk  tk1 , where tk ¼ t 2 C is the first instant or date of the window sk . Thus, K 2 N jK ¼ Ts s and t þ s  T. Therefore, the TVG can be written as a set of static graphs, Eq. 4: G ¼ fGsk g ¼ fGs1 ; Gs2 ; . . .; Gsk ; . . .; GsK1 ; GsK g sk ¼ ½t þ ðk  1Þs; t þ ðk  1Þs þ s  1:

1

For Nature T ¼ 514 weeks and for Science T ¼ 512 weeks.

ð4Þ

512

3.4

M. do Vale Cunha et al.

Metrics Used

For each clique network in a window, the following properties were used [22]: • nq : number of cliques on initial configuration which is the number of titles; • n: number of network vertices in the final configuration, which is the number of different words in the network; • m: number of edges in the final configuration which are the word pairs in title network; • m0 : number of edges in the initial configuration which is the number of word pairs in the titles; • n0 : number of vertices in the initial configuration, total words, n0 n; • #ðvi Þ: frequency of vertex i in the  initial configuration, which is the number of titles containing vertex i 1  ni  Nq ;   • # vi ; vj : edge frequency ij in initial clique network configuration, which is the number of titles containing the words i and j, 1  mij  Nq e i; j 2 f1; 2; . . .; n  1; ng; com i 6¼ j e ij ¼ ji; • qmax : number of vertices of largest clique on the initial setting which is the largest title size ð1  qmax  nÞ; • qmin : number of vertices of smallest clique in the initial setting which is the smallest clique size ð1  qmin  nÞ; • qi : number of vertices of a clique i in the initial configuration, which is the title size i ð1  i  nÞ. 3.5

Information Entropy in Titles Networks

In our study, two random variables from the process of network formation will be taken: the vertex and the edge. The probabilities of occurrence of a vertex i and an edge ij in the network in each TVG window sk are calculated according to equations Eqs. 5 and 6. To simplify sk ¼ t, it is important to be careful in order to not confuse window number with date,  pi ð t Þ ¼

 # ð vi Þ ; n0 t

   # vi ; vj pij ðtÞ ¼ : m0 t

ð5Þ

ð6Þ

Equations 7 and 8 show Shannon entropy for these distributions, which Hv ðtÞ e and He ðtÞ are the entropies of vertices and edges, respectively, at a given time t, H v ðt Þ ¼  H e ðt Þ ¼ 

Xn i¼1

Xn ij¼1

pi ðtÞ: log2 pi ðtÞ:

ð7Þ

pij ðtÞ: log2 pij ðtÞ:

ð8Þ

Shannon Entropy in Time–Varying Clique Networks

3.6

513

Limits for Entropy Value

The limits showed in the Eq. 2 (Sect. 2.1) not necessarily occurs on associated entropy the construction of cliques networks. In this section, the extremes are recalculated based on boundary conditions (or initial conditions) for the formation of a clique network. The following conditions were used for the journals studied2: number of cliques in initial configuration nq ; size of largest clique qMax ; smallest clique size qMin ; number of vertices n and number of vertices in initial configuration n0 . Minimum Entropy. The minimum entropy value is associated with the variable’s maximum certainty. Two factors contribute strongly to this: (i) minimum of possible states for the variable and (ii) greater repetition of one or some possible states for the variable. To suit the boundary conditions, the n vertices in the nq cliques will be distributed without vertex repetition on each clique, with the number of vertices per clique qi not exceeding the maximum value and not being less than the minimum value, i.e. qmin  qi  qmax . The Fig. 2 is a scheme that is named as configuration 1. In there, there are x cliques of size q and y cliques of size q þ 1, so that x þ y ¼ nq , and j k xq þ yðq þ 1Þ ¼ n, which q ¼ nnq and y ¼ n  qnq .

Fig. 2. Example of a scheme based on configuration 1 and configuration 2 using real data from the window t ¼ 202 of TVG from the Science titles network shown in this paper. In this window, there are nq ¼ 160 titles with n ¼ 761 different words from a total of n0 ¼ 968 words. The configuration 1 minimizes the number of edges in the network, consequently its entropy, Hemin ¼ 10:49 bits and maximizes the vertices entropy Hvmax ¼ 9:57 bits. The configuration 2 minimizes the vertices entropy. For t ¼ 202, TVG of Science, Hvmin ¼ 8:34 bits.

2

Depending on the investigated system, it may not be necessary to use all of these conditions or to include or replace the existing one.

514

M. do Vale Cunha et al.

This configuration generates the lowest entropy for the edges of the network once it guarantees the smallest number of edges. Despite this, the repetition of a variable also contributes to its reduction in entropy. In clique networks, this phenomenon does not occur for edges because the repetition of an edge implies that it exists in more than one clique. And due to it, two vertices that compose it are forced to be connected to all others of the clique what causes a considerable increase of edges, in other words, possible states and consequently an increase of entropy. For the minimum vertices entropy, it has started from configuration 1 and has added the remaining n0  n repeated vertices, one by one, with the maximum repetition for each vertex for the first vertices added. Thus, if n0  n [ nq  1, there will be a   vertex present in the nq cliques. If ðn0  nÞ  nq  1 [ nq  1, the process continues with another vertex that will be in every clique or as many cliques as possible. For each vertex added to all cliques, the value nq  1 is subtracted from vertices that still have not been added until this subtraction results in a number n0  nq  1, so the last vertex is added repeatedly clique to clique into n0 cliques. This configuration, known as configuration 2, increases the probability of some vertices to reduce the entropy to the smallest value possible, respecting the initials conditions of the problem. Maximum Entropy. For maximum edge entropy, the number of edges should be increased as much as possible, avoiding their repetition. For this purpose, the appropriate distribution of vertices will be done according to the initials conditions, in a way that there is a plot with x cliques of size qmax and another plot y of size qmin vertices, with the possibility of having a clique with size qD , so that qMin \qD \qMax , as shown in Fig. 3, called initial configuration 3. After that, the repeated vertices n0  n that remains are added one by one to cliques with maximum repetition per vertex for the first vertices (final configuration 3), as shown in Fig. 3. This procedure increases the number of maximum-cliques, making the number of distinct edges increases, consequently their entropy. For the maximum vertex entropy, the configuration 1 already gives the largest possible entropy, once that we have all vertices without repetition, so the maximum vertex entropy is given by the well–known equation, Hvmax ¼ log2 ðnÞ. For the dataset of this work, in every window n nq . For larger time windows, it might happen n\nq . In this case, some adjustments will be required for the calculation of the limits, as, for example, in configuration 1, q ¼ 0, q þ 1 ¼ 1, y ¼ n e x ¼ nq  n. This contradicts the condition 0, once that q ¼ 0\qmin . Thus, some of n  n0 will need to be distributed in cliques, in which each one has a number of vertices q ¼ qmin . Due to the limitation of the page numbers, this application will be developed in a subsequent article.

Shannon Entropy in Time–Varying Clique Networks

515

Fig. 3. Example of a scheme designed for configuration 3 using real data from the windowt ¼ 188 of the TVG of the Science title network shown in this paper. In this window, at the beginning of the configuration, there are nq ¼ 146 titles with n ¼ 728 different words from a total of n0 ¼ 968 words. Adding repeated vertices until the vertices total n ¼ n0 ¼ 898, so that is the configuration which maximizes the number of edges in the network, Hemax ¼ 11:50 bits.

4 Results and Discussion The Fig. 4 shows the values of vertex and edge entropy as well as their respective maximum and minimum values over time for two journals. The Fig. 5 shows the values min of entropies normalized by their extremes H 0 ¼ HHH the vertices and edges of the max Hmin two journals over time. The graphs show us some interesting results: The moments where entropy decreases away from the maximum may indicate trends in the journal’s vocabulary at the time. The vertex entropy values are higher and vary much less than the edges entropy values. Moreover, in various intervals Hv and He have opposite growth trends. We know that increasing He implies the generation of new edges, being possible from the increment of repeated vertices in the cliques, which makes Hv decrease. Besides, in some of the studied periods, it was possible to see an opposite growth between journals for the edges entropy standard, moments in which one journal reaches a high entropy value and the other shows a low one. Although not necessarily a maximum value, He ¼ log2 m is on the graph to show how similar and strongly correlated are the real entropy of edges with this value. This shows that windows which have clique networks with little edge overlap, but with potential for more.

516

M. do Vale Cunha et al.

Fig. 4. Vertex entropy ðHv Þ and edge entropy ðHe Þ for journal TVGs, with sliding window   w8;1 . Windows s ¼ ½t; t þ 8 from t1 ¼ January 5; 1999 for Science and t1 ¼ January 7; 1999 for Nature.

Fig. 5. Entropies normalized by their maximum and minimum extremes for two journals.

Notwithstanding entropy measures are sensitive to sample size, we use here the entire dataset from the period collected. This allows a proper comparison of the two journals, even with values of entropy close. But it is worth mentioning the fact that the real vertices entropy Hv ffi log2 n in any time window of the journals. For edge entropy, there are periods that when these values deviate from the corresponding maximum.

Shannon Entropy in Time–Varying Clique Networks

517

The entropy values calculated here do not require the use of a null model (i.e. random network) for comparison. The process of constructing configurations 1, 2 and 3 is already randomized. It is also important to emphasize that a network of cliques has a high clustering and this means that there is not a correspondent random network, since in random networks clustering coefficient tends to zero ðC ! 0Þ [28].

5 Conclusions The results showed a strong correlation between entropy values and their respective maximum values, especially for vertices entropy. The graphs also show that journals have a greater diversity of words than word pairs. In other words, with the journal’s vocabulary in a window, there are many more possible combinations for word pairs than for repeating them in the titles. The method of constructing semantic clique networks is coherent with previous works as regarding to the vocabulary diversity of high impact scientific journals. The study of vertices and edges entropy in clique networks can be combined with the emergence of communities in these networks, as well as correlations with other indicators specific to this type of network, (e.g.: fidelity incidence [17], reference diameter and fragmentation [22], among others). Acknowledgment. This paper is being financially supported by the Rectory of Research and Innovation of the Federal Institute of Bahia (PRPGI-IFBA) and the Senai Cimatec-BA University Center, from its preparation to its presentation at Complex Networks 2019.

References 1. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948) 2. Mousavian, Z., Kavousi, K., Masoudi–Nejad, A.: Information theory in systems biology. Part I: gene regulatory and metabolic networks. In: Seminars in Cell & Developmental Biology, vol. 51, pp. 3–13. Academic Press (2016) 3. Mishra, S., Ayyub, B.M.: Shannon entropy for quantifying uncertainty and risk in economic disparity. Risk Anal. 39(10), 2160–2181 (2019) 4. Nascimento, W.S., Prudente, F.V.: Shannon entropy: a study of confined hydrogenic–like atoms. Chem. Phys. Lett. 691, 401–407 (2018) 5. Zenil, H., Kiani, N.A., Tegnér, J.: Methods of information theory and algorithmic complexity for network biology. In: Seminars in Cell & Developmental Biology, vol. 51, pp. 32–43. Academic Press (2016) 6. Zurek, W.H.: Complexity, vol. 8. Entropy and the Physics of Information. CRC Press, Boca Raton (2018) 7. Gao, X., Gallicchio, E., Roitberg, A.E.: The generalized Boltzmann distribution is the only distribution in which the Gibbs-Shannon entropy equals the thermodynamic entropy. J. Chem. Phys. 151(3), 034113 (2019) 8. Solé, R.V., Valverde, S.: Information theory of complex networks: on evolution and architectural constraints. In: Complex Networks, pp. 189–207. Springer, Berlin (2004)

518

M. do Vale Cunha et al.

9. Ji, L., Bing–Hong, W., Wen–Xu, W., Tao, Z.: Network entropy based on topology configuration and its computation to random networks. Chin. Phys. Lett. 25(11), 4177 (2008) 10. Viol, A., Palhano-Fontes, F., Onias, H., de Araujo, D.B., Hövel, P., Viswanathan, G.M.: Characterizing complex networks using entropy–degree diagrams: unveiling changes in functional brain connectivity induced by Ayahuasca. Entropy 21(2), 128 (2019) 11. Nicosia, V., Tang, J., Musolesi, M., Russo, G., Mascolo, C., Latora, V.: Components in time varying–graphs. Chaos: Interdisc. J. Nonlinear Sci. 22(2), 023101 (2012) 12. Casteigts, A., Flocchini, P., Quattrociocchi, W., Santoro, N.: Time–varying graphs and dynamic networks. Int. J. Parallel Emergent Distrib. Syst. 27(5), 387–408 (2012) 13. Holme, P., Saramäki, J.: Temporal networks. Phys. Rep. 519(3), 97–125 (2012) 14. Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999) 15. Newman, M.E.: Scientific collaboration networks. I. Network construction and fundamental results. Phys. Rev. E 64(1), 016131 (2001) 16. Caldeira, S.M., Lobao, T.P., Andrade, R.F.S., Neme, A., Miranda, J.V.: The network of concepts in written texts. Eur. Phys. J. B-Condens. Matter Complex Syst. 49(4), 523–529 (2006) 17. Teixeira, G.M., Aguiar, M.S.F., Carvalho, C.F., Dantas, D.R., Cunha, M.V., Morais, J.H.M., Pereira, H.B.B., Miranda, J.G.V.: Complex semantic networks. Int. J. Modern Phys. C 21 (03), 333–347 (2010) 18. Pereira, H.D.B., Fadigas, I.S., Senna, V., Moret, M.A.: Semantic networks based on titles of scientific papers. Phys. A: Stat. Mech. Appl. 390(6), 1192–1197 (2011) 19. Pereira, H.B.B., Fadigas, I.S., Monteiro, R.L.S., Cordeiro, A.J.A., Moret, M.A.: Density: a measure of the diversity of concepts addressed in semantic networks. Phys. A: Stat. Mech. Appl. 441, 81–84 (2016) 20. Grilo, M., Fadigas, I.S., Miranda, J.G.V., Cunha, M.V., Monteiro, R.L.S., Pereira, H.B.B.: Robustness in semantic networks based on cliques. Phys. A: Stat. Mech. Appl. 472, 94–102 (2017) 21. Brillouin, L.: Science and Information Theory. Courier Corporation, Chelmsford (2013) 22. Fadigas, I.D.S., Pereira, H.B.D.B.: A network approach based on cliques. Phys. A: Stat. Mech. Appl. 392(10), 2576–2587 (2013) 23. Adamcsek, B., Palla, G., Farkas, I.J., Derényi, I., Vicsek, T.: CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics 22(8), 1021–1023 (2006) 24. Derényi, I., Palla, G., Vicsek, T.: Clique percolation in random networks. Phys. Rev. Lett. 94 (16), 160202 (2005) 25. Lima–Neto, J.L.A., Cunha, M., Pereira, H.B.B.: Redes semânticas de discursos orais de membros de grupos de ajuda mútua. Obra Digit.: J. Commun. Technol. 14, 51–66 (2018) 26. Henrique, T., de Sousa Fadigas, I., Rosa, M.G., de Barros Pereira, H.B.: Mathematics education semantic networks. Soc. Netw. Anal. Mining 4(1), 200 (2014) 27. Cunha, M.V., Rosa, M.G., Fadigas, I.S., Miranda, J.G.V., Pereira, H.B.B.: Redes de títulos de artigos científicos variáveis no tempo. In: Anais do II Brazilian Workshop on Social Network Analysis and Mining, CSBC 2013, Maceió–AL, pp. 194–205 (2013) 28. Watts, D.J., Strogatz, S.H.: Collective dynamics of small-world networks. Nature 393(4), 440–442 (1998)

Two-Mode Threshold Graph Dynamical Systems for Modeling Evacuation Decision-Making During Disaster Events Nafisa Halim1 , Chris J. Kuhlman2(B) , Achla Marathe2 , Pallab Mozumder3 , and Anil Vullikanti2 1

Boston University, Boston, MA 02218, USA [email protected] 2 University of Virginia, Charlottesville, VA 22904, USA {cjk8gx,achla,vsakumar}@virginia.edu 3 Florida International University, Miami, FL 33199, USA [email protected]

Abstract. Recent results from social science have indicated that neighborhood effects have an important role in an evacuation decision by a family. Neighbors evacuating can motivate a family to evacuate. On the other hand, if a lot of neighbors evacuate, then the likelihood of an individual or family deciding to evacuate decreases, for fear of looting. Such behavior cannot be captured using standard models of contagion spread on networks, e.g., threshold models. Here, we propose a new graph dynamical system model, 2mode-threshold, which captures such behaviors. We study the dynamical properties of 2mode-threshold in different networks, and find significant differences from a standard threshold model. We demonstrate the utility of our model through agent based simulations on small world networks of Virginia Beach, VA. We use it to understand evacuation rates in this region, and to evaluate the effects of the model and of different initial conditions on evacuation decision dynamics.

1

Introduction

Background. Extreme weather events displaced 7 million people from their homes just in the first six months of 2019 [23]. With the rise in global warming, the frequency of these events is increasing and they are also becoming more damaging. Just in 2017–2018, there were 24 major events. In 2017, there was a total of 16 weather events that together costed over $306 billion, according to NOAA. In 2018, there were eight hurricanes, out of which two were category 3 or higher and caused more than $50 billion in damages. Motivation. Timely evacuation is the only action that can reduce risk in many of these events. Although more people are exposed to these weather events, technological improvements in weather prediction, early warning systems, emergency management, and information sharing through social media, have helped c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 519–531, 2020. https://doi.org/10.1007/978-3-030-36687-2_43

520

N. Halim et al.

keep the number of fatalities fairly low. During Hurricane Fani [17], a record 3.4 million people were evacuated in India and Bangladesh and fewer than 100 fatalities were recorded [23]. However, in many disaster events, e.g. Hurricane Sandy, the fraction of people who evacuated has been much lower than what local governments would like. The decision to evacuate or not is a very complex one and depends on a large number of social, demographic, familial, and psychological factors, including forecasts, warnings, and risk perceptions [13,14,19,25,26]. Two specific factors have been shown to have an important effect on evacuation decisions. First, peer effects, i.e., whether neighbors and others in the community have evacuated, are important. Up to a point, this has a positive impact on the evacuation probability of a household, i.e., as more neighbors evacuate, a household becomes more likely to evacuate. Second, concerns about property, e.g., due to looting, if a lot of people have already left, counteracts the first effect. Therefore, this has a negative impact on the evacuation probability. An important public policy goal in disaster planning and response is to increase the evacuation rates in an affected region, and understanding how this happens is crucial. Summary of Results. There is a lot of work on modeling peer effects, e.g., the spread of diseases, information, fads and other contagions [1,5,7]. A number of models have been proposed, such as independent cascade [15], and different types of threshold models (e.g., [6,24]). These are defined on a network, with each node in state 0 or 1 (0 indicating non-evacuating, 1 indicating a node has been influenced, e.g., is evacuating), and a rule for a node to change state from 0 to 1. For instance, in a τ -threshold model, a node switches from state 0 to state 1 if τ -fraction of its neighbors are in state 1. All prior models only capture the first effect above, i.e., as the number of effected neighbors increases, a node is more likely to switch to state 1. Here, we propose a new threshold model, referred to as 2mode-threshold, which inhibits a transition from state 0 to 1 if a sufficiently large fraction of a family’s neighborhood is in state 1, and demonstrate its use in a large scale study. Our results are summarized below. 1. Dynamics of the 2mode-threshold model (results in Sects. 2 and 3). We introduce and formalize evacuation decision making as a graph dynamical system (GDS) [21] using 2mode-threshold functions at nodes. We study theoretically the dynamics of 2mode-threshold in different networks, and show significant differences from the standard threshold model that has no drop off. Specifically, we find that starting at a small set of nodes in state 1, the diffusion process does not go beyond a constant fraction of the network. System configurations in which more nodes are 1’s (e.g., the all 1’s vector of node states) are also fixed points, but our results imply that one cannot reach such fixed points with lots of 1’s from most initial configurations that have a small number of 1’s. 2. Agent based simulation and application (results in Sect. 4). We develop an agent-based modeling and simulation (ABMS) of the 2mode-threshold model on a realistic small world network in the region of Virginia Beach, VA. This region has a population of over 450,000, and households are geographically situ-

Evacuation Decision-Making

521

ated based on land-use data, with a real geo-location which invokes the concept of neighbors and long range connections [4]. We add edges between households based on the Kleinberg small world (KSW) model [16]. Our ABM enables us to capture heterogeneities in the modeling of the evacuation decision-making process. This includes not only heterogeneities in families, but also differences in (local) neighborhoods of families as represented in social networks. We use it to understand the evacuation rates in this region, and evaluate the effects of different initial conditions (e.g., number of seeds) [seeds are families who are highly risk averse] on evacuation decision dynamics. For example, including the effects of looting can reduce evacuation rates by 50%. Novelty and Implications. Models of type 2mode-threshold have not been studied before. Our ABM approach can help (i) understand how planners and managers can more effectively convince families that are in harms way to evacuate; (ii) understand the effects of families’ social networks on evacuation decisions [10,25,26]; and (iii) establish down-stream conditions after the evacuation decision has been made, to support additional types of analyses. For example, results from these studies can be used to forecast traffic congestion (spatially and temporally) during the exodus [19], and to determine places where shelters and triage centers should be established.

2 2.1

Evacuation Decision-Making Model Motivation from Social Science

Our model is motivated by the analysis of a survey in the counties affected by Hurricane Sandy in the northeastern United States by [13], which is briefly summarized here. The goal of this survey was to assess factors driving evacuation decisions [20]. The survey was at a pretty large scale, with over 1200 individuals, and a response rate of 61.93%. A Binomial Logit model was applied to the survey data and tested for the factors associated with households’ evacuation behaviors [13]. The results indicate that a respondent’s employment status, consideration of neighbors’ evacuation behavior, concerns about neighborhood criminal activities or looting, access to the internet in the household, age, and having flood insurance, each plays a significant role in a respondent’s decision to evacuate during Hurricane Sandy. Noteworthy was the influence of neighbors’ evacuation behaviors, and concerns about looting and criminal behavior. Neighbors’ evacuations had a statistically significant and positive effect on evacuation probability but concerns about criminal and looting behavior had a significant negative effect—implying that if too many neighbors leave, then the remaining households are less likely to evacuate. 2.2

A Graph Dynamical Systems Framework

A graph dynamical system (GDS) is a powerful mathematical abstraction of agent based models, and we use it here to develop a model of evacuation behavior, motivated by the survey analysis described above. A GDS S describes the

522

N. Halim et al.

evolution of the states of a set of agents. Let xt ∈ {0, 1}n denote the vector of agent states at time t, with xtv = 1 indicating that agent v has evacuated. xtv = 0 means that agent v has not evacuated at time t. A GDS S consists of two components: (1) an interaction network G = (V, E), where V represents the set of agents (in our case, the households which are deciding whether or not to evacuate), and E represents a set of edges, with e = {u, v} ∈ E if agents u and v can influence each other; and (2) a set F = {fv : v ∈ V } of local functions fv : {0, 1}deg(v) → {0, 1} for each node v ∈ V , which determines the state of node v in terms of the states of N (v), the set of neighbors of v. Given a vector xt describing the states of all agents at time t, the vector xt+1 at the next time using its local function fv (·). We say that a state is obtained by updating xt+1 v vector xt is a fixed point of S if the node states do not change, i.e., xt+1 = xt . The 2mode-Threshold Local Functions: Modeling Evacuation Behavior. The 2mode-threshold function fv (·) will be probabilistic, and will depend on the probability of evacuation, in order to capture the qualitative aspects of the results of [13]. This is shown in Fig. 1a and specifies the probability of evacuation pe for agent vi as a function of the fraction η1 of neighbors of vi in state 1. We have pe = pe,max for η1 ∈ (ηmin , ηc ], and pe = 0 for η1 ∈ [0, ηmin ] and η1 > ηc . In this paper, we primarily focus on ηmin = 0. Specifically, this captures the following effects: (i) peer (neighbor) influence can cause families to evacuate and (ii) if too many of a family’s neighbors evacuate, there are not enough neighbors remaining behind to dissuade potential looters, so a family reduces its probability of evacuation. The first effect makes pe = pe,max for η1 > 0, and the second effect results in pe dropping to zero at η1 = ηc . Note that the special case where pe = pe,max for η1 > ηmin = 0 is a probabilistic variant of the ηmin -threshold function (e.g., [6]); we will sometimes refer to this as the “regular probabilistic threshold” model, and denote them by rp-threshold. This model is shown in Fig. 1b. These are models that can be assigned to any agent; in GDS, an agent is a node that resides in a networked population. Network Models. We describe the models for the contact network G = (V, E), which is the other component of a GDS S. A node vi ∈ V represents a family, or a household. Edges represent interaction channels, for communication and observations. Edges are directed : a directed edge (vj , vi ) ∈ E, with vi , vj ∈ V , means that family vj influences family vi . We use the population model developed in [4] for representing the set V of households. Edges are specified using the Kleinberg small world (KSW) network approach [16], and there are two types of edges: short range and long range. Short range edges (vj , vi ) represent either (i) a family vi speaks with (is influenced by) another family vj about evacuation decisions, or (ii) a family vi observes vj ’s home and infers whether or not a family vj has evacuated. A long-range edge represents a member of one family vi interacting with a member of family vj at work. Each edge has a label of distance between homes, using (lon, lat) coordinates of each home. Thus, the KSW model has the following parameters: the node set V and their attributes, the short-range distance dsr over which short-range edges are placed between nodes, and the number q of long range

Evacuation Decision-Making

(a)

523

(b)

Fig. 1. Dynamics models—probability of evacuation curve—for probability pe of evacuation for a family versus the fraction η1 of its neighbors in state 1 (i.e., evacuating). (a) The 2mode-threshold model: the evacuation probability is pe = 0 for η1 = ηmin = 0 and for η1 > ηc . The maximum probability is pe = pe,max in the interval (ηmin , ηc ]. (b) The rp-threshold model: this curve is similar to the previous curve, except that pe = pe,max for η1 > ηmin . This is a special case of 2mode-threshold, but is a variation of the regular probabilistic threshold model [6, 21, 24]. As an illustration, if an agent has 50% of its neighbors in state 1, then the model in (a) shows that pe = 0, while (b) shows that pe = pe,max > 0. An example with values for these parameters is given in the text.

edges incident on each node vi . For each node vi , (i) short range edges (vj , vi ) are constructed, where d(vj , vi ) ≤ dsr ; and (ii) q long range edges (vk , vi ) are placed at random, with probability proportional to 1/d(vk , vi )α , for a parameter α. Note that for each short range edge (vj , vi ), there is a corresponding edge (vi , vj ). See [16] for details. Example. Figure 1a shows an example of the 2mode-threshold model with the parameters pe,max = 0.2, and ηc = 0.4. Figure 1b shows a rp-threshold model. The purpose of this example is to illustrate the dynamics of these models on a network of five agents. In Fig. 2, x1 is the initial configuration with node 1 evacuated (in state 1), and nodes 2, 3, 4, and 5 not evacuated (in state 0). Nodes 2 and 3 have η1 = 1/3 < ηc = 0.4, and so for both of them, the evacuation probability is pe = 0.2. Nodes 4 and 5 have η1 = 0, so pe = 0 for them. Therefore, the probability that the state vector is x2 at the next time step (see Fig. 2) is pe,max (1−pe,max ) = 0.2·0.8 = 0.16, since only node 2 switches to 1. With respect to the configuration x2 , nodes 3, 4, and 5 have η1 = 23 , 1 and 0, respectively. Therefore, pe = 0 for all these nodes, and x2 is a fixed point of the S with the 2mode-threshold functions. However, for the regular probabilistic threshold model, with ηmin < 0.3, x2 is not a fixed point, since nodes 3 and 4 both have pe = pe,max (since they have η1 > ηmin ). Therefore, in the regular probabilistic threshold model, the x2 → x3 transition occurs with probability p2e,max = 0.04. Problems of Interest. We will refer to a GDS system S2m = (G, F) in which the local functions are 2mode-threshold functions as a 2mode-threshold-

524

N. Halim et al.

Fig. 2. An example showing the transitions in a S on a graph with five nodes, and 2mode-threshold local functions, with parameters pe,max = 0.2 and ηc = 0.4. The figure shows a transition of the dynamics model from configuration x1 to x2 , with shaded nodes indicating evacuation. The x1 → x2 transition occurs with probability pe,max (1 − pe,max ) = 0.16. For the above parameters, x2 is a fixed point, and the node states do not change. However, if we had ηc = 1 (i.e., this is a regular probabilistic threshold), x2 is not a fixed point, and there can be a transition to configuration x3 with probability p2e,max = 0.04 (indicated as a dashed arrow).

GDS. Our objective in this paper is to study the following problems on a S2m system: (1) How do the dynamical properties of 2mode-threshold GDS systems differ from those of S with rp-threshold model functions? Do they have fixed points, and what are their characteristics? (2) How do the number of 1’s in the fixed point depend on the initial conditions, and the model parameters, namely pe,max and ηc ? How can this be maximized? We provide solutions to these problems next.

3

Analyzing Dynamical Properties in Different Network Models

It can be shown that any S2m converges to a fixed point in at most n/pe,max steps. S2m systems have significantly lesser levels of diffusion (i.e., number of nodes ending up in state 1), compared to the rp-threshold model, as we discuss below. Many details are omitted for space reasons. Lemma 1. Consider a S2m with G = Kn being a complete graph on n nodes. Starting at a configuration x0 with a single node in state 1, S2m converges to a fixed point with at most (pe,max + ηc )n nodes in state 1, in expectation. In contrast, in a regular probabilistic threshold system on Kn with ηmin = 0, the system converges to the all 1’s vector as a fixed point. Proof. Consider a state vector xt with k nodes in state 1. Consider any node v with xt v = 0. If k ≤ ηc n, then, Pr[node v switches to 1] = pe,max . Therefore, the expected number of nodes which switch to 1 is pe,max (n − k) ≤ npe,max . If k > ηc n, for every node in state 0, the probability of switching to 1 is pe = 0. Therefore, the expected number of 1’s in a fixed point is at most npe,max + nηc .

Evacuation Decision-Making

525

On the other hand, in a regular probabilistic threshold model, the system does not converge till each node in state 0 switches to 1 (since pe = pe,max for all η1 > 0). We observe below that starting at an initial configuration with a single 1, S2m converges to a fixed point with at most a constant fraction of nodes in state 1. Note, however, that configurations with more than that many 1’s, e.g., the all 1’s vector, are also fixed points. The result below implies that those fixed points will not be reached from an initial configuration with a few 1’s. Lemma 2. Consider a S2m on a G(n, p) graph with pηc ≥ 62 logn n , for any  ∈ (0, 1). Starting at a configuration x0 with a single node in state 1, S2m converges to a fixed point with at most (1 + 2)(ηc + pe,max )n nodes in state 1, in expectation. In contrast, in a regular probabilistic threshold system on Kn with ηmin = 0, the system converges to the all 1’s vector as a fixed point. Proof. (Sketch) Let deg(v) denote the degree of v. For a subset S, let degS (v) denote the degree of v with respect to S, i.e., the number of neighbors of v in S. For any node v, we have E[deg(v)] = np. By the Chernoff bound [9], it 2 follows that Pr[deg(v) > (1 + )np] ≤ e− np/3 ≤ 1/n2 . Consider a set S of 1+ ηc n. For v ∈ S, E[degS (v)] = |S|p, and so Pr[degS (v) < (1 − )|S|p] ≤ size 1− 2

1+ ηc n, we have (1 − )|S|p ≥ (1 + )ηc np. Putting e− |S|p/2 ≤ 1/n2 . For |S| ≥ 1− these together, with probability at least 1 − 2/n, we have deg(v) ≤ (1 + )np and degS (v) ≥ (1 + )ηc np ≥ ηc deg(v), for all nodes v. Therefore, if S2m reaches a 1+ ηc n < (1 + 2)ηc n, with probability configuration with nodes in set S of size 1− 1 − 2/n, S is a fixed point. With probability ≤ 2/n, S is not a fixed point, and the process converges to a fixed point with at most n 1’s, so that the expected number of 1’s in the fixed point is at most |S| + 2 ≤ (1 + 2)ηc n. On the other hand, consider the last configuration S  which has size |S  | < (1 + 2)ηc n. Then, in expectation, at most pe,max n additional nodes switch to state 1, after which point, the configuration has more than (1 + )ηc n 1’s. Therefore, the expected number of 1’s in the fixed point is at most (1 + 2)(ηc + pe,max )n.

4

Agent-Based Simulations and Results

Simulation Process. Inputs to the simulation are a social network (described below), a set of local functions that quantifies the evacuation decision making process of each node vi ∈ V (see Sect. 2), and a set of seed nodes whose state is 1 (i.e., these nodes are set to “evacuate” at the start of a simulation instance, at time t = 0). All other nodes at time t = 0 are in state 0 (the non-evacuating state). We vary a number of simulation input parameters, as discussed immediately below, across simulations. Each simulation instance or run consists of a particular set of seed nodes, and time is incremented in discrete timesteps, from t = 0 to tmax . Here, tmax = 10 days, to model the ten days leading up to hurricane arrival. At each timestep, nodes that are in state 0 may change to state 1, per the models in Sect. 2. At each 1 ≤ t ≤ tmax , the state of the system

526

N. Halim et al.

at time t − 1 is used to compute the next state of each vi ∈ V (corresponding to time t) synchronously; that is, all vi update their states in parallel at each t. A simulation consists of 100 runs, where each run has a different seed set; the network and dynamics models are fixed in a simulation across runs. We present results below based on averaging the results of the 100 runs. Social Networks. Table 1 provides the social networks (and selected properties) that are used in simulations of evacuation decision making. The network model of Sect. 2.2 was used to generate KSW networks for Virginia Beach, VA. Inputs for the model were n = 113967 families forming the node set V , with (lat, long) coordinates, dsr = 40 m, α = 2.5 (see [16]), and q = 0 to 16. Simulation Parameters Studied. The input parameters varied across simulations are provided in Table 2. Table 1. Kleinberg small world (KSW) networks [16] used in our experiments and their properties. The number n of nodes is 113967 for all graphs. The short range distance dsr = 40 m and the exponent α = 2.5 is for computing the probabilities of selecting particular long-range nodes with which to form long-range edges with each node vi ∈ V . Column “No. LR Edges” (= q) means number of long-range edges incoming to each node vi . There are five graph instances for every row. Average degree is dave and maximum degree is dmax , for in-degree and out-degree. Network Class

Avg. In-Deg.

Max. In-Deg.

Avg. Out-Deg.

Max. Out-Deg.

KSW0

0

10.11

380

10.11

380

KSW2

2

11.71

382

11.71

381

KSW4

4

13.70

384

13.70

381

KSW8

8

17.70

388

17.70

382

16

25.70

396

25.70

383

KSW16

No. LR Edges

Table 2. Summary of the parameters and their values used in the simulations. Parameter

Description

Networks

Networks in Table 1. We vary q per the table, from 0 to 16

Num. random seeds, ns .

Number of seed nodes specified per run (chosen uniformly at random). Values are 50, 100, 200, 300, 400, and 500

Threshold model

The 2mode-threshold model of Fig. 1a and the rp-threshold (i.e., classic) threshold model of Fig. 1b, in Sect. 2

Threshold range, ηc .

The range in relative degree over which nodes can change to state 1. Discrete values are 0.2 and 1.0. Note that ηc = 1 corresponds to the classic stochastic threshold model (Fig. 1b), whereas smaller values of η1 correspond to the 2mode-threshold model (Fig. 1a)

Maximum probability, pe,max

The maximum probability of evacuation pe,max of Fig. 1. Discrete values are 0.05, 0.10, and 0.15

Basic Results and the Effects of Seeding. Figure 3b provides average fraction of population deciding to evacuate (Frac. DE) as a function of time for one

Evacuation Decision-Making

527

instance of the KSW2 category of networks. We use the 2mode-threshold model with pe,max = 0.15 and ηc = 0.2 (see Fig. 1a). A simulation uses a fixed value of number ns of random seed nodes per run, but the set of nodes differs in each run (see legend). Other simulation parameters are in the figure. Error bars indicate the variance in results across 100 runs (i.e., simulation instances). The variance is very small (the bars cannot be seen in the plots, and are barely visible even under magnified conditions). Hence we say no more about the variance in output. As number ns of random seeds increases from 50 to 500, the fraction deciding to evacuate fde increases from about 0.02 to 0.1.

(a)

(b)

(c)

Fig. 3. Simulation results of fraction of population deciding to evacuate (Frac. DE) versus simulation time. All results use the 2mode-threshold model of Fig. 1a, pe,max = 0.15, ηc = 0.2, and ns (numbers of random seeds) varies from 50 to 500 (see legend). Error bars denote variance. (The variance is very small.) (a) Results for one graph instance of network class KSW0 (i.e., q = 0 long range edges per node). (b) Results for one graph instance of network class KSW2 (i.e., q = 2 long range edges per node). (c) Results for one graph instance of network class KSW16 (i.e., q = 16 long range edges per node).

Effect of Graph Structure: Long Range Edges. The effect of number q of long range edges is shown across the three plots in Fig. 3 for the 2modethreshold model. For q = 0 (i.e., no long-range edges), the fraction of the population evacuating (Frac. DE) = fde ≈ 0. As q increases to 2 and then 16 long-range edges per node, fde increases markedly. In particular, Fig. 3c shows how the spread of evacuation decisions has an upper bound in the 2modethreshold model: too many families have evacuated, so the remaining families do not evacuate over concerns of looting and crime. This effect of greater contagion spreading as q increases is the “weak link” phonemena [12], where long-range edges can cause remote nodes to change their state to 1 (i.e., evacuating), thus moving a “contagion” into a different region of the graph. Note that the speed with which the maximum of fde = 0.32 is attained increases with ns . Effect of Dynamics Model: Maximum Evacuation Probability pe,max . Figure 4 shows the effect of number pe,max of the 2mode-threshold model. As pe,max increases from 0.05 (Fig. 4a) to 0.10 (Fig. 4b) to 0.15 (Fig. 4c), the fraction of population evacuating increases at smaller pe,max , almost plateaus for all ns

528

N. Halim et al.

when pe,max = 0.1, and increases its speed to plateau for the largeset pe,max . The values of pe,max were selected based the survey results [13] mentioned in Sect. 2.1.

(a)

(b)

(c)

Fig. 4. Simulation results of fraction of population deciding to evacuate (Frac. DE) versus simulation time. All results use the 2mode-threshold model of Fig. 1a with ηc = 0.2, and ns (numbers of random seeds) varies from 50 to 500, for one instance of the KSW16 graph class, i.e., q = 16 long range edges per node (similar results for other graph instances). (a) Results for pe,max = 0.05. (b) Results for pe,max = 0.10. (c) Results for pe,max = 0.15, is the same as Fig. 3c, reproduced for completeness.

(a)

(b)

(c)

Fig. 5. Simulation results of fraction of population deciding to evacuate (Frac. DE) versus simulation time. All results use the rp-threshold model of Fig. 1b where looting and crime are not concerns, and ns (numbers of random seeds) varies from 50 to 500, for one instance of the KSW16 graph class, i.e., q = 16 long range edges per node (similar results for other graph instances). (a) Results for pe,max = 0.05. (b) Results for pe,max = 0.10. (c) Results for pe,max = 0.15. These results can be compared with corresponding plots from Fig. 4 for the 2mode-threshold model.

Effect of Dynamics Model: Range of Relative Threshold for Transition to State 1. We compare results from the 2mode-threshold (Fig. 4), with various values for pe,max and ηc = 0.2, against the rp-threshold model, with the same pe,max values, where ηc = 1.0 (Fig. 5). The corresponding plots, left to right in each figure, can be compared. As pe,max increases, the discrepancy between the two models increases: concern over looting dampens evacuation in

Evacuation Decision-Making

529

the 2mode-threshold model. For pe,max = 0.15, the rp-threshold model results in Fig. 5c reach fde > 0.6, while the corresponding results for 2modethreshold model in Fig. 4c are only roughly one-half the values of fde in Fig. 5c. Hence, the 2mode-threshold model can produce a large difference (dampening) in the fraction of families evacuating. Therefore, ignoring the influence of looting and crime can cause a large overprediction of family evacuations.

5

Related Work

Many studies have identified factors that affect evacuation decision making. These include social networks and peer influence [18,22], risk perceptions, evacuation notices, storm characteristics [2,3,8] and household demographics such as nationality, proximity to hurricane path, pets, disabled family members, mobile home, access to a vehicle etc. [11,25]. Other studies use social networks and relative threshold models to model evacuation behavior. A relative threshold [6,24] θi for agent vi is the minimum fraction of distance-1 neighbors in G(V, E) that must be in state 1 in order for vi to change from state 0 and to state 1. Several studies [14,25,26] assign thresholds to agents in agent-based models (ABMs) of hurricane evacuation modeling. Stylized networks of 2000 nodes are used in [14] to study analytical and ABM solutions to evacuation. In [25], 12,892 families are included in a model of a 1995 hurricane for which 75% of households evacuated. They include three demographic factors in their evacuation model, in addition to the the peer influence that is captured by a threshold model. Small world and random regular stylized networks are used for social networks. Simulations of hurricane evacuation decision-making in the Florida Keys are presented in [26]. The simulations cover 24 hours, where the actual evacuation rate was about 53% of families. The social network is also a small-world network, with geospatial home locations, which is similar to our network construction method. In all of these studies, as the number of neighbors of a family vi evacuates, the more likely it is that vi will evacuate. Our threshold model differs: in our model, if too many neighbors evacuate, then vi will not evacuate because of concerns over crime and looting.

6

Summary and Conclusions

We study evacuation decision-making as a graph dynamical system using 2mode-threshold functions for nodes. This work is motivated by the results of a survey collected during Hurricane Sandy which shows that concerns about crime motivates families to stay in their homes. We study the dynamics of 2mode-threshold in different networks, and show significant differences from the standard threshold model. Results obtained from this work can help determine the size and characteristics of non-evacuees which city planners can use for contingency planning.

530

N. Halim et al.

Acknowledgment. We thank the anonymous reviewers for their insights. This work has been partially supported by the following grants: NSF CRISP 2.0 Grant 1832587, DTRA CNIMS (Contract HDTRA1-11-D-0016-0001), NSF DIBBS Grant ACI-1443054, NSF EAGER Grant CMMI-1745207, and NSF BIG DATA Grant IIS1633028.

References 1. Aral, S., Nicolaides, C.: Exercise contagion in a global social network. Nat. commun. 8, 14753 (2017) 2. Baker, E.J.: Evacuation behavior in hurricanes. Int. J. Mass Emergencies Disasters 9(2), 287–310 (1991) 3. Baker, E.J.: Public responses to hurricane probability forecasts. Prof. Geogr. 47(2), 137–147 (1995) 4. Barrett, C.L., Beckman, R.J., et al.: Generation and analysis of large synthetic social contact networks. In: Winter Simulation Conference, pp. 1003–1014 (2009) 5. Beckman, R., Kuhlman, C., et al.: Modeling the spread of smoking in adolescent social networks. In: Proceedings of the Fall Research Conference of the Association for Public Policy Analysis and Management. Citeseer (2011) 6. Centola, D., Macy, M.: Complex contagions and the weakness of long ties. Am. J. Sociol. 113(3), 702–734 (2007) 7. Chen, J., Lewis, B., et al.: Individual and collective behavior in public health epidemiology. In: Handbook of statistics, vol. 36, pp. 329–365. Elsevier (2017) 8. Dash, N., Gladwin, H.: Evacuation decision making and behavioral responses: individual and household. Nat. Hazards Rev. 8(3), 69–77 (2007) 9. Dubhashi, D.P., Panconesi, A.: Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, Cambridge (2009) 10. Ferris, T., et al.: Studying the usage of social media and mobile technology during extreme events and their implications for evacuation decisions: a case study of hurricane sandy. Int. J. Mass Emerg. Dis. 34(2), 204–230 (2016) 11. Fu, H., Wilmot, C.G.: Sequential logit dynamic travel demand model for hurricane evacuation. Transp. Res. Part B 45, 19–26 (2004) 12. Granovetter, M.: The strength of weak ties. Am. J. Sociol. 78(6), 1360–1380 (1973) 13. Halim, N., Mozumder, P.: Factors influencing evacuation behavior during hurricane sandy. Risk Anal. (To be submitted) 14. Hasan, S., Ukkusuri, S.V.: A threshold model of social contagion process for evacuation decision making. Transp. Res. Part B 45, 1590–1605 (2011) 15. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of influence through a social network. In: Proceedings of ACM KDD, pp. 137–146 (2003) 16. Kleinberg, J.: The small-world phenomenon: an algorithmic perspective. Technical report 99-1776 (1999) 17. Kumar, H.: Cyclone fani hits India: storm lashes coast with hurricane strength. New York Times, May 2019 18. Lindell, M.K., Perry, R.W.: Warning mechanisms in emergency response systems. Int. J. Mass Emergencies Disasters 5(2), 137–153 (2005) 19. Madireddy, M., Tirupatikumara, S., et al.: Leveraging social networks for efficient hurricane evacuation. Transp. Res. Ser. B: Methodol. 77, 199–212 (2015) 20. Meng, S., Mozumder, P.: Hurricane sandy: damages, disruptions and pathways to recovery. Risk Anal. (Under review)

Evacuation Decision-Making

531

21. Mortveit, H., Reidys, C.: An Introduction to Sequential Dynamical Systems. Springer, Berlin (2007) 22. Riad, J.K., Norris, F.H., Ruback, R.B.: Predicting evacuation in two major disasters: risk perception, social influence, and access to resources. J. Appl. Soc. Psychol. 20(5), 918–934 (1999) 23. Sengupta, S.: Extreme weather displaced a record 7 million in first half of 2019. New York Times, September 2019 24. Watts, D.: A simple model of global cascades on random networks. PNAS 99, 5766–5771 (2002) 25. Widener, M.J., Horner, M.W., et al.: Simulating the effects of social networks on a population’s hurricane evacuation participation. J. Geogr. Syst. 15, 193–209 (2013) 26. Yang, Y., Mao, L., Metcalf, S.S.: Diffusion of hurricane evacuation behavior through a home-workplace social network: a spatially explicit agent-based simulation model. Comput. Environ. Urban Syst. 74, 13–22 (2019)

Spectral Evolution of Twitter Mention Networks Miguel Romero(B) , Camilo Rocha, and Jorge Finke Pontificia Universidad Javeriana Cali, Cali, Colombia {miguel.romero,camilo.rocha,jfinke}@javerianacali.edu.co

Abstract. This papers applies the spectral evolution model presented in [5] to networks of mentions between Twitter users who identified messages with the most popular political hashtags in Colombia (during the period which concludes the disarmament of the Revolutionary Armed Forces of Colombia). The model characterizes the dynamics of each mention network (i.e., how new edges are established) in terms of the eigen decomposition of its adjacency matrix. It assumes that as new edges are established the eigenvalues change, while the eigenvectors remain constant. The goal of our work is to evaluate various link prediction methods that underlie the spectral evolution model. In particular, we consider prediction methods based on graph kernels and a learning algorithm that tries to estimate the trajectories of the spectrum. Our results show that the learning algorithm tends to outperform the kernel methods at predicting the formation of new edges. Keywords: Spectral evolution model Eigen decomposition · Graph kernels

1

· Twitter mention networks ·

Introduction

Social networks have become increasingly relevant for understanding the political issues of a country. On such platforms, users share perceptions and opinions on government and public affairs, creating political conversations that often unveil specific patterns of interaction (e.g., the degree of polarization on a current issue). While some studies focus on identifying which profiles play a key role in shaping user-user interactions [9,10], others studies focus on how the user terms and conditions of social networks influence broad political decisions [3,6]. Not surprising, analyzing the patterns that arise from online conversations on social networks has received wide attention [7,8]. Understanding the broad dynamics of user interactions is an important step to evaluate both the formation and political ramifications of stationary patterns. More specifically, characterizing the evolution of user interactions requires the development of models that predict how new edges are established. For example, predicting the formation of new edges is useful to identify whether an influential user retains her status over time or whether a political polarization reflects a dynamic process or a stationary state [2]. c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 532–542, 2020. https://doi.org/10.1007/978-3-030-36687-2_44

Spectral Evolution Models of Twitter Mention Networks

533

This paper uses the spectral evolution model presented in [5] to capture the dynamics of user interactions and evaluate which link prediction method best estimates the formation of new edges over time. The spectral evolution model considers that the growth of a network can be captured by its eigen decomposition, under the assumption that its eigenvectors remain constant. If this condition is satisfied, the estimation of the formation of new edges can be masked as a transformation of the spectrum through the application of real functions (using graph kernels) or through extrapolation methods (using learning algorithms that estimate the spectrum trajectories) [4]. The main contribution of this paper is to apply the spectral evolution model to networks of mentions between Twitter users who identified messages with the most popular political hashtags H. Vertices represent users and there exists an edge between two users if a user mentions the another user using a hashtag h ∈ H. We select the most popular hashtags related to political affairs in Colombia between August 2017 and August 2018, the period which concludes the disarmament of the Revolutionary Armed Forces of Colombia (Farc) and marks the end of the armed conflict. Different prediction methods are compared to identify which prediction method best describes the evolution of each mention network. The remainder of the paper is organized as follows. Section 2 describes the networks used for our analysis. Section 3 presents the spectral evolution model and verifies that the model can be applied to the mention networks. Section 3 also overviews the different link prediction methods that underlie the model. Section 4 presents the results of applying the spectral evolution model with various link prediction methods. Section 5 draws some conclusions and future research directions.

2

Data Description

The dataset consists of 31 mention networks between Twitter users who defined their profile location as Colombia. These networks capture conversations around a set of hashtags H related to popular political topics between August 2017 and August 2018. Users are represented by the set of vertices V . The set of edges is denoted by E; there exists an edge {i, j} ∈ V × V between users i and j, if user i identifies a message with a political hashtag in H (e.g., #safeelections) and mentions user j (via @username). The mention network G = (V, E) is represented as a weighted multi-graph without self-loops, which means that it is possible to have multiple edges between two users. Our analysis is based on the largest connected component of G, denoted by Gc = (Vc , Ec ). A network is built for each hashtag h ∈ H. Table 1 shows a description of the hashtags and the resulting networks, including the number of vertices and edges (|V | and |E|) for the whole network G, the number of vertices and edges (|Vc | and |Ec |) for its largest component Gc , the community modularity (Q) of Gc , and the number of communities (m) of Gc .

534

M. Romero et al.

Table 1. Mention networks with political hashtags. English translations for some popular political hashtags appear in parenthesis. Set of hashtags H 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

abortolegalya (legal abortion now) alianzasporlaseguridad (security alliance) asiconstruimospaz (how we build peace) colombialibredefracking (ban fracking) colombialibredeminas (ban mining) dialogosmetropolitanos (city dialogues) edutransforma (education transforms) eleccionesseguras (safe elections) elquedigauribe (whoever Uribe says) frutosdelapaz (fruits of peace) garantiasparatodos (assurances for all) generosinideologia (no gender ideology) hidroituangoescololombia horajudicialur (judicial hour) lafauriecontralor (comptroller Lafaurie) lanochesantrich lapazavanza (peace advances) libertadreligiosa (religious liberty) manifestacionpacifica plandemocracia2018 (democracy plan) plenariacm (plenary) proyectoituango reformapolitica (political reform) rendiciondecuentas (accountability) rendiciondecuentas2017 resocializaciondigna salariominimo (minimum wage) semanaporlapaz (week of peace) serlidersocialnoesdelito vocesdelareconciliacion (reconciliation) votacionesseguras (safe voting)

G |V |

|E|

Gc |Vc |

2235 176 2514 1606 707 959 166 3035 2375 1671 388 639 1028 2250 2154 1518 2949 1584 211 3090 1504 1214 2714 5103 1711 503 2494 1988 530 161 2748

2202 1074 14055 3483 2685 18340 1296 17922 6933 6960 814 914 3362 23647 7082 6946 8288 13443 274 20955 19866 3086 8385 25479 12441 4054 7041 8103 861 1500 13307

1282 1538 0.89 30 150 351 0.34 7 2405 6950 0.56 16 1476 3127 0.62 19 655 1421 0.51 14 932 4134 0.34 10 161 404 0.40 9 2634 7969 0.51 20 2052 5272 0.65 20 1479 3468 0.58 18 340 563 0.55 10 615 805 0.63 12 883 2252 0.68 15 2187 6756 0.42 14 1999 5309 0.59 14 1444 3567 0.45 13 2775 6569 0.70 18 1395 6856 0.38 15 112 151 0.69 9 2962 7996 0.58 22 1460 4782 0.41 15 1186 1891 0.53 44 2608 5928 0.66 18 4401 10308 0.84 33 998 2933 0.51 16 496 1171 0.46 8 2079 5016 0.71 22 1732 4860 0.69 25 439 697 0.67 15 158 405 0.34 7 2439 5338 0.66 24

|Ec |

Q

m

The modularity and number of communities shown in Table 1 are computed with the multilevel community detection algorithm [1]. Note that Q > 0.3 for all networks in the dataset, i.e., community structure can be observed for all mention networks.

3

Spectral Evolution Model

Let A denote the adjacency matrix of Gc . Furthermore, let A = U Λ UT denote the eigen decomposition of A, where Λ represents the spectrum of Gc . The

Spectral Evolution Models of Twitter Mention Networks

535

spectral evolution model characterizes the dynamics of Gc (i.e., how new edges are created over time) in terms of the evolution of the spectrum of the network, assuming that its eigenvectors in U remain unchanged [4,5]. In other words, assume that the dynamics of the network may only involve small changes in behavior of the eigenvectors. 3.1

Spectral Evolution Model Verification

To apply the spectral evolution model, we need to verify the assumption on the evolution of the spectrum and eigenvectors. Every network Gc has a timestamp associated to each edge, representing the time at which the edge was created. Spectral Evolution. For a given network, the set of edges is split into 40 bins based on their time stamps. Figure 1 illustrates the top 8% of the largest eigenvalues (by absolute value) for two mention networks, namely, #educationtransforms and #howwebuildpeace. For both cases, the eigenvalues grow irregularly, that is, some eigenvalues growth at a higher rate than others. Most of the networks in the dataset show this irregular behavior in spectrum evolution.

(a) #EduTransforma

(b) #AsiConstruimosPaz

Fig. 1. Spectral evolution for mention networks #educationtransforms (left) and #howwebuildpeace (right).

Eigenvector Evolution. At time t, consider the adjacency matrix A(t) , with 1 ≤ t ≤ T . The eigenvectors corresponding to the top 8% of the largest eigenvalues (by absolute value) at time t are compared to the eigenvectors at time T = 40. In particular, the cosine distance is used as a similarity measure to compare the eigenvectors U(T )i and U(t)i , for each latent dimension i. Figure 2 shows that some eigenvectors have a similarity close to one during the entire evolution of the network. These eigenvectors correspond to the eigenvectors associated to the largest eigenvalues. Note also that at some time instants the similarity for some eigenvectors drops to zero, which can be explained because eigenvectors swap locations during eigen decomposition. To identify such changes we verify the stability of the largest eigenvectors.

536

M. Romero et al.

Fig. 2. Eigenvector evolution for mention networks #educationtransforms (left) and #howwebuildpeace (right).

Eigenvector Stability. For a given network Gc , Let ta and tb be the times when 75% and 100% of all edges have been created. The eigen decomposition of the adjacency matrices are given by Aa = Ua Λa UTa and Ab = Ub Λb UTb . Similarity values are computed for every pairs of eigenvectors (i, j) using: simij (ta , tb ) = |UT(a)i · U(b)j |. The resulting values are plotted as a heatmap, where white cells represent a value of zero and black cells a value of one. The more the heatmap approximates a diagonal matrix, the fewer eigenvector permutations there are, i.e., the eigenvectors are preserved over time. Figure 3a shows sub-squares with intermediate values (between zero and one) for the #democracyplan2018 network. These sub-squares result from an exchange in the location of eigenvectors that have eigenvalues that are close in magnitude.

(a) Eigenvector stability

(b) Spectral diagonality test

Fig. 3. Eigenvector stability and spectral diagonality test for the #democracyplan2018 network.

Spectral Evolution Models of Twitter Mention Networks

537

Spectral Diagonality Test. As for the eigenvector stability test, consider the eigen decomposition of the adjacency matrix of Gc at time ta , Aa = Ua Λa UTa . At time tb > ta the adjacency matrix is expected to become Ab = Ua (Λa + Δ)UTa , where Δ is a diagonal matrix and indicates whether the growth of the network is spectral. Using least-squares, the matrix Δ can be derived as Δ = Ua (Ab − Aa )UTa . If Δ is diagonal, then the growth between ta and tb is spectral. We find that the matrix Δ is almost diagonal for all mention networks. Figure 3b, for example, shows the diagonality test for the #democracyplan2018 network. 3.2

Growth Models

Previous sections have verified that the assumptions underlying the spectral evolution model seem to hold to some extent. Broad speaking, eigenvalues grow while eigenvectors remain fairly constant over time. Next, we consider network growth as a spectral transformation, i.e., in terms of the eigen decomposition of the adjacency matrix. Let K(A) be a kernel of an adjacency matrix A, whose eigen decomposition is A = U Λ UT . Graph kernels assume that there exists a real function f (λ) that describes the growth of the spectrum. In particular, K(A) can be written as K(A) = UF (Λ)UT , for some function F (Λ) that applies a real function f (λ) to the eigenvalues of A. In particular, we use the triangle closing kernel, the exponential kernel, and the Neumann growth kernel. Triangle Closing Kernel. The triangle closing kernel is expressed as A2 = U Λ UT U Λ UT = U Λ2 UT , since UT U = I. This spectral transformation replaces the eigenvalues of A by their squared values. The real function associated to the triangle closing kernel is f (λ) = λ2 . Exponential Kernel. The exponential of the adjacency matrix A is called the exponential kernel. This kernel denotes the sum of every path between two vertices weighted by the inverse factorial of its length. It is expressed as exp (αA) =

∞  k=0

αk

1 k A , k!

where α is a constant used to balance the weight of short and long paths. The real function associated to the exponential kernel is f (λ) = eαλ .

538

M. Romero et al.

Neumann Kernel. The Neumann kernel is expressed as (I − αA)−1 =

∞ 

α k Ak ,

k=0

where α−1 > |λ1 | and λ1 is the largest eigenvalue of A. Its real function is given by f (λ) = 1/(1 − αλ). Spectral Extrapolation. As noted above, graph kernels assume that there exists a real function f (λ) that describes the growth of the spectrum. However, when the evolution of the spectrum is irregular, as in Fig. 1a, it is not possible to find a simple function that describe network growth. The spectral extrapolation method is a generalization of the graph kernels, which extrapolates each eigenvalue under the assumption that the network follows the spectral evolution model [4]. More specifically, given a network with a timestamped set of edges, the set is split into three subsets named training, target and test sets. Consider two time instants ta and tb . Let Aa represent the adjacency matrix of the network at time ta and Aa + Ab the adjacency matrix at time tb . The eigen decompositions of the network at the two time instances are given by Aa = Ua Λa UTa and Aa + Ab = Ub Λb UTb . Next, let (λb )j be the j-eigenvalue at time tb . Its previous value at time ta is estimated as a diagonalization of Aa by Ub as follows: ˆ a )j = (λ

  i

−1 (Ua )Ti (Ua )j

 (Ua )Ti (Ua )j (λa )i , i

where (Ua )i and (λa )i are the eigenvectors and eigenvalues of A, respectively. ˆ c )i at a A linear extrapolation is now performed to predict the eigenvalues (λ future time tc , ˆ a )j . ˆ c )j = 2(λb )j − (λ (λ ˆ c is used to compute the predicted edge weights A ˆc = The predicted matrix Λ T ˆ Ub Λc Ub .

4

Case Study: Twitter Conversations

This section presents the results of applying the proposed kernels (namely, triangle closing, exponential, and Neumann kernels) and the extrapolation method to predict the creation of new edges across the mention networks described in Sect. 2. Curve-fitting methods are applied to find the parameters α of the exponential and Neumann kernels.

Spectral Evolution Models of Twitter Mention Networks

539

RMSE

0.6

0.4

0.2

0.0

R2

0.8

0.6

ext tri exp neu abortolegalya alianzasporlaseguridad asiconstruimospaz colombialibredefracking colombialibredeminas dialogosmetropolitanos edutransforma eleccionesseguras elquedigauribe frutosdelapaz garantiasparatodos generosinideologia hidroituangoescolombia horajudicialur lafauriecontralor lanochesantrich lapazavanza libertadreligiosa manifestacionpacifica plandemocracia2018 plenariacm proyectoituango reformapolitica rendiciondecuentas rendiciondecuentas2017 resocializaciondigna salariominimo semanaporlapaz serlidersocialnoesdelito vocesdelareconciliacion votacionesseguras

0.4

Hashtags

Fig. 4. Performance of the prediction of the methods is evaluated based on two metrics, RMSE and R2 .

To evaluate the performance of the methods we compute the metrics of the root mean square error (RMSE) and R2 . Figure 4 summarizes the result of RMSE and R2 metrics. Note that the performance of the models appear to be very similar for most mention networks. In Sect. 3, we verify that the growth of the eigenvalues for most networks is irregular. It is therefore to some extent expected that the extrapolation method outperform the graph kernels. Next, we borrow the structural similarity index method (SSIM) from the filed of image processing to measure the similarity between the actual and the estimated adjacency matrices. (SSIM is widely applied in the

540

M. Romero et al.

field of image processing to compare the similarity between two images based on the idea that pixels have strong inter-dependencies when they are spatially close [11].) Unlike other techniques, such as RMSE, SSIM relies on the estimation of point-to-point absolute errors.

0.95 0.90 SSIM

0.85 0.80 0.75 0.70

abortolegalya alianzasporlaseguridad asiconstruimospaz colombialibredefracking colombialibredeminas dialogosmetropolitanos edutransforma eleccionesseguras elquedigauribe frutosdelapaz garantiasparatodos generosinideologia hidroituangoescolombia horajudicialur lafauriecontralor lanochesantrich lapazavanza libertadreligiosa manifestacionpacifica plandemocracia2018 plenariacm proyectoituango reformapolitica rendiciondecuentas rendiciondecuentas2017 resocializaciondigna salariominimo semanaporlapaz serlidersocialnoesdelito vocesdelareconciliacion votacionesseguras

0.65

ext tri exp neu

Hashtags

Fig. 5. Performance of the prediction of the methods is evaluated based on SSIM method.

The results are shown in Table 2 and Fig. 5. Figure 5 summarizes the performance for all methods using SSIM. In general, the extrapolation method tends to outperform the other methods. Specifically, for 28 out of 31 networks (91% of the total), the extrapolation method provides a distinct, if sometimes slight, improvement. The Neumann kernel and the triangle closing combined provide better estimates only for 3 networks. Whenever the spectral extrapolation method outperforms the graph kernels, better prediction seem to be explained by the method being able to consider the irregular evolution of the eigenvalues. In general, note that the networks considered are large enough so that only a small number of eigenvalues and eigenvectors can be computed.

Spectral Evolution Models of Twitter Mention Networks

541

Table 2. Spectral evaluation model performance analysis with SSIM.

5

Hashtag

extrapol A2

exp(αA) (I − αA)−1 Best kernel or method

0

abortolegalya

0.97

0.98 0.97

1

alianzasporlaseguridad 0.80

2

asiconstruimospaz

0.98

3 4

0.97

A2

0.73 0.74

0.63

extrapol.

0.96 0.96

0.95

extrapol.

colombialibredefracking 0.97

0.96 0.96

0.96

extrapol.

colombialibredeminas

0.95

0.89 0.90

0.92

extrapol.

5

dialogosmetropolitanos 0.93

0.86 0.66

0.80

extrapol.

6

edutransforma

0.74

0.67 0.70

0.65

extrapol.

7

eleccionesseguras

0.98

0.96 0.96

0.95

extrapol.

8

elquedigauribe

0.98

0.96 0.97

0.96

extrapol.

9

frutosdelapaz

0.98

0.95 0.96

0.95

extrapol.

10 garantiasparatodos

0.93

0.90 0.90

0.88

extrapol.

11 generosinideologia

0.99

0.95 0.96

0.97

extrapol.

12 hidroituangoescolombia 0.95

0.93 0.93

0.93

extrapol.

13 horajudicialur

0.98

0.94 0.90

0.93

extrapol.

14 lafauriecontralor

0.98

0.96 0.96

0.96

extrapol.

15 lanochesantrich

0.98

0.94 0.94

0.94

extrapol.

16 lapazavanza

0.98

0.97 0.97

0.97

extrapol.

17 libertadreligiosa

0.95

0.89 0.89

0.92

extrapol

18 manifestacionpacifica

0.83

0.81 0.81

0.88

(I − αA)−1

19 plandemocracia2018

0.98

0.96 0.96

0.97

extrapol.

20 plenariacm

0.97

0.92 0.83

0.90

extrapol.

21 proyectoituango

0.99

0.96 0.96

0.96

extrapol.

22 reformapolitica

0.99

0.97 0.97

0.98

extrapol.

23 rendiciondecuentas

0.99

0.98 0.97

0.98

extrapol.

24 rendiciondecuentas2017 0.97

0.89 0.93

0.89

extrapol.

25 resocializaciondigna

0.95

0.87 0.87

0.81

extrapol.

26 salariominimo

0.99

0.97 0.97

0.98

extrapol.

27 semanaporlapaz

0.96

0.95 0.94

0.95

extrapol.

28 serlidersocialnoesdelito 0.91

0.91 0.90

0.92

(I − αA)−1

29 vocesdelareconciliacion 0.84

0.74 0.72

0.62

extrapol.

30 votacionesseguras

0.96 0.96

0.96

extrapol.

0.98

Conclusions

This paper applies the spectral evolution model to 31 Twitter mention networks. This model characterizes the evolution of each network in terms of the eigen decomposition of its adjacency matrix. It has been verified that Twitter mention networks follow the spectral evolution model. For most networks, the eigenvectors remain approximately constant, while the spectra of the mention networks grow irregularly. Their evolution can be predicted with the help

542

M. Romero et al.

different growth models. Our results shows that the extrapolation method outperforms the kernel methods mainly due to the irregular evolution of the spectra. Developing more refined models that use learning to predict the evolution of the spectra of graphs remains an important direction for future research. Acknowledgements. This work was funded by the OMICAS program: Optimizaci´ on Multiescala In-silico de Cultivos Agr´ıcolas Sostenibles (Infraestructura y Validaci´ on en Arroz y Ca˜ na de Az´ ucar), sponsored within the Colombian Scientific Ecosystem by the World Bank, Colciencias, Icetex, the Colombian Ministry of Education, and the Colombian Ministry of Industry and Turism, under GRANT ID: FP44842-217-2018.

References 1. Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008) 2. DiMaggio, P., Evans, J., Bryson, B.: Have american’s social attitudes become more polarized? Am. J. Sociol. 102(3), 690–755 (1996) 3. Gustafsson, N.: The subtle nature of Facebook politics: Swedish social network site users and political participation. New Media Soc. 14(7), 1111–1127 (2012) 4. Kunegis, J., Fay, D., Bauckhage, C.: Network growth and the spectral evolution model. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, Toronto, ON, Canada, p. 739. ACM Press (2010) 5. Kunegis, J., Fay, D., Bauckhage, C.: Spectral evolution in dynamic networks. Knowl. Inf. Syst. 37(1), 1–36 (2013) 6. Loader, B.D., Mercea, D.: Networking democracy?: Social media innovations and participatory politics. Inf. Commun. Soc. 14(6), 757–769 (2011) 7. Mcclurg, S.D.: Social networks and political participation: the role of social interaction in explaining political participation. Polit. Res. Q. 56(4), 449–464 (2003) 8. McPherson, M., Smith-Lovin, L., Cook, J.M.: Birds of a feather: homophily in social networks. Ann. Rev. Sociol. 27(1), 415–444 (2001) 9. Noveck, B.S.: Five hacks for digital democracy. Nature 544(7650), 287–289 (2017) 10. Persily, N.: Can democracy survive the Internet? J. Democracy 28(2), 63–76 (2017) 11. Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)

Network Models

Minimum Entropy Stochastic Block Models Neglect Edge Distribution Heterogeneity Louis Duvivier1(B) , C´eline Robardet1 , and R´emy Cazabet2 1

2

Univ Lyon, INSA Lyon, CNRS, LIRIS UMR5205, 69621 Lyon, France {louis.duvivier,celine.robardet}@insa-lyon.fr Univ Lyon, Universit´e Lyon 1, CNRS, LIRIS UMR5205, 69622 Lyon, France [email protected]

Abstract. The statistical inference of stochastic block models as emerged as a mathematicaly principled method for identifying communities inside networks. Its objective is to find the node partition and the block-to-block adjacency matrix of maximum likelihood i.e. the one which has most probably generated the observed network. In practice, in the so-called microcanonical ensemble, it is frequently assumed that when comparing two models which have the same number and sizes of communities, the best one is the one of minimum entropy i.e. the one which can generate the less different networks. In this paper, we show that there are situations in which the minimum entropy model does not identify the most significant communities in terms of edge distribution, even though it generates the observed graph with a higher probability. Keywords: Network · Community detection model · Statistical inference · Entropy

· Stochastic block

Since the seminal paper by Girvan and Newman [1], a lot of work has been devoted to finding community structure in networks [2]. The objective is to exploit the heterogeneity of connections in graphs to partition its nodes into groups and obtain a coarser description, which may be simpler to analyze. Yet, the absence of a universally accepted formal definition of what a community is has favored the development of diverse methods to partition the nodes of a graph, such as the famous modularity function [3], and the statistical inference of a stochastic block model [4]. This second method relies on the hypothesis that there exists an original partition of the nodes, and that the graph under study was generated by picking edges at random with a probability that depends only on the communities to which its extremities belong. The idea is then to infer the original node partition based on the observed edge distribution in the graph. This method has two main advantages with respect to modularity maximization: first, it is able to detect non-assortative connectivity pattern, i.e. groups of nodes that are not c Springer Nature Switzerland AG 2020  H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 545–555, 2020. https://doi.org/10.1007/978-3-030-36687-2_45

546

L. Duvivier et al.

necessarily characterized by an internal density higher than the external density, and second it can be performed in a statistically significant way, while it has been shown that modularity may detect communities even in random graphs [5]. In particular, a bayesian stochastic blockmodeling approach has been developed in [6], which finds the most likely original partition for a SBM with respect to a graph by maximizing simultaneously the probability to choose this partition and the probability to generate this graph, given the partition. To perform the second maximization, this method assumes that all graphs are generated with the same probability and it thus searches a partition of minimal entropy, in the sense that the cardinal of its microcanonical ensemble (i.e. the number of graphs the corresponding SBM can theoretically generate [7]) is minimal, which is equivalent to maximizing its likelihood [8]. In this paper, we show that even when the number and the size of the communities are fixed, the node partition which corresponds to the sharper communities is not always the one with the lower entropy. We then demonstrate that when community sizes and edge distribution are heterogeneous enough, a node partition which places small communities where there are the most edges will always have a lower entropy. Finally, we illustrate how this issue implies that such heterogeneous stochastic block models cannot be identified correctly by this model selection method and discuss the relevance of assuming an equal probability for all graphs in this context.

1

Entropy Based Stochastic Block Model Selection

The stochastic block model is a generative model for random graphs. It takes as parameters a set of nodes V = [1; n] partitioned in p blocks (or communities) C = (ci )i∈[1;p] and a block-to-block adjacency matrix M whose entries correspond to the number of edges between two blocks. The corresponding set of generable graphs G = (V, E) with weight matrix W is defined as: ⎧ ⎫ ⎨ ⎬  ΩC,M = G | ∀c1 , c2 ∈ C, W(i,j) = M(c1 ,c2 ) ⎩ ⎭ i∈ci ,j∈cj

It is called the microcanonical ensemble (a vocabulary borrowed to statistical physics [7]) and it can be refined to impose that all graphs are simple, undirected (in which case M must be symmetric) and to allow or not self loops. In this paper we will consider multigraphs with self loops, because they allow for simpler computations. Generating a graph with the stochastic block model associated to C, M amounts to drawing at random G ∈ ΩC,M . The probability distribution P[G|C, M ] on this ensemble is defined as the one which maximizes Shanon’s entropy  P[G|C, M ] × ln(P[G|C, M ]) S= G∈ΩC,M

Minimum Entropy Stochastic Block Models

547

In the absence of other restriction, the maximum entropy distribution is the flat one: 1 P[G|C, M ] = |ΩC,M | whose entropy equals S = ln(|ΩC,M |). It has been computed for different SBM flavours in [8]. It measures the number of different graphs a SBM can generate with a given set of parameters. The lower it is, the higher the probability to generate any specific graph G. On the other hand, given a graph G = (V, E), with a weight matrix W , it may have been generated by many different stochastic block models. For any partition C = (ci )i∈[1;p] of V , there exists one and only one matrix M such that G ∈ ΩC,M , and it is defined as:  ∀c1 , c2 ∈ C, M(c1 ,c2 ) = W(i,j) i∈c1 ,j∈c2

Therefore, when there is no ambiguity about the graph G, we will consider indifferently a partition and the associated SBM in the following. The objective of stochastic block model inference is to find the partition C that best describes G. To do so, bayesian inference relies on the Bayes theorem which stands that: P[C, M |G] =

P[G|C, M ] × P[C, M ] P[G]

(1)

As P[G] is the same whatever C, it is sufficient to maximize P[G|C, M ] × P[C, M ]. The naive approach which consists in using a maximum-entropy uniform prior distribution for P[C, M ] simplifies the computation to maximizing directly P[G|C] (the so called likelihood function) but it will always lead to the trivial partition ∀i ∈ V, ci = {i}, which is of no use because the corresponding SBM reproduces G exactly: M = W and P[G|C] = 1. To overcome this overfitting problem, another prior distribution was proposed in [9], which assigns lower probabilities to the partitions with many communities. Yet, when comparing two models C1 , M1 and C2 , M2 with equal probability, the one which is chosen is still the one minimizing |ΩC,M | or equivalently the entropy S = ln(|ΩC,M |), as logarithm is a monotonous function.

2

The Issue with Heavily Populated Graph Regions

In this paper, we focus on the consequence of minimizing the entropy to discriminate between node partitions. To do so, we need to work on a domain of partitions on which the prior distribution is uniform. As suggested by [9], we restrict ourselves to finding the best partition when the number p and the sizes

548

L. Duvivier et al.

(si )i∈[1;p] of communities are fixed because in this case, both P [C] and P [M |C] are constant. This is a problem of node classification, and in this situation the maximization of Eq. 1 boils down to minimizing the entropy of ΩC,M , which can be written as:

 si sj + M(i,j) − 1 ln S= M(i,j) i,j∈[1;p]

as shown in [8]. Yet, even within this restricted domain (p and (si )i are fixed), the lower entropy partition for a given graph G is not always the one which corresponds to the sharper communities. To illustrate this phenomena, let’s consider the stochastic block models whose matrices M are shown on Fig. 1, and a multigraph G ∈ ΩSBM1 ∩ ΩSBM2 . – SBM1 corresponds to C1 = {ca1 : {0, 1, 2, 3, 4, 5}, cb1 : {6, 7, 8}, cc1 : {9, 10, 11}} – SBM2 corresponds to C2 = {ca2 : {0, 1, 2}, cb2 : {3, 4, 5}, cc2 : {6, 7, 8, 9, 10, 11}}. As G ∈ ΩSBM1 ∩ ΩSBM2 , it could have been generated using SBM1 or SBM2 . Yet, the point of inferring a stochastic block model to understand the structure of a graph is that it is supposed to identify groups of nodes (blocks) such that the edge distribution between any two of them is homogeneous and characterized by a specific density. From this point of view C1 seems a better partition than C2 : – The density of edges inside and between ca2 and cb2 is the same (10), so there is no justification for dividing ca1 in two. – On the other hand, cb1 and cc1 have an internal density of 1 and there is no edge between them, so it is logical to separate them rather than merge them into cc2 . Yet, if we compute the entropy of SBM1 and SBM2 : 395 17 + 2 × ln = 136 S1 = ln 360 9

S2 = ln

53 18





+ 4 × ln

98 90

= 135

The entropy of SBM2 is lower and thus partition C2 will be the one selected. Of course, as |ΩSBM2 | < |ΩSBM1 |, the probability to generate G with SBM2 is higher than the probability to generate it with SBM1 . But this increased probability is not due to a better identification of the edge distribution heterogeneity, it is a mechanical effect of imposing smaller communities in the groups of nodes which contain the more edges, even if their distribution is homogeneous. Doing so reduces the number of possible positions for each edge and thus the number of different graphs the model can generate.

Minimum Entropy Stochastic Block Models

549

Fig. 1. Block-to-block adjacency matrices of two overlapping stochastic block models. Even though the communities of SBM1 are better defined, SBM2 can generate less different graphs and thus generates them with higher probability.

Fig. 2. Block-to-block adjacency matrices of two overlapping stochastic block models with lower densities. Once again, even though SBM3 has better defined communities, SBM4 is more likely a model for graphs G ∈ ΩSBM3 ∩ ΩSBM4

This problem can also occur with smaller densities, as illustrated by the stochastic block models whose block-to-block adjacency matrices are shown on Fig. 2. SBM3 , defined as one community of 128 nodes and density 0.6 and 32

550

L. Duvivier et al.

communities of 4 nodes and density 0.4 has an entropy of 17851. SBM4 which merges all small communities into one big and splits the big one into 32 small ones has an entropy of 16403.

3

The Density Threshold

More generally, let’s consider a SBM (C1 , M1 ) with one big community of size s, containing c × m0 edges and q small communities of size qs containing (mi )i∈[1;q] edges each, as illustrated on Fig. 3. Its entropy is equal to:

2 2   q s s + c × m0 − 1 2 + mi − 1 q ln S1 (c) = ln + c × m0 mi i=1 On the other hand, the entropy of the SBM (C2 , M2 ) which splits the big community into q small ones of size qs and merges the q small communities into one big is:

2 2 q  s +c×m0 −1 s + i=1 mi − 1 q2 2 q S2 (c) = ln + q ln c×m0 i=1 mi q2

Fig. 3. Theoretical pair of stochastic block models. The right-side partition splits the big community in q = 3 small ones and merges the small communities in one big.

Minimum Entropy Stochastic Block Models

q

So, with C1 =

i=1

ln

2  s +mi −1 q2

mi

and C2 = ln



s2 + c × m0 − 1 c × m0

S1 (c) − S2 (c) = ln



− q 2 ln



 2 q s +

mi −1 i=1 mi

qi=1

are constants with respect to c: s2 +c×m0 q2 c×m0 q2

 −1

551

, which

+ C1 − C2

⎡⎛ ⎞q 2 ⎤ c×m0

c×m  2 2 q 0 k + s2 − 1 k + qs2 − 1 ⎟ ⎥ ⎢⎜  = ln − ln ⎢ ⎠ ⎥ ⎝ ⎦ + C1 − C2 ⎣ k k k=1

k=1



⎡ c×m0 !q2 −1 q2 2 i=0 (k + s − 1 + i × ⎢  = ln ⎣ 2 (k + qs2 − 1)q2 k=1

c×m0 q2 ) ⎥

⎦ + C1 − C2

⎤ ⎡ c×m0 #q 2 " q2 2  k+s −1 ⎥ ⎢ > ln ⎣ ⎦ + C1 − C2 s2 k + q2 − 1 k=1 c×m0 q2

> q2



ln 1 +

k=1

Now, as

ln 1 +

and

(q 2 − 1)s2 q 2 k + s2 − q 2 c×m0 q2



k=1

we have that

c×m0 q2

q

2

(q 2 − 1)s2 + C1 − C2 q 2 k + s2 − q 2



k=1



k→∞

(2)

(q 2 − 1)s2 q 2 k + s2 − q 2

(q 2 − 1)s2 → ∞ q 2 k + s2 − q 2 c→∞

ln 1 +

(q 2 − 1)s2 q 2 k + s2 − q 2

→ ∞

c→∞

(3)

and thus, by injecting Eq. 3 inside 2, ∃c, ∀c > c, S2 (c ) < S1 (c ). Which means that for any such pair of stochastic block models, there exists some density threshold for the big community in C1 above which (C2 , M2 ) will be identified as the most likely model for all graphs G ∈ Ω(C1 ,M1 ) ∩ Ω(C2 ,M2 ) .

4

Consequences on Model Selection

In practice, this phenomena implies that a model selection technique based on the minimization of entropy will not be able to identify correctly some SBM when they are used as generative models for synthetic graphs. To illustrate this,

552

L. Duvivier et al.

we generate graphs and try to recover the original partition. The experiment is conducted on two series of stochastic block models, one with relatively large communities and another one with smaller but more sharply defined communities: – SBM7 (d) is made of 5 blocks (1 of 40 nodes, and 4 of 10 nodes). Its density matrix D is given on Fig. 4(left) (one can deduce the block adjacency matrix by M(ci ,cj ) = |ci ||cj | × D(ci ,cj ) ). – SBM8 (d) is made of 11 blocks (1 of 100 nodes, and 10 of 10 nodes). The internal density of the big community is d, it is 0.15 for the small ones and 0.01 between communities. For each of those two models, and for various internal densities d of the largest community, we generate 1000 random graphs. For each of these graphs, we compute the entropy of the original partition (correct partition) and the entropy of the partition obtained by inverting the big community with the small ones (incorrect partition). Then, we compute the percentage of graphs for which the correct partition has a lower entropy than the incorrect one and plot it against the density d. Results are shown on Figs. 4 and 5.

Fig. 4. Block-to-block adjacency matrix of SBM7 (d) (left) and percentage of graphs generated using SBM7 (d) for which the original partition has a lower entropy than the inverted one against the density d of the big community (right).

We observe that as soon as d reaches a given density threshold (about 0.08 for SBM7 (d) and 0.18 for SBM8 (d)), the percentage of correct match drops quickly to 0. As d rises over 0.25, the correct partition is never the one selected. It should be highlighted that in these experiments we only compared two partitions among the Bn possible, so the percentage of correct match is actually an upper bound on the percentage of graphs for which the correct partition is identified. This means that if SBM7 (d) or SBM8 (d) are used as generative models for random graphs, with d > 0.25, and one wants to use bayesian inference for determining the original partition, it will almost never return the correct one. What is more,

Minimum Entropy Stochastic Block Models

553

Fig. 5. Percentage of graphs generated using SBM8 (d) for which the original partition has a lower entropy than the inverted one against the density d of the big community.

the results of Sect. 3 show that this will occur for any SBM of the form described in Fig. 3, as soon as the big community contains enough edges.

5

Discussion

We have seen in Sect. 1 that model selection techniques that rely on the maximization of the likelihood function to find the best node partition given an observed graph boils down to the minimization of the entropy of the corresponding ensemble of generable graphs in the microcanonical framework. Even in the case of bayesian inference, when a non-uniform prior distribution is defined on the set of possible partitions, entropy remains the criterion of choice between equiprobable partitions. Yet, as shown in Sects. 2 and 3, entropy behaves counter intuitively when a large part of the edges are concentrated inside one big community. In this situation, a partition that splits this community in small ones will have a lower entropy, even though the edge density is homogeneous. Furthermore, this happens even when the number and sizes of communities are known. Practically, as explained in Sect. 4, this phenomena implies that stochastic block models of this form cannot be recovered using model selection techniques based on the mere minimization of the cardinal of the associated microcanonical ensemble. Let’s stress that contrary to the resolution limit described in [10] or [11], the problem is not about being able or not to detect small communities with

554

L. Duvivier et al.

no prior knowledge about the graph, it occurs even though the number and sizes of communities are known. It is also different from the phase transition issue that has been investigated in [12–15] for communities detection or recovery because it happens even when communities are dense and perfectly separated. Entropy minimization fails at classifying correctly the nodes between communities because it only aims at identifying the SBM that can generate the lowest number of different graphs. A model which enforces more constraints on edge positions will necessarily perform better from this point of view, but this is a form of overfitting, in the sense that the additional constraints on edge placement are not justified by an heterogeneity in the observed edge distribution. The results presented in this paper were obtained for a particular class of stochastic block models. First of all, they were obtained for the multigraph flavour of stochastic block models. As the node classification issue occurs also for densities below 1, they can probably be extended to simple graphs, but this would need to be checked, as well as the case of degree-corrected stochastic block models. Furthermore, the reason why the log-likelihood of a stochastic block model C, M for a graph G is equal to the entropy of ΩC,M is that we consider the microcanonical ensemble, in which all graphs have an equal probability to be generated. It would be interesting to check if similar results can be obtained when computing P[G|C, M ] in the canonical ensemble [8]. Finally, we assumed that for a graph G and two partitions C1 and C2 with the same number and sizes of blocks, the associated block-to-block adjacency matrices M1 and M2 have the same probability to be generated, and this assumption too could be questioned. Yet, within this specific class of SBM, our results illustrate a fundamental issue with the stochastic block model statistical inference process. Since the random variable whose distribution we are trying to infer is the whole graph itself, we are performing statistical inference on a single observation. This is why frequentist inference is impossible, but bayesian inference also has strong limitations in this context. In particular, the only tool to counterbalance the observation and avoid overfitting is to specify the kind of communities we are looking for through the prior distribution. If it is agnostic about the distribution of edge densities among these communities, the mere minimization of the entropy of the posterior distribution fails to identify the heterogeneity in the edge distribution. Beside refining even more the prior distribution, another approach could be to consider a graph as the aggregated result of a series of edge positioning. If the considered random variable is the position of an edge, a single graph observation contains information about many of its realizations, which reduces the risk of overfitting. Acknowledgments. This work was supported by the ACADEMICS grant of the IDEXLYON, project of the Universit´e de Lyon, PIA operated by ANR-16-IDEX0005, and of the project ANR-18-CE23-0004 (BITUNAM) of the French National Research Agency (ANR).

Minimum Entropy Stochastic Block Models

555

References 1. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Nat. Acad. Sci. 99(12), 7821–7826 (2002) 2. Fortunato, S., Hric, D.: Community detection in networks: a user guide. Phys. Rep. 659, 1–44 (2016) 3. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(2), 026113 (2004) 4. Hastings, M.B.: Community detection as an inference problem. Phys. Rev. E 74(3), 035102 (2006) 5. Guimera, R., Sales-Pardo, M., Amaral, L.A.N.: Modularity from fluctuations in random graphs and complex networks. Phys. Rev. E 70(2), 025101 (2004) 6. Peixoto, T.P.: Nonparametric bayesian inference of the microcanonical stochastic block model. Phys. Rev. E 95(1), 012317 (2017) 7. Cimini, G., Squartini, T., Saracco, F., Garlaschelli, D., Gabrielli, A., Caldarelli, G.: The statistical physics of real-world networks. Nat. Rev. Phys. 1(1), 58 (2019) 8. Peixoto, T.P.: Entropy of stochastic blockmodel ensembles. Phys. Rev. E 85(5), 056122 (2012) 9. Peixoto, T.P.: Bayesian stochastic blockmodeling. arXiv preprint. http://arxiv. org/abs/1705.10225 (2017) 10. Fortunato, S., Barthelemy, M.: Resolution limit in community detection. Proc. Nat. Acad. Sci. 104(1), 36–41 (2007) 11. Peixoto, T.P.: Parsimonious module inference in large networks. Phys. Rev. Lett. 110(14), 148701 (2013) 12. Decelle, A., Krzakala, F., Moore, C., Zdeborov´ a, L.: Inference and phase transitions in the detection of modules in sparse networks. Phys. Rev. Lett. 107(6), 065701 (2011) 13. Decelle, A., Krzakala, F., Moore, C., Zdeborov´ a, L.: Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys. Rev. E 84(6), 066106 (2011) 14. Dandan, H., Ronhovde, P., Nussinov, Z.: Phase transitions in random Potts systems and the community detection problem: spin-glass type and dynamic perspectives. Philos. Mag. 92(4), 406–445 (2012) 15. Abbe, E., Sandon, C.: Community detection in general stochastic block models: fundamental limits and efficient algorithms for recovery. In: 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp. 670–688. IEEE (2015)

Three-Parameter Kinetics of Self-organized Criticality on Twitter Victor Dmitriev1

, Andrey Dmitriev1(&) , Svetlana Maltseva1 and Stepan Balybin2

,

1

2

National Research University Higher School of Economics, 101000 Moscow, Russia [email protected] Department of Physics, M.V. Lomonosov Moscow State University, 119991 Moscow, Russia

Abstract. A kinetic model is proposed to describe the self-organized criticality on Twitter. The model is based on a fractional three-parameter self-organization scheme with stochastic sources. It is shown that the adiabatic regime of selforganization to the critical state is determined by the coordinated action of a relatively small number of network users. The model is described the subcritical, self-organized critical and supercritical state of Twitter. Keywords: Self-organized criticality

 Social networks  Langevin equation

1 Introduction Critical phenomena in complex networks have been considered in many papers (e.g., see the review [1] and references therein). In the network science, under the critical phenomena commonly understand the significant changes in the integral parameters of the network structure under the influence of external factors [1]. In the thermodynamics theory of irreversible processes, it is stated that significant structure reconstructions occur when the external parameter reaches a certain critical value and has the character of a kinetic phase transition [2]. The critical point is reached as a result of fine tuning of the system external parameters. In a certain sense, such critical phenomena are not robust. At the end of the 1980s, Bak, Tang and Wiesenfeld [3, 4] found that there are complex systems with a large number of degrees of freedom that go into a critical mode as a result of the internal evolutionary trends of these systems. A critical state of such systems does not require fine tuning of external control parameters and may occur spontaneously. Thus, the theory of self-organized criticality (SOC) was proposed. From the moment of the SOC model emergence, this model started to be applied to describe critical phenomena in systems regardless of their nature (e.g., see the review [5] with references). Not an exception is the application of the theory to the description of critical phenomena in social networks (e.g., see the works [6–9]). The motivation of our investigation is the following. There is a number of studies (e.g., see the works [7, 9–17]), in which it is established that the observed flows of microposts generated by microblogging social networks (e.g., Twitter), are characterized © Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 556–565, 2020. https://doi.org/10.1007/978-3-030-36687-2_46

Three-Parameter Kinetics of Self-organized Criticality on Twitter

557

by avalanche-like behavior. Time series of microposts (gt ) depicting such streams are the time series with a power law distribution of probabilities: pðgÞ / ga

ð1Þ

where a 2 ð2; 3Þ. Despite this, there are no studies on the construction and analysis of macroscopic kinetic models that explain the phenomenon of the emergence and spread of avalanche of microposts on Twitter.

2 One of the Possible Mechanisms of Twitter Self-organizing Transition in a Critical State Let N be the total number of Twitter users, and let S  N be the number of users who follow a certain strategy. Let’s call them strategically oriented users (SOUs). The remaining N  S users do not follow a single coherent strategy and, in this sense, are randomly oriented users (ROUs). Suppose that at each moment in time, one SOU goes on Twitter, i.e. social network is the open system. These users act in concert, trying to form some microposts in the network relevant to a certain topic. Gradually, a subnetwork of SOUs is formed in the social network. ROUs that are SOUs subscribers in this case are also embedded in the emerging hierarchical network structure. As a result, local and predictable micropost flows are formed on Twitter, corresponding to the topic defined by SOUs. Such a behavior of the social network is simple, since the individual local flows of microposts are not interconnected. The formed hierarchical system of the social network by SOUs and ROUs are still not able to form an avalanche of microposts. Over time, the number of SOUs reaches a critical value Sc . In this state, the network can no longer be pumped by these users. In order to maintain a steady state, all network users, including ROUs, must follow a certain coordinated strategy in the distribution of microposts. Therefore, in a stationary system of users, global avalanches of microposts arise and distributed in the network. This is the SOC state of the social network, formed by the action of a small, compared with the total number of all users, the number of strategically oriented users. Instead of local flows of microposts, a global avalanche of microposts occurs, which is characteristic of the critical state of the network. The behavior of global avalanches spreading in the self-organized critical network is unpredictable based on the behavior of individual users. In this case, the social network has the property of emergence. Let Sc be the number of SOUs in the stationary (critical) state of the social network. In relation to the critical state, three qualitatively different states of Twitter can be distinguished: • S\Sc is the subcritical (SubC) network state; • S ¼ Sc is the SOC network state; • S [ Sc is the supercritical (SupC) network state. The SubC state is characterized by the small number of avalanches of microposts, which can be almost neglected. In the SOC state, microposts avalanche size is growing.

558

V. Dmitriev et al.

The appearance of such avalanches of microposts satisfies the power law distribution of probabilities (see Eq. (1)). In the SupC state, the number of SOUs, and, accordingly, the avalanche sizes of microposts continue to grow. This growth is unstable. In a response to a further increase in the number of microposts generated by SOUs entering the network, the number of “extra” microposts in the social network increases, reducing S to a critical level. The value Sc separates the chaotic and the ordered states of the network. Indeed, the almost zero flow of microposts, which occurs when S\Sc , can be considered as the result of lots of randomly directed flows of microposts, which are mutually balanced. When S [ Sc , disorder gives way to order, which is expressed in the appearance of a dedicated flow direction (avalanche) of microposts. And, as a result, it becomes significant at the macro level. Both of these states correspond to the non-catastrophic behavior of the social network, since in these states the network is resistant to small impact. In a chaotic state, small perturbations still fade out quickly in time and space, and in an ordered state, perturbations can no longer have a noticeable effect on the avalanche size of microposts. In a critical state, in which only one added SOU can cause an avalanche of microposts of any size, catastrophes are possible. As a result of self-organization in a critical state, a social network acquires properties that its elements did not have, demonstrating complex emergent behavior. At the same time, it is important that the self-organizing nature of emergent properties ensures their robustness. The SOC state is robust in relation to possible changes in the social network. For example, if the nature of interactions between users’ changes, the social network temporarily deviates from the existing critical state, but after a while it is restored in a slightly different form. The hierarchical network structure will change, but its dynamics will remain critical. Every time, when trying to divert Twitter from the SOC state, the social network invariably returns to this state.

3 The Formalism It is known (e.g. see works [18–20]) that the concept of self-organization is a generalization of the physical concept of critical phenomena, such as phase transitions. Therefore, the phenomenological theory that we propose is a generalization of the theory of thermodynamic transformations for open systems. Twitter self-organization is possible due to its openness, since there are incoming and outgoing network flows of its users constantly; its macroscopic, because includes a large number of users; its dissipation, because there are losses in the flows of microposts and associated information. Based on the synergetic principle of subordination, it can be argued that Twitter’s selforganization in a critical state is completely determined by the suppression of the behavior of an infinite number of microscopic degrees of freedom by a small number of macroscopic degrees of freedom. As a result, the collective behavior of users of the social network is defined by several parameters or degrees of freedom: an order parameter gt , its role is the number of microposts relevant to a certain topic that are sent by SOUs and, unwittingly following their strategies, by ROUs; a conjugate field ht is information associated with microposts distributed in the network; a control parameter

Three-Parameter Kinetics of Self-organized Criticality on Twitter

559

St which is the number of SOUs of the networks. On the other hand, in Twitter’s selforganization as the non-equilibrium system, the dissipation of flows of microposts in the network should play a crucial role, which ensures the transition of the network to the stationary state. In the process of self-organization in a critical state of the network, all three degrees of freedom have an equal character, and the description of the process requires a self-consistent view of their evolution. The restriction to three degrees of freedom is also determined by the Ruelle–Takens theorem, according to which a nontrivial picture of self-organization is observed if the number of selected degrees of freedom is, at least, three. Kinetic equations and a detailed physical substantiation of the relations between its parameters are given in our paper [16]. The construction of the three-parameter selforganization scheme was based on the analogy between the mechanisms of functioning of a single-mode laser and the microblogging social network. The study of possible modifications of equations leading to models that are capable to describe critical phenomena on Twitter, in particular the SOC or the SupC states, is outside of the scope of this paper. These equations in dimensionless quantities have the following form: pffiffiffiffi g_ t ¼ get þ ht þ Ig nt pffiffiffiffi sh _ e sg ht ¼ ht þ gt St þ Ih nt > : sS S_ ¼ ðS  S Þ  ge h þ pffiffiffiffi I S nt 0 t t t sg t 8 >