This book highlights cuttingedge research in the field of network science, offering scientists, researchers, students,
955 79 87MB
English Pages XXVII, 979 [992] Year 2020
Table of contents :
Front Matter ....Pages ixxvii
Front Matter ....Pages 11
LinkAUC: Unsupervised Evaluation of Multiple Network Node Ranks Using Link Prediction (Emmanouil Krasanakis, Symeon Papadopoulos, Yiannis Kompatsiaris)....Pages 314
A Gradient Estimate for PageRank (Paul Horn, Lauren M. Nelsen)....Pages 1526
A Persistent Homology Perspective to the Link Prediction Problem (Sumit Bhatia, Bapi Chatterjee, Deepak Nathani, Manohar Kaul)....Pages 2739
The Role of Network Size for the Robustness of Centrality Measures (Christoph Martin, Peter Niemeyer)....Pages 4051
Novel Edge and Density Metrics for Link Cohesion (Cetin Savkli, Catherine Schwartz, Amanda Galante, Jonathan Cohen)....Pages 5263
Facility Location Problem on Network Based on Group Centrality Measure Considering Cooperation and Competition (Takayasu Fushimi, Seiya Okubo, Kazumi Saito)....Pages 6476
Finding Dominant Nodes Using Graphlets (David Aparício, Pedro Ribeiro, Fernando Silva, Jorge Silva)....Pages 7789
Sampling on Networks: Estimating Eigenvector Centrality on Incomplete Networks (Nicolò Ruggeri, Caterina De Bacco)....Pages 90101
Front Matter ....Pages 103103
Repel Communities and Multipartite Networks (Jerry Scripps, Christian Trefftz, Greg Wolffe, Roger Ferguson, Xiang Cao)....Pages 105115
The Densest k Subgraph Problem in bOuterplanar Graphs (Sean Gonzales, Theresa Migler)....Pages 116127
Spread Sampling and Its Applications on Graphs (Yu Wang, Bortik Bandyopadhyay, Vedang Patel, Aniket Chakrabarti, David Sivakoff, Srinivasan Parthasarathy)....Pages 128140
Eva: AttributeAware Network Segmentation (Salvatore Citraro, Giulio Rossetti)....Pages 141151
Exorcising the Demon: Angel, Efficient NodeCentric Community Discovery (Giulio Rossetti)....Pages 152163
Metrics Matter in Community Detection (Arya D. McCarthy, Tongfei Chen, Rachel Rudinger, David W. Matula)....Pages 164175
An Exact No Free Lunch Theorem for Community Detection (Arya D. McCarthy, Tongfei Chen, Seth Ebner)....Pages 176187
Impact of Network Topology on Efficiency of Proximity Measures for Community Detection (Rinat Aynulin)....Pages 188197
Identifying, Ranking and Tracking Community Leaders in Evolving Social Networks (Mário Cordeiro, Rui Portocarrero Sarmento, Pavel Brazdil, Masahiro Kimura, João Gama)....Pages 198210
Change Point Detection in a Dynamic Stochastic Blockmodel (Peter Wills, François G. Meyer)....Pages 211222
A General Method for Detecting Community Structures in Complex Networks (Vesa Kuikka)....Pages 223237
A New Metric for Package Cohesion Measurement Based on Complex Network (Yanran Mi, Yanxi Zhou, Liangyu Chen)....Pages 238249
A Generalized Framework for Detecting Social Network Communities by the Scanning Method (TaiChi Wang, Frederick Kin Hing Phoa)....Pages 250261
Comparing the Community Structure Identified by Overlapping Methods (Vinícius da F. Vieira, Carolina R. Xavier, Alexandre G. Evsukoff)....Pages 262273
Semantic Frame Induction as a Community Detection Problem (Eugénio Ribeiro, Andreia Sofia Teixeira, Ricardo Ribeiro, David Martins de Matos)....Pages 274285
A New Measure of Modularity in Hypergraphs: Theoretical Insights and Implications for Effective Clustering (Tarun Kumar, Sankaran Vaidyanathan, Harini Ananthapadmanabhan, Srinivasan Parthasarathy, Balaraman Ravindran)....Pages 286297
Front Matter ....Pages 299299
Crying “Wolf” in a Network Structure: The Influence of NodeGenerated Signals (Tomer Tuchner, Gail GilboaFreedman)....Pages 301312
Vaccination Strategies on a Robust Contact Network (Christopher Siu, Theresa Migler)....Pages 313324
Total Positive Influence Domination on Weighted Networks (Danica Vukadinović Greetham, Nathaniel Charlton, Anush Poghosyan)....Pages 325336
Modelling Spatial Information Diffusion (Zhuo Chen, Xinyue Ye)....Pages 337348
RejectionBased Simulation of NonMarkovian Agents on Complex Networks (Gerrit Großmann, Luca Bortolussi, Verena Wolf)....Pages 349361
CommunityAware Content Diffusion: Embeddednes and Permeability (Letizia Milli, Giulio Rossetti)....Pages 362371
Can WhatsApp Counter Misinformation by Limiting Message Forwarding? (Philipe de Freitas Melo, Carolina Coimbra Vieira, Kiran Garimella, Pedro O. S. Vaz de Melo, Fabrício Benevenuto)....Pages 372384
Modeling Airport Congestion Contagion by SIS Epidemic Spreading on Airline Networks (Klemens Köstler, Rommy Gobardhan, Alberto Ceria, Huijuan Wang)....Pages 385398
A Population Dynamics Approach to Viral Marketing (Pedro C. Souto, Luísa V. Silva, Diego Costa Pinto, Francisco C. Santos)....Pages 399411
Integrating Environmental Temperature Conditions into the SIR Model for VectorBorne Diseases (Md Arquam, Anurag Singh, Hocine Cherifi)....Pages 412424
Opinion Diffusion in Competitive Environments: Relating Coverage and Speed of Diffusion (Valeria Fionda, Gianluigi Greco)....Pages 425435
Beyond FactChecking: Network Analysis Tools for Monitoring Disinformation in Social Media (Stefano Guarino, Noemi Trino, Alessandro Chessa, Gianni Riotta)....Pages 436447
Suppressing Information Diffusion via Link Blocking in Temporal Networks (XiuXiu Zhan, Alan Hanjalic, Huijuan Wang)....Pages 448458
Using Connected Accounts to Enhance Information Spread in Social Networks (Alon Sela, Orit CohenMilo, Eugene Kagan, Moti Zwilling, Irad BenGal)....Pages 459468
Designing Robust Interventions to Control Epidemic Outbreaks (Prathyush Sambaturu, Anil Vullikanti)....Pages 469480
Front Matter ....Pages 481481
The Impact of Network Degree Correlation on Parrondo’s Paradox (Ye Ye, XiaoRong Hang, Lin Liu, Lu Wang, Nenggang Xie)....Pages 483494
Analysis of Diversity and Dynamics in Coevolution of Cooperation in Social Networking Services (Yutaro Miura, Fujio Toriumi, Toshiharu Sugawara)....Pages 495506
Shannon Entropy in Time–Varying Clique Networks (Marcelo do Vale Cunha, Carlos César Ribeiro Santos, Marcelo Albano Moret, Hernane Borges de Barros Pereira)....Pages 507518
TwoMode Threshold Graph Dynamical Systems for Modeling Evacuation DecisionMaking During Disaster Events (Nafisa Halim, Chris J. Kuhlman, Achla Marathe, Pallab Mozumder, Anil Vullikanti)....Pages 519531
Spectral Evolution of Twitter Mention Networks (Miguel Romero, Camilo Rocha, Jorge Finke)....Pages 532542
Front Matter ....Pages 543543
Minimum Entropy Stochastic Block Models Neglect Edge Distribution Heterogeneity (Louis Duvivier, Céline Robardet, Rémy Cazabet)....Pages 545555
ThreeParameter Kinetics of Selforganized Criticality on Twitter (Victor Dmitriev, Andrey Dmitriev, Svetlana Maltseva, Stepan Balybin)....Pages 556565
Multiparameters Model Selection for Network Inference (Veronica Tozzo, Annalisa Barla)....Pages 566577
Scott: A Method for Representing Graphs as Rooted Trees for Graph Canonization (Nicolas Bloyet, PierreFrançois Marteau, Emmanuel Frénod)....Pages 578590
Cliques in HighDimensional Random Geometric Graphs (Konstantin Avrachenkov, Andrei Bobu)....Pages 591600
Universal Boolean Logic in Cascading Networks (Galen Wilkerson, Sotiris Moschoyiannis)....Pages 601611
FitnessWeighted Preferential Attachment with Varying Number of New Connections (Juan Romero, Jorge Finke, Andrés Salazar)....Pages 612620
Rigid Graph Alignment (Vikram Ravindra, Huda Nassar, David F. Gleich, Ananth Grama)....Pages 621632
Detecting Hotspots on Networks (Juan Campos, Jorge Finke)....Pages 633644
Front Matter ....Pages 645645
A Transparent Referendum Protocol with Immutable Proceedings and Verifiable Outcome for Trustless Networks (Maximilian Schiedermeier, Omar Hasan, Lionel Brunie, Tobias Mayer, Harald Kosch)....Pages 647658
Utilizing Complex Networks for Event Detection in Heterogeneous HighVolume News Streams (Iraklis Moutidis, Hywel T. P. Williams)....Pages 659672
Drawing Networks of Political Leaders: Global Affairs in The Economist’s KAL’s Cartoons (Nikita Golubev, Alina V. Vladimirova)....Pages 673681
Shielding and Shadowing: A Tale of Two Strategies for Opinion Control in the Voting Dynamics (Guillermo Romero Moreno, Long TranThanh, Markus Brede)....Pages 682693
Front Matter ....Pages 695695
Stable and Uniform Resource Allocation Strategies for Network Processes Using Vertex Energy Gradients (Mikołaj Morzy, Tomi Wójtowicz)....Pages 697708
Cascading Failures in Weighted Networks with the Harmonic Closeness (Yucheng Hao, Limin Jia, Yanhui Wang)....Pages 709720
Learning to Control Random Boolean Networks: A Deep Reinforcement Learning Approach (Georgios Papagiannis, Sotiris Moschoyiannis)....Pages 721734
Comparative Network Robustness Evaluation of Link Attacks (Clara Pizzuti, Annalisa Socievole, Piet Van Mieghem)....Pages 735746
MAC: Multilevel Autonomous Clustering for Topologically Distributed Anomaly Detection (M. A. Partha, C. V. Ponce)....Pages 747760
Network Strengthening Against Malicious Attacks (Qingnan Rong, Jun Zhang, Xiaoqian Sun, Sebastian Wandelt)....Pages 761772
Identifying Vulnerable Nodes to Cascading Failures: OptimizationBased Approach (Richard J. La)....Pages 773782
Ensemble Approach for Generalized Network Dismantling (XiaoLong Ren, Nino AntulovFantulin)....Pages 783793
Front Matter ....Pages 795795
A Simple Approach to Attributed Graph Embedding via Enhanced Autoencoder (Nasrullah Sheikh, Zekarias T. Kefato, Alberto Montresor)....Pages 797809
Matching Node Embeddings Using Valid Assignment Kernels (Changmin Wu, Giannis Nikolentzos, Michalis Vazirgiannis)....Pages 810821
Short Text Tagging Using Nested Stochastic Block Model: A Yelp Case Study (John Bowllan, Kailey Cozart, Seyed Mohammad Mahdi Seyednezhad, Anthony Smith, Ronaldo Menezes)....Pages 822833
DomainInvariant Latent Representation Discovers Roles (Shumpei Kikuta, Fujio Toriumi, Mao Nishiguchi, Tomoki Fukuma, Takanori Nishida, Shohei Usui)....Pages 834844
Inductive Representation Learning on Feature Rich Complex Networks for Churn Prediction in Telco (María Óskarsdóttir, Sander Cornette, Floris Deseure, Bart Baesens)....Pages 845853
On Inferring Monthly Expenses of Social Media Users: Towards Data and Approaches (Danila Vaganov, Alexander Kalinin, Klavdiya Bochenina)....Pages 854865
Evaluating the Community Structures from Network Images Using Neural Networks (Md. Khaledur Rahman, Ariful Azad)....Pages 866878
GumbelSoftmax Optimization: A Simple General Framework for Combinatorial Optimization Problems on Graphs (Jing Liu, Fei Gao, Jiang Zhang)....Pages 879890
TemporalNode2vec: Temporal Node Embedding in Temporal Networks (Mounir Haddad, Cécile Bothorel, Philippe Lenca, Dominique Bedart)....Pages 891902
Deep Reinforcement Learning for TaskDriven Discovery of Incomplete Networks (Peter Morales, Rajmonda Sulo Caceres, Tina EliassiRad)....Pages 903914
Evaluating Network Embedding Models for Machine Learning Tasks (Ikenna Oluigbo, Mohammed Haddad, Hamida Seba)....Pages 915927
A BERTBased Transfer Learning Approach for Hate Speech Detection in Online Social Media (Marzieh Mozafari, Reza Farahbakhsh, Noël Crespi)....Pages 928940
Front Matter ....Pages 941941
A Simple Differential Geometry for Networks and Its Generalizations (Emil Saucan, Areejit Samal, Jürgen Jost)....Pages 943954
Characterizing Distances of Networks on the Tensor Manifold (Bipul Islam, Ji Liu, Romeil Sandhu)....Pages 955964
Eigenvalues and Spectral Dimension of Random Geometric Graphs in Thermodynamic Regime (Konstantin Avrachenkov, Laura Cottatellucci, Mounia Hamidouche)....Pages 965975
Back Matter ....Pages 977979
Studies in Computational Intelligence 881
Hocine Cherifi · Sabrina Gaito · José Fernendo Mendes · Esteban Moro · Luis Mateus Rocha Editors
Complex Networks and Their Applications VIII Volume 1 Proceedings of the Eighth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2019
Studies in Computational Intelligence Volume 881
Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the ﬁelds of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artiﬁcial intelligence, cellular automata, selforganizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the worldwide distribution, which enable both wide and rapid dissemination of research output. The books of this series are submitted to indexing to Web of Science, EICompendex, DBLP, SCOPUS, Google Scholar and Springerlink.
More information about this series at http://www.springer.com/series/7092
Hocine Cheriﬁ Sabrina Gaito José Fernendo Mendes Esteban Moro Luis Mateus Rocha •
•
•
•
Editors
Complex Networks and Their Applications VIII Volume 1 Proceedings of the Eighth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2019
123
Editors Hocine Cheriﬁ University of Burgundy Dijon Cedex, France
Sabrina Gaito Universita degli Studi di Milano Milan, Italy
José Fernendo Mendes University of Aveiro Aveiro, Portugal
Esteban Moro Universidad Carlos III de Madrid Leganés, Madrid, Spain
Luis Mateus Rocha Indiana University Bloomington, IN, USA
ISSN 1860949X ISSN 18609503 (electronic) Studies in Computational Intelligence ISBN 9783030366865 ISBN 9783030366872 (eBook) https://doi.org/10.1007/9783030366872 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The International Conference on Complex Networks and Their Applications has been initiated in 2011. Since, it has grown to become one of the major international events in network science. The aim is to support the rise of the scientiﬁc community that study the world through the lens of networks. Every year, it brings together researchers from a wide variety of scientiﬁc background ranging from ﬁnance and economy, medicine and neuroscience, biology and earth sciences, sociology and politics, computer science and physics, and many others in order to review current state of the ﬁeld and formulate new directions. Besides, the variety of scientiﬁc topics ranges from network theory, network models, network geometry, community structure, network analysis and measure, link analysis and ranking, resilience and control, machine learning and networks, dynamics on/of networks, diffusion, and epidemics. Let us also mention some current applications such as social and urban networks, human behavior, urban systems, mobility, and quantifying success. The great diversity of the participants allows for crossfertilization between fundamental issues and innovative applications. The papers selected for the volumes of proceedings from the eighth edition, hosted by the Calouste Gulbenkian Foundation in Lisbon (Portugal) from December 10 to December 12, 2019, clearly reflect the multiple aspects of complex network issues as well as the high quality of the contributions. This edition attracted numerous authors from all over the world with 470 submissions from 58 countries. All the submissions have been peerreviewed from at least 3 independent reviewers from our strong International Program Committee in order to ensure high quality of contributed material as well as adherence to conference topics. After the review process, 161 papers were selected to be included in the proceedings. The challenges for a successful edition are undoubtedly related to the work of the authors who provided highquality papers. This success goes also to our keynote speakers with their fascinating plenary lectures. Their talks provide an outstanding coverage of the broad ﬁeld of complex networks. v
vi
Preface
• Lada Adamic (Facebook Inc.): “Reflections of social networks” • Reka Albert (Pennsylvania State University, USA): “Networkbased dynamic modeling of biological systems: toward understanding and control” • Ulrik Brandes (ETH Zürich, Switzerland): “On a positional approach to network science” • Jari Saramäki (Aalto University, Finland): “Temporal networks: past, present, future” • Stefan Thurner (Medical University of Vienna, Austria): “How to eliminate systemic risk from ﬁnancial multilayer networks” • Michalis Vazirgiannis (LIX, École Polytechnique, France): “Machine learning for graphs based on kernels.” Prior to the conference, for the traditional tutorial sessions, Maria Ángeles Serrano (Universitat de Barcelona, Spain) and Diego SaezTrumper (Wikimedia Foundation) deliver insightful talks, respectively, on “Mapping networks in latent geometry: models and applications” and “Wikimedia public (research) resources.” We sincerely gratify our advisory board members for inspiring the essence of the conference: Jon Crowcroft (University of Cambridge), Raissa D’Souza (University of California, Davis, USA), Eugene Stanley (Boston University, USA), and Ben Y. Zhao (University of Chicago, USA). We record our thanks to our fellow members of the Organizing Committee: Luca Maria Aiello (Nokia Bell Labs, UK) and Rosa Maria Benito (Universidad Politecnica de Madrid, Spain), our satellite chairs; Nuno Araujo (University of Lisbon, Portugal), Huijuan Wang (TU Delft, the Netherlands), and Taha Yasseri (University of Oxford, UK) for chairing the lightning sessions; Gitajanli Yadav (University of Cambridge, UK), Jinhu Lü (Chinese Ac. Science, Beijing, China), and Maria Clara Gracio (University of Evora, Portugal) for managing the poster sessions; and Bruno Gonçalves (NYU, USA), our tutorial chair. We extend our thanks to Carlos Gershenson (Universidad Nacional Autónoma de México, Mexico), Michael Schaub (MIT, USA), Leto Peel (Université Catholique de Louvain, Belgium), and Feng Xia (Dalian University of Technology, China), the publicity chairs for advertising the conference in America, Asia, and Europa, hence encouraging the participation. We would like also to acknowledge Roberto Interdonato (CIRAD UMR TETIS, Montpellier, France) as well as Andreia Soﬁa Teixeira (University of Lisbon, Portugal), respectively, our sponsor chair and social media chair. Our thanks goes also to Chantal Cheriﬁ (University of Lyon, France), publication chair, and to the Milan team (University of Milan, Italy), Matteo Zignani, the Web chair and Christian Quadri, the submission chair for the tremendous work they have done in maintaining the Web site and the submission system. We would also like to record our appreciation for the work of the Local Committee chair, Manuel MarquesPita, and members, Pedro Souto, Flávio L. Pinheiro, Rion Bratting Correia, Lília Perfeito, Sara Mesquita, Soﬁa Pinto, Simone Lackner, João Franco, for their work which participate to the success of this edition.
Preface
vii
A deep thanks to the members of Instituto Gulbenkian de Ciência, Rita Caré, Regina Fernandes, Greta Martins, as well as to Paulo Madruga from Fundação Calouts Gulbenkian for their precious support and dedication. We are also indebted to our partners, Alessandro Fellegara and Alessandro Egro (Tribe Communication) for their passion and patience in designing the visual identity of the conference. We would like to express our gratitude to the editors involved in the sponsoring of conference: Frontiers and Springer Nature. Our deepest appreciation goes to all those who have helped us for the success of this meeting. Sincere thanks to the contributors; the success of the technical program would not be possible without their creativity. Finally, we would like to express our most sincere thanks to the Program Committee members for their huge efforts in producing 1453 highquality reviews in a very limited time. These volumes make the most advanced contribution of the international community to the research issues surrounding the fascinating world of complex networks. We hope that you enjoy the papers as much as we enjoyed organizing the conference and putting this collection of papers together. Hocine Cherifi Sabrina Gaito José Fernendo Mendes Esteban Moro Luis Mateus Rocha
Organization and Committees
General Chairs Hocine Cheriﬁ José Fernando Mendes Luis Mateus Rocha
University of Burgundy, France University of Aveiro, Portugal Indiana University Bloomington, USA
Advisory Board Jon Crowcroft Raissa D’Souza Eugene Stanley Ben Y. Zhao
University of Cambridge, UK University of California, Davis, USA Boston University, USA University of Chicago, USA
Program Chairs Sabrina Gaito Esteban Moro
University of Milan, Italy Universidad Carlos III de Madrid, Spain
Program Cochairs Joana GonçalvesSá Francisco Santos
Universidade NOVA de Lisboa, Portugal University of Lisbon, Portugal
Satellite Chairs Luca Maria Aiello Rosa M. Benito
Nokia Bell Labs, UK Universidad Politecnica de Madrid, Spain
ix
x
Organization and Committees
Lightning Chairs Nuno Araujo Huijuan Wang Taha Yasseri
University of Lisbon, Portugal TU Delft, the Netherlands University of Oxford, UK
Poster Chairs Gitajanli Yadav Jinhu Lü Maria Clara Gracio
University of Cambridge, UK Chinese Ac. Science, Beijing, China University of Evora, Portugal
Publicity Chairs Carlos Gershenson Leto Peel Michael Schaub Feng Xia
UNA de Mexico, Mexico UCLouvain, Belgium MIT, USA Dalian University of Technology, China
Tutorial Chair Bruno Gonçalves
NYU, USA
Sponsor Chair Roberto Interdonato
CIRAD  UMR TETIS, France
Social Media Chair Andreia Soﬁa Teixeira
University of Lisbon, Portugal
Local Committee Chair Manuel MarquesPita
University Lusófona, Portugal
Local Committee Rion Bratting Correia João Franco Simone Lackner Sara Mesquita Lília Perfeito Flávio L. Pinheiro Soﬁa Pinto
Instituto Gulbenkian de Ciência, Portugal Nova SBE, Portugal Nova SBE, Portugal Nova SBE, Portugal Nova SBE, Portugal NOVA IMS, Portugal Nova SBE, Portugal
Organization and Committees
Pedro Souto Andrea Soﬁa Teixeira
xi
University of Lisbon, Portugal University of Lisbon, Portugal
Publication Chair Chantal Cheriﬁ
University of Lyon, France
Submission Chair Christian Quadri
University of Milan, Italy
Web Chair Matteo Zignani
University of Milan, Italy
Program Committee Aguirre Jacobo Ahmed Nesreen Aida Masaki Aiello Luca Maria Aiello Marco Aktas Mehmet Akutsu Tatsuya Albert Reka Allard Antoine Aloric Aleksandra Altaﬁni Claudio AlvarezZuzek Lucila G. Alves Luiz G. A. Amblard Fred An Chuankai Angione Claudio Angulo Marco Tulio Antonioni Alberto AntulovFantulin Nino Araujo Nuno Arcaute Elsa Aref Samin Arenas Alex Ares Saul Argyrakis Panos Aste Tomaso
Centro Nacional de Biotecnología (CNBCSIC), Spain Intel, USA Tokyo Metropolitan University, Japan Nokia Bell Labs, UK University of Stuttgart, Germany University of Central Oklahoma, USA Kyoto University, Japan Pennsylvania State University, USA Laval University, Canada Institute of Physics Belgrade, Serbia Linköping University, Sweden IFIMARUNMdP, Argentina Northwestern University, USA IRIT  University Toulouse 1 Capitole, France Dartmouth College, USA Teesside University, UK UNAM, Mexico Carlos III University of Madrid, Spain ETH Zurich, Switzerland Universidade de Lisboa, Portugal University College London, UK MPI for Demographic Research, Germany URV, Spain Centro Nacional de Biotecnología (CNBCSIC), Spain Aristotle University of Thessaloniki, Greece University College London, UK
xii
Atzmueller Martin Avrachenkov Konstantin Baggio Rodolfo Banisch Sven Barnett George Barucca Paolo Basov Nikita Baxter Gareth Beguerisse Diaz Mariano Benczur Andras A. Benito Rosa M. Bianconi Ginestra Biham Ofer Boguna Marian Bonato Anthony Bongiorno Christian Borg Anton BorgeHolthoefer Javier Borgnat Pierre Bornholdt Stefan Bovet Alexandre Braha Dan Brandes Ulrik Brede Markus Bressan Marco Brockmann Dirk Bródka Piotr Burioni Raffaella Campana Paolo Cannistraci Carlo Vittorio Carchiolo Vincenza Cardillo Alessio Casiraghi Giona Cattuto Ciro Cazabet Remy Chakraborty Abhijit Chakraborty Tanmoy Chavalarias David Chawla Nitesh V. Chen KwangCheng Cheng Xueqi
Organization and Committees
Tilburg University, the Netherlands Inria, France Bocconi University, Italy MPI for Mathematics in the Sciences, Germany University of California, Davis, USA University College London, UK St. Petersburg State University, Russia University of Aveiro, Portugal Spotify Limited, UK ICSC, Hungarian Academy of Sciences, Hungary Universidad Politécnica de Madrid, Spain Queen Mary University of London, UK The Hebrew University of Jerusalem, Israel University of Barcelona, Spain Ryerson University, Canada Università degli Studi di Palermo, Italy Blekinge Institute of Technology, Sweden Internet Interdisciplinary Institute IN3 UOC, Spain CNRS, Laboratoire de Physique ENS de Lyon, France Universität Bremen, Germany Université Catholique de LouvainlaNeuve, Belgium NECSI, USA ETH Zurich, Switzerland University of Southampton, UK Sapienza University of Rome, Italy Humboldt University of Berlin, Germany Wroclaw University of Science and Technology, Poland Università di Parma, Italy University of Cambridge, UK TU Dresden, Germany Università di Catania, Italy Universitat Rovira i Virgili, Spain ETH Zurich, Switzerland ISI Foundation, Italy Université Lyon 1, CNRS, LIRIS, France University of Hyogo, Japan IIIT Delhi, India CNRS, CAMS/ISCPIF, France University of Notre Dame, USA University of South Florida, USA Institute of Computing Technology, China
Organization and Committees
Cheriﬁ Hocine Cheriﬁ Chantal Chin Peter Chung Fu Lai Cinelli Matteo Clegg Richard Cohen Reuven Coscia Michele Costa Luciano Criado Regino Cucuringu Mihai Darwish Kareem Dasgupta Bhaskar Davidsen Joern De Bie Tijl De Meo Pasquale De Vico Fallani Fabrizio Del Genio Charo I. Delellis Pietro Delvenne JeanCharles Deng Yong Devezas José Di Muro Matías Diesner Jana Douw Linda Duch Jordi Eismann Kathrin El Hassouni Mohammed Emmerich Michael T. M. EmmertStreib Frank Ercal Gunes Faccin Mauro Fagiolo Giorgio Flammini Alessandro Foerster Manuel Frasca Mattia Fu Xiaoming Furno Angelo Gaito Sabrina Gallos Lazaros Galán José Manuel Gama Joao
xiii
University of Burgundy, France Lyon 2 University, France Boston University, USA The Hong Kong Polytechnic University, Hong Kong University of Rome “Tor Vergata”, Italy Queen Mary University of London, UK BarIlan University, Israel IT University of Copenhagen, Denmark Universidade de Sao Paulo, Brazil Universidad Rey Juan Carlos, Spain University of Oxford and The Alan Turing Institute, UK Qatar Computing Research Institute, Qatar University of Illinois at Chicago, USA University of Calgary, Canada Ghent University, Belgium Vrije Universiteit Amsterdam, the Netherlands Inria  ICM, France Coventry University, UK University of Naples Federico II, Italy University of Louvain, Belgium Xi’an Jiaotong University, China INESC TEC and DEIFEUP, Portugal Universidad Nacional de Mar del Plata, Argentina University of Illinois at UrbanaChampaign, USA Amsterdam UMC, the Netherlands Universitat Rovira i Virgili, Spain University of Bamberg, Germany Mohammed V University, Morocco Leiden University, the Netherlands Tampere University of Technology, Finland SIUE, USA Université Catholique de Louvain, Belgium Sant’Anna School of Advanced Studies, Italy Indiana University Bloomington, USA University of Hamburg, Germany University of Catania, Italy University of Gottingen, Germany Université de Lyon, France University of Milan, Italy Rutgers University, USA Universidad de Burgos, Spain University of Porto, Portugal
xiv
Gandica Yerali Gao Jianxi Garcia David Gates Alexander Gauthier Vincent Gera Ralucca Giordano Silvia Giugno Rosalba Gleeson James Godoy Antonia Goh KwangIl GomezGardenes Jesus Gonçalves Bruno GonçalvesSá Joana Grabowicz Przemyslaw Grujic Jelena Guillaume JeanLoup Gunes Mehmet Guney Emre Guo Weisi Gómez Sergio Ha Meesoon Hackl Jürgen Hagberg Aric Hancock Edwin Hankin Chris Hayashi Yukio Heinimann Hans R. Helic Denis Hens Chittaranjan Hernandez Laura Heydari Babak Hoevel Philipp Holme Petter Hong SeokHee Hoppe Ulrich Hu Yanqing Huang Junming HébertDufresne Laurent Iannelli Flavio Ikeda Yuichi Interdonato Roberto
Organization and Committees
Université Catholique de Louvain, Belgium Rensselaer Polytechnic Institute, USA Complexity Science Hub Vienna, Austria Indiana University Bloomington, USA Institut MinesTelecom, CNRS SAMOVAR, France Naval Postgraduate School, USA SUPSI, Switzerland University of Verona, Italy University of Limerick, Ireland Rovira i Virgili University, Spain Korea University, South Korea Universidad de Zaragoza, Spain New York University, USA Nova School of Business and Economics, Portugal MPI for Software Systems, Germany Vrije Universiteit Brussel, Belgium Université de la Rochelle, France University of Nevada, Reno, USA Pompeu Fabra University, Spain University of Warwick, UK Universitat Rovira i Virgili, Spain Chosun University, South Korea ETH Zurich, Switzerland Los Alamos National Laboratory, USA University of York, UK Imperial College London, UK Japan Advanced Inst. of Science and Technology, Japan ETH Zurich, Switzerland Graz University of Technology, Austria CSIRIndian Institute of Chemical Biology, India Université de CergyPontoise, France Northeastern University, USA University College Cork, Ireland Tokyo Institute of Technology, Japan University of Sydney, Australia University DuisburgEssen, Germany Sun Yatsen University, China Princeton University, USA University of Vermont, USA Humboldt University of Berlin, Germany Kyoto University, Japan CIRAD  UMR TETIS, France
Organization and Committees
Iori Giulia Iorio Francesco Iosiﬁdis George Iovanella Antonio Ivanov Plamen Iñiguez Gerardo Jalan Sarika Jalili Mahdi Jankowski Jaroslaw Javarone Marco Alberto Jeong Hawoong Jia Tao Jin Di Jo HangHyun Jouve Bertrand Jędrzejewski Arkadiusz Kaltenbrunner Andreas Kanawati Rushed Karsai Márton Kaya Mehmet Kelen Domokos Kenett Yoed Kenett Dror Kertesz Janos Keuschnigg Marc Khansari Mohammad Kheddouci Hamamache Kim Hyoungshick Kitsak Maksim Kivela Mikko Klemm Konstantin Klimek Peter Kong Xiangjie Koponen Ismo Korhonen Onerva Kutner Ryszard Lambiotte Renaud Largeron Christine Larson Jennifer Lawniczak Anna T. Leclercq Eric Lee DeokSun
xv
City, University of London, UK Wellcome Sanger Institute, UK Trinity College Dublin, Ireland University of Rome “Tor Vergata”, Italy Boston University, USA Central European University, Hungary IIT Indore, India RMIT University, Australia West Pomeranian University of Technology, Poland Coventry University, UK KAIST, South Korea Southwest University, Chongqing, China Tianjin University, China Asia Paciﬁc Center for Theoretical Physics, South Korea CNRS, France Wrocław University of Science and Technology, Poland NTENT, Spain Université Paris 13, France ENS de Lyon, France Firat University, Turkey Hungarian Academy of Sciences, Hungary University of Pennsylvania, USA Johns Hopkins University, USA Central European University, Hungary Linköping University, Sweden University of Tehran, Iran University Claude Bernard Lyon 1, France Sungkyunkwan University, South Korea Northeastern University, USA Aalto University, Finland IFISC (CSICUIB), Spain Medical University of Vienna, Austria Dalian University of Technology, China University of Helsinki, Finland Université de Lille, France University of Warsaw, Poland University of Oxford, UK Université de Lyon, France New York University, USA University of Guelph, Canada University of Burgundy, France Inha University, South Korea
xvi
Lehmann Sune Leifeld Philip Lerner Juergen Lillo Fabrizio Livan Giacomo Longheu Alessandro Lu Linyuan Lu Meilian Lui John C. S. Maccari Leonardo Magnani Matteo Malliaros Fragkiskos Mangioni Giuseppe Marathe Madhav Mariani Manuel Sebastian Marik Radek Marino Andrea Marques Antonio MarquesPita Manuel Martin Christoph Masoller Cristina Mastrandrea Rossana Masuda Naoki Matta John Mccarthy Arya Medo Matúš Menche Jörg Mendes Jose Fernando Menezes Ronaldo MeyerBaese Anke Michalski Radosław Milli Letizia Mitra Bivas Mitrovic Marija Mizera Andrzej Mokryn Osnat Molontay Roland Mondragon Raul Mongiovì Misael
Organization and Committees
Technical University of Denmark, Denmark University of Essex, UK University of Konstanz, Germany University of Bologna, Italy University College London, UK University of Catania, Italy University of Fribourg, Switzerland Beijing University of Posts and Telecommunications, China The Chinese University of Hong Kong, Hong Kong University of Venice, Italy Uppsala University, Sweden University of ParisSaclay, France University of Catania, Italy University of Virginia, USA University of Zurich, Switzerland Czech Technical University, Czechia University of Florence, Italy King Juan Carlos University, Spain Universidade Lusofona, Portugal Leuphana University of Lüneburg, Germany Universitat Politècnica de Catalunya, Spain IMT Institute of Advanced Studies, Italy University at Buffalo, State University of New York, USA SIUE, USA Johns Hopkins University, USA University of Electronic Science and Technology of China, China Austrian Academy of Sciences, Austria University of Aveiro, Portugal University of Exeter, UK FSU, USA Wrocław University of Science and Technology, Poland University of Pisa, Italy Indian Institute of Technology Kharagpur, India Institute of Physics Belgrade, Serbia University of Luxembourg, Luxembourg University of Haifa, Israel Budapest University of Technology and Economics, Hungary Queen Mary University of London, UK Consiglio Nazionale delle Ricerche, Italy
Organization and Committees
Moro Esteban Moschoyiannis Sotiris Moses Elisha Mozetič Igor Murata Tsuyoshi Muscoloni Alessandro Mäs Michael Neal Zachary NourEddin El Faouzi Oliveira Marcos Omelchenko Iryna Omicini Andrea Palla Gergely Panzarasa Pietro Papadopoulos Fragkiskos Papadopoulos Symeon Papandrea Michela Park Han Woo Park Juyong Park Noseong Passarella Andrea Peel Leto Peixoto Tiago Perc Matjaz Petri Giovanni Pfeffer Juergen Piccardi Carlo Pizzuti Clara Poledna Sebastian Poletto Chiara Pralat Pawel Preciado Victor Przulj Natasa Qu Zehui Quadri Christian Quaggiotto Marco Radicchi Filippo Ramasco Jose J. ReedTsochas Felix Renoust Benjamin Ribeiro Pedro Riccaboni Massimo Ricci Laura Rizzo Alessandro
xvii
Universidad Carlos III de Madrid, Spain University of Surrey, UK Weizmann Institute of Science, Israel Jozef Stefan Institute, Slovenia Tokyo Institute of Technology, Japan TU Dresden, Germany ETH Zurich, the Netherlands Michigan State University, USA IFSTTAR, France Leibniz Institute for the Social Sciences, USA TU Berlin, Germany Università di Bologna, Italy HAS, Hungary Queen Mary University of London, UK Cyprus University of Technology, Cyprus Information Technologies Institute, Greece SUPSI, Switzerland Yeungnam University, South Korea KAIST, South Korea George Mason University, USA IITCNR, Italy Université Catholique de Louvain, Belgium University of Bath, Germany University of Maribor, Slovenia ISI Foundation, Italy Technical University of Munich, Germany Politecnico di Milano, Italy CNRICAR, Italy IIASA and Complexity Science Hub Vienna, Austria Sorbonne Université, France Ryerson University, Canada University of Pennsylvania, USA UCL, UK Southwest University, China University of Milan, Italy ISI Foundation, Italy Northwestern University, USA IFISC (CSICUIB), Spain University of Oxford, UK Osaka University, Japan University of Porto, Portugal IMT Institute for Advanced Studies, Italy University of Pisa, Italy Politecnico di Torino, Italy
xviii
Rocha Luis M. Rocha Luis E. C. Rodrigues Francisco Rosas Fernando Rossetti Giulio Rossi Luca Roth Camille Roukny Tarik Saberi Meead Safari Ali Saniee Iraj Santos Francisco C. Saramäki Jari Sayama Hiroki Scala Antonio Schaub Michael Schich Maximilian Schifanella Rossano Schoenebeck Grant Schweitzer Frank Segarra Santiago Sharma Aneesh Sharma Rajesh Sienkiewicz Julian Singh Anurag Skardal Per Sebastian Small Michael Smolyarenko Igor Smoreda Zbigniew Snijders Tom Socievole Annalisa Sole Albert Song Lipeng Stella Massimo Sullivan Blair D. Sun Xiaoqian Sundsøy Pål Szymanski Boleslaw Tadic Bosiljka Tagarelli Andrea Tajoli Lucia Takemoto Kazuhiro Takes Frank Tang Jiliang
Organization and Committees
Indiana University Bloomington, USA Ghent University, Belgium University of São Paulo, Brazil Imperial College London, UK ISTICNR, Italy IT University of Copenhagen, Denmark CNRS, Germany Massachusetts Institute of Technology, USA UNSW, Australia FriedrichAlexanderUniversität, Germany Bell Labs, AlcatelLucent, USA Universidade de Lisboa, Portugal Aalto University, Finland Binghamton University, USA Institute for Complex Systems/CNR, Italy Massachusetts Institute of Technology, USA The University of Texas at Dallas, USA University of Turin, Italy University of Michigan, USA ETH Zurich, Switzerland Rice University, USA Google, USA University of Tartu, Estonia Warsaw University of Technology, Poland NIT Delhi, India Trinity College Dublin, Ireland The University of Western Australia, Australia Brunel University, UK Orange Labs, France University of Groningen, the Netherlands CNR and ICAR, Italy Universitat Rovira i Virgili, Spain North University of China, China Institute for Complex Systems Simulation, UK University of Utah, USA Beihang University, China NBIM, Norway Rensselaer Polytechnic Institute, USA Jozef Stefan Institute, Slovenia University of Calabria, Italy Politecnico di Milano, Italy Kyushu Institute of Technology, Japan Leiden University and University of Amsterdam, the Netherlands Michigan State University, USA
Organization and Committees
Tarissan Fabien Tessone Claudio Juan Thai My Théberge François Tizzoni Michele Togni Olivier Traag Vincent Antonio Trajkovic Ljiljana Treur Jan Tupikina Liubov Török Janos Uzzo Stephen Valdez Lucas D. Valverde Sergi Van Der Hoorn Pim Van Der Leij Marco Van Mieghem Piet Van Veen Dirk Vazirgiannis Michalis Vedres Balazs Vermeer Wouter Vestergaard Christian Lyngby Vodenska Irena Wachs Johannes Wang Xiaofan Wang Lei Wang Huijuan Wen Guanghui Wilfong Gordon Wilinski Mateusz Wilson Richard Wit Ernst Wu Bin Wu Jinshan Xia Feng Xia Haoxiang Xu Xiaoke Yagan Osman Yan Gang Yan Xiaoran Zhang Qingpeng
xix
CNRS  ENS ParisSaclay (ISP), France Universität Zürich, Switzerland University of Florida, USA Tutte Institute for Mathematics and Computing, Canada ISI Foundation, Italy University of Burgundy, France Leiden University, the Netherlands Simon Fraser University, Canada Vrije Universiteit Amsterdam, the Netherlands Ecole Polytechnique, France Budapest University of Technology and Economics, Hungary New York Hall of Science, USA FAMAFUNC, Argentina Institute of Evolutionary Biology (CSICUPF), Spain Northeastern University, USA University of Amsterdam, the Netherlands Delft University of Technology, the Netherlands ETH Zurich, SingaporeETH Centre, Switzerland AUEB, Greece CEU, Hungary Northwestern University, USA CNRS and Institut Pasteur, France Boston University, USA Central European University, Hungary Shanghai Jiao Tong University, China Beihang University, China Delft University of Technology, the Netherlands Southeast University, China Bell Labs, USA Scuola Normale Superiore di Pisa, Italy University of York, UK University of Groningen, the Netherlands Beijing University of Posts and Telecommunications, China Beijing Normal University, China Dalian University of Technology, China Dalian University of Technology, China Dalian Minzu University, China CyLabCMU, USA Tongji University, China Indiana University Bloomington, USA City University of Hong Kong, USA
xx
Zhang ZiKe Zhao Junfei Zhong Fay Zignani Matteo Zimeo Eugenio Zino Lorenzo Zippo Antonio Zlatic Vinko Zubiaga Arkaitz
Organization and Committees
Hangzhou Normal University, China Columbia University, USA CSUEB, USA Università degli Studi di Milano, Italy University of Sannio, Italy Politecnico di Torino, Italy Consiglio Nazionale delle Ricerche, Italy Sapienza University of Rome, Italy Queen Mary University of London, UK
Contents
Link Analysis and Ranking LinkAUC: Unsupervised Evaluation of Multiple Network Node Ranks Using Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emmanouil Krasanakis, Symeon Papadopoulos, and Yiannis Kompatsiaris
3
A Gradient Estimate for PageRank . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paul Horn and Lauren M. Nelsen
15
A Persistent Homology Perspective to the Link Prediction Problem . . . Sumit Bhatia, Bapi Chatterjee, Deepak Nathani, and Manohar Kaul
27
The Role of Network Size for the Robustness of Centrality Measures . . . . Christoph Martin and Peter Niemeyer
40
Novel Edge and Density Metrics for Link Cohesion . . . . . . . . . . . . . . . . Cetin Savkli, Catherine Schwartz, Amanda Galante, and Jonathan Cohen
52
Facility Location Problem on Network Based on Group Centrality Measure Considering Cooperation and Competition . . . . . . . . . . . . . . . Takayasu Fushimi, Seiya Okubo, and Kazumi Saito Finding Dominant Nodes Using Graphlets . . . . . . . . . . . . . . . . . . . . . . . David Aparício, Pedro Ribeiro, Fernando Silva, and Jorge Silva Sampling on Networks: Estimating Eigenvector Centrality on Incomplete Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicolò Ruggeri and Caterina De Bacco
64 77
90
Community Structure Repel Communities and Multipartite Networks . . . . . . . . . . . . . . . . . . . 105 Jerry Scripps, Christian Trefftz, Greg Wolffe, Roger Ferguson, and Xiang Cao
xxi
xxii
Contents
The Densest k Subgraph Problem in bOuterplanar Graphs . . . . . . . . . 116 Sean Gonzales and Theresa Migler Spread Sampling and Its Applications on Graphs . . . . . . . . . . . . . . . . . 128 Yu Wang, Bortik Bandyopadhyay, Vedang Patel, Aniket Chakrabarti, David Sivakoff, and Srinivasan Parthasarathy EVa: AttributeAware Network Segmentation . . . . . . . . . . . . . . . . . . . . 141 Salvatore Citraro and Giulio Rossetti Exorcising the Demon: Angel, Efﬁcient NodeCentric Community Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Giulio Rossetti Metrics Matter in Community Detection . . . . . . . . . . . . . . . . . . . . . . . . 164 Arya D. McCarthy, Tongfei Chen, Rachel Rudinger, and David W. Matula An Exact No Free Lunch Theorem for Community Detection . . . . . . . . 176 Arya D. McCarthy, Tongfei Chen, and Seth Ebner Impact of Network Topology on Efﬁciency of Proximity Measures for Community Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Rinat Aynulin Identifying, Ranking and Tracking Community Leaders in Evolving Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Mário Cordeiro, Rui Portocarrero Sarmento, Pavel Brazdil, Masahiro Kimura, and João Gama Change Point Detection in a Dynamic Stochastic Blockmodel . . . . . . . . 211 Peter Wills and François G. Meyer A General Method for Detecting Community Structures in Complex Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Vesa Kuikka A New Metric for Package Cohesion Measurement Based on Complex Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 Yanran Mi, Yanxi Zhou, and Liangyu Chen A Generalized Framework for Detecting Social Network Communities by the Scanning Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 TaiChi Wang and Frederick Kin Hing Phoa Comparing the Community Structure Identiﬁed by Overlapping Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 Vinícius da F. Vieira, Carolina R. Xavier, and Alexandre G. Evsukoff Semantic Frame Induction as a Community Detection Problem . . . . . . 274 Eugénio Ribeiro, Andreia Soﬁa Teixeira, Ricardo Ribeiro, and David Martins de Matos
Contents
xxiii
A New Measure of Modularity in Hypergraphs: Theoretical Insights and Implications for Effective Clustering . . . . . . . . . . . . . . . . . . . . . . . . 286 Tarun Kumar, Sankaran Vaidyanathan, Harini Ananthapadmanabhan, Srinivasan Parthasarathy, and Balaraman Ravindran Diffusion and Epidemics Crying “Wolf” in a Network Structure: The Inﬂuence of NodeGenerated Signals . . . . . . . . . . . . . . . . . . . . . . . . 301 Tomer Tuchner and Gail GilboaFreedman Vaccination Strategies on a Robust Contact Network . . . . . . . . . . . . . . 313 Christopher Siu and Theresa Migler Total Positive Inﬂuence Domination on Weighted Networks . . . . . . . . . 325 Danica Vukadinović Greetham, Nathaniel Charlton, and Anush Poghosyan Modelling Spatial Information Diffusion . . . . . . . . . . . . . . . . . . . . . . . . 337 Zhuo Chen and Xinyue Ye RejectionBased Simulation of NonMarkovian Agents on Complex Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 Gerrit Großmann, Luca Bortolussi, and Verena Wolf CommunityAware Content Diffusion: Embeddednes and Permeability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 Letizia Milli and Giulio Rossetti Can WhatsApp Counter Misinformation by Limiting Message Forwarding? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 Philipe de Freitas Melo, Carolina Coimbra Vieira, Kiran Garimella, Pedro O. S. Vaz de Melo, and Fabrício Benevenuto Modeling Airport Congestion Contagion by SIS Epidemic Spreading on Airline Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 Klemens Köstler, Rommy Gobardhan, Alberto Ceria, and Huijuan Wang A Population Dynamics Approach to Viral Marketing . . . . . . . . . . . . . . 399 Pedro C. Souto, Luísa V. Silva, Diego Costa Pinto, and Francisco C. Santos Integrating Environmental Temperature Conditions into the SIR Model for VectorBorne Diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 Md Arquam, Anurag Singh, and Hocine Cheriﬁ Opinion Diffusion in Competitive Environments: Relating Coverage and Speed of Diffusion . . . . . . . . . . . . . . . . . . . . . . . 425 Valeria Fionda and Gianluigi Greco
xxiv
Contents
Beyond FactChecking: Network Analysis Tools for Monitoring Disinformation in Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436 Stefano Guarino, Noemi Trino, Alessandro Chessa, and Gianni Riotta Suppressing Information Diffusion via Link Blocking in Temporal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448 XiuXiu Zhan, Alan Hanjalic, and Huijuan Wang Using Connected Accounts to Enhance Information Spread in Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 Alon Sela, Orit CohenMilo, Eugene Kagan, Moti Zwilling, and Irad BenGal Designing Robust Interventions to Control Epidemic Outbreaks . . . . . . 469 Prathyush Sambaturu and Anil Vullikanti Dynamics on/of Networks The Impact of Network Degree Correlation on Parrondo’s Paradox . . . 483 Ye Ye, XiaoRong Hang, Lin Liu, Lu Wang, and Nenggang Xie Analysis of Diversity and Dynamics in Coevolution of Cooperation in Social Networking Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 Yutaro Miura, Fujio Toriumi, and Toshiharu Sugawara Shannon Entropy in Time–Varying Clique Networks . . . . . . . . . . . . . . 507 Marcelo do Vale Cunha, Carlos César Ribeiro Santos, Marcelo Albano Moret, and Hernane Borges de Barros Pereira TwoMode Threshold Graph Dynamical Systems for Modeling Evacuation DecisionMaking During Disaster Events . . . . . . . . . . . . . . . 519 Naﬁsa Halim, Chris J. Kuhlman, Achla Marathe, Pallab Mozumder, and Anil Vullikanti Spectral Evolution of Twitter Mention Networks . . . . . . . . . . . . . . . . . . 532 Miguel Romero, Camilo Rocha, and Jorge Finke Network Models Minimum Entropy Stochastic Block Models Neglect Edge Distribution Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 Louis Duvivier, Céline Robardet, and Rémy Cazabet ThreeParameter Kinetics of Selforganized Criticality on Twitter . . . . . 556 Victor Dmitriev, Andrey Dmitriev, Svetlana Maltseva, and Stepan Balybin Multiparameters Model Selection for Network Inference . . . . . . . . . . . 566 Veronica Tozzo and Annalisa Barla
Contents
xxv
Scott: A Method for Representing Graphs as Rooted Trees for Graph Canonization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578 Nicolas Bloyet, PierreFrançois Marteau, and Emmanuel Frénod Cliques in HighDimensional Random Geometric Graphs . . . . . . . . . . . 591 Konstantin Avrachenkov and Andrei Bobu Universal Boolean Logic in Cascading Networks . . . . . . . . . . . . . . . . . . 601 Galen Wilkerson and Sotiris Moschoyiannis FitnessWeighted Preferential Attachment with Varying Number of New Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612 Juan Romero, Jorge Finke, and Andrés Salazar Rigid Graph Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621 Vikram Ravindra, Huda Nassar, David F. Gleich, and Ananth Grama Detecting Hotspots on Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633 Juan Campos and Jorge Finke Political Networks A Transparent Referendum Protocol with Immutable Proceedings and Veriﬁable Outcome for Trustless Networks . . . . . . . . . . . . . . . . . . . 647 Maximilian Schiedermeier, Omar Hasan, Lionel Brunie, Tobias Mayer, and Harald Kosch Utilizing Complex Networks for Event Detection in Heterogeneous HighVolume News Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659 Iraklis Moutidis and Hywel T. P. Williams Drawing Networks of Political Leaders: Global Affairs in The Economist’s KAL’s Cartoons . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673 Nikita Golubev and Alina V. Vladimirova Shielding and Shadowing: A Tale of Two Strategies for Opinion Control in the Voting Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682 Guillermo Romero Moreno, Long TranThanh, and Markus Brede Resilience and Control Stable and Uniform Resource Allocation Strategies for Network Processes Using Vertex Energy Gradients . . . . . . . . . . . . . . . . . . . . . . . 697 Mikołaj Morzy and Tomi Wójtowicz Cascading Failures in Weighted Networks with the Harmonic Closeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 709 Yucheng Hao, Limin Jia, and Yanhui Wang
xxvi
Contents
Learning to Control Random Boolean Networks: A Deep Reinforcement Learning Approach . . . . . . . . . . . . . . . . . . . . . . 721 Georgios Papagiannis and Sotiris Moschoyiannis Comparative Network Robustness Evaluation of Link Attacks . . . . . . . 735 Clara Pizzuti, Annalisa Socievole, and Piet Van Mieghem MAC: Multilevel Autonomous Clustering for Topologically Distributed Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747 M. A. Partha and C. V. Ponce Network Strengthening Against Malicious Attacks . . . . . . . . . . . . . . . . . 761 Qingnan Rong, Jun Zhang, Xiaoqian Sun, and Sebastian Wandelt Identifying Vulnerable Nodes to Cascading Failures: OptimizationBased Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773 Richard J. La Ensemble Approach for Generalized Network Dismantling . . . . . . . . . . 783 XiaoLong Ren and Nino AntulovFantulin Machine Learning and Networks A Simple Approach to Attributed Graph Embedding via Enhanced Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797 Nasrullah Sheikh, Zekarias T. Kefato, and Alberto Montresor Matching Node Embeddings Using Valid Assignment Kernels . . . . . . . . 810 Changmin Wu, Giannis Nikolentzos, and Michalis Vazirgiannis Short Text Tagging Using Nested Stochastic Block Model: A Yelp Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822 John Bowllan, Kailey Cozart, Seyed Mohammad Mahdi Seyednezhad, Anthony Smith, and Ronaldo Menezes DomainInvariant Latent Representation Discovers Roles . . . . . . . . . . . 834 Shumpei Kikuta, Fujio Toriumi, Mao Nishiguchi, Tomoki Fukuma, Takanori Nishida, and Shohei Usui Inductive Representation Learning on Feature Rich Complex Networks for Churn Prediction in Telco . . . . . . . . . . . . . . . . . . . . . . . . 845 María Óskarsdóttir, Sander Cornette, Floris Deseure, and Bart Baesens On Inferring Monthly Expenses of Social Media Users: Towards Data and Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854 Danila Vaganov, Alexander Kalinin, and Klavdiya Bochenina Evaluating the Community Structures from Network Images Using Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866 Md. Khaledur Rahman and Ariful Azad
Contents
xxvii
GumbelSoftmax Optimization: A Simple General Framework for Combinatorial Optimization Problems on Graphs . . . . . . . . . . . . . . 879 Jing Liu, Fei Gao, and Jiang Zhang TemporalNode2vec: Temporal Node Embedding in Temporal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891 Mounir Haddad, Cécile Bothorel, Philippe Lenca, and Dominique Bedart Deep Reinforcement Learning for TaskDriven Discovery of Incomplete Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 903 Peter Morales, Rajmonda Sulo Caceres, and Tina EliassiRad Evaluating Network Embedding Models for Machine Learning Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915 Ikenna Oluigbo, Mohammed Haddad, and Hamida Seba A BERTBased Transfer Learning Approach for Hate Speech Detection in Online Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 928 Marzieh Mozafari, Reza Farahbakhsh, and Noël Crespi Network Geometry A Simple Differential Geometry for Networks and Its Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 943 Emil Saucan, Areejit Samal, and Jürgen Jost Characterizing Distances of Networks on the Tensor Manifold . . . . . . . 955 Bipul Islam, Ji Liu, and Romeil Sandhu Eigenvalues and Spectral Dimension of Random Geometric Graphs in Thermodynamic Regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965 Konstantin Avrachenkov, Laura Cottatellucci, and Mounia Hamidouche Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977
Link Analysis and Ranking
LinkAUC: Unsupervised Evaluation of Multiple Network Node Ranks Using Link Prediction Emmanouil Krasanakis(B) , Symeon Papadopoulos, and Yiannis Kompatsiaris CERTHITI, Thessaloniki, Greece {maniospas,papadop,ikom}@iti.gr Abstract. An emerging problem in network analysis is ranking network nodes based on their relevance to metadata groups that share attributes of interest, for example in the context of recommender systems or node discovery services. For this task, it is important to evaluate ranking algorithms and parameters and select the ones most suited to each network. Unfortunately, large realworld networks often comprise sparsely labelled nodes that hinder supervised evaluation, whereas unsupervised measures of community quality, such as density and conductance, favor structural characteristics that may not be indicative of metadata group quality. In this work, we introduce LinkAUC, a new unsupervised approach that evaluates network node ranks of multiple metadata groups by measuring how well they predict network edges. We explain that this accounts for relation knowledge encapsulated in known members of metadata groups and show that it enriches densitybased evaluation. Experiments on one synthetic and two realworld networks indicate that LinkAUC agrees with AUC and NDCG for comparing ranking algorithms more than other unsupervised measures.
1
Introduction
It is wellknown that network nodes can be organized into communities [1–4] identiﬁed through either ground truth structural characteristics or shared node attributes [5–7]. A common task in network analysis is to rank all network nodes based on their relevance to such communities, especially of the second type [8,9], which are commonly referred to as metadata groups. Ranking nodes is particularly important in large social networks, where metadata group boundaries can be vague [10,11]. Node ranks can also be used by recommender systems that combine them with other characteristics, in which case it is important to be of high quality across the whole network. Some of the most wellknown algorithms that discover communities with only a few known members also rely on ranking mechanisms and work by thresholding their outcome [12,13]. Node ranks for metadata groups are a form of recommendation and their quality is usually (e.g. in [14]) evaluated with wellknown recommender system measures [15–17], such as AUC and NDCG. Since calculating these measures requires knowledge of node labels, the eﬃcacy of ranking algorithms needs be c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 3–14, 2020. https://doi.org/10.1007/9783030366872_1
4
E. Krasanakis et al.
demonstrated on labeled networks, such as those of the SNAP repository1 . However, diﬀerent algorithms and parameters are more suited to diﬀerent networks, for example based on how well their assumptions match structural or metadata characteristics. At the same time, large realworld networks are often sparsely labeled, prohibiting supervised evaluation. In such cases, there is a need to evaluate ranking algorithms on the network at hand using unsupervised procedures. A ﬁrst take on unsupervised evaluation would be to generalize traditional structural community measures, such as density [18], modularity [19] and conductance [20], to support ranks. However, these measures are designed with structural ground truth communities in mind and often fail to assess hierarchical dependencies or other mesoscale (instead of local) features [6,11,21] that may characterize metadata groups. To circumvent this problem, we propose utilizing the network’s structure and the existence of multiple metadata groups; under the assumption that network edges are inﬂuenced by node metadata similarity [6], a phenomenon known as homophily in social networks [22], we assess the quality of ranks for multiple metadata groups based on their ability to predict network edges. We show that this practice enriches densitybased evaluation and that it agrees with supervised measures better than other unsupervised ones.
2
LinkAUC
The main idea behind our approach is that, if there is little information to help evaluate node ranks, we can evaluate other related structural characteristics instead. To this end, we propose using node rank distributions across metadata groups to derive link ranks between nodes. Link ranks can in turn be evaluated through their ability to predict the network’s edges. An overview of the proposed scheme is demonstrated in Fig. 1. In this section, we ﬁrst justify why we expect node rank quality to follow link rank quality (Subsect. 2.1) and formally describe the evaluation process of the latter using AUC (Subsect. 2.2). We then show that link rank quality enriches densitybased evaluation (Subsect. 2.3).
Fig. 1. Proposed scheme for evaluating ranking algorithms. Lighter colored nodes have higher ranks and lighter colored edges have lower link ranks. 1
https://snap.stanford.edu/data/.
LinkAUC: Unsupervised Evaluation of Node Ranks Using Link Prediction
2.1
5
Link Ranks
Let ri be vectors whose elements rij estimate the relevance of network nodes j to metadata groups i = 1, . . . , n. Motivated by latent factor models for link prediction [23] and collaborative ﬁltering [24], we consider R = [r1 . . . rn a matrix factorization of the network. Its rows Rj = [r1j . . . rnj ] represent the distribution of ranks of network nodes j across metadata groups. Following the principles of previous link prediction works [25,26], if network construction is inﬂuenced predominantly by structurebased and metadatabased characteristics, this factorization can help predict network edges by linking nodes with similar rank distributions. We calculate the similarities of rank distributions between nodes jk = Rj · Rk . These form a matrix of link ranks: j, k using the dot product2 as M = RRT M
(1)
Accurate link prediction using link ranks implies good metadata group representations. To empirically understand this claim, let us consider algo∞ ranking n [27] of rithms that can be expressed as network ﬁlters f (M ) = n=0 an M the network’s adjacency matrix M , where an are the weights placed on random walks of length n. For example, personalized PageRank and Heat Kernels arise from exponentially degrading weights and the Taylor expansion coeﬃcients of an exponential function respectively. If applied on query vectors qi , where qij are proportional to probabilities that nodes j belong to metadata groups i, network ﬁlters produce ranks ri = f (M )qi of how much nodes pertain to the metadata groups. Organizing multiple queries into a matrix Q = [q1 . . . qn ]: = f (M )QQT f T (M ) R = f (M )Q ⇒ M
(2)
This is a quadratic form of f (M ) around the kernel QQT and, as such, propagates link ranks between queries to the rest of link candidates. Therefore, if queries adequately predict the links between involved query nodes and link ranks can predict the network’s edges, then the algorithm with ﬁlter f (M ) is a good rank propagation mechanism. At best, queries form an orthonormal basis of ranks QQT = I and this process can express any symmetric link prediction ﬁlter [25,26,28] by decomposing it to f (M )f T (M ). 2.2
Link Rank Evaluation Using AUC
When evaluating link ranks, it is often desirable to exclude certain links, such as withheld test edges or those absent due to systemic reasons (e.g. users may not be allowed to befriend themselves in social networks). To model this, we devise the notion of a network group that uses a binary matrix M to remove noncomparable links of the network’s adjacency matrix M by projecting the latter to M M, where is the Hadamard product performing elementwise multiplication. For 2
Cosine similarity would arise by a ﬁxedﬂow assumption of the ranking algorithm that performs rowwise normalization of R before the dot product.
6
E. Krasanakis et al.
example, network groups of zerodiagonal networks correspond to M = 1 − I, where 1 are matrices of ones and I identity matrices. against a network with adjacency matrix M To help evaluate link ranks M within a network group M, we introduce a transformation vecM (M ) that creates a vector containing all elements of M for which M = 0 in a predetermined order. ) with vecM (M ). Then, link ranks can be evaluated by comparing vecM (M A robust measure that compares operating characteristic tradeoﬀs at different decision thresholds is the Area Under Curve (AUC) [29], which has been previously used to evaluate link ranks [26]. When network edges are not weighted, if T P R(θ) and F P R(θ) are the true positive and false positive rates of a decision ) predicting vecM (M ), the AUC of link ranks becomes: threshold θ on vecM (M ∞ T P R(θ)F P R (θ) dθ (3) LinkAU C = −∞
This evaluates whether actual linkage is assigned higher ranks across the network [30] without being aﬀected from edge sparsity. These properties make LinkAUC preferable to precisionbased evaluation of link ranks, which assesses the correctness of only a ﬁxed number of top predictions [26]. 2.3
Relation to Rank Density
The density of a network is deﬁned as the portion of edges compared to the maximum number of possible ones [31,32]. Using the notion of volume vol(M ) to annotate the number of edges in a network with adjacency matrix M , the denM) sity of its projection inside the network group M becomes DM (M ) = vol(M vol(M) . We similarly deﬁne rank density by substituting the volume with the expected volume vol(M, r) of the fuzzy set of subgraphs arising from ranks being proportional to the probabilities that nodes are involved in links: rT M r vol(M, r) = Ev∼r v T M v = r21 rT (M M)r vol(M M, r) = ⇒ DM (M, r) = vol(M, r) rT Mr
(4)
where · 1 is the L1 norm, calculated as the sum of vector elements, and v are binary vectors of vertices sampled with probabilities r. We ﬁrst examine the qualitative relation between link ranks and rank density for a single metadata group R = r1 . Annotating as m ≥ θ the vectors arising from ) M binary thresholding on the elements of m = vecM ( and selecting thresholds vecM (M )1
θ[k] that determine the topk link ranks up to all K link candidates (θ[K] = 0): m=
K−1
(m ≥ θ[k])(θ[k] − θ[k + 1])
k=1
) vecTM (M )vecM (M ⇒ DM (M, r1 ) = = vecM (M )1
∞
−∞
T P (θ)P (θ)dθ
LinkAUC: Unsupervised Evaluation of Node Ranks Using Link Prediction
7
where T P and P denote the number of true positive and positive number of thresholded link ranks respectively. At worst, every new positive link after a certain point would be a false positive. Using the bigO notation this can be P R(θ) written as ∂F∂P (θ) ∈ O(1) and hence: LinkAU C ∈ O DM (M, r1 ) (5) We next consider the case where discovered ranks form nonoverlapping metadata groups, i.e. each node has nonzero rank only for one group. This may happen when query propagation stops before it reaches other metadata groups. = r rT , for nonoverlapping ranks ri · rj = 0 for i = j, we rewrite Annotating M i i i ) = vecM (M i ), similarly to before: (1) as M = i Mi ⇒ vecM (M i
2 LinkAU C ∈ O DM (M, ri )vol(M, ri )ri 1 i
This averages group densities and weights them by vol(M, ri )ri 21 . Hence, when metadata groups are nonoverlapping, high LinkAUC indicates high rank density. Finally, for overlapping metadata groups, LinkAUC involves intergroup links in its evaluation. Since averaging densitybased evaluations across groups ignores these links, LinkAUC can be considered an enrichment of rank density in the sense that it bounds it when metadata groups do not overlap but accounts for more information when they do.
3
Experiments
To assess the merit of evaluating node ranks using LinkAUC, we devise a series of experiments where we test a number of diﬀerent algorithms on several ranking tasks of varying degrees of diﬃculty across labeled networks. We use the ranks produced by these experiments to compare various unsupervised measures with supervised ones; the latter form the ground truth unsupervised measures need reproduce, but would not be computable if node labels were sparse or missing. For every network, we start with known binary vectors ci , whose elements cij show whether nodes j are members of metadata groups i. We use a uniform sampling process U to withhold a small set of evaluation nodes evali ∼ U (ci , 1%) and edges (evali × evali ) M that by merit of their small number do not signiﬁcantly aﬀect the ranking algorithm outcomes. We also procure varied length query vectors qi ∼ U (ci − evali , f ) that serve as inputs to the ranking algorithms, where their relative size compared to the group is selected amongst f ∈ {0.1%, 1%, 10%}. Depending on whether query nodes are adequately many or too few, we expect algorithms to encounter high and low diﬃculty respectively. 3.1
Networks
Experiments are conducted on three networks; a synthetic one constructed through a stochastic block model [33] and two realworld ones often used to evaluate metada group detection; the Amazon copurchasing [34] and the DBLP
8
E. Krasanakis et al.
author coauthorship networks. These networks were selected on merit of being fully labeled, hence enabling supervised evaluation to serve as ground truth. They also comprise multiple metadata groups and unweighted edges needed for LinkAUC. The stochastic block model is a popular method to construct networks of known communities [35,36], where the probability of two nodes being linked is determined by which communities they belong to. Our synthetic network uses the randomly generated 5 × 5 block probability matrix of Fig. 2 with blocks of 2K5K nodes. The Amazon network comprises links between frequently copurchased products3 that form communities based on their type (e.g. Book, CD, DVD, Video). We use the 2011 version of the DBLP dataset4 , which comprises 1.6M papers from the DBLP database, from which we extracted an author network based on coauthorship relations. In this network, authors form overlapping metadata groups based on academic venues (journals, conferences) they have published in. To experiment with smaller portions of query nodes and limit the running time of experiments, we select only the metadata groups with ≥5K nodes for the realworld networks. A summary of these is presented in Table 1.
Table 1. Networks and the number of metada groups used in experiments. Network
Nodes Edges
Synthetic
15K
Groups
0.4M
5 4
Amazon
[34] 0.5M
1.8M
DBLP
[37] 1.0M
11.3M 52
Fig. 2. Stochastic block model used to create the synthetic network.
3.2
Ranking Algorithms
We use both heuristic and established algorithms to rank the relation of network nodes to metadata groups. Our goal is not to select the best algorithm but to obtain ranks with many diﬀerent methods and then use these ranks to compute the evaluation measures to be compared. The considered algorithms are: PPR [12,38]. Personalized PageRank with symmetric matrix Laplacian normalization arising from a random walk with restart strategy. It iterates ri ← aD−1/2 M D−1/2 ri + (1 − a)qi , where D is the diagonal matrix of node degrees. Throughout our experiments, we select the wellperforming parameter a = 0.99. PPR+Inflation [13]. Adds all neighbors of the original query nodes to the query to further spread PPR. 3 4
https://snap.stanford.edu/data/amazonmeta.html. DBLPCitationnetwork V4 from https://aminer.org/citation.
LinkAUC: Unsupervised Evaluation of Node Ranks Using Link Prediction
9
PPR+Oversampling [39]. Adding nodes with high PPR ranks to the query vector before rerunning the algorithm. HK [40]. Heat Kernel ranks obtained through an exponential degradation ﬁlter N tk −1/2 ri = e−t M D−1/2 )k qi . This places higher weights on shorter k=0 k! (D paths instead of uniformly spreading them across longer random walks. Hence, it discovers denser local structures at the cost of not spreading ranks too much. We selected t = 5 and stopped iterations when (D−1/2 M D−1/2 )k qi converged. HPR. A heuristic adaptation of PPR that borrows assumptions of heat kernels to place emphasis on short random walks ri ← kt a(D−1/2 M D−1/2 − I)ri + (1 − a)qi , where k is the current iteration, t = 5 and a = 0.99. 3.3
Measures
The following measures are calculated for the node ranks of metadata groups produced in each experiment. We remind that, when network labels are sparse, supervised measures that serve as the ground truth of evaluation may be inapplicable. Unsupervised measures other than LinkAUC are computed on the training edges, as the sparsity of withheld group members evali does not allow meaningful structural scores. LinkAUC, on the other hand is applicable regardless of the evaluation edge set’s sparsity. To avoid data overlap between rank calculation and evaluation, which could overestimate the latter, supervised measures and LinkAUC use only the test group members and edges. Unsupervised Measures Conductance  Compares the probability of a random walk to move outside a community vs. to return to it [41]. Using the same probabilistic formulation as for rank density we deﬁne rank conductance: φM (M, r) =
rT (M M)(C − r) rT (M M)r
(6)
where C = 1 is a maxprobability parameter. (Comparisons are preserved for any value.) Lower conductance indicates better community separation. Gap Conductance  Conductance of binarily cutting the network on the maximal rij percentage gap between degree(j) for each community i [42,43]. We use this as an alternative to sweeping strategies [12,13], which took too long to run. Density  The rankbased extension of density in (4). LinkAUC  AUC of links ranks calculated through (1), where columns are divided with their maximal value and then each node’s row representation is L2normalized, making link ranks represent cosine similarity between edge nodes. This is our proposed unsupervised measure. Supervised Measures (Ground Truth) NodeAUC  AUC of node ranks, averaged across metadata groups i.
10
E. Krasanakis et al.
NDCG  Normalized discounted cumulative gain across all network nodes. For this nonparametric statistic, ranks derive ordinalities ord[j] for nodes j (i.e. the highest ranked node is assigned ord[j] = 1). For each metadata group i, assigning to nodes j relevance scores of 1 if they belongs to it and 0 otherwise: j:j∈evali 1/log2 (ord[j] + 1) (7) N DCGi = evali  1/log2 (c + 1) c=1 NDCG is usually used to evaluate whether a ﬁxed topk nodes are relevant to the metadata group. However, we are interested in evaluating the relevant nodes of the whole network and hence we make this measure span all nodes. This makes it similar to AUC in that values closer to 1 indicate that metadata group members are ranked as more relevant to the group compared to nongroup members. Its main diﬀerence is that more emphasis is placed on the top discoveries. 3.4
Results
In Fig. 3 we present the outcome of evaluating diﬀerent algorithms on the various experiment setups, i.e. tuples of networks, seed node fractions and ranking algorithms. Each point corresponds to a diﬀerent unsupervised (vertical axes)  supervised (horizontal axes) measure pair calculated for a diﬀerent experiment setup (i.e. combination of seed node sizes and ranking algorithms) and is obtained by averaging the measures across 5 repetitions of the setup. Unsupervised measures are considered to yield descriptive evaluations when they correlate to supervised ones for the same network (each network is involved in 15 experiment setups arising from the combination of f  = 3 diﬀerent seed node sizes with one of the 5 diﬀerent ranking algorithms). We can see that LinkAUC is the unsupervised measure whose behavior most closely resembles that of the supervised ones. In particular, Table 2 shows that LinkAUC has a strong positive correlation with NodeAUC and a positive correlation with NDCG for all three networks, outperforming the other metrics in all but one experiments. To make sure that these ﬁndings cannot be attributed to nonlinear relations with other measures, we conﬁrm them using both Pearson and Spearman correlation, where the latter is a nonparametric metric that compares the ordinality of measure outcomes. The slightly weaker correlation of LinkAUC with NDCG can be attributed to the latter’s tendency to place more emphasis on the top predictions, which makes it overstate the correctness of rank quality compared to AUC when the rest of ranks are inaccurate. Looking at the other unsupervised measures, fuzzy deﬁnitions of conductance and density sometimes degrade for higher NodeAUC values. This can be attributed to these metrics measuring localscale features, which are not always a good indication of the quality of larger metadata groups. It must be noted that gap conductance also exhibits strong correlation with the supervised measures on the realworld networks. However, especially for the synthetic network, it frequently assumes a value of 1 that reﬂects its inability to discover clearcut boundaries. This sheds doubt on the validity of using it for evaluating ranks in new networks, since similar structural deﬁciencies can render it uninformative.
LinkAUC: Unsupervised Evaluation of Node Ranks Using Link Prediction 1200
1200 Synth Amazon DBPL
800
1000
Conductance
Conductance
1000
600 400 200
800
Synth Amazon DBPL
600 400 200
0 0.4
0.6
0.8
0 0.4
1
0.6
NodeAUC
0.8
1
0.8
1
0.8
1
0.8
1
NDCG 15
15 Synth Amazon DBPL
10
Gap Conductance
Gap Conductance
11
5
0
Synth Amazon DBPL
10 5 0 5 10
0.4
0.6
0.8
1
0.4
0.6
NodeAUC 20
103
0.02 Synth Amazon DBPL
0.01
10
Density
Density
15
NDCG
5
0
0.01
0 5 0.4
Synth Amazon DBPL
0.6
0.8
1
0.4
0.6
NodeAUC
LinkAUC
0.9
1
Synth Amazon DBPL
LinkAUC
1
NDCG
0.8 0.7 0.6
0.8
Synth Amazon DBPL
0.6 0.4
0.5 0.4
0.6
0.8
NodeAUC
1
0.4
0.6
NDCG
Fig. 3. Scatter plots and least square lines of unsupervised vs. supervised measures. Each point corresponds to a diﬀerent experiment setup.
12
E. Krasanakis et al.
Table 2. Correlations between unsupervised and supervised measures. The strongest correlations for each dataset are bolded. Pearson Correlation Spearman Correlation Synth Amazon DBLP Synth Amazon DBLP With NodeAUC Conductance
−27%
14%
25%
1%
Gap Cond/nce −28%
−67% −70% −23%
Density
58%
−22% −59%
LinkAUC
84%
92%
55%
84%
85%
40%
1%
24%
5%
−72% −88% 6% −45% 95%
90%
−3%
−9%
With NDCG Conductance
4
−23%
−26%
Gap Cond/nce −19%
−69% −71% −19%
−68% −85%
Density
38%
−63% −74%
−21% −75%
LinkAUC
56%
65%
84%
45% 71%
85%
88%
Conclusions and Future Work
In this work we proposed a new unsupervised procedure that evaluates node ranks of multiple metadata groups based on how well they predict network edges. We explained the intuitive motivation behind this approach and experimentally showed that it closely follows supervised rank evaluation across a number of different experiments, many of which are inadequately evaluated by other unsupervised community quality measures. Based on our ﬁndings, our approach can be a better alternative to existing rank evaluation strategies in unlabeled networks whose metadata propagation mechanisms are unknown. This indicates that network structure and awareness of multiple metadata groups are two promising types of ground truth that can help evaluate metadata group ranks. In the future, we are interested in performing experiments across more networks and compare our approach with additional unsupervised measures. Acknowledgements. This work was partially funded by the European Commission under contract numbers H2020761634 FuturePulse and H2020825585 HELIOS.
References 1. Fortunato, S., Hric, D.: Community detection in networks: a user guide. Phys. Rep. 659, 1–44 (2016) 2. Leskovec, J., Lang, K.J., Mahoney, M.: Empirical comparison of algorithms for network community detection. In: Proceedings of the 19th International Conference on World Wide Web, pp. 631–640. ACM (2010) 3. Xie, J., Kelley, S., Szymanski, B.K.: Overlapping community detection in networks: the stateoftheart and comparative study. ACM Comput. Surv. (CSUR) 45(4), 43 (2013)
LinkAUC: Unsupervised Evaluation of Node Ranks Using Link Prediction
13
4. Papadopoulos, S., Kompatsiaris, Y., Vakali, A., Spyridonos, P.: Community detection in social media. Data Min. Knowl. Discov. 24(3), 515–554 (2012) 5. Hric, D., Darst, R.K., Fortunato, S.: Community detection in networks: structural communities versus ground truth. Phys. Rev. E 90(6), 062805 (2014) 6. Hric, D., Peixoto, T.P., Fortunato, S.: Network structure, metadata, and the prediction of missing nodes and annotations. Phys. Rev. X 6(3), 031038 (2016) 7. Peel, L., Larremore, D.B., Clauset, A.: The ground truth about metadata and community detection in networks. Sci. Adv. 3(5), e1602548 (2017) 8. Perer, A., Shneiderman, B.: Balancing systematic and ﬂexible exploration of social networks. IEEE Trans. Visual Comput. Graphics 12(5), 693–700 (2006) 9. De Domenico, M., Sol´eRibalta, A., Omodei, E., G´ omez, S., Arenas, A.: Ranking in interconnected multilayer networks reveals versatile nodes. Nat. Commun. 6, 6868 (2015) 10. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Community structure in large networks: natural cluster sizes and the absence of large welldeﬁned clusters. Internet Math. 6(1), 29–123 (2009) 11. Lancichinetti, A., Fortunato, S., Kert´esz, J.: Detecting the overlapping and hierarchical community structure in complex networks. New J. Phys. 11(3), 033015 (2009) 12. Andersen, R., Chung, F., Lang, K.: Local graph partitioning using pagerank vectors. In: 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), pp. 475–486. IEEE (2006) 13. Whang, J.J., Gleich, D.F., Dhillon, I.S.: Overlapping community detection using neighborhoodinﬂated seed expansion. IEEE Trans. Knowl. Data Eng. 28(5), 1272– 1284 (2016) 14. Hsu, C.C., Lai, Y.A., Chen, W.H., Feng, M.H., Lin, S.D.: Unsupervised ranking using graph structures and node attributes. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 771–779. ACM (2017) 15. Shani, G., Gunawardana, A.: Evaluating recommendation systems. In: Recommender Systems Handbook, pp. 257–297. Springer (2011) 16. Wang, Y., Wang, L., Li, Y., He, D., Chen, W., Liu, T.Y.: A theoretical analysis of NDCG ranking measures. In: Proceedings of the 26th Annual Conference on Learning Theory (COLT 2013), vol. 8, p. 6 (2013) 17. Isinkaye, F., Folajimi, Y., Ojokoh, B.: Recommendation systems: principles, methods and evaluation. Egypt. Inform. J. 16(3), 261–273 (2015) 18. Kowalik, L .: Approximation scheme for lowest outdegree orientation and graph density measures. In: International Symposium on Algorithms and Computation, pp. 557–566. Springer (2006) 19. Newman, M.E.: Modularity and community structure in networks. Proc. Natl. Acad. Sci. 103(23), 8577–8582 (2006) 20. Chalupa, D.: A memetic algorithm for the minimum conductance graph partitioning problem, arXiv preprint arXiv:1704.02854 (2017) 21. Jeub, L.G., Balachandran, P., Porter, M.A., Mucha, P.J., Mahoney, M.W.: Think locally, act locally: detection of small, mediumsized, and large communities in large networks. Phys. Rev. E 91(1), 012821 (2015) 22. McPherson, M., SmithLovin, L., Cook, J.M.: Birds of a feather: homophily in social networks. Annu. Rev. Sociol. 27(1), 415–444 (2001) 23. Duan, L., Ma, S., Aggarwal, C., Ma, T., Huai, J.: An ensemble approach to link prediction. IEEE Trans. Knowl. Data Eng. 29(11), 2402–2416 (2017)
14
E. Krasanakis et al.
24. Koren, Y., Bell, R.: Advances in collaborative ﬁltering. In: Recommender Systems Handbook, pp. 77–118. Springer (2015) 25. LibenNowell, D., Kleinberg, J.: The linkprediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58(7), 1019–1031 (2007) 26. L¨ u, L., Zhou, T.: Link prediction in complex networks: a survey. Phys. A 390(6), 1150–1170 (2011) 27. Ortega, A., Frossard, P., Kovaˇcevi´c, J., Moura, J.M., Vandergheynst, P.: Graph signal processing: overview, challenges, and applications. Proc. IEEE 106(5), 808– 828 (2018) 28. Mart´ınez, V., Berzal, F., Cubero, J.C.: A survey of link prediction in complex networks. ACM Comput. Surv. (CSUR) 49(4), 69 (2017) 29. Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982) 30. Mason, S.J., Graham, N.E.: Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: statistical signiﬁcance and interpretation. Q. J. R. Meteorol. Soc. 128(584), 2145–2166 (2002) 31. Schaeﬀer, S.E.: Graph clustering. Comput. Sci. Rev. 1(1), 27–64 (2007) 32. G¨ orke, R., Kappes, A., Wagner, D.: Experiments on densityconstrained graph clustering. J. Exp. Algorithmics (JEA) 19, 3–3 (2015) 33. Holland, P.W., Laskey, K.B., Leinhardt, S.: Stochastic blockmodels: ﬁrst steps. Soc. Netw. 5(2), 109–137 (1983) 34. Leskovec, J., Adamic, L.A., Huberman, B.A.: The dynamics of viral marketing. ACM Trans. Web (TWEB) 1(1), 5 (2007) 35. Rohe, K., Chatterjee, S., Yu, B., et al.: Spectral clustering and the highdimensional stochastic blockmodel. Ann. Stat. 39(4), 1878–1915 (2011) 36. Abbe, E., Bandeira, A.S., Hall, G.: Exact recovery in the stochastic block model. IEEE Trans. Inf. Theory 62(1), 471–487 (2016) 37. Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 990–998. ACM (2008) 38. Lofgren, P., Banerjee, S., Goel, A.: Personalized pagerank estimation and search: a bidirectional approach. In: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pp. 163–172. ACM (2016) 39. Krasanakis, E., Schinas, E., Papadopoulos, S., Kompatsiaris, Y., Symeonidis, A.: Boosted seed oversampling for local community ranking. Inf. Process. Manag. 102053 (2019, in press). https://service.elsevier.com/app/answers/detail/a id/ 11241/supporthub/scopus/ 40. Kloster, K., Gleich, D.F.: Heat kernel based community detection. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1386–1395. ACM (2014) 41. Andersen, R., Chung, F., Lang, K.: Local partitioning for directed graphs using pagerank. Internet Math. 5(1–2), 3–22 (2008) 42. Borgs, C., Chayes, J., Mahdian, M., Saberi, A.: Exploring the community structure of newsgroups. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 783–787. ACM (2004) 43. Gleich, D., Kloster, K.: Seeded pagerank solution paths. Eur. J. Appl. Math. 27(6), 812–845 (2016)
A Gradient Estimate for PageRank Paul Horn1 and Lauren M. Nelsen2(B) 1
2
University of Denver, Denver, CO 80208, USA [email protected] University of Indianapolis, Indianapolis, IN 46227, USA [email protected] https://cs.du.edu/~paulhorn, https://sites.google.com/view/laurennelsen
Abstract. Personalized PageRank has found many uses in not only the ranking of webpages, but also algorithmic design, due to its ability to capture certain geometric properties of networks. In this paper, we study the diﬀusion of PageRank: how varying the jumping (or teleportation) constant aﬀects PageRank values. To this end, we prove a gradient estimate for PageRank, akin to the LiYau inequality for positive solutions to the heat equation (for manifolds, with later versions adapted to graphs).
Keywords: PageRank Gradient estimate
1
· Discrete curvature · Random walks ·
Introduction/Background
Personalized PageRank, developed by Brin and Page [3] ranks the importance of webpages ‘near’ a seed. PageRank can be thought of in a variety of ways, but one of the most important points of view of PageRank is that it is the distribution of a random walk allowed to diﬀuse for a geometrically distributed number of steps. A key parameter in PageRank, then, is the ‘jumping’ or ‘teleportation’ constant which controls the expected length of the involved random walks. As the jumping constant controls the length, it controls locality – that is, how far from the seed the random walk is (likely) willing to stray. When the jumping constant is small, the involved walks are (on average) short, and the mass of the distribution will remain concentrated near the seed. As the jumping constant increases, then the involved walk will (likely) be much longer. This allows the random walk to mix, and the involved distribution tends towards the stationary distribution of the random walk. As the PageRank of individual vertices (for a ﬁxed jumping constant) can be thought of as a measure of importance to the seed, then as the jumping constant increases this importance diﬀuses. In this paper, we are interested in how this importance diﬀuses as the jumping constant increases. This diﬀusion is related to the network’s geometry; in particular, the importance can get ‘caught’ by small cuts. This partially accounts for c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 15–26, 2020. https://doi.org/10.1007/9783030366872_2
16
P. Horn and L. M. Nelsen
PageRank’s importance in web search but has other uses as well – for instance Andersen, Chung and Lang use PageRank to implement local graph partitioning algorithms in [1]. This paper seeks to understand the diﬀusion of inﬂuence (as the jumping constant changes) in analogy to the diﬀusion of heat. The study of solutions to ∂ u on both graphs and manifolds has a long history, the heat equation Δu = ∂t motivated by its close ties to geometric properties of graphs. On graphs, the relationship between heat ﬂow and PageRank has been exploited several times. For instance, Chung [4] introduced the notion of heat kernel PageRank and used it to improve the algorithm of Anderson, Chung, Lang for graph partitioning. A particularly useful way of understanding positive solutions to the heat equation is through curvature lower bounds, which can be used to prove ‘gradient estimates’, which bound how heat diﬀuses locally in space and time and which can be integrated to obtain Harnack inequalities. Most classical of these is the LiYau inequality [13], which (in it’s simplest form) states that if u is a positive solution on a nonnegatively curved ndimensional compact manifold, then u satisﬁes n ut ∇u2 ≤ . (1) − u2 u 2t In the graph setting, Bauer, et al. proved a gradient estimate for the heat kernel on graphs in [2]. In this paper we aim to prove a similar inequality for PageRank. Our gradient estimate, which is formally stated as Theorem 1 below, is proved using the exponential curvature dimension inequality CDE, introduced by Bauer et. al. We mention that, in some ways, our inequality is more closely related to 2 another inequality of Hamilton [9] which bounds merely ∇u u2 , and was established for graphs by Horn in [10]. Other related works establish gradient estimates for eigenfunctions for the Laplace matrix; these include [6]. This paper is organized as follows: In the next section we introduce definitions for both PageRank and the graph curvature notions used. We further establish a useful ‘time parameterization’ for PageRank, which allows us to think of increasing the jumping constant as increasing a time parameter, and makes our statements and proofs cleaner. In Sect. 3 we prove a gradient estimate for PageRank. In Sect. 4 we use this gradient estimate to prove a Harnacktype inequality that allows us to compare PageRank at two vertices in a graph.
2 2.1
Preliminaries Spectral Graph Theory and Graph Laplacians
Spectral graph theory involves associating a matrix (or operator) with a graph and investigating how eigenvalues of the associated matrix reﬂect graph properties. The most familiar such matrix is the adjacency matrix A, whose rows and columns are indexed with vertices and 1 if vi ∼ vj aij = 0 else.
A Gradient Estimate for PageRank
17
In this work, the principal matrix that we will consider is the normalized Laplace operator Δ = I − D−1 A, where D is the diagonal matrix of vertex degrees and D−1 A is the transition probability matrix for a simple random walk. As a quick observation, note that Δ is nonpositive semideﬁnite. This is contrary to usual sign conventions in graph theory, but is the proper sign convention for the LaplaceBeltrami operator in Riemannian manifolds, the analogy to which we emphasize in this paper. Also note that this matrix is (up to sign) the unsymmetrized version of the normalized Laplacian popularized by Chung (see [7]), L = (D−1/2 AD−1/2 ) − I. 2.2
PageRank
(Personalized) PageRank was introduced as a ranking mechanism [12], to rank the importance of webpages with respect to a seed. To deﬁne personalized PageRank, we introduce the following operator which we call the PageRank operator. This operator, P (α), is deﬁned as follows: P (α) = (1 − α)
∞
αk W k ,
k=1 −1
where W = D A is the transition probability matrix for a simple random walk. Here the parameter α is known as the jumping or teleportation constant. For a ﬁnite nvertex graph, P (α) is a square matrix; the personalized PageRank vector of a vector u : V → R is uT P (α) = (1 − α)
∞
α k uT W k
k=1
It has been noticed ([4,5]) that PageRank has many similarities to the heat kernel etΔ . Chung deﬁned the notion of ‘Heat Kernel PageRank’ to exploit these similarities. In this work, we take inspiration in the opposite direction: we are interested in understanding the action of the PageRank operator in analogy to solutions of the heat equation. In order to emphasize our point of view, we note that graph theorists view the heat kernel operator in two diﬀerent ways: For a vector u : V → R studying the evolution of uT etΔ as t → ∞ is really studying the evolution of the continuous time random walk, while studying the evolution of etΔ u as t → ∞ is studying the solutions to the heat equation Δu = ut . The diﬀering behavior of these two evolutions comes from the fact that (for irregular graphs) the left and right eigenvectors of Δ = I − W are diﬀerent: the left PerronFrobenius eigenvector of Δ is proportional to the degrees of a graph (as it captures the stationary distribution of the random walk) while the right PerronFrobenius eigenvector is the constant vector. In particular, as t → ∞ the vector etΔ u tends to a constant. Physically, this represents the ‘heat’ on a graph
18
P. Horn and L. M. Nelsen
evening out, and this regularization (and the rate of regularization) is related to a number of geometric features of a graph. A similar feature holds for PageRank. As α → 1, uT P (α) tends to a vector proportional to degrees, but P (α)u regularizes. In this paper we study this regularization. Although we do not study the PageRank vector explicitly, we note that the left and right action of the PageRank operator are closely related. For an undirected graph uT P (α) = (P (α)T u)T = (DP (α)D−1 u)T , so that the regularization of D−1 u can be translated into information on the ‘mixing’ of the personalized PageRank vector seeded at u. To complete the analogy between P (α)u and etΔ u, it is helpful to come up with a time parameterization t = t(α) so we can view the regularization as a function of ‘time’, in analogy to the heat equation. To do this in the best way, ∂ Pα . it is useful to think of α = α(t) and compute ∂t Proposition 1
∂ α Pα = ΔPα2 , ∂t (1 − α)2
where ΔPα2 = ΔPα (Pα ). Proof. Notice that, the chain rule and algebra reveals, ∂ ∂ Pα = (1 − α)(I − αW )−1 ∂t ∂t = α ((αW − I)(I − αW )−2 + (1 − α)W (I − αW )−2 =
α ΔPα2 . (1 − α)2
This is remarkably close to the heat equation if α (t) = (1 − α)2 ; solving this 1 . Since we desire separable diﬀerential equation yields that α = α(t) = 1 − t+C a parameterization so that α(0) = 0 and α → 1 as t → ∞, this gives us that C = 1 from whence we obtain:
α(t) = 1 − t=
1 t+1
α 1−α
(2) (3)
Given the time parameterization in Eq. 2, we get the following Corollary to Proposition 1. Corollary 2 ∂ Pα = ΔPα2 , ∂t where ΔPα2 = ΔPα (Pα ).
A Gradient Estimate for PageRank
19
Proof. From Proposition 1 and our choice of parameterization, we see that 1
∂ α (t+1)2 2 2 Pα = ΔPα2 = 2 ΔPα = ΔPα . ∂t (1 − α)2 1 t+1
Fix a vector u : V → R. From now on, we let
f = Pα u. Lemma 1. For f = Pα u and t =
(4) f −u α , we have that Δf = . 1−α t
Proof. We know that W = D−1 A and Δ = W − I, so ΔPα = (W − I)(1 − α)(I − αW )−1 1 1−α · (1 − α)(I − αW )−1 = − (I − αW )(1 − α)(I − αW )−1 + α α 1−α (Pα − I). = α Hence Δf = ΔPα u =
(1 − α) f −u (Pα − I)u = . α t
2.3
Graph Curvature
In this paper we study the regularization of P (α)u for an initial seed u as α → 1. On one hand, the information about this regularization is contained in the spectral decomposition of the random walk matrix W . The eigenvalues of P (α) are determined by the eigenvalues of W : indeed, if λ is an eigenvalue of W , then 1−α 1−αλ is an eigenvalue of Pα . One may observe that, then, as α → 1 all eigenvalues of Pα tend to zero except for the eigenvalue, 1, of W , and this is what causes the regularization. Thus the diﬀerence between Pα u and the constant vector can be bounded in terms of (say) the inﬁnity norms of eigenvectors of Pα and α itself. On the other hand, curvature lower bounds (in graphs and manifolds) have proven to be important ways to understand the local evolution of solutions to the heat equation. As we have already noted important similarities between heat solutions and PageRank, we seek similar understanding in the present case. Curvature, for graphs and manifolds, gives a way of understanding the local geometry of the object. A manifold (or graph) satisfying a curvature lower bound at every point has a locally constrained geometry which allows a local understanding of heat ﬂow through where a ‘gradient estimate’ can be proved. These gradient estimates can then be ‘integrated’ over spacetime to yield Harnack inequalities which compare the ‘heat’ of diﬀerent points at diﬀerent times.
20
P. Horn and L. M. Nelsen
While a direct analogue of the Ricci curvature is not deﬁned in a graph setting, a number of graph theoretical analogues have been developed recently in an attempt to apply geometrical ideas in the graph setting. In the context of proving gradient estimates of heat solutions, a new notion of curvature known as the exponential curvature dimension inequality was introduced in [2]. In order to discuss the exponential curvature dimension inequality, we ﬁrst need to introduce some notation. The Laplace operator, Δ, on a graph G is deﬁned at a vertex x by (f (y) − f (x)). Δf (x) = y∼x
Definition 3. The gradient form Γ is defined by 2Γ (f, g)(x) = (Δ(f · g) − f · Δ(g) − Δ(f ) · g)(x) = (f (y) − f (x))(g(y) − g(x)), y∼x
and we write Γ (f ) = Γ (f, f ). In general, there is no “chain rule” that holds for the Laplacian on graphs. However, the following formula does hold for the Laplacian and will be useful to us: Δf = 2 f Δ f + 2Γ ( f ). (5) We deﬁne an iterated gradient form, Γ2 , that will be of use to us for the notion of graph curvature that we are using. Definition 4. The gradient form Γ2 is defined by 2Γ2 (f, g) = ΔΓ (f, g) − Γ (f, Δg) − Γ (Δf, g), and we write Γ2 (f ) = Γ2 (f, f ). At the heart of the exponential curvature dimension inequality is an idea that had been used previously based on the Bochner formula. The Bochner formula reveals a connection between solutions to the heat equation and the curvature of a manifold. Bochner’s formula tells us that if M is a Riemannian manifold and f is in C ∞ (M ), then 1 Δ∇f 2 = ∇f, ∇Δf + Hessf 22 + Ric(∇f, ∇f ). 2 The Bochner formula implies that for an ndimensional manifold with Ricci curvature at least K, we have 1 1 Δ∇f 2 ≥ ∇f, ∇Δf + (Δf )2 + K∇f 2 . 2 n
(6)
An important insight of Bakry and Emery was that an object satisfying an inequality like (6) could be used as a definition of a curvature lower bound even
A Gradient Estimate for PageRank
21
when curvature could not be directly deﬁned. Such an inequality became known as a curvature dimension inequality, or the CD inequality. Bauer, et al. introduced a modiﬁcation of the CD inequality that deﬁnes a new notion of curvature on graphs that we will use here [2], the exponential curvature inequality. Definition 5. A graph is said to satisfy the exponential curvature dimension inequality CDE(n, K) if, for all positive f : V → R and at all vertices x ∈ V (G) satisfying (Δf )(x) < 0 ΔΓ (f ) − 2Γ (f,
2 Δf 2 ) ≥ (Δf )2 + 2KΓ (f ), 2f n
(7)
where the inequality in (7) is taken pointwise. While the inequality (7) may seem somewhat unwieldy it, as shown in [2], arises from ‘baking in’ the chain rule and is actually equivalent to the standard curvature dimension inequality (6) in the setting of diﬀusive semigroups (where the Laplace operator satisﬁes the chain rule.) Additionally, in [2], it is shown that some graphs including the Ricci ﬂat graphs of Chung and Yau satisfy CDE(n, 0) (and hence are nonnegatively curved for this curvature notion) and some general curvature lower bounds for graphs are given. An important observation is that this notion of curvature only requires looking at the second neighborhood of a graph, and hence this kind of curvature is truly a local property (and hence a curvature lower bound can be certiﬁed by only inspecting second neighborhoods of vertices.)
3
Gradient Estimate for PageRank
Our main result will make use of the following lemma, adapted from a lemma in [2]. Lemma 2. ([2]). Let G(V, E) be a (finite or infinite) graph, and let f, H : V × {t } → R be functions. If f ≥ 0 and H has a local maximum at (x , t ) ∈ V × {t }, then Δ(f H)(x , t ) ≤ (Δf )H(x , t ). Our goal is to show that C(t) t
√ Γ ( f) √ f ·M
≤
C(t) t
for some function C(t). However,
is badly behaved as t → 0. The way that we handle this is by showing √ f) that H := t · Γ√(f ·M ≤ C(t). If H is a function from V × [0, ∞) → R, then instead consider H as a function from V × [0, T ] → R for some T > 0. Then, by compactness, there is a point (x , t ) in V ×[0, T ] at which H(x, t) is maximized. ∂ ∂ H ≥ 0. Since L = Δ − ∂t , this At this maximum, we know that ΔH ≤ 0 and ∂t implies that at the maximum point, LH ≤ 0. Using the CDE inequality, along with some other lemmas and an identity, we are able to relate H 2 with itself. √ Γ ( f) This allows us to ﬁnd an upper bound for H, and thus for √f ·M . Our situation is a little easier, because we consider a ﬁxed t. A simple computation shows the following:
22
P. Horn and L. M. Nelsen
Lemma 3. Let G be a graph, and suppose 0 ≤ f (x) ≤ M for all x ∈ V (G) and √ tΓ ( f ) t ∈ [0, ∞), and let H = √f ·M . Then f −u Δ f= √ − 2t f
√ MH . t
2 Δu − ∇u that is 2 u u √ √ ∇ u2 Δ√ u Δu = u − u , u
This identity plays a similar role to the identity Δ log u =
key in the LiYau inequality on manifolds, and the identity which is behind the LiYau inequality for graphs. Lemma 3 is similar to these other identities and the CDE inequality allows us to exploit this relationship.
Theorem 1. Let G be a graph satisfying CDE(n, 0). Suppose 0 ≤ f ≤ M for all x ∈ V (G) and t ∈ (0, ∞). Then √ 1 n Γ ( f) n+4 1 √ · +2 ·√ . ≤ n+2 t n+2 f ·M t Note that this theorem actually is more akin to a ‘Hamiltontype’ gradient estimate, as it is an estimate in space only (and not time). Due to space constraints, the full proof of Theorem 1 is deferred to the full version of the paper. It proceeds by the maximum principle, similarly to the proof of the LiYau inequality in [2] but requires additional care in handling some terms since the heat equation is not speciﬁed; these give rise to its form. For convenience of the reader, we sketch the main ideas in the proof in an appendix.
4
HarnackType Inequality
We can use Theorem 1 to prove a result comparing PageRank at two vertices in a graph depending on the distance between them. This result is similar to a Harnack inequality. The classical form of a Harnack inequality is the following. Proposition 6 ([2]). Suppose G is a graph satisfying CDE(n, 0). Let T1 < T2 be real numbers, let d(x, y) denote the distance between x, y ∈ V (G), and let D = max deg(v). If u is a positive solution to the heat equation on G, then v∈V (G)
u(x, T1 ) ≤ u(y, T2 )
T2 T1
n exp
4Dd(x, y)2 T2 − T1
.
This result allows one to compare heat at diﬀerent points and diﬀerent times. This can make it possible to deduce geometric information about the graph, such as bottlenecking. Delmotte [8] showed that Harnack inequalities do not only allow us to compare heat at diﬀerent points in space and time – they also have geometric consequences, such as volume doubling and satisfying the Poincar´e inequality. Horn, Lin, Liu, and Yau [11] completed the work of Delmotte by proving that even more geometric information can be obtained from Harnack inequalities.
A Gradient Estimate for PageRank
23
Using Theorem 1, we are able to relate PageRank at diﬀerent vertices, but our result is not quite of the right form to be a Harnack √ inequality. In Theorem 1, an ideal conclusion would be to have an f instead of f · M in the denominator. Since we do not, this makes proving a “Harnacktype” inequality, directly comparing the two values in terms of themselves and their distance, more diﬃcult. (A somewhat similar technique is used by Horn in [10] on the heat equation, but in the case of [10] the gradient estimate is scaled better, yielding stronger results.) To prove our Harnacktype inequality, we will use a lemma comparing PageRank at adjacent vertices. From now on, we will consider t ﬁxed and write f (x) instead of f (x, t). If a vertex, w, is adjacent to a vertex, z, then we want to lower f (z) by a function only involving f (w). The trick to this is to rewrite bound f (w) f (z) so that we can use Theorem 1 in order to get rid of the ‘ f (z)’ in the denominator. Lemma 4. Let D = maxv∈V (G) deg(v). If w ∼ z, then
√ f (w) 2CD M 1 √ ≤ + 2. · f (z) t f (w) √ 2CD √ M · √1 Proof. If f (z) ≥ 12 f (w), then ff(w) ≤ 2 ≤ + 2. (z) t f (w) If f (z) < 12 f (w), then
f (w) − f (z) + f (z) f (w) = f (z) f (z) f (w) − f (z) = +1 f (z) D( f (w) − f (z))2 = + 1. D f (z)( f (w) − f (z))
(8)
Now applying the gradient estimate (Theorem 1) yields, √ CD M 1 √ (8) ≤ +1 · t f (w) − f (z) √ CD M 1 2 √ ≤ + 1 since f (z) < f (w) · 2 t f (w) √ 2CD M 1 √ ≤ + 2. · t f (w) We note that this can be carefully iterated to compare PageRank of vertices of a given distance. This proof, however (and even its rather complicated statement) are deferred to the full journal version of the paper due to space considerations.
24
5
P. Horn and L. M. Nelsen
Conclusions, Applications, and Future Work
In this paper we investigated PageRank as a diﬀusion, using recently developed notions of discrete curvature. These results, while theoretical (and in some cases not as strong as would be desired due to the dependence on the maximum value ‘M ’ in the gradient estimate), show that curvature aspects of graphs can be used to understand relative importance in networks – at least when ranking is based on random walk based diﬀusions. Regarding these points, we highlight the following: – Curvature is a local property – based only on second neighborhood conditions. An upshot of this is that it can be certiﬁed quickly. While the work here focuses on situations where the entire graph is nonnegatively curved for simplicity, work in [2,10] show that these methods can be used when only parts of the graph satisfy such a radius by using cutoﬀ functions. In principle these yield algorithms that are linear, either in the size of the graph – or even in a considered portion of the graph – verifying curvature conditions and elucidating PageRank’s diﬀusion in bounded degree graphs. – The inﬂuence of the jumping constant on PageRank has been important for certain algorithms (such as in [1]), but was originally picked rather arbitrarily (see, eg. [14]). A more rigorous study of this phenomenon seems important for the analysis of complex networks and this paper should be seen as part of this thrust. – There are several interesting areas for improvement here: The nonideal scaling in Theorem 1 leads to a weaker than ideal result in Lemma 4. While Lemma 4 seems a reasonable result, when iterated it quickly loses power (unlike the Harnack inequality from a ‘properly scaled’ gradient estimate like in Proposition 6). While a ‘properly scaled’ Theorem 1 may not even be true, we suspect the scaling can be improved. An interesting √ question is whether a true ‘Hamilton type’ gradient estimate is true: Is Γ ( f )/f ≤ C log(M/f )t−1 ? Note that the addition of the logarithmic term damages a Harnack inequality, but the results obtainable from this are far better than we obtain. Also, a version including the time derivative term is also desirable. Acknowledgments. Horn’s work was partially supported by Simons Collaboration grant #525309.
Appendix The proof of Theorem 1 includes some rather lengthy computations, and is deferred for the full paper. For the beneﬁt of readers, however, we have included a sketch here which highlights the initial part of the proof where one relates the quantity to be bounded with its own square using CDE. √ tΓ ( f ) Proof. (Proof sketch for Theorem 1). Let H = √ . Fix t > 0. Let (x , t) f ·M be a point in V ×{t} such that H(x, t) is maximized. We desire to bound H(x , t).
A Gradient Estimate for PageRank
25
√ Our goal, then is to apply √ to do this, √ the CDE inequality to Δ( f H). In order we must ensure that Δ f < 0, but a computation shows that if Δ f ≥ 0, then H ≤ 12 so this is allowable. √ Following this, one computes by bounding the arising ΔΓ ( f ) by CDE. √ One √ √ u 2 2 2 bounds the ensuing terms; clearly t Γ ( f ) − t Γ f , √f ≥ − t Γ f , √uf √ √ f −u √ − M H . Then one bounds: and by Lemma 3, Δ f = 2t t f √ (f − u) M H (f − u)2 2 √ √ − + MH 2 f f √
2 2Γ ( f ) u −√ Γ + √ f, √ f M M √ √ 2 2 MH − f MH √ ≥ M nt
1 f (y) f (x) −√ + u(y) 1 − u(x) 1 − f (x) f (y) M y∼x √ √ 2 M H2 − f M H √ √ ≥ − M, M nt
2 Δ f H≥√ M nt
Now one proceeds carefully, noting that we have related H and its square and thus, in principle at least, have recorded an upper bound for H. Now we continue to compute to recover the result. Remark: In a typical application of the maximum principle, one maximizes over [0, T ] and then uses information from the time derivative. Here, we don’t do this. This is important because one obtains an inequality of the form H 2 ≤ C1 · H + C2 · t Because of the dependence of this inequality on the time where the maximum occurs, if the t maximizing the function over all [0, √T ] is considered, then the 2C2 t . However, since we result will depend on t , giving a bound like H ≤ t are able to do the computation at t, this problem does not arise.
References 1. Andersen, R., Chung, F., Lang, K.: Local partitioning for directed graphs using PageRank. Internet Math. 5(1–2), 3–22 (2008) 2. Bauer, F., Horn, P., Lin, Y., Lippner, G., Mangoubi, D., Yau, S.T.: LiYau inequality on graphs. J. Diﬀer. Geom. 99(3), 359–405 (2015) 3. Brin, S., Page, L.: The anatomy of a largescale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1), 107–117 (1998). Proceedings of the Seventh International World Wide Web Conference
26
P. Horn and L. M. Nelsen
4. Chung, F.: The heat kernel as the pagerank of a graph. Proc. Natl. Acad. Sci. 104(50), 19735–19740 (2007) 5. Chung, F.: PageRank as a discrete Green’s function. In: Geometry and Analysis, no. 1, volume 17 of Advanced Lectures in Mathematics (ALM), pp. 285–302. International Press, Somerville (2011) 6. Chung, F., Lin, Y., Yau, S.T.: Harnack inequalities for graphs with nonnegative Ricci curvature. J. Math. Anal. Appl. 415(1), 25–32 (2014) 7. Chung, F.R.K.: Spectral Graph Theory, volume 92 of CBMS Regional Conference Series in Mathematics. Published for the Conference Board of the Mathematical Sciences, Washington, DC; by the American Mathematical Society, Providence (1997) 8. Delmotte, T.: Parabolic Harnack inequality and estimates of Markov chains on graphs. Rev. Mat. Iberoamericana 15(1), 181–232 (1999) 9. Hamilton, R.S.: A matrix Harnack estimate for the heat equation. Comm. Anal. Geom. 1(1), 113–126 (1993) 10. Horn, P.: A spacial gradient estimate for solutions to the heat equation on graphs. SIAM J. Discrete Math. 33(2), 958–975 (2019) 11. Horn, P., Lin, Y., Liu, S., Yau, S.T.: Volume doubling, poincar´e inequality and Gaussian heat kernel estimate for nonnegatively curved graphs. J. Reine Angew. Math. (to appear) 12. Jeh, G., Widom, J.: Scaling personalized web search. In: Proceedings of the 12th World Wide Web Conference (WWW), pp. 271–279 (2003) 13. Li, P., Yau, S.T.: On the parabolic kernel of the Schr¨ odinger operator. Acta Math. 156(3–4), 153–201 (1986) 14. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. In: Proceedings of the 7th International World Wide Web Conference, Brisbane, Australia, pp. 161–172 (1998)
A Persistent Homology Perspective to the Link Prediction Problem Sumit Bhatia1(B) , Bapi Chatterjee2 , Deepak Nathani3 , and Manohar Kaul3 1
2
3
IBM Research AI, New Delhi, India [email protected] Institute of Science and Technology, Klosterneuburg, Austria [email protected] Indian Institute of Technology, Hyderabad, Sangareddy, India {me15btech11009,mkaul}@iith.ac.in
Abstract. Persistent homology is a powerful tool in Topological Data Analysis (TDA) to capture topological properties of data succinctly at diﬀerent spatial resolutions. For graphical data, shape and structure of the neighborhood of individual data items (nodes) is an essential means of characterizing their properties. We propose the use of persistent homology methods to capture structural and topological properties of graphs and use it to address the problem of link prediction. We achieve encouraging results on nine diﬀerent realworld datasets that attest to the potential of persistent homology based methods for network analysis.
1
Introduction
A graph structure representing pairwise relations or interactions among individuals or entities recurs in diverse realworld applications such as social and professional networks, biological phenomena such as proteinprotein interactions [10], and citation and collaboration networks [4]. In all these applications, understanding how the network evolves and the ability to predict the formation of new, hitherto nonexistent links is extremely useful and has crucial applications such as predicting target genes for cancer research [31], social network analysis, and recommendation systems. The Link Prediction Problem: Let U denote the set of all possible edges in graph G = (V, E) with V as the vertex set, and E as the edge set. If G is undirected, U  = C(n, 2) = n(n − 1)/2, whereas, if G is directed, U  = 2 × C(n, 2) = n(n − 1). The set U − E is called the set of potential links. Often, in realworld settings, only a small subset of links u ∈ U will materialize in future with u U . For example, in a typical social network that has hundreds of millions of users (nodes), each user may only be friends (form an edge) with only B. Chatterjee—Supported by the European Union’s Horizon 2020 research and innovation programme under the Marie SkodowskaCurie grant agreement No. 754411 (ISTPlus). c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 27–39, 2020. https://doi.org/10.1007/9783030366872_3
28
S. Bhatia et al.
a few hundred users. Given G = (V ; E), the task of identifying the edges e ∈ u is challenging and requires understanding and modelling the diﬀerences between the sets u and U − u. Why Persistent Homology for Link Prediction?: Persistent homology (PH) [11,12] is an algebraic tool for describing the structural features of a topological space at diﬀerent spatial resolutions. By embedding a highdimensional dataset in a topological space, PH allows us to extract and study crucial information about the structure and shape of the dataset in a succinct manner. Since understanding the evolution and formation of edges in networks involves analyzing the structure and shape of the underlying networks, we posit that PH oﬀers a theoretically sound framework to study such topological properties of networks. As an emerging technique in data mining, PH has been successfully applied in various applications such as text analysis [42], image analysis [8], temporal network analysis [16,33], and network classiﬁcation [7]. Persistence Diagram: A popular tool from the realm of PH is persistence diagram (PD). Homology of a point set X roughly characterizes it in terms of shapefeatures like connected components, tunnels and voids. Given a graph G = (V, E), mapping the nodes vi ∈ V to the points {xi }1≤i≤n ∈ X, homology of X exhibits G in terms of the shapefeatures formed by its nodes and edges. However, these features depend a lot on the resolution or the scale at which they are studied, and it is crucial to study them across a spectrum of spatial resolutions. The features that persist across resolutions constitute its persistent homology (PH) represented by its PDs. PD is depicted as a set of points in a twodimensional space whose indices correspond to the resolutions at which the topological features are “born” and subsequently, “die”. Diﬀerences between PHs of two graphs (or subgraphs) can be captured by dissimilarity measure such as the Wasserstein or Bottleneck [12, Chapter VIII] distance between their corresponding PDs. Using such dissimilarity measures between PDs, we understand how an adaptivesized extended neighborhood of query nodes changes in terms with regards to their PH when an edge is added (removed) to (from) the graph. Our Contributions: We describe a novel approach for predicting links in networks by utilizing the Persistence Diagrams of diﬀerent neighborhood subgraphs around the query nodes. Speciﬁcally, we characterize the existence of a potential link between a pair of query nodes in terms of a dissimilarity measure between a number of specially constructed neighborhoods. We ﬁrst present the necessary mathematical notions to describe our method: the PD of a graph and the distance measures between PDs (Sect. 2). We then argue and explain that for a pair of nodes, the PDs of the subgraph induced by their extended neighborhood should not change much by addition or removal of a naturally existing edge. We also provide a theoretical insight into the working of our approach (Sect. 3). We describe and discuss the experiments conducted using nine diﬀerent realworld network datasets that provide strong empirical evidence for the potential of application of PH for link prediction, and network analysis in general. Our proposed approach achieves robust performance across all the datasets
Persistent Homology and Link Prediction
29
when compared with six commonly used baseline methods for link prediction (Sect. 4). Overview of and Comparison with Related Work: Most methods for link prediction utilize the structural properties of the underlying network to predict formation of new edges. Some of the most frequently used methods [1,29] utilize the intuition that the likelihood of a link between two nodes is high if they share many common neighbors. Despite being widely adopted due to their intuitive nature and ease of computation, such methods are limited to the second order neighborhood of the source node and ignore the global structural information about the underlying network. On the other hand, studying the shape features of the graph at varying resolutions enables us to capture the global structure information. Diﬀerent other approaches that consider global information for link prediction include measures based on an ensemble of all paths (such as the Katz score [21]), measures derived from conducting random walks over the graph [2,18], and learning continuous vector representations of nodes in the graph such that the nodes sharing similar structural properties are mapped close to each other in the latent space (e.g., DeepWalk [34], LINE [38], node2vec [15], struc2vec [35]). Ensemble methods that complement the network information with external information such as text documents have also been proposed [6]. In contrast to these methods that need to explore the entire graph for capturing global information, our approach is adaptive: we only study the combined neighborhood whose size varies depending on the sparsity of the graph. Thus, we can also avoid the large cost of exploring the entire graph.
2
Persistent Homology of a Graph
For a selfcontained exposition, we brieﬂy present the deﬁnitions of main concepts used in this work. For a detailed description of the concepts used, an interested reader may refer to advanced textbooks on computational topology, such as the one by Edelsbrunner and Harer [12]. A quick yet suﬃcient introduction to some more basic concepts can also be found in the extended preprint of this paper [5]. Persistence Diagram: Let Δ be a ﬁnite abstract simplicial complex and {Γi }i∈I s.t. ∅ = Γ0 Γ1 Γ2 . . . Γp = Δ be a filtration of Δ. For a pair i, j s.t. 0 ≤ i ≤ j ≤ p, this inclusion relation among Γi s induces a homomorphism on the simplicial homology group of each dimension n ∈ Z given by fni,j : Hn (Γi ) → Hn (Γj ). The nth persistent homology (PH) group is the image of the homomorphism i,j fn given by Im(fni,j ). In turn, the nth persistent Betti number is deﬁnes as the rank of Im(fni,j ) given by βni,j = rank(Im(fni,j )). The nth persistent Betti number counts how many homology classes of dimension n survives a passage from Γi to Γj . We say that a homology class α ∈ Hn (Γi ) is born at resolution i if it did not come from a previous subcomplex :
30
S. Bhatia et al.
Death
α∈ / Im(fni−1,i ). Similarly, we say that a homology class dies at resolution j if it does not belong to the subcomplex Γj and belonged to previous subcomplexes. A persistence diagram (PD) is a plotting of the points (i, j) corresponding to the birth and death resolutions, respectively, for each of the homology classes. Because a homology class can not die before it is born, every point (i, j) lies above the diagonal x = y. If a homology class does not die after its birth, we draw a vertical line starting from the diagonal in Birth correspondence to its birth. For practical purposes, we take a persistence threshold τ , and assume that every homology Fig. 1. PD class dies at the resolution τ . A typical PD is shown in Fig. 1. Distance Between PDs: Let P1 and P2 be two PDs. Let η be a bijection between the points in the two diagrams. We deﬁne the following two distance measures: 1 p − η(p)q∞ ) q (1) (a) Wassersteinq distance : Wq (P1 , P2 ) = ( inf η:P1 →P2
(b) Bottleneck Distance : W∞ (P1 , P2 ) =
inf
p∈P1
sup p − η(p)∞
η:P1 →P2 p∈P1
(2)
The Wassersteinq distance is sensitive to small diﬀerences in the PDs, whereas, the Bottleneck distance captures relatively large diﬀerences. Rips Complex: A VietorisRips Complex, also called a Rips complex is an abstract simplicial complex deﬁned over a ﬁnite set of points X = {xi }ni=1 ⊆ X in a metric space (X , d). Given X and a real number r > 0, r ∈ R, a Rips complex R(X, r) is formed by connecting the points for which the balls of radius r 2 centered at them intersect. In the context of the same point set, we use Rr to denote R(X, r). A 1simplex is formed by connecting two such points and corresponds to an edge. A 2simplex is formed by 3 such points and corresponds to a triangular face. Rips Filtration: Given a set of points X = {xi }ni=1 ⊆ X , let 0 = r0 ≤ r1 ≤ r2 . . . ≤ rm denote a ﬁnite sequence of increasing real numbers, which we use to construct Rips complexes {Rri }m i=1 as deﬁned above. Clearly, by construction of Rips complexes the sequence {Rri }m i=1 is nested and thus provides a ﬁltration of R rm : ∅ = R r0 R r1 R r2 . . . R rm Deriving the PH groups via homomorphism over a Rips ﬁltration, we obtain a PD associated with the point set X. Please note that to compute the Rips ﬁltration associated with a point set X we need only the relative pairwise distances between the points xi ∈ X. Essentially, we need a symmetric distance matrix D = {d(xi , xj )}n,n i=1,j=1 to compute the PD of X. Next, we will use this method to compute the PD of a graph. Remark: Without going in details, we would like to mention that there are many choices for ﬁltrations and distance metric available when applying PH
Persistent Homology and Link Prediction
31
to a graph1 , however, for this application, computational simplicity and welldeveloped software that could scale to real world datasets were the main factors for us to decide on Rips ﬁltration with shortestpath metric. 2.1
Persistence Diagram of a Graph
Consider a graph G = (V, E), where V = {vi }ni=1 is the node set and E = {ei }m i=1 is the edge set. We associate a positive weight wei ∈ R, wei > 0 with each of the elements ei ∈ E. For an unweighted graph, wei = 1, ∀ei ∈ E. If two nodes are not connected by an edge, we take the (virtual) edgeweight between them as ∞, which for practical purposes is taken as a large positive real number M ∈ R. The shortestpath distance Dsp (vi , vj ) between the nodes vi , vj ∈ V is deﬁned as the sum of weights of the edges on the path starting at vi and terminating at vj . Now consider the metric space (X , d) equipped with a metric d. Let X = {xi }ni=1 be a set of points in (X , d) such that the points in X correspond to the nodes in V = {vi }ni=1 . In an undirected graph, where the shortestpath distance Dsp between any two nodes is symmetric, it makes a natural choice for a metric. We can verify that Dsp satisﬁes all the properties of a metric: for arbitrary vi , vj , vk ∈ V , (a) Dsp (vi , vj ) ≥ 0, (b) Dsp (vi , vj ) = 0 ⇐⇒ vi = vj , (c) Dsp (vi , vj ) = Dsp (vj , vi ) and (d) Dsp (vi , vj ) + Dsp (vj , vk ) ≥ Dsp (vi , vk ). Therefore, for points xi , xj ∈ X, which correspond to vi , vj ∈ V , we take the metric as d(xi , xj ) = Dsp (vi , vj ). For a directed graph, the shortestpath distance between two nodes is not symmetric. In this case, d(xi , xj ) = Dsp (vi , vj ) provides a quasimetric: it satisﬁes (a), (b) and (d) as described above. From a quasimetric d(xi , xj ), we derive a metric as follows: fa (xi , xj ) = a × d(xi , xj ) + (1 − a) × d(xj , xi ) where a ∈ [0, 1/2] [40]. For a = 12 , fa (xi , xj ) is the average of the two directed distances. In this work, for a metric space representation of a directed graph, we D (v ,v )+D (v ,v ) take d(xj , xi ) = sp i j 2 sp j i , where xi , xj ∈ X correspond to vi , vj ∈ V . Computing the allpairshortestpath (APSP) in an undirected graph [19] gives a symmetric distance matrix D = {dij }n,n i=1,j=1 . For a directed graph, the distance matrix is not symmetric; therefore, to impose a metric structure we d +d apply the aforementioned method: dij = dji = ij 2 ji . With that, we have a complete pipeline to compare the shapefeatures of two graphs (or subgraphs) using PH.
3
Link Prediction via Persistent Homology
Having discussed the background to compute the quantitative diﬀerences between a pair of subgraphs with respect to their shapefeatures, we describe 1
https://topology.ima.umn.edu/node/53.
32
S. Bhatia et al.
knbd of u
knbd of v
u
v (a)
knbd of u
knbd of v
u
v (b)
Fig. 2. Combinedneighborhood of u and v, when they have (a) no edges connecting (b) multiple edges connecting.
how to use that to understand and predict the existence of a potential link. First, we summarize the entire pipeline of computing the PD for a graph G. PD Computation: To start with, we compute the allpairshortestpath distance matrix D using Johnson’s algorithm [19]. In case G is directed, D is made symmetric as described in the Sect. 2.1. Thereafter, D and a persistencethreshold τ are used to compute the PD of G. Eﬃcient implementations for PD computations, such as the one by Bauer [3], could be used for this purpose. Now consider the cases of combinedneighborhood of nodes u and v as shown in Fig. 2. We consider two scenarios with respect to reasonably extended neighborhoods of the two nodes, as shown in Fig. 2(a) and (b). A potential link is shown by the dotted curve. Essentially, a case of predicting a link between an arbitrary pair of nodes lies on the spectrum of scenarios starting at the one shown in the Fig. 2(a) and stretches towards the ones similar to the Fig. 2(b). As we explained before, the existence of a possible link has higher chances as we move away from the case of the Fig. 2(a) on this spectrum. With that observation, we explore and understand how the diﬀerence in shapemeasures, as provided by the distances in the PDs of a number of subgraphs induced by the combinedneighborhood of u and v, varies when we examine the cases of arbitrary pair of nodes. This is presented in the Algorithm 1. Given a graph G = ({vi }ni=1 , {ek }m k=1 ), for a k ≤ n, ﬁrst, we compute the subgraph of G induced by the ihop neighbors of u and v, where 1 ≤ i ≤ k, see lines 2 and 3. Thereafter, we compute the subgraph induced by an ihop combinedneighborhood the two nodes, where where 1 ≤ i ≤ r, see line 4. The radius of the individual and combined neighborhoods, k and r, respectively, are chosen such that there could be a positive probability of covering of the combinedneighborhood by the union of the two individual neighborhoods, and therefore, k ≤ 2r. From this subgraph, we induce two subgraphs corresponding to the existence and nonexistence of a link between the query nodes, see lines 5 and 6. Following our intuition, a missing link in a complete graph has high chances of existence, therefore, we also construct a complete graph over the nodes of the combined neighborhood, line 7. Having collected these subgraphs, we compute their PDs as described previously. In the PDs, we have considered only 0th PH groups. This is because the cycles in a graph, which correspond to its 1st PH group, are never destroyed as
Persistent Homology and Link Prediction
33
Algorithm 1. Bottleneck and Wasserstein2 dist. computation. Input: Graph G, Nodes u, v, Neighborhood radius k, Combinedneighborhood radius r, persistencethreshold τ , a boolean isD to indicate if directed. 1: Algorithm GetDist(G, u, v, k, r, τ, isD) 2: Nuk ← GetNbrs(u, k); Induced subgraph over ihop neighbors of u, where 1≤i≤k. 3: Nvk ← GetNbrs(v, k); r ← GetCombinedNbrs(u, v, r); Induced subgraph over ineighbors 4: Nu,v of u or v or both, where 1≤i≤r. r+ r r ← Nu,v ∪ (u, v); Induced subgraph Nu,v augmented with the edge 5: Nu,v (u, v). r− r r ← Nu,v − (u, v); Induced subgraph Nu,v without the edge (u, v). 6: Nu,v r r 7: C(Nu,v ) ← MakeComplete(Nu,v ); The complete graph over the nodes of r . Nu,v 8: Pu ← PD(Nuk , τ, isD); Persistence diagram of the subgraph induced by Nuk . r+ , τ, isD); 9: Pv ← PD(Nvk , τ, isD); P + ← PD(Nu,v − r− c r ), τ, isD); 10: P ← PD(Nu,v , τ, isD); P ← PD(C(Nu,v + − 11: d1 ← W2Dis(P , P ); d2 ← W2Dis(P + , P c ); 12: d3 ← W2Dis(P + , Pu ); d4 ← W2Dis(P + , Pv ); 13: Wasserstein−2 distances between the Ps 14: d5 ← BDis(P + , P − ); d6 ← BDis(P + , P c ); 15: d7 ← BDis(P + , Pu ); d8 ← BDis(P + , Pv ); 16: Bottleneck distances between the Ps 17: d˜ ← {d1 , d2 , d3 , d4 , d5 , d6 , d7 , d8 }; A vector of the eight distances. ˜ 18: Output d; 19: end Algorithm
there are no 2faces. Thus, for our purpose, distances between the 1dimensional PDs of the subgraphs would not help much. In the subsequent discussion, by the topological features we shall mean the 0th dimensional features i.e. the number of connected components. We compute the Wasserstein2: d1 , d2 , d3 and d4 , and the Bottleneck: d5 , d6 , d7 and d8 distances between the PDs, as shown in the lines 11 to 15. They signify how much the induced subgraphs are dissimilar with respect to their shapefeatures. We use di s, 1 ≤ i ≤ 8, in our experiments to perform linkprediction as a ranking task (Sect. 4). Computational Cost: To implement Algorithm 1, we leveraged parallelization as much as possible. For example, for shortestpath computation, we use a simple sharedmemory threadbased parallelization of applying Dijkstra’s algorithm, ˜ which runs in O(V 2 ) (assuming V  > E), for each of the nodes, and thus ˜ pay roughly O(V 3 /p), where p is the number of threads, and store the APSP matrix in a database. The neighborhood and combined neighborhood computation steps are linear in the maximum degree, thus O(V ). The PD computation is performed by reduction of the APSP matrix to cost O(V 3 ) arithmetic operations. Wq and W∞ distance computation steps are linear in the size of PDs.
34
S. Bhatia et al.
Eﬀectively, Algorithm 1 costs O(V 3 ). Next, we sketch a theoretical justiﬁcation of our approach. 3.1
Why This Algorithm Works?
While the commonly used linkprediction heuristics [1,23,29,39], have been empirically validated, to the best of our knowledge, only a limited number of works [9,36] have explored why such methods should work. McPherson et al. [28] suggest that the network of reallife interactions stem from homophily. Hoﬀ et al. [17] introduced a statistical model for such networks, that was extended by Sarkar et al. [36]. Essentially, all these models represent a graphnode by a point in a latent ddimensional Euclidean space and suggest that the probability of the existence of a link between two query nodes u and v can be deﬁned in terms of a parameterized logistic function of the distance between the corresponding points as follow [36]: P (u ∼ vduv ) =
1
(3) 1+ where u ∼ v denotes the existence of a link between the nodes u, v, and α and r are the parameters of functionsharpness and sociability of the nodes, respectively. Thus, a smaller distance duv in the latent space implies a higher probability of link between u and v. Under the constraints of space, we now explain how decreasing the distances d1 to d8 in Algorithm 1 corresponds to decreasing duv in the Eq. (3). First, note that the distances di s, i ≤ 1 ≤ 8, are essentially based on the optimal matchings between the PDs and behave very diﬀerently from the Euclidean metrics. See Eqs. (1) and (2): higher the value of η(p) for each p ∈ P1 , lower are the Wq (P1 , P2 ) and W∞ (P1 , P2 ). η(p), as a bijection, represents matchings between the PDs P1 and P2 . Thus, the lower values of di s reﬂect higher matchings between the PDs indicating that the compared subgraphs have more similar topological features. An attentive reader would also have noticed that the PDs that we compare to generate di s, correspond to the simplicial subcomplexes over the subsets of the same dataset obtained by the embedding of a graph in a metric space. It is easy to observe that these subsets overlap by virtue of the construction of the subgraphs induced by the combined neighborhoods of the query nodes. In this setting, a higher matching in the PDs indicates highly similar topological features and these similar features are over the common subset of the subgraphs. Now, we discuss the individual subgraph comparison summaries captured by the di s: (a) d1 and d5 : smaller values of d1 and d5 indicate that augmenting a possible edge between the query nodes does not change the topological features of the subgraph induced by the combined neighborhood. (b) d3 and d7 : their smaller values imply that the combined neighborhood itself is not much diﬀerent from the neighborhood of the ﬁrst node in terms of the topological features. (c) d4 and d8 : same as (b) for the second node. (d) d2 and d6 : smaller values of d2 and d6 indicate that the subgraph induced by the combinedneighborhood is closer to a complete graph in terms of topological features. eα(duv −r)
Persistent Homology and Link Prediction
35
Let nl (u, v) denote the number of paths of length l between the nodes u and v. From the above summary, in general terms, it can be inferred that smaller the values of di s, i ≤ 1 ≤ 8, (a) the combinedneighborhood lies closer to the structure shown in Fig. 2(b) on the spectrum of the scenarios mentioned in Sect. 3. For example, smaller d2 and d6 would indicate that the combinedneighborhood is similar to a complete graph in which the likelihood of completion of a missing link is very high, and (b) because of the fact that higher overlap of neighborhood subgraphs, nl (u, v) is nonzero for increasing number of small pathlengths l. In our method, the metric space embedding of the graph translates it into a point cloud in Euclidean space where even though the points are at nondeterministic positions, the distance between them is deterministic. Essentially, it aligns to the deterministic model, (see Sects. 3 and 4 of [36], with (a) identical radii for unweighted graphs and (b) nonidentical radii of weighted graphs. Thus in the spirit of the discussion in Sect. 5 for the bounds over duv , the Lemma 5.7, and Theorem 5.8 in the paper by Sarkar et al. [36], and inferring from the point (b) in the previous paragraph, P (u ∼ vduv ) increases with decrease in the values of di s, i ≤ 1 ≤ 8.
4
Experiments
4.1
Experimental Protocol
Datasets: Table 1 lists the nine publicly available datasets that were used for evaluating our proposed approach. The datasets selected are from diﬀerent domains and widely used in the study of complex networks. Baselines: We compare the performance of our approach with six frequently used methods for link prediction. We consider Common Neighbors (CN), AdamicAdar (AA) [1] and MilneWitten (MW) [29] as representative local methods. We chose Preferential Attachment (PA), node2vec (N2V) [15], and struc2vec (S2V) [35] as representative global methods. Implementation: We implemen Table 1. Diﬀerent datasets used in experiments ted our approach in C++ using # nodes # edges N/w type the Ripser library [3] for comDC [32] 112 425 Word Cooccurrence n/w puting PDs. We used the pubATCa 1226 2615 Air Traﬃc n/w 2 licly available code of Kerber Cora [26] 2708 5429 Citation n/w et al. [22] to compute W2 and Euroad [37] 1174 1417 Road n/w W∞ distances. We ﬁxed the Figeyes [14] 2239 6452 Protein interaction n/w 1870 2277 Protein interaction n/w persistence threshold τ = 4. Yeast [10] 6594 Power Grid n/w It was empirically found that Power [41] 4941 arXiv [24] 5242 14496 Collaboration n/w beyond τ = 4 PD did not change. Twitter [27] 23370 33101 Social N/w The neighborhood and combined a http://research.mssm.edu/maayan/datasets/qualitative neighborhood radii L r networks.shtml L k and and are taken as 4 2 , 2
https://bitbucket.org/grey narn/hera/src/master/.
36
S. Bhatia et al.
respectively, where L is the shortest path distance between the two query nodes. This selection of k and r is adaptable to the position of query nodes and ensures that there is a reasonable intersection of their neighborhoods. Empirically we found that increasing this value did not change the distance di ’s but only increased the computation time. We implemented the baselines AA, MW, CN, and PA in C++ and used author provided source code for node2vec and struc2vec. For all the datasets, we removed 5% of edges making sure that the residual graph remains connected. We then compare the performance of diﬀerent methods to recover the removed edges using information from the residual graph (Sect. 4.2). All the datasets and our source code are available for download3 . 4.2
Results and Discussions
Traditionally, the problem of link prediction has been addressed as a ranking problem where given a source node, a ranked list of target nodes is produced ordered by the likelihood of a potential link being formed between the source and the target nodes [20,25]. The baselines CN, AA, MW, and PA by deﬁnition, output a score between the source and target node that can be used as the ranking function. The other two baselines – N2V and S2V – learn continuous vector representations for each node in the graph. A typical way to rank target nodes given a source node is to rank them based on their distance from the source node [30]. Hence, for these methods, given a source node, we produce a ranked list of all the other nodes in the graph ordered by the Euclidean distance between the source and target node vectors. Given a pair of source and target nodes our proposed approach produces eight diﬀerent distance values (Algorithm 1) capturing diﬀerent topological properties. In order to produce a ranked list that combines these diﬀerent properties captured by the diﬀerent distance functions, we use the rank product metric [13] to combine the ranked lists produced by individual distance functions to obtain the ﬁnal ranking of target nodes with respect to a given source node. For a node i, the rank product is computed as m rpi = ( j=1 rij )1/m where rij is the rank of node i in the j th ranked list. Table 2 summarizes the results achieved by the six baselines and our proposed approach (PH). We report Hit Rate@N (for N = {10, 50, 100}), – the proportion of edges for which the correct target node was ranked in the top N positions. Observe that our approach outperforms the baselines in most cases, and is a close second in others. Also note that while the methods based on immediate neighborhood achieve the best values for ﬁve out of nine datasets in terms of Hits@10, the methods that utilize global network information generally outperform the local methods at higher ranks. This is expected as the local methods work in a small, though highly relevant, search space of nodes in the immediate neighborhood of query nodes. Thus, they are able to predict the links for a few test cases that lie in this small search space. However, they fail for hard test cases that lie outside this search space. For instance, in the euroad dataset, only 6 out of 70 test cases lie in the ﬁrst order neighborhood of query nodes, resulting 3
https://github.com/sumitresearch/persistenthomologylinkprediction.
Persistent Homology and Link Prediction
37
Table 2. Performance of diﬀerent methods on nine diﬀerent datasets for the link prediction task. Hits at ranks 10, 50, and 100 are reported. For each dataset, the best method achieving highest hits at a given rank is highlighted in bold. Hits @ 50 Hits @ 100 Hits @ 10 CN AA MW PA S2V N2V PH CN AA MW PA S2V N2V PH CN AA MW PA S2V N2V PH DC ATC Cora Euroad Figeyes Yeast Power Arxiv Twitter
.190 .100 .180 .085 .000 .212 .227 .580 .055
.285 .061 .080 .085 .006 .247 .209 .587 .046
.142 .053 .074 .085 .000 .159 .182 .135 .047
.333 .023 .016 .014 .000 .008 .000 .015 .000
.142 .038 .028 .000 .006 .017 .015 .122 .003
.095 .061 .048 .100 .012 .150 .246 .480 .000
.000 .077 .332 .185 .003 .183 .267 .237 .003
.571 .161 .232 .085 .012 .256 .255 .849 .085
.666 .076 .080 .085 .006 .292 .255 .874 .053
.619 .092 .074 .085 .018 .283 .255 .526 .161
.571 .130 .038 .028 .003 .079 .009 .070 .002
.476 .138 .052 .114 .018 .053 .039 .219 .010
.476 .238 .118 .557 .024 .292 .574 .823 .001
.761 .263 .338 .600 .027 .339 .595 .723 .117
.714 .161 .252 .085 .015 .256 .255 .904 .087
.714 .076 .080 .085 .006 .292 .255 .918 .053
.714 .092 .074 .085 .024 .292 .255 .709 .236
1.00 .215 .040 .071 .015 .159 .030 .114 .011
.952 .184 .072 .214 .043 .106 .072 .238 .015
.952 .384 .144 .742 .043 .362 .680 .897 .001
1.00 .372 .338 .728 .046 .385 .747 .865 .276
in poor performance of local methods. On the other hand, the global methods (N2V, S2V, PH) outperform at higher ranks as they are not limited to this small search space. The robust performance achieved by the proposed approach, for all the datasets and at diﬀerent ranks, is commendable given that the proposed approach uses only eight features (distance functions comparing the topological properties) that can be computed with relative ease compared to computationally expensive learning of vector representations (as is the case with node2vec and struc2vec). Further, unlike the CN, AA, MW, and PA baselines, that are also easier to compute, the proposed approach is built upon the solid theoretical foundations and is not limited to the immediate neighborhood of query nodes.
5
Conclusions and Future Work
We proposed an approach inspired from persistent homology to model link formation in graphs and use it to predict missing links. Our approach achieved robust and stable performance across nine datasets, outperforming many frequently used baseline methods despite being relatively simple and computationally less expensive. Given that the topological features succinctly capture information about shape and structure of the network and can be computed without the need of extensive training, it will be worth exploring how these features can be combined with other techniques for network analysis.
References 1. Adamic, L.A., Adar, E.: Friends and neighbors on the web. Soc. Netw. 25(3), 211–230 (2003) 2. Backstrom, L., Leskovec, J.: Supervised random walks: predicting and recommending links in social networks. In: WSDM 2011 (2011) 3. Bauer, U.: Ripser (2018). https://github.com/Ripser/ripser
38
S. Bhatia et al.
4. Bhatia, S., Caragea, C., Chen, H.H., Wu, J., Treeratpituk, P., Wu, Z., Khabsa, M., Mitra, P., Giles, C.L.: Specialized research datasets in the citeseerx digital library. DLib Mag. 18(7/8) (2012) 5. Bhatia, S., Chatterjee, B., Nathani, D., Kaul, M.: Understanding and predicting links in graphs: a persistent homology perspective. arXiv preprint arXiv:1811.04049 (2018) 6. Bhatia, S., Vishwakarma, H.: Know thy neighbors, and more!: studying the role of context in entity recommendation. In: Hypertext (HT), pp. 87–95 (2018) 7. Carstens, C.J., Horadam, K.J.: Persistent homology of collaboration networks. Math. Probl. Eng. 2013, 7 (2013) 8. Chung, M.K., Bubenik, P., Kim, P.T.: Persistence diagrams of cortical surface data. In: International Conference on Information Processing in Medical Imaging, pp. 386–397 (2009) 9. Cohen, S., Zohar, A.: An axiomatic approach to link prediction. In: TwentyNinth AAAI Conference on Artiﬁcial Intelligence (2015) 10. Coulomb, S., Bauer, M., Bernard, D., MarsolierKergoat, M.C.: Gene essentiality and the topology of protein interaction networks. Proc. R. Soc. B: Biol. Sci. 272(1573), 1721–1725 (2005) 11. Edelsbrunner, H., Harer, J.: Persistent homologya survey. Contemp. Math. 453, 257–282 (2008) 12. Edelsbrunner, H., Harer, J.: Computational Topology  An Introduction. American Mathematical Society, Providence (2010) 13. Eisinga, R., Breitling, R., Heskes, T.: The exact probability distribution of the rank product statistics for replicated experiments. FEBS Lett. 587(6), 677–682 (2013) 14. Ewing, R.M., Chu, P., Elisma, F., Li, H., Taylor, P., Climie, S., McBroomCerajewski, L., Robinson, M.D., O’Connor, L., Li, M., et al.: Largescale mapping of human proteinprotein interactions by mass spectrometry. Mol. Syst. Biol. 3(1), 89 (2007) 15. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD, pp. 855–864 (2016) 16. Hajij, M., Wang, B., Scheidegger, C., Rosen, P.: Visual detection of structural changes in timevarying graphs using persistent homology. In: PaciﬁcVis, pp. 125– 134. IEEE (2018) 17. Hoﬀ, P.D., Raftery, A.E., Handcock, M.S.: Latent space approaches to social network analysis. J. Am. Stat. Assoc. 97(460), 1090–1098 (2002) 18. Jeh, G., Widom, J.: SimRank: a measure of structuralcontext similarity, pp. 538– 543. ACM (2002) 19. Johnson, D.B.: Eﬃcient algorithms for shortest paths in sparse networks. J. ACM (JACM) 24(1), 1–13 (1977) 20. Kataria, S., Mitra, P., Bhatia, S.: Utilizing context in generative Bayesian models for linked corpus. In: AAAI, vol. 10, p. 1 (2010) 21. Katz, L.: A new status index derived from sociometric analysis. Psychometrika 18(1), 39–43 (1953) 22. Kerber, M., Morozov, D., Nigmetov, A.: Geometry helps to compare persistence diagrams. In: 2016 Proceedings of the Eighteenth Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 103–112. SIAM (2016) 23. Leskovec, J., Backstrom, L., Kumar, R., Tomkins, A.: Microscopic evolution of social networks. In: KDD, pp. 462–470 (2008) 24. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graph evolution: densiﬁcation and shrinking diameters. ACM Trans. Knowl. Discov. Data 1(1) (2007)
Persistent Homology and Link Prediction
39
25. LibenNowell, D., Kleinberg, J.: The linkprediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58(7), 1019–1031 (2007) 26. Lu, Q., Getoor, L.: Linkbased classiﬁcation. In: Fawcett, T., Mishra, N. (eds.) ICML, pp. 496–503. AAAI Press (2003). http://www.aaai.org/Library/ICML/ 2003/icml03066.php 27. McAuley, J., Leskovec, J.: Learning to discover social circles in ego networks. In: NIPS, pp. 548–556 (2012) 28. McPherson, M., SmithLovin, L., Cook, J.M.: Birds of a feather: homophily in social networks. Annu. Rev. Sociol. 27(1), 415–444 (2001) 29. Milne, D., Witten, I.: An eﬀective, lowcost measure of semantic relatedness obtained from Wikipedia links. In: AAAI Workshop on Wikipedia and Artiﬁcial Intelligence: An Evolving Synergy, pp. 25–30 (2008) 30. Misra, V., Bhatia, S.: Bernoulli embeddings for graphs. In: AAAI, pp. 3812–3819 (2018) 31. Nagarajan, M., et al.: Predicting future scientiﬁc discoveries based on a networked analysis of the past literature. In: KDD, pp. 2019–2028. ACM (2015) 32. Newman, M.E.J.: Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E 74(3), 036104 (2006) 33. Pal, S., Moore, T.J., Ramanathan, R., Swami, A.: Comparative topological signatures of growing collaboration networks. In: Workshop on Complex Networks CompleNet, pp. 201–209. Springer (2017) 34. Perozzi, B., AlRfou, R., Skiena, S.: Deepwalk: online learning of social representations. In: KDD, pp. 701–710 (2014) 35. Ribeiro, L.F., Saverese, P.H., Figueiredo, D.R.: struc2vec: learning node representations from structural identity. In: KDD, pp. 385–394 (2017) 36. Sarkar, P., Chakrabarti, D., Moore, A.W.: Theoretical justiﬁcation of popular link prediction heuristics. In: IJCAI (2011) ˇ 37. Subelj, L., Bajec, M.: Robust network community detection using balanced propagation. Eur. Phys. J. B 81(3), 353–362 (2011) 38. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: Largescale information network embedding. In: WWW, pp. 1067–1077 (2015) 39. Tang, L., Liu, H.: Relational learning via latent social dimensions. In: KDD, pp. 817–826 (2009) 40. Turner, K.: Generalizations of the rips ﬁltration for quasimetric spaces with persistent homology stability results. arXiv preprint arXiv:1608.00365 (2016) 41. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘smallworld’ networks. Nature 393(6684), 440 (1998) 42. Zhu, X.: Persistent homology: an introduction and a new text representation for natural language processing. In: IJCAI (2013)
The Role of Network Size for the Robustness of Centrality Measures Christoph Martin(B) and Peter Niemeyer Institute of Information Systems, Leuphana University of L¨ uneburg, 21335 L¨ uneburg, Germany {cmartin,niemeyer}@uni.leuphana.de
Abstract. Measurement errors are omnipresent in network data. Studies have shown that these errors have a severe impact on the robustness of centrality measures. It has been observed that the robustness mainly depends on the network structure, the centrality measure, and the type of error. Previous ﬁndings regarding the inﬂuence of network size on robustness are, however, inconclusive. Based on twentyfour empirical networks, we investigate the relationship between global network measures, especially network size and average degree, and the robustness of the degree, eigenvector centrality, and PageRank. We demonstrate that, in the vast majority of cases, networks with a higher average degree are more robust. For random graphs, we observe that the robustness of Erd˝ osR´enyi (ER) networks decreases with an increasing average degree, whereas with Barab` asiAlbert networks, the opposite eﬀect occurs: with an increasing average degree, the robustness also increases. As a ﬁrst step into an analytical discussion, we prove that for ER networks of diﬀerent size but with the same average degree, the robustness of the degree centrality remains stable. Keywords: Centrality · Robustness data · Noisy data · Sampling
1
· Measurement error · Missing
Introduction
Networks are used to model various realworld phenomenons. Typical use cases are (online) social networks, web graphs, proteinprotein interaction networks, infrastructure networks, and many more [23]. Networks model the pairwise relationship of objects, which makes them sensitive to errors in the data underlying the network. The reasons for such errors are manifold. When collecting data for a social network, for example, actors may be missing on the day of the survey or the number for the nomination of possible friends may be limited by the survey questionnaire [30]. The collection of proteinprotein interaction data is, depending on the method used, inevitably associated with uncertainty, which is consequently also part of the network constructed from this data [6]. When c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 40–51, 2020. https://doi.org/10.1007/9783030366872_4
The Role of Network Size for the Robustness of Centrality Measures
41
creating coauthorship or citation networks, authors or papers can be included multiple times or not at all, for example, due to incorrect spelling [8,26]. All these errors aﬀect the outcome of network analysis methods and thus, the conclusions that depend on these methods [17,20]. In the ﬁeld of network analysis, centrality measures are commonly used to analyze individual nodes. These measures map a real number to every node in the network which can be used to rank the node by their “importance”; the deﬁnition of importance here depends on the domain and the speciﬁc research question. It is well known that errors in the network data can have a severe impact on the reliability of centrality measures. For example, the bestranked actor might actually not be the best in the erroneous network. We measure this impact using the concept of robustness of centrality measures, which is the rank correlation between the centrality values in the clean and the erroneous network [13,15,21, 31]. The eﬀects of errors on the robustness of centrality measures depend on several variables, e.g., the type of centrality measure, the type and extent of the error, the network structure, and how we measure the robustness [9,27]. In this article, we study the robustness of centrality measures in larger networks. We are especially interested in whether global network measures can explain the robustness. Existing studies are inconclusive about this, especially about the relationship between network size and robustness. No relationship between size and robustness is noticeable in the empirical part of [24]. In [5] and [3], the authors observed that larger network size could be related to both, higher and lower robustness, depending on the network structure. In [31], the smaller network is usually more robust than the larger one. In contrast, [27] noticed that larger networks are frequently more robust. Moreover, existing studies have mostly been concerned about smaller networks (approx. less than 1000 nodes). For a comprehensive review of the existing work on the robustness of centrality measures, we refer to [28]. To examine these contrary observations in greater detail, we will proceed as follows in this paper: First, we investigate the robustness of centrality measures in 24 empirical networks coming from diverse domains. We focus on degree, eigenvector centrality, and PageRank. Both, the eigenvector centrality and the PageRank are feedback measures and fast to calculate [16]. However, they have rarely been considered simultaneously in previous studies. Since PageRank can be very stable in scalefree networks [10], a comparison with the eigenvector centrality is therefore interesting. We hardly observe any association between network size and robustness, but a high correlation between average degree and robustness. This observation holds for all considered centrality measures and error types that involve removing nodes or edges. We further investigate the eﬀect of network size on the robustness using the Erd˝ osR´enyi (ER) and the Barab´ asi–Albert (BA) random graph model. For both models, we observe that the robustness is independent of network size if the average degree remains constant. If the average degree increases, then centrality measures in BA graphs become more robust, in contrast to ER graphs.
42
C. Martin and P. Niemeyer
As a ﬁrst step into an analytical discussion, we prove that for ER networks of diﬀerent size but with the same average degree, the robustness of the degree centrality remains stable. As a consequence, there exist robust and nonrobust networks of varying sizes, at least w.r.t. the degree centrality.
2
Methods
. We A graph G(V, E) consists of a vertex set V and an edge set E, E ⊆ V (G) 2 denote the number of vertices in G by N = V (G) and the number of edges by M = E(G). All graphs considered in this paper are undirected, unweighted, and simple, i.e., they do not contain loops nor multiple edges. The adjacency matrix of a graph is denoted by A, where Ai,j = 1 if there is an edge between vertex vi and vj (i.e., {vi , vj } ∈ E(G)) and 0 otherwise. The neighborhood of a node u is Γ (u) = {v : {u, v} ∈ E(G)}. It is the set of nodes that are connected to u. The degree is the number of connections that a node has, degree(u) = Γ (u). The degree of an edge is the sum of the degree values of the source and target node, degree({u, v}) = degree(u) + degree(v). The average degree of a graph G is deﬁned as degree(G) = N1 u degree(u). Density is the ratio of the number of existing edges to the number of maximum 2M possible edges in a graph, density(G) = N (N −1) . The transitivity of a graph is of closed paths of length two) deﬁned as transitivity(G) = (number [23]. (number of paths of length two) In contrast to global measures, centrality measures map a real number to every node in the graph. These values are, depending on the context, often interpreted as a proxy for the “importance” of the node and, thus, used to rank the nodes. We denote the centrality value for a speciﬁc node u in a graph G w.r.t. a centrality measure c by cG (u). If the context permits, we do not explicitly mention the graph. The vector of centrality values for all nodes in G is deﬁned as c(G) = (cG (v1 ), . . . , cG (vN )). The most straightforward centrality measure is the degree centrality which was already discussed above, degree(u) = Γ (u). The eigenvector centrality of a node u is proportional to the sum of the centrality values of its neigh bors: evc(u) = λ1 v∈Γ (u) evc(v), where λ is the largest eigenvalue of A [2]. The PageRank is deﬁned as PageRank(u) = d v∈Γ (u) PageRank(v) degree(v) + (1 − d) with d as damping factor (in our case 0.85) [4].
Error Mechanisms. When collecting data, external factors and the selection of the sampling method can lead to inaccurate network data. We use four procedures to simulate the impact of errors on information about nodes and edges. We call these procedures error mechanisms. They simulate an error that aﬀects the nodes or edges of a network. Their inputs are a graph G and a parameter α which controls the intensity of the error. The procedure returns one network from the set of all possible erroneous versions of the graph G. For a more detailed discussion of the error mechanisms, see [21]. In this study, we use the following error mechanisms:
The Role of Network Size for the Robustness of Centrality Measures
43
add edges (e+): αN edges are added to the graph. The new edges are chosen uniformly at random from the N2 − M possible edges. remove edges unif. (e−): αN edges are removed from the graph. The edges are chosen uniformly at random from E(G). remove edges degree (e−(p)) also removes αN edges. The edges are, however, chosen with probability proportional to the edge degree (i.e., P ({u, v}) = degree({u,v}) degree(e) ). e∈E(G)
remove nodes (n−): αN nodes are removed from the graph. The nodes are chosen uniformly at random from V (G). Robustness of Centrality Measures. To quantify the impact of errors in data collection on centrality measures, we use the concept of robustness, which measures how the ranking of nodes, induced by the centrality measure, changes. In the same way as [15,31], we use Kendall’s tau (“taub”) rank correlation coeﬃcient [14]. For two graphs on the same vertex set, G and H, and a centrality measure c, the robustness is deﬁned as follows: τc (G, H) =
nc − nd (nc + nd + nt )(nc + nd + nt )
(1)
The number of concordant pairs and discordant pairs w.r.t. c(G) and c(H) are nc and nd , respectively. A pair of nodes u, v is concordant, if (cG (v) − cG (u)) · (cH (v) − cH (u)) > 0 and discordant if (cG (v) − cG (u)) · (cH (v) − cH (u)) < 0. Ties in c(G) and c(H) (i.e., cG (v) − cG (u) = 0 or cH (v) − cH (u) = 0, respectively) are denoted by nt and nt . If G and H are not on the same vertex set then, similar to [31], we only consider nodes that exist in both graphs. Random Graph Models. The Erd˝ osR´enyi random graph model has two parameters: the number of nodes n and the edge probability p. Since all node pairs are connected with the same probability (p), the degree distribution of the nodes in this model follows a binomial distribution [7]. In contrast, the Barab´ asi–Albert model is based on the idea of preferential attachment. Consequently, the probability that a new node will connect to an existing node depends on the degree of the existing node. This model also has two parameters. In addition to the number of nodes n, the parameter m speciﬁes the number of connections that a new node makes to existing nodes. Due to this generation process, the degree distribution of the nodes in graphs generated by this model follows a powerlaw distribution [1].1
3
Experiments with Empirical Networks
In this section, we investigate the relationship between the robustness of centrality measures in empirical networks and global network measures. We consider 1
We use NetworkX (version 2.2, [12]) to generate random graphs and calculate centrality measures.
44
C. Martin and P. Niemeyer
the following centrality measures: degree, eigenvector centrality, and PageRank. Regarding the global measures, we focus on the network size, average degree, density, and transitivity. For our empirical study, we use the following undirected and unweighted networks available through the Koblenz Network Collection [18] (24 networks at the time of the beginning of this study): zachary (34 nodes, 78 edges), dolphins (62, 159), jazz (198, 2,742), pdzbase (212, 244), vidal (3,133, 6,726), facebook (4,039, 88,234), powergrid (4,941, 6,594), CAGrQc (5,242, 14,496), reactome (6,327, 147,547), CAHepTh (9,877, 25,998), pgp (10,680, 24,316), CAHepPh (12,008, 118,521), CAAstroPh (18,772, 198,110), CACondMat (23,133, 93,497), deezerRO (41,773, 125,826), deezerHU (47,538, 222,887), deezerHR (54,573, 498,202), brightkite (58,228, 214,078), livemocha (104,103, 2,193,083), petstercat (149,700, 5,449,275), douban (154,908, 327,162), gowalla (196,591, 950,327), dblp (317,080, 1,049,866), and petsterdog (426,820, 8,546,581). As part of the data preprocessing, we have removed any existing loops. If a network consists of several components, we only consider the largest connected component. Table 1. Robustness of Konect networks, aggregated over all networks. Centrality
Error mechanism e+ e− e−(p) n− Mean Std Mean Std Mean Std Mean Std
Degree
0.85
0.06 0.89
0.05 0.91
0.05 0.90
0.05
Eigenvector 0.75
0.18 0.81
0.11 0.73
0.15 0.79
0.15
PageRank
0.07 0.82
0.07 0.86
0.07 0.83
0.07
0.73
To analyze the eﬀects of diﬀerent errors on the robustness of centrality measures in the empirical networks, we use a simulationbased experimental procedure. An iteration of the experiment is performed as follows: Starting from a graph G (one of the 24 networks described above), we apply the error mechanism with the intensity α. The resulting modiﬁed graph is called H. Finally, we calculate the robustness of the centrality measure c: τc (G, H) (as deﬁned in Sect. 2). We repeat this procedure 100 times for each network for all combinations of centrality measures (degree, eigenvector centrality, PageRank), error mechanism (add and remove edges (unif. and prop.), and remove nodes), and error level (α ∈ {0.1, 0.2, . . . , 0.5}). Similar to previous studies in this area (as discussed in Sect. 1), we observe that the robustness declines with an increasing level of error. Therefore we will subsequently focus on an error level of α = 0.2 since the results for the other error levels yield the same conclusions regarding the role of the network structure and the impact of the error level is not our main objective. When observing the average across all networks (Table 1), degree centrality is always the most robust. For the removal error mechanisms (e−, e−(p), and n−), the PageRank
The Role of Network Size for the Robustness of Centrality Measures
45
is more robust than the eigenvector centrality. In the case of additional edges, the opposite eﬀect can be observed. Regarding the standard deviation, the ranking is constant across all error types, degree centrality varies least, followed by PageRank. The robustness of the eigenvector centrality ﬂuctuates most; sometimes the standard deviation is two to three times as large as for the degree centrality and the PageRank. Concerning the eﬀect of the type of measurement error on the robustness, degree centrality and PageRank behave similarly. The absence of edges proportional to the edge degree has the weakest eﬀect, spurious edges the most substantial. For eigenvector centrality, on the other hand, the ﬁrst error type has the strongest inﬂuence on the robustness. As we look at the relationship between robustness and global network measures, we notice that there are both: large networks that are very vulnerable to errors (e.g., douban) and small networks that are very robust (e.g., Jazz). In the following, we will discuss the relationship between global network measures and robustness in more detail. Table 2 lists the Kendall rank correlation between the average robustness and the respective values for the global network measures. For all removal error types, the robustness tends to be higher with an increasing average degree. We observe almost perfect correlation for cases where edges or nodes are missing uniformly at random and still high correlation values when edges are missing proportionally. For the degree centrality, the correlation is also high in the case of spurious edges. For PageRank and eigenvector centrality, this is not the case. While for the transitivity a moderate correlation with the robustness can still be observed, the number of nodes, as well as the density, are, in most cases, uncorrelated with the robustness. This observation is rather unexpected since growing graphs often show “densiﬁcation”, which means the average degree grows with the number of nodes [19]. Figure 2 shows the behavior of robustness for three groups in each panel in an exemplary fashion. The ﬁrst panel shows the observation for PageRank and add edges. This behavior is typical for all centrality measures under the inﬂuence of additional edges, there is no obvious pattern. The middle panel shows the combination of PageRank and missing edges uniform. In this case, the relationship between average degree and robustness is most prominent. Robustness is high when the average degree is also high. The variance of robustness is also low. This Table 2. Empirical networks: rank correlation between global measures and the average robustness. Error mechanism Centrality Degree e+
e−
Eigenvector e−(p) n−
e+
e−
Avg degree
0.77 0.97 0.63
0.98
0.27 0.92
Density
0.18 0.01 0.01
0.03 −0.38 0.02
Nodes
0.02 0.31 0.16
0.29
Transitivity
0.26 0.23 0.58
0.23 −0.63 0.27
PageRank e−(p) n− 0.52
e+
e−
e−(p) n−
0.93
0.43 0.96 0.72
0.95
0.30 −0.11
0.26 0.07 0.00
0.15
0.43 −0.18 0.26 0.22
0.18
0.04
0.23
0.52 0.26 −0.14 0.50
0.22 0.16 0.49
46
C. Martin and P. Niemeyer
behavior occurs for PageRank and degree for all cases of missing edges (uniform and proportional) and missing nodes. The robustness of the eigenvector centrality in case of missing nodes is depicted in the last panel. There is a recognizable association, but in this case, the variance is higher than in most other cases.
4
Experiments with Random Graphs
In Sect. 3, we examined the robustness of 24 empirical networks and observed that there exist small and robust as well as large and sensitive networks regarding the robustness of centrality measures. We observed that there is little association between network size and robustness. We found, however, that in many cases, the higher the average degree of the network, the higher the robustness. In this section, we use the ER and the BA model to control the average degree and to measure the eﬀects of its change on robustness. For this purpose, we choose two diﬀerent perspectives. First, we keep the average degree constant and change the size of the network. Then we control the average degree while keeping the network size constant. The experimental setup is similar to that of Sect. 3. Instead of using empirical networks, however, we generate ER and BA graphs with an average degree of 10 and a network size n ∈ {100, 500, 1000, 1500, . . . , 10000}, which we call G. For the second part of the experiments, we ﬁx the network size at n = 1000 and select the parameters p and m such that we obtain graphs with an average degree between 4 and 100. Then we apply the error mechanism with intensity α to G, which results in the erroneous network H and calculate the robustness of the centrality measure c: τc (G, H). For both parts of the experiment, we repeat this procedure 100 times for the two random graph models and the varying values for the network size and the average degree for all combinations of centrality measure (degree, eigenvector centrality, PageRank), the four error mechanisms, and error level (α ∈ {0.1, 0.2, . . . , 0.5}). The results for the ER graphs are homogeneous. For all centrality measures and error mechanisms, we observe the same behavior: robustness does not change with increasing network size. However, the variance decreases with increasing network size. It decreases sharply with the ﬁrst increases in network size. Above 2000 nodes, the change is hardly visible. For the BA graphs, we observe, with two exceptions, the same behavior as for the ER graphs. Figure 1 shows the three diﬀerent characteristics (all results in this ﬁgure are for the eigenvector centrality). The middle panel in Fig. 1 represents the robustness behavior in BA graphs in almost all cases. The robustness is independent of the network size. Only the variance decreases with increasing size, whereas the variance is relatively small already. The outer panels show the two exceptional cases. The absence of edges proportional to the degree of edge (left panel) reduces the robustness of the eigenvector centrality with increasing network size. If nodes are missing (right panel), the robustness is, as in most other cases, independent of the network size, but the variance is much larger and hardly declines with increasing size.
The Role of Network Size for the Robustness of Centrality Measures
47
Fig. 1. Results for the robustness of centrality measures in BA graphs. Here, the network size increases while the average degree remains constant, the error level is 0.2. For the network size values up to 5000 are shown for better readability; for larger values, hardly any changes occur.
Fig. 2. Illustrative examples for the three diﬀerent behaviors of the robustness of empirical networks. The networks are sorted their average degree (ascending). The median robustness is indicated in each box; whiskers are 1.5 times the interquartile range.
The results for the second part of the experiment are listed in Table 3; the values are the rank correlation between the average degree and the robustness in ER and BA graphs. For the results concerning the ER graphs, the pattern consists of two parts, independent of the type of error. The robustness of the eigenvector centrality and the PageRank is, except for the smallest initial increases, constant and thus independent of the increase of the average degree. With degree centrality, on the other hand, robustness decreases with an increasing average degree. The decreases occur especially during the initial increases of the average degree (approx. the range between 4 and 25). Here the robustness decreases by 0.1. While degree centrality shows a strong negative correlation, in most other cases, no or weak negative correlation can be observed. The results for the BA graphs show a consistent pattern. In all cases, regardless of centrality measure and error type, higher average degree is accompanied by higher robustness. There is a very high, positive rank correlation between the average degree and the associated robustness. The increases in robustness associated with the increase in the average degree are particularly strong for
48
C. Martin and P. Niemeyer
Table 3. The rank correlations between the average degree and the robustness are listed for the cases of BA and ER graphs under the inﬂuence of diﬀerent error mechanisms with an error level of 0.2. Centrality
Degree
Error mechanism BA graphs e+ e− e−(p) n− 0.91 0.92 0.90
ER graphs e+ e−
e−(p) n−
0.87 −0.66 −0.64 −0.64 −0.51
Eigenvector 0.82 0.93 0.95
0.69 −0.07
PageRank
0.91 −0.33 −0.25 −0.41 −0.02
0.93 0.94 0.93
0.02
0.04
0.15
initial increases, further increases still have a positive eﬀect on robustness, but this eﬀect diminishes. The only exception to this is the eigenvector centrality, which resembles a linear relationship. The variance is slightly higher for the error type missing nodes than for the other error types. In the case of eigenvector centrality, the variance is much higher. In the previous experiments, we observed that the robustness of degree centrality is independent of the size of the network. We will now take a more detailed look at this scenario. We focus on the degree centrality, as this is the most feasible for an analytical perspective [22,25,29]. We will show that for ER graphs and suﬃciently large network size, the robustness is independent of the network size if the average degree remains constant. To analyze the degree centrality in more detail, we use the following terms: G is the unmodiﬁed graph and H is the erroneous graph (a “modiﬁed” version of G, i.e., H is on the same vertex set as G or on a subset of that vertex set). The error level is denoted by α (i.e., the fraction of edges deleted). Additionally, let v1 , v2 be two nodes drawn randomly from V (H). Then, Di is the random variable for the degree of node vi in G and Xi is the random variable for the degree decrease of node vi (i.e., the diﬀerence of the degree of node vi in G and in H). On this basis, we deﬁne P (D1 = d1 , X1 = x1 , D2 = d2 , X2 = x2 ) as the joint probability that speciﬁc values for d1 , x1 , d2 , x2 occur together. We abbreviate this by P (d1 , x1 , d2 , x2 ). We demonstrate how to use P (d1 , x1 , d2 , x2 ) to calculate the robustness. Summing P (d1 , x1 , d2 , x2 ) over the quadruples that correspond to concordant (discordant) pairs of nodes, we can calculate the probability for (v1 , v2 ) to be concordant (discordant). For the case of missing edges, the probability for (v1 , v2 ) to be concordant is (2) Pc = d1 d2 −x2
Analogously, the probability for (v1 , v2 ) to be discordant is Pd = d1 d2 −x2 P (d1 , x1 , d2 , x2 ).
(3)
d1 >d2 ;d1 −x1 1. For details, see Baker [2]. An example is given in Fig. 4.
122
S. Gonzales and T. Migler
Slice Construction. The dynamic programming solution follows a divide and conquer paradigm, where we divide the graph into socalled slices and calculate the tables for each slice before merging them together. Each vertex in each tree (and hence each interior face and exterior edge in each component of the graph) corresponds to a particular slice. The idea is to ﬁrst deﬁne left and right boundaries for each tree vertex, and then deﬁne their slices by taking the induced subgraph of all vertices and edges that exist between the boundaries. The analogy here is that one can obtain a slice of pie by ﬁrst deciding where the two cut lines will be (the left and right boundaries), and then the slice will be the pie that exists between your cuts. The full construction of slices is given in the full version of this paper. In Fig. 4, each tree vertex has its left boundary vertices to the left and its right boundary vertices to the right (note that the right boundary vertices for a tree vertex are the same as the left boundary vertices of the next tree vertex).
Fig. 4. Trees with slice boundaries
Dynamic Program. In this section, we detail the dynamic program that solves the densest k subgraph problem for bouterplanar graphs. The dynamic program is given by the procedures table, adjust, merge, extend, and contract. Note that adjust, merge, extend, and contract are original to this paper whereas the table procedure is given by Baker [2]. The pseudocode for these procedures can be found in the extended version of the paper. The program constructs a table for each slice. The table for a level i slice consists of 22i entries: one entry for each subset of the boundary vertices (the left and right boundaries for a level i slice contain exactly i vertices each, for a total of 2i boundary vertices). An entry contains a number for each value of k = 0 . . . k, where this number is the maximum number of edges over all subgraphs of the slice of exactly k vertices that contain the corresponding subset of the boundary vertices. The main procedure of the program is table (ﬁnd in extended version ), which takes as input a tree vertex v = (x, y). This procedure contains four conditional branches. The ﬁrst branch handles the case when v represents a face f that does not enclose a level i + 1 component. In this case, the procedure makes a recursive call to table on each child of v and merges the resulting tables together. The second conditional branch handles the case when v represents a face f that encloses a level i + 1 component C. In this case, the procedure makes a recursive call to table on the tree vertex that represents C. The resulting table
The Densest k Subgraph Problem
123
is then passed to the contract function, which turns a level i + 1 table into a level i table by removing the level i + 1 boundary vertices from the table, and the contracted table is then returned. The third conditional branch handles the case when v is a level 1 leaf. In this case, the procedure returns a template table that works for all level 1 leaf vertices, since any level 1 leaf represents a level 1 exterior edge of the graph. The fourth conditional branch is slightly more complicated. This branch handles the case when v is a level i > 1 leaf vertex. The idea is to break up the slice of v into subslices, compute the table for an initial subslice, and extend this table by merging the tables for the subslices clockwise and counterclockwise from the initial subslice. These subslices have their own respective subboundaries. In the case that the slice of v is simply a line of vertices, no subslices can be created, so the procedure will return the table for the whole slice by passing the vertex to the create procedure, which eﬀectively applies brute force to create the table for the subslice determined by the second parameter p (this is explained in more detail below). In the case that the slice of v is not just a line, we can create tables for subslices. Since we are dealing with a planar graph, there exists a level i − 1 vertex zp such that all level i − 1 vertices other than zp that are adjacent to x are clockwise from zp , and all level i − 1 vertices other than zp that are adjacent to y are counterclockwise from zp . Here, zp is the only level i − 1 vertex in slice(v) that can be adjacent to both x and y (although it might not be adjacent to either). So, we construct an initial level i table for the subslice corresponding to zp using create, and then we make as many calls as necessary to the merge and extend procedures (described below) to extend the table on one side with subslices constructed from the vertices adjacent to x, and on the other side with subslices constructed from the vertices adjacent to y.
Fig. 5. The slice for (c, d) is split into subslices for computing tables.
For example, Fig. 5 shows how we split slice((c, d)) into subslices. The algorithm ﬁnds that E is a level 1 vertex such that the level 1 vertices adjacent to c are clockwise from E and the level 1 vertices adjacent to d are counterclockwise to E (of which there are none). A call to create constructs the table for the initial subslice with subboundaries c, E and d, E. We then merge the other subslices in
124
S. Gonzales and T. Migler
a clockwise fashion, ﬁrst merging the table for the subslice with subboundaries c, C and c, E, and then merging the table for the subslice with subboundaries c, C and c, B. The adjust procedure, (see extended version for pseudocode), takes as input a table T , which represents a slice with left boundary L and right boundary R. Let x and y be the highest level boundary vertices in L and R, respectively. This procedure checks if x = y, and if so, adds 1 to the table entry where x and y are both included. Unlike in Baker’s original procedure, the case when x = y is not handled in this procedure and is instead handled in merge. The merge procedure, given in Fig. 6 in Sect. A of the Appendix, takes as input two tables T1 and T2 such that the right boundary of the slice that T1 represents is the same as the left boundary of the slice that T2 represents. Let the left and right boundaries of the slice that T1 represents be L and M , and let the left and right boundaries of the slice that T2 represents be M and R. The resulting table T returned by merge represents the slice with left and right boundaries L and R. This procedure constructs T by creating an entry for each subset A of L ∪ R. As stated earlier, each entry contains a number for each value of k , where k ranges from 0 to the number of vertices in the union of the two slices represented by T1 and T2 , and where the number is the maximum number of edges over all subgraphs of exactly k vertices containing A. Each of these individual subgraphs contains a diﬀerent subset B of M . The contract procedure, (see extended version for pseudocode), changes a level i + 1 table T into a level i table T . Here, T represents the slice for (z, z), where (z, z) is the root of a tree corresponding to a level i + 1 component C contained in a level i face f , and T is the table for the slice of vertex(f ). Let S = slice((z, z)) and let S = slice(vertex(f )). Let the left and right boundaries of S be L and R, respectively. Then the left and right boundaries of S are of the form z, L and z, R respectively. For each subset of L ∪ R and for each value of k , T contains two numbers: one that includes z and one that does not include z. So, for each subset A of L ∪ R and each value of k , we set T (A, k ) equal to the larger of these two numbers. The create procedure takes as input a leaf vertex v = (x, y) in a tree that corresponds to a level i + 1 component enclosed by a face f , and a number p ≤ t + 1, where the children of vertex(f ) are u1 , u2 , . . . , ut . This procedure simply applies brute force to create the table for the subgraph containing the edge (x, y), the subgraph induced by the left boundary of up if p ≤ t or the right boundary of up−1 if p = t + 1, and any edges from x or y to the level i vertex of this boundary. Lastly, the extend procedure, given in Fig. 7 in Sect. A of the Appendix, takes as input a level i + 1 vertex z and a table T representing a level i slice, and produces a table T for a level i + 1 slice. Let L and R be the boundaries for the level i slice represented by T . The boundaries for the new slice will be L ∪ {z} and R ∪ {z}. For each subset A of L ∪ R and each value of k , the new table T contains two values: one that includes z and one that does not include z. The entries T (A, k ) which do not include z can simply be set to their original values in T . For the entries T (A ∪ {z}, k ) that do include z, we ﬁrst check
The Densest k Subgraph Problem
125
that T (A, k − 1) is not undeﬁned. If this is the case, we set T (A ∪ {z}, k ) as T (A, k − 1) plus the number of edges between z and every vertex in A. We claim that calling the above algorithm on the root of the level 1 tree results in a correct table for the slice of the root and that this slice is actually the entire graph. Since the level 1 root is of the form (x, x), its left and right boundaries are both equal to x. Thus, the table for this root has exactly 4 rows, and two of these rows are invalid since they attempt to include one copy of x and not the other (which is nonsensical). The two remaining rows have numbers corresponding to the maxmimum number of edges in subgraphs of size exactly k for k = 0, . . . , k. Taking the maximum between the two numbers corresponding to when k = k gives us the size of the densest k subgraph. A proof of correctness and analysis of the O(k 2 8b n) running time can be found in the extended version.
4
Polynomial Time Approximation Scheme and Future Work
When searching for a polynomial time approximation scheme (PTAS) for planar graph problems, one often attempts to use Baker’s technique. For this technique, we assume that we have a dynamic programming solution to the given problem in bouterplanar graphs. This technique works as follows: Given a planar graph G and a positive number , let b = 1 . Perform a breadthﬁrst search on G to obtain a BFS tree T , and number the levels of T starting from the root, which is level 0. For each i = 0, 1, . . . , b − 1, let Gi be the subgraph of G induced by the vertices on the levels of T that are congruent to i modulo b. Gi is likely disconnected. Let the connected components of Gi be Gi,0 , Gi,1 , . . . . Since each Gi,j is (b − 1)outerplanar by construction, we may run the given dynamic program on each Gi,j and combine the solutions over all j to obtain a solution Si for the graph Gi . We then take the maximum Si , denoted S, as our approximate solution. We hypothesize that this technique will not work for the densest k subgraph problem on planar graphs. The reason is that by having a potentially large number of disconnected components, the approximate solution cannot be guaranteed to be within the bound given by . Suppose we have an approximate solution S for the densest k subgraph problem on some planar graph G. Note that S is the exact solution for the graph Gi for some i, meaning S does not account for any vertices on the levels of T which are congruent to i modulo b. While S could still be very dense, it is possible that most (if not all) of the edges in G are between vertices in diﬀerent levels of T . This allows for the possibility that no matter which Si is chosen as the maximum, each graph Gi is missing too many edges to closely approximate the optimal solution. For future work, it would be of great interest for one to prove that such a construction is impossible using Baker’s technique. Acknowledgements. We would like to express our sincere thanks to Samuel Chase for his collaboration on our initial explorations of ﬁnding a PTAS for the densest k subgraph problem. We would like to thank our reviewer who pointed us to previous work on this problem [3].
126
A
S. Gonzales and T. Migler
Pseudocode Selections
procedure merge(T1 , T2 ) let T be an initially empty table; let L and M be the left and right boundaries of the slice that T1 represents; let M and R be the left and right boundaries of the slice that T2 represents; for each subset A of L ∪ R do for each k = 0, . . . , maxT1 k + maxT2 k − M  do let V be an initially empty list; for each subset B of M do let n = 0; let x and y be the top level vertices in L and R, respectively; if x = y and x and y are both in A then let n = 1; for each k1 , k2 satisfying k1 + k2 − B − n = k do let m be the number of edges between vertices in B; let v = T1 ((A ∩ L) ∪ B, k1 ) + T2 ((A ∩ R) ∪ B, k2 ) − m; if v is not undefined then add v to V ; if V is not empty then let T (A, k ) = maxv V ; else let T (A, k ) be undefined; return T ;
Fig. 6. The merge procedure. procedure extend(z, T ) let T be a table that is initialized with every entry in T ; let L and R be the boundaries for the slice represented by T ; for each subset A of L ∪ R do for each k = 0, . . . , maxT k do if T (A, k − 1) is not undefined then let m be the number of edges between z and every vertex in A; let T (A ∪ {z}, k ) = T (A, k − 1) + m; else let T (A ∪ {z}, k ) be undefined; return T ;
Fig. 7. The extend procedure.
The Densest k Subgraph Problem
127
References 1. Angel, A., Sarkas, N., Koudas, N., Srivastava, D.: Dense subgraph maintenance under streaming edge weight updates for realtime story identiﬁcation. Proc. VLDB Endow. 5(6), 574–585 (2012) 2. Baker, B.S.: Approximation algorithms for NPcomplete problems on planar graphs. J. ACM 41, 153–180 (1994) 3. Bourgeois, N., Giannakos, A., Lucarelli, G., Milis, I., Paschos, V.T.: Exact and approximation algorithms for densest ksubgraph. In: WALCOM: Algorithms and Computation, pp. 114–125. Springer, Heidelberg (2013) 4. Buehrer, G., Chellapilla, K.: A scalable pattern mining approach to web graph compression with communities. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, WSDM ’08, pp. 95–106. ACM, New York, NY, USA (2008) 5. Chen, J., Saad, Y.: Dense subgraph extraction with application to community detection. IEEE Trans. Knowl. Data Eng. 24(7), 1216–1230 (2010) 6. Corneil, D.G., Perl, Y.: Clustering and domination in perfect graphs. Discret. Appl. Math. 9(1), 27–39 (1984) 7. Du, X., Jin, R., Ding, L., Lee, V.E., Thornton Jr, J.H.: Migration motif: a spatial  temporal pattern mining approach for ﬁnancial markets. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pp. 1135–1144. ACM, New York, NY, USA (2009) 8. Feige, U., Kortsarz, G., Peleg, D.: The dense ksubgraph problem. Algorithmica 29, 2001 (1999) 9. Fratkin, E., Naughton, B.T., Brutlag, D.L., Batzoglou, S.: MotifCut: regulatory motifs ﬁnding with maximum density subgraphs. Bioinformatics 22, 150–157 (2006) 10. Gibson, D., Kleinberg, J., Raghavan, P.: Inferring web communities from link topology. In: Proceedings of the Ninth ACM Conference on Hypertext and Hypermedia : Links, Objects, Time and Space—Structure in Hypermedia Systems: Links, Objects, Time and Space—Structure in Hypermedia Systems, HYPERTEXT ’98, pp. 225–234. ACM, New York, NY, USA (1998) 11. Gibson, D., Kumar, R., Tomkins, A.: Discovering large dense subgraphs in massive graphs. In: Proceedings of the 31st International Conference on Very Large Data Bases, VLDB ’05, pp. 721–732. VLDB Endowment (2005) 12. Goldberg, A.V.: Finding a maximum density subgraph. Technical report, University of California at Berkeley, Berkeley, CA, USA (1984) 13. Mark Keil, J., Brecht, T.B.: The complexity of clustering in planar graphs. J. Comb. Math. Comb. Comput. 9, 155–159 (1991) 14. Langston, M.A., Lin, L., Peng, X., Baldwin, N.E., Symons, C.T., Zhang, B., Snoddy, J.R.: A combinatorial approach to the analysis of diﬀerential gene expression data: the use of graph algorithms for disease prediction and screening. In: Methods of Microarray Data Analysis IV, pp. 223–238. Springer (2005) 15. Newman, M.E.J.: Fast algorithm for detecting community structure in networks. Phys. Rev. E 69, 066133 (2004)
Spread Sampling and Its Applications on Graphs Yu Wang1(B) , Bortik Bandyopadhyay1 , Vedang Patel1 , Aniket Chakrabarti2 , David Sivakoﬀ1 , and Srinivasan Parthasarathy1 1
The Ohio State University, Columbus, OH 43210, USA [email protected], [email protected] 2 Microsoft Inc., Hyderabad, India
Abstract. Eﬃciently ﬁnding small samples with high diversity from large graphs has many practical applications such as community detection and online survey. This paper proposes a novel scalable node sampling algorithm for large graphs that can achieve better spread or diversity across communities intrinsic to the graph without requiring any costly preprocessing steps. The proposed method leverages a simple iterative sampling technique controlled by two parameters: infection rate, that controls the dynamics of the procedure and removal threshold that aﬀects the endofprocedure sampling size. We demonstrate that our method achieves very high community diversity with an extremely low sampling budget on both synthetic and realworld graphs, with either balanced or imbalanced communities. Additionally, we leverage the proposed technique for a very low sampling budget (only 2%) driven treatment assignment in Network A/B Testing scenario, and demonstrate competitive performance concerning baseline on both synthetic and realworld graphs.
Keywords: Graph sampling
1
· Social network analysis
Introduction
Networks are a powerful tool to represent relational data in various domains: an email network in a corporate, a cosponsorship network in Congress, a coauthorship network in academia, etc. Given the ubiquitousness of the Internet, we can collect relational data at an immense scale (Facebook, Twitter, etc.). A huge amount of data restrains us from conducting complicated analysis: PageRank [28] computation has time complexity O(V 3 ); community detection using GirvanNewman [8] method takes O(E2 V ) time; to compare the similarity between two large networks (each has ∼70 million edges) using a stateofart method takes 10 min [19]. Sampling is often touted as a means to combat the inherent complexity of analyzing large networks [20]. Network sampling is broadly classiﬁed into edge and node sampling strategies. Edge sampling seeks to sample pairs of nodes (dyads) from a network, and c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 128–140, 2020. https://doi.org/10.1007/9783030366872_11
Spread Sampling
129
one of its applications is to infer the network topological structure [36]. Node sampling seeks to sample nodes from a network, and one of its applications is to infer the distribution of network statistics (node degree, node label, etc.). Existing node sampling techniques include selectionbased sampling (uniform [20]) and chainbased sampling (forest ﬁre [20], random walk [4,9]). The idea of sampling subjects from diﬀerent groups is called stratified sampling [22,24]. A typical design for stratiﬁed sampling is ﬁrst to divide the population into diﬀerent strata (groups) using some population characteristics and then sample individuals from each stratum. Social networks are well known to have the community structure [7]. Nodes within a community are more similar to each other than nodes across communities, and individuals in a social network tend to interact more frequently with others from the same community [25]. In this graph/network context, the doctrinal stratiﬁedsampling ﬁrst detects communities, and then samples from each community. Community discovery, based on topological and (or) attributebased graph characteristics, is a very timeconsuming procedure even with stateoftheart implementation [3]. Chainbased sampling methods sample connected subgraphs and hence are more likely to be stuck in one community or a few nearby (in terms of topological distance) communities, resulting in less community diversity of the sampled nodes. Uniform node selection sampling has better community coverage than chainbased sampling, but it tends to undersample small communities when community sizes are imbalanced. Also, from an endtoend application’s performance perspective, to continue beneﬁting from sampling approaches, the sampling budget (i.e., number of nodes to be sampled) has to be kept as low as possible, which reduces the chance of high community coverage under practical settings on largescale graphs. We propose a new graph sampling method, spread sampling, that can achieve better community coverage than existing algorithms for low sampling budget even in graphs with imbalanced communities, resulting in a more representative nodeset in the sample in terms of community diversity than other methods. Under appropriate userchosen parameter conﬁguration, the proposed method penalizes sampling neighboring nodes, cliques or nearclique structures, and hence allows for better overall community coverage of the network without any costly preprocessing step required for a typical stratiﬁed sampling approach. We demonstrate its applications on community detection seeding and network A/B testing.
2
Related Work
Network Sampling: The work by Handcock and Gile [12] systematically studies node sampling on social networks. It proposes the concept of modelbased sampling and designbased sampling. Our proposed method is a kind of designbased sampling method. Node selection sampling (a kind of uniform node sampling) is studied by Leskovec [20]. This method is easy to implement but has a sparse approximation to several network statistics.
130
Y. Wang et al.
Traditional chainbased sampling methods are biased towards hub nodes with a high degree. Two strategies, insample correction and postsample correction, are proposed to address this issue. Insample correction modiﬁes random walk such that the equilibrium distribution is uniform [9,26,34]. Postsample correction uses estimators that account for the biasness incurred by the sampling methods [11,13]. Maiya and BergerWolf [23,24] propose a crawlingbased approach that samples a connected subgraph with a good community structure (high precision). Our work diﬀers in that we seek to identify samples spread across communities (high recall). We empirically show that spread sampling has better community coverage than all baselines. Community Detection: We choose community detection to test the eﬃcacy and eﬃciency of spread sampling. Modularity maximization [3,8] and ncut maximization [14] based approaches were surveyed by Fortunato [7]. The personalized PageRank based approach [1] is a variant of the ncut maximization method. The work by [16] compared several seed expansion based community detection methods and concluded Personalized PageRank [1] works best. They ﬁnd that seeding by uniform sampling results in a better recall than highdegree sampling in community detection. We show that seeding by spread sampling achieves an even better result than uniform sampling. Network A/B Testing: Spread sampling can also be applied for network A/ B testing, which is a widely used statistical experiment in modern social network settings to determine the eﬀect of a treatment. The experimenters apply treatment like introducing a new feature of an online service to a subset of customers [17,18], while keeping the remaining users without that feature in control group, and then measure a quantity of interest for each user to compute the overall eﬀect of the newly introduced feature, usually deﬁned as Average Treatment Eﬀect (ATE) by Gui et al. [10]. While classical A/B Testing experiments [31] assume independence of user behavior (SUTVA assumption), this is invalid in the context of social networks due to direct interaction edges between users [10]. Broadly speaking, the two popular sampling approaches proposed to tackle this ATE estimation problem for social networks under interference eﬀect are Nodebased [2,15] and Cluster based [10,27,33] strategies. More recently, Saveski et al. [32] has proposed a hybrid strategy combining the Node and Clusterbased randomization approach. One practical problem is the cost of running the experiments in production deployments, which increases based on size of treatment group and hence an eﬀective node sampling strategy is required which can have competitive ATE estimation error even for low treatment (sampling) budget.
3 3.1
Methology Designing Spread Sampling
We wish to obtain a sample that spreads out over the graph with a limited budget. Intuitively, a spreadout sample has nodes with very few of their neighbors in the sampled set. Hence we design an iterative sampling approach alternating
Spread Sampling
131
between two steps: (1) uniform sample from candidate nodes; (2) remove nodes neighboring to the existing sample. The ﬁrst step leads to a nearuniform sample, while the second step spreads out the sampled nodes. The algorithm, described in Algorithm 1, is an iterative sampling method. During the sampling process, three sets are maintained: the sample set, the removal set, and the candidate set. The sampleset contains all the sampled nodes; the removal set contains nodes that have particular sampled neighbors; all other nodes are in the candidate set. Both the sample set and the removal set monotonically expands as the iteration proceeds, while the candidate set shrinks. The rationale of the removal set is that if a node has enough neighbors in the sample, we should not sample it since our goal is to “spread out” sample with a limited budget. Algorithm 1. Spread Sampling Input: infection rate q, removal threshold k, a connected undirected graph G, target sample size; Output: A set S of sampled nodes. 1: Initialize candidate set C = G; 2: while C is not empty and S smaller than target size do 3: for each node u in C, sample it with probability q, and add the sampled nodes into S; 4: C = C − S; { remove sampled nodes from candidate set} {Below: if a candidate node has at least k neighbors sampled, remove it from candidate set} 5: Bk = {v ∈ C  N (v) ∩ S ≥ k}; 6: R = R ∪ Bk ; { R is the removal set} 7: C = C − R; 8: end while 9: return
The sampling algorithm has two parameters: a single step infection rate and a removal threshold. Low removal threshold tends to remove nodes more aggressively, and hence is more likely to sample nonadjacent nodes; an extremely high removal threshold removes no node and hence achieves uniform sampling. Low singlestep infection rate is also pro nonadjacent nodes; an extremely high rate reduces the chance of removal and hence pro uniform sampling. 3.2
High Community Diversity
We use community coverage ratio to quantify how well spreadout the sampled nodes are. We deﬁne community coverage ratio1 for each sample set S as the fraction of communities represented by the sampled nodes in S. It can be formulated as: 1
By design, our method achieves 100% expansion quality, a ratio of the neighborhood size of the sample to the number of unsampled nodes, as deﬁned in [24] when the infection rate is exactly one node and removal threshold is one.
132
Y. Wang et al.
CoverageRatio(S) =
i∈S ci
C
,
where S is the sample set, ci is the community of node i and C is the set of all the communities. We compare 5 baseline sampling methods: uniform sampling, expanding snowball (XSN [24]), degreeinverse sampling, Louvain [3]+ stratiﬁcation sampling, and METIS [14]+ stratiﬁcation sampling. XSN is a greedy sampling method to sample a communitydiversiﬁed connected component, and performs very well on graphs with balanced communities. We compare against degreeinverse because our method has a degreeinverse property for small removal threshold. We show that merely sampling nodes with degreeinverse probability cannot achieve as high community coverage as our method. For the stratiﬁcation sampling, we ﬁrst run stateoftheart community detection methods, then sample the nodes with probability proportional to the sizes of the detected communities.2 Randomwalkbased sampling [20] methods, including multiwalkers [29], have inferior community coverage, and hence we do not include those results. We report results on an SBM graph with imbalanced communities on various sampling budgets. All algorithms achieve high community coverage as the sampling budget increases. Spread sampling (SS) achieves the best community coverage ratio among all methods at minimal sampling budgets (Fig. 1). More results can be found in Sect. 3.3.3 of [35]
Fig. 1. Community coverage ratio on SBM10k Imbalanced. At very small sample budget (5%, 10%), spread sampling (SS) covers signiﬁcantly more communities than the baselines.
3.3
Complexity Analysis
According to Algorithm 1, there are Ct  candidate nodes in the whileloop at the t−th iteration, Ct+1  = (1 − q)Ct  for infection rate q. The bottleneck is line 4, candidate nodes update: for each candidate node, we need to scan all 2
Evaluation is always performed on the ground truth communities.
Spread Sampling
133
its neighbors to determine the validity of its candidacy. This procedure incurs Ct d¯ queries per iteration, where d¯ ≡ 2E/V  is the average degree. Hence, the Ct =n sum w/ ratio 1−q ¯ =geometric ¯ = Ct d) ================== O(C0 d) overall time complexity is O( Ct =0
sparse [5], E=O(V ) ¯ = O(E) == O(nd) ============== O(V ). We do not store edge information. All the edge information is retrieved from the graph. Hence the space complexity is O(V ).
3.4
Impact of Sampling Parameters
The spread sampling method has two parameters: singlestep infection rate q, controlling the sampling dynamic; and removal threshold k, controlling the endofprocedure status. We have analyzed the impact of parameters on community coverage. Intuitively, low infection rate and low removal threshold incur more neighborhood removal, and hence result in better community coverage. However, due to the paucity of space, we refer the reader to [35] for a detailed theoretical and empirical evaluation of community coverage and sampling probability of our proposed technique.
4
Applications of Spread Sampling
We run experiments on both synthetic and realworld graphs in Table 1. (C is number of communities, and CC stands for Clustering coeﬃcient). Table 1. Graphs used in experiments V 
Graph
E
C
CC
ER (1k, 0.4)
1k
100k N/A 0.3992
ER (10k, 0.001)
10k
50k
N/A 0.0011
PL (1k, 4, 0.01)
1k
4k
N/A 0.0343
PL (1k, 4, 0.8)
1k
4k
N/A 0.3720
SBM10k balanced
10k
594k 100
0.0165
SBM10k imbalanced
10k
103k 500
0.1265
BTER syn
1k
33k
10
0.2696
4k
86k
250
0.5964
Email [21]
1k
25k
42
0.3994
FB [21]
4k
88k
DBLP [21]
317k 1M
13k
0.6324
AMAZON [21]
335k 926k 75k
0.3967
LiveJournal (lj) [21]
4M
Youtube [21]
1.1M 3M
BTER fb
(fit powerlaw)
(fit FB below)
N/A 0.6055
35M 288k 0.2843 8k
0.0808
134
4.1
Y. Wang et al.
Overlapping Community Detection Seeding
Personalized PageRank (PPR) [1] is a wellestablished method for local graph partitioning: to identify a community (a densely connected subgraph [30,37]) from a small seed set. An extensive comparison by [16] shows that PPR has the best performance among all local partitioning methods. They focus on the comparison of diﬀerent methods and claim that uniform sampling is better than highdegree node sampling. Our spread sampling, with diﬀerent parameter settings, can be specialized as uniform sampling (k > dmax ) or lowdegree nodes sampling (small k). We run PPR community detection following the same procedure and on the same datasets (with one more, LiveJournal) as in Sect. 2 of [16] and compare various seeding methods: SS with small k (lowdegree heuristic), SS with large k, uniform sampling and highdegree heuristic.3 For each groundtruth community, we test with sampling budgets of 5% and 10%, and get consistent results in terms of seeding methods ranking. PPR assigns each node a conductance score in [0, 1]. For each tobedetected community, we sort all the nodes in descending order of the score and determine the topC nodes as the community members, where C is the size of the groundtruth community. For SS, we ﬁx q to be exact one node per step and vary removal threshold k from 1 to 50. Note that large k degenerate to uniform sampling. Our key ﬁnding is that lowdegree heuristic works the best in overlapping community detection, and the recall of each seeding methods is reported in Table 2.4 Experiments show that lowdegree heuristic works best on graphs with overlapping communities: highdegree nodes belong to several communities while lowdegree nodes are “core members” of a community [30,39]. Expansion from a high degree node results in a sample from diﬀerent communities, and hence the recall is low; expansion from “core members” is more likely to return nodes within the same community and hence a better recall. The recall on the youtube graph is very low, which is probably due to its low clustering coeﬃcient. Table 2. Community detection recall with varying seed strategies Data
k=1
k = 10
amazon .5768 ± .038
3
4
Uniform
Max Deg.
.5808 ± .040 .5617 ± .042 .5612 ± .043
dblp
.2512 ± .004 .2396 ± .003
.2383 ± .004 .1479 ± .001
lj
.1328 ± .002 .1313 ± .003
.1311 ± .002 .1123 ± .001
youtube .0227 ± .004 .0200 ± .003
.0138 ± .002 .0108 ± .001
The neighborhood inﬂation method [38] is also a highly cited work. We did not compare against it since according to [16], PPR performs better than the neighborhood inﬂation method. Precision≡
True Positive Cdetect 
exp. setup
========
TP Cground truth 
≡recall.
Spread Sampling
4.2
135
Network A/B Testing
There are several reasons to consider spread sampling as a treatment assignment, i.e., node sampling strategy for Network A/B testing. First, spread sampling, intuitively, attempts to reduce homophily eﬀect [2] and potential information leak [10]. Second, it only requires a knowledge of the immediate neighborhood of speciﬁc nodes in the seed set. Third, the spread sampling method samples a set with a higher population diversity in terms of community representation (in comparison to alternatives) at lower sampling budget, while also being relatively cheap to compute – there is no need for costly preprocessing steps like clustering of the graph and subsequent assignment to achieve this objective. Lower sampling budget is particularly useful when multiple alternatives are being tested (e.g., alternative advertisement variations or alternative features), which is usually the case. Large sampling budgets are an imposition on testers and can lead to tester fatigue. Dataset and Simulation: Our goal is to assign treatment using node sampling strategies viz. uniform sampling (US) and spread sampling (SS), and then simulate user response with Linear Probit model [6,10] to compute the average treatment eﬀect (ATE). The simulation steps are summarized in Algorithm 2. We run 1000 simulations (as in Sect. 5.1.1 of [10]) on four small networks: PL(1k,4,0.001), PL(1k,4,0.8), FB and BTER (Table 1), of which BTER has community structure, with each of the node sampling strategies. The Linear Probit model for response simulation is proposed by [10]: ∗ = λ0 + λ1 Zi + λ2 Yi,t
AT.i Yt−1 + Ui,t Dii
∗ Yi,t = I(Yi,t > 0)
where Z is the treatment assignment vector, A is the adjacency matrix, D is the diagonal matrix of node degree and U is a user dependent stochastic component. We ﬁx the baseline parameter λ0 = −1.5, vary the treatment eﬀect parameter λ1 ∈ {0.25, 0.5, 0.75, 1.0} and the network eﬀect parameter λ2 ∈ {0, 0.1, 0.25, 0.5, 0.75} as in [10], and keep a very low ﬁxed sampling budget of only 2%. Estimator Choice: We have used existing stateoftheart estimators viz: SUTVA  designed assuming no interference eﬀect & LMI  designed to work well in the presence of interference eﬀects.
136
Y. Wang et al.
Algorithm 2. Simulation Steps Input: λ0 , λ1 , λ2 , ρ, Sampling ﬂag 1: Generate treatment response Yi,t for all nodes i using Zi = 1. 2: Generate control response Yi,t for all nodes i using Zi = 0. 3: Compute ground truth ATE (AT EGT ) using above responses. 4: for (l = 1; l ¡= 1000; l++) do 5: Set seed(l) to ﬁx Ui,t generation. 6: Select ρ% nodes using SRSWOR or SS based on sampling ﬂag and consider those nodes for treatment. 7: Use the above treatment assignment to construct the experiment assignment vector of all nodes (Zexp ) by setting those entries to 1 keeping 0 for others. 8: Compute fractional neighborhood exposure vector. ∗ for all nodes by simulating Probit 9: Use Zexp to generate binary response Yi,t Model [6,10] with t = 3. 10: Apply estimator of choice to compute empirical AT E. 11: end for E (computed by averaging 1000 simulation estimates) to com12: Use AT EGT and AT pute the Absolute Error (relative error).
SUTVA For Node Sampling [10]: 1 1 δ = [ Yi (Z = zi )] − [ Yi (Z = zi )] N1 N0 {i:zi =1}
{i:zi =0}
N1 is number of nodes in treatment and N0 is number of nodes in control group. Yi (Z = 1) is the response of node i in treatment and Yi (Z = 0) is the response of node i in control. Fraction Neighborhood  Linear Model I [10]: g(Zi , σi ) = α + βZi + γσi δˆLI = βˆ + γˆ Zi = 1 indicates node i is in treatment, and 0 indicates that it is in the control group. σi is the fraction of i’s neighbor in the treatment group. β is the treatment eﬀect parameter, and γ is the parameter that captures the network ˆ and γˆ which are then used to eﬀect. Linear Regression is used to estimate α ˆ , β, ˆ compute δLI [10]. Note that the use of the fractional LMI model to node sampling is novel to this work (previous eﬀorts only examined the performance of this idea on cluster sampling based assignment strategy at a high budget). We did not compare against [10] since that work requires half of the users to be sampled while in our case we focus on the ATE estimation at a minimal sampling budget (2%). We show that as the network eﬀect, interaction among users, increases, the SS method becomes better and better than uniform sampling.
Spread Sampling
137
Fig. 2. ATE estimation bias comparison for BTER synthetic graph with balanced communities. Although SS does not have advantage over US when the network eﬀect is weak, the advantage of SS becomes more and more clear as the the network eﬀect increases. Spread sampling sample users far apart and hence reduces the homophily bias.
Results: Table 3 shows that spread sampling, together with clustering assignment estimator (SS+LMI), signiﬁcantly outperforms the uniform sampling on multiple datasets. Figure 2 shows that although SS does not have an advantage over US when the network eﬀect is weak, the advantage of SS becomes more and more evident as the network eﬀect increases. This is not hard to explain: strong network eﬀects mean users have strong interactions with each other and hence have homophily eﬀect [2]. Spread sampling sample users far apart and hence reduces the homophily bias. Table 3. ATE estimation with λ0 = −1.5, λ1 = 1.0, λ2 = 0.75 strong network eﬀect, and sampling budget of 2%; SS uses q = one node and k = 1. Dataset
GT US SUTVA
SS SUTVA
SS LMI
PL (1k, 4, 0.8)
0.35 0.26 (26.89%) 0.26 (25.90%) 0.29 (18.46%)
PL (1k, 4, 0.01) 0.35 0.26 (26.79%) 0.26 (27.93%) 0.29 (19.44%) FB
0.34 0.26 (24.61%) 0.25 (25.53%) 0.29 (15.78%)
BTER syn
0.36 0.26 (28.23%) 0.25 (29.41%) 0.28 (20.67%)
138
5
Y. Wang et al.
Conclusions
We propose a simple yet elegant procedure  spread sampling (SS) for sampling nodes within a graph. We show that spread sampling tries to sample nodes from all regions of the graph, thereby improving community coverage than existing baselines, especially on the networks with imbalanced communities. We apply SS to three real practical applications viz: community detection and network A/B testing. Seeding PPRbased community detection with SS leads to higher recall than existing heuristics. SSbased network A/B testing outperforms competitive strawman solutions on a range of graph models, particularly in the presence of moderate to high network interference eﬀects. Acknowledgments. This paper is funded by NSF grants DMS1418265, IIS1550302, and IIS1629548.
References 1. Andersen, R., Chung, F., Lang, K.: Local graph partitioning using pagerank vectors. In: FOCS 2006, pp. 475–486 (2006) 2. Backstrom, L., Kleinberg, J.: Network bucket testing. In: Proceedings of the 20th International Conference on World Wide Web, pp. 615–624. ACM (2011) 3. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech.: Theory Exp. 2008(10), P10008 (2008) 4. Chiericetti, F., Dasgupta, A., Kumar, R., Lattanzi, S., Sarl´ os, T.: On sampling nodes in a network. In: Proceedings of the 25th International Conference on World Wide Web, pp. 471–481. International World Wide Web Conferences Steering Committee (2016) 5. Chung, F.: Graph theory in the information age. Not. AMS 57(6), 726–732 (2010) 6. Karrer,B., Eckles, D., Ugander, J.: Design and analysis of experiments in networks: reducing bias from interference (2014) 7. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3), 75–174 (2010) 8. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. 99(12), 7821–7826 (2002) 9. Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Walking in Facebook: a case study of unbiased sampling of OSNs. In: 2010 Proceedings IEEE Infocom, pp. 1–9. IEEE (2010) 10. Gui, H., Xu, Y., Bhasin, A., Han, J.: Network A/B testing: from sampling to estimation. In: Proceedings of the 24th International Conference on World Wide Web, pp. 399–409. ACM (2015) 11. Hand, D.J.: Statistical analysis of network data: methods and models by Eric D. Kolaczyk. Int. Stat. Rev. 78(1), 135–135 (2010) 12. Handcock, M.S., Gile, K.J.: Modeling social networks from sampled data. Ann. Appl. Stat. 4(1), 5 (2010) 13. Hansen, M.H., Hurwitz, W.N.: On the theory of sampling from ﬁnite populations. Ann. Math. Stat. 14(4), 333–362 (1943) 14. Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)
Spread Sampling
139
15. Katzir, L., Liberty, E., Somekh, O.: Framework and algorithms for network bucket testing. In: Proceedings of the 21st International Conference on World Wide Web, WWW 2012, pp. 1029–1036. ACM, New York (2012) 16. Kloumann, I.M., Kleinberg, J.M.: Community membership identiﬁcation from small seed sets. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1366–1375. ACM (2014) 17. Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., Pohlmann, N.: Online controlled experiments at large scale. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, pp. 1168–1176. ACM, New York (2013) 18. Kohavi, R., Deng, A., Longbotham, R., Xu, Y.: Seven rules of thumb for web site experimenters. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2014, pp. 1857–1866. ACM, New York (2014) 19. Koutra, D., Shah, N., Vogelstein, J.T., Gallagher, B., Faloutsos, C.: D elta C on: principled massivegraph similarity function with attribution. ACM Trans. Knowl. Discov. Data (TKDD) 10(3), 28 (2016) 20. Leskovec, J. Faloutsos, C.: Sampling from large graphs. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 631–636. ACM (2006) 21. Leskovec, J., Sosiˇc, R.: SNAP: a generalpurpose network analysis and graphmining library. ACM Trans. Intell. Syst. Technol. (TIST) 8(1), 1 (2016) 22. Lohr, S.: Sampling: Design and Analysis. Nelson Education, Toronto (2009) 23. Maiya, A.S., BergerWolf, T.Y.: Expansion and search in networks. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 239–248. ACM (2010) 24. Maiya, A.S., BergerWolf, T.Y.: Sampling community structure. In: Proceedings of the 19th International Conference on World Wide Web, pp. 701–710. ACM (2010) 25. McPherson, M., SmithLovin, L., Cook, J.M.: Birds of a feather: homophily in social networks. Ann. Rev. Sociol. 27(1), 415–444 (2001) 26. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21(6), 1087– 1092 (1953) 27. Middleton, J.A., Aronow, P.M.: Unbiased estimation of the average treatment eﬀect in clusterrandomized experiments 28. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. Technical report, Stanford InfoLab (1999) 29. Ribeiro, B., Towsley, D.: Estimating and sampling graphs with multidimensional random walks. In: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, pp. 390–403. ACM (2010) 30. Ruan, Y., Fuhry, D., Liang, J., Wang, Y., Parthasarathy, S.: Community discovery: simple and scalable approaches. In: User Community Discovery, pp. 23–54. Springer (2015) 31. Rubin, D.B.: Estimating causal eﬀects of treatments in randomized and nonrandomized studies. J. Educ. Psychol. 66(5), 688 (1974) 32. Saveski, M., PougetAbadie, J., SaintJacques, G., Duan, W., Ghosh, S., Xu, Y., Airoldi, E.M.: Detecting network eﬀects: randomizing over randomized experiments. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1027–1035. ACM (2017)
140
Y. Wang et al.
33. Ugander, J., Karrer, B., Backstrom, L., Kleinberg, J.: Graph cluster randomization: network exposure to multiple universes. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, pp. 329–337. ACM, New York (2013) 34. Wang, D., Li, Z., Xie, G.: Towards unbiased sampling of online social networks. In: 2011 IEEE International Conference on Communications (ICC), pp. 1–5. IEEE (2011) 35. Wang, Y.: Revisiting network sampling. Ph.D. thesis, The Ohio State University (2019) 36. Wang, Y., Chakrabarti, A., Sivakoﬀ, D., Parthasarathy, S.: Fast change point detection on dynamic social networks. In: Proceedings of the 26th International Joint Conference on Artiﬁcial Intelligence, pp. 2992–2998. AAAI Press (2017) 37. Wang, Y., Chakrabarti, A., Sivakoﬀ, D., Parthasarathy, S.: Hierarchical change point detection on dynamic networks. In: Proceedings of the 2017 ACM on Web Science Conference, pp. 171–179. ACM (2017) 38. Whang, J.J., Gleich, D.F., Dhillon, I.S.: Overlapping community detection using seed set expansion. In: CIKM, pp. 2099–2108. ACM (2013) 39. Yang J., Leskovec, J.: Structure and overlaps of communities in networks. arXiv preprint arXiv:1205.6228 (2012)
Eva: AttributeAware Network Segmentation Salvatore Citraro and Giulio Rossetti(B) KDD Lab, ISTICNR, Pisa, Italy [email protected], [email protected]
Abstract. Identifying topologically welldefined communities that are also homogeneous w.r.t. attributes carried by the nodes that compose them is a challenging social network analysis task. We address such a problem by introducing Eva, a bottomup low complexity algorithm designed to identify network hidden mesoscale topologies by optimizing structural and attributehomophilic clustering criteria. We evaluate the proposed approach on heterogeneous realworld labeled network datasets, such as cocitation, linguistic, and social networks, and compare it with stateofart community discovery competitors. Experimental results underline that Eva ensures that network nodes are grouped into communities according to their attribute similarity without considerably degrading partition modularity, both in single and multi nodeattribute scenarios.
1
Introduction
Among the most frequent data mining tasks, segmentation requires a given population, to partition it into internally homogeneous clusters so to better identify diﬀerent cohorts of individuals sharing a common set of features. Classical approaches [1] model this problem on relational data, each individual (data point) described by a structured list of attributes. Indeed, in several scenarios, this modeling choice represents an excellent proxy to address contextdependent questions (e.g., segment retail customers or music listeners by their adoption behaviors). However, such methodologies by themselves are not able to answer a natural, yet nontrivial question: what does it mean to segment a population for which the social structure is known in advance? A ﬁrst way of addressing such an issue can be identiﬁed in the complex network counterpart to the data mining clustering problem, Community Discovery. Node clustering, also known as community discovery, is one of the most productive subﬁelds of the complex network analysis playground. Many algorithms have been proposed so far to eﬃciently and eﬀectively partition graphs into connected clusters, often maximizing speciﬁcally tailored quality functions. One of the reasons this task is considered among the most challenging, and intriguing ones, is its illposedness: there not exist a single, universally shared, deﬁnition of what a community should look like. Every algorithm, every study, deﬁnes node c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 141–151, 2020. https://doi.org/10.1007/9783030366872_12
142
S. Citraro and G. Rossetti
partitions by focusing on speciﬁc topological aspects (internal density, separation. . . ) thus leading to the possibility of identifying diﬀerent, even conﬂicting, clusters on top of the same topology. Generalizing, we can deﬁne the community discovery problem using a meta deﬁnition such as the following: Definition 1 (Community Discovery (CD)). Given a network G, a community c = {v1 , v2 , . . . , vn } is a set of distinct nodes of G. The community discovery problem aims to identify the set C of all the communities in G. Classical approaches to the CD problem focus on identifying a topologically accurate segmentation of nodes. Usually, the identiﬁed clusters – either crisp or overlapping, producing complete or partial node coverage – are driven only by the distribution of edges across network nodes. Such constraint, in some scenarios, is not enough. Nodes, the proxies for the individuals we want to segment, are carriers of semantic information (e.g., age, gender, location, spoken language. . . ). However, segmenting individuals by only considering their social ties might produce well deﬁned, densely connected, cohorts, whose homogeneity w.r.t. the semantic information is not guaranteed. Usually, when used to segment a population embedded into a social context, CD approaches are applied assuming an intrinsic social homophily of individuals, often summarized with the motto “birds of a feather ﬂock together”. Indeed, such a correlation in some scenarios might exist; however, it is not always given, and its strength could be negligible. To address such issue, in this work, we approach a speciﬁc instance of the CD problem, namely Labeled Community Discovery: Definition 2 (Labeled Community Discovery (LCD)). Let G = (V, E, A) be a labeled graph where V is the set of vertices, E the set of edges, and A a set of categorical attributes such that A(v), with v ∈ V , identiﬁes the set of labels associated to v. The labeled community discovery problem aims to ﬁnd a node partition C = {c1 , ..., cn } of G that maximizes both topological clustering criteria and label homophily within each community. LCD focuses on obtaining topologically welldeﬁned partitions (as in CD) that also results in homogeneous labeled communities. An example of contexts in which an LCD approach could be helpful is, for instance, the identiﬁcation, and impact evaluation, of echo chambers in online social networks, a task that cannot be easily addressed relying only on standard CD methodologies. In this work, we introduce a novel LCD algorithm, Eva (Louvain Extended to Vertex Attributes), tailored to extract labelhomogeneous communities from a complex network. Our approach conﬁgures as a multicriteria optimization one and extends a classical hierarchical algorithmic schema used by stateofart CD methodologies. The paper is organized as follows. In Sect. 2 we introduce Eva. There we discuss its rationale and its computational complexity. In Sect. 3 we evaluate the proposed method on realworld datasets, comparing its results with stateofart competitors. Finally, in Sect. 4 the literature relevant to our work is discussed, and Sect. 5 concludes the paper.
Eva: AttributeAware Network Segmentation
2
143
The Eva Algorithm
In this section, we present our solution to the LCD problem: Eva1 . Eva is designed as a multiobjective optimization approach. It adopts a greedy modularity optimization strategy, inherited by the Louvain algorithm [2], pairing it with the evaluation of intracommunity label homophily. Eva main goal is maximizing the intracommunity label homophily while assuring high partition modularity. In the following, we will detail the algorithm rationale and study its complexity. Eva is designed to handle networks whose nodes possess one or more labels having categorical values. Algorithm Rationale. The algorithmic schema of Eva is borrowed from the Louvain one: a bottomup, hierarchical approach designed to optimize a wellknown community ﬁtness function called modularity. Definition 3 (Modularity). Modularity is a quality score that measures the strength of the division of a network into modules. It takes values in [−1, 1] and, intuitively, measures the fraction of the edges that fall within the given partition minus the expected fraction if they were distributed following a null model. Formally: kv kw 1 (1) Avw − δ (cv , cw ) Q= (2m) vw (2m) where m is the number of graph edges, Av,w is the entry of the adjacency matrix for v, w ∈ V , kv , kw the degree of v, w and δ (cv , cw ) identiﬁes an indicator function taking value 1 iﬀ v, w belong to the same cluster, 0 otherwise. Eva leverages the modularity score to incrementally update community memberships. Conversely, from Louvain, such an update is weighted in terms of another ﬁtness function tailored to capture the overall label dispersion within communities, namely purity. Definition 4 (Purity). Given a community c ∈ C its purity is the product of the frequencies of the most frequent labels carried by its nodes. Formally: max( v∈c a(v)) (2) Pc = c a∈A
where A is the label set, a ∈ A is a label, a(v) is an indicator function that takes value 1 iﬀ a ∈ A(v). The purity of a partition is then the average of the purities of the communities that compose it: 1 P = Pc (3) C c∈C
Purity assumes values in [0, 1] and it is maximized when all the nodes belonging to the same community share a same attribute proﬁle. 1
Python code available at: https://github.com/GiulioRossetti/EVA.
144
S. Citraro and G. Rossetti
Algorithm 1. EVA 1: function EVA(G, α) 2: C ← Initialize(G) 3: Z ← αP + (1 − α)Q 4: Zperv ← −∞ 5: while Z > Zprev do 6: C ← M oveN odes(G, C, α) 7: G ← Aggregate(G, C) 8: Zprev ← Z 9: Z ← αP + (1 − α)Q 10: return C
The primary assumption underlying the purity deﬁnition is that node labels can be considered as independent and identically distributed random variables: in such a scenario, considering the product of maximal frequency labels is equivalent to computing the probability that a randomly selected node in the given community has exactly that speciﬁc label proﬁle. Eva takes into account both modularity and purity while incrementally identifying a network partition. To do so, it combines them linearly, thus implicitly optimizing the following score: Z = αP + (1 − α)Q
(4)
where α is a tradeoﬀ parameter that allow to tune the importance of each component for better adapt the algorithm results to the analyst needs. Eva pseudocode is highlighted in Algorithm 1. Our approach takes as input a labeled graph, G and a tradeoﬀ value, α and returns a partition C. As a ﬁrst step, line 2, Eva assigns each node to a singleton community and computes the initial quality Z as a function of both modularity and purity. After the initialization step, the algorithm mainloop is executed (lines 5–9). Eva computation, as Louvain, can be broken in two main components: (i) greedy identiﬁcation of the community merging move that produces the optimal increase of the partition quality (row 6), and (ii) network reconstruction (line 7). In Algorithm 2 is detailed the procedure applied to identify the best move among the possible ones. Eva inner loop cycles over the graph nodes and, for each of them, evaluate the gain in terms of modularity and purity while moving a single neighboring node to its community (lines 18–24). For each pair (v, w) the local gain produced by the move is computed: Eva compares such value with the best gain identiﬁed so far and, if needed, updates the latter to keep track of the newly identiﬁed optimum: in case of ties, the move that results in a higher increase of the community size is preferred (lines 25–28). Such a procedure is repeated until no more moves are possible (line 29). As a result of Algorithm 2, the original allocation of nodes to communities is updated. After this step, the aggregate function (Algorithm 1, line 7) hierarchically updates the original graph G transforming its communities in nodes, thus allowing to repeat the algorithm main loop until there are no moves able to increase the partition quality (lines 8–9).
Eva: AttributeAware Network Segmentation
145
Algorithm 2. EVA  MoveNodes 1: function MoveNodes(G, C, α) 2: Cbest ← C 3: repeat 4: for all v ∈ V (G) do 5: s ← P [v] 6: sbest ← size 7: gbest = −∞ 8: for all u ∈ Γ (v) do 9: Cnew ← C 10: Cnew [v] ← C[u] 11: sizenew ← Cnew [v] 12: qgain ← QCnew − QC 13: pgain ← PCnew − PC 14: g ← αpgain + (1 − α)qgain 15: if g > gbest or g == gbest and snew 16: gbest ← g 17: sbest ← snew 18: Cbest ← Cnew 19: until C == Cbest 20: return Cbest
> s then
Eva Complexity. Being a Louvain extension, Eva shares the same time complexity, namely O(V logV ). Regarding space consumption, the increase w.r.t. Louvain is related only to the data structures used for storing node labels. Considering k labels, the space required to associate them to each node in G is O(kV ): assuming k 0.4 rapidly reach saturation. Table 3 compares modularity and purity for four diﬀerent instantiations of Eva on the Amh dataset while varying the number of node attributes from 1 to 5. We can observe that both quality functions are stable w.r.t. the number of attributes and that α = 0.8 oﬀers a viable compromise to our aim. Table 3. Multiattribute: modularity and purity comparison of Eva over Amh Modularity Purity Amh1 Amh2 Amh3 Amh4 Amh5 Amh1 Amh2 Amh3 Amh4 Amh5 Eva0.1 .43
.43
.43
.43
.43
.49
.13
.09
.04
.03
Eva0.5 .43
.43
.43
.43
.43
.49
.73
.77
.79
.80
Eva0.8 .43
.42
.42
.42
.43
.95
.93
.95
.94
.96
Eva0.9 .42
.36
.38
.40
.40
.97
.95
.97
.95
.98
4
Related Work
In this section, a brief overview of previous studies addressing LCD is presented. As previously discussed, classic CD algorithms deal only with the topological information since their clustering schemes are established by optimizing structural quality functions. In this scenario, LCD is a challenging and more sophisticated task, aiming to balance the weight of topological and attribute related information expressed by data enriched networks to extract coherent and welldeﬁned communities. At the moment, an emerging LCD algorithm classiﬁcation proposal [13] organizes the existing algorithms in three families on the basis of the diﬀerent methodological principles they leverage: (i) topologicalbased LCD, the attribute information is used to complement the topological one that guides the partition identiﬁcation; (ii) attributedbased LCD, topology is used as reﬁnement for
150
S. Citraro and G. Rossetti
partitions identiﬁed leveraging the information oﬀered by node attributes; (iii) hybrid LCD approach, the two types of information are exploited complementary to obtain the ﬁnal partition. Examples of topologicalbased LCD are of three types, those that weight the edges taking account of the attribute information [14], those that use a labelaugmented graph [15] and those that extend a topological quality function in order also to consider the attribute information [12,16]. All three methodologies share the idea that the attribute information should be attached to the topological one, while, in an attributedbased LCD, attributes are merged with the structural information into a similarity function between vertices [12,17]. Finally, examples of hybrid LCD approaches are those that use an ensemble method to combine the found partitions [18] and those that use probabilistic models treating vertex attributes as hidden variables [19].
5
Conclusion
In this paper, we introduced Eva, a scalable algorithmic approach to address the LCD problem that optimizes the topological quality of the communities alongside to attribute homophily. Experimental results highlight how the proposed method outperforms CD and LCD state of art competitors in terms of community purity and modularity, allowing to identify highquality results even in multiattribute scenarios. As future works, we plan to generalize Eva methodology, allowing the selection of alternative quality functions, both topological (e.g., the conductance rather than the modularity) and attribute related – e.g., performing diﬀerent assumptions for the purity computation than the independence of the vertex attributes. Moreover, we plan to integrate our approach within the CDlib project [20] and to extend it to support numeric node attributes. Acknowledgment. This work is partially supported by the European Community’s H2020 Program under the funding scheme “INFRAIA120142015: Research Infrastructures” grant agreement 654024, http://www.sobigdata.eu, “SoBigData”.
References 1. MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, vol. 1, pp. 281–297 (1967) 2. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008) 3. McCallum, A.K., Nigam, K., Rennie, J., Seymore, K.: Automating the construction of internet portals with machine learning. Inf. Retrieval 3, 127–163 (2000) 4. Leskovec, J., Mcauley, J.J.: Learning to discover social circles in ego networks. In: Advances in Neural Information Processing Systems, pp. 539–547 (2012)
Eva: AttributeAware Network Segmentation
151
5. Neville, J., Jensen, D., Friedland, L., Hay, M.: Learning relational probability trees. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 625–630. ACM (2003) 6. Trask, A., Michalak, P., Liu, J.: sense2vec  a fast and accurate method for word sense disambiguation in neural word embeddings. CoRR, vol. abs/1511.06388 (2015) 7. Traud, A.L., Mucha, P.J., Porter, M.A.: Social structure of Facebook networks. CoRR, vol. abs/1102.2166 (2011) 8. Traag, V.A., Waltman, L., van Eck, N.J.: From louvain to leiden: guaranteeing wellconnected communities. CoRR, vol. abs/1810.08473 (2018) 9. Fortunato, S., Barthelemy, M.: Resolution limit in community detection. Proc. Natl. Acad. Sci. 104(1), 36–41 (2007) 10. Rosvall, M., Bergstrom, C.T.: Maps of random walks on complex networks reveal community structure. Proc. Natl. Acad. Sci. 105(4), 1118–1123 (2008) 11. Raghavan, U.N., Albert, R., Kumara, S.: Near linear time algorithm to detect community structures in largescale networks. Phys. Rev. E 76, 036106 (2007) 12. Dang, T.A., Viennet, E.: Community detection based on structural andattribute similarities. In: International Conference on Digital Society (ICDS) (2012) 13. Falih, I., Grozavu, N., Kanawati, R., Bennani, Y.: Community detection in attributed network. In: Companion Proceedings of the The Web Conference, pp. 1299–1306 (2018) 14. Neville, J., Adler, M., Jensen, D.: Clustering relational data using attribute and link information. In: 18th International Joint Conference on Artificial Intelligence, pp. 9–15 (2003) 15. Zhou, Y., Cheng, H., Yu, J.X.: Graph clustering based on structural/attribute similarities. Proc. VLDB Endow. 2, 718–729 (2009) 16. Combe, D., Largeron, C., G´ery, M., EgyedZsigmond, E.: ILouvain: an attributed graph clustering method. In: Advances in Intelligent Data Analysis XIV, pp. 181– 192. Springer, Cham (2015) 17. Falih, I., Grozavu, N., Kanawati, R., Bennani, Y.: ANCA : attributed network clustering algorithm. In: Complex Networks and Their Applications, vol. VI, pp. 241–252. Springer, Cham (2018) 18. Elhadi, H., Agam, G.: Structure and attributes community detection: comparative analysis of composite, ensemble and selection methods. In: Proceedings of the 7th Workshop on Social Network Mining and Analysis, SNAKDD 2013, pp. 10:1–10:7. ACM (2013) 19. Yang, J., McAuley, J., Leskovec, J.: Community detection in networks with node attributes. In: 2013 IEEE 13th International Conference on Data Mining, pp. 1151– 1156, December 2013 20. Rossetti, G., Milli, L., Cazabet, R.: CDLIB: a python library to extract, compare and evaluate communities from complex networks. Appl. Netw. Sci. 4(1), 52 (2019)
Exorcising the Demon: Angel, Eﬃcient NodeCentric Community Discovery Giulio Rossetti(B) KDD Lab, ISTICNR, Pisa, Italy [email protected]
Abstract. Community discovery is one of the most challenging tasks in social network analysis. During the last decades, several algorithms have been proposed with the aim of identifying communities in complex networks, each one searching for mesoscale topologies having diﬀerent and peculiar characteristics. Among such vast literature, an interesting family of Community Discovery algorithms, designed for the analysis of social network data, is represented by overlapping, nodecentric approaches. In this work, following such line of research, we propose Angel, an algorithm that aims to lower the computational complexity of previous solutions while ensuring the identiﬁcation of highquality overlapping partitions. We compare Angel, both on synthetic and realworld datasets, against state of the art community discovery algorithms designed for the same community deﬁnition. Our experiments underline the eﬀectiveness and eﬃciency of the proposed methodology, conﬁrmed by its ability to constantly outperform the identiﬁed competitors. Keywords: Complex network analysis
1
· Community discovery
Introduction
Community discovery (henceforth CD), the task of decomposing a complex network topology into meaningful node clusters, is allegedly oldest and most discussed problem in complex network analysis [3,6]. One of the main reasons behind the attention such task has received during the last decades lies in its intrinsic complexity, strongly tied to its overall illposedness. Indeed, one the few universally accepted axioms characterizing this research ﬁeld regards the impossibility of providing a single shared deﬁnition of what community should look like. Usually, every CD approach is designed to provide a diﬀerent point of view on how to partition a graph: in this scenario, the solutions proposed by diﬀerent authors were often proven to perform well when speciﬁc assumptions can be made on the analyzed topology. Nonetheless, decomposing a complex structure in a set of meaningful components represents per se a step required by several analytical tasks – a need that has transformed what usually is considered a problem deﬁnition weakness, the existence of multiple partition criteria, into one of its major strength. Such peculiarity has lead to the deﬁnition of several c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 152–163, 2020. https://doi.org/10.1007/9783030366872_13
Exorcising the Demon
153
“meta” community deﬁnitions, often tied to speciﬁc analytical needs. Classic works intuitively describe communities as sets of nodes closer among them than with the rest of the network, while others, only deﬁne such topologies as dense network subgraphs. A general, highlevel, formulation of the Community Discovery problem deﬁnition is following: Definition 1 (Community Discovery (CD)). Given a network G, a community C is a set of distinct nodes: C = {v1 , v2 , . . . , vn }. The community discovery problem aims to identify the set C of all the communities in G. In this work, we introduce a CD algorithm, Angel, tailored to extract overlapping communities from a complex network. Our approach is primarily designed for social networks analysis and belongs to a wellknown sub family of Community Discovery approaches often identiﬁed by the keywords bottomup and nodecentric [18]. Angel aims to provide a fast way to compute reliable overlapping network partitions. The proposed approach focuses on lowering the computational complexity of existing methods proposing scalable sequential – although, easily parallelizable – solutions to a very demanding task: overlapping network decomposition. The paper is organized as follows. In Sect. 2 we introduce Angel. There we discuss its rationale, the properties it holds as well as its computational complexity. In Sect. 3 we evaluate the proposed method on both synthetic and realworld datasets for which ground truth communities are known in advance. To better discuss the resemblance of Angel partitions to ground truth ones as well as its execution times, we compare the proposed method with stateofart competitors sharing the same rationale. Finally, in Sect. 4 the literature relevant to our work is discussed and Sect. 5 concludes the paper.
2
Angel
In this section, we present our bottomup solution to the community discovery problem: Angel1 . Our approach, as we will discuss, follows a wellknown pattern composed by two phases: (i) construction of local communities moving from egonetwork structures and, (ii) deﬁnition of mesoscale topologies by aggregating the identiﬁed localscale ones. Since Angel main goal is reducing the computational complexity of previous nodecentric approaches, we will detail the merging strategy it implements to build up the ﬁnal community partition and, ﬁnally, we will discuss its properties and study its complexity. Algorithm Rationale. The algorithmic schema of Angel is borrowed from the Demon [4] one, an approach whose main goal was to identify local communities capturing individual nodes perspectives on their neighborhoods and to use them to build mesoscale ones. Angel takes as input a graph G, a merging threshold φ and an empty set of communities C. The main loop of the algorithm cycles over each node, so 1
Code available at: https://github.com/GiulioRossetti/ANGEL.
154
G. Rossetti
ALGORITHM 1. Angel
Input: G : (V, E), the graph; φ, the merging threshold. Output: C a set of overlapping communities.
1 2 3 4 5 6 7 8 9 10 11 12 13
for v ∈ V do e ← EgoMinusEgo(v, G) ; C(v) ← LabelPropagation(e) ; C ← C ∪ C(v) ncoms = C acoms = 0 while ncoms != acoms do acoms = ncoms C ← DecreasingSizeSorting(C) ; for c ∈ C do C ← PrecisionMerge(c, C, φ) ; ncoms = C return C
// Step #1 // Step #2 // Step #3
// Step #4 // Step #5 // Step #6
to generate all the possible points of view of the network structure (Step #1 in Algorithm 1). To do so, for each node v, it applies the EgoM inusEgo(v, G) (Step #2 in Algorithm 1) operation as deﬁned in [4]. Such function extracts the egonetwork centered in the node v – e.g., the graph induced on G and built upon v and its ﬁrst order neighbors – then removes v from it, obtaining a novel, ﬁltered, graph substructure. Angel removes v since, by deﬁnition, it is directly linked to all nodes in its egonetwork, connections that would lead to noise in the identiﬁcation of local communities. Obviously, a single node connecting the entire subgraph will make all nodes very close, even if they are not in the same local community. Once obtained the egominusego graph, Angel computes the local communities it contains (Step #3 in Algorithm 1). The algorithm performs this step by using a community discovery algorithm borrowed from the literature: Label Propagation (LP) [13]. This choice, as in [4], is justiﬁed by the fact that: (i) LP has low algorithmic complexity (∼ O(N ), with N number of nodes), and, (ii) it returns results of a quality comparable to more complex algorithms [3]. Reason (i) is particularly important since Step #3 of Angel needs to be performed once for every node of the network, thus making unacceptable to spend a superlinear time for each node. Notice that instead of LP any other community discovery algorithm (both overlapping or not) can be used (impacting both on the algorithmic complexity and partition quality). Given the linear complexity (in the number of nodes of the extracted egominusego graph) of Step #3, we refer to this as the inner loop for ﬁnding the local communities. Due to the importance of LP for our approach and to shed lights on how it works we brieﬂy describe its classical formulation [13]. Suppose that a node v has neighbors v1 , v2 , ..., vk and that each one of them carries a label denoting the community that it belongs to: then, at each iteration the label of v is updated to the majority label of its neighbors. As the labels propagate, densely connected groups of nodes quickly reach a consensus on a unique label. At the end of the propagation process, nodes with the same labels are grouped as one community. In case of bowtie situations – e.g., a node having an equal maximum number of neighbors in two or more communities – the classic deﬁnition
Exorcising the Demon
155
ALGORITHM 2. PrecisionMerge Input: x, a community; C, a set of overlapping communities; φ, the merging threshold. Output: C, a set of overlapping communities. 1 2 3 4 5 6
com to freq ← community frequency(x) ; for com, freq ∈ com to freq do req if fx ≥ φ then C = C − {x, com} C = C ∪ {x ∪ com} return C
// Step #A // Step #B
of the LP algorithm randomly selects a single label for the contended node. Angel, conversely, handle this situation allowing soft community memberships, thus producing deterministic local partitions. The result of Steps #1–3 of Algorithm 1 is a set of local communities C(v), according to the perspective of a speciﬁc node, v, of the network. Conversely, from what done in Demon, Angel does not reintroduce the ego in each local community to reduce the noisy eﬀects hubs play during the merging step. Local communities are likely to be an incomplete view of the real community structure of G. Thus, the result of Angel needs further processing: namely, to merge each local community with the ones already present in C. Once the outer loop on the network nodes is completed, Angel leverage the PrecisionMerge function to compact the community set C so to avoid the presence of fully contained communities in it. Such function (Step #6, detailed in Algorithm 2) implements a deterministic merging strategy and is applied iteratively until reaching convergence (Step #4) – e.g., until the communities in C cannot be merged further. To assure that all the possible community merges are performed at each iteration C is ordered from the smallest community to the biggest (Algorithm 1, #Step 6). This merging step is a crucial since it needs to be repeated for each of the local communities. In Demon such operation requires the computation for each pair of communities (x, y), x ∈ C(v) and y ∈ C, of an overlap measure (i.e. Jaccard index) and to evaluate if its value overcome a user deﬁned threshold. This approach, although valid, has a major drawback: given a community x ∈ C(v) it requires O(C) evaluations to identify its best match among its peers. Indeed, such kind of strategy represents a costly bottleneck requiring an overall O(C2 ) complexity while applied to all the identiﬁed local communities. Angel aims to drastically reduce such computational complexity by performing the matches leveraging a greedy strategy. To do so, it proceeds in the following way: (i) Angel assumes that each node carries, as additional information, the identiﬁers of all the communities in C it already belongs to; (ii) in Step #A (Algorithm 2) for each local community x is computed the frequency of the community identiﬁers associated with its nodes; (iii) in Step #B, for each pair (community id, f requency) is computed its Precision w.r.t. x, namely the percentage of nodes in x that also belong to community id;
156
G. Rossetti
(iv) iﬀ the precision ratio is greater (or equal) than a given threshold φ the local community x is merged with community id: their union is added to C and the original communities are removed from the same set. Operating in this way it is avoided the time expensive computation of community intersections required by Jaccardlike measures since all the containment testing can be done in place. Angel Properties. The proposed approach posses two nice properties: it produces a deterministic output (once ﬁxed the network G and threshold φ), and it allows for a parallel implementation. Property 1 (Determinism). There exists a unique C=Angel(G, φ) for any given G and φ, disregarding the order of visit of the nodes in G. To prove the determinism of Angel it is mandatory to break its execution in two welldeﬁned steps: (i) local community extraction and (ii) merging of local communities. (i) Local communities: Label Propagation identiﬁes communities by applying a greedy strategy. In its classical formulation [13] it does not assure convergence to a stable partition due to the socalled “label pingpong problem” (i.e., instability scenario primarily due to bowtie conﬁgurations). Moreover as already discussed, Angel addresses such problem by relaxing the node single label constraint thus allowing for the identiﬁcation of a stable conﬁguration of overlapping local communities. (ii) Merging: this step operates on a welldetermined set of local communities on which the PrecisionMerge procedure is applied iteratively. Since we explicitly impose the community visit ordering the determinism of the solution is given by construction. Property 2 (Compositionality). Angel is easily parallelizable since the local community extraction can be applied locally on well defined subgraphs (i.e., egominusego networks). Given a graph G = (V, E) it is possible to instantiate Angel local community extraction simultaneously on all the nodes u ∈ V and then apply the PrecisionMerge recursively in order to reduce and compact the ﬁnal overlapping partition: Angel(G, φ) = P M erge( u∈V LP (EM E(u))) (1) The underlying idea is to operate community merging only when all the local communities are already identiﬁed (i.e., LabelPropagation is applied to all the egominusego of the nodes u ∈ V – LP(EME(u)) in Eq. 1 – as shown in Fig. 1). Moreover, this parallelization schema is assured to produce the same network partition obtained by the original sequential approach due to the determinism property. Angel Complexity. To evaluate the time complexity we proceed by decomposing Angel in its main components. Given the pseudocode description provided in Algorithm 1 we can divide our approach into the following subprocedures:
Exorcising the Demon
157
Fig. 1. Angel parallelization schema. The graph G is decomposed in V  egominusego network by a dispatcher D and distributed to n workers {LP0 , . . . , LPn } that extract local communities from them. At the end of such parallel process, a collector C iteratively apply PrecisionMerge till obtaining the ﬁnal overlapping partition.
(i) Outer loop (lines 3–6): the algorithm cycles over the network nodes to extract the egominusego networks and identify local communities. This main loop has thus complexity O(V ). (ii) Local Communities extraction: the Label Propagation algorithm has complexity O(n + m) [13], where n is the number of nodes and m is the number of edges of the egominusego network. Let us assume that we are working with a scale free network, whose degree distribution is pk = k −α : in this scenario the majority of the identiﬁed egominusego networks are composed by n 0 [8,9] – Communicability (Comm): K = n=0 n! ∞ – Forest: K = n=0 αn (−L)n = (I + αL)−1 , α > 0 [3]
Impact of Network Topology on Measures Eﬃciency
191
n n ∞ – Heat: K = n=0 α (−L) = exp(−αL), α > 0 [14] n! – PageRank (PR): K = (I − αP )−1 , 0 < α < 1 [18]
It is important to note that because of the kernels deﬁnitions, the eigenvectors are the same for the Walk and Communicability, and Forest and Heat measures [1]. Consequently, the Spectral clustering will lead to the same partitions for these 2 pairs, and we will use only 3 measures (i.e., Walk, Forest, and PageRank) instead of 5 when discussing results for the Spectral method. 3.4
Network Generation
To generate networks with diﬀerent topologies, we use the LFR model introduced by Lancichinetti et al. in [15]. Generated networks share a number of features which real networks have, e.g., the power law degree and community size distributions. Changing the model input parameters, one can get networks varying in size, average degree, power law exponent for the degree and community size distributions, minimum and maximum size of clusters, and clusters quality, i.e., the fraction of intercommunity edges. Together, this allows one to get graphs with completely diﬀerent structures. We discuss chosen input parameters and reasons for this choice in Sect. 4. 3.5
Clustering Quality Evaluation
For clustering quality evaluation, the Adjusted Rand Index (ARI) introduced in [10] is used. Reasons for using this quality index are provided in [16]. ARI plays an important role in the study, so we will brieﬂy explain it. Initially, the Rand Index was introduced in [20]. If X and Y are two diﬀerent partitions (clusterings) of n elements, let a be the number of pairs of elements that are in the same clusters in X and Y , and b the number of pairs of elements . that are in diﬀerent clusters in X and Y . Then the Rand Index equals to a+b (n2 ) The idea here is simple: it is the number of agreements between two partitions divided by the total number of pairs. Unfortunately, the Rand Index has a drawback: the expected value of the Rand Index is not zero for random partitions. So, it should be corrected, and ARI is the corrected version of the Rand Index−ExpectedIndex . Index: ARI = MaxIndex−ExpectedIndex For ARI, 1 refers to perfect matching, while 0 is characteristic of random labeling.
4
Experimental Methodology
In this study, measures are tested in experiments with networks generated using the LFR model. To obtain diﬀerent topologies, the following 6 input parameters of the LFR model are varied:
192
R. Aynulin
– Network size (n). Obviously, real networks are diﬀerent in size. Due to the computational limits, we cannot generate really big networks, so graphs are generated with the following numbers of nodes: 100, 300, 500, 1000, 2000, 3000. – Average degree (m). Varying average degree from 2 to 15 with step 1 allows obtaining networks from very sparse (for which the clustering quality will be close to 0, regardless of other parameters) to pretty dense (with the clustering quality close to 1 for networks with good community structure). – Power law exponent for the degree distribution (τ1 ). The power law exponent is usually considered to be in the range from 2 to 3 [15,17]. We use the following values of τ1 : 2.0, 2.2, 2.4, 2.5, 2.6, 2.8, 3.0. – Power law exponent for the community size distribution (τ2 ). Like the degree distribution, the community size distribution was also reported to follow the power law with the typical limits 1 < τ2 < 2 [15]. Networks generated in this study have the power law exponent for the community size distribution varying from 1 to 2 with step 0.25. – Minimum and maximum communities size (cmin and cmax). Changing the limits for the communities size, we can get networks with a lot of small communities, few big communities, and intermediate stages between them. As a baseline, for n = 300, the following limits are used: [20, 50], [50, 80], [80, 140], [140, 185], and if the network size is diﬀerent, then the limits are scaled accordingly. – Fraction of intercommunity edges (μ). This parameter allows to change the quality of communities. We vary μ in the range from 0.1 to 0.6 with step 0.1. Graphs are generated in the following way: the basic conﬁguration is n = 300, m = 5, τ1 = 2.5, τ2 = 1.5, cmin = 80, cmax = 140, μ = 0.2. Then, one of the parameters varies from the basic conﬁguration within the limits described above. After generation, networks are clustered with each of the measures listed in Sect. 3.3 using the Ward and Spectral method. Each of the measures depends on the parameter. Therefore, we also search for the optimal parameter, and the results include the clustering quality for the optimal parameter. As already noted, the quality of produced partitions is evaluated using ARI. To get a stable result, for each combination of parameters, 100 graphs are generated and the quality index is averaged over them.
5
Results
In this section, we discuss the results of the experiments described above. Figure 1 presents results for the Spectral method. As noted in Sect. 3.3, it is meaningful here to analyze results only for 3 measures out of 5 under research. The basic set of parameters is marked by a circle on each of the graphs. The xaxis shows the values of the varying parameter, while the value of average ARI is plotted on yaxis. When changing the community size limits, the average number of clusters for the generated networks is plotted on the xaxis.
Impact of Network Topology on Measures Eﬃciency
193
Fig. 1. Results for the Spectral method, point n = 300, m = 5, τ1 = 2.5, τ2 = 1.5, cmin = 80, cmax = 140, μ = 0.2 is marked
Both the clustering algorithm and the proximity measure may depend on the network topology. Common features of the plots for diﬀerent proximity measures show how the algorithm depends, while deviation from the general picture shows the dependence of the proximity measure on the network topology.
194
R. Aynulin
As can be seen, for the Spectral algorithm, all proximity measures behave similarly when topology changes. Their ranking by the quality index is also largely preserved. So we can conclude that when using the Spectral method, the dependence of the relative clustering quality on the network topology for each measure is almost absent, and the topperforming measure is Walk for most topologies. Let’s now look at common features of the plots for diﬀerent measures and analyze how the clustering quality depends on the network structure for the Spectral algorithm itself. According to Fig. 1a, the Spectral method copes well with the network size increase. This is not generally true for all community detection algorithms [19]. There is a rapid increase in the clustering quality when the average degree increases (Fig. 1b). This can be explained by the deﬁnition of the community on which the Spectral method bases. Like most clustering algorithms, this method looks for groups of nodes which are densely connected, and it is hard to do it when there are almost no edges in the network. Figure 1c reveals an interesting relationship between the quality and the power law exponent for the degree distribution. For example, there are several local maxima and minima, and after the local minimum at τ1 = 2.2 there is the peak at τ1 = 2.4. So far, there is no explanation for such behavior, and it can be explored more indepth in future studies. In Fig. 1d, we can see that the quality is almost independent of τ2 . According to Fig. 1e, the clustering quality is better when there are a lot of small communities than there are few big communities. Finally, in Fig. 1f, one can see an expected steep decline in the quality when the fraction of intercluster edges increases. Results for the Ward algorithm are presented in Fig. 2. This algorithm is more sensitive to the choice of the proximity measure. This is most noticeable in Fig. 2c, when we vary the power law exponent for the degree distribution. However, we can still ﬁnd the measures which perform well for most topologies (Walk and Communicability), and the ranking of measures by the quality generally remains the same. So, generally the superiority of one measure over another is the fundamental property which doesn’t depend on the network topology. However, there are a few exceptions. For example, PageRank outperforms all the other measures when there are a lot of small clusters (Fig. 2e). An interesting relation can be seen in Fig. 2f. According to it, the Forest and Heat measures are slightly worse than others when there are clear cluster structure and μ = 0.1. But as soon as the cluster structure becomes slightly less distinct, and the fraction of intercluster edges increases to 0.2, their quality drops rapidly to zero. So, when using the Ward algorithm, Forest and Heat can detect clusters only if the community structure is distinct and there are almost no edges between clusters. Let’s now analyze the common features for all the measures, by which we can assess the impact of network topology on the Ward algorithm itself.
Impact of Network Topology on Measures Eﬃciency
195
Fig. 2. Results for the Ward method, point n = 300, m = 5, τ1 = 2.5, τ2 = 1.5, cmin = 80, cmax = 140, μ = 0.2 is marked
Due to the computation limits, we used networks with n ≤ 1000 for clustering with the Ward method. However, even for this network size interval, in Fig. 2a we can see that the performance of the Ward method degrades when the network size increases.
196
R. Aynulin
Similarly to the Spectral method, the quality of clustering increases with the increase in the average degree (Fig. 2b) and decreases with the increase in μ (Fig. 2f). Also, according to Fig. 2e, many small clusters are better than few big clusters for the Ward method. The explanation for these properties is the same as for the Spectral method. Figure 2d shows that the power law exponent for the community size distribution still doesn’t essentially aﬀect the eﬃciency of community detection, although there are more ﬂuctuations in comparison to the Spectral method. According to Fig. 2c, the relation between the eﬃciency and τ1 is fuzzy, and it is hard to detect any common properties for all the measures. We can also make some conclusions about the comparative eﬃciency of the Ward and the Spectral algorithms. According to the results of the experiments, the Spectral method outperforms the Ward method in most cases.
6
Conclusion
In this paper, we studied how the network topology aﬀects the quality of community detection for such graph measures as Walk, Communicability, Forest, Heat, and PageRank. A variety of network topologies were generated using the LFR model, and resulting graphs were clustered using the Ward and the Spectral method in combination with each of the above measures. As a result, we found that the eﬃciency of proximity measures depends on the network topology in some way. However, this dependence is not critical, and measures which are eﬃcient for most topologies can be found. For the Spectral method, the most eﬃcient measure is Walk. When the Ward method is used, the Walk and the Communicability measures outperform others in most cases. Also, we have found some common features for all the measures. Using these common features, we can conclude how the algorithms themselves depend on the network topology. For example, the Ward and the Spectral methods prefer small clusters to big clusters.
References 1. Avrachenkov, K., Chebotarev, P., Rubanov, D.: Kernels on graphs as proximity measures. In: International Workshop on Algorithms and Models for the WebGraph. LNCS, vol. 10519, pp. 27–41. Springer (2017) 2. Aynulin, R.: Eﬃciency of transformations of proximity measures for graph clustering. In: International Workshop on Algorithms and Models for the WebGraph. LNCS, vol. 11631, pp. 16–29. Springer (2019) 3. Chebotarev, P.Y., Shamis, E.: On the proximity measure for graph vertices provided by the inverse Laplacian characteristic matrix. In: 5th Conference of the International Linear Algebra Society, Georgia State University, Atlanta, pp. 30–31 (1995) 4. Chebotarev, P.: The walk distances in graphs. Discrete Appl. Math. 160(10–11), 1484–1500 (2012)
Impact of Network Topology on Measures Eﬃciency
197
5. Costa, L.d.F., Oliveira Jr., O.N., Travieso, G., Rodrigues, F.A., Villas Boas, P.R., Antiqueira, L., Viana, M.P., Correa Rocha, L.E.: Analyzing and modeling realworld phenomena with complex networks: a survey of applications. Adv. Phys. 60(3), 329–412 (2011) 6. Deza, M.M., Deza, E.: Encyclopedia of Distances. Springer, Berlin (2016) 7. Emmons, S., Kobourov, S., Gallant, M., B¨ orner, K.: Analysis of network clustering algorithms and cluster quality metrics at scale. PLoS One 11(7), e0159161 (2016) 8. Estrada, E.: The communicability distance in graphs. Linear Algebra Appl. 436(11), 4317–4328 (2012) 9. Fouss, F., Yen, L., Pirotte, A., Saerens, M.: An experimental investigation of graph kernels on a collaborative recommendation task. In: Sixth International Conference on Data Mining (ICDM 2006), pp. 863–868. IEEE (2006) 10. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985) 11. Ivashkin, V., Chebotarev, P.: Do logarithmic proximity measures outperform plain ones in graph clustering? In: International Conference on Network Analysis. PROMS, vol. 197, pp. 87–105. Springer (2016) 12. Jeub, L.G., Balachandran, P., Porter, M.A., Mucha, P.J., Mahoney, M.W.: Think locally, act locally: detection of small, mediumsized, and large communities in large networks. Phys. Rev. E 91(1), 012821 (2015) 13. Katz, L.: A new status index derived from sociometric analysis. Psychometrika 18(1), 39–43 (1953) 14. Kondor, R., Laﬀerty, J.: Diﬀusion kernels on graphs and other discrete input spaces. In: International Conference on Machine Learning, pp. 315–322 (2002) 15. Lancichinetti, A., Fortunato, S., Radicchi, F.: Benchmark graphs for testing community detection algorithms. Phys. Rev. E 78(4), 046110 (2008) 16. Milligan, G.W., Cooper, M.C.: A study of the comparability of external criteria for hierarchical cluster analysis. Multivar. Behav. Res. 21(4), 441–458 (1986) 17. Newman, M.E.: The structure and function of complex networks. SIAM Rev. 45(2), 167–256 (2003) 18. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Technical report, Stanford InfoLab (1999) 19. Pasta, M.Q., Zaidi, F.: Topology of complex networks and performance limitations of community detection algorithms. IEEE Access 5, 10901–10914 (2017) 20. Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971) 21. Schenker, A., Last, M., Bunke, H., Kandel, A.: Comparison of distance measures for graphbased clustering of documents. In: International Workshop on GraphBased Representations in Pattern Recognition. LNCS, vol. 2726, pp. 202–213. Springer (2003) 22. Sommer, F., Fouss, F., Saerens, M.: Comparison of graph node distances on clustering tasks. In: International Conference on Artiﬁcial Neural Networks. LNCS, vol. 9886, pp. 192–201. Springer (2016) 23. Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007) 24. Ward Jr., J.H.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963) 25. Yen, L., Vanvyve, D., Wouters, F., Fouss, F., Verleysen, M., Saerens, M.: Clustering using a random walk based distance measure. In: ESANN, pp. 317–324 (2005)
Identifying, Ranking and Tracking Community Leaders in Evolving Social Networks M´ ario Cordeiro1,2(B) , Rui Portocarrero Sarmento2 , Pavel Brazdil2 , ao Gama2 Masahiro Kimura3 , and Jo˜ 1
2
Faculty of Engineering, University of Porto, Porto, Portugal [email protected] Laboratory of Artiﬁcial Intelligence and Decision Support, Porto, Portugal 3 Faculty of Science and Technology, Ryukoku University, Kyoto, Japan
Abstract. Discovering communities in a network is a fundamental and important problem to complex networks. Find the most inﬂuential actors among its peers is a major task. If on one side, studies on community detection ignore the inﬂuence of actors and communities, on the other hand, ignoring the hierarchy and community structure of the network neglect the actor or community inﬂuence. We bridge this gap by combining a dynamic community detection method with a dynamic centrality measure. The proposed enhanced dynamic hierarchical community detection method computes centrality for nodes and aggregated communities and selects each community representative leader using the ranked centrality of every node belonging to the community. This method is then able to unveil, track, and measure the importance of main actors, network intra and intercommunity structural hierarchies based on a centrality measure. The empirical analysis performed, using two temporal networks shown that the method is able to ﬁnd and tracking community leaders in evolving networks. Keywords: Community detection leaders · Centrality measures
1
· Dynamic networks · Community
Introduction
Typical tasks of Social Network Analysis (SNA) involve: the identiﬁcation of the most inﬂuential, prestigious and central actors, using statistical measures; the identiﬁcation of hubs and authorities, using link analysis algorithms; the discovery of communities, using community detection techniques; the visualization the interactions between actors; or spreading of information. These tasks are instrumental in the process of extracting knowledge from networks and consequently in the process of problemsolving with network data. Particularly, centrality measures help us to identify relevant nodes and quantify the notion of importance of an actor in the network. Recently, researchers have invested a lot c Springer Nature Switzerland AG 2020 H. Cheriﬁ et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 198–210, 2020. https://doi.org/10.1007/9783030366872_17
Community Leaders in Dynamic Social Networks
199
of eﬀort in the development of eﬃcient algorithms able to compute centrality measures of nodes in very large evolving networks. Community Detection is also a key task to unveil and understand underlying structure of complex networks. Mainly because community structures are very common in complex networks, detecting and identifying these communities is a key point to understand hidden features of a network. Detecting communities summarizes interactions between members for gaining a deep understanding of interesting characteristics shared between members of the same community. Social networks communities usually have one leader, which in many cases represent the most inﬂuential, prestigious or central actor of the whole community. Identifying these actors shall not be neglected.
Fig. 1. Zachary karate club community detection. Node 33 represent “John A” group and Node 0 “Mr. Hi” group: Classical (a) vs Hierarchical (b)→ (c)→ (d).
Fig. 2. Jemaah Islamiyah Bali bombings cell. Strategy (Samudra), logistics commander (Idris) and team’s gofer (Imron): Classical (a) vs Hierarchical (b)→ (c).
Currently, the majority of the methods and approaches to the abovementioned tasks present some limitations. While quantifying the importance of actors in social networks using centrality measures based only on the local or global connectivity of the nodes is considered to be inappropriate. With the primary cause of this being the inattention to the hierarchy and community structure of the network inherent in all human social networks. On the other hand, even the community detection methods that provide a hierarchy and community structure of a network, do not give any insights on what these individual
200
M. Cordeiro et al.
communities represent in the overall network, nor information about community inﬂuence (as shown on Figs. 11 and 22 ). In brief, Community detection methods do not answer essential questions such as: what is the importance of each community among all the other identiﬁed communities? What are the major representative nodes of these densely connected set of actors? Moreover, what is the underlying structure of the community? In short, what are the most inﬂuential actors in the community which is called community leaders? To answer these questions, we propose, to the best of our knowledge, the ﬁrst method which combines a dynamic Laplacian centrality [7] with a dynamic hierarchical community detection method [6]. The objective is to identify, rank and track communities and their leaders over time in evolving networks. The main contribution of the proposed method is the threefold: First, the new method identiﬁes and ranks communities according to its inﬂuence in the overall network; Second, it ranks the most inﬂuential actors in each one of the communities to select a community leader (most representative node of the community); and ﬁnally, provides an indepth community hierarchy and structure of the individual communities that form the full network. It is an important task to extract the hierarchical structure of communities and their inﬂuence in terms of leadership (i.e., Leadership Hierarchy) each timestep in a large evolving network. However, this is challenging since a large amount of computation should be required in general. This paper presents a promising solution. Moreover, the method supports both incremental only or a full dynamic setting to perform community detection using locality modularity optimization, and compute the Laplacian centrality for nodes and individual communities in evolving networks. By incremental only we mean networks in which only new nodes and edges are added to the network in subsequent snapshots, by fully dynamic we mean support for addition and removal of nodes and edges.
2
Related Methods and Techniques
Cordeiro et al. [8] reviews the state of the art in selected aspects of evolving social networks and presents open research challenges related to Online Social Networks. In the present work, the focus of research is maintenance methods, which is desirable to maintain the results of the data mining process continuously over time [1]. More speciﬁcally, we are interested in identifying, ranking and tracking communities and their leaders continuously over time. To accomplish this, two diﬀerent kinds of methods are required: the detection and identiﬁcation of community structures which represent occurrences of groups of nodes in the network that are more densely connected internally than with the rest of the network; and, actorlevel or nodelevel statistical measures, such as centrality measures and leadership, to determine the importance of an actor or node within 1 2
Figure 1 online here: https://mmfcordeiro.github.io/LaplaceLouvainResults/Intro. html. Figure 2 online here: https://mmfcordeiro.github.io/LaplaceLouvainResults/Intro2. html.
Community Leaders in Dynamic Social Networks
201
the network, i.e.: reveal the individuals in which the most important relationships are concentrated and give an idea about their social power within their peers. Community Detection Methods: Fortunato [10] in a comprehensive survey devoted to the methods and techniques of ﬁnding communities in static networks, classify hierarchical clustering methods into two types: Divisive algorithms, such as Girvan and Newman; and agglomerative algorithms, such as [3] in which a greedy modularity optimization is used. Concerning dynamic networks, Aggarwal and Subbian [1] propose the division into slowly evolving networks methods and streaming networks methods. Slowly evolving networks methods are: [3] when used in batch snapshots; [33] a modularity based method employing Palla et al. [26] principles of the life of communities events (growth, contraction, merging, splitting, birth and death); QCA [24] a fast and adaptive algorithm based on [3] and preferable neighboring groups; AFOCS [23] a modiﬁed QCA that allows the detection of overlapping communities; label propagation techniques more speciﬁcally speakerlistener label propagation (SLPA) such as LabelRank, GANXiSw, LabelRankT all with good performance in overlapping community detection [36]; Cordeiro et al. [6] with a modularitybased full dynamic community detection algorithm, where dynamically added and removed nodes and edges only aﬀect their related communities. By reusing community structure obtained by previous iterations, the local modularity optimization step operates in smaller networks where only aﬀected communities are disbanded to their origin. Streaming networks methods include: [35] that uses a local weightededgebased pattern (LWEP) as a summary to cluster weighted graph stream and perform dynamic community detection; almost linear time simple label propagation algorithms [17,29]; spectral based eﬃcient memorylimited streaming clustering algorithms [38]; random and adaptive sampling methods [38]; SONIC, a ﬁndandmerge type of overlapping community detection [31]; and, SCoDA, a linear streaming algorithm for very large networks [11]. By combining consensus clustering [16] in a selfconsistent way with any of the above methods, stability and the accuracy of the resulting partitions is enhanced considerably. Centrality Measures: There is no consensual best centrality measure for graphs, several measures are accepted and give good centrality values that are valuable in diﬀerent scenarios. In [8] is shown that most of the commonly used centrality measures for static networks have their versions for evolving networks. Commonly used measures are Betweenness Centrality B(v), which provides an indicator of the magnitude of node placement in between other nodes in the network. Popular implementations are the Brandes [4] algorithm for static networks, and Nasre et al. [22] for incremental. Kas et al. [12] adapted the classical Betweenness algorithm for evolving graphs; Closeness Centrality C(v) quantiﬁes the reachability, of a giving starting node to every network node. It gives an overall indicator of the actor positioning in the complete network by measuring, on average, how long takes to reach any of all other network nodes. Two methods were proposed in the literature that uses incremental updates over the Close
202
M. Cordeiro et al.
ness Centrality measure in evolving networks: Kas et al. [13] and Sariyuce et al. [30]; Eigenvector Centrality E(v) assumes that the status of a node is recursively deﬁned by the status of his/her ﬁrstdegree connections, i.e., the nodes that are directly connected to a particular node [25]. Is often implemented for static networks using Katz centrality and Google’s PageRank. Evolving networks variants are [2,9] and [15]; Laplacian Centrality L(v) permits to include intermediate surrounding circumstantial information of a node or vertex to compute its centrality measure. The Laplacian centrality of a given v vertex is then described as a function of the network 2walks counts in which a v vertex takes part. The known fact that Laplacian Centrality is a local measure [27,28] motivated the incremental versions of [27] and [7] which eﬃciency is improved by computing single node centralities just for nodes somehow aﬀected by the removal or addition of edges in network snapshots. Influential Communities and Community Leaders: In analogy to actorlevel or nodelevel statistical measures that determine the importance of an actor or node within the whole network, the inﬂuence of a community measures the inﬂuence of a community on the overall network. This topic, ignored by the traditional community detection algorithms, has been very recently introduced in the ﬁeld [20]. Commonly used approaches are: an eﬃcient search method of the topr kinﬂuential communities was proposed by [20] (r is the number of communities, community nodes with degree at least k); maximal krcliques community [18] which uses heuristics based on common graph structures like cliques; kinﬂuential community based on the concept of kcore to capture the inﬂuence of a community in [21]; skyline community combining the concepts of kcore and a multivalued network [19]. Ignored by the existing community detection algorithms, the role of community leaders arose in recent research. This problem is addressed by leaderfollower modelbased methods that identify community structures that hold that a community is a set of follower nodes around a leader: Top Leaders [14], Licod [37] and [32]. Recently Sun et al. [34], proposed an agglomerative type clustering method, which measures the leadership of each node and lets each one adhere to its local leader forming dependence trees. This leaderaware community detection algorithm ﬁnds community structures as well as leaders of each community. Other proposals are the [39] method that identiﬁes inﬂuential nodes with community structure. The method uses the information transfer probability between any pair of nodes and the kmedoid clustering algorithm. In [40] a centrality index (the communitybased centrality) was introduced to identify inﬂuential spreaders in a network based on the community structure of the network. The index considering both the number and sizes of communities that are directly linked by a node. None of the above referenced inﬂuential community methods nor the community leaders’ methods are suitable for evolving networks. All of them were devised and designed for handling static networks.
Community Leaders in Dynamic Social Networks 1
2
3
4
7
5
6
(a) Batch
1
2
3
4
7
5
6
203
(b) Incremental
Fig. 3. Calculated node centralities with edge {(4, 6)} added.
3
Identifying, Ranking and Tracking Community Leaders
To answer the questions raised before, in this work is proposed a new method to solve the problem of identifying, ranking and tracking community leaders in evolving networks. The method is a combination of an agglomerative hierarchical community detection algorithm [6] with an eﬃcient centrality measure [7]. Both designed to handle evolving networks with proven eﬃciency in large networks. With this new method, communities can be detected and tracked over time very eﬃciently. In parallel, for each community, a leader will be chosen to represent the community according to its centrality among its community peers. The main purpose is to identify and track communities over and establish intracommunity and intercommunity leadership hierarchies. 3.1
Dynamic Community Detection
The dynamic community detection algorithm proposed by Cordeiro et al. [6] shares the same greedy optimization method of the Blondel et al. [3] static version, i.e.: attempts to optimize the modularity of a partition of the network in successive iterations. Communities are calculated by maximizing the objective function in a twostep optimization in each one of the iterations. In the ﬁrst step (step 1), small communities are formed by optimizing the modularity locally. Only local changes in communities are allowed in this step. In the following step (step 2), nodes belonging to the same community are aggregated in a single node that represents a community in a new aggregated network of communities. Iteratively these steps are repeated until no increase in modularity is possible with a hierarchy of communities being produced. This algorithm is a modiﬁcation of the original Louvain method. We dynamically add and remove nodes and edges that only aﬀect their related communities. By handling the 4 diﬀerent types of the addition of nodes/edges and the 4 other ones w.r.t. removal of edges/nodes. Thereby, the algorithm in every iteration maintains unchanged all the communities that were not aﬀected by modiﬁcations to the network. Eﬃciency is achieved by reusing previous iterations obtained community structure, thus the local modularity optimization step operates in smaller networks where only aﬀected communities are disbanded to their origin. 3.2
Dynamic Laplacian Centrality
As already stated before, the Laplacian Centrality metric is not a global measure, i.e., is a function of the local degree plus the degrees of the neighbors (with
204
M. Cordeiro et al. 1
4
1
7 l1 = 16
C3
1
1
1
1
l4 = 16
1
4
1
6
l7 = 14 7
2
3
2
1
3
1
5
1
1
l2 = 16
6
1
2
1
1
3
(a) original network
1
5
l3 = 26
1 1
6
l5 = 26
lC3 = 111 C5 l6 = 16
C1
C2
2
C4
1 1
l2 = 16
l4 = 16
1 1
l3 = 26
4
1
1 3 C3
1 l5 = 26
5
6
2
C5
6
3
C6
lC3 = 111
l6 = 16
(b) initial communities
1
5
lC5 = 71
C3
2
4 lC4 = 49
C5
6
C7
8
2
1 1
5
lC5 = 71 C3
l7 = 14 7
1
(e) step 1 of 2nd iteration
(c) step 1 of 1st iteration l1 = 16
2
C4
2
C5
3
4 C4
lC3 = 119
lC4 = 49
(d) step 2 of 1st iteration
1
C3
5 C5
lC5 = 184
(f) step 2 of 2nd iteration
Fig. 4. Community leaders algorithm steps. Cx represents the community of node x; ly represents node y laplacian centrality; lCz the community Cx laplacian centrality. Steps and iteration follow the same ﬂow deﬁned by Cordeiro et al. [6]
diﬀerent weights for each). In [27,28] is shown that local degree and the 1st order neighbors degree is all that is needed to calculate the metric for unweighted networks. Remark that a similar approximation can be done for weighted networks. For evolving networks, the algorithm proposed by Cordeiro et al. [7] performs selective Laplace Centrality calculations only for the nodes aﬀected by the addition and removal of edges in the snapshot (i.e., it reuses Laplace Centrality information of the previous snapshot). To demonstrate this property, the toy network example in Fig. 3 is used to show the locality of the Laplacian Centrality. Dark grey nodes aﬀected by the addition of edges. The Incremental version show light grey nodes were centralities need to be calculated due to their neighborhood with aﬀected nodes. In comparison to batch, only 4 out of 7 nodes required centralities to be computed in the incremental version. 3.3
Combining Centrality and Dynamic Community Detection
Our proposed method has a twofold enhancement: Representative Nodes Chosen W.r.t. Centrality: Blondel et al. [3] algorithm is nondeterministic and produces a solution for the communities that step 1 of 1st iteration
step 2 of 1st iteration 5 step 1 of 2nd iteration 5
2
6
step 2 of 2nd iteration
step 2 of 1st iteration
5
3
1
4
1
3
3
3
5
7 4
1 2
1 1
1 3
1
5
1 1
1
4
7
6 4
step 1 of 1st iteration
step 2 of 2nd iteration step 1 of 2nd iteration
(a) original network
(b) step 1 of 2nd iteration
Fig. 5. Community leadership hierarchies
Community Leaders in Dynamic Social Networks
205
is very unstable between separate runs on the same snapshot. This issue, is attenuated in [6] that produces more stable communities by using the concept of local modularity. Nevertheless, it still holds one of the major drawbacks of the method: community representative nodes are randomly chosen and do not have signiﬁcance among the other community node peers. This means that, apart from their hierarchical structure, higherlevel aggregated networks are meaningless. In short, topmost nodes are not the representatives of the community and often change from iteration to iteration. Intracommunity and Intercommunity Centralities: Both Blondel et al. [3] and Cordeiro et al. [6] methods were designed primarily to perform eﬃcient community detection on large networks. None of them uses node or actorlevel measures to quantify node or community importance over its peers. By combining the dynamic community detection algorithm [6] with the dynamic Laplacian Centrality [7] method, we enable the disclosure of both intercommunity and intracommunity centralities. By intercommunity centrality we mean, to ﬁnd the most central or inﬂuential community or communities in the network. By intracommunity centralities, we mean to ﬁnd the most central or inﬂuential nodes belonging to a community. Using Fig. 4 network toy example and Algorithm 1 pseudocode, the complete method is now described. Immediately after the initial partition (Line 6), laplacian centralities are computed for every node on the network (Line 7, Fig. 4b). Then, while the maximum modularity is not reached, or if edges/nodes are added/removed to the network (Line 10), a new community detection step (step 1) is performed (Line 11, Fig. 4c). Representative nodes for each community are chosen according to its higher Laplacian Centrality in the community (Line 12, Fig. 4c). In the following step (step 2), the generated new aggregated network of communities (Line 16) already includes the Lapacian Centralities computed in nodes that changed community (Line 17, Fig. 4d), and for nodes aﬀected by adding or removal of edges (Line 21 and Line 29 respectively). A new 2step iteration is performed and shown in Fig. 4e and f. The method also provides ways to perform an hierarchical analysis of the importance of communities. Detailed observation of Fig. 4f show that the network has two major communities (C5 and C3). The nonnormalized Laplacian Centrality values obtained reveal C5 as the most important community in the network (lC5 = 184 vs lC3 = 119). Moving backwards in the hierarchy, can also be observed that C5 is composed by two sub modules or subcommunities: C5 and C4 . C3 is composed by 3 nodes ({1, 2, 3}). Centrality ranks for isolated communities can also be observed in Fig. 4c: node 3 is the most central within community C3 . By assuming centrality as a measure of community leadership we can visualize the network of Fig. 4 in an hierarchical way in Fig. 5b.
4
Results
Figures 1 and 2 show the eﬀectiveness of the method to identify and rank community leader in static networks. For Evolving Social Networks, an visual empirical
206
M. Cordeiro et al.
(a) increments t = 0 .. t = 3
(b) community leadership analysis
Fig. 6. Leadership Hierarchies for the Zachary karate club dataset.
(a) increments t = 0 .. t = 3
(b) community leadership analysis
Fig. 7. Leadership Hierarchies for the Jure Leskovec and Andrews Ng temporal collaboration network.
analysis, in a incremental network setting, using two distinct datasets: in Fig. 63 the Zachary karate club dataset divided into 4 snapshots containing an equal number of randomly chosen edges; in Fig. 74 the temporal collaboration network of Jure Leskovec and Andrews Ng [5] was used. This dataset partitioned the 20year coauthorship of both authors in four 5year intervals. In these results ﬁgures, on the left side are shown the graphs resulting from the direct applicability of the method for each one of the increments (from t = 0 to t = 3). The vertical stack of graphs represents each one of the levels of the hierarchical community detection algorithm. On the right side, the hierarchy of each aggregated 3 4
Figure 6 online here: https://mmfcordeiro.github.io/LaplaceLouvainResults/Karate. html. Figure 7 online here: https://mmfcordeiro.github.io/LaplaceLouvainResults/Jure. html.
Community Leaders in Dynamic Social Networks
207
Algorithm 1. Dynamic Community Leaders Algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36:
V ← {u1 , u2 , .., uv } , E ← {(i1 , j1 ), (i2 , j2 ), .., (ie , je )} A ← array{(i1 , j1 ), .., (im , jm )} R ← array{(i1 , j1 ), .., (in , jn )} procedure Main(G ← (V, E), A, R) Cll ← {C1 , C2 , .., Cn }, Cul ← {}, Caux ← Cll InitPartition(Caux ) Lcent ← LapCent(Cll ) initial centralities mod ← Modularity(Caux ), old mod ← 0 m ← 1, n ← 1 while (mod ≥ old mod ∨ m ≤ A ∨ v ≤ R) do Caux ← OneLevel(Caux ) Caux ← RepresentativeNodesByMaxCentrality(Caux , Lcent ) n, c ← CommunityChangedNodes(Cll , Caux ) Cll ← UpdateCommunities(Cll , n, c) old mod ← mod, mod ← Modularity(Cll ) Cul ← PartitionToGraph(Cll ) Lcent ← LapCentOnNodes(Cul , n, c) centralities in new network if m ≤ A then src, dest ← A[m] anodes ← AffectedByAddition(src, dest, Cll ) Lcent ← LapCentOnNodes(Cll , anodes ) aﬀected by addition Cll ← AddEdge(src, dest, Cll ) Cll ← DisbandCommunities(Cll , anodes ) Cul ← SyncCommunities(Cll , Cul , anodes ) end if if n ≤ R then src, dest ← R[n] anodes ← AffectedByRemoval(src, dest, Cll ) Lcent ← LapCentOnNodes(Cll , anodes ) aﬀected by removal Cll ← RemoveEdge(src, dest, Cll ) Cll ← DisbandCommunities(Cll , anodes ) Cul ← SyncCommunities(Cll , Cul , anodes ) end if Caux ← Cul , m ← m + 1, n ← n + 1 end while end procedure
community presented isolated the last increment (t = 3). For readability purposes, size of nodes reﬂect the normalized centrality of the node or aggregated community, new added edges are shown using dashed edges, and colours are maintained for nodes belonging to the same community across plots. In Fig. 6a at the ﬁrst increment (t = 0) it is visible that nodes 0 (C0 ) and 33 (C33 ) are important nodes in the network. In Fig. 6a (t = 0), Layer 4 they represent the two highet centrality communities in the network. In Fig. 6a (t = 1) nodes 33 (C33 ) lose its importance to node 30 (C30 ), which is gained again at Fig. 6a (t = 2). Final community hierarchies show on Fig. 6b, the algorithm using locality modularity maximization found four ﬁnal communities instead of the expected two (“John A” group “Mr. Hi” group), although C31 and C6 centrality values being very low. In Fig. 7a, the two ﬁrst increments (t = 0 and t = 1) which represents the ﬁrst 10 years of colaboration, Andrews Ng node (CAndrewsN g ) is the most important with few other small and less important communities arround it. Remark that Jure Leskovec led an initial community in Fig. 7a (t = 1). In Fig. 7a (t = 3) Jure Leskovec emerge as the most important community followed by Andrews Ng. An interesting observable fact is that both communities are separated by the Christopher Potts community known for being the only author
208
M. Cordeiro et al.
with collaborations with both. Figure 7b show the four most important community hierarchies and respective normalized centralities: CJureLeskovec with 0.775, CAndrewsN g with 0.643 and two Andrews Ng related communities: CQuocLe with 0.083 and CAdamCoates with 0.053.
5
Conclusions
In evolving networks, there is a clear lack between traditional community detection methods and the task of ranking and tracking community leaders. The proposed method aims to bridge this gap by combining two techniques with proven results in each one of its domain areas: a dynamic hierarchical community detection enhanced with a dynamic Laplacian centrality method. By applying the proposed method to a human social network and a temporal collaboration network, the empirical analysis performed and the obtained results have shown that the method is innovative and promising with respect to future research. In both cases, the method was able to unveil, track, and measure the importance of main actors, network intra and intercommunity structural hierarchies using a centrality measure.
References 1. Aggarwal, C., Subbian, K.: Evolutionary network analysis: a survey. ACM Comput. Surv. (CSUR) 47(1), 1–36 (2014) 2. Bahmani, B., Chowdhury, A., Goel, A.: Fast incremental and personalized PageRank. Proc. VLDB Endow. 4(3), 173–184 (2010) 3. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech.: Theor. Exp. 2008(10), P10008 (2008) 4. Brandes, U.: A faster algorithm for betweenness centrality. J. Math. Sociol. 25, 163–177 (2001) 5. Chen, P.Y., Hero, A.O.: Multilayer spectral graph clustering via convex layer aggregation: theory and algorithms. IEEE Trans. Sig. Inf. Process. Netw. 3, 553–567 (2017) 6. Cordeiro, M., Sarmento, R., Gama, J.: Dynamic community detection in evolving networks using locality modularity optimization. Soc. Netw. Anal. Min. 6(1), 15 (2016) 7. Cordeiro, M., Sarmento, R.P., Brazdil, P., Gama, J.: Dynamic laplace: eﬃcient centrality measure for weighted or unweighted evolving networks. CoRR abs/1808.02960 (2018) 8. Cordeiro, M., Sarmento, R.P., Brazdil, P., Gama, J.: Evolving networks and social network analysis methods and techniques. In: Viˇsn ˇovsk´ y, J., Radoˇsinsk´ a, J. (eds.) Social Media and Journalism, chap. 7. IntechOpen, Rijeka (2018) 9. Desikan, P., Pathak, N., Srivastava, J., Kumar, V.: Incremental page rank computation on evolving graphs. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, WWW 2005, pp. 1094–1095. ACM, New York (2005) 10. Fortunato, S.: Community detection in graphs, June 2009
Community Leaders in Dynamic Social Networks
209
11. Hollocou, A., Maudet, J., Bonald, T., Lelarge, M.: A linear streaming algorithm for community detection in very large networks. CoRR abs/1703.02955 (2017) 12. Kas, M., Wachs, M., Carley, K.M., Carley, L.R.: Incremental algorithm for updating betweenness centrality in dynamically growing networks. In: 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013), pp. 33–40, August 2013 13. Kas, M., Carley, K.M., Carley, L.R.: Incremental closeness centrality for dynamically changing social networks. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2013. pp. 1250–1258. ACM, New York (2013) 14. Khorasgani, R.R., Chen, J., Zaiane, O.R.: Top leaders community detection approach in information networks. In: Proceedings of the 4th Workshop on Social Network Mining and Analysis (2010) 15. Kim, K.S., Choi, Y.S.: Incremental iteration method for fast PageRank computation. In: Proceedings of the 9th International Conference on Ubiquitous Information Management and Communication, IMCOM 2015, pp. 80:1–80:5. ACM, New York (2015) 16. Lancichinetti, A., Fortunato, S.: Consensus clustering in complex networks. Sci. Rep. 2, 336 (2012) 17. Leung, I.X., Hui, P., Li` o, P., Crowcroft, J.: Towards realtime community detection in large networks. Nonlinear Soft Matter Phys. Phys. Rev. E  Stat. 79, 066107 (2009) 18. Li, J., Wang, X., Deng, K., Yang, X., Sellis, T., Yu, J.X.: Most inﬂuential community search over large social networks. In: Proceedings  International Conference on Data Engineering (2017) 19. Li, R.H., Qin, L., Ye, F., Yu, J.X., Xiaokui, X., Xiao, N., Zheng, Z.: Skyline community search in multivalued networks. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (2018) 20. Li, R.H., Qin, L., Yu, J.X., Mao, R.: Inﬂuential community search in large networks. Proc. VLDB Endowment 8, 509–520 (2015) 21. Li, R.H., Qin, L., Yu, J.X., Mao, R.: Finding inﬂuential communities in massive networks. VLDB J. 26, 751–776 (2017) 22. Nasre, M., Pontecorvi, M., Ramachandran, V.: Betweenness centrality  incremental and faster. CoRR abs/1311.2147 (2013) 23. Nguyen, N.P., Dinh, T.N., Tokala, S., Thai, M.T.: Overlapping communities in dynamic networks: their detection and mobile applications. In: Proceedings of the Annual International Conference on Mobile Computing and Networking, MOBICOM (2011) 24. Nguyen, N.P., Dinh, T.N., Xuan, Y., Thai, M.T.: Adaptive algorithms for detecting community structure in dynamic social networks. In: INFOCOM, pp. 2282–2290. IEEE (2011) 25. Oliveira, M.D.B., Gama, J.: An overview of social network analysis. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 2(2), 99–115 (2012) 26. Palla, G., Barab´ asi, A.L., Vicsek, T.: Quantifying social group evolution. Nature 446(7136), 664–667 (2007) 27. Qi, X., Duval, R.D., Christensen, K., Fuller, E., Spahiu, A., Wu, Q., Wu, Y., Tang, W., Zhang, C.: Terrorist networks, network energy and node removal: a new measure of centrality based on laplacian energy. Soc. Netw. 02(01), 19–31 (2013) 28. Qi, X., Fuller, E., Wu, Q., Wu, Y., Zhang, C.Q.: Laplacian centrality: a new centrality measure for weighted networks. Inf. Sci. 194, 240–253 (2012)
210
M. Cordeiro et al.
29. Raghavan, U.N., Albert, R., Kumara, S.: Near linear time algorithm to detect community structures in largescale networks. Nonlinear Soft Matter Phys. Phys. Rev. E  Stat. 76, 036106 (2007) 30. Sariyuce, A.E., Kaya, K., Saule, E., Catalyiirek, U.V.: Incremental algorithms for closeness centrality. In: Proceedings  2013 IEEE International Conference on Big Data, Big Data 2013, pp. 487–492 (2013) ¨ 31. Sarıy¨ uce, A.E., Gedik, B., JacquesSilva, G., Wu, K.L., C ¸ ataly¨ urek, U.V.: SONIC: streaming overlapping community detection. Data Min. Knowl. Discov. 30, 819– 847 (2016) 32. Shah, D., Zaman, T.: Community detection in networks: the leaderfollower algorithm. Sort 1050, 2 (2010) 33. Shang, J., Liu, L., Xie, F., Chen, Z., Miao, J., Fang, X., Wu, C.: A realtime detecting algorithm for tracking community structure of dynamic networks. In: 2012 18th ACM SIGKDD Conference on Knowledge Discovery and Data Mining Workshops, SNAKDD, vol. 12 (2012) 34. Sun, H., Du, H., Huang, J., Li, Y., Sun, Z., He, L., Jia, X., Zhao, Z.: Leaderaware community detection in complex networks. Knowl. Inf. Syst. 1–30 (2019) 35. Wang, C.D., Lai, J.H., Yu, P.S.: Dynamic community detection in weighted graph streams. In: Proceedings of the 2013 SIAM International Conference on Data Mining, SDM 2013 (2013) 36. Xie, J., Kelley, S., Szymanski, B.K.: Overlapping community detection in networks: the stateoftheart and comparative study. ACM Comput. Surv. 45, 43 (2013) 37. Yakoubi, Z., Kanawati, R.: LICOD: a leaderdriven algorithm for community detection in complex networks. Vietnam J. Comput. Sci. 1, 241–256 (2014) 38. Yun, S.Y., Lelarge, M., Proutiere, A.: Streaming, memory limited algorithms for community detection. In: Advances in Neural Information Processing Systems (2014) 39. Zhang, X., Zhu, J., Wang, Q., Zhao, H.: Identifying inﬂuential nodes in complex networks with community structure. Knowl.Based Syst. 42, 74–84 (2013) 40. Zhao, Z., Wang, X., Zhang, W., Zhu, Z.: A communitybased approach to identifying inﬂuential spreaders. Entropy 17, 2228–2252 (2015)
Change Point Detection in a Dynamic Stochastic Blockmodel Peter Wills and Fran¸cois G. Meyer(B) Department of Applied Mathematics, University of Colorado at Boulder, Boulder, CO 80309, USA [email protected]
Abstract. We study a change point detection scenario for a dynamic community graph model, which is formed by adding new vertices and randomly attaching them to the existing nodes. The goal of this work is to design a test statistic to detect the merging of communities without solving the problem of identifying the communities. We propose a test that can ascertain when the connectivity between the balanced communities is changing. In addition to the theoretical analysis of the test statistic, we perform Monte Carlo simulations of the dynamic stochastic blockmodel to demonstrate that our test can detect changes in graph topology, and we study a dynamic socialcontact graph.
Keywords: Change detection Graph distance
1
· Dynamic stochastic blockmodel ·
Introduction
Some of the most wellknown empirical network datasets reﬂect social connective structure between individuals, often in online social network platforms such as Facebook and Twitter. These networks exhibit structural features such as communities and highly connected vertices, and can undergo signiﬁcant structural changes as they evolve in time. Examples of such structural changes include the merging of communities, or the emergence of a single user as a connective hub between disparate regions of the graph. The main contribution of this work is a rigorous analysis of a dynamic community graph model, which we call the dynamic stochastic blockmodel. Models of dynamic community networks have recently been proposed. The simplest incarnation of such models, the dynamic stochastic blockmodel, is the subject of our study. This model is formed by adding new vertices, and randomly attaching them to the existing nodes. We circumvent the problem of decomposing each graph into communities, and propose instead a test that can ascertain when the connectivity between the balanced communities is changing. Because the evolution of the graph is stochastic, one expects random ﬂuctuations of the graph topology. c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 211–222, 2020. https://doi.org/10.1007/9783030366872_18
212
P. Wills and F. G. Meyer
We propose an hypothesis test to detect the abnormal growth of the balanced stochastic blockmodel. The stochastic blockmodel represents the quintessential exemplar of a network with community structure. In fact, it is shown in [28] that any suﬃciently large graph behaves approximately like a stochastic blockmodel. This model is also amenable to a rigorous mathematical analysis, and is indeed at the cutting edge of rigorous probabilistic analysis of random graphs [1].
2
Graph Models
We recall the deﬁnition of the twocommunity stochastic blockmodel [1]. Deﬁnition 1. Let n ∈ N, and let p, q ∈ [0, 1]. We denote by SBM(n, p, q) the probability space formed by the graphs deﬁned on the set of vertices [n], constructed as follows. We split the vertices [n] into two communities C1 and C2 , formed by the odd and the even integers in [n] respectively. We denote by n1 = (n+1)/2 and n2 = n/2 the size of C1 and C2 respectively. Edges within each community are drawn randomly from independent Bernoulli random variables with probability p. Edges between communities are drawn randomly from independent Bernoulli random variables with probability q. 2.1
The Dynamic Stochastic Blockmodel
Several dynamic stochastic blockmodels have been proposed in recent years (e.g., [10,11,18,22,27,29,31], and references therein). Existing dynamic stochastic block models assume that the number of nodes is ﬁxed, and that community membership is random. Some authors propose a Markovian model for the community membership [30,31], while others assume the sequence of graphs are independent realizations in time [5]. Our work is more similar to that of [4], where the authors study changes in the dynamics of a preferential attachment model, the size of which grows as a function of time. Similarly, we investigate a growing model of a stochastic block model, and we are interested in the regime of large graphs (n → ∞), where the probabilities of connection, within each community, pn , and across communities, qn , go to zero as the size of the graph, n, goes to inﬁnity. In order to guarantee that at each time n we study the growth of a graph ∼ SBM(n, pn , qn ), we cannot simply assume that the graphs G1 = (V1 , V1 ), . . . , Gn = (Vn , En ) form a sequence of nested subgraphs, where we would have V1 ⊂ · · · ⊂ Vn and E1 ⊂ · · · ⊂ En . Instead. our study focuses on the transition between a random realization (Vn , En ) ∼ SBM(n, pn , qn ) and the graph formed by adding a new node n + 1 and random edges to (Vn , En ). Formally, the dynamic stochastic blockmodel is deﬁned recursively (see Table 1). G1 is formed by a single vertex. We assume that we have constructed G1 , . . . , Gn and we proceed with the construction of Gn+1 . First, we replace Gn by a graph Hn ∼ SBM(n, pn , qn ), and we consider the graph formed by adding a new node n + 1 (assigned to either C1 or C2 according to the parity of n), and we deﬁne
Change Point Detection in a Dynamic Stochastic Blockmodel
213
Fig. 1. Left: the stochastic blockmodel Hn = (Vn , En ) is comprised of two communities C1 (red) and C2 (blue). A new vertex (green) is added to Vn , and random edges are created between n + 1 and vertices in Vn . This leads to a new set of edges, En+1 , and the corresponding new graph Gn+1 deﬁned by (2)
Vn+1 Vn ∪ {n + 1} .
(1)
Random edges are then assigned from n + 1 to each vertex in the same community with probability pn and to each vertex of the opposite community with probability qn . This leads to a new set of edges, En+1 , and the corresponding graph (see Fig. 1), (2) Gn+1 (Vn+1 , En+1 ) . We note that Hn is diﬀerent from Gn ; indeed, Gn was created by adding a node and some edges to the graph Hn−1 ∼ SBM(pn−1 , qn−1 ), whereas Hn is a realization of SBM(pn , qn ). Table 1 summarizes the construction of the sequence G1 , G2 , . . . Table 1. Row n depicts the construction of Gn+1 as a function of the random seed graph Hn . The distance (last column) is always deﬁned with respect to the seed graph Hn on the vertices 1, . . . , n that led to the construction of Gn+1 . Time index n
Probabilities of connection
0
Hn = seed Growth sequence graph at time n to generate Gn+1
Deﬁnition of the graph distance
∅
0
∅ → G1
1
{p1 , q1 }
H1 ∼ SBM(1, p1 , q1 )
H 1 → G2
drp (H1 , G2 )
2
{p2 , q2 }
H 2 → G3
drp (H2 , G3 )
.. .
.. .
H2 ∼ SBM(2, p2 , q2 ) .. .
.. .
.. .
214
P. Wills and F. G. Meyer
We conclude this section with the deﬁnition of the expected degrees and the number of acrosscommunity edges, kn . Deﬁnition 2 (Degrees and number of acrosscommunity edges). Let G ∼ SBM(n, p, q). We denote by dn1 = pn1 the expected degree within community C1 , and by dn2 = pn2 the expected degree within community C2 . We denote by kn the binomial random variables that counts the number of crosscommunity edges between C1 and C2 . Because asymptotically, n1 ∼ n2 , we ignore the dependency of the expected degree on the speciﬁc community when computing asymptotic behaviors for large n. More precisely, we loosely write 1/dn when either 1/dn1 or 1/dn2 could be used.
3
The Resistance Perturbation Distance
In order to study the dynamic evolution of the graph sequence, we focus on changes between two successive time steps n and n + 1. These changes are formulated in terms of changes in connectivity between Gn+1 and the seed graph Hn (see Table 1). To construct the statistic that can detect the merging of communities without identifying the communities, we use the resistance perturbation distance [17]. This graph distance can be tuned to quantify conﬁgurational changes that occur on a graph at diﬀerent scales: from the local scale formed by the local neighbors of each vertex, to the largest scale that quantiﬁes the connections between clusters, or communities [17] (see [2,6] for recent surveys on graph distances, and [19] for a distance similar to the resistance perturbation distance). The Eﬀective Resistance. For the sake of completeness, we review the concept of eﬀective resistance (e.g., [7–9,12]). Given a graph G = (V, E), we transform G into a resistor network by replacing each edge e by a resistor with conductance we (i.e., with resistance 1/we ). The eﬀective resistance between two vertices u and v in V is deﬁned as the voltage applied between u and v that is required to maintain a unit current through the terminals formed by u and v. To simplify the discussion, we will only consider graphs that are connected with high probability. All the results can be extended to disconnected graphs as explained in [17]. Deﬁnition 3 (The resistance perturbation distance). Let G(1) = (V, E (1) ) and G(2) = (V, E (2) ) be two graphs deﬁned on the same vertex sets V . Let R(1) and R(2) denote the eﬀective resistances of G(1) and G(2) respectively. We deﬁne the resistanceperturbation distance to be (1) (3) drp G(1) , G(2) Ru v − Ru(2)v . u∈V v∈V,v=u
Change Point Detection in a Dynamic Stochastic Blockmodel
215
The resistanceperturbation distance cannot be used to compare graphs deﬁned on diﬀerent vertex sets, V (1) and V (2) . If V (1) and V (2) share many nodes, then we can compute the restriction of the perturbation distance on the intersection V (1) ∩ V (2) . In the following we compare two graphs Hn and Gn+1 that share all the nodes but one newly added node. We therefore extend the deﬁnition of the perturbation distance as follows. Deﬁnition 4 (Extension of the resistance perturbation distance). Let Hn = (Vn , En ) ∼ SBM(n, pn , qn ), and let Gn+1 (Vn+1 , En+1 ), deﬁned by (2). We deﬁne the resistanceperturbation distance between Hn and Gn+1 as follows (4) drp (Hn , Gn+1 ) Ru(1)v − Ru(2)v . u∈Vn v∈Vn ,v=u
Because n + 1 did not exist at time n, it is not meaningful to compute Rv(n+1) , for v ∈ Vn Therefore we only compute the eﬀective resistances for u, v ∈ Vn in (4). In the remainder of the paper, we use the notation drp to denote the extended resistance perturbation distance deﬁned by (4).
4
Main Result
Figure 1 illustrates the statement of the problem. As a new vertex (shown in green) is added to the graph Hn the connectivity between the communities can increase, if edges are added between C1 and C2 , or the communities can remain separated if no acrosscommunity edges are created. If the addition of the new vertex promotes the merging of C1 and C2 , then we consider the new graph Gn+1 to be structurally diﬀerent from Gn , otherwise Gn+1 remains structurally the same as Gn (see Fig. 1). As explained in Theorem 1, the resistance perturbation distance, between time n and n + 1, (see Table 1) measured by drp (Hn , Gn+1 ) (deﬁned by (4)) is able to distinguish between connectivity changes within a community and changes across communities. Theorem 1 (The statistic under the null and alternative hypotheses). Let Hn = (Vn , En ) ∼ SBM(n, pn , qn ) with pn = ω (log n/n), and pn /n < qn < 3/4 (pn /n) . Let Gn+1 be the graph generated according to the dynamic stochastic blockmodel described by (2) and Table 1. To test the hypothesis kn+1 = kn (null hypothesis) versus kn+1 > kn (alternative hypothesis) we use the statistic Zn deﬁned by Zn
pn E [drp (Hn , Gn+1 )] − 1. 4
(5)
216
P. Wills and F. G. Meyer
The expected value of the statistic E [Zn ] is given by ⎧
⎪ 1 ⎪ ⎪ , conditioned on kn+1 = kn (null) ⎪O ⎨ dn
E [Zn ] = (6) ⎪ 2pn 1 ⎪ ⎪ , conditioned on kn+1 > kn (alternative). ⎪ ⎩ n2 q 2 + O n dn The theoretical analysis of the dynamic stochastic block model ignore the SBM(n, pn , qn ), provided by Theorem 1, reveals that if one could withincommunity random connectivity changes, which have size O 1/ dn , then one should always be able to detect the addition of acrosscommunity edges using the global metric provided by the test statistic Zn . The condi3/4 tion qn < (pn /n) therefore guarantees that withincommunity connectivity changes do not obfuscate acrosscommunity connectivity changes triggered by the increase in acrosscommunity edges. Without loss of generality, we consider that the new node n + 1 is added to C2 (C1 and C2 play symmetric roles). The main result relies on the following two ingredients. 1. the community C2 is approximately an Erd˝ osR´enyi graph SBM(n2 , pn ), wherein the eﬀective resistance Ruv concentrates around 2/(n2 pn ) [23]; 2. the eﬀective resistance between u ∈ C1 and v ∈ C2 depends mostly on the bottleneck formed by the kn acrosscommunity edges, Ruv ≈ 1/kn [14,15], and the number of acrosscommunity edges, kn , concentrates around qn n1 n2 . Under the null hypothesis, about n2 pn nodes in C2 will become incident to the new edges created by the addition of node n + 1. For each of these nodes, the = Ruv − 1/d2u , for new degree du becomes du + 1 w.h.p., and therefore Ruv 2 all v ∈ C2 . By symmetry, Ruv = Ruv − 1/dv , for all u ∈ C2 , and the total 2 perturbation for u ∈ C2 , v ∈ C2 is ≈ 2n2 n2 pn /dn = 2/pn . We derive the same estimate for the perturbation Ruv − Ruv for u ∈ C1 , v ∈ C2 or u ∈ C2 , v ∈ C1 . We conclude that drp ≈ 4/pn under the null hypothesis. Under the alternative hypothesis kn+1 = kn + 1 w.h.p., and thus ΔRuv ≈ −1/kn2 ≈ −1/(n1 n2 qn )2 . This perturbation aﬀects every pair of node (u, v) where 2 2 u ∈ C1 and v ∈ C2 , therefore drp ≈ 2/(n1 n2 qn ) = 8/(nqn ) . There is an addi 2 tional term of order O 1/pn that accounts for the changes in eﬀective resistance within C2 (the community wherein node n+ 1 is added). In order to estimate the noise term, O 1/ dn , we need to construct estimates of the eﬀective resistance that are more precise than those that can be found elsewhere (e.g., [23], but see [21] for estimates similar to ours, obtained with diﬀerent techniques). The full detailed rigorous proof of Theorem 1 is provided in the supplementary material [25]; we give in the following the key steps. Proof (Proof of Theorem 1). The proof proceeds in two steps: we ﬁrst analyze the null hypothesis, and then the alternative hypothesis. Due to space limitation
Change Point Detection in a Dynamic Stochastic Blockmodel
217
we only present the alternative case, kn+1 > kn . We have E (Ruv − Ruv ) + E (Ruv − Ruv ) E (drp (Gn , Gn+1 )) = u∈C1 v∈C1
+
u∈C2 v∈C2
E (Ruv −
Ruv )
+
u∈C1 v∈C2
E (Ruv − Ruv ) (7)
u∈C2 v∈C1
The ﬁrst sum in (7) is equal to u∈C1 v∈C1
E (Ruv −
Ruv )
=
2n1 dn
2
1+O
1 dn
.
(8)
Similarly, we have
E (Ruv − Ruv ) = n2 (n2 − 1)2pn
u∈C2 v∈C2
1 dn
2
+O
1 dn
5/2
.
(9)
The estimates (8) and (9), which quantify the connectivity within both communities, are oblivious to the increase in the acrosscommunity connectivity (kn+1 > kn ). We need the third and fourth terms in (7), which are signiﬁcantly aﬀected by the increase in acrosscommunity edges, to detect a change in the eﬀective resistance. Indeed, we have
pn 1 1 1+ . (10) E Ruv − Ruk+1 v  = +O pn n1 n2 qn2 dn u∈C1 v∈C2
The symmetric case where u ∈ C2 and v ∈ C1 leads to the same exact expression. Finally, we can assemble the expected resistance perturbation distance by combining the terms (8), (9), (10), and we obtain the advertised result. We conclude the proof with the condition on qn that guarantees that Zn can 3/4 detect the alternative hypothesis. As soon as qn < (pn /n) , the Zn statistic under the alternative hypothesis is larger than the noise term O 1/ dn . The theoretical condition on qn will be conﬁrmed experimentally in the next section. The proofs of (8), (9), and (10) are rather technical and are provided in the supplementary material [25].
5
Experiments
Synthetic Experiments. Figure 2 shows numerical evidence supporting Theorem 1. The experiment involves Monte Carlo simulations of the dynamic stochastic blockmodel for 64 random realizations for each qn . The empirical distribution of Zn is computed under the null hypothesis (green line) and the alternative hypothesis (red line). The theoretical estimate given by (7) under the alternative hypothesis is also displayed (blue line). The size of the graph is n = 2, 048,
218
P. Wills and F. G. Meyer
Fig. 2. Statistic Zn deﬁned by (5) computed under the null hypothesis (green line) and the alternative hypothesis (red line) for several values of the inverse acrosscommunity edge density, 2pn /(n2 qn2 ). The theoretical estimate of Zn under the alternative hypothesis, given by (6), is displayed as a blue line.
the density of edges is pn = log2 (n)/n, and the acrosscommunity edge density √ 3/4 ranges from qmax = 2 (pn /n) down to qmin = qmax /100. For each value of qn , we display the statistic Zn as a function of the inverse acrosscommunity edge density, 2pn /(n2 qn2 ). As the inverse density of acrosscommunity edges increases, the statistic Zn can more easily detect the alternative hypothesis. The theoretical 3/4 , analysis, provided by (6), is conﬁrmed: as qn becomes larger than O (pn /n) the statistic Zn computed under the null and alternative hypotheses merge. The acrosscommunity edge density qn becomes too large for the global statistic Zn to “sense” perturbations triggered by connectivity changes between the commu2 the alternative hypothesis nities. The expected value E [Zn ] = 2pn /(nq n ) under becomes smaller than the noise term O 1/ dn , and the test statistic Zn fails to detect the alternative hypothesis. Analysis of a Primary School Face to Face Contact. In this section we provide an experimental extension of Theorem 1, wherein there are 10 communities, but the number of nodes, N , is ﬁxed. The acrosscommunity and withincommunity edge densities are rapidly ﬂuctuating as a function of time n. Our goal is to experimentally validate the ability of the resistanceperturbation distance to detect signiﬁcant structural changes between the communities, while remaining impervious to random changes within each community. The data are part of a study [20] where RFID tags were used to record facetoface contact between students in a primary school. Events punctuate the school day of the children, and lead to fundamental topological changes in the
Change Point Detection in a Dynamic Stochastic Blockmodel
219
Fig. 3. Left to right: snapshots of the facetoface contact network at 9:00 a.m., 10:20 a.m., 12:45 p.m., and 2:03 p.m.
contact network (see Fig. 3). The school is composed of ten classes: each of the ﬁve grades (1 to 5) is divided into two classes (see Fig. 3). Each class forms a community of connected students; classes are weakly connected. During the school day, events such as lunch periods (12:00 p.m.– 1:00 p.m. and 1:00 – 2:00 p.m.) and recess (10:30 – 11:00 a.m. and 3:30 – 4:00 p.m), trigger signiﬁcant increases in the number of links between the communities, and disrupt the community structure (see Fig. 3). The construction of the dynamic graphs proceeds as follows. We divide the school day into N = 150 time intervals of Δt ≈ 200 s. We denote by ti = 0, Δt, . . . , (N − 1)Δt, the corresponding temporal grid. For each ti we construct an undirected unweighted graph Gti , where the n = 232 nodes correspond to the 232 students in the 10 classes, and an edge is present between two students and if they were in contact (according to the RFID tags) during the time interval [ti−1 , ti ).
E , Fig. 4. Primary school data set: resistance perturbation distance drp , edit distance D and DeltaCon distance DDC
220
P. Wills and F. G. Meyer
N S and edit distance D E (left); combiFig. 5. Primary school data set: NetSimile D L , and adjacency D A (right). L , normalized Laplacian D natorial Laplacian D
The purpose of the analysis is to assess whether distances can detect changes in the topology coupled with the hidden events that control the network connectivity. We are also interested to verify if distances are robust against random changes within each classroom that do not aﬀect the communication between the classes. We compare the resistance perturbation distance drp to the following N S [3], DC [13], (2) NetSimile distance D distances: (1) DeltaCon distance D L, (3) edit distance DE , (4) three spectral distances: combinatorial Laplacian D normalized Laplacian DL , and adjacency DA . The spectral distance between graph G and G is the 2 norm of the diﬀerence between the two spectra, {λi } and {λi } of the corresponding matrices [26]. For each distance measure d, we deﬁne a normalized distance contrast i ) = d(Gt , Gt )/D, D(t i−1 i where D = N −1 i d(Gti−1 , Gti ). All experiments were conducted using the NetComp library, which can be found on GitHub at [24]. Figure 4 displays the normalized temporal diﬀerences for the resistance distance, edit distance, and DeltaCon distance. The stochastic variability in the connectivity appreciably inﬂuence the high frequency (ﬁne scale) eigenvalues; spectral distances, which are computed using all the eigenvalues, lead to very noisy estimates of the temporal diﬀerences (see Fig. 5). NetSimile is also signiﬁcantly aﬀected by these random ﬂuctuations. The volume of the dynamic network changes rapidly, and the edit distance can reliably monitor these large scale changes. However, it entirely misses the signiﬁcant events that disrupt the graph topology: onset and end of morning recess, onset of ﬁrst lunch, end of second lunch (see Fig. 4). The resistance distance can detect subtle topological changes that are coupled to latent events that dynamically modify the networks, while remaining impervious to random local changes, which do not aﬀect the large scale connectivity structure (see Fig. 4).
Change Point Detection in a Dynamic Stochastic Blockmodel
6
221
Discussion
We note that the condition on qn in Theorem 1 guarantees that the communities could be recovered using other techniques (e.g., spectral clustering). Our global approach, which does not require the detection of the communities can be computed eﬃciently (at a cost that is comparable to fast spectral clustering algorithms). Indeed, we have developed in [17] fast (linear in the number of edges) randomized algorithms that can quickly compute an approximation to the drp distance (see [16] for the publicly available codes). In the context of streaming graphs, we described in [17] algorithms to compute fast updates of the drp distance when a small number of edges are added, or deleted. We are currently exploring several extensions of the current model. The scenario of the primary school dataset, wherein the graph size is ﬁxed and a latent process controls the addition and deletion of edges is an important extension of the current model. Acknowledgements. F.G.M was supported by the National Science Foundation (CCF/CIF 1815971), and by a Jean d’Alembert Fellowship.
References 1. Abbe, E., Bandeira, A.S., Hall, G.: Exact recovery in the stochastic block model. IEEE Trans. Inf. Theor. 62(1), 471–487 (2016) 2. Akoglu, L., Tong, H., Koutra, D.: Graph based anomaly detection and description: a survey. Data Min. Knowl. Discov. 29(3), 626–688 (2015). https://doi.org/10. 1007/s106180140365y 3. Berlingerio, M., Koutra, D., EliassiRad, T., Faloutsos, C.: NetSimile: a scalable approach to sizeindependent network similarity. CoRR abs/1209.2684 (2012). http://dblp.unitrier.de/db/journals/corr/corr1209.html#abs12092684 4. Bhamidi, S., Jin, J., Nobel, A., et al.: Change point detection in network models: preferential attachment and long range dependence. Ann. Appl. Probab. 28(1), 35–78 (2018) 5. Bhattacharjee, M., Banerjee, M., Michailidis, G.: Change point estimation in a dynamic stochastic block model. arXiv preprint arXiv:1812.03090 (2018) 6. Donnat, C., Holmes, S., et al.: Tracking network dynamics: a survey using graph distances. Ann. Appl. Stat. 12(2), 971–1012 (2018) 7. Doyle, P., Snell, J.: Random walks and electric networks. AMC 10, 12 (1984) 8. Ellens, W., Spieksma, F., Mieghem, P.V., Jamakovic, A., Kooij, R.: Eﬀective graph resistance. Linear Algebra Appl. 435(10), 2491 – 2506 (2011). http://www. sciencedirect.com/science/article/pii/S0024379511001443 9. Ghosh, A., Boyd, S., Saberi, A.: Minimizing eﬀective resistance of a graph. SIAM Rev. 50(1), 37–66 (2008) 10. Ho, Q., Song, L., Xing, E.P.: Evolving cluster mixedmembership blockmodel for timevarying networks. J. Mach. Learn. Res. 15, 342–350 (2015) 11. Kim, B., Lee, K.H., Xue, L., Niu, X., et al.: A review of dynamic network models with latent variables. Stat. Surv. 12, 105–135 (2018) 12. Klein, D., Randi´c, M.: Resistance distance. J. Math. Chem. 12(1), 81–95 (1993)
222
P. Wills and F. G. Meyer
13. Koutra, D., Shah, N., Vogelstein, J.T., Gallagher, B., Faloutsos, C.: DELTACON: principled massivegraph similarity function with attribution. ACM Trans. Knowl. Discov. Data (TKDD) 10(3), 28 (2016) 14. Levin, D.A., Peres, Y., Wilmer, E.L.: Markov Chains and Mixing Times. American Mathematical Soc. (2009) 15. Lyons, R., Peres, Y.: Probability on trees and networks (2005). http://mypage.iu. edu/∼rdlyons/ 16. Monnig, N.D.: The ResistancePerturbationDistance. https://github.com/ natemonnig/ResistancePerturbationDistance (2016) 17. Monnig, N.D., Meyer, F.G.: The resistance perturbation distance: a metric for the analysis of dynamic networks. Discrete Appl. Math. 236, 347 – 386 (2018). http:// www.sciencedirect.com/science/article/pii/S0166218X17304626 18. Peel, L., Clauset, A.: Detecting change points in the largescale structure of evolving networks. In: AAAI, pp. 2914–2920 (2015) 19. Sricharan, K., Das, K.: Localizing anomalous changes in timeevolving graphs. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1347–1358. ACM (2014) 20. Stehl´e, J., Voirin, N., Barrat, A., Cattuto, C., Isella, L., Pinton, J.F., Quaggiotto, M., Van den Broeck, W., R´egis, C., Lina, B., et al.: Highresolution measurements of facetoface contact patterns in a primary school. PLoS One 6(8), e23176 (2011) 21. Sylvester, J.A.: Random walk hitting times and eﬀective resistance in sparsely connected Erd˝ os–Renyi random graphs. arXiv preprint arXiv:1612.00731 (2016) 22. Tang, X., Yang, C.C.: Detecting social media hidden communities using dynamic stochastic blockmodel with temporal Dirichlet process. ACM Trans. Intell. Syst. Technol. (TIST) 5(2), 36 (2014) 23. Von Luxburg, U., Radl, A., Hein, M.: Hitting and commute times in large random neighborhood graphs. J. Mach. Learn. Res. 15(1), 1751–1798 (2014) 24. Wills, P.: The NetComp Python library (2019). https://www.github.com/ peterewills/netcomp 25. Wills, P., Meyer, F.G.: Change point detection in a dynamic stochastic block model (2019). https://ecee.colorado.edu/∼fmeyer/pub/WillsMeyer2019.pdf 26. Wills, P., Meyer, F.G.: Metrics for graph comparison: a practitioner’s guide. arXiv preprint arXiv:1904.07414 (2019) 27. Wilson, J.D., Stevens, N.T., Woodall, W.H.: Modeling and estimating change in temporal networks via a dynamic degree corrected stochastic block model. arXiv preprint arXiv:1605.04049 (2016) 28. Wolfe, P.J., Olhede, S.C.: Nonparametric graphon estimation. arXiv preprint arXiv:1309.5936 (2013) 29. Xing, E.P., Fu, W., Song, L.: A statespace mixed membership blockmodel for dynamic network tomography. Ann. Appl. Stat. 4(2), 535–566 (2010) 30. Xu, K.: Stochastic block transition models for dynamic networks. In: Artiﬁcial Intelligence and Statistics, pp. 1079–1087 (2015) 31. Yang, T., Chi, Y., Zhu, S., Gong, Y., Jin, R.: Detecting communities and their evolutions in dynamic social networksa Bayesian approach. Mach. Learn. 82(2), 157–189 (2011)
A General Method for Detecting Community Structures in Complex Networks Vesa Kuikka(&) Finnish Defence Research Agency, Tykkikentäntie 1, PO Box 10, 11311 Riihimäki, Finland [email protected]
Abstract. We present a general method for detecting communities and their substructures in a complex network. The novelty of the method is to separate the network model and the community detection model. Network connectivity and influence spreading models are used as examples for network models. Depending on the network model, different communities and substructures can be found. We illustrate the results with two empirical network topologies. In these cases the strongest detected communities are very similar for the two network models. We use a community detection method that is based on searching local maxima of an influence measure describing interactions between nodes in a network. Keywords: Complex networks Community detection Influence spreading model Network connectivity Community influence measure
1 Introduction Methods for detecting communities in social, biological and technological networks have been studied extensively in the literature and still no commonly accepted deﬁnition of a community exists. Different mathematical methods and algorithms have been presented for detecting communities in complex network topologies [3, 9, 12–14]. Modularity maximization and spectral graph partitioning are two examples in the wide context of community detection methods [4, 6, 7, 10, 12]. Modularity measures the strength of division of a network into modules. One deﬁnition of a community is a locally dense connected subgraph in a network [2]. Modularity has been deﬁned as the fraction of links falling within the given groups minus the expected fraction if links were distributed at random. In order to compute the numerical value of modularity, each link is cut into two halves, called stubs. The expected number of links is computed by rewiring stubs randomly with any other stub in the network, except itself, but allowing selfloops when a stub is rewired to another stub from the same node. Mathematically modularity can be expressed as 1 X kv kw s v s w þ 1 M¼ : Avw 2m vw 2 2m
© Springer Nature Switzerland AG 2020 H. Cheriﬁ et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 223–237, 2020. https://doi.org/10.1007/9783030366872_19
ð1Þ
224
V. Kuikka
In Eq. (1) v and w are nodes in the network, 2m is the number of stubs in the network, kv is the node degree of node v, Avw ¼ 1 means that there is a link between nodes v and w, and Avw ¼ 0 means that there is no link between the two nodes. Matrix A is called the adjacency matrix. Membership variable sv indicates if node v belongs to a community: sv ¼ 1 if node v belongs to community 1 and sv ¼ 1 if node v belongs to community 2. Equation (1) holds for partitioning into two modules but it can be generalized for partitioning into a desired number of modules. Modularity suffers a resolution limit and it is unable to detect small communities [2, 12]. In matrix terms Eq. (1) is M¼
1 T s Bs; 4m
v kw is called the modularity matrix. The equation for M is similar in where Bvw ¼ Avw k2m form to an expression used in spectral partitioning of graphs for the cut size of a network in terms of the graph Laplacian. This similarity can be used for deriving a spectral algorithm for community detection. The eigenvector corresponding to the largest eigenvalue of the modularity matrix assigns nodes to communities according to the signs of the vector elements. The classical graph partitioning is the problem of dividing the nodes of a network into a given number of nonoverlapping groups of given sizes such that the number of links between groups is minimized [12]. The Louvain algorithm and Infomap are two fast algorithms for community detection that have been briefly described in [2]. These algorithms have gained popularity because of their suitability for identifying communities in very large networks. Both algorithms optimize a quality function. For the Louvain algorithm the quality function is modularity and for Infomap an entropybased measure. In the Louvain algorithm modularity is optimized by local changes of the modularity measure and communities are obtained by aggregating the modules to build larger communities. Infomap compresses the information about a random walker exploring the graph [2]. In this paper, we take an alternative approach, where instead of running community detection based on the adjacency matrix A of a graph, an influence matrix C is ﬁrst constructed that contains information about a social influence process over the paths of a graph [8]. An element Cvw of the social influence matrix accounts for interactions over all the paths from source node v to target node w. In order to study local interactions in the network, maximum path length Lmax can be used in the algorithm. In addition, we account for the fact that communities can mean different things depending on the processes that are supposed to operate on a network. This is demonstrated by substituting the social influence matrix with the network connectivity matrix known from the classical communication theory [1]. In the community detection algorithm of this paper, a sum of rows and columns of matrix C is used as the quality function. Rows and columns are included in the sum that correspond to node pairs in the community and node pairs not in the community. Also, this measure is different from the modularity M of Eq. (1). One idea for future studies is to use the measure M where the adjacency matrix A is substituted with the influence matrix C in a standard framework of community detection.
A General Method for Detecting Community Structures in Complex Networks
225
2 Community Detection Method Typically, a node has higher influence on neighbouring nodes compared to nodes that are far away in the network topology. Influence is increasing with the number of alternative paths between nodes. Most community detection methods calculate only the local influence among nodes in a network structure. Results that are more accurate can be obtained when longer path lengths are included in the model and computations. In order to balance between the increasing number of alternative paths and distance between a source node and a target node, weighting factors can be used to describe probabilities of influence via links between two nodes in the network. Weighting factors for links or nodes, or both, together with the network topology are the main input data for network models. In dynamic models, spreading time distribution or time dependency of node and link attribute values describe the influence spreading or the changing network structure. In static connectivity models the probability of functioning connection between a source node and a target node plays the role of a weighting factor. The novelty of the community detection method of this paper is to separate the network model and the community detection model. This means that the same community detection algorithm can be used with different network models. We present two examples of network models: a network connectivity model [1] and an influence spreading model [8]. In both cases, we use the same community detection algorithm. In these two examples, interactions between nodes are described as spreading probabilities or probabilities of functioning connections. Technically, interactions between the pairs of nodes in a network are expressed in a N N –dimensional community influence matrix C where N is the number of nodes in the network. The method detects any kinds of structures in topological complex networks: nonoverlapping, overlapping and hierarchical community structures [8]. As a special case, communities consisting of two or more distinct subcommunities that have no direct contact, can be discovered. The method is based on searching local maxima of a community influence measure computed from the elements of the community influence matrix Cs;t ; s; t ¼ 1; . . .N. Our basic model has the following form for the community influence measure: X X P¼ Cs;t þ Cs;t ð2Þ s;t2V
s;t2V
In Eq. (2) the ﬁrst summation is over the pairs of nodes in a subset V of nodes in of nodes. Cross the network and the second summation is over the remaining pairs V terms are ignored in this version of the model because they describe interactions between the two subsets and are not directly involved in the internal cohesion of the two subsets. The simplest method to search local maxima of the community influence measure is to start from a random division of the network and move one node, at a time, from one side to the other. If the numerical value of P increases, continue with the next node or return the node back and continue with the next node. This procedure is continued until
226
V. Kuikka
moving any of the nodes in the original network does not increase the value of the community influence measure [8]. The ﬁrst cut will crucially impact the set of local maxima that can be found. More local maxima are searched by starting from a new random division of the network. This is repeated until no more local maxima are found in a reasonable number of trials. Finally, an understanding of the landscape of local maxima can be achieved. Especially, this part of the method is proposed for analysing and identifying substructures on larger and lesswell studied networks. In this way, community detection methods are tools for understanding network data in general. Several methods for speeding up the computing process exist. For example, instead of a random starting division, intersections and subtractions of previously found solutions or some preexisting information about the closeness of nodes can be used. Community influence matrix C and the topology of the network are possible sources of the closeness information. The current computer program version for maximizing Eq. (2) is scaling up to about 40.000 nodes with a personal computer because the matrix C is kept in the working memory. The optimization of memory and processing capabilities is not directly related with the community detection problem but with the optimization algorithm of the quality function. Scaling of the network influence spreading model is more dependent on processing power than memory size [8]. Examples of computing times of social media networks have been provided in [8]: Facebook 4,039 nodes and 88,234 links 4 min, Twitter 81,306 nodes and 1,768,149 links 5 h, and Google+ 107,614 nodes and 13,673,453 links 4 days. Efﬁcient algorithms have been developed for computing network connectivity. Scaling of the network connectivity model has been discussed more in the literature [2]. The method assumes that the network is divided into two communities. However the model provides many solutions for the local maxima of Eq. (2). Communities with high rankings according to the value of P in Eq. (1) are candidates for the split of the original community in reallife. Note that this may not be the most probable solution of the community formation process. Later, in Sect. 4 we will present results for both of the community measures: the strength P of the split into two communities in Eq. (1) and the statistical measure describing the probability of community formation. In practice, realworld social networks separate into two parts; although it is possible that at a later point of time more subcommunities appear. However, this kind of community formation is a special case of the basic model. If the original network is ﬁrst divided into A and B, and later B is divided into B1 and B2, usually divisions ðA [ B1 Þ [ B2 and ðA [ B2 Þ [ B1 are also local maxima of Eq. (2). To be precise, interactions between nodes in one community include also interactions mediated via paths through other communities. In the model, these paths are included whenever the source node and the target node are inside one community. We assume that the same community influence matrix is valid, it does not depend on the communities, and there is no need to recalculate the matrix between iterations. If communities have selfintensifying properties or a complex dependency on other communities, the community influence matrix should be recalculated during the optimization of Eq. (2). Later, we will present results also for a modiﬁcation of Eq. (2). We deﬁne the modiﬁed community influence measure as
A General Method for Detecting Community Structures in Complex Networks
0 1 X N2 @ X P¼ 2 Cs;t þ l Cs;t A: NV þ NV2 s;t2V s;t2V
227
ð3Þ
The number of nodes in the original network and the two factions are denoted by N, NV , and NV . The factor before the summations balances the fact that Eq. (2) has less are of unequal sizes. The phenomenosummands when the two divisions V and V logical parameter denoted by l inside the brackets is used to model possible asymmetry Stronger interactions may exist in V than in V when nodes in between V and V. are community V are connected around a common ideology and nodes in community V left outside without a similar connecting factor. As another application, modiﬁcations of Eq. (2) can approximate the abovementioned recalculation of the community influence matrix. Equation (3) will be applied only for a human social network because the interpretation of a common opinion or belief may not be justiﬁed for animal social networks. In this paper, the form of Eq. (3) is experimental because it superimposes a model on another model captured already by the influence matrix C. It would be better to include such a process in the deﬁnition of C and avoid the use of a free parameter. This can be achieved directly by using empirical or evaluated node and link activity values (weights). If these kind of data is available, effects that are modelled with parameter l can be included in matrix C elements. In the same fashion, social pressure against the other community could be included in matrix C for describing how a community is trying to convince outsiders to join it. In this paper, we demonstrate principles of the method and its parameters with two small well known empirical networks. In these cases, solutions agree with the ground truth divisions and many other methods, with the same documented exceptions [11, 14]. Essential aspects of the method are that the network model and the community detection model can be separated and the combined method can be used to investigate complex network substructures. One or both of the submodels can be substituted with other models. In fact, a new idea would be to focus more on the effect of different network or influence models in a standard framework of community detection. Community detection of empirical examples with different network processes can yield new results, if the process described by the influence matrix C, is adjusted to the respective empirical case.
3 Example Network Models Two different network models are used to demonstrate the generality of the community detection method: the classical network connectivity model [1] and an influence spreading model proposed in [7, 8]. We present briefly the main features of the two models. The influence spreading model is designed for describing complex social interactions in a social network structure. These interactions propagate via connections, or paths, between people. We assume that the information content may change and the
228
V. Kuikka
social influence is developing during the spreading process. We allow repeated attempts of social influence from one source node to target nodes via all alternative paths. This includes also loops (no selfloops) where one node can occur several times on a path. The spreading probabilities between all pairs of the nodes in a network can be calculated from the node and link weighting values between neighbouring nodes, and the temporal spreading distribution as a function of number of links between the source node and the target node. Node and link weighting factors describe probabilities of forwarding information to neighbouring nodes in the network [8]. Node and link weighting factors along a path are taking into account by factor WL where the relevant weighting values are multiplied. The following equation shows how two paths from a source node to a target node with path lengths L1 and L2 and a common path length of L3 are combined together. Pi;minðL1 ;L2 Þ ðT Þ ¼ WL1 DL1 ðT Þ þ Pi1;L2 ðT Þ
WL1 DL1 ðT ÞPi1;L2 ðT Þ ; W L3 D L3
i ¼ 1; . . .; NL 1ðL1 ; L2 Lmax Þ; P0;L2 ¼ WL2 DL2 ðT Þ: The quantity Pi1;L2 ðT Þ is the intermediate result at step i during iterations, and NL is the number of different paths between the two nodes. The temporal distribution at time T of the spreading process [9] is denoted by DL ðT Þ. The maximum path length of calculations is Lmax . Finally, all NL paths go into the result PNL 1 ðT Þ for the probability of influence propagation from source node s to target node t. We denoted this quantity by Cs;t in Eq. (2). A more detailed algorithm of the model has been presented in [8]. The network connectivity model is designed for describing the reliability of communication networks [1]. If the reliability values between any neighbouring pairs of nodes in the network are known, reliability values between any pairs of nodes in the network can be computed. Reliability is identiﬁed with the probability of operational connection in a time unit. From the general reliability theory [1] the reliability of a network V is r ðV Þ ¼
XY
ð 1 pe Þ
S2O e62S
Y
pe ;
e2S
where S is a set of links, where the network is connected, and O is the set of all connected states of the network. Links are denoted by e and the probability of an operational link is denoted by pe . If the probabilities pe are equal, r ðV Þ ¼ 1
NL h X Y h¼1
s¼0
ð1Þ
hs
! NL s HS phe ; NL h
where Hs is the sum of indicator functions when the number of broken links is s. The above equations are polynomials of the order of the number of links NL in the network.
A General Method for Detecting Community Structures in Complex Networks
229
In this form, the equations describe reliability of the entire network. In our case we apply the results for pairs of nodes by only taking the relevant terms in the summations. The influence spreading model and the network connectivity model have been designed for different application areas. In this study we use these models to demonstrate a community detection method with different network models. In the network connectivity model, connectivity is required in both directions between two nodes, and consequently connectivity is a symmetric property. The influence spreading model is less restrictive in this respect.
4 Detected Communities and Their Substructures We use two empirical networks to demonstrate the method and to compare results between two network models: network connectivity and network spreading models. Zachary’s karate club and Lusseau’s dolphin networks have been used as empirical example networks in several studies in the literature [5, 11, 14]. We do not expect exactly the same results from the two network models because of their different applications, deﬁnitions and parameterizations. However, in the cases of the two example network topologies, the most important communities and their subcommunities are surprisingly close to each other. More differences appear in weaker communities and in substructures. The fact that the same community detection algorithm provides reasonable results for different network models suggests that the method is generally valid for community detection in various applications. On the other hand, features of the community detection method are useful because substructures are also uncovered. The community detection method is not limited to particular network models: directed, weighted, time dependent, and layered network models can be used. The network connectivity model describes connectivity between pairs of nodes in the network. This is calculated by considering all possible paths between a source node and a target node. We use the same parameter value for describing an operational link between two neighbouring nodes and utilize the second formula of r ðV Þ in Sect. 3. However, in social networks weighting factors are used for describing probabilities of social influence. Although we apply the network model originally designed for physical communication network modelling, we use low values for the parameter value as in the modelling of influence spreading. The main focus of this paper is in the methodology and this is why simple realworld social networks, Zachary’s karate club and Lusseau’s bottlenose dolphin network, are used to demonstrate the method. In addition, we document very detailed results provided by the model to demonstrate the granularity and different aspects of the model. However, these results are not analysed in detail because such lowlevel empirical information is not available. Usually the model predicts the strongest communities accurately but weaker structures are more sensitive to network models and parameter values.
230
4.1
V. Kuikka
Zachary’s Karate Club
Wayne W. Zachary observed 34 members of a karate club over a period of two years [14]. During the study a disagreement developed between the administrator of the club and the club’s instructor. The instructor started a new club, taking 16 members of the original club with him. Figure 1 shows the karate club social network where lines 5 and 14 indicates the two factions after the split of the club; with the exception of node 9 who joined the other club. The instructor is node 1 and the administrator is node 34. Zachary’s karate club and Lusseau’s dolphin networks are social networks where low link weights describe better the probability of social influence. On the other, only one community, where all the nodes of the network are in one community, is detected with high parameter values. This is not an interesting case in our study. Connectivity and influence spreading probabilities describe different phenomena but they can have some common interpretation in social networks.
Fig. 1. Zachary’s karate club network with divisions indicating detected communities. Divisions correspond lines in Tables 3 and 4 (these can be compared with Tables 1 and 2).
Next, we present results of the Zachary’s karate club from the two network models. In Table 1, columns ‘A0.05’ and ‘A0.1’ are from the network connectivity model and the other ﬁve columns are from the network spreading model. Nine different solutions for communities are detected. This are lines 1–9 in Tables 1 and 2. The numerical values of the community influence measure of Eq. (2) from the two network models for the nine detected divisions are shown in the left part of Table 1. The corresponding values of statistical community measures are shown on the middle part of the table. The statistical values are probabilities to split into the two communities. These results are simulated by starting from random initial conﬁgurations. The
A General Method for Detecting Community Structures in Complex Networks
231
Table 1. The values of the community influence measure of Eq. (2) from the two network models for the nine detected divisions are shown in the left part of the table. The corresponding values of the statistical community measures are shown in the middle part of the table. Columns ‘A0.05’ and ‘A0.1’ show the results from the network connectivity model with connectivity probabilities p ¼ 0:05 and p ¼ 0:1 between neighbouring nodes. Columns ‘P0.05’ and ‘P0.1’ shows the results from the influence spreading model with influence spreading probabilities wL ¼ 0:05 and wL ¼ 0:1. The next column ‘PT0.1’ shows the results during the spreading process at time T ¼ 0:1 (wL ¼ 1:0) (all the other columns show results for time approaching inﬁnity T ! 1). Column ‘L0.1’ shows the results with the limited path length Lmax ¼ 2 (wL ¼ 0:1). Column ‘VL0.05’ shows the results with the limited number of visits V ¼ 1 on a node during the influence spreading process (wL ¼ 0:05; Lmax ¼ 2). The right part of the table shows aggregated data from the right part of Table 3. 1 2 3 4 5 6 7 8 9
A0.05 A0.1 10.35 9.18 24.90 8.69 8.72 23.58 8.58 8.03 21.46 7.94 24.78
P0.05 P0.1 PT0.1 L0.1 VL0.05 A0.05 A0.1 P0.05 P0.1 10.62 19.24 23.89 9.72 1 10.7 % 10.3 % 9.42 17.20 21.29 8.76 2 13.1 % 10.2 % 13.2 % 8.91 16.23 20.04 8.25 3 4.6 % 5.1 % 8.94 16.33 20.12 8.32 4 2.6 % 0.4 % 2.8 % 8.79 16.04 19.70 8.16 5 1.6 % 1.6 % 8.22 15.03 18.34 7.67 6 0.4 % 0.1 % 0.5 % 8.12 14.84 18.10 7.56 7 0.4 % 0.4 % 8 0.1 % 29.27 9 9.0 %
PT0.1 10.8 % 14.3 % 3.2 % 5.2 % 1.8 % 0.5 % 0.4 %
L0.1 10.7 % 15.0 % 2.0 % 5.3 % 1.7 % 0.4 % 0.3 %
VL0.05 11.1 % 14.9 % 3.9 % 4.7 % 2.1 % 0.5 % 0.4 %
1 2 3 4 5 6 7 8 9
A0.05 11.6 % 7.8 % 2.7 % 5.5 % 0.9 % 0.5 % 0.4 % 0.0 % 7.2 %
A0.1
P0.05 P0.1 PT0.1 L0.1 VL0.05 11.3 % 11.4 % 10.7 % 11.9 % 6.0 % 3.1 % 5.0 % 2.8 % 3.1 % 3.1 % 5.0 % 5.2 % 5.2 % 6.8 % 0.6 % 15.1 % 15.7 % 16.6 % 17.0 % 0.9 % 1.0 % 1.0 % 1.2 % 1.1 % 1.1 % 1.1 % 1.2 % 0.5 % 0.6 % 0.5 % 0.6 % 0.1 % 5.3 %
ﬁrst division in line 1 has the highest community measure of Eq. (2) for ‘A0.05’ for the connectivity network model and four influence spreading model calculations with different model parameters. Table 2 shows the nodes included in the communities. For example, the ﬁrst line indicates that nodes {5, 6, 7, 11, and 17} and {1, 2, 3, 4, 8, 9, 10, 12, …,16, 18, …, and 34} are members of the two detected communities. The last two columns show that the number of nodes in the communities are 5 and 29.
Table 2. Nodes in communities corresponding lines in Table 1. For example, line 1 means that nodes 5, 6, 7, 11, and 17 are members of the ﬁrst community. Communities detected in runs correspond columns in Table 1 (for example, 1010111 means that the division in line 1 is found in runs ‘A0.05’, ‘P0.05’, ‘PT0.1’ ‘L0.1’, and ‘VL0.05’). The last two columns show the number of nodes in the two factions of the network. 1 2 3 4 5 6 7 8 9
Nodes
Found in runs
N1
0000111000100000100000000000000000 1111111100111100110101000000000000 0000000000000011001010110010010011 0000000011000011001010110011011011 1101000100011100010101000000000000 1111000100011100010101001100100100 0000111000100011101010110010010011 1101111100111000110001000000000000 1101111100111100110101000000000000
1010111 1110111 1010111 1110111 1010111 1110111 1010111 0100000 0001000
5 16 10 14 10 15 15 13 15
N2 29 18 24 20 24 19 19 21 19
232
V. Kuikka
Table 3. Values of the community influence measure of Eq. (3) with the parameter value of l ¼ 0:95 (left part of the table). The corresponding values of the statistical community measures are shown on the right part of the table. The results can be compared with Table 1 where Eq. (2) is used as the quality function. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
A0.05 17.8 17.9 16.7 14.9 17.7 17.0 18.0 16.6 16.0 15.2 15.5 15.5 15.3
A0.1 P0.05 P0.1 18.3 48.6 45.1 17.1 15.3 48.1 18.2 56.5 45.7 17.4 18.4 17.0 16.3 15.8 15.9 15.8 18.4 17.3 15.9 55.9
PT0.1 L0.1 VL0.05 33.2 41.1 16.8 31.2 27.9 33.2 31.7 33.4 31.1 29.9 28.4 29.1 28.9
37.8 34.5 41.2 39.1 41.4 38.1 36.7 35.2 35.5 35.0
15.9 14.2 16.9 16.2 16.9 15.8 15.2 14.5 14.8 14.6
33.6 41.5 31.6 29.1 35.4
17.1 14.9
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
A0.05 5.7 % 7.2 % 4.0 % 2.7 % 7.8 % 1.5 % 5.9 % 0.9 % 1.1 % 0.2 % 0.3 % 0.2 % 0.2 %
A0.1
P0.05 P0.1 5.5 %
5.3 % 0.4 % 7.3 % 4.1 % 6.0 % 3.1 % 5.0 % 0.2 % 7.8 % 0.9 % 5.8 % 0.9 % 0.6 % 1.1 % 0.1 % 0.4 % 0.1 % 0.3 %
PT0.1 L0.1 VL0.05 5.5 % 5.3 % 6.0 % 7.5 % 4.1 % 2.8 % 8.2 % 1.1 % 5.9 % 1.0 % 0.8 % 1.1 % 0.1 %
8.0 % 4.0 % 3.1 % 8.6 % 1.17 % 5.45 % 1.05 % 0.2 % 1.06 % 0.05 %
8.0 % 4.5 % 3.1 % 9.0 % 2.3 % 6.0 % 1.2 % 0.3 % 1.2 % 0.1 %
0.49 % 0.46 % 0.5 % 0.12 % 0.28 % 0.08 % 0.13 % 0.1 %
Table 4. Nodes in communities corresponding lines in Table 3. Communities detected in runs correspond columns in Table 3. The last two columns show the number of nodes in the two factions of the network. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Nodes
Found in runs
N1
N2
0000111000100000100000000000000000 0010000011000011001010111111111111 0000000011000011001010110011011011 1111111111111100110101001101101100 1111111100111100110101000000000000 1111111100111100110101001100100100 1111000111011111011111111111111111 0010111011100011101010111111111111 1111000100011100010101000000000000 1111000100011100010101001100100100 0000111011100011101010110011011011 1111000111011100010101001101101100 0000111000100011101010110010010011 0000000011000011001010111111111111 1111111101111100110101001101100100 1111000101011100010101001101100100 1101111100111000110001000000000000
1010111 1100000 1110111 1010111 1111111 1110111 1010111 1010111 1010111 1010111 1010111 1010111 1000000 0010111 0010100 0010111 0001000
5 19 14 24 16 20 29 24 11 15 19 19 15 18 22 17 13
29 15 20 10 18 14 5 10 23 19 15 15 19 16 12 17 21
A General Method for Detecting Community Structures in Complex Networks
233
Note, that runs ‘A0.1’ and ‘P0.1’ have not found the ﬁrst community, as can be seen also in Table 2 with the second and the fourth zero in ‘1010111’. Less communities are found with higher weights and it is even possible that the strongest community is not found or a new combination of nodes emerges. Comparing lines 2 and 9 in Table 2 reveals that the only difference is node 3 moving to the larger faction. This conﬁguration is the only one detected with the higher influence parameter value of wL ¼ 0:1 in the influence spreading model. Three columns show calculations from the influence spreading model with three different parameters: ‘PT0.1’ with the time of spreading T ¼ 0:1, ‘L0.1’ with the maximum spreading path length Lmax ¼ 2, and ‘VL0.05’ with the limited number of visits in one node during the influence spreading V ¼ 1 (with Lmax ¼ 2). These results agree with the basic calculations of ‘A0.05’ and ‘P0.05’. It is possible that these results are different in more complex network topologies or network conﬁgurations. Tables 3 and 4 show the result of the modiﬁed community influence measure of Eq. (3). The division into 20 and 14 nodes in lines 3 and 6 has high probabilities of community formation for low weighting values (with exception ‘A0.05’) but the division into 18 and 16 nodes has high numerical values of Eq. (3) in lines 5 and 14. Division into 16 and 18 nodes is predicted also by Eq. (2) with a high community formation probability but division to 5 and 29 nodes has the highest numerical value of P in Eq. (2) in most cases. As a general observation, several differences in details exist between the results for the network connectivity model and the influence spreading model. and Tables 3 and 4 have more lines than Tables 1 and 2 because now {V, V} are two different solutions of the optimization problem in Eq. (3). For {1 V, 1 V} example, line 1 and 7 in Tables 3 and 4 correspond line 1 in Tables 1 and 2: the value of the statistical community measure for the division into 5 and 29 nodes (or 29 and 5) is 5.7% + 5.9% = 11.6%. The aggregated data from Table 3 is collected in the right part of Table 1. The numerical value of 10.7% in the middle part of Table 1 is different because is computed from Eq. (2) instead of Eq. (3). The correct way is to compare rankings between columns in Table 1. The aggregated data illustrate that the division into 14 and 20 nodes (lines 3 and 6 in Fig. 1) is stronger than in the basic model of Eq. (2). 4.2
Lusseau’s Bottlenose Dolphin Network
A population of 62 bottlenose dolphins were observed over a period of seven years [11]. A temporary disappearance of dolphin SN100 led to the ﬁssion of the dolphin community into two factions. The dolphin social network is found to be similar to a human social network in some respects but assortative mixing by degree is not
234
V. Kuikka
observed within the community [5, 11]. Assortative mixing measures bias in favour of interactions between nodes with similar characteristics.
Fig. 2. Dolphins’ social network with some results from Tables 5 and 6.
Line 1 in Fig. 2 shows the split observed in reallife with one exception of dolphin SN89. Only a few representative divisions are shown in Fig. 2. Table 5 shows the list of different structures detected by the basic community detection method of Eq. (2). In the upper part of the table, numerical values of the community detection measure of Eq. (2) are shown. In the lower part of the ﬁgure some representative values of the statistical community detection measure are shown. Nodes included in the communities of Fig. 2 are documented in Table 6. The complete list of detected communities are presented to illustrate the model. We discuss only the main results because weaker solutions may not have any interpretations in reallife. However, also weaker communities may be potential starting points for future developments in the community structure. The division indicated by line 1 has the highest ranking in most cases according to the community detection measure of Eq. (2) and the statistical measure describing the probability of community formation. The community indicated by line 2 is documented in the literature and agrees with other research [5, 11]. Line 4 in Tables 5 and 6 has the
A General Method for Detecting Community Structures in Complex Networks
235
community of 15 nodes {6, 7, 10, 14, 18, 23, 32, 33, 40, 42, 49, 55, 57, 58, 61}. This can be an indication of mediating roles of the six nodes {2, 8, 20, 26, 27, 28}. The names of these six dolphins are also documented in Fig. 2. Table 5. Values of the community influence measure of Eq. (2) for the dolphin network of Fig. 2. Notations are the same as in Tables 1 and 3. Values of the statistical measures are shown only for the most important four lines in the lower part of the table. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A0.05 A0.075 A0.1 A0.15 A0.17 A0.19 P0.01 P0.05 P0.1 T0.2 L0.17 VL0.01 21.13 37.99 62.43 3.25 21.57 75.02 6.16 98.93 3.23 19.10 3.02 19.42 5.73 87.84 3.01 17.81 30.93 2.84 18.12 5.40 81.37 2.83 20.79 37.37 3.20 21.24 6.08 97.20 3.19 18.19 31.69 2.89 18.50 5.48 83.34 2.88 18.16 31.57 2.89 18.45 5.48 2.88 17.97 31.23 49.59 107.91 2.86 18.28 57.90 5.44 82.12 2.86 17.72 30.83 48.92 106.34 140.66 2.82 18.02 57.12 5.36 81.22 2.81 17.53 17.82 17.72 30.80 48.92 2.82 18.02 5.36 80.93 2.81 17.53 17.82 0.00 17.70 30.71 48.75 2.82 17.99 5.36 2.81 17.54 2.78 17.87 5.28 2.77 17.76 31.03 2.82 18.06 5.36 81.23 2.81 18.00 2.85 5.40 2.84 17.38 16.00 16.59 61.63 145.07 198.34 74.24 145.44 197.78 268.01 3.18 21.02 6.03 3.17 2.80 5.32 2.79 2.82 5.35 2.81 57.16 A0.05
A0.075
A0.1
A0.15
A0.17
A0.19
P0.01
P0.05
P0.1
T0.2
L0.17
VL0.01
1 30.55 % 32.93 % 35.07 % 29.62 % 30.81 % 36.01 % 29.60 % 8.72 % 29.40 % 2 7.59 % 8.29 % 7.23 % 8.28 % 32.54 % 8.34 % 7 1.11 % 0.81 % 9.51 % 33.70 % 1.68 % 1.54 % 8.57 % 1.68 % 0.82 % 1.69 % 8 0.95 % 0.61 % 0.44 % 7.25 % 30.85 % 1.29 % 1.10 % 0.29 % 1.28 % 0.85 % 1.25 % 21 0.002 % 27.66 %
The division indicated by line 8 is shown in Fig. 2 because it has a high ranking according to the statistical measure with high weighting values in the network connection model but lower ranking in the influence spreading model. This is only one example showing that the network model is important and different network models provide different results.
236
V. Kuikka
Table 6. Nodes in communities corresponding lines in Table 5. Communities detected in runs correspond columns in Table 5. The last two columns show the number of nodes in the two factions of the network. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Nodes
Found in runs
N1
N2

1110001110111 1000001100111 1100001100111 1100001100111 1100001100111 1100001100101 1111001110111 1111101110111 1000000100000 1110001100111 1000000100000 1110001100101 1000001100101 1100001100111 1000001000101 1000000000000 1000000000000 1000000000000 0011100010000 0001000000000 0000110000000 0000001100101 0000001000101 0000001000101 0000000010000
21 12 27 15 29 27 23 24 31 22 31 22 25 26 30 28 30 23 14 18 17 11 30 24 23
41 50 35 47 33 35 39 38 31 40 31 40 37 36 32 34 32 39 48 44 45 51 32 38 39
5 Conclusions We propose a general community detection method for analysing social, biological and technological network structures. The main result of this study is to separate the network model and the community detection method. Different network models can be used to provide input for the community detection algorithm. We demonstrate this by the classical network connectivity model and a resent influence spreading model. Two realworld social networks, Zachary’s karate club and Lusseau’s dolphin network are used to illustrate the method. Communities and substructures are local maxima of a quality function that we call the community influence measure. A set of local maxima of the community influence measure is searched by repeating the procedure several times starting from a random initial split of the original network. Weak interactions among network nodes produce more solutions than strongly connected networks. This is a way of obtaining understanding of the landscape of local maxima and identifying substructures in networks. The second main result of this paper is to present two different approaches for ranking the detected communities and subcommunities in the influence spreading model. This is an important question because the method provides many solutions and the strongest communities are candidates for reallife communities and their substructures. The ﬁrst alternative is the optimal numerical value of the community influence measure and the second alternative is the statistical quantity measuring the
A General Method for Detecting Community Structures in Complex Networks
237
probability of forming a community. This later measure favours larger communities that may have lower values of the community influence measure but a higher probability of forming a community given that the initial state is random. Interestingly, Zachary’s karate club has split into two divisions according to the highest value of the statistical probability measure. Both measures predict correctly the reallife split of the dolphin social network (except dolphin SN89) for weakly interacting connections. The network connectivity model predicts different high rank divisions for more strongly interacting dolphins with both community ranking measures. We conclude that the network model and different processes on networks can have a signiﬁcant impact on communities and their substructures. Community detection methods can provide useful tools for analysing empirical examples of different network processes. Existing community detection algorithms identify communities based the adjacency matrix. A standard framework of community detection with an influence matrix, instead of the adjacency matrix, could be used to study effects of different processes on networks.
References 1. Ball, M.O., Colbourn, C.J., Provan, J.S.: Network reliability. In: Handbooks in Operations Research and Management Science, vol. 7, pp. 673–762 (1995) 2. Barabási, A.L.: Network Science. Cambridge University Press, Cambridge (2016) 3. Coscia, M., Giannotti, F., Pedreschi, D.: A classiﬁcation for community discovery methods in complex networks. Stat. Anal. Data Min. 4(5), 512–546 (2011) 4. Fortunato, S., Hric, D.: Community detection in networks: a user guide. Phys. Rep. 659(11), 1–44 (2016) 5. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. U.S.A. 99(12), 7821–7826 (2002) 6. Karrer, B., Newman, M.E.J.: Stochastic blockmodels and community structure in networks. Phys. Rev. E 83(1), 016107 (2011) 7. Kuikka, V.: Influence spreading model used to community detection in social networks. In: Cheriﬁ, C., Cheriﬁ, H., Karsai, M., Musolesi, M. (eds.) Complex Networks & their applications VI. COMPLEX NETWORKS 2017. Studies in Computational Intelligence, vol. 689, pp. 202–215. Springer, Cham (2018) 8. Kuikka, V.: Influence spreading model used to analyse social networks and detect Subcommunities. Comput. Soc. Netw. 5, 12 (2018). https://doi.org/10.1186/s406490180060z 9. Lancichinetti, A., Fortunato, S.: Community detection algorithms: a comparative analysis. Phys. Rev. E 80, 056117 (2009) 10. Lancichinetti, A., Fortunato, S., Kertész, J.: Detecting the overlapping and hierarchical community structure in complex networks. New J. Phys. 11, 033015 (2009) 11. Lusseau, D., Newman, M.E.J.: Identifying the role that animals play in their social networks. Proc. R. Soc. London Ser. B 271, S477 (2004) 12. Newman, M.E.J.: Networks, An introduction. Oxford University Press, Oxford (2010) 13. Yang, Z., Algesheimer, R., Tessone, C.J.: A Comparative analysis of community detection algorithms on artiﬁcial networks. Sci. Rep. 6, 30750 (2016). https://doi.org/10.1038/ srep30750 14. Zachary, W.W.: An information flow model for conflict and ﬁssion in small groups. J. Anthropol. Res. 33, 452–473 (1977)
A New Metric for Package Cohesion Measurement Based on Complex Network Yanran Mi, Yanxi Zhou, and Liangyu Chen(B) Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai 200062, China [email protected]
Abstract. With software evolution and code expansion, software structure becomes more and more complex. Refactoring can be used to improve the structure design and decrease the complexity of software. In this paper, we propose a cohesive metric that can be used for package refactoring. It considers not only the dependencies of intrapackage and interpackage, but also the backward dependencies of interpackage. After theoretical veriﬁcation and empirical veriﬁcation on multiple open source softwares, our metric is proved to eﬀectively measure software structure. Keywords: Software dependency network · Software refactoring Package cohesion measurement · Software metric
1
·
Introduction
Software is not stationary, but usually gradually evolves. According to the newly requirements from our dynamic world, software functionality needs to be updated and attached with new codes. The amount of code has become more and more huge during software evolution, and the code structure has become more and more complicated. Therefore, it is easy to deviate from the original design rules, result in the degradation of software quality and comprehensibility, and ﬁnally create a “technical debt” [1]. For this problem, more researchers in software engineering focus on complex network method to analyze the structural characteristics of softwares. Based on the combination of complex network and software engineering, diﬀerent software systems can be investigated from the macroscopic perspective, such as Linux kernel system [2], open source systems [3] and so on. Faced with the increasing software complexity, it needs an urgent adjustment on software structure without function degradation. Refactoring can improve software design and increase the maintainability and usability of software [4]. Simple refactoring with moving code in manual, is time consuming and has ordinary eﬀort. Recently, there are a lot of researches on cohesive metrics as guidelines for auto refactoring. Current cohesion metrics tend to focus on the class cohesion level. Chidamber and Kemerer deﬁned a set of CK metrics, in c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 238–249, 2020. https://doi.org/10.1007/9783030366872_20
Metric for Package Cohesion Measurement
239
which LCOM (Lack of Cohesion in Methods) is used to measure cohesion [5]. Harrison et al proposed a set of MOOD metrics, in which CF (Coupling Factor) is proposed [6]. The metrics measure the cohesion indirectly by measuring the coupling. It calculates the sum of dependencies between all classes dividing the sum of all the possible dependencies of all classes. Bieman et al proposed a method based on the number of instance variables shared by methods [7]. They deﬁne TCC (Tight Class Cohesion) and LCC (loose class cohesion). Briand et al deﬁned a graph of cohesive relationship and a set of measurement tools [8]. At the same time, Briand et al also deﬁned a rigorous and understandable cohesive metric standard to measure the rationality for cohesion metrics [9]. Counsell et al. tested with a series of C++ based systems based on hamming distance, and proved that the measure of cohesion and coupling is intrinsically linked [10]. Badri et al. compared the metrics against a series of Javabased systems and found that the lower the degree of cohesion, the higher the coupling [11]. Therefore, it is worthwhile to add external dependency correlation to the cohesion metrics. Diﬀerent from the abovementioned class cohesion measurement methods, in recent years, some classbased package cohesion measurement methods have been proposed. Misic proposed a cohesion metric and concluded that relying solely on the internal relationship of the package is not suﬃcient to determine cohesion [12]. Abdeen et al. presented a cohesive metric with cyclic dependencies [13]. Gupta et al. presented a cohesive metric that considers the hierarchical relationship [14]. This paper presents an improved software cohesion metric based on complex network. Compared with the previous work, the new metric takes into account overall dependencies between classes, and also considers the backwards dependencies of classes. We ﬁrstly prove it meets the four principles of cohesion metrics proposed by Briand [9]. Based on this metric, we provide a refactoring algorithm to adapt packageclass relations for better cohesion. Finally, through strict experiments on multiple open source systems, we verify the validity of our metric, and eﬃciency of the refactoring algorithm. The remainder of this paper is organized as follows. In Sect. 2, we presents the fundamental notations. In Sect. 3, we describes the new cohesive metric and its refactoring algorithm. In Sect. 4, we use experiments to verify the validity and eﬃciency of our metric. Section 5 is the conclusion of our work.
2 2.1
Preliminary Basis of Attributes of Community
Definition 1. Let G = (V, E, C) be a network, where V denotes the set of vertex and E denotes the set of edges. And C = {C1 , C2 , · · · , Ck } is the set of communities, where each vertex belongs to one community and ∀i=j, Ci ∩Cj =∅. Let A = (aij ) be the adjacent matrix of G. Note that for the networks generated from softwares, the communities information is given based on the packageclass structure.
240
Y. Mi et al.
Definition 2. The sum of all the edges of network M is M=
1 aij . 2 i,j
(1)
Definition 3. The sum of intraedges of all communities is Qreal =
1 aij δ(Ci , Cj ) 2 i,j
(2)
where Ci and Cj are the communities that vertex i and j belong to respectively. If they belong to the same community, δ is 1, otherwise δ is 0. Definition 4. The sum of intraedges of all the communities for the null model which has same scale to the real network is 1 Qnull = pij δ(Ci , Cj ), (3) 2 i,j where pij is the expectation of edges between vertex i and j. Definition 5. For community k, the sum of all the intraedges is EIk = aij δ(Ci , Cj , k),
(4)
i,j
where Ci , Cj are communities that vertex i and j belong to. If Ci = Cj = Ck , δ is 1, otherwise δ is 0. Definition 6. For community k, the sum of external dependencies is EXk = aij δ(Ci , Cj , k),
(5)
i,j
where Ci , Cj are communities that vertex i and j belong to. If Ci = Ck and Cj = Ck , δ is 1, otherwise δ is 0. Definition 7. For community k, the sum of external backwards dependencies EXBk = aij δ(Ci , Cj , k), (6) i,j
where Ci , Cj are communities that vertex i and j belong to. If Ci = Ck and Cj = Ck , δ is 1, otherwise δ is 0.
Metric for Package Cohesion Measurement
2.2
241
Class Dependency Graph
We take a software system developed by Java as an example to illustrate the construction process of software network. Definition 8. Class dependency graph (CDG) [15] is a directed graph, Gc = (Vc , Ec , C). Vc is the set of vertexes of classes. Ec is the set of edges. C is the set of communities. Every package is mapped to be a community of the network. In the directed network, there is an edge between two classes if and only if there is at least one following dependency between these two classes: • R1—Inheritance and implementation: vj extends or implements vi ; • R2—Aggregation: vj is the data type of member variable in vi ; • R3—Parameter: vj is the data type of parameter/return value/declared exception of member function in vi ; • R4—Signature: vj is the type of local member variable in vi ; • R5—Invocation: vj is invoked insides the member function in vi ; We assume that the weights of above ﬁve dependencies are same, then the dependency between two classes is adding up all the dependencies. After deﬁning the dependency rules, from the Java source codes shown in Fig. 1, we can make the CDG shown in Fig. 2. Comparing to the existing coarsegranularity software networks, our CDG based on ﬁve dependencies, can represent the software structure features more intuitively and clearly.
Fig. 1. Example of Java classes
In Fig. 1, classes A, B, and C belong to package1, class D, E, and F belong to package2, and classes H, I, and J belong to package3. Obviously, there are three dependencies between classes: D depends on A, F depends on D, and I (double) depends on E. According to the above ﬁve dependencies, we get the CDG in Fig. 2.
3 3.1
Cohesion Metrics Based on Complex Network Cohesion Metrics
As everyone knows, high cohesion is an important goal in software design, since it has a great impact on software maintainability and reusability. However, manual
242
Y. Mi et al.
Fig. 2. Example of a class dependency graph
evaluation for cohesion is time consuming and labor intensive. Therefore, it is necessary to propose standard cohesion metrics instead of manual evaluation, for better automated code refactoring. Newman et al. [16] proposed a module metric, Q=
Qreal − Qnull , M
(7)
where M is the number of edges in the network. Since the number of dependencies in the software may be less than the number of dependencies in the null model, this modularity does not meet the nonnegative principle proposed by Briand [9]. Abdeen [10] proposed a cohesion metric of package, Q=
EI , EI + E X
(8)
where EI represents the number of edges of internal dependencies in a package, and EX is the number of edges of external dependencies between packages. For a package, this metric considers not only the internal dependencies, but also the external dependencies, however, it omits the backwards dependencies. From the perspective of software quality, excessive interpackage calls brought by the backwards dependencies, have a higher probability of aﬀecting overall package reusability. The modularity in complex networks is consistent with the software design principle, that is, high cohesion and low coupling. When we simply apply it to be a cohesion metric, it does not satisfy the cohesion measurement principles proposed by Briand [9]. This is caused that the basis of modularity in complex networks is a random network. However, the corresponding network in software networks should have a condition that there is no intrapackage dependency. Therefore, considering the backwards dependencies, we have deﬁned the software network package cohesion metric Qc ,
Qc =
EI . EI + EX + α ∗ EXB
(9)
When EI + EX + α ∗ EXB = 0, Qc is marked as 0, preventing the case where the denominator is 0. EI means the number of edges inside the community/package;
Metric for Package Cohesion Measurement
243
Algorithm 1. package cohesion calculation algorithm Input: Adjacent matrix A; communities C; vertex number N ; and the package p1 to be calculated. Output: The cohesion of package p1 . 1: Set EI , EX EXB to 0; 2: for i = 0 to N − 1 and Ci = p1 do 3: for j = 0 to N − 1 do 4: if Cj = p1 then 5: EI + = wij ; 6: else 7: EX + = wij ; 8: for j = 0 to N − 1 do 9: if Cj = p1 then 10: EI + = wij ; 11: else 12: EXB + = wij ; 13: Calculate Qc as formula (9) and return the result.
EX means the sum of weight of edges sent from all classes in this package to the classes in other packages; EXB means the sum of weight of edges sent from all external classes outside this package to the classes inside this packages. α is the empirical scale factor. Since the inﬂuences of the class backwards dependencies on the class is signiﬁcantly less than the class dependencies’ eﬀorts, therefore, in this paper, we tentatively let α = 0.5 according to experience. We give Algorithm 1 to calculate package cohesion. Let’s turn to complexity analysis. In Algorithm 1, the number of classes of p1 is Np1 , and the algorithm executes a nested loop. The outer loop runs Np1 times, and the inner loop runs the number of nodes (classes) 2N times. Therefore, the complexity of Algorithm 1 is O(Np1 N ). When performing Algorithm 1 on all packages, the total complexity is O((Np1 + Np2 + · · · + Npx ) · N ). Since Np1 + Np2 + · · · + Npx = N , the total complexity is O(N 2 ). It is worth noting that the total software cohesion is average of cohesions of all packages. 3.2
Theoretical Verification of Our Cohesion Metric
We theoretically verify whether our proposed cohesion metric is reasonable. Briand proposed four veriﬁcation principles for cohesion metric veriﬁcation [9]. Here, we use these four principles to prove validity for our metric. Proposition 1: Formula (9) satisﬁes four veriﬁcation principles proposed by Briand. Proof: (1) Nonnegativity In formula (9), EI , EX , EXB are nonnegative, so that Qc is also nonnegative .
244
Y. Mi et al.
(2) Maximum and minimum I The maximum of Qc is when EX + α ∗ EXB = 0 and EI > 0, Qc = E EI = 1. The minimum of Qc is when EI = 0 and the denominator is positive, Qc = 0. (3) Monotonicity When adding some intrapackage edges, we denote EIn as the number of new intrapackage edges, and Qcn as the new package cohesion. Qcn − Qc =
(EX + EXB ) ∗ (EIn − EI ) (EIn + EX + EXB ) ∗ (EI + EX + EXB )
We can see that when EIn > EI , Qcn − Qc > 0. It satisﬁes the monotonicity. (4) Cohesive Modules Assume that two packages a, b, where all classes in package a have no dependencies or backwards dependencies on the classes in the b package. The cohesions of package a and b are listed as follows. Qca =
EIa , EIa + EXa + EXBa
Qcb =
EIb . EIb + EXb + EXBb
Then we merge a, b into a new package c. The cohesion of c is, Qcc = We use
Na Da
to denote
Qca − Qcc = and
Nb Db
EIa + EIb . EIa + EXa + EXBa + EIb + EXb + EXBb
EIa (EXb + EXBb ) − EIb (EXa + EXBa ) , (EIa + EXa + EXBa + EIb + EXb + EXBb )(EIa + EXa + EXBa )
to denote
Qcb − Qcc =
EIb (EXa + EXBa ) − EIa (EXb + EXBb ) . (EIa + EXa + EXBa + EIb + EXb + EXBb )(EIb + EXb + EXBb )
Obviously, Da , Db > 0, and Na = −Nb , therefore Qca − Qc ≥ 0 or Qcb − Qc ≥ 0 holds. This means Qc ≤ max{Qca , Qcb }. We prove the cohesion of merged package is not bigger than the cohesions of two original packages. In summary, we have proved that our cohesion metric satisﬁes four Briand’s principles. 3.3
Refactoring Algorithm
According to our cohesion metric, we propose a refactoring algorithm to optimize software structures. We refer to the idea of the wellknown community detection algorithm CNM algorithm. The original CNM algorithm deems each node as a single community, and then merges iteratively to increase modularity until the modularity no longer increases. However, it doesn’t ﬁt for software networks, since softwares natively have package structures in codes. So we propose a greedy
Metric for Package Cohesion Measurement
245
algorithm based on the original package structure. Each time we move a class to a package whose classes have dependencies with the previous class, we calculate the average package cohesion and pick the maximum. Repeat the above process, we ﬁnish refactoring until all classes are visited. Algorithm 2. Refactoring algorithm base on our cohesion metric Input: Adjacent matrix A; communities C; vertex number N . Output: output Qstart , Qref actored and a set of refactoring suggestions. 1: Calculate the original average package cohesion Qstart according to Algorithm 1. 2: while there is a unvisited class do 3: Select a unvisited class A, and set Qmax = 0. 4: for traverse all packages do 5: Set Dn = (as the number of dependencies A depends on the traversing package). 6: Set Do = (as the number of dependencies A depends on the original package). 7: set Dnb = (the corresponding backwards dependencies on the traversing package). 8: set Dob = (the corresponding backwards dependencies on the original package). 9: if (Dn −Do > threshold1 Dnb −Dob > threshold2 (Dn ≥ 1&&Dnb ≥ 1)) then 10: Calculate the average cohesion for the original package and the traversing package Qo . 11: Move A to the traversing package and update C. 12: Recalculate the average cohesion of the original package and the traversing package Qr . 13: if Qr > Qmax then 14: Qmax = Qr , mark pmax as the traversing package. 15: Move A back to the original package and update C. 16: Set A as visited 17: if Qmax > 0 then 18: Move A to package Pmax and update C. 19: Set every class that depends on A as unvisited. 20: Calculate the average package cohesion Qref actored and return .
Note 1: The threshold is adapted based on experience. When the class number is large, the threshold is relatively large, and vice versa. In our experiments, threshold1 is 2, and threshold2 is 3. Algorithm 2 terminates when class movement does not make any increase in software package cohesion. In the process of class movement, only the cohesions of the source and destination packages change, we only consider cohesion change for these two packages. Obviously, the main timeconsuming part of Algorithm 2 is the whileloop part, from the 2nd line to the 20th line. Let N be the number of classes, Np the number of packages. For one package, we use Algorithm 1 to calculate its cohesion, so the complexity of forloop at 4th line is O(N 2 Np ).
246
Y. Mi et al.
Therefore, the total complexity of Algorithm 2 is O(N 3 Np ). Since this is a typical greedy algorithm, there may be a result of “local optimal”. But in the process of moving, the average package cohesion will increase monotonously. So the correctness of refactoring algorithm can be guaranteed.
4 4.1
Experiment and Analysis Refactoring and Analysis
Our experiment environment is a computer with i58400, 16G DDR4, Windows 10. We selected 10 open source softwares for refactoring veriﬁcation. The software statistics and cohesion result are listed in Table 1. From the last two columns, we can see that the cohesion can be obviously improved after refactoring by Algorithm 2. Table 1. Refactoring Result of multiple Java softwares Name
PN
CN
EN
NP
RCN
WFR
CB
Ant 1.9.9
58
998
5169
2
671
51
0.223
0.251
9
202
914
1
151
8
0.274
0.327
Emma 2.0.5313
11
143
574
0
143
17
0.327
0.386
Hsqldb 2.4.0
21
553
4519
2
311
27
0.288
0.320
Jaxen 1.1.6
16
204
947
0
204
18
0.315
0.447
Jgroups 4.0.10
31
859
4454
2
460
60
0.230
0.292
Ormlite 5.0
11
176
938
0
176
23
0.263
0.353
9
154
548
0
154
14
0.357
0.396
7
402
1661
1
189
4
0.492
0.509
42
663
1887
2
522
21
0.413
0.451
Cglibnodep 3.2.6
PDFRenderer 1.0.5 RabbitMQ Client 5.0.0 Tomcat 9.0.1
CA
PN:Package number; CN:Class number; EN:Edge number; NP:Neglected packages RCN:Rest class number; WFR:Waiting for refactoring; CB:Cohesion before; CA:Cohesion after
We also compare our metric with Newman modularity and Adbeen package cohesion. In Table 2, we show the cohesion diﬀerence under three methods. In most cases, the cohesion variance of three methods are positive, only the Newman modularity of RabbitMQ is slightly reduced by 0.001 and Adbeen of Jgroups is reduced by 0.003. We calculate the correlation of our metric to Newman modularity and Adbeen cohesion. The Pearson correlation coeﬃcient between our metric and Newman modularity is 0.840, and the signiﬁcant level is 0.001. The Pearson correlation coeﬃcient between our metric and Adbeen cohesion metric is 0.720, and the signiﬁcant level is 0.009. Both are signiﬁcantly and strongly related. This shows that our proposed cohesion metric is reasonable. At the same time, our proposed metric is signiﬁcantly and strongly correlated with the Newman modularity, which indicates that in some case, the proposed metric can be used to replace the Newman modularity in software measurement. It shows our metric makes full use of the work of Newman and Adbeen and has good rationality and stablity.
Metric for Package Cohesion Measurement
247
Table 2. The impact of refactoring on other metrics Name
NM
ACM
Ours
Before/After/Diﬀ
Before/After/Diﬀ
Before/After/Diﬀ
Ant
0.250/0.298/+0.048
0.260/0.298/+0.038
0.223/0.251/+0.029
Cglibnodep
0.291/0.356/+0.065
0.322/0.396/+0.074
0.270/0.327/+0.058
Emma
0.348/0.488/+0.140
0.489/0.493/+0.004
0.327/0.386/+0.058
Hsqldb
0.235/0.255/+0.020
0.376/0.402/+0.026
0.288/0.320/+0.032
Jaxen
0.303/0.505/+0.202
0.384/0.506/+0.122
0.315/0.447/+0.133
Jgroups
0.228/0.259/+0.031
0.399/0.397/−0.003
0.230/0.292/+0.061
Ormlite
0.265/0.341/+0.076
0.395/0.449/+0.055
0.262/0.353/+0.092
PDF renderer
0.292/0.312/+0.020
0.422/0.443/+0.021
0.357/0.396/+0.040
RabbitMQ Client
0.287/0.286/−0.001
0.744/0.756/+0.012
0.492/0.509/+0.018
Tomcat
0.579/0.613/+0.033
0.536/0.579/+0.043
0.414/0.451/+0.037
NM: Newman’s modularity; ACM: Adbeen cohesion; Ours: Our cohesion
4.2
Randomly Disturb and Recover
In this section, we do another experiment with disturbing and recovering steps, to verify the validity and eﬃciency of our metric. For a software, we ﬁrstly deem it as a “PERFECT” software with good conﬁgurations between classes and packages. Then we randomly disturb one package, that is, a certain proportion of classes of a package are randomly put into other packages. In the veriﬁcation step, we run the refactoring algorithm on the disturbed software, then check whether our algorithm can ﬁnd the disturbed class and place correctly into the Recovered original package. The correct rate P of recovering is calculated as N Ndistrubed , where NRecovered represents the number of disturbed classes recovered by the algorithm, and Ndisturbed represents the total number of disturbed classes. In our experiments, we implement the disturbrecover process on 10 Java opensource softwares. For one software, we repeatedly do the process 100 times. At last, we compare our method with Pan’s method [17]. Table 3. Results of disturbing and recovering Name
NDC
NRC
LRP(%)
URP(%)
RP(%)
TC(s)
PRP(%)
Ant
35
28
77.7
82.9
80.0
1802
−−
−−
8
6
66.3
82.5
74.0
12
79.0
169
Emma
11
10
84.5
94.5
88.7
7
82.2
154
Hsqldb
17
14
76.5
84.1
80.9
268
84.5
5962
Cglibnodep
PTC(s)
Jaxen
14
9
60.0
75.0
69.5
24
72.9
574
Jgroups
32
23
70.0
78.4
74.0
574
−−
−−
Ormlite
13
9
63.1
73.8
67.2
16
79.5
213
PDF renderer
11
9
73.6
88.2
80.4
7
85.1
115
RabbitMQ Client
14
12
77.1
87.1
81.2
32
85.3
650
Tomcat
30
26
82.3
87.3
84.9
349
85.0
5795
NDC: Number of disturbed classes; NRC: Number of recovered classes. LRP: Minimal percentage of recovering; URP: Maximal percentage recovering. RP: Recovering percentage; TC: Average time consumption. PRP: Pan’s recovering percentage; PTC:Pan’s average time consumption. −− means no result in 2 h
248
Y. Mi et al.
In Table 3, we ﬁnd the refactoring algorithm has a good probability to recover the disturbed classes. In addition, the ﬂuctuation of recovering rate is small, indicating that the algorithm has better stability. This proves that our proposed cohesive metric can eﬀectively measure and reﬂect the software structure. We also see that our method and Pan’s method perform similarly in recovering percentage, but our method saves more time.
5
Conclusion
Nowadays, most researches focus the cohesion metrics at the level of classes and methods, and few researchers combine complex networks with software refactoring. In this paper, we take advantage of complex network methods and consider the factor of backwards dependencies into class dependency relations. The new metric based on complex networks and its refactoring algorithm are proposed. After theoretical veriﬁcation and software veriﬁcation, the metric and it’s refactoring algorithm are proved to measure the software structure correctly and eﬀectively.
References 1. Tom, E., Aurum, A., Vidgen, R.: An exploration of technical debt. J. Syst. Softw. 86(6), 1498–1516 (2013) 2. Wang, L., Yu, P., Wang, Z., Yang, C., Ye, Q.: On the evolution of linux kernels: a complex network perspective. J. softw. Evol. Process 25(5), 439–458 (2013) 3. Myers, C.R.: Software systems as complex networks: structure, function, and evolvability of software collaboration graphs. Phys. Rev. E 68(4), 046116 (2003) 4. Fowler, M.: Refactoring: improving the design of existing code. In: 11th European Conference. Jyv¨ askyl¨ a, Finland (1997) 5. Chidamber, S.R., Kemerer, C.F.: A metrics suite for object oriented design. IEEE Trans. Softw. Eng. 20(6), 476–493 (1994) 6. Harrison, R., Counsell, S.J., Nithi, R.V.: An evaluation of the mood set of objectoriented software metrics. IEEE Trans. Softw. Eng. 24(6), 491–496 (1998) 7. Bieman, J.M., Kang, B.K.: Cohesion and reuse in an objectoriented system. In: ACM SIGSOFT Software Engineering Notes, vol. 20, no. SI, pp. 259–262 (1995) 8. Briand, L.C., Morasca, S., Basili, V.R.: Deﬁning and validating measures for objectbased highlevel design. IEEE Trans. Softw. Eng. 25(5), 722–743 (1999) 9. Briand, L.C., Morasca, S., Basili, V.R.: Propertybased software engineering measurement. IEEE Trans. Softw. Eng. 22(1), 68–86 (1996) 10. Counsell, S., Mendes, E., Swift, S.: Comprehension of objectoriented software cohesion: the empirical quagmire. In: Proceedings 10th International Workshop on Program Comprehension, pp. 33–42. IEEE (2002) 11. Badri, L., Badri, M., Toure, F.: Exploring empirically the relationship between lack of cohesion and testability in objectoriented systems. In: International Conference on Advanced Software Engineering and Its Applications, pp. 78–92. Springer (2010) 12. Misic, V.B.: Cohesion is structural, coherence is functional: diﬀerent views, diﬀerent measures. In: Proceedings of Seventh International Conference on Software Metrics Symposium, METRICS 2001, pp. 135–144. IEEE (2001)
Metric for Package Cohesion Measurement
249
13. Abdeen, H., Ducasse, S., Sahraoui, H., Alloui, I.: Automatic package coupling and cycle minimization. In: 2009 16th Working Conference on Reverse Engineering, WCRE 2009, pp. 103–112. IEEE (2009) 14. Gupta, V., Chhabra, J.K.: Package level cohesion measurement in objectoriented software. J. Braz. Comput. Soc. 18(3), 251–266 (2012) 15. Shen, P., Chen, L.: Complex network analysis in Java application systems. J. East Chin. Normal Univ. 38–51 (2017) 16. Newman, M.E., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(2), 026113 (2004) 17. Pan, W., Li, B., Jiang, B., Liu, K.: Recode: software package refactoring via community detection in bipartite software networks. Adv. Complex Syst. 17(07n08), 1450006 (2014)
A Generalized Framework for Detecting Social Network Communities by the Scanning Method TaiChi Wang1 and Frederick Kin Hing Phoa2(B) 1
National Center for HighPerformance Computing, Hsinchu, Taiwan 2 Academia Sinica, Taipei, Taiwan [email protected] http://phoa.stat.sinica.edu.tw:10080
Abstract. With the popularity of social media, recognizing and analyzing social network patterns have become important issues. A society oﬀers a wide variety of possible communities, such as schools, families, ﬁrms and many others. The study and detection of these communities have been popular among business and social science researchers. Under the Poisson random graph assumption, the scan statistics have been veriﬁed as a useful tool to determine the statistical signiﬁcance of both structure and attribute clusters in networks. However, the Poisson random graph assumption may not be fulﬁlled in all networks. In this paper, we ﬁrst generalize the scan statistics by considering the individual diversity of each edge. Then we construct the random connection probability model and the logit model, and demonstrate the eﬀectiveness of the generalized method. Simulation studies show that the generalized method has better detection when compared to the existing methods. Keywords: Network analysis likelihood methods
1
· Graphical models · Empirical
Introduction
The growth in the big data regime and the popularity of social media have enhanced a research migration to social network structural recognition and analysis. One of the most essential features in social networks is the community structure, which are deﬁned as groups of vertices that share common properties or play similar roles in a graph or network [6]. We can recognize part of social functioning by understanding these communities. The study of these social communities are interested in many ﬁelds, like ecommerce [17,23]. When the boundaries of communities are deﬁned, we are able to classify the nodes in a network into diﬀerent communities. [3,8]. Therefore, methodologies in community detection have drawn much attentions among researchers in diﬀerent ﬁelds. Community detection methods are generally designed by comparing the similarity within the groups and analyzing the diﬀerence between inside and outside c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 250–261, 2020. https://doi.org/10.1007/9783030366872_21
Community Detection by Generalized Scan
251
the groups. A popular method is the modularitybased method [20], which uses a modularity measure to evaluate the similarities and connections of groups. Several extended methods [9,15,19] are developed through this criterion. However, the modularity optimization suﬀers severe limitation as it fails to ﬁnd communities that are smaller than a given scale, even when they are very pronounced and easy to observe [7]. Furthermore, the modularity methods do not provide tests on the statistical signiﬁcance of the detected communities. This leads to a blooming progress in developing statistical test for the signiﬁcance of the detected communities, see [10,11,14,22,27,28] and many others for details. These existing methods take the diﬀerence between a selected group and its unselected counterpart into account, but most of them consider that the nodes of the selected groups follow the same distributions. In practice, many researches pointed out that networks follow a scalefree model [2] or exponential random graph [24]. We thus generalize the scan method for community detection by considering the likelihood of all edges with diﬀerent types of baseline network models. In this study, we provide a generalized framework of the scan statistic to consider the individual diﬀerence of each edge instead of assuming that the number of edges in a subset follows the homogeneous Poisson distribution. If probabilities of edges can be properly addressed, we can construct the likelihood of a network, and then the scan statistic can be derived by appropriate formulations. We brieﬂy review the standard framework of the scan statistic in Sect. 2, then we introduce its generalized framework in Sect. 3. Simulations are performed in Sect. 4 to verify the detection performances of the proposed method. Finally, a brief conclusion and discussions are reported at the end.
2
The Standard Framework of Scan Statistics
The scan statistic is a useful tool to ﬁnd cluster patterns of many data in both time domain [18] and spatial domain [13]. [27] ﬁrst applied this statistic to social networks, and [28] extended this method to consider both attribute and structure clusters. The basic idea of scan statistic is to divide the studied region/network into a selected part within and an unselected part outside scanning windows; This provides a systematic mechanism to detect clusters/communities. Recall the standard framework of the scan statistics in [28]. Let G = (VG , EG ) be an undirected graph with vertex set VG = {v1 , . . . , vVG  } and edge set EG , and the degrees of vertices are k = {k1 , . . . , kVG  }. Moreover, we deﬁne the VG  total number of edges as EG  = i=1 ki /2. Then, by considering the random graph assumption with degree vector k, the number of expected edges connecting the pair nodes (vi , vj ) is expressed as eij = ki kj /(2EG ) for i = j, and eii = ki2 /4EG . Under the Poisson random graph [5], a scan statistic is used to evaluate the expected numbers of edges between the selected subgraph and its unselected counterpart. Suppose a subgraph Z is selected based on a scanning window. The corresponding EZ  and expected μ(Z) are deﬁned as kVZ /2 and kV2 Z /4EG  respectively. One can apply a likelihood ratio test to evaluate the statistical
252
T.C. Wang and F. K. H. Phoa
signiﬁcance of the selected subgraph when compared to the Poisson random graph. Thus, the likelihood ratio statistic of a selected subgraph Z is EZ  EZ  EG −EZ  EG −EZ  LZ if α ˆ > βˆ µ(Z) µ(G)−µ(Z) = LR(Z) = (1) L0 1 otherwise, EZ  EG −EZ  and βˆ = µ(G)−µ(Z) are the corresponding maximum likelihood where α ˆ = µ(Z) estimates (MLEs) for the selected subgraph Z and the unselected counterpart Z c under the Poisson random graph model. By scanning the whole region, the test statistic is the one with the maximum logarithmic likelihood ratio, that is,
ˆ = max ln LR(Z). λ(Z) Z
(2)
The subgraph Zˆ with the maximum LR(·) is identiﬁed as a community if the null hypothesis is rejected. Due to the large set of selected subgraphs, the scanning method often suﬀers the multiple testing problem. The Monte Carlo testing is one of the best solutions to this problem [13] during the development of a testing procedure. [13] suggested generating the Monte Carlo samples for an attribute of interest in the spatial cluster detection problem, that is, to randomly permute observations for diﬀerent nodes. However, randomly assigning the edges in a network might not be an appropriate method to adopt the similar idea in generating the Monte Carlo samples for networks. First, because of the diﬀerent degrees of the nodes, it is inappropriate to assume all of them have equal probabilities to connect. Second, it is likely that there are few diﬀerent combinations of edges for a subgraph with few nodes. For example, a star graph has only one corresponding graph expression with the ﬁxed degree sequence (Fig. 1(a)). Therefore, we adopt the method provided in [28], which considered to construct a Monte Carlo graph under the null hypothesis. For example, by treating the expected connection probability of each edge under the random graph assumption, the expected connection probability is eij = ki kj /(2EG ). This idea is demonstrated in Fig. 1, where the original network is a star graph with 5 nodes (Fig. 1(a)). Only the node 1 has degree 4 and other nodes have degree 1. Then we can evaluate from the degrees the connection probability of each edge (Fig. 1(b)), and generate random graphs based on the connection probabilities (Fig. 1(c)). This study focuses on determining the community eﬀect by the idea of the likelihood ratio test. We suggest using the likelihoodbased pseudo Rsquared [16]. The pseudo Rsquared, which is like the use of R2 applied in the usual linear regression, is used to assess the goodness of ﬁt in the logistic regression analysis or other likelihoodbased models. Based on the deviance measure, D, 2 , is expressed as the likelihoodbased pseudo Rsquared, RL 2 RL =
Dnull − Df itted , Dnull
(3)
where Dnull = −2 log (likelihood of the null model) and Df itted = −2 log (like2 is equivalent to minimizing Df itted lihood of the ﬁtted model). Maximizing RL
Community Detection by Generalized Scan 4
2
2
0.125
3
0.125
3
3
0.5
0.5
0.5
1 0.125
0.5
1
0.125
0.125 0.5 0.125
2
4
0.5 0.125 0.5
5
1
0.125
0.125
0.125
4
0.5
0.125
0.125
5
(a) Star graph
253
(b) Complete graph
5
(c) Random graph
Fig. 1. Example of Monte Carlo graph.
and maximizing the likelihood of the ﬁtted model. No matter what model we use to ﬁt or test the communities, this criterion can be used to verify the eﬀects of communities in social networks. The same testing algorithm is used to evaluate the new generated data and to determine the Monte Carlo pvalue. Suppose a simulation with a large number of iteration such as 99 or 999 is executed. The Monte Carlo pvalue with R runs is computed as 2 2 (MCr ) ≥ RL (obs)} + 1 #{RL , (4) p= R+1 2 2 2 2 where RL (obs) is the RL for the observed data, and RL (MCr ) is the RL for the th r Monte Carlo data. It suggests that the probability of ﬁnding extreme values than the observed value. If the pvalue is smaller than a prespeciﬁed criterion (e.g., 0.05), it is statistically signiﬁcant to declare that there is a community. That is to say, the observed value is less likely to happen under the null hypothesis.
3
The Generalized LikelihoodBased Framework
In [28], the methods are constructed based on the Poisson random graph assumption. By assuming the homogenous likelihood within the subgraphs, the test statistics are used to evaluate the likelihoods between diﬀerent subgraphs. However, the homogeneity assumption may not be always true in practice, thus we have to consider the diversities inside the subgraphs. In this study, we take the individual diﬀerence of each edge into account, which means that we assume each edge has its own likelihood/probability based on the manner of a network. Upon this idea, suppose a network of interest is G. The joint likelihood of all edges in a network is expressed as y pijij (1 − pij )1−yij , (5) L(G) = (i,j)∈EG
where (i, j) is the edge of node pair vi and vj , EG is the set of edges in G, pij is the probability of existence of edge (i, j), and yij = 1 if the edge (i, j) occurs in G, otherwise is 0.
254
T.C. Wang and F. K. H. Phoa
The deﬁnition of community is a group of nodes that have higher connections inside the group than those outside the group. To verify existence of a community, we select a subgraph Z from G based on the scan statistic described in Sect. 2.1. The original network G is divided into two parts, which are the selected subgraph Z and its unselected counterpart Z c . The joint likelihood based on this separation is y y pijij (1 − pij )1−yij pijij (1 − pij )1−yij , (6) L(Z, Z c ) = (i,j)∈E(Z)
(i,j)∈E(Z c )
E(Z) and E(Z c ) are the sets of edges in Z and Z c respectively, and the other notations are the same as those in Eq. (5). Based on this likelihood, we determine the diﬀerence between Z and Z c by deﬁning the diﬀerent manners of the probability pij . If the connection probability of each edge is welladdressed, communities can be veriﬁed in this likelihoodbased manner. Furthermore, we demonstrate how to apply this notion to two diﬀerent network construction regimes: the random connection probability model (Sect. 3.1) and the logit model (Sect. 3.2). 3.1
Random Connection Probability Model
We introduce the connection probabilities in some random network models in this section. Suppose a random graph with a given degree sequence k = (k1 , . . . , kn ) for a vector of nodes v = (v1 , . . . , vn ). We denote pij as the probability of having an edge between vi and vj is proportional to the product ki kj . In speciﬁc, pij is expressed as pij = (ki kj )/(2EG ) for i = j. Since there is no uniﬁed terminology for this probability, we call it as random connection probability (RCP) here. We consider RCP as the baseline model and apply it to the generalized framework. Suppose a subgraph Z is selected and all edges are independent but decided by the subgraphs they belong to. By computing the diﬀerence between the selected subgraph and its unselected counterpart, we consider a parameter γ for the connection probabilities within Z, and consider another parameter η for the connection probabilities within Z c . The γ and η can be treated as the strengths of connection in Z and Z c respectively. Then Eq. (6) is translated to y y γpijij (1 − γpij )1−yij ηpijij (1 − ηpij )1−yij . (7) L(Z, Z c ) = (i,j)∈Z
(i,j)∈Z c
2 As we mentioned in Eq. (3), maximizing the RL is equivalent to maximizing the joint likelihood. To obtain the Df itted , we search for the maximum likelihood estimates of γ and η. We consider the independent assumption that independently estimate these two parameters. By partially diﬀerentiating Eq. (7) at γ and η,
yij − γpij ∂ log(L(Z c )) ∂ log(L(Z)) = and = ∂γ γ(1 − γpij ) ∂η (i,j)∈Z
(i,j)∈Z c
yij − ηpij . (8) η(1 − ηpij )
Community Detection by Generalized Scan
255
The maximum likelihood estimates (MLEs) of γˆ and ηˆ are the solutions of Eq. (8), so the Df itted with γˆ and ηˆ is Df itted (Z, Z c ) = −2 log L(Z Z c) (yij log(ˆ γ pij /(1 − γˆ pij )) + log(1 − γˆ pij )) = (i,j)∈Z
+
(yij log(ˆ η pij /(1 − ηˆpij )) + log(1 − ηˆpij ))
(9)
(i,j)∈Z c
−
(yij log(pij /(1 − pij )) + log(1 − pij )).
(i,j)∈G
We restrict γˆ ≥ ηˆ and 0 ≤ pˆij ≤ 1 ∀(i, j), where pˆij = γˆ pij for (i, j) ∈ Z and pˆij = ηˆpij for (i, j) ∈ Z c , since the communities are deﬁned to have higher connection probabilities. On the other hand, Dnull is determined based on the null hypothesis, or in other words, there is no community eﬀect with pij = ki kj /(2EG ) and γ = η = 1. Then the Dnull is exactly the same as Eq. (5). 3.2
Logit Model
Exponential random graph is another useful regime to construct a network. Based on this random graph, the connection probability follows an exponential function [24], that is, P r(Y = y) =
1 exp{ ξA gA (y)}, κ
(10)
A
where A are the conﬁgurations of G, ξA are the parameters in the conﬁguration A, gA (y) = 1 if the con uration is observed in the network y. In this study, we consider a simple case of exponential random graph model, where the connection probability of edge (i, j) is related to the conﬁguration of nodes i and j. It is equivalent to the logit model described in [25], and the model is expressed as (11) logit(p(yij = 1)) = αi + αj , where αi and αj are the eﬀects of these two nodes. We can estimate α by this logit model from the observed network. Denote the estimates of parameters as ˆ 0 ). α ˆ 0 , then the corresponding deviance is Dnull = −2 log L(α When the community eﬀect is considered, we follow the similar idea of the scanning method. Suppose a subgraph Z is selected. We consider a community eﬀect in Eq. (11), and it can be expressed as logit(p(yij = 1Z, Z c )) = αi + αj + δ × Iij ,
(12)
where Iij = 1 when (i, j) ∈ Z, otherwise is 0, and δ is called the community eﬀect. We can also estimate δ from the observed network and the selected subgraph.
256
T.C. Wang and F. K. H. Phoa
ˆ then the corresponding deviance Denote the estimates of parameters as {α ˆ z , δ}, ˆ ˆ z , δ). is Df itted = −2 log L(α In general, we suggest estimating the logit model by a sparse matrix, and the computing time for estimation is approximately 5 times quicker than that for the indicator functions.
4
Simulation Study and Comparison
We compare the diﬀerences, in terms of the type I error, detection power, and detection accuracy, between our proposed models and the original Poisson model by simulated data sets. To verify the improvement of the detection accuracy, the proposed methods are compared with the scan statistic and the traditional Poisson model. 4.1
Type I Error
In the simulation for checking type I error, we set a limit on the number of nodes to be 100 in the synthetic networks. The edges among these nodes are set to follow a Bernoulli distribution with four diﬀerent connection probabilities p0 = 1/5, 1/10, 1/15, and 1/20, so the expected degrees of each node are 20, 15, 10, and 5. A thousand runs are executed in each simulation case. Suppose the signiﬁcance level is 0.05. Table 1 shows that the type I error is good except p0 = 1/20. In addition, the results are diﬀerent between the RCP model and logit model. The type I error of RCP model is more consistent (around 0.06), though the values are a little higher than expected signiﬁcant level (0.05). On the other hand, the type I error of logit model seems more dramatic, especially in the case of p0 = 1/20. Thus, if one is interested in the community detection where Type I error has to be minimized, we suggest to use the RCP model for its consistency. Table 1. Type I error Connection probability 1/5
4.2
1/10 1/15 1/20
RCP model
0.059 0.054 0.058 0.068
Logit model
0.050 0.033 0.095 0.129
Testing Power
We use a similar setting as the previous subsection in the following simulation. According to the results reported in [28], the community size and the connection probability of community play important roles in community detection. Therefore, we construct the simulation cases where there are K community nodes and
Community Detection by Generalized Scan
257
100 − K usual nodes in a studied network. The connection probability of the usual nodes is 1/20, and that of the community node is set to be 1/4, 1/2, 3/4, and 1. For each combination of community size and connection probability, 100 simulations are executed. Table 2 shows the testing powers of diﬀerent community sizes and connection probabilities. When the community size gets larger and/or the connection probability gets higher, the power gets higher. However, the detecting power is not good enough when the community size is small. Even in the situation where the connection probability is one and a clique with a size 5, the testing powers are only 0.68 and 0.42 for the RCP model and logit model respectively. On the other hand, the testing power is good when the community size is large. Take the case where the community size is 20 as an example, the detection power approaches perfect when the connection probability is only 1/2 for the logit model. On balance, the logit model receives better performances than the RCP model. A possible explanation is that the logit model captures more individual diﬀerences among nodes, while the RCP model is restricted to the random assumption that the probability is already higher when the community size is large and there is few space to ﬁnd the signiﬁcant eﬀect on the estimation of γ in Eq. (9). Table 2. Testing power Community size RCP model 5 10 15
4.3
20
Logit model 5 10 15
20
pc = 1/4
0.10 0.04 0.16 0.08 0.18 0.09 0.25 0.29
pc = 1/2
0.10 0.28 0.82 0.88 0.13 0.43 0.88 1.00
pc = 3/4
0.21 0.94 1.00 1.00 0.19 0.95 1.00 1.00
pc = 1
0.68 1.00 1.00 1.00 0.42 1.00 1.00 1.00
Comparisons of Detection Accuracy
In this section, the accuracy of both the RCP and logit models are checked, and they are compared with the traditional Poisson model. Since the accuracy of community members is considered, some cleardeﬁned accuracy measurements other than the testing power are also used for evaluation. We ﬁrst deﬁne the terms of true positive, false positive, true negative, and false negative. True positive (TP) cells represent the true community nodes that are correctly detected as a community; false positive (FP) cells represent the usual nodes that are incorrectly detected as a community; true negative (TN) cells represent the usual nodes that are not identiﬁed as a community; false negative (FN) cells represent the true community nodes that are not identiﬁed as a community. In addition, since only one groundtruth community is set in the simulation study, we use some common criteria to evaluate our detection
258
T.C. Wang and F. K. H. Phoa
results. To check if the method can identify community nodes, the recall (r), which is deﬁned as TP/(TP+FN), is used to measure the proportion of identiﬁed community nodes among all true community nodes. The precision (p), which is deﬁned as TP/(TP+FP), is used to measure the proportion of true community nodes among the identiﬁed community nodes. F1 score and Jaccard similarity are two common criteria which are used to evaluate community detection and are deﬁned respectively as pr C ∗ C , and J(C ∗ , C) = ∗ , F1 = 2 C p+r C where C ∗ is the detected community and C is the groundtruth community. In addition, we adopt the modularity proposed by [19] to measure similarity among the detection groups. The modularity is deﬁned as ki kj 1 1 )si sj = sT Bs, (eij − 2EG  ij 2EG  2EG 
Q=
where eij is 1 if vi and vj are adjacent, and 0 otherwise, and s is a ±1 vector, in which 1 represents an element belongs to the target group and −1 represents an element doest not belong to the target group. S = 20
S = 20
0.6
F1−score
0.4
0.6
Recall
0.6
0.4
0.4
0.6
0.8
1.0
0.4
Connection Probability
0.6
0.8
Poisson RCP Logit
1.0
0.4
Connection Probability
(a) Precision
0.6
(c) F1 score
(b) Recall S = 20
0.20
Modularity
0.6
0.15
0.4 0.2
Jaccard
0.25
0.8
0.30
S = 20
Poisson RCP Logit 0.4
0.6
0.8
Connection Probability
(d) Jaccard
1.0
0.8
Connection Probability
Poisson RCP Logit
0.10
0.4
Poisson RCP Logit
0.2
Poisson RCP Logit
0.2
Precision
0.8
0.8
0.8
1.0
1.0
S = 20
0.4
0.6
0.8
Connection Probability
(e) Modularity
Fig. 2. Comparisons among three models
1.0
1.0
Community Detection by Generalized Scan
259
We only consider the most signiﬁcant community, so the testing power is excluded here. In order to have a concise comparison, we only demonstrate the results of the case of community with size 20. Please refer to Appendix to see the detail values of other cases. The comparison measures are demonstrated in Fig. 2. According to the comparison results, all measures show the similar performances. We can clearly observe the improvement of our proposed methods. In general, the detection results can be separated into two parts: small connection probabilities (0.25 and 0.5) and large connection probabilities (0.75 and 1). In large connection probabilities, little diﬀerence exist among the three models and the logit model is the best. When the connection probability is small, the generalized methods are signiﬁcantly better than the Poisson model, featuring the logit model with two times more accurate than the Poisson model.
5
Discussion and Conclusion
In this study, we propose a generalized framework of the scan statistic and suggest the pseudoR2 measure as a testing criterion for identifying communities in social networks. By considering the heterogeneity of probability of each edge, the proposed method is ﬂexible to apply in diﬀerent random models. We provide the RCP model and the logit model to demonstrate the eﬀectiveness of this generalized method. Both models have acceptable type I errors and detection powers, and also improve the detection accuracy of community members. In addition, two empirical examples show that the proposed methods are practical in the real data. Although the accuracy of our new models are better than that of the Poisson model, our method fails to fully replace the original one due to the computing load. Since the provided models, especially for the logit model, consist of some parameters without close forms, the numerical procedures can be used to ﬁnd the estimates. The number of parameters (number of nodes) is huge when dealing with a large network, and the computer with current speciﬁcations usually cannot aﬀord to run such estimation procedure. In addition, the scan statistics contain the Monte Carlo testing that is used to obtain the Monte Carlo pvalue. It takes large time to reproduce synthetic data and to execute the R times of Monte Carlo procedures. In the simulation study (the studied network with 100 nodes), the RCP model takes 5 min and whereas the logit model takes around half an hour to conduct a complete testing procedure for 99 runs on our personal computer (Intel Core i74770 CPU 3.40 GHz). Although the scanning method seems to be a timeconsuming approach, this approach can be executed via parallel computing by distributing and calculating the test statistics of independent scanning windows according to its systematically searching regime. On the other hand, instead of using the Monte Carlo procedure, a possible solution is to apply the false discovery rate (FDR) [1] to explain the type I errors when conducting the multiple testing. Without the Monte Carlo procedure, we can save much time when executing the scanning
260
T.C. Wang and F. K. H. Phoa
method. We are also looking forward to reducing the computing time by accelerating the computing algorithm. Another restriction of the scan statistic is the shape of scanning window that decides the range of community. We use the circular windows to generate elective subsets for detecting communities, but the circular window is not the only choice. One may apply to detecting communities in social networks. Some metaheuristic optimization methodologies are also considered as good methods to ﬁnd the best communities [15]. On the other hand, the selection of radius is another problem. In real data, we do not know the true expansion of communities. We select 60% as the maximum size of community in the empirical studies, but in most real cases, one should consult ﬁeld experts about the maximum size of community for a better result. The generalized framework has a ﬂexibility for testing communities. The two models provided in this study are just two simpler forms. If the connection probabilities between nodes can be estimated in prior, this framework is easy to apply to constructing the scan statistic. However, the probabilities of edges are very diﬃcult to be constructed. We need more eﬀort to construct the probabilities of edges and to make this method more ﬂexible. Moreover, the approach of statistically evaluating community detection is not restricted to the scanning method we proposed in this project. It could be implemented to several stateoftheart methods in community detection, including at least the Louvain algorithm and methods based on stochastic block modeling. Acknowledgement. This work was partially supported by the Ministry of Science and Technology (Taiwan) Grant Numbers 1072118M001011MY3 and 1082321B001016.
References 1. Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc.: Ser. B 57(1), 289–300 (1995) 2. Catanzaro, M., Bogu˜ n´ a, M., PastorSatorras, R.: Generation of uncorrelated random scalefree networks. Phys. Rev. E 71(2), 027103 (2005) 3. Csermely, P.: Creative elements: networkbased predictions of active centres in proteins and cellular and social networks. Trends Biochem. Sci. 33(12), 569–576 (2008) 4. Easley, D., Kleinberg, J.: Networks, Crowds, and Markets: Reasoning about a Highly Connected World. Cambridge University Press, Cambridge (2010) 5. Erd˝ os, P., R´enyi, A.: On random graphs. Publ. Math. Debr. 6, 290–297 (1959) 6. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3), 75–174 (2010) 7. Fortunato, S., Barth´elemy, M.: Resolution limit in community detection. Proc. Natl. Acad. Sci. 104(1), 36–41 (2007) 8. Freeman, L.C.: A set of measures of centrality based on betweenness. Sociometry 40(1), 35–41 (1977)
Community Detection by Generalized Scan
261
9. Guimera, R., SalesPardo, M., Amaral, L.A.N.: Modularity from ﬂuctuations in random graphs and complex networks. Phys. Rev. E 70(2), 025101 (2004) 10. Handcock, M.S., Raftery, A.E., Tantrum, J.M.: Modelbased clustering for social networks. J. R. Stat. Soc.: Ser. A 170(2), 301–354 (2007) 11. Heard, N.A., Weston, D.J., Platanioti, K., Hand, D.J.: Bayesian anomaly detection methods for social networks. Ann. Appl. Stat. 4(2), 645–662 (2010) 12. Javadi, S.H.S., Khadivi, S., Shiri, M.E., Xu, J.: An ant colony optimization method to detect communities in social networks. In: Proceedings of Advances in Social Networks Analysis and Mining, pp. 200–203 (2014) 13. Hulldorﬀ, M.: A spatial scan statistic. Commun. Stat.Theory Methods 26(6), 1481–1496 (1997) 14. Lancichinetti, A., Radicchi, F., Ramasco, J.: Statistical signiﬁcance of communities in networks. Phys. Rev. E 81(4), 046110 (2010) 15. Liu, J., Liu, T.: Detecting community structure in complex networks using simulated annealing with kmeans algorithms. Phys. A 389(11), 2300–2309 (2010) 16. Magee, L.: R2 measures based on Wald and likelihood ratio joint signiﬁcance tests. Am. Stat. 44(3), 250–253 (1990) 17. Moody, J., White, D.R.: Structural cohesion and embeddedness: a hierarchical concept of social groups. Am. Sociol. Rev. 68(1), 103–127 (2003) 18. Naus, J.I.: Approximations for distributions of scan statistics. J. Am. Stat. Assoc. 77(377), 177–183 (1982) 19. Newman, M.E.J.: Fast algorithm for detecting community structure in networks. Phys. Rev. E 69(6), 066133 (2004) 20. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(2), 026113 (2004) 21. Patil, G.P., Taillie, C.: Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environ. Ecol. Stat. 11(2), 183–197 (2004) 22. Perry, M.B., Michaelson, G.V., Ballard, M.A.: On the statistical detection of clusters in undirected networks. Comput. Stat. Data Anal. 68, 170–189 (2013) 23. Reddy, P.K., Kitsuregawa, M., Sreekanth, P., Rao, S.S.: A graph based approach to extract a neighborhood customer community for collaborative ﬁltering. Databases Netw. Inf. Syst. 2544, 188–200 (2002) 24. Robins, G., Pattison, P., Kalish, Y., Luster, D.: An introduction to exponential random graph (p∗ ) models for social networks. Soc. Netw. 29(2), 173–191 (2007) 25. Strauss, D., Ikeda, M.: Pseudolikelihood estimation for social networks. J. Am. Stat. Assoc. 85(409), 204–212 (1990) 26. Tango, T., Takahashi, K.: A ﬂexibly shaped spatial scan statistic for detecting clusters. Int. J. Health Geogr. 4(1), 11 (2005) 27. Wang, B., Philips, J.M., Schreiber, R., Wilkinson, D.M., Mishra, N., Tarjan, R.: Spatial scan statistics for graph clustering. In: Proceedings of the 2008 SIAM International Conference on Data Mining, pp. 727–738 (2008) 28. Wang, T.C., Phoa, F.K.H.: A scanning method for detecting clustering pattern of both attribute and structure in social networks. Phys. A 445, 295–309 (2016) 29. Watts, D.J., Strogatz, S.H.: Collective dynamics of smallworld networks. Nature 393(6684), 440–442 (1998)
Comparing the Community Structure Identified by Overlapping Methods Vin´ıcius da F. Vieira1(B) , Carolina R. Xavier1 , and Alexandre G. Evsukoﬀ2 1
2
Federal University of S˜ ao Jo˜ ao delRei, S˜ ao Jo˜ ao delRei, Brazil [email protected] COPPE, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
Abstract. Community detection is one of the most important tasks in network analysis. Recently, an increasing number of researchers have been dedicated to investigating networks in which the nodes participate concomitantly in more than one community. This work presents a comparative study of ﬁve stateofart methods for overlapping community detection from the perspective of the structural properties of the communities identiﬁed by them. Experiments with benchmark and groundtruth networks show that, although the methods are able to identify modular communities, they often miss many structural properties of the communities, such as the number of nodes in the overlapping region and the membership of the nodes. Keywords: Overlapping community detection properties · Comparison of methods
1
· Structural
Introduction
One of the most important topological properties in complex networks is the organization of nodes as communities, a division of the nodes in groups with dense internal connections and sparse external connections. The discovery and investigation of communities in real world networks can reveal key functional and structural aspects in many contexts. Traditional methods for community detection aim at dividing a network in groups as a partition problem, i.e., a node must belong to one and only one community, and some measures have been proposed in the literature in order to assess the quality of communities in networks. However, it is very intuitive that, in real world networks, elements can participate concomitantly in several communities. For example, a person can maintain relationships with his family members, his coworkers and the members of his sports club or a protein can interact with other proteins in many diﬀerent metabolic reactions. In this sense, more recently, several works have dedicated eﬀorts to understand the properties and characteristics of overlapping communities in networks, in order to explore this natural aspect of real world phenomena [7,11,12]. An increasing number of authors seek to develop methods for detecting the overlapping community structure in networks. Some authors consider a speciﬁc c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 262–273, 2020. https://doi.org/10.1007/9783030366872_22
Comparing Overlapping Community Structure
263
criterion to characterize a good division of a network in communities and deﬁne methods to optimize such criterion, such as Lancichinetti et al. [5], which deﬁne a ﬁtness function for communities and propose an heuristic to optimize this quantity. The works of Nicosia et al. [6] and Shen et al. [10] also aim to detect overlapping communities by considering a variation of Newman’s modularity and propose optimization methods for community structure. Other authors propose methods from diﬀerent perspectives. Palla et al. [7] present an algorithm for community detection based on the identiﬁcation of adjacent kcliques (CFinder), allowing nodes to participate in several communities at the same time. Yang and Leskovec [12] present a model for the representation of overlapping community structure and propose a method to ﬁt the community structure to the model as an integer optimization problem. Some works found in the literature present reviews and comparative studies of overlapping community method and can be used as good references [1,4,11]. The work of Xie et al. [11] and the work of Amelio et al. [1] present a wide description of methods for the identiﬁcation of overlapping communities. Additionally, Xie et al. compare the methods regarding their ability to identify the nodes in the overlapping region. Hric et al. [4] investigate if the communities identiﬁed by the methods agree with groundtruth communities. The present work also performs a comparative analysis of some of the stateofart methods for overlapping community detection in complex networks. However, unlike the works of Xie et al. and Hric et al., this work does not focus on quality measures and hit rates of the methods. In a diﬀerent perspective, the experiments conducted in this work investigate the structural properties of the communities identiﬁed by the studied methods. This approach was previously followed by Hric et al. [4], in a work where the authors test a basic hypotesis for community detection methods: that topological organization of the networks solely are able to reveal the communities of a network. The authors conducted experiments to assess if community detection methods are able to correctly classify the groundtruth communities (which they call metadata networks), disconsidering objective measures and verify that, on most cases, there is a substantial diﬀerence between identiﬁed and metadata groups. In this work community detection methods are evaluated with benchmark networks, frequently used to evaluate community detection algorithms and networks with groundtruth. Diﬀerent from the work of Hrich et al. [4], where the authors evaluate the methods by their ability to identify the correct communities, this work focuses on structural properties of the methods in order to investigate the occurrence of patterns in the topology of the extracted communities and their similarity to groundtruth communities. Quality measures of the overlapping community structure identiﬁed by the methods are also presented in order to better illustrate the analysis. The investigation conducted allows us to verify that the methods are able to extract highly modular community from small and medium sized networks, conﬁrming the results obtained by other authors in literature. It is also observed that most stateofart methods are unable to identify the correct groups for nodes in networks with groundtruth. Moreover,
264
V. F. Vieira et al.
the identiﬁed communities are often very distinct from those observed for the groundtruth networks. The investigation on groundtruth networks combined to the benchmark networks also suggest that the structural properties of the communities obtained by the community detection methods are more related to the method itself and its way of operation than the topological organization of the networks.
2
Methods for Overlapping Community Detection
The problem of overlapping community detection can be stated as follows. Consider a network G(V, E) where V represents the set of nodes and E represents the set of edges, such that n = V  and m = E. For the sake of simplicity, the edges can be considered unweighted and undirected. G can be represented by an adjacency matrix A. An overlapping community structure C can be deﬁned as a cover of V in nc communities C = {Ci , i = 1...nc } and a node u can participate in of node u in community one or more community Ci with a belonging factor αuC i n Ci such that 0 ≤ αuCi ≤ 1, ∀ u ∈ V, ∀ Ci ∈ C and i c αuCi = 1, ∀ u ∈ V . As well as in the partition problem, the sense of community becomes more evident as the diﬀerence between intra and inter edges increases and there is not an universal deﬁnition for a particular division of the network in communities, overlapped or not. An increasing number of works aim to understand and detect the organization of nodes in communities without limiting the number of groups which a node can take part of. In order to deﬁne a method for detecting overlapping communities (CFinder), Palla et al. [7] observe that typical communities are basically the union of complete subgraphs of size k and deﬁne a community as a union of adjacent kcliques. In this sense, a single node can belong to several communities and an overlapping structure can be identiﬁed. A game theory perspective is considered by Zhou et al. [14] for the problem of overlapping communities in a work in which the authors propose a function to asses the quality of the union of hierarchical groups and a greedy algorithm to identify overlapping communities. Variations of the strategy of label propagation [8] are at the core of many stateofart works in the literature for overlapping community detection. Xie et al. [11] propose a method based on a variation of label propagation (SLPA) in which each node can assume the role of a listener/speaker and receive/propagate a label retained in a memory attached to it. The method of Coscia et al. [2] (Demon) is also based on a variation of label propagation, but it considers an egonetwork for each node and evaluates the labels of each group shared by the node. Gregory [3] describes dynamic label propagation method for overlapping community detection (COPRA) in which the belonging factor of each node to each community is locally updated by considering the belonging factors of its neighbors and a parameter is deﬁned to control the maximum number of communities in which a node can participate. Yang and Leskovec [12] propose a CommunityAﬃliation Model that considers more densely connections in the overlapping regions and ﬁts it to networks.
Comparing Overlapping Community Structure
265
In another work, a similar approach is proposed by the same authors [13], where they combine a stochastic model and nonnegative matrix factorization method to detect communities in huge real world networks (Bigclam). The number of communities to be considered by the method for network is estimated by forcing the model to use the minimal number of communities while still accurately modeling the network. Some works propose variations on the widely adopted Newman’s modularity in order to adapt it to the assessment of overlapping communities [6,10]. This work considers the adaptation of the modularity proposed by Shen et al. [10], which relaxes the fact that a node can belong to only one community and deﬁnes a more ﬂexible factor αu,Ci , that denotes how much a node u belongs to the community Ci : 1 ku kv (1) Auv − Qov = αu,Ci αv,Ci . 2m 2m uv Ci ∈C
Before presenting and discussing the experiments performed in this work it is worth to mention that most methods in the literature for the identiﬁcation of overlapping community structure are highly dependent on parameters adjustments, which may pose a concern in their use. The methods of Lancichinetti et al. [5] and Rhouma et al. [9] require the deﬁnition of resolution parameters and mixing parameters, which control the size of the resulting communities. The community structure found by the method of Lancichinetti et al. also depends on the correct deﬁnition of the parameter μ in the ﬁtness function. The method of Shen et al. [10] is based on the identiﬁcation of cliques and the deﬁnition of a proper number of cliques is crucial to the detection of meaningful communities, which is performed by the authors as an exploratory analysis. In CFinder [7], the parameter k, which speciﬁes the size of the kcliques must be set. COPRA [3] requires the user to set the number of communities a node can participate. The graph models proposed by Yang and Leskovec for overlapping community structure [12] also depends on the deﬁnition of a parameter that controls the probability of two nodes to be connected even if they are not at the same community. Furthermore, their methods for overlapping community detection [12,13] depend on the deﬁnition of the number of communities on the network, which is calculated by solving another optimization problem in the original works. The number of communities is also a parameter required by the genetic algorithm for overlapping community detection of Shen et al., which represents the size of a chromosome in the population. Moreover, the method requires the settings of many parameters (number of generations, population size, stop criterion, crossover rate and mutation rate), like any genetic algorithm. A diﬀerent kind of parametrization is required by the method of Nicosia et al., which demands the deﬁnition of an association function for the belonging coeﬃcients of two nodes to a community. In the context of this work, we adopted the default deﬁned by the authors or the set of parameters suggested by them, as discussed in the next sesction.
266
3
V. F. Vieira et al.
Experiments and Discussion
The main purpose of this work is to investigate some of the most important methods for overlapping community detection found in the literature from the perspective of the structural properties of the communities found. For this, two sets of networks were considered: a set of benchmark networks, frequently explored for the evaluation of community detection methods and sampled groundtruth networks. A sampling strategy was adopted in this work for groundtruth networks due to the impossibility of the methods in dealing with their complete versions, with millions of nodes. A variation of the sampling strategy proposed Yang and Leskovec [12] for groundtruth networks is adopted. The original strategy randomly chooses a seed node u from the network and identiﬁes all the communities at which u participates. Then, it separates all the nodes from these communities and determines the induced subgraph considering these nodes. In this work, we do not select the seed node randomly. Instead, we ﬁrst identify the node which participate in most communities and deﬁne it as the seed node. By doing this, we expect to obtain a sample network with denser overlapping regions and closer to a real world scenario. Although it is interesting to investigate some variation on the parameters, due to the requirement of delimiting a scope for this work, it was necessary to deﬁne an arbitrary set of parameters for the considered methods. In an attempt to get the best possible results, the methods were parametrized as suggested by the authors, as follows: CFinder [7] (size of kcliques k = 4), Bigclam [13] (minimum number of communities mc = 5, maximum number of communities xc = 100, number of trials nc = 10), Demon [2] (merging threshold = 0.3, minimum community size mc = 3), COPRA [3] (maximum memberships of nodes v = 8) and SLPA [11] (propagation threshold r = 0.45). The set of networks investigated is composed of ﬁve benchmark networks: CAGrQc1 (n = 5242, m = 14496), CAHepPh (see footnote 1) (n = 12008, m = 118521), CitHepTh (see footnote 1) (n = 27770, m = 352324), Email2 (n = 1133, m = 5451) and Keys (see footnote 2) (n = 10680, m = 24319); and four sampled groundtruth networks: Amazon (see footnote 1) (n = 10308, m = 21575), DBLP (see footnote 1) (n = 7515, m = 31784), Live Journal (see footnote 1) (n = 5843, m = 12543), Youtube (see footnote 1) (n = 6792, m = 28177). 3.1
Quality Measures
Table 1 shows some results regarding objective measures of the communities extracted by the explored methods. The overlapping modularity Qov considered implements the extension proposed by Shen et al. [10] (Eq. 1). The Overlapping Normalized Mutual Information (ONMI) [5] is presented only for networks with groundtruth information. The execution time for each method and each network 1 2
Downloaded from: http://snap.stanford.edu/data/. Downloaded from: http://wwwpersonal.umich.edu/∼mejn/netdata/.
Comparing Overlapping Community Structure
267
Table 1. Execution time (in seconds), Overlapping modularity Qov and Overlapping Normalized Mutual Information (ONMI), when applicable, for the studied networks. GT CFinder
Bigclam
Demon
COPRA
SLPA
Qov Time
Qov ONMI Time Qov ONMI Time Qov ONMI Time Qov ONMI Time Qov ONMI
Amazon
0.40 0.65
0.33 0.04
0.31
0.81 0.11
2.60
0.51 0.05
14.82 0.53 0.25
2.50
0.80 0.05
DBLP
0.57 2.43
0.38 0.04
1.11
0.43 0.03
6.54
0.20 0.04
2.00
0.50 0.25
1.77
0.52 0.07
LJ
0.59 0.35
0.15 0.00
0.58
0.42 0.01
1.03
0.21 0.05
0.88
0.52 0.12
1.06
0.58 0.04
Youtube
0.30 16.32
0.02 0.00
3.77
0.06 0.00
2.76
0.03 0.03
0.72
0.36 0.16
1.07
0.01 0.00
CAGrQc

0.71
0.46 
0.29
0.61 
2.13
0.41 
0.87
0.47 
1.14
0.66 
CAHepPh 



5.03
0.35 
74.65 0.14 
2.37
0.16 
4.31
0.25 
CitHepTh 



35.37 0.16 
86.18 0.04 
9.67
0.01 
15.02 0.14 
Email

0.27
0.27 
0.69
0.18 
0.54
0.06 
0.22
0.51 
0.24
0.11 
Keys

2111.34 0.38 
0.93
0.63 
2.98
0.31 
1.64
0.68 
2.50
0.71 
is also presented3 . First, it is possible to observe From Table 1 that, for most groundtruth networks, the methods were able to execute in a reasonable time, allowing them to be used in real world small and middle sized applications. For some benchmark networks, CFinder and Demon presented a high execution time, making them prohibitive to be considered in real time situations. Bigclam and SLPA, especially, presented consistent results, executing almost all the networks in a short time. For slightly larger networks the methods start to show a high execution time and some considerations must be done. All the methods shown a high execution time for CitHepTh (with ∼27k nodes and ∼352k edges). It is important to highlight that CFinder was unable to run CitHepTh and CAHepph. Since the main purpose of this work is not to compare objective measures of the methods neither to produce a ranking of methods, some observations regarding qualty measures are made with the sole purpose of better understand the structural properties of the identiﬁed networks. Considering the overlapping modularity Qov , COPRA, Bigclam and SLPA were able to identify modular communities in almost all networks. The exception is the Youtube network, from which only COPRA was able to identify relevant community structure. It is interesting to notice that, for Amazon network, Bigclam, Demon, COPRA and SLPA were able to identify communities with much more modular structure than the groundtruth, especially SLPA, which obtained Qov = 0.80, while the groundtruth modularity is Qov = 0.40. The Overlapping Normalized Mutual Information (ONMI) observed in Table 1 show that the methods were unable to identify the communities correctly although the results obtained by COPRA stand out from the other methods. Some hypothesis to explain why the methods mostly show a low ONMI, while presenting a high Qov are discussed in the next sections.
3
The computational environment consists of an Intel Core i99900K processor with 32Gb RAM running an Ubuntu 18.04 OS.
268
V. F. Vieira et al.
(a) Amazon
(b) DBLP
(f) CAHepPh (g) CitHepTh
(c) LJ
(d) Youtube
(h) Email
(i) Keys
(e) CAGrQc
Fig. 1. Complementary Cumulative Distribution Function of nodes memberships to the communities obtained by the studied methods.
3.2
Node Membership
Considering the features of the community detection methods and the results for objective quality measures previously presented, it is possible to investigate other aspects of the structural properties of the networks. Figure 1 shows the Complementary Cumulative Distribute Function (CCDF) of node memberships, i.e., the number of communities in which each node participate for all the identiﬁed communities and the groundtruth communities for each network. Figure 1 allows us to clearly see that except for COPRA and SLPA, all the considered methods does not necessarily assign a community to each node and some nodes can be left with no community and, thus, the probability of a node to belong to at least one community can be less than one. This behavior particularly stands out for CFinder, that severely underestimates the number of nodes that belong to one community. It is also possible to observe that the CCDFs for the groundtruth communities show an almost linear behavior (in log −log scale), which was not captured by none of the studied methods. It is also interesting to notice that, despite to diﬀerences in scale, some methods, like SLPA, Demon and Bigclam, have shown a very similar behavior for a most networks, suggesting that the resulting community structure is not very dependent on the relations existing on the network, but mostly on the strategy of the algorithm. It is important to note that most methods depend on the deﬁnition of parameters to execute, which may strongly aﬀect the community structure and a deeper inestigation should be performed for each method in order to draw more conclusive observations. 3.3
Community Size
The CCDF for the sizes of communities identiﬁed by the methods were also considered and the results are presented in Fig. 2. As well as for the memberships
Comparing Overlapping Community Structure
(a) Amazon
(b) DBLP
(f) CAHepPh (g) CitHepTh
(c) LJ
(d) Youtube
(h) Email
(i) Keys
269
(e) CAGrQc
Fig. 2. Complementary Cumulative Distribution Function of the sizes of communities obtained by the studied methods.
distribution, Fig. 2 also shows that, despite for diﬀerences in scale, the groundtruth communities shows an almost linear behavior (in log − log scale) in a wide range of the distribution, which was not captured by none of the methods. First, it is worth to note that the communities obtained by COPRA are very similar in size to the groundtruth. However, this is atypical and such similarity can not be observed in the other scenarios. It is interesting to notice that the shapes of the curves observed for Demon and, more clearly, for Bigclam is very consistent for most networks, though very diﬀerent from the groundtruth, suggesting that the structural properties of the communities may be highly aﬀected by the mechanisms used by the methods and not by the network itself, which must be better understood with further investigation on the methods. Bigclam, for instance, requires the speciﬁcation of the maximum number of communities what naturally impacts the size of communities. This can underestimate the number of large communities, as observed for three of the groundtruth networks (Amazon, DBLP and Live Journal). A deeper investigation on the propeerties of CAHepPh and CitHepTh must also be conducted in order to understand the similarity of the curves obtained by the methods in these networks. We can also notice that large communities are likely to occur in these networks and the probability of occurrence of large communities identiﬁed by the studied methods is very low. If we take into consideration the results presented in Table 1, it is noteworthy that Qov of the community structure identiﬁed for these networks is very high for most methods. On the other hand, Qov observed for Youtube is very low for most methods. This fact must be further investigated, but some hypothesis can be raised. Possibly, the methods are mistakenly disconsidering larger communities as they are prone to identify very high modular structures, even they are very small compared to real world scenarios and this way of operation may be causing the methods to fail in discovering the right communities, what is reﬂected in the low ONMI observed in Table 1.
270
V. F. Vieira et al.
(a) Amazon
(b) DBLP
(f) CAHepPh (g) CitHepTh
(c) LJ
(d) Youtube
(h) Email
(i) Keys
(e) CAGrQc
Fig. 3. Complementary Cumulative Distribution Function of the sizes of the overlapping regions obtained by the studied methods.
3.4
Overlapping Size
The size of the overlapping region, i.e., the number of nodes at each region on the intersection of communities, was also investigated and the CCDFs for these quantities are presented in Fig. 3. The methods does not show a clear and constant pattern for the distribution of overlapping size and the curves observed for a certain method are very distinct from one network to another. It is worth to notice that, for most methods, the occurrence of overlapping regions is very unlikely. The exception is Bigclam, which identiﬁes communities with relevant overlapping regions, especially in the groundtruth networks. Yet considering the networks with groundtruth, the number of small overlapping regions is very high. Moreover, for three groundtruth networks (Amazon, DBLP and Live Journal), the methods underestimate the sizes of overlapping regions in all the range of the distribution. In the communities identiﬁed by COPRA, SLPA and CFinder the occurrence of nodes in the overlapping region is extremely rare. Therefore, we can argue that these methods are treating the community detection as a partition problem for a large number of nodes. CFinder and, especially, SLPA also identify very small overlapping regions for almost all networks. When considering the results from Table 1, it may suggest that the methods fail in identifying the overlapping nodes, concentrating the hits on the nodes that belong to only one community. Xie et al. [11] investigate the ability of overlapping community methods in identifying overlapping nodes on synthetic networks and verify that, in that scenario, the methods were able to classify them. A further investigation must be performed in order to validate or not the hypothesis that the methods are underestimating the size of overlapping regions due to a missclassiﬁcation of overlapping nodes in the context explored in the present work.
Comparing Overlapping Community Structure
(a) Amazon
(b) DBLP
(c) LJ
(f) CAHepPh
(g) Email
(h) Keys
(d) Youtube
271
(e) CAGrQc
Fig. 4. Probability of occurrence of an edge between two nodes in respect to the number of communities shared by them.
3.5
Edge Probability on the Overlapping Region
According to Yang and Leskovec [12], most methods found in the literature tend to identify communities with a fewer number of edges in the overlapping region than the nonoverlapping region. However, the authors investigate a set of groundtruth communities and empirically show that the region of the network where communities overlap tend to be more dense than the rest of the network. I.e., the more communities a node share with another, the higher the probability of an edge to exist between them. In this work, we investigate if the ﬁndings of Yang and Leskovec are observed in the groundtruth networks and if the methods investigated are able to capture this behavior even for the benchmark networks. Figure 4 shows the probability of an edge between two nodes to exist in function of the number of communities shared by them. First, it is important to mention that this experiment is very memory consuming and it was unable to be performed for CitHepTh. From Fig. 4 it is possible to notice that, in fact, there is a positive correlation between the number of shared nodes and the edge probability in groundtruth networks. This behavior is captured by Bigclam, although the edge probability decreases after a certain number of shared communities in almost all networks. Most other methods fail in capturing dense overlapping regions, as stated by Yang and Leskovec [12]. The exception is Demon, which identiﬁes dense overlapping regions but largely overestimates the probability of an edge to exist. For the other methods, it is diﬃcult to make more conclusive observations. But when the other results are considered, a clearer view of the results presented in Fig. 4 is possible. For instance, in SLPA, COPRA and CFinder the occurrence of large overlapping regions and the presence of nodes in several communities is rare. Thus, it is expected that the probability of an edge to exist in the overlapping region does not show a clear pattern among the diﬀerent networks.
272
4
V. F. Vieira et al.
Conclusions and Future Directions
This work presents a comparative analysis of ﬁve stateofart methods for overlapping community detection in complex networks considering a set of four groundtruth networks and ﬁve benchmark networks. Unlike other works, where community detection methods are compared regarding objective measures, in this work the investigation is performed from the perspective of the community structure identiﬁed by the methods. Diﬀerent ways to characterize the community structure were the organization of communities were tested and analyzes in combination with objective quality measures for the network cover, an extension of Newman’s modularity and an adaptation of NMI for overlapping communities. The methods were able to identify modular community structures, resulting in large values of Qov , however they were unable to estimate the correct community cover, what is evidenced by the low values for ONMI obtained. The analysis of the community structure, especially when we consider the size of the overlapping region and the number of memberships of the nodes, allows us to raise some hypothesis in order to explain how the methods can present low values for ONMI while ﬁnding modular communities. The overlapping community methods, particularly those based on label propagation, tend to ﬁnd very modular small communities disconsidering the nodes in the overlapping region. Naturally, the results obtained by the analysis conducted in this work may not be observed in other networks and no general conclusion can be done regarding them. Deeper investigation must be performed to better understand the communities identiﬁed by the methods and relate them to the mechanism of the methods. The same can be stated about the networks and further investigation must be conducted in this sense, ﬁrst to describe in detail the properties of the communities and then to relate them to the groundtruth communities in diﬀerent contexts. In order to better understand the methods for community detection, it would be also interesting to evaluate if there is some agreement between the communities identiﬁed by the methods. Nevertheless, from the experiments performed in this work it can be argued that, although very convenient when assessing the quality of community detection algorithms, objective measures can miss important aspects from community structure in real world networks. Mainly due to the fact that objective measures for overlapping communities are extensions of measures for nonoverlapping communities, especially Qov and ONMI considered in this work, they does not reﬂect the behavior of the communities in the overlapping region. The results observed from the experiments are very consistent with other works in the literature [4,13] and suggest that other aspects, besides objective functions, must be considered when designing community detection methods. Moreover, although some remarks can be done in order to better understand overlapping community detection methods, this work brings more questions than answers regarding the community detection problem in real world applications and we expect that it can serve as a base for further studies in this direction. Acknowledgement. The authors would like to thank the Brazilian research funding agencies CNPq and Capes for the support to this work.
Comparing Overlapping Community Structure
273
References 1. Amelio, A., Pizzuti, C.: Overlapping community discovery methods: a survey. CoRR 1411.3935 (2014) 2. Coscia, M., Rossetti, G., Giannotti, F., Pedreschi, D.: Uncovering hierarchical and overlapping communities with a localﬁrst approach. ACM Trans. Knowl. Discov. Data 9(1), 6:1–6:27 (2014) 3. Gregory, S.: Finding overlapping communities in networks by label propagation. New J. Phys. 12(10), 103018 (2010) 4. Hric, D., Darst, R.K., Fortunato, S.: Community detection in networks: structural communities versus ground truth. Phys. Rev. E 90, 062805 (2014) 5. Lancichinetti, A., Fortunato, S., Kertesz, J.: Detecting the overlapping and hierarchical community structure of complex networks. New J. Phys. 11, 033015 (2009) 6. Nicosia, V., Mangioni, G., Carchiolo, V., Malgeri, M.: Extending the deﬁnition of modularity to directed graphs with overlapping communities. J. Stat. Mech: Theory Exp. 2009(03), P03024 (2009) 7. Palla, G., Der´enyi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043), 814 (2005) 8. Raghavan, N., Albert, R., Kumara, S.: Near linear time algorithm to detect community structures in largescale networks. Phys. Rev. E 76, 036106 (2007) 9. Rhouma, D., Romdhane, L.B.: An eﬃcient algorithm for community mining with overlap in social networks. Expert Syst. Appl. 41(9), 4309–4321 (2014) 10. Shen, H.W.: Detecting the overlapping and hierarchical community structure in networks, pp. 19–44. Springer, Heidelberg (2013) 11. Xie, J., Kelley, S., Szymanski, B.K.: Overlapping community detection in networks: the state of the art and comparative study. CoRR 1110.5813 (2011) 12. Yang, J., Leskovec, J.: Communityaﬃliation graph model for overlapping network community detection. In: Proceedings of the 2012 IEEE 12th International Conference on Data Mining, ICDM 2012, pp. 1170–1175. IEEE Computer Society, Washington, DC (2012) 13. Yang, J., Leskovec, J.: Overlapping community detection at scale: a nonnegative matrix factorization approach. In: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013, pp. 587–596. ACM, New York (2013) 14. Zhou, L., L¨ u, K., Yang, P., Wang, L., Kong, B.: An approach for overlapping and hierarchical community detection in social networks based on coalition formation game theory. Expert Syst. Appl. 42(24), 9634–9646 (2015)
Semantic Frame Induction as a Community Detection Problem Eug´enio Ribeiro1,2,5(B) , Andreia Soﬁa Teixeira1,3,4 , Ricardo Ribeiro1,5 , and David Martins de Matos1,2 1
INESCID, Lisbon, Portugal [email protected] 2 Instituto Superior T´ecnico, Universidade de Lisboa, Lisbon, Portugal 3 Center for Social and Biomedical Complexity, School of Informatics, Computing and Engineering, Indiana University, Bloomington, IN, USA 4 Indiana University Network Science Institute (IUNI), Indiana University, Bloomington, IN, USA 5 Instituto Universit´ ario de Lisboa (ISCTEIUL), Lisbon, Portugal
Abstract. Resources such as FrameNet provide semantic information that is important for multiple tasks. However, they are expensive to build and, consequently, are unavailable for many languages and domains. Thus, approaches able to induce semantic frames in an unsupervised manner are highly valuable. In this paper we approach that task from a network perspective as a community detection problem that targets the identiﬁcation of groups of verb instances that evoke the same semantic frame. To do so, we apply a graphclustering algorithm to a graph with contextualized representations of verb instances as nodes connected by an edge if the distance between them is below a threshold that deﬁnes the granularity of the induced frames. By applying this approach to the benchmark dataset deﬁned in the context of the SemEval shared task we outperformed all the previous approaches to the task. Keywords: Semantic frames · Contextualized representations Community detection · Graph clustering
1
·
Introduction
A word may have diﬀerent senses depending on the context in which it appears. Thus, in order to understand its meaning, we must analyze that context and identify the semantic frame that is being evoked [12]. Consequently, sets of frame deﬁnitions and annotated datasets that map text into the semantic frames it evokes are important resources for multiple Natural Language Processing (NLP) tasks [1,10,23]. Among such resources, the most prominent is FrameNet [5], providing a set of more than 1,200 generic semantic frames, as well as over 200,000 annotated sentences in English. However, this kind of resource is expensive and timeconsuming to build, since both the deﬁnition of the frames and the annotation of sentences require expertise in the underlying knowledge. Furthermore, c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 274–285, 2020. https://doi.org/10.1007/9783030366872_23
Semantic Frame Induction
275
it is diﬃcult to decide both the granularity and the domains to consider while deﬁning the frames. Thus, such resources only exist for a reduced amount of languages [8] and even English lacks domainspeciﬁc resources in multiple domains. An approach to alleviate the eﬀort in the process of building semantic frame resources is to induce the frames evoked by a collection of documents using unsupervised approaches. However, most research on this subject focused on arguments and the induction of their semantic roles [16,25,26] or on the induction of semantic frames from verbs with two arguments [18,28]. To address this issue and deﬁne a benchmark for future research, a shared task was proposed in the context of SemEval 2019 [21]. This task focused on the unsupervised induction of FrameNetlike frames through the grouping of verbs and their arguments according to the requirements of three diﬀerent subtasks. The ﬁrst of those subtasks focused on clustering instances of verbs according to the semantic frame they evoke while the others focused on clustering the arguments of those verbs, both according to the framespeciﬁc slots they ﬁll and their semantic role. In this paper we approach the ﬁrst subtask from a network perspective. First, we generate a network in which the nodes correspond to contextualized representations of each verb instance. Then, we create edges between two nodes if the distance between them is lower than a certain threshold which controls the granularity of the induced frames. Finally, we apply a graphclustering approach to identify communities of nodes that evoke the same frame. In the remainder of the paper, we start by providing an overview of previous approaches to the task, in Sect. 2. Then, in Sect. 3, we describe our induction approach. Section 4 describes our experimental setup. The results of our experiments are presented and discussed in Sect. 5. Finally, Sect. 6 summarizes the conclusions of our work and provides pointers for future work.
2
Related Work
Before the shared task in the context of SemEval 2019, there were already some approaches to unsupervised semantic frame induction. For instance, LDAFrames [18] relied on topic modeling and, more speciﬁcally, on Latent Dirichlet Allocation (LDA) [7], to jointly induce semantic frames and their framespeciﬁc semantic roles. On the other hand, Ustalov et al. [28] approached the induction of frames through the triclustering of SubjectVerbObject (SVO) triples using the Watset fuzzy graphclustering algorithm [27], which induces wordsense information in the graph before clustering. However, although these approaches are able to induce semantic frames, they can only be applied to verb instances with certain characteristics, such as a ﬁxed number of arguments. Since we are approaching one of the subtasks deﬁned in the context of SemEval 2019s Task 2, the most important approaches to describe in this section are those which competed in that subtask. Arefyev et al. [3] achieved the highest performance in the competition using a twostep agglomerative clustering approach. First, it generates a small set of large clusters containing instances of verbs which have at least one sense that evokes the same frame. Then, the verb
276
E. Ribeiro et al.
instances of each cluster are clustered again to distinguish the diﬀerent frames that are evoked according to the diﬀerent senses. In both steps, the generation of the representations of the instances relies on BERT [11]. Nonetheless, while the ﬁrst step relies on the contextualized representation given by an empirically selected layer of the model, the second step uses BERT as a language model to generate possible context words that provide cues for the sense of the verb instance. To do so, multiple Hearstlike patterns [15] are applied to the sentence in which the verb instance occurs and the context words correspond to those generated to ﬁll the slots in the patterns. The representation of the instance is then given by a tfidfweighted average of the representations of the most probable context words. The number of clusters in the ﬁrst step was obtained by performing grid search while clustering the development and test data together. The selected value corresponds to that which led to maximum performance on the development data. In the second step, clusters with less than 20 instances or containing speciﬁc undisclosed verbs were left intact. In the remainder, the number of clusters was selected to maximize the silhouette score. Anwar et al. [2] used a more simplistic approach based on the agglomerative clustering of contextualized representations of the verb instances. The number of clusters was deﬁned empirically. In the system submitted for participation in the competition, the contextualized representations were obtained by concatenating the contextfree representation of the verb instance obtained using Word2Vec [19] with the tfidfweighted average of the representations of the remaining words in the sentence. However, in a postevaluation experiment, better results were achieved using the mean of contextualized representations generated by ELMo [20]. Finally, Ribeiro et al. [22] also relied on contextualized representations of the verb instances, but used a graphbased approach. They experimented with both the sum of the representations generated by ELMo [20] and those generated by the last layer of the BERT model [11]. Better results were achieved with the former. The contextualized representations are used as the nodes in a graph and connected by a distanceweighted edge if the cosine distance between them is below a threshold based on a function of the mean and standard deviation of the pairwise distances between the nodes. Finally, the Chinese Whispers [6] algorithm is applied to the graph to identify communities of nodes that evoke the same frame. Although the high performance achieved on the development data did not generalize to the test data, this simple approach has the potential to achieve higher results with some modiﬁcations. Thus, the work described in this paper is based on this approach
3
Semantic Frame Induction Approach
In general, our approach, summarized in Algorithm 1, is very similar to the one used by Ribeiro et al. [22] in the context of the SemEval shared task. It starts by generating a contextualized representation of each verb instance. These representations are then used as the nodes in a network or graph in which each
Semantic Frame Induction
277
pair of nodes is connected through an edge if the distance between them is below a certain threshold. Finally, the Chinese Whispers algorithm is applied to the graph to identify communities of verb instances that evoke the same frame. However, it has some key modiﬁcations that improve its performance. Algorithm 1. Frame Induction Approach Input: S // The set of sentences Input: T // The set of head tokens to cluster Input: Embed // The approach for generating contextualized representations Input: d // The neighboring threshold Output: C // The set of clusters 1: V ← {Embed(St , t) : t ∈ T } 2: D ← {1 − cos(θv,v ) : (v, v ) ∈ V 2 , v = v } // θv,v is the angle between v and v 3: W ← {1 − Dv,v : (v, v ) ∈ V 2 , v = v } // The weights of the edges 4: E ← {(v, v , Wv,v ) : (v, v ) ∈ V 2 , v = v , Dv,v < d} 5: G ← (V, E) 6: C ← ChineseWhispers(G) 7: return C
Starting with the representation of verb instances, the use of contextualized word representations in all of the approaches that competed in the SemEval shared task proves their importance for distinguishing diﬀerent word senses, which evoke diﬀerent frames. Ribeiro et al. [22] experimented with representations generated by both ELMo and BERT and achieved better results using the former. Furthermore, in their experiments, Arefyev et al. [3] noticed that BERT tends to generate representations of the diﬀerent forms of the same lexeme which are distant in terms of the typically used euclidean and cosine distances. They tried to identify a distance metric that was appropriate for correlating such representations, but were unsuccessful. Thus, although it is not the current stateoftheart approach for generating contextualized word representations, we rely on ELMo in our approach. The generated representations include a contextfree representation and context information at two levels. According to the experiments performed by the authors of ELMo, the ﬁrst level is typically related to the syntactic context, while the second is typically related to the semantic context. In addition to the combination of all information, we also explore the use of each level independently. This way, we are able to assess which information is actually important for the task. To generate the contextualized representation of multiword verb instances, we use a dependency parser to identify the head word and use the corresponding representation, since it contains information from the other words. In our approach, the contextualized representations of the verb instances are used as the nodes of a graph. To generate the edges, the ﬁrst step is to calculate the pairwise distance between those representations. We use the cosine distance since it is bounded and the magnitude of word vectors is typically related to the number of occurrences. Thus, the angle between the vectors is a better
278
E. Ribeiro et al.
indicator of similarity. Furthermore, the euclidean distance has issues in spaces with high dimensionality. Still, we performed preliminary experiments to conﬁrm that using the cosine distance leads to better results than the euclidean distance. Each pair of nodes in the graph is connected through an edge if the distance between them is below a certain threshold. The deﬁnition of this threshold is particularly important, since it controls the granularity of the induced frames. Having control over this granularity is important, since it allows us to induce more speciﬁc or more abstract frames, both of which are relevant in diﬀerent scenarios. Furthermore, this control allows us to deﬁne granularity in a small set of instances and then induce frames with a similar granularity in a diﬀerent set. The latter was the main issue of Ribeiro et al.’s [22] approach at SemEval, whose performance on the development set did not generalize to the test set. That happened since the threshold was selected using a function of the statistics of the distribution of pairwise distances, which vary according to the contexts covered by the datasets and the number of instances. Consequently, applying the same function on the development and test sets led to the generation of frames with diﬀerent granularity. We ﬁx this issue by deﬁning the threshold through grid search on the development set and then using the same ﬁxed threshold across sets. Another diﬀerence of our approach is the weighting of the edges. While Ribeiro et al. [22] attributed a weight corresponding to the distance between the nodes, we weight the edges using the cosine similarity. This is more appropriate, since the Chinese Whispers [6] algorithm that we use to identify the communities of nodes that evoke the same frame attributes more importance to edges with higher weight. Chinese Whispers is a simple but eﬀective graphclustering algorithm based on the idea that nodes that broadcast the same message to their neighbors should be aggregated. It starts by attributing each node to a diﬀerent cluster. Then, in each iteration, the nodes are processed in random order and are attributed to the cluster with highest sum of edge weights in their neighborhood. This process is repeated until there are no changes or the maximum number of iterations is reached. Chinese Whispers is appropriate for this task since it identiﬁes the number of cluster on its own, is able to handle clusters of diﬀerent sizes, and scales well to large graphs. Furthermore, it typically outperforms other clustering approaches on NLP tasks.
4
Experimental Setup
In this section we describe our experimental setup in terms of data, evaluation approach, and implementation details. 4.1
Dataset
In our experiments, we used the same dataset used in the context of SemEval 2019s Task 2. This dataset consists of sentences extracted from the Penn Treebank 3.0 [17] and annotated with FrameNet frames. Since we are focusing on
Semantic Frame Induction
279
clustering verb instances into semantic frame heads, we are not interested in the annotations of the arguments. The development set consists of 600 verb instances extracted from 588 sentences and annotated with 41 diﬀerent frames. The test set consists of 4,620 verb instances extracted from 3,346 sentences and annotated with 149 diﬀerent frames. Additionally, all the sentences are annotated with morphosyntactic information in the CoNLLU format [9]. 4.2
Evaluation Approach
For direct comparison with the approaches that competed in SemEval’s task, we evaluate our approach using the same metrics used on the task: Purity F1 , which is the harmonic mean of purity and inversepurity [24], and BCubed F1 , which is the harmonic mean of BCubed Precision and BCubed Recall [4]. While the ﬁrst focuses on the quality of each cluster independently, the latter focuses on the distribution of instances of the same category across the clusters. Additionally, we report the number of induced clusters. Since the Chinese Whispers algorithm is not deterministic, the values we report for these metrics refer to the mean and standard deviation over 30 runs. Since we are approaching the problem from a networkbased perspective, we also report the number of edges, the diameter, and the clustering coeﬃcient of the network corresponding to the neighboring threshold with highest performance in each scenario. In addition to that of the approaches that competed in SemEval’s task, we also compare the performance of our approach with a baseline that consists of generating one cluster per verb. 4.3
Implementation Details
To obtain the contextualized representation of the verb instances we used the ELMo model provided by the AllenNLP package [13] to generate the contextualized embeddings for every sentence in the dataset and then selected the representations of the head token of each instance. The representation of each verb instance is then given by three vectors of dimensionality 1,024, corresponding to the contextfree representation of the head token and the two levels of context information. We experimented both with each vector independently, as well as their combination. To combine the vectors we used their sum, since it represents the variation of the contextfree representation according to the context. To apply the Chinese Whispers algorithm, we relied on Ustalov’s [29] implementation in Python, which requires the graph to be built using the NetworkX package [14]. We did not use weight regularization and performed a maximum of 20 iterations. Finally, to obtain the syntactic dependencies used to determine the head token of multiword verbs, we used the annotations provided with the dataset, which were obtained automatically using a dependency parser.
280
5
E. Ribeiro et al.
Results and Discussion
Before starting the discussion, it is important to make some remarks regarding the presentation of the results. First, although the cosine distance varies in the interval [0, 2], for readability, we only plot the results in the interval [0, 1], since for neighboring thresholds above that value the verb instances are always grouped into a single cluster. Furthermore, we do not include the value of the graph diameter in our tables, since the graph corresponding to the threshold that leads to higher performance in each scenario is never connected. Thus, the diameter is always inﬁnite.
Fig. 1. Results on the development data using the diﬀerent levels of ELMo representations. The xx axis refers to the neighboring threshold used to create the edges.
Table 1. Results on the development data using the diﬀerent levels of ELMo representations. d refers to the neighboring threshold. CC refers to the clustering coeﬃcient. d
Edges CC
Clusters
Purity F1
ContextFree 0.57 39,441 0.97 22.03 ± 0.31 95.57 ± 0.58 Syntactic Context 0.41 27,201 0.80 30.93 ± 0.25 94.32 ± 0.16 Semantic Context 0.53 20,660 0.67 24.47 ± 0.52 88.64 ± 0.33 Free + Syntactic All
BCubed F1 93.35 ± 0.71 91.65 ± 0.20 81.92 ± 0.54
0.47 37,913 0.94 22.97 ± 0.41 95.83 ± 0.28 93.66 ± 0.32 0.45 21,448 0.73 34.73 ± 0.51 93.93 ± 0.30 91.04 ± 0.59
Starting with the information provided by the multiple levels included in ELMo representations, in Fig. 1 and the ﬁrst block of Table 1, we can see that, independently, the contextfree representation is the most informative of the three and the most robust to changes in the threshold, with a wide interval with reduced decrease in performance around the threshold with highest performance. The initial drop in the number of clusters is due to its lack of context information, which makes all the instances of the same verb become connected as soon as the threshold is higher than zero.
Semantic Frame Induction
281
The lower performance of the levels that provide context information on their own was expected, since they represent changes in the word sense of the verb according to the context, but lack information regarding the verb itself. Surprisingly, the level that typically captures the semantic context leads to worse performance than that which captures syntactic context and even harms performance in combination with the other levels. However, this can be explained by the fact that the ELMo model was trained for a speciﬁc task and, consequently, the semantic context is overﬁt to that task. On the other hand, the syntactic context is more generic and, since the sense of a verb can be related to the syntactic tree in which it occurs, it provides important information for the task.
Fig. 2. Results on the development data according to the weighting of the edges. The xx axis refers to the neighboring threshold used to create the edges.
As shown in the second block of Table 1, the highest performance is achieved when using the combination of the contextfree representation and the syntactic context. Still, the average increase in BCubed F1 in relation to when using the contextfree representation on its own is of just 0.33% points, which suggests that the context information is only able to disambiguate a reduced amount of speciﬁc cases. However, the threshold that leads to the highest performance in the combination is lower. This means that the graph has less edges and consequently, is less connected. Still, the number of clusters, around 23, is nearly half of the number of frames in the gold standard, 41, which means that the graph should be even less connected. Since the performance decreases for lower thresholds, this suggests that either the representations or the distance metric are unable to capture all the information required to group the instances in FrameNetlike frames. Table 2. Results on the development data according to the weighting of the edges. d refers to the neighboring threshold. CC refers to the clustering coeﬃcient. d Weighted
Edges CC
Clusters
Purity F1
BCubed F1
0.47 37,913 0.94 22.97 ± 0.41 95.83 ± 0.28 93.66 ± 0.32
Unweighted 0.46 37,415 0.93 22.90 ± 0.30 95.77 ± 0.16
93.56 ± 0.31
282
E. Ribeiro et al.
Regarding the weighting of the edges, the results in Table 2 show that the diﬀerence in average top performance is of just 0.06 and 0.10% points in terms of Purity F1 and BCubed F1 , respectively. This suggests that the presence of the edges is more important for the approach than their weight. Still, in Fig. 2 we can see that using weighted edges increases the robustness of the approach to changes in the neighboring threshold.
Fig. 3. Results on the test data. The xx axis refers to the neighboring threshold used to create the edges.
Table 3. Results on the test data. d refers to the neighboring threshold. CC refers to the clustering coeﬃcient. d
Edges
CC
Clusters
Purity F1
Dev. Threshold 0.47 347,202 0.91 196.63 ± 1.68 79.97 ± 0.21
BCubed F1 73.07 ± 0.25
Best Threshold 0.49 364,829 0.91 186.33 ± 0.98 80.26 ± 0.17 73.43 ± 0.19
Figure 3 shows the results achieved when applying the same approach to the test data. Although the performance is lower, we can observe patterns similar to those observed on the development data. The only diﬀerence is that there is a more pronounced performance drop immediately after the threshold that leads to highest performance. Nonetheless, as shown in Table 3, the threshold selected on development data, 0.47, is lower and very close to the best threshold on test data, 0.49. This shows that our gridsearch approach to deﬁne the threshold generalizes well. Still, the average performance loss in relation to when using the best threshold is of 0.29 and 0.36% points in terms of Purity F1 and BCubed F1 , respectively. It is interesting to observe that, contrarily to what happened on development data, the approach overestimates the number of clusters. However, this can be explained by the fact that the test data includes more instances of diﬀerent verbs that evoke the same frame. Once again, this suggests that either the representations or the distance metric are unable to capture all the required information.
Semantic Frame Induction
283
Table 4. Comparison with previous approaches in terms of performance on the test data. Purity F1 BCubed F1 Baseline
73.78
65.35
Ribeiro et al. [22] Anwar et al. [2] Arefyev et al. [3]
75.25 76.68 78.15
65.32 68.10 70.70
Our Approach (Dev. Threshold) 79.97
73.07
Finally, Table 4 compares the results of our approach with those of the systems that competed in the SemEval shared task. First of all, it is important to refer that while Ribeiro et al.’s [22] approach, on which ours is based, performed worse than the oneframeperverb baseline, our surpasses it by 4.37% points in terms of Purity F1 and 7.72% points in terms of BCubed F1 . This shows the importance of discarding the semantic context provided in the ELMo representations and, most importantly, of identifying a neighboring threshold that allows the approach to generalize. Furthermore, our approach also outperforms the more complex approach by Arefyev et al. [3] by 2.37% points in terms of BCubed F1 . Consequently, it achieves the current stateoftheart performance on the task.
6
Conclusions
In this paper we have approached semantic frame induction as a community detection problem by applying the Chinese Whispers graphclustering algorithm to a network with contextualized representations of verb instances as nodes connected by an edge if the cosine distance between them is below a threshold that deﬁnes the granularity of the induced frames. We have shown that the best performance is achieved when using verb instance representations given by the combination of the contextfree and syntactical context levels of ELMo representations. The semantic context level impairs the performance since it is overﬁt to the task on which the model was trained. We have also observed that weighting the edges with the cosine similarity between the nodes improves the robustness to changes in the neighboring threshold. We have performed our experiments on the benchmark dataset deﬁned in the context of SemEval 2019s Task 2, which allows us to compare our results with those of previous approaches. In this context, the most important step is to identify the threshold that deﬁnes correct granularity according to the gold standard annotations. We did so by performing grid search on the development data and used the same ﬁxed threshold on the test data. This way, we solved the main issue of the approach on which ours was based, which was its lack of generalization ability. In fact, the diﬀerence between the best threshold on the
284
E. Ribeiro et al.
development set and that which would lead to the best performance on the test set was of just 0.02. Using this approach we were able to outperform the more complex approach that won the SemEval shared task by 2.37% points in terms of BCubed F1 . Thus, it achieves the current stateoftheart performance on the task. Although we were able to outperform all the previous approaches on the task, the 73.07 BCubed F1 score achieved on the test data shows that the approach is not able to capture all the information required to induce FrameNetlike frames and that there is still room for improvement. Thus, as future work, we intend to assess the cases that our approach fails to cluster to check whether a diﬀerent clustering approach or additional features are required, or an adaptation of the contextualized representations is enough. Regarding the latter, it would be interesting to assess whether ﬁne tuning the ELMo representations to the task would make the semantic context level provide relevant information. Finally, since this approach achieves stateoftheart performance when inducing semantic frames from verb instances, we intend to assess whether it is also appropriate to induce the semantic roles and the framespeciﬁc slots ﬁlled by the arguments of the verbs. Acknowledgements. This work was supported by Portuguese national funds through Funda¸ca ˜o para a Ciˆencia e a Tecnologia (FCT), with reference UID/CEC/50021/2019, and PT2020, project number 39703 (AppRecommender).
References 1. Aharon, R.B., Szpektor, I., Dagan, I.: Generating entailment rules from framenet. In: ACL, vol. 2, pp. 241–246 (2010) 2. Anwar, S., Ustalov, D., Arefyev, N., Ponzetto, S.P., Biemann, C., Panchenko, A.: HHMM at SemEval2019 Task 2: unsupervised frame induction using contextualized word embeddings. In: SemEval, pp. 125–129 (2019) 3. Arefyev, N., Sheludko, B., Davletov, A., Kharchev, D., Nevidomsky, A., Panchenko, A.: Neural GRANNy at SemEval2019 Task 2: a combined approach for better modeling of semantic relationships in semantic frame induction. In: SemEval, pp. 31–38 (2019) 4. Bagga, A., Baldwin, B.: Algorithms for scoring coreference chains. In: LREC, pp. 563–566 (1998) 5. Baker, C.F., Fillmore, C.J., Lowe, J.B.: The Berkeley FrameNet project. In: ACL/COLING, vol. 1, pp. 86–90 (1998) 6. Biemann, C.: Chinese whispers: an eﬃcient graph clustering algorithm and its application to natural language processing problems. In: Workshop on Graphbased Methods for Natural Language Processing, pp. 73–80 (2006) 7. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) 8. Boas, H.C. (ed.): Multilingual FrameNets in Computational Lexicography: Methods and Applications. Mouton de Gruyter, Berlin (2009) 9. Buchholz, S., Marsi, E.: CoNLLX shared task on multilingual dependency parsing. In: CoNLL, pp. 149–164 (2006)
Semantic Frame Induction
285
10. Das, D., Chen, D., Martins, A.F.T., Schneider, N., Smith, N.A.: Framesemantic parsing. Comput. Linguist. 40(1), 9–56 (2014) 11. Devlin, J., Chang, M.W., Kenton, L., Toutanova, K.: BERT: pretraining of deep bidirectional transformers for language understanding. In: NAACLHLT, vol. 1, pp. 4171–4186 (2019) 12. Fillmore, C.J.: Frame semantics and the nature of language. In: Annals of the New York Academy of Sciences (Origins and Evolution of Language and Speech), vol. 280, pp. 20–32 (1976) 13. Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N.F., Peters, M., Schmitz, M., Zettlemoyer, L.S.: AllenNLP: a deep semantic natural language processing platform. CoRR abs/1803.07640 (2017). http://arxiv.org/abs/1803.07640 14. Hagberg, A., Schult, D., Swart, P.: NetworkX. GitHub (2004). https://networkx. github.io/ 15. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: COLING, vol. 2, pp. 539–545 (1992) 16. Lang, J., Lapata, M.: Similaritydriven semantic role induction via graph partitioning. Comput. Linguist. 40(3), 633–670 (2014) 17. Marcus, M., Santorini, B., Marcinkiewicz, M.: Building a large annotated corpus of english: the Penn Treebank. Comput. Linguist. 19(2), 330–331 (1993) 18. Materna, J.: LDAframes: an unsupervised approach to generating semantic frames. In: CICLing, pp. 376–387 (2012) 19. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013) 20. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: NAACLHLT, vol. 1, pp. 2227–2237 (2018) 21. QasemiZadeh, B., Petruck, M.R.L., Stodden, R., Kallmeyer, L., Candito, M.: SemEval2019 Task 2: unsupervised lexical frame induction. In: SemEval, pp. 16–30 (2019) 22. Ribeiro, E., Mendon¸ca, V., Ribeiro, R., Martins de Matos, D., Sardinha, A., Santos, A.L., Coheur, L.: L2F/INESCID at SemEval2019 Task 2: unsupervised lexical semantic frame induction using contextualized word representations. In: SemEval, pp. 130–136 (2019) 23. Shen, D., Lapata, M.: Using semantic roles to improve question answering. In: EMNLPCoNLL, pp. 12–21 (2007) 24. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000) 25. Titov, I., Khoddam, E.: Unsupervised induction of semantic roles within a reconstructionerror minimization framework. In: NAACLHLT, vol. 1, pp. 1–10 (2015) 26. Titov, I., Klementiev, A.: A Bayesian approach to unsupervised semantic role induction. In: EACL, vol. 1, pp. 12–22 (2012) 27. Ustalov, D., Panchenko, A., Biemann, C.: Watset: automatic induction of synsets from a graph of synonyms. In: ACL, vol. 1, pp. 1579–1590 (2017) 28. Ustalov, D., Panchenko, A., Kutuzov, A., Biemann, C., Ponzetto, S.P.: Unsupervised semantic frame induction using triclustering. In: ACL, vol. 2, pp. 55–62 (2018) 29. Ustalov, D., et al.: Chinese Whispers for Python. GitHub (2018). https://github. com/nlpub/chinesewhisperspython/
A New Measure of Modularity in Hypergraphs: Theoretical Insights and Implications for Eﬀective Clustering Tarun Kumar1,2(B) , Sankaran Vaidyanathan3 , Harini Ananthapadmanabhan2 , Srinivasan Parthasarathy4 , and Balaraman Ravindran1,2 2
4
1 Robert Bosch Centre for Data Science and AI (RBCDSAI), Chennai, India Department of Computer Science and Engineering, IIT Madras, Chennai, India [email protected] 3 College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst, USA Department of Computer Science and Engineering, The Ohio State University, Columbus, USA
Abstract. Many realworld systems consist of entities that exhibit complex group interactions rather than simple pairwise relationships; such multiway relations are more suitably modeled using hypergraphs. In this work, we generalize the framework of modularity maximization, commonly used for community detection on graphs, for the hypergraph clustering problem. We introduce a hypergraph null model that can be shown to correspond exactly to the conﬁguration model for undirected graphs. We then derive an adjacency matrix reduction that preserves the hypergraph node degree sequence, for use with this null model. The resultant modularity function can be maximized using the Louvain method, a popular fast algorithm known to work well in practice for graphs. We additionally propose an iterative reﬁnement over this clustering that exploits higherorder information within the hypergraph, seeking to encourage balanced hyperedge cuts. We demonstrate the eﬃcacy of our methods on several realworld datasets.
1
Introduction
While most approaches for learning clusters on graphs assume pairwise (or dyadic) relationships between entities, many entities in real world network systems engage in more complex, multiway (superdyadic) relations. Hypergraphs provide a natural representation for such superdyadic relations; for example, in a cocitation network, a hyperedge could represent a group of cocited papers. Indeed, learning on hypergraphs has been gaining recent traction [7,21,25,26]. Analogous to the graph clustering task, Hypergraph clustering seeks to ﬁnd dense T. Kumar and S. Vaidyanathan—Equal contribution. S. Vaidyanathan—Work done while the author at IIT Madras. H. Ananthapadmanabhan—Currently at Google, Bangalore, India. c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 286–297, 2020. https://doi.org/10.1007/9783030366872_24
A New Measure of Modularity in Hypergraphs
287
connected components within a hypergraph [19]. This has been applied to varied problems such as VLSI placement [11], image segmentation [13], and modeling ecobiological systems [6], among others. A few previous works on hypergraph clustering [1,14,15,20,23] have limited their focus to kuniform hypergraphs, where all hyperedges have the same ﬁxed size. [27] extends the Spectral Clustering framework for general hypergraphs by proposing a suitable hypergraph Laplacian, which implicitly deﬁnes a reduction of the hypergraph to a graph [2,16]. Modularity maximization [17] is an alternative methodology for clustering on graphs, which additionally provides a useful metric for measuring cluster quality in the modularity function and return the number of clusters automatically. In practice, a greedy fast and scalable optimization algorithm known as the Louvain method [3] is commonly used. However, extending the modularity function to hypergraphs is not straightforward. One approach would be to reduce a hypergraph to a simple graph using a clique expansion and then employ a standard modularitybased solution. Such an approach would lose critical information encoded within the superdyadic hyperedge structure. A clique expansion would also not preserve the hypergraph’s node degrees, which are required for the null model that modularity maximization methods are based on. Encoding the hyperedgecentric information present within the hypergraph is key to the development of an appropriate modularitybased framework for clustering. Additionally, when viewing the clustering problem via a minimization function (analogous to minimizing the cut), there are multiple ways to cut a hyperedge. Based on the proportion and assignments of nodes on diﬀerent sides of the cut, the clustering will change. One way of incorporating information based on properties of hyperedges or their vertices is to introduce hyperedge weights based on a metric or function of the data. Building on this idea, we make the following contributions in this work: – We deﬁne a null model on hypergraphs, and prove its equivalence to the conﬁguration model [18] for undirected graphs. We derive a nodedegree preserving graph reduction to satisfy this null model. Subsequently, we deﬁne a modularity function using the above which can be maximized using the Louvain method. – We propose an iterative hyperedge reweighting procedure that leverages information from the hypergraph structure and the balance of hyperedge cuts. – We empirically evaluate the resultant algorithm, titled Iteratively Reweighted Modularity Maximization (IRMM), on several realworld datasets and demonstrate its eﬃcacy and eﬃciency over competitive baselines.
2 2.1
Background Hypergraphs
Let G = (V, E, w) be a hypergraph, with vertex set V and hyperedge set E. Each hyperedge can be associated with a positive weight w(e). Degree of a vertex v
288
T. Kumar et al.
is denoted as d(v) = e∈E,v∈e w(e). The degree of a hyperedge e is the number of nodes it contains; denoted by δ(e) = e. We denote the number of vertices as n = V  and the number of edges as m = E. The incidence matrix H is given by h(v, e) = 1 if vertex v is in hyperedge e, and 0 otherwise. W is the hyperedge weight matrix and De is the edge degree matrix; these are diagonal matrices of size m × m. Dv is the vertex degree matrix, of size n × n. Clique Reduction: For any hypergraph, one can compute its clique reduction [9] by replacing each hyperedge by a clique formed from its node set. The adjacency matrix for the clique reduction of a hypergraph with incidence matrix H is A = HW H T . Dv may be subtracted from this matrix to remove selfloops. 2.2
Modularity
Modularity [17] is a metric of clustering quality that measures whether the number of withincluster edges is greater than its expected value. This is deﬁned as: 1 Q= [Aij − Pij ]δ(gi , gj ) (1) 2m ij
where Pij is denotes the expected number of edges between nodes i and j. The configuration model [18] used for graphs produces random graphs with a ﬁxed degree sequence, by drawing random edges such that the node degrees are preserved. For two nodes i and j, with degrees ki and kj respectively, we have: ki kj Pij = j∈V
3
kj
Hypergraph Modularity
We propose a simple but novel nodedegree preserving null model for hypergraphs. Analogous to the conﬁguration model for graphs, the sampling probability for a node is proportional to the number (or in the weighted case, the total weight) of hyperedges it participates in. Speciﬁcally, we have: d(i) × d(j) Pijhyp = v∈V d(v)
(2)
The above null model preserves the node degree sequence of the original hypergraph. When using this null model to deﬁne modularity, we get the expected number of hyperedges that two nodes i and j participate in. However, while taking the clique reduction, the degree of a node in the corresponding graph is not the same as its degree in the original hypergraph, as veriﬁed below. Lemma 1. For the clique reduction of a hypergraph with incidence matrix H, the degree of a node i in the reduced graph is given by: H(i, e)w(e)(δ(e) − 1) ki = e∈E
A New Measure of Modularity in Hypergraphs
289
Proof. The adjacency matrix of the reduced graph is given by Aclique = HW H T , H(i, e)w(e)H(j, e) (HW H T )ij = e∈E
Note that we do not have to consider selfloops, since they are not cut during the modularity maximization process. This is done by explicitly setting Aii = 0 for all i. We can write the degree of a node i in the reduced graph as: Aij = H(i, e)w(e)H(j, e) ki = j
=
j
H(i, e)w(e)
e∈E
=
e∈E
H(j, e)
j:j=i
H(i, e)w(e)(δ(e) − 1)
e∈E
As shown above, the node degree is over counted by a factor of (δ(e) − 1) for each hyperedge e. We can hence correct it by scaling down each w(e) by a factor of (δ(e) − 1). This leads to the following corrected adjacency matrix: Ahyp = HW (De − I)−1 H T
(3)
Proposition 1. For the reduction of a hypergraph given by the adjacency matrix A = HW (De − I)−1 H T , the degree of a node i in the reduced graph (denoted ki ) is equal to its hypergraph node degree d(i). Proof. We have, (HW (De − I)−1 H T )ij =
H(i, e)w(e)H(j, e) δ(e) − 1
e∈E
Note again that we do not have to consider selfloops, since they are not cut during the modularity maximization process (explicitly setting Aii = 0 for all i). We can rewrite the degree of a node i in the reduced graph as ki =
j
=
Aij =
H(i, e)w(e) H(j, e) δ(e) − 1
e∈E
j:j=i
H(i, e)w(e) = d(i)
e∈E
We can use this nodedegree preserving reduction, with the diagonals zeroed out, to implement the null model from Eq. 2. As in Eq. 1, we can obtain an expression for the hypergraph modularity, which can be maximized using a Louvainstyle algorithm. 1 hyp Qhyp = [A − Pijhyp ]δ(gi , gj ) (4) 2m ij ij
290
T. Kumar et al.
As with any weighted graph, the range of this function is [−1, 1]. We would get Qhyp = −1 when no pair of nodes in a hyperedge belong to the same cluster, and Qhyp = 1 when any two nodes that are part of the same hyperedge are always part of the same cluster. Qhyp = 0 when, for any pair of nodes i and j, the number of hyperedges that contain both i and j is equal to the number of randomly wired hyperedges containing both i and j, given by the null model.
4
Iterative Hyperedge Reweighting
When improving clustering, we look at minimizing the number of betweencluster edges that get cut, which for a hypergraph is given by the total volume of the hyperedge cut. We ﬁrst consider the twocluster case as in [27], where the set V is partitioned into two clusters, S and S c . For a given hyperedge e, the volume of the cut is proportional to e ∩ Se ∩ S c , which gives the number of cut subedges when the hyperedge is reduced to a clique. This is minimized when all vertices of e go into one partition. A mincut algorithm would hence favour cuts that are as unbalanced as possible [10]. For a given hyperedge, if there were a larger portion of its vertices in one cluster and a smaller portion in the other, it is likely that the smaller group of vertices are actually similar to the rest and should be pulled into the larger cluster. Similarly, the vertices in a hyperedge that is cut equally between clusters are equally likely to lie in either of the clusters. We would hence want unbalanced hyperedges to be retained in clusters and the more balanced hyperedges to be cut. This can be done by increasing the weights of hyperedges that get unbalanced cuts, and decreasing the weights of hyperedges that get more balanced cuts. Considering the case where a hyperedge is partitioned into two, for a cut hyperedge with k1 and k2 nodes in each partition (k1 , k2 = 0), we have the following equation that operationalizes the aforementioned scheme: t=
1 1 × δ(e) + k1 k2
(5)
For two partitions, δ(e) = k1 + k2 . In Fig. 4, note that t is minimized when k1 = k2 = δ(e)/2, which gives t = 4. We can then generalize Eq. 5 to c partitions as follows: c 1 1 w (e) = [δ(e) + c] (6) m i=1 ki + 1 Here, both the +1 and +c terms are added for smoothing, to account for cases where any of the ki ’s are zero. We divide by m to normalize the weights (Fig. 1). Let wt (e) be the weight of hyperedge e in the tth iteration, and w (e) be the weight computed at a given iteration then weight update rule can be written as: wt+1 (e) = αwt (e) + (1 − α)w (e)
(7)
A New Measure of Modularity in Hypergraphs
t=
1 1 + 2 18
× 20 = 11.111
t=
1 1 + 10 10
291
× 20 = 4
Fig. 1. Reweighting for diﬀerent hyperedge cuts
4.1
A Simple Example
The following example illustrates the eﬀect of a single iteration of hyperedge reweighting. As seen in Fig. 2, the initial clustering of this hypergraph resulted in two highly unbalanced cuts. Cut 1 and Cut 2 each split hyperedge h2 in a 1:4 ratio, and also each split hyperedge h3 in a 1 : 2 ratio respectively. Cut 3 splits hyperedge h1 in a 2:3 ratio. After reweighting, the 1:4 splits are removed. This reduces the number of cuts from 3 to just 1, leaving two neat clusters. The single nodes in h1 and h3 , initially assigned to another cluster, have been pulled back into their respective (larger) clusters. This captures the intended behaviour for the reweighting scheme as described earlier.
(a) Before Reweighting
(b) After Reweighting
Fig. 2. Eﬀect of iterative reweighting
5
Evaluation on Ground Truth
We use the average F1 measure [24] and Rand Index scores to evaluate clustering performance on realworld data with ground truth class labels. Average F1 scores are obtained by computing the F1 score of the best matching groundtruth class to the observed cluster, and the F1 score of the best matching observed cluster
292
T. Kumar et al.
to the ground truth class, then averaging the two scores. The proposed methods are shown in the results table as HypergraphLouvain and IRMM. Number of Clusters Returned: While implementing the modularity function deﬁned in Eq. 4, we use the Louvain method to ﬁnd clusters by maximizing the hypergraph modularity. By default, the algorithm returns the number of clusters. To return a ﬁxed number of clusters c, we used hierarchical clustering with the average linkage criterion as a postprocessing step. Settings for IRMM: We tuned hyperparameter α over the set of values 0.1, 0.2, ..., 0.9. We did not ﬁnd considerable diﬀerence in the resultant F1 scores, and minimal diﬀerence in the rate of convergence, over a wide range of values. As α is a scalar coeﬃcient in a moving average, it did not result in a large diﬀerence in resultant weight values when set in this useful range. Hence, for the experiments, we chose to leave it at α = 0.5. For the iterations, we set the stopping threshold for the weights at 0.01. 5.1
Compared Methods
Clique Reductions: We took the clique reduction of the hypergraph and ran the graph versions of Spectral Clustering and the Louvain method. These are referred to as CliqueSpectral and CliqueLouvain respectively. Hypergraph Spectral Clustering: Here, the top c eigenvectors of the Hypergraph Laplacian (as deﬁned in [27]) were found, and then clustered using bisecting kmeans. We refer to this method as HypergraphSpectral. hMETIS [12] and PaToH [5]: These are hypergraph partitioning algorithms that are commonly used. We used the codes provided by the respective authors. 5.2
Datasets
For all datasets, we took the single largest connected component of the hypergraph. In each case, the class labels were taken as ground truth clusters. More statistics on the datasets are given in Table 1. Table 1. Dataset description Dataset TwitterFootball
# nodes # hyperedges Avg. hyperedge Avg. node # classes degree degree 234
3587
15.491
237.474
20
Cora
2708
2222
3.443
2.825
7
Citeseer
3264
3702
27.988
31.745
6
MovieLens
3893
4677
79.875
95.961
2
Arnetminer
21375
38446
4.686
8.429
10
A New Measure of Modularity in Hypergraphs
293
MovieLens [4]: We used the director relation to deﬁne hyperedges and build a codirector hypergraph, where the nodes represent movies. A group of nodes are connected by a hyperedge if they were directed by the same individual. Cora and Citeseer: In these datasets, nodes represent papers. The nodes are connected by a hyperedge if they share the same set of words [22]. TwitterFootball: This is taken from one view of the Twitter dataset [8]. It represents members of 20 diﬀerent football clubs of the English Premier League. Here, the nodes are the players, and hyperedges are formed based on whether they are colisted. Arnetminer: In this signiﬁcantly larger cocitation network, the nodes represent papers and hyperedges are formed between cocited papers. We used the nodes from the CS discipline, and its 10 subdisciplines were treated as clusters. 5.3
Experiments
We compare the average F1 and Rand Index (RI) scores for the diﬀerent datasets on all the given methods. For Louvain methods, the number of clusters was returned by the algorithm in an unsupervised manner and the same number was used for spectral methods. The results are given in Table 2. We also ran the same experiments with the number of clusters set to the number of ground truth classes, using the postprocessing methodology described earlier (Table 3). 5.4
Results
In both experiment settings, IRMM shows the best average F1 scores on all datasets, and the best Rand Index scores on all but one dataset. Additionally, both hypergraph modularity maximization methods show competitive performance with respect to the baselines. Figure 3 shows the results for varying number of clusters. CliqueLouvain sometimes returned a lower number of clusters than IRMM, and hence these curves are shorter in some of the plots. On TwitterFootball, IRMM returns fewer than ground truth clusters; and hence the corresponding entry in Table 3 is left blank. On some datasets, the best performance is achieved when the number of clusters returned by the Louvain method is used (e.g Citeseer, Cora), and on others when the ground truth number of classes is used (e.g ArnetMiner, MovieLens). This could be based on the structure of clusters in the network and its relationship to the ground truth classes. A class could have comprised multiple smaller clusters, which were detected by the Louvain algorithm. In other cases, the cluster structure could have corresponded better to the class labels. It is evident that Hypergraph based methods outperform the respective clique reduction methods on all datasets and both experiment settings. We infer that superdyadic relational information captured by the hypergraph has a positive impact on the clustering performance.
294
T. Kumar et al.
Table 2. Average F1 and RandIndex scores; no. of clusters returned by Louvain method Citeseer
MovieLens
TwitterFootball
F1
RI
Cora F1
RI
F1
RI
F1
RI
Arnetminer F1
RI
hMETIS
0.1087
0.6504
0.1075
0.7592
0.1291
0.4970
0.3197
0.7639
0.0871
0.0416
PaToH
0.0532
0.6612
0.1171
0.6919
0.1104
0.4987
0.1132
0.7553
0.0729
0.0052
Clique Spectral
0.1852
0.7164
0.1291
0.2478
0.1097
0.4806
0.4496
0.7486
0.0629
0.0610
Hypergraph Spectral
0.2774
0.8210
0.2517
0.5743
0.118
0.4977
0.5055
0.9016
0.0938
0.0628
Clique Louvain
0.1479
0.7361
0.2725
0.7096
0.1392
0.4898
0.2238
0.6337
0.1378
0.0384
Hypergraph Louvain
0.2782
0.7899
0.3248
0.8238
0.1447
0.4988
0.5461
0.9056
0.1730
0.0821
IRMM
0.4019
0.7986
0.3709
0.8646
0.1963
0.5091
0.5924
0.9448
0.1768
0.0967
Table 3. Avg. F1 and RandIndex scores; no. clusters set to no. of ground truth classes Citeseer
MovieLens
TwitterFootball
F1
RI
F1
RI
F1
RI
F1
RI
F1
RI
hMETIS
0.1451
0.6891
0.2611
0.7853
0.4445
0.5028
0.3702
0.7697
0.3267
0.3116
PaToH
0.0710
0.7312
0.1799
0.7208
0.3239
0.4984
0.1036
0.7618
0.2756
0.182
Clique Spectral
0.2917
0.7369
0.2305
0.3117
0.2824
0.4812
0.4345
0.7765
0.387
0.3762
Hypergraph Spectral
0.3614
0.8267
0.2672
0.5845
0.3057
0.5006
0.5377
0.9112
0.4263
0.3851
Clique Louvain
0.1479
0.7361
0.2725
0.7096
0.2874
0.4982
0.2238
0.6337
0.4587
0.4198
Hypergraph Louvain
0.3491
0.8197
0.3314
0.8441
0.3411
0.5119
0.5461
0.9056
0.4948
0.5359
IRMM
0.4410
0.8245
0.3966
0.889
0.4445
0.5347
–
–
0.5299
0.5506
5.5
Cora
Arnetminer
Eﬀect of Reweighting on Hyperedge Cuts
The plots in Fig. 4 illustrate the eﬀect of hyperedge reweighting over iterations. We found the relative size of the largest partition of each hyperedge, and binned them in intervals of relative size = 0.1. The plot shows the fraction of hyperedges that fall in each bin over each iteration. relative size(e) = max i
number of nodes in cluster i number of nodes in the hyperedge e
We refer these hyperedges as fragmented if the relative size of its largest partition is lower than a threshold (here set at 0.3), and dominated of the relative size of its largest partition is higher than the threshold. The fragmented edges are likely to be balanced, since the largest cluster size is low. On the smaller TwitterFootball dataset, which has a greater number of ground truth classes, we see that the number of dominated edges decreases and the number of fragmented edges increases. This is as expected; the increase in fragmented edges is likely to correspond to more balanced cuts. A similar trend is reﬂected in the larger Cora dataset.
A New Measure of Modularity in Hypergraphs
(a) Citeseer
295
(b) Cora
(c) Arnetminer
Fig. 3. Symmetric F1 scores for varying number of clusters
(a) TwitterFootball
(b) Cora
Fig. 4. Reweighting for diﬀerent hyperedge cuts.
6
Conclusion
In this work, we have considered the problem of modularity maximization on hypergraphs. In presenting a modularity function for hypergraphs, we derived a node degree preserving graph reduction and a hypergraph null model. To reﬁne the clustering further, we proposed a hyperedge reweighting procedure that balances the cuts induced by the clustering method. Empirical evaluations on realworld data illustrated the performance of our resultant method, entitled
296
T. Kumar et al.
Iteratively Reweighted Modularity Maximization (IRMM). We leave the exploration of additional constraints and hyperedgecentric information in the clustering framework for future work. Acknowledgements. This work was partially supported by Intel research grant RB/1819/CSE/002/INTI/BRAV to BR.
References 1. Agarwal, S., Lim, J., ZelnikManor, L., Perona, P., Kriegman, D., Belongie, S.: Beyond pairwise clustering. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 838–845, June 2005 2. Agarwal, S., Branson, K., Belongie, S.: Higher order learning with graphs. In: ICML 2006: Proceedings of the 23rd International Conference on Machine Learning, pp. 17–24 (2006) 3. Blondel, V.D., loup G., J., L., R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech.: Theory Exp. (10), P10008 (2008) 4. Cantador, I., Brusilovsky, P., Kuﬂik, T.: 2nd workshop on information heterogeneity and fusion in recommender systems (hetrec 2011). In: Proceedings of the 5th ACM Conference on Recommender Systems, RecSys 2011. ACM, New York (2011) ¨ Aykanat, C.: PaToH (partitioning tool for hypergraphs), pp. 1479– 5. C ¸ ataly¨ urek, U., 1487. Springer, Boston (2011). https://doi.org/10.1007/9780387097664 93 6. Estrada, E., RodriguezVelazquez, J.A.: Complex networks as hypergraphs. arXiv preprint physics/0505137 (2005) 7. Feng, F., He, X., Liu, Y., Nie, L., Chua, T.S.: Learning on partialorder hypergraphs. In: Proceedings of the 2018 World Wide Web Conference, WWW 2018, pp. 1523–1532. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (2018). https://doi.org/10.1145/ 3178876.3186064 8. Greene, D., Sheridan, G., Smyth, B., Cunningham, P.: Aggregating content and network information to curate Twitter user lists. In: Proceedings of the 4th ACM RecSys Workshop on Recommender Systems and the Social Web, RSWeb 2012, pp. 29–36. ACM, New York (2012). https://doi.org/10.1145/2365934.2365941 9. Hadley, S.W., Mark, B.L., Vannelli, A.: An eﬃcient eigenvector approach for ﬁnding netlist partitions. IEEE Trans. Comput.Aided Design Integr. Circ. Syst. 11(7), 885–892 (1992) 10. Hein, M., Setzer, S., Jost, L., Rangapuram, S.S.: The total variation on hypergraphs  learning on hypergraphs revisited. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS 2013, vol. 2, pp. 2427–2435. Curran Associates Inc., USA (2013). http://dl.acm.org/citation. cfm?id=2999792.2999883 11. Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998). https://doi.org/ 10.1137/S1064827595287997 12. Karypis, G., Kumar, V.: Multilevel kway hypergraph partitioning. VLSI Design 11(3), 285–300 (2000) 13. Kim, S., Nowozin, S., Kohli, P., Yoo, C.D.: Higherorder correlation clustering for image segmentation. In: Advances in Neural Information Processing Systems, pp. 1530–1538 (2011)
A New Measure of Modularity in Hypergraphs
297
14. Leordeanu, M., Sminchisescu, C.: Eﬃcient hypergraph clustering. In: Proceedings of the 15th International Conference on Artiﬁcial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 22, pp. 676–684. PMLR (2012). http:// proceedings.mlr.press/v22/leordeanu12.html 15. Liu, H., Latecki, L.J., Yan, S.: Robust clustering as ensembles of aﬃnity relations. In: Advances in Neural Information Processing Systems (2010) 16. Louis, A.: Hypergraph Markov operators, eigenvalues and approximation algorithms. In: Proceedings of the Fortyseventh Annual ACM Symposium on Theory of Computing, STOC 2015, pp. 713–722. ACM, New York (2015) 17. Newman, M.E.: Modularity and community structure in networks. Proc. Natl. Acad. Sci. 103(23), 8577–8582 (2006) 18. Newman, M.E.: Networks: An Introduction. Oxford University Press Inc., New York (2010) 19. Papa, D.A., Markov, I.L.: Hypergraph partitioning and clustering. In: In Approximation Algorithms and Metaheuristics. Citeseer (2007) 20. Rot´ a Bulo, S., Pelillo, M.: A gametheoretic approach to hypergraph clustering. IEEE Trans. Pattern Anal. Mach. Intell. 35(6), 1312–1327 (2013) 21. Saito, S., Mandic, D., Suzuki, H.: Hypergraph plaplacian: a diﬀerential geometry view. In: AAAI Conference on Artiﬁcial Intelligence (2018) 22. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., EliassiRad, T.: Collective classiﬁcation in network data. AI Mag. 29(3), 93 (2008) 23. Shashua, A., Zass, R., Hazan, T.: Multiway clustering using supersymmetric nonnegative tensor factorization. In: Proceedings of the 9th European Conference on Computer Vision, ECCV 2006, vol. IV, pp. 595–608. Springer, Heidelberg (2006). https://doi.org/10.1007/11744085 46 24. Yang, J., Leskovec, J.: Deﬁning and evaluating network communities based on groundtruth. In: Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, MDS 2012, pp. 3:1–3:8. ACM, New York (2012). https://doi.org/10. 1145/2350190.2350193 25. Zhang, M., Cui, Z., Jiang, S., Chen, Y.: Beyond link prediction: predicting hyperlinks in adjacency space. In: AAAI Conference on Artiﬁcial Intelligence (2018) 26. Zhao, X., Wang, N., Shi, H., Wan, H., Huang, J., Gao, Y.: Hypergraph learning with cost interval optimization. In: AAAI Conference on Artiﬁcial Intelligence (2018) 27. Zhou, D., Huang, J., Sch¨ olkopf, B.: Learning with hypergraphs: clustering, classiﬁcation, and embedding. In: Advances in Neural Information Processing Systems, pp. 1601–1608 (2007)
Diffusion and Epidemics
Crying “Wolf ” in a Network Structure: The Influence of NodeGenerated Signals Tomer Tuchner1 and Gail GilboaFreedman2(B) 1
Eﬁ Arazi School of Computer Science, IDC Herzliya, P.O. Box 167, 4610101 Herzliya, Israel [email protected] 2 Adelson School of Entrepreneurship, IDC Herzliya, P.O. Box 167, 4610101 Herzliya, Israel [email protected]
Abstract. Research into rumor spreading in a social network has largely assumed that information may only originate from an external content provider. In today’s age individual nodes may also be content providers. The propagation of a signal generated by a node in the network may contribute or diminish the eﬀorts of information diﬀusion, as signals become an imprecise indication of a node’s knowledge. We present a model that allows for incorporating nodegenerated information into the wellstudied area of modeling rumor spread in a network. We capture this by a stochastic information transmission mechanism at each node, with a positive probability to spread the rumor without holding its value. Simulations are performed using synthetic WattsStrogatz networks, along with a realworld Facebook sample graph. Using decision trees as a descriptive tool, we examine the eﬀects of the rate in which internal noninformed nodes generate information on the properties of the rumor spread process. As our main results we show that: increasing the rate of information generated by noninformed nodes may have monotonous or nonmonotonous inﬂuence on the rumor spread time, in dependency with whether the network is sparse on not. We also identify that a strategy of increasing external communication in order to gain higher pureness level tends to be eﬀective only for a medium level range of this generation rate and only in sparse networks. Keywords: Rumor spread · Advertising · Word of mouth networks · Decision tress · Predictive models
1
· Social
Introduction
With the spread of new social media technologies, which guide more and more of our access to news [9], our responses to everything from natural disasters [10,22] to terrorist attacks [28], are increasingly disrupted by the spread of unreliable rumors online. The source of these unreliable rumors is often internal, with claims c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 301–312, 2020. https://doi.org/10.1007/9783030366872_25
302
T. Tuchner and G. GilboaFreedman
potentially generated by a single noninformed user (for example via a Twitter post) and presented as news. Recipients of this information may then knowingly or unknowingly spread signals that are false. To examine rumor spread behavior in a network, and how it is aﬀected by unreliable information, let us recall of the fable about the boy who cried “wolf”. The boy amuses himself by crying “wolf” to see the panic he causes in the community, but consequently fails to get assistance when a real threat appears. The current study extends the “wolf” story and investigates the inﬂuence of “wolf” cries in a network structure, which are equivalent to generating information not originating from an external source of information. This information ﬂow results with suspicion towards received data. Unlike existing models that describe the transmission of unreliable information [18], our model assumes not only that nodes transfer unreliable information, but also that information may be generated by nodes which have not received any information regarding the rumor from an external source. The model assumes that curiosity arises about the value of some variable (for example, how many causalities were in an earthquake), and there is some external trustworthy source of information (for example, a news channel) that spreads the real value of the variable (for example, that 23 people were killed in the earthquake). It also assumes that there is an internal diﬀusion of the information inside the network, from one person to another, and the internal diﬀusion can be of the real value (23) or of false data (other values, invented by noninformed individuals). The fact that there is an internal diﬀusion of data, generated by nodes that have not yet received any information about the rumor, has potentially two opposing impacts on the rate of propagation. On the one hand, spontaneous generation of information in the social network, may increase aggregate growth of informed population. On the other hand, it results with suspicion towards information, which may cause a slowdown in the data spread rate. The spread of the rumor involves a large number of actions taken by a large number of entities which interact with each other, generating aggregated patterns which are hard to predict, and often impossible to analyze analytically [30]. For this reason, we take a numerical approach, running simulations of rumor spread processes using combinations of the model parameters. Given that some of the model parameters have a nonlinear eﬀect, we use decision trees to analyze the simulation results [23]. We highlight this approach as of potential value for other numerical studies on complex networks that depend on large number of variables, with complex relationships between variables and research targets. We examine the rate in which internal noninformed nodes generate information, and reach two interesting results that focus on the eﬀects this rate has on: (1) the rumor spread time; and (2) the level of pureness for informed nodes, i.e. nodes receiving information originating from an external source. Concerning the ﬁrst eﬀect, we examine when does the fact that there is a high rate of faked signals (i.e., generated by a node that doesn’t hold a value
Crying “Wolf” in a Network Structure
303
of the rumor), can harm the rumor diﬀusion in terms of spread time. We show that the answer is tightly bounded to whether the network is sparse or not. Concerning the second eﬀect, we identify that a strategy of increasing external communication in order to gain higher pureness level tends to be eﬀective for medium level range of this generation rate, and only in sparse networks.
2
Related Work
In today’s “posttruth” age [14], there is increasing interest in the concept of “fake news” – i.e., false stories – in the scientiﬁc literature, speciﬁcally their epidemiology [17], detection [3], and impact. This phenomenon has the potential to inﬂuence attitudes toward journalistic objectivity [21], and may impose real costs on society and politics [1]. From a corporate point of view, false stories have the potential to damage a ﬁrm’s or brand’s image and propel ﬁrms into ﬁnancial disaster [11]. Of course, false stories have always existed – but the ability of social media platforms to spread such narratives rapidly and aggressively gives the question new importance. One recent study focusing on the social network Twitter found that fake news “diﬀused signiﬁcantly farther, faster, deeper, and more broadly than the truth in all categories of information” [29]. Reliability in general, and rumor reliability in particular, is a central concept in theories of decisionmaking [25,27,33], cooperation [6], communication [26], viral marketing [13,16], and markets [2]. It is a vast research topic spanning multiple disciplines. The study of reliability commonly draws on network models, which often use the term reliability for the probability that the proportion of informed individuals exceeds a certain value. Some network models associate reliability with social cohesion [31]. Modern network models deﬁne reliability as the probability of data transmit from one element to another [7], which is the probability that a given node will be informed. Our model follows previous work in probability theory on interacting particle systems [20]. We formulate the simplest extension of an independent cascade model, a type of model that has been investigated in the context of marketing and wordofmouth processes [8,16]. The dynamic at the node level follows a predeﬁned scheme of response probabilities and is a function of the state of the nodes with which it interacts, as in [12]. Our contribution is in considering a richer dynamic for these interactions, speciﬁcally the possibility of activation by noninformed nodes. We simulate the rumor spread process on a Facebook graph sample taken from the SNAP project [19], on a random graph, and on a series of synthetic WattsStrogatz networks [32]. Like other studies in the literature on rumor spreading online [4], we also consider a midsized network (500 nodes). The intuition behind this is that opinions and rumors often spread within a particular online community which is not that large – for example, people contributing to an online forum or people tweeting and retweeting some hashtag. We analyze the simulation results, by organizing the results in a decision tree for each phenomenon of the rumor spread process we aim to understand. Decision
304
T. Tuchner and G. GilboaFreedman
trees are widely used in Machine Learning [15], with the purpose of predicting a target value (a class) from some input features. To build the decision tree, a modeler uses a data set that includes a list of measurable properties (features), one of which is the target. Decision trees are considered to be one of the most popular predictive models (see [24] for survey). They are also used as descriptive tools [5].
3
Useful Definitions
Below are several deﬁnitions for terms used in the article. The purpose of the current section is to help the reader understanding the model description, which contains some new technical terms. Reader may skip to Sect. 4, and return to this section when there are terms that require further review. 1. External Communication – Transmission of information from an external advertiser (source) to any node in the graph. 2. Internal Communication – Transmission of information between neighboring nodes in the graph. 3. Informed/Noninformed Node – At each iteration, each node can be in one of two states: Informed or Noninformed. Informed nodes are nodes that hold information about the rumor. Noninformed nodes are nodes that do not hold such information. 4. Pure/Nonpure Node – Each Informed node is in either a Pure or Nonpure state, and thus there are Pure Informed Nodes or Nonpure Informed Nodes. The pureness state of a node is determined when it is activated (becomes Informed). A Noninformed node that has been activated by an external source, or by a Pure node, is Pure. Otherwise, it is Nonpure. (This is a recursive deﬁnition). 5. Faked/Held Signal – Produced when a Noninformed/Informed node (respectively) transmits information to its neighbors. 6. Reliable/Unreliable signal – Each signal may be Reliable or Unreliable. Faked signals are unreliable. Held signals are reliable/unreliable when held by pure/nonpure nodes respectively. 7. Suspicion Factor – The probability that Noninformed node which receives information from a neighboring node chooses to accept that information and become Informed (rather than to reject it and stay Noninformed).
4
Model
We introduce a model for rumor spread over a network. The network is represented by a graph with a ﬁnite set of nodes and a set of undirected edges. We deﬁne three states of nodes: Noninformed (a node that does not hold information; see def. 3), Pureinformed (a node that holds information originating from a reliable source; see def. 4) and Nonpure informed (a node that holds information originating from an unreliable source; see def. 4). The novelty of our model
Crying “Wolf” in a Network Structure
305
is in the fact that noninformed nodes have a positive probability for generating and spreading information on each iteration. On each iteration, every Noninformed node spreads information to its neighbors with probability Q (rate of faked signals – always unreliable; see def. 5 and 6), whereas informed nodes spread information to their neighbors with probability P (rate of held signals – either reliable or not; see def. 5). Every Noninformed node can be activated (become Informed), either by an external source (see def. 1) – with probability α, separately for each node – or by the internal communication with its neighbors (see def. 2), where the probability for it to adopt the information from its neighbors is multiplied by the suspicion factor P/(P + Q) (see def. 7). Nodes that become informed, cannot be deactivated (return to be Noninformed) in later stages of the process. The intuition behind the suspicion factor is that, the possibility of generating signals by noninformed nodes, may hold nodes from adopting any rumor that they get. Without any prior knowledge on whether a node is informed or not, it would be intuitive for its neighbor to consider the likelihood of a signal to be generated by an informed node, and become informed with probability that increases with this rate. Process is fully described in Algorithm 1. Regarding the question of whether a node becomes pure or nonpure following an activation, we have to consider the possibility that in a single iteration it may choose to adopt received signals from several sources at once. For consistency of the pureness concept, we deﬁne that if one chooses to adopt some unreliable signal in some iteration, then it becomes nonpure, or in other words – nonpure activation is dominant. In Table 1 we show a numerical example for the signal distribution depending on a node’s state. Every row in the matrix represents the state of a node, and every column represents the conditional probability for the signal it may transmit, depending on its state. In the given example, P = 0.8 and Q = 0.6. Please note two important points: (1) A noninformed node spreads unreliable signals (as for not holding the advertised value); and (2) An informed node spreads reliable or unreliable signals in agreement with its state – pure or nonpure (as for holding a value that is sourced at an external source or not). Table 1. Numerical example for the distribution of the signal transmitted by a node in a speciﬁc iteration, depending on the state of the node. In this example, P = 0.8 and Q = 0.6. Unreliable signal Silence Reliable signal Noninformed
0.6
0.4
0
Pure informed
0
0.2
0.8
0.2
0
Nonpure informed 0.8
306
T. Tuchner and G. GilboaFreedman
Algorithm 1. Rumor spread process with faked signals 1:
Input: – P , Q, α  Held signal, faked signal and advertising rates – Network G with N nodes and adjacency matrix A 2: Output: – State  array with final state for each node (0 for noninformed, 1 for pure – informed, 2 for nonpure informed) – T ime  array with activation time for each node 3: ρ = P/(P + Q) // Suspicion factor 4: State = T ime = zeros[1, N ] 5: i = 0, inf ormed = 0 6: while inf ormed < 0.95 · N do 7: i=i+1 8: // Randomizing transmission for this iteration 9: T ransmit = zeros[1, N ] 10: for n = 1 to N do 11: if State[n] > 0 then // Informed 12: T ransmit[n] = 1 w.p. P 13: else // Noninformed 14: T ransmit[n] = 1 w.p. Q 15: end if 16: end for 17: // Calculating new states 18: N ewState = zeros[1, N ] 19: for n = 1 to N do 20: if State[n] = 0 then // Noninformed 21: ActivatedByP ure = 0 22: for j = 1 to N do 23: if AN D(T ransmit[j] = 1, State[j] = 1, A[n, j] = 1) then 24: // Transmitting neighbor is pure 25: ActivatedByP ure = 1 w.p. ρ // Adopted with suspicion 26: end if 27: end for 28: if ActivatedByP ure = 0 then 29: ActivatedByP ure = 1 w.p. α // Activated by external 30: end if 31: if ActivatedByP ure = 1 then 32: N ewState[n] = 1 // Pure 33: T ime[n] = i 34: end if 35: ActivatedByN onpure = 0 36: for j = 1 to N do 37: if AN D(T ransmit[j] = 1, State[j] = 1, A[n, j] = 1) then 38: // Transmitting neighbor is nonpure or noninformed 39: ActivatedByN onpure = 1 w.p. ρ // Adopted with suspicion 40: end if 41: end for 42: if ActivatedByN onpure = 1 then // Nonpure activation is dominant 43: N ewState[n] = 2 // Nonpure 44: T ime[n] = i 45: end if 46: if N ewState[n] > 0 then // Activated 47: inf ormed = inf ormed + 1 48: end if 49: end if 50: end for 51: State = N ewState 52: end while 53: Return State, T ime
Crying “Wolf” in a Network Structure
5
307
Example: Facebook Sample Graph
We start by simulating our model on a Facebook graph sample, and on a Erd˝ osR´enyi RandomGraph with the same number of nodes (4039) and the same average degree (approx. 44). The Facebook graph sample was taken from the SNAP project [19]. The randomgraph was generated with a computer program in Matlab. Figure 1 displays the number of iterations needed to reach 95% of informed nodes against Q, where P and α are ﬁxed at 0.01, for the two networks (i.e., the Facebook sample and the random setup). Before we talk about whether Q strengthens or harms rumor spread in different settings, we wish to examine whether it actually has a twosided eﬀect on the spread process. As we see in this example of the Facebook network, spread of the rumor speeds up as Q increases, but only at the far left of the graph, where Q is low. At a certain point, increasing Q further once again slows the speed at which the rumor spreads. Figure 1 demonstrates that the ability to speed up the spread of the rumor by increasing the rate of faked signals, may or may not exist, in dependence with network topology. 60
Number of Iterations
50
40
30 Facebook Random 20
10
0 0
0.005
0.01
0.015
Q
Fig. 1. Extent of rumor spread as a function of the probability of faked signals. The Yaxis is the number of iterations needed to reach 95% of the nodes being informed, and the Xaxis is Q, where P and α are ﬁxed at 0.01
6
Methods
We simulate the rumor spread process as described by the model for a variety of model parameters and network structures. We then analyze the simulation
308
T. Tuchner and G. GilboaFreedman
results, using decision trees for classifying the simulations by their properties in decision tree structures. 6.1
Preparation of Networks
We wrote a Matlab program for generating synthetic networks using an algorithm from Watts and Strogatz [32]. These networks have 500 nodes and vary on the parameters K and β, which describe respectively the ratio between edges and nodes in the graph, and how close the graph is to a random network, where β = 1 implies a random network and β < 1 displays properties of a smallworld structure (closer to a social network). 6.2
Numerical Simulations
Combinations of model parameters (α, P , Q) and network parameters (K,β) were considered in a full factorial design experiment. Consistent with previous literature [8], we set the advertising rate α lower than P (the probability of an Informed node to generate a signal). We also set Q – the probability of a faked signal (generated by a Noninformed node) – to be lower than P , under the reasonable assumption that being informed about a rumor should not reduce the probability of spreading it. Each of the ﬁve input variable parameters was manipulated to produce a variety of spread process simulations. For each set of parameters we ran 20 simulations and calculated their mean results. In total we examined 6, 720 sets of parameters (no α = Q = 0), in each case running the rumor spread process until 95% of the network was informed. The parameter ranges were set as follows: 3,10,15,20 1. K – Half the average degree 0.5,0.8,1 2. β – Randomness of the graph 3. α – Advertising rate 0–0.01 (steps of 0.00125) 4. P – Held signal rate 0.01–0.07 (steps of 0.01) 5. Q – Faked signal rate 0–P (9 values, constant step). 6.3
Analysis: Decision Trees
After generating all possible outcomes over the set of simulation parameters, we analyzed the inﬂuence of the parameters on the behavior of the rumor spread. For this purpose, we took the decision trees approach to elicit relevant observations. Results served as input data for training a decision tree – i.e., a set of rules organized in a hierarchical structure that could serve as a predictive model for the relevant measure. This tool can reveal observations that are nonintuitive and that therefore would not likely be advanced and tested by a “human” learning approach. We wrote a computer program in Python to build the trees. The results obtained from the decision trees are of rigorous nature, because each observation spots a node or leaf in the tree, that represents a hypothesis with high signiﬁcance according to traditional methods.
Crying “Wolf” in a Network Structure
309
Each decision tree speciﬁes the properties of its nodes. Speciﬁcally, for each leaf (which is a node with no arrow coming out from it), the tree speciﬁes: the number of samples that were sorted into this leaf over the categories of the target; their distribution in terms of how many samples fall in each category; and the prediction assigned to this leaf. For each internal node (not a leaf), the graph also speciﬁes the splitting criteria. We examined the classiﬁcations of the simulation results in the trees, and derived our observations based on values in the same class. We are most interested in classes that are homogeneous in terms of the target values of the simulations that fall into them.
7
Results
Firstly, we examine the twosided impact of information generated by noninformed nodes on the time for a rumor spread. Increasing the rate of generating such information, increases both transmission rate and suspicion. When there exists a value of Q for which lower values are associated with acceleration in the rumor spread and higher values with a slowdown, we say that the network exhibits a turning point. We saw an example for this in Fig. 1 for the Facebook graph. If a turning point exists, then we are not sure that it is beneﬁcial to increase internal communication of faked signals, in order to achieve a faster rumor spread. We get a simple decision tree (see Fig. 2), showing that for a sparse network (k = 3), it is not always beneﬁcial to increase internal communication (the probability for turning point is 0.95, while for all simulations it is 0.37), and for a highlyconnected dense network (k > 3), it is beneﬁcial to increase internal communication (the probability for no turning point is 0.82, while for all simulations it is 0.63). Intuitively, when the graph is highlyconnected, signals are being spread to many nodes in each iteration, overcoming suspicion.
k 0.002, Q ≤ 0.021), if the graph is sparse (k = 3), we can increase the advertisement rate (α ≥ 0.00375), in order to get high pureness and avoid low pureness. When α ≤ 0.0025 then the probability for low pureness is 0.96, but if α ≥ 0.00375 then the probability for low pureness is 0.33. We see that for high or low values of Q, we don’t need to increase advertisement rate to get high pureness, because there will be too much or not enough unreliable signals for the pureness to be high/low, accordingly.
Q 0 dt βhv b(T )Sv (t)Ihk (t) − μv Iv (t) > 0 βhv b(T )Sv (t)Ihk (t) Iv (t) < μv
dIv (t) dt
419
> 0,
(7)
At the outset of an epidemic, Sv (t) ≈ 1. Death is an instantaneous process, therefore μv = 1 then, βhv b(T )Ihk (t) > Iv (t). This should be greater than one. The basic reproduction number of vector is given by: R0v = βhv b(T )Ihk (t). By reporting the value of Iv (t) into Eqs. 2 and 3, the above equation can be written: dShk (t) = −βh kShk (t)Ihk (t) − βvh b(T )Shk (t){βhv b(T )Ihk (t)} (8) dt dIhk (t) = βh kShk (t)Ihk (t) + βvh b(T )Shk (t){βhv b(T )Ihk (t)} − Ihk (t) (9) dt dRhk (t) = Ihk (t) (10) dt Generally, a healthy host node is infected, and this infected node is converted into a recovered node. So we can say Shk (t) is converted into Rhk (t). Therefore, from Eqs. 8 and 10, dShk (t) −(βh kSkh (t) + βvh b(T )Shk (t)βhv b(T )Ihk (t) = dRhk (t) Ihk (t)
(11)
where, Eq. 11 shows the rate of change of susceptible nodes to recovered nodes. Integrating both side of Eq. 11 Skh (t) = e−(βh k+βvh b(T )
2
βhv )Rhk (t)
(12)
The negative exponent in Eq. 12 shows that the number of susceptible nodes is decreasing and converted into recovered nodes. Epidemic reaches a steady state at t → ∞ hence, Ihk (∞) = 0. Therefore, the normalized condition for the steady state is Skh (∞) = e−(βh k+βvh b(T )
2
βhv )Rhk (∞)
−(βh k+βvh b(T )2 βhv )Rhk (∞)
Rhk (∞) = 1 − e
2
(13) (14)
Now let f (Rhk (∞)) = 1 − e−(βh k+βvh b(T ) βhv )Rhk (∞) be a function of Rhk (∞) and strictly increasing. If we put Rhk (∞) = 0, then the whole population of host recover and it gives us a trivial solution. It also explains about disease free state.
420
M. Arquam et al.
Now we need to ﬁnd some non trivial solution which lies between 0 and 1. For this the following condition must satisfy df (Rhk (∞)) >1 dRhk (∞) Rhk (∞)=0 (βh k + βvh b(T )2 βhv )e−(βh k+βvh b(T )
2
βhv )Rhk (∞) Rhk (∞)=0
>1
(βh k + βvh b(T )2 βhv ) > 1 Now, we can say that basic reproduction R0h must be (βh k+βvh b(T )2 βhv ) > 1 to spread the epidemic in the host population. Therefore, R0h = (βh k + βvh b(T )2 βhv ) where βh , k, βvh and βhv are constant. Hence, R0h is directly proportional to the square of biting rate i.e. b(T )2 . This basic reproduction rate is also called the critical threshold of spreading of disease.
5
Simulation of the Model and Results Analysis
In this section, we ﬁrst report the simulation setup, and then we discuss the results of the simulation performed using the temperature dependent SIR model using a homogeneous network (Watts–Strogatz model) as the underlying contact network. The various parameters values used for simulations are listed in Table 2. These values have been chosen according to a literature review. Table 2. Parameters values used in the simulation Name of parameter
Value
Host contact network size
2000
Connectivity probability for Watts–Strogatz model 0.2 Number of neighbour of each node
300
Vector population size
100000
Spreading Rate between host to host (βh )
0.6
Recovery Rate (μh )
1
Death Rate of vector (μv )
1
Spreading Rate between vector to host (βvh )
0.4
Spreading Rate between host to vector (βhv )
0.6
Biting rate of vector (b0 ) at T0
0.4
T0
25 ◦ C
Range of temperature (T )
[4.3, 37] ◦ C
Integrating Temperature into the SIR Model for VectorBorne Diseases
421
We focus on the eﬀect of temperature on the dynamics of epidemics on the host contact network as well as the vector population. In the simulation, if the temperature is in the range T < 0 ◦ C or T > 37 ◦ C, the vector biting rate is zero. In other words, outside the limit temperatures, no vectors are present. Within that range of temperature the critical threshold is given by: R0h = (βh k + βvh b(T )2 βhv ) The epidemic spreading with the modiﬁed SIR Model on homogeneous network as underlying topology (Watts–Strogatz model) is shown in Fig. 4. We took the value of temperature T ranging from 4.3 ◦ C to 37 ◦ C to analyze the eﬀect of temperature in the infection process (temperature of Delhi NCR in 2018). The epidemic spreading evolution of the SIR spreading model for the host population is reported in Fig. 4(a). Similar results for the vector population is shown in Fig. 4(b). These ﬁgures show that the infection increases with time until the optimum temperature is reached. After that, the infection starts decreasing. The timespan of the existence of epidemic depends upon the existence of the vector population as shown in Fig. 4(b).
Fig. 4. Epidemic spreading in host and vector population
One can observe that the vector population vanished as much as quick due to short life span but it increases the epidemic threshold as mentioned in Fig. 5. Figure 5(a)&(b) illustrate the evolution in the epidemic threshold in the vector population with the temperature variation. Figure 5(c)&(d) present also the variation of the epidemic threshold but in the host population with a change in temperature. The infection threshold varies from 0.8 to more than 0.9 in the host population while it does not change a lot in the vector population because the life span of the vector is very small. These results corroborate researches reported in the literature that have already proved that the transmission probability from host to vector is greater than transmission probability from vector to host.
422
M. Arquam et al.
Fig. 5. Eﬀect of temperature on infection threshold in SIR considering a homogeneous contact network
Fig. 6. Biting of vector population
We also analyse the eﬀect of temperature on the biting rate that depends upon the total population of vectors. As temperature increases from 25◦ C, then mosquitoes start biting till the maximum temperature. After that once temperature reaches 37◦ C then biting becomes null as the vector population vanish. The biting rate is plotted in Fig. 6. Figure 6(a) shows that biting is maximum at the middle of the spreading process, while Fig. 6(b) shows that biting increases with the increase of temperature till ambient temperature. After that, the vector population starts dying.Finally, after reaching the maximum temperature the vector population is eliminated. Figure 7 explains the eﬀect of temperature on infection spreading in vector as well as the host population. An infected vector can cause infection in multiple hosts. Vector population is much larger than the host population. Therefore, infection in the host population increases more than in vector population.
Integrating Temperature into the SIR Model for VectorBorne Diseases
423
Fig. 7. Eﬀect of temperature on infection spreading in vector and host populations
6
Conclusion and Future Work
In this work, we propose and investigate a modiﬁed SIR model which integrates the eﬀect of temperature on spreading of vectorborne diseases. Here, we consider two type of populations: (1) the host population with the three states of the SIR Model and (2) the vector population with the two states of the SI Model. Favourable temperature increases the disease spreading from vector to host and by cascading eﬀect to the host population. We show that the threshold of spreading rate of the disease is proportional to the square of the biting rate (b(T )) which is deﬁned as the function of temperature. Simulations are performed using the proposed modiﬁed SIR model using an homogeneous contact network. They show that temperature increases the critical threshold value of the spreading rate. Additionally, if the temperature increases above 37 ◦ C, the epidemic die out due to the extinction of the vector population. Result of real data of diseases are plotted, which shows similar infection pattern in host population. We plan to develop this work in various future directions. An important extension is to include the eﬀect of humidity in our future studies as most diseases spread after the rainy season in India especially. Furthermore more realistic scenario need to be considered concerning the host contact network topology such as scalefree networks, modular networks and dynamic networks [17–19]. The movement of population may also be considered.
References 1. World Health Organization et al.: Global strategy for dengue prevention and control 20122020 (2012) 2. Anderson, R.M., May, R.M., Anderson, B.: Infectious Diseases of Humans: Dynamics and Control, vol. 28. Wiley Online Library, Hoboken (1992) 3. Esteva, L., Vargas, C.: Analysis of a dengue disease transmission model. Math. Biosci. 150(2), 131–151 (1998) 4. de Pinho, S.T.R., Ferreira, C.P., Esteva, L., Barreto, F.R., Morato e Silva, V.C., Teixeira, M.G.L.: Modelling the dynamics of dengue real epidemics. Philos. Trans. Roy. Soc. A: Math. Phys. Eng. Sci. 368(1933), 5679–5693 (2010) 5. Focks, D.A., Daniels, E., Haile, D.G., Keesling, J.E.: A simulation model of the epidemiology of urban dengue fever: literature analysis, model development, preliminary validation, and samples of simulation results. Am. J. Trop. Med. Hygiene 53(5), 489–506 (1995)
424
M. Arquam et al.
6. Waikhom, P., Jain, R., Tegar, S.: Sensitivity and stability analysis of a delayed stochastic epidemic model with temperature gradients. Model. Earth Syst. Environ. 2(1), 49 (2016) 7. LiuHelmersson, J., Stenlund, H., WilderSmith, A., Rockl¨ ov, J.: Vectorial capacity of Aedes aegypti: eﬀects of temperature and implications for global dengue epidemic potential. PLoS One 9(3), e89783 (2014) 8. Polwiang, S.: The seasonal reproduction number of dengue fever: impacts of climate on transmission. PeerJ 3, e1069 (2015) 9. Wang, W., Mulone, G.: Threshold of disease transmission in a patch environment. J. Math. Anal. Appl. 285(1), 321–335 (2003) 10. Auger, P., Kouokam, E., Sallet, G., Tchuente, M., Tsanou, B.: The rossmacdonald model in a patchy environment. Math. Biosci. 216(2), 123–131 (2008) 11. Nekovee, M., Moreno, Y., Bianconi, G., Marsili, M.: Theory of rumour spreading in complex social networks. Phys. A 374(1), 457–470 (2007) 12. Albert, R., Barab´ asi, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74(1), 47 (2002) 13. PastorSatorras, R., Castellano, C., Van Mieghem, P., Vespignani, A.: Epidemic processes in complex networks. Rev. Mod. Phys. 87(3), 925 (2015) 14. Vespignani, A.: Modelling dynamical processes in complex sociotechnical systems. Nat. Phys. 8(1), 32 (2012) 15. Moreno, Y., PastorSatorras, R., Vespignani, A.: Epidemic outbreaks in complex heterogeneous networks. Eur. Phys. J. BCondens. Matter Complex Syst. 26(4), 521–529 (2002) 16. Li, X., Wang, X.: Controlling the spreading in smallworld evolving networks: stability, oscillation, and topology. IEEE Trans. Autom. Control 51(3), 534–540 (2006) 17. Orman, K., Labatut, V., Cheriﬁ, H.: An empirical study of the relation between community structure and transitivity. In: Menezes R., Evsukoﬀ A., Gonz´ alez M. (eds.) Complex Networks. Studies in Computational Intelligence, vol. 424, pp. 99– 110 (2013) 18. Gupta, N., Singh, A., Cheriﬁ, H.: Centrality measures for networks with community structure. Phys. A 452, 46–59 (2016) 19. Ghalmane, Z., El Hassouni, M., Cheriﬁ, C., Cheriﬁ, H.: Centrality in modular networks. EPJ Data Sci. 8(1), 1–27 (2019)
Opinion Diﬀusion in Competitive Environments: Relating Coverage and Speed of Diﬀusion Valeria Fionda(B) and Gianluigi Greco Department of Mathematics and Computer Science, University of Calabria, Rende, Italy {fionda,greco}@mat.unical.it
Abstract. The paper analyzes how two opinions/products/innovations diﬀuse in a network according to nonprogressive dynamics. We show that the ﬁnal conﬁguration of the network strongly depends on their relative speed of diﬀusion. In particular, we characterize how the number of agents that will eventually adopt an opinion (at the end of the diﬀusion process) is related with the speed of propagation of that opinion. Moreover, we study how the minimum speed of propagation required to converge to consensus on a given opinion is related with the percentage of agents that initially act as seeds for that opinion. Our results comple ment earlier works in the literature on competitive opinion diﬀusion, by depicting a clear picture on the relationships between coverage and speed of diﬀusion.
Keywords: Competitive opinion diﬀusion
1
· Linear threshold models
Introduction
The mechanisms according to which opinions form and diﬀuse over social networks have attracted much research in recent years. A number of models have been proposed, and their properties have been studied both theoretically and experimentally (see, e.g., [11]). By abstracting from their speciﬁc technical differences, these models can be classiﬁed in two board categories, nonprogressive and progressive ones (cf. [19]), with the diﬀerence between them being in whether or not an agent/individual/node that has been inﬂuenced to adopt some opinion can eventually rethink about her decision. Most of the literature focuses on the latter kind of models, which are indeed reminiscent of very inﬂuential earlier studies in economics [27] and sociology [16,17]. However, there are scenarios where the progressive behaviour is unrealistic [4,6,8,10]. This typically happens when social environments host opinions that compete with each other and when agents can oscillate in adopting one of them, being subject to the social pressure of their neighbors (see, e.g., [12–14]). Practical applications of nonprogressive models have been pointed out in the c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 425–435, 2020. https://doi.org/10.1007/9783030366872_35
426
V. Fionda and G. Greco
context of the diﬀusion of competing (product) innovations, in the usage of mobile apps, and for analyzing the cycles of opinions that are in fashion [22]. In these applications, we typically select an initial set S0 of seeds to propagate an opinion, say b(lack); but the diﬀusion process of b competes with the cascade of inﬂuence of another opinion, say w(hite). In the paper we contribute to shed lights on the theoretical and practical behaviour of such nonprogressive models for opinion diﬀusion. Our analysis moves from the observation that the outcome of these models, i.e., the ﬁnal conﬁguration of the social network, crucially depends on the order in which the various agents change their mind. In fact, the setting has a kind of gametheoretic ﬂavour and strategic aspects naturally emerge (see, e.g., [28]). Here we depart from the strategic analysis and we focus on an aspect that received considerably less attention in the literature. Indeed, we focus on answering questions that relate the speed of propagation of some given opinion with the overall number of agents that will eventually adopt that opinion. For an exempliﬁcation, assume that b propagates at the same speed of w, that is, if we consider two updates of opinions in the social network, then (in average) we observe one agents adopting opinion b and one agent adopting opinion w; and assume that a seed composed of 5% of the agents that initially hold opinion b is able to spread that opinion to 40% of the agents. Then, we may ask: What happens if b propagates in the network two times faster than w? Is it the case that the final coverage will be about 80%? More generally, for a given set S0 of initial seeds that hold opinion b, in the paper we study how the overall number of agents that will eventually adopt that opinion varies as a function of the relative speed of propagation of b compared to that of w. Moreover, we study how the minimum cardinality of a seed from which consensus on b can be eventually achieved varies (again) as a function of the relative speed of b. Our analysis includes a throughout experimental campaign which we conducted on competitive environments built over syntectic and real social networks. On these environments, we essentially considered a (deterministic) linear threshold [19] model of opinion diﬀusion and we provide results suggesting that the rate of the adoption of an opinion dramatically impacts (i.e., much more than one would naturally envisage) on its capacity of spreading over the whole network.
2
A Formal Framework for Competitive Diﬀusion
In this section we introduce our formal framework for opinion diﬀusion. In particular, we deﬁne notions that are meant to formalize the concepts of speed and coverage of diﬀusion and we illustrate how these concepts are related with each other. In the exposition we consider speciﬁc kinds of networks, which rather unlikely occur in realworld instances. This has been done with the aim of providing sharp theoretical bounds on these relationships. Bounds that hold in practice are instead singled out in Sect. 3.
Opinion Diﬀusion
427
Fig. 1. Illustrations for the results in Sect. 2.
Preliminaries. Let G = (N, E) be a social network, that is, an undirected graph encoding the interactions of a set N of agents. Throughout the paper, we consider a setting where two opinions/products/innovations, say b and w, compete for diﬀusing over G. We adopt a linear threshold model of diﬀusion [19], by assuming that the thresholds are apriori known (rather than being initially selected at random). Indeed, thresholds can be learned over some available data by means of mining techniques [15,18] or, in many cases, we might just want to analyze and reason about scenarios where agents are characterized by some speciﬁc behaviour. Here, we focus on majority agents, which attracted much research as a prototypical behaviour in opinion diﬀusion (e.g., [1,7,9,20,24,25]). Formally, for each agent x ∈ N , the set {y  {x, y} ∈ E} of her neighbors is denoted by δ(x), and her associated threshold is denoted by σ(x) with 1 ≤ σ(x) ≤ δ(x). Then, we say that x is a majority agent if σ(x) = δ(x)/2. To keep notation simple, a configuration for G is just deﬁned as the set S ⊆ N of all agents that hold opinion b—so that agents in N \ S hold opinion w. An agent x ∈ S (resp., x ∈ N \ S) is stable with respect to that conﬁguration if δ(x) ∩ S ≥ σ(x) (resp., δ(x) ∩ (N \ S) ≥ σ(x)). A conﬁguration S is stable if all agents in N are stable. A dynamic for G is a sequence of conﬁgurations π = S0 , ..., Sk such that Sk is stable and, for each h ∈ {1, ..., k}, Sh is obtained from Sh−1 by ﬂipping the opinion of an agent that is not stable. Note that, at each time step, we assume that precisely one agent is selected to changer her opinion. Indeed, this assumption is appropriate when analyzing scenarios involving large social networks (cf. [7]). Speed and Coverage. Let π = S0 , ..., Sk be a dynamic for G = (N, E). Let I b (π, G) ⊆ {1, ..., k} be the set of all time steps i such that Si is obtained from Si−1 by swapping to b the opinion of some agent in G. Moreover, let I b+w (π, G) ⊆ {1, ..., k} be the set of all time steps i such that in Si−1 there are at least two agents, one with opinion b and another with opinion w, that are not stable. In words, I b (π, G) captures the times steps where some agent adopted opinion b, whereas I b+w (π, G) captures the time steps where both b and w can potentially propagate. Then, we deﬁne the relative speed of b in π as the ratio rsb (π, G) = I b (π, G) ∩ I b+w (π, G)/I b+w (π, G). That is, rsb (π, G) measures the fraction of the times steps in which b has been propagated over all times steps in which w might have been propagated too.
428
V. Fionda and G. Greco
For each rational number α, with 0 ≤ α ≤ 1, let Παb (S0 , G) be the set of all dynamics π starting at the conﬁguration S0 and such that rsb (π, G) ≤ α. Moreover, let covbα (S0 , G) be the maximum coverage of b from S0 , that is, the maximum number of agents that hold opinion b at the end of any dynamics π ∈ Παb (S0 , G). Hence, covbα (S0 , G) measures the capacity of spreading opinion b by considering dynamics for which the relative speed of b is at most α. We next relate the coverage covbα (S0 , G) with the relative speed α, in particular by showing that the impact of the speed on the coverage of the diﬀusion might be dramatic. Theorem 1. Let S0 be a configuration for a social network G = (N, E). Then, for each pair 0 ≤ α ≤ α ≤ 1, it holds that covbα (S0 , G) ≤ covbα (S0 , G). Moreover, there is a class of social networks {Gn =({1, ..., n}, En )}n>3 such that covb0 ({1, 2}, Gn ) = 0 and covb1 ({1, 2}, Gn ) = n. Proof (Sketch). The monotonic behaviour of cov w.r.t. the relative speed α is easily seen to hold. Concerning the second part of the statement, consider the graph Gn shown in Fig. 1(a) and any dynamic starting with the conﬁguration {1, 2}. When we focus on dynamics in Π0b ({1, 2}, G), for n > 3, the evolution of the network is entirely deterministic: agent 1 and agent 2 will change their opinion to w in this order. That is, covb0 ({1, 2}, Gn ) = 0. On the other hand, within the dynamics in Π0b ({1, 2}, G), we can chose the one propagating b to agent 3, then to agent 4, and so on, till covering all agents in Gn . That is, covb1 ({1, 2}, Gn ) = n. Speed and Consensus. The second parameter we are interested to analyze as a function of the relative speed is the number of seeds that are necessary to converge to consensus (i.e., to a scenario where all agents hold—w.l.o.g.—opinion b). To this end, let γαb (G) be the minimum cardinality over all the sets S0 such that covbα (S0 , G) = N . Again, a dramatic gap emerges between the values of γ α at the opposite boundaries of α. Theorem 2. Let S0 be a configuration for a social network G = (N, E). Then, for each pair 0 ≤ α ≤ α ≤ 1, it holds that γαb (G) ≥ γαb (G). Moreover, there is a class of social networks {Gn =({1, ..., n}, En )}n>3 such that γ0b (Gn ) = n/2 and γ1b = 2. Proof (Sketch). The antimonotonic behaviour of γ α is easily seen to hold. Then, consider again the graph Gn in Fig. 1(a), by recalling from the proof of Theorem 1 that γ1b ≤ 2. In fact, there is no way that a seed with one node only can propagate opinion b to all remaining agents; hence, γ1b = 2 actually holds. To conclude, note that when considering α = 0, the only way to forbid the propagation of opinion w is precisely to include in the initial conﬁguration agent 1 and half of the agents taken from the remaining agents. Hence, γ0b (Gn ) = n/2. In fact, it is known that γ1b (G) ≤ N /2 always hold [3]. Therefore, the above result might tempt the reader to believe that the N /2 bound holds independently on the speed of the propagation. We next show that this is not the case.
Opinion Diﬀusion
429
Theorem 3. There is a class of social networks {Gn =({1, ..., 5n}, En )}n≥1 such that γ0b (Gn ) = 4n. Proof. Consider the graph Gn shown in Fig. 1(b), which consists of a gadget over 5 agents cloned n times, and the conﬁguration depicted there where 4n agents initially hold opinion b. It is immediate to check that from this conﬁguration consensus on b can be eventually achieved even with dynamics for which α = 0. Now, the crucial observation is that it is not possible to end up with a consensus on b if at any time steps, two agents that are connected in Gn both hold opinion w. To avoid this obstruction, it can be checked that in every gadget at least four agents must initially hold opinion b.
3
Experimental Evaluation
Our experimental campaign involved both synthetic and real networks. Networks can be characterized by the following parameters: • Degree distribution p(k): the probability distribution of node degrees over the whole network (i.e., the probability that a randomly chosen node has degree k); • Average node degree k ; • Neighbor degree distribution q(k): the probability that a randomly chosen edge is connected to a node of degree k; • Joint degree distribution e(k1 , k2 ): the probability that the two endpoint of a random chosen edge have degrees k1 and k2 , respectively; • Assortativity or degree correlation rkk : the Pearson correlation between degrees of connected nodes. In assortative networks (rkk > 0) nodes are connected to nodes having similar degree, and in disassortative networks (rkk < 0) they link to nodes having dissimilar degree. Note that, two networks having the same degree distribution can diﬀer for their assortativity. The initial conﬁguration of the network S0 can be speciﬁed by ﬁxing the percentage of agents that initially hold opinion b and the joint probability distribution P (o, k) that is the probability that a node of degree k has opinion o ∈ {b, w}. Such joint probability can be used to compute ρk,o that is the correlation between node degrees and opinions. In general, randomly picking the agents having opinion b in the network leads to a conﬁguration where ρk,o ≈ 0. By following the approach used in [21], we changed the opinion degree correlation by swapping the opinions of agents. In particular, given two agents n and n having opinions b and w respectively, swapping their opinion will lead to an increase of ρk,o if the degree of n is lower than that of n . Thus, to increase the value of the opinion degree correlation, we iteratively picked a node n with opinion b and a node n with opinion w and swapped their opinions only if the degree f n was bigger than the degree of n and repeated until the desired correlation value was reached or no more swappings were possible.
430
V. Fionda and G. Greco
For each input network we performed two types of experiments. The ﬁrst experiment was meant to empirically validate Theorem 1, thus we ﬁxed the number of seeds (agents initially having opinion b) and analyzed the coverage of opinion b according to diﬀerent propagation speeds 0 ≤ α ≤ 1. The results of the ﬁrst experiment are reported in Sect. 3.1. The second experiment was meant to empirically validate Theorem 2, thus we analyzed how the coverage of the network varies by increasing the initial number of seed for a ﬁxed propagation speed. The results of the second experiment are discussed in Sect. 3.2. All results are reported as the average on 30 runs. Synthetic Networks. We used two types of synthetic networks: scalefree networks and Erd¨ osR´enyi networks. We created both types of networks by using the generator implemented in the SNAP library (https://snap.stanford.edu/data/). In particular, we generated scalefree networks with a speciﬁed degree sequence by specifying the number of vertices and the exponent λ of the power law distribution p(k) ∼ k −λ , where p(k) indicates the fraction of nodes that have degree k. In particular, we generated three undirected scalefree networks with 10000 nodes having power low distribution k −2.1 , k −2.4 , k −3.1 , respectively. From each of this initial networks we obtained three networks by changing their assortativity. To generate Erd¨ osR´enyi networks we speciﬁed the number of nodes and edges in order to obtain a speciﬁc average node degree. In particular, we generated an undirected Erd¨ osR´enyi network having 10000 nodes and 12500 edges, thus with average node degree k = 2.5. From this initial network, by using the rewiring procedure, we obtained three networks having, respectively, assortativity equals to −0.5, 0, 0, 5. Starting from each generated network we used the Newmans edge rewiring procedure [23] to change its assortativity. This is an iterative procedure that at each iteration randomly chooses two disjoint edges and swap their eindpoints if it changes their degree correlation. The procedure stops when the desired degree assortativity is achieved. Real Networks. We considered a benchmark consisting of 14 graph datasets, whose main features are summarized in the Table 1, which report: the name of the dataset, the number of nodes, the number of edges, the assortativity coeﬃcient, the average degree, the maximum degree and the coeﬃcient of the power law distribution that better approximate the degree distribution of the network computed according to the Bhattacharyya distance [5]. The datasets fbArt, fbAth, fbCom, fbGov, fbNS, fbPol, fbPF, fbTvS have been extracted from the Facebook (fb) dataset by considering respectively artists’, athletes’, companies’, government’s, new sites’, politicians’, public ﬁgures’ and TV shows’ pages only. The datasets dzHR, dzHU and dzRO have been extracted from the Deezer dataset by considering the friendships networks of users in Croatia, Hungary and Romania, respectively.
Opinion Diﬀusion
431
Table 1. Real networks characteristics. Network
Number of nodes Number of edges rkk
fb [26] 134873 deezer [26] 143884 317080 dblp [29]
3.1
1380293 846915 1049866
k
k∗
λ
0.0740 20.462 1469 1.3 0.3320 11.772 420 1.3 0.2665 6.622 343 1.5
fbArt fbAth fbCom fbGov fbNS fbPol fbPF fbTvS
50521 13868 14120 7058 27930 5908 11573 3895
819306 86858 52310 89455 206259 41729 67114 17262
−0.019 −0.027 0.014 0.029 0.022 0.018 0.202 0.561
32.43 12.52 7.40 25.34 14.76 14.12 11.59 8.86
dzHR dzHU dzRO
54573 47538 41773
498202 222887 125826
0.197 0.207 0.114
18.26 9.38 6.02
1469 468 215 697 678 323 326 126
1.2 1.3 1.4 1.2 1.3 1.3 1.4 1.3
420 1.2 112 1.2 112 1.4
Speed and Coverage
The scope of the ﬁrst experiment was to empirically validate Theorem 1. To this aim, for each input network we ﬁxed the number of seed to the 10% of the nodes and varied the propagation speed α of opinion b. Results are reported in Fig. 2 for synthetic networks and Fig. 3 for real networks. In particular, in all charts the diﬀusion speed α is reported on the xaxis while the percentage of nodes having opinion b after propagation is reported on the yaxis. In Fig. 2 each line reports the result obtained on a particular type of network by considering three diﬀerent values of assortativity. The series correspond to various values of the attribute degree correlation of the initial conﬁguration. Figure 3 reports in the left chart the results obtained on the real networks by considering an initial conﬁguration having opinion degree correlation ρk,o = 0.5, while in the right chat an initial attribute degree correlation equals to ρk,o = 0.3 has been considered. In both charts the series correspond to networks. By looking at the results of this experiments, it can be concluded that Theorem 1 holds for both synthetic and real networks as, in general, larger values of α correspond to larger coverage percentage after propagation. In particular, we noticed that in general for synthetic networks we obtained larger variation of coverage percentage for lower values of assortativity rkk (see Fig. 2). As for ScaleFree networks, we noticed that, in general, greater values of the power low exponent λ corresponds to more regular increasing of the coverage percentage with respect to α (i.e., lower values of λ require larger values of α to obtain signiﬁcant increase in the coverage percentage). Furthermore, for Erd¨ osR´enyi networks and for ScaleFree networks with high value of λ, when small values of
432
V. Fionda and G. Greco
Fig. 2. Coverage percentage of b nodes after propagation on the synthetic networks.
Fig. 3. Coverage percentage of b nodes after propagation on the real networks.
α are considered it can be noted that the coverage after propagation is higher for lower values of opinion degree correlation ρk,o (see Fig. 2). Finally, by analyzing Fig. 3, it can be noticed that larger values of opinion degree correlation ρko correspond to both larger coverage values at the end of propagation and more regular increasing of the coverage percentage with respect to α. Indeed, consider for example the fbNS dataset (green line in Fig. 3), the coverage percentage of 25% is reached for α = 0.8 if ρk,o = 0.5 and for α ∼ 0.95 if ρk,o = 0.3.
Opinion Diﬀusion
433
Fig. 4. Coverage percentage of b nodes after propagation on the real networks for diﬀerent initial conﬁgurations.
3.2
Speed and Consensus
In the second experiment we want to empirically validate Theorem 2. To this aim, for each real network we varied the number of seeds from 5% to 25% of the nodes of the network and analyzed the number of nodes having opinion b at the end of the propagation for diﬀerent diﬀusion speeds α. In this experiment we always considered an initial conﬁguration having opinion degree correlation ρk,o equals to 0.5. Results are reported in Fig. 4. In particular, in all charts the diﬀusion speed α is reported on the xaxis while the percentage of nodes having opinion b after propagation is reported on the yaxis. Each chart in the ﬁgure refers to a diﬀerent network and series correspond to diﬀerent initial percentage of nodes having opinion b. By looking at the results it can be concluded that Theorem 2 holds on all networks. In fact, each coverage percentage is reached for smallest values of α if the initial number of seed S is increased. For example,
434
V. Fionda and G. Greco
by looking at the Facebook network (top left chart in Fig. 4) it can be noted that the ﬁnal coverage percentage of 40% is reached for α = 1 if S = 5%N , for α = 0.7 if S = 15%N  and for α = 0 if S = 25%N . In particular, it can be noted that larger values of assortativity correspond in general to slower increasing in the coverage percentage with respect to the diﬀusion speed.
4
Conclusion
We have studied the relationship between speed and coverage of diﬀusion in a completive environment, by considering two orthogonal perspectives. On the one hand, after having ﬁxed some initial conﬁguration (set of seeds), we have analyzed how the speed of an opinion impacts on the number of agents that will eventually adopt some desired opinion. On the other hand, we have analyzed how the minimum number of seeds required to reach consensus varies depending on the speed of propagating that opinion. The model we have adopted is a natural variant of the linear threshold model, and we have focused on the wellstudied setting of majority agents [1,2,7,9,20,24,25]. Hence, the most natural avenue for further research is to conduct an analysis similar to the one discussed in this paper for other settings characterized by diﬀerent thresholds.
References 1. Auletta, V., Caragiannis, I., Ferraioli, D., Galdi, C., Persiano, G.: Minority becomes majority in social networks. In: Proceedings of WINE 2015, pp. 74–88 (2015) 2. Auletta, V., Ferraioli, D., Fionda, V., Greco, G.: Maximizing the spread of an opinion when tertium datur est. In: Proceedings of AAMAS 2019, pp. 1207–1215 (2019) 3. Auletta, V., Ferraioli, D., Greco, G.: Reasoning about consensus when opinions diﬀuse through majority dynamics. In: Proceedings of IJCAI 2018, pp. 49–55 (2018) 4. Bharathi, S., Kempe, D., Salek, M.: Competitive inﬂuence maximization in social networks. In: Proceedings of WINE 2007, pp. 306–311 (2007) 5. Bhattacharyya, A.: On a measure of divergence between two statistical populations deﬁned by their probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943) 6. Borodin, A., Filmus, Y., Oren, J.: Threshold models for competitive inﬂuence in social networks. In: Saberi, A. (ed.) Proceedings of WINE 2010, pp. 539–550 (2010) 7. Bredereck, R., Elkind, E.: Manipulating opinion diﬀusion in social networks. In: Proceedings of IJCAI 2017, pp. 894–900 (2017) 8. Budak, C., Agrawal, D., El Abbadi, A.: Limiting the spread of misinformation in social networks. In: Proceedings of WWW 2011, pp. 665–674 (2011) 9. Chen, N.: On the approximability of inﬂuence in social networks. SIAM J. Discrete Math. 23(3), 1400–1415 (2009) 10. Chen, W., Collins, A., Cummings, R., Ke, T., Liu, Z., Rincon, D., Sun, X., Wei, W., Wang, Y., Yuan, Y.: Inﬂuence maximization in social networks when negative opinions may emerge and propagate. In: Proceedings of SDM 2011, pp. 379–390 (2011)
Opinion Diﬀusion
435
11. Easley, D., Kleinberg, J.: Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press, Cambridge (2010) 12. Fazli, M.A., Ghodsi, M., Habibi, J., Jalaly, P., Mirrokni, V., Sadeghian, S.: On nonprogressive spread of inﬂuence through social networks. Theoret. Comput. Sci. 550, 36–50 (2014) 13. Frischknecht, S., Keller, B., Wattenhofer, R.: Convergence in (social) inﬂuence networks. In: Afek, Y. (ed.) Proceedings of DISC 2013, pp. 433–446 (2013) 14. Goles, E., Olivos, J.: Periodic behaviour of generalized threshold functions. Discrete Math. 30(2), 187–189 (1980) 15. Goyal, A., Bonchi, F., Lakshmanan, L.V.S.: Learning inﬂuence probabilities in social networks. In: Proceedings of WSDM 2010, pp. 241–250 (2010) 16. Granovetter, M.: The strength of weak ties. Am. J. Sociol. 78(6), 1360–1380 (1973) 17. Granovetter, M.: Threshold models of collective behavior. Am. J. Sociol. 83(6), 1420–1443 (1978) 18. Gursoy, F., Gunnec, D.: Inﬂuence maximization in social networks under deterministic linear threshold model. Knowl.Based Syst. 161, 111–123 (2018) ´ Maximizing the spread of inﬂuence through 19. Kempe, D., Kleinberg, J., Tardos, E.: a social network. Theory Comput. 11(4), 105–147 (2015) 20. Khoshkhah, K., Soltani, H., Zaker, M.: On dynamic monopolies of graphs: the average and strict majority thresholds. Discrete Optim. 9(2), 77–83 (2012) 21. Lerman, K., Yan, X., Wu, X.Z.: The “majority illusion” in social networks. PLoS One 11(2), e0147617+ (2016) 22. Lou, V.Y., Bhagat, S., Lakshmanan, L.V.S., Vaswani, S.: Modeling nonprogressive phenomena for inﬂuence propagation. In: Proceedings of COSN 2014, pp. 131–138 (2014) 23. Newman, M.: Assortative mixing in networks. Phys. Rev. Lett. 89(20), 208701+ (2002) 24. Peleg, D.: Size bounds for dynamic monopolies. Discrete Appl. Math. 86(2), 263– 273 (1998) 25. Peleg, D.: Local majorities, coalitions and monopolies in graphs: a review. Theoret. Comput. Sci. 282(2), 231–257 (2002) 26. Rozemberczki, B., Davies, R., Sarkar, R., Sutton, C.:. GEMSEC: graph embedding with self clustering. CoRR (2018) 27. Schelling, T.C.: Micromotives and Macrobehavior. W. W. Norton & Company, New York (1978) 28. Tzoumas, V., Amanatidis, C., Markakis, E.: A gametheoretic analysis of a competitive diﬀusion process over social networks. In: Proceedings of WINE 2012, pp. 1–14 (2012) 29. Yang, J., Leskovec, J.: Deﬁning and evaluating network communities based on groundtruth. In: Proceedings of ICDM12, pp. 745–754 (2012)
Beyond FactChecking: Network Analysis Tools for Monitoring Disinformation in Social Media Stefano Guarino1,2(B) , Noemi Trino2 , Alessandro Chessa2,3 , and Gianni Riotta2 1
Institute for Applied Computing, National Research Council, Rome, Italy [email protected] 2 Data Lab, Luiss “Guido Carli” University, Rome, Italy 3 Linkalab, Cagliari, Italy
Abstract. Operated by the H2020 SOMA Project, the recently established Social Observatory for Disinformation and Social Media Analysis supports researchers, journalists and factcheckers in their quest for quality information. At the core of the Observatory lies the DisInfoNet Toolbox, designed to help a wide spectrum of users understand the dynamics of (fake) news dissemination in social networks. DisInfoNet combines text mining and classiﬁcation with graph analysis and visualization to oﬀer a comprehensive and userfriendly suite. To demonstrate the potential of our Toolbox, we consider a Twitter dataset of more than 1.3M tweets focused on the Italian 2016 constitutional referendum and use DisInfoNet to: (i) track relevant news stories and reconstruct their prevalence over time and space; (ii) detect central debating communities and capture their distinctive polarization/narrative; (iii) identify inﬂuencers both globally and in speciﬁc “disinformation networks”. Keywords: Social network analysis
1
· Disinformation · Classiﬁcation
Introduction
“SOMA – Social Observatory for Disinformation and Social Media Analysis” is a H2020 Project aimed at supporting, coordinating and guiding the eﬀorts of researchers, factcheckers and journalists contrasting online and social disinformation, to shield a fair political debate and a responsible, shared, set of information for our citizens. At the core of the Observatory is a webbased collaborative platform for the veriﬁcation of digital (usergenerated) content and the analysis of its prevalence in the social debate, based on a special instance of (SOMA partner) ATC’s Truly Media1 . In this paper, we present the ﬁrst prototype of the DisInfoNet Toolbox, designed to support the users of the SOMA veriﬁcation platform in understanding the dynamics of (fake) news dissemination in social media and tracking down the origin and the broadcasters of false 1
https://www.truly.media/.
c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 436–447, 2020. https://doi.org/10.1007/9783030366872_36
Network Tools for Social Disinformation
437
information. We overview current features, preview future extensions, and report on the insights provided by our tools in the analysis of a Twitter dataset. Data collected on social media is paramount for understanding disinformation disorders [7] as it is instrumental to: (i) quantitative analyses of the diﬀusion of unreliable news stories [1]; (ii) comprehending the relevance of disinformation in the social debate, possibly incorporating thematic, polarity or sentiment classiﬁcation [34]; (iii) unveiling the structure of social ties and their impact on (dis)information ﬂows [3]. DisInfoNet was designed to allow all of the above and more, as it allows tracking speciﬁc news pieces in the data and visualizing their prevalence over time/space, classifying content in a semiautomatic fashion (relying on clustering a keyword/hashtag cooccurrence graph), and extracting, analyzing and visualizing social interaction graphs, embedding communitydetection and user classiﬁcation. Additional features will soon enrich the Toolbox, such as a userfriendly interface for Structural Topic Model [29], supporting sentiment analysis both globally and at topic level [16]. To demonstrate the potential of DisInfoNet, we also present an analysis of a dataset of over 1.3M Italian tweets dating back to November 2016 and focused on the constitutional referendum held on December 4, 2016. The signiﬁcant diffusion of fake news in the phase of political campaign before the vote, together with the dichotomic structure of referendums fostering user polarization, make this dataset especially ﬁt for purpose. Additionally, the distance in time of such a crucial political event makes it easier treating sensitive issues like disinformation while preventing the risk of recentism in analyzing social phenomena. We found evidence of a few relevant false stories in our dataset and, by relating polarization and network analysis, we were able to gain a better understanding of their patterns of production/propagation and contrast, and of the role of renowned authoritative accounts as well as outsiders and bots in driving the production and sharing of news stories. From a purely quantitative point of view, it is worth noting that our ﬁndings diverge signiﬁcantly from what observed by (SOMA partner) Pagella Politica at the time [26], underlining once more that Twitter and Facebook provide very diﬀerent perspectives on society and that further support of social media platforms is paramount for the research community.
2
Related Work
As reported by a recent Science Policy Forum article [21], stemming the viral diﬀusion of fake news and characterizing disinformation networks largely remain open problems. Besides the technical setbacks, the existence of the socalled “continued inﬂuence eﬀect of misinformation” is widely acknowledged among sociopolitical scholars [31], thus questioning the intrinsic potential of debunking in contrasting the proliferation of fake news. Yet, the body of research work on fake news detection and (semi)automatic debunking is vast and heterogeneous, relying on linguistics [22], deep syntax analysis [14], knowledge networks [11], or data mining [30]. Attempts at designing an endtoend factchecking system exist [19], but are mostly limited to detecting and evaluating strictly factual
438
S. Guarino et al.
claims. Even supporting professional factcheckers by automating stance detection is problematic, due to relatedness being far easier to capture than agreement/disagreement [18]. Approaches speciﬁcally conceived for measuring the credibility of social media rumours appear to beneﬁt from the combined eﬀectiveness of analyzing textual features, classifying users’ posting and reposting behaviors, examining external citations patterns, and comparing sametopic messages [5,10,35]. Unfortunately, this is well beyond what social media analytics and editorial factchecking tools on the market permit. In this context, DisInfoNet was designed to help researchers, journalists and factcheckers characterizing the prevalence and dynamics of disinformation on social media. Recent work conﬁrmed the general perception that, on average, fake news get diﬀused farther, faster, deeper and more broadly than true news [1,34]. The prevalence of false information is often deemed to be caused by the presence of “fake” and automated proﬁles, usually called bots [6]. The role of bots in disinformation campaigns is however far from being sorted out: albeit bots seem to be the main responsible for fake news production and are used to boost the perceived authority of successful (human) sources of disinformation [3], they have been found to accelerate the spread of true and false news at the same rate [34]. Models for explaining the success of false information without a direct reference to bots have also been recently proposed, either based on information overload vs. limited attention [28], or on information theory and (adversarial) noise decoding [8]. Finally, investigating the relation between polarization and information spreading has been shown to be instrumental for both uncovering the role of disinformation in a country’s political life [7] and predicting potential targets for hoaxes and fake news [33].
3
The Toolbox
DisInfoNet is a Python library built on top of wellknown packages (e.g., igraph, scikitlearn, NumPy, Gensim), soon to be available under the GPL on GitLab2 . It provides modules for managing archives, elaborating and classifying text, building and analyzing graphs, and more. It is memoryeﬃcient to support large datasets and, albeit a few functions are optimized for Twitter data, generally ﬂexible. At the same time, DisInfoNet implements a pipeline designed to enable journalists and factcheckers with no coding expertise assessing the prevalence of disinformation in social media data. This pipeline, depicted in Fig. 1, consists of three main tools which may be controlled by a single conﬁguration ﬁle – soon to be replaced by a userfriendly dashboard embedded in the SOMA platform. One of DisInfoNet’s main features is the ability to extract and examine both keyword cooccurrence graphs and user interaction graphs induced by a speciﬁc set of themes of interest, thus providing valuable insights into the contents and the actors of the social debate around disinformation stories. The ﬁrst tool of DisInfoNet’s pipeline is the Subject Finder. It ﬁlters a dataset and returns information about the prevalence of themes or news pieces 2
Please, contact the authors if you wish to be notiﬁed when the code is released.
Network Tools for Social Disinformation
439
Fig. 1. DisInfoNet’s main pipeline.
of interest. It uses keywordbased queries (migration to document similarity is in progress) to extract (parsed) records into a CSV ﬁle. For instance, for Twitter data it returns tweets with covariates such as author, timestamp, geolocalization, retweet count, hashtags, mentions. It also plots the temporal and spatial distribution of all and querymatching records. The Classifier partitions records into classes based on a semiautomatic “selftraining” process. By building and clustering a keyword cooccurrence graph (that the user may prune of central yet generic and/or outofcontext keywords, detrimental to clustering), it presents the user with an excerpt of the keywords associated with the obtained classes. Signiﬁcantly, this means using far more keywords than any fully manual approach would permit, without sacriﬁcing accuracy, but rather possibly discovering previously unknown and highly informative keywords. The user can select and label the classes of interest, which are used to automatically extract a training set. The Classiﬁer then selects the best performing model among a few alternative (currently, Logistic Regression and Gradient Boosting Classiﬁer, with 10fold crossvalidation) and predicts a label for all records. When only two classes are used (e.g., republican vs. democratic, right vs. leftwing, pro vs. against; discussing theme A vs. theme B), the obtained classiﬁcation may also be extended to users (e.g., authors) by averaging over the classiﬁcation of all records associated to a speciﬁc user. Finally, the Graph Analyzer incorporates functions for graph mining and visualization. It ﬁrst extracts a directed user interaction graph, wherein two users (e.g., authors) are connected based on how often they interact (e.g., cite each other). It then computes a set of global and local metrics, including: distances, eccentricity, radius and diameter; clustering coeﬃcient; degree and assortativity; PageRank, closeness and betweenness centrality [24]. It also partitions the graph into communities, relying on the wellknown Louvain [4] or Leading Eigenvector [25] algorithms, and applies the Guimer` aAmaral cartography [17], based on
440
S. Guarino et al.
discerning inter and intracommunity connections. This results into a number of tables and plots.
4
Politics and Information in 2016 Italy
The 2013 election imposed an unprecedented tripolar equilibrium in the Italian political scene, with the 5 Stars Movement (5SM) breaking the traditional leftright framework, and the rise of the populist right party Northern League (NL). In 2016, the Italian government guided by the centerleft Democratic Party (PD) promoted a constitutional reform which led to a referendum, held on December 4, 2016. Both the 5SM and the NL opposed the referendum, making the NO faction a composite front supported by a wide spectrum of formations with alternative yet sometimes overlapping political justiﬁcations. In this framework, populist movements showed an extraordinary ability in setting the agenda, by imposing carefully selected instrumental newsframes and narratives that found the perfect breeding ground in Italy – the country of political disaﬀection par excellence [12]. New media, in particular, oﬀered an unprecedented opportunity: to maintain a critical – even conspiratorial – attitude towards the establishmentdominated media, while enhancing the role of alternative/social media as strategic resources for communitybuilding and alternative agenda setting [2]. In these contexts, Twitter plays a strategic role for newly born political parties, that through the activation of the twoway street mediatization may incorporate their proposals into conventional media [9]. The dichotomous structuring of referendum was however instrumental to both sides for aligning the various issues along a proanti/status quo spectrum. The ﬁnal victory of the NO caused Renzi’s resignation from Head of Government and paved the way for the deﬁnite aﬃrmation of the 5SM and the NL, who in 2018 joined forces in forming a socalled “government of change”. 4.1
Disinformation Stories
In order to identify relevant themes of disinformation of the political campaigning we relied on the activity of factchecking and news agencies, who reported lists of fake news that went viral during the referendum campaign. Mostly based on the work by factchecking web portal Bufale.net [23], online newspaper Il Post [27], and SOMA partner and political factchecking agency Pagella Politica [26], we were able to identify the twelve main pieces of disinformation related to the referendum. To widen the scope of the analysis, we considered stories and speculations that reﬂect information disorders in a broader sense, from rumors, hearsays, clickbait items and unintentionally propagated misinformation, to conspiracy theories and organized propaganda, often used by the two sides to accuse one another. We then classiﬁed these disinformation stories into four categories: (i) the QUOTE category includes entirely fabricated quotes of public ﬁgures endorsing one or the other faction or defaming voters of the other side; (ii) the
Network Tools for Social Disinformation
441
CONSQ group of news contains manipulated interpretations of genuine information about the (potential) consequences of the reform; (iii) the PROPG category includes news inserted in a typical populist frame, opposing people vs the ´elite; (iv) ﬁnally, the FRAUD category involves the integrity of the electoral process, gaining unauthorized access to voting machines and altering voting results. Due to page restrictions, in this paper we only study disinformation at this category level, deferring a detailed analysis at newsstory level to future work. Signiﬁcantly, this type of categorybased approach is fully supported by DisInfoNet and easily available through the conﬁguration ﬁle.
5
Findings
In this section, we demonstrate the potential of DisInfoNet by analyzing a dataset of more than 1.3M tweets to shed light on the dynamics of social disinformation as Italy approached the referendum. 5.1
Disinformation Prevalence
With each of the selected news stories represented by a suitable keywordbased query, we ran the Subject Finder to identify our set of disinformation tweets, have them labelled with categories, and obtain the plots in Fig. 2 showing their temporal and geographical distribution. In Fig. 2a we see the oneday rolling mean of the four classes across November 2016, compared with the overall trend. The presence of disinformation in the dataset is limited, yet nonnegligible: except for QUOTE tweets, each of the other three classes accounts for ≈5% of the records. The volume of discussion about fake/distorted news stories does not seem to simply increase at the approach of the referendum as for the general discussion, but diﬀerent stories have diﬀerent spikes, possibly related with events (e.g., a politician giving an interview) or with the activity of some inﬂuencer. Regarding the geography of the debate, we found that only 29716 tweets – that is, 2.21% of the whole dataset – were geotagged, and this percentage is even lower (≈1%) among disinformation tweets (see Table 1 for details), possibly due to users involved in this type of discussions being more concerned about privacy than the average. The map, reported in Fig. 2b, shows some activity in Great Britain and the Benelux area, but disinformation topics appear to be substantially absent outside Italy. 5.2
Polarization and Disinformation
The Classiﬁer can now be used to gain a better understanding of the relation between polarization and disinformation in our dataset. During the semiautomatic selftraining process, we pruned a few central but outofcontext hashtags (e.g., “#photo” and “#trendingtopic”) and let the Classiﬁer run Louvain’s algorithm and plot the hashtag graph. This graph, reported in Fig. 3, shows that: (i) hashtags used by the NO and YES supporters are
442
S. Guarino et al.
(a) Temporal distribution by class
(b) Spatial distribution by class.
Fig. 2. The temporal and spatial distribution of disinformation tweets.
strongly clustered; (ii) “neutral” hashtags (such as those used by international reporters) also cluster together; (iii) a few hashtags are surprisingly highranked, such as “#ottoemezzo”, a popular and supposedly impartial political talkshow being central in the NO cluster – thus conﬁrming regular patterns of behavior in the “secondscreen” use of social network sites to comment television programs [32]. In particular, it is easy to identify two large clusters of hashtags clearly characterizing the two sides: the YES cluster is dominated by the hashtags “#bastauns`ı” (“a yes is enough”) and “#iovotosi (“I vote yes”), whereas the NO cluster by “#iovotono” (“I vote no”), “#iodicono” (“I say no”) and “#renziacasa” (“Renzi go home”). In this perspective, both communities show clear segregation and high levels of clustering by political alignments, thus conﬁrming the hypothesis of socialmedia platforms as echo chambers, with political exchanges exhibiting “a highly partisan community structure with two homogeneous clusters of users who tend to share the same political identity” [12]. By interacting with the Classiﬁer, we selected the aforementioned YES and NO clusters as the sets of hashtags to be used for building a training set. Labelling works as follows: −1 (NO) if the tweet only contains hashtags from the NO cluster; +1 (YES) if the tweet only contains hashtags from the YES cluster; 0 (UNK) if the tweet contains a mix of hashtags from the two clusters. Signiﬁcantly, we also obtained a continuous score in [−1, 1] for each user, as the average score of the user’s tweets. When ran after the Subject Finder, the Classiﬁer also plots a histogram that helps relating classiﬁcation and disinformation, reported in Fig. 3b. We immediately see that UNK tweets are substantially negligible, while NO tweets are almost 1.5× more frequent than YES tweets, supporting the diﬀused belief that the NO front was signiﬁcantly more active than its counterpart in the social debate. Disinformation news stories mostly follow the general trend, but: (i) topics of the QUOTE and PROPG classes, which gather attack vectors frequently used by the populist parties, are especially popular among NO supporters (hence, debunking eﬀorts are invisible); (ii) on the other hand, YES supporters are more active than the average in the CONSQ
Network Tools for Social Disinformation
(a) The hashtag graph, with clusters highlighted, vertex size by pagerank.
443
(b) The polarization of tweets, in total and for the four disinformation classes.
Fig. 3. The hashtag graph and the classiﬁcation results.
topics, probably due to the concurrent attempts at promoting the referendum and at tackling the fears of potential NO voters. 5.3
Interaction Graphs and Disinformation
Finally, we used the Graph Analyzer to better understand the dynamics of disinformation networks in our dataset. Due to page restrictions, in the following we only focus on retweets and on the CONSQ and PROPG disinformation classes, leaving a more detailed analysis to future work. Among the three supported types of interaction, in fact, retweeting is the simplest endorsement tool [20], commonly used for promoting ideas and campaigns and for community building, possibly relying on semiautomatic accounts. On the other hand, the CONSQ and PROPG classes appeared to be the most informative, for both their different polarity distribution and their almost nonintersecting sets of inﬂuencers. First of all, we obtained a number of macroscopic descriptors that yield insights into the structural similarities and diﬀerences of the two graphs, reported in Table 1. The CONSQ and PROPG are similar in size (2755 vertices and 3786 edges vs. 2126 and 2886) and have similarly sized in and outhubs (628 and 16 vs. 653 and 18), but the diameter of the CONSQ graph is signiﬁcantly smaller (12 vs. 30) despite it having a larger average distance (2.73 vs. 1.64). These numbers suggest that PROPG disinformation stories travelled less on average, but were sporadically able to reach very peripherical users. Additionally, we see that the clustering coeﬃcient of the two graphs is almost identical and rather small (≈0.004), more than one order of magnitude smaller that the clustering coeﬃcient of the whole graph. This suggests that these disinformation networks
444
S. Guarino et al. Table 1. Dataset overview. Tweets
Geotags (%)
Retweet graph
Dataset
1344216
29716 (2.21%) 72574
CONSQ
7909
71 (0.90%)
PROPG
4345
47 (1.08%)
FRAUD
5362
QUOTE
57
Vertices Edges
degmax degmax in out Clustering Diam. Avg. dist. 1541 0.0483 149 4.81044
451423
4813
2755
3786
628
16
0.0039
12
2.72581
2126
2886
653
18
0.00385
30
1.63941
69 (1.29%)
2195
3452
692
13
0.00321
8
2.45673
1 (1.75%)
9
8
8
1
0.0
1
1.0
may not be “selforganizing” and their structure might be governed by artiﬁcial diﬀusion patterns. For a more closeup analysis, Fig. 4 shows, for both classes, the network composed of the top 500 users by pagerank. In these plots, users are colored by their polarity and edges take the average color of the connected vertices. The size of a vertex is proportional to its pagerank, whereas the width of an edge to its weight, i.e., number of interactions between the two users. These plots highlight a number of interesting aspects. First of all, the NO front appears to be generally dominant, with relevant YES actors only emerging in the debate on the alleged consequences of the referendum. Also, there seems to be limited interaction between YES and NO supporters, as can be noted by the fact that edges almost always link vertices of similar or even identical color. Among the leaders of the NO front, we ﬁnd wellknown public ﬁgures (e.g., politicians Renato Brunetta and Fabio Massimo Castaldo in the PROPG graph) along with accounts not associated with any publicly known individual. In most cases, these are militants of the NO front, sometimes having multiple aliases, and whose activity is characterized by a high number of retweets and mentions of wellknown actors belonging to the same community (e.g., Antonio Bordin, Claudio Degl’Innocenti, Angelo Sisca, Liberati Linda). Additional insights can be gained by using Truthnest3 , a tool developed by SOMA partner ATC, which reports analytics on the usage patters of a speciﬁed account summarized into a botlikelihood score. One of the most inﬂuential nodes of the PROPG graph, @INarratore, came out having a suspiciously high 60% botscore, other than only 1% of original tweets and a considerable number of “suspicious followers”. In the same graph, @dukana2 has a 50% botscore, while the account @advalita has been suspended from Twitter. In the CONSQ graph, the most central user is @ClaudioDeglinn2, characterized by a relatively low 10% botscore, but apparently in control of at least other 7 aliases and strongly connected with other ampliﬁcation accounts. Two of these “ampliﬁers” are especially noteworthy: @IPredicatore, having a 40% botscore, and @PatriotaIl, having a 30% botscore, mentioning @ClaudioDeglinn2 in more than 20% of his tweets, and producing only 3% original tweets. Altogether, we seem to have found indicators of coordinated eﬀorts to avoid bot detection tools while reaching peripheral users and expanding the network.
3
https://app.truthnest.com/.
Network Tools for Social Disinformation
(a) The PROPG graph.
445
(b) The CONSQ graph.
Fig. 4. 500 top users by pagerank. Color is by polarity, size by pagerank.
6
Conclusion
In this paper, we publicly presented – to both the scientiﬁc and factchecking community – an integrated toolbox for monitoring social disinformation, conceived as part of the H2020 Social Observatory for Disinformation and Social Media Analysis. Our DisInfoNet Toolbox builds on wellestablished techniques for text and graph mining to provide a wide spectrum of users instruments for quantifying the prevalence of disinformation and understanding its dynamics of diﬀusion on social media. We presented a case study analysis focused on the 2016 Italian constitutional referendum, wherein the natural bipolar political structure of the debate helps in reducing one of the most frequent problem in opinion detection on social media, related to the identiﬁcation of all possible political orientations (associated to communities). Following the literature [12,15], we resorted to retweets in order to analyze accounts and their interactions according to their possible political orientation. The combined analysis of political communities and network clustering and centrality shows how the referendum caused a clear segregation by political alignment [13], conﬁguring the existence of diﬀerent echochambers. From a thematic point of view, news stories related to conspiracy theories and distrust with political ´elite were especially popular and traveled deeper than any other category of disinformation. We found evidence of a correlation between users’ polarization and participation to disinformation campaigns, and by highlighting the primary actors of disinformation production and propagation we could manually tell apart public ﬁgures, activists and potential bots. Our DisInfoNet Toolbox will soon be available online and extended in the next future. We believe that the stateoftheart techniques for classiﬁcation and network analysis embedded in the Toolbox will pave the way for future
446
S. Guarino et al.
research in the area, crucial to the preservation of our public conversation and the future of our democracies.
References 1. Allcott, H., Gentzkow, M.: Social media and fake news in the 2016 election. J. Econ. Perspect. 31(2), 211–36 (2017) 2. AlonsoMu˜ noz, L., CaseroRipoll´es, A.: Communication of European populist leaders on Twitter: agenda setting and the ‘more is less’ eﬀect. Prof. Inform. 27(6), 1193–1202 (2018) 3. Bessi, A., Ferrara, E.: Social bots distort the 2016 US presidential election online discussion (2016) 4. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008) 5. Boididou, C., Middleton, S.E., Jin, Z., Papadopoulos, S., DangNguyen, D.T., Boato, G., Kompatsiaris, Y.: Verifying information with multimedia content on twitter. Multimed. Tools Appl. 77(12), 15545–15571 (2018) 6. Boshmaf, Y., Muslukhov, I., Beznosov, K., Ripeanu, M.: Design and analysis of a social botnet. Comput. Netw. 57(2), 556–578 (2013) 7. Bovet, A., Makse, H.A.: Inﬂuence of fake news in Twitter during the 2016 US presidential election. Nat. Commun. 10(1), 7 (2019) 8. Brody, D.C., Meier, D.M.: How to model fake news. arXiv preprint arXiv:1809.00964 (2018) 9. CaseroRipoll´es, A., Feenstra, R.A., Tormey, S.: Old and new media logics in an electoral campaign: the case of podemos and the twoway street mediatization of politics. Int. J. Press/Polit. 21(3), 378–397 (2016) 10. Castillo, C., Mendoza, M., Poblete, B.: Information credibility on Twitter. In: Proceedings of the 20th International Conference on World Wide Web, pp. 675– 684. ACM (2011) 11. Ciampaglia, G.L., Shiralkar, P., Rocha, L.M., Bollen, J., Menczer, F., Flammini, A.: Computational fact checking from knowledge networks. PLoS One 10(6), e0128193 (2015) 12. Conover, M., Ratkiewicz, J., Francisco, M.R., Gon¸calves, B., Menczer, F., Flammini, A.: Political polarization on Twitter. In: Icwsm, vol. 133, pp. 89–96 (2011) 13. Conover, M.D., Gon¸calves, B., Flammini, A., Menczer, F.: Partisan asymmetries in online political activity. EPJ Data Sci. 1(1), 6 (2012) 14. Feng, V.W., Hirst, G.: Detecting deceptive opinions with proﬁle compatibility. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 338–346 (2013) 15. Garimella, K., Weber, I.: A longterm analysis of polarization on Twitter. CoRR abs/1703.02769 (2017) 16. Guarino, S., Santoro, M.: Multiword structural topic modelling of ToR drug marketplaces. In: 2018 IEEE 12th International Conference on Semantic Computing (ICSC), pp. 269–273. IEEE (2018) 17. Guimer` a, R., Nunes Amaral, L.: Functional cartography of complex metabolic networks. Nature 433, 895 (2005) 18. Hanselowski, A., PVS, A., Schiller, B., Caspelherr, F., Chaudhuri, D., Meyer, C.M., Gurevych, I.: A retrospective analysis of the fake news challenge stance detection task. arXiv:1806.05180 (2018)
Network Tools for Social Disinformation
447
19. Hassan, N., Zhang, G., Arslan, F., Caraballo, J., Jimenez, D., Gawsane, S., Hasan, S., Joseph, M., Kulkarni, A., Nayak, A.K., et al.: ClaimBuster: the ﬁrstever endtoend factchecking system. Proc. VLDB Endow. 10(12), 1945–1948 (2017) 20. Kantrowitz, A.: The man who built the retweet: “we handed a loaded weapon to 4yearolds” (2019). www.buzzfeednews.com/article/alexkantrowitz/howtheretweetruinedtheinternet. Accessed 05 Aug 2019 21. Lazer, D.M., Baum, M.A., Benkler, Y., Berinsky, A.J., Greenhill, K.M., Menczer, F., Metzger, M.J., Nyhan, B., Pennycook, G., Rothschild, D., et al.: The science of fake news. Science 359(6380), 1094–1096 (2018) 22. Markowitz, D.M., Hancock, J.T.: Linguistic traces of a scientiﬁc fraud: the case of Diederik Stapel. PLoS One 9(8), e105937 (2014) 23. Mastinu, L.: TOP 10 Bufale e disinformazione sul Referendum (2016). www.bufale. net/top10bufaleedisinformazionesulreferendum/. Accessed 05 Jul 2019 24. Newman, M., Barabasi, A.L., Watts, D.J. (eds.): The Structure and Dynamics of Networks. Princeton University Press, Princeton (2006) 25. Newman, M.E.: Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E 74(3), 036104 (2006) ` una bufala (2016). 26. Politica, R.P.: La notizia pi` u condivisa sul referendum? E https://pagellapolitica.it/blog/show/148/lanotiziapi%C3%B9condivisasulrefer endum%C3%A8unabufala. Accessed 05 Jul 2019 27. Post, R.I.: Nove bufale sul referendum (2016). www.ilpost.it/2016/12/02/bufalereferendum/. Accessed 05 Jul 2019 28. Qiu, X., Oliveira, D.F., Shirazi, A.S., Flammini, A., Menczer, F.: Limited individual attention and online virality of lowquality information. Nat. Hum. Behav. 1(7), 0132 (2017) 29. Roberts, M.E., Stewart, B.M., Tingley, D., Lucas, C., LederLuis, J., Gadarian, S.K., Albertson, B., Rand, D.G.: Structural topic models for openended survey responses. Am. J. Polit. Sci. 58(4), 1064–1082 (2014) 30. Shu, K., Sliva, A., Wang, S., Tang, J., Liu, H.: Fake news detection on social media: a data mining perspective. ACM SIGKDD Explor. Newsl. 19(1), 22–36 (2017) 31. Skurnik, I., Yoon, C., Park, D.C., Schwarz, N.: How warnings about false claims become recommendations. J. Consum. Res. 31(4), 713–724 (2005) 32. Trilling, D.: Two diﬀerent debates? Investigating the relationship between a political debate on TV and simultaneous comments on Twitter. Soc. Sci. Comput. Rev. 33(3), 259–276 (2015) 33. Vicario, M.D., Quattrociocchi, W., Scala, A., Zollo, F.: Polarization and fake news: early warning of potential misinformation targets. ACM Trans. Web (TWEB) 13(2), 10 (2019) 34. Vosoughi, S., Roy, D., Aral, S.: The spread of true and false news online. Science 359(6380), 1146–1151 (2018) 35. Zubiaga, A., Aker, A., Bontcheva, K., Liakata, M., Procter, R.: Detection and resolution of rumours in social media: a survey. ACM Comput. Surv. (CSUR) 51(2), 32 (2018)
Suppressing Information Diﬀusion via Link Blocking in Temporal Networks XiuXiu Zhan, Alan Hanjalic, and Huijuan Wang(B) Faculty of Electrical Engineering, Mathematics, and Computer Science, Delft University of Technology, Mekelweg 4, 2628 CD Delft, The Netherlands [email protected]
Abstract. In this paper, we explore how to eﬀectively suppress the diffusion of (mis)information via blocking/removing the temporal contacts between selected node pairs. Information diﬀusion can be modelled as, e.g., an SI (SusceptibleInfected) spreading process, on a temporal social network: an infected (information possessing) node spreads the information to a susceptible node whenever a contact happens between the two nodes. Speciﬁcally, the link (node pair) blocking intervention is introduced for a given period and for a given number of links, limited by the intervention cost. We address the question: which links should be blocked in order to minimize the average prevalence over time? We propose a class of link properties (centrality metrics) based on the information diﬀusion backbone [19], which characterizes the contacts that actually appear in diﬀusion trajectories. Centrality metrics of the integrated static network have also been considered. For each centrality metric, links with the highest values are blocked for the given period. Empirical results on eight temporal network datasets show that the diﬀusion backbone based centrality methods outperform the other metrics whereas the betweenness of the static network, performs reasonably well especially when the prevalence grows slowly over time. Keywords: Link blocking · Link centrality · Information diﬀusion backbone · Temporal network · SI spreading
1
Introduction
The development of sensor technology and electronic communication service provide us access to rich human interaction data, including proximity data like human facetoface contacting, electronic communication data like email exchange, message exchange, phone calls [6,14,18]. The recorded human interactions can be represented as temporal networks, in which each interaction is represented as a contact at a given time step between two nodes. The availability of such social temporal networks inspires us to explore further how to suppress the diﬀusion of (mis)information that unfolds on them? One possible intervention is to block the links (i.e., remove contacts between node pairs), but c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 448–458, 2020. https://doi.org/10.1007/9783030366872_37
Suppressing Information Diﬀusion
449
only for a given period and given node pairs limited by intervention cost. In this work, we address the question: which links should we block for a given period in order to minimize the prevalence averaged over time, i.e., to prevent or delay the diﬀusion on temporal networks? Progress has been made recently in understanding, e.g., nodes with what temporal topological properties (temporal centrality metrics) should be selected as the seed node that starts the information diﬀusion in order to maximize the ﬁnal prevalence [3,5,8,13,15,16], links with what temporal topological properties appear more frequently in a diﬀusion trajectory [19]. These works explored in general the relation between node’s or link’s topological properties and its role in a dynamic process on a temporal network. Our question which links should be blocked to suppress information diﬀusion will actually reveal the role of a link within a given period in a diﬀusion process in relation to the link’s temporal topological properties. As a starting point, we consider the SusceptibleInfected (SI) model as the information diﬀusion process. A seed node possesses the information (is infected) at time t = 0 whereas all the other nodes are susceptible. An infected node spreads the information to a susceptible node whenever a contact happens between the two nodes. Given a temporal network within the observation time window [0, T ], we would like to choose a given number of links within a period [ts , te ] to block in order to suppress the diﬀusion. We propose a comprehensive set of link centrality metrics that characterize diverse temporal topological properties. Each centrality metric is used to rank the links and we remove the links with the highest centrality values for the period [ts , te ]. One group of centrality metrics is based on the information diﬀusion backbone [19], which characterizes how the contacts appear in a diﬀusion trajectory thus contribute to the diﬀusion process. Centrality metrics of the integrated static network, where two nodes are connected if they have at least one contact, are also considered. We propose as well the temporal link gravity, generalized from the static node gravity model [9]. We conduct the SI spreading on the original temporal network as well as the temporal network after link blocking. Their diﬀerence in prevalence accumulated over time is used to evaluate the performance of the link blocking strategies/metrics. Our experiments on eight realworld temporal networks show that the diﬀusion backbone based metrics and the betweenness of the static integrated networks evidently outperform the rest. The backbone based metrics (betweenness of static network) perform(s) better when the prevalence increases fast (slowly) over time. This observation remains universal for diverse choices of the blocking period [ts , te ] and number of links to block. Our ﬁnding points out that both temporal and static centrality metrics, with diﬀerent computational complexities, are crucial in identifying links’ role in a dynamic process. The rest of the paper is organized as follows. We propose the methodology in Sect. 2. In Sect. 2.1, the representation of a temporal network is introduced. In Sect. 2.2, the construction of diﬀusion backbone is illustrated. Afterwards, we propose the link centrality metrics in Sect. 2.3. In Sect. 2.4, the link blocking procedure and the performance evaluation method are given. We further describe
450
X.X. Zhan et al.
temporal empirical networks that will be used in Sect. 3. The results of the link blocking strategies on the temporal empirical networks are analyzed in Sect. 4. We conclude our paper in Sect. 5.
2 2.1
Methods Representation of Temporal Networks
A temporal network within a given time window [0, T ] is represented as G = (N , L), where N denotes the node set and the number of nodes is N = N . The contact set L = {l(j, k, t), t ∈ [0, T ], j, k ∈ N } contains the element l(j, k, t) representing that a contact between node j and k occurs at time step t. The integrated weighted network of G is denoted by GW = (N , LW ). The weight wjk of link l(j, k) counts the number of contacts between node j and node k. 2.2
Information Diﬀusion Backbone
The information diﬀusion backbone was proposed to characterize how node pairs appear in a diﬀusion trajectory thus contribute to the actual diﬀusion process [19]. To illustrate our method, we construct the backbone for the SI model with infection probability β = 1, which means that an infected node infects a susceptible node with probability β = 1 whenever the two nodes have a contact. The backbone can be also constructed for the SI model with any infection probability β ∈ [0, 1]. We ﬁrst record the spreading tree Ti of each node i by setting i as the seed of the SI spreading process starting at t = 0. The spreading tree Ti is the union of the contacts through which the information propagates. The diﬀusion backbone N GB is deﬁned as the union of all the spreading trees, i.e., GB = (N , LB ) = Ti . i=1
We use N , LB to represent the node set and the link set respectively. Each link B , counting the number of contacts l(j, k) in LB is associated with a weight wjk between j and k, that appear in diﬀusion trees/trajectories initiated from every node. An example of how we construct the diﬀusion backbone GB is given in Fig. 1(a–c). 2.3
Link Centrality Metrics
We ﬁrst propose three backbone based link centrality metrics: B • Backbone Weight. The backbone weight wjk of a link l(j, k) counts how many times the link or its contacts appear in spreading trees (trajectories) initialized from every node. • Timeconfined Backbone Weight [ts , te ]. Furthermore, we deﬁne the timeconﬁned information diﬀusion backbone GB ∗ , which generalizes our previous backbone deﬁnition. The backbone GB ∗ conﬁned within a time window [ts , te ]
Suppressing Information Diﬀusion
451
Fig. 1. (a) A temporal network G with N = 5 nodes and T = 8 time steps. (b) Spreading trees rooted at every seed node. The time step on each link denotes the time of the contact through which information diﬀuses. (c) The diﬀusion backbone GB . (d) Diﬀusion backbone GB ∗ conﬁned within ts = 2, te = 5. When we consider the links that only appear in a time window [ts , te ] = [2, 5], the value on the link shows the link weight in GB ∗ .
is the union of all the spreading trees but only of the contacts that occur within [ts , te ]. Hence, two nodes in GB ∗ are connected if at least one contact between them within [ts , te ] appears in a diﬀusion tree rooted at any node. B∗ of link l(j, k) in GB ∗ equals to the number of times that The weight wjk contact(s) between j and k within [ts , te ] that appear in the spreading trees rooted at every node. The link weight in GB ∗ characterizes the frequency that a link, within [ts , te ], contributes to the information diﬀusion. An example of the timeconﬁned backbone construction is given in Fig. 1(d), where ts = 2, te = 5. Take link l(2, 4) as an example. It appears in the spreading trees twice, both at time step t1 , which is beyond range [ts = 2, te = 5]. Therefore, B∗ = 0. Link l(2, 3) appears at time step t8 , t3 , t3 , t3 , t3 in all the spreading w24 B∗ = 4. trees, only the time step t8 is out of range [2, 5]. Hence, w23 • Backbone Betweenness. The backbone betweenness is deﬁned to measure the link inﬂuence in disseminating global information. Given a spreading tree Ti , i . We deﬁne the number of descendant nodes of link l(j, k) is denoted as Bjk number of descenthe backbone betweenness Bjk of link l(j, k) as the average i . dant nodes over all the spreading trees, i.e., Bjk = N1 i∈N Bjk We consider as well the following centrality metrics derived from the integrated weighted network. Only the links in the integrated network deserves blocking. All the following metrics are zero for a node pair that they are not connected in the integrated network. • Degree Product of a link l(j, k) is the product of the degrees of its two end nodes in GW , i.e., dj · dk .
452
X.X. Zhan et al.
• Strength Product. The node strength of a node j in GW is deﬁned as sj = w k∈Γj jk , where Γj is the neighbor set of node j. Hence, the strength of a node equals to the total weight of all the links incident to this node. We deﬁne strength product of a link l(j, k) as sj · sk . • Static Betweenness. The static betweenness centrality for a link is the number of shortest paths between all node pairs that pass through the link. To compute the shortest path, we deﬁne the distance of each link in the integrated network GW inversely proportional to its link weight in GW . This choice follows the assumption that links with a higher weight in GW can spread information faster [12]. • Link Weight. The link weight wjk of a link l(j, k) in GW tells the total number of contacts between node j and k in the temporal network G within the observation window [0, T ]. • Timeconfined Link Weight [ts , te ] refers to the number of contacts between two ending nodes that occur in [ts , te ]. • Temporal Link Gravity. The link gravity between node j and k has been deﬁned by regarding the node degree as the mass, the distance Hjk of the shortest path on static network GW between j and k dasd the distance. The static gravity of node j can be further deﬁned as k=j Hj 2 k . The static node jk
gravity has been used to select the seed node of an information diﬀusion process in order to maximize the prevalence [9], motivated by the fact that it contains both the neighborhood and the path information of a node. We generalize the gravity deﬁnition to temporal networks. The temporal link d d d d gravity of l(j, k) is deﬁned as 12 ( Qj 2 k + Qj 2 k ), where Qjk is the number of jk
kj
links of the shortest path from j to k in all the directed spreading trees (see Fig. 1(b)). Speciﬁcally, the shortest directed path from j to k is computed in each spreading tree rooted at one seed node. We consider the shortest among these N shortest directed paths and its length (number of links) is Qjk . 2.4
Link Blocking and Evaluation
We illustrate the link blocking procedure and the evaluation method to measure the eﬀectiveness of link blocking strategies. Given a temporal network, we specify the time window to block links as [ts , te ]. For each time window [ts , te ], we count the number of node pairs L∗W (ts , te ) that have at least one contact within [ts , te ] and block 5%, 10%, 20%, 40%, 60%, 80% and 100% of L∗W (ts , te ) links respectively using each centrality metric. The number of links to be blocked is further expressed as the fraction f of the number of links in the integrated network. For each centrality metric, we block the given fraction f of links that have the highest values for the given period [ts , te ], i.e., remove all the contacts within [ts , te ] associated with the selected links. We perform the SI spreading model by setting each node as the seed node on the original temporal network as well as the temporal network after the link blocking. The average prevalence is the average over each possible seed node. The average prevalence of the SI diﬀusion at any time t when the selected fraction
Suppressing Information Diﬀusion
453
f of links are blocked within [ts , te ] and when no links are blocked is denoted as ρf (t) and ρo (t) respectively, where t ∈ [0, 1, ..., T ]. The eﬀectiveness of each centrality metric is evaluated by T (ρ0 (t) − ρf (t)) (1) ρD (f ) = t=1T t=1 ρ0 (t) which corresponds to the area below the original prevalence ρo (t) and above the prevalence curve ρf (t) with link blocking normalized by the area under ρo (t) (shown in Fig. 2(b)). A larger ρD (f ) implies a more eﬀective link block strategy in suppressing the SI spreading.
3
Data Description
In this paper, we use eight temporal network datasets to investigate the link blocking problem in temporal networks. The dataset can be classiﬁed into two categories according to the contact type, i.e., proximity (Haggle [1], HighSchool 2012 (HS2012) [4], HighSchool2013 (HS2013) [10], Reality Mining (RM ) [2], Hypertext 2009 (HT 2009) [7], Primary School (P S) [17] and Infectious [7]) and electronic communication (Manufacturing Email (M E) [11]). The detailed topological features of these datasets are shown in Table 1, including the number of nodes, time steps, contacts, the number of links, link density, average degree and average link weight in GW . On each temporal network, we perform the SI spreading process starting at every node as the seed. The average prevalence ρ over time for each dataset is shown in Fig. 2(a), where the time step is normalized by the time span T of the observation time window. The spreading speed, i.e., how fast the prevalence grows over time, is quite diﬀerent across networks. Two networks (Haggle and infectious) show slow and relative linear increase in prevalence over times, due to the low link density in these two networks (Table 1). However, the prevalence in the other networks, increases dramatically at the early stage of the spreading process and converges to about 100%.
4
Empirical Results
In this section, we evaluate the eﬀectiveness of using aforementioned centrality metrics to select the links to be blocked within [ts , te ]. We consider diverse time windows [ts , te ] as listed in Table 2. Intervention is possibly introduced at different diﬀusion phases. Hence, ts ∈ {T10%I , T20%I , T30%I , T40%I , T50%I }, where T10%I is the time when the average prevalence without blocking reaches ρ = 10% (see Fig. 2(a)). The duration of each time window is set as the duration for the average prevalence to increase 10% just before ts . If ts = T20%I , the duration of the time window is te − ts = T20%I − T10%I . If ts = T10%I , the duration of the time window is te − ts = T10%I − T0%I = T10%I . The number of links to block has also been chosen systematically. We take [ts = T10%I , te = 2T10%I ] as an
454
X.X. Zhan et al.
Table 1. Basic properties of the empirical networks. The number of nodes (N ), the original length of the observation time window (T in number of steps), the total number of contacts (L) and the number of links (LW ), link density, average node degree (d) and average link weight w in GW are shown. Network
N
Haggle
274 15,662
HS2012
180 11,273
HS2013
327
HT2009
113
5,246
20,818 2,196 0.3470
38.87
9.48
Infectious 410
1,392
17,298 2,765 0.0330
13.49
6.26
ME
167 57,791
82,876 3,250 0.2345
38.92
25.50
PS
242
68.74
15.12
RM
T
L
7,375
LW  Link density d
w
28,244 2,124 0.0568
15.50
13.30
45,047 2,220 0.1378
24.67
20.29
188,508 5,818 0.1092
35.58
32.40
3,100
125,773 8,317 0.2852
96 33,452
1,086,404 2,539 0.5568
52.90 427.89
Fig. 2. (a) Evolution of the average prevalence ρ of the SI model (β = 1) for the eight empirical datasets. (b) An example of the area diﬀerence between the original spreading curve (ρo ) and the curve (ρf ) after blocking f fraction of links.
example to illustrate our ﬁndings. Figure 3 shows the eﬀectiveness of each centrality metric as a function of f , which is the number of links blocked normalized by the number of links in the integrated network. The random selection of links from those that have at least one contact within [ts , te ] is used as a baseline, in which each point is the averaged over 100 realizations. We ﬁnd that four link centrality metrics always outperform the random selection: static betweenness, backbone weight, timeconﬁned backbone weight [ts , te ] and backbone betweenness. In Haggle and infectious, the best performance comes from static betweenness, whereas the timeconﬁned backbone weight [ts , te ] outperforms the other metrics in the other six networks. Figure 2 shows that the prevalence grows slowly over time in Haggle and infectious. Hence, the static betweenness seems a suitable link blocking strategy for networks with a slow spreading speed. However, for networks where information propagates fast, the
Suppressing Information Diﬀusion
455
Table 2. The time window [ts , te ] we choose for link blocking based on the average prevalence ρ when β = 1. For instance, T10%I represents the time when the prevalence reaches ρ = 0.1. [T20%I , 2T20%I − T10%I ] [T30%I , 2T30%I − T20%I ]
N etwork
[T10%I , 2T10%I ]
Haggle
[3293, 6586]
[8416, 13539]
[9523, 10630]
HS2012
[403, 806]
[675, 947]
[925, 1175]
HS2013
[50, 100]
[113, 176]
[195, 277]
HT2009
[332, 664]
[377, 422]
[439, 501]
Infectious [410, 820]
[553, 696]
[751, 949]
ME
[168, 336]
[285, 402]
[461, 637]
PS
[136, 272]
[276, 416]
[287, 298]
RM
[5, 10]
[34, 63]
[111, 188]
N etwork
[T40%I , 2T40%I − T30%I ] [T50%I , 2T50%I − T40%I ]
Haggle
[12440, 15357]
[12668, 12896]
HS2012
[1043, 1161]
[1109, 1175]
HS2013
[236, 277]
[369, 502]
HT2009
[568, 697]
[790, 1012]
Infectious [955, 1159]
[1062, 1169]
ME
[731, 1001]
[1387, 2043]
PS
[323, 359]
[347, 371]
RM
[133, 155]
[257, 381]
Fig. 3. The eﬀectiveness ρD (f ) of each centrality metric in selecting the links to block within time window [T10%I , 2T10%I ]. Each point on the curve corresponds to block 5%, 10%, 20%, 40%, 60%, 80% and 100% of L∗W (ts = T10%I , 2T10%I ) links, respectively. The xaxis f is obtained by the number of links blocked normalized by the number of links in the integrated network.
456
X.X. Zhan et al.
Fig. 4. Average link blocking performance for each centrality metric over diﬀerent number of blocked links, within diﬀerent time windows and in diﬀerent networks. The x axis shows the time windows. We only show the starting time ts of each time window for simplicity and the ending time of each window can be found in Table 2.
timeconﬁned backbone weight [ts , te ] is a good indicator to select the links to block. Furthermore, we ﬁnd that timeconﬁned link weight [ts , te ] outperforms link weight and timeconﬁned backbone weight [ts , te ] outperforms the backbone weight. This implies that considering the link temporal topological features within the blocking time window is crucial for the link selection. For a given time window [ts , te ], we deﬁne the average performance of a centrality metric as the area under ρD (f ) over the whole range f . The average performance is further normalized by the maximal average performance among all the centrality metrics for the given [ts , te ]. This average performance over diverse numbers of links to be blocked allows us to evaluate whether the performance of these centrality metrics is stable when the time window varies. Figure 4 veriﬁes that our ﬁndings within [ts = T10%I , te = 2T10%I ] from Fig. 3 can be generalized to the other time windows.
Suppressing Information Diﬀusion
5
457
Conclusion
In this paper, we investigate how diﬀerent link blocking strategies could suppress the information diﬀusion process on temporal networks. The spreading process is modeled by the SI model with infection probability β = 1. We propose diverse classes of link centrality metrics to capture diﬀerent link temporal topological properties, including the information diﬀusion backbone based metrics and the static link centrality metrics. According to each metric, we select a given number of links that have the highest centrality value and block them for the given period [ts , te ]. The corresponding eﬀect of such link blocking is evaluated via the extent that the prevalence is suppressed over time. The empirical results from eight temporal network datasets show that four metrics outperform the random link selection, that is, backbone weight, backbone weight [ts , te ], backbone betweenness and static betweenness. An interesting ﬁnding is that the backbone based metrics, especially timeconﬁned backbone weight [ts , te ], perform well in networks where information gets prevalent fast. However, the static betweenness outperforms in networks where information propagates slowly. These observations hold for diﬀerent choices of time window and the number of links to be blocked. Our ﬁndings point out the importance of both temporal and static centrality metrics in determining links’ role in a diffusion process. Moreover, the timeconﬁned metrics that explicitly explore the property/role of the contacts that occur within the time window in the global diﬀusion process seems promising in identifying the links to block. In this work, we select links based on the centrality metrics that are derived from the temporal network information over the whole observation window [0, T ]. Our study unravels actually the relation between links’ or contacts’ temporal topological properties and their role in a diﬀusion process. A more challenging question is how to identify the links to block based on the temporal network information observed so far within [0, ts ]. Acknowledgements. This work has been partially supported by the China Scholarship Council (CSC).
References 1. Chaintreau, A., Hui, P., Crowcroft, J., Diot, C., Gass, R., Scott, J.: Impact of human mobility on opportunistic forwarding algorithms. IEEE Trans. Mob. Comput. 6, 606–620 (2007) 2. Eagle, N., Pentland, A.S.: Reality mining: sensing complex social systems. Pers. Ubiquit. Comput. 10(4), 255–268 (2006) 3. Estrada, E.: Communicability in temporal networks. Phys. Rev. E 88(4), 042811 (2013) 4. Fournet, J., Barrat, A.: Contact patterns among high school students. PLoS One 9(9), e107878 (2014) 5. Grindrod, P., Parsons, M.C., Higham, D.J., Estrada, E.: Communicability across evolving networks. Phys. Rev. E 83(4), 046120 (2011)
458
X.X. Zhan et al.
6. Holme, P.: Modern temporal network theory: a colloquium. Eur. Phys. J. B 88(9), 234 (2015) 7. Isella, L., Stehl´e, J., Barrat, A., Cattuto, C., Pinton, J.F., Van den Broeck, W.: What’s in a crowd? Analysis of facetoface behavioral networks. J. Theor. Biol. 271(1), 166–180 (2011) 8. Li, C., Li, Q., Van Mieghem, P., Stanley, H.E., Wang, H.: Correlation between centrality metrics and their application to the opinion model. Eur. Phys. J. B 88(3), 65 (2015) 9. Li, Z., Ren, T., Ma, X., Liu, S., Zhang, Y., Zhou, T.: Identifying inﬂuential spreaders by gravity model. Sci. Rep. 9(1), 8387 (2019) 10. Mastrandrea, R., Fournet, J., Barrat, A.: Contact patterns in a high school: a comparison between data collected using wearable sensors, contact diaries and friendship surveys. PLoS One 10(9), e0136497 (2015) 11. Michalski, R., Palus, S., Kazienko, P.: Matching organizational structure and social network extracted from email communication. In: International Conference on Business Information Systems, pp. 197–206. Springer (2011) 12. Newman, M.E.: Scientiﬁc collaboration networks. ii. Shortest paths, weighted networks, and centrality. Phys. Rev. E 64(1), 016132 (2001) 13. PastorSatorras, R., Castellano, C., Van Mieghem, P., Vespignani, A.: Epidemic processes in complex networks. Rev. Mod. Phys. 87(3), 925 (2015) 14. Peters, L.J., Cai, J.J., Wang, H.: Characterizing temporal bipartite networkssequentialversus crosstasking. In: International Conference on Complex Networks and their Applications, pp. 28–39. Springer (2018) 15. Qu, C., Zhan, X., Wang, G., Wu, J., Zhang, Z.K.: Temporal information gathering process for node ranking in timevarying networks. Chaos: Interdisc. J. Nonlinear Sci. 29(3), 033116 (2019) 16. Rocha, L.E., Masuda, N.: Random walk centrality for temporal networks. New J. Phys. 16(6), 063023 (2014) 17. Stehl´e, J., Voirin, N., Barrat, A., Cattuto, C., Isella, L., Pinton, J.F., Quaggiotto, M., Van den Broeck, W., R´egis, C., Lina, B., et al.: Highresolution measurements of facetoface contact patterns in a primary school. PLoS One 6(8), e23176 (2011) 18. Takaguchi, T., Sato, N., Yano, K., Masuda, N.: Importance of individual events in temporal networks. New J. Phys. 14(9), 093003 (2012) 19. Zhan, X.X., Hanjalic, A., Wang, H.: Information diﬀusion backbones in temporal networks. Sci. Rep. 9(1), 6798 (2019)
Using Connected Accounts to Enhance Information Spread in Social Networks Alon Sela1,2,3,6(&), Orit CohenMilo4,5, Eugene Kagan1,2, Moti Zwilling6,7, and Irad BenGal2 1
Industrial Engineering Department, Ariel University, Ariel 40700, Israel [email protected] 2 Industrial Engineering Department, Tel Aviv University, Tel Aviv 39040, Israel 3 Physics Department, Bar Ilan University, Tel Aviv 5290002, Israel 4 Economics Department, Hebrew University, Jerusalem 9190501, Israel 5 Economics Department, BenGurion University, Tel Aviv 8410501, Israel 6 Ariel Cyber Innovation Center (ACIC), Ariel University, Ariel 40700, Israel 7 Business and Management Department, Ariel University, Ariel 40700, Israel
Abstract. In this article, a new operation mode of social bots is presented. It includes a creation of social bots in dense, highlyconnected, sub structures in the network, named Spreading Groups. Spreading Groups are groups of bots and humanmanaged accounts that operate in social networks. They are often used to bias the natural opinion spread and to promote and over represent an agenda. These bots accounts are mixed with regular users, while repeatedly echoing their agenda, disguised as real humans who simply deliver their own personal thoughts. This mixture makes the bots more difﬁcult to detect and more influential. We show that if these connected sub structures repeatedly echo a message within their group, such an operation mode will spread messages more efﬁciently compared to a random spread of unconnected bots of a similar size. In particular, groups of bots were found to be as influential as groups of similar sizes, which are constructed from the most influential users (e.g., those with the highest eigenvalue centrality) in the social network. They were also found to be twice more influential on average than groups of similar sizes of random bots. Keywords: Social networks
Information spread Spreading Groups Bot
1 Introduction Spreading Groups are groups of bots (automatic software agents) that operate in social networks in order to bias the natural opinion spread and over represent an agenda. These bots accounts are mixed with regular users (i.e., human beings), while repeatedly echoing their agenda, disguised as accounts of real humans who simply deliver their own personal thoughts. Through this method, bots and groups of bots amplify the agenda of their creator and influence the opinions of real users to spread a deﬁned ideology [1, 2]. The cyber spread of information can trigger and deeply influence political, economical and social changes [1, 3, 4]. Ideological political struggles, such as the © Springer Nature Switzerland AG 2020 H. Cheriﬁ et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 459–468, 2020. https://doi.org/10.1007/9783030366872_38
460
A. Sela et al.
Boycott, Divestment, Sanctions (BDS) Movement [5–7], the Arab Spring [8, 9], Civil ideological spread [10], and the Russian effort to spread harmful fake information [11], are all example of political players who ﬁercely use the social medial arena as their battleﬁeld. The importance of social networks is based on their ability to quickly spread all types of views with few censoring limitations. The social network websites can easily spread information to different remote populations in an unprecedented speed. The ability of social networks to efﬁciently spread information creates a counter effort to manipulate this spread. One such manipulation is the inflating of an agenda or a message by fake accounts and social bots. Social bots are deﬁned as software agents that mimics humans’ activity in a social network in an endless effort to spread a deﬁned agenda. Many studies have found that social bots, operate in large numbers, inside each of the many existing social networks platforms [12–16]. Social bots ﬁrst collect followers, then, after a dormant phase, they begin to spread their agenda within their crowd. There are currently several methods to detect individual bots. These methods are often based on machine learning classiﬁcation algorithms that ﬁnd a difference in account features between bots and humans. While some studies claim these algorithms can detect approximately up to 95% of bots [17], there is a growing consensus that between 5% [18] and 15% [19] of accounts are bots. Human classiﬁcation for example, can only detect *20% of the bots [20]. These two numbers contradict each other to some degree. If algorithms can detect as much as 95% of bot accounts, one would expect that the number of bots will decrease with time (assuming Twitter and other social platforms deletes these accounts), or at least that there would only be 5% of accounts considered as bots. Since this does not happen, and since bots are still considered as a great concern for some years, it might be concluded that many bots are not detectable through these artiﬁcial intelligence algorithms. The proposed model in this study, and its supported data might explain the above inconsistency. While Twitter or alike ﬁnds individual bots, it is likely that groups of bots can stay “under the radar” for longer periods, or even stay unconnected permanently. This is because each bot can operate in a different period, thus be considered as a human. Furthermore, the model, as well as the data, show that it is worthy for bots’ creators to connect these bots. As shown, the creation of connected structures of bots increase the spread of a message within human accounts by 3–28 times, depending on the initial conditions. We compared the spreading rates of Spreading Groups, not only to random users, but also to groups of the most influential users in the network. We found that the connected spreading groups of bots reach a spreading rate which ranges between 75%–108% (depending on the initial conditions), compared to a group of similar size which is constructed of the most influential spreaders in the network (those with the highest PageRank or Eigenvector Centrality). As will be shown in the following results section, these results can easily be explained, and are highly applicable, since the creation of interconnect bots accounts is an easy, realistic and technically feasible task.
Using Connected Accounts to Enhance Information Spread in Social Networks
461
The current work is an extension to our previous work, where we have shown and proposed the operation mode of Spreading Groups of bots [2]. While in the previous work, the model assumed only one seeding period before the natural spread through human’s account, in this work, we further examined these previous models and enable to spread information in longer spreading durations. This assumption is related to the work of the Debot project [21] that detected the activity of bots through temporal correlations between their messages. Accounts that have high temporal correlation of activity, and/or also spread related topics, are likely to be classiﬁed as bots. Thus, seeding messages in different periods of time breaks this temporal correlation and make the detection much harder. Since in each cascade different accounts from the spreading group are active, by this method, accounts in a spreading group can hide undetected for longer durations. While connected bots’ structures that repeatedly echo messages within the group and do not operate in one single peek have a slower spread rate, their influence is still expected to be much stronger, mainly since they remain undetected for longer durations, thus collect more followers.
2 Model The Spreading Group model consists of two stages. In the ﬁrst stage the spreading group is created within the network. In the second stage different spreading strategies are inspected, while seeds are allocated at once before the natural infection stage, or gradually together with the natural infection. Note that we refer to natural infections as the infections of nonbot’s accounts. Similarly, intended infections by bots are named as seedings. 2.1
Creation of Spreading Groups in the Network
In order to create the required structure, we ﬁrst construct a network of n2 nodes through a Preferential Attachment process [22]. Then, we select a set of r% of these nodes and label them as the “Spreading Group”. We then add links according to the following algorithm d times to the spreading group. Last, we continue growing the remaining n 2
nodes by preferential attachment. The meta code to enrich the group is the following. CONSTRUCTION OF SPREADING GROUP: For each node in SpreadingGroup: Chose randomly node_1 Chose randomly node_2 if (node_1!= node_2) && no link (node_1, node_2): createlink (node_1, node_2)
462
A. Sela et al.
This process forms a network which is basically a preferential attachment network, but also contains a denser structure of r%; i.e. the Spreading Group. 2.2
Four Seeding Strategies
We have deﬁned four main seeding strategies while for each of these strategies we inspected the outcome of graduate vs. instant seeding. The four strategies were: 1. Random Seeding – In this strategy, S seeds are randomly selected from the entire network and are seeded; i.e., their status changes to “infected”. 2. Group Seeding – In this strategy, S seeds are randomly selected solely from the Spreading Group and are seeded; i.e., their status is changed to “infected”. 3. PageRank Seeding – In this strategy, the nodes are ordered according to their PageRank scores and the S nodes with the highest scores are seeded. 4. Eigenvalue Centrality Seeding – In this strategy, the nodes are ordered according to their Eigenvalue centrality scores and the S nodes with the highest scores are seeded. Illustrations of these four seeding strategies are presented in Fig. 1, where the seeds are represented by a red target sign and the Spreading Group is represented by orange links (see A, B). In the lower ﬁgures, the seeded nodes position in the plot were changed in order to enable seeing their relatively high ranks.
Fig. 1. Four seeding strategies of the Spreading Group model: random seeds – A (upper left), Spreading Group – B (upper right). Highest Eigenvector centrality seeds – C (lower left). Highest PageRank centrality seeds – D (lower right).
Using Connected Accounts to Enhance Information Spread in Social Networks
2.3
463
Temporal Examination of Seeding and Spread
While many works inspect the spread of information as a twophase process, where ﬁrst there is an action of seeding, which is then followed by a process of infections, our previous works have shown the high importance of timing [23, 24, 26]. Timing acts in spreading processes as a twoedged sword. On the one hand, as one extends the time of the seeding process, potentially infected nodes forget, and as a result, the probability of acceptance of a message by the natural infection is reduced. On the other hand, as time passes, more nodes become infected, and thus, a wellplanned seeding action, can boost the spread [23, 25]. Overall, we now believed that a process where seeding occurs only at the initial stages of the spread, when no planning or examination of the infection’s cascades exist can reach more nodes, compared to processes where the seeds are allocated gradually. Nevertheless, if the gradual seeds allocations are well planned, we have shown that a 10%–20% improvement can be obtained by correct scheduling of seeds [23, 25]. As a consequence of these results, we also inspected the seeding through a multi steps process. In such a case, we divided the S seeds to S ¼ s Ts seeding batches, where s denotes the seeds used at each period, and Ts 2 ft ¼ 1; 2; 3. . .g denotes the spreading periods. 2.4
Modeling the Retention Loss
A very important part in the studies of information spread focuses on the spread of news. As the word news implies, it needs to be something new in order to capture users attention and spread it forward. In our model, we assumed human retention loss function follows an exponent decay. This assumption is based upon studies on memory, done by Ebbinghaus [25] and his followers, which are now over a century old. The probability of infection is set initially to be p0 , then at each step, for each infected node, this probability is reduced according to an exponential decay p ¼ p0 ect
ð1Þ
Since the probability of infection p is decaying inﬁnitively, to terminate the model’s runs, when p reaches a value lower than the lower bound p\LB, the node’s status changes to noninfectious. In addition, if no nodes are infected for more than 10 periods, the model terminates. These stopping criteria permits a termination of the simulation runs in reasonable run times. 2.5
Simulation Scheme
To conclude, the study monitors the spread of the different spreading groups, while inspecting through an external layer the different runtime parameter. The different strategies of seeding are generated according to the following scheme.
464
A. Sela et al.
SIMULATION SCHEME: Create a network of n/2 nodes #By preferential attachment Select r% of nodes > {SG} #Usually 0.001 d(B)(T) > d(A+B)(T). The paradox does not occur. However, when r = −0.9 (disassortative network), the average population gain d (T) of Game B is −0.11, and the average population gain d (T) of the randomized Game A + B is 0.18. d(A+B)(T) > 0 > d(B)(T). Thus, the strong paradox occurs. Based on these ﬁndings, the mechanism of paradox is analyzed.
Fig. 2. Changes in average population gain over time
Fig. 3. Changes in proportion of favoable branch (Branch 1) over time
(1) Strong paradox in disassortative networks (1) When we play Game B individually, the average population gain is negative. The reasons are: from the group level, the system environment is benign at the beginning of the game (for Game B, equal capital will make the node play Branch 1, and at this time the wining probability of Branch 1 is p1 = 0.695. So it is a favorable branch). As the number of smalldegree nodes is more than that of largedegree nodes, the opportunity of capital growth of smalldegree nodes is greater than that of largedegree nodes. The capital growth of smalldegree nodes leads to the increase of the chance of playing Branch 2 (p2 = 0.28 is an unfavorable branch) in the followup game (that is, the probability becomes larger when the capital of smalldegree nodes has an increment than the average capital of neighbors). As a result, the capital will gradually change from growth to decline as the game progresses. In disassortative networks, the neighbors of largedegree nodes are mostly composed of smalldegree nodes, so the decrease of capital of smalldegree nodes further increases the chances of largedegree nodes playing Branch 2 (when p2 = 0.28, Branch 2 is unfavorable). That is, the decrease of capital of smalldegree nodes increases the chances of largedegree nodes exceeding the average capital of neighbors. The above development trend of smalldegree and largedegree nodes makes the probability of playing Branch 1 (favorable branch) decrease continuously until it is lower than the probability of 53.01% in fair game (In fair game, the mathematical expectation E = 0. So we can obtain the probability f of playing Branch 1 according to the equation: E = f[1 p1 + (−1) (1 − p1)] + (1 − f)[1 p2 + (−1) (1 − p2)]). Then the probability of playing Branch 1 Thus the average gain of the whole population decreases gradually. From
The Impact of Network Degree Correlation on Parrondo’s Paradox
489
Fig. 2, we can also notice that the average population gain increases ﬁrst, then decreases gradually after the number of games is about 2 104. Furthermore, the average population gain decreases until it is negative. Figure 4(a) shows the relationship between the degree of nodes and the average gain of subgroups in a disassortative network. It can be found that there is no obvious relationship between the node degree and the average gain of subgroups when playing Game B alone. The average gain of subgroups with different degrees is evenly distributed and concentrated in a narrow interval. The reasons are as follows: ﬁrstly, under the disassortative network, the neighbors of largedegree nodes are mainly composed of smalldegree nodes, and the neighbors of smalldegree nodes are also composed of largedegree nodes. Secondly, when Branch 1 is a favorable branch, the capital of the node and its neighbors will keep the synchronization of increase and decrease (when the average capital of the neighbor is greater than or equal to the capital of the node, the node will play the favorable Branch 1, and the capital will increase; when the capital of the node is greater than the average capital of the neighbors, the node will play the unfavorable Branch 2, and the capital will decrease). Therefore, when we play Game B alone, the gains of largedegree and smalldegree nodes will keep the synchronization of winning and losing to a certain extent, which results in that there is no obvious relationship between node degree and the average gains of subgroups. (2) When we play the randomized Game A + B, there is a positive correlation between the average gains of subgroups and node degree. The average gains of subgroups with degrees less than 38 are negative, while those of other subgroups are positive. The reason for this result is due to the “agitating” role of Game A. Half of the time in the randomized Game A + B is used to play the zerosum Game A, and the game relation is set as way of cooperation. For the cooperative pattern, the subject pays one unit to the object for free. In Game A, for any node, the probability of being selected as the subject is equal, but for largedegree nodes with more neighbors the probability of being selected as the object is greater. Thus, when we play Game A, nodes with largerdegrees have larger capital and the increase of the capital is more obvious. Then capital flows with an orientation to nodes with largerdegrees large. Therefore, there is a positive correlation between the average gains of subgroups and node degrees, and a larger degree yields a higher average gain of subgroup. Simultaneously, for the disassortative network, the neighbors of the smalldegree nodes are mainly composed of largedegree nodes. So the largedegree nodes with large capital increase the chances of smalldegree nodes playing the favorable branch of Game B. From the group level, the agitation of Game A increases the chances of nodes playing the favorable branch of Game B. (Figure 3 shows that the probability has a increment from 50.20% of playing Game B individually to 55.45% of playing the randomized Game A + B). This causes the game result of the randomized Game A + B being positive (as shown in Fig. 2). Therefore, the ratcheting mechanism of Game B (there are some asymmetric branches in the structure) and the agitation effect of Game A are the key to produce Parrondo’s paradox. Besides, the disassortative network is conducive to the development of the ratcheting mechanism.
490
Y. Ye et al.
(a) in disassortative network
(b) in assortative network
Fig. 4. Changes in average subpopulation gain over node degree
(2) No paradox in the assortative network Figure 4(b) shows the relationship between the node degree and the average gain of subgroup in the assortative network. It can be notable that: (1) When we play Game B alone, there is no obvious relationship between node degree and the average gain of subgroup. The average gains of subgroups with different degrees are distributed evenly, concentrated in a narrow interval, and the average gain of the population is negative. There is a similar reason to the case of Game B in the disassortative networks, which is caused by the synchronization of increase and decrease between the capital of nodes and their neighbors. The average gain of the population is negative (shown in Fig. 2), which is related to the parameters (p1 = 0.695 and p2 = 0.28). Under this set of parameters, as shown in Fig. 3, when the system is stable, the probability of playing the favorable branch of Game B is 50.22%, which is lower than that under the fair game (53.01%). (2) When we play the randomized Game A + B, the relationship between the average gain of subgroup and node degree can be divided into three sections: for subgroups with largedegree (degree greater than 50) and smalldegree (degree less than 25) nodes, the average gains are positively correlated with node degrees, while for subgroups with mediumdegree nodes (degree between 25 and 50), the average gains are not signiﬁcantly correlated with node degree. The reasons for this difference are as follows: in Game A with the cooperative pattern, the node with a largerdegree has a larger capital, and the capital has a more obvious increment. Furthermore, capital flows to the largedegree nodes, and the positive correlation between degree and subgroup gain arises. The “synchronization of increase and decrease” of Game B will lead to the convergence of average gains among subgroups with different degrees. Simultaneously, because of the normal distribution of node degrees in random networks (i.e. fewer largedegree and smalldegree nodes, especially more mediumdegree nodes), the average gains of subgroups with different degrees will gradually move towards the middle, and eventually demonstrates the result shown in Fig. 4(b). Moreover, because the average gain of the subpopulation with mediumdegree, which accounts for a large proportion of nodes, is negative, the average gain of the population is negative (as shown in Fig. 2). From the group level, the agitation effect of Game A reduces the opportunity for nodes to play the favorable branch of Game B (Fig. 3 shows that the
The Impact of Network Degree Correlation on Parrondo’s Paradox
491
probability has a decrease from 50.22% of playing Game B individually to 49.10% of playing the randomized Game A + B), which leads to the average population gain of the randomized Game A + B is less than that of Game B played alone. Thus, no paradox occurs. 4.2.2 The Competitive Pattern: p1 = 0.175, p2 = 0.85 For this set of parameters, when r = 0.9 (assortative network), the average population gain d (T) of Game B is 0.59, and the average population gain d(T) of the randomized Game A + B is 0.34. d(B)(T) > d(A+B)(T) > 0. The paradox does not occur. However, when r = −0.9 (disassortative network), the average population gain d (T) of Game B is −0.094, and the average population gain d(T) of the randomized Game A + B is 0.32. d(A+B)(T) > 0 > d(B)(T). Thus, the strong paradox occurs. Based on these ﬁndings, the mechanism of paradox is analyzed. Figure 5 shows the change of the average population gain over time in the disassortative network. Figure 6 shows the change of the proportion of favorable branch (Branch 2) over time in the disassortative network.
Fig. 5. Changes in average population gain over time
Fig. 6. Changes in proportion of favorable branch (Branch 2) over time
(1) Strong paradox in disassortative Networks Figure 7(a) shows the relationship between the node degree and the average gain of subgroup in a disassortative network. It can be notable that there is a positive relationship between the average gain of subgroup and the node degree when Game B is played individually. A largerdegree node yields a greater average gain of the subgroup. The average gains of subgroups whose degrees are less than 40 are negative, while the average gains of subgroups with other degrees are positive. The reason for this result is that: for p1 = 0.175 and p2 = 0.85, when the capital of a node is not greater than the average value of the capital of its neighbors, the neighboring environment of the node is unfavorable (because the probability of winning is p1 = 0.175 for Branch 1 of Game B at this time); when the capital of a node is greater than the average value of the capital of its neighbors, the neighboring environment of the node is favorable (because for Branch 2 of Game B, the probability of winning is p2 = 0.85). At beginning, all nodes have the same capital. Thus, the node plays the unfavorable branch (Branch 1), the capital of the node decreases, while the largedegree node has
492
Y. Ye et al.
more neighbors, the neighborhood environment improves quickly, that is, the probability of the largedegree node playing the favorable branch increases quickly. In the disassortative network (r = −0.9), the neighbors of smalldegree nodes are mainly composed of largedegree nodes. The capital of largedegree nodes is large, which makes the probability of smalldegree nodes playing the unfavorable branch larger. This process leads to the capital of smalldegree nodes reduction. The capital reduction of smalldegree nodes makes the neighborhood environment of largedegree nodes connected with them further improved (the capital reduction of smalldegree nodes leads to the decrease of the average capital of neighbors around largedegree nodes). In the game process, the favorable neighboring environment for largedegree nodes and the unfavorable neighboring environment for smalldegree nodes are strengthened continuously, which ultimately leads to the average gains of subgroups with largerdegree larger, and the average gains of subgroups with smallerdegree smaller. From the group level, Fig. 6 shows that the probability has a decrease from 47.58% of playing Game B individually to 48.15% of playing fair Game B. This causes the average population gain of Game B being negative. (2) When we play the randomized Game A + B, there is no obvious correlation between the average gains of subgroups and node degrees. The average gain of the population is positive. The reason for this result is due to the “agitating” role of Game A. Half of the time in the randomized Game A + B is used to play the zerosum Game A, and the game relation is set as way of competition (The probability of winning is 0.5 for both subject and object). Because of the zerosum game between largedegree and smalldegree nodes, the probability of winning or losing is the same, which makes it possible for smalldegree nodes to have an increase in capital, thus disrupting the strengthening process of largedegree nodes to form favorable environment and smalldegree nodes to form unfavorable environment. From the group level, the agitation of Game A increases the chances of nodes playing the favorable branch of Game B. (Fig. 6 shows that the probability has a increment from 47.58% of playing Game B individually to 49.86% of playing the randomized Game A + B). This causes the game result of the randomized Game A + B being positive. Therefore, the ratcheting mechanism of Game B (there are some asymmetric branches in the structure) and the
Fig. 7. Changes in average subpopulation gain over node degree
The Impact of Network Degree Correlation on Parrondo’s Paradox
493
agitation effect of Game A are the key to produce Parrondo’s paradox. Besides, the disassortative network is conducive to the development of the ratcheting mechanism. (2) No paradox in the assortative network Figure 7(b) shows the relationship between the node degree and the average gain of subgroups in the assortative network. It can be found that there is no obvious relationship between node degree and average subgroup gain. The reasons for this result are as follows: ﬁrstly, under the assortative network, the neighbors of each node is mainly composed of nodes with similar degrees; secondly, when Branch 2 is a favorable branch, the capital of the node and its neighbors will maintain the characteristics of “opposite increase and decrease” (when the average capital of the neighbor is larger than the capital of the node, the node will play the unfavorable branch, and the capital will decrease; when the average capital of the neighbors is smaller than the capital of the node, the node will play the favorable Branch 2, and the capital increases). Therefore, when Game B is played alone, the environment is the same for individuals with different degrees. There are no signiﬁcant differences among the average subgroup gains for different degrees. Therefore, there is no obvious relationship between node degree and average subgroup gain. The average gain of population is positive due to the parameters p1 = 0.175 and p2 = 0.85. Under these parameters, the probability playing the favorable Branch 2 of Game B (49.71%) is higher than that of fair game (48.15%). For the randomized Game A + B, a competitive mode is used when we play Game A, and the probability of winning is 0.5 for both the subject and the object, which will lead to more uniform gains between subgroups with different degrees. It can also be notable from the graph that the fluctuation is smaller. From the group level, the agitation of Game A increases the chances of nodes playing the favorable branch of Game B. (The calculation results show that the probability has a increment from 49.71% of playing Game B individually to 49.91% of playing the randomized Game A + B). Half of the time in the randomized Game A + B is used to play the zerosum Game A. So the average population gain of playing the randomized Game A + B is less than that of playing Game B individually and there is no paradox.
5 Conclusions (1) In this paper, a multiagent Parrondo’s model based on complex networks is established. Two different interactive behavior modes: competition and cooperation in Game A are adopted. Furthermore, the gradual change of the parameter space of Parrondo’s paradox from the assortative random network to the disassortative random network under different behavioral modes is analyzed, and the relationship between the parameter space and the degree correlation of networks is also analyzed. The simulation results show that: (1) different behavioral modes have impacts on the parameter space generated by the paradox; (2) under the same behavior mode, a smaller degree correlation coefﬁcient yields a larger parameter space generated by the paradox. (2) In view of competitive and cooperative behaviors, a set of probability parameter is adopted, respectively, to analyze the microcauses of strong paradox in
494
Y. Ye et al.
disassortative random networks in detail. Furthermore, the interaction mechanisms of the asymmetric structure of Game B, the “agitation” effect of Game A under different behavior modes and network topology structure are demonstrated. Acknowledgments. This project was supported by the National Natural Science Foundation of China (Grant No. 11705002); Ministry of Education, Humanities and Social Sciences research projects (15YJCZH210; 19YJAZH098; 18YJCZH102).
References 1. Harmer, G.P., Abbott, D.: Losing strategies can win by Parrondo’s paradox. Nature 402 (6764), 864–870 (1999) 2. Harmer, G.P., Abbott, D., Taylor, P.G., Parrondo, J.M.R.: Brownian ratchets and Parrondo’s games. Chaos 11(3), 705–714 (2001) 3. Parrondo, J.M.R., Harmer, G.P., Abbott, D.: New paradoxical games based on Brownian ratchets. Phys. Rev. Lett. 85(24), 5226–5229 (2000) 4. Shu, J.J., Wang, Q.W.: Beyond Parrondo’s Paradox. Sci. Rep. 4(4244), 1–9 (2014) 5. Toral, R.: Capital redistribution brings wealth by Parrondo’s paradox. Fluct. Noise Lett. 2(3), 305–311 (2002) 6. Ye, Y., Xie, N.G., Wang, L.G., Meng, R., Cen, Y.W.: Study of biotic evolutionary mechanisms based on the multiagent Parrondo’s games. Fluct. Noise Lett. 11(2), 352–364 (2012) 7. Mihailović, Z., Rajković, M.: Cooperative Parrondo’s games on a twodimensional lattice. Phys. A 365, 244–251 (2006) 8. Ye, Y., Xie, N.G., Wang, L.G., Wang, L., Cen, Y.W.: Cooperation and competition in historydependent Parrondo’s game on networks. Fluct. Noise Lett. 10(3), 323–336 (2011) 9. Wang, L.G., Xie, N.G., Xu, G., Wang, C., Chen, Y., Ye, Y.: Gamemodel research on coopetition behavior of Parrondo’s paradox based on network. Fluct. Noise Lett. 10(1), 77– 91 (2011) 10. Meyer, D.A., Blumer, H.: Quantum parrondo games: biased and unbiased. Fluct. Noise Lett. 2(04), 257–262 (2002) 11. Miszczak, J.A., Pawela, L., Sladkowski, J.: General model for an entanglementenhanced composed quantum game on a twodimensional lattice. Fluct. Noise Lett. 13(02), 1450012 (2014) 12. Ye, Y., Cheong, K.H., Cen, Y.W., Xie, N.G.: Effects of behavioral patterns and network topology structures on Parrondo’s paradox. Sci. Rep. 6, 37028 (2016) 13. Ye, Y., Wang, L., Xie, N.G.: Parrondo’s games based on complex networks and the paradoxical effect. PLoS ONE 8(7), e67924 (2013) 14. Newman, M.E.J.: Assortative mixing in networks. Phys. Rev. Lett. 89(20), 2087011 (2002) 15. Xulvibrunet, R., Sokolov, I.M.: Reshuffling scalefree networks: from random to assortative. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 70(6 Pt 2), 066102 (2004) 16. Klemm, K., Eguíluz, V.M.: Growing scalefree networks with smallworld behavior. Phys. Rev. E 65(5), 057102 (2002)
Analysis of Diversity and Dynamics in Coevolution of Cooperation in Social Networking Services Yutaro Miura1(B) , Fujio Toriumi2 , and Toshiharu Sugawara1 1 Waseda University, Tokyo 1698555, Japan [email protected], [email protected] 2 The University of Tokyo, Tokyo 1138656, Japan [email protected]
Abstract. How users of social networking services (SNSs) dynamically identify their own reasonable strategies was investigated by applying a coevolutionary algorithm to an agentbased game theoretic model of SNSs. We often use SNSs such as Twitter, Facebook, and Instagram, but we can also freeride without providing any content because providing information incurs costs to us. Numerous studies on evolutionary network analysis have been conducted to investigate why people continue to post articles. In these studies, genetic algorithms (GAs) have often been used to ﬁnd reasonable strategies for SNS users. Although the evolved strategies in these studies are usually common among all users, the appropriate strategies for them must be diverse because the strategies are used in various circumstances. In this paper, we present our analysis using a coevolutionary algorithm, multipleworld GA (MWGA), the various strategies for individual agents involving coevolution with their neighboring agents. We also present the ﬁtness value we obtained, a value that was higher than those obtained using the conventional GA. Finally, we show that the MWGA enables us to observe dynamic processes of coevolution, i.e., why agents reach their own strategies in diﬀerent circumstances. This analysis is helpful to understand various users’ behaviors through mutual interactions with neighboring users. Keywords: Social networking services · Public goods game Coevolutionary dynamics · Complex networks
1
·
Introduction
Social networking services (SNSs), such as Twitter, Facebook, Instagram and LinkedIn, have become an indispensable part of people’s lives. They are virtual places for people’s communications in groups of close friends, communities, companies, and organizations, and are utilized for various activities such as T. Sugawara—This work was partly supported by KAKENHI (17KT0044, 19H02376, 18H03498). c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 495–506, 2020. https://doi.org/10.1007/9783030366872_41
496
Y. Miura et al.
advertisement, marketing, and political campaigns as well as private and local communication. SNSs are maintained by the massive content generated by users’ voluntary participation; therefore, a SNS disappears only if too little content is posted. Conversely, users can become free riders (or lurkers) who only read content without posting articles because such activities impose some costs on users. To understand why users voluntarily post so much content while others behave as free riders is a crucial issue and is helpful to keep SNSs thrive. Many studies have analyzed the factors of users becoming free riders on SNSs [5,7,10]. Sun et al. [10] clariﬁed the factors of users when they stop generating content in online communities using a motivation model that determines online behavior. The researchers insisted that the factors causing free riders are, for example, lowquality messages, low response rates, and long response delays; these are inevitable, given the diﬀerent levels and types of participants. They also claimed that free riders could be encouraged to participate by introducing external stimuli (rewards) and new norms. Some studies analyzed the characteristics of network features of failed SNSs [5,7]. A number of studies tried to identify incentives for voluntary participation on SNSs using gametheoretic simulation models. For example, Toriumi et al. [11] proposed the metarewards game to model users’ behaviors on SNSs using evolutionary game theory. The metarewards game is an extended model of Axelrod’s metanorms game [2], which is a kind of publicgoods game and is used to see what could prompt cooperation in social dilemma situations. Their experimental results indicated that cooperation on SNSs emerged by introducing the rewards given for comment returns (i.e., meta rewards). Hirahara et al. [6] proposed a SNSnorms game by adding the structural features of SNSs to the metarewards game; for example, users who respond to comments on articles are likely to be users who originally posted the articles. Then, they conducted agentbased simulations on artiﬁcial complex networks and a Facebook ego network to identify the optimal behavior in the game. We have to consider two issues, even though some evolutionary algorithms used in network analysis including these studies [6,11] have attempted to search for a common optimal/better strategy in the entire network. First, applying genetic algorithms (GAs) to ﬁnd the optimal solution (strategy) that is common for agents does not ﬁt the actual SNSs; for example, the strategy in SNSs learned by a hub agent (such as a celebrity) with so many followers is not necessarily advantageous for general (nonhub) agents. Of course, all users may be homogeneous in the sense that they attempt to increase the received rewards, but they are in diverse circumstances because the numbers of followers/friends are quite diﬀerent; thus, they have their own behavioral strategies in a SNS along with the emergence of strategies among surrounding users. Second, the learned strategy is only the ﬁnal result of longterm interactions, and it ignores the process of learning; i.e., their learned strategies are mutually aﬀected between neighboring agents, and thus, the process of learning must be more complicated but worth understanding.
Analysis of Diversity and Dynamics in Coevolution of Cooperation
497
This discussion motivated us to introduce coevolution, which is a phenomenon where diﬀerent species aﬀect each other and evolve together, into evolutionary network analysis based on game theory [3]. Along this line, Miura et al. [9] proposed the multipleworld genetic algorithm (MWGA), which is a coevolutionary genetic algorithm in (complex) networks to maintain the diversity of nodes. In the MWGA, the network including all nodes (users) is duplicated to several networks where nodes in the same position have diﬀerent strategies and where all users interact with diverse neighbors in diﬀerent network worlds. Then, all agents simultaneously learn their own reasonable strategies by considering strategies examined in all duplicated networks. In this study, we introduced the MWGA into the agentbased simulation using the SNSnorms game and analyzed the diverse strategies in various circumstances involving the ﬁrst issue. We also found that the MWGA can see the course of learning of strategies in accordance with the strategies learned by neighbor agents, so we analyzed the process of the simultaneous learning involving the second issue and identiﬁed why agents utilized converged strategies.
2 2.1
Modeling Social Networking Services Agents
We brieﬂy describe the model including agents and behaviors of SNSs based on the networked evolutionary game. This model is identical to that proposed by Hirahara et al. [6]. Let G = (V, E) be a graph representing the underlying network, where V = {v1 , . . . , vn } is the set of nodes that correspond to agents representing users on a SNS and where E is the set of links between agents. The links in E represent the interaction structure among agents. An agent has two behavioral strategies, cooperation and defecting. Cooperation involves contributions to the SNS, i.e., posting articles and comments, and defecting involves freeriding to get only beneﬁts by just reading posted articles and comments. Agents have two learning parameters specifying behavioral strategies: the probability of posting a new article Bi and the probability of posting a comment on a posted article or on a posted comment Li . Bi and Li are encoded by threebit genes (in total, a 6bit gene) for the MWGA; thus, they take on a discrete value, 0/7, 1/7, . . . , or 7/7. 2.2
SNSNorms Game
A round of the SNSnorms game proceeds as follows (see also Fig. 1). For agent vi ∈ V , the initial values of Bi and Li are randomly determined before the game starts. In the tth round, 0 ≤ Si,t ≤ 1, which represents the amount of fun or interest in the content of the article that vi intends to post, is randomly selected for vi . If Si,t ≥ 1 − Bi , vi posts an article and receives a negative reward as cost F ( 0 as a reward. After that, vj posts a comment on the read article with
498
Y. Miura et al.
Fig. 1. SNSnorm game.
Fig. 2. Conceptual structure of multipleworld GA.
Table 1. Sum of rewards (costs) for each action. Type of action Article (post article a)
Cooperate
Defect
F + R × loge (Nc (a) + 1) + C × Ncc (a) 0
Comment (post comment c) M + C + R × loge (Nm (c) + 1)
M
probability Lj and receives cost C < 0. Then, vi gets a reward R > 0 because it received the comment on the article vi posted. When vi receives a comment from vj , vi gives a commentreturn with probability Li . Thus, vi receives a negative reward C < 0 and vj receives reward R > 0 as metareward. Table 1 shows the agent’s total rewards for each action, where Nc (a) is the number of comments on article a posted by neighboring agents, and Ncc (a) is the number of comment returns on the received comments. Nm (c) is the number of comment returns on comment c, which is 0 or 1 in the SNSnorms game. Note that the reward for posting an article follows the WeberFechner law [4], and it increases logarithmically with the number of comments received [8]. The SNSnorms game is an evolutionary game aimed at ﬁnding the agents’ reasonable strategies speciﬁed by Bi and Li to gain higher rewards through interactions. 2.3
Structure of Network
Realworld networks observed in human society have three complex network properties: a high clustering coeﬃcient, smallworld property, and scalefree property [1]. We generate an artiﬁcial agent network following the connecting nearest neighbor model (CNN model) [12], which has all three properties. The characteristics of networks generated by the CNN model (CNN networks) is determined using parameter u, the probability of turning a potential edge to a real one, where the potential edge is deﬁned as {(vi , vj ) ∈ E  vi , vj ∈ V and ∃vk ∈ V s.t. (vi , vk ), (vj , vk ) ∈ E}.
Analysis of Diversity and Dynamics in Coevolution of Cooperation
3
499
Coevolutionary Game
We utilized the MWGA [9] as the coevolutionary computation to ﬁnd diverse reasonable strategies for individual agents. We brieﬂy explain it here. The existing studies use genetic algorithms in which agents take over the better strategies of neighboring agents for the next generation, under the assumption that neighbors’ excellent strategies are worth mimicking. However, in actual social networks that have three properties of complex networks, agents are in diverse surroundings; thus, their strategies must also diﬀer. 3.1
MultipleWorld GA
In the MWGA, we make W copies of the master network G = (V, E), each of which is denoted by Gl = (V l , E l ) for 1 ≤ l ≤ W , where W is a positive integer called the multipleworld number. We represent the set of agents in the lth world as V l = {v1l , . . . , vnl } (∼ = V ). An example of a multipleworld GA structure is illustrated in Fig. 2. The set of copied agents of vi ∈ V is denoted by Ai = {vi1 , . . . , viW }, and agents in Ai stand on the same position in all worlds. Initial genes are randomly given to all agents. Therefore, the agents in Ai = {vi1 , . . . , viW } have diﬀerent genes and behave diﬀerently, even though these agents are the copies of a single agent vi . For vi ’s neighboring agent vj , agents in Aj have diﬀerent strategies, so agents in Ai experience the game diﬀerently and receive diﬀerent rewards as a result. In the following experiments on the SNSnorms game, we assumed that all agents had four chances to post articles during a generation and then simultaneously entered the coevolution phase consisting of three operators: (parents) selection, crossover, and mutation. Then, with the new genes, all agents entered the next generation and repeated this process a certain number of times. 3.2
Genetic Operations
W In the parent selection phase, agent vil ∈ l=1 V l selects two agents as parents from Ai by following the probability distribution {Pil }W i=1 , Pil =
(f (vil ) − fmin )2 , 2 v∈Ai (f (v) − fmin )
(1)
where f (vil ) is the ﬁtness function, the value of which is the sum of the rewards received during the SNSnorms games in the current generation (see Table 1) and fmin = minv∈Ai f (v). The genes for the next generation are generated from the selected parents, applying uniform crossover and ﬂipbit mutation with a probability of 0.005 for each bit. For example, if W = 30, the gene of approximately one agent in Ai mutates in every generation (because 30 × 6 × 0.005 = 0.9). Then, the agent with the generated gene is placed as vil in the next generation.
500
Y. Miura et al. Table 2. Parameter values in experiments. Parameter Description
Value
W
Multipleworld number Mutation rate
30 0.005
F
Cost of posting article
−3.0
M
Reward for reading article
1.0
C and C Costs of comment and comment return
−2.0
R and R Rewards for receiving comment and Commentreturn 9.0 Table 3. Parameters and characteristics of the CNN networks.
4
Parameter Description
Value
N
Number of agents
1000
u
Probability of changing a potential edge to an edge 0.9 Average degree (average number of friends) 19.8 Average clustering coeﬃcient 0.468 Average characteristic path length 3.31 Average powerlaw exponent −1.128
Experiments and Discussion
We simulated SNSnorm games on a CNN network and investigated how agents individually learn (coevolve) their strategies (the probability of posting an article B and the probability of posting a comment L) through interactions with their neighboring agents. Table 2 shows the parameters used in the experiments. The rewards and costs were deﬁned on the basis of the experiments done by Axelrod [2]. The parameters of the CNN networks are listed in Table 3. These parameters also followed existing studies [6,11] so that we can describe the eﬀect of the MWGA appropriately. We conducted the experiments using the MWGA for 2000 generations. The values of Bi and Li for agent vi were deﬁned as the averages of W agents in Ai . 4.1
Comparison of Learned Strategies and Fitness Values
In our experiment, we analyzed the eﬀects of evolutionary computation using the MWGA on strategy learning using the SNSnorm game and compared them with those using the conventional GA. Figure 3 shows the transition of all agents’ average probability of posting an article B and a comment L in the SNSnorms game. Figure 4 shows the transition of average ﬁtness value of all agents. Figure 3 indicates two diﬀerent results; i.e., B and L converged to around 0.5 and 0.4 when using the MWGA, while they were 0.05 and 1.0 when using the conventional GA. As shown in Fig. 4, the average ﬁtness value converged to approximately
Analysis of Diversity and Dynamics in Coevolution of Cooperation
Fig. 3. Probability of posting.
501
Fig. 4. Fitness value.
10 using the conventional GA but converged to 80 when using the MWGA. The diﬀerences in these results suggest that the agents’ strategies evolved because the MWGA had much higher ﬁtness values. To conﬁrm that agents gained higher ﬁtness values with the MWGA, we calculated the improvement in the ﬁtness values of individual agents, i.e., the incremental value of the ﬁtness value with the MWGA minus that with the conventional GA. The results are plotted in Fig. 5, where the vertical axis shows the averages of the incremental ﬁtness values between the 1500th and 2000th generations and where the horizontal axis shows the degrees of the agents. Figure 5 indicates that almost all the agents (more than 99%), especially those with higher degrees, gained the higher ﬁtness value. Only a few agents lost ﬁtness value, but we think that this is due to mutation. These results clearly indicate that the agents could ﬁnd their own better strategies.
Fig. 5. Improvement in ﬁtness value.
4.2
Distribution of Strategies and Dynamics of Learning Process
Second, we investigated the distribution of the strategies learned by agents and their learning dynamics in the SNSnorms game. First, we plotted the tuple of Bi , Li and agents’ degree (which corresponds to the number of friends) in Fig. 6(a) (when using the conventional GA) and in Fig. 6(b) (when using the MWGA).
502
Y. Miura et al.
(a) Conventional GA
(b) Multipleworld GA
Fig. 6. Distribution of Bi and Li .
Figure 6(a) does reveal similar strategies regardless of the degree values. However, agents learned various strategies when using the MWGA. For example, agents with higher degrees were likely to post articles but not to post comments (article writers). Agents with normal agents (with relatively low degrees) could be divided into three groups: a group in which agents actively post articles and comments, one in which agents post articles but do not post comments very much, and one in which free riders have Bi and Li of almost zero. We think that these results indicate the agents could learn diverse strategies depending on their surroundings. We then focused on the dynamics of the strategy learning by agents with the MWGA. Figure 7 shows the distribution of B and L during the certain generations of coevolution in the SNSnorms game. In the 0th generation, because each agent was given a strategy at random, the average article posting rate B and average comment posting rate L followed a normal distribution with centers of Bi = 0.5 and Li = 0.5 (Fig. 7(a)). In the 10th generation (Fig. 7(b)), we can see that all agents learned a high comment posting rate. Moreover, the agents with high degrees (hub agents) learned low article posting rates, while the agents with low degrees (nonhub agents) learned higher both posting rates, B and L. Thus, agents with low degrees gradually learned that lower L was a better strategy for gaining high ﬁtness values. After that, the agents with both high and low degrees gradually turned to the strategy of posting articles, so their position in the graph gradually moved to the right corner (i.e., L decreased). Figure 7(c) shows the middle of this movement process. At the same time, a few agents with low degrees became free riders with B and L of almost 0.0. We found that these free riders were mainly connected with highdegree (hub) agents because the hub agents became likely to post articles but hardly ever post comments, and thus, such free riders could get suﬃcient rewards only by reading artcles from the hub agents. After the aforementioned transition, the hub agents who posted only comments could not receive comment returns, so they stopped posting comments.
Analysis of Diversity and Dynamics in Coevolution of Cooperation
(a) 0th generation
(b) 10th generation
(c) 100th generation
(d) 500th generation
503
Fig. 7. Dynamics in coevolved strategies.
However, this just resulted in lower rewards, and they began to ﬁnd that posting articles could gain more rewards. Thus, hub agents moved from the left corner in Fig. 7(c) to the right corners shown in Fig. 7(d), which is the distribution in the 500th generation, one that is almost identical to the ﬁnal state shown in Fig. 6(b)). Therefore, hub agents ended up with high L and low B, meaning they became article authors and hardly posted comments. Along with such a dynamic change in agents’ behaviors, the number of free rider increased, because the free riders could read many articles without doing anything if the hub agents posted articles. 4.3
Discussion
Our experimental results indicate that the diverse strategies for individual agents and the dynamic coevolution process during the learning of strategies could be seen using the simulation of a SNSnorm game with the MWGA. We discuss why agents in various circumstances utilized their own strategies in this section. First, we focus on the agents with high degrees (hub agents). The hub agents behaved in the end as comment writers to gain higher ﬁtness when using the conventional GA, as shown in Fig. 6(a). Hub agents generally could more easily to gain more rewards because they had many friends who might post articles and comments even though the friends’ B and L were low. However, this situation
504
Y. Miura et al.
(a) Agent of interest vf
(b) Hub agent A
(c) Hub agent B
(d) Nonhub agent C
Fig. 8. Strategy and ﬁtness dynamics of the focused agent and its neighbors.
does not necessarily apply to nonhub agents, although they are likely to learn from hub agents when using the conventional GA. The reason is hub agents usually receive more rewards (thereby giving them higher ﬁtness values). This results in a low posting article rate B and agents not gaining many rewards. However, Fig. 6(b) shows that hub agents behave as articlewriters, which is a quite diﬀerent result from the one using the conventional GA. They could gain enough rewards by receiving many comments because many of their neighboring agents have high L. If we consider actual SNSs in the real world, celebrities and popular users post articles to get feedback and consent but hardly ever respond to the posts from normal users. We believe that this phenomenon is in good agreement with the actual activities in an SNS. Second, we tried to understand why some agents turned to free riders in the end. For this purpose, we focused on a certain free rider vf ∈ V and its neighbors and investigated how their strategies changed across generations. Figure 8 shows the transition of B, L, and ﬁtness values of the agent of interest vf (Fig. 8(a)) and its neighbors. Note that agent vf had ﬁve neighbors consisting of one nonhub agent who connected only with hub agents except vf (in this sense, the nonhub agent and vf were in similar circumstances), and the rest of the neighbors were hub agents with an average degree of 150. We only show two hub
Analysis of Diversity and Dynamics in Coevolution of Cooperation
505
agents (Fig. 8(b)–(c)) and one nonhub agent (Fig. 8(d)) because all hubagents connected to vf behaved the same way. Looking at hub agents (Fig. 8(b)–(c)), their ﬁtness values gradually declined (but were still higher than those of nonhub agents), although their values of B and L were almost unchanged until the 130th generation. This is because their neighbor nonhub agents including the agent of interest (Fig. 8(a)) learned low L (posted fewer comments) and high B to gain high ﬁtness. After that, the hub agents changed strategies to behave as article writers, and vf could receive only a few comments even if it posted articles. Thus, vf gradually stopped posting articles. Then, it found that it could receive suﬃcient rewards just by reading articles from its neighboring hub agents without posting articles and comments. Therefore, we can say that normal nonhub agents connected mainly with hub agents are likely to be free riders. If we apply this phenomena to an actual SNS, we can say that users who tend to follow only hub accounts, such as news communicators/broadcasters and celebrities who just post articles easily become free riders without posting any content. However, Fig. 6(b) also shows many active nonhub agents posted both articles and comments. We found that these agents not only have hub agents who posted articles but also have quite a few many nonhub friends; they behaved actively to maintain the activity between these friends. These phenomena seem to match the behavior in actual SNSs, but such realistic behaviors and diverse strategies could not evolve when using the conventional GA. In Fig. 4, we can see the ﬁtness value ﬂuctuated; it increased until the 20th generation, decreased to around 85 until the 100th generation, increased to 105 until the 200th generation, and then gradually decreased to around 80 in the process of learning by the MWGA. In very early generations, all agents tried to post articles and comments. However, because all the agents were rational and looked for the strategies that brought more rewards, we think that a dilemma situation appeared, i.e., these rational behaviors decreased the ﬁtness values of neighboring agents, and they also looked for strategies to gain more rewards. This seems to be the reason for the ﬁrst decrease. Then, after the 100th generation, the hub agents gradually changed their behaviors to make them article writers. This resulted in the temporal increase around the 200th generation. Next, the dilemma situations appeared again, the number of total rewards decreased, and the number of free riders increased. This assertion was also supported by the data in Figs. 3 and 8.
5
Conclusion
We investigated how users of social networking services (SNSs) dynamically identify their own reasonable strategies using the SNSnorms game, which is a gametheoretic model of SNS, with MWGA, a coevolutionary algorithm with which diverse strategies emerge depending on the circumstances of agents. Through comparison experiments with existing studies that use conventional GA, we found that the strategies of agents with MWGA had high ﬁtness values. In addition, we could observe the dynamic evolving process of individual agents. This
506
Y. Miura et al.
feature of the MWGA is quite helpful to understand the phenomena and reasons occurring in the SNS. Such phenomena and reasons are quite complicated because agent strategies were mutually inﬂuenced by the strategies selected by neighboring agents. On the basis of our experimental results using a simulation with the MWGA, we could reproduce a plausible model of dynamic behaviors that can explain well the process of behavior selections in actual SNSs. We plan to clarify what network characteristics including the neighboring agents determine the agents’ strategies on the coevolutionary SNS model. The ﬁndings can be applied to friend recommendation systems on SNSs to increase the activity level of free riders.
References 1. Albert, R., Barab´ asi, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47–97 (2002). https://doi.org/10.1103/RevModPhys.74.47 2. Axelrod, R.: An evolutionary approach to norms. Am. Polit. Sci. Rev. 80(4), 1095– 1111 (1986) 3. Ebel, H., Bornholdt, S.: Coevolutionary games on networks. Phys. Rev. E 66(5), 056118 (2002) 4. Fechner, G.T., Howes, D.H., Boring, E.G.: Elements of Psychophysics, vol. 1. Holt, Rinehart and Winston, New York (1966) 5. Garcia, D., Mavrodiev, P., Schweitzer, F.: Social resilience in online communities: the autopsy of Friendster. In: Proceedings of the First ACM Conference on Online Social Networks, COSN 2013, pp. 39–50. ACM, New York (2013). https://doi.org/ 10.1145/2512938.2512946 6. Hirahara, Y., Toriumi, F., Sugawara, T.: Evolution of cooperation in SNSnorms game on complex networks and real social networks. In: International Conference on Social Informatics, pp. 112–120. Springer (2014) 7. L˝ orincz, L., Koltai, J., Gy˝ or, A.F., Tak´ acs, K.: Collapse of an online social network: burning social capital to create it? Soc. Netw. 57, 43–53 (2019) 8. Miura, Y., Toriumi, F., Sugawara, T.: Evolutionary learning model of social networking services with diminishing marginal utility. In: Companion Proceedings of the The Web Conference 2018, WWW 2018, pp. 1323–1329. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (2018). https://doi.org/10.1145/3184558.3191573 9. Miura, Y., Toriumi, F., Sugawara, T.: Multiple world genetic algorithm to analyze individually advantageous behaviors in complex networks. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 297–298. GECCO 2019. ACM, New York (2019). https://doi.org/10.1145/3319619.3321989 10. Sun, N., Rau, P.P.L., Ma, L.: Understanding lurkers in online communities: a literature review. Comput. Hum. Behav. 38, 110–117 (2014) 11. Toriumi, F., Yamamoto, H., Okada, I.: Why do people use social media? Agentbased simulation and population dynamics analysis of the evolution of cooperation in social media. In: Proceedings of the 2012 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent TechnologyVolume 02, pp. 43–50. IEEE Computer Society (2012) 12. V´ azquez, A.: Growing network with local rules: preferential attachment, clustering hierarchy, and degree correlations. Phys. Rev. E 67(5), 056104 (2003)
Shannon Entropy in Time–Varying Clique Networks Marcelo do Vale Cunha1,2(&) , Carlos César Ribeiro Santos1 Marcelo Albano Moret1,3 , and Hernane Borges de Barros Pereira1,3 1
,
Centro Universitário SENAI CIMATEC, Salvador, BA 41650010, Brazil [email protected], [email protected], [email protected], [email protected] 2 Instituto Federal da Bahia, Barreiras, BA 47808006, Brazil 3 Universidade do Estado da Bahia, Salvador, BA 41150000, Brazil
Abstract. Recent works have used information theory in complex networks. Studies often discuss entropy in the degree distributions of a network. However, there is no speciﬁc work for entropy in clique networks. In this regard, this work proposes a method to calculate clique network entropy, as well as its theoretical maximum and minimum values. The entropies are calculated for the dataset of the semantic networks of titles of scientiﬁc papers from the journals Nature and Science for approximately a decade. Journals are modeled as time–varying graphs and each system is analyzed from a time sliding window. The results show the entropy values of vertices and edges in each window arranged in time series, and also suggest the moment which has more or less vocabulary diversiﬁcation when this diversity turns the studied journals closer or move them away. For that matter, this report contributes to the studies on clique networks and the diffusion of human knowledge in journals of high scientiﬁc impact. Keywords: Networks of cliques Shannon entropy Semantic networks Social network analysis
Time–varying graphs
1 Introduction The mathematical formalism of the information as an entropy measure was ﬁrstly introduced by Claude Shannon in 1945. According to Shannon theory, the information theory allows to investigate and to compare systems from random variables inherent in the composition of this system or its properties [1]. As a consequence, the theory can reach several areas, such as biology, economics, and conﬁned quantum systems, among others [2–4]. Also, it may compose a methodological link that unites different areas [5] including statistical and thermodynamic physics in which several recent works have shown some importance for information entropy [6, 7]. Recently, many authors have introduced these concepts to measure the information contained in the degrees or geodetic distances distribution of real networks or in classical network models to differentiate these systems by the heterogeneity of their links [8–10]. © Springer Nature Switzerland AG 2020 H. Cheriﬁ et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 507–518, 2020. https://doi.org/10.1007/9783030366872_42
508
M. do Vale Cunha et al.
The use of time is very important in the systems analysis in which their elements connect. In 2012, [11, 12] formalized diverse concepts and metrics used in time– varying networks creating the concept of a time–varying graph (TVG). For a more comprehensive approach, [13] has presented some suggestions for speciﬁc algorithms and metrics for very many applications which require this model. Considering the large applicability, the clique networks ﬁt the modeling of various social systems, e.g. movie actor networks [14], co–authoring networks [15], concepts networks [16] and semantic networks [17–20]. The study of clique networks that are formed by words contributes mainly to the study of human language organization. In this context, knowledge representation systems such as the scientiﬁc journals can be studied through the semantic networks of titles of scientiﬁc papers (STN). The STNs has gained a prominent role in research aimed at understanding the behavior and structure of this system, consisting of words that summarize the main contribution works published in important scientiﬁc journals [18–20]. Despite the growing interest of various areas about Shannon entropy, no studies using this measure in clique networks were found. Therefore, this work proposes a method that calculates vertex and edge entropy in clique networks and calculates the maximum and minimum limits for entropy values, according to the initial conditions of the clique networks studied here. The dataset used in this report is from the STNs of Nature and Science journals from 1998 to 2008. To extract more accurate information from the system, it was decided to construct the associated network as a TVG. The results seek correlations between the two journals in a certain period and compare different times of the same journal, contributing to the study of the scientiﬁc dissemination of important reports over the studied decade.
2 Background 2.1
Information Entropy
Information theory has evolved in recent decades, and it has been applied in different ﬁelds, such as telecommunications, computer knowledge, physics, genetics, ecology and in the discussion of the fundamental process of scientiﬁc observation [21]. The mathematical concept of information, developed by Claude Shannon, considers that the information contained in a message is associated with the number of possible values or states that this message may have [1]. Thus, if the system has only one possible state (e.g. the degree of vertices in a regular network), no information is obtained upon inspection. As more possible states for a system, more information it contains, in other words, it is possible to learn more with the discovery of its real state. Entropy is the expected value for the uncertainty of a random variable X (a system state), referring to a probability distribution, Eq. 1: H ð X Þ ¼ k
X i
pi log pi :
ð1Þ
In P Eq. 1, X is the random variable, pi is the probability of a state i for this variable (with i pi ¼ 1), and k is a constant that, if arbitrated for k ¼ log 2, the entropy value
Shannon Entropy in Time–Varying Clique Networks
509
is given in bits. Hereafter, this value of k will be used. Each calculated entropy value has a maximum value and an associated minimum value. When these limits are known, they help to evaluate how much the real value deviates from these idealized situations. In a probability distribution for a state of a random variable X the minimal entropy situation occurs when the uncertainty is minimal. As an example, when there is only one possible state for X, we are 100% sure about this state, so H ð X Þ ¼ 0. On the other hand, the maximum entropy situation occurs when all N possible states for the variable P have equal probability of happening, i.e. p ¼ N1 and H ð X Þ ¼ 1n log2 N1 ¼ log2 N. Thus, the entropy value for a random variable X of N possible states is within these limits, as shown in Eq. 2, 0 H ð X Þ log2 N bits:
2.2
ð2Þ
Time Varying Graphs
Real networks are strongly influenced by the dynamics of their vertices (network’s input and output) and changes in the connections between them. Thus, for a better study of systems of this type, it is necessary to consider the temporal elements in their sets of vertices and edges. Among the various ways to study the effects of time on a network, there is a very interesting model: Time–Varying Graphs (TVG). Considering the formalization [11, 12], a TVG can be understood as a static graph G ¼ ðV; EÞ plus temporal parameters (functions or sets): r (latency function), c (presence function) and C (lifetime). Thus, a TVG is the ﬁvefold shown in Eq. 3, G ¼ ðV; E; c; r; CÞ:
ð3Þ
In Eq. 3, V ¼ fv1 ; v2 ; . . .; vn g isthe set of vertices and E ¼ fðe1 ; e2 ; . . .; em g is the set of edges of the system, where ek ¼ vi ; vj (with i 6¼ j and i; j ¼ 1; 2; . . .n 1; n). For these sets n ¼ jVj and m ¼ jEj. The time sets are: C Z j C ¼ ft1 ; t2 ; t3 ; . . .; t; . . .; tT1 ; tT g representing the system lifetime, discrete in time. Each element of C represents a date or time instant. The interval between the extreme dates is total time T ¼ tT t1 þ 1. The smallest variation between dates (two consecutive instants) represents the time unit of C; c ¼ E C ! f0; 1g is the presence function that guarantees the existence of a given edge at a given time t 2 C; r is the latency function, which represents the time required to form an edge. 2.3
Clique Networks
A clique network consists of a maximal graph or subgraph which all vertices of the same clique connect. Thus, the clique network is a graph formed by the union of cliques, through the processes of overlapping and juxtaposition of common edges and vertices, respectively [22]. Figure 1 shows the process of forming these networks.
510
M. do Vale Cunha et al.
Fig. 1. (a) Cliques on the original conﬁguration; (b) cliques joined from common vertices forming a clique network. In juxtaposition cliques are joined by only one vertex, while in overlap the union is made by at least two vertices and an edge.
There are several works investigating systems that mold themselves to the clique network. From social [15] until biological ﬁeld [23], along to the theoretical works about these networks as in [24] and [22] which have proposed a set of indexes to capture the properties of the clique network, and a method to characterize the small world phenomenon in this type of network. Semantic clique networks are increasingly being studied, where the words that make up in a sentence of a text, a university course menu, keywords, or title of an article are vertices of a clique, [16] analyzes the structure of meaningful concepts in written discourses. On the other hand, [17, 25] have used the semantic clique networks to analyze the relationship between words that emerge in oral speeches. Others have proposed important study methodologies on semantic networks of titles of scientiﬁc papers (STN): [18, 19] has studied the topological structure of STN as a method to analyze the diffusion efﬁciency of information, [26] have used STN to compare the titles of journal articles in mathematics education in English and Portuguese. The work [27] has considered a time–varying STN, and observed an effect in the network memory.
3 Materials and Method 3.1
Dataset
The dataset is composed of the titles of the articles published in the journals Nature and Science, from 1999 to 2008 [18]. These journals were chosen because of their high impact factor values and similar publication frequency in the collected period. Table 1 shows some information about the collected data. We conjecture that the system to be analyzed considers only lexical words, so that grammatical words (e.g. preposition, article, and pronoun) are no longer considered as elements of the system because they have no intrinsic meanings.
Shannon Entropy in Time–Varying Clique Networks
511
Table 1. Data on Nature and Science journals from 1999 to 2008. Data’s information Nature Publication frequency Weekly Number of articles published 1999 to 2008 11798 Number of weeks (1999 a 2008) 512
Science Weekly 3490 514
The words of these titles were treated according to treatment rules, proposed in [18]. After treatment, the data were organized in a way that each journal has a set of text ﬁles, where each ﬁle contains the treated titles corresponding to one week of publication (a magazine number). After treating the data, clique networks are built in which vertex represents an already treated word, and the edges connect words that belong to the same title, the ﬁnal network is called as semantic networks of titles of scientiﬁc papers (STN). 3.2
Construction of Time–Varying Semantic Networks of Titles of Scientiﬁc Papers
The STN is built for different periods. In other words, the time–varying semantic networks of titles of scientiﬁc papers (TVSNT) considers the temporal information contained in its titles in the construction of the network. We will use the same parameters as a TVG: The set of vertices V is represented by the treated words of each STN in the collected period; the set of edges E is formed by the pairs of words that belong to the same title. For the time parameters, the elements of the set C represent the collection time of the titles, which is given in weeks, since it is the minimum period of publication of the journals1. The presence function c indicates if two words occur in the same title at least once in a given instant. For this work, we will not use the latency function r. 3.3
Sliding Window
The network can be observed at one or more consecutive times of set C. To contribute to the various possibilities of network analysis, in this work it was proposed the use of the sliding window function ws;s , where s is the size of the time window and s represents the step taken by the window in time. Assuming the values of s and s as constant and arbitrated by the researcher, the set of windows ﬁt into a TVG ¼ fs1 ; s2 ; . . .; sk ; . . .; sK1 ; sK g. In this set, the time distance between two consecutive windows is given s ¼ tk tk1 , where tk ¼ t 2 C is the ﬁrst instant or date of the window sk . Thus, K 2 N jK ¼ Ts s and t þ s T. Therefore, the TVG can be written as a set of static graphs, Eq. 4: G ¼ fGsk g ¼ fGs1 ; Gs2 ; . . .; Gsk ; . . .; GsK1 ; GsK g sk ¼ ½t þ ðk 1Þs; t þ ðk 1Þs þ s 1:
1
For Nature T ¼ 514 weeks and for Science T ¼ 512 weeks.
ð4Þ
512
3.4
M. do Vale Cunha et al.
Metrics Used
For each clique network in a window, the following properties were used [22]: • nq : number of cliques on initial conﬁguration which is the number of titles; • n: number of network vertices in the ﬁnal conﬁguration, which is the number of different words in the network; • m: number of edges in the ﬁnal conﬁguration which are the word pairs in title network; • m0 : number of edges in the initial conﬁguration which is the number of word pairs in the titles; • n0 : number of vertices in the initial conﬁguration, total words, n0 n; • #ðvi Þ: frequency of vertex i in the initial conﬁguration, which is the number of titles containing vertex i 1 ni Nq ; • # vi ; vj : edge frequency ij in initial clique network conﬁguration, which is the number of titles containing the words i and j, 1 mij Nq e i; j 2 f1; 2; . . .; n 1; ng; com i 6¼ j e ij ¼ ji; • qmax : number of vertices of largest clique on the initial setting which is the largest title size ð1 qmax nÞ; • qmin : number of vertices of smallest clique in the initial setting which is the smallest clique size ð1 qmin nÞ; • qi : number of vertices of a clique i in the initial conﬁguration, which is the title size i ð1 i nÞ. 3.5
Information Entropy in Titles Networks
In our study, two random variables from the process of network formation will be taken: the vertex and the edge. The probabilities of occurrence of a vertex i and an edge ij in the network in each TVG window sk are calculated according to equations Eqs. 5 and 6. To simplify sk ¼ t, it is important to be careful in order to not confuse window number with date, pi ð t Þ ¼
# ð vi Þ ; n0 t
# vi ; vj pij ðtÞ ¼ : m0 t
ð5Þ
ð6Þ
Equations 7 and 8 show Shannon entropy for these distributions, which Hv ðtÞ e and He ðtÞ are the entropies of vertices and edges, respectively, at a given time t, H v ðt Þ ¼ H e ðt Þ ¼
Xn i¼1
Xn ij¼1
pi ðtÞ: log2 pi ðtÞ:
ð7Þ
pij ðtÞ: log2 pij ðtÞ:
ð8Þ
Shannon Entropy in Time–Varying Clique Networks
3.6
513
Limits for Entropy Value
The limits showed in the Eq. 2 (Sect. 2.1) not necessarily occurs on associated entropy the construction of cliques networks. In this section, the extremes are recalculated based on boundary conditions (or initial conditions) for the formation of a clique network. The following conditions were used for the journals studied2: number of cliques in initial conﬁguration nq ; size of largest clique qMax ; smallest clique size qMin ; number of vertices n and number of vertices in initial conﬁguration n0 . Minimum Entropy. The minimum entropy value is associated with the variable’s maximum certainty. Two factors contribute strongly to this: (i) minimum of possible states for the variable and (ii) greater repetition of one or some possible states for the variable. To suit the boundary conditions, the n vertices in the nq cliques will be distributed without vertex repetition on each clique, with the number of vertices per clique qi not exceeding the maximum value and not being less than the minimum value, i.e. qmin qi qmax . The Fig. 2 is a scheme that is named as conﬁguration 1. In there, there are x cliques of size q and y cliques of size q þ 1, so that x þ y ¼ nq , and j k xq þ yðq þ 1Þ ¼ n, which q ¼ nnq and y ¼ n qnq .
Fig. 2. Example of a scheme based on conﬁguration 1 and conﬁguration 2 using real data from the window t ¼ 202 of TVG from the Science titles network shown in this paper. In this window, there are nq ¼ 160 titles with n ¼ 761 different words from a total of n0 ¼ 968 words. The conﬁguration 1 minimizes the number of edges in the network, consequently its entropy, Hemin ¼ 10:49 bits and maximizes the vertices entropy Hvmax ¼ 9:57 bits. The conﬁguration 2 minimizes the vertices entropy. For t ¼ 202, TVG of Science, Hvmin ¼ 8:34 bits.
2
Depending on the investigated system, it may not be necessary to use all of these conditions or to include or replace the existing one.
514
M. do Vale Cunha et al.
This conﬁguration generates the lowest entropy for the edges of the network once it guarantees the smallest number of edges. Despite this, the repetition of a variable also contributes to its reduction in entropy. In clique networks, this phenomenon does not occur for edges because the repetition of an edge implies that it exists in more than one clique. And due to it, two vertices that compose it are forced to be connected to all others of the clique what causes a considerable increase of edges, in other words, possible states and consequently an increase of entropy. For the minimum vertices entropy, it has started from conﬁguration 1 and has added the remaining n0 n repeated vertices, one by one, with the maximum repetition for each vertex for the ﬁrst vertices added. Thus, if n0 n [ nq 1, there will be a vertex present in the nq cliques. If ðn0 nÞ nq 1 [ nq 1, the process continues with another vertex that will be in every clique or as many cliques as possible. For each vertex added to all cliques, the value nq 1 is subtracted from vertices that still have not been added until this subtraction results in a number n0 nq 1, so the last vertex is added repeatedly clique to clique into n0 cliques. This conﬁguration, known as conﬁguration 2, increases the probability of some vertices to reduce the entropy to the smallest value possible, respecting the initials conditions of the problem. Maximum Entropy. For maximum edge entropy, the number of edges should be increased as much as possible, avoiding their repetition. For this purpose, the appropriate distribution of vertices will be done according to the initials conditions, in a way that there is a plot with x cliques of size qmax and another plot y of size qmin vertices, with the possibility of having a clique with size qD , so that qMin \qD \qMax , as shown in Fig. 3, called initial conﬁguration 3. After that, the repeated vertices n0 n that remains are added one by one to cliques with maximum repetition per vertex for the ﬁrst vertices (ﬁnal conﬁguration 3), as shown in Fig. 3. This procedure increases the number of maximumcliques, making the number of distinct edges increases, consequently their entropy. For the maximum vertex entropy, the conﬁguration 1 already gives the largest possible entropy, once that we have all vertices without repetition, so the maximum vertex entropy is given by the well–known equation, Hvmax ¼ log2 ðnÞ. For the dataset of this work, in every window n nq . For larger time windows, it might happen n\nq . In this case, some adjustments will be required for the calculation of the limits, as, for example, in conﬁguration 1, q ¼ 0, q þ 1 ¼ 1, y ¼ n e x ¼ nq n. This contradicts the condition 0, once that q ¼ 0\qmin . Thus, some of n n0 will need to be distributed in cliques, in which each one has a number of vertices q ¼ qmin . Due to the limitation of the page numbers, this application will be developed in a subsequent article.
Shannon Entropy in Time–Varying Clique Networks
515
Fig. 3. Example of a scheme designed for conﬁguration 3 using real data from the windowt ¼ 188 of the TVG of the Science title network shown in this paper. In this window, at the beginning of the conﬁguration, there are nq ¼ 146 titles with n ¼ 728 different words from a total of n0 ¼ 968 words. Adding repeated vertices until the vertices total n ¼ n0 ¼ 898, so that is the conﬁguration which maximizes the number of edges in the network, Hemax ¼ 11:50 bits.
4 Results and Discussion The Fig. 4 shows the values of vertex and edge entropy as well as their respective maximum and minimum values over time for two journals. The Fig. 5 shows the values min of entropies normalized by their extremes H 0 ¼ HHH the vertices and edges of the max Hmin two journals over time. The graphs show us some interesting results: The moments where entropy decreases away from the maximum may indicate trends in the journal’s vocabulary at the time. The vertex entropy values are higher and vary much less than the edges entropy values. Moreover, in various intervals Hv and He have opposite growth trends. We know that increasing He implies the generation of new edges, being possible from the increment of repeated vertices in the cliques, which makes Hv decrease. Besides, in some of the studied periods, it was possible to see an opposite growth between journals for the edges entropy standard, moments in which one journal reaches a high entropy value and the other shows a low one. Although not necessarily a maximum value, He ¼ log2 m is on the graph to show how similar and strongly correlated are the real entropy of edges with this value. This shows that windows which have clique networks with little edge overlap, but with potential for more.
516
M. do Vale Cunha et al.
Fig. 4. Vertex entropy ðHv Þ and edge entropy ðHe Þ for journal TVGs, with sliding window w8;1 . Windows s ¼ ½t; t þ 8 from t1 ¼ January 5; 1999 for Science and t1 ¼ January 7; 1999 for Nature.
Fig. 5. Entropies normalized by their maximum and minimum extremes for two journals.
Notwithstanding entropy measures are sensitive to sample size, we use here the entire dataset from the period collected. This allows a proper comparison of the two journals, even with values of entropy close. But it is worth mentioning the fact that the real vertices entropy Hv ﬃ log2 n in any time window of the journals. For edge entropy, there are periods that when these values deviate from the corresponding maximum.
Shannon Entropy in Time–Varying Clique Networks
517
The entropy values calculated here do not require the use of a null model (i.e. random network) for comparison. The process of constructing conﬁgurations 1, 2 and 3 is already randomized. It is also important to emphasize that a network of cliques has a high clustering and this means that there is not a correspondent random network, since in random networks clustering coefﬁcient tends to zero ðC ! 0Þ [28].
5 Conclusions The results showed a strong correlation between entropy values and their respective maximum values, especially for vertices entropy. The graphs also show that journals have a greater diversity of words than word pairs. In other words, with the journal’s vocabulary in a window, there are many more possible combinations for word pairs than for repeating them in the titles. The method of constructing semantic clique networks is coherent with previous works as regarding to the vocabulary diversity of high impact scientiﬁc journals. The study of vertices and edges entropy in clique networks can be combined with the emergence of communities in these networks, as well as correlations with other indicators speciﬁc to this type of network, (e.g.: ﬁdelity incidence [17], reference diameter and fragmentation [22], among others). Acknowledgment. This paper is being ﬁnancially supported by the Rectory of Research and Innovation of the Federal Institute of Bahia (PRPGIIFBA) and the Senai CimatecBA University Center, from its preparation to its presentation at Complex Networks 2019.
References 1. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948) 2. Mousavian, Z., Kavousi, K., Masoudi–Nejad, A.: Information theory in systems biology. Part I: gene regulatory and metabolic networks. In: Seminars in Cell & Developmental Biology, vol. 51, pp. 3–13. Academic Press (2016) 3. Mishra, S., Ayyub, B.M.: Shannon entropy for quantifying uncertainty and risk in economic disparity. Risk Anal. 39(10), 2160–2181 (2019) 4. Nascimento, W.S., Prudente, F.V.: Shannon entropy: a study of conﬁned hydrogenic–like atoms. Chem. Phys. Lett. 691, 401–407 (2018) 5. Zenil, H., Kiani, N.A., Tegnér, J.: Methods of information theory and algorithmic complexity for network biology. In: Seminars in Cell & Developmental Biology, vol. 51, pp. 32–43. Academic Press (2016) 6. Zurek, W.H.: Complexity, vol. 8. Entropy and the Physics of Information. CRC Press, Boca Raton (2018) 7. Gao, X., Gallicchio, E., Roitberg, A.E.: The generalized Boltzmann distribution is the only distribution in which the GibbsShannon entropy equals the thermodynamic entropy. J. Chem. Phys. 151(3), 034113 (2019) 8. Solé, R.V., Valverde, S.: Information theory of complex networks: on evolution and architectural constraints. In: Complex Networks, pp. 189–207. Springer, Berlin (2004)
518
M. do Vale Cunha et al.
9. Ji, L., Bing–Hong, W., Wen–Xu, W., Tao, Z.: Network entropy based on topology conﬁguration and its computation to random networks. Chin. Phys. Lett. 25(11), 4177 (2008) 10. Viol, A., PalhanoFontes, F., Onias, H., de Araujo, D.B., Hövel, P., Viswanathan, G.M.: Characterizing complex networks using entropy–degree diagrams: unveiling changes in functional brain connectivity induced by Ayahuasca. Entropy 21(2), 128 (2019) 11. Nicosia, V., Tang, J., Musolesi, M., Russo, G., Mascolo, C., Latora, V.: Components in time varying–graphs. Chaos: Interdisc. J. Nonlinear Sci. 22(2), 023101 (2012) 12. Casteigts, A., Flocchini, P., Quattrociocchi, W., Santoro, N.: Time–varying graphs and dynamic networks. Int. J. Parallel Emergent Distrib. Syst. 27(5), 387–408 (2012) 13. Holme, P., Saramäki, J.: Temporal networks. Phys. Rep. 519(3), 97–125 (2012) 14. Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999) 15. Newman, M.E.: Scientiﬁc collaboration networks. I. Network construction and fundamental results. Phys. Rev. E 64(1), 016131 (2001) 16. Caldeira, S.M., Lobao, T.P., Andrade, R.F.S., Neme, A., Miranda, J.V.: The network of concepts in written texts. Eur. Phys. J. BCondens. Matter Complex Syst. 49(4), 523–529 (2006) 17. Teixeira, G.M., Aguiar, M.S.F., Carvalho, C.F., Dantas, D.R., Cunha, M.V., Morais, J.H.M., Pereira, H.B.B., Miranda, J.G.V.: Complex semantic networks. Int. J. Modern Phys. C 21 (03), 333–347 (2010) 18. Pereira, H.D.B., Fadigas, I.S., Senna, V., Moret, M.A.: Semantic networks based on titles of scientiﬁc papers. Phys. A: Stat. Mech. Appl. 390(6), 1192–1197 (2011) 19. Pereira, H.B.B., Fadigas, I.S., Monteiro, R.L.S., Cordeiro, A.J.A., Moret, M.A.: Density: a measure of the diversity of concepts addressed in semantic networks. Phys. A: Stat. Mech. Appl. 441, 81–84 (2016) 20. Grilo, M., Fadigas, I.S., Miranda, J.G.V., Cunha, M.V., Monteiro, R.L.S., Pereira, H.B.B.: Robustness in semantic networks based on cliques. Phys. A: Stat. Mech. Appl. 472, 94–102 (2017) 21. Brillouin, L.: Science and Information Theory. Courier Corporation, Chelmsford (2013) 22. Fadigas, I.D.S., Pereira, H.B.D.B.: A network approach based on cliques. Phys. A: Stat. Mech. Appl. 392(10), 2576–2587 (2013) 23. Adamcsek, B., Palla, G., Farkas, I.J., Derényi, I., Vicsek, T.: CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics 22(8), 1021–1023 (2006) 24. Derényi, I., Palla, G., Vicsek, T.: Clique percolation in random networks. Phys. Rev. Lett. 94 (16), 160202 (2005) 25. Lima–Neto, J.L.A., Cunha, M., Pereira, H.B.B.: Redes semânticas de discursos orais de membros de grupos de ajuda mútua. Obra Digit.: J. Commun. Technol. 14, 51–66 (2018) 26. Henrique, T., de Sousa Fadigas, I., Rosa, M.G., de Barros Pereira, H.B.: Mathematics education semantic networks. Soc. Netw. Anal. Mining 4(1), 200 (2014) 27. Cunha, M.V., Rosa, M.G., Fadigas, I.S., Miranda, J.G.V., Pereira, H.B.B.: Redes de títulos de artigos cientíﬁcos variáveis no tempo. In: Anais do II Brazilian Workshop on Social Network Analysis and Mining, CSBC 2013, Maceió–AL, pp. 194–205 (2013) 28. Watts, D.J., Strogatz, S.H.: Collective dynamics of smallworld networks. Nature 393(4), 440–442 (1998)
TwoMode Threshold Graph Dynamical Systems for Modeling Evacuation DecisionMaking During Disaster Events Naﬁsa Halim1 , Chris J. Kuhlman2(B) , Achla Marathe2 , Pallab Mozumder3 , and Anil Vullikanti2 1
Boston University, Boston, MA 02218, USA [email protected] 2 University of Virginia, Charlottesville, VA 22904, USA {cjk8gx,achla,vsakumar}@virginia.edu 3 Florida International University, Miami, FL 33199, USA [email protected]
Abstract. Recent results from social science have indicated that neighborhood eﬀects have an important role in an evacuation decision by a family. Neighbors evacuating can motivate a family to evacuate. On the other hand, if a lot of neighbors evacuate, then the likelihood of an individual or family deciding to evacuate decreases, for fear of looting. Such behavior cannot be captured using standard models of contagion spread on networks, e.g., threshold models. Here, we propose a new graph dynamical system model, 2modethreshold, which captures such behaviors. We study the dynamical properties of 2modethreshold in diﬀerent networks, and ﬁnd signiﬁcant diﬀerences from a standard threshold model. We demonstrate the utility of our model through agent based simulations on small world networks of Virginia Beach, VA. We use it to understand evacuation rates in this region, and to evaluate the eﬀects of the model and of diﬀerent initial conditions on evacuation decision dynamics.
1
Introduction
Background. Extreme weather events displaced 7 million people from their homes just in the ﬁrst six months of 2019 [23]. With the rise in global warming, the frequency of these events is increasing and they are also becoming more damaging. Just in 2017–2018, there were 24 major events. In 2017, there was a total of 16 weather events that together costed over $306 billion, according to NOAA. In 2018, there were eight hurricanes, out of which two were category 3 or higher and caused more than $50 billion in damages. Motivation. Timely evacuation is the only action that can reduce risk in many of these events. Although more people are exposed to these weather events, technological improvements in weather prediction, early warning systems, emergency management, and information sharing through social media, have helped c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 519–531, 2020. https://doi.org/10.1007/9783030366872_43
520
N. Halim et al.
keep the number of fatalities fairly low. During Hurricane Fani [17], a record 3.4 million people were evacuated in India and Bangladesh and fewer than 100 fatalities were recorded [23]. However, in many disaster events, e.g. Hurricane Sandy, the fraction of people who evacuated has been much lower than what local governments would like. The decision to evacuate or not is a very complex one and depends on a large number of social, demographic, familial, and psychological factors, including forecasts, warnings, and risk perceptions [13,14,19,25,26]. Two speciﬁc factors have been shown to have an important eﬀect on evacuation decisions. First, peer eﬀects, i.e., whether neighbors and others in the community have evacuated, are important. Up to a point, this has a positive impact on the evacuation probability of a household, i.e., as more neighbors evacuate, a household becomes more likely to evacuate. Second, concerns about property, e.g., due to looting, if a lot of people have already left, counteracts the ﬁrst eﬀect. Therefore, this has a negative impact on the evacuation probability. An important public policy goal in disaster planning and response is to increase the evacuation rates in an aﬀected region, and understanding how this happens is crucial. Summary of Results. There is a lot of work on modeling peer eﬀects, e.g., the spread of diseases, information, fads and other contagions [1,5,7]. A number of models have been proposed, such as independent cascade [15], and diﬀerent types of threshold models (e.g., [6,24]). These are deﬁned on a network, with each node in state 0 or 1 (0 indicating nonevacuating, 1 indicating a node has been inﬂuenced, e.g., is evacuating), and a rule for a node to change state from 0 to 1. For instance, in a τ threshold model, a node switches from state 0 to state 1 if τ fraction of its neighbors are in state 1. All prior models only capture the ﬁrst eﬀect above, i.e., as the number of eﬀected neighbors increases, a node is more likely to switch to state 1. Here, we propose a new threshold model, referred to as 2modethreshold, which inhibits a transition from state 0 to 1 if a suﬃciently large fraction of a family’s neighborhood is in state 1, and demonstrate its use in a large scale study. Our results are summarized below. 1. Dynamics of the 2modethreshold model (results in Sects. 2 and 3). We introduce and formalize evacuation decision making as a graph dynamical system (GDS) [21] using 2modethreshold functions at nodes. We study theoretically the dynamics of 2modethreshold in diﬀerent networks, and show signiﬁcant diﬀerences from the standard threshold model that has no drop oﬀ. Speciﬁcally, we ﬁnd that starting at a small set of nodes in state 1, the diﬀusion process does not go beyond a constant fraction of the network. System conﬁgurations in which more nodes are 1’s (e.g., the all 1’s vector of node states) are also ﬁxed points, but our results imply that one cannot reach such ﬁxed points with lots of 1’s from most initial conﬁgurations that have a small number of 1’s. 2. Agent based simulation and application (results in Sect. 4). We develop an agentbased modeling and simulation (ABMS) of the 2modethreshold model on a realistic small world network in the region of Virginia Beach, VA. This region has a population of over 450,000, and households are geographically situ
Evacuation DecisionMaking
521
ated based on landuse data, with a real geolocation which invokes the concept of neighbors and long range connections [4]. We add edges between households based on the Kleinberg small world (KSW) model [16]. Our ABM enables us to capture heterogeneities in the modeling of the evacuation decisionmaking process. This includes not only heterogeneities in families, but also diﬀerences in (local) neighborhoods of families as represented in social networks. We use it to understand the evacuation rates in this region, and evaluate the eﬀects of diﬀerent initial conditions (e.g., number of seeds) [seeds are families who are highly risk averse] on evacuation decision dynamics. For example, including the eﬀects of looting can reduce evacuation rates by 50%. Novelty and Implications. Models of type 2modethreshold have not been studied before. Our ABM approach can help (i) understand how planners and managers can more eﬀectively convince families that are in harms way to evacuate; (ii) understand the eﬀects of families’ social networks on evacuation decisions [10,25,26]; and (iii) establish downstream conditions after the evacuation decision has been made, to support additional types of analyses. For example, results from these studies can be used to forecast traﬃc congestion (spatially and temporally) during the exodus [19], and to determine places where shelters and triage centers should be established.
2 2.1
Evacuation DecisionMaking Model Motivation from Social Science
Our model is motivated by the analysis of a survey in the counties aﬀected by Hurricane Sandy in the northeastern United States by [13], which is brieﬂy summarized here. The goal of this survey was to assess factors driving evacuation decisions [20]. The survey was at a pretty large scale, with over 1200 individuals, and a response rate of 61.93%. A Binomial Logit model was applied to the survey data and tested for the factors associated with households’ evacuation behaviors [13]. The results indicate that a respondent’s employment status, consideration of neighbors’ evacuation behavior, concerns about neighborhood criminal activities or looting, access to the internet in the household, age, and having ﬂood insurance, each plays a signiﬁcant role in a respondent’s decision to evacuate during Hurricane Sandy. Noteworthy was the inﬂuence of neighbors’ evacuation behaviors, and concerns about looting and criminal behavior. Neighbors’ evacuations had a statistically signiﬁcant and positive eﬀect on evacuation probability but concerns about criminal and looting behavior had a signiﬁcant negative eﬀect—implying that if too many neighbors leave, then the remaining households are less likely to evacuate. 2.2
A Graph Dynamical Systems Framework
A graph dynamical system (GDS) is a powerful mathematical abstraction of agent based models, and we use it here to develop a model of evacuation behavior, motivated by the survey analysis described above. A GDS S describes the
522
N. Halim et al.
evolution of the states of a set of agents. Let xt ∈ {0, 1}n denote the vector of agent states at time t, with xtv = 1 indicating that agent v has evacuated. xtv = 0 means that agent v has not evacuated at time t. A GDS S consists of two components: (1) an interaction network G = (V, E), where V represents the set of agents (in our case, the households which are deciding whether or not to evacuate), and E represents a set of edges, with e = {u, v} ∈ E if agents u and v can inﬂuence each other; and (2) a set F = {fv : v ∈ V } of local functions fv : {0, 1}deg(v) → {0, 1} for each node v ∈ V , which determines the state of node v in terms of the states of N (v), the set of neighbors of v. Given a vector xt describing the states of all agents at time t, the vector xt+1 at the next time using its local function fv (·). We say that a state is obtained by updating xt+1 v vector xt is a fixed point of S if the node states do not change, i.e., xt+1 = xt . The 2modeThreshold Local Functions: Modeling Evacuation Behavior. The 2modethreshold function fv (·) will be probabilistic, and will depend on the probability of evacuation, in order to capture the qualitative aspects of the results of [13]. This is shown in Fig. 1a and speciﬁes the probability of evacuation pe for agent vi as a function of the fraction η1 of neighbors of vi in state 1. We have pe = pe,max for η1 ∈ (ηmin , ηc ], and pe = 0 for η1 ∈ [0, ηmin ] and η1 > ηc . In this paper, we primarily focus on ηmin = 0. Speciﬁcally, this captures the following eﬀects: (i) peer (neighbor) inﬂuence can cause families to evacuate and (ii) if too many of a family’s neighbors evacuate, there are not enough neighbors remaining behind to dissuade potential looters, so a family reduces its probability of evacuation. The ﬁrst eﬀect makes pe = pe,max for η1 > 0, and the second eﬀect results in pe dropping to zero at η1 = ηc . Note that the special case where pe = pe,max for η1 > ηmin = 0 is a probabilistic variant of the ηmin threshold function (e.g., [6]); we will sometimes refer to this as the “regular probabilistic threshold” model, and denote them by rpthreshold. This model is shown in Fig. 1b. These are models that can be assigned to any agent; in GDS, an agent is a node that resides in a networked population. Network Models. We describe the models for the contact network G = (V, E), which is the other component of a GDS S. A node vi ∈ V represents a family, or a household. Edges represent interaction channels, for communication and observations. Edges are directed : a directed edge (vj , vi ) ∈ E, with vi , vj ∈ V , means that family vj influences family vi . We use the population model developed in [4] for representing the set V of households. Edges are speciﬁed using the Kleinberg small world (KSW) network approach [16], and there are two types of edges: short range and long range. Short range edges (vj , vi ) represent either (i) a family vi speaks with (is inﬂuenced by) another family vj about evacuation decisions, or (ii) a family vi observes vj ’s home and infers whether or not a family vj has evacuated. A longrange edge represents a member of one family vi interacting with a member of family vj at work. Each edge has a label of distance between homes, using (lon, lat) coordinates of each home. Thus, the KSW model has the following parameters: the node set V and their attributes, the shortrange distance dsr over which shortrange edges are placed between nodes, and the number q of long range
Evacuation DecisionMaking
(a)
523
(b)
Fig. 1. Dynamics models—probability of evacuation curve—for probability pe of evacuation for a family versus the fraction η1 of its neighbors in state 1 (i.e., evacuating). (a) The 2modethreshold model: the evacuation probability is pe = 0 for η1 = ηmin = 0 and for η1 > ηc . The maximum probability is pe = pe,max in the interval (ηmin , ηc ]. (b) The rpthreshold model: this curve is similar to the previous curve, except that pe = pe,max for η1 > ηmin . This is a special case of 2modethreshold, but is a variation of the regular probabilistic threshold model [6, 21, 24]. As an illustration, if an agent has 50% of its neighbors in state 1, then the model in (a) shows that pe = 0, while (b) shows that pe = pe,max > 0. An example with values for these parameters is given in the text.
edges incident on each node vi . For each node vi , (i) short range edges (vj , vi ) are constructed, where d(vj , vi ) ≤ dsr ; and (ii) q long range edges (vk , vi ) are placed at random, with probability proportional to 1/d(vk , vi )α , for a parameter α. Note that for each short range edge (vj , vi ), there is a corresponding edge (vi , vj ). See [16] for details. Example. Figure 1a shows an example of the 2modethreshold model with the parameters pe,max = 0.2, and ηc = 0.4. Figure 1b shows a rpthreshold model. The purpose of this example is to illustrate the dynamics of these models on a network of ﬁve agents. In Fig. 2, x1 is the initial conﬁguration with node 1 evacuated (in state 1), and nodes 2, 3, 4, and 5 not evacuated (in state 0). Nodes 2 and 3 have η1 = 1/3 < ηc = 0.4, and so for both of them, the evacuation probability is pe = 0.2. Nodes 4 and 5 have η1 = 0, so pe = 0 for them. Therefore, the probability that the state vector is x2 at the next time step (see Fig. 2) is pe,max (1−pe,max ) = 0.2·0.8 = 0.16, since only node 2 switches to 1. With respect to the conﬁguration x2 , nodes 3, 4, and 5 have η1 = 23 , 1 and 0, respectively. Therefore, pe = 0 for all these nodes, and x2 is a ﬁxed point of the S with the 2modethreshold functions. However, for the regular probabilistic threshold model, with ηmin < 0.3, x2 is not a ﬁxed point, since nodes 3 and 4 both have pe = pe,max (since they have η1 > ηmin ). Therefore, in the regular probabilistic threshold model, the x2 → x3 transition occurs with probability p2e,max = 0.04. Problems of Interest. We will refer to a GDS system S2m = (G, F) in which the local functions are 2modethreshold functions as a 2modethreshold
524
N. Halim et al.
Fig. 2. An example showing the transitions in a S on a graph with ﬁve nodes, and 2modethreshold local functions, with parameters pe,max = 0.2 and ηc = 0.4. The ﬁgure shows a transition of the dynamics model from conﬁguration x1 to x2 , with shaded nodes indicating evacuation. The x1 → x2 transition occurs with probability pe,max (1 − pe,max ) = 0.16. For the above parameters, x2 is a ﬁxed point, and the node states do not change. However, if we had ηc = 1 (i.e., this is a regular probabilistic threshold), x2 is not a ﬁxed point, and there can be a transition to conﬁguration x3 with probability p2e,max = 0.04 (indicated as a dashed arrow).
GDS. Our objective in this paper is to study the following problems on a S2m system: (1) How do the dynamical properties of 2modethreshold GDS systems diﬀer from those of S with rpthreshold model functions? Do they have ﬁxed points, and what are their characteristics? (2) How do the number of 1’s in the ﬁxed point depend on the initial conditions, and the model parameters, namely pe,max and ηc ? How can this be maximized? We provide solutions to these problems next.
3
Analyzing Dynamical Properties in Diﬀerent Network Models
It can be shown that any S2m converges to a ﬁxed point in at most n/pe,max steps. S2m systems have signiﬁcantly lesser levels of diﬀusion (i.e., number of nodes ending up in state 1), compared to the rpthreshold model, as we discuss below. Many details are omitted for space reasons. Lemma 1. Consider a S2m with G = Kn being a complete graph on n nodes. Starting at a configuration x0 with a single node in state 1, S2m converges to a fixed point with at most (pe,max + ηc )n nodes in state 1, in expectation. In contrast, in a regular probabilistic threshold system on Kn with ηmin = 0, the system converges to the all 1’s vector as a fixed point. Proof. Consider a state vector xt with k nodes in state 1. Consider any node v with xt v = 0. If k ≤ ηc n, then, Pr[node v switches to 1] = pe,max . Therefore, the expected number of nodes which switch to 1 is pe,max (n − k) ≤ npe,max . If k > ηc n, for every node in state 0, the probability of switching to 1 is pe = 0. Therefore, the expected number of 1’s in a ﬁxed point is at most npe,max + nηc .
Evacuation DecisionMaking
525
On the other hand, in a regular probabilistic threshold model, the system does not converge till each node in state 0 switches to 1 (since pe = pe,max for all η1 > 0). We observe below that starting at an initial conﬁguration with a single 1, S2m converges to a ﬁxed point with at most a constant fraction of nodes in state 1. Note, however, that conﬁgurations with more than that many 1’s, e.g., the all 1’s vector, are also ﬁxed points. The result below implies that those ﬁxed points will not be reached from an initial conﬁguration with a few 1’s. Lemma 2. Consider a S2m on a G(n, p) graph with pηc ≥ 62 logn n , for any ∈ (0, 1). Starting at a configuration x0 with a single node in state 1, S2m converges to a fixed point with at most (1 + 2)(ηc + pe,max )n nodes in state 1, in expectation. In contrast, in a regular probabilistic threshold system on Kn with ηmin = 0, the system converges to the all 1’s vector as a fixed point. Proof. (Sketch) Let deg(v) denote the degree of v. For a subset S, let degS (v) denote the degree of v with respect to S, i.e., the number of neighbors of v in S. For any node v, we have E[deg(v)] = np. By the Chernoﬀ bound [9], it 2 follows that Pr[deg(v) > (1 + )np] ≤ e− np/3 ≤ 1/n2 . Consider a set S of 1+ ηc n. For v ∈ S, E[degS (v)] = Sp, and so Pr[degS (v) < (1 − )Sp] ≤ size 1− 2
1+ ηc n, we have (1 − )Sp ≥ (1 + )ηc np. Putting e− Sp/2 ≤ 1/n2 . For S ≥ 1− these together, with probability at least 1 − 2/n, we have deg(v) ≤ (1 + )np and degS (v) ≥ (1 + )ηc np ≥ ηc deg(v), for all nodes v. Therefore, if S2m reaches a 1+ ηc n < (1 + 2)ηc n, with probability conﬁguration with nodes in set S of size 1− 1 − 2/n, S is a ﬁxed point. With probability ≤ 2/n, S is not a ﬁxed point, and the process converges to a ﬁxed point with at most n 1’s, so that the expected number of 1’s in the ﬁxed point is at most S + 2 ≤ (1 + 2)ηc n. On the other hand, consider the last conﬁguration S which has size S  < (1 + 2)ηc n. Then, in expectation, at most pe,max n additional nodes switch to state 1, after which point, the conﬁguration has more than (1 + )ηc n 1’s. Therefore, the expected number of 1’s in the ﬁxed point is at most (1 + 2)(ηc + pe,max )n.
4
AgentBased Simulations and Results
Simulation Process. Inputs to the simulation are a social network (described below), a set of local functions that quantiﬁes the evacuation decision making process of each node vi ∈ V (see Sect. 2), and a set of seed nodes whose state is 1 (i.e., these nodes are set to “evacuate” at the start of a simulation instance, at time t = 0). All other nodes at time t = 0 are in state 0 (the nonevacuating state). We vary a number of simulation input parameters, as discussed immediately below, across simulations. Each simulation instance or run consists of a particular set of seed nodes, and time is incremented in discrete timesteps, from t = 0 to tmax . Here, tmax = 10 days, to model the ten days leading up to hurricane arrival. At each timestep, nodes that are in state 0 may change to state 1, per the models in Sect. 2. At each 1 ≤ t ≤ tmax , the state of the system
526
N. Halim et al.
at time t − 1 is used to compute the next state of each vi ∈ V (corresponding to time t) synchronously; that is, all vi update their states in parallel at each t. A simulation consists of 100 runs, where each run has a diﬀerent seed set; the network and dynamics models are ﬁxed in a simulation across runs. We present results below based on averaging the results of the 100 runs. Social Networks. Table 1 provides the social networks (and selected properties) that are used in simulations of evacuation decision making. The network model of Sect. 2.2 was used to generate KSW networks for Virginia Beach, VA. Inputs for the model were n = 113967 families forming the node set V , with (lat, long) coordinates, dsr = 40 m, α = 2.5 (see [16]), and q = 0 to 16. Simulation Parameters Studied. The input parameters varied across simulations are provided in Table 2. Table 1. Kleinberg small world (KSW) networks [16] used in our experiments and their properties. The number n of nodes is 113967 for all graphs. The short range distance dsr = 40 m and the exponent α = 2.5 is for computing the probabilities of selecting particular longrange nodes with which to form longrange edges with each node vi ∈ V . Column “No. LR Edges” (= q) means number of longrange edges incoming to each node vi . There are ﬁve graph instances for every row. Average degree is dave and maximum degree is dmax , for indegree and outdegree. Network Class
Avg. InDeg.
Max. InDeg.
Avg. OutDeg.
Max. OutDeg.
KSW0
0
10.11
380
10.11
380
KSW2
2
11.71
382
11.71
381
KSW4
4
13.70
384
13.70
381
KSW8
8
17.70
388
17.70
382
16
25.70
396
25.70
383
KSW16
No. LR Edges
Table 2. Summary of the parameters and their values used in the simulations. Parameter
Description
Networks
Networks in Table 1. We vary q per the table, from 0 to 16
Num. random seeds, ns .
Number of seed nodes specified per run (chosen uniformly at random). Values are 50, 100, 200, 300, 400, and 500
Threshold model
The 2modethreshold model of Fig. 1a and the rpthreshold (i.e., classic) threshold model of Fig. 1b, in Sect. 2
Threshold range, ηc .
The range in relative degree over which nodes can change to state 1. Discrete values are 0.2 and 1.0. Note that ηc = 1 corresponds to the classic stochastic threshold model (Fig. 1b), whereas smaller values of η1 correspond to the 2modethreshold model (Fig. 1a)
Maximum probability, pe,max
The maximum probability of evacuation pe,max of Fig. 1. Discrete values are 0.05, 0.10, and 0.15
Basic Results and the Eﬀects of Seeding. Figure 3b provides average fraction of population deciding to evacuate (Frac. DE) as a function of time for one
Evacuation DecisionMaking
527
instance of the KSW2 category of networks. We use the 2modethreshold model with pe,max = 0.15 and ηc = 0.2 (see Fig. 1a). A simulation uses a ﬁxed value of number ns of random seed nodes per run, but the set of nodes diﬀers in each run (see legend). Other simulation parameters are in the ﬁgure. Error bars indicate the variance in results across 100 runs (i.e., simulation instances). The variance is very small (the bars cannot be seen in the plots, and are barely visible even under magniﬁed conditions). Hence we say no more about the variance in output. As number ns of random seeds increases from 50 to 500, the fraction deciding to evacuate fde increases from about 0.02 to 0.1.
(a)
(b)
(c)
Fig. 3. Simulation results of fraction of population deciding to evacuate (Frac. DE) versus simulation time. All results use the 2modethreshold model of Fig. 1a, pe,max = 0.15, ηc = 0.2, and ns (numbers of random seeds) varies from 50 to 500 (see legend). Error bars denote variance. (The variance is very small.) (a) Results for one graph instance of network class KSW0 (i.e., q = 0 long range edges per node). (b) Results for one graph instance of network class KSW2 (i.e., q = 2 long range edges per node). (c) Results for one graph instance of network class KSW16 (i.e., q = 16 long range edges per node).
Eﬀect of Graph Structure: Long Range Edges. The eﬀect of number q of long range edges is shown across the three plots in Fig. 3 for the 2modethreshold model. For q = 0 (i.e., no longrange edges), the fraction of the population evacuating (Frac. DE) = fde ≈ 0. As q increases to 2 and then 16 longrange edges per node, fde increases markedly. In particular, Fig. 3c shows how the spread of evacuation decisions has an upper bound in the 2modethreshold model: too many families have evacuated, so the remaining families do not evacuate over concerns of looting and crime. This eﬀect of greater contagion spreading as q increases is the “weak link” phonemena [12], where longrange edges can cause remote nodes to change their state to 1 (i.e., evacuating), thus moving a “contagion” into a diﬀerent region of the graph. Note that the speed with which the maximum of fde = 0.32 is attained increases with ns . Eﬀect of Dynamics Model: Maximum Evacuation Probability pe,max . Figure 4 shows the eﬀect of number pe,max of the 2modethreshold model. As pe,max increases from 0.05 (Fig. 4a) to 0.10 (Fig. 4b) to 0.15 (Fig. 4c), the fraction of population evacuating increases at smaller pe,max , almost plateaus for all ns
528
N. Halim et al.
when pe,max = 0.1, and increases its speed to plateau for the largeset pe,max . The values of pe,max were selected based the survey results [13] mentioned in Sect. 2.1.
(a)
(b)
(c)
Fig. 4. Simulation results of fraction of population deciding to evacuate (Frac. DE) versus simulation time. All results use the 2modethreshold model of Fig. 1a with ηc = 0.2, and ns (numbers of random seeds) varies from 50 to 500, for one instance of the KSW16 graph class, i.e., q = 16 long range edges per node (similar results for other graph instances). (a) Results for pe,max = 0.05. (b) Results for pe,max = 0.10. (c) Results for pe,max = 0.15, is the same as Fig. 3c, reproduced for completeness.
(a)
(b)
(c)
Fig. 5. Simulation results of fraction of population deciding to evacuate (Frac. DE) versus simulation time. All results use the rpthreshold model of Fig. 1b where looting and crime are not concerns, and ns (numbers of random seeds) varies from 50 to 500, for one instance of the KSW16 graph class, i.e., q = 16 long range edges per node (similar results for other graph instances). (a) Results for pe,max = 0.05. (b) Results for pe,max = 0.10. (c) Results for pe,max = 0.15. These results can be compared with corresponding plots from Fig. 4 for the 2modethreshold model.
Eﬀect of Dynamics Model: Range of Relative Threshold for Transition to State 1. We compare results from the 2modethreshold (Fig. 4), with various values for pe,max and ηc = 0.2, against the rpthreshold model, with the same pe,max values, where ηc = 1.0 (Fig. 5). The corresponding plots, left to right in each ﬁgure, can be compared. As pe,max increases, the discrepancy between the two models increases: concern over looting dampens evacuation in
Evacuation DecisionMaking
529
the 2modethreshold model. For pe,max = 0.15, the rpthreshold model results in Fig. 5c reach fde > 0.6, while the corresponding results for 2modethreshold model in Fig. 4c are only roughly onehalf the values of fde in Fig. 5c. Hence, the 2modethreshold model can produce a large diﬀerence (dampening) in the fraction of families evacuating. Therefore, ignoring the inﬂuence of looting and crime can cause a large overprediction of family evacuations.
5
Related Work
Many studies have identiﬁed factors that aﬀect evacuation decision making. These include social networks and peer inﬂuence [18,22], risk perceptions, evacuation notices, storm characteristics [2,3,8] and household demographics such as nationality, proximity to hurricane path, pets, disabled family members, mobile home, access to a vehicle etc. [11,25]. Other studies use social networks and relative threshold models to model evacuation behavior. A relative threshold [6,24] θi for agent vi is the minimum fraction of distance1 neighbors in G(V, E) that must be in state 1 in order for vi to change from state 0 and to state 1. Several studies [14,25,26] assign thresholds to agents in agentbased models (ABMs) of hurricane evacuation modeling. Stylized networks of 2000 nodes are used in [14] to study analytical and ABM solutions to evacuation. In [25], 12,892 families are included in a model of a 1995 hurricane for which 75% of households evacuated. They include three demographic factors in their evacuation model, in addition to the the peer inﬂuence that is captured by a threshold model. Small world and random regular stylized networks are used for social networks. Simulations of hurricane evacuation decisionmaking in the Florida Keys are presented in [26]. The simulations cover 24 hours, where the actual evacuation rate was about 53% of families. The social network is also a smallworld network, with geospatial home locations, which is similar to our network construction method. In all of these studies, as the number of neighbors of a family vi evacuates, the more likely it is that vi will evacuate. Our threshold model diﬀers: in our model, if too many neighbors evacuate, then vi will not evacuate because of concerns over crime and looting.
6
Summary and Conclusions
We study evacuation decisionmaking as a graph dynamical system using 2modethreshold functions for nodes. This work is motivated by the results of a survey collected during Hurricane Sandy which shows that concerns about crime motivates families to stay in their homes. We study the dynamics of 2modethreshold in diﬀerent networks, and show signiﬁcant diﬀerences from the standard threshold model. Results obtained from this work can help determine the size and characteristics of nonevacuees which city planners can use for contingency planning.
530
N. Halim et al.
Acknowledgment. We thank the anonymous reviewers for their insights. This work has been partially supported by the following grants: NSF CRISP 2.0 Grant 1832587, DTRA CNIMS (Contract HDTRA111D00160001), NSF DIBBS Grant ACI1443054, NSF EAGER Grant CMMI1745207, and NSF BIG DATA Grant IIS1633028.
References 1. Aral, S., Nicolaides, C.: Exercise contagion in a global social network. Nat. commun. 8, 14753 (2017) 2. Baker, E.J.: Evacuation behavior in hurricanes. Int. J. Mass Emergencies Disasters 9(2), 287–310 (1991) 3. Baker, E.J.: Public responses to hurricane probability forecasts. Prof. Geogr. 47(2), 137–147 (1995) 4. Barrett, C.L., Beckman, R.J., et al.: Generation and analysis of large synthetic social contact networks. In: Winter Simulation Conference, pp. 1003–1014 (2009) 5. Beckman, R., Kuhlman, C., et al.: Modeling the spread of smoking in adolescent social networks. In: Proceedings of the Fall Research Conference of the Association for Public Policy Analysis and Management. Citeseer (2011) 6. Centola, D., Macy, M.: Complex contagions and the weakness of long ties. Am. J. Sociol. 113(3), 702–734 (2007) 7. Chen, J., Lewis, B., et al.: Individual and collective behavior in public health epidemiology. In: Handbook of statistics, vol. 36, pp. 329–365. Elsevier (2017) 8. Dash, N., Gladwin, H.: Evacuation decision making and behavioral responses: individual and household. Nat. Hazards Rev. 8(3), 69–77 (2007) 9. Dubhashi, D.P., Panconesi, A.: Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, Cambridge (2009) 10. Ferris, T., et al.: Studying the usage of social media and mobile technology during extreme events and their implications for evacuation decisions: a case study of hurricane sandy. Int. J. Mass Emerg. Dis. 34(2), 204–230 (2016) 11. Fu, H., Wilmot, C.G.: Sequential logit dynamic travel demand model for hurricane evacuation. Transp. Res. Part B 45, 19–26 (2004) 12. Granovetter, M.: The strength of weak ties. Am. J. Sociol. 78(6), 1360–1380 (1973) 13. Halim, N., Mozumder, P.: Factors inﬂuencing evacuation behavior during hurricane sandy. Risk Anal. (To be submitted) 14. Hasan, S., Ukkusuri, S.V.: A threshold model of social contagion process for evacuation decision making. Transp. Res. Part B 45, 1590–1605 (2011) 15. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of inﬂuence through a social network. In: Proceedings of ACM KDD, pp. 137–146 (2003) 16. Kleinberg, J.: The smallworld phenomenon: an algorithmic perspective. Technical report 991776 (1999) 17. Kumar, H.: Cyclone fani hits India: storm lashes coast with hurricane strength. New York Times, May 2019 18. Lindell, M.K., Perry, R.W.: Warning mechanisms in emergency response systems. Int. J. Mass Emergencies Disasters 5(2), 137–153 (2005) 19. Madireddy, M., Tirupatikumara, S., et al.: Leveraging social networks for eﬃcient hurricane evacuation. Transp. Res. Ser. B: Methodol. 77, 199–212 (2015) 20. Meng, S., Mozumder, P.: Hurricane sandy: damages, disruptions and pathways to recovery. Risk Anal. (Under review)
Evacuation DecisionMaking
531
21. Mortveit, H., Reidys, C.: An Introduction to Sequential Dynamical Systems. Springer, Berlin (2007) 22. Riad, J.K., Norris, F.H., Ruback, R.B.: Predicting evacuation in two major disasters: risk perception, social inﬂuence, and access to resources. J. Appl. Soc. Psychol. 20(5), 918–934 (1999) 23. Sengupta, S.: Extreme weather displaced a record 7 million in ﬁrst half of 2019. New York Times, September 2019 24. Watts, D.: A simple model of global cascades on random networks. PNAS 99, 5766–5771 (2002) 25. Widener, M.J., Horner, M.W., et al.: Simulating the eﬀects of social networks on a population’s hurricane evacuation participation. J. Geogr. Syst. 15, 193–209 (2013) 26. Yang, Y., Mao, L., Metcalf, S.S.: Diﬀusion of hurricane evacuation behavior through a homeworkplace social network: a spatially explicit agentbased simulation model. Comput. Environ. Urban Syst. 74, 13–22 (2019)
Spectral Evolution of Twitter Mention Networks Miguel Romero(B) , Camilo Rocha, and Jorge Finke Pontificia Universidad Javeriana Cali, Cali, Colombia {miguel.romero,camilo.rocha,jfinke}@javerianacali.edu.co
Abstract. This papers applies the spectral evolution model presented in [5] to networks of mentions between Twitter users who identified messages with the most popular political hashtags in Colombia (during the period which concludes the disarmament of the Revolutionary Armed Forces of Colombia). The model characterizes the dynamics of each mention network (i.e., how new edges are established) in terms of the eigen decomposition of its adjacency matrix. It assumes that as new edges are established the eigenvalues change, while the eigenvectors remain constant. The goal of our work is to evaluate various link prediction methods that underlie the spectral evolution model. In particular, we consider prediction methods based on graph kernels and a learning algorithm that tries to estimate the trajectories of the spectrum. Our results show that the learning algorithm tends to outperform the kernel methods at predicting the formation of new edges. Keywords: Spectral evolution model Eigen decomposition · Graph kernels
1
· Twitter mention networks ·
Introduction
Social networks have become increasingly relevant for understanding the political issues of a country. On such platforms, users share perceptions and opinions on government and public aﬀairs, creating political conversations that often unveil speciﬁc patterns of interaction (e.g., the degree of polarization on a current issue). While some studies focus on identifying which proﬁles play a key role in shaping useruser interactions [9,10], others studies focus on how the user terms and conditions of social networks inﬂuence broad political decisions [3,6]. Not surprising, analyzing the patterns that arise from online conversations on social networks has received wide attention [7,8]. Understanding the broad dynamics of user interactions is an important step to evaluate both the formation and political ramiﬁcations of stationary patterns. More speciﬁcally, characterizing the evolution of user interactions requires the development of models that predict how new edges are established. For example, predicting the formation of new edges is useful to identify whether an inﬂuential user retains her status over time or whether a political polarization reﬂects a dynamic process or a stationary state [2]. c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 532–542, 2020. https://doi.org/10.1007/9783030366872_44
Spectral Evolution Models of Twitter Mention Networks
533
This paper uses the spectral evolution model presented in [5] to capture the dynamics of user interactions and evaluate which link prediction method best estimates the formation of new edges over time. The spectral evolution model considers that the growth of a network can be captured by its eigen decomposition, under the assumption that its eigenvectors remain constant. If this condition is satisﬁed, the estimation of the formation of new edges can be masked as a transformation of the spectrum through the application of real functions (using graph kernels) or through extrapolation methods (using learning algorithms that estimate the spectrum trajectories) [4]. The main contribution of this paper is to apply the spectral evolution model to networks of mentions between Twitter users who identiﬁed messages with the most popular political hashtags H. Vertices represent users and there exists an edge between two users if a user mentions the another user using a hashtag h ∈ H. We select the most popular hashtags related to political aﬀairs in Colombia between August 2017 and August 2018, the period which concludes the disarmament of the Revolutionary Armed Forces of Colombia (Farc) and marks the end of the armed conﬂict. Diﬀerent prediction methods are compared to identify which prediction method best describes the evolution of each mention network. The remainder of the paper is organized as follows. Section 2 describes the networks used for our analysis. Section 3 presents the spectral evolution model and veriﬁes that the model can be applied to the mention networks. Section 3 also overviews the diﬀerent link prediction methods that underlie the model. Section 4 presents the results of applying the spectral evolution model with various link prediction methods. Section 5 draws some conclusions and future research directions.
2
Data Description
The dataset consists of 31 mention networks between Twitter users who deﬁned their proﬁle location as Colombia. These networks capture conversations around a set of hashtags H related to popular political topics between August 2017 and August 2018. Users are represented by the set of vertices V . The set of edges is denoted by E; there exists an edge {i, j} ∈ V × V between users i and j, if user i identiﬁes a message with a political hashtag in H (e.g., #safeelections) and mentions user j (via @username). The mention network G = (V, E) is represented as a weighted multigraph without selfloops, which means that it is possible to have multiple edges between two users. Our analysis is based on the largest connected component of G, denoted by Gc = (Vc , Ec ). A network is built for each hashtag h ∈ H. Table 1 shows a description of the hashtags and the resulting networks, including the number of vertices and edges (V  and E) for the whole network G, the number of vertices and edges (Vc  and Ec ) for its largest component Gc , the community modularity (Q) of Gc , and the number of communities (m) of Gc .
534
M. Romero et al.
Table 1. Mention networks with political hashtags. English translations for some popular political hashtags appear in parenthesis. Set of hashtags H 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
abortolegalya (legal abortion now) alianzasporlaseguridad (security alliance) asiconstruimospaz (how we build peace) colombialibredefracking (ban fracking) colombialibredeminas (ban mining) dialogosmetropolitanos (city dialogues) edutransforma (education transforms) eleccionesseguras (safe elections) elquedigauribe (whoever Uribe says) frutosdelapaz (fruits of peace) garantiasparatodos (assurances for all) generosinideologia (no gender ideology) hidroituangoescololombia horajudicialur (judicial hour) lafauriecontralor (comptroller Lafaurie) lanochesantrich lapazavanza (peace advances) libertadreligiosa (religious liberty) manifestacionpacifica plandemocracia2018 (democracy plan) plenariacm (plenary) proyectoituango reformapolitica (political reform) rendiciondecuentas (accountability) rendiciondecuentas2017 resocializaciondigna salariominimo (minimum wage) semanaporlapaz (week of peace) serlidersocialnoesdelito vocesdelareconciliacion (reconciliation) votacionesseguras (safe voting)
G V 
E
Gc Vc 
2235 176 2514 1606 707 959 166 3035 2375 1671 388 639 1028 2250 2154 1518 2949 1584 211 3090 1504 1214 2714 5103 1711 503 2494 1988 530 161 2748
2202 1074 14055 3483 2685 18340 1296 17922 6933 6960 814 914 3362 23647 7082 6946 8288 13443 274 20955 19866 3086 8385 25479 12441 4054 7041 8103 861 1500 13307
1282 1538 0.89 30 150 351 0.34 7 2405 6950 0.56 16 1476 3127 0.62 19 655 1421 0.51 14 932 4134 0.34 10 161 404 0.40 9 2634 7969 0.51 20 2052 5272 0.65 20 1479 3468 0.58 18 340 563 0.55 10 615 805 0.63 12 883 2252 0.68 15 2187 6756 0.42 14 1999 5309 0.59 14 1444 3567 0.45 13 2775 6569 0.70 18 1395 6856 0.38 15 112 151 0.69 9 2962 7996 0.58 22 1460 4782 0.41 15 1186 1891 0.53 44 2608 5928 0.66 18 4401 10308 0.84 33 998 2933 0.51 16 496 1171 0.46 8 2079 5016 0.71 22 1732 4860 0.69 25 439 697 0.67 15 158 405 0.34 7 2439 5338 0.66 24
Ec 
Q
m
The modularity and number of communities shown in Table 1 are computed with the multilevel community detection algorithm [1]. Note that Q > 0.3 for all networks in the dataset, i.e., community structure can be observed for all mention networks.
3
Spectral Evolution Model
Let A denote the adjacency matrix of Gc . Furthermore, let A = U Λ UT denote the eigen decomposition of A, where Λ represents the spectrum of Gc . The
Spectral Evolution Models of Twitter Mention Networks
535
spectral evolution model characterizes the dynamics of Gc (i.e., how new edges are created over time) in terms of the evolution of the spectrum of the network, assuming that its eigenvectors in U remain unchanged [4,5]. In other words, assume that the dynamics of the network may only involve small changes in behavior of the eigenvectors. 3.1
Spectral Evolution Model Veriﬁcation
To apply the spectral evolution model, we need to verify the assumption on the evolution of the spectrum and eigenvectors. Every network Gc has a timestamp associated to each edge, representing the time at which the edge was created. Spectral Evolution. For a given network, the set of edges is split into 40 bins based on their time stamps. Figure 1 illustrates the top 8% of the largest eigenvalues (by absolute value) for two mention networks, namely, #educationtransforms and #howwebuildpeace. For both cases, the eigenvalues grow irregularly, that is, some eigenvalues growth at a higher rate than others. Most of the networks in the dataset show this irregular behavior in spectrum evolution.
(a) #EduTransforma
(b) #AsiConstruimosPaz
Fig. 1. Spectral evolution for mention networks #educationtransforms (left) and #howwebuildpeace (right).
Eigenvector Evolution. At time t, consider the adjacency matrix A(t) , with 1 ≤ t ≤ T . The eigenvectors corresponding to the top 8% of the largest eigenvalues (by absolute value) at time t are compared to the eigenvectors at time T = 40. In particular, the cosine distance is used as a similarity measure to compare the eigenvectors U(T )i and U(t)i , for each latent dimension i. Figure 2 shows that some eigenvectors have a similarity close to one during the entire evolution of the network. These eigenvectors correspond to the eigenvectors associated to the largest eigenvalues. Note also that at some time instants the similarity for some eigenvectors drops to zero, which can be explained because eigenvectors swap locations during eigen decomposition. To identify such changes we verify the stability of the largest eigenvectors.
536
M. Romero et al.
Fig. 2. Eigenvector evolution for mention networks #educationtransforms (left) and #howwebuildpeace (right).
Eigenvector Stability. For a given network Gc , Let ta and tb be the times when 75% and 100% of all edges have been created. The eigen decomposition of the adjacency matrices are given by Aa = Ua Λa UTa and Ab = Ub Λb UTb . Similarity values are computed for every pairs of eigenvectors (i, j) using: simij (ta , tb ) = UT(a)i · U(b)j . The resulting values are plotted as a heatmap, where white cells represent a value of zero and black cells a value of one. The more the heatmap approximates a diagonal matrix, the fewer eigenvector permutations there are, i.e., the eigenvectors are preserved over time. Figure 3a shows subsquares with intermediate values (between zero and one) for the #democracyplan2018 network. These subsquares result from an exchange in the location of eigenvectors that have eigenvalues that are close in magnitude.
(a) Eigenvector stability
(b) Spectral diagonality test
Fig. 3. Eigenvector stability and spectral diagonality test for the #democracyplan2018 network.
Spectral Evolution Models of Twitter Mention Networks
537
Spectral Diagonality Test. As for the eigenvector stability test, consider the eigen decomposition of the adjacency matrix of Gc at time ta , Aa = Ua Λa UTa . At time tb > ta the adjacency matrix is expected to become Ab = Ua (Λa + Δ)UTa , where Δ is a diagonal matrix and indicates whether the growth of the network is spectral. Using leastsquares, the matrix Δ can be derived as Δ = Ua (Ab − Aa )UTa . If Δ is diagonal, then the growth between ta and tb is spectral. We ﬁnd that the matrix Δ is almost diagonal for all mention networks. Figure 3b, for example, shows the diagonality test for the #democracyplan2018 network. 3.2
Growth Models
Previous sections have veriﬁed that the assumptions underlying the spectral evolution model seem to hold to some extent. Broad speaking, eigenvalues grow while eigenvectors remain fairly constant over time. Next, we consider network growth as a spectral transformation, i.e., in terms of the eigen decomposition of the adjacency matrix. Let K(A) be a kernel of an adjacency matrix A, whose eigen decomposition is A = U Λ UT . Graph kernels assume that there exists a real function f (λ) that describes the growth of the spectrum. In particular, K(A) can be written as K(A) = UF (Λ)UT , for some function F (Λ) that applies a real function f (λ) to the eigenvalues of A. In particular, we use the triangle closing kernel, the exponential kernel, and the Neumann growth kernel. Triangle Closing Kernel. The triangle closing kernel is expressed as A2 = U Λ UT U Λ UT = U Λ2 UT , since UT U = I. This spectral transformation replaces the eigenvalues of A by their squared values. The real function associated to the triangle closing kernel is f (λ) = λ2 . Exponential Kernel. The exponential of the adjacency matrix A is called the exponential kernel. This kernel denotes the sum of every path between two vertices weighted by the inverse factorial of its length. It is expressed as exp (αA) =
∞ k=0
αk
1 k A , k!
where α is a constant used to balance the weight of short and long paths. The real function associated to the exponential kernel is f (λ) = eαλ .
538
M. Romero et al.
Neumann Kernel. The Neumann kernel is expressed as (I − αA)−1 =
∞
α k Ak ,
k=0
where α−1 > λ1  and λ1 is the largest eigenvalue of A. Its real function is given by f (λ) = 1/(1 − αλ). Spectral Extrapolation. As noted above, graph kernels assume that there exists a real function f (λ) that describes the growth of the spectrum. However, when the evolution of the spectrum is irregular, as in Fig. 1a, it is not possible to ﬁnd a simple function that describe network growth. The spectral extrapolation method is a generalization of the graph kernels, which extrapolates each eigenvalue under the assumption that the network follows the spectral evolution model [4]. More speciﬁcally, given a network with a timestamped set of edges, the set is split into three subsets named training, target and test sets. Consider two time instants ta and tb . Let Aa represent the adjacency matrix of the network at time ta and Aa + Ab the adjacency matrix at time tb . The eigen decompositions of the network at the two time instances are given by Aa = Ua Λa UTa and Aa + Ab = Ub Λb UTb . Next, let (λb )j be the jeigenvalue at time tb . Its previous value at time ta is estimated as a diagonalization of Aa by Ub as follows: ˆ a )j = (λ
i
−1 (Ua )Ti (Ua )j
(Ua )Ti (Ua )j (λa )i , i
where (Ua )i and (λa )i are the eigenvectors and eigenvalues of A, respectively. ˆ c )i at a A linear extrapolation is now performed to predict the eigenvalues (λ future time tc , ˆ a )j . ˆ c )j = 2(λb )j − (λ (λ ˆ c is used to compute the predicted edge weights A ˆc = The predicted matrix Λ T ˆ Ub Λc Ub .
4
Case Study: Twitter Conversations
This section presents the results of applying the proposed kernels (namely, triangle closing, exponential, and Neumann kernels) and the extrapolation method to predict the creation of new edges across the mention networks described in Sect. 2. Curveﬁtting methods are applied to ﬁnd the parameters α of the exponential and Neumann kernels.
Spectral Evolution Models of Twitter Mention Networks
539
RMSE
0.6
0.4
0.2
0.0
R2
0.8
0.6
ext tri exp neu abortolegalya alianzasporlaseguridad asiconstruimospaz colombialibredefracking colombialibredeminas dialogosmetropolitanos edutransforma eleccionesseguras elquedigauribe frutosdelapaz garantiasparatodos generosinideologia hidroituangoescolombia horajudicialur lafauriecontralor lanochesantrich lapazavanza libertadreligiosa manifestacionpacifica plandemocracia2018 plenariacm proyectoituango reformapolitica rendiciondecuentas rendiciondecuentas2017 resocializaciondigna salariominimo semanaporlapaz serlidersocialnoesdelito vocesdelareconciliacion votacionesseguras
0.4
Hashtags
Fig. 4. Performance of the prediction of the methods is evaluated based on two metrics, RMSE and R2 .
To evaluate the performance of the methods we compute the metrics of the root mean square error (RMSE) and R2 . Figure 4 summarizes the result of RMSE and R2 metrics. Note that the performance of the models appear to be very similar for most mention networks. In Sect. 3, we verify that the growth of the eigenvalues for most networks is irregular. It is therefore to some extent expected that the extrapolation method outperform the graph kernels. Next, we borrow the structural similarity index method (SSIM) from the ﬁled of image processing to measure the similarity between the actual and the estimated adjacency matrices. (SSIM is widely applied in the
540
M. Romero et al.
ﬁeld of image processing to compare the similarity between two images based on the idea that pixels have strong interdependencies when they are spatially close [11].) Unlike other techniques, such as RMSE, SSIM relies on the estimation of pointtopoint absolute errors.
0.95 0.90 SSIM
0.85 0.80 0.75 0.70
abortolegalya alianzasporlaseguridad asiconstruimospaz colombialibredefracking colombialibredeminas dialogosmetropolitanos edutransforma eleccionesseguras elquedigauribe frutosdelapaz garantiasparatodos generosinideologia hidroituangoescolombia horajudicialur lafauriecontralor lanochesantrich lapazavanza libertadreligiosa manifestacionpacifica plandemocracia2018 plenariacm proyectoituango reformapolitica rendiciondecuentas rendiciondecuentas2017 resocializaciondigna salariominimo semanaporlapaz serlidersocialnoesdelito vocesdelareconciliacion votacionesseguras
0.65
ext tri exp neu
Hashtags
Fig. 5. Performance of the prediction of the methods is evaluated based on SSIM method.
The results are shown in Table 2 and Fig. 5. Figure 5 summarizes the performance for all methods using SSIM. In general, the extrapolation method tends to outperform the other methods. Speciﬁcally, for 28 out of 31 networks (91% of the total), the extrapolation method provides a distinct, if sometimes slight, improvement. The Neumann kernel and the triangle closing combined provide better estimates only for 3 networks. Whenever the spectral extrapolation method outperforms the graph kernels, better prediction seem to be explained by the method being able to consider the irregular evolution of the eigenvalues. In general, note that the networks considered are large enough so that only a small number of eigenvalues and eigenvectors can be computed.
Spectral Evolution Models of Twitter Mention Networks
541
Table 2. Spectral evaluation model performance analysis with SSIM.
5
Hashtag
extrapol A2
exp(αA) (I − αA)−1 Best kernel or method
0
abortolegalya
0.97
0.98 0.97
1
alianzasporlaseguridad 0.80
2
asiconstruimospaz
0.98
3 4
0.97
A2
0.73 0.74
0.63
extrapol.
0.96 0.96
0.95
extrapol.
colombialibredefracking 0.97
0.96 0.96
0.96
extrapol.
colombialibredeminas
0.95
0.89 0.90
0.92
extrapol.
5
dialogosmetropolitanos 0.93
0.86 0.66
0.80
extrapol.
6
edutransforma
0.74
0.67 0.70
0.65
extrapol.
7
eleccionesseguras
0.98
0.96 0.96
0.95
extrapol.
8
elquedigauribe
0.98
0.96 0.97
0.96
extrapol.
9
frutosdelapaz
0.98
0.95 0.96
0.95
extrapol.
10 garantiasparatodos
0.93
0.90 0.90
0.88
extrapol.
11 generosinideologia
0.99
0.95 0.96
0.97
extrapol.
12 hidroituangoescolombia 0.95
0.93 0.93
0.93
extrapol.
13 horajudicialur
0.98
0.94 0.90
0.93
extrapol.
14 lafauriecontralor
0.98
0.96 0.96
0.96
extrapol.
15 lanochesantrich
0.98
0.94 0.94
0.94
extrapol.
16 lapazavanza
0.98
0.97 0.97
0.97
extrapol.
17 libertadreligiosa
0.95
0.89 0.89
0.92
extrapol
18 manifestacionpacifica
0.83
0.81 0.81
0.88
(I − αA)−1
19 plandemocracia2018
0.98
0.96 0.96
0.97
extrapol.
20 plenariacm
0.97
0.92 0.83
0.90
extrapol.
21 proyectoituango
0.99
0.96 0.96
0.96
extrapol.
22 reformapolitica
0.99
0.97 0.97
0.98
extrapol.
23 rendiciondecuentas
0.99
0.98 0.97
0.98
extrapol.
24 rendiciondecuentas2017 0.97
0.89 0.93
0.89
extrapol.
25 resocializaciondigna
0.95
0.87 0.87
0.81
extrapol.
26 salariominimo
0.99
0.97 0.97
0.98
extrapol.
27 semanaporlapaz
0.96
0.95 0.94
0.95
extrapol.
28 serlidersocialnoesdelito 0.91
0.91 0.90
0.92
(I − αA)−1
29 vocesdelareconciliacion 0.84
0.74 0.72
0.62
extrapol.
30 votacionesseguras
0.96 0.96
0.96
extrapol.
0.98
Conclusions
This paper applies the spectral evolution model to 31 Twitter mention networks. This model characterizes the evolution of each network in terms of the eigen decomposition of its adjacency matrix. It has been veriﬁed that Twitter mention networks follow the spectral evolution model. For most networks, the eigenvectors remain approximately constant, while the spectra of the mention networks grow irregularly. Their evolution can be predicted with the help
542
M. Romero et al.
diﬀerent growth models. Our results shows that the extrapolation method outperforms the kernel methods mainly due to the irregular evolution of the spectra. Developing more reﬁned models that use learning to predict the evolution of the spectra of graphs remains an important direction for future research. Acknowledgements. This work was funded by the OMICAS program: Optimizaci´ on Multiescala Insilico de Cultivos Agr´ıcolas Sostenibles (Infraestructura y Validaci´ on en Arroz y Ca˜ na de Az´ ucar), sponsored within the Colombian Scientific Ecosystem by the World Bank, Colciencias, Icetex, the Colombian Ministry of Education, and the Colombian Ministry of Industry and Turism, under GRANT ID: FP448422172018.
References 1. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008) 2. DiMaggio, P., Evans, J., Bryson, B.: Have american’s social attitudes become more polarized? Am. J. Sociol. 102(3), 690–755 (1996) 3. Gustafsson, N.: The subtle nature of Facebook politics: Swedish social network site users and political participation. New Media Soc. 14(7), 1111–1127 (2012) 4. Kunegis, J., Fay, D., Bauckhage, C.: Network growth and the spectral evolution model. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, Toronto, ON, Canada, p. 739. ACM Press (2010) 5. Kunegis, J., Fay, D., Bauckhage, C.: Spectral evolution in dynamic networks. Knowl. Inf. Syst. 37(1), 1–36 (2013) 6. Loader, B.D., Mercea, D.: Networking democracy?: Social media innovations and participatory politics. Inf. Commun. Soc. 14(6), 757–769 (2011) 7. Mcclurg, S.D.: Social networks and political participation: the role of social interaction in explaining political participation. Polit. Res. Q. 56(4), 449–464 (2003) 8. McPherson, M., SmithLovin, L., Cook, J.M.: Birds of a feather: homophily in social networks. Ann. Rev. Sociol. 27(1), 415–444 (2001) 9. Noveck, B.S.: Five hacks for digital democracy. Nature 544(7650), 287–289 (2017) 10. Persily, N.: Can democracy survive the Internet? J. Democracy 28(2), 63–76 (2017) 11. Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Network Models
Minimum Entropy Stochastic Block Models Neglect Edge Distribution Heterogeneity Louis Duvivier1(B) , C´eline Robardet1 , and R´emy Cazabet2 1
2
Univ Lyon, INSA Lyon, CNRS, LIRIS UMR5205, 69621 Lyon, France {louis.duvivier,celine.robardet}@insalyon.fr Univ Lyon, Universit´e Lyon 1, CNRS, LIRIS UMR5205, 69622 Lyon, France [email protected]
Abstract. The statistical inference of stochastic block models as emerged as a mathematicaly principled method for identifying communities inside networks. Its objective is to ﬁnd the node partition and the blocktoblock adjacency matrix of maximum likelihood i.e. the one which has most probably generated the observed network. In practice, in the socalled microcanonical ensemble, it is frequently assumed that when comparing two models which have the same number and sizes of communities, the best one is the one of minimum entropy i.e. the one which can generate the less diﬀerent networks. In this paper, we show that there are situations in which the minimum entropy model does not identify the most signiﬁcant communities in terms of edge distribution, even though it generates the observed graph with a higher probability. Keywords: Network · Community detection model · Statistical inference · Entropy
· Stochastic block
Since the seminal paper by Girvan and Newman [1], a lot of work has been devoted to ﬁnding community structure in networks [2]. The objective is to exploit the heterogeneity of connections in graphs to partition its nodes into groups and obtain a coarser description, which may be simpler to analyze. Yet, the absence of a universally accepted formal deﬁnition of what a community is has favored the development of diverse methods to partition the nodes of a graph, such as the famous modularity function [3], and the statistical inference of a stochastic block model [4]. This second method relies on the hypothesis that there exists an original partition of the nodes, and that the graph under study was generated by picking edges at random with a probability that depends only on the communities to which its extremities belong. The idea is then to infer the original node partition based on the observed edge distribution in the graph. This method has two main advantages with respect to modularity maximization: ﬁrst, it is able to detect nonassortative connectivity pattern, i.e. groups of nodes that are not c Springer Nature Switzerland AG 2020 H. Cherifi et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 545–555, 2020. https://doi.org/10.1007/9783030366872_45
546
L. Duvivier et al.
necessarily characterized by an internal density higher than the external density, and second it can be performed in a statistically signiﬁcant way, while it has been shown that modularity may detect communities even in random graphs [5]. In particular, a bayesian stochastic blockmodeling approach has been developed in [6], which ﬁnds the most likely original partition for a SBM with respect to a graph by maximizing simultaneously the probability to choose this partition and the probability to generate this graph, given the partition. To perform the second maximization, this method assumes that all graphs are generated with the same probability and it thus searches a partition of minimal entropy, in the sense that the cardinal of its microcanonical ensemble (i.e. the number of graphs the corresponding SBM can theoretically generate [7]) is minimal, which is equivalent to maximizing its likelihood [8]. In this paper, we show that even when the number and the size of the communities are ﬁxed, the node partition which corresponds to the sharper communities is not always the one with the lower entropy. We then demonstrate that when community sizes and edge distribution are heterogeneous enough, a node partition which places small communities where there are the most edges will always have a lower entropy. Finally, we illustrate how this issue implies that such heterogeneous stochastic block models cannot be identiﬁed correctly by this model selection method and discuss the relevance of assuming an equal probability for all graphs in this context.
1
Entropy Based Stochastic Block Model Selection
The stochastic block model is a generative model for random graphs. It takes as parameters a set of nodes V = [1; n] partitioned in p blocks (or communities) C = (ci )i∈[1;p] and a blocktoblock adjacency matrix M whose entries correspond to the number of edges between two blocks. The corresponding set of generable graphs G = (V, E) with weight matrix W is deﬁned as: ⎧ ⎫ ⎨ ⎬ ΩC,M = G  ∀c1 , c2 ∈ C, W(i,j) = M(c1 ,c2 ) ⎩ ⎭ i∈ci ,j∈cj
It is called the microcanonical ensemble (a vocabulary borrowed to statistical physics [7]) and it can be reﬁned to impose that all graphs are simple, undirected (in which case M must be symmetric) and to allow or not self loops. In this paper we will consider multigraphs with self loops, because they allow for simpler computations. Generating a graph with the stochastic block model associated to C, M amounts to drawing at random G ∈ ΩC,M . The probability distribution P[GC, M ] on this ensemble is deﬁned as the one which maximizes Shanon’s entropy P[GC, M ] × ln(P[GC, M ]) S= G∈ΩC,M
Minimum Entropy Stochastic Block Models
547
In the absence of other restriction, the maximum entropy distribution is the ﬂat one: 1 P[GC, M ] = ΩC,M  whose entropy equals S = ln(ΩC,M ). It has been computed for diﬀerent SBM ﬂavours in [8]. It measures the number of diﬀerent graphs a SBM can generate with a given set of parameters. The lower it is, the higher the probability to generate any speciﬁc graph G. On the other hand, given a graph G = (V, E), with a weight matrix W , it may have been generated by many diﬀerent stochastic block models. For any partition C = (ci )i∈[1;p] of V , there exists one and only one matrix M such that G ∈ ΩC,M , and it is deﬁned as: ∀c1 , c2 ∈ C, M(c1 ,c2 ) = W(i,j) i∈c1 ,j∈c2
Therefore, when there is no ambiguity about the graph G, we will consider indiﬀerently a partition and the associated SBM in the following. The objective of stochastic block model inference is to ﬁnd the partition C that best describes G. To do so, bayesian inference relies on the Bayes theorem which stands that: P[C, M G] =
P[GC, M ] × P[C, M ] P[G]
(1)
As P[G] is the same whatever C, it is suﬃcient to maximize P[GC, M ] × P[C, M ]. The naive approach which consists in using a maximumentropy uniform prior distribution for P[C, M ] simpliﬁes the computation to maximizing directly P[GC] (the so called likelihood function) but it will always lead to the trivial partition ∀i ∈ V, ci = {i}, which is of no use because the corresponding SBM reproduces G exactly: M = W and P[GC] = 1. To overcome this overﬁtting problem, another prior distribution was proposed in [9], which assigns lower probabilities to the partitions with many communities. Yet, when comparing two models C1 , M1 and C2 , M2 with equal probability, the one which is chosen is still the one minimizing ΩC,M  or equivalently the entropy S = ln(ΩC,M ), as logarithm is a monotonous function.
2
The Issue with Heavily Populated Graph Regions
In this paper, we focus on the consequence of minimizing the entropy to discriminate between node partitions. To do so, we need to work on a domain of partitions on which the prior distribution is uniform. As suggested by [9], we restrict ourselves to ﬁnding the best partition when the number p and the sizes
548
L. Duvivier et al.
(si )i∈[1;p] of communities are ﬁxed because in this case, both P [C] and P [M C] are constant. This is a problem of node classiﬁcation, and in this situation the maximization of Eq. 1 boils down to minimizing the entropy of ΩC,M , which can be written as:
si sj + M(i,j) − 1 ln S= M(i,j) i,j∈[1;p]
as shown in [8]. Yet, even within this restricted domain (p and (si )i are ﬁxed), the lower entropy partition for a given graph G is not always the one which corresponds to the sharper communities. To illustrate this phenomena, let’s consider the stochastic block models whose matrices M are shown on Fig. 1, and a multigraph G ∈ ΩSBM1 ∩ ΩSBM2 . – SBM1 corresponds to C1 = {ca1 : {0, 1, 2, 3, 4, 5}, cb1 : {6, 7, 8}, cc1 : {9, 10, 11}} – SBM2 corresponds to C2 = {ca2 : {0, 1, 2}, cb2 : {3, 4, 5}, cc2 : {6, 7, 8, 9, 10, 11}}. As G ∈ ΩSBM1 ∩ ΩSBM2 , it could have been generated using SBM1 or SBM2 . Yet, the point of inferring a stochastic block model to understand the structure of a graph is that it is supposed to identify groups of nodes (blocks) such that the edge distribution between any two of them is homogeneous and characterized by a speciﬁc density. From this point of view C1 seems a better partition than C2 : – The density of edges inside and between ca2 and cb2 is the same (10), so there is no justiﬁcation for dividing ca1 in two. – On the other hand, cb1 and cc1 have an internal density of 1 and there is no edge between them, so it is logical to separate them rather than merge them into cc2 . Yet, if we compute the entropy of SBM1 and SBM2 : 395 17 + 2 × ln = 136 S1 = ln 360 9
S2 = ln
53 18
+ 4 × ln
98 90
= 135
The entropy of SBM2 is lower and thus partition C2 will be the one selected. Of course, as ΩSBM2  < ΩSBM1 , the probability to generate G with SBM2 is higher than the probability to generate it with SBM1 . But this increased probability is not due to a better identiﬁcation of the edge distribution heterogeneity, it is a mechanical eﬀect of imposing smaller communities in the groups of nodes which contain the more edges, even if their distribution is homogeneous. Doing so reduces the number of possible positions for each edge and thus the number of diﬀerent graphs the model can generate.
Minimum Entropy Stochastic Block Models
549
Fig. 1. Blocktoblock adjacency matrices of two overlapping stochastic block models. Even though the communities of SBM1 are better deﬁned, SBM2 can generate less diﬀerent graphs and thus generates them with higher probability.
Fig. 2. Blocktoblock adjacency matrices of two overlapping stochastic block models with lower densities. Once again, even though SBM3 has better deﬁned communities, SBM4 is more likely a model for graphs G ∈ ΩSBM3 ∩ ΩSBM4
This problem can also occur with smaller densities, as illustrated by the stochastic block models whose blocktoblock adjacency matrices are shown on Fig. 2. SBM3 , deﬁned as one community of 128 nodes and density 0.6 and 32
550
L. Duvivier et al.
communities of 4 nodes and density 0.4 has an entropy of 17851. SBM4 which merges all small communities into one big and splits the big one into 32 small ones has an entropy of 16403.
3
The Density Threshold
More generally, let’s consider a SBM (C1 , M1 ) with one big community of size s, containing c × m0 edges and q small communities of size qs containing (mi )i∈[1;q] edges each, as illustrated on Fig. 3. Its entropy is equal to:
2 2 q s s + c × m0 − 1 2 + mi − 1 q ln S1 (c) = ln + c × m0 mi i=1 On the other hand, the entropy of the SBM (C2 , M2 ) which splits the big community into q small ones of size qs and merges the q small communities into one big is:
2 2 q s +c×m0 −1 s + i=1 mi − 1 q2 2 q S2 (c) = ln + q ln c×m0 i=1 mi q2
Fig. 3. Theoretical pair of stochastic block models. The rightside partition splits the big community in q = 3 small ones and merges the small communities in one big.
Minimum Entropy Stochastic Block Models
q
So, with C1 =
i=1
ln
2 s +mi −1 q2
mi
and C2 = ln
s2 + c × m0 − 1 c × m0
S1 (c) − S2 (c) = ln
− q 2 ln
2 q s +
mi −1 i=1 mi
qi=1
are constants with respect to c: s2 +c×m0 q2 c×m0 q2
−1
551
, which
+ C1 − C2
⎡⎛ ⎞q 2 ⎤ c×m0
c×m 2 2 q 0 k + s2 − 1 k + qs2 − 1 ⎟ ⎥ ⎢⎜ = ln − ln ⎢ ⎠ ⎥ ⎝ ⎦ + C1 − C2 ⎣ k k k=1
k=1
⎤
⎡ c×m0 !q2 −1 q2 2 i=0 (k + s − 1 + i × ⎢ = ln ⎣ 2 (k + qs2 − 1)q2 k=1
c×m0 q2 ) ⎥
⎦ + C1 − C2
⎤ ⎡ c×m0 #q 2 " q2 2 k+s −1 ⎥ ⎢ > ln ⎣ ⎦ + C1 − C2 s2 k + q2 − 1 k=1 c×m0 q2
> q2
ln 1 +
k=1
Now, as
ln 1 +
and
(q 2 − 1)s2 q 2 k + s2 − q 2 c×m0 q2
k=1
we have that
c×m0 q2
q
2
(q 2 − 1)s2 + C1 − C2 q 2 k + s2 − q 2
k=1
∼
k→∞
(2)
(q 2 − 1)s2 q 2 k + s2 − q 2
(q 2 − 1)s2 → ∞ q 2 k + s2 − q 2 c→∞
ln 1 +
(q 2 − 1)s2 q 2 k + s2 − q 2
→ ∞
c→∞
(3)
and thus, by injecting Eq. 3 inside 2, ∃c, ∀c > c, S2 (c ) < S1 (c ). Which means that for any such pair of stochastic block models, there exists some density threshold for the big community in C1 above which (C2 , M2 ) will be identiﬁed as the most likely model for all graphs G ∈ Ω(C1 ,M1 ) ∩ Ω(C2 ,M2 ) .
4
Consequences on Model Selection
In practice, this phenomena implies that a model selection technique based on the minimization of entropy will not be able to identify correctly some SBM when they are used as generative models for synthetic graphs. To illustrate this,
552
L. Duvivier et al.
we generate graphs and try to recover the original partition. The experiment is conducted on two series of stochastic block models, one with relatively large communities and another one with smaller but more sharply deﬁned communities: – SBM7 (d) is made of 5 blocks (1 of 40 nodes, and 4 of 10 nodes). Its density matrix D is given on Fig. 4(left) (one can deduce the block adjacency matrix by M(ci ,cj ) = ci cj  × D(ci ,cj ) ). – SBM8 (d) is made of 11 blocks (1 of 100 nodes, and 10 of 10 nodes). The internal density of the big community is d, it is 0.15 for the small ones and 0.01 between communities. For each of those two models, and for various internal densities d of the largest community, we generate 1000 random graphs. For each of these graphs, we compute the entropy of the original partition (correct partition) and the entropy of the partition obtained by inverting the big community with the small ones (incorrect partition). Then, we compute the percentage of graphs for which the correct partition has a lower entropy than the incorrect one and plot it against the density d. Results are shown on Figs. 4 and 5.
Fig. 4. Blocktoblock adjacency matrix of SBM7 (d) (left) and percentage of graphs generated using SBM7 (d) for which the original partition has a lower entropy than the inverted one against the density d of the big community (right).
We observe that as soon as d reaches a given density threshold (about 0.08 for SBM7 (d) and 0.18 for SBM8 (d)), the percentage of correct match drops quickly to 0. As d rises over 0.25, the correct partition is never the one selected. It should be highlighted that in these experiments we only compared two partitions among the Bn possible, so the percentage of correct match is actually an upper bound on the percentage of graphs for which the correct partition is identiﬁed. This means that if SBM7 (d) or SBM8 (d) are used as generative models for random graphs, with d > 0.25, and one wants to use bayesian inference for determining the original partition, it will almost never return the correct one. What is more,
Minimum Entropy Stochastic Block Models
553
Fig. 5. Percentage of graphs generated using SBM8 (d) for which the original partition has a lower entropy than the inverted one against the density d of the big community.
the results of Sect. 3 show that this will occur for any SBM of the form described in Fig. 3, as soon as the big community contains enough edges.
5
Discussion
We have seen in Sect. 1 that model selection techniques that rely on the maximization of the likelihood function to ﬁnd the best node partition given an observed graph boils down to the minimization of the entropy of the corresponding ensemble of generable graphs in the microcanonical framework. Even in the case of bayesian inference, when a nonuniform prior distribution is deﬁned on the set of possible partitions, entropy remains the criterion of choice between equiprobable partitions. Yet, as shown in Sects. 2 and 3, entropy behaves counter intuitively when a large part of the edges are concentrated inside one big community. In this situation, a partition that splits this community in small ones will have a lower entropy, even though the edge density is homogeneous. Furthermore, this happens even when the number and sizes of communities are known. Practically, as explained in Sect. 4, this phenomena implies that stochastic block models of this form cannot be recovered using model selection techniques based on the mere minimization of the cardinal of the associated microcanonical ensemble. Let’s stress that contrary to the resolution limit described in [10] or [11], the problem is not about being able or not to detect small communities with
554
L. Duvivier et al.
no prior knowledge about the graph, it occurs even though the number and sizes of communities are known. It is also diﬀerent from the phase transition issue that has been investigated in [12–15] for communities detection or recovery because it happens even when communities are dense and perfectly separated. Entropy minimization fails at classifying correctly the nodes between communities because it only aims at identifying the SBM that can generate the lowest number of diﬀerent graphs. A model which enforces more constraints on edge positions will necessarily perform better from this point of view, but this is a form of overﬁtting, in the sense that the additional constraints on edge placement are not justiﬁed by an heterogeneity in the observed edge distribution. The results presented in this paper were obtained for a particular class of stochastic block models. First of all, they were obtained for the multigraph ﬂavour of stochastic block models. As the node classiﬁcation issue occurs also for densities below 1, they can probably be extended to simple graphs, but this would need to be checked, as well as the case of degreecorrected stochastic block models. Furthermore, the reason why the loglikelihood of a stochastic block model C, M for a graph G is equal to the entropy of ΩC,M is that we consider the microcanonical ensemble, in which all graphs have an equal probability to be generated. It would be interesting to check if similar results can be obtained when computing P[GC, M ] in the canonical ensemble [8]. Finally, we assumed that for a graph G and two partitions C1 and C2 with the same number and sizes of blocks, the associated blocktoblock adjacency matrices M1 and M2 have the same probability to be generated, and this assumption too could be questioned. Yet, within this speciﬁc class of SBM, our results illustrate a fundamental issue with the stochastic block model statistical inference process. Since the random variable whose distribution we are trying to infer is the whole graph itself, we are performing statistical inference on a single observation. This is why frequentist inference is impossible, but bayesian inference also has strong limitations in this context. In particular, the only tool to counterbalance the observation and avoid overﬁtting is to specify the kind of communities we are looking for through the prior distribution. If it is agnostic about the distribution of edge densities among these communities, the mere minimization of the entropy of the posterior distribution fails to identify the heterogeneity in the edge distribution. Beside reﬁning even more the prior distribution, another approach could be to consider a graph as the aggregated result of a series of edge positioning. If the considered random variable is the position of an edge, a single graph observation contains information about many of its realizations, which reduces the risk of overﬁtting. Acknowledgments. This work was supported by the ACADEMICS grant of the IDEXLYON, project of the Universit´e de Lyon, PIA operated by ANR16IDEX0005, and of the project ANR18CE230004 (BITUNAM) of the French National Research Agency (ANR).
Minimum Entropy Stochastic Block Models
555
References 1. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Nat. Acad. Sci. 99(12), 7821–7826 (2002) 2. Fortunato, S., Hric, D.: Community detection in networks: a user guide. Phys. Rep. 659, 1–44 (2016) 3. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(2), 026113 (2004) 4. Hastings, M.B.: Community detection as an inference problem. Phys. Rev. E 74(3), 035102 (2006) 5. Guimera, R., SalesPardo, M., Amaral, L.A.N.: Modularity from ﬂuctuations in random graphs and complex networks. Phys. Rev. E 70(2), 025101 (2004) 6. Peixoto, T.P.: Nonparametric bayesian inference of the microcanonical stochastic block model. Phys. Rev. E 95(1), 012317 (2017) 7. Cimini, G., Squartini, T., Saracco, F., Garlaschelli, D., Gabrielli, A., Caldarelli, G.: The statistical physics of realworld networks. Nat. Rev. Phys. 1(1), 58 (2019) 8. Peixoto, T.P.: Entropy of stochastic blockmodel ensembles. Phys. Rev. E 85(5), 056122 (2012) 9. Peixoto, T.P.: Bayesian stochastic blockmodeling. arXiv preprint. http://arxiv. org/abs/1705.10225 (2017) 10. Fortunato, S., Barthelemy, M.: Resolution limit in community detection. Proc. Nat. Acad. Sci. 104(1), 36–41 (2007) 11. Peixoto, T.P.: Parsimonious module inference in large networks. Phys. Rev. Lett. 110(14), 148701 (2013) 12. Decelle, A., Krzakala, F., Moore, C., Zdeborov´ a, L.: Inference and phase transitions in the detection of modules in sparse networks. Phys. Rev. Lett. 107(6), 065701 (2011) 13. Decelle, A., Krzakala, F., Moore, C., Zdeborov´ a, L.: Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys. Rev. E 84(6), 066106 (2011) 14. Dandan, H., Ronhovde, P., Nussinov, Z.: Phase transitions in random Potts systems and the community detection problem: spinglass type and dynamic perspectives. Philos. Mag. 92(4), 406–445 (2012) 15. Abbe, E., Sandon, C.: Community detection in general stochastic block models: fundamental limits and eﬃcient algorithms for recovery. In: 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp. 670–688. IEEE (2015)
ThreeParameter Kinetics of Selforganized Criticality on Twitter Victor Dmitriev1
, Andrey Dmitriev1(&) , Svetlana Maltseva1 and Stepan Balybin2
,
1
2
National Research University Higher School of Economics, 101000 Moscow, Russia [email protected] Department of Physics, M.V. Lomonosov Moscow State University, 119991 Moscow, Russia
Abstract. A kinetic model is proposed to describe the selforganized criticality on Twitter. The model is based on a fractional threeparameter selforganization scheme with stochastic sources. It is shown that the adiabatic regime of selforganization to the critical state is determined by the coordinated action of a relatively small number of network users. The model is described the subcritical, selforganized critical and supercritical state of Twitter. Keywords: Selforganized criticality
Social networks Langevin equation
1 Introduction Critical phenomena in complex networks have been considered in many papers (e.g., see the review [1] and references therein). In the network science, under the critical phenomena commonly understand the signiﬁcant changes in the integral parameters of the network structure under the influence of external factors [1]. In the thermodynamics theory of irreversible processes, it is stated that signiﬁcant structure reconstructions occur when the external parameter reaches a certain critical value and has the character of a kinetic phase transition [2]. The critical point is reached as a result of ﬁne tuning of the system external parameters. In a certain sense, such critical phenomena are not robust. At the end of the 1980s, Bak, Tang and Wiesenfeld [3, 4] found that there are complex systems with a large number of degrees of freedom that go into a critical mode as a result of the internal evolutionary trends of these systems. A critical state of such systems does not require ﬁne tuning of external control parameters and may occur spontaneously. Thus, the theory of selforganized criticality (SOC) was proposed. From the moment of the SOC model emergence, this model started to be applied to describe critical phenomena in systems regardless of their nature (e.g., see the review [5] with references). Not an exception is the application of the theory to the description of critical phenomena in social networks (e.g., see the works [6–9]). The motivation of our investigation is the following. There is a number of studies (e.g., see the works [7, 9–17]), in which it is established that the observed flows of microposts generated by microblogging social networks (e.g., Twitter), are characterized © Springer Nature Switzerland AG 2020 H. Cheriﬁ et al. (Eds.): COMPLEX NETWORKS 2019, SCI 881, pp. 556–565, 2020. https://doi.org/10.1007/9783030366872_46
ThreeParameter Kinetics of Selforganized Criticality on Twitter
557
by avalanchelike behavior. Time series of microposts (gt ) depicting such streams are the time series with a power law distribution of probabilities: pðgÞ / ga
ð1Þ
where a 2 ð2; 3Þ. Despite this, there are no studies on the construction and analysis of macroscopic kinetic models that explain the phenomenon of the emergence and spread of avalanche of microposts on Twitter.
2 One of the Possible Mechanisms of Twitter Selforganizing Transition in a Critical State Let N be the total number of Twitter users, and let S N be the number of users who follow a certain strategy. Let’s call them strategically oriented users (SOUs). The remaining N S users do not follow a single coherent strategy and, in this sense, are randomly oriented users (ROUs). Suppose that at each moment in time, one SOU goes on Twitter, i.e. social network is the open system. These users act in concert, trying to form some microposts in the network relevant to a certain topic. Gradually, a subnetwork of SOUs is formed in the social network. ROUs that are SOUs subscribers in this case are also embedded in the emerging hierarchical network structure. As a result, local and predictable micropost flows are formed on Twitter, corresponding to the topic deﬁned by SOUs. Such a behavior of the social network is simple, since the individual local flows of microposts are not interconnected. The formed hierarchical system of the social network by SOUs and ROUs are still not able to form an avalanche of microposts. Over time, the number of SOUs reaches a critical value Sc . In this state, the network can no longer be pumped by these users. In order to maintain a steady state, all network users, including ROUs, must follow a certain coordinated strategy in the distribution of microposts. Therefore, in a stationary system of users, global avalanches of microposts arise and distributed in the network. This is the SOC state of the social network, formed by the action of a small, compared with the total number of all users, the number of strategically oriented users. Instead of local flows of microposts, a global avalanche of microposts occurs, which is characteristic of the critical state of the network. The behavior of global avalanches spreading in the selforganized critical network is unpredictable based on the behavior of individual users. In this case, the social network has the property of emergence. Let Sc be the number of SOUs in the stationary (critical) state of the social network. In relation to the critical state, three qualitatively different states of Twitter can be distinguished: • S\Sc is the subcritical (SubC) network state; • S ¼ Sc is the SOC network state; • S [ Sc is the supercritical (SupC) network state. The SubC state is characterized by the small number of avalanches of microposts, which can be almost neglected. In the SOC state, microposts avalanche size is growing.
558
V. Dmitriev et al.
The appearance of such avalanches of microposts satisﬁes the power law distribution of probabilities (see Eq. (1)). In the SupC state, the number of SOUs, and, accordingly, the avalanche sizes of microposts continue to grow. This growth is unstable. In a response to a further increase in the number of microposts generated by SOUs entering the network, the number of “extra” microposts in the social network increases, reducing S to a critical level. The value Sc separates the chaotic and the ordered states of the network. Indeed, the almost zero flow of microposts, which occurs when S\Sc , can be considered as the result of lots of randomly directed flows of microposts, which are mutually balanced. When S [ Sc , disorder gives way to order, which is expressed in the appearance of a dedicated flow direction (avalanche) of microposts. And, as a result, it becomes signiﬁcant at the macro level. Both of these states correspond to the noncatastrophic behavior of the social network, since in these states the network is resistant to small impact. In a chaotic state, small perturbations still fade out quickly in time and space, and in an ordered state, perturbations can no longer have a noticeable effect on the avalanche size of microposts. In a critical state, in which only one added SOU can cause an avalanche of microposts of any size, catastrophes are possible. As a result of selforganization in a critical state, a social network acquires properties that its elements did not have, demonstrating complex emergent behavior. At the same time, it is important that the selforganizing nature of emergent properties ensures their robustness. The SOC state is robust in relation to possible changes in the social network. For example, if the nature of interactions between users’ changes, the social network temporarily deviates from the existing critical state, but after a while it is restored in a slightly different form. The hierarchical network structure will change, but its dynamics will remain critical. Every time, when trying to divert Twitter from the SOC state, the social network invariably returns to this state.
3 The Formalism It is known (e.g. see works [18–20]) that the concept of selforganization is a generalization of the physical concept of critical phenomena, such as phase transitions. Therefore, the phenomenological theory that we propose is a generalization of the theory of thermodynamic transformations for open systems. Twitter selforganization is possible due to its openness, since there are incoming and outgoing network flows of its users constantly; its macroscopic, because includes a large number of users; its dissipation, because there are losses in the flows of microposts and associated information. Based on the synergetic principle of subordination, it can be argued that Twitter’s selforganization in a critical state is completely determined by the suppression of the behavior of an inﬁnite number of microscopic degrees of freedom by a small number of macroscopic degrees of freedom. As a result, the collective behavior of users of the social network is deﬁned by several parameters or degrees of freedom: an order parameter gt , its role is the number of microposts relevant to a certain topic that are sent by SOUs and, unwittingly following their strategies, by ROUs; a conjugate ﬁeld ht is information associated with microposts distributed in the network; a control parameter
ThreeParameter Kinetics of Selforganized Criticality on Twitter
559
St which is the number of SOUs of the networks. On the other hand, in Twitter’s selforganization as the nonequilibrium system, the dissipation of flows of microposts in the network should play a crucial role, which ensures the transition of the network to the stationary state. In the process of selforganization in a critical state of the network, all three degrees of freedom have an equal character, and the description of the process requires a selfconsistent view of their evolution. The restriction to three degrees of freedom is also determined by the Ruelle–Takens theorem, according to which a nontrivial picture of selforganization is observed if the number of selected degrees of freedom is, at least, three. Kinetic equations and a detailed physical substantiation of the relations between its parameters are given in our paper [16]. The construction of the threeparameter selforganization scheme was based on the analogy between the mechanisms of functioning of a singlemode laser and the microblogging social network. The study of possible modiﬁcations of equations leading to models that are capable to describe critical phenomena on Twitter, in particular the SOC or the SupC states, is outside of the scope of this paper. These equations in dimensionless quantities have the following form: pﬃﬃﬃﬃ g_ t ¼ get þ ht þ Ig nt pﬃﬃﬃﬃ sh _ e sg ht ¼ ht þ gt St þ Ih nt > : sS S_ ¼ ðS S Þ ge h þ pﬃﬃﬃﬃ I S nt 0 t t t sg t 8 >