1,538 153 90MB
English Pages 479 Year 2016
X ALBERTLÁSZLÓ BARABÁSI
NETWORK SCIENCE PREFACE
ACKNOWLEDGEMENTS
NICOLE SAMAY
SARAH MORRISON
INDEX
Introduction How to Use This Book
1
Acknowledgements
2
Homework Bibliography
Figure x.0
Networks of Dispossession In 2014 a popular protest movement has shaken Turkey, prompting thousands of activists and protesters to decamp at Gezi Park. The protests were accompanied by online campaigns, using Twitter or the WWW to mobilize supporters. A central component of this campaign was Networks of Dispossessions, generated by a coalition of artists, lawyers, activists and journalists that mapped the complex financial relationships behind Istanbul's political and business elite. First exhibited at the Istanbul Biennial in 2013, the map reproduced here shows “dispossession” projects as black circles. The size of each circle represents the monetary value of the project. Corporations and media outlets, shown in blue, are directly linked to their projects. Work related crimes are noted in red and supporters of Turkey's Olympic bid are shown in purple, while the sponsor of the Istanbul Biennial are in turquoise. The map was developed by Yaşar Adanalı, Burak Arıkan, Özgül Şen, Zeyno Üstün, Özlem Zıngıland and anonymous participants using the Graph Commons (http://graphcommons.com/).
This book is licensed under a Creative Commons: CC BYNCSA 2.0. PDF V26, 05.09.2014
SECTION x.1
HOW TO USE THIS BOOK
The perspective offered by networks is indispensable for those who wish to understand today’s interlinked world. This textbook is the best avenue I found to share this perspective, offering anyone the opportunity to become a bit of a network scientist. Many of the choices I made in selecting the topics and in presenting the material were guided by the desire to offer a quantitative yet easy to follow introduction to the field. At the same time I tried to pass on the many insights networks offer about the many complex systems that surround us. To resolve these often conflicting desires, I paired the technical advances with historical notes and boxes that illustrate the roots and the applications of the key discoveries. This preface has two puposes. On one end, by describing the class that motivated this text, it offers some practical tips on how to best use the textbook. Equally important, it acknowledges the long list of individuals who helped move forward this textbook.
ONLINE COMPENDIUM Network science is rich in content and knowledge that is best apprecitated online. Therefore throughout the chapters we encounter numerous ONLINE RESOURCES that point to pertinent online material  videos, software, interactive tools, datasets and data sources. These resources are available on the http://barabasi.com/NetworkScienceBook website. The website also contains the PowerPoint slides that I used to teach network science, mirroring the content of this textbook. Anyone teaching networks should feel free to use these slides and modify them as they see it fit to offer the best classroom experience. There is no need to ask the author for permission to use these slides in educational settings. Online Resource x.1
Given the empirical roots of network science, the book has a strong
barabasi.com/NetworkScienceBook
emphasis on the analysis of real networks. We have therefore assembled
The website offers online access to the textbook, the videos, software and interactive tools mentioned in the Online Resources in the chapters, the slides I use to teach network science and the datasets analyzed in the book.
ten network maps that are frequently used in the literature to test various network characteristics. They were chosen to represent the diversity of the networks explored in network science, describing social, biological,
PREFACE
3
How to use this book
technological and informational systems. The Online Compendium offers WIKI ASSIGNMENT
access to these datasets, which are used throughout the book to illustrate the tools of network science.
1. Select a keyword related to network science and check that it is not already covered in Wikipedia. “Related” is defined widely you can select a technical concept (degree distribution), a networkrelated concept (terrorist networks), an application of networks (networks in finance), you can write about a network scientist, or anything else that you can convincingly relate to networks.
Finally for those teaching the book in different languages, the website also mirrors the ongoing translation projects.
TEACHING NETWORK SCIENCE I have taught network science in two different settings. The first is a full semester class that attracts graduate and advanced undergraduate students with physics, computer science and engineering background. The second is a threeweek twocredit class for students with economics and social science background. The textbook builds on both teaching expe
2. You are not expected to generate original material. Instead you need to identify 25 sources on the subject (research papers, books, etc.) and write a succinct, selfcontained encyclopedic style summary with references, graphs, tables, images, photos, as required to best cover the material. Observe Wikipedia’s copyright and notability guidelines.
riences: In the full semester class I cover the full text, integrating into the lectures the proofs and derivations contained in the Advanced Topics. In the shorter class I only cover the content of the main sections, omitting the Advanced Topics and the chapter on degree correlations. In both settings a key component of the class are assignments and the research project described next. Homework Problems
3. Upload your page on Wikipedia and send us the link. You will need to sign up for an account in Wikipedia, as anonymous editors cannot add new pages. Make sure that the page is not deleted by the Wikipedia administrators, which happens when the concept is not well documented or referenced, or is not written in an encyclopedic
For the longer class we assign as homework a subset of the problems listed at the end of each chapter, testing the technical proficiency of the students with the material and their problem solving ability. Two rounds of homework cover the material as we progress with the class. Wiki Assignment We ask each student to select a concept or a term related to network science and write a Wikipedia page on it (Figure x.1). What makes this
style.
assignment somewhat challenging is that the topic must not be already 4. The grade will reflect how understandable, pertinent, selfcontained and accurate is the content of your page.
covered by Wikipedia, yet must be sufficiently notable to be covered. The Wiki assignment tests the students' ability to synthetize and distill material in an easytounderstand encyclopedic style, potentially turning them into regular Wikipedia contributors. At the same time the assignment enriches Wikipedia with network science content, offering a service to the whole community. Those teaching network science in
Figure x.1
Wikipedia Assignment Guidelines
other languages should consider contributing to Wikipedia in their native language. Social Network Analysis As a warmup to network analysis, students are asked to analyze the social network of the class. This requires a bit of preparation and the help of a teaching assistant. In the very first class the instructor hands out the class list and asks everyone to check that they are on the list or add their name if they are missing. The teaching assistant takes the final list, and during the class prints an accurate class list for each student. At the end of the class each student is asked to mark everyone they knew before coming to the class. To help students match the faces with the names, each student is asked to briefly introduce themselves  also offering a chance for the instructor to learn more about the students in PREFACE
4
How to use this book
the class. These lists are then compiled to generate a social network of the class, enriching the nodes with gender and the name of the program
PRELIMINARY PROJECT
the students are engaged in. The anonymized version of this network is
Present 5 slides in no more than five minutes:
returned to the class halfway through the course, the assignment being
• Introduce your network, discussing its nodes and links. • Tell us how you will collect the data and estimate size of the network (N, L). Make sure that N > 100. • Tell us what questions you are planning to ask. We understand that they may change as you advance with your project and the class. • Tell us why you care about your network.
to analyze its properties using the network science tools the students acquired up to that point. This allows them to explore a relatively small network that they are invested in and understand. The assignment offers a preparation for the more extensive network analysis they will perform for their final research project. This homework is assigned after the handson class on software, so that the students are already familiar with the online tools available for network analysis. Final Research Project The final project is the most rewarding part of the class, offering the students the opportunity to combine and utilize all the knowledge they acquired. Students are asked to select a network of interest to them,
Figure x.2
Preliminary Project Guidelines
map it out and analyze it. Some procedural details enrich this assignment: (a) The project is carried out in pairs. If the class composition allows, the students are asked to form professionally heterogenous pairs: undergraduate students are asked to pair up with graduate students, or students from different programs are asked to work together, like a physics student with a biology student. This forces the students to collaborate outside their expertise level and comfort zone, a common ingredient of interdisciplinary research. The instructor does not do the pairing, but students are encouraged to find their partners. (b) A few weeks into the course one class is devoted to preliminary project presentations. Each group is asked to offer a five minute presentation with no more than five slides, offering a preview of the dataset they selected (Figure x.2). Students are advised to collect their own data  simply downloading a dataset already prepared for network analysis is not acceptable. Indeed, one of the goals of the project is to experience the choices and compromises one must make in network mapping. Manual mapping is allowed, like looking up the ingredients of recipes in a cookbook or the interaction of characters in a novel or a historical text. Digital mapping is encouraged, like scrapping data from a website or a database that is not explicitly organized as a network map, but the students must reinterpret and clean the data to make it amenable for network analysis. For example one can systematically scrap data from Wikipedia to identify relationships between writers, scientists or concepts. (c) It is important to always emphasize that the purpose of the final project is to test a student's ability to analyze a network. Consequently students must stay focused on exploring the network aspect of the data, and avoid being carried away by other tempting questions their dataset poses that would take them away from this goal. PREFACE
5
How to use this book
FINAL PROJECT
(d) The course ends with the final project presentations. Depending on the size of the class, we devote one or two classes to this (Figure x.3).
Each group has 10 minutes to present their final project. Time limit is strictly enforced. On the first slide, give your title, name and program.
The choice of the Wikipedia keywords, the partner selection for the research project, and the choice of the topic for the final project requires repeated feedback from the instructor, making sure that all students
Tell us about your data and the data collection method. Show an entry of the data source to offer a sense of where you started from.
are on track. To achieve this the last ten minutes of each class is devoted to asking everyone: Have you chosen a network that you wish to analyze? What are your nodes and your links? Do you know how to get
Measure: N, L, and their time dependence if you have a time dependent network; degree distribution, average path length, clustering coefficient, C(k), the weight distribution P(w) if you have a weighted network. Visualize communities; discuss network robustness and spreading, degree correlations, whichever is appropriate for your project.
the data? Do you have a partner for your final project? What is your Wiki word? Did you check if it is already covered by Wikipedia? Did you collect literature pertaining to it? The answers range from "Not yet", to firm or vague ideas the students are entertaining. By providing public feedback about the appropriateness and the feasibility of their plans helps those who are behind to crystallize their ideas, and to identify potential partners with common interests. In a few classes typically ev
It is not sufficient to simply measure things you need to discuss the insights you gained, always asking:
eryone finds a partner, identifies a research project and a Wikipedia keyword, at which point this endofclass ritual ends.
• What was your expectation? • What is the proper random reference? • How do the results compare to your expectation? • What did you learn from each quantity?
Software We devote one class to various network analysis and visualization software, like Gephi, Cytoscape, or NetworkX. In the longer class we devote another one to other numerical procedures, like fitting, logbinning or network visualization. We ask students to bring their laptops to these
Grading criteria:
classes, so that they can try out these tools immediately.
• Use of network tools (completeness/ correctness); • Ability to extract information/insights from your data using the network tools; • Overall quality of the project/presentation.
Movie Night We devote one night, typically outside the class time, to a movie night, where we screen the documentary Connected by Annamaria Talas. The onehour documentary features many contributors to network sci
No need to write a report  email us the presentation as a pdf file.
ence, and offers a compelling narrative of the field's importance. Movie Night is advertised university wide, offering a chance to reach out to a wider community. Guest Speakers
Figure x.3
Final Project Guidelines
In the full semester class we invite researchers from the area to give research seminars about their work pertaining to networks. This offers the students a sense of what cutting edge research looks like in this area. This is typically (but not always) done towards the end of the class, by which point most theoretical tools are covered and the students are focusing on their final project. Such talks, advertised and open to the local research community, often inspire additional perspectives and ideas for the final project. To aid the planning of the class, Figure x.4 offers the schedule of the full semester class I cotaught before this book went to print.
PREFACE
6
How to use this book
GRADE DISTRIBUTION
COMPLEX NETWORKS: SYLLABUS
(1) Assignment 1 (Homework 1): 15%
Week 1 • Class 1 Ch. 1: Introduction • Class 2 Ch. 2: Graph Theory
(2) Assignment 2 (Homework 2): 15% (3) Assignment 3 (Class Network): 15%
Week 2 • Class 1 Ch. 3: Random Networks • Class 2 Ch. 3: Random Networks
(4) Assignment 4 (Wikipedia): 15% (5) Preliminary Project Presentation: No grading, only feedback.
Week 3 • Class 1 Ch. 4: The ScaleFree Property • Class 2 Ch. 4: The ScaleFree Property Handout Assignment 1 (Problems for Chapters 15) Week 4 • Class 1 Ch. 5: The BarabásiAlbert model • Class 2 Ch. 5: The BarabásiAlbert model
(6) Final Project: 40%
Figure x.4
Grading The grading system used in the one semester class.
Week 5 • Class 1 Preliminary Project Presentations • Class 2 Handson Class: Graph representation, binning, fitting Week 6 • Class 1 Handson Class: Gephi and Python Collect Assignment 1; Handout Assignment 2: Class Network Analysis • Class 2 Guest Speaker Week 7 • Class 1 Ch. 6: Evolving Networks • Class 2 Ch. 6: Evolving Networks Week 8 • Class 1 Guest Speaker Collect Assignment 2 • Class 2 Ch. 7: Degree Correlations Hand out Assignment 3 (Problems for Chapters 610) Week 9 • Class 1 Ch. 8: Network Robustness Hand out Assignment 4: Wikipedia Page • Class 2 Ch. 8: Network Robustness
Figure x.5
The Syllabus The weekbyweek schedule of the four credit network science class, that meets twice a week.
Week 10 • Class 1 Ch. 9: Communities • Class 2 Ch. 9: Communities Movie Night: Connected, by Annamaria Talas Week 11 • Class 1 Ch. 10: Spreading Phenomena • Class 2 Ch. 10: Spreading Phenomena Week 12 • Class 1 Guest Speaker • Class 2 Ch. 10: Spreading Phenomena Collect Assignment 4 Week 13 • Class 1 Guest Speaker • Class 2 OpenDoor class (Research Project Discussions) Collect Assignment 3 Week 14 • Exam Week Final Project Presentations (10 min per group)
PREFACE
7
How to use this book
SECTION x.2
ACKNOWLEDGEMENTS
Writing a book, any book, is an exercise in lonely endurance. This project was no different, dominating all my free time between 2011 and 2015. It was mostly time spent alone, working in one of the many coffeehouses I frequent in Boston and Budapest, or wherever in the world the morning found me. Despite this the book is far from being a lonely achievement: During these four years a number of individuals have donated their time and expertise to help move forward the project, offering me the opportunity to discuss the subject with colleagues, friends and lab members. I also shared the chapters on the internet for everyone to use, receiving valuable feedback from many individuals. In this section I wish to acknowledge the professional network that stepped in to help at various stages of this long journey.
Figure x.6
The Math Team Márton Pósfai was responsible for the calculations, simulations and measurements in the textbook.
FORMULAS, GRAPHS, SIMULATIONS A textbook must ensure that everything works as promised. That one can derive the key formulas, and that the measures described in the text, when applied to real data, work as the theory predicts. There is only one way to achieve this: One must check and repeat each calculation, measurement and simulation. This was a heroic job, most of it done by Márton Pósfai, who joined the project when he was a visiting student in my lab in Boston and stayed with it throughout his PhD work in Budapest, Hungary. He checked all derivations, if needed helped rederive key formulas, performed all the simulations and measurements and prepared the book’s figures and tables. Many figures and tables amounted to small research projects, their outcome forcing us to deemphasize some quantities because they did not work as promised, or helped us appreciate and understand the importance of others. His deep understanding of the network science literature and his careful work offered many subtle insights that enriched the book. There is no way I could have achieved this depth and reliability if it wasn’t for Márton’s tireless dedication to the project.
THE DESIGN The ambition to create a book that had a clear aesthetic and visual appeal was planted by Mauro Martino, a data visualization expert in my lab.
PREFACE
8
Acknowledgements
He created the first face of the chapters and many visual elements de
Figure x.5
signed by him stayed with us until the end. After Mauro moved on to lead
The Design Team Mauro Martino, Gabriele Musella and Nicole Samay have developed the look and feel of the chapters and the figures, offering the book an elegant and consistent style.
a team of designers at IBM Research, Gabriele Musella took over the design. He standardized the color palette and designed the basic elements of the infographics appearing throughout the book, also redrawing most images. He worked with us until the fall of 2014, when he too had to return to London to take up his dream job. At that time the design was taken over by Nicole Samay, who tirelessly and gently retouched the whole book as we neared the finish line. The website for the book was designed by Kim Albrecht, who currently collaborates with Mauro to design the online experience that trails the book. An important component of the visual design are the images included at the beginning of each chapter, illustrating the interplay between networks and art. In selecting these images I have benefited from advice and discussions with several artists and designers, academics and practicing artists alike. Many thanks to Isabel Meirelles and Dietmar Offenhuber from the Art and Design Department at Northeastern, Mathew Ritchie from Columbia University, and Meredith Tromble from the San Francisco Art Institute, for helping me navigate the boundaries of art, data and network science.
THE DAILY DRILL: TYPING, EDITING
Figure x.6
The Editorial Team Payam Parsinejad, Amal AlHussieni and Sarah Morrison have worked daily on the book, editing and correcting it.
I remain an oldfashioned writer, who writes with a pencil rather than a computer. I am lost, therefore, without editors and typers, who integrate my handwritten notes, corrections and recommendations into each PREFACE
9
Acknowledgements
chapter. Sabrina Rabello and Galen Wilkerson have helped get this project started. Yet, the bulk of editing fell on the shoulders of three individuals. Payam Parsinejad worked with me during first year of the project. After he had to refocus on his research, Amal AlHusseini, a former student from my network science class, joined us, and stayed until the very end. Equally defining was the help of Sarah Morrison, my former assistant, who joined the project after she moved to Lucca, Italy. Her timely and accurate editing were essential to finish the book. Each chapter, before it was released on our webpage, has undergone a final check by Phillipp Hoevel, who joined the project while visiting my lab, and continued to work with us even after he returned to Berlin to run his own lab. Philipp methodically reviewed everything, from the science to notations, becoming our first reader and final filter. Brett Common has worked tirelessly to secure all the permissions for the visual materials used throughout the textbook. This was a major project on its own, whose magnitude and difficulty was hard to anticipate.
HOMEWORK The homework at the end of each chapter were conceived and curated by Roberta Sinatra. As a research faculty affiliated with my lab, Roberta has cotaught the network science class with me in the fall of 2014, helping also catch and correct many typos and misunderstanding that surfaced while teaching the material.
Figure x.7
Accuracy and Rights Philipp Hoevel acted as our first reader and last editor. The rights were obtained and managed by Brett Common.
SCIENCE INPUT Throughout the project I have received comments, recommendations, advice, clarifications, and key materials from numerous scientists and students. It is impossible to recall them all, but I will try. Chaoming Song helped estimate the degree exponent of scalefree networks and helped me uncover the literature pertaining to cascading failures. The mathematician Endre Csóka helped clarify the subtle details of the Bollobás model. I have benefited from a great discussion with Raissa D’Souza on optimization models, with Ginestra Bianconi on the fitness model, and with Erzsébet Ravasz Reagan on the Ravasz algorithm. Alex Vespignani was a great resource on spreading processes and degree correlations. Marian Boguña has snapped the picture for the Karate Club Trophy. Huawei Shen calculated the future citations of research papers. Gergely Palla and Tamás Vicsek helped me understand the CFinder algorithm and Martin Rosvall pointed us to some key material on the InfoMap algorithm.
Figure x.8
Homework Roberta Sinatra has conceived and compiled the homework after each chapter in the textbook.
Gergely Palla, Sune Lehmann and Santo Fortunato offered critical comments on the community detection chapter. YongYeol Ahn helped me develop the early version of the material on spreading phenomena. Ramis Movassagh, Hiroki Sayama and Sid Redner have provided careful feedback on several chapters, and Kate Coronges has helped improve the clarity of the first four chapters.
PREFACE
10
Acknowledgements
PUBLISHING Simon Capelin, my longtime editor at Cambridge University Press, has been encouraging this project even before I was ready to write it. He also had the patience to see the book to its completion, through many missed deadlines. Róisín Munnelly has helped move the book through production within Cambridge.
INSTITUTIONS This book would not have been possible if several institutions did not offer inspiring environments and a supporting infrastructure. First and foremost I need to thank the leadership of Northeastern University, from its President, Joseph Aoun, its Provost, Steve Director, my deans, Murray Gibson and Larry Finkelstein, and my department chair, Paul Champion, who were true champions of network science, turning it into a major crossdisciplinary topic within Northeastern. Their relentless support has lead to the hiring of several superb faculty focusing on networks, spanning all domains of inquiry, from physics and mathematics to social, political, computer and health sciences, turning Northeastern into the leading institution in this area. They have also urged and supported the creation of a network science PhD program and helped found the Network Science Institute lead by Alessandro Vespignani. My appointment at Harvard Medical School, through the Network Medicine Division at Brigham and Women's Hospital and Center for Cancer Systems Biology at Dana Farber Cancer Institute, offered a window on the applications of network science in cell biology and medicine. Many thanks to Marc Vidal from DFCI and Joe Loscalzo from Brigham, who, as colleagues and mentors have defined my work in this area, an experience that found its way into this book as well. My visiting appointment at Central European University, and the network science class I teach there in the summer, have exposed me to a student body with economics and social science background, an experience that has shaped this textbook. Balázs Vedres had the vision to bring network science to CEU, George Soros convinced me to get involved with the university and President John Shattuck and Provosts Farkas Katalin and Liviu Matei, with their relentless support, have smoothed the path toward CEU's superb program in this area, giving birth to CEU's PhD program in network science. Finally, thanks to the place where it all began: As a young assistant Professor, University of Notre Dame offered me the support and the serene environment to think about something different. And big thanks to Suzanne Aleva, who followed my lab from Notre Dame to Northeastern, and worked tirelessly for over a decade to foster an environment where I can focus, uninterrupted, on science.
PREFACE
11
Acknowledgements
1 ALBERTLÁSZLÓ BARABÁSI
NETWORK SCIENCE INTRODUCTION
ACKNOWLEDGEMENTS
MÁRTON PÓSFAI GABRIELE MUSELLA MAURO MARTINO ROBERTA SINATRA
PHILIPP HOEVEL SARAH MORRISON AMAL HUSSEINI
INDEX
Vulnerability Due to Interconnectivity
1
Networks at the Heart of Complex Systems
2
Two Forces Helped the Emergence of Network Science
3
The Characteristics of Network Science
4
Societal Impact
5
Scientific Impact
6
Summary
7
Homework
8
Bibliography
9 Figure 1.0 (front cover)
Mark Lombardi: Global International Airway and Indian Spring State Bank Mark Lombardi (1951 – 2000) was an American artist who documented “the uses and abuses of power.” His work was preceded by careful research, resulting in thousands of index cards, whose number began to overwhelm his ability to deal with them. Hence Lombardi began assembling them into handdrawn diagrams, intended to focus his work. Eventually these diagrams became a form of art on their own [1]. The image shows one such drawing, created between 1977 and 1983 in colored pencil and graphite on paper.
This work is licensed under a Creative Commons: CC BYNCSA 2.0. PDF V26, 03.09.2014
SECTION 1.1
VULNERABILITY DUE TO INTERCONNECTIVITY
At a first glance the two satellite images of Figure 1.1 are indistinguish
(a)
able, showing lights shining brightly in highly populated areas and dark spaces that mark vast uninhabited forests and oceans. Yet, upon closer inspection we notice differences: Toronto, Detroit, Cleveland, Columbus and Long Island, bright and shining in (a), have have gone dark in (b). This is not a doctored shot from the next Armageddon movie but represents a real image of the US Northeast on August 14, 2003, before and after the blackout that left without power an estimated 45 million people in eight US states and another 10 million in Ontario. (b)
The 2003 blackout is a typical example of a cascading failure. When a network acts as a transportation system, a local failure shifts loads to other nodes. If the extra load is negligible, the system can seamlessly absorb it, and the failure goes unnoticed. If, however, the extra load is too much for the neighboring nodes, they will too tip and redistribute the load to their neighbors. In no time, we are faced with a cascading event, whose magnitude depends on the position and the capacity of the nodes that failed initially. Cascading failures have been observed in many complex systems. They take place on the Internet, when traffic is rerouted to bypass malfunction
Figure 1.1 2003 North American Blackout
ing routers. This routine operation can occasionally create denial of service
(a) Satellite image on Northeast United States on August 13th, 2003,at 9:29pm (EDT), 20 hours before the 2003 blackout.
attacks, which make fully functional routers unavailable by overwhelming them with traffic. We witness cascading events in financial systems, like in 1997, when the International Monetary Fund pressured the central banks
(b) The same as above, but 5 hours after the blackout.
of several Pacific nations to limit their credit, which defaulted multiple corporations, eventually resulting in stock market crashes worldwide. The 20092011 financial meltdown is often seen as a classic example of a cascading failure, the US credit crisis paralyzing the economy of the globe, leaving behind scores of failed banks, corporations, and even bankrupt states. Cascading failures can be also induced artificially. An example is the worldwide effort to dry up the money supply of terrorist organizations, aimed at crippling their ability to function. Similarly, cancer researchers aim to induce cascading failures in our cells to kill cancer cells. INTRODUCTION
3
The Northeast blackout illustrates several important themes of this book: First, to avoid damaging cascades, we must understand the structure of the network on which the cascade propagates. Second, we must be able to model the dynamical processes taking place on these networks, like the flow of electricity. Finally, we need to uncover how the interplay between the network structure and dynamics affects the robustness of the whole system. Although cascading failures may appear random and unpredictable, they follow reproducible laws that can be quantified and even predicted using the tools of network science. The blackout also illustrates a bigger theme: vulnerability due to interconnectivity. Indeed, in the early years of electric power each city had its own generators and electric network. Electricity cannot be stored, however: Once produced, electricity must be immediately consumed. It made economic sense, therefore, to link neighboring cities up, allowing them to share the extra production and borrow electricity if needed. We owe the low price of electricity today to the power grid, the network that emerged through these pairwise connections, linking all producers and consumers into a single network. It allows cheaply produced power to be instantly transported anywhere. Electricity hence offers a wonderful example of the huge positive impact networks have on our life. Being part of a network has its catch, however: local failures, like the breaking of a fuse somewhere in Ohio, may not stay local any longer. Their impact can travel along the network’s links and affect other nodes, consumers and individuals apparently removed from the original problem. In general interconnectivity induces a remarkable nonlocality: It allows information, memes, business practices, power, energy, and viruses to spread on their respective social or technological networks, reaching us, no matter our distance from the source. Hence networks carry both benefits and vulnerabilities. Uncovering the factors that can enhance the spread of traits deemed positive, and limit others that make networks weak or vulnerable, is one of the goals of this book.
INTRODUCTION
4
VULNERABILITY DUE TO INTERCONNECTIVITY
SECTION 1.2
NETWORKS AT THE HEART OF COMPLEX SYSTEMS
“I think the next century will be the century of complexity.”
BOX 1.1
Stephen Hawking
COMPLEX
[adj., v. kuh mpleks, kompleks; n. kompleks]
We are surrounded by systems that are hopelessly complicated. Consider for example the society that requires cooperation between billions of
1) composed of many intercon
individuals, or communications infrastructures that integrate billions of cell phones with computers and satellites. Our ability to reason and com
nected
prehend our world requires the coherent activity of billions of neurons in
composite: a complex high
parts;
compound;
our brain. Our biological existence is rooted in seamless interactions be
way system
tween thousands of genes and metabolites within our cells. 2) characterized by a very com
These systems are collectively called complex systems, capturing the
plicated or involved arrange
fact that it is difficult to derive their collective behavior from a knowledge
ment of parts, units, etc.:
of the system’s components. Given the important role complex systems
complex machinery
play in our daily life, in science and in economy, their understanding, 3) so complicated or intricate as
mathematical description, prediction, and eventually control is one of the
to be hard to understand or
major intellectual and scientific challenges of the 21st century.
deal with: a complex problem The emergence of network science at the dawn of the 21st century is Source: Dictionary.com
a vivid demonstration that science can live up to this challenge. Indeed, behind each complex system there is an intricate network that encodes the interactions between the system’s components: (a) The network encoding the interactions between genes, proteins,
and metabolites integrates these components into live cells. The very existence of this cellular network is a prerequisite of life. (b) The wiring diagram capturing the connections between neurons,
called the neural network, holds the key to our understanding of how the brain functions and to our consciousness.
INTRODUCTION
5
(c) The sum of all professional, friendship, and family ties, often called
the social network, is the fabric of the society and determines the spread of knowledge, behavior and resources. (d) Communication networks, describing which communication devic
es interact with each other, through wired internet connections or wireless links, are at the heart of the modern communication system. (e) The power grid, a network of generators and transmission lines,
supplies with energy virtually all modern technology.
Figure 1.2
Subtle Networks Behind the Economy A credit card selected as the 99th object in The History of the World in 100 Objects exhibit by the British Museum. This card is a vivid demonstration of the highly interconnected nature of the modern economy, relying on subtle economic and social connections that normally go unnoticed.
(f) Trade networks maintain our ability to exchange goods and services,
being responsible for the material prosperity that the world has enjoyed since WWII (Figure 1.2). Networks are also at the heart of some of the most revolutionary tech
The card was issued in the United Arab Emirates in 2009 by the Hong Kong and Shanghai Banking Corporation, known as HSBC, a London based bank. The card functions through protocols provided by VISA, a USA based credit association. Yet, the card adheres to Islamic banking principles, which operates in accordance with FiqhalMuamalat (Islamic rules of transactions), most notably eliminating interest or riba. The card is not limited to muslims in the United Arab Emirates, but is offered in nonMuslim countries as well, to anyone who agrees with its strict ethical guidelines.
nologies of the 21st century, empowering everything from Google to Facebook, CISCO, and Twitter. At the end, networks permeate science, technology, business and nature to a much higher degree than it may be evident upon a casual inspection. Consequently, we will never understand complex systems unless we develop a deep understanding of the networks behind them. The exploding interest in network science during the first decade of the 21st century is rooted in the discovery that despite the obvious diversity of complex systems, the structure and the evolution of the networks behind each system is driven by a common set of fundamental laws and principles. Therefore, notwithstanding the amazing differences in form, size, nature, age, and scope of real networks, most networks are driven by common organizing principles. Once we disregard the nature of the components and the precise nature of the interactions between them, the obtained networks are more similar than different from each other. In the following sections we discuss the forces that have led to the emergence of this new research field and its impact on science, technology, and society.
INTRODUCTION
6
NETWORKS AT THE HEART OF COMPLEX SYSTEMS
SECTION 1.3
TWO FORCES THAT HELPED NETWORK SCIENCE
600
Network science is a new discipline. One may debate its precise begin
ErdősRényi (1959)
ning, but by all accounts the field has emerged as a separate discipline only
500
in the 21st century.
Granovetter (1973) 400
Why didn’t we have network science two hundred years earlier? After all many of the networks that the field explores are by no means new:
300
metabolic networks date back to the origins of life, with a history of four 200
billion years, and the social network is as old as humanity. Furthermore, many disciplines, from biochemistry to sociology and brain science, have
100
been dealing with their own networks for decades. Graph theory, a prolific subfield of mathematics, has explored graphs since 1735. Is there a reason,
0
therefore, to call network science the science of the 21st century? Something special happened at the dawn of the 21st century that tran
1960
1970
1980
1990
2000
2008
Figure 1.3
The Emergence of Network Science
scended individual research fields and catalyzed the emergence of a new discipline (Figure 1.3). To understand why this happened now and not two
While the study of networks has a long history, with roots in graph theory and sociology, the modern chapter of network science emerged only during the first decade of the 21st century.
hundred years earlier, we need to discuss the two forces that have contributed to the emergence of network science.
THE EMERGENCE OF NETWORK MAPS
The explosive interest in networks is well documented by the citation pattern of two classic papers, the 1959 paper by Paul Erdős and Alfréd Rényi that marks the beginning of the study of random networks in graph theory [2] and the 1973 paper by Mark Granovetter, the most cited social network paper [3]. The figure shows the yearly citations each paper acquired since their publication. Both papers were highly regarded within their discipline, but had only limited impact outside their field. The explosive growth of citations to these papers in the 21st century is a consequence of the emergence of network science, drawing a new, interdisciplinary attention to these classic publications.
To describe the detailed behavior of a system consisting of hundreds to billions of interacting components, we need a map of the system’s wiring diagram. In a social system this would require an accurate list of your friends, your friends’ friends, and so on. In the WWW this map tells us which webpages link to each other. In the cell the map corresponds to a detailed list of binding interactions and chemical reactions involving genes, proteins, and metabolites. In the past, we lacked the tools to map these networks. It was equally difficult to keep track of the huge amount of data behind them. The Internet revolution, offering effective and fast data sharing methods and cheap digital storage, fundamentally changed our ability to collect, assemble, share, and analyze data pertaining to real networks.
INTRODUCTION
7
Thanks to these technological advances, at the turn of the millenium we witnessed an explosion of map making (BOX 1.2). Examples range from the CAIDA or DIMES projects that offered the first largescale maps of the Internet; to the hundreds of millions of dollars spent by biologists to experimentally map out proteinprotein interactions in human cells; the efforts made by social network companies, like Facebook, Twitter, or LinkedIn, to develop accurate depositories of our friendships and professional ties; the Connectome project of the US National Institute of Health that aims to systematically trace the neural connections in mammalian brains. The sudden availability of these maps at the end of the 20th century has catalyzed the emergence of network science.
THE UNIVERSALITY OF NETWORK CHARACTERISTICS It is easy to list the differences between the various networks we encounter in nature or society: the nodes of the metabolic network are tiny molecules and the links are chemical reactions governed by the laws of chemistry and quantum mechanics; the nodes of the WWW are web documents and the links are URLs guaranteed by computer algorithms; the nodes of the social network are individuals and the links represent family, professional, friendship, and acquaintance ties. The processes that generated these networks also differ greatly: metabolic networks were shaped by billions of years of evolution; the WWW is built by the collective actions of millions of individuals and organizations; social networks are shaped by social norms whose roots go back thousands of years. Given this diversity in size, nature, scope, history, and evolution, one would not be surprised if the networks behind these systems would differ greatly. A key discovery of network science is that the architecture of networks emerging in various domains of science, nature, and technology are similar to each other, a consequence of being governed by the same organizing principles. Consequently we can use a common set of mathematical tools to explore these systems. This universality is one of the guiding principle of this book: we will not only seek to uncover specific network properties, but each time we ask how widely they apply. We will also aim to understand their origins, uncovering the laws that shape network evolution and their consequences on network behavior. In summary, while many disciplines have made the important contributions to network science, the emergence of a new field was partly made possible by data availability, offering accurate maps of networks encountered in different disciplines. These diverse maps allowed network scientists to identify the universal properties of various network characteristics. This universality offers the foundation of the new discipline of network science.
INTRODUCTION
8
THE FORCES THAT HELPED THE EMERGENCE OF NETWORK SCIENCE
BOX 1.2 THE ORIGINS OF NETWORK MAPS
A few of the maps studied today by network scientists were generated with the purpose of studying networks. Most are the byproduct of other projects and morphed into maps only in the hands of network scientists. (a) The list of chemical reactions in a cell were discovered oneby
one over a 150 year period by biochemists. In the 1990s they were collected in central databases, offering the first chance to assemble the biochemical networks within a cell. (b) The list of actors that play in each movie were traditionally
scattered in newspapers, books and encyclopedias. With the advent of the Internet, these data were assembled into central databases, like imdb.com, feeding the curiosity of movie aficionados. The database allowed network scientists to reconstruct the affiliation network behind Hollywood. (c) The list of authors of millions of research papers were tra
ditionally scattered in the table of content of thousands of journals. Recently Web of Science, Google Scholar, and other services have assembled them into comprehensive databases, allowing network scientists to reconstruct accurate maps of scientific collaboration networks. Much of the early history of network science relied on the investigators’ ingenuity to recognize and extract networks from preexisting databases. Network science changed that: Today wellfunded research collaborations focus on map making, capturing accurate wiring diagrams of biological, communication and social systems.
INTRODUCTION
9
THE FORCES THAT HELPED THE EMERGENCE OF NETWORK SCIENCE
SECTION 1.4
THE CHARACTERISTICS OF NETWORK SCIENCE
Network science is defined not only by its subject matter, but also by its methodology. In this section we discuss the key characteristics of the approach network science adopted to understand complex systems.
INTERDISCIPLINARY NATURE Network science offers a language through which different disciplines can seamlessly interact with each other. Indeed, cell biologists, brain scientists (Figure 1.4) and computer scientists alike are faced with the task of characterizing the wiring diagram behind their system, extracting information from incomplete and noisy datasets, and understanding their systems’ robustness to failures or attacks. To be sure, each discipline brings a different set of goals, technical details and challenges, which are important on their own. Yet, the common nature of many issues these fields struggle with has led to a crossdisciplinary fertilization of tools and ideas. For example, the concept of betweenness centrality that emerged in the social network literature in the 1970s, today plays a key role in identifying high traffic nodes on the Internet. Similarly algorithms developed by computer scientists for graph partitioning have found novel applications in identifying disease modules
Figure 1.4
Mapping the Brain
in medicine or detecting communities within large social networks.
An exploding application area for network science is brain research. The wiring diagram of a complete nervous system has long been available for C. elegans, a small roundworm, but neuronal connectivity data for larger animals has been missing until recently. That is changing thanks to major efforts by the scientific community to develop technologies that can map out the brain’s wiring diagram. The image shows the cover of the April 10, 2014 issue of Nature, reporting an extensive map of the laboratory mouse [4] generated by researchers at the Allen Institute in Seattle.
EMPIRICAL, DATA DRIVEN NATURE Several key concepts of network science have their roots in graph theory, a fertile field of mathematics. What distinguishes network science from graph theory is its empirical nature, i.e. its focus on data, function and utility. As we will see in the coming chapters, in network science we are never satisfied with developing abstract mathematical tools to describe a certain network property. Each tool we develop is tested on real data and its value is judged by the insights it offers about a system’s properties and behavior.
QUANTITATIVE AND MATHEMATICAL NATURE To contribute to the development of network science and to properly use its tools, it is essential to master the mathematical formalism behind INTRODUCTION
10
it. Network science borrowed the formalism to deal with graphs from graph theory and the conceptual framework to deal with randomness and seek universal organizing principles from statistical physics. Lately, the field is benefiting from concepts borrowed from engineering, like control and information theory, allowing us to understand the control principles of networks, and from statistics, helping us extract information from incomplete and noisy datasets. The development of network analysis software has made the tools of network science available to a wider community, even those who may not be familiar with the intellectual foundations and the full mathematical depths of the discipline. Yet, to further the field and to efficiently use its tools, we neet to master its theoretical formalism.
COMPUTATIONAL NATURE Given the size of many of the networks of practical interest, and the exceptional amount of auxiliary data behind them, network scientists are regularly confronted by a series of formidable computational challenges. Hence, the field has a strong computational character, actively borrowing from algorithms, database management and data mining. A series of software tools are available to address these computational problems, enabling practitioners with diverse computational skills to analyze the networks of interest to them. In summary, a mastery of network science requires familiarity with each of these aspects of the field. It is their combination that offers the multifaceted tools and perspectives necessary to understand the properties of real networks.
INTRODUCTION
11
THE CHARACTERISTICS OF NETWORK SCIENCE
SECTION 1.5
SOCIETAL IMPACT
The impact of a new research field is measured both by its intellectual achievements as well as by its societal impact, indicated by the reach and the potential of its applications. While network science is a young field, its impact is everywhere.
ECONOMIC IMPACT: FROM WEB SEARCH TO SOCIAL NETWORKING The most successful companies of the 21st century, from Google to Facebook, Twitter, LinkedIn, Cisco, Apple and Akamai, base their technology and business model on networks. Indeed, Google not only runs the biggest network mapping operation that humanity has ever built, generating a comprehensive and constantly updated map of the WWW, but its search technology is deeply interlinked with the network characteristics of the Web. Networks have gained particular popularity with the emergence of Facebook, the company with the ambition to map out the social network of the whole planet. Facebook was not the first social networking site and it is likely not the last either: An impressive ecosystem of social networking tools, from Twitter to LinkedIn are fighting for the attention of millions of users. Algorithms conceived by network scientists fuel these sites, aiding everything from friend recommendation to advertising.
HEALTH: FROM DRUG DESIGN TO METABOLIC ENGINEERING Completed in 2001, the human genome project offered the first comprehensive list of all human genes [5, 6]. Yet, to fully understand how our cells function, and the origin of disease, a full list of genes is not sufficient: We also need an accurate map of how genes, proteins, metabolites and other cellular components interact with each other. Indeed, most cellular processes, from food processing to sensing changes in the environment, rely on molecular networks. The breakdown of these networks is responsible for human diseases. The increasing awareness of the importance of molecular networks INTRODUCTION
12
has led to the emergence of network biology, a new subfield of biology that aims to understand the behavior of cellular networks. A parallel movement within medicine, called network medicine, aims to uncover the role of networks in human disease (Figure 1.5). The importance of these advances is illustrated by the fact that Harvard University in 2012 started the Division of Network Medicine, that employs researchers and medical doctors who apply networkbased ideas towards understanding human disease. Networks play a particularly important role in drug development. The ultimate goal of network pharmacology [7] is to develop drugs that can cure diseases without significant side effects. This goal is pursued at many levels, from millions of dollars invested to map out cellular networks, to the development of tools and databases to store, curate, and analyze patient and genetic data. Several new companies take advantage of the opportunities offered by networks for health and medicine. For example GeneGo collects maps of cellular interactions from the scientific literature and Genomatica uses the predictive power behind metabolic networks to identify drug targets in bacteria and humans. Recently major pharmaceutical companies, like Johnson & Johnson, have made significant investments in network medicine, seeing it as the path towards future drugs.
SECURITY: FIGHTING TERRORISM Terrorism is a malady of the 21st century, requiring significant resources to combat it worldwide. Network thinking is increasingly present in the arsenal of various law enforcement agencies in charge of responding to terrorist activities. It is used to disrupt the financial network of terrorist organizations and to map adversarial networks, helping to uncover the role of their members and their capabilities. While much of the work in this area is classified, several well documented case studies have been made public. Examples include the use of social networks to find Saddam Hussein [10] or those responsible for the March 11, 2004 Madrid train bombings through the examination of the mobile call network. Network concepts have impacted military doctrine as well, leading to the concept of networkcentric warfare, aimed at fighting low intensity conflicts against terrorist and criminal networks that employ
Figure 1.5
Network Biology and Medicine The cover of two issues of Nature Reviews Genetics, the leading review journal in genetics. The journal has devoted exceptional attention to the impact of networks: the 2004 cover focuses on network biology [8] (top), the 2011 cover discusses network medicine [9] (bottom).
decentralized flexible network organization [11] (Figure 1.6). Given the numerous potential military applications, it is perhaps not surprising that one of the first academic programs in network science was started at West Point, the US Army Military Academy. Furthermore, starting in 2009 the Army Research Lab devoted over $300 million to support network science centers across the US. The knowledge and the capabilities offered by networks can be also abused. Such misuses were well illustrated by the indiscriminate network mapping operation by the National Security Agency [12]. Under the pretext of stopping future terrorist attacks, NSA monitored the INTRODUCTION
13
THE IMPACT OF NETWORK SCIENCE
Figure 1.6
The Network Behind a Military Engagement This diagram was designed during the Afghan war in 2012 to portray the American operational plans in Afghanistan. While it has been ridiculed in the press for displaying too much complexity and detail in one chart, it vividly illustrates the interconnected nature of a modern military engagement. Today this example is studied by officers and military students to demonstrate the power and utility of network models for decisionmaking and operational coordination. Indeed, the job of military generals is not limited to ensuring the necessary military capacities, but must also factor in the beliefs and the living conditions of the local population or the impact of the narcotics trade that finances the opearations of the insurgents. Image from New York Times.
INTRODUCTION
14
THE IMPACT OF NETWORK SCIENCE
communications of hundreds of millions of individuals, from the US and abroad, rebuilding their social network. With that network scientists have awoken to a new social responsibility: to ensure the ethical use of our tools and knowledge.
>
EPIDEMICS: FROM FORECASTING TO HALTING DEADLY VIRUSES While the H1N1 pandemic was not as devastating as it was feared at the beginning of the outbreak in 2009, it gained a special role in the history of epidemics: It was the first pandemic whose course and time evolution was accurately predicted months before the pandemic reached its peak (Online Resource 1.1) [13]. This was possible thanks to fundamental advances in understanding the role of transportation networks in the spread of viruses.
Online Resource 1.1 Predicting the H1N1 Epidemic
Before 2000 epidemic modeling was dominated by compartmentbased
The predicted spread of the H1N1 epidemics during 2009, representing the first successful realtime prediction of a pandemic [13]. The project, relying on data describing the structure and the dynamics of the worldwide transportation network, foresaw that H1N1 will peak out in October 2009, in contrast with the expected JanuaryFebruary peak of influenza. This meant that the vaccines timed for November 2009 were too late, eventually having little impact on the outcome of the epdemic. The success of this project shows the power of network science in facilitating advances in areas of key importance for humanity.
models, assuming that everyone can infect everyone else in the same sociophysical compartment. The emergence of a networkbased framework has brought a fundamental change, offering a new level of predictability. Today epidemic prediction is one of the most active applications of network science [13, 14], being used to foresee the spread of influenza or to contain Ebola. It is also the source several fundamental results covered in this book, allowing us to model and predict the spread of biological, digital and social viruses (memes). The impact of these advances are felt beyond epidemiology. Indeed, in
Video courtesy of Alessandro Vespignani.
January 2010 network science tools have predicted the conditions nec
>
essary for the emergence of viruses spreading through mobile phones [15]. The first major mobile epidemic outbreak that started in the fall of 2010 in China, infecting over 300,000 phones each day, closely followed the predicted scenario.
NEUROSCIENCE: MAPPING THE BRAIN The human brain, consisting of hundreds of billions of interlinked neurons, is one of the least understood networks from the perspective of network science. The reason is simple: We lack maps telling us which neurons are linked together. The only fully mapped brain available for research is that of the C. elegans worm, consisting of only 302 neurons. Detailed maps of mammalian brains could lead to a revolution in brain science, allowing the understanding and curing of numerous neurological and brain diseases. With that brain research could turn it into one of the most prolific application area of network science [16]. Driven by the potential transformative impact of such maps, in 2010 the National Institutes of Health in the U.S. has initiated the Connectome project, aimed at developing technologies that could provide accurate neuronlevel maps of mammalian brains (Figure 1.4).
MANAGEMENT: UNCOVERING THE INTERNAL STRUCTURE OF AN ORGANIZATION While management tends to rely on the official chain of command, it is increasingly evident that the informal network, capturing who really communicates with whom, plays the most important role in the sucINTRODUCTION
15
THE IMPACT OF NETWORK SCIENCE
cess of an organization. Accurate maps of such organizational networks can expose the potential lack of interactions between key units, help identify individuals who play an important role in bringing different departments and products together, and help higher management diagnose diverse organizational issues. Furthermore, there is increasing evidence in the management literature that the productivity of an employee is determined by his/her position in this informal organizational network [17]. Therefore, numerous companies, like Maven 7, Activate Networks or Orgnet, offer tools and methodologies to map out the true structure of an organization. These companies offer a host of services, from identifying opinion leaders to reducing employee churn, optimizing knowledge and product diffusion and designing teams with the diversity, size and expertise to be the most effective for specific tasks (Figure 1.8). Established firms, from IBM to SAP, have added social networking capabilities to their business. Overall, network science tools are indispensable in management and business, enhancing productivity and boosting innovation within an organization.
INTRODUCTION
16
THE IMPACT OF NETWORK SCIENCE
(a)
Figure 1.7
Mapping Organizations
(a)Employees of a Hungarian company with three main locations (purple, yellow and blue). The management realized that information reaching the workers about the intentions of the higher management often had nothing do to with their real plans. Seeking to enhance information flow within the company, they turned to Maven 7, a company that applies network science in organizational setting.
(b) (b) Maven 7 developed an online platform to ask each employee to whom do they turn to for advice when it comes to decisions impacting the company. This platform provided the map shown in (b), where two individuals are connected if one nominated the other as his/her source of information on organizational and professional issues. The map identifies several highly influential individuals, appearing as large hubs.
(c) The position of the leadership within the company’s informal network, nodes being colored based on their rank within the company. Note that none of the directors, shown in red, are hubs. Nor are the top managers, shown in blue. The hubs come from lower ranks: they are managers, group leaders and associates. The biggest hub, hence the most influential individual, is an ordinary employee, appearing as a gray node in the center.
(c)
(d) The links of the largest hub (red) and those two links away from this hub (orange), demonstrate that a significant fraction of employees are at most two links from this hub. But who is this hub? He is the employee in charge of safety and environmental issues. Hence he regularly visits each location and talks with the employees. He is connected to everyone except the top management. With little knowledge of the true intentions of the management, he passes on information that he collects along his trail, effectively running a gossip center.
(d)
Should they fire or promote the biggest hub? What is the best solution to this problem?
INTRODUCTION
17
THE IMPACT OF NETWORK SCIENCE
SECTION 1.6
SCIENTIFIC IMPACT
Nowhere is the impact of network science more evident than in the scientific community. The most prominent scientific journals, from Nature to Science, Cell and PNAS, have devoted reviews and editorials addressing the impact of networks on various topics, from biology to social sciences. For example, Science has published a special issue on networks, marking the tenyear anniversary of the discovery of scalefree networks [18] (Figure 1.8). During the past decade each year about a dozen international conferences, workshops, summer and winter schools have focused on network science. A highly successful network science conference series, called NetSci, attracts the field’s practitioners since 2005. Several generalinterest books have made bestseller lists in many countries, bringing network science to the general public. Most major universities offer network science courses, attracting a diverse student body, and in 2014 Northeastern University in Boston and the Central European University in Budapest have launched PhD programs in network science. The see the impact of networks on the scientific community it is useful to inspect the citation patterns of the most cited papers in the area of com
Figure 1.8
Complex Systems and Networks Special issue of Science magazine devoted to networks, published on July 24, 2009, on the 10th anniversary of the 1999 discovery of scalefree networks [18].
plex systems. Each of these papers are citation classics, reporting classic discoveries like the butterfly effect, renormalisation group, spin glasses, fractals and neural networks, and cumulatively amassing anywhere between 2,000 and 5,000 citations. To see how the interest in network science compares to the impact of these foundational papers in Figure 1.9 we compare their citation patterns to the citations of the two most cited network science papers: the 1998 paper on smallworld phenomena [19] and the 1999 Science paper reporting the discovery of scalefree networks [18]. As one can see, the rapid rise of yearly citations to these two papers is without precedent in the area of complex systems. Several other metrics indicate that network science is impacting in a defining manner numerous disciplines. For example, in several research fields network papers became the most cited papers in their leading journals: INTRODUCTION
18
1000
Figure 1.9
Chaos: Lorenz (1963)
900
Complexity and Network Science
Spin Glasses: EdwardAnderson (1975)
800
The scientific impact of network science, as seen through citation patterns, compared to the citations of the most cited papers in complexity. The study of complex systems in the 60s and 70s was dominated by Edward Lorenz’s 1963 classic work on chaos [20], Kenneth G. Wilson’s renormalization group [21], and Samuel F. Edwards and Philip W. Anderson work on spin glasses [22]. In the 1980s the community has shifted its focus to pattern formation, following Benoit Mandelbrot’s book on fractals [23] and Thomas Witten and Len Sander’s introduction of the diffusion limited aggregation model [24]. Equally influential was John Hopfield’s paper on neural networks [25] and Per Bak, Chao Tang and Kurt Wiesenfeld’s work on selforganized criticality [26]. These papers continue to define our understanding of complex systems. The figure compares the yearly citations of these landmark papers with the citations of the two most cited papers in network science, the paper by Watts and Strogatz on small world networks and by Barabási and Albert, reporting the discovery of scalefree networks. [18, 19].
Renormalization: Wilson (1975)
700
Neural Networks: Hopfield (1982)
600
Fractals: Mandelbrot (1982)
500
Networks: WattsStrogatz (1998)
400
Networks: BarabásiAlbert (1999)
300 200 100 0 1960
1965
1970
1975
1980
1985
1990
1995
2000
2005
2010
(a) The 1998 paper by Watts and Strogatz in Nature on small world phe
nomena [19] and the 1999 paper by Barabási and Albert in Science on scalefree networks [18] were identified by ThompsonReuters as being among the top ten most cited papers in physical sciences during the decade after their publication. Currently (2011) the WattsStrogatz paper is the second most cited of all papers published in Nature in 1998 and the BarabásiAlbert paper is the most cited paper among all papers published in Science in 1999. (b) Four years after its publication the SIAM review by Mark Newman
on network science became the most cited paper of any journal published by the Society of Industrial & Applied Mathematics [27]. (c) Reviews of Modern Physics, published since 1929, is the physics jour
nal with the highest impact factor. Until 2012 the most cited paper of the journal was written by Nobel Prize winner Subrahmanyan Chandrasekhar, his classic 1944 review entitled Stochastic Problems in Physics and Astronomy [28]. During the 70 years since its publication, the paper gathered over 5,000 citations. Yet, in 2012 it was taken over by the first review of network science published in 2001 entitled Statistical Mechanics of Complex Networks [29]. (d) The paper reporting the discovery that in scalefree networks the ep
idemic threshold vanishes, by PastorSatorras and Vespignani [30], is the most cited paper among the papers published in 2001 by Physical Review Letters, shared with a paper on quantum computing. (e) The paper by Michelle Girvan and Mark Newman on community dis
covery in networks [31] is the most cited paper published in 2002 by Proceedings of the National Academy of Sciences. (f) The 2004 review entitled Network Biology [8] is the second most cited
paper in the history of Nature Reviews Genetics, the top review journal in genetics. Prompted by this extraordinary enthusiasm within by the scientifINTRODUCTION
19
SCIENTIFIC IMPACT
ic community, network science was examined by the National Research Council (NRC), the arm of the US National Academies in charge of offering policy recommendation to the US government. NRC has assembled two panels, resulting in recommendations summarized in two NRC Reports [32, 33], defining the field of network science (Figure 1.10). These reports not only documented the emergence of a new research field, but highlighted the field’s role for science, national competitiveness and security. Following these reports, the National Science Foundation (NSF) in the US established a network science directorate and several Network Science Centers were funded at US universities by the Army Research Labs. Network science has excited the public as well. This was fueled by the success of several general audience books, like Linked, Nexus, Six Degrees and Connected (Figure 1.11). Connected, an awardwinning documentary by Australian filmmaker Annamaria Talas, has brought the field to our TV screen, being broadcasted all over the world and winning several prestigious prizes (Online Resource 1.2). Networks have inspired artists as well, leading to a wide range of networkrelated art projects, and an annual symposium series that brings together artists and network scientists [38]. Fueled by successful movies like The Social Network or Six Degrees of Separation, and a series of science fiction novels and short stories exploiting the network paradigm, today networks are deeply ingrained in popular culture.
Figure 1.10 National Research Council
Two National Research Council reports on network science have documented the emergence of the new discipline and highlighted its longterm impact on research and national competitiveness [32, 33]. They have recommended dedicated support for the field, prompting the establishment of network science centers at US universities and a network science program within NSF.
> Online resource 1.2 Connected
The trailer of the award winning documentary entitled Connected, directed by Annamaria Talas, offering an introduction into network science. It features the actor Kevin Bacon and several wellknown network scientists.
>
INTRODUCTION
20
SCIENTIFIC IMPACT
Figure 1.11
Wide Impact Four widely read books, translated to over twenty languages, have brought network science to the general public [34, 35, 36, 37].
INTRODUCTION
21
SCIENTIFIC IMPACT
SECTION 1.7
SUMMARY
Figure 1.12
0.008000%
The Rise of Networks
NETWORK QUANTUM EVOLUTION
0.007000% 0.006000%
The frequency of use of the words evolution, quantum, and networks in books since 1880. The plot indicates the exploding societal awareness of networks in the last decades of the 20th century, laying the ground for the emergence of network science. The plots were generated by Google’s ngram platform, calculating the fraction of books published in a year that mention evolution, quantum or networks.
0.005000% 0.004000% 0.003000% 0.002000% 0.001000% 0.000000% 1800
1820
1840
1860
1880
1900
1920
1940
1960
1980
2000
While the emergence of network science may appear to have been rather sudden phenomenon (Figures 1.3 & 1.9), the field was responding to a wider social awareness of the role and importance of networks. This is illustrated in Figure 1.12, that shows the usage frequency of words that capture two important scientific revolutions of the past two centuries: evolution, the most common term referring to Darwin’s theory of evolution, and quantum, the most frequently used term when one refers to quantum mechanics. As expected, the use of evolution increases after the 1859 publication of Darwin’s On the Origins of Species. The word quantum, first used in 1902, remained virtually absent until the 1920s, when quantum mechanics gained acceptance among physicists and reached public conciousness. The figure compares these words with the usage of network, which enjoyed a spectacular increase following the 1980s, surpassing both evolution and quantum. While the term network has many uses (as do evolution and quantum), its dramatic rise captures the increasing societal awareness of networks. There is something common between the advances facilitated by evolutionary theory, quantum mechanics and network science: They are not only important scientific fields with their own intellectual core and body of knowledge, but they are also enabling platforms. Indeed, the current revolution in genetics is built on evolutionary theory and quantum mechanics offers a platform for a wide range of advances in contemporary science, from chemistry to electronics. In a similar fashion, network sciINTRODUCTION
22
ence is an enabling platform, offering novel tools and perspectives for a wide range of scientific problems, from social networking to drug design. Given this exceptional impact networks have both in science and in society, we must master the tools to study and quantify them. The rest of this book is devoted to this worthy subject.
INTRODUCTION
23
SUMMARY
SECTION 1.8
HOMEWORK
1.1. Networks Everywhere List three different real networks and state the nodes and links for each of them. 1.2. Your Interest Tell us of the network you are personally most interested in. Address the following questions: (a) What are its nodes and links? (b) How large is it? (c) Can be mapped out? (d) Why do you care about it? 1.3. Impact In your view what would be the area where network science could have the biggest impact in the next decade? Explain your answer.
INTRODUCTION
24
SECTION 1.9
BIBLIOGRAPHY
[1] J. Richards, R. Hobbs. Mark Lombardi: Global Networks. Independent Curators International, New York, 2003. [2] P. Erdős and A. Rényi. On random graphs. Publicationes Mathematicae, 6: 290, 1959. [3] M. S. Granovetter. The strength of weak ties. American Journal of Sociology, 78: 1360, 1973. [4] S.W. Oh et.al. A mesoscale connectome of the mouse brain. Nature, 508: 207214, 2014. [5] International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 409: 6822, 2001. [6] J. C. Venter et al. The Sequence of the Human Genome. Science, 291: 1304, 2001. [7] A. L. Hopkins, Network Pharmacology. Nature Biotechnology, 25: 11101111, 2007. [8] Z. N. Oltvai and A.L. Barabási. Network Biology: Understanding the cell’s functional organization. Nature Reviews Genetics, 5: 101, 2004. [9] N. Gulbahce, A.L. Barabási, and J. Loscalzo. Network medicine: A networkbased approach to human disease. Nature Reviews Genetics, 12: 56, 2011. [10] C. Wilson. Searching for Saddam: A fivepart series on how the US military used social networking to capture the Iraqi dictator. 2010. www. slate.com/id/2245228/. [11] J. Arquilla and D. Ronfeldt. Networks and Netwars: The Future of Terror, Crime, and Militancy. RAND: Santa Monica, CA, 2001. INTRODUCTION
25
[12] A.L. Barabási, Scientists must spearhead ethical use of big data. Politico.com, September 30, 2013. [13] D. Balcan, H. Hu, B. Goncalves, P. Bajardi, C. Poletto, J. J. Ramasco, D. Paolotti, N. Perra, M. Tizzoni, W. Van den Broeck, V. Colizza, and A. Vespignani. Seasonal transmission potential and activity peaks of the new influenza A(H1N1): a Monte Carlo likelihood analysis based on human mobility. BMC Medicine, 7: 45, 2009. [14] L. Hufnagel, D. Brockmann, and T. Geisel. Forecast and control of epidemics in a globalized world. PNAS, 101: 15124, 2004. [15] P. Wang, M. Gonzalez, C. A. Hidalgo, and A.L. Barabási. Understanding the spreading patterns of mobile phone viruses. Science, 324: 1071, 2009. [16] O. Sporns, G. Tononi, and R. Kötter. The Human Connectome: A Structural Description of the Human Brain. PLoS Computional Biology, 1: 4, 2005. [17] L. Wu , B. N. Waber, S. Aral, E. Brynjolfsson, and A. Pentland. Mining FacetoFace Interaction Networks using Sociometric Badges: Predicting Productivity in an IT Configuration Task. Proceedings of the International Conference on Information Systems, Paris, France, December 1417, 2008. [18] A.L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286: 509, 1999. [19] D. J. Watts and S .H. Strogatz. Collective dynamics of ‘smallworld’ networks. Nature, 393: 440, 1998. [20] E. N. Lorenz. Deterministic Nonperiodic Flow. Journal of the Atmospheric Sciences, 20: 130, 1963. [21] K. G. Wilson. The renormalization group: Critical phenomena and the Kondo problem. Reviews of Modern Physics, 47: 773, 1975. [22] S. F. Edwards and P. W. Anderson. Theory of Spin Glasses. Journal of Physics, F 5: 965, 1975. [23] B. B. Mandelbrot. The Fractal Geometry of Nature. W.H. Freeman and Company. 1982. [24] T. Witten, Jr. and L. M. Sander. DiffusionLimited Aggregation, a Kinetic Critical Phenomenon. Physical Review Letters, 47: 1400, 1981. [25] J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. PNAS, 79: 2554, 1982.
INTRODUCTION
26
BIBLIOGRAPHY
[26] P. Bak, C. Tang, and K. Wiesenfeld. Selforganized criticality: an explanation of 1/ƒ noise. Physical Review Letters, 59: 4, 1987. [27] M. E. J. Newman. The structure and function of complex networks. SIAM Review. 45: 167, 2003. [28] S. Chandrasekhar. Stochastic Problems in Physics and Astronomy. Reviews Modern Physics, 15: 1, 1943. [29] R. Albert and A.L. Barabási, Statistical mechanics of complex networks. Reviews Modern Physics, 74: 47, 2002. [30] R. PastorSatorras and A. Vespignani. Epidemic spreading in scalefree networks. Physical Review Letters, 86: 3200, 2001. [31] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. PNAS, 99: 7821, 2002. [32] National Research Council. Network Science. Washington, DC: The National Academies Press, 2005. [33] National Research Council. Strategy for an Army Center for Network Science, Technology, and Experimentation. Washington, DC: The National Academies Press, 2007. [34] A.L. Barabási. Linked: The New Science of Networks. Perseus Books Group, 2002. [35] M. Buchanan. Nexus: Small Worlds and the Groundbreaking Science of Networks. Norton, 2003. [36] D. Watts. Six Degrees: The Science of a Connected Age. Norton, 2004. [37] N. Christakis and J. Fowler. Connected: The Surprising Power of Our Social Networks and How They Shape Our Lives. Back Bay Books, 2011. [38] M. Schich, R. Malina, and I. Meirelles (Editors). Arts, Humanities, and Complex Networks [Kindle Edition], 2012.
INTRODUCTION
27
BIBLIOGRAPHY
2 ALBERTLÁSZLÓ BARABÁSI
NETWORK SCIENCE GRAPH THEORY
ACKNOWLEDGEMENTS
MÁRTON PÓSFAI GABRIELE MUSELLA MAURO MARTINO ROBERTA SINATRA
PHILIPP HOEVEL SARAH MORRISON AMAL HUSSEINI
INDEX
The Bridges of Königsberg
1
Networks and Graphs
2
Degree, Average Degree and Degree Distribution
3
Adjacency Matrix
4
Real Networks are Sparse
5
Weighted Networks
6
Bipartite Networks
7
Paths and Distances
8
Connectedness
9
Clustering Coefficient
10
Summary
11
Homework
12
ADVANCED TOPIC 2.A Global Clustering Coefficient
13
Bibliography
14
Figure 2.0 (front cover)
Human Disease Network The Human Disease Network, whose nodes are diseases connected if they have common genetic origin. Published as a supplement of the Proceedings of the National Academy of Sciences [1], the map was created to illustrate the genetic interconnectedness of apparently distinct diseases. With time it crossed disciplinary boundaries, taking up a life of its own. The New York Times created an interactive version of the map and the Londonbased Serpentine Gallery, one of the top contemporary art galleries in the world, have exhibited it part of their focus on networks and maps [2]. It is also featured in numerous books on design and maps [3, 4, 5].
This work is licensed under a Creative Commons: CC BYNCSA 2.0. PDF V27, 05.09.2014
SECTION 2.1
THE BRIDGES OF KÖNIGSBERG
Few research fields can trace their birth to a single moment and place in
(a)
history. Graph theory, the mathematical scaffold behind network science, can. Its roots go back to 1735 in Königsberg, the capital of Eastern Prussia, a thriving merchant city of its time. The trade supported by its busy fleet of ships allowed city officials to build seven bridges across the river Pregel that surrounded the town. Five of these connected to the mainland the elegant island Kneiphof, caught between the two branches of the Pregel. The
(b)
C
remaining two crossed the two branches of the river (Figure 2.1). This pecu
A
liar arrangement gave birth to a contemporary puzzle: Can one walk across all seven bridges and never cross the same one twice? Despite many at
B
tempts, no one could find such path. The problem remained unsolved until 1735, when Leonard Euler, a Swiss born mathematician, offered a rigorous
(c)
mathematical proof that such path does not exist [6, 7].
C A
Euler represented each of the four land areas separated by the river with letters A, B, C, and D (Figure 2.1). Next he connected with lines each
simple observation: if there is a path crossing all bridges, but never the
C Figure 2.1 The Bridges of Königsberg A
same bridge twice, then nodes with odd number of links must be either the
? D
(a) A contemporary map of Königsberg (now Kaliningrad, Russia) during Euler’s time.
starting or the end point of this path. Indeed, if you arrive to a node with
B
an odd number of links, you may find yourself having no unused link for
(b) A schematic illustration of Königsberg’s four land pieces and the seven bridges across them.
you to leave it. A walking path that goes through all bridges can have only one starting
(c) Euler constructed a graph that has four nodes (A, B, C, D), each corresponding to a patch of land, and seven links, each corresponding to a bridge. He then showed that there is no continuous path that would cross the seven bridges while never crossing the same bridge twice. The people of Königsberg gave up their fruitless search and in 1875 built a new bridge between B and C, increasing the number of links of these two nodes to four. Now only one node was left with an odd number of links. Consequently we should be able to find the desired path. Can you find one yourself?
and one end point. Thus such a path cannot exist on a graph that has more than two nodes with an odd number of links. The Königsberg graph had four nodes with an odd number of links, A, B, C, and D, so no path could satisfy the problem. Euler’s proof was the first time someone solved a mathematical problem using a graph. For us the proof has two important messages: The first is that some problems become simpler and more tractable if they are represented as a graph. The second is that the existence of the path does not GRAPH THEORY
D
B
piece of land that had a bridge between them. He thus built a graph, whose nodes were pieces of land and links were the bridges. Then Euler made a
D
3
depend on our ingenuity to find it. Rather, it is a property of the graph. Indeed, given the structure of the Königsberg graph, no matter how smart we are, we will never find the desired path. In other words, networks have
>
properties encoded in their structure that limit or enhance their behavior. To understand the many ways networks can affect the properties of a system, we need to become familiar with graph theory, a branch of mathematics that grew out of Euler’s proof. In this chapter we learn how to represent a network as a graph and introduce the elementary characteristics
Online Resource 2.1
of networks, from degrees to degree distributions, from paths to distanc
The Bridges of Königsberg
es and learn to distinguish weighted, directed and bipartite networks. We
Watch a short video introducing the Könisberg problem and Euler’s solution.
will introduce a graphtheoretic formalism and language that will be used
GRAPH THEORY
>
throughout this book.
4
THE BRIDGE OF KÖNIGSBERG
SECTION 2.2
NETWORKS AND GRAPHS
If we want to understand a complex system, we first need to know how
(a)
its components interact with each other. In other words we need a map of its wiring diagram. A network is a catalog of a system’s components often called nodes or vertices and the direct interactions between them, called links or edges (BOX 2.1). This network representation offers a common language to study systems that may differ greatly in nature, appearance, or
(b)
scope. Indeed, as shown in Figure 2.2, three rather different systems have exactly the same network representation. Figure 2.2 introduces two basic network parameters: Number of nodes, or N, represents the number of components in the
(c)
system. We will often call N the size of the network. To distinguish the nodes, we label them with i = 1, 2, ..., N. Number of links, which we denote with L, represents the total number of interactions between the nodes. Links are rarely labeled, as they can
(d)
be identified through the nodes they connect. For example, the (2, 4) link connects nodes 2 and 4.
2
3
The networks shown in Figure 2.2 have N = 4 and L = 4.
4 1
The links of a network can be directed or undirected. Some systems have directed links, like the WWW, whose uniform resource locators (URL) point
Figure 2.2 Different Networks, Same Graph
from one web document to the other, or phone calls, where one person calls the other. Other systems have undirected links, like romantic ties: if I date
The figure shows a small subset of (a) the Internet, where routers (specialized computers) are connected to each other; (b) the Hollywood actor network, where two actors are connected if they played in the same movie; (c) a proteinprotein interaction network, where two proteins are connected if there is experimental evidence that they can bind to each other in the cell. While the nature of the nodes and the links differs, these networks have the same graph representation, consisting of N = 4 nodes and L = 4 links, shown in (d).
Janet, Janet also dates me, or like transmission lines on the power grid, on which the electric current can flow in both directions. A network is called directed (or digraph) if all of its links are directed; it is called undirected if all of its links are undirected. Some networks simultaneously have directed and undirected links. For example in the metabolic network some reactions are reversible (i.e., bidirectional or undirected) and others are irreversible, taking place in only one direction (directed). GRAPH THEORY
5
BOX 2.1
The choices we make when we represent a system as a network will determine our ability to use network science successfully to solve a particular problem. For example, the way we define the links between
NETWORKS OR GRAPHS?
two individuals dictates the nature of the questions we can explore:
In the scientific literature the
(a) By connecting individuals that regularly interact with each
terms network and graph are
other in the context of their work, we obtain the organizational
used interchangeably:
or professional network, that plays a key role in the success of a company or an institution, and is of major interest to organizational research (Figure 1.7). (b) By linking friends to each other, we obtain the friendship net
work, that plays an important role in the spread of ideas, products and habits and is of major interest to sociology, marketing
Network Science
Graph Theory
Network
Graph
Node
Vertex
Link
Edge
Yet, there is a subtle distinction
and health sciences.
between the two terminologies: the {network, node, link} combi
(c) By connecting individuals that have an intimate relationship,
nation often refers to real sys
we obtain the sexual network, of key importance for the spread
tems: The WWW is a network of
of sexually transmitted diseases, like AIDS, and of major inter
web documents linked by URLs;
est for epidemiology.
society is a network of individuals linked by family, friend
(d) By using phone and email records to connect individuals that
ship or professional ties; the
call or email each other, we obtain the acquaintance network,
metabolic network is the sum
capturing a mixture of professional, friendship or intimate
of all chemical reactions that
links, of importance to communications and marketing.
take place in a cell. In contrast, we use the terms {graph, ver
While many links in these four networks overlap (some coworkers may
tex, edge} when we discuss the
be friends or may have an intimate relationship), these networks have dif
mathematical representation of
ferent uses and purposes.
these networks: We talk about the web graph, the social graph
We can also build networks that may be valid from a graph theoretic
(a term made popular by Face
perspective, but may have little practical utility. For example, if we link
book), or the metabolic graph.
all individuals with the same first name, Johns with Johns and Marys with
Yet, this distinction is rarely
Marys, we do obtain a welldefined graph, whose properties can be ana
made, so these two terminolo
lyzed with the tools of network science. Its utility is questionable, however.
gies are often synonyms of each
Hence in order to apply network theory to a system, careful considerations
other.
must precede our choice of nodes and links, ensuring their significance to the problem we wish to explore. Throughout this book we will use ten networks to illustrate the tools of network science. These reference networks, listed in Table 2.1, span social systems (mobile call graph or email network), collaboration and affiliation networks (science collaboration network, Hollywood actor network), information systems (WWW), technological and infrastructural systems (Internet and power grid), biological systems (protein interaction and metabolic network), and reference networks (citations). They differ widely in their sizes, from as few as N =1,039 nodes in the E. coli metabolism, to almost half million nodes in the citation network. They cover several areas where networks are actively applied, representing ‘canonical’ datasets frequently GRAPH THEORY
6
NETWORKS AND GRAPHS
used by researchers to illustrate key network properties. As we indicate in Table 2.1, some of them are directed, others are undirected. In the coming chapters we will discuss in detail the nature and the characteristics of each of these datasets, turning them into the guinea pigs of our journey to understand complex networks. NETWORK
NODES
LINKS
DIRECTED UNDIRECTED
N
L
k
Internet
Routers
Internet connections
Undirected
192,244
609,066
6.34
WWW
Webpages
Links
Directed
325,729
1,497,134
4.60
Power Grid
Power plants, transformers
Cables
Undirected
4,941
6,594
2.67
Mobile Phone Calls
Subscribers
Calls
Directed
36,595
91,826
2.51
Email
Email addresses
Emails
Directed
57,194
103,731
1.81
Science Collaboration
Scientists
Coauthorship
Undirected
23,133
93,439
8.08
Actor Network
Actors
Coacting
Undirected
702,388
29,397,908
83.71
Citation Network
Paper
Citations
Directed
449,673
4,689,479
10.43
E. Coli Metabolism
Metabolites
Chemical reactions
Directed
1,039
5,802
5.58
Protein Interactions
Proteins
Binding interactions
Undirected
2,018
2,930
2.9 0
Table 2.1 Canonical Network Maps
The basic characteristics of ten networks used throughout this book to illustrate the tools of network science. The table lists the nature of their nodes and links, indicating if links are directed or undirected, the number of nodes (N) and links (L), and the average degree for each network. For directed networks the average degree shown is the average in or outdegrees = = (see Equation (2.5)).
GRAPH THEORY
7
NETWORKS AND GRAPHS
SECTION 2.3
DEGREE, AVERAGE DEGREE, AND DEGREE DISTRIBUTION
A key property of each node is its degree, representing the number of
BOX 2.2
links it has to other nodes. The degree can represent the number of mobile phone contacts an individual has in the call graph (i.e. the number of different individuals the person has talked to), or the number of citations a
BRIEF STATISTICS REVIEW
research paper gets in the citation network.
Four key quantities characterize
DEGREE
a sample of N values x1, ... , xN :
We denote with ki the degree of the ith node in the network. For exam
ple, for the undirected networks shown in Figure 2.2 we have k1=2, k2=3,
Average (mean):
k3=2, k4=1. In an undirected network the total number of links, L, can be
expressed as the sum of the node degrees:
1 L= ki . N
∑ 2
x1 + x2 + … + x N 1 N = ∑ xi N N i =1
x =
￼ (2.1)
i =1
The nth moment:
Here the 1/2 factor corrects for the fact that in the sum (2.1) each link is counted twice. For example, the link connecting the nodes 2 and 4 in Figure
x1 + x 2 + … + x N ￼ n
2.2 will be counted once in the degree of node 1 and once in the degree of
x =
n
n
n
N
node 4.
AVERAGE DEGREE
Standard deviation:
=
1 N n ∑ xi N i =1
An important property of a network is its average degree (BOX 2.2), which for an undirected network is
σx =
￼
1 k =
N
∑k N i =1
i
=
2L . N
(2.2)
)
2
Distribution of x:
In directed networks we distinguish between incoming degree, kiin, rep
￼ 1 p= δ x,x .
resenting the number of links that point to node i, and outgoing degree,
x
kiout, representing the number of links that point from node i to other
nodes. Finally, a node’s total degree, ki, is given by
N
where px follows
￼ (2.3) k = kiin + kiout. i For example, on the WWW the number of pages a given document points to represents its outgoing degree, kout, and the number of documents that point to it represents its incoming degree, kin. The total number
GRAPH THEORY
1 N ∑ ( xi − x N i =1
8
∑ i
i
of links in a directed network is￼ N
N
L = ∑ kiin = ∑ kiout . i =1 i =1
(2.4)
The 1/2 factor seen in (2.1) is now absent, as for directed networks the two sums in (2.4) separately count the outgoing and the incoming degrees. The average degree of a directed network is￼ 1 N 1 N L k in = ∑ kiin = k out = ∑ kiout = N N i =1 N i =1
(2.5)
DEGREE DISTRIBUTION The degree distribution, pk, provides the probability that a randomly se
lected node in the network has degree k. Since pk is a probability, it must be normalized, i.e. ￼
∞
∑p
k =1
k
=1 .
(2.6)
For a network with N nodes the degree distribution is the normalized histogram (Figure 2.3) is given by
Nk , N
pk =
(2.7)
where Nk is the number of degreek nodes. Hence the number of degreek
nodes can be obtained from the degree distribution as Nk = Npk.
The degree distribution has assumed a central role in network theory following the discovery of scalefree networks [8]. One reason is that the calculation of most network properties requires us to know pk. For example, the average degree of a network can be written as ∞
k = ∑ kpk .
(2.8)
k=0
The other reason is that the precise functional form of pk determines many network phenomena, from network robustness to the spread of viruses. (a)
1
(b) 0.75
4
pk
2
Figure 2.3
0.5
1
Degree Distribution
0.25
3
The degree distribution of a network is provided by the ratio (2.7).
0 0
(c)
(d)
1
2
k
3
4
(a) For the network in (a) with N = 4 the degree distribution is shown in (b). (b) We have p1 = 1/4 (one of the four nodes has degree k1 = 1), p2 = 1/2 (two nodes have k3 = k4 = 2), and p3 = 1/4 (as k2 = 3). As we lack nodes with degree k > 3, pk = 0 for any k > 3.
1 0.75
pk
(c) A one dimensional lattice for which each node has the same degree k = 2.
0.5
(d) The degree distribution of (c) is a Kronecker’s delta function, pk = δ(k  2).
0.25 0 0
GRAPH THEORY
1
2
k
3
4
9
DEGREE, AVERAGE DEGREE, AND DEGREE DISTRIBUTION
(a)
Figure 2.4 Degree Distribution of a Real Network
In real networks the node degrees can varywidely. (a) A layout of the protein interaction network of yeast (Table 2.1). Each node corresponds to a yeast protein and links correspond to experimentally detected binding interactions. Note that the proteins shown on the bottom have selfloops, hence for them k=2.
(b)
(b) The degree distribution of the protein interaction network shown in (a). The observed degrees vary between k=0 (isolated nodes) and k=92, which is the degree of the most connected node, called a hub. There are also wide differences in the number of nodes with different degrees: Almost half of the nodes have degree one (i.e. p1=0.48), while we have only one copy of the biggest node (i.e. p92 = 1/N=0.0005).
(c)
pk
pk
k
(c) The degree distribution is often shown on a loglog plot, in which we either plot log pk in function of ln k, or, as we do in (c), or we use logarithmic axes. The advantages of this representation are discussed in Chapter 4.
k 10
SECTION 2.4
ADJACENCY MATRIX
A complete description of a network requires us to keep track of its links. The simplest way to achieve this is to provide a complete list of the links. For example, the network of Figure 2.2 is uniquely described by listing its four links: {(1, 2), (1, 3), (2, 3), (2, 4)}. For mathematical purposes we often represent a network through its adjacency matrix. The adjacency matrix of a directed network of N nodes has N rows and N columns, its elements being: Aij = 1 if there is a link pointing from node j to node i
Aij = 0 if nodes i and j are not connected to each other The adjacency matrix of an undirected network has two entries for each link, e.g. link (1, 2) is represented as A12 = 1 and A21 = 1. Hence, the ad
jacency matrix of an undirected network is symmetric, Aij = Aji (Figure 2.5b). The degree ki of node i can be directly obtained from the elements of the
adjacency matrix. For undirected networks a node’s degree is a sum over either the rows or the columns of the matrix, i.e. N
N
j =1
i =1
ki = ∑ Aji = ∑ Aji .
(2.9)
For directed networks the sums over the adjacency matrix’ rows and columns provide the incoming and outgoing degrees, respectively N
N
kiin = ∑ Aij ,
kiout = ∑ A ji . j =1
j =1
(2.10)
Given that in an undirected network the number of outgoing links equals the number of incoming links, we have
N
N
N
i =1
i =1
ij
2 L = ∑ kiin = ∑ kiout = ∑ Aij .
(2.11)
The number of nonzero elements of the adjacency matrix is 2L, or twice the number of links. Indeed, an undirected link connecting nodes i and j appears in two entries: Aij = 1, a link pointing from node j to node i, and Aji = 1, a link pointing from i to j (Figure 2.5b).
GRAPH THEORY
11
(a) Adjacency matrix
A11
A21
Aij =
A31 A41
A12
A22
A32 A42
A13
A23
A33 A43
A14
A24
A34 A44
(b) Undirected network
(c) Directed network
1
1 4
3
4
2
0 1 1 0
Aij =
1 0 1 1
1 1 0 0
4
4
j =1
i =1
3
0 1 0 0
k2 = ∑ A2 j = ∑ Ai 2 = 3 Aij = A ji
Aii = 0
Aij =
2
0 1 0 0
0 0 0 1
1 1 0 0
4
k = ∑ A2 j = 2 , k in 2
j =1
0 0 0 0
out 2
4
= ∑ Ai 2 = 1 i =1
Aii = 0
Aij ≠ A ji
Figure 2.5
1 N L = ∑ Aij 2 i , j =1
k =
2L N
L=
The Adjacency Matrix
N
∑A
i , j =1
(a) The labeling of the elements of the adjacency matrix.
ij
k in = k out =
(b) The adjacency matrix of an undirected network. The figure shows that the degree of a node (in this case node 2) can be expressed as the sum over the appropriate column or the row of the adjacency matrix. It also shows a few basic network characteristics, like the total number of links, L, and average degree, , expressed in terms of the elements of the adjacency matrix.
L N
(c) The same as in (b) but for a directed network.
GRAPH THEORY
12
ADJACENCY MATRIX
SECTION 2.5
REAL NETWORKS ARE SPARSE
In real networks the number of nodes (N) and links (L) can vary widely. For example, the neural network of the worm C. elegans, the only fully mapped nervous system of a living organism, has N = 302 neurons (nodes). In contrast the human brain is estimated to have about a hundred billion (N ≈ 1011) neurons. The genetic network of a human cell has about 20,000 genes as nodes; the social network consists of seven billion individuals (N
≈ 7×109) and the WWW is estimated to have over a trillion web documents (N > 1012). These wide differences in size are noticeable in Table 2.1, which lists N and L for several network maps. Some of these maps offer a complete wiring diagram of the system they describe (like the actor network or the E. coli metabolism), while others are only samples, representing a subset of Figure 2.6 Complete Graph
the full network (like the WWW or the mobile call graph).
A complete graph with N = 16 nodes and Lmax = 120 links, as predicted by (2.12). The adjacency matrix of a complete graph is Aij = 1 for all i, j = 1, .... N and Aii = 0. The average degree of a complete graph is = N  1. A complete graph is often called a clique, a term frequently used in community identification, a problem discussed in CHAPTER 9.
Table 2.1 indicates that the number of links also varies widely. In a network of N nodes the number of links can change between L = 0 and Lmax, where
Lmax =
N 2
=
N ( N − 1) 2
(2.12)
is the total number of links present in a complete graph of size N (Figure 2.6). In a complete graph each node is connected to every other node. In real networks L is much smaller than Lmax, reflecting the fact that
most real networks are sparse. We call a network sparse if L
Finally, one can also define multipartite networks, like the tripartite recipeingredientcompound network shown in Figure 2.11.
Online Resource 2.2 Human Disease Network
Download the high resolution version of the Human Disease Network [1], or explore it using the online interface built by the New York Times.
>
GRAPH THEORY
17
GRAPH THEORY
18
BIPARTITE NETWORK
(d)
SANDHOFF DISEASE
OVARIAN CANCER
HUMAN DISEASE NETWORK
ATAXIATELANGIECTASIA
PAPILLARY SEROUS CARCINOMA FANCONI ANEMI A TCELL LYMPHOBLASTIC LEUKEMIA
PANCREATIC CANCER
BREAST CANCER
LYMPHOMA
PERINEAL HYPOSPADIAS
WILMS TUMOR
PROSTATE CANCER
ANDROGEN INSENSITIVITY
SPINAL MUSCULAR ATROPHY
AMYOTROPHIC LATERAL SCLEROSI S
SILVER SPASTIC PARAPLEGIA SYNDROME
LIPODYSTROPHY
CHARCOTMARIETOOTH DISEASE
SPASTIC ATAXIA/PARAPLEGI A
(a)
ATAXIATELANGIECTASIA ANDROGEN INSENSITIVITY
PERINEAL HYPOSPADIAS
DISEASE GENOME
DISEASE PHENOME
AR
BRIP1
ALS2
BSCL2
CHEK2
VAPB
RAD54L
MAD1L1
TP53
PIK3CA
MSH2
LMNA
KRAS
HEXB
GARS
CDH1
BRCA2
BRCA1
ATM
(c) The second projection is the disease network, whose nodes are diseases. Two diseases are connected if the same genes are associated with them, indicating that the two diseases have common genetic origin. Figures (a)(c) shows a subset of the diseaseome, focusing on cancers.
(b) The Human Disease Network (or diseaseome) is a bipartite network, whose nodes are diseases (U) and genes (V). A disease is connected to a gene if mutations in that gene are known to affect the particular disease [4].
(a) One projection of the diseaseome is the gene network, whose nodes are genes, and where two genes are connected if they are associated with the same disease.
Figure 2.10 Human Disease Network
FANCONI ANEMIA
SPASTIC ATAXIA/PARAPLEGIA
SILVER SPASTIC PARAPLEGIA SYNDROME
AMYOTROPHIC LATERAL SCLEROSIS
CHARCOTMARIETOOTH DISEASE
LIPODYSTROPHY
SANDHOFF DISEASE
SPINAL MUSCULAR ATROPHY
WILMS TUMOR
PANCREATIC CANCER
BREAST CANCER
LYMPHOMA
OVARIAN CANCER
PROSTATE CANCER
PAPILLARY SEROUS CARCINOMA
TCELL LYMPHOBLASTIC LEUKEMIA
(b)
DISEASOME
AR
KRAS
MSH2
CHEK2
DISEASE GENE NETWORK
CDH1
TP53
BRCA1
BRCA2
BSCL2
PIK3CA
HEXB
ATM
GARS
MAD1L1
RAD54L
BRIP1
VAPB
LMNA
(d) The full diseaseome, connecting 1,283 disorders via 1,777 shared disease genes. After [1]. See Online Resource 2.2 for the detailed map.
ALS2
(c)
RECIPES
(a)
INGREDIENTS
COMPOUNDS
Figure 2.11 Tripartite Network
PHENETHYL ALCOHOL
FLOUR
(a) The construction of the tripartite recipeingredientcompound network, in which one set of nodes are recipes, like Chicken Marsala; the second set corresponds to the ingredients each recipe has (like flour, sage, chicken, wine, and butter for Chicken Marsala); the third set captures the flavor compounds, or chemicals that contribute to the taste of each ingredient.
LASPARTIC ACID
SAGE
BUTYRALDEHYDE
CHICKEN MASALA
9DECANOIC ACID
CHICKEN
MCRESOL HYDROGEN SULFIDE
WINE
DELTATETRACALACTONE ACETOIN
BUTTER
(b) The ingredient or the flavor network represents a projection of the tripartite network. Each node denotes an ingredient; the node color indicating the food category and node size indicates the ingredient’s prevalence in recipes. Two ingredients are connected if they share a significant number of flavor compounds. Link thickness represents the number of shared compounds.
OCRESOL 3METHYL2BUTANOL DECANOIC ACID
GLAZED CARROTS
VINEGAR
PYRROLIDINE STYRENE PROPENYL PROPYL DISULFIDE
CARROT
GERANIOL
After [11].
CHIVE
(b)
pimenta
turmeric
carnation lime juice
cassava lard
kelp
angelica holy basil mussel
avocado
litchi
star anise
geranium
black mustard seed oil
grape juice
cane molasses
pear
chamomile
lettuce
zucchini
bartlett pear
anise
dill
vinegar
sherry
armagnac
kale
wood squid
parsnip cocoa katsuobushi
cabernet sauvignon wine
cacao sour milk
cheese
potato chip feta cheese
munster cheese emmental cheese provolone cheese chinese cabbage
cream cheese
cashew
frankfurter bacon
roquefort cheese
ham
buttermilk coconut
butter
malt hop watercress rutabaga
cucumber
eel chervil
corn vegetable
coconut oil palm
salmon
catfish herring
fish
yogurt smoked fish
cod
yam
sunflower oil
horseradish wasabi brassica
red kidney bean
porcini
enokidake matsutake
broccoli
roasted pecan pecan
19
liver mushroom
shiitake
soybean oil roasted hazelnut
beef liver
chicory cauliflower
oat
chicken liver
scallion turnip
nira
bean kidney bean
tequila red algae
bone oil
leek
garlic
black bean red bean
roasted nut hazelnut
sea algae
chive
asparagus
haddock
vegetable oil
bread
cabbage
lima bean
smoked salmon mackerel tuna
white bread
wheat bread
shallot
onion
mung bean
sturgeon caviar
galanga
rye bread
clam beech
meat
potato
soybean
lentil caviar
sesame oil
turkey
crab
brussels sprout
baked potato
yeast
soy sauce
cheddar cheese
scallop lobster
egg
date
GRAPH THEORY
mutton
shrimp
kohlrabi sweet potato fenugreek
pea
chicken broth
cottage cheese
pumpkin raisin
oyster barley
beef broth
veal
raw beef
beef
peanut butter
root
pistachio
nut
macaroni
lamb
peanut oil macadamia nut
camembert cheese
sheep cheese
milk fat
rice
wheat
corn grit whole grain wheat flour
chicken pork roasted meat
peanut walnut
romano cheese
mozzarella cheese
cream
milk
truffle
Plants
brown rice
Cereal
roasted beef
swiss cheese
Animal products
tomato
corn flake rye flour smoked sausage
beet pork sausage
popcorn
roasted peanut
Flowers
saffron
lingonberry
cured pork
pork liver
parmesan cheese
prawn goat cheese
cilantro
coffee
gruyere cheese
Vegetables
tomato juice
buckwheat
chickpea
pork
jamaican rum blue cheese
Plant derivatives
tamarind
smoke egg noodle
beer
bourbon whiskey
licorice
mustard
octopus endive celery
black tea
white wine
rum
Herbs
mint
green tea
apple brandy
sauerkraut
mate
Meats
thyme
oatmeal
okra
Seafoods
1%
lovage
cardamom
parsley
strawberry jam peppermint
vanilla
whiskey
cider
red wine
radish
bell pepper
tea
brandy
cognacport wine
leaf
jasmine tea
raspberry
champagne wine
pear brandy grape brandy
cereal
cherry brandy
mandarin peel pimento green bell pepper
thai pepper
spearmint
roasted sesame seed grape
Nuts and seeds 10 %
celery oil
carrot
coriander
Alcoholic beverages
oregano
lime peel oil
pepper
salmon roe plum lemonjapanese peel peppermint oil roasted almond
sesame seed
blackberry
pineapple wine
violet
chayote
kiwi
strawberry
apple
shellfish carob
plum
apricot
prickly pear
cherry
cayenne
caraway
cinnamon
mandarin
bitter orange
orange juice cranberry
fig
lemon
Spices
30 %
black pepper
ginger
Dairy
10
tabasco pepper
rosemary
50 %
50
ouzo
CATEGORIES Fruits
mace
fennel
bergamot
PREVALENCE
150
nutmeg
black raspberry elderberry basil currant rose berry muscat grape mango lilac flower oil black currant sour cherry almond blueberry peach squash nectarine maple syrup huckleberry clove papaya citrus peel quince strawberry juice olive melon guava sake honey banana concord grape passion fruit
olive oil
gin
sage
orange lime tangerine citrus juniper berry
seaweed savory
durian
watermelon rhubarb
balm
flower orange flower jasmine
artichoke
artemisia fruit
anise seed
lemon juice gardenia
SHARED COMPOUNDS
marjoram
rapeseed black sesame seed
lavender
laurel orange peel
cumin
seed
bay
lemongrass
grapefruit
blackberry brandy
tarragon
kumquat
sassafras
BIPARTITE NETWORK
SECTION 2.8
PATHS AND DISTANCES
Physical distance plays a key role in determining the interactions be
(a)
tween the components of physical systems. For example the distance be
3 1
tween two atoms in a crystal or between two galaxies in the universe determine the forces that act between them. In networks distance is a challenging concept. Indeed, what is the diseach other? The physical distance is not relevant here: Two webpages could be sitting on computers on the opposite sides of the globe, yet, have a link to each other. At the same time two individuals that live in the same build
6
1
5
7
3 2
4 6
5
7
Figure 2.12 Paths
(a) A path between nodes i0 and in is an ordered list of n links P = {(i0, i1), (i1, i2), (i2, i3), ... ,(in1, in)}. The length of this path is n. The path shown in orange in (a) follows the route 1→2→5→7→4→6, hence its length is n = 5.
ing may not know each other. In networks physical distance is replaced by path length. A path is a route that runs along the links of the network. A path’s length represents the number of links the path contains (Figure 2.12a). Note that some texts
(b) The shortest paths between nodes 1 and 7, or the distance d17, correspond to the path with the fewest number of links that connect nodes 1 to 7. There can be multiple paths of the same length, as illustrated by the two paths shown in orange and grey. The network diameter is the largest distance in the network, being dmax = 3 here.
require that each node a path visits is distinct. In network science paths play a central role. Next we discuss some of their most important properties, many more being summarized in Figure 2.13.
SHORTEST PATH The shortest path between nodes i and j is the path with the fewest number of links (Figure 2.12b). The shortest path is often called the distance between nodes i and j, and is denoted by dij, or simply d. We can have multiple shortest paths of the same length d between a pair of nodes (Figure 2.12b). The shortest path never contains loops or intersects itself. In an undirected network dij = dji, i.e. the distance between node i and j is
the same as the distance between node j and i. In a directed network often dij ≠ dji. Furthermore, in a directed network the existence of a path from node i to node j does not guarantee the existence of a path from j to i.
In real networks we often need to determine the distance between two
GRAPH THEORY
2
4
tance between two webpages, or between two individuals who do not know
(b)
20
FIG. 2.13 PATHOLOGY (a)
(b)
2
5
3
4
1
d1→4 2
(d)
(e)
(f)
(g)
GRAPH THEORY
5 d1→5
d1→4
(c)
Path A sequence of nodes such that each node is connected to the next node along the path by a link. Each path consists of n+1 nodes and n links. The length of a path is the number of its links, counting multiple links multiple times. For example, the orange line 1 → 2 → 5 → 4 → 3 covers a path of length four.
1
d1→4=3
3
4
2
5
3
4
2
5
Shortest Path (Geodesic Path, d) The path with the shortest distance d between two nodes. We also call d the distance between two nodes. Note that the shortest path does not need to be unique: between nodes 1 and 4 we have two shortest paths, 1→ 2→ 3→ 4 (blue) and 1→ 2→ 5→ 4 (orange), having the same length d1,4 =3.
1
d1→4=3=dmax
Diameter (dmax) The longest shortest path in a graph, or the distance between the two furthest nodes. In the graph shown here the diameter is between nodes 1 and 4, hence dmax=3.
1
3
4
2
5
3
4
2
5
3
4
2
5
3
4
d =(d1→2+d1→3+d1→4+d1→5+ +d2→3+d2→4+d2→5+ +d3→4+d3→5+ +d4→5)/10=1.6
Average Path Length (〈d〉) The average of the shortest paths between all pairs of nodes. For the graph shown on the left we have 〈d〉=1.6, whose calculation is shown next to the figure.
1
Cycle A path with the same start and end node. In the graph shown on the left we have only one cycle, as shown by the orange line.
1
Eulerian Path A path that traverses each link exactly once. The image shows two such Eulerian paths, one in orange and the other in blue.
1
Hamiltonian Path A path that visits each node exactly once. We show two Hamiltonian paths in orange and in blue.
21
PATHS AND DISTANCES IN NETWORKS
BOX 2.4 NUMBER OF SHORTEST PATHS BETWEEN TWO NODES
The number of shortest paths, Nij, and the distance dij between
nodes i and j can be calculated directly from the adjacency matrix Aij . dij = 1: If there is a direct link between i and j, then Aij = 1 (Aij = 0
otherwise). dij = 2: If there is a path of length two between i and j, then Aik Akj =1
(Aik Akj = 0 otherwise). The number of dij = 2 paths between i and j is
N
∑
N ij(2) = Aik Akj = A 2 ￼ k =1
ij
where [...]ij denotes the (ij)th element of a matrix. dij = d: If there is a path of length d between i and j, then Aik ... Alj =
1 (Aik ... Alj = 0 otherwise). The number of paths of length d
between i and j is
N ij( d ) = A d
ij
.
These equations hold for directed and undirected networks. The distance between nodes i and j is the path with the smallest d for which Nij(d) > 0. Despite the elegancy of this approach, faced with a large network, it is more efficient to use the breadthfirstsearch algorithm described in BOX 2.5.
nodes. For a small network, like the one shown in Figure 2.12, this is an easy task. For a network with millions of nodes finding the shortest path between two nodes can be rather time consuming. The length of the shortest path and the number of such paths can be formally obtained from the adjacency matrix (BOX 2.4). In practice we use the breadth first search (BFS) algorithm discussed in BOX 2.5 for this purpose.
NETWORK DIAMETER The diameter of a network, denoted by dmax, is the maximum shortest path in the network. In other words, it is the largest distance recorded between any pair of nodes. One can verify that the diameter of the network shown in Figure 2.13 is dmax = 3. For larger networks the diameter can be determined using the BFS algorithm described in BOX 2.5.
GRAPH THEORY
22
PATHS AND DISTANCES IN NETWORKS
AVERAGE PATH LENGTH The average path length, denoted by 〈d〉, is the average distance between all pairs of nodes in the network. For a directed network of N nodes, 〈d〉 is
d =
1 ∑ di , j . N ( N − 1) i , j =1, N i≠ j
(2.14)
Note that (2.14) is measured only for node pairs that are in the same component (SECTION 2.9). We can use the BFS algorithm to determine the average path length for a large network. For this we first determine the distances between the first node and all other nodes in the network using the algorithm described in BOX 2.5. We then determine the distances between the second node and all other nodes but the first one (if the network is undirected). We then repeat this procedure for all nodes. The sum
BOX 2.5
(a)
BREADTHFIRST SEARCH (BFS) ALGORITHM 0
BFS is a frequently used algorithms in network science. Similar to throwing a pebble in a pond and watching the ripples spread from it, BFS starts from a node and labels its neighbors, then the neighbors’
1
(b)
1
neighbors, until it reaches the target node. The number of “ripples”
0
needed to reach the target provides the distance. 1
The identification of the shortest path between node i and j follows the following steps (Figure 2.14):
(c) 2
1. Start at node i, that we label with “0”.
1
1
2
0 1
2. Find the nodes directly linked to i. Label them distance “1” and
3
put them in a queue.
(d) 3
3. Take the first node, labeled n, out of the queue (n = 1 in the first
1
1 2
0
2 3
1
step). Find the unlabeled nodes adjacent to it in the graph. Label them with n + 1 and put them in the queue. 4. Repeat step 3 until you find the target node j or there are no more
Figure 2.14 Applying the BFS Algorithm
nodes in the queue.
(a) Starting from the orange node, labeled ”0”, we identify all its neighbors, labeling them ”1”.
5. The distance between i and j is the label of j. If j does not have a label, then dij = ∞.
(b)(d) Next we label ”2” the unlabeled neighbors of all nodes labeled ”1”, and so on, in each iteration increasing the label number, until no node is left unlabeled. The length of the shortest path or the distance d0i between node 0 and any other node i in the network is given by the label of node i. For example, the distance between node 0 and the leftmost node is d = 3.
The computational complexity of the BFS algorithm, representing the approximate number of steps the computer needs to find dij on a net
work of N nodes and L links, is O(N + L). It is linear in N and L as each
node needs to be entered and removed from the queue at most once, and each link has to be tested only once.
GRAPH THEORY
23
PATHS AND DISTANCES IN NETWORKS
SECTION 2.9
CONNECTEDNESS
A phone would be of limited use as a communication device if we could not call any valid phone number; email would be rather useless if we could send emails to only certain email addresses, and not to others. From a network perspective this means that the network behind the phone or the Internet must be capable of establishing a path between any two nodes. This is in fact the key utility of most networks: they ensure connectedness. In this section we discuss the graphtheoretic formulation of connectedness. In an undirected network nodes i and j are connected if there is a path between them. They are disconnected if such a path does not exist, in which case we have dij = ∞. This is illustrated in Figure 2.15a, which shows a network consisting of two disconnected clusters. While there are paths between any two nodes on the same cluster (for example nodes 4 and 6), there are no paths between nodes that belong to different clusters (nodes 1 and 6). A network is connected if all pairs of nodes in the network are connected. A network is disconnected if there is at least one pair with dij = ∞. Clear
ly the network shown in Figure 2.15a is disconnected, and we call its two subnetworks components or clusters. A component is a subset of nodes in a network, so that there is a path between any two nodes that belong to the component, but one cannot add any more nodes to it that would have the same property. If a network consists of two components, a properly placed single link can connect them, making the network connected (Figure 2.15b). Such a link is called a bridge. In general a bridge is any link that, if cut, disconnects the network. While for a small network visual inspection can help us decide if it is connected or disconnected, for a network consisting of millions of nodes connectedness is a challenging question. Mathematical and algorithmic tools can help us identify the connected components of a graph. For example, for a disconnected network the adjacency matrix can be rearranged into a block diagonal form, such that all nonzero elements in the matrix GRAPH THEORY
24
are contained in square blocks along the matrix’ diagonal and all other elements are zero (Figure 2.15a). Each square block corresponds to a component. We can use the tools of linear algebra to decide if the adjacency matrix is block diagonal, helping us to identify the connected components. In practice, for large networks the components are more efficiently identified using the BFS algorithm (BOX 2.6).
(a)
Figure 2.15
1
3
(a) A small network consisting of two disconnected components. Indeed, there is a path between any pair of nodes in the (1,2,3) component, as well in the (4,5,6,7) component. However, there are no paths between nodes that belong to the different components.
2 7
(b)
Connected and Disconnected Networks
5
4
6
The right panel shows the adjacently matrix of the network. If the network has disconnected components, the adjacency matrix can be rearranged into a block diagonal form, such that all nonzero elements of the matrix are contained in square blocks along the diagonal of the matrix and all other elements are zero.
1 5
4 3
2 7
6
(b) The addition of a single link, called a bridge, shown in grey, turns a disconnected network into a single connected component. Now there is a path between every pair of nodes in the network. Consequently the adjacency matrix cannot be written in a block diagonal form.
BOX 2.6 FINDING THE CONNECTED COMPONENTS OF A NETWORK
1. Start from a randomly chosen node i and perform a BFS (BOX 2.5). Label all nodes reached this way with n = 1. 2. If the total number of labeled nodes equals N, then the network is connected. If the number of labeled nodes is smaller than N, the network consists of several components. To identify them, proceed to step 3. 3. Increase the label n → n + 1. Choose an unmarked node j, label it with n. Use BFS to find all nodes reachable from j, label them all with n. Return to step 2.
GRAPH THEORY
25
CONNECTEDNESS AND COMPONENTS
SECTION 2.10
CLUSTERING COEFFICIENT
The clustering coefficient captures the degree to which the neighbors
(a)
of a given node link to each other. For a node i with degree ki the local clustering coefficient is defined as [12]
Ci =
2 Li ki ( ki − 1)
(2.15)
where Li represents the number of links between the ki neighbors of node i.
(b)
Ci = 0 if none of the neighbors of node i link to each other.
•
Ci = 1 if the neighbors of node i form a complete graph, i.e. they all
Cii=0
0
1/6 0
Note that Ci is between 0 and 1 (Figure 2.16a): •
Cii=1/2
Cii=1
⟨C⟩= 1/3
0
2/3
13 �0.310 42
3 C △△= = 0.375 8
1
link to each other. •
Ci is the probability that two neighbors of a node link to each other.
Consequently C = 0.5 implies that there is a 50% chance that two neighbors of a node are linked. In summary Ci measures the network’s local link density: The more
densely interconnected the neighborhood of node i, the higher is its local clustering coefficient. The degree of clustering of a whole network is captured by the average clustering coefficient, 〈C〉, representing the average of Ci over all nodes i = 1, ..., N [12],
1 ￼ (2.16) C = Ci .
N
Figure 2.16 Clustering Coefficient
(a) The local clustering coefficient, Ci , of the central node with degree ki = 4 for three different configurations of its neighborhood. The local clustering coefficient measures the local density of links in a node’s vicinity. (b) A small network, with the local clustering coefficient of each nodes shown next to it. We also list the network’s average clustering coefficient 〈C〉, according to (2.16), and its global clustering coefficient CΔ, defined in SECTION 2.12, Eq. (2.17). Note that for nodes with degrees ki = 0,1, the clustering coefficient is zero.
∑ N i =1
In line with the probabilistic interpretation 〈C〉 is the probability that two neighbors of a randomly selected node link to each other. While (2.16) is defined for undirected networks, the clustering coefficient can be generalized to directed and weighted [13, 14, 15, 16] networks as well. In the network literature we may encounter the global clustering coefficient as well, discussed in ADVANCED TOPICS 2.A. GRAPH THEORY
26
SECTION 2.11
SUMMARY
The crash course offered in this chapter introduced some of the basic graph theoretical concepts and tools used in network science. The set of elementary network characteristics, summarized in Figure 2.17, offer a formal language through which we can explore networks. Many of the networks we study in network science consist of thousands or even millions of nodes and links (Table 2.1). To explore them, we need to go beyond the small graphs shown in Figure 2.17. A glimpse of what we are about to encounter is offered by the proteinprotein interaction network of yeast (Figure 2.4a). The network is too complex to understand its properties through a visual inspection of its wiring diagram. We therefore need to turn to the tools of network science to characterize its topology. Let us use the measures we introduced so far to explore some basic characteristics of this network. The undirected network, shown in Figure 2.4a, has N = 2,018 proteins as nodes and L=2,930 binding interactions as links. Hence its average degree, according to (2.2), is 〈k〉 = 2.90, suggesting that a typical protein interacts with approximately two to three other proteins. Yet, this number is somewhat misleading. Indeed, the degree distribution pk shown in Figure 2.4b,c, indicates that the vast majority of nodes have only a few links. To be precise, in this network 69% of nodes have fewer than three links, i.e. for these k < 〈k〉 . These numerous nodes with few links coexist with a few highly connected nodes, or hubs, the largest having as many as 92 links. Such wide differences in node degrees is a consequence of the network’s scalefree property, discussed in CHAPTER 4. We will see that the shape of the degree distribution determines a wide range of network properties, from the network’s robustness to the spread of viruses. The breadthﬁrstsearch algorithm (BOX 2.5) helps us determine the network’s diameter, finding dmax = 14. We might be tempted to expect wide variations in d, as some nodes are close to each other, others, however, may
be quite far. The distance distribution (Figure 2.18a) indicates otherwise: pd has a prominent peak between 5 and 6, telling us that most distances are rather short, being in the vicinity of 〈d〉 =5.61. Also, pd decays fast for
GRAPH THEORY
27
102
(a)
large d, suggesting that large distances are absent. Indeed, the variance of the distances is σd = 1.64, indicating that most path lengths are in the close
100
101
102
k
103
0.25 0 10 0.2
vicinity of 〈d〉 . These are manifestations of the small world property dis
p10 d pk 0.15 1
cussed in CHAPTER 3.
102
HUBS
0.1
The breadth ﬁrst search algorithm also tells us that the protein interaction network is not connected, but consists of 185 components, shown
⟨d⟩
103 0.05
as isolated clusters and nodes in Figure 2.4a. The largest, called the giant component, contains 1,647 of the 2,018 nodes; all other components are
4 10 0
tiny. As we will see in the coming chapters, such fragmentation is common
1000
in real networks. (b)
2
4
6 101 8 d
10
k
12
142 10
100
The average clustering coefficient of the protein interaction network is C(k)
〈C〉 =0.12, which, as we will come to appreciate in the coming chapters, indicates a significant degree of local clustering. A further caveat is provided by the dependence of the clustering coefficient on the node’s degree, or
101
the C(k) function (Figure 2.18b). The fact that C(k) decreases for large k indicates that the local clustering coefficient of the small nodes is significantly higher than the local clustering coefficient of the hubs. Hence the small degree nodes are located in dense local network neighborhoods, while the
102
neighborhood of the hubs is much sparser. This is a consequence of hierarchy, a network property discussed in CHAPTER 9. Finally, a visual inspection reveals an interesting pattern: hubs have a
100
101
102
k
103
0.25
Figure 2.18 Characterizing a Real Network 0.2
tendency to connect to small nodes, giving the network a hub and spoke
pd The proteinprotein interaction (PPI) network
character (Figure 2.4a). This is a consequence of degree correlations, dis
of yeast0.15 is frequently studied by biologists and network scientists. The detailed wiring diagram of the network is shown in Figure 2.4a. 0.1 The figure indicates that the network, consisting of N=2,018 nodes and ⟨d⟩ L=2,930 links, has a large 0.05 component that connects 81% of the proteins. We also have several smaller components and 0 numerous isolated proteins that do not interact with any other node.
cussed in CHAPTER 7. Such correlations influence a number of network based processes, from spreading phenomena to the number of driver nodes needed to control a network. Taken together, Figures 2.4 and 2.18 illustrate that the quantities we introduced in this chapter can help us diagnose several key properties of real
0
networks. The purpose of the coming chapters is to study systematically
2
4
6
d
8
10
12
14
(a) The distance distribution, pd, for the PPI network, providing the probability that two randomly chosen nodes have a distance d between them (shortest path). The grey vertical line shows the average path length, which is 〈d〉 =5.61.
these network characteristics and understand what they tell us about a particular complex system.
(b) The dependence of the average local clustering coefficient on the node’s degree, k. The C(k) function is obtained by averaging over the local clustering coefficient of all nodes with the same degree k.
GRAPH THEORY
28
SUMMARY
1 0 0
4 4
1 1 2 2
3
L L= =
FIG. 2.17 GRAPHOLOGY3
N N i, j=1 i, j=1
0 0 0
0 0 0
0 0 0
A A jiji Aijij A L L < k >= < k >= N N
A Aijij
In network science we often distinguish networks by1some elementary property of 0 00 1 1 11 1 0 0 0 Unweighted Unweighted the underlying graph. Here we summarize the 1most commonly encountered netUndirected 0 11 1 1 1 1 (undirected) 1the 0 1 particular (undirected) A = work types. We also list real systems that A share property. Note that ij = = ij ij 1 1 00 4 0 0 11 1 0network 0 4 of these elementary many real networks combine several characteristics. 1 1 00 1 a directed multigraph0 For example the WWW is with the mobile 00 1 selfinteractions; 0 0 0 0 call network is directed and weighted, without =0 Aij = A ji Aii selfloops.
(a)
Aii =N0 1 NN 1 L = A L= =1 Aijij 2 L A ij 2 i,i, j=1 j=1
2 2
3 3
2 i, j=1
Weighted Undirected Directed Weighted (undirected) (undirected)
4 4
1 1 1 2 22
3 33
(b)
Aij = A ji 2L 2L < >= = 2L N < kk >= N
4
0 00 0 2 1 2 0 A A Aijijijij = = 1 A == 0.5 0.51 0 0 00
2 21 1 0 00 0 1 10 1 4 40
0.5 00 100.5 1 1 11 11 0 00 0 00 0 0 00 0 0
0 0
1 0 10 11 0 1 A = = Aijijijij = 11 1 0 00
1 1 0 1 0 1 1 0 1 1
1 101 1 1 11 0 0 00 0 00 0
0 0 1 1 0 0 1 0 1
A AiiiiA= =ii 0 0= 0
0 0 4 4 0 0
===jiAA AijAA AijijijA A jijiji N N 2L L 2L 1 2L < k >= k= L= = AijA < < >= N L kk>= ij N N i, 2j=1 N i, j=1
2
Directed Selfloops Undirected Unweighted Selfloops (undirected)
44 4 4 4
11 1
00 11 00 00
AijA Ǝ i, AA = =0 0 0 = = AijijA =ijij jiA A ii ii Ǝ i, AA AA = jijiA A jiji ii iiN 0 N N N N N L 2L 1 1 1 2L 1 k >= L= = A L = L k L= AijijA+ +ijA A= >=N??N 2 2 j=1,i i, 2 i,i, j=1,i i, j=1 i=1 2jj j=1 N i=1
2 22
3 33
N
i, j=1
(c)
Undirected Directed Unweighted Multigraph Weighted Multigraph (undirected) (undirected)
4 4 4 44
1 11 1 1
1 2 2211 00 0 000 1 1 0 1 11 0 41
110.5 0 00 0 11 00 0 1 11 1 1 1 13 4 1 0 0 0 30 0 0 00 0 0 0 0 0 0 00 00 00 0 0
00 1 00 10 2 AA Aijijij= = 11 == A ij ij 1 0.5 1 0 00 1 0 1
1 1121 1 0 0 0 000 0 1 1 1011 1 1 141
11 100.5 1 1 10 1 1 1 1 11 1 1 0 0 0 0 01 1 0 0 0 0 1 0 1 00
00 0 1 14 0 00 0 10
0 001 1 Aij = 221 A A == 1 Aijijij = 0.5 1 01 00
1 221 0 0 00 1 11 1 1 41
1 1 101 0 00 0
0 1 4 301 00 00 10
0 0
Directed Complete Complete Graph Unweighted Weighted Graph Selfloops (undirected) (undirected)
444 4 4
11 11
(e)
Unweighted Weighted Selfloops Multigraph (undirected) (undirected) (undirected)
4 4 44
1 11
(f)
0
4 44
11 1
GRAPH THEORY
Complete Graph Selfloops Multigraph (undirected) (undirected) 11 1
0
0 01 0 2 Aij = 21 Aij = =0.51 A ijij = 1 11 0 1 00
2 121 0 000 1 111 4 131
0.5 1 111 1 1 111 0 0 001 0 1 000
1 00 1 Aij = 21 Aij = =1 A ij 11 0
1 1 0 12 1 1 1 0 0 1 1 00 1 1 1 3 1 0 0 11 0 0 1 0 1 0 1
444
Selfloops In many networks nodes do not interact with themselves, so the diagonal elements of the adjacency matrix are zero, Aii = 0, i = 1,..., N. In some systems selfinteractions are allowed; in such networks, selfloops represent the fact that node i interacts with itself. Examples: WWW, protein interactions. Multigraph/Simple Graphs In a multigraph nodes are permitted to have multiple links (or parallel links) between them. Hence Aii can be any positive integer. Networks that do not allow multiple links are called simple. Multigraph Examples: Social networks, where we distinguish friendship, family and professional ties.
Directed Network A network whose links have selected directions. Examples: WWW, mobile phone calls, citation network.
Weighted Network A network whose links have a defined weight, strength or flow parameter. The elements of the adjacency matrix are Aij = wij if there is a link with weight wij between them. For unweighted (binary) networks, the adjacency matrix only indicates the presence (Aij = 1) or the absence (Aij = 0) of a link. Examples: Mobile phone calls, email network.
0
00 4 31 0 00 0
01
Aii ==0 AAij = A ji1 Ǝ i, =A Aji ji AAA = 00 AA iij ij j== iiii ii 2L N N N N(N 1) 1 1 2L < k >= N Aii < =>= ?N 1 LLL === LmaxA=ij Aij + 22i, i,j=1j=1,i j 2 i=1 N
22 22
33 3
0
ii N
Weighted Complete Selfloops MultigraphGraph (undirected) (undirected) (undirected)
3
1 000 10.5
Aij = A ji A =0 AAijA= A jiA ji ij = N ij = A ji 1 2L N N 1L= Aij < k >= 2L 1 2L Aii = ij + LL== 2 Aij2 i,Aj=1 >=?NN i=1 2 i, i,j=1j=1,i j N ii Ǝ i, AAAiiii==000
222 2
3 3 3
0
Aij AA ji = 1 Aii = 0 i= AAijijA =jj =A A1jijiA = i ij ji L N N(N 1) N 1 N < k >= N(N 1) 2L 2L1 L = A L = = < kkkk>= N ij 1L max L = L = < >= N 1 L = A >= < >= N ij A L = max i, 2j=1 Aij 2 + ? ii 2 N 2 i, j=1,i j i, j=1 i=1 A=ii00= AA Ǝ i,A 00 iiiiii=N
2 22 22
3 33 3
0
i, j=1
i, j=1
(d)
3 3
===AAjiA A=ii 0==00 ijA AA A ijAA ij ijA AAiiiiiiA ==ii0N0N A =jiAAjiji jiji ijij = N N 1 N L2L 1 2L 1 2L 2L 1 LL= 2L < k >= A = A < k >= A < k >= L = A < k >= ij ij = ij L=2 Aijij >= NNN i, j=1 NN 2 i, j=1 2i,2j=1 N
2 2 22 2 2
3 3 33 3
0 0 000 110 Aij = 2 A = 21 A Aijijijijij == = 0.5 1101 1 000
Undirected Network A network whose links do not have a defined direction. Examples: Internet, power grid, science collaboration networks.
Complete Graph (Clique) In a complete graph, or a clique, all nodes are connected to each other. Examples: Actors in the cast of the same movie, as they are all linked to each other in the actor network.
29
SUMMARY
SECTION 2.12
HOMEWORK
2.1. Königsberg Problem
(a)
(b)
(c)
(d)
a)
b)
Which of the icons in Figure 2.19 can be drawn without raising your pencil from the paper, and without drawing any line more than once? Why? 2.2. Matrix Formalism Let A be the N x N adjacency matrix of an undirected unweighted network, without selfloops. Let 1 be a column vector of N elements, all equal to 1. In other words 1 = (1, 1, ..., 1) , where the superscript T indicates the T
transpose operation. Use the matrix formalism (multiplicative constants, multiplication row by column, matrix operations like transpose and trace, etc, but avoid the sum symbol ∑) to write expressions for: (a) The vector k whose elements area)the degrees ki of all b) nodes i = 1, 2,..., N.
Figure 2.19
c)
Königsberg Problem
(b) The total number of links, L, in the network. (c) The number of triangles T present in the network, where a triangle means three nodes, each connected by links to the other two (Hint: you can use the trace of a matrix). (d) The vector knn whose element i is the sum of the degrees of node i's neighbors.
(e) The vector knnn whose element i is the sum of the degrees of node i's second neighbors.
2.3. Graph Representation The adjacency matrix is a useful graph representation for many analytical calculations. However, when we need to store a network in a computer, we can save computer memory by offering the list of links in a Lx2 matrix, whose rows contain the starting and end point i and j of each link. Construct for the networks (a) and (b) in Figure 2.20:
GRAPH THEORY
30
d)
c)
a)
b) (a)
1
2
(b)
3
6 5
1
6
4
Figure 2.20
2
Graph Representation (a) Undirected graph of 6 nodes and 7 links. (b) Directed graph of 6 nodes and 8 directed links.
3 5
4
(a) The corresponding adjacency matrices. (b) The corresponding link lists. (c) Determine the average clustering coefficient of the network shown in Figure 2.20a. (d) If you switch the labels of nodes 5 and 6 in Figure 2.20a, how does that move change the adjacency matrix? And the link list? (e) What kind of information can you not infer from the link list representation of the network that you can infer from the adjacency matrix? (f) In the (a) network, how many paths (with possible repetition of nodes and links) of length 3 exist starting from node 1 and ending at node 3? And in (b)? (g) With the help of a computer, count the number of cycles of length 4 in both networks. 2.4. Degree, Clustering Coefficient and Components (a) Consider an undirected network of size N in which each node has degree k = 1. Which condition does N have to satisfy? What is the degree distribution of this network? How many components does the network have? (b) Consider now a network in which each node has degree k = 2 and clustering coefficient C = 1. How does the network look like? What condition does N satisfy in this case? 2.5. Bipartite Networks
2
1
Consider the bipartite network of Figure 2.21
4
3
6
5
(a) Construct its adjacency matrix. Why is it a blockdiagonal matrix? (b) Construct the adjacency matrix of its two projections, on the pur
7
ple and on the green nodes, respectively. (c) Calculate the average degree of the purple nodes and the average degree of the green nodes in the bipartite network. (d) Calculate the average degree in each of the two network projec
9
10
11
Figure 2.21 Bipartite network Bipartite network with 6 nodes in one set and 5 nodes in the other, connected by 10 links.
tions. Is it surprising that the values are different from those obtained in point (c)?
GRAPH THEORY
8
31
HOMEWORK
2.6. Bipartite Networks  General Considerations Consider a bipartite network with N1 and N2 nodes in the two sets. (a) What is the maximum number of links Lmax the network can have? (b) How many links cannot occur compared to a nonbipartite network of size N = N1 + N2 ? (c) If N1≪N2 , what can you say about the network density, that is the total number of links over the maximum number of links, Lmax?
(d) Find an expression connecting N1, N2 and the average degree for the two sets in the bipartite network, 〈k1〉 and 〈k2〉.
GRAPH THEORY
32
HOMEWORK
SECTION 2.13
ADVANCED TOPICS 2.A GLOBAL CLUSTERING COEFFICIENT
In the network literature we ocassionally encounter the global clustering coefficient, which measures the total number of closed triangles in a network. Indeed, Li in (2.15) is the number of triangles that node i participates in, as each link between two neighbors of node i closes a triangle (Figure 2.17). Hence the degree of a network’s global clustering can be also captured by the global clustering coefficient, deﬁned as
,
(2.17)
where a connected triplet is an ordered set of three nodes ABC such that A connects to B and B connects to C. For example, an A, B, C triangle is made of three triplets, ABC, BCA and CAB. In contrast a chain of connected nodes A, B, C, in which B connects to A and C, but A does not link to C, forms a single open triplet ABC. The factor three in the numerator of (2.17) is due to the fact that each triangle is counted three times in the triplet count. The roots of the global clustering coefficient go back to the social network literature of the 1940s [17, 18], where CΔ is often called the ratio of transitive triplets. Note that the average clustering coefficient defined in (2.16) and the global clustering coefficient (2.17) are not equivalent. Indeed, take a network that is a double star, consisting of N nodes, where nodes 1 and 2 are joined to each other and to all other nodes, and there are no other links. Then the local clustering coefficient Ci is 1 for i ≥ 3 and 2/(N − 1) for i = 1, 2. It follows that the average clustering coefficient of the network is = 1−O(1), while the global clustering coefficient is CΔ ~ 2/N. In less extreme networks the two definitions will give more comparable values, but they still differ from each other [19]. For example, for the network of in Figure 2.16b we have = 0.31 and CΔ = 0.375.
GRAPH THEORY
33
SECTION 2.14
BIBLIOGRAPHY
[1] K.I. Goh, M. E. Cusick, D. Valle, B. Childs, M. Vidal, and A.L. Barabási. The human disease network. PNAS, 104:8685–8690, 2007. [2] H.U. Obrist. Mapping it out: An alternative atlas of contemporary cartographies. Thames and Hudson, London, 2014. [3] I. Meirelles. Design for Information. Rockport, 2013. [4] K. Börner. Atlas of Science: Visualizing What We Know. The MIT Press, 2010. [5] L. B. Larsen. Networks: Documents of Contemporary Art. MIT Press. 2014. [6] L. Euler, Solutio Problemat is ad Geometriam Situs Pertinentis. Commentarii Academiae Scientiarum Imperialis Petropolitanae 8:128140, 1741. [7] G. Alexanderson. Euler and Königsberg’s bridges: a historical view. Bulletin of the American Mathematical Society 43: 567, 2006. [8] A.L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286:509–512, 1999. [9] G. Gilder. Metcalfe’s law and legacy. Forbes ASAP, 1993. [10] B. Briscoe, A. Odlyzko, and B. Tilly. Metcalfe’s law is wrong. IEEE Spectrum, 43:34–39, 2006. [11] Y.Y. Ahn, S. E. Ahnert, J. P. Bagrow, A.L. Barabási. Flavor network and the principles of food pairing, Scientific Reports, 196, 2011. [12] D. J. Watts and S. H. Strogatz. Collective dynamics of ‘smallworld’ networks. Nature, 393:440–442, 1998. GRAPH THEORY
34
[13] A. Barrat, M. Barthélemy, R. PastorSatorras, and A. Vespignani. The architecture of complex weighted networks. PNAS, 101:3747–3752, 2004. [14] J. P. Onnela, J. Saramäki, J. Kertész, and K. Kaski. Intensity and coherence of motifs in weighted complex networks. Physical Review E, 71:065103, 2005. [15] B. Zhang and S. Horvath. A general framework for weighted gene coexpression network analysis. Statistical Applications in Genetics and Molecular Biology, 4:17, 2005. [16] P. Holme, S. M. Park, J. B. Kim, and C. R. Edling. Korean university life in a network perspective: Dynamics of a large affiliation network. Physica A, 373:821–830, 2007. [17] R. D. Luce and A. D. Perry. A method of matrix analysis of group structure. Psychometrika, 14:95–116, 1949. [18] S. Wasserman and K Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, 1994. [19] B. Bollobás and O. M. Riordan. Mathematical results on scalefree random graphs, in Stefan Bornholdt, Hans Georg Schuster, Handbook of Graphs and Networks: From the Genome to the Internet (2003 WileyVCH Verlag GmbH & Co. KGaA).
GRAPH THEORY
35
BIBLIOGRAPHY
3 ALBERTLÁSZLÓ BARABÁSI
NETWORK SCIENCE RANDOM NETWORKS
ACKNOWLEDGEMENTS
MÁRTON PÓSFAI GABRIELE MUSELLA MAURO MARTINO ROBERTA SINATRA
SARAH MORRISON AMAL HUSSEINI PHILIPP HOEVEL
INDEX Introduction
1
The Random Network Model
2
Number of Links
3
Degree Distribution
4
Real Networks are Not Poisson
5
The Evolution of a Random Network
6
Real Networks are Supercritical
7
Small Worlds
8
Clustering Coefficient
9
Summary: Real Networks are Not Random
10
Homework
11
ADVANCED TOPICS 3.A Deriving the Poisson Distribution
12
Figure 3.0 (cover image)
Erdős Number
ADVANCED TOPICS 3.B Maximum and Minimum Degrees ADVANCED TOPICS 3.C Giant Component ADVANCED TOPICS 3.D Component Sizes
13
14
15
ADVANCED TOPICS 3.E Fully Connected Regime
16
ADVANCED TOPICS 3.F Phase Transitions
17
The Hungarian mathematician Pál Erdős authored hundreds of research papers, many of them in collaboration with other mathematicians. His relentless collaborative approach to mathematics inspired the Erdős Number, which works like this: Erdős’ Erdős number is 0. Erdős’ coauthors have Erdős number 1. Those who have written a paper with someone with Erdős number 1 have Erdős number 2, and so on. If there is no chain of coauthorships connecting someone to Erdős, then that person’s Erdős number is infinite. Many famous scientists have low Erdős numbers: Albert Einstein has Erdős Number 2 and Richard Feynman has 3. The image shows the collaborators of Pál Erdős, as drawn in 1970 by Ronald Graham, one of Erdős’ close collaborators. As Erdős’ fame rose, this image has achieved an iconic status.
ADVANCED TOPICS 3.G Small World Corrections
18
Bibliography
19
This work is licensed under a Creative Commons: CC BYNCSA 2.0. PDF V32, 08.09.2014
SECTION 3.1
INTRODUCTION
(a)
Early
Imagine organizing a party for a hundred guests who initially do not know each other [1]. Offer them wine and cheese and you will soon see them chatting in groups of two to three. Now mention to Mary, one of your guests, that the red wine in the unlabeled dark green bottles is a rare vintage, much better than the one with the fancy red label. If she shares this information only with her acquaintances, your expensive wine appears to be safe, as she only had time to meet a few others so far. The guests will continue to mingle, however, creating subtle paths between individuals that may still be strangers to each other. For example, while John has not yet met Mary, they have both met Mike, so there is an invisible path from John to Mary through Mike. As time goes on, the guests
(b)
Later
will be increasingly interwoven by such elusive links. With that the secret of the unlabeled bottle will pass from Mary to Mike and from Mike to John, escaping into a rapidly expanding group. To be sure, when all guests had gotten to know each other, everyone would be pouring the superior wine. But if each encounter took only ten minutes, meeting all ninetynine others would take about sixteen hours. Thus, you could reasonably hope that a few drops of your fine wine would be left for you to enjoy once the guests are gone. Yet, you would be wrong. In this chapter we show you why. We will see that the party maps into a classic model in network science called the random network model. And random network theory tells us that we do not have to wait until all individuals get to know each other for our expensive wine to be in danger. Rather, soon after each person meets at least one oth
Figure 3.1 From a Cocktail Party to Random Networks
er guest, an invisible network will emerge that will allow the information
The emergence of an acquaintance network through random encounters at a cocktail party.
to reach all of them. Hence in no time everyone will be enjoying the better wine.
(a) Early on the guests form isolated groups. (b) As individuals mingle, changing groups, an invisible network emerges that connects all of them into a single network. RANDOM NETWORKS
3
SECTION 3.2
THE RANDOM NETWORK MODEL BOX 3.1 DEFINING RANDOM NETWORKS
There are two definitions of a
Network science aims to build models that reproduce the properties of
random network:
real networks. Most networks we encounter do not have the comforting regularity of a crystal lattice or the predictable radial architecture of a spider web. Rather, at first inspection they look as if they were spun randomly
G(N, L) Model
(Figure 2.4). Random network theory embraces this apparent randomness
N labeled nodes are connect
by constructing and characterizing networks that are truly random.
ed with L randomly placed links. Erdős and Rényi used
From a modeling perspective a network is a relatively simple object,
this definition in their string
consisting of only nodes and links. The real challenge, however, is to decide
of papers on random net
where to place the links between the nodes so that we reproduce the com
works [29].
plexity of a real system. In this respect the philosophy behind a random network is simple: We assume that this goal is best achieved by placing
G(N, p) Model
the links randomly between the nodes. That takes us to the definition of a
Each pair of N labeled nodes
random network (BOX 3.1):
is connected with probability p, a model introduced by Gil
A random network consists of N nodes where each node pair is connect
bert [10].
ed with probability p. Hence, the G(N, p) model fixes the probability p that two nodes
To construct a random network we follow these steps:
are connected and the G(N, L) 1) Start with N isolated nodes.
model fixes the total number of links L. While in the G(N, L)
2) Select a node pair and generate a random number between 0 and 1.
model the average degree of a
If the number exceeds p, connect the selected node pair with a link,
node is simply = 2L/N, oth
otherwise leave them disconnected.
er network characteristics are easier to calculate in the G(N, p)
3) Repeat step (2) for each of the N(N1)/2 node pairs.
model. Throughout this book we will explore the G(N, p) model,
The network obtained after this procedure is called a random graph or
not only for the ease that it al
a random network. Two mathematicians, Pál Erdős and Alfréd Rényi, have
lows us to calculate key network
played an important role in understanding the properties of these net
characteristics, but also because
works. In their honor a random network is called the ErdősRényi network
in real networks the number of
(BOX 3.2).
links rarely stays fixed.
RANDOM NETWORKS
4
BOX 3.2 RANDOM NETWORKS: A BRIEF HISTORY
(a)
(b)
Figure 3.2
(a) Pál Erdős (19131996) Hungarian mathematician known for both his exceptional scientific output and eccentricity. Indeed, Erdős published more papers than any other mathematician in the history of mathematics. He coauthored papers with over five hundred mathematicians, inspiring the concept of Erdős number. His legendary personality and profound professional impact has inspired two biographies [12, 13] and a documentary [14] (Online Resource 3.1).
Anatol Rapoport (19112007), a Russian immigrant to the United
(b) Alfréd Rényi (19211970)
States, was the first to study random networks. Rapoport’s interests
Hungarian mathematician with fundamental contributions to combinatorics, graph theory, and number theory. His impact goes beyond mathematics: The Rényi entropy is widely used in chaos theory and the random network theory he codeveloped is at the heart of network science. He is remembered through the hotbed of Hungarian mathematics, the Alfréd Rényi Institute of Mathematics in Budapest.
turned to mathematics after realizing that a successful career as a concert pianist would require a wealthy patron. He focused on mathematical biology at a time when mathematicians and biologists hardly spoke to each other. In a paper written with Ray Solomonoff in 1951 [11], Rapoport demonstrated that if we increase the average degree of a network, we observe an abrupt transition from disconnected nodes to a graph with a giant component. The study of random networks reached prominence thanks to the fundamental work of Pál Erdős and Alfréd Rényi (Figure 3.2). In a sequence of eight papers published between 1959 and 1968 [29], they
>
merged probability theory and combinatorics with graph theory, establishing random graph theory, a new branch of mathematics [2]. The random network model was independently introduced by Edgar Nelson Gilbert (19232013) [10] the same year Erdős and Rényi published their first paper on the subject. Yet, the impact of Erdős and Rényi’s work is so overwhelming that they are rightly considered the founders of random graph theory.
Online Resource 3.1 N is a Number: A Portrait of Paul Erdős
The 1993 biographical documentary of Pál Erdős, directed by George Paul Csicsery, offers a glimpse into Erdős' life and scientific impact [14].
“A mathematician is a device for turning coffee into theorems”
>
Alfréd Rényi (a quote often attributed to Erdős)
RANDOM NETWORKS
5
THE RANDOM NETWORK MODEL
SECTION 3.3
NUMBER OF LINKS
Each random network generated with the same parameters N, p looks slightly different (Figure 3.3). Not only the detailed wiring diagram changes between realizations, but so does the number of links L. It is useful, therefore, to determine how many links we expect for a particular realization of a random network with fixed N and p. The probability that a random network has exactly L links is the product of three terms: 1) The probability that L of the attempts to connect the N(N1)/2 pairs of nodes have resulted in a link, which is pL. 2) The probability that the remaining N(N1)/2  L attempts have not resulted in a link, which is (1p)N(N1)/2L. 3) A combinational factor, N(N1) 2 , L
(3.0)
counting the number of different ways we can place L links among N(N1)/2 node pairs. We can therefore write the probability that a particular realization of a random network has exactly L links as N(N1) N ( N −1) pL (1 − p ) 2 − L . pL = 2 L
(3.1)
As (3.1) is a binomial distribution (BOX 3.3), the expected number of links in a random graph is
〈L 〉 =
N ( N −1) 2
∑ L=0
RANDOM NETWORKS
LpL = p
N(N − 1) . 2
(3.2)
6
Hence is the product of the probability p that two nodes are connected and the number of pairs we attempt to connect, which is Lmax = N(N
 1)/2 (CHAPTER 2).
Using (3.2) we obtain the average degree of a random network 〈k 〉 =
2〈L 〉 = p(N − 1). N
(3.3)
Hence is the product of the probability p that two nodes are connected and (N1), which is the maximum number of links a node can have in a network of size N. In summary the number of links in a random network varies between realizations. Its expected value is determined by N and p. If we increase p a random network becomes denser: The average number of links increase linearly from = 0 to Lmax and the average degree of a node increases
from = 0 to = N1.
Figure 3.3 Random Networks are Truly Random
Top Row Three realizations of a random network generated with the same parameters p=1/6 and N=12. Despite the identical parameters, the networks not only look different, but they have a different number of links as well (L=10, 10, 8). Bottom Row Three realizations of a random network with p=0.03 and N=100. Several nodes have degree k=0, shown as isolated nodes at the bottom. RANDOM NETWORKS
7
NUMBER OF LINKS
BOX 3.3 BINOMIAL DISTRIBUTION: MEAN AND VARIANCE
If we toss a fair coin N times, tails and heads occur with the same probability p = 1/2. The binomial distribution provides the probability px that we obtain exactly x heads in a sequence of N throws. In general, the binomial distribution describes the number of successes in N independent experiments with two possible outcomes, in which the probability of one outcome is p, and of the other is 1p. The binomial distribution has the form
N px = p x (1 − p )N − x . x ￼ The mean of the distribution (first moment) is N
￼ (3.4) 〈 x 〉 = xpx = Np. x =0
∑
Its second moment is N
〈 x 2 〉 = ∑ x 2 px = p(1 − p )N + p 2N 2 ,
(3.5)
x =0
providing its standard deviation as
(
σ x = 〈 x 2 〉 − 〈 x 〉2
)
1 2
1
= [ p(1 − p )N ] 2 .
(3.6)
Equations (3.4)  (3.6) are used repeatedly as we characterize random networks.
RANDOM NETWORKS
8
NUMBER OF LINKS
SECTION 3.4
DEGREE DISTRIBUTION
In a given realization of a random network some nodes gain numerous
0.14
links, while others acquire only a few or no links (Figure 3.3). These differ
0.12
ences are captured by the degree distribution, pk, which is the probability
pk 0.1 0.08
that a randomly chosen node has degree k. In this section we derive pk for a random network and discuss its properties.
0.06 0.04
BINOMIAL DISTRIBUTION the product of three terms [15]:
0
• The number of ways we can select k links from N 1 potential links a ￼ N − 1 k .
Consequently the degree distribution of a random network follows the binomial distribution
k
N − 1 k
(3.7)
The shape of this distribution depends on the system size N and the probability p (Figure 3.4). The binomial distribution (BOX 3.3) allows us to calculate the network’s average degree , recovering (3.3), as well as its second moment and variance σk (Figure 3.4).
RANDOM NETWORKS
Peak at
Width
Width
5
10
15
20
k
25
30
35
40
The exact form of the degree distribution of a random network is the binomial distribution (left half). For N ≫ the binomial is well approximated by a Poisson distribution (right half). As both formulas describe the same distribution, they have the identical properties, but they are expressed in terms of different parameters: The binomial distribution depends on p and N, while the Poisson distribution has only one parameter, . It is this simplicity that makes the Poisson form preferred in calculations.
(1p)N1k.
p = p k (1 − p )N −1− k .
Peak at
Figure 3.4 Binomial vs. Poisson Degree Distribution
• The probability that the remaining (N1k) links are missing, or
node can have, or
POISSON
0.02
In a random network the probability that node i has exactly k links is
• The probability that k of its links are present, or pk.
BINOMIAL
9
POISSON DISTRIBUTION
0.1
Most real networks are sparse, meaning that for them ≪ N (Table
0.075
2.1). In this limit the degree distribution (3.7) is well approximated by the
〈k 〉 k , k!
BINOMINAL N=102
pk
Poisson distribution (ADVANCED TOPICS 3.A)
pk = e −〈 k 〉
POISSON
N=103 N=104
0.05
0.025
(3.8)
0 20
30
40
which is often called, together with (3.7), the degree distribution of a random network.
50
k
60
70
80
Figure 3.5
Degree Distribution is Independent of the
The binomial and the Poisson distribution describe the same quantity,
Network Size
hence they have similar properties (Figure 3.4):
The degree distribution of a random network with = 50 and N = 102, 103, 104.
• Both distributions have a peak around . If we increase p the net
Small Networks: Binomial For a small network (N = 102) the degree distribution deviates significantly from the Poisson form (3.8), as the condition for the Poisson approximation, N», is not satisfied. Hence for small networks one needs to use the exact binomial form (3.7) (green line).
work becomes denser, increasing and moving the peak to the right. • The width of the distribution (dispersion) is also controlled by p or . The denser the network, the wider is the distribution, hence the larger are the differences in the degrees.
Large Networks: Poisson For larger networks (N = 103, 104) the degree distribution becomes indistinguishable from the Poisson prediction (3.8), shown as a continuous grey line. Therefore for large N the degree distribution is independent of the network size. In the figure we averaged over 1,000 independently generated random networks to decrease the noise.
When we use the Poisson form (3.8), we need to keep in mind that: • The exact result for the degree distribution is the binomial form (3.7), thus (3.8) represents only an approximation to (3.7) valid in the ≪ N limit. As most networks of practical importance are sparse, this condition is typically satisfied. • The advantage of the Poisson form is that key network characteristics, like , and σk , have a much simpler form (Figure 3.4), depending on a single parameter, . • The Poisson distribution in (3.8) does not explicitly depend on the number of nodes N. Therefore, (3.8) predicts that the degree distribution of networks of different sizes but the same average degree are indistinguishable from each other (Figure 3.5). In summary, while the Poisson distribution is only an approximation to the degree distribution of a random network, thanks to its analytical simplicity, it is the preferred form for pk. Hence throughout this book, un
less noted otherwise, we will refer to the Poisson form (3.8) as the degree distribution of a random network. Its key feature is that its properties are independent of the network size and depend on a single parameter, the average degree .
RANDOM NETWORKS
10
DEGREE DISTRIBUTION
SECTION 3.5
REAL NETWORKS ARE NOT POISSON
As the degree of a node in a random network can vary between 0 and N1, we must ask, how big are the differences between the node degrees in a particular realization of a random network? That is, can high degree nodes coexist with small degree nodes? We address these questions by estimating the size of the largest and the smallest node in a random network. Let us assume that the world’s social network is described by the random network model. This random society may not be as far fetched as it first sounds: There is significant randomness in whom we meet and whom we choose to become acquainted with. Sociologists estimate that a typical person knows about 1,000 individuals on a first name basis, prompting us to assume that ≈ 1,000. Using the results obtained so far about random networks, we arrive to a number of intriguing conclusions about a random society of N ≃ 7 x 109 of individuals (ADVANCED TOPICS 3.B): • The most connected individual (the largest degree node) in a random society is expected to have kmax = 1,185 acquaintances. • The degree of the least connected individual is kmin = 816, not that different from kmax or .
• The dispersion of a random network is σk = 1/2 , which for = 1,000 is σk = 31.62. This means that the number of friends a typical individual has is in the ± σk range, or between 968 and 1,032, a rather narrow window. Taken together, in a random society all individuals are expected to have a comparable number of friends. Hence if people are randomly connected to each other, we lack outliers: There are no highly popular individuals, and no one is left behind, having only a few friends. This suprising conclusion is a consequence of an important property of random networks: in a large random network the degree of most nodes is in the narrow vicinity of RANDOM NETWORKS
11
(BOX 3.4).
BOX 3.4
This prediction blatantly conflicts with reality. Indeed, there is extensive evidence of individuals who have considerably more than 1,185 ac
WHY ARE HUBS MISSING?
quaintances. For example, US president Franklin Delano Roosevelt’s appointment book has about 22,000 names, individuals he met personally
To understand why hubs, nodes
[16, 17]. Similarly, a study of the social network behind Facebook has docu
with a very large degree, are ab
mented numerous individuals with 5,000 Facebook friends, the maximum
sent in random networks, we
allowed by the social networking platform [18]. To understand the origin of
turn to the degree distribution
these discrepancies we must compare the degree distribution of real and
(3.8).
random networks.
We first note that the 1/k! term
In Figure 3.6 we show the degree distribution of three real networks, to
in (3.8) significantly decreases
gether with the corresponding Poisson fit. The figure documents systemat
the chances of observing large
ic differences between the random network predictions and the real data:
degree nodes. Indeed, the Stirling approximation
• The Poisson form significantly underestimates the number of high degree nodes. For example, according to the random network model
k k ! 2π k e
the maximum degree of the Internet is expected to be around 20. In contrast the data indicates the existence of routers with degrees close to 103.
allows us rewrite (3.8) as
• The spread in the degrees of real networks is much wider than expect
−〈 k 〉
e
ed in a random network. This difference is captured by the dispersion
pk =
σk (Figure 3.4). If the Internet were to be random, we would expect σk =
2π k
( e 〈k 〉 ) . k
k
(3.9)
2.52. The measurements indicate σinternet = 14.14, significantly higher
For degrees k > e the term in
than the random prediction. These differences are not limited to the
the parenthesis is smaller than
networks shown in Figure 3.6, but all networks listed in Table 2.1 share
one, hence for large k both kde
this property.
pendent terms in (3.9), i.e. 1/√k and (e/k)k decrease rapidly
In summary, the comparison with the real data indicates that the ran
with increasing k. Overall (3.9)
dom network model does not capture the degree distribution of real net
predicts that in a random net
works. In a random network most nodes have comparable degrees, forbid
work the chance of observing a
ding hubs. In contrast, in real networks we observe a significant number
hub decreases faster than expo
of highly connected nodes and there are large differences in node degrees.
nentially.
We will resolve these differences in CHAPTER 4.
RANDOM NETWORKS
k
12
REAL NETWORKS ARE NOT POISSON
(a)
(b)
100 10
(c)
100
INTERNET
SCIENCE COLLABORATION
1
101
pk
pk
102
102
103 10
PROTEIN INTERACTIONS
101
pk
102
100
103 4
103 104
105
⟨k⟩
⟨k⟩
⟨k⟩ 104
105
106 100
101
k
102
103
100
101
k
102
100
103
101
k
102
Figure 3.6
Degree Distribution of Real Networks The degree distribution of the (a) Internet, (b) science collaboration network, and (c) protein interaction network (Table 2.1). The green line corresponds to the Poisson prediction, obtained by measuring for the real network and then plotting (3.8). The significant deviation between the data and the Poisson fit indicates that the random network model underestimates the size and the frequency of the high degree nodes, as well as the number of low degree nodes. Instead the random network model predicts a larger number of nodes in the vicinity of than seen in real networks.
RANDOM NETWORKS
13
REAL NETWORKS ARE NOT POISSON
SECTION 3.6
THE EVOLUTION OF A RANDOM NETWORK
The cocktail party we encountered at the beginning of this chapter captures a dynamical process: Starting with N isolated nodes, the links are added gradually through random encounters between the guests. This corresponds to a gradual increase of p, with striking consequences on the network topology (Online Resource 3.2). To quantify this process, we first inspect
>
how the size of the largest connected cluster within the network, NG, varies
with . Two extreme cases are easy to understand:
• For p = 0 we have = 0, hence all nodes are isolated. Therefore the largest component has size NG = 1 and NG/N→0 for large N. • For p = 1 we have = N1, hence the network is a complete graph and all nodes belong to a single component. Therefore NG = N and NG/N = 1. One would expect that the largest component grows gradually from NG
= 1 to NG = N if increases from 0 to N1. Yet, as Figure 3.7a indicates, this
Online Resource 3.2
Evolution of a Random Network
is not the case: NG/N remains zero for small , indicating the lack of a
A video showing the change in the structure of a random network with increasing p. It vividly illustrates the absence of a giant component for small p and its sudden emergence once p reaches a critical value.
large cluster. Once exceeds a critical value, NG/N increases, signaling
the rapid emergence of a large cluster that we call the giant component.
>
Erdős and Rényi in their classical 1959 paper predicted that the condition for the emergence of the giant component is [2]
k = 1.
(3.10)
In other words, we have a giant component if and only if each node has on average more than one link (ADVANCED TOPICS 3.C). The fact that we need at least one link per node to observe a giant component is not unexpected. Indeed, for a giant component to exist, each of its nodes must be linked to at least one other node. It is somewhat counterintuitive, however, that one link is sufficient for its emergence. We can express (3.10) in terms of p using (3.3), obtaining
RANDOM NETWORKS
14
1 1 pc = ￼ ≈ , N −1 N
(3.11)
Therefore the larger a network, the smaller p is sufficient for the giant component. The emergence of the giant component is only one of the transitions characterizing a random network as we change . We can distinguish four topologically distinct regimes (Figure 3.7a), each with its unique characteristics:
1 N
Subcritical Regime: 0 < < 1 (p < ￼ , Figure 3.7b). For = 0 the network consists of N isolated nodes. Increasing means that we are adding N = pN(N1)/2 links to the network. Yet, given that < 1, we have only a small number of links in this regime, hence we mainly observe tiny clusters (Figure 3.7b). We can designate at any moment the largest cluster to be the giant component. Yet in this regime the relative size of the largest cluster, NG/N, remains zero. The reason is that for < 1 the largest cluster is a tree with size NG ~ lnN, hence its size increases much slower than the size of the net
work. Therefore NG/N ≃ lnN/N→0 in the N→∞ limit.
In summary, in the subcritical regime the network consists of numerous tiny components, whose size follows the exponential distribution (3.35). Hence these components have comparable sizes, lacking a clear winner that we could designate as a giant component. Critical Point: = 1 (p = ￼ 1 , Figure 3.7c).
N
The critical point separates the regime where there is not yet a giant component ( < 1) from the regime where there is one ( > 1). At this point the relative size of the largest component is still zero (Figure 3.7c). Indeed, the size of the largest component is NG ~ N2/3. Consequently NG grows much slower than the network’s size, so its relative size decreases as NG/N~
N 1/3 in the N→∞ limit.
Note, however, that in absolute terms there is a significant jump in the size of the largest component at = 1. For example, for a random network with N = 7 ×109 nodes, comparable to the globe’s social network, for < 1 the largest cluster is of the order of NG ≃ lnN = ln (7 ×109) ≃ 22.7. In contrast at = 1 we expect NG ~ N2/3 = (7 ×109)2/3 ≃ 3 ×106, a jump of about five orders of magnitude. Yet, both in the subcritical regime and at the critical point the largest component contains only a vanishing fraction of the total number of nodes in the network. In summary, at the critical point most nodes are located in numerous small components, whose size distribution follows (3.36). The power law form indicates that components of rather different sizes coexist. These numerous small components are mainly trees, while the giant component RANDOM NETWORKS
15
THE EVOLUTION OF A RANDOM NETWORK
RANDOM NETWORKS
16
THE EVOLUTION OF A RANDOM NETWORK
(a)
NG /N
0
0.2
0.4
0.6
0.8
1
〈k〉 < 1
1
(b) Subcritical Regime • No giant component • Cluster size distribution: ps ~ s3/2 eαs • Size of the largest cluster: NG ~ lnN • The clusters are trees
0
3
(c) Critical Point • No giant component • Cluster size distribution: ps ~ s 3/2 • Size of the largest cluster: NG ~ N 3/2 • The clusters may contain loops
〈k〉 = 1
2
〈k〉 > 1
〈k〉 » lnN
6
(e) Connected Regime • Single giant component • No isolated nodes or clusters • Size of the giant component: NG = N • Giant component has loops
5
(be) A sample network and its properties in the four regimes that characterize a random network.
(a) The relative size of the giant component in function of the average degree in the ErdősRényi model. The figure illustrates the phase tranisition at = 1, responsible for the emergence of a giant component with nonzero NG.
Evolution of a Random Network
Figure 3.7
(d) Supercritical Regime • Single giant component • Cluster size distribution: ps ~ s3/2 eαs • Size of the giant component: NG ~ (p  pc )N • The small clusters are trees • Giant component has loops
k
4
may contain loops. Note that many properties of the network at the critical point resemble the properties of a physical system undergoing a phase transition (ADVANCED TOPICS 3.F). Supercritical Regime: > 1 (p > ￼1 , Figure 3.7d).
N
This regime has the most relevance to real systems, as for the first time we have a giant component that looks like a network. In the vicinity of the critical point the size of the giant component varies as
NG / N ~ 〈k 〉 − 1,
(3.12)
NG ~ ( p − pc )N ,
(3.13)
or
where pc is given by (3.11). In other words, the giant component contains a finite fraction of the nodes. The further we move from the critical point, a larger fraction of nodes will belong to it. Note that (3.12) is valid only in the vicinity of = 1. For large the dependence between NG and is
nonlinear (Figure 3.7a).
In summary in the supercritical regime numerous isolated components coexist with the giant component, their size distribution following (3.35). These small components are trees, while the giant component contains loops and cycles. The supercritical regime lasts until all nodes are absorbed by the giant component.
ln N , Figure 3.7e). N
Connected Regime: ‹k› > lnN (p > ￼
For sufficiently large p the giant component absorbs all nodes and components, hence NG≃ N. In the absence of isolated nodes the network becomes connected. The average degree at which this happens depends on N as (AD
VANCED TOPIC 3.E)
〈k 〉 = ln N .
(3.14)
Note that when we enter the connected regime the network is still relatively sparse, as lnN / N → 0 for large N. The network turns into a complete graph only at = N  1. In summary, the random network model predicts that the emergence of a network is not a smooth, gradual process: The isolated nodes and tiny components observed for small collapse into a giant component through a phase transition (ADVANCED TOPICS 3.F). As we vary we encounter four topologically distinct regimes (Figure 3.7). The discussion offered above follows an empirical perspective, fruitful if we wish to compare a random network to real systems. A different perspective, with its own rich behavior, is offered by the mathematical literature (BOX 3.5).
RANDOM NETWORKS
17
THE EVOLUTION OF A RANDOM NETWORK
BOX 3.5 NETWORK EVOLUTION IN GRAPH THEORY.
In the random graph literature it is often assumed that the connection probability p(N) scales as Nz, where z is a tunable parameter between ∞ and 0 [15]. In this language Erdős and Rényi discovered that as we vary z, key properties of random graphs appear quite suddenly. A graph has a given property Q if the probability of having Q approaches 1 as N → ∞. That is, for a given z either almost every graph has the property Q or almost no graph has it. For example, for z less than 3/2 almost all graphs contain only isolated nodes and pairs of nodes connected by a link. Once z exceeds 3/2, most networks will contain paths connecting three or more nodes (Figure 3.8).
p~Nz z
�
2
3/2
4/3
5/4
1
2/3
1/2
Figure 3.8 Evolution of a Random Graph The threshold probabilities at which different subgraphs appear in a random graph, as defined by the exponent z in the p(N) ~ Nz relationship. For z < 3/2 the graph consists of isolated nodes and edges. When z passes 3/2 trees of order 3 appear, while at z = 4/3 trees of order 4 appear. At z = 1 trees of all orders are present, together with cycles of all orders. Complete subgraphs of order 4 appear at z =2/3, and as z increases further, complete subgraphs of larger and larger order emerge. After [19].
RANDOM NETWORKS
18
THE EVOLUTION OF A RANDOM NETWORK
SECTION 3.7
REAL NETWORKS ARE SUPERCRITICAL
Two predictions of random network theory are of direct importance for
N
L
k
InN
Internet
192,244
609,066
6.34
12.17
Power Grid
4,941
6,594
2.67
8.51
Science Collaboration
23,133
94,439
8.08
10.05
emerge that contains a finite fraction of all nodes. Hence only for
Actor Network
702,388
29,397,908 83.71
13.46
> 1 the nodes organize themselves into a recognizable network.
Protein Interactions
2,018
2,930
7.61
NETWORK
real networks: 1) Once the average degree exceeds = 1, a giant component should
2) For > lnN all components are absorbed by the giant component, resulting in a single connected network.
Table 3.1
Are Real Networks Connected? The number of nodes N and links L for the undirected networks of our reference network list of Table 3.1, shown together with and lnN. A giant component is expected for > 1 and all nodes should join the giant component for > lnN. While for all networks > 1, for most is under the lnN threshold (see also Figure 3.9).
Do real networks satisfy the criteria for the existence of a giant component, i.e. > 1? And will this giant component contain all nodes for > lnN, or will we continue to see some disconnected nodes and components? To answer these questions we compare the structure of a real network for a given with the theoretical predictions discussed above. The measurements indicate that real networks extravagantly exceed the = 1 threshold. Indeed, sociologists estimate that an average person has around 1,000 acquaintances; a typical neuron is the human brain has about 7,000 synapses; in our cells each molecule takes part in several chemical reactions. This conclusion is supported by Table 3.1, that lists the average degree of several undirected networks, in each case finding > 1. Hence the average degree of real networks is well beyond the = 1 threshold, implying that they all have a giant component. The same is true for the reference networks listed in Table 3.1. Let us now turn to the second prediction, inspecting if we have single component (i.e. if > lnN), or if the network is fragmented into multiple components (i.e. if < lnN). For social networks the transition between the supercritical and the fully connected regime should be at > ln(7 ×109) ≈ 22.7. That is, if the average individual has more than two dozens acquaintances, then a random society must have a single component, leavRANDOM NETWORKS
2.90
19
ing no individual disconnected. With ≈ 1,000 this condition is clearly satisfied. Yet, according to Table 3.1 many real networks do not obey the fully connected criteria. Consequently, according to random network theory these networks should be fragmented into several disconnected components. This is a disconcerting prediction for the Internet, indicating that some routers should be disconnected from the giant component, being unable to communicate with other routers. It is equally problematic for the power grid, indicating that some consumers should not get power. These predictions are clearly at odds with reality. In summary, we find that most real networks are in the supercritical regime (Figure 3.9). Therefore these networks are expected to have a giant component, which is in agreement with the observations. Yet, this giant component should coexist with many disconnected components, a prediction that fails for several real networks. Note that these predictions should be valid only if real networks are accurately described by the ErdősRényi model, i.e. if real networks are random. In the coming chapters, as we learn more about the structure of real networks, we will understand why real networks can stay connected despite failing the k > lnN criteria.
SUBCRITICAL
SUPERCRITICAL
Figure 3.9
FULLY CONNECTED
Most Real Networks are Supercritical The four regimes predicted by random network theory, marking with a cross the location () of the undirected networks listed in Table 3.1. The diagram indicates that most networks are in the supercritical regime, hence they are expected to be broken into numerous isolated components. Only the actor network is in the connected regime, meaning that all nodes are part of a single giant component. Note that while the boundary between the subcritical and the supercritical regime is always at = 1, the boundary between the supercritical and the connected regime is at lnN, which varies from system to system.
INTERNET
POWER GRID SCIENCE COLLABORATION
ACTOR NETWORK
YEAST PROTEIN INTERACTIONS
1
RANDOM NETWORKS
10
k
20
REAL NETWORKS ARE SUPERCRITICAL
SECTION 3.8
SMALL WORLDS
q
The small world phenomenon, also known as six degrees of separation, has long fascinated the general public. It states that if you choose any two individuals anywhere on Earth, you will find a path of at most six acquain
q
tances between them (Figure 3.10). The fact that individuals who live in the same city are only a few handshakes from each other is by no means surprising. The small world concept states, however, that even individuals
q
who are on the opposite side of the globe can be connected to us via a few acquaintances. In the language of network science the small world phenomenon im
w
plies that the distance between two randomly chosen nodes in a network is short. This statement raises two questions: What does short (or small)
Jane
q q
Ralph
w
w
Sarah
w
Peter
q
mean, i.e. short compared to what? How do we explain the existence of Figure 3.10
these short distances?
Six Deegree of Separation
Both questions are answered by a simple calculation. Consider a ran
According to six degrees of separation two individuals, anywhere in the world, can be connected through a chain of six or fewer acquaintances. This means that while Sarah does not know Peter, she knows Ralph, who knows Jane and who in turn knows Peter. Hence Sarah is three handshakes, or three degrees from Peter. In the language of network science six degrees, also called the small world property, means that the distance between any two nodes in a network is unexpectedly small.
dom network with average degree . A node in this network has on average: nodes at distance one (d=1). 2 nodes at distance two (d=2). 3 nodes at distance three (d =3). ...￼ d nodes at distance d. For example, if ≈ 1,000, which is the estimated number of acquaintences an individual has, we expect 106 individuals at distance two and about a billion, i.e. almost the whole earth’s population, at distance three from us. To be precise, the expected number of nodes up to distance d from our starting node is
N(d ) ≈ 1 + 〈k 〉 + 〈k 〉2 + ... + 〈k 〉d =
〈k 〉d +1 − 1 . 〈k 〉 − 1
RANDOM NETWORKS
(3.15)
21
(a)
N(d) must not exceed the total number of nodes, N, in the network.
1D LATTICE ⟨d⟩~N
2D LATTICE ⟨d⟩~N1/2
Therefore the distances cannot take up arbitrary values. We can identify the maximum distance, dmax, or the network’s diameter by setting ￼
N(dmax ) ≈ N,.
3D LATTICE ⟨d⟩~N1/3
⟨d⟩
(3.16)
RANDOM NETWORK
⟨d⟩~lnN
Assuming that » 1, we can neglect the (1) term in the nominator and the denominator of (3.15), obtaining
N
￼ 〈k 〉
dmax
≈
N.
(b)
(3.17)
1D
2D
3D
Therefore the diameter of a random network follows ￼
ln N , dmax ≈ ln 〈k 〉
ln⟨d⟩
(3.18)
RANDOM NETWORK
which represents the mathematical formulation of the small world phelnN
nomenon. The key, however is its interpretation: Figure 3.11
• As derived, (3.18) predicts the scaling of the network diameter, dmax, with
Why are Small Worlds Surprising?
the size of the system, N. Yet, for most networks (3.18) offers a better
Much of our intuition about distance is based on our experience with regular lattices, which do not display the small world property:
approximation to the average distance between two randomly chosen nodes, , than to dmax (Table 3.2). This is because dmax is often dominat
ed by a few extreme paths, while is averaged over all node pairs, a
1D: For a onedimensional lattice (a line of length N) the diameter and the average path length scale linearly with N: dmax~ ~N.
process that supresses the fluctuations. Hence typically the small world property is defined by
〈d 〉 ≈
ln N , ln 〈k 〉
2D: For a square lattice dmax~ ~ N1/2.
(3.19)
3D: For a cubic lattice dmax~ ~ N1/3. 4D: In general, for a ddimensional lattice dmax ~ ~ N1/d.
describing the dependence of the average distance in a network on N and .
These polynomial dependences predict a much faster increase with N than (3.19), indicating that in lattices the path lengths are significantly longer than in a random network. For example, if the social network would form a square lattice (2D), where each individual knows only its neighbors, the average distance between two individuals would be roughly (7 ×109)1/2 = 83,666. Even if we correct for the fact that a person has about 1,000 acquaintances, not four, the average separation will be orders of magnitude larger than predicted by (3.19).
• In general lnN « N, hence the dependence of on lnN implies that the distances in a random network are orders of magnitude smaller than the size of the network. Consequently by small in the "small world phenomenon" we mean that the average path length or the diameter depends logarithmically on the system size. Hence, “small” means that is proportional to lnN, rather than N or some power of N (Figure 3.11). • The 1/ln term implies that the denser the network, the smaller is
(a) The figure shows the predicted Ndependence of for regular and random networks on a linear scale. (b) The same as in (a), but shown on a loglog scale.
the distance between the nodes. • In real networks there are systematic corrections to (3.19), rooted in the fact that the number of nodes at distance d > drops rapidly (ADVANCED TOPICS 3.F). Let us illustrate the implications of (3.19) for social networks. Using N ≈ 7 ×109 and ≈ 103, we obtain ￼ RANDOM NETWORKS
22
SMALL WORLD PROPERTY
〈d 〉 ≈
ln7 × 109 = 3.28. ln(103 )
BOX 3.6
(3.20)
19 DEGREES OF SEPARATION
Therefore, all individuals on Earth should be within three to four handshakes of each other [20]. The estimate (3.20) is probably closer to the real
How many clicks do we need to reach
value than the frequently quoted six degrees (BOX 3.7).
a randomly chosen document on the Web? The difficulty in addressing
Much of what we know about the small world property in random net
this question is rooted in the fact
works, including the result (3.19), is in a little known paper by Manfred Ko
that we lack a complete map of the
chen and Ithiel de Sola Pool [20], in which they mathematically formulated
WWW—we only have access to small
the problem and discussed in depth its sociological implications. This pa
samples of the full map. We can
per inspired the well known Milgram experiment (BOX 3.6), which in turn
start, however, by measuring the
inspired the sixdegrees of separation phrase.
WWW’s average path length in samples of increasing sizes, a procedure
While discovered in the context of social systems, the small world prop
called finite size scaling. The mea
erty applies beyond social networks (BOX 3.6). To demonstrate this in Table
surements indicate that the average
3.2 we compare the prediction of (3.19) with the average path length for
path length of the WWW increases
several real networks, finding that despite the diversity of these systems
with the size of the network as [21]
and the significant differences between them in terms of N and , (3.19)
⟨d⟩ � 0.35 + 0.89 lnN.
offers a good approximation to the empirically observed .
In 1999 the WWW was estimated to
In summary the small world property has not only ignited the public’s
have about 800 million documents
imagination (BOX 3.8), but plays an important role in network science as
[22], in which case the above equa
well. The small world phenomena can be reasonably well understood in
tion predicts ≈18.69. In other
the context of the random network model: It is rooted in the fact that the
words in 1999 two randomly chosen
number of nodes at distance d from a node increases exponentially with d.
documents were on average 19 clicks
In the coming chapters we will see that in real networks we encounter sys
from each other, a result that be
tematic deviations from (3.19), forcing us to replace it with more accurate
came known as 19 degrees of separa
predictions. Yet the intuition offered by the random network model on the
tion. Subsequent measurements on
origin of the small world phenomenon remains valid.
NETWORK
N
L
k
a sample of 200 million documents d
dmax
found ≈16 [23], in good agree
lnN
ment with the ≈17 prediction.
ln k
Currently the WWW is estimated to
Internet
192,244
609,066
6.34
6.98
26
6.58
WWW
325,729
1,497,134
4.60
11.27
93
8.31
Power Grid
4,941
6,594
2.67
18.99
46
8.66
in which case the formula predicts
Mobile Phone Calls
36,595
91,826
2.51
11.72
39
11.42
≈25. Hence is not fixed but as
Email
57,194
103,731
1.81
5.88
18
18.4
the network grows, so does the dis
Science Collaboration
23,133
93,439
8.08
5.35
15
4.81
tance between two documents.
Actor Network
702,388
29,397,908
83,71
3,91
14
3,04
Citation Network
449,673
4,707,958
10.43
11,21
42
5.55
E. Coli Metabolism
1,039
5,802
5.58
2.98
8
4.04
Protein Interactions
2,018
2,930
2.9 0
5.61
14
7.14
have about trillion nodes (N~1012),
The average path length of 25 is much larger than the proverbial six degrees (BOX 3.7). The difference is easy to understand: The WWW has smaller
Table 3.2 Six Degrees of Separation
average degree and larger size than the social network. According to (3.19)
The average distance and the maximum distance dmax for the ten reference networks. The last column provides predicted by (3.19), indicating that it offers a reasonable approximation to the measured . Yet, the agreement is not perfect  we will see in the next chapter that for many real networks (3.19) needs to be adjusted. For directed networks the average degree and the path lengths are measured along the direction of the links.
RANDOM NETWORKS
both of these differences increase the Web’s diameter.
23
SMALL WORLD PROPERTY
BOX 3.7 SIX DEGREES: EXPERIMENTAL CONFIRMATION
The first empirical study of the small world phenomena took place in 1967, when Stanley Milgram, building on the work of distances in social networks [24, 25]. Milgram chose a stock broker in Boston and a divinity student in Sharon, Massachusetts as targets. He then randomly selected residents of Wichita and Omaha, sending them a letter containing a short summary of the study’s purpose, a photograph, the name, address and infor
15 NUMBER OF CHAINS
Pool and Kochen [20], designed an experiment to measure the
mation about the target person. They were asked to forward the
N=64
10
5
0
letter to a friend, relative or acquantance who is most likely to
0
1
know the target person.
2 3 4 5 6 7 8 9 10 11 12 NUMBER OF INTERMEDIARIES
0.7
Within a few days the first letter arrived, passing through only
0.6
two links. Eventually 64 of the 296 letters made it back, some,
0.5
however, requiring close to a dozen intermediates [25]. These
pd
completed chains allowed Milgram to determine the number of
0.4
He found that the median number of intermediates was 5.2, a
0.2
relatively small number that was remarkably close to Frigyes
0.1
Karinthy’s 1929 insight (BOX 3.8).
0
work, hence his experiment could not detect the true distance between his study’s participants. Today Facebook has the most extensive social network map ever assembled. Using Facebook’s social graph of May 2011, consisting of 721 million active users and 68 billion symmetric friendship links, researchers found an average distance 4.74 between the users (Figure 3.12b). Therefore, the study detected only ‘four degrees of separation’ [18], closer to the prediction of (3.20) than to Milgram’s six degrees [24, 25].
“I asked a person of intelligence how many steps he thought it would take, and he said that it would require 100 intermediate persons, or more, to move from Nebraska to Sharon.” Stanley Milgram, 1969
RANDOM NETWORKS
USA
0.1
individuals required to get the letter to the target (Figure 3.12a).
Milgram lacked an accurate map of the full acquaintance net
Worldwide
0
2
4
d
6
8
10
Figure 3.12
Six Degrees? From Milgram to Facebook (a) In Milgram's experiment 64 of the 296 letters made it to the recipient. The figure shows the length distribution of the completed chains, indicating that some letters required only one intermediary, while others required as many as ten. The mean of the distribution was 5.2, indicating that on average six ‘handshakes’ were required to get a letter to its recipient. The playwright John Guare renamed this ‘six degrees of separation’ two decades later. After [25]. (b) The distance distribution, pd , for all pairs of Facebook users worldwide and within the US only.Using Facebook’s N and L (3.19) predicts the average degree to be approximately 3.90, not far from the reported four degrees. After [18].
24
THE EVOLUTION OF A RANDOM NETWORK
RANDOM NETWORKS
25
1929
1935
Frigyes Karinthy (18871938) Hungarian writer, journalist and playwright, the first to describe the small world property. In his short story entitled ‘Láncszemek’ (Chains) he links a worker in Ford’s factory to himself [26, 27].
PUBLICATION DATE
MILESTONES
1940
1945
WWII
Karinthy, 1929
“The worker knows the manager in the shop, who knows Ford; Ford is on friendly terms with the general director of Hearst Publications, who last year became good friends with Árpád Pásztor, someone I not only know, but to the best of my knowledge a good friend of mine.”
19 DEGREES OF THE WWW
BOX 3.8
1958 1960
Manfred Kochen (19281989), Ithiel de Sola Pool (19171984) Scientific interest in small worlds started with a paper by political scientist Ithiel de Sola Pool and mathematician Manfred Kochen. Written in 1958 and published in 1978, their work addressed in mathematical detail the small world effect, predicting that most individuals can be connected via two to three acquaintances. Their paper inspired the experiments of Stanley Milgram.
1950
1970
1978
1980
19 Degrees of the WWW Measurements on the WWW indicate that the separation between two randomly chosen documents is 19 [21] (Box 3.6).
1985
6DEGREE OF SEPARATION
John Guare
1991
2011
The Facebook Data Team measures the average distance between its users, finding “4 degrees” (BOX 3.7).
2005
4DEGREE OF SEPARATION
Duncan J. Watts (1971), Steven Strogatz (1959) A new wave of interest in small worlds followed the study of Watts and Strogatz, finding that the small world property applies to natural and technological networks as well [29].
1998 1999 2000
XXI
Duncan J. Watts Steven Strogatz
John Guare (1938) The phrase ‘six degrees of separation’ was introduced by the playwright John Guare, who used it as the title of his Broadway play [28].
Stanley Milgram (19331984) American social psychologist who carried out the first experiment testing the smallworld phenomena. (BOX 3.7).
1967
PUBLISHED 20 YEARS LATER
Ithiel de Sola Pool
DISCOVERY
Manfred Kochen
Stanley Milgram
Guare, 1991
“Everybody on this planet is separated by only six other people. Six degrees of separation. Between us and everybody else on this planet. The president of the United States. A gondolier in Venice. It’s not just the big names. It’s anyone. A native in a rain forest. A Tierra del Fuegan. An Eskimo. I am bound to everyone on this planet by a trail of six people. It’s a profound thought. How every person is a new door, opening up into other worlds.”
SECTION 3.9
CLUSTERING COEFFICIENT
The degree of a node contains no information about the relationship between a node's neighbors. Do they all know each other, or are they perhaps isolated from each other? The answer is provided by the local clustering coefficient Ci, that measures the density of links in node i’s immediate
neighborhood: Ci = 0 means that there are no links between i’s neighbors; Ci = 1 implies that each of the i’s neighbors link to each other (SECTION 2.10).
To calculate Ci for a node in a random network we need to estimate the
expected number of links Li between the node’s ki neighbors. In a random
network the probability that two of i’s neighbors link to each other is p. As there are ki(ki  1)/2 possible links between the ki neighbors of node i, the
expected value of Li is
〈Li 〉 = p
￼
ki ( ki − 1) . 2
(3.20)
Thus the local clustering coefficient of a random network is ￼ 2〈Li 〉 〈k 〉 =p= . Ci = ki ( ki − 1) N
(3.21)
Equation (3.21) makes two predictions: (1) For fixed , the larger the network, the smaller is a node’s cluster
ing coefficient. Consequently a node's local clustering coefficient Ci is expected to decrease as 1/N. Note that the network's average clustering coefficient, also follows (3.21). (2) The local clustering coefficient of a node is independent of the node’s
degree. To test the validity of (3.21) we plot / in function of N for several undirected networks (Figure 3.13a). We find that / does not decrease as N1, but it is largely independent of N, in violation of the prediction (3.21) RANDOM NETWORKS
26
and point (1) above. In Figure 3.13bd we also show the dependency of C on the node’s degree ki for three real networks, finding that C(k) systematically decreases with the degree, again in violation of (3.21) and point (2).
In summary, we find that the random network model does not capture the clustering of real networks. Instead real networks have a much higher clustering coefficient than expected for a random network of similar N and L. An extension of the random network model proposed by Watts and Strogatz [29] addresses the coexistence of high and the small world property (BOX 3.9). It fails to explain, however, why highdegree nodes have a smaller clustering coefficient than lowdegree nodes. Models explaining the shape of C(k) are discussed in Chapter 9.
(a)
(b)
All Networks
10
Internet
10
0
Clustering in Real Networks (a) Comparing the average clustering coefficient of real networks with the prediction (3.21) for random networks. The circles and their colors correspond to the networks of Table 3.2. Directed networks were made undirected to calculate and . The green line corresponds to (3.21), predicting that for random networks the average clustering coefficient decreases as N1. In contrast, for real networks appears to be independent of N.
101
102
C(k)
C / k 104
102
106
103 103
101
N
100
105
Science Collaboration
(c)
10
Figure 3.13
0
10
102
k
104
(b)(d) The dependence of the local clustering coefficient, C(k), on the node’s degree for (b) the Internet, (c) science collaboration network and (d) protein interaction network. C(k) is measured by averaging the local clustering coefficient of all nodes with the same degree k. The green horizontal line corresponds to .
0
k
C(k)
C(k)
103
Protein Interactions
(d)
0
101
101 101
102 100
101
102
k
RANDOM NETWORKS
103
104
100
101
102
103
k
27
CLUSTERING COEFFICIENT
BOX 3.9 WATTSSTROGATZ MODEL
Duncan Watts and Steven Strogatz proposed an extension of the
(a)
random network model (Figure 3.14) motivated by two observa
(b)
REGULAR
tions [29]:
(c)
SMALLWORLD
RANDOM
(a) Small World Property In real networks the average distance between two nodes depends logarithmically on N (3.18), rather than following a polynomial ex
p= 0
p= 1
Increasing randomness
pected for regular lattices (Figure 3.11).
1
(d)
0.8
(b) High Clustering The average clustering coefficient of real networks is much high
0.6
er than expected for a random network of similar N and L (Figure
0.4
3.13a).
0.2
⟨C (p)⟩ ⟨ C (0) ⟩
d (p) /d (0)
0 0.0001
The WattsStrogatz model (also called the smallworld model) interpolates between a regular lattice, which has high clustering but lacks the smallworld phenomenon, and a random network, which has low clustering, but displays the smallworld property (Figure 3.14ac). Numerical simulations indicate that for a range of rewiring parameters the model's average path length is low but the clustering coefficient is high, hence reproducing the coexistence of high clustering and smallworld phenomena (Figure 3.14d). Being an extension of the random network model, the WattsStrogatz model predicts a Poissonlike bounded degree distribution. Consequently high degree nodes, like those seen in Figure 3.6, are absent from it. Furthermore it predicts a kindependent C(k), being unable to recover the kdependence observed in Figures 3.13bd. As we show in the next chapters, understanding the coexistence of the small world property with high clustering must start from the network's correct degree distribution.
DEGREE CORRELATIONS
0.001
0.01 p
0.1
1
Figure 3.14 The WattsStrogatz Model
(a) We start from a ring of nodes, each node being connected to their immediate and next neighbors. Hence initially each node has = 3/4 (p = 0). (b) With probability p each link is rewired to a randomly chosen node. For small p the network maintains high clustering but the random longrange links can drastically decrease the distances bea tween the nodes. b (c) For p = 1 all links have been rewired, so the network turns into a random network. (d) The dependence of the average path length d(p) and clustering coefficient on the rewiring parameter p. Note that d(p) and have been normalized by d(0) and obtained for a regular lattice (i.e. for p=0 in (a)). The rapid drop in d(p) signals the onset of the smallworld phenomenon. During this drop, remains high. Hence in the range 0.001 ln N condition, implying that they should be broken into isolated clusters (Table 3.1). Some networks are indeed fragmented, most are not.
AT A GLANCE: RANDOM NETWORKS
Average Path Length Random network theory predicts that the average path length follows
Definition: N nodes, where each
(3.19), a prediction that offers a reasonable approximation for the ob
node pair is connected with probability p.
served path lengths. Hence the random network model can account for the emergence of small world phenomena.
Average Degree:
k = p ( N − 1) .
Clustering Coefficient In a random network the local clustering coefficient is independent of the node’s degree and depends on the system size as 1/N. In con
Average Number of Links: ￼
trast, measurements indicate that for real networks C(k) decreases with
L =
the node degrees and is largely independent of the system size (Figure 3.13).
p N ( N − 1) . 2
Degree Distribution:
Taken together, it appears that the small world phenomena is the only
Binomial Form: ￼
property reasonably explained by the random network model. All other network characteristics, from the degree distribution to the clustering co
pk =
efficient, are significantly different in real networks. The extension of the ErdősRényi model proposed by Watts and Strogatz successfully predicts the coexistence of high C and low , but fails to explain the degree distri
N 1 k p (1 p)N k
1 k
.
Poisson Form:
bution and C(k). In fact, the more we learn about real networks, the more we will arrive at the startling conclusion that we do not know of any real
pk = e
network that is accurately described by the random network model.
− k
k
k . k!
Giant Component (GC) (NG):
This conclusion begs a legitimate question: If real networks are not random, why did we devote a full chapter to the random network model? The
NG~ lnN
〈k〉 < 1:
answer is simple: The model serves as an important reference as we proceed to explore the properties of real networks. Each time we observe some network property we will have to ask if it could have emerged by chance.
2 3 N G ~ N
1 < 〈k〉 < lnN:
For this we turn to the random network model as a guide: If the property is present in the model, it means that randomness can account for it. If the property is absent in random networks, it may represent some signature of
NG~(ppc )N
〈k〉 > lnN:
order, requiring a deeper explanation. So, the random network model may be the wrong model for most real systems, but it remains quite relevant for network science (BOX 3.10).
Average Distance: ￼
〈d 〉 ∝
ln N ,. ln 〈k 〉
Clustering Coefficient: ￼
C =
RANDOM NETWORKS
30
k . N
REAL NETWORKS ARE NOT RANDOM
BOX 3.10 RANDOM NETWORKS AND NETWORK SCIENCE
The lack of agreement between random and real networks raises an important question: How could a theory survive so long given its poor agreement with reality? The answer is simple: Random network theory was never meant to serve as a model of real systems. Erdős and Rényi write in their first paper [2] that random networks “may be interesting not only from a purely mathematical point of view. In fact, the evolution of graphs may be considered as a rather simplified model of the evolution of certain communication nets (railways, road or electric network systems, etc.) of a country or some unit.” Yet, in the string of eight papers authored by them on the subject [29], this is the only mention of the potential practical value of their approach. The subsequent development of random graphs was driven by the problem's inherent mathematical challenges, rather than its applications. It is tempting to follow Thomas Kuhn and view network science as a paradigm change from random graphs to a theory of real networks [30]. In reality, there was no network paradigm before the end of 1990s. This period is characterized by a lack of systematic attempts to compare the properties of real networks with graph theoretical models. The work of Erdős and Rényi has gained prominence outside mathematics only after the emergence of network science (Figure 3.15). Network theory does not lessen the contributions of Erdős and Rényi, but celebrates the unintended impact of their work. When we discuss the disrepacies between random and real networks, we do so mainly for pedagogical reasons: to offer a proper foundation on which we can understand the properties of real systems. Figure 3.15
200
Network Science and Random Networks
ErdősRényi 1960 ErdősRényi 1959
150
While today we perceive the ErdősRényi model as the cornerstone of network theory, the model was hardly known outside a small subfield of mathematics. This is illustrated by the yearly citations of the first two papers by Erdős and Rényi, published in 1959 and 1960 [2,3]. For four decades after their publication the papers gathered less than 10 citations each year. The number of citations exploded after the first papers on scalefree networks [21, 31, 32] have turned Erdős and Rényi’s work into the reference model of network theory.
100
50
0 1960
1965
RANDOM NETWORKS
1970
1975
1980
1985
1990
1995
2000
2005 2010
31
REAL NETWORKS ARE NOT RANDOM
SECTION 3.11
HOMEWORK
3.1. ErdősRényi Networks Consider an ErdősRényi network with N = 3,000 nodes, connected to each other with probability p = 10–3. (a) What is the expected number of links, 〈L〉? (b) In which regime is the network? (c) Calculate the probability pc so that the network is at the critical point. (d) Given the linking probability p = 10–3, calculate the number of nodes Ncr so that the network has only one component. (e) For the network in (d), calculate the average degree 〈kc r〉 and the average distance between two randomly chosen nodes 〈d〉. (f) Calculate the degree distribution pk of this network (approximate with a Poisson degree distribution). 3.2. Generating ErdősRényi Networks Relying on the G(N, p) model, generate with a computer three networks with N = 500 nodes and average degree (a) 〈k〉 = 0.8, (b) 〈k〉 = 1 and (c) 〈k〉 = 8. Visualize these networks. 3.3. Circle Network Consider a network with N nodes placed on a circle, so that each node connects to m neighbors on either side (consequently each node has degree 2m). Figure 3.14(a) shows an example of such a network with m = 2 and N = 20. Calculate the average clustering coefficient 〈C〉 of this network and the average shortest path 〈d〉. For simplicity assume that N and m are chosen such that (n1)/2m is an integer. What happens to 〈C〉 if N≫1? And what happens to 〈d〉? 3.4. Cayley Tree A Cayley tree is a symmetric tree, constructed starting from a central
RANDOM NETWORKS
32
node of degree k. Each node at distance d from the central node has degree k, until we reach the nodes at distance P that have degree one and are called leaves (see Figure 3.16 for a Cayley tree with k = 3 and P = 5.). (a) Calculate the number of nodes reachable in t steps from the central node. (b) Calculate the degree distribution of the network. (c) Calculate the diameter dmax. (d) Find an expression for the diameter dmax in terms of the total number of nodes N.
(e) Does the network display the smallworld property? 3.5. Snobbish Network Consider a network of N red and N blue nodes. The probability that
Figure 3.16 Cayley Tree
there is a link between nodes of identical color is p and the probability that
A Cayley Tree With k = 3 and P = 5.
there is a link between nodes of different color is q. A network is snobbish if p > q, capturing a tendency to connect to nodes of the same color. For q = 0 the network has at least two components, containing nodes with the same color. (a) Calculate the average degree of the "blue" subnetwork made of only blue nodes, and the average degree in the full network. (b) Determine the minimal p and q required to have, with high probability, just one component. (c) Show that for large N even very snobbish networks (p≫q) display the smallworld property. 3.6. Snobbish Social Networks Consider the following variant of the model discussed above: We have a network of 2N nodes, consisting of an equal number of red and blue nodes, while an f fraction of the 2N nodes are purple. Blue and red nodes do not connect to each other (q = 0), while they connect with probability p to nodes of the same color. Purple nodes connect with the same probability p to both red and blue nodes. (a) We call the red and blue communities interactive if a typical red node is just two steps away from a blue node and vice versa. Evaluate the fraction of purple nodes required for the communities to be interactive. (b) Comment on the size of the purple community if the average degree of the blue (or red) nodes is 〈k〉≫1. (c) What are the implications of this model for the structure of social (and other) networks?
RANDOM NETWORKS
33
HOMEWORK
SECTION 3.12
ADVANCED TOPICS 3.A DERIVING THE POISSON DISTRIBUTION
To derive the Poisson form of the degree distribution we start from the exact binomial distribution (3.7) ￼
N − 1 k pk = p (1 − p )N −1− k k
(3.22)
that characterizes a random graph. We rewrite the first term on the r.h.s. as
N − 1 (N − 1)(N − 1 − 1)(N − 1 − 2)...(N − 1 − k + 1) (N − 1)k (3.23) , ≈ k = k! k! where in the last term we used that k « N. The last term of (3.22) can be simplified as ￼
ln[(1 − p )( N −1)− k ] = (N − 1 − k )ln(1 −
〈k 〉 ) N −1
and using the series expansion ￼
( −1)n+1 n x 2 x3 ln(1 + x ) = ∑ x =x− + − ..., ∀  x ≤ 1 n 2 3 n =1 ∞
we obtain ￼
ln[(1 p )N −1− k ] ≈ (N − 1 − k )
〈k 〉 k = −〈k 〉(1 − ≈ 〈k 〉 N −1 N −1
which is valid if N » k. This represents the small degree approximation at the heart of this derivation. Therefore the last term of (3.22) becomes ￼
(1 − p ) N − 1− k = e −〈 k 〉 .
(3.24)
Combining (3.22), (3.23), and (3.24) we obtain the Poisson form of the deRANDOM NETWORKS
34
gree distribution ￼
￼
N − 1 k (N − 1)k k −〈 k 〉 ( N −1)− k pk = p (1 − p ) = pe k! k k
(N − 1)k 〈k 〉 −〈 k 〉 e , = k ! N − 1 or ￼
pk = e
RANDOM NETWORKS
−〈 k 〉
〈k 〉 k . k!
(3.25)
35
DERIVING THE POISSONDEGREE DISTRIBUTION
SECTION 3.13
ADVANCED TOPICS 3.B MAXIMUM AND MINIMUM DEGREES
To determine the expected degree of the largest node in a random network, called the network’s upper natural cutoff, we define the degree kmax
such that in a network of N nodes we have at most one node with degree higher than kmax . Mathematically this means that the area behind the Pois
son distribution pk for k ≥ kmax should be approximately one (Figure 3.17). Since the area is given by 1P(kmax), where P(k) is the cumulative degree dis
tribution of pk, the network’s largest node satisfies:
￼
N 1 − P ( kmax ) ≈ 1.
(3.26)
We write ≈ instead of =, because kmax is an integer, so in general the exact equation does not have a solution. For a Poisson distribution kmax k
k
+1
∞ 〈k 〉 〈k 〉 k 〈k 〉 max = e −〈 k 〉 ∑ ≈ e −〈 k 〉 , ( kmax + 1)! k = kmax +1 k ! k=0 k !
1 − P ( kmax ) = 1 − e −〈 k 〉 ∑
(3.27)
where in the last term we approximate the sum with its largest term. For N = 109 and = 1,000, roughly the size and the average degree of the globe’s social network, (3.26) and (3.27) predict kmax = 1,185, indicating that a random network lacks extremely popular individuals, or hubs. We can use a similar argument to calculate the expected degree of the smallest node, kmin. By requiring that there should be at most one node with degree smaller than kmin we can write
￼
NP ( kmin– 1) ≈ 1.
(3.28)
For the ErdősRényi network we have ￼ RANDOM NETWORKS
36
kmin–1
P( kmin– 1) = e −〈 k 〉 ∑ k=0
〈k 〉 k . k!
(3.29)
Solving (3.28) with N = 109 and = 1,000 we obtain kmin = 816.
pk
The area under the curve should be less than 1/N.
Figure 3.17
Minimum and Maximum Degree
kmin
RANDOM NETWORKS
k
kmax
The estimated maximum degree of a network, kmax, is chosen so that there is at most one node whose degree is higher than kmax. This is often called the natural upper cutoff of a degree distribution. To calculate it, we need to set kmax such that the area under the degree distribution pk for k > kmax equals 1/N, hence the total number of nodes expected in this region is exactly one. We follow a similar argument to determine the expected smallest degree, kmin.
37
MAXIMUM AND MINIMUM DEGREES
SECTION 3.14
ADVANCED TOPICS 3.C GIANT COMPONENT
In this section we introduce the argument, proposed independently by Solomonoff and Rapoport [11], and by Erdős and Rényi [2], for the emergence of giant component at = 1 [33]. Let us denote with u = 1  NG/N the fraction of nodes that are not in the
giant component (GC), whose size we take to be NG. If node i is part of the
GC, it must link to another node j, which must also be part of the GC. Hence if i is not part of the GC, that could happen for two reasons: • There is no link between i and j (probability for this is 1 p). • There is a link between i and j, but j is not part of the GC (probability for this is pu). Therefore the total probability that i is not part of the GC via node j is 1  p + pu. The probability that i is not linked to the GC via any other node is therefore (1  p + pu)N  1, as there are N  1 nodes that could serve as potential links to the GC for node i. As u is the fraction of nodes that do not belong to the GC, for any p and N the solution of the equation ￼
u = (1 − p + pu )N −1
(3.30)
provides the size of the giant component via NG = N(1  u). Using p = / (N  1) and taking the logarithm of both sides, for « N we obtain
〈k 〉
〈k 〉 ln u = (N − 1)ln 1 − (1 − u ) ≈ ( N −1) − (1 − u ) = − 〈k 〉 (1 − u ), (3.31) N −1 N −1
where we used the series expansion for ln(1+x). Taking an exponential of both sides leads to u = exp[ (1  u)]. If we denote with S the fraction of nodes in the giant component, S = NG / N, then
S = 1  u and (3.31) results in RANDOM NETWORKS
38
S = 1 − e −〈 k 〉 S .
(3.32)
1
(a)
0.8
This equation provides the size of the giant component S in function of (Figure 3.18). While (3.32) looks simple, it does not have a closed solu
0.6
tion. We can solve it graphically by plotting the right hand side of (3.32) as
y
a function of S for various values of . To have a nonzero solution, the
k = 1.5 k =1
0.4
obtained curve must intersect with the dotted diagonal, representing the left hand side of (3.32). For small the two curves intersect each other
0.2
only at S = 0, indicating that for small the size of the giant component
k = 0.5
is zero. Only when exceeds a threshold value, does a nonzero solution emerge.
0
0.2
0.4
0.6
1
0.8
S
To determine the value of at which we start having a nonzero solution we take a derivative of (3.32), as the phase transition point is when the
1
(b)
r.h.s. of (3.32) has the same derivative as the l.h.s. of (3.32), i.e. when
(
)
d 1 − e −〈 k 〉 S = 1, dS
〈k 〉e
−〈 k 〉 S
0.8 0.6
(3.33)
S 0.4
= 1.
0.2
Setting S = 0, we obtain that the phase transition point is at = 1 (see also ADVANCED TOPICS 3.F).
0
1
k
2
3
Figure 3.18
Graphical Solution
(a) The three purple curves correspond to y = 1exp[  S ] for =0.5, 1, 1.5. The green dashed diagonal corresponds y = S, and the intersection of the dashed and purple curves provides the solution to (3.32). For =0.5 there is only one intersection at S = 0, indicating the absence of a giant component. The =1.5 curve has a solution at S = 0.583 (green vertical line). The =1 curve is precisely at the critical point, representing the separation between the regime where a nonzero solution for S exists and the regime where there is only the solution at S = 0. (b) The size of the giant component in function of as predicted by (3.32). After [33].
RANDOM NETWORKS
39
GIANT COMPONENT
SECTION 3.15
ADVANCED TOPICS 3.D COMPONENT SIZES (a)
In Figure 3.7 we explored the size of the giant component, leaving an im
(b)
portant question open: How many components do we expect for a given ? What is their size distribution? The aim of this section is to discuss these topics.
N=102 N=103 N=104 N=�
k =1/2 100
105 (c)
longs to a component of size s (which is different from the giant component G) is [33]
(d)
s
s s ! ≈ 2π s for large s we obtain e
￼
ps ~ s
10
−3/2 − ( 〈 k 〉−1) s + ( s −1)ln 〈 k 〉
e
.
(3.35)
Therefore the component size distribution has two contributions: a slowly decreasing power law term s3/2 and a rapidly decreasing exponential term e(1)s+(s1)ln. Given that the exponential term dominates
102
103
101
102
103
k =3 100
100
101 102 ps 3 10 104 105 106
101
100 101 102
105
Replacing s1 with exp[(s1) ln] and using the Stirlingformula ￼
k =1
103 104
〈k 〉 )s −1 −〈 k 〉 s ( s (3.34) ps ~ e . s! ￼
102
N=102 N=103 N=104 N=�
0
For a random network the probability that a randomly chosen node be
s
101
100 101 102 ps 103 104
Component Size Distribution
k =3 k =1 k =1/2 N=� N=104 100
101
s
102
103
Figure 3.19 Component Size Distribution
for large s, (3.35) predicts that large components are prohibited. At the
Component size distribution ps in a random network, excluding the giant component.
critical point, = 1, all terms in the exponential cancel, hence ps follows the power law
(a)(c) ps for different values and N, indicating that ps converges for large N to the prediction (3.34).
￼ (3.36) ps ~ s −3/2 .
(d) ps for N = 104, shown for different . While for < 1 and > 1 the ps distribution has an exponential form, right at the critical point = 1 the distribution follows the power law (3.36). The continuous green lines correspond to (3.35). The first numerical study of the component size distribution in random networks was carried out in 1998 [34], preceding the exploding interest in complex networks.
As a power law decreases relatively slowly, at the critical point we expect to observe clusters of widely different sizes, a property consistent with the behavior of a system during a phase transition (ADVANCED TOPICS 3.F). These predictions are supported by the numerical simulations shown in Figure 3.19.
RANDOM NETWORKS
100 101 102 ps 3 10 104 105 106
40
(a) 5
Average Component Size
4
The calculations also indicate that the average component size (once again, excluding the giant component) follows [33]
s
2
1 1 − 〈k 〉 + 〈k 〉NG / N
〈s 〉 =
1
(3.37)
0
For < 1 we lack a giant component (NG = 0), hence (3.37) becomes
￼
〈s 〉 =
3
1 , 1 − 〈k 〉
0.5
1
k
1.5
(b) 30 25
s
(3.38)
2
2.5
N=102 N=103 N=104 Theory
20 15 10
which diverges when the average degree approaches the critical point
5
= 1. Therefore as we approach the critical point, the size of the clus
0.5
0
ters increases, signaling the emergence of the giant component at = 1. Numerical simulations support these predictions for large N (Figure
1
k
1.5
(c) 2.5
3.20).
2 1.5
need to first calculate the size of the giant component. This can be done in a selfconsistent manner, obtaining that the average cluster size de
1
creases for > 1, as most clusters are gradually absorbed by the giant
0.5
component.
0
0.5
Note that (3.37) predicts the size of the component to which a randomly
Figure 3.20
chosen node belongs. This is a biased measure, as the chance of belong
Average Component Size
ing to a larger cluster is higher than the chance of belonging to a smallwe obtain the average size of the small components that we would get if we were to inspect each cluster one by one and then measure their average size [33]
2 〈s . ′〉 = 2 − 〈k 〉 + 〈k 〉NG / N
1
k
1.5
2
2.5
(a) The average size of a component to which a randomly chosen node belongs to as predicted by (3.39) (purple). The green curve shows the overall average size of a component as predicted by (3.37). (After [33]).
er one. The bias is linear in the cluster size s. If we correct for this bias,
￼
2.5
N=102 N=103 N=104 Theory
s
To determine the average component size for > 1 using (3.37), we
2
(b) The average cluster size in a random network. We choose a node and determined the size of the cluster it belongs to. This measure is biased, as each component of size s will be counted s times. The larger N becomes, the more closely the numerical data follows the prediction (3.37). As predicted, diverges at the =1 critical point, supporting the existence of a phase transition (ADVANCED TOPICS 3.F).
(3.39)
Figure 3.20 offers numerical support for (3.39).
(c) The average cluster size in a random network, where we corrected for the bias in (b) by selecting each component only once.The larger N becomes, the more closely the numerical data follows the prediction (3.39).
RANDOM NETWORKS
41
COMPONENT SIZES
SECTION 3.16
ADVANCED TOPICS 3.E FULLY CONNECTED REGIME
To determine the value of at which most nodes became part of the giant component, we calculate the probability that a randomly selected NG
node does not have a link to the giant component, which is (1 ￼ − p )
≈ (1 − p )N,
as in this regime NG ≃ N. The expected number of such isolated nodes is N
N ⋅ p IN = N(1 − p ) = N 1 − ≈ Ne − Np , N N
where we used ￼(1 −
(3.40)
x n ) ≈ e − x, an approximation valid for large n. If we n
make p sufficiently large, we arrive to the point where only one node is disconnected from the giant component. At this point IN = 1, hence according
to (3.40) p needs to satisfy ￼Ne
− Np
= 1 . Consequently, the value of p at which
we are about to enter the fully connected regime is ￼
ln N , N
p=
(3.41)
which leads to (3.14) in terms of .
RANDOM NETWORKS
42
SECTION 3.17
ADVANCED TOPICS 3.F PHASE TRANSITIONS
The emergence of the giant component at =1 in the random network model is reminiscent of a phase transition, a much studied phenomenon in physics and chemistry [35]. Consider two examples: i. WaterIce Transition (Figure 3.21a): At high temperatures the H2O molecules engage in a diffusive motion, forming small groups and then breaking apart to group up with other water molecules. If cooled, at 0˚C the molecules suddenly stop this diffusive dance, forming an ordered rigid ice crystal. ii. Magnetism (Figure 3.21b): In ferromagnetic metals like iron at high temperatures the spins point in randomly chosen directions. Under some critical temperature Tc all atoms orient their spins in the same direction and the metal turns into a magnet. The freezing of a liquid and the emergence of magnetization are examples of phase transitions, representing transitions from disorder to order. Indeed, relative to the perfect order of the crystalline ice, liquid water is rather disordered. Similarly, the randomly oriented spins in a ferromagnet take up the highly ordered common orientation under Tc. Many properties of a system undergoing a phase transition are universal. This means that the same quantitative patterns are observed in a wide range of systems, from magma freezing into rock to a ceramic material turning into a superconductor. Furthermore, near the phase transition point, called the critical point, many quantities of interest follow powerlaws. The phenomena observed near the critical point = 1 in a random network in many ways is similar to a phase transition: • The similarity between Figure 3.7a and the magnetization diagram of Figure 3.21b is not accidental: they both show a transition from disorder to order. In random networks this corresponds to the emergence RANDOM NETWORKS
43
of a giant component when exceeds = 1. • As we approach the freezing point, ice crystals of widely different sizes are observed, and so are domains of atoms with spins pointing in the same direction. The size distribution of the ice crystals or magnetic domains follows a power law. Similarly, while for < 1 and > 1 the cluster sizes follow an exponential distribution, right at the phase transition point ps follows the power law (3.36), indicating the coexistence of components of widely different sizes. •
At the critical point the average size of the ice crystals or of the magnetic domains diverges, assuring that the whole system turns into a single frozen ice crystal or that all spins point in the same direction. Similarly in a random network the average cluster size diverges as we approach = 1 (Figure 3.20).
Figure 3.21
Phase Transitions (a) WaterIce Phase Transition The hydrogen bonds that hold the water molecules together (dotted lines) are weak, constantly breaking up and reforming, maintaining partially ordered local structures (left panel). The temperaturepressure phase diagram indicates (center panel) that by lowering the temperature, the water undergoes a phase transition, moving from a liquid (purple) to a frozen solid (green) phase. In the solid phase each water molecule binds rigidly to four other molecules, forming an ice lattice (right panel). After http://www.lbl.gov/ScienceArticles/Archive/sabl/2005/February/ watersolid.html.
PRESSURE (ATM)
(a)
SOLID
LIQUID
1.0
GAS TEMPERATURE
0° C
100° C
(b)
(b) Magnetic Phase Transition In ferromagnetic materials the magnetic moments of the individual atoms (spins) can point in two different directions. At high temperatures they choose randomly their direction (right panel). In this disordered state the system’s total magnetization (m = ∆M/N, where ∆M is the number of up spins minus the number of down spins) is zero. The phase diagram (middle panel) indicates that by lowering the temperature T, the system undergoes a phase transition at T= Tc, when a nonzero magnetization emerges. Lowering T further allows m to converge to one. In this ordered phase all spins point in the same direction (left panel).
1 0.8 0.6
m
0.4
ordered phase
0
RANDOM NETWORKS
ordered phase
0.2
0
1
disordered phase
disordered phase 2
Tc
3
4
T
44
PHASE TRANSITIONS
SECTION 3.18
ADVANCED TOPICS 3.G SMALL WORLD CORRECTIONS
Equation (3.18) offers only an approximation to the network diameter, valid for very large N and small d. Indeed, as soon as d approaches the system size N the d scaling must break down, as we do not have enough nodes to continue the d expansion. Such finite size effects result in corrections to (3.18). For a random network with average degree , the network diameter is better approximated by [36]
ln N ln〈k 〉
2ln N , ln[ −W ( 〈k 〉 exp − 〈k 〉 )]
dmax = + where
(3.42)
the Lambert Wfunction W(z) is the principal inverse of
f(z) = z exp(z). The first term on the r.h.s is (3.18), while the second is the correction that depends on the average degree. The correction increases the diameter, accounting for the fact that when we approach the network’s diameter the number of nodes must grow slower than . The magnitude of the correction becomes more obvious if we consider the various limits of (3.42). In the → 1 limit we can calculate the Lambert Wfunction, finding for the diameter [36]
dmax = 3
ln N . ln〈k 〉
(3.43)
Hence in the moment when the giant component emerges the network diameter is three times our prediction (3.18). This is due to the fact that at the critical point = 1 the network has a treelike structure, consisting of long chains with hardly any loops, a configuration that increases dmax. In the → ∞ limit, corresponding to a very dense network, (3.42) becomes
dmax =
ln〈k 〉 ln N 2ln N + + ln N . ln〈k 〉 〈k 〉 〈k 〉2
(3.44)
Hence if increases, the second and the third terms vanish and the solution (3.42) converges to the result (3.18).
RANDOM NETWORKS
45
SECTION 3.19
BIBLIOGRAPHY
[1] A.L. Barabási. Linked: The new science of networks. Plume Books, 2003. [2] P. Erdős and A. Rényi. On random graphs, I. Publicationes Mathematicae (Debrecen), 6:290297, 1959. [3] P. Erdős and A. Rényi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci., 5:1761, 1960. [4] P. Erdős and A. Rényi. On the evolution of random graphs. Bull. Inst. Internat. Statist., 38:343347, 1961. [5] P. Erdős and A. Rényi. On the Strength of Connectedness of a Random Graph, Acta Math. Acad. Sci. Hungary, 12: 261–267, 1961. [6] P. Erdős and A. Rényi. Asymmetric graphs. Acta Mathematica Acad. Sci. Hungarica, 14:295315, 1963. [7] P. Erdős and A. Rényi. On random matrices. Publ. Math. Inst. Hung. Acad. Sci., 8:455461, 1966. [8] P. Erdős and A. Rényi. On the existence of a factor of degree one of a connected random graph. Acta Math. Acad. Sci. Hungary, 17:359368, 1966. [9] P. Erdős and A. Rényi. On random matrices II. Studia Sci. Math. Hungary, 13:459464, 1968. [10] E. N. Gilbert. Random graphs. The Annals of Mathematical Statistics, 30:11411144, 1959. [11] R. Solomonoff and A. Rapoport. Connectivity of random nets. Bulletin of Mathematical Biology, 13:107117, 1951. [12] P. Hoffman. The Man Who Loved Only Numbers: The Story of Paul RANDOM NETWORKS
46
Erdős and the Search for Mathematical Truth. Hyperion Books, 1998. [13] B. Schechter. My Brain is Open: The Mathematical Journeys of Paul Erdős. Simon and Schuster, 1998. [14] G. P. Csicsery. N is a Number: A Portait of Paul Erdős, 1993. [15] B. Bollobás. Random Graphs. Cambridge University Press, 2001. [16] L. C. Freeman and C. R. Thompson. Estimating Acquaintanceship. Volume, pg. 147158, in The Small World, Edited by Manfred Kochen (Ablex, Norwood, NJ), 1989. [17] H. Rosenthal. Acquaintances and contacts of Franklin Roosevelt. Unpublished thesis. Massachusetts Institute of Technology, 1960. [18] L. Backstrom, P. Boldi, M. Rosa, J. Ugander, and S. Vigna. Four degrees of separation. In ACM Web Science 2012: Conference Proceedings, pages 45−54. ACM Press, 2012. [19] R. Albert and A.L. Barabási. Statistical mechanics of complex networks. Reviews of Modern Physics, 74:4797, 2002. [20] I. de Sola Pool and M. Kochen. Contacts and Influence. Social Networks, 1: 551, 1978. [21] H. Jeong, R. Albert and A. L. Barabási. Internet: Diameter of the worldwide web. Nature, 401:130131, 1999. [22] S. Lawrence and C.L. Giles. Accessibility of information on the Web Nature, 400:107, 1999. [23] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer Networks, 33:309–320, 2000. [24] S. Milgram. The Small World Problem. Psychology Today, 2: 6067, 1967. [25] J. Travers and S. Milgram. An Experimental Study of the Small World Problem. Sociometry, 32:425443, 1969. [26] K. Frigyes, “Láncszemek,” in Minden másképpen van (Budapest: Atheneum Irodai es Nyomdai R.T. Kiadása, 1929), 85–90. English translation is available in [27]. [27] M. Newman, A.L. Barabási, and D. J. Watts. The Structure and Dynamics of Networks. Princeton University Press, 2006. [28] J. Guare. Six degrees of separation. Dramatist Play Service, 1992. RANDOM NETWORKS
47
BIBLIOGRAPHY
[29] D. J. Watts and S. H. Strogatz. Collective dynamics of ‘smallworld’ networks. Nature, 393: 409–10, 1998. [30] T. S. Kuhn. The Structure of Scientific Revolutions. University of Chicago Press, 1962. [31] A.L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286:509512, 1999. [32] A.L. Barabási, R. Albert, and H. Jeong. Meanfield theory for scalefree random networks. Physica A, 272:173187, 1999. [33] M. Newman. Networks: An Introduction. Oxford University Press, 2010. [34] K. Christensen, R. Donangelo, B. Koiller, and K. Sneppen. Evolution of Random Networks. Physical Review Letters, 81:23802383, 1998. [35] H. E. Stanley. Introduction to Phase Transitions and Critical Phenomena. Oxford University Press, 1987. [36] D. Fernholz and V. Ramachandran. The diameter of sparse random graphs. Random Structures and Algorithms, 31:482516, 2007.
RANDOM NETWORKS
48
BIBLIOGRAPHY
4 ALBERTLÁSZLÓ BARABÁSI
NETWORK SCIENCE THE SCALEFREE PROPERTY
ACKNOWLEDGEMENTS
MÁRTON PÓSFAI GABRIELE MUSELLA MAURO MARTINO ROBERTA SINATRA
SARAH MORRISON AMAL HUSSEINI PHILIPP HOEVEL
INDEX
Introduction
1
Power Laws and ScaleFree Networks
2
Hubs
3
The Meaning of ScaleFree
4
Universality
5
UltraSmall Property
6
The Role of the Degree Exponent
7
Generating Networks with Arbitrary Degree Distribution
8
Summary
9
Homework
10
ADVANCED TOPICS 4.A Power Laws
11
ADVANCED TOPICS 4.B Plotting Powerlaws
12
ADVANCED TOPICS 4.C Estimating the Degree Exponent
13
Figure 4.0 (cover image)
“Art and Networks” by Tomás Saraceno
Bibliography
14
Tomás Saraceno creates art inspired by spider webs and neural networks. Trained as an architect, he deploys insights from engineering, physics, chemistry, aeronautics, and materials science, using networks as a source of inspiration and metaphor. The image shows his work displayed in the Miami Art Museum, an example of the artist’s take on complex networks.
This book is licensed under a Creative Commons: CC BYNCSA 2.0. PDF V53 09.09.2014
SECTION 4.1
INTRODUCTION
The World Wide Web is a network whose nodes are documents and the links are the uniform resource locators (URLs) that allow us to “surf” with a click from one web document to the other. With an estimated size of over one trillion documents (N≈1012), the Web is the largest network humanity
>
has ever built. It exceeds in size even the human brain (N ≈ 1011 neurons).
It is difficult to overstate the importance of the World Wide Web in our daily life. Similarly, we cannot exaggerate the role the WWW played in the development of network theory: it facilitated the discovery of a number of fundamental network characteristics and became a standard testbed for most network measures.
Online Resource 4.1 Zooming into the World Wide Web
We can use a software called a crawler to map out the Web’s wiring di
Watch an online video that zooms into the WWW sample that has lead to the discovery of the scalefree property [1]. This is the network featured in Table 2.1 and shown in Figure 4.1, whose characteristics are tested throughout this book.
agram. A crawler can start from any web document, identifying the links (URLs) on it. Next it downloads the documents these links point to and identifies the links on these documents, and so on. This process iteratively returns a local map of the Web. Search engines like Google or Bing operate
>
crawlers to find and index new documents and to maintain a detailed map of the WWW. The first map of the WWW obtained with the explicit goal of understanding the structure of the network behind it was generated by Hawoong Jeong at University of Notre Dame. He mapped out the nd.edu domain [1], consisting of about 300,000 documents and 1.5 million links (Online Resource 4.1). The purpose of the map was to compare the properties of the Web graph to the random network model. Indeed, in 1998 there were reasons to believe that the WWW could be well approximated by a random network. The content of each document reflects the personal and professional interests of its creator, from individuals to organizations. Given the diversity of these interests, the links on these documents might appear to point to randomly chosen documents. A quick look at the map in Figure 4.1 supports this view: There appears to be considerable randomness behind the Web’s wiring diagram. Yet, a THE SCALEFREE PROPERTY
3
closer inspection reveals some puzzling differences between this map and a random network. Indeed, in a random network highly connected nodes, or hubs, are effectively forbidden. In contrast in Figure 4.1 numerous smalldegree nodes coexist with a few hubs, nodes with an exceptionally large number of links. In this chapter we show that hubs are not unique to the Web, but we encounter them in most real networks. They represent a signature of a deeper organizing principle that we call the scalefree property. We therefore explore the degree distribution of real networks, which allows us to uncover and characterize scalefree network. The analytical and empirical results discussed here represent the foundations of the modeling efforts the rest of this book is based on. Indeed, we will come to see that no matter what network property we are interested in, from communities to spreading processes, it must be inspected in the light of the network’s degree distribution.
Figure 4.1 The Topology of the World Wide Web
Snapshots of the World Wide Web sample mapped out by Hawoong Jeong in 1998 [1]. The sequence of images show an increasingly magnified local region of the network. The first panel displays all 325,729 nodes, offering a global view of the full dataset. Nodes with more than 50 links are shown in red and nodes with more than 500 links in purple. The closeups reveal the presence of a few highly connected nodes, called hubs, that accompany scalefree networks. Courtesy of M. Martino. THE SCALEFREE PROPERTY
4
INTRODUCTION
SECTION 4.2
POWER LAWS AND SCALEFREE NETWORKS
If the WWW were to be a random network, the degrees of the Web documents should follow a Poisson distribution. Yet, as Figure 4.2 indicates, the Poisson form offers a poor fit for the WWW’s degree distribution. Instead on a loglog scale the data points form an approximate straight line, suggesting that the degree distribution of the WWW is well approximated with (4.1)
pk ~ k − γ . Equation (4.1) is called a power law distribution and the exponent degree exponent (BOX 4.1). If we take a logarithm of (4.1), we obtain
γ
is its
(4.2)
log pk ~ −γ log k .
If (4.1) holds, log pk is expected to depend linearly on log k, the slope of this
line being the degree exponent γ (Figure 4.2).
(b)
Figure 4.2
100
100
The Degree Distribution of the WWW
102
102
(a)
pk
pk
in
10
The incoming (a) and outgoing (b) degree distribution of the WWW sample mapped in the 1999 study of Albert et al. [1]. The degree distribution is shown on double logarithmic axis (loglog plot), in which a power law follows a straight line. The symbols correspond to the empirical data and the line corresponds to the powerlaw fit, with degree exponents γin= 2.1 and γout = 2.45. We also show as a green line the degree distribution predicted by a Poisson function with the average degree 〈kin〉 = 〈kout〉 = 4.60 of the WWW sample.
out
104
4
γ in
γ out
106
106
108
108
1010
1010 100
101
102
kin
103
THE SCALEFREE PROPERTY
104
105
100
101
102
kout
103
104
105
5
The WWW is a directed network, hence each document is characterized by an outdegree kout, representing the number of links that point from the document to other documents, and an indegree kin, representing
the number of other documents that point to the selected document. We must therefore distinguish two degree distributions: the probability that a randomly chosen document points to kout web documents, or pk , and the out
probability that a randomly chosen node has kin web documents pointing
to it, or pk . In the case of the WWW both pk and pk in
in
by a power law
pk ~ k
− γ in
in
pk ~ k
out
can be approximated (4.3)
,
− γ out
out
,
(4.4)
where γin and γout are the degree exponents for the in and outdegrees, re
spectively (Figure 4.2). In general
γin can differ from γout. For example, in
Figure 4.1 we have γin ≈ 2.1 and γout ≈ 2.45.
The empirical results shown in Figure 4.2 document the existence of a network whose degree distribution is quite different from the Poisson distribution characterizing random networks. We will call such networks scalefree, defined as [2]: A scalefree network is a network whose degree distribution follows a power law. As Figure 4.2 indicates, for the WWW the power law persists for almost four orders of magnitude, prompting us to call the Web graph scalefree network. In this case the scalefree property applies to both in and outdegrees. To better understand the scalefree property, we have to define the powerlaw distribution in more precise terms. Therefore next we discuss the discrete and the continuum formalisms used throughout this book. Discrete Formalism As node degrees are positive integers, k = 0, 1, 2, ..., the discrete formalism provides the probability pk that a node has exactly k links
pk = Ck −γ .
(4.5)
The constant C is determined by the normalization condition ∞
∑p k =1
Using (4.5) we obtain,
k
= 1.
(4.6)
∞
C ∑ k −γ = 1 , k =1
THE SCALE FREE PROPERTY
6
POWER LAWS AND SCALEFREE NETWORKS
hence
C =
1 ∞
∑k
= −γ
1 , ζ (γ )
(4.7)
k =1
where ζ (γ) is the Riemannzeta function. Thus for k > 0 the discrete powerlaw distribution has the form
pk =
k −γ . ζ (γ )
(4.8)
Note that (4.8) diverges at k=0. If needed, we can separately specify p0, representing the fraction of nodes that have no links to other nodes. In that case the calculation of C in (4.7) needs to incorporate p0. Continuum Formalism In analytical calculations it is often convenient to assume that the degrees can have any positive real value. In this case we write the powerlaw degree distribution as
p( k ) = Ck −γ .
(4.9)
Using the normalization condition
∫
∞
kmin
(4.10)
p( k )dk = 1
we obtain
1
C= kmin
k dk
=(
1)k min1 .
(4.11)
Therefore in the continuum formalism the degree distribution has the form γ −1 − γ p( k ) = (γ − 1)kmin k ,.
(4.12)
Here kmin is the smallest degree for which the power law (4.8) holds. Note that pk encountered in the discrete formalism has a precise mean
ing: it is the probability that a randomly selected node has degree k. In contrast, only the integral of p(k) encountered in the continuum formalism has a physical interpretation: k2
∫ p(k)dk
(4.13)
k1
is the probability that a randomly chosen node has degree between k1 and k2.
In summary, networks whose degree distribution follows a power law are called scalefree networks. If a network is directed, the scalefree property applies separately to the in and the outdegrees. To mathematically study the properties of scalefree networks, we can use either the discrete or the continuum formalism. The scalefree property is independent of the formalism we use.￼
THE SCALE FREE PROPERTY
7
POWER LAWS AND SCALEFREE NETWORKS
BOX 4.1 THE 80/20 RULE AND THE TOP ONE PERCENT Vilfredo Pareto, a 19th century economist, noticed that in Italy a few wealthy individuals earned most of the money, while the majority of the population earned rather small amounts. He connected this disparity to the observation that incomes follow a power law, representing the first known report of a powerlaw distribution [3]. His finding entered the popular literature as the 80/20 rule: Roughly 80 percent of money is earned by only 20 percent of the population. The 80/20 rule emerges in many areas. For example in management it is often stated that 80 percent of profits are produced by only 20 percent of the employees. Similarly, 80 percent of decisions are made during 20 percent of meeting time. The 80/20 rule is present in networks as well: 80 percent of links
Figure 4.3
on the Web point to only 15 percent of webpages; 80 percent of
Vilfredo Federico Damaso Pareto (1848 – 1923)
citations go to only 38 percent of scientists; 80 percent of links in Hollywood are connected to 30 percent of actors [4]. Most quantities following a power law distribution obey the 80/20 rule. During the 2009 economic crisis power laws gained a new meaning: The Occupy Wall Street Movement draw attention to the fact that in the US 1% of the population earns a disproportionate 15%
Italian economist, political scientist, and philosopher, who had important contributions to our understanding of income distribution and to the analysis of individual choices. A number of fundamental principles are named after him, like Pareto efficiency, Pareto distribution (another name for a powerlaw distribution), the Pareto principle (or 80/20 law).
of the total US income. This 1% phenomena, a signature of a profound income disparity, is again a consequence of the powerlaw nature of the income distribution.
THE SCALEFREE PROPERTY
8
POWER LAWS AND SCALEFREE NETWORKS
SECTION 4.3
HUBS
The main difference between a random and a scalefree network comes in the tail of the degree distribution, representing the highk region of pk.
To illustrate this, in Figure 4.4 we compare a power law with a Poisson func
tion. We find that: • For small k the power law is above the Poisson function, indicating that a scalefree network has a large number of small degree nodes, most of which are absent in a random network. • For k in the vicinity of 〈k〉 the Poisson distribution is above the power law, indicating that in a random network there is an excess of nodes with degree k≈〈k〉. • For large k the power law is again above the Poisson curve. The difference is particularly visible if we show pk on a loglog plot (Figure 4.4b), indicating that the probability of observing a highdegree node, or hub, is several orders of magnitude higher in a scalefree than in a random network. Let us use the WWW to illustrate the magnitude of these differences. The probability to have a node with k=100 is about p100≈10−94 in a Poisson distribution while it is about p100≈4x104 if pk follows a power law. Conse
quently, if the WWW were to be a random network with =4.6 and size N≈1012, we would expect
N k≥100 = 1012
(4.6)k e k! k=100
4.6
10
82
(4.14)
nodes with at least 100 links, or effectively none. In contrast, given the WWW’s power law degree distribution, with
γin = 2.1 we have Nk≥100 = 4x109,
i.e. more than four billion nodes with degree k ≥100.
THE SCALEFREE PROPERTY
9
(a)
(b)
0.15 0.15 0.15
2.1 pkk ~ k2.1 p ~ k2.1 pk k~ k2.1
0.1 0.1 0.1 pkk p pk k 0.05 0.05 0.05
0 0 0
POISSON POISSON POISSON POISSON
10 10 10
20 k 30 20 30 20 kk 30
40 40 40
50 50 50
1000 100 2.1 pkk ~ k2.1 1001 pk ~ k2.12.1 10 1 pk ~ k 101 101 2 2 10 102 p10 kk 2 p10 pk k 33 103 103 1044 104 104 POISSON POISSON 1055 POISSON 1055 POISSON 10 6 10 6 106 0 1011 k 1022 1033 106 10 00 10 101 102 103 100 101 kk 102 103
Figure 4.4 Poisson vs. Powerlaw Distributions
(a) Comparing a Poisson function with a powerlaw function (γ= 2.1) on a linear plot. Both distributions have ⟨k⟩= 11. (b) The same curves as in (a), but shown on a loglog plot, allowing us to inspect the difference between the two functions in the highk regime. (c) A random network with ⟨k⟩= 3 and N = 50, illustrating that most nodes have comparable degree k≈⟨k⟩.
(d)
(c)
(d) A scalefree network with γ=2.1 and ⟨k⟩= 3, illustrating that numerous smalldegree nodes coexist with a few highly connected hubs. The size of each node is proportional to its degree.
The Largest Hub All real networks are finite. The size of the WWW is estimated to be N ≈
1012 nodes; the size of the social network is the Earth’s population, about N
≈
7 × 109. These numbers are huge, but finite. Other networks pale in com
parison: The genetic network in a human cell has approximately 20,000 genes while the metabolic network of the E. Coli bacteria has only about a thousand metabolites. This prompts us to ask: How does the network size affect the size of its hubs? To answer this we calculate the maximum degree, kmax, called the natural cutoff of the degree distribution pk. It represents the expected size of the largest hub in a network. It is instructive to perform the calculation first for the exponential distribution
p(k) = Ce− λ k .
For a network with minimum degree kmin the normalization condition
∫
∞
kmin
provides C =
λeλk
p( k )dk = 1
(4.15)
. To calculate kmax we assume that in a network of N
min
nodes we expect at most one node in the (kmax, ∞) regime (ADVANCED TOPICS
3.B). In other words the probability to observe a node whose degree exceeds kmax is 1/N:
∫
∞
kmax
THE SCALEFREE PROPERTY
p( k )dk =
1 . N
(4.16)
10
Equation (4.16) yields
kmax = kmin +
ln N . λ
1010 109 108 107
(4.17)
105 104 103 102 101 100
imum degree will not be significantly different from kmin. For a Poisson degree distribution the calculation is a bit more involved, but the obtained dependence of kmax on N is even slower than the logarithmic dependence predicted by (4.17) (ADVANCED TOPICS 3.B).
RANDOM NETWORK
104
106
8 N 10
1010
1012
Figure 4.5 Hubs are Large in Scalefree Networks
1
kmax = kmin N γ −1 .
The estimated degree of the largest node (natural cutoff) in scalefree and random networks with the same average degree ⟨k⟩= 3. For the scalefree network we chose γ = 2.5. For comparison, we also show the linear behavior, kmax ∼ N − 1, expected for a complete network. Overall, hubs in a scalefree network are several orders of magnitude larger than the biggest node in a random network with the same N and ⟨k⟩.
(4.18)
Hence the larger a network, the larger is the degree of its biggest hub. The polynomial dependence of kmax on N implies that in a large scalefree network there can be orders of magnitude differences in size between the smallest node, kmin, and the biggest hub, kmax (Figure 4.5). To illustrate the difference in the maximum degree of an exponential and a scalefree network let us return to the WWW sample of Figure 4.1, consisting of N ≈ 3 × 105 nodes. As kmin = 1, if the degree distribution were to follow an exponential, (4.17) predicts that the maximum degree should be kmax
1
(ʏ1)
kmax ~ InN 102
For a scalefree network, according to (4.12) and (4.16), the natural cutoff
kmax ~ N
kmax
As lnN is a slow function of the system size, (4.17) tells us that the max
follows
SCALEFREE (N  1)
≈ 14 for λ=1. In a scalefree network of similar size and γ
= 2.1,
(4.18) predicts kmax ≈ 95,000, a remarkable difference. Note that the largest indegree of the WWW map of Figure 4.1 is 10,721, which is comparable to
kmax predicted by a scalefree network. This reinforces our conclusion that
in a random network hubs are effectivelly forbidden, while in scalefree networks they are naturally present.
In summary the key difference between a random and a scalefree network is rooted in the different shape of the Poisson and of the powerlaw function: In a random network most nodes have comparable degrees and hence hubs are forbidden. Hubs are not only tolerated, but are expected in scalefree networks (Figure 4.6). Furthermore, the more nodes a scalefree network has, the larger are its hubs. Indeed, the size of the hubs grows polynomially with network size, hence they can grow quite large in scalefree networks. In contrast in a random network the size of the largest node grows logarithmically or slower with N, implying that hubs will be tiny even in a very large random network.
THE SCALE FREE PROPERTY
11
HUBS
(a) POISSON
P
Most nodes have the same number of links
Chicago Boston
No highly connected nodes
Los Angeles
Number of links (k)
Number of nodesNumber with k links of nodes with k links
Number of nodes with k links
(b)
P
(d)
Number of nodes with k links
(c) POWER LAW
Many nodes with only a few links A few hubs with large number of links
Chicago Boston Los Angeles
Number of links (k)
Figure 4.6 Random vs. Scalefree Networks
(a) The degrees of a random network follow a Poisson distribution, rather similar to a bell curve. Therefore most nodes have comparable degrees and nodes with a large number of links are absent. (b) A random network looks a bit like the national highway network in which nodes are cities and links are the major highways. There are no cities with hundreds of highways and no city is disconnected from the highway system.
(c) In a network with a powerlaw degree distribution most nodes have only a few links. These numerous small nodes are held together by a few highly connected hubs. (d) A scalefree network looks like the airtraffic network, whose nodes are airports and links are the direct flights between them. Most airports are tiny, with only a few flights. Yet, we have a few very large airports, like Chicago or Los Angeles, that act as major hubs, connecting many smaller airports. Once hubs are present, they change the way we navigate the network. For example, if we travel from Boston to Los Angeles by car, we must drive through many cities. On the airplane network, however, we can reach most destinations via a single hub, like Chicago. After [4].
THE SCALE FREE PROPERTY
12
HUBS
Number of nodesNumber with k links of nodes with k links
P
P
SECTION 4.4
THE MEANING OF SCALEFREE
The term “scalefree” is rooted in a branch of statistical physics called the theory of phase transitions that extensively explored power laws in the 1960s and 1970s (ADVANCED TOPICS 3.F). To best understand the meaning of the scalefree term, we need to familiarize ourselves with the moments of the degree distribution. The nth moment of the degree distribution is defined as ∞
〈 k n 〉 = ∑ k n pk ≈ ∫ kmin
∞
kmin
k n p( k )dk.
(4.19)
The lower moments have important interpretation: • n=1: The first moment is the average degree, ⟨k⟩.
• n=2: The second moment, ⟨k2⟩, helps us calculate the variance σ2 = ⟨k2⟩ − ⟨k⟩2, measuring the spread in the degrees. Its square root,
standard deviation.
σ, is the
• n=3: The third moment, ⟨k3⟩, determines the skewness of a distribution, telling us how symmetric is pk around the average ⟨k⟩.
For a scalefree network the nth moment of the degree distribution is
〈k n 〉 = ∫
kmax
kmin
k n p( k )dk = C
n −γ +1 n −γ +1 kmax − kmin . n −γ +1
(4.20)
While typically kmin is fixed, the degree of the largest hub, kmax, increas
es with the system size, following (4.18). Hence to understand the behavior
of ⟨kn⟩ we need to take the asymptotic limit kmax → ∞ in (4.20), probing the properties of very large networks. In this limit (4.20) predicts that the value of ⟨kn⟩ depends on the interplay between n and γ: n−γ+1 • If n −γ + 1 ≤ 0 then the first term on the r.h.s. of (4.20), kmax , goes to
zero as kmax increases. Therefore all moments that satisfy n ≤ γ−1 are finite.
• If n−γ+1 > 0 then ⟨kn⟩ goes to infinity as kmax→∞. Therefore all moTHE SCALEFREE PROPERTY
13
ments larger than γ−1 diverge.
For many scalefree networks the degree exponent γ is between 2 and 3
(Table 4.1). Hence for these in the N → ∞ limit the first moment ⟨k⟩ is finite,
but the second and higher moments, ⟨k2⟩, ⟨k3⟩, go to infinity. This diver
gence helps us understand the origin of the “scalefree” term. Indeed, if the degrees follow a normal distribution, then the degree of a randomly chosen node is typically in the range
k = k ± σ k .
(4.21)
Yet, the average degree and the standard deviation σk have rather different magnitude in random and in scalefree networks: • Random Networks Have a Scale For a random network with a Poisson degree distribution σk = 1/2, which is always smaller than ⟨k⟩. Hence the network’s nodes have de
grees in the range k = ⟨k⟩ ± ⟨k⟩1/2. In other words nodes in a random network have comparable degrees and the average degree ⟨k⟩ serves
as the “scale” of a random network. • Scalefree Networks Lack a Scale For a network with a powerlaw degree distribution with γ < 3 the first moment is finite but the second moment is infinite. The divergence of ⟨k2⟩ (and of
pk
σk) for large N indicates that the fluctuations around
the average can be arbitrary large. This means that when we randomly choose a node, we do not know what to expect: The selected node’s degree could be tiny or arbitrarily large. Hence networks with γ < 3 do
not have a meaningful internal scale, but are “scalefree” (Figure 4.7).
⟨k⟩ k
For example the average degree of the WWW sample is ⟨k⟩ = 4.60 (Ta
Random Network
ble 4.1). Given that γ ≈ 2.1, the second moment diverges, which means
that our expectation for the indegree of a randomly chosen WWW
Randomly chosen node: k = k ± k Scale: ⟨k⟩
document is k=4.60 ± ∞ in the N → ∞ limit. That is, a randomly chosen
ScaleFree Network
1/2
Randomly chosen node: k = k ± ∞ Scale: none
web document could easily yield a document of degree one or two, as 74.02% of nodes have indegree less than ⟨k⟩. Yet, it could also yield a node with hundreds of millions of links, like google.com or facebook.
Figure 4.7 Lack of an Internal Scale
com. Strictly speaking ⟨k2⟩ diverges only in the N
→ ∞ limit. Yet, the diver
For any exponentially bounded distribution, like a Poisson or a Gaussian, the degree of a randomly chosen node is in the vicinity of ⟨k⟩. Hence ⟨k⟩ serves as the network’s scale. For a power law distribution the second moment can diverge, and the degree of a randomly chosen node can be significantly different from ⟨k⟩. Hence ⟨k⟩ does not serve as an intrinsic scale. As a network with a power law degree distribution lacks an intrinsic scale, we
gence is relevant for finite networks as well. To illustrate this, Table 4.1 lists
⟨k2⟩
2
= ten k2 − k networks. and Figure 4.8 shows the standard deviation σ for real
For most of these networks σ is significantly larger than ⟨k⟩, documenting large variations in node degrees. For example, the degree of a randomly chosen node in the WWW sample is kin = 4.60 ± 1546, indicating once again that the average is not informative.
In summary, the scalefree name captures the lack of an internal scale, a consequence of the fact that nodes with widely different degrees coexist in the same network. This feature distinguishes scalefree networks from lattices, in which all nodes have exactly the same degree (σ = 0), or from
random networks, whose degrees vary in a narrow range (σ = ⟨k⟩1/2). As we THE SCALE FREE PROPERTY
14
THE MEANING OF SCALEFREE
will see in the coming chapters, this divergence is the origin of some of the most intriguing properties of scalefree networks, from their robustness to random failures to the anomalous spread of viruses.
N
NETWORK
L
k
k2in
k2out
k2
in
out
Internet
192,244
609,066
6.34


240.1


3.42*
WWW
325,729
1,497,134
4.60
1546.0
482.4

2.00
2.31

Power Grid
4,941
6,594
2.67


10.3


Exp.
Mobile Phone Calls
36,595
91,826
2.51
12.0
11.7

4.69*
5.01*

Email
57,194
103,731
1.81
94.7
1163.9

3.43*
2.03*

Science Collaboration
23,133
93,439
8.08


178.2

3.35*
702,388
29,397,908
83.71


47,353.7

Actor Network


2.12*
Citation Network
449,673
4,689,479
10.43
971.5
198.8

3.03**
4.00*

E. Coli Metabolism
1,039
5,802
5.58
535.7
396.7

2.43*
2.9 0*

Protein Interactions
2,018
2,930
2.9 0


32.3


2.89*
Table 4.1 Degree Fluctuations in Real Networks
The table shows the first 〈k〉 and the second 2 moment ⟨k2⟩ (〈kin2〉 and 〈kout 〉 for directed networks) for ten reference networks. For directed networks we list 〈k〉=〈kin〉=〈kout〉. We also list the estimated degree exponent, γ, for each network, determined using the procedure discussed in ADVANCED TOPICS 4.A. The stars next to the reported values indicate the confidence of the fit to the degree distribution. That is, * means that the fit shows statistical confidence for a powerlaw (k−γ); while ** marks statistical confidence for a fit (4.39) with an exponential cutoff. Note that the power grid is not scalefree. For this network a degree distribution of the form e−λk offers a statistically significant fit, which is why we placed an “Exp” in the last column.
45 40
WWW (IN)
35
EMAIL (OUT)
30
CITATIONS (IN)
25
σ
Figure 4.8 Standard Deviation is Large in Real Networks
METABOLIC (IN)
WWW (OUT)
20
METABOLIC (OUT)
15
For a random network the standard deviation follows σ = 1/2 shown as a green dashed line on the figure. The symbols show σ for nine of the ten reference networks, calculated using the values shown in Table 4.1. The actor network has a very large ⟨k⟩ and σ, hence it omitted for clarity. For each network σ is larger than the value expected for a random network with the same ⟨k⟩. The only exception is the power grid, which is not scalefree. While the phone call network is scalefree, it has a large γ, hence it is well approximated by a random network.
INTERNET SCIENCE COLLABORATION
10
EMAIL (IN)
5
CITATIONS (OUT)
‹k›1/2
PROTEIN PHONE CALLS (IN, OUT) POWER GRID
0
2
THE SCALEFREE PROPERTY
4
6
‹k›
8
10
12
14
15
THE MEANING OF SCALEFREE
SECTION 4.5
UNIVERSALITY
While the terms WWW and Internet are often used interchangeably in the media, they refer to different systems. The WWW is an information
Figure 4.9
The topology of the Internet
network, whose nodes are documents and links are URLs. In contrast the
An iconic representation of the Internet topology at the beginning of the 21st century. The image was produced by CAIDA, an organization based at University of California in San Diego, devoted to collect, analyze, and visualize Internet data. The map illustrates the Internet’s scalefree nature: A few highly connected hubs hold together numerous small nodes.
Internet is an infrastructural network, whose nodes are computers called routers and whose links correspond to physical connections, like copper and optical cables or wireless links. This difference has important consequences: The cost of linking a Bostonbased web page to a document residing on the same computer or to one on a Budapestbased computer is the same. In contrast, establishing a direct Internet link between routers in Boston and Budapest would require us to lay a cable between North America and Europe, which is prohibitively expensive. Despite these differences, the degree distribution of both networks is well approximated by a power law [1, 5, 6]. The signatures of the Internet’s scalefree nature are visible in Figure 4.9, showing that a
THE SCALEFREE PROPERTY
16
few highdegree routers hold together a large number of routers with only a few links. In the past decade many real networks of major scientific, technological and societal importance were found to display the scalefree property. This is illustrated in Figure 4.10, where we show the degree distribution of an infrastructural network (Internet), a biological network (protein interactions), a communication network (emails) and a network characterizing scientific communications (citations). For each network the degree distribution significantly deviates from a Poisson distribution, being better approximated with a power law. The diversity of the systems that share the scalefree property is remarkable (BOX 4.2). Indeed, the WWW is a manmade network with a history of little more than two decades, while the protein interaction network is the product of four billion years of evolution. In some of these networks the nodes are molecules, in others they are computers. It is this diversity that prompts us to call the scalefree property a universal network characteristic. From the perspective of a researcher, a crucial question is the following: How do we know if a network is scalefree? On one end, a quick look at the degree distribution will immediately reveal whether the network could be scalefree: In scalefree networks the degrees of the smallest and the largest nodes are widely different, often spanning several orders of magnitude. In contrast, these nodes have comparable degrees in a random network. As the value of the degree exponent plays an important role in predicting various network properties, we need tools to fit the pk distribution
and to estimate γ. This prompts us to address several issues pertaining to plotting and fitting power laws: Plotting the Degree Distribution
The degree distributions shown in this chapter are plotted on a double logarithmic scale, often called a loglog plot. The main reason is that when we have nodes with widely different degrees, a linear plot is unable to display them all. To obtain the cleanlooking degree distributions shown throughout this book we use logarithmic binning, ensuring that each datapoint has sufficient number of observations behind it. The practical tips for plotting a network’s degree distribution are discussed in ADVANCED TOPICS 4.B. Measuring the Degree Exponent A quick estimate of the degree exponent can be obtained by fitting a straight line to pk on a loglog plot.Yet, this approach can be affected by
systematic biases, resulting in an incorrect γ. The statistical tools available to estimate γ are discussed in ADVANCED TOPICS 4.C.
The Shape of pk for Real Networks Many degree distributions observed in real networks deviate from a pure power law. These deviations can be attributed to data incompleteTHE SCALE FREE PROPERTY
17
UNIVERSALITY
ness or data collection biases, but can also carry important information about processes that contribute to the emergence of a particular network. In ADVANCED TOPICS 4.B we discuss some of these deviations and in CHAPTER 6 we explore their origins. In summary, since the 1999 discovery of the scalefree nature of the WWW, a large number of real networks of scientific and technological interest have been found to be scalefree, from biological to social and linguistic networks (BOX 4.2). This does not mean that all networks are scalefree. Indeed, many important networks, from the power grid to networks observed in materials science, do not display the scalefree property (BOX 4.3). (a)
(b)
100
100
Figure 4.10
Many Real Networks are Scalefree
101 101
102
The degree distribution of four networks listed in Table 4.1.
103
(a) Internet at the router level.
102
104 pk 105
(b) Proteinprotein interaction network.
pk
(c) Email network.
103
106
(d) Citation network.
107
In each panel the green dotted line shows the Poisson distribution with the same 〈k〉 as the real network, illustrating that the random network model cannot account for the observed pk. For directed networks we show separately the incoming and outgoing degree distributions.
104
PROTEIN INTERACTIONS
INTERNET
108
105
109 10
0
10
1
k
10
100
10
2
3
(c)
k
101
102
(d)
100
100
kin kout
101 102
kin kout
101 102
EMAILS
3
10 pk 104
10 pk 104
105
105
106
106
107
107
108
108
CITATIONS
3
109
109 10
0
10
1
10 kin, kout
2
THE SCALEFREE PROPERTY
10
3
10
4
100
101
102 kin, kout
103
104
18
UNIVERSALITY
19
1965
UNIVERSALITY
Derek de Solla Price (1922  1983) discovers that citations follow a powerlaw distribution [7], a finding later attributed to the scalefree nature of the citation network [2].
CITATIONS [7]
PUBLICATION DATE
Michalis, Petros, and Christos Faloutsos discover the scalefree nature of the internet [15].
Réka Albert, Hawoong Jeong, and AlbertLászló Barabási discover the powerlaw nature of the WWW [1] and introduce scalefree networks [2, 10].
TIMELINE: SCALEFREE NETWORKS
BOX 4.2
CITATIONS [8]
1998
0 1999
4
2000
23
2001
54
METABOLIC [11, 12]
2002
145
SOFTWARE [21] ENERGY LANDSCAPE [23]
EMAIL [22]
LINGUISTICS [19] ELECT. CIRCUITS [20]
COAUTHOR. [16, 17] SEXUAL CONTACTS [18]
PHONE CALLS [13]
INTERNET [5]
ACTORS [2]
WWW [1, 2, 9, 10]
PROTEINS [14,15]
2003
304
2005
2006
2007
2008
2009
2010
1470 1460
TWITTER [25, 26]
2011
1760
FACEBOOK [27]
2012
1900
2013
1960
Barabási and Albert, 1999
“we expect that the scaleinvariant state observed in all systems for which detailed data has been available to us is a generic property of many complex networks, with applicability reaching far beyond the quoted examples.”
2004
559
781
985
1180
MOBILE CALLS [24]
2560
# OF PAPERS ON “SCALEFREE NETWORKS” (Google Scholar)
THE SCALEFREE PROPERTY
BOX 4.3 NOT ALL NETWORK ARE SCALEFREE The ubiquity of the scalefree property does not mean that all real networks are scalefree. To the contrary, several important networks do not share this property: • Networks appearing in material science, describing the bonds between the atoms in crystalline or amorphous materials. In these networks each node has exactly the same degree, determined by chemistry (Figure 4.11). • The neural network of the C. elegans worm [28]. • The power grid, consisting of generators and switches connected by transmission lines. For the scalefree property to emerge the nodes need to have the capacity to link to an arbitrary number of other nodes. These links do not need to be concurrent: We do not constantly chat with each of our acquaintances and a protein in the cell does not simultaneously bind to each of its potential interaction partners. The scalefree property is absent in systems that limit the number of links a node can have, effectively restricting the maximum size of the hubs. Such limitations are common in materials (Figure 4.11), explaining why they cannot develop a scalefree topology.
THE SCALE FREE PROPERTY
Figure 4.11 The Material Network
A carbon atom can share only four electrons with other atoms, hence no matter how we arrange these atoms relative to each other, in the resulting network a node can never have more than four links. Hence, hubs are forbidden and the scalefree property cannot emerge. The figure shows several carbon allotropes, i.e. materials made of carbon that differ in the structure of the network the carbon atoms arrange themselves in. This different arrangement results in materials with widely different physical and electronic characteristics, like (a) diamond; (b) graphite; (c) lonsdaleite; (d) C60 (buckminsterfullerene); (e) C540 (a fullerene) (f) C70 (another fullerene); (g) amorphous carbon; (h) singlewalled carbon nanotube.
20
UNIVERSALITY
SECTION 4.6
ULTRASMALL WORLD PROPERTY
The presence of hubs in scalefree networks raises an interesting question: Do hubs affect the small world property? Figure 4.4 suggests that they do: Airlines build hubs precisely to decrease the number of hops between two airports. The calculations support this expectation, finding that distances in a scalefree network are smaller than the distances observed in an equivalent random network. The dependence of the average distance ⟨d⟩ on the system size N and the degree exponent
γ are captured by the formula [29, 30] 〈d 〉 ~
const.
ln ln N
ln N ln ln N ln N
γ =2 2 3
Next we discuss the behavior of ⟨d⟩ in the four regimes predicted by
(4.22), as summarized in Figure 4.12: Anomalous Regime (γ = 2)
According to (4.18) for γ = 2 the degree of the biggest hub grows linearly
with the system size, i.e. kmax ∼ N. This forces the network into a hub and spoke configuration in which all nodes are close to each other because
they all connect to the same central hub. In this regime the average path length does not depend on N. UltraSmall World (2 < γ < 3) Equation (4.22) predicts that in this regime the average distance increases as lnlnN, a significantly slower growth than the lnN derived for random networks. We call networks in this regime ultrasmall, as the hubs radically reduce the path length [29]. They do so by linking to a large number of smalldegree nodes, creating short distances between them.
THE SCALEFREE PROPERTY
21
To see the implication of the ultrasmall world property consider again the world’s social network with N ≈ 7x109. If the society is described by
a random network, the Ndependent term is lnN = 22.66. In contrast for
a scalefree network the Ndependent term is lnlnN = 3.12, indicating that the hubs radically shrink the distance between the nodes. HUMAN PPI
(a)
INTERNET (2011)
SOCIETY
WWW
30
InN (γ > 3 and random)
⟨d⟩
Figure 4.12
20
Distances in Scalefree Networks InN
10
InInN
(a) The scaling of the average path length in the four scaling regimes characterizing a scalefree network: constant (γ = 2), lnlnN (2 < γ< 3), lnN/lnlnN (γ = 3), lnN (γ > 3 and random networks). The dotted lines mark the approximate size of several real networks. Given their modest size, in biological networks, like the human proteinprotein interaction network (PPI), the differences in the nodetonode distances are relatively small in the four regimes. The differences in ⟨d⟩ is quite significant for networks of the size of the social network or the WWW. For these the smallworld formula significantly underestimates the real ⟨d⟩.
(γ = 3)
InInN (2 < γ < 3) (γ = 2)
0 10
10
2
10
4
10
6
10
N
10
10
12
N = 104
N = 102 0.5
pd
10
8
N = 106
0.5
(b)
0.5
(c)
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
0
0
5 d γ = 2.1
10
15
γ = 3.0
20
0
γ = 5.0
0
5 d
14
10
15
20
0
(d)
0
5 d
10
15
(b) (c) (d) Distance distribution for networks of size N = 102, 104, 106, illustrating that while for small networks (N = 102) the distance distributions are not too sensitive to γ, for large networks (N = 106) pd and ⟨d⟩ change visibly with γ.
20
RN
Critical Point (γ = 3)
The networks were generated using the static model [32] with ⟨k⟩ = 3.
This value is of particular theoretical interest, as the second moment of the degree distribution does not diverge any longer. We therefore call
γ = 3 the critical point. At this critical point the lnN dependence en
countered for random networks returns. Yet, the calculations indicate the presence of a double logarithmic correction lnlnN [29, 31], which shrinks the distances compared to a random network of similar size. Small World (γ > 3) In this regime ⟨k2⟩ is finite and the average distance follows the small world result derived for random networks. While hubs continue to be present, for γ > 3 they are not sufficiently large and numerous to have a significant impact on the distance between the nodes. Taken together, (4.22) indicates that the more pronounced the hubs are, the more effectively they shrink the distances between nodes. This conclusion is supported by Figure 4.12a, which shows the scaling of the average path length for scalefree networks with different γ. The figure indicates
that while for small N the distances in the four regimes are comparable, for large N we observe remarkable differences. Further support is provided by the path length distribution for scaleTHE SCALE FREE PROPERTY
22
ULTRASMALL PROPERTY
γ and N (Figure 4.12bd). For N = 102 the path length distributions overlap, indicating that at this size differences in γ refree networks with different
sult in undetectable differences in the path length. For N = 106, however, pd observed for different γ are well separated. Figure 4.12d also shows that the
larger the degree exponent, the larger are the distances between the nodes. In summary the scalefree property has several effects on network distances: • Shrinks the average path lengths. Therefore most scalefree networks of practical interest are not only “small”, but are “ultrasmall”. This is a consequence of the hubs, that act as bridges between many small degree nodes. • Changes the dependence of ⟨d⟩ on the system size, as predicted by (4.22). The smaller is γ, the shorter are the distances between
the nodes. • Only for
γ > 3 we recover the ln N dependence,
the signature of the
smallworld property characterizing random networks (Figure 4.12).
BOX 4.4 WE ARE ALWAYS CLOSE TO THE HUBS Frigyes Karinthy in his 1929 short story [33] that first described
12
the small world concept cautions that “it’s always easier to find
10
someone who knows a famous or popular figure than some run
8 ⟨dtarget⟩ 6
themill, insignificant person”. In other words, we are typically closer to hubs than to less connected nodes. This effect is particu
4
larly pronounced in scalefree networks (Figure 4.13).
2 0
RANDOM NETWORK
SCALEFREE
10 20 30 40 50 60 70 80 90 100
The implications are obvious: There are always short paths linking us to famous individuals like well known scientists or the president of the United States, as they are hubs with an excep
ktarget
Figure 4.13 Closing on the hubs
tional number of acquaintances. It also means that many of the The distance ⟨dtarget⟩ of a node with degree k ≈ ⟨k⟩ to a target node with degree ktarget in a random and a scalefree network. In scalefree networks we are closer to the hubs than in random networks. The figure also illustrates that in a random network the largestdegree nodes are considerably smaller and hence the path lengths are visibly longer than in a scalefree network. Both networks have ⟨k⟩ = 2 and N = 1,000 and for the scalefree network we choose γ = 2.5.
shortest paths go through these hubs. In contrast to this expectation, measurements aiming to replicate the six degrees concept in the online world find that individuals involved in chains that reached their target were less likely to send a message to a hub than individuals involved in incomplete chains [34]. The reason may be selfimposed: We perceive hubs as being busy, so we contact them only in real need. We therefore avoid them in online experiments of no perceived value to them.
THE SCALE FREE PROPERTY
23
ULTRASMALL PROPERTY
SECTION 4.7
THE ROLE OF THE DEGREE EXPONENT
Many properties of a scalefree network depend on the value of the degree exponent γ. A close inspection of Table 4.1 indicates that: •
γ
varies from system to system, prompting us to explore how the
properties of a network change with
γ.
• For most real systems the degree exponent is above 2, making us wonder: Why don’t we see networks with γ < 2? To address these questions next we discuss how the properties of a scalefree network change with γ (BOX 4.5). Anomalous Regime (γ≤ 2) For
γ< 2 the exponent 1/(γ− 1) in (4.18) is larger than one, hence the
number of links connected to the largest hub grows faster than the size of the network. This means that for sufficiently large N the degree of the largest hub must exceed the total number of nodes in the network, hence it will run out of nodes to connect to. Similarly, for γ < 2 the av
erage degree ⟨k⟩ diverges in the N → ∞ limit. These odd predictions are
only two of the many anomalous features of scalefree networks in this regime. They are signatures of a deeper problem: Large scalefree network with
γ < 2, that lack multilinks, cannot exist (BOX 4.6).
ScaleFree Regime (2 < γ< 3) In this regime the first moment of the degree distribution is finite but the second and higher moments diverge as N →∞. Consequently scale
free networks in this regime are ultrasmall (SECTION 4.6). Equation (4.18) predicts that kmax grows with the size of the network with exponent 1/
(γ  1), which is smaller than one. Hence the market share of the largest hub, kmax /N, representing the fraction of nodes that connect to it, decreases as k /N ∼ N(γ2)/(γ1). max
As we will see in the coming chapters, many interesting features of scalefree networks, from their robustness to anomalous spreading
THE SCALEFREE PROPERTY
24
BOX 4.5 THE γ DEPENDENT PROPERTIES OF SCALEFREE NETWORKS
RANDOM REGIME
SCALEFREE REGIME
ANOMALOUS REGIME
1
DIVERGES
k2
DIVERGES
d
const
kmax GROWS FASTER THAN N
=2 kmax N
k k2
d
) (IN
TE
IN
CO
LL
AB
ON
3 FINITE DIVERGES
=3
d
ln N ln ln N
CRITICAL POINT
lnlnN
ULTRASMALL WORLD
kmax N
THE SCALEFREE PROPERTY
TA TI
B
2
k
CI
W EM W ( AI OU AC L (O T) TO UT ) W R W W M (IN) ET AB .( IN )
W
A
OR EM RN ATIO N AI ET L (IN )
Indistinguishable from a random network
PR O M TEI ET N AB (IN .( ) OU T)
No large network can exist here
γ
k
FINITE
k2
FINITE
d
lnN ln k SMALL WORLD
1 1
25
THE ROLE OF THE DEGREE EXPONENT
phenomena, are linked to this regime. Random Network Regime (γ > 3) According to (4.20) for γ > 3 both the first and the second moments are finite. For all practical purposes the properties of a scalefree network in this regime are difficult to distinguish from the properties a random network of similar size. For example (4.22) indicates that the average distance between the nodes converges to the smallworld formula derived for random networks. The reason is that for large
γ
the degree
distribution pk decays sufficiently fast to make the hubs small and less numerous. Note that scalefree networks with large γ are hard to distinguish from a random network. Indeed, to document the presence of a powerlaw degree distribution we ideally need 23 orders of magnitude of scaling, which means that kmax should be at least 102  103 times larger than kmin.
By inverting (4.18) we can estimate the network size necessary to observe the desired scaling regime, finding
k N = max kmin
γ −1
.
(4.23)
For example, if we wish to document the scalefree nature of a network with
γ
= 5 and require scaling that spans at least two orders of magni
tudes (e.g. kmin ∼ 1 and kmax ≃ 102), according to (4.23) the size of the network must exceed N > 108. There are very few network maps of this size.
Therefore, there may be many networks with large degree exponent. Given, however, their limited size, it is difficult to obtain convincing evidence of their scalefree nature. In summary, we find that the behavior of scalefree networks is sensitive to the value of the degree exponent γ. Theoretically the most interesting regime is 2
2 are graphical, it is impossible to find graphical networks in the 0 < γ < 2 range. After [39].
27
THE ROLE OF THE DEGREE EXPONENT
SECTION 4.8
GENERATING NETWORKS WITH ARBITRARY DEGREE DISTRIBUTION
k1=3 k2=2
Networks generated by the ErdősRényi model have a Poisson degree distribution. The empirical results discussed in this chapter indicate, how
k3=2 k4=1
(a)
ever, that the degree distribution of real networks significantly deviates from a Poisson form, raising an important question: How do we generate
(b)
networks with an arbitrary pk? In this section we discuss three frequently used algorithms designed for this purpose. Configuration Model
(c)
The configuration model, described in Figure 4.15, helps us build a network with a predefined degree sequence. In the network generated by the
(d)
model each node has a predefined degree ki, but otherwise the network is
wired randomly. Consequently the network is often called a random net
work with a predefined degree sequence. By repeatedly applying this procedure to the same degree sequence we can generate different networks
Figure 4.15
with the same pk (Figure 4.15bd). There are a couple of caveats to consider:
The Configuration Model The configuration model builds a network whose nodes have predefined degrees [40, 41]. The algorithm consists of the following steps:
• The probability to have a link between nodes of degree ki and kj is
pij =
ki k j . 2L − 1
(4.24)
(a) Degree Sequence Assign a degree to each node, represented as stubs or halflinks. The degree sequence is either generated analytically from a preselected pk distribution (BOX 4.7), or it is extracted from the adjacency matrix of a real network. We must start from an even number of stubs, otherwise we are left with unpaired stubs.
Indeed, a stub starting from node i can connect to 2L  1 other stubs. Of these, kj are attached to node j. So the probability that a particular stub is connected to a stub of node j is kj /(2L  1). As node i has ki stubs, it has kj attempts to link to j, resulting in (4.24).
• The obtained network contains selfloops and multilinks, as there is
(b, c, d) Network Assembly Randomly select a stub pair and connect them. Then randomly choose another pair from the remaining 2L  2 stubs and connect them. This procedure is repeated until all stubs are paired up. Depending on the order in which the stubs were chosen, we obtain different networks. Some networks include cycles (b), others selfloops (c) or multilinks (d). Yet, the expected number of selfloops and multilinks goes to zero in the N → ∞ limit.
nothing in the algorithm to forbid a node connecting to itself, or to generate multiple links between two nodes. We can choose to reject stub pairs that lead to these, but if we do so, we may not be able to complete the network. Rejecting selfloops or multilinks also means that not all possible matchings appear with equal probability. Hence (4.24) will not be valid, making analytical calculations difficult. Yet, the number of selfloops and multilinks remain negligible, as the number of choices to connect to increases with N, so typically we do not need to exclude them [42]. THE SCALEFREE PROPERTY
28
• The configuration model is frequently used in calculations, as (4.24) and its inherently random character helps us analytically calculate numerous network measures.
BOX 4.7 GENERATING A DEGREE SEQUENCE WITH POWERLAW DISTRIBUTION
The degree sequence of an undirected network is a sequence of
b
networks shown in Figure 4.15a is {3, 2, 2, 1}. As Figure 4.15a illustrates, the degree sequence does not uniquely identify a graph, as there are multiple ways we can pair up the stubs.
pk~kʏ pk~kʏ
106 106
To generate a degree sequence from a predefined degree distribution we start from an analytically predefined degree distribuDegree preserving
108 108
degree sequence {k1, k2, ..., kN} that follow the distribution pk. We
1010 1010 100 100
Full randomization
(a)
100 100 pk pk 2 10 102 104 104
node degrees. For example, the degree sequence of each of the
Original network
randomization tion, like pk∼kγ, shown in Figure 4.16a. Our goal is to generate a
101 101
102 102
start by calculating the function
D( k ) = ∑ pk ' ,
k k
103 103
1 1
(4.25)
k '≥ k
104 104
(b)
D(k) D(k)
shown in Figure 4.16b. D(k) is between 0 and 1, and the step size at
0.5 0.5
any k equals pk. To generate a sequence of N degrees following pk,
we generate N random numbers ri, i = 1, ..., N, chosen uniformly
r r 0 0
from the (0, 1) interval. For each ri we use the plot in (b) to assign
a degree ki. The obtained ki = D (ri) set of numbers follows the de1
sired pk distribution. Note that the degree sequence assigned to
a pk is not unique  we can generate multiple sets of {k1, ..., kN} se
D(k) D(k)
1 1
k=D1(r) k=D1(r)
k' k k' k
k' k'
10 10
k k
100 100
quences compatible with the same pk.
Figure 4.16 Generating a Degree Sequence
(a) The power law degree distribution of the degree sequence we wish to generate. (b) The function (4.25), that allows us to assign degrees k to uniformly distributed random numbers r.
THE SCALEFREE PROPERTY
29
GENERATING NETWORKS WITH A PREDEFINED DEGREE DISTRIBUTION
DegreePreserving Randomization As we explore the properties of a real network, we often need to ask if a certain network property is predicted by its degree distribution alone, or if it represents some additional property not contained in pk. To answer this question we need to generate networks that are wired randomly, but whose pk is identical to the original network. This can be achieved through
degreepreserving randomization [43] described in Figure 4.17b. The idea be
hind the algorithm is simple: We randomly select two links and swap them, if the swap does not lead to multilinks. Hence the degree of each of the four involved nodes in the swap remains unchanged. Consequently, hubs stay hubs and smalldegree nodes retain their small degree, but the wiring diagram of the generated network is randomized. Note that degreepreserving randomization is different from full randomization, where we swap links without preserving the node degrees (Figure 4.17a). Full randomization turns any network into an ErdősRényi network with a Poisson degree distribution that is independent of the original pk.
Figure 4.17 Degree Preserving Randomization
(b)
(a)
FULL RANDOMIZATION
T1
ORIGINAL NETWORK
DEGREEPRESERVING RANDOMIZATION
Two algorithms can generate a randomized version of a given network [43], with different outcomes. (a) Full Randomization This algorithm generates a random (Erdős– Rényi) network with the same N and L as the original network. We select randomly a source node (S1) and two target nodes, where the first target (T1) is linked directly to the source node and the second target (T2) is not. We rewire the S1T1 link, turning it into an S1T2 link. As a result the degree of the target nodes T1 and T2 changes. We perform this procedure once for each link in the network.
T1
S1
S1
T2
S2
T2
(b) DegreePreserving Randomization This algorithm generates a network in which each node has exactly the same degree as in the original network, but the network’s wiring diagram has been randomized. We select two source (S1, S2) and two target nodes (T1, T2), such that initially there is a link between S1 and T1, and a link between S2 and T2. We then swap the two links, creating an S1T2 and an S2T1 link. The swap leaves the degree of each node unchanged.We repeat this procedure until we rewire each link at least once. Bottom Panels: Starting from a scalefree network (middle), full randomization eliminates the hubs and turns the network into a random network (left). In contrast, degreepreserving randomization leaves the hubs in place and the network remains scalefree (right). THE SCALE FREE PROPERTY
30
GENERATING NETWORKS WITH A PREDEFINED DEGREE DISTRIBUTION
Hidden Parameter Model The configuration model generates selfloops and multilinks, features
p3,4=0.2
p1,3=0.4
(a)
that are absent in many real networks. We can use the hidden parameter model (Figure 4.18) to generate networks with a predefined pk but without multilinks and selfloops [44, 45, 46]. ηi
We start from N isolated nodes and assign each node i a hidden parameter ηi, chosen from a distribution ρ(η). The nature of the generated network
1
2
3
4
1
1.5
2
0.5
⟨η⟩=1.25
>
depends on the selection of the {ηi} hidden parameter sequence. There are
(b)
two ways to generate the appropriate hidden parameters:
(c)
1
2
1
2
3
4
3
4
• ηi can be a sequence of N random numbers chosen from a predefined ρ(η) distribution. The degree distribution of the obtained network is
pk = ∫
e −ηη k ρ(η )dη . k!
• ηi can come from a deterministic sequence {η1,
(4.26)
η2, ..., ηN}. The degree
distribution of the obtained network is −η
pk =
e j ηj 1 . ∑ N j k! k
Figure 4.18 Hidden Parameter Model
(4.27)
(a) We start with N isolated nodes and assign to each node a hidden parameter ηi, which is either selected from a ρ(η) distribution or it is provided by a sequence {ηi}. We connect each node pair with probability
The hidden parameter model offers a particularly simple method to generate a scalefree network. Indeed, using j
c = , i = 1,..., N i
p(ηi , η j ) =
(4.28)
The figure shows the probability to connect nodes (1,3) and (3,4).
as the sequence of hidden parameters, according to (4.27) the obtained network will have the degree distribution
pk
k
(1+ 1 )
ηiη j . η N
(b, c) After connecting the nodes, we obtain the networks shown in (b) or (c), representing two independent realizations generated by the same hidden parameter sequence (a).
(4.29)
for large k. Hence by choosing the appropriate α we can tune γ=1+1/α. We
can also use ⟨η⟩ to tune ⟨k⟩ as (4.26) and (4.27) imply that ⟨k⟩ = ⟨η⟩.
The expected number of links in the network generated by the model is
In summary, the configuration model, degreepreserving randomiza
L=
tion and the hidden parameter model can generate networks with a predefined degree distribution and help us analytically calculate key network characteristics. We will turn to these algorithms each time we explore
1 N ηiη j 1 = η N. ∑ 2 i, j ' η N 2
Similar to the random network model, L will vary from network to network, following an exponentially bounded distribution. If we wish to control the average degree ⟨k⟩ we can add L links to the network one by one. The end points i and j of each link are then chosen randomly with a probability proportional to ηi and ηj. In this case we connect i and j only if they were not connected previously.
whether a certain network property is a consequence of the network’s degree distribution, or if it represents some emergent property (BOX 4.8). As we use these algorithms, we must be aware of their limitations: • The algorithms do not tell us why a network has a certain degree distribution. Understanding the origin of the observed pk will be the subject of CHAPTERS 6 and 7.
• Several important network characteristics, from clustering (CHAPTER 9) to degree correlations (CHAPTER 7), are lost during randomization.
THE SCALE FREE PROPERTY
31
GENERATING NETWORKS WITH A PREDEFINED DEGREE DISTRIBUTION
BOX 4.8 TESTING THE SMALLWORD PROPERTY In the literature the distances observed in a real network are
0.35
often compared to the smallworld formula (3.19). Yet, (3.19) was
0.3
derived for random networks, while real networks do not have
0.25
a Poisson degree distribution. If the network is scalefree, then
pd
(4.22) offers the appropriate formula. Yet, (4.22) provides only the
0.2
scaling of the distance with N, and not its absolute value. Instead
0.15
of fitting the average distance, we often ask: Are the distances ob
0.1
served in a real network comparable with the distances observed
0.05
in a randomized network with the same degree distribution? De
0
gree preserving randomization helps answer this question. We
2
4
6
8 d 10
12
14
16
Original network
illustrate the procedure on the protein interaction network.
Degree preserving randomization Full randomization
(i) Original Network
We start by measuring the distance distribution pd of the
original network, obtaining ⟨d⟩= 5.61 (Figure 4.19). (ii) Full Randomization
We generate a random network with the same N and L as the original network. The obtained pd visibly shifts to the right, providing ⟨d⟩ = 7.13, much larger than the original ⟨d⟩ = 5.61.
It is tempting to conclude that the protein interaction network is affected by some unknown organizing principle that keeps the distances shorter. This would be a flawed conclusion, however, as the bulk of the difference is due to the fact that full randomization changed the degree distribution. (iii) DegreePreserving Randomization
As the original network is scalefree, the proper random reference should maintain the original degree distribution.
Figure 4.19 Randomizing Real Networks
The distance distribution pd between each node pair in the proteinprotein interaction network (Table 4.1). The green line provides the pathlength distribution obtained under full randomization, which turns the network into an ErdősRényi network, while keeping N and L unchanged (Figure 4.17). The light purple curve correspond to pd of the network obtained after degreepreserving randomization, which keeps the degree of each node unchanged. We have: ⟨d⟩=5.61±1.64 (original), ⟨d⟩=7.13 ± 1.62 (full randomization), ⟨d⟩=5.08 ± 1.34 (degreepreserving randomization).
Hence we determine pd after degreepreserving randomiza
tion, finding that it is comparable to the original pd.
In summary, a random network overestimates the distances between the nodes, as it is missing the hubs. The network obtained by degree preserving randomization retains the hubs, so the distances of the randomized network are comparable to the original network. This example illustrates the importance of choosing the proper randomization procedure when exploring networks.
THE SCALE FREE PROPERTY
32
GENERATING NETWORKS WITH A PREDEFINED DEGREE DISTRIBUTION
Hence, the networks generated by these algorithms are a bit like a photograph of a painting: at first look they appear to be the same as the original. Upon closer inspection we realize, however, that many details, from the texture of the canvas to the brush strokes, are lost. The three algorithms discussed above raise the following question: How do we decide which one to use? Our choice depends on whether we start from a degree sequence {ki} or a degree distribution pk and whether we can tolerate selfloops and multilinks between two nodes. The decision tree involved in this choice is provided in Figure 4.20.
NETWORK
EXACTLY THE SAME DEGREE SEQUENCE
DEGREEPRESERVING RANDOMIZATION
DEGREE DISTRIBUTION
SIMPLE pk
CONFIGURATION MODEL
Figure 4.20
Choosing a Generative Algorithm The choice of the appropriate generative algorithm depends on several factors. If we start from a real network or a known degree sequence, we can use degreepreserving randomization, which guarantees that the obtained networks are simple and have the degree sequence of the original network. The model allows us to forbid multilinks or selfloops, while maintaining the degree sequence of the original network.
ADJUSTABLE ⟨k⟩
If we wish to generate a network with given predefined degree distribution pk, we have two options. If pk is known, the configuration model offers a convenient algorithm for network generation. For example, the model allows us generate a networks with a pure power law degree distribution pk=Ck –γ for k≥ kmin.
HIDDEN PARAMETER MODEL
However, tuning the average degree 〈k〉 of a scalefree network within the configuration model is a tedious task, because the only available free parameter is kmin. Therefore, if we wish to alter 〈k〉, it is more convenient to use the hidden parameter model with parameter sequence (4.28). This way the tail of the degree distribution follows ~kγ and by changing the number of links L we can to control 〈k〉.
THE SCALEFREE PROPERTY
33
GENERATING NETWORKS WITH A PREDEFINED DEGREE DISTRIBUTION
SECTION 4.9
SUMMARY
The scalefree property has played an important role in the development of network science for two main reasons: • Many networks of scientific and practical interest, from the WWW to the subcellular networks, are scalefree. This universality made the scalefree property an unavoidable issue in many disciplines. • Once the hubs are present, they fundamentally change the system’s behavior. The ultrasmall property offers a first hint of their impact on a network’s properties; we will encounter many more examples in the coming chapters. As we continue to explore the consequences of the scalefree property, we must keep in mind that the powerlaw form (4.1) is rarely seen in this pure form in real systems. The reason is that a host of processes affect the topology of each network, which also influence the shape of the degree distribution. We will discuss these processes in the coming chapters. The diversity of these processes and the complexity of the resulting pk confuses those who approach these networks through the narrow perspective of the quality of fit to a pure power law. Instead the scalefree property tells us that we must distinguish two rather different classes of networks: Exponentially Bounded Networks We call a network exponentially bounded if its degree distribution decrease exponentially or faster for high k. As a consequence is smaller than , implying that we lack significant degree variations. Examples of pk in this class include the Poisson, Gaussian, or the sim
ple exponential distribution (Table 4.2). ErdősRényi and WattsStrogatz networks are the best known models network belonging to this class. Exponentially bounded networks lack outliers, consequently most nodes have comparable degrees. Real networks in this class include highway networks and the power grid. Fat Tailed Networks We call a network fat tailed if its degree distribution has a power law tail in the highk region. As a consequence is much larger than , resulting in considerable degree variations. Scalefree networks with a powerlaw degree distribution (4.1) offer the best known example of networks belonging to this class. Outliers, or exceptionally highdegree THE SCALEFREE PROPERTY
34
nodes, are not only allowed but are expected in these networks. Net
BOX 4.9
works in this class include the WWW, the Internet, protein interaction networks, and most social and online networks.
AT A GLANCE: SCALEFREE NETWORKS
While it would be desirable to statistically validate the precise form of the degree distribution, often it is sufficient to decide if a given network has an exponentially bounded or a fat tailed degree distribution (see AD
DEGREE DISTRIBUTION
VANCED TOPICS 4.A). If the degree distribution is exponentially bounded, the
Discrete form:
random network model offers a reasonable starting point to understand
pk =
its topology. If the degree distribution is fat tailed, a scalefree network offers a better approximation. We will also see in the coming chapters that
k −γ . ζ (γ )
Continuous form:
the key signature of the fat tailed behavior is the magniture of 〈k2〉: If 〈k2〉 is
γ −1 − γ p( k ) = (γ − 1)kmin k .
large, systems behave like scalefree networks; if 〈k2〉 is small, being comparable to 〈k〉(〈t〉+1), systems are well approximated by random networks.
SIZE OF THE LARGEST HUB 1
In summary, to understand the properties of real networks, it is of
kmax = kminN y −1 .
ten sufficient to remember that in scalefree networks a few highly connected hubs coexist with a large number of small nodes. The presence of
MOMENTS OF pk for N
γ ≤ 3: 〈k〉 diverges.
these hubs plays an important role in the system’s behavior. In this chapter
2
3: 〈k〉
scalefree? The next chapter provides the answer.
→∞
finite, 〈k2〉
and 〈k2〉 finite.
DISTANCES
〈d 〉 ~
THE SCALEFREE PROPERTY
35
SUMMARY
const.
ln ln N
ln N ln ln N ln N
γ =2 2 3
SECTION 4.10
HOMEWORK
4.1. Hubs Calculate the expected maximum degree kmax for the undirected net
works listed in Table 4.1.
4.2. Friendship Paradox The degree distribution pk expresses the probability that a randomly
selected node has k neighbors. However, if we randomly select a link, the
probability that a node at one of its ends has degree k is qk = Akpk, where A is a normalization factor. (a) Find the normalization factor A, assuming that the network has a power law degree distribution with 2 < γ < 3, with minimum degree kmin and maximum degree kmax. (b) In the configuration model qk is also the probability that a randomly chosen node has a neighbor with degree k. What is the average degree of the neighbors of a randomly chosen node? (c) Calculate the average degree of the neighbors of a randomly chosen node in a network with N = 104, γ= 2.3, kmin= 1 and kmax= 1, 000. Compare the result with the average degree of the network, 〈k〉. (d) How can you explain the "paradox" of (c), that is a node's friends have more friends than the node itself? 4.3. Generating ScaleFree Networks Write a computer code to generate networks of size N with a powerlaw degree distribution with degree exponent γ. Refer to SECTION 4.9 for the pro
cedure. Generate three networks with γ = 2.2 and with N = 103, N = 104 and N = 105 nodes, respectively. What is the percentage of multilink and self
loops in each network? Generate more networks to plot this percentage in function of N. Do the same for networks with γ = 3. 4.4. Mastering Distributions Use a software which includes a statistics package, like Matlab, Math
THE SCALEFREE PROPERTY
36
ematica or Numpy in Python, to generate three synthetic datasets, each containing 10,000 integers that follow a powerlaw distribution with γ =
2.2, γ = 2.5 and γ = 3. Use kmin = 1. Apply the techniques described in ADVANCED TOPICS 4.C to fit the three distributions.
THE SCALEFREE PROPERTY
37
HOMEWORK
SECTION 4.11
ADVANCED TOPICS 4.A POWER LAWS
Power laws have a convoluted history in natural and social sciences, being interchangeably (and occasionally incorrectly) called fattailed, heavytailed, longtailed, Pareto, or Bradford distributions. They also have a series of close relatives, like lognormal, Weibull, or Lévy distributions. In this section we discuss some of the most frequently encountered distributions in network science and their relationship to power laws. Exponentially Bounded Distributions Many quantities in nature, from the height of humans to the probability of being in a car accident, follow bounded distributions. A common property of these is that px decays either exponentially (ex), or faster
than exponentially (ex2/σ2) for high x. Consequently the largest expected x is bounded by some upper value xmax that is not too different from
⟨x⟩. Indeed, the expected largest x obtained after we draw N numbers from a bounded px grows as xmax ∼ log N or slower. This means that out
liers, representing unusually high xvalues, are rare. They are so rare that they are effectively forbidden, meaning that they do not occur with any meaningful probability. Instead, most events drawn from a bounded distribution are in the vicinity of ⟨x⟩. The highx regime is called the tail of a distribution. Given the absence of numerous events in the tail, these distributions are also called thin tailed. Analytically the simplest bounded distribution is the exponential distribution eλx. Within network science the most frequently encountered bounded distribution is the Poisson distribution (or its parent, the binomial distribution), which describes the degree distribution of a random network. Outside network science the most frequently encountered member of this class is the normal (Gaussian) distribution (Table 4.2). Fat Tailed Distributions The terms fat tailed, heavy tailed, or long tailed refer to px whose decay
THE SCALEFREE PROPERTY
38
at large x is slower than exponential. In these distributions we often encounter events characterized by very large x values, usually called outliers or rare events. The powerlaw distribution (4.1) represents the best known example of a fat tailed distribution. An instantly recognizable feature of an fat tailed distribution is that the magnitude of the events x drawn from it can span several orders of magnitude. Indeed, in these distributions the size of the largest event after N trials scales as xmax ∼
Nζ where
ζ
is determined by the exponent γ characterizing the tail of
the px distribution. As Nζ grows fast, rare events or outliers occur with a noticeable frequency, often dominating the properties of the system.
The relevance of fat tailed distributions to networks is provided by several factors: • Many quantities occurring in network science, like degrees, link weights and betweenness centrality, follow a powerlaw distribution in both real and model networks. • The powerlaw form is analytically predicted by appropriate network models (CHAPTER 5). Crossover Distribution (LogNormal, Stretched Exponential) When an empirically observed distribution appears to be between a power law and exponential, crossover distributions are often used to fit the data. These distributions may be exponentially bounded (power law with exponential cutoff), or not bounded but decay faster than a power law (lognormal or stretched exponential). Next we discuss the properties of several frequently encountered crossover distributions. Power law with exponential cutoff is often used to fit the degree distribution of real networks. Its density function has the form:
C=
where x > 0 and
γ
p(x) = C x −γ e − λ x ,
(4.30)
λ 1−γ , Γ (1 − γ , λ xmin )
(4.31)
> 0 and Γ(s,y) denotes the upper incomplete gamma
function. The analytical form (4.30) directly captures its crossover nature: it combines a powerlaw term, a key component of fat tailed distributions, with an exponential term, responsible for its exponentially bounded tail. To highlight its crossover characteristics we take the logarithm of (4.30), ln p(x) = ln C − γ ln x − λ x .
(4.32)
For x ≪ 1/λ the second term on the r.h.s dominates, suggesting that the
distribution follows a power law with exponent γ. Once x ≫ 1/λ, the λx
term overcomes the ln x term, resulting in an exponential cutoff for high x.
THE SCALEFREE PROPERTY
39
4.A POWER LAWS
Stretched exponential (Weibull distribution) is formally similar to (4.30) except that there is a fractional power law in the exponential. Its name comes from the fact that its cumulative distribution function is one minus a stretched exponential function P(x) = e(λx) (4.32) which leads to β
density function
P '( x ) = Cx β −1e− ,( λ x )
β
(4.33)
C = βλ β .
(4.34)
∞
In most applications x varies between 0 and + . In (4.32) ing exponent, determining the properties of p(x):
β is the stretch
• For β = 1 we recover a simple exponential function. • If β is between 0 and 1, the graph of log p(x) versus x is “stretched”, meaning that it spans several orders of magnitude in x. This is the regime where a stretched exponential is difficult to distinguish from a pure power law. The closer β is to 0, the more similar is p(x) to the power law x1. • If β > 1 we have a “compressed” exponential function, meaning that x varies in a very narrow range. • For β = 2 (4.33) reduces to the Rayleigh distribution. As we will see in CHAPTERS 5 and 6, several network models predict a streched exponential degree distribution. A lognormal distribution (Galton or Gibrat distribution) emerges if ln x follows a normal distribution. Typically a variable follows a lognormal distribution if it is the product of many independent positive random numbers. We encounter lognormal distributions in finance, representing the compound return from a sequence of trades. The probability density function of a lognormal distribution is
p(x) =
2
1
x
exp
(ln x μ ) 2 . 2 2
(4.35)
Hence a lognormal is like a normal distribution except that its variable in the exponential term is not x, but ln x. To understand why a lognormal is occasionally used to fit a power law distribution, we note that
σ 2 = (ln x)2 − ln x
2
(4.36)
captures the typical variation of the order of magnitude of x. Therefore now ln x follows a normal distribution, which means that x can vary rather widely. Depending on the value of σ the lognormal distribution THE SCALEFREE PROPERTY
40
4.A POWER LAWS
may resemble a power law for several orders of magnitude. This is also illustrated in Table 4.2, that shows that ⟨x2⟩ grows exponentially with σ, hence it can be very large. In summary, in most areas where we encounter fattailed distributions, there is an ongoing debate asking which distribution offers the best fit to the data. Frequently encountered candidates include a power law, a stretched exponential, or a lognormal function. In many systems empirical data is not sufficient to distinguish these distributions. Hence as long as there is empirical data to be fitted, the debate surrounding the best fit will never die out. The debate is resolved by accurate mechanistic models, which analytically predict the expected degree distribution.We will see in the coming chapters that in the context of networks the models predict Poisson, simple exponential, stretched exponential, and power law distributions. The remaining distributions in Table 4.2 are occasionally used to fit the degrees of some networks, despite the fact that we lack theoretical basis for their relevance for networks.
THE SCALEFREE PROPERTY
41
4.A POWER LAWS
NAME Poisson (discrete)
px /p(x) � e−µ µx x!
Exponential (discrete)
(1 − e−λ )e−λx
Exponential (continuous)
λe−λx
Power law (discrete)
Power law (continuous)
Power law with cutoff (continuous) Stretched exponential (continuous) Lognormal (continuous) Normal (continuous)
x
� −↵
↵x
(
hxi
hx2 i
µ
µ(1 + µ)
� 1 (eλ − 1)
� (eλ + 1) (eλ − 1)2
� 1 λ
� ⇣(↵ − 2) ⇣(↵), 1,
⇣(↵)
( � ↵ (↵ − 1), 1,
−↵
λ1 ↵ x−↵ e−λx (1−↵)
βλβ xβ−1 e−(λx)
p1 e−(ln x−µ) x 2⇡σ 2
2
β
�
�
(2σ 2 )
2 2 p 1 e−(x−µ) (2σ ) 2⇡σ 2
λ−1
↵>2 ↵1 ↵>2 ↵1
(2−↵) (1−↵)
2
�
(
� ⇣(↵ − 1) ⇣(↵), 1, ( � ↵ (↵ − 2), 1, λ−2
λ−1 (1 + β −1 )
eµ+σ
� 2 λ2
↵>1 ↵2 ↵>1 ↵2
(3−↵) (1−↵)
λ−2 (1 + 2β −1 )
2
e2(µ+σ
2)
µ2 + σ 2
µ
Table 4.2 Distributions in Network Science
The table lists frequently encountered distributions in network science. For each distribution we show the density function px, the appropriate normalization constant C such that ∞
∫
x = xmin
Cf ( x ) dx = 1
for the continuous case or ∞
∑ Cf ( x ) = 1
x = xmin
for the discrete case. Given that ⟨x⟩ and ⟨x2⟩ play an important role in network theory, we show the analytical form of these two quantities for each distribution. As some of these distributions diverge at x = 0, for most of them ⟨x⟩ and ⟨x2⟩ are calculated assuming that there is a small cutoff xmin in the system. In networks xmin often corresponds to the smallest degree, kmin, or the smallest degree for which the appropriate distribution offers a good fit.
THE SCALEFREE PROPERTY
42
4.A POWER LAWS
Poisson
Lin‐lin plot
pk
pk
Exponen+al
(b) Log‐log plot
k
k
Lin‐lin plot
k
Lin‐lin plot
k
k
k
k
Log‐log plot
k
k
Gaussian
(g) Lin‐lin plot
Log‐log plot
pk
pk
Log‐normal
Lin‐lin plot
(f)
pk
k
Log‐log plot
pk
pk
Stretched Exponen0al
(e)
k
pk
Log‐log plot
Power Law with Exponen3al Cutoﬀ
Lin‐lin plot
pk
(d) Log‐log plot
k
pk
Power Law
(c)
pk
pk
Log‐log plot
pk
Lin‐lin plot
pk
(a)
k
k
Figure 4.21 Distributions Visualized
Linear and the loglog plots for the most frequently encountered distributions in network science. For definitions see Table 4.2.
THE SCALEFREE PROPERTY
43
4.A POWER LAWS
SECTION 4.12
ADVANCED TOPICS 4.B PLOTTING POWERLAWS
Plotting the degree distribution is an integral part of analyzing the properties of a network. The process starts with obtaining Nk, the number
of nodes with degree k. This can be provided by direct measurement or by a model. From Nk we calculate pk = Nk /N. The question is, how to plot pk to
best extract its properties. Use a LogLog Plot In a scalefree network numerous nodes with one or two links coexist with a few hubs, representing nodes with thousands or even millions of links. Using a linear kaxis compresses the numerous small degree nodes in the smallk region, rendering them invisible. Similarly, as there can be orders of magnitude differences in pk for k = 1 and for large
k, if we plot pk on a linear vertical axis, its value for large k will appear to be zero (Figure 4.22a). The use of a loglog plot avoids these problems.
We can either use logarithmic axes, with powers of 10 (used throughout this book, Figure 4.22b) or we can plot log pk in function of log k (equally
correct, but slightly harder to read). Note that points with pk =0 or k=0 are not shown on a loglog plot as log 0=∞. Avoid Linear Binning The most flawed method (yet frequently seen in the literature) is to simply plot pk = Nk/N on a loglog plot (Figure 4.22b). This is called linear
binning, as each bin has the same size Δk = 1. For a scalefree network linear binning results in an instantly recognizable plateau at large k, consisting of numerous data points that form a horizontal line (Figure 4.22b). This plateau has a simple explanation: Typically we have only one copy of each high degree node, hence in the highk region we either have Nk=0 (no node with degree k) or Nk=1 (a single node with degree k). Consequently linear binning will either provide pk=0, not shown on a loglog plot, or pk = 1/N, which applies to all hubs, generating a plateau
at pk = 1/N.
This plateau affects our ability to estimate the degree exponent
γ. For
example, if we attempt to fit a power law to the data shown in Figure
THE SCALEFREE PROPERTY
44
LINEAR SCALE
LINEAR BINNING
0.15
(a)
10
0
10
1
Figure 4.22 Plotting a Degree Distributions
(b)
0.1
102
pk
pk
A degree distribution of the form pk ∼ (k + k0)γ, with k0=10 and γ=2.5, plotted using the four procedures described in the text: (a) Linear Scale, Linear Binning. It is impossible to see the distribution on a linlin scale. This is the reason why we always use loglog plot for scalefree networks.
103
0.05
104
0
1000
2000 k 3000
105 4000
LOGBINNING
10
0
10
1
10
2
k 10
3
(b) LogLog Scale, Linear Binning. Now the tail of the distribution is visible but there is a plateau in the highk regime, a consequence of linear binning.
10
4
CUMULATIVE
100
(c) LogLog Scale, LogBinning. With logbinning the plateau dissappears and the scaling extends into the highk regime. For reference we show as light grey the data of (b) with linear binning.
100 (c)
101
(d)
101
102 103 pk 104 105
102 Pk
(d) LogLog Scale, Cumulative. The cumulative degree distribution shown on a loglog plot.
103
106 107
104
108 10
0
10
1
10
2
k 10
3
105
10
4
4.22b using linear binning, the obtained
100
101
102
3 k 10
104
γ is quite different from the
real value γ=2.5. The reason is that under linear binning we have a large number of nodes in small k bins, allowing us to confidently fit pk in this
regime. In the largek bins we have too few nodes for a proper statistical estimate of pk. Instead the emerging plateau biases our fit. Yet, it is pre
cisely this highk regime that plays a key role in determining γ. Increasing the bin size will not solve this problem. It is therefore recommended to avoid linear binning for fat tailed distributions. Use Logarithmic Binning Logarithmic binning corrects the nonuniform sampling of linear binning. For logbinning we let the bin sizes increase with the degree, making sure that each bin has a comparable number of nodes. For example, we can choose the bin sizes to be multiples of 2, so that the first bin has size b0=1, containing all nodes with k=1; the second has size b1=2, con
taining nodes with degrees k=2, 3; the third bin has size b2=4 containing nodes with degrees k=4, 5, 6, 7. By induction the nth bin has size 2n1 and
contains all nodes with degrees k=2n1, 2n1+1, ..., 2n11. Note that the bin size can increase with arbitrary increments, bn = cn, where c > 1. The degree distribution is given by p⟨k ⟩=Nn/bn, where Nn is the number of n
nodes found in the bin n of size bn and ⟨kn⟩ is the average degree of the nodes in bin bn.
The logarithmically binned pk is shown in Figure 4.22c. Note that now the scaling extends into the highk plateau, invisible under linear binning. Therefore logarithmic binning extracts useful information from the THE SCALEFREE PROPERTY
45
4.B PLOTTING A POWERLAW DEGREE DISTRIBUTION
rare high degree nodes as well (BOX 4.10). Use Cumulative Distribution Another way to extract information from the tail of pk is to plot the complementary cumulative distribution
Pk =
q=k+1
pq ,
(4.37)
which again enhances the statistical significance the highdegree region. If pk follows the power law (4.1), then the cumulative distribution scales as
Pk ∼ k −γ +1 .
(4.38)
The cumulative distribution again eliminates the plateau observed for linear binning and leads to an extended scaling region (Figure 4.22d), allowing for a more accurate estimate of the degree exponent. In summary, plotting the degree distribution to extract its features requires special attention. Mastering the appropriate tools can help us better explore the properties of real networks (BOX 4.10).
THE SCALEFREE PROPERTY
46
4.B PLOTTING A POWERLAW DEGREE DISTRIBUTION
BOX 4.10 DEGREE DISTRIBUTION OF REAL NETWORKS In real systems we rarely observe a degree distribution that fol
(a) 100
lows a pure power law. Instead, for most real systems pk has the shape shown in Figure 4.23a, with some recurring features:
101
• Lowdegree saturation is a common deviation from the powerlaw behavior. Its signature is a flattened pk for k < ksat. This
pk
indicates that we have fewer small degree nodes than expect
102
HIGH DEGREE CUTOFF
ed for a pure power law. The origin of the saturation will be
(kcut)
explained in CHAPTER 6.
LOW DEGREE SATURATION (ksat)
103
• Highdegree cutoff appears as a rapid drop in pk for k > kcut,
104
indicating that we have fewer highdegree nodes than expected in a pure power law. This limits the size of the largest hub, making it smaller than predicted by (4.18). Highdegree cut
k
100
101
102
103
100
101 k+ksat 102
103
0 (b) 10
offs emerge if there are inherent limitations in the number of links a node can have. For example, in social networks individuals have difficulty maintaining meaningful relationships
101
with an exceptionally large number of acquaintances.
~ pk 102
Given the widespread presence of such cutoffs the degree distribution is occasionally fitted to (4.39)
103
where ksat accounts for degree saturation, and the exponential
104
⎛ k ⎞ , px = a(k + ksat )−γ exp ⎜ − . ⎝ kcut ⎟⎠
term accounts for highk cutoff. To extract the full extent of the scaling we plot ∼ ⎛ k ⎞ px = px exp ⎜ ⎝ kcut ⎟⎠
(4.40)
Figure 4.23 Rescaling the Degree Distribution (a) In real networks the degree distribution frequently deviates from a pure power law by showing a low degree saturation and high degree cutoff.
~ ~ ~ in function of k = k + ksat. According to (4.40) p ~ k γ, correcting for
the two cutoffs, as seen in Figure 4.23b.
(b) By plotting the rescaled in function of (k + ksat), as suggested by (4.40), the degree distribution follows a power law for all degrees.
It is occasionally claimed that the presence of lowdegree or highdegree cutoffs implies that the network is not scalefree. This is a misunderstanding of the scalefree property: Virtually all properties of scalefree networks are insensitive to the lowdegree saturation. Only the highdegree cutoff affects the system’s properties by limiting the divergence of the second moment, ⟨k2⟩. The presence of such cutoffs indicates the presence of additional phenomena that need to be understood.
THE SCALE FREE PROPERTY
47
4.B PLOTTING A POWERLAW DEGREE DISTRIBUTION
SECTION 4.13
ADVANCED TOPICS 4.C ESTIMATING THE DEGREE EXPONENT
Online Resource 4.2 Fitting powerlaw
As the properties of scalefree networks depend on the degree exponent (SECTION 4.7), we need to determine the value of γ. We face several difficulties, however, when we try to fit a power law to real data. The most
The algorithmic tools to perform the fitting procedure described in this section are available at http://tuvalu.santafe. edu/~aaronc/powerlaws/.
important is the fact that the scaling is rarely valid for the full range of the degree distribution. Rather we observe small and high degree cut
>
offs (BOX 4.10), denoted in this section with Kmin and Kmax, within which we
have a clear scaling region. Note that Kmin and Kmax are different from kmin
and kmax, the latter corresponding to the smallest and largest degrees in a
network. They can be the same as ksat and kcut discussed in BOX 4.10. Here we
focus on estimating the small degree cutoff Kmin, as the high degree cutoff can be determined in a similar fashion. The reader is advised to consult the discussion on systematic problems provided at the end of this section before implementing this procedure. Fitting Procedure As the degree distribution is typically provided as a list of positive integers kmin , ..., kmax, we aim to estimate γ from a discrete set of data points [47]. We use the citation network to illustrate the procedure. The network consists of N=384,362 nodes, each node representing a research paper published between 1890 and 2009 in journals published by the American Physical Society. The network has L = 2,353,984 links, each representing a citation from a published research paper to some other publication in the dataset (outside citations are ignored). For no particular reason, this is not the citation dataset listed in Table 4.1. See [48] for an overall characterization of this data. The steps of the fitting process are [47]: 1. Choose a value of Kmin between kmin and kmax. Estimate the value of the degree exponent corresponding to this Kmin using −1
⎤ ⎡ ⎢N ⎥ ki γ = 1+ N ⎢ ∑ ln ⎥ . 1 ⎢ i=1 K min − ⎥ ⎣ 2⎦
THE SCALEFREE PROPERTY
(4.41)
48
2. With the obtained (γ, Kmin) parameter pair assume that the degree
pk =
1 k −γ , ζ (γ , K min )
(a)
101
distribution has the form (4.42)
103 pk
hence the associated cumulative distribution function (CDF) is
105
ζ (γ , k ) . Pk = 1 − ζ (γ , K min )
(4.43) 107
3. Use the KormogorovSmirnov test to determine the maximum dis
109
tance D between the CDF of the data S(k) and the fitted model pro
100
vided by (4.43) with the selected (γ, kmin) parameter pair,
D = maxk ≥ K min  S ( k ) − Pk  .
Kmin=49
101
Citation
102
k
103
104
Fitting
100
(4.44)
(b)
Equation (4.44) identifies the degree for which the difference D be
D
tween the empirical distribution S(k) and the fitted distribution (4.43) is the largest. 101
4. Repeat steps (13) by scanning the whole Kmin range from kmin to kmax. We aim to identify the Kmin value for which D provided by (4.44) is
minimal. To illustrate the procedure, we plot D as a function of Kmin
for the citation network (Figure 4.24b). The plot indicates that D is minimal for Kmin= 49, and the corresponding
γ
estimated by (4.41),
102
representing the optimal fit, is γ=2.79. The standard error for the ob
0
tained degree exponent is
σγ =
ζ ′′(γ , K ) ζ ′(γ , K ) 2 min min N − K K ζ ( γ , ) ζ ( γ , min min )
obtain σγ=0.003, hence γ=2.79(3).
40 Kmin 60
80
100
500
1
which implies that the best fit is γ
20
±
(c)
p < 104
(4.45)
400 p(D) 300
σγ. For the citation network we
200
Note that in order to estimate γ datasets smaller than N=50 should be 100
treated with caution. Goodnessoffit
0
Just because we obtained a (γ, Kmin) pair that represents an optimal fit to our dataset, does not mean that the power law itself is a good model for the studied distribution.We therefore need to use a goodnessoffit test,
0.000
D
0.010
Figure 4.24 Maximum Likelihood Estimation
(a) The degree distribution pk of the citation network, where the straight purple line represents the best fit based on the model (4.39).
which generates a pvalue that quantifies the plausibility of the power law hypothesis. The most often used procedure consists of the following steps: 1. Use the cumulative distribution (4.43) to estimate the KS distance be
(b) The values of KormogorovSmirnov test vs. Kmin for the citation network.
tween the real data and the best fit, that we denote by Dreal. This is step 3 above, taking the value of D for Kmin that offered the best fit
(c) p(Dsynthetic) for M=10,000 synthetic datasets, where the grey line corresponds to the Dreal value extracted for the citation network.
to the data. For the citation data we obtain Dreal = 0.01158 for Kmin= 49 (Figure 4.24c).
THE SCALEFREE PROPERTY
0.005
49
ESTIMATING THE DEGREE EXPONENT
2. Use (4.42) to generate a degree sequence of N degrees (i.e. the same
number of random numbers as the number of nodes in the original dataset) and substitute the obtained degree sequence for the empirical data, determining Dsynthetic for this hypothetical degree sequence. Hence Dsynthetic represents the distance between a synthetically generated degree sequence, consistent with our degree distribution, and the real data. 3. The goal is to see if the obtained Dsynthetic is comparable to Dreal. For this
we repeat step (2) M times (M ≫ 1), and each time we generate a new
degree sequence and determine the corresponding Dsynthetic, eventu
ally obtaining the p(Dsynthetic) distribution. Plot p(Dsynthetic) and show as a vertical bar Dreal (Figure 4.24c). If Dreal is within the p(Dsynthetic) distribution, it means that the distance between the model providing the best fit and the empirical data is comparable with the distance expected from random degree samples chosen from the best fit distribution. Hence the power law is a reasonable model for the data. If, however, Dreal falls outside the p(Dsynthetic) distribution, then the power law is not a good model  some other function is expected to describe the original pk better. While the distribution shown in Figure 4.24c may be in some cases useful to illustrate the statistical significance of the fit, in general it is better to assign a pnumber to the fit, given by ∞
(
)
p = ∫ P D synthetic dD synthetic . D
(4.46)
The closer p is to 1, the more likely that the difference between the empirical data and the model can be attributed to statistical fluctuations alone. If p is very small, the model is not a plausible fit to the data. Typically, the model is accepted if p > 1%. For the citation network we obtain p < 104, indicating that a pure power law is not a suitable model for the original degree distribution. This outcome is somewhat surprising, as the powerlaw nature of citation data has been documented repeatedly since 1960s [7, 8]. This failure indicates the limitation of the blind fitting to a power law, without an analytical understanding of the underlying distribution. Fitting Real Distributions To correct the problem, we note that the fitting model (4.44) eliminates all the data points with k < Kmin. As the citation network is fat tailed,
choosing Kmin = 49 forces us to discard over 96% of the data points. Yet,
there is statistically useful information in the k < Kmin regime, that is
ignored by the previous fit. We must introduce an alternate model that resolves this problem. As we discussed in BOX 4.10, the degree distribution of many real networks, like the citation network, does not follow a pure power law. It often has low degree saturations and high degree cutoffs, described by THE SCALEFREE PROPERTY
50
ESTIMATING THE DEGREE EXPONENT
100
the form
pk =
1 ( k + ksat )−γ e− k / kcut − γ − k ′ / kcut + ( ) ∑ k ′ ksat e
0.0026
D
(4.47)
0.0022
k =1
5000
10
and the associated CDF is
Pk =
(a)
0.0024
1
k
∑ (k ′ + ksat ) e −γ
− k ′ / kcut
k ′=1
∑ (k ′ + k k ′=1
sat
)−γ e− k ′ / kcut ,
6000
kcut 7000
1
(4.48)
where ksat and kcut correspond to lowk saturation and the largek cutoff,
102 0
respectively. The difference between our earlier procedure and (4.47) is
kcut=3000
that we now do not discard the points that deviate from a pure power
10 kcut=6000
ksat
20
kcut=9000
law, but instead use a function that offers a better fit to the whole degree distribution, from kmin to kmax.
(b)
101
Our goal is to find the fitting parameters ksat, kcut, and
γ of the model
103
(4.47), which we achieve through the following steps (Figure 4.25):
pk 105
1. Pick a value for ksat and kcut between Kmin and Kmax. Estimate the val
ue of the degree exponent γ using the steepest descend method that
maximizes the loglikelihood function N
log (γ  ksat , kcut ) = ∑ log p( ki  γ , ksat , kcut ).
107
(4.49)
109
i =1
100
That is, for fixed (ksat, kcut) we vary γ until we find the maximum of
(4.49).
k
103
104
p = 0.69 p(D)
tween the cumulative degree distribution (CDF) of the original data
600
and the fitted model provided by (4.47).
400
3. Change ksat and kcut, and repeat steps (13), scanning with ksat from
kmin= 0 to kmax and scanning with kcut from kmin= k0 to kmax. The goal is
200
to identify ksat and kcut values for which D is minimal. We illustrate
this by plotting D in function of ksat for several kcut values in Figure 4.25a for our citation network. The (ksat, kcut) for which D is minimal,
and the corresponding γ is provided by (4.41), represent the optimal parameters of the fit. For our dataset the optimal fit is obtained for
γ= 3.028. We
find that now D for the real data is within the generated p(D) distri
(c)
800
the form (4.47). Calculate the Kormogorov Smirnov parameter D be
0 0.000 0.001 0.002 0.003 0.004 0.005 D Figure 4.25
Estimating the Scaling Parameters for Citation Networks
bution (Figure 4.25c), and the associated pvalue is 69%. Systematic Fitting Issues The procedure described above may offer the impression that determining the degree exponent is a cumbersome but straightforward process. In reality these fitting methods have some well known limitations: 1. A pure power law is an idealized distribution that emerges in its
THE SCALEFREE PROPERTY
102 Fitting
1000
2. With the obtained γ(ksat, kcut) assume that the degree distribution has
ksat= 12 and kcut= 5,691, providing the degree exponent
101
Citation
51
(a) The KormogorovSmirnov parameter D vs. ksat for kcut = 3,000, 6,000, 9,000, respectively. The curve indicates that ksat= 12 corresponds to the minimal D. Inset: D vs. kcut for ksat= 12, indicating that kcut =5,691 minimizes D. (b) Degree distribution pk where the straight line represents the best estimate from (a). Now the fit accurately captures the whole curve, not only its tail, or it did in Figure 4.24a. (c) p(Dsynthetic) for M = 10,000 synthetic datasets. The grey line corresponds to the Dreal value from the citation network. ESTIMATING THE DEGREE EXPONENT
form (4.1) only in simple models (CHAPTER 5). In reality, a whole range of processes contribute to the topology of real networks, affecting the precise shape of the degree distribution. These processes will be discussed in CHAPTER 6. If pk does not follow a pure power law, the methods described above, designed to fit a power law to the data, will inevitably fail to detect statistical significance. While this finding can mean that the network is not scalefree, it most often means that we have not yet gained a proper understanding of the precise form of the degree distribution. Hence we are fitting the wrong functional form of pk to the dataset. 2. The statistical tools used above to test the goodnessoffit rely on
the KolmogorovSmirnov criteria, which measures the maximum distance between the fitted model and the dataset. If almost all data points follow a perfect power law, but a single point for some reason deviates from the curve, we will loose the fit’s statistical significance. In real systems there are numerous reasons for such local deviations that have little impact on the system’s overall behavior. Yet, removing these “outliers” could be seen as data manipulation; if kept, however, one cannot detect the statistical significance of the power law fit. A good example is provided by the actor network, whose degree distribution follows a power law for most degrees. There is, however, a prominent outlier at k = 1,287, thanks to the 1956 movie Around the World in Eighty Days. This is the only movie where imdb.com the source of the actor network, lists all the normally uncredited extras in the cast. Hence the movie appears to have 1,288 actors. The second largest movie in the dataset has only 340 actors. Since each extra has links only to the 1,287 extras that played in the same movie, we have a local peak in pk at k=1,287. Thanks to this peak, the degree distribution, fitted to a power law, fails to pass the KolmogorovSmirnov criteria. Indeed, as indicated in Table 4.3, neither the pure power law fit, nor a power law with highdegree cutoff offers a statistically significant fit. Yet, ultimately this single point does not alter the power law nature of the degreee distribution. 4. As a result of the issues discussed above, the methodology described
to fit a power law distribution often predicts a small scaling regime, forcing us to remove a huge fraction of the nodes (often as many as
Power Power Grid Grid
kmin 0.517 0.5174
4
PVALUE
PERCENTAGE
0.91 0.91
12% 12%
Table 4.3 Exponential Fitting
For the power grid a power law degree distribution does not offer a statistically significant fit. Indeed, we will encounter numerous evidence that the underlying network is not scalefree. We used the fitting procedure described in this section to fit the exponential function eλk to the degree distribution of the
99%, see Table 4.4) to obtain a statistically significant fit. Once plotted
power grid, obtaining a statistically significant fit. The table shows the obtained λ parameters, the kmin over which the fit is valid, the obtained pvalue, and the percentage of data points included in the fit.
next to the original dataset, the obtained fit can be at times ridiculous, even if the method predits statistical significance.
THE SCALEFREE PROPERTY
52
ESTIMATING THE DEGREE EXPONENT
In summary, estimating the degree exponent is still not yet an exact science. We continue to lack methods that would estimate the statistical significance in a manner that would be acceptable to a practitioner. The blind application of the tools describe above often leads to either fits that obviously do not capture the trends in the data, or to a false rejection of the powerlaw hypothesis. An important improvement is our ability to derive the expected form of the degree distribution, a problem discussed in CHAPTER 6.
Kmin
( k + ksat
PVALUE
PERCENT
(
K ;[ Kmin , ]
e
k/kcut
ksat
kcut
PVALUE
INTERNET
3.42
72
0.13
0.6%
3.55
8
8500
0.00
WWW (IN)
2.00
1
0.00
100%
1.97
0
660
0.00
WWW (OUT)
2.31
7
0.00
15%
2.82
8
8500
0.00
POWER GRID
4.00
5
0.00
12%
8.56
19
14
0.00
4.69
9
0.34
2.6%
6.95
15
10
0.00
5.01
11
0.77
1.7%
7.23
15
10
0.00
EMAILPRE (IN)
3.43
88
0.11
0.2%
2.27
0
8500
0.00
EMAILPRE (OUT)
2.03
3
0.00
1.2%
2.55
0
8500
0.00
SCIENCE COLLABORATION
3.35
25
0.0001
5.4%
1.50
17
12
0.00
ACTOR NETWORK
2.12
54
0.00
33%



0.00
CITATION NETWORK (IN)
2.79
51
0.00
3.0%
3.03
12
5691
0.69
4.00
19
0.00
14%
0.16
5
10
0.00
E.COLI METABOLISM (IN)
2.43
3
0.00
57%
3.85
19
12
0.00
E.COLI METABOLISM (OUT)
2.90
5
0.00
34%
2.56
15
10
0.00
2.89
7
0.67
8.3%
2.95
2
90
0.52
MOBILE PHONE CALLS (IN) MOBILE PHONE CALLS (OUT)
CITATION NETWORK (OUT)
YEAST PROTEIN INTERACTIONS
Table 4.4 Fitting Parameters for Real Networks The estimated degree exponents and the appropriate fit parameters for the reference networks studied in this book. We implement two fitting strategies, the first aiming to fit a pure power law in the region (Kmin, ∞) and the second fits a power law with saturation and exponential cutoff to the whole dataset. In the table we show the obtained γ exponent and Kmin for the fit with the best statistical significance, the pvalue for the best fit and the percentage of the data included in the fit. In the second case we again show the exponent γ, the two fit parameters, ksat and kcut, and the pvalue of the obtained fit. Note that p > 0.01 is considered to be statistically significant.
THE SCALEFREE PROPERTY
53
ESTIMATING THE DEGREE EXPONENT
SECTION 4.14
BIBLIOGRAPHY
[1] H. Jeong, R.Albert, and A.L. Barabási. Internet: Diameter of the worldwide web. Nature, 401:130131, 1999. [2] A.L. Barabási and R.Albert. Emergence of scaling in random networks. Science, 286:509512, 1999. [3] V. Pareto. Cours d’Économie Politique: Nouvelle édition par G. H. Bousquet et G. Busino, Librairie Droz, Geneva, 299–345, 1964. [4] A.L. Barabási. Linked: The New Science of Networks. Plume, New York, 2002. [5] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On powerlaw relationships of the internet topology. Proceedings of SIGCOMM. Comput. Commun. Rev. 29: 251262, 1999. [6] R. PastorSatorras and A.Vespignani. Evolution and Structure of the Internet: A Statistical Physics Approach. Cambridge University Press, Cambridge, 2004. [7] D. J. De Solla Price. Networks of Scientific Papers. Science 149: 510515, 1965. [8] S. Redner. How Popular is Your Paper? An Empirical Study of the Citation Distribution. Eur. Phys. J. B 4: 131, 1998. [9] R. Kumar, P. Raghavan, S. Rajalopagan, and A.Tomkins. Extracting LargeScale Knowledge Bases from the Web. Proceedings of the 25thVLDBConference, Edinburgh,Scotland,pp.639650,1999. [10] A.L. Barabási, R.Albert, and H. Jeong. Meanfield theory of scalefree random networks. Physica A 272:173187, 1999. [11] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, and A.L. Barabási. The THE SCALEFREE PROPERTY
54
largescale organization of metabolic networks. Nature 407: 651654, 2000. [12] A. Wagner, A. and D.A. Fell. The small world inside large metabolic networks. Proc. R. Soc. Lond. B 268: 1803–1810, 2001. [13] W. Aiello, F. Chung, and L.A. Lu. Random graph model for massive graphs, Proc. 32nd ACM Symp. Theor. Comp, 2000. [14] H. Jeong, B. Tombor, S. P. Mason, A.L. Barabási, and Z.N. Oltvai. Lethality and centrality in protein networks. Nature 411: 4142, 2001. [15] A. Wagner. How the global structure of protein interaction networks evolves. Proc. R. Soc. Lond. B 270: 457–466, 2003. [16] M. E. J. Newman. The structure of scientific collaboration networks. Proc. Natl.Acad. Sci. 98: 404409, 2001. [17] A.L. Barabási, H. Jeong, E. Ravasz, Z. Néda, A. Schubert, and T. Vicsek. Evolution of the social network of scientific collaborations. Physica A 311: 590614, 2002. [18] F. Liljeros, C.R. Edling, L.A.N. Amaral, H.E. Stanley, and Y. Aberg. The Web of Human Sexual Contacts. Nature 411: 907908, 2001. [19] R. Ferrer i Cancho and R.V. Solé. The small world of human language. Proc. R. Soc. Lond. B 268: 22612265, 2001. [20] R. Ferrer i Cancho, C. Janssen, and R.V. Solé. Topology of technology graphs: Small world patterns in electronic circuits. Phys. Rev. E 64: 046119, 2001. [21] S. Valverde and R.V. Solé. Hierarchical Small Worlds in Software Architecture. arXiv:condmat/0307278, 2003. [22] H. Ebel, L.I. Mielsch, and S. Bornholdt. Scalefree topology of email networks. Phys. Rev. E 66: 035103(R), 2002. [23] J.P.K. Doye. Network Topology of a Potential Energy Landscape: A Static ScaleFree Network. Phys. Rev. Lett. 88: 238701, 2002. [24] J.P. Onnela, J. Saramaki, J. Hyvonen, G. Szabó, D. Lazer, K. Kaski, J. Kertesz, and A.L. Barabási. Structure and tie strengths in mobile communication networks. Proceedings of the National Academy of Sciences 104: 73327336 (2007). [25] H. Kwak, C. Lee, H. Park, S. Moon. What is Twitter, a social network or a news media? Proceedings of the 19th international conference on World Wide Web, 591600, 2010. [26] M. Cha, H. Haddadi, F. Benevenuto and K. P. Gummadi. Measuring THE SCALEFREE PROPERTY
55
BIBLIOGRAPHY
user influence in Twitter: The million follower fallacy. Proceedings of international AAAI Conference on Weblogs and Social, 2010. [27] J. Ugander, B. Karrer, L. Backstrom, and C. Marlow. The Anatomy of the Facebook Social Graph. ArXiv:1111.4503, 2011. [28] L.A.N. Amaral, A. Scala, M. Barthelemy and H.E. Stanley. Classes of smallworld networks. Proceeding National Academy of Sciences U. S. A. 97:1114911152, 2000. [29] R. Cohen and S. Havlin. Scale free networks are ultrasmall. Phys. Rev. Lett. 90, 058701, 2003. [30] B. Bollobás and O. Riordan. The Diameter of a ScaleFree Random Graph. Combinatorica, 24: 534, 2004. [31] R. Cohen and S. Havlin. Complex Networks  Structure, Robustness and Function. Cambridge University Press, Cambridge, 2010. [32] K.I. Goh, B. Kahng, and D. Kim. Universal behavior of load distribution in scalefree networks. Phys. Rev. Lett. 87: 278701, 2001. [33] F. Karinthy. Láncszemek, in Minden másképpen van. Budapest, Atheneum Irodai es Nyomdai R.T. Kiadása, 85–90, 1929. English translation in: M.E.J. Newman, A.L. Barabási, and D. J. Watts. The Structure and Dynamics of Networks. Princeton University Press, Princeton, 2006. [34] P.S. Dodds, R. Muhamad and D.J. Watts. An experimental study to search in global social networks. Science 301: 827829, 2003. [35] P. Erdős and T. Gallai. Graphs with given degrees of vertices. Matematikai Lapok, 11:264274, 1960. [36] C.I. Del Genio, H. Kim, Z. Toroczkai, and K.E. Bassler. Efficient and exact sampling of simple graphs with given arbitrary degree sequence. PLoS ONE, 5: e10012, 04 2010. [37] V. Havel. A remark on the existence of finite graphs. Casopis Pest. Mat., 80:477480, 1955. [38] S. Hakimi. On the realizability of a set of integers as degrees of the vertices of a graph. SIAM J.Appl. Math., 10:496506, 1962. [39] I. Charo Del Genio, G. Thilo, and K.E. Bassler. All scalefree networks are sparse. Phys. Rev. Lett. 107:178701, 10 2011. [40] B. Bollobás. A probabilistic proof of an asymptotic formula for the number of labelled regular graphs. European J. Combin. 1: 311– 316, 1980. [41] M. Molloy and B. A. Reed. Critical Point for Random Graphs with a Given Degree Sequence. Random Structures and Algorithms, 6: 161180, THE SCALEFREE PROPERTY
56
BIBLIOGRAPHY
1995. [42] M. Newman. Networks: An Introduction. Oxford University, Oxford, 2010. [43] S. Maslov and K. Sneppen. Specificity and stability in topology of protein networks. Science, 296:910913, 2002. [44] G. Caldarelli, I. A. Capocci, P. De Los Rios, and M.A. Muñoz. ScaleFree Networks from Varying Vertex Intrinsic Fitness. Phys. Rev. Lett. 89: 258702, 2002. [45] B. Söderberg. General formalism for inhomogeneous random graphs. Phys. Rev. E 66: 066121, 2002. [46] M. Boguñá and R. PastorSatorras. Class of correlated random networks with hidden variables. Phys. Rev. E 68: 036112, 2003. [47] A. Clauset, C.R. Shalizi, and M.E.J. Newman. Powerlaw distributions in empirical data. SIAM Review S1: 661703, 2009. [48] S. Redner. Citation statistics from 110 years of physical review. Physics Today, 58:49, 2005.
THE SCALEFREE PROPERTY
57
BIBLIOGRAPHY
5 ALBERTLÁSZLÓ BARABÁSI
NETWORK SCIENCE THE BARABÁSIALBERT MODEL
ACKNOWLEDGEMENTS
MÁRTON PÓSFAI GABRIELE MUSELLA MAURO MARTINO ROBERTA SINATRA
SARAH MORRISON AMAL HUSSEINI PHILIPP HOEVEL
INDEX
Introduction Growth and Preferential Attachment
1
The BarabásiAlbert Model
2
Degree Dynamics
3
Degree Distribution
4
The Absence of Growth or Preferential Attachment
5
Measuring Preferential Attachment
6
Nonlinear Preferential Attachment
7
The Origins of Preferential Attachment
8
Diameter and Clustering Coefficient
9
Homework
10
Summary
11
ADVANCED TOPICS 5.A
Deriving the Degree Distribution
12
ADVANCED TOPICS 5.B
Nonlinear Preferential Attachment
13
ADVANCED TOPICS 5.C
The Clustering Coefficient
14
Bibliography
15
Figure 5.0 (cover image) Scalefree Sonata
Composed by Michael Edward Edgerton in 2003, 1 sonata for piano incorporates growth and preferential attachment to mimic the emergence of a scalefree network. The image shows the beginning of what Edgerton calls Hub #5. The relationship between the music and networks is explained by the composer: “6 hubs of different length and procedure were distributed over the 2nd and 3rd movements. Musically, the notion of an airport was utilized by diverting all traffic into a limited landing space, while the density of procedure and duration were varied considerably between the 6 differing occurrences.“
This book is licensed under a Creative Commons: CC BYNCSA 2.0. PDF V48 19.09.2014
SECTION 5.1
INTRODUCTION
Hubs represent the most striking difference between a random and a scalefree network. On the World Wide Web, they are websites with an exceptional number of links, like google.com or facebook.com; in the metabolic network they are molecules like ATP or ADP, energy carriers in
>
volved in an exceptional number of chemical reactions. The very existence of these hubs and the related scalefree topology raises two fundamental questions: • Why do so different systems as the WWW or the cell converge to a similar scalefree architecture? Online Resource 5.1 Scalefree Sonata
• Why does the random network model of Erdős and Rényi fail to reproduce the hubs and the power laws observed in real
Listen to a recording of Michael Edward Edgerton's 1 sonata for piano, music inspired by scalefree networks.
networks?
>
The first question is particularly puzzling given the fundamental differences in the nature, origin, and scope of the systems that display the scalefree property: • The nodes of the cellular network are metabolites or proteins, while the nodes of the WWW are documents, representing information without a physical manifestation. • The links within the cell are chemical reactions and binding interactions, while the links of the WWW are URLs, or small segments of computer code. • The history of these two systems could not be more different: The cellular network is shaped by 4 billion years of evolution, while the WWW is less than three decades old. •The
purpose
of
the
metabolic
network
is
to
produce
the
chemical components the cell needs to stay alive, while the purpose of the WWW is information access and delivery. THE BARABÁSIALBERT MODEL
3
To understand why so different systems converge to a similar architecture we need to first understand the mechanism responsible for the emergence of the scalefree property. This is the main topic of this chapter. Given the diversity of the systems that display the scalefree property, the explanation must be simple and fundamental. The answers will change the way we model networks, forcing us to move from describing a network’s topology to modeling the evolution of a complex system.
THE BARABÁSIALBERT MODEL
4
INTRODUCTION
SECTION 5.2
GROWTH AND PREFERENTIAL ATTACHMENT
We start our journey by asking: Why are hubs and power laws absent in random networks? The answer emerged in 1999, highlighting two hidden assumptions of the ErdősRényi model, that are violated in real networks [1]. Next we discuss these assumptions separately. Networks Expand Through the Addition of New Nodes The random network model assumes that we have a fixed number of nodes, N. Yet, in real networks the number of nodes continually grows thanks to the addition of new nodes. Consider a few examples: • In 1991 the WWW had a single node, the first webpage build by Tim BernersLee, the creator of the Web. Today the Web has over a trillion (1012) documents, an extraordinary number that was reached through the continuous addition of new documents by millions of individuals and institutions (Figure 5.1a). • The collaboration and the citation network continually expands through the publication of new research papers (Figure 5.1b). • The actor network continues to expand through the release of new movies (Figure 5.1c). • The protein interaction network may appear to be static, as we inherit our genes (and hence our proteins) from our parents. Yet, it is not: The number of genes grew from a few to the over 20,000 genes present in a human cell over four billion years. Consequently, if we wish to model these networks, we cannot resort to a static model. Our modeling approach must instead acknowledge that networks are the product of a steady growth process.
THE BARABÁSIALBERT MODEL
5
Nodes Prefer to Link to the More Connected Nodes
(a) NUMBER OF HOSTS
The random network model assumes that we randomly choose the interaction partners of a node. Yet, most real networks new nodes prefer to link to the more connected nodes, a process called preferential attachment (Figure 5.2). Consider a few examples:
1x109 9x108 8x108 7x108 6x108 5x108
WORLD WIDE WEB
4x108 3x108 2x108 1x108 0x100
1982
1987
1992
1997
2002
2007
2012
YEARS
• We are familiar with only a tiny fraction of the trillion or more docu
(b) NUMBER OF PAPERS
ments available on the WWW. The nodes we know are not entirely random: We all heard about Google and Facebook, but we rarely encounter the billions of lessprominent nodes that populate the Web. As our knowledge is biased towards the more popular Web documents, we are more likely to link to a highdegree node than to a node with only few links.
450000 400000 350000
CITATION NETWORK
300000 250000 200000 150000 100000 50000 0
1880 1900
1920
1940
• No scientist can attempt to read the more than a million scientific pa
1960
1980
2000 2020
1980
2000 2020
pers published each year. Yet, the more cited is a paper, the more likely that we hear about it and eventually read it. As we cite what we read, our citations are biased towards the more cited publications, representing the highdegree nodes of the citation network. • The more movies an actor has played in, the more familiar is a casting director with her skills. Hence, the higher the degree of an actor in the
(c)
250000
NUMBER OF MOVIES
YEARS
200000 150000 100000 50000 0
actor network, the higher are the chances that she will be considered
ACTOR NETWORK
1880 1900
1920
1940
1960
YEARS
for a new role. Figure 5.1 The Growth of Networks
In summary, the random network model differs from real networks in two important characteristics:
Networks are not static, but grow via the addition of new nodes:
(A) Growth
(a) The evolution of the number of WWW hosts, documenting the Web’s rapid growth. After http://www.isc.org/solutions/survey/history.
Real networks are the result of a growth process that continuously increases N. In contrast the random network model assumes that the number of nodes, N, is fixed.
(b) The number of scientific papers published in Physical Review since the journal’s founding. The increasing number of papers drives the growth of both the science collaboration network as well as of the citation network shown in the figure.
(B) Preferential Attachment In real networks new nodes tend to link to the more connected nodes. In contrast nodes in random networks randomly choose their interaction partners.
(c) Number of movies listed in IMDB.com, driving the growth of the actor network.
There are many other differences between real and random networks, some of which will be discussed in the coming chapters. Yet, as we show next, these two, growth and preferential attachment, play a particularly important role in shaping a network’s degree distribution.
THE BARABÁSIALBERT MODEL
6
GROWTH AND PREFERENTIAL ATTACHMENT
THE BARABÁSIALBERT MODEL
7
GROWTH AND PREFERENTIAL ATTACHMENT
PUBLICATION DATE
MILESTONES
1935
1941 1945
George Udmy Yule (18711951) used preferential attachment to explain the powerlaw distribution of the number of species per genus of flowering plants [3]. Hence, in statistics preferential attachment is often called a Yule process.
1955 1960
1968
PHYSICIST
George Kinsley Zipf (19021950) used preferential attachment to explain the fat tailed distribution of wealth in the society [5].
1976
SOCIOLOGIST
1980
1985
Derek de Solla Price (19221983) used preferential attachment to explain the citation statistics of scientific publications, calling it cumulative advantage [7].
1995
1999 2000
2005
Barabási (1967) & Albert (1972) introduce the term preferential attachment to explain the origin of scalefree networks [1].
Robert Merton (19102003) In sociology preferential attachment is often called the Matthew effect, named by Merton [8] after a passage in the Gospel of Matthew.
1990
Gospel of Matthew XXI
NETWORK SCIENTISTS
AlbertLászló Barabási & Réka Albert PREFERENTIAL ATTACHMENT
“For everyone who has will be given more, and he will have an abundance.”
Herbert Alexander Simon (19162001) used preferential attachment to explain the fattailed nature of the distributions describing city sizes, word frequencies, or the number of papers published by scientists [6].
1970
Derek de Solla Price CUMULATIVE ADVANTAGE
Robert Gibrat (19041980) proposed that the size and the growth rate of a firm are independent. Hence, larger firms grow faster [4]. Called proportional growth, this is a form of preferential attachment.
1950
ECONOMIST
STATISTICIAN
1931
Robert Gibrat PROPORTIONAL GROWTH
POLITICAL SCIENTIST
Robert Merton MATTHEW EFFECT
2010
Preferential attachment has emerged independently in many disciplines, helping explain the presence of power laws characterising various systems. In the context of networks preferential attachment was introduced in 1999 to explain the scalefree property.
Herbert Alexander Simon MASTER EQUATION
George Udmy Yule YULE PROCESS
ECONOMIST
György Pólya (18871985) Preferential attachment made its first appearance in 1923 in the celebrated urn model of the Hungarian mathematician György Pólya [2]. Hence, in mathematics preferential attachment is often called a Pólya process.
1923 1925
MATHEMATICIAN
György Pólya PÓLYA PROCESS
George Kinsley Zipf WEALTH DISTRIBUTION
PREFERENTIAL ATTACHMENT: A BRIEF HISTORY
FIG 5.2
SECTION 5.2
THE BARABÁSIALBERT MODEL
The recognition that growth and preferential attachment coexist in real networks has inspired a minimal model called the BarabásiAlbert model, which can generate scalefree networks [1]. Also known as the BA model or the scalefree model, it is defined as follows: We start with m0 nodes, the links between which are chosen arbitrarily, as long as each node has at least one link. The network develops following two steps (Figure 5.3): (A) Growth At each timestep we add a new node with m (≤ m0) links that connect
the new node to m nodes already in the network. (B) Preferential attachment
The probability Π(k) that a link of the new node connects to node i depends on the degree ki as
Π( ki ) =
ki
∑k j
.
Figure 5.3 Evolution of the BarabásiAlbert Model
The sequence of images shows nine subsequent steps of the BarabásiAlbert model. Empty circles mark the newly added node to the network, which decides where to connect its two links (m=2) using preferential attachment (5.1). After [9].
(5.1)
j
Preferential attachment is a probabilistic mechanism: A new node is free to connect to any node in the network, whether it is a hub or has a single link. Equation (5.1) implies, however, that if a new node has a choice
>
between a degreetwo and a degreefour node, it is twice as likely that it connects to the degreefour node. After t timesteps the BarabásiAlbert model generates a network with N = t + m0 nodes and m0 + mt links. As Figure 5.4 shows, the obtained network has a powerlaw degree distribution with degree exponent γ=3. A mathe
Online Resource 5.2 Emergence of a Scalefree Network
matically selfconsistent definition of the model is provided in BOX 5.1.
Watch a video that shows the growth of a scalefree network and the emergence of the hubs in the BarabásiAlbert model. Courtesy of Dashun Wang.
As Figure 5.3 and Online Resource 5.2 indicate, while most nodes in the
>
network have only a few links, a few gradually turn into hubs. These hubs are the result of a richgetsricher phenomenon: Due to preferential attachTHE BARABÁSIALBERT MODEL
8
ment new nodes are more likely to connect to the more connected nodes
100
than to the smaller nodes. Hence, the larger nodes will acquire links at the
101
expense of the smaller nodes, eventually becoming hubs.
102 pk
In summary, the BarabásiAlbert model indicates that two simple
103 104
mechanisms, growth and preferential attachment, are responsible for the
105
emergence of scalefree networks. The origin of the power law and the as
106
sociated hubs is a richgetsricher phenomenon induced by the coexistence
107
of these two ingredients. To understand the model’s behavior and to quan
10
100
tify the emergence of the scalefree property, we need to become familiar with the model’s mathematical properties, which is the subject of the next section.
γ=3
8
101
k
102
103
Figure 5.4 The Degree Distribution
The degree distribution of a network generated by the BarabásiAlbert model. The figure shows pk for a single network of size N=100,000 and m=3. It shows both the linearlybinned (purple) and the logbinned version (green) of pk. The straight line is added to guide the eye and has slope γ=3, corresponding to the network’s predicted degree exponent.
THE BARABÁSIALBERT MODEL
9
THE BARABÁSIALBERT MODEL
BOX 5.1 G1(0)
THE MATHEMATICAL DEFINITION OF THE BARABÁSIALBERT MODEL The definition of the BarabásiAlbert model leaves many mathe
G1(1)
matical details open:
1
• It does not specify the precise initial configuration of the first m0 nodes.
G1(2)
• It does not specify whether the m links assigned to a new node are added one by one, or simultaneously. This leads to potential
1
2
or
2 3
p=
mathematical conflicts: If the links are truly independent, they
1
2
p=
1 3
could connect to the same node i, resulting in multilinks. 2
Bollobás and collaborators [10] proposed the Linearized Chord Diagram (LCD) to resolve these problems, making the model more
1
amenable to mathematical approaches.
or
G1(3)
According to the LCD, for m=1 we build a graph G1(t) as follows (Figure 5.5):
1
3
p=
2
3 5
3
p=
(1) Start with G1(0), corresponding to an empty graph with no
1 5
or
1
2
a
nodes. (2) Given G1(t1) generate G1(t) by adding the node vt and a single link between vt and vi, where vi is chosen with probability
3
1 p= 5
(4.1) Figure 5.5
p=
ki 2t 1 1 , 2t 1
if 1 i
t 1
if i = t
The Linearized Chord Diagram (LCD)
(5.2)
b
G1(0): We start with an empty network. G1(1): The first node can only link to itself, forming a selfloop. Selfloops are allowed, and so are multilinks for m>1. G1(2): Node 2 can either connect to node 1 with probability 2/3, or to itself with probability 1/3. According to (5.2), half of the links that the new node 2 brings along is already counted as present. Consequently node 1 has degree k1=2 at node 2 has degree k2=1, the normalization constant being 3. G1(3): Let us assume that the first of the two G1(t) network possibilities have materialized. When node 3 comes along, it again has three choices: It can connect to node 2 with probability 1/5, to node 1 with probability 3/5 and to itself with probability 1/5.
That is, we place a link from the new node vt to node vi with prob
ability ki/(2t1), where the new link already contributes to the degree of vt. Consequently node vt can also link to itself with prob
ability 1/(2t  1), the second term in (5.2). Note also that the model permits selfloops and multilinks. Yet, their number becomes negligible in the t→∞ limit. For m > 1 we build Gm(t) by adding m links from the new node vt one by one, in each step allowing the outward half of the newly added link to contribute to the degrees.
THE BARABÁSIALBERT MODEL
The construction of the LCD, the version of the BarabásiAlbert model amenable to exact mathematical calculations [10]. The figure shows the first four steps of the network's evolution for m=1:
10
INTRODUCTION
SECTION 5.3
DEGREE DYNAMICS
To understand the emergence of the scalefree property, we need to focus on the time evolution of the BarabásiAlbert model. We begin by exploring the timedependent degree of a single node [11]. In the model an existing node can increase its degree each time a new node enters the network. This new node will link to m of the N(t) nodes already present in the system. The probability that one of these links connects to node i is given by (5.1). Let us approximate the degree ki with a continuous real variable, representing its expectation value over many realizations of the growth process. The rate at which an existing node i acquires links as a result of new nodes connecting to it is
dki k = mΠ( ki ) = m N −1i . dt ∑k j
(5.3)
j =1
The coefficient m describes that each new node arrives with m links. Hence, node i has m chances to be chosen. The sum in the denominator of (5.3) goes over all nodes in the network except the newly added node, thus N −1
∑k j =1
j
= 2 mt − m.
(5.4)
Therefore (5.4) becomes
dki ki . = dt 2t − 1
(5.5)
For large t the (1) term can be neglected in the denominator, obtaining
dki 1 dt . = 2 t ki
(5.6)
By integrating (5.6) and using the fact that ki (ti)=m, meaning that node i
joins the network at time ti with m links, we obtain
THE BARABÁSIALBERT MODEL
11
β
⎛ t ⎞ = m ⎜ ⎟ ∑kki (t) j = 2 mt − m. ⎝ ti ⎠ j =1 N −1
BOX 5.2
(5.7)
We call β the dynamical exponent and has the value N −1
∑k j =1
j
TIME IN NETWORKS
1 = 2 mt β =− m. 2
As we compare the predictions of the network models with real data, we
Equation (5.7) offers a number of predictions:
have to decide how to measure time in networks. Real networks evolve
• The degree of each node increases following a powerlaw with the
over rather different time scales:
same dynamical exponent β =1/2 (Figure 5.6a). Hence all nodes follow the same dynamical law.
World Wide Web The first webpage was created in
• The growth in the degrees is sublinear (i.e. β < 1). This is a consequence
1991. Given its trillion documents,
of the growing nature of the BarabásiAlbert model: Each new node has
the WWW added a node each milli
more nodes to link to than the previous node. Hence, with time the ex
second (103 sec).
isting nodes compete for links with an increasing pool of other nodes.
Cell
• The earlier node i was added, the higher is its degree ki(t). Hence, hubs
The cell is the result of 4 billion years
are large because they arrived earlier, a phenomenon called firstmov
of evolution. With roughly 20,000
er advantage in marketing and business.
genes in a human cell, on average the cellular network added a node
• The rate at which the node i acquires new links is given by the deriva
every 200,000 years (~1013 sec).
tive of (5.7)
dki (t ) m 1 , = . dt 2 tit
Given these enormous timescale (5.8)
differences it is impossible to use real time to compare the dynamics of different networks. Therefore, in
indicating that in each time step older nodes acquire more links (as
network theory we use event time,
they have smaller ti). Furthermore the rate at which a node acquires
advancing our timestep by one each
links decreases with time as t
−1/2
. Hence, fewer and fewer links go to a
time when there is a change in the
node.
network topology.
In summary, the BarabásiAlbert model captures the fact that in real
For example, in the BarabásiAlbert
networks nodes arrive one after the other, offering a dynamical descrip
model the addition of each new node
tion of a network’s evolution. This generates a competition for links during
corresponds to a new time step,
which the older nodes have an advantage over the younger ones, eventual
hence t=N. In other models time
ly turning into hubs.
is also advanced by the arrival of a new link or the deletion of a node. If needed, we can establish a direct mapping between event time and the physical time.
THE BARABÁSIALBERT MODEL
12
DEGREE DYNAMICS
105 (a)
Figure 5.6
SINGLE NETWORK
Degree Dynamics
104
(a) The growth of the degrees of nodes added at time t =1, 10, 102, 103, 104, 105 (continuous lines from left to right) in the BarabásiAlbert model. Each node increases its degree following (5.7). Consequently at any moment the older nodes have higher degrees. The dotted line corresponds to the analytical prediction (5.7) with β = 1/2.
tβ
10
3
k 102 101
(b) Degree distribution of the network after adding N = 102, 104, and 106 nodes, i.e. at time t = 102, 104, and 106 (illustrated by arrows in (a)). The larger the network, the more obvious is the powerlaw nature of the degree distribution. Note that we used linear binning for pk to better observe the gradual emergence of the scalefree state.
100 100
(b)
101
102
103
t
104
105
k
k
η N To determine the degree distribution in the large N limit, we first calculate N k k of knodes η than C/η k, i.e. the number fitness η and with degree greater η (t) > with η
m k
t0 < t (t)>> kk. Using (6.3) we find that this condition implies those that satisfy k kkη (t) η k
kη (t) > k
t0
m C/η t0 k) ρ(η)dη i η ρ(η)dη, C/η C/η η ηk t mm ≈ 1− P (k) = P (ki ≤ k) = 1−P (ki > k) ≈ 1− + t m 0 ρ(η)dη (6.43) η 0C/η m 0 + ttk 0≈ 1− k ρ(η)dη m C/η P (k) = P (ki ≤ k) = 1−P (ki > k) ≈ 1− 0 m ρ(η)dη, k 0 P (k) = P (ki ≤ k) = 1−P (ki > k) ≈ 1− ≈ 1− ρ(η)dη, k m0 + t 0 k m0 + t 0 t asymptotically, for large t. The probability where the last equation is valid t i
m C/η
η
≈η1− m0 C/η
t density function for the degree distribution is η
p(k) = P (k) =
recovering (6.6).
0
p(k) = P (k) =
p(k) = P (k) =
EVOLVING NETWORKS
Ct
C/η −(C/η+1)
ηη m 0 η
0
k
i η
i
ρ(η)dη,
m C/η k
ρ(η)dη,
η
C C/η −(C/η+1) ρ(η)dη, m k η mC/η k −(C/η+1) ρ(η)dη,
Cp(k) = P (k)ρ(η)dη, mC/η= k −(C/η+1) ηC 0
η
32
ADVANCED TOPICS 6.A SOLVING THE FITNESS MODEL
SECTION 6.9
BIBLIOGRAPHY
[1] A.L. Barabási. Linked: The New Science of Networks. Perseus, Boston, 2001. [2] G. Bianconi and A.L. Barabási. Competition and multiscaling in evolving networks. Europhysics Letters, 54: 436442, 2001. [3] A.L. Barabási, R. Albert, H. Jeong, and G. Bianconi. Powerlaw distribution of the world wide web. Science, 287: 2115, 2000. [4] P.L. Krapivsky and S. Redner. Statistics of changes in lead node in connectivitydriven networks. Phys. Rev. Lett., 89:258703, 2002. [5] C. Godreche and J. M. Luck. On leaders and condensates in a growing network. J. Stat. Mech., P07031, 2010. [6] J. H. Fowler, C. T. Dawes, and N. A. Christakis. Model of Genetic Variation in Human Social Networks. PNAS, 106: 17201724, 2009. [7] M. O. Jackson. Genetic influences on social network characteristics. PNAS, 106:1687–1688, 2009. [8] S.A. Burt. Genes and popularity: Evidence of an evocative gene environment correlation. Psychol. Sci., 19:112–113, 2008. [9] J. S. Kong, N. Sarshar, and V. P. Roychowdhury. Experience versus talent shapes the structure of the Web. PNAS, 105:137249, 2008. [10] A.L. Barabási, C. Song, and D. Wang. Handful of papers dominates citation. Nature, 491:40, 2012. [11] D. Wang, C. Song, and A.L. Barabási. Quantifying Long term scientific impact. Science, 342:127131, 2013. [12] M. Medo, G. Cimini, and S. Gualdi. Temporal effects in the growth of EVOLVING NETWORKS
33
networks. Phys. Rev. Lett., 107:238701, 2011. [13] C. Venter et al. The sequence of the human genome. Science, 291:13041351, 2001. [14] A.L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286:509512, 1999. [15] G. Bianconi and A.L. Barabási. BoseEinstein condensation in complex networks. Phys. Rev. Lett., 86: 5632–5635, 2001. [16] C. Borgs, J. Chayes, C. Daskalakis, and S. Roch. First to market is not everything: analysis of preferential attachment with fitness. STOC’07, San Diego, California, 2007. [17] S. N. Dorogovtsev, J.F.F. Mendes, and A.N. Samukhin. Structure of growing networks with preferential linking. Phys. Rev. Lett., 85: 4633, 2000. [18] C. Godreche, H. Grandclaude, and J.M. Luck. Finitetime fluctuations in the degree statistics of growing networks. J. of Stat. Phys., 137:11171146, 2009. [19] Y.H. Eom and S. Fortunato. Characterizing and Modeling Citation Dynamics. PLoS ONE, 6: e24926, 2011. [20] A.L. Barabási, H. Jeong, Z. Néda, E. Ravasz, A. Schubert, and T. Vicsek. Evolution of the social network of scientific collaborations. Physica A, 311: 590614, 2002. [21] R. Albert, and A.L. Barabási. Topology of evolving networks: local events and universality. Phys. Rev. Lett., 85:52345237, 2000. [22] G. Goshal, L. Chi, and A.L Barabási. Uncovering the role of elementary processes in network evolution. Scientific Reports, 3:18, 2013. [23] J.H. Schön, Ch. Kloc, R.C. Haddon, and B. Batlogg. A superconducting fieldeffect switch. Science, 288: 656–8. 2000. [24] D. Agin. Junk Science: An Overdue Indictment of Government, Industry, and Faith Groups That Twist Science for Their Own Gain. Macmillan, New York, 2007. [25] S. Saavedra, F. ReedTsochas, and B. Uzzi. Asymmetric disassembly and robustness in declining networks. PNAS, 105:16466–16471, 2008. [26] F. Chung and L. Lu. Coupling online and offline analyses for random powerlaw graphs. Int. Math., 1: 409461, 2004. [27] C. Cooper, A. Frieze, and J. Vera. Random deletion in a scalefree random graph process. Int. Math. 1, 463483, 2004. EVOLVING NETWORKS
34
BIBLIOGRAPHY
[28] S. N. Dorogovtsev and J. Mendes. Scaling behavior of developing and decaying networks. Europhys. Lett., 52: 3339, 2000. [29] C. Moore, G. Ghoshal, and M. E. J. Newman. Exact solutions for models of evolving networks with addition and deletion of nodes. Phys. Rev. E, 74: 036121, 2006. [30] H. Bauke, C. Moore, J. Rouquier, and D. Sherrington. Topological phase transition in a network model with preferential attachment and node removal. The European Physical Journal B, 83: 519524, 2011. [31] M. Pascual and J. Dunne, (eds). Ecological Networks: Linking Structure to Dynamics in Food Webs. Oxford Univ Press, Oxford, 2005. [32] R. Sole and J. Bascompte. SelfOrganization in Complex Ecosystems. Princeton University Press, Princeton, 2006. [33] U. T. Srinivasan, J. A. Dunne, J. Harte, and N. D. Martinez. Response of complex food webs to realistic extinction sequencesm. Ecology, 88:671– 682, 2007. [34] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On powerlaw relationships of the internet topology. ACM SIGCOMM Computer Communication Review, 29: 251262, 1999. [35] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, and A. Tomkins. Graph structure in the web. Computer Networks, 33: 309320, 2000. [36] J. Leskovec, J. Kleinberg, and C. Faloutsos, Graph evolution: Densification and shrinking diameters. ACM TKDD07, ACM Transactions on Knowledge Discovery from Data, 1:1, 2007. [37] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, and A.L. Barabási. The largescale organization of metabolic networks. Nature, 407: 651–655, 2000. [38] S. Dorogovtsev and J. Mendes. Effect of the accelerating growth of communications networks on their structure. Phys. Rev. E, 63: 025101(R), 2001. [39] M. J. Gagen and J. S. Mattick. Accelerating, hyperaccelerating, and decelerating networks. Phys. Rev. E, 72: 016123, 2005. [40] C. Cooper and P. Prałat. Scalefree graphs of increasing degree. Random Structures & Algorithms, 38: 396–421, 2011. [41] N. Deo and A. Cami. Preferential deletion in dynamic models of weblike networks. Inf. Proc. Lett., 102: 156162, 2007. EVOLVING NETWORKS
35
BIBLIOGRAPHY
[42] S.N. Dorogovtsev and J.F.F. Mendes. Evolution of networks with aging of sites. Phys. Rev. E, 62:1842, 2000. [43] A.N. Amaral, A. Scala, M. Barthélémy, and H.E. Stanley. Classes of smallworld networks. Proc. National Academy of Sciences USA, 97: 11149, 2000. [44] K. Klemm and V. M. Eguiluz. Highly clustered scale free networks. Phys. Rev. E, 65: 036123, 2002. [45] X. Zhu, R. Wang, and J.Y. Zhu. The effect of aging on network structure. Phys. Rev. E, 68: 056121, 2003.
EVOLVING NETWORKS
36
BIBLIOGRAPHY
7 ALBERTLÁSZLÓ BARABÁSI
NETWORK SCIENCE DEGREE CORRELATION
ACKNOWLEDGEMENTS
MÁRTON PÓSFAI GABRIELE MUSELLA MAURO MARTINO NICOLE SAMAY
ROBERTA SINATRA SARAH MORRISON AMAL HUSSEINI PHILIPP HOEVEL
INDEX
Introduction
1
Assortativity and Disassortativity
2
Measuring Degree Correlations
3
Structural Cutoffs Correlations in Real Networks
4 5
Generating Correlated Networks
6
The Impact of Degree Correlations
7
Summary
8
Homework
9
ADVANCED TOPICS 7.A
Degree Correlation Coefficient
10
ADVANCED TOPICS 7.B
Structural Cutoffs
11
Bibliography
12
Figure 7.0 (cover image) TheyRule.net by Josh On
Created by Josh On, a San Franciscobased designer, the interactive website TheyRule.net uses a network representation to illustrate the interlocking relationship of the US economic class. By mapping out the shared board membership of the most powerful U.S. companies, it reveals the influential role of a small number of individuals who sit on multiple boards. Since its release in 2001, the project is interchangeably viewed as art or science.
This book is licensed under a Creative Commons: CC BYNCSA 2.0. PDF V24 18.09.2014
SECTION 7.1
INTRODUCTION
Angelina Jolie and Brad Pitt, Ben Affleck and Jennifer Garner, Harrison Ford and Calista Flockhart, Michael Douglas and Catherine ZetaJones, Tom Cruise and Katie Holmes, Richard Gere and Cindy Crawford (Figure 7.1). An odd list, yet instantly recognizable to those immersed in the headlinedriven world of celebrity couples. They are Hollywood stars that are or were married. Their weddings (and breakups) has drawn countless hours of media coverage and sold millions of gossip magazines. Thanks to them we take for granted that celebrities marry each other. We rarely pause to ask: Is this normal? In other words, what is the true chance that a celebrity marries another celebrity?
Figure 7.1 Hubs Dating Hubs
Celebrity couples, representing a highly visible proof that in social networks hubs tend to know, date and marry each other (Images from http://www.whosdatedwho.com).
Assuming that a celebrity could date anyone from a pool of about a hundred million (108) eligible individuals worldwide, the chances that their mate would be another celebrity from a generous list of 1,000 other celebrities is only 105. Therefore, if dating were driven by random encounters, celebrities would never marry each other. Even if we do not care about the dating habits of celebrities, we must pause and explore what this phenomenon tells us about the structure of the social network. Celebrities, political leaders, and CEOs of major corporations tend to know an exceptionally large number of individuals and are known by even more. They are hubs. Hence celebrity dating (Figure 7.1) and joint board memberships (Figure 7.0) are manifestations of an interesting property of social network: hubs tend to have ties to other hubs. As obvious this may sound, this property is not present in all networks. Consider for example the proteininteraction network of yeast, shown in Figure 7.2. A quick inspection of the network reveals its scalefree nature: numerous one and twodegree proteins coexist with a few highly connected hubs. These hubs, however, tend to avoid linking to each other. They link instead to many smalldegree nodes, generating a hubandspoke pattern. This is particularly obvious for the two hubs highlighted in Figure 7.2: they almost exclusively interact with smalldegree proteins.
DEGREE CORRELATIONS
3
k’ =
13
Figure 7.2
k
=
56
Hubs Avoiding Hubs The protein interaction map of yeast. Each node corresponds to a protein and two proteins are linked if there is experimental evidence that they can bind to each other in the cell. We highlighted the two largest hubs, with degrees k = 56 and k′ = 13. They both connect to many small degree nodes and avoid linking to each other. The network has N = 1,870 proteins and L = 2,277 links, representing one of the earliest protein interaction maps [1, 2]. Only the largest component is shown. Note that the protein interaction network of yeast in TABLE 4.1 represents a later map, hence it contains more nodes and links than the network shown in this figure. Node color corresponds to the essentiality of each protein: the removal of the red nodes kills the organism, hence they are called lethal or essential proteins. In contrast the organism can survive without one of its green nodes. After [3]. Pajek
A brief calculation illustrates how unusual this pattern is. Let us assume that each node chooses randomly the nodes it connects to. Therefore the probability that nodes with degrees k and k′ link to each other is
pk,k ′ =
kk ′ . 2L
(7.1)
Equation (7.1) tells us that hubs, by the virtue of the many links they have, are much more likely to connect to each other than to small degree nodes. Indeed, if k and k′ are large, so is pk,k’ . Consequently, the likelihood
that hubs with degrees k=56 and k’ = 13 have a direct link between them
is pk,k’ = 0.16, which is 400 times larger than p1,2 = 0.0004, the likelihood that a degreetwo node links to a degreeone node. Yet, there are no direct links between the hubs in Figure 7.2, but we observe numerous direct links between small degree nodes. Instead of linking to each other, the hubs highlighted in Figure 7.2 almost exclusively connect to degree one nodes. By itself this is not unexpected: We expect that a hub with degree k = 56 should link to N1 p1, 56 ≈ 12
nodes with k = 1. The problem is that this hub connects to 46 degree one neighbors, i.e. four times the expected number. In summary, while in social networks hubs tend to “date” each other, in the protein interaction network the opposite is true: The hubs avoid linking to other hubs, connecting instead to many small degree nodes. While it is dangerous to derive generic principles from two examples, the purpose of this chapter is to show that these patterns are manifestations of a general property of real networks: they exhibit a phenomena called degree correlations. We discuss how to measure degree correlations and explore their impact on the network topology. DEGREE CORRELATIONS
4
INTRODUCTION
SECTION 7.2
ASSORTATIVITY AND DISASSORTATIVITY
Just by the virtue of the many links they have, hubs are expected to link to each other. In some networks they do, in others they don’t. This is illustrated in Figure 7.3, that shows three networks with identical degree sequences but different topologies: • Neutral Network Figure 7.3b shows a network whose wiring is random. We call this network neutral, meaning that the number of links between the hubs coincides with what we expect by chance, as predicted by (7.1). • Assortative Network The network of Figure 7.3a has precisely the same degree sequence as the one in Figure 7.3b. Yet, the hubs in Figure 7.3a tend to link to each other and avoid linking to smalldegree nodes. At the same time the smalldegree nodes tend to connect to other smalldegree nodes. Networks displaying such trends are assortative. An extreme manifestation of this pattern is a perfectly assortative network, in which each degreek node connects only to other degreek nodes (Figure 7.4). • Disassortative Network In Figure 7.3c the hubs avoid each other, linking instead to smalldegree nodes. Consequently the network displays a hub andspoke character, making it disassortative. In general a network displays degree correlations if the number of links between the high and lowdegree nodes is systematically different from what is expected by chance. In other words, the number of links between nodes of degrees k and k′ deviates from (7.1).
DEGREE CORRELATIONS
5
Figure 7.3
Degree Correlation Matrix (d)
assortative
(a)
20
0.02
15
0.015
(a,b,c) Three networks that have precisely the same degree distribution (Poisson pk), but display different degree correlations. We show only the largest component and we highlight in orange the five highest degree nodes and the direct links between them.
k1 10
0.01
5
0.005
(d,e,f) The degree correlation matrix eij for an assortative (d), a neutral (e) and a disassortative network (f) with Poisson degree distribution, N=1,000, and 〈k〉=10. The colors correspond to the probability that a randomly selected link connects nodes with degrees k1 and k2.
0 0
5
k1
(b)
10
k2
15
k1
k2
20
(a,d) Assortative Networks For assortative networks eij is high along the main diagonal. This indicates that nodes of comparable degree tend to link to each other: smalldegree nodes to smalldegree nodes and hubs to hubs. Indeed, the network in (a) has numerous links between its hubs as well as between its small degree nodes.
k2
(e) 20
0.02
15
0.015
(b,e) Neutral Networks In neutral networks nodes link to each other randomly. Hence the density of links is symmetric around the average degree, indicating the lack of correlations in the linking pattern.
neutral
k1 10
0.01
5
0.005
(c,f) Disassortative Networks In disassortative networks eij is higher along the secondary diagonal, indicating that hubs tend to connect to smalldegree nodes and smalldegree nodes to hubs. Consequently these networks have a hub and spoke character, as seen in (c).
0 0
10
15
k2
20
(f)
disassortative
(c)
5
20
0.02
15
0.015
k1 10
0.01
5
0.005
0 0
5
k1
10
kk22
k2
15
k1
20
k2
15
DEGREE CORRELATIONS
6
ASSORTATIVITY AND DISASSORTATIVITY
The information about potential degree correlations is captured by the degree correlation matrix, eij, which is the probability of finding a node with degrees i and j at the two ends of a randomly selected link. As eij is a
probability, it is normalized, i.e.
∑e i, j
ij
= 1.
(7.2)
In (5.27) we derived the probability qk that there is a degreek node at the end of the randomly selected link, obtaining
kpk . 〈k〉
(7.3)
eij = q i .
(7.4)
qk = We can connect qk to eij via j
Figure 7.4
In neutral networks, we expect
Perfect Assortativity
eij = qi q j .
(7.5)
In a perfectly assortative network each node links only to nodes with the same degree. Hence ejk = δjkqk, where δjk is the Kronecker delta. In this case all nondiagonal elements of the ejk matrix are zero. The figure shows such a perfectly assortative network, consisting of complete kcliques.
A network displays degree correlations if eij deviates from the random
expectation (7.5). Note that (7.2)  (7.5) are valid for networks with an arbitrary degree distribution, hence they apply to both random and scalefree networks. Given that eij encodes all information about potential degree correla
tions, we start with its visual inspection. Figures 7.3d,e,f show eij for an assortative, a neutral and a disassortative network. In a neutral network small and highdegree nodes connect to each other randomly, hence eij lacks any
trend (Figure 7.3e). In contrast, assortative networks show high correlations along the main diagonal, indicating that nodes predominantly connect to nodes with comparable degree. Therefore lowdegree nodes tend to link to other lowdegree nodes and hubs to hubs (Figure 7.3d). In disassortative net
works eij displays the opposite trend: it has high correlations along the secondary diagonal. Therefore highdegree nodes tend to connect to lowdegree nodes (Figure 7.3f). In summary information about degree correlations is carried by the degree correlation matrix eij. Yet, the study of degree correlations through
the inspection of eij has numerous disadvantages:
• It is difficult to extract information from the visual inspection of a matrix. • Unable to infer the magnitude of the correlations, it is difficult to compare networks with different correlations. 2 /2 independent variables, representing • ejk contains approximately k max
a huge amount of information that is difficult to model in analytical calculations and simulations. We therefore need to develop a more compact way to detect degree correlations. This is the goal of the subsequent sections. DEGREE CORRELATIONS
7
ASSORTATIVITY AND DISASSORTATIVITY
SECTION 7.3
MEASURING DEGREE CORRELATIONS
While eij contains the complete information about the degree correlations characterizing a particular network, it is difficult to interpret its content. In this section is to introduce the degree correlation function that
j2
j1
offers a simpler way to quantify degree correlations.
i
Degree correlations capture the relationship between the degrees of
j3
nodes that link to each other. One way to quantify their magnitude is to measure for each node i the average degree of its neighbors (Figure 7.5) N
1 knn (ki ) = ∑ Aij k j . ki j=1
j4
(7.6)
The degree correlation function calculates (7.6) for all nodes with degree
Figure 7.5
Nearest Neighbor Degree: knn
k [4, 5]
knn (k) = ∑ k ′P( k ′  k) , k′
To determine the degree correlation function knn(ki ) we calculate the average degree of a node’s neighbors. The figure illustrates the calculation of knn(ki ) for node i. As the degree of the node i is ki = 4, by averaging the degree of its neighbors j1, j2, j3 and j4, we obtain knn(4) = (4 + 3 + 3 + 1)/4 = 2.75.
(7.7)
where P(k’k) is the conditional probability that following a link of a kdegree node we reach a degreek' node. Therefore knn(k) is the average degree of the neighbors of all degreek nodes.To quantify degree correlations we inspect the dependence of knn(k) on k. • Neutral Network For a neutral network (7.3)(7.5) predict
P( k ′  k) =
ekk ′ e q q = kk ′ = k ′ k = qk ′ . ∑ ekk′ qk qk
(7.8)
k′
This allows us to express knn(k) as
knn (k) = ∑ k ′qk ′ = ∑ k ′ k′
k′
k ′p( k ′ ) 〈k 2 〉 = . 〈k〉 〈k〉
(7.9)
Therefore, in a neutral network the average degree of a node’s neighbors is independent of the node’s degree k and depends only on the global network characteristics ⟨k⟩ and ⟨k2⟩. So plotting knn(k) in func
tion of k should result in a horizontal line at ⟨k2⟩/⟨k⟩, as observed for DEGREE CORRELATIONS
j2
8
the power grid (Figure 7.6b). Equation (7.9) also captures an intriguing
SCIENTIFIC COLLABORATION
(a)
102
property of real networks: our friends are more popular than we are,
ASSORTATIVE
a phenomenon called the friendship paradox (BOX 7.1). • Assortative Network In assortative networks hubs tend to connect to other hubs, hence the higher is the degree k of a node, the higher is the average degree of
knn(k)
101
its nearest neighbors. Consequently for assortative networks knn(k)
Random prediction
increases with k, as observed for scientific collaboration networks
~k0.37
(Figure 7.6a). 100
• Disassortative Network
(b)
In disassortative network hubs prefer to link to lowdegree nodes.
101
Consequently knn(k) decreases with k, as observed for the metabolic
101
k
102
103
POWER GRID
network (Figure 7.6c).
NEUTRAL
The behavior observed in Figure 7.6 prompts us to approximate the degree correlation function with [4]
knn (k) = ak µ .
knn(k)
(7.10)
Random prediction ~k0.04
If the scaling (7.10) holds, then the nature of degree correlations is determined by the sign of the correlation exponent μ:
100
102
METABOLIC NETWORK
(c)
• Assortative Networks: μ > 0
k
101
103
A fit to knn(k) for the science collaboration network provides μ = 0.37 DISASSORTATIVE
± 0.11 (Figure 7.6a).
• Neutral Networks: μ = 0 According to (7.9) knn(k) is independent of k. Indeed, for the power grid
we obtain μ = 0.04 ± 0.05, which is indistinguishable from zero (Figure 7.6b). • Disassortative Networks: μ < 0 For the metabolic network we obtain μ = − 0.76 ± 0.04 (Figure 7.6c).
102
knn(k) 101 Random prediction ~k0.76
Figure 7.6
100
101
Degree Correlation Function
102
k
103
The degree correlation function knn(k) for three real networks. The panels show knn(k) on a loglog plot to test the validity of the scaling law (7.10).
In summary, the degree correlation function helps us capture the presence or absence of correlations in real networks. The knn(k) function also plays an important role in analytical calculations, allowing us to predict the impact of degree correlations on various network characteristics (SEC
(a) Collaboration Network The increasing knn(k) with k indicates that the network is assortative.
TION 7.6). Yet, it is often convenient to use a single number to capture the magnitude of correlations present in a network. This can be achieved either through the correlation exponent μ defined in (7.10), or using the de
(b) Power Grid The horizontal knn(k) indicates the lack of degree correlations, in line with (7.9) for neutral networks.
gree correlation coefficient introduced in BOX 7.2.
(c) Metabolic Network The decreasing knn(k) documents the network’s disassortative nature. On each panel the horizontal line corresponds to the prediction (7.9) and the green dashed line is a fit to (7.10).
DEGREE CORRELATIONS
9
MEASURING DEGREE CORRELATIONS
BOX 7.2
BOX 7.1
DEGREE CORRELATION COEFFICIENT
FRIENDSHIP PARADOX
If we wish to characterize degree correlations using a single number,
The friendship paradox makes
we can use either μ or the degree correlation coefficient. Proposed by
a suprising statement: On av
Mark Newman [8,9], the degree correlation coefficient is defined as
r=∑ jk
jk(e jk − q j qk ) σ2
erage my friends are more popular than I am [6,7]. This claim
(7.11)
is rooted in (7.9), telling us that the average degree of a node’s
with
neighbors is not simply ⟨k⟩, but 2
⎡ ⎤ σ = ∑k qk − ⎢ ∑kqk ⎥ . ⎣ k ⎦ k 2
2
depends on ⟨k2⟩ as well.
(7.12)
Consider a random network, for
Hence r is the Pearson correlation coefficient between the degrees
which ⟨k2⟩ = ⟨k⟩(1 + ⟨k⟩). Accord
found at the two end of the same link. It varies between −1 ≤ r ≤ 1: For
ing to (7.9) knn(k) = 1+⟨k⟩. There
r < 0 the network is assortative, for r = 0 the network is neutral and
fore the average degree of a
for r > 0 the network is disassortative. For example, for the scientific
node’s neighbors is always high
collaboration network we obtain r = 0.13, in line with its assortative
er than the average degree of a
nature; for the protein interaction network r = −0.04, supporting its
randomly chosen node, which is
disassortative nature and for the power grid we have r = 0.
⟨k⟩.
The assumption behind the degree correlation coefficient is that
The gap between ⟨k⟩ and our
knn(k) depends linearly on k with slope r. In contrast the correlation
friends’ degree can be partic
exponent μ assumes that knn(k) follows the power law (7.10). Naturally,
ularly large in scalefree net
both cannot be valid simultaneously. The analytical models of SEC
works, for which ⟨k2⟩/⟨k⟩ signifi
TION 7.7 offer some guidance, supporting the validity of (7.10). As we
cantly exceeds ⟨k⟩ (Figure 4.8).
show in ADVANCED TOPICS 7.A, in general r correlates with μ.
Consider for example the actor network, for which ⟨k2⟩/⟨k⟩ = 565 (Table 4.1). In this network the average degree of a node's friends is hundreds of times the degree of the node itself. The friendship paradox has a simple origin: We are more likely to be friends with hubs than with smalldegree nodes, simply because hubs have more friends than the small nodes.
DEGREE CORRELATIONS
10
MEASURING DEGREE CORRELATIONS
SECTION 7.4
STRUCTURAL CUTOFFS
Throughout this book we assumed that networks are simple, meaning that there is at most one link between two nodes (Figure 2.17). For example, in the email network we place a single link between two individuals that are in email contact, despite the fact that they may have exchanged multiple messages. Similarly, in the actor network we connect two actors with a single link if they acted in the same movie, independent of the number of joint movies. All datasets discussed in Table 4.1 are simple networks. In simple networks there is a puzzling conflict between the scalefree property and degree correlations [10, 11]. Consider for example the scalefree network of Figure 7.7a, whose two largest hubs have degrees k = 55 and k' = 46. In a network with degree correlations ekk' the expected number of links between k and k' is
Ekk ′ = ekk ′ 〈k〉N .
(7.13)
For a neutral network ekk, is given by (7.5), which, using (7.3), predicts
55 46 k pk k ' pk ' Ekk ' = N = 300 300 300 = 2.8 . k 3
(7.14)
Therefore, given the size of these two hubs, they should be connected to each other by two to three links to comply with the network’s neutral nature. Yet, in a simple network we can have only one link between them, causing a conflict between degree correlations and the scalefree property. The goal of this section is to understand the origin and the consequences of this conflict. For small k and k' (7.14) predicts that Ekk’ is also small, i.e. we expect less than one link between the two nodes. Only for nodes whose degree exceeds some threshold ks does (7.14) predict multiple links. As we show in ADVANCED TOPICS 7.B, ks, called structural cutoff, scales as
DEGREE CORRELATIONS
11
(a) a
a
Figure 7.7
Structural Disassortativity (a) A scalefree network with N=300, L=450, and γ=2.2, generated by the configuration model (Figure 4.15). By forbidding selfloops and multilinks, we made the network simple. We highlight the two largest nodes in the network. As (7.14) predicts, to maintain the network’s neutral nature, we need approximately three links between these two nodes. The fact that we do not allow multilinks (simple network representation) makes the network disassortative, a phenomena called structural disassortativity.
b
b (b)
(b) To illustrate the origins of structural correlations we start from a fixed degree sequence, shown as individual stubs on the left. Next we randomly connect the stubs (configuration model). In this case the expected number of links between the nodes with degree 8 and 7 is 8x7/28 ≈ 2. Yet, if we do not allow multilinks, there can only be one link between these two nodes, making the network structurally disassortative.
>
>
12
STRUCTURAL CUTOFFS
ks (N ) ∼ (〈k〉N )1/2 .
(7.15)
In other words, nodes whose degree exceeds (7.15) have Ekk’ > 1, a conflict that as we show below gives rise to degree correlations. To understand the consequences of the structural cutoff we must first ask if a network has nodes whose degrees exceeds (7.15). For this we compare the structural cutoff, ks, with the natural cutoff, kmax, which is the expected
largest degree in a network. According to (4.18), for a scalefree network 1 γ 1
kmax ∼ N . Comparing kmax to ks allows us to distinguish two regimes: • No Stuctural Cutoff For random networks and scalefree networks with γ ≥ 3 the exponent of kmax is smaller than 1/2, hence kmax is always smaller than ks. In other words the node size at which the structural cutoff turns on exceeds the size of the biggest hub. Consequently we have no nodes for which Ekk’ > 1. For these networks we do not have a conflict between degree correlations and the simple network requirement. • Stuctural Disassortativity For scalefee networks with
γ < 3 we have 1/(γ1) > 1/2, i.e. ks can be
smaller than kmax. Consequently nodes whose degree is between ks and kmax can violate Ekk’ > 1. In other words the network has fewer links between its hubs than (7.14) would predict. These networks will therefore
become disassortative, a phenomenon we call structural disassortativity. This is illustrated in Figures 7.8a,b that show a simple scalefree network generated by the configuration model. The network shows disassortative scaling, despite the fact that we did not impose degree correlations during its construction. We have two avenues to generate networks that are free of structural disassortativity: (i) We can relax the simple network requirement, allowing multiple links between the nodes. The conflict disappears and the network will be neutral (Figures 7.8c,d). (ii) If we insist having a simple scalefree network that is neutral or assortative, we must remove all hubs with degrees larger than ks. This is illustrated in Figures 7.8e,f: a network that lacks nodes with k ≥ 100 is neutral. Finally, how can we decide whether the correlations observed in a particular network are a consequence of structural disassortativity, or are generated by some unknown process that leads to degree correlations? Degreepreserving randomization (Figure 4.17) helps us distinguish these two possibilities: (i) Degree Preserving Randomization with Simple Links (RS) We apply degreepreserving randomization to the original network DEGREE CORRELATIONS
13
STRUCTURAL CUTOFFS
(a)
(b)
100
103
102 102
104 pk
knn(k)
10
6
101
108 1010
100 10
0
10
1
10
2
k 10
3
10
4
(c)
(d)
10
103
0
10 0
10 1
10 2 k 103
10 4
10 0
10 1
10 2 k
10 4
102 102
104 pk
knn(k)
106
101
108 1010
100 10 0
10 1
10 2 k
103
10 4
(e)
(f)
100
103
103
102
Figure 7.8
10
2
104 pk
Natural and Structural Cutoffs
knn(k)
10
6
The figure illustrates the tension between the scalefree property and degree correlations. We show the degree distribution (left panels) and the degree correlation function knn(k) (right panels) of a scalefree network with N = 10,000 and γ = 2.5, generated by the configuration model (Figure 4.15).
101
108 100
1010 10 0
10 1
10 2 k 10 3
10 4
10 0
10 1
10 2 k 10 3
10 4
(a,b) If we generate a scalefree network with the powerlaw degree distribution shown in (a), and we forbid selfloops and multilinks, the network displays structural disassortativity, as indicated by knn(k) in (b). In this case, we lack a sufficient number of links between the highdegree nodes to maintain the neutral nature of the network, hence for high k the knn(k) function must decay. (c,d) We can eliminate structural disassortativity by relaxing the simple network requirement, i.e. allowing multiple links between two nodes. As shown in (c,d), in this case we obtain a neutral scalefree network. (e,f) If we impose an upper cutoff by removing all nodes with k ≥ ks ≃ 100, as predicted by (7.15), the network becomes neutral, as seen in (f).
DEGREE CORRELATIONS
14
STRUCTURAL CUTOFFS
Real Network
100
(a)
(b)
ASSORTATIVE
102
101
102
100
103
(c)
NEUTRAL
101
SCIENTIFIC COLLABORATION
k
POWER GRID
102
METABOLIC NETWORK
102
knn(k) 101
knn(k) 101
RS RM Real Network
100
101
k
102
103
100
101
k
102
DISASSORTATIVE
NEUTRAL
1 3 at each step we make sure that we10do not permit more than one 10and
METABOLIC NETWORK
POWER GRID
link between a pair of nodes. On the algorithmic side this means
100
R−S ly explained by the degree distribution.1 If the randomized knn (k)
10
does not show degree correlations while knn(k) does, there is some unknown process that generates the observed degree correlations. 1
2
103
Randomization with Simple Links (RS): At each step of the randomization process we check that we do not have more than one link between any node pairs.
103
For a selfconsistency check it is sometimes useful to perform deDISASSORTATIVE
Randomization with Multiple Links (RM): We allow multilinks during the randomization processes.
3 10greepreserving METABOLIC NETWORK randomization that allows for multiple links be
tween the nodes. On the algorithmic side this means that we allow each random rewiring, even if it leads to multilinks. This process
102
We performed these two randomizations for the networks of Figure 7.6. The RM procedure always generates a neutral network, conseRM quently knn (k) is always horizontal. The true insight is obtained when we compare knn(k) RS with k nn (k), helping us to decide if the observed correlations are structural:
eliminates all degree correlations.
We 101performed the randomizations discussed above for three real networks. As Figure 7.9a shows, the assortative nature of the scientific collaboration network disappears under both randomizations. This indicates that the assortative correlations of the collaboration network is not linked to 100 101 102 103 k contrast, its scalefree nature. In for the metabolic network the observed
(a) Scientific Collaboration Network The increasing knn(k) differs from the horiRS zontal knn (k), indicating that the network’s assortativity is not structural. Consequently the assortativity is generated by some process that governs the network’s evolution. This is not unexpected: structural effects can generate only disassortativity, not assortativity.
disassortativity remains unchanged under RS (Figure 7.9c). Consequently the disassortativity of the metabolic network is structural, being induced by its degree distribution. In summary, the scalefree property can induce disassortativity in simple networks. Indeed, in neutral or assortative networks we expect multi
(b) Power Grid RS RM The horizontal knn(k), k nn (k) and k nn (k) all support the lack of degree correlations (neutral network).
ple links between the hubs. If multiple links are forbidden (simple graph), the network will display disassortative tendencies. This conflict vanishes for scalefree networks with γ ≥ 3 and for random networks. It also vanishes if we allow multiple links between the nodes.
DEGREE CORRELATIONS
102
To uncover the origin of the observed degree correlations, we must compare knn(k) (grey RS symbols), with k nn (k) and k RM (k) obtained after nn degreepreserving randomization. Two degreepreserving randomizations are informative in this context:
102
0
k
Randomization and Degree Correlations
R−S (k) are indistinguishable, then real knn(k) and the randomized knn (k) knnsystem knn(k)the correlations observed in a real are all structural, ful
1 10 10 100 Multiple10Links 102 (ii)10Degree Preserving Randomization with k k (RM)
101
Figure 7.9
that each rewiring that generates multilinks is discarded. If the
knn(k)
k
DISASSORTATIVE
103
knn(k)
101
(c) Metabolic Network RS As both knn(k) and knn (k) decrease, we conclude that the network’s disassortativity is induced by its scalefree property. Hence the observed degree correlations are structural.
15
STRUCTURAL CUTOFFS
SECTION 7.5
CORRELATIONS IN REAL NETWORKS
To understand the prevalence of degree correlations we need to inspect the correlations characterizing real networks. In Figure 7.10 we show the knn(k) function for the ten reference networks, observing several patterns: • Power Grid For the power grid knn(k) is flat and indistinguishable from its randomized version, indicating a lack of degree correlations (Figure 7.10a). Hence the power grid is neutral. • Internet For small degrees (k ≤ 30) knn(k) shows a clear assortative trend, an
effect that levels off for high degrees (Figure 7.10b). The degree correlations vanish in the randomized version of the Internet map. Hence the Internet is assortative, but structural cutoffs eliminate the effect for high k. • Social Networks The three networks capturing social interactions, the mobile phone network, the science collaboration network and the actor network, all have an increasing knn(k), indicating that they are assortative (Figures
7.10ce). Hence in these networks hubs tend to link to other hubs and lowdegree nodes tend to link to lowdegree nodes. The fact that the RS (k), indicates that the assortative observed knn(k) differs from the knn
nature of social networks is not due to their scalefree the degree distribution. • Email Network While the email network is often seen as a social network, its knn(k)
decreases with k, documenting a clear disassortative behavior (Figure RS
7.10f). The randomized knn (k) also decays, indicating that we are observing structural disassortativity, a consequence of the network’s scalefree nature.
DEGREE CORRELATIONS
16
• Biological Networks The protein interaction and the metabolic network both have a negative μ, suggesting that these networks are disassortative. Yet, the scalRS
ing of knn (k) is indistinguishable from knn (k), indicating that we are observing structural disassortativity, rooted in the scalefree nature of these networks (Figure 7.10 g,h). • WWW The decaying knn(k) implies disassortative correlations (Figure 7.10i). The randomized knn (k) also decays, but not as rapidly as knn(k). Hence RS
the disassortative nature of the WWW is not fully explained by its degree distribution. • Citation Network This network displays a puzzling behavior: for k ≤ 20 the degree correlation function knn(k) shows a clear assortative trend; for k > 20,
however, we observe disassortative scaling (Figure 7.10j). Such mixed behavior can emerge in networks that display extreme assortativity (Figure 7.13b). This suggests that the citation network is strongly as
sortative, but its scalefree nature induces structural disassortativity, changing the slope of knn(k) for k ≫ ks. In summary, Figure 7.10 indicates that to understand degree correlaRS
tions, we must always compare knn(k) to the degree randomized knn (k). It also allows us to draw some interesting conclusions: (i) Of the ten reference networks the power grid is the only truly neutral network. Hence most real networks display degree correlations. (ii) All networks that display disassortative tendencies (email, protein, metabolic) do so thanks to their scalefree property. Hence, these are all structurally disassortative. Only the WWW shows disassortative correlations that are only partially explained by its degree distribution. (iii) The degree correlations characterizing assortative networks are not explained by their degree distribution. Most social networks (mobile phone calls, scientific collaboration, actor network) are in this class and so is the Internet and the citation network. A number of mechanisms have been proposed to explain the origin of the observed assortativity. For example, the tendency of individuals to form communities, the topic of CHAPTER 9, can induce assortative correlations [12]. Similarly, the society has endless mechanisms, from professional committees to TV shows, to bring hubs together, enhancing the assortative nature of social and professional networks. Finally, homophily, a well documented social phenomena [13], indicates that individuals tend to associate with other individuals of similar background and characteristics, hence individuals with comparable degree tend to know each other. This degreehomophily may be responsible for the celebrity marriages as well (Figure 7.1). DEGREE CORRELATIONS
17
DEGREE CORRELATIONS IN REAL NETWORKS
POWER GRID
(a)
INTERNET
(b)
101
103
µ=0.04
µ=0.56
102
knn(k)
knn(k) 101
Real Network (logbin) Real Network (linbin) RS
100 10
k
10
0
1
100
10
MOBILE PHONE CALLS
(c)
102
100
2
101
102 k 103
SCIENTIFIC COLLABORATION
(d)
102
µ=0.33
knn(k)
104
µ=0.16
knn(k)
101
100
10
k
10
0
1
101
10
ACTOR
(e)
100
2
k
105
µ=0.34
102
103
EMAIL
(f)
104
101
µ=0.74
10
4
103
knn(k)
knn(k)
103
102
102
101 10
0
10 k 10 10
10 10 1
2
3
4
PROTEIN
(g)
101 100
5
102 k 103
103
µ=0.10
knn(k)
104
METABOLIC
(h)
102
101
µ=0.76
102
Figure 7.10
knn(k)
101
Randomization and Degree Correlations
10
1
100 10
0
(i)
10
1
k
100 10
(j)
WWW
10
100
2
101
k
102
The degree correlation function knn(k) for the ten reference networks (Table 4.1). The grey symbols show the knn(k) function using linear binning; purple circles represent the same data using logbinning (SECTION 4.11). The green dotted line corresponds to the best fit to (7.10) within the fitting interval marked by the arrows at the bottom. Orange squares represent k RS (k) nn obtained for 100 independent degreepreserving randomizations, while maintaining the simple character of these networks. Note that we made directed networks undirected when we measured knn(k). To fully characterize the correlations emerging in directed networks we must use the directed correlation function (BOX 7.3).
103
CITATION
4
µ=0.82
µ=0.18
103
102
knn(k)
knn(k)
102 101 100 101 102 DEGREE CORRELATIONS
103 k 104 105
101 100
101
102 k 103
104 18
DEGREE CORRELATIONS IN REAL NETWORKS
inin inout outin outout
BOX 7.3 102
CORRELATIONS IN DIRECTED NETWORKS kα, β(k ) nn
β
The degree correlation function (7.7) is defined for undirected net1 measure correlations in directed networks we must take works. 10To
and into account that each node i is characterized by an incoming k in i an outgoing k iout degree [14]. We therefore define four degree correla
α,β tion functions, k nn (k), where α and β refer to the in and out indices
α,β (k) for citation networks, in(Figures 7.11 ad). In Figure 7.11e we show knn
dicating 100 a lack of inout correlations and the presence of assortativity for small k for 0the other three (inin, outin, outout).4 1 correlations 2 3
10
10
(a)
10
kβ
10
10
(b)
inin
inout
(c)
(d)
outin
(e)
outout
103
inin inout outin outout
Figure 7.11 Correlations in Directed Network
102 kα,nnβ(kβ)
101
100 10 0
10 1
a
10 2
kβ
10 3
10 4
(a)(d) The four possible correlations characterizing a directed network. We show in purple and green the (α, β) indices that define the appropriate correlation function [14]. For example, (a) in,in describes the knn (k) correlations between the indegrees of two nodes connected by a link. (e) The k α,nnβ (k) correlation function for citation networks, a directed network. in,in For example knn (k) is the average indegree of the inneighbors of nodes with indegree kin. These functions show a clear assortative tendency for three of the four functions up to degree k ≃ 100. The empty symbols capture the degree randomized k α,nnβ (k) for each degree correlation function (RS randomization).
b 19
DEGREE CORRELATIONS
inin
inout
DEGREE CORRELATIONS IN REAL NETWORKS
SECTION 7.6
GENERATING CORRELATED NETWORKS
To explore the impact of degree correlations on various network characteristics we must first understand the correlations characterizing the network models discussed thus far. It is equally important to develop algorithm that can generate networks with tunable correlations. As we show in this section, given the conflict between the scalefree property and degree correlations, this is not a trivial task. DEGREE CORRELATIONS IN STATIC MODELS ErdősRényi Model The random network model is neutral by definition. As it lacks hubs, it does not develop structural correlations either. Hence for the ErdősRényi network knn(k) is given by (7.9), predicting μ = 0 for any ⟨k⟩ and N. Configuration Model The configuration model (Figure 4.15) is also neutral, independent of the choice of the degree distribution pk. This is because the model allows for both multilinks and selfloops. Consequently, any conflicts caused by the hubs are resolved by the multiple links between them. If, however, we force the network to be simple, then the generated network will develop structural disassortativity (Figure 7.8). Hidden Parameter Model In the model ejk is proportional to the product of the randomly chosen
hidden variables ηj and ηk (Figure 4.18). Consequently the network is tech
nically uncorrelated. However, if we do not allow multilinks, for scalefree networks we again observe structural disassortativity. Analytical calculations indicate that in this case [18] knn(k) ~ k−1,
(7.16)
i.e. the degree correlation function follows (7.10) with μ = − 1. Taken together, the static models explored so far generate either neutral networks, or networks characterized by structural disassortativity following (7.16). DEGREE CORRELATIONS
20
103
DEGREE CORRELATIONS IN EVOLVING NETWORKS
Randomized (RS)
knn(k)
To understand the emergence (or the absence) of degree correlations in growing networks, we start with the initial attractiveness model (SECTION 6.5), which includes as a special case the BarabásiAlbert model.
>
102
Initial Attractiveness Model
~k0.5
Consider a growing network in which preferential attachment follows (6.23), i.e. Π(k) ∼ A + k, where A is the initial attractiveness. Depending on
101
the value of A, we observe three distinct scaling regimes [15]: (i) Disassortative Regime: γ < 3
10 0
If − m < A < 0 we have
knn (k) m
1−
A m
(m + A) ⎛ 2m ⎞ ς⎜ ⎟N 2m + A ⎝ 2m + A ⎠
A − 2 m+A
k
A m
(7.17)
A m
(7.18)
(ii) Neutral Regime: γ = 3 If A = 0 the initial attractiveness model reduces to the BarabásiAlbert model. In this case
knn (k)
m ln N. 2
(7.19)
Consequently knn(k) is independent of k, hence the network is neutral. (iii) Weak Assortativity: γ > 3 If A > 0 the calculations predict
⎛ k ⎞ knn (k) ≈ (m + A)ln ⎜ . ⎝ m + A ⎟⎠
(7.20)
As knn(k) increases logarithmically with k, the resulting network displays a weak assortative tendency, but does not follow (7.10).
In summary, (7.17)  (7.20) indicate that the initial attractiveness model generates rather complex degree correlations, from disassortativity to weak assortativity. Equation (7.19) also shows that the network generated by the BarabásiAlbert model is neutral. Finally, (7.17) predicts a power law kdependence for knn(k), offering analytical support for the empirical scaling (7.10).
BianconiBarabási Model With a uniform fitness distribution the BianconiBarabási model generates a disassortative network [5] (Figure 7.12). The fact that the randomized version of the network is also disassortative indicates that the model's disassortativity is structural. Note, however, that the real knn(k) DEGREE CORRELATIONS
10 4
10 5
The degree correlation function of the BianconiBarabási model for N = 10,000, m = 3 and uniform fitness distribution (SECTION 6.2). As the green dotted line indicates, follwing (7.10) indicates, the network is disassortative, consistent with μ ≃ 0.5. The orange symbols corRS RS respond to knn (k). As knn (k) also decreases, the bulk of the observed disassortativity is strucRS tural. But the difference between knn(k) and knn (k) suggests that structural effects cannot fully account for the observed degree correlation.
ing the powerlaw [15, 16] −
10 2 k 10 3
Figure 7.12 Correlations in the BianconiBarabási Model
Hence the resulting network is disassortative, knn(k) decaying follow
knn (k) ∼ k
10 1
21
GENERATING CORRELATED NETWORKS
(a) STEP 1 LINK SELECTION (a)
···
k=3
kmax
Figure 7.13 XulviBrunet & Sokolov Algorithm
···
Assortative Neutral Disassortative
knn(k)
a
d (e)
10
b
c
k=2
k=1
(b)(b) 2
The algorithm generates networks with maximal degree correlations.
k=3
DISASSORTATIVE k ≥ k a
≥ kc ≥ kd
b
10
k=1
k=2
k=3
···
kmax
STEP 2 REWIRE
k=3
100
k=2 k=1
k=2
10 ASSORTATIVE
(a) The basic steps of the algorithm. (b) knn(k) for networks generated by the algorithm for a scalefree network with N = GENERATING CORRELATED NETWORKS 1,000, L = 2,500, γ = 3.0. (c, d) A typical network configuration and the corresponding Aij matrix for the maximally assortative network generated by the algorithm, where the rows and columns of Aij were ordered according to increasing node degrees k. (e,f) Same as in (c,d) for a maximally disassortative network.
(f)
1
···
0
10
1
k=1
k
DISASSORTATIVE
10 2
(d) (c) ASSORTATIVE (c) ASSORTATIVE
(d)
(d)
···
k=2 k=3
k=1
k=1
k=2 k=3
···
The Aij matrices (d) and (f) capture the inner regularity of networks with maximal correlations, consisting of blocks of nodes that connect to nodes with similar degree in (d) and of blocks of nodes that connect to nodes with rather different degrees in (f).
kmax kmax
···
···
E
(f)
k=3 k=2
k=3
k=1
k=2 k=1
(a) STEP 1 LINK SELECTION
(b)
b
c
102
knn(k)
a (e) dDISASSORTATIVE DISASSORTATIVE (e) ka ≥ kb ≥ kc ≥ kd
Assortative Neutral Disassortative
(f) (f)
(f)
k=2
k=1 k=1
101
k=2
···
k=3 k=3
···
102
Assortative 100 Neutral 10 0 Disassortative
kASSORTATIVE (k) nn
DISASSORTATIVE
k=3 10 2
10 1
k
···
(b)
···
STEP 2 REWIRE
kmax
kmax
k=3
k=2 k=1
k=2
101
k=1 (c)
(d)
ASSORTATIVE
k=1
k=2 k=3
10
···
kmax
0
10 2
10 1
k
TATIVE
···
10 0
(d)
k=3
k=1
k=2 k=3
· · k=2 ·
kmax
k=1
··· k=3 (a) STEP 1 LINK SELECTION
c d
b
(b)
a ka ≥ kb ≥ kc ≥ kd
DEGREE CORRELATIONS
102
knn(k) 101
STEP 2 REWIRE
0
Assortative Neutral Disassortative
k=2 k=1 22
RS and the randomized knn (k) do not overlap, indicating that the disassor
tativity of the model is not fully explained by its scalefree nature. TUNING DEGREE CORRELATIONS Several algorithms can generate networks with desired degree correlations [8, 17, 18]. Next we discuss a simplified version of the algorithm proposed by XalviBrunet and Sokolov that aims to generate maximally correlated networks with a predefined degree sequence [19, 20, 21]. It consists of the following steps (Figure 7.13a): • Step 1: Link Selection Choose at random two links. Label the four nodes at the end of these two links with a, b, c, and d such that their degrees are ordered as ka ≥ kb ≥ kc ≥ kd. • Step 2: Rewiring Break the selected links and rewire them to form new pairs. Depending on the desired degree correlations the rewiring is done in two ways: • Step 2A: Assortative By pairing the two highest degree nodes (a with b) and the two lowest degree nodes (c with d), we connect nodes with comparable degrees, enhancing the network’s assortative nature. • Step 2B: Disassortative By pairing the highest and the lowest degree nodes (a with d and b with c), we connect nodes with different degrees, enhancing the network’s disassortative nature. By iterating these steps we gradually enhance the network’s assortative (Step 2A) or disassortative (Step 2B) features. If we aim to generate a simple network (free of multilinks), after Step 2 we check whether the particular rewiring leads to multilinks. If it does, we reject it, returning to Step 1. The correlations characterizing the networks generated by this algorithm converge to the maximal (assortative) or minimal (disassortative) value that we can reach for the given degree sequence (Figure 7.13b). The model has no difficulty creating disassortative correlations (Figures 7.13e,f). In the assortative limit simple networks display a mixed knn(k): assortative
for small k and disassortative for high k (Figures 7.13b). This is a consequence
of structural cutoffs: For scalefree networks the system is unable to sustain assortativity for high k. The observed behavior is reminiscent of the knn(k) function of citation networks (Figure 7.10j). The version of the XalviBrunet & Sokolov algorithm introduced in Figure 7.13 generates maximally assortative or disassortative networks. We can tune the magnitude of the generated degree correlations if we use the algorithm discussed in Figure 7.14. DEGREE CORRELATIONS
23
GENERATING CORRELATED NETWORKS
In summary, static models, like the configuration or hidden parameter model, are neutral if we allow multilinks, and develop structural disassortativity if we force them to generate simple networks. To generate networks with tunable correlations, we can use for example the XalveBrunet & Sokolov algorithm. An important result of this section is (7.16) and (7.18), offering the analytical form of the degree correlation function for the hidden paramenter model and for a growing network, in both case predicting a powerlaw kdependence. These results offer analytical backing for the scaling hypothesis (7.10), indicating that both structural and dynamical effects can result in a degree correlation function that follows a power law.
DEGREE CORRELATIONS
24
INTRODUCTION
cc
ASSORTATIVE
DISASSORTATIVE
ASSORTATIVE ASSORTATIVE
(a) 2 10210SELECTION STEP 1 LINK
a
knnk(k) (k) nn
c
10 10 1
STEP 2 REWIRE
p=0.2 p=0.2 p=0.4 p=0.4 p=0.6 p=0.6 p=0.8 p=0.8 p=1.0 p=1.0
1
ab
cd
102102
Figure 7.14 Tuning Degree Correlations
p=0.2 µ=0.064 p=0.2 µ=0.064
We can use the XalviBrunet & Sokolov algop=0.4 µ=0.080 p=0.4 µ=0.080 rithm to tune the magnitude of degree corp=0.6 µ=0.085 p=0.6 µ=0.085 relations.
knnk(k) (k) nn
p=0.8 µ=0.095 p=0.8 µ=0.095 p=1.0 p=1.0rewiring step (a) We execute the deterministic
ASSORTATIVE
b
p
10 10 1
with probability p, and with probability 1 − p we randomly pair the a, b, c, d nodes with each other. For p = 1 we are back to the algorithm of Figure 7.13, generating maximal degree correlations; for p < 1 the induced noise tunes the magnitude of the effect.
1
bc DISASSORTATIVE
a
d
(b) Typical network configurations generated for p = 0.5.
ad
100100
101 10 1 1 p
ka ≥ kb ≥ kc ≥ kd
102102
k k
(c) The knn(k) functions for various p values for 1 N 2 and a network 10 with = 1, 101 = 10,000, ⟨k⟩10 102 γ = 3.0.
100100
k k
RANDOM REWIRE
Note that the correlation exponent μ depends on the fitting region, especially in the assortative case.
ASSORTATIVE
(b)
DISASSORTATIVE DISASSORTATIVE
DISASSORTATIVE
bb
c
c
ASSORTATIVE ASSORTATIVE
DISASSORTATIVE DISASSORTATIVE
ASSORTATIVE
(c)
DISASSORTATIVE
ASSORTATIVE ASSORTATIVE 10 10 p=0.2 p=0.2 p=0.4 p=0.4 p=0.6 p=0.6 knn(k) p=0.8 1 LINK SELECTION a kann(k)STEP STEPp=0.8 1 LINK SELECTION p=1.0 p=1.0 2
2
101
101
cdcd
101
101
ASSORTATIVE ASSORTATIVE
bb
c c
DISASSORTATIVE DISASSORTATIVE 10 10 p=0.2 µ=0.064 p=0.2 µ=0.064 p=0.4 µ=0.080 p=0.4 µ=0.080 p=0.6 µ=0.085 p=0.6 µ=0.085 knn(k) knn(k) p=0.8 µ=0.095 p=0.8 µ=0.095 STEP 2 REWIRE STEP 2 REWIRE ab ab p=1.0 p=1.0 2
2
pp bcbc
100
dd
100
DEGREE CORRELATIONS
a a
101
101 k
102 k
10 DISASSORTATIVE 100 DISASSORTATIVE 2
100
ad ad 25
101
101 k
102 k
GENERATING CORRELATED NETWORKS
102
SECTION 7.7
THE IMPACT OF DEGREE CORRELATIONS
1
As we have seen in Figure 7.10, most real networks are characterized by works display structural disassortativity. These correlations raise an im
0.8 S/N 0.6
portant question: Why do we care? In other words, do degree correlations
0.4
some degree correlations. Social networks are assortative; biological net
alter the properties of a network? And which network properties do they
Assortative Neutral Disassortative
0.2
influence? This section addresses these important questions.
0
An important property of a random network is the emergence of a phase transition at ⟨k⟩ = 1, marking the appearance of the giant component (SECTION 3.6). Figure 7.15 shows the relative size of the giant component for networks with different degree correlations, documenting several pat
1
k
1.5
2
2.5
3
Figure 7.15
Degree Correlations and the Phase Transition Point
terns [8, 19, 20]:
Relative size of the giant component for an ErdősRényi network of size N=10,000 (green curve), which is then rewired using the XalviBrunet & Sokolov algorithm with p = 0.5 to induce degree correlations (Figure 7.14). The figure indicates that as we move from assortative to disassortative networks, the phase transition point is delayed and the size of the giant component increases for large ⟨k⟩. Each point represents an average over 10 independent runs.
• Assortative Networks For assortative networks the phase transition point moves to a lower ⟨k⟩, hence a giant component emerges for ⟨k⟩ < 1. The reason is that it is easier to start a giant component if the highdegree nodes seek out each other. • Disassortative Networks The phase transition is delayed in disassortative networks, as in these the hubs tend to connect to small degree nodes. Consequently, disassortative networks have difficulty forming a giant component. • Giant Component For large ⟨k⟩ the giant component is smaller in assortative networks than in neutral or disassortative networks. Indeed, assortativity forces the hubs to link to each other, hence they fail to attract to the giant component the numerous small degree nodes. These changes in the size and the structure of the giant component have implications to the spread of diseases [22, 23, 24], the topic of CHAPTER 10. Indeed, as we have seen in Figure 7.10, social networks tend to be assortative. The high degree nodes therefore form a giant component that acts as DEGREE CORRELATIONS
0.5
26
a “reservoir” for the disease, sustaining an epidemic even when on average
0.3
the network is not sufficiently dense for the virus to persist.
0.25
Assortative Neutral Disassortative
0.2 pd 0.15
The altered giant component has implications for network robustness
24 m ax
=
m ax
18
0.05
d
damage because the hubs form a core group, hence many of them are re
=
0.1
d
d
m ax
fragments a network. In assortative networks hub removal makes less
=
21
as well [25]. As we discuss in CHAPTER 8, the removal of a network's hubs
dundant. Hub removal is more damaging in disassortative networks, as in these the hubs connect to many smalldegree nodes, which fall off the net
0
5
10
d
15
20
25
work once a hub is deleted. Figure 7.16
Let us mention a few additional consequences of degree correlations:
Degree Correlations and Path Lengths Distance distribution for a random network with size N = 10, 000 and ⟨k⟩ = 3. Correlations are induced using the XalviBrunet & Sokolov algorithm with p = 0.5 (Figure 7.14). The plots show that as we move from disassortative to assortative networks, the average path length decreases, indicated by the gradual move of the peaks to the left. At the same time the diameter, dmax, grows. Each curve represents an average over 10 independent networks.
• Figure 7.16 shows the pathlength distribution of a random network rewired to display different degree correlations. It indicates that in assortative networks the average path length is shorter than in neutral networks. The most dramatic difference is in the network diameter, dmax, which is significantly higher for assortative networks. Indeed, assortativity favors links between nodes with similar degree, resulting in long chains of k = 2 nodes, enhancing dmax (Figure 7.13c). • Degree correlations influence a system’s stability against stimuli and perturbations [26] as well as the synchronization of oscillators placed on a network [27, 28]. • Degree correlations have a fundamental impact on the vertex cover problem [29], a muchstudied problem in graph theory that requires us to find the minimal set of nodes (cover) such that each link is connected to at least one node in the cover (BOX 7.4). • Degree correlations impact our ability to control a network, altering the number of input signals one needs to achieve full control [30]. In summary, degree correlations are not only of academic interest, but they influence numerous network characteristics and have a discernable impact on many processes that take place on a network.
DEGREE CORRELATIONS
27
THE IMPACT OF DEGREE CORRELATIONS
BOX 7.4 VERTEX COVER AND MUSEUM GUARDS Imagine that you are the director of an openair museum located in a large park. You wish to place guards on the crossroads to observe each path. Yet, to save cost you want to use as few guards as possible. How many guards do you need? Let N be the number of crossroads and m < N is the number of guards N ) ways of placing the m guards you can afford to hire. While there are (m
at N crossroads, most configurations leave some paths unsupervised [31]. The number of trials one needs to place the guards so that they cover all paths grows exponentially with N. Indeed, this is one of the six basic NPcomplete problems, called the vertex cover problem. The vertex cover of a network is a set of nodes such that each link is connected to at least one node of the set (Figure 7.17). NPcompleteness means
Figure 7.17 The Minimum Cover Formally, a vertex cover of a network is a set C of nodes such that each link of the network connects to at least one node in C. A minimum vertex cover is a vertex cover of smallest possible size. The figure above shows examples of minimum vertex covers in two small networks, where the set C is shown in purple. We can check that if we turn any of the purple nodes into green nodes, at least one link will not connect to a purple node.
that there is no known algorithm which can identify a minimal vertex cover substantially faster than using as exhaustive search, i.e. checking each possible configuration individually. The number of nodes in the minimal a vertex cover depends on the network topology, being affected by the degree distribution and degree correlations [29].
DEGREE CORRELATIONS
28
THE IMPACT OF DEGREE CORRELATIONS
SECTION 7.8
SUMMARY
BOX 7.5 AT A GLANCE: DEGREE CORRELATIONS
Degree correlations were first discovered in 2001 in the context of the Internet by Romualdo PastorSatorras, Alexei Vazquez, and Alessandro Vespignani [4, 5], who also introduced the degree correlation function
Degree Correlation Matrix eij
knn(k) and the scaling (7.10). A year later Kim Sneppen and Sergey Maslov
used the full p(ki,kj), related to the eij matrix, to characterize the degree
Neutral networks:
correlations of protein interaction networks [32]. In 2003 Mark Newman
eij = qi qi =
introduced the degree correlation coefficient [8, 9] together with the assortative, neutral, and disassortative distinction. These terms have their roots in social sciences [13]:
ki pki k j pk j 〈k〉 2
Degree Correlation Function
knn (k) = ∑ k ' p(k '  k)
Assortative mating reflects the tendency of individuals to date or marry
k'
individuals that are similar to them. For example, lowincome individuals marry lowincome individuals and college graduates marry college grad
Neutral networks:
uates. Network theory uses assortativity in the same spirit, capturing the
knn (k) =
degreebased similarities between nodes: In assortative networks hubs tend to connect to other hubs and smalldegree nodes to other smallde
〈k 2 〉 〈k〉
gree nodes. In a network environment we can also encounter the tradition
Scaling Hypothesis
al assortativity, when nodes of similar properties link to each other (Figure
knn (k) ∼ k µ
7.18).
μ > 0: Assortative Disassortative mixing, when individuals link to individuals wo are unlike
μ = 0: Neutral
them, is also common in some social and economic systems. Sexual net
μ < 0: Disassortative
works are perhaps the best example, as most sexual relationships are be
Degree Correlation Coefficient
tween individuals of different gender. In economic settings trade typically takes place between individuals of different skills: the baker does not sell
r=∑
bread to other bakers, and the shoemaker rarely fixes other shoemaker's
jk
shoes.
r > 0: Assortative r = 0: Neutral
Taken together, there are several reasons why we care about degree cor
r < 0: Disassortative
relations in networks (BOX 7.5): • Degree correlations are present in most real networks (SECTION 7.5).
DEGREE CORRELATIONS
jk(e jk − q j qk ) σ2
29
• Once present, degree correlations change a network’s behavior (SECTION 7.7).
•
Figure 7.18
Politics is Never Neutral The network behind the US political blogosphere illustrates the presence of assortative mixing, as used in sociology, meaning that nodes of similar characteristics tend to link to each other. In the map each blue node corresponds to liberal blog and red nodes are conservative. Blue links connect liberal blogs, red links connect conservative blogs, yellow links go from liberal to conservative, and purple from conservative to liberal. As the image indicates, very few blogs link across the political divide, demonstrating the strong assortativity of the political blogosphere.
Degree correlations force us to move beyond the degree distribution, representing quantifiable patters that govern the way nodes link to each other that are not captured by pk alone.
Despite the considerable effort devoted to characterizing degree correlations, our understanding of the phenomena remains incomplete. For example, while in SECTION 7.6 we offered an algorithm to tune degree correlations, the problem is far from being fully resolved. Indeed, the most accurate description of a network's degree correlations is contained in the eij matrix. Generating networks with an arbitrary eij remains a difficult task.
Finally, in this chapter we focused on the knn(k) function, which cap
After [33].
tures twopoint correlations. In principle higher order correlations are also present in some networks (BOX 7.6). The impact of such three or four point correlations remains to be understood.
DEGREE CORRELATIONS
30
SUMMARY
BOX 7.6 TWOPOINT, THREEPOINT CORRELATIONS The complete degree correlations characterizing a network are determined by the conditional probability P(k(1), k(2), ..., k(k)k) that a node with degree k connects to nodes with degrees k(1), k(2), ..., k(k). Twopoint Correlations The simplest of these is the twopoint correlation discussed in this chapter, being the conditional probability P(k’k) that a node with degree k is connected to a node with degree k′. For uncorrelated networks this conditional probability is independent of k, i.e. P(k’ k) = k’pk’/⟨k⟩ [18]. As the empirical evaluation of P(k′k) in real networks is cumbersome, it is more practical to analyze the degree correlation function knn(k) defined in (7.7). Threepoint Correlations Correlations involving three nodes are determined by P(k(1),k(2)k). This conditional probability is connected to the clustering coefficient. Indeed, the average clustering coefficient C(k) [22, 23] can be formally written as the probability that a degreek node is connected to nodes with degrees k(1) and k(2), and that those two are joined by a link, averaged over all the possible values of k(1) and k(2),
C(k) =
k
∑
(1)
,k
(2)
P(k (1) , k (2)  k)pkk(1) ,k ( 2 ) ,
where pkk , k is the probability that nodes k(1) and k(2) are connected, (1)
(2)
provided that they have a common neighbor with degree k [18]. For neutral networks C(k) is independent of k, following
(k C=
DEGREE CORRELATIONS
2
k
k 3N
)
2
.
31
SUMMARY
SECTION 7.9
HOMEWORK
7.1. Detailed Balance for Degree Correlations Express the joint probability ekk' , the conditional probability P(k'k) and
the probability qk, discussed in this chapter, in terms of number of nodes
N, average degree 〈k〉, number of nodes with degree k, Nk, and the number
of links connecting nodes of degree k and k', Ekk' (note that Ekk' is twice the
number of links when k = k'). Based on these expressions, show that for any network we have
ekk ' = qk P ( k '  k ). 7.2. Star Network Consider a star network, where a single node is connected to N – 1 degree one nodes. Assume that N≫1. (a) What is the degree distribution pk of this network? (b) What is the probability qk that moving along a randomly chosen link we find at its end a node with degree k?
(c) Calculate the degree correlation coefficient r for this network. Use the expressions of ekk' and P(k'k) calculated in HOMEWORK 7.1. (d) Is this network assortative or disassortative? Explain why. 7.3. Structural Cutoffs Calculate the structural cutoff ks for the undirected networks listed in
Table 4.1. Based on the plots in Figure 7.10, predict for each network whether
ks is larger or smaller than the maximum expected degree kmax. Confirm your prediction by calculating kmax.
7.4. Degree Correlations in ErdősRényi Networks Consider the ErdősRényi G(N,L) model of random networks, introduced in CHAPTER 2 (BOX 3.1 and SECTION 3.2), where N labeled nodes are connected with L randomly placed links. In this model, the probability that there is a link connecting nodes i and j depends on the existence of a link between nodes l and s. DEGREE CORRELATIONS
32
(a) Write the probability that there is a link between i and j, eij and the probability that there is a link between i and j conditional on the existence of a link between l and s. (b) What is the ratio of such two probabilities for small networks? And for large networks? (c) What do you obtain for the quantities discussed in (a) and (b) if you use the ErdősRényi G(N,p) model? Based on the results found for (a)(c) discuss the implications of using the G(N,L) model instead of the G(N,p) model for generating random networks with small number of nodes.
DEGREE CORRELATIONS
33
HOMEWORK
SECTION 7.10
ADVANCED TOPICS 7.A DEGREE CORRELATION COEFFICIENT
In BOX 7.2 we defined the degree correlation coefficient r as an alterna
NETWORK
tive measure of degree correlations [8, 9]. The use of a single number to
N
r
μ
characterize degree correlations is attractive, as it offers a way to compare
Internet
192,244
0.02
0.56
the correlations observed in networks of different nature and size. Yet, to
WWW
325,729
0.05
1.11
effectively use r we must be aware of its origin.
Power Grid
4,941
0.003
0.0
Mobile Phone Calls
36,595
0.21
0.33
Email
57,194
0.08
0.74
Science Collaboration
23,133
0.13
0.16
Actor Network
702,388
0.31
0.34
Citation Network
449,673
0.02
0.18
E. Coli Metabolism
1,039
0.25
0.76
Protein Interactions
2,018
0.04
0.1
The hypothesis behind the correlation coefficient r implies that the knn(k) function can be approximated by the linear function
knn (k) ∼ rk .
(7.21)
This is different from the scaling (7.10), which assumes a power law dependence on k. Equation (7.21) raises several issues: • The initial attractiveness model predicts a power law (7.18) or a logarithmic kdependence (7.20) for the degree correlation function. A
Table 7.1 Degree Correlations in Reference Networks
The table shows the estimated r and μ for the ten reference networks. Directed networks were made undirected to measure r and μ. Alternatively, we can use the directed correlation coefficient to characterize such directed networks (BOX 7.8).
similar power law is derived in (7.16) for the hidden parameter model. Consequently, r forces a linear fit to an inherently nonlinear function. This linear dependence is not supported by numerical simulations or analytical calculations. Indeed, as we show in Figure 7.19, (7.21) offers a poor fit to the data for both assortative and disassortative networks. • As we have seen in Figure 7.10, the dependence of knn(k) on k is complex, often changing trends for large k thanks to the structural cutoff. A linear fit ignores this inherent complexity.
• The maximally correlated model has a vanishing r for large N, despite the fact that the network maintains its degree correlations (BOX 7.7).
This suggests that the degree correlation coefficient has difficulty detecting correlations characterizing large networks.
DEGREE CORRELATIONS
34
ASSORTATIVE
102
70 60
knn(k)
50 40 30
101
20 10 0
100
(b)
Figure 7.19
SCIENTIFIC COLLABORATION
(a)
101
k 102
103
Degree Correlation Function The degree correlation function knn(k) for three real networks. The left panels show the cumulative function knn(k) on a loglog plot to test the validity of (7.10). The right panels show knn(k) on a lin−lin plot to test the validity of (7.21), i.e. the assumption that knn(k) depends linearly on k. This is the hypothesis behind the correlation coefficient r. The slope of the dotted line corresponds to the correlation coefficient r. As the linlin plots on the right illustrate, (7.21) offers a poor fit for both assortative and disassortative networks.
0 50 100 150 200 250 300
POWER GRID
101
10
NEUTRAL
8 knn(k)
6 4 2
10
0
100
DISASSORTATIVE
(c)
0
k
101
0
102
5
10
15
20
METABOLIC NETWORK
400
103
300
102 knn(k)
200
101
100
100
0 100
101
k 102
0
103
200
400 600 800
0.5
INTERNET PHONE CALLS
Relationship Between μ and r On the positive side, r and μ are not independent of each other. To show this we calculated r and μ for the ten reference networks (TABLE 7.1). The results are plotted in Figure 7.20, indicating that μ and r correlate for positive r. Note, however, that this correlation breaks down for negative r. To understand the origin of this behavior, next we derive a direct rela
0
and determine the value of r for a network with correlation exponent μ.
μ
CITATION
0.5
1
a=
WWW
1
of the degree distribution as
k
EMAIL
METABOLIC
We start by determining a from (7.10). We can write the second moment
〈k 2 〉 = 〈knn (k)k〉 = ∑ ak µ +1 pk = a〈k µ +1 〉 ,
POWER GRID
PROTEIN
tionship between μ and r. To be specific we assume the validity of (7.10)
which leads to
ACTORS
COLLABORATION
0.5
r
0
Figure 7.20
Correlation Between r and N To illustrate the relationship between r and μ, we estimated μ by fitting the knn(k) function to (7.10), whether or not the power law scaling was statistically significant.
2
〈k 〉 . 〈k µ +1 〉
We now calculate r for a network with a given μ: DEGREE CORRELATIONS
0.5
35
ADVANCED TOPIC 7A: DEGREE CORRELATION
r= =
k
kak µ qk 2 r
k2 2 k 2
=
k
a k µ +2
pk k
k2 2 k 2
2 r
k2 k µ +1
=
k µ +2 k
2 µ +2 k2 ⎞ 1 k ⎛ k ⎟. ⎜ µ +1 − 2 σr k ⎝ k k ⎠
2 r
k2 2 k 2
BOX 7.7
= (7.22)
THE PROBLEM WITH LARGE NETWORKS
For μ = 0 the term in the last parenthesis vanishes, obtaining r = 0.
The XalviBrunet & Sokolov al
Hence if μ = 0 (neutral network), the network will be neutral based on r
gorithm helps us calculate the
as well. For k > 1 (7.22) suggests that for μ > 0 the parenthesis is positive,
maximal (rmin) and the minimal
hence r > 0, and for μ < 0 the parenthesis is negative, hence r < 0. There
(rmax) correlation coefficient for
fore r and μ predict degree correlations of similar kind.
a scalefree network, obtaining [21]
In summary, if the degree correlation function follows (7.10), then the sign of the degree correlation exponent μ will determine the sign of the coefficient r:
rmin μ Npk’ and k’ > Npk , the effects of the restriction
on the multiple links are felt, turning the expression for rkk′ into
rkk ′ =
DEGREE CORRELATIONS
〈k〉ekk ′ . Npk pk '
Nk Nk = 12
The maximum number of links one can have between two groups. The figure shows two groups of nodes, with degree k=3 and k’=2. The total number of links between these two groups must not exceed:
As mkk’ is the maximum of Ekk′, we must have rkk′ ≤ 1 for any k and k’.
equation
(c)
Calculating mkk'
7.22. Consequently, we can write rkk’ as
Ekk ′ 〈k〉ekk ′ = . mkk ′ min { kP(k), k ′P( k ′ ), NP(k)P( k ′ )}
k Nk = 8
Figure 7.22
is the largest possible value of Ekk′. The origin of (7.25) is explained in Figure
rkk ′ =
(b)
mNkk = min{kNk, k Nk , Nk Nk } =8
where Ekk′ is the number of links between nodes of degrees k and k’ for k≠k’ and twice the number of connecting links for k=k’, and
kNk= 9
(7.28)
38
For scalefree networks these conditions are fulfilled in the region k, k’ > (aN)1/(γ+1), where a is a constant that depends on pk. Note that this value is below the natural cutoff. Consequently this scaling provides a lower bound for the structural cutoff, in the sense that whenever the cutoff of the degree distribution falls below this limit, the condition rkk’ < 1 is always satisfied. For neutral networks the joint distribution factorizes as
ekk ′ =
kk ′pk pk ' 〈k〉 2
.
(7.29)
Hence, the ratio (7.28) becomes
rkk ′ =
kk ′ . 〈k〉N
(7.30)
Therefore, the structural cutoff needed to preserve the condition rkk’ ≤ 1 has the form [11, 34, 35, 36]
ks (N ) ~ (〈k〉N )1/2 ,
(7.31)
which is (7.15). Note that (7.31) is independent of the degree distribution of the underlying network. Consequently, for a scalefree network ks(N) is independent of the degree exponent γ.
DEGREE CORRELATIONS
39
ADVANCED TOPIC 7B: STRUCTURAL CUTOFFS
SECTION 7.12
BIBLIOGRAPHY
[1] P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, RS Judson, JR Knight, D. Lockshon, V. Narayan, M. Srinivasan, P. Pochart, A. QureshiEmili, Y. Li, B. Godwin, D. Conover, T. Kalbfleisch, G. Vijayadamodar, M. Yang, M. Johnston, S. Fields, J. M. Rothberg. A comprehensive analysis of proteinprotein interactions in Saccharomyces cerevisiae. Nature, 403: 623–627, 2000. [2] I. Xenarios, D. W. Rice, L. Salwinski, M. K. Baron, E. M. Marcotte, D. Eisenberg. DIP: the database of interacting proteins. Nucleic Acids Res., 28: 289–29, 2000. [3] H. Jeong, S.P. Mason, A.L. Barabási, and Z.N. Oltvai. Lethality and centrality in protein networks. Nature, 411: 4142, 2001. [4] R. PastorSatorras, A. Vázquez, and A. Vespignani. Dynamical and correlation properties of the Internet. Phys. Rev. Lett., 87: 258701, 2001. [5] A. Vazquez, R. PastorSatorras, and A. Vespignani. Largescale topological and dynamical properties of Internet. Phys. Rev., E 65: 066130, 2002. [6] S.L. Feld. Why your friends have more friends than you do. American Journal of Sociology, 96: 1464–1477, 1991. [7] E.W. Zuckerman and J.T. Jost. What makes you think you’re so popular? Self evaluation maintenance and the subjective side of the “friendship paradox”. Social Psychology Quarterly, 64: 207–223, 2001. [8] M. E. J. Newman. Assortative mixing in networks. Phys. Rev. Lett., 89: 208701, 2002. [9] M. E. J. Newman. Mixing patterns in networks. Phys. Rev. E, 67: 026126, 2003. [10] S. Maslov, K. Sneppen, and A. Zaliznyak. Detection of topological pattern in complex networks: Correlation profile of the Internet. Physica DEGREE CORRELATIONS
40
A, 333: 529540, 2004. [11] M. Boguna, R. PastorSatorras, and A. Vespignani. Cutoffs and finite size effects in scalefree networks. Eur. Phys. J. B, 38: 205, 2004. [12] M. E. J. Newman and Juyong Park. Why social networks are different from other types of networks. Phys. Rev. E, 68: 036122, 2003. [13] M. McPherson, L. SmithLovin, and J. M. Cook. Birds of a feather: homophily in social networks. Annual Review of Sociology, 27:415444, 2001. [14] J. G. Foster, D. V. Foster, P. Grassberger, and M. Paczuski. Edge direction and the structure of networks. PNAS, 107: 10815, 2010. [15] A. Barrat and R. PastorSatorras. Rate equation approach for correlations in growing network models. Phys. Rev. E, 71: 036127, 2005. [16] S. N. Dorogovtsev and J. F. F. Mendes. Evolution of networks. Adv. Phys., 51: 1079, 2002. [17] J. Berg and M. Lässig. Correlated random networks. Phys. Rev. Lett., 89: 228701, 2002. [18] M. Boguñá and R. PastorSatorras. Class of correlated random networks with hidden variables. Phys. Rev. E, 68: 036112, 2003. [19] R. XulviBrunet and I. M. Sokolov. Reshuffling scalefree networks: From random to assortative. Phys. Rev. E, 70: 066102, 2004. [20] R. XulviBrunet and I. M. Sokolov. Changing correlations in networks: assortativity and dissortativity. Acta Phys. Pol. B, 36: 1431, 2005. [21] J. Menche, A. Valleriani, and R. Lipowsky. Asymptotic properties of degreecorrelated scalefree networks. Phys. Rev. E, 81: 046103, 2010. [22] V. M. Eguíluz and K. Klemm. Epidemic threshold in structured scalefree networks. Phys. Rev. Lett., 89:108701, 2002. [23] M. Boguñá and R. PastorSatorras. Epidemic spreading in correlated complex networks. Phys. Rev. E, 66: 047104, 2002. [24] M. Boguñá, R. PastorSatorras, and A. Vespignani. Absence of epidemic threshold in scalefree networks with degree correlations. Phys. Rev. Lett., 90: 028701, 2003. [25] A. Vázquez and Y. Moreno. Resilience to damage of graphs with degree correlations. Phys. Rev. E, 67: 015101R, 2003. [26] S.J. Wang, A.C. Wu, Z.X. Wu, X.J. Xu, and Y.H. Wang. Response of degreecorrelated scalefree networks to stimuli. Phys. Rev. E, 75: 046113, DEGREE CORRELATIONS
41
BIBLIOGRAPHY
2007. [27] F. Sorrentino, M. Di Bernardo, G. Cuellar, and S. Boccaletti. Synchronization in weighted scalefree networks with degree–degree correlation. Physica D, 224: 123, 2006. [28] M. Di Bernardo, F. Garofalo, and F. Sorrentino. Effects of degree correlation on the synchronization of networks of oscillators. Int. J. Bifurcation Chaos Appl. Sci. Eng., 17: 3499, 2007. [29] A. Vazquez and M. Weigt. Computational complexity arising from degree correlations in networks. Phys. Rev. E, 67: 027101, 2003. [30] M. Posfai, Y Y. Liu, JJ Slotine, and A.L. Barabási. Effect of correlations on network controllability. Scientific Reports, 3: 1067, 2013. [31] M. Weigt and A. K. Hartmann. The number of guards needed by a museum: A phase transition in vertex covering of random graphs. Phys. Rev. Lett., 84: 6118, 2000. [32] S. Maslov and K. Sneppen. Specificity and stability in topology of protein networks. Science, 296: 910–913, 2002. [33] L. Adamic and N. Glance. The political blogosphere and the 2004 U.S. election: Divided they blog (2005). [34] J. Park and M. E. J. Newman. The origin of degree correlations in the Internet and other networks. Phys. Rev. E, 66: 026112, 2003. [35] F. Chung and L. Lu. Connected components in random graphs with given expected degree sequences. Annals of Combinatorics, 6: 125, 2002. [36] Z. Burda and Z. Krzywicki. Uncorrelated random networks. Phys. Rev. E, 67: 046118, 2003.
DEGREE CORRELATIONS
42
BIBLIOGRAPHY
8 ALBERTLÁSZLÓ BARABÁSI
NETWORK SCIENCE NETWORK ROBUSTNESS
ACKNOWLEDGEMENTS
MÁRTON PÓSFAI GABRIELE MUSELLA NICOLE SAMAY ROBERTA SINATRA
SARAH MORRISON AMAL HUSSEINI PHILIPP HOEVEL
INDEX
Introduction Introduction
1
Percolation Theory
2
Robustness of Scalefree Networks
3
Attack Tolerance
4
Cascading Failures
5
Modeling Cascading Failures
6
Building Robustness
7
Summary: Achilles' Heel
8
Homework
9
ADVANCED TOPICS 8.A Percolation in Scalefree Network
10
ADVANCED TOPICS 8.B MolloyReed Criteria ADVANCED TOPICS 8.C Critical Threshold Under Random Failures ADVANCED TOPICS 8.D
11 12
Breakdown of a Finite Scalefree Network
13
ADVANCED TOPICS 8.E
14
Attack and Error Tolerance of Real Networks ADVANCED TOPICS 8.F Attack Threshold ADVANCED TOPICS 8.G The Optimal Degree Distribution Homework
Figure 8.0 (cover image)
Networks & Art: Facebook Users Created by Paul Butler, a Torontobased data scientist during a Facebook internship in 2010, the image depicts the network connecting the users of the social network company. It highlights the links within and across continents. The presence of dense local links in the U.S., Europe and India is just as revealing as the lack of links in some areas, like China, where the site is banned, and Africa, reflecting a lack of Internet access.
15
16 This book is licensed under a Creative Commons: CC BYNCSA 2.0. PDF V26, 05.09.2014
SECTION 8.1
INTRODUCTION
Errors and failures can corrupt all human designs: The failure of a component in your car’s engine may force you to call for a tow truck or a wiring error in your computer chip can make your computer useless. Many natural and social systems have, however, a remarkable ability to sustain their basic functions even when some of their components fail. Indeed, while there are countless protein misfolding errors and missed reactions in our cells, we rarely notice their consequences. Similarly, large organizations can function despite numerous absent employees. Understanding the origins of this robustness is important for many disciplines: • Robustness is a central question in biology and medicine, helping us understand why some mutations lead to diseases and others do not. • It is of concern for social scientists and economists, who explore the stability of human societies and institutions in the face of such disrupting forces as famine, war, and changes in social and economic order. • It is a key issue for ecologists and environmental scientists, who seek to predict the failure of an ecosystem when faced with the disruptive
Figure 8.1 Achilles’ Heel of Complex Networks
The cover of the 27 July 2000 issue of Nature, highlighting the paper entitled Attack and error tolerance of complex networks that began the scientific exploration of network robustness [1].
effects of human activity. • It is the ultimate goal in engineering, aiming to design communication systems, cars, or airplanes that can carry out their basic functions despite occasional component failures. Networks play a key role in the robustness of biological, social and technological systems. Indeed, a cell's robustness is encoded in intricate regulatory, signaling and metabolic networks; the society’s resilience cannot be divorced from the interwoven social, professional, and communication web behind it; an ecosystem’s survivability cannot be understood without a careful analysis of the food web that sustains each species. Whenever nature seeks robustness, it resorts to networks.
NETWORK ROBUSTNESS
3
The purpose of this chapter is to understand the role networks play in ensuring the robustness of a complex system. We show that the structure of the underlying network plays an essential role in a system’s ability to survive random failures or deliberate attacks. We explore the role of networks in the emergence of cascading failures, a damaging phenomenon frequently encountered in real systems. Most important, we show that the laws governing the error and attack tolerance of complex networks and the emergence of cascading failures, are universal. Hence uncovering them helps us understand the robustness of a wide range of complex systems.
Figure 8.2 Robust, Robustness
“Robust” comes from the latin Quercus Robur, meaning oak, the symbol of strength and longevity in the ancient world. The tree in the figure stands near the Hungarian village Diósviszló and is documented at www.dendromania.hu, a site that catalogs Hungary's oldest and largest trees. Image courtesy of György Pósfai.
NETWORK ROBUSTNESS
4
INTRODUCTION
SECTION 8.2
PERCOLATION THEORY
The removal of a single node has only limited impact on a network’s
(a)
(b)
(c)
(d)
integrity (Figure 8.3a). The removal of several nodes, however, can break a network into several isolated components (Figure 8.3d). Obviously, the more nodes we remove, the higher are the chances that we damage a network, prompting us to ask: How many nodes do we have to delete to fragment a network into isolated components? For example, what fraction of Internet routers must break down so that the Internet turns into clusters of computers that are unable to communicate with each other? To answer these questions, we must first familiarize ourselves with the mathematical underpinnings of network robustness, offered by percolation theory. Percolation Percolation theory is a highly developed subfield of statistical physics and mathematics [2, 3, 4, 5]. A typical problem addressed by it is illustrated in Figure 8.4a,b, showing a square lattice, where we place pebbles with probability p at each intersection. Neighboring pebbles are considered connected, forming clusters of size two or more. Given that the position of each pebble is decided by chance, we ask:
Figure 8.3
The Impact of Node Removal The gradual fragmentation of a small network following the breakdown of its nodes. In each panel we remove a different node (highlighted with a green circle), together with its links. While the removal of the first node has only limited impact on the network’s integrity, the removal of the second node isolates two small clusters from the rest of the network. Finally, the removal of the third node fragments the network, breaking it into five noncommunicating clusters of sizes s = 2, 2, 2, 5, 6.
• What is the expected size of the largest cluster? • What is the average cluster size? Obviously, the higher is p, the larger are the clusters. A key prediction of percolation theory is that the cluster size does not change gradually with p. Rather, for a wide range of p the lattice is populated with numerous tiny clusters (Figure 8.4a). If p approaches a critical value pc, these small clusters grow and coalesce, leading to the emergence of a large cluster at pc. We call this the percolating cluster as it reaches the
end of the lattice. In other words, at pc we observe a phase transition
from many small clusters to a percolating cluster that percolates the whole lattice (Figure 8.4b). To quantify the nature of this phase transition, we focus on three quantities: NETWORK ROBUSTNESS
5
• Average Cluster Size: ⟨s⟩ According to percolation theory the average size of all finite clusters follows
〈s〉 ∼ p − pc
−γ p
(8.1)
In other words, the average cluster size diverges as we approach pc (Figure 8.4c).
• Order Parameter: P∞ The probability P∞ that a randomly chosen pebble belongs to the largest cluster follows
P∞ ∼ ( p − pc ) p . β
(8.2)
Therefore as p decreases towards pc the probability that a pebble belongs to the largest cluster drops zero (Figure 8.4d). • Correlation Length: ξ The mean distance between two pebbles that belong to the same cluster follows −ν
ξ ∼ p − pc .
(8.3) Figure 8.4 Percolation
p = 0 .1
(a)
A classical problem in percolation theory explores the random placement with probability p of pebbles on a square lattice.
p = 0 .7
(b)
(a) For small p most pebbles are isolated. In this case the largest cluster has only three nodes, highlighted in purple. (b) For large p most (but not all) pebbles belong to a single cluster, colored purple. This is called the percolating cluster, as it spans the whole lattice (see also Figure 8.6).
(c)
(c) The average cluster size, ⟨s⟩, in function of p. As we approach pc from below, numerous small clusters coalesce and ⟨s⟩ diverges, following (8.1). The same divergence is observed above pc, where to calculate ⟨s⟩ we remove the percolating cluster from the average. The same exponent γp characterizes the divergence on both sides of the critical point.
(d)
s
P∞
1
0
0.25
0.5
p
NETWORK ROBUSTNESS
pc
0.75
1
0
0
0.25
0.5
p
pc
0.75
(d) A schematic illustration of the p−dependence of the probability P∞ that a pebble belongs to the largest connected component. For p < pc all components are small, so P∞ is zero. Once p reaches pc a giant component emerges. Consequently beyond pc there is a finite probability that a node belongs to the largest component, as predicted by (8.2).
1
6
PERCOLATION THEORY
Therefore while for p < pc the distance between the pebbles in the
same cluster is finite, at pc this distance diverges. This means that at pc the size of the largest cluster becomes infinite, allowing it to percolate the whole lattice. The exponents γp, βp, and ν are called critical exponents, as they char
acterize the system’s behavior near the critical point pc. Percolation
theory predicts that these exponents are universal, meaning that they are independent of the nature of the lattice or the precise value of pc. Therefore, whether we place the pebbles on a triangular or a hexagonal lattice, the behavior of ⟨s⟩, P∞, and ξ is characterized by the same γp, βp, and ν exponents. Consider the following examples to better understand this universality: • The value of pc depends on the lattice type, hence it is not universal. For example, for a twodimensional square lattice (Figure 8.4) we have
pc ≈ 0.593, while for a twodimensional triangular lattice pc = 1/2 (site percolation). • The value of pc also changes with the lattice dimension: for a square
lattice pc ≈ 0.593 (d = 2); for a simple cubic lattice (d = 3) pc ≈ 0.3116.
Therefore in d = 3 we need to cover a smaller fraction of the nodes with pebbles to reach the percolation transition.
• In contrast with pc, the critical exponents do not depend on the lattice type, but only on the lattice dimension. In two dimensions, the case shown in Figure 8.4, we have γp = 43/18, βp = 5/36, and ν = 4/3, for any lattice. In three dimensions γp = 1.80, βp = 0.41, and ν = 0.88. For any
d > 6 we have γp = 1, βp = 1, ν = 1/2, hence for large d the exponents are
independent of d as well [2].
Inverse Percolation Transition and Robustness The phenomena of primary interest in robustness is the impact of node failures on the integrity of a network. We can use percolation theory to describe this process. Let us view a square lattice as a network whose nodes are the intersections (Figure 8.5). We randomly remove an f fraction of nodes, asking how their absence impacts the integrity of the lattice. If f is small, the missing nodes do little damage to the network. Increasing f, however, can isolate chunks of nodes from the giant component. Finally, for sufficiently large f the giant component breaks into tiny disconnected components (Figure 8.5). This fragmentation process is not gradual, but it is characterized by a critical threshold fc: For any f < fc we continue to have a giant component.
Once f exceeds fc, the giant component vanishes. This is illustrated by the
fdependence of P∞, representing the probability that a node is part of the
NETWORK ROBUSTNESS
7
PERCOLATION THEORY
giant component (Figure 8.5): P∞ is nonzero under fc, but it drops to zero as
we approach fc. The critical exponents characterizing this breakdown, γp,
βp, ν, are the same as those encountered in (8.1)(8.3). Indeed, the two pro
cesses can be mapped into each other by choosing f = 1 − p.
What, however, if the underlying network is not as regular as a square lattice? As we will see in the coming sections, the answer depends on the precise network topology. Yet, for random networks the answer continues to be provided by percolation theory: Random networks under random node failures share the same scaling exponents as infinitedimensional percolation. Hence the critical exponents for a random network are γp = 1, βp = 1 and ν = 1/2, corresponding to the d > 6 percolation exponents encountered
earlier. The critical exponents for a scalefree network are provided in ADVANCED TOPICS 8.A. In summary, the breakdown of a network under random node removal is not a gradual process. Rather, removing a small fraction of nodes has only limited impact on a network’s integrity. But once the fraction of removed nodes reaches a critical threshold, the network abruptly breaks into disconnected components. In other words, random node failures induce a phase transition from a connected to a fragmented network. We can use the tools of percolation theory to characterize this transition in both regular and in random networks. For scalefree networks key aspects of the described phenomena change, however, as we discuss in the next section.
1
P∞
0.75
0.5
0.25
0
0
f = 0.1
0.25
0.5
f
f = fc
0< f < fc :
f = fc :
There is a giant component.
The giant component vanishes.
0.75
1
f = 0.8
Figure 8.5 Network Breakdown as Inverse Percolation
The consequences of node removal are accurately captured by the inverse of the percolation process discussed in Figure 8.4. We start from a square lattice, that we view as a network whose nodes are the intersections. We randomly select and remove an f fraction of nodes and measure the size of the largest component formed by the remaining nodes. This size is accurately captured by P∞, which is the probability that a randomly selected node belongs to the largest component. The observed networks are shown on the bottom panels. Under each panel we list the characteristics of the corresponding phases.
f > fc : The lattice breaks into many tiny components.
P∞ ~ f −f c  β
NETWORK ROBUSTNESS
8
PERCOLATION THEORY
BOX 8.1 From Forest Fires to Percolation Theory We can use the spread of a fire in a forest to illustrate the basic con
(a)
cepts of percolation theory. Let us assume that each pebble in Figure 8.4a,b is a tree and that the lattice describes a forest. If a tree catch
p = 0 .55
es fire, it ignites the neighboring trees; these, in turn ignite their neighbors. The fire continues to spread until no burning tree has a nonburning neighbor. We must therefore ask: If we randomly ignite a tree, what fraction of the forest burns down? And how long it takes the fire to burn out? The answer depends on the tree density, controlled by the parameter
(b)
p. For small p the forest consists of many small islands of trees (p = 0.55, Figure 8.6a), hence igniting any tree will at most burn down one
p = 0 .593
of these small islands. Consequently, the fire will die out quickly. For large p most trees belong to a single large cluster, hence the fire rapidly sweeps through the dense forest (p = 0.62, Figure 8.6c). The simulations indicate that there is a critical pc at which it takes ex
tremely long time for the fire to end. This pc is the critical threshold of the percolation problem. Indeed, at p = pc the giant component just
(c)
emerges through the union of many small clusters (Figure 8.6b). Hence the fire has to follow a long winding path to reach all trees in the loose
p = 0 .62
ly connected clusters, which can be rather time consuming.
Figure 8.6
Forest Fire The emergence of the giant component as we change the occupation probability p. Each panel corresponds to a different p in the vicinity of pc shown for a lattice of 250x250 sites. The largest cluster is colored black. For p < pc the largest cluster is tiny, as seen in (a). If this is a forest and the pebbles are trees, any fire can at most consume only a small fraction of the trees, burning out quickly. Once p reaches pc≈0.593, shown on (b), the largest cluster percolates the whole lattice and the fire can reach many trees, burning slowly through the forest. Increasing p beyond pc connects more pebbles (trees) to the largest component, as seen for p = 0.62 on (c). Hence, the fire can sweep through the forest, burning out quickly again.
NETWORK ROBUSTNESS
9
PERCOLATION THEORY
SECTION 8.3
ROBUSTNESS OF SCALEFREE NETWORKS
(a)
INTERNET
1
Percolation theory focuses mainly on regular lattices, whose nodes have identical degrees, or on random networks, whose nodes have compa
0.75
P ∞ ( f ) /P ∞ (0)
rable degrees. What happens, however, if the network is scalefree? How do the hubs affect the percolation transition? To answer these questions, let us start from the router level map of the
0.5
0.25
Internet and randomly select and remove nodes onebyone. According to percolation theory once the number of removed nodes reaches a critical
0
value fc, the Internet should fragment into many isolated subgraphs (Figure
0
0.25
0.5
f
0.75
1
8.5). The simulations indicate otherwise: The Internet refuses to break apart even under rather extensive node failures. Instead the size of the largest
(b)
component decreases gradually, vanishing only in the vicinity of f = 1 (Fig
SCALEFREE NETWORK
1
ure 8.7a). This means that the network behind the Internet shows an unusual robustness to random node failures: we must remove all of its nodes to
P ∞ ( f ) /P ∞ (0)
0.75
destroy its giant component. This conclusion disagrees with percolation on lattices, which predicts that a network must fall apart after the removal of a finite fraction of its nodes.
0.5
0.25
The behavior observed above is not unique to the Internet. To show this we repeated the above measurement for a scalefree network with degree exponent γ = 2.5, observing an identical pattern (Figure 8.7b): Under random node removal the giant component fails to collapse at some finite fc,
but vanishes only gradually near f = 1 (Online Resource 8.1). This hints that the Internet's observed robustness is rooted in its scalefree topology. The goal of this section is to uncover and quantify the origin of this remarkable robustness.
0
0
0.25
0.5
f
0.75
1
Figure 8.7 Robustness of Scalefree Networks (a) The fraction of Internet routers that belong to the giant component after an f fraction of routers are randomly removed. The ratio P∞( f)/P∞(0) provides the relative size of the giant component. The simulations use the router level Internet topology of Table 4.1.
(b) The fraction of nodes that belong to the giant component after an f fraction of nodes are removed from a scalefree network with γ = 2.5, N = 10,000 and kmin = 1. The plots indicate that the Internet and in general a scalefree network do not fall apart after the removal of a finite fraction of nodes. We need to remove almost all nodes (i.e. fc=1) to fragment these networks. NETWORK ROBUSTNESS
10
MolloyReed Criterion To understand the origin of the anomalously high fc characterizing the
Internet and scalefree networks, we calculate fc for a network with an arbitrary degree distribution. To do so we rely on a simple observation:
>
For a network to have a giant component, most nodes that belong to it must be connected to at least two other nodes (Figure 8.8). This leads to the MolloyReed criterion (ADVANCED TOPICS 8.B), stating that a randomly wired network has a giant component if [6]
κ= Networks with
〈k 2 〉 > 2. 〈k〉
(8.4)
κ < 2 lack a giant component, being fragmented into
many disconnected components. The MolloyReed criterion (8.4) links the network’s integrity, as expressed by the presence or the absence of a giant component, to ⟨k⟩ and ⟨k2⟩. It is valid for any degree distribution pk.
Online Resource 8.1 Scalefree Network Under Node Failures
To illustrate the robustness of a scalefree network we start from the network we constructed in Online Resource 4.1, i.e. a scalefree network generated by the BarabásiAlbert model. Next we randomly select and remove nodes onebyone. As the movie illustrates, despite the fact that we remove a significant fraction of the nodes, the network refuses to break apart. Visualization by Dashun Wang.
To illustrate the predictive power of (8.4), let us apply it to a random network. As in this case ⟨k2⟩ = ⟨k⟩(1 + ⟨k⟩), a random network has a giant component if
κ=
〈k 2 〉 〈k〉(1+ 〈k〉) = = 1+ 〈k〉 > 2 〈k〉 〈k〉
(8.5)
or
>
〈k〉 > 1 .
(8.6)
This prediction coincides with the necessary condition (3.10) for the existence of a giant component. Critical Threshold To understand the mathematical origin of the robustness observed in Figure 8.7, we ask at what threshold will a scalefree network loose its giant component. By applying the MolloyReed criteria to a network with an arbitrary degree distribution, we find that the critical threshold follows [7] (ADVANCED TOPICS 8.C)
fc = 1−
1 . 〈k 2 〉 −1 〈k〉
(8.7)
Figure 8.8
MolloyReed Criterion Each individual must hold the hand of two other individuals to form a chain. Similarly, to have a giant component in a network, on average each of its nodes should have at least two neighbors. The MolloyReed criterion (8.4) exploits this property, allowing us to calculate the critical point at which a network breaks apart. See ADVANCED TOPICS 8.B for the derivation.
The most remarkable prediction of (8.7) is that the critical threshold fc depends only on ⟨k⟩ and ⟨k2⟩, quantities that are uniquely determined by the degree distribution pk. Let us illustrate the utility of (8.7) by calculating the breakdown threshold of a random network. Using ⟨k2⟩ = ⟨k⟩(⟨k⟩ + 1), we obtain (ADVANCED TOPICS 8.D)
fc ER = 1−
1 . 〈k〉
(8.8)
Hence, the denser is a random network, the higher is its fc, i.e. the more NETWORK ROBUSTNESS
11
ROBUSTNESS OF SCALEFREE NETWORKS
nodes we need to remove to break it apart. Furthermore (8.8) predicts
1
γ = 4.0
0.75
γ = 2.0
that fc is always finite, hence a random network must break apart after
γ = 3.0
P ∞ ( f ) /P ∞ (0)
the removal of a finite fraction of nodes. Equation (8.7) helps us understand the roots of the enhanced robustness observed in Figure 8.7. Indeed, for scalefree networks with γ < 3 the second moment ⟨k2⟩ diverges in the N → ∞ limit. If we insert ⟨k2⟩ → ∞ into
0.5
(8.7), we find that fc converges to fc = 1. This means that to fragment a
0.25
random removal of a finite fraction of its nodes does not break apart a
0
scalefree network we must remove all of its nodes. In other words, the 0
0.25
0.5
f
0.75
1
large scalefree network. Figure 8.9
To better understand this result we express ⟨k⟩ and ⟨k2⟩ in terms of the
Robustness and Degree Exponent
parameters characterizing a scalefree network: the degree exponent γ
The probability that a node belongs to the giant component after the removal of an f fraction of nodes from a scalefree network with degree exponent γ. For γ = 4 we observe a finite critical point fc≃2/3, as predicted by (8.9). For γ < 3, however, fc → 1. The networks were generated with the configuration model using kmin = 2 and N = 10, 000.
and the minimal and maximal degrees, kmin and kmax, obtaining
1 fc =
3 1
2
1 k
k
2 3 min max
1
2 k 1 3 min
1
2
3 the critical threshold fc depends only on γ and kmin, hence fc
is independent of the network size N. In this regime a scalefree network behaves like a random network: it falls apart once a finite fraction of its nodes are removed.
• For
γ < 3 the kmax diverges for large N, following (4.18). Therefore in
the N → ∞ limit (8.9) predicts fc → 1. In other words, to fragment an infinite scalefree network we must remove all of its nodes. Equations (8.6)(8.9) are the key results of this chapter, predicting that scalefree networks can withstand an arbitrary level of random failures without breaking apart. The hubs are responsible for this remarkable robustness. Indeed, random node failures by definition are blind to degree, affecting with the same probability a small or a large degree node. Yet, in a scalefree network we have far more small degree nodes than hubs. Therefore, random node removal will predominantly remove one of the numerous small nodes as the chances of selecting randomly one of the few large hubs is negligible. These small nodes contribute little to a network’s integrity, hence their removal does little damage. Returning to the airport analogy of Figure 4.6, if we close a randomly selected airport, we will most likely shut down one of the numerous small airports. Its absence will be hardly noticed elsewhere in the world: you can still travel from New York to Tokyo, or from Los Angeles to Rio de Janeiro. NETWORK ROBUSTNESS
12
ROBUSTNESS OF SCALEFREE NETWORKS
Link Removal
1
Robustness of Finite Networks
Node Removal
Equation (8.9) predicts that for a scalefree network fc converges to one
P ∞ ( f ) /P ∞ (0)
0.75
only if kmax → ∞, which corresponds to the N → ∞ limit. While many networks of practical interest are very large, they are still finite, prompting us to ask if the observed anomaly is relevant for finite networks. To address this we insert (4.18) into (8.9), obtaining that fc depends on the
0.5
0.25
network size N as (ADVANCED TOPICS 8.C)
0
C fc ≈ 1− 3−γ , N
0.25
0.5
(8.10)
γ −1
f
0.75
Robustness and Link Removal
cates that the larger a network, the closer is its critical threshold to fc = 1.
What happens if we randomly remove the links rather than the nodes? The calculations predict that the critical threshold fc is the same for random link and node removal [7, 8]. To illustrate this, we compare the impact of random node and link removal on a random network with ⟨k⟩ = 2. The plot indicates that the network falls apart at the same critical threshold fc ≃ 0.5. The difference is in the shape of the two curves. Indeed, the removal of an f fraction of nodes leaves us with a smaller giant component than the removal of an f fraction of links. This is not unexpected: on average each node removes ⟨k⟩ links. Hence the removal of an f fraction of nodes is equivalent with the removal of an f⟨k⟩ fraction of links, which clearly makes more damage than the removal of an f fraction of links.
To see how close fc can get to the theoretical limit fc = 1, we calculate fc for the Internet. The router level map of the Internet has ⟨k2⟩/⟨k⟩ = 37.91 (Table 4.1). Inserting this ratio into (8.7) we obtain fc = 0.972. Therefore, we need to remove 97% of the routers to fragment the Internet into disconnected components. The probability that by chance 186,861 routers fail simultaneously, representing 97% of the N = 192,244 routers on the Internet, is effectively zero. This is the reason why the topology of the Internet is so robust to random failures. In general a network displays enhanced robustness if its breakdown threshold deviates from the random network prediction (8.8), i.e. if (8.11)
Enhanced robustness has several ramifications: • The inequality (8.11) is satisfied for most networks for which ⟨k2⟩ deviates from ⟨k⟩(⟨k⟩ + 1). According to Figure 4.8, for virtually all reference networks ⟨k2⟩ exceeds the random expectation. Hence the robustness predicted by (8.7) affects most networks of practical interest. This is illustrated in Table 8.1, that shows that for most reference networks (8.11) holds. • Equation (8.7) predicts that the degree distribution of a network does not need to follow a strict power law to display enhanced robustness. All we need is a larger ⟨k2⟩ than expected for a random network of similar size. • The scalefree property changes not only fc, but also the critical expo
nents γp, βp and ν in the vicinity of fc. Their dependence on the degree exponent γ is discussed in ADVANCED TOPICS 8.A.
• Enhanced robustness is not limited to node removal, but emerges under link removal as well (Figure 8.10).
NETWORK ROBUSTNESS
1
Figure 8.10
where C collects all terms that do not depend on N. Equation (8.10) indi
fc > fcER.
0
13
ROBUSTNESS OF SCALEFREE NETWORKS
In summary, in this section we encountered a fundamental property of real networks: their robustness to random failures. Equation (8.7) predicts that the breakdown threshold of a network depends on ⟨k⟩ and ⟨k2⟩, which in turn are uniquely determined by the network's degree distribution. Therefore random networks have a finite threshold, but for scalefree networks with γ < 3 the breakdown threshold converges to one. In other words, we need to remove all nodes to break a scalefree network apart, indicating that these networks show an extreme robustness to random failures. The origin of this extreme robustness is the large ⟨k2⟩ term. Given that for most real networks ⟨k2⟩ is larger than the random expectation, enhanced robustness is a generic property of many networks. This robustness is rooted in the fact that random failures affect mainly the numerous small nodes, which play only a limited role in maintaning a network’s integrity.
NETWORK
RANDOM FAILURES (REAL NETWORK)
Internet
RANDOM FAILURES
(RANDOMIZED NETWORK)
ATTACK
(REAL NETWORK)
Table 8.1
Breakdown Thresholds Under Random Failures and Attacks
WWW
The table shows the estimated fc for random node failures (second column) and attacks (fourth column) for ten reference networks. The procedure for determining fc is described in ADVANCED TOPICS 8.E. The third column (randomized network) offers fc for a network whose N and L coincides with the original network, but whose nodes are connected randomly to each other (randomized network, f cER, determined by (8.8)). For most networks fc for random failures exceeds f cER for the corresponding randomized network, indicating that these networks display enhanced robustness, as they satisfy (8.11). Three networks lack this property: the power grid, a consequence of the fact that its degree distribution is exponential (Figure 8.31a), and the actor and the citation networks, which have a very high ⟨k⟩, diminishing the role of the high ⟨k2⟩ in (8.7).
Power Grid MobilePhone Call Email Science Collaboration Actor Network
0.98
Citation Network E. Coli Metabolism Yeast Protein Interactions
NETWORK ROBUSTNESS
14
ROBUSTNESS OF SCALEFREE NETWORKS
SECTION 8.4
ATTACK TOLERANCE
The important role the hubs play in holding together a scalefree net
Attacks
1
work motivates our next question: What if we do not remove the nodes
Random Failures
randomly, but go after the hubs? That is, we first remove the highest de0.75
P ∞ ( f ) /P ∞ (0)
gree node, followed by the node with the next highest degree and so on. The likelihood that nodes would break in this particular order under normal conditions is essentially zero. Instead this process mimics an attack on the network, as it assumes a detailed knowledge of the network topology, an
0.5
0.25
ability to target the hubs, and a desire to deliberately cripple the network [1].
0
0
0.25
0.5
f
0.75
1
The removal of a single hub is unlikely to fragment a network, as the remaining hubs can still hold the network together. After the removal of a few hubs, however, large chunks of nodes start falling off (Online Resource 8.2). If the attack continues, it can rapidly break the network into tiny clus
Figure 8.11
Scalefree Network Under Attack The probability that a node belongs to the largest connected component in a scalefree network under attack (purple) and under random failures (green). For an attack we remove the nodes in a decreasing order of their degree: we start with the biggest hub, followed by the next biggest and so on. In the case of failures the order in which we choose the nodes is random, independent of the node’s degree. The plot illustrates a scalefree network’s extreme fragility to attacks: fc is small, implying that the removal of only a few hubs can disintegrate the network. The initial network has degree exponent γ = 2.5, kmin = 2 and N = 10,000.
ters. The impact of hub removal is quite evident in the case of a scalefree network (Figure 8.11): the critical point, which is absent under random failures, reemerges under attacks. Not only reemerges, but it has a remarkably low value. Therefore the removal of a small fraction of the hubs is sufficient to break a scalefree network into tiny clusters. The goal of this section is to quantify this attack vulnerability. Critical Threshold Under Attack An attack on a scalefree network has two consequences (Figure 8.11): • The critical threshold fc is smaller than fc = 1, indicating that under attacks a scalefree network can be fragmented by the removal of a finite fraction of its hubs. • The observed fc is remarkably low, indicating that we need to remove only a tiny fraction of the hubs to cripple the network. To quantify this process we need to analytically calculate fc for a netNETWORK ROBUSTNESS
15
work under attack. To do this we rely on the fact that hub removal changes the network in two ways [9]: • It changes the maximum degree of the network from kmax to k'max as all
>
nodes with degree larger than k'max have been removed.
• The degree distribution of the network changes from pk to p'k', as nodes connected to the removed hubs will loose links, altering the degrees of the remaining nodes. By combining these two changes we can map the attack problem into the robustness problem discussed in the previous section. In other words,
Online Resource 8.2
we can view an attack as random node removal from a network with ad
Scalefree Networks Under Attack
justed k'max and p'k'. The calculations predict that the critical threshold fcfor
During an attack we aim to inflict maximum damage on a network. We can do this by removing first the highest degree node, followed by the next highest degree, and so on. As the movie illustrates, it is sufficient to remove only a few hubs to break a scalefree network into disconnected components. Compare this with the network’s refusal to break apart under random node failures, shown in Online Resource 8.1. Visualization by Dashun Wang.
attacks on a scalefree network is the solution of the equation [9, 10] (ADVANCED TOPICS 8.F) 2−γ
fc1−γ = 2 +
3−γ
2 −γ kmin ( fc1−γ − 1). 3−γ
(8.12)
Figure 8.12 shows the numerical solution of (8.12) in function of the degree exponent γ, allowing us to draw several conclusions:
>
• While fc for failures decreases monotonically with γ, fc for attacks can have a nonmonotonic behavior: it increases for small γ and decreases for large γ. • fc for attacks is always smaller than fc for random failures.
1
Random Failures
• For large γ a scalefree network behaves like a random network. As a
Attacks 0.8
random network lacks hubs, the impact of an attack is similar to the impact of random node removal. Consequently the failure and the
kmin = 3
kmin = 2
0.6
kmin = 3
fc
attack thresholds converge to each other for large γ. Indeed, if γ →
0.4
∞ then pk → δ(k − kmin), meaning that all nodes have the same degree kmin. Therefore random failures and targeted attacks become indistin
kmin = 2
0.2
guishable in the γ → ∞ limit, obtaining 0
1
fc → 1− (kmin − 1)
(8.13)
• As Figure 8.13 shows, a random network has a finite percolation thresh
2
3
4
γ
6
7
8
Figure 8.12
Critical Threshold Under Attack
old under both random failures and attacks, as predicted by Figure 8.12
The dependence of the breakdown threshold, fc, on the degree exponent γ for scalefree networks with kmin = 2, 3. The curves are predicted by (8.12) for attacks (purple) and by (8.7) for random failures (green).
and (8.13) for large γ. The airport analogy helps us understand the fragility of scalefree networks to attacks: The closing of two large airports, like Chicago’s O’Hare Airport or the Atlanta International Airport, for only a few hours would be headline news, altering travel throughout the U.S. Should some series of events lead to the simultaneous closure of the Atlanta, Chicago, Denver, and New York airports, the biggest hubs, air travel within the North American continent would come to a halt within hours.
NETWORK ROBUSTNESS
5
16
ATTACK TOLERANCE
In summary, while random node failures do not fragment a scalefree
Attacks Random Failures
1
network, an attack that targets the hubs can easily destroy such a network. This fragility is bad news for the Internet, as it indicates that it is inherentP ∞ ( f ) /P ∞ (0)
ly vulnerable to deliberate attacks. It can be good news in medicine, as the vulnerability of bacteria to the removal of their hub proteins offers avenues to design drugs that kill unwanted bacteria.
0.75
0.5
0.25
0 0
0.25
0.5
f
0.75
1
Figure 8.13
Attacks and Failures in Random Networks The fraction of nodes that belong to the giant component in a random network if an f fraction of nodes are randomly removed (green) and in decreasing order of their degree (purple). Both curves indicate the existence of a finite threshold, in contrast with scalefree networks, for which fc→ 1 under random failures. The simulations were performed for random networks with N = 10,000 and ⟨k⟩ = 3.
NETWORK ROBUSTNESS
17
ATTACK TOLERANCE
BOX 8.2 PAUL BARAN AND THE INTERNET
In 1959 RAND, a Californian thinktank, has assigned Paul Baran, a young engineer at that time, to develop a communication system that can survive a Soviet nuclear attack. As a nuclear strike handicaps all equipment within the range of the detonation, Baran had to design a system whose users outside this range do not loose contact with one another. He described the communication network of his time as a “hierarchical structure of a set of stars connected in the form of a larger star,” offering an early description of what we call today a scalefree network [11]. He concluded that this topology is too centralized to be viable under attack. He also discarded the hubandspoke topology shown in Figure 8.14a, noting that the “centralized network is obviously vulnerable as destruction of a single central node destroys communication between the end stations.” Baran decided that the ideal survivable architecture was a distributed meshlike network (Figure 8.14c). This network is sufficiently redundant, so that even if some of its nodes fail, alternative paths can connect the remaining nodes. Baran’s ideas were ignored by the military, so when the Internet was born a decade later, it relied on distributed protocols that allowed each node to decide where to link. This decentralized philosophy paved the way to the emergence of a scalefree Internet, rather than the uniform meshlike topology envisioned by Baran.
Figure 8.14 Baran’s Network
LINK STATION
(a)
CENTRALIZED
NETWORK ROBUSTNESS
(b)
DECENTRALIZED
(c)
DISTRIBUTED
Possible configurations of communication networks, as envisioned by Paul Baran in 1959. After [11].
18
ATTACK TOLERANCE
SECTION 8.5
CASCADING FAILURES
Throughout this chapter we assumed that each node failure is a random event, hence the nodes of a network fail independently of each other. In reality, in a network the activity of each node depends on the activity of its neighboring nodes. Consequently the failure of a node can induce the failure of the nodes connected to it. Let us consider a few examples: • Blackouts (Power Grid) After the failure of a node or a link the electric currents are instantaneously reorganized on the rest of the power grid. For example, on August 10, 1996, a hot day in Oregon, a line carrying 1,300 megawatts
Figure 8.15
sagged close to a tree and snapped. Because electricity cannot be
Domino Effect
stored, the current it carried was automatically shifted to two lower
The domino effect is the fall of a series of dominos induced by the fall of the first domino. The term is often used to refer to a sequence of events induced by a local change, that propagates through the whole system. Hence the domino effect represents perhaps the simplest illustration of cascading failures, the topic of this section.
voltage lines. As these were not designed to carry the excess current, they too failed. Seconds later the excess current lead to the malfunction of thirteen generators, eventually causing a blackout in eleven U.S. states and two Canadian provinces [12]. • Denial of Service Attacks (Internet) If a router fails to transmit the packets received by it, the Internet protocols will alert the neighboring routers to avoid the troubled equipment by rerouting the packets using alternative routes. Consequently a failed router increases traffic on other routers, potentially inducing a series of denial of service attacks throughout the Internet [13]. • Financial Crises Cascading failures are common in economic systems. For example, the drop in the house prices in 2008 in the U.S. has spread along the links of the financial network, inducing a cascade of failed banks, companies and even nations [14, 15, 16]. It eventually caused the worst global financial meltdown since the 1930s Great Depression. While they cover different domains, these examples have several common characteristics. First, the initial failure had only limited impact on NETWORK ROBUSTNESS
19
the network structure. Second, the initial failure did not stay localized, but it spread along the links of the network, inducing additional failures. Eventually, multiple nodes lost their ability to carry out their normal functions. Consequently each of these systems experienced cascading failures, a dangerous phenomena in most networks [17]. In this section we discuss the empirical patterns governing such cascading failures. The modeling of these events is the topic of the next section. EMPIRICAL RESULTS Cascading failures are well documented in the case of the power grid, information systems and tectonic motion, offering detailed statistics about their frequency and magnitude. • Blackouts A blackout can be caused by power station failures, damage to electric transmission lines, a short circuit, and so on. When the operating limits of a component is exceeded, it is automatically disconnected to protect it. Such failure redistributes the power previously carried by the failed component to other components, altering the power flow, the frequency, the voltage and the phase of the current, and the operation of the control, monitoring and alarm systems. These changes can in turn disconnect other components as well, starting an avalanche of failures. A frequently recorded measure of blackout size is the energy unserved. Figure 8.17a shows the probability distribution p(s) of energy unserved in all North American blackouts between 1984 and 1998. Electrical engineers approximate the obtained distribution with the power law [18],
p(s) ~ s −α,
(8.14)
where the avalanche exponent α is listed in Table 8. 2 for several countries. The power law nature of this distribution indicates that most blackouts are rather small, affecting only a few consumers. These coexists, however, with occasional major blackouts, when millions of consumers lose power (Figure 8.16). • Information Cascades Modern communication systems, from email to Facebook or Twitter,
Figure 8.16 Northeast Blackout of 2003 One of the largest blackouts in North America took place on August 14, 2003, just before 4:10 p.m. Its cause was a software bug in the alarm system at a control room of the First Energy Corporation in Ohio. Missing the alarm, the operators were unaware of the need to redistribute the power after an overloaded transmission line hit a tree. Consequently a normally manageable local failure began a cascading failure that shut down more than 508 generating units at 265 power plants, leaving an estimated 10 million people without electricity in Ontario and 45 million in eight U.S. states. The figure highlights the states affected by the August 14, 2003 blackout. For a satelite image of the blackout, see Figure 1.1.
facilitate the cascadelike spreading of information along the links of the social network. As the events pertaining to the spreading process often leave digital traces, these platforms allow researchers to detect the underlying cascades. The microblogging service Twitter has been particularly studied in this context. On Twitter the network of who follows whom can be reconstructed by crawling the service's follower graph. As users frequently share webcontent using URL shorteners, one can also track each spreading/sharing process. A study tracking 74 million such events over two months followed the diffusion of each URL from a
NETWORK ROBUSTNESS
20
CASCADING FAILURES
particular seed node through its reposts until the end of a cascade
(a)
1
10
POWER FAILURES
(Figure 8.18). As Figure 8.17b indicates, the size distribution of the ob2
served cascades follows the powerlaw (8.14) with an avalanche exponent α
10
≈ 1.75 [19]. The power law indicates that the vast majority of
3
posted URLs do not spread at all, a conclusion supported by the fact
10
p(s)
that the average cascade size is only ⟨s⟩ = 1.14. Yet, a small fraction of
4
10
URLs are reposted thousands of times.
5
10
• Earthquakes Geological fault surfaces are irregular and sticky, prohibiting their
6
10
smooth slide against each other. Once a fault has locked, the contin
10
0
1
ued relative motion of the tectonic plates accumulate an increasing (b)
amount of strain energy around the fault surface. When the stress becomes sufficient to break through the asperity, a sudden slide re
10
leases the stored energy, causing an earthquake. Earthquakes can be
10
also induced by the natural rupture of geological faults, by volcanic
10
2
✁
3
p(s)
4
10
5
Each year around 500,000 earthquakes are detected with instrumen
10
tation. Only about 100,000 of these are sufficiently strong to be felt
10
✁ ✁
10
(c)
[20].
10 10
non, given the difficulty of mapping out the precise network of inter
10
P(s)
bear many similarities to networkbased cascading events, suggest
1
2
10 10 10 s: RETWEET NUMBER
3
10
4
EARTHQUAKES
3
10
dependencies that causes them. Yet, the resulting cascading failures
0
4
Earthquakes are rarely considered a manifestly network phenome
10
Shallow (070 km) Intermediate (70300 km) Deep (300700 km)
2 1 0
1
10
ing common mechanisms.
2
10
3
10
The powerlaw distribution (8.14) followed by blackouts, informa
5.5
6
6.5
7
7.5
8
log s: EARTHQUAKE MAGNITUDE
8.5
Figure 8.17
Cascade Size Distributions (a) The distribution of energy loss for all North American blackouts between 1984 and 1998, as documented by the North American Electrical Reliability Council. The distribution is typically fitted to (8.14). The reported exponents for different countries are listed in Table 8.2. After [18].
or earthquakes so small that one needs sensitive instruments to detect them. Equation (8.14) predicts that these numerous small events coexist with a few exceptionally large events. Examples of such major cascades include the 2003 power outage in North America (Figure 8.16), the tweet Iran Election Crisis: 10 Incredible YouTube Videos http://bit.ly/vPDLo that was shared 1,399 times
(b) The distribution of cascade sizes on Twitter. While most tweets go unnoticed, a tiny fraction of tweets are shared thousands of times. Overall the retweet numbers are well approximated with (8.14) with α ≃ 1.75. After [19].
[21], or the January 2010 earthquake in Haiti, with over 200,000 victims. Interestingly, the avalanche exponents reported by electrical engineers, media researches and seismologists are surprisingly close to each other, being between 1.6 and 2 (Table 8.2).
(c) The cumulative distribution of earthquake amplitudes recorded between 1977 and 2000. The dashed lines indicate the power law fit (8.14) used by seismologists to characterize the distribution. The earthquake magnitude shown on the horizontal axis is the logarithm of s, which is the amplitude of the observed seismic waves. After [20].
Cascading failures are documented in many other environments: • The consequences of bad weather or mechanical failures can cascade through airline schedules, delaying multiple flights and
NETWORK ROBUSTNESS
✁
7
≈ 1.67 (Figure 8.17c)
electricity in a few houses, tweets of little interest to most users,
✁
6
by humans. Seismologists approximate the distribution of earth
ures are relatively small. These small cascades capture the loss of
4
2
10
tion cascades and earthquakes indicates that most cascading fail
10
TWITTER CASCADES
1
activity, landslides, mine blasts and even nuclear tests.
quake amplitudes with the power law (8.14) with α
3
10 10 10 s: ENERGY UNSERVED (MWH)
21
CASCADING FAILURES
stranding thousands of passengers (BOX 8.3) [22]. • The disappearance of a species can cascade through the food web of an ecosystem, inducing the extinction of numerous species and altering the habitat of others [23, 24, 25, 26]. • The shortage of a particular component can cripple supply chains. For example, the 2011 floods in Thailand have resulted in a chronic shortage of car components that disrupted the production chain of more than 1,000 automotive factories worldwide. Therefore the damage was not limited to the flooded factories, but resulted in worldwide insurance claims reaching $20 billion [27]. In summary, cascading effects are observed in systems of rather dif
Figure 8.18 Information Cascades
ferent nature. Their size distribution is well approximated with the power
Examples of information cascades on Twitter. Nodes denote Twitter accounts, the top node corresponding to the account that first posted a certain shortened URL. The links correspond to those who retweeted it. These cascades capture the heterogeneity of information avalanches: most URLs are not retweeted at all, appearing as single nodes in the figure. Some, however, start major retweet avalanches, like the one seen at the bottom panel. After [19].
law (8.14), implying that most cascades are too small to be noticed; a few, however, are huge, having a global impact. The goal of the next section is to understand the origin of these phenomena and to build models that can reproduce its salient features.
SOURCE
EXPONENT
CASCADE
Table 8.2
Power grid (North America)
Avalanche Exponents in Real Systems.
Power grid (Sweden)
The reported avalanche exponents of the power law distribution (8.14) for energy loss in various countries [18], twitter cascades [19] and earthquake sizes [20]. The third column indicates the nature of the measured cascade size s, corresponding to power or energy not served, the number of retweets generated by a typical tweet and the amplitude of the seismic wave.
Power grid (Norway) Power grid (New Zealand) Power grid (China) Twitter Cascades Earthquakes
NETWORK ROBUSTNESS
Seismic Wave
22
CASCADING FAILURES
BOX 8.3 CASCADING FLIGHT CONGESTIONS
Flight delays in the U.S. have an economic impact of over $40 billion per year [28], caused by the need for enhanced operations, passenger loss of time, decreased productivity and missed business and leisure opportunities. A flight delay is the time difference between the expected and actual departure/arrival times of a flight. Airline schedules include a buffer period between consecutive flights to accommodate short delays. When a delay exceeds this buffer, subsequent flights that use the same aircraft, crew or gate, are also delayed. Consequently a delay can propagate in a cascadelike fashion through the airline network. While most flights in 2010 were on time, 37.5% arrived or departed late [22]. The delay distribution follows (8.14), implying that while most flights were delayed by just a few minutes, a few were hours behind NETWORK ROBUSTNESS
CASCADING FAILURES
schedule. These long delays induce correlated delay patterns, a signature of cascading congestions in the air transportation system (Figure 8.19).
Figure 8.19 Clusters of Congested Airports U.S. aviation map showing congested airports as purple nodes, while those with normal traffic as green nodes. The lines correspond to the direct flights between them on March 12, 2010. The clustering of the congested airports indicate that the dealys are not independent of each other, but cascade through the airport network. After [22].
23
SECTION 8.6
MODELING CASCADING FAILURES
The emergence of a cascading event depends on many variables, from the structure of the network on which the cascade propagates, to the nature of the propagation process and the breakdown criteria of each individual component. The empirical results indicate that despite the diversity of these variables, the size distribution of the observed avalanches is universal, being independent of the particularities of the system. The purpose of this section is to understand the mechanisms governing cascading phenomena and to explain the powerlaw nature of the avalanche size distribution. Numerous models have been proposed to capture the dynamics of cascading events [18, 29, 30, 31, 32, 33, 34, 35]. While these models differ in the degree of fidelity they employ to capture specific phenomena, they indicate that systems that develop cascades share three key ingredients: (i) The system is characterized by some flow over a network, like the flow of electric current in the power grid or the flow of information in communication systems. (ii) Each component has a local breakdown rule that determines when it contributes to a cascade, either by failing (power grid, earthquakes) or by choosing to pass on a piece of information (Twitter). (iii) Each system has a mechanism to redistribute the traffic to other nodes upon the failure or the activation of a component. Next, we discuss two models that predict the characteristics of cascading failures at different levels of abstraction.
NETWORK ROBUSTNESS
24
A !" φ=0.4 (a) φ=0.4 E f=1/2
A !"
FAILURE PROPAGATION MODEL Introduced to model the spread of ideas and opinions [30], the failure
A !"
B $" C f=1/3 f=1/2
f=1/2
well [35]. The model is defined as follows:
φ=0.4 Consider a network with an arbitrary degree distribution, where each E f=1/2 16 14
node contains an agent. An agent i can be in the state 0 (active or healthy)12
A !"
(c)
φi = φ for all i.
D
8
6
6
4
4
0
k
φ
C
f=2/3
102
102
p (s )
103
10
104
104
φ SUPERCRITICAL
C
101
3
D
p (s )
105 10 2 100
s
f=2/3
f=1/2 3 1 1010
103
102 4 10
s
0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26
LOWER CRITICAL POINT UPPER CRITICAL POINT SUBCRITICAL SUPERCRITICAL
102
103
6
• If the selected agent i is in state 1, it does not 4change its state.
104
SUPERCRITICAL
105 0 10
0.26
101
102
s
103
104
Figure 8.20 Failure Propagation Model
of any other node. It can also lead to the failure of multiple nodes, as il
(a,b) The development of a cascade in a small network in which each node has the same breakdown threshold φ = 0.4. Initially all nodes are in state 0, shown as green circles. After node A changes its state to 1 (purple), its neighbors B and E will have a fraction f = 1/2 > 0.4 of their neighbors in state 1. Consequently they also fail, changing their state to 1, as shown in (b). In the next time step C and D will also fail, as both have f > 0.4. Consequently the cascade sweeps the whole network, reaching a size s = 5. One can check that if we initially flip node B, it will not induce an avalanche.
lustrated in Figure 8.20a,b. The simulations document three regimes with distinct avalanche characteristics (Figure 8.20c): • Subcritical Regime If ⟨k⟩ is high, changing the state of a node is unlikely to move other nodes over their threshold, as the healthy nodes have many healthy neighbors. In this regime cascades die out quickly and their sizes follow an exponential distribution. Hence the system is unable to support large global cascades (blue symbols, Figure 8.20c,d). • Supercritical Regime
(c) The phase diagram of the failure propagation model in terms of the threshold function φ and the average degree ⟨k⟩ of the network on which the avalanche propagates. The continuous line encloses the region of the (⟨k⟩, φ) plane in which the cascades can propagate in a random graph.
If ⟨k⟩ is small, flipping a single node can put several of its neighbors over the threshold, triggering a global cascade. In this regime perturbations induce major breakdowns (purple symbols, Figure 8.20c,d). • Critical Regime
(d) Cascade size distributions for N = 10,000 and φ = 0.18, ⟨k⟩ = 1.05 (green), ⟨k⟩ = 3.0 (purple), ⟨k⟩ = 5.76 (orange) and ⟨k⟩ = 10.0 (blue). At the lower critical point we observe a power law p(s) with exponent α = 3/2 . In the supercritical regime we have only a few small avalanches, as most cascades are global. In the upper critical and subcritical regime we see only small avalanches. After [30].
At the boundary of the subcritical and supercritical regime the avalanches have widely different sizes. Numerical simulations indicate that in this regime the avalanche sizes s follow (8.14) (green and orange symbols, Figure 8.21d) with α = 3/2 if the underlying network is random.
25
f=2/3
102
100
8 its original state 0. of its ki neighbors are in state 1, otherwise itkretains
f=1/2
p (s )
φ
10 bors. The agent i adopts state 1 (i.e. it also fails) if at least a φ fraction
NETWORK ROBUSTNESS
B
2
12 the state of its k neigh• If the selected agent i is in state 0, it inspects i
tial perturbation can die out immediately, failing to induce the failure
8
0
(d)
D
f=1/2
E
p (s )
6
101
0 In other words, a healthy node i changes its state if a φ fraction of its 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24
D
100 LOWER CRITICAL POINT LOWER CRITICAL POINT UPPER CRITICAL POINT UPPER CRITICAL POINT 101 SUBCRITICAL SUBCRITICAL 101 SUPERCRITICAL SUPERCRITICAL
101
105 0 10 0.1 0.12 0.14 0.16 00.18 0.2 0.14 0.22 0.16 0.24 0.18 0.26 0.2 0.22 0.24 0.26 0.1 0.12
SUBCRITICAL
φ neighbors have failed. Depending on the local network topology, an ini
A !"
10
SUPERCRITICAL 2 SUPERCRITICAL 4
16
2
B C
E
100
12
we randomly pick an agent and update its state following a threshold 14
B
0
the release of a new piece of information. In each subsequent time step rule:
$" C f=1/3
f=0
10 SUBCRITICAL
10
k
E A !"
16
12
2
D
14
8
f=0
A !"
14 SUBCRITICAL SUBCRITICAL
f=0
B All agents are initially in the healthy state 0. At time t = 0 one agent $" C f=1/3 switches to state 1, corresponding to an initialf=1/2 component failure or to
D
(b)
f=1/2
$" C f=1/3
16
or 1 (inactive or failed), and is characterized by a breakdown threshold10 k
f=0
f=1/2
B
propagation model is frequently used to describe cascading failures as
D
B
E
f=1/2
E
MODELING CASCADING FAILURES
103
104
104
105 0 10
(a)
E .$
p = 0.5
(b)
D $
Figure 8.21 Branching Model
(a) The branching process mirroring the propagation of the failure shown in Figure 8.20a,b. The perturbation starts from node A, whose failure flips B and E, which in turn flip C and D, respectively.
A *$
+$ B
C ,$
p = 0.5
(b) An elementary branching process. Each active link (green) can become inactive with probability p0 = 1/2 (top) or give birth to two new active links with probability p2 = 1/2 (bottom).
(c)
x( t)
3 2 1 0
0
1
2
3
t
SUBCRITICAL (d)
4
5
s = tmax + 1 = 6
SUPERCRITICAL (e)
(c) To analytically calculate p(s) we map the branching process into a diffusion problem. For this we show the number of active sites, x(t), in function of time t. A nonzero x(t) means that the avalanche persists. When x(t) becomes zero, we loose all active sites and the avalanche ends. In the example shown in the image this happens at t = 5, hence the size of the avalanche is tmax + 1 = 6.
CRITICAL (f)
An exact mapping between the branching model and a one dimensional random walk helps us calculate the avalanche exponent. Consider a branching process starting from a stub with one active end. When the active site becomes inactive, it decreases the number of its active sites, i.e. x → x − 1. When the active site branches, creates two active sites, i.e. x → x + 1. This maps the avalanche size s to the time it takes for the walk that starts at x = 1 to reach x = 0 for the first time. This is a much studied process in random walk theory, predicting that the return time distribution follows a power law with exponent 3/2 [32]. For branching process corresponding to scalefree pk, the avalanche exponent depends on γ, as shown in Figure 8.22.
BRANCHING MODEL Given the complexity of the failure propogation model, it is hard to analytically predict the scaling behavior of the obtained avalanches. To understand the powerlaw nature of p(s) and to calculate the avalanche exponent α, we turn to the branching model. This is the simplest model that still captures the basic features of a cascading event. The model builds on the observation that each cascading failure follows a branching process. Indeed, let us call the node whose initial failure triggers the avalanche the root of the tree. The branches of the tree are the nodes whose failure was triggered by this initial failure. For exam
(d,e,f) Typical avalanches generated by the branching model in the subcritical (d), supercritical (e) and critical regime (f). The green node in each cascade marks the root of the tree, representing the first perturbation. In (d) and (f) we show multiple trees, while in (e) we show only one, as each tree (avalanche) grows indefinitely.
ple, in Figures 8.20a,b, the breakdown of node A starts the avalanche, hence A is the root of the tree. The failure of A leads to the failure of B and E, representing the two branches of the tree. Subsequently E induces the failure of D and B leads to the failure of C (Figure 8.21a). The branching model captures the essential features of avalanche propagation (Figure 8.21). The model starts with a single active node. In the next time step each active node produces k offsprings, where k is selected from a pk distribution. If a node selects k = 0, that branch dies out
(Figure 8.21b). If it selects k > 0, it will have k new active sites. The size of an avalanche corresponds to the size of the tree when all active sites died out (Figure 8.21c).
NETWORK ROBUSTNESS
26
MODELING CASCADING FAILURES
2.0
The branching model predicts the same phases as those observed in the cascading failures model. The phases are now determined only by ⟨k⟩,
α
hence by the pk distribution:
1.5
• Subcritical Regime: ⟨k⟩ < 1 For ⟨k⟩ < 1 on average each branch has less then one offspring. Consequently each tree will terminate quickly (Figure 8.21d). In this regime the avalanche sizes follow an exponential distribution.
1.0
2
3
4
γ
For ⟨k⟩ > 1 on average each branch has more than one offspring. Conse
The Avalanche Exponent The dependence of the avalanche exponent α on the degree exponent γ of the network on which the avalanche propagates, according to (8.15). The plot indicates that between 2 < γ < 3 the avalanche exponent depends on the degree exponent. Beyond γ = 3, however, the avalanches behave as they would be spreading on a random network, in which case we have α =3/2.
quently the tree will continue to grow indefinitely (Figure 8.21e). Hence in this regime all avalanches are global. • Critical Regime: ⟨k⟩ = 1 For ⟨k⟩ = 1 on average each branch has exactly one offspring. Consequently some trees are large and others die out shortly (Figure 8.21e). Numerical simulations indicate that in this regime the avalanche size distribution follows the power law (8.14). The branching model can be solved analytically, allowing us to determine the avalanche size distribution for an arbitrary pk. If pk is exponentially bounded, e.g. it has an exponential tail, the calculations predict α = 3/2. If, however, pk is scalefree, then the avalanche exponent depends on the powerlaw exponent γ, following (Figure 8.22) [32, 33]
γ ≥3
2 1
⎧⎪ 3 / 2, α=⎨ γ / ( γ − 1), ⎪⎩
5
27
MODELING CASCADING FAILURES
SECTION 8.7
BUILDING ROBUSTNESS
Can we enhance a network’s robustness? In this section we show that
(a)
(b)
the insights we gained about the factors that influence robustness allows us to design networks that can simultaneously resist random failures and attacks. We also discuss how to stop a cascading failure, allowing us to enhance a system’s dynamical robustness. Finally, we apply the developed
k = 12 / 7
tools to the power grid, linking its robustness to its reliability.
(c)
Designing Robust Networks
k = 24 / 7
1.5 RANDOM TARGETED TOTAL
Designing networks that are simultaneously robust to attacks and random failures appears to be a conflicting desire [36, 37, 38, 39]. For ex
1
ample, the hubandspoke network of Figure 8.23a is robust to random
fc
failures, as only the failure of its central node can break the network
0.5
into isolated components. Therefore, the probability that a random failure will fragment the network is 1/N, which is negligible for large N. At the same time this network is vulnerable to attacks, as the removal of
0
a single node, its central hub, breaks the network into isolated nodes. We can enhance this network’s attack tolerance by connecting its pe
5
γ
10
15
20
Figure 8.23 Enhancing Robustness
ripheral nodes (Figure 8.23b), so that the removal of the hub does not
(a) A hubandspoke network is robust to random failures but has a low tolerance to an attack that removes its central hub.
fragment the network. There is a price, however, for this enhanced robustness: it requires us to double the number of links. If we define the cost to build and maintain a network to be proportional to its average
(b) By connecting some of the small degree nodes, the reinforced network has a higher tolerance to targeted attacks. This increases the cost measured by ⟨k⟩, which is higher for the reinforced network.
degree ⟨k⟩, the cost of the network of Figure 8.23b is 24/7, double of the cost 12/7 of the network of Figure 8.23a. The increased cost prompts us to refine our question: Can we maximize the robustness of a network to both random failures and targeted attacks without changing the cost?
(c) Random, fcrand, targeted fctarg and total fctot percolation thresholds for scalefree networks in function of the degree exponent γ for a network with kmin = 3.
A network’s robustness against random failures is captured by its percolation threshold fc, which is the fraction of the nodes we must remove for the network to fall apart. To enhance a network's robustness we must increase fc. According to (8.7) fc depends only on ⟨k⟩ and ⟨k2⟩. Consequently the degree distribution which maximizes fc needs to
maximize ⟨k2⟩ if we wish to keep the cost ⟨k⟩ fixed. This is achieved by a bimodal distribution, corresponding to a network with only two kinds
NETWORK ROBUSTNESS
0
28
(a)
of nodes, with degrees kmin and kmax (Figure 8.23a,b).
k =2 ATTACK
1
RANDOM FAILURE
If we wish to simultaneously optimize the network topology against
0.75
both random failures and attacks, we search for topologies that maxi
P∞
0.5
mize the sum (Figure 8.24c)
0.25
fc tot = fcrand + fctarg . .
(8.16)
A combination of analytical arguments and numerical simulations in
0
0
0.25
0.5
f
0.75
1
k =3
(b)
dicate that this too is best achieved by the bimodal degree distribution
ATTACK
1
RANDOM FAILURE
[36, 37, 38, 39]
0.75
P∞ 0.5
pk = (1− r)δ (k − kmin ) + rδ (k − kmax ) ,
(8.17)
0.25 0
0
0.25
0.5
f
0.75
1
describing a network in which an r fraction of nodes have degree kmax and the remaining (1 − r) fraction have degree kmin.
(c)
k =5 ATTACK
1
As we show in ADVANCED TOPICS 8.G, the maximum of fctot is obtained
RANDOM FAILURE
when r = 1/N, i.e. when there is a single node with degree kmax and the
0.75
P∞
remaining nodes have degree kmin. In this case the value of kmax depends
0.5
on the system size as
0.25
kmax = AN . 2/3
(8.18)
In other words, a network that is robust to both random failures and attacks has a single hub with degree (8.18), and the rest of the nodes have the same degree kmin. This hubandspoke topology is obviously robust
0
0
0.25
0.5
f
0.75
1
Figure 8.24
Optimizing Attack and Failure Tolerance
against random failures as the chance of removing the central hub is
The figure illustrates the optimal network topologies predicted by (8.16) and (8.17), consisting of a single hub of size (8.18) and the rest of the nodes have the same degree kmin determined by ⟨k⟩. The left panels show the network topology for N = 300; the right panels show the failure/attack curves for N = 10,000.
1/N, tiny for large N. The obtained network may appear to be vulnerable to an attack that removes its hub, but it is not necessarily so. Indeed, the network’s giant component is held together by both the central hub as well as by the
(a) For small ⟨k⟩ the hub holds the network together. Once we remove this central hub the network breaks apart. Hence the attack and error curves are well separated, indicating that the network is robust to random failures but fragile to attacks.
many nodes with degree kmin, that for kmin > 1 form a giant component
on their own. Hence while the removal of the kmax hub causes a major
onetime loss, the remaining low degree nodes are robust against subsequent targeted removal (Figure 8.24c).
(b) For larger ⟨k⟩ a giant component emerges, that exists even without the central hub. Hence while the hub enhances the system’s robustness to random failures, it is no longer essential for the network. In this case both the attack fctarg and error fcrand are large. (c) For even larger ⟨k⟩ the error and the attack curves are indistinguishable, indicating that the network's response to attacks and random failures is indistinguishable. In this case the network is well connected even without its central hub.
NETWORK ROBUSTNESS
29
BUILDING ROBUSTNESS
BOX 8.4 HALTING CASCADING FAILURES
Can we avoid cascading failures? The first instinct is to reinforce the
0.4
network by adding new links. The problem with reinforcement is that BIOMASS FLUX
in most real systems the time needed to establish a new link is much larger than the timescale of a cascading failure. For example, thanks to regulatory, financial and legal barriers, building a new transmission line on the power grid can take up to two decades. In contrast, a
reduced through selective node and link removal [40]. To do so we note that each cascading failure has two parts:
0.2 0.1 0
cascading failure can sweep the power grid in a few seconds. In a counterintuitive fashion, the impact of cascading failures can be
0.3
0
5
10
15 20 25 30 35 NUMBER OF GENES
Figure 8.25
Lazarus Effect The growth rate of a bacteria is determined by its ability to generate biomass, the molecules it needs to build its cell wall, DNA and other cellular components. If some key genes are missing, the bacteria is unable to generate the necessary biomass. Unable to multiply, it will eventually die. Genes in whose absence the biomass flux is zero are called essential.
(i) Initial failure is the breakdown of the first node or link, representing the source of the subsequent cascade. (ii) Propagation is when the initial failure induces the failure of additional nodes and starts cascading through the network. Typically the time interval between (i) and (ii) is much shorter than the time scale over which the network could be reinforced. Yet, sim
The plot shows the biomass flux for E. Coli, a bacteria frequently studied by biologists. The original mutant is missing an essential gene, hence its biomass flux is zero, as shown on the vertical axis. Consequently, it cannot multiply. Yet, as the figure illustrates, by removing five additional genes we can turn on the biomass flux. Therefore, counterintuitively, we can revive a dead organism through the removal of further genes, a phenomena called the Lazarus effect [41].
ulations indicate that the size of a cascade can be reduced if we intentionally remove additional nodes right after the initial failure (i), but before the failure could propagate. Even though the intentional removal of a node or a link causes further damage to the network, the removal of a well chosen component can suppress the cascade propagation [40]. Simulations indicate that to limit the size of the cascades we must remove nodes with small loads and links with large excess load in the vicinity of the initial failure. The mechanism is similar to the method used by firefighters, who set a controlled fire in the fireline to consume the fuel in the path of a wildfire. A dramatic manifestation of this approach is provided by the Lazarus effect, the ability to revive a previously "dead" bacteria, i.e. one that is unable to grow and multiply. This can be achieved through the knockout of a few well selected genes (Figure 8.25) [41]. Therefore, in a counterintuitive fashion, controlled damage can be beneficial to a network.
NETWORK ROBUSTNESS
40
30
BUILDING ROBUSTNESS
CASE STUDY: ESTIMATING ROBUSTNESS The European power grid is an ensemble of more than twenty national power grids consisting of over 3,000 generators and substations (nodes) and 200,000 km of transmission lines (Figure 8.26ad). The network's degree distribution can be approximated with (Figure 8.26e) [42, 43]
pk =
e− k/〈 k 〉 〈k〉
(8.19)
indicating that its topology is characterized by a single parameter, ⟨k⟩. Such exponential pk emerges in growing networks that lack preferential attachment (SECTION 5.5).
By knowing ⟨k⟩ for each national power grid, we can predict the respective network's critical threshold fctarg for attacks. As Figure 8.26f shows, for national power grids with ⟨k⟩ > 1.5 there is a reasonable agreement between the observed and the predicted fctarg (Group 1). However, for power grids with ⟨k⟩ < 1.5 (Group 2) the predicted fctarg underestimates
the real fctarg, indicating that these national networks are more robust
to attacks than expected based on their degree distribution. As we show next, this enhanced robustness correlates with the reliability of the respective national networks. To test the relationship between robustness and reliability, we use several quantities, collected and reported for each power failure: (1) energy not supplied; (2) total loss of power; (3) average interruption time, measured in minutes per year. The measurements indicate that Group 1 networks, for which the real and the theoretical fctarg agree, represent two thirds of the full network size and carry almost as much power and energy as the Group 2 networks. Yet, Group 1 accumulates more than five times the average interruption time, more than two times the recorded power losses and almost four times the undelivered energy compared to Group 2 [42]. Hence, the national power grids in Group 1 are significantly more fragile than the power grids in Group 2. This result offers direct evidence that networks that are topologically more robust are also more reliable. At the same time this finding is rather counterintuitive: One would expect the denser networks to be more robust. We find, however, that the sparser power grids display enhanced robustness. In summary, a better understanding of the network topology is essential to improve the robustness of complex systems. We can enhance robustness by either designing network topologies that are simultaneously robust to both random failures and attacks, or by interventions that limit the spread of cascading failures. These results may suggest that we should redesign the topology of the Internet and the power grid to enhance their robustness [44]. Given the opportunity to do so, this could indeed be achieved. Yet, these infrastructural networks were built incrementally over decades, following the selforganized growth process described in the previous chapters. Given the enormous cost of each node and link, it is unlikely that we would ever be given a chance to rebuild them. NETWORK ROBUSTNESS
31
BUILDING ROBUSTNESS
(a)
Figure 8.26
(b)
The Power Grid (a) The power grid is a complex infrastructure consisting of (1) power generators, (2) switching units, (3) the high voltage transmission grid, (4) transformers, (5) low voltage lines, (6) consumers, like households or businesses. When we study the network behind the power grid, many of these details are ignored.
fc
Following [27], we translate the problem of intentional (c) tack to an equivalent random failure problem. The (d) ing [27], we translate the problem of intentional moval of a fraction f of nodes with the highest deo an equivalent random failure problem. The ee equivalent to thewith random removal deof those of is a then fraction f of nodes the highest nks nodes to those thenconnecting equivalentthe toremaining the random removal ofalready those reoved. Thus, probability that to a specific link leads nnecting thethe remaining nodes those already reto deleted will be given Thus, thenode probability that aby: specific link leads to
d node will be given by: ˜
K kP (k) p˜K˜= kP (k) dk k dk K
p˜ =
(13)
k
K
the average degree of the undamaged graph. It (e)
Cumulative distribution
1
γ
2
2
0
10
0
10
˜ K˜b K˜ p˜ = + 1 +e−1K/˜ eγ − K/ γ γ
Cumulative distribution
b =
c c (14) (14)
0,6
random
0,4
0,0
1,0
3,0
fc
(e) The complementary cumulative degree distribution Pk of the European power grid. The plot shows the data for the full network (UCTE) and separately for Italy, and the joint network of UK and Ireland, indicating that the national grid’s Pk also follows (8.19).
4,0
BREAKDOWN
0,4 0,4
p˜ = (lnp˜ 2p − 1)p =c(ln pc −c 1)pc ITALY Italy (15) (15) 10 2 10 e assume thatK is large enough to ignore the UK AND IRELAND here we ).assume is large enough to ignore the Thus,thatK an equivalent network UK and Ireland with K/ γ). Thus, an equivalent network UCTE with has been3 built after a random removal UCTE ˜ 10 been built after a random removal aximal degree K has es due to the fact that the 3 absence of correlations 10 0links. 10 correlations 5absence nodes due to theoffact that In theorder of a random failure the 15 kto obtain 5 k such mplies a random failure of0links.graph, In order to obtain distribution of the damaged a10fail the 15 k introduced (3). graph, But this willa tions. egree distribution of equation the damaged such failhbenodes includinginto generators, transformers and substa
2,0
BB
b
0,5 0,5 Breakdown
1
(b,c,d) The Italian power grid with the details of production and consumption. Once we strip these details from the network, we obtain the spatial network shown in (c). Once the spatial information is also removed, we arrive to the network (d), which is the typical object of study at the network level.
attack
(f)
10 straightforward equation (12) it(12) is to see Using equation it is10 straightforward to that: see that:
Pk
A a
0,2
(13)
the average degree of the undamaged graph. It to show thatgives: this gives: cult tocult show that this
a
0,8
GROUP 2
f ctarg 0,3
GROUP 1
0,3
0,2 0,2 0,1 0,0
Connected CONNECTED
0,1 0,0
1
1
1,5 1,5
k
2
2
Figure 3: (a) Phase space for exponential uncorrelated net
NETWORK ROBUSTNESS
32
(f) The phase space (fctarg,〈k〉) of exponential uncorrelated networks under attack, where fctarg is the fraction of hubs we must remove to fragment the network. The continuous curve corresponds to the critical boundary for attacks, below which the network retains its giant component. The plot also shows the estimated fctarg(⟨k⟩) for attacks for the thirtythree national power grids within EU, each shown as a separate circle. The plot indicates the presence of two classes of power grids. For countries with ⟨k⟩ > 1.5 (Group 1), the analytical prediction for fctarg agrees with the numerically observed values. For countries with ⟨k⟩ < 1.5 (Group 2) the analytical prediction underestimates the numerically observed values. Therefore, Group 2 national grids show enhanced robustness to attacks, meaning that they are more robust than expected for a random network with the same degree sequence. After [42].
BUILDING ROBUSTNESS
SECTION 8.8
SUMMARY: ACHILLES' HEEL BOX 8.5 ROBUSTNESS, RESILIENCE, REDUNDANCY
Redundancy and resilience are concepts deeply linked to robustness. It is useful to clarify the differences between them. Robustness A system is robust if it can maintain its basic functions in the presence of internal and external errors. In a network context robustness refers to the system's ability to carry out its basic functions even when some of its nodes and links may be missing.
The masterminds of the September 11, 2001 did not choose their targets at random: the World Trade Center in New York, the Pentagon, and the White House (an intended target) in Washington DC are the hubs of America’s economic, military, and political power [45]. Yet, while causing a human tragedy far greater than any other event America has experienced since the Vietnam war, the attacks failed to topple the network. They did
Resilience A system is resilient if it can adapt to internal and external errors by changing its mode of operation, without losing its ability to function. Hence resilience is a dynamical property that requires a shift in the system's core activities.
offer, however, an excuse to start new wars, like the Iraq and the Afghan wars, triggering a series of cascading events whose impact was far more devastating than the 9/11 terrorist attacks themselves. Yet, all networks, ranging from the economic to the military and the political web, survived. Hence, we can view 9/11 as a tale of robustness and network resilience (BOX 8.5). The roots of this robustness were uncovered in this chapter: Real networks have a whole hierarchy of hubs. Taking out any one of them is not
Redundancy Redundancy implies the presence of parallel components and functions that, if needed, can replace a missing component or funciton. Networks show considerable redundancy in their ability to navigate information between two nodes, thanks to the multiple independent paths between most node pairs.
sufficient to topple the underlying network. The remarkable robustness of real networks represents good news for most complex systems. Indeed, there are uncountable errors in our cells, from misfolding proteins to the late arrival of a transcription factor. Yet, the robustness of the underlying cellular network allows our cells to carry on their normal functions. Network robustness also explains why we rarely notice the effect of router errors on the Internet or why the disappearance of a species does not result in an immediate environmental catastrophe. This topological robustness has its price, however: fragility against attacks. As we showed in this chapter, the simultaneous removal of several hubs will break any network. This is bad news for the Internet, as it allows crackers to design strategies that can harm this vital communication system. It is bad news for economic systems, as it indicates that hub removal can cripple the whole economy, as vividly illustrated by the 2009 financial meltdown. Yet, it is good news for drug design, as it suggests that an accurate map of cellular networks can help us develop drugs that can kill unwanted bacteria or cancer cells. The message of this chapter is simple: Network topology, robustness, NETWORK ROBUSTNESS
33
p = 0 .593
ACHILLES’ HEEL
PERCOLATION
SHLOMO HAVLIN
p = 0 .62
YEAR
1950
1957 1960
1964
Mathematicians Simon Broadbent and John Hammersey introduce percolation and formalize many of its mathematical concepts [5]. The theory rose to prominence in the 1960s and 70s, finding applications from oil exploration to superconductivity.
1970
1980
Paul Baran explores the vulnerability of communication networks to Soviet nuclear attacks, concluding that they are too centralized to be viable under attack. Proposes instead a meshlike network architecture (BOX 8.2).
2000 2001
1990
Albert, Jeong and Barabási study the error and attack tolerance of complex networks, discovering their joint robustness to failures and fragility to attacks.
and fragility cannot be separated from one other. Rather, each complex system has its own Achilles’ Heel: the networks behind them are simulta
Shlomo Havlin and his collaborators establish a formal link between network robustness and percolation theory, showing that the percolation threshold of a scalefree network is determined by the first two moments of the degree distribution.
Figure 8.27 From Percolation to Robustness: A Brief History
The systematic study of network robustness started with a paper published in Nature (Figure 8.1) by Réka Albert, Hawoong Jeong and AlbertLászló Barabási [1], reporting the robustness of scalefree networks to random failures and their fragility to attacks. Yet, the analytical understanding of network robustness relies on percolation theory. In this context, particularly important were the contributions of Shlomo Havlin and collaborators, who established the formal link between robustness and percolation theory and showed that the percolation threshold of a scalefree network is determined by the moments of the degree distribution. A statistical physicist from Israel, Havlin had multiple contributions to the study of networks, from discovering the selfsimilar nature of real networks [46] to exploring the robustness of layered networks [47].
neously robust to random failures but vulnerable to attacks. When considering robustness, we cannot ignore the fact that most systems have numerous controls and feedback loops that help them survive in the face of errors and failures. Internet protocols were designed to ‘route around the trouble’, guiding the traffic away from routers that malfunction; cells have numerous mechanisms to dismantle faulty proteins and to shut down malfunctioning genes. This chapter documented a new contribution to robustness: the structure of the underlying network offers a system an enhanced failure tolerance. The robustness of scalefree networks prompts us to ask: Could this enhanced robustness be the reason why many real networks are scalefree? Perhaps real systems have developed a scalefree architecture to satisfy their need for robustness. If this hypothesis is correct we should be able to set robustness as an optimization criteria and obtain a scalefree network. Yet, as we showed in SECTION 8.7, a network with maximal robustness has a hubandspoke topology. Its degree distribution is bimodal, rather than a power law. This suggests that robustness is not the principle that drives the development of real networks. Rather, networks are scalefree thanks to growth and preferential attachment. It so happens that scalefree networks also have enhanced robustness. Yet, they are not the most robust networks we could design.
NETWORK ROBUSTNESS
2010
34
SUMMARY
BOX 8.6 AT A GLANCE: NETWORK ROBUSTNESS
MalloyReed criteria: A giant component exists if
k2 >2 k Random failures:
1 〈k 〉 −1 〈k〉 1 ER Random Network: fc = 1− 〈k〉 fc = 1−
2
Enhanced robustness:
fc > fcER
Attacks: 2−γ
fc1−γ = 2 +
3−γ
2 −γ kmin ( fc1−γ − 1) 3−γ
Cascading failures:
p(s) ∼ s −α ⎧ 3/2 γ >3 ⎪ α=⎨ γ ⎪ γ −1 2 < γ < 3 ⎩
NETWORK ROBUSTNESS
35
SUMMARY
SECTION 8.9
HOMEWORK
8.1. Random Failure: Beyond ScaleFree Networks Calculate the critical threshold fc for networks with (a) Power law with exponential cutoff. (b) Lognormal distribution. (c) Delta distribution (all nodes have the same degree). Assume that the networks are uncorrelated and infinite. Refer to Table 4.2 for the functional form of the distribution and the corresponding first and second moments. Discuss the consequences of the obtained results for network robustness. 8.2. Critical Threshold in Correlated Networks Generate three networks with 104 nodes, that are assortative, disassortative and neutral and have a powerlaw degree distribution with degree exponent γ = 2.2. Use the XalviBrunet & Sokolov algorithm described in SECTION 7.5 to generate the networks. With the help of a computer, study the robustness of the three networks against random failures, and compare their P∞(f)/P∞(0) ratio. Which network is the most robust? Can you explain why? 8.3. Failure of Real Networks Determine the number of nodes that need to fail to break the networks listed in Table 4.1. Assume that each network is uncorrelated. 8.4. Conspiracy in Social Networks In a Big Brother society, the thought police wants to follow a "divide and conquer" strategy by fragmenting the social network into isolated components. You belong to the resistance and want to foil their plans. There are rumours that the police wants to detain individuals that have many friends and individuals whose friends tend to know each other. The resistance puts you in charge to decide which individuals to protect: those whose friendship circle is highly interconnected or those with many friends. To decide
NETWORK ROBUSTNESS
36
you simulate two different attacks on your network, by removing (i) the nodes that have the highest clustering coefficient and (ii) the nodes that have the largest degree. Study the size of the giant component in function of the fraction of removed nodes for the two attacks on the following networks: (a) A network with N = 104 nodes generated with the configuration model (SECTION 4.8) and powerlaw degree distribution with γ = 2.5. (b) A network with N = 104 nodes generated with the hierarchical model described in Figure 9.16 and ADVANCED TOPIC 9.B. Which is the most sensitive topological information, clustering coefficient or degree, which, if protected, limits the damage best? Would it be better if all individuals' information (clustering coefficient, degree, etc.) could be kept secret? Why? 8.5. Avalanches in Networks Generate a random network with the ErdősRényi G(N,p) model and a scalefree network with the configuration model, with N = 103 nodes and average degree 〈k〉 = 2. Assume that on each node there is a bucket which can hold as many sand grains as the node degree. Simulate then the following process: (a) At each time step add a grain to a randomly chosen node i. (b) If the number of grains at node i reaches or exceeds its bucket size, then it becomes unstable and all the grains at the node topple to the buckets of its adjacent nodes. (c) If this toppling causes any of the adjacent nodes' buckets to be unstable, subsequent topplings follow on those nodes, until there is no unstable bucket left. We call this sequence of toppings an avalanche, its size s being equal to the number of nodes that turned unstable following an initial perturbation (adding one grain). Repeat (a)(c) 104 times. Assume that at each time step a fraction 10–4 of sand grains is lost in the transfer, so that the network buckets do not become saturated with sand. Study the avalanche distribution P(s).
NETWORK ROBUSTNESS
37
HOMEWORK
SECTION 8.10
ADVANCED TOPICS 8.A PERCOLATION IN SCALEFREE NETWORKS
To understand how a scalefree network breaks apart as we approach the threshold (8.7), we need to determine the corresponding critical exponents γp, βp and ν. The calculations indicate that the scalefree property alters the value of these exponents, leading to systematic deviations from the exponents that characterize random networks (SECTION 8.2). Let us start with the probability P∞ that a randomly selected node belongs to the giant component. According to (8.2) this follows a power law
near pc (or fc in the case of node removal). The calculations predict that for a scalefree network the exponent βp depends on the degree exponent γ as [7, 48, 49, 50, 51]
3 p=
1 1 1
3
2
4) we have βp = 1, for most scalefree networks of practical interest βp > 1. Therefore, the giant component collapses faster in the vicinity of the critical point in a scalefree network than in a random network. The exponent characterizing the average component size near pc follows [48]
⎧⎪ 1 γ >3 γp =⎨ −1 2 < γ < 3. ⎩⎪
(8.21)
The negative γp for γ < 3 may appear surprising. Note, however, that for
γ < 3 we always have a giant component. Hence, the divergence (8.1) cannot be observed in this regime. NETWORK ROBUSTNESS
38
For a randomly connected network with arbitrary degree distribution the size distribution of the finite clusters follows [48, 50, 51] *
ns ∼ s −τ e− s/s.
(8.22)
Here, ns is the number of clusters of size s and s* is the crossover cluster size. At criticality
s * ~ p − pc
−σ
(8.23)
The critical exponents are
=
⎧ ⎪ ⎪ ⎪ σ =⎨ ⎪ ⎪ ⎪ ⎩
2
5 2
>4 3 2< 2
3−γ 2 4.
< 4,
(8.24)
(8.25)
τ = 5/2 and σ = 1/2 are recov
In summary, the exponents describing the breakdown of a scalefree network depend on the degree exponent γ. This is true even in the range 3 < γ < 4, where the percolation transition occurs at a finite threshold fc. The
meanfield behavior predicted for percolation in infinite dimensions, capturing the response of a random network to random failures, is recovered only for γ > 4.
NETWORK ROBUSTNESS
39
RANDOM NETWORKS AND PERCOLATION
SECTION 8.11
ADVANCED TOPICS 8.B MOLLOYREED CRITERION
The purpose of this section is to derive the MolloyReed criterion, which allows us to calculate the percolation threshold of an arbitrary network [6]. For a giant component to exist each node that belongs to it must be connected to at least two other nodes on average (Figure 8.8). Therefore, the average degree ki of a randomly chosen node i that is part of the giant com
ponent should be at least 2. Denote with P(ki ∣ i ↔ j) the conditional probability that a node in a network with degree ki is connected to a node j that
is part of the giant component. This conditional probability allows us to determine the expected degree of node i as [51]
〈ki∣i ↔ j〉 = ∑ ki P(ki∣i ↔ j) = 2 .
(8.26)
ki
In other words, ⟨ki ∣ i ↔ j⟩ should be equal or exceed two, the condition
for node i to be part of the giant component. We can write the probability appearing in the sum (8.26) as
P(ki∣i ↔ j) =
P(ki ,i ↔ j) P(i ↔ j∣ki )p(ki ) , = P(i ↔ j) P(i ↔ j)
(8.27)
where we used Bayes’ theorem in the last term. For a network with degree distribution pk, in the absence of degree correlations, we can write
P(i ↔ j) =
2L 〈k〉 , = N(N − 1) N − 1
P(i ↔ j∣ki ) =
ki , N −1
(8.28)
which express the fact that we can choose between N − 1 nodes to link to, each with probability 1/(N − 1) and that we can try this ki times. We can now return to (8.26), obtaining
∑ ki
P(i ↔ j∣ki )p(ki ) ki P(ki∣i ↔ j) = ∑ki P(i ↔ j) ki
k p(k ) = ∑ki i i = 〈k〉 ki
∑ ki
ki 2 p(ki ) 〈k〉
(8.29)
With that we arrive at the MolloyReed criterion (8.4), providing the condition to have a giant component as
κ=
NETWORK ROBUSTNESS
〈k 2 〉 >2. 〈k〉
(8.30)
40
SECTION 8.12
ADVANCED TOPICS 8.C CRITICAL THRESHOLD UNDER RANDOM FAILURES
The purpose of this section is to derive (8.7), that provides the critical threshold for random node removal [7, 51]. The random removal of an f fraction of nodes has two consequences: • It alters the degree of some nodes, as nodes that were previously connected to the removed nodes will lose some links [k → k' ≤ k]. • Consequently, it changes the degree distribution, as the neighbors of the missing nodes will have an altered degree [pk → p'k']. To be specific, after we randomly remove an f fraction of nodes, a node with degree k becomes a node with degree k' with probability
⎛ k ⎞ k− k ′ k′ ⎜⎝ k ' ⎟⎠ f (1− f )
k' ≤ k .
(8.31)
The first f dependent term in (8.31) accounts for the fact that the selected node lost (k − k') links, each with probability f; the next term accounts for the fact that node removal leaves k' links untouched, each with probability (1 − f). The probability that we have a degreek node in the original network is pk; the probability that we have a new node with degree k' in the new network is ∞ ⎛ k ⎞ k− k′ k′ p'k ' = ∑ pk ⎜ ⎟⎠ f (1− f ) . k ' ⎝ k=k '
(8.32)
Let us assume that we know ⟨k⟩ and ⟨k2⟩ for the original degree distribution pk. Our goal is to calculate ⟨k'⟩, ⟨k'2⟩ for the new degree distribution pk'' ,
obtained after we randomly removed an f fraction of the nodes. For this we write
NETWORK ROBUSTNESS
41
k'
=
f
k '=0
k ' pk ''
∞ ∞ ⎛ ⎞ k−k ' k! k' = ∑ k ' ∑ pk ⎜ f (1− f ) ⎟ k '! k − k ' ! )⎠ ⎝ ( k=k ' k '=0 ∞
∞
=∑
∑p
k '=0
k '=k '
k
k=[k’, ∞)
(8.33)
k
k(k − 1)! k '−1 f k−k ' (1− f ) (1− f ). (k '− 1)!( k − k ')!
The sum above is performed over the triangle shown in Figure 8.28. We
k’
can check that we are performing the same sum if we change the order of Figure 8.28
summation together with the limits of the sums as ∞
=∑
k '=0
∞
∞
k
k=k '
k=0
k '=0
∑ =∑ ∑ .
The Integration Domain
(8.34)
In (8.34) we change the integration order, i.e. the order of the two sums. We can do so because both sums are defined over the triangle shown in purple in the figure.
Hence we obtain
k'
f
∞
k
k=0
k '=0
= ∑ k ' ∑ pk
k(k − 1)! k '−1 f k−k ' (1− f ) (1− f ) ( k '− 1)!( k − k ')!
∞
k
k=0
k '=0
= ∑ (1− f ) kpk ∑ ∞
k
k=0
k '=0
= ∑ (1− f ) kpk ∑
( k − 1)! f k−k ' 1− f k '−1 ( ) ( k '− 1)!( k − k ')! ⎛ k − 1 ⎞ k−k ' k '−1 ⎜⎝ k '− 1 ⎟⎠ f (1− f )
(8.35)
∞
= ∑ (1− f ) kpk k=0
= (1− f ) k . This connects ⟨k'⟩ to the original ⟨k⟩ after the random removal of an f fraction of nodes. We perform a similar calculation for ⟨k'2⟩:
k '2
f
= k '(k ' 1) + k '
f
= k '(k '− 1) f + k ' =
k '=0
f
(8.36)
k ' ( k ' 1)pk '' + k ' f .
Again, we change the order of the sums (Figure 8.28), obtaining
k '(k ' 1)
f
=
k '=0
k '(k ' 1)pk ''
∞ ∞ ⎛ k ⎞ k−k ' f (1− f )k ' = ∑ k '(k '− 1)∑ pk ⎜ ⎝ k ' ⎟⎠ k '=0 k=k '
k '(k '− 1) k−k ' f (1− f )k ' k '!(k − k ')! k=0 k '=0 ∞ k k! = ∑ ∑ pk f k−k ' (1− f )k '−2 (1− f )2 (k '− 2)!(k − k ')! k=0 k '=0 k
∞
= ∑ k '(k '− 1)∑ pk
∞
k
= ∑ (1− f ) k(k − 1)pk ∑ 2
k=0
k '=0
(k − 2)! f k−k ' (1− f )k '−2 (k '− 2)!(k − k ')!
(8.37)
k ⎛ k − 2 ⎞ k−k ' k '−2 = ∑ (1− f )2 k(k − 1)pk ∑ ⎜ ⎟ f (1− f ) k=0 k '=0 ⎝ k '− 2 ⎠ ∞
∞
= ∑ (1− f )2 k(k − 1)pk k=0
NETWORK ROBUSTNESS
42
CRITICAL THRESHOLD UNDER RANDOM FAILURES
= (1− f )2 k(k − 1) . Hence we obtain
k '2
f
= k '(k '− 1) + k '
f
= k '(k '− 1) f + k '
f
= (1− f ) k(k − 1) + (1− f ) k 2
(
)
= (1− f )2 k 2 − k + (1− f ) k 2
= (1− f )2 k 2 − (1− f ) k + (1− f ) k = (1− f ) k 2
2
(
− − f + 2 f − 1+ 1− f 2
(8.38)
)k
= (1− f )2 k 2 + f (1− f ) k .
which connects ⟨k'2⟩ to the original ⟨k2⟩ after the random removal of an f fraction of nodes. Let us put the results (8.35) and (8.38) together:
〈 k ′ 〉 f = (1− f )〈k〉 ,
(8.39)
〈 k ′ 〉 f = (1− f )2 〈k 2 〉 + f (1− f )〈k〉 .
(8.40)
According to the MolloyReed criterion (8.4) the breakdown threshold is given by
κ=
〈k '2 〉 f = 2. 〈k '〉 f
(8.41)
Inserting (8.38) and (8.40) into (8.41) we obtain our final result (8.7),
fc = 1−
1 〈k 〉 −1 〈k〉 2
(8.42)
providing the breakdown threshold of networks with arbitrary pk under random node removal.
NETWORK ROBUSTNESS
43
CRITICAL THRESHOLD UNDER RANDOM FAILURES
SECTION 8.13
ADVANCED TOPICS 8.D BREAKDOWN OF A FINITE SCALEFREE NETWORK
In this section we derive the dependence (8.10) of the breakdown threshold of a scalefree network on the network size N. We start by calculating the mth moment of a powerlaw distribution
〈k 〉 = (γ − 1)k m
kmax
γ −1 min
∫
kmin
k m−γ dk =
(γ − 1) γ −1 m−γ +1 kmax kmin [k ]kmin . (m − γ + 1)
(8.43)
Using (4. 18) 1
kmax = kmin N γ −1
(8.44)
we obtain
〈k m 〉 =
(γ − 1) γ −1 m−γ +1 m−γ +1 . kmin [kmax − kmin ] (m − γ + 1)
(8.45)
To calculate fc we need to determine the ratio
κ=
3−γ 3−γ − kmin 〈k 2 〉 (2 − γ ) kmax , = 2−γ 2−γ 〈k〉 (3 − γ ) kmax − kmin
(8.46)
which for large N (and hence for large kmax) depends on γ as
γ >3
⎧ kmin 〈k 2 〉 2 − γ ⎪⎪ 3−γ γ −2 κ= = ⎨ k k 〈k〉 3 − γ ⎪ max min ⎪⎩ kmax
3>γ > 2
(8.47)
2 >γ >1
The breakdown threshold is given by (8.7)
fc = 1−
1 , κ −1
(8.48)
where κ is given by (8.46). Inserting (8.43) into (8.42) and (8.47), we obtain
fc ≈ 1− which is (8.10). NETWORK ROBUSTNESS
C N
3−γ γ −1
,
(8.49)
44
SECTION 8.14
ADVANCED TOPICS 8.E ATTACK AND ERROR TOLERANCE OF REAL NETWORKS
In this section we explore the attack and error curves for the ten reference networks discussed in Tables 4.1 and (8.2). The corresponding curves are shown in Figure 8.29. Their inspection reveals several patterns, confirming the results discussed in this chapter: • For all networks the error and attack curves separate, confirming the Achilles’ Heel property (SECTION 8.8): Real networks are robust to random failures but are fragile to attacks. • The separation between the error and attack curves depends on the average degree and the degree heterogeneity of each network. For example, for the citation and the actor networks fc for the attacks is in the vicinity of 0.5 and 0.75, respectively, rather large values. This is because these networks are rather dense, with ⟨k⟩ = 20.8 for citation network and ⟨k⟩ = 83.7 for the actor network. Hence these networks can survive the removal of a very high fraction of their hubs.
NETWORK ROBUSTNESS
45
75
(a)
75
MOBILE PHONE CALLS
1
SCIENTIFIC COLLABORATION
1
Figure 8.29
1
Error and Attack Curves
0.5
0
(c)
0.25
0.5
f
0.75
1 1
0.25
0.5
f
0.75
0
1
POWER GRID SCIENTIFIC COLLABORATION PROTEIN
0.75 0.75 0.75
0.75 0.75 0.75
0.5 P0.5 ∞
0.5 P0.5 ∞ P∞ 0.5
0.5 P0.5 ∞ P 0.5
0.5 f0.5
f
0.75 0.75
1 1
INTERNET ACTOR METABOLIC
(e)1 1 1
0 0 0 0 0 0
1 1
0.5 0.5 f f0.5
f
0.75 0.75 0.75
1 1 1
MOBILE PHONE CALLS EMAIL WWW
0 0 0 00 0
(g) 1 1 1
0.25 0.25 0.25
0.5 0.5 f 0.5
f f
0.75 0.75 0.75
1 1 1
SCIENTIFIC COLLABORATION PROTEIN CITATION
0 0 0 0 0 0
0.25 0.25 0.25
0.5 0.5 f 0.5
f f
0.75 0.75 0.75
1 1 1
ACTOR METABOLIC
0 00 0 0 0
0.25 0.25 0.25
(i)
0.5 f0.5 0.5 f
f
0.75 0.75 0.75
1 1 1
1
P∞
P0.25 ∞
P0.25 ∞
0.25
0.25
0 0
f
0 0 0 0 0 0
0.75
1
0
0.25
0.5
f
0.5
f
0.5
0.25
0.25
0
0.25
0.5
f
NETWORK ROBUSTNESS
0.75
1
0
0.25
0.5
f
0
0
0
0
0
0
0
1
0.25 0.25 0.25
0.5 f0.5 0.5
f f
0.75 0.75 0.75
1 1 1
WWW
0.25 0.25
0.25
0.5
0 0 0 0
0.25 0.25
0.5 0.5 f
f
0.75 0.75
1 1
0.75
1
CITATION
1
0.75
0.5
0
0.25
0.5
f
0.75
1
0
0
0.25
0.5
f
CITATION
0
1
0
0.75
0.5
P∞
0.75
1
ACTOR METABOLIC
0.75
P∞
0.75
P∞
0.25
1 1
SCIE
0.5
0.25
0.75 0.75
0
0.25
0.5 P∞ P∞
0.25 0.25
0.5 f0.5
0
0.75 0.75
P∞
0.25 0.25
0
0.5
1 1
0.5
0 00 0
0
0.5
P∞
P∞
1
0.75
0.75
0.5 P∞ P∞
(j)
WWW
1 1 1
0.75
0.5
0.25 0.25
f
0.75 0.75 0.75
SCIENTIFIC COLLABORATION PROTEIN CITATION
1
0.75 0.75
0.5
0.5 0.5 f f0.5
0.25 0.25
1
0.5 P∞ P0.5 ∞ P0.25 ∞
0.25 0.25 0.25
0.5
(h) 1
0.75 0.75 0.75
0 0 0 0 0 0
P0.5 ∞ P0.5 ∞ P0.25 ∞
0.25 0.25 0.25
0.25 0.25 0.25
0.75
0.5
0.75 0.75 0.75
P0.5 ∞ P0.5 ∞ P∞
0.75 0.75 0.5
1 1 1
0.5
0.5
0
0.25 0.25 0.25
0.75 0.75 0.75
P0.5 ∞ 0.5 P∞ P∞
INTERNET ACTOR METABOLIC
P
(f)1
0.75 0.75 0.75
0.5
f
∞ 0.25 0.25 0.25
0.25 0.25
0.25 0.25
0.25
0.75
The error (green) and attack (purple) curves for the ten reference networks listed in Table 0.5 0.5 4.1. The green P∞vertical line corresponds to the P ∞ rand estimated fc for errors, while the purple vertical line corresponds to fctarg for attacks. The 0.25 0.25 estimated fc corresponds to the point where the giant component first drops below 1% of 0 0 its size. this proce0.75 original 1 0 In most 0.25 systems 0.5 0.75 1 f dure offers a good approximation for fc. The only exception is theMOBILE metabolic network, for PHONE CALLS 1 1 which fctarg < 0.25, but a small cluster persists, WWW pushing the reported fctarg to fctarg ≃ 0.5. 1 1
∞
P0.25 ∞
0 0 0 0
0
1 1 1
0.75 0.75
0.25 0.25
1
0
(d) 1
P∞
1 1 1
0.25
0
1
MOBILE PHONE CALLS EMAIL
1 1
1 1 1
P∞
0.25
0
0.75
0.5
0.5
P∞
0.25
1
0.75
0.75
P∞
S
75 75 75
INTERNET
1
0.75
TION
75 75 75
(b)
POWER GRID
1
46
ATTACK AND ERROR TOLERANCE OF REAL NETWORKS
0
SECTION 8.15
ADVANCED TOPICS 8.F ATTACK THRESHOLD
The goal of this section is to derive (8.12), providing the attack threshold of a scalefree network. We aim to calculate fc for an uncorrelated scale
free network, generated by the configuration model with pk = c ⋅ k−γ where
−γ+1 −γ+1 − k max ). k = kmin ,…, kmax and c ≈ (γ − 1)/(k min
The removal of an f fraction of nodes in a decreasing order of their degree (hub removal) has two effects [9, 51]: (i) The maximum degree of the network changes from kmax to k'max. (ii) The links connected to the removed hubs are also removed, changing the degree distribution of the remaining network. The resulting network is still uncorrelated, therefore we can use the MolloyReed criteria to determine the existence of a giant component. We start by considering the impact of (i). The new upper cutoff, k'max, is given by kmax
f=
∫
′ kmax
pk dk =
′ − γ +1 − γ +1 γ − 1 kmax − kmax . − γ +1 − γ +1 γ − 1 kmin − kmax
(8.50)
If we assume that kmax ≫ k'max and kmax ≫ kmin (true for large scalefree
networks with natural cutoff), we can ignore the kmax terms, obtaining
⎛ k′ ⎞ f = ⎜ max ⎟ ⎝ kmin ⎠
− γ +1
,
(8.51)
which leads to 1
k 'max = kmin f 1−γ .
(8.52)
Equation (8.52) provides the new maximum degree of the network after we remove an f fraction of the hubs. NETWORK ROBUSTNESS
47
Next we turn to (ii), accounting for the fact that hub removal changes the degree distribution pk → p'k . In the absence of degree correlations we assume that the links of the removed hubs connect to randomly selected stubs. Consequently, we calculate the fraction of links removed ‘randomly’, f,˜ as a consequence of removing an f fraction of the hubs: kmax
f =
∫
k 'max
=
kpk dk
kmax
=
〈k〉
1 c k −γ +1dk 〈k〉 k∫'max
− γ +2 1 1− γ kmax ′ −γ +2 − kmax . − γ +1 − γ +2 〈k〉 2 − γ kmin − kmax
Ignoring the kmax term again and using 〈k〉 ≈
γ −1 k we obtain γ − 2 min
+2
kmax kmin
f˜ =
(8.53)
.
(8.54)
Using (8.51) we obtain 2
f˜ = f 1
For γ
.
(8.55) ˜ → 2 we have f → 1, which means that the removal of a tiny fraction
of the hubs removes all links, potentially destroying the network. This is consistent with the finding of CHAPTER 4 that for γ = 2 the hubs dominate the network. In general the degree distribution of the remaining network is
pk =
kmax k=kmin
k k
f˜
k k
(1 f˜)k pk .
(8.56)
Note that we obtained the degree distribution (8.32) in ADVANCED TOPICS 8.C. This means that now we can proceed with the calculation method developed there for random node removal. To be specific, we calculate
κ for a
scalefree network with kmin and k'max using (8.45):
κ=
3−γ 2 − γ kmax ′ 3−γ − kmin . 2−γ 2−γ 3 − γ kmax ′ − kmin
(8.57)
Substituting into this (8.52) we have
κ=
3−γ (3−γ )/(1−γ ) 3−γ 2 − γ kmin f − kmin 2 −γ f (3−γ )/(1−γ ) − 1 . = k min 2−γ (2−γ )/(1−γ ) 2−γ 3 − γ kmin f − kmin 3−γ f (2−γ )/(1−γ ) − 1
(8.58)
After simple transformations we obtain 2−γ
fc1−γ = 2 +
NETWORK ROBUSTNESS
⎛ 3−γ ⎞ 2 −γ kmin ⎜ fc1−γ − 1⎟ 3−γ ⎝ ⎠
(8.59)
48
THRESHOLD UNDER ATTACK
SECTION 8.17
ADVANCED TOPICS 8.G THE OPTIMAL DEGREE DISTRIBUTION
In this section we derive the bimodal degree distribution that simultaneously optimizes a network’s topology against attacks and failures, as discussed in SECTION 8.7 [37]. Let us assume, as we did in (8.17), that the degree distribution is bimodal, consisting of two delta functions:
pk = (1− r)δ (k − kmin ) + rδ (k − kmax ) . We start by calculating the total threshold, f
(8.62)
, as a function of r and
tot
kmax for a fixed ⟨k⟩. To obtain analytical expressions for fcrand and fctarg we
calculate the moments of the bimodal distribution (8.62),
〈k〉 = (1− r)kmin + rkmax , 〈k 〉 = (1− r)k 2
2 min
+ rk
2 max
(〈k〉 − rkmax )2 2 . = + rkmax 1− r
(8.63)
Inserting these into (8.7) we obtain
fcrand =
2 〈k〉 2 − 2r〈k〉kmax − 2(1− r)〈k〉 + rkmax . 2 〈k〉 2 − 2r〈k〉kmax − (1− r)〈k〉 + rkmax
(8.64)
To determine the threshold for targeted attack, we must consider the fact that we have only two types of nodes, i.e. an r fraction of nodes have degree kmax and the remaining (1 − r) fraction have degree kmin. Hence hub removal can either remove all hubs (case (i)), or only some fraction of them (case (ii)): (i) fctarg > r . In this case all hubs have been removed, hence the nodes left after the targeted attack have degree kmin. We therefore obtain
fctarg = r +
NETWORK ROBUSTNESS
1− r 〈k〉 − rkmax
⎧ 〈k〉 − rkmax − 2(1− r) ⎫ − rkmax ⎬ . ⎨〈k〉 〈k〉 − rk − (1− r) max ⎩ ⎭
(8.65)
49
(ii) fctarg < r. In this case the removed nodes are all from the highdegree group, leaving behind some kmax nodes. Hence we obtain
fctarg =
2 − 2(1− r)〈k〉 . 〈k〉 2 − 2r〈k〉kmax + rkmax kmax (kmax − 1)(1− r)
(8.66)
With the thresholds (8.64)  (8.66) we can now evaluate the total threshold fctot (8.16). To obtain an expression for the optimal value of kmax as a function
of r we determine the value of k for which fctot is maximal. Using (8.64) and (8.66), we find that for small r the optimal value of kmax can be approximated
by 1/3
⎧ 2〈k〉 2 (〈k〉 − 1)2 ⎫ −2/3 −2/3 kmax ~ ⎨ ⎬ r = Ar . 2〈k〉 − 1 ⎩ ⎭
(8.67)
Using this result and (8.16), for small r we have fctot = 2 −
1 3〈k〉 1/3 − r + O(r 2/3 ) . 〈k〉 − 1 A 2
(8.68)
Thus fctot approaches the theoretical maximum when r approaches zero.
For a network of N nodes the maximum value of fctot is obtained when r = 1/N, being the smallest value consistent with having at least one node of
degree kmax. Given this r the equation determining the optimal kmax, representing the size of the central hubs, is [37]
kmax = AN 2/3 ,
(8.69)
where A is defined in (8.67).
NETWORK ROBUSTNESS
50
THE OPTIMAL DEGREE DISTRIBUTION
SECTION 8.18
BIBLIOGRAPHY
[1] R. Albert, H. Jeong, and A.L. Barabási. Attack and error tolerance of complex networks. Nature, 406: 378, 2000. [2] D. Stauffer and A. Aharony. Introduction to Percolation Theory. Taylor and Francis. London, 1994. [3] A. Bunde and S. Havlin. Fractals and Disordered Systems. Springer, 1996. [4] B. Bollobás and O. Riordan. Percolation. Cambridge University Press. Cambridge, 2006. [5] S. Broadbent and J. Hammersley. Percolation processes I. Crystals and mazes. Proceedings of the Cambridge Philosophical Society, 53: 629, 1957. [6] M. Molloy and B. Reed. A criticial point for random graphs with a given degree sequence. Random Structures and Algorithms, 6: 161, 1995. [7] R. Cohen, K. Erez, D. benAvraham and S. Havlin. Resilience of the Internet to random breakdowns. Phys. Rev. Lett., 85: 4626, 2000. [8] D. S. Callaway, M. E. J. Newman, S. H. Strogatz. and D. J. Watts. Network robustness and fragility: Percolation on random graphs. Phys. Rev. Lett., 85: 5468–5471, 2000. [9] R. Cohen, K. Erez, D. benAvraham and S. Havlin. Breakdown of the Internet under intentional attack. Phys. Rev. Lett., 86: 3682, 2001. [10] B. Bollobás and O. Riordan. Robustness and Vulnerability of ScaleFree Random Graphs. Internet Mathematics, 1: 135, 2003. [11] P. Baran. Introduction to Distributed Communications Networks. Rand Corporation Memorandum, RM3420PR, 1964. NETWORK ROBUSTNESS
51
[12] D.N. Kosterev, C.W. Taylor and W.A. Mittlestadt. Model Validation of the August 10, 1996 WSCC System Outage. IEEE Transactions on Power Systems 14: 967979, 1999. [13] C. Labovitz, A. Ahuja and F. Jahasian. Experimental Study of Internet Stability and WideArea Backbone Failures. Proceedings of IEEE FTCS, Madison, WI, 1999. [14] A. G. Haldane and R. M. May. Systemic risk in banking ecosystems. Nature, 469: 351355, 2011. [15] T. Roukny, H. Bersini, H. Pirotte, G. Caldarelli and S. Battiston. Default Cascades in Complex Networks: Topology and Systemic Risk. Scientific Reports, 3: 2759, 2013. [16] G. Tedeschi, A. Mazloumian, M. Gallegati, and D. Helbing. Bankruptcy cascades in interbank markets. PLoS One, 7: e52749, 2012. [17] D. Helbing. Globally networked risks and how to respond. Nature, 497: 5159, 2013. [18] I. Dobson, B. A. Carreras, V. E. Lynch and D. E. Newman. Complex systems analysis of series of blackouts: Cascading failure, critical points, and selforganization. CHAOS, 17: 026103, 2007. [19] E. Bakshy, J. M. Hofman, W. A. Mason, and D. J. Watts. Everyone's an influencer: quantifying influence on twitter. Proceedings of the fourth ACM international conference on Web search and data mining (WSDM '11). ACM, New York, NY, USA, 6574, 2011. [20] Y. Y. Kagan. Accuracy of modern global earthquake catalogs. Phys. Earth Planet. Inter., 135: 173, 2003. [21] M. Nagarajan, H. Purohit, and A. P. Sheth. A Qualitative Examination of Topical Tweet and Retweet Practices. ICWSM, 295298, 2010. [22] P. Fleurquin, J.J. Ramasco and V.M. Eguiluz. Systemic delay propagation in the US airport network. Scientific Reports, 3: 1159, 2013. [23] B. K. Ellis, J. A. Stanford, D. Goodman, C. P. Stafford, D.L. Gustafson, D. A. Beauchamp, D. W. Chess, J. A. Craft, M. A. Deleray, and B. S. Hansen. Longterm effects of a trophic cascade in a large lake ecosystem. PNAS, 108: 1070, 2011. [24] V. R. Sole, M. M. Jose. Complexity and fragility in ecological networks. Proc. R. Soc. Lond. B, 268: 2039, 2001. [25] F. Jordán, I. Scheuring and G. Vida. Species Positions and Extinction Dynamics in Simple Food Webs. Journal of Theoretical Biology, 215: 441NETWORK ROBUSTNESS
52
BIBLIOGRAPHY
448, 2002. [26] S.L. Pimm and P. Raven. Biodiversity: Extinction by numbers. Nature, 403: 843, 2000. [27] World Economic Forum, Building Resilience in Supply Chains. World Economic Forum, 2013. [28] Joint Economic Committee of US Congress. Your flight has been delayed again: Flight delays cost passengers, airlines and the U.S. economy billions. Available at http://www.jec.senate.gov, May 22. 2008. [29] I. Dobson, A. Carreras, and D.E. Newman. A loading dependent model of probabilistic cascading failure. Probability in the Engineering and Informational Sciences, 19: 15, 2005. [30] D.J. Watts. A simple model of global cascades on random networks. PNAS, 99: 5766, 2002. [31] K.I. Goh, D.S. Lee, B. Kahng, and D. Kim. Sandpile on scalefree networks. Phys. Rev. Lett., 91: 148701, 2003. [32] D.S. Lee, K.I. Goh, B. Kahng, and D. Kim. Sandpile avalanche dynamics on scalefree networks. Physica A, 338: 84, 2004. [33] M. Ding and W. Yang. Distribution of the first return time in fractional Brownian motion and its application to the study of onoff intermittency. Phys. Rev. E., 52: 207213, 1995. [34] A. E. Motter and Y.C. Lai. Cascadebased attacks on complex networks. Physical Review E, 66: 065102, 2002. [35] Z. Kong and E. M. Yeh. Resilience to DegreeDependent and Cascading Node Failures in Random Geometric Networks. IEEE Transactions on Information Theory, 56: 5533, 2010. [36] G. Paul, S. Sreenivas, and H. E. Stanley. Resilience of complex networks to random breakdown. Phys. Rev. E, 72: 056130, 2005. [37] G. Paul, T. Tanizawa, S. Havlin, and H. E. Stanley. Optimization of robustness of complex networks. European Physical Journal B, 38: 187–191, 2004. [38] A.X.C.N. Valente, A. Sarkar, and H. A. Stone. Twopeak and threepeak optimal complex networks. Phys. Rev. Lett., 92: 118702, 2004. [39] T. Tanizawa, G. Paul, R. Cohen, S. Havlin, and H. E. Stanley. Optimization of network robustness to waves of targeted and random attacks. Phys. Rev. E, 71: 047101, 2005.
NETWORK ROBUSTNESS
53
BIBLIOGRAPHY
[40] A.E. Motter. Cascade control and defense in complex networks. Phys. Rev. Lett., 93: 098701, 2004. [41] A. Motter, N. Gulbahce, E. Almaas, and A.L. Barabási. Predicting synthetic rescues in metabolic networks. Molecular Systems Biology, 4: 110, 2008. [42] R.V. Sole, M. RosasCasals, B. CorominasMurtra, and S. Valverde. Robustness of the European power grids under intentional attack. Phys. Rev. E, 77: 026102, 2008. [43] R. Albert, I. Albert, and G.L. Nakarado. Structural Vulnerability of the North American Power Grid. Phys. Rev. E, 69: 025103 R, 2004. [44] C.M. Schneider, N. Yazdani, N.A.M. Araújo, S. Havlin and H.J. Herrmann. Towards designing robust coupled networks. Scientific Reports, 3: 1969, 2013. [45] A.L. Barabási. Linked: The New Science of Networks. Plume, New York, 2002. [46] C.M. Song, S. Havlin, and H.A Makse. Selfsimilarity of complex networks. Nature, 433: 392, 2005. [47] S.V. Buldyrev, R. Parshani, G. Paul, H.E. Stanley and S. Havlin. Catastrophic cascade of failures in interdependent networks. Nature, 464: 08932, 2010. [48] R. Cohen, D. benAvraham and S. Havlin. Percolation critical exponents in scalefree networks. Phys. Rev. E, 66: 036113, 2002. [49] S. N. Dorogovtsev, J. F. F. Mendes, and A. N. Samukhin. Anomalous percolation properties of growing networks. Phys. Rev. E, 64: 066110, 2001. [50] M. E. J. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary degree distributions and their applications. Phys. Rev. E, 64: 026118, 2001. [51] R. Cohen and S. Havlin. Complex Networks: Structure, Robustness and Function. Cambridge University Press. Cambridge, UK, 2010.
NETWORK ROBUSTNESS
54
BIBLIOGRAPHY
9 ALBERTLÁSZLÓ BARABÁSI
NETWORK SCIENCE COMMUNITIES NORMCORE ONCE UPON A TIME PEOPLE WERE BORN INTO COMMUNITIES AND HAD TO FIND THEIR INDIVIDUALITY. TODAY PEOPLE ARE BORN INDIVIDUALS AND HAVE TO FIND THEIR COMMUNITIES.
MASS INDIE RESPONDS TO THIS SITUATION BY CREATING CLIQUES OF PEOPLE IN THE KNOW, WHILE NORMCORE KNOWS THE REAL FEAT IS HARNESSING THE POTENTIAL FOR CONNECTION TO SPRING UP. IT'S ABOUT ADAPABILITY, NOT EXCLUSIVITY.
ACKNOWLEDGEMENTS
MÁRTON PÓSFAI NICOLE SAMAY ROBERTA SINATRA
SARAH MORRISON AMAL HUSSEINI
INDEX
Introduction Introduction
1
Basics of Communities
2
Hierarchical Clustering
3
Modularity
4
Overlapping Communities
5
Characterizing Communities
6
Testing Communities
7
Summary
8
Homework
9
ADVANCED TOPICS 9.A Counting Partitions ADVANCED TOPICS 9.B Hiearchical Modularity ADVANCED TOPICS 9.C Modularity ADVANCED TOPICS 9.D Fast Algorithms for Community Detection
10
11 Figure 9.0 (cover image)
12
13
ADVANCED TOPICS 9.E Threshold for clique percolation
14
Art & Networks: KMode: Youth Mode KMode is an art collective that publishes trend reports with an unusual take on various concepts. The image shows a page from Youth Mode: A Report on Freedom, discussing the subtle shift in the origins and the meaning of communities, the topic of this chapter [1].
Homework Bibliography
This book is licensed under a Creative Commons: CC BYNCSA 2.0. PDF V26, 05.09.2014
SECTION 9.1
INTRODUCTION
Belgium appears to be the model bicultural society: 59% of its citizens are Flemish, speaking Dutch and 40% are Walloons who speak French. As multiethnic countries break up all over the world, we must ask: How did this country foster the peaceful coexistence of these two ethnic groups since 1830? Is Belgium a densely knitted society, where it does not matter if one is Flemish or Walloon? Or we have two nations within the same borders, that learned to minimize contact with each other? The answer was provided by Vincent Blondel and his students in 2007, who developed an algorithm to identify the country’s community structure. They started from the mobile call network, placing individuals next to whom they regularly called on their mobile phone [2]. The algorithm revealed that Belgium’s social network is broken into two large clusters of communities and that individuals in one of these clusters rarely talk with individuals from the other cluster (Figure 9.1). The origin of this separation became obvious once they assigned to each node the language spoken by each individual, learning that one cluster consisted almost exclusively of French speakers and the other collected the Dutch speakers.
Figure 9.1 Communities in Belgium
Communities extracted from the call pattern of the consumers of the largest Belgian mobile phone company. The network has about two million mobile phone users. The nodes correspond to communities, the size of each node being proportional to the number of individuals in the corresponding community. The color of each community on a red–green scale represents the language spoken in the particular community, red for French and green for Dutch. Only communities of more than 100 individuals are shown. The community that connects the two main clusters consists of several smaller communities with less obvious language separation, capturing the culturally mixed Brussels, the country’s capital. After [2].
In network science we call a community a group of nodes that have a higher likelihood of connecting to each other than to nodes from other communities. To gain intuition about community organization, next we discuss two areas where communities play a particularly important role: • Social Networks Social networks are full of easy to spot communities, something that scholars have noticed decades ago [3,4,5,6,7]. Indeed, the employees of a company are more likely to interact with their coworkers than with employees of other companies [3]. Consequently work places appear as densely interconnected communities within the social network. Communities could also represent circles of friends, or a group of individuals who pursue the same hobby together, or individuals living in the same neighborhood. A social network that has received particular attention in the context COMMUNITIES
3
INTRODUCTION
of community detection is known as Zachary’s Karate Club (Figure 9.2)
(a)
[7], capturing the links between 34 members of a karate club. Given
23
15
27
10
16
the club's small size, each club member knew everyone else. To uncov
31
er the true relationships between club members, sociologist Wayne
14
7
29 28 32
11 8
22
18 12
25
(b)
tween the club’s president and the instructor split the club into two.
CITATIONS
About half of the members followed the instructor and the other half the president, a breakup that unveiled the ground truth, representing club's underlying community structure (Figure 9.2a). Today community finding algorithms are often tested based on their ability to infer these two communities from the structure of the network before the
90 80 70 60 50 40 30 20 10 1980
split.
1985
1990
1995 2000 YEAR
2005
2010
2015
Figure 9.2 Zachary’s Karate Club
ing of how specific biological functions are encoded in cellular net
(a) The connections between the 34 members of Zachary's Karate Club. Links capture interactions between the club members outside the club. The circles and the squares denote the two fractions that emerged after the club split in two. The colors capture the best community partition predicted by an algorithm that optimizes the modularity coefficient M (SECTION 9.4). The community boundaries closely follow the split: The white and purple communities capture one fraction and the greenorange communities the other. After [8].
works. Two years before receiving the Nobel Prize in Medicine, Lee Hartwell argued that biology must move beyond its focus on single genes. It must explore instead how groups of molecules form functional modules to carry out a specific cellular functions [10]. Ravasz and collaborators [11] made the first attempt to systematically identify such modules in metabolic networks. They did so by building an algorithm to identify groups of molecules that form locally dense communities (Figure 9.3). Communities play a particularly important role in understanding
(b) The citation history of the Zachary karate club paper [7] mirrors the history of community detection in network science. Indeed, there was virtually no interest in Zachary’s paper until Girvan and Newman used it as a benchmark for community detection in 2002 [9]. Since then the number of citations to the paper exploded, reminiscent of the citation explosion to Erdős and Rényi’s work following the discovery of scalefree networks (Figure 3.15).
human diseases. Indeed, proteins that are involved in the same disease tend to interact with each other [12,13]. This finding inspired the disease module hypothesis [14], stating that each disease can be linked to a welldefined neighborhood of the cellular network. The examples discussed above illustrate the diverse motivations that drive community identification. The existence of communities is rooted in who connects to whom, hence they cannot be explained based on the degree distribution alone. To extract communities we must therefore inspect
The frequent use Zachary’s Karate Club network as a benchmark in community detection inspired the Zachary Karate Club Club, whose tongueincheek statute states: “The first scientist at any conference on networks who uses Zachary's karate club as an example is inducted into the Zachary Karate Club Club, and awarded a prize.”
a network’s detailed wiring diagram. These examples inspire the starting hypothesis of this chapter: H1: Fundamental Hypothesis A network’s community structure is uniquely encoded in its wiring diagram.
Hence the prize is not based on merit, but on the simple act of participation. Yet, its recipients are prominent network scientists (http://networkkarate.tumblr.com/). The figure shows the Zachary Karate Club trophy, which is always held by the latest inductee. Photo courtesy of Marián Boguñá.
According to the fundamental hypothesis there is a ground truth about a network’s community organization, that can be uncovered by inspecting Aij. The purpose of this chapter is to introduce the concepts necessary to COMMUNITIES
17
1
2
24
The interest in the dataset is driven by a singular event: A conflict be
Communities play a particularly important role in our understand
6 3
26
• Biological Networks
5
33
19
larly interacted outside the club (Figure 9.2a).
4
9
21
Zachary documented 78 pairwise links between members who regu
13
20
34
30
4
INTRODUCTION
understand and identify the community structure of a complex network. We will ask how to define communities, explore the various community characteristics and introduce a series of algorithms, relying on different principles, for community identification.
(a)
(b)
Figure 9.3 Communities in Metabolic Networks
The E. coli metabolism offers a fertile ground to investigate the community structure of biological systems [11]. (a) The biological modules (communities) identified by the Ravasz algorithm [11] (SECTION 9.3). The color of each node, capturing the predominant biochemical class to which it belongs, indicates that different functional classes are segregated in distinct network neighborhoods. The highlighted region selects the nodes that belong to the pyrimidine metabolism, one of the predicted communities. (b) The topologic overlap matrix of the E. coli metabolism and the corresponding dendrogram that allows us to identify the modules shown in (a). The color of the branches reflect the predominant biochemical role of the participating molecules, like carbohydrates (blue), nucleotide and nucleic acid metabolism (red), and lipid metabolism (cyan).
(c)
(d)
(c) The red right branch of the dendrogram tree shown in (b), highlighting the region corresponding to the pyridine module. (d) The detailed metabolic reactions within the pyrimidine module. The boxes around the reactions highlight the communities predicted by the Ravasz algorithm. After [11].
COMMUNITIES
5
INTRODUCTION
SECTION 9.2
BASICS OF COMMUNITIES
What do we really mean by a community? How many communities are in a network? How many different ways can we partition a network into communities? In this section we address these frequently emerging questions in community identification.
DEFINING COMMUNITIES Our sense of communities rests on a second hypothesis (Figure 9.4): H2: Connectedness and Density Hypothesis A community is a locally dense connected subgraph in a network. In other words, all members of a community must be reached through other members of the same community (connectedness). At the same time we expect that nodes that belong to a community have a higher probability
Figure 9.4 Connectedness and Density Hypothesis
to link to the other members of that community than to nodes that do not
Communities are locally dense connected subgraphs in a network. This expectation relies on two distinct hypotheses:
belong to the same community (density). While this hypothesis considerably narrows what would be considered a community, it does not uniquely
Connectedness Hypothesis Each community corresponds to a connected subgraph, like the subgraphs formed by the orange, green or the purple nodes. Consequently, if a network consists of two isolated components, each community is limited to only one component. The hypothesis also implies that on the same component a community cannot consist of two subgraphs that do not have a link to each other. Consequently, the orange and the green nodes form separate communities.
define it. Indeed, as we discuss below, several community definitions are consistent with H2. Maximum Cliques One of the first papers on community structure, published in 1949, defined a community as group of individuals whose members all know each other [5]. In graph theoretic terms this means that a community is a complete subgraph, or a clique. A clique automatically satisfies H2: it is a connected subgraph with maximal link density. Yet, viewing communities as cliques has several drawbacks:
Density Hypothesis Nodes in a community are more likely to connect to other members of the same community than to nodes in other communities. The orange, the green and the purple nodes satisfy this expectation.
• While triangles are frequent in networks, larger cliques are rare. • Requiring a community to be a complete subgraph may be too restrictive, missing many other legitimate communities. For example, none of the communities of Figure 9.2 and 9.3 correspond to complete subgraphs.
COMMUNITIES
6
Basics of Communities
Strong and Weak Communities
(a)
To relax the rigidity of cliques, consider a connected subgraph C of NC
nodes in a network. The internal degree kiint of node i is the number of links
that connect i to other nodes in C. The external degree kiext is the number of
links that connect i to the rest of the network. If kiext=0, each neighbor of i is within C, hence C is a good community for node i. If kiint=0, then node i should be assigned to a different community. These definitions allow us to distinguish two kinds of communities (Figure 9.5): • Strong Community
(b)
C is a strong community if each node within C has more links within the community than with the rest of the graph [15,16]. Specifically, a subgraph C forms a strong community if for each node i ∈ C,
kiint(C ) > kiext(C ) .
(9.1)
• Weak Community C is a weak community if the total internal degree of a subgraph exceeds its total external degree [16]. Specifically, a subgraph C forms a weak community if
k int (C ) ∑ i i∈C
(c)
>
k ext (C ) ∑ i . i∈C
(9.2)
A weak community relaxes the strong community requirement by allowing some nodes to violate (9.1). In other words, the inequality (9.2) applies to the community as a whole rather than to each node individually. Note that each clique is a strong community, and each strong community is a weak community. The converse is generally not true (Figure 9.5). Figure 9.5 Defining Communities
The community definitions discussed above (cliques, strong and weak communities) refine our notions of communities. At the same time they
(a) Cliques A clique corresponds to a complete subgraph. The highest order clique of this network is a square, shown in orange. There are several threenode cliques on this network. Can you find them?
indicate that we do have some freedom in defining communities.
NUMBER OF COMMUNITIES How many ways can we group the nodes of a network into communities? To answer this question consider the simplest community find
(b) Strong Communities A strong community, defined in (9.1), is a connected subgraph whose nodes have more links to other nodes in the same community that to nodes that belong to other communities. Such a strong community is shown in purple. There are additional strong communities on the graph  can you find at least two more?
ing problem, called graph bisection: We aim to divide a network into two nonoverlapping subgraphs, such that the number of links between the nodes in the two groups, called the cut size, is minimized (BOX 9.1). Graph Partitioning We can solve the graph bisection problem by inspecting all possible divisions into two groups and choosing the one with the smallest cut size.
(c) Weak Communities A weak community defined in (9.2) is a subgraph whose nodes’ total internal degree exceeds their total external degree. The green nodes represent one of the several possible weak communities of this network.
To determine the computational cost of this brute force approach we note that the number of distinct ways we can partition a network of N nodes into groups of N1 and N2 nodes is
N! N1!N2!
.
COMMUNITIES
(9.3)
7
Basics of Communities
BOX 9.1 GRAPH PARTITIONING
Chip designers face a problem of exceptional complexity: They need
(a)
to place on a chip 2.5 billion transistors such that their wires do not intersect. To simplify the problem they first partition the wiring diagram of an integrated circuit (IC) into smaller subgraphs, chosen such that the number of links between them to be minimal. Then they lay out different blocks of an IC individually, and reconnect these blocks. A similar problem is encountered in parallel computing, when a large computational problem is partitioned into subtasks and assigned to
(b)
individual chips. The assignment must minimize the typically slow communication between the processors. The problem faced by chip designers or software engineers is called graph partitioning in computer science [17]. The algorithms developed for this purpose, like the widely used KerninghanLin algorithm (Fig
Figure 9.6 KerninghanLin Algorithm
ure 9.6), are the predecessors of the community finding algorithms dis
The best known algorithm for graph partitioning was proposed in 1970 [18]. We illustrate this with graph bisection which starts by randomly partitioning the network into two groups of predefined sizes. Next we select a node pair (i,j), where i and j belong to different groups, and swap them, recording the resulting change in the cut size. By testing all (i,j) pairs we identify the pair that results in the largest reduction of the cut size, like the pair highlighted in (a). By swapping them we arrive to the partition shown in (b). In some implementations of the algorithm if no pair reduces the cut size, we swap the pair that increases the cut size the least.
cussed in this chapter. There is an important difference between graph partitioning and community detection: Graph partitioning divides a network into a predefined number of smaller subgraphs. In contrast community detection aims to uncover the inherent community structure of a network. Consequently in most community detection algorithms the number and the size of the communities is not predefined, but needs to be discovered by inspecting the network’s wiring diagram.
Using Stirling's formula n! ! 2π n(n / e)n we can write (9.3) as
N! 2 N (N / e) N
N1!N2!
2 N1 (N1 / e)
N1
2 N2 (N2 / e)
N2
N N +1/2 . N N2N2 +1/2 N1 +1/2 1
(9.4)
To simplify the problem let us set the goal of dividing the network into two equal sizes N1 = N2 = N/2. In this case (9.4) becomes 2 N +1 ( N +1)ln 2 – 1 ln N 2 = e , N
(9.5)
indicating that the number of bisections increases exponentially with the size of the network. To illustrate the implications of (9.5) consider a network with ten nodes
COMMUNITIES
8
Basics of Communities
which we bisect into two subgraphs of size N1 = N2 = 5. According to (9.3)
10 30
Bell Number
we need to check 252 bisections to find the one with the smallest cut size.
eN
Let us assume that our computer can inspect these 252 bisections in one millisecond (103 sec). If we next wish to bisect a network with a hundred nodes into groups with N1 = N2 = 50, according to (9.3) we need to check
10 20
approximately 1029 divisions, requiring about 1016 years on the same computer. Therefore our bruteforce strategy is bound to fail, being impossible
BN
to inspect all bisections for even a modest size network. Community Detection
10 10
While in graph partitioning the number and the size of communities is predefined, in community detection both parameters are unknown. We call a partition a division of a network into an arbitrary number of groups, such that each node belongs to one and only one group. The number of pos
10 0
sible partitions follows [1922] BN =
1 ∞ jN . e∑ j! j=0
(9.6)
0
10
20
30
40
50
Figure 9.7
Number of Partitions The number of partitions of a network of size N is provided by the Bell number (9.6). The figure compares the Bell number to an exponential function, illustrating that the number of possible partitions grows faster than exponentially. Given that there are over 1040 partitions for a network of size N=50, bruteforce approaches that aim to identify communities by inspecting all possible partitions are computationally infeasible.
As Figure 9.7 indicates, BN grows faster than exponentially with the net
work size for large N.
Equations (9.5) and (9.6) signal the fundamental challenge of community identification: The number of possible ways we can partition a network into communities grows exponentially or faster with the network size N. Therefore it is impossible to inspect all partitions of a large network (BOX 9.2). In summary, our notion of communities rests on the expectation that each community corresponds to a locally dense connected subgraph. This hypothesis leaves room for numerous community definitions, from cliques to weak and strong communities. Once we adopt a definition, we could identify communities by inspecting all possible partitions of a network, selecting the one that best satisfies our definition. Yet, the number of partitions grows faster than exponentially with the network size, making such bruteforce approaches computationally infeasible. We therefore need algorithms that can identify communities without inspecting all partitions. This is the subject of the next sections.
COMMUNITIES
N
9
Basics of Communities
BOX 9.2 NP COMPLETENESS
How long does it take to execute an algorithm? The answer is not given in minutes and hours, as the execution time depends on the speed of the computer on which we run the algorithm. We count instead the number of computations the algorithm performs. For example an algorithm that aims to find the largest number in a list of N numbers has to compare each number in the list with the maximum found so far. Consequently its execution time is proportional to N. In general, we call an algorithm polynomial if its execution time follows Nx. An algorithm whose execution time is proportional to N3 is slower on any computer than an algorithm whose execution time is N. But this difference dwindles in significance compared to an exponential algorithm, whose execution time increases as 2N. For example, if an algorithm whose execution time is proportional to N takes a second for N = 100 elements, then an N3 algorithm takes almost three hours on the same computer. Yet an exponential algorithm (2N) will take 1020 years to complete. The problem that an algorithm can solve in polynomial time is called
Figure 9.8 Night at the Movies
Traveling Salesman is a 2012 intellectual thriller about four mathematicians who have solved the P versus NP problem, and are now struggling with the implications of their discovery. The P versus NP problem asks whether every problem whose solution can be verified in a polynomial time can also be solved in a polynomial time. This is one of the seven Millennium Prize Problems, hence a $1,000,000 prize waits for the first correct solution. The Traveling Salesman refers to a salesman who tries to find the shortest route to visit several cities exactly once, at the end returning to his starting city. While the problem appears simple, it is in fact NPcomplete  we need to try all combination to find the shortest path.
a class P problem. Several computational problems encountered in network science have no known polynomial time algorithms, but the available algorithms require exponential running time. Yet, the correctness of the solution can be checked quickly, i.e. in polynomial time. Such problems, called NPcomplete, include the traveling salesman problem (Figure 9.8), the graph coloring problem, maximum clique identification, partitioning a graph into subgraphs of specific type, and the vertex cover problem (Box 7.4). The ramifications of NPcompleteness has captured the fascination of the popular media as well. Charlie Epps, the main character of the CBS TV series Numbers, spends the last three months of his mother's life trying to solve an NP complete problem, convinced that the solution will cure her disease. Similarly the motive for a double homicide in the CBS TV series Elementary is the search for a solution of an NPcomplete problem, driven by its enormous value for cryptography.
COMMUNITIES
10
Basics of Communities
SECTION 9.3
HIERARCHICAL CLUSTERING
To uncover the community structure of large real networks we need algorithms whose running time grows polynomially with N. Hierarchical clustering, the topic of this section, helps us achieve this goal. The starting point of hierarchical clustering is a similarity matrix, whose elements xij indicate the distance of node i from node j. In community identification the similarity is extracted from the relative position of nodes i and j within the network. Once we have xij, hierarchical clustering iteratively identifies groups of nodes with high similarity. We can use two different procedures to achieve this: agglomerative algorithms merge nodes with high similarity into the same community, while divisive algorithms isolate communities by removing low similarity links that tend to connect communities. Both procedures generate a hierarchical tree, called a dendrogram, that predicts the possible community partitions. Next we explore the use of agglomerative and divisive algorithms to identify communities in networks.
AGGLOMERATIVE PROCEDURES: THE RAVASZ ALGORITHM We illustrate the use of agglomerative hierarchical clustering for community detection by discussing the Ravasz algorithm, proposed to identify functional modules in metabolic networks [11]. The algorithm consists of the following steps: Step 1: Define the Similarity Matrix In an agglomerative algorithm similarity should be high for node pairs that belong to the same community and low for node pairs that belong to different communities. In a network context nodes that connect to each other and share neighbors likely belong to the same community, hence their xij should be large. The topological overlap matrix (Figure 9.9) [11]
COMMUNITIES
xijo =
J (i, j) min(ki, kj )+1‒Θ(A ) ij
(9.7)
11
Hierarchical Clustering
erding according the predominant to the to predominant A (1) B (1) class of the substrates it producf the substrates (a)it producAB BA (1) 1 B (1) 1 (b) A 1 1 e classification of metabolism 1 1 sification of metabolism 1 1 C (3) A B C C (3) 1 1 1biochem1 ndard, small molecule small molecule biochemn shown in Fig. in 4A,Fig. and4A, in and the in the C C 1/3 1/3 representation in Fig. 4B, sional representation in Fig. 4B, D (0) D (0) fates a given of a small given molecule small molecule 1/3 1/3 ed on the on same of stributed the branch same branch of 1/3 1/3 1/3 1/3 d4A) correspond to relatively and correspond to relatively I(1/3) I(1/3) D D E (1/3) ons of the metabolic net net E (1/3) 1 d regions of the metabolic 1 1 1 2/3 1 11/3 1/3 1 K (1)2/3 1 erefore, there are strong strong1 K (1) B). Therefore, there 2/3 1 1/3 1/3 are 2/3 H 1/3 F(1) 1/3 GF (1/3) 1 H G (1/3) (1/3) (1) 1 een shared biochemical between shared biochemical (1/3) I etabolites and theand global J (2/3) I J (2/3) of metabolites the global E. coli metaboation ofE. coliofEmetaboE Fig. 3. Uncovering organization the underlying 1 Fig. 3. Uncovering the underlying 1 m) (16). (116).1 1 1 2/3 modularity network.network. 1/3 , bottom) modularity of2/3 complex 1/3 of a1complex Ka1illustrated K illustrated Topological eate putative modules ob (A) ob1 2/3 overlap (A)H Topological overlap theF putative modules 1 2/3 G 1 on a small network. For F –analyaph – based 1 on hypothetical a smallHhypothetical network. For our theory graph theory based G analyeach pair of nodes, iofand j, wei and define mical pathways, we conwe each pair nodes, j, we define biochemical pathways, conJ (i, j) the topological overlapO J T (i, j) the topological overlapO involving the J (i, the npathways the pathways involving j)/[min k )], (k, where j) T J (i, j) J (i, (k, j)/[min k )],J (i, where
A D B K C J D I KH J E
IF
HG E
F
Figure 9.9 The Ravasz Algorithm
G
F E
E
H
H
I
I
J
J
K
K
D
D
C B A
The agglomerative hierarchical clustering algorithm proposed by Ravasz was designed to identify functional modules in metabolic networks, but it can be applied to arbitrary networks.
F
0.90 0.70 0.50 0.30 0.10
C B A
0.90 0.70 0.50 0.30 0.10
captures this expectation. Here Θ(x) is the Heaviside step function,
(a) Topological Overlap A small network illustrating the calculation of the topological overlap xij0. For each node pair i and j we calculate the overlap (9.7). The obtained xij0 for each connected node pair is shown on each link. Note that xij0 can be nonzero for nodes that do not link to each other, but have a common neighbor. For example, xij=1/3 for C and E. (b) Topological Overlap Matrix The topological overlap matrix xij0 for the network shown in (a). The rows and columns of the matrix were reordered after applying average linkage clustering, placing next to each other nodes with the highest topological overlap. The colors denote the degree of topological overlap between each node pair, as calculated in (a). By cutting the dendrogram with the orange line, it recovers the three modules built into the network. The dendogram indicates that the EFG and the HIJK modules are closer to each other than they are to the ABC module.
which is zero for x≤0 and one for x>0; J(i, j) is the number of common neighbors of node i and j, to which we add one (+1) if there is a direct link between i and j; min(ki,kj) is the smaller of the degrees ki and kj. Consequently: • x0ij=1 if nodes i and j have a link to each other and have the same neighbors, like A and B in Figure 9.9a.
• x0ij (i, j) =0 if i and j do not have common neighbors, nor do they link to each other, like A and E. • Members of the same dense local network neighborhood have high topological overlap, like nodes H, I, J, K or E, F, G.
After [11].
Step 2: Decide Group Similarity As nodes are merged into small communities, we must measure how similar two communities are. Three approaches, called single, complete and average cluster similarity, are frequently used to calculate the community similarity from the nodesimilarity matrix xij (Figure 9.10). The
Ravasz algorithm uses the average cluster similarity method, defining the similarity of two communities as the average of xij over all node pairs i and j that belong to distinct communities (Figure 9.10d). Step 3: Apply Hierarchical Clustering The Ravasz algorithm uses the following procedure to identify the communities: 1. Assign each node to a community of its own and evaluate xij for all node pairs. 2. Find the community pair or the node pair with the highest similarity and merge them into a single community. 3. Calculate the similarity between the new community and all other communities. 4. Repeat Steps 2 and 3 until all nodes form a single community. Step 4: Dendrogram
COMMUNITIES
12
Hierarchical Clustering
(a)
(b)
1
2
A
1B
A
B
x1 ij = r ij = A
(c)
1B
D
A
1B
2F
E
C
D G C
x ij = r ij =
1
A B C A B C
F D E EF G 2.75 2.22 3.46 3.08 3.38 2.68 3.97 G3.40 2.31 1.59 2.88 2.34 D E F G 2.75 2.22 3.46 3.08 2 3.38 2.68 3.97 3.40 2.31 1.59 2.88 D2.34
C
B C Complete Linkage:
E
A
1B
F
A
x 12 = 3.97
In agglomerative clustering we need to determine the similarity of two communities from the node similarity matrix xij. We illustrate this procedure for a set of points whose similarity xij is the physical distance rij between them. In networks xij corresponds to some networkbased distance measure, like xijo defined in (9.7).
D G F
E
x 12 = 1.59
G
2
(a) Similarity Matrix Seven nodes forming two distinct communities. The table shows the distance rij between each node pair, acting as the similarity xij.
D C
2F
E D G
C B Average Linkage:
x 12 = 2.84
Average Linkage:
x 12 = 2.84
G
Complete Linkage:
Cluster Similarity
2F
E
C
A
(d)
Figure 9.10
D
1Single Linkage: x12 = 1.59
D G
x 12 = 3.97
A
B C Single Linkage:
2F
E
2
F
E
(b) Single Linkage Clustering The similarity between communities 1 and 2 is the smallest of all xij , where i and j are in different communities. Hence the similarity is x12=1.59, corresponding to the distance between nodes C and E.
G
(c) Complete Linkage Clustering The similarity between two communities is the maximum of xij, where i and j are in distinct communities. Hence x12=3.97.
The pairwise mergers of Step 3 will eventually pull all nodes into a sin
(d) Average Linkage Clustering The similarity between two communities is the average of xijover all node pairs i and j that belong to different communities. This is the procedure implemented in the Ravasz algorithm, providing x12=2.84.
gle community. We can use a dendrogram to extract the underlying community organization. The dendrogram visualizes the order in which the nodes are assigned to specific communities. For example, the dendrogram of Figure 9.9b tells us that the algorithm first merged nodes A with B, K with J and E with F, as each of these pairs have x0ij=1. Next node C was added to the (A, B) community, I to (K, J) and G to (E, F). To identify the communities we must cut the dendrogram. Hierarchical clustering does not tell us where that cut should be. Using for example the cut indicated as a dashed line in Figure 9.9b, we recover the three obvious communities (ABC, EFG, and HIJK). Applied to the E. coli metabolic network (Figure 9.3a), the Ravasz algorithm identifies the nested community structure of bacterial metabolism. To check the biological relevance of these communities, we colorcoded the branches of the dendrogram according to the known biochemical classification of each metabolite. As shown in Figure 9.3b, substrates with similar biochemical role tend to be located on the same branch of the tree. In other words the known biochemical classification of these metabolites confirms the biological relevance of the communities extracted from the network topology. Computational Complexity How many computations do we need to run the Ravasz algorithm? The algorithm has four steps, each with its own computational complexity: COMMUNITIES
13
Hierarchical Clustering
0.31
(a)
0.17
Figure 9.11
(b) 0.31 0.29 0.29
0.23
0.17 0.2
0.57
0.2 0.31
0.17
0.4 0.23
0.18
Centrality Measures Divisive algorithms require a centrality measure that is high for nodes that belong to different communities and is low for node pairs in the same community. Two frequently used measures can achieve this:
0.18
(a) Link Betweenness Link betweenness captures the role of each link in information transfer. Hence xij is proportional to the number of shortest paths between all node pairs that run along the link (i,j). Consequently, intercommunity links, like the central link in the figure with xij=0.57, have large betweenness. The calculation of link betweenness scales as 0(L N), or 0(N2) for a sparse network [23].
0.23 Step 1: The calculation of the similarity matrix x0ij requires us to com0.18 0.2
pare N2 node pairs, hence the number of computations scale as N2. In 0.2
0.4 other words its computational complexity is 0(N2). 0.18 0.23
Step 2: Group similarity requires us to determine in each step the distance of the new cluster to all other clusters. Doing this N times requires 0(N2) calculations. Steps 3 & 4: The construction of the dendrogram can be performed in 0(NlogN) steps.
(b) RandomWalk Betweenness A pair of nodes m and n are chosen at random. A walker starts at m, following each adjacent link with equal probability until it reaches n. Random walk betweenness xij is the probability that the link i→j was crossed by the walker after averaging over all possible choices for the starting nodes m and n. The calculation requires the inversion of an NxN matrix, with 0(N3) computational complexity and averaging the flows over all node pairs, with 0(LN2). Hence the total computational complexity of random walk betweenness is 0[(L + N) N2], or 0(N3) for a sparse network.
Combining Steps 14, we find that the number of required computations scales as 0(N2) + 0(N2) + 0(NlogN). As the slowest step scales as 0(N2), the algorithm’s computational complexity is 0(N2). Hence hierarchal clustering is much faster than the brute force approach, which generally scales as 0(eN).
DIVISIVE PROCEDURES: THE GIRVANNEWMAN ALGORITHM Divisive procedures systematically remove the links connecting nodes that belong to different communities, eventually breaking a network into isolated communities. We illustrate their use by introducing an algorithm proposed by Michelle Girvan and Mark Newman [9,23], consisting of the following steps: Step 1: Define Centrality While in agglomerative algorithms xij selects node pairs that belong to
the same community, in divisive algorithms xij, called centrality, selects node pairs that are in different communities. Hence we want xij to be
high (or low) if nodes i and j belong to different communities and small if they are in the same community. Three centrality measures that satisfy this expectation are discussed in Figure 9.11. The fastest of the three is link betweenness, defining xij as the number of shortest paths that
go through the link (i, j). Links connecting different communities are expected to have large xij while links within a community have small xij.
Step 2: Hierarchical Clustering The final steps of a divisive algorithm mirror those we used in agglomerative clustering (Figure 9.12): 1. Compute the centrality xij of each link. 2. Remove the link with the largest centrality. In case of a tie, choose one link randomly. 3. Recalculate the centrality of each link for the altered network. COMMUNITIES
14
Hierarchical Clustering
E
I
F
H
G
E K
F
J
(a)
A
A
A
(b) A
C
C
C
B
AB
B
C
x x xD D
D
F
E
E
E
F GF G
I H
G H
J
B
D
I I H K K J J
E
K F
E
B
AB
C
C
D
D
E
A
A
D
I
J
I I H K K J J
K
E F
E
E
FG FG
C
D
D
G H
I I I H KH K J J J
HH
I I
BB
AA
FF
(d)
CC
CC
DD
DD
EE KK
BB
AA
xx HH
GG
I I
FF
HH
GG
A
A B
I I
KK
JJ
JJ
JJ
BA DCB EDC FED JFE HJF IHJ JI H K JI C
KJ
D
x x x
B
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.1
0.1
C
D
K F E
F
E
C
C
D
D
E
E
K
0 0
0
I I H KH K J J J
J
J
K
I
K
32
2 43
n43 6n4
6n 8 6
8 10 8
10
0.30.3 0.20.2
0 0
(Figure0 9.2a),2 finding that the predicted communities matched almost 3 4 6 8 10 nn
6
8
(a) The divisive hierarchical algorithm of GirM 0.2 Newman uses link betweenness van and (Figure0.1 9.11a) as centrality. In the figure the link weights, assigned proportionally 0 to xij , indicate that links connecting dif0 2 3 4 10 n 6 the8 highest ferent communities have xij. Indeed, each shortest path between these communities must run through them.
(f) The modularity function, M, introduced in SECTION 9.4, helps us select the optimal cut. Its maxima agrees with our expectation that the best cut is at level 3, as shown in (e).
0.10.1
Girvan and Newman applied their algorithm to Zachary’s Karate Club 4
0.5
(e) The dendrogram generated by the GirvanNewman algorithm. The cut at level 3, shown as an orange dotted line, reproduces the three communities present in the network.
M 4. MRepeat steps 2 and 3 until all links are removed.
3
10
perfectly the two groups after the breakup. Only node 3 was classified incorrectly.
Computational Complexity The rate limiting step of divisive algorithms is the calculation of centrality. Consequently the algorithm’s computational complexity depends on which centrality measure we use. The most efficient is link betweenness, with 0(LN) [24,25,26] (Figure 9.11a). Step 3 of the algorithm introduces an additional factor L in the running time, hence the algorithm scales as 0(L2N), or 0(N3) for a sparse network.
HIERARCHY IN REAL NETWORKS Hierarchical clustering raises two fundamental questions: Nested Communities First, it assumes that small modules are nested into larger ones. These nested communities are well captured by the dendrogram (Figures 9.9b and 9.12e). How do we know, however, if such hierarchy is indeed present in a network? Could this hierarchy be imposed by our algorithms, whether or not the underlying network has a nested community structure? Communities and the ScaleFree Property Second, the density hypothesis states that a network can be partitioned into a collection of subgraphs that are only weakly linked to other subgraphs. How can we have isolated communities in a scalefree network, if the hubs inevitably link multiple communities?
COMMUNITIES
J
10
0.40.4
2
K
H
(b)(d) The sequence of images illustrates how the algorithm removes onebyone the three highest xij links, leaving three isolated communities behind. Note that betweenness needs to be recalculated after each link removal.
0.50.5
0
I G
0.3
0 20
F
0.4
0.1
0
K
The GirvanNewman Algorithm
G H
I
E
I H
G
J
Figure 9.12
FG F G H
F
J
M M M 0.2
EE KK
(f)
C
K
H
B
A AB B
A
C
A
(c)
BB
B
AB
B
C
x xGH xH
F G FG
(e)
E
I
x
G
15
Hierarchical Clustering
(a)
The hierarchical network model, whose construction is shown in Figure 9.13, resolves the conflict between communities and the scalefree property and offers intuition about the structure of nested hierarchical commu(b)
nities. The obtained network has several key characteristics: Scalefree Property The hierarchical model generates a scalefree network with degree exponent (Figure 9.14a, ADVANCED TOPICS 9.A) ln5 γ =1+ = 2.161 . ln4 Size Independent Clustering Coefficient
(c)
While for the ErdősRényi and the BarabásiAlbert models the clustering coefficient decreases with N (SECTION 5.9), for the hierarchical network we have C=0.743 independent of the network size (Figure 9.14c). Such Nindependent clustering coefficient has been observed in metabolic networks [11]. Hierarchical Modularity The model consists of numerous small communities that form larger communities, which again combine into ever larger communities. The quantitative signature of this nested hierarchical modularity is the dependence of a node’s clustering coefficient on the node’s degree [11,27,28]
C(k) ~ k −1 .
Figure 9.13 Hierarchical Network
(9.8) The iterative construction of a deterministic hierarchical network.
In other words, the higher a node’s degree, the smaller is its clustering coefficient.
(a) Start from a fully connected module of five nodes. Note that the diagonal nodes are also connected, but the links are not visible.
Equation (9.8) captures the way the communities are organized in a network. Indeed, small degree nodes have high C because they reside in dense communities. High degree nodes have small C because they con
(b) Create four identical replicas of the starting module and connect the peripheral nodes of each module to the central node of the original module. This way we obtain a network with N=25 nodes.
nect to different communities. For example, in Figure 9.13c the nodes at the center of the fivenode modules have k=4 and clustering coefficient C=4. Those at the center of a 25node module have k=20 and C=3/19. Those at the center of the 125node modules have k=84 and C=3/83.
(c) Create four replicas of the 25node module and connect the peripheral nodes again to the central node of the original module, obtaining an N=125node network. This process is continued indefinitely.
Hence the higher the degree of a node, the smaller is its C. The hierarchical network model suggests that inspecting C(k) allows us to decide if a network is hierarchical. For the ErdősRényi and the
After [27].
BarabásiAlbert models C(k) is independent of k, indicating that they do not display hierarchical modularity. To see if hierarchical modularity is present in real systems, we calculated C(k) for ten reference networks, finding that (Figure 9.36): • Only the power grid lacks hierarchical modularity, its C(k) being independent of k (Figure 9.36a). • For the remaining nine networks C(k) decreases with k. Hence in
COMMUNITIES
16
Hierarchical Clustering
(a)
10
0
10
1
10
2
10
0
10
0
10
1
10
1
10
2
10
2
3
3
P(k)
P(k)
P(k)
p k 10 p k 10 p k 10 3
10
4
10
4
10
4
10
5
10
5
10
5
10
6
10
6
10
6
10
7
10
7
10
7
10
8
0 10
8
10
(b) 10
0
10
0
10
0
10
1
10
1
10
1
C(k) C(k) C(k) 2
10
2
10
2
10
3
10
3
10
3
4
4
0
10
0
10
0
10
1
10
1
10
1
C(N) C(N) C(N)
10
8
(c)
10
4
0 101 0 1 2 1 2 3 2 3 4 3 4 10 40 10 0 110 01 2 10 10 10 10 10 k 10 10 10 10 10 10 k 10 10k 10 1010 10 10 10
k101010123 k 101010234k
345
10 10 10
10
2
10
2
10
2
10
3
10
3
10
3
4
4
4
45 5 3 2 10 10 10 10 10 2 10 10 2 10 10 10 10 N3 10 410N3 10 4N10 510 4 10 5
these networks small nodes are part of small dense communities,
Figure 9.14
while hubs link disparate communities to each other.
Scaling in Hierarchical Networks
• For the scientific collaboration, metabolic, and citation network
10
5
Three quantities characterize the hierarchical network shown in Figure 9.13:
C(k) follows (9.8) in the highk region. The form of C(k) for the Internet, mobile, email, protein interactions, and the WWW needs to
(a) Degree Distribution The scalefree nature of the generated network is illustrated by the scaling of pk with slope γ=ln 5/ln 4, shown as a dashed line. See ADVANCED TOPICS 9.A for the derivation of the degree exponent.
be derived individually, as for those C(k) does not follow (9.8). More detailed network models predict C(k)~kβ, where β is between 0 and 2 [27,28]. In summary, in principle hierarchical clustering does not require pre
(b) Hierarchical Clustering C(k) follows (9.8), shown as a dashed line. The circles show C(k) for a randomly wired scalefree network, obtained from the original model by degreepreserving randomization. The lack of scaling indicates that the hierarchical architecture is lost under rewiring. Hence C(k) captures a property that goes beyond the degree distribution.
liminary knowledge about the number and the size of communities. In practice it generates a dendrogram that offers a family of community partitions characterizing the studied network. This dendrogram does not tell us which partition captures best the underlying community structure. Indeed, any cut of the hierarchical tree offers a potentially valid partition (Figure 9.15). This is at odds with our expectation that in each network there is a ground truth, corresponding to a unique community structure.
(c) Size Independent Clustering Coefficient The dependence of the clustering coefficient C on the network size N. For the hierarchical model C is independent of N (filled symbols), while for the BarabásiAlbert model C(N) decreases (empty symbols).
(a)
After [27].
Figure 9.15 Ambiguity in Hierarchical Clustering A
(b)
B
A
C
D
E
F
J
H
I
J
(d)
B
A
K
C
C
D
D
D
E
I G
K
H J
COMMUNITIES
F
E
I G
K
H J
F
Hierarchical clustering does not tell us where to cut a dendrogram. Indeed, depending on where we make the cut in the dendrogram of Figure 9.9a, we obtain (b) two, (c) three or (d) four communities. While for a small network we can visually decide which cut captures best the underlying community structure, it is impossible to do so in larger networks. In the next section we discuss modularity, that helps us select the optimal cut.
B
A
C
E F
B
(c)
I G
K
H J
17
Hierarchical Clustering
While there are multiple notions of hierarchy in networks [29,30], inspecting C(k) helps decide if the underlying network has hierarchical modularity. We find that C(k) decreases in most real networks, indicating that most real systems display hierarchical modularity. At the same time C(k) is independent of k for the ErdősRényi or BarabásiAlbert models, indicating that these canonical models lack a hierarchical organization.
COMMUNITIES
18
Hierarchical Clustering
SECTION 9.4
MODULARITY
In a randomly wired network the connection pattern between the nodes is expected to be uniform, independent of the network's degree distribution. Consequently these networks are not expected to display systematic local density fluctuations that we could interpret as communities. This expectation inspired the third hypothesis of community organization: H3: Random Hypothesis Randomly wired networks lack an inherent community structure. This hypothesis has some actionable consequences: By comparing the link density of a community with the link density obtained for the same group of nodes for a randomly rewired network, we could decide if the original community corresponds to a dense subgraph, or its connectivity pattern emerged by chance. In this section we show that systematic deviations from a random configuration allow us to define a quantity called modularity, that measures the quality of each partition. Hence modularity allows us to decide if a particular community partition is better than some other one. Finally, modularity optimization offers a novel approach to community detection.
MODULARITY Consider a network with N nodes and L links and a partition into nc
communities, each community having Nc nodes connected to each other
by Lc links, where c =1,...,nc. If Lc is larger than the expected number of links between the Nc nodes given the network’s degree sequence, then the nodes
of the subgraph Cc could indeed be part of a true community, as expected
based on the Density Hypothesis H2 (Figure 9.2). We therefore measure the difference between the network’s real wiring diagram (Aij) and the expect
ed number of links between i and j if the network is randomly wired (pij), 1 Mc = (Aij − p ij ). 2L (i,∑ j)∈Cc
COMMUNITIES
(9.9)
19
Modularity
(a)
Here pij can be determined by randomizing the original network, while keeping the expected degree of each node unchanged. Using the degree
OPTIMAL PARTITION M = 0 .41
SU
OPTIMAL PARTITION SUBOPTIMAL = 0 .41PARTITION SINGLEMCOMMUNITY M = 0 .22 M =0
SU
preserving null model (7.1) we have
p ij =
ki kj 2L
.
(9.10)
If Mc is positive, then the subgraph Cc has more links than expected by
is zeroPARTITION then the chance, hence it represents a potential community. If M OPTIMAL c M the = 0 degree .41 connectivity between the N nodes is random, fully explained by
(b)
c
distribution. Finally, if Mc is negative, then the nodes of Cc do not form a
NE
community. Using (9.10) we can derive a simpler form for the modularity (9.9) (ADVANCED TOPICS 9.B) Mc =
OPTIMAL PARTITION Lc kc 2 , − M = 0 .41 (9.11) L ( 2L ) SINGLE COMMUNITY M =0
(c)
where Lc is the total number of links within the community Cc and kc is the
SUBOPTIMAL PARTITION SINGLE COMMUNITY M = 0 .22 NEGATIVE MODULARITY M =0 M = − 0.12
total degree of the nodes in this community. To generalize these ideas to a full network consider the complete partition that breaks the network into nc communities. To see if the local link density of the subgraphs defined by this partition differs from the expectSINGLE COMMUNITY M = 0 modued density in a randomly wired network, we define the partition’s
(d)
larity by summing (9.11) over all nc communities [23]
NEGATIVE MODULARITY M = − 0.12
nc Lc kc 2 M = − ∑ [ L ( 2L ) ]. (9.12) c=1
Modularity has several key properties:
Figure 9.16 Modularity
• Higher Modularity Implies Better Partition To better understand the meaning of modularity, we show M defined in (9.12) for several partitions of a network with two obvious communities.
The higher is M for a partition, the better is the corresponding community structure. Indeed, in Figure 9.16a the partition with the maximum modularity (M=0.41) accurately captures the two obvious communities. A partition with a lower modularity clearly deviates from
(a) Optimal Partition The partition with maximal modularity M=0.41 closely matches the two distinct communities.
these communities (Figure 9.16b). Note that the modularity of a partition cannot exceed one [31,32]. • Zero and Negative Modularity
(b) Suboptimal Partition A partition with a suboptimal but positive modularity, M=0.22, fails to correctly identify the communities present in the network.
By taking the whole network as a single community we obtain M=0, as in this case the two terms in the parenthesis of (9.12) are equal (Figure 9.16c). If each node belongs to a separate community, we have Lc=0
and the sum (9.12) has nc negative terms, hence M is negative (Figure
(c) Single Community If we assign all nodes to the same community we obtain M=0, independent of the network structure.
9.16d).
We can use modularity to decide which of the many partitions predicted by a hierarchical method offers the best community structure, select
(d) Negative Modularity If we assign each node to a different community, modularity is negative, obtaining M=0.12.
ing the one for which M is maximal. This is illustrated in Figure 9.12f, which shows M for each cut of the dendrogram, finding a clear maximum when COMMUNITIES
20
Modularity
NE
the network breaks into three communities.
THE GREEDY ALGORITHM The expectation that partitions with higher modularity corresponds to partitions that more accurately capture the underlying community structure prompts us to formulate our final hypothesis: H4: Maximal Modularity Hypothesis For a given network the partition with maximum modularity corresponds to the optimal community structure. The hypothesis is supported by the inspection of small networks, for which the maximum M agrees with the expected communities (Figures 9.12 and 9.16). The maximum modularity hypothesis is the starting point of several community detection algorithms, each seeking the partition with the largest modularity. In principle we could identify the best partition by checking M for all possible partitions, selecting the one for which M is largest. Given, however, the exceptionally large number of partitions, this bruteforce approach is computationally not feasible. Next we discuss an algorithm that finds partitions with close to maximal M, while bypassing the need to inspect all partitions. Greedy Algorithm The first modularity maximization algorithm, proposed by Newman [33], iteratively joins pairs of communities if the move increases the partition's modularity. The algorithm follows these steps: 1. Assign each node to a community of its own, starting with N communities of single nodes. 2. Inspect each community pair connected by at least one link and compute the modularity difference ∆M obtained if we merge them. Identify the community pair for which ∆M is the largest and merge them. Note that modularity is always calculated for the full network. 3. Repeat Step 2 until all nodes merge into a single community, recording M for each step. 4. Select the partition for which M is maximal. To illustrate the predictive power of the greedy algorithm consider the collaboration network between physicists, consisting of N=56,276 scientists in all branches of physics who posted papers on arxiv.org (Figure 9.17). The greedy algorithm predicts about 600 communities with peak modularity M = 0.713. Four of these communities are very large, together containing 77% of all nodes (Figure 9.17a). In the largest community 93% of the authors publish in condensed matter physics while 87% of the authors in the second largest community publish in high energy COMMUNITIES
21
Modularity
(a)
Physics E−print Archive, 56,276 nodes
13,454
11,070
93% C.M.
87% H.E.P.
(b)
1,009
1,744
480 615 9,278 98% astro
(c)
mostly condensed matter, 9,350 nodes
subgroup, 134 nodes
1,005
460
9,350 86% C.M.
+ 600 smaller communities
power−law distribution of group sizes
single research group 28 nodes
physics, indicating that each community contains physicists of similar
Figure 9.17
professional interests. The accuracy of the greedy algorithm is also il
The Greedy Algorithm
lustrated in Figure 9.2a, showing that the community structure with the
(a) Clustering Physicists The community structure of the collaboration network of physicists. The greedy algorithm predicts four large communities, each composed primarily of physicists of similar interest. To see this on each cluster we show the percentage of members who belong to the same subfield of physics. Specialties are determined by the subsection(s) of the eprint archive in which individuals post papers. C.M. indicates condensed matter, H.E.P. highenergy physics, and astro astrophysics. These four large communities coexist with 600 smaller communities, resulting in an overall modularity M=0.713.
highest M for the Zachary Karate Club accurately captures the club’s subsequent split. Computational Complexity Since the calculation of each ∆M can be done in constant time, Step 2 of the greedy algorithm requires O(L) computations. After deciding which communities to merge, the update of the matrix can be done in a worstcase time O(N). Since the algorithm requires N–1 community mergers, its complexity is O[(L + N)N], or O(N2) on a sparse graph. Optimized implementations reduce the algorithm’s complexity to O(Nlog2N) (ONLINE RESOURCE 9.1).
(b) Identifying Subcommunities We can identify subcommunities by applying the greedy algorithm to each community, treating them as separate networks. This procedure splits the condensed matter community into many smaller subcommunities, increasing the modularity of the partition to M=0.807.
LIMITS OF MODULARITY Given the important role modularity plays in community identification, we must be aware of some of its limitations. Resolution Limit Modularity maximization forces small communities into larger ones
(c) Research Groups One of these smaller communities is further partitioned, revealing individual researchers and the research groups they belong to.
[34]. Indeed, if we merge communities A and B into a single community, the network’s modularity changes with (ADVANCED TOPICS 9.B) ΔMAB =
l AB kAkB , − L 2L 2
(9.13)
After [33].
where lAB is number of links that connect the nodes in community A
with total degree kA to the nodes in community B with total degree kB. If
A and B are distinct communities, they should remain distinct when M is maximized. As we show next, this is not always the case. Consider the case when kAkB 2L < 1, in which case (9.13) predicts ∆MAB > 0 if there is at least one link between the two communities (lAB ≥ 1).
Hence we must merge A and B to maximize modularity. Assuming for simplicity that kA ~ kB= k, if the total degree of the communities satisfies COMMUNITIES
22
Modularity
k ≤ 2L
(9.14)
then modularity increases by merging A and B into a single communi
>
ty, even if A and B are otherwise distinct communities. This is an artifact of modularity maximization: if kA and kB are under the threshold
(9.14), the expected number of links between them is smaller than one. Hence even a single link between them will force the two communities together when we maximize M. This resolution limit has several consequences:
Online Resource 9.1
• Modularity maximization cannot detect communities that are
Modularitybased Algorithms
smaller than the resolution limit (9.14). For example, for the WWW
There are several widely used community finding algorithms that maximize modularity.
sample with L=1,497,134 (Table 2.1) modularity maximization will have difficulties resolving communities with total degree kC ≲
Optimized Greedy Algorithm The use of data structures for sparse matrices can decrease the greedy algorithm’s computational complexity to 0(Nlog2N) [35]. See http:// cs.unm.edu/~aaron/research/fastmodularity.htm for the code.
1,730. • Real networks contain numerous small communities [3638]. Given the resolution limit (9.14), these small communities are systematically forced into larger communities, offering a misleading
Louvain Algorithm The modularity optimization algorithm achieves a computational complexity of 0(L) [2]. Hence it allows us to identify communities in networks with millions of nodes, as illustrated in Figure 9.1. The algorithm is described in ADVANCED TOPICS 9.C. See https:// sites.google.com/site/findcommunities/ for the code.
characterization of the underlying community structure. To avoid the resolution limit we can further subdivide the large communities obtained by modularity optimization [33,34,39]. For example, treating the smaller of the two condensedmatter groups of Figure 9.17a as a separate network and feeding it again into the greedy algorithm,
>
we obtain about 100 smaller communities with an increased modularity M = 0.807 (Figure 9.17b) [33]. Modularity Maxima All algorithms based on maximal modularity rely on the assumption that a network with a clear community structure has an optimal partition with a maximal M [40]. In practice we hope that Mmax is an easy to find maxima and that the communities predicted by all other partitions are distinguishable from those corresponding to Mmax. Yet, as we show next, this optimal partition is difficult to identify among a large number of close to optimal partitions. Consider a network composed of nc subgraphs with comparable link
densities kC ≈ 2L/nc. The best partition should correspond to the one where each cluster is a separate community (Figure 9.18a), in which case
M=0.867. Yet, if we merge the neighboring cluster pairs into a single community we obtain a higher modularity M=0.87 (Figure 9.18b). In general (9.13) and (9.14) predicts that if we merge a pair of clusters, we change modularity with ΔM =
l AB 2 − 2 . L nc
(9.15)
In other words the drop in modularity is less than ∆M = −2/nc2. For a
network with nc = 20 communities, this change is at most ∆M = −0.005,
tiny compared to the maximal modularity M≃0.87 (Figure 9.18b). As the
COMMUNITIES
23
Modularity
67 (a)
Figure 9.18 Modularity Maxima A ring network consisting of 24 cliques, each made of 5 nodes.
M=0.867
(a) The Intuitive Partition The best partition should correspond to the configuration where each cluster is a separate community. This partition has M=0.867.
(b)
71
(b) The Optimal Partition If we combine the clusters into pairs, as illustrated by the node colors, we obtain M=0.871, higher than M obtained for the intuitive partition (a).
M=0.871
(c)
7
M=0.80
the height of the modularity ber of these structures k is of the network, since there dularity (d) tructures than nodes in the s variations k is in n are very xts, Q max and increasing n (or there rease Q max . If the intention in the cores across networks, these for in order to ensure a fair re very
(c) Random Partition Partitions with comparable modularity tend to have rather distinct community structure. For example, if we assign each cluster randomly to communities, even clusters that have no links to each other, like the five highlighted clusters, may end up in the same community. The modularity of this random partition is still high, M=0.80, not too far from the optimal M=0.87.
7
(d) Modularity Plateau The modularity function of the network (a) reconstructed from 997 partitions. The vertical axis gives the modularity M, revealing a highmodularity plateau that consists of numerous lowmodularity partitions. We lack, therefore, a clear modularity maxima  instead the modularity function is highly degenerate. After [40].
MODULARITY, M
g n (or pendence of Q max on n and k entiontopology and how it network For instance, in Appendix A, these ence for the ring network and fair ya ofits degenerate solutions
MODULARITY, M
0
Because of this dependence, ny empirical network should without a null expectation
n and k how it dix A, ork and utions ndence, should tation
number of groups increases, ∆Mij goes to zero, hence it becomes increasingly difficult to distinguish the optimal partition from the numerous suboptimal alternatives whose modularity is practically indistinguishable from Mmax. In other words, the modularity function is not peaked around a single optimal partition, but has a high modularity plateau (Figure 9.18d). In summary, modularity offers a first principle understanding of a network's community structure. Indeed, (9.16) incorporates in a compact form a number of essential questions, like what we mean by a community, how we choose the appropriate null model, and how we measure the goodness of a particular partition. Consequently modularity optimization plays a
COMMUNITIES
24
Modularity
central role in the community finding literature. At the same time, modularity has several wellknown limitations: First, it forces together small weakly connected communities. Second, networks lack a clear modularity maxima, developing instead a modularity plateau containing many partitions with hard to distinguish modularity. This plateau explains why numerous modularity maximization algorithms can rapidly identify a high M partition: They identify one of the numerous partitions with close to optimal M. Finally, analytical calculations and numerical simulations indicate that even random networks contain high modularity partitions, at odds with the random hypothesis H3 that motivated the concept of modularity [4143]. Modularity optimization is a special case of a larger problem: Finding communities by optimizing some quality function Q. The greedy algorithm and the Louvain algorithm described in ADVANCED TOPICS 9.C assume that Q = M, seeking partitions with maximal modularity. In ADVANCED TOPICS 9.C we also describe the Infomap algorithm, that finds communities by minimizing the map equation L, an entropybased measure of the partition quality [4446].
COMMUNITIES
25
Modularity
SECTION 9.5
OVERLAPPING COMMUNITIES
A node is rarely confined to a single community. Consider a scientist, who belongs to the community of scientists that share his professional interests. Yet, he also belongs to a community consisting of family members and relatives and perhaps another community of individuals sharing his hobby (Figure 9.19). Each of these communities consists of individuals who are members of several other communities, resulting in a complicated web of nested and overlapping communities [36]. Overlapping communities are not limited to social systems: The same genes are often implicated in multiple diseases, an indication that disease modules of different disorders overlap [14].
Figure 9.19 Overlapping Communities
While the existence of a nested community structure has long been ap
Schematic representation of the communities surrounding Tamás Vicsek, who introduced the concept of overlapping communities. A zoom into the scientific community illustrates the nested and overlapping structure of the community characterizing his scientific interests. After [36].
preciated by sociologists [47] and by the engineering community interested in graph partitioning, the algorithms discussed so far force each node into a single community. A turning point was the work of Tamás Vicsek and collaborators [36,48], who proposed an algorithm to identify overlapping communities, bringing the problem to the attention of the network science community. In this section we discuss two algorithms to detect overlapping communities, clique percolation and link clustering.
CLIQUE PERCOLATION The clique percolation algorithm, often called CFinder, views a commu
>
nity as the union of overlapping cliques [36]: • Two kcliques are considered adjacent if they share k – 1 nodes (Figure 9.20b). • A kclique community is the largest connected subgraph obtained by the union of all adjacent kcliques (Figure 9.20c). • kcliques that can not be reached from a particular kclique belong to other kclique communities (Figure 9.20c,d).
Online Resource 9.2 CFinder
The CFinder software, allowing us to identify overlapping communities, can be downloaded from www.cfinder.org.
The CFinder algorithm identifies all cliques and then builds an Nclique x
>
Nclique clique–clique overlap matrix O, where Nclique is the number of cliques
and Oij is the number of nodes shared by cliques i and j (Figure 9.39). A typical COMMUNITIES
26
Overlapping Communities
(a)
Figure 9.20
(b)
The Clique Percolation Algorithm (CFinder) To identify k=3 cliquecommunities we roll a triangle across the network, such that each subsequent triangle shares one link (two nodes) with the previous triangle.
(c)
(a)(b) Rolling Cliques Starting from the triangle shown in green in (a), (b) illustrates the second step of the algorithm.
(d)
(c) Clique Communities for k=3 The algorithm pauses when the final triangle of the green community is added. As no more triangles share a link with the green triangles, the green community has been completed. Note that there can be multiple kclique communities in the same network. We illustrate this by showing a second community in blue. The figure highlights the moment when we add the last triangle of the blue community. The blue and green communities overlap, sharing the orange node.
output of the CFinder algorithm is shown in Figure 9.21, displaying the com
(d) Clique Communities for k=4 k=4 community structure of a small network, consisting of complete four node subgraphs that share at least three nodes. Orange nodes belong to multiple communities.
munity structure of the word bright. In the network two words are linked to each other if they have a related meaning. We can easily check that the overlapping communities identified by the algorithm are meaningful: The word bright simultaneously belongs to a community containing lightrelated words, like glow or dark; to a community capturing colors (yellow,
Images courtesy of Gergely Palla.
brown); to a community consisting of astronomical terms (sun, ray); and to a community linked to intelligence (gifted, brilliant). The example also illustrates the difficulty the earlier algorithms would have in identifying communities of this network: they would force bright into one of the four communities and remove from the other three. Hence communities would be stripped of a key member, leading to outcomes that are difficult to interpret. Could the communities identified by CFinder emerge by chance? To distinguish the real kclique communities from communities that are a pure consequence of high link density we explore the percolation properties of kcliques in a random network [48]. As we discussed in CHAPTER 3, if a random network is sufficiently dense, it has numerous cliques of varying order. A large kclique community emerges in a random network only if the connection probability p exceeds the threshold (ADVANCED TOPICS 9.D)
1
. pc(k) =
[(k − 1)N ]1/(k−1)
(9.16)
Figure 9.21 Overlapping Communities Communities containing the word bright in the South Florida Free Association network, whose nodes are words, connected by a link if their meaning is related. The community structure identified by the CFinder algorithm accurately describes the multiple meanings of bright, a word that can be used to refer to light, color, astronomical terms, or intelligence. After [36].
Under pc(k) we expect only a few isolated kcliques (Figure 9.22a). Once p ex
ceeds pc(k), we observe numerous cliques that form kclique communities (Figure 9.22b). In other words, each kclique community has its own thresh
old: • For k =2 the kcliques are links and (9.16) reduces to pc(k)~1/N, which COMMUNITIES
27
Overlapping Communities
is the condition for the emergence of a giant connected component in
(a)
Erdős–Rényi networks. • For k = 3 the cliques are triangles (Figure 9.22a,b) and (9.16) predicts pc(k)~1/√2N. In other words, kclique communities naturally emerge in sufficiently dense networks. Consequently, to interpret the overlapping community structure of a network, we must compare it to the community structure obtained for the degreerandomized version of the original network.
Computational Complexity Finding cliques in a network requires algorithms whose running time grows exponentially with N. Yet, the CFinder community definition is
(b)
based on cliques instead of maximal cliques, which can be identified in polynomial time [49]. If, however, there are large cliques in the network, it is more efficient to identify all cliques using an algorithm with O(eN) complexity [36]. Despite this high computational complexity, the algorithm is relatively fast, processing the mobile call network of 4 million mobile phone users in less then one day [50] (see also Figure 9.28).
LINK CLUSTERING While nodes often belong to multiple communities, links tend to be community specific, capturing the precise relationship that defines a node’s membership in a community. For example, a link between two individuals may indicate that they are in the same family, or that they work together, or that they share a hobby, designations that only rarely overlap.
Figure 9.22
Similarly, in biology each binding interaction of a protein is responsible
The Clique Percolation Algorithm (CFinder)
for a different function, uniquely defining the role of the protein in the
Random networks built with probabilities p=0.13 (a) and p=0.22 (b). As both p's are larger than the link percolation threshold (pc=1/ N=0.05 for N=20), in both cases most nodes belong to a giant component.
cell. This specificity of links has inspired the development of community finding algorithms that cluster links rather than nodes [51,52]. The link clustering algorithm proposed by Ahn, Bagrow and Lehmann
(a) Subcritical Communities The 3clique (triangle) percolation threshold is pc(3)=0.16 according to (9.16), hence at p=0.13 we are below it. Therefore, only two small 3clique percolation clusters are observed, which do not connect to each other.
[51] consists of the following steps: Step 1: Define Link Similarity The similarity of a link pair is determined by the neighborhood of the nodes connected by them. Consider for example the links (i,k) and (j,k), connected to the same node k. Their similarity is defined as (Figure
(b) Supercritical Communities For p=0.22 we are above pc(3), hence we observe multiple 3cliques that form a giant 3clique percolation cluster (purple). This network also has a second overlapping 3clique community, shown in green.
9.23ac)
 n+(i) ∩ n+( j)  , S ((i,k),( j,k))=  n+(i) ∪ n+( j) 
(9.17)
where n+(i) is the list of the neighbors of node i, including itself. Hence
After [48].
S measures the relative number of common neighbors i and j have. Con
sequently S=1 if i and j have the same neighbors (Figure 9.23c). The less is the overlap between the neighborhood of the two links, the smaller is S (Figure 9.23b).
COMMUNITIES
28
Overlapping Communities
(a)
(b)
A
k
i
j
Figure 9.23
(c)
B
C
ck ai
ai
bj
1 S ( (i,k), ( j,k)) = 3
Identifying Link Communities
ck bj
S ( (i,k), ( j,k)) = 1
Figure 1: (A ) The similarity measure S eik e jk between edges eik ande jk sharing nodek. For this example,n i n j 12 and 4, giving S 1 3. Two simple cases:B) ( an isolated k(a kb 1), connected triplea,c,b) ( has (d)n i n j S 1 3, while (C) an isolated triangle has S 1. (e) 1
2
a
1
2
c
3
34
3
24
b
1
12 13
2
47
3
46
3
45 79
4
78
6
89
index [1]: S eik ejk
n i n i
n n
j j
5
(a) The similarity S of the (i,k) and (j,k) links connected to node k detects if the two links belong to the same group of nodes. Denoting with n+(i) the list of neighbors of node i, including itself, we obtain n+(i)∪n+(j) =12 and n+(i)∩n+(j) =4, resulting in S = 1/3 according to (9.17). (b) For an isolated (ki = kj = 1) connected triple we obtain S = 1/3.
2
(f)
56
5
7
Figure 2: An example network with node 6 communities a) ( and link communities 8 b).( 5 link similarity matrix and (c) The resulting link dendrogram. Compare with main text Fig. 1. 1
23 5
9
4
14
The link clustering algorithm identifies links with a similar topological role in a network. It does so by exploring the connectivity patterns of the nodes at the two ends of each link. Inspired by the similarity function of the Ravasz algorithm [4] (Figure 9.19), the algorithm aims to assign to high similarity S the links that connect to the same group of nodes.
(c) For a triangle we have S = 1.
9 7 8 (2)
An example illustration of this similarity measure is shown in1 Fig. (see Sec.4.1 for generalizations of
Step 2: Apply Hierarchical Clustering The similarity matrix S allows us to use hierarchical clustering to iden
(d) The link similarity matrix for the network shown in (e) and (f). Darker entries correspond to link pairs with higher similarity S. The figure also shows the resulting link dendrogram. (e) The link community structure predicted by the cut of the dendrogram shown as an orange dashed line in (d). (f) The overlapping node communities derived from the link communities shown in (e).
tify link communities (SECTION 9.3). We use a singlelinkage procedure, iteratively merging communities with the largest similarity link pairs (Figure 9.10).
After [51].
Taken together, for the network of Figure 9.23e, (9.17) provides the similarity matrix shown in (d). The singlelinkage hierarchical clustering leads to the dendrogram shown in (d), whose cuts result in the link communities shown in (e) and the overlapping node communities shown in (f). Figure 9.24 illustrates the community structure of the characters of Victor Hugo’s novel Les Miserables identified using the link clustering algorithm. Anyone familiar with the novel can convince themselves that the communities accurately represent the role of each character. Several characters are placed in multiple communities, reflecting their overlapping roles in the novel. Links, however, are unique to each community. Computational Complexity The link clustering algorithm involves two timelimiting steps: similarity calculation and hierarchical clustering. Calculating the similarity (9.17) for a link pair with degrees ki and kj requires max(ki,kj) steps. For a
scalefree network with degree exponent γ the calculation of similarity has
complexity O(N2/(γ1)), determined by the size of the largest node, kmax. Hierarchical clustering requires O(L2) time steps. Hence the algorithm's total
COMMUNITIES
29
Overlapping Communities
Figure 9.24
Boulatruelle Jondrette
Brujon
Child1
Blacheville Gueulemer
MmeBurgon
Dahlia
Favourite
Babet
Eponine
Child2
Link Communities
Anzelma
Zephine Montparnasse
Listolier
Tholomyes
Claquesous
MotherPlutarch
Perpetue
Fantine Mabeuf
Thenardier MmeThenardier
Gavroche
Combeferre
Courfeyrac
Bahorel Joly Grantaire
Marius
Feuilly
Simplice
Brevet Champmathieu
Judge
Chenildieu
Bamatabois
Woman2
Cochepaille Valjean
Gribier
Woman1
Magnon MmeHucheloup
Javert
Cosette
Bossuet Prouvaire
Marguerite
Toussaint Enjolras
Fauchelevent
LtGillenormand Gillenormand
BaronessT
The network of characters in Victor Hugo’s 1862 novel Les Miserables. Two characters are connected if they interact directly with each other in the story. The link colors indicate the clusters, light grey nodes corresponding to singlelink clusters. Nodes that belong to multiple communities are shown as piecharts, illustrating their membership in each community. Not surprisingly, the main character, Jean Valjean, has the most diverse community membership. After [51].
Fameuil
Scaufflaire Isabeau
Pontmercy
MmeDeR
MlleGillenormand
Gervais
Labarre
MlleBaptistine MmeMagloire
MmePontmercy MlleVaubois
OldMan
Myriel
MotherInnocent
CountessDeLo Napoleon Geborand
Count
Champtercier
Cravatte
computational complexity is O(N2/(γ1))+ O(L2). For sparse graphs the latter term dominates, leading to O(N2). The need to detect overlapping communities have inspired numerous algorithms [53]. For example, the CFinder algorithm has been extended to the analysis of weighted [54], directed and bipartite graphs [55,56]. Similarly, one can derive quality functions for link clustering [52], like the modularity function discussed in SECTION 9.4. In summary, the algorithms discussed in this section acknowledge the fact that nodes naturally belong to multiple communities. Therefore by forcing each node into a single community, as we did in the previous sections, we obtain a misleading characterization of the underlying community structure. Link communities recognize the fact that each link accurately captures the nature of the relationship between two nodes. As a bonus link clustering also predicts the overlapping community structure of a network.
COMMUNITIES
30
Overlapping Communities
SECTION 9.6
TESTING COMMUNITIES
Community identification algorithms offer a powerful diagnosis tool, allowing us to characterize the local structure of real networks. Yet, to interpret and use the predicted communities, we must understand the accuracy of our algorithms. Similarly, the need to diagnose large networks prompts us to address the computational efficiency of our algorithms. In this section we focus on the concepts needed to assess the accuracy and the speed of community finding.
ACCURACY If the community structure is uniquely encoded in the network’s wiring diagram, each algorithm should predict precisely the same communities. Yet, given the different hypotheses the various algorithms embody, the partitions uncovered by them can differ, prompting the question: Which community finding algorithm should we use? To assess the performance of community finding algorithms we need to measure an algorithm’s accuracy, i.e. its ability to uncover communities in networks whose community structure is known. We start by discussing two benchmarks, which are networks with predefined community structure, that we can use to test the accuracy of a community finding algorithm. GirvanNewman (GN) Benchmark The GirvanNewman benchmark consists of N=128 nodes partitioned into nc=4 communities of size Nc=32 [9,57]. Each node is connected with
probability pint to the Nc–1 nodes in its community and with probability
pext to the 3Nc nodes in the other three communities. The control parameter
μ=
k ext , + k int
k ext
(9.18)
captures the density differences within and between communities. We
COMMUNITIES
31
testing communities
Figure 9.25
(a)
Testing Accuracy with the NG Benchmark The position of each node in (a) and (c) shows the planted communities of the GirvanNewman (GN) benchmark, illustrating the presence of four distinct communities, each with Nc=32 nodes. (a) The node colors represent the partitions predicted by the Ravasz algorithm for mixing parameter µ=0.40 given by (9.18). As in this case the communities are well separated, we have an excellent agreement between the planted and the detected communities.
(b) 1
Ravasz
(b) The normalized mutual information in function of the mixing parameter µ for the Ravasz algorithm. For small µ we have In≃1 and nc≃4, indicating that the algorithm can easily detect well separated communities, as illustrated in (a). As we increase µ the link density difference within and between communities becomes less pronounced. Consequently the communities are increasingly difficult to identify and In decreases.
0.8
0.6
In 0.4
0.2
0
0.1
0.2
0.3
µ
0.4
0.5
(c) For µ=0.50 the Ravasz algorithm misplaces a notable fraction of the nodes, as in this case the communities are not well separated, making it harder to identify the correct community structure.
0.6
Note that the Ravasz algorithm generates multiple partitions, hence for each µ we show the partition with the largest modularity, M. Next to (a) and (c) we show the normalized mutual information associated with the corresponding partition and the number of detected communities nc. The normalized mutual information (9.23), developed for nonoverlapping communities, can be extended to overlapping communities as well [59].
(c)
expect community finding algorithms to perform well for small µ (Figure 9.25a), when the probability of connecting to nodes within the same community exceeds the probability of connecting to nodes in different communities. The performance of all algorithms should drop for large µ (Figure 9.25b), when the link density within the communities becomes comparable to the link density in the rest of the network. LancichinettiFortunatoRadicchi (LFR) Benchmark The GN benchmark generates a random graph in which all nodes have comparable degree and all communities have identical size. Yet, the degree distribution of most real networks is fat tailed, and so is the community size distribution (Figure 9.29). Hence an algorithm that performs well on the GN benchmark may not do well on real networks. To avoid COMMUNITIES
32
testing communities
(a)
Figure 9.26
(e)
LFR Benchmark The construction of the LancichinettiFortunatoRadicchi (LFR) benchmark, which generates networks in which both the node degrees and community sizes follow a power law. The benchmark is built as follows [57]:
(b)
(c)
(a) Start with N isolated nodes. (d)
(b) Assign each node to a community of size Nc where Nc follows the power law distribution PN ~Ncζ with community exponent c ζ. Also assign each node i a degree ki selected from the power law distribution pk~k γ with degree exponent γ.
this limitation, the LFR benchmark (Figure 9.26) builds networks for which both the node degrees and the planted community sizes follow power laws [58].
(c) Each node i of a community receives an internal degree (1µ)ki, shown as links whose color agrees with the node color. The remaining µki degrees, shown as black links, connect to nodes in other communities.
Having built networks with known community structure, next we need tools to measure the accuracy of the partition predicted by a particular community finding algorithm. As we do so, we must keep in mind that the
(d) All stubs of nodes of the same community are randomly attached to each other, until no more stubs are ‘‘free’’. In this way we maintain the sequence of internal degrees of each node in its community. The remaining µki stubs are randomly attached to nodes from other communities.
two benchmarks discussed above correspond to a particular definition of communities. Consequently algorithms based on clique percolation or link clustering, that embody a different notion of communities, may not fare so well on these. Measuring Accuracy
(e) A typical network and its community structure generated by the LFR benchmark with N=500, γ=2.5, and ζ=2.
To compare the predicted communities with those planted in the benchmark, consider an arbitrary partition into nonoverlapping communities. In each step we randomly choose a node and record the label of the community it belongs to. The result is a random string of community labels that follow a p(C) distribution, representing the probability that a randomly selected node belongs to the community C. Consider two partitions of the same network, one being the benchmark (ground truth) and the other the partition predicted by a community finding algorithm. Each partition has its own p(C1) and p(C2) distribution. The
joint distribution, p(C1, C2), is the probability that a randomly chosen node belongs to community C1 in the first partition and C2 in the second. The
similarity of the two partitions is captured by the normalized mutual information [38]
∑ p(C1, C2)log2
C ,C
In = 1 2 1 2
p(C1, C2 )
p(C1)p(C2 )
H({p(C1)}) + 12 H({p(C2)})
.
(9.19)
The numerator of (9.19) is the mutual information I, measuring the information shared by the two community assignments: I=0 if C1 and C2 are
independent of each other; I equals the maximal value H({p(C1)}) = H({p(C2)}) when the two partitions are identical and
∑
H({p(C )}) = − p(C )log2 p(C )
(9.20)
C
COMMUNITIES
33
testing communities
N
0.6 0.4 0.2 0 0.1
0.2
0.3
0.4
µ 0.5
0.6
0.3
0.7
0.4
µ 0.5
0.6
0.7
0.8
LFR BENCHMARK
Figure 9.27
1
GirvanNewman Greedy Mod. (Opt) Louvain Infomap Ravasz
0.8
0.2
(b)
NG BENCHMARK
1
NORMALIZED MUTUAL INFORMATION
NORMALIZED MUTUAL INFORMATION
(a)
0 0.1
0.8
Testing Against Benchmarks
0.8
We tested each community finding algorithm that predicts nonoverlapping communities against the GN and the LFR benchmarks. The plots show the normalized mutual information In against µ for five algorithms. For the naming of each algorithm, see TABLE 9.1.
0.6 0.4 0.2 0 0.1
GirvanNewman Greedy Mod. (Opt) Louvain Infomap Ravasz
0.2
0.3
0.4
µ 0.5
0.6
0.7
(a) GN Benchmark The horizontal axis shows the mixing parameter (9.18), representing the fraction of links connecting different communities. The vertical axis is the normalized mutual information (9.19). Each curve is averaged over 100 independent realizations.
0.8
is the ShannonLFR entropy. BENCHMARK NORMALIZED MUTUAL INFORMATION
1
If 0.8
all nodes belong to the same community, then we are certain about
the next label and H=0, as we do not gain new information by inspecting the 0.6
(b) LFR Benchmark Same as in (a) but for the LFR benchmark. The benchmark parameters are N=1,000, ⟨k⟩=20, γ=2, kmax=50, ζ=1, maximum community size: 100, minimum community size: 20. Each curve is averaged over 25 independent realizations.
community to which the next node belongs to. H is maximal if p(C) is the uniform 0.4 distribution, as in this case we have no idea which community comes GirvanNewman next and each new node provides H bits of new information. Greedy Mod. (Opt)
0.2
0In 0.1
Louvain Infomap Ravasz
summary, In=1 if the benchmark and the detected partitions are 0.2
0.3
0.4
µ 0.5
0.6
0.7
0.8
identical, and In=0 if they are independent of each other. The utility of In is
illustrated in Figure 9.25b that shows the accuracy of the Ravasz algorithm
for the GirvanNewman benchmark. In Figure 9.27 we use In to test the performance of each algorithm against the GN and LFR benchmarks. The results allow us to draw several conclusions: • We have In≃1 for µ n + 2 we can combine (9.40) and (9.41) to obtain
ln Nn(Hi) = C′n − ln ki
or
ln 5 ln 4
− ln 5
Nn(Hi) ∼ ki ln 4 .
(9.42) (9.43)
To calculate the degree distribution we need to normalize Nn(Hi) by calculating the ratio Nn(Hi) pki ∼ ∼ ki−γ . ki+1 − ki
(9.44)
Using ￼
ki+1 − ki =
i+1
∑
4l −
l=1
i
∑
4l = 4i+1 = 3 ki +4
l=1
we obtain
(9.45)
ln 5
ln 5 ki− ln 4 p = ∼ ki−1− ln 4 . ki 3ki + 4
(9.46)
In other words the obtained hierarchical network’s degree exponent is γ=1+
ln 5 = 2.16 . ln 4
(9.47)
Clustering Coefficient It is somewhat straightforward to calculate the clustering coefficient i
of the Hi hubs. Their ∑ 4 links come from nodes linked in a square, thus the l
l=1
connections between them equals their number. Consequently the number of links between the Hi’s neighbors is i
4l = kn(Hi) , ∑
(9.48)
l=1
providing
2ki ki(ki − 1)
2 ki − 1
. C(Hi) = =
(9.49)
In other words we obtain
COMMUNITIES
45
Advanced Topics 9.a
2 k
, C(k) ≃
(9.50)
indicating that C(k) for the hubs scales as k–1, in line with (9.8).
Empirical Results Figure 9.36 shows the C(k) function for the ten reference networks. We also show C(k) for each network after we applied degreepreserving randomization (green symbols), allowing us to make several observations: • For small k all networks have an order of magnitude higher C(k) than their randomized counterpart. Therefore the small degree nodes are located in much denser neighborhoods than expected by chance. • For the scientific collaboration, metabolic, and citation networks with a good approximation we have C(k)~k–1, while the randomized C(k) is flat. Hence these networks display the hierarchical modularity of the model of Figure 9.13. • For the Internet, mobile phone calls, actors, email, protein interactions and the WWW C(k) decreases with k, while their randomized C(k) is kindependent. Hence while these networks display a hierarchical modularity, the observed C(k) is not captured by our simple hierarchical model. To fit the C(k) of these systems we need to build models that accurately capture their evolution. Such models predict that C(k)~k–β, where β can be different from one [27]. • Only for the power grid we observe a flat, kindependent C(k), indicating the lack of a hierarchical modularity. Taken together, Figure 9.36 indicates that most real networks display some nontrivial hierarchical modularity.
COMMUNITIES
46
Advanced Topics 9.a
(a)
POWER GRID
10 0
10 1
C(k)
C(k)
10 2
10 2
10 3
10 3
k
10 1
(d)
10 0
10 1
Figure 9.36 Hierarchy in Real Networks
10 4 0 10
10 2
MOBILE PHONE CALLS
10 0
INTERNET
10 0
10 1
10 4 0 10
(c)
(b)
k
10 1
10 2
10 3
The scaling of C(k) with k for the ten reference networks (purple symbols). The green symbols show C(k) obtained after applying degree preserving randomization to each network, that washes out the local density fluctuations. Consequently communities and the underlying hierarchy are gone. Directed networks were made undirected to measure C(k). The dashed line in each figure has slope 1, following (9.8), serving as a guide to the eye.
10 4
SCIENTIFIC COLLABORATION
10 1
C(k)
C(k)
10 2
10 2 10 3
10 4 0 10
(e)
k
10 1
ACTOR
10 0
10 3 10 0
10 2
(f)
k
10 1
10 2
10 3
EMAIL
10 0 10 1
10 1
10 2
C(k)
C(k)
10 3
10 2
10 4 10 3 10 0
(g)
10 1
10 2
k 10 3
10 4
(h)
PROTEIN
10 0
10 5 10 0
10 5
10 0
10 1
10 1
C(k)
C(k)
10 2
10 2
10 3 10 0
(i)
k
10 1
WWW
10 0
10 3 10 0
10 2
(j)
k
10 1
10 2
10 3
10 4
METABOLIC
k
10 1
10 2
10 3
CITATION
10 0
10 1 10 1
10 2
C(k)3
C(k)
10
10 2
10 4 10 5
10 3
10 6 10 7
10 0
COMMUNITIES
10 1
10 2
k 10 3
10 4
10 5
10 4 10 0
10 1
k
10 2
10 3
10 4
47
Advanced Topics 9.a
SECTION 9.11
ADVANCED TOPICS 9.B MODULARITY
In this section we derive the expressions (9.12) and (9.13), characterizing the modularity fuction and its changes. Modularity as a Sum Over Communities Using (9.9) and (9.10) we can write the modularity of a full network as
1 N 2L i,∑ j=1
ki kj
M= (A − )δC ,C , ij
2L
i
j
(9.51)
where Ci is the label of the community to which node i belongs to. As only node pairs that belong to the same community contribute to the sum in (9.51), we can rewrite the first term as a sum over communities,
nc nc Lc 1 N 1 Aij δC ,C = Aij = ∑ 2L ∑ ∑L i j 2L i,∑ j=1 c=1 i, j∈C c=1
(9.52)
c
where Lc is the number of links within community Cc. The factor 2 disap
pears because each link is counted twice in Aij.
In a similar fashion the second term of (9.51) becomes N
kk
1 i j δ = 2L i,∑ 2L j=1
C i ,C j
nc
1
∑ (2L)2 ∑ c=1 i, j∈C
c
ki kj =
nc
kc 2 , ∑ 4L2 c=1
(9.53)
where kc is the total degree of the nodes in community Cc. Indeed, in the configuration model the probability that a stub connects to a randomly
1 , as in total we have 2L stubs in the network. Hence the 2L k likelihood that our stub connects to a stub inside the module is c . By re2L
chosen stub is
peating this procedure for all kc stubs within the community Cc and adding 1/2 to avoid double counting, we obtain the last term of (9.53). Combining (9.52) and (9.53) leads to (9.12).
COMMUNITIES
48
Advanced Topics 9.b
Merging Two Communities Consider communities A and B and denote with kA and kB the total de
gree in these communities (equivalent with kc above). We wish to calculate the change in modularity after we merge these two communities. Using (9.12), this change can be written as 2
ΔMAB =
k AB L AB − L ( 2L )
−
LA kA 2 L B kB 2 + − − L ( 2L ) L ( 2L )
,
(9.54)
where
L AB = L A +LB + l AB , (9.55)
lAB is the number of direct links between the nodes of communities A and B, and
k AB = kA + kB .
(9.56)
After inserting (9.55) and (9.56) into (9.54), we obtain
l AB L
kAkB 2L 2
− ΔMAB =
(9.57)
which is (9.13).
COMMUNITIES
49
Advanced Topics 9.b
SECTION 9.12
ADVANCED TOPICS 9.C FAST ALGORITHMS FOR COMMUNITY DETECTION
The algorithms discussed in this chapter were chosen to illustrate the fundamental ideas and concepts pertaining to community detection. Consequently they are not guaranteed to be neither the fastest nor the most accurate algorithms. Recently two algorithms, called the Louvain algorithm and Infomap have gained popularity, as their accuracy is comparable to the accuracy of the algorithms covered in this chapter but offer better scalability. Consequently we can use them to identify communities in very large networks. There are many similarities between the two algorithms: • They both aim to optimize a quality function Q . For the Louvain algorithm Q is modularity, M, and for Infomap Q is an entropybased measure called the map equation or L. • Both algorithms use the same optimization procedure. Given these similarities, we discuss the algorithms together.
THE LOUVAIN ALGORITHM The O(N2) computational complexity of the greedy algorithm can be prohibitive for very large networks. A modularity optimization algorithm with better scalability was proposed by Blondel and collaborators [2]. The Louvain algorithm consists of two steps that are repeated iteratively (Figure 9.37): Step I Start with a weighted network of N nodes, initially assigning each node to a different community. For each node i we evaluate the gain in modularity if we place node i in the community of one of its neighbors j. We then move node i in the community for which the modularity gain is the largest, but only if this gain is positive. If no positive gain is found, i stays in its original community. This process is applied to all nodes until no further improvement can be achieved, completing Step I.
COMMUNITIES
50
Advanced Topics 9.c
Figure 9.37
1ST PASS 1 2
3
0 4
10
9 12
The Louvain Algorithm 1 2
STEP I
5 8
10
9 12
14
7
STEP II
14
1
1
13
3
16
14
The main steps of the Louvain algorithm. Each pass consists of two distinct steps:
4
4
6
11
15
13
3
0 4
6
11 8
= 0 .023 = 0 .032 = 0 .026 = 0 .026
7
5
15
∆ M 0,2 ∆ M 0,3 ∆ M 0,4 ∆ M 0,5
Step I Modularity is optimized by local changes. We choose a node and calculate the change in modularity, (9.58), if the node joins the community of its immediate neighbors. The figure shows the expected modularity change ∆Mo,i for node 0. Accordingly node 0 will join node 3, as the modularity change for this move is the largest, being ∆M0,3=0.032. This process is repeated for each node, the node colors corresponding to the resulting communities, concluding Step I.
1 2
2ND PASS 14
1 16
4
4 1 3
STEP I
14
1
1
1 2
4
4
3
16
STEP II 26
3
1
24
2
Step II The communities obtained in Step I are aggregated, building a new network of communities. Nodes belonging to the same community are merged into a single node, as shown on the top right. This process will generate selfloops, corresponding to links between nodes in the same community that are now merged into a single node.
The modularity change ΔM obtained by moving an isolated node i into a community C can be calculated using
ΔM = − − − − [ ( )] [ ( ) ( )] Σin + 2ki,in 2W
Σtot + ki 2W
2
Σin 2W
Σtot 2W
2
ki 2W
The sum of Steps I & II are called a pass. The network obtained after each pass is processed again (Pass 2), until no further increase of modularity is possible. After [2].
2
(9.58)
where ∑ in is the sum of the weights of the links inside C (which is LC for an unweighted network); ∑ tot is the sum of the link weights of all nodes
in C; ki is the sum of the weights of the links incident to node i; ki,in is the
sum of the weights of the links from i to nodes in C and W is the sum of the weights of all links in the network. Note that ΔM is a special case of (9.13), which provides the change in modularity after merging communities A and B. In the current case B is an isolated node. We can use ΔM to determine the modularity change when i is removed from the community it belonged earlier. For this we calculate ΔM for merging i with the community C after we excluded i from it. The change after removing i is –ΔM. Step II We construct a new network whose nodes are the communities iden
tified during Step I. The weight of the link between two nodes is the sum of the weight of the links between the nodes in the corresponding communities. Links between nodes of the same community lead to weighted selfloops. Once Step II is completed, we repeat Steps I  II, calling their combination a pass (Figure 9.37). The number of communities decreases with each pass. The passes are repeated until there are no more changes and maximum modularity is attained. COMMUNITIES
51
Advanced Topics 9.c
1111100 1100 0110 11011 10000 11011 0110 0011 10111 1001 0011 1001 0100 0111 10001 1110 0111 10001 0111 1110 0000 1110 10001 0111 1110 0111 1110 1111101 1110 0000 10100 0000 1110 10001 0111 0100 10110 11010 10111 1001 0100 1001 10111 1001 0100 1001 0100 0011 0100 0011 0110 11011 0110 0011 0100 1001 10111 0011 0100 0111 10001 1110 10001 0111 0100 10110 111111 10110 10101 11110 00011
(a)
(b)
(c)
01011 0110 1111100
1100 11011
10000
1111101
1110 01010
10100
10111
0000
0100
00010
1101 0010
11110
00011 1111100 1100 0110 11011 10000 11011 0110 0011 10111 1001 0011 1001 0100 0111 10001 1110 0111 10001 0111 1110 0000 1110 10001 0111 1110 0111 1110 1111101 1110 0000 10100 0000 1110 10001 0111 0100 10110 11010 10111 1001 0100 1001 10111 1001 0100 1001 0100 0011 0100 0011 0110 11011 0110 0011 0100 1001 10111 0011 0100 0111 10001 1110 10001 0111 0100 10110 111111 10110 10101 11110 00011
010
01 11
011
011
0001 0
1011
10
0001 110
000
00
0
111 0000 11 01 101 100 101 01 0001 0 110 011 00 110 00 111 1011 10 111 000 10 111 000 111 10 011 10 000 111 10 111 10 0010 10 011 010 011 10 000 111 0001 0 111 010 100 011 00 111 00 011 00 111 00 111 110 111 110 1011 111 01 101 01 0001 0 110 111 00 011 110 111 1011 10 111 000 10 000 111 0001 0 111 010 1010 010 1011 110 00 10 011
Infomap detect communities by compressing the movement of a random walker on a network.
The Louvain111 algorithm is more limited by storage demands than by 000
computational1100 time. 010 The number of computations scale linearly with L 10
0010 most time consuming first pass. 10 for the With subsequent passes over a
(a) The orange line shows the trajectory of a random walker on a small network. We want to describe this trajectory with a minimal number of symbols, which we can achieve by assigning repeatedly used structures (communities) short and unique names.
00
1101 number of nodes and links, the complexity of the algorithm decreasing 011 11
110
010 10 is at most O(L). It therefore allows us to identify communities in net0 1011 011 works 111with0001millions of nodes. 000
111 0000 11 01 101 100 101 01 0001 0 110 011 00 110 00 111 1011 10 111 000 10 111 000 111 10 011 10 000 111 10 111 10 0010 10 011 010 011 10 000 111 0001 0 111 010 100 011 00 111 00 011 00 111 00 111 110 111 110 1011 111 01 101 01 0001 0 110 111 00 011 110 111 1011 10 111 000 10 000 111 0001 0 111 010 1010 010 1011 110 00 10 011
INFOMAP
111 0000 11 01 101 100 101 01 0001 0 110 011 00 110 00 111 1011 10 111 000 10 111 000 111 10 011 10 000 111 10 111 10 0010 10 011 010 011 10 000 111 0001 0 111 010 100 011 00 111 00 011 00 111 00 111 110 111 110 1011 111 01 101 01 0001 0 110 111 00 011 110 111 1011 10 111 000 10 000 111 0001 0 111 010 1010 010 1011 110 00 10 011
(b) We start by giving a unique name to each node. This is derived using a Huffman coding, a data compression algorithm that assigns each node a code using the estimated probability that the random walk visits that node. The 314 bits under the network describe the sample trajectory of the random walker shown in (a), starting with 1111100 for the first node of the walk in the upper left corner, 1100 for the second node, etc., and ending with 00011 for the last node on the walk in the lower right corner.
Introduced by Martin Rosvall and Carl T Bergstrom, Infomap exploits data compression for community identification (Figure 9.38) [4446]. It does it by optimizing a quality function for community detection in directed and weighted networks, called the map equation. Consider a network partitioned into nc communities. We wish to encode in the most efficient fashion the trajectory of a random walker on this network. In other words, we want to describe the trojectory with the smallest number of symbols. The ideal code should take advantage of the fact that the random walker tends to get trapped into communities, staying there for a long time (Figure 9.38c).
(c) The figure shows a twolevel encoding of the random walk, in which each community receives a unique name, but the name of nodes within communities are reused. This code yields on average a 32% shorter coding. The codes naming the communities and the codes used to indicate an exit from each community are shown to the left and the right of the arrows under the network, respectively. Using this code, we can describe the walk in (a) by the 243 bits shown under the network in (c). The first three bits 111 indicate that the walk begins in the red community, the code 0000 specifies the first node of the walk, etc.
To achieve this coding we assign: • One code to each community (index codebook). For example the purple community in Figure 9.38c is assigned the code 111. • Codewords for each node within each community. For example the top left node in (c) is assigned 001. Note that the same node code can be reused in different communities. • Exit codes that mark when the walker leavers a community, like 0001 for the purple community in (c).
(d) By reporting only the community names, and not the locations of each node within the communities, we obtain an efficient coarse graining of the network, which corresponds to its community structure.
The goal, therefore, is to build a code that offers the shortest description of the random walk. Once we have this code, we can identify the network's community structure by reading the index codebook, which is uniquely as
COMMUNITIES
011
From Data Compression to Communities
100
1010
0001 110
110
10
111 0000 11 01 101 100 101 01 0001 0 110 011 00 110 00 111 1011 10 111 000 10 111 000 111 10 011 10 000 111 10 111 10 0010 10 011 010 011 10 000 111 0001 0 111 010 100 011 00 111 00 011 00 111 00 111 110 111 110 1011 111 01 101 01 0001 0 110 111 00 011 110 111 1011 10 111 000 10 000 111 0001 0 111 010 1010 010 1011 110 00 10 011
111
100 Computational Complexity 111
10
11
Figure 9.38
110
101
10 00
111
001 0000
010
1100
10
10101 0000
100
1010
10110 0010
111 0
111
111
000
011 00
101
100
11010
111111
110
11
1001
0111
10001
(d)
001 01
0011
52
Advanced Topics 9.c
signed to each community (Figure 9.38c).
>
The optimal code is obtained by finding the minimum of the map equation nc
L= qH(Q) + p cH(Pc ) . ∑ ↻
(9.59)
c =1
In a nutshell, the first term of (9.59) gives the average number of bits necessary to describe the movement between communities where q is the prob
Online Resource 9.3 Map Equation for Infomap
For a dynamic visualization of the mechanism behind the map equation, see http://www. tp.umu. se/~rosvall/livemod/mapequation/.
ability that the random walker switches communities during a given step. The second term gives the average number of bits necessary to describe
>
movement within communities. Here H(Pc) is the entropy of withincommunity movements — including an “exit code” to capture the departure from a community i. The specific terms of the maps equation and their calculation in terms of the probabilities capturing the movement of a random walker on a network, is somewhat involved. They are described in detail in Ref [4446]. Online Resource 9.3 offers an interactive tool to illustrate the mechanism behind (9.59) and its use. At the end L serves as a quality function, that takes up a specific value for a particular partition of the network into communities. To find the best partition, we must minimize L over all possible partitions. The popular implementation of this optimization procedure follows Steps I and II of the Louvain algorithm: We assign each node to a separate community, and we systematically join neighboring nodes into modules if the move decreases L. After each move L is updated using (9.59). The obtained communities are joined into supercommunities, finishing a pass, after which the algorithm is restarted on the new, reduced network. Computational Complexity The computational complexity of Infomap is determined by the procedure used to minimize the map equation L. If we use the Louvain procedure, the computational complexity is the same as that of the Louvain algorithm, i.e. at most O(LlogL) or O(NlogN) for a sparse graph. In summary, the Louvain algorithm and Infomap offer tools for fast community identification. Their accuracy across benchmarks is comparable to the accuracy of the algorithms discussed throughout this chapter (Figure 9.28).
COMMUNITIES
53
Advanced Topics 9.c
SECTION 9.13
ADVANCED TOPICS 9.D THRESHOLD FOR CLIQUE PERCOLATION
In this section we derive the percolation threshold (9.20) for clique percolation on a random network and discuss the main steps of the CFinder algorithm (Figure 9.39). When we roll a kclique to an adjacent kclique by relocating one of its nodes, the expectation value of the number of adjacent kcliques for the template to roll further should equal exactly one at the percolation threshold (Figure 9.20). Indeed, a smaller than one expectation value will result in a premature end of the kclique percolation clusters, because starting from any kclique, the rolling would quickly come to a halt. Consequently the size of the clusters would decay exponentially. A larger than one expectation value, on the other hand, allows the clique community to grow indefinitely, guaranteeing that we have a giant cluster in the system. The above expectation value is provided by
(k − 1)(N − k − 1)
k−1
,
(9.63)
where the term (k–1) counts the number of nodes of the template that can be selected for the next relocation; the term (Nk–1)k–1 counts the number of potential destinations for this relocation, out of which only the fraction pk–1 is acceptable, because each of the new k–1 edges (associated with the relocation) must exist in order to obtain a new kclique. For large N, (9.63) simplifies to
(k–1) N p c k–1 = 1 , which leads to (9.16).
COMMUNITIES
54
Advanced Topics 9.d
Figure 9.39
(a)
CFinder algorithm
1
5
4
2
The main steps of the CFinder algorithm. (a) Starting from the network shown in the figure, our goal is to identify all cliques. All five k=3 cliques present in the network are highlighted.
3
(b)
O=
1
2
3
4
5
1
0
1
0
0
0
2
1
0
0
0
0
3
0
0
0
1
0
4
0
0
1
0
1
5
0
0
0
1
0
(b) The overlap matrix O of the k=3 cliques. This matrix is viewed as an adjacency matrix of a network whose nodes are the cliques of the original network. The matrix indicates that we have two connected components, one consisting of cliques (1,2) and the other of cliques (3, 4, 5). The connected components of this network map into the communities of the original network. (c) The two clique communities predicted by the adjacency matrix.
(c)
2 1
3
4
(d) The two clique communities shown in (c), mapped on the original network.
5
(d)
COMMUNITIES
55
Advanced Topics 9.d
SECTION 9.14
BIBLIOGRAPHY
[1] B. Droitcour. Young Incorporated Artists. Art in America, 9297, April 2014. [2] V. D. Blondel, J.L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of communities in large networks. J. Stat. Mech., 2008. [3] G.C. Homans. The Human Groups. Harcourt, Brace & Co, New York, 1950. [4] S.A. Rice. The identification of blocs in small political bodies. Am. Polit. Sci. Rev., 21:619–627, 1927. [5] R.D. Luce and A.D. Perry. A method of matrix analysis of group structure. Psychometrika, 14:95–116, 1949. [6] R.S. Weiss and E. Jacobson. A method for the analysis of the structure of complex organizations. Am. Sociol. Rev., 20:661–668, 1955. [7] W.W. Zachary. An information flow model for conflict and fission in small groups. J. Anthropol. Res., 33:452–473, 1977. [8] L. Donetti and M.A. Muñoz. Detecting network communities: a new systematic and efficient algorithm. J. Stat. Mech., P10012, 2004. [9] M. Girvan and M.E.J. Newman. Community structure in social and biological networks. PNAS, 99:7821–7826, 2002. [10] L.H. Hartwell, J.J. Hopfield, and A.W. Murray. From molecular to modular cell biology. Nature, 402:C47–C52, 1999. [11] E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, and A.L. Barabási. Hierarchical organization of modularity in metabolic networks. Science, 297:15511555, 2002. [12] K.I. Goh, M. E. Cusick, D. Valle, B. Childs, M. Vidal, and A.L. Barabási. The human disease network. PNAS, 104:86858690, 2007. COMMUNITIES
56
BIBLIOGRAPHY
[13] J. Menche, A.Sharma, M. Kitsak, S. Ghiassian, M. Vidal, J. Loscalzo, A.L. Barabási. Oncovering diseasedisease relationships through the human interactome. 2014. [14] A.L. Barabási, N. Gulbahce, and J. Loscalzo. Network medicine: a networkbased approach to human disease. Nature Review Genetics, 12:5668, 2011. [15] G. W. Flake, S. Lawrence, and C.L. Giles. Efficient identification of web communities. Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 150160, 2000. [16] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi. Defining and identifying communities in networks. PNAS, 101:2658–2663, 2004. [17] A.B. Kahng, J. Lienig, I.L. Markov, and J. Hu. VLSI Physical Design: From Graph Partitioning to Timing Closure. Springer, 2011. [18] B.W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. Bell Systems Technical Journal, 49:291–307, 1970. [19] G.E. Andrews. The Theory of Partitions. AddisonWesley, Boston, USA, 1976. [20] L. Lovász. Combinatorial Problems and Exercises. NorthHolland, Amsterdam, The Netherlands, 1993. [21] G. Pólya and G. Szegő. Problems and Theorems in Analysis I. SpringerVerlag, Berlin, Germany, 1998. [22] V. H. Moll. Numbers and Functions: From a classicalexperimental mathematician’s point of view. American Mathematical Society, 2012. [23] M.E.J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69:026113, 2004. [24] M.E.J. Newman. A measure of betweenness centrality based on random walks. Social Networks, 27:39–54, 2005. [25] U. Brandes. A faster algorithm for betweenness centrality. J. Math. Sociol., 25:163–177, 2001. [26] T. Zhou, J.G. Liu, and B.H. Wang. Notes on the calculation of node betweenness. Chinese Physics Letters, 23:2327–2329, 2006. [27] E. Ravasz and A.L. Barabasi. Hierarchical organization in complex networks. Physical Review E, 67:026112, 2003. [28] S. N. Dorogovtsev, A. V. Goltsev, and J. F. F. Mendes. Pseudofractal scalefree web. Physical Review E, 65:066122, 2002. [29] E. Mones, L. Vicsek, and T. Vicsek. Hierarchy Measure for Complex Networks. PLoS ONE, 7:e33799, 2012. COMMUNITIES
57
BIBLIOGRAPHY
[30] A. Clauset, C. Moore, and M. E. J. Newman. Hierarchical structure and the prediction of missing links in networks. Nature, 453:98101, 2008. [31] L. Danon, A. DíazGuilera, J. Duch, and A. Arenas. Comparing community structure identification. Journal of Statistical Mechanics, P09008, 2005. [32] S. Fortunato and M. Barthélemy. Resolution limit in community detection. PNAS, 104:36–41, 2007. [33] M.E.J. Newman. Fast algorithm for detecting community structure in networks. Physical Review E, 69:066133, 2004. [34] S. Fortunato and M. Barthélemy. Resolution limit in community detection. PNAS, 104:36–41, 2007. [35] A. Clauset, M.E.J. Newman, and C. Moore. Finding community structure in very large networks. Physical Review E, 70:066111, 2004. [36] G. Palla, I. Derényi, I. Farkas, and T. Vicsek. Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435:814, 2005. [37] R. Guimerà, L. Danon, A. DíazGuilera, F. Giralt, and A. Arenas. Selfsimilar community structure in a network of human interactions. Physical Review E, 68:065103, 2003. [38] L. Danon, A. DíazGuilera, J. Duch, and A. Arenas. Comparing community structure identification. J. Stat. Mech., P09008, 2005. [39] J. Ruan and W. Zhang. Identifying network communities with a high resolution. Physical Review E 77: 016104, 2008. [40] B. H. Good, Y.A. de Montjoye, and A. Clauset. The performance of modularity maximization in practical contexts. Physical Review E, 81:046106, 2010. [41] R. Guimerá, M. SalesPardo, and L.A.N. Amaral. Modularity from fluctuations in random graphs and complex networks. Physical Review E, 70:025101, 2004. [42] J. Reichardt and S. Bornholdt. Partitioning and modularity of graphs with arbitrary degree distribution. Physical Review E, 76:015102, 2007. [43] J. Reichardt and S. Bornholdt. When are networks truly modular? Physica D, 224:20–26, 2006. [44] M. Rosvall and C.T. Bergstrom. Maps of random walks on complex networks reveal community structure. PNAS, 105:1118, 2008. [45] M. Rosvall, D. Axelsson, and C.T. Bergstrom. The map equation. Eur. Phys. J. Special Topics, 178:13, 2009. COMMUNITIES
58
BIBLIOGRAPHY
[46] M. Rosvall and C.T. Bergstrom. Mapping change in large networks. PLoS ONE, 5:e8694, 2010. [47] A. Perey. Oksapmin Society and World View. Dissertation for Degree of Doctor of Philosophy. Columbia University, 1973. [48] I. Derényi, G. Palla, and T. Vicsek. Clique percolation in random networks. Physical Review Letters, 94:160202, 2005. [49] J.M. Kumpula, M. Kivelä, K. Kaski, and J. Saramäki. A sequential algorithm for fast clique percolation. Physical Review E, 78:026109, 2008. [50] G. Palla, A.L. Barabási, and T. Vicsek. Quantifying social group evolution. Nature, 446:664667, 2007. [51] Y.Y. Ahn, J. P. Bagrow, and S. Lehmann. Link communities reveal multiscale complexity in networks. Nature, 466:761764, 2010. [52] T.S. Evans and R. Lambiotte. Line graphs, link partitions, and overlapping communities. Physical Review E, 80:016105, 2009. [53] M. Chen, K. Kuzmin, and B.K. Szymanski. Community Detection via Maximization of Modularity and Its Variants. IEEE Trans. Computational Social Systems, 1:4665, 2014. [54] I. Farkas, D. Ábel, G. Palla, and T. Vicsek. Weighted network modules. New J. Phys., 9:180, 2007. [55] S. Lehmann, M. Schwartz, and L.K. Hansen. Biclique communities. Physical Review E, 78:016108, 2008. [56] N. Du, B. Wang, B. Wu, and Y. Wang. Overlapping community detection in bipartite networks. IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, IEEE Computer Society, Los Alamitos, CA, USA: 176–179, 2008. [57] A. Condon and R.M. Karp. Algorithms for graph partitioning on the planted partition model. Random Struct. Algor., 18:116–140, 2001. [58] A. Lancichinetti, S. Fortunato, and F. Radicchi. Benchmark graphs for testing community detection algorithms. Physical Review E, 78:046110, 2008. [59] A. Lancichinetti, S. Fortunato, and J. Kertész. Detecting the overlapping and hierarchical community structure of complex networks. New Journal of Physics, 11:033015, 2009. [60] S. Fortunato. Community detection in graphs. Physics Reports, 486:75–174, 2010. [61] D. Hric, R.K. Darst, and S. Fortunato. Community detection in networks: structural clusters versus ground truth. Physical Review E, 90:062805, 2014. COMMUNITIES
59
BIBLIOGRAPHY
[62] M. S. Granovetter. The Strength of Weak Ties. The American Journal of Sociology, 78:1360–1380, 1973. [63] J.P. Onnela, J. Saramäki, J. Hyvönen, G. Szabó, D. Lazer, K. Kaski, J. Kertész, and A.L. Barabási. Structure and tie strengths in mobile communication networks. PNAS, 104:7332, 2007. [64] K.I. Goh, B. Kahng, and D. Kim. Universal Behavior of Load Distribution in ScaleFree Networks. Physical Review Letters, 87:278701, 2001. [65] A. Maritan, F. Colaiori, A. Flammini, M. Cieplak, and J.R. Banavar. Universality Classes of Optimal Channel Networks. Science, 272:984 –986, 1996. [66] L.C. Freeman. A set of measures of centrality based upon betweenness. Sociometry, 40:35–41, 1977. [67] J. Hopcroft, O. Khan, B. Kulis, and B. Selman. Tracking evolving communities in large linked networks. PNAS, 101:5249–5253, 2004. [68] S. Asur, S. Parthasarathy, and D. Ucar. An eventbased framework for characterizing the evolutionary behavior of interaction graphs. KDD ’07: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, pp. 913–921, 2007. [69] D.J. Fenn, M.A. Porter, M. McDonald, S. Williams, N.F. Johnson, and N.S. Jones. Dynamic communities in multichannel data: An application to the foreign exchange market during the 2007–2008 credit crisis. Chaos, 19:033119, 2009. [70] D. Chakrabarti, R. Kumar, and A. Tomkins. Evolutionary clustering, in: KDD ’06: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, pp. 554–560, 2006. [71] Y. Chi, X. Song, D. Zhou, K. Hino, and B.L. Tseng. Evolutionary spectral clustering by incorporating temporal smoothness. KDD ’07: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, pp. 153–162, 2007. [72] Y.R. Lin, Y. Chi, S. Zhu, H. Sundaram, and B.L. Tseng. Facetnet: a framework for analyzing communities and their evolutions in dynamic networks. in: WWW ’08: Proceedings of the 17th International Conference on the World Wide Web, ACM, New York, NY, USA, pp. 685–694, 2008. [73] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in large social networks: membership, growth, and evolution. KDD ’06: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, pp. 44–54, 2006.
COMMUNITIES
60
BIBLIOGRAPHY
[74] M. E. J. Newman and J. Park. Why social networks are different from other types of networks. Physical Review E, 03G122, 2003. [75] B. Krishnamurthy and J. Wang. On networkaware clustering of web clients. SIGCOMM Comput. Commun. Rev., 30:97–110, 2000. [76] K.P. Reddy, M. Kitsuregawa, P. Sreekanth, and S.S. Rao. A graph based approach to extract a neighborhood customer community for collaborative filtering. DNIS ’02: Proceedings of the Second International Workshop on Databases in Networked Information Systems, SpringerVerlag, London, UK, pp. 188–200, 2002. [77] R. Agrawal and H.V. Jagadish. Algorithms for searching massive graphs. Knowl. Data Eng., 6:225–238, 1994. [78] A.Y. Wu, M. Garland, and J. Han. Mining scalefree networks using geodesic clustering. KDD ’04: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, New York, NY, USA, 2004, pp. 719–724, 2004.
COMMUNITIES
61
BIBLIOGRAPHY
10 ALBERTLÁSZLÓ BARABÁSI
NETWORK SCIENCE SPREADING PHENOMENA
ACKNOWLEDGEMENTS
MÁRTON PÓSFAI NICOLE SAMAY ROBERTA SINATRA
SARAH MORRISON AMAL HUSSEINI PHILIPP HOEVEL
INDEX
Introduction
1
Epidemic Modeling
2
Network Epidemics
3
Contact Networks
4
Beyond the Degree Distribution
5
Immunization
6
Epidemic Prediction
7
Summary
8
Homework
9
ADVANCED TOPICS 10.A Microscopic Models of Epidemic Processes
10
ADVANCED TOPICS 10.B Analytical Solution of the SI, SIS and SIR Models
11
ADVANCED TOPICS 10.C Targeted Immunization
12
ADVANCED TOPICS 10.D The SIR Model and Bond Percolation
13
Bibliography
14 Figure 10.0 (cover image)
Bill Smith An epidemiological model of the perfect infectious disease (evolved growth system) is an artwork by Bill Smith, an Illinoisbased artist (2009, mixed media, 84x84x84 inches) (http://www.widicus.org).
This work is licensed under a Creative Commons: CC BYNCSA 2.0. PDF V26, 05.09.2014
SECTION 10.1
INTRODUCTION
On the night of February 21, 2003 a physician from Guangdong Province in southern China checked into the Metropole Hotel in Hong Kong. He previously treated patients suffering from a disease that, lacking a clear diagnosis, was called atypical pneumonia. Next day, after leaving the hotel, he went to the local hospital, this time as a patient. He died there several days later of atypical pneumonia [1]. The physician did not leave the hotel without a trace: That night sixteen other guests of the Metropole Hotel and one visitor also contracted the disease that was eventually renamed Severe Acute Respiratory Syndrome, or SARS. These guests carried the SARS virus with them to Hanoi, Singapore, and Toronto, sparking outbreaks in each of those cities. Epidemiologists later traced close to half of the 8,100 documented cases of SARS back to the
Figure 10.1
Superspreaders
Metropole Hotel. With that the physician who brought the virus to Hong
Onehundredfortyfour of the 206 SARS patients diagnosed in Singapore were traced to a chain of five individuals that included four superspreaders. The most important of these was Patient Zero, the physician from Guangdong Province in China, who brought the disease to the Metropole Hotel. After [1].
Kong become an example of a superspreader, an individual who is responsible for a disproportionate number of infections during an epidemic. A network theorist will recognize superspreaders as hubs, nodes with an exceptional number of links in the contact network on which a disease spreads. As hubs appear in many networks, superspreaders have been documented in many infectious diseases, from smallpox to AIDS [2]. In this chapter we introduce a network based approach to epidemic phenomena that allows us to understand and predict the true impact of these hubs. The resulting framework, that we call network epidemics, offers an analytical and numerical platform to quantify and forecast the spread of infectious diseases. Infectious diseases account for 43% of the global burden of disease, as captured by the number of years of lost healthy life. They are called contagious, as they are transmitted by contact with an ill person or with their secretions. Cures and vaccines are rarely sufficient to stop an infectious disease  it is equally important to understand how the pathogen responsible for the disease spreads in the population, which in turn determines the way we administer the available cures or vaccines. SPREADING PHENOMENA
3
Figure 10.2
Mobile Phone Viruses Smart phones, capable of sharing programs and data with each other, offer a fertile ground for virus writers. Indeed, since 2004 hundreds of smart phone viruses have been identified, reaching a state of sophistication in a few years that took computer viruses about two decades to achieve [3]. Mobile viruses are transmitted using two main communication mechanisms [4]: Bluetooth (BT) Viruses A BT virus infects all phones found within BT range from the infected phone, which is about 1030 meters. As physical proximity is essential for a BT connection, the transmission of a BT virus is determined by the owner’s location and the underlying mobility network, connecting locations by individuals who travel between them (SECTION 10.4). Hence BT viruses follow a spreading pattern similar to influenza.
The diversity of phenomena regularly described as spreading processes on networks is staggering: Biological
Multimedia Messaging Services (MMS) Viruses carried by MMS can infect all susceptible phones whose number is in the infected phone’s phonebook. Hence MMS viruses spread on the social network, following a longrange spreading pattern that is independent of the infected phone’s physical location. Consequently the spreading of MMS viruses is similar to the patterns characterizing computer viruses.
The spread of pathogens on their respective contact network is the main subject of this chapter. Examples include airborne diseases like influenza, SARS, or tuberculosis, transmitted when two individuals breathe the air in the same room; contagious diseases and parasites transmitted when people touch each other; the Ebola virus, transmitted via contact with a patient's bodily fluids, HIV and other sexually transmitted diseases passed on during sexual intercourse. Infectious diseases also include cancers carried by cancercausing viruses, like HPV or EBV, or diseases carried by parasites like bedbugs or malaria. Digital A computer virus is a selfreproducing program that can transmit a copy of itself from computer to computer. Its spreading pattern has many similarities to the spread of pathogens. But digital viruses also have many unique features, determined by the technology behind the specific virus. As mobile phones morphed into handheld computers, lately we also witnessed the appearance of mobile viruses and worms that infect smartphones (Figure 10.2). Social The role of the social and professional network in the spread and acceptance of innovations, knowledge, business practices, products, behavior, rumors and memes, is a muchstudied problem in social sciences, marketing and economics [5, 6]. Online environments, like Twitter, offer unprecedented ability to track such phenomena. Consequently a staggering number of studies focus on social spreading, asking for example why can some messages reach millions of individuals, while others struggle to get noticed. The examples discussed above involve diverse spreading agents, from biological to computer viruses, ideas and products; they spread on differ
SPREADING PHENOMENA
4
introduction
PHENOMENA
AGENT
NETWORK
Venereal Disease
Pathogens
Sexual Network
Rumor Spreading
Information, Memes
Communication Network
Diffusion of Innovations
Ideas, Knowledge
Communication Network
Computer Viruses
Malwares, Digital viruses
Internet
Mobile Phone Virus
Mobile Viruses
Social Network/Proximity Network
Bedbugs
Parasitic Insects
Hotel  Traveler Network
Malaria
Plasmodium
Mosquito  Human network
ent types of networks, from social to computer and professional networks;
Table 10.1
they are characterized by widely different time scales and follow different
Networks and Agents
mechanisms of transmission (Table 10.1). Despite this diversity, as we show
The spread of a pathogen, a meme or a computer virus is determined by the network on which the agent spreads and the transmission mechanism of the responsible agent. The table lists several much studied spreading phenomena, together with the nature of the particular spreading agent and the network on which the agent spreads.
in this chapter, these spreading processes obey common patterns and can be described using the same networkbased theoretical and modeling framework.
SPREADING PHENOMENA
5
introduction
SECTION 10.2
EPIDEMIC MODELING
Epidemiology has developed a robust analytical and numerical framework to model the spread of pathogens. This framework relies on two fundamental hypotheses: i. Compartmentalization Epidemic models classify each individual based on the stage of the disease affecting them. The simplest classification assumes that an individual can be in one of three states or compartments: • Susceptible (S): Healthy individuals who have not yet contacted the pathogen (Figure 10.3). • Infectious (I): Contagious individuals who have contacted the pathogen and hence can infect others. • Recovered (R): Individuals who have been infected before, but have recovered from the disease, hence are not infectious. The modeling of some diseases requires additional states, like immune individuals, who cannot be infected, or latent individuals, who have been exposed to the disease, but are not yet contagious. Individuals can move between compartments. For example, at the beginning of a new influenza outbreak everyone is in the susceptible state. Once an individual comes into contact with an infected person, she can become infected. Eventually she will recover and develop im
Figure 10.3
Pathogens
munity, losing her susceptibility to the the particular strain of influ
A pathogen, a word rooted in the Greek words “suffering, passion” (pathos) and “producer of” (genes), denotes an infectious agent or germ. A pathogen could be a diseasecausing microorganism, like a virus, a bacterium, a prion, or a fungus. The figure shows several muchstudied pathogens, like the HIV virus, responsible for AIDS, an influenza virus and the hepatitis C virus. After http://www. livescience.com/18107hivtherapeuticvaccinespromise.html and http://www.huffingtonpost.com/2014/01/13/deadlyvirusesbeautifulphotos_n_4545309.html
enza. ii. Homogenous Mixing The homogenous mixing hypothesis (also called fully mixed or massaction approximation) assumes that each individual has the same chance of coming into contact with an infected individual. This hypothesis eliminates the need to know the precise contact network on which the disease spreads, replacing it with the assumption that
SPREADING PHENOMENA
6
SUSCEPTIBLE (HEALTHY) (a)
anyone can infect anyone else.
S
In this section we introduce the epidemic modeling framework built
INFECTED (SICK)
I
INFECTION
on these two hypotheses. To be specific, we explore the dynamics of three frequently used epidemic models, the socalled SI, SIS and SIR models, that SUSCEPTIBLE (HEALTHY)
help us understand the basic building blocks of epidemic modeling. (b)
FRACTION INFECTED i(t)
SUSCEPTIBLEINFECTED (SI) MODEL Consider a disease that spreads in a population of N individuals. Denote with S(t) the number of individuals who are susceptible (healthy) at time t and with I(t) the number individuals that have been already infected. At time t=0 everyone is susceptible (S(0) = N) and no one is infected (I(0)=0). Let us assume that a typical individual has ⟨k⟩ contacts and that the likelihood that the disease will be transmitted from an infected to a susceptible
1
0.5
FRACTION INFECTED i(t)
0 1 0
individual in a unit time is β. We ask the following: If a single individual fected at some later time t?
infected person comes into contact with ⟨k⟩S(t)/N susceptible individuals in a unit time. Since I(t) infected individuals are transmitting the pathogen, each at rate β, the average number of new infections dI(t) during a timeframe dt is
β⟨k⟩
Figure 10.4
(10.1)
s(t) = S(t) / N, i(t) = I(t) / N ,
If i → 1,
(10.2)
(10.3)
We solve (10.3) by writing
di di + = β 〈k〉dt . i (1− i)
Integrating both sides, we obtain SPREADING PHENOMENA
10
i ≈ i0eβ⟨k⟩t
where the product β⟨k⟩ is called the transmission rate or transmissibility.
8
If i is small,
time t. For simplicity we also drop the (t) variable from i(t) and s(t), rewrit
di = β 〈k〉si = β 〈k〉i(1− i) , dt
6
saturation regime
ing (10.1) as (ADVANCED TOPICS 10.A)
t
exponential regime
to capture the fraction of the susceptible and of the infected population at
4
(b) Time evolution of the fraction of infected individuals, as predicted by (10.4). At early times the fraction of infected individuals grows exponentially. As eventually everyone becomes infected, at large times we have i(∞)=1.
Throughout this chapter we will use the variables
If i → 1,
di →0 dt
(a) In the SI model an individual can be in one of two states: susceptible (healthy) or infected (sick). The model assumes that if a susceptible individual comes into contact with an infected individual, it becomes infected at rate β. The arrow indicates that once an individual becomes infected, it stays infected, hence it cannot recover.
S(t)I(t) dt . N
dI(t) S(t)I(t) . = β 〈k〉 dt N
10
saturation regime
i ≈ i0e
2
8
di The SusceptibleInfected (SI) Model dt → 0
Consequently I(t) changes at the rate
6
β⟨k⟩t
0 0
fected person encounters a susceptible individual is S(t)/N. Therefore the
t
4
If i is small,
Within the homogenous mixing hypothesis the probability that the in
2
exponential regime 0.5
becomes infected at time t=0 (i.e. I(0)=1), how many individuals will be in
INFECTED (SICK)
lni − ln(1− i) + C = β 〈k〉t . 7
epidemic modeling
SUSCEPTIBLE (HEALTHY) (a)
With the initial condition i0= i(t=0), we get C=i0/(1–i0), obtaining that the
S
fraction of infected individuals increases in time as
i=
i0 eβ 〈k 〉t . 1− i0 + i0 eβ 〈k 〉t
RECOVERY
SUSCEPTIBLE (HEALTHY)
• At the beginning the fraction of infected individuals increases exponen
(b) FRACTION INFECTED i(t)
tially (Figure 10.4b). Indeed, early on an infected individual encounters only susceptible individuals, hence the pathogen can easily spread. • The characteristic time required to reach an 1/e fraction (about 36%) of all susceptible individuals is
τ=
1 . β 〈k〉
τ is the inverse of the speed with which the pathogen spreads
through the population. Equation (10.5) predicts that increasing either the density of links ⟨k⟩ or β enhances the speed of the pathogen and reduces the characteristic time.
Figure 10.5
• With time an infected individual encounters fewer and fewer suscepti
Most pathogens are eventually defeated by the immune system or by treatment. To capture this fact we need to allow the infected individuals to recover, ceasing to spread the disease. With that we arrive at the socalled SIS model, which has the same two states as the SI model, susceptible and infected. The difference is that now infected individuals recover at a fixed rate μ, becoming susceptible again (Figure 10.5a). The equation describing the dynamics of this model is an extension of (10.3),
exponential outbreak
If i is small,
i ≈ i0e(β⟨k⟩tμ)t 0 0
2
(10.6)
the population recovers from the disease. The solution of (10.6) provides the fraction of infected individuals in function of time (Figure 10.5b) (10.7)
where the initial condition i0= i(t=0) gives C=i0/(1–i0 –µ/β⟨k⟩). While in the SI model eventually everyone becomes infected, (10.7) predicts that in the SIS model the epidemic has two possible outcomes:
SPREADING PHENOMENA
8
10
endemic state
0.5
where μ is the recovery rate and the μi term captures the rate at which
µ Ce( β 〈 k 〉− µ )t , ) β 〈k〉 1+ Ce( β 〈 k 〉− µ )t
6
4
t
6
i (∞)= 1 –
μ β⟨k⟩
8
10
(b) Time evolution of the fraction of infected individuals in the SIS model, as predicted by (10.7). As recovery is possible, at large t the system reaches an endemic state, in which the fraction of infected individuals is constant, i(∞), given by (10.8). Hence in the endemic state only a finite fraction of individuals are infected. Note that for high recovery rate μ the number of infected individuals decreases exponentially and the disease dies out.
SUSCEPTIBLEINFECTEDSUSCEPTIBLE (SIS) MODEL
i = (1−
t
μ
s(t→∞)=0.
4
If i is small, (a) The SIS model has the samei (∞)= states 1 – as the SI β⟨k⟩ i ≈ i0e(β⟨k⟩tμ)t model: susceptible and infected. It differs from the SI model in that it allows recovery, i.e. infected individuals are cured, becoming susceptible again at rate μ.
epidemic ends when everyone has been infected, i.e. when i(t→∞)=1 and
di = β 〈k〉i(1− i) − µi , dt
2
exponential endemic The SusceptibleInfectedSusceptible (SIS) Model outbreak state
ble individuals. Hence the growth of i slows for large t (Figure 10.4b). The
0.5
0 1 0
(10.5)
INFECTED (SICK)
1
FRACTION INFECTED i(t)
Hence
I
INFECTION
(10.4)
Equation (10.4) predicts that:
INFECTED (SICK)
8
epidemic modeling
•
Endemic State (μ1 the pathogen will spread and persist in the population. The higher is R0, the faster is the spreading process. The table lists R0 for several wellknown pathogens. After [7].
ceeds the number of newly infected individuals. Therefore with time the pathogen disappears from the population. In other words, the SIS model predicts that some pathogens will persist in the population while others die out shortly. To understand what governs the difference between these two outcomes we write the characteristic time of a pathogen as
τ=
1 , µ (R0 − 1)
(10.9)
where
R0 =
β 〈k〉 (10.10) µ
is the basic reproductive number. It represents the average number of susceptible individuals infected by an infected individual during its infectious period in a fully susceptible population. In other words, R0 is the number of new infections each infected individual causes under ideal circumstances. The basic reproductive number is valuable for its predictive power: •
If R0 exceeds unity,
τ is positive, hence the epidemic is in the
endemic state. Indeed, if each infected individual infects more than one healthy person, the pathogen is poised to spread and persist in the population. The higher is R0, the faster is the spreading process. •
If R0< 1 then τ is negative and the epidemic dies out. Indeed, if each infected individual infects less than one additional person, the pathogen cannot persist in the population.
Consequently, the reproductive number is one of the first parameters SPREADING PHENOMENA
9
epidemic modeling
SUSCEPTIBLE model. The equation describing the evolution of the SIS model there (HEALTHY) a spontaneous transition term and reads as
di(t) = − µ i(t) + β k i(t) [1 − i(t)] . epidemiologists estimate for a new pathogen, gauging the severity of the dt (a) model for the study of infectious diseases leading to an endemic state with The usual normalization condition s(t) = 1 − i(t) has to be valid at a listed in Table problem they face. For several wellstudies pathogens R0 ismatic a stationary and constant value for the prevalence of infected individuals, i.e. the degree to which the infection in the population as not measured Theis widespread SIS model does take into account the possibility of they 10.2. The high R0 of some of these pathogens underlies thebydangers the density of infected. In the SIS model, individuals exist INFECTION in the susceptiREMOVAL matic model for the study of infectious diseases leading to an endemic state with ble and infected classes only. The removal disease transmission is described as in theor SI acquired immunization, which woul ual’s through death pose: For example each individual infected with measles causes over a dozmodel, but infected individuals may recover and become susceptible again with a stationary and constant value for the prevalence of infected individuals, i.e. susceptible–infected–removed (SIR) model (Anderson and probability µ dt, where µsocalled is the recovery rate. Individuals thus run stochastically as measured en subsequent infections. the degree to which the infection is widespread through in the the cycle population susceptible infected susceptible, hence the name of the Murray, 2005). The SIR model, in fact, assumes REMOVED that infected indiv INFECTED SUSCEPTIBLE model. The equation describing the evolution of the SIS model therefore contains by the density of infected. In the SIS model, individuals exist INFECTION in the susceptiREMOVAL (SICK) (HEALTHY) (IMMUNE/DEAD) a spontaneous transition term and reads as pear permanently ble and infected classes only. The disease transmission is described as in the SI from the network with rate µ and enter a new comp di(t) (9.6) = − µ i(t) +again β k i(t)with [1 − i(t)] . model, but infected individuals may recover and become susceptible individuals, whose density in the population is r (t) = R(t)/ dt removed SUSCEPTIBLEINFECTEDRECOVERED (SIR) MODEL probability µ dt, where µ is the recovery rate. thus run stochastically TheIndividuals usual normalization condition s(t) = 1 − i(t) has to be valid at all times. duction of a new compartment yields the following system of equation The SIS model does not take into account the possibility of an individFor many pathogens, like mostthe strains of influenza, individuals develop through cycle susceptible infected susceptible, henceorthe name (b) of the ual’s removal through death acquired immunization, which would lead to the the dynamics: INFECTED SUSCEPTIBLE REMOVED model. Thethe equation describing the evolution of theof SIS model therefore(SIR) contains socalled susceptible–infected–removed model (Anderson and May, 1992; immunity after they recover from infection. Hence, instead return2005). The SIR model, in fact, assumes that infected dsindividuals disap(SICK) (HEALTHY) (IMMUNE/DEAD) a spontaneous transition term and reads as Murray, ds(t) = − β k i [1 − r − i] pear permanently from the network with rate µ and enter a new compartment R of ing to the susceptible state, they are “removed” from the population. These k i(t) [1 − r (t) − i(t)] =Theβintroremoved individuals, whose density in the population is r (t) =dt di(t) dtR(t)/ N. describing duction of − a new compartment system of equations i(t) i(t)] .the yields the following (9.6) k = − µ i(t) + β [1 recovered individuals do not count any longer from the perspective of the dynamics: dt s di ds = − µ i + β k i [1 − r − i ] i ds(t) = − βat iall [1 −times. r − i] pathogen as they cannot be infected, nor can they infect SIR The usual normalization condition s(t) = 1 others. − i(t) has The to be = β k ki(t) [1 − r (t) − i(t)] dt valid r dt dt The SIS model does not take into account possibility of an individdi dr model, whose properties are discussed in Figure 10.6, captures thethe dynamics (9.7) = − µ i + β k i [1 − r − i ] dt = µi ual’s removal through death or acquired immunization, which would lead to the dr dt = µi (9.8) of this process. socalled susceptible–infected–removed (SIR) model (Anderson and May, 1992; dt Throughthat theseinfected dynamics,Through all infected individuals will sooner or later enter these dynamics, all theinfected individuals will sooner or la Murray, 2005). The SIR model, in fact, assumes individuals disaprecovered compartment, so that it is clear that in the inﬁnite time limit the epipear permanently from the network with rate demics µ andmust enter new R compartment, somodels that it is clear that fade a away. Itrecovered iscompartment interesting to note thatof both the SIS and SIR t in the inﬁnite time l In summary, dependingremoved on theindividuals, characteristics of a pathogen, we need introduce a time governing selfrecovery (c) whose density in the population is rscale (t) 1/µ = R(t)/ N.theThe introof individuals. We can think demics must fade away. It is interesting to note that both the SIS and of two extreme cases. If 1/µ is smaller than the spreading 1time scale 1/ β, then the different models to capture the ofdynamics of anyields epidemic outbreak. duction a new compartment the following equations describing process issystem dominatedof by As the natural recovery of infected to susceptible or removed introduce a time scale 1/µ governing the selfrecovery of individuals. This situation is less interesting since it corresponds to a dynamical the dynamics: of the SI, SIS, and SIRindividuals. s shown in Figure 10.7, the predictions models process governed agree by the decay into a healthy state and the interaction with neighof two extreme cases. If 1/µ is smaller than the spreading time scale 1 ds i bors plays a minor role. The other extreme case is in0.75 the regime 1/µ 1/ β, ds(t) = − β k i When i]the with each other in the early stages of an dtepidemic: number of r (t) − i(t)] = β k i(t)[1[1−−r − process is dominated by the natural recovery of infected to susceptibl r dt infected individuals is small, the disease spreads freely and the number of individuals. This situation is less interesting since it corresponds to di (9.7) 0.5 = − µ i + β k i [1 − r − i ] dt process governed by the decay into a healthy state and the interaction infected individuals increases exponentially. The outcomes are different dr bors plays a(9.8) minor role. The other extreme case is in the regime 1 = µ i infected; the SIS model for large times: In the SI model everyone becomes dt 0.25 either reaches an endemic Through state, in which a finite fraction of individuals these dynamics, all infected individuals will sooner or later enter the compartment, so in thatthe it isSIR clearmodel that in the inﬁnite time limit the epiare always infected, or therecovered infection dies out; everyone demics must fade away. It is interesting to note that both the SIS and SIR models 0 0 20 40 60 t recovers at the end. The reproductive predictsthethe longterm fate We can think introduce a timenumber scale 1/µ governing selfrecovery of individuals. two extreme persists cases. If 1/µinisthe smaller than the spreading time scale 1/Figure β, then the population, while for of an epidemic: for R01 it dies out naturally. individuals. This situation is less interesting since it corresponds to aThe SusceptibleInfectedRecovered (SIR) Model dynamical
I
S
R
I
S
R
FRACTION OF POPULATION
1
0.75
0.5
0.25
0
20
40
60
FRACTION OF POPULATION
0
process governed by the decay into a healthy state and the interaction with neigh(a) In contrast with the SIS model, in the SIR a minor role. the The fact other that extreme casean is in the regime 1/µ 1/ β, The models discussed sobors farplays have ignored that individmodel recovered individuals enter a recovered state, meaning that they develop imual comes into contact only with its networkbased neighbors in the permunity rather than becoming susceptible tinent contact network. We assumed homogenous mixing instead, which again. Flu, SARS and Plague are diseases with this property, hence we must use the means that an infected individual can infect any other individual. It also SIR model to describe their spread.
means that an infected individual typically infects only ⟨k⟩ other individu
(b) The differential equations governing the time evolution of the fraction of individuals in the susceptible s, infected i and the removed r state.
als, ignoring variations in node degrees. To accurately predict the dynamics of an epidemic, we need to consider the precise role the contact network plays in epidemic phenomena.
(c) The time dependent behavior of s, i and r as predicted by the equations shown in (b). According to the model all individuals transition from a susceptible (healthy) state to the infected (sick) state and then to the recovered (immune) state.
SPREADING PHENOMENA
10
epidemic modeling
1
SI
EXPONENTIAL REGIME
Figure 10.7
Comparing the SI, SIS and SIR Models
SIS
0.75
The plot shows growth of the fraction of infected individuals, i, in the SI, SIS and SIR models. Two different regimes stand out:
FINAL REGIME
i ( t ) 0.5
Exponential Regime The models predict an exponential growth in the number of infected individuals during the early stages of the epidemic. For the same β the SI model predicts the fastest growth (smallest τ, see (10.5)). For the SIS and SIR models the growth is slowed by recovery, resulting in a larger τ, as predicted by (10.9). Note that for sufficiently high recovery rate μ the SIS and the SIR models predict a diseasefree state, when the number of infected individuals decays exponentially with time.
0.25 0
SIR 0
5
t
10
15
SI Exponential Regime: Number of infected individuals grows exponentially
i =
i0 e k t 1 i0 + i0 e
SIS
k
t
i= 1
µ
Ce(
k 1+ Ce(
Final Regime: Saturation at t→ =∞
i( ∞) = 1
i( ∞ ) = 1 −
Epidemic Threshold: Disease does not always spread
No threshold
R0 = 1
SPREADING PHENOMENA
SIR µ )t
k k
β
µ )t
µ k
No closed solution
Final Regime The three models predict different longterm outcomes: In the SI model everyone becomes infected, i(∞)=1; in the SIS model a finite fraction of individuals are infected i(∞)
spreading on fully connected networks. Mobile Phone Viruses Mobile phone viruses spread via MMS and Bluetooth (Figure 10.2). An MMS virus sends a copy of itself to all phone numbers found in the phone's contact list. Therefore MMS viruses exploit the social network behind mobile communications. As shown in Table 4.1, the mobile call network is scalefree with a high degree exponent. Mobile viruses can also spread via Bluetooth, passing a copy of themselves to all susceptible phones with a BT connection in their physical proximity. As discussed above, this colocation network is also highly heterogenous [4]. In summary, in the past decade technological advances allowed us to map out the structure of several networks that support the spread of biological or digital viruses, from sexual to proximitybased contact networks (see also ONLINE RESOURCE 10.2). Many of these, like the email network, the internet, or sexual networks, are scalefree. For others, like colocation networks, the degree distribution may not be fitted with a simple power law, yet show significant degree heterogeneity with high ⟨k 2⟩. This means that the analytical results obtained in the previous section are of direct relevance to pathogens spreading on most networks. Consequently the underlying heterogenous contact networks allow even weakly virulent viruses to easily spread in the population.
SPREADING PHENOMENA
22
contact networks
SECTION 10.5
BEYOND THE DEGREE DISTRIBUTION
(a)
So far we have kept our models simple: We assumed that pathogens spread on an unweighted network uniquely defined by its degree distri
NETWORK
A
bution. Yet, real networks have a number of characteristics that are not
B
B
captured by pk alone, like degree correlations or community structure. Furthermore, the links are typically weighted and the interactions have
C
a finite temporal duration. In this section we explore the impact of these
D
properties on the spread of a pathogen.
A
C D
9:00
12:00
15:00
time
TEMPORAL NETWORKS
Figure 10.17
Most interactions that we perceive as social links are brief and infre
Temporal Networks
quent. As a pathogen can be only transmitted when there is an actual con
Most interactions in a network are not continuous, but have a finite duration. We must therefore view the underlying networks as temporal networks, an increasingly active research topic in network science.
tact, an accurate modeling framework must also consider the timing and the duration of each interaction. Ignoring the timing of the interactions can lead to misleading conclusions [3941]. For example, the static network of Figure 10.17b was obtained by aggregating the individual interactions
(a) Temporal Network
shown in Figure 10.17a. On the aggregated network the infection has the
The timeline of the interactions between four individuals. Each vertical line marks the moment when two individuals come into contact with each other. If A is the first to be infected, the pathogen can spread from A to B and then to C, eventually reaching D. If, however, D is the first to be infected, the disease can reach C and B, but not A. This is because there is a temporal path from A to D.
same chance of spreading from D to A as from A to D. Yet, by inspecting the timing of each interaction, we realize that while an infection starting from A can infect D, an infection that starts at D cannot reach A. Therefore, to accurately predict an epidemic process we must consider the fact that pathogens spread on temporal networks, a topic of increasing interest in network science [4043]. By ignoring the temporality of these contact patterns, we typically overestimate the speed and the extent of an outbreak [42,43].
(b) Aggregated Network The network obtained by merging the temporal interactions shown in (a). If we only have access to this aggregated representation, the pathogen can reach all individuals, independent of its starting point. After [40].
BURSTY CONTACT PATTERNS The theoretical approaches discussed in the SECTIONS 10.2 and 10.3 assume that the timing of the interactions between two connected nodes is random. This means that the interevent times between consecutive contacts follow an exponential distribution, resulting in a random but uniform sequence of events (Figure 10.18ac). The measurements indicate otherwise: The interevent times in most social systems follow a power law distribution [35,44] (Fig. 10.18df). This means that the sequence of contacts
SPREADING PHENOMENA
(b) AGGREGATED
TEMPORAL NETWORK
23
(a)
Figure 10.18
Bursty Interactions (b)
(c)
300
(a) If the pattern of activity of an individual is random, the interevent times follow a Poisson process, which assumes that in any moment an event takes place with the same probability q. The horizontal axis denotes time and each vertical line corresponds to an event whose timing is chosen at random. The observed interevent times are comparable to each other and very long delays are rare.
10 0
10 −2
P (τ)
DELAY TIME ( τ)
500
100
−100
10 −4
0
200
400
600
EVENT NUMBER
800
10 −6
1,000
0
20
40
60
τi
80 100
(b) The absence of long delays is visible if we show the interevent times τi for 1,000 consecutive random events. The height of each vertical line corresponds to the gaps seen in (a).
(d) (f) 500
10 0
300
10 −4
(c) The probability of finding exactly n events within a fixed time interval follows the Poisson distribution P(n,q)=e–qt(qt)n/n!, predicting that the interevent time distribution follows P(τi)~e–qτi, shown on a loglinear plot.
P (τ)
DELAY TIME (τ)
(e)
100
−100
10 −8
0
200
400
600
EVENT NUMBER
800
1,000
10 −12 −2 10 0 10
10 2
τi
10 4
10 6
(d) The succession of events for a temporal pattern whose interevent times follow a powerlaw distribution. While most events follow each other closely, forming bursts of activity, there are a few exceptionally long interevent times, corresponding to long gaps in the contact pattern. The time sequence is not as uniform as in (a), but has a bursty character.
between two individuals is characterized by periods of frequent interactions, when multiple contacts follow each other within a relatively short time frame. Yet, the power law also implies that occasionally there are a very long time gaps between two contacts. Therefore the contact patterns have an uneven, “bursty” character in time (Figure 10.18d,e).
(e) The waiting time τi of 1,000 consecutive events, where the mean event time is chosen to coincide with the mean event time of the Poisson process shown in (b). The large spikes correspond to exceptionally long delays.
Bursty interactions are observed in a number of contact processes of relevance for epidemic phenomena, from email communications to call patterns and sexual contacts. Once present, burstiness alters the dynamics
(f) The delay time distribution P(τi)~τi–2 for the bursty process shown in (d) and (e). After [35].
of the spreading process [43]. To be specific, power law interevent times increase the characteristic time τ, consequently the number of infected individuals decays slower than predicted by a random contact pattern. For example, if the time between consecutive emails would follow a Poisson distribution, an email virus would decay following i(t)~exp(–t/τ) with a de
cay time of τ≈1 day. In the real data, however, the decay time is τ≈21 days, a much slower process, correctly predicted by the theory if we use power law interevent times [43].
DEGREE CORRELATIONS As discussed in CHAPTER 7, many social networks are assortative, implying that high degree nodes tend to connect to other high degree nodes. Do these degree correlations affect the spread of a pathogen? The calculations indicate that degree correlations leave key aspects of network epidemics in place, but they alter the speed with which a pathogen spreads in a network: • Degree correlations alter the epidemic threshold λc : assortative correlations decrease λc and dissasortative correlations increase it [45,46].
SPREADING PHENOMENA
24
Beyond the Degree Distribution
(a)
old vanishes for a scalefree network with diverging second moment, whether the network is assortative, neutral or disassortative [47]. Hence the fundamental results of SECTION 10.3 are not affected by degree correlations. • Given that hubs are the first to be infected in a network, assortativity accelerates the spread of a pathogen. In contrast disassortativity slows the spreading process. • Finally, in the SIR model assortative correlations were found to
FRACTION OF INFECTED USERS
• Despite the changes in λc , for the SIS model the epidemic thresh
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
CONTROL
0
10
20
30
40
TIME t
REAL
50
60
70
80
(b)
lower the prevalence but increase the average lifetime of an epidemic outbreak [48].
LINK WEIGHTS AND COMMUNITIES Throughout this chapter we assumed that all tie strengths are equal, fo
E
cusing our attention on pathogens spreading on an unweighted network.
F
In reality tie strengths vary considerably, a heterogeneity that plays an important role in spreading phenomena. Indeed, the more time an individual spends with an infected individual, the more likely that she too becomes (c)
infected. In the same vein, previously we ignored the community structure of the network on which the pathogen spreads. Yet, the existence communities (CHAPTER 9) leads to repeated interactions between the nodes within the same community, altering the spreading dynamics. The mobile phone network allows us to explore the role of tie strengths and communities on spreading phenomena [49]. Let us assume that at t=0 we provide a randomly selected individual with some key information. At
Figure 10.19
each time step this “infected” individual i passes the information to her
Information Diffusion in Mobile Phone Networks
contact j with probability p ij ~βwij, where β is the spreading probability and
The spread of information on a weighted mobile call graph, where the probability that a node passes information to one of its neighbors is proportional to the strength of the tie between them. The tie strength is the number of minutes two individuals talk on the phone.
wij is the strength of the ties captured by the number of minutes i and j have
spent with each other on the phone. Indeed, the more time two individuals talk, the higher is the chance that they will pass on the information. To understand the role of the link weights in the spreading process, we also consider the situation when the spreading takes place on a control network,
(a) The fraction of infected nodes in function of time. The blue circles capture the spread on the network with the real tie strengths; the green symbols represent the control case, when all tie strengths are equal.
that has the same wiring diagram but all tie strengths are set equal to w= ⟨wij⟩. As Figure 10.19a illustrates, information travels significantly faster on
(b) Spreading in a small network neighborhood, following the real link weights. The information is released from the red node, the arrow weight indicating the tie strength. The simulation was repeated 1,000 times; the size of the arrowheads is proportional to the number of times the information was passed along the corresponding direction, and the color indicates the total number of transmissions along that link. The background contours highlight the difference in the direction the information follows in the real and the control simulations.
the control network. The reduced speed observed in the real system indicates that the information is trapped within communities. Indeed, as we discussed in CHAPTER 9, strong ties tend to be within communities while weak ties are between them [50]. Therefore, once the information reaches a member of a community, it can rapidly reach all other members of the same community, given the strong ties between them. Yet, as the ties between the communities are weak, the information has difficulty escaping the community. Consequently the rapid invasion of the community is followed by long intervals during which the infection is trapped within a
SPREADING PHENOMENA
(c) Same in (b), but we assume that each link has the same weight w=⟨wij⟩(control). After [49]. 25
Beyond the Degree Distribution
community. When all link weights are equal (control), the bridges between communities are strengthened, and the trapping vanishes. The difference between the real and the control spreading process is illustrated by Figure 10.20b,c, that shows the spreading pattern in a small neighborhood of the mobile call network. In the control simulation the information tends to follow the shortest path. When the link weights are taken into account, information flows along a longer backbone with strong ties. For example, the information rarely reaches the lower half of the network in Figure 10.20b, a region always reached in the control simulation shown in (c).
COMPLEX CONTAGION Communities have multiple consequences for spreading, from inducing global cascades [51,52] to altering the activity of individuals [53]. The diffusion of memes, representing ideas or behavior that spread from individual to individual, further highlights the important role of communities [54]. Meme diffusion has attracted considerable attention from marketing [5, 55] to network science [56,57], communications [58], and social media [5961]. Pathogens and memes can follow different spreading patterns, prompting us to systematically distinguish simple from complex contagion [54,62,63]. Simple contagion is the process we explored so far: It is sufficient to come into contact with an infected individual to be infected. The spread of memes, products and behavior is often described by complex contagion, capturing the fact that most individuals do not adopt a new meme, product or behavioral pattern at the first contact. Rather, adoption requires reinforcement [64], i.e. repeated contact with several individuals who have already adopted. For example, the higher is the fraction of a person’s friends that have a mobile phone, the more likely that she also buys one. In simple contagion communities trap an information or a pathogen, slowing the spreading (Figure 10.19a). The effect is reversed in complex contagion: Because communities have redundant ties, they offer social reinforcement, exposing an individual to multiple examples of adoption. Hence communities can incubate a meme, a product or a behavioral pattern, enhancing its adoption. The difference between simple and complex contagion is well captured by Twitter data. Tweets, or short messages, are often labeled with hashtags, which are keywords acting as memes. Twitter users can follow other users, receiving their messages; they can forward tweets to their own followers (retweet), or mention others in tweets. The measurements indicate that most hashtags are trapped in specific communities, a signature of complex contagion [54]. A high concentration of a meme within a certain community is evidence of reinforcement. In contrast, viral memes spread across communities, following a pattern similar to that encountered in bi
SPREADING PHENOMENA
26
Beyond the Degree Distribution
(a)
Figure 10.20
Simple vs. Complex Contagion The community structure of the Twitter follower network. Each circle corresponds to a community and its size is proportional to the number of tweets produced by the respective community. The color of a community represents the time when the studied hashtag (meme) is first used in the community. Lighter colors denote the first communities to use a hashtag, darker colors denote the last community to adapt it.
(b)
(a) Simple Contagion The evolution of the viral meme captured by the #ThoughtsDuringSchool hashtag from its early stage (30 tweets, left) to the late stage (200 tweets, right). The meme jumps easily between communities, infecting many of them, following a contagion pattern encountered in the case of biological pathogens. (b) Complex Contagion The evolution of a nonviral meme caputed by the #ProperBand hashtag from the early stage (left) to the final stage (65 tweets, right). The tweet is trapped in a few of communities, having difficulty to escape them. This is a signature of reinforcement, an indication that the meme follows complex contagion. After [54].
ological pathogens. In general the more communities a meme reaches, the more viral it is (Figure 10.20). In summary, several network characteristics can affect the spread of a pathogen in a network, from degree correlations to link weights and the bursty nature of the contact pattern. As we discussed in this section, some network characteristics slow a pathogen, others aid their spread. These effects must therefore be accounted for if we wish to predict the spread of a real pathogen. While these patterns are of obvious relevance for infectious diseases, they also influence the spread of such noninfectious diseases as obesity (BOX 10.2).
SPREADING PHENOMENA
27
Beyond the Degree Distribution
BOX 10.2 DO OUR FRIENDS MAKE US FAT?
Infectious diseases, like influenza, SARS, or AIDS, spread through the transmission of a pathogen. But could the social network aid the spread of noninfectious diseases as well? Recent measurements indi
>
cate that it does, offering evidence that social networks can impact the spread of obesity, happiness, and behavioral patterns, like giving up smoking [65,66]. Obesity is diagnosed through an individual’s bodymass index (BMI), which is determined by numerous factors, from genetics to diet and
Online Resource 10.3
exercise. The measurements show that our friends also play an im
Spreading in Social Networks
portant role. The analysis of the social network of 5,209 men and
“If your friends are obese, your risk of obesity is 45 percent higher. … If your friend’s friends are obese, your risk of obesity is 25 percent higher. … If your friend’s friend’s friend, someone you probably don’t even know, is obese, your risk of obesity is 10 percent higher. It’s only when you get to your friend’s friend’s friend’s friends that there’s no longer a relationship between that person’s body size and your own body size.”
women has found that if one of our friends is obese, the risk that we too gain weight in the next two to four years increases by 57% [65]. The risk triples if our best friend is overweight: In this case, our chances of weight gain jumps by 171% (Figure 10.21). For all practical purposes, obesity appears to be just as contagious as influenza or AIDS, despite the fact that there is no "obesity pathogen" that transmits it.
Watch Nicholas Christakis explaining the spread of health patterns in social networks.
> Figure 10.21
The Web of Obesity The largest connected component of the social network capturing the friendship ties between 2,200 individuals enrolled in the Framingham Heart Study. Each node represents an individual; nodes with blue borders are men, those with red borders are women. The size of each node is proportional to the person's BMI, yellow nodes denoting obese individuals (BMI ≥30). Purple links are friendship or marital ties and orange links are family ties (e.g. siblings). Clusters of obese and nonobese individuals are visible in the network. The analysis indicates that these clusters cannot be attributed to homophily, i.e. the fact that individuals of similar body size may befriend with each other. They document instead a complex contagion process, capturing the "spread" of obesity along the links of the social network. After [65].
SPREADING PHENOMENA
28
Beyond the Degree Distribution
SECTION 10.6
IMMUNIZATION
Immunization strategies specify how vaccines, treatments or drugs are distributed in the population. Ideally, should a treatment or vaccine exist, it should be given to every infected individual or those at risk of contracting the pathogen. Yet, often cost considerations, the difficulty of reaching all individuals at risk, and real or perceived side effects of the treatment prohibit full coverage. Given these constraints, immunization strategies aim to minimize the threat of a pandemic by most effectively distributing the available vaccines or treatments. Immunization strategies are guided by an important prediction of the traditional epidemic models: If a pathogen’s spreading rate λ is reduced under its critical threshold λc, the virus naturally dies out (Figure 10.11). Yet, the epidemic threshold vanishes in scalefree networks, questioning the effectiveness of this strategy. Indeed, if the epidemic threshold vanishes, immunization strategies can not move λ under λc. In this section we discuss how to use our understanding of the network topology to design effective networkbased immunization strategies that counter the impact of the vanishing epidemic threshold.
RANDOM IMMUNIZATION The main purpose of immunization is to protect the immunized individual from an infection. Equally important, however, is its secondary role: Immunization reduces the speed with which the pathogen spreads in a population. To illustrate this effect consider the situation when a randomly selected g fraction of individuals are immunized in a population [8]. Let us assue that the pathogen follows the SIS model (10.3). The immunized nodes are invisible to the pathogen, and only the remaining (1–g) fraction of the nodes can contact and spread the disease. Consequently, the effective degree of each susceptible node changes from ⟨k⟩ to ⟨k⟩ (1–g), which decreases the spreading rate of the pathogen from λ= β/µ to λ'=λ(1–g). Next we explore the consequences of this reduction in both random and scalefree contact networks.
SPREADING PHENOMENA
29
• Random Networks
BOX 10.3
If the pathogen spreads on a random network, for a sufficiently high g the spreading rate λ' could fall below the epidemic threshold (10.25). The immunization rate gc necessary to achieve this is calculated by setting
HOW TO HALT AN EPIDEMIC?
(1− gc )β 1 , = µ 〈k〉 + 1
several interventions to control or delay an epidemic outbreak.
obtaining
Health safety officials rely on
g = 1− c
µ 1 . β 〈k〉 + 1
Some of the most common in
(10.27)
terventions include: TransmissionReducing Interventions
Consequently, if vaccination increases the fraction of immunized individuals above gc, it pushes the spreading rate under the epidemic threshold λc. In this case
Face masks, gloves, and hand
τ becomes negative and the
washing reduces the transmis
pathogen dies out naturally. This explains why health official
sion rate of airborne or contact
encourage a high fraction of the population take the influenza
based pathogens. Similarly,
vaccine: The vaccine protects not only the individual, but also the
condoms reduce the transmis
rest of the population by decreasing the pathogen’s spreading
sion rate of sexually transmit
rate. Similarly, a condom not only protects the individual who
ted pathogens.
uses it from contacting the HIV virus, but also decrease the rate at which AIDS spreads in the sexual network. Hence for random
ContactReducing Interventions
networks a sufficiently high immunization rate can eliminate
For diseases with severe health consequences officials can quar
the pathogen from the population.
antine patients, close schools • Heterogenous Networks
and limit access to frequently
If the pathogen spreads on a network with high ⟨k 2⟩, and random
visited public spaces, like movie
immunization changes λ to λ(1–g), we can use (10.26) to determine
theaters and malls. These make
the critical immunization gc
the network sparser by reducing the number of contacts between
β 〈k〉 (1− gc ) = 2 (10.28) µ 〈k 〉
individuals, hence decreasing the transmission rate.
obtaining gc = 1−
µ 〈k〉 . β 〈k 2 〉
Vaccinations
(10.29)
Vaccinations permanently remove the vaccinated nodes from the network, as they
For a random network (10.29) reduces to (10.27). For a scalefree
cannot be infected nor can they
network with γ0, i.e. when the number of infected nodes grows exponentially with time. This yields the condition for a global outbreak as
β µ
λ= >
〈k〉 , 〈k 〉 − 〈k〉 2
(10.46)
allowing us to write the epidemic threshold for the SIR model as (Table 10.3)
1 . λ = c
〈k 2 〉 −1 〈k〉
(10.47)
SIS MODEL In the SIS model the density of infected nodes is given by (10.18),
di dt
k . = β (1− ik )kΘ − µik
(10.48)
There is a small but important difference in the density function of the SIS model. For the SI and the SIR models, if a node is infected, then at least one of its neighbors must also be infected or recovered, hence at most (k–1) of its neighbors are susceptible, the origin of the (1) term in the paranthesis of (10.34) . However, in the SIS model the previously infected neighbor can become susceptible again, therefore all k links of a node can be available to spread the disease. Hence we modify the definition (10.34) to obtain
Θ k =
∑ k′
k ′pk′ik′
〈k〉
= Θ .
(10.49)
Again keeping only the first order terms we obtain
di dt
k = β kΘ − µik .
(10.50)
Multiplying the equation with (k–1)pk/〈k〉 and summing over k we have
dΘ dt
⎛ 〈k 2 〉 ⎞ − µ ⎟ Θ . ⎝ 〈k〉 ⎠
(10.51)
Θ(t) = Cet /τ,
(10.52)
= ⎜β This again has the solution
SPREADING PHENOMENA
50
advanced topics 10.B
where the characteristic time of the SIS model is
〈k〉 . β 〈k 2 〉 − 〈k〉 µ
τ =
(10.53)
A global outbreak is possible if τ>0, which yields the condition for a global outbreak as
β 〈k〉 , (10.54) λ= > µ
〈k 2 〉
and the epidemic threshold for the SIS model as (Table 10.3)
SPREADING PHENOMENA
λc =
〈k〉 . 〈k 2 〉
(10.55)
51
advanced topics 10.B
SECTION 10.12
ADVANCED TOPICS 10.C TARGETED IMMUNIZATION
In this section we derive the epidemic threshold for the SIS and SIR models on scalefree networks under hub immunization. We start with an uncorrelated network with power law degree distribution p =c.k γ where k
γ+1 and k ≥k min. In SECTION 10.16 we obtained for the critical spreadc≈(γ–1)/k min
ing rate, λc =
〈k〉 1 = 〈k 2 〉 κ
(SIS model)
and
1 1 = 〈k 2 〉 κ −1 −1 〈k〉
λc =
(SIR model).
Under hub immunization we immunize all nodes whose degree is larger than k0. From the perspective of the epidemic this is equivalent with removing the high degree nodes from the network. Therefore to calculate the new critical spreading rate, we need to determine the average degree 〈k'〉 and the second moment 〈k'2〉 after the hubs have been removed. This problem was addressed in the ADVANCED TOPICS 8.F, where we studied the robustness of a network under attack. We have seen that hub removal has two effects: 1) The maximum degree of the network changes to k0. 2) The links connected to the removed hubs are also removed, as if we
randomly remove an
⎛ k ⎞ f = ⎜ 0 ⎟ ⎝ kmin ⎠
− γ +2
(10.56)
fraction of links. The degree distribution of the resulting network is k0
⎛ k ⎞ ˜ k− k′ k′ pk′′ = ∑ ⎜ ⎟ f (1− f ) pk . k=k min ⎝ k ′ ⎠ SPREADING PHENOMENA
52
According to (8.39) and (8.40) this yields
〈 k ′ 〉 = (1− f )〈k〉 ,
〈 k ′ 〉 = (1− f ) 〈k 〉 + f (1− f )〈k〉 , 2
2
2
where 〈k〉 is the average and 〈k2〉 is the second moment of the degree distribution before the link removal, but with maximum degree k0. For the SIS model this means (1− f )〈k〉 1 , λc′ = = 2 (1− f ) 〈k 2 〉 + f (1− f )〈k〉 (1− f )κ + f
(10.57)
where, according to equation (8.47), for 2>γ>3
γ − 2 3−γ γ −2 κ= k0 kmin . 3−γ
(10.58)
Combining (10.56), (10.57) and (10.58) we obtain −1
⎡ γ − 2 3−γ γ −2 γ − 2 5−2γ 2γ −4 γ −2 ⎤ λc′ = ⎢ k0 kmin − k0 kmin + k02−γ kmin ⎥ . (10.59) 3−γ ⎣ 3−γ ⎦ For the SIR model a similar calculation yields −1
⎡ γ − 2 3−γ γ −2 γ − 2 5−2γ 2γ −4 ⎤ γ −2 λc′ = ⎢ k0 kmin − k0 kmin + k02−γ kmin − 1⎥ . 3−γ ⎣ 3−γ ⎦
(10.60)
For both the SIR and SIS models if k0≫k min we have
3 − γ γ −3 2−γ k k . γ − 2 0 min
λc′ ≈
SPREADING PHENOMENA
(10.61)
53
advanced topics 10.C
SECTION 10.13
ADVANCED TOPICS 10.D THE SIR MODEL AND BOND PERCOLATION
The SIR model is a dynamical model that captures the time dependent
1
spread of an infection in a network. Yet, it can be mapped into a static bond
SOURCE
percolation problem [103106]. This mapping offers analytical tools that help us predict the model’s behavior.
2
Consider an epidemic process on a network, so that each infected node transmits a pathogen to each of its neighbors with rate β, and recovers
after a recovery time τ=1/µ. We view the infection as a Poisson process,
4
consisting of series of random contacts with average interevent time βτ.
5
3 pb = 1 − e− β /µ
Therefore the probability that an infected node does not transmit the pathogen to susceptible neighbors decreases exponentially in time, or e–βτ. The infected node stays infected until it recovers in τ=1/µ time. Therefore
Figure 10.35 Mapping Epidemics into Percolation Consider the contact network on which the epidemic spreads. To map the spreading process into percolation, we leave in place each link with probability, pb=1–e–β/µ, a probability determined by the biological characteristics of the pathogen. Therefore links are removed with probability e–β/µ. The cluster size distribution of the remaining network can be mapped exactly into the outbreak size. For large β/µ we will likely have a giant component, indicating that we could face a global outbreak. β/µ corresponds to a virus that has difficulty spreading and we end up with numerous small clusters, indicating that the pathogen will likely die out.
the overall probability that the pathogen is passed on is 1– e–βτ. This process is equivalent with bond percolation on the same network, where each directed link is occupied with probability pb=1–e–βτ (Figure 10.35). If β and τ are the same for each node, the network can be considered un
directed. Although this mapping looses the temporal dynamics of the epidemic process, it has several advantages: • The total fraction of infected nodes in the endemic state maps into the size of the giant component of the percolation problem. • The probability that a pathogen dies out before reaching the endemic state equals the fraction of the nodes in a randomly selected finite component in the percolation problem. • We can determine the epidemic threshold by exploiting the known properties of bond percolation. Consider the average number of links outgoing from a node that can be reached by a link. This allows us to retrace the course of the epidemic: If an infected individual infects on average at least one other individual, then the epidemic can reach an endemic state. Since a node can be reached by one of its k links, the probability to be reached is kpk/N〈k〉. The probability of each of its k–1 outgoing links infectSPREADING PHENOMENA
54
ing its neighbor is pb. Since the network is randomly connected, as long as the epidemic has not spread yet, the average number of neighbors infected by the selected node is 〈R 〉 = p i
b
∑
pk k(k − 1) . 〈k〉
An endemic state can be reached only if 〈Ri〉>1, obtaining the condition for the epidemic as [107,108]
1 〈k 2 〉 − 1) > . 〈k〉 pb
(
(10.62)
Equation (10.62) agrees with the result (10.46) derived earlier from the dynamical models: Scalefree networks with γ≤3 have a divergent second moment, hence such networks undergo a percolation transition even at pb→0. That is, a virus can spread on this network regardless of how small is
the infection probability β or how small is the recovery time τ.
SPREADING PHENOMENA
55
advanced topics 10.D
SECTION 10.14
BIBLIOGRAPHY
[1] D. Normile. The Metropole, Superspreaders and Other Mysteries. Science, 339:12721273, 2013. [2] J.O. LloydSmith, S.J. Schreiber, P.E. Kopp, and W.M. Getz. Superspreading and the effect of individual variation on disease emergence. Nature, 438:355359, 2005. [3] M. Hypponen. Malware Goes Mobile. Scientific American, 295:70, 2006. [4] P. Wang, M. Gonzalez, C. A. Hidalgo, and A.L. Barabási. Understanding the spreading patterns of mobile phone viruses. Science, 324:10711076, 2009. [5] E.M. Rogers. Diffusion of Innovations. Free Press, 2003. [6] T.W. Valente. Network models of the diffusion of innovations. Hampton Press, Cresskill, NJ, 1995. [7] History and Epidemiology of Global Smallpox Eradication From the training course titled "Smallpox: Disease, Prevention, and Intervention". The CDC and the World Health Organization. Slides 1617. [8] R.M. Anderson and R.M. May. Infectious Diseases of Humans: Dynamics and Control. Oxford University Press, Oxford, 1992. [9] R. PastorSatorras and A. Vespignani. Epidemic spreading in scalefree networks. Physical Review Letters, 86:3200–3203, 2001. [10] R. PastorSatorras and A. Vespignani. Epidemic dynamics and endemic states in complex networks. Physical Review E, 63:066117, 2001. [11] Y. Wang, D. Chakrabarti, C, Wang, and C. Faloutsos. Epidemic spreading in real networks: an eigenvalue viewpoint. Proceedings of 22nd International Symposium on Reliable Distributed Systems, pg. 2534, 2003.
SPREADING PHENOMENA
56
[12] R. Durrett. Some features of the spread of epidemics and information on a random graph. PNAS, 107:44914498, 2010. [13] S. Chatterjee and R. Durrett. Contact processes on random graphs with power law degree distributions have critical value 0. Ann. Probab., 37: 23322356, 2009. [14] C Castellano, and R PastorSatorras. Thresholds for epidemic spreading in networks. Physical Review Letters, 105:218701, 2010. [15] B. Lewin. (ed.), Sex i Sverige. Om sexuallivet i Sverige 1996 [Sex in Sweden. On the Sexual Life in Sweden 1996]. National Institute of Public Health, Stockholm, 1998. [16] F. Liljeros, C. R. Edling, L. A. N. Amaral, H. E. Stanley, and Y. Åberg. The web of human sexual contacts. Nature, 411:9078, 2001. [17] A. Schneeberger, C. H. Mercer, S. A. Gregson, N. M. Ferguson, C. A. Nyamukapa, R. M. Anderson, A. M. Johnson, and G. P. Garnett. Scalefree networks and sexually transmitted diseases: a description of observed patterns of sexual contacts in Britain and Zimbabwe. Sexually Transmitted Diseases, 31: 380387, 2004. [18] W. Chamberlain. A View from Above. Villard Books, New York, 1991. [19] R. Shilts. And the Band Played On. St. Martin’s Press, New York, 2000. [20] P. S. Bearman, J. Moody, and K. Stovel. Chains of affection: the structure of adolescent romantic and sexual networks. Am J Sociol., 110:4491, 2004. [21] M. C. González, C. A. Hidalgo, and A.L. Barabási. Understanding individual human mobility patterns. Nature, 453:779782, 2008. [22] C. Song, Z. Qu, N. Blumm, and A.L. Barabási. Limits of Predictability in Human Mobility. Science, 327:10181021, 2010. [23] F. Simini, M. González, A. Maritan, and A.L. Barabási. A universal model for mobility and migration patterns. Nature, 484:96100, 2012. [24] D. Brockmann, L. Hufnagel, and T. Geisel. The scaling laws of human travel. Nature, 439:462–465, 2006. [25] V. Colizza, A. Barrat, M. Barthelemy, and A. Vespignani. The role of the airline transportation network in the prediction and predictability of global epidemics. PNAS, 103:2015, 2006. [26] L. Hufnagel, D. Brockmann, and T. Geisel. Forecast and control of epidemics in a globalized world. PNAS, 101:15124, 2004. [27] R. Guimerà, S. Mossa, A. Turtschi, and L. A. N. Amaral. The worldwide air transportation network: Anomalous centrality, community strucSPREADING PHENOMENA
57
BIBLIOGRAPHY
ture, and cities' global roles. PNAS, 102:7794, 2005. [28] C. Cattuto, et al. Dynamics of PersontoPerson Interactions from Distributed RFID Sensor Networks. PLoS ONE, 5:e11596, 2010. [29] L. Isella, C. Cattuto, W. Van den Broeck, J. Stehle, A. Barrat, and J.F. Pinton. What’s in a crowd? Analysis of facetoface behavioral networks. Journal of Theoretical Biology, 271:166180, 2011. [30] K. Zhao, J. Stehle, G. Bianconi, and A. Barrat. Social network dynamics of facetoface interactions. Physical Review E, 83:056109, 2011. [31] J. Stehlé, N. Voirin, A. Barrat, C Cattuto, L. Isella, JF. Pinton, M. Quaggiotto, W. Van den Broeck, C. Régis, B. Lina, and P. Vanhems. Highresolution measurements of facetoface contact patterns in a primary school. PLoS ONE, 6:e23176, 2011. [32] B.N. Waber, D. Olguin, T. Kim, and A. Pentland. Understanding Organizational Behavior with Wearable Sensing Technology. Academy of Management Annual Meeting. Anaheim, CA. August, 2008. [33] L. Wu, B.N. Waber, S. Aral, E. Brynjolfsson, and A. Pentland. Mining FacetoFace Interaction Networks using Sociometric Badges: Predicting Productivity in an IT Configuration Task. In Proceedings of the International Conference on Information Systems. Paris, France. December 1417 2008. [34] M. Salathé, M. Kazandjievab, J.W. Leeb, P. Levisb, M.W. Feldmana, and J.H. Jones. A highresolution human contact network for infectious disease transmission. PNAS, 107:22020–22025, 2010. [35] A.L. Barabási. The origin of bursts and heavy tails in human dynamics. Nature, 435:20711, 2005. [36] V. Sekara, and S. Lehmann. Application of network properties and signal strength to identify facetoface links in an electronic dataset. Proceedings of CoRR, 2014. [37] S. Eubank, H. Guclu, V.S.A. Kumar, M.V. Marathe, A. Srinivasan, Z. Toroczkai, and N. Wang. Modelling disease outbreaks in realistic urban social networks. Nature, 429:180184, 2004. [38] H. Ebel, LI. Mielsch, and S. Bornholdt. Scalefree topology of email networks. Physical Review E, 66:035103, 2002. [39] M. Morris, and M. Kretzschmar. Concurrent partnerships and transmission dynamics in networks. Social Networks, 17:299318, 1995. [40] N. Masuda and P. Holme. Predicting and controlling infectious diseases epidemics using temporal networks. F1000 Prime Rep., 5:6, 2013. [41] P. Holme, and J. Saramäki. Temporal networks. Physics Reports,
SPREADING PHENOMENA
58
BIBLIOGRAPHY
519:97125, 2012. [42] M. Karsai, M. Kivelä, R. K. Pan, K. Kaski, J. Kertész, A.L. Barabási, and J. Saramäki. Small but slow world: how network topology and burstiness slow down spreading. Physical Review E, 83:025102(R), 2011. [43] A. Vazquez, B. Rácz, A. Lukács, and A.L. Barabási. Impact of nonPoissonian activity patterns on spreading processes. Physical Review Letters, 98:158702, 2007. [44] A. Vázquez, J.G. Oliveira, Z. Dezsö, K.I. Goh, I. Kondor, and A.L. Barabási. Modeling bursts and heavy tails in human dynamics. Physical Review E, 73:036127, 2006. [45] A.V. Goltsev, S.N. Dorogovtsev, and J.F.F. Mendes. Percolation on correlated networks. Physical Review E., 78:051105, 2008. [46] P. Van Mieghem, H. Wang, X. Ge, S. Tang and F. A. Kuipers. Influence of assortativity and degreepreserving rewiring on the spectra of networks. The European Physical Journal B, 76:643, 2010. [47] M. Boguná, R. PastorSatorras, and A. Vespignani. Absence of epidemic threshold in scalefree networks with degree correlations. Physical Review Letters, 90:028701, 2003. [48] Y. Moreno, J. B. Gómez, and A.F. Pacheco. Epidemic incidence in correlated complex networks. Physical Review E, 68:035103, 2003. [49] J.P. Onnela, J. Saramaki, J. Hyvonen, G. Szabó, D. Lazer, K. Kaski, J. Kertész, and A.L. Barabási. Structure and tie strengths in mobile communication networks. PNAS, 104:7332, 2007. [50] M. S. Granovetter. The strength of weak ties. American Journal of Sociology, 78:1360–1379, 1973. [51] A. Galstyan, and P. Cohen. Cascading dynamics in modular networks. Physical Review E, 75:036109, 2007. [52] J. P. Gleeson. Cascades on correlated and modular random networks. Physical Review E, 77:046117, 2008. [53] P. A. Grabowicz, J. J. Ramasco, E. Moro, J. M. Pujol, and V. M. Eguiluz. Social features of online networks: The strength of intermediary ties in online social media. PLOS ONE, 7:e29358, 2012. [54] L. Weng, F. Menczer and Y.Y. Ahn. Virality Prediction and Community Structure in Social Networks. Scientific Reports, 3:2522, 2013. [55] S. Aral, and D. Walker. Creating social contagion through viral product design: A randomized trial of peer influence in networks. Management Science, 57:1623–1639, 2011. [56] J. Leskovec, L. Adamic, and B. Huberman. The dynamics of viral SPREADING PHENOMENA
59
BIBLIOGRAPHY
marketing. ACM Trans. Web, 1, 2007. [57] L. Weng, A Flammini, A. Vespignani, and F. Menczer. Competition among memes in a world with limited attention. Scientific Reports, 2:335, 2012. [58] J. Berger, and K. L. Milkman. What makes online content viral? Journal of Marketing Research, 49:192–205, 2009. [59] S. Jamali, and H. Rangwala. Digging digg: Comment mining, popularity prediction and social network analysis. Proc. Intl. Conf. on Web Information Systems and Mining (WISM), 32–38, 2009. [60] G. Szabó and, B. A. Huberman. Predicting the popularity of online content. Communications of the ACM, 53:80–88, 2010. [61] B. Suh, L. Hong, P. Pirolli, and E. H. Chi. Want to be retweeted? Large scale analytics on factors impacting retweet in twitter network. Proc. IEEE Intl. Conf. on Social Computing, 177–184, 2010. [62] D. Centola. The spread of behavior in an online social network experiment. Science, 329:1194–1197, 2010. [63] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in large social networks: membership, growth, and evolution. Proc. ACM SIGKDD Intl. Conf. on Knowledge discovery and data mining, 44–54, 2006. [64] M. Granovetter. Threshold Models of Collective Behavior. American Journal of Sociology, 83:1420–1443, 1978. [65] N.A. Christakis, and J.H. Fowler. The Spread of Obesity in a Large Social Network Over 32 Years. New England Journal of Medicine, 35:370379, 2007. [66] N. A. Christakis and J. H. Fowler. The collective dynamics of smoking in a large social network. New England Journal of Medicine, 358:22492258, 2008. [67] R. PastorSatorras, and A. Vespignani. Evolution and structure of the Internet: A statistical physics approach. Cambridge University Press, Cambridge, 2007. [68] Z. Dezső and AL. Barabási. Halting viruses in scalefree networks. Physical Review E, 65:055103, 2002. [69] R. PastorSatorras and A. Vespignani. Immunization of complex networks. Physical Review E, 65:036104, 2002. [70] R. Cohen, S. Havlin, and D. benAvraham. Efficient Immunization Strategies for Computer Networks and Populations. Physical Review Letters, 91:247901, 2003.
SPREADING PHENOMENA
60
BIBLIOGRAPHY
[71] F. Fenner et al. Smallpox and its Eradication. WHO, Geneva, 1988. http://www.who.int/features/2010/smallpox/en/ [72] L. A. Rvachev, and I. M. Longini Jr. A mathematical model for the global spread of influenza. Mathematical Biosciences, 75:322, 1985. [73] A. Flahault, E. Vergu, L. Coudeville, and R. Grais. Strategies for containing a global influenza pandemic. Vaccine, 24:67516755, 2006. [74] I. M. Longini Jr, M. E. Halloran, A. Nizam, and Y. Yang. Containing pandemic influenza with antiviral agents. Am. J. Epidemiol., 159:623633, 2004. [75] I.M. Longini Jr, A. Nizam, S. Xu, K. Ungchusak, W. Hanshaoworakul, D. Cummings, and M. Halloran. Containing pandemic influenza at the source. Science, 309:10831087, 2005. [76] V. Colizza, A. Barrat, M. Barthélemy, A.J. Valleron, and A. Vespignani. Modeling the worldwide spread of pandemic influenza: baseline case and containment interventions. PLoS Med, 4:e13, 2007. [77] T. D. Hollingsworth, N.M. Ferguson, and R.M. Anderson. Will travel restrictions control the International spread of pandemic influenza? Nature Med., 12:497499, 2006. [78] C.T. Bauch, J.O. LloydSmith, M.P. Coffee, and A.P. Galvani. Dynamically modeling SARS and other newly emerging respiratory illnesses: past, present, and future. Epidemiology, 16:791801, 2005. [79] I. M. Hall, R. Gani, H.E. Hughes, and S. Leach. Realtime epidemic forecasting for pandemic influenza. Epidemiol Infect., 135:372385, 2007. [80] M. Tizzoni, P. Bajardi, C. Poletto, J. J. Ramasco, D. Balcan, B. Gonçalves, N. Perra, V. Colizza, and A. Vespignani. Realtime numerical forecast of global epidemic spreading: case study of 2009 A/H1N1pdm. BMC Medicine, 10:165, 2012. [81] D. Balcan, H. Hu, B. Gonçalves, P. Bajardi, C. Poletto, J. J. Ramasco, D. Paolotti, N. Perra, M. Tizzoni, W. Van den Broeck, V. Colizza, and A. Vespignani. Seasonal transmission potential and activity peaks of the new influenza A/H1N1: a Monte Carlo likelihood analysis based on human mobility. BMC Med., 7:45, 2009. [82] P. Bajardi, et al. Human Mobility Networks, Travel Restrictions, and the Global Spread of 2009 H1N1 Pandemic. PLoS ONE, 6:e16591, 2011. [83] P.Bajardi, C. Poletto, D. Balcan, H. Hu, B. Gonçalves, J. J. Ramasco, D. Paolotti, N. Perra, M. Tizzoni, W. Van den Broeck, V. Colizza, and A. Vespignani. Modeling vaccination campaigns and the Fall/Winter 2009 activity of the new A/H1N1 influenza in the Northern Hemisphere. EHT Journal, 2:e11, 2009.
SPREADING PHENOMENA
61
BIBLIOGRAPHY
[84] M.E. Halloran, N.M. Ferguson, S. Eubank, I.M. Longini, D.A.T. Cummings, B. Lewis, S. Xu, C. Fraser, A. Vullikanti, T.C. Germann, D. Wagener, R. Beckman, K. Kadau, C. Macken, D.S. Burke, and P. Cooley. Modeling targeted layered containment of an influenza pandemic in the United States. PNAS, 105:463944, 2008. [85] G. M. Leung, A. Nicoll. Reflections on Pandemic (H1N1) 2009 and the international response. PLoS Med, 7:e1000346, 2010. [86] A.C. Singer, et al. Meeting report: risk assessment of Tamiflu use under pandemic conditions. Environ Health Perspect., 116:15631567, 2008. [87] R. Fisher. The wave of advance of advantageous genes. Ann. Eugen., 7:355–369, 1937. [88] J. V. Noble. Geographic and temporal development of plagues. Nature, 250:726–729, 1974. [89] D. Brockmann and D. Helbing. The Hidden Geometry of Complex, NetworkDriven Contagion Phenomena. Science, 342:13371342, 2014. [90] J. S. Brownstein, C. J. Wolfe, and K. D. Mandl. Empirical evidence for the effect of airline travel on interregional influenza spread in the United States. PLoS Med, 3:e40, 2006. [91] D. Shah and T. Zaman, in SIGMETRICS’10, Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pp. 203–214, 2010. [92] A. Y. Lokhov, M. Mezard, H. Ohta, L. Zdeborová. Inferring the origin of an epidemy with dynamic messagepassing algorithm. Phys. Rev E, 90:012801, 2014. [93] P. C. Pinto, P. Thiran, M. Vetterli. Locating the Source of Diffusion in LargeScale Networks. Physical Review Letters, 109:068702, 2012. [94] C. H. Comin and L. da Fontoura Costa. Identiying the starting point of a spreading process in complex networks. Phys. Rev. E, 84:056105, 2011. [95] D. Shah and T. Zaman. Rumors in a Network: Who's the Culprit? IEEE Trans. Inform. Theory, 57:5163, 2011. [96] K. Zhu and L. Ying. Information source detection in the SIR model: A sample path based approach. Information Theory and Applications Workshop (ITA); 19, 2013. [97] B. A. Prakash, J. Vreeken, and C. Faloutsos. Spotting culprits in epidemics: How many and which ones? ICDM’12; Proceedings of the IEEE International Conference on Data Mining, 11:20, 2012. [98] V. Fioriti and M. Chinnici. Predicting the sources of an outbreak
SPREADING PHENOMENA
62
BIBLIOGRAPHY
with a spectral technique. Applied Mathematical Sciences, 8:67756782, 2012. [99] W. Dong, W. Zhang and C.W. Tan. Rooting out the rumor culprit from suspects. Proceedings of CoRR, 2013. [100] B. Barzel, and A.L. Barabási. Universality in network dynamics. Nature Physics, 9:673, 2013. [101] A. Barrat, M. Barthélemy and A. Vespignani. Dynamical Processes on Complex Networks. Cambridge University Press, 2012. [102] S. N. Dorogovtsev, A.V. Goltsev, and J. F. F. Mendes. Critical phenomena in complex networks. Reviews of Modern Physics 80, 1275, 2008. [103] R. Cohen and S. Havlin. Complex Networks  Structure, Robustness and Function. Cambridge University Press, 2010. [104] P. Grassberger. On the critical behavior of the general epidemic process and dynamical percolation. Mathematical Biosciences, 63:157, 1983. [105] M. E. J.Newman. The spread of epidemic disease on networks. Physical Review E, 66:016128, 2002. [106] C. P. Warren, L. M. Sander, and I. M. Sokolov. Firewalls, disorder, and percolation in networks. Mathematical Biosciences, 180:293, 2002. [107] R. Cohen, K. Erez, D. benAvraham, and S. Havlin. Resilience of the Internet to random breakdown. Physical Review Letters, 85:4626–4628, 2000. [108] D. S. Callaway, M. E. J. Newman, S. H. Strogatz, and D. J. Watts. Network robustness and fragility: percolation on random graphs. Physical Review Letters, 85:5468–5471, 2000.
SPREADING PHENOMENA
63
BIBLIOGRAPHY