Computational Prediction of Protein Complexes from Protein Interaction Networks [First edition] 1970001550, 978-1-97000-155-6, 978-1-97000-152-5, 1970001526, 978-1-97000-153-2, 978-1-97000-154-9

Complexes of physically interacting proteins constitute fundamental functional units that drive almost all biological pr

305 102 3MB

English Pages 296 [281] Year 2017

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Principles of Computational Cell Biology: From Protein Complexes to Cellular Networks [2nd Edition] 978-3527333585, 3527333584

945 27 16MB Read more

Protein Structure Prediction [4th ed.] 9781071607077, 9781071607084

This thorough new edition explores web servers and software for protein structure prediction and modeling that are freel

665 163 14MB Read more

Protein-Protein Interaction Networks: Methods and Protocols [1st ed. 2020] 978-1-4939-9872-2, 978-1-4939-9873-9

This volume explores techniques that study interactions between proteins in different species, and combines them with co

527 121 11MB Read more

Macromolecular Protein Complexes III: Structure and Function 3030589706, 9783030589707

This book covers important topics such as the dynamic structure and function of the 26S proteasome, the DNA replication

549 92 26MB Read more

Protein

893 96 65MB Read more

Machine Learning for Protein Subcellular Localization Prediction 9781501501500, 9781501510489

Comprehensively covers protein subcellular localization from single-label prediction to multi-label prediction, and incl

167 46 8MB Read more

Machine Learning for Protein Subcellular Localization Prediction 1501510487, 9781501510489

Comprehensively covers protein subcellular localization from single-label prediction to multi-label prediction, and incl

346 101 5MB Read more

DIY Protein Bars: 30 Delicious and Healthy DIY Protein Bars (diy protein bars, protein bars, high protein snacks)

If you’re tired of spending too much for glorified candy bars at the health food store, this is the book for you! Even i

585 137 2MB Read more

Machine Learning for Protein Subcellular Localization Prediction 9781501501500, 9781501510489

Comprehensively covers protein subcellular localization from single-label prediction to multi-label prediction, and incl

163 58 5MB Read more

Carbohydrate-Protein Interaction 9780841204669, 9780841206007, 0-8412-0466-7

Content: Studies on the interaction of lectins with saccharides on lymphocyte cell surfaces / Nathan Sharon, Yair Reisne

289 35 4MB Read more

Computational Prediction of Protein Complexes from Protein Interaction Networks [First edition]
1970001550, 978-1-97000-155-6, 978-1-97000-152-5, 1970001526, 978-1-97000-153-2, 978-1-97000-154-9

Author / Uploaded
Sriganesh Srihari
Chern Han Yong
Limsoon Wong

Table of contents :
Content: Preface1. Introduction to Protein Complex Prediction2. Constructing Reliable Protein-Protein Interaction (PPI) Networks3. Computational Methods for Protein Complex Prediction from PPI Networks4. Evaluating Protein Complex Prediction Methods5. Open Challenges in Protein Complex Prediction6. Identifying Dynamic Protein Complexes7. Identifying Evolutionarily Conserved Protein Complexes8. Protein Complex Prediction in the Era of Systems Biology9. ConclusionReferencesAuthors' Biographies

Citation preview

Computational Prediction of Protein Complexes from Protein Interaction Networks Sriganesh Srihari The University of Queensland Institute for Molecular Bioscience

Chern Han Yong Duke-National University of Singapore Medical School

Limsoon Wong National University of Singapore

ACM Books #16

Copyright © 2017 by the Association for Computing Machinery and Morgan & Claypool Publishers

Computational Prediction of Protein Complexes from Protein Interaction Networks Sriganesh Srihari, Chern Han Yong, Limsoon Wong books.acm.org www.morganclaypoolpublishers.com ISBN: 978-1-97000-155-6 ISBN: 978-1-97000-152-5 ISBN: 978-1-97000-153-2 ISBN: 978-1-97000-154-9

hardcover paperback ebook ePub

Series ISSN: 2374-6769 print 2374-6777 electronic DOIs: 10.1145/3064650 Book 10.1145/3064650.3064651 10.1145/3064650.3064652 10.1145/3064650.3064653 10.1145/3064650.3064654 10.1145/3064650.3064655

Preface Chapter 1 Chapter 2 Chapter 3 Chapter 4

10.1145/3064650.3064656 10.1145/3064650.3064657 10.1145/3064650.3064658 10.1145/3064650.3064659 10.1145/3064650.3064660 10.1145/3064650.3064661

A publication in the ACM Books series, #16 ¨ zsu, University of Waterloo Editor in Chief: M. Tamer O

First Edition 10 9 8 7 6 5 4 3 2 1

Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 References, Bios

Contents Preface xi Chapter 1

Introduction to Protein Complex Prediction 1 1.1 1.2 1.3

Chapter 2

Constructing Reliable Protein-Protein Interaction (PPI) Networks 15 2.1 2.2 2.3 2.4 2.5 2.6 2.7

Chapter 3

High-Throughput Experimental Systems to Infer PPIs 15 Data Sources for PPIs 23 Topological Properties of PPI Networks 25 Theoretical Models for PPI Networks 27 Visualizing PPI Networks 31 Building High-Confidence PPI Networks 34 Enhancing PPI Networks by Integrating Functional Interactions 50

Computational Methods for Protein Complex Prediction from PPI Networks 59 3.1 3.2 3.3 3.4 3.5

Chapter 4

From Protein Interactions to Protein Complexes 6 Databases for Protein Complexes 11 Organization of the Rest of the Book 13

Basic Definitions and Terminologies 60 Taxonomy of Methods for Protein Complex Prediction 60 Methods Based Solely on PPI Network Clustering 61 Methods Incorporating Core-Attachment Structure 78 Methods Incorporating Functional Information 85

Evaluating Protein Complex Prediction Methods 91 4.1 4.2 4.3 4.4

Evaluation Criteria and Methodology 91 Evaluation on Unweighted Yeast PPI Networks 93 Evaluation on Weighted Yeast PPI Networks 95 Evaluation on Human PPI Networks 99

x

Contents

4.5 4.6

Chapter 5

Open Challenges in Protein Complex Prediction 107 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8

Chapter 6

Inferring Evolutionarily Conserved PPIs (Interologs) 166 Identifying Conserved Complexes from Interolog Networks, I 170 Identifying Conserved Complexes from Interolog Networks, II 178

Protein Complex Prediction in the Era of Systems Biology 185 8.1 8.2 8.3 8.4 8.5

Chapter 9

Dynamism of Protein Interactions and Protein Complexes 145 Identifying Temporal Protein Complexes 149 Intrinsic Disorder in Proteins 156 Intrinsic Disorder in Protein Interactions and Protein Complexes 157 Identifying Fuzzy Protein Complexes 162

Identifying Evolutionarily Conserved Protein Complexes 165 7.1 7.2 7.3

Chapter 8

Three Main Challenges in Protein Complex Prediction 107 Identifying Sparse Protein Complexes 112 Identifying Overlapping Protein Complexes 118 Identifying Small Protein Complexes 124 Identifying Protein Sub-complexes 134 An Integrated System for Identifying Challenging Protein Complexes 136 Recent Methods for Protein Complex Prediction 138 Identifying Membrane-Protein Complexes 141

Identifying Dynamic Protein Complexes 145 6.1 6.2 6.3 6.4 6.5

Chapter 7

Case Study: Prediction of the Human Mechanistic Target of Rapamycin Complex 103 Take-Home Lessons from Evaluating Prediction Methods 105

Constructing the Network of Protein Complexes 185 Identifying Protein Complexes Across Phenotypes 189 Identifying Protein Complexes in Diseases 192 Enhancing Quantitative Proteomics Using PPI Networks and Protein Complexes 208 Systems Biology Tools and Databases to Analyze Biomolecular Networks 221

Conclusion 225 References 233 Authors’ Biographies 279

Preface The suggestion and motivation to write this book came from Limsoon, who thought that it would be a great idea to compile our (Sriganesh’s and Chern Han’s) Ph.D. research conducted at National University of Singapore on protein complex prediction from protein-protein interaction (PPI) networks into a comprehensive book for the research community. Since we (Sriganesh and Chern Han) completed our Ph.D.s not long ago, the timing could not have been better for writing this book while the topic is still fresh in our minds and the empirical set up (datasets and software pipelines) for evaluating the methods is still in a “quick-to-run” form. However, although we had our Ph.D. theses to our convenient disposal and reference, it is only after we started writing this book that we realized the real scale of the task that we had embarked upon. The problem of protein complex prediction may be just one of the plethora of computational problems that have opened up since the deluge of proteomics (protein-protein interaction; PPI) data over the last several years. However, in reality this problem encompasses or directly relates to several important and open problems in the area—in particular, the fundamental problems of modeling, visualizing, and denoising of PPI networks, prediction of PPIs (novel as well as evolutionarily conserved), and protein function prediction from PPI data. Therefore, to write a comprehensive self-contained book, we had to cover even these closely related problems to some extent or at least allude to or reference them in the book. We had to do so without missing the connection between these problems and our central problem of protein complex prediction in the book. The early tone to write the book in this manner was set by our review article in a 2015 special issue of FEBS Letters, where we covered a number of protein complex prediction methods which are based on a diverse range of topological, functional, temporal, structural, and evolutionary information. However, being only a single-volume article, the description of the methods was brief, and to compile

xii

Preface

this description in the form of a book we had to delve a lot deeper into the algorithmic underpinnings of each of the methods, highlight how each method utilized the information (topological, functional, temporal, structural, and evolutionary) on which it was based in its own unique way, and evaluate and study the applications of the methods across a diverse range of datasets and scenarios. To do this well, we had to: (i) cover in substantial detail the preliminaries such as the experimental techniques available to infer PPIs, the limitations of each of these techniques, PPI network topology, modeling, and denoising, PPI databases that are available, and how functional, temporal, structural, and evolutionary information of proteins can be integrated with PPI networks; and (ii) we had to categorize protein complex prediction methods into logical groups based on some criteria, and dedicate a separate chapter for each group to make our description comprehensive. In the book, we cover (i) in Chapter 2 and in the form of independent sections within each of the other Chapters 3, 5, 6, 7, and 8. We cover (ii) by allocating Chapters 3 and 4 for “classical” methods and their comprehensive evaluation, Chapter 5 for methods that predict certain kinds of “challenging complexes” which the classical methods do not predict well, Chapter 6 for methods that utilize temporal and structural information, Chapter 7 for methods that utilize information on evolutionary conservation, and Chapter 8 for methods that integrate other kinds of omics datasets to predict “specialized” complexes—e.g., protein complexes in diseases. The requirement for a book exclusively dedicated to the problem of protein complex prediction from PPI networks at this point in time cannot be understated. Over the last two decades, a major focus of high-throughput experimental technologies and of computational methods to analyze the generated data has been in genomics—e.g., in the analysis of genome sequencing data. It is relatively recently that this focus has started to shift toward proteomics and computational methods to analyze proteomics data. For example, while the complete sequence of the human genome was assembled more than a decade ago, it is only over the last three years that there have been similar large-scale efforts to map the human proteome. The ProteomicsDB (http://www.proteomicsdb.org/), Human Proteome Map (http://humanproteomemap.org/), and the Human Protein Atlas projects (http://www.proteinatlas.org/) have all appeared only over the last three years. Similarly is The Cancer Proteome Atlas (TCPA) project (http://app1.bioinformatics .mdanderson.org/tcpa/_design/basic/index.html) which complements The Cancer Genome Atlas project (TCGA). This means that developing more effective solutions for fundamental problems such as protein complex prediction has become all the more important today, as we try to apply these solutions to larger and more complex datasets arising from these newer technologies and projects. In this respect,

Preface

xiii

we had to write the book not just by considering protein complex prediction methods (i.e., their algorithmic details) as important in their own right but also by giving significant importance to the applications of these methods in the light of today’s complex datasets and research questions. There are several sections within each of the Chapters 6, 7, and 8 that play this dual role, e.g., a section in Chapter 7 discusses the evolutionary conservation of core cellular processes based on conservation patterns of protein complexes, and a section in Chapter 8 discusses the dysregulation of these processes in diseases based on rewiring of protein complexes between normal and disease conditions. In the end, we hope that we have done justice to what we intend this book to be. We hope that this book provides valuable insights into protein complex prediction and inspires further research in the area especially for tackling the open challenges, as well as inspires new applications in diverse areas of biomedicine.

Acknowledgments Although this book is primarily concerned with the problem of protein complex prediction, the book also covers several other aspects of PPI networks. We would like to therefore dedicate this book to the students—Honors, Masters, and Ph.D. students—who worked on these different aspects of PPI networks by being part of the computational biology group at the Department of Computer Science, National University of Singapore, over the years. Several of the methods covered in this book are a result of the extensive research conducted by these students. Sriganesh would like to thank Hon Wai Leong (Professor of Computer Science, National University of Singapore) under whom he conducted his Ph.D. research on protein complex prediction; Mark Ragan (Head of Division of Genomics of Development and Disease at Institute for Molecular Bioscience, The University of Queensland) under whom he conducted his postdoctoral research, a substantial portion of which was on identifying protein complexes in diseases; and Kum Kum Khanna (Senior Principal Research Fellow and Group Leader at QIMR-Berghofer Medical Research Institute) whose guidance played a significant part in his understanding of biological aspects of protein complexes. Sriganesh is grateful to Mark for passing him an original copy of a 1977 volume of Progress in Biophysics and Molecular Biology in which G. Rickey Welch makes a consistent principled argument that “multienzyme clusters” are advantageous to the cell and organism because they enable metabolites to be channeled within the clusters and protein expression to be co-regulated [Welch 1977]—a possession which Sriganesh will deeply cherish. Chern Han would like to thank his coauthors: Sriganesh for doing the heavy lifting in writing, editing, and

xiv

Preface

driving this project and Limsoon Wong for guiding him through his Ph.D. journey on protein complexes. He would also like to acknowledge the support of Bin Tean Teh (Professor with Program in Cancer and Stem Cell Biology, Duke-NUS Medical School), who currently oversees his postdoctoral research. Limsoon would like to acknowledge Chern Han and Sriganesh for doing the bulk of the writing for this book, and especially thank Sriganesh for taking the overall lead on the project. When he suggested the book to Chern Han and Sriganesh, he had not imagined that he would eventually be a co-author. ¨ zsu, ExecWe are indebted also to the Editor-in-Chief of ACM Books, Tamer O utive Editor Diane Cerra, Production Manager Paul C. Anagnostopoulos, and the entire team at ACM Books and Morgan & Claypool Publishers for their encouragement and for producing this book so beautifully. Sriganesh Srihari Chern Han Yong Limsoon Wong May 2017

1

Introduction to Protein Complex Prediction Unfortunately, the proteome is much more complicated than the genome. —Carol Ezzell [Ezzel et al. 2002]

In an early survey, American biochemist Bruce Alberts termed large assemblies of proteins as protein machines of cells [Alberts et al. 1998]. Protein assemblies are composed of highly specialized parts that coordinate to execute almost all of the biochemical, signaling, and functional processes in cells [Alberts et al. 1998]. It is not hard to see why protein assemblies are more advantageous to cells than individual proteins working in an uncoordinated manner. Compare, for example, the speed and elegance of the DNA replication machinery that simultaneously replicates both strands of the DNA double helix with what could ensue if each of the individual components—DNA helicases for separating the double-stranded DNA into single stands, DNA polymerases for assembling nucleotides, DNA primase for generating the primers, and the sliding clamp to hold these enzymes onto the DNA—acted in an uncoordinated manner [Alberts et al. 1998]. Although what might seem like individual parts brought together to perform arbitrary functions, protein assemblies can be very specific and enormously complicated. For example, the spliceosome is composed of 5 small nuclear RNAs (snRNAs or “snurps”) and more than 50 proteins, and is thought to catalyze an ordered sequence of more than 10 RNA rearrangements at a time as it removes an intron from an RNA transcript [Alberts et al. 1998, Baker et al. 1998]. The discovery of this intron-splicing process won Phillip A. Sharp and Richard J. Roberts the 1993 Nobel Prize in Physiology or Medicine.1

1. http://www.nobelprize.org/nobel_prizes/medicine/laureates/1993/

2

Chapter 1 Introduction to Protein Complex Prediction

Protein assemblies are known to be in the order of hundreds even in the simplest of eukaryotic cells. For example, more than 400 protein assemblies have been identified in the single-celled eukaryote Saccharomyces cerevisiae (budding yeast) [Pu et al. 2009]. However, our knowledge of these protein assemblies is still fragmentary, as is our conception of how each of these assemblies work together to constitute the “higher level” functional architecture of cells. A faithful attempt toward identification and characterization of all protein assemblies is therefore crucial to elucidate the functioning of the cellular machinery. To identify the entire complement of protein assemblies, it is important to first crack the proteome—a concept so novel that the word “proteome” first appeared only around 20 years ago [Wilkins et al. 1996, Bryson 2003, Cox and Mann 2007]. The proteome, as defined in the UniProt Knowledgebase, is the entire complement of proteins expressed or derived from protein-coding genes in an organism [Bairoch and Apweiler 1996, UniProt 2015]. With the introduction of high-throughput experimental (proteomics) techniques including mass spectrometric [Cox and Mann 2007, Aebersold and Mann 2003] and protein quantitative trait locus (QTL) technologies [Foss et al. 2007], mapping of proteins on a large scale has become feasible. Just like how genomics techniques (including genome sequencing) were first demonstrated in model organisms, proteome-mapping has progressed initially and most rapidly for model prokaryotes including Escherichia coli (bacteria) and model eukaryotes including Saccharomyces cerevisiae (budding or baker’s yeast), Drosophila melanogaster (fruit fly), Caenorhabditis elegans (a nematode), and Arabidopsis thaliana (a flowering plant). Table 1.1 summarizes the numbers of proteins or protein-coding genes identified from these organisms. Of these, the proportions of protein-coding genes that are essential (genes that are thought to be critical for the survival of the cell or organism; “fitness genes”) range from ∼2% in Drosophila to ∼6.5% in Caenorhabditis and ∼18% in Saccharomyces [Cherry et al. 2012, Chen et al. 2012]. Recent landmark studies using large-scale proteomics [Wilhelm et al. 2014, Kim et al. 2014, Uhl´ en et al. 2010, Uhl´ en et al. 2015] on Homo sapiens (human) cells have characterized >17,000 (or >90%) putative protein-coding genes from ≥40 tissues and organs in the human body. An encyclopedic resource on these proteins covering their levels of expression and abundance in different human tissues is available from the ProteomicsDB (http://www.proteomicsdb.org/) [Wilhelm et al. 2014], The Human Proteome Map (http://humanproteomemap.org/) [Kim et al. 2014], and The Human Protein Atlas (http://www.proteinatlas.org/) [Uhl´ en et al. 2010, Uhl´ en et al. 2015] projects. GeneCards (http://www.genecards.org/) [Safran et al. 2002, Safran et al. 2010] aggregates information on human protein-coding genes from >125 Web sources

Table 1.1

http://rgd.mcw.edu/ http://www.yeastgenome.org/ http://www.pombase.org/ http://www.xenbase.org/

∼29,600 ∼6,500 ∼5,100 ∼4,700

R. norvegicus S. cerevisiae S. pombe X. laevis

H. sapiens

http://www.informatics.jax.org/ http://humanproteomemap.org/, http://www.proteomicsdb.org/, http://www.proteinatlas.org/

∼23,200 >17,000

M. musculus

http://flybase.org/ http://ecocyc.org/, http://ecoli.iab.keio.ac.jp/

∼17,700 ∼4,300

http://zfin.org/

>26,000

D. rerio

E. coli

http://www.wormbase.org/

∼20,500

C. elegans

D. melanogaster

http://www.arabidopsis.org/

>27,400

A. thaliana

Source

No. of Proteins/ Protein-Coding Genes

Organism

[Karpinka et al. 2015]

[Wood et al. 2012, McDowall et al. 2015]

[Cherry et al. 2012, Picotti et al. 2013]

[Shimoyama et al. 2015]

[Wilhelm et al. 2014, Kim et al. 2014, Uhl´ en et al. 2010, Uhl´ en et al. 2015]

[Bult et al. 2010]

[Keseler et al. 2009, Ishii et al. 2007]

[The Flybase Consortium 1996]

[Sprague et al. 2001]

[Stein et al. 2001]

[Huala et al. 2001, Rhee et al. 2003]

References

Examples of proteome resources for some model and higher-order organisms (as of December 2015), covering also Danio rerio (Zebrafish), Mus musculus (house mouse), Rattus norvegicus (Norwegian rat), Schizosaccharomyces pombe (fission yeast), and Xenopus laevis (African clawed frog)

Chapter 1 Introduction to Protein Complex Prediction 3

4

Chapter 1 Introduction to Protein Complex Prediction

and presents the information in an integrative user-friendly manner. The expression levels of nearly 200 proteins that are essential for driving different human cancers are available from The Cancer Proteome Atlas (TCPA) project (http://app1 .bioinformatics.mdanderson.org/tcpa/_design/basic/index.html) [Li et al. 2013], measured from more than 3,000 tissue samples across 11 cancer types studied as part of The Cancer Genome Atlas (TCGA) project (http://cancergenome.nih .gov/). Short-hairpin RNA (shRNA)-mediated knockdown [Paddison et al. 2002, Lambeth and Smith 2013], clustered regularly interspaced short palindromic repeats (CRISPR)/Cas9-based gene editing [Sanjana et al. 2014, Baltimore et al. 2015, Shalem et al. 2015], and disruptive mutagenesis [B¨ okel 2008] screening using MCF-10A (near-normal mammary), MDA-MB-435 (breast cancer), KBM7 (chronic myeloid leukemia), HAP1 (haploid), A375 (melanoma), HCT116 (colorectal cancer), and HUES62 (human embryonic stem) cells have characterized 1,500–1,880 (or 8– 10%) “core” protein-coding genes as essential in human cells [Marcotte et al. 2016, Silva et al. 2008, Wang et al. 2014, Hart et al. 2015, Hart et al. 2014, Wang et al. 2015, Blomen et al. 2015]. Comparative analyses of proteomes from different species have revealed interesting insights into the evolution and conservation of proteins. For example, it is estimated that the genomes (proteomes) of human and budding yeast diverged about 1 billion years ago from a common ancestor [Douzery et al. 2014], and these share several thousand genes accounting for more than one-third of the yeast genome ¨ stlund et al. 2010]. Yeast and human orthologs are highly [O’Brien et al. 2005, O diverged; the amino-acid sequence similarity between human and yeast proteins ranges from 9–92%, with a genome-wide average of 32%. But, sequence similarity predicts only a part of the picture [Sun et al. 2016]. Recent studies [Kachroo et al. 2015, Laurent et al. 2015] have reported that 414 (or nearly half of the) essential protein-coding genes in yeast could be “replaced” by human genes, with replaceability depending on gene (protein) assemblies: genes in the same process tend to be similarly replaceable (e.g., sterol biosynthesis) or not replaceable (e.g., DNA replication initiation). Irrespective of whether in a lower-order model or a higher-order complex organism, a protein has to physically interact with other proteins and biomolecules to remain functional. Estimates in human suggest that over 80% of proteins do not function alone, but instead interact to function as macromolecular assemblies [Bergg´ ard et al. 2007]. This organization of individual proteins into assemblies is tightly regulated in cellular space and time, and is supported by protein conformational changes, posttranslational modifications, and competitive binding [Gibson and Goldberg 2009]. On the basis of the stability (area of interaction surface and du-

Chapter 1 Introduction to Protein Complex Prediction

5

ration of interaction) and partner specificity, the interactions between proteins are classified as homo- or hetero-oligomeric, obligate or non-obligate, and permanent or transient [Zhang 2009, Nooren and Thornton 2003]. Proteins in obligate interactions cannot exist as stable structures on their own and are frequently bound to their partners upon translation and folding, whereas proteins in non-obligate interactions can exist as stable structures in bound and unbound states. Obligate interactions are generally permanent or constitutive, which once formed exist for the entire lifetime of the proteins, whereas non-obligate interactions may be permanent, or alternatively transient, wherein the protein interacts with its partners for a brief time period and dissociates after that. Depending on the functional, spatial, and temporal context of the interactions, protein assemblies are classified as protein complexes, functional modules, and biochemical (metabolic) and signaling pathways. Protein complexes are the most basic forms of protein assemblies and constitute fundamental functional units within cells. Complexes are stoichiometrically stable structures and are formed from physical interactions between proteins coming together at a specific time and space. Complexes are responsible for a wide range of functions within cells including formation of cytoskeleton, transportation of cargo, metabolism of substrates for the production of energy, replication of DNA, protection and maintenance of the genome, transcription and translation of genes to gene products, maintenance of protein turn over, and protection of cells from internal and external damaging agents. Complexes can be permanent—i.e., once assembled can function for the entire lifetime of cells (e.g., ribosomes)—or transient—i.e., assembled temporarily to perform a specific function and are disassembled after that (e.g., cell-cycle kinase-substrate complexes formed in a cell-cycle dependent manner). Functional modules are formed when two or more protein complexes interact with each other and often other biomolecules (viz. nucleic acids, sugars, lipids, small molecules, and individual proteins) at a specific time and space to perform a particular function and disassociate after that. This molecular organization has been termed “protein sociology” [Robinson et al. 2007]. For example, the DNA replication machinery, highlighted earlier, is formed by a tightly coordinated assembly of DNA polymerases, DNA helicase, DNA primase, the sliding clamp and other complexes within the nucleus to ensure error-free replication of the DNA during cell division. Pathways are formed when sets of complexes and individual proteins interact via an ordered sequence of interactions to transduce signals (signaling pathways) or metabolize substrates from one form to another (metabolic pathways). For

6

Chapter 1 Introduction to Protein Complex Prediction

example, the MAPK pathway is composed of a sequence of microtubule-associated protein kinases (MAPKs) that transduce signals from the cell membrane to the nucleus, to induce the transcription of specific genes within the nucleus. Unlike complexes and functional modules, pathways do not require all components to co-localize in time and space.

1.1

From Protein Interactions to Protein Complexes Physical interactions between proteins are fundamental to the formation of protein complexes. Therefore, mapping the entire complement of protein interactions (the “interactome”) occurring within cells (in vivo) is crucial for identifying and characterizing complexes. However, inferring all interactions occurring during the entire lifetime of cells in an organism is challenging, and this challenge increases multifold as the complexity of the organism increases—e.g., for multicellular organisms made up of multiple cell types. The development of high-throughput proteomics technologies including yeast two-hybrid- (Y2H) [Fields and Song 1989], co-immunoprecipitation (Co-IP) [Golemis and Adams 2002] and affinity-purification (AP)-based [Rigaut et al. 1999] screens have revolutionized our ability to interrogate protein interactions on a massive scale, and have enabled global surveys of interactomes from a number of organisms. In particular, up to 70% of the interactions from model organisms including yeast [Ito et al. 2000, Uetz et al. 2000, Ho et al. 2002, Gavin et al. 2002, Gavin et al. 2006, Krogan et al. 2006], fly [Guruharsha et al. 2011], and nematode [Butland et al. 2005, Li et al. 2004] have been mapped, and the identification of interactions from higher-order multicellular organisms including species of flowering plant Arabidopsis, fish Danio (zebrafish), and several mammals—Mus musculus (house mouse), Rattus norvegicus (Norwegian rat), and humans—is rapidly underway; the interactions are cataloged in large public databases [Stark et al. 2011, Rolland et al. 2014]. The earliest and most widely used experimental techniques to capture binary interacting proteins on a high-throughput scale were mostly yeast two-hybrid (Y2H) [Fields and Song 1989]. However, datasets of protein interactions inferred from Y2H screens were found to have significant numbers of spurious interactions [Von Mering et al. 2002, Bader and Hogue 2002, Bader et al. 2004]. This is attributed in part to the nature of the Y2H protocol in which all potential interactors are tested within the same compartment (nucleus) even though some of these do not meet during their lifetimes due to compartmentalization (different subcellular localizations) within living cells.

1.1 From Protein Interactions to Protein Complexes

7

Co-immunoprecipitation or affinity-purification (Co-IP/AP) techniques were introduced later and these are more specific in detecting interactions between cocomplexed proteins [Golemis and Adams 2002, Rigaut et al. 1999, K¨ ocher and Superti-Furga 2007]. In these protocols, cohesive groups or complexes of proteins are “pulled down,” from which the binary interactions between the proteins are individually inferred. However, this indirect inference could lead to over- or under-estimation of protein interactions. In the tandem affinity purification (TAP) procedure [Rigaut et al. 1999, Puig et al. 2001], proteins of interest (“baits”) are TAP-tagged and purified in an affinity column with potential interaction partners (“preys”). The pulled-down complexes are subjected to mass spectrometric (MS) analysis to identify individual components within the complexes. However, although more reliable than Y2H, the TAP/MS procedure can be elaborate and with the inclusion of MS, it can be expensive too. The exhaustiveness of TAP/MS depends on the baits used—there is no way to identify all possible complexes unless all possible baits are tested. Proteins which do not interact directly with the chosen bait but interact with one or more of the preys, might also get pulled down as part of the purified complex. In some cases, these proteins are indeed part of the real complex whereas in other cases these proteins are not (i.e., they are contaminants); therefore multiple purifications are required, possibly with each protein as a bait and as a prey, to identify the correct set of proteins within the complex. The TAP procedure therefore offers two successive affinity purifications so that the chance of retained contaminants reduces significantly. Conversely, a chosen bait might form a real complex with a set of proteins without actually interacting directly with every protein from the set, and therefore some proteins might not get pulled down as part of the purified complex. In these cases, multiple baits would need to be tested to assemble the complete complex. Moreover, since some proteins participate in more than one complex, multiple independent purifications are required to identify all hosting complexes for these proteins. Binary interactions between the proteins in a pulled-down protein complex are inferred using two models: matrix and spoke. In the matrix model, a binary interaction is inferred between every pair of proteins within the complex, whereas in the spoke model interactions are inferred only between the bait and all its preys. Since all pairs of proteins within a complex do not necessarily interact, the matrix model is usually an overestimation of the total number of binary interactions, whereas the spoke model is an underestimation. Therefore, usually a balance is struck between the two models that is close enough to the estimated total number of interactions for the species or organism.

8

Chapter 1 Introduction to Protein Complex Prediction

Table 1.2

Numbers of mapped physical interactions between proteins across different model and higher-order organisms

Organism

No. of Interactions

No. of Proteins

A. thaliana

34,320

9,240

C. elegans

5,783

3,269

188

181

36,741

8,071

99

104

230,843

20,006

M. musculus

18,465

8,611

R. norvegicus

4,537

3,328

S. cerevisiae

82,327

6,278

S. pombe

9,492

2,944

X. laevis

532

471

D. rerio D. melanogaster E. coli H. sapiens

Based on BioGrid version 3.4.130 (November 2015) [Stark et al. 2011, Chatr-Aryamontri et al. 2015].

Despite differences in procedures and technologies, the use of different experimental protocols can effectively complement one another in detecting interactions. While TAP can be more specific and detect mainly stable (co-complexed) protein interactions, Y2H can be more exhaustive and detect even transient and betweencomplex interactions. Based on BioGrid version 3.4.130 (November 2015) (http:/ /thebiogrid.org/) [Stark et al. 2011, Chatr-Aryamontri et al. 2015], the numbers of mapped physical interactions range from 99 in E. coli to ∼82,300 in S. cerevisiae and ∼230,900 in H. sapiens (summarized in Table 1.2). It remains to be seen how many of these interactions actually occur in the physiological contexts of living cells or cell types, how many are subject to genetic and physiological variations, and how many still remain to be mapped. The binary interactions inferred from the different experiments are assembled into a protein-protein interaction network, or simply, PPI network. The PPI network presents a global or “systems” view of the interactome, and provides a mathematical (topological) framework to analyze these interactions. Protein complexes are expected to be embedded as modular structures within the PPI network [Hartwell et al. 1999, Spirin and Mirny 2003]. Topologically, this modularity refers to densely connected subsets of proteins separated by less-dense regions in the network [Newman

1.1 From Protein Interactions to Protein Complexes

9

2004, Newman 2010]. Biologically, this modularity represents division of labor among the complexes, and provides robustness against disruptions to the network from internal (e.g., mutations) and external (e.g., chemical attacks) agents. Computational methods developed to identify protein complexes therefore mine for modular subnetworks in the PPI network. While this strategy appears reasonable in general, limitations in PPI datasets, arising due to the shortcomings highlighted above in experimental protocols, severely restrict the feasibility of accurately predicting complexes from the network. Specifically, the limitations in existing PPI datasets that directly impact protein complex prediction include: 1. presence of a large number of spurious (noisy) interactions; 2. relative paucity of interactions between “complexed” proteins; and 3. missing contextual—e.g., temporal and spatial—information about the interactions. These limitations translate to the following three main challenges currently faced by computational methods for protein complex prediction: 1. difficulty in detecting sparse complexes; 2. difficulty in detecting small (containing fewer than four proteins) and subcomplexes; and 3. difficulty in deconvoluting overlapping complexes (i.e., complexes that share many proteins), especially when these complexes occur under different cellular contexts. While the interactome coverage can be improved by integrating multiple PPI datasets, the lack of agreement between the datasets from different experimental protocols [Von Mering et al. 2002, Bader et al. 2004], and the multifold increase in accompanying noise (spurious interactions), tend to cancel out the advantage gained from the increased coverage. Consequently, the confidence of each interaction has to be assessed (confidence scoring) and low-confidence interactions have to be first removed from the datasets (filtering) before performing any downstream analysis. To summarize, computational identification of protein complexes from interaction datasets follows these steps (Figure 1.1): 1. integrating interactions from multiple experiments and stringently assessing the confidence (reliability) of these interactions; 2. constructing a reliable PPI network using only the high-confidence interactions;

Figure 1.1

(b)

Fanconi anemia core complex

Potentially novel

Potentially novel

NFKB1-NFKB2-RELA-RELB complex Potentially novel

Nuclear pore complex

Anaphase promoting complex

BAF complex

Identification of protein complexes from protein interaction data. (a) A high-confidence PPI network is assembled from physical interactions between proteins after discarding low-confidence (potentially spurious) interactions. (b) Candidate protein complexes are predicted from this PPI network using network-clustering approaches. The quality of the predicted complexes is validated against bona fide complexes, whereas novel complexes are functionally assessed and assigned new roles where possible.

(a)

20S proteasome

26S proteasome

1.2 Databases for Protein Complexes

11

3. identifying modular subnetworks from the PPI network to generate a candidate list of protein complexes; and 4. evaluating these candidate complexes against bona fide complexes, and validating and assigning roles for novel complexes. As we shall see in the following chapters, several sophisticated approaches have been developed over the years to overcome some of the above-mentioned challenges. Computational methods have co-evolved with proteomics technologies, and over the last ten years a plethora of computational methods have been developed to predict complexes from PPI networks, which is the subject of this book. In general, computational methods complement experimental approaches in several ways. These methods have helped counter some of the limitations arising in proteomic studies, e.g., by eliminating spurious interactions via interaction scoring, and by enriching true interactions via prediction of missing interactions. The novel interactions and protein complexes predicted from these methods have been added back to proteomics databases, and these have helped to further enhance our resources and knowledge in the field.

1.2

Databases for Protein Complexes Several high-quality resources for protein complexes have been developed over the years covering both lower-order model and higher-order organisms (summarized in Table 1.3). In total, Aloy [Aloy et al. 2004], CYC2008 [Pu et al. 2009], and MIPS [Mewes et al. 2008] contain over 450 manually curated complexes from S. cerevisiae (budding yeast). CORUM [Reuepp et al. 2008, 2010] contains ∼3,000 mammalian complexes of which ∼1,970 are protein complexes identified from human cells. The European Molecular Biology Laboratory (EMBL) and European Bioinformatics Institute (EBI) maintain a database of manually curated protein complexes from 18 different species including C. elegans, H. sapiens, M. musculus, S. cerevisiae, and S. pombe [Meldal et al. 2015]. Havugimana et al. [2012] present a dataset of 622 putative human soluble protein complexes (http://human.med.utoronto.ca/) identified using high-throughput AP/MS pulldown and PPI-clustering approaches. Huttlin et al. [2015] present 352 putative human complexes identified from human embryonic (HEK293T) cells (http://wren.hms.harvard.edu/bioplex/). Wan et al. [2015] present a catalog of conserved metazoan complexes (http://metazoa.med.utoronto.ca/) identified by clustering of high-quality pulldown interactions from C. elegans, D. melanogaster, H. sapiens, M. musculus, and Strongylocentrotus purpuratus (purple sea urchin). This

Table 1.3

S. cerevisiae H. sapiens (HEK293T) S. cerevisiae 3 species 18 species H. sapiens H. sapiens Metazoa Mammals S. cerevisiae Mammals

Aloy BioPlex CYC2008 COMPLEAT b EMBL-EBI Complex Portal Human Soluble hu.MAP Metazoan conserved c MIPS—CORUM MIPS—Yeast Ori

279

313

2,837

344–490

>4,600

622

1,564

8,886

408

354

102

No. of Complexes

http://www.bork.embl.de/Docu/variable_ complexes/

http://www.helmholtz-muenchen.de/en/ibis/

http://mips.helmholtz-muenchen.de/genre/ proj/corum/

http://metazoa.med.utoronto.ca/

http://proteincomplexes.org

http://human.med.utoronto.ca/

http://www.ebi.ac.uk/intact/complex/

http://www.flyrnai.org/compleat/

http://wodaklab.org/cyc2008/

http://wren.hms.harvard.edu/bioplex/

http://www.russelllab.org/complexes/

Source

[Ori et al. 2016]

[Mewes et al. 2008]

[Reuepp et al. 2008, Ruepp et al. 2010]

[Wan et al. 2015]

[Drew et al. 2017]

[Havugimana et al. 2012]

[Meldal et al. 2015]

[Vinayagam et al. 2013]

[Pu et al. 2009]

[Huttlin et al. 2015, 2017]

[Aloy et al. 2004]

Reference

a. No. of complexes as of 2016. b. COMPLEAT includes protein complexes from D. melanogaster, H. sapiens, and S. cerevisiae. The EMBL-EBI portal includes protein complexes from 18 different species of which are C. elegans (16 complexes), H. sapiens (441), M. musculus (404), S. cerevisiae (399), and S. pombe (16). CORUM includes mammalian protein complexes, mainly from H. sapiens (64%), M. musculus (house mouse) (15%) and R. norvegicus (12%) (Norwegian rat). c. Includes mainly conserved complexes among the metazoans, C.elegans, D. melanogaster, H. sapiens, M. musculus, and Strongylocentrotus purpuratus (purple sea urchin), consisting of 344 complexes with entirely ancient proteins and 490 complexes with largely ancient proteins conserved ubiquitously among eurkaryotes.

Organisms

Database

Publicly available databases for protein complexes a

12

Chapter 1 Introduction to Protein Complex Prediction

1.3 Organization of the Rest of the Book

13

dataset includes ∼300 complexes composed of entirely ancient proteins (evolutionarily conserved from lower-order organisms), and ∼500 complexes composed of largely ancient proteins conserved ubiquitously among eurkaryotes. Drew et al. [2017] present a comprehensive catalog of >4,600 computationally predicted human protein complexes covering >7,700 proteins and >56,000 interactions by analyzing data from >9,000 published mass spectrometry experiments. Vinayagam et al. [2013] present COMPLEAT (http://www.flyrnai.org/compleat/), a database of 3,077, 3,636, and 2,173 literature-curated protein complexes from D. melanogaster, H. sapiens, and S. cerevisiae, respectively. Ori et al. [2016] combined mammalian complexes from CORUM and COMPLEAT to generate a dataset of 279 protein complexes from mammals.

1.3

Organization of the Rest of the Book The rest of this book reads as follows. Chapter 2 discusses important concepts underlying PPI networks and presents prerequisites for understanding subsequent chapters. We discuss different high-throughput experimental techniques employed to infer PPIs (including the Y2H and AP/MS techniques mentioned earlier), explaining briefly the biological and biochemical concepts underlying these techniques and highlighting their strengths and weaknesses. We explain computational approaches that denoise (PPI weighting) and integrate data from multiple experiments to construct reliable PPI networks. We also discuss topological properties of PPI networks, theoretical models for PPI networks, and the various databases and software tools that catalog and visualize PPI networks. Chapter 3 forms the main crux of this book as it introduces and discusses in depth the algorithmic underpinnings of some of the classical (seminal) computational methods to identify protein complexes from PPI networks. While some of these methods work solely on the topology of the PPI network, others incorporate additional biological information—e.g., in the form of functional annotations—with PPI network topology to improve their predictions. Chapter 4 presents a comprehensive empirical evaluation of six widely used protein complex prediction methods available in the literature using unweighted and weighted PPI networks from yeast and human. Taking a known human protein complex as an example, we discuss how the methods have fared in recovering this complex from the PPI network. Based on this evaluation, we explain in Chapter 5 the shortcomings of current methods in detecting certain kinds of protein complexes, e.g., protein complexes that are sparse or that overlap with other complexes. Through this, we highlight the open challenges that need to be tackled to improve coverage and accuracy of protein

14

Chapter 1 Introduction to Protein Complex Prediction

complex prediction. We discuss some recently proposed methods that attempt to tackle these open challenges and to what extent these methods have been successful. Chapter 6 is dedicated to an important class of protein complexes that are dynamic in their protein composition and assembly. While some of these protein complexes are temporal in nature—i.e., assemble at a specific timepoint and dissociate after that—others are structurally variable—e.g., change their 3D structure and/or composition—based on the cellular context. Quite obviously, it is not possible to detect dynamic complexes solely by analyzing the PPI network; methods that integrate gene or protein expression and 3D structural information are required. These more-sophisticated methods are covered here. Chapter 7 discusses methods to identify protein complexes that are conserved between organisms or species; these evolutionarily conserved complexes provide important insights into the conservation of cellular processes through the evolution. Finally, in today’s era of systems biology where biological systems are studied as a complex interplay of multiple (biomolecular) entities, we explain how protein complex prediction methods are playing a crucial role in shaping up the field; these applications are covered in Chapter 8. We discuss the application of these methods for predicting dysregulated or dysfunctional protein complexes, identifying rewiring of interactions within complexes, and in discovery of new disease genes and drug targets. We conclude the book in Chapter 9 by reiterating the diverse applications of protein complex prediction methods and thereby the importance of computational methods in driving this exciting field of research.

2

Constructing Reliable Protein-Protein Interaction (PPI) Networks No molecule arising naturally (MAN) is an island, entire of itself. —John Donne (1573–1631), English poet and cleric (modified [Dunn 2010] from original quote, “No man is an island, entire of itself.”)

The identification of PPIs yields insights into functional relationships between proteins. Over the years, a number of different experimental techniques have been developed to infer PPIs. This inference of PPIs is orthogonal, but also complementary, to experiments inferring genetic interactions; both provide lists of candidate interactions and implicate functional relationships between proteins [Morris et al. 2014].

2.1

High-Throughput Experimental Systems to Infer PPIs Physical interactions between proteins are inferred using different biochemical, biophysical, and genetic techniques (summarized in Table 2.1). Yeast two-hybrid (Y2H; less commonly, YTH) [Ito et al. 2000, Uetz et al. 2000, Ho et al. 2002] and protein-fragment complement assays [Michnick 2003, Remy and Michnick 2004, Remy et al. 2007] enable identification of direct binary physical interactions between the proteins, whereas co-immunoprecipitation or affinity purification assays [Golemis and Adams 2002, Rigaut et al. 1999, K¨ ocher and Superti-Furga 2007, Dunham et al. 2012] enable pull down of whole protein complexes from which the binary interactions are inferred. Protein-fragment complementation assay (PCA)

Table 2.1

Cell Assay

In vivo (yeast, mammalian) In vitro

In vivo (yeast, mammalian)

In vivo (yeast, mammalian)

Experimental Technique

Yeast two-hybrid (Y2H)

Co-immunoprecipitation precipitation followed by mass spectrometry (Co-IP/MS) Protein-fragment complement (PCA)

Membrane yeast twohybrid (MYTH), Mammalian membrane YTH (MaMTH)

Binary interactions between membrane proteins

Binary interactions; can infer membrane-protein interactions

Co-complex relationships

Binary interactions

Interaction Type

[Lalonde et al. 2008, Kittanakom et al. 2009, Lalonde et al. 2010, Petschnigg et al. 2014, Yao et al. 2017]

[Michnick 2003, Remy and Michnick 2004, Remy et al. 2007, Tarassov et al. 2008]

[Golemis and Adams 2002, Rigaut et al. 1999]

[Ito et al. 2000, Uetz et al. 2000, Ho et al. 2002]

Key References

Experimental techniques for screening protein interactions; these techniques can be employed in a high-throughput manner to screen whole protein libraries for potential interactors

16

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

2.1 High-Throughput Experimental Systems to Infer PPIs

17

coupled with biomolecular fluorescence complementation (BIFC) [Grinberg et al. 2004] enables mapping of interaction surfaces of proteins, and is thus a good tool to confirm protein binding. Membrane YTH and mammalian membrane YTH (MaMTH) [Lalonde et al. 2008, Kittanakom et al. 2009, Lalonde et al. 2010, Petschnigg et al. 2014, Yao et al. 2017] enable identification of interactions involving membrane or membrane-bound proteins which are typically difficult to identify using traditional Y2H and AP techniques. Techniques inferring genetic interactions [Brown et al. 2006] enable detection of functional associations or genetic relationships between the proteins (genes), but these associations do not always correspond to physical interactions. Here, we present only an overview of each of the experimental techniques; for a more descriptive survey, the readers are referred to Br¨ uckner et al. [2009], Shoemaker and Panchenko [2007], and Snider et al. [2015].

Yeast Two-Hybrid (Y2H) Screening System Y2H was first described by Fields and Song [1989] and is based on the modularity of binding domains in eukaryotic transcription factors. Eukaryotic transcription factors have at least two distinct domains: (1) the DNA binding domain (BD) which directs binding to a promoter DNA sequence (upstream activating sequence (UAS)); and (2) the transcription activating domain (AD) which activates the transcription of target reporter genes. Splitting the BD and AD domains inactivates transcription, but even indirectly connecting AD and BD can restore transcription resulting in activation of specific reporter genes. Plasmids are engineered to produce a protein product (chimeric or “hybrid”) in which the BD fragment is in-frame fused onto a protein of interest (the bait), while the AD fragment is in-frame fused onto another protein (the prey) (Figure 2.1). The plasmids are then transfected into cells chosen for the screening method, usually from yeast. If the bait and prey proteins interact, the AD and BD domains are indirectly connected, resulting in the activation of reporters within nuclei of cells. Typically, multiple independent yeast colonies are assayed for each combination of plasmids to account for the heterogeneity in protein expression levels and their ability to activate reporter transcription. This basic Y2H technique has been improved over the years to enable large library screening [Chien et al. 1991, Dufree et al. 1993, Gyuris et al. 1993, Finley and Brent 1994]. Interaction mating is one such protocol that can screen more than one bait against a library of preys, and can save considerable time and materials. In this protocol, the AD- and BD-fused proteins begin in two different haploid yeast strains with opposite mating types. These proteins are brought together by mating, a process in which two haploid cells fuse to form a single diploid cell. The diploids are then tested using conventional reporter activation for possible interactors.

18

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

Chimeric plasmids with in-frame fused bait and prey

Prey

AD

Bait

BD UAS Figure 2.1

Reporter gene

Schematic representation of the yeast two-hybrid protocol to detect interaction between bait and prey proteins. If the bait and prey proteins interact, the DNA binding domain (BD) fused to the bait and the transcription activing domain (AD) fused to the prey are indirectly connected resulting in the activation of the reporter gene. UAS: upstream activating sequence (promoter) of the reporter gene.

Therefore, different bait-expressing strains can be mated with a library of preyexpressing strains, and the resulting diploids can be screened for interactors. It is important to know how many viable diploids have arisen and to determine the false-positive frequency of the detected interactions. True interactors tend to come up in a timeframe specific for each given bait, with false positives clustering at a different timepoint. Multiple yeast colonies are assayed to confirm the interactors. Y2H screens have been extensively used to detect protein interactions among yeast proteins, with two of the earliest studies reporting 692 [Uetz et al. 2000] and 841 [Ito et al. 2000] interactions for S. cerevisiae. In the bacteria Helicobacter pylori, one of the first applications of Y2H identified over 1,200 interactions, covering about 47% of the bacterial proteome [Rain et al. 2001]. Applications on fly proceeded on an even greater scale when Giot et al. [2003] identified 10,021 protein interactions involving 4,500 proteins in D. melanogaster. More recently, Vo et al. [2016] used Y2H to map binary interactions in the yeast S. pombe (fission yeast). This network, called FissionNet, consisted of 2,278 interactions covering 4,989 proteincoding genes in S. pombe. The Y2H system has also been applied for humans, with two initial studies [Rual et al. 2005, Stelzl et al. 2005] yielding over 5,000 interactions among human proteins. More recently, Rolland et al. [2014] employed Y2H to characterize nearly 14,000 human interactions.

2.1 High-Throughput Experimental Systems to Infer PPIs

19

However, inherent to this type of library screening, the number of detected false-positive interactions is usually high. Among the possible reasons for the generation of false positives is that the experimental compartmentalization (within the nucleus) for bait and prey proteins does not correspond to the natural cellular compartmentalization. Moreover, proteins that are not correctly folded under experimental conditions or are “sticky” may show non-specific interactions. The third source of false positives is the interaction of the preys themselves with reporter proteins, which can turn on the reporter genes. Von Mering et al. [2002] estimated the accuracy of classic Y2H to be less than 10%, with subsequent evaluations suggesting the number of false positives to be between 50% and 70% in large-scale Y2H interaction datasets for yeast [Bader and Hogue 2002, Bader et al. 2004].

Co-Immunoprecipitation/Affinity Purification (AP) Followed by Mass Spectrometry (Co-IP/AP followed by MS) Complementing the in vivo Y2H screens are the in vitro Co-IP/AP followed by MS screens that identify whole complexes of interacting proteins, from which the binary interactions between proteins can be inferred [Golemis and Adams 2002, Rigaut et al. 1999, K¨ ocher and Superti-Furga 2007, Dunham et al. 2012]. The CoIP/AP followed by MS screens consist of two steps: co-immunoprecipitation/affinity purification and mass spectrometry (Figure 2.2). In the first step, cells are lysed

MS

Harvest and lyse cells

Incubate lysate with tagged proteins; introduce specific antibody for protein of interest (bait)

Figure 2.2

Antibody binds to bait; nonbinding proteins are washed; retain proteins (bait and preys) in complex

Elute complex

Separate out the eluted proteins using gel electrophoresis

Cut bands from gel; identify proteins by mass spectrometry

Schematic representation of the co-immunoprecipitation/affinity purification followed by mass spectrometry (Co-IP/AP followed by MS) protocol. The protein of interest (bait) is targeted with a specific antibody and pulled down with its interactors in a cell lysate buffer. The individual components of the pulled-down complex are identified using mass spectrometry. These days, liquid chromatography with mass spectrometry (LCMS) instead of running the gel is increasingly being used more frequently for as a combined physical-separation and MS-analysis technique [Pitt 2009].

20

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

in a radioimmunoprecipitation assay (RIPA) buffer. The RIPA buffer enables efficient cell lysis and protein solubilization while avoiding protein degradation and interference with biological activity of the proteins. A known member of the set of proteins (the protein of interest or bait) is epitope-tagged and is either immunoprecipitated using a specific antibody against the tag or purified using affinity columns recognizing the tag, giving the interacting partners (preys) of the bait. Normally, this purification step is more effective when two consecutive purification steps are used with proteins that are doubly tagged (hence called tandem affinity purification or TAP). This results in an enrichment of native multi-protein complexes containing the bait. The individual components within each such purified complex are then screened by gel electrophoresis and identified using mass spectrometry. In one of the first applications of TAP/MS, Ho et al. [2002] expressed 10% of the coding open reading frames from yeast, and the identified interactions connected 25% of the yeast proteome as multi-protein complexes. Subsequently, Gavin et al. [2002], Gavin et al. [2006], and Krogan et al. [2006] purified 1,993 and 2,357 TAPtagged proteins covering 60% and 72% of the yeast proteome, and identified 7,592 and 7,123 protein interactions from yeast, respectively. One of the first proof-ofconcept studies for humans applied AP/MS to characterize interactors using 338 bait proteins that were selected based on their putative involvement in diseases, and the study identified 6,463 interactions between 2,235 proteins [Ewing et al. 2007].

Comparison of Y2H and AP/MS Experimental Techniques A majority of the interaction data collected so far has come from Y2H screening. For example, approximately half of the data available in databases including IntAct [Hermjakob et al. 2004, Kerrien et al. 2012] and MINT [Zanzoni et al. 2002, ChatrAryamontri et al. 2007] are from Y2H screens [Br¨ uckner et al. 2009] (more sources of PPI data are listed in Table 2.2). This could in part be attributed to the inaccessibility of mass spectrometry due to the expensive large equipment that is required. But, in general, Y2H and AP/MS techniques are complementary in the kind of interactors they detect. If a set of proteins form a stable complex, then an AP/MS screen can determine all the proteins within the complex, but may not necessarily confirm every interacting pair (the binary interactions) within the complex. On the other hand, a Y2H screen can detect whether any given two proteins directly interact. While stable interactions between co-complexed proteins can be accurately determined using AP/MS techniques, Y2H techniques are useful for identifying transient interactions between the proteins. However, due to considerable func-

2.1 High-Throughput Experimental Systems to Infer PPIs

21

tional cross-talk within cells, Y2H can also report an interaction even when the proteins are not directly connected. In addition, some types of interactions can be missed in Y2H due to inherent limitations in the technique—e.g., interactions involving membrane proteins, or proteins requiring posttranslational modifications to interact—but these limitations may also occur with AP/MS-based approaches [Br¨ uckner et al. 2009]. Therefore, only a combination of different approaches that necessarily also includes computational methods (to filter out the incorrectly detected interactions) will eventually lead to a fairly complete characterization of all physiologically relevant interactions in a given cell or organism.

Protein-Fragment Complementation Assay (PCA) PCA is a relatively new technique which can detect in vivo protein interactions as well as their modulation or spatial and temporal changes [Michnick 2003, Morell et al. 2009, Tarassov et al. 2008]. Similar to Y2H, PCA is based on the principle of splitting a reporter protein into two fragments, each of which cannot function alone [Michnick 2003]. However, unlike Y2H, PCA is based on the formation of a biomolecular complex between the bait and prey, where both are fused to the split domains of the reporter. Importantly, the formation of this complex occurs in competition with alternative endogenous interaction partners present within the cell. The interaction brings the two split fragments in proximity enabling their non-covalent reassembly, folding, and recovery of protein reporter function [Morell et al. 2009]. Typically, the reporter proteins are fluorescent proteins, and the formation of biomolecular complexes is visualized using biomolecular fluorescence complementation (BIFC). BIFC can also be used to map the interaction surfaces of these complexes. This enables investigation of competitive binding between mutually exclusive interaction partners as well as comparison of their intracellular distributions [Grinberg et al. 2004]. PCA can be used as a screening tool to identify potential interaction partners of a specific protein [Remy and Michnick 2004, Remy et al. 2007], or to validate the interactions detected from other techniques such as Y2H [Vo et al. 2016]. In one of the first applications of PCA on a genome-wide in vivo scale, Tarassov et al. [2008] identified 2,770 interactions among 1,124 proteins from S. cerevisiae. Vo et al. [2016] used PCA as an orthogonal assay to reconfirm the interactions detected in S. pombe (from the FissionNet network consisting of 2,278 interactions; discussed earlier). PCA has also been employed to validate interactions between membrane proteins or membrane-associated proteins [Babu et al. 2012, Shoemaker and Panchenko 2007] (discussed next).

22

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

Techniques for Inferring Membrane-Protein Interactions Membrane proteins are attached to or associated with membranes of cells or their organelles, and constitute approximately 30% of the proteomes of organisms [Carpenter et al. 2008, Von Heijne 2007, Byrne and Iwata 2002]. Being non-polar (hydrophobic), membrane proteins are difficult to crystallize using traditional Xray crystallography compared to soluble proteins, and are the least studied among all proteins using high-throughput proteomics techniques [Carpenter et al. 2008]. Membrane proteins are involved in the transportation of ions, metabolites, and larger molecules such as proteins, RNA, and lipids across membranes, in sending and receiving chemical signals and propagating electrical impulses across membranes, in anchoring enzymes and other proteins to membranes, in controlling membrane lipid composition, and in organizing and maintaining the shape of organelles and the cell itself [Lodish et al. 2000]. In humans, the G-protein-coupledreceptors (GPCRs), which are membrane proteins involved in signal transduction across membranes, alone account for 15% of all membrane proteins; and 30% of all drug targets are GPCRs [Von Heijne 2007]. Due to the key roles of membrane proteins, identifying interactions involving these proteins has important applications especially in drug development. Membrane protein complexes are notoriously difficult to study using traditional high-throughput techniques [Lalonde et al. 2008]. Intact membrane-protein complexes are difficult to pull down using conventional AP/MS systems. This is due in part to the hydrophobic nature of membrane proteins as well as the ready dissociation of subunit interactions, either between trans-membrane subunits or between trans-membrane and cytoplasmic subunits [Barrera et al. 2008]. Further, membrane protein structure is difficult to study by commonly used high-resolution methods including X-ray crystallography and NMR spectroscopy. A major avenue by which one can understand membrane proteins and their complexes is by mapping the membrane-protein “subinteractome”—the subset of interactions involving membrane proteins. Conventional Y2H system is confined to the nucleus of the cell thereby excluding the study of membrane proteins. New biochemical techniques have been developed to facilitate the characterization of interactions among membrane proteins. Among these is the split-ubiquitin membrane yeast two-hybrid (MYTH) system [Miller et al. 2005, Kittanakom et al. 2009, Stagljar et al. 1998, Petschnigg et al. 2012]. This system is based on ubiquitin, an evolutionarily conserved 76-amino acid protein that serves as a tag for proteins targeted for degradation by the 26S proteasome. The presence of ubiquitin is recognized by ubiquitin-specific proteases (UBPs) located in the nucleus and cytoplasm of all eukaryotic cells. Ubiquitin can be split and expressed as two halves: the amino-

2.2 Data Sources for PPIs

23

terminal (N) and the carboxyl terminal (C). These two halves have a high affinity for each other in the cell and can reconstitute to form pseudo-ubiquitin that is recognizable by UBPs. In MYTH, the bait proteins are fused to the C-terminal of a split-ubiquitin, and the prey proteins are fused to the N-terminal. The two halves reconstitute into a pseudo-ubiquitin protein if there is affinity between the bait and prey proteins. This pseudo-ubiquitin is recognized by UBPs, which cleaves after the C-terminus of ubiquitin to release the transcription factor, which then enters the nucleus to activate reporter genes. Two of the earliest studies using the MYTH screens reported a fair number of interactions among membrane proteins from yeast: 343 interactions among 179 proteins by Lalonde et al. [2010], and 808 interactions among 536 proteins by Miller et al. [2005]. PCA has also been adopted to identify and/or verify membrane-protein interactions. For example, Babu et al. [2012] used PCA to validate and integrate 1,726 yeast membrane-protein interactions obtained from multiple studies, and these encompassed 501 putative membrane protein complexes. The mammalian version of membrane yeast two-hybrid, MaMTH, is also based on the split-ubiquitin assay and is derived from the MYTH assay. Stagljar and colleagues [Petschnigg et al. 2014, Yao et al. 2017] used MaMTH to probe interactions involving the epidermal growth factor receptor/receptor tyrosine-protein kinase (RTK) ErbB-1 (EGFR/ERBB1), Erb-B2 receptor tyrosine kinase 2 (ERBB2), and other RTKs that localize to the plasma membrane in human cells. When applied to human lung cancer cells, the assay identified 124 interactors for wild-type and mutant EGFR [Petschnigg et al. 2014].

2.2

Data Sources for PPIs Several public and proprietary databases now catalog protein interactions from both lower-order model and higher-order organisms (summarized in Table 2.2). These databases contain PPI data in an acceptable format required for data deposition, such as IMEx (http://www.imexconsortium.org/submit-your-data) [Orchard et al. 2012]. The Biomolecular Interaction Network Database (BIND) [Bader et al. 2003], now called Biomolecular Object Network Database (BOND), includes experimentally determined protein-protein, protein-small molecule, and protein-nucleic acid interactions. BioGrid [Stark et al. 2011] catalogs physical and genetic interactions inferred from multiple high-throughput experiments. The Database of Interacting Proteins (DIP) [Xenarios et al. 2002] contains experimentally determined protein interactions with a “core” subset of interactions that have passed quality

Table 2.2

http://string-db.org/

STRING

[Von Mering et al. 2003, Szklarczyk et al. 2011]

[Pagel et al. 2005]

[Mewes et al. 2008]

http://mips.helmholtz-muenchen.de/proj/ppi/

[Razick et al. 2008, Turner et al. 2010]

[Lynn et al. 2008]

http://mips.helmholtz-muenchen.de/proj/ppi/

http://irefindex.org/wiki/index.php?title=iRefIndex

iRefIndex

MPPI

http://www.innatedb.com/

InnateDB

[Brown and Jurisica 2005]

MIPS

http://ophid.utoronto.ca/

OPHID/IID

[Peri et al. 2004, Prasad et al. 2009]

[Zanzoni et al. 2002, Chatr-Aryamontri et al. 2007, Persico et al. 2005]

http://www.hprd.org/

HPRD

[Chen et al. 2009]

[Hermjakob et al. 2004, Kerrien et al. 2012]

[Xenarios et al. 2002, Salwinski et al. 2004]

[Rolland et al. 2014, Yu et al. 2008, Yu et al. 2011]

[Bader et al. 2003]

MINT/HomoMINT http://mint.bio.uniroma2.it/mint/

http://www.ebi.ac.uk/intact/

http://dip.doe-mbi.ucla.edu

DIP

http://discern.uits.iu.edu:8340/HAPPI/

http://mips.helmholtz-muenchen.de/genre/proj/yeast/ [G¨ uldener et al. 2005]

CYGD

HAPPI

http://interactome.dfci.harvard.edu/

CCSB

EMBL-EBI IntAct

http://bind.ca

BIND

[Stark et al. 2011]

http://thebiogrid.org

BioGrid

Reference

Source

PPI Database

Public and proprietary databases for protein-protein, protein-small molecule, and protein-DNA interactions

24

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

2.3 Topological Properties of PPI Networks

25

assessment (for example, based on literature verification). The Centre for Cancer Systems Biology (CCSB) Interactome Database at Harvard [Rolland et al. 2014, Yu et al. 2008, Yu et al. 2011] contains yeast, plant, virus, and human interactions. STRING [Von Mering et al. 2003, Szklarczyk et al. 2011] catalogs physical and functional interactions inferred from experimental and computational techniques. MIPS Comprehensive Yeast Genome Database (CYGD) [G¨ uldener et al. 2005] and MIPS Mammalian Protein-Protein Interaction Database (MPPI) [Pagel et al. 2005] catalog protein interactions and also expert-curated protein complexes from yeast and mammals. The Human Protein Reference Database (HPRD) [Peri et al. 2004, Prasad et al. 2009] mainly contains experimentally identified human interactions. IRefIndex [Razick et al. 2008, Turner et al. 2010] and Integrated Interaction Database (IID) [Brown and Jurisica 2005] integrate experimental and computationally predicted interactions for human and several other species.

2.3

Topological Properties of PPI Networks A simple yet effective way to represent interaction data is in the form of an undirected network called a protein-protein interaction network or simply PPI network, given as G = V , E, where V is the set of proteins and E is the set of physical interactions between the proteins. Such a network presents a global or “systems” view of the entire set of proteins and their interactions, and provides a topological (mathematical) framework to interrogate the interactions. In the definitions throughout this book, we also use V (G) and E(G) to refer to the set of proteins and interactions of a (sub)network of G. For a protein v ∈ V , the set N (v) or Nv includes all immediate neighbors of v, and deg(v) = |N (v)| is the degree of v. These neighbors together with their interactions, Ev = E(v) = {(v, u) : u ∈ N (v)} ∪ {(u, w) : u ∈ N (v), w ∈ N (v) ∩ N (u)}, constitute the local (immediate) neighborhood subnetwork of v. PPI networks, like most real-world networks, have characteristic topological properties which are distinct from that of random networks. But, to understand this distinction we need to first understand what are random networks. Traditionally, random networks have been described using the Erdös-Rényi (ER) model, in which G(n, p) is a random network with |V | = n nodes and each possible edge connecting pairs of these nodes has probability p of existing [Erd¨ os 1960, Bollob´ as 1985]. The n expected number of edges in the network is 2 p, and the expected mean degree is np. Alternatively, a random network is defined as a network chosen uniformly at n random from the collection (m2) of all possible networks with n nodes and m edges. If p is the probability for the existence of an edge, the probability for each network

26

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

in this collection is, n

p m(1 − p)(2)−m ,

(2.1)

where the closer p gets to 1, the more skewed the collection is toward networks with higher number of edges. When p = 1/2, the collection contains all possible (n) (n2) 2 m networks with equal probability, (1/2) . A marked feature of PPI networks that makes them non-random is the extremely broad distribution of degrees of proteins: While a majority of proteins have small number of immediate binding partners, there exist some proteins, referred to as “hubs,” with unusually large number of binding partners. Moreover, the degrees of most proteins are larger than the average node degree of the network. This degree distribution P (k) can be approximated by a power law P (k) ∼ k −γ , where k is the node degree and γ is a small constant. Such networks are called scale-free networks [Barab´ asi 1999, Albert and Barab´ asi 2002]. Proteins within PPI networks exhibit an inherent tendency to cluster, which can be quantified by the local clustering coefficient for these proteins [Watts and Strogatz 1998]. For a selected protein v that is connected to |N (v)| other nodes, if these immediate neighbors were part of a clique (a complete subgraph), there . (v)|−1) will be |N (v)| (|N interactions between them. The ratio of the actual number 2 of interactions that exist between these nodes and the total number of possible interactions gives the local clustering coefficient CC(v) for the protein v: CC(v) =

2|{(u, w) : u, w ∈ N (v), (u, w) ∈ E}| . |N (v)| . (|N (v)| − 1)

(2.2)

The average clustering coefficient of the entire network is therefore given by CC(G) =

1 CC(v). |V (G)| v∈V (G)

(2.3)

This clustering property gives rise to groups (subnetworks) of proteins in the PPI network that exhibit dense interactions within the groups, but sparse or less-dense interactions between the groups. The interactions between the groups occur via a few “central” proteins through which pass most paths in the network. The average ¯ shortest path length d(G) is given by 2 u =v (u, v) ¯ , (2.4) d(G) = |V (G)| . (|V (G)| − 1) where (u, v) is the length of the shortest path between two connected nodes u and v. This average shortest path length of the PPI network is significantly shorter than

2.4 Theoretical Models for PPI Networks

27

that of an ER random network; in an ER network, the average shortest path length is proportional to ln n. However, both PPI networks as well as ER networks fall under “small-world” networks, because the average path length between any two nodes is still significantly smaller than the network size: n >> k >> ln n >> 1, where k is the average node degree. The average clustering coefficient of a PPI network is significantly higher than a random network constructed on the same protein set and with the same average shortest path length. This small-world property can also be quantified using two coefficients: closeness and betweenness [Hormozdiari et al. 2007]. The closeness of v is defined based on its average shortest path length to all other proteins reachable from v (that is, within the same connected component as v) in the network, Cl(v) =

|V (G)| − 1 , u∈R(G, v) (u, v)

(2.5)

where R(G, v) is the set of proteins reachable from v. Closeness is thus the inverse of the average distance of v to all other nodes in the network. The betweenness of the node v measures the extent to which v lies “between” any pair of proteins connected to v in the network. Let Sxy be the number of shortest paths between all pairs x , y ∈ R(G, v), and let Sxy (v) be the number of these shortest paths that pass through v. The betweenness Bet(v) of v is defined as: Bet(v) =

Sxy (v)

x , y∈R(G, v), x =y

Sxy

.

(2.6)

The distributions of closeness and betweenness coefficients for PPI networks are significantly different from that of random networks [Hormozdiari et al. 2007]. In particular, PPI networks contain central proteins which have high betweenness and these connect and hold together different groups of proteins or regions of the network.

2.4

Theoretical Models for PPI Networks Studying the topological properties of PPI networks has gained considerable attention in the last several years, and in particular various network models have been proposed to describe PPI networks. As new PPI data became available, these theoretical models have been improved to accurately model the data. These include the earliest models such as the Erd¨ os-R´ enyi model [Erd¨ os 1960, Bollob´ as 1985], and more recent ones such as the Barab´ asi-Albert [Barab´ asi 1999, Albert and

28

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

Barab´ asi 2002], Watts-Strogatz [Watts and Strogatz 1998], and hierarchical [Ravasz and Barab´ asi 2003] network models. With a fixed probability for each edge, the Erd¨ os-R´ enyi (ER) networks (discussed above) look to be intuitive and seemingly correct, and were once commonly used to model real-world networks. However, in reality, ER networks are rarely found in real-world examples. Most real-world networks—including road and airline routes, social contacts, webpage links, and also PPI networks—do not have evenly distributed degrees. Moreover, although ER networks have the small-world property, they have almost no clustering effects. For these reasons, using ER networks as null models for comparison against real-world networks including PPI networks is usually inappropriate. In the Barabási-Albert (BA) model [Barab´ asi 1999, Albert and Barab´ asi 2002], nodes are added one at a time to simulate network growth. The probability of an edge forming between an incoming new node and existing nodes is based on the principle of preferential attachment—that is, existing nodes with higher degrees have a higher likelihood of forming an edge with the incoming node. Hence, the degree distribution follows a power-law distribution (scale-free) P (k) ∼ k −γ and exhibits small-world behavior, but like ER, lacks modular organization (low clustering behavior). The BA model is particularly important as one of the earliest instances of network models proposed as a suitable null model, instead of ER, for comparisons against biological networks [Barab´ asi 1999, Albert and Barab´ asi 2002]. This led to a panoply of works describing the properties of various biological network types including metabolic [Wagner and Fell 2001] and PPI networks [Yook et al. 2004] using BA. At the same time, the deficiencies of the BA model started to become more clear: while the scale-free property is preserved, modularity is not. Thus, better null models are needed. The Watts-Strogatz (WS) model [Watts and Strogatz 1998] is seldom used for modeling biological networks but demonstrates how easy it is to transition from a non-small world network (e.g., a lattice graph) to small-world network with random rewiring of a few edges. The generation procedure is amazingly simple: From a ring lattice with n vertices and k edges per vertex, each edge is rewired at random with probability p. If p = 0, then the graph is completely regular, and if 1, the graph is completely random/disordered. For the WS model, the intermediate region, where 0 < p < 1 can be selected to examine its intrinsic properties. Aside from potential use for investigating the level of ordered-ness of an observed network, the WS model has seen few applications in biology. Its major contribution toward network biology is that given how easy the small-world property can emerge from minor disruption of a regular lattice graph, it is not a unique defining property. Therefore, the myriad

2.4 Theoretical Models for PPI Networks

29

of research papers that describe the biological network under analysis as small world are not reporting particularly useful information. The BA model is able to capture the scale-free property observed in biological networks but has very low internal clustering. This seemed at odds with the idea that biological networks, or at least PPI networks, are modular in nature. Proteins achieve their functionality by virtue of extensive interconnections with other proteins, forming simple physical interactions, which at higher levels can be envisaged as complexes and functional modules. To better capture the clustering effects in real biological networks, hierarchical models were proposed. Hierarchical network (HN) models [Ravasz and Barab´ asi 2003] are iterative approaches for generating networks that encapsulate both scale-free and highly clustered behavior. For a real network, the average clustering coefficient for the entire network is higher than the BA model and ER model. To construct a HN model, tightly clustered cores with high clustering coefficient are first generated. These are then iteratively connected by selecting random nodes in each core, and having these connect to another. The downside, however, is that developing HN models requires making certain assumptions. For instance, we have to assume that we know the distribution of the sizes and clustering densities of the embedded modules. We also assume that these modules combine in an iterative manner. In biological networks, however, it seems that the distinction boundary between modules is not very sharp with high levels of interconnectivity. PPI networks from the yeast S. cerevisiae and the bacteria H. pylori resulting from some of the high-throughput studies mentioned earlier [Uetz et al. 2000, Ito et al. 2000, Xenarios et al. 2002] have been shown to have scale-free degree distributions [Prˇzulj et al. 2004, Jeong et al. 2001, Maslov and Sneppen 2002]. However, the larger D. melanogaster (fruit fly) PPI network has been shown to decay faster than a power law [Giot et al. 2003]. Furthermore, the shortest path distribution and the frequencies of cycles of 3–15 nodes in the fruitfly network differ from those of randomly rewired networks which preserve the same degree distribution as the original PPI network [Giot et al. 2003]. To better capture these frequency distributions of node-cycles, geometric random graph models were proposed. In a geometric random graph model, the nodes correspond to independently and uniformly distributed points in a metric space, and two nodes are linked by an edge if the distance between them is smaller than or equal to some radius r, where the distance is an arbitrary distance norm in the metric space [Penrose 2003]. Prˇzulj et al. [2004] used geometric random graphs to model PPI networks, by defining the points in 2D, 3D, and 4D Euclidean space, and distance between the points measured using Euclidean distance. Prˇzulj et al. studied the similarity between

30

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

3-node graphlets

4-node graphlets

1

3

2

4

5

6

8

7

5-node graphlets

9

20 Figure 2.3

10

11

21

12

13

14

22

23

24

15

16

25

26

17

18

27

19

28

29

The 29 graphlets of 3–5 nodes defined by Prˇzulj et al. [2004].

geometric random graphs and PPI networks using the distributions for graphlets. A graphlet is a connected network with a small number of nodes, and graphlet frequency is the number of occurrences of a graphlet in a network. The authors defined 29 types of graphlets using 3–5 nodes (Figure 2.3). The relative frequency of a graphlet i from the PPI network G is defined as Ni (G)/T (G), where Ni (G) is the number of graphlets of type i ∈ {1, . . . , 29}, and T (G) = 29 i=1 Ni (G) is the total number of graphlets of G. The same is defined for a generated geometric random network H . The relative graphlet frequency distance D(G, H ) between the two networks G and H is measured as D(G, H ) =

29

|Fi (G) − Fi (H )|,

(2.7)

i=1

where Fi (G) = − log(Ni (G)/T (G)); the logarithm being used because the graphlet frequencies can differ by several orders of magnitude. The authors generated geometric random networks with the same number of nodes as the proteins in S. cerevisiae and D. melanogaster PPI networks from high-throughput experiments. The authors found that, although the degree distributions of these PPI networks were closer to that of scale-free random networks, other topological parameters matched closely with those of geometric random networks. Specifically, the diameter, local and whole-network clustering coefficients, and the relative graphlet frequencies, computed as above, showed that PPI networks were closer to geomet-

2.5 Visualizing PPI Networks

31

ric random networks than scale-free networks. Furthermore, the authors suggest that as the quality and quantity of PPI data improve, the geometric random network may become better suited compared to scale-free and other networks to model PPI networks.

2.5

Visualizing PPI Networks Visualization concerns an important component of the analysis of PPI networks. Very simply, a PPI network is visualized as dots and lines, where the dots (or other shapes) represent proteins and the lines connecting the dots represent interactions between the proteins. Such a visualization enables quick exploration of the topological properties of PPI networks—for example, counting the neighbors of a selected protein in a PPI network, the number of connected components in the network, or even spotting dense and sparse regions of the network. However, the ease of this exploration and subsequent analysis depends on how effective is the visualization method or tool used to render the PPI network. This rendering of the PPI network concerns the field of graph or network layout, where layout algorithms are used to draw the network—by appropriately positioning the dots and lines—in a 2D space. A good layout should be able to (visually) bring out the topological properties of the PPI network easily, and this has been a subject of research in graph visualization for several years. Here we briefly introduce some of the commonly used algorithms for PPI network layout; for details the readers are referred to the excellent reviews of Morris et al. [2014], Agaptio et al. [2013], and Doncheva et al. [2012]. Random layout arranges the dots (nodes) and lines (edges) in a random manner in the 2D space. The advantage of this algorithm is its simplicity, but on the other hand, it presents a high number of criss-crossing edges, and does not necessarily use the available space optimally, especially for large networks. Circular layout arranges the nodes in succession, one after the other, on a circle. While this algorithm also suffers from criss-crossing of edges, it is widely used to visualize small (sub)networks such as protein complexes and pathways. Tree layout arranges the network as a tree with a hierarchical organization of the nodes. This is obviously more suitable for visualizing trees than networks with cycles. Often, the layout is “ballooned out” by placing the children of each node in the tree on a circle surrounding the node, resulting in several concentric circles. Force-directed layout places the nodes according to a system of forces based on physical concepts in spring mechanics. Typically, the system combines attractive forces between adjacent nodes with repulsive forces between all pairs of nodes to seek an optimal layout in which the overall edge lengths are small while the nodes are well separated.

32

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

Transcriptional regulation

Number of proteins: 1977 Number of interactions: 5679 Average number of neighbors: 5.47 Network diameter: 31 Clustering coefficient: 0573

Proteasome

Eukaryotic initiation factor 4F complex

Fanconi anemia pathway

Nuclear pore complex

Figure 2.4

Chromatin remodeling

DNA damage repair

PPI network visualization using the Cytoscape 3.4.0 tool [Shannon et al. 2003, Smoot et al. 2010]. A portion of the human PPI network (1,977 proteins and 5,679 interactions) downloaded from BioGrid [Stark et al. 2011] is visualized here using force-directed layout. Basic statistics—average number of neighbors, network diameter, etc.—are displayed for the network. Proteins (e.g., BRCA1), protein complexes (e.g., eukaryotic initiation factor 4F and nuclear pore complexes), pathways (e.g., Fanconi anaemia pathway), and cellular processes (e.g., DNA-damage repair and chromatin remodeling) are “pulled-out” and highlighted. Cytoscape provides “link-out” to external databases and tools—e.g., KEGG [Kanehisa and Goto 2000]—to enable further analysis.

Once the PPI network is laid out, a good visualization tool should allow at least some basic visual analysis of the network. The following aspects become important here (see Figure 2.4). The ease of navigation through the PPI network to explore individual proteins and interactions is of prime importance. In particular, the tool should be able to load and enable nagivation of even large networks.

2.5 Visualizing PPI Networks

33

Next is the provision to annotate the network using internal (e.g., labeling nodes by serial numbers or by their network properties) or external information (see below). The tool should also be able to compute (basic) topological properties of the network—for example, node degree, shortest path lengths, and clustering, closeness, and betweenness coefficients. These statistics help users get at least a preliminary idea of the network. Another valuable feature of a good tool is linkout to external databases, for example to PubMed literature (http://www.ncbi.nlm.nih .gov/pubmed), National Center for Biotechnology Information (NCBI) (http://www .ncbi.nlm.nih.gov/), UniProt or SwissProt (http://www.uniprot.org/) [Bairoch and Apweiler 1996, UniProt 2015], BioGrid (http://thebiogrid.org/) [Stark et al. 2011], Gene Ontology (GO) (http://www.geneontology.org/) [Ashburner et al. 2000], and Kyoto Encyclopedia of Genes and Genomes (KEGG) (http://www.genome.jp/kegg/ pathway.html) [Kanehisa and Goto 2000]. These enable functional annotation of proteins and interactions. Finally, the tool should also possibly support advanced analyses such as clustering of the network, comparison (based on topological characteristics, for example) between networks, and enrichment analysis, e.g., using GO terms. Table 2.3 lists some of the popular tools available for PPI network visualization and (visual) analysis. OMICS Tools (http://omictools.com/networkvisualization-category) maintains an exhaustive list of visualization tools for PPI and other biomolecular network analysis.

Table 2.3

Software tools for PPI network visualization and analysis Visualization Tool

Source

Reference

Arena3D

http://arena3d.org/

[Pavlopoulos et al. 2011]

AVIS

http://actin.pharm.mssm.edu/AVIS2/

[Seth et al. 2007]

BioLayout

http://www.biolayout.org/

[Theocharidis et al. 2009]

Cytoscape

http://www.cytoscape.org/download.html

[Shannon et al. 2003, Smoot et al. 2010]

Medusa

http://coot.embl.de/medusa/

[Hooper and Bork 2005]

NAViGaTOR

http://ophid.utoronto.ca/navigator/download.html [Brown et al. 2009]

ONDEX

http://www.ondex.org/

[K¨ ohler et al. 2006]

Osprey

http://biodata.mshri.on.ca/osprey/servlet/Index

[Breitkreutz et al. 2003]

Pajek

http://vlado.fmf.uni-lj.si/pub/networks/pajek/

[Vladimir and Andrej 2004]

PIVOT

http://acgt.cs.tau.ac.il/pivot/

[Orlev et al. 2004]

ProViz

http://cbi.labri.fr/eng/proviz.htm

[Florian et al. 2005]

34

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

2.6

Building High-Confidence PPI Networks From our discussions on experimental protocols in earlier sections, we know that some protocols—including the AP/MS ones—offer only pulled-down complexes consisting of baits and their preys without specifying the binary interactions between these components. Therefore, binary interactions need to be specifically inferred between the bait and each of its preys within the pulled-down complexes. However, not all preys in a pulled-down complex interact directly with the bait (but, get pulled down due to their interactions with other preys in the complex). Therefore, it is necessary to infer binary interactions not just between the bait and its preys but also between the interacting preys. Yet, care should be taken to avoid inferring spurious (false-positive) interactions between the preys that do not interact. To overcome these uncertainties, often a balance is sought between two kinds of models, spoke and matrix, which are used to transform pulled-down complexes into binary interactions between the proteins [Gavin et al. 2006, Krogan et al. 2006, Spirin and Mirny 2003, Zhang et al. 2008]. The spoke model assumes that the only interactions in the complex are between the bait and its preys, like the spokes of a wheel. This model is useful to reduce the complexity of the data, but misses all (true) prey–prey interactions. On the other hand, the matrix model assumes that every pair of protein within a complex interact. This model can cover all possible true interactions, but can also predict a large number of spurious interactions. An empirical evaluation using 1,993 baits and 2,760 preys from the dataset from Gavin et al. [2006] against 13,384 pairwise protein interactions between proteins within the expert-curated MIPS complexes [Mewes et al. 2006] revealed 80.2% true-negative (missing) interactions and 39% false-positive (spurious) interactions in the spoke model, and 31.2% true-negative interactions but 308.7% false-positive interactions in the matrix model [Zhang et al. 2008]. However, note that many of the missing interactions could be due to the lack of protein coverage in these experiments. A balance is struck between the two models that covers as many true interactions between the baits and preys as possible without allowing too many false interactions [Gavin et al. 2006]; see Figure 2.5.

Gaining Confidence in PPI Networks Although high-throughput studies have been successful in mapping large fractions of interactomes from multiple organisms, the datasets generated from these studies are not free from errors. High-throughput PPI datasets often contain a considerable number of spurious interactions, while missing a substantial num-

2.6 Building High-Confidence PPI Networks

p

p b

p p b

p

p 0.60

p 0.85

A p

b

p

p

35

p

p

0.55

0.70 0.25 0.75

0.60

0.85 0.70

0.65

0.75

p

0.25

C

b

0.60

p b

0.70

0.60 0.70

0.75

0.65

0.75

p

p

0.55

p

p D

p

p B Figure 2.5

Inferring protein interactions from pull-down protein complexes. Bait–prey relationships from pull-down complexes are assembled using the spokes model, where the bait is connected to each of the preys (A); or using the matrix model, where every bait–prey and prey–prey pair is connected (B). However, these models either miss many true interactions or produce too many spurious interactions. Therefore, a combination of the spoke and matrix models is used, where a balance is sought between the two models using weighting of interactions (C). Interactions with low weights are discarded to give the final set of high-confidence inferred interactions (D).

ber of true interactions [Von Mering et al. 2002, Bader and Hogue 2002, Cusick et al. 2009]. Consequently, a crucial challenge in adopting these datasets for downstream analysis—including protein complex prediction—is in overcoming these challenges. Spurious Interactions Spurious or false-positive interactions in high-throughput screens may arise from technical limitations in the underlying experimental protocols, or limitations in the (computational) inference of interactions from the screen. For example, the Y2H system, despite being in vivo, does not consider the localization (compartmentalization), time, and cellular context while testing for binding partners. Since all proteins are tested within one compartment (the nucleus), the chances that two proteins, belonging to two different compartments and are not likely to meet during their lifetimes in live cells, end up testing positive for interaction, is high. Similarly, in vitro TAP pull downs are carried out using cell lysates in an environment where every protein is present in the same “uncompartmentalized soup” [Mackay et al. 2007, Welch 2009]. Therefore, even though two proteins interact

36

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

under these laboratory conditions, it is not certain that they will ever meet or interact during their life times in live cells. Opportunities are high for “sticky” molecules to function as bridges between proteins causing these proteins to interact promiscuously with partners that never interact with in live cells [Mackay et al. 2007]. Once these complexes are pulled down, the model used to infer binary interactions— between bait and prey or between preys—can also result in inference of spurious interactions (further discussed below). Recent analyses showed that only 30–50% of interactions inferred from high-throughput screens actually occur within cells [Shoemaker and Panchenko 2007, Welch 2009], while the remaining interactions are false positives. Missing Interactions and Lack of Concordance Between Datasets Comparisons between datasets from different techniques have shown a striking lack of concordance, with each technique producing a unique distribution of interactions [Shoemaker and Panchenko 2007, Von Mering et al. 2002, Bader and Hogue 2002, Cusick et al. 2009]. Moreover, certain interactions depend on posttranslational modifications such as disulfide-bridge formation, glycosylation, and phosphorylation, which may not be supported in the adopted system. Many of these techniques also show bias toward abundant proteins (e.g., soluble proteins) and bias against certain kind of proteins (e.g., membrane proteins). For example, AP/MS screens predict relatively few interactions for proteins involved in transport and sensing (trans-membrane proteins), while Y2H screens being targeted in the nucleus fail to cover extracellular proteins [Shoemaker and Panchenko 2007]. These limitations effectively result in a considerable number of missed interactions in interactome datasets. Welch [2009] summed up the status of interactome maps, based on these above limitations, as “fuzzy,” i.e., error-prone, yet filled with promise.

Estimating Reliabilities of Interactions The coverage of true interactions can be increased by integrating datasets from multiple experiments. This integration ensures that all or most regions of the interactome are sufficiently represented in the PPI network. However, overcoming spurious interactions still remains a challenge, which is further magnified when datasets are integrated. Therefore, estimating the reliabilities of interactions becomes necessary, thereby keeping only the highly reliable interactions while discarding the spurious or less-reliable ones. Confidence or reliability scoring schemes offer a score (weight) to each interaction in the PPI network. For an interaction (u, v) ∈ E in the scored (weighted) PPI

2.6 Building High-Confidence PPI Networks

Table 2.4

37

Classification of confidence scoring (PPI weighting) schemes for protein interactions

Classification

Scoring Scheme

Reference

Sampling or counting based

Bootstrap sampling

[Friedel et al. 2009]

Comparative Proteomic Analysis (ComPASS)

[Sowa et al. 2009]

Dice coefficient a

[Zhang et al. 2008]

Hypergeometric sampling

[Hart et al. 2007]

Significance Analysis of INTeractions (SAINT)

[Choi et al. 2011, Teo et al. 2014]

Socio-affinity scoring

[Gavin et al. 2006]

Bayesian networks and C4.5 decision trees

[Krogan et al. 2006]

Topological Clustering Semantic Similarity (TCSS)

[Jain and Bader 2010]

Purification Enrichment (PE)

[Collins et al. 2007]

Collaborative Filtering (CF)

[Luo et al. 2015]

Functional Similarity (FS) Weight a

[Chua et al. 2006]

Geometric embedding

[Prˇzulj et al. 2004, Higham et al. 2008]

Iterative Czekanowski-Dice (ICD) distance a

[Liu et al. 2008]

PageRank affinity

[Voevodski et al. 2009]

Independent evidence-based

Topology-based

a. Dice coefficient, FS Weight, and Iterative CD scoring schemes can also be considered as independent evidence-based schemes, because if a pair of proteins have several common partners then these proteins most likely perform the same or similar functions and/or are present in the same cellular compartment (a biological evidence).

network G = V , E, w, the score w(u, v) encodes the confidence for the physical interaction between the two proteins u and v. The scoring function w: V × V → R accounts for the biological variability and technical limitations of the experiments used to infer the interactions. The scoring schemes can be classified into three broad categories (Table 2.4): (i) sampling or counting-based, (ii) biological evidence-based, and (iii) topology-based schemes.

38

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

Sampling or Counting-Based Schemes These schemes estimate the confidence of protein pairs by measuring the number of times each protein pair is observed to interact across multiple trials against what would be expected by chance given the abundance of each protein in the library. If the protein pairs are coming from the same experiment, the counting is performed across multiple purifications of the experiment. Given multiple PPI datasets, this idea can be extended to score interacting pairs by measuring the number of times each pair is observed across the different datasets against what would be expected from random given the number of times these proteins appear across the datasets. However, if the PPI datasets come from different experiments (e.g., Y2H and TAP/MS-based), which is usually the case, then it is useful to capture the relative reliability of each experimental technique or source of the datasets into this computation. For example, if Y2H is believed to be less reliable than TAP/MSbased techniques, then protein pairs can be assigned lower weights when observed in Y2H datasets, but assigned higher weights when observed in TAP/MS datasets. In the study by Gavin et al. [2006], a “socio-affinity” scheme based on this counting idea was used to estimate confidence for interactions inferred from pulled-down complexes detected from TAP purifications. The interactions within the pulled-down complexes are inferred as a combination of spoke and matrixmodeled relationships. A socio-affinity index SA(u, v) then quantifies the tendency for two proteins u and v to identify each other when tagged (spoke model, S) and to co-purify when other proteins are tagged (matrix model, M): SA(u, v) = S(u, v)|u=bait + S(u, v)|v=bait + M(u, v) , n(u, v)|u=bait , where S(u, v)|u=bait = log prey . prey n f bait . nbait . fv S(u, v)|v=bait = log ⎛ and M(u, v) = log ⎝

u

u=bait

n(u, v)|v=bait bait . nbait . fuprey . nprey f v

v=bait

prey

prey

fu

. fvprey .

n(u, v)

(2.8)

,

npreys(npreys−1) all baits 2

⎞ ⎠,

where, for the spoke model (S), n(u, v)|u=bait is the number of times that u retrieves v when u is tagged; fubait is the fraction of purifications when u was bait; fvprey is the fraction of all retrieved preys that were v; nbait is the total number of purifications preys (i.e., using baits); and nu=bait is the number of preys retrieved with u as bait. These prey terms are similarly defined for v as bait. For the matrix model (M), n(u, v) is the

2.6 Building High-Confidence PPI Networks

39

number of times that u and v are seen in purifications as preys with baits other prey than u or v; fu and fvprey are as above; and nprey is the number of preys observed with a particular bait (excluding the bait itself). Friedel et al. [2009] combined the bait–prey relationships detected from the Gavin et al. [2006] and Krogan et al. [2006] experiments, and used a random sampling-based scheme to estimate confidence of interactions. In this approach, a list = (φ1 , . . . , φn) purifications were generated where each purification φi consisted of one bait bi and the preys pi , 1 , . . . , pi , m identified for this bait in the purification: φi = bi , [pi , 1 , . . . , pi , m]. From , l = 1000 bootstrap samples were created by drawing n purifications with replacement. This means that the bootstrap sample Sj () contains the same number of purifications as and each purification φi can be contained in Sj () once, multiple times, or not at all, with multiple copies being treated as separate purifications. Interaction scores for the protein pairs are then calculated from these l bootstrap samples using socio-affinity scoring as above, where each protein pair is counted for the number of times the pair appeared across randomly sampled sets of interactions against what would be expected for the pair from random based on the abundance of each protein in the two datasets. Zhang et al. [2008] modeled each purification as a bit vector which lists proteins pulled down as preys against a bait across different experiments. The authors then used the Sørensen-Dice similarity index [Sørensen 1948, Dice 1945] between the vectors to estimate co-purification of preys across experiments, and thus the interaction reliability between proteins. Specifically, the pull-down data is transformed into a binary protein pull-down matrix in which a cell [u, i] in the matrix is 1 if u is pulled down as a prey in the experiment or purification i, and a zero otherwise. For two protein vectors in this matrix, the Sørensen-Dice similarity index, or simply the Dice coefficient, is computed as follows: D(u, v) =

2q , 2q + r + s

(2.9)

where q is the number of the matrix elements (experiments or purifications) that have ones for both proteins u and v; r is the number of elements where u has ones, but v has zeroes; and s is the number of elements where v has ones, but u has zeroes. If u and v indeed interact (directly or as part of a complex), then most likely the two proteins will be frequently co-purified in different experiments. The Dice coefficient therefore estimates the fraction of times u and v are co-purified in order to estimate the interaction reliability between u and v.

40

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

Hart et al. [2007] generated a Probabilistic Integrated Co-complex (PICO) network by integrating matrix-modeled relationships from the Gavin et al. [2002], Gavin et al. [2006], Krogan et al. [2006], and Ho et al. [2002] datasets using hypergeometric sampling. Specifically, the significance (p-value) for observing an interaction between the proteins u and v at least k times in the dataset is estimated using the hypergeometric distribution as p(#interactions ≥ k|n, m, N ) =

min(n, m)

p(i|n, m, N ),

i=k

(2.10)

nN −n where p(i|n, m, N ) =

i

m−i N m

,

where k is the number of times the interaction between u and v is observed, n and m are the total number of interactions for u and v, respectively, and N is the total number of interactions in the entire dataset. The lower the p-value, the lesser is the chance that the observed interaction between u and v is random, and therefore higher is the chance that the interaction is true. Methods such as Significance Analysis of INTeractome (SAINT) are based on quantitative analysis of mass spectrometry data. SAINT, developed by Choi et al. [2011] and Teo et al. [2014], assigns confidence scores to interactions based on the spectral counts of proteins pulled down in AP/MS experiments. The aim is to convert the spectral count Xij for a prey protein i identified in a purification of bait j into the probability of true interaction between the two proteins, P (T rue|Xij ). For this, the true and false distributions, P (Xij |T rue) and P (Xij |F alse), and the prior probability πT of true interactions in the dataset are inferred from the spectral counts of all interactions involving prey i and bait j . Essentially, SAINT assumes that, if proteins i and j interact, then their “interaction abundance” is proportional to the product Xi Xj of their spectral counts Xi and Xj . To compute P (Xij |T rue), the spectral counts Xi and Xj are learned not only from the interaction between i and j , but also from all bona fide interactions that involve i and j . The same principle is applied to compute P (Xij |F alse) for false interactions. These probability distributions are then used to calculate the posterior probability of true interaction P (T rue|Xij ). The interactions are then ranked in decreasing order of their probabilities, and a threshold is used to select the most likely true interactions. Comparative Proteomic Analysis (ComPASS) [Sowa et al. 2009] employs a comparative methodology to assign scores to proteins identified within parallel proteomic datasets. It constructs a stats table X[k × m] where each cell X[i , j ] = Xi , j

2.6 Building High-Confidence PPI Networks

41

is the total spectral count (TSC) for an interactor j (arranged as m rows) in an experiment i (arranged as k columns). ComPASS uses a D-score to normalize the TSCs across proteins such that the highest scores are given to proteins in each experiment that are found rarely, found in duplicate runs, and have high TSCs—all characteristics that qualify proteins to be candidate high-confidence interactors. The D-score is a modification of the Z-score, which weights all interactors equally regardless of the number of replicates or their TSCs. Let X¯ j be the average TSC for interactor j across all the experiments, i=k X¯ j =

i=1, j =n

Xi , j

k

;

n = 1, 2, . . . , m.

(2.11)

The Z-score is computed as Zi , j =

Xi , j − X¯ j σj

,

(2.12)

where σj is the standard deviation for the spectral counts of interactor j across the experiments. The D-score improves the Z-score by incorporating the uniqueness, the TSC, and the reproducibility of the interaction to assign a score to each protein within each experiment. The D-score first rescales Xi , j as

DiR, j

=

where fi , j =

1,

k

p

i=k

i=1 fi , j

. Xi , j (2.13)

when Xi , j > 0

0, otherwise,

and p is the number of replicate runs in which the interactor is present. A D-score distribution is generated using a simulated random dataset, and a D-score threshold D T is determined below which 95% of this randomized data falls. A normalized D-score is then computed using this threshold as DiN, j = DiR, j /D T . All interactors with DiN, j ≥ 1 are considered true, whereas those with DiN, j < 1 are less likely to be bona fide interactors. Huttlin et al. [2015] employed ComPASS to identify 23,744 high-confidence interactors for 2,594 baits expressed in human embryonic kidney (HEK293T) cells (“the BioPlex network,” which as of 2016 contains over 50,000 interactions: http://wren.hms.harvard.edu/bioplex/). The readers are referred to the review by Nesvizhskii [2012] for a number of other scoring methods based on analysis of mass spectrometry data.

42

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

Evidence-Based Schemes These schemes use external or independence evidence to estimate confidence of interactions in the PPI dataset. For example, these evidence may include Gene Ontology (GO) annotations [Ashburner et al. 2000, Mi et al. 2013] for protein functions and localization (compartmentalization), and co-complex memberships in validated complexes. Some of the methods are learning-based; for example, given a (training) set of known interacting pairs and GO annotations, the methods learn (train on) the conditional probability distribution for interacting pairs with and without similar GO annotations (e.g., similar functions or localization), and using this learned distribution, the methods estimate the probability of interaction for the proteins in each pair. In Krogan et al.’s study [Krogan et al. 2006], a machine learning approach using Bayesian networks and C4.5-decision trees trained on validated physical interactions and functional evidence—co-occurrence in manually curated complexes from MIPS—was used to estimate confidence for protein pairs in a spoke-modeled experimental dataset. Collins et al. [2007] developed Purification Enrichment (PE) scoring to generate a “Consolidated network” using the matrix-modeled relationships from Gavin et al. and Krogan et al. datasets. The PE scoring is based on a Bayes classifier trained on manually curated cocomplexed protein pairs, GO annotations, mRNA expression patterns, and cellular co-localization and co-expression profiles. The Consolidated network was shown to be of high quality, comparable to that of PPIs derived from low-throughput experiments. In other methods, explicit learning may not be involved, but instead the evidence is directly used to assess the interaction confidence of protein pairs. For example, Resnick’s measure [Resnick 1995] for computing the semantic similarity between annotation terms has been adopted to compute confidence based on GO annotations between the proteins [Xu et al. 2008, Pesquita et al. 2008, Jain and Bader 2010]. Specifically, the semantic similarity between two ontology terms (S , T ) having a set A of common ancestors in the GO graph is given as the information content, p(A) log(p(A)), (2.14) r(S , T ) = − A∈A

where p(A) is the fraction of proteins annotated to term A and all its descendants in the GO graph. Suppose that proteins u and v are annotated to sets of GO terms S and T , respectively. Then the semantic similarity between u and v is defined as the maximum information content (Resnick’s measure) of the set S × T , sim(u, v) =

max

Si ∈S , Tj ∈T

r(Si , Tj ).

(2.15)

2.6 Building High-Confidence PPI Networks

43

The interaction confidence between u and v is then estimated as sim(u, v). GO graphs tend to be unbalanced with some paths containing more details (depth) compared to others, which stems from the complex biological structure of the GO annotations. However, this creates a bias against terms that do not represent such complex structures, i.e., terms that do not have sufficient depth in the GO graph. To account for this topological imbalance of the GO graph, Jain and Bader [2010] developed Topological Clustering Semantic Similarity (TCSS) which collapses subgraphs that define similar concepts. Terms that are lower down the GO tree have higher information content (i.e., more specific) than the terms at higher levels (i.e., less specific). A cut-off is used to identify subgraphs—subgraph root terms and all their children—with high information content. Since GO terms often have multiple parents, it is likely that this results in overlapping subgraphs. Overlaps between subgraphs are removed in two steps: edge removal by transitive reduction and term duplication. In the edge-removal step, a reduction is performed on the subgraphs: If nodes u and v are connected both via a directed edge u → v as well as a directed path u → w1 , . . . , wk → v, then a transitive reduction is performed to preserve only u → v. After this step, if a term still belongs to more than one subgraph, then the term and all its descendents are duplicated across the subgraphs. The similarity between two proteins is measured using this reduced GO graph using Resnick’s similarity between their GO terms, as described above. Topology-Based Schemes These schemes analyze the topology of the PPI network—usually in the immediate neighborhood of each protein pair—to estimate interaction confidence for the protein pairs. If the proteins in an interacting pair have many common neighbors in the network, the proteins and their shared neighbors have similar functions and/or are co-localized [Batada et al. 2004], and it is likely that the observed interaction between the proteins is true. This can be understood from the following simple example. Two proteins need to be localized to the same compartment to interact physically. Let us assume that a PPI screen with a false-positive rate of p reports that protein A interacts with C1 , . . . , Cn and D1 , . . . , Dm, and another protein B interacts with C1 , . . . , Cn and E1 , . . . , Em . Suppose that each of these proteins can be localized to say h subcellular compartments with equal chance. The probability that A and B interact with a Ci in two different places is therefore x = (1 − p) . (1 − p) . (h − 1)/ h. Thus, the probability that A and B interact with each C1 , . . . , Cn in two different compartments is x n, and the probability that A and B interact with some Ci in the same compartment is 1 − x n, which monotonically increases with n. That is, the more common partners A and B have (as reported by the screen), the

44

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

higher the chance they are in the same compartment. Thus, the more common partners the two proteins have in the PPI network, the higher the chance they satisfy the co-localization requirement for interaction. In other words, topologybased schemes based on counting common partners can be viewed as ways to guess whether A and B are likely to be in the same compartment (without needing to know explicitly which compartment); that is, these indices are in fact based on exploiting the biological fact that two proteins must be in the same compartment to physically interact (i.e., a piece of biological information). Therefore, Functional Similarity Weighting (FS Weight) [Chua et al. 2006], Iterative Czekanowski-Dice (CD) weight [Liu et al. 2008], and Dice coefficient [Zhang et al. 2008] can be classified as both topology-based as well as biological evidence-based schemes. FS Weight, proposed by Chua et al. [2006], is inspired from the graph theory measure Czekanowski-Dice (CD) distance, which is given by CD(u, v) =

|NuNv | , |Nu ∪ Nv | + |Nu ∩ Nv |

(2.16)

where Nu includes u and all the neighbors of u, similarly for Nv , and NuNv = (Nu − Nv ) ∪ (Nv − Nu) is the symmetric difference between the sets of neighbors Nu and Nv . CD is a distance (dissimilarity) measure. If Nu = Nv , then CD(u, v) = 0; and if Nu ∩ Nv = ∅, then CD(u, v) = 1. FS Weight takes inspiration from CD to use common neighbors, and estimates the confidence of the physical interaction between u and v based on their common neighbors. Chua et al. [2006] show that proteins share functions with their strictly indirect neighbors more than with their strictly direct neighbors, and proteins share functions with even higher likelihood with proteins that are simultaneously their direct and indirect neighbors than with proteins that are either strictly their direct or strictly their indirect neighbors. The weight FS(u, v) of the interaction between u and v is estimated as FS(u, v) =

2|Nu ∩ Nv | |Nu − Nv | + 2|Nu ∩ Nv | + λu, v ×

(2.17)

2|Nu ∩ Nv | , |Nv − Nu| + 2|Nu ∩ Nv | + λv, u

where λu, v and λv, u are used to penalize protein pairs with very few neighbors, and are given by λu, v = max(0, navg − (|Nu − Nv | + |Nu ∩ Nv |)), λu, v = max(0, navg − (|Nv − Nu| + |Nv ∩ Nu|)),

and (2.18)

2.6 Building High-Confidence PPI Networks

45

where navg is the average number of level-1 neighbors that a protein has in the network. FS Weight thus assigns higher weight to protein pairs with larger number of common neighbors. FS Weight can be used to predict new functional associations between proteins based on the number of neighbors they share; some of these functional associations may be physical interactions. In Iterative CD, proposed by Liu et al. [2008], the weight for each interaction is computed using the number of neighbors shared between the two interacting proteins, in a manner similar to FS Weight. However, Iterative CD then iteratively corrects these weights for the interactions, such that the weights computed in an iteration uses the weights computed in the previous iteration. By doing so, Iterative CD progressively reinforces the weights of true interactions and dampens the weights of false-positive interactions with each iteration. Iterative CD begins with an unweighted network (all weights set to 1) or a network with weights coming from some prior evidence (e.g., reliabilities of experiments from which the protein pairs were inferred). The weight w k (u, v) for each protein pair (u, v) in the k-th (k > 1) iteration is estimated as

w (u, v) = k

k−1(x , u) + w k−1(x , v)] x∈Nu ∩Nv [w , k−1(x , u) + λk + k−1(x , v) + λk x∈Nu w x∈Nv w u v

(2.19)

where w 1(u, v) = 1 if the interaction (u, v) exists in the original network, w 1(u, v) = 0 if the interaction (u, v) does not exist; alternatively, w 1(u, v) = r(e)(u, v), which is the reliability of the experiment e used to infer (u, v). The parameters λu and λv penalize protein pairs with very few common neighbors. Liu et al. show that the iterative procedure converges in two iterations for typical PPI networks, but 10–30 iterations may be necessary if the network has high levels of noise. The procedure produces a weight between 0 and 1 for each interaction, and a cut-off (recommended 0.2) is used to filter out low-scoring interactions. Luo et al. [2015] scored interactions based on collaborative-filtering (CF). This approach is inspired by personalized recommendation in e-commerce where the problem is to identify useful patterns reflecting the connection between users and items from their usage history, and make reliable predictions for possible useritem links based on these patterns [Herlocker et al. 2002]. The CF scheme first computes the similarity sim(u, v) between a pair of interacting proteins u and v in the PPI network using the cosine distance of their adjacency vectors, sim(u, v) =

u, ¯ v ¯ , ||u||.|| ¯ v|| ¯

(2.20)

46

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

where . . . denotes the inner product between the vectors, and ||.|| denotes the Euclidean norm of the vectors. Therefore, the closer sim(u, v) is to 1, more the common neighbors that u and v share. This cosine distance is then rescaled by incorporating a Cγ parameter which, similar to λ in FS Weight and Iterative CD, is used to account for spurious neighbors in the network, sim(u, v) = (

||u|| ¯ 2 Cγ

u, ¯ v ¯ . ||v|| ¯ 2 + 1). ( C + 1)

(2.21)

γ

Like in e-commerce where people judge an item based on multiple reviews, preferably from known contacts, the score sim(u, v) from Equation (2.21) is rescaled as: sim(u, v) =

u, ¯ v ¯ . (ru, v )d , ||u|| ¯ 2 ||v|| ¯ 2 ( C + 1). ( C + 1) γ

(2.22)

γ

where (ru, v )d (d being a tunable parameter) is the mean of the scores of interactions with common neighbors, {(u, x) : x ∈ N (u) ∩ N (v)} ∪ {(v, x) : x ∈ N (u) ∩ N (v)}, in the PPI network. Voevodski et al. [2009] developed PageRank Affinity, a scoring scheme inspired from Google’s PageRank algorithm [Haveliwala 2003], to score interactions in the PPI network. PageRank Affinity uses random walks to estimate the interconnectedness of protein interactions in the network. A random walk is a Markov process where in each step the walk moves from the current protein to the next with a certain preset probability. PageRank Affinity begins with an unweighted network G (with all weights set to 1) given by the adjacency matrix AG(u, v) =

1,

if (u, v) ∈ E

0, otherwise.

(2.23)

The transition probability matrix for the random walks (often called the random walk matrix) is the normalized adjacency matrix where each row sums to one, −1 AG , WG = DG

(2.24)

where D is the degree matrix, which is a diagonal matrix containing the degrees of the proteins, DG(u, v) =

deg(u) = |N (u)|, if u = v 0,

otherwise.

(2.25)

2.6 Building High-Confidence PPI Networks

47

Therefore, the transition probabilities are given by the matrix WG, where the transition probabilities pt+1 at step t + 1 are simply pt+1 = pt WG. PageRank Affinity repeatedly simulates random walks beginning from each protein in the network. Let pr(s) be the steady-state probability distribution of a random walk with restart probability α, where the starting vector s gives the probability distribution upon restarting. Then, pr(s) is the unique solution for the linear system given by prα (s)t+1 = αs + (1 − α) . prα (s)t . WG .

(2.26)

The starting vector here is set as follows: su(i) =

1,

if i = u

0, otherwise.

(2.27)

Therefore, pr(su) is the steady-state probability distribution of a random walk that always restarts at u. Let pr(su)[v] represent the steady-state probability value that a protein v has in the distribution vector pr(su). This can be thought of as the probability contribution that u makes to v, and is denoted as pr(u → v). This provides the affinity between u and v when the walks always restart at u. The final affinity between u and v is computed as the minimum of pr(u → v) and pr(v → u). Prˇzulj et al. [2004] and Higham et al. [2008] argue that PPI networks are best modeled by geometric random graphs instead of scale-free or small-world graphs, as suggested by earlier works [Watts and Strogatz 1998, Barab´ asi 1999]. A geometric random graph G = V , with radius is a graph with node set V of points in a metric space and edge set E = {(u, v) : u, v ∈ V ; 0 < ||u − v|| ≤ }, where ||.|| is an arbitrary distance norm in this space. The authors show that a PPI network can be represented as a geometric random graph by embedding the proteins of the PPI network in metric spaces R2, R3, or R4, and finding an such that any two proteins are connected in the PPI network if and only if the proteins are -close in the chosen metric space. This embedding can be used to not only weight all interactions, but also predict new interactions between protein pairs and remove spurious interactions from the PPI network: Interactions between proteins that are farther than -distance away in the metric embedding are pruned off as noise, whereas non-interacting protein pairs that are -close are predicted to be interacting. The embedding of the PPI network in a chosen metric space is described briefly as follows. Given a PPI network of N proteins, the interaction weights for protein pairs (i , j ) are interpreted as pairwise distances dij , and the task is to find locations N

in m-dimensional Euclidean space (vectors {x [i]i=1} in Rm) for these proteins so that the pairwise distances are preserved, i.e., ||x [i] − x [j ]||2 = dij for all i , j . In PPI

48

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

networks with only {0, 1} connectivity information, the length of the shortest path between proteins in the network is used in lieu of the Euclidean distance. Przulj et al. suggest that using the square root of the path length, dij = pathij , where pathij denotes the path length between i and j , is a good function for the purpose. Conditional probabilities p(dij |interact) and p(dij |noninteract) are learned for distances dij from the embedding. Here, p(dij |interact) is the probability density function describing the distances between pairs of proteins which are known to interact, and p(dij |noninteract) is the probability density function describing the distances between pairs of proteins which do not interact in the dataset (and are known to not interact). Given a distance threshold δ, these probabilities are used to compute the posterior probabilities p(interact|dij ≤ δ) and p(noninteract|dij ≤ δ) for protein pairs to interact or not interact. For each protein pair (i , j ) within the δ-threshold, the weight of the interaction between i and j is then estimated as S(i , j ) =

p(interact(i , j )|dij ≤ δ) p(interact(i , j )|dij ≤ δ) + p(noninteract(i , j )|dij ≤ δ)

.

(2.28)

A threshold on this estimated weight is applied to remove false-positive interactions from the network. Combining Confidence Scores Interactions scored high by all or a majority of confidence-scoring schemes (consensus interactions) are likely to be true interactions. Therefore, a simple majorityvoting scheme counts the number of times each interaction is assigned a high score (above the recommended cut-off) by each scheme, and considers only the interactions scored high by a majority of the schemes. Chua et al. [2009] integrated multiple scoring schemes using a na¨ıve Bayesian approach. For an interaction (u, v) that is assigned scores pi (u, v) by different schemes i (assuming the scores are in the same range [0,1]), the combined score can be computed as 1 − i (1 − pi (u, v)). However, even within the same range [0,1], usually different scoring schemes tend to have different distribution of scores they assign to the interactions. Some schemes assign high scores (close to 1) to most or a sizeable fraction of the interactions, whereas other schemes are more conservative and assign low scores to many interactions. To account for the variability in distributions of scores, it is important to consider the relative ranking of interactions within each scoring scheme instead of their absolute scores. Chua et al. [2009] therefore proposed a ranked-based combination scheme, which works as follows. For each scheme i, the scored interactions are first binned in increasing order of their scores: The first 100 interactions are placed in the first

2.6 Building High-Confidence PPI Networks

49

bin, the second 100 interactions in the second bin, and so on. For each bin k in scheme i, a weight p(i , k) is assigned based on the number of interactions from the bin that match known interactions from an independent dataset: p(i , k) =

#interactions from bin k for scheme i that match known interactions #total interactions in bin k across all schemes. (2.29)

While combining the scores for interaction (u, v), the Bayesian weighting is mod ified to: 1 − (i , k)∈D(u, v)(1 − p(i , k)), where D(u, v) is the list of scheme-bin pairs (i , k) that contain (u, v) across all schemes. This ensures that, irrespective of the scoring distribution of the schemes, if an interaction belongs to reliable bins, it is assigned a high final score. Yong et al. [2012] present a supervised maximum-likelihood weighting scheme (SWC) to combine PPI datasets and to infer co-complexed protein pairs. The method uses a na¨ıve Bayes maximum-likelihood model to derive the posterior probability that an interaction (u, v) is a co-complexed interaction based on the scores assigned to (u, v) across multiple data sources. These data sources include PPI databases, namely BioGrid [Stark et al. 2011], IntAct [Hermjakob et al. 2004, Kerrien et al. 2012], MINT [Zanzoni et al. 2002, Chatr-Aryamontri et al. 2007], and STRING [Von Mering et al. 2003, Szklarczyk et al. 2011], and evidence from cooccurrence of proteins in the PubMed literature abstracts (http://www.ncbi.nlm .nih.gov/pubmed). The set of features is the set of these data sources, and a feature F has value f if proteins u and v are related by data source F with score f , else f = 0. The features are discretized using minimum-description length supervised discretization [Fayyad and Irani 1993]. Using a reference set of protein complexes, each (u, v) in the training set is given a class label co-complex if both u and v are in the same complex, otherwise it is given the class label non-co-complex. The maximum-likelihood parameters are learned for the two classes, P (F = f |co-complex) =

P (F = f |non-co-complex) =

Nc, F =f Nc N¬c, F =f N¬c

(2.30) ,

where Nc is the number of interactions with label co-complex, Nc, F = f is the number of interactions with label co-complex and feature value F = f , and likewise for interactions with label non-co-complex, N¬c, F = f . After learning the maximumlikelihood model, the score for each interaction is computed as the posterior probability of being a co-complex interaction based on the na¨ıve Bayes assumption of

50

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

independence of the features: s(u, v) = P (co-complex|F1 = f1 , F2 = f2 , . . .) P (Fi = fi |co-complex) . P (co-complex) , = i Z

(2.31)

where Z is the normalizing factor: P (Fi = fi |co-complex) . P (co-complex) Z= i

+

(2.32) P (Fi = fi |non-co-complex) . P (non-co-complex).

i

Table 2.5 displays some of the publicly available databases that integrate PPI datasets from multiple sources. However, a common problem with integrating multiple scored-datasets is the low agreement between the schemes or experiments used to produce these data. Moreover, most scoring methods favor high abundance proteins and are not effective enough to filter out common contaminants [Pu et al. 2015]. Therefore, better scoring and integrating schemes for PPI datasets are always required.

2.7

Enhancing PPI Networks by Integrating Functional Interactions In addition to the presence of spurious interactions, another limitation in existing PPI datasets is the lack of coverage for true interactions among certain kinds of proteins (the “sparse zone” [Rolland et al. 2014]). This is in part due to limitations in experimental protocols (e.g., washing away of weakly connected proteins during purification of pull-down complexes in TAP experiments), and in part due to the under-representation of certain groups of proteins in these experiments (e.g., membrane proteins). The paucity of true interactions can considerably affect downstream analysis including protein complex prediction. For example, in an analysis by Srihari and Leong [2012a] using protein complexes from MIPS and CYC2008, it was found that many true complexes are embedded in sparse and disconnected regions of the PPI network, thereby altering their dense connectivity and modularity. As we shall see in a subsequent chapter, many computational methods find it difficult to identify these sparse complexes. Computational prediction of protein interactions can be a good alternative to experimental protocols for enriching the PPI network with true interactions, and to “densify” regions of the network that are sparsely connected. However, accurate prediction of physical interactions between proteins is a difficult problem

Table 2.5

http://ophid.utoronto.ca/iid/ http://bhapp.c2b2.columbia.edu/PrePPI/ http://psicquic.googlecode.com/ http://string-db.org/ http://www.unihi.org/

IID/OPHID PrePPI PSICQUIC STRING UniHI

http://irefindex.org/wiki/index.php?title=iRefIndex

iRefIndex http://netbio.bgu.ac.il/myproteinnet/

http://www.lagelab.org/resources/

InWeb

http://matrixdb.univ-lyon1.fr/

http://intscore.molgen.mpg.de/

IntScore

MyProteinNet

http://www.innatedb.com/

InnateDB

MatrixDB

http://ophid.utoronto.ca/ophidv2.204/

I2D/OPHID

[Patil et al. 2011]

http://hintdb.hgc.jp/htp/ http://www.functionalnet.org/

HitPredict

http://cbdm.mdc-berlin.de/tools/hippie/

HIPPIE

HumanNet

[Schaefer et al. 2012]

http://www.genemania.org/

GeneMANIA

[Kalathur et al. 2014]

[Von Mering et al. 2003, Szklarczyk et al. 2011]

[Aranda et al. 2011]

[Zhang et al. 2012, Zhang et al. 2013]

[Brown and Jurisica 2005, Kotlyar et al. 2015, Kotlyar et al. 2016]

[Chautard et al. 2011]

[Basha et al. 2015]

[Razick et al. 2008, Turner et al. 2010]

[Li et al. 2017]

[Kamburov et al. 2012]

[Lynn et al. 2008]

[Brown and Jurisica 2005, Brown and Jurisca 2007, Kotlyar et al. 2015]

[Lee et al. 2011]

[Warde-Farley et al. 2010]

[Veres et al. 2015]

http://comppi.linkgroup.hu/

ComPPI

Reference

Source

Database

Publicly available databases that integrate PPI datasets from multiple experimental, literature, and computational sources

2.7 Enhancing PPI Networks by Integrating Functional Interactions 51

52

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

in itself, and as several studies have noted [Von Mering et al. 2003, Szklarczyk et al. 2011, Srihari and Leong 2012a] most predicted interactions tend to be “functional associations”—that is, relationships connecting functionally similar pairs of proteins—instead of actual physical interactions between the proteins. Nevertheless, if these functional interactions are successful in “topologically enhancing” the PPI network, these can still aid downstream analysis including protein complex prediction.

Computational Prediction of Protein Interactions Although high-throughput techniques produce large amounts of data, the covered fraction of the interactomes from most organisms are far from complete [Cusick et al. 2009, Hart et al. 2006, Huang et al. 2007]. For example, while ∼70% of the interactomes from model organisms including S. cerevisiae have been mapped, these interactomes still lack interactions among membrane proteins [Von Mering et al. 2002, Hart et al. 2006, Huang et al. 2007]. Likewise, estimates show that less than 50% of the interactomes from higher-order organisms including human (∼10%) and other mammals have been mapped [Hart et al. 2006, Stumpf et al. 2008, Vidal 2016]. Computational prediction of interactions could partially compensate for this lack of coverage by predicting interactions between proteins in network regions with low coverage. Here, we only present a brief conceptual overview of computational methods developed for protein interaction prediction; for methodological details and for a comprehensive list of these methods, the readers are referred to excellent surveys by Valencia and Pazos [2002], Obenauer and Yaffe [2004], Zahiri et al. [2013], Ehrenberger et al. [2015], and Keskin et al. [2016]. A commonly used approach to predict protein interactions in prokaryotes is by using co-transcribed or co-regulated sets of genes. It is based on the observation that, in prokaryotes, proteins encoded by genes that are transcribed or regulated as single units—e.g., as operons—are often involved in similar functions and tend to physically interact. Computational methods exist to predict operons in bacterial genomes using intergenic distances [Ermolaeva et al. 2011, Price et al. 2005]. Analysis of gene-order conservation in bacterial and archaeal genomes shows that protein products of 63–75% of operonic genes physically interact [Dandekar et al. 1998]. In eukaryotes, evidence from yeast and worm [Teichmann and Babu 2002, Snel et al. 2004] shows that co-regulated sets of genes encode proteins that are functionally similar and these proteins are highly likely to interact. These studies therefore provide the basis to predict new interactions between proteins using sets of co-transcribed and co-regulated sets of genes [Huynen et al. 2000, Bowers et al. 2004].

Gene Neighbors.

2.7 Enhancing PPI Networks by Integrating Functional Interactions

53

Similar phylogenetic profiles between proteins provide strong evidence for protein interactions [Pellegrini et al. 1999, Galperin and Koonin 2000, Pellegrini 2012]. For a given protein, a phylogenetic profile is constructed as a vector of N elements, where N is the number of genomes (species). The presence or absence of the protein in a genome is indicated as 1 or 0 at the corresponding position in the phylogenetic profile. Phylogenetic profiles of a collection of proteins can be clustered using a bit-distance measure, to generate clusters of proteins that co-evolve. Therefore, proteins appearing in the same cluster are considered to be evolutionarily co-evolving and these proteins are inferred to be functionally related and physically interacting. This inference is based on the hypothesis that interacting sets of non-homologous proteins that co-evolve are under evolutionary pressure to conserve their interactions and to maintain their co-functioning ability [Shoemaker and Panchenko 2007, Sun et al. 2005]. Phylogenetic Profiles.

Co-Evolution of Interacting Proteins. Interacting proteins often co-evolve so that

changes in one protein in a pair leading to the loss of function or interaction should be compensated by correlated changes in the other protein [Shoemaker and Panchenko 2007]. This co-evolution is reflected by the similarity between the phylogenetic protein trees (or simply, protein trees) of non-homologous interacting protein families. A protein tree represents the evolutionary history of protein families, i.e., proteins or protein families that diverged from a common ancestor. These protein trees reconciled with their species trees have their internal nodes annotated to speciation and duplication events [Vilella et al. 2009]. TreeSoft (http://treesoft .sourceforge.net/treebest.shtml) provides a suite of tools to build and visualize protein trees. The similarity between two protein trees can be computed by aligning the corresponding distance matrices so as to minimize the difference between the matrix elements: the smaller the difference between the matrices, the stronger the co-evolution between the two protein families. Interactions are predicted between proteins corresponding to the aligned columns of the two matrices. The similarity between two protein trees is influenced by the speciation process and, therefore, there is a certain background similarity between any two protein trees, irrespective of whether the proteins interact or not. Statistical approaches exist to correct for these factors (phylogenetic subtraction) [Harvey and Pagel 1991, Harvey et al. 1995]. It is also worth noting that a protein can have multiple partners, and so taking into consideration its co-evolution with all its partners further enhances the accuracy of the interaction prediction [Juan et al. 2008]. Gene fusion is a common event in evolution, wherein two or more genes in one species fuse into a single gene in another species. Gene fusion is Gene Fusion.

54

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

a result of duplication, translocation, or inversion events that affect coding sequences during the evolution of genomes. Therefore, gene fusions play an important role in determining the gene (and genomic) architecture of species. Gene fusions may occur to optimize co-transcription of genes involved in the fusion: by fusing two or more genes, it may be easier to transcribe these genes as a single entity, thus resulting in a single protein product. Typically, proteins coded by these fused genes in a species carry multiple functional domains, which originate from different proteins (genes) in the ancestor species. Therefore, one may infer interactions between these individual proteins in the ancestor species: it is likely that these proteins are partners in performing a particular function and they interact in the ancestor species, and that gene fusion has occurred in another species to optimize the transcription and to produce a single multidomain protein [Marcotte et al. 1999]. These fused proteins are referred to as chimeric or Rosetta Stone proteins [Marcotte et al. 1999]. The Rosetta Stone approach [Enright and Ouzounis 2001, Suhre 2007] infers protein interactions by detecting fusion events between protein sequences across species. In E. coli, this approach identified 6,809 putative interacting pairs of proteins, wherein both proteins from each pair had significant sequence similarity to a single (fused) protein from at least one other species (genome). The analysis of these interacting pairs revealed that, for more than half of these pairs, both the proteins were functionally related [Marcotte et al. 1999]. The pattern of interactions between proteins in a PPI network says a lot about how proteins interact, and provides a way to predict new interactions. For example, if a pair of proteins have many common neighbors in the PPI network, then most likely the two proteins in the pair and their common neighbors are involved in the same or similar function(s). Therefore, one may infer a direct physical interaction between the two proteins based on the number of neighbors and/or functions the two proteins share. Chua et al. [2006] used FS Weight interaction-scoring approach in this manner to predict interactions between level2 neighbors (connected via one other protein) in the PPI network. This is based on the observation that level-2 neighbors in the PPI network show the same or similar annotations for functions and/or cellular compartment, and therefore these are more likely to interact compared to random pairs of proteins in the network. These FS-weighted predicted interactions between level-2 neighbors are added back to the PPI network after removing low-weighted interactions. Using the same rationale, one can predict new interactions using other topology-based (common-neighbor counting) schemes including Dice coefficient [Zhang et al. 2008] and Iterative CD

PPI Network Topology.

2.7 Enhancing PPI Networks by Integrating Functional Interactions

55

[Liu et al. 2008]. Likewise, the geometric embedding model [Prˇzulj et al. 2004, Higham et al. 2008] can also be used to predict new interactions: Proteins that are

-close in the geometric embedding of the PPI network are more likely to interact compared to random pairs of proteins and proteins that are farther than -distance away in the embedding. Interacting proteins are often involved in the same or similar functions. Therefore, if a pair of proteins are annotated with the same or similar functions, one could, with some degree of accuracy, infer a physical interaction between the two proteins. This is often referred to as “guilt by association,” which refers to the principle that genes or proteins with related functions tend to share properties such as genetic or physical interactions [Oliver 2000]. This inference can be further enhanced by combining other evidence that supports their functional similarity—for example, if the genes coding for the two proteins are located close by on the genome or are transcribed as an operonic unit (for prokaryotes) [Dandekar et al. 1998, Kumar et al. 2002], or the coding genes are co-transcribed or co-expressed [Huynen et al. 2000, Bowers et al. 2004, Jansen et al. 2002], or show similar phylogenetic profiles [Pellegrini et al. 1999, Galperin and Koonin 2000, Pellegrini 2012]. Proteins within the same protein complex (co-complexed proteins) show a strong tendency to share functions and cellular localization and therefore physically interact. On the other hand, proteins from different cellular compartments most likely do not meet and therefore do not interact in vivo during their lifetimes. Jansen et al. [2003] used interactions between co-complexed proteins from the MIPS protein complex catalog [Mewes et al. 2006] as the positive training set, and non-interacting pairs of proteins as the negative training set, in a Bayesian framework, to predict new interactions in yeast. Blohm et al. [2014] present a dataset, the “Negatome,” of protein pairs that are highly unlikely to interact, which can be used as a negative training set. The Gene Ontology graph [Ashburner et al. 2000] integrates information on the functional and localization properties of proteins, and therefore provides a way to predict new interactions. For example, the TCSS approach by Jain and Bader [2010] can be used to compute similarity between pairs of proteins using the GO graph, and protein pairs showing high GO-semantic similarity can be predicted to physically interact. Likewise, multiple pieces of experimental and functional information can be combined to predict new interactions. For example, GeneMANIA (http://www.genemania.org/) [Warde-Farley et al. 2010] combines experimentally detected interactions from BioGrid [Stark et al. 2011, Chatr-Aryamontri et al. 2015], pathway annotations from Pathway Commons (http://www.pathwaycommons.org/)

Functional Features.

56

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

[Cerami et al. 2011], and information on evolutionary conservation of interactions from the Interologous Interaction Database (I2D) [Brown and Jurisica 2005], along with GO-based similarity, to predict new interactions (GeneMANIA and I2D are also listed in Table 2.5). The HumanNet [Lee et al. 2011] is a human functional interaction network which includes predicted interactions based on guilt by association for genes involved in human diseases. Structural Information on Proteins. 3D structures of proteins provide first-hand evidence for protein interaction sites and binding surfaces of proteins. Therefore, by assessing the compatibility between the binding surfaces between two proteins, one can predict whether the two proteins interact or not. For example, Zhang et al. [2012, 2013] analyzed 3D structures of proteins from the Protein Data Bank (PDB) (http://www.rcsb.org/pdb/home/home.do) [Berman et al. 2000], a database which stores 3D structures for over 600 of the ∼6,000 characterized yeast proteins (∼10%), to predict new interactions between proteins in yeast. However, since the 3D structures are available for only a small fraction of proteins, using this approach for prediction of interactions on a larger scale is not feasible. Zhang et al. proposed to overcome this limitation to some extent by deriving homology models for proteins without available 3D structures. Homology models were derived for an additional ∼3,600 yeast proteins using the ModBase (http://modbase.compbio.ucsf .edu/) [Pieper et al. 2006] and Skybase (http://skybase.c2b2.columbia.edu/pdb60_ new/struct_show.php) [Mirkovic et al. 2007] databases. Given a query protein, these databases predict the most likely 3D structure for the protein based on its sequence similarity with templates built from proteins with available 3D structures. The final set of structurally predicted interactions from the Zhang et al. study is available in the PrePPI database (http://bhapp.c2b2.columbia.edu/PrePPI/). Struct2Net (http://groups.csail.mit.edu/cb/struct2net/webserver/) [Singh et al. 2010] uses a structure-threading approach to predict interactions between proteins. Given two protein sequences, Struct2Net “threads” the sequences to known 3D structures from PDB, and then based on the best-matching structures, estimates the interaction between the two proteins. PredictProtein (http://www.predictprotein.org/) [Yachdav et al. 2014] combines structure and GO-based methods to predict new interactions. Wang et al. [2012] curated a 3D-structure resolved dataset of 4,222 high-quality human PPIs enriched for human disease genes by examining relationships between 3,949 genes, 62,663 mutations, and 3,453 associated with human disorders.

Interactions missed in PPI datasets but have direct or indirect reference in scientific publications can be identified by mining the litera-

Literature Mining.

2.7 Enhancing PPI Networks by Integrating Functional Interactions

57

ture. For example, these references may include abstracts or full-texts of publications maintained in PubMed (http://www.ncbi.nlm.nih.gov/pubmed) by the National Center for Biotechnology Information (NCBI). Text-mining tools based on natural language processing (NLP) and other machine-learning techniques mine for co-occurrence of protein names in these literature sources, and proteins frequently referenced together can be predicted to interact. For example, the PubGene tool, which is a part of COREMINE (http://www.coremine.com/medical/), mines for information on genes and proteins including their co-occurrences in abstracts of publications, their sequence homology, and association with cell processes and diseases. This information can be used to predict interactions between frequently co-occurring proteins. Similarly, iHOP (http://www.ihop-net.org/UniPub/ iHOP/) [Hoffman and Valencia 2004] presents a network of genes and proteins that co-occur in the scientific literature. Despite their promise in enhancing experimentally curated interaction datasets, computational methods have their own limitations. For example, methods that rely on (high-throughput) experimental datasets to predict new interactions—e.g., PPI topology-based methods—carry an inherent bias in their predictions toward proteins already accounted for or enriched in these datasets. Moreover, these predictions are also affected by biological and technical noise (spurious interactions) in experimental datasets. The methods that are based on genomic distances, fusion of genes, and cotranscription of (operonic) genes are applicable only to a selected subset of species (prokaryotes), since most eurkaryotic systems do not exhibit these properties so strongly. Knowledge-based methods—e.g., those based on the GO graph, 3D structures, and literature abstracts—are restricted by the amount and kind of available information, and are again applicable to only a subset of proteins. As a result of these limitations, accurate prediction of physical interactions between proteins is a challenging problem, and most methods in fact predict only functional associations instead of direct physical interactions between proteins. Depending on the application, these functional associations can turn out to be more useful or less useful. However, despite these limitations, computational techniques are orthogonal to and can effectively complement experimental techniques. Computational predictions can be used to enhance the topologies of PPI networks—for example, by “densifying” sparse network regions or piecing together disconnected proteins or regions of the network—which in turn can improve downstream applications of PPI networks including protein complex prediction.

Limitations of Computationally Predicted Protein Interactions.

3

Computational Methods for Protein Complex Prediction from PPI Networks All models are wrong, but some models are useful. —George Box (1919–2013), British statistician (as quoted in Box [1979])

The process of identifying protein complexes from high-throughput interaction datasets involves the following steps [Spirin and Mirny 2003, Srihari and Leong 2012b, Srihari et al. 2015a] (Figure 1.1): 1. integrating high-throughput datasets from multiple sources and assessing the confidence of interactions; 2. constructing a reliable PPI network using only the high-confidence interactions; 3. identifying modular subnetworks from the network to generate a candidate list of complexes; and 4. evaluating the identified complexes against bona fide complexes, and validating and assigning roles to novel complexes. In this chapter, we review some of the representative methods developed to date for computational prediction of protein complexes from PPI networks.

60

Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

3.1

Basic Definitions and Terminologies A protein-protein interaction (PPI) network is modeled as an undirected graph G = V , E, where V is the set of proteins, and E ⊆ V × V is the set of interactions between the proteins. We also use V (G) and E(G) to refer to the set of proteins and interactions of a (sub-)network G, respectively. For a protein v ∈ V , N (v) or Nv is the set of immediate neighbors of v, deg(v) = |N (v)| = |Nv | is the degree of v, and E(v) is the set of interactions in the immediate neighborhood of v. The interaction density of G (or its subnetwork) is defined as: density(G) =

2|E(G)| , |V | . (|V (G)| − 1)

(3.1)

which gives a real number in [0, 1] quantifying the “richness of interactions” within G: 0 for a network without any interactions and 1 for a fully connected network. The clustering coefficient CC(v) measures the “cliquishness” for the neighborhood of v, and is given by: CC(v) =

2|E(v)| . |N (v)| . (|N (v)| − 1)

(3.2)

If the interactions of the network are scored (weighted), i.e., G = V , E, w, then the weighted versions—weighted degree, weighted interaction density, and weighted clustering coefficients—are given as follows: w(u, v), deg w (v) = u∈N (v)

densityw (G) =

2

e∈E(G)

w(e)

|V (G)| . (|V (G)| − 1) 2 e∈E(v) w(e) CCw (v) = |N (v)| . (|N (v)| − 1)

,

and

(3.3)

There are other variants for these definitions proposed in the literature; see Kalna and Higham [2007] and Newman [2010] for a survey.

3.2

Taxonomy of Methods for Protein Complex Prediction Although at a generic level most methods make the basic assumption that protein complexes are embedded among densely connected subsets of proteins within the PPI network, these methods vary considerably in their algorithmic strategies or in the biological information that they employ to detect complexes. Accordingly, we classify protein complex detection methods into the following two categories [Srihari and Leong 2012b, Srihari et al. 2015a, Li et al. 2010]:

3.3 Methods Based Solely on PPI Network Clustering

61

1. methods based solely on network clustering; and 2. methods based on network clustering and additional biological information. The biological information can be in the form of functional, structural, organizational, or evolutionary evidence on complexes or their constituent proteins. Wang et al. [2010] and Chen et al. [2014] present alternative classifications based on topological (interaction density, betweenness centrality, and clustering coefficient), and dynamical properties of PPI networks capitalized by computational methods, respectively. We present this classification of methods for protein complex prediction as two snapshots—methodology-based and chronology-based—as shown in Figures 3.1 and 3.2, respectively. The methodology-based classification (Figure 3.1) is based on algorithmic methodologies employed by the methods. At level 1 of the tree, we divide methods into those based solely on network clustering, and those employing additional biological information. At subsequent levels, we further subdivide the methods based on their algorithmic strategies into: (i) methods employing merging or growing of clusters from the network (agglomerative); and (ii) methods employing repeated partitioning of the network (divisive). Agglomerative methods go bottom-up, i.e., by beginning with small “seeds” (e.g., triangles or cliques) and repeatedly adding proteins or merging clusters of proteins based on certain similarity criteria to arrive at predicted set of complexes. On the other hand, divisive methods go top-down, i.e., by repeatedly partitioning the network into multiple dense subnetworks based on certain dissimilarity criteria to arrive at predicted complexes. In the chronology-based classification (Figure 3.2), we bin methods based on the time (year) when these were developed, and for each bin we stack the methods based on the kind of biological information they employ for the prediction. Figures 3.1 and 3.2 and Table 3.1 show only the methods that are discussed in this chapter. Other methods that employ diverse kinds of biological information and/or algorithmic strategies are covered in the following chapters.

3.3

Methods Based Solely on PPI Network Clustering Most methods that directly mine for dense subnetworks from the PPI network make use of solely the topology of the network.

Molecular COmplex DEtection (MCODE) MCODE, proposed by Bader and Hogue [2003], is one of the first computational methods developed for protein complex detection from PPI networks. The MCODE algorithm operates in two steps, vertex weighting and complex prediction, and an optional third step for post-processing of the candidate complexes. In the first step,

Figure 3.1

• SPC (Spirin & Mirny, 2003) • MCL (Van Dongen, 2000) • Leal-Pereira et al. (2004) • Pu et al. (2007) • Friedel et al. (2008)

Network partitioning

Kind of algorithmic techniques used

• PCP (Chua et al., 2007) • DECAFF (Li et al., 2007)

• RNSC (King et al., 2005)

Network partitioning

Functional homogeneity

Merging and growing clusters

Coreattachment

• COACH (Wu et al., 2009) • CORE (Leung et al., 2009) • MCL-CAw (Srihari et al., 2010) • CACHET (Wu et al., 2012)

Merging and growing clusters

Kind of biological information level

PPI network clustering + Biological information

Methodology-based taxonomy of protein complex prediction methods (for large complexes, of size≥4). At the topmost level, the complex prediction methods are classified into those based solely on PPI network clustering, and those that take into account additional biological information (here, core-attachment and functional homogeneity). At the next level, the methods are further classified into those that merge and grow clusters (agglomerative), and those that partition the network into smaller clusters (divisive). Methods that are not shown here are covered in the following chapters.

• MCODE (Bader et al., 2003) • LCMA (Li et al., 2005) • CFinder (Ademcsek et al., 2006) • DPClus (Altaf-Ul-Amin et al., 2006) • IPCA (Li et al., 2008) • SuperComplex (Qi et al., 2008) • CMC (Liu et al., 2009) • HACO (Wang et al., 2009) • ClusterONE (Nepusz et al., 2012) • Ensemble (Yong et al., 2012)

Merging and growing clusters

Solely PPI network clustering

Protein Complex Prediction Methods

62

Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

Figure 3.2

2000

2001

MCL (Van Dongen)

2002

2003

SPC (Spirin & Mirny)

MCL (Enright & Van Dongen)

MCODE (Bader et al.)

2004

2005

LCMA (Li et al.)

2006

COACH (Wu et al.)

2007

2008

2009

2010

2011

CMC (Liu et al.)

HACO (Wang et al.)

MCL-CAw (Srihari et al.)

CORE (Leung et al.)

PCP (Chua et al.)

Pu et al. SuperComplex (Qi et al.)

IPCA (Li et al.)

DPClus (Altaf-Ul-Amin et al.)

CFinder (Ademcsek et al.)

MCL (Leal-Pereira et al.)

RNSC (King et al.)

DECAFF (Li et al.)

Chronology-based taxonomy of protein complex prediction methods (for large complexes, of size≥4). Complex prediction methods are binned based on the year (2000–2012) these were published. The methods are stacked into layers based on whether these are based on solely PPI network clustering, or PPI network clustering with additional biological information (here, core-attachment and functional homogeneity). Methods that are not shown here are covered in the following chapters.

PPI network clustering

Coreattachment

Functional homogeneity

2012

Ensemble (Yong et al.)

ClusterONE (Nepusz et al.)

CACHET (Wu et al.)

Table 3.1

2006 2008 2008 2009 2010 2012 2012

DPClus IPCA SuperComplex CMC HACO ClusterONE Ensemble

Functional homogeneity

Not available

http://compbio.ddns.comp.nus.edu.sg/~cherny/SWC/

http://apps.cytoscape.org/apps/clusterone

http://www.bio.ifi.lmu.de/Complexes/ProCope/

http://www.comp.nus.edu.sg/~wongls/projects/... complexprediction/CMC-26may09

http://www.cs.cmu.edu/~qyj/SuperComplex/

http://netlab.csu.edu.cn/bioinformatics/limin/IPCA/

http://kanaya.naist.jp/DPClus/

http://www.cfinder.org/

2004 2007 2008

DECAFF PCP

2012

CACHET RNSC

2009 2009, 2010

COACH MCL-CAw

complexprediction/PCP-3aug07/

http://www.comp.nus.edu.sg/~wongls/projects/...

http://www1.i2r.a-star.edu.sg/~xlli/DECAFF/complexes.html

http://www.cs.utoronto.ca/~juris/data/ppi04/

http://www1.i2r.a-star.edu.sg/~xlli/CACHET/CACHET.htm

http://sites.google.com/site/mclcaw/

http://www1.i2r.a-star.edu.sg/~xlli/coach.zip

2006

CFinder

http://alse.cs.hku.hk/complexes/

2005

LCMA

http://www.weizmann.ac.il/complex/compphys/sites/ complex.compphys/..., files/uploads/software/clustering_prl.ps

http://apps.cytoscape.org/apps/clustermaker

http://micans.org/mcl/

http://apps.cytoscape.org/apps/mcode

Source

2009

2003

SPC

CORE

2000, 2004

MCL

Core-attachment structure

2003

MCODE

PPI network clustering

Year

Method

Classification

[Chua et al. 2008]

[Li et al. 2007]

[King et al. 2004]

[Wu et al. 2012]

[Srihari et al. 2009, Srihari et al. 2010]

[Wu et al. 2009]

[Leung et al. 2009]

[Yong et al. 2012]

[Nepusz et al. 2012]

[Friedel et al. 2009, Wang et al. 2009]

[Liu et al. 2009]

[Qi et al. 2008]

[Li et al. 2008]

[Altaf-Ul-Amin et al. 2006]

[Adamcsek et al. 2006]

[Li et al. 2005]

[Spirin and Mirny 2003, Blatt et al. 1996]

[Van Dongen 2000, Leal-Pereira et al. 2004]

[Bader and Hogue 2003]

Reference

Protein complex prediction methods discussed in this chapter and weblinks to their associated software.

3.3 Methods Based Solely on PPI Network Clustering

65

each protein v in the PPI network G is weighted based on its clustering coefficient (CC). However, instead of using the entire neighborhood of v, MCODE uses the density of the highest k-core in the neighborhood of v, which amplifies the weights for proteins located in densely connected regions of G. A k-core is a subnetwork of proteins such that each protein in this subnetwork has a degree no less than k. A kcore in the neighborhood of v is a subnetwork of v and all its neighbors that have degree no less than k. A k-core in the neighborhood of v is highest or maximal if there exists no (k + 1)-core in that neighborhood. If Ck (v) represents this highest k-core, the weight for v is assigned as: CC(v) =

2|E(Ck (v))| . |V (Ck (v))| . (|V (Ck (v))| − 1)

(3.4)

In the second step, a protein v with the highest weight is used to “seed” a complex. MCODE then recursively moves outwards from the seed by including proteins whose weight is no more than a preset percentage, controlled by the vertexweight parameter, away from the seed protein. The process stops when there are no more proteins to be added to the complex. This process is repeated by selecting the next unseeded protein. A protein once added to a complex is not checked subsequently for seeding. At the end of this process, multiple non-overlapping candidate complexes are generated. The optional third step postprocesses the candidate list of complexes. First, complexes without 2-cores, i.e., 1-protein and 2-protein complexes, are filtered out. New proteins in the neighborhood of candidate complexes and with weights higher than a preset ‘fluff’ parameter are then added to these complexes. These newly added proteins are not marked as “seen” and hence can be added to multiple complexes (shared between the complexes). This step also includes a “hair cut” option whereby proteins singly connected to complexes are removed before attempting to add to complexes. The final complexes are ranked based on their interaction densities. The time complexity of the MCODE algorithm is O(|V (G)| . |E(G)| . h3), where h is the size of an average neighborhood in G.

Markov CLustering (MCL) The MCL algorithm, proposed by Van Dongen [2000], is a popular graph-clustering algorithm that works by extracting dense subnetworks or regions from graphs using random walks. These random walks are collectively called a flow. In biological applications, MCL was first applied to cluster protein families and protein ortholog groups [Enright et al. 2002] where interactions in the graph represented (sequence) similarities between the proteins. MCL was subsequently found to be effective

66

Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

for clustering PPI networks into protein complexes and functional modules [LealPereira et al. 2004, Brohee and van Helden 2006, Pu et al. 2007]. MCL works by manipulating the connectivity (adjacency) matrix of the network using two operators, expansion and inflation, to control the flow. Expansion controls the dispersion of the flow, whereas inflation controls the contraction of the flow, making the flow thicker in dense regions and thinner in sparse regions. These parameters boost the probabilities of intra-cluster walks and demote those of inter-cluster walks. Such iterative expansion and inflation separates the network into multiple non-overlapping regions. An animated example of the clustering process is available from the MICANS website, http://www.micans.org/mcl/. Mathematically, expansion coincides with matrix multiplication, whereas inflation is a Hadamard power followed by a diagonal scaling of the matrix. Therefore, being mainly repetitive matrix operations, MCL is efficient and scalable even to large networks. The inflation parameter I is a user input, and higher the value of I the finer is the clustering. For PPI networks, Brohee and van Helden [2006] recommend I to be between 1.8 and 2.0, to obtain clusters that match bona fide protein complexes. However, if the PPI network is reasonably dense (e.g., with average node degree ≥10) there is a tendency for MCL to produce several large (size ≥30) clusters that subsume smaller clusters, and therefore, I >2 is more appropriate to produce a finer clustering in such cases. However, a large I can also often lead to artificial breaking apart of clusters. One drawback of MCL particularly in its application to detect protein complexes is that MCL only produces disjoint (non-overlapping) clusters. Therefore, proteins that participate in more than one complex will be assigned (arbitrarily) to only one of the clusters. To overcome this problem, Pu et al. [2007] tweaked MCL by allowing proteins in a cluster that have sufficiently large number of interacting partners in another cluster (the acceptor cluster) to be also assigned to that cluster. The mimimum fraction f of partners a protein requires to have in the acceptor cluster C is defined by a power-law function f = a|C|b , where a and b are empirically determined, the recommended values being a = 1.5 and b = −0.5.

Superparamagnetic Clustering (SPC) SPC, proposed by Blatt et al. [1996] and Getz et al. [2000, 2002], is based on the Potts model [Fukunaga 1990] (a model of ferromagnetism), where data points in a given metric space are clustered based on “spins” assigned to them. Initially, each data point in the metric space is assigned a spin from a fixed range of spin values such that two close-enough (based on a distance threshold) data points are assigned

3.3 Methods Based Solely on PPI Network Clustering

67

the same spin. Beginning from an initial condition (represented by a temperature parameter), the system is subjected to a relaxation (cooling) process (by constantly lowering the temperature). At each step, the spins on the data points are allowed to fluctuate such that a cost function is constantly minimized. This cost function measures the overall similarity between the spins of the data points, and is the lowest when data points that are close to one another have similar spins. The system is considered unstable at the initial condition (high temperature), but when the cost function reaches its minimum (at a lower temperature), the system is considered stable (the superparamagnetic phase). At this stable condition, the data points are considered clustered, with points with the same spins assigned to the same cluster. Let X = {x1 , x2 , . . . , xN } be a set of data points in a given metric space that needs to be partitioned into M groups that form the clusters. If two data points xi and xj are close enough, ||xi − xj || ≤ d, they are assigned the same spin si = sj from a range [1, q]. The cost function H (X ) is defined as: H (X ) =

J (xi , xj ) . δ(xi , xj ),

(3.5)

(i , j )

where J (xi , xj ) ∝ −||xi − xj ||, and δ(xi , xj ) = 0 if si = sj , else δ(xi , xj ) = 1. The function δ(xi , xj ) measures dissimilarity between the spins of xi and xj . The cooling process makes the spins on the data points to fluctuate such that H (X ) decreases at each step until the function reaches a (local) minima. However, due to lack of sufficient details in the original works, we are unable to describe how the spins are updated to ensure H (X ) reaches a minima and the cooling process halts. Blatt et al. [1996] describe a halting condition based on variation of the fluctuating spins during the cooling process. During the initial stages of the cooling process, the spins fluctuate more (unstable system), but as the system cools, the spins tend to converge to specific values and fluctuate less (stable system). The aim is to locate a temperature range [Tt . . . Tt+l ] in which the system is stable. A susceptibility parameter χ over the temperature range of the system is measured which depends on the variance m of fluctuation of spins over the temperature range, given as: χ=

N (m2 − m2), l

(3.6)

where for each temperature point Ti of the system, m is computed as: m=

(Nmax /N )q − 1 . q −1

(3.7)

68

Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

Here, Nmax = max{N1 , N2 , . . . , Nq } where Nμ is the number of spins with value μ, and . . . refers to the average. As the system stabilizes, the variance of the fluctuating spins becomes negligible, and so does the susceptibility χ of the system. The average spin dissimilarity δ(xi , xj ) of a subset of data points is used to decide whether or not the data points belong to the same cluster. Spirin and Mirny [2003] used SPC to predict protein complexes from PPI networks. Again, due to lack of sufficient details in the original works, we are unable to provide a clear description of this application of SPC to networks. But briefly, if two proteins pi and pj are immediate neighbors in the PPI network, they are mapped to data points xi and xj in the metric space such that xj is one of the k nearest data points to xi . The two points xi and xj are assigned the same spins, si = sj . As described above, the system is allowed to cool until the system attains low susceptibility, at which stage the data points and hence the proteins are considered to be clustered. A subset of proteins are assigned to the same cluster (protein complex) if the average spin dissimilarity between all pairs of proteins in the subset is below a threshold.

Local Clique Merging Algorithm (LCMA) LCMA, proposed by Li et al. [2005], is based on identifying and merging local cliques into bigger dense subgraphs for protein complex detection. LCMA works in two steps. In the first step, local cliques are identified from the network G as follows. For a vertex v in G, its local neighborhood subgraph Gv = Vv , Ev is computed, where Vv = {v} ∪ {u : (u, v) ∈ E}, and Ev = {(s , t) : (s , t) ∈ E, s ∈ Vv , t ∈ Vv }. LCMA then repeatedly removes a (loosely connected) vertex from this local neighborhood subgraph if the removal increases the interaction density of the subgraph, until the density cannot be increased any further. The resulting local subgraph is a local clique with density 1. LCMA then moves on to the local neighborhood of the next vertex to produce similar local cliques. However, the local cliques produced this way may overlap considerably. Therefore, in the second step, LCMA repeatedly merges these local cliques using their neighborhood affinities (NA) to produce larger neighborhoods. The affinity NA(C1 , C2) between two local cliques (neighborhoods) C1 and C2 is given as 2 1∩C2| NA(C1 , C2) = |C |C1|.|C2| . Beginning with the set LC of all local cliques (neighborhoods), LCMA repeatedly merges two cliques C1 and C2 that have NA(C1 , C2) ≥ ω, a preset threshold, to produce a merged and larger neighborhood. This resulting neighborhood is added back to LC and used in the merging process. Local cliques

3.3 Methods Based Solely on PPI Network Clustering

69

that are not used up in the merging process are retained. The final set of merged neighborhoods and unmerged cliques are output as candidate protein complexes.

Complex Finder (CFinder) CFinder, proposed by Adamcsek et al. [2006], uses the idea of clique percolation by Der´ enyi et al. [2005] to locate k-clique percolation clusters that represent protein complexes in the PPI network. A k-clique is a complete subnetwork of size k, and two k-cliques are said to be adjacent if they share exactly k − 1 vertices. A kclique percolation cluster includes: (i) all vertices that can be reached via chains of adjacent k-cliques from each other; and (ii) the edges within these cliques. A k-clique percolation cluster is very much like a building a k-clique adjacency graph— where the vertices represent the k-cliques of the original graph and there is an edge between two vertices if the corresponding k-cliques are adjacent—and moving from one vertex to another along an edge is equivalent to “rolling” a k-clique template from one k-clique of the original graph to an adjacent one. For an Erd¨ os-Renyi random network of N vertices with p being the probability that any two vertices are connected, Der´ enyi et al. [2005] show that a giant k-clique component appears in the network at probability p = pc (k), the percolation threshold, pc (k) =

1 . [(k − 1) . N ]1/(k−1)

(3.8)

For k = 2, this result agrees with the threshold pc (k) = 1/N because 2-clique connectedness is equivalent to regular edge connectedness. By locating a random k-clique, we can roll a k-clique template onto an adjacent k-clique by relocating one the vertices of the k-clique and keeping other k − 1 vertices fixed. The expression 3.8 can then be obtained by requiring that after rolling the expectation value of the number of adjacent k-cliques, where the template can roll further, be equal to 1 at the percolation threshold pc (k). The intuition behind this requirement is that a smaller expectation value would cause the rolling to halt too soon, whereas a larger expectation value would allow an infinite series of bifurcations for the rolling. This expectation can be estimated as (k − 1)(N − k − 1)p k−1, where the first term (k − 1) counts the number of vertices that can be selected for the roll-on, the second term (N − k − 1) counts the number of potential destinations for this roll-on, of which only a fraction p k−1 is acceptable, because each of the new k − 1 edges (associated with the roll-on) must exist in order to obtain a new k-clique. For large N , this criteria simplifies to (k − 1)Np k−1 = 1, giving the above Equation (3.8) for p(k).

70

Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

Since we are interested in locating more than one k-cliques rather than just a single giant k-clique, it is important that the number of k-cliques in the percolation cluster, denoted by N ∗, not grow as fast as N , the total number of k-cliques in the graph. We can define an order parameter associated with this choice as ψ = N ∗/N . The number of k-cliques in graph can be estimated as: N k(k−1)/2 N k k(k−1)/2 ≈ , (3.9) p N ≈ p k! k because k different vertices can be selected in Nk different ways, and any such selection makes a k-clique if and only if all the k(k − 1)/2 edges between these k vertices exist, each with probability p. Erd¨ os [1960] showed that for random graphs with N vertices, the size of the largest component scales as N 2/3. Therefore, applying the same to the k-clique adjacency graph, the size of the giant component N ∗ scales as N 2/3. Plugging p = pc (k) from Equation (3.8) into Equation (3.9), we get the scaling N ∼ N k/2

(3.10)

for the total number of k-cliques. Thus, the size of the giant component N ∗ is expected to scale as N 2/3 ∼ N k/3, and the order parameter ψ scales as N 2/3/N ∼ N −k/6. But obviously, this holds only if k ≤ 3 because for k > 3, the size N ∗ which scales as N k/3 is faster than N . Therefore, for k > 3, we expect N ∗ ∼ N , and using Equation (3.10), we get ψ = N ∗/N ∼ N 1−(k/2), that is, as some negative power of N . Since PPI networks are generally sparse and composed of more than one connected components, we expect this to hold. Adamcsek et al. [2006] identified k-clique percolation clusters from PPI networks, and suggest k between 4 and 6 to identify clusters that match real protein complexes.

Density-Periphery-based Clustering (DPClus) DPClus, proposed by Altaf-Ul-Amin et al. [2006], works in a manner similar to MCODE [Bader and Hogue 2003] by first identifying seed vertices and then expanding them into protein complexes. DPClus begins by assigning a weight λ(u) to every protein u in the network: λ(u) = |N (u)|, if the network is unweighted; and λ(u) = v∈N(u) w(u, v), if the network is weighted. Proteins are then sorted in nonincreasing order of their weights, and the protein with the highest weight is chosen as a seed. This seed is initialized as a cluster, and the neighbors of the cluster are then repeatedly chosen and added to the cluster. Specifically, a cluster property Cp(v, C) of a neighboring protein v with respect to a cluster C is computed as:

3.3 Methods Based Solely on PPI Network Clustering

Cp(v, C) =

p∈C

w(v, p)

|C|

.

1 , densityw (C)

71

(3.11)

where densityw (C) is the weighted interaction density of C. The neighbors of C are then prioritised in non-increasing order of their cluster property. A neighbor v is repeatedly selected and added to C if: (i) Cp(v, C) ≥ Cpin, a minimum threshold for the cluster property; and (ii) the addition of v does not reduce the density of C below a threshold din. The protein v is then considered to be not at the “periphery” of the cluster but rather to belong to the cluster (and hence the name, DPClus). Once a cluster cannot be further be expanded, it is output as a predicted complex, and also removed from the network. The next seed is chosen and expanded using the remaining network. DPClus thus outputs only non-overlapping clusters; however, in a modification of their implementation (http://kanaya.naist.jp/DPClus/), the authors retain the original network when future seeds are expanded, to produce overlapping clusters.

Interaction Probability-based Clustering Algorithm (IPCA) IPCA, developed by Li et al. [2008], modifies the criteria used in DPClus to add (neighboring) proteins to clusters by introducing two additional measures, cluster diameter and interaction probability. The diameter DiamC of the cluster C is the maximum distance between any pair of proteins in the cluster. The interaction probability P (v, C) for a protein v with the cluster C is given by p∈C w(v, p) P (v, C) = . (3.12) p, q∈C w(p, q) Li et al. observed that if for every protein v ∈ C, P (v, C ) ≥ t, where V (C ) = V (C) − {v} and t is a fixed threshold, then density(C) ≥ t, that is, the density of the cluster C is lower-bounded by t. Li et al. also observed that many real complexes have small diameters. Based on these observations, the criteria to select proteins v to be added to cluster C in IPCA is modified to: (i) P (v, C) ≥ t; and (ii) the diameter of the cluster after including v is DiamC ≤ d, where t and d are fixed thresholds.

Clustering Based on Merging Maximal Cliques (CMC) CMC, proposed by Liu et al. [2009], detects protein complexes by repeated merging of reliable maximal cliques from the PPI network. While LCMA [Li et al. 2005] also uses a similar strategy, it is restricted to unweighted networks, whereas CMC uses the weights of interactions to filter and merge cliques. The inclusion of weights allows CMC to discount the effect of noise in the network. The interactions are

72

Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

weighted using Iterative-CD scoring (Chapter 2), which is available with the CMC software package. CMC begins by enumerating all maximal cliques in the network using a fast search space pruning-based maximum clique-finding algorithm [Tomita et al. 2006]. Although enumerating all maximal cliques in a graph is NP-hard in general, this does not pose a problem here because PPI networks are quite sparse. Each clique C is scored using its weighted interaction density, u, v∈C w(u, v) . (3.13) densityw (C) = |C| . (|C| − 1) The cliques are ranked in non-increasing order of their weighted densities. CMC then iteratively merges highly overlapping cliques based on the extent of their inter-connectivity. The overlap between two cliques C1 and C2 is defined as: 1∩C2| O(C1 , C2) = |C |C1∪C2| , and the cliques are considered to highly overlap if O(C1 , C2) ≥ To , an overlap threshold. The inter-connectivity Iw (C1 , C2) between C1 and C2 is then defined based on their non-overlapping regions u∈(C1−C2) u∈(C2−C1) v∈C2 w(u, v) v∈C1 w(u, v) . . (3.14) Iw (C1 , C2) = . |C1 − C2| |C2| |C2 − C1| . |C1| If Iw (C1 , C2) ≥ Tm, a merge threshold, then C2 is merged with C1, otherwise C2 is removed if densityw (C1) ≥ densityw (C2) and O(C1 , C2) ≥ To . Finally, all merged clusters are ranked based on their weighted densities and output as predicted complexes. Since CMC takes into account the weights of interactions, it prioritises more reliable cliques for the merging process while eliminating the less reliable ones, thereby reasonably discounting the effect of noise in PPI datasets.

Clustering with Overlapping Neighborhood Expansion (ClusterONE) ClusterONE, proposed by Nepusz et al. [2012], works in a manner similar to MCODE, by greedy neighborhood expansion. However, unlike MCODE (and MCL) it allows overlaps between the generated protein complexes. Beginning with individual seed proteins, ClusterONE greedily expands them into clusters C ∈ C based on a cohesiveness measure, given by: f (C) =

w (in)(C) w (in)(C) + w (bound)(C) + p(C)

,

(3.15)

where w (in)(C) is the total weight of interactions within cluster C, w (bound)(C) is the total weight of interactions connecting C to the rest of the PPI network, and p(C) is a penalty term to model uncertainty in the data due to missing interactions. At each step, new proteins are added to C until f (C) does not increase any further.

3.3 Methods Based Solely on PPI Network Clustering

73

Cluster C represents a locally cohesive group of proteins. When all such clusters are generated, the highly overlapping ones are merged into candidate protein complexes. The overlap between two clusters C1 , C2 ∈ C is computed as: ω(C1 , C2) =

|C1 ∩ C2|2 , |C1| . |C2|

(3.16)

and C1 and C2 are merged if ω(C1 , C2) ≥ 0.8. If a cluster does not overlap with any other cluster, it is promoted by itself as a candidate protein complex without any merging. Likewise, if a cluster overlaps with more than one other clusters, then it is merged individually with each of its overlapping clusters. In the final step, ClusterONE discards all candidate complexes that contain less than three proteins or whose density is below a certain threshold, to produce the final list of protein complexes.

Hierarchical Agglomerative Clustering with Overlaps (HACO) HACO, proposed by Wang et al. [2009], modifies the classical hierarchical agglomerative clustering (HAC) [Kaufman and Rousseeuv 2009, Sardiu et al. 2009] to allow overlaps between clusters and to identify overlapping complexes. At each step, the HAC algorithm with average linkage maintains a pool of candidate sets to be merged. The distance between two non-overlapping sets S1 and S2 is given by: D(S1 , S2) =

1 . |S1| . |S2|

d(p, q),

(3.17)

p∈S1 , q∈S2

where d(p, q) is the complement of the affinity between proteins p and q. The affinity is usually the reliability weight w(p, q) ∈ [0, 1] of the interaction (p, q) in the PPI network, and therefore d(p, q) can be set to, for example, d(p, q) = 1 − w(p, q). In each step, two non-overlapping sets S1 and S2 with the closest distance are iteratively merged to generate a new set S12, and the original sets S1 and S2 are removed. The algorithm terminates when there are no remaining sets to be merged. In HACO, the sets S1 and S2 are retained for a potential merges with other sets later, the intuition being that if there is another set S3 whose distance to S1 is only slightly greater than that of S2, then the decision to merge S1 and S2 could be arbitrary and unstable. This is more so when there are spurious interactions in the PPI network. Therefore, HACO produces two merged sets S12 and S13 by retaining S1 based on a divergence decision: If S1 is considerably different from S12 then S1 is retained (to generate S13), otherwise S1 is removed while S12 is retained. This procedure can potentially result in overlapping complexes, for example, when both S12 and S13 are produced, and thus improves the classical HAC for identifying overlapping protein complexes.

74

Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

Supervised Protein Complex Prediction (SuperComplex) SuperComplex, proposed by Qi et al. [2008], works by supervised clustering of the PPI network using a Bayesian Network model learned from distinguishing features of real protein complexes. If a certain subnetwork from the PPI network represents a real protein complex, a distinguishing feature is a property that can sharply distinguish this subnetwork from other (random) subnetworks that do not represent real protein complexes. Qi et al. employed ten groups of features, of which a majority (nine groups) are based on the topology of the subnetwork, and one is based on the weight and size characteristics of proteins involved in the subnetwork. Hence, we classify SuperComplex as still a (predominantly) topology-based clustering method. The ten groups of distinguishing features for a subnetwork S are as follows (the number of features are given in parenthesis): 1. number of proteins in S (1); 2. interaction density of S (1); 3. degree statistics: mean, variance, median, and maximum degree of the proteins in S (4); 4. interaction-weight statistics: mean and variance of interaction weights in S (2); 5. interaction density of S with respect to preset weight cut-offs on interactions (1); 6. degree correlation statistics: mean, variance, and the maximum of the number of interactions with immediate neighbors of each protein in S (3); 7. clustering coefficient: see below (1); 8. topological coefficient statistics: see below (1); 9. the first eigenvalues: see below (3); and 10. protein weight/size statistics: the average and maximum protein length, and average and maximum protein weight of each protein in the subnetwork (4). Clustering coefficient (CC), which has been defined earlier (Section 3.1) but we redefine it here for the ease of reading, measures the “cliquishness” in the neighborhood of a protein v, and is computed as follows. If Ev is the number of interactions in the immediate neighborhood of v, then CCv = 2|Ev |/(|Nv | . (|Nv | − 1). This feature is then averaged for all proteins in the subnetwork S in question: CC(S) = 1/|S| . v∈S CCv .

3.3 Methods Based Solely on PPI Network Clustering

A

B

A

B

A

B

A

B

C

D

C

D

C

D

C

D

Linear SV1: 1.618 SV2: 1.618 SV3: 0.618 Figure 3.3

Clique SV1: 3.0 SV2: 3.0 SV3: 1.0

Star SV1: 1.7321 SV2: 1.7321 SV3: 0.0

75

Hybrid SV1: 2.1701 SV2: 2.1701 SV3: 1.0

The first three eigenvalues for four example subnetwork topologies. (Redrawn from Qi et al. [2008])

Topological coefficient (TC) is computed as follows. Let u be a neighbor of v, then Nv ∩ Nu is the set of neighbors shared between v and u. Therefore |Nv ∩ Nu| + 1 is the number of shared neighbors to which both v and u are linked including each other. Thus, TC is defined as: 1 . u∈Nv , |Nv ∩Nu|≥1(|Nv ∩ Nu| + 1) , (3.18) TCv = |Nv | |{u : u ∈ Nv , |Nv ∩ Nu| ≥ 1}| where u in the equation includes only the neighbors of v with which v shares at least one other neighbor. This feature is then averaged for all proteins in the subnetwork S in question: TC(S) = 1/|S| . v∈S TCv . The eigenvalues feature includes the first three singular values (SV) of the adjacency matrix of the subnetwork S. The singular value decomposition of an m × n matrix M is a factorization of the matrix in the form M = UV, where U is a m × m unitary matrix, is a m × n diagonal matrix with non-negative real numbers on its diagonal, and V is a n × n unitary matrix. The diagonal entries of are known as the SVs of M. A common convention is to list the SVs in descending order. Qi et al. noted that the SVs for subnetworks of different topologies (e.g., linear, clique, star, and hybrid) show differences in the first three SVs; see Figure 3.3. A supervised Bayesian Network (BN) is then trained using a positive training set of known protein complexes, and a negative training set generated by randomly selecting proteins from the PPI network into groups. The above features are generated independently of two parameters: whether a subnetwork is a protein complex or not (C = 1 or 0), and the number of proteins N in the subnetwork. The number of proteins N in the subnetwork is treated separately and is not included as part of the list of features because of the tendency for other features to depend on N .

76

Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

For a given subnetwork S, the conditional probability that it represents a protein complex is computed using Bayes’ rule as: p(C = 1|N , x1 , x2 , . . . , xm) =

p(N , x1 , x2 , . . . , xm|C = 1) . p(C = 1) p(N , x1 , x2 , . . . , xm)

p(x1 , x2 , . . . , xm|N , C = 1) . p(N |C = 1) . p(C = 1) p(N , x1 , x2 , . . . , xm) m p(xk |N , C = 1) . p(N |C = 1) . p(C = 1) = k=1 , p(N , x1 , x2 , . . . , xm)

=

(3.19) where x1 , x2 , . . . xm represent the features discussed above. A similar conditional probability is computed for S to represent a non-complex, by replacing C = 1 by C = 0 in the above equation. Using the two posteriors, the log likelihood ratio for the subnetwork S is given by: p(C = 1|N , x1 , x2 , . . . , xm) p(C = 0|N , x1 , x2 , . . . , xm) p(N |C = 1) . p(C = 1) . m p(xk |N , C = 1) = log . k=1 m p(N |C = 0) . p(C = 0) . p(xk |N , C = 0)

L = log

(3.20)

k=1

Maximum likelihood estimation is used for learning these conditional probabilities from the training data. The prior is P (C = 0) = 1 − P (C = 1) with P (C = 1) set to 0.0001. The log likelihood score is then used to identify maximally scoring subnetworks from the PPI networks, using a heuristic iterative procedure as follows. The procedure begins with seed clusters consisting of pairs of proteins (u, v) in which the protein u is connected to only the neighbor v with which it has the highest interaction weight. In each iteration, a neighbor is repeatedly chosen from the neighbors of each current seed cluster and added to the cluster if the new log likelihood score L is higher than the current score L. But, if the addition of the neighbor decreases the score, the expanded cluster is still accepted with a probability e(L−L )/T , where T is a temperature parameter, with the initial temperature T0 chosen using cross validation on a dataset of training and testing samples. After each such iteration, the temperature is reduced as T := αT , where the scaling factor α is also chosen using cross validation. The resulting clusters after 20 iterations of this search procedure are output as predicted protein complexes.

3.3 Methods Based Solely on PPI Network Clustering

77

Ensemble Clustering Yong et al. [2012] and Srihari and Leong [2012a] developed ensemble approaches using majority voting to aggregate clusters generated from different protein complex prediction methods, and from different PPI networks, respectively. In the study by Yong et al. [2012], individual lists of clusters are first predicted from the PPI network using different protein complex prediction methods namely, MCL, CMC, ClusterONE and HACO. Sets of similar clusters but produced from different clustering methods are first identified: Two clusters A and B produced from two different methods are assessed for similarity using the Jaccard index, J (A, B) = |A ∩ B|/|A ∪ B|, and the two clusters are deemed similar if J (A, B) ≥ 0.75. If the two clusters are similar, then only one of them which has the highest weighted density, say cluster A here, is retained, and is assigned a score given by: Score(A) = Num(A) . densityw (A),

(3.21)

where Num(A) is the number of different methods that produce A or clusters similar to A. Thus, the clusters with higher quality (density) and predicted by multiple methods are scored higher. All unique clusters produced by each method, which are not similar to clusters from other methods, are also retained and scored based on their densities. As the different complex prediction methods do not have large overlap in their predictions, doing so improves the coverage of the generated clusters while maintaining the quality of the clusters by scoring higher those with higher densities and predicted by multiple methods. A similar approach is adopted by Benschop et al. [2010], but with the difference being that the overlapping (similar) clusters are merged such that the final merged cluster contains proteins present in at least two originating clusters. In the study by Srihari and Leong [2012a], clusters produced from the same method but from different (weighted) PPI networks are combined to produce consensus clusters. Let G = {G1 , G2 , . . . , Gk } be k different PPI networks from which the clusters are predicted, and let C = ∪i C(Gi ) be the union of the sets of clusters, C(Gi ), i = 1, . . . , k, produced from the k networks. Sets of similar clusters but identified from different PPI networks are first identified. Two clusters A ∈ C(Gi ) and B ∈ C(Gj ), i = j , are assessed for similarity using the Jaccard index J (A, B) = |A ∩ B|/|A ∪ B|, and the two clusters are deemed similar if J (A, B) ≥ 0.75. Next, a cluster-similarity network H = VH , EH is constructed, where VH = C, and EH = {(A, B) : A ∈ C(Gi ), B = C(Gj ), i = j , J (A, B) ≥ 0.75}. Each connected component from the network H is then collapsed into a single consensus complex T such that T contains all proteins that are present in at least half of the clusters

78

Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

in the component. All remaining clusters—singleton clusters that do not belong to any connected component, and clusters within components that are not collapsed into consensus complexes—are retained as individual clusters. The final list of clusters are then scored using weighted density, as before. This consensus approach accounts for the differences in clusters arising from differences in the weighting of interactions between the networks. Taking the consensus improves the protein coverage of clusters while maintaining their quality by including only proteins that are predicted from multiple networks. In another approach, instead of combining the output clusters from different clustering methods, Asur et al. [2007] used the k base clusters to construct a new network, such that an interaction in this network exists between two proteins u and v if and only if the two proteins are present together in at least one base cluster. The interaction (u, v) is weighted proportional to the reliability of the clusters containing the protein pair as: w(u, v) =

p

Rel(Ck ) . Mem(u, v, Ck ),

(3.22)

k=1

where Rel(Ck ) represents the reliability of cluster Ck (determined, for example, as the interaction density of the cluster) and Mem(u, v, Ck ) = 0/1 is the membership of the pair (u, v) in Ck . The resulting weighted network is then clustered using hierarchical agglomerative method (a method not used previously to produce the base clusters) to obtain the final candidate set of protein complexes.

3.4

Methods Incorporating Core-Attachment Structure Protein complexes are modular entities and, in particular, proteins within eukaryotic complexes can be categorized into two distinct modular subgroups based on their function and organization—cores, which constitute central functional units of complexes, and attachments which are peripheral and aid the core proteins in their functions [Gavin et al. 2006]—as depicted in Figure 3.4. This categorization into cores and attachments (the core-attachment structure) provides important insights into the manner in which the proteins come together to constitute protein complexes and perform functions. For example, core proteins are strongly co-expressed with one another, and depending on the cellular context, a set of core proteins may interact with different sets of attachment proteins to form different protein complexes. Likewise, depending on the cellular context, the same attachment proteins may interact with different core proteins to form different complexes. But even among the attachment proteins, there are subsets of proteins called attachment modules that often function together in complexes (Figure 3.4).

3.4 Methods Incorporating Core-Attachment Structure

79

Attachment module

Cores Figure 3.4

Attachments

Core-attachment modularity observed in yeast and eukaryotic protein complexes [Gavin et al. 2006]. Cores constitute central functional units within complexes, whereas attachments are peripheral and aid the core proteins in their functions. Core proteins are strongly coexpressed with one another, and depending on the cellular context, a set of core proteins may interact with different attachment proteins to form different protein complexes. Likewise, the same attachment proteins may interact with different core proteins to form different protein complexes. Even among the attachment proteins, there may be subsets of proteins called attachment modules that often function while forming complexes.

In the PPI network, one can observe that core proteins are more densely connected with each other than to the rest of the proteins in the complex, but interact more densely with attachment proteins than with the rest of the PPI network [Gavin et al. 2006]. This provides an important topological insight to identify protein complexes. For example, not all dense clusters identified from the PPI network may correspond to real protein complexes, in particular these clusters may have been “densified” by spurious interactions; on the other hand, clusters that adhere to the core-attachment structure may correspond better to real complexes. CORE [Leung et al. 2009], COACH [Wu et al. 2009], MCL-CAw [Srihari et al. 2009, 2010], and CACHET [Wu et al. 2012] are some methods that predict protein complexes based on this idea.

Complex Detection by Core-Attachment (CORE) CORE, proposed by Leung et al. [2009], builds protein complexes from a PPI network in two steps: by (i) identifying core sets in which the proteins densely interact

80

Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

with one another and (ii) adding attachment proteins to these cores from proteins which may not densely interact with one another, but interact densely with at least one of the cores. The probability for two proteins u and v with degrees du and dv , respectively, to belong to the same core is determined by the number of interactions i (i = 0 or 1) and the number of common neighbors m between u and v. The probability that u and v have ≥ i interactions and ≥ m common neighbors is computed under the null hypothesis that du interactions connecting u and dv interactions connecting v are assigned randomly in the network according to a uniform distribution. If there are |V (G)| = N proteins in the network, the probability that u and v have exactly i interactions between them is calculated by considering the number of combinations to assign (du − i) and (dv − i) interactions connecting u and v, respectively, to the other N − 2 proteins: N −2N −2 d −i

u Pinteract (i|N , du , dv ) = N −2N −2

du

dv

dv −i N −2 . + dN −2 u −1 dv −1

(3.23)

Only two situations are studied: When i = 1, u and v must have an interaction between them, and when i = 0, u and v might or might not have an interaction between them. The probability that u and v with i interactions have exactly m common neighbors is computed by considering three groups of proteins—common neighbors of u and v, neighbors of u, and neighbors of v—and is given as: N −2(N −2)−m(N −2)−m−(d Pcommon(m|N , du , dv , i) =

m

u −i−m) du −i−m dv −i−m . N −2N −2 du −1 dv −i

(3.24)

The probability that u and v have ≥ i interactions and ≥ m common neighbors is computed by taking the product of the above two equations: p-value(u, v) =

Pinteract (j |N , du , dv ) . Pcommon(k|N , du , dv , j ).

(3.25)

i≤j ≤1, m≤k≤min(du , dv )−j

A small p-value(u, v) means that the null hypothesis is likely to be wrong, that is, u and v have a higher chance of being a pair of core proteins. If p-value(u, v) is the smallest p-value among p-value(u, k) and p-value(v, k) for all possible proteins k, then u and v are a pair of core proteins. This core (u, v) is then progressively expanded by adding new proteins x if p-value(x , u) for all proteins u in the core is smaller than p-value(x , k) for all proteins k not in the core, until no further proteins

3.4 Methods Incorporating Core-Attachment Structure

81

can be added. It is easy to see that the core protein sets are disjoint because each protein can only associate with a unique set of proteins with the lowest p-values. Given a core set, proteins that are common neighbors to at least half of the core proteins are selected as attachment proteins to that core. The final protein complex is the union of the core and all its attachments. Note that a set of attachments may be linked to more than one core, thus giving sharing of attachments between the complexes. An additional point to note here is that, if the attachment proteins (to a given core) do not interact between themselves, then all the attachments may not necessarily belong to the same protein complex, but instead the core proteins may be interacting with different subsets of these attachments (modules; Figure 3.4) to form distinct (but overlapping) protein complexes. However, this pattern of overlapping protein complexes is not directly decipherable here, and requires more specialized methods, as we shall see in Chapter 5.

COre-AttACHment Based Complex Prediction (COACH) COACH, proposed by Wu et al. [2009], works in a manner similar to CORE, in the sense that it first identifies cores and then adds attachments to these cores to construct the final protein complexes. COACH begins by building “preliminary cores,” which are dense but redundant or overlapping subsets of proteins that need to be postprocessed to determine the final cores. For every protein v in G = V , E, its neighborhood subnetwork Gv = Vv , Ev is first computed, where Vv = {v} ∪ {u : (u, v) ∈ E}, and Ev = {(s , t) : (s , t) ∈ E, s ∈ Vv , t ∈ Vv }. All proteins with degree 1 in this subnetwork (i.e., those only connected to v) are removed. A protein u ∈ Gv is designated as a core protein relative to Gv if deg(u) ≥ AvgDeg(Gv ), the average degree computed using only interactions in Gv . These core vertices and the interactions among them are used to build a core subnetwork CSv of this neighborhood subnetwork of v. A preliminary core is defined as a subnetwork of CSv satisfying two properties: (i) the interaction density of the subnetwork is greater than a preset threshold (0.70) and (ii) the subnetwork is maximal, that is, there exists no other subnetwork of CSv that satisfies property (i). If the entire core subnetwork CSv itself is dense enough (that is, satisfies property (i)), then CSv is designated as a preliminary core. Otherwise, multiple preliminary cores are likely to be embedded in CSv which can potentially overlap, thus resulting in redundancy among these preliminary cores. To identify these preliminary cores, COACH uses a core-removal and redundancy-filtering step, which works as follows. For a neighborhood network Gv whose core subnetwork CSv is not dense enough, all core proteins are first removed from Gv , breaking Gv into multiple connected components. Next, low-degree proteins are repeatedly removed from

82

Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

these components until the components achieve the required density threshold. Trivial components are discarded (those of size t, a similarity threshold, then only the component with higher density is retained as a preliminary core. This way, the resulting preliminary cores do not completely encompass one another (thus, satisfying property ii). Finally, to determine the attachment proteins, for each preliminary core A identified from above, if there exists a neighboring protein p in the network such that |VA ∩ Np |/|V | ≥ 0.5, then p is included as an attachment to the core A. A final complex is built from each such preliminary core A and all its attachment proteins p.

MCL Followed by Core-Attachment-based Filtering (MCL-CAw) MCL-CAw, proposed by Srihari et al. [2010], refines clusters produced from MCL [Van Dongen 2000] by incorporating core-attachment structure, to predict protein complexes. MCL-CAw is an improvement over its unweighted and predecessor version MCL-CA [Srihari et al. 2009] that is designed only for unweighted networks. The motivation behind MCL-CAw is three-fold: (i) The clusters produced from MCL often include “noisy” proteins that get “pulled in” due to the presence of spurious (but sparse) interactions connecting them to the cluster proteins; (ii) MCL produces non-overlapping clusters, and in particular, if a protein interacts with proteins from two different clusters then it is arbitrarily assigned to only one of the clusters; and (iii) MCL often produces large clusters (of sizes ≥30) which possibly encompass multiple protein complexes. If protein complexes indeed follow a core-attachment modularity, then refining the MCL clusters by extracting only the subsets of proteins that adhere to this core-attachment structure, should produce clusters that correspond better to real protein complexes. MCL is first run on a PPI network to produce an initial set of clusters using the recommended inflation parameter I ∈ [1.8, 2.0] [Brohee and van Helden 2006]. For each cluster of size ≥30, a subnetwork is constructed using proteins within the cluster and the interactions among them. This subnetwork is then reclustered with a higher inflation, I >2, giving clusters of size at most 30. All resultant clusters are postprocessed using a core-attachment procedure as follows. From each cluster C (from above), MCL-CAw first identifies a core set of proteins Core(C). A protein p ∈ C is considered a core protein, i.e., p ∈ Core(C) if the follow-

3.4 Methods Incorporating Core-Attachment Structure

83

ing two conditions are satisfied: (i) deg w (p) ≥ AvgDeg(C); and (ii) deg (in) w (p, C) > (out) deg w (p, C), where deg w (p) is the weighted degree of p, AvgDeg(C) is the average weighted degree of proteins in C, deg (in) w (p, C) is the weighted in-connectivity of p defined as the total weight of interactions of p with proteins within C, and deg (out) (p, C) is the weighted out-connectivity of p defined as the total weight of w interactions of p with proteins outside of C. Only those clusters with at least two core proteins are retained while the rest are discarded. The attachment proteins are chosen from the non-core proteins. An attachment protein p is assigned to a Core(C) based on the extent of interactions p has with Core(C) considering the interaction density and size of Core(C), and is decided based on the following condition: I (p, Core(C)) ≥ α . I (Core(C)) .

|Core(C)| −γ , 2

(3.26)

where I (p, Core(C)) is the total weight of interactions of p with Core(C), I (Core(C)) is the total weight of interactions within Core(C), and α and γ are used to control the effects of I (Core(C)) and |Core(C)|. Large and dense cores require the attachments to be strongly connected to them, whereas weak connections with attachments are allowed for smaller and less-dense cores. The final complexes are then constructed by the union of cores and their attachments, as before. Proteins that neither qualify as cores nor attachments are discarded, thus eliminating noisy proteins from the MCL clusters. Moreover, an attachment protein is allowed to be assigned to more than one core as long as the above condition (Equation (3.26)) is satisfied for each of the cores, and therefore this step allows for sharing of proteins between multiple clusters.

Core-AttaCHment Structures Directly from BipartitE TAP Data (CACHET) CACHET, proposed by Wu et al. [2012], works by the rationale that co-complex relationships identified from pulled-down complexes in TAP experiments hold vital information on cores: the bait–prey and prey–prey pairs within pulled-down complexes that are repeatedly observed across purifications may constitute permanent or constitutive interactions between these proteins, and may thus constitute cores within the complexes. These co-complex relationships are lost when the pulleddown complexes are converted to pairwise interactions in the PPI network. Therefore, CACHET works directly on the bait–prey interactions produced from TAP experiments instead of constructing the PPI network.

84

Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

The TAP data is represented as a bipartite graph G = P , Q, E, where P is the set of baits, Q is the set of preys captured by the baits, and E ⊆ {(u, v) : u ∈ P , v ∈ Q, u = v} represents the set of bait-prey relationships. Note that, in TAP experiments, a protein p can appear both as a bait and as a prey; each such protein is maintained as two separate entities, one as a bait and the other as a prey. Additionally, a bait might tag itself as a prey; to avoid the resulting self-relationships, E is made to contain only relationships (u, v) where u = v. Each edge (u, v) ∈ E carries a reliability weight w(u, v) that estimates the likelihood of the bait u interacting with prey v. The reliability of a subgraph S = P (S), Q(S), E(S) of G is defined as the average of reliability weights of all the interactions in E(S): (u, v)∈E(S) w(u, v) . (3.27) Reliability(S) = |E(S)| Each protein u ∈ P (S) is assigned a degree-ratio given by: (u, l)∈E(S) w(u, l) . dr S (u) = |E(S)|

(3.28)

The degree-ratio for u ∈ Q(S) is analogously computed. Similar to the above methods, CACHET first identifies cores and then adds attachments to these cores to build protein complexes. The cores are built from reliable bicliques of G. A subgraph of G, given by B = P (B), Q(B), E(B), is a biclique if there exists (u, v) ∈ E for every u ∈ P (B) and v ∈ Q(B). The biclique B is considered reliable if Reliability(B) ≥ t, a preset threshold (0.70). The procedure begins by identifying all maximal bicliques from G, using the algorithm by Li et al. [2007]. If a maximal biclique B has a reliability lower than the threshold, CACHET repeatedly prunes B by removing proteins w ∈ B with low degree-ratios dr B (w) until Reliability(B) achieves the desired threshold. The pruned biclique is then added to the list of reliable bicliques. Several of these bicliques may overlap and may produce redundant cores. CACHET therefore merges overlapping bicliques based on their neighborhood affinity (NA) score. The NA score for two bicliques B1 = P1 , Q1 , E1 and B2 = P2 , Q2 , E2 is given by: NA(B1 , B2) =

|(P1 ∪ Q1) ∩ (P2 ∪ Q2)|2 . |P1 ∪ Q1| . |P2 ∪ Q2|

(3.29)

Two (near-)bicliques B1 and B2 from the list of reliable biclques are repeatedly chosen and merged into a near-biclique if NA(B1 , B2) ≥ ω, an overlap threshold,

3.5 Methods Incorporating Functional Information

85

such that the resulting near-bicliques are all maximal and non-redundant. After discarding the small near-bicliques (below a size threshold), the final set of nearbicliques form the cores. The attachment proteins are chosen from the set Q of preys, by looking for proteins that densely interact with the cores identified above. A prey p ∈ Q is added as an attachment to a core B if p satisfies the following two conditions: (i) p interacts with at least half the proteins in B; and (ii) the average reliability of these interactions between p and B is at least a preset threshold (0.7).

3.5

Methods Incorporating Functional Information Proteins within a complex are enriched for the same or similar functions. Therefore, dense protein clusters which also show high functional coherence are highly likely to correspond to real complexes compared to random sets of (dense) clusters. Following this idea, Restricted Neighborhood Search Clustering (RNSC) [King et al. 2004], PCP [Chua et al. 2008], and Dense-neighborhood Extraction using Connectivity and conFidence Features (DECAFF) [Li et al. 2007] employ functional annotations (e.g., from Gene Ontology (GO) [Ashburner et al. 2000]) to enhance complex prediction from PPI networks.

Restricted Neighborhood Search Clustering (RNSC) RNSC, proposed by King et al. [2004], works in two steps: (i) clustering the PPI network, and (ii) filtering clusters based on their functional coherence. In the clustering step, RNSC uses a cost-based local search algorithm to repeatedly partition the node set V into smaller subsets. Beginning with a random clustering (obtained as user input), RNSC uses a variant of the tabu search metaheuristic [Glover 1986] to efficiently search the space of partitions of V , each of which is assigned a cost, to find a clustering with lower cost. At each step of this process, proteins are shuffled between the clusters and the costs are recomputed, to gradually generate a clustering with the (locally) minimum cost. In the initial few steps, RNSC uses a simple integer-valued cost function which is efficient to compute, but then upgrades to a more expensive (but less efficient) real-valued cost function as the clustering progresses. A common problem among local search algorithms is their tendency to settle in poor local minima. RNSC overcomes this problem by maintaining a tabu (forbidden) list of previously explored moves that are less-optimal, and by occasionally dispersing the contents of a cluster at random. However, being randomized, different runs of RNSC could result in different clusterings.

86

Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

In the filtering step, RNSC discards clusters that are not likely to correspond to real complexes based on their cluster sizes and functional homogeneity. Small complexes (of sizes 2 and 3) that are completely connected correspond to single interactions or triangles in the network. However, there are many interactions and even more triangles in a typical PPI network that do not necessarily correspond to complexes. Hence, most small clusters (single interactions and triangles) do not correspond to complexes. The small clusters that do correspond to complexes are difficult to identify solely based on their network topology. RNSC therefore simply discards all small clusters. Next, all clusters of low interaction densities are discarded, and the authors recommend a cluster density cut-off ranging from 0.65– 0.75. Finally, RNSC computes the enrichment for Gene Ontology (GO) functional groups within these clusters. A GO functional group Fˆ is the set of all proteins annotated with a particular function. RNSC computes a hypergeometric test p-value for each cluster C:

p

ˆ |−|Fˆ | i=k−1 |Fi | |V|C|−i , =1− |V | i=0 |C|

(3.30)

where C contains k proteins annotated with terms in Fˆ . Clusters that do not pass the p-value cut-off of 0.001 for at least one GO term are discarded, and clusters that pass these filtering criteria are predicted as protein complexes.

Protein Complex Prediction Using Indirect Neighbors (PCP) Instead of postprocessing the clusters based on explicitly given functional homogeneity information, PCP, proposed by Chua et al. [2008], uses FS Weight [Chua et al. 2006] (Chapter 2) to assess the functional similarity between proteins based on the topology of their neighborhoods in the PPI network. Weights computed by FS Weight correlate well with the functional homogeneity and co-localization of proteins. Therefore, clusters identified from a FS-Weighted network are expected to show high functional and co-localization coherence, and thus correspond better to real complexes, compared to those identified from an unweighted network. PCP begins by first preprocessing the unweighted PPI network by: (i) weighting all interactions by FS Weight (weights in the range [0,1]); (ii) adding new interactions between indirect (level-2) neighbors using FS Weight; and (iii) discarding all low-weighted interactions (typically weights below 0.2). The addition of new interactions is based on the observation that indirect neighbors in the network often share the same or similar functions, and therefore adding FS-Weighted interactions be-

3.5 Methods Incorporating Functional Information

87

tween them reinforces their functional association, and also potentially helps to densify the PPI network by completing many partial cliques in the network. PCP identifies dense clusters by merging maximal cliques. These maximal cliques are identified using an efficient search-space pruning algorithm by Tomita et al. [2006]. Cliques identified from the PPI network before the postprocessing step are usually small and only partially cover real complexes. On the other hand, the addition of FS-Weighted interactions between level-2 neighbors helps to complete many partial cliques in the network, thus producing larger cliques. However, this procedure also tends to produce numerous overlapping cliques, which need to be reconciled into larger clusters that represent real complexes. To go about this, PCP uses the inter-cluster density measure, which quantifies the interconnectivity between two cliques based on the interactions between the non-overlapping proteins of the two cliques. The inter-cluster density I (C1 , C2) between two cliques C1 and C2 is defined as: u∈(V1−V2), v∈(V2−V1), (u, v)∈E FS(u, v) . (3.31) I (C1 , C2) = |V1 − V2| . |V2 − V1| The merging procedure begins by computing an average FS Weight for each clique Ci as follows: (u, v)∈E(Ci ) FS(u, v) FSavg (Ci ) = . (3.32) |E(Ci )| The cliques are then sorted in non-increasing order of their average FS Weights, and the clique with the highest weight is repeatedly selected and compared with the remaining cliques. When an overlap above a threshold ω is found with another clique, given by J (Ci , Cj ) = |Ci ∩ Cj |/|Ci ∪ Cj | ≥ ω, the two cliques are merged if their inter-cluster density is above a threshold I0: I (Ci , Cj ) ≥ I0. The merging of the two cliques produces a larger but partial clique, which is added back to the list of (partial) cliques for future merging. Essentially, PCP follows a hierarchical merging 0 is procedure here to merge the (partial) cliques. An initial graph H 0 = VH0 , EH defined from the maximal cliques identified from G such that each vertex Xi0 ∈ VH0 0 is weighted using represents a maximal clique from G, and each edge (Xi0 , Xj0) ∈ EH the interconnectivity between the cliques represented by Xi0 and Xj0 in G. Cliques in H 0 represent the highly overlapping and interconnected cliques from G, and these are collapsed and merged into larger partial cliques. This procedure is then repeated on graphs constructed at higher and higher levels: In each iteration k, k a new graph H k = VHk , EH is constructed such that Xik ∈ VHk represents a partial k k clique and (Xi , Xj ) represents the interconnectivity between two partial cliques

88

Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

from the previous iteration k − 1. Maximal cliques are then identified from H k and merged after reducing the inter-cluster density threshold as Ik := Ik−1 − 0.1. These iterations continue while Ik > Imin, a minimum threshold. The final set of all partial cliques produced when the iterations stop represents the partial cliques from the original network G after repeated mergers. These final partial cliques together with the cliques that were not used up in the merging process are output as the list of protein complexes.

Dense-neighborhood Extraction Using Connectivity and conFidence Features (DECAFF) DECAFF, proposed by Li et al. [2007], scores a PPI network based on functional similarity between the proteins and then clusters this scored network to generate protein complexes. However, unlike PCP, DECAFF uses external evidence on functional similarity, instead of the network topology as in PCP, to score the PPI network. DECAFF starts by assigning an initial reliability ru, v to each interaction (u, v) in the network based on the number of times (u, v) is observed across different experimental datasets, where higher reliabilities are assigned to interactions detected from multiple experiments. Using these reliabilities as the prior probabilities, DECAFF computes the posterior probabilities for the proteins to interact, by employing the functional similarity between u and v from the MIPS functional catalog [Mewes et al. 2006] as additional evidence. This is done based on the following three cases: (i) u and v share the same or similar functions; (ii) u and v do not share any functions; and (iii) the functions of either u or v (or both) are unknown (unannotated proteins). For case (i), the posterior probability P (R|S) given the two proteins u and v share the same function is estimated as: P (R|S) =

P (R) . P (S|R) , P (S)

(3.33)

where P (R) = ru, v is the prior probability, and P (S) = P (S|R) . P (R) + P (S|¬R) . P (¬R) is the probability that two proteins share functions. P (S|R) is estimated using an independent high-confidence small-scale experimental dataset DS : P (S|R) =

|{(u, v) : share(u, v), (u, v) ∈ DS }| , |DS |

(3.34)

where share(u, v) denotes that u and v share at least one function. A dataset DN of one million randomly generated protein pairs that are not present in current

3.5 Methods Incorporating Functional Information

89

protein interaction datasets is constructed, and P (S|¬R) is estimated as: P (S|¬R) =

|{(u, v) : share(u, v), (u, v) ∈ DN }| . |DN |

(3.35)

For case (ii), the posterior probability P (R|D) given that the two proteins do not share any functions is computed as: P (R|D) =

P (D|R) . P (R) , P (D)

(3.36)

where P (D) = 1 − P (S), and P (D|R) = 1 − P (S|R). Finally, for case (iii), the posterior probability P (R|U ) given that one or both proteins are unannotated is computed as: P (R|U ) = P (S) . P (R|S) + P (D) . P (R|D).

(3.37)

Once the network is scored using the posterior probabilities as computed above, DECAFF clusters the network using LCMA [Li et al. 2005], with an added step to remove less-reliable clusters. For each generated cluster C from LCMA, a clusterreliability is computed as: Reliability(C) =

1 |E(C)|

Ru, v ,

(3.38)

(u, v)∈E(C)

where Ru, v is the posterior probability depending on the case that (u, v) satisfies. DECAFF removes all clusters C for which Reliability(C) < μ + max(0.5, γ ) . σ , where μ and σ are the mean and standard deviation of the reliability distribution, and γ is a tuning parameter to control the stringency of the filtering.

4

Evaluating Protein Complex Prediction Methods The only relevant test of the validity of a hypothesis is comparison of prediction with experience. —Milton Friedman (1912–2006), American economist (as quoted in Friedman [1953])

In the previous chapter, we presented a taxonomy of protein complex prediction methods available in the literature, and described their algorithmic underpinnings. In this chapter, we evaluate a sampling of these algorithms for prediction of yeast and human complexes. Additionally, we investigate the impact of the weighting of PPI networks on the reduction of noise (false-positive interactions) from PPI datasets, and its impact on improving protein complex prediction.

4.1

Evaluation Criteria and Methodology We employ criteria and methodology described and used in the literature [Brohee and van Helden 2006, Pu et al. 2007, Srihari and Leong 2012b, Srihari et al. 2015a, Sardiu et al. 2009] to evaluate protein complex prediction methods. We consider a predicted complex P to match a reference (benchmark) complex C if the Jaccard index of P and C, Jaccard(P , C) ≥ match thresh, Jaccard(P , C) =

VP ∩ VC , VP ∪ V C

where VX is the set of proteins in complex X.

92

Chapter 4 Evaluating Protein Complex Prediction Methods

We define the score of a predicted complex P as the weighted density of interactions in the complex: score(P ) =

weight(a, b) VP (VP − 1)

a , b∈VP , a =b

Here, weight(a, b) is the weight of edge (a, b) between proteins a and b. To evaluate the predicted complexes, we plot precision-recall curves by calculating the precision and recall of predicted complexes at various score thresholds. Given a set of reference complexes C, consisting of large complexes (at least four proteins) Clarge and small complexes (fewer than four proteins) Csmall, and a set of predicted complexes P, we define the precision and recall at score threshold s as:

Recall s = Precisions =

{C|C ∈ Clarge ∧ ∃P ∈ P, score(P ) ≥ s , P matches C} |Clarge| {P |P ∈ P, score(P ) ≥ s ∧ ∃C ∈ C, C matches P } |{P |P ∈ P, score(P ) ≥ s}|

.

Note that while we evaluate recall only for large complexes, we evaluate precision to include both large and small complexes, so that the prediction algorithm is not penalized should it also predict small complexes. As a summarizing statistic for an entire set of predicted complexes, we calculate its F-score, which is the harmonic mean of the precision and recall over all the predicted complexes: F=

2 × precision × recall . precision + recall

We also test different weighting (interaction-scoring) methods to ameliorate the noise in PPI networks. To evaluate weighting methods independent of complex prediction algorithms, we calculate and plot the precision-recall-coverage graph of each PPI-weighting method on the prediction of co-complex interactions, i.e., protein pairs belonging to the same complex. Given a set of PPI interactions in the PPI network E, and weighting function weight(e) for each PPI edge e ∈ E, we define the precision, recall, and complex-coverage for co-complex interactions at weight threshold w as:

4.2 Evaluation on Unweighted Yeast PPI Networks

Recall w = Precisionw =

Coveragew =

93

{e|e ∈ E ∧ weight(e) ≥ w ∧ e ∈ interactions(Clarge)} |interactions(Clarge)|

{e|e ∈ E ∧ weight(e) ≥ w ∧ e ∈ interactions(C)} |{e|e ∈ E ∧ weight(e) ≥ w}| {C|C ∈ Clarge ∧ (∃e ∈ E, weight(e) ≥ w ∧ e ∈ C)} |{C|C ∈ Clarge}|

,

where interactions(C) is the set of all co-complex interactions in the set of reference complexes C.

Data Sources for PPI Networks and Benchmark Protein Complexes To evaluate protein complex prediction algorithms on yeast and human complexes, we obtain yeast and human PPI data from two repositories, Biogrid (downloaded June 2016) [Stark et al. 2011] and IntAct (downloaded June 2016) [Hermjakob et al. 2004], keeping only the physical PPIs. In addition, for yeast PPIs, we incorporate the widely used Consolidated PPI dataset [Collins et al. 2007], which consolidates two high-throughput TAP-MS datasets from Gavin et al. [2006] and Krogan et al. [2006] using a probabilistic framework called Purification Enrichment (PE); see Chapter 2. We evaluate protein complex prediction algorithms on the yeast reference complexes dataset CYC2008 [Pu et al. 2009], and the human reference complexes dataset CORUM [Reuepp et al. 2008]. Here, we evaluate only methods for prediction of large protein complexes (i.e., complexes consisting of at least four proteins); prediction of small complexes is covered in Chapter 5.

4.2

Evaluation on Unweighted Yeast PPI Networks PPI data is inherently noisy due to limitations of PPI assay technologies. It is characterized by high rates of spurious interactions (false positives) and missing interactions (false negative) [Bader and Hogue 2002, Bader et al. 2004, Yong and Wong 2015a]. To ameliorate this problem for protein complex prediction, most methods are designed for weighted PPI networks, where such weights may indicate the reliability or quality of the PPI in some way, such as the number of times reported or the experiments used. As an indication of baseline performance, we first perform yeast protein complex prediction on the yeast unweighted PPI network, so that no information about the reliability of the PPIs is utilized. We obtain yeast PPI data from the BioGrid and IntAct databases and the Consolidated PPI dataset, deriving an unweighted PPI network composed of 136,552

94

Chapter 4 Evaluating Protein Complex Prediction Methods

Table 4.1

Clustering algorithms tested a

Clustering Algorithm Parameters

CMC

min_deg_ratio=1, min_size=4, overlap_thres=0.5, merge_thres=0.75

ClusterOne

-s 4

IPCA

-S4 -P2 -T.6

MCL

-I 3.0

RNSC

-e10 -D50 -d10 -t20 -T3

COACH

Default settings

a. Parameters used: CMC: min_deg_ratio minimum degree ratio; min_size minimum cluster size; overlap_thres threshold to consider two clusters as overlapping; merge_thres threshold to merge overlapping clusters. ClusterOne: -s minimum cluster size. IPCA: -S minimum cluster size; -P shortest path length; -T interaction probability threshold. MCL: -I inflation parameter. RNSC: -e number of experiments; -D diversification frequency; d shuffling diversification length; -t tabu length; -T tabu list tolerence. COACH: default settings provided by the software were used (no parameters were set).

interactions. This set of PPIs accounts for 70.63% of co-complex interactions with a precision of 5.61%, and covers 99.33% of yeast complexes with at least one PPI each. We evaluate the performance of the following six protein complex prediction algorithms, which form a mix of purely network topology-based and network topology and additional information-based methods: MCL [Bader and Hogue 2003], IPCA [Li et al. 2008], ClusterOne [Nepusz et al. 2012], CMC [Liu et al. 2009], COACH [Wu et al. 2009], and RNSC [King et al. 2004] (see Table 4.1 for parameters used). When evaluating on unweighted PPI networks we set the interaction weights to 1 for the methods that use interaction weights. As the number of interactions in the unweighted PPI network (number of interactions 136,552) is too large for most clustering algorithms, we choose 50,000 interactions at random to form the unweighted PPI network for complex prediction. Interactions are chosen randomly so that the quality of PPIs is not factored in the PPI network. Figure 4.1 shows the precision-recall graph for complex prediction. Most algorithms perform poorly with unweighted PPI networks, recalling less than 7% of reference complexes at less than 10% precision. The exception is ClusterOne, which performs noticeably better than the others, achieving almost 20% recall and 10% precision. Even so, this performance is poorer than expected, showing that protein complex prediction is generally unsatisfactory without any reliability weighting of PPIs.

4.3 Evaluation on Weighted Yeast PPI Networks

0.6

CMC ClusterOne IPCA MCL RNSC Coach

0.5 Precision

95

0.4 0.3 0.2 0.1 0.0 0.00

Figure 4.1

4.3

0.05

0.10 Recall

0.15

0.20

Protein complex prediction from yeast unweighted PPI network.

Evaluation on Weighted Yeast PPI Networks In the previous section we showed that an unweighted PPI network, which does not account for the reliability of PPIs, gave dismal performance in protein complex prediction. Noise in the PPI data results in a PPI network prevalent in spurious and missing interactions, making it ill-suited for analysis for protein complex discovery. Here we evaluate the impact of weighting the PPI network to ameliorate such noise. We test three different ways of weighting to account for PPI reliability: weighting by topology, publication count, and experiment.

Weighting by Topology Here, we weight each PPI based on its local topology in the PPI network. Proteins that share many common neighbors tend to interact themselves, as they share cellular locations or functions via the common neighbors. Moreover, multiple proteins that function together in cohort are reflected as dense regions in the PPI network. Thus, a PPI in a dense neighborhood with many shared common neighbors is more likely to be a true PPI, compared to a solitary PPI. We weight each PPI using Iterative CD scoring with two iterations [Liu et al. 2009] (see Chapter 2), which accounts for both the density and extent of shared neighbors around each PPI.

Weighting by Publication Count Here, we weight each PPI using the number of times it has been reported in the different experiments, similar to the sampling-based scoring method by Friedel et al. [2009] (Chapter 2). A true PPI is more likely to be observed in screens by different

96

Chapter 4 Evaluating Protein Complex Prediction Methods

authors and reported in multiple publications, whereas a spuriously-detected PPI is less likely to be repeatedly observed. For each PPI, we count the number times it is reported in unique publications in the BioGrid and IntAct repositories. We normalize the counts to a maximum of 1.

Weighting by Experiment Here, we weight each PPI based on its reported experimental detection methods. Since different experimental methods are associated with different levels of specificity, the reliability of a PPI should depend on the method(s) used to detect it, as well as the number of repeated observations by different authors. We use a Noisy-Or model to combine the number of times a PPI is reported and the types of experiment used to detect it. For each experimental detection method e, we estimate its reliability rele as the fraction of interactions detected where both proteins share at least one high-level cellular-component Gene Ontology term. Then we estimate the reliability of each PPI (a, b) as: weight(a, b) = 1 − (1 − rel e )ne, (a , b) , e∈X(a , b)

where X(a , b) is the set of experimental detection methods that detected (a, b), ne, (a , b) is the number of times that experimental method e detected (a, b). We incorporate the yeast Consolidated PPI dataset by discretizing the PE scores into ten equally-spaced bins, and considering each bin as a separate experimental detection method. We avoid duplicate counting of evidences by ensuring PPIs reported by each publication ID are incorporated only once during weighting. Since the three weighted PPI networks are derived from the same datasets, they consist of the same underlying network structures (i.e., with the same sets of PPI interactions), with only differences in edge weights between them. All three networks are composed of 136,552 interactions. However, as noted earlier, the entire PPI network of 136,552 interactions is too large for most protein complex prediction methods. Morever, most algorithms perform better when a smaller subset of highquality PPIs is used. To predict protein complexes, we select the top k interactions from each weighted network (k=10,000, 20,000, and 40,000), before running the prediction algorithms. We evaluated protein complex prediction at match thresh = 0.5 (rough prediction) and 0.75 (stringent prediction). Figure 4.2a shows the complex prediction results in terms of F-scores for the six algorithms, at various levels of k and match thresh. Weighting by experiment frequently gave the best F-scores (CMC, IPCA, RNSC, and COACH for rough prediction; all clustering algorithms for stringent prediction), followed by weighting by publication count. For ClusterOne and

Figure 4.2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Precision

0.4 Recall

0.6

0.8

0.4

0.6

0.8

1.0

0.0

0.0

0.0

PPI reliability

MCL

0.1

0.2 0.3 Recall

0.4

0.5

match_thresh = 0.5

match_thresh = 0.75

RNSC

(c) match_thresh = 0.75

Dark shade: k = 40,000

0.0

CMC ClusterOne IPCA MCL RNSC Coach

Medium shade: k = 20,000

Publication count

IPCA

0.2

0.2

(b) match_thresh = 0.5

Light shade: k = 10,000

Topological weighting

ClusterOne

0.2

0.4

0.6

0.8

1.0

CMC

Yeast complex prediction using 6 protein complex prediction algorithms. (a) F-scores for the 6 algorithms, using PPI networks weighted by 3 methods (topology, publication count, and experiment), with the top 10,000, 20,000, and 40,000 interactions to form the PPI network. (b)-(c) Precision-recall graphs for the 6 algorithms, using PPI network weighted by experiment with the top 10,000 interactions, for (b) match thresh = 0.5, and (c) 0.75.

F score

(a)

Precision

CMC ClusterOne IPCA MCL RNSC Coach

Coach

98

Chapter 4 Evaluating Protein Complex Prediction Methods

MCL, weighting by topology achieved the best F-scores, but only for rough prediction. For experiment and publication count weighting, using the smallest network (10,000 interactions) consistently gave the best performance; whereas for topological weighting, larger networks (20,000 or 40,000 interactions) improved performance. This may reflect the concentration of highly weighted interactions in a few complexes for topological weighting, so more interactions are required to uncover a wider range of complexes. Among the algorithms tested, RNSC achieved the best F-score, followed by CMC and IPCA; ClusterOne achieved the lowest F-score. The F-score reflects the aggregate precision and recall of all the complexes predicted by each prediction algorithm. However, it is sometimes beneficial to consider only a subset of complexes predicted by each algorithm (i.e., the topscoring predicted complexes). To evaluate the performance at different prediction score thresholds, as well as to observe the trade-off between precision and recall, we vary cutoffs for the predicted complexes’ weighted densities to plot precisionrecall graphs. Figures 4.2b–c show the precision-recall graphs for the six clustering algorithms, using the PPI network weighted by experiment with k=10,000, for match thresh = 0.5 and 0.75, respectively. When considering only the high-scoring predicted complexes (low recall range, left side of each graph), COACH achieves the highest precision, while IPCA attains the lowest precision. On the other hand, to maximize recall by considering all predicted complexes (right side of each graph), IPCA and CMC achieve better precision. Since these algorithms achieve different trade-offs between precision and recall, an assessment of their performance depends on whether the objective is to predict more complexes to maximize recall (for which IPCA and CMC perform best), or to predict fewer complexes with high precision (for which COACH performs best). Figure 4.3 shows the precision-recall graph for prediction of co-complex interactions. Since all the PPI networks are structurally identical, they achieve the same overall recall and precision of 70.63% and 5.61%, respectively, covering 99.33% of yeast complexes. However, the shapes of the precision-recall graphs differ according to weighting method: weighting by topology achieves the highest precision at all recall levels (solid blue markers), followed by experiment (solid orange), and finally publication count (solid green). While topological weighting predicts co-complex interactions with high precision, these interactions are clustered in a few dense complexes, leading to lower complex coverage (hollow blue markers). In contrast, weighting by publication count and experiment gives lower precision, but their predicted co-complex interactions are distributed among a wider range of complexes (hollow green and orange markers), enabling the prediction of a wider range of complexes.

1.0

1.0

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

99

Unweighted Complexes coverage

Precision

4.4 Evaluation on Human PPI Networks

Topological weighting Publication count

Precision

PPI reliability Unweighted Topological weighting

Complexes coverage

Publication count 0.0 0.0

Figure 4.3

4.4

0.0 0.2

0.4 0.6 Recall

0.8

1.0

PPI reliability

Prediction of co-complex interactions in yeast PPI network weighted by three approaches: topology, publication-count, and experiment-based weighting.

Evaluation on Human PPI Networks Here we evaluate the same algorithms on the prediction of human complexes. Human complex prediction presents a distinct set of challenges, making it a more difficult task compared to predicting yeast complexes. A significant challenge is the paucity of human PPI data. Hart et al. [2006] estimated the human interactome size at around 220,000 PPIs. Our human PPI data consists of around 260,000 PPIs, and with an estimated false-positive rate of 50%, this means that our human PPI network covers just slightly over half the human interactome. Many human complexes are therefore sparsely connected, which is a major hurdle for most protein complex prediction algorithms which are based on discovering dense clusters in the PPI network. In contrast, the yeast interactome is estimated at around 50,000 PPIs. Our yeast PPI data consists of around 130,000 PPIs, so even with the same falsepositive rate, our yeast PPI network is a comprehensive representation of the yeast interactome. A second challenge is the more elaborate structure of interactions between protein complexes in human. Many complexes form core-attachment structures in vivo [Gavin et al. 2006]; see Chapter 3. A core consists of proteins that bind together (more or less) permanently as a subcomplex, and recruits other proteins as attachments. Such attachments to a core can vary depending on cellular location or state. These complexes are present as overlapping clusters in the PPI network, and are problematic for protein complex discovery algorithms to deconvolute: they are frequently discovered as a single large complex, in which different core-attachment configurations are fused. While such overlapping complexes also occur in the yeast complexome, they are much more frequent in human.

Chapter 4 Evaluating Protein Complex Prediction Methods

0.007

CMC ClusterOne IPCA MCL RNSC Coach

0.006 0.005 Precision

100

0.004 0.003 0.002 0.001 0.000 0.000

Figure 4.4

0.002

0.004 0.006 Recall

0.008

0.0

Human complex prediction on unweighted PPI network.

We obtain human PPI data from the BioGrid and IntAct databases, deriving a PPI network composed of 261,711 interactions. While twice the size of the yeast PPI network, this set of PPIs accounts for only 34.1% of human co-complex interactions, with a precision of 4.5%; this indicates the higher extent of noise in human PPI data compared to yeast. Despite this, almost all the complexes are covered with at least one PPI (99.7%). Predicting complexes using the unweighted PPI network (taking 50,000 interactions at random) gave dismal results, with every algorithm achieving less than 1% in both recall and precision (Figure 4.4). Clearly, noise is highly prevalent in the human PPI network, such that complex prediction is untenable without addressing this problem. To reduce noise in the PPI network, we applied three PPI weighting methods as described above: weighting by topology, publication count, and experiment. Figure 4.5 shows the performance of these weighting methods on the prediction of co-complex interactions. As in yeast, topological weighting achieves the highest precision, but its high-weighted interactions are clustered in a few dense protein complexes, giving low protein complex coverage. Whereas weighting by publication count and experiment give lower precision, but their high-weighted interactions are distributed among a wider range of complexes, which enables their discovery by protein complex prediction algorithms. In contrast to yeast, for human cocomplex interactions weighting by publication count achieves higher precision over weighting by experiment.

1.0

1.0

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

101

Unweighted Complexes coverage

Precision

4.4 Evaluation on Human PPI Networks

Topological weighting Publication count

Precision

PPI reliability Unweighted Topological weighting

Complexes coverage

Publication count 0.0 0.0

Figure 4.5

0.0 0.1

0.2 0.3 Recall

0.4

0.5

PPI reliability

Prediction of co-complex interactions in human PPI network weighted by three approaches: topology, publication count, and experiment.

Figure 4.6a shows the F-score performance of the six clustering algorithms on human complex discovery, for each of the three weighting methods with k=10,000, 20,000, and 40,000. Consistent with the results for co-complex edge prediction, and in contrast to yeast, weighting by publication count performed better than weighting by experiment. Weighting by publication count achieves the highest F-scores in four clustering algorithms (CMC, IPCA, RNSC, and COACH), while topological weighting performed best for ClusterOne and MCL. As in yeast, the smallest networks (k=10,000) gave the best performance for networks weighted by publication count and experiment; while larger networks performed better when weighted by topology. Again, this may reflect the concentration of highly-weighted interactions in a few complexes for topological weighting, so more interactions are required to uncover a wider range of complexes. Among the six clustering algorithms, IPCA and CMC achieve the highest F-score, followed by RNSC and COACH. Figures 4.6b-c show the precision-recall graphs of the six algorithms on human complex discovery, using weighting by publication count (k=10,000), for match thresh = 0.5 and 0.75, respectively. COACH achieves the highest precision across a large recall range, demonstrating the importance of the core-attachment model for human complexes. In contrast, MCL performs poorly–its limitation in predicting non-overlapping clusters degrades performance particularly for human complexes, where many complexes overlap. When considering only the top predicted complexes for each algorithm (low recall range, left side of graphs), COACH and ClusterOne achieve high precision. In contrast, when considering all predicted

Figure 4.6

0.0

0.1

0.2

0.3

0.4

Precision

0.2 0.3 Recall

0.4

0.5

0.4

0.6

0.8

1.0

0.00

0.0

0.0

PPI reliability

MCL

0.05 Recall

0.10

0.15

match_thresh = 0.5

match_thresh = 0.75

RNSC

(c) match_thresh = 0.75

Dark shade: k = 40,000

0.0

Coach

RNSC

MCL

IPCA

ClusterOne

CMC

Medium shade: k = 20,000

Publication count

IPCA

0.2

0.1

(b) match_thresh = 0.5

Light shade: k = 10,000

Topological weighting

ClusterOne

0.2

0.4

0.6

0.8

1.0

CMC

Human complex prediction using 6 protein complex prediction algorithms. (a) F-scores for the 6 algorithms, using PPI networks weighted by 3 methods (topology, publication count, and experiment), with the top 10,000, 20,000, and 40,000 interactions to form the PPI network. (b)-(c) Precision-recall graphs for the 6 algorithms, using PPI network weighted by publication count with the top 10,000 interactions, for (b) match thresh = 0.5, and (c) 0.75.

F score

(a)

Precision

Coach

RNSC

MCL

IPCA

ClusterOne

CMC

Coach

4.5 Case Study: Prediction of the Human Mechanistic Target of Rapamycin Complex

103

complexes so as to maximize recall (right sides of graphs), IPCA and CMC achieve the highest precisions. The complex prediction performance is generally poorer in human compared to yeast, with a maximum recall of 42% in human, compared to 73% in yeast (both achieved by IPCA). Regardless, the strengths of the protein complex prediction algorithms are consistent across human and yeast complexes: for both human and yeast, COACH achieves good precision among top-scoring predictions, while IPCA and CMC are most competent for maximizing recall.

4.5

Case Study: Prediction of the Human Mechanistic Target of Rapamycin Complex To demonstrate the challenges of protein complex prediction, and to show how the idiosyncrasies underpinning different protein complex prediction algorithms can produce complexes with different compositions even on the same PPI network, we use the mechanistic target of rapamycin complex 2 (mTORC2) as an illustrative example. mTORC2 is a signaling protein kinase complex involved in regulation of the cytoskeleton, metabolism, and cellular proliferation, and is implicated in diseases such as diabetes and cancer [Laplante and Sabatini 2012]. It consists of the four proteins MTOR, MLST8, MAPKAP1, and RICTOR as per the CORUM database (Figure 4.7(a)). mTORC2 presents three challenges for protein complex prediction algorithms. First, MTOR and MLST8 also participate in a related signaling complex called mTORC1 (consisting of MTOR, MLST8, RAPTOR, and AKT1S1). Thus, these two complexes are present as overlapping clusters in the PPI network, which may be difficult to tease apart. Second, many external proteins are connected to mTORC2 members, corresponding to regulators or phosphorylation targets of either mTORC1 or mTORC2. Finally, the mTORC2 complex is itself not fully connected in the PPI network, with five out of six possible edges present between its proteins. Each of the six protein complex prediction algorithm produces different complexes (Figure 4.7b-g), with CMC, COACH, and IPCA predicting complexes that best match mTORC2 (Jaccard score of 0.6). CMC’s prediction includes three of four mTORC2 proteins (MTOR, RICTOR, MAPKAP1), and an additional protein AKT1, a phosphorylation target of the complex. COACH predicts a complex identical to CMC’s prediction, but additionally predicts a second complex which includes RAPTOR from the mTORC1 complex. IPCA’s results demonstrate its propensity to promiscuously predict overlapping complexes around dense regions. It predicts six complexes that overlap with mTORC2 (of which three are shown). The first consists

104

Chapter 4 Evaluating Protein Complex Prediction Methods

(a) Actual complex

(b) CMC

(c) COACH

(d) IPCA

(e) MCL

(f) ClusterOne

(g) RNSC

Figure 4.7

Example human complex mTORC2. (a) Protein complex obtained from CORUM. (b)-(g) Predictions from CMC, COACH, IPCA, MCL, ClusterONE, and RNSC methods. COACH, IPCA, and RNSC produce multiple predictions which (partially) match the actual complex, and these are shown here within different boundaries.

of three mTORC2 proteins and its target AKT1 (identical to CMC’s prediction), while the second consists of three mTORC1 proteins and an mTORC1 target. The third consists of a mix of mTORC1 and mTORC2 proteins: MTOR, MLST8, and RICTOR from the former, and MTOR, MLST8, and RAPTOR from the latter. The three other

4.6 Take-Home Lessons from Evaluating Prediction Methods

105

predictions overlap mTORC2 to a lesser extent. In contrast, since MCL does not predict overlapping complexes, it instead merges the dense region around mTORC1 and mTORC2 into one large complex, also including three other proteins corresponding to targets or regulators of these complexes. ClusterONE also predicts a large complex, merging the entire mTORC1 and mTORC2 along with their targets and regulators. RNSC predicts two complexes: the first a small complex consisting of two mTORC2 proteins with its target AKT, and the second a larger complex consisting of two mTORC1 proteins with two other interactors. The best-matching mTORC2 prediction consists of only three out of its four proteins, missing out on MLST8 and instead includes extraneous protein AKT1 (predicted in common by CMC, COACH, and IPCA). On the other hand, the only prediction that includes all four mTORC2 proteins also includes the mTORC1 complex as well as four additional interactors (from ClusterONE). Thus, most predictions could not properly distinguish the boundary separating mTORC2 with the related complex mTORC1 or their interactors.

4.6

Take-Home Lessons from Evaluating Prediction Methods In this chapter, we evaluated six complex prediction algorithms on the prediction of yeast and human complexes. Due to the effects of noise in the PPI data, manifested as false positives (spurious interactions) and false negatives (missing interactions) in the PPI network, a na¨ıve use of PPI data without accounting for PPI quality results in dismal performance in protein complex prediction. We employed three PPI-weighting methods to estimate PPI quality, namely, weighting by topology, by publication count, and by the type of experimental technique. Topological weighting uses only information inherent in the PPI network, but tends to weight highly a limited set of interactions found in dense clusters, whereas weighting by publication count and experiment exploits extraneous information that is easily obtained from the PPI databases. All three methods ameliorate the impact of noise in the PPI data, and substantially improve protein complex prediction performance. As different protein complex prediction algorithms are based on different clustering methods or models, their predicted complexes can vary considerably, leading to distinct trade-offs between precision and recall. For example, COACH, which employs a core-attachment model for complexes, achieves the highest precision among its top-scoring predicted complexes; while IPCA, which tends to generate a large number of complexes, is effective at predicting more complexes at the expense of precision. On the other hand, MCL, which does not generate overlapping

106

Chapter 4 Evaluating Protein Complex Prediction Methods

complexes, performs poorly for human complexes which are highly overlapping. Thus, the objective of the user, as well as the nature of the interactome and complexome being considered, should inform the choice of the appropriate protein complex prediction algorithm to use. The prediction of human complexes share some of the difficulties as yeast complex prediction (noisy PPI data), but is further hindered by unique challenges: highly overlapping complexes and insufficient PPI coverage. In the next chapter, we further discuss these challenges as well as the difficulty of predicting small complexes, which was not evaluated here.

5

Open Challenges in Protein Complex Prediction Euclid taught me that without assumptions there is no proof. Therefore, in any argument, examine the assumptions. —Eric Temple Bell (1883–1960), Scottish mathematician (as quoted in Eves [1998])

Although existing computational methods for protein complex prediction incorporate diverse biological evidence (e.g., core-attachment [Leung et al. 2009, Wu et al. 2009, Srihari et al. 2010, Wu et al. 2012] and functional homogeneity [King et al. 2004, Chua et al. 2008, Li et al. 2007]) to improve their prediction ability, most methods work on the basic credence that protein complexes are embedded as dense subnetworks within PPI networks. While this assumption is generally correct, due to limitations in current PPI datasets, some of which we highlighted in Chapter 2, most methods miss many protein complexes.

5.1

Three Main Challenges in Protein Complex Prediction Yong and Wong [2015a] evaluated the performance of protein complex prediction algorithms by categorizing protein complexes in three ways. They stratified protein complexes into large and small complexes consisting of >3 and ≤3 proteins, respectively. Large complexes were further stratified by density (DENS), and the number of external proteins highly connected to the complex (EXT). EXT is used to indicate the extent to which a complex may overlap other complexes. The authors showed that yeast protein complexes with low DENS and/or high EXT were more difficult for complex prediction algorithms to predict (Figure 5.1a), because the predicted complexes matched poorly with the actual complexes (Figure 5.1b).

108

Chapter 5 Open Challenges in Protein Complex Prediction

The predicted protein complexes with high EXT included large numbers of extraneous proteins (Figure 5.1c), and these were merged with other complexes as well (Figure 5.1d). Small protein complexes were also difficult to predict. The only complexes that were consistently predicted well were the large complexes with high DENS and low EXT. The same conclusions were drawn from evaluating human protein complexes (Figure 5.2). Based on this evaluation, we conclude that existing methods for protein complex prediction do not perform well at identifying: (i) sparse and (ii) small protein complexes; and at (iii) deconvoluting overlapping protein complexes—the three kinds of “challenging protein complexes.” The first limitation is partly because some protein complexes are naturally sparse (e.g., each protein within such a complex may not necessarily interact with every other protein in the complex), and partly because of the paucity of interactions between co-complexed proteins in currently available PPI datasets. A case study by Srihari and Leong [2012a] that mapped 123 large protein complexes from MIPS [Mewes et al. 2006] onto the Consolidated yeast PPI network (1,622 proteins, 9,704 interactions, and average node degree 11.96) found that only 89 complexes were completely “embedded” within the main connected component of the network, whereas the remaining 34 complexes were “scattered” among multiple components (Figure 5.3). The Consolidated network [Collins et al. 2007] is generated by combining TAP/MS interactions from Gavin et al. [2006] and Krogan et al. [2006] studies using the purification enrichment scoring (for details, refer to Chapter 2). The authors evaluated four methods, CMC, HACO, MCL and MCL_CAw, and found that these methods were able to recover only 58 of these 123 complexes. Of the 65 unrecovered complexes, 27 were scattered across the different components of the network, and 34 complexes, although intact, had low interaction densities ( 3. Lo DENS: density ≤ 0.35. Med DENS: 0.35 < density ≤ 0.7. Hi DENS: density > 0.7. Large size: > 3 unique proteins. Small size: ≤ 3 unique proteins. (Figure adapted from Yong and Wong [2015a])

SIZE

DENS

EXT

Coach

HACO

MCL

ClusterOne

CMC

(a) Recall

5.1 Three Main Challenges in Protein Complex Prediction 109

Figure 5.2

0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0

0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0

Large

Lo

Lo Hi

Large

Med

Lo Hi Hi

Lo Hi

(b) Match score of best cluster

Small

30 25 20 15 10 5 0

12 10 8 6 4 2 0

60 50 40 30 20 10 0

0

20

40

60

80

30 25 20 15 10 5 0

Lo

Lo Hi

Large

Med

Lo Hi

Hi

Lo Hi

(c) Extra proteins in best cluster

Small

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Lo

Lo Hi

Large

Med

Lo Hi

Hi

Lo Hi

Small

(d) Merged complexes

Performance of complex discovery algorithms on human complexes, stratified by size, density (DENS), and number of external proteins highly connected to the complex (i.e., connected to at least half of the complex’s proteins; EXT). The x-axis of each chart corresponds to the different stratified groups of complexes, given at the bottom of the figure. (a) Proportion of complexes recalled. (b) Jaccard match score of the best-matching prediction. (c) Number of extraneous proteins in the best-matching prediction. (d) Number of complexes merged into the best-matching prediction. Stratification categories are: Lo EXT: EXT ≤ 3. Hi EXT: EXT > 3. Lo DENS: density ≤ 0.35. Med DENS: 0.35 < density ≤ 0.7. Hi DENS: density > 0.7. Large size: > 3 unique proteins. Small size: ≤ 3 unique proteins. (Figure adapted from Yong and Wong [2015a])

SIZE Small

0.6

0.6

Hi

0.8

0.8

Med

1.0

1.0

Lo

0.0

0.0

DENS

0.2

0.2

Lo Hi

0.4

0.4

Lo Hi

0.6

0.6

Lo Hi

0.8

1.0

0.8

1.0

EXT

Coach

HACO

MCL

ClusterOne

CMC

(a) Recall

110

Chapter 5 Open Challenges in Protein Complex Prediction

5.1 Three Main Challenges in Protein Complex Prediction

YAL0210C

YDL160C

Main component of the PPI network: 1,034 proteins 8,377 interactions

Total MIPS complexes in the PPI network: 123

111

YCR093W YNR052C YPR072W YFL028C

Nuclear pore complex

YDL165W

CCR4 complex SAGA complex

RNA –I, II, and III complexes (overlapping complexes)

CCR4 complex (disconnected and sparse)

Chaperonine containing T-complex TRiC Cdc28p complees (Cdc28p interacts with diefferent proteins but at different times)

Smaller components: 588 proteins 696 interactions

Figure 5.3

MIPS complees “scatered” among smaller components: 34

The Consolidated network (1,622 proteins and 9,704 interactions) [Collins et al. 2007] contains one main connected component (covering 1,034 proteins and 8,377 interactions) and several small- to medium-sized components of sizes 2–15 (covering the remaining 588 proteins and 696 interactions). Of the 123 MIPS complexes (of sizes at least 4), 89 are completely embedded within the main component, whereas the remaining 34 complexes are “scattered” among multiple components. For example, the SAGA and Chaperonine TRiC complexes are mostly embedded within a single (the main) component, whereas the CCR4 (inset) and Nuclear Pore complexes are scattered among multiple components which makes the latter two difficult to identify. On the other hand, proteins within the three RNA polymerase complexes—I, II, and III—are tightly interconnected which makes it difficult to deconvolute the individual complexes. The Cdc28p complexes, on the other hand, are formed during different cellular contexts, and without this contextual information, these are difficult to deconvolute.

112

Chapter 5 Open Challenges in Protein Complex Prediction

coefficient become less effective for evaluating small complexes (e.g., a mismatch of only one protein in a three-protein complex renders the prediction inaccurate or less useful despite achieving a Jaccard of 0.50). As a result, most methods fare poorly in detecting small complexes. Some methods therefore explicitly exclude small complexes from their predictions (e.g., Wu et al. [2009]). The third limitation can occur when proteins from different complexes are heavily shared among the complexes, thus resulting in a high density of interactions between the complexes. The example of the three RNA polymerase complexes (Pol) I, II, and III is highlighted in several studies [Krogan et al. 2006, Pu et al. 2007, Srihari et al. 2010, Liu et al. 2011], where due to the sharing of multiple proteins, the three complexes are not clearly discernable from each other (Figure 5.3). The presence of spurious interactions can further aggravate this problem by increasing the density of interactions between the complexes. In other cases, the individual complexes that share proteins are not all formed simultaneously; however, due to the lack of contextual data, it is difficult to deconvolute the individual complexes. For example, different Cdc28-cyclin complexes are formed depending on the cellcycle phase, all of which share Cdc28; however, without the cell-cycle context, the complexes are lumped together into one large cluster which does not match any single (context-dependent) complex. To handle the above limitations, specialized methods are needed, which form the subject of this chapter.

5.2

Identifying Sparse Protein Complexes One of the first works that attempted to study sparse protein complexes was by Habibi et al. [2010], who modeled complexes as k-connected subgraphs. A graph G = V , E is connected if every two vertices of G are linked by a path. The graph G is k-connected (or, specifically k-vertex-connected), k < |V |, if for every subset S ⊆ V , |S| < k, the remaining subgraph G = V − S , E − S × S is connected. From Menger’s theorem [Menger 1927, Diestel 2000], any two vertices u and v in a kconnected graph are connected by k vertex-disjoint paths (i.e., paths that do not intersect except at u and v). Consequently, what follows is that all vertices in a kconnected graph must have a degree at least k. The motivation to model complexes as k-connected subgraphs comes from studying the densities of known protein complexes in the PPI network. For example, when 827 known yeast complexes from MIPS [Mewes et al. 2006] and Aloy et al. [2004] where mapped onto a yeast PPI network (5,040 proteins and 27,557 interactions) from BioGrid [Stark et al. 2011], Habibi et al. [2010] found that 11

5.2 Identifying Sparse Protein Complexes

113

complexes had a density of zero (i.e., no connected protein pair) and 41 complexes had densities less than 0.1. However, several of these low-density complexes were in fact at least 2- or 3-connected in the network. Habibi et al. proposed that kconnectivity might be a better criterion to capture the topology of real complexes. For a complex C from a set of complexes C, and a given k, the k-connectivity score for C is: kscore(C) =

max{|sik (C) : i = 1, 2, . . . , t|} |{p ∈ C : ∃q , (p, q) ∈ E}|

,

(5.1)

where s1k (C), s2k (C), . . . , stk (C) are maximal k-connected subgraphs of the complex C. Therefore, the average k-connected score for the set of complexes C is given by, k C∈C max{|si (C) : i = 1, 2, . . . , t|} . (5.2) kscore(C) = C∈C |{p ∈ C : ∃q , (p, q) ∈ E}| While the average density of the 827 known complexes in the BioGrid PPI network was low (0.41), the kscore of these complexes was 0.784 for k = 2, and 0.537 for k = 3, meaning that, on average, 78.4% of proteins within each real complex were located in a 2-connected subgraph, and 53.7% of proteins in a 3-connected subgraph. This meant that protein complexes, in particular the low-density ones, could be identified by modeling them as k-connected instead of dense subgraphs. To do so, Habibi et al. proposed the k-Connected complex Finding Algorithm (CFA), which works in the following two steps. In the first step, CFA generates maximal k-connected subgraphs for various k ≥ 1 to build a candidate list of complexes. For a k, CFA begins by removing all proteins of degree less than k, because from Menger’s theorem these proteins cannot be part of any k-connected subgraphs. Next, CFA finds a subset of h < k proteins that disconnects the network into several connected components. This procedure is recursively applied to each of the connected components. The procedure terminates on a component if the component cannot be disconnected by removing h < k proteins, thus returning the component as a maximal k-connected subgraph. This entire step is applied to larger and larger values of k until no more k-connected subgraphs can be generated, thus returning maximal k-connected subgraphs for various values of k. In the second step, CFA applies the following filtering rules to eliminate unlikely candidates. All subgraphs of size less than 4 are removed, which follows from the explanation earlier in the chapter that small subgraphs (size less than 4) are numerous (of the order O(n3) for a network of size n) and many of these are unlikely to be real complexes. Furthermore, all k-connected subgraphs (k ≥ 2) with diameter

114

Chapter 5 Open Challenges in Protein Complex Prediction

greater than k, and 1-connected subgraphs of diameter greater than 4 are removed. These subgraphs have very long shortest paths between their constituent proteins, and likely do not represent the topologies of real complexes. In the analysis by Habibi et al., the k-connected subgraphs with larger k matched real complexes more accurately, as expected. However, there were several 1-connected and 2-connected subgraphs that also matched real complexes with a Jaccard of at least 0.5. In particular, when evaluated against the 827 MIPS and Aloy et al. complexes, these 1-connected and 2-connected subgraphs achieved a recall of 0.163 and 0.149, respectively (i.e., 1-connected: 134, and 2-connected: 123 recovered complexes), thus identifying several sparsely connected complexes from the PPI network. Fan et al. [2012] devised a method to separate out regions of high and low densities within the PPI network and thereby to identify high- and low-density protein complexes. The method starts by identifying the set C of all maximal cliques of size at least three from the PPI network using a branch-and-bound approach. This procedure begins with pairs of interacting proteins and adds one neighboring protein at a time to expand each pair into a clique until no more vertices can be added. Since most proteins in the PPI network have small degrees (a long-tail distribution of degrees), this branch-and-bound procedure is still tractable. Next, |C∩C | for any two cliques C1 and C2, a similarity is defined as S(C , C ) = |C| .|C | . For a clique C, all other cliques C are counted such that S(C , C ) ≥ t, a similarity threshold, and the clique C is considered to be in a low-density neighborhood if this number is below a threshold c, and in a high-density neighborhood otherwise. Cliques with high and low neighborhood densities are treated differently while building complexes. For a maximal clique with low neighborhood density, a best neighboring protein is repeatedly identified and added such that the density of the enlarged subgraph remains high enough and the length of the shortest path between any two proteins in the subgraph remains small (based on thresholds). For maximal cliques with high neigbourhood densities, the method first sorts all cliques in non-increasing order of their densities. For a clique C, if there exists another clique C with S(C , C ) ≥ ω, a merge threshold, then the two cliques are merged, and the resulting subgraph is added to the list of complexes. The two cliques C and C are then removed. All unmerged cliques are added to the list of complexes. Srihari and Leong [2012a] devised a method SPARC (SPARse Complexes) that adds functional interactions to the PPI network to detect sparse complexes. SPARC is based on the observation that a protein complex detection method based solely on the topology of the PPI network tends to detect a protein complex with better

5.2 Identifying Sparse Protein Complexes

115

accuracy if: (i) a majority of the protein subunits of the complex are within the same connected component; and (ii) the interaction density of the complex is higher relative to its immediate (local) neighborhood. Therefore, if functional interactions can be predicted and added to the PPI network such that these interactions improve the connectivity as well as the neighborhood-relative density of complexes, then existing methods should be able to detect sparse complexes with better accuracy. Based on this observation, the authors devised a Component-Edge-density (CE) score to measure the detectability of protein complexes. The CE score for a protein complex is composed of two parts: the Component score (CS) and the Edge density score (ES). CS measures the fraction of proteins within a protein complex that belong to the same connected component. ES measures the density of interactions within the complex relative to the immediate neighborhood of the complex. For a protein complex B, its CS and ES scores with respect to the network G are given by: CS(B , G) =

max i |Si (B)| |{p : p ∈ B , ∃q ∈ B , (p, q) ∈ E(G)}|

(5.3)

for |B| > 0, else CS(B , G) = 0, where Si (B), i = 1, 2, . . . , t represent the t connected components of B; and e∈E(B) w(e) for E(N [B]) = ∅, else ES(B , G) = 0, (5.4) ES(B , G) = e∈E(N[B]) w(e) where N [B] is the set of proteins in B and the immediate neighborhood (i.e., level-1 neighbors) of B in the network. The CE score for the complex B is then defined as, CE(B , G) = CS(B , G) . ES(B , G).

(5.5)

For a set B containing m complexes, if CE(Bi , G) = 1, for i = 1, 2, . . . m, then each complex Bi ∈ B forms a connected component in the network disjoint (isolated) from the other complexes; therefore, B forms a set of complexes that are topologically “easiest” to identify. On the other hand, if CE(Bi , G) = 0, then each complex forms a “hole” in the network G, that is, the complex has no interactions among its constituent proteins but may have interactions with proteins outside the complex (extraneous interactions); these complexes are topologically the “hardest” to identify. Given a threshold 0 ≤ tce ≤ 1, a complex Bi is considered to be sparse with respect to the network G if CE(Bi , G) < tce (and therefore, likely not detectable), and dense otherwise (and therefore, detectable). The SPARC method attempts to improve the CE scores of sparse complexes by adding functional interactions, thus making these complexes more easily detectable by existing protein complex prediction methods. Given a PPI network

116

Chapter 5 Open Challenges in Protein Complex Prediction

GP = VP , EP , clusters C are generated using an existing method. All clusters C ∈ C with CE(C , G) ≥ tce are dense, and are therefore added to the list of predicted complexes. The remaining clusters C with CE(C , G) < tce are sparse, and are retained for further post-processing. A network GF = VF , EF of functional interactions is then added to the PPI network to give the augmented network GA = VA , EA, where VA = VP ∪ VF and EA = EP ∪ EF . The CE scores for the retained sparse clusters are recomputed with respect to this augmented network. If for a cluster C , CE(C , GA) ≥ tce , then C is added to the list of predicted complexes. If not, then SPARC explores the immediate neighborhood around C and repeatedly considers a protein p in the neighborhood such that CE(C ∪ {p}, GA) > CE(C , GA) and adds p to C , until the CE score cannot be increased any further. If the CE score of the expanded cluster CE(C , GA) ≥ tce , then C is added to the list of predicted complexes. Srihari and Leong [2012a] studied the effectiveness of SPARC in improving the predictions of four existing methods, MCL, MCL_CAw, CMC, and HACO, using a yeast PPI network containing 4,113 proteins and 26,518 interactions, and a functional network containing 3,980 proteins and 18,683 interactions (the augmented network contained 5,145 proteins and 43,905 interactions). SPARC was evaluated against detectable protein complexes from the MIPS catalog—that is, MIPS protein complexes with at least four proteins present in the PPI network. The methods achieved 25% higher recall on average upon SPARC-based processing of their clusters. For example, there are 155 MIPS protein complexes with at least 4 proteins present in the PPI network (of the total 313 protein complexes in MIPS). Before SPARC-based processing, MCL produced 294 clusters, which matched 38 out of the 155 MIPS complexes (a recall 0.245), but after SPARC-based processing this improved to 56 protein complexes (recall 0.361). Among the 294 MCL clusters, 102 were deemed sparse (tce 2) as negative examples. Ten-fold crossvalidation was performed to assess the performance of the SVM classifier. The classifier achieved considerably higher precision and recall in predicting heterodimeric complexes compared to MCL: a precision of 0.586 and recall of 0.659 compared to 0.017 and 0.023, respectively, for MCL. Tatsuke and Maruyama [2013] proposed Protein Partition Sampler (PPSampler) to predict small protein complexes from PPI networks based on Markov Chain Monte Carlo (MCMC) sampling [Liu et al. 2008]. PPSampler is based on the Metropolis-Hastings (MH) algorithm, a kind of MCMC sampling, which generates a partition C of the entire protein set V according to a specified probability distri-

5.4 Identifying Small Protein Complexes

131

bution with each element c ∈ C of this partition considered as a predicted protein complex. Therefore, the partition C can be represented as C = c1 , . . . , cn ⊆ V : ∀i , ci = ∅; ∪i ci = V ; ∀i , j (i = j ), ci ∩ cj = ∅ . (5.21) Each partition is called a state, and the collection C of all states the algorithm can reach is called the domain. The MH algorithm in PPSampler generates a sequence of states from a probability distribution given as P (C) ∝ exp

−f (C) , T

(5.22)

where f (C) is a scoring function on the partition C, and T is a parameter called temperature (initially set to T = 5). Given a partition C ∈ C, the proposal probability Q(C |C) of reaching a new partition C ∈ C is determined in two ways, in both of which, a protein u ∈ V is chosen uniformly at random and is removed from the cluster to which u belongs to in C. Thus, the probability of choosing u is 1/|V |. In the first way, u forms a singleton cluster with a probability β (set to 1/100), in which case, Q(C |C) =

β . |V |

(5.23)

In the second way, u is added to an existing cluster in C. The cluster is chosen probabilistically as follows. All proteins v( = u) are ranked in non-increasing order of their interaction weights to u from the PPI network: w(u, v1) ≥ w(u, v2) ≥ . . . ≥ w(u, v|V |−1). The probability of choosing a cluster c ∈ C is set inversely proportional to the total rank of proteins vi that the cluster contains, vi ∈c 1i . As a result, the proposal distribution becomes, Q(C |C) ∝

(1 − β) . 1 . |V | i v ∈c

(5.24)

i

Next, the scoring function f (C) is chosen as the product, −f1(C) . f2(C) . f3(C), of three scoring functions f1(C), f2(C), and f3(C), which depend on three factors: (i) weights of interactions within the clusters in C, (ii) relative frequency of sizes of clusters in C, and (iii) the total number of proteins within clusters of size at least two in C, respectively. The first scoring function f1(C) is defined as f1(C) = c∈C f1(c), where ⎧ 0, if |c| = 1 ⎪ ⎨ if |c| > N or ∃u ∈ c, ∀v( = u) ∈ c, w(u, v) = 0 (5.25) f1(c) = −∞, ⎪ ⎩ u, v( =u)∈c w(u, v), otherwise.

132

Chapter 5 Open Challenges in Protein Complex Prediction

The second scoring function f2(C) is defined based on how close is the relative frequency of clusters in C based on size to a predefined target distribution provided as a parameter. For size i = (2, 3, . . . , N ), let φ(i) be a predefined target of the relative frequency of clusters of size i in C, and let φC (i) be the relative frequency of clusters of size i in C. Then, f2(C) is defined as f2(C) =

N i=2

1 + i2

1 . . (φ(i) − φC (i))2

(5.26)

Let S2(C) be the number of proteins within clusters of size at least two, that is, S2(C) = c∈C , |c|≥2 |c|. The third scoring function f3(C) is defined as f3(C) =

1 1+

(S2(C)−λ)2 1, 000

,

(5.27)

where λ is a parameter specifying the target value for S2(C), and the division by 1,000 is to adjust the strength of f3 (which is a larger number) relative to f2 and f1. The initial state C0 of the algorithm consists of the following clusters: (i) a unique cluster of two proteins for every pair (u, v) that interact with weight w(u, v) in the PPI network, and (ii) for each singleton (non-interacting) protein w, a cluster consisting of only w. This way, f (C0) = ∞, and therefore, P (C0) = 0, and thus C0 is a valid initial state for the MH algorithm. PPSampler was evaluated on a PPI network consisting of 49,607 interactions among 5,953 proteins, and the predicted protein complexes were matched against the 408 reference complexes from CYC2008 [Pu et al. 2009]. Of these 408 complexes, 172 (42%) and 87 (21%) are two-protein and three-protein complexes, respectively. PPSampler produced 350 predicted complexes of average size 5.72, which matched the CYC2008 complexes with a precision of 0.537 and recall of 0.534, giving a Fmeasure of 0.536. PPSampler particularly performed well for two- and three-protein complexes, achieving a precision of 0.302 and recall of 0.331 (F-measure 0.316) for two-protein complexes, and a precision of 0.533 and recall of 0.552 (F-measure 0.542) for three-protein complexes. Widita and Maruyama [2013] subsequently improved PPSampler by modifying the scoring functions in the MH algorithm, to develop PPSampler2. The scoring function in PPSampler2 is defined as f (C) = −(g1(C) + g2(C) + g3(C)), using three new scoring functions g1(C), g2(C), and g3(C), but using the same data source as f1(C), f2(C), and f3(C). The first scoring function is given by g1(C) = c∈C g1(c), where

5.4 Identifying Small Protein Complexes

⎧ 0, ⎪ ⎪ ⎨ g1(c) = −∞, ⎪ ⎪ ⎩

133

if |c| = 1 if |c| > N or ∃u ∈ c, ∀v( = u) ∈ c, w(u, v) = 0

w(u, v) u, v( =u)∈c √|c| ,

(5.28)

otherwise,

√ which differs from f1(c) by the use of an additional |c| factor. The addition of w(u, v) v( =u)∈c this factor rescales u, v( =u)∈c w(u, v) to approximate the density u,|c| .(|c|−1) of the cluster. Widita and Maruyama observe that using density penalizes larger √ clusters more severely, but using a more gradual function as |c| tends to ease this penalization on larger clusters. To define the second scoring function, the size distribution φC (i) for clusters of size i, is estimated using a normal distribution with mean φ(i) and standard deviation σ2, i , given by −(φC (i) − φ(i))2 . (5.29) p(φC (i)|φ(i), σ2, i ) ∝ exp 2σ2,2 i The joint probability of φC (2), φC (3), . . . , φC (N ) given φ(2), φ(3), . . . , φ(N) and σ2 = (σ2, 2 , σ2, 3 , . . . , σ2, N ) is formulated as their product N (φC (i) − φ(i))2 exp − p(φC (2), φC (3), . . . , φC (N )|φ, σ2) ∝ 2σ2,2 i i=2 N (5.30) (φ (i) − φ(i))2 C = exp − 2σ2,2 i i=2 = exp{g2(C)}, where g2(C) = −

N i=2

(φC (i)−φ(i))2 . 2 2σ2, i

The third scoring function is derived by esti-

mating S2(C) using a normal distribution with mean λ and standard deviation σ2 given by (S2(C) − λ)2 p(S2(C)|λ, σ2) ∝ exp − 2σ22 (5.31) = exp{g3(C)}, 2

where g3(C) = − (S2(C)−λ) . The parameters in PPSampler2 are set to the following 2σ 2 2

new values: T = 10−9, λ = 2000, and σ22 = 106. Upon evaluating on the same PPI dataset as above (as used for PPSampler), PPSampler2 generated 402 protein complexes, with average size 5, which matched the CYC2008 reference complexes with

134

Chapter 5 Open Challenges in Protein Complex Prediction

a precision of 0.618 and recall of 0.742 (F-measure 0.674). For size-two complexes, PPSampler2 achieved a precision of 0.41 and recall of 0.6 (F-measure 0.49), and for size-three complexes, the precision was 0.58 and recall was 0.8 (F-measure 0.67), thus outperforming PPSampler for complexes of all sizes.

5.5

Identifying Protein Sub-complexes Identifying protein sub-complexes is as an interesting special case of overlapping and/or small complexes in which a subset of proteins from a larger complex forms a smaller but by itself a distinct functional complex. For example, the origin recognition (ORC) and the minichromosome maintenance (MCM) complexes are subcomplexes which come together to facilitate DNA replication in eukaryotic cells by forming the pre-replication complex (pre-RC). One may relate sub-complexes to “cores,” in which the set of core proteins interact with different sets of attachments to form different distinct complexes, as suggested by Gavin et al. [2006]. Since these sub-complexes share proteins with the larger complexes they constitute, most existing methods predict only the larger complexes and fail to detect the constituting smaller sub-complexes. One indication for the presence of sub-complexes comes from examining data from TAP/MS experiments. If during multiple TAP purifications, a bait and its set of preys are co-purified repeatedly in such a way that the purified sets of preys never completely overlap, then each such purified set of bait and its preys may form a sub-complex. In the normal case, all bait-prey relationships are combined to build a PPI network in which every relationship occuring in more than one purifications is collapsed into a single interaction (see Chapter 2). However, by preserving these relationships from individual purifications, it may be possible to extract the different sub-complexes represented by these relationships. CACHET [Wu et al. 2012], explained in Chapter 3, can be used to detect sub-complexes in this manner because CACHET treats individual sets of bait-preys detected from different purifications as a unique core (a modeled as biclique) to which attachments are added to build distinct complexes; each of these distinct cores can be considered as a sub-complex. Indeed, in an evaluation of protein complex prediction methods, Zaki and Mora [2014] found that CACHET performs particularly well in identifying sub-complexes. Inspired by CACHET, Zaki and Mora developed TRIBAL (TRIad-Based ALgorithm) to detect sub-complexes from TAP data. Like CATCHET, TRIBAL preserves the multi-interaction information from different purifications for the PPI scoring and clustering steps. TRIBAL starts by generating a pull-down matrix [b, (p × p)], in which each row represents a bait b, and each column represents a pair of preys (pi , pj ). Each cell (bk , {pi , pj }) in the ma-

5.5 Identifying Protein Sub-complexes

p3 b1

p3

b1 b1

p1

p2

p1

b2

b1

p3 p1

p2

p2

p1

p2 b2

p4 p4

Reliable trends

Figure 5.8

p1

b2 b2

p1

135

Spoke-modeled interaction network. Interactions (b1, p1), (b2, p2) detected in multiple directions.

p4 Overlaid on a real complex (b1, p1, p2, b2). Sub-complex identified (in red) (b1, p1, b2).

Identifying sub-complexes using TRIBAL [Zaki and Mora 2014]. TRIBAL preserves interactions identified from individual purifications. These interactions are mapped onto reference complexes; the subset of proteins traced out by interactions from multiple purifications are predicted to be subcomplexes of the reference complexes.

trix is a 1 or a 0 depending on whether or not the preys pi and pj are co-purified along with a bait bk . This way, the co-purification relationships among preys (with common baits) is maintained. The Dice coefficient [Zhang et al. 2008] (Chapter 2) is used to score each interaction between a bait and a prey-pair, and all interactions above a certain threshold of the score are considered reliable. The interactions deemed reliable by the Dice scoring are used to build a purification network: Each reliable triad (b, {p1 , p2}) is converted into a spoke graph and added to the network, by preserving duplicate interactions between baits and preys. Finally, these spoke-modeled interactions are considered as templates and are overlaid onto a set of known (real) complexes, as shown in Figure 5.8. If a set of duplicated interactions induces a connected subnetwork within a real complex, then that subnetwork is considered a sub-complex. When applied on the yeast TAP/MS dataset, TRIBAL was able to detect TIM9-TIM10 as a sub-complex of the TIM22 complex, ADA-GCN5 as a sub-complex of the SAGA complex, and the ORC and MCM complexes as subcomplexes of the pre-RC complex, and the post-replication, DNA polymerase δ, DNA polymerase , and DNA polymerase ζ complexes as sub-complexes of the DNA replication complex.

136

Chapter 5 Open Challenges in Protein Complex Prediction

5.6

An Integrated System for Identifying Challenging Protein Complexes The three challenges that we have described remain open problems in protein complex prediction. Although the above-described approaches can go some way to address each of these issues with varying degrees of success, none of them can completely resolve them. Yong and Wong [2015a, 2015b] showed that most complexes fall into at least one of the three categories of challenging complexes— sparse complexes, small complexes, or overlapping complexes—in both yeast and human. The majority of complexes are small complexes, while substantial numbers of the large complexes are of low density or overlap with other complexes. The lesschallenging complexes are the non-overlapping large complexes with high density, and based on current reference datasets these constitute only 15% and 5% of yeast and human complexes respectively. Thus, although no approach can completely tackle all the three challenges, it still makes sense to integrate these approaches so that the partial improvements to tackle each of these challenges stack up to a greater overall improvement in complex prediction. To this end, Yong and Wong [2015b] combined three approaches in an integrated complex prediction system to address the three challenges (Figure 5.9): (i) Supervised Weighting of Composite Networks (SWC) [Yong et al. 2012], which uses diverse data sources with supervised learning to fill-in missing PPIs and thereby densify sparse complexes; (ii) PPI network decomposition using GO terms and hub removal [Liu et al. 2011], which improves the prediction of overlapping complexes and complexes embedded within highly-connected regions; and (iii) Size-Specific Supervised Weighting (SSS) [Yong et al. 2014], which integrates diverse data sources and their topological features to predict small complexes specifically. SWC and PPI network decomposition utilize multiple clustering-based complex prediction algorithms (CMC, ClusterOne, IPCA, COACH, MCL, and RNSC), with voting-based aggregation (the Ensemble method [Yong et al. 2012] described in Chapter 3), to derive complexes from their respective processed PPI networks. Each of these three approaches are run independently, and the resulting complexes are combined using a voting-based aggregation strategy. Yong et al. tested the integrated system on yeast and human complex prediction, and observed marked improvement compared to each of its constituent approaches, and also compared to each individual clustering-based complex prediction algorithm, predicting a greater number of reference complexes at higher precision levels. This improvement was especially dramatic for human complexes,

5.6 An Integrated System for Identifying Challenging Protein Complexes

137

PPI data Experimental STRING literature

SWC

PPI network decomposition

Data integration and supervised weighting

CMC

…

SSS

Hub removal

GO decomposition

…

CMC

IPCA

IPCA

Combine predictions from algorithms

Data integration and size-specific supervised weighting

Extract Recombine decomposed clusters

…

Recombine decomposed clusters

Re-add hubs

…

Re-add hubs

Combine predictions from algorithms Large complexes Sparsecomplexes

Overlapping complexes

Small complexes

Rescale scores and combine

Final predictions Figure 5.9

Integrated system to predict challenging protein complexes [Yong and Wong 2015b]. Large, small, overlapping, and sparse complexes differ in their topological characteristics, and one single strategy (e.g., density-based prediction) does not cover all protein complexes. A combination of strategies is therefore required to predict protein complexes.

138

Chapter 5 Open Challenges in Protein Complex Prediction

where an additional 10% of reference complexes were predicted with greater than twofold increase in precision. Further analysis showed that the three categories of challenging complexes saw the greatest benefits in improvements, as each of the three constituent approaches complemented each other to predict different types of challenging complexes. While this approach took convincing strides in tackling these three challenges in protein complex prediction, there remains much room for further improvement; e.g., only 40% of human complexes were predicted by the integrated system. Nonetheless, these results highlight a need to identify and understand specific challenges within protein complex prediction, and to design solutions to address them. As the “easy” protein complexes (in fact constituting only a minority of yeast and human complexes) can already be predicted by the wide variety of existing protein complex prediction algorithms, it is the other, more difficult, complexes that present the best opportunities for improvements in protein complex prediction.

5.7

Recent Methods for Protein Complex Prediction The last two to three years have seen development of new protein complex prediction methods, which have been evaluated on more recent and larger PPI datasets and shown to outperform most traditional methods in precision and recall. For example, ClusterEP, proposed by Liu et al. [2016], is a supervised complex prediction method that uses emerging patterns (EPs) to distinguish real complexes from random subgraphs in the PPI network. An EP is a kind of conjunctive pattern that can contrast sharply between different classes of data; that is, the elements of the dataset that exhibit a particular EP are more likely to belong to a particular class than other classes. ClusterEP combines multiple topological properties of complexes to build these EPs, and works in the following three steps. In the first step, a feature vector is built to represent the positive (real complexes) and negative (random subnetworks) classes of data. This feature vector F consists of 22 different topological features (F1–F22) for subnetworks S. These features include: the number of nodes in the subnetwork (F1), the density of the subnetwork (F2), and the remaining 20 features (F3–F22) are grouped into six categories based on the following properties: mean and variation of the degrees of nodes, clustering coefficient, topological coefficient, eigenvalues of the adjacency (or connectivity) matrix of the subgraph, and weight and size (number of amino acids) of the proteins in the subnetwork. The eigenvalue feature includes the first three singular values (SV) of the adjacency matrix of the subnetwork S. The singular value decomposition

5.7 Recent Methods for Protein Complex Prediction

139

of an m × n matrix M is a factorization of the matrix in the form M = UV, where U is a m × m unitary matrix, is a m × n diagonal matrix with non-negative real numbers on its diagonal, and V is a n × n unitary matrix. The diagonal entries of are known as the SVs of M. A common convention is to list the SVs in descending order. SVs for subnetworks of different “shapes” (for example, with topologies such as linear, clique, star, and hybrid) have remarkably different first three SVs [Qi et al. 2008]; refer to Figure 3.3 from Chapter 3. The feature vector for the positive class Dp is learned from a set of real protein complexes. The feature vector for the negative class Dn is learned from a set of randomly generated PPI networks and their subgraphs, which is at least 20 times larger than the set of real complexes. The feature values in D = Dp ∪ Dn for each feature are discretized into ten equal-width bins. A bin for each feature is termed an item, and a set of items from different features is called an itemset (pattern). In the second step, ClusterEP identifies a specific type of EPs called noisetolerant EPs (NEPs), defined as follows. Given two support thresholds δ1 , δ2 > 0 and δ2 >> δ1, an NEP from D1 to D2 is an itemset X that satisfies the following two conditions: (i) SuppD1(X) ≤ δ1 and SuppD2 (X) ≥ δ2; and (ii) no proper subset of X satisfies condition (i), where SuppDi (X) is the occurrence frequency of X in Di (i = 1, 2). The NEP from Dp to Dn is represented as NEP(Dn), and the NEP from Dn to Dp is represented as NEP(Dp ). For a given subgraph S, let F (S) be the feature vector values of S. An EP-score for S with respect to Dp is computed as:

EPscore(S , Dp ) =

SuppDp (e),

(5.32)

SuppDn (e).

(5.33)

e∈NEP(Dp ), e⊆F (S)

and with respect to Dn is computed as: EPscore(S , Dn) =

e∈NEP(Dn ), e⊆F (S)

By definition, a noise-tolerant EP e ∈ NEP(Dp ) is a pattern that occurs much more often in Dp than in Dn. Thus, EPscore(S , Dp ) sums the number of real complexes in Dp over each noise-tolerant EP e ∈ NEP(Dp ) that is consistent with F (S). In other words, EPscore(S , Dp ) sums the support in Dp of patterns in F (S) that are more associated with real complexes. Analogously, EPscore(S , Dn) sums the support in Dn of patterns in F (S) that are more associated with random subgraphs. Therefore, the score EPscore(S , Dp ) favors a positive label; that is, the larger the score the more likely S is a real protein complex. On the other hand, EPscore(S , Dn) favors a negative label; that is, the larger the score the more likely S is a random

140

Chapter 5 Open Challenges in Protein Complex Prediction

subgraph. These EP scores are normalized relative to the median EP scores in the two datasets, Norm EPscore(S , Dp ) = EPscore(S , Dp )/ median(Dp ) Norm EPscore(S , Dn) = EPscore(S , Dn)/ median(Dn),

(5.34)

and the normalized scores are used to compute a clustering score for S: f (S) =

Norm EPscore(S , Dp ) Norm EPscore(S , Dp ) + Norm EPscore(S , Dn)

,

(5.35)

with f (S) > 1/2 meaning S is more likely to be a real complex. In the third step, ClusterEP uses a seed-and-expand procedure similar to MCODE [Bader and Hogue 2003] (Chapter 3) to identify complexes, with the difference being that each time a protein p is added to an existing cluster S, ClusterEP checks for the following conditions: f (S ∪ {p}) > f (S) and the average degree of S ∪ {p} has increased. Two subgraphs S1 and S2 built in this manner are then checked for a possible overlap ω(S1 , S2) = |S1 ∩ S2|2/(|S1| . |S2|). If the overlap ω(S1 , S2) ≥ , a fixed threshold, the two clusters are merged. Liu et al. [2016] trained ClusterEP using a yeast PPI network containing 4,931 proteins and 22,227 interactions, and yeast complexes from MIPS (103 complexes) [Mewes et al. 2006] and the Gavin et al. [2006] study (491 complexes). ClusterEP was used to predict protein complexes from a human PPI network consisting of 6,305 proteins and 62,937 interactions from HPRD [Prasad et al. 2009]. The predicted complexes were evaluated against known complexes from CORUM (1,843 complexes) [Reuepp et al. 2008, 2010]. ClusterEP achieved a recall of 0.603 at a precision of 0.211, which was considerably better than that of some methods described before—for example, ClusterONE (recall of 0.376 at precision of 0.169)— on the same dataset. Rizzetto et al. [2015] presented a SImulation-based COMplex PREdiction (SiComPre) approach that considers the absolute protein concentrations, protein binding sites, and DDIs involved in interactions between proteins, and localization of protein interactors, to generate protein complexes. The approach is based on stochastic simulations where multiple instances of proteins (corresponding to the square root of the absolute protein concentrations) move and interact randomly on a 2D space. The simulation space is divided into square lattices, where proteins are diffused randomly between neighboring lattices at discrete time steps. Proteins are represented as objects with binding sites, where proteins with complementary binding sites can interact to form complexes, or their bonds can break resulting in sub-complexes or singleton proteins. Protein domains were retrieved

5.8 Identifying Membrane-Protein Complexes

141

from the SMART database [Letunic et al. 2012], which cover domains for 34% of the protein interactions. To compensate for the lack of sufficient DDI data, fictitious interacting domains were added to proteins: domains were added to a protein pair if the proteins are involved in the same biological function, according to MIPS. This procedure increased the coverage of interactions with DDI data to 84%. Starting with a yeast protein interaction dataset consisting of 1,622 proteins and 9,022 interactions, and a human interaction dataset consisting of 3,006 proteins and 13,992 interactions, Rizzetto et al. [2015] generated the networks containing 1,474 proteins and 7,618 interactions with protein domains in yeast, and 2,342 proteins and 9,395 interactions with protein domains in human. Since the intial (random) positioning of the proteins and their concentration levels affect the simulations, multiple simulation runs were tested. However, most simulation runs produced the same common set of complexes, and the number of predicted complexes unique to each run only varied by 1% between the different runs. On average, the simulations yielded 657 complexes in yeast, of which 409 complexes matched a known complex from MIPS. In human, 1,158 complexes were predicted, of which 268 matched a known complex from CORUM.

5.8

Identifying Membrane-Protein Complexes Native membrane protein complexes are difficult to purify using traditional TAP procedures owing to the hydrophobic nature of membrane proteins (MPs) [Lalonde et al. 2008]. Conventional Y2H system is confined to the nucleus of the cell thereby excluding the study of membrane proteins [Miller et al. 2005, Kittanakom et al. 2009]. Consequently, MPs and their interactions are often under-represented in public interaction datasets [Turner et al. 2010]. This poses a severe challenge for predicting MP complexes. Therefore, new experimental protocols are required that can cover MPs and interactions among MPs. Babu et al. [2012] developed a new TAP extraction and purification procedure that can pull down complexes involving MPs. The study used 2,141 MPs, of which 1,590 were tagged and processed using this TAP procedure. Of these tagged MPs, about 77% (1,228 out of 1,590) were purified as part of some complex. The physical interactions inferred within the purified complexes were scored using Purification Enrichment scoring [Collins et al. 2007]. These interactions were integrated with TAP/MS interaction datasets from Gavin et al. [2006] and Krogan et al. [2006] that mainly cover soluble proteins (the readers are referred to Chapter 2 for details on Purification Enrichment scoring, and the Gavin et al. and Krogan et al. datasets).

142

Chapter 5 Open Challenges in Protein Complex Prediction

This resulted in an integrated PPI network consisting of 13,343 high-confidence interactions among 2,875 proteins, which represented two-thirds of the yeast proteome detectable by mass spectrometry. Babu et al. clustered this network using MCL [Enright et al. 2002] to derive 720 putative protein complexes, of which 501 complexes contained at least one MP. Comparisons between these predicted complexes and the expert-curated CYC2008 catalog [Pu et al. 2009] showed that of the 167 curated complexes containing at least one MP, there were 67 complexes with each containing 90% or more proteins covered by a predicted complex. This suggests that as more interaction data on MPs become available, existing computational methods for protein complex prediction will be able to perform better at predicting MP complexes. Membrane proteins are involved in the transportation of ions, metabolites, and larger molecules such as proteins, RNA, and lipids across membranes. MP complexes involve highly transient interactions required for the “dynamic exchange” of cargoes across membranes. For example, dynamic exchange of proteins between MP complexes has been observed for the translocase of the mitochondrial outer membrane (TOM) and the NADH-ubiquinone oxidoreductase complexes [Rapaport 2005, Lazarou et al. 2007]. Therefore, MP complexes are not stable entities like their soluble counterparts. Traditional AP/MS techniques involve dual-step stringent purification which filters weak interactors of baits. Keilhauer et al. [2015] used a AP/MS protocol for yeast which used a less-stringent single-step purification that preserves the weaker interactions. Although this protocol results in co-purification of a large number of nonspecific interactors, the complexes can still be identified because of their higher enrichment in specific bait pull-downs vs. all other pulldowns. This modified protocol, called the Affinity Enrichment Mass Spectrometry (AE/MS) rather than AP/MS, identified several MP complexes for yeast including the HOPS vacuolar and the SPOTS complexes [Keilhauer et al. 2015]. However, a problem with this protocol is the potential large number of contaminants: A large number of controls may be necessary to comprehensively cover all possible nonspecific interactors. There is an ongoing collaborative effort to establish a “contaminant repository for affinity purification” (the “CRAPome”) containing control pull-downs from different laboratories performed under various experimental conditions [Mellacheruvu et al. 2013]. Readers are referred to the review by Laganowsky et al. [2013] for details on these and other on-going efforts for studying MP complexes. Since MP complexes are not stable entities like their soluble counterparts, their identification requires understanding their dynamic assembly and disassembly— how the individual proteins come together to form complexes, and how these

5.8 Identifying Membrane-Protein Complexes

143

complexes are eventually degraded. Studies reveal that this assembly occurs in an orderly fashion; that is, MP complexes are formed by an ordered assembly of intermediaries. To prevent unwanted intermediaries, this assembly is aided by chaperones [Daley 2008]. For example, in a study [Maddalo et al. 2011] that focused on 30 inner and outer membrane-bound protein complexes from E. coli, a wellcharacterized periplasmic chaperone (PpiD) is involved in the assembly of several MPs. One of the proteins (YfgM) in this chaperone complex had no annotated function, but its association to the complex suggested that YfgM is also part of the PpiD interactome, and is therefore a chaperone involved in the assembly of MPs. Why membrane complexes assemble in an ordered manner is unclear, but studies suggest that this could be a protection mechanism of the cell against harmful intermediary complexes [Herrmann and Funes 2005].

6

Identifying Dynamic Protein Complexes Governing dynamics, gentlemen!

—John Nash, played by Russell Crowe, A Beautiful Mind (2001)

Many, if not all, protein complexes are dynamic entities, which assemble at a specific sub-cellular space and time to perform a particular function and disassemble after that. For example, cyclin-CDK complexes regulate cell-cycle functions at periodic intervals, and are assembled when cyclin-dependent kinases (CDKs) are activated based on the concentration levels of cyclins during different phases of the cell cycle [Nurse 2001, Morgan 1995, Mendenhall and Hodge 1998, Enserink and Kolodner 2010]. To be able to detect dynamic complexes, we need to capture the dynamics of the proteins and their interactions. While the interaction between two dynamic proteins, at the minimum, requires the two proteins to co-localize and be expressed (to certain minimum levels), it also depends upon other spatial, temporal, structural, and/or contextual constraints of the proteins at that time [Hein et al. 2015]. Dynamic interactions occurring in this manner play an important role in driving all cellular systems, and taking this dynamics into consideration is crucial to understanding cellular functioning and organization [Przytycka et al. 2010, Washburn 2016]. While large numbers of genome-scale PPI datasets have been generated in the last several years, in general, these lack the specific spatial, temporal, and contextual information about the interactions. Therefore, it becomes necessary to integrate such information from other sources into the analysis of PPI networks in order to study the dynamics of PPIs and protein complexes.

6.1

Dynamism of Protein Interactions and Protein Complexes Whether a protein interacts, and the choice of its interaction partner, is regulated by different cellular mechanisms [Nooren and Thornton 2003, Przytycka et al. 2010,

146

Chapter 6 Identifying Dynamic Protein Complexes

Hein et al. 2015]. For example, the co-localization of the interactors in time and space, as well as the local concentration of the interactors, are influenced by the expression of the coding genes, degradation rates of the mRNAs, and transport, secretion, and degradation of the protein products. Similarly, the binding affinities of different interactors are regulated through posttranslational modifications on the proteins, or changes to the physiochemical environments within the cell (e.g., by changes to the concentration levels of effector molecules such as ATP that may affect the binding affinity), or be determined by the context (e.g., response to a stimuli or availability of a metabolite). Depending on these factors, interactions are categorized as obligate, where the proteins cannot exist as stable structures on their own and are frequently bound to their partners upon translation and folding; or as non-obligate, where the proteins can exist as stable structures in bound and unbound states. Obligate interactions are generally permanent or constitutive, which once formed can exist for the entire lifetime of the proteins. Non-obligate interactions may be permanent or alternatively transient, wherein a protein interacts with its partners for a brief time period and dissociates after that. It is not just the permanent interactions which result in formation of long-lived multiprotein assemblies, transient interactions are also of crucial functional importance, particularly in cell signaling [Jordon et al. 2000, Pawson and Nash 2000]. In addition to the levels of expression and abundance for proteins and substrates, protein interactions are also mediated by the levels of intrinsic disorder in the participating proteins. Protein disorder refers to the phenomenon of a protein partially or fully not folding into a stable structure. Such proteins or protein regions are called intrinsically disordered proteins (IDPs) or IDP regions [Dunker et al. 2001, 2015] (this nomenclature will be discussed in detail later in the chapter). Nearly half of the eukaryotic proteome is found to contain regions of 40 or more amino acids with disorder [Dunker et al. 2001, 2015, Berlow et al. 2015]. Under different cellular and physiological contexts, disordered proteins and regions can undergo order-todisorder transition, thereby determining the binding partners and binding affinity of protein interactions. A commonly observed mechanism for this form of binding is through a short linear sequence motif (commonly referred to as a SLiM), which is about 10 amino acids in length. A SLiM can induce transient interactions of low affinities with multiple target proteins [Van Roey et al. 2014]. This mechanism is observed in the oncoprotein E7 from the human papilloma virus (HPV) (Figure 6.1): E7 contains an intrinsically disordered N-terminus made up of a short LxCxE motif which interacts with the TAZ2 domain of cAMP response element binding (CREB)binding protein (CBP)/p300, and the pocket domain of the retinoblastoma protein Rb to deregulate the cell cycle of the host [Jansma et al. 2014].

Figure 6.1

[Berman et al. 2000])

Human papilloma virus (HPV) E7 oncoprotein with disordered region containing a LxCxE motif occurring at positions 22–26 in HPV type-16, and positions 26–30 in PHV type-45 (the regions in red for the ’disorder’ track). (Figure generated from Protein Data Bank server

148

Chapter 6 Identifying Dynamic Protein Complexes

As a result of dynamism in protein interactions, protein complexes also display dynamism in their formation, composition, and stability, which impart them with important functional properties. For example, the highly conserved yeast cyclindependent kinase Cdc28p regulates the cell cycle by forming complexes with different cyclins; these complexes in turn phosphorylate different substrates to regulate different cell-cycle phases: Cdc28p forms a complex with cyclin Cln3p to enter the cell cycle, with Cln1,2p to enter and regulate activities in the G1 phase, with Clb5,6p to begin DNA replication in the S phase, and with Clb1,2,3,4p to enter M phase [Morgan 1995, Mendenhall and Hodge 1998, Enserink and Kolodner 2010]. Here, the dynamism of Cdk28p-cyclin complexes is determined by transient interactions mediated by the levels of cyclins during the different cell-cycle phases. The second example is that of RNA polymerase II (Pol II) which undergoes large structural rearrangements of protein subunits during transcription initiation and elongation [Cramer et al. 2001, Hanh 2004]. Pol II is composed of four structural mobile elements—Core, Clamp, Shelf, and Jaw Lobe—that move relative to each other. The Core element forms a cleft through which the DNA enters from one side and the active site is buried at the base. The Shelf and Jaw Lobe elements move relatively little and can rotate parallel to the active site cleft. The Clamp element, connected to the Core through a set of flexible switches, moves with a large ˚ to open and close the cleft. The final example is of swinging motion of up to 30A protein complexes formed by the intrinsically disordered hypoxia inducible factor1α (HIF-1α) depending on the levels of oxygen in the cellular milieu [Berlow et al. 2015]. HIF-1α plays an important role in the transcription of genes that are critical for cell survival under low cellular oxygen concentrations through interactions with the TAZ1 domain of the transcriptional coactivator CBP/p300. The C-terminal disordered region of HIF-1α forms a protein complex with CBP/p300 based on cellular oxygen levels: Under normoxic (normal oxygen) conditions, the C-terminal is hydroxylated by the asparagine hydroxylase FIH to impair binding to the TAZ1 domain of CBP/p300; however, under hypoxic conditions, the C-terminal undergoes a disorder-to-order transition forming helical structures that interact with the TAZ1 domain of CBP/p300, to activate survival-related genes. The importance of dynamism of protein interactions and protein complexes cannot be understated, and is rapidly becoming the subject of extensive computational research. A number of computational methods have focused on the effective integration of temporal, contextual, and structural information with protein interaction networks to understand this dynamism of protein interactions and complexes; some of these methods are covered in this Chapter.

6.2 Identifying Temporal Protein Complexes

6.2

149

Identifying Temporal Protein Complexes The study by Han et al. [2004] was one of the first attempts to study dynamism at an interactome level by integrating expression levels of proteins with PPI networks. Han et al. combined a S. cerevisiae interaction network (“filtered yeast interactome” or FYI) with mRNA expression data to study the dynamics of hub proteins. The FYI dataset consisted of 2,493 high-confidence interactions (among 1,379 proteins) detected in at least two different experiments. Hubs in the FYI network were characterized using expression profiles containing 315 data points across five different experimental conditions. Pearson correlation coefficients (PCCs) were computed between the hubs and their partners. Han et al. found that for hubs with degree k ≥ 5, the average PCCs followed a bimodal distribution, with one peak centered around 0.1 and the other around 0.6. On the other hand, for proteins with k < 5, the average PCCs showed a normal distribution centered around 0.1. Based on this bimodal distribution, Han et al. suggested that hubs (with k ≥ 5) can be split into two distinct subtypes—ones with high average PPCs (“party hubs”), and the ones with low average PCCs (“date hubs”). This bimodal distribution was observed only for specific conditions—e.g., “stress response” and “cell cycle”—but not for others, indicating condition-dependent participation of these hubs in interactions. When removed from the network, the party and date hubs had distinct effects on the overall topology of the network, with the removal of date hubs having a more significant effect on the network connectivity. The removal of date hubs resulted in disconnected subnetworks which were larger in size and number than compared to those resulting from removal of party hubs. Based on this, Han et al. suggested that date hubs are central to the network, and tend to participate in a wide range of integrated connections required for global organization of biological modules in the whole network. On the other hand, party hubs tend to form local modules, and although important for the function of these modules, tend to perform at a lower level in the functional organization. Although initially contested [Batada et al. 2006], the observations by Han et al. have since then been reproduced on larger PPI datasets [Agarwal et al. 2010, Pritykin and Singh 2013]. For example, Pritykin and Singh [2013] further demonstrated that date hubs have higher betweenness centrality, while party hubs have higher clustering and functional similarity, thus supporting the central placement of date and modular placement of party hubs in PPI networks. Moreover, the authors found a higher enrichment for essential genes among party hubs than date hubs, as determined by a hypergeometric test; thus agreeing with the notion that cellular systems try to buffer their date hubs (which are more centrally located) against

150

Chapter 6 Identifying Dynamic Protein Complexes

disruption. Komurov and White [2007] further expanded the concept to include “family hubs,” and showed that family hubs are more constitutively expressed and form static modules. Luscombe et al. [2004] studied the dynamics of transcription factor (TF) regulatory networks using 7,074 regulatory interactions between 142 TFs and 3,420 target genes from S. cerevisiae. The authors integrated gene-expression data from five conditions—cell cycle, sporulation, diauxic shift, DNA damage, and stress response—and identified the subnetworks active under each of these conditions. Luscombe et al. found about half of the target genes were uniquely expressed in only one condition, whereas most TFs (95/142) were expressed across multiple conditions. Moreover, over half of the interactions were replaced by new ones between conditions, and only 66 interactions were retained across four or more conditions; these 66 “hot links” corresponded to housekeeping functions. Endogenous processes (cell cycle and sporulation) are multi-stage and operate with an internal transcriptional programme, whereas exogenous processes (diauxic shift, DNA damage, and stress response) constitute binary events that react to external stimuli with a rapid turnover of expressed genes. In Luscombe et al.’s analysis, the topology of the subnetworks changed considerably between these endogenous and exogenous processes. In particular, the average number of TFs regulating a target gene was smaller for exogenous processes, indicating that the TFs were regulating in simpler combinations. On the other hand, the average number of target genes regulated by a TF was larger for exogenous processes, indicating a wider regulation in response to external stimuli. Further, feedforward loops (FFLs) were present higher in the exogenous subnetworks. A FFL is a three-gene pattern composed of two input TFs, one of which regulates the other, and both jointly regulating a target gene. FFLs respond only to persistent input signals or stimuli. Therefore, FFLs suit exogenous conditions better, as cells cannot initiate a new stage until the previous one has stabilized. Within the cell-cycle phases, Luscombe et al. found that most TFs that are active in the cell cycle operate only in a particular phase, and a small minority of TFs are ubiquitously active throughout the cell cycle. A third of these ubiquitous TFs are static hubs and perform housekeeping functions. De Lichtenberg et al. [2005] studied the dynamics of protein complex formation using the example of yeast cell cycle. The authors constructed a PPI network from 300 yeast cell-cycle proteins of which 184 proteins were periodically expressed during the cell cycle. By mapping curated complexes from MIPS [Mewes et al. 2006] onto this cell-cycle network, De Lichtenberg et al. identified 29 periodic modules consisting of a combination of periodic and static proteins. For example, the minichromosome maintenance / origin recognition (MCM/ORC) complex

6.2 Identifying Temporal Protein Complexes

151

(the pre-replication complex) with 22 proteins consists of 12 static proteins and 10 periodic proteins, of which six are expressed during the M/G1 phase, two during the early G1, and one during late G1. The presence of static proteins within these modules, and the close timepoints at which most of the periodic proteins within a module are expressed, indicated a pattern for the assembly of modules. In particular, de Lichtenberg et al. suggested that yeast (and in general, all eukaryotic) complexes are assembled just-in-time (as against synthesized just-in-time like in prokaryotes) where the transcription of only a few components of the complexes is tighly regulated and held off until the final complex assembly. This prevents accidental switching on of complexes, but at the same time is more efficient than synthesis of entire complexes like in prokaryotes. The analysis by De Lichtenberg et al. also revealed several binary dynamic complexes (e.g., Cdc28p formed binary complexes with different cyclin substrates during the different phases of the yeast cell cycle). Srihari and Leong [2012c] studied the dynamics of protein complex assembly during the yeast cell cycle using complexes derived from the Consolidated network [Collins et al. 2007]. Using the Cyclebase dataset (http://www.cyclebase.org/) [Gauthier et al. 2008] of timepoints (cell-cycle phases) at which yeast cell-cycle proteins show their peak expression, the authors derived the most likely cell-cycle phases at which cell-cycle complexes are assembled. Specifically, cell-cycle phases (G1 , S , G2 , M) were assigned to each yeast protein by averaging its expression across multiple datasets and computing the phase(s) at which the protein showed peak expression. Of the 6,114 yeast proteins available in the dataset, 5,514 were labeled static, and the remaining 600 as dynamic. Of these dynamic proteins, 576 had a unique peak phase, whereas the phases for the remaining 24 were difficult to determine. By mapping these labels onto the Consolidated network (1,622 proteins and 9,704 interactions), the authors found that static-static interactions dominated the network (94.69%), whereas static-dynamic and dynamic-dynamic interactions formed relatively smaller fractions (4.6% and 0.716%, respectively). This agreed with the notion that, in order for the cellular system to be stable, most of the interactions had to be static with only a small fraction of the interactions capable of changing dynamically. Protein complexes were derived using a core-attachment model (MCL-CAw [Srihari et al. 2010]), and by mapping the cell-cycle proteins onto these complexes, 57 cell-cycle complexes containing at least one dynamic protein each were identified. The authors found that the attachment proteins within these complexes were significantly more enriched for dynamic proteins compared to the cores. Moreover, when these complexes were mapped back onto the network, the static cores were shared between multiple complexes and these were involved in

152

Chapter 6 Identifying Dynamic Protein Complexes

Yal040c Ygr108w

M

Ybr160w

Ygr108w

Ypl256c

Ypr119w Ygr109c

Yal040c

Mapping of cell-cycle phases

Ypl256c

G2 Ybr160w

Ypr119w

G1 Ygr109c

Ydl155w

Ydl155w Ypr120c Ylr210w

Ymr199c

Ypr120c

S Ylr210w

Ymr199c

G1/S Figure 6.2

Decomposing clusters by mapping cell-cycle phases [Srihari and Leong 2012c]. The decomposition of a cluster containing the kinase Cdc28 (Ybr160w) and its cyclinsubstrates identifies protein complexes active at different cell-cycle phases.

multiphase interactions, i.e., dynamic proteins peaking at different (usually adjacent) cell-cycle phases interacted with these static core proteins. These observations hinted at a biological design principle of “temporal reusability” of the static cores: By maintaining the core proteins throughout all phases, these proteins are reused, whereas the attachment proteins are transcribed when required to assemble the complexes, thus agreeing with the just-in-time assembly proposed by De Lichtenberg et al. [2005]. Srihari and Leong then decomposed large clusters using the cell-cycle phases of constituent proteins, and found that these clusters contained multiple distinct complexes. For example, when the cell-cycle phases were mapped onto a cluster containing the kinase Cdc28 (Ybr160w) and its cyclin-substrates, the authors identified distinct dynamic complexes, as shown in Figure 6.2, that were active at distinct phases. Here, Cdc28 can be considered the static core, while the cyclins as its dynamically regulated attachments. Tang et al. [2011], Li et al. [2012], and Ou-Yang et al. [2014] proposed methods to detect temporal protein complexes by constructing dynamic PPI (DPPI) networks using a static PPI network and time-course gene expression data. All the three methods are very similar, and we describe here only the method by Ou-Yang et al., which is the most recent among the three. Given a generic (static) PPI network, Ou-Yang et al. constructed a DPPI network for each of the T timepoints by determining the peak timepoint(s) of expression for each protein. Specifically, each DDPI network is considered as a combination of a static and a dynamic subnetwork. The static subnetwork is estimated based on the PCCs between the proteins: two interacting

6.2 Identifying Temporal Protein Complexes

153

proteins u and v are said to have a static interaction if the PCC between their expression patterns across the T timepoints is greater than a threshold δ (set to 0.6). The dynamic subnetwork at each timepoint is estimated based on the active proteins at that timepoint. A protein u is considered active at a timepoint if its expression is greater than the active threshold AT (u) given by AT (u) = μ(u) + 3σ (u)(1 − F (u)),

(6.1)

where μ(u) is the mean and σ (u) is the standard deviation of u’s expression across all timepoints, and F (u) = 1/(1 + σ 2(u)) is a weight function that corrects for the fluctuation in u’s expression [Wang et al. 2013]. Equation 6.1 is based on the “threesigma principle” to differentiate active and inactive timepoints of a protein during the cellular cycle [Wang et al. 2013]. For a normal distribution, three standard deviations (“three sigmas”) on either side of the mean encompasses 99% of the data represented in the distribution. To allow for fluctuations in the expression of the protein, it is considered to be active if its expression varies beyond this three-sigma range. However, these fluctuations can happen even due to noise in the expression data, and therefore a correction factor F (u) which is inversely proportional to the fluctuation (standard deviation) is additionally used in the Equation to determine the active threshold. An interacting pair (u, v) from the input PPI network is added to the dynamic subnetwork at time t only if both u and v are active, that is, their expression levels are above their respective active thresholds, AT (u) and AT (v). Finally, the DPPI network at timepoint t is the union of the static and dynamic subnetworks at t. Complexes are then identified by clustering each of these DPPI networks individually, and the final set of predicted temporal complexes is the union of all sets of complexes. Using PPI networks from BioGrid (59,748 interactions among 5,640 proteins) and DIP (21,592 interactions among 4,850 proteins), Ou-Yang et al. predicted 606 complexes with a precision of 0.363 and recall of 0.741, which were higher than clustering without using temporal segregation of the network (precision 0.332 and recall 0.647). In particular, OuYang et al. showed that the three RNA polymerase (Pol) complexes, namely, Pol I, II, and III, could be easily identified using temporal network segregation, which are otherwise lumped into one large cluster when detected using only the generic PPI network. Similar to what Han et al. [2004] report on the bimodal distributions of PCC values with date hubs centered around 0.1 and party hubs centered around 0.6, the motivation to divide the DPPI network into static and dynamic portions in Ou-Yang et al.’s work also comes from the observed differences in PCC distributions for the

154

Chapter 6 Identifying Dynamic Protein Complexes

two portions. Essentially, Ou-Yang et al. observed a bimodal distribution for PCC values, one peak centered around 0.6, corresponding to static interactions, and the other around 0.1, corresponding to transient interactions (hence, δ was set to 0.6). Hanna et al. [2015] proposed biclustering of gene-expression data and mapping the biclusters onto PPI networks as a means to generate dynamic PPI networks, and thereby identified temporal protein complexes. Essentially, Hanna et al. argue that if an interaction in the PPI network occurs only during specific conditions, then the correponding gene pair will be correlated only under those conditions in the expression dataset. Therefore, dynamic PPI networks for specific conditions can be generated by biclustering the expression data and mapping the correlation between gene pairs falling within the same cluster, onto the PPI network. Biclustering of expression data is defined as follows [Cheng and Church 2000]. Let X be the set of genes and Y the set of conditions. Let aij be the element of the expression matrix A representing the expression value of gene i ∈ X under condition j ∈ Y . Let BC(I , J ) specify a submatrix AI J , I ⊂ X and J ⊂ Y with the following mean squared residue (MSR) score MSR(BC(I , J )): 1 MSR(BC(I , J )) = (aij − aiJ − aIj + aI J )2 , (6.2) |I | . |J | i∈I , j ∈J where aiJ =

1 aij , |J | j ∈J

(6.3)

1 aij , aIj = |I | i∈I and aI J =

1 |I | . |J |

aij

(6.4)

i∈I , j ∈J

are the row and column means, and the mean of the submatrix (I , J ). The lower the MSR score the higher is the bicluster coherence; that is, the genes within the bicluster show higher correlation in their expression patterns. After generating the set of biclusters, Hanna et al. generated subnetworks by extracting interactions from the PPI network that corresponded to these biclusters. Protein complexes were then detected from each of these PPI networks individually via clustering, and the resultant complexes were combined into a final set of predicted complexes. Using a PPI network containing 5,192 proteins and 24,698 interactions from DIP [Xenarios et al. 2002], and expression dataset GSE3431 from GEO [Barrett et al. 2013], Hanna et al. identified 89 protein complexes with a precision of 0.646 and

6.2 Identifying Temporal Protein Complexes

155

recall of 0.717, compared to 76 complexes with a precision of 0.6 and recall of 0.7 without using gene-expression biclusters. Park and Bader [2012] studied how PPI networks change across timepoints and how these affect changes to complexes, by simulating time-evolving PPI networks using stochastic models. Let {G(t) : t = 1, . . . , T } be a series of T time-ordered PPI networks. For an arbitrary pair, t = t , G(t) and G(t ) can have different proteins and interactions. The authors infer a sequence of time-evolving stochastic block models, {M (t) : t = 1, 2, . . . , T }, where M (t) is a model generating the network G(t). Such a network-generative model M is built as follows. The total n vertices in M are grouped into K clusters. The probability for a vertex v to belong to a cluster k is πk , which is the parameter for a multinomial distribution, with k πk = 1. The parameter θi , j ∈ [0, 1] gives the probability of adding an unweighted undirected interaction between vertices u ∈ i and v ∈ j belonging to clusters i and j , and is modeled as independent Bernoulli trials. Essentially, each vertex u is sampled with probability πk for cluster k, then each interaction is sampled as euv = 0 or 1 as a Bernoulli trial with success probability θij for u ∈ i and v ∈ j . Therefore, the number of interactions at the cluster-level will be nij = u∈i , v∈i euv for i = j , or . u≤v∈i euv for i = j . The total possible number of interactions is mij = ni nj for . i = j , and mii = ni (ni − 1)/2 for i = j . The resulting clusters are then merged into larger clusters such that the probability for a vertex to belong to a large cluster k is πˆ k = nk /n and the probability for two such vertices to be connected is θîj = eij /mij , where eij is the number of interacting pairs where one protein is in cluster i and the other is in cluster j . This procedure is repeated T times to generate T PPI networks, which can be ordered and compared as if these were a single network that is slowly changing over the T timepoints. The number of vertices and edges to be generated in the model followed a yeast PPI dataset containing 13,401 interactions and 3,248 proteins. The vertices to be present in the different timepoints is based on a timecourse gene-expression dataset covering 3,510 significantly periodic genes from 36 timepoints. This resulted in 36 network snapshots which on average contained 1,380 proteins with degree 1.8. The authors recovered 31 dynamic protein complexes containing at least three proteins each from these networks. Most of the complexes matched across the entire time-course, but some complexes disappeared and reappeared. For example, a complex involved in DNA repair was most active at the end of each 12-point cycle. For other complexes such as the mitochondrial ribosome complex, and the nuclear pore complex, some of the proteins were transcribed and then degraded at specific timepoints—e.g., the ribosomal small subunit of mitochondria (RSM) subunit 22 was active only at t = 9, 20, and 32, suggesting a periodic requirement of RSM22 at specific timepoints for the functioning of the mitochondria.

156

Chapter 6 Identifying Dynamic Protein Complexes

6.3

Intrinsic Disorder in Proteins Intrinsically disordered proteins (IDPs) and IDP regions do not form stable tertiary structures, yet they exhibit biological activities [Dunker et al. 2001, 2015, Dyson and Wright 2005, Oldfield and Dunker 2014]. The word “disordered” has been adopted because of Jirgensons’ use of it for protein classification [Jirgensons 1966]. The word “intrinsically” emphasizes a sequence-dependent characteristic: The structural instability of IDPs is likely encoded by/in their amino acid sequences [Wright and Dyson 1999]. Other terms have also been used in the literature to explain this phenomena namely, naturally disordered, intrinsically unstructured, partially folded, flexible, rheomorphic, natively denatured, and natively unfolded. This simple definition of disorder, however, covers distinct structural phenomena, the two most apparent being static and dynamic disorder [Tompa and Fuxreiter 2008]. In static disorder, an IDP region might adopt (one of the) multiple stable conformations (but is missing from electron density maps and qualifies for disorder). Such regions are sometimes called “wobbly” domains [Uversky et al. 2005]. By contrast, an IDP or IDP region might also constantly fluctuate between a large number of conformations and can be best described as a conformational ensemble; this kind of disorder is dynamic. Computational studies of IDPs began with the investigations into why they do not fold into stable tertiary structures. IDPs and IDP regions do not fold primarily because they are rich in polar residues and proline and depleted in hydrophobic residues [Xie et al. 1998, Dunker et al. 2002]. More than 50% of eukaryotic proteins have at least one long (>30 amino acid sequence) IDP region. Partial disorder has been observed in almost every protein, but it seems to be more abundant in proteins belonging to certain biological processes such as transcription, cellcycle regulation and signal transduction [Tompa and Fuxreiter 2008]. The lack of structural constraints offers IDPs and IDP regions mobile flexibility to facilitate a diverse range of biological processes. In particular, IDP regions facilitate proteins to perform tedious functions that require more flexibility and cannot be performed by rigid-structured proteins (e.g., movement through narrow pores or channels, and creation of chimeric or fused proteins [Dunker et al. 2001, 2015]). Some examples of IDPs include casein, phosvitin, fibrinogen, trypsinogen, and calcineurin [Oldfield and Dunker 2014]. There are five main databases for IDPs namely, DisProt (http://www.disprot .org/) [Hobohm and Sander 1994], Intrinsically Disordered Proteins with Extensive Annotations and Literature (IDEAL) (http://idp1.force.cs.is.nagoya-u.ac.jp/ IDEAL/) [Fukuchi et al. 2012], the Database of Protein Disorder and Mobility Annotations (MoBiDB) (http://mobidb.bio.unipd.it/) [De Domenico et al. 2012, Walsh et al. 2015], the Database of Disordered Protein Prediction (D2P2) (http://d2p2.pro/)

6.4 Intrinsic Disorder in Protein Interactions and Protein Complexes

157

[Oates et al. 2013], and the Protein Ensemble Database (pE-DB) (http://pedb.vib .be/) [Varadi et al. 2014]. There are several software and Web servers available in the literature for IDP and IDP region prediction; some of the popular ones include: Predictor Of Naturally Disordered Regions (PONDR®) (http://www.pondr .com/) [Romero et al. 1997], Intrinsically Unstructured Proteins Prediction (IUPred) (http://iupred.enzim.hu/) [Doszt´ anyi et al. 2005], DisEMBL (http://dis.embl.de/) [Linding et al. 2003a], GlobProt (http://globplot.embl.de) [Linding et al. 2003b], and DISOPRED (http://bioinf.cs.ucl.ac.uk/psipred/?disopred=1) [Jones and Ward 2003]. Some of these databases and predictors (e.g., DisProt) are based on the assumption that disorder is encoded in specific features of the amino acid sequence, whereas others (e.g., PONDR®) take into account the flexibility, hydropathy, charge, and coordination number apart from the amino acid composition, to predict IDP regions. IDPs interact with other (IDP) proteins, nucleic acids, and other types of molecules, just like their structured counterparts. These interactions are often facilitated by small-molecule ligands, macromolecular binding partners, or posttranslational modifications that induce IDPs or IDP regions to become structured for the duration of those interactions. This kind of disorder-to-order transition in the partner-bound state is termed fuzziness [Tompa and Fuxreiter 2008]. The term emphasizes the structural ambiguity of protein interactions involving IDPs and IDP domains. Using a simple extension from disordered proteins, fuzziness of PPIs can be divided along the same lines into static and dynamic fuzziness. If a PPI involves a static IDP or IDP region, then the fuzziness of the interaction is considered static fuzziness. In contrast, if a PPI involves a dynamic IDP or IDP region, then the fuzziness of the interaction is considered dynamic fuzziness (for further discussion on these definitions, see Section 6.5).

6.4

Intrinsic Disorder in Protein Interactions and Protein Complexes The representation of proteins and their interactions in PPI networks—as dots and lines connecting these dots—neglects their biophysical properties. This simplified representation, although makes for a powerful tool to explore topological properties, is insufficient to answer important questions such as: Which domains mediate the interactions? Which interactions are mediated by structured regions and which are mediated by disordered regions? Which interactions are transient and which are permanent? Which interactions exclude one another and which can occur simultaneously? Although there have been some recent efforts (e.g., Hein et al. [2015]) to quantify the strength and duration of interactions, these questions still largely remain unanswered using traditional proteomics techniques.

158

Chapter 6 Identifying Dynamic Protein Complexes

A complementary computational approach is to use protein docking or molecular dynamics simulations to predict the ensemble of interactions and macromolecular assemblies possible between complementary binding surfaces of proteins [Meng et al. 2011, Russel et al. 2012, De Ruyck et al. 2016]. Cross-docking involves identifying interacting partners by docking each protein in the set with every other protein. This, to some extent, helps us understand the structural dynamics of protein interactions. The Critical Assessment of PRedicted Interactions (CAPRI) challenge was started to assess the performance of docking algorithms and scoring approaches [Janin 2010]. The CAPRI challenge has led to significant improvements in available tools for docking [Pierce et al. 2014, De Vries and Zacharias 2013, Torchala et al. 2013, Kozakov et al. 2013]. For example, ZDOCK (http://zdock.umassmed .edu/) [Pierce et al. 2014] is a popular docking algorithm which uses Fast Fourier Transform (FFT)-based global search to identify docking sites based on statistical potential, shape complementarity, and electrostatics. However, fuzzy or flexible (IDP-based) docking is much more difficult than rigid docking, and predicting docking pairs involving proteins undergoing large conformational changes is still a challenge. Certain proteins have a more central position in PPI networks. These proteins are typically hubs and are likely to be essential for survival [Jeong et al. 2001]. Some hub proteins have multiple simultaneous interactions (party hubs), while others have multiple sequential interactions separated in time or space (date hubs) [Han et al. 2004]. While proteomics techniques do not provide answers on the dynamics of protein interactions, and docking-based analysis for all proteins in the PPI network is challenging and limited by the availability of 3D structures, targeted analysis of hubs is more feasible. For example, in a pioneering work, Aloy and Russell [2002] predicted interactions for cyclin-dependent kinases (Cdks) using 3D structures of interaction surfaces. By mapping these structures onto a yeast PPI network consisting of 2,590 interactions, the authors identified structural interaction patterns for 59 of these interactions, which covered many of the Cdks. In another study (also highlighted earlier in Chapter 5), Kim et al. [2006] generated a structural interaction network (SIN) by integrating 3D structures of proteins with a yeast PPI network (available through the Structural Analysis of Protein Interaction Networks (SAPIN) server: http://sapin.crg.es/). The authors found that date hubs were often involved in mutually exclusive interactions in the SIN, which appeared to be more transient than simultaneously possible interactions. Tsuji et al. [2015] combined structural information from the Protein Data Bank (PDB) [Berman et al. 2000] with protein interactions from IntAct [Hermjakob et al. 2004] to investigate individual protein interactions involving hubs and protein complexes. The authors mapped human

6.4 Intrinsic Disorder in Protein Interactions and Protein Complexes

159

PPIs from IntAct onto protein complex models from PDB to extract 3,959 unique experimentally determined interfaces for 7,241 interactions. Of these, about 80% of the proteins had a single interface. Proteins within protein complexes possessed more interfaces, with the largest number of interfaces reaching up to seven in some proteins (e.g., 26S proteasome subunit 7, MAP kinase kinase kinase (MAP3K) 3, and growth receptor bound (GRB) protein 2). When the amino-acid sequences of these interface regions were mapped on the PDB protein complex models, it was found that approximately half (43%) of the sequence regions mapped to IDP regions, and between 16–57% mapped to interface, surface, and interior (burried residues) of the protein complex models. In another study, Kar et al. [2009] integrated structural interfaces for hubs in a human cancer PPI network to build a cancer structural protein interface network (ciSPIN). The ciSPIN was used to study cancer-related properties of hub proteins using their interface dynamics. The Interactome3D database (http://interactome3d.irbbarcelona.org/) [Mosca et al. 2013] provides structural information for over 12,000 protein interactions covering eight model organisms. The above-mentioned studies on the existence of hubs capable of participating in multiple interactions suggests a distinct mechanism for molecular recognition: Structural features that endow these hubs the flexibility and mobility to accommodate a diverse range of interactions with other proteins [Hasty and Collins 2001]. Since IDPs are flexible, IDP-based interactions have been proposed as a mechanism to explain multiple binding by hubs in PPI networks [Dunker et al. 2005]. Several studies have supported this idea, with disorder being associated with the interaction patterns observed in both date [Ekman et al. 2006, Patil and Nakamura 2006, Patil et al. 2010] as well as party hubs [Jaffee et al. 2004, Marinissen and Gutkind 2005]. Some examples of completely (100% of the protein sequence/structure is disordered) or almost completely disordered hubs include α-synuclein, caldesmon, high mobility group protein A (HMGA), and synaptobrevin [Dunker et al. 2005]. There exist hubs which are themselves ordered but interact with intrinsically disordered binding partners; an example is that of cyclin-dependent kinases (Cdks) binding to their inhibitors, cyclin-kinase inhibitors (CKIs). Cdks have been introduced earlier in this Chapter. Cdks are considered as the master timekeepers of cell division [Nurse 2001, Morgan 1995]. Cdks are regulated by binding to their cyclin partners to form heterodimeric complexes. For example, cyclins A and E that interact with Cdk2 are required for the G1/S transition and progression through the S phase; exit from G1 is primarily under the control of cyclin D/Cdk4/6; and the interaction of Cdk1 with cyclin B1 directs the G2/M transition [Morgan 1995, Mendenhall and Hodge 1998, Enserink and Kolodner 2010, Dunker et al. 2005].

160

Chapter 6 Identifying Dynamic Protein Complexes

Therefore, not surprisingly, the activity of Cdks throughout the cell cycle is precisely directed by a combination of mechanisms including the levels of cyclins and the levels of their inhibitors (CKI proteins), which are responsible for deactivating Cdk-cyclin complexes. The crystal structures of several Cdks and their complexes with cyclins and CKIs have been resolved [Pavletich 1999]. These studies suggest that CKIs are disordered, but upon binding to specific cyclin-Cdk complexes, the CKIs attain an order. The “molecular staple” analogy is often used to describe this phenomenon, wherein the “prongs” of the “CKI staples” are unstructured and flexible, but the binding of CKIs to cyclin-Cdk complexes connects the prongs, thus rendering the CKIs a transient structure, to facilitate deactivation of the cyclin-Cdk complexes. In addition to the two extremes of (dis)orderness observed in hubs, there also exists an intermediate category of partially disordered hubs. This group includes hub proteins that fold into overall stable 3D structures but contain disordered domains which play important roles in binding to the partners of these hubs. The percentage of the protein sequence/structure covered by these disordered regions ranges approximately from 12% (e.g., 14-3-3 proteins) to 79% (e.g., breast cancer susceptibility protein 1 BRCA1) [Dunker et al. 2005, Dunker et al. 2015]. A prototypical example here is that of the tumor suppressor protein TP53, which has about 29% disordered regions, and is involved in crucial biological processes including response to cellular stress and DNA damage, arresting of cell cycle progression, and inducing of apoptosis [Anderson and Appella 2004]. TP53 has four (not necessarily structured) domains namely, the N-terminal transcription activation domain, the central DNA binding domain, the C-terminal tetramerization domain, and the C-terminal regulatory domain [Lee et al. 2000, Wells et al. 2008] (Figure 6.3). The N-terminal domain interacts with transcription factors II D (TFIID) and H (TFIIH), the mouse double minute 2 homolog / E3 ubiquitin-protein ligase (Mdm2), replication protein A (RPA), cAMP response element binding (CREB)-binding protein (CBP)/p300, and COP9 signalosome subunit 5/ JUN activation domain-binding protein 1 (CSN5/Jab1) among many other proteins. The C-terminal regulatory domain of TP53 interacts with glycogen synthase kinase 3 β (GSK3β), Poly(ADP-Ribose) Polymerase (PARP) 1, TATA-Box binding protein associated factor 1 (TAF1), transformation/transcription domain associated protein (TRRAP), histone acetyltransferase 5 (hGcn)5, 14-3-3, S100 calcium binding protein B (S100B), and several other proteins. While the DNA-binding domain is structured, the two terminal domains are instrinsically disordered, and aquire structures only upon formation of interactions or complexes [Lee et al. 2000, Wells et al. 2008]. Multiple different posttranslational modifications have also been reported in TP53, which are believed to alter the struc-

Figure 6.3

(b)

Predicted disordered tail

(c)

KAT2B

EP300

TP53

ATM

CDKN2A

CDKN1A

CREBBP

MDM2

Disordered regions in the human tumor suppressor protein TP53. (a) TP53 sequence with predicted disordered regions (potentially disordered shown in red, and probably disordered shown in blue in the grey track), generated from Protein Data Bank server [Berman et al. 2000]; (b) two views of TP53 secondary structure with predicted disordered tail [Walsh et al. 2015], generated from SWISS-MODEL [Kiefer et al. 2009]; and (c) high-confidence interactors of TP53, identified from STRING [Von Mering et al. 2003, Szklarczyk et al. 2011].

(a)

SIRT1

MAPK8

BRCA1

162

Chapter 6 Identifying Dynamic Protein Complexes

tures of these domains to enable protein interactions. Other examples of partially disordered hubs include BRCA1, Mdm2, xeroderma pigmentosum complementation group A protein (XPA), and estrogen receptor (ER) α [Dunker et al. 2015].

6.5

Identifying Fuzzy Protein Complexes The concept of disorder in proteins can be extended to protein interactions and protein complexes involving IDP proteins or domains, to give fuzzy interactions and fuzzy complexes. As mentioned earlier, this can be a simple extension of the definition: If an IDP or its domain is capable of adopting (one of the multiple) stable conformations in its unbound state, but gets completely structured in its bound state, the fuzziness of the resulting interaction or complex is considered static. Likewise, if an IDP or its domain remains in an ensemble of conformations in its unbound state, but gets completely structured following binding, the fuzziness is considered dynamic. However, if the conformational freedom provides IDPs and IDP regions functional diversity, why would nature limit this conformational freedom of IDPs after they interact and attain a bound state? It would be a twisted logic not to exploit these beneficial features after interacting with a partner. It makes sense for IDPs and IDP regions to preserve their conformational freedom under all circumstances [Fuxreiter and Tompa 2012, Patel et al. 2007]. Fuxreiter and Tompa [Tompa and Fuxreiter 2008, Fuxreiter and Tompa 2012] identified 26 cases of protein complexes where structural disorder was present even in the bound state with significant contribution to function. Based on these cases the authors proposed four major categories of fuzziness of complexes: (i) polymorphic complexes, (ii) clamp complexes, (iii) flanking complexes, and (iv) random complexes. These cases are not completely distinct, and can overlap with one another. In a polymorphic complex, at least one of the proteins involved in the complex attains a few or multiple conformations. These alternative conformations underlie different biological functions. For example, consider the protein complex formed by the binding of α-importins to proteins “tagged” with nuclear localization signals (NLS). Importins are a type of karyopherins that transport protein molecules into the cellular nucleus. NLS is an amino acid sequence that ‘tags’ to a cargo protein for import of the protein into the nucleus (e.g., PKKKRKV found in the SV40 Large T-antigen, and KR[PAATKKAGQA]KKKK found in the nucleoplasmin). Here, the same NLS peptide exhibits different side-chain conformations to enable its tagging with different cargo proteins, which then form the complex with importins, to be transported into the nucleus [Terry et al. 2007]. Other examples of polymorphic complexes include the T-cell factor 4 (Tcf4) (as the IDP) with β-catenin (partner),

6.5 Identifying Fuzzy Protein Complexes

163

Inhibitor 2 with protein phosphatase I, Myelin with Actin, and RNase I with RNase inhibitor. In clamp complexes, the disordered segment connects two or more ordered binding regions. For example, the MAPK scaffold protein Ste5 interacts with the MAPK fusion protein Fus3 via an eight-residue long linker, which dynamically interchanges among several conformations and provides flexibility to the interaction. The interaction of CKIs with cyclin-Cdk heterodimers, discussed above, is another example of a clamp complex. In flank complexes, the IDP region acts as recognition sequence (a linear motif) and is used to establish a specific contact with a partner. The IDP region imparts the plasticity to localize to the target site. For example, the kinase inducible domain (KID) of CREB interacts with the kinase-inducible domain interacting (KIX) domain of CBP via a 29-residue IDP region. Random complexes are extreme cases of fuzziness, where binding does not induce any structural order for the interacting regions. For example, T-cell receptor ζ chains lack a stable structure and remain in a fast dynamic equilibrium between conformations [Sigalov et al. 2004]. This random behavior is not limited only to the binding between proteins; for example, the binding of single-stranded DNA (ssDNA)-binding (SSB) protein to DNA does not induce a strict disorder-to-order transition of the protein, which keeps fluctuating among several alternative conformations [Savvides et al. 2004]. Hegyi et al. [2007] studied disorder within protein complexes derived from E. coli and S. cerevisiae, and correlated it to the size of protein complexes. The protein complexes were derived by clustering interaction data from the IntAct [Hermjakob et al. 2004, Kerrien et al. 2012] and Gavin et al. [2006] datasets. Structural disorder for proteins or protein regions in these complexes was predicted using IUPred (http://iupred.enzim.hu/) [Doszt´ anyi et al. 2005]. For each protein, the fraction of amino-acid sequence accounting for the disorder was predicted, and this was averaged for the entire set of proteins within each complex. The protein complexes were then grouped according to size: singular, 2–4, 5–10, 11–20, 21–30, and 31– 100 protein-containing complexes. The authors found that there were significant (p1 billion years old) and 490/980 complexes that contained predominantly ancient components were ubiquitously conserved among eukaryotes. The high degree of conservation

166

Chapter 7 Identifying Evolutionarily Conserved Protein Complexes

of complexes observed in these studies indicates that core cellular processes are (preferentially) conserved throughout the evolution. Detecting conserved complexes using PPI networks is typically based on identifying conserved dense (modular) subnetworks between the PPI networks from two or more species. The underlying assumption here is that if a dense subnetwork is conserved between species, then the proteins within the subnetwork are under evolutionary pressure to maintain their interactions [Shoemaker and Panchenko 2007, Juan et al. 2008, Yamada and Bork 2009], and therefore the subnetwork corresponds to an evolutionarily conserved function, most often executed by protein complexes or functional modules. Most methods go about identifying conserved subnetworks by first building interolog or orthology networks. An interolog is a conserved interaction between a pair of proteins in an organism which have interacting homologs in another organism [Walhout et al. 2000, Matthews et al. 2001]. The homologs are typically identified based on protein sequence similarity. One of the organisms is considered the source (S), whereas the other the target (T ), and an interaction between a pair of proteins {uS , vS } from S is said to be conserved as {uT , vT } in T if the pair has a joint (geometric mean) sequence similarity ≥80% or a joint BLAST E-value