*742*
*18*
*16MB*

*English*
*Pages 461*
*Year 2019*

- Author / Uploaded
- Volkhard Helms

- Categories
- Biology

**Commentary**- Let’s download books from mobilism.org and upload them to libgen!

*Table of contents : Cover......Page 1Title Page......Page 5Copyright......Page 6Contents......Page 7Preface of the First Edition......Page 17Preface of the Second Edition......Page 191.1 Some Basics About Networks......Page 211.1.2 Small‐World Phenomenon......Page 221.1.3 Scale‐Free Networks......Page 231.2 Biological Background......Page 241.2.2 Cellular Components......Page 251.2.3 Spatial Organization of Eukaryotic Cells into Compartments......Page 271.3.1 Biochemical Pathways......Page 281.3.3 Signal Transduction......Page 311.4.1 Ontologies......Page 321.4.4 Reactome......Page 331.4.6 DAVID......Page 341.4.8 Systems Biology Markup Language......Page 351.7 Problems......Page 37Bibliography......Page 38Chapter 2 Structures of Protein Complexes and Subcellular Structures......Page 412.1 Examples of Protein Complexes......Page 422.1.1 Principles of Protein–Protein Interactions......Page 442.1.2 Categories of Protein Complexes......Page 472.2.1 Complexome of Saccharomyces cerevisiae......Page 482.2.2 Bacterial Protein Complexomes......Page 502.3 Experimental Determination of Three‐Dimensional Structures of Protein Complexes......Page 512.3.1 X‐ray Crystallography......Page 522.3.4 Cryo‐EM......Page 542.3.6 Fluorescence Resonance Energy Transfer......Page 552.3.7 Mass Spectroscopy......Page 562.4.1 Correlation‐Based Density Fitting......Page 582.5.1 Fourier Series......Page 602.5.4 Convolution Theorem......Page 612.5.5 Fast Fourier Transformation......Page 622.6 Advanced Density Fitting......Page 642.6.1 Laplacian Filter......Page 652.7 FFT Protein–Protein Docking......Page 662.8 Protein–Protein Docking Using Geometric Hashing......Page 682.9.1 CombDock......Page 692.9.3 3D‐MOSAIC......Page 722.10 Electron Tomography......Page 732.10.2 Protein Complexes in Mycoplasma pneumoniae......Page 752.11 Summary......Page 762.12.1 Mapping of Crystal Structures into EM Maps......Page 77Bibliography......Page 803.1 Modeling by Homology......Page 833.2.1 Size and Shape......Page 863.2.2 Composition of Binding Interfaces......Page 883.2.3 Hot Spots......Page 893.2.4 Physicochemical Properties of Protein Interfaces......Page 913.2.5 Predicting Binding Affinities of Protein–Protein Complexes......Page 923.2.6 Forces Important for Biomolecular Association......Page 933.3.1 Pairing Propensities......Page 953.3.2 Statistical Potentials for Amino Acid Pairs......Page 983.3.3 Conservation at Protein Interfaces......Page 993.3.4 Correlated Mutations at Protein Interfaces......Page 103Bibliography......Page 1064.1 Primer on Mathematical Graphs......Page 1094.2 A Few Words About Algorithms and Computer Programs......Page 1104.2.1 Implementation of Algorithms......Page 1114.2.2 Classes of Algorithms......Page 1124.3 Data Structures for Graphs......Page 1134.4 Dijkstra's Algorithm......Page 1154.4.1 Description of the Algorithm......Page 1164.4.2 Pseudocode......Page 1204.5 Minimum Spanning Tree......Page 1214.6 Graph Drawing......Page 1224.7 Summary......Page 1244.8 Problems......Page 1254.8.1 Force Directed Layout of Graphs......Page 127Bibliography......Page 1305.1 Experimental High‐Throughput Methods for Detecting Protein–Protein Interactions......Page 1315.1.2 Two‐Dimensional Gel Electrophoresis......Page 1325.1.3 Affinity Chromatography......Page 1335.1.4 Yeast Two‐hybrid Screening......Page 1345.1.5 Synthetic Lethality......Page 1355.1.8 Overlap of Interactions......Page 1365.1.9 Criteria to Judge the Reliability of Interaction Data......Page 1385.2 Bioinformatic Prediction of Protein–Protein Interactions......Page 1405.2.2 Phylogenetic Profiling/Coevolutionary Profiling......Page 1415.2.2.1 Coevolution......Page 1425.3 Bayesian Networks for Judging the Accuracy of Interactions......Page 1445.3.2 Bayesian Network......Page 1455.3.3 Application of Bayesian Networks to Protein–Protein Interaction Data......Page 1465.3.3.2 Prior and Posterior Odds......Page 1475.3.3.3 A Worked Example: Parameters of the Naïve Bayesian Network for Essentiality......Page 1485.3.3.4 Fully Connected Experimental Network......Page 1495.4.2 Protein Interaction Network of Escherichia coli......Page 1515.5 Protein Domain Networks......Page 1525.6 Summary......Page 1555.7.1 Bayesian Analysis of (Fake) Protein Complexes......Page 156Bibliography......Page 1586.1.1 Degree Distribution......Page 1616.1.2 Clustering Coefficient......Page 1636.2 Finding Cliques......Page 1656.3 Random Graphs......Page 1666.4 Scale‐Free Graphs......Page 1676.5 Detecting Communities in Networks......Page 1696.5.1 Divisive Algorithms for Mapping onto Tree......Page 1736.6 Modular Decomposition......Page 1756.6.1 Modular Decomposition of Graphs......Page 1776.7.1 MCODE......Page 1816.7.2 ClusterONE......Page 1826.7.3 DACO......Page 1836.7.4 Analysis of Target Gene Coexpression......Page 1846.8 Network Growth Mechanisms......Page 1856.10 Problems......Page 189Bibliography......Page 1987.1 Transcription Factors......Page 2017.3.1 Electrophoretic Mobility Shift Assay......Page 2037.3.2 DNAse Footprinting......Page 2047.3.3 Protein‐Binding Microarrays......Page 2057.4 Position‐Specific Scoring Matrices......Page 2077.5 Binding Free Energy Models......Page 2097.6 Cis‐Regulatory Motifs......Page 2117.7 Relating Gene Expression to Binding of Transcription Factors......Page 2127.9 Problems......Page 214Bibliography......Page 2158.1 Regulation of Gene Transcription at Promoters......Page 2178.2 Experimental Analysis of Gene Expression......Page 2188.2.2 Microarray Analysis......Page 2198.3 Statistics Primer......Page 2218.3.3 Fisher's Exact Test......Page 2238.3.4 Mann–Whitney–Wilcoxon Rank Sum Tests......Page 2258.3.6 Hypergeometric Test......Page 2268.4.1 Removal of Outlier Genes......Page 2278.4.3 Log Transformation......Page 2288.5 Differential Expression Analysis......Page 2298.5.2 SAM Analysis of Microarray Data......Page 2308.5.3 Differential Expression Analysis of RNA‐seq Data......Page 2328.5.3.2 DESeq......Page 2338.6 Gene Ontology......Page 2348.6.1 Functional Enrichment......Page 2368.8 Translation of Proteins......Page 2378.8.1 Transcription and Translation Dynamics......Page 2388.9 Summary......Page 2398.10 Problems......Page 240Bibliography......Page 244Chapter 9 Gene Regulatory Networks......Page 2479.1.1 Gene Regulatory Network of E. coli......Page 2489.2 Graph Theoretical Models......Page 2519.2.1 Coexpression Networks......Page 2529.2.2 Bayesian Networks......Page 2539.3.1 Boolean Networks......Page 2549.3.2 Reverse Engineering Boolean Networks......Page 2559.3.3 Differential Equations Models......Page 2569.4 DREAM: Dialogue on Reverse Engineering Assessment and Methods......Page 2589.4.1 Input Function......Page 2599.4.2 YAYG Approach in DREAM3 Contest......Page 2609.5 Regulatory Motifs......Page 2649.5.2 SIM......Page 2659.5.3 Densely Overlapping Region (DOR)......Page 2669.6.1 Key‐pathway Miner Algorithm......Page 2679.6.2 Identifying Sets of Dominating Nodes......Page 2689.6.4 Minimum Connected Dominating Set......Page 2699.7 Summary......Page 2709.8 Problems......Page 271Bibliography......Page 27410.1 Introduction to RNAs......Page 27710.2 Elements of RNA Interference: siRNAs and miRNAs......Page 27910.3 miRNA Targets......Page 28110.5 Role of TFs and miRNAs in Gene‐Regulatory Networks......Page 28410.6 Constructing TF/miRNA Coregulatory Networks......Page 28610.6.1 TFmiR Web Service......Page 28710.6.1.1 Construction of Candidate TF–miRNA–Gene FFLs......Page 28810.6.1.2 Case Study......Page 289Bibliography......Page 29011.1.1 DNA Methylation......Page 29311.1.1.1 CpG Islands......Page 29611.1.2 Histone Marks......Page 29711.1.3 Chromatin‐Regulating Enzymes......Page 29811.1.4 Measuring DNA Methylation Levels and Histone Marks Experimentally......Page 29911.2.1.2 Smoothing of DNA Methylation Data......Page 30111.2.2 Differential Methylation Analysis......Page 30211.2.3 Comethylation Analysis......Page 30311.2.4 Working with Data on Histone Marks......Page 30511.3.1 Measuring Chromatin States......Page 30611.3.2 Connecting Epigenetic Marks and Gene Expression by Linear Models......Page 30711.3.3 Markov Models and Hidden Markov Models......Page 30811.3.4 Architecture of a Hidden Markov Model......Page 31011.3.5 Elements of an HMM......Page 31111.4 The Role of Epigenetics in Cellular Differentiation and Reprogramming......Page 31211.4.2 Developmental Gene Regulatory Networks......Page 31311.5 The Role of Epigenetics in Cancer and Complex Diseases......Page 31511.7 Problems......Page 316Bibliography......Page 32112.1 Introduction......Page 32312.2 Resources on Metabolic Network Representations......Page 32612.3 Stoichiometric Matrix......Page 32812.4.1 Matrices: Definitions and Notations......Page 32912.4.2 Adding, Subtracting, and Multiplying Matrices......Page 33012.4.4 Square Matrices and Matrix Inversion......Page 33112.4.5 Eigenvalues of Matrices......Page 33212.4.6 Systems of Linear Equations......Page 33312.5 Flux Balance Analysis......Page 33412.5.1 Gene Knockouts: MOMA Algorithm......Page 33612.5.2 OptKnock Algorithm......Page 33812.6 Double Description Method......Page 33912.7.1 Steps of the Extreme Pathway Algorithm......Page 34412.7.2 Analysis of Extreme Pathways......Page 34812.7.3 Elementary Flux Modes......Page 34912.7.4 Pruning Metabolic Networks: NetworkReducer......Page 35112.8 Minimal Cut Sets......Page 35212.8.1 Applications of Minimal Cut Sets......Page 35712.9 High‐Flux Backbone......Page 35912.11.1 Static Network Properties: Pathways......Page 361Bibliography......Page 36613.1 Biological Oscillators......Page 36913.2 Circadian Clocks......Page 37013.2.1 Role of Post‐transcriptional Modifications......Page 37213.3 Ordinary Differential Equation Models......Page 37313.3.1 Examples for ODEs......Page 37413.4.1 Protein Synthesis and Degradation: Linear Response......Page 37613.4.2 Phosphorylation/Dephosphorylation – Hyperbolic Response......Page 37713.4.3 Phosphorylation/Dephosphorylation – Buzzer......Page 37913.4.4 Perfect Adaptation – Sniffer......Page 38013.4.5 Positive Feedback – One‐Way Switch......Page 38113.4.7 Negative Feedback – Homeostasis......Page 38213.4.8 Negative Feedback: Oscillatory Response......Page 38413.4.9 Cell Cycle Control System......Page 38513.5 Partial Differential Equations......Page 38613.5.2 Reaction–Diffusion Systems......Page 38813.6 Dynamic Phosphorylation of Proteins......Page 38913.7 Summary......Page 39013.8 Problems......Page 392Bibliography......Page 39314.1 Stochastic Processes......Page 39514.1.1 Binomial Distribution......Page 39614.1.3 Master Equation......Page 39714.2 Dynamic Monte Carlo (Gillespie Algorithm)......Page 39814.2.1 Basic Outline of the Gillespie Method......Page 39914.3.1 Expression of a Single Gene......Page 40014.3.2 Toggle Switch......Page 40114.4.1 Model System: Bacterial Photosynthesis......Page 40514.4.2 Pools‐and‐Proteins Model......Page 40614.4.3 Evaluating the Binding and Unbinding Kinetics......Page 40714.4.5 Steady‐State Regimes of the Vesicle......Page 40914.5 Parameter Optimization with Genetic Algorithm......Page 41214.6 Protein–Protein Association......Page 41514.7 Brownian Dynamics Simulations......Page 41614.8 Summary......Page 41814.9.1 Dynamic Simulations of Networks......Page 420Bibliography......Page 427Chapter 15 Integrated Cellular Networks......Page 42915.1 Response of Gene Regulatory Network to Outside Stimuli......Page 43015.2 Whole‐Cell Model of Mycoplasma genitalium......Page 43215.4 Integrative Differential Gene Regulatory Network for Breast Cancer Identified Putative Cancer Driver Genes......Page 43615.5 Particle Simulations......Page 44115.6 Summary......Page 443Bibliography......Page 444Chapter 16 Outlook......Page 447Index......Page 449*

Principles of Computational Cell Biology

Principles of Computational Cell Biology From Protein Complexes to Cellular Networks

Volkhard Helms

Second Edition

Author Volkhard Helms

Universität des Saarlandes Zentrum für Bioinformatik 66041 Saarbrücken Germany

All books published by Wiley-VCH are carefully produced. Nevertheless, authors, editors, and publisher do not warrant the information contained in these books, including this book, to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate. Library of Congress Card No.:

applied for British Library Cataloguing-in-Publication Data

A catalogue record for this book is available from the British Library. Bibliographic information published by the Deutsche Nationalbibliothek

The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliograﬁe; detailed bibliographic data are available on the Internet at . © 2019 Wiley-VCH Verlag GmbH & Co. KGaA, Boschstr. 12, 69469 Weinheim, Germany All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form – by photoprinting, microﬁlm, or any other means – nor transmitted or translated into a machine language without written permission from the publishers. Registered names, trademarks, etc. used in this book, even when not speciﬁcally marked as such, are not to be considered unprotected by law. Print ISBN: 978-3-527-33358-5 ePDF ISBN: 978-3-527-81033-8 ePub ISBN: 978-3-527-81032-1 Typesetting SPi Global, Chennai, India Printing and Binding

Printed on acid-free paper

10 9 8 7 6 5 4 3 2 1

v

Contents Preface of the First Edition xv Preface of the Second Edition xvii 1

Networks in Biological Cells 1

1.1 1.1.1 1.1.2 1.1.3 1.2 1.2.1 1.2.2 1.2.3 1.2.4 1.3 1.3.1 1.3.2 1.3.3 1.3.4 1.4 1.4.1 1.4.2 1.4.3 1.4.4 1.4.5 1.4.6 1.4.7 1.4.8 1.5 1.6 1.7

Some Basics About Networks 1 Random Networks 2 Small-World Phenomenon 2 Scale-Free Networks 3 Biological Background 4 Transcriptional Regulation 5 Cellular Components 5 Spatial Organization of Eukaryotic Cells into Compartments 7 Considered Organisms 8 Cellular Pathways 8 Biochemical Pathways 8 Enzymatic Reactions 11 Signal Transduction 11 Cell Cycle 12 Ontologies and Databases 12 Ontologies 12 Gene Ontology 13 Kyoto Encyclopedia of Genes and Genomes 13 Reactome 13 Brenda 14 DAVID 14 Protein Data Bank 15 Systems Biology Markup Language 15 Methods for Cellular Modeling 17 Summary 17 Problems 17 Bibliography 18

vi

Contents

2

Structures of Protein Complexes and Subcellular Structures 21

2.1 2.1.1 2.1.2 2.2 2.2.1 2.2.2 2.2.3 2.3

Examples of Protein Complexes 22 Principles of Protein–Protein Interactions 24 Categories of Protein Complexes 27 Complexome: The Ensemble of Protein Complexes 28 Complexome of Saccharomyces cerevisiae 28 Bacterial Protein Complexomes 30 Complexome of Human 31 Experimental Determination of Three-Dimensional Structures of Protein Complexes 31 X-ray Crystallography 32 NMR 34 Electron Crystallography/Electron Microscopy 34 Cryo-EM 34 Immunoelectron Microscopy 35 Fluorescence Resonance Energy Transfer 35 Mass Spectroscopy 36 Density Fitting 38 Correlation-Based Density Fitting 38 Fourier Transformation 40 Fourier Series 40 Continuous Fourier Transform 41 Discrete Fourier Transform 41 Convolution Theorem 41 Fast Fourier Transformation 42 Advanced Density Fitting 44 Laplacian Filter 45 FFT Protein–Protein Docking 46 Protein–Protein Docking Using Geometric Hashing 48 Prediction of Assemblies from Pairwise Docking 49 CombDock 49 Multi-LZerD 52 3D-MOSAIC 52 Electron Tomography 53 Reconstruction of Phantom Cell 55 Protein Complexes in Mycoplasma pneumoniae 55 Summary 56 Problems 57 Mapping of Crystal Structures into EM Maps 57 Bibliography 60

2.3.1 2.3.2 2.3.3 2.3.4 2.3.5 2.3.6 2.3.7 2.4 2.4.1 2.5 2.5.1 2.5.2 2.5.3 2.5.4 2.5.5 2.6 2.6.1 2.7 2.8 2.9 2.9.1 2.9.2 2.9.3 2.10 2.10.1 2.10.2 2.11 2.12 2.12.1

3

Analysis of Protein–Protein Binding 63

3.1 3.2 3.2.1 3.2.2

Modeling by Homology 63 Properties of Protein–Protein Interfaces 66 Size and Shape 66 Composition of Binding Interfaces 68

Contents

3.2.3 3.2.4 3.2.5 3.2.6 3.3 3.3.1 3.3.2 3.3.3 3.3.4 3.4 3.5

Hot Spots 69 Physicochemical Properties of Protein Interfaces 71 Predicting Binding Aﬃnities of Protein–Protein Complexes 72 Forces Important for Biomolecular Association 73 Predicting Protein–Protein Interactions 75 Pairing Propensities 75 Statistical Potentials for Amino Acid Pairs 78 Conservation at Protein Interfaces 79 Correlated Mutations at Protein Interfaces 83 Summary 86 Problems 86 Bibliography 86

4

Algorithms on Mathematical Graphs

4.1 4.2 4.2.1 4.2.2 4.3 4.4 4.4.1 4.4.2 4.4.3 4.5 4.5.1 4.6 4.7 4.8 4.8.1

89 Primer on Mathematical Graphs 89 A Few Words About Algorithms and Computer Programs 90 Implementation of Algorithms 91 Classes of Algorithms 92 Data Structures for Graphs 93 Dijkstra’s Algorithm 95 Description of the Algorithm 96 Pseudocode 100 Running Time 101 Minimum Spanning Tree 101 Kruskal’s Algorithm 102 Graph Drawing 102 Summary 104 Problems 105 Force Directed Layout of Graphs 107 Bibliography 110

5

Protein–Protein Interaction Networks – Pairwise Connectivity 111

5.1

Experimental High-Throughput Methods for Detecting Protein–Protein Interactions 111 Gel Electrophoresis 112 Two-Dimensional Gel Electrophoresis 112 Aﬃnity Chromatography 113 Yeast Two-hybrid Screening 114 Synthetic Lethality 115 Gene Coexpression 116 Databases for Interaction Networks 116 Overlap of Interactions 116 Criteria to Judge the Reliability of Interaction Data 118 Bioinformatic Prediction of Protein–Protein Interactions 120 Analysis of Gene Order 121 Phylogenetic Proﬁling/Coevolutionary Proﬁling 121

5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 5.1.7 5.1.8 5.1.9 5.2 5.2.1 5.2.2

vii

viii

Contents

5.2.2.1 5.3 5.3.1 5.3.2 5.3.3 5.3.3.1 5.3.3.2 5.3.3.3 5.3.3.4 5.4 5.4.1 5.4.2 5.4.3 5.5 5.6 5.7 5.7.1

Coevolution 122 Bayesian Networks for Judging the Accuracy of Interactions 124 Bayes’ Theorem 125 Bayesian Network 125 Application of Bayesian Networks to Protein–Protein Interaction Data 126 Measurement of Reliability “Likelihood Ratio” 127 Prior and Posterior Odds 127 A Worked Example: Parameters of the Naïve Bayesian Network for Essentiality 128 Fully Connected Experimental Network 129 Protein Interaction Networks 131 Protein Interaction Network of Saccharomyces cerevisiae 131 Protein Interaction Network of Escherichia coli 131 Protein Interaction Network of Human 132 Protein Domain Networks 132 Summary 135 Problems 136 Bayesian Analysis of (Fake) Protein Complexes 136 Bibliography 138

6

Protein–Protein Interaction Networks – Structural Hierarchies 141

6.1 6.1.1 6.1.2 6.2 6.3 6.4 6.5 6.5.1 6.6 6.6.1 6.7 6.7.1 6.7.2 6.7.3 6.7.4 6.8 6.9 6.10

Protein Interaction Graph Networks 141 Degree Distribution 141 Clustering Coeﬃcient 143 Finding Cliques 145 Random Graphs 146 Scale-Free Graphs 147 Detecting Communities in Networks 149 Divisive Algorithms for Mapping onto Tree 153 Modular Decomposition 155 Modular Decomposition of Graphs 157 Identiﬁcation of Protein Complexes 161 MCODE 161 ClusterONE 162 DACO 163 Analysis of Target Gene Coexpression 164 Network Growth Mechanisms 165 Summary 169 Problems 169 Bibliography 178

7

Protein–DNA Interactions 181

7.1 7.2

Transcription Factors 181 Transcription Factor-Binding Sites 183

Contents

7.3 7.3.1 7.3.2 7.3.3 7.3.4 7.4 7.5 7.6 7.6.1 7.7 7.8 7.9

Experimental Detection of TFBS 183 Electrophoretic Mobility Shift Assay 183 DNAse Footprinting 184 Protein-Binding Microarrays 185 Chromatin Immunoprecipitation Assays 187 Position-Speciﬁc Scoring Matrices 187 Binding Free Energy Models 189 Cis-Regulatory Motifs 191 DACO Algorithm 192 Relating Gene Expression to Binding of Transcription Factors 192 Summary 194 Problems 194 Bibliography 195

8

Gene Expression and Protein Synthesis 197

8.1 8.2 8.2.1 8.2.2 8.2.3 8.3 8.3.1 8.3.2 8.3.3 8.3.4 8.3.5 8.3.6 8.3.7 8.4 8.4.1 8.4.2 8.4.3 8.5 8.5.1 8.5.2 8.5.3 8.5.3.1 8.5.3.2 8.6 8.6.1 8.7 8.8 8.8.1 8.9 8.10

Regulation of Gene Transcription at Promoters 197 Experimental Analysis of Gene Expression 198 Real-time Polymerase Chain Reaction 199 Microarray Analysis 199 RNA-seq 201 Statistics Primer 201 t-Test 203 z-Score 203 Fisher’s Exact Test 203 Mann–Whitney–Wilcoxon Rank Sum Tests 205 Kolmogorov–Smirnov Test 206 Hypergeometric Test 206 Multiple Testing Correction 207 Preprocessing of Data 207 Removal of Outlier Genes 207 Quantile Normalization 208 Log Transformation 208 Diﬀerential Expression Analysis 209 Volcano Plot 210 SAM Analysis of Microarray Data 210 Diﬀerential Expression Analysis of RNA-seq Data 212 Negative Binomial Distribution 213 DESeq 213 Gene Ontology 214 Functional Enrichment 216 Similarity of GO Terms 217 Translation of Proteins 217 Transcription and Translation Dynamics 218 Summary 219 Problems 220 Bibliography 224

ix

x

Contents

9

9.1 9.1.1 9.1.2 9.2 9.2.1 9.2.2 9.3 9.3.1 9.3.2 9.3.3 9.4 9.4.1 9.4.2 9.5 9.5.1 9.5.2 9.5.3 9.6 9.6.1 9.6.2 9.6.3 9.6.4 9.7 9.8

227 Gene Regulatory Networks (GRNs) 228 Gene Regulatory Network of E. coli 228 Gene Regulatory Network of S. cerevisiae 231 Graph Theoretical Models 231 Coexpression Networks 232 Bayesian Networks 233 Dynamic Models 234 Boolean Networks 234 Reverse Engineering Boolean Networks 235 Diﬀerential Equations Models 236 DREAM: Dialogue on Reverse Engineering Assessment and Methods 238 Input Function 239 YAYG Approach in DREAM3 Contest 240 Regulatory Motifs 244 Feed-forward Loop (FFL) 245 SIM 245 Densely Overlapping Region (DOR) 246 Algorithms on Gene Regulatory Networks 247 Key-pathway Miner Algorithm 247 Identifying Sets of Dominating Nodes 248 Minimum Dominating Set 249 Minimum Connected Dominating Set 249 Summary 250 Problems 251 Bibliography 254

Gene Regulatory Networks

10

Regulatory Noncoding RNA 257

10.1 10.2 10.3 10.4 10.5 10.6 10.6.1 10.6.1.1 10.6.1.2 10.7

Introduction to RNAs 257 Elements of RNA Interference: siRNAs and miRNAs 259 miRNA Targets 261 Predicting miRNA Targets 264 Role of TFs and miRNAs in Gene-Regulatory Networks 264 Constructing TF/miRNA Coregulatory Networks 266 TFmiR Web Service 267 Construction of Candidate TF–miRNA–Gene FFLs 268 Case Study 269 Summary 270 Bibliography 270

11

Computational Epigenetics 273

11.1 11.1.1 11.1.1.1 11.1.2 11.1.3

Epigenetic Modiﬁcations 273 DNA Methylation 273 CpG Islands 276 Histone Marks 277 Chromatin-Regulating Enzymes 278

Contents

11.1.4

Measuring DNA Methylation Levels and Histone Marks Experimentally 279 11.2 Working with Epigenetic Data 281 11.2.1 Processing of DNA Methylation Data 281 11.2.1.1 Imputation of Missing Values 281 11.2.1.2 Smoothing of DNA Methylation Data 281 11.2.2 Diﬀerential Methylation Analysis 282 11.2.3 Comethylation Analysis 283 11.2.4 Working with Data on Histone Marks 285 11.3 Chromatin States 286 11.3.1 Measuring Chromatin States 286 11.3.2 Connecting Epigenetic Marks and Gene Expression by Linear Models 287 11.3.3 Markov Models and Hidden Markov Models 288 11.3.4 Architecture of a Hidden Markov Model 290 11.3.5 Elements of an HMM 291 11.4 The Role of Epigenetics in Cellular Diﬀerentiation and Reprogramming 292 11.4.1 Short History of Stem Cell Research 293 11.4.2 Developmental Gene Regulatory Networks 293 11.5 The Role of Epigenetics in Cancer and Complex Diseases 295 11.6 Summary 296 11.7 Problems 296 Bibliography 301 12

12.1 12.2 12.3 12.4 12.4.1 12.4.2 12.4.3 12.4.4 12.4.5 12.4.6 12.5 12.5.1 12.5.2 12.6 12.7 12.7.1 12.7.2 12.7.3 12.7.4 12.8 12.8.1

303 Introduction 303 Resources on Metabolic Network Representations 306 Stoichiometric Matrix 308 Linear Algebra Primer 309 Matrices: Deﬁnitions and Notations 309 Adding, Subtracting, and Multiplying Matrices 310 Linear Transformations, Ranks, and Transpose 311 Square Matrices and Matrix Inversion 311 Eigenvalues of Matrices 312 Systems of Linear Equations 313 Flux Balance Analysis 314 Gene Knockouts: MOMA Algorithm 316 OptKnock Algorithm 318 Double Description Method 319 Extreme Pathways and Elementary Modes 324 Steps of the Extreme Pathway Algorithm 324 Analysis of Extreme Pathways 328 Elementary Flux Modes 329 Pruning Metabolic Networks: NetworkReducer 331 Minimal Cut Sets 332 Applications of Minimal Cut Sets 337

Metabolic Networks

xi

xii

Contents

12.9 12.10 12.11 12.11.1

High-Flux Backbone 339 Summary 341 Problems 341 Static Network Properties: Pathways 341 Bibliography 346

13

Kinetic Modeling of Cellular Processes 349

13.1 13.2 13.2.1 13.3 13.3.1 13.4 13.4.1 13.4.2 13.4.3 13.4.4 13.4.5 13.4.6 13.4.7 13.4.8 13.4.9 13.5 13.5.1 13.5.2 13.6 13.7 13.8

Biological Oscillators 349 Circadian Clocks 350 Role of Post-transcriptional Modiﬁcations 352 Ordinary Diﬀerential Equation Models 353 Examples for ODEs 354 Modeling Cellular Feedback Loops by ODEs 356 Protein Synthesis and Degradation: Linear Response 356 Phosphorylation/Dephosphorylation – Hyperbolic Response Phosphorylation/Dephosphorylation – Buzzer 359 Perfect Adaptation – Sniﬀer 360 Positive Feedback – One-Way Switch 361 Mutual Inhibition – Toggle Switch 362 Negative Feedback – Homeostasis 362 Negative Feedback: Oscillatory Response 364 Cell Cycle Control System 365 Partial Diﬀerential Equations 366 Spatial Gradients of Signaling Activities 368 Reaction–Diﬀusion Systems 368 Dynamic Phosphorylation of Proteins 369 Summary 370 Problems 372 Bibliography 373

14

Stochastic Processes in Biological Cells

14.1 14.1.1 14.1.2 14.1.3 14.2 14.2.1 14.3 14.3.1 14.3.2 14.4 14.4.1 14.4.2 14.4.3 14.4.4 14.4.5 14.5

375 Stochastic Processes 375 Binomial Distribution 376 Poisson Process 377 Master Equation 377 Dynamic Monte Carlo (Gillespie Algorithm) 378 Basic Outline of the Gillespie Method 379 Stochastic Eﬀects in Gene Transcription 380 Expression of a Single Gene 380 Toggle Switch 381 Stochastic Modeling of a Small Molecular Network 385 Model System: Bacterial Photosynthesis 385 Pools-and-Proteins Model 386 Evaluating the Binding and Unbinding Kinetics 387 Pools of the Chromatophore Vesicle 389 Steady-State Regimes of the Vesicle 389 Parameter Optimization with Genetic Algorithm 392

357

Contents

14.6 14.7 14.8 14.9 14.9.1

Protein–Protein Association 395 Brownian Dynamics Simulations 396 Summary 398 Problems 400 Dynamic Simulations of Networks 400 Bibliography 407

15

Integrated Cellular Networks

15.1 15.2 15.3 15.4 15.5 15.6

16

409 Response of Gene Regulatory Network to Outside Stimuli 410 Whole-Cell Model of Mycoplasma genitalium 412 Architecture of the Nuclear Pore Complex 416 Integrative Diﬀerential Gene Regulatory Network for Breast Cancer Identiﬁed Putative Cancer Driver Genes 416 Particle Simulations 421 Summary 423 Bibliography 424 Outlook

427

Index 429

xiii

xv

Preface of the First Edition This book grew out of a course for graduate students in the ﬁrst year of the MSc bioinformatics program that the author teaches every year at Saarland University. Also included is some material from a special lecture on cell simulations. The book is designed as a textbook, placing emphasis on transmitting the main ideas of a problem, outlining algorithmic strategies for solving these, and describing possible complications or connections to other parts of the book. The main challenge during the writing of the book was the concentration on conceptual points that may be of general educative value rather than including the latest research results of this fascinating fast-moving ﬁeld. It is considered more important for a textbook to give a cohesive picture rather than mentioning all possible drawbacks and special cases where particular general guidelines may not apply. We apologize to those whose work could not be mentioned because of space constraints. The intended audience includes students of bioinformatics and from life science disciplines. Consequently, some basic knowledge in molecular biology is taken for granted. The language used is not very formal. Previous knowledge of computer science is not required, but a certain adeptness in basic mathematics is necessary. The book introduces all of the mathematical concepts needed to understand the material covered. In particular, Chapter 2 introduces mathematical graphs and algorithms on graphs used in classifying protein–protein interaction networks. Chapter 6 introduces linear and convex algebra typically being used in the description of metabolic networks. Chapter 7 discusses ordinary and stochastic diﬀerential equations used in the kinetic modeling of signal transduction pathways. Chapter 8 introduces the method of Fourier transformation for protein–protein docking and pattern matching. Also introduced are Bayesian networks in Chapter 4 as a way to judge the reliability of protein–protein interactions and inference techniques to model gene regulatory networks. We note, however, that the emphasis of this book is placed on discrete mathematics rather than on statistical methods. Not included yet are classical network ﬂow algorithms such as Menger’s theorem or the max-ﬂow min-cut theorem as they are currently rarely used in cellular modeling. The book focuses on proteins and the genes coding for them, as well as on metabolites. Less room is given to DNA, RNA, or lipid membranes that would, of course, also deserve a great deal of attention. The main reason for this was to provide a homogenous background for discussing algorithmic concepts.

xvi

Preface of the First Edition

The author is very grateful to Dr. Tihamér Geyer who coordinated the assignments for the lectures for valuable comments on the manuscript and for many solved examples and problems for this book. The following coworkers from Saarbrücken and elsewhere have provided valuable suggestions on diﬀerent portions of the text: Kerstin Kunz, Jan Christoph, and Florian Lauck. I thank Dr. Hawoong Jeong, Dr. Julio Collado-Vides, Dr. Agustino Martínez-Antonio, Dr. Ruth Sperling, Dr. James R. Williamson, Dr. Joanna Trylska, Dr. Claude Antony, and Dr. Nicholas Luscombe for sending me high-resolution versions of their graphics. I thank Dr. Andreas Sendtko and the publishing staﬀ at Wiley-VCH for their generous support of this book project, for their seemingly endless patience during the revision stage, and for excellent typesetting. I also thank the Center of Theoretical Biophysics at the University of California, San Diego, for their hospitality during a sabbatical visit in summer 2007 that ﬁnally allowed to complete this work. Finally, this book would not have been possible without the support and patience of my wife Regina and our two daughters. March 2008

Volkhard Helms Center for Bioinformatics Saarland University Saarbrücken, Germany

xvii

Preface of the Second Edition1 About 10 years after the publication of the ﬁrst edition, I ﬁnally managed to prepare this expanded second edition of this book. Its main spirit remained the same: it is designed as a textbook, placing emphasis on transmitting the main ideas of a problem, outlining algorithmic strategies for solving these, and describing possible complications or connections to other parts of the book. Because of the feedback from colleagues, I have reordered the content, starting now in Chapter 2 with an introduction into the structures of protein–protein complexes before we enter into the world of protein interaction networks. I refrain from listing all the rearrangements here. Usually, I tried to keep subsections intact and simply shifted them around. A few sections were removed from the text because I now felt that they were too specialized. About 50% of new content has been added. In terms of mathematical methods, much more room is now given to statistical methods. In terms of biology, several new chapters now address protein–DNA interactions, epigenetic modiﬁcations, and microRNAs. Still not covered are biophysical topics related to intracellular transport, cytoskeletal dynamics, and processes taking place at and across biological membranes. Maybe, there will be a need for a third edition eventually? In addition to those who contributed to the ﬁrst edition, the author is very grateful to Thorsten Will and Maryam Nazarieh for solved examples and problems for this book. The following coworkers from Saarbrücken and elsewhere have provided valuable suggestions on diﬀerent portions of the text: Mohamed Hamed Fahmy, Dania Humaidan, Olga Kalinina, Heiko Rieger, and Thorsten Will. I thank my group members of the past years with whom I had the privilege to work on exciting research projects related to the content of this book and I thank our secretary Kerstin Gronow-Pudelek for technical assistance. April 2018

1

Volkhard Helms Center for Bioinformatics Saarland University Saarbrücken, Germany

Problems: To really absorb the content of this textbook, it is advisable to also try to solve some of the problems enclosed.

1

1 Networks in Biological Cells Modern molecular and cell biology has worked out many important cellular processes in more detail, although some other areas are known to a lesser extent. It often remains to understand how the individual parts are connected, and this is exactly the focus of this book. Figure 1.1 displays a cartoon of a cell as a highly viscous soup containing a complicated mixture of many particles. Certainly, several important details are left out here that introduce a partial order, such as the cytoskeleton and organelles of eukaryotic cells. Figure 1.1 reminds us that there is a myriad of biomolecular interactions taking place in biological cells at all times and that it is pretty amazing how a considerable order is achieved in many cellular processes that are all based on pairwise molecular interactions. The focus of this book is placed on presenting mathematical descriptions developed in recent years to describe various levels of cellular networks. We will learn that many biological processes are tightly interconnected, and this is exactly where many links still need to be discovered in further experimental studies. Many researchers in the ﬁeld of molecular biology believe that only combined eﬀorts of modern experimental techniques, mathematical modeling, and bioinformatics analysis will be able to arrive at a suﬃcient understanding of the biological networks of cells and organisms. In this chapter, we will start with some principles of mathematical networks and their relationship with biological networks. Then, we will brieﬂy look at several biological key players to be used in the rest of this book (cells, compartments, proteins, and pathways). Without going into any further detail, we will directly move into the ﬁeld of network theory with the amazing “small-world phenomenon.”

1.1 Some Basics About Networks Network theory is a branch of applied mathematics and more of physics that uses the concepts of graph theory. Its developments are led by application to real-world examples in the areas of social networks (such as networks of acquaintances or among scientists having joint publications), technological networks (such as the World Wide Web that is a network of web pages and the Internet that is a network of computers and routers or power grids), and biological networks (such as neural networks and metabolic networks). Principles of Computational Cell Biology: From Protein Complexes to Cellular Networks, Second Edition. Volkhard Helms. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

2

Principles of Computational Cell Biology

Figure 1.1 Is this how we should view a biological cell? The point of this schematic picture is that about 30% of the volume of a biological cell is taken up my millions of individual proteins. Therefore, biological cells are really “full.” However, of course, such pictures do not tell us much about the organization of biological processes. As we will see later in this book, there are many diﬀerent hierarchies of order in such a cell.

1.1.1

Random Networks

In a random network, every possible link between two “vertices” (or nodes) A and B is established according to a given probability distribution irrespective of the nature and connectivity of the two vertices A and B. This is what is “random” about these networks. If the network contains n vertices in total, the maximal number of undirected edges (links) between them is n × (n − 1)/2. This is because we can pick each of the n vertices as the ﬁrst vertex of an edge, and there are (n − 1) other vertices that this vertex can be connected to. In this way, we will actually consider each edge twice, using each end point as the ﬁrst vertex. Therefore, we need to divide the number of edges by 2. If every edge is established with a probability p ∈ [0, 1], the total number of edges in an undirected graph is p × n × (n − 1)/2. The mathematics of random graphs was developed and elucidated by two Hungarian mathematicians Erdös and Renyi. However, the analysis of real networks showed that such networks often diﬀer signiﬁcantly from the characteristics of random graphs. We will turn back to random graphs in Section 6.3. 1.1.2

Small-World Phenomenon

The term small-world phenomenon was coined to describe the observation that everyone in the world is linked to some other person through a short chain of social acquaintances. In a small-world experiment, the psychologist Stanley Milgram found in 1967 that, on average, any two US citizens randomly picked were connected to each other by only six acquaintances. Vertices in a network have short average distances. Usually, the distance between the nodes scales logarithmically with the total number, n, of the vertices. In a paper published in the journal Nature in 1998, the two mathematicians Duncan J. Watts and Steven H. Strogatz (Watts and Strogatz, 1998) reported

Networks in Biological Cells

that small-world networks are common in many diﬀerent areas ranging from neuronal connections of the worm Caenorhabditis elegans to power grids. 1.1.3

Scale-Free Networks

Only one year after the discovery of Watts and Strogatz, Albert-László Barabási from the Physics Department at the University of Notre Dame introduced an even simpler model for the emergence of the small-world phenomenon (Barabási and Albert 1999). Although Watts and Strogatz’s model was able to explain the short average path length and the dense clustering coeﬃcient of a small world (all these terms will be introduced in Chapter 6), it did not manage to explain another property that is typical for real-world networks such as the Internet: these networks are scale-free. In simple terms, this means that although the vast majority of vertices are weakly connected, there also exist some highly interconnected super-vertices or hubs. The term scale-free expresses that the ratio of highly to weakly connected vertices remains the same irrespective of the total number of links in the network. We will see in Section 6.4 that the connectivity of scale-free networks follows a power law. If a network is scale-free, it is also a small world. In this paper, Barabási and Albert presented a strikingly simple and intuitive algorithm that generates networks with a scale-free topology. It has two essential elements: • Growth. The network is started from a small number of (at least two) connected vertices. At every iteration step, a new vertex is added that forms links to m of the existing vertices. • Preferential attachment. One assumes that the probability of a link between a newly added vertex and an existing vertex i depends on the degree of i (the number of existing links between vertex i and other vertices). The more connections i has already, the more likely the new vertices will link to i. This behavior is described by the saying “the rich become richer.” Let us motivate this on the ﬁctitious example of the early days of air traﬃc. Initially, one needs to build two airports so that a ﬁrst regular ﬂight connection can be established between them. Eventually, a third airport is established. Most likely, initially, only one new ﬂight will go to either one of the existing airports. Now, the situation is unbalanced. Now, there exists one airport that is connected to two other cities, and the airports of those cities are only connected to one city. There is a certain chance that, after some time, the “missing” connection between the new airport and the other airport would be introduced, which would lead to a balanced situation again. Alternatively, a fourth airport could emerge that would also start by establishing only one ﬂight to one of the existing airports. Now, the airport that already has two connections would have an obvious practical advantage because passengers taking this route simply have more options to carry on. Therefore, the chance that this ﬂight is established is higher than for the other connections. Exactly, this idea is captured by the concept of preferential attachment. The same growth mechanism applies, for example, to the World Wide Web. Obviously, this network grows constantly over time, and many new pages are

3

4

Principles of Computational Cell Biology

added to it every moment. We know from our own experience that once a new web page is created, its owner will most likely include links to other popular pages (hubs) on the new page so that the second “rule” is also fulﬁlled. In the early exciting days of network theory when the study of large-scale networks took oﬀ like a storm, it was even suggested that the scale-free network model may be something like a law of nature that controls how natural small-world networks are formed. However, subsequent work on integrated biological networks showed that the concept of scale-free networks may rather be of theoretical value and that it may not be directly applicable to certain biological networks. For the moment, we will consider the idea of network topology (scale-free networks and small-world phenomenon) as a powerful concept that is useful for understanding the mechanism of network growth and vulnerability.

1.2 Biological Background Until recently, the paradigm of molecular biology was that genetic information is read from the genomic DNA by the RNA polymerase complex and is transcribed into the corresponding RNA. Ribosomes then bind to messenger RNA (mRNA) snippets and produce amino acid strands. This process is called translation. Importantly, the paradigm involved the notion that this entire process is unidirectional, see Figure 1.2. Central paradigm of molecular biology

(a)

DNA

RNA

Protein

Phenotype (symptoms)

Central paradigm of structural biology Genetic (b) information

Molecular structure

Biochemical function

Phenotype (symptoms)

Central paradigm of molecular systems biology Genetic information

(c)

Molecular structure

Biochemical function

Phenotype (symptoms)

Molecular interactions

Figure 1.2 (a) Since the 1950s, a paradigm was established, whereby the information ﬂows from DNA over RNA to protein synthesis, which then gives rise to particular phenotypes. (b) The emergence of structural biology – the ﬁrst crystal structure of the protein myoglobin was determined in 1960 – emphasized the importance of the three-dimensional structures of proteins determining their function. (c) Today, we have realized the central role played by molecular interactions that inﬂuence all other elements.

Networks in Biological Cells

1.2.1

Transcriptional Regulation

It is now well established that many feedback loops are provided in this system too, e.g. by the proteins known as transcription factors that bind to sequence motifs on the genomic DNA and mediate (activate or repress) transcription of certain genomic segments. Important discoveries of the past 20 years showed that cellular mRNA concentrations are also largely aﬀected by small RNA snippets termed microRNAs and that the chromatin structure is shaped by epigenetic modiﬁcations of the DNA and histone proteins that control the accessibility of genomic regions. The cellular network therefore certainly appears much more complicated today than it did 60 years ago. This brings us to the world of gene regulatory networks. Collecting the required information on the regulation of individual genes is a subject of intense active research. For example, the ENCODE project for human cells and the modENCODE project for the model organisms C. elegans and Drosophila melanogaster mapped the binding sites of hundreds of transcription factors throughout the genomes. Also, the FANTOM initiative started in Japan is a worldwide collaborative project aiming at identifying all the functional elements in mammalian genomes. However, occupancy maps of transcription factors alone are not being considered as compelling evidence of biologically functional regulation. To really prove or disprove which gene is activated or repressed by a particular transcription factor (or microRNA), one could create a knockout organism lacking the gene coding for this transcription factor and see which genes are no longer expressed or are now expressed in excess. Such genome-wide deletion libraries have actually been produced for the model organism Saccharomyces cerevisiae. However, in this way, we can only discover those combinations that are not lethal for the organism. Also, pairs or larger assemblies of transcription factors often need to bind simultaneously. It simply appears impossible to discover the full connectivity of this regulatory network by a traditional one-by-one approach. Fortunately, modern microarray and RNAseq experiments probe the expression levels of many genes simultaneously. Ongoing challenges are the noisy nature of the large-scale data and the fact that genes actually do not interact directly with each other. Analysis of gene expression data will be discussed in Chapter 8. In this book, we will be mostly concerned with the following four types of biological cellular networks: protein–protein interaction networks, gene regulatory networks, signal transduction networks, and metabolic networks. We will discuss them at diﬀerent hierarchical levels as shown in Figure 1.3 using the example of regulatory networks. 1.2.2

Cellular Components

Cells can be described at various levels in detail. We will mostly use three diﬀerent levels of description: (a) Inventory lists and lists of processes. • Proteins in particular compartments • Proteins forming macromolecular complexes

5

6

Principles of Computational Cell Biology

• Biomolecular interactions • Regulatory interactions • Metabolic reactions (b) Structural descriptions. • Structures of single proteins • Topologies of protein complexes • Subcellular compartments (c) Dynamic descriptions. • Cellular processes ranging from nanosecond dynamics for the association of two biomolecules up to processes occurring in seconds and minutes such as the cell division of yeast cells. We will assume that the reader has a basic knowledge about the organic molecules commonly found within living cells and refer those who do not to basic books on biochemistry or molecular biology. Depending on their role in metabolism, the biomolecules in a cell can be grouped into several classes. Basic unit

Motifs

Modules

Transcription factor

SIM

MIM Target gene and binding site (a)

(c)

(b)

FFL

Figure 1.3 Structural organization of transcriptional regulatory networks. (a) The “basic unit” comprises the transcription factor, its target gene with a DNA recognition site, and the regulatory interaction between them. (b) Units are often organized into network “motifs” that comprise speciﬁc patterns of inter-regulation that are overrepresented in networks. Examples of motifs include single-input/multiple output (SIM), multiple input/multiple output (MIM), and feed-forward loop (FFL) motifs. (c) Network motifs can be interconnected to form semi-independent “modules,” many of which have been identiﬁed by integrating regulatory interaction data with gene expression data and imposing evolutionary conservation. The next level consists of the entire network (not shown). Source: Babu et al. (2004). Drawn with permission of Elsevier.

Networks in Biological Cells

1. Macromolecules including nucleic acids, proteins, polysaccharides, and certain lipids. 2. The building blocks of macromolecules include sugars as the precursors of polysaccharides, amino acids as the building blocks of proteins, nucleotides as the precursors of nucleic acids (and therefore of DNA and RNA), and fatty acids that are incorporated into lipids. Interestingly, in biological cells, only a small number of theoretically synthesizable macromolecules exist at a given time point. At any moment during a normal cell cycle, many new macromolecules need to be synthesized from their building blocks, and this is meticulously controlled by the complex gene expression machinery. Even during a steady state of the cell, there exists a constant turnover of macromolecules. 3. Metabolic intermediates (metabolites). Many molecules in a biological cell have complex chemical structures and must be synthesized in several reactions from speciﬁc starting materials that may be taken up as the energy source. In the cell, connected chemical reactions are often grouped into metabolic pathways (Section 1.3). 4. Molecules of miscellaneous function including vitamins, steroid hormones, molecules that can store energy storage such as ATP, regulatory molecules, and metabolic waste products. Almost all biological materials that are needed to construct a biological cell are either synthesized by the RNA polymerase and ribosome machinery of the cell or are taken up from the outside via the cell membrane. Therefore, as a minimum inventory, every cell needs to contain the construction plan (DNA), a processing unit to transcribe this information into mRNA (polymerase), a processing unit to translate these mRNA pieces into protein (ribosome), and transporter proteins inside the cell membrane that transport material through the cell membrane. 1.2.3

Spatial Organization of Eukaryotic Cells into Compartments

Organization into various compartments greatly simpliﬁes the temporal and spatial process ﬂow in eukaryotic cells. As mentioned above, at each time point during a cell cycle, only a small subfraction of all potential proteins is being synthesized (and not yet degraded). Also, many proteins are only available in very small concentrations, possibly with only a few copies per cell. However, localizing these proteins to particular spots in the cell, e.g. by attaching them to the cytoskeleton or by partitioning them into lipid rafts, their local concentrations may be much higher. We assume that the reader is vaguely familiar with the compartmentalization of eukaryotic cells involving the lysosome, plasma membrane, cell membrane, Golgi complex, nucleus, smooth endoplasmic reticulum, mitochondrion, nucleolus, rough endoplasmic reticulum, and cytoskeleton. An important element of cellular organization is the active transport of macromolecules along the microtubules of the cytoskeleton that is carried out by molecular motor proteins such as kinesin and dynein. Here, we will not address the activities of molecular motors because this is rather a research topic in biophysics.

7

8

Principles of Computational Cell Biology

Table 1.1 Data on the genome length and on the number of protein-coding and RNA genes are taken from the Kyoto Encyclopedia of Genes and Genomes database (April 2018); data on the number of putative transporter proteins are taken from www.membranetransport.org.

Organism

Length of genome (Mb)

Number of proteincoding genes

Number of RNA genes

Number of transporter proteins

Prokaryotes Mycoplasma genitalium G37

0.6

476

43

53

Bacillus subtilis BSN5

4.2

4 145

113

552

Escherichia coli APEC01

4.6

4 890

93

665

1.3

6 002

425

341 662

Eukaryotes Saccharomyces cerevisiae S288C Drosophila melanogaster Caenorhabditis elegans Homo sapiens

1.2.4

12

13 929

3 209

100.2

20 093

24 969

669

20 338

19 201

1 467

3 150

Considered Organisms

Table 1.1 presents some statistics of the organisms considered in this book.

1.3 Cellular Pathways 1.3.1

Biochemical Pathways

Metabolism denotes the entirety of biochemical reactions that occur within a cell (Figure 1.4). In the past century, many of these reactions have been organized into metabolic pathways. Each pathway consists of a sequence of chemical reactions that are catalyzed by speciﬁc enzymes, and the outcome of one reaction is the input for the next one. Unraveling the individual enzymatic reactions was one of the big successes of applying biochemical methods to cellular processes. Metabolic pathways can be divided into two broad types. Catabolic pathways disintegrate complex molecules into simpler ones, which can be reused for synthesizing other molecules. Also, catabolic pathways provide chemical energy required for many cellular processes. This energy may be stored temporarily as high-energy phosphates (primarily in ATP) or as high-energy electrons (primarily in NADPH). Conversely, anabolic pathways synthesize more complex substances from simpler starting reagents by utilizing the chemical energy generated by exergonic catabolic pathways.

ATP

AMP

Purine biosynthesis

ADP Adenosine

dATP dADP Adenine GTP GDP Guanosine

GMP

dGTP dGDP Guanine

Histidine

Vitamin C

Pentose interconversions

Glucuronate metabolism

Histidine metabolism

Aspartate amino acid group synthesis

Serine Methionine

Glucuronate

Cysteine

Inositol metabolism

Tryptophan Ribulose-5P

Lactose

Inositol

Aromatic amino acid

Fructose

Galactose

Pentose phosphate pathway

Threonine

Asparagine

Asparate

Lysine

Porphyrin and corrinoids metabolism

Tyrosine Fumarate

Synthesis

Cellulose and sucrose

Haemoglobin

Succinate Oxaloacetate

Sucrose

Vitamin B 12 Cytochromes

2-oxo glutarate

Cellulose

Glucose-1P

Metabolism

Glucose-6P

Glucose

Starch Amylose Glycogen and glycogen metabolism Glycine

Fructose-6P

GAD-3P

Glycolysis and gluconeo genesis

Pyruvate Acetyl-CoA Pyruvate decarbox.

Glucosamin

Lactate

Chitin

Ethanol

Amino sugars metabolism Serine

Fats

Chlorophyll

Glutamine

Glutamate amino acid group synthesis

Pyruvate metabolism

Glutamate

Alanine

Proline

Small amino acid synthesis Urea

Valine Threonine

Fatty acid

Uracil

Leucine Iso-Leucine

Branched amino acid synthesis

Figure 1.4 Major metabolic pathways.

Uridine

Fatty acid metabolism

Cytidine

Arginine

Urea cycle

Aspartate

UTP

Thymin dTTP

Cytosine

CTP

CDP

dCTP

Pyrimidine synthesis

10

Principles of Computational Cell Biology

The traditional biochemical pathways were often derived from studying simple organisms where these pathways constitute a dominating part of the metabolic activity. For example, the glycolysis pathway was discovered in yeast (and in muscle) in the 1930s. It describes the disassembly of the nutrient glucose that is taken up by many microorganisms from the outside. Figure 1.5 shows the glycolysis pathway in Homo sapiens as represented in the KEGG database (Kanehisa et al. 2016). Nucleotide sugars metabolism

Glycolysis

Pentose and glucuronate interconversions 2.7.1.41

Starch and sucrose metabolism

α-D-Glucose-1P

3.1.3.10

5.4.2.2

Galactose metabolism 3.1.3.9 α-D-Glucose

D-Glucose

2.7.1.69

(extracellular)

2.7.1.1 2.7.1.2

3.1.6.3

α-D-Glucose-6P

2.7.1.63 D-Glucose

5.1.3.3

6-sulfate

5.1.3.15 5.3.1.9

(aerobic decarboxylation)

5.3.1.9

2.7.1.2 3.1.6.3

β-D-Glucose

β-D-Fructose-6P

5.3.1.9 β-D-Glucose-6P

2.7.1.1 2.7.1.63

3.1.3.11

Arbutin-6P

Arbutin (extracellular) Salicin (extracellular)

2.7.1.69

3.2.1.86

2.7.1.69

3.2.1.86

Fructose and mannose metabolism

β-D-Fructose-1,6P2 4.1.2.13

Salicin-6P

Carbon fixation in photosynthetic organisms

Pentose phosphate pathway

2.7.1.11

Glyceraldehyde-3P

5.3.1.1 Glycerone-P Galactose metabolism Glycerolipid metabolism

4.6.1.–

Glycerate-1,3P2 3.6.1.7

Cyclic glycerate-2,3P2

1.2.1.12 5.4.2.4

Glycerate-2,3P2

2.7.2.3 5.4.2.4

3.1.3.13 Glycerate-3P 5.4.2.1 2.7.2.–

Gluconeogenesis

Thiamine metabolism

Glycerate-2P 4.2.1.11

Aminophosphonate metabolism

Photosynthesis

Citrate cycle Pyruvate metabolism

Phosphoenolpyruvate 2.7.1.40

Tryptophan metabolism

Lysine biosynthesis 1.2.1.51

Acetyl-CoA Synthesis and degradation of ketone bodies 6.2.1.1

Phe,Tyr, and Trp biosynthesis

1.2.4.1

2.3.1.12

2-Hydroxyethyl-ThPP 1.2.4.1

S-Acetyldihydrolipoamide-E 1.8.1.4 Lipoamide-E 4.1.1.1 Dihydrolipoamide-E Ethanol

Acetate

1.1.1.27 Pyruvate

ThPP

4.1.1.1

Propanoate metabolism C5-Branched dibasic acid metabolism Butanoate metabolism Pantothenate and CoA biosynthesis

1.1.1.1 1.1.1.2

L-Lactate

Acetaldehyde

Alanine and aspartate metabolism

1.1.1.71 1.2.1.3 1.2.1.5

1.1.99.8

D-Alanine

metabolism

Tyrosine metabolism

00010 8/6/07

Figure 1.5 The glycolysis pathway as visualized in the KEGG database is connected to many other cellular pathways. Source: From http://www.genome.ad.jp/kegg.

Networks in Biological Cells

1.3.2

Enzymatic Reactions

Enzymes are proteins that catalyze biochemical reactions so that they proceed much faster than in aqueous solution, e.g. by factors of many thousands to billions of times. As is the case for any catalyst, the enzyme remains intact after the reaction is complete and can therefore continue to function. Enzymes reduce the activation energy of a reaction, but this aﬀects forward reaction and backward reaction in the same manner. Hence, the relative free energy diﬀerence and the equilibrium between the products and reagents are not aﬀected. Compared to other catalysts, enzymatic reactions are carried out in a highly stereo-, regio-, and chemoselective and speciﬁc manner. For the binding reaction P + L ↔ PL of a protein P and a ligand L, the binding constant k d : kd =

[P] ⋅ [L] [PL]

determines how much of the ligand concentration [L] is bound by the protein (with concentration [P]) under equilibrium conditions. [PL] is the concentration of the protein:ligand complex. The binding constant has the unit M. In the case of a “nanomolar inhibitor,” for example, where a blocking ligand binds to a protein with a k d in the order of 10−9 M, the product of the concentrations of free protein and of free ligand is 109 times smaller than the concentration of the protein–ligand complex. Thus, the equilibrium is very strongly shifted to the complexed form, and only a few free ligand molecules exist. The binding constant k d is also the ratio of the kinetic rates for the backward and forward reactions, k oﬀ and k on . The units of the two kinetic rates are M−1 s−1 for the forward reaction and s−1 for the backward reaction. Understanding the ﬁne details of enzymatic reactions is one of the main branches of biochemistry. Fortunately, in the context of cellular simulations, we need not be interested with the enzymatic mechanisms themselves. Here, instead, it is important to characterize the chemical diversity of the substrates a particular enzyme can turn over and to collect the thermodynamic and kinetic constants of all relevant catalytic and binding reactions. A rigorous system to classify enzymatic function is the Enzyme Classiﬁcation (EC) scheme. It contains four major categories, each divided into three hierarchies of subclassiﬁcations.

1.3.3

Signal Transduction

Here, we denote by signal transduction the transmission of a chemical signal such as phosphorylation of a target amino acid. Signal transduction is a very important subdiscipline of cell biology. Hundreds of working groups are looking at separate aspects of signal transduction, and large research consortia such as the Alliance of Cell Signaling have been formed in the past. In humans, about 70% of all proteins get phosphorylated at speciﬁc residues in certain conditions. Many proteins can be phosphorylated multiple times at diﬀerent amino acids. A phosphorylation step often characterizes a transition between active and

11

12

Principles of Computational Cell Biology

inactive states. The fraction of phosphorylated versus unphosphorylated proteins can be detected experimentally by mass spectrometry on a genome-wide level. 1.3.4

Cell Cycle

The cell cycle describes a series of processes in a prokaryotic or eukaryotic cell that leads from one cell division to the next one. The cell cycle is regulated by two types of proteins termed cyclins and cyclin-dependent kinases. In 2001, the Nobel Prize in Physiology or Medicine was awarded to Leland H. Hartwell, R. Timothy Hunt, and Paul M. Nurse who discovered these central molecules. Broadly speaking, a cell cycle can be grouped into three stages termed interphase, mitosis, and cytokinesis. These can be further split into the following: • The G0 phase. This is a resting phase outside the regular “cell cycle” where the cells exist in a quiescent state. • The G1 phase. This is the ﬁrst growth phase for the cell. • The S phase for the “synthesis” of DNA. In this phase, the cellular DNA is replicated to secure the hereditary information for the future daughter cells. • The G2 phase is the second growth phase. This is also a preparation phase for the subsequent cell division. • The M phase or mitosis and cytokinesis cover the processes to divide the cell into two daughter cells. There exist several surveillance points, the so-called checkpoints, when the cell is inspected for potential DNA damage or for lacking ability to perform critical cellular processes. If certain conditions are not fulﬁlled, checkpoints may prevent transitioning to the next state of the cell cycle. We will see in Chapter 15 how cellular processes may dynamically regulate each other. In Section 15.2, we will discuss an integrated computational model that simulated the nine-minute long cell cycle of the simple organism Mycoplasma genitalium almost in molecular detail. Very important for the cell cycle are phosphorylation reactions of the central cell cycle regulators.

1.4 Ontologies and Databases 1.4.1

Ontologies

“Ontology” is a term from philosophy and describes a structured controlled vocabulary. Why have ontologies nowadays become of particular importance in biological and medical sciences? The main reason is that, historically, biologists worked in separate camps, each on a particular organism, and each camp discovered a gene after gene, protein after protein. Because of this separation, every subﬁeld started using its own terminology. These early researchers did not know that, at a later stage, biologists wished to compare diﬀerent organisms to transfer useful information from one to the other in a process termed annotation. Thus, proteins deriving from the same ancestor may have been given completely diﬀerent names.

Networks in Biological Cells

It would require many years of intensive study for anyone of us to learn these associations. Instead, researchers have realized quite early that it would be extremely useful to generate general electronic repositories for classiﬁcation schemes that connect the corresponding genes and proteins belonging to diﬀerent organisms and that provide access to functional annotations. 1.4.2

Gene Ontology

One of the most important projects in the area of ontologies is the gene ontology (GO) (www.geneontology.org). This collaborative project started in 1998 as a collaboration of three databases dealing with model organisms, FlyBase (Drosophila), the Saccharomyces Genome Database (SGD), and the Mouse Genome Database (MGD). In the meantime, many other organizations have joined this consortium. In the GO project, gene products are associated with molecular functions, biological processes, and cellular components where they are expressed in a species-dependent manner. A gene product may be connected to one or more cellular components; it may be involved in one or more biological processes, during which it executes one or more molecular functions. GO has become widely used together with the analyses of diﬀerential gene expression or enriched pathways. We will revisit the gene ontology in Section 8.6. 1.4.3

Kyoto Encyclopedia of Genes and Genomes

Initiated in 1995, the Kyoto Encyclopedia of Genes and Genomes (KEGG) is an integrated bioinformatics resource consisting of three types of databases for genomic, chemical, and network information (http://www.genome.jp/kegg). KEGG consists of three graph objects called the gene universe (GENES, SSDB, and KEGG Orthology databases that contain more than 14 million genes from 280 eukaryotic, 2800 bacterial, and 171 archaeal genomes), the chemical universe (COMPOUND, GLYCAN, and REACTION databases that contain more than 17.000 chemical compounds and more than 9.700 reactions), and the protein network (PATHWAY database) (Table 1.2). The gene universe is a conceptual graph object representing ortholog/paralog relations, operon information, and other relationships between genes in all the completely sequenced genomes. The chemical universe is another conceptual graph object representing chemical reactions and structural/functional relations among metabolites and other biochemical compounds. The protein network is based on biological phenomena, representing known molecular interaction networks in various cellular processes. 1.4.4

Reactome

REACTOME (reactome.org) is a pathway database. At the moment, it focuses on human pathways and provides links to the NCBI Entrez Gene, Ensembl, and UniProt databases; the UCSC and HapMap Genome Browsers; the KEGG Compound and ChEBI small-molecule databases, PubMed, and Gene Ontology. Molecular interaction data can be overlayed from the Reactome Functional

13

14

Principles of Computational Cell Biology

Table 1.2 The three graph objects in KEGG. Graph

Vertex

Edge

Main databases

Gene universe

Gene

Any association of genes (ortholog/paralog relation, sequence/structural similarity, adjacency on chromosome, expression similarity)

GENES, SSDB, KO

Chemical universe

Chemical compound (including carbohydrate)

Any association of compounds (chemical reactivity, structural similarity, etc.)

COMPOUNDS, GLYCAN, REACTION

Protein network

Protein (including other gene products)

Known interaction/relation of proteins (direct protein–protein interaction, gene expression relation, enzyme–enzyme relation)

PATHWAY

Source: After Kanehisa et al. (2016).

Interaction Network and from external databases. Reactome also provides data on gene expression and supports overrepresentation analysis of functional terms. It is worth noting that diﬀerent databases have been developed according to diﬀerent philosophies and provide diﬀerent coverage. Stobbe and coworkers recently compared ﬁve diﬀerent databases including KEGG and Reactome and found signiﬁcant diﬀerences (Stobbe et al. 2014). The considerable ﬁnancial pressure of maintaining such databases will decide in the long run, which resources will survive. 1.4.5

Brenda

Since 1987, the Brenda resource (www.brenda-enzymes.org) has been developed in the group of Dietmar Schomburg. As of 2007, it is hosted at the Technical University Braunschweig/Germany. Brenda is a comprehensive information system on enzymatic reactions (Table 1.3). Data on enzyme function are manually extracted from the primary literature. One may wonder whether all this detail is required by a computational cell biologist analyzing the network capacities of a particular organism. In some ways no, in other ways yes. No, if you only want to analyze the pathway space (Chapter 12). Yes, if you are interested in particular reaction rates or in modeling time-dependent processes (Chapter 13). Computer scientists among the readers of this text should be aware that the rates of biochemical reactions vary signiﬁcantly with temperature and pH and may even change their directions. 1.4.6

DAVID

The DAVID tool developed at the National Institute of Allergy and Infectious Diseases (NIAID, an institute of the NIH) has become a popular and

Networks in Biological Cells

Table 1.3 Information stored in the BRENDA system for individual biochemical reactions. Nomenclature

Enzyme names, EC number, common/recommended name, systematic name, synonyms, CAS registry number

Reaction and speciﬁcity

Pathway, catalyzed reaction, reaction type, natural and unnatural substrates and products, inhibitors, cofactors, metals/ions, activating compounds, ligands

Functional parameters

K m value, K i value, pI value, turnover number, speciﬁc activity, pH optimum, pH range, temperature optimum, temperature range

Isolation and preparation

Puriﬁcation, cloned, renatured, crystallization

Organism-related information

Organism, source tissue, localization

Stability

Stability with respect to pH, temperature, oxidation, and storage; stability in organic solvent

Enzyme structure

Links to sequence/SwissProt entry, 3D-structure/PDB entry, molecular weight, subunits, posttranslational modiﬁcation

Disease

Disease

user-friendly web service (david.abcc.ncifcrf.gov). With respect to annotating the function of genes, it supports enrichment analysis of gene annotations, clustering of functional annotations, mapping to BioCarta and KEGG pathways, analyzing the association of genes to diseases, and more. It also provides tools to organize long lists of genes into functionally related groups of genes to help uncover the biological meaning of the data measured by high-throughput technologies. 1.4.7

Protein Data Bank

The Protein Data Bank (PDB, later renamed into RCSB, www.rcsb.org) was established in 1971 at the Brookhaven National Laboratory in the United States. It started with seven crystal structures of proteins. Since then, it has become the worldwide repository of information about the three-dimensional atomistic structures of large biological molecules. It currently holds more than 130 000 structures including proteins and nucleic acids. 1.4.8

Systems Biology Markup Language

The last item in this list is a programming language rather than a database. The systems biology markup language (SBML) has been formulated to allow the well-deﬁned construction of cellular reaction systems and allow exchange of simulation models between diﬀerent simulation packages. The idea is to be able to interface models of diﬀerent resolution and detail. Cell simulation methods usually import and export (sub)cellular models in SMBL language. SBML builds on the XML standard, which stands for eXtensible Markup Language.

15

Table 1.4 Mathematical techniques used in computational cell biology that are covered in this book.

Mathematical concept

Object of investigation

Analysis of complexity

Time dependent

Treated in chapter numbers of this book

Mathematical graphs

Protein–protein networks, protein complexes, gene regulatory networks

Yes

No

5, 6, 9, 10

Yes (count number of possible paths that connect two metabolites)

No

12

networksa)

Stoichiometric analysis, matrix algebra

Metabolic

Diﬀerential equations

Signal transduction, energy transduction, gene regulatory networks

Equations of motion

Individual proteins, protein complexes

Correlation functions, Fourier transformation

Reconstruction of two- and three-dimensional structures of cellular structures and individual molecules

Statistical tests

Machine learning (linear regression, hidden Markov model) a)

Yes

9, 13

Yes

14, 15

No

Yes, when applied on timedependent data

2

Diﬀerential expression and methylation; enriched network motifs

No

Yes, when applied on timedependent data

8, 9, 10

Predict gene expression, classify chromatin states

No

No

8, 11

May also be applied to gene regulatory networks and signal transduction networks.

No

Networks in Biological Cells

XML is similar to the HTML language that is used to design websites. The European Bioinformatics Institute (EBI) provides a compilation of hundreds of biological models mostly underlying published work at http://www.ebi.ac.uk/ biomodels-main.

1.5 Methods for Cellular Modeling Table 1.4 presents an overview of the methods in cellular modeling that are covered in this book.

1.6 Summary This introductory chapter took a ﬁrst look at the cellular components that will be the objects of computational and mathematical analysis in the rest of the book. Obviously, it was not intended to provide a rigorous introduction, but rather to whet the appetite of the reader without spending too much time on subjects that many readers will be very familiar with. We have seen that the central paradigms of molecular biology (a linear information ﬂow from DNA → RNA → proteins) and cellular biochemistry (grouping of biochemical reactions into major pathways) are being challenged by new discoveries on the roles of small RNA snippets, and by the discovery of highly interconnected hub proteins and metabolites that seem to connect almost “everything to everything.” This is one reason why mathematical and computational analysis is needed to keep the overview over all of the data being generated and to deepen our understanding about cellular processes.

1.7 Problems 1. Compare the glycolysis pathways of yeast and Escherichia coli. Open with a web browser of your choice, the web portals of KEGG (www .genome.jp/kegg) and REACTOME (www.reactome.org). Find the glycolysis pathways of S. cerevisiae and E. coli and compare them. 2. Extract details on enzymatic reactions from the BRENDA database. Go to www.brenda-enzymes.org. Type in “glucose-6-phosphate isomerase” as one of the central enzymes of the glycolysis pathway. The EC number of this enzyme is 5.3.1.9. It interconverts d-glucose 6-phosphate into d-fructose 6-phosphate and can do this in both directions. Browse the information collected on the properties of this enzyme in a large number of organisms. Note that the optimal pH for this enzyme ranges from 3 in Lactobacillus casei to 9.5 in Pisum sativum and that the temperature optimum ranges from 22 ∘ C in Cricetulus griseus to 100 ∘ C in Pyrobaculum aerophilum. We will leave the understanding how this amazing variability is

17

18

Principles of Computational Cell Biology

achieved through variation of the protein sequence to the ﬁeld of enzymology. Interestingly, the turnover number of this enzyme (how many molecules of d-glucose 6-phosphate or d-fructose 6-phosphate react at a single GPI enzyme per second) ranges from 0.0003 per second in Thermococcus litoralis to 650 per second in human if d-fructose 6-phosphate is the substrate and from 6.2 per second in Pyrococcus furiosus to 1700 per second in human if d-glucose 6-phosphate is the substrate. These rate constants are important parameters for modeling time-dependent behavior of metabolic networks and are thus also of relevance for this book. 3. Find protein interaction partners of GPI in yeast. Go to the web portal pre-PPI (https://bhapp.c2b2.columbia.edu/PrePPI) and enter the UNIPROT identiﬁer P06744 for human “glucose-6phosphate isomerase.” Find the predicted interactions of GPI with other human proteins. The top hit with the probability 0.99 is ATP-dependent 6-phosphofructokinase. Explore the list. 4. Discover consequences of GPI mutations in human. Go to the OMIM database (www.omim.org) and enter “glucose-6phosphate isomerase.” Click the top entry “172400” on the next list and scroll to “allelic variants.” Apparently, diﬀerent mutations have been identiﬁed in the GPI enzyme of various patients that all led to “hemolytic anemia.”

Bibliography Small-World Networks, Scale-Free Networks Barabási, A.L. and Albert, R. (1999). Emergence of scaling in random networks. Science 286: 509–512. Watts, D.J. and Strogatz, S.H. (1998). Collective dynamics of ‘small-world’-networks. Nature 393: 409–410.

Gene Regulatory Networks Babu, M.M., Luscomebe, N.M., Aravind, L. et al. (2004). Structure and evolution of transcriptional regulatory networks. Current Opinion in Structural Biology 14: 283–291.

ENCODE The ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489: 57–74.

Networks in Biological Cells

FANTOM Consortium Lizio, M., Harshbarger, J., Shimoji, H. et al. (2015). Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biology 16: 22.

The KEGG Database Kanehisa, M., Sato, Y., Kawashima, M. et al. (2016). KEGG as a reference resource for gene and protein annotation. Nucleic Acids Research 44: D457–D462.

Brenda Database Schomburg, I., Jeske, L., Ulbrich, M. et al. (2017). The BRENDA enzyme information system-From a database to an expert system. Journal of Biotechnology 261: 194–206.

Pathway Databases Stobbe, M.D., Jansen, G.A., Moerland, P.D., and van Kampen, A.H.C. (2014). Knowledge representation in metabolic pathway databases. Brieﬁngs in Bioinformatics 15: 455–470.

DAVID Dennis, G. Jr., Sherman, B.T., Hosack, D.A. et al. (2003). DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biology 4: P3.

19

21

2 Structures of Protein Complexes and Subcellular Structures The ﬁrst chapter introduced proteins as some of the key players of biological cells. Many of them are enzymes; others are membrane transporters or structural proteins that provide cell stability and motility. Enzymes are typically highly specialized “experts” that catalyze one particular biochemical reaction with high catalytic rate and speciﬁcity. The functional dependencies between them lead to the formation of biochemical pathways as one way of cellular organization where several enzymes are chained one after the other, catalyzing several reactions on one molecule. It has been shown that some enzymes physically bind to other enzymes from the same pathway to form multienzyme complexes. This has the advantage that, in such cases, the diﬀusion of the substrate is not rate-limiting and also the diﬀusion of toxic or unstable intermediates is reduced. One well-documented example is the de novo synthesis of purines in eukaryotes involving 10 chemical reactions that are catalyzed by 6 enzymes. As revealed by ﬂuorescence microscopy of HeLa cells, all the six enzymes organize into clusters in the cellular cytoplasm (An et al. 2008). Alternatively, there exists a mechanism termed cluster-mediated channeling where, rather than physical coordination between active sites, only colocalization of sequential enzymes seems to be enough to promote metabolic eﬃciency (Schmitt and An 2017). It is now well accepted that the formation of protein complexes involving from 2 up to 100 proteins is an important hierarchical element in cells. Many cellular functions are mediated by the structural association of separate proteins and their coordinated activities, not by random diﬀusion and transient associations. It is believed that protein complexes assemble in a particular order. Correct assembly often requires energy-driven conformational changes, the involvement of chaperone proteins, and the presence of particular post-translational modiﬁcations. Diﬀerent cellular requirements may lead to diﬀerent compositions of protein complexes. To explain why complexes have evolved as an important form of cellular organization, we must resort to speculating that organizing proteins into large complexes must have functional advantages for the cell. We have already mentioned several of such complexes in Section 1.2, such as the ribosome formed by 80 proteins and rRNA and RNA polymerase I formed by 10 proteins. We will now re-encounter these and a few other well-known examples also involving complexes where multiple enzymes of a biochemical pathway are permanently arranged into a large complex. Principles of Computational Cell Biology: From Protein Complexes to Cellular Networks, Second Edition. Volkhard Helms. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

22

Principles of Computational Cell Biology

2.1 Examples of Protein Complexes The ﬁrst three examples show three protein complexes that execute the central processing from DNA over RNAs to proteins. The RNA polymerase shown in Figure 2.1 is the key enzyme in gene transcription and synthesizes a copy of messenger RNA (mRNA) from a DNA template. For determining the three-dimensional atomic structure of RNA polymerase, Roger Kornberg was awarded the 2006 Nobel Prize in Chemistry. Spliceosome is a multicomponent macromolecular machine that is responsible for the splicing of pre-mRNA. Contacts with substrate mRNA fragments are formed by complementary RNA components of spliceosome. Figure 2.2 shows the structure of a precatalytic human spliceosome primed for activation that was obtained by cryo-electron microscopy. Comparison of crystal structures representing various stages of the catalytic cycle of spliceosome illustrates the large-scale conformational transitions of its domains that are the basis of various splicing steps.

Figure 2.1 RNA polymerase II is the central enzyme of gene expression. It synthesizes all mRNA in eukaryotes. Shown is a cartoon view of the high-resolution structure of the RNA polymerase II-mediator core transcription initiation complex from Saccharomyces cerevisiae determined by Plaschka et al. (2015). The bound template DNA strand is colored in red. Figure was generated from PDB entry 4V1M with NGL Viewer software 2015 (Rose and Hildebrand, 2015).

Structures of Protein Complexes and Subcellular Structures

′

Figure 2.2 Spliceosome: Spliceosome is a cellular “editor” that “cuts and pastes” the ﬁrst draft of RNA straight after it is formed from its DNA template. Shown is an atomic resolution structure determined by cryo-electron of the human spliceosome that is primed for activation. Source: Bertram et al. (2017). Reprinted with permission of Elsevier.

The ribonucleoprotein called the ribosome carries out the ﬁnal step of gene expression where the genomic information encoded in mRNAs is translated into protein sequences. About two-thirds of the prokaryotic ribosome is composed of RNA and one-third of protein. The ribosome is made up of two subunits, with the larger one (sedimentation at 50S in prokaryotes) being about twice as large as the small subunit. The large subunit shown in Figure 2.3 catalyzes the peptide bond formation. For determining the three-dimensional atomic structure of the ribosome, Venkatraman Ramakrishnan, Thomas A. Steitz, and Ada E. Yonath were awarded the 2009 Nobel Prize in Chemistry. The next example in Figure 2.4 shows the Arp2/3 complex formed by structural proteins binding to the cytoskeleton. We will re-encounter this macromolecular complex in Section 2.11. The apoptosome (Figure 2.5) is a key component of the process of induced cell death, termed apoptosis. It has an unexpected sevenfold symmetry. The formation of the apoptosome complex is initiated when cytochrome c is released from the mitochondria following a cell death stimulus.

23

24

Principles of Computational Cell Biology

CP L7/L12 stalk L1

Figure 2.3 The ribosome: Model of the large ribosomal subunit from Haloarcula marismortui. The protein is shown so that the surface of the subunit interacting with the small subunit faces the reader. RNA is shown in gray and the protein backbone in yellow. Source: Ban et al. (2000). Reprinted with permission of AAAS.

Figure 2.6 shows an example of a large multienzyme protein complex pyruvate dehydrogenase. Figure 2.7 shows a homodimer of the transcription factor Oct4 bound to DNA. 2.1.1

Principles of Protein–Protein Interactions

In biological systems, proteins rarely act alone. Instead, they often bind to other biomolecules to exert their cellular functions. In many cases, the binding partners are other proteins so that the proteins form homo- and heterodimers and oligomers. Dimerization or oligomerization of proteins may provide them with diverse structural and functional advantages such as an enhanced thermal stability or modifying the accessibility and binding speciﬁcity of their active sites. Figure 2.8 illustrates the main functional consequences of dimerization and oligomerization. An important aspect of protein–protein interactions is their high speciﬁcity due to conformational and physicochemical properties of the involved proteins. Because of the rather crowded environment in vivo, where about 30% of the cellular volume is taken up by proteins, proteins constantly “bump” into other proteins. In order to correctly fulﬁll their functions, they need to be highly speciﬁc in recognizing their biological partners and binding to them. This is the case, for example, for hormone-receptor and enzyme-inhibitor complexes.

Structures of Protein Complexes and Subcellular Structures

Figure 2.4 Arp 2/3 complex: The seven-subunit Arp2/3 complex choreographs the formation of branched actin networks at the leading edge of migrating cells. Figure generated from PDB entry 1K8K with NGL Viewer software. Figure 2.5 The human apoptosome. Source: Yu et al. (2005). Reprinted with permission of Elsevier. Cyt c

Arm

7 Hub

Regulatory region

All other unwanted nonspeciﬁc contacts should be short lived in comparison. An early study by Jones and Thornton (1996) characterized interfaces in terms of residue propensities, conservation, interaction propensities, protrusions, and planarity. These structural details of interfaces will be discussed in more detail in Chapter 3.

25

26

Principles of Computational Cell Biology

Figure 2.6 Pyruvate dehydrogenase is a huge multienzyme complex comprising 60 copies each of dihydrolipoyl acetyltransferase (E2), pyruvate decarboxylase (E1), and dihydrolipoyl dehydrogenase (E3), as well as binding proteins and regulatory kinases and phosphatases. The picture shows an atomic representation of the complete E1E2 complex involving the E2 catalytic and peripheral subunit-binding domains shown in green and in red as well as the E1 α2 β2 tetramers shown in blue (Milne et al. 2002). The positions of the E1 α2 β2 tetramers and the peripheral subunit-binding domains were determined by core-weighted ﬁtting to the density of the E1E2 complex. For visual clarity, only the back half of the model is presented. Source: Milne et al. (2002). Reprinted with permission of John Wiley & Sons.

α2 α1

α1 α4

α3 α2

α3 α5

α2 α5 α3 α1

α3 α4 α1

α2

Figure 2.7 Schematic representation of the Oct4 homodimer bound to DNA. Oct4 is colored by domain (POUS in green and POUHD in blue). The linker is highlighted in red. Residues that were not visible in the electron density map are represented by a dotted line. The DNA is shown in yellow. Source: Esch et al. (2013). Reprinted with permission of Springer Nature.

Structures of Protein Complexes and Subcellular Structures

(c)

(a)

(b)

(d)

Figure 2.8 Functional consequences of dimerization and oligomerization. (a) Concentration, stability, and assembly. (b) Cooperation and allostery. (c) Modiﬁcation of the active site. (d) Dimerization yields increased diversity in the formation of regulatory complexes.

2.1.2

Categories of Protein Complexes

How can one best categorize protein complexes? Some possible classiﬁcations would be categorizing them by their function, by their size, or by their nonprotein components as protein complexes may also involve various other components involving nucleic acids, carbohydrates, or lipids. A mechanistic classiﬁcation scheme distinguishes transient complexes (enzyme-inhibitor and signal transduction) from stable/permanent complexes with long lifetimes compared to those of typical biochemical processes. Obviously, obtaining structural information on transient complexes is much harder than for permanent complexes. One may also distinguish obligate and nonobligate complexes where the components of obligate complexes function only when they are in the bound state, whereas those of nonobligate complexes can also exist as monomers. Examples of the latter class are antibodies that exist in free form in the cell until a suitable antigen target appears and signaling complexes. Another example is the nonobligate assembly of RNA polymerase with diﬀerent initiation and elongation factor proteins. Although all components also exist stably in the unbound state, each assembly modulates the function of RNA polymerase in a diﬀerent manner.

27

28

Principles of Computational Cell Biology

Similar to individual proteins, complexes do not need to be present in cells during the entire cell cycle, but only during phases when they are required. However, what is the “resting state” of a complex? Do the complexes exist as half-formed entities or are their components only expressed when needed for some job followed by their immediate assembly? The latter scenario would require a quite elaborate and eﬃcient synthesis-and-assembly machinery. This question could be answered by analyzing the gene expression data obtained at various stages of the cell cycle of yeast that revealed which proteins oscillate with the frequency of the cell cycle (de Lichtenberg et al. 2005). When overlapping this set with the set of high-quality complexes, almost all complexes known at that time were not fully assembled during the entire cell cycle. Most of them remain half-formed in the cell once the fully assembled complex is not required.

2.2 Complexome: The Ensemble of Protein Complexes The protein components of protein complexes can be grouped into three types: cores, attachments, and modules (Figure 2.9). Core proteins appear in a particular complex most of the time, whereas attachment proteins are less often part of a complex. Modules are sets of attachment proteins that jointly occur in diﬀerent complexes. 2.2.1

Complexome of Saccharomyces cerevisiae

In 2006, two research consortia published extensive surveys of the protein– protein interactions of the yeast S. cerevisiae. Gavin et al. (2006) applied tandem aﬃnity puriﬁcation (TAP) (Section 5.1.3) to all 6466 open reading frames. About 1993 TAP fusion proteins could be puriﬁed, of which 88% were retrieved with at least one partner bound. A “socioaﬃnity” index characterizes the propensity of proteins to bind other proteins by measuring the log-odds ratio of the number of cases when two proteins occur together over what is expected based on the frequencies in the data set. This analysis resulted in a list of 491 unique multiprotein complexes involving “core” and “attachment” proteins. The cores of complexes contained 1–23 proteins with an average of 3.1 proteins. The attachment proteins involved “module” combinations. On an average, one such set of proteins is associated with 3.3 cores. The terminology “module” will be used in a diﬀerent context in Chapter 6, where it will mostly describe the functionally related proteins. Assume “modules” as somehow tightly associated proteins. Seventy-four known complexes could not be identiﬁed in this study. Reasons for this could be either that these complexes do not form in the particular experimental conditions or that the used puriﬁcation tag prevented the assembly of the complexes. Taking into account the number of known complexes, the authors suggested the existence of 300 further core machines, so that the total number of core complexes in yeast may be around 800. When comparing the number of genes in yeast (6 500) and humans (21 000), a simple extrapolation shows

Structures of Protein Complexes and Subcellular Structures

Attachment

Module

134 134 250 1

2

100

350

100

230 300

45 600

Complex 2

Complex 1

Cores

Figure 2.9 Deﬁnition and terminology used to deﬁne protein complex architecture. “Core” components are present in most isoforms, whereas “attachments” are present in only one of them. “Modules” are a subclass of “attachments” where two or more proteins are always together and present in multiple complexes. Source: From Gavin et al. (2006).

that there may exist some 3 000 core complexes in human cells. However, these would only be the stable, tightly bound complexes. The total number of complexes including transient and weakly binding ones may still be much higher. A separate, parallel study (Krogan et al. 2006) essentially applied the same methodology to all open reading frames of yeast and obtained 2357 puriﬁcations. These were categorized into 547 distinct (nonoverlapping) heteromeric protein complexes. On an average, each protein formed 5.26 interactions. The number of interactions per protein followed an inverse power law distribution (Section 6.1.1), indicating a scale-free topology. As expected, the authors found a signiﬁcant amount of colocalization and semantic similarity (in terms of similarity of genome ontology annotation) among the members of particular complexes. It is quite encouraging that the two independent studies reported a similar number of successful puriﬁcations and an overall similar number of complexes. Because of the diﬀerent deﬁnitions of complexes, the “core” complexes of the Gavin et al. (2006) screen ended up being smaller on an average than those of Krogan et al. (2006). The CYC2008 data set manually curated by the Wodak group (http://wodaklab.org/cyc2008/) is a widely used reference set of 408 heteromeric protein complexes from S. cerevisiae, for which solid evidence exists from small-scale experiments (Pu et al. 2008). Considering that protein complexes are involved in many cellular processes, their total number is quite small. Reuse of functional “modules” in diﬀerent complexes is an

29

Principles of Computational Cell Biology

eﬃcient manner to increase the functional variability of proteins and to simplify regulating complex formation at correct times and locations. 2.2.2

Bacterial Protein Complexomes

We will now turn to the simpler world of prokaryotes, where assembly of multiple proteins into complexes is also well known. The database EcoCyc provides a set of 285 “gold-standard” protein complexes in Escherichia coli. Hu et al. measured protein interactions in E. coli on a large scale (Hu et al. 2009). Based on these data, a clustering algorithm predicted 443 protein complexes, most of which consist of two to four polypeptides. Compared to the E. coli strain MG1655 with about 4500 genes, the genome of Mycoplasma pneumoniae (only around 700 genes) is one of the shortest genomes among self-replicating organisms. Still, TAP-mass spectrometry (TAP-MS) applied to M. pneumoniae identiﬁed 62 homomeric and 116 heteromeric soluble protein complexes (Kühner et al. 2009). Hence, prokaryotes contain fewer protein complexes than yeast, but the number of complexes is not proportional to the size of their genomes. Figure 2.10 compares the inventory of protein complexes of the two bacterial species E. coli and M. pneumoniae (Cauﬁeld et al. 2015). Apparently, only few complexes with more than two members are conserved perfectly across species. Most complexes are fractionally conserved. A comparison between protein complexes of E. coli and S. cerevisiae (Reid et al. 2010) showed that the complexes must have evolved in quite distinct ways between both organisms. For example, the archaeal thermosome is a homomer of 16 identical subunits, but the related complex from eukaryotes, TRiC/CCT chaperonin, contains eight diﬀerent, yet homologous proteins, two copies of each. Figure 2.10 The count of complexes in two Escherichia coli complex data sets and one Mycoplasma pneumoniae data set, having the given number of unique protein components. Homodimers are not included. Source: Cauﬁeld et al. (2015). http://journals.plos .org/ploscompbiol/article?id=10 .1371/journal.pcbi.1004107. Licensed Under CC BY 4.0.

90 Number of complexes

30

60

30

0 2

6 7 8 9 5 Complex size (in number of unique proteins)

3

4

E. coli – EcoCyc E. coli – Hu et al. M. pneumoniae – Kühner et al.

≥10

Structures of Protein Complexes and Subcellular Structures

The subunit specialization occurred early in eukaryote evolution and is well conserved with nearly 60% sequence identity between mammalian and yeast subunits of the same type. The experimentally determined three-dimensional structure of thermosome from Thermococcus sp. has been used as a template for homology modeling of the eukaryotic TRiC complex (Kalisman et al. 2012). 2.2.3

Complexome of Human

A widely used reference compendium on protein complexes in humans, rats, and mice is CORUM (http://mips.helmholtz-muenchen.de/corum/). It currently provides about 3000 diﬀerent protein complexes that are involved in practically all cellular processes ranging from metabolism, energy generation, cell cycle control and DNA processing, signal transduction in the regulation of cell fate, diﬀerentiation, and development. A comparison of protein complexes from yeast and human inspected 5960 protein pairs that belong to the same complex in human (van Dam and Snel 2008). In 2216 cases, both human proteins did not have an ortholog in S. cerevisiae. In 1828 cases, one of the two proteins lacks an ortholog. Of the remaining 1916 protein pairs, only 10% were never copuriﬁed in a large-scale experiment. This suggest that joint membership in a complex is conserved between human and S. cerevisiae during evolution at a level of about 90% as long as the respective genes are also conserved. As an example, we mention a comparative analysis of the two chromatin remodeling complexes SWI/SNF and RSC in Schizosaccharomyces pombe, S. cerevisiae, and human using data from aﬃnity puriﬁcation (Monahan et al. 2008). SWI/SNF plays a role in transcriptional activation, telomeric and rDNA silencing, and DNA repair. RSC regulates the activity of RNA polymerases and has important functions during the cell cycle. Thus, SWI/SNF and RSC are involved in a large part of all chromatin-related processes. Although the SWI/SNF and RSC complexes of S. pombe have six proteins in common, only three proteins are shared between the respective complexes in S. cerevisiae. For example, SWI/SNF and RSC have two Arp proteins in common. In S. cerevisiae, they are essential or almost essential for cell viability. Yet, in S. pombe, growth does not depend on them. Both S. pombe complexes contain Ssr1 and Ssr2. They are related to the Arp-proteins BAF and PBAF in human. Hence, there exists a conserved core complex that is completed by diﬀering components in the three species (Table 2.1).

2.3 Experimental Determination of Three-Dimensional Structures of Protein Complexes This subsection will brieﬂy introduce several important experimental methods that are able to generate three-dimensional structural data on protein complexes such as X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy. Atomic protein structures are made available to the scientiﬁc community via the Protein Data Bank (PDB) (Section 1.4.7).

31

32

Principles of Computational Cell Biology

Table 2.1 Composition of Schizosaccharomyces pombe SWI/SNF and RSC complexes compared to those of Saccharomyces cerevisiae and human. Schizosaccharomyces pombe

Saccharomyces cerevisiae

Human

SWI/SNF

RSC

SWI/SNF

RSC

BAF

PBAF

Snf22

Snf21

Snf2

Sth1

BRG1 or BRM

BRG1

Sol1

Swi1

Snf5

BAF250

Sfh1

Snf5

Sfh1

SNF5

SNF5

Ssr1, Ssr2

Ssr1, Ssr2

Swi3

Rsc8

BAF170, BAF155

BAF170, BAF155

Ssr3

Ssr3

Snf12

Rsc6

BAF60a

BAF60a or BAF60b

Ssr4

Ssr4

Arp42

Arp42

BAF53

BAF53

Arp9

Arp9

Arp9

Arp9

Arp7

Arp7 Actin

Actin

Tfg3

Taf14 Rsc1

Rsc1 or Rsc2

Rsc4

Rsc4

Rsc9

Rsc9

Rsc58 Snf59

Rsc7

BAF180

Rsc58 Swp82

Rsc7

Rtt102

Rtt102

Snf30 BAF57

BAF57

Snf11 Snf6 Rsc3 Rsc30 Ldb7 Htl1 Source: Monahan et al. (2008). Reprinted with permission of Springer Nature.

2.3.1

X-ray Crystallography

X-ray crystallography is the most widely used method for determining the structures of biomolecules. Successful structure determination yields very accurate structures of molecular complexes, from small molecules up to very large complexes such as the ribosome or viral capsids (Section 2.1). The basic principles of X-ray crystallography are shown in Figure 2.11. In the early days of crystallography, reconstructing the molecular structure of the target molecule in the crystal was a huge numerical task. Modern software packages now allow processing the experimental data and performing

Structures of Protein Complexes and Subcellular Structures

High voltage Crystal

X-ray beam

Lead screen Photographic plate

Figure 2.11 X-rays are electromagnetic waves in the ultrashort (“hard”) regime with wavelengths on the order of 0.1 nm. When they hit a sample, the electromagnetic X-rays undergo weak interactions with the electron clouds around the atomic nuclei, which lead to partial diﬀraction of the incoming beam into diﬀerent angles. As the interaction is quite weak, a noticeable diﬀraction intensity can only be detected in orientations where the diﬀracted beams from many molecules sum up in a constructive way. Here, we need to appreciate that electromagnetic waves are sinusoidal waves that may be described by an amplitude and phase. Therefore, intensities are only detected in those orientations where the path diﬀerence of waves originating from diﬀerent molecules equals integer multiples of their phases. This requires, ﬁrst of all, a very ordered orientation of all molecules as in a three-dimensional crystal. Still, in almost all orientations, the overlap of various waves will not be constructive. Images on the photographic plate (or charge-coupled display detector) are recorded for various rotational orientations of the crystal. Structure determination involves reconstruction of the molecular structure of the target molecule that will give rise to the observed reﬂections. The numerical methods mostly involve Fourier transformation. A crystallographic structure determination ultimately reveals contours of the electron density. Atomic models are then reﬁned using this electron density and information about typical bond lengths and bond angles of chemical bonds between atoms.

the structural modeling fairly automatically within a few hours of computing time. The two remaining bottlenecks of X-ray protein crystallography are (i) puriﬁcation of milligram quantities of the protein(s) and (ii) ﬁnding appropriate crystallization conditions where the proteins will assemble into three-dimensional crystals. Structural genomics is the name for a series of large-scale research initiatives worldwide that tackle all these various steps in an automated manner. The crystallization trials are done by robots in parallel. Light diﬀraction then reveals which trials lead to microcrystals. The protein structure initiative (PSI) aimed at determining the structures of proteins with so far not covered architectures (“folds”). Between 2000 and 2010, more than 3000 “distinct” structures resulting from this initiative were deposited into the PDB. Most of these were “novel” structures with i

Print out this energy together with the number of the iteration. You will see that the energy decreases fast in the beginning and then slower and slower. (v) Repeat from (i) until the total energy is essentially constant. (b) “Real” networks Now perform the same layout on the following networks. Give the ﬁnal energies and conﬁgurations for each of them and the number of iterations after which you stopped the layout process. (i) “sfnet_100.dat” is a thinned out scale-free network of 100 vertices, where every second edge has been left out. (ii) “11309.txt” and “2287.txt” contain networks extracted from the BIND database for the taxon identiﬁers 11309 and 2287, respectively (what species are these?). Hint: You may have to run a few trials with diﬀerent initial placements of the vertices and then choose the best result. What happens when you skip step (ii), i.e. do the optimization without the random forces? Plot the total energy versus iteration with and without the random forces for one of the “real” networks of your choice. Be careful to scale the axis so that the important diﬀerence can clearly be seen. These networks contain more than a single cluster. What happens to the diﬀerent clusters? Why? 7. Closeness centrality Consider an undirected tree of n vertices. A particular edge in the tree joins vertices 1 and 2 and divides the tree into two disjoint regions of n1 and n2 vertices as sketched in Figure 4.11. Show that closeness centralities C 1 and C 2 of the two vertices deﬁned according to equation Ci =

1 n =∑ , li j dij

are related by n n 1 1 + 1 = + 2. C1 n C2 n

109

110

Principles of Computational Cell Biology

Figure 4.11 Schematic graph (see Problem 7). 1

2

Bibliography Cormen, T.H., Leiserson, C.E., Rivest, R.L., and Stein, C. (2001). Introduction into Algorithms. MIT Press.

111

5 Protein–Protein Interaction Networks – Pairwise Connectivity The formidable advances in protein sciences in recent years have highlighted the importance of protein–protein interactions in biology. Well before the proteomics resolution, we knew that many proteins are capable of interacting with each other in a highly speciﬁc way and that the function of certain proteins is regulated by interacting partners. However, the extent and degree of the protein–protein interaction network were not realized. It is now believed that not only are a majority of proteins in a eukaryotic cell involved in complex formation at some point in the life of the cell but also that each protein may have on an average six to eight interacting partners (Section 5.4). In Chapters 2 and 3, we introduced experimental and computational methods to study individual protein–protein interfaces and three-dimensional structures of protein complexes. The structural details of molecular complexes are equally important when considering simultaneous binding of more than two proteins. Imagine that protein A binds another protein B via a particular interface on its surface. Then, protein A can obviously not bind a third protein C at the same interface at the same time. In this chapter, we will start with the collection of protein–protein interaction data from high-throughput experiments followed by its bioinformatics interpretations. This topic will be continued in Chapter 6 that goes deeper into the analysis of the resulting protein interaction networks.

5.1 Experimental High-Throughput Methods for Detecting Protein–Protein Interactions We will begin our discussion of protein–protein interactions by introducing several experimental techniques that can be applied to “ﬁsh” in cell lysates for unknown interaction partners. Even a bioinformatician should be familiar with the basics of these techniques because this greatly helps in understanding why diﬀerent methods give diﬀerent results. Also, we need to appreciate that diﬀerent and sometimes even contradictory answers about the composition of a macromolecular complex can be simultaneously correct depending on which properties of the complex are being probed by the respective experimental techniques.

Principles of Computational Cell Biology: From Protein Complexes to Cellular Networks, Second Edition. Volkhard Helms. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

112

Principles of Computational Cell Biology

5.1.1

Gel Electrophoresis

Gel electrophoresis techniques are useful to separate and partially purify molecules on the basis of their physicochemical characteristics such as size, shape, or isoelectric point (pI). The ﬁrst term, gel, denotes the matrix employed for separating the molecules. Typically, the gel contains a cross-linked polymer of diﬀerent porosity. The second term, “electrophoresis,” points to the electromotive force whereby the molecules are pushed or pulled along the gel matrix. Once the molecules are placed into the wells of the gel, an electric voltage is applied across the gel. As a result, the molecules will migrate across the polymer matrix with diﬀerent velocities, toward the anode if they carry a negative charge, and toward the cathode if they are positively charged. Proteins show diﬀerent mobilities because of the diﬀerences in their total charge and conformations. Therefore, in the preparation of a gel electrophoresis experiment, one usually denatures the proteins by adding detergent molecules such as sodium dodecyl sulfate (SDS) that decorates the proteins with negative charges. This method is therefore termed sodium dodecyl sulfate–polyacrylamide gel electrophoresis (SDS–PAGE). The SDS treatment leads to denaturation of a protein into an unfolded conformation. The number of SDS molecules that bind to a protein is roughly proportional to their size, so that all proteins will have a comparable charge-to-mass ratio. Denatured proteins behave as long rods instead of particles with a complex three-dimensional shape. Hence, the velocity at which they move in the gel depends only on their mass. How far the proteins move through the gel until the voltage is turned oﬀ is roughly inversely proportional to the logarithm of their molecular mass. If several samples are placed next to each other in diﬀerent lanes of the gel, they will run parallel to each other. Protein bands in diﬀerent lanes that reach the same distance from where they were started represent molecules that migrated across the polymer matrix with the same velocity. This generally means that their masses are comparable to each other. There exist the so-called ladders that consist of a mixture of molecules of known masses. If such a mixture is placed into one lane of the gel parallel to the unknown samples, the bands detected in the marker lane can be compared to the bands in the lanes with the unknown components in order to determine their mass (Figure 5.1). 5.1.2

Two-Dimensional Gel Electrophoresis

Like one-dimensional electrophoresis, two-dimensional electrophoresis separates molecules by molecular weight in one direction using SDS–PAGE and, additionally, in a ﬁrst step, also by their pI in the perpendicular direction. As there is only a low random chance that two molecules resemble each other in both properties, they can be more eﬀectively separated by two-dimensional electrophoresis than one-dimensional electrophoresis. The procedure starts with placing the sample in a gel manufactured with a stationary pH gradient in one direction and applying an electrostatic potential diﬀerence across it. A pH gradient means that the pH of the gel is not constant, but changes from low pH to high pH. At all pH values except for their pI, the proteins will be charged. In conditions where they carry a positive charge, they will be pulled toward the

Sample C

Sample B

Sample A

Marker

Protein–Protein Interaction Networks – Pairwise Connectivity

100 90

Protein mass (kDa)

80 70

60

50

40

30 Anode

Figure 5.1 Schematic result of a gel electrophoresis run. The left lane labeled “Marker” contains three proteins with known weights of 40, 70, and 100 kDa. The other three lanes labeled “samples A to C” contain unknown mixtures of three to four proteins. Their masses can be assigned with reference to the “Marker” lane. Also, the puriﬁed protein bands can be cut out from the gel for further usage.

more negative end of the gel; if they are negatively charged, otherwise. When the protein moves to lower pH, for example, the concentration of free protons in the solution increases so that some of its negatively charged amino acids (Glu and Asp) will be protonated and thus neutralized. This means that the total charge of a protein depends on the surrounding pH. The protein therefore migrates along the pH gradient until it carries no overall charge. This location of the protein in the gel corresponds to the apparent pI of the protein. In the second step, the proteins separated according to their pI are placed at the start of an SDS gel (see above) and are now separated by their mass as well. 5.1.3

Aﬃnity Chromatography

Aﬃnity chromatography is a biochemical separation technique to separate a sample into fractions of diﬀerent sizes. It also contains a stationary phase that

113

Principles of Computational Cell Biology

W

Z

Bait

Y X

Affinity column

114

Figure 5.2 In aﬃnity puriﬁcation, a protein of interest (bait) is tagged with a molecular label (dark route in the middle of the ﬁgure) to allow easy puriﬁcation. The tagged protein is then copuriﬁed together with its interacting partners (W–Z). This strategy can be applied on a genome scale. Source: Aloy and Russell (2006). Drawn with permission of Springer Nature.

reversibly binds to a known subset of molecules. Usually, the separation starts from an uncharacterized sample containing a heterogeneous group of solvated molecules, such as a cell lysate, growth medium, or blood serum. The aﬃnity puriﬁcation then exploits a particular known property of the molecule of interest so that it will bind to a solid or stationary phase together with any putative partners that are bound to it, but not the other molecules of the sample. After this puriﬁcation step, the eluted target protein and its interaction partners are denatured and thus separated. Subsequently, they will be characterized and identiﬁed by a subsequent gel electrophoresis or mass spectroscopy step. The puriﬁcation process is sketched in Figure 5.2. In the area of protein complexes, a variant of the method is used that is termed tandem aﬃnity puriﬁcation (TAP). 5.1.4

Yeast Two-hybrid Screening

Yeast two-hybrid screening (Y2H) is a molecular biology technique used to discover protein–protein interactions by testing for physical interactions (such as binding) between two proteins. One protein is termed the bait and the other is a library protein (or prey). The idea behind the test is the activation of downstream reporter gene(s) by the binding of a transcription factor to an upstream activating sequence (UAS). For the purposes of two-hybrid screening, one utilizes a transcription factor consisting of two domains, a binding domain (BD) and an activating domain (AD). The BD is the domain responsible for binding to the UAS and the AD is responsible for the activation of transcription. The key to the two-hybrid screen is that in most eukaryotic transcription factors (Section 7.1), the AD and BD are modular and can function in close proximity to each other without direct binding. In the Y2H screen, the transcription factor is split into the BD and AD fragments. Then, the BD fragment is fused onto the bait protein X and the AD fragment onto a library protein Y . If X and Y bind to each other (Figure 5.3), then the AD and BD of the transcription factor would be indirectly connected again,

Protein–Protein Interaction Networks – Pairwise Connectivity

Y AD

X BD Reporter gene

Figure 5.3 The Y2H system is one of the most widely used high-throughput systems to detect protein–protein interactions. In the most common variant, a bifunctional transcription factor (usually GAL4) is split into its binding domain (BD) and its activation domain (AD). Each segment is then fused to a protein of interest (X and Y). If these two proteins interact, the activity of the transcription factor is reconstituted. The system can be scaled up and applied in genome-scale screens.

and transcription of the reporter gene(s) could occur. If the two proteins do not interact, there would be no transcription of the reporter gene. In this manner, a huge “library” of “prey” proteins can be tested one by one in parallel experiments on a multiwell chip for interaction with the bait. A common transcription factor used for Y2H screening is GAL4 that binds speciﬁcally to the UAS sequence and initiates activation of a downstream target gene. For example, this can be the gene coding for green ﬂuorescent protein so that the expression of the target gene can be eﬃciently and quickly read out. An advantage of this method is that the interactions are probed in vivo. As yeast is cheap and robust, the method can be applied on a large scale. Disadvantages are that the interactions need to be probed in the yeast nucleus and some proteins, such as membrane proteins, may not be translocated easily into the nucleus. Also, it is possible that the two proteins interact in Y2H experiments, although they are not simultaneously expressed during the cell cycle or in the particular compartment. In addition, a reported interaction may also be mediated through a third (or even more) protein that binds X and Y simultaneously. In this case, X and Y could be reported to interact directly (although they actually do not), and the mediating partners would remain undetected. 5.1.5

Synthetic Lethality

The synthetic lethality method aims to detect cases where mutation or deletion of two genes is viable when only one gene is aﬀected, but their combination is lethal to the cell under certain conditions. Because these mutations are lethal to the cell, these gene interactions cannot be isolated directly and must be constructed in an artiﬁcial way. Synthetic interaction can arise either from possible physical interaction between the two proteins coded by the genes because they belong to the same biochemical pathway or they have similar functions.

115

116

Principles of Computational Cell Biology

5.1.6

Gene Coexpression

The biological function of a protein complex is closely connected to the functions of its components. As the subunits need to be present at the correct ratios, the expression levels of the components of a complex should be quite similar. One can measure gene expression proﬁles (Section 8.2), for example, along the cell cycle of synchronized yeast cells under diﬀerent conditions. Independent studies showed that yeast proteins that make contacts to each other tend to show higher levels of coexpression than proteins that do not interact (see Section 9.2.1). There exist many other experimental methods such as surface plasmon resonance (SPR), nuclear magnetic resonance (NMR), or chemical cross-linking. On the one hand, these methods give more precise information whether two proteins really interact. They can provide binding constants and association rates in titration experiments. On the other hand, they require more work and are typically not applied on a large genome-wide scale. 5.1.7

Databases for Interaction Networks

The traditional major source for data on protein–protein interactions is the Protein Data Bank (PDB), which provides crystallographic data on protein–protein complexes (Section 1.4.7). Although atomic resolution structural data remains the gold standard for predicting and modeling protein–protein interactions, the recent development of other experimental techniques for determining interacting pairs of proteins has resulted in the development of a number of other protein–protein interaction databases. Table 5.1 lists some databases that contain information on protein interactions. Note that the number of important databases in this ﬁeld and the number of interactions deposited in each of them are expanding at an increasing speed. 5.1.8

Overlap of Interactions

Unfortunately, interaction data sets from high-throughput experiments discussed in the previous sections were quickly found to be incomplete in many regards and even contradictory. This means that two proteins may in reality interact with each other, although their interaction was not detected in the high-throughput experiment (“false negatives”) or that reported interactions may be artiﬁcial (“false positives”). Clearly, in the context of genome-wide analyses, these inaccuracies do not only result from intrinsic errors of the particular methods but are also aﬀected by the diﬃcult statistical nature of these decisions. We must consider that these inaccuracies are likely greatly magniﬁed in this case because the protein pairs that do not interact (negatives) by far outnumber those that do interact (positives). Yeast contains, for example, c. N = 6000 proteins. Between them up to N(N − 1)/2 ∼ 18 million potential interactions could be formed. However, the number of actual interactions is estimated to be at most 100 000. This gives a fraction of true positives of roughly 1 in 200 or 0.5%. To obtain an equal number of false positives as true positives, an experiment with an error rate of only 0.5% is required. This is obviously far

Protein–Protein Interaction Networks – Pairwise Connectivity

Table 5.1 Some public databases compiling data related to protein interactions.

URL

Number of interactions

Type

Proteins/ domains

CORUM

http://mips.helmholtzmuenchen.de/corum

2 837 complexes

Curated mammalian protein complexes

P

DIP

http://dip.mbi.ucla .edu/dip

81 000

Curated

P

HPRD

www.hprd.org

41 000

Curated

P, D

IntAct

https://www.ebi.ac.uk/ intact

805 000 molecular interactions

Curated, experimentally veriﬁed protein–protein interactions

P

Mentha

https://mentha .uniroma2.it

707 000

Integrates data from curated databases

P

MIPS

http://mips.helmholtzmuenchen.de/proj/ppi

1 859

Curated

Scoppi

http://scoppi.biotec .tu-dresden.de/scoppi

105 000

Automatic

D

STRING

https://string-db.org

1 380 million

Integrated data from genomic context, HT experiments, coexpression, previous knowledge

P

UniHI

www.unihi.org

350 000

Integrates data for human proteins

P

Yeast protein complexes

http://wodaklab.org/ cyc2008

408 complexes

Curated, validated by small-scale experiments

P

(P) and (D) stand for proteins and domains. The number of interactions reﬂects the status of April 2018.

below the accuracy of any of the existing techniques mentioned in the previous section. For example, it is estimated that the error rate of Y2H experiments is of the order of 50%. Also, we need to be aware that various high-throughput methods even yield diﬀering results on the same complex. For example, of the about 100 000 interactions that have been characterized for yeast (see Section 5.4.1), only a small fraction is supported by more than one method. Possible explanations for this discrepancy are as follows: (i) the experimental methods may not have reached saturation, (ii) many of the methods produce a signiﬁcant fraction of false positives, or (iii) some methods have diﬃculties for certain types of interactions. For example, it has turned out that each experimental technique produces a unique distribution of interactions with respect to functional categories (Figure 5.4). Because of probing interactions in the solution, TAP and mass spectrometry naturally predict few interactions for proteins involved in transport and sensing

117

118

Principles of Computational Cell Biology

SHS1 CIN2

CDC10 CDC12

GIN4

CDC3 CDC11

SPR28 ARC1 LPD1

TAP purification HMS-PCI purification Two-hybrid interaction

Annotated member of septin complex

Figure 5.4 Results from diﬀerent methods for complexes involving the cell cycle protein CDC11 located in the center of the ﬁgure. Source: von Mering et al. (2002). Drawn with permission of Springer Nature.

because these categories are enriched with membrane proteins. On the other hand, Y2H detects few proteins involved in translation because these proteins operate in the cytosol, not in the nucleus. 5.1.9

Criteria to Judge the Reliability of Interaction Data

Given this partly conﬂicting data, it would be very useful to have some conﬁdence measures at hand to separate putative interactions into likely and unlikely ones. The following principles may provide some guidance in judging interaction data. 1) mRNA abundance is a rough measure of protein abundance (see Section 8.8.1). By dividing the mRNA data for the yeast genome into mRNA abundance classes (bins) of equal size, from zero to maximum expression, it turned out that most protein–protein interaction data sets are heavily biased toward proteins of high abundance except for genetic techniques (Y2H and synthetic lethality). This means that, for example, a missing mass spectrometry (MS) detection of the interaction between a pair of low-abundance proteins should be treated with less conﬁdence than a missing interaction of two highly abundant proteins.

Protein–Protein Interaction Networks – Pairwise Connectivity

Coregulation

P

A

B

C

Figure 5.5 The gene cluster method. Genes A, B, and C are arranged linearly as one operon. When transcription is activated at promoter P, all the three genes are simultaneously transcribed. Genome 1

A

B

C

D

Genome 2

A

B

C

D

A

B

C

Genome 3

D

Figure 5.6 The gene neighborhood method analyzes the gene order in diﬀerent evolutionarily related organisms. Genes that always occur in the same order (A, B, C) are likely to form an operon, meaning that they would be jointly regulated and are quite likely to interact. Gene D may occur at diﬀerent locations, making it less likely to be part of the same operon.

(2) Analyzing the interaction coverage for protein–protein pairs expressed in diﬀerent cellular compartments would indicate whether a particular method shows a bias toward protein pairs from certain compartments. Indeed, comparing the protein localization from the MIPS (Munich Information Center for Protein Sequences) and TRIPLES (Transposon-Insertion Phenotypes, Localization and Expression in Saccharomyces cerevisiae) databases against the interaction coverage showed that in silico predictions such as conserved gene neighborhood, co-occurrence of genes, and gene fusion events (see Figures 5.5–5.7) overestimate mitochondrial interactions. Taken the other way around, an independent quality measure is to check whether interacting proteins belong to the same compartment (Figure 5.8). The Y2H method gave relatively poor results here. A possible complication of this approach is that proteins may actually be translocated between several compartments in reality, and not all such multiple occurrences in diﬀerent compartments may be annotated yet in databases or may even not have been discovered yet. (3) A certain degree of cofunctionality is to be expected for interacting protein–protein pairs. Although proteins from diﬀerent groups of biological functions may certainly interact with each other, the degree to which interacting proteins are annotated to the same functional category (see Section 8.6) is a measure of quality for predicted interactions. The predicted interactions should cluster fairly well along the diagonal, meaning that like pairs with like.

119

120

Principles of Computational Cell Biology

Sp1

Sp2

Sp3

Sp4

Sp5

Figure 5.7 If two protein-coding genes are separated in some species (Sp1, Sp4, and Sp5) and fused to form a single gene in other species (Sp2 and Sp3), a physical interaction is probable. Analyzing the pairwise connectivity of genes to detect putative interactions is termed the Rosetta Stone method. Almost 40 000 predicted pairwise functional associations were found in this way from a search in 23 complete genomes (Enright and Ouzounis 2001). B

A

D C

(a)

Figure 5.8 (a) Proteins C and D are localized in the same compartment and may interact. (b) On the other hand, proteins A and B belong to diﬀerent compartments, making it very unlikely that they physically interact in vivo unless one of them moves to the other compartment.

(b)

5.2 Bioinformatic Prediction of Protein–Protein Interactions Complementing the experimental methods introduced in Section 5.1, various purely computational methods have been developed for sequence-based predictions of protein–protein interactions that take into account the genomic content of gene pairs or the occurrence of genes in related organisms. Table 5.2 provides an overview over some of the methods that will be discussed in this section.

Protein–Protein Interaction Networks – Pairwise Connectivity

Table 5.2 Bioinformatics methods to predict protein–protein interactions.

Method name

Protein/domain interaction

Physical interaction/ functional relationship

Gene cluster and gene neighborhood

P

F

Phylogenetic proﬁle

P, D

F

Rosetta stone

P

PI, F

Sequence coevolution

P, D

PI, F

Classiﬁcation

P, D

PI

Integrative methods

P, D

PI, F

Here, P and D stand for proteins and domains. F and PI stand for functional relationship and direct physical interaction, respectively.

5.2.1

Analysis of Gene Order

In bacteria, an operon is formed by several genes having similar functions and encodes putatively interacting proteins. Activated from a single promoter, such genes are often transcribed as a single unit (Figure 5.6). In eukaryotes, neighboring genes also tend to be coregulated. Although gene order tends to be shuﬄed by neutral evolution between distantly related organisms, the organization into gene clusters or operons formed by coregulated genes is often conserved. Analysis of gene order conservation in various genomes found that about 70% of coregulated genes interact physically. 5.2.2

Phylogenetic Proﬁling/Coevolutionary Proﬁling

The method of phylogenetic proﬁling hypothesizes that the genes coding for two interacting proteins should either both be present in the genome of an organism or both be absent. Although this may not sound as a strong indicator at ﬁrst, with the availability of hundreds to thousands of sequenced genomes, this pattern may indeed become very powerful. In contrast to methods developed to predict direct physical interactions, phylogenetic proﬁling identiﬁes functionally connected proteins and that may jointly belong to a structural protein complex or a metabolic pathway. This is based on the hypothesis that functionally linked proteins evolve in a correlated manner, so that they will likely have homologs in the same group of organisms. Therefore, one can systematically identify edges between all the proteins coded by a genome. For instance, one expects to ﬁnd ﬂagellar proteins in bacteria that possess ﬂagella, but not in other organisms. To characterize the set of organisms having a homologous member of the protein family, one constructs a phylogenetic proﬁle for each protein (Figure 5.9). This proﬁle is a string with n entries, where n is the number of considered genomes. Whether or not a homologous protein of a given protein is present in the ith genome is marked as “1” at the ith position. “0” is used in cases where no homolog is detected. This presence is most conveniently tabulated in a Table 5.3.

121

122

Principles of Computational Cell Biology

S. cerevisiae (SC) P1 P2

P2

P5

P1 P4

P4

P7

P5

P6 P7

P1 P2

P3

P5

P6

P3

P7 P5

P3

E. coli (EC)

P6

B. Subtilis (BS)

H. influenza (HI)

Figure 5.9 The presence of protein families in various organisms detected, for example, by sequence homology searches in the genome sequences of these organisms. Table 5.3 Presence or absence of proteins P1–P7 in the four organisms in Figure 5.9. Proteins

EC

SC

BS

HI

P1

1

1

0

1

P2

1

1

1

0

P3

1

0

1

1

P4

1

1

0

0

P5

1

1

1

1

P6

1

0

1

1

P7

1

1

1

0

Each row is named a “proﬁle” that characterizes the presence or absence of protein Pi in diﬀerent organisms. From Table 5.3, we may compute the Hamming distances between all proﬁles as the number of columns with diﬀerent entries (Table 5.4). For example, P1 and P3 diﬀer in columns SC and BS. This yields a Hamming distance of 2. The entries of this table may again be presented in a graph as shown in Figure 5.10. If two proteins have similar proﬁles, this suggests that both proteins have evolved in a similar manner and may be functionally related to each other. The phylogenetic proﬁling technique also relates the biological function of so far uncharacterized proteins to other proteins in the same cluster that carry functional annotations. 5.2.2.1

Coevolution

Apart from simply detecting the presence and absence of pairs of proteins among diﬀerent genomes, one may also take into account the phylogenetic relationships among these organisms. Protein pairs that bind have a higher

Protein–Protein Interaction Networks – Pairwise Connectivity

Table 5.4 Hamming distances between proﬁles of proteins P1–P7.

P1 P2 P3 P4 P5

P1

P2

P3

P4

P5

P6

P7

0

2

2

1

1

2

2

0

2

1

1

2

0

0

3

1

0

2

2

3

1

0

1

1

0

2

0

P6 P7

0

Figure 5.10 Graphical representation of the Hamming distances between the phylogenetic proﬁles of proteins P1–P7. The thick black lines connect proteins with a Hamming distance of 0, and the thinner gray lines connect those with a distance of 1. The thinnest lines connect proteins with a distance of 2. The two pairs P2 and P7, as well as P3 and P6, have identical proﬁles (simultaneous presence or absence in the four diﬀerent organisms), compare Figure 5.9.

P1

P7

P2

P6

P3

P5

P4

correlation between their phylogenetic distance matrices than other homologs drawn from the ligand and receptor families that do not bind (Goh and Cohen 2002) (Figure 5.11). One may also use this concept to identify by phylogenetic analysis proteins that can functionally substitute for another in various organisms. Such proteins are expected to have an anticorrelated distribution pattern across organisms. This allows the discovery of nonobvious components of pathways, function prediction of uncharacterized proteins, and prediction of novel interactions. There now exists a small selection of promising experimental and theoretical methods to analyze the cellular interactome (by which we understand the set of all biomolecular interactions in a cell). We have encountered a ﬁrst problem in that each method detects too few interactions (as seen by the fact that the overlap between predictions of various methods is small), and a second problem in that each method has an intrinsic error rate producing “false positives” and “false negatives.” Although this may not sound too encouraging, there is hope that everything will converge to a big picture eventually. The ﬁrst problem can

123

124

Principles of Computational Cell Biology

?

Sp1

Sp2

Sp3

Sp4

Sp5

Sp6

Figure 5.11 Potentially interacting proteins or functionally related proteins should show a similar pattern of evolution across several species.

be partially solved by simply producing more experimental data. Also, solving the ﬁrst problem will help solve the second problem by combining predictions. We will now introduce such a statistical approach for combining results from various experimental and prediction methods that can be used to estimate the quality of the interactions by statistical methods.

5.3 Bayesian Networks for Judging the Accuracy of Interactions Given the considerable uncertainties of the available experimental data and of the in silico predictions on protein–protein interactions, it would be highly desirable to develop an indicator of conﬁdence for every suggested interaction or even for every possible interaction. Can one integrate evidence from many diﬀerent sources to increase the predictivity of true and false protein–protein predictions? One class of techniques to do this is composed up of Bayesian approaches that allow for the probabilistic combination of multiple data sets. For illustration, this strategy will be applied to real yeast interaction data. These approaches can be used for combining noisy genomic interaction data sets given as input. For normalization, each source of evidence for interactions is compared against samples of known positives and negatives (“gold standard”). The Bayesian network then outputs, for every possible protein pair, its likelihood of interaction.

Protein–Protein Interaction Networks – Pairwise Connectivity

5.3.1

Bayes’ Theorem

Bayes’ theorem belongs to the ﬁeld of probability theory and has its name from the English mathematician Thomas Bayes. Bayes’ theorem relates conditional and marginal probabilities of two variables. The unknown conditional probability distribution of a random variable A is expressed by available knowledge on another variable B, namely, the conditional probability distribution of B given A, and the marginal probability distribution of A alone. The theorem can be derived from the deﬁnition of conditional probability: P(A|B)P(B) = P(A, B) = P(B|A)P(A), where P(A, B) is the joint probability of A and B. In words, the probability of A given B times the probability of B equals the probability of both events A and B occurring together and also equals the probability of B given A times the probability of A. If one divides the left- and right-hand sides by P(B) given that P(B) is nonzero, we obtain P(B|A)P(A) . P(A|B) = P(B) This expression is known as Bayes’ theorem. It reads: “The probability of A given B equals the probability of B given A times the probability of A, divided by the probability of B.” P(A) is also called the prior probability of A where “prior” indicates that it precedes any information about B. P(A|B) is the posterior probability of A, given B. “Posterior” indicates that the probability is derived from or entailed by the speciﬁed value of B. For example, P(A) could be the likelihood of an arbitrary person you bump into to hold a PhD degree in bioinformatics. This likelihood is probably quite low these days. P(A|B) could then be the likelihood for some other person of holding a PhD in bioinformatics if you are given the evidence (B) that they regularly read the journal Nature, they work long but irregular hours, and show activities in running a start-up company. This likelihood P(A|B) is probably higher than P(A) alone. So posterior means determining the likelihood after you are provided with evidence B. P(B|A), for a particular value of B, is termed the likelihood function for A given B. It may also be written as L(A|B). P(B) is the prior probability of B. As it appears in the denominator, it serves as a normalizing constant. Using this terminology, one can formulate the theorem in the following way: likelihood × prior . posterior = normalizing constant If the ratio P(B|A)/P(B) is termed the standardized likelihood, the theorem takes on the following form: posterior = standardized likelihood × prior. 5.3.2

Bayesian Network

A Bayesian network is a directed acyclic graph of vertices and arcs that stand for variables and dependence relations among the variables (Figure 5.12). If vertex

125

126

Principles of Computational Cell Biology

Figure 5.12 An example of a Bayesian network. A directed arc, e.g. between variables Y and E 1 , denotes conditional dependency of E 1 on Y, as determined by the direction of the arc.

Y

E3

E1 E2

A is connected to another vertex B by an arc, then A is called a parent of B. If the value of a vertex is known, it is called an evidence vertex. A vertex can represent any sort of variable, either an observed measurement, a parameter, a latent variable, or a hypothesis. Vertices cannot only represent random variables. This is what is “Bayesian” about a Bayesian network. An important ﬁeld in machine learning deals with the task of deducing the structure (topology) of a Bayesian network from the given data. For each variable and its parents, Bayesian networks deﬁne their dependency by a conditional probability function or a table. A probabilistic Bayesian network model is completely speciﬁed by its graphical structure together with the conditional probability functions/tables. In Figure 5.12, one such measure is the probability Pr(E1 |Y ). The probability for the combined occurrence of E1 , E2 , and E3 is described by Pr(Y , E1 , E2 , E3 ) = Pr(E1 |Y ) Pr(E2 |Y ) Pr(E3 |Y ) Pr(Y ). In this case, the individual probabilities simply multiply as E1 , E2 , and E3 do not depend on each other. Obviously, for E1 , E2 , and E3 to occur, Y has to be “on” as well. 5.3.3 Application of Bayesian Networks to Protein–Protein Interaction Data Gerstein and coworkers (Jansen et al. 2003) implemented a Bayesian network to combine three sorts of data about protein–protein interactions from Saccharomyces cerevisiae: (i) data on protein–protein interactions from two-hybrid screens (Y2H) and in vivo pull-down (TAP) high-throughput experiments, (ii) genomic features consisting of expression data, protein function according to the biological process branch of the gene ontology, and the MIPS functional catalog, respectively, and data on the essentiality of proteins, (iii) “gold standards” of known interactions and of noninteracting protein pairs. As positives, they used the MIPS catalog of complexes, which is a manually compiled list of complexes (8250 protein pairs that each belong to a known complex) from the biomedical literature. The negatives are harder to deﬁne as it is hard to say for sure which protein–protein pairs deﬁnitely do not interact. However, a properly deﬁned negative set is as essential for the successful training of the Bayesian network as the set of positives. In lieu of an experimentally

Protein–Protein Interaction Networks – Pairwise Connectivity

veriﬁed set, Gerstein et al. assumed that proteins localized to diﬀerent compartments do not interact (Figure 5.8). As noted before, this choice may not be perfectly true because certain proteins may be exchanged between diﬀerent compartments. However, the power of this approach derives from its statistical ansatz, and it is probably true that, on an average, proteins localized to diﬀerent compartments are much less likely to interact than those that belong to the same compartment. 5.3.3.1

Measurement of Reliability “Likelihood Ratio”

Considering a genomic feature f expressed in binary terms (i.e. “absent” or “present”), the standardized likelihood ratio L(f ) introduced above may also be written as fraction of gold-standard positives having feature f . L(f ) = fraction of gold-standard negatives having feature f L(f ) = 1 means that the feature has no predictability: the same fractions of positives and negatives have feature f . The larger the L(f ), the better its predictability. For two features f 1 and f 2 with uncorrelated evidence, a “naïve” Bayesian network can be used that expresses the likelihood ratio of the combined evidence by their product: L(f1 , f2 ) = L(f1 ) × L(f2 ). For correlated evidence, L(f 1 , f 2 ) cannot be factorized in this way. Such a relationship between features is formally correctly treated by a non-naïve Bayesian network. The combined likelihood ratio is proportional to the estimated odds that two proteins are in the same complex, given multiple sources of information, or L(f ) = 5.3.3.2

fraction of gold-standard positives having features f1 and f2 fraction of gold-standard negatives having features f1 and f2

Prior and Posterior Odds

A positive result should refer to a pair of proteins that are in the same complex. Given the number of positives among the total number of protein pairs, the “prior” odds of ﬁnding a positive are Oprior =

P(pos) P(pos) = . P(neg) 1 − P(pos)

The “posterior” odds are the odds of identifying a positive complex after including the information f 1 …f N from N data sets: Opost =

P(pos ∣ f1 … fN ) . P(neg ∣ f1 … fN )

Again, the terms “prior” and “posterior” refer to the situation before and after knowing the information in the N data sets. In the case of protein–protein interaction data, the posterior odds describe the odds of having a protein–protein interaction given that we have the information from the N experiments, whereas the prior odds are related to the chance of randomly ﬁnding a protein–protein interaction when no experimental data are known.

127

128

Principles of Computational Cell Biology

If Opost > 1, the chances of having an interaction are higher than those of having no interactions. The likelihood ratio L deﬁned as P( f1 … fN |pos) , L( f1 … fN ) = P( f1 … fN |neg) relates prior and posterior odds according to Bayes’ rule: Opost = L( f1 … fN )Oprior . In the special case that the N features are conditionally independent (i.e. they provide uncorrelated evidence), L can be simpliﬁed as before to L( f1 … fN ) =

N ∏ i=1

L(fi ) =

N ∏ P(fi |pos) i=1

P(fi |neg)

.

L can be computed from contingency tables relating positive and negative examples with the N features (by binning the feature values f 1 …f N into discrete intervals), see below. Determining the prior odds Oprior is somewhat arbitrary in that it requires an assumption about the number of positives. Assuming that 30 000 is a conservative lower bound for the number of positives (i.e. pairs of proteins that are in the same complex) and 18 million possible protein pairs for yeast (Section 5.4.1), 3 × 104 P(pos) 1 18 × 106 = = . Oprior = P(neg) 600 17.97 × 106 18 × 106 This means that Opost > 1 can be achieved with L > 600. 5.3.3.3 A Worked Example: Parameters of the Naïve Bayesian Network for Essentiality

We will present only one data set from Jansen et al. (2003) that is simplest to explain and discuss. Proteins are termed essential or nonessential by considering whether a deletion mutant where this protein is knocked out from the genome is viable (nonessential) or not (essential). We expect that it should be more likely that both the proteins in a complex are essential or nonessential, but not a mixture of these two attributes. Deletion mutants of either one protein should impair the function of the same complex. Remarkably, already in 2003, data about essentiality were available for approximately 4000 out of the 6000 yeast proteins from individual gene knockout experiments. Four thousand proteins can form about 8 million interactions. In Table 5.5, the essentiality data are compiled for all possible protein–protein pairs and for those from the gold standard of interacting proteins. Column 1 describes the genomic feature. Protein pairs can take on three discrete values in the “essentiality data” (EE: both essential; NN: both nonessential; NE: one essential and one not). Column 2 gives the number of protein pairs with a particular feature (i.e. “EE”) drawn from the whole yeast interactome (around 8 million pairs). Columns “pos” and “neg” contain the overlap of these pairs with the 8250 gold standard positives and the 2 708 746 gold standard negatives.

Protein–Protein Interaction Networks – Pairwise Connectivity

Table 5.5 Essentiality of protein pairs is weakly associated with their tendency to interact.

Essentiality

Values

Number of protein pairs

Gold standard overlap pos

neg

P(E|pos)

P(E|neg)

L

EE

384 126

1 114

81 924

5.18E−01

1.43E−01

3.6

NE

2 767 812

624

285 487

2.90E−01

4.98E−01

0.6

NN

4 978 590

412

206 313

1.92E−01

3.60E−01

0.5

Sum

8 130 528

2 150

573 724

1.00E−00

1.00E+00

1.0

P(feature value|pos) and P(feature value|neg) give the conditional probabilities of the feature values. L is the ratio of these two conditional probabilities. E.g. P(E|pos) = 5.18 × 10−1 in the ﬁrst row, column 6, is obtained as the ratio of 1114:2150. P(E|neg) = 1.43 × 10−1 is obtained as the ratio of 81 924:573 724. The value of 3.6 is then obtained as the ratio of 0.518: 0.143. Although P(E|pos) seems only 50% predictive, this feature becomes more powerful by the small value of P(E|neg). The number of protein pairs for each category listed in the third column is not needed for the computation of likelihood factors. It is given here to emphasize the importance of suﬃcient coverage.

Jansen et al. performed a similar analysis for mRNA expression data by computing a correlation index for the expression of each protein pair and for the functional similarity. How can these diﬀerent sorts of information be combined? The way of linking the diﬀerent types of information depends on whether the information is independent or whether it is dependent. Here, a simple “naïve” Bayesian network may be used to connect such independent sorts of data because these information sets hardly overlap. 5.3.3.4

Fully Connected Experimental Network

The binary experimental interaction data sets from high-throughput experiments contain correlated evidence and should be combined via a fully connected Bayesian network (Table 5.6). Here, the four data sets can be combined in at most 24 = 16 diﬀerent ways (subsets). For each of these 16 subsets, a likelihood ratio is computed from the overlap with the gold standard positives (“pos”) and negatives (“neg”) in the same way as for the essentiality data. This representation generates a transformation of the individual binary-valued interaction sets into a data set where every protein pair is weighed according to the likelihood that it exists in a complex. First of all, Table 5.6 reveals some interesting eﬀects obtained when applying plain statistics to real data. Row 2 has a higher L score than row 3. This seems odd. Is it a stronger indication for true interaction if an interaction is only found by two experimental groups (row 2) than if it was found by all four experimental groups (row 3)? Our intuition would say that the opposite should be the case. However, this is what results from the performance of the gold standard in these experiments. In row 2, 26 members of the gold standard were positive and two were negative. In row 3, there were only nine members positive and one negative. Thus, the positive ones outnumber the negative ones by a smaller ratio. In this case, the diﬀerences arise from the small number of interactions. Having one or

129

130

Principles of Computational Cell Biology

Table 5.6 The ﬁrst four columns contain results from high-throughput experiments on protein–protein interactions.

Gavin (g)

Ho (h)

Uetz (u)

Ito (i)

Number of protein pairs

Gold standard overlap Pos

P(g, h, u, i|pos)

Neg

P(g, h, u, i|neg)

L

1

1

1

0

16

6

0

7.27E−04

0.00E+00

—

1

0

0

1

53

26

2

3.15E−03

7.38E−07

4268.3

1

1

1

1

11

9

1

1.09E−03

3.69E−07

2955.0

1

0

1

1

22

6

1

7.27E−04

3.69E−07

1970.0

1

1

0

1

27

16

3

1.94E−03

1.11E−06

1751.1

1

0

1

0

34

12

5

1.45E−03

1.85E−06

788.0

1

1

0

0

1 920

337

209

4.08E−02

7.72E−05

529.4

0

1

1

0

29

5

5

6.06E−04

1.85E−06

328.3

0

1

1

1

16

1

1

1.21E−04

3.69E−07

328.3

0

1

0

1

39

3

4

3.64E−04

1.48E−06

246.2

0

0

1

1

123

6

23

7.27E−04

8.49E−06

85.7

1

0

0

0

29 221

1 331

6 224

1.61E−01

2.30E−03

70.2

0

0

1

0

730

5

112

6.06E−04

4.13E−05

14.7

0

0

0

1

4 102

11

644

1.33E−03

2.38E−04

5.6

0

1

0

0

23 275

87

5 563

1.05E−02

2.05E−03

5.1

0

0

0

0

2 702 284

6 389

2 695 949

7.74E−01

9.95E−01

0.8

8 250

2 708 746

Sum

“1” means that an interaction was detected in this experiment, and “0” that is was not detected. The meaning of the other columns is equivalent to those in Table 5.5.

two negative ones makes quite a diﬀerence here. After all, this tells us that the diﬀerence between L = 4268.3 and L = 2955.0 is not meaningful. Another important fact about the values in Table 5.6 is that one needs to account for statistical ﬂuctuations of these values. The neg column contains a zero entry in the ﬁrst row reﬂecting that none of the gold standard negatives had positive g, h, and u experiments and a negative i experiment (which is comforting!). Consequently, the likelihood L is not deﬁned in this column as division by zero is not allowed. However, all experimental values are subject to statistical ﬂuctuations on the order of √ ± N + 1. L for this column can therefore adopt values of √ (6 ± 7) × 2 708 746 , √ (0 ± 1) × 8250 so that L can be between 1101 and inﬁnity. In this case, one would use the smallest possible value of 1101 as a most conservative estimate.

Protein–Protein Interaction Networks – Pairwise Connectivity

The combined experimental data in Table 5.6 yields a much higher likelihood factor for interactions that were conﬁrmed multiple times than the essentiality data in Table 5.5. However, this is not the end of the story. To achieve a stronger predictivity, the essentiality data may be combined with other factors such as mRNA coexpression. As argued before, the likelihood values of features with uncorrelated evidence are multiplicative. Consequently, we may also achieve predictions with Lcombined > 600 from such combined data sets. In summary, the Bayesian approach allows making relatively reliable predictions of protein–protein interactions by combining weakly predictive genomic features. In a similar manner, the approach could have been extended to include a number of other features related to interactions (e.g. phylogenetic co-occurrence, gene fusions, and gene neighborhood).

5.4 Protein Interaction Networks 5.4.1

Protein Interaction Network of Saccharomyces cerevisiae

As of August 2017, the data repository BioGrid (www.thebiogrid.org) contains 91 651 nonredundant physical interactions for S. cerevisiae involving 6367 genes. Annotated physical interactions in BioGrid need to be based on one of several types of experimental assays. The database Mentha (http://mentha.uniroma2.it) integrates experimentally validated physical interactions from several primary databases (including BioGrid). It currently holds 106 683 interactions for S. cerevisiae. An integrative bioinformatics study (Zhang et al. 2012, 2013) used a similar Bayesian strategy as that presented in Section 5.3 to compile the so-called PrePPI network for S. cerevisiae that involves about 60 000 high-conﬁdence pairwise interactions among yeast proteins with probability >0.5 (https://bhapp.c2b2 .columbia.edu/PrePPI). The major diﬀerences between PrePPI and the method presented in Section 5.3 is the inclusion of a structure-based scoring scheme (similar to that presented in Section 3.1) that checks whether an X-ray structure of a protein complex exists in the PDB database that is homologous to the two protein chains under investigation. 5.4.2

Protein Interaction Network of Escherichia coli

The protein interaction network of E. coli has only recently been addressed by large-scale experiments. For example, a protein interaction network for E. coli was compiled consisting of 5993 high-conﬁdence, nonredundant pairwise interactions among 1757 distinct E. coli proteins (Hu et al. 2009). Y2H reported 2234 PPIs in E. coli (Rajagopala et al. 2014). The integrated Mentha database provides 26 043 interactions involving 4184 genes. Wuchty and Uetz compared the protein interaction networks of S. cerevisiae and of E. coli and found quite similar behavior such as the more central roles of essential proteins, enrichment of interactions within protein complexes, depletion of interactions between protein complexes, and a signiﬁcant enrichment of interactions between targets of the same transcription factor (Wuchty and Uetz 2014). The protein interaction networks of prokaryotes and eukaryotes appear to have quite similar network characteristics.

131

132

Principles of Computational Cell Biology

5.4.3

Protein Interaction Network of Human

The Mentha database provides information on 277 371 physical PP interactions involving 18 506 human proteins. The current release of PrePPI contains 1.35 million predicted interactions of human proteins. For 127 000 of them, there exists evidence indicating a direct interaction. The authors determined the degree to which they could recover the proteins belonging to the complexes compiled in the CORUM database that consists of around 1800 multi-subunit complexes (http://mips.helmholtz-muenchen.de/corum). Approximately two-third of CORUM complexes were fully recovered at a likelihood ratio cutoﬀ of 600.

5.5 Protein Domain Networks So far in this chapter, we have always considered entire proteins. Yet, threedimensional protein structures suggest that the fundamental structural unit of proteins is a “domain.” In the following, we will focus on simple domains formed by a continuous stretch of a peptide chain, although we note that there exist more complicated domains, e.g. in the catalytic subunits of protein kinases, where the large helical domain is not formed from one but from several stretches of the protein sequence. In the case of simple domains, the respective portion of the polypeptide chain folds into a distinct structure that often carries out the biological functionality of this protein domain independent of neighboring sequences. Eukaryotes increasingly tend to have multidomain proteins, whereas the proteomes of bacteria or archaea mostly contain single-domain proteins. In any case, structural domain architectures govern interactions among proteins (Figure 5.13), oﬀering a framework for prediction models. One method to predict domain–domain interactions is the association method. To each domain pair (dm , dn ), this method assigns an interaction probability: Pr(dm , dn ) =

Imn , Nmn

where I mn stands for the number of interacting protein pairs that contain this pair of domains, and N mn is the total number of protein pairs having (dm , dn ). As the experimental coverage of protein interactomes is still incomplete, one often resorts to domain–domain interactions for predicting protein interaction networks. The databases iPfam and 3did (the “database of 3D interacting domains”) provide experimental sets of domain–domain interactions. Alternatively, the database InterDom deduces potential domain interactions by integrating data on domain fusions, interactions of proteins, and complexes and from the scientiﬁc literature. One may assume that interacting domains share a common biological process (BP) (annotation according to the GO terminology) and a common molecular function (MF) annotation (see Section 8.6). Statistical analysis (Schlicker et al. 2007) showed that the BP similarity of interacting domains is generally higher than the corresponding MF similarity. This suggests that interacting domains or proteins may perform diﬀerent functions, although they act in similar processes.

Protein–Protein Interaction Networks – Pairwise Connectivity

Figure 5.13 Nine domains (geometrical objects) are the fundamental units of these four proteins (shaded areas), mediating a distinct structure and biological functionality. Figure 5.14 Connectivity of domains to other domains. Source: Wuchty (2001). Drawn with permission of Oxford University Press. p(nlinks to other domains)

1

0.1

0.01

0.001

1 10 100 nlinks to other domains

Incorporating statistical checks for functional similarity may thus be useful for improving the prediction of protein–protein interactions. Sequences of large proteins often seem to have evolved by joining preexisting domains in new combinations, “domain shuﬄing,” that involves domain duplication or domain insertion. Remarkably, it was found that the number of domain–domain connections follows a power law (Figure 5.14). The majority of highly connected InterPro domains, the highly connected hub domains,

133

Table 5.7 The 10 most highly connected InterPro domains of Methanococcus, E. coli, yeast, Caenorhabditis elegans, Drosophila, and human. Methanococcus

Yeast

E. coli

Domain

kv

SAM Fer4 FMN-enzymes

10

SAM

NAD-binding

9

Fer4

AA-TRNA-ligase-1

8

AA-TRNA-ligase-II

Intein

7

Pyr-redox

7

C. elegans

Humans

Drosophila

Domain

kv

Domain

kv

Domain

Domain

kv

13

NAD-binding

20

P-kinase

18

P-kinase

11

Esterase

16

P-kinase-ST

18

EGF

57

P-rich extension

101

57

P-kinase

15

PH

16

PH

46

Zf-C2H2

13

Zf-C3HC4

14

EF-hand

45

Ank

12

AA-TRNA-ligase-II

14

Ank

37

EGF

FMN

12

EF-hand

14

P-kinase-st

35

HIS-KIN

11

C2

13

EGF-CA

kv

Domain

kv

ATP-GTP-A

169

GPCR-rhodopsin

162

53

P-rich extension

110

52

EGF

98

50

P-kinase

89

SH3

48

Ig

79

34

Antifreeze 1

46

PH

72 64

70

ATP-GTP-A

6

AA-TRNA-ligase-1

11

CPSase-L-chain

13

Zf-C3HC4

33

EF-hand

45

EF-hand

CBS

6

HIS-REC

10

GATase

13

Ig

30

PH

45

SH3

61

N6-MTASE

6

PAS

9

WD40

13

SH3

30

P-kinase-st

44

Zf-C2H2

58

Source: After Wuchty 2001. Drawn with permission of Oxford University Press.

Protein–Protein Interaction Networks – Pairwise Connectivity

appear in signaling pathways. Table 5.7 shows a list of the 10 best linked domains in various species. From left to right, the number of edges increases with increasing complexity of the organisms. At the same time, the number of signaling domains (PH, SH3), their ligands (proline-rich extensions), and receptors (GPCR/RHODOPSIN) increases. This shows that the evolutionary trends toward compartmentalization of the cell and multicellularity lead to a higher degree of organization. According to the Barabási–Albert model (Section 6.4), a so-called scale-free network is constructed by preferential attachment of newly added vertices to already well-connected ones. Along these lines, it was argued that vertices with many connections in metabolic network are metabolites originating very early in the course of evolution where they shaped a core metabolism. If we transfer this concept to the connectivity of domains, highly connected domains should also have originated very early. Is this true? In the simple organisms, Methanococcus and E. coli, the majority of highly connected domains are concerned with the maintenance of metabolism. None of the highly connected domains of higher organisms are found here. On the other hand, the enzyme helicase C has roughly similar degrees of connection in all organisms. Therefore, the BA model for network growth appears not applicable to this evolutionary scenario, although the domain–domain connectivity clearly follows a power law dependence. The expansion of protein families in multicellular vertebrates coincides with a higher connectivity of the respective domains. Extensive shuﬄing of domains to increase combinatorial diversity might give rise to a large repertoire of multidomain proteins. These could then carry out the complex variety of cellular functions without dramatically expanding the absolute size of the protein complement. Therefore, the greater proteome complexity of higher eukaryotes is not simply a consequence of the genome size but must also be a consequence of innovations in domain arrangements. Highly linked domains represent functional centers in various diﬀerent cellular aspects. They could be treated as “evolutionary hubs” that help to organize the domain space.

5.6 Summary A number of diﬀerent experimental high-throughput methods have been developed over the past 20 years to probe the “interactome” of a cell, i.e. all biomolecular interactions occurring under certain conditions. Unfortunately, the results of some methods have an error on the order of about 50%. Complementing these experimental techniques are computational methods that can be employed genome wide. Also, these methods are not very reliable per se. However, a statistical approach, Bayesian networks, enables to calibrate the accuracy of the individual approaches and obtain much more reliable results from combinations of various features. The protein–protein interaction network is nowadays considered as a highly connected assembly of cellular protein types. Biological functions emerge from the activity of individual proteins as well as from their coordinated interactions.

135

136

Principles of Computational Cell Biology

5.7 Problems 5.7.1

Bayesian Analysis of (Fake) Protein Complexes

One way to estimate whether a given combination of proteins is a potential complex or not is to use a Bayesian analysis. It allows determining probabilities (likelihood ratios) from the properties of known protein complexes. These likelihood ratios can then be used to estimate the probability whether a candidate is a potential complex. For Problems 1–3, fake binary complexes will be used where each of the two proteins has two properties: it belongs to one of the three compartments “A,” “B,” or “C,” and it has a mass. These two properties are encoded in the protein names as “Compartment” + “_” + “Mass.” For example, a protein-labeled “A_86” is localized to compartment “A” and has a mass of 86 units (you may think of kilodalton). To determine the likelihood ratios, you should download two “gold standard” data sets, GoldPos.dat and GoldNeg.dat from the book website. The ﬁles contain complexes that either deﬁnitely occur or do not occur, respectively, plus two “experimental” sets Exp1.dat and Exp2.dat. These sets, which have a certain overlap with the gold standard data sets, contain both true and false complexes at a variable ratio, i.e. the experiments were performed at diﬀerent levels of accuracy. Consequently, these sets are of diﬀerent sizes, too. As an estimate of the initial probability Oprior , you may assume that the relative sizes of the gold standard sets resemble the natural distribution of complexes and noncomplexes. 1. Likelihood ratios from the theoretical properties Use the gold standard data sets to determine likelihood ratios for the following properties: (a) Compartment For each complex in the gold standard data sets, determine the respective compartments of the two partners. Sort them into six categories: AA, BB, CC, AB, AC, and BC, where both proteins belong to compartment A, B, or C, or one is in A and the other in B, etc. Note that it does not matter which of the two partners is found in A and which in B. From the relative occurrences of the gold standard data sets in these six classes, determine the likelihood ratio Lcomp . (b) Mass diﬀerence For the gold standard sets, determine the absolute mass diﬀerence Δm = |m1 − m2 | between the two proteins and sort Δm into four categories with 0 ≤ Δm < 11, 11 ≤ Δm < 23, 23 ≤ Δm < 35, and 35 ≤ Δm. Use these numbers to determine the respective values for Lmass . 2. Likelihood ratios from the “experiments” Use a fully connected Bayesian scheme to determine the likelihood ratios Lexp from the “experimental” data sets Exp1.dat and Exp2.dat. This gives

Protein–Protein Interaction Networks – Pairwise Connectivity

C1

E1

C2

E2

Figure 5.15 Resulting gel from a ﬁctitious TAP experiment (see Problem 4).

you four categories for various combinations of positive and negative outcome for the complexes from the gold standard. Try to judge the quality of the experiments by determining the likelihood ratios for both experiments independently, i.e. from the overlap between only one experiment and the gold standard sets. 3. Identifying complexes (a) Test set For all potential complexes in the test set of test_small.dat, give the likelihood ratios for the properties Lcomp , Lmass , and Lexp and the ﬁnal probability Opost that it is a true complex. Start from a reasonable Oprior . Indicate for each potential complex whether you classify it as a true or as a false complex. (b) Re-evaluation with the gold standard sets. Now run the Bayesian analysis with the theoretical ratios Lcomp and Lmass alone on the gold standard positive and negative sets and count

137

138

Principles of Computational Cell Biology

how many false negatives and false positives you ﬁnd in the two sets, respectively. Give the percentage how many of the complexes in these two sets are classiﬁed correctly. Is the classiﬁcation biased; if yes, in which direction? 4. Analyze a TAP experiment Here, you will analyze the results of an electrophoresis gel and determine the masses of the fragments from a protein with a total weight of 100 kDa. For better resolution, the gels C1 and E1 were run a second time, three times as long (Figure 5.15). C1 and C2 are two controls with a calibration set that contained the masses 5, 12, 20, 42, 90, and 120 kDa, whereas for E1 and E2, the fragmented complex was used. For E1 and E2, two diﬀerent preparation protocols were used, which lead to diﬀerent sets of fragments. First, assign the masses in the control gels C1 and C2 and quantify the relation between distance traveled and mass. With this relation, determine, assign, and tabulate the masses of the fragments in E1 and E2. Name the fragments with capital letters starting from A for the smallest fragment. Note that some of the bands for larger masses may be due to parts of the protein that still consist of multiple subunits. Describe your observations and try to ﬁgure out which subunits of the complex stick together better.

Bibliography Experimental Methods Shoemaker, B.A. and Panchenko, A.R. (2007a). Deciphering protein–protein interactions. Part I. Experimental techniques and databases. PLOS Computational Biology 3: e42.

Protein–Protein Interactions Goh, C.S. and Cohen, F.E. (2002). Co-evolutionary analysis reveals insights into protein–protein interactions. Journal of Molecular Biology 324: 177–192. Hu, P., Janga, S.C., Babu, M. et al. (2009). Global functional atlas of Escherichia coli encompassing previously uncharacterized proteins. PLoS Biology 7: e1000096. von Mering, C., Krause, R., Snel, B. et al. (2002). Comparative assessment of large-scale data sets of protein–protein interactions. Nature 417: 399–403. Rajagopala, S.V., Sikorski, P., Kumar, A. et al. (2014). The binary protein–protein interaction landscape of Escherichia coli. Nature Biotechnology 32: 285–290. Wuchty, S. and Uetz, P. (2014). Protein–protein interaction networks of E. coli and S. cerevisiae are similar. Scientiﬁc Reports 4: 7187.

Protein–Protein Interaction Networks – Pairwise Connectivity

Bioinformatic Prediction Methods Aloy, P. and Russell, R.B. (2006). Structural systems biology: modelling protein interactions. Nature Reviews Molecular Cell Biology 7: 188–197. Enright, A.J. and Ouzounis, C.A. (2001). Functional associations of proteins in entire genomes by means of exhaustive detection of gene fusions. Genome Biology 2: research0034.1. Garzón, J.I., Deng, L., Murray, D. et al. (2016). A computational interactome and functional annotation for the human proteome. Elife 22: 5. Shoemaker, B.A. and Panchenko, A.R. (2007b). Deciphering protein–protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PLOS Computational Biology 3: e43. Zhang, Q.C., Petrey, D., Deng, L. et al. (2012). Structure-based prediction of protein–protein interactions on a genome-wide scale. Nature 490: 556–560. Zhang, Q.C., Petrey, D.I., Garzón, J.I. et al. (2013). PrePPI: a structure-informed database of protein–protein interactions. Nucleic Acids Research 41: D828–D833.

Bayesian Network Jansen, J., Yu, H., Greenbaum, D. et al. (2003). A Bayesian networks approach for predicting protein–protein interactions from genomic data. Science 302: 449–453.

Protein Domain Networks Schlicker, A., Huthmacher, C., Ramirez, F. et al. (2007). Functional evaluation of domain–domain interactions and human protein interaction networks. Bioinformatics 23: 859–865. Wuchty, S. (2001). Scale-free behavior in protein domain networks. Molecular Biological Evolution 18: 1694–1702.

139

141

6 Protein–Protein Interaction Networks – Structural Hierarchies This chapter follows up on the previous chapter that provided an introduction to protein interaction networks. Now is the time to organize all the various data, detect and categorize topological properties of such networks, and possibly relate them to biological function.

6.1 Protein Interaction Graph Networks As discussed in Chapter 3, a large amount of information on protein–protein interactions in biological cells is available over the past few years. As organisms contain between 400 and 30 000 protein-coding genes, a large number of pairwise protein interactions are possible in principle. How can all this information be conveniently stored for subsequent analysis? As already introduced in Section 4.3, one way of doing this is by representing the known interactions as a two-dimensional n × n matrix with entries “1” for the known interactions and entries “0” everywhere else (Figure 6.1b). Considering that the 6000 yeast proteins form about 60 000–100 000 interactions (see Section 5.4.1), less than 0.6% of all ﬁelds would be ﬁlled. Representing the data as a matrix would be quite wasteful in terms of computer memory and would make ﬁnding interaction partners quite ineﬃcient. Instead, it is preferable to use a mathematical data structure that also allows for the visualization of data connectivity. For example, a biologist may suspect that a certain protein is related to cancer and wonders which the interaction partners of this protein are. Mathematical graphs introduced in Chapter 4 are an ideal data structure to address such questions. Biomolecular interactions are typically bidirectional. When molecule A binds molecule B, B also binds A. Therefore, protein interaction networks should be represented by undirected graphs where vertices (nodes) stand for proteins and edges for known interactions (Figure 6.1a). The graph representation also aids in detecting signatures of putative permanent protein complexes (Figure 6.1c). 6.1.1

Degree Distribution

The degree of a vertex speciﬁes the number of edges by which it is linked to other vertices. The degree distribution of a graph is a function measuring the Principles of Computational Cell Biology: From Protein Complexes to Cellular Networks, Second Edition. Volkhard Helms. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

142

Principles of Computational Cell Biology

B

E

A A

A

C

B

1

C

1

D

1

E

B

C

D

1

1

1

B

E

1 A

1

D

D

(a)

(b)

(c)

Figure 6.1 (a) Protein A interacts with proteins B, C and D. B also interacts with protein E. In this example, it is not clear whether any of these interactions occur simultaneously or independently. (b) Interaction matrix representing the connectivity of the network on the left. (c) Proteins A, B and C all interact with each other, suggesting that they may form a permanent complex ABC.

p(k)

(a)

(b)

1

2

3

4

k

Figure 6.2 (a) Cubic lattice and (b) the corresponding distribution p(k).

total number of vertices of a certain degree in this graph. Formally, the degree distribution is ∑ 1 1, p(k) = n v ∈V ∣deg(v )=k i

i

where vi is a vertex in the set V of the n vertices of the graph and deg(vi ) is the degree of vertex vi . This expression simply counts how many vertices have degree k. Figure 6.2 shows a simple example. In the two-dimensional lattice in Figure 6.2, each vertex has exactly four neighbors. p(k) is zero except for k = 4. What is the virtue of this representation? It is an abstraction of the network topology. If you were given p(k) of a square lattice – with its single peak at 4, respectively, 6 in three dimensions – you would quickly be able to draw the corresponding network. Although this is not very

0.15

1

0.12

0.1 p(k)

p(k)

Protein–Protein Interaction Networks – Structural Hierarchies

0.9

0.01

0.6

0.001

0.3

0.0001

3

6

9

12

k

(a)

143

1

15 (b)

10 100 1000 k

Figure 6.3 Degree distribution in random network (a) showing a Poisson distribution and for a scale-free network (b).

exciting for simple examples like this one, it becomes extremely valuable for complicated networks involving thousands of vertices and edges. The degree distribution p(k) is a common way of classifying graphs into categories. The p(k) of a random graph (see Section 6.3) has the shape of a Poisson distribution (see Section 14.1.2) where most vertices have an average number of connections; sparsely connected and densely connected vertices are equally unlikely (Figure 6.3). In scale-free networks (see Section 6.4), p(k) follows a power law: p(k) = k −𝛾 ,

𝛾 ∈ ℜ+ ,

so that p(k) decays much slower for large k than the exponential decay of a Poisson distribution. The implication of this is that highly connected “hubs” occur at a much larger frequency than expected in random graphs. Box 6.1 Technical Comment Why does a scale-free network appear as a straight line with a negative slope on a log–log plot? Let us start from the given power law dependence: p(k) = k −𝛾 . Taking the logarithm on both sides gives log p(k) = log(k −𝛾 ) = − 𝛾 • log k. Therefore, plotting log p(k) on the y-axis against log(k) on the x-axis will give a straight line with the slope −𝛾 as seen in Figure 6.3b.

6.1.2

Clustering Coeﬃcient

In 1998, Duncan J. Watts and Steven Strogatz introduced the clustering coeﬃcient of mathematical graphs as a measure of whether or not the graph has the

144

Principles of Computational Cell Biology

2

2

3

1

(a)

3

1

4

2

4

c=1

(b)

c = 1/3

3

1

4 (c)

c=0

Figure 6.4 Example illustrating the clustering coeﬃcient on an undirected graph for the shaded vertex i. Dotted lines connect the shaded vertex with three white vertices, its neighbors. The black edges connect neighbors of i among each other. In the panel (a) example of a fully connected subgraph, three out of three possible connections are formed between the white neighbor vertices. Thus, the clustering coeﬃcient of the black vertex equals 1. Vertices 1–2–3, 1–2–4, and 1–3–4 form triangles in the panel (a). The panel (b) contains one triangle (1–3–4) and the panel (c) same three triples as the panel (a).

topology of a small-world network. As an example, consider the group formed by all your N friends. The clustering coeﬃcient describes whether your friends are also friends among themselves. When your group of friends is a “clique” (see below), all of its members will also be friends of each other. The clustering coefﬁcient is deﬁned for vertex i (which is you in the example) and considers the connectivity of its neighbors (here, your friends) among each other (Figure 6.4). Let us consider a graph with n vertices V = v1 , v2 , …, vn and a set of edges E, where eij denotes an edge between vertices vi and vj . The neighborhood N i of a vertex vi is deﬁned as its immediately connected neighbors: Ni = {vj }∶eij ∈ E. As introduced before, the degree k i of vertex vi is the number |N i | of vertices in its neighborhood N i . By convention, |⋅| counts the numbers of vertices in the vertex set {vj }. The clustering coeﬃcient C i of vi is deﬁned as the ratio of the observed number of edges between its neighbor vertices over the maximum possible number of edges k i (k i − 1)/2 that could be formed between them (Figure 6.4): Ci =

2|{ejk }| ki (ki − 1)

∶vj , vk ∈ Ni , ejk ∈ Eundirected .

In this case, |⋅| counts the numbers of edges in the edge set {ejk }. By way of construction, where the actual number of edges is normalized by the maximum number of edges, 0 ≤ C i ≤ 1. For a directed graph, ejk is distinct from ekj so that k i (k i − 1) arcs could exist among the vertices within the neighborhood N i . In that case, the clustering coeﬃcient is equal to Ci =

|{ejk }| ki (ki − 1)

∶vj , vk ∈ Ni , ejk ∈ Edirected .

We can also describe the clustering coeﬃcient in an alternative way. If vertices vj and vk are neighbors of vertex vi , and if they are connected by an edge (arc) (j, k), then the three edges (i, j), (j, k), and (i, k) form a triangle, a cyclic structure with

Protein–Protein Interaction Networks – Structural Hierarchies

three vertices and three edges (Figure 6.4a). Let 𝜆G (vi ) be the number of triangles involving vi ∈ V (G) for an undirected graph G. Let 𝜏 G (vi ) be the number of triples on vi ∈ V with two edges and three vertices, one of which is vi and such that vi is the end point of both the edges (Figure 6.4b). Obviously, each triangle is also a triple. Therefore, the clustering coeﬃcient can also be written as Ci =

𝜆G (vi ) . 𝜏G (vi )

Since 𝜏G (vi ) =

1 k (k − 1), 2 i i

and 𝜆G (vi ) = |{ejk }|, as deﬁned above, the two preceding deﬁnitions are equivalent. The clustering coeﬃcient is equal to 1 if and only if every direct neighbor vertex of vi is also connected to every other direct neighbor vertex of vi . Turned around, the clustering coeﬃcient of vi is equal to 0 if none of its neighbor vertices is linked to any other of its neighbor vertices. According to Watts and Strogatz, the clustering coeﬃcient for the full network is obtained as the average clustering coeﬃcient of its n vertices: 1∑ C= C. n i=1 i n

This average clustering coeﬃcient can be used as another measure of network topology.

6.2 Finding Cliques In graph theory, a clique in an undirected graph G consists of a set of vertices V whereby each pair of vertices in V is connected by an edge (Figure 6.5). Equivalently, the subgraph induced by V can be termed a complete graph. The size of a clique is equal to the number of vertices it contains. The task of ﬁnding the largest clique of a graph G is termed the “clique problem.” This problem is NP-complete (Section 4.2.1). It is unlikely that there exists an eﬃcient algorithm for ﬁnding the largest clique of a graph. In a protein interaction network, cliques often correspond to permanent multiprotein complexes as will become clearer in later chapters. A k-clique is a clique of size k. The k-clique problem, therefore, is the task of ﬁnding a clique of size k, i.e. a complete subgraph G′ (V′ , E′ ) of G with |V′ | = k. Obviously, a k-clique can be found using a brute-force algorithm in O(nk ) time. One simply starts with all vertices [O(n)] and tests whether they have one connection [O(n2 )]… up to k − 1 connections [O(nk )] that yields a clique of size k when including the vertex itself. Another brute force algorithm inspects each subgraph with at least k vertices and checks whether it is a clique. An alternative

145

146

Principles of Computational Cell Biology

6 5 1

4

2 3 (a)

(b)

Figure 6.5 A clique in a graph is a set of pairwise adjacent vertices or a fully connected subgraph. Panel (a) shows a clique of size 5. In the graph at (b), vertices 1, 2, and 5 form a clique because each vertex is connected to the two other vertices.

algorithm starts from cliques of size 1 corresponding to the individual vertices. These cliques are then merged into larger and larger cliques until no further mergers are possible (Spirin and Mirny 2003). In the case of protein interactions, it is convenient to start from known binding pairs of proteins, e.g. A–B and C–D. If the interactions A–C, B–C, A–D, and B–D exist as well, this gives already a clique of size 4. Generally speaking, one can combine two cliques A and B when each vertex in clique A is connected to each vertex in clique B. Checking whether these connections exist can be done in linear time. However, some large clique may be missed in case 2 or more of its parts were already merged with vertices that do not belong to the clique. Yet, the strategy manages to identify at least one maximal clique, which is a clique that is not contained in any larger clique.

6.3 Random Graphs As mentioned in Section 1.1, a random graph describes a graph Gn,p of n vertices joined by edges that all have been chosen and placed between pairs of vertices at random. In Gn,p , each possible edge is present with probability p and absent with probability 1 − p. Then, the average number of edges in Gn,p is n(n − 1) p. 2 As each edge connects two vertices, the average degree of a vertex is n(n − 1)p = (n − 1)p −−−−→ np. n→∞ n In the 1950s, Erdös and Renyi studied how the expected topology of a random graph with n vertices changes as a function of the number of edges m. When m is small, the graph is likely fragmented into many small connected components (Section 4.1) having vertex sets of size at most O(log n). As m increases, the components grow at ﬁrst by linking to isolated vertices and later by fusing with other components. A transition happens at m = n/2, when many clusters cross-link

Protein–Protein Interaction Networks – Structural Hierarchies

spontaneously to form a unique largest component called the giant component. Its vertex set size is much larger than the vertex set sizes of any other components. It contains O(n) vertices, whereas the second largest component contains O(log n) vertices. In statistical physics, this phenomenon is called percolation and describes, for example, at what density of edges between lattice points in a planar lattice, a continuous path exists between two opposite walls. The shortest path length between any pairs of vertices in the giant component grows like log n. In a network with n = 1 000 000 vertices, log 106 = 6. Therefore, any two vertices have only a few edges in between them, and this is why these graphs are called small worlds. The properties of random graphs have been studied very extensively (Bollobas 2004). However, it turned out that random graphs are no adequate models for real-world networks because real networks appear to have a power law degree distribution (while random graphs have Poisson distribution), and real networks show strong clustering while the clustering coeﬃcient of a random graph is C = p. Can one cure this deﬁciency by manipulating the connectivities in a random graph to generate a power law degree distribution while leaving all other aspects as in the random graph model? Yes, given a degree sequence (e.g. a power law distribution), one can generate such a tweaked random graph by ﬁrst assigning to a vertex i a degree k i from the given degree sequence. Then, pairs of vertices are chosen uniformly at random to make edges so that the assigned degrees remain preserved. When all the degrees have been used to make edges, the resulting graph is a random member of the set of graphs with the desired degree distribution. However, this method does not allow specifying the clustering coeﬃcient at the same time. On the other hand, this property makes it possible to exactly determine many properties of these graphs in the limit of large n. For example, it can be shown that almost all random graphs with a ﬁxed degree distribution and having no vertices of degree smaller than two possess a unique giant component.

6.4 Scale-Free Graphs In 1999, Albert and Barabási published a paper in the journal Science that completely changed our understanding of complex networks (see Section 1.1.3). In this paper, a new network growth algorithm was introduced to generate networks with a scale-free topology. The Barabási–Albert ( BA) algorithm starts an initial connected network with n0 vertices and m0 edges. At every step t, a new vertex v is added, and m edges (m < n0 ) will be linked from v to the existing vertices with a preferential attachment probability ki Πi (ki ) = ∑N−1 j=1

kj

,

where k i is the degree of the ith vertex and the summation runs over all vertices present so far. Eventually, after t iterations, the graph will have (n0 + t) vertices and (n0 + mt) edges. It was found that if either growth or preferential attachment is eliminated from the BA algorithm, the resulting network does not exhibit scale-free properties anymore.

147

148

Principles of Computational Cell Biology

Considering the properties of networks generated by the BA model, the average path length in the BA model is proportional to ln n/ln(ln n), which is shorter than that in random graphs. Therefore, scale-free networks are termed ultrasmall worlds. As the degree distribution decays polynomial, many more hubs exist having a very high degree than is expected for random graphs, see Section 6.1.1. In the BA model, p(k) ∝ k −𝛾 with 𝛾 = 3, whereas real networks often show a softer decay with 𝛾 ≈ 2.1–2.4. The numerical result for the clustering coeﬃcient of the BA model is C ≈ n−0.75 . The original deﬁnition of the BA model left a number of mathematical details open. Bollobás and colleagues proposed the so-called linearized chord diagram to resolve these problems, making the model more amenable to mathematical approaches (Bollobás et al. 2001). Scale-free networks are resistant to random failures of arbitrary hubs (robustness) because a few high-degree hubs dominate their topology. Intuitively, we understand that an arbitrary failing vertex likely has a small degree so that the remaining network will be little aﬀected. On the other hand, scale-free networks are quite vulnerable when their hubs are attacked in a dedicated way. Validating ﬁndings on the average path length and the size of the giant component were made in numerical simulations and by analytical considerations. A classical result from the work of Barabási is shown in Figure 6.6. The authors investigated the protein interaction network of the yeast Saccharomyces cerevisiae based on the data from yeast two-hybrid experiments. Figure 6.6 shows the giant component of this interaction network. The authors color coded all proteins by information about their essentiality – green for nonessential, yellow for unknown, and red for essential proteins. In agreement with the vulnerability hypothesis, hub proteins were much more likely to be essential than nonhub proteins. If we consider the connectivity of protein–protein interaction networks (PPINs) over time and with spatial resolution, one detects both party hubs, which are proteins that make simultaneous contacts with most of their binding partner proteins, and date hubs, which are proteins that bind other proteins at diﬀerent times or locations (Figure 6.7). These observations suggest a concept of organized modularity where date hubs organize the proteome by connecting various biological processes (or modules) to each other, whereas party hubs belong to individual modules. We will end this section by noting that the BA model is a minimal model that captures the mechanisms responsible for the power law degree distribution observed in real networks. However, there may exist other models that explain the same phenomenon equally well. One discrepancy is the ﬁxed exponent of the predicted power law distribution (𝛾 = 3). There exist variants of the BA construction algorithm with cleaner mathematical properties. These may include eﬀects of adding or rewiring edges as this allows vertices to age so that they can no longer accept new edges or vary the forms of preferential attachment. These models also predict exponential and truncated power law degree distributions in some parameter regimes.

Protein–Protein Interaction Networks – Structural Hierarchies

Figure 6.6 Result of one of the ﬁrst large-scale analyses of the protein interactions in a biological network. Each vertex corresponds to a protein from S. cerevisiae. Edges reﬂect interactions detected in Y2H experiments. The plot shows the largest connected cluster that contains about 78% of all proteins. The vertex color reﬂects the phenotypic eﬀect of a gene deletion mutant where the gene coding for the respective protein was knocked out. Green circles reﬂect nonlethal gene deletions, red ones lethal mutations, orange ones lead to slow growth, and the yellow vertices reﬂect gene deletions where the eﬀect is unknown. Source: Jeong et al. (2001). Reprinted with permission of Springer Nature.

6.5 Detecting Communities in Networks Several distinct statistical properties are shared by many seemingly unrelated types of networks, including social networks (such as acquaintance networks and collaboration networks), technological networks (such as the Internet, the world wide web, and power grids), and biological networks (such as neural networks and metabolic networks). These properties include the “small-world eﬀect” (Section 1.1.2), a right-skewed degree distribution that often has a power law distribution (Figure 6.3b), and the tendency to cluster. In this section, we will focus on a fourth property common to many networks, the existence of a community structure. “Community” may stand here for module, class, group, cluster, etc. We deﬁne community as a subset of vertices within the graph such that the connections between the vertices are denser than the

149

150

Principles of Computational Cell Biology

Date hub

Party hub

Party hub

Figure 6.7 In this schematic protein interaction networks, proteins are colored according to mutual similarity in their mRNA expression patterns. “Party” hubs are highly correlated in expression with their partners and presumably interact with them at similar times. The partners of “date” hubs exhibit reduced coexpression, and presumably the corresponding physical interactions occur at diﬀerent times and/or diﬀerent locations. Source: Han et al. (2004). Drawn with permission of Springer Nature.

connections with the rest of the network (Figure 6.8). The detection of community structure is generally intended as a procedure for mapping the network into a tree (“dendogram” in social sciences). The leaves of the tree represent vertices; branches join vertices or (at higher level) groups of vertices. A traditional method to perform this mapping is hierarchical clustering. For every pair i, j of vertices in the network, a weight W ij is computed, which measures how closely connected the vertices are. Starting from the set of all vertices and no edges, edges are iteratively added between pairs of vertices in the order of decreasing weight. In this way, vertices are grouped into larger and larger communities, and the tree is built up to the root, which represents the whole network. What measures could be used together with hierarchical clustering to detect communities in a given network? (1) The number of vertex-independent paths between vertices. Two paths that connect the same pair of vertices are said to be vertex independent if they share none of the same vertices other than their initial and ﬁnal vertices. (2) The number of edge-independent paths deﬁned in the same spirit. A classical result from graph theory is that the number of vertex-independent (edge-independent) paths between two vertices i and j in a graph is equal to the minimum number of vertices (edges) that must be removed from the

Protein–Protein Interaction Networks – Structural Hierarchies

Figure 6.8 Three communities formed by densely connected vertices (circles with solid lines). There are far fewer connections between nodes in diﬀerent communities (gray lines).

graph to disconnect i and j from one another (theorem of Menger, 1927). These numbers are, therefore, a measure of the robustness of the network to deletion of vertices (edges). (3) One may also count the total number of paths that connect vertices i and j (not just those that are vertex- or edge-independent). As the number of paths between any two vertices is either 0 or inﬁnite, one typically weighs paths of length l by a factor 𝛼 l with small 𝛼 so that the weighted count of their number of paths converges. In doing so, long paths contribute to exponentially less weight than short paths. Although these vertex- or edge-dependent path deﬁnitions for weights work okay for certain community structures, there generally exist some problems with all these measures. In particular, counting of both vertex- and edge-independent paths has a tendency to separate single peripheral vertices from the communities to which they should rightly belong. If a vertex is, for example, connected to the rest of a network by only a single edge, then, to the extent that it belongs to any community, it should clearly be considered to belong to the community

151

152

Principles of Computational Cell Biology

at the other end of that edge. Unfortunately, both the numbers of independent paths and the weighted path counts for such vertices are small and, hence, single vertices often remain isolated from the network when the communities are constructed. This and other pathologies make the hierarchical clustering method, although useful, far from perfect. One may also measure the centrality of a vertex within a graph based on its betweenness (Freeman 1977). In a preprocessing step, shortest paths are computed between all vertices of the network. Vertices that belong to many shortest paths between other vertices have a higher betweenness than other nodes that lie only on few short paths (Figure 6.9). For a graph G: = (V , E) of n vertices, vertex v has a betweenness C B (v) of ∑ s≠v≠t∈V 𝜎st (v) CB (v) = . (n − 1)(n − 2) where 𝜎 st (v) = 1 if the shortest path from s to t passes through v and 0 otherwise.

Figure 6.9 Modiﬁed version of Figure 6.8 where the three dark vertices mediate the contacts between the three communities. Obviously, all shortest pathways between members of diﬀerent communities would run through one or two of these vertices making them vary “between” in the sense of the deﬁnition given in the text.

Protein–Protein Interaction Networks – Structural Hierarchies

Alternatively, we may focus on those edges that are least central, that are “between” communities. In analogy, edge betweenness of an edge is deﬁned as the number of shortest paths between pairs of vertices that run along this particular edge. If there is more than one shortest path between a pair of vertices, each path is given equal weight such that the total weight of all of the paths is 1. If a network contains communities or groups that are only loosely connected by a few intergroup edges, then all shortest paths between diﬀerent communities must go along one of these few edges. The edges connecting communities will have high-edge betweenness. By removing these edges ﬁrst, we are able to separate groups from one another and so reveal the underlying community structure of the graph. This is the basis of the Girvan–Newman algorithm introduced in 2002: 1. Calculate the betweenness for all m edges in a graph of n vertices (can be done in O(mn) time). 2. Remove the edge with the highest betweenness. 3. Recalculate the betweenness for all edges aﬀected by the removal. 4. Repeat from step 2 until no edges remain. Because step 3 has to be done for all edges over and over again, the algorithm requires in the worst case O(m2 n) operations. 6.5.1

Divisive Algorithms for Mapping onto Tree

A tree may be constructed from the computed edge betweenness by reversing the order of tree construction compared to an agglomerative algorithm. By starting from the full graph, edges with highest betweenness are iteratively discarded. In this manner, the graph is progressively divided into smaller and smaller disconnected subnetworks that are considered as communities. The critical point in this recipe is how to select the edges that should be discarded. The Girvan and Newman algorithm is one way of doing so. Figure 6.10 shows an example given in Girvan and Newman (2002) for the clustering of a social network that is based on the well-known karate club study of Zachary. Over a period of two years, he observed the personal relationships between 34 members of a karate club. At some point during the study, the administrator of the club and the club’s instructor started having a disagreement. Finally, the instructor left the club and started a new club. About half of the original club’s members moved with him to the new club. Zachary generated a network of friendships between the members of the club. As is shown in Figure 6.10b, hierarchical clustering performs quite poorly and cannot identify the two “camps” based on the knowledge of pairwise friendships. The Girvan–Newman algorithm, however, solves this task almost perfectly (Figure 6.10c). Only the gray vertex 3, which is a sort of in between the camps, is clustered closest with the white vertex 29. The Girvan–Newman algorithm suﬀers from a speed problem that results from requiring the repeated evaluation of a global property, the betweenness, for each edge whose value depends on the properties of the whole system. This becomes computationally very expensive for networks with, for example, more than 10 000 vertices. A faster algorithm was devised in Radicchi et al. (2004)

153

154

Principles of Computational Cell Biology

who introduced a divisive algorithm that only requires the consideration of local quantities. To identify edges that connect vertices belonging to distinct communities, they used a newly introduced edge-clustering coeﬃcient. This measure is deﬁned as the ratio of the number of triangles to which a given edge belongs to over the putative number of triangles that might include the edge, considering the degrees of the adjacent vertices. The edge pointing from vertex i to vertex j has an edge-clustering coeﬃcient of Ci,j(3) =

[

(3) +1 zi,j

min (ki − 1), (kj − 1)

],

where zi,j (3) is the number of triangles including the edge and min[(k i − 1), (k j − 1)] is the maximal possible number of triangles. In the numerator, 1 is added to zi,j (3) to remove the degeneracy for zi,j (3) = 0. With zi,j = 0, the actual values of k i and k j would otherwise be irrelevant. Edges connecting vertices in diﬀerent communities are included in few or no triangles and tend to have small values of C i,j (3) . On the other hand, many triangles exist within clusters (Figure 6.11).

23 27

15 10

13

16 20

5

31

14

34

30 21

6

4

17

9

33

7 19

3 29

1

2

11

24 32

28

22

8

(a)

26

25

18

12

Figure 6.10 (a) The friendship network from Zachary’s karate club study. The instructor and the administrator are represented by vertices 1 and 34. Vertices associated with the club administrator’s fraction are drawn as white circles and those associated with the instructor’s fraction are drawn as gray squares. (b) Hierarchical tree calculated by using edge-independent path counts, which fails to extract the known community structure of the network. (c) Hierarchical tree showing the complete community structure for the network calculated by using the Girvan–Newman algorithm. The initial split of the network into two groups is in agreement with the actual fractions observed by Zachary, except for the misclassiﬁed vertex 3. Source: Girvan and Newman (2002). Drawn with permission of PNAS.

Protein–Protein Interaction Networks – Structural Hierarchies

1 2 3 33 34 4 32 914 24 8 3128 30 5 6 7 11 20 29 26 25 1318 22 10 1715 1619 21 23 27 12 (b)

Figure 6.10 (Continued)

By considering higher order cycles, one can deﬁne coeﬃcients of order g (g)

(g) Ci,j

=

zi,j + 1 (g)

si,j

,

where zi,j (g) is the number of cyclic structures of order g the edge (i, j) belongs to and si,j (g) is the number of possible cyclic structures of order g that can be built given the degrees of the vertices. For every g, a detection algorithm is used that works exactly as the Girvan–Newman method with the diﬀerence that, at every step, the removed edges are those with the smallest value of C i,j (g) . By considering increasing values of g, one can smoothly interpolate between a local and a nonlocal algorithm.

6.6 Modular Decomposition A very important aspect of biology is categorization. Grouping the millions of various plants and animals into categories, subcategories, etc., made up large

155

156

Principles of Computational Cell Biology

3 29 25 28 33 34 30 24 31 9 23 21 19 16 15 26 32 27 10

4 14 2 1 8 22 20 18 13 12 7 17 6 5 11

(c)

Figure 6.10 (Continued)

parts of zoology and botany. On the one hand, collectors like to be able to quickly get access to a particular species, and a categorization scheme facilitates ﬁnding this particular species. Another, scientiﬁcally deeper, aspect is that a well-planned grouping provides insights into the organization of the individual elements. By uncovering which plants are related, we may understand which of them have a common ancestor and may then have similar metabolisms, etc. Likewise, in biological cells, grouping/categorization of the individual elements provides a much deeper level of understanding. Chapter 5 was mostly concerned with discovering and enumerating all protein–protein interactions. In this chapter, we have started to develop a hierarchical way of looking at the interactome. Here, we will now address the aspect of modularity. The notion that biological cells are organized into separate functional modules may come as a surprise. We just started appreciating the fact that almost every protein seems to interact with every other protein, and one may wonder how this ﬁts to the idea that cells are rather organized in a strictly hierarchical way where protein interactions are mostly conﬁned to within distinct modules. The notion of a biological module is still debated, and its deﬁnition may undergo further reﬁnement in the coming years. Hartwell et al. (1999) deﬁned a functional module as a discrete entity whose function is separable from those of other modules. The ability to separate distinct modules from each other relies on chemical isolation, which may be due to spatial localization or chemical

Protein–Protein Interaction Networks – Structural Hierarchies

Figure 6.11 The left vertex has a degree of 7 and the right vertex has a degree of 5. The continuous lines denote the edges formed and dashed lines are those that could be formed additionally between the neighbors. Two triangles are currently formed including both vertices out of the min [(ki − 1), (kj − 1)] = 4 that would be maximally possible for the given degrees of the two vertices. In both cases, one is subtracted as the central edge is already required to connect the two vertices. In this example, C i,j (3) = (2 + 1)/4 = 0.75.

speciﬁcity. We will see an example for such modularity in Section 15.2 that presents a modular computational model for the simple organism Mycoplasma genitalium. We will now have a look into the well-deﬁned concept of modular decomposition and then discuss its applicability to biological networks. 6.6.1

Modular Decomposition of Graphs

Tandem aﬃnity puriﬁcation (TAP) (see Section 5.1.3) showed that protein complexes can share components. Proteins can be reused and participate in several complexes. Methods for analyzing high-throughput protein interaction data are mainly based on clustering techniques. They are being applied to assign protein function by inference from the biological context as given by their interactors and to identify complexes as dense regions of the network. The TAP-MS method cannot reveal the logical organization into shared and speciﬁc components. Shared components are proteins or groups of proteins occurring in diﬀerent complexes. Shared components are fairly common. A shared component may be a small part of many complexes where it acts as a unit that is reused over and over again because of its biological function. On the other hand, it may form the main part of a complex, for example, in a family of variant complexes that only diﬀer from each other by a few proteins providing them with functional speciﬁcity. The modularity of PPINs can be illustrated by characterizing the shared components and how they are organized into complexes. For this, we will follow Gagneur et al. (2004).

157

158

Principles of Computational Cell Biology

c

a

f

d

g

e

b

Figure 6.12 A graph and its modules. In addition to the trivial modules {a}, {b}, …, {g} and {a, b, c, …, g}, this graph contains the modules {a, b, c}, {a, b},{a, c},{b, c}, and {e, f }. Source: From Gagneur et al. (2004). Reproduced with permission of BioMed Central Ltd.

In graph theory, vertices connected by an edge are called neighbors. A module contains several vertices, all having exactly the same neighbors outside the module (Figure 6.12). To reveal the hierarchical organization of the graph, all the elements of a module that have exactly the same neighbors outside the module can be substituted for a representative vertex. In such a quotient, all the elements of the module are replaced by the representative vertex, and the edges with the neighbors are replaced by edges to the representative (Figure 6.13). Quotients can be iterated until the entire graph is merged into a ﬁnal representative vertex. The sequence of iterating quotients can be captured in a “modular” tree. Each node of this tree represents a module that is a subset of its parent and the set of its descendant leaves. The modular decomposition of the example graph in Figure 6.13 gives a labeled tree that represents iterations of particular quotients, here the successive quotients on the modules {a, b, c} and {e, f }. The modular decomposition is a unique, canonical tree of iterated quotients. The vertices of the modular decomposition are categorized in three ways. In a series module, the direct descendants are all neighbors of each other (labeled by an asterisk within a circle). In a parallel module, the direct descendants are all non-neighbors of each other (labeled by two parallel lines within a circle). In a prime module, the structure of the module is neither a series nor parallel (labeled by a P within a circle). The graph can be retrieved from the tree in Figure 6.13(B) by recursively expanding the modules using the information in the labels. Therefore, the

Protein–Protein Interaction Networks – Structural Hierarchies

c

a d

f

g

f

g

∗

e

{a,b,c}

d

e

b (b) (a)

g P

∗ {a,b,c}

d

{e,f}

{a,b,c,d,e,f,g}

(d)

(c) (A) ∗ d

g

∗

(B)

Figure 6.13 (A) Modular decomposition and (B) resulting tree. Vertices a, b and c are collapsed into one representing vertex {a, b, c}. The edges with their common neighbors (a, d), (b, d), and (c, d) are replaced by one edge between d and {a, b, c}. Then, f and e are replaced by {e, f } and, ﬁnally, the whole graph by one vertex. Source: From Gagneur et al. (2004). Reproduced with permission of BioMed Central Ltd.

159

Principles of Computational Cell Biology

labeled tree can be seen as an exact alternative representation of the graph. This modular decomposition has been successfully applied to real protein interaction networks. The aim is that complexes are identiﬁed as modules. We have discussed in Section 5.1.9 that diﬀerent experimental techniques probe diﬀerent aspects of protein–protein interactions. For example, the yeast two-hybrid screen detects direct physical interactions between proteins, whereas protein complex puriﬁcation (PCP) by TAP with mass spectrometric identiﬁcation of the protein components identiﬁes multiprotein complexes. Therefore, the molecular decomposition will have a diﬀerent meaning because of diﬀerent semantics of such graphs. One problem faced by this approach is the currently incomplete nature of interaction networks. The hierarchical modular decomposition scheme is most powerful when applied to almost complete interaction sets or subsets. What is now the value of this approach? Parallel modules are often encountered in cases where related complexes contain diﬀerent variants of one type of proteins. One example is the protein phosphatase 2A that belongs to a family of serine/threonine phosphatase complexes. Each such complex is formed as a trimer of Tpd3 that serves as a structural anchor, one representative of two regulatory domains, Rts1 or Cdc55, and one representative of two catalytic domains, Pph21 or Pph22 (Figure 6.14). Altogether, four combinations are possible for the

Tpd3 Pph21

Tpd3 Rts 1

Pph22

Tpd3 Pph21

Rts 1

Tpd3 Cdc55

Pph22

MOD

160

Tpd3

Rts1

Cdc55

Pph21

Pph22

Cdc55

Figure 6.14 Modular decomposition of four alternative phosphatase 2A complexes. In this example, parallel modules group functionally equivalent proteins of which each complex contains either one or the other. These are the catalytic proteins Pph21 and Pph22 and the regulatory proteins Cdc55 and Rts1. Source: From Gagneur et al. (2004). Reproduced with permission of BioMed Central Ltd.

Protein–Protein Interaction Networks – Structural Hierarchies

complex. Modular decomposition of these combinations illustrates clearly the logical organization of the related complexes. According to the decomposition tree, a full complex must contain three components. This is expressed by the series module at the top. One of these components must be Tpd3. The other two components are either–or options between the two regulatory subunits Rts1or Cdc55, on the one hand, and between two regulatory subunits Pph21 or Pph22, on the other hand. These “rules” are expressed by the two parallel modules. These functional relationships are not apparent if one only inspects a list of pairwise protein–protein interactions that would be a typical outcome of a Y2H or TAP-MS experiment.

6.7 Identiﬁcation of Protein Complexes Given that there now exist various experimental data sets on the physical interactions between proteins of model organisms, computational research groups focus on organizing them as PPINs and on developing clustering approaches that identify dense areas in these networks. Such dense regions may then be considered as candidates of physical and/or functional protein complexes. 6.7.1

MCODE

The MCODE algorithm (Bader and Hogue 2003) operates in three stages. In the ﬁrst stage of MCODE, locally dense regions of the graph are identiﬁed. We ﬁrst need to introduce a bit of nomenclature. The density of a graph, G = (V , E), or of a subgraph, is deﬁned as the ratio of the number of edges |E| over the theoretical maximum number of possible edges, |E|max . A k-core is a graph G of minimal degree k (∀v in G, deg(v) ≥ k). The highest k-core of a graph is then its most densely connected subgraph. Using these deﬁnitions, MCODE computes for all vertices the highest k-core of the vertex neighborhood and uses this as a measure of their local network density. Furthermore, the core-clustering coeﬃcient of a vertex v stands for the density of the highest k-core of the immediate neighborhood of v (vertices that are directly connected to v) including v. (Note that C i does not include v.) The weight assigned to a vertex by MCODE is computed as the product of the vertex core-clustering coeﬃcient and the highest k-core level of the immediate neighborhood of the vertex. This scheme emphasizes the weight of densely connected vertices. In the second stage, MCODE constructs the molecular complexes. Using the weighted graph constructed in stage 1 as input, the algorithm selects the highest weighted vertex as the start vertex of the ﬁrst complex. In a recursive manner, it adds neighboring vertices of the start vertex to the complex if their weight exceeds a preset threshold that is set to a certain fraction of the weight of the start vertex. Once a neighbor vertex is added, its neighbor vertices are recursively checked in the same way and potentially added to the complex as well. Every vertex is tested only once because complexes are not allowed to overlap in this phase. Construction of a complex is completed when none of the remaining

161

162

Principles of Computational Cell Biology

neighbor vertices has a weight exceeding the threshold. The algorithm then selects among the remaining unseen vertices of the graph the one with highest weight and repeats the same process to build the next complex. In this manner, complexes are generated in the regions of the graph with highest density. In the third stage, the obtained results are ﬁltered and scored. All complexes need to include at least one subgraph of minimum degree 2 (a 2-core). The remaining complexes are scored and ranked. The complex score is computed as the product of the number of vertices in the complex subgraph |V | and the density of the complex subgraph DC . MCODE has a polynomial time complexity of O(nmh3 ) with n vertices, m edges, and h being the number of vertices of the average vertex neighborhood in the input graph. This complexity is due to the vertex-weighting step. k-Cores in a graph are identiﬁed by progressively deleting nodes with a degree smaller than k until all remaining vertices are connected to each other by degree k or more. Computing all node degrees in a graph takes O(n2 ) computations. Finding the highest k-core can be done in a bottom-up manner. Starting from k = 1, one identiﬁes k-cores until all vertices have been labeled. The number of iterations cannot exceed the maximal degree in the graph which is n. Thus, ﬁnding the highest k-core requires – in general – O(n3 ) computations. Here, this procedure is carried out in the neighborhood of a vertex, which contains a lower average number of vertices, say h. Thus, O(n3 ) becomes O(h3 ). The inner loop of the algorithm is applied twice to every edge in the input graph, making it O(2mh3 ). The outer loop works once on all vertices in the input graph. Therefore, the total time complexity of the weighting stage is O(n 2mh3 ) = O(nmh3 ). Stage 2 is O(n). The optional post-processing step where 2-cores are found once for each complex can take up to O(cs2 ) computations, with c equal to the number of complexes identiﬁed in the previous step and s equal to the number of vertices in the largest complex. 6.7.2

ClusterONE

The method ClusterONE consists of three major steps (Nepusz et al. 2012). Given a start vertex, further vertices are added or deleted in a greedy manner to identify sets with high cohesiveness. For a group of proteins V in a graph, the cohesiveness f is computed as f (V ) =

win (V ) win (V ) + wbound (V )

where win (V) is the summed edge weights of links connecting current members of V (internal edges), and wbound (V) is the summed edge weight of links between members of V and nodes in the remaining graph (boundary edges). In Figure 6.15, these terms are illustrated for a case example. The ﬁgure also explains the meaning of incident and boundary proteins. Subgraphs with high cohesiveness are more densely connected than other regions of the graph and well separated from the surrounding. The cohesiveness is optimized stepwise by adding and deleting those incident/boundary proteins that are most beneﬁcial. Several, potentially overlapping protein groups are grown by starting repeatedly from diﬀerent start

Protein–Protein Interaction Networks – Structural Hierarchies

F B

D

E C

A

Figure 6.15 Schematic protein interaction network to illustrate cohesiveness measure (see text). For convenience, all edges have unit weight. The current members of the cohesive subset are V = {C, D, E}. Their internal edges are marked as solid lines, and boundary edges are marked as dashed lines. Boundary edges can be thought to span a boundary (shown as a circle) that separates the current dense subset V from the remaining network. This border deﬁnes the set of incident proteins V inc = {B, F}, external vertices adjacent to the boundary, and boundary proteins V bound = {D, E}, which are internal vertices at the boundary. For the given example win (V) = 3, wbound (V) = 4 and f (V) = 37 .

vertices. Groups having large overlaps relative to their size are merged. In the ﬁnal step of the algorithm, complex candidates are removed having fewer than three proteins or when their density is lower than a set threshold. 6.7.3

DACO

The combinatorial approach DACO (short for “domain-aware cohesiveness optimization”) combines a local optimization of cohesiveness based on weighted protein interaction data with a structural consideration of the involved binding interfaces at the level of domain interaction data (Will and Helms 2014). DACO needs as input a probability-weighted PPIN, a list of proteins that are used to seed the growth, a threshold to generate the seed pairs, and an upper bound for the depth of search to keep the combinatorial explosion on a local level. Figure 6.16 illustrates the deﬁnitions for incident and boundary proteins on the domain level. The algorithm then checks on the basis of the weighted protein interactions for all incident proteins whether the cohesiveness can be increased by adding them to V . The same test is applied for removing all boundary proteins. Among all possible modiﬁcations V ′ , it selects the one that maximizes the cohesiveness. Every iteration can have three outcomes: (i) the algorithm could terminate and return the current complex candidate because no further increase is possible, (ii) the addition of a protein may be the optimal choice, or (iii) the removal of a protein could yield the largest gain. In the latter case, the boundary protein P is removed from V , leading to V ′ =V \ {P}. Additionally, the domains occupied by the distinct spanning edge that connected P to the remaining cluster V ′ are made available again. For the yeast protein interaction network, the DACO algorithm suggested around 10–100 times more protein complex candidates involving transcription factors than other methods such as MCODE and ClusterONE. However,

163

164

Principles of Computational Cell Biology

B

A

D

C

E

Figure 6.16 The gray nodes in this domain–domain interaction network are the proteins currently included in V = {A, C, E}, and the black edges show how the current dense cluster is connected on the domain level. Incident nodes V inc = {B, D} are those nodes that can be connected to V by a new domain interaction edge (colored in gray) to an unused domain of an internal protein (gray domains). Boundary nodes V bound = {A, E} are proteins in V with only one used domain.

DACO not only generated many complexes but also did well in reproducing reference data sets of known complexes. The biological plausibility of the results was validated in terms of colocalization and functional homogeneity within complexes. For example, in the special case of TF complexes, a high degree of in vivo localization to the nucleus was observed for all proteins within the same predicted complex as expected. Functional homogeneity was tested as the fraction of complex candidates with at least one enriched gene ontology (GO) annotation at a signiﬁcance level of P = 0.05 (Bonferroni corrected). 6.7.4

Analysis of Target Gene Coexpression

Causal links between TF tuples and their regulatory targets were analyzed on the basis of expression data. In general, genes that are regulated by the same control mechanism can be expected to exhibit a highly similar expression pattern in a certain condition. For the 290 predicted pairs and higher order tuples of TFs, coexpression of their target genes was analyzed using a time series of 32 time points for the yeast cell cycle. The aim of this was to determine signiﬁcantly cooperative TF tuples that are assumed to be decisive regulatory drivers. Here, a tuple of n TFs is assumed to be decisive if the coexpression of the target genes signiﬁcantly increases with the reﬁnement induced by binding site constraints of an nth TF. To quantify this measure, the coherence of expression was scored following Pilpel et al. (2001) in the following way. For a given set of K genes containing a particular motif or motif combination in their promoters and an expression data set, the Euclidean distances are calculated between the mean and variance-normalized expression proﬁles of each pair of genes. The expression coherence score (ECS) associated with a motif/motif combination is deﬁned as p/(number of possible pairs), where p is the number of gene pairs whose Euclidean distance is smaller than a threshold distance, which was taken as the typical Euclidean distance of randomly selected gene pairs.

Protein–Protein Interaction Networks – Structural Hierarchies

4 3

Normalized expression

2 1 0 –1 –2 –3 –4 0

5

10

15

20

25

30

Time points

Figure 6.17 Cell cycle expression proﬁles of all genes targeted by MET4 or MET32 (gray) are compared with the set of target genes where MET4 and MET32 bind as a colocalized complex to the two binding sites at pairwise distance between −50 and 50 bp (black). The targets of a colocalized complex have a signiﬁcantly higher expression coherence than the targets of the individual TFs.

Seventeen higher order TF combinations led to a signiﬁcantly increased ECS in the context of the yeast cell cycle (PdECS < 0.05) for a certain binding mode. As an example, Figure 6.17 shows the complex-induced reﬁnement in expression coherence among the target genes of the TF pair MET4/MET32. Seventy-six percent of the corresponding target gene sets were signiﬁcantly enriched with speciﬁc GO biological process annotations (P < 0.05, Bonferroni corrected) (Table 6.1). Not surprisingly, all signiﬁcant tuples were associated with either cell cycle control itself or with metabolic processes that are in cross talk with the cell cycle during normal growth.

6.8 Network Growth Mechanisms What is the “true” growth mechanism of real biological networks? Is it at all important to know this? One may argue that this is indeed the case, but it will be quite diﬃcult to ﬁnd this out as with any phylogenetic analysis. Traditionally, the growth of protein networks has been modeled by the duplication-divergence mechanism. Here, a duplication of a vertex reﬂects the duplication of the corresponding gene and a divergence or loss of redundant edges or functions is a consequence of gene mutations. Although this view emerges from our understanding about the evolution of genomes, it is unclear

165

166

Principles of Computational Cell Biology

Table 6.1 Predicted TF combinations with a signiﬁcant increase of expression coherence (PdECS ) among their mutual targets comprising 15 TF pairs and two triples.

TF pair

PdECS

GO process enrichment Binding Regulatory (P < 0.05, Bonferroni corrected) in targets mode Targets inﬂuence

MET4/MET32

0.0010 coloc.

19

+

Methionine metabolic process

TBP/HAP5

0.0335 med.

47

+

/

GLN3/DAL80

0.0009 med.

28

/

Allantoin catabolic process

DIG1/STE12/SWI6 0.0369 All

15

/

Fungal-type cell wall organization

FHL1/RAP1

0.0001 coloc.

116

+

rRNA transport

RPH1/GIS1

0.0001 med.

100

−

Hexose catabolic process

CBF1/MET32

0.0002 coloc.

33

O

Sulfate assimilation

DIG1/STE12

0.0003 med.

34

−

Response to pheromone

GCN4/RAP1

0.033

62

+

/

MSN4/MSN2

0.0021 med.

med.

105

+

Oligosaccharide biosynthetic process

DAL80/GZF3

0.0044 med.

20

−

Purine nucleobase metabolic process

SWI6/SWI4

0.0039 med.

53

+

Regulation of cyclin-dependent protein serine/threonine kinase activity

STB1/SWI6

0.0275 All

47

+

/

TBP/SWI6

0.0159 med.

14

+

/

GLN3/GZF3

0.0120 adj.

31

/

Allantoin catabolic process

MBP1/SWI6/SWI4 0.0307 med.

18

+

Regulation of cyclin-dependent protein serine/threonine kinase activity

MBP1/SWI6

25

/

Cell cycle process

0.0124 adj.

“Binding mode” refers to the relative position of the transcription factor binding site (TFBS) motifs. “All” Are those with shared target proteins without distance constraints. “coloc.” Means that the positions are within 50 bp from each other, and “adj.” are directly adjacent targets with TFBS distance of 0–10 bp. In the case of “med.,” the complex involves further bridge proteins, and the TFBS have pairwise distances between 10 and 50 bp. Only the most enriched GO process term is listed for each target set. The inferred regulatory inﬂuence on the rate of transcription is abbreviated as follows: + (increase), − (decrease), O (no statement possible), / (conﬂicting annotations). Source: Data from Will and Helms (2014).

whether this growth mechanism also applies to the evolution of PPINs. One way of ﬁnding this out may be to analyze the topological properties such as the degree distribution, clustering coeﬃcient, and mean path length of a large number of artiﬁcial networks resulting from this growth principle. However, it turned out that networks generated by diﬀerent growth algorithms (see below) all have very similar overall topological properties.

Protein–Protein Interaction Networks – Structural Hierarchies

Another strategy in exploring the properties of growth algorithms is to investigatel their modular structure and distribution of various subgraphs or motifs. As an example, we will use here the results by a study of Middendorf et al. (2005) who analyzed the protein–protein interaction map for Drosophila by Giot et al. (2003) derived from high-throughput data. One typical problem with data from high-throughput experiments, which we also mentioned in Chapter 3, is that the data set is subject to numerous false positives. Giot et al. (2003) were able to assign a conﬁdence score p ∈ [0, 1] to each interaction measuring how likely the interaction occurs in vivo. However, what threshold p* should be used to distinguish true from questionable interactions? With our background in topological analysis of networks gained so far, we expect that the network should contain one giant component. Therefore, Middendorf et al. (2005) measured the size of the components for diﬀerent values of p*. At p* = 0.65, the two largest components became connected. Therefore, they used this value as the threshold. Edges in the graph correspond to interactions for which p > p*. After removing self-interactions and isolated vertices, this resulted in a network with 3359 (4625) vertices and 2795 (4683) edges for p* = 0.65 (0.5). The duplication–mutation–complementation (DMC) algorithm is based on a model proposing that most of the duplicate genes observed today have been preserved by functional complementation. If either the gene or its copy loses one of its functions (edges), the other becomes essential in assuring the survival of the organisms. In the algorithm, a duplication step is followed by mutations that preserve functional complementarity. At every time step, a vertex v is chosen at random. A twin vertex vtwin is introduced copying all of v’s edges. For each edge of v, either the original edge is deleted with probability qdel or its corresponding edge of vtwin . Twins cojoin themselves with independent probability qcon , representing an interaction of a protein with its own copy. No edges are created by mutations. The DMC algorithm therefore assumes that the probability of creating new advantageous functions by random mutations is negligible. A variant of DMC is the duplication–random mutations (DMR) algorithm where possible interactions between twins are neglected. Instead, edges between vtwin and the neighbors of v can be removed with probability qdel , and new edges can be created at random between vtwin and any other vertices with probability qnew /N, where N is the current total number of vertices. DMR emphasizes the creation of new advantageous functions by mutation. Other models tested involved linear preferential attachment (after Barabási-Albert), random static networks (after Erdös–Renyi), random growing networks (growing graphs where new edges are created randomly between existing vertices), aging vertex networks (growing graphs modeling citation networks, where the probability for new edges decreases with the age of the vertex), and small-world networks (interpolation between regular ring lattices and randomly connected graphs). The aim of the Middendorf study was to ﬁnd out which network graph algorithm generates networks that best resemble the “true” Drosophila network. (We will disregard, for the moment, that the experimental network is error prone too.) As was mentioned before, networks generated by these diﬀerent algorithms could all be parameterized to produce networks having the same overall topological

167

168

Principles of Computational Cell Biology

S1

S2

S3

S4

S5

S6

S7

S8

S9

S10

S11

S12

S13

S14

S15

S16

S17

S18

S19

S20

S21

S22

S23

S24

S25

S26

S27

S28

S29

S30

S31

S32

S33

S34

S35

S36

S37

S38

S39

S40

S41

S42

S43

S44

S45

S46

S47

S48

S49

S50

S51

Figure 6.18 Fifty-one representative subgraphs of length 8 (out of 148 subgraphs in total). Source: Middendorf et al. (2005). Reprinted with permission of PNAS.

properties as the real network. The idea was therefore to quantify the ﬁne structure of the generated networks. Therefore, 1000 graphs with the same number of edges and vertices as measured in Drosophila were created by the seven network growth algorithms. The topology of the obtained networks was quantiﬁed by counting all possible subgraphs up to a given cutoﬀ, which could be the number of vertices, number of edges, or the length of a given walk. Middendorf et al. (2005) counted all subgraphs that can be constructed by a walk of length 8 (148 nonisomorphic subgraphs) (Figure 6.18). The average shortest path between two vertices of the Drosophila network’s giant component is 11.6 (9.4) for p* = 0.65 (0.5). This means that walks of length 8 can traverse large parts of the network. It turned out that for 60% of the subgraphs (S1–S30), the counts for Drosophila are closest to the DMC model. All of these subgraphs contain one or more cycles, including highly connected subgraphs (S1) and long linear chains ending in cycles (S16, S18, S22, S23, and S25). The DMC algorithm was the only mechanism that produces such cycles with a high occurrence. The protein interaction network of Drosophila was thus conﬁdently classiﬁed as DMC network. This method of quantifying the ﬁne structure of graph networks allows inferring growth mechanisms for real networks with conﬁdence. The method is robust against noise and data subsampling and requires no prior assumption about network features/topology.

Protein–Protein Interaction Networks – Structural Hierarchies

6.9 Summary This chapter introduced several topological descriptors of graph networks. Using this language, the biological protein interaction network is characterized by its small-world property, by the occurrence of network hubs, by the eﬀect of clustering, and by the property of community structure. The concept of communities (sometimes also termed “modules,” although the notion of a “biological module” is being used in diﬀerent ways) aims at connecting groups of proteins with a common biological function. In contrast to this, protein complexes refer to proteins that physically bind to each other. Ongoing challenges are to capture condition-speciﬁc protein interactions in various diﬀerentiation stages of mammalian cells or in disease states. Also, a mostly uncharted area is considering the eﬀects of post-translational modiﬁcations of proteins and of alternative splicing.

6.10 Problems 1.

Existing networks Characterize with a short explanation the following examples of networks into the categories introduced in Sections 6.3 and 6.4. Hint: Some of the examples might ﬁt into more than one category. If so, explain your choice. • Telephone system. • Network of highways. • The physical backbone of the Internet (cables, routers, computers, etc.). • The world wide web. • European airports connected by direct ﬂights. • Your own social network … … when you were in school (living at home) … now as a student.

2.

The random network (a) Implement the algorithm given in Section 6.3 to generate random graphs. Start from a given number of vertices and add one edge after the other. Take care not to add the same edge twice. Store the constructed network in a ﬁle. Hint: A simple yet eﬃcient representation of the network is a list with a cell for each vertex, which itself holds a list of the vertices connected to this vertex (and vice versa – why both?) In standard python, you can use the following code to deﬁne and initialize a two-dimensional list: net = [0] net = node * net for i in range(node): net[i] = []

169

170

Principles of Computational Cell Biology

Now you can assign values to the individual sublists with net[i]. append(k) for an edge between vertices i and k. Don’t forget the entry for k ≥ i. Hint: It is a good idea to read the variable parameters (number of vertices and edges) from the command line. Hint: To store the network topology, save the edges into a ﬁle. Each line then lists the (index of the) two vertices. Take care not to list the same edge twice (only print an edge when i < k). (b) Determine and plot the degree distribution of the random network created above. Hint: Implement a tool that reads in the network from part (a). You can avoid to explicitly save the network, when the network creation tool writes to stdout (standard output via, e.g. print) and this tool reads from stdin (standard input). Then, you just need to connect the two tools via a pipe. To plot the degree distribution, pipe the output of this tool into a ﬁle. On a Mac or Linux box, the easiest way to plot this ﬁle is using gnuplot. (“createNetwork [parameters] | degreeDist [parms] > outputfile”). Hint: First, determine the largest number of edges that occurs and initialize an array of that size. This array then holds the degree distribution. Now loop over the network list with the edge count (degree) of the vertices and increment the corresponding cells of the second array. Finally, print out the degree distribution from this array with proper normalization. (c) Verify that the degree distribution obeys the Poisson distribution P(k) with a mean value of 𝜆: 𝜆k P(k) = e−𝜆 k! Calculate 𝜆 from the numbers of vertices and edges and determine the Poisson distribution for a suﬃciently large range of k. Plot P(k) together with the degree distributions of the random network for the following numbers of vertices and edges (each given in the form vertices/edges): Plot 1

50/100

500/1 000

5 000/10 000

50 000/100 000

Plot 2

20 000/5 000

20 000/17 000

20 000/40 000

20 000/70 000

Plot 3

5 000/40 000

13 000/40 000

25 000/40 000

50 000/40 000

Describe each plot and explain the diﬀerence (or the trend) between the diﬀerent parameter sets. Do not forget to label the axes. 3.

The scale-free network (a) Implement the BA algorithm given in Section 6.4 to generate a scale-free network. Start with two connected vertices and add vertices sequentially by preferentially connecting the new edges to those vertices that already have more edges.

Protein–Protein Interaction Networks – Structural Hierarchies

Hint: To implement the preferential attachment, use a list that contains the vertices that are already connected (each vertex occurs in that list as often as its degree). When you add a new edge attach() the indices of the two vertices to the end of this (growing) list. For the next edge, randomly choose a vertex from that list. Why does this recipe give the desired preferential attachment? (b) Determine the degree distribution of the scale-free network created above. Plot the degree distribution for a network of 100 000 vertices with double logarithmic axes and determine the exponent 𝛾. Which region of degrees can you use to ﬁt the exponent 𝛾? Which problems do you encounter? To extend the range for ﬁtting the exponent, sum up P(k) and compare ∑ P(k ′ ), S(k) = k ′ ≤k

to the corresponding integral of the power law F(k) =

∫

dk ck −𝛾 .

First, verify that F(k) =

c 1−𝛾 k + F0 1−𝛾

where F 0 is an integration constant. Plot (1 − S(k)) and F(k) into the same plot. How many vertices in the network are now necessary for a comparable ﬁtting range (to ﬁt the exponent, also adjust c and F 0 )? Which main diﬀerence between the two ways to plot P(k) (directly versus integrated) do you observe? Why is this diﬀerence helpful? (c) Create a random network with the same number of vertices and edges as the scale-free network of 100 000 vertices and plot the degree distributions of both networks into the same plot. For which combination of logarithmic and/or linear axes is the diﬀerence between the degree distributions of the two networks seen best? Identify and explain the two major diﬀerences between the degree distribution of the random and of the scale-free network. 4.

Biological interaction networks The BioGRID repository (thebiogrid.org) contains many known interactions between proteins for many diﬀerent species. To set up a protein interaction network for a given species, perform the following steps: (a) Download the interactions from the download area of the BioGRID database. (b) Write a python script that creates a histogram of the taxon identiﬁers from those interactions, where the two partners are both proteins. Which are the top ﬁve species that have the largest number of these interactions in BioGRID? Give their taxon identiﬁers, their scientiﬁc names, and the respective number of occurrences.

171

172

Principles of Computational Cell Biology

Hint: Check the taxon identiﬁer ﬁle for the highest occurring number and initialize an array of that size. Then, read the interaction ﬁle line by line and split each line at the tabs to get the two taxon IDs. For each of them, increment the corresponding entry of the array due to whether the types are protein or small molecules. Alternatively, you can use a python dictionary. Note that a handful of taxon identiﬁers in the interaction ﬁle are not listed in the taxon overview. Hint: There is the class of small molecules, which have an ID of 0. They are not a species on their own! (c) For each of the top ﬁve species from (b), parse the interaction ﬁle to pick all interactions that belong to this species. Pick those interactions where both partners belong to the chosen species or where one belongs to the chosen species and the other one is a “small molecule.” Count the total number of interaction partners that are known for this species, regardless of their type. Hint: Look for one species at a time, i.e. run the script once for each species that you want to extract from the interaction ﬁle. Hint: Read footnote 5 of the README ﬁle – there may be either missing accession codes or missing molecule IDs for the proteins. A workaround is to create a new label for the proteins by concatenating the accession code and the molecule ID. This ensures that each protein is labeled uniquely. Convert the labels into integers, print out the interactions one per line, and pipe the output into a ﬁle for each of the top ﬁve species. Hint: Use a python dictionary (or a hash in Perl) to translate the protein identiﬁers into integers. For each new interaction, check whether any of the two proteins is already stored in that dictionary to avoid duplicate entries. For any protein not yet in the dictionary, increase a counter and store the counter value with the protein identiﬁer as the key. (d) Use the tool implemented in Problem 2 to determine the degree distribution of the protein interaction networks of the top ﬁve species. Do you see any diﬀerence between the degree distributions of their interaction networks? Are they more like the scale-free network or more like the random graph? Explain your observations. Hint: Plot the integrated P(k). 5.

Clustering coeﬃcient: scale-free versus random network (a) The average clustering coeﬃcient in a random or a scale-free network is claimed to be independent of the degree. Do an in silico experiment to verify this statement and proceed as follows. Implement a tool that (i) determines the cluster coeﬃcient C(k) of each vertex of a given network, (ii) averages the cluster coeﬃcients of vertices of the same degree, i.e. determines the average cluster coeﬃcient ⟨C(k)⟩, and (iii) averages over all C(k) to get the average degree-independent cluster coeﬃcient ⟨C⟩ of the network.

Protein–Protein Interaction Networks – Structural Hierarchies

Create two plots, one for the scale-free network á la Barabási and one for the random graph with the same number of edges, and plot C(k) (as a scatter plot), ⟨C(k)⟩ and ⟨C⟩ against k for networks of 400 000 vertices each. (b) Use the probability p = 2L/(N(N − 1)) for an edge between two arbitrary vertices and calculate an estimate for the probabilities that a vertex has k edges and for the cluster coeﬃcient C(k) of a vertex of degree k. Interpret the results. Does this estimate of C(k) reproduce the numerical results of (a)? Hint: Consider how many possible realizations of n links can occur between the k neighbors of a node. What is the probability for any of these conﬁgurations? 6.

Clusters of the scale-free network How many clusters are contained in a scale-free network constructed by the BA algorithm starting from a single node? What is the size of the largest cluster? Why?

7.

Cluster sizes and numbers (a) Determine the number of clusters N cl and the size of the largest cluster N max for a random graph of N = 100, 1000, 10 000, 100 000, and 1 000 000 vertices and twice as many edges and plot them against the size of the network, i.e. the number of vertices. To get more accurate results, create 10 networks of each size and average the results. What trends do you observe in the plots? Explain your observations. Hint: To identify the clusters, start from the ﬁrst vertex and assign it to the ﬁrst cluster. Then, follow all edges from there and assign the vertices connected to this ﬁrst vertex to the same cluster. Repeat from these vertices until you ﬁnd no more connected but unassigned vertices. Then, repeat this procedure from the ﬁrst unassigned vertex, which you assign to the second cluster. Repeat until all vertices are assigned to a cluster. Note that a vertex without any edges forms a cluster on its own. Hint: If you implemented a recursive algorithm to identify the clusters, you may run into a “maximum recursion depth exceeded” runtime error (or a similar error message). An iterative algorithm will work, though. (b) Check for the existence of the “spanning cluster” of a random graph. To do so, determine the size of the largest cluster N max and the number of clusters N cl of a random network for diﬀerent values of 𝜆 = 2L/N, i.e. for diﬀerent average degrees. For a random graph of 100 000 vertices, vary 𝜆 between 0 and 4 and create two plots, one with N max versus 𝜆 and one with N cl versus 𝜆. Do you observe any transition? If yes, at which value of 𝜆? To interpret your ﬁndings, determine the values and behavior at 𝜆 → 0 and at 𝜆 → ∞. Hint: The spacings between the values of 𝜆 need not be constant. Just choose enough (and sensible) values of 𝜆 so that the trend of N max and N cl in the plots is clear.

173

174

Principles of Computational Cell Biology

(c) Create 10 random networks, each with N = 200 and L = 400, 1200, and 2400, and determine the average numbers of clusters C(nc) for each 𝜆 = 2L/N. 8.

Clusters in biological networks (a) Read in the interactions listed in the BioGRID database (thebiogrid.org) for the fruit ﬂy, for the mouse, and for Escherichia coli and determine the histogram of cluster sizes P(C(k)), the size of the largest cluster N max , and the average cluster coeﬃcient ⟨C⟩ (see Problem 5). Hint: You can start from the interactions extracted in Problem 4, but do not limit them to proteins and small molecules this time. (b) To check the stability of these biological networks against directed attacks, take the interaction network of the mouse and determine the labels of the 200 vertices with the highest degrees. Compare the size of the largest cluster N max and the number of clusters N cl of the original network (i) to networks where you delete the 10, 20, 50, 100, or 200 vertices with the highest degrees and (ii) to networks, where you randomly delete the same numbers of vertices. How does the network behave?

9.

Theoretical estimate of the number of cliques Use the probability for a single edge in a random network of N vertices and L edges and derive an analytical estimate for the number of cliques C(nc) of size nc assuming that the network is large enough so that the cliques do not overlap. Tabulate the values of C(nc) for nc = 2, …, 10 with N = 1326 and L = 2548. How many of the vertices of the network belong to a clique?

10.

Cliques in model networks Adapt and implement the algorithm by Spirin and Mirny (Section 6.2) to search for cliques. Test the algorithm by identifying the cliques in the supplied testNetz.dat ﬁle. List all cliques found. (The network contains four cliques of size 2, two cliques of size 3, and one clique of size 5.) Hint: When reading the network ﬁle, check for duplicates. In the input ﬁles, the vertices are not necessarily sorted. Use the following modiﬁcations of the algorithm: • Start from cliques of size 2, not 4. This is both a runtime constraint and gives more data points to compare. You can save a lot of time by collecting the list of 2 cliques while you read in the network. • Combine the removal of redundant cliques and the search for additional vertices into the same step: if you ﬁnd a new vertex to extend an existing clique, then immediately mark the original (smaller) clique as obsolete. • After each extension step, remove the nonredundant smaller cliques, sort the list of new cliques, and get rid of duplicates.

Protein–Protein Interaction Networks – Structural Hierarchies

11.

Cliques in BIND networks Read in the supplied interaction networks that were extracted from the BIND database (interaction_*.dat), determine the numbers of cliques, and tabulate C(nc) for the occurring values of their size nc. List the members of the largest clique(s) only. Hint: In the interactions ﬁltered from the BIND database for each species – encoded in the ﬁle name via the taxon ID – either both molecules belong to the species or one is from the species and the other is a small molecule. The labels of the molecules are generated from the accession code and the molecule ID as in Problem 4. For each network, record the run time of the clique search. Which dependency do you expect? Explain your answer. Create diﬀerent plots where you plot the run time versus N, L, N × L, and N × C(2). Do these plots conﬁrm your expectation?

12.

Network communities A community is a part of a network that has more internal connections than to the rest of the network. In this problem, you will use the algorithm of Radicchi et al. (2004) to identify the communities of a given network (see Section 6.5.1). The edge-clustering coeﬃcient of an edge between vertices i and j was deﬁned in Section 6.5.1. (3) ̃i,j is inﬁnite. What is the (a) If one of the vertices has a degree of 1, then C maximal ﬁnite value that the edge-clustering coeﬃcient can take? For which conﬁguration does this occur? (b) Read the given scale-free network in the ﬁle sfnetz_1000.txt, determine the edge-clustering coeﬃcient for all occurring edges, and create a frequency histogram of the occurring values. Use a reasonable bin size. (3) ̃i,j , use a value of 10 times the maximal Hint: To denote an inﬁnite C occurring ﬁnite value from (a). Hint: The ﬁle includes the number of vertices on the ﬁrst line and then one edge per line.

13.

Network communities To determine the communities of the network supplied in the ﬁle HighSociety.txt, proceed in two steps (parts (a) and (b)). The overall process is easier to implement if you create a separate script for each of the two parts and save the intermediate results into a ﬁle. (a) Decomposition of the network (3) ̃i,j : Iteratively, delete the edges with the smallest C (i) Read in the network ﬁle. (3) ̃i,j for each edge. (ii) Calculate the edge-clustering coeﬃcient C (3) ̃i,j and delete it from the network. (iii) Find the edge with the smallest C Print out this edge.

175

176

Principles of Computational Cell Biology (3)

̃i,j , choose Hint: When you encounter multiple edges with the same C any of them. Does this actual choice make any diﬀerence in the ﬁnal result? (iv) Repeat from (ii) until there is no edge left. (b) Buildup of the communities and of the dendogram There are two criteria for a community (see Radicchi et al. 2004): (i) In a community in a strong sense, every single member of the subgraph V has more edges to the inside of the community (k in ) than to the outside (k out ): kiin (V ) > kiout (V )

∀i ∈ V ,

(ii) In a community in a weak sense, the total number of edges inside the subgraph V is larger than that outside: ∑ i∈V

kiin >

∑

kiout .

i∈V

Now, use the edges deleted in (a) in reverse order, i.e. the edge that was deleted last is now used ﬁrst to construct the communities. To do so, read in one edge after the other and check whether they have vertices in common with the already included edges. During this composition stage, you do not need to keep track of the edges, but only of the vertices that belong to the same subgraph. (i) If the latest edge is disjoint from the already processed edges, then start a new subgraph (list of vertices of this subgraph) from this one. (ii) If the latest edge has a single vertex in common with one of the existing subgraphs, then add the other vertex of this edge to that (list of the vertices of the) subgraph, too. (iii) If the two vertices of the latest edge belong to two diﬀerent subgraphs, then join the two subgraphs to form a single one from them. Print out the two lists of vertices that are joined in this step. After adding the last edge, you should end up with a single graph that contains all vertices of the network and a listing of the subgraphs just before they were joined to form larger ones. To draw the dendrogram of the network, look at the above choice (iii), the joining of two groups: start from the individual vertices and every time that this happens, connect two subgraphs. Hint: You may want to draw the dendrogram by hand. To identify communities, determine the two community criteria explained above each time after you added a new edge. If one of the two criteria (weak or strong) is met for one of the subgraphs, print out the list of vertices. For the weak criterion also, give the sums of the internal and external edges. Highlight these communities in the dendrogram. Do the communities obtained by the two criteria diﬀer? If so, what is the reason for this speciﬁc network?

Protein–Protein Interaction Networks – Structural Hierarchies

2

3 12

11

9

10

8 1

4 6

7

5

Figure 6.19 Partitioning based on edge betweenness (see Problem 14).

(c) Visualization of the communities The supplied ﬁle societyPositions.txt contains the ﬁnal positions from a force-directed layout of the “High Society” network. Use these coordinates to plot the communities that you identiﬁed in the previous exercise. Identify the vertices in at least one of the plots. Hint: The hierarchy of the communities that is captured in the dendrogram is best explained by creating multiple plots, starting from the smallest communities. Hint: Discern between the communities from the weak and from the strong criterion. 14.

Graph topology Use the Girvan–Newman algorithm to answer the following questions about the network shown in Figure 6.19. How many steps are required to have three disjoint components? (You may solve this problem either with a short program or by hand.) (a) Compute the edge betweenness for all the edges in the network. List the values in a table. (b) Remove the edge with highest betweenness value from the network. (c) Recalculate the edge betweenness values for all the edges in the remaining network. (d) Return to step (b) until the graph has no edges left.

15.

Network evolution Evolving networks are networks that change as a function of time, either by adding or removing nodes or links over time. (a) Write a function to read the data from the book website as an undirected graph. (b) Write a function that calculates the number of cliques of size 3, 4, and 5. (c) For each network provided at the book website, randomly insert or delete edges as a function of time (one edge per time, t = 100, so that the total of edges remain about constant). (d) Plot the number of cliques before and after each edge modiﬁcations as the function of time.

177

178

Principles of Computational Cell Biology

(e) Start from the original network and randomly shuﬄe the edges as mentioned below for 100 times: – For 2 × L steps, two edges e1 = (v1 , v2 ) and e2 = (v3 , v4 ) are randomly chosen from the network and rewired such that the start and end nodes are swapped, i.e. e3 = (v1 , v4 ) and e4 = (v3 , v2 ). – Determine using (P value < 0.05) whether clique motifs of size 3, 4, and 5 are signiﬁcantly enriched in the original network. In this, the P value is calculated as the ratio of the number of random times that a certain motif type is acquired more often than or equally often as in the real network.

Bibliography Network Topology Barabasi, A.L. and Oltvai, Z.N. (2004). Network biology: understanding the cell’s functional organization. Nature Reviews Genetics 5: 101–113. Bollobás, B., Riordan, O., Spencer, J., and Tusnády, G. (2001). The degree sequence of a scale-free random graph process. Random Structures and Algorithms 18: 279–290. Freeman, L.C. (1977). A set of measures of centrality based on betweenness. Sociometry 40: 35–41. Han, J.-D.J., Bertin, N., Hao, T. et al. (2004). Evidence for dynamically organized modularity in the yeast protein–protein interaction network. Nature 430: 88–93. Jeong, H., Mason, S.P., Barabási, A.L., and Oltvai, Z.N. (2001). Lethality and centrality in protein networks. Nature 411: 41–42.

Random Graphs Bollobas, B. (2004). Random Graphs. London: Academic Press.

Finding Cliques and Clusters Spirin, V. and Mirny, L.A. (2003). Protein complexes and functional modules in molecular networks. Proceedings of the National Academy of Sciences of the United States of America 100: 12123–12128.

Protein–Protein Interaction Networks – Structural Hierarchies

Communities Girvan, M. and Newman, M.E.J. (2002). Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America 99: 7821–7826. Radicchi, F., Castellano, C., Cecconi, F. et al. (2004). Deﬁning and identifying communities in networks. Proceedings of the National Academy of Sciences of the United States of America 101: 5200–5205.

Modular Decomposition Gagneur, J., Krause, R., Bouwmeester, T., and Casari, G. (2004). Modular decomposition of protein–protein interaction networks. Genome Biology 5: R57. Hartwell, L.H., Hopﬁeld, J.J., Leibler, S., and Murray, A.W. (1999). From molecular to modular cell biology. Nature 402: C47–C52.

Protein Complex Prediction Tools Bader, G.D. and Hogue, C.W.V. (2003). An automated method for ﬁnding molecular complexes in large protein interaction networks. BMC Bioinformatics 4: 2. Nepusz, T., Yu, H., and Paccanaro, A. (2012). Detecting overlapping protein complexes in protein–protein interaction network. Nature Methods 9: 471–472. Pilpel, Y., Sudarsanam, P., and Church, G.M. (2001). Identifying regulatory networks by combinatorial analysis of promoter elements. Nature Genetics 29: 153–159. Will, T. and Helms, V. (2014). Identifying transcription factor complexes and their roles. Bioinformatics 30: i415–i421.

Network Evolution Giot, L., Bader, J.S., Brouwer, C. et al. (2003) A protein interaction map of Drosophila melanogaster. i Start Sciencei End 302: 1727–1736. Middendorf, M., Ziv, E., and Wiggins, C.H. (2005). Inferring networks mechanisms: The Drosophila melanogaster protein interaction network. Proceedings of the National Academy of Sciences of the United States of America 102: 3192–3197.

179

181

7 Protein–DNA Interactions Protein–DNA interactions belong to the most important biomolecular interactions in cells. Proteins that bind speciﬁcally to DNA can be transcription factors activating or repressing gene expression, enzymes involved in DNA repair or in chemical (epigenetic) modiﬁcations of DNA (see Chapter 11), proteins that pack or unpack the chromatin structure, proteins that help to unzip double-stranded DNA, DNA topoisomerases that are involved in DNA supercoiling, etc. From this long list, we will discuss here only the binding of transcription factors to DNA. Chapter 11 will mention some enzymes involved in epigenetics. Generally, protein–DNA interactions are stabilized by two types of binding forces. One of them is the electrostatic attraction between the negatively charged phosphate backbone of DNA and positively charged amino acids on the protein surface. This interaction involves only the DNA backbone and is thus mostly independent from the DNA sequence. The other attractive contributions are speciﬁc polar and nonpolar interactions between the nucleotide bases of particular DNA sequence motifs and their protein-binding partners. There exist two diﬀerent tasks related to discovering DNA-binding sites. Given a known DNA-binding motif, one may search for additional occurrences of this motif in the genomic sequence (this is the site search problem). Typically, several mismatches are allowed in the “hit” regions. Alternatively, one may try to discover novel DNA-binding motifs in a collection of sequences that have related biological functions (this is the sequence motif discovery problem).

7.1 Transcription Factors Transcription factors (TFs) are proteins with usually at least two structural domains. About 75% of the known TFs contain one DNA-binding domain and one activation domain. The activation domain is often sensitive to environmental conditions such as the concentration of ions or particular nutrients such as cyclic adenosine monophosphate (cAMP). The activation domain may also bind to other proteins. In this way, the TFs can provide ﬁne-tuned control of gene expression according to the particular cell state. In prokaryotes, the DNA-binding domains of TFs always bind to the DNA directly upstream of the transcription start site of the genes that they regulate. In eukaryotes, TFs bind Principles of Computational Cell Biology: From Protein Complexes to Cellular Networks, Second Edition. Volkhard Helms. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

182

Principles of Computational Cell Biology

(a)

(b)

(c)

(d)

Figure 7.1 X-ray crystal structures of common structural topologies of eukaryotic transcription factors. (a) Zinc ﬁnger. Shown is the complex of mouse ZIF268 with DNA; PDB code 1ZAA. (b) Helix–loop–helix. Shown is the complex of the mouse E47 − NeuroD1-bHLH dimer in complex with the insulin promoter E-box sequence, PDB code 2QL2. (c) Leucine zipper. Shown is the complex of yeast GCN4 with DNA, PDB code 1YSA. (d) High-mobility group (HMG). Shown is the complex of mouse SRY-related box 18 (SOX18) with DNA, PDB code 4Y60. Figures were generated with NGL Viewer software (Rose and Hildebrand, 2015).

either to enhancer or promoter regions of genes. Depending on the particular transcription factor, and whether it helps in recruiting or repelling RNA polymerase to or from the transcription start site, the transcription of the adjacent gene is either up- or down-regulated. The most common structural topologies of eukaryotic TFs are the zinc ﬁnger, the helix–loop–helix, the leucine zipper, and the high-mobility group (HMG) (Figure 7.1). Most organisms contain hundreds up to a few thousands of transcription factors. Upper estimates can be obtained by counting the number of protein-coding genes in the genome that contain a DNA-binding domain. However, as mentioned above, not all of these proteins will necessarily be TFs. For example, Saccharomyces cerevisiae contains 245 genes with known DNA-binding domain, which is roughly 4% of all yeast genes. The human genome contains about 2600 genes with known DNA-binding domains, amounting to 11.8% of the human genes. Among these, about three quarters have been estimated to be TFs. In every human tissue, between 150 and 350 diﬀerent TFs are expressed. Transcription factors can only bind to dsDNA in an open chromatin conformation. Thus, their binding is intimately connected to the chromatin state of

Protein–DNA Interactions

the binding sequence. If the sequence is highly methylated at CG positions (see Section 11.1.1) or the bound histone proteins carry chemical modiﬁcations characteristic for the densely packed state of chromatin, then TF binding is physically impossible (see Chapters 9 and 10) because their binding sites on the DNA are simply not accessible to them. Binding of some transcription factors to DNA may induce strong curvature of the DNA distorting it from its canonical B-DNA shape. Such conformational eﬀects will not be discussed here.

7.2 Transcription Factor-Binding Sites A DNA region that forms a speciﬁc physical contact with a particular TF is termed transcription factor-binding site (abbreviated as TFBS). TFBSs are usually between 8 and 20 bp long and contain a core region of 5–8 bp of well-conserved nucleotide bases. The other positions adjacent to the core region show more sequence degeneracy. Most TFs bind in the major groove of double-stranded DNA, the others bind in the minor groove (Figure 7.1). The amino acid side chains at their binding interfaces make speciﬁc interactions, often hydrogen bonds with individual DNA bases. As the periodicity of double-stranded DNA is around 10 bp, the core regions of short TFBS motifs are a bit longer than half a turn of dsDNA. TFs may recognize DNA sequences that are similar but not identical, diﬀering by a few nucleotides. Experimental techniques (discussed in Section 7.3) reveal the multiplicity of DNA sequences that a particular TF may bind to. Some TFs bind to hundreds or even thousands of positions in the genome. As an example, Figure 7.2 shows sequence motifs that the two global TFs Yin Yang 1 (YY1) and CTCF bind to. Such sequence logos are a convenient way to visualize the degree of degeneracy in the TFBS.

7.3 Experimental Detection of TFBS Traditional experimental techniques to discover and analyze DNA-binding sites are the electrophoretic mobility shift assay (EMSA) and the DNAse footprinting assay. 7.3.1

Electrophoretic Mobility Shift Assay

An EMSA or gel shift assay is an aﬃnity electrophoresis technique for identifying speciﬁc binding of a protein–DNA or protein–RNA pair. The samples are electrophoretically separated on a polyacrylamide or agarose gel. The results are visualized by radioactive labeling of the DNA with 32 P or by tagging a ﬂuorescent dye. The control lane contains the DNA probe without protein. At the end of the experiment, a single band will show up that corresponds to the unbound DNA or RNA fragment. The other lane contains the DNA:protein mixture. If the protein actually binds to the DNA or RNA fragment, this lane will show an up-shifted band relative to the control lane, which is due to the larger and less mobile protein:DNA complex (Figure 7.3).

183

Principles of Computational Cell Biology

(a)

23

22

21

19

20

18

17

16

15

14

13

12

11

9

10

7 8

6

5

4

3

0

2

1

1

Bits

2

MEME (no 55C) 24 Ce, 11 00:11

13

11

12

9

8

7

6

5

4

10

(b)

3

0

2

1

1

Bits

2

MEME (no 55C) 24 De, 11 02:10

Figure 7.2 Sequence logos for the DNA-binding motifs that the transcription factors (a) YY1 (Yin Yang 1) and (b) CTCF (CCCTC-binding factor) bind to. In each case, motifs were derived from the top 500 TF ChIP-seq peaks by the ENCODE consortium. For YY1, 468 out of 500 sequences contained this motif, for CTCF 484 out of 500. Source: Wang et al. (2013). Taken with permission of Oxford University Press.

1

2

Figure 7.3 Resulting gel of an EMSA assay. The “shifted” band shows that the protein has bound to the studied DNA fragment.

3

Movement

184

Band “Shift”

7.3.2

DNAse Footprinting

DNAse footprinting is another electrophoresis method to characterize the sequence speciﬁcity of interactions between DNA and DNA-binding proteins in vitro. Before putting the solution onto the gel, a DNAse enzyme is added to the sample that cleaves DNA nonspeciﬁcally at many positions (Figure 7.4, middle row). On a polyacrylamide gel, the cleaved DNA fragments of diﬀering lengths will show up as diﬀerent lanes. In a second experiment, the protein of interest is added (right lane). If this protein binds speciﬁcally at a particular position of the DNA, it will prevent cleavage by DNAse at this position. Then, this DNA

Protein–DNA Interactions

1. PCR amplify

2. Add protein of interest

3. Cleave DNA

4. Run on denaturing polyacrylamide gel

Protected “footprint”

Figure 7.4 Main steps of DNAse footprinting assay.

fragment cannot be found on the gel (bottom, right lane) and thus represents the speciﬁc binding motif in the investigated DNA sequence for the protein. 7.3.3

Protein-Binding Microarrays

There exist also several high-throughput in vitro methods to measure the TF-DNA-binding aﬃnity of large numbers of DNA variants. One of them is

185

186

Principles of Computational Cell Biology

Double-stranded DNA microarrays

Bind epitope-tagged TF to dsDNA microarrays

GST

Label with fluorophore-tagged anti(epitope) antibody

SYBR Green

Scan triplicate microarrays

Calculate normalized PBM data

Figure 7.5 Schema of protein-binding microarray experiments.

a DNA microarray-based method called protein-binding microarray (PBM) (Berger and Bulyk 2006). With this technology, one can characterize the binding speciﬁcity of a single DNA-binding protein in vitro by adding it to the wells of a microarray spotted with a large number of putative binding sites in double-stranded DNA (Figure 7.5). The protein of interest carrying an epitope tag is expressed and puriﬁed and then applied to the microarray. After removing nonspeciﬁcally bound protein by a washing step, the protein is detected in a labeling step where a ﬂuorophore-conjugated antibody binds speciﬁcally to the epitope tag. One identiﬁes all spots carrying a signiﬁcant amount of protein. In the DNA sequences belonging to these spots, one identiﬁes enriched DNA-binding site motifs for the DNA-binding protein of interest. Biochemical assays in solution, such as the mentioned EMSA assays, DNAse footprinting assays, or protein-binding assays, are useful to deﬁne the sequence that is required for the recognition of DNA by a particular transcription factor. Based on the outcome of this, one can construct a sequence motif (see Sections 7.4 and 7.5) for this TF. Then, algorithms that assess DNA sequence similarity to a TFBS motif can detect instances of this motif, e.g. in a mammalian genome. However, because of the short length of these motifs and the relatively small number of invariant nucleotide positions in it, some motifs are found millions of times in the genome. Thus, in fact, although any motif instance could potentially be bound in vivo, only about 1 in 500 is actually bound in organisms with large genomes. As a speciﬁc example, the mouse genome contains

Protein–DNA Interactions

∼8 million instances of a match to the binding site motif of the TF GATA-binding factor 1, but only ∼15 000 DNA segments are bound by this transcription factor in erythroid cells (Hardison and Taylor 2012). 7.3.4

Chromatin Immunoprecipitation Assays

To overcome the mentioned limitations of in vitro assays, new massively parallel methods have been introduced, such as ChIP-chip and , whereby one can identify TF-binding sites in vivo. As suggested by their names, these methods are based on and new sequencing techniques, respectively. In Chip-seq experiments (Section 11.1.4), a cellular extract is puriﬁed using an antibody against a particular transcription factor. Then, the DNA sequences bound to the TF are digested using a restriction enzyme, and the remaining DNA can be considered to be tightly bound to the TF. This DNA is washed and sequenced. All DNA reads correspond to DNA fragments that were bound to the TF before. One may then use motif search packages such as MEME to identify enriched sequence motifs among those sequences. The binding motif can be represented in the form of a position-speciﬁc scoring matrix (PSSM).

7.4 Position-Speciﬁc Scoring Matrices A PSSM, (also termed position weight matrix or position-speciﬁc weight matrix) is a useful way to represent motifs (patterns) in biological sequences. A PSSM is essentially a rectangular matrix ﬁlled with scores. It contains one row for each symbol of the alphabet – in the case of DNA, the alphabet consists of four bases: adenine, cytosine, guanine, and thymine – and one column for each position in the pattern. We will illustrate the calculation of PSSM scores step by step with the following toy example of six sequences with four nucleotide positions each (Table 7.1). From these raw sequences, one computes the frequency matrix for observing every nucleotide in one of the four positions (Table 7.2). Out of 6 × 4 = 24 nucleotides in the four sequences, 7 are adenine, 6 are cytosine, 6 are guanine, and 5 are thymine. Thus, the frequencies pi of the four nucleotides are 0.29 (A), Table 7.1 Toy example of six DNA sequences that are 4 bp long. Position 1

Position 2

Position 3

Position 4

Sequence 1

A

C

A

T

Sequence 2

A

C

C

T

Sequence 3

A

G

G

G

Sequence 4

C

C

T

G

Sequence 5

A

T

A

G

Sequence 6

C

A

G

T

187

188

Principles of Computational Cell Biology

Table 7.2 Frequency of nucleotide bases at the four positions, cf. Table 7.1. Position 1

Position 2

Position 3

Position 4

Frequency A

4

1

2

0

Frequency C

2

3

1

0

Frequency G

0

1

2

3

Frequency T

0

1

1

3

Table 7.3 Score matrix of nucleotide bases at the four positions, cf. Tables 7.1 and 7.2. Position 1

Position 2

Position 3

Position 4

Score A

0.75

−0.45

0.12

−1.94

Score C

0.25

0.62

−0.34

−1.94

Score G

−1.94

−0.34

0.25

0.62

Score T

−1.94

−0.19

−0.19

0.78

0.25 (C and G), and 0.21 (T). From the frequency matrix, one computes the score matrix using j

j si

= ln

(ni + pi )∕(N + 1) pi

,

where N is the number of considered sequences (here, N = 6) (Table 7.3). Adding the frequencies pi in the denominator and dividing by N + 1 avoids problematic j cases with ni = 0, where the logarithm would not be deﬁned otherwise. Positions j with score si = 0 occur at the frequency that is expected randomly, positive entries denote enriched nucleotides at this position, and negative entries denote the opposite case. Each column contains log-likelihoods for the possible characters. A PSSM score is the sum of log-likelihoods, which corresponds to the product of the likelihoods. Thus, the score of a PSSM is a product multinomial distribution. From a physics viewpoint, the PSSM scores can also be looked at as the sum of binding energies of all nucleotides (characters of the substring) aligned with the PSSM (see Section 7.5). Although PSSMs typically use log-likelihood values, as described above, some methods employ log-odds scores. Then, an element in a PSSM is computed as pi,j , mi,j = log bi

Protein–DNA Interactions

with pi,j being the probability of observing symbol i at position j of the motif and bi equal to the probability of observing symbol i in a background model. The score then corresponds to the log-odds of the substring being generated by the motif versus being generated by the background, in a generative model of the sequence. A better prediction quality is obtained by exploiting the following features of regulatory regions: promoters reside next to transcriptional start sites (TSSs). If TSSs have been characterized by experiments or by a reliable prediction method, the search for TFBSs can be limited to regions surrounding the TSSs. Moreover, regulatory regions are typically subject to evolutionary pressure, so that they tend to be conserved over related species. A method termed phylogenetic footprinting exploits conservation information to identify regulatory regions and TFBSs. There exist many algorithms to discover TFBS sequence motifs. “Motif discovery” in biological sequences can be deﬁned as the problem of ﬁnding short-sequence elements building the “motif ” shared by a set of nucleotide sequences with a common biological function. These methods search for underlying functional reasons if binding motifs are shared by a set of sequences. One can split methods that discover binding motifs into enumerative, stochastic, and deterministic methods. Classical examples of tools using a deterministic optimization strategy are MEME and Consensus, whereas the Gibbs sampler is a purely stochastic method. A PSSM implicitly assumes the independence between diﬀerent positions in the pattern because the scores are calculated at each position independently from the symbols at other positions. Thus, the PSSM model assumes that each mononucleotide contributes independently to the binding aﬃnity. This is certainly a compromise in comparison to the real case where neighboring bases may clearly aﬀect each other either directly or through their eﬀect on the overall conformation of the DNA double strand. Moreover, it is known that many transcription factors may bind next to each other (see Section 7.6) and aﬀect each other in a cooperative manner.

7.5 Binding Free Energy Models The binding of a transcription factor to a single- or a double-stranded DNA is an elementary biomolecular association reaction. We will follow the binding free energy model of Djordjevic et al. (2003) that describes the reversible binding of a TF to a short piece of DNA with sequence S, kbind

−−−−→ TF − DNA TF + DNA ←−−−− kdiss

by the sequence-dependent rate constants k bind and k diss for TF binding and dissociation, respectively. The ratio of the bound and free forms equals the ratio of the two rate constants (see Sections 1.3.2 and 14.1.3) and is equal to

189

190

Principles of Computational Cell Biology k

ΔG(S)

(S)

KD −1 = kbind(S) = c ⋅ e− kT , where c is a constant and ΔG(S) is the binding free diss energy of the TF to its recognition sequence S on the DNA. When a solution contains both the DNA sequence and the TF with total concentration ntf , the equilibrium probability that the DNA is bound to a TF molecule is (replace in Section 9.4.1 [M] by ntf ) p(TF is bound to S) =

ntf ∕KD (S) c ⋅ e−ΔG(S)∕kT ⋅ ntf = ntf ∕KD (S) + 1 c ⋅ e−ΔG(S)∕kT ⋅ ntf + 1

where we have replaced K D −1 by the above expression. We multiply this with e+ΔG(S)/kT and divide by c • ntf . This gives: P(TF is bound to S) =

1 1+

eΔG(S)∕kT c ⋅ ntf

,

We then set 𝜇

c • ntf = e kT where 𝜇 is the chemical potential set by the TF concentration 𝜇 = kT ⋅ ln(c ⋅ ntf ). This is the so-called Fermi–Dirac form of binding probability. A sequence having a binding free energy well below the chemical potential (ΔG(Si ) − 𝜇 ≪ 0) is almost always bound to the TF. (P(TF is bound to S) → 1 because the exponential term is very small.) In cases when the binding free energy is well above the chemical potential, the sequence is rarely bound. → − The binding energy model (BEM) uses a vector of energy contributions, E . For any sequence Si , the binding energy predicted by the BEM model is → − → − E(Si ) = E ⋅ S i → − where S i is the vector encoding of sequence Si that can include whatever features of the sequence are relevant to its binding energy. If the only relevant features are → − which bases occur at each position within the binding site, then E will be a PSSM with the characteristic that each element is an energy contribution. → − → − When the energy contributions of each position are independent, E ⋅ S i can be written out as E(Si ) =

T L ∑ ∑

𝜀(b, m)Si (b, m)

b=A m=1

where L is the length of the binding site, 𝜀(b, m) are the energy contributions of base b at position m, and Si (b, m) is an indicator variable with Si (b, m) = 1 if base b occurs at position m of sequence Si and Si (b, m) = 0, otherwise. If the positions are not independent, one can include pairwise interactions between

Protein–DNA Interactions

adjacent positions m and n by adding interaction terms to the energy function → − → − such that E ⋅ S i is E(Si ) =

T L ∑ ∑

𝜀(b, m)Si (b, m) +

b=A m=1

L−1 T L T ∑ ∑ ∑ ∑

𝜀(b, m, c, n)Si (b, m, c, n)

m=1 n=m+1 b=A c=A

where 𝜀(b, m, c, n) is the energy contribution of having base b at position m and base c at position n.

7.6 Cis-Regulatory Motifs Although hundreds of TFs are present in a typical eukaryotic cell, the complex expression patterns of thousands of genes can only be implemented by a regulatory machinery involving combinations of TFs. Thus, prokaryotic and eukaryotic gene promoters often bind multiple TFs simultaneously. These transcription factors may also make structural contacts to each other and thus aﬀect their mutual binding aﬃnities in a cooperative manner. In that case, for steric reasons, the distance between TFBSs of contacting TFs is constrained to a certain range. All such combinatorial and cooperative eﬀects are diﬃcult to capture in a quantitative manner by a PSSM-based approach. A cluster of TFBSs is termed a cis-regulatory module (CRM). The existence of such a CRM is a footprint of a transcription factor complex. For metazoans, a typical CRM may be more than 500 bp long and is made up of 10–50 TFBSs to which between 3 and 15 diﬀerent sequence-speciﬁc TFs bind. If there exist multiple similar binding sites, this enhances the sensitivity for a TF, results in a more robust transcriptional response, and aﬀects how morphogen TFs are activated when the local TF concentration is low, or they may simply favor the binding of a homo-oligomeric TF (e.g. p53 or NF-𝜅B). Some transcription factors such as the TF pair Oct4 and Sox2 have well-known interaction partners. The ENCODE project (see Section 7.7) reported that 114 out of the 117 human TFs investigated in that project formed around 3300 pairs of statistically coassociated factors. Considering that 117 TFs can form 117 × 117/2 = 6000 potential pairs, this means that more than one half of all pairs are actually found in nature. These pairs included expected associations, such as those of Jun and Fos, but also some unexpected novel associations. If a TF binds to another transcription factor of the same type, this is termed a homotypic interaction. A well-known example of homotypic interactions is GATA-1. Heterotypic interactions are cases where a transcription factor binds to a TF of another type. Besides, DNA-binding transcription factors may also bind indirectly to other DNA-binding factors if this is mediated by further cofactors (see Section 6.7.3). Figure 7.6 illustrates various approaches that are used to identify CRMs.

191

192

Principles of Computational Cell Biology

CRM scanners (a)

CRM builders (b)

CRM genome screeners (c)

Figure 7.6 Methods for the detection of cis-regulatory modules (CRMs): (a) CRM scanners require user-deﬁned motif combinations as input to search for putative regulatory regions. (b) CRM builders analyze a set of coregulated genes as input and produce candidate motif combinations, as well as similar target regions. (c) CRM genome screeners search for homotypic or heterotypic motif clusters without making assumptions about the involved TFs.

7.6.1

DACO Algorithm

Section 6.7.3 presented the DACO algorithm for the construction of protein complexes. One application of this method is to construct protein complexes containing one or more transcription factors. As was discussed for transcription factor complexes of S. cerevisiae, protein complexes containing two or three transcription factors that bind to the promoter regions of target genes may exert much ﬁner transcriptional regulation than individual transcription factors.

7.7 Relating Gene Expression to Binding of Transcription Factors The ENCODE project (short for Encyclopedia Of DNA Elements) was a large-scale project lasting from 2003 to 2012 that aimed at identifying all functional elements in the human genome sequence for a variety of 147 cell types. In total, ENCODE sampled the binding sites for 119 of the 1800 known human transcription factors and general components of the transcriptional machinery. It turned out that 95% of genomic locations are within 8 kbp of a DNA–protein contact and that the union of all TFBS motifs covers 4.6% of all nucleotide bases. Classifying the genome into seven chromatin states

CAGE poly(A)+ K562 whole cell Classification (mean decreased Gini)

60 40 20 0

1 C X T 1 4 1 1 1 N 6 1 l1 1 A A L3 1 F 2 2 A 1 K 33 S 1 3 1 1 2 3 B 2 D l1 B 5 74 2 YY MY MA ES TSE2FELF GR TAL JUE2FLAF MX SPABPTB7BC AP SR ATA SFNFY SL AF TB FO RF F26 SFATA SPATF BPR2C UN SP FY SIX F2 FE J N N E N N UG R E E G U FO M ZB TH G ZB C N ZN Z BC

3500

0

–5

3000 2500 2000 1500 1000 500 0

–4

(a)

80

5

Regression (mean decreased Gini)

Measured expression (log2)

10

Relative importance of variables

Pearson’s r = 0.81 RMSE = 2.57 Classification: AUC = 0.89 Regression: r = 0.62 (RMSE = 3.06)

–2

0

2

4

Predicted expression (log2)

6

8

(b)

1 C F1 F4 N 1 T l1 PA X S B 1 6 63 1 2 1 L1 YA A A2 l1 5 2 L1 1 1 L3 F1 F1 D 33 2 K F B 3 2 74 YY MYEL E2 JUETS ES MXAB MA FO FY GRE2F F2 AP SP RF TA NF TB7 AT SP SIXUSFOS SP ATABC US LA JUN TB NFE AF SR BPATFR2C F2 H N N E R E M G G F G N ZN C ZN T ZB BC ZB

Figure 7.7 The ENCODE project studied how well the occupancy of transcription factor-binding sites is correlated with RNA production in K562 cells. (a) Scatter plot comparing a linear regression curve (red line) with observed values for RNA production (blue circles). (b) Bar graphs showing the most important transcription factors both in the initial classiﬁcation phase (top bar graph) and the quantitative regression phase (bottom bar graph). Larger values indicate increasing importance of the variable in the model. AUC, area under the curve; Gini, Gini coeﬃcient; RMSE, root mean square error. Source: The ENCODE Project Consortium (2012). Reprinted with permission of Springer Nature.

194

Principles of Computational Cell Biology

(see Section 11.3) indicated a set of close to 400 000 regions with enhancer-like features and around 70 000 regions having promoter-like features, as well as hundreds of thousands of quiescent regions. Gene expression levels had a wide range from 10−2 to 104 r.p.k.m. (reads per kb per million reads) for polyadenylated RNAs and from 10−2 to 103 r.p.k.m. for non-polyadenylated RNAs. If we assume that 1–4 r.p.k.m. corresponds to 1 copy per cell, almost one-fourth of the expressed protein-coding genes and 80% of the detected lncRNAs were detected in the studied cell samples in one or fewer copies per cell. There is considerable interest in relating the presence/absence of individual TFs to the expression level of nearby target genes. The ENCODE data are an ideal data set for this. Figure 7.7 shows the results from a linear regression analysis where the expression of single genes is related to the occupancy of the nearby TFBSs. The left panel shows a comparison of predicted and observed expression levels in K562 cells. The Pearson correlation is 0.81. The right panel shows the importance of individual TFs by the decrease in prediction performance. The global regulator YY1 has the largest contribution. Omitting it from the model signiﬁcantly aﬀects its performance.

7.8 Summary Protein–DNA interactions belong to the most important biomolecular interactions in cells. For a certain fraction, the interaction of individual transcription factors or other DNA-binding proteins has been thoroughly characterized by X-ray crystallography. Also, the binding site preference can be experimentally determined by in vitro assays or in vivo by Chip-seq experiments. Importantly, the binding aﬃnity of transcription factors to linear DNA sequence motifs is modulated by DNA curvature eﬀects, epigenetic modiﬁcations (Chapter 11), and by co-operative binding of multiple transcription factors.

7.9 Problems 1. PSSM matrices Compute a PSSM from the ﬁve sequences listed in Table 7.4. Table 7.4 Toy example of ﬁve DNA sequences that are 4 bp long, cf. Table 7.1. Position 1

Position 2

Position 3

Position 4

Sequence 1

G

C

A

T

Sequence 2

A

C

C

T

Sequence 3

G

G

G

C

Sequence 4

T

G

T

C

Sequence 5

A

T

A

C

Protein–DNA Interactions

Bibliography Protein–DNA Interactions Luscombe, N.M., Austin, S.E., Berman, H.M., and Thornton, J.M. (2000). An overview of the structures of protein–DNA complexes. Genome Biology 01: reviews001.1.

Transcription Factor Binding Berger, M.F. and Bulyk, M.L. (2006). Protein binding microarrays (PBMs) for rapid, high-throughput characterization of the sequence speciﬁcities of DNA binding proteins. Methods in Molecular Biology 338: 245–260. Djordjevic, M., Sengupta, A.M., and Shraiman, B.I. (2003). A biophysical approach to transcription factor binding site discovery. Genome Research 13: 2381–2390. Hardison, R.C. and Taylor, J. (2012). Genomic approaches towards ﬁnding cis-regulatory modules in animals. Nature Reviews Genetics 13: 469–483. The ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489: 57–74. Weirauch, M.T., Cote, A., Norel, R. et al. (2013). Evaluation of methods for modelling transcription factor sequence speciﬁcity. Nature Biotechnology 31: 126–134. Zhao, Y., Ruan, S., Pandey, M., and Stormo, G.D. (2012). Improved models for transcription factor binding site identiﬁcation using nonindependent interactions. Genetics 191: 781–790.

Factorbook Wang, J., Zhuang, J., Iyer, S. et al. (2012). Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Research 22: 1798–1812. Wang, J., Zhuang, J., Iyer, S. et al. (2013). Factorbook.org: a Wiki-based database for transcription factor-binding data generated by the ENCODE consortium. Nucleic Acids Research 41: D171–D176.

NGL viewer Rose, A.S. and Hildebrand, P.W. (2015). NGL Viewer: a web application for molecular visualization. Nucleic Acids Research 43: W576–W579.

195

197

8 Gene Expression and Protein Synthesis Among living organisms, the circular or linear genomes of bacteria contain c. 4000–4500 genes, those of fungi such as Saccharomyces cerevisiae and Saccharomyces pombe contain c. 5000–6000 genes, and those of higher eukaryotes c. 18 000 (mice) to 30 000 (Arabidopsis thaliana) genes. In this chapter, we will discuss the experimental methods for measuring gene expression and computational methods that detect diﬀerential expression and functional enrichment of genes.

8.1 Regulation of Gene Transcription at Promoters Transcription is the process through which a DNA segment is copied by the enzyme RNA polymerase II to produce a complementary piece of RNA sequence. Transcription is divided into three stages: initiation, elongation, and termination. Here, we will only be concerned with the initiation process and how this is controlled by additional proteins. As transcription exclusively proceeds in the 5′ → 3′ direction, the DNA template strand used must be oriented in the 3′ → 5′ direction. In prokaryotes, transcription begins with the binding of RNA polymerase to a promoter sequence in the DNA. An RNA core polymerase is a multi-subunit complex composed of α2 ββ′ subunits that catalyze the elongation of RNA (see Figure 2.1). At the start of initiation, the bacterial core enzyme is associated with a sigma factor that aids in ﬁnding the appropriate −35 and −10 bp upstream of the promoter sequences (Figure 8.1). Initiation of transcription in eukaryotes is far more complex than in prokaryotes as eukaryotic polymerases do not directly recognize their core promoter sequences. Instead, additional proteins termed transcription factors regulate the binding of RNA polymerase to DNA. Many eukaryotic promoters, but by no means all, contain a sequence motif termed TATA box. The TATA box is typically positioned in direct proximity upstream of the transcriptional start site (often within 50 bases). The TATA box functions as a binding spot for TATA binding proteins, which help in recruiting the RNA polymerase transcriptional complex. Figure 8.2 sketches a short genomic eukaryotic region that contains three exemplary genes. The ﬁgure serves to illustrate the position of the upstream promoter segment relative to gene B and the so-called untranslated regions at the 5′ and Principles of Computational Cell Biology: From Protein Complexes to Cellular Networks, Second Edition. Volkhard Helms. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

198

Principles of Computational Cell Biology

Pribnow or TATA box 5′

TATAAT –10

TTGAC A –35

be transcribed

Gene start

Upstream

- 3′

GGGGGGGGGGGGGGG -

Gene to

Downstream

Figure 8.1 Typical promoter region of a prokaryotic gene. The TTGACA and TATAAT motifs at positions −35 and −10 nucleotides are not essential. The preference for the corresponding nucleotide at each position is between 50% and 80%. Gene B

Introns Gene A

5′ UTR exons

Intergenic A Intergenic B

Half distance gene A/B –2000

TSS +1000

Coding start [CSS]

3′ UTR exons

Coding exons

Coding end [CES]

Intergenic B Intergenic C

TES

Gene C

Half distance gene B/C

Promoter

Figure 8.2 Eukaryotic genomic region containing three genes A, B, and C. Diﬀerent genetic regions are distinguished: intergenic region between gene A and gene B or between gene B and gene C, promoter region, 5′ UTR, coding exons, 3′ UTR, introns, intragenic region, and CpG islands (not shown here). Shown in the ﬁgure is the + strand of DNA, − strand is analogous. Not shown in this ﬁgure are so-called enhancer regions further upstream of the promoter that plays crucial roles in transcriptional regulation in higher eukaryotes.

3′ ends of gene B. Not shown are the so-called enhancer regions. These are short (50–1500 bp) regions that can also be bound by transcription factors to increase the expression of genes that are located elsewhere on the genome up to millions of base pairs away from the enhancer region. As discussed in Section 7.1, a eukaryotic transcription factor is a protein needed to activate or repress the transcription of a gene but is not itself a part of the enzymes responsible for the chemical steps involved in transcription. Some transcription factors bind to cis-acting DNA sequences only, whereas others bind to DNA and other transcription factors. Regulation of gene transcription in an organism involves a complex network, of which the DNA-binding transcription factors are a key component (see Chapter 9).

8.2 Experimental Analysis of Gene Expression In this section, we will brieﬂy present several experimental methods that are being used to detect active gene expression. Real-time polymerase chain reaction (rtPCR) is a small-scale method to detect the expression of multiple genes at a time. In contrast, microarrays and next-generation sequencing (NGS) are high-throughput methods enabling, in principle, the detection of all genomic transcripts simultaneously.

Gene Expression and Protein Synthesis

8.2.1

Real-time Polymerase Chain Reaction

The polymerase chain reaction (PCR) is a well-known method for amplifying a speciﬁc target DNA sequence. PCR is used to isolate, sequence, or clone pieces of DNA. PCR was invented in 1983 by Kary Mullis, who was awarded the 1993 Nobel Prize in Chemistry for this. A PCR reaction usually consists of three steps. The sample consists of a dilute concentration of template DNA that is mixed with a heat-stable DNA polymerase (e.g. Taq polymerase), with primer sequences for the target DNA sequences, with deoxynucleoside triphosphates (dNTPs), and with magnesium. In the ﬁrst step of PCR, the sample is brought to a temperature of 95–98 ∘ C. As a result, the double-stranded DNA denatures and splits up into two single strands. In the second step, the temperature is lowered to about 55–65 ∘ C. This enables the primer sequences to bind (or anneal) to complementary sequence motifs at both ends of the target sequence, also known as the template. In the third step, the temperature is usually raised to 72 ∘ C. Then, the DNA polymerase can extend the primer sequences by adding dNTPs to create a new strand of DNA. Therefore, the amount of DNA is duplicated in the reaction. This series of denaturation, annealing, and extension steps is repeated for many cycles and yields an exponential ampliﬁcation of the template DNA. At the end of a conventional PCR run, the amount of ampliﬁed DNA product is quantiﬁed. In “real-time” PCR, one quantiﬁes in real time how the ampliﬁcation product accumulates after each cycle. Because PCR ampliﬁes DNA stretches, the cellular mRNA is ﬁrst reverse transcribed into complementary DNA (cDNA) by the enzyme reverse transcriptase. Detection of multiple PCR products in real time is made possible by adding a ﬂuorescent reporter molecule to each reaction well of a parallel chip. The detected ﬂuorescence level is proportional to the total quantity of product DNA. The change in ﬂuorescence over time serves to derive the amount of ampliﬁed DNA made in each cycle. A set of multiple internal reference genes (such as suitable housekeeping genes that exhibit rather constant expression levels in all cell types and experimental conditions) is used to normalize the expression of target genes. 8.2.2

Microarray Analysis

Figure 8.3 illustrates the basic steps of a microarray experiment. In a prototypical two-color microarray, cellular mRNA is collected from two conditions (e.g. cancer tissue and adjacent normal tissue), puriﬁed, and reverse transcribed into the complementary cDNA using again the enzyme reverse transcriptase. To identify the source, one of the four nucleotides provided to the assay is ﬂuorescently labeled. For the samples belonging to the two conditions, two diﬀerent nucleotides with diﬀerent ﬂuorophores emitting light at diﬀerent wavelengths (for example, at the colors green and red) are used. To quantify the amount of cDNA relative to each other, the extract of cDNA is then loaded on a well of a two-color microarray where it hybridizes to the microarray probes. After washing oﬀ the nonspeciﬁcally bonding sequences, only strongly paired strands will remain hybridized. If the well shines green under a laser, the well contains more cDNA belonging to the ﬁrst probe. If it is red, it contains more

199

Principles of Computational Cell Biology

Sample A

Sample B

Allele 1 Allele 2

3 R eg

io

n

2 R eg

io

n

1 R eg

io

n

3 R eg

io

n

2 n io R eg

io

n

1

Amplification

R eg

Production of target and labeling Cy5

Cy3

Allele 1 Allele 2

Microarray hybridization

Density

200

1 Cy5/Cy3 ratio

Figure 8.3 Basic steps of a microarray experiment.

cDNA from the second probe. If both amounts are equal, the ﬁeld is either yellow (high expression) or dark (expression below detection threshold). It is a common practice to use multiple short hybridization probes to monitor the expression levels of long mRNAs. In the bioinformatics analysis part, one then faces the task of how to assign a particular expression value to such genes. Often, one simply takes the median expression level of all probes covering one gene.

Gene Expression and Protein Synthesis

Large numbers of microarray expression data sets have been deposited in the Gene Expression Omnibus (GEO) repository of the NIH. Although microarray analysis is being challenged recently by the uprise of NGS methods, researchers will likely continue to use the rich microarray data sets deposited in GEO as the reference data. 8.2.3

RNA-seq

The term RNA-seq describes the sequencing and determination of transcription levels of the expressed cellular mRNAome by NGS methods. Here, we will ignore the technical details of the fast moving ﬁeld of NGS sequencing because new methods are constantly entering the market every few years. At the time of writing, third-generation methods are state of the art. Bioinformaticians working on questions related to computational systems biology often need to process RNA-seq data. Although RNA-seq provides the complete genomic picture at a single-base resolution, we will focus here on the expression levels of the entire genes.

8.3 Statistics Primer Biological experiments are subject to a considerable variability. Even if one repeats exactly the same experiment on two subsequent days, for example, and the expression of gene i is tested in a well-deﬁned cell culture, the results will never be exactly the same. There are many possible reasons for this. The most important reason is the considerable variability of biological materials. In this respect, experiments on living cells are very diﬀerent from, say, experiments in physical chemistry or material sciences where the changes between diﬀerent repetitions of the same experiment are typically very small. In contrast, experiments on living cells are “noisy.” Therefore, conclusions based on data from such experiments need to be carefully checked for their signiﬁcance using statistical tests. In this section, we will introduce some basic statistics measures that are useful, for example, when measuring gene expression using microarrays. Given n data points denoted by ai , where i = 1, …, n, their arithmetic mean a is 1∑ a. n i=1 i n

a=

In statistics and probability theory, the standard deviation σ measures how much variation or “dispersion” exists from the average (mean, or expected value). For the same n data points, a1 , a2 , …, an , their standard deviation from the mean is √ √ n √ 1 ∑ (a − a)2 . 𝜎=√ n − 1 i=1 i The variance 𝜎 2 is the square of the standard deviation.

201

Principles of Computational Cell Biology

In probability theory, a continuous probability distribution f has to fulﬁll three properties: the probability is non-negative everywhere, the integral over the full distribution is normalized to 1, and the probability that x lies between two points a and b is b

p[a ≤ x ≤ b] =

∫a

f (x)dx.

The well-known normal (or Gaussian) distribution is an example of a continuous probability distribution. It has a characteristic bell-shaped probability density function and is commonly referred to as the Gaussian function 1 −1 f (x; 𝜇, 𝜎 2 ) = √ e 2 𝜎 2π

(

x−𝜇 𝜎

)2

.

0.3

0.4

The parameter 𝜇 is the mean or expectation value (normally a sharp peak) and 𝜎 2 is the variance. The distribution with 𝜇 = 0 and 𝜎 2 = 1 is called the standard normal distribution. If a real-valued random variable clusters around a single mean value, this is typically modeled by a normal distribution as a ﬁrst try. Figure 8.4 illustrates a standard normal distribution and indicates how much of the distribution falls in between various multiples of the standard deviation. Only 4.6% of the values are at least 2𝜎 away from the mean. This is why a deviation of at least 2𝜎 is often considered a statistically meaningful deviation. The null hypothesis of a statistical hypothesis test corresponds to a general or default position. For example, the null hypothesis might state that there is no relationship between two measured phenomena. A null hypothesis can never be formally proven in a mathematical sense. However, a set of data can either reject a null hypothesis or fail to reject it. A p-value is the probability that the test statistic is at least as extreme as the one observed under the condition that the null hypothesis is true. A small p-value is an indication that the null hypothesis is false (see also Section 8.5.1).

0.2

34.1%

34.1%

0.1

202

2.3% –3σ

–2σ

13.6% –1σ

13.6% 2.3% μ

1σ

2σ

3σ

Figure 8.4 Standard normal distribution. 𝜇 is the mean of the (symmetric) normal distribution. 𝜎 is the standard deviation.

Gene Expression and Protein Synthesis

8.3.1

t-Test

A t-test is a parametric statistical hypothesis test that can be used when the population conforms to a normal distribution. A frequently used t-test is the one-sample location t-test that tests whether the mean of a normally distributed population has a particular value 𝜇0 t=

x − 𝜇0 √ 𝜎∕ n

where x is the sample mean, 𝜎 is the standard deviation of the sample, and n is the sample size. The critical value of the t-statistic t 0 is tabulated in t-distribution tables. In this case, the hypothesis (H 0 ) is that the population mean equals 𝜇0 . Another popular t-test is the two-sample location test. It tests the null hypothesis that the mean values of two normally distributed populations are equal. Strictly speaking, the name Student’s t-test refers to cases when the variances of the two populations are assumed to be equal. When this assumption is dropped, a modiﬁed test called Welch’s t-test may be used. 8.3.2

z-Score

The standard z-score of a raw score x is x−𝜇 z= 𝜎 where 𝜇 is the mean of the population and 𝜎 is the standard deviation of the population. We will see an example in Section 10.6.1. 8.3.3

Fisher’s Exact Test

Fisher’s exact test (named after its inventor, R.A. Fisher) is a statistical signiﬁcance test that is typically used to analyze contingency tables. It is valid for all sample sizes, although it is mostly used in practice when sample sizes are small. Fisher’s exact test of independence is used if there are two nominal variables and we want to check whether the proportions of one variable are diﬀerent depending on the value of the other variable. The test belongs to the class of exact tests. For such tests, one can compute the signiﬁcance of deviating from a null hypothesis in an exact way. For many other statistical tests, one has to rely on an approximation for the signiﬁcance that becomes exact only in the limiting case of assuming an inﬁnite sample size. For example, a hypothetical sample of teenagers might be grouped either by their gender (male and female) or by whether the individuals are regularly doing some sports or not. The data might look like Table 8.1. Then, it is of interest to ﬁnd out whether the imbalanced proportions that we observe are statistically signiﬁcant. Thus, if we knew beforehand that 10 of these 22 teenagers do sports, and that 10 of the 22 are female, what is the probability that these 10 individuals doing sports would be so unevenly distributed between the women and the men as given in Table 8.1? Assuming that we would randomly select 10 teenagers, what

203

204

Principles of Computational Cell Biology

Table 8.1 Example for Fisher’s exact test. Men

Doing sports

Women

1

Row total

9

10

Not doing sports

11

1

12

Column total

12

10

22

Table 8.2 Symbols used in Fisher’s exact test. Men

Women

Total

Doing sports

a

b

a+b

Not doing sports

c

d

c+d

Totals

a+c

b+d

a + b + c + d (= n)

is the chance that 9 are female and only 1 is male? We will label the cells by the letters a, b, c, and d, call the totals across rows and columns as marginal totals, and the grand total equals n = a + b + c + d. The symbolic table now looks like Table 8.2. According to Fisher, the probability for any such set of values is given by the hypergeometric distribution ( )( ) a+b c+d (c + d)! (a + b)! • a c a!((a + b) − a)! c!((c + d) − c)! = p= ( ) n! n (a + c)!(n − (a + c))! a+c (a + b)!(c + d)!(a + c)!(b + d)! a!b!c!d!n! ( ) n and the symbol! indicating the factorial operwith the binomial coeﬃcient k ator (see Section 8.6.1). This formula yields the exact probability of observing the speciﬁc distribution of the data assuming the given marginal totals on the null hypothesis that men and women are equally likely to do sports. One can compute the probability of any assignment of the 22 teenagers to the four cells of the table. The signiﬁcance is obtained by considering all those cases where the marginal totals are equally or more extreme as those in the observed table. In the case study, there is only one case that is more extreme in the same direction as the given data; it is shown in Table 8.3. Hence, we need to compute the values of p for both these tables (0.000 185 and 0.000 001 54) and add them together (c. 0.000 187). This corresponds to a one-tailed test. For a two-tailed test, we must also take into account data arrangements that are equally extreme in the opposite direction. =

Gene Expression and Protein Synthesis

Table 8.3 Most extreme inequilibrium, cf. Table 8.1. Men

Doing sports

0

Women

Row total

10

10

Not doing sports

12

0

12

Column total

12

10

22

8.3.4

Mann–Whitney–Wilcoxon Rank Sum Tests

The Mann–Whitney U test is also called the Mann–Whitney–Wilcoxon (MWW) or Wilcoxon rank sum test. It belongs to the most used statistical tests among nonparametric statistical hypothesis testing methods. Given a set of independent observations, this test can be used to estimate whether one sample of observations has larger values than the rest. If the two distributions have a diﬀerent shape, the Mann–Whitney U test is used to determine whether there are signiﬁcant diﬀerences between the distributions of the two groups. If the two distributions are of the same shape, the Mann–Whitney U test is used to determine whether there are diﬀerences in the medians of the two groups. Let us assume we are given the two distributions of eight values each listed in Table 8.4. The values could be average grades of the pupils in two classes of a high-school or expression levels of genes A and B in several individuals. From this, we form a joint ranked list (from lowest to highest value) (Table 8.5). It looks like there are more values from class A on the left side, but it is in fact impossible to judge by visual inspection of the data whether there is a signiﬁcant diﬀerence between the two classes. Thus, we will test by a rank sum test whether ranks are equally distributed in the joint rank list or not. The sum of all the ranks equals N(N + 1)/2 where N is the total number of observations. In this example, samples of class A have the ranks 1, 3, 4, 7, 9, 10, 11, and 15. The sum of these ranks is T 1 = 60. The samples of class B have the ranks 2, 5, 6, 8, 12, 13, 14, and 16. The sum of these ranks is T 2 = 76. This shows that the class B values have higher ranks on an average. From these rank sums, we compute the sum of ranking imbalances Table 8.4 Data distribution to illustrate Mann–Whitney U test. Class A

2.1

1.7

1.6

2.1

2.0

1.4

2.6

2.2

Class B

2.5

2.3

1.8

1.5

2.7

1.9

2.0

2.4

Table 8.5 Ordered list of the values from Table 8.4. Grade

1.4

1.5

1.6

1.7

1.8

1.9

2.0

2.0

2.1

2.1

2.2

2.3

2.4

2.5

2.6

2.7

Class

A

B

A

A

B

B

A

B

A

A

A

B

B

B

A

B

Joint rank

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

205

206

Principles of Computational Cell Biology

U using U1 = n1 ⋅ n2 +

n1 ⋅ (n1 + 1) − T1 2

U2 = n1 ⋅ n2 +

n2 ⋅ (n2 + 1) − T2 . 2

and

Here, nk is the number of samples in sample k. In our example, n1 = n2 = 8. For the example above, we get U 1 = 40 and U 2 = 24. The correctness of these calculations can be checked by noting that the two following conditions always hold: U1 + U2 = n1 ⋅ n2 and T1 + T2 = (n1 + n2 )

(n1 + n2 + 1) . 2

U is the sum of ranking imbalances. The question is how often such an imbalance in ranks can be due to chance. For this, we compare the smaller U value (24) with the critical value of the theoretical U distribution. In this case, we get from the Mann–Whitney U table using n1 = n2 = 8 and a signiﬁcance threshold of α = 0.05 (two-sided) a critical value of 13. Hence, the values show a signiﬁcant diﬀerence between the two classes.

8.3.5

Kolmogorov–Smirnov Test

The Kolmogorov–Smirnov (abbreviated as K–S) test is also a nonparametric test. Given continuous, one-dimensional probability distributions, a one-sample K–S test compares a sample against a reference probability distribution, whereas a two-sample K–S test compares two samples to each other. The K–S statistic determines a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution, or between the empirical distribution functions of two samples. The two-sample K–S test is one of the most useful and general nonparametric methods, as it is sensitive to diﬀerences in both location and shape of the empirical cumulative distribution functions of the two samples.

8.3.6

Hypergeometric Test

The hypergeometric test is a discrete probability distribution that computes the probability for k successes in n draws, without replacement, from a ﬁnite population of size N that contains exactly K successes, wherein each draw is either a success or a failure. The hypergeometric test will be explained in detail in Section 8.6.1. In contrast, the binomial distribution models the probability of successes in n draws with replacement (see Section 14.1.1).

Gene Expression and Protein Synthesis

8.3.7

Multiple Testing Correction

In case, the same data are tested by several independent statistical tests, the probability of false-positive predictions (the so-called “type 1” errors) is increasing. An example for this situation is if we test whether several functional annotations are enriched in a set of genes that are diﬀerentially expressed (DE) in two conditions. If we would simply test one functional annotation after the other, the chance that one of these statistical tests would produce an apparently signiﬁcant p-value would monotonously increase the more tests we perform. Hence, one has to take into account the total number of tests performed. To correct for the occurrence of false positives, there exist several methods for multiple testing corrections. One popular method is the false discovery rate method by Benjamini and Hochberg (1995). This method controls the expected ratio of errors among the rejected null hypothesis ( ) number of falsely rejected null hypothesis FDR = E . number of rejected null hypothesis This method is reported to keep a good balance between the ability to identify signiﬁcantly deregulated genes and limitations of its predictive power due to false positives.

8.4 Preprocessing of Data 8.4.1

Removal of Outlier Genes

Analysis of expression data sets starts with the identiﬁcation and omission of outlier genes and outlier samples. Outliers are experimental data points that deviate “too much” from the typical behavior observed in other samples or genes. The reason for such outliers could be either technical problems with the measurement, mislabeling of samples, or that this sample represents a truly unique case. Keeping such outlier data points in the data set would obscure the downstream analysis. Typical techniques to identify outliers are hierarchical clustering, boxplots, and computing the median absolute deviation (MAD) or the generalized extreme studentized deviate (GESD) test. GESD (Rosner 1983) is meant to identify one or more outliers in a data set, assuming that the majority of its data points are normally (Gaussian) distributed. For every data point xi , the algorithm calculates the deviation from the mean 𝜇 relative to the standard deviation 𝜎: Ri =

Maxi |xi − 𝜇| . 𝜎

At each iteration, the algorithm deletes the point having the largest deviation. This process is continued until all outliers fulﬁlling Ri > 𝜆i have been removed where 𝜆i are the critical values calculated for all outlier points according to the t distribution. GESD always labels at least one data point as an outlier even when there is no outlier. Therefore, GESD is supplied with a minimum threshold so

207

208

Principles of Computational Cell Biology

that a certain number of outliers must be detected before any gene is marked as an outlier. In contrast to GESD, the MAD algorithm (Rousseeuw and Croux 1993) is not based on the variance or standard deviation and thus makes no particular assumption on the statistical distribution of the data. At ﬁrst, the raw median is computed over all data points (e.g. considering the expression of all genes in all samples). From this, MAD obtains the MAD of single data points X i from the raw median as MAD = b•median(|Xi − median(X)|) where b is a scaling constant. For normally distributed data, it is taken as b = 1.4826. As a rejection criterion, one uses Xi − median(X) ≥ threshold. MAD The suitable threshold could be 3 (very conservative), 2.5 (moderately conservative), or 2 (poorly conservative). Consider the data (1, 3, 4, 5, 6, 6, 7, 7, 8, 9, 100). It has a median value of 6. The absolute deviations about 6 are (5, 3, 2, 1, 0, 0, 1, 1, 2, 3, 94). Sorting this list into (0, 0, 1, 1, 1, 2, 2, 3, 3, 5, 94) shows that the deviations have a median value of 2. Therefore, the MAD for this data is roughly 3 when scaled with b = 1.4826. Possible outliers above a rejection threshold would need to diﬀer from the median by 6–9 or more. In this case, this would obviously only be the extreme data point of 100. 8.4.2

Quantile Normalization

The typical steps after the removal of outliers and imputation of missing data (see Section 11.2.1) are normalization and potentially a logarithmic transformation of the data. There exist numerous methods for data normalization. We will only consider one of them here termed quantile normalization. Quantile normalization is a widely used normalization method in diverse ﬁelds. It has a quite drastic eﬀect on the original distributions because it shifts all values so that the ranks of the values in the distribution remain the same but the distributions become identical to each other after normalization. Hence, it is essentially a ranked-based normalization method. Let us consider three measurements of four variables (Table 8.6). At ﬁrst, the rank of each value in the respective distribution is determined. Then, the values are ordered by magnitude for each distribution. One determines the mean value of each column. The original values are then replaced by these mean values. Each normalized measurement now contains exactly the same values, however potentially in a diﬀerent order. 8.4.3

Log Transformation

For the analysis of diﬀerential expression, it is common to apply a logarithmic transformation to the expression data. The main reason for this is that upregulation and downregulation should be treated in a balanced manner. Imagine a case where one gene is upregulated fourfold and another gene is downregulated

Gene Expression and Protein Synthesis

Table 8.6 Normalization schema explaining quantile normalization algorithm. A

B

C

D

5

2

3

4

4

1

4

2

3

4

6

8

Original data

Determine ranks of original values in each row iv

i

ii

iii

iii

i

iii

ii

i

ii

iii

iv

Reorder rows by size 2

3

4

5

1

2

4

4

3

4

6

8

Compute column averages 2

3

4.67

5.67

Rank i

Rank ii

Rank iii

Rank iv

Replace original values by column averages 5.67

2

3

4.67

4.67

2

4.67

3

2

3

4.67

5.67

fourfold. If we consider ratios of the absolute expression levels between two sets of samples, the upregulated gene would be assigned a ratio of 4, whereas the downregulated gene is assigned the ratio 1/4. Generally, upregulation covers the full range between 1 and inﬁnity, whereas downregulated genes are pushed into the interval between 0 and 1. Instead, if one applies a log2 transformation, the fourfold upregulated gene would get a ratio of +2 and the fourfold downregulated gene a ratio of −2. Hence, both directions are now treated in a balanced manner. Another nice feature of the log2 transformation is that the transformed data often resemble a Gaussian distribution that facilitates, for example, the application of t-tests.

8.5 Diﬀerential Expression Analysis Figure 8.5 compares the expression of a gene in healthy individuals and in patients suﬀering from a hypothetical disease. Are these two distributions signiﬁcantly diﬀerent? First of all, this clearly depends on the number of data points taken. If this number is very low, any meaningful statistical test should report that the observed diﬀerence is not signiﬁcant. A common threshold is

209

Principles of Computational Cell Biology

9 Gene expression level (log2)

210

8.5

8

7.5 Case

Control

Figure 8.5 Schematic representation of the expression levels of a particular gene in healthy individuals (labeled “Control”) and in a group of patients (labeled “Case”). A log2 transformation was applied to expression levels.

to consider diﬀerences with p-values below 0.05 as signiﬁcant. More stringent requirements (e.g. p < 0.01 or even p < 0.001) are also known. Sometimes, fulﬁllment of these diﬀerent levels of stringency are marked as *, **, or *** next to plots showing experimental data. 8.5.1

Volcano Plot

A signiﬁcant p-value does not always suﬃce as evidence for clear deregulation. If the number of samples is considerably large, even small diﬀerences between the two distributions can be ranked as highly signiﬁcant. Thus, a second criterion is introduced, which can be either the absolute fold-change or the relative fold-change of expression normalized by the standard deviation. The Volcano plot shown in Figure 8.6 plots the p-value on the y-axis and the fold-change on the x-axis. Only data points in the upper left or upper right quadrants are considered as signiﬁcantly deregulated. It is subject to the particular application what threshold of the fold-change is considered meaningful. Alternatively, to also interpret the magnitude of the underlying diﬀerences, one can consider the eﬀect size by applying the Cohen’s d eﬀect size estimator that is independent of the sample size. Cohen’s d is deﬁned as 𝜇 − 𝜇2 d = √1 𝜎12 + 𝜎22 2 with 𝜇1 , 𝜇2 , 𝜎 1 , and 𝜎 2 being the mean and standard deviation of the two samples, respectively. Eﬀect sizes can be separated, for example, into small (0.2 ≤ d < 0.5), medium (0.5 ≤ d < 0.8), and large (d ≥ 0.8) eﬀects. 8.5.2

SAM Analysis of Microarray Data

The SAM (signiﬁcance analysis of microarrays) method (Tusher et al. 2001) is a popular statistical technique to identify genes showing signiﬁcant expression

Gene Expression and Protein Synthesis

6

–log10 p-value

5

4

3

2

1

P = 0.05

0 –4

–2

0

2

4

log2 fold change

Figure 8.6 A “volcano plot” visualizes the results of diﬀerential expression analysis. In this case, ﬁlled gray circles are genes that are signiﬁcantly downregulated (having signiﬁcant p-values 𝜂

s=1,…,m

if |mij | ≤ 𝜂

mij = 0

where 𝜂 is a decorrelation threshold. Given the interaction matrix M, the change of state xi of gene i between t and t + 1 obeys a threshold rule: ) ( n ∑ or mik xk (t) − bi xi (t + 1) = H k=1

xi (t + 1) = H(Mx(t) − b) where H is the step function with H(y) = 1 if y ≥ 0 and H(y) = 0 if y < 0, and the bi s are threshold values. In the case of small regulatory genetic systems, knowing the matrix M makes it possible to know all possible stationary behaviors of the organisms having the corresponding genome. Let us look at a well-characterized example derived for the model plant organism Arabidopsis. As shown in Espinosa-Soto et al. (2004), the network shown in Figure 9.5 has about 140 000 possible starting states and converges to only 10 ﬁxed point attractors or states with steady gene expression levels that are listed in Table 9.1. Based on this model, one can now easily simulate the behavior of virtual loss-of-function mutants where the vertex corresponding to the knocked out gene is permanently turned oﬀ. Boolean approaches suﬀer from their inability to describe intermediate levels of gene expression. Because of their discrete nature, they can sometimes give spurious results. Potentially more accurate representations of gene networks use continuous functions, in which expression values are allowed to take on any positive value. These approaches are mathematically implemented by diﬀerence or diﬀerential equations, either linear or nonlinear. 9.3.3

Diﬀerential Equations Models

Diﬀerential equations are often used to model complex GRNs in a quantitative manner. They will be introduced more thoroughly in Section 13.3 in a diﬀerent

Gene Regulatory Networks

EMF1 AP1

LFY

FT

AP2

FUL

WUS

UFO

AG

AP3

LUG

SEP

CLF PI

TFL1

Figure 9.5 Gene network architecture determining the fate of the ﬂoral organ from Arabidopsis thaliana. Each vertex corresponds to the concentration of active or functional protein encoded by each gene. The edges represent the regulatory interactions between vertices. Arrows represent positive (activating) interactions and blunt-end lines represent negative (repressing) interactions. AG, PI, AP3, WUS, etc., code for transcription factors; TFL1 and FT are likely membrane-bound signaling molecules; EMF1 and LUG are positive or negative cofactors probably involved in transcription; and CLF is a chromatin remodeling protein. The four dashed lines constitute four novel predictions of the model that need to be tested experimentally at a later stage. Source: Adapted from Espinosa-Soto et al. (2004).

context, and we will therefore keep the discussion short at this point. Diﬀerential equation models of gene networks consist of a set of rate equations describing how changes of gene expression depend on the expression levels of other genes (and possibly other factors as well). For each of the n genes, there exists one continuous function reﬂecting the expression level xi of gene i. The general form is 𝜕xi = fi (xi1 , xi2 , … , xil ) 𝜕t Each f i describes the joint eﬀect of its arguments on xi and integrates the eﬀects of molecular interactions and degradation. Its argument{xi1 , xi2 , …, xil } is a subset

237

238

Principles of Computational Cell Biology

Table 9.1 Gene expression in each of the 10 steady gene activation states in the wild-type Arabidopsis thaliana gene network. FT EMF1 TFL1 LFY FUL AP1 AP3

PI

AG UFO WUS AP2 SEP LUG CLF Cell type

0

1

2

0

0

0

0

0

0

0

0

0

0

1

1

Inf1

0

1

2

0

0

0

0

0

0

1

0

0

0

1

1

Inf2

0

1

2

0

0

0

0

0

0

1

1

0

0

1

1

Inf3

0

1

2

0

0

0

0

0

0

0

1

0

0

1

1

Inf4

1

0

0

2

0

2

0

0

0

0

0

1

1

1

1

Sep

1

0

0

2

0

2

2

2

0

1

0

1

1

1

1

Pe1

1

0

0

2

0

2

2

2

0

0

0

1

1

1

1

Pe2

1

0

0

2

2

0

2

2

2

1

0

1

1

1

1

St1

1

0

0

2

2

0

2

2

2

0

0

1

1

1

1

St2

1

0

0

2

2

0

0

1

2

0

0

1

1

1

1

Car

Obviously, the example network discussed here is not truly Boolean as the variables may adopt three values 0, 1, and 2 rather than only 0 and 1.

of all gene expression levels {x1 , x2 , …, xn }. In this way, only those gene expression levels are arguments for gene i that actually aﬀect its expression (its parents, in the language of Boolean networks). In addition, there may appear additional factors such as constants. Deriving an ODE model from experimental data involves estimating (ﬁtting) the parameters in the functions f i ( ).

9.4 DREAM: Dialogue on Reverse Engineering Assessment and Methods It is very diﬃcult if not even impossible to develop new mathematical methods for reverse engineering of GRNs as long as the correct answer is not known at least for some real biological networks that can then be used for validation. Therefore, in order to achieve a systematic evaluation of methods for reverse engineering of network topologies (also termed network inference methods), the idea came up to develop the models on synthetic data that was generated from assumed model topologies of GRNs. As in Section 8.8.1, we consider transcriptional regulatory networks consisting of genes, mRNA, and proteins where regulatory proteins (TFs) control the transcription rate (activation) of the genes. We ignore here other eﬀects such as microRNAs and epigenetic eﬀects. The state of the network is given by the vector of mRNA concentrations x and protein concentrations y. The gene network is modeled by the system of diﬀerential equations dxi ⋅ xi = mi ⋅ fi (y) − 𝜆RNA i dt dyi ⋅ yi = ri ⋅ xi − 𝜆Prot i dt

Gene Regulatory Networks

where mi is the maximum transcription rate, ri the translation rate, 𝜆i RNA and 𝜆i Prot are the mRNA and protein degradation rates, respectively, and f i (.) is the so-called input function of gene i. The input function describes the relative activation of the gene, which is between 0 (the gene is shut oﬀ ) and 1 (the gene is maximally activated), given the protein concentrations y, which include those of the TF proteins.

9.4.1

Input Function

Let us consider the binding reaction of two molecules L and M: L + M ⇄ LM. The dissociation equilibrium constant K D is deﬁned as KD =

[L][M] , [LM]

where [L], [M], and [LM] are the molecular concentrations of L and M and of the complex LM. In equilibrium, we may take T as the total concentration of molecule L T = [L] + [LM]. y is the fraction of molecules L that have reacted (bound) y=

[LM] . [LM] + [L]

Substituting [LM] by [L] [M]/K D gives y=

([L][M])∕KD ([M])∕KD = . ([L][M])∕KD + [L] ([M])∕KD + 1

Now, we will turn back to our task of modeling TF binding to DNA. TF j then takes the role of M so that [M] = yj . The probability P(S1 ) that gene i is in state S1 at a particular moment is termed the fractional saturation and is equal to the fraction y we just introduced. Hence, P(S1 ) depends on the TF concentration yj , and k ij is the dissociation constant for TF j at the promoter of gene i. ( )nij 𝜒j yj P(S1 ) = with 𝜒j = . 1 + 𝜒j kij nij is the so-called Hill coeﬃcient (describing cooperativity) for this binding equilibrium. P(S1 ) is large if the concentration yj of TF j is large and if the dissociation constant k ij is small (strong binding). The bound TF either activates or represses the expression of the gene. In state S0 , the relative activation is 𝛼 0 . In state S1 , it is 𝛼 1 . The input function f i (yj ) is obtained from P(S1 ) and its complement P(S0 ). 𝜒 1 . P(S0 ) = 1 − 1+𝜒j = 1+𝜒 j

j

239

240

Principles of Computational Cell Biology

The input function describes the mean activation of gene i as a function of the TF concentration yj fi (yj ) = 𝛼0 P(S0 ) + 𝛼1 P(S1 ) =

𝛼0 + 𝛼1 χj 1 + χj

.

This approach can be generalized to an arbitrary number of regulatory inputs. A gene that is controlled by N TFs has 2N states: each of the TFs can be bound or not bound. Thus, the input function for N regulators is 2N −1

f (y) =

∑

𝛼m P(Sm ).

m=0

Now, we turn back to the Dialogue on Reverse Engineering Assessment and Methods (DREAM) challenge and the construction of synthetic data for toy GRN models. We assume that binding of TFs to cis-regulatory sites on the DNA is in quasi-equilibrium because molecular binding and disassociation reactions are orders of magnitudes faster than the transcription and translation processes. Using the approach just described, the organizers of the DREAM challenge around Gustavo Stolovitzi generated synthetic expression data of model GRNs, i.e. using reasonable initial conditions, they computationally generated the activity in a network of given topology (N nodes, M directed edges). This output data, but not the original network used to generate this data, could then be downloaded by the competing teams from the competition website. The participants of this competition would then try to predict the topology of the GRN from the output data. 9.4.2

YAYG Approach in DREAM3 Contest

The DREAM contest is organized at regular intervals. Diﬀerent sorts of problems are given to the contestants each time. Here, we will present an approach used by the best-performing team around Mark Gerstein in the DREAM3 contest (Yip et al. 2010). YAYG is the concatenation of the author’s names. The authors assumed that the provided expression data are subject to Gaussian noise. If we were given xa b , the expression level of gene a detected in the deletion strain of gene b, and xa wt* , the true expression level of gene a in wild type, one would need to estimate from this data whether the deviation xa b − xa wt* is due to a causal interaction or merely due to this noise. For this, one would need to know the variance 𝜎 2 of the Gaussian, assuming the noise is nonsystematic so that the mean 𝜇 is zero. Then, the probability for observing a deviation as large as xa b − xa wt* by chance is [ ( b )] |xa − xwt∗ a | chance p , b→a = 2 1 − Φ 𝜎 where Φ is the cumulative distribution function of the standard Gaussian distribution. The complement of this function [ ( b )] |xa − xwt∗ a | pb→a = 1 − 2 1 − Φ 𝜎

Gene Regulatory Networks

is the probability that the deviation is caused by a regulatory process. Using these probabilities, all gene pairs (b, a) can be ordered in a descending order of pb→a . In a ﬁrst step, the variance 𝜎 2 is estimated from the data. Here, one faces two diﬃculties. On the one hand, it is unknown which genes are modulated and which ones are not modulated by the knocked-out gene. This is exactly what one would like to get from the data. On the other hand, the expression level of a gene detected in the wild-type strain, xawt , is also aﬀected by random noise and thus is not suitable as a gold standard reference point xawt* . Gerstein and coworkers proceeded in steps so that the estimate for pb→a becomes more and more reliable. Initially, they took the measured wild-type expression levels xawt as reasonable estimates of the true wild-type expression levels xawt* . For each gene a, an initial estimate for the variance of the Gaussian noise is computed from its expression levels in the diﬀerent deletion strains. Then, the following three steps are repeated a number of times: (1) The probability of the regulatory link pb→a is computed for each pair of genes (b, a) using the current reference points xawt . The set of potential regulations is based on a p-value threshold of 0.05. If there is only a small probability of less than 0.05 that an expression change of gene a between a deletion strain b and wild type can also be due to chance events, then b → a is treated as a potential regulation. If the probability exceeds 0.05, (b, a) is included in the set P of gene pairs that is used to reﬁne the error model. (2) The gene expression levels in set P are used to re-estimate the variance of the Gaussian noise ∑ 𝜎 =

b (b,a)∈P (xa

2 − xwt a )

. |P| − 1 (3) For each gene a, its wild-type expression level is re-estimated as the mean value of its observed expression levels in the wild-type and in mutant strains in which the expression level of a is unaﬀected by the deletion of b: 2

xwt a

=

xwt a +

∑

b b∶(b,a)∈P xa

1 + |b∶ (b, a) ∈ P|

.

Subsequent iterations serve to converge set P to a stable conﬁguration. After the iterative steps are completed, the probability of regulation pb→a is calculated using the ﬁnal estimate of the reference points xa wt and the variance of the Gaussian noise 𝜎 2 . For time series data following an initial perturbation, diﬀerential equations dxi = fi (x1 , x2 , … , xn ) are used to model the gene expression rates with xi as the dt expression level of gene i and f i as the input function (see previous section). For the functional form of f i , a linear model is considered: ∑ dxi aij xj , = ai0 − aii xi + dt j∈S

241

242

Principles of Computational Cell Biology

G6

G6 G7

G5

G7

G5

G8

G4

G8

G4

G9

G3

G9

G3

G10

G2

(a)

G2

G10

G1

G1 (b)

Figure 9.6 Yeast1-size10 network to test the GRN reconstruction algorithm in the DREAM contest. (a) The actual network. (b) Ten predicted regulatory links with highest ranks. The two dotted arrows illustrate the diﬀerence between the two networks. Source: Yip et al. (2010). http://journals.plos.org/plosone/article/metrics?id=10.1371/journal.pone.0008121. Licensed under CC BY 4.0.

with ai0 as the basal expression rate of gene i in the absence of regulators, aii the decay rate of mRNA transcripts of i, and S as the set of potential regulators of i (they assumed that no self-regulation takes place, so i not element of S). For each potential regulator j in S, aij explains how the expression of i is aﬀected by the abundance of j. A positive aij indicates that j is an activator of i, and a negative aij indicates that j is a suppressor of i. The linear model involves S + 2 parameters and assumes a linear relationship between the expression level of the regulating genes and the expression rate of the target gene. A second model assumed a sigmoidal relationship between the regulators and the target, but this will not be covered here. Figure 9.6 shows the predicted network (b) in comparison to the real network (a) for a 10-node network that was taken from the full Saccharomyces cerevisiae GRN. Obviously, almost all edges were correctly detected except for the inhibition of G9 by G4 (which itself is inhibited by G1). The algorithm turned this double inhibition into a direct activation from G1 on G9. This may seem incorrect but is equally plausible given the available data. Figure 9.7 shows two cases that could only be disentangled by considering the time-dependent expression traces. In the network illustrated in panel (a), G7 is downregulated by G3, G8, and G10. Panel (c) shows that G8 and G10 have high expression levels in wt, whereas G3 expression is quite low. In the gene knockouts (b), turning oﬀ the inhibition by G3 therefore only results in a small increase of G7, which is hard to detect. However, panel (c) also gives a hint that the higher expression level of G7 over time is anticorrelated with the reduced level of G3. The inhibitory edge between G3 and G7 could therefore be detected by the ODE model. In the network shown in panel (d), G6 is activated by G1 and suppressed by G5. G1 also suppresses G5. G1 therefore has two functions on G6. When G1 is

G8

G4

G9

G3

G2 G1

(b) G6 G5

G8

G4

G9

G3

G10

Expression of G6

G7

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Wt G1-G2- G3-G4-G5-G6-G7-G8-G9-G10-

(a)

G2 G1

(d)

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

(e)

Strain 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Wt G1-G2- G3-G4-G5-G6-G7-G8-G9-G10-

Strain

G3 G7 G8 G10

Wt 0 20 40 60 80 100 120140 160180

(c)

Expression of G6

G10

Expression of G7

G5

Expression of G6

G6 G7

(f)

Strain 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

G1 G5 G6

Wt 0 20 40 60 80 100 120140 160180

Strain

Figure 9.7 Two regulation events that were missed by the noise models of the YYAG algorithm but detected by the diﬀerential equation models. (a) The actual E. coli1-size10 network. (b) The homozygous deletion proﬁle of G7 in the E. coli1-size10 network. (c) A perturbation time series of G7 in the E. coli1-size10 network. (d) The actual E. coli2-size10 network. (e) The homozygous deletion proﬁle of G6 in the E. coli2-size10 network. (f ) A perturbation time series of G6 in the E. coli2-size10 network. Source: Yip et al. (2010). http://journals.plos.org/plosone/article/metrics?id=10.1371/journal.pone.0008121. Licensed under CC BY 4.0.

244

Principles of Computational Cell Biology

expressed, deleting G5 (e) has no eﬀect. In the time-dependent expression proﬁle, G6 appears anticorrelated to G1. This does not ﬁt with the activating role of G1. However, G5 is also anticorrelated with G6. This provides evidence for inhibitory role of G5. In conclusion, reverse engineering of GRN networks is a hot topic because GRNs give detailed insights into the circuitry of cells. This is important for understanding the molecular causes, e.g. of diseases. New data are constantly appearing. The computational algorithms need to be adapted. Perturbation data (knockouts and time series following perturbations) are most useful for mathematic reconstruction of GRN topologies.

9.5 Regulatory Motifs As mentioned in the introduction of this chapter, wiring diagrams of regulatory networks somehow resemble electrical circuits. To identify overrepresented motifs in these networks, Uri Alon and coworkers (Shen-Orr et al. 2002) tried to break down networks into basic building blocks. They searched for network motifs as patterns of interconnections that recur in many diﬀerent parts of a network at frequencies much higher than those found in randomized networks. To do so, they again represented the transcriptional network as a connectivity matrix M such that Mij = 1 if operon j encodes a TF that transcriptionally regulates operon i and Mij = 0 otherwise (Figure 9.8). All n × n submatrices of M were generated by choosing n vertices that belong to a connected graph, for n = 3 and n = 4. Submatrices were enumerated by recursively searching for nonzero elements. P values representing the statistical signiﬁcance were computed for submatrices representing each type of connected subgraph by comparing the number of times they appear in real network versus in a large number of random networks (Sections 6.8 and 10.6). For n = 3, the only signiﬁcant motif was the FFL (Figure 9.9). Single-input motif (SIM) modules (Figure 9.10) were identiﬁed by searching for isolated columns of M with many 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 (a)

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 (b)

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 (c)

Figure 9.8 Connectivity matrix for causal regulation of target gene j (row) by transcription factor i (column). Dark ﬁelds indicate regulation. (a) Feed-forward loop (FFL) motif. Transcription factor 2 regulates transcription factor 3 and target gene 6, and transcription factor 3 again regulates target gene 6. (b) Single-input motif (SIM), see text. (c) Densely overlapping region (DOR).

Gene Regulatory Networks

X

Crp

Y

araC

Z

araBAD

Figure 9.9 Example of an FFL (L-arabinose utilization in E. coli). The global transcription factor CRP is activated in the presence of cyclic AMP and deactivated in its absence.

X

argR

Z1 Z2

Zn

argCBH

argD

argE

argF

argI

(a)

(b)

fis

crp proP

rcsA

hns

lrp

ihf

oxyR

ada

nhaR nhaA

Zm

ftsQAZ

Z4

Xn

osmC

Z3

X4

dps

Z2

X3

katG

Z1

X2

alkA

X1

rpoS

Figure 9.10 Example of a single-input motif (SIM) system (arginine biosynthesis in E. coli).

Figure 9.11 Example of a DOR. (a) In this motif, many inputs regulate many outputs. (b) This is an example of a DOR motif found in the stress response system of E. coli.

entries equal to 1. For n = 4, only the densely overlapping regulation motif, where two operons are regulated by the same two transcription factors, was signiﬁcant (Figure 9.11). We will revisit these motifs in Section 15.1. 9.5.1

Feed-forward Loop (FFL)

The term feed-forward loop (abbreviated later as FFL) characterizes a motif where a ﬁrst TF X controls a second TF Y , and together they control one or more operons Z1 , …, Z n . This motif is present in hundreds of gene systems in E. coli (Shen-Orr et al. 2002) and yeast, as well as in other organisms (Figures 9.8a and 9.9). As each of the three regulatory links of the FFL can have an activating or repressing eﬀect, there exist 23 = 8 possible structures of FFLs. 9.5.2

SIM

In a SIM (Figures 9.8b and 9.10), a single TF X controls a set of operons Z1 , …, Z n . In the purest form of a SIM, no other regulator modulates the expression of any

245

246

Principles of Computational Cell Biology

of these genes. This gives rise to the name SIM. X is usually autoregulatory. All regulations are of the same sign. The main role of this motif is to induce coordinated expression of a set of target genes that share certain biological functions. The motif can generate a temporal expression program, where several target promoters are switched on in a precisely ordered way. Because of the variations in the binding motifs and its sequence context in the respective promoters, a TF often possesses diﬀerent activation thresholds for various target genes. Therefore, when X activity increases gradually over time, these thresholds are crossed in a deﬁned sequence, from the lowest threshold over the next lowest threshold, etc., so that expression levels are activated in a temporal order. For example, the biosynthesis of arginine was found to implement a SIM design where the repressor ArgR controls several operons that code for enzymes belonging to the arginine biosynthesis pathway. After removing arginine from the medium, these promoters are activated sequentially with minutes between the activations of different promoters. The order of activation matches the location of the enzymes in the arginine biosynthesis pathway. 9.5.3

Densely Overlapping Region (DOR)

A densely overlapping region (DOR) motif represents a set of operons Z1 , …, Z m together with a set of input TFs, X 1 , …, X n , by which they are regulated in a densely interconnected manner (Figures 9.8c and 9.11). The DOR algorithm (Shen-Orr et al. 2002) detects dense regions of connections, with a high ratio of connections to TFs. To identify DORs, all operons regulated by more than two TFs were considered. One can introduce a (nonmetric) distance measure between operons k and j that considers the number of TFs controlling both operons: d(k, j) = (∑

1 n fn Mk,n Mj,n

)2

where f n = 0.5 for global TFs (to account for their unspeciﬁcity) and f n = 1 otherwise. When clustering operons with the average-linkage algorithm, DORs correspond to clusters having more than 10 connections, where the connections outnumber the TFs by a factor larger than 2. E. coli contains several DORs that regulate hundreds of output genes. Each of them is relevant for a broad biological function. One can think of a DOR as a gate array that connects diﬀerent inputs with diverse outputs via a computation. Thus, to fully unravel the function of a DOR, the regulatory connections are not enough. Also, one needs to characterize the input functions of the promoters of each output gene. Exhaustive analysis of data from microarray experiments revealed that these three basic motifs (FFL, SIM, and DOR) are signiﬁcantly overrepresented in natural GRNs. Comparison of the regulation of homologous genes in diﬀerent organisms revealed that they are often regulated by diﬀerent classes of TFs. It is therefore hardly possible as in other areas of molecular and cell biology to transfer knowledge about particular gene regulation mechanisms from one organism to others. Remarkably, it turns out that if one gene is regulated by

Gene Regulatory Networks

a FFL, the homologous gene is often regulated by another FFL in response to similar environmental stimuli. It therefore seems that the same network motifs have been “rediscovered” during evolution over and over again. The same applies to SIM or DOR networks motifs where similar output genes in diﬀerent organisms are often related by unrelated TFs (Alon 2007). As GRNs become better characterized, it is likely that further new motifs and motif functions will turn up, for example, in the areas of signaling networks and neuronal networks. If the mentioned observations can be generalized, complex biological networks likely possess a certain degree of structural simplicity and there exist only a limited set of network motifs. This raises the hope that one will be able to formulate the dynamics of large networks on the basis of elementary circuit patterns. We will see a further example of how regulatory motifs are used in Section 10.6.

9.6 Algorithms on Gene Regulatory Networks 9.6.1

Key-pathway Miner Algorithm

The key-pathway miner algorithm solves the problem of ﬁnding key pathways at the level of labeled graphs (Alcaraz et al. 2012). Key pathways are assumed as connected subnetworks where most of the components are expressed in most conditions. The algorithm can output either only the best solution found or multiple top solutions. For a labeled graph G = (V , E, d) of vertices V and edges E, there also exists a labeling function d: V → ℕ. Let k, l ∈ ℕ. Then, the (k, l)-KeyPathway problem consists of determining a connected subset U ⊆ V of maximal cardinality, which contains at most k elements u ∈ U with d(u) ≤ l. Any set U fulﬁlling these two conditions is termed a (k, l)-component. Any vertex v ∈ V for which d(v) ≤ l is termed an exception vertex. Vertices of the graph represent biological entities (e.g. genes or proteins); edges stand for interactions between two such entities, e.g. a protein–protein interaction. The labels on a vertex v denote the number of situations were v is active/expressed/methylated, etc. In a preprocessing stage, one generates an auxiliary-labeled graph that serves to reduce the problem size and to help in steering the algorithm to more promising regions of the search space. C(G, l) is the l-component graph that is deduced from G in the following way. The vertex set of C(G, l) contains all exception vertices of G. Two exception vertices are linked by an edge in C(G, l) if they are connected by a path in G, which does not contain exception vertices as inner vertices. For any subset U ⊆ V of exception vertices, S(U) is deﬁned as the set of all vertices v ∈ V that can be reached in G from an element of U without visiting an exception vertex that does not belong to U. Intuitively, one simply needs to select a connected set of k exception vertices U in C(G, l) to construct a (k, l)-component of G, namely S(U). The key-pathway miner algorithm then applies a greedy principle. For every vertex u, a set W u is iteratively constructed that begins with W u = {u}. At every iteration step, one adds a vertex v from C(G, l) to W u that is adjacent to W u (in C(G, l) and which maximizes |S(W u ∪ {v})|). The iterations are stopped

247

248

Principles of Computational Cell Biology

ATP2B2

MAP1A

APP

GRIA1

COX6A1 PFN2

EPB41L DLG2

DLG3

GNB1

GAPDH RUNDCJA

MAP2

VTPR1

RAP2A

YWHAB

GRIN

HTT PRKCZ

OPTN DVL3

ARHGEF4

CDK5

UBE2K

STMN2

SMARCA4

CTNNB

CALM1 9743

KCNQ3

PQBP1

FBA1 UOCRC2 PRNP

THY1 GRM7 CALM3

Figure 9.12 Largest subnetwork identiﬁed as downregulated in the caudate nucleus of Huntington’s disease patients found by the key-pathway miner algorithm for k = 2. Red nodes represent exception nodes, squared nodes are genes also reported as part of the Huntington’s disease KEGG pathway, nodes with dashed borders are HTT modiﬁers, and nodes with italic font are part of the calcium signaling pathway. Source: Alcaraz et al. (2012). Reproduced with permission of Royal Society of Chemistry.

when |W u | = k. The algorithm returns S(W u ) of maximal size found for some u. Figure 9.12 shows the results of the key-pathway miner algorithm for gene expression data from patients suﬀering from Huntington’s disease.

9.6.2

Identifying Sets of Dominating Nodes

A given GRN may contain hundreds of TFs and thousands of target genes. Are all TFs equally important or do a small number of “master regulators” control the rest of the network? This question is the motivation behind searching for a minimum-size set of dominating nodes (MDS, minimum dominating set) that regulate all other genes of the network (Nazarieh et al. 2016). Because each node that does not belong to the MDS is adjacent to at least one node in the MDS, full control over the network is provided by the MDS solution. The MDS method can be applied to any connected or disconnected regulatory network to identify key dominator nodes. If applied to the area of complex diseases, this method can capture several important disease and drug target genes. Besides the MDS concept, the task of identifying a set of master regulatory genes can also be considered as an analog of another optimization problem, namely, that of constructing a minimum connected dominating set (MCDS). The concepts of MDS and MCDS are visualized for a small toy network in Figure 9.13.

Gene Regulatory Networks

A

A

C

E

C

E

B

D

B

D

(a)

(b)

Figure 9.13 An illustration of the MDS and MCDS solutions of an example network. The network can be controlled by MDS and MCDS nodes. In the case of a GRN, directed arcs symbolize that a transcription factor regulates a target gene. (a) The MDS nodes {A, B} are the dominator nodes of the network. Together, they regulate all other nodes of the network (C, E, D). (b) Visualizes the respective set of MCDS nodes (black and gray). Here, node C is added in order to preserve the connection between the two dominators A and B to form an MCDS.

9.6.3

Minimum Dominating Set

A dominating set (DS) in an undirected (or directed) graph G = (V , E) is a subset of nodes D ⊆ V , where for each node v ∈ V , either v ∈ D or there is a node u ∈ D and an (directed) edge {u, v} ∈ E (Figure 9.13). We call a set D ⊆ V a MDS if it is a dominating set and it has minimum cardinality among all dominating sets for G. Computing an MDS is known to be an NP-complete problem. Here, an integer linear programming (ILP) formulation of MDS for directed graphs is presented. For each node v ∈ V , 𝛿 − (v) denotes the set of incoming nodes of v, i.e. the set of nodes u such that (u, v) ∈ E. ∑ xv minimize v∈V

subject to xu +

∑

xv ≥ 1

∀u ∈ V

v∈𝛿 − (u)

xv ∈ {0, 1} ∀v ∈ V Here, variables xu and xv are binary variables associated with nodes u and v in the graph. Using this formulation, a node v is selected as a dominator if its binary variable xv has value 1 in the computed solution and otherwise it is not selected. ∑ Because the objective function is to minimize v∈V xv , this yields an MDS. 9.6.4

Minimum Connected Dominating Set

An MCDS for a directed graph G = (V , E) is a set of nodes D ⊆ V of minimum cardinality that is a dominating set and that additionally has the property that the graph G[D] induced by D is weakly connected, i.e. such that in the underlying undirected graph, there is a path between any two nodes v,v′ ∈ D using only vertices in D. Computing an optimal MCDS in undirected graphs is known to be NP-hard.

249

250

Principles of Computational Cell Biology

For a set S ⊆ V , the set E(S) stands for all the edges connecting two vertices u,v ∈ S. Let the binary-valued yv variables indicate whether node v is selected to belong to the MCDS. The binary variables xe for the edges then yield a tree that contains all selected vertices and no vertex that was not selected. Thus, the selected vertices form a connected component. An ILP can be formulated to obtain an MCDS: ∑ yv minimize v∈V

subject to ∑ e∈E(S)

yu +

∑

xe =

e∈E

∑

xe ≤ ∑

∑

yi − 1

i∈V

yi

∀S ⊂ V , ∀ j ∈ S

i∈S\{j}

yv ≥ 1 ∀u ∈ V

v∈𝛿 − (u)

yv ∈ {0, 1} ∀v ∈ V xv ∈ {0, 1} ∀e ∈ E The ﬁrst constraint guarantees that the number of edges is one unit less than the number of nodes. This is necessary for them to form a (spanning) tree, but is not suﬃcient. The second constraint guarantees that the selected edges imply a tree. The third constraint guarantees that the set of selected nodes in the solution forms a dominating set of the graph, see Section 9.6.3. For dense undirected graphs, this ILP formulation provides a quick solution, but in the case of sparse graphs, ﬁnding the optimal solution may take considerable running time. Note that the given ILP formulation contains an exponential number of constraints because it has one constraint for each subset S ⊆ V . Therefore, already for relatively small instances, it is impractical to generate all its inequalities. Instead, one can use the following approach: we generate the ﬁrst constraint and all con∑ ∑ ∑ straints of the third type (i.e. e∈E xe = i∈V yi − 1 and yu + v∈𝛿−(u) yv ≥ 1 for each u ∈ V ). Then, we compute the optimal ILP solution subject to these constraints. Then, we check whether the found solution satisﬁes all constraints of the above ILP (even those that we did not add to our formulation). This is the case if and only if the computed set of vertices yields a connected (dominating) set. If this is the case, then we found the optimal solution and we stop. Otherwise, ∑ ∑ we add (violated) constraints of the second type (i.e. e∈E(S) xe ≤ i∈S\{j} yi for some subset V and some node j) to our formulation and compute the optimal ILP solution to this stronger formulation and repeat. Figure 9.14 shows an example for an MCDS controlling the subset of genes that are diﬀerentially expressed during the cell cycle of S. cerevisiae.

9.7 Summary GRNs are responsible for producing the right amount of protein at the right time in an intricately orchestrated timely manner. When measured at the level of mRNA expression by microarrays or RNAseq, it is hard to deduce which

Gene Regulatory Networks

PHD1

GCR1

STB1

ORC1

FKH1 RAP1 SWI4 ACE2 TID3 STB5

CAR1

STP4

SRD1 TEC1 YOX1 PMA2 SWI5

Figure 9.14 Tightly interwoven network of 17 transcription factors and target genes that organize the cell cycle of S. cerevisiae. Shown on the circumference of the outer circle are 164 target genes that are diﬀerentially expressed during the cell cycle. The inner circle consists of the 14 TFs from a heuristic form of the MCDS algorithm and 123 other target genes that are regulated by at least 2 of these TFs. Source: Nazarieh et al. (2016). https://bmcsystbiol .biomedcentral.com/articles/10.1186/s12918-016-0329-5. Licensed under CC BY 4.0.

gene is regulating which other gene as we only see the end result of the process. Although methods for automatic reconstruction of the topologies of gene regulatory network are beginning to yield useful results, their widespread usage is still hampered by the overlapping eﬀects and unclear roles of epigenetic modiﬁcations, microRNAs, and chromatin conformation.

9.8 Problems 1. Boolean network In this task, we consider the yeast cell cycle network described in Orlando et al. (2008). The following statements formulate the node dependencies in the network: • MBF is activated by CLN3.

251

252

Principles of Computational Cell Biology

• If CLN3 or MBF is transcribed and at least one of the inhibitors YOX1 and YHP1 is inactive, SBF is active. • YOX1 is active if both its TFs MBF and SBF are present. The same applies to HCM1. • YHP1 can be activated independently by MBF and SBF. • SBF and HCM1 jointly activate SFF. • ACE2 require SFF to be active. The same applies to SWI5. • CLN3 requires the presence of SWI5 and ACE2 and the inactivity of at least one of the inhibitors YOX1 and YHP1 to be activated. (a) Construct a Boolean network (conditional tables) from these statements and save it to a ﬁle boolean nw.txt. Load the network into BoolNet using the function loadNetwork and visualize its wiring with plotNetworkWiring(). (BoolNet is an R package available at https://cran.r-project.org/ web/packages/BoolNet/index.html that provides tools for assembling, analyzing, and visualizing synchronous and asynchronous Boolean networks as well as probabilistic Boolean networks.) (b) Plot the trajectory starting from (MBF, CLN3, YOX1, YHP1, SBF, HCM1, SFF, ACE2, SWI5) = (0, 1, 1, 1, 1, 1, 1, 1, 1) for the next two state transitions (solve on paper). 2. Gene regulatory networks and expression data The supplementary ﬁle TF_target_map.txt contains in each line a TF and the names of all proteins (UniProt accessions) to the gene promoters of which the TF binds to. Furthermore, expr_H1hESC.gtf contains H1ESC gene expression data from the ENCODE project in the GENCODE GTF format. Here, genes are given as Ensembl identiﬁers. (a) Build an initial human GRN using the TF/target associations given in TF_target_map.txt. Report the number of genes/proteins in the network and the number of interactions. (b) Read the ENCODE H1hESC expression data. First, separate all protein-coding genes into the ones with FPKM value above 1 and those below 1. The former ones will be considered as expressed. Convert the gene identiﬁers to UniProt accessions using mapping data from the HGNC web service at https://www.genenames.org/cgi-bin/download (tick both “UniProt ID” and “Ensembl ID” and untick the rest). How many of the genes can be mapped to UniProt proteins? How many of them are expressed in stem cells? (c) Reﬁne your initial GRN by integrating the speciﬁc expression data. For simplicity, assume that all promoters are accessible to the TFs and that a TF always aﬀects the expression of the gene it binds to. Consequently, all TFs that are expressed are assumed to interact with all their target genes. Report the number of proteins in this network and the number of interactions. Discuss what additional data would be needed to overcome the strong assumptions above. (d) Crucial regulatory drivers are often referred to as “master regulators.” Although there is no ﬁxed deﬁnition of the term, it sometimes means the TFs on the highest level of the regulatory hierarchy. Use the notion of

Gene Regulatory Networks

Figure 9.15 Toy Boolean network (see Problem 3).

D

B

E

A

C

F

topological sorting to determine putative master regulators in the reﬁned network. Do the TFs in the highest level of the hierarchy have parents as well? How many equivalent TFs are on this level? 3. Boolean network Consider the Boolean network shown in Figure 9.15, which describes the mutual regulation of the hypothetical genes A to F. A line with an arrowhead denotes an activation, whereas a ﬂat end denotes an inhibition, i.e. if A is high, B is activated, whereas high levels of D inhibit the expression of B. To investigate the behavior of this network, use a dynamic simulation with a synchronous update scheme. Assume that an activation has a weight of 1, whereas an inhibition is always three times stronger than an activation. Set all threshold to 0. (a) Weighted Interactions. Set up the propagation matrix that relates the states of genes A to F in the next iteration to the current state. (b) Implementation. Write a program to simulate the Boolean network. To enumerate the initial states, convert the binary levels of the genes into integer numbers where A determines the least signiﬁcant bit and F the most signiﬁcant one. An initial state where, e.g. only A, C, and D are at high levels would translate into 1 + 4 + 8 = 13. – When does it make sense to stop the propagation and why? – Which sequences do you get when you start from states 1, 4, 21, and 33? (c) Periodic orbits. Determine the attractors and the corresponding basins of attraction by going through all possible initial states and save at which state the Boolean network closes its orbit. – List these orbits with their respective lengths and basins of attraction. – Give the relative coverage of the state space by the basins of attraction. (d) Interpretation – Give the attractors in terms of active genes and characterize them with a few words. – Which ones are the special genes and what are their respective eﬀects on the behavior of the network? For this, explain what is determining

253

254

Principles of Computational Cell Biology

the period of the orbits. Further, compare the two shorter orbits with each other. Which gene is responsible for the diﬀerence?

Bibliography Gene Networks Brazhnik, P., de la Fuente, A., and Mendes, P. (2002). Gene networks: how to put the function in genomics. Trends in Biotechnology 20: 467–472. van Dam, S., Võsa, U., van der Graaf, A. et al. (2017). Gene co-expression analysis for functional classiﬁcation and gene-disease predictions. Brieﬁngs in Bioinformatics 19: 575–592. Martínez-Antonio, A. and Collado-Vides, J. (2003). Identifying global regulators in transcriptional regulatory networks in bacteria. Current Opinion in Microbiology 6: 482–489.

Mathematical Modeling of Gene Regulatory Networks Filkov, V. (2005). Identifying gene regulatory networks from gene expression data. In: Handbook of Computational Molecular Biology (ed. S. Aluru), 27-1–27-29. Chapman & Hall/CRC Press. Markowetz, F. and Spang, R. (2007). Inferring cellular networks – a review. BMC Bioinformatics 8: S5.

Gene Regulatory Networks of E. coli and S. cerevisiae Costanzo, M.C., Engel, S.R., Wong, E.D. et al. (2014). Saccharomyces genome database provides new regulation data. Nucleic Acids Research 42: D717–D725. Resendis-Antonio, O., Freyre-González, J.A., Menchaca-Méndez, R. et al. (2005). Modular analysis of the transcriptional regulatory network of E. coli. Trends in Genetics 21: 16–20.

Boolean Networks Espinosa-Soto, C., Padilla-Longoria, P., and Alvarez-Buylla, E.R. (2004). A gene regulatory model for cell-fate determination during Arabidopsis thaliana ﬂower development that is robust and recovers experimental gene expression proﬁles. The Plant Cell 16: 2923–2939. Gershenson, C. (2004). Introduction to random Boolean networks. In: Artiﬁcial Life IX: Proceedings of the Ninth International Conference on the Simulation and

Gene Regulatory Networks

Synthesis of Living Systems (ed. J. Pollack, M. Bedau, P. Husbands, et al.), 160–173. MIT Press. Orlando, D.A. et al. (2008). Global control of cell cycle transcription by coupled CDK and network oscillators. Nature 453: 944–947.

Motifs in Networks Alon, U. (2007). Network motifs: theory and experimental approaches. Nature Reviews Genetics 8: 450–461. Shen-Orr, S.S., Milo, R., Mangan, S., and Alon, U. (2002). Network motifs in the transcriptional regulation network of Escherichia coli. Nature Genetics 31: 64–68.

DREAM Challenge Yip, K.Y., Alexander, R.P., Yan, K.-K., and Gerstein, M. (2010). Improved reconstruction of in silico gene regulatory networks by integrating knockout and perturbation data. PLoS One 5: e8121.

Key Pathway Miner Alcaraz, N., Friedrich, T., Kötzing, T. et al. (2012). Eﬃcient key pathway mining: combining networks and OMICS data. Integrative Biology 4: 756–764.

MDS and MCDS Nazarieh, M., Wiese, A., Will, T. et al. (2016). Identiﬁcation of key player genes in gene regulatory networks. BMC Systems Biology 10: 88.

255

257

10 Regulatory Noncoding RNA In 1958, Francis Crick formulated the so-called central dogma of molecular biology that postulates a unidirectional ﬂow of information where welldetermined pieces of the genomic DNA sequence are transcribed into messenger RNA (mRNA), which are then translated into protein. However, the “RNA revolution” that started in the 1990s has revealed that hundreds of evolutionary conserved sequences of the genome are also being transcribed into RNA, but not translated into protein. In this chapter, we will focus on one particular class of noncoding RNAs (ncRNAs), the so-called microRNAs (miRNAs), that are now recognized as equally important for regulating gene expression as the well-known transcription factors (TFs) (see Chapter 7). Many miRNAs are conserved throughout large parts of the plant and animal kingdom. As of 2018, the database mirbase1 contains more than 20.000 of such miRNA sequences, of which about 2700 are from human.

10.1 Introduction to RNAs Single-stranded RNA molecules often have recognizable “domains” of secondary structure (Figure 10.1). The double-stranded regions formed by the stacking of two or more consecutive base pairs are referred to as stems. The single-stranded regions of the secondary structure form a variety of patterns or motifs, all of which are some kind of loops. The hairpin loop is a single-stranded region that a stem ends into. A bulge loop can be viewed as a single-stranded region that interrupts a stem on one side. An interior loop on the other hand interrupts a stem on either side. A single-stranded region where more than two stems meet is called a multijunction or multibranch loop. Moreover, RNA molecules frequently adopt speciﬁc tertiary structures by forming intrastrand hydrophobic contacts. Figure 10.2 shows the three-dimensional structure of a ribozyme, an important class of ncRNAs that play prominent roles, e.g. in splicing. RNAs come in various forms, such as long ncRNAs, snoRNAs (small nucleolar RNAs), miRNAs, siRNAs, etc. Importantly, these RNA variants may either bind to newly synthesized mRNAs and thus prevent their translation until they themselves will eventually be degraded or may bind directly to coding DNA 1 www.mirbase.org Principles of Computational Cell Biology: From Protein Complexes to Cellular Networks, Second Edition. Volkhard Helms. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

258

Principles of Computational Cell Biology

(5′ end) G

(3′ end) A A

U

G

C

Stem (S1)

A

U

G

A G

Interior loop G U

Hairpin loop

G

C

G

A

A

U

(S2)

G

C

Bulge loop G G A

G

G

A G G

U G

U

C

C

A

G

U

Multibranch loop G

(S3)

G

G

A

C

U

A

C

C

U

G

A

U

G

G

(S4)

G G A

(S5)

Figure 10.1 Basic structural motifs of RNA secondary structure. This RNA consists of ﬁve stems (labeled S1–S5) connected by loops (labeled according to loop type).

4 5 3

A652-A2 tertiary contact

Loop–loop kissing interaction

1

A730-loop

2 6

G638-loop

Figure 10.2 Three-dimensional structure of the VS ribozyme. This ribozyme from the mitochondria of Neurospora performs self-cleavage during replication. Shown is the catalytic domain (helices 2–6) of one protomer and the substrate helix (helix 1) that belongs to another protomer. The three-way helical junctions 2–3–6 and 3–4–5 organize the overall fold of the catalytic domain. The yellow sphere depicts the scissile phosphate. Red sticks correspond to the catalytic nucleobases. Junction 1–2–7 and accompanying helices 1 and 7 have been omitted for clarity. Source: Suslov et al. (2015). Reprinted with permission of Springer Nature.

Regulatory Noncoding RNA

Table 10.1 Diﬀerent types of RNA molecules. Short name

Full name

Function

Oligomerization

mRNA rRNA

Messenger RNA

Product of gene transcription

Single stranded

Ribosomal RNA

Part of ribosome

Single stranded

tRNA snRNA

Transfer RNA

Docks to ribosome

Single stranded

Small nuclear RNA

Splicing and other functions

snoRNA

Small nucleolar RNA

Nucleotide modiﬁcation of RNAs

Long ncRNA Long noncoding RNA Various miRNA

MicroRNA

siRNA

Small interfering RNA Gene regulation

Gene regulation

Single stranded Double stranded

The ﬁrst three rows contain the well-known mRNA, rRNA, and tRNA that are covered in all classical text books of molecular biology and cell biology. snRNA and snoRNA have speciﬁc functions in the nucleus. Long ncRNAs belong to a less-well characterized class.

sequences and silence them. Table 10.1 gives an overview over diﬀerent types of RNA molecules. In this text, we will focus on those RNA molecules that are involved in regulating gene transcription. Small nuclear RNA (snRNA) molecules are found inside the nucleus of eukaryotic cells. They are transcribed by RNA polymerase II or RNA polymerase III and are involved in a variety of important processes such as RNA splicing, regulation of TFs or RNA polymerase II, and maintaining the telomeres. A large group of snRNAs is known as snoRNAs. These are small RNA molecules that play an essential role in RNA biogenesis and guide chemical modiﬁcations of rRNAs, tRNAs, and snRNAs. They are located in the nucleolus and in the cajal bodies of eukaryotic cells.

10.2 Elements of RNA Interference: siRNAs and miRNAs RNA interference may involve small interfering RNAs (siRNAs) or miRNAs. Because both are encoded by DNA but not translated into protein, they are called ncRNAs. In 2006, the Nobel Prize in Physiology or Medicine was awarded to Andrew Fire and Craig Mello who reported in 1998 that the double-stranded RNA may cause the so-called RNA interference in the worm Caenorhabditis elegans. siRNA molecules, sometimes also termed short interfering RNA or silencing RNA, are a class of double-stranded RNA molecules that are 20–25 nucleotides in length (often precisely 21 nt) and play a variety of roles in biology. siRNAs are generated by a highly controlled multistep process. By trimming double-stranded RNA, the enzyme dicer produces small interfering RNA or miRNA. These generated RNAs are incorporated into the RNA-induced silencing complex (RISC), which targets mRNA by downregulating their translation. siRNAs typically base pair perfectly to mRNA and induce mRNA cleavage only in a single, speciﬁc target or interfere with the expression of a speciﬁc gene.

259

260

Principles of Computational Cell Biology

In contrast to double-stranded siRNA, miRNAs are single-stranded RNA molecules of 21–23 nucleotides in length. miRNAs play important generegulatory roles in animals and plants. Precisely, they bind to the 3′ -UTR regions of mRNAs of protein-coding genes and induce their posttranscriptional degradation. miRNAs in plants usually bind to their targets with perfect complementarity. In contrast, animal miRNAs typically have incomplete base pairing to a target and inhibit the translation of many diﬀerent mRNAs with similar sequences (Figure 10.3). The biogenesis of miRNAs consists of both nuclear and cytoplasmic steps (Figure 10.4). Following transcription, a protein complex containing the proteins Drosha and DGCR8 converts pri-miRNAs into pre-miRNAs. These pre-miRNAs are then exported from the nucleus to the cytoplasm. Similar to the processing of siRNAs, the cytoplasmic pre-miRNAs are then processed into mature miRNA duplexes by the Dicer protein. The guide strand of the mature miRNA duplex binds to the Argonaute–RISC complex and then directs the RISC complex to bind to mRNA target sequences, and thereby represses the respective gene.

Plants AUG

AGO-PIWI domain mRNA ORF

STOP NNNNN NNNN NN NN NN NNN NN CH3-2′O-N N N N N N N N N N N N N N N N N N N N N-P 5′

(a)

Animals

miRNA

21 19 17 15 13 11 9 7 5 3 1 20 18 16 14 12 10 8 6 4 2

ORF Seed match

AA A A

mRNA 3′ UTR

N NN N NN NNNNN NNNN N O H 3′ miRNA

(b)

NN NN

N N N N N N N-P 5′ 7

5 6

3 4

1 2

Seed

Figure 10.3 MicroRNAs (miRNAs) recognize their targets by Watson–Crick base pairing. (a) Plant miRNAs recognize fully or nearly complementary binding sites, which are generally located in ORFs. Importantly, miRNA nucleotides (N) 9–12 are usually engaged in Watson–Crick base pairing, which allows target cleavage by Argonaute proteins. (b) Animal miRNAs recognize partially complementary binding sites, which are generally located in 3′ -UTRs. Complementarity to the 5′ end of the miRNA – the “seed” sequence, containing nucleotides 2–7 – is a major determinant in target recognition and is suﬃcient to trigger silencing. For most miRNA-binding sites, the complementarity is limited to the seed sequence or to the seed sequence plus miRNA nucleotide 8. Source: Huntzinger and Izaurralde (2011). Reprinted with permission of Springer Nature.

Regulatory Noncoding RNA

Processing of Pri-miRNA into Pre-miRNA Drosha

Pre-miRNA

(AAA)n

DGCR8

5′ P

Pri-miRNA Nucleus

Exportin 5-RanGTP shuttle

Cytoplasm TRBP Dicer

5′ P

5′ P Processing of Pre-miRNA into mature miRNA

Pre-miRNA

5′ P Mature miRNA duplex 5′ P RISC Assembly of miRNA guide strand into RISC complex

RISC

Target mRNA recognition (AAAA)n and inhibition of translation or destabilization of mRNA

Figure 10.4 Schematic representation of miRNA biogenesis. Source: Mallanna and Rizzino (2010). Reprinted with permission of Elsevier.

The ﬁrst two miRNAs were identiﬁed in the year 2000. Both lin-4 and let-7 are important in timing the larval development of C. elegans. Interestingly, let-7 is highly conserved among various organisms (Figure 10.5), suggesting that its function is conserved as well.

10.3 miRNA Targets How do miRNAs function? In principle, they could “swim around” in the cytosol and search for complementary binding partners. However, as mentioned above, miRNAs prefer to bind to proteins of the Argonaute family before recognizing their target mRNAs. In this manner, they can be “presented” to potential binding partners in a well-deﬁned conformation. Animal miRNAs bind to partially

261

262

Principles of Computational Cell Biology U A U A A G G U U U G A U A U G U U G G A U G A U G G A G U G G C C U

U

C

A

U

C C A C C GG U AG A C U A U G C A A U U U U C U A C C U U A C C G G A

C. elegans

A CA U C

U A A

A U C

U G

A U G A U A U G U U G G A U G A U G G A G U U A A A C G G

A

G U G C U A G U U U U C U U U G C U

D. melanogaster

U U G A U A U G U U G G A U G A U G G A G G G G U

U U C C U G

U

U A C U A U A C A A U

G U

U G C C C C U A UG U G AU A G G A C U A U A C A A U C U A C U G U C U

C G G G G

H. sapiens chr22

Figure 10.5 Stem-loop structures of C. elegans, Drosophila melanogaster, and Homo sapiens let-7 transcripts. The 21-nt let-7 region is shaded. Source: Pasquinelli et al. (2000). Reprinted with permission of Springer Nature.

complementary target sites on mRNAs. This destabilizes the respective mRNAs and/or represses their translation. It is estimated that about 30–60% of the mammalian genes are potentially regulated by one or more miRNAs (Hafner et al. 2012). Thus, it is not surprising that miRNAs are involved in almost all biological processes, ranging from development to metabolic regulation and cancer. Most miRNAs aﬀect the levels of hundreds of target genes. On the other hand, many genes are concurrently regulated by several or many miRNAs. In this way, miRNAs add an additional layer of complexity to cellular gene expression networks. Much still needs to be discovered in this area. Figure 10.6 shows the miRNA circuity responsible for the maintenance of cell pluripotency and for the onset of diﬀerentiation. Considering their biological function, we may ask ourselves why miRNAs are precisely 21 nt long. Let us ﬁrst consider a particular 4-mer sequence – e.g. CGGC – and reﬂect on how many complementary binding positions one would expect for it in the human genome that contains 3 × 109 nt. Assuming an equal distribution of mono- and dinucleotides, we expect that 3 × 109 /44 positions of 4-mers = 11.7 × 106 positions are complementary to it. This would be not at all selective. On the other hand, for the full-length 21-mer, we expect

Regulatory Noncoding RNA

Rbl2

Ensures proper ES cell differentiation

p21

Nanog

Causes ES cell differentiation

miR-296

Dnmt 3a and 3b

miR-134

Sox2

miR-470

Oct4

Target genes of pluripotency factors Sox2, Oct4, Nanog in ES cells

Lin28

let-7 cluster

Sal4

miR-290 cluster Klf4 miR-302 cluster

miR-145

c-Myc

p53

miR-107

miR-141

miR-143

miR-200

miR-200c

miR-429

Attenuates ES cell differentiation

Figure 10.6 Regulatory networks of miRNAs and proteins involved in the control of cell self-renewal and diﬀerentiation in embryonic stem (ES) cells. Black arrows: direct binding and/or activation of miRNA/protein expression. Black dashed arrows: indirect activation of miRNA/protein expression. Red lines with a vertical stub: direct inhibition of miRNA/protein expression or evidence for direct inhibition as suggested by binding of the protein to the promoter region. Ovals colored in darker gray (proteins/protein coding genes) and rectangles (miRNAs/miRNA coding genes): expressed only in ES cells/expressed abundantly in ES cells compared to diﬀerentiated cells. Ovals colored in light gray (proteins/protein coding genes) and rectangles (miRNAs/miRNA coding genes): expressed only in diﬀerentiated cells/expressed abundantly in diﬀerentiated cells compared to ES cells. The pluripotency factors Sox2, Oct4, and Nanog, in addition to regulating the expression of numerous protein-coding genes, regulate the expression of several miRNAs in ES cells, including the miR-290 cluster and miR-302 cluster. Conversely, several tissue-speciﬁc miRNAs such as miR-296, miR-134, miR-470, and miR-145 inhibit the expression of Sox2, Oct4, and Nanog in ES cells. Source: Mallanna and Rizzino (2010). Drawn with permission of Elsevier.

3 × 109 /421 = 0.7 × 10−3 positions, which would be too few. Real miRNAs fall somehow in between these two extreme examples. In fact, it turned out that animal miRNAs do not bind selectively over their full length. Typically, they base pair to the 3′ -UTR regions of their target mRNAs with positions 2–7 from their 5′ ends. The target UTR regions of genes are typically about 800 nt long. Concatenating all human 3′ -UTRs gives a total length of 20.000 mRNAs × 800 nt = 16 × 106 nt. In this construct, every 6-mer recognition sequence of a miRNA is expected to bind speciﬁcally to about 16 × 106 /46 = 3900 positions. The result of this back-of-the-envelope calculation is not far from the hundreds of mRNAs that are targeted by typical miRNAs. There exist two sorts of databases related to miRNAs. One type of database collects information about identiﬁed miRNAs. mirbase.org is one representative

263

264

Principles of Computational Cell Biology

of this class. Databases of the other type store information about the targets that are regulated by miRNAs.

10.4 Predicting miRNA Targets Many modern bioinformatics approaches have been developed with the aim of predicting the targets of miRNAs. The following three-step protocol has been recommended for predicting evolutionarily conserved targets for a metazoan miRNA (Bartel 2009): (i) Detect the two 7 nt matches where the miRNA binds to the seed region of the target mRNA. For example, miRNA-1 having the sequence 5′ -UGGAAUGUAAAGAAGUAUGUA recognizes the CAUUCCA match (that is complementary to UGGAAUG in the reverse direction) and the ACAUUCC match (that is shifted by one nucleotide to the right). (ii) Whole-genome alignments are used to compile orthologous 3′ -UTRs. (iii) Conserved occurrences of either 7 nt match are identiﬁed within the set of orthologous UTRs. These are predicted as regulatory sites. Note that members of one miRNA family all have the same predicted targets. Searching for longer, conserved 8 nt long sites that comprise both 7 nt motifs (e.g. ACAUUCCA, in the case of miRNA-1) yields a higher prediction speciﬁcity, whereas searching for shorter, conserved 6 nt long seed matches yields higher sensitivity. Figure 10.7 gives an overview over the diﬀerent tasks where bioinformatics tools contribute to miRNA research. Algorithms in the area of miRNA target prediction typically consider (some of ) the following features: (i) seed complementarity between miRNA and mRNA strands, (ii) evolutionary conservation of miRNA target sites among species, (iii) free energy of the miRNA:mRNA duplex, (iv) target site accessibility, and (v) contribution of multiple binding sites. Of course, machine-learning tools are widely used in this area. Some of these tools are integrated with expression analysis and check whether the predicted miRNA:mRNA pairs are coexpressed.

10.5 Role of TFs and miRNAs in Gene-Regulatory Networks TFs and miRNAs work through common regulatory principles. The combination of TFs and miRNAs expressed in a particular cell controls its speciﬁc cell type. Individual TFs and miRNAs can control between tens and hundreds of target genes via binding to their cis-regulatory elements. Almost all genes in the genome appear to be regulated not by a single but by a combination of multiple regulating factors. Many TFs bind cooperatively to their target binding motifs on the DNA and/or cooperatively bind together with transcriptional cofactors. In a similar manner, reporter gene assays also revealed cooperative activity of multiple miRNAs. Cooperative action is therefore the mechanistic basis for combinatorial expression patterns of both TFs and miRNAs.

Regulatory Noncoding RNA

Genomic data

Correlating miRNA and mRNA expression

Novel miRNA discovery

miRNA–TF interaction

miRNA target prediction

miRNA regulatory network prediction

miRNA metabolic and signaling pathway analysis Validated miRNA databases

MiRNA and disease

miRNA deregulation in human disease

Circulating miRNAs MiRNA, EF, and phenotype

Linking SNP in miRNA target with disease

Predicting target of host and viral miRNA

Figure 10.7 Bioinformatics tools are used for diﬀerent purposes in miRNA research. Source: Akhtar et al. (2016). https://academic.oup.com/nar/article/44/1/24/2499630. Licensed under Creative Commons Attribution 4.0.

Controlling the accessibility of binding sites creates an additional control layer of gene regulation. The occupancy of TF binding sites in vivo depends on how these sites are covered by nucleosomes, whereby the positioning and remodeling of nucleosomes are regulated processes. In the same manner, the accessibility of miRNA recognition sites is aﬀected by a large protein family with more than 100 members that have RNA binding domains. Besides, the accessibility of miRNA binding sites is also aﬀected by the secondary structures adopted by mRNA target sequences. Simply considering the coexpression of a miRNA and its mRNA target, therefore, is not always indicative for functional interactions.

265

266

Principles of Computational Cell Biology

Although TFs can regulate transcription either in a activating a or repressing manner, miRNAs appear to have mostly repressing eﬀects on gene expression levels. Such repressive eﬀects are important for shaping cell-speciﬁc gene regulatory programs. Broadly expressed TFs mostly exert broad activity that is not speciﬁc for particular cell types. Such broad eﬀects can be turned into more speciﬁc eﬀects in combination with cell-type-speciﬁc transcriptional repressors so that gene expression can be focused on a smaller subset of cells. miRNA-mediated gene regulation can take place at fast speed and is reversible, which gives miRNAs a specialized regulatory niche. Downregulation of transcription requires that a sophisticated machinery is initiated in the nucleus that is separated from the cytoplasm where proteins are synthesized. The stability of already transcribed mRNA species puts an upper threshold on the speed at which transcriptional repression can shut down the expression of a target gene. On the other hand, miRNAs can rapidly turn oﬀ protein production right at the ribosome. Because of their small size and noncoding nature, miRNAs can be synthesized in much shorter time than TFs, which reduces the response time to stimuli that activate gene repression. A target that is repressed by miRNAs can also be reactivated more rapidly than a target that is repressed by TFs because the former mechanism only requires translocating a mRNA that exists already to an active ribosome. Another important diﬀerence between miRNA- and TF-mediated gene regulation is that the activity of miRNAs can be focused on individual compartments and aﬀect local gene expression levels in a fast manner. For example, highly compartmentalized neurons need to regulate gene expression at the translational level on a synapse-speciﬁc, rather than cell-wide, level. TFs cannot focus their regulatory eﬀects on subcellular compartments. In summary, speed, reversibility, and compartmentalization of miRNA-mediated control mechanisms predestine miRNAs to induce rapid, adaptive changes in gene expression that may, for example, maintain homeostasis or respond to speciﬁc environmental, nutrient, or neuronal signals.

10.6 Constructing TF/miRNA Coregulatory Networks Several bioinformatics tools or servers enable users to predict coregulatory networks where TFs and miRNAs jointly regulate a set of common target genes. Although one can consider such networks as simple directed mathematical graphs (see Chapter 4), it should be emphasized that the edges in the joint network have diﬀerent meanings. Still one can use common graph algorithms to identify motifs where one TF and one miRNA jointly regulate the concentration of one target mRNA. Feed forward loops are connectivity patterns that are found in many diﬀerent areas of a network and that can make up key functional modules. It has been shown that FFLs are important motif patterns in transcriptional regulatory networks (see Section 9.5). Diﬀerent scenarios are possible (Figure 10.8). In addition to the regulation of the common target gene, TF and

Regulatory Noncoding RNA

TF-FFL TF

MicroRNA-FFL miRNA

TF

gene

(a)

miRNA

Composite FFL TF

gene

miRNA

gene

Coherent FFLs TF

miRNA

TF

TF

gene

gene

(b)

miRNA

miRNA

gene

Incoherent FFLs TF

(c)

miRNA

gene

TF

miRNA

gene

TF

miRNA

gene

Figure 10.8 FFL and FBL types. (a) Three types of FFLs classiﬁed by the master regulator. Blunt arrows with dot end represent transcriptional activation or repression. (b) Coherent FFLs. In this kind of FFLs, two paths that regulate the target gene have the same eﬀects (either activation or repression). (c) Incoherent FFLs. The target gene is regulated by two opposite paths. Source: Zhang et al. (2015). Drawn with permission of Oxford University Press.

miRNA also aﬀect each other in some of these motifs. Figure 10.9 shows an experimentally validated case. 10.6.1

TFmiR Web Service

The web service TFmiR2 integrates genome-wide transcriptional and posttranscriptional regulatory interactions to elucidate human diseases (Hamed et al. 2015). Based on user-supplied lists of deregulated genes/TFs and miRNAs that were, e.g. deregulated in a disease, TFmiR investigates four diﬀerent types of interactions, TF → gene, TF → miRNA, miRNA → miRNA, and miRNA → gene. It then also unravels the circuitry between miRNAs, TFs, and target genes with respect to speciﬁed diseases. For each interaction type, TFmiR utilizes information provided by curated regulatory databases of both predicted and experimentally validated interactions. TFmiR identiﬁes three-node motifs as those shown in Figure 10.8 consisting of a TF, an miRNA, and their cotargeted gene. 2 http://service.bioinformatik.uni-saarland.de/tfmir

267

268

Principles of Computational Cell Biology

MYC (c-myc)

E2Fs

miR-106b-25 miR-17-92 miR-106a-363

Proliferation

CDK inhibitors (p21, p27)

CDKs (cyclin D1/CDK4)

TP53 (p53)

miR-34

Pocket proteins (pRb, p130)

Figure 10.9 Schematic model for TF-miRNA coregulatory network in cell proliferation. The E2F family and three miRNA clusters form several composite FFLs with CDK inhibitors and pocket proteins. They corporately control the progression of the cell cycle. The oncogene c-Myc can promote cell cycle progress through directly activating the E2F family and miRNA clusters, whereas the tumor repressor p53 represses E2Fs activity in an indirect way. The T-shaped dotted arrow indicates the indirect repression of E2Fs by p53. Source: Zhang et al. (2015). Drawn with permission of Oxford University Press.

Statistically signiﬁcant TF and miRNA pairs that cooperatively regulate the same target gene are identiﬁed using the hypergeometric distribution (see Section 8.6.1): ( )( ) k M−k x ∑ i N −i P-value = 1 − ( ) M i=0 N where k is the number of target genes of a certain miRNA, N is the number of genes regulated by a certain TF, x is the number of common target genes between this pair of TF and miRNA, and M is the number of genes in the union of all human genes targeted by human miRNAs and of all human genes regulated by human TFs in the used databases. After applying a multiple test correction with the false discovery rate according to the Benjamini–Hochberg method, only those pairs with an adjusted P-value 0 and Ai r′ < 0 whether they are “adjacent” in P(AK ), which means that the minimal face of P containing both rays contains no other extreme rays. We will omit the further mathematical and practical details of the DD method and refer the interested reader to the according specialized literature.

12.7 Extreme Pathways and Elementary Modes The metabolic pathway analysis searches for meaningful structural and functional units in metabolic networks. Today’s most powerful methods are based on the concepts from convex analysis that were introduced in Section 12.6. Two such approaches are the elementary ﬂux modes (EFMs) (Schuster and Hilgetag 1994) and extreme pathways (Schilling et al. 2000). Both methods span the space of feasible steady-state ﬂux distributions by nondecomposable routes, i.e. no subset of reactions involved in an EFM or extreme pathway can hold the network balanced using nontrivial ﬂuxes. Both algorithms follow the principles of the DD algorithm introduced in the previous section to generate from the set of observed ﬂuxes (S⋅v ≥ 0) a set of extremal rays/generating vectors of a convex polyhedral cone. As both methods employ quite similar algorithms, we will introduce here only the extreme pathway method in order not to confuse the reader. Metabolic pathway analysis has been used, for example, to study routing and ﬂexibility/redundancy of networks, functionality of networks, and the identiﬁcation of futile cycles. It gives all (sub)optimal pathways with respect to product/biomass yield, and it can be useful for calculability studies in metabolic ﬂux analysis. In a practical point of view, we may say that the algorithms construct balanced combinations of multiple reaction ﬂuxes that do not change the concentrations of metabolites when ﬂux ﬂows through them (input ﬂuxes are channeled to products, not to accumulation of intermediates). As the stoichiometric matrix describes the coupling of each reaction i to the concentration of metabolite j, we need to construct combinations of reactions that leave the metabolite concentrations unchanged. When such combined pathways applied to metabolites do not change their concentrations, we will call the combined pathways “balanced.” In algorithmic terms, we need to transform the stoichiometric matrix so that the matrix entries are brought to zero. Looking back at the previous section, we will start with the system ( ) S =𝟎 ⋅v , external ﬂuxes ≥𝟎 as in A⋅x ≥ 0, and derive the matrix R containing the basis vectors, the extreme rays. 12.7.1

Steps of the Extreme Pathway Algorithm

Let us start with the very simple system shown in Figure 12.9.

Metabolic Networks

b2

b1

A

ν2

ν1

B

C

ν4

b3

D

ν3

ν1 –1 1

ν2

ν3

–1 1

1 –1

ν4

b1 1

b2

b3

1 –1 1

1

A B C D

Figure 12.9 Simple metabolic network with four metabolites A–D connected by four internal ﬂux reactions v 1 –v 4 and three unconstrained external ﬂuxes b1 –b3 that connect the system with the environment. The lower part of the ﬁgure shows the stoichiometric matrix for this system.

Step 0. Construct the stoichiometric matrix of the metabolic system and take its transpose ST . Then, attach an n × n identity matrix from the left side that will serve for bookkeeping purposes (Figure 12.10). Reconﬁgure the network if needed, which means splitting up all reversible internal reactions into two separate, irreversible reactions (forward and backward reactions). Step 1. Identify all metabolites that do not have an unconstrained external ﬂux associated with them. The total number of such metabolites is denoted by 𝜇. The example network contains only one such metabolite C (𝜇 = 1). (The concentrations of these internal metabolites need to be balanced ﬁrst. For the balancing of other metabolites, we can later use the external ﬂuxes as well.) Examine the constraints on each of the exchange ﬂuxes as given by 𝛼j ≤ bj ≤ 𝛽j There may be three cases: (a) If the exchange ﬂux is constrained to be positive do nothing. (b) If the exchange ﬂux is constrained to be negative, multiply the corresponding row of the initial matrix by −1 (i.e. change the direction of ﬂow). (c) If the exchange ﬂux is unconstrained, move the entire row to a temporary matrix T(E) . Remove for the moment T(E) containing all rows representing the unconstrained external ﬂuxes (they will be added back later). In this example, we are talking about the rows b1 to b3 . This completes the ﬁrst tableau T(0) . Each element of this matrix will be designated T ij . T(0) for the example reaction system is shown in Figure 12.10. Starting with x = 1 and T(0) = T(x−1) , the next tableau is generated in the following way. Step 2. The new matrix T(x) is formed by copying all rows from T(x−1) , which already contain a zero in that column of ST that corresponds to the ﬁrst

325

326

Principles of Computational Cell Biology

ν1 ν2 ν3 ν4 b1 b2 b3 A –1

0

0

0

1

0

0

B

1

–1

1

0

0

1

0

C

0

1

–1 –1

0

0

0

D

0

0

0

0

0

1

1

(1) Take transpose

1 0 0 0 0 0 0

0 1 0 0 0 0 0

0 0 1 0 0 0 0

0 0 0 1 0 0 0

0 0 0 0 1 0 0

0 0 0 0 0 1 0

0 0 0 0 0 0 1

A

B

C

D

–1 0 0 0 1 0 0

1 –1 1 0 0 1 0

0 1 –1 –1 0 0 0

0 0 0 1 0 0 1

(2) Combine with identity matrix

1 0 0 0 0 0 0

0 1 0 0 0 0 0

0 0 1 0 0 0 0

0 0 0 1 0 0 0

0 0 0 0 1 0 0

0 0 0 0 0 1 0

0 0 0 0 0 0 1

A B C –1 1 0 0 –1 1 0 1 –1 0 0 –1 1 0 0 0 1 0 0 0 0

D 0 0 0 1 0 0 1

(3) Split into T(0) and T(E)

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

A –1 0 0 0

B 1 –1 1 0

C 0 1 –1 –1

D 0 0 0 1

0 0 0

0 0 0

0 0 0

0 0 0

1 0 0

0 1 0

0 0 1

1 0 0

0 1 0

0 0 0

0 0 1

Figure 12.10 Construction of the ﬁrst tableau in step 1.

metabolite identiﬁed in step 1, denoted by index C. (Here the third column of ST .) Step 3. From the remaining rows in T(x−1) , one forms all possible combinations of rows that contain values of the opposite sign in column C, such that adding the rows gives a zero value in this column (Figure 12.11). Step 4. For all rows added to T(x) in steps 2 and 3, make sure that no row exists that is a nonnegative combination of any other rows in T(x) . This step is equivalent to removing superﬂuous rays during the DD method. One method that may be used for this works as follows. Let A(i) be the set of column indices j for which the elements of row i equal zero. Here, A(1) = {2, 3, 4, 7, 8}, A(2) = {1, 4, 5, 6, 7, 8}, and A(3) = {1, 3, 5, 7}. Then, check to determine if there exists another row (h) for which A(i) is a subset of A(h). Here, A(2) does not include columns 2 and 3, and A(3) does not include columns 2 and 4.

Metabolic Networks

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

A –1 0 0 0

B 1 –1 1 0

C 0 1 –1 –1

D 0 0 0 1

1 0 0

0 1 1

0 1 0

0 0 1

A –1 0 0

B 1 0 –1

C 0 0 0

D 0 0 1

Figure 12.11 By going from the upper to the lower tableau, rows are being transferred if they already contain zero in the C column (dashed line – corresponding to step 2) or appropriate combinations are formed (step 3) to balance metabolite C. Here, row 2 can be added to row 3 (solid lines on the right) or to row 4 (solid lines on the left) to obtain zero in column C.

Therefore, A(1) is not a subset of neither A(2) nor A(3). Moreover, neither A(2) nor A(3) is a subset of either one of the other two sets. Therefore, no row must be eliminated in this example. Step 5. Steps 2–4 are completed for all substances that are not connected to an unconstrained exchange ﬂux transporting this substance. Each time x is incremented by one up to 𝜇. (In the example here, 𝜇 = 1.) The ﬁnal tableau will be T(𝜇) . The ﬁnal number of rows in T(𝜇) corresponds to k, which is the number of extreme pathways. Step 6. Then, T(E) is appended to the bottom of T(𝜇) . This results in a new tableau (Figure 12.12).

1 0 0

0 1 1

0 1 0

0 0 1

0 0 0

0 0 0

0 0 0

A –1 0 0

B 1 0 –1

C 0 0 0

D 0 0 1

0 0 0

0 0 0

0 0 0

0 0 0

1 0 0

0 1 0

0 0 1

1 0 0

0 1 0

0 0 0

0 0 1

1 0 0

0 1 1

0 1 0

0 0 1

0 0 0

0 0 0

0 0 0

A –1 0 0

B 1 0 –1

C 0 0 0

D 0 0 1

0 0 0

0 0 0

0 0 0

0 0 0

1 0 0

0 1 0

0 0 1

1 0 0

0 1 0

0 0 0

0 0 1

1 0 0

0 1 1

0 1 0

0 0 1

1 0 0

–1 0 1

0 0 –1

A 0 0 0

B 0 0 0

C 0 0 0

D 0 0 0

b1 b2 b3

b1 b2 b3

Figure 12.12 In the upper part of the picture, the tableau T(E) with the exchange ﬂuxes is combined with the previously formed tableau T(μ) . By going from the combined to the bottom tableau, all rows are being transferred that already contain zero in all metabolite columns (dashed line) or appropriate combinations are formed (continuous lines left and right, see Step 3). The bottom tableau is the ﬁnal tableau after step 8.

327

328

Principles of Computational Cell Biology

Step 7. Starting in the n + 1 column (or the ﬁrst nonzero column on the right side), if Ti,(n+1) ≠ 0, then add the corresponding nonzero row from T(E) to row i so as to produce 0 in the n + 1-th column. In order to use the exchange ﬂux in the appropriate direction, add the appropriate multiples of this nonzero row to row i if Ti,(n+1) < 0 or subtract them if Ti,(n+1) > 0. This procedure is iteratively repeated for all rows in the upper portion of the tableau so that the entire upper portion of the (n + 1) column contains zero values. Once this is done, one deletes from T(E) the row that corresponds to the exchange ﬂux for the substance that was just balanced. Step 8. The same procedure as in (step7) is applied to each of the columns on the right side of the tableau containing nonzero entries. (In this example, step 7 needs to be applied to every column except the third column of the right side, which corresponds to metabolite C.) The ﬁnal tableau T(ﬁnal) will contain on the left side the transpose of the matrix P containing the extreme pathways in place of the original identity matrix. Note that this algorithm perfectly agrees with the principles of the DD method in that we generated from the set of observed ﬂuxes (stoichiometric matrix S ≡ A) a set of generating vectors (P ≡ R). Figure 12.13 shows the pathways constructed for this simple network. 12.7.2

Analysis of Extreme Pathways

How do reactions appear in the pathway matrix? In the matrix P of extreme pathways, each column is an extreme pathway, and each row corresponds to a reaction in the network. The numerical value of the i,jth element corresponds to the relative ﬂux level through the ith reaction in the jth extreme pathway. Let us consider again the example network introduced at the beginning of this chapter (Figure 12.3). Figure 12.14 displays the stoichiometric matrix for this system and its three extreme pathways. The lengths of P1 , P2 , and P3 are 6, 6, and 7, respectively, as they are composed from six, six, or seven reactions. These values are also contained as diagonal elements of the symmetric pathway length matrix PLM that is computed from the ̃ where all nonzero entries are replaced by 1: normalized pathway matrix P P1 P2 P3 P1 P2 P3 ⎧ 2 2 2⎫ 𝜈 ⎧ 1 1 1⎫ 𝜈 ⎪ ⎪ ⎪ 1 ⎪ 1 ⎪1 0 1⎪ 𝜈2 ⎪1 0 1⎪ 𝜈2 P1 P2 P3 ⎪0 1 0⎪ 𝜈3 ⎪0 1 0⎪ 𝜈3 ⎧ 6 4 5⎫ P ⎪ 0 1 1⎪ 𝜈 ⎪ 0 1 1⎪ 𝜈 ̃=⎪ ̃T ⋅ P ̃=⎪ ⎪ 1 ⎪ 4 P ⎪ 4 PLM = P P=⎪ ⎨ 6 5⎬ P2 ⎨0 0 1⎬ 𝜈5 ⎨0 0 1⎬ 𝜈5 ⎪ ⎪1 0 0⎪ 𝜈6 ⎪1 0 0⎪ 𝜈6 7⎪ P3 ⎩ ⎭ ⎪ 2 2 2⎪ b ⎪ 1 1 1⎪ b ⎪ ⎪ ⎪ 1 ⎪ 1 ⎪ 1 1 1⎪ b 2 ⎪ 1 1 1⎪ b 2 ⎪ 1 1 1⎪ b 3 ⎪ 1 1 1⎪ b 3 ⎩ ⎩ ⎭ ⎭ The oﬀ-diagonal terms of PLM are the number of reactions that a pair of extreme pathways have in common. For example, the extreme pathways P2 and P3 have ﬁve reactions in common: v1 , v4 , and b1 , b2 , and b3 .

Metabolic Networks

b2

b1

A

ν1

ν2 B

C

ν4

D

b3

ν3 ν1

ν2

ν3

ν4

b1

1 (a)

0

0

0

1

b3 0

b2 –1

A 0

B 0

C 0

D 0

b2

b1

A

ν1

ν2 B

C

ν4

D

b3

ν3 ν1

ν2

ν3

ν4

0

1

1

0

(b)

b1 0

b3 0

b2 0

A 0

B 0

C 0

D 0

b2

b1

A

ν1

ν2 C

B

ν4

D

b3

ν3 ν1

ν2

ν3

ν4

0 (c)

1

0

1

b1 0

b2 1

b3 –1

A 0

B 0

C 0

D 0

Figure 12.13 Three extreme pathways are found that span the solution space of this network. Note that pathways P1 (panel a) and P3 (panel c) use the external ﬂuxes b2 and b3 in the opposite direction. The cyclic pathway P2 (panel b) formed by the opposing ﬂuxes v 2 and v 3 has no net overall eﬀect on the functional capabilities of the network. Such pathways are usually deleted at this point. You may wonder why the straight path b1 → v 1 → v 2 → v 4 → b3 is not among the solutions. It is indeed not an independent solution because it can be generated as a linear combination from the other three by adding P1 and P3.

One can also compute a reaction participation matrix PPM from P: ̃⋅P ̃T PPM = P where the diagonal elements correspond to the number of pathways in which the given reactions participate (Figure 12.14). 12.7.3

Elementary Flux Modes

The EFM technique was developed before that of the extreme pathways technique by Stephan Schuster, Reinhart Heinrich, and coworkers. Extreme pathways are a subset of elementary modes and for many systems both methods coincide.

329

330

Principles of Computational Cell Biology ν1 ν2

ν3

ν4

ν5 ν6 b1 b2 b3

–1

0

0

0

0

1

0

0

A

–2 –2

0

0

0

0

0

0

B

0 –1 –1

0

0

0

C D E

1

S=

0

byp

0

1

0

0

0

1 –1

1

0

0

0

0

0

0

0

1

0

1

0

–1

0

0

1

1

0

0

0

0

0 –1 byp

0

0 –1

1 –1

0

0

0

2A

cof

E cof

byp

D

byp

0 cof

(a)

(b)

byp

byp

2A

C

2B cof

2A

E

C

2B cof

cof

E cof

byp

byp

D

(c)

C

2B

D

byp

byp

(d)

Figure 12.14 The ﬁgure shows the same network as in Figure 12.3. (a) This is the stoichiometric matrix for the system shown in Figure 12.3. (b–d) Carrying out the same algorithm as before gives these three extreme pathways.

A pathway P(v) is an EFM if it fulﬁlls conditions (C1)–(C3). (C1) Pseudo steady state. S⋅v = 0. This ensures that none of the metabolites is consumed or produced in the overall stoichiometry. (C2) Feasibility. Rate vi ≥ 0 if reaction i is irreversible. This demands that only thermodynamically realizable ﬂuxes are contained in v. (C3) Nondecomposability. There is no vector v′ (unequal to the zero vector and to v) fulﬁlling (C1) and (C2) and that P(v′ ) is a proper subset of P(v). This is the core characteristics for EFMs and extreme pathways and ensures the decomposition of the network into smallest units (that are still able to hold the network in a steady state). The pathway P(v) is furthermore an extreme pathway if it fulﬁlls conditions (C1)–(C3) and conditions (C4) and (C5). (C4) Network reconﬁguration. Each reaction must be classiﬁed either as an exchange ﬂux or as an internal reaction. All reversible internal reactions must be split up into two separate, irreversible reactions (forward and backward reaction). (C5) Systemic independence. The set of extreme pathways in a network is the minimal set of ﬂux modes that can describe all feasible steady-state ﬂux distributions. EFMs and extreme pathways are robust methods for studying functional and structural properties of complex metabolic networks such as identifying enzyme subsets and essential reactions or searching for optimal as well as redundant realizations of stoichiometric transformations. Both methods yield no explicit

Metabolic Networks

predictions of metabolic behavior. They seek primarily to allow understanding of an organism’s metabolic capabilities. One practical problem is that the number of extreme pathways or EFMs may easily reach into the thousands or even millions for medium-sized metabolic networks containing a few hundred reactions and metabolites. For example, more than two million modes exist already for a simple model of the E. coli central metabolism consisting of 112 reactions, whereas recent genome-scale metabolic networks of E. coli consist of even more than 2000 reactions. A powerful application of EFM analysis involves decomposition of a given ﬂux distribution – which may be computed by FBA or determined by experiment – into EFMs. Although such decompositions are not unique, they may be helpful for the biological interpretation. Ip presented an algorithm where they ﬁrst select the reaction with nonzero ﬂux of maximum magnitude from the given ﬂux distribution (Ip et al. 2011). Then, an EFM is determined that both contains the selected reaction and conforms to the ﬂux distribution. The contribution of this EFM to the given distribution is determined before it is removed to give an updated ﬂux distribution. This new distribution is then used as an input for the next iteration of the algorithm. If the original input ﬂux distribution was properly balanced (S⋅v = 0), the remaining ﬂux will be zero after a ﬁnite number of iterations and the algorithm terminates. 12.7.4

Pruning Metabolic Networks: NetworkReducer

As mentioned, the advent of more and more detailed genome-scale metabolic models with thousands of metabolites and reactions complicates their interpretation and application of extreme pathways and EFM methods. Thus, computational scientists tried to ﬁnd ways enabling them to reduce genome-scale models to “core” models of lower complexity but having the same key elements and/or key functional features. One such method is the network reduction algorithm NetworkReducer that is able to simplify an input large-scale metabolic network to a smaller subnetwork, whereby desired properties of the larger network are kept (Erdrich et al. 2015). As in FBA, we consider vectors v of net reaction rates that fulﬁll S ⋅ v = 0. The ﬂuxes v satisfying this equation form the null space of S. Its dimensionality may also be termed the number of degrees of freedom (dof) and is given by dof = n − rank(S) where n is the number of reactions in the system. NetworkReducer uses the following speciﬁcations: (a) PM is the set of “protected metabolites” that must be kept in the reduced network. (b) PR is the set of “protected reactions” that must be kept in the reduced network. Optionally, one may also require that these PRs must be feasible as well (this means that for each PR i ∈ PR, there is at least one ﬂux vector v with vi ≠ 0). (c) Protected functions and phenotypes are characterized by appropriate inequalities.

331

332

Principles of Computational Cell Biology

(d) The reduced network may not have fewer dof than a minimum number: dof ≥ dofmin . (e) A speciﬁed minimal number of reactions must be kept (n ≥ nmin ). A key property of the algorithm is how it treats desired (protected) functions and phenotypes. Each protected function (there are s of them in total) is formulated by a respective set of linear equalities/inequalities Dk v ≤ d k ,

k = 1, … , s.

The network reduction algorithm ﬁrst checks the feasibility of the protected functions in the input network. Then, a loop tries to iteratively discard non-PRs unless this violates any of the desired conditions (a)–(e). To decide on the order of this process, the algorithm computes for each removable (nonprotected) reaction i the feasible ﬂux ranges separately for each protected function k deﬁned by Dk /dk . Let F i k denote the ﬂux range of reaction i under the protected function k, k = 1, …, s. From this, the union F i of all ﬂux ranges is formed: Fi =

s ⋃

Fik

k=1

Essential reactions possess an entirely positive or entirely negative ﬂux range F i k for any of the desired functionalities k. Such essential reactions are deleted from the list of removable reactions. From the current set of removable reactions, the next candidate reaction to be discarded is the reaction with overall smallest ﬂux range F i . It can be safely assumed that a considerable amount of ﬂux variability remains in the network after deleting this reaction. After discarding a reaction, one needs to test the feasibility of the protected functions (condition [c]), PRs and PMs. If any of these conditions is not fulﬁlled, then the reaction that was just deleted is reinserted and labeled as nonremovable. Then, one continues with the reaction having the second smallest overall range of ﬂuxes F i . After deleting a reaction, the ﬂux ranges are recomputed in the next iteration. The main loop of network pruning terminates when no additional reaction can be removed without violating any of conditions (a)–(e). Finally, unconnected metabolites in the reduced network that do not participate in any of the remaining reactions are deleted from the network. In a postprocessing step, the network can be (optionally) compressed further without losing dof. For example, reaction sets or enzyme sets belonging to a linear chain of reactions can be combined into a single reaction with collapsed stoichiometries. Compression does not aﬀect PRs and PMs. Klamt and coworkers applied the NetworkReducer algorithm to a genome-scale metabolic model of E. coli with 2384 reactions (Table 12.3). The algorithm pruned this model to a reduced model with 105 reactions. This is close to a manually constructed core model of E. coli that contained 88 reactions.

12.8 Minimal Cut Sets Extreme pathways and elementary ﬂux analysis characterize particular functional states in a metabolic network. The concept of minimal cut sets enables identifying structural failure modes (Klamt and Gilles 2004). A minimal cut set is deﬁned

Metabolic Networks

Table 12.3 Properties of the E. coli network models ColiGS and ColiCore.

E. coli genome-scale model (ColiGS)

E. coli pruned model (ColiPruned)

E. coli pruned and compressed model (ColiPrunedComp)

E. coli core model of Orth et al. (2010) (ColiCore)

Number of reactions

2384

455

105

88

Number of internal metabolites

1669

438

85

69

Number of external metabolites

305

33

33

17

Degrees of freedom

753

26

26

24

𝜇max (h−1 ; aerobic)

0.9290

0.9288

0.9288

0.8739

𝜇max (h−1 ; anaerobic)

0.2309

0.2309

0.2309

0.2117

Source: Erdrich et al. (2015). https://bmcsystbiol.biomedcentral.com/articles/10.1186/s12918-0150191-x. Licensed under CC-BY 4.0.

Figure 12.15 Example network with ﬁve internal metabolites and eight reactions. Two of its reactions, v 4 and v 5 , are reversible. Reactions crossing the system boundary are connected to buﬀered (external) substances. Source: Klamt (2006). Drawn with permission of Elsevier.

ν1

A ν6

ν2

B

ν3

C

ν7

E

Psynthesis

ν4 D ν5

X

P

as the smallest “failure modes” in the network that prevent the correct functioning of a cellular reaction. The authors used the following example network for illustration (Figure 12.15). The analysis of this network will focus on the synthesis of product P. To understand how the topology of the network aﬀects the synthesis of P, we will use its computed EFMs (Figure 12.16). Similarly, one could use its extreme pathways. The production of P may be prevented by deleting or shutting down one or more reactions. In mathematical graph theory, a cut set is deﬁned as a set of vertices (or edges) whose removal increases the number of connected components of this graph. In the context of metabolic networks, a cut set (deﬁned with

333

Principles of Computational Cell Biology

ν1

ν1 A

A ν6 C

B

ν7

E

D ν5

(a)

X

P

E

ν5

(b)

X

P

ν1 A ν6 C

B

ν7

E

D ν5

(c)

B

X

C

E

D ν5

(d)

P

ν7

ν4

Psynthesis

ν4

ν2

ν6

ν3

Psynthesis

ν3

X

P

ν1

ν1

A

A

B

ν3

ν6 C

ν7

E

ν4 D ν5

(e)

ν7

D

A

ν2

C ν4

ν1

ν2

B

Psynthesis

ν4

ν2

X

ν2

P

B

ν3

ν6 C

ν7

E

ν4

Psynthesis

ν2

ν6

ν3

Psynthesis

ν3

Psynthesis

334

D ν5

(f)

X

P

Figure 12.16 For the example network of Fig. 12.15, one obtains six elementary ﬂux modes, see panels (a) to (f ). The latter ﬁve are coupled to product synthesis. Source: Klamt (2006). Drawn with permission of Elsevier.

Metabolic Networks

335

respect to a target reaction or product) will be understood as a set of reactions whose removal from the network prevents any feasible steady-state ﬂux distribution involving the target. Box 12.4 Deﬁnition We call a set of reactions a cut set (with respect to a deﬁned objective reaction) if, after the removal of these reactions from the network, no feasible balanced ﬂux distribution involves the objective reaction.

For example, C 1 = {v1 , v2 , v5 } is a cut set with respect to Psynthesis. However, inhibiting the three reactions of C 1 (by a gene knockout or by targeting the respective enzymes by small-molecule inhibitors) would not be an eﬃcient strategy because a subset of it, C 2 = {v2 , v5 }, is also a cut set that leads to the failure of the objective reaction Psynthesis. C 2 is optimal in the sense that no subset of C 2 is a cut set itself. Therefore, C 2 is called a minimal cut set. Box 12.5 Deﬁnition A minimal cut set must be a cut set itself, and it is minimal in the sense that none of its elements can be discarded without losing the cut set property.

Figure 12.17 displays all eight minimal cut sets preventing the synthesis of P in the model network. Removing a minimal cut set always guarantees the interruption of the objective function as long as the assumed network structure is correct. However, additional regulatory circuits or capacity restrictions may in practice allow that even a proper subset of a minimal cut set may function as a cut set. This means that in a real cell, fewer gene deletions or small-molecule inhibitors may be required than those found by the minimal cut set analysis. On the other hand, after removing a complete minimal cut set from the network, other pathways producing other metabolites may still be active. A systematic computation of minimal cut sets must ensure (i) that the calculated minimal cut sets are cut sets (interrupting all possible balanced ﬂux distributions involving the objective reaction), (ii) that the minimal cut sets are really minimal, and (iii) that all minimal cut sets are found. When designing an algorithm to compute minimal cut sets, one may exploit the fact that any feasible steady-state ﬂux distribution in a given network – expressed as vector r of the q net reaction rates – can be expressed by a nonnegative linear combination of the N elementary modes: r=

N ∑

𝛼i EMi ,

𝛼 ≥ 0.

i=1

If C is a proper cut set, each elementary mode involving the objective reaction (with a nonzero value) must contain at least one reaction from C. This guarantees that all elementary modes in which the objective reaction participates will

Principles of Computational Cell Biology

S1

A

S2

B

S1

A

S2

B

ν3

D

X

(a)

S1

P

E

D

X

(b)

A

C

Psynthesis

Psynthesis

E

S1

A

S2

B

P

ν6

ν1

ν7

E

D

(c)

X

S1

A

S2

B

C

E

ν2 ν4 D

(e)

X

P

E

ν7

D

(d)

P

C

Psynthesis

C

B

Psynthesis

S2

S1

A

S2

B

X

P

C

E

ν2

Psynthesis

C

Psynthesis

336

D ν5

(f)

X

P

Figure 12.17 Minimal cut sets for repressing synthesis of P in the example network of Figure 12.15. (a) MCS1, (b) MCS2, (c) MCS3, (d) MCS4, (e) MCS5, (f ) MCS6, (g) MCS7, and (h) MCS8. Source: Klamt (2006). Drawn with permission of Elsevier.

Metabolic Networks

A

S1

A

S2

B

ν6

ν1 C

B

ν4

E Psynthesis

S2

D ν5

(g)

X

P

C

ν4

E Psynthesis

S1

D ν5

(h)

X

P

Figure 12.17 (Continued)

be eliminated when the reactions in the cut set are removed from the network. To ensure that the rate rk of the objective reaction is 0 in all r, each remaining elementary mode must contain 0 at the kth place. For a detailed discussion of algorithms to compute MCS, we refer the reader to the original papers by Klamt and Gilles (Klamt and Gilles 2004; Klamt 2006). 12.8.1

Applications of Minimal Cut Sets

(1) Target identiﬁcation and repression of cellular functions. Looking at all minimal cut sets facilitates identifying the best-suited manipulation of a cellular system for a desired target operation. For practical reasons, a small number of interventions is desirable (small size of minimal cut set). Furthermore, if the cells are to be kept viable, it is preferable to choose minimal cut sets that aﬀect other pathways in the network only weakly. Certainly, there may also be practical considerations for selecting a particular minimal cut set as some of the cellular functions might be diﬃcult to shut down genetically or by inhibition. Also, it is impractical to select a minimal cut set containing an enzyme for which several other isozymes exist that catalyze the same reaction. (2) Network veriﬁcation and mutant phenotype predictions. When targeting cellular reactions/processes that are essential for cell survival, shutting down a corresponding minimal cut set should deﬁnitely lead to cell death. Such predictions, derived purely from network structure, may be used to verify hypothetical or reconstructed networks. If the cell can survive the deletion of a set of genes in an experiment that were predicted to be fatal, the underlying network structure must be incomplete (a false-negative prediction). One may, however, as often in biology also face the opposite where a deletion was predicted to be still viable and the cells die in the experiment. This may be a hint that, for example, additional gene regulatory eﬀects need to be taken into account. (3) Structural fragility and robustness. The concept of minimal cut sets is also known in completely diﬀerent research ﬁelds such as risk analysis of

337

Principles of Computational Cell Biology

industrial plants. There, a minimal cut set is also called the failure mode and characterizes the minimal set of events that will together cause a breakdown of the system. In this regard, one desires to characterize the importance of individual reactions to lead to failure of the system. Intuitively, the most fragile reactions will be those whose removal directly leads to system failure (or shut down of the respective target). The next dangerous ones are those that, together with only one other reaction, lead to system failure. In this analysis, we will, for the moment, assume that each reaction in a metabolic network has the same probability to fail. Therefore, small minimal cut sets are most probable to be responsible for a failing objective function. We will deﬁne a fragility coeﬃcient F i as the reciprocal of the average size of all minimal cut sets in which reaction i participates: Fi =

1 . avg({|MCSj | ∶ i ∈ MCSj })

Figure 12.18 shows the fragility coeﬃcients of the reactions in the example network of Figure 12.15 with respect to the production of P. Apparently, reaction v3 and Psynthesis itself are the most fragile reactions of the network with respect to the synthesis of product P. In summary, minimal cut sets are an irreducible combination of network elements whose simultaneous inactivation leads to a guaranteed dysfunction of certain cellular reactions or processes. Minimal cut sets are inherent and uniquely determined structural features of metabolic networks similar to elementary modes (or extreme pathways). Unfortunately, the computation of minimal cut sets and elementary modes becomes challenging in large networks. Analyzing the minimal cut sets gives deeper insights into the structural fragility of a given metabolic network and is useful for identifying target sets for an intended repression of network functions. ν1 ν1

A ν6

ν3 ν2

B

C ν4

ν7

E Psynthesis

338

D ν5

(a)

Minimal cut sets MCS1 MCS2 MCS3 x MCS4 MCS5 MCS6 MCS7 x MCS8 Fragility Fi

X

P

0.4

ν2

ν3

ν4

ν5

ν6

ν7 Psynthesis x

x x x x

x x x

0.5

x x

1

x x x

0.375 0.375

x 0.4

0.5

1

(b)

Figure 12.18 (a) Example network of Figure 12.15. (b) Participation of individual reactions in the minimal cut sets shown in Figure 12.17 and computation of the fragility coeﬃcient F i . For example, reaction v 2 is a member of MCS5 and MCS6 that both contain a total of two reactions. Therefore, the average size of the MCSs that v 2 belongs to is 2, with the reciprocal 0.5.

Metabolic Networks

12.9 High-Flux Backbone As discussed in Section 12.1, the systems-level characterization of the metabolic networks of model organisms required revising the picture of separate biochemical pathways into a densely woven metabolic network. FBA as well as experimental data showed that the connectivity of substrates in this network follows a power law (see Section 6.11). Constraint-based modeling approaches (FBA) were successful in analyzing the capabilities of cellular metabolism including its capacity to predict deletion phenotypes, the ability to calculate the relative ﬂux values of metabolic reactions, and the capability to identify properties of alternate optimal growth states in a wide range of simulated environmental conditions. Now, we will address the question which parts of the metabolism are involved in adaptation to environmental conditions. As an example for many successful studies employing FBA, we again use a study involving the Barabási group (Almaas et al. 2004). This work utilized the stoichiometric matrix from the Palsson group for E. coli strain MG1655 containing 537 metabolites and 739 reactions. FBA was applied to characterize the solution space (all possible ﬂux states under a given condition) using linear programming and constraints for each reaction ﬂux vi of the form 𝛽 i,min ≤ vi ≤ 𝛽 i,max . The ﬂux states were calculated that optimize cell growth on various substrates. The mass carried by reaction j producing (consuming) metabolite i is denoted by ̂ vij = |Sij |vj . with the entries of the stoichiometric matrix Sij and the ﬂuxes vj . Interestingly, the magnitudes of the individual ﬂuxes found by FBA varied widely. For example, the dimensionless ﬂux of succinyl coenzyme A synthetase reaction was 0.185, whereas the ﬂux of the aspartate oxidase reaction was 10 000 times smaller, 2.2 × 10−5 . Figure 12.19 shows the calculated ﬂux distribution for 100

10–1 10–2 10–3 –4

10

Succ Glu

10–5 10–6 10–5

(a)

Relative frequency, P(v)

Relative frequency, P(v)

100

10–4 10–3 10–2 10–1 Optimized flux, v

10–1

10–2

10–3

100 (b)

100

101 102 Experimental flux, v (% of Glc uptake rate)

103

Figure 12.19 (a) Calculated ﬂux distribution for optimized biomass production on succinate (black) and glutamate (red) substrates. The solid line corresponds to the power law ﬁt of the likelihood that a reaction has ﬂux v with P(v) ∝ (v + v 0 )−𝛼 , with v 0 = 0.0003 and 𝛼 = 1.5. (b) The distribution of experimentally determined ﬂuxes from the central metabolism of E. coli shows a power law behavior as well, with a best ﬁt to P(v) ∝ v−𝛼 with 𝛼 = 1. Source: Almaas et al. (2004). Reprinted with permission of Springer Nature.

339

Principles of Computational Cell Biology

k.Y(k)~1

(a)

k.Y(k)~k

(b)

Figure 12.20 Schematic illustration of two hypothetical scenarios in which either (a) all ﬂuxes have comparable activity, in which case one expects kY(k) ∼ 1, or (b) the majority of the ﬂux is carried by a single incoming or outgoing reaction, for which kY(k) ∼ k.

active (nonzero ﬂux) reactions of E. coli grown on a glutamate- or succinate-rich substrate. Remarkably, both computed and experimental ﬂux distributions showed a wide spectrum of ﬂuxes. The authors considered two diﬀerent potential local ﬂux structures (Figure 12.20). (a) A homogenous local organization would imply that all reactions producing (consuming) a given metabolite have comparable ﬂuxes, whereas (b) a more delocalized “high-ﬂux backbone (HFB)” would be expected if the local ﬂux organization was heterogeneous such that each metabolite has a dominant source (consuming) reaction. To distinguish between these two schemes contrasted in Figure 12.20, they deﬁned [ ]2 k ∑ vij Y (k, i) = ∑k j=1 l=1 vil where vij is the mass carried by reaction j, which produces (consumes) metabolite i. If all reactions producing (consuming) metabolite i had comparable vij values, Y (k, i) would scale as 1/k. If, however, the activity of a single reaction dominates, one would expect Y (k, i) to scale as 1 (i.e. to be independent of k). The measured result for k⋅Y (k) is shown in Figure 12.21. 100

0.096 FAD 8.1e–8

2.2e–5 0.096

8.1e–8 Glu substrate

k Y(k)

340

10 Glu in Glu out Succ in Succ out γ = 0.73 1

1

10 Degree (K)

100

Figure 12.21 Measured kY(k) shown as a function of k for incoming and outgoing reactions, averaged over all metabolites, indicates that Y(k) ∝ k−0.27 . Inset shows nonzero mass ﬂows, v ij , producing (consuming) FAD on a glutamate-rich substrate. Source: Almaas et al. (2004). Reprinted with permission of Springer Nature.

Metabolic Networks

Interestingly, the measured ﬂuxes showed that an intermediate situation exists between the two extreme cases. This proves that the levels of individual metabolites show similar large inhomogeneities in the overall ﬂux distribution. The results for ﬂavine adenine dinucleotide (FAD) suggests that the more the reactions consume (produce) a given metabolite, the more likely it is that a single reaction carries most of the ﬂux. Encouraged by this ﬁnding, the authors went on to characterize the main ﬂux backbone of the metabolic network of E. coli. From the large complex ﬂux network, a simple algorithm removed for each metabolite systematically all reactions but the one providing the largest incoming (outgoing) ﬂux distribution. In this way, the algorithm uncovered the high-ﬂux backbone of the metabolism, a distinct structure of linked reactions that form a giant component with a starlike topology (Figure 12.22). In Figure 12.22, only a few pathways appear disconnected, indicating that although these pathways are part of the HFB, their end product is only the second-most important source for another high-ﬂux metabolite. The groups of individual HFB reactions largely overlap with traditional biochemical partitioning of cellular metabolism. Thus, also at the systems level, the traditional concepts remain of value.

12.10 Summary Compared to other cellular networks discussed in this textbook, our understanding of metabolic networks may be considered quite mature. This is due to the almost complete characterization of central metabolism in most organisms and by the ability to perform direct ﬂuxome measurement using, for example, 13 C-labeled substrate. The mathematical approaches of FBA, elementary modes, and extreme pathways provide a robust toolbox to characterize topological properties of the networks and even make quantitative predictions. Recent methodological improvements promise to facilitate the analysis of millions of elementary modes. Metabolic network use is highly uneven (power law distribution) both at the global level and at the level of the individual metabolites. Although most metabolic reactions have low ﬂuxes, the overall activity of the metabolism is dominated by several reactions with very high ﬂuxes. E. coli responds to changes in growth conditions by reorganizing the rates of selected ﬂuxes predominantly within this HFB. Apart from minor changes, the use of the other pathways remains unaltered. These reorganizations result in large, discrete changes in the ﬂuxes of the HFB reactions.

12.11 Problems 12.11.1

Static Network Properties: Pathways

The ﬁrst three assignments will introduce you to some aspects of networks under steady-state conditions. This includes taking apart a network into independent paths, identifying crucial metabolites and reactions, and simplifying networks.

341

5 FGAM

AIR

NCAIR

CAIR

SLF

AHHMD AHHMP

GMP

3PSER

APS

2

PRAM

GL

GLAL

SER

PAPS

ADCHOR

D6RP5P A6RP5P

SAICAR AICAR DHF

XMP

SLFxt

PHP

DHPT

PABA FGAR GAR

GOP

IMP

PRFICA

GTP

ASUC

FTHF

METHF METTHF

DHSK

SME

SME5P

3PSME

DB4P

A6RP5P2

D8RL

A6RP

THF

PL5P DHP

PAP

AHTD

DHPP

6

ASER

OTHIO

GLY DQT

RL5P

CHOR

CYS

3

X5P

NPRAN CPAD5P

IGP

TRP

FADH

T3P2

T3P1

25

13DPG

G6P

3PG

G1P

2PG

PEP

SUCC

24 AC

ACTP

AKG

ICIT

23

GLUxt

oALA

AKP

ATP

NAD

Q

O2

O2xt

NAAD

NAMN

QA

ISUCC

11

QH2

NADP CO2xt

CIT

PANT

HEXT

NADH

CC2

OA ACCOA

ARGSUCC ARG

CITR

COA

10

ADPGLC GLYCOEN

9

DPCOA

SUCCOA

MAL

ACxt NAGLU NAGLUYP NAGLUSAL NAARON ORN

4PP NCYS 4PPNTO PNTO 4PPNTE

SUCCxt

TYR FUM

FDP

7 PYR

PHE

E4P F6P

8 PDLA

HPHPYR

PHPYR

FAD

AN

PL

FMN

RTHIO

PHEN

3DDAH7P

1

H2SO3

H2S

PYRDX

RIBFLY

PRPP AMP R5P

P5P

PDLA5P H2O2

ALA

4

NH3xt

NH3

12

13

GLU

NADPH

PRO

GLUGSAL GLUP

GL3P

22

6181ACP

MALACP

C160ACP

PA

MALCOA

C161ACP ACACP ACP

21

CMP

CDPDG

C141ACP

CDP

CTP

PS

CAASP

DOROA OROA

OMP

UMP

UDP

PRLP

IMACP

HISOLP

HISOL

GLN

CAP

PRBATP

PRBAMP PRFP

DIMGP

PGP

HCYS

LLCT

OSLHSER

LYS

MDAP

D26PIM

NS26DP

UTP

THR

PHSER

HSER

HIS

ILE

OMVAL

DHMVA

LEU

OICAP

IPPMAL

CBHCAP OIVAL

DHVAL

ACLAC

5MTA

5MTR

5MTRP

DKMPP

KMB

PEPTIDO DCTP

ADPHEP UNPTDO UDPNAG ODPETN LPS UNAGDA GA1P KDOP

CMPKDO

KDO UDPG23A

UDPG2AA UDPG2A

DISAC1P

19 LIPIV

C120ACP LIPA

BMxt

DGDP

PIP26DA D23PIC

ASPSA

BASP

ASP

ABUT

OBUT

AA

UNAGD

GA6P

CL SPMD

DADP

DTMP

DGLU UDPNAMAG

DUMP

UDPNAMA

DUDP

15

PTRC

DGTP DTTP DATP COST DODP DTDP

NS2A6O

DSAM

UDPG

KDOLIPIV K2LIPIV DALA

C140ACP

14

MET

20

A5P

MTHF

Biomass

PG

PE

VAL

17

SAM

AD

ADN

R1P

5MTR1P

16

ASN

LIPX

18

UDPNAM

UDPNAGEP

Figure 12.22 HFB in the metabolic network of E. coli as optimized by ﬂux balance analysis for growth on glutamate-rich substrate. Two metabolites (e.g. A and B) are connected with a directed edge pointing from A to B only if the reaction with maximal ﬂux consuming A is the reaction with maximal ﬂux producing B. Shown are all the metabolites that have at least one neighbor after completing this procedure. The background colors indicate diﬀerent known biochemical pathways. Source: Almaas et al. (2004). Reprinted with permission of Springer Nature.

Metabolic Networks ATP

A

v1

2B

ADP

v2

C

D

byp

v3 E

NADH NAD

v4

NAD

ADP

NADH

Glucose

v5 2F

ATP

v6

2H NADH

ATP

v7

J

v8

K

C NAD

ADP

v9

L

byp

Figure 12.23 Example of a metabolic network leading to the production of biomass L (see Problem 1).

1. Identifying targets The hypothetical metabolic network shown in Figure 12.23 produces “biomass” L from the substrates A and F. In various intermediate steps, accessory substances are produced or consumed. The metabolites are labeled with the letters A to L. (a) Inspect the network visually and identify (without calculation) the important substrates that are essential for the functioning of the network, i.e. the production of L is stopped when these are missing. Explain your ﬁndings. Now assume that this network was the central part of the metabolism of a dangerous bacterium and you want to develop an eﬀective drug. On which enzymes (reactions) would you concentrate when searching for an inhibitor? Explain your answer. Would you change your strategy, if you knew that high concentrations of “byp” slow down or even reverse reaction v3 ? (Why?) What if somebody discovers that high concentrations of D are lethal for the host? (b) Simplify the network of Figure 12.23 without changing its topology (structure). Do so by writing an equation for each reaction and solve for J and L. Example: From v1 you get A ⇒ 2B + C, from v2 : 2B + ATP ⇒ D + ADP. This can be rewritten (for the moment) as 2B = D + ADP − ATP. Insert this into v1 to get v12 : A + ATP ⇒ D + ADP + C. Now, try to eliminate D from this equation, etc. Note: Some of the internal substrates cannot be eliminated. Once you are done with the elimination of the internal substrates (A, F, … are “not” internal), draw the simpliﬁed network in the same layout as the original network. 2. Extreme pathways Construct the stoichiometric matrix of the network shown in Figure 12.24. Do not forget to label rows and columns.

343

344

Principles of Computational Cell Biology

b1

b2

A

C v1

v2

Figure 12.24 Simple metabolic network as in Figure 12.9 (see Problem 2).

b3

E v5

v4

3B v3 D

Hint: Split up v4 into a forward and a backward reaction (reconﬁguration). Then, calculate the extreme pathways from the stoichiometric matrix as explained in Section 12.7. Determine both the “pathway length matrix” and the “reaction participation matrix.” What information do they provide? Which reaction(s) contribute(s) to the most pathways? Which are the shortest and the longest pathways? Now assume that reaction b2 is essential for the organism, i.e. it dies if there is no output via b2 . Determine from the extreme pathways which (combinations of ) internal reactions are essential, i.e. if they are blocked, then the output via b2 is blocked, too. 3. Extreme pathways Characterize the steady-state properties of the network given in Figure 12.25 via its extreme pathways. (a) Simplify the network

b1

b2

G v1

A

v2 D

2C 2B

v7

v6

K

J G b3

E

v3

v4

L v10 b4

v5 2F

v8

v9

byp

M

H

b5

Figure 12.25 Simple metabolic network with ﬁve exchange ﬂuxes (see Problem 3).

Metabolic Networks

As a ﬁrst step, simplify the given network by grouping reactions and metabolites, respectively. Sketch the simpliﬁed network. Hint: Write down the reactions as equations. Then, you can proceed analogously as with a set of linear equations and eliminate (i.e. group) substances. Example: v1 ∶A + 2B → 2C v2 ∶2C → D ⇒ v12 ∶A + 2B → D Comment: You can actually reduce the network to three internal substrates. Internal means that the metabolite is not connected to any of the exchange ﬂuxes b. (b) Stoichiometric matrix Construct the stoichiometric matrix for the simpliﬁed network and calculate from it the extreme pathways. Give the pathways as formulas and sketch them. Determine both the “pathway length matrix” and the “reaction participation matrix” and interpret them. Which reactions contribute to the most pathways; are there reactions that do not contribute at all? (c) Extreme pathways of the original network Now reconstruct the extreme pathways of the original network from the simpliﬁed network. Again give the extreme pathways both as formulas and as sketches. Why can one determine the extreme pathways of the complete network via this “detour” of the simpliﬁed network? From the pathway length matrices, determine the lengths of the pathways for both the original and the simpliﬁed networks. The output (“biomass production”) of the considered network corresponds to the ﬂux through reaction b4 . Consider a reaction as “essential” for the network if blocking this reaction shuts down the network. List all essential reactions of the simplifying network. Can you ﬁgure out how these essential reactions can be determined from the extreme pathways? (d) Steady-state properties In the following analysis, we will neglect the internal reactions of the pathways, i.e. only consider those reactions that are labeled with a letter b. Then, we see how the (black box) network transforms input through b1 , b2 , and b3 into output through b4 and b5 . Hint: Look at the pathways in their formula form … Complete Table 12.4 that relates the input through b1 , …, b3 to the output via b4 and b5 . For each conﬁguration, give the contributions of each of the extreme pathways as, e.g. n1 × e1 + n2 × e2 + n3 × e3 with ni as the contributions. Also determine the ﬂux through the reactions v2 , v5 , and v7 of the original network. Hint: Extend the table so that it can hold all the information asked for.

345

346

Principles of Computational Cell Biology

Table 12.4 Data for Problem 3(d). Inward and outward directed ﬂuxes b1 to b5 . Conﬁguration

I

II

III

IV

V

b1

0

1

1

b2

0

2

2

b3

1

2

VI

5

b4

2

8

4

b5

1

5

3

Bibliography Metabolic Networks in General Francke, C., Siezen, R.J., and Teusink, B. (2005). Reconstructing the metabolic of a bacterium from its genome. Trends Microbiology 13: 550–558. McCloskey, D., Palsson, B.Ø., and Feist, A.M. (2013). Basic and applied uses of genome-scale metabolic network reconstructions of Escherichia coli. Molecular Systems Biology 9: 661. Orth, J.D., Thiele, I., and Palsson, B.Ø. (2010). What is ﬂux balance analysis? Nature Biotechnology 28: 245–248. Palsson, B.Ø. (2006). Systems Biology: Properties of Reconstructed Networks. Cambridge University Press. Papin, J.A., Price, N.D., Wiback, S.J. et al. (2003). Metabolic pathways in the post-genome area. Trends in Biochemical Sciences 28: 250–258. Price, N.D., Reed, J.L., and Palsson, B.Ø. (2004). Genome-scale models of microbial cells: evaluating the consequences of constraints. Nature Reviews Microbiology 2: 886–897.

Metabolism of E. coli Ouzounis, C.A. and Karp, P.D. (2000). Global properties of the metabolic map of Escherichia coli. Genome Research 10: 568–576. Schuetz, S., Kuepfer, L., and Sauer, U. (2007). Systematic evaluation of objective functions for predicting intracellular ﬂuxes in Escherichia coli. Molecular Systems Biology 3: 119.

Metabolism of S. cerevisiae Heavner, B.D., Smallbone, K., Barker, B. et al. (2012). Yeast 5 – an expanded reconstruction of the Saccharomyces cerevisiae metabolic network. BMC Systems Biology 6: 55.

Metabolic Networks

Herrgard, M. et al. (2008). A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. Nature Biotechnology 26: 1155–1160.

Flux Balance Analysis Almaas, E., Kovacs, B., Vicsek, T. et al. (2004). Global organization of metabolic ﬂuxes in the bacterium Escherichia coli. Nature 427: 839–843.

Double Description Method Gagneur, J. and Klamt, S. (2004). Computation of elementary modes: a unifying framework and the new binary approach. BMC Bioinformatics 5: 175.

Extreme Pathway Method Schilling, C.H., Letscher, D., and Palsson, B.O. (2000). Theory for the systemic deﬁnition of metabolic pathways and their use in interpreting metabolic function from a pathway-oriented perspective. Journal of Theoretical Biology 203: 229–248.

Elementary Modes Erdrich, P., Steuer, R., and Klamt, S. (2015). An algorithm for the reduction of genome-scale metabolic network models to meaningful core models. BMC Systems Biology 9: 48. Ip, K., Colijn, C., and Lun, D.S. (2011). Analysis of complex metabolic behavior through pathway decomposition. BMC Systems Biology 5: 91. Schuster, S. and Hilgetag, C. (1994). On elementary ﬂux modes in biochemical reaction systems at steady state. Journal of Biological Systems 2: 165–182.

Minimal Cut Sets Klamt, S. (2006). Generalized concept of minimal cut sets in biochemical networks. BioSystems 83: 233–247. Klamt, S. and Gilles, E.D. (2004). Minimal cut sets in biochemical reaction networks. Bioinformatics 20: 226–234.

Strain Design: MOMA Method Alper, H., Jin, Y.S., Moxley, J.F., and Stephanopoulos, G. (2005). Identifying gene targets for the metabolic engineering of lycopene biosynthesis in Escherichia coli. Metabolic Engineering 7: 155–164.

347

348

Principles of Computational Cell Biology

Park, J.H., Lee, K.H., Kim, T.Y., and Lee, S.Y. (2007). Metabolic engineering of Escherichia coli for the production of l-valine based on transcriptome analysis and in silico gene knockout simulation. Proceedings of the National Academy of Sciences of the United States of America 104: 7797–7802. Segre, D., Vitkup, D., and Church, G.M. (2002). Analysis of optimality in natural and perturbed metabolic networks. Proceedings of the National Academy of Sciences of the United States of America 99: 15112–15117.

Optknock Algorithm Burgard, A.P., Pharkya, P., and Maranas, C.D. (2003). Optknock: a bilevel programming framework for identifying gene knockout strategies for microbial strain optimization. Biotechnology and Bioengineering 84: 647–657. Fong, S.S., Burgard, A.P., Herring, C.D. et al. (2005). In silico design and adaptive evolution of Escherichia coli for production of lactic acid. Biotechnology and Bioengineering 91: 643–648. Kim, J., Reed, J.L., and Maravelias, C.T. (2011). Large-scale Bi-level strain design approaches and mixed-integer programming solution techniques. PLoS ONE 6: e24162.

349

13 Kinetic Modeling of Cellular Processes So far, we have mostly used stationary, time-independent mathematical models to describe metabolic ﬂux distributions, gene regulatory networks, or protein–protein interactions. In cells, however, many processes undergo important temporal ﬂuctuations, most notably the cell cycle itself. Therefore, we now introduce a new class of models that enable us to model time-dependent cellular phenomena such as individual biochemical reactions, more complex processes such as signal transduction cascades, or even the entire cell cycle.

13.1 Biological Oscillators Biochemical oscillations occur, among others, in signaling, metabolism, and development, where they control important elements of cell physiology such as DNA synthesis, circadian rhythms, and mitosis (Table 13.1). In the 1950s and 1960s, the ﬁrst biochemical oscillations were identiﬁed in glycolysis, synthesis of cyclic AMP, and the reaction catalyzed by the enzyme horseradish peroxidase. During the molecular biology revolution of the 1980s, many further cases of oscillations were discovered, involving period (PER) genes that regulate circadian rhythms in animals and the cyclin proteins that govern the cell cycle in eukaryotes. Characterizing the molecular basis of cellular oscillations requires one to take a look at experimental data from a theoretical perspective and to describe the chemical oscillatory processes by quantitative mathematical modeling. Such models exhibit typical well-known properties of dynamical systems such as feedback, time delay, bistability, and hysteresis. An essential requirement for biochemical oscillators is the existence of negative feedback. Only this can take a reaction network back to the beginning of its oscillation. In fact, already very simple genetic circuits can give rise to oscillations. For example, a negative feedback loop X → R ⊣ X can produce oscillations where X ﬁrst activates R and R then inhibits X so that the level of R itself goes down, eventually followed by a rise in the level of X. However, the negative feedback signal needs to have a suﬃcient time delay so that the chemical reactions do not converge into a stable steady state. Such time delay can be due to a physical constraint (for example, the minimal time required for transcription and translation, or the time involved with transporting chemical substances to other cellular Principles of Computational Cell Biology: From Protein Complexes to Cellular Networks, Second Edition. Volkhard Helms. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

350

Principles of Computational Cell Biology

Table 13.1 Survey of biochemical oscillators. Function

Components

Oscillation period

Metabolism

Glucose, ATP, phospho-fructokinase

2 min

Signaling

Cyclic AMP, receptor, adenylate kinase

5 min

Signaling

Calcium, Ins(1,4,5)P3

>1 s

Signaling

NF-κB, I-κB, IKK

c. 2 h

Signaling

p53, MDM2

5h

Signaling

Msn2, adenylate cyclase, cAMP, PKA

c. 10 min

Frog egg cycles

CycB, Wee1, Cdc25, Cdc20

30 min

Circadian rhythm

PER, TIM, CLOCK, CYC

24 h

Source: Novák and Tyson (2008). Taken with permission of Springer Nature.

compartments), by a long chain of reaction intermediates (as in a metabolic pathway) or by dynamical hysteresis (overshoot and undershoot, because of the positive feedback in the reaction mechanism). Furthermore, also the kinetic rate laws of the reaction mechanism must fulﬁll some conditions; they have to possess suﬃcient “nonlinearities” by which the steady state is destabilized. Also, the reactions that synthesize and metabolize the interacting chemical substances must occur on suitable time scales enabling oscillations in the network.

13.2 Circadian Clocks Most organisms (animals, plants, fungi, and cyanobacteria) enhance their ﬁtness by coordinating their development with daily environmental changes through molecular timekeepers termed circadian clocks. Mammals display circadian rhythms in behavioral and physiological processes, such as sleep, feeding, blood pressure, and metabolism. In plants, circadian rhythms control, e.g., the opening of ﬂowers in the morning and their closure at night. Circadian rhythms (Figure 13.1) are guided by external light–dark signals that are integrated through intrinsic central and peripheral molecular clocks. Circadian rhythms are a subset of biological rhythms with an oscillation period of 24 hours. The term circadian combines the Latin words “circa” (about) and “dies” (day). Circadian rhythms are endogenously generated and self-sustaining. They also persist under constant environmental conditions, such as constant light (or dark) and constant temperature. For all circadian rhythms, the period remains relatively constant over a range of ambient temperatures. This is quite remarkable when considering that according to the Arrhenius equation, the speed of “normal” chemical reactions increases rapidly with temperature. Our own biological clocks contain three essential elements: a central oscillator that keeps time, the ability to sense time cues in the environment, and to reset the clock as the seasons change or after a long-distance ﬂight to another time zone, and a series of outputs tied to distinct phases of the oscillator that regulate activity and physiology. In mammals, the central clock resides in

Kinetic Modeling of Cellular Processes

Clock output

Period

2 amplitudes

Phase

Time

Figure 13.1 Schematic illustration of an oscillating output of a biological clock. The period denotes the time to complete one cycle. The amplitude of the rhythm is deﬁned as one-half of the peak-to-trough distance. The phase is the time of day for a maximum (or minimum) relative, e.g. to noon. The phase is often deﬁned in zeitgeber time (ZT). Zeitgeber is the German expression for “time giver,” and any stimulus that imparts time information to the clock is a zeitgeber. The onset of light is a powerful zeitgeber. Dawn is typically deﬁned as ZT0.

CRY

PER

BMAL 1 CLOCK E box

CRY CRY

CRY

PER

BMAL 1 CLOCK E box

PER PER

BMAL 1 CLOCK E box Clock-controlled genes

Figure 13.2 Minimal components of the mammalian clock. First, the two transcription factors CLOCK and BMAL1 heterodimerize. Then, the protein complex BMAL1:CLOCK binds to the so-called E-boxes in the promoters of the PER and CRY genes, as well as in other clock-controlled genes (bottom), activating their transcription. Subsequently, after transcription and translation, the PER and CRY proteins dimerize, re-enter the nucleus, and eventually inhibit CLOCK-BMAL1-activated transcription. This resets the clock.

the suprachiasmatic nucleus (SCN), a small region of the brain that contains c. 20 000 neurons. Figure 13.2 shows the minimal scheme for the mammalian clock. It requires several interconnecting transcriptional, translational, and post-translational loops to achieve gene expression with circadian periodicity. The SCN produces a rhythmic output that consists of a multitude of neural and hormonal signals that inﬂuence sleep and activity. The SCN clock is reset by

351

352

Principles of Computational Cell Biology

external light, which is sensed by the ganglion cells of the retina. Remarkably, autonomous circadian oscillators are also present in all tissues of the body, where they are synchronized with the central SCN by yet unidentiﬁed signals and regulate, in a tissue-speciﬁc manner, transcriptional activity throughout the day. Plants were the ﬁrst organisms for which the observation of a circadian rhythm was discovered. The molecular study of plant clocks began in 1985 with the observation that the mRNA abundance of the light-harvesting chlorophyll a/b-binding (LHCB) protein genes of peas oscillated with a circadian rhythm. Oligonucleotide-based arrays representing about 8200 diﬀerent genes were used to determine steady-state mRNA levels in Arabidopsis thaliana at four-hour intervals during the subjective day and night (Harmer et al. 2000). In this way, temporal patterns of gene expression in Arabidopsis plants were identiﬁed. For each gene, a correlation score si was computed between its expression values and cosine test waves with a period between 20 and 28 hours: ( ) N 2πtj 2 ∑ si = xi,j cos − 𝜑i N j=1 N In this equation, N is the number of data points in the time series, xi,j denotes the expression value of gene i at time t j , and 𝜑i is the phase when xi,j reaches its maximal value during the day (note that cos(0) = 1). Then, the score for the original data was compared to randomly shuﬄed data (time points are shuﬄed). Those genes having a greater periodic expression score than 95% of the randomly shuﬄed cases (plus FDR correction) were considered as circadian. Of the tested genes, 453 were classiﬁed as cycling. Interestingly, some unexpected components of the cell also showed daily oscillations. For example, the rigid plant cell wall normally prevents cell expansion. However, a simultaneous loosening of cell wall components, uptake of water, and synthesis of cell wall components was found to take place during night. 13.2.1

Role of Post-transcriptional Modiﬁcations

Transcription–translation feedback cycles generally operate on a time scale of up to a few hours. If, following synthesis, the repressor proteins PER and CRY would directly translocate to the nucleus to repress the clock genes CLOCK and BMAL1, the whole cycle would take just a few hours rather than one full day. To maintain the daily oscillations of clock proteins, a signiﬁcant delay between the activation and repression of transcription is required. This is ensured through post-translational modiﬁcations such as phosphorylation catalyzed by casein kinase 2 and possibly also by epigenetic eﬀects. For example, CLOCK acetylates histones H3 and H4 in nucleosomes to confer “open” chromatin structure and enable CLOCK-BMAL1 to bind to the E-boxes in cognate promoters and turn on transcription. CLOCK also acetylates BMAL1, making it a target for the binding of the CRY repressor, concomitant with deacetylation of histones by histone deacetylases. These dual eﬀects of acetylation by CLOCK contribute to circadian periodicity of gene expression.

Kinetic Modeling of Cellular Processes

In 2017, the Nobel Prize in Physiology or Medicine was awarded to Jeﬀrey C. Hall, Michael Rosbash, and Michael W. Young “for their discoveries of molecular mechanisms controlling the circadian rhythm.”

13.3 Ordinary Diﬀerential Equation Models When considering the time evolution of a function f , a natural approach is considering the ﬁrst time derivative of the function f ′ = 𝜕f /𝜕t and higher derivatives. An ordinary diﬀerential equation (ODE) contains one or more functions of a single independent variable and one or more derivatives of the functions with respect to that variable. A simple example is Newton’s second law of motion describing the motion of a particle of mass m. It is an ODE containing the second derivative d2 x/dt 2 of its positional coordinate x with respect to time t and relates that to the force F acting on the particle at its current position x: d2 x = F(x(t)). (13.1) dt 2 The only independent variable here is the time t. Recall that the particle velocity v: m

dx , dt expresses how the particle’s position changes with time and the particle acceleration a: dv d2 x = 2, a= dt dt expresses how the particle’s velocity changes with time. Therefore, Newton’s second law of motion describes whether the particle is accelerating or slowing down being subjected to the force F(x). The order of a diﬀerential equation is the order n of the highest derivative that appears. A function y(x) is a solution of an ODE if the derivatives of the function satisfy the ODE. There is no guarantee that such a function exists. Even if it does exist, the solution is usually not unique. A general solution of an nth-order equation contains n arbitrary variables that correspond to the n integration constants. A particular solution may be deduced from the general solution by setting the constants to particular values. A diﬀerential equation of order n and having the form v=

f (x, y′ , y′′ , … , y(n) ) = 0, is called an implicit diﬀerential equation. If it has the form f (x, y′ , y′′ , … , y(n−1) ) = y(n) , it is called an explicit diﬀerential equation. In contrast to ODEs, the functions of partial diﬀerential equations (PDEs) depend on several independent variables, and the diﬀerential equation may involve partial derivatives with respect to each of these variables (Section 13.5).

353

354

Principles of Computational Cell Biology

13.3.1

Examples for ODEs

When solving a diﬀerential equation, the task is to identify a function y so that its derivatives satisfy the equation. For example, the diﬀerential equation y′′ + y = 0,

(13.2)

has the general solution y = A cos(x) + B sin(x) where the values of the two constants A, B are determined by the boundary conditions. (Recall that the ﬁrst derivative of the sin(x) function is cos(x), and the second derivative is −sin(x).) Another illustrative example is the mathematical pendulum (Figure 13.3). For the sake of simplicity, one commonly assumes that the bob swinging is mass less, the weight is a point mass m, there is no friction, and the motion is restricted to the two-dimensional plane, i.e. the pendulum does not start to move out of the plane to make an ellipsoidal movement. Assuming that the weight has been pushed out of its equilibrium position, the projection of the gravitational force mg onto the particle’s plane of movement, mg sin 𝜃, with the gravity constant g generates a restoring force on the weight that directs the weight back to the central position (and actually beyond that into an inﬁnite-length pendulum motion as we assume the absence of friction). Let us use the coordinate of the arc length s to describe the weight’s motion along its line of movement. Newton’s second law for the arc length coordinate s is F(s) = m

d2 s . dt 2 Figure 13.3 Force diagram for a mathematical pendulum consisting of a weight attached to a ﬁxed point via a stiﬀ connection of length l. The only acting force is gravitation.

θ l

sinθ

θ

mg

mg

cosθ

mg

Kinetic Modeling of Cellular Processes

As force F(s), we need to use the projection of the gravitational force mg sin 𝜃 along the arc length (Figure 13.3): d2 s dt 2 2 ds −g sin 𝜃 = 2 . (13.3) dt The minus sign reﬂects that the gravitation force works against the particle’s displacement from its equilibrium position. The coordinate s is related to the rod length l and the angle 𝜃: −mg sin 𝜃 = m

s = l𝜃 d𝜃 ds =l v= dt dt d2 𝜃 d2 s a= 2 =l 2. dt dt Combining Eqs. (13.3) and (13.4) gives −g sin 𝜃 = l

(13.4)

d2 𝜃 , dt 2

or d2 𝜃 g + sin 𝜃 = 0. (13.5) dt 2 l This is the diﬀerential equation which, when solved for 𝜃(t), will give the motion of the pendulum. Unfortunately, it cannot be integrated directly. If we restrict the motion of the pendulum to a relatively small amplitude, we can approximate sin 𝜃 ≈ 𝜃 This approximation is not too bad for small angles if you recall the shape of the sinus function that crosses the origin at a 45∘ angle. Then, Eq. (13.5) transforms into d2 𝜃 g + 𝜃 = 0. (13.6) dt 2 l which is almost exactly Eq. (13.2) above. The solution to this equation is readily obtained as ) (√ g |𝜃0 | ≪ 1, 𝜃(t) = 𝜃0 cos t + 𝜑0 l with the amplitude 𝜃 0 and the initial phase 𝜑0 . As expected, a pendulum undergoes an oscillatory motion. Its frequency decreases when the pendulum’s length l increases. Solutions to diﬀerential equations can be identiﬁed by many analytical and numerical approaches. For example, in cases of linear diﬀerential equations, the original equation can be broken down into smaller equations, which are then solved, and the results are combined. Unfortunately, many interesting diﬀerential equations are nonlinear functions and cannot be broken down in this manner. There also exist many approaches to solve diﬀerential equations with software

355

356

Principles of Computational Cell Biology

programs. In the next subchapter, we will concentrate on ﬁnding stationary and oscillatory solutions of several model cases observed in cellular processes.

13.4 Modeling Cellular Feedback Loops by ODEs As formulated by the late Reinhart Heinrich (Heinrich et al. 2002), a pioneer in the mathematical modeling of cellular processes, mathematical treatments of signaling pathways may help answering questions such as (1) How do the magnitudes of signal output and signal duration depend on the kinetic properties of pathway components? (2) Can high signal ampliﬁcation be coupled with fast signaling? (3) How are signaling pathways designed to ensure that they are safely oﬀ in the absence of stimulation, yet display high signal ampliﬁcation following receptor activation? (4) How can diﬀerent agonists stimulate the same pathway in distinct ways to elicit a sustained or a transient response, which can have dramatically diﬀerent consequences? We will see in the following section how very simple signaling pathways involving positive and negative feedback links can give rise to quite complex behaviors such as toggle switches and oscillators. The following discussion is largely based on the brilliant review paper by Tyson et al. (2003). 13.4.1

Protein Synthesis and Degradation: Linear Response

Let us start with the example of protein synthesis and degradation shown in Figure 13.4. Here, protein R is constantly produced at a base line rate k 0 . R is termed the response magnitude. The synthesis rate of R may be further stimulated by the presence of a signal S. Here, S measures the signal strength, e.g. the concentration of the corresponding protein. k 1 is the rate by which synthesis of R responds to the concentration of S. k 2 is the rate constant for the degradation of R. One may safely assume that degradation depends linearly on the concentration of R. This means that during each time interval, a constant fraction of R is degraded by digestion enzymes, etc. We can set up the rate equation as a simple balance equation: dR = k0 + k1 S − k2 R dt S

Figure 13.4 Simple model for the synthesis of protein R (“response”) under activation of a signal S and subsequent degradation. S could be a transcription factor that binds to the promoter region of the gene coding for R.

k1

k0

(13.7)

R

k2

Kinetic Modeling of Cellular Processes

Figure 13.5 For the example of linear response, the steady-state response Rss depends linearly on the signal strength S. At zero signal strength, there is a baseline response given by the ratio of the two rate constants k0 and k2 . The magnitude of the response beyond this baseline depends on the ratio of the two constants k1 and k2 .

Rss

Slope k1/k2

k0/k2

S

where R and S denote the concentrations of response and signal molecules, respectively. This equation describes that the concentration R increases over time with a constant rate k 0 plus the signal-modulated rate k 1 S and – at the same time – decreases with a constant rate −k 2 R proportional to R itself. Assuming a constant signal strength S, we expect that the response R will adjust to a steady state after a certain time interval. A steady-state solution of any differential equation, dR/dt = f (R), is a constant Rss fulﬁlling the algebraic equation f (Rss ) = 0. Therefore, we require 0 = k0 + k1 S − k2 Rss , or k0 + k1 S k2 k k = 0 + 1 S. k2 k2

Rss =

This means that the steady-state response Rss depends linearly on the signal strength S with a proportionality constant given as the ratio of the two rate constants for input (k 1 ) and for degradation (k 2 ) (Figure 13.5). 13.4.2

Phosphorylation/Dephosphorylation – Hyperbolic Response

The following example models the equilibrium between a protein R (“response”) and its phosphorylated form RP (Figure 13.6). This example is a variation of the ﬁrst example of protein synthesis because the degradation product of RP is fed back to the starting substance R. The rate equation for the temporal change of the concentration of RP is given by dRP = k1 SR − k2 RP . dt This example is almost identical to the previous case except for multiplication of S with the concentration of the nonphosphorylated form R. Here, we are not considering a basal activity k 0 as in the ﬁrst example because the system is a closed

357

358

Principles of Computational Cell Biology

S ATP

k1

ADP

Figure 13.6 Simple model for the equilibrium between the phosphorylated form of a protein RP and the dephosphorylated form R. In the back reaction, a water molecule is taken up from solution and a molecule of inorganic phosphate Pi is set free so that ATP can be regenerated from ADP.

RP

R

k2 Pi

H2O

cycle. As the total concentration of protein RT = R + RP is constant, we can substitute R = RT − RP , in the above equation and obtain dRP = k1 S(RT − RP ) − k2 RP . dt

(13.8)

The steady-state concentration of the phosphorylated form is again obtained by requiring 0 = k1 S(RT − RP,ss ) − k2 RP,ss , leading to the stationary concentration: k1 SRT k2 + k1 S SRT = . (k2 ∕k1 ) + S

RP,ss =

In the two limiting cases:

RP,ss

⎧ S small ⎪ −−−−−→ S = ⋅ RT ⎨ S large (k2 ∕k1 ) + S ⎪ −− ⎩ −−−→

k1 R ⋅S k2 T . RT

In this case, the steady-state response is not linear, but hyperbolic (Figure 13.7). This is easy to understand as the total concentration of protein on both sides is constant. Increasing the signal strength S can shovel a large fraction of the protein into its phosphorylated form, but the maximum response is limited to the total concentration of protein, RT , of course.

Kinetic Modeling of Cellular Processes

Figure 13.7 Steady-state concentration of phosphorylated protein RP as a function of signal strength S. In the regime of small S, the rise of RP,ss is almost linear with S. The slope is determined by the ratio of the rates for synthesis and degradation. For larger S, the response saturates at the total concentration RT of protein R.

RP,ss

RT

S

13.4.3

Phosphorylation/Dephosphorylation – Buzzer

As a variation of the previous section, we will now assume that the phosphorylation and dephosphorylation reactions can be described by Michaelis–Menten kinetics. Equation (13.8) then becomes dRP k2 RP k S(RT − RP ) − . = 1 dt Km1 + RT − RP Km2 + RP The steady-state solution is obtained as k2 RP (Km1 + RT − RP ) = k1 S(RT − RP )(Km2 + RP ). Here, the biophysically acceptable solutions must be in the range 0 < RP < RT . With the “Goldbeter–Koshland function” G (Goldbeter and Koshland 1981) that is deﬁned as 2uK G(u, v, J, K) = √ 2 𝛽 + 𝛽 − 4uK(v − u) 𝛽(u, v, J, K) = v − u + u ⋅ J + v ⋅ K, the steady-state solution of the above equation is ( ) RP,ss K K = G k1 S, k2 , m1 , m2 . RT RT RT When plotting RP,ss as a function of S (Figure 13.8), we obtain a sigmoidal response curve if both J and K are much smaller than 1. Figure 13.8 Response in a phosphorylation/dephosphorylation equilibrium with Michaelis–Menten kinetics. RP,ss

S

359

Principles of Computational Cell Biology

This mechanism creates a switch-like signal response curve, which is called zero-order ultrasensitivity. All the three examples considered so far give a “graded” and reversible behavior of R with respect to S. “Graded” here means that R increases continuously with S. “reversible” means that if S is modiﬁed from Sinitial to Sﬁnal , this leads to the same response at Sﬁnal irrespective of whether the signal is turned up (Sinitial < Sﬁnal ) or down (Sinitial > Sﬁnal ). Although continuous and reversible, a sigmoidal response is abrupt. The element behaves like a buzzer where one must press hard and long enough on the button to activate the response. In terms of the phosphorylation signal, the signal S must be strong enough to create a noticeable change of the equilibrium. 13.4.4

Perfect Adaptation – Sniﬀer

Now, the simple linear response element of Figure 13.4 is supplemented with a second signaling pathway through species X (Figure 13.9a). The signal inﬂuences the response via two parallel pathways that push the response in opposite directions (an example of feed-forward control similar to Section 9.5.1). The action of S on X corresponds exactly to our ﬁrst example on protein synthesis. The steady-state response of X is linearly dependent on S: dR = k1 S − k2 X ⋅ R dt dX = k3 S − k4 X. dt For the steady state, setting the second equation to zero gives kS Xss = 3 . k4 S

k3 X

S k1 R

(a)

k4

X

R, S, X

360

k2

R

(b)

Time

Figure 13.9 (a) Coupling of the initial response pathway via R with a second signaling pathway (X). (b) Transient response (R, thin solid line) as a function of stepwise increases in signal strength S (thick solid line) with concomitant changes in the indirect signaling pathway X (dashed line). The signal inﬂuences the response via two parallel pathways that push the response in opposite directions.

Kinetic Modeling of Cellular Processes

Setting the ﬁrst equation to zero and replacing X ss by the expression just derived gives Rss =

k1 S kk = 1 4. k2 X k2 k3

In this set up, the response mechanism exhibits perfect adaptation to the signal. In Fig. 13.9(b) we are plotting the relaxation process of adapting the response to new levels of the signal S that is being increased abruptly in discrete steps. Although the signaling pathway responds transiently to changes in signal strength R (Figure 13.9b, right), its steady-state response Rss is independent of S and is only controlled by the ratio of the four kinetic rates of the system! Such behavior is typical of chemotactic systems, which respond to an abrupt change in attractants or repellents, but then adapt to a constant level of the signal. Our own sense of smell operates this way. Therefore, this element is termed a sniﬀer. 13.4.5

Positive Feedback – One-Way Switch

In the previous example, the signal aﬀected the response via two parallel pathways that stimulated synthesis and degradation. This was an example for a feed-forward control system. Alternatively, some component of a response pathway may feed back to the signal. This will be the case in the example presented here, where the response element R activates enzyme E (by phosphorylation) and EP enhances the synthesis of R (Figure 13.10): dR = k4 EP (R) + k1 S − k2 R. dt EP (R) is again a Goldbeter–Koshland function depending on the rate constants k 3 R, k 4 and also the rate that describes the spontaneous back reaction from EP to E. Solving this system gives the steady-state response shown in Figure 13.11. The equilibrium between EP and E is similar to the previous example of a phosphorylation/dephosphorylation equilibrium in that the total concentration S

Figure 13.10 Example of a positive feedback system built from a protein E and the response R. E P is the phosphorylated form of E. Here, R activates E by phosphorylation, and E P enhances the synthesis of R.

k1 k2 R

k4

k3

EP

E

361

Principles of Computational Cell Biology

Figure 13.11 Positive feedback system. As S increases, the response is low until S exceeds a critical value Scrit at which point the response increases abruptly to a high value. Then, if S decreases, the response stays high.

R

362

Scrit

S

of E + EP is constant. However, the previous example did not include the positive feedback of EP into the production of P. In the response curve, the control system is found to be bistable between 0 and Scrit . In this regime, there are two stable steady-state response values (on the upper and lower branches, the solid lines) separated by an unstable steady state (on the intermediate branch, the dashed line). This is called a one-parameter bifurcation. Which value is taken depends on the history of the system. After the signal threshold, Scrit has been crossed once, the system will remain on the upper curve. This is termed a one-way switch. Biological examples for this behavior include the maturation of frog oocytes in response to the hormone progesterone, and apoptosis, where the decision to shut down the cell must be clearly a one-way switch. 13.4.6

Mutual Inhibition – Toggle Switch

The following example is a minute variation of the previous one in that the enzyme E now stimulates degradation of R (Figure 13.12a). This system again leads to a discontinuous behavior (Figure 13.12b). This type of bifurcation is called a toggle switch. If S is decreased enough after starting from a high level, the switch will go back to the oﬀ-state on the lower curve, meaning a small response R. For an intermediate strength stimulus (Scrit1 < S < Scrit2 ), the response of the system can be either small or large, depending on the history of S(t). This is often called “hysteresis.” Biological examples for such behavior include the lac operon in bacteria, activation of M-phase promoting factor in frog egg extracts, and the autocatalytic conversion of normal prion protein into its pathogenic form. 13.4.7

Negative Feedback – Homeostasis

In negative feedback, the response counteracts the eﬀect of the stimulus (Figure 13.13a). Here, the response element R downregulates the enzyme E

R

Kinetic Modeling of Cellular Processes

S k1 k2

R

Scrit1

k3

Scrit2

k4

E

EP (a)

S

(b)

Figure 13.12 (a, b) Positive feedback system, termed “toggle switch.”

S k1 R k4

R

k2

k3

E

EP

(a)

(b)

S

Figure 13.13 (a, b) Negative feedback system.

catalyzing its production. Therefore, the rate of synthesizing R is a sigmoidal decreasing function of R (Figure 13.13b). In this case, the signal is in demand for R. If there is not enough R present, increasing the signal S does change the concentration of R much more: dR = k4 E(R) − k2 S ⋅ R dt E(R) = G(k4 , k 3 R, J3 , J4 ). This type of regulation is frequently encountered in biosynthetic pathways. It is termed homeostasis. It is sort of an adaptive mechanism, but not a sniﬀer because stepwise increases in S do not generate transient changes in R.

363

Principles of Computational Cell Biology

S k1 k2

X k0

X

k3 k7

YP

YP

Y k4

k5

R

RP

RP k6

(a)

(b)

Time

RP

364

RPmax

RPmin (c)

Scrit1

Scrit2

S

Figure 13.14 (a) Three-component system with feedback loop. (b) Feedback loop leads to oscillations of X (top solid line), Y P (dashed line), and RP (bottom solid line). (c) Within the range Scrit1 < S < Scrit2 , the steady-state response RP,ss is unstable. Within this range, RP (t) oscillates between RPmin and RPmax .

13.4.8

Negative Feedback: Oscillatory Response

Negative feedback as in the two-component system E → R ⊣ E can result in damped oscillations leading to a stable steady state but not to sustained oscillations (⊣ indicates inhibition). Sustained oscillations can only be generated by at least three components, X → Y → R ⊣ X. The third component (Y ) introduces a time delay in the feedback loop. This causes the control system to repeatedly overshoot and undershoot its steady state. There are two ways to close the negative feedback loop: (i) RP inhibits the synthesis of X and (ii) RP activates the degradation of X. Here, the second scenario is realized (Figure 13.14): dX = k0 + k1 S − k2 X − k7 RP ⋅ X dt

Kinetic Modeling of Cellular Processes

k X(YT − YP ) dYP k4 Y P = 3 − dt Km3 + (YT − YP ) Km4 + YP k Y (R − RP ) k6 RP dRP = 5 P T − dt Km5 + (RT − RP ) Km6 + RP Negative feedback has been proposed as a basis for oscillations in protein synthesis, activity of the mitosis-promoting factor (MPF), mitogen-activated protein kinase signaling pathways, and circadian rhythms. 13.4.9

Cell Cycle Control System

As an example of a truly complex system that absolutely relies on the exact temporal control of all steps involved, we will now consider the wiring diagram for the cyclin-dependent kinase (Cdk) network controlling the synthesis of DNA and mitosis. Fortunately, this system can be partly decomposed into the motives explained above. The network contains several proteins that regulate the activity of Cdk1-cyclin B heterodimers and consists of three modules that control the G1 /S, G2 /M, and M/G1 transitions of the cell cycle (Figure 13.15). Let us ﬁrst describe the individual components of this system. Cdk1 is a catalytic protein kinase domain whose function is to phosphorylate other proteins. It is activated by the binding of cyclin B and inhibited by the binding of the cyclin kinase inhibitor (CKI). Wee1 is another kinase that phosphorylates and thereby deactivates the Cdk1 bound to cyclin B. This phosphorylation can be eliminated by

Damaged DNA

APC

Cdk1

Cdc25

CycB Unreplicated DNA

P Cdc20

Misaligned chromosomes

Cdc20

Cdk1

P

APC

P

Wee1

Wee1

P

CycB Cdk1

CKI

G2/M module

odule

AP C

P

M/G1 m

Cdc25

CycB

Cdk1

Sm

odu

le

P P P

CKI

G1/

CKI

CycB

Growth

Figure 13.15 Wiring diagram of the cell cycle regulation in eukaryotes. Major events of the cell cycle are triggered by Cdk1 in combination with cyclin B (CycB). The active dimer (see center) can be inactivated in the G1 /S module by binding to an inhibitor (CKI) or in the G2 /M module by phosphorylation of the kinase subunit by a kinase called Wee1. The inhibitory phosphate group is removed by a phosphatase (Cdc25). Cdk1 activity can also be destroyed by proteolysis of its cyclin partner, mediated by the anaphase-promoting complex (APC) in combination with Cdc20.

365

366

Principles of Computational Cell Biology

S k1 R

k2

k3 EP

Cdk1:cycB k3

k4 E

CKIP

(a)

k2

Cdk1:cycB:CKI

k1

k4

CKI

(b)

Figure 13.16 (a) Three-component system with feedback loop (cf. Figure 13.10). (b) Toggle-switch in the G1 /S module of the cell cycle.

Wee1P

Wee

Cdk1:cycB

Cdc25p

Figure 13.17 Toggle switch in G2 /M phase involving the activating interaction of Cdk1:cyclin B on Cdc25 (cf. Figure 13.8) and the inhibitory interaction of Cdk1:cyclin B on Wee (cf. Figure 13.10). Cdk1:cycBP

Cdc25

phosphatase Cdc25, thus reactivating Cdk1. Cdk1 itself activates Cdc25 by phosphorylation, meaning that this is a positive feed-forward mechanism. Cdk1 also phosphorylates Wee1, preventing its own phosphorylation by Wee1. The G1/S module is a toggle switch due to the mutual inhibition between Cdk1-cyclin B and CKI, which acts as a stoichiometric Cdk inhibitor (Figure 13.16). In the G2/M module, a second toggle switch is realized that is based on the mutual activation of Cdk1-cyclin B and Cdc25 (a phosphatase activating the dimer) and mutual inhibition of Cdk1-cyclin B and Wee1 (a kinase inactivating the dimer) (Figure 13.17). The M/G1 module is an oscillator due to a negative-feedback loop, whereby Cdk1-cyclin B ﬁrst activates the anaphase-promoting complex (APC), which then activates Cdc20 so that cyclin B is eventually degraded again. The “signal” driving cell proliferation is cell growth. Before it has grown to a critical size, a newly generated cell cannot exit the G1 phase and enter the DNA synthesis/division phase (S/G2 /M).

13.5 Partial Diﬀerential Equations A PDE is a diﬀerential equation that contains an unknown function of several independent variables and its partial derivatives with respect to those variables. PDEs are used to model many diﬀerent types of processes that are distributed in

Kinetic Modeling of Cellular Processes

space, or distributed in space and time, involving the propagation of sound and heat, electrostatics, electrodynamics, ﬂuid ﬂow, or elasticity. Solving PDEs is an advanced ﬁeld of mathematics. We will restrict ourselves here to discussing the diﬀusion equation that describes density ﬂuctuations in a solution or material arising because of diﬀusional processes. This problem is closely related to many cellular phenomena. The equation is usually written as 𝜕𝜌(x, t) = ∇ ⋅ (D(𝜌, x)∇𝜌(x, t)), (13.9) 𝜕t with the density of the diﬀusing material 𝜌, the macroscopic diﬀusion coeﬃcient D, the spatial coordinate x, and the time t. We have already encountered the gradient operator ⎛𝜕∕𝜕x⎞ ∇:= ⎜𝜕∕𝜕y⎟ , ⎜ ⎟ ⎝𝜕∕𝜕z ⎠ in Section 4.6. The diﬀusion equation (Eq. (13.9)) can be derived by combining the continuity equation that states that a change in density in any part of the system is due to the inﬂow and outﬂow of the material into and out of that part of the system: 𝜕𝜌 + ∇ ⋅ j = 0, 𝜕t with the ﬂux j described by the phenomenological Fick’s ﬁrst law that sets the magnitude and direction of a ﬂux involving the diﬀusing material in any part of the system as proportional to the local density gradient: j = −D(𝜌)∇𝜌(x, t). This equation states that material will ﬂow from a region of higher concentration to regions of lower concentrations. This is why j points in the direction of the negative gradient (Section 4.6). Inserting Fick’s law into the continuity equation gives the diﬀusion equation (Eq. (13.9)). For ease of understanding, let us look at the one-dimensional form of the diffusion equation, where we additionally assume that the diﬀusion coeﬃcient is independent of 𝜌 and x: 𝜕 2 𝜌(x, t) 𝜕𝜌(x, t) . (13.10) =D 𝜕t 𝜕x2 This equation describes a one-dimensional process of diﬀusion. One solution of this equation is the Gaussian function: 𝜌 2 𝜌(x, t) = √ 0 e−(x ∕4Dt) , 4πDt as can be veriﬁed by twice taking the derivative with respect to x times D or once with respect to time t. This solution describes how a certain amount of material that was initially all localized in one point at the coordinate origin, spreads out into the surrounding medium for t > 0.

367

368

Principles of Computational Cell Biology

From this form of the solution, you can immediately imagine how diﬀusion proceeds. Initially, the Gaussian function√is quite well focused. With increasing time, its width increases proportional to t, but its amplitude decreases. In computational cell biology, solving the diﬀusion equation by numerical methods allows modeling the diﬀusive motion of particles in cells. Nowadays, modern optical techniques allow tracking of ﬂuorescent tracer molecules or Green Fluorescent Protein-labeled proteins. Comparing spatially resolved simulations with experimental observables has become an important technique even for experimentalists. An important software package in this area is the “Virtual Cell” initiative by the National Resource for Cell Analysis and Modeling at the University of Connecticut Health Center. The Virtual Cell environment (www .vcell.org) allows users from the experimental biologist’s community to utilize a sophisticated simulation package without having to deal with the mathematical background. The package itself will model diﬀusional and transport processes by appropriate ODE and PDE approaches. 13.5.1

Spatial Gradients of Signaling Activities

In cells, proteins are not equally distributed throughout the cell volume but are often localized to speciﬁc sites. Spatial gradients (diﬀerences of local concentrations) of protein activities may then organize signaling processes around cellular structures, such as membranes, chromosomes, and scaﬀolds. The basic requisite for signaling gradients is the spatial segregation of opposing reactions (e.g. kinases and phosphatases) in a universal protein modiﬁcation cycle (Figure 13.18). For a protein that is phosphorylated by a membrane-bound kinase and dephosphorylated by a cytosolic phosphatase (Figure 13.18b), a gradient of the phosphorylated protein will be established with a high concentration close to the membrane and a low concentration within the cell. The magnitude of the gradient will depend on the ratio of the diﬀusion constants versus reaction rates of kinase and phosphatase. 13.5.2

Reaction–Diﬀusion Systems

Reaction–diﬀusion systems describe how the concentration level(s) of one or more substances in a volume are aﬀected by the eﬀects of local chemical reactions that transform the substances into each other and because of diﬀusion that causes the substances to distribute in the volume. Reaction–diﬀusion systems are frequently used to describe dynamical processes in chemistry and related disciplines. Mathematically, reaction–diﬀusion systems can be described by extending Eq. (13.10) into 𝜕 2 𝜌(x, t) 𝜕𝜌(x, t) + R(x, t) =D 𝜕t 𝜕x2 where each component of the vector 𝜌(x, t) holds the concentration of one substance, D is a diagonal matrix of diﬀusion coeﬃcients, and R(x, t) accounts for all local reactions. The solutions of reaction–diﬀusion equations show a wide range of behaviors, including the formation of traveling waves and wave-like phenomena as well as other self-organized patterns like stripes and hexagons.

Kinetic Modeling of Cellular Processes Kin Phos

Kin Phos

Phos

Kin

Phos Phos

Phos

Kin

Phos Phos

Phos

Kin Kin Multi- Kin molecular structure Kin Kin

Phos

Phos

Phos

Phos Kin

Kin Phos Phos

Phos

Phos Phos Kin

Phos

Phos

Kin Phos

Phos Kin

(a)

(b)

Kin

Figure 13.18 Spatial segregation of two opposing enzymes in a protein modiﬁcation cycle generates intracellular gradients. (a) Kinases (Kin) localize to a supramolecular structure and the counteracting phosphates (Phos) are distributed elsewhere in the cytosol. (b) Kinases localize to the membrane.

13.6 Dynamic Phosphorylation of Proteins Nowadays, it is known that about 70% of all mammalian proteins are getting phosphorylated at speciﬁc amino acid positions, mostly at serine, histidine, or tyrosine residues. One may wonder whether it is really that important to know whether a protein is phosphorylated or not? Would it not be suﬃcient to know the concentration and localization of the protein in the cell? To better understand why the phosphorylation status is really important, we stress that each phosphate PO4 2− group carries a formal charge of −2 electron charges. Locally, this strongly alters the electrostatic potential of the protein region around the phosphorylated amino acid. This may alter the local pK a value of surrounding amino acids and may often aﬀect the speed of chemical reactions taking place in the protein. Importantly, phosphorylation also modiﬁes interactions of the protein with other biomolecules. To put this into perspective, we note that the total charges of proteins from E. coli strain K-12 can be modeled by a Gaussian curve with mean zero and standard deviation of c. 10 electron charges (Sear 2003). This means that changing the total charge of a protein by −2e has a relatively large eﬀect. Hence, characterizing the cellular phospho-proteome has gained more interest recently. The method of choice is mass spectroscopy. With this technique, one can determine the precise concentrations of all phosho-forms of each protein, not only whether the protein is phosphorylated or not. Often, the active form of a protein is phosphorylated in multiple locations. For example, the carboxy-terminal domain (CTD) of RNA polymerase II can carry hundreds of diﬀerent post-translational modiﬁcations (many of them phosphorylation variants) (Harlen and Churchman 2017). CTD regulates each step in the transcription process, from initiation to termination. In addition, the CTD is important for regulating several cotranscriptional processes, such as splicing

369

370

Principles of Computational Cell Biology

and chromatin modiﬁcation. Clearly, these posttranslational modiﬁcations have a dynamic component and are not ﬁxed over time. There exist a number of bioinformatics methods that predict phosphorylation sites on proteins and the kinases responsible for this. For example, the integrative computational approach, NetworKIN, connects consensus sequence motifs with protein association networks to predict which protein kinases target certain phosphorylation sites that were experimentally characterized in vivo (Linding et al. 2007). The algorithm has two stages. In the ﬁrst step, neural networks and position-speciﬁc scoring matrices (Section 7.4) are used to associate each phosphorylation site with one or more kinase families, whereby the intrinsic preference of kinases for consensus substrate motifs is exploited. In the second stage, the context for each substrate is represented by a probabilistic protein network extracted from the STRING database (see table 5.1). Another important area is that of dynamic phosphorylation changes in signal transduction. Models used for signaling pathways can be either deterministic or stochastic processes and either discrete or continuous with respect to time and to abundance of the components. Also, they may or may not treat the processes to be spatially dependent. Which model is best suitable, depends on the system, the available information, and the speciﬁc questions need to be studied. Traditionally, tailored diﬀerential equation models have been used (Heinrich et al. 2002). In deterministic approaches, the spatial distribution of compounds may be described, for example, by distinguishing diﬀerent compartments or by describing the dynamics of the considered process(es) in a continuous space with PDEs. Examples for modeling approaches that are discrete with respect to time and values of variables are Boolean networks (Section 9.3.1), Petri nets, or cellular automata. Stochastic eﬀects become most relevant in systems involving small molecule numbers. Then, one needs to consider individual reaction events between single particles, e.g. using the diﬀerent algorithms developed by Gillespie (Section 14.2). Compared to the signaling ﬁeld, only few experimental studies have characterized dynamic changes of the phospho-proteome during the cell cycle. For example, the group of Paul Nurse used SILAC experiments (Section 8.8) to characterize proteins phosphorylated by the master kinase Cdc2 in Schizosaccharomyces pombe (Swaﬀer et al. 2016). Interestingly, they used a construct where Cdc2 is fused with B-type cyclin Cdc13 and all other genes coding for cyclins are edited out. This mutant form of yeast was shown to undergo a very similar oscillating cell cycle as wild-type S. pombe. Figure 13.19 shows a Boolean network model for the core oscillator of S. pombe (Davidich and Bornholdt 2013).

13.7 Summary The techniques used in mathematical modeling of time-dependent phenomena in biological cells are quite mature because they are the same ones as in other

Kinetic Modeling of Cellular Processes

Figure 13.19 Topology of a Boolean network for ﬁssion yeast. Source: Davidich and Bornholdt (2013). http:// journals.plos.org/plosone/ article?id=10.1371/journal .pone.0071786. Licensed under CC-BY 4.0.

Start

Cig1/Cdc2

Cig2/Cdc2

Puc1/Cdc2

Cig2/Cdc13

Rum1

Ste9

Cdc25

PP

Slp1

Wee1

Cdc2_Tyr15

scientiﬁc disciplines. Challenges for the large-scale application are the determination of the required kinetic parameters as well as the formulation of the model topology. Pioneers in this ﬁeld used to be leading experts in one particular aspect of cells. They would formulate models that would be as simple as possible, but as complicated as necessary. The models developed by them would then precisely reproduce the (known) properties of this aspect and reveal hidden details that are not accessible to experimental detection. Deﬁning what components need to be included in the model and which ones not was a matter of biological expertise and intuition. Now, that we enter into the ﬁeld of data science and automatic generation of mathematical models, particular challenges arise when connected cellular processes need to be integrated and “emerging” phenomena need to be retrieved. As models become more complicated, it may not be so obvious which processes one should focus on and which aspects may be neglected. Fortunately, super-resolution light microscopy now enables cell biologists to visualize dynamic processes involving cellular components labeled with ﬂuorescent dye molecules with spatial resolution of 10–30 nm. This provides unprecedented insights into cellular processes. Three of the pioneers of this ﬁeld, Stefan Hell, Eric Betzig, and W.E. Moerner, were awarded the Nobel Prize in Chemistry in 2014 “for the development of super-resolved ﬂuorescence microscopy.”

371

372

Principles of Computational Cell Biology

13.8 Problems 1. A two-ODE system Let us assume that CDK1 is activated by a constant rate of cyclin synthesis (𝛼 1 ), whereas its inactivation by APC is described by a Hill function (see Figure 13.20). The inactivation rate is proportional to the concentration of CDK1 times a Hill function of APC. Hence, APCn1 dCDK1 = 𝛼1 − 𝛽1 ⋅ CDK1 ⋅ n1 dt K1 + APC1n1 The ODE for APC activation is CDK1n2 dAPC − 𝛽2 ⋅ APC = 𝛼2 ⋅ (1 − APC) n2 dt K2 + CDK1n2 For APC, the activation by CDK is proportional to the concentration of inactive APC (assuming that the total concentration of active and inactive APC is constant) times a Hill function of CDK1 whereas the rate of inactivation of APC is described by simple mass action kinetics. Figure 13.20 illustrates the mutual regulation. (a) Implement the model in Python (or similar). Start with all concentrations set to zero and simulate with t = [0; 25]. Use the following parameters: 𝛼 1 = 0.1, 𝛼 2 = 𝛽 1 = 3, 𝛽 2 = 1, K 1 = K 2 = 0.5, n1 = n2 = 8. (b) Plot the CDK1 and APC activity against time. Explain the oscillation (if any) and the behavior. (c) Investigate the oscillation behavior by plotting the concentration of CDK1 against APC. Also, plot the nullcline curves, i.e. plot the steady-state concentration of CDK1 with ﬁxed APC using the interval [0; 1] and vice versa. These curves describe the concentration change if there was no regulation of APC by CDK1. Show the ranges for x, y = CDK1, APC between [0; 1]. Explain your ﬁndings. (d) Repeat the previous steps with diﬀerent concentrations for APC and CDK1. Explain your conclusions. Figure 13.20 Schematic view of the two ODE model (see Problem 1). CDK1

APC

Kinetic Modeling of Cellular Processes

Bibliography Novák, B. and Tyson, J.J. (2008). Design principles of biochemical oscillators. Nature Reviews Molecular Cell Biology 9: 981–991.

Cycling Genes in Arabidopsis thaliana Harmer, S.L., Hogenesch, J.B., Straume, M. et al. (2000). Orchestrated transcription of key pathways in Arabidopsis by the circadian clock. Science 290: 2110–2113.

Regulatory Motifs Goldbeter, A. and Koshland, D.E. (1981). An ampliﬁed sensitivity arising from covalent modiﬁcation in biological systems. Proceedings of the National Academy of Sciences of the United States of America 78: 6840–6844. Tyson, J.J., Chen, K.C., and Novak, B. (2003). Sniﬀers, buzzers, toggles and blinkers: dynamics of regulatory and signaling pathways in the cell. Current Opinion in Cell Biology 15: 221–231.

Predicting Phosphorylation Sites Linding, R., Jensen, L.J., Ostheimer, G.J. et al. (2007). Systematic discovery of in vivo phosphorylation networks. Cell 129: 1415–1426. Sear, R.P. (2003). The eﬀects of added salt on the second virial coeﬃcients of the complete proteome of E. coli. Journal of Chemical Physics 118: 5157–5161.

Dynamic Phosphoproteome During Yeast Cell Cycle Swaﬀer, M.P., Jones, A.W., Flynn, H.R. et al. (2016). CDK substrate phosphorylation and ordering the cell cycle. Cell 167: 1750–1761.

Phosphorylation in Signal Transduction Davidich, M.I. and Bornholdt, S. (2013). Boolean network model predicts knockout mutant phenotypes of ﬁssion yeast. PLoS ONE 8: e71786.

373

374

Principles of Computational Cell Biology

Harlen, K.M. and Churchman, L.S. (2017). The code and beyond: transcription regulation by the RNA polymerase II carboxy-terminal domain. Nature Reviews Molecular Cell Biology 18: 263–273. Heinrich, R., Neel, B.G., and Rapoport, T.A. (2002). Mathematical models of protein kinase signal transduction. Molecular Cell 9: 957–970.

375

14 Stochastic Processes in Biological Cells In Section 8.3, we introduced statistical tests for analyzing gene expression data. This was done to treat the heterogeneity of biological data and to account for statistical ﬂuctuations and statistical noise. However, at that point, we did not discuss the fact that biological data are subject to considerable intrinsic ﬂuctuations. A considerable degree of heterogeneity is even found at the level of single-cell data taken from the same tissue. In this chapter, we will meet one of the reasons for this, namely, that many processes in biological cells have a stochastic nature so that diﬀerent events may occur in a random order and at random time intervals! For example, at any instance in time, many biomolecules will “encounter” other biomolecules in a cell. Which ones of these encounters will lead to the formation of stereospeciﬁc complexes is subject to stochasticity and can essentially not be predicted beforehand. Generally speaking, stochastic eﬀects are best noticeable and have the largest eﬀects if they involve processes that are based on small particle numbers and/or infrequent, so-called “rare” events. Transcriptional regulation and cell diﬀerentiation are two important processes of this kind that were addressed at several places in this book. Both processes are clearly subject to stochasticity.

14.1 Stochastic Processes Processes involving large numbers of particles appear as continuous processes that can be modeled, for example, by deterministic diﬀerential equations (Section 13.3). On the other hand, many quantities of cell biological interest are present only in small discrete copy numbers inside the cell (e.g. a single DNA, dozens of mRNAs, and hundreds of proteins). In such cases, one often employs methods from discrete stochastic mathematics to analyze and model such processes. A random process, also called stochastic process, describes a system that makes random transitions between diﬀerent states x in a state space X. Here, we will give a short introduction into stochastic processes with an eye on modeling gene transcription and particle dynamics. A random process determines a joint probability p for the system states at all time points t 1 …t n . In many situations, the transitions from state a to state b do not depend on the history of the full process but only on the state immediately Principles of Computational Cell Biology: From Protein Complexes to Cellular Networks, Second Edition. Volkhard Helms. © 2019 Wiley-VCH Verlag GmbH & Co. KGaA. Published 2019 by Wiley-VCH Verlag GmbH & Co. KGaA.

376

Principles of Computational Cell Biology

preceding the current state. Then, the state probabilities p fulﬁll p(xn , tn |x1 , t1 ; … ; xn−1 , tn−1 ) = p(xn , tn |xn−1 , tn−1 )

∀tn > tn−1 > … > t1

Such processes are called Markov processes. For Markov processes, we can determine the transition probability from the time distribution at two time points t 1 and t 2 : p(xa , t2 |xb , t1 ) =

p(xa , t2 ; xb , t1 ) . p(xb , t1 )

This equation describes the probability to be in state xa at time t 2 under the condition that the system was in state xb at time t 1 . This conditional probability is equivalent to the one we used with Bayesian probabilities (see Section 5.3). If the ensemble of states is not aﬀected by shifting the time variable from t to t + Δt, we call the process stationary.

14.1.1

Binomial Distribution

Suppose you throw n times a coin, which shows head with probability p and tail with probability 1 − p. The probability to obtain k times head is then ( ) n k P(k; n, p) = p (1 − p)n−k k This is the binomial distribution. It describes the number of successes (here: heads) in a sequence of n independent experiments from a population of size N with replacement and with a binary outcome. The formula combines the probabilities of k successes that occur with probability pk and that of n − k failures n−k that occur with probability (1 −(p)) . If the k successes can take place anywhere n among the n trials, there are diﬀerent ways of how k successes are disk tributed in a series of n trials. If the sampling is done without replacement, this yields a hypergeometric distribution, not a binomial one. However, if N is much larger than n, the binomial distribution continues to be a good approximation and is often used. The average number ⟨k⟩ (also termed 𝜆) of successes of n independent binomially distributed binary events is n⋅p. If the average 𝜆 = n⋅p is kept constant but the number of trials n goes to inﬁnity (implying that p = 𝜆/n goes to zero), the binomial distribution converges to the Poisson distribution P(k) = e−𝜆

𝜆k k!

It describes a wide range of independent identically distributed random events, for instance, the number of chemical reactions in a given time and volume, where 𝜆 is the average number of reactions. As a rule of thumb, this is an appropriate approximation if n ≥ 20 and p ≤ 0.05, or if n ≥ 100 and np ≤ 10.

Stochastic Processes in Biological Cells

14.1.2

Poisson Process

In probability theory, a Poisson process is a stochastic process that counts the number of events and the time points when these events take place in a given time interval. In relationship to the Binomial process just described, one of these “events” is equivalent to a (binomial) success, a no-event is equivalent to a failure. The basic form of a Poisson process is a continuous time counting process {N(t), t ≥ 0} with the following properties: N(0) = 0, the numbers of events counted in disjoint intervals are independent from each other, the probability distribution of the number of occurrences counted in any time interval only depends on the length of the interval, and no counted occurrences occur simultaneously. If these conditions apply, then • The probability distribution of k = N(t) is a Poisson distribution, where the k probability of observing k events during a time interval is P(k) = e−λ λk! with the mean number of events 𝜆. Interestingly, the variance of this distribution is also 𝜆. • The probability distribution of the waiting time between any two consecutive events is an exponential distribution with parameter 𝜆. • The events are distributed uniformly over any time interval. 14.1.3

Master Equation

The time derivative of the probability to be at a state xa equals the sum over all the other states of the probability to be at a particular state xb times the rate wb→a at which transitions are made from xb to xa , minus the probability to be at xa times the rate at which transitions are made back to xb : dp(xa , t) ∑ wb→a (t)p(xb , t) − wa→b (t)p(xa , t) = dt b≠a This is called a master equation. If the range of states is a discrete set of states, the master equation is a gain–loss equation for the probabilities of the separate states. The master equation can also be written in a compact form: dP =A⋅P dt where P is a column vector where element i represents state i, and A is the transition matrix of connections. At steady state, the state probabilities are, by definition, constant in time so that the left side equals zero. Actually, this matrix equation corresponds to n equations. Thus, the transition rate between any two states is equal in both directions, a condition termed detailed balance. To illustrate this principle, let us consider a system with two states a and b, see Figure 14.1. In a dynamic equilibrium, the same number of particles will transition from state a to b in a given short time interval as in the opposite direction. These numbers depend on the occupancy of a state (its state probability) and on the transition probability of an individual particle. Thus, wa→b ⋅ p(a) = wb→a ⋅ p(b)

377

378

Principles of Computational Cell Biology

Wa→b

Wb→a b

a

pb

pa

Figure 14.1 Schema illustrating detailed balance for a system with two states, a and b. p(a) and p(b) are the probabilities (occupancies) of the two states. The arcs mark transitions between the two states with the according transition probabilities wa→b and wb→a .

To give an example, let us assume that initially 60% of the particles are in state a and 40% of the particles are in state b. Thus, p(a) = 0.6 and p(a) = 0.4. Assume now that one-third of the particles that are in state a will switch to state b during a short time interval Δt. Thus, wa→b = 13 . For the above condition to hold, we get for wb→a : wb→a =

wa→b ⋅ p(a) = p(b)

1 3

⋅ 2 5

3 5

=

1 2

This means that half of the particles that are in state b should switch to state a during the same time interval Δt.

14.2 Dynamic Monte Carlo (Gillespie Algorithm) Stochastic simulations generate realizations of stochastic processes. Each simulation will generate a diﬀerent series of steps so that one typically needs to rerun the simulation a number of times to deduce the general behavior of the system. In this way, one can generate probability distributions of stochastic processes without assuming their form a priori. In the traditional computational approach to chemical/biochemical kinetics, one starts with a set of coupled ordinary differential equations (ODEs) that describe the time-dependent concentration of chemical species (i.e. the average instantaneous particle numbers per volume element) and uses some integrator to calculate the concentrations as a function of time given the rate constants and a set of initial concentrations. Successful applications involve studies of yeast cell cycle, metabolic engineering, whole-cell scale models of metabolic pathways using the package E-cell, and others. This approach assumes that particle concentrations in the reaction volume are well mixed and suﬃciently large. In that case, stochastic ﬂuctuations of the particle concentrations are small with respect to the concentrations and can thus be neglected. On the other hand, cellular processes occur in very small volumes and frequently involve very small number of molecules. For example, one Escherichia

Stochastic Processes in Biological Cells

coli cell contains on average only 10 molecules of the important Lac repressor. As a consequence, the modeling of reactions as continuous ﬂuxes of matter is no longer correct. In such cases, one has to account for the fact that signiﬁcant stochastic ﬂuctuations occur. Popular approaches to study such scenarios have been stochastic formulations of chemical kinetics or Monte Carlo computer simulations. In 1976, Daniel Gillespie introduced the exact Dynamic Monte Carlo (DMC) method (Gillespie 1976; Gillespie 1977) that connects traditional chemical kinetics and stochastic approaches. In the usual implementation of DMC for kinetic simulations, each reaction is considered as an event and each event has an associated probability of occurring. Assuming that the system is well mixed, the rate constants appearing in the DMC and the ODE methods are related. The probability P(Ei ) that a certain chemical reaction Ei takes place in a given time interval Δt is proportional to an eﬀective rate constant ̃ k = N∕V with the volume V and to the number of molecules from the chemical species that can take part in that event, e.g. the probability of the ﬁrst-order reaction X → Y + Z would be ̃ k1 NX with N x as the number of molecules of species X and ̃ k1 as the eﬀective rate constant of the reaction. Similarly, the probability of the reverse second-order reaction Y + Z → X would be ̃ k2 NY NZ . As the method is a probabilistic approach based on “events,” “reactions” included in the DMC simulations do not have to be solely chemical reactions. Any process that can be associated with a probability can be included as an event in the DMC simulations, e.g. a substrate attaching to a solid surface can initiate a series of chemical reactions. One can split the modeling into the physical events of substrate arrival, of attaching the substrate, followed by the chemical reaction steps. 14.2.1

Basic Outline of the Gillespie Method

Step i. A list of the components/species is generated and the initial distributions at time t = 0 are deﬁned. Step ii. A list of possible events Ei (chemical reactions as well as physical processes) is generated. Step iii. Using the current component/species distribution, a probability table P(Ei ) is prepared of all the events that can take place. The total probability is computed of all events: ∑ P(Ei ), Ptot = where P(Ei ) is the probability of event Ei in a given time interval. Step iv. Two random numbers r1 and r2 ∈ [0…1] are generated to decide which event E𝜇 will occur next and the amount of time 𝜏 by which E𝜇 occurs later since the most recent event. Using the random number r1 and the probability table, the event E𝜇 is determined by ﬁnding the event that satisﬁes the relation: 𝜇−1 ∑ i=1

P(Ei ) < r1 Ptot ≤

𝜇 ∑ i=1

P(Ei ).

379

380

Principles of Computational Cell Biology

The second random number r2 is used to obtain the amount of time 𝜏 between the reactions: 1 𝜏=− ln(r2 ). Ptot As the total probability of the events changes in time, the time step between occurring steps varies. To complete one run of the simulation, steps (iii) and (iv) are repeated at each step of the simulation until some ﬁnal time is reached. The necessary number of independent runs depends on the inherent noise of the system and on the desired statistical accuracy. In this way, every process is attempted as frequent as its probability contributes to the total probability. We will see example applications of the Gillespie algorithm in the following two subchapters.

14.3 Stochastic Eﬀects in Gene Transcription 14.3.1

Expression of a Single Gene

In Section 8.8.1, we introduced a simple deterministic model for gene expression. However, because of the small and discrete number of involved molecules (genes, RNA polymerases, mRNA molecules, and ribosomes), gene expression turns out to be a stochastic process, whereby randomness in transcription and translation leads to signiﬁcant cell-to-cell variations in mRNA and protein levels. In pioneering experimental and analytical work, Oudenaarden and coworkers studied the synthesis of a green ﬂuorescent protein (GFP) in Bacillus subtilis (Ozbudak et al. 2002) In their mutant bacterium, the transcriptional eﬃciency of this protein was regulated by a promoter construct that was sensitive to the local concentration of the chemical IPTG. Furthermore, translational eﬃciency could be modulated by engineering point mutations into the binding site on the ribosome and into the initiation codon of GFP. Figure 14.2a shows the observed number of GFP proteins in individual bacteria (that is, in individual cells). The mean number of proteins ⟨p⟩ depends on the kinetic rates for protein synthesis and degradation: ⟨p⟩ =

kR kP 𝛾R 𝛾P

The symbols are explained in Figure 14.3a. The ratio between ﬂuctuations of the protein number and the mean protein number is also called the phenotypic noise strength (or Fano factor). For a Poissonian process, this ratio would be equal to 1 by construction. When this was modeled by a stochastic model for the expression of a single gene, however, the strength of phenotypic noise depended on the average number of proteins that are translated from each mRNA transcript: 𝜎P2 ⟨p⟩

≅1+

kP 𝛾R

In agreement with this, mutants with diﬀerential translational eﬃciency gave values for the phenotypic noise strength between 32 and 35, Figure 14.2b.

σp2/

(fluorescence units)

Stochastic Processes in Biological Cells

Number of cells

〈p〉 400

σp

200

0

0

200

400

600

800

p (fluorescence units)

(a)

(b)

40

ERT 27 ERT 29 ERT 25 ERT 3

20

[IPTG] = 1 mM

0

0

100 200

(fluorescence units)

Figure 14.2 (a) Histogram showing the expression level of a ﬂuorescent reporter protein measured in a population of isogenic bacterial cells. (b) Phenotypic noise strength (see deﬁnition in the text) for four diﬀerent translational mutants. Noise strength is clearly dependent on translational eﬃciency. Source: Ozbudak et al. (2002). Reprinted with permission of Springer Nature.

A Gillespie-type stochastic simulation of this system reproduced the same k dependency of the magnitude of the ﬂuctuations on the ratio 𝛾P (Figure 14.3). R In real systems, transcription of a gene often negatively aﬀects its own expression via an autoregulatory feedback mechanism. Thattai and van Oudenaarden showed mathematically that autoregulatory negative feedback of the transcribed gene/protein somehow reduces the magnitude of the ﬂuctuations illustrated in Figure 14.3 (Thattai and van Oudenaarden 2001). 14.3.2

Toggle Switch

Next, we will reconsider the toggle switch that we already discussed in Section 13.4.6 and that was experimentally realized by Tim Gardner and Jim Collins (2000). We will describe a stochastic dynamics implementation of the Gardner–Collins system presented by Kauﬀman and coworkers (Ribeiro et al. 2006). The toggle switch consists of two repressors and two constitutive promoters (Figure 14.4). Each promoter is downregulated by the repressor that is transcribed from the opposing promoter. We model the production of protein due to gene expression, by Eq. (14.1) where Proi (t) stands for the promoter site, RNAP is the level of RNA polymerase, and ri is the resulting protein level due to the translation of the transcribed RNA. k i is the probability rate constant for RNAP to bind to DNA. The values 𝜏 are the times when each of the reaction products becomes available in the system. The constant ni is an integer value associated with the rate of translation of each distinct mRNA and the mRNA rate of transcription. The higher its value, the higher is the number of times when a single mRNA is translated. ki

RNAP(t) + Proi (t) → Proi (t + 𝜏1 ) + RNAP(t + 𝜏2 ) + ni × ri (t + 𝜏3 )

(14.1)

With the following chemical reactions, one can model the behavior of a toggle switch consisting of two identical genes via the Gillespie algorithm: k1

RNAP(t) + Pro1 (t) → Pro1 (t + 𝜏1 ) + RNAP(t + 𝜏2 ) + n1 × r1 (t + 𝜏3 )

(14.2)

381

Principles of Computational Cell Biology

γP

Protein kP

γR

mRNA kR DNA

Protein number

(a) 100

Protein number

382

100

b = 10

50

0

(b)

b=1

50

0

0

1

2

3

4

(c)

5

6

7

8

Time (h)

Figure 14.3 Stochastic simulation of single-gene expression using the Gillespie algorithm. (a) mRNA molecules are transcribed at rate kR from the template DNA strand. Proteins are translated at rate kP from each mRNA molecule. Both are degraded with rates 𝛾 R and 𝛾 P . Panels (b) and (c) show typical simulated time courses for protein number. The corresponding population histogram is shown to the right of each time course. In both cases, 𝛾 R = 0.1 s−1 , 𝛾 P = 0.002 s−1 . (b) A gene with low transcription but high translation rates (kR = 0.01 s−1 , kP = 1 s−1 ) produces bursts that are large, variable, and infrequent, resulting in strong ﬂuctuations. (c) Conversely, a gene with high transcription and low translation rates (kR = 0.1 s−1 , kP = 0.1 s−1 ) produces bursts that are small and frequent, causing only weak ﬂuctuations in protein concentration. Source: Ozbudak et al. (2002). Reprinted with permission of Springer Nature. k2

RNAP(t) + Pro2 (t) → Pro2 (t + 𝜏1 ) + RNAP(t + 𝜏2 ) + n2 × r2 (t + 𝜏3 ) k3

r1 (t) + Pro1 (t) −−→ Pro1 r1 (t + 𝜏1 ) k4

r2 (t) + Pro2 (t) → Pro2 r2 (t + 𝜏1 ) k5

Pro1 r1 (t) + Ind1 (t) → Pro1 (t) + r1 (t) + Ind1 (t) k6

Pro2 r2 (t) + Ind2 (t) → Pro2 (t) + r2 (t) + Ind2 (t)

(14.3) (14.4) (14.5) (14.6) (14.7)

Stochastic Processes in Biological Cells

Inducer 2

Repressor 2

Promoter 1

Repressor 1

Reporter

Promoter 2

Inducer 1

Figure 14.4 Toggle switch design by Gardner and Collins. Repressor 1 inhibits transcription from promoter 1 and is induced by inducer 1. Repressor 2 inhibits transcription from promoter 2 and is induced by inducer 2. Source: Adapted from Ribeiro et al. (2006). k7

r1 (t) → ∅ k8

r2 (t) → ∅

(14.8) (14.9)

Equations (14.2) and (14.3) represent the chemical processes of transcribing the two genes and translating their mRNA products. Reactions (14.4) and (14.5) describe the down regulation of the promoters by forming Pro1 r1 and Pro2 r2 , and reactions (14.6) and (14.7) describe the reactivation of the ability of the promoters to support expression by the inducers, Ind1 and Ind2 . The last two reactions, (14.8) and (14.9), model the decay processes of the gene expression products. n1 and n2 are associated with the rates of translation; we set them to 1 for simplicity. In the simulations, the stochastic rate constants of all reactions were set to 1 s−1 , except for the decay reactions, with a stochastic rate constant of 0.001 s−1 . The delay times are random variables following some distribution. For simplicity, one may use constant delays, such as 1 s for 𝜏 1 , 20 s for 𝜏 2 , and 10 s for 𝜏 3 . The initial numbers of the reactants are RNAP = 50, Pro1 = 1, Pro2 = 1, and Ind1 = 1. All other elements are not present initially. When started from this initial state, one obtains (after some transients) a stable state where gene 2 is oﬀ and 1 is on (Figure 14.5). When inducer 1 is removed and the other inducer 2 introduced, we toggle to the other possible stable state (gene 2 on, gene 1 oﬀ ). The variation of r1 and r2 levels in Figure 14.5 conﬁrms the robustness of the control mechanism of the toggle switch. At the beginning of the simulation, between 0 and 10 s, both r1 and r2 are expressed according to reactions (14.2) and (14.3). After a transient, as only inducer 1 exists, which releases Pro1 by reaction (14.6), only r1 is now produced. Around time 100 s, because of the toggle of the inducers, the synthesis of r1 stops, and r2 begins for the same reasons. If both inducers are absent, r1 and r2 are not generated after an initial transient, corresponding to a situation where the expression of both genes is repressed. Thus, it seems that deterministic attractors can be stable in the face of stochastic behavior. When both inducers are available, two stable states are

383

Principles of Computational Cell Biology

80 r1 No. of molecules

60

40 r2 20

0

20

40

60

80

100

120

140

160

180

200

Time (s)

Figure 14.5 In stochastic simulations of the system shown in Figure 14.4, initially only inducer 1 is present. The inducer is removed, at about time 100 s, and replaced by inducer 2 (toggle switch mechanism). Source: Adapted from Ribeiro et al. (2006). 80

60 No. of molecules

384

r1

40

r2

20

0

20

40

60

80

100

120

140

160

180

200

Time (s)

Figure 14.6 Having both inducers present, the two genes switch on and oﬀ, depending on the concentrations, making the system state switch frequently. Source: Adapted from Ribeiro et al. (2006).

possible: one where gene 1 is on and gene 2 is oﬀ, and the other where gene 2 is on and gene 1 is oﬀ. In this case, if the decay rates are too low, the system settles in one of the attractors and stays there. If there is intermediate decay, then the system will switch from one attractor to the other (Figure 14.6). The authors found that the decay equations play a very important role in the switching dynamics of the states when both inducers exist. The larger are the probability rate constants

Stochastic Processes in Biological Cells

for decay, the more likely it becomes for the genes to toggle. The reason for this is that, according to the Gillespie algorithm, if the probability rate constant of one reaction is increased, it becomes more likely that this reaction is selected. Thus, as one gene represses the other one, the corresponding repressor will start decaying faster with the higher probability rate constant, indicating more probability for the gene to be toggled by the other.

14.4 Stochastic Modeling of a Small Molecular Network As discussed before, bioinformatics approaches of cellular systems may either use systems of diﬀerential equations for signal transduction networks, graph description of protein–protein interaction networks, matrix equations for metabolic networks, or heuristic reverse engineering techniques for gene expression networks. All these were covered in previous chapters. Typically, these descriptions do not account for the molecular nature of individual genes and protein components. Here, we will encounter a particular system where it is of essential importance to account for the eﬀects of low particle numbers in a stochastic modeling approach utilizing variants of the Gillespie algorithm. 14.4.1

Model System: Bacterial Photosynthesis

The photosynthetic apparatus of purple bacteria is a very attractive model system for developing in silico approaches at the molecular level. In these evolutionary old species, the photosynthetic machinery consists of only four diﬀerent, relatively simple, and well-known transmembrane proteins, i.e. the reaction center (RC), the light-harvesting complexes (LHCs), the cytochrome bc1 complex and the F0 F1 ATPase, as well as two transporter molecules, the soluble electron carrier protein cytochrome c2 , and the electron/protein carrier ubiquinone (Q)/ubiquinol (QH2) that diﬀuses in the hydrophobic membrane. Except for some mechanistic details, the biological function of each macromolecule is known precisely (Figure 14.7). Moreover, the three-dimensional structures of all components could be determined in recent years at atomic resolution. In some species, such as in Rhodobacter sphaeroides, this system is spatially conﬁned to small vesicles of 30–60 nm diameter, which consequently contain a manageably small number of some 100 proteins in total, most of which are the simple LHCs. Figure 14.8 shows a cartoon of such a vesicle derived from stoichiometric and mechanistic considerations. Consequently, these vesicles are already large and complex enough to present a nontrivial metabolic subunit of a cell but are still small enough to be considered at the molecular level. Also, they are easily accessible in experiments and, as light is the central “metabolite” for them, they can be probed and monitored in dynamic experiments on time scales ranging from the picosecond scale for electronic transitions over the millisecond range for association and dissociation dynamics up to quasi-steady-state conditions. We will now introduce a kinetic model of photosynthetic chromatophore vesicles from the purple bacterium R. sphaeroides that was compiled from a

385

386

Principles of Computational Cell Biology

ATPase

ATP ADP + Pi 0

γ

H+

H

H

4

2

RC/LH1

7

+

+

bc1

1 3 LH2

2 c2

Q

6 4

e– 5

H+

H+

Figure 14.7 Artistic textbook-style rendering of the photosynthetic apparatus of purple bacteria. The inside of the chromatophore vesicle is below the membrane. The light-harvesting complexes labeled LH2 and LH1 collect the incident photons (process 0) and hand their energy onto the reaction centers (RCs) in the form of electronic excitations (process 1). The RC passes this energy onto a waiting quinone (Q) in the form of an electron–proton pair, where the proton (H+ ) is taken up from the cytoplasm (process 2). Later, this quinone, which has become ubiquinol (QH2) by the uptake of a second of these pairs, unbinds from the RC and diﬀuses inside the membrane (process 3) to deliver its freight to the cytochrome bc1 complex (bc1). The bc1 complex releases the protons to the inside of the vesicle and the stored energy is used to pump two further protons across the membrane (process 4). The electrons are then shuttled back to the RC by the water-soluble electron carrier protein cytochrome c2 (c2) (process 5), whereas the proton gradient is the driving force for the synthesis of ATP from ADP and inorganic phosphate in the F0 F1 –ATP synthase (ATPase) (processes 6 and 7). Source: Picture courtesy of Dr. Tihamér Geyer.

number of experimental references reported by various research groups (Geyer and Helms 2006a). As the experimental data were measured using a variety of diﬀerent spectroscopic and physicochemical techniques either for individual reactions or for the entire vesicle, it was not clear at all whether it could all be combined into one general picture. Fortunately, the data were found to be overall consistent, so that the entire system could be assembled into a model of a conversion chain working at steady-state conditions. Considering the stoichiometric relationships and the geometric dimensions of the individual proteins determined in electron microscopy and atomic force microscopy experiments also allowed suggesting a detailed spatial model of a chromatophore vesicle at a molecular resolution (Geyer and Helms 2006b) as shown in Figure 14.8. Panel (a) shows the vesicle from outside, and panel (b) shows a cut-through. 14.4.2

Pools-and-Proteins Model

A metabolic network, such as the one for bacterial photosynthesis, can be looked at from two diﬀerent sides: from the network side with a focus on the various metabolites and from the protein side where the proteins with their internal reactions form the building blocks. The protein-based view corresponds to the classical microbiological approach, where the functions and structures of individual proteins are ﬁgured out ﬁrst, whereas the network

Stochastic Processes in Biological Cells

4 nm

7

6

H+

3.3 nm 37

nm

+

H

4 5

12 n

m

(a)

(b)

2

H+

Figure 14.8 (a) A reconstructed chromatophore model vesicle of 45 nm diameter. The surface of this particle is 6300 nm2 leaving just enough room to position one ATPase molecule (shown in green at the top), 11 LH1/RC rings (blue/red) around the circumference, and the smaller LH2 rings (blue). The cytochrome bc1 complexes are shown in light blue. The Z-shaped LH1/RC dimers form a linear array around the “equator” of the vesicle, determining the vesicle’s diameter by their intrinsic curvature. (b) Shown is a cut section of the vesicle detailing some of the dimensions and illustrating the diﬀusion of the enclosed soluble electron carrier protein cytochrome c2 . Source: Geyer and Helms (2006b). Drawn with permission of Elsevier.

view reﬂects the experimentally accessible concentrations of the metabolites. The “pools-and-proteins” model shown in Figure 14.9 combines the details of individual proteins, each with its own discrete internal states and reactions, and the network view of uniform concentrations of a discrete number of indistinguishable metabolite molecules. In the actual computational implementation (see below), the proteins are treated as the “machines” where the conversions between the metabolites take place. For these to take place as in vivo, certain metabolites that otherwise diﬀuse freely in the cytosolic or membrane compartments ﬁrst have to bind to the proteins. Then, the proteins act on the bound metabolites and modify them into some product forms, which are then released. Correspondingly, in this model, each protein has a set of input connectors, where individual metabolite molecules are taken up from the corresponding pools and a set of outputs that ﬁnally release the product molecules into the respective pools. Thus, the network is built up from the respective numbers of the diﬀerent proteins present in the system and a pool for each of the metabolites. 14.4.3

Evaluating the Binding and Unbinding Kinetics

In the stochastic molecular simulation, each reaction is modeled as the binding and unbinding of discrete molecules according to the respective rate constants k on or k oﬀ . Importantly, these reactions often can only take place conditionally. A binding event, for example, can only occur if there is an empty binding site available. For an electron transfer to a bound substrate, on the other hand, this

387

388

Principles of Computational Cell Biology

ADP + Pi

H+ outside

γ

ATP

Out In QH2 LHC

E

bc1

RC

ATPase

Q

c2red c2ox

H+ inside

Figure 14.9 “Pools-and-proteins” view of bacterial photosynthesis. The network is formed from the proteins (rectangular boxes), which each exist in multiple copies, and one pool for each of the metabolites (rounded boxes), which are connected by the reactions (arrows). The path of the photon energy through the system into its chemical form of ATP is shown by thick arrows. The core complex of RC and LHC with their own private exciton pool is enclosed by the dashed line. The membrane, indicated by the gray area, and the placement of the proteins and pools with respect to the membrane are not relevant for the simulation. Source: Geyer et al. (2007). Drawn with permission of Elsevier.

electron has to be available to the binding site. The objective is therefore to ﬁrst check whether such a reaction can take place and then conditionally determine the probability for the reaction to occur during a time step, given the rate constant. For example, the binding reaction A + B → A∶ B, where a metabolite B binds to a protein A to form the complex A:B, can only take place at a given protein A if its binding site for B is empty. Then, the rate of binding events Ron at this protein A is proportional to the concentration of B, [B], and the association rate Ron = kon [B] When the time step Δt is small enough, i.e. Δt ≪ (k on × [B])−1 , the probability Pon for this reaction to take place during Δt is well approximated by Pon = Δt ⋅ kon [B]. At each time step, the occupation state of each binding site of every protein is determined. Pon is then evaluated only for empty binding sites and compared to a random number r in the interval [0, 1]. If r < Pon , one molecule B is placed in the respective binding site. Then, the binding reaction needs not to be considered anymore until B unbinds again, often after some charge transfer has occurred. To model the corresponding unbinding reaction: A∶ B → A + B,

Stochastic Processes in Biological Cells

with its rate k oﬀ , a more eﬃcient technique may be used than probing the reaction at every single time step. Note that the inverse of the rate constant k oﬀ gives the life time T oﬀ of the complex A:B. Correspondingly, a random unbinding duration t oﬀ is chosen from the (non-normalized) exponential probability distribution P(t oﬀ ) = exp[−t oﬀ /T oﬀ ] of the unbinding times. A timer is then initialized with t oﬀ that triggers the unbinding event. This timer approach can of course only be applied to unbinding reactions that are independent of the concentration of B. Otherwise, as for all association reactions, the reaction probability has to be determined at each time step in order to account for a possible change of [B] after the timer was initialized. 14.4.4

Pools of the Chromatophore Vesicle

A pool is characterized by the type of metabolite it contains and by its volume. In practice, a pool counts the number of particles and determines the respective concentration only when necessary. To be able to handle indeﬁnitely large pools such as the reservoir of protons outside of a vesicle or the incident light, a pool can be ﬁxed to a given constant particle number. Pools do not contain any geometric information except for their volumes. However, special geometries can be modeled by spatial discretization where each volume element (“voxel”) is represented by a small pool that is connected to its direct neighbors. According to the overview shown in Figure 14.9, a simulation of a chromatophore vesicle requires eight pools for biochemical metabolites (both cytochrome c2 and quinone in their oxidized and reduced forms, protons inside and outside of the vesicle, and one each for adenosine diphosphate, ADP, and for adenosine triphosphate, ATP), one pool for the incident light, and a number of exciton pools. The number of exciton pools depends on the number and connectivity of LHCs and RCs. An average-size vesicle of 45 nm diameter has an inner volume of 2.65 × 104 nm3 , determining the volumes of the two pools of the reduced and oxidized cytochrome c2 . The quinones diﬀuse in the eﬀectively two-dimensional plane between the two lipid layers of the vesicle membrane, which has an area of 5.28 × 103 nm2 . Correspondingly, this two-dimensional volume was used for the two pools of the membrane-bound quinones and quinols. Otherwise, the exact thickness of the quinone layer had to be known. As examples, Figures 14.10 and 14.11 describe the individual reactions modeled that are connected to the reaction center and to the cytochrome bc1 complex, respectively. 14.4.5

Steady-State Regimes of the Vesicle

Even such a very simple biological system can already show biologically interesting behaviors under steady-state conditions. For low light intensities, the supply of photons limits the throughput of the whole photosynthetic conversion chain, whereas for high intensities, the total throughput of the cytochrome bc1 complexes is the bottleneck. This is illustrated in Figure 14.12 that shows the steady-state rate of ATP production RATP versus the light intensity I from stochastic dynamic simulations of a vesicle consisting of 20 monomeric LHC/RC

389

390

Principles of Computational Cell Biology

H+out A3

Q

A2

RC bsQ

E A1

QH2

D2

P D4 bsc

c2red

A4

D4

c2ox

Figure 14.10 Reactions modeled in the RC. The respective rate constants are A1 (exciton-initiated electron transfer from P to Qb ) = 1010 s−1 ; A2 (quinone binding) = 1.25 × 105 nm2 s−1 ; D2 (quinol unbinding) = 40 s−1 ; A3 (proton uptake and transfer to Qb− ) = 1010 nm3 s−1 ; A4 (binding of reduced cytochrome c2 ) = 1.4 × 109 nm3 s−1 ; D4 (reduction of P and unbinding of oxidized c2 ) = 103 s−1 .

complexes, ﬁve cytochrome bc1 dimers, one adenosine triphosphate synthase (ATPase), and 20 cytochrome c2 molecules. For small light intensities, RATP increases linearly with I before saturating at a light intensity above around 4 W m−2 . You may wonder about these small light intensities that are far smaller than that at a bright sunny day (around 1000 W m−2 ). Interestingly, these bacteria live in a quite muddy environment at a light intensity of only about 18 W m−2 . Which of the system components responds to the transition between the two diﬀerent operation modes of the photosynthetic chain shown in Figure 14.7? This is apparent from Figure 14.13 that plots the redox state of the electron carrier cytochrome c2 inside the vesicle. For small light intensities, most of the 20 c2 particles in the vesicle are reduced, whereas for high intensities, they are essentially all oxidized. The “titration intensity” of a half-reduced c2 pool corresponds to the transition intensity between the linear increase of RATP and the saturation regime. For little light, the RCs are slower than the bc1 s and, consequently, the c2 are oxidized slower at the RCs than they can be reduced at the bc1 s. For high light intensities, the situation is reversed. Then, the RCs are faster and the oxidized c2 s are queuing up at the bc1 s, waiting to be reduced again. The transition between the two regimes takes place within a small intensity interval, as it results from the dynamic balance between the light-dependent turnover of the RCs and the light-independent rate of the bc1 s. Without the statistical ﬂuctuations of the turnovers of the RCs and the bc1 s, the transition would be a sharp step. As mentioned before, chromatophore vesicles are ideal systems to learn how to do computational molecular systems biology, because they are small, well

Stochastic Processes in Biological Cells

A3

QH2 A1

bc1-2 Q

D1

bsH

bc1-1 bsQo

R4

bsQi

Q

R2

R1

A2

H+out

A2 bsQi

FeS

D2

R3 bsc

QH2

bsc D4

A4

H+in

c2red

c2ox

Figure 14.11 Reactions modeled in the bc1 complex. The respective rate constants are A1 (QH2 binding at Q0 ) = 1.5 × 105 nm2 s−1 ; D1 (Q unbinding from Q0 , if Q was not transferred directly to Qi of the other dimer half in reaction R1 ) = 180 s−1 ; A2 (Q binding at Qi ) = 1.5 × 105 nm2 s−1 ; D2 (QH2 unbinding from Qi ) =180 s−1 ; A3 (binding of a proton from the outside) = 1010 nm3 s−1 ; A4 (binding of oxidized c2 with a mutual inhibition of the two binding sites of the dimer) = 1.4 × 107 nm3 s−1 ; D4 (unbinding of reduced c2 ) = 1000 s−1 ; R1 (swap of the Q unbinding from Q0 directly to Qi of the other dimer half, otherwise reaction D1 may occur) = 104 s−1 ; R2 (transfer of the electrons from Q0 to Qi and FeS plus release of the protons into the vesicle; rate is exponentially reduced with increasing transmembrane voltage) = 300 s−1 , Φ0 = 200 mV; R3 (movement of the FeS unit and electron transfer onto the bound c2 ) = 120 s−1 ; R4 (transfer of a bound proton to Qi ) = 1000 s−1 .

RATP (s–1)

100

75

50

25

0

1

2

3

4

5

6

7

I (W m–2)

Figure 14.12 Rate of ATP production per second as a function of light intensity. At high light intensity, the production saturates because the bc1 complexes are working at full speed and cannot pump more protons inside the vesicle.

391

Principles of Computational Cell Biology

20 N(c2red) (s–1)

392

15

10

5

0

1

2

3

4 I (W

5

6

7

m–2

)

Figure 14.13 Number of reduced cytochrome c2 particles in the stochastic simulation model at diﬀerent light intensities.

understood, and experimentally well characterized. In the scope of dynamical systemic simulations, they are a nearly unique system because they can be naturally probed and monitored by light as today’s technology allows to easily generate illumination signals of arbitrary shape and intensity.

14.5 Parameter Optimization with Genetic Algorithm Although biological cells often only contain relatively small numbers of particular proteins and metabolites, biological cells can utilize them in an amazing variety of ways through diﬀerent metabolic pathways or signaling cascades that often utilize the same proteins and metabolites. This frequent reusage represents a huge challenge to computer simulations of metabolic systems. Even though we may only be interested in the temporal change of the concentration of a single metabolite, we often need to take into account large pieces of the full metabolic environment. The same problem naturally applies to experimental determinations of rate constants where some rates can be measured more easily, whereas others are masked by neighboring reactions. Instead of trying to understand the dynamic behavior of a metabolic system in a bottom-up approach from the kinetics of its individual reactions, one may also follow a systemic top-down approach and probe and analyze the whole network as one entity. This section is based on the same simulation model of a chromatophore vesicle as introduced in Section 14.4. Here, the aim was to implement a data driven top-down approach to derive a set of kinetic parameters such that various sets of time-dependent experimental data are reproduced (Geyer et al. 2010). Precisely, based on the molecular stochastic simulations introduced before, automated parameter determinations were performed where an evolutionary algorithm (Figure 14.14) is used to generate, select, and thus optimize parameter sets of

Stochastic Processes in Biological Cells

Memory Template files Store

Generation

Simulation + Scoring

Copy Filter Modify Newly created

Best scored

Recombine

Genetic algorithm

Figure 14.14 Optimization of kinetic system parameters by an evolutionary algorithm. A set of stochastic simulations (top, right) is run for a ﬁrst “generation” of parameter sets. The results are scored and compared against suitable experiments (see text). The best parameter sets are kept unchanged and new combinations are generated from the promising parameter sets. This results in a new “generation” of parameter sets as input for a new set of simulations. This iterative cycle is repeated a given number of steps or until convergence is reached.

25 of the 44 rate constants such that multiple experiments are simultaneously reproduced as well as possible. The correlation between the sets of experiments and simulations then allowed judging both the consistency of the experimental data and the validity of the reconstructed in silico model. It turned out that the kinetic parameters for a dynamic model of a metabolic system can be reliably determined by simultaneously ﬁtting the results from stochastic dynamics simulations against a number of time-dependent experiments. After determining an optimal set of kinetic and biophysical parameters for the chosen experiments with an evolutionary algorithm, the molecular stochastic simulation model reproduced the observed dynamics over a wide kinetic range from millisecond long single-ﬂash experiments up to quasi steady-state conditions. For example, Figure 14.15 shows the temporal relaxation of four properties after fast laser-light ﬂashes. Remarkably, only about one-third of the stoichiometries, rate constants, and parameters of the model, which are all related to the kinetic behavior of the chromatophores, had so far been determined experimentally. For the others, only estimates were available. It was therefore not clear a priori whether a parameter determination would actually succeed in simultaneously narrowing down such a large number of parameters so that a reasonable agreement between the experiments and the simulation could be achieved. The success indicates that the methodology can also be applied to other more complex or less well-understood cellular subsystems. The next logical step at a higher level of complexity would be

393

Principles of Computational Cell Biology 250 2

200 ΔΦ (mV)

ΔΦ (a.u.)

A7_ΔΦ A7_ΔΦ 1

150 100 50 0

0 0

50

100 t (ms)

150

200

0

20

40

60

t (ms)

(b)

0 Average number of QH/RC

Cytochrome c oxidation (a.u.)

(a)

A7_cytc

–0.1

1

0.5 B1_Q 0

–0.2 0

5

(c)

10 t (ms)

15

20

0

Cytochrome c oxidation

–20 –30 A9_cytc –40 0

2

4

6

8

3

4

5

t (s)

A8_ΔΦ 1

0 2

4

6 t (s)

–2 –3 B6_cytc –4 0

100

8

10

(h)

200

300

t (ms)

(f)

2

0

–1

10

t (s)

(e)

Average number of oxidized P/RC

Cytochrome c oxidation

2

0

–10

(g)

1

(d)

0

ΔΦ (a.u.)

394

0.4 0.3

B6_P

0.2 0.1 0 0

100

200

300

t (ms)

Figure 14.15 Comparison between diﬀerent experimental data for the time-dependent oxidation state of cytochrome c2 (c, e and f ), membrane potential (a, b and g), number of reduced quinols per reaction center (d) and number of oxidized special pairs (h) and the simulations with the three best scored parameter sets (red, green, and blue points). Shown in black lines are ﬁt functions of the experimental data used for scoring the respective observable. The variation between the three time traces reﬂects the variability of the results with the optimized parameters when the same scenario is simulated repeatedly.

Stochastic Processes in Biological Cells

the bioenergetic processes of an entire bacterium or mitochondrium. Also, signaling processes are good candidates for such a bottom-up molecular stochastic description together with the systemic evolutionary parametrization.

14.6 Protein–Protein Association At the end of this chapter, we now turn to another phenomenon, biomolecular association, that occurs on a molecule scale and also has a stochastic element. As discovered by the Scottish botanist Robert Brown, biomolecules move in the cell by undirected diﬀusion, which is also termed Brownian motion, Figure 14.16. Protein diﬀusion in cellular media can often simply be modeled as random diﬀusion using an eﬀective diﬀusion constant (see Section 13.5 for a discussion of the diﬀusion equation). For example, experiments on GFP showed that its diﬀusion in cells is only reduced by a factor 5 compared to diﬀusion in pure water. The Einstein relation. D = 𝜇kB T relates the diﬀusion constant D of a particle to its mobility 𝜇 times the Boltzmann constant k B and the temperature T. In the limit of low Reynolds number, the mobility 𝜇 is the inverse of the drag coeﬃcient 𝛾. For spherical particles of radius r, Stokes’ law gives 𝛾 = 6π𝜂r

Figure 14.16 Schematic illustration of particles undergoing undirected Brownian motion in a rectangular simulation box. The arrows illustrate the particle displacements during the next time interval.

395

396

Principles of Computational Cell Biology

with the viscosity 𝜂 of the medium. The Einstein relation then becomes. D=

kB T , 6π𝜂r

which is also known as the Stokes–Einstein relationship. Using this equation, we can estimate the diﬀusion coeﬃcient of a globular protein. Its mean squared displacement (msd) in a time interval of length Δt is: msd = 6DΔt in three dimensions. A typical diﬀusion constant for a small protein such as barnase is 0.15 μm2 ms−1 (corresponding to 1.5 × 10−6 cm2 s−1 in the typical units). Typical cellular dimensions are 1 μm for the cell diameter, or 100 nm for the diameter of a vesicle. This means that such a protein traverses the cell roughly in one millisecond. In Sections 2.7–2.9, we have already touched on the methods of protein– protein docking. In Section 3.2.6, we introduced the forces steering biomolecular association reactions. Here, we will now address the question how the two proteins approach each other. The association of two proteins is often studied using the Brownian dynamics method. This is the name of an algorithm to generate diﬀusional trajectories of particle motion.

14.7 Brownian Dynamics Simulations The Ermak–McCammon algorithm (Ermak and McCammon 1978) is an iterative algorithm to compute trajectories for the diﬀusional motion of a particle using the translational/rotational diﬀusion coeﬃcients D and DR . The algorithm computes the forces F and torques T acting between diﬀusing particles and with the rest of the system that may be kept ﬁxed. From these forces and torques, it computes translational and rotational displacements Δr and Δw from the particles’ current positions during the next time step Δr = (kT)−1 D F Δt + R

with ⟨R⟩ = 0 and ⟨R2 ⟩ = 6 D Δt

and Δw = (kT)−1 DR T Δt + W

with ⟨W⟩ = 0 and ⟨W2 ⟩ = 6 DR Δt.

Importantly, the algorithm accounts for the solvent inﬂuence on the particle dynamics by the stochastic terms R and W. These noise terms will generate the characteristic ﬂickering motion of a Brownian particle. When talking about protein–protein association, we ﬁrst need to choose a good coordinate system or reaction coordinate. Possible choices are the center–center distance d1–2 or the average distance between contact pairs cdavg (Figure 14.17). Figure 14.18 shows a sketch of the interaction free energy versus the separation between proteins. We distinguish two cases. Shown in the left panel of Figure 14.18 is the case of two hydrophilic proteins that attract each other by electrostatic interactions. At large distances (region 1 in the scheme), e.g. beyond a few times the Debye length, the electrostatic interactions of both proteins are

Stochastic Processes in Biological Cells

cdavg

d1–2

(a)

(b)

Figure 14.17 Deﬁnition of diﬀerent criteria to describe the relative orientation of the binding interfaces of two proteins. (a) d1–2 is the distance between the centers of both proteins. This criterion is particularly suitable at large particle distances. Here, the orientation of the particles is not accounted for. (b) cdavg is the average distance between interaction pairs on the two interfaces (atoms or residues). This criterion is most useful at short distances.

1 2

Free energy

Free energy

2

3

(a)

1

3

Reaction coordinate

(b)

Reaction coordinate

Figure 14.18 Schematic representation of the free energy for protein–protein interaction along an idealized one-dimensional reaction coordinate such as the distance of their centers of masses. (a) Example of a protein–protein pair attracted by long-range steering in region 2. (b) Example of a protein–protein pair that undergoes random diﬀusion until the two proteins eventually collide and bind to each other via hydrophobic contacts.

screened from each other by the ionic cloud of the solution (Eq. (3.3)). Both proteins will undergo free diﬀusion. At closer distance (region 2), the hydrophilic proteins will start to experience the electrostatic ﬁeld of the other protein. First, they will feel the eﬀect of its net charge (monopole ﬁeld), and at closer distances, this ﬁeld will be modulated by its dipolar character and other ﬁne details. Detailed molecular dynamics simulations in explicit solvent demonstrated that the free energy surface for association resembles a monotonous downhill funnel into the bound complex (region 3) (Ulucan et al. 2014). Thus, desolvation of the interfaces occurs spontaneously and does not incur an energetic penalty. However, not all components of a protein complex attract each other electrostatically. For example, antibodies do not need to bind their binding partners fast, only tight. Shown in the right panel of Figure 14.18 is the case of hydrophobic binding partners that are not electrostatically attracted. Here, the binding

397

398

Principles of Computational Cell Biology

into the bound complex (region 3) may involve a transition over a high-energy intermediate state (region 2). As an example, we will discuss Brownian dynamics simulations to study the association pathway for the protein–protein pair of barnase and barstar (Spaar et al. 2006). As mentioned in Chapter 3, the surfaces of these proteins show strong electrostatic complementarity (Figure 3.11), leading to one of the fastest known association rates because of strong electrostatic steering. Also, the interaction is one of the tightest known protein–protein interactions characterized by a very high k D . In these simulations, one protein is kept ﬁxed at the origin, and the other is started at a deﬁned distance at a random angle and randomly oriented. The association trajectories from 200 000 such simulations were mapped onto a cubic lattice and the occupancy of the diﬀusing particle in each voxel was computed. The idea behind this is that very favorable locations should be highly populated and unfavorable regions for the approach should have low populations. Figure 14.19 shows the observed occupancy plots. Obviously, the orientation (b) is less constrained than position (a). Occupancy maps can be interpreted as probability distributions for the computation of an entropy landscape. The entropy of a system with N states is S = −kB

N ∑

Pn ln Pn ,

n=1

where Pn is the probability for each state. Using this relationship, local entropy diﬀerences were computed for the distribution within spheres around grid points, separately for the translational and for the orientational maps. The free energy diﬀerence for protein–protein encounter was then expressed as ΔG(cdavg ) = ΔECoul (cdavg ) + Edesolv (cdavg ) − TΔStransl (cdavg ) − TΔSrot (cdavg ) involving the Coulombic interaction energy ΔECoul computed by solving the Poisson–Boltzmann equation, a desolvation term ΔEdesolv that accounts for the unfavorable change in solvation energy of the two proteins and the contributions of the translational and rotational entropy loss. The results for the association of barnase and barstar are shown in Figure 14.20. Interestingly, the position of the encounter state for this protein–protein system is found as a well-localized minimum where the two biological interfaces are about 1 nm apart from each other. However, such Brownian dynamics simulations are not well suited to model the details how the two proteins bind into the bound complex. As mentioned before, molecular dynamics simulations that included explicit solvent molecules (Ulucan et al. 2014) gave a modiﬁed free energy schema for the Barnase:Barstar association that resembles (Figure 14.18a).

14.8 Summary When stochastic processes are involved, waiting times between events and turnover rates during a time interval are subject to Poissonian or related

Stochastic Processes in Biological Cells

cdavg = 13 Å

cdavg = 9 Å

0 ϕx (deg)

90

0

–90 –90

cdavg = 21 Å

90

–90 –90

90

ϕn,y (deg) 0 ϕn,x (deg)

90

ϕy (deg) 0 ϕn,x (deg)

90

0

–90 –90

0 ϕn,x (deg)

90

cdavg = 29 Å 90

0

–90 –90

90

cdavg = 17 Å

90

0

0 ϕx (deg)

90 ϕn,y (deg)

ϕn,y (deg)

–90 –90

cdavg = 25 Å

cdavg = 21 Å

ϕn,y (deg)

90

0

–90 –90

90

(b)

0 ϕx (deg)

0

cdavg = 13 Å

ϕn,y (deg) 0 ϕn,x (deg)

90

cdavg = 29 Å

90

0

0 ϕx (deg)

90

0

cdavg = 9 Å

–90 –90

–90 –90

ϕn,y (deg)

ϕy (deg)

ϕy (deg) 0 ϕx (deg)

90

–90 –90

90

90

0

(a)

0 ϕx (deg)

0

cdavg = 25 Å

90

–90 –90

90 ϕy (deg)

ϕy (deg)

ϕy (deg)

0

–90 –90

cdavg = 17 Å

90

90

0 ϕn,x (deg)

90

0

–90 –90

0 ϕn,x (deg)

90

Figure 14.19 Occupancy maps for barstar at various distances cdavg from barnase. (a) positions of barstar and (b) relative orientations of barstar. Preferred orientations for distances (cdavg ) between barstar and barnase ranging from 2.9 nm down to 0.9 nm. To understand these plots, you should imagine yourself standing on the binding interface of protein 1 and looking perpendicularly in the sky. The maps show the projected positions where protein 2 is found in the sky. At larger distances, there is almost no orientational preference. Once the two proteins approach each other more closely, there is a clearly preferred maximum of occupancy perpendicular to the binding patch. This means that the two proteins approach each other in a very carefully carried out maneuver that may be compared to docking a space ship to the international space station.

399

Principles of Computational Cell Biology 3

–2

–TΔS (kcal mol−1)

ΔEel (kcal mol−1)

0

–4 –6

2.5 2 1.5 1 0.5

–8 20

30

40

50

60

0 20

70

d1–2 (Å)

(a)

30

40

50

60

70

60

70

d1–2 (Å)

(c)

5

2

4 ΔG (kcal mol−1)

ΔEds (kcal mol−1)

400

3 2 1

1 0 –1 –2 –3 –4

0 20

(b)

30

40

50

d1–2 (Å)

60

20

70

(d)

30

40

50

d1–2 (Å)

Figure 14.20 (a) Electrostatic interaction energy between barnase and barstar at various center-to-center distances d1–2 . (b) Desolvation energy of the two proteins. (c) Entropy loss at close distances due to barstar adopting a preferred orientation. (d) Free energy of interaction between barnase and barstar obtained as the sum of the other three panels.

statistics. Especially when small particle numbers are involved, substantial ﬂuctuations may arise. Transcriptional regulation is one area where stochasticity has important eﬀects. On the one hand, it may be irritating at ﬁrst that several stochastic simulation runs give nonidentical results. On the other hand, this helps us understanding why replicates of biological experiments show variability. Nowadays, robust techniques exist to properly model and analyze stochastic systems. Such systems bear highly interesting phenomena and are relevant in cell biology, especially when it comes to phenomena involving single cells.

14.9 Problems 14.9.1

Dynamic Simulations of Networks

A static analysis of a (metabolic) network can reveal its steady-state properties like the most important ﬂux modes or identify seemingly redundant reactions. However, a network can exhibit a diﬀerent or unexpected behavior, when subjected to time-dependent concentration changes of the metabolites. This is where dynamic network simulations come into play. For these dynamic simulations, two major approaches exist: for large densities of the relevant molecules, the network can be treated by a set of diﬀerential equations that describe the continuous time evolution of the densities, whereas for small

Stochastic Processes in Biological Cells

densities, where the dynamics are governed by the binding and unbinding events of individual molecules, stochastic approaches are more appropriate. Problem 1 ﬁrst introduces you to the basic simulation techniques with a simple four-species network, before the second part exempliﬁes how the stochastic nature of individual binding and unbinding events can make signaling cascades sensitive to a few molecules. 1. A simple reaction network For this part, consider the network displayed in Figure 14.21: two molecules of A associate to create one molecule of B, which is converted into substance D, when it encounters one molecule of C. (a) Setting up the diﬀerential equations A convenient recipe to compile the (sometimes complicated) set of diﬀerential equations that describe a system is to start from the stoichiometric matrix (Section 12.2). To do so, ﬁrst set up the stoichiometric matrix S for the above example network. dR1 = k1 A2 dt dR dA = −2 1 dt dt (i) Then, walk through the columns of S to derive the rates dR1 /dt and dR2 /dt for the reactions R1 and R2 , respectively, from the entries that have a negative sign (these are the educts for the corresponding reaction) (Figure 14.22). (ii) Via the columns, you can then ﬁgure out, which reactions contribute to the time evolution of a given molecule. This recipe is explained for Figure 14.21 A simple reaction network (see Problem 1).

R1

2A

B R2

2D

C

Figure 14.22 Constructing a stoichiometric matrix for reaction network of Figure 14.21.

R1 A

–2

B

1

R2

(ii)

C D

(i)

401

402

Principles of Computational Cell Biology

R1 and dA/dt in Figure 14.22 and the equations above. Note that the stoichiometric matrix is not complete. From the complete stoichiometric matrix, give dR1 /dt and dR2 /dt explicitly and list the rates for the changes of A, B, C, and D in terms of the rates for R1 and R2 . Note: The amounts of metabolites A, B, C, and D are given as densities with units of particles per volume (for example, 1 nm−3 , mol m−3 , etc.). Consequently, the actual size of the system (test tube) is neglected in this description. (b) Implementation with rate equations The simplest way of solving a diﬀerential equation is based on the Taylor expansion truncated after the linear term: A(t + Δt) ≈ A(t) + Δt × dA(t)∕dt = A(t) + ΔA(t). This simple approximation requires small time steps to be fairly accurate. Note that the increments ΔA, ΔB, … are calculated at the beginning of the time step of size Δt. To use this so-called “Forward Euler” integrator to simulate the time evolution of the above reaction system, the densities A, B, C, and D are initialized to their values at t = 0 and then with these values the increments ΔA, ΔB, … are determined and added to A, B, … Then, the time is advanced to t + dt. These steps are repeated until the ﬁnal time is reached. Note: All increments are evaluated at the beginning of the time step; therefore, take care to ﬁrst determine all values of ΔA, ΔB, … and only then to add them to their respective variable. With the diﬀerences per time step ΔA, ΔB, ΔC, and ΔD implement a differential equation model of the above network. Use a time step Δt of 0.1 s and a ﬁnal time of 250 s. At t = 0, start from A = 20 μm−3 , C = 10 μm−3 , and B = D = 0. Set the reaction constants to k 1 = 10−3 μm3 s−1 and k 2 = 3 × 10−3 μm3 s−1 . Plot the time traces of A(t), B(t), C(t), and D(t) into a single plot, describe them, and explain from their behavior the dynamics of the network. For comparison, the traces of A (dashed line) and D (solid line) should look as sketched in Figure 14.23. Then, run the simulation until t = 100 s and give the ﬁnal values of A, B, C, and D. (c) Stochastic implementation Section 14.4 introduced the stochastic Gillespie method to solve a set of reactions. Here, you will use a simpler and more direct method, where for each of the considered reactions, the reaction probability P for a time step Δt is compared to a random number. If this random number is smaller than P, the event takes place during the actual time step and forms the corresponding product molecules. Hint: How to calculate the reaction probabilities is explained here for the binary reaction A + B → AB. For example, the density of molecules of type A, [A], is determined by the number N A of A molecules and the corresponding volume V as [A] = N A /V . With this, the rate for the change of

Stochastic Processes in Biological Cells

Particle concentrations

20

15

10

5

0

50

100

150

200

250

Time

Figure 14.23 Concentrations of metabolites A and D at various time points.

the density of A becomes (the square brackets indicating the density are omitted in the following) d[A] dA = = −kAB dt dt N N 1 dNA = −k A B V dt V V dNA Δt = −kN A NB ΔA = Δt dt V With the change of the number of molecules of type A during Δt of ΔA = –PN A N B , we can identify the probability for this bimolecular reaction between any of the A’s and any of the B’s to take place during Δt as P = k Δt/V . Correspondingly, for the above network, the parts of the inner loop that serve to integrate the number of molecules A looks as follows: while (t1): # why this test??? for i1 in range(NA): for i2 in range(NA): if random.random()