Computational Epigenomics and Epitranscriptomics 1071629611, 9781071629611

This volume details state-of-the-art computational methods designed to manage, analyze, and generally leverage epigenomi

223 55 14MB

English Pages 266 [267] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Computational Epigenomics and Epitranscriptomics
 1071629611, 9781071629611

Table of contents :
Preface
References
Contents
Contributors
Chapter 1: DNA Methylation Data Analysis Using Msuite
1 Introduction
2 Materials
3 Methods
3.1 Running Environment and Dependencies
3.2 Building Indices
3.3 Run Msuite
3.4 The Msuite Output
4 Notes
References
Chapter 2: Interactive DNA Methylation Array Analysis with ShinyÉPICo
1 Introduction
2 Materials
3 Methods
3.1 Data Upload
3.2 Quality Control
3.3 Normalization
3.4 Differentially Methylated Position Calculation
3.5 Differentially Methylated Region Calculation
3.6 Results Export
4 Notes
References
Chapter 3: Predicting Chromatin Interactions from DNA Sequence Using DeepC
1 Introduction
2 Materials
2.1 Data
2.1.1 Hi-C Data
2.1.2 Pre-trained Convolutional Filter Weights for Transfer Learning
2.1.3 Trained Models
2.1.4 Additional Data
2.2 Software
2.3 Hardware
3 Methods
3.1 Data Pre-processing
3.1.1 Pre-processing Using the Wrapper Script
3.1.2 Pre-processing Manually
3.1.3 Collate Data for Training
3.2 Training a Model
3.3 Predicting Chromatin Interactions
3.3.1 Run Predictions
3.3.2 Visualize Predictions Using the Wrapper Script
3.3.3 Visualize Predictions Manually
4 Notes
References
Chapter 4: Integrating Single-Cell Methylome and Transcriptome Data with MAPLE
1 Introduction
2 Materials
3 Methods
3.1 Overview of MAPLE
3.2 Methylome Matrix Construction
3.3 Downstream Analysis
3.4 Integration
4 Notes
References
Chapter 5: Quantitative Comparison of Multiple Chromatin Immunoprecipitation-Sequencing (ChIP-seq) Experiments with spikChIP
1 Introduction
2 Materials
2.1 Cell Culture
2.2 Equipment
2.3 Disposables
2.4 Software Requirements
3 Methods
3.1 Preparation of Spike-in Chromatin for ChIP-seq Experiments
3.1.1 Preparation of Drosophila SL2 Cells
3.1.2 Preparation of Drosophila SL2 Chromatin
3.1.3 Quality Control and DNA Quantification from Fragmented Chromatin
3.1.4 Incorporation of Drosophila SL2 Chromatin with Experimental Chromatin Samples
3.2 Computational Analysis of ChIP-seq Data Using Spike-in Chromatin
3.2.1 Generation of the Genome Index for ChIP-seq Mapping
3.2.2 Genome Mapping of each Individual ChIP-seq Sample
3.2.3 Extraction of Aligned Reads into Distinct Genome and Spike-in Files
3.2.4 Identification of ChIP-Enriched Regions on Each Experiment
3.2.5 Preparation of the Configuration File of spikChIP
3.2.6 Preparation of the Genome Definition File of spikChIP
3.2.7 Running spikChIP to Normalize a Collection of ChIP-seq Samples
3.2.8 Interpretation and Usage of spikChIP Output Files
3.2.9 Generation of Genome-Wide Profiles for Graphical Browsers
4 Notes
References
6: A Guide to MethylationToActivity: A Deep Learning Framework That Reveals Promoter Activity Landscapes from DNA Methylomes i...
1 Introduction
2 Materials
2.1 Summary
2.1.1 Github Clone
2.1.2 Docker
2.1.3 St. Jude Cloud
3 Methods
3.1 Step 1: Response Variable (Only for Transfer Learning)
3.2 Input
3.3 Example Command
3.3.1 Optional Arguments
3.3.2 Optional Example Command
3.4 Output
3.5 Step 2: Feature Extraction
3.6 Input
3.7 Example Command
3.7.1 Optional Arguments
3.7.2 Optional Example Command
3.8 Output
3.9 Step 3: Format
3.10 Input
3.11 Example Command
3.11.1 Optional Arguments
3.11.2 Optional Example Command
3.12 Output
3.13 Step 4: Run Model
3.14 Input
3.15 Example Command
3.15.1 Optional Arguments
3.15.2 Optional Example Command
3.16 Output
3.17 Example Output
3.18 Step 5: Transfer Learning (Optional)
3.19 Input
3.20 Example Command
3.20.1 Optional Arguments
3.20.2 Optional Example Command
3.21 Output
References
Chapter 7: DNA Modification Patterns Filtering and Analysis Using DNAModAnnot
1 Introduction
2 Materials
2.1 Data Sources
2.2 Software and Installation
3 Methods
3.1 Loading Mandatory Files
3.1.1 Import Genome Sequence Information
3.1.2 Import Modification Input Files
3.2 Sequencing Quality Assessment and Filtering
3.3 Analysis of Global Distribution and Motif of DNA Modification Data
3.4 False Discovery Rate Estimations and Filtering (PacBio Only)
3.5 Analysis of DNA Modification Patterns with Genomic Annotations and Other Sequencing Data
3.5.1 Computing Counts by Genomic Feature
3.5.2 Quantitative Parameter by Feature and by Mod Count Categories
3.5.3 Computing Count Within Genomic Features
3.5.4 Computing Distance from Genomic Features
3.5.5 Local Visualization with Gviz
4 Notes
References
Chapter 8: Methylome Imputation by Methylation Patterns
1 Introduction
2 Materials
3 Methods
3.1 Implementation
3.2 Parameters
4 Notes
References
Chapter 9: Sequoia: A Framework for Visual Analysis of RNA Modifications from Direct RNA Sequencing Data
1 Introduction
2 Materials
3 Methods
3.1 Backend Computation
3.2 Visualization Interface
3.2.1 Data Selection
3.2.2 5-mer List
3.2.3 t-SNE Plot
3.2.4 Signal Plot
4 Notes
4.1 Input Files
4.2 Execution (Signal Extraction)
5 Summary and Discussion
References
Chapter 10: Predicting Pseudouridine Sites with Porpoise
1 Introduction
2 Materials
2.1 Software
2.2 Python Environment and Required Packages
2.3 Data Sources
3 Methods
3.1 Local Stand-Alone Version of Porpoise
3.1.1 Sequence Windows
3.1.2 Step-by-Step Usage Guide
3.1.3 Output Format
3.2 Webserver
3.2.1 Online Webserver Layout
3.2.2 Running Predictions Through the Online Webserver
3.3 Auto-pipeline for Model Training
3.3.1 Step-by-Step Details
3.3.2 Outputs
4 Notes
References
Chapter 11: Pseudouridine Identification and Functional Annotation with PIANO
1 Introduction
2 Materials
2.1 Software
2.2 Data Sources
3 Methods
3.1 A High-Accuracy Predictor of Human Ψ Sites Using a Machine Learning Approach
3.1.1 Dataset Preparation for the Machine Learning Approach
3.1.2 Feature Encoding Methods
3.1.3 Model Training and Evaluation
3.1.4 Functional Annotation of Putative Ψ Sites and Probability Estimation
3.2 Using PIANO Website to Obtain the Desired Ψ site
3.2.1 Input File of PIANO
3.2.2 Encoding Types Used to Start the Prediction Job
3.2.3 Result Explanation for Genomic Feature-Based Prediction
3.2.4 Result Explanation for Sequence-Based Prediction
3.2.5 Ψ Site Collection in PIANO Database
4 Notes
References
Chapter 12: Analyzing mRNA Epigenetic Sequencing Data with TRESS
1 Introduction
2 Materials
2.1 Data
2.2 Software
3 Methods
3.1 Download and Preprocess Data
3.2 Prepare R Environment
3.3 Conduct Peak Calling Analysis
3.4 Visualization of Individual Peaks
3.5 Conduct Differential Peak Calling Analysis
4 Notes
References
Chapter 13: Nanopore Direct RNA Sequencing Data Processing and Analysis Using MasterOfPores
1 Introduction
2 Materials
2.1 Nanopore Sequencing Test Datasets: Total RNA from Wild Type and snoRNA-depleted S. cerevisiae Strains
2.2 Required Infrastructure to Run MoP2 on the Yeast Dataset
2.3 Software Installation
3 Methods
3.1 Tuning the Pipeline Parameters
3.2 Running MoP2 Pipelines
3.3 Mop_preprocess: Pre-processing of FAST5 or FASTQ Files
3.3.1 Mop_preprocess Steps
3.3.2 Mop_preprocess Configuration and Running
3.3.3 Mop_preprocess Output
3.3.4 Mop_preprocess Runtime
3.4 Mop_tail: Estimation of poly(A) Tail Lengths
3.5 Mop_mod: Prediction of RNA Modifications Using Four Different Approaches
3.6 Mop_consensus: Identification of Robust Changes in RNA Modifications Across Two Conditions
3.7 MoP2 Execution Monitoring and Reporting
4 Notes
References
Chapter 14: Data Analysis Pipeline for Detection and Quantification of Pseudouridine (ψ) in RNA by HydraPsiSeq
1 Introduction
2 Materials
2.1 Library Sequencing
2.2 Analysis of Raw Reads´ Quality by FastQC
2.3 Trimming
2.4 Alignment
2.5 Data Processing
2.6 Data Treatment
2.7 Data Analysis
3 Methods
3.1 Library Sequencing
3.2 Analysis of Raw Reads´ Quality by FastQC
3.3 Trimming
3.4 Alignment
3.5 Data Processing
3.6 Data Treatment
3.7 Data Analysis
4 Notes
References
Chapter 15: Analysis of RNA Sequences and Modifications Using NASE
1 Introduction
2 Materials
3 Methods
3.1 Building a Workflow
3.2 The Minimal NASE Workflow
3.3 Adding Decoys
3.4 Adding Label-Free Quantitation
4 Notes
References
Chapter 16: Mapping of RNA Modifications by Direct Nanopore Sequencing and JACUSA2
1 Introduction
2 Materials
2.1 ONT Direct RNA Sequencing
2.2 Preparation of an In Vitro Transcriptome Sample
2.3 Hardware Requirements
2.4 Software Dependencies and Installation
3 Methods
3.1 Nanopore Direct RNA Sequencing
3.2 Preparation of an In Vitro Transcriptome Sample
3.3 Nanopore Read Processing
3.4 Use Case 1: Comparison of Wild-Type and Knockout Samples
3.5 Use Case 2: Comparison of Wild-Type and IVT Samples
4 Notes
References
Index

Citation preview

Methods in Molecular Biology 2624

Pedro H. Oliveira Editor

Computational Epigenomics and Epitranscriptomics

METHODS

IN

MOLECULAR BIOLOGY

Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK

For further volumes: http://www.springer.com/series/7651

For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.

Computational Epigenomics and Epitranscriptomics Edited by

Pedro H. Oliveira Genoscope, Évry, France

Editor Pedro H. Oliveira Genoscope E´vry, France

ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-2961-1 ISBN 978-1-0716-2962-8 (eBook) https://doi.org/10.1007/978-1-0716-2962-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.

Preface Nucleic acids (DNA and RNA) are key repositories of genetic information, and their primary sequence of four canonical nucleobases (A, C, G, T/U) in genomes and transcriptomes defines the genetic blueprints and cellular identities across all branches of life. Moreover, it is recognized that diversity within an organism is often governed by dynamic chemical modifications of nucleobases, which can operate as a regulatory layer to fine-tune key molecular and cellular processes. Changes in epigenomic and epitranscriptomic landscapes can affect a variety of such processes (e.g., transcription, translation, differentiation, and maintenance of genome integrity) and are often linked to the onset and progression of disease. Our understanding of the biochemistry and biological significance of the more than 45 DNA and 170 RNA chemical modifications reported to date [1, 2] has been largely propelled by high-throughput sequencing technologies and mass-spectrometry-based approaches, coupled with chemical, enzymatic, or antibody-dependent methodologies. In parallel, we have witnessed the development of increasingly robust computational methods and statistical tools tailored to make sense of a growing volume of often heterogeneous and noisy epi-ome data. In this book, the reader is introduced to state-of-the-art computational methods designed to manage, analyze, and generally leverage epigenomic and epitranscriptomic data. Topics include fine-mapping and quantification of modifications, visual analytics, imputation methods, supervised analysis, and integrative approaches for single-cell data. Ultimately this compendium will be of interest to a broad audience including students, biologists, bioinformaticians, and biomedical researchers. ´ Evry, France

Pedro H. Oliveira

References 1. Sood et al (2019) DNAmod: the DNA modification database. J Cheminfor 11(30): 1–10 2. Boccaletto et al (2018) MODOMICS: a database of RNA modification pathways. 2017 update. Nucleic Acids Res 46(Database issue):D303–D307

v

Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v ix

1 DNA Methylation Data Analysis Using Msuite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaojian Liu, Pengxiang Yuan, and Kun Sun 2 Interactive DNA Methylation Array Analysis with ShinyE´PICo . . . . . . . . . . . . . . . Octavio Morante-Palacios 3 Predicting Chromatin Interactions from DNA Sequence Using DeepC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ron Schwessinger 4 Integrating Single-Cell Methylome and Transcriptome Data with MAPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasin Uzun, Hao Wu, and Kai Tan 5 Quantitative Comparison of Multiple Chromatin ImmunoprecipitationSequencing (ChIP-seq) Experiments with spikChIP. . . . . . . . . . . . . . . . . . . . . . . . . Enrique Blanco, Cecilia Ballare´, Luciano Di Croce, and Sergi Aranda 6 A Guide to MethylationToActivity: A Deep Learning Framework That Reveals Promoter Activity Landscapes from DNA Methylomes in Individual Tumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Karissa Dieseldorff Jones, Daniel Putnam, Justin Williams, and Xiang Chen 7 DNA Modification Patterns Filtering and Analysis Using DNAModAnnot . . . . . Alexis Hardy, Sandra Duharcourt, and Matthieu Defrance 8 Methylome Imputation by Methylation Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ya-Ting Sabrina Chang, Ming-Ren Yen, and Pao-Yang Chen 9 Sequoia: A Framework for Visual Analysis of RNA Modifications from Direct RNA Sequencing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ratanond Koonchanok, Swapna Vidhur Daulatabad, Khairi Reda, and Sarath Chandra Janga 10 Predicting Pseudouridine Sites with Porpoise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xudong Guo, Fuyi Li, and Jiangning Song 11 Pseudouridine Identification and Functional Annotation with PIANO . . . . . . . . Jiahui Yao, Cuiyueyue Hao, Kunqi Chen, Jia Meng, and Bowen Song 12 Analyzing mRNA Epigenetic Sequencing Data with TRESS . . . . . . . . . . . . . . . . . Zhenxing Guo, Andrew M. Shafik, Peng Jin, Zhijin Wu, and Hao Wu 13 Nanopore Direct RNA Sequencing Data Processing and Analysis Using MasterOfPores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luca Cozzuto, Anna Delgado-Tejedor, Toni Hermoso Pulido, Eva Maria Novoa, and Julia Ponomarenko

1

vii

7

19

43

55

73

87 115

127

139 153 163

185

viii

14

15 16

Contents

Data Analysis Pipeline for Detection and Quantification of Pseudouridine (ψ) in RNA by HydraPsiSeq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Florian Pichot, Virginie Marchand, Mark Helm, and Yuri Motorin Analysis of RNA Sequences and Modifications Using NASE . . . . . . . . . . . . . . . . . 225 Samuel Wein Mapping of RNA Modifications by Direct Nanopore Sequencing and JACUSA2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241 Amina Lemsara, Christoph Dieterich, and Isabel S. Naarmann-de Vries

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

261

Contributors SERGI ARANDA • Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology, Barcelona, Spain CECILIA BALLARE´ • Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology, Barcelona, Spain ENRIQUE BLANCO • Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology, Barcelona, Spain YA-TING SABRINA CHANG • Institute of Plant and Microbial Biology, Academia Sinica, Taipei, Taiwan KUNQI CHEN • Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou, Fujian, China PAO-YANG CHEN • Institute of Plant and Microbial Biology, Academia Sinica, Taipei, Taiwan XIANG CHEN • Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN, USA LUCA COZZUTO • Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain SWAPNA VIDHUR DAULATABAD • Department of BioHealth Informatics, School of Informatics and Computing, Indiana University Purdue University, Indianapolis, IN, USA MATTHIEU DEFRANCE • Universite´ Libre de Bruxelles, Interuniversity Institute of Bioinformatics in Brussels (IB2), Brussels, Belgium ANNA DELGADO-TEJEDOR • Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain LUCIANO DI CROCE • Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology, Barcelona, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain; ICREA, Barcelona, Spain KARISSA DIESELDORFF JONES • Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN, USA CHRISTOPH DIETERICH • Klaus Tschira Institute for Integrative Computational Cardiology, University Heidelberg, Heidelberg, Germany; Department of Internal Medicine III (Cardiology, Angiology, and Pneumology), University Hospital Heidelberg, Heidelberg, Germany; German Centre for Cardiovascular Research (DZHK)-Partner Site HD/MA, Heidelberg, Germany SANDRA DUHARCOURT • Universite´ Paris Cite´, CNRS, Institut Jacques Monod, 75013 , Paris, France XUDONG GUO • College of Information Engineering, Northwest A&F University, Yangling, China ZHENXING GUO • Department of Biostatistics and Bioinformatics, Emory University Rollins School of Public Health, Atlanta, GA, USA CUIYUEYUE HAO • Department of Mathematical Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu, China ALEXIS HARDY • Universite´ Libre de Bruxelles, Interuniversity Institute of Bioinformatics in Brussels (IB2), Brussels, Belgium

ix

x

Contributors

MARK HELM • Institute of Pharmacy of Pharmacy and Biochemistry, Johannes Gutenberg University Mainz, Mainz, Germany TONI HERMOSO PULIDO • Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain SARATH CHANDRA JANGA • Department of BioHealth Informatics, School of Informatics and Computing, Indiana University Purdue University, Indianapolis, IN, USA; Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, USA; Centre for Computational Biology and Bioinformatics, Indiana University School of Medicine, 5021 Health Information and Translational Sciences (HITS), Indianapolis, IN, USA PENG JIN • Department of Human Genetics, Emory University School of Medicine, Atlanta, GA, USA RATANOND KOONCHANOK • Department of Human-Centered Computing, School of Informatics and Computing, Indiana University Purdue University, Indianapolis, IN, USA AMINA LEMSARA • Klaus Tschira Institute for Integrative Computational Cardiology, University Heidelberg, Heidelberg, Germany; Department of Internal Medicine III (Cardiology, Angiology, and Pneumology), University Hospital Heidelberg, Heidelberg, Germany FUYI LI • College of Information Engineering, Northwest A&F University, Yangling, China; Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia XIAOJIAN LIU • Institute of Cancer Research, Shenzhen Bay Laboratory, Shenzhen, China VIRGINIE MARCHAND • Universite´ de Lorraine, CNRS, INSERM, UAR2008/US40 IBSLor, EpiRNA-Seq Core facility, F-54000, Nancy, France JIA MENG • Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu, China; AI University Research Centre, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu, China; Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, UK OCTAVIO MORANTE-PALACIOS • Epigenetics and Immune Disease Group, Josep Carreras Research Institute (IJC), Barcelona, Spain; Germans Trias i Pujol Research Institute (IGTP), Barcelona, Spain YURI MOTORIN • Universite´ de Lorraine, CNRS, INSERM, UAR2008/US40 IBSLor, EpiRNA-Seq Core facility, F-54000, Nancy, France; Universite´ de Lorraine, CNRS, UMR7365 IMoPA, F-54000, Nancy, France ISABEL S. NAARMANN-DE VRIES • Klaus Tschira Institute for Integrative Computational Cardiology, University Heidelberg, Heidelberg, Germany; Department of Internal Medicine III (Cardiology, Angiology, and Pneumology), University Hospital Heidelberg, Heidelberg, Germany; German Centre for Cardiovascular Research (DZHK)-Partner Site HD/MA, Heidelberg, Germany EVA MARIA NOVOA • Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain FLORIAN PICHOT • Institute of Pharmacy of Pharmacy and Biochemistry, Johannes Gutenberg University Mainz, Mainz, Germany; Universite´ de Lorraine, CNRS, INSERM, UAR2008/US40 IBSLor, EpiRNA-Seq Core facility, F-54000, Nancy, France

Contributors

xi

JULIA PONOMARENKO • Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain; Universitat Pompeu Fabra (UPF), Barcelona, Spain DANIEL PUTNAM • Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN, USA KHAIRI REDA • Department of Human-Centered Computing, School of Informatics and Computing, Indiana University Purdue University, Indianapolis, IN, USA RON SCHWESSINGER • MRC WIMM Centre for Computational Biology, MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK ANDREW M. SHAFIK • Department of Human Genetics, Emory University School of Medicine, Atlanta, GA, USA BOWEN SONG • Department of Mathematical Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu, China; Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, UK JIANGNING SONG • Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia; Monash Data Futures Institute, Monash University, Melbourne, VIC, Australia KUN SUN • Institute of Cancer Research, Shenzhen Bay Laboratory, Shenzhen, China KAI TAN • Center for Childhood Cancer Research, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA; Department of Biomedical and Health Informatics, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA; Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA; Penn Epigenetics Institute, University of Pennsylvania, Philadelphia, PA, USA; Department of Pediatrics, University of Pennsylvania, Philadelphia, PA, USA YASIN UZUN • Center for Childhood Cancer Research, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA; Department of Biomedical and Health Informatics, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA SAMUEL WEIN • Center for Bioinformatics Tu¨bingen, University of Tu¨bingen, Tu¨bingen, Germany JUSTIN WILLIAMS • Department of Tumor Cell Biology, St. Jude Children’s Research Hospital, Memphis, TN, USA HAO WU • Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA; Penn Epigenetics Institute, University of Pennsylvania, Philadelphia, PA, USA HAO WU • Department of Biostatistics and Bioinformatics, Emory University Rollins School of Public Health, Atlanta, GA, USA ZHIJIN WU • Department of Biostatistics, Brown University, Providence, RI, USA JIAHUI YAO • Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu, China MING-REN YEN • Institute of Plant and Microbial Biology, Academia Sinica, Taipei, Taiwan PENGXIANG YUAN • Institute of Cancer Research, Shenzhen Bay Laboratory, Shenzhen, China

Chapter 1 DNA Methylation Data Analysis Using Msuite Xiaojian Liu, Pengxiang Yuan, and Kun Sun Abstract DNA methylation is a widespread epigenetic modification responsible for many biological regulation pathways. The development of various powerful biochemical assays, including conventional bisulfite treatment-based and emerging bisulfite-free techniques, has promised high-resolution DNA methylome profiling and significantly propelled the DNA methylation research field. However, the analysis of largescale data generated from such assays is still complex and challenging. In this paper, we present a step-bystep protocol for using Msuite for whole-spectrum DNA methylation data analysis, from quality control, read alignment, to methylation call and data visualization. The Msuite package and a testing dataset are freely available at https://github.com/hellosunking/Msuite Key words Bisulfite sequencing, Data visualization, CpG dinucleotides

1

Introduction DNA methylation is a pervasive and important epigenetic regulator in the mammalian genome, which affects diverse gene regulatory processes and is intricately regulated to guide complex biological processes, such as embryogenesis, aging, and tumorigenesis [1– 4]. In mammalian genomes, DNA methylation mostly takes place at cytosines in CpG dinucleotides, and bisulfite sequencing has traditionally been the most commonly used method to detect such modifications [5–9]. Bisulfite treatment can modify the unmethylated cytosines (converted to thymines in sequencing data) while leaving the methylated ones unchanged, therefore allowing their differentiation. However, in mammalian genomes, methylated cytosines mostly appear in CpG dinucleotides, which only accounts for a very limited proportion (e.g., ~5% in the human genome) of all cytosines [10–12]. Hence, DNA libraries after bisulfite treatment usually contain very few cytosines, which is biased with low complexity. As a contrast, emerging bisulfite-free techniques, such as TET-assisted pyridine borane sequencing (TAPS) [13], employ an opposite strategy that only converts the

Pedro H. Oliveira (ed.), Computational Epigenomics and Epitranscriptomics, Methods in Molecular Biology, vol. 2624, https://doi.org/10.1007/978-1-0716-2962-8_1, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

1

2

Xiaojian Liu et al.

methylated cytosines, resulting in a much balanced DNA library with much higher complexity. In both types of approach, the conversion makes the DNA library no longer the same compared to the original DNA or reference genome, which makes data analysis a complex and challenging task. Various data analysis softwares have been developed, but most of them focus on sequence alignment and have not been optimized for bisulfite-free assays. To this end, we developed Msuite [14], an all-in-one, multifunctional package for DNA methylation data analysis. In this paper, we provide a step-by-step protocol for using Msuite to perform DNA methylation data analysis.

2

Materials Msuite is freely available at https://github.com/hellosunking/ Msuite, implemented in C++ and Perl for Linux/Unix systems. Users are recommended to downloaded the package releases, for example, https://github.com/hellosunking/Msuite/archive/ refs/tags/v1.1.2.tar.gz for the latest version. After downloading, uncompress the package using the following command: $ tar zxf Msuite-1.1.2.tar.gz A directory named “Msuite-1.1.2” will be created, and the major program “msuite” will be found under this directory. Note that we have included pre-compiled programs in the release package. In addition, the package also contains a testing dataset, which is in silico generated using the SHERMAN program (https://www. bioinformatics.babraham.ac.uk/projects/sherman/), following the TAPS protocol against the human reference genome (key parameters: C->T conversion rate: 20% for CpG sites, C->T conversion rate at CpH sites: 0.5%, error rate: 0.1%).

3

Methods

3.1 Running Environment and Dependencies

Runing Msuite on Linux/Unix machine requires g++ (version 4.8 or higher), Perl (version 5.10 or higher), and R (version 3.0 or higher). The Msuite package has already included all pre-compiled executables needed (see Note 1). Msuite depends on bowtie2 and samtools, and they must be installed on the system. Users can download bowtie2 from https://sourceforge.net/projects/bowtiebio/files/bowtie2/ and samtools from http://www.htslib.org/ download/ and follow the installation instructions in the corresponding packages.

DNA Methylation Data Analysis Using Msuite

3.2

Building Indices

3

Before running Msuite, genome indices must be built. To this end, a utility named “build.index.sh” is available under the “build.index” directory. The “build.index.sh” script requires three parameters: a genome sequence file in fasta format (or directory containing sequences for individual chromosomes), a RefSeq annotation, and the identity of the genome index. Genome sequences can be downloaded from UCSC genome browser (http://genome.ucsc.edu). For example, the hg38 reference genome can be downloaded via the following command: $ wget http://hgdownload.cse.ucsc.edu/goldenpath/hg38/ bigZips/hg38.fa.gz The RefSeq annotations for hg38 (and mm10 (Mus Musculus)) are already included in the Msuite package. Users can download newer versions via the following command if needed: $ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/data base/refGene.txt.gz Next, build the genome indices using the following command: $ build.index/build.index.sh /path/to/hg38.fa.gz build.index/ hg38.refGene.txt.gz hg38

3.3

Run Msuite

Msuite provides two analysis modes: three-letter mode and fourletter mode. The three-letter mode is generic and could be applied to most DNA methylation assays (e.g., WGBS), while the fourletter mode is specific for TAPS, 5hmC-CATCH [15], or similar approaches (see Note 2). Prepare scripts for analyzing the testing dataset using the threeletter mode: $ ./msuite -x hg38 -m TAPS -3 -1 ./testing_dataset/simu.read1. fq.gz -2 ./testing_dataset/simu.read2.fq.gz -p 8 -o ./testing_dataset/Msuite.Mode3 Or using the four-letter mode: $ ./msuite -x hg38 -m TAPS -4 -1 ./testing_dataset/simu.read1. fq.gz -2 ./testing_dataset/simu.read2.fq.gz -p 8 -o ./testing_dataset/Msuite.Mode4 The “-x” option specifies the identity of the genome index (here hg38); the “-m” option specifies the assay type (users can set “-m BS” for WGBS data); the “-3”/ “-4” option denotes that a three-letter or four-letter mode should be used, respectively (when using the four-letter mode, “-m TAPS” must be set at the same time as this mode only supports TAPS or similar assays; see Note 3); the “-p” option specifies the threads to be used (set “-p 0” to use

4

Xiaojian Liu et al.

all threads); the “-1” and “-2” options specify the input files (see Note 4); and the “-o” option sets the output directory. A makefile should be generated in the output directory (“-o” option). Change to that directory and run the actual analysis: $ cd ./testing_dataset/Msuite.Mode3 $ make Msuite will perform data preprocessing (including adaptertrimming, low-quality cycle removal) [16], read alignment, methylation call, as well as data visualization (see Notes 5 and 6). We have prepared a script, “run_testing_dataset.sh,” for building references and running the analysis on the testing dataset automatically. Users can call it using the following command: $ ./run_testing_dataset.sh 3.4 The Msuite Output

4

Msuite writes all the results to the directory specified by the “-o” option. The alignment output is recorded in the files “Msuite.final. bam” and “Msuite.rmdup.sam,” while the methylation calls are recorded in the files “Msuite.CpG.meth.call,” “Msuite.CpH.meth. call,” and “Msuite.CpG.meth.bedgraph.” In addition, Msuite summarizes the most relevant quality control and analysis statistics, as well as visualizations for methylation call and M-bias into an HTML file named “Msuite.report/index.html.” Figure 1 shows the HTML report of Msuite on the testing dataset.

Notes 1. This source package contains pre-compiled executable files using g++ version 4.8.5 for Linux x86_64 system. If users could not run the analysis normally, which is usually caused by low version of libc++ library, the users can re-compile the programs using the following command: $ make clean && make 2. Msuite can directly analyze data generated by ATAC-me [17] or similar protocols via setting “-k nextera”; Msuite also allows the users to set the “-c cycle” option to control the cycles to be analyzed if the users do not need all cycles; 3. If your data is generated using BS-seq protocol, you must use the three-letter mode and set “-m BS.” The four-letter mode only supports processing of TAPS / 5hmC-CATCH data where the non-CpG methylationMethylation is very low; 4. Msuite supports multiple files as well as gzip-compressed files: the users can use “,” to concatenate input files. Those with “. gz” suffix will be automatically interpreted as gzip-compressed files;

DNA Methylation Data Analysis Using Msuite

5

Fig. 1 Example of the Msuite analysis summary on the testing dataset

5. The Msuite package contains a program called “Mviewer” [18] for nucleotide-level, genotyping-aware DNA methylation data visualization, which is specifically suitable for visualizing imprinting events, or ending preferences in cell-free DNA data [19]. 6. This protocol also works for Msuite2 [20], which is the successor of Msuite and freely available at https://github.com/ hellosunking/Msuite2.

6

Xiaojian Liu et al.

References 1. Yin Y, Morgunova E, Jolma A, Kaasinen E, Sahu B, Khund-Sayeed S, Das PK, Kivioja T, Dave K, Zhong F et al (2017) Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356: eaaj2239. https://doi.org/10.1126/science. aaj2239 2. Baylin SB, Jones PA (2011) A decade of exploring the cancer epigenome – biological and translational implications. Nat Rev Cancer 11: 726–734. https://doi.org/10.1038/nrc3130 3. Zemach A, McDaniel IE, Silva P, Zilberman D (2010) Genome-wide evolutionary analysis of eukaryotic DNA methylation. Science 328: 916–919. https://doi.org/10.1126/science. 1186366 4. Feng S, Jacobsen SE, Reik W (2010) Epigenetic reprogramming in plant and animal development. Science 330:622–627. https:// doi.org/10.1126/science.1190614 5. Meissner A, Mikkelsen TS, Gu H, Wernig M, Hanna J, Sivachenko A, Zhang X, Bernstein BE, Nusbaum C, Jaffe DB (2008) Genomescale DNA methylation maps of pluripotent and differentiated cells. Nature 454:766–770. https://doi.org/10.1038/nature07107 6. Boyle P, Clement K, Gu H, Smith ZD, Ziller M, Fostel JL, Holmes L, Meldrim J, Kelley F, Gnirke A (2012) Gel-free multiplexed reduced representation bisulfite sequencing for large-scale DNA methylation profiling. Genome Biol 13:1–10. https://doi.org/10. 1186/gb-2012-13-10-r92 7. Meissner A, Gnirke A, Bell GW, Ramsahoye B, Lander ES, Jaenisch R (2005) Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic Acids Res 33:5868–5877. https:// doi.org/10.1093/nar/gki901 8. Sun K, Jiang P, Chan KCA, Wong J, Cheng YK, Liang RH, Chan WK, Ma ES, Chan SL, Cheng SH et al (2015) Plasma DNA tissue mapping by genome-wide methylation sequencing for noninvasive prenatal, cancer, and transplantation assessments. Proc Natl Acad Sci U S A 112:E5503–E5512. https://doi.org/10. 1073/pnas.1508736112 9. Gu H, Smith ZD, Bock C, Boyle P, Gnirke A, Meissner A (2011) Preparation of reduced representation bisulfite sequencing libraries for genome-scale DNA methylation profiling. Nat Protoc 6:468–481. https://doi.org/10. 1038/nprot.2010.190 10. Luo C, Hajkova P, Ecker JR (2018) Dynamic DNA methylation: in the right place at the right time. Science 361:1336–1340. https:// doi.org/10.1126/science.aat6806 11. Greenberg MVC, Bourc’his D (2019) The diverse roles of DNA methylation in

mammalian development and disease. Nat Rev Mol Cell Biol 20:590–607. https://doi.org/ 10.1038/s41580-019-0159-6 12. Yoder JA, Walsh CP, Bestor TH (1997) Cytosine methylation and the ecology of intragenomic parasites. Trends Genet 13:335–340. https://doi.org/10.1016/s0168-9525(97) 01181-5 13. Liu Y, Siejka-Zielinska P, Velikova G, Bi Y, Yuan F, Tomkova M, Bai C, Chen L, Schuster-Bockler B, Song CX (2019) Bisulfite-free direct detection of 5-methylcytosine and 5-hydroxymethylcytosine at base resolution. Nat Biotechnol 37:424–429. https://doi. org/10.1038/s41587-019-0041-2 14. Sun K, Li L, Ma L, Zhao Y, Deng L, Wang H, Sun H (2020) Msuite: a high-performance and versatile DNA methylation data-analysis toolkit. Patterns (NY) 1:100127. https://doi. org/10.1016/j.patter.2020.100127 15. Zeng H, He B, Xia B, Bai D, Lu X, Cai J, Chen L, Zhou A, Zhu C, Meng H et al (2018) Bisulfite-free, nanoscale analysis of 5-hydroxymethylcytosine at single base resolution. J Am Chem Soc 140:13190–13194. https://doi.org/10.1021/jacs.8b08297 16. Sun K (2020) Ktrim: an extra-fast and accurate adapter- and quality-trimmer for sequencing data. Bioinformatics 36:3561–3562. https:// doi.org/10.1093/bioinformatics/btaa171 17. Barnett KR, Decato BE, Scott TJ, Hansen TJ, Chen B, Attalla J, Smith AD, Hodges E (2020) ATAC-Me captures prolonged DNA methylation of dynamic chromatin accessibility loci during cell fate transitions. Mol Cell 77: 1350–1364 e1356. https://doi.org/10. 1016/j.molcel.2020.01.004 18. Sun K, Lun FFM, Jiang P, Sun H (2017) BSviewer: a genotype-preserving, nucleotidelevel visualizer for bisulfite sequencing data. Bioinformatics 33:3495–3496. https://doi. org/10.1093/bioinformatics/btx505 19. Sun K, Jiang P, Wong AIC, Cheng YKY, Cheng SH, Zhang H, Chan KCA, Leung TY, Chiu RWK, Lo YMD (2018) Size-tagged preferred ends in maternal plasma DNA shed light on the production mechanism and show utility in noninvasive prenatal testing. Proc Natl Acad Sci U S A 115:E5106–E5114. https://doi. org/10.1073/pnas.1804134115 20. Li L, An Y, Ma L, Yang M, Yuan P, Liu X, Jin X, Zhao Y, Zhang S, Hong X, Sun K (2022) Msuite2: all-in-one DNA methylation data analysis toolkit with enhanced usability and performance. Comput Struct Biotechnol J 20: 1271–1276. https://doi.org/10.1016/j.csbj. 2022.03.005

Chapter 2 Interactive DNA Methylation Array Analysis with ShinyE´PICo Octavio Morante-Palacios Abstract ´ PICo is an Arrays provide a cost-effective platform for the analysis of human DNA methylation. ShinyE interactive, web-based, and graphical tool that allows the user to analyze Illumina DNA methylation arrays (450 k and EPIC), from the user’s own computer or from a server. This tool covers the analysis entirely, from the raw data input to the final list of differentially methylated positions or regions. Here, we describe the steps of the analysis, the different parameters available, and useful information to understand and select the best options in each step. Key words DNA methylation, Epigenetics, Shiny, Web Interface, Differentially Methylated Positions, Differentially Methylated Regions

1

Introduction Cellular epigenomic landscapes are constituted by a plethora of mechanisms, including posttranslational modification of histones, noncoding RNAs, chromatin accessibility, three-dimensional chromosome organization, and DNA methylation [1, 2]. In particular, DNA methylation is the best-studied epigenetic modification, consisting, in humans, of the addition of a methyl group to the carbon 5 (5meC) of cytosines [2]. DNA methylation not only is found mostly in cytosine-followed-by-guanine dinucleotides (CpG sites) but also at non-CpG sites (CpA, CpT, and CpC). Originally, DNA methylation was studied in CpG-rich regions (CpG islands), which are found generally in gene promoters [3]. In that context, DNA methylation is associated with gene repression, and it is also relevant for X-chromosome inactivation, pre-mRNA alternative splicing, and long-term gene silencing [1, 4, 5]. However, more recent works have elucidated a new role of DNA methylation in the regulation of dynamic biological processes, such as cell differentiation, highlighting the role of DNA methylation in enhancers, gene bodies, and partially methylated domains [6, 7].

Pedro H. Oliveira (ed.), Computational Epigenomics and Epitranscriptomics, Methods in Molecular Biology, vol. 2624, https://doi.org/10.1007/978-1-0716-2962-8_2, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

7

8

Octavio Morante-Palacios

Despite the development of deep sequencing-based techniques for the study of DNA methylation, with high resolution and coverage, such as the whole-genome bisulfite sequencing (WGBS), the moderate cost and robust CpG coverage of DNA methylation arrays make them still very useful for DNA methylation studies, especially when a high number of samples is involved. Illumina Infinium microarrays are the most widely used platform for the study of human DNA methylation. Infinium MethylationEPIC BeadChIP is the latest version of Illumina DNA methylation array, covering more than 850,000 methylation sites and 99% of RefSeq genes. The prior version was the Infinium Human Methylation 450 k BeadChip, which covers 450,000 methylation sites. Several tools have been created for the analysis of Illumina DNA methylation arrays. First, Genome Studio is the proprietary tool designed by Illumina, which only works in Windows. In addition, multiplatform R packages such as illuminaio, minfi, lumi, RnBeads, and limma [8–12] can be used to import array files, perform the normalization, and calculate differentially methylated positions, from R. ShinyE´PICo [13] is a graphical tool that relies on several wellestablished R packages such as minfi, limma, and mCSEA to provide a web interface to follow all the steps of the DNA methylation array analysis. First, this can be useful for users without a bioinformatics background, since the application is very user-friendly and guide the user through the analysis. Secondly, even for bioinformaticians, the automatization and immediate graphical output of shinyE´PICo is very convenient for analysis involving iterative repetitions and trying several parameters in each step. Lastly, since shinyE´PICo can be installed on a server and can be used remotely from other computers, it can help to optimize computer resources and facilitate the use of servers for array analysis. In this chapter, the main steps of the DNA methylation array analysis with shinyE´PICo will be discussed, explaining the different options in each one and the interpretation of graphs and statistics.

2

Materials ShinyE´PICo is an R package, available in Bioconductor 3.14 (http://www.bioconductor.org/packages/release/bioc/html/ shinyepico.html). It requires the installation of R 4.1 in GNU/Linux, macOS, or Windows. Moreover, for comfortable use of the application, we recommend a computer with at least 16GB of RAM, although depending on the number of arrays analyzed, the requirements could be higher or lower. After the installation, following the Bioconductor web instructions, shinyE´PICO can be run executing the function shinyepico::

Interactive DNA Methylation Array Analysis with ShinyE´PICo

9

run_shinyepico(), appearing directly in the web interface in the web browser. To note, the run_shinyepico() function contains four arguments that can be customized: • n_cores: This numeric parameter controls the number of cores, and by default, it is half of the detected cores. If you have limited RAM (8GB or less), we recommend setting this to 1, to avoid the RAM overhead of multicore calculations. • max_upload_size: This parameter established the limitation (in MB) of the files that can be uploaded to the application. By default, this parameter is 2000 MB. • host: IP used to deploy the application. By default, this parameter is your local IP (127.0.0.1), which means that only you, from your computer, will have access to the application. However, it is possible to make the app reachable to other computers in the same LAN by changing the IP to 0.0.0.0. • port: Port used to deploy the application. By default, a random free port. ´ PICo is also distributed as a Docker conAlternatively, shinyE tainer ready to use (https://hub.docker.com/repository/docker/ omorante/shinyepico). This option is especially useful to use shinyE´PICO remotely, configuring a server, potentially with more RAM and computational power, and using it from other computers. This kind of server can be configured with the ShinyProxy project (https://www.shinyproxy.io).

3

Methods The shinyE´PICo workflow is divided into several parts. In the following section, the full analysis will be explained, from the data upload to the results export.

3.1

Data Upload

´ PICo workflow is to prepare the data in The first step in the shinyE the proper format. iDAT files should be compressed into a .zip file. The name of the files should follow the standard convention: XXXXXXXXXXXX_YYYYYY_ZZZ.idat being XXXXXXXXXXXX the Sentrix_ID, YYYYYY the Sentrix_Position, and ZZZ Grn or Red (corresponding, respectively, to the Red and Green signal file). Moreover, a CSV (comma-separated) file with the annotation of the experiment should be included. It is mandatory to include the Sentrix_ID and Sentrix_Position columns that allow the software to find their respective iDAT files. Moreover, other columns should be added to reflect the different variables (e.g., sample name, health/disease, treatment/control, age, sex. . .). An example of a sample sheet is included in Table 1. shinyE´PICo autodetects variable types (numerical or categorical). Then, it is recommended

10

Octavio Morante-Palacios

Table 1 Example of sample sheet Sample_Name

Sample_Group

Donor

Sentrix_ID

Sentrix_Position

MAC A

MAC

A

202163550095

R02C01

MO B

MO

B

202163550095

R04C01

MAC C

MAC

C

202163550095

R06C01

MO A

MO

A

202163550095

R07C01

MO C

MO

C

202163550097

R05C01

MAC B

MAC

B

202163550097

R07C01

to not use numbers to define categorical variables. For example, ’Donor’, a categorical variable, should be filled in using letters or words, but not only numbers. After preparing the zip file, it can be easily uploaded with the browse button on the Input tab. As indicated in the Materials section, the size of the zip file to upload is limited by the max_upload_size parameter. When a zip file with a proper sample sheet is uploaded, the sample name, variable of interest, and donor columns should be selected. The variable of interest is the one in which you want to calculate the methylation differences. The donor variable is useful when you have several samples from the same person. If it is not the case, the sample name can be selected as the donor variable also. Finally, it is possible to exclude samples from the analysis, deselecting them in the “Select Samples to Process” bar. When the “Continue” button is pressed, the information is processed, and the tab focus is automatically switched to the normalization tab. 3.2

Quality Control

After data uploading, two quality control plots are shown. First, the QC signal plot shows the median methylated (mMed) and unmethylated (uMed) signal from each sample (Fig. 1a). If a sample is shown as “Suboptimal” or very distant from the other samples, there is a great chance that some problem with the hybridization or sample quality has occurred. Secondly, the bisulfite conversion plot, calculated with specific control probes, allows the user to know if the bisulfite conversion has been successful (Fig. 1b). Moreover, the Sex Prediction and SNPs Heatmap tabs can also be used to check if samples have been correctly annotated and hybridized in the proper order. The sex prediction plot relies on the X and Y chromosome intensities to identify XX or XY genotype (Fig. 1c), whereas the SNP heatmap uses specific probes of the array

Interactive DNA Methylation Array Analysis with ShinyE´PICo

11

Fig. 1 Quality control plots. (a) Scatter plot showing the overall methylation signal in the methylated channel (mMed) in contrast with the unmethylated channel (uMed), for each sample. In this example, all the samples are above the signal threshold. (b) Lollipop plot depicting the minimum ratio between the converted and nonconverted control Illumina Infinium II probes. (c) Scatter plot showing the median X chromosome intensity in contrast with the median Y chromosome intensity, for each sample. The color indicates the sex prediction. In this case, all the samples are predicted as “male” (M). (d) Heatmap of SNP probes. The dendrogram clusterized together the samples from the same donors

to identify samples from the same donors (Fig. 1d). In the dendrogram, samples from the same donor should appear together in the same cluster. Altogether, this information can be used to contrast the data from the sample sheet and detect errors in the sample processing. 3.3

Normalization

The fluorescence signal of methylation arrays can be very sensitive to technical variables, and the global profiles of methylation of different samples can look very different. It is necessary to use a normalization method to correct the data and make the samples comparable. ShinyE´PICo includes all the normalization methods present in the minfi package. It is important to understand the use cases of each normalization method in order to select one appropriate for the dataset of interest: • Raw: This method does not perform any normalization, keeping data as is. This is generally not recommended in order to calculate DMPs or DMRs.

12

Octavio Morante-Palacios

• Illumina: A reverse-engineered implementation of the Genome Studio normalization. It relies on control probes of the array in order to equalize methylation values. • Funnorm: A between-array normalization method that relies on data from control probes of the arrays. It performs also Noob normalization before Functional Normalization. • Noob: A within-array normalization method with dye-bias normalization. • SWAN: A within-array normalization method that allows Infinium I and Infinium II probes to be normalized together. • Quantile: A between-array normalization method that assumes no global differences in methylation between the samples. When global changes are expected, such as in cancer samples, other methods, such as Funnorm, are recommended. • Additionally, it offers the option of performing Noob within array normalization followed by Quantile normalization (Noob + Quantile), analogously to the Funnorm method. This nonstandard approach has empirically shown good results in our experience. In our experience, for primary samples in which little methylation changes are expected, the Quantile or Noob+Quantile methods perform very effectively. However, when very broad methylation changes are expected, such as in cancer versus healthy samples, other methods, such as Funnorm, should be used. Moreover, the Illumina method is flexible for different types of datasets, and it could be useful when we want similar behavior to the Genome Studio software. The result of normalization can be directly visualized by comparing the “Raw” and “Normalized” density plots (Fig. 2a, b) and boxplots. In addition to the normalization method, other options can be selected. The Drop CpH and Drop SNP buttons remove probes of cytosines outside the CpG context or annotated to known SNPs, respectively. Moreover, the X and Y chromosomes can be removed, which can be useful if we have samples of different sex, to remove some variability. However, there are other ways to manage this sex bias, as we will explain in the next sections. Finally, the normalization section contains two other tabs to perform an exploratory data analysis prior to DMP or DMR calculation. In the correlations tab, the different parameters of the sample sheet are correlated with the principal components (PCs) of the principal component analysis (PCA) with the beta values (Fig. 2c). This is very useful to detect variables that could be influencing methylation. Moreover, this can be represented in the PCA tab, where different PCs can be chosen for representation (Fig. 2d).

Interactive DNA Methylation Array Analysis with ShinyE´PICo

13

Fig. 2 Normalization and exploratory data analysis plots. (a) Density plot of raw beta values, depicting the distribution of each sample. (b) Density plot of quantile-normalized beta values, depicting the distribution of each sample. (c) Correlations between variables and principal components. Pearson correlation is applied to correlate principal components with numerical variables. For categorical variables, linear models (principal component ~ categorical variable) are generated, and R-squared statistics are shown in the representation. (d) Principal component analysis representation, showing the principal component 1 (PC1) versus principal component 2 (PC2)

3.4 Differentially Methylated Position Calculation

When normalization is finished, the next section is the DMP calculation. Analyzing differential methylation is probably the main aim ´ PICo uses the limma of most DNA methylation studies. ShinyE package to generate a model and calculate contrasts. In order to fit the limma data assumptions, M-values are used instead of B-values to generate the model. First, the variable of interest should be selected. Pairwise differences will be calculated between all groups specified in this variable. Moreover, covariables and interactions can also be included in the model. This is useful to take into account differences in methylation that can be driven by other biological variables such as donor, sex, or age, or technical variables such as batch or array. The correlations tab of the normalization section can help to determine what variables can be impacting DNA methylation. Usually, sex, if X/Y chromosomes are preserved, and donor have a great impact on DNA methylation and should be included. Including these covariables in the model can increase the statistical power and reduce false

14

Octavio Morante-Palacios

Fig. 3 Differentially methylated positions and region calculation plots. (a) Density plot depicting the meanvariance relation of each analyzed position. (b) DNA methylation heatmap of an example dataset including three monocyte (MO) samples and three macrophage (MAC) samples (lower DNA methylation levels in blue and higher methylation levels in red). (c) Genomic plot depicting CpGs assigned to the B2M gene promoter. Consistent demethylation in MACs can be observed in several CpGs

positives. Furthermore, if a variable can affect DNA methylation depending on another variable, an interaction term could be added to the model. Secondly, if the ArrayWeights option is enabled, estimated relative quality weights for each array are calculated and used in the limma model, using the function limma::arrayWeights. This option ponderates the influence of the arrays in the model depending on the calculated qualities. It is especially useful with large datasets of heterogeneous quality [14]. When the model is generated, a diagnosis plot is shown, depicting the mean-variance relation of each position of the array (Fig. 3a). Ideally, no relationship between mean and variance should be detected: the distribution should be similar to a horizontal straight line. Finally, two other options are shown after clicking on the “Generate Model” button. The eBayes Trend and eBayes Robust options correspond to the homonymous options of the limma:: eBayes function. In brief, the eBayes Trend option can be useful

Interactive DNA Methylation Array Analysis with ShinyE´PICo

15

when a mean/variance relationship is shown in the diagnosis plot [15], and the Robust option can protect the statistical method against hypo-variable or hyper-variable positions [16]. After contrast calculation, a heatmap depicting the differences is shown (Fig. 3b). This heatmap is completely customizable. The significant DMPs can be filtered by the differential of beta value, FDR, or p-value. Moreover, the groups and contrasts shown in the plot can be also selected. The “Remove Batch Effect” option performs a correction of the represented beta values, subtracting the effects of the covariables and interactions. This can be appropriate when a high effect of these covariables is observed, and it only applies to the representation, not the statistical analysis. In addition, the heatmap can be divided into different clusters according to the dendrogram with the “Row Colors options.” The positions corresponding to each cluster can be downloaded as bed files in the following sections. Specific DMPs can also be explored in the DMP annotation tab, which depicts the annotation of each DMP and allows the user to represent boxplots of selected DMPs. 3.5 Differentially Methylated Region Calculation

When DMPs are calculated, the DMR calculation is enabled. shinyE´PICo relies on mCSEA [17] for this calculation. Since mCSEA relies on the limma results as input, DMR calculation is only available after the DMP calculation has been completed. The concept of DMR can be explained as the calculation of methylation differences in aggregate positions of the genome. This aggregation can use different criteria, but generally, proximity is the main criteria, because DNA methylation levels are often very spatially correlated. mCSEA uses predefined regions of CpGs annotated to promoters, gene bodies, or CpG islands outside genes. These regions can be filtered depending on the containing CpGs. Moreover, the statistics of mCSEA are calculated with permutations, and shinyE´PICo allows the user to select the number of permutations. More permutations will generate more accurate statistics but will take longer to finish the analysis. After finishing this analysis, a heatmap can be also shown, depicting the average beta values of each DMR. Moreover, individual DMRs can be plotted in the “Single DMR Plot” tab (Fig. 3c).

3.6

After finishing the analysis, the results of the analysis can be downloaded:

Results Export

• R Objects: The objects used by shinyE´PICo during the analysis, such as the RGChannelSet (minfi raw object) and GenomicRatioSet (minfi normalized object) can be downloaded in RDS format, to import and use in R.

16

Octavio Morante-Palacios

• Filtered Bed Files: DMP or DMR genomic positions can be downloaded in BED format. Genomic coordinates can be requested in hg38 or hg19 genomes. By default, beds are downloaded by contrasts, with two files for each contrast (with the positions more methylated in one group, or another group). Moreover, clusters of heatmap can be directly downloaded if the Row Colors option is enabled. • Workflow Report: All the options selected during the analysis, and the plots present in the application can be downloaded in an HTML report. This report is intended as a reference that indicates all the details of the analysis that can be consulted at any time and reproduced in shinyE´PICo if necessary. • Custom R Script: This option creates a custom R script, including all the parameters selected in the analysis and all the functions needed to reproduce exactly the results generated in shinyE´PICo outside the application. • Heatmap(s): DMP and DMR heatmaps can be downloaded in PDF, ready to use in publication-quality figures. Overall, these files can be used in a simple way to carry out typical downstream analysis, such as gene ontology enrichment analysis based on genomic regions or transcription factor motif enrichment analysis.

4

Notes ShinyE´PICo enables interactive analysis of methylation arrays, providing graphs to aid in analysis decision-making. However, choosing the best options at each step is up to the user. In this chapter, and in the ShinyE´PICo and Minfi vignettes in Bioconductor, there is useful information for interpreting the graphs and making these decisions. Additionally, in this section, I will make some considerations and recommendations for frequent situations. First of all, in projects with a large number of samples, one of the issues is quality control. Sometimes, the amount or integrity of DNA in some of the samples could be low, especially if patient samples are involved. The shinyE´PICo signal plot allows to detect potentially failed samples at a glance. Generally, samples hybridized in the same batch should have a similar signal. For uncertain situations, it is advisable to continue the analysis with all samples and check later by exploring the data with the PCA plot whether those suboptimal samples differ significantly from the rest. Another key decision in the analysis is the method of normalization. For this, the proportion of expected methylation changes must be taken into account. For example, differentiation processes

Interactive DNA Methylation Array Analysis with ShinyE´PICo

17

or primary cell samples from patients show relatively small changes in proportion to the 450,000/850,000 positions of the Illumina methylation arrays. In these cases, in our experience, the Noob +Quantile method performs very well, equalizing very finely the overall methylation profiles in the samples. In contrast, if global methylation changes are expected, as occurs in cancer cells, it is necessary to use alternative methods, such as Functional Normalization or Illumina normalization. Another key step in the analysis is the choice of the parameters of the linear limma model. In particular, the choice of appropriate covariates and interactions can make a big difference in the outcome. A valuable resource for this choice is the correlations and PCA plots in the Normalization tab. If we observe significant correlation of any covariate with the methylation data, we should incorporate it into the linear model. Finally, validation of the results after finishing the analysis requires biological expertise related to the hybridized samples. Tools such as GREAT [18], which allows calculation of enrichment of CpGs lists with functional categories, and HOMER [19], which calculates enrichment of transcription factor motifs, can be very useful to determine whether the results are meaningful.

Funding O.M.-P. holds an i-PFIS PhD Fellowship [IFI17/00034] from Accio´n Estrate´gica en Salud 2013–2016 ISCIII, co-financed by Fondo Social Europeo. References 1. Goldberg AD, Allis CD, Bernstein E (2007) Epigenetics: a landscape takes shape. Cell 128: 635–638. https://doi.org/10.1016/j.cell. 2007.02.006 2. de la Calle-Fabregat C, Morante-Palacios O, Ballestar E (2020) Understanding the relevance of DNA methylation changes in immune differentiation and disease. Genes (Basel) 11. https://doi.org/10.3390/genes11010110 3. Jones PA (2012) Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat Rev Genet 13:484–492. https:// doi.org/10.1038/nrg3230 4. Lev Maor G, Yearim A, Ast G (2015) The alternative role of DNA methylation in splicing regulation. Trends Genet 31:274–280 5. Chow JC, Yen Z, Ziesche SM, Brown CJ (2005) Silencing of the mammalian X chromosome. Annu Rev Genomics Hum Genet 6:69– 92

6. Lister R, Pelizzola M, Dowen RH et al (2009) Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462:315–322. https://doi.org/10. 1038/nature08514 7. Neri F, Rapelli S, Krepelova A et al (2017) Intragenic DNA methylation prevents spurious transcription initiation. Nature 543:72–77. https://doi.org/10.1038/nature21373 8. Smith ML, Baggerly KA, Bengtsson H et al (2013) Illuminaio: an open source IDAT parsing tool for Illumina microarrays. F1000Research 2:264. https://doi.org/10. 12688/f1000research.2-264.v1 9. Aryee MJ, Jaffe AE, Corrada-Bravo H et al (2014) Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics 30:1363–1369. https://doi.org/10. 1093/bioinformatics/btu049

18

Octavio Morante-Palacios

10. Mu¨ller F, Scherer M, Assenov Y et al (2019) RnBeads 2.0: comprehensive analysis of DNA methylation data. Genome Biol 20:55. https://doi.org/10.1186/s13059-0191664-9 11. Ritchie ME, Phipson B, Wu D et al (2015) Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43:e47. https://doi.org/ 10.1093/nar/gkv007 12. Du P, Kibbe WA, Lin SM (2008) lumi: a pipeline for processing Illumina microarray. Bioinformatics 24:1547–1548. https://doi.org/10. 1093/bioinformatics/btn224 13. Morante-Palacios O, Ballestar E (2021) shinyE´PICo: a graphical pipeline to analyze Illumina DNA methylation arrays. Bioinformatics. https://doi.org/10.1093/bioinformatics/ btaa1095 14. Ritchie ME, Diyagama D, Neilson J et al (2006) Empirical array quality weights in the analysis of microarray data. BMC Bioinf 7:261. https://doi.org/10.1186/1471-2105-7-261 15. Law CW, Chen Y, Shi W, Smyth GK (2014) Voom: precision weights unlock linear

model analysis tools for RNA-seq read counts. Genome Biol 15:R29. https://doi.org/10. 1186/gb-2014-15-2-r29 16. Phipson B, Lee S, Majewski IJ et al (2016) Robust hyperparameter estimation protects against hypervariable genes and improves power to detect differential expression. Ann Appl Stat 10:946–963. https://doi.org/10. 1214/16-AOAS920 17. Martorell-Maruga´n J, Gonza´lez-Rumayor V, Carmona-Sa´ez P (2019) MCSEA: detecting subtle differentially methylated regions. Bioinformatics 35:3257–3262. https://doi.org/10. 1093/bioinformatics/btz096 18. McLean CY, Bristor D, Hiller M et al (2010) GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol 28: 495–501. https://doi.org/10.1038/nbt. 1630 19. Heinz S, Benner C, Spann N et al (2010) Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell 38:576–589. https://doi.org/10. 1016/j.molcel.2010.05.004

Chapter 3 Predicting Chromatin Interactions from DNA Sequence Using DeepC Ron Schwessinger Abstract The genome 3D structure is central to understanding how disease-associated genetic variants in the noncoding genome regulate their target genes. Genome architecture spans large-scale structures determined by fine-grained regulatory elements, making it challenging to predict the effects of sequence and structural variants. Experimental approaches for chromatin interaction mapping remain costly and timeconsuming, limiting their use for interrogating changes of chromatin architecture associated with genomic variation at scale. Computational models to predict chromatin interactions have either interpreted chromatin at coarse resolution or failed to capture the long-range dependencies of larger sequence contexts. To bridge this gap, we previously developed deepC, a deep neural network approach to predict chromatin interactions from DNA sequence at megabase scale. deepC employs dilated convolutional layers to achieve simultaneously a large sequence context while interpreting the DNA sequence at single base pair resolution. Using transfer learning of convolutional weights trained to predict a compendium of chromatin features across cell types allows deepC to predict cell type-specific chromatin interactions from DNA sequence alone. Here, we present a detailed workflow to predict chromatin interactions with deepC. We detail the necessary data pre-processing steps, guide through deepC model training, and demonstrate how to employ trained models to predict chromatin interactions and the effect of sequence variations on genome architecture. Key words Machine learning, Deep neural networks, Gene regulation, Chromatin interactions, Genomic variation, DeepC

1

Introduction Mammalian gene regulation is mediated through an intricate network of regulatory DNA elements comprised of promoters and distal element, such as insulators and enhancers, that may be located mega bases away from the gene they regulate. Enhancers physically interact with their target promoters as they regulate gene expression [1, 2]. The genome is folded to minimize interactions between enhancers and nontarget promoters, providing one key driver of enhancer-promoter specificity [1]. Therefore, to identify

Pedro H. Oliveira (ed.), Computational Epigenomics and Epitranscriptomics, Methods in Molecular Biology, vol. 2624, https://doi.org/10.1007/978-1-0716-2962-8_3, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

19

20

Ron Schwessinger

the target promoter of a regulatory element of interest, we must either measure or predict the 3D organization of the surrounding locus. Methods such as Hi-C [3] are powerful approaches for sampling the physical interactions of the genome, and such studies have provided convincing evidence for the role of CTCF in organizing the genome into self-interacting domains [4, 5]. However, CTCF binding varies little between tissues, so it remains unclear how chromatin interactions are regulated in a programmed and cell type-specific manner. Understanding these aspects is critical not only for understanding basic biology but also for the interpretation of sequence variants in regulatory regions, which often underlie common human diseases. Computational models sophisticated enough to predict 3D folding directly from DNA sequence can be used to predict the consequences of sequence variation in silico at a scale that is unfeasible for experimental approaches. Generating chromatin interaction maps is still expensive in terms of required cell numbers, material, and time. Interaction maps, especially at high resolution, are usually only affordable for a limited number of cell types and individual genomes. In contrast, computational models can be trained on genome-wide chromatin interactions over the reference genome and, once generalization to unseen sequences has been confirmed, can be used to predict the effect of genomic variation at a large scale. Moreover, such models enable systematic mutation of DNA sequences in silico. For example, predicting the effects of deleting all annotated enhancers, promoters and CTCF sites genome-wide to assess which are critically important to predict chromatin interactions from DNA sequence and so likely functionally important for 3D genome folding [6]. Proposed methods for predicting large scale but intricate chromatin architecture in silico are based on polymer models [7, 8] or other coarse-grained encodings of chromatin features such as CTCF binding, histone modifications, and transcription [9, 10]. Such models inherently lack the ability to predict the effect of small genetic variations without the use of intermediate models. In contrast, DNA sequence-based approaches focus on predicting interactions using window-to-window-based encodings [11–13] neglecting the interdependence of chromatin architecture. For example, insulator elements located adjacent to or in between enhancer and promoter elements are likely to influence their interaction frequency. Overcoming the combined limitations of existing approaches requires learning to predict the multiscale, convoluted nature of chromatin architecture from DNA sequence patterns that are sensitive to changes at base pair resolution. Machine learning approaches using deep neural networks have emerged as promising tools for such tasks. Specifically, convolutional neural networks (CNNs) have proven powerful in genomics due to their ability to

Predicting Chromatin Interactions Using DeepC

21

learn local patterns and re-apply them over larger sequence contexts [14–16]. Dilated convolutions [17, 18] have enabled the efficient aggregation of information over large spatial contexts while maintaining resolution. Where prediction problems proved too challenging or training data too sparse for ab initio learning, transfer learning [19] has enabled effective model training. Layers or entire networks can be pre-trained on an auxiliary, more tractable, task. The learned parameters can then be used to seed a model for training on a more challenging task. Previously, we reported deepC [6], a neural network that combines dilated convolutions and transfer learning to predict chromatin interactions from DNA sequence at megabase scale. The combination of large-scale sequence context at base pair resolution allows deepC models to predict alterations of chromatin architecture caused by genomic variation ranging from large structural variations to single base pair variants. Users should note that the field of chromatin architecture prediction is continuing to develop. Back to back with deepC, Fudenberg et al. proposed Akita [20], a neural network for predicting chromatin interactions of multiple cell types jointly from DNA sequence without transfer learning. More recently, further developments focus on developments increasing the prediction accuracy [21, 22] home in focus specifically on enhancer-promoter interactions [23, 24] or focus on increasing the utility for application to new cell types using chromatin features as input rather than DNA sequence [25]. In this chapter, we describe the core deepC workflow for formatting Hi-C data, training models, and using them to predict chromatin interactions and estimate the impact of sequence variation (Fig. 1.). In brief, Hi-C data in the form of sparse intrachromosomal contact matrices are encoded as a vertical zigzag pole of chromatin interactions associated with the center of a megabase scale genomic window (Fig. 2). The chromatin interactions are normalized with regard to their linear distance yielding a Hi-C skeleton that highlights domain boundaries. Using this pre-processed data as training set, deepC models are trained to predict chromatin interactions using DNA sequence as input while holding out chromosomes for testing and validation. The network architecture of the base deepC model is detailed in Table 1. To train a deepC model, the user will need to supply the pre-trained convolutional filter weights used for transfer learning. We provide pre-trained weights for human and mouse. After evaluating trained models on holdout chromosomes, the final models are used to predict chromatin interactions over regions of interest. By comparing the predictions over reference- and variantharboring DNA sequences, the impact of variants on chromatin architecture can be estimated. The models may also be used to fine map chromatin interactions where the original Hi-C may not

Hi-C Interactions

Hi-C Skeleton

C hr

Pre-Processing

om os om e

Concatinate Sparse Contact Matrices

deepC Training Set

Hi-C Skeleton Files Predictions

Pre-trained Weights

Model Training Validation deepC Model

Reference

Quantification

Prediction

Reference

Variant

deepC Query Files

Variant

Fig. 1 Overall workflow for deepC analysis. Sparse Hi-C contact matrices are pre-processed, including distance normalization to Hi-C skeleton, and encoding for deepC training (see Fig. 2) on a chromosome basis. Processed chromosomes are concatenated to form a deepC training set. Using the training set and pre-trained convolutional weights from chromatin feature models, a cell type-specific deepC model is trained. After model validation, chromatin interactions over reference and variant regions can be predicted and their difference quantified. As shortcuts, pre-processed training data and pre-trained models are available

7 1:1

2:1

6

5 2:2

4 3:2

3 3:3

2 4:3

1 4:4

2

3 5:4

4 5:5

5

6

6:5

6:6

7 7:6

7:7

Target Output Vector Fig. 2 Overview of the deepC encoding of Hi-C data. Genomic windows of megabase size are binned into equal sized genomic windows matching the Hi-C data resolution. All pairwise interactions in a vertical zigzag pole over the center of the window are used as target output vector, which a deepC model predicts using the sequence underlying the whole window. By shifting the pole and the window by bin-sized increments, the whole Hi-C map up to a window-sized linear distance is recovered

Predicting Chromatin Interactions Using DeepC

23

Table 1 Details of the default deepC model architecture. A module of 1D convolutional layers with max pooling and ReLU activation is followed by a module of 1D dilated convolutions with residual connections and gated activations used in Oord et al. 2016 [42], followed by flattening and a fully connected layer. The seeded column indicates if the convolutional filters of this layer should be seeded with pre-trained weights Layer Type

Hidden units Filter width Max pool width Activation Residual Seeded

1

1D Conv

300

8

4

ReLU

No

Yes

2

1D Conv

600

8

5

ReLU

No

Yes

3

1D Conv

600

8

5

ReLU

No

Yes

4

1D Conv

900

4

5

ReLU

No

Yes

5

1D Conv

900

4

2

ReLU

No

Yes

6

1D Conv

100

1

1

ReLU

No

No

Dilation rate 7

1D Dilated Conv. 100

3

1

Gated

Yes

No

8

1D Dilated Conv. 100

3

2

Gated

Yes

No

9

1D Dilated Conv. 100

3

4

Gated

Yes

No

10

1D Dilated Conv. 100

3

8

Gated

Yes

No

11

1D Dilated Conv. 100

3

16

Gated

Yes

No

12

1D Dilated Conv. 100

3

32

Gated

Yes

No

13

1D Dilated Conv. 100

3

64

Gated

Yes

No

14

1D Dilated Conv. 100

3

128

Gated

Yes

No

15

1D Dilated Conv. 100

3

256

Gated

Yes

No

16

1D Dilated Conv. 100

3

1

Gated

Yes

No

17

Fully Connected

0

Gated

Yes

No

Output size

sufficiently recover chromatin domains. Pre-trained models allow users to directly start with predictions. Throughout the chapter, links to the relevant sections of the deepC repository (https://github.com/rschwess/deepC) are listed. Sample data used to run the example commands and tutorials in the repository are part of the repository or accessible via download links. This includes example Hi-C data for human [4] and mouse [26] as well as exemplary DNase-seq and CTCF ChIPseq data [27]. Moreover, all data necessary to run the outlined examples as well as a snapshot of the deepC github repository are available via Zenodo under https://doi.org/10.5281/zenodo. 5785805.

24

2 2.1 2.1.1

Ron Schwessinger

Materials Data Hi-C Data

For training deepC models, intrachromosomal contact frequency matrices are required. The matrices need to be in the resolution that is desired for the final deepC model (also see Note 1). Contact matrices can be obtained from standard Hi-C data analysis pipelines such as Hi-C-Pro [28] and are often published alongside the raw sequencing data. Contact frequencies may be provided as raw or normalized interaction frequencies, for example, after iterative correction and eigenvector decomposition (ICE) normalization [29]. The deepC workflow was designed to handle intrachromosomal chromatin interactions. The deepC data processing functions accept contact matrices in two sparse matrix formats: (a) Three-column tab-separated file per chromosome listing [start_coord_of_window1 start_coord_of_window2 interaction_frequency] (b) Two files per chromosome in Hi-C-Pro output style. A threecolumn tab-separated matrix file [index_of_window1 index_of_window2 contact_frequency] and a coordinate file with tab-separated three columns [chromosome position index] linking the indices used in the matrix file to genomic windows. For data pre-processing, we recommend processing individual contact matrices for each chromosome. Whole genome contact matrices in style (a) can usually be split by a simple grep command: grep -P “chr1\s+” whole_genome.matrix >chr1_contact.matrix

For splitting HiC-Pro style (b) contact matrices, the user may want to use the perl helper script from the deepC repository (https://github.com/rschwess/deepC/tree/master/formatted_ data_links). Example to extract a chr2 intrachromosomal contact matrix: perl ./deepC/helper_for_preprocessing_and_analysis/\ extract_hicpro_matrix.pl \ --bin insitu_k652_5000_abs.bed \ --matrix insitu_k652_5000_iced.matrix \ --outmatrix insitu_k652_5000_iced.chr20.matrix \ --outbed insitu_k652_5000_abs.chr20.bed \ -chr chr2

Great starting points for published Hi-C data are Rao et al. [4] (GSE63525) for human and Bonev et al. [26] (GSE96107) for mouse data. Note that Hi-C matrix formats usually list equally sized genomic windows and the genomic coordinate indicating a window usually refers to the first genomic position of that window (the leftmost base).

Predicting Chromatin Interactions Using DeepC

25

2.1.2 Pre-trained Convolutional Filter Weights for Transfer Learning

Pre-trained convolutional filter weights for human and mouse are available from (https://github.com/rschwess/deepC/tree/mas ter/formatted_data_links) in .npz format. If the user chooses to pre-train their own CNN on chromatin features, we re-direct them to the deepHaem github repository (https://github.com/ rschwess/deepHaem), which includes detailed tutorials on data formatting and training of CNNs for chromatin features. Once trained, CNN filter weights can be saved by modifying and running the run_save_weights.py script.

2.1.3

Trained Models

To skip the data processing and training steps, the user may want to download pre-trained deepC models. Pre-trained deepC models for seven human and one mouse cell line are provided via the deepC repository (https://github.com/rschwess/deepC/tree/master/ models). Every directory contains a hyperparameter file listing the exact parameters the model has been trained with. It also contains three model.* files that together comprise a saved tensorflow model.

2.1.4

Additional Data

1. Chromosome sizes file: a two-column tab-separated file listing [chromosome size_in_bp] indicating the chromosome size for every chromosome named in the Hi-C matrices used. These files can be downloaded from the UCSC Sequence and Annotation Database [30] (https://hgdownload.soe.ucsc.edu/ downloads.html), for example, https://hgdownload.soe.ucsc. edu/goldenPath/hg19/bigZips/hg19.chrom.sizes 2. Whole genome fasta file and a corresponding fasta index. For the tutorials, a chr17-only fasta file as well as a whole genome file are available from (https://github.com/rschwess/deepC/ tree/master/formatted_data_links). For other genomes, the user should download the respective genome-wide fasta file from the UCSC database (e.g., https://hgdownload.soe.ucsc. edu/goldenPath/hg19/bigZips/) and index it, for example, with samtools faidx [31] 3. (Optional) Bigwig files of chromatin features for adding coverage tracks to the chromatin interaction plots.

2.2

Software

The core deepC framework was written in python 3. To run model training and predictions, the user will need python 3.5+ and the following python packages: 1. tensorflow (tensorflow-gpu) [32] tensorflow with GPU support is preferable for predictions and essential for model training. deepC was developed under tensorflow 1.8 but also supports 1.14+ and 2.1+. 2. numpy [33]

26

Ron Schwessinger

3. h5py (http://www.h5py.org) 4. pysam (https://github.com/pysam-developers/pysam) 5. pybedtools [34] and a compatible version of bedtools [35] installed and accessible from the command line The data pre-processing and visualization capabilities were implemented in R. The user will need R version 3.4.4+ (not tested for earlier versions) and the following packages installed: 1. tidyverse (https://www.tidyverse.org) 2. optparse 3. RColorbrewer 4. Cowplot (https://github.com/wilkelab/cowplot) 5. rtracklayer In addition, a working version of perl is required to be accessible from the command line as perl helper scripts are called from within R to speed up data processing. 2.3

3

Hardware

For training deepC models, a CUDA-capable GPU is required. We trained models on a single NVIDIA Titan V card with 12 GB of video memory but were restricted to a batch size of 1. Users with more video memory available are encouraged to increase the batch size.

Methods

3.1 Data Preprocessing

3.1.1 Pre-processing Using the Wrapper Script

Hi-C contact matrices can be pre-processed for deepC model training by using the provided wrapper script (./helper_for_preprocessing_and_analysis/wrapper_preprocess_hic_data.R) or by following the steps outlined in the pre-processing tutorial (./tutorials/tutorial_format_HiC_data_for_deepC.html). 1. Run the wrapper script from the command line pointing to the chromosome-wise Hi-C input, the chromosome sizes file, and the directory of the deepC helper scripts. 2. Indicate the bin.size (resolution) in which the Hi-C data have been analyzed, which will determine the deepC resolution (also see Note 2). 3. Indicate the window.size, which limits the linear distance of the deepC model to train, usually 1 Mb + 1x bin.size so 1,005,000 bp for a 5 kb resolution model. 4. Indicate the name of chromosome processed, matching the names in the chromosome sizes file.

Predicting Chromatin Interactions Using DeepC

27

5. Indicate if example plots of the Hi-C data and/or the skeleton transformed data should be saved using --plot.hic and--plot. skeleton, respectively. Provide the start and end coordinate of the region to plot and the size ratio of the plot to produce. 6. The wrapper script will: (a) Extract all Hi-C interactions at a linear distance relevant within the specified window size. (b) Convert the sparse matrix to the vertical zigzag pole encoding for deepC. (c) Remove genomic windows that have a median interaction value of zero. To turn this behavior off, use the --keep. median.zero option (also see Note 3). (d) Impute interaction windows with an observed frequency of 0 with the median interaction value of a 5 × 5 neighborhood. (e) Transform the interaction frequencies to skeleton interactions. To format the Hi-C data for deepC without skeleton transformation, use --no.transform (also see Note 4). (f) Format and output single chromosome data to a text file. (g) (Optional) If indicated, the wrapper will produce an example plot of the Hi-C or skeleton-transformed data over the specified genomic region. 7. Check if other pre-trained CNN filters may be desirable (also see Note 5). Example command: Rscript ./deepC/helper_for_preprocessing_and_analysis/\ wrapper_preprocess_hic_data.R \ --hic.matrix=gm12878_primary_chr17_5kb.contacts.KRnorm.matrix \ --chromosome.sizes=hg19_chrom_sizes.txt \ --sample=IMR90 \ --bin.size=5000 \ --window.size=1005000 \ --chrom=chr20 \ --helper=./deepC/helper_for_preprocessing_and_analysis \ --plot.hic \ --plot.skeleton \ --plot.start=2e+06 \ --plot.end=5000000 \ --plot.height=6 \ --plot.width=8

28

Ron Schwessinger

3.1.2 Pre-processing Manually

To run the data pre-processing steps manually, the user may follow the outlined workflow. The corresponding R code is detailed in (https://github.com/rschwess/deepC/blob/master/tutorials/ tutorial_format_HiC_data_for_deepC.html) 1. Load libraries tidyverse, cowplot, and RColorBrewer. 2. Source the R functions for Hi-C and deepC analysis published in the repository (./helper_for_preprocessing_and_analysis/ functions_for_HiC.R and ./helper_for_preprocessing_and_analysis/functions_for_deepC.R). 3. Define the bin.size of the Hi-C data and window.size to be used for formatting. 4. Define the number of percentiles used in the skeleton transformation. Default = 10. 5. Load the chromosome sizes file. 6. Read the Hi-C data using one of the two supported sparse matrix formats. 7. Remove all interactions between genomic windows more distal than the defined window.size. 8. Create a windowed chromosome template using the defined window.size and bin.size. 9. Remove all incomplete genomic windows that are smaller than the desired window.size. 10. Map the sparse contact map interactions to deepC vertical zigzag pole format (see Fig. 2) where each window is assigned to the pair-wise interaction frequencies at increased linear distance from the center (1:1, 2:1, 2:2, 3:2, 3:3, . . .). 11. Calculate the median interaction value of the vertical zigzag pole per window. 12. Remove windows with a median interaction frequency of zero. 13. Impute pairwise interaction frequencies of zero with the median frequency of a 5 x 5 neighborhood. 14. Apply the skeleton transformation. Convert the interaction frequencies in unequal percentiles: 0–20%; 20–40%; 40–50%; 50–60%; 60–70%; 70–80%; 80–85%; 85–90%; 90–95%; 95–100%. 15. (Optional) Plot Hi-C data or the skeleton for a selected region. 16. Collapse the skeleton interaction values into a single commaseparated string per window. 17. Store in a four-column tab-separated text file listing [chromosome start_coordinate end_coordinate interaction_frequencies] with no header.

Predicting Chromatin Interactions Using DeepC 3.1.3 Collate Data for Training

29

Both pre-processing workflows will produce a four-column tab-separated file listing [chromosome start_coordinate end_coordinate interaction_frequencies], in which the interaction frequencies list the comma-separated skeleton-transformed frequencies of the vertical zigzag pole over the center of indicated genomic window. It is recommended to process intradomain interaction matrices per chromosome. To construct a full training dataset, concatenate all chromosome output files into a single file: cat coords_and_hic_skeleton_5kb_chr*_INR90.bed >training_set_5kb_IMR90.txt

3.2

Training a Model

DeepC models are trained using the respective run_training_deepCregr.py script (also see Note 6). The user may follow the tutorial in the deepC repository (https://github.com/rschwess/deepC/ blob/master/tutorials/tutorial_train_a_model.md) and use the example bash script provided as a guide (https://github.com/ rschwess/deepC/blob/master/tutorials/example_script_deepc_ train.sh). Default values for each parameter are outlined therein. A minimal training set of IMR90 skeleton chromatin interactions at 5 kb resolution is provided (minimal_training_set_example_IMR90.txt). It only contains 100 training instances per chromosome and therefore will not produce meaningful models but is intended to test the training procedure. For hyperparameters, see Notes 7–13. 1. Define the path to the local deepC repository copy. 2. Link to the training set file, which is the combination of all processed chromosomes. 3. Select test and validation chromosomes. Performance on test chromosomes will be checked after each epoch. Validation chromosomes will be held out from the training process entirely. 4. Choose how often the training script should report training loss. 5. Define the number of output classes (entries in the output vector). For example, a 5 kb resolution model with a window size of 1,005,000 bp has 201 outputs. 6. Specify the window size of the processed data and hence the model. 7. Define the keep probability for the dropout layers in the first module. 8. Limit the maximum number of epochs and maximum number of chromosomes over which the training should be run. Note that the training script will stop if one of the limits is reached (also see Note 13).

30

Ron Schwessinger

9. Define the number of training chromosomes after which the test chromosome performance should be evaluated, and checkpoints saved. 10. Define training hyperparameters: learning rate, L2 norm strength and ADAM [36] parameters (also see Note 8). 11. Define the batch size (also see Note 9). 12. Define the network architecture of the first module of convolutional layers with max pooling. Define the number of convolutional layers and the respective number of hidden units, filter width, and max pooling width. Importantly, to employ transfer learning, the dimensions of the first convolutional layers should match the dimensions of the convolutional layer weights provided in terms of the number of hidden units and the filter widths. Larger dimensions are supported, in which case the excess filter weights are sampled from the existing weights. Smaller dimensions are not supported. See Table 1 and the example training script for the dimensions of the provided human and mouse transfer learning weights. In addition, the number of hidden units of the last convolutional layer must match the number of hidden units in the dilated convolutional layers (dilation_units) (also see Notes 7 and 11). 13. Define the network architecture of the second, dilated convolutional module. Specify the successive dilation rates, the number of hidden units per layer, and the width of the convolutional filters. Importantly, the first dilated convolutional layer with a dilation rate of 1 is automatically applied and should not be specified explicitly. Specify if residual connections between the dilated layers should be applied (also see Note 7). 14. Define the transfer learning settings. Select if to apply transfer learning, specify which convolutional filters of the first module should be seeded with pre-trained filter weights, and link to the weights file. 15. Set additional training options. Specify if the order of training examples per chromosome should be shuffled before each training epoch. Select the data type in which to temporarily store the DNA sequence and link to an indexed whole genome fasta file from which the training script should extract the DNA sequence. 16. Indicate if base pairs that are soft masked in the provided reference genome (lower case letters) should be used during training (also see Note 12). 17. Start the training and monitor training progress via tensorboard (also see Note 10).

Predicting Chromatin Interactions Using DeepC

31

18. Inspect the finished run via tensorboard. Ensure the training and test loss have converged and that the test loss is lower than the training loss. Checkpoints are saved automatically if the current test loss is smaller than the previously lowest test loss. Thus, the checkpoint with the highest training iteration number is the best available model in the output directory. 19. Validate trained models by predicting regions or entire chromosome maps over the holdout validation chromosomes using run_deploy_shape_deepCregr.py. See “Predicting chromatin interactions” for more details. 20. (Optional) If further training of the saved model is required, restart the training process. Link the training script with the path to the previously best model and enable the reloading of an existing model. Example command: python ./deepC/tensorflow2.1plus_compatibility_version/\ run_training_deepCregr.py \ --data_file ./minimal_training_set_example_IMR90.txt \ --train_dir ./minimal_imr90_training \ --test_chroms chr12,chr13 \ --validation_chroms chr16,chr17 \ --report_every 1 \ --num_classes 201 \ --bp_context 1005000 \ --learning_rate 0.0001 \ --l2_strength 0.001 \ --max_epoch 1 \ --max_chroms 18 \ --save_every_chrom 3 \ --keep_prob_inner 0.8 \ --batch_size 1 \ --conv_layers 6 \ --hidden_units_scheme 300,600,600,900,900,100 \ --kernel_width_scheme 8,8,8,4,4,1 \ --max_pool_scheme 4,5,5,5,2,1 \ --dilation_scheme 2,4,8,16,32,64,128,256,1 \ --dilation_units 100 \ --dilation_width 3 \ --dilation_residual=True \ --epsilon 0.1 \ --seed_weights=True \ --seed_scheme 1,1,1,1,1,0 \ --seed_file ./saved_conv_weights_human_deepc_arch.npy.npz \ --shuffle=True \ --store_dtype bool \

32

Ron Schwessinger --whg_fasta ./hg19.fa \ --use_softmasked=False \ --gpu 0

3.3 Predicting Chromatin Interactions

3.3.1

Run Predictions

Using a trained model, chromatin interaction predictions are run using the python scripts: run_deploy_shape_deepCregr.py and run_deploy_shape_combination_deepCregr.py. Details of script parameters and usage are listed in the prediction tutorial (https:// github.com/rschwess/deepC/blob/master/tutorials/tutorial_ predict_and_plot.html). The resulting deepC predictions can be visualized using the wrapper script (https://github.com/ rschwess/deepC/blob/master/helper_for_preprocessing_and_ analysis/wrapper_plot_deepc_predictions.R) or by following the manual steps in the tutorial; both are implemented in R. Trained models are available to download from the deepC repository and the Zenodo archive. 1. Construct the query files for predictions (also see Notes 14– 17). The basic format is a four-column tab-separated file indicating [chromosome start_position end_position variant_to_apply]. The first three columns define the genomic position of interest in bed-like, 0-based, half-open coordinates. The fourth column indicates the sequence variation to apply to the region. Here, reference indicates that the sequence should be extracted from the provided reference genome. A single dot (no quotations) “.” indicates that the outlined sequence should be deleted. DNA bases in [A, C, G, T, N] alphabet indicate that the outlined genomic window should be replaced with the provided bases regardless of them matching the reference genome or not. Flanking regions required for the predictions will be extracted from the provided reference genome. 2. Select the appropriate run script to use: (a) run_deploy_shape_deepCregr.py—processes all regions and variants indicated in the query file separately, producing one output per line. (b) run_deploy_shape_combination_deepCregr.py—applies all variants indicated in the query file and produces one output of predicted chromatin interactions over the resulting DNA sequence, spanning all listed genomic windows. 3. Run the respective prediction script (also see Note 18). (a) Link to the input query file. (b) Define an output directory and a name_tag for the output files. (c) Link to the trained deepC model to use.

Predicting Chromatin Interactions Using DeepC

33

(d) Link to the indexed genome fasta that should be used to extract the DNA sequence. (e) Specify if bases soft masked in the fasta file should be used or ignored. (f) Indicate the base pair context, the window size of the deepC model and for the predictions. (g) Specify over how many base pairs of the regions flanking the query window chromatin interactions should be predicted. (h) Indicate the number of classes (output vector dimension) and the bin.size of the deepC model. (i) Specify if to run the prediction on a GPU, if available, or on a CPU only. 4. The output files are plain text files. The three-header lines, marked by a leading #, indicate: (1) the query, the genomic window, and the variants applied; (2) the relative coordinates of the predicted interactions (position of the vertical zigzag pole) as determined by the genomic window, the added flanking base pairs and the variants applied; (3) the number of base pairs all genomic coordinates following the applied variants need to be adjusted by. The remaining lines list (tab-separated) the genomic windows of window.size and the predicted chromatin interactions of the respective central zigzag pole. Example command: python ./deepC/tensorflow2.1plus_compatibility_version/\ run_deploy_shape_deepCregr.py \ --input example_region_short.bed \ --out_dir ./test_predict_out \ --name_tag predict \ --model ./model_deepCregr_5kb_GM12878/model \ --genome ./hg19.fa \ --use_softmasked=False \ --bp_context 1005000 \ --add_window 500000 \ --num_classes 201 \ --bin_size 5000 \ --run_on gpu

3.3.2 Visualize Predictions Using the Wrapper Script

The wrapper script can be used to plot Hi-C data with or without skeleton transformation, predicted chromatin interactions for a reference and a variant prediction, the corresponding differential predicted interactions, and up to three 1D genomic signals from bigwig tracks. All plots indicated will be stacked. An example output of the wrapper plots is shown in Fig. 3. For exact parameter

Ron Schwessinger 1000 Value

750

Hi-C

500

20 10

250

0

0 1000

Skeleton

750 500

10.0 7.5 5.0 2.5

0 1000

Reference

750 500 250

6 4 2

0 1000 750

Variant

Genomic Distance [kb]

250

500 250

6 4 2

0 1000

Difference

750 500 250 0 80 60 40 20 0

Diff 2 1 0 -1 -2

DHS CTCF

Coverage

34

40 20 0 71,100,000

71,400,000

71,700,000

72,000,000

72,300,000

Fig. 3 Example output of Hi-C data and deepC predictions using the plotting wrapper script. From top to bottom, shown are Hi-C data, distance normalized Hi-C skeleton, reference sequence prediction, prediction over a 350 bp deletion of a CTCF site, differential plot reference – variant, and DNase-seq (DHS, DNase hypersensitivity) and CTCF ChIP-seq coverage tracks. All data and predictions are based on GM12878

Predicting Chromatin Interactions Using DeepC

35

names, see the --help message of the wrapper script and the detailed readme file (https://github.com/rschwess/deepC/tree/master/ tutorials). 1. Define a name tag for the plot and name for the output directory. 2. Link to the deepC repository helper script directory. 3. Specify the bin.size and the window.size of the deepC model used. 4. Set the genomic region to plot by defining chromosome start and end position. 5. Set the width and height of the final plot and indicate the relative heights (0–1) of the individual plot parts. 6. Select which plots to produce and link to the relevant input files if a plot type was selected: (a) To plot Hi-C data without skeleton transformation, supply a deepC pre-processed file of untransformed data, as obtained by running the pre-processing wrapper with the --no.transform flag. Alternatively, the Hi-C data can be processed from the same Hi-C input files that the pre-processing wrapper supports (also see Note 19). (b) To plot skeleton data, provide the deepC formatted interactions, for example, as output from the pre-processing wrapper script. Alternatively, the skeleton transformation can be applied on the fly from Hi-C data input. (c) To plot deepC predictions, link to the output files of the prediction python scripts. The user can supply and plot a reference and a variant prediction. (d) Differential plots (reference–variant) can be plotted and calculated using the supplied reference and variant definition. (e) To plot 1D genomic signals, supply bigwig files. 7. (Optional) Indicate if to also calculate the mean absolute interaction difference per pairwise interaction within the window. size. 8. (Optional) Provide colors for the 1D genomic tracks. 9. (Optional) Set individual titles for the plot components. 10. (Optional) Check for additional comments and workflows in Notes 20–25. Example command: Rscript ./deepC/helper_for_preprocessing_and_analysis/\ wrapper_plot_deepc_predictions.R \ --sample=gm1278_test \

36

Ron Schwessinger --out.dir=’.’ \ --helper=./deepC/helper_for_preprocessing_and_analysis/ \ --bin.size 5000 \ --window.size 1005000 \ --chrom=chr17 \ --plot.start=71150000\ --plot.end=72250000 \ --plot.width=12 \ --plot.height=16 \ --rel.heights=’0.75,0.75,0.75,0.75,0.75,1’ \ --plot.hic \ --plot.skeleton \ --plot.deepc.ref \ --plot.deepc.var \ --plot.deepc.diff \ --calc.deepc.diff \ --hic.preprocessed=hic_5kb_chr17_GM12878_no_transform.bed \ --skeleton.input=hic_skeleton_5kb_chr17_GM12878.bed \ --deepc.ref.input=test_predict_out/\ class_predictions_predict_1_chr17_71000000_71999999.txt \ --deepc.var.input=test_variant_out/\ class_predictions_predict_variant_1_chr17_71706322_71706671. txt \ --plot.tracks \ --track.input.1=./dnase_gm12878_encode_uw_merged_w50.bw \ --track.input.2=./ctcf_gm12878_encode_broad_merged_w50.bw \ --track.colour.1=#756bb1 \ --track.colour.2=#e41a1c

3.3.3 Visualize Predictions Manually

To visualize predictions manually, follow the R tutorial (https:// github.com/rschwess/deepC/blob/master/tutorials/tutorial_ train_a_model.md). The tutorial runs through the basic data reading and plot functionalities that can be combined to produce the user’s desired visualizations. The basic steps are outlined here: 1. Load libraries tidyverse, cowplot, and RColorBrewer. Load rtracklayer if the plotting of 1D signals from bigwig tracks is desired. 2. Source the R functions for Hi-C and deepC analysis published in the repository (./functions_for_HiC.R and ./functions_for_deepC.R). 3. Define the bin.size of the Hi-C data and window.size to use for formatting. 4. Define the number of percentiles used in the skeleton transformation. Default = 10. 5. Read deepC prediction files as output from the prediction python scripts.

Predicting Chromatin Interactions Using DeepC

37

6. Read Hi-C data from pre-processed or supported Hi-C input data types. 7. Read the skeleton-transformed Hi-C data from pre-processed files or perform the skeleton transformation using the Hi-C data. 8. Convert every chromatin interaction data frame to a ggplot2compatible long format for plotting interactions (triangularize). 9. Plot chromatin interactions, observed, normalized, or predicted, using ggplot2 [37] geom_polygon. 10. Import bigwig files by defining a GRange object [38] over the desired region and sub-setting the bigwig files over those ranges. 11. Plot bigwig signals using ggplot2 geom_area. 12. Create differential plots (reference – variant). 13. Calculate the mean absolute interaction difference per pairwise interaction within window.size. 14. Arrange multiple plots using cowplot.

4

Notes 1. DeepC can be used to fine-map chromatin interactions at resolutions at which the Hi-C contact frequencies start to become sparse [39]. 2. Due to the vertical zigzag pole interaction data encoding, window.size / bin.size must be an odd number. When experiencing errors in the formatting script, the user should try adding 1 × bin.size to the chosen window.size. 3. By default, deepC pre-processing removes genomic windows where the median of the vertical zigzag pole interactions is 0 to focus on genomic windows with more interaction structure. The user may want to test the result of keeping these windows in the training set. 4. The skeleton normalization was only tested using raw and ICE-normalized contact frequencies. If the user chooses a different normalization, they should ensure that the skeleton transformation still produces an adequate representation of the Hi-C data, for example, by visual inspection or by training a deepC model without skeleton transformation. 5. The provided CNN filter weights were trained on a compendium of chromatin feature data from ENCODE [27] and other publicly available resources. The data composition largely follows deepSEA [14]. We observed good performance of these

38

Ron Schwessinger

pre-trained filters for a variety of other tasks. However, users may choose to pre-train their own weights, including or focusing on different cell types or chromatin feature data. We recommend designing this compendium as broad as possible. Note that it is not essential for the chromatin feature compendium to include matching chromatin features of the exact cell type for which a deepC model should be trained. For related deep learning tasks, pre-seeding with TF motifs, for example, from the JASPAR motif database [40] has proven successful [41] and users may want to experiment with this approach. 6. For training of a deepC model, GPU support is essential. 7. Hyperparameters for the original deepC model were optimized using grid search. Resources permitting, users are encouraged to further optimize hyperparameters for their given Hi-C data, for example, using more advanced optimization strategies, such as Bayesian optimization. 8. Although we observed good results with a learning rate of 0.0001, the user may need to adjust the learning rate when working Hi-C data substantially different from the data analyzed in the original deepC paper [4, 26]. 9. In the original deepC model/publication, hardware limitations restricted us to a batch size of 1. This batch size leads to substantial fluctuations of training loss per iteration. Hardware permitting, users are encouraged to trial larger batch numbers. 10. When inspecting deepC model training progression, the training loss is best smoothed over a considerable number of training steps, for example, by using the smoothing slider of tensorboard. In contrast, the test loss is best inspected without smoothing. 11. For the dilated module architecture, the dilation rates should be increasing exponentially using a consistent base to ensure a fast-growing receptive field and equal coverage. This is usually referred to as a dilated stack [42], and other works have used multiple stacks in their architecture. When using a single stack as in the default deepC architecture, using at least one final dilated layer with a dilation rate of 1 ensures a denser feature representation per position. 12. By default, deepC training ignores soft-masked base pairs. They are treated as N’s. We did not observe obvious differences between models trained with or without soft masked base pairs. Users particularly interested in repeat sequences should train models including those base pairs. 13. Although deepC models usually converge fast after training on three to six chromosomes, it is recommended to train for at least one full epoch over all training chromosomes. We usually

Predicting Chromatin Interactions Using DeepC

39

did not see improvements when training for longer than one full epoch. 14. The basic prediction query files allow for a wide range of prediction tasks. Note that query files should use the 0-based half-open coordinates as for bed files. For example: (a) To predict the interactions of a larger window of reference sequence, supply a larger window for reference prediction [chr17 71000000 72000000 reference]. (b) To predict the impact of a single base pair change, run predictions for the reference and the variant case [chr17 71000000 71000001 G] and [chr17 71000000 71000001 A] (c) To predict the effect of a deletion, use the dot notation [chr17 71000000 71000011 .] and compare it to the reference prediction [chr17 71000000 71000011 reference] (d) To predict the effect of an insertion, provide more DNA bases in the fourth column than the size of the genomic window indicates [chr17 71000000 71000001 GATAA]. (e) Deletions and insertions lead to a relative shift of the downstream chromatin interactions. Depending on the size of the deletion, this shift effect may mask any finer detail of chromatin interactions that changes when plotting differential plots and summarizing the difference. One alternative is to replace the deleted DNA sequence with a matching number of N’s [chr17 71000000 71000011 NNNNNNNNNN]. This removes the interaction shift effect. However, this encoding does not accurately reflect the resulting DNA structure, and border effects, such as the formation of novel binding sites at the deletion junction, will not be captured, and sizes of merged TADs may not be accurately reflected. 15. The size of insertions or replacement DNA sequence is only limited by file size and memory restrictions. Large genomic windows can be replaced by DNA sequences of kilo and even mega bases using the basic format. For example, the user may construct a long fasta sequence with multiple modifications and supply the sequence in the fourth column of the query format to replace a designated large genomic window. 16. Combinatorial mutations can be applied using the run_deploy_shape_combination_deepCregr.py script. Every line of the query file will be applied to the reference DNA sequence in order of the genomic positions. The relative positions of the variants to be applied will stay intact. For example, in the case of two deletions being applied to a large DNA sequence, the

40

Ron Schwessinger

coordinates of the second deletion are not affected by the first deletion. Coordinates of the second deletion should be supplied as matching to the reference genome coordinates. 17. Adding several hundred kilo bases of flanking regions to the predictions usually allows to visualize and place predicted interactions in better context. 18. Running predictions using only a CPU support is possible but GPU access speeds up the process significantly. As a rough guide, predicting chromatin interactions over several mega bases may take about 5 min using a GPU and 2 h using a CPU. 19. While the wrapper script can pre-process and skeleton transform the Hi-C data on the fly, it is recommended to pre-process and save the Hi-C separately using the pre-processing wrapper script. Using pre-processed files speeds up the plotting wrapper script significantly. Likely, users will run the wrapper script several times to select plotting components and optimize the visualization. 20. The provided reference prediction does not have to match the reference genome, for example, the user may compare the impact of two different genomic variants relative to each other. 21. It may prove useful to create a single long reference prediction and subset it for plotting over various regions of interest. 22. If a reference prediction is available, interactions over variants need only be computed over the regions within window.size of the variant. To still visualize the variant chromatin interactions in a wider context, use the --fill.deepc.var flag to complete the region of interest with the supplied reference predictions. 23. It is often helpful to highlight the position of a genomic variant, for example, with a vertical dashed line, and to highlight the diagonal from the variant that indicates all pairwise interactions with the variant genomic bin. The slope for this diagonal is 2/bin.size. 24. To assess the potential impact of multiple variants, the user may choose to only calculate the mean absolute interaction difference of all predicted variants and visualize only prioritized ones. 25. When estimating the impact of InDels or other variations that introduce or remove a substantial number of DNA base pairs, the shifting effect needs to be taken into consideration, and N masking might prove more appropriate.

Predicting Chromatin Interactions Using DeepC

41

References 1. Hanssen LLP, Kassouf MT, Oudelaar AM et al (2017) Tissue-specific CTCF-cohesinmediated chromatin architecture delimits enhancer interactions and function in vivo. Nat Cell Biol 19:952–961. https://doi.org/ 10.1038/ncb3573 2. Deng W, Lee J, Wang H et al (2012) Controlling long-range genomic interactions at a native Locus by targeted tethering of a looping factor. Cell 149:1233–1244. https:// doi.org/10.1016/J.CELL.2012.03.051 3. Lieberman-Aiden E, van Berkum NL, Williams L et al (2009) Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326:289– 293. https://doi.org/10.1126/science. 1181369 4. Rao SSP, Huntley MH, Durand NC et al (2014) A 3D map of the human genome at Kilobase resolution reveals principles of chromatin looping. Cell 159:1665–1680. https:// doi.org/10.1016/j.cell.2014.11.021 5. Nora EP, Goloborodko A, Valton AL et al (2017) Targeted degradation of CTCF decouples local insulation of chromosome domains from Genomic compartmentalization. Cell 169:930.e22–944.e22. https://doi.org/10. 1016/j.cell.2017.05.004 6. Schwessinger R, Gosden M, Downes D et al (2020) DeepC: predicting 3D genome folding using megabase-scale transfer learning. Nat Methods 17:1118–1124. https://doi.org/10. 1038/s41592-020-0960-3 ˜ ez DG, Chiariello AM et al 7. Bianco S, Lupia´n (2018) Polymer physics predicts the effects of structural variants on chromatin architecture. Nat Genet 50:662–667. https://doi.org/10. 1038/s41588-018-0098-8 8. Buckle A, Brackley CA, Boyle S et al (2018) Polymer simulations of heteromorphic chromatin predict the 3D folding of complex Genomic Loci. Mol Cell 72:786.e11–797.e11. https://doi.org/10.1016/j.molcel.2018. 09.016 9. Belokopytova PS, Nuriddinov MA, Mozheiko EA et al (2020) Quantitative prediction of enhancer–promoter interactions. Genome Res 30:72–84. https://doi.org/10.1101/gr. 249367.119 10. Zhang S, Chasman D, Knaack S, Roy S (2019) In silico prediction of high-resolution Hi-C interaction matrices. Nat Commun 10:5449. https://doi.org/10.1038/s41467-01913423-8 11. Whalen S, Truty RM, Pollard KS (2016) Enhancer–promoter interactions are encoded by complex genomic signatures on looping

chromatin. Nat Genet 48:488–496. https:// doi.org/10.1038/ng.3539 12. Schreiber J, Libbrecht M, Bilmes J, Noble WS (2017) Nucleotide sequence and DNaseI sensitivity are predictive of 3D chromatin architecture. bioRxiv 103614. https://doi.org/10. 1101/103614 13. Li W, Wong WH, Jiang R (2019) DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning. Nucleic Acids Res 47:e60–e60. https://doi.org/10.1093/nar/ gkz167 14. Zhou J, Troyanskaya OG (2015) Predicting effects of noncoding variants with deep learning–based sequence model. Nat Methods 12:931–934. https://doi.org/10.1038/ nmeth.3547 15. Kelley DR, Snoek J, Rinn JL (2016) Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res 26:990–999. https://doi. org/10.1101/gr.200535.115 16. Alipanahi B, Delong A, Weirauch MT, Frey BJ (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol 33:831–838. https://doi.org/10.1038/nbt.3300 17. Kelley DR, Reshef YA, Bileschi M et al (2018) Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res 28:739–750. https://doi. org/10.1101/gr.227819.117 18. Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions 19. Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? Adv Neural Inf Proces Syst 4: 3320–3328 20. Fudenberg G, Kelley DR, Pollard KS (2020) Predicting 3D genome folding from DNA sequence with Akita. Nat Methods 17:1111– 1117. https://doi.org/10.1038/s41592020-0958-x 21. Zhou J (2021) Sequence-based modeling of genome 3D architecture from kilobase to chromosome-scale. bioRxiv 2021.05.19.444847. https://doi.org/10. 1101/2021.05.19.444847 22. Zheng X, Wang J, Wang C (2021) HiCArch: a deep learning-based Hi-C data predictor. bioRxiv 2021.11.26.470146. https://doi. org/10.1101/2021.11.26.470146 23. Cao F, Zhang Y, Cai Y et al (2021) Chromatin interaction neural network (ChINN): a machine learning-based method for predicting chromatin interactions from DNA sequences.

42

Ron Schwessinger

Genome Biol 22:1–25. https://doi.org/10. 1186/S13059-021-02453-5/FIGURES/8 24. Chen K, Zhao H, Yang Y (2021) Capturing large genomic contexts for accurately predicting enhancer-promoter interactions. bioRxiv 2021.09.04.458817. https://doi.org/10. 1101/2021.09.04.458817 25. Das A, Yang R, Gao V, et al Epiphany: predicting the Hi-C Contact Map from 1D Epigenomic Data 26. Bonev B, Mendelson Cohen N, Szabo Q et al (2017) Multiscale 3D genome rewiring during mouse neural development. Cell 171:557. e24–572.e24. https://doi.org/10.1016/j. cell.2017.09.043 27. The ENCODE Project Consortium (2004) The ENCODE (ENCyclopedia Of DNA Elements) Project. Science (New York, NY) 306: 636–640. https://doi.org/10.1126/science. 1105136 28. Servant N, Varoquaux N, Lajoie BR et al (2015) HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biol 16:259. https://doi.org/10.1186/ s13059-015-0831-x 29. Imakaev M, Fudenberg G, McCord RP et al (2012) Iterative correction of Hi-C data reveals hallmarks of chromosome organization. – Supplement. Nat Methods 9:999–1003. https:// doi.org/10.1038/nmeth.2148 30. Karolchik D, Hinricks AS, Furey TS et al (2004) The UCSC table browser data retrieval tool. Nucleic Acids Res 32. https://doi.org/ 10.1093/NAR/GKH103 31. Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079. https://doi.org/10.1093/bioinformatics/ btp352 32. Abadi M, Barham P, Chen J, et al (2016) TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16), p 265–284 33. van der Walt S, Colbert SC, Varoquaux G (2011) The NumPy array: a structure for

efficient numerical computation. Comput Sci Eng 13:22–30. https://doi.org/10.1109/ MCSE.2011.37 34. Dale RK, Pedersen BS, Quinlan AR (2011) Pybedtools: a flexible Python library for manipulating genomic datasets and annotations. Bioinformatics (Oxford, UK) 27:3423–3424. https://doi.org/10.1093/BIOINFORMAT ICS/BTR539 35. Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26:841–842. https:// doi.org/10.1093/bioinformatics/btq033 36. Kingma DP, Ba J (2014) Adam: a method for Stochastic Optimization. https://doi.org/ h t t p : // d o i . a c m . o r g . e z p r o x y. l i b . u c f . edu/10.1145/1830483.1830503 37. Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer, New York 38. Lawrence M, Huber W, Page`s H et al (2013) Software for computing and annotating genomic ranges. PLoS Comput Biol 9:e1003118. https://doi.org/10.1371/JOURNAL.PCBI. 1003118 39. Schwessinger R, Gosden M, Downes D et al (2020) DeepC: predicting 3D genome folding using megabase-scale transfer learning. Nat Methods. https://doi.org/10.1038/s41592020-0960-3 40. Sandelin A, Alkema W, Engstro¨m P et al (2004) JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 32:D91–D94. https://doi. org/10.1093/nar/gkh012 41. Quang D, Xie X (2016) DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res 1:032821. https://doi.org/10.1101/032821 42. Oord A van den, Dieleman S, Zen H, et al (2016) WaveNet: a generative model for Raw Audio. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, p 3437–3440

Chapter 4 Integrating Single-Cell Methylome and Transcriptome Data with MAPLE Yasin Uzun, Hao Wu, and Kai Tan Abstract As a mechanism of epigenetic gene regulation, DNA methylation has crucial roles in developmental and differentiation programs. Thanks to the recently introduced bisulfite-sequencing-based methods, it is possible to profile the entire methylome at single-cell resolution. However, analysis of single-cell methylome data is challenging due to data sparsity and moderate correlation with transcript level. Our recently developed computational framework, MAPLE, addresses these challenges using supervised learning models. Using both genomic sequence and methylation information as the input, MAPLE predicts activity for each gene, which can be used to integrate with transcriptome data from the same cell types. Here, we provide an overview of our method and detailed guidance on how to use it for the integration of methylome and transcriptome data. Key words DNA methylation, Single-cell, Epigenomics, Multi-omics, Data integration

1

Introduction DNA methylation is a major epigenetic mechanism of gene regulation. Due to its vital role in development and disease, methylation of DNA has long been studied using both microarray and nextgeneration sequencing technologies. Because of its genome-wide and quantitative nature, bisulfite sequencing is regarded as the gold-standard technology for methylome profiling [1]. The advent of single-cell sequencing methods has revolutionized biology, revealing unprecedented molecular and cellular heterogeneities at genetic, epigenetic, and microenvironmental levels. For DNA methylation, many bisulfite conversion-based protocols have recently been developed for profiling the DNA methylome at single-cell resolution [2–13], leading to an exponential growth of single-cell methylome data. Despite the rapid accumulation of single-cell methylome data, its interpretation presents formidable challenges, especially for

Pedro H. Oliveira (ed.), Computational Epigenomics and Epitranscriptomics, Methods in Molecular Biology, vol. 2624, https://doi.org/10.1007/978-1-0716-2962-8_4, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

43

44

Yasin Uzun et al.

genome-wide datasets. First, unlike transcriptome or chromatin accessibility data, single-cell bisulfite sequencing data is not just concentrated at coding regions or regulatory regions. As a result, such data is inherently sparse, and short genomic regions such as gene promoters contain a limited number of cytosine sites, making the estimation of methylation level across a genomic region difficult. The data sparsity makes imputation a necessity for performing high-dimensional data analysis such as principal component analysis. Cell type annotation is another challenge for methylome data for the following reasons. First, the correlation between transcription and methylation is dependent on cell type. Although it is generally observed that there is a positive correlation between gene body methylation and transcription level, it is not the case in mammalian neurons [3, 14]. Second, regarding the more general negative correlation between gene promoter methylation and gene transcription level, recent single-cell dual-omic assay profiling both modalities have revealed that only a small number of genes follow this trend. In contrast, promoter methylation of many genes is positively correlated with transcription level or not correlated at all [10, 11, 15, 16]. Hence, there is not a single universal relationship between DNA methylation and gene activity level. One approach to annotate cell types in methylome data is integration of methylome data with transcriptome data if matched datasets are available. Several methods have been developed for integrating different single-cell data modalities generated using the same tissue type [17–19]. Although those methods have been demonstrated to have good performance, their performance depends on the quality of the gene-by-cell activity matrices used as the input. Unlike transcriptome data, for the reasons described above, generating a reliable gene-by-cell activity matrix is challenging for methylome data. Thus, a reliable method is needed to generate such a matrix for any downstream analysis. Our computation framework, termed MAPLE (Methylome Association by Predictive Linkage to Expression), addresses this problem using a supervised learning approach and generates a reliable methylomebased gene activity matrix [20].

2

Materials MAPLE uses four types of inputs provided by the user as follows: 1. CpG methylation calls: This is a set of bed files describing the genomic coordinates of CpG sites and the number of methylated cytosines for each site, as described in Table 1. A separate file is expected per individual cell. They can be generated using the Bismark software [21].

Single-Cell Methylome-Transcriptome Integration with MAPLE

45

Table 1 Format of the input files containing the CpG methylation call information Chr.

Start

End

Methylation Pct.

Methylated C Count

Unmethylated C count

chr1

3003226

3003227

100

2

0

chr1

3026310

3026311

0

0

1

chr1

3060607

3060608

100

1

0

...

...

...

...

...

...

Table 2 Format of the gene coordinate file containing the annotation information Chr.

Start

End

chr10

100042193

chr10

Strand

Gene ID

Gene symbol

Gene biotype

100081877

ENSG00000120054

CPN1

protein_coding

100150094

100188334

ENSG00000107566

ERLIN1

protein_coding

chr10

100347124

100364834

+

ENSG00000099194

SCD

protein_coding

...

...

...

...

...

...

...

Gene coordinates: This is a seven-column bed file describing the genomic coordinates of the genes as described in Table 2. This annotation is available for human and mouse through the MAPLE project repository at https://github. com/tanlabcode/MAPLE.1.0. The annotation files for other organisms can be obtained from the Ensembl genome archive (https://www.ensembl.org/index.html). 2. CpG contents: This is a gene-by-genomic bin matrix. Each row is a gene, each column is a genomic bin, and the values are the CpG content of the bin, ranging from 0 to 1. The file is an R object file in RDS format. These inputs are also available for human and mouse genes through the MAPLE project repository at https://github.com/tanlabcode/MAPLE.1.0. 3. Trained classifier models: These are the pre-trained models for the three classifiers. The models were trained using singlecell dual-omics datasets. The model file for CNN is in “hd5” format compatible with the R keras package, whereas the models for RF and EN models are R objects of “randomForest” and “cv.glmnet” classes, respectively, in the RDS format. The pre-trained models using four different single-cell dual-omics datasets are available through the MAPLE project repository at https://github.com/tanlabcode/MAPLE.1.0 and can directly be used for mammalian species.

46

3

Yasin Uzun et al.

Methods

3.1 Overview of MAPLE

To train a predictive model, MAPLE takes advantage of data generated by single-cell dual-omics assays, which profile the gene expression and methylation levels in the same cells. Using such data as inputs, we trained supervised-learning models to predict the gene activities from bisulfite sequencing data. Using the trained models, MAPLE infers gene activities using only single-cell methylome data for each cell, which in turn can be used for downstream analysis. MAPLE uses the genomic and methylation information gathered from the 5 kb region flanking the transcription start site (TSS) of each gene. Since the relative importance of methylated cytosines for predicting gene activity depends on their distance to the TSS [20], the promoter regions are divided into 500 bp bins, and two features are computed for each bin. The first feature is CpG content, which is defined as the ratio of number of CpG dinucleotides to total number of dinucleotides in the bin. This feature accounts for the differences in genomic sequence content among different genes. The second feature is CpG methylation rate, which is defined as the ratio of methylated CpG sites to total number of CpG sites. By combining these two features for each bin, MAPLE generates a feature set, which is used as the input for predicting gene activity. As mentioned above, a major challenge in single-cell methylome data analysis is the inherent sparsity of the data. Even very deep sequencing can only cover a small fraction of all CpG sites in the genome, resulting in many bins with insufficient numbers of CpG calls to estimate the methylation rate of the bin and subsequently inference of gene activity. MAPLE addresses the sparsity problem by utilizing the concept of meta cell [22] as follows. First, methylation rates are computed for all gene-cell pairs, by using the overall methylation rate of the 5 kb region flanking the TSS. Missing values are imputed by using the column average of the gene-bycell methylation rate matrix, that is, for cells with insufficient number of cytosine calls, the average methylation rate of all cells is used. Next, principal component analysis (PCA) is performed on this methylation rate matrix. Pairwise distances between cells are computed using values based on top PCA components explaining the greatest variance in the data. Then for each cell, a meta-cell is generated, by combining the neighboring cells based on pairwise distance. The methylation rate of the cell is calculated by using all the cytosine calls in the meta-cell. By using this approach, the number of bins with insufficient cytosine calls is vastly reduced and a robust inference can be achieved. Using the CpG content and methylation rate features computed based on meta-cells, MAPLE predicts the activity for each

Single-Cell Methylome-Transcriptome Integration with MAPLE

47

gene and cell pair. Three classes of statistical classifiers are used for the prediction: random forest, convolutional neural network, and elastic net regression. Gene activities are predicted by all three classifiers, and the mean of the three predictions is used as the final predicted value. We pre-trained each classifier using four single-cell dual-omics datasets [10, 11, 15, 16] and made these models available through the project repository at https://github.com/ tanlabcode/MAPLE.1.0. MAPLE is an open-source software implemented in R statistical programming language and is publicly available under the MIT license. MAPLE predicts gene activities using CpG methylation calls provided by the user in bed file format. In addition to methylation calls, MAPLE requires two annotation files. The first annotation file contains gene coordinates in bed format, which is used for extracting transcription start sites (TSS) information. The second annotation file contains the CpG contents of the genomic bins, which are generated using the reference genome sequence of the organism under study. The first task of MAPLE is generating promoter regions (defined as upstream and downstream 5 Kb of TSS) across the genome. Each promoter region is divided into 500 bp bins. Next, the CpG sites in the promoter region including both methylated and unmethylated sites are identified. Meta-cells are computed in the second step. Using the total cytosine calls in a promoter region, the methylation rate of each promoter-cell pair is calculated. PCA is computed for the gene promoter-by-cell methylation matrix, and pairwise distances between the cells are calculated using the principal components explaining the greatest variance. Finally, meta-cells are computed based on the cell-cell pairwise distances. After the meta-cells are computed, gene activity predictions are made for each cell by each classifier. The mean of three predictions is used as the single output for each cell-gene pair, and the ensemble predictions are converted into a gene-by-cell activity matrix. This activity matrix can be analyzed directly or integrated with matching transcriptome and/or epigenome data using existing data integration methods such as Seurat [17], Harmony [19], or LIGER [18]. 3.2 Methylome Matrix Construction

Cell-specific gene activities are computed using MAPLE in eight steps as follows (Fig. 1): 1. Methylation calls for promoter bins are computed with the compute_binned_met_counts function. The inputs to the function are the gene coordinates file (annot_file) and the directory path containing the CpG call files (cov_dir). The CpG methylation rates of the promoter bins are calculated for each gene and cell pair. The output is a list of two elements. The first element is a matrix of methylated cytosine counts for

48

Yasin Uzun et al.

Fig. 1 Flowchart showing methylome-based gene activity prediction using MAPLE. Green color represents input, blue color represents intermediate results, and yellow color represents the output. CNN convolutional neural network, RF random forest, EN elastic net regression

the promoter bins. Each row corresponds to a cell-gene pair, and each column corresponds to a bin (numbered 1 to 20). The second element is the same as the first one but for unmethylated cytosines. 2. Meta-cells are generated with the compute_meta_cells function. The inputs to the function are the matrices of methylated (df_met) and unmethylated (df_demet) CpG call counts for the promoter bins, which are computed in the first step and saved in the list. The overall methylation rates of the promoters are calculated and then used for PCA. Pairwise cellcell distances are computed using top principal components to determine cell neighborhoods. Each cell and its nearest neighbors are defined as a meta-cell. Methylation levels of the metacells are calculated using all cells in the neighborhood. The number of nearest neighbors is set using the num_neighbors parameter, which defaults to 20 (see Note 1). 3. Features for gene activity prediction are generated with the get_fr_list function. The inputs are the meta-cell objects (meta_object) computed in the previous step and the CpG content file (cpg_content_file) for promoter bins. Missing values in the methylation rate matrix are imputed by the promoter average of all cells in the dataset. Cell-gene pairs with a large number of bins with missing values, as determined by the max_na_bins parameter (default is 5), can be excluded in this step (see Note 2). The methylation rates and CpG contents are combined into a single feature matrix for elastic net regression and random forest and a three-dimensional array for convolutional neural network (CNN). The output of this function is a list containing the feature matrix and array.

Single-Cell Methylome-Transcriptome Integration with MAPLE

49

4. Methylome-based gene activities are computed using CNN model with the cnn_predict function. The inputs are the list of features computed in step 3 (fr_list) and the hd5 formatted pre-trained CNN model file (model_file). The output is a vector of predicted activities for all cell-gene pairs. 5. Methylome-based gene activities are computed using ElasticNet regression model (EN) with the elastic_predict function. The inputs are the same as step 4, except that the pre-trained model file is in the RDS file format (elastic_model_file) for the cv.glmnet object. The output is a vector of predicted activities for all cell-gene pairs. 6. Methylome-based gene activities are computed using random forest model (RF) with the rf_predict function. The inputs are the same as step 4, except that the pre-trained model is in the RDS file format (rf_model_file) for the randomForest object. The output is a vector of predicted activities for all cellgene pairs. 7. The three predictions (CNN, EN, RF) are combined into one output with the ensemble_predict function. The inputs are the list of three vectors, corresponding to the three predictions obtained in steps 4, 5, and 6 (prediction_list). The final gene activity scores for the cell-gene pairs are computed with this function by using the mean of three predictions (see Note 3). 8. Cell-by-gene activity matrix is constructed using the convert_preds_to_matrix function. The input is the vector of ensemble gene activity predictions. The input vector is converted to a two-dimensional gene-by-cell matrix as the output. 3.3 Downstream Analysis

Once the gene activity matrix is generated from the methylome data, any computational tools for single-cell genomic data can be used to analyze this data. Here, we describe the data processing steps for Seurat [17], as it is widely used in the biomedical research community. 1. A Seurat object is constructed with the CreateSeuratObject function by providing the MAPLE predicted activity matrix as the input count matrix and the assay name parameter set to “MET.” 2. Predicted gene activity matrix is normalized using the SCTransform function [23], providing the initialized object as the input. 3. Highly variable features in the gene activity matrix are selected and used for dimensionality reduction with the FindVariableFeatures function using the output generated in step 2.

50

Yasin Uzun et al.

4. The highly variable features are scaled with the ScaleData function so that mean activity per gene is 0 and variance is 1, using the output of step 3 as the input. 5. Principal component analysis (PCA) is performed on the object generated in step 4 to identify the principal components in the activity matrix, using the RunPCA function on the output of step 4. A UMAP (Uniform Manifold Approximation and Projection) [24, 25] dimensionality reduction is generated using the RunUMAP function and the PCA result as the input. 6. A shared nearest neighbor (SNN) graph is constructed for the cells in the population with the FindNeighbors function. Next, find cell clusters in the data using the FindClusters function. Provide the Seurat object as the input for both functions. 7. Gene activity pattern of the marker genes is investigated with the FeaturePlot and VlnPlot functions for the individual clusters, providing the object generated in step 6 and the list of marker genes for the expected cell types as the input (see Note 4). 8. Differentially activated genes are identified based on the MAPLE result, using the FindMarkers function, providing the object with the clustering result as the input (see Note 5). Visualize the results with the DoHeatmap function. The cell type annotations can be assigned to the clusters identified in the dataset based on the output of steps 7 and 8. 9. A column is added to the metadata of the Seurat object by using the AddMetaData function with the Seurat object as the input, metadata parameter set to “MET” and col.name set to “tech” for the purpose of integration. 3.4

Integration

Methylome and matching transcriptome datasets can be integrated by using MAPLE-predicted gene activity matrices as the input for any single-cell integration tool. In here, we provide the integration of MAPLE-predicted gene activities with transcriptome data using Seurat [17] as follows (Fig. 2): 1. For the matching transcriptome dataset, execute steps 1–5 described in “Downstream Analysis.” Set the assay name parameter to “RNA” and the “tech” metadata column to “RNA.” 2. Select the integration features with the SelectIntegrationFeatures function, providing transcriptome and methylome objects as the input. 3. Prepare the two objects for integration with the PrepSCTIntegration function, using the list of two objects and the selected features as input.

Single-Cell Methylome-Transcriptome Integration with MAPLE

51

Fig. 2 Flowchart showing the steps for integration of single-cell methylome and transcriptome data. Yellow color represents methylome data, pink color represents transcriptome data, orange color represents integrated data

4. Find a set of anchors to be used for integrating the two modalities using the FindIntegrationAnchors function and the list generated in step 3 as the input. Set the normalization method to “SCT.” 5. Perform data integration using the IntegrateData function, providing the anchor set object generated in step 4 as the input and normalization method set to “SCT” (see Note 6). 6. Compute the data representation in lower dimensions, by executing the RunPCA and RunUMAP functions consecutively on the integrated dataset object. Visualize the embedding of two data modalities in reduced dimensions with the DimPlot function and the “group.by” parameter set to “tech.” 7. Perform co-embedded clustering on the integrated object as described in step 6 in the “Downstream Analysis” section. 8. As in step 7, in the “Downstream Analysis” section, determine the activities of the marker genes in the integrated data. Set the default assay for the integrated object to “SCT” and execute the FeaturePlot and VlnPlot functions with the object and the marker genes as the input. Group the data by clusters by setting the “group.by” parameter to “seurat_clusters” and split the data by modalities with the “split.by”

52

Yasin Uzun et al.

parameter set to “tech.” Check whether the marker activities in the methylome data are consistent with the transcriptome data for the corresponding clusters. 9. Assign the cell types for the cells in the DNA methylome data by using the K-nearest neighbor transcriptome cells in the co-embedded representation using the knn function R package (see Note 7).

4

Notes 1. The choice of the number of nearest neighbors (num_neighbors) for meta-cell computation depends on the sequencing depth and the dataset size. For dataset with deep sequencing and small number of cells, the number of nearest neighbors can be smaller than the default value. If the sequencing depth is low and the number of cells is high, the number of nearest neighbors should be increased. 2. The choice of the maximum number of allowable bins for gene activity prediction (max_na_bins) depends on the sequencing depth. For deeply sequenced samples, this parameter can be lowered for more robust inference and increased otherwise. 3. Although ensemble prediction of three classifiers provides consistent and robust results, gene activity predictions of individual classifiers can provide satisfactory output, depending on the input. This can speed up processing when time and computing resource are limited. Please refer to the results in our study for details [20]. 4. Since not all genes are strongly regulated by DNA methylation, inspecting a limited number of canonical marker genes may not be sufficient for accurate cell cluster annotation. Hence, we recommend the users to use a marker gene set that is as comprehensive as possible. 5. The differences in the methylome activity across the cell clusters may be subtle compared to the transcriptome data. For this reason, we recommend using a lower setting for the logfc. threshold than the default value (0.25) for FindMarkers function. 6. The typical number of cells in a methylome sample is considerably low (in the order of hundreds) when compared to a typical transcriptome sample (in the order of thousands). For this reason, the default value of 100 neighbors for the smoothing parameter k.weight in IntegrateData function may be too high and can cause run time error. In that case, this parameter setting should be gradually lowered.

Single-Cell Methylome-Transcriptome Integration with MAPLE

53

7. The number of nearest neighbors (k) in the knn function should be set based on the expected sample heterogeneity. If a high number of cell types with small numbers of cells are expected in the data (which can be inferred partially using the transcriptome data alone), k should be set low; otherwise, it must be set to a higher value. References 1. Li Y, Tollefsbol TO (2011) DNA methylation detection: bisulfite genomic sequencing analysis. In: Tollefsbol TO (ed) Epigenetics protocols, vol 791. Humana Press, Totowa, pp 11–21 2. Ahn J, Heo S, Lee J, Bang D (2021) Introduction to single-cell DNA methylation profiling methods. Biomolecules 11(7):1013. https:// doi.org/10.3390/biom11071013 3. Luo C, Keown CL, Kurihara L et al (2017) Single-cell methylomes identify neuronal subtypes and regulatory elements in mammalian cortex. Science 357(6351):600–604. https:// doi.org/10.1126/science.aan3351 4. Luo C, Rivkin A, Zhou J et al (2018) Robust single-cell DNA methylome profiling with snmC-seq2. Nat Commun 9:3824. https:// doi.org/10.1038/s41467-018-06355-2 5. Clark SJ, Smallwood SA, Lee HJ et al (2017) Genome-wide base-resolution mapping of DNA methylation in single cells using singlecell bisulfite sequencing (scBS-seq). Nat Protoc 12(3):534–547. https://doi.org/10.1038/ nprot.2016.187 6. Kobayashi H, Koike T, Sakashita A et al (2016) Repetitive DNA methylome analysis by smallscale and single-cell shotgun bisulfite sequencing. Genes Cells 21(11):1209–1222. https:// doi.org/10.1111/gtc.12440 7. Farlik M, Sheffield NC, Nuzzo A et al (2015) Single-cell DNA methylome sequencing and bioinformatic inference of epigenomic cellstate dynamics. Cell Rep 10(8):1386–1397. https://doi.org/10.1016/j.celrep.2015. 02.001 8. Mulqueen RM, Pokholok D, Norberg SJ et al (2018) Highly scalable generation of DNA methylation profiles in single cells. Nat Biotechnol 36(5):428–431. https://doi.org/10. 1038/nbt.4112 9. Bian S, Hou Y, Zhou X et al (2018) Single-cell multiomics sequencing and analyses of human colorectal cancer. Science 362(6418): 1060–1063. https://doi.org/10.1126/sci ence.aao3791

10. Angermueller C, Clark SJ, Lee HJ et al (2016) Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity. Nat Methods 13(3):229–232. https://doi.org/10. 1038/nmeth.3728 11. Clark SJ, Argelaguet R, Kapourani C-A et al (2018) scNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nat Commun 9: 781. https://doi.org/10.1038/s41467-01803149-4 12. Gu C, Liu S, Wu Q et al (2019) Integrative single-cell analysis of transcriptome, DNA methylome and chromatin accessibility in mouse oocytes. Cell Res 29(2):110–123. https://doi.org/10.1038/s41422-0180125-4 13. Pott S (2017) Simultaneous measurement of chromatin accessibility, DNA methylation, and nucleosome phasing in single cells. elife 6. https://doi.org/10.7554/eLife.23203 14. Mo A, Mukamel EA, Davis FP et al (2015) Epigenomic signatures of neuronal diversity in the mammalian brain. Neuron 86(6): 1369–1384. https://doi.org/10.1016/j.neu ron.2015.05.018 15. Hernando-Herraez I, Evano B, Stubbs T et al (2019) Ageing affects DNA methylation drift and transcriptional cell-to-cell variability in mouse muscle stem cells. Nat Commun 10(1):4361. https://doi.org/10.1038/ s41467-019-12293-4 16. Argelaguet R, Clark SJ, Mohammed H et al (2019) Multi-omics profiling of mouse gastrulation at single-cell resolution. Nature 576(7787):487–491 17. Stuart T, Butler A, Hoffman P et al (2019) Comprehensive integration of single-cell data. Cell 177(7):1888.e21–1902.e21. https://doi. org/10.1016/j.cell.2019.05.031 18. Welch JD, Kozareva V, Ferreira A et al (2019) Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177(7):1873.e17–1887.e17. https://doi. org/10.1016/j.cell.2019.05.006

54

Yasin Uzun et al.

19. Korsunsky I, Millard N, Fan J et al (2019) Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods 16(12): 1289–1296. https://doi.org/10.1038/ s41592-019-0619-0 20. Uzun Y, Wu H, Tan K (2020) Predictive modeling of single-cell DNA methylome data enhances integration with transcriptome data. Genome Res 31(1):101–109. https://doi. org/10.1101/gr.267047.120 21. Krueger F, Andrews SR (2011) Bismark: a flexible aligner and methylation caller for BisulfiteSeq applications. Bioinformatics 27(11): 1571–1572. https://doi.org/10.1093/bioin formatics/btr167 22. Zhu Q, Gao P, Tober J et al (2020) Developmental trajectory of prehematopoietic stem cell

formation from endothelium. Blood 136(7): 845–856. https://doi.org/10.1182/blood. 2020004801 23. Hafemeister C, Satija R (2019) Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol 20(1):296. https:// doi.org/10.1186/s13059-019-1874-1 24. McInnes L, Healy J, Melville J (2018) UMAP: uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 25. Becht E, McInnes L, Healy J et al (2018) Dimensionality reduction for visualizing single-cell data using UMAP. Nat Biotechnol 37:38–44. https://doi.org/10.1038/nbt. 4314

Chapter 5 Quantitative Comparison of Multiple Chromatin Immunoprecipitation-Sequencing (ChIP-seq) Experiments with spikChIP Enrique Blanco, Cecilia Ballare´, Luciano Di Croce, and Sergi Aranda Abstract The chromatin immunoprecipitation coupled with the next-generation sequencing (ChIP-seq) is a powerful technique that enables to characterize the genomic distribution of chromatin-associated proteins, histone posttranslational modifications, and histone variants. However, in the absence of a reference control for monitoring experimental and biological variations, the standard ChIP-seq scheme is unable to accurately assess changes in the abundance of chromatin targets across different experimental samples. To overcome this limitation, the combination of external spike-in material with the experimental chromatin is offered as an effective solution for quantitative comparison of ChIP-seq data across different conditions. Here, we detail (i) the experimental protocol for preparing quality control spike-in chromatin from Drosophila melanogaster cells and (ii) the computational protocol to compare ChIP-seq samples with spike-in based on the use of the spikChIP software. Key words ChIP-seq, Chromatin, Spike-in, Normalization, Genome bin, ChIP peak, Local regression

1

Introduction Chromatin is the macromolecular complex of DNA and histone proteins that packs the genome into its basic structural units of nucleosomes [1]. Within chromatin, a plethora of interacting proteins organize the 3D distribution of the genome, regulate multiple gene expression programs, and coordinate the appropriate transmission of genetic and epigenetic information to cellular progeny [1–5]. Alterations in the functionality of the proteins associated with chromatin are intimately linked to severe developmental diseases and cancer [6, 7]. Due to its biological and pathological relevance, research on chromatin and epigenetics has been a rapidly moving field over the last decade, assisted by the development of

Pedro H. Oliveira (ed.), Computational Epigenomics and Epitranscriptomics, Methods in Molecular Biology, vol. 2624, https://doi.org/10.1007/978-1-0716-2962-8_5, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

55

56

Enrique Blanco et al.

novel technologies for the high-throughput molecular analysis of the genome. Since its development in the 1980s [8–10], the chromatin immunoprecipitation (ChIP) technique is a widely used method in molecular biology [11]. The power of this technique increased dramatically later with the advent of the massive parallel sequencing approaches [12–15]. Indeed, ChIP-seq experiments are currently a standard method to describe the genome-wide maps of distribution of transcription factors, chromatin remodelers, and histone modifications [16–20]. The striking impact of the ChIP-seq technology is based in its relative technical simplicity (which allows it to be successfully adopted by most experimental laboratories), its high sensitivity and accuracy for mapping the genomic distribution of proteins, and the solid standardization of the experimental and computational methods to efficiently analyze such a volume of information. The original scheme of ChIP-seq has been maintained substantially unmodified from chromatin isolation and fragmentation, immunoprecipitation using specific antibodies, DNA purification from protein complexes, library preparation, and parallel sequencing. Remarkably, ChIP-seq is a semiquantitative method, which enables to determine the relative occupancy of one factor on a given genomic region, with respect to the rest of the genome. However, direct comparison of ChIP-seq signal strength in the same loci between different conditions (e.g., cell types, metabolic states, or pathological situations) is inaccurate in the absence of an independent internal control to monitor technical variability when performing the ChIP-seq scheme (i.e., variations in the efficiency of immunoprecipitation or library preparation [21]). Additionally, when an internal reference is missing, global changes in chromatin target occupancy might result obscure as a consequence of equilibrating the ChIP-seq libraries before sequencing or due to computational normalization of the output sequencing by the total number of reads [22]. To overcome all these limitations, distinct labs have proposed alternative strategies based on the addition of exogenous spike-in material as independent internal reference controls (spike-in), thus providing a feasible solution to accurately normalize ChIP-seq signal across samples [23–26]. Originally developed to correct gene expression measurement in microarrays and RNA-seq experiments [27, 28], the spike-in strategy is based in the mixture of the experimental sample with an amount of exogenous material (either from another species or synthetically produced) that is constant between experiments. Both the experimental sample and the spike-in are processed and analyzed in parallel. As long as the amount of spike-in ChIP signal is the same in all samples, the observable differences in the experimental samples across conditions can be exclusively attributed to biological variation. Eventual

ChIP-seq Normalization Using spikChIP

57

differences in the spike-in signal can be computationally equilibrated to eliminate technical variability, and this correction factor is then used to normalize the experimental signal of each condition. We have recently proposed a practical guideline for facilitating the decision-making to researchers about when and how to apply the spike-in strategy for normalizing ChIP-seq using mammalian cells [26]. At the experimental level, we recommend using fly material as the spike-in because of: (1) the accuracy of the genome assembly; (2) the extensive characterization of the fly chromatin at epigenetic level; 3) the evolutionary divergence between fly and mammalian genomes, which enables an unambiguous alignment of the reads [24, 25]; and (4) the relative simplicity of preparing chromatin from fly cells. Moreover, we suggest the use of a second antibody for a fly-specific histone variant (H2Av) to capture the spike-in material [25]. This strategy aims to avoid the crossreactivity constraint of the experimental antibody and to reduce any potential variability due to competition between the spike-in control and the experimental material, which usually exceeds the amount of spike-in material by far. Moreover, the genomic occupancy profile of the fly-specific H2Av is already characterized [25], which can be extremely useful as an additional control point for assessing ChIP-seq performance. In this chapter, we detail an efficient procedure for preparing a stock of chromatin from D. melanogaster cells to use as internal reference for ChIP-seq experiments. In addition, we provide a roadmap for the computational analysis of ChIP-seq data using spikChIP, a stand-alone pipeline designed in Perl and R to perform the systematic normalization of multiple ChIP-seq experiments with spike-in [26]. SpikChIP implements a local regression that enables a gradual normalization from background to positive ChIP signal regions (Fig. 1). This reduces the influence of sequencing noise of spike-in material during ChIP-seq normalization while minimizes the overcorrection of non-occupied genomic regions in the experimental ChIP-seq.

2 2.1

Materials Cell Culture

1. Drosophila Schneider’s Line 2 (SL2) cells (ATCC, Ref. CRL-1963; Drosophila melanogaster). 2. Heat-inactivated FBS: Fetal bovine serum (Thermo Fisher) inactivated 30 min at 56 °C. 3. SL2 culture medium: Schneider’s Drosophila media (Thermo Fisher) supplemented with 10% heat-inativated FBS plus 1× penicillin/streptomycin (Gibco). 4. 1× PBS buffer

58

Enrique Blanco et al.

Fig. 1 Scheme representing the normalization applied by SpikChIP. Sequencing data results in a number of reads corresponding to non-occupied genomic regions (background, in gray) and enriched genomic regions (peaks, in orange or red). The SpikChIP software runs by segmenting both the spike-in and experimental genome into bins. Using the ChIP signal for each spike-in bin, SpikChIP calculates distinct correction coefficients (here depicted by the letters α, β, δ, γ). These coefficients are then used to normalize the experimental ChIP-seq values

5. Formaldehyde solution (Sigma). 6. Crosslinking Solution: 1% Formaldehyde in 1× PBS buffer. 7. Quenching Solution: 2 M glycine. 8. Complete EDTA-free Protease Inhibitor Cocktail (PIC; Roche). 9. QIAquick PCR purification kit (Qiagen). 10. Proteinase K 20 mg/mL (Thermo Scientific). 11. Spike-in Antibody (Active Motif), against the Drosophila-specific histone variant, H2Av. 12. Lysis Buffer: 50 mM Tris-HCl pH 8.1; 10 mM EDTA; 1% SDS.

ChIP-seq Normalization Using spikChIP

2.2

Equipment

59

1. Cell culture hood (i.e., biosafety cabinet). 2. Inverted microscope with 4× and 10× objectives. 3. Incubator set at 26 °C. 4. Refrigerated centrifuges. 5. Micropipettes. 6. Pipettor. 7. Freezers: -20 and - 80 °C. 8. Bioruptor Pico Diagenode or equivalent. 9. Rotating shaker. 10. Thermo-block. 11. Nanodrop. 12. Complete electrophoresis apparatus. 13. Workstation with a minimum of 8 GB of RAM and 1 TB hard drive.

2.3

Disposables

1. Sterile plastic pipettes. 2. 15 and 50 mL conical tubes 3. 75 cm2 tissue culture-treated flasks 4. Filter pipette tips. 5. Sterilized Pasteur pipettes. 6. 1.5 mL Eppendorf tubes.

2.4 Software Requirements

1. Operating system: UNIX command-line platform (Mac OS or Linux) is required. 2. Bowtie: http://bowtie-bio.sourceforge.net/index.shtml 3. GAWK: https://www.gnu.org/software/gawk 4. MACS2: https://pypi.org/project/MACS2 5. Perl: https://www.perl.org 6. R: https://www.r-project.org (affy and MASS packages are required). 7. SAMtools: http://www.htslib.org 8. SeqCode: https://github.com/eblancoga/seqcode 9. SpikChIP: https://github.com/eblancoga/spikChIP

60

3

Enrique Blanco et al.

Methods

3.1 Preparation of Spike-in Chromatin for ChIP-seq Experiments

Drosophila S2 cells are cultured in SL2 Culture Medium at 25 °C without additional CO2. Cell cultures should be maintained between 5 × 104 and 4 × 105 cells/cm2.

3.1.1 Preparation of Drosophila SL2 Cells 3.1.2 Preparation of Drosophila SL2 Chromatin

1. Collect 4 × 107 S2 cells by centrifugation at 300 g for 5 min at room temperature (see Note 1). 2. Resuspend the cells with 10 mL of cross-linking solution by pipetting up and down and incubate 10 min with gentle rotation. 3. Add 0.67 mL of Quenching Solution and incubate 5 min at room temperature with gentle rotation. 4. Collect the cells by centrifugation at 3250 g for 5 min at 4 °C. 5. Wash pellets by gently suspending in 10 mL ice-cold 1× PBS and then collect them by centrifugation at 3250 g for 5 min at 4 °C. 6. Repeat step 5 twice. Keep the samples on ice during the centrifugation steps (see Note 2). 7. Suspend cross-linked cells in 1 mL lysis buffer supplemented with protease inhibitors (complete EDTA-free protease inhibitor cocktail 1X) by gently pipetting up and down (avoid foam formation), and then incubate for 10 min on ice. 8. Sonicate the suspension in 15 mL tubes in a refrigerated Bioruptor Pico at 4 °C during 24 cycles (30 s ON/30 s OFF, see Notes 3 and 4). 9. Transfer the suspension to a 1.5 mL Eppendorf tube, clarify by centrifugation at 15,000 g for 10 min at 4 °C in a tabletop centrifuge and then, transfer supernatant to a new tube. Fragmented chromatin can be maintained at 4 °C for 16 h. For longer periods, store at -80 °C, while performing Subheading 3.1, step 3.

3.1.3 Quality Control and DNA Quantification from Fragmented Chromatin

1. Collect a 50 μL aliquot from fragmented chromatin from the previous step 9. 2. Add 150 μL lysis buffer plus 2.5 μL proteinase K 20 mg/mL. 3. Incubate the mixture from 4 h to overnight at 65 °C with vigorous shaking (800 rpm) in a Thermomixer. 4. Purify DNA using the PCR purification kit and elute in 50 μL. 5. Quantify the purified DNA using a NanoDrop.

ChIP-seq Normalization Using spikChIP

61

6. Separate 800 ng of purified DNA by electrophoresis on a 1.2% (w/v) agarose gel. Optimal DNA fragmentation should be 100–500 bp. If fragments are larger than 500 bp, include additional sonication cycles until you reach the appropriate fragmentation. 7. If the DNA fragmentation is optimal, store the fragmented chromatin from the previous step 9 (Subheading 3.1.2) into 50 uL aliquots at -80 °C, to minimize freeze-thaw cycles. 3.1.4 Incorporation of Drosophila SL2 Chromatin with Experimental Chromatin Samples

1. Prepare the experimental ChIP reaction mixes containing the experimental chromatin and the antibody of interest, according to the standard procedures. 2. Add the Spike-in chromatin (see Note 5). 3. Add Spike-in antibody. Usually, for a common ChIP reaction [30 μg sample chromatin (DNA) and 5 μg specific antibody], add 1.5 μg Spike-in antibody. 4. Remove 1% of the reaction mixes as input sample and keep at 20 °C. 5. Perform the ChIP and the sequencing using Illumina sequencing platforms according to the standard procedures (see Note 6).

3.2 Computational Analysis of ChIP-seq Data Using Spike-in Chromatin 3.2.1 Generation of the Genome Index for ChIP-seq Mapping

To perform in our computer the analysis of ChIP-seq samples with exogenous spike-in (see Note 7), it is necessary to previously generate a synthetic genome constructed from the chromosomes of both species (Fig. 2; see Note 8). FASTA sequences will be downloaded from the UCSC genome browser [29]. For example, chromosomes from human hg38 and fruit fly dm3 assemblies are retrieved, respectively, from:

Fig. 2 Computational workflow for Subheading 3.2, steps 1–4

62

Enrique Blanco et al.

h t t p s : // h g d o w n l o a d. s o e . u c s c . e d u / g o l d e n P a t h / h g 3 8 / chromosomes/ h t t p s : // h g d o w n l o a d . s o e . u c s c . e d u / g o l d e n P a t h / d m 3 / chromosomes/. We will tag the name of the spike-in chromosomes for easy identification (e.g., >chr2L in Drosophila melanogaster must be replaced by >chr2L_FLY). The following GAWK command carries this operation out on a FASTA file: % gawk ’{if ($0~">") {print $0"_FLY"} else {print $0}}’ chrN.fa > chrN_FLY.fa.

FASTA sequences of both genomes must be concatenated together in a single multi-FASTA file with the UNIX cat command (see Note 9). Finally, genome indexes for read mapping will be generated from this multi-FASTA file using a mapping tool such as Bowtie [30] or BWA [31]. Let genome.fa be a multi-FASTA file combining the chromosomes of two species, the following command generates and stores the genome indexing files into the output folder: % bowtie-build genome.fa output/genome.

3.2.2 Genome Mapping of each Individual ChIP-seq Sample

The FASTQ raw data file of our ChIP-seq sample including spike-in material must be aligned to the appropriate genome index containing the same species involved in the experiment (see Notes 10 and 11). We will run the following Bowtie command to map the ChIPseq experiment called sample.fastq over the genome indexing files including spike-in using up to four processors, discarding multi-locus reads, and saving the output in SAM format into the sample.sam file: % bowtie -p 4 -t -m 1 -S -q genome sample.fastq sample.sam.

We will remove unaligned reads and convert into BAM format the resulting SAM file to save storage space with the following SAMTools [32] command: % samtools view -b -F 0x4 -o sample.bam sample.sam.

3.2.3 Extraction of Aligned Reads into Distinct Genome and Spike-in Files

To identify differences in the amount of spike-in reads across conditions and introduce such corrections in the same proportions over the reads corresponding to the true experiment, we will search the tag described before in the BAM file and extract both classes of reads into two separate SAM files (sample_experiment.sam and sample_spike.sam):

ChIP-seq Normalization Using spikChIP

63

% samtools view -h sample.bam | grep -v FLY > sample_experiment.sam % samtools view -h sample.bam | grep FLY > sample_spike.sam

Each SAM file will be converted into BAM format with this SAMtools command (see Note 12): % samtools view -S -b -o sample_experiment.bam sample_experiment.sam % samtools view -S -b -o sample_spike.bam sample_spike.sam

3.2.4 Identification of ChIP-Enriched Regions on Each Experiment

MACS2 [33] will be used to identify the peaks of ChIP-seq signal on each BAM file generated before for both genomes. For instance, we perform the peak calling of one spike-in file of reads by running this command (the –g option indicates the reference genome; dm stands for Drosophila melanogaster, see Note 13 for additional flag options): % macs2 callpeak -t sample_spike.bam -f BAM -g dm -n sample_spike --nomodel --extsize 150

We will select the columns informing about the location of the peaks in the MACS2 output to generate final BED files: % grep -v "#" sample_spike/sample_spike_peaks.xls | grep -v start | gawk ’BEGIN{OFS="\t"}{if (NF>0) print $1,$2,$3,NR-1}’ > sample_spike_peaks.bed

3.2.5 Preparation of the Configuration File of spikChIP

SpikChIP is a standalone pipeline designed in Perl and R to perform the normalization of multiple ChIP-seq spike-in experiments by a local regression across samples that demonstrate to focus the impact of the corrections over the ChIP signal-enriched regions, diminishing secondary effects of the normalization over the background [26]. We will edit the following config.txt configuration file to instruct spikChIP about where to find our collection of input data (files are considered to be in the current directory in this example, Fig. 3; see Note 14): BAM Sample experiment 1

BAM spike-in BED experiment

BED spike-in

1_experiment. 1_spike. 1_experiment_peaks. 1_spike_peaks. Bam Bam Bed Bed

... N

N_experiment. N_spike. N_experiment_peaks. N_spike_peaks. Bam Bam Bed Bed

64

Enrique Blanco et al.

Fig. 3 Computational workflow for Subheading 3.2, steps 5–9 3.2.6 Preparation of the Genome Definition File of spikChIP

SpikChIP requires a second input a file with the list of chromosomes (and their sizes) that constitute the synthetic genome employed in the previous mapping step. For instance, we will download this file for human (hg38 assembly) from the UCSC genome browser: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ chromInfo.txt.gz Once we concatenate the same file for both species (e.g., hg38 and dm3), we will edit the final ChromInfo.txt to tag the spike-in genome chromosomes: chr1

248,956,422

... chrY

57,227,415

chr2L_FLY

23,011,544

... chrX_FLY 3.2.7 Running spikChIP to Normalize a Collection of ChIP-seq Samples

22,422,827

We will execute the following command to analyze with spikChIP the collection of ChIP-seq samples indicated in the configuration file (see Note 15): % spikChIP.pl -v config.txt ChromInfo.txt

Throughout the running, spikChIP will inform about the stage of the analysis that is currently ongoing. Basically, we will see the following main informative messages:

ChIP-seq Normalization Using spikChIP

65

Stage 0. Configuration of the pipeline. Stage 1. Producing the segmentation of both genomes in bins (1000 bps). Stage 2. Processing each experiment with different normalization methods. Stage 3. Distinguishing bins of spike and sample genomes into peaks or bg. Stage 4. Generating the final boxplots of values. 3.2.8 Interpretation and Usage of spikChIP Output Files

SpikChIP analysis generates a segmentation of both input genomes into bins of the same size, calculates the amount of reads of each ChIP-seq sample within each bin, and performs the local normalization on such values using the corrections necessary to adjust spike-in bins to normalize the bins of the genome employed in the true experiment (see Note 16). Once the analysis is finished, we can explore the final files saved on the results/folder. We will see the values normalized for each bin from both genomes in every ChIP-seq experiment (rows are bins and columns are ChIP-seq experiments normalized counts following the order declared in the configuration file, see Note 17): % head FINAL_1. . .N_SPIKCHIP_1000_avg_normalized_sample.txt chr1*1*1001 value1 . . . valueN ... % head FINAL_1. . .N_SPIKCHIP_1000_avg_normalized_spike.txt chr2L_FLY*1*1001 value1 . . . valueN ...

We can examine the output files used by spikChIP to generate the boxplots segregating bins into peaks and background regions according to the set of peaks declared in the configuration file (only average values shown here): % head FINAL_1. . .N_SPIKCHIP_1000_sample_avg_peaks.txt % head FINAL_1. . .N_SPIKCHIP_1000_sample_avg_bg.txt % head FINAL_1. . .N_SPIKCHIP_1000_spike_avg_peaks.txt % head FINAL_1. . .N_SPIKCHIP_1000_spike_avg_bg.txt

3.2.9 Generation of Genome-Wide Profiles for Graphical Browsers

To explore visually in the UCSC genome browser, we will generate custom tracks in BedGraph format by extracting the appropriate column of a particular spikChIP output file. The following UNIX commands will produce the genome-wide profile for the values normalized in the experimental genome using the first ChIP-seq condition from a generic dataset as declared in the configuration file (see Note 18):

66

Enrique Blanco et al. % zcat FINAL_1. . .N_SPIKCHIP_1000_avg_normalized_sample.txt.gz | sed ’s/\*/ /g’ | gawk ’BEGIN{print "track type=bedGraph name= AVG_SPIKCHIP_SAMPLE description=AVG_SPIKCHIP_SAMPLE visibility=full maxHeightPixels=60 color=200,0,0"; OFS="\t"}{print $1,$2,$3,$4}’ > SPIKCHIP_sample.bg

Before uploading into the browser, we will check the beginning of the profile: % head SPIKCHIP_sample.bg track type=bedGraph name=AVG_SPIKCHIP_SAMPLE description= AVG_SPIKCHIP_SAMPLE visibility=full maxHeightPixels=60 color=200,0,0 chr1 1 1001 value_in_sample1 ...

Finally, we will compress the custom track file to speed the uploading up (see Note 19): % gzip SPIKCHIP_sample1.bg

4

Notes 1. SL2 cells grow in suspension with some loosely adherent cells. To collect the cells, gently resuspend them in pipetting medium across the monolayer. Strongly adherent cells can be mechanically dislodged with a cell scraper. 2. At this step, the cross-linked cell pellets could be stored at -80 °C until 3 months. 3. The number of cycles should be optimized since the efficiency in the chromatin fragmentation can differ greatly depending on the equipment. The indicated values are suggested starting parameters for the indicated equipment. Select the lowest number of cycles that result in fragmented DNA ranging from 100 to 500 bp. 4. Alternatively, another sonicator can be used, as long as the samples are kept refrigerated throughout the process and the appropriate chromatin fragmentation is achieved. 5. The optimal amount of spike-in reads for normalization must be at least one million and, approximately, 2%–5% of the experimental genome. The abundance of the target in the experimental sample, as well as the efficiency of the antibody used for capturing it, will influence the chromatin sample:chromatin spike-in proportion in the final mixture. As a guideline, when using robust antibodies against abundant histone

ChIP-seq Normalization Using spikChIP

67

modifications, the chromatin mixture should contain between 2.5% and 1% of spike-in material (% calculated according to the amount of DNA corresponding to each chromatin material). In contrast, when using antibodies against transcription factors, histone modifiers or low abundance histone modifications, the mixture should contain between 0.1% and 0.05% of spike-in material. 6. Alternatively to massive parallel sequencing, immunoprecipitated and input material can be processed for qPCR by using standard procedures. The qPCR signal from a specific spike-in locus can be then used to normalize the qPCR signal for an experimental locus. To select a specific locus to amplify from spike-in material, the researchers can find a source for H2Av, H3K4me3, H3K27ac, and H3K27me3 ChIP-seq profiles in the following link: https://genome.ucsc.edu/s/DiCroceLab/ spikChIP_MethodsMolBiol_2021. Here below are two examples of ChIP-qPCR normalization using spike-in: (i) the traditional “percentage of input” normalization method and (ii) the use of the qPCR signal from the spike-in as an internal reference. (i) % input normalized by spike-in

Raw ct

ChIP Sam ple #1 Input (1%) ChIP Sam ple #2 Input (1%)

ChIP Spike-in #1 Input (1%) ChIP Spike-in #2 Input (1%)

28,03 28,18 28,10 26,67 26,53 26,50 29,10 29,30 29,10 26,52 26,50 26,48 29,69 29,86 29,60 30,15 30,30 30,40 29,47 29,32 29,3 30,3 30,1 30,2

Average from triplicate

Adjusted Input to 100%

% input

Correction factor

Spike-in equilibrated %inp

Raw Ct-6.64

100 x 2^ (Adj usted input-Ct ChIP)

% inp spike-in / % inp spike-in reference

% inp sam ple / correction factor

28,10 26,57

0,16

1,48

0,83

1,79

1,00

23,64

29,36 30,2

0,16 19,86

29,72 30,28

0,42

19,92

29,17 26,50

0,34

23,6

68

Enrique Blanco et al.

(ii) Normalization using spike-in signal as internal reference Norm alization by spike-in ChIP signal (ΔCT)

Raw ct

Sam ple #1

ChIP

Sam ple #2

ChIP

Spike-in #1

ChIP

Spike-in #2

ChIP

28,0 28,2 28,1 29,1 29,3 29,1 29,7 29,9 29,6 29,5 29,3 29,3

Average from triplicate

Norm alization to a Relative reference sam ple quantification (ΔΔCT)

CT(ChIP sam ple) − ΔCT(ChIP sam ple) CT(reference − ΔCT(a reference spike-in ChIP) sam ple)

2^-ΔΔCT

28,10

-1,6

0,00

1,00

29,17

-0,2

1,42

0,37

29,72 29,36

7. Virtual machines (e.g., Oracle Virtual Box™) allow users of other operating systems to safely run Linux environments in their computers with no modification of the current setup. 8. We recommend to generate genome indexes from the basic list of chromosomes (i.e., Homo sapiens assembly hg38: chr1 to chr22, chrX, chrY, and chrM). Thus, FASTA files from alternate sequences (e.g., chr1_GL383518v1_alt), fix sequences (e.g., chr1_KN196472v1_fix), unlocalized contigs (e.g., chr1_KI270706v1_random), and unplaced contigs (e.g., chrUn_GL000195v1) would be excluded from the analysis. The same advice is applicable to spikChIP genome definition files (chromInfo.txt). 9. Genome indexes for mapping reads on a particular pair of genome assemblies are typically produced only once for being reutilized later in all the mappings performed in the computer. Original FASTA files are no longer used by the alignment tools and can be deleted. 10. Genome mapping consists on univocally identifying the location of each read in a particular position of one of the chromosomes in one of the two species that constitute the ChIP-seq experiment. Unless focusing in the analysis of genomic repeats, multi-reads (reads whose sequence is matching more than one locus in one of the species) are usually discarded by introducing the appropriate flag option in the alignment tool.

ChIP-seq Normalization Using spikChIP

69

11. It is a good habit to centralize the storage of raw data FASTQ files and BAM alignment files in two separate folders of a particular location within our file tree. 12. The SAMTools flagstat command can be used over each SAM/BAM file at any step of this protocol to obtain statistics on the number and the class of the ChIP-seq aligned reads. 13. It is recommended to use the flag option --broad of MACS2, when analyzing histone marks with a broad pattern of occupancy. Furthermore, a second BAM file can be introduced with the option –c to be used as a peak calling control (e.g., Input or IgG). 14. We recommend to introduce the same set of ChIP-enriched regions on each genome in all the lines of the spikChIP configuration file to ensure a fair comparison. These BED files could be generated separately for each species (peaks.bed and spike_peaks.bed) by gathering all peaks in every condition or simply by selecting the condition in which a higher number of peaks was identified. 15. SpikChIP can be customized to delete all intermediate files at the end of the processing (-c flag option). Depending on the volume of the BAM files and the genomes analyzed on a run, we strongly recommend to switch this option on. 16. In addition to its own normalization based in local regression, spikChIP pipeline calculates in parallel up to four different alternative normalization approaches: raw (absolute counts), traditional (sequencing depth), ChIP-Rx [24] and Tag removal [25]. Boxplots showing the performance of each method on peaks and background regions of the input ChIPseq experiments are automatically generated. Further information on [26]. 17. SpikChIP calculates separately the maximum and the average number of reads of each experiment to be assigned to each bin of the genomes. Thus, output files using normalized values calculated in both modes are generated by spikChIP. We suggest to utilize maximum mode results when working with transcription factors and histone marks (e.g., H3K4me3) associated to sharp peaks, while average mode results are suitable to study broad peaks of certain histone marks (e.g., H3K79me2). 18. To generate genome-wide profiles from normalized counts in the spike-in genome, it is necessary to undo the tag edit incorporated in the spikChIP input files (e.g., chr2L_FLY to chr2L) by adding a sed instruction to the command previously described. 19. The UCSC genome browser offers Track hubs as an alternative approach to directly upload genome-wide profiles. Faster access to custom tracks visualized in this manner is achieved

70

Enrique Blanco et al.

by rendering only the fragment of the chromosome currently viewed on the screen. However, track files must be permanently hosted on a web space provider, and their internet address should be supplied to the genome browser.

Acknowledgments This work was supported by the Spanish of Economy, Industry and Competitiveness (MEIC) (BFU2016-75008-P, and PID2019108322GB-100), “Fundacio´n Vencer El Ca´ncer” (VEC), the European Regional Development Fund (FEDER), and from AGAUR to L.D.C. The Ramon y Cajal program of the Ministerio de Ciencia, Innovacio´n y Universidades and the European Social Fund under the reference number RYC-2018-025002-I, and the Instituto de Salud Carlos III-FEDER (PI19/01814), to S.A. We acknowledge the funding support of the Spanish Ministry of Science and Innovation to the EMBL partnership, the Centro de Excelencia Severo Ochoa, and the CERCA Programme/Generalitat de Catalunya. References 1. Cramer P (2014) A tale of chromatin and transcription in 100 structures. Cell 159(5): 985–994. https://doi.org/10.1016/j.cell. 2014.10.047 2. Bonev B, Cavalli G (2016) Organization and function of the 3D genome. Nat Rev Genet 17(11):661–678. https://doi.org/10.1038/ nrg.2016.112 3. Almouzni G, Cedar H (2016) Maintenance of epigenetic information. Cold Spring Harb Perspect Biol 8(5). https://doi.org/10.1101/ cshperspect.a019372 4. Gilbert DM, Takebayashi SI, Ryba T, Lu J, Pope BD, Wilson KA, Hiratani I (2010) Space and time in the nucleus: developmental control of replication timing and chromosome architecture. Cold Spring Harb Symp Quant Biol 75:143–153. https://doi.org/10.1101/ sqb.2010.75.011 5. Aranda S, Mas G, Di Croce L (2015) Regulation of gene transcription by Polycomb proteins. Sci Adv 1(11):e1500737. https://doi. org/10.1126/sciadv.1500737 6. Mirabella AC, Foster BM, Bartke T (2016) Chromatin deregulation in disease. Chromosoma 125(1):75–93. https://doi.org/10. 1007/s00412-015-0530-0

7. Espejo I, Di Croce L, Aranda S (2020) The changing chromatome as a driver of disease: a panoramic view from different methodologies. BioEssays 42(12):e2000203. https://doi.org/ 10.1002/bies.202000203 8. Solomon MJ, Larsen PL, Varshavsky A (1988) Mapping protein-DNA interactions in vivo with formaldehyde: evidence that histone H4 is retained on a highly transcribed gene. Cell 53(6):937–947 9. Gilmour DS, Lis JT (1984) Detecting proteinDNA interactions in vivo: distribution of RNA polymerase on specific bacterial genes. Proc Natl Acad Sci U S A 81(14):4275–4279 10. Gilmour DS, Lis JT (1985) In vivo interactions of RNA polymerase II with genes of Drosophila melanogaster. Mol Cell Biol 5(8): 2009–2018 11. Aranda S, Shi Y, Di Croce L (2016) Chromatin and epigenetics at the forefront: finding clues among peaks. Mol Cell Biol 36(19): 2432–2439. https://doi.org/10.1128/MCB. 00328-16 12. Johnson DS, Mortazavi A, Myers RM, Wold B (2007) Genome-wide mapping of in vivo protein-DNA interactions. Science 316(5830):1497–1502. https://doi.org/10. 1126/science.1141319

ChIP-seq Normalization Using spikChIP 13. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K (2007) High-resolution profiling of histone methylations in the human genome. Cell 129(4):823–837. https://doi.org/10.1016/j. cell.2007.05.009 14. Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP, Lee W, Mendenhall E, O’Donovan A, Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C, Lander ES, Bernstein BE (2007) Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448(7153):553–560. https:// doi.org/10.1038/nature06008 15. Albert I, Mavrich TN, Tomsho LP, Qi J, Zanton SJ, Schuster SC, Pugh BF (2007) Translational and rotational settings of H2A.Z nucleosomes across the Saccharomyces cerevisiae genome. Nature 446(7135):572–576. https://doi.org/10.1038/nature05632 16. Consortium EP (2004) The ENCODE (ENCyclopedia of DNA elements) project. Science 306(5696):636–640. https://doi.org/ 10.1126/science.1105136 17. Consortium EP (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414):57–74. https://doi.org/ 10.1038/nature11247 18. Consortium EP, Moore JE, Purcaro MJ, Pratt HE, Epstein CB, Shoresh N, Adrian J, Kawli T, Davis CA, Dobin A, Kaul R, Halow J, Van Nostrand EL, Freese P, Gorkin DU, Shen Y, He Y, Mackiewicz M, Pauli-Behn F, Williams BA, Mortazavi A, Keller CA, Zhang XO, Elhajjajy SI, Huey J, Dickel DE, Snetkova V, Wei X, Wang X, Rivera-Mulia JC, Rozowsky J, Zhang J, Chhetri SB, Zhang J, Victorsen A, White KP, Visel A, Yeo GW, Burge CB, Lecuyer E, Gilbert DM, Dekker J, Rinn J, Mendenhall EM, Ecker JR, Kellis M, Klein RJ, Noble WS, Kundaje A, Guigo R, Farnham PJ, Cherry JM, Myers RM, Ren B, Graveley BR, Gerstein MB, Pennacchio LA, Snyder MP, Bernstein BE, Wold B, Hardison RC, Gingeras TR, Stamatoyannopoulos JA, Weng Z (2020) Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583(7818):699–710. https://doi. org/10.1038/s41586-020-2493-4 19. Stunnenberg HG, International Human Epigenome C, Hirst M (2016) The international human Epigenome Consortium: a blueprint for scientific collaboration and discovery. Cell 167(5):1145–1149. https://doi.org/10. 1016/j.cell.2016.11.007

71

20. Skipper M, Eccleston A, Gray N, Heemels T, Le Bot N, Marte B, Weiss U (2015) Presenting the epigenome roadmap. Nature 518(7539): 313. https://doi.org/10.1038/518313a 21. Teng M, Du D, Chen D, Irizarry RA (2021) Characterizing batch effects and binding sitespecific variability in ChIP-seq data. NAR Genom Bioinform 3(4):lqab098. https://doi. org/10.1093/nargab/lqab098 22. Chen K, Hu Z, Xia Z, Zhao D, Li W, Tyler JK (2015) The overlooked fact: fundamental need for spike-in control for virtually all genomewide analyses. Mol Cell Biol 36(5):662–667. https://doi.org/10.1128/MCB.00970-14 23. Bonhoure N, Bounova G, Bernasconi D, Praz V, Lammers F, Canella D, Willis IM, Herr W, Hernandez N, Delorenzi M (2014) Quantifying ChIP-seq data: a spiking method providing an internal reference for sample-tosample normalization. Genome Res 24(7): 1157–1168. https://doi.org/10.1101/gr. 168260.113 24. Orlando DA, Chen MW, Brown VE, Solanki S, Choi YJ, Olson ER, Fritz CC, Bradner JE, Guenther MG (2014) Quantitative ChIP-Seq normalization reveals global modulation of the epigenome. Cell Rep 9(3):1163–1170. https://doi.org/10.1016/j.celrep.2014. 10.018 25. Egan B, Yuan CC, Craske ML, Labhart P, Guler GD, Arnott D, Maile TM, Busby J, Henry C, Kelly TK, Tindell CA, Jhunjhunwala S, Zhao F, Hatton C, Bryant BM, Classon M, Trojer P (2016) An alternative approach to ChIP-Seq normalization enables detection of genome-wide changes in histone H3 lysine 27 Trimethylation upon EZH2 inhibition. PLoS One 11(11):e0166438. https:// doi.org/10.1371/journal.pone.0166438 26. Blanco E, Di Croce L, Aranda S (2021) SpikChIP: a novel computational methodology to compare multiple ChIP-seq using spike-in chromatin. NAR Genom Bioinform 3(3): lqab064. https://doi.org/10.1093/nargab/ lqab064 27. Loven J, Orlando DA, Sigova AA, Lin CY, Rahl PB, Burge CB, Levens DL, Lee TI, Young RA (2012) Revisiting global gene expression analysis. Cell 151(3):476–482. https://doi.org/ 10.1016/j.cell.2012.10.012 28. Taruttis F, Feist M, Schwarzfischer P, Gronwald W, Kube D, Spang R, Engelmann JC (2017) External calibration with drosophila whole-cell spike-ins delivers absolute mRNA fold changes from human RNA-Seq and

72

Enrique Blanco et al.

qPCR data. BioTechniques 62(2):53–61. https://doi.org/10.2144/000114514 29. Lee BT, Barber GP, Benet-Pages A, Casper J, Clawson H, Diekhans M, Fischer C, Gonzalez JN, Hinrichs AS, Lee CM, Muthuraman P, Nassar LR, Nguy B, Pereira T, Perez G, Raney BJ, Rosenbloom KR, Schmelter D, Speir ML, Wick BD, Zweig AS, Haussler D, Kuhn RM, Haeussler M, Kent WJ (2021) The UCSC genome browser database: 2022 update. Nucleic Acids Res. https://doi.org/ 10.1093/nar/gkab959 30. Langmead B, Trapnell C, Pop M, Salzberg SL (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10(3):R25. https:// doi.org/10.1186/gb-2009-10-3-r25

31. Li H, Durbin R (2009) Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics 25(14):1754–1760. https://doi.org/10.1093/bioinformatics/ btp324 32. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H (2021) Twelve years of SAMtools and BCFtools. Gigascience 1 0 ( 2 ) . h t t p s : // d o i . o r g / 1 0 . 1 0 9 3 / gigascience/giab008 33. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS (2008) Modelbased analysis of ChIP-Seq (MACS). Genome Biol 9(9):R137. https://doi.org/10.1186/ gb-2008-9-9-r137

Chapter 6 A Guide to MethylationToActivity: A Deep Learning Framework That Reveals Promoter Activity Landscapes from DNA Methylomes in Individual Tumors Karissa Dieseldorff Jones, Daniel Putnam, Justin Williams, and Xiang Chen Abstract Genome-wide DNA methylomes have contributed greatly to tumor detection and subclassification. However, interpreting the biological impact of the DNA methylome at the individual gene level remains a challenge. MethylationToActivity (M2A) is a pipeline that uses convolutional neural networks to infer H3K4me3 and H3K27ac enrichment from DNA methylomes and thus infer promoter activity. It was shown to be highly accurate and robust in revealing promoter activity landscapes in various pediatric and adult cancers. The following will present a user-friendly guide through the model pipeline. Key words H3K4me3, H3K27ac, Tumor, Promoter activity, Machine learning

1

Introduction Cells orchestrate gene activity by controlling the frequency and quantity of transcribed RNA. This spatial and temporal control allows a given cell to respond to a plethora of intra- and extracellular signals and defines specific cell types with different cell functions. To initiate transcription, promoter regions regulate these binding events at the transcription start site (TSS) by integrating signals from distal enhancers and local histone modifications (HMs). Notably, tumors take advantage of alternative promoter usage to activate repressed oncogenes [1–3], increase isoform diversity [2, 3], and evade host immunity [3, 4].

Authors Karissa Dieseldorff Jones and Daniel Putnam have equally contributed to this chapter. Pedro H. Oliveira (ed.), Computational Epigenomics and Epitranscriptomics, Methods in Molecular Biology, vol. 2624, https://doi.org/10.1007/978-1-0716-2962-8_6, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

73

74

Karissa Dieseldorff Jones et al.

Pediatric tumors harbor fewer mutations than adult tumors [5, 6], and, concomitantly, use epigenetic deregulation to promote cancer initiation and progression [7]. Numerous groups have highlighted the potential of HMs and other epigenetic modifications to predict gene expression [8–10]. Chromatin immunoprecipitation sequencing (ChIP-seq) is the most common technique used to quantify HM activities [11], and Cut&Run has surfaced as a great alternative to meet the limitations of ChIP-seq [12]. Still, studies of individual patient tumors in pediatric tumors are often limited by the rarity of patient samples, limited amounts of fresh starting material, and the extensive workload needed [13, 14]. Eukaryote DNA methylation (DNAm) is a regulatory mechanism that occurs when a methyl group (-CH3) is covalently attached to cytosine (C) to form 5-methylcytosine (5mC). It can be accurately profiled in different tissue types through both array [15, 16] and sequencing [17] methods. The DNAm pattern is mechanistically linked to transcription factor binding and HMs [18–25] and is integral in determining chromatin structure in normal as well as disease conditions [26, 27]. However, the extent to which DNAm contributes to the regulation of individual gene expression is not well understood [13, 28, 29], with few exceptions (e.g., hypermethylation of the promoters of RB1, CDKN2A, and MGMT) [30]. To address these challenges, we developed MethylationToActivity (M2A), a deep convolutional neural network that captures the complex relation between DNAm signatures and promoter activities [31]. M2A is highly accurate and robust in revealing promoter activity landscapes. Here, we explain the steps of software installation and pipeline application. The M2A workflow is summarized in Fig. 1, and each step is further described in more detail in Table 1. M2A starts with raw DNAm feature extraction near individual TSSs. This is followed by high-level feature extraction through the CNN layers and mapping between the generalized (high-order) features and the final output (i.e., the H3K4me3 and H3K27ac enrichment of the promoter) in the fully connected (FC) layers.

2 2.1

Materials Summary

M2A is available in four operating modes: (1) Github Clone, (2) Github Docker Image, (3) Github Pipeline, and (4) St. Jude Cloud. The main hub is located at https://github.com/chenlab-

A Guide to MethylationToActivity: A Deep Learning Framework That Reveals. . .

75

Fig. 1 Workflow illustrating both the MethyationToActivity vanilla version and optional transfer learning pipeline

sj/M2A. The following chapter will elaborate on the Github Pipeline. 2.1.1

Github Clone

1. Clone M2A from from Github Git clone https://github.com/chenlab-sj/M2A.git 2. Ensure the following software packages are installed

76

Karissa Dieseldorff Jones et al.

Table 1 MethylationToActivity pipeline step descriptions Process

Description

1_ResponseVariable

Generate histone enrichment for each unique promote region (transferlearning only)

2_MethylationFeatures Process whole genome bisulfite sequencing (WGBS) features for model input 3_CombineInput

Scale and recombine features, and, for transfer learning, calculate HM values (also stated as Step 3: Format)

4_RunModel

Using pre-generated input, get HM predictions for each unique promoter region

5_TransferLearning (Optional)

Train fully connected layers of a particular model for increased performance in your domain of interest

Package

Version

pyBigWig

0.3.13

Numpy

1.17.1

Pandas

0.25.1

Pandarallel

1.4.2

Scikit-learn

0.20.2

H5py

2.9.0

Keras

2.2.4

Tensorflow

1.10.1

Scipy

1.3.1

Matplotlib

3.3.0

Cwltool

1.0

Psutil

5.6.1

3. Refer to M2A README.md to follow instructions using a cwl workflow or use the Github Pipeline specified below. 2.1.2

Docker

M2A provides a Dockerfile that builds an image with all the included dependencies. 1. To use this image, install Docker for your platform. In the M2A project directory, build the following Docker image: Docker build –tag stjude/m2a:0.0.1.

2. Refer to M2A README.md to follow instructions using a cwl workflow or use the Github Pipeline specified below.

A Guide to MethylationToActivity: A Deep Learning Framework That Reveals. . . 2.1.3

St. Jude Cloud

77

St. Jude Cloud provides the infrastructure to run a user-friendly GPU accelerated pipeline. 1. Log in to St. Jude Cloud. Refer to the following URL: https://university.stjude.cloud/docs/genomics-plat form/workflow-guides/methylation-to-activity/ 2. Select the MethylationToActivity (M2A) Project. Click Start. This creates a new DNAnexus workspace and imports the workflow. When the “Launch Tool” and “View Results” tabs appear, the workflow has been successfully created. 3. Click “Launch Tool” with or without Transfer Learning. Select the desired option. This will redirect to project analysis page with configurable settings and inputs. 4. Under the “Analysis Inputs 1” tab, input the files requested. The files will need to be uploaded to your project space on the cloud before they become available for selection. Note: Only input a “Promoter definitions” and/or “HDF5 format model” file(s) if the default is not desired. Refer to the following link for cloud file upload/ download: https://university.stjude.cloud/docs/genomics-plat form/analyzing-data/command-line 5. Choose desired “Genome” and “Model selection” from the drop-down menus. Note: If an alternative “Promoter definitions” file is uploaded, the “Genome” must be set to “custom.” 6. Click the “Start Analysis” button located at the top right of the screen to queue the project run. The run may take some time to begin based on the DNAnexus queue. 7. Once the project has completed, download the result files. Refer to the cloud file upload/download link above.

3

Methods

3.1 Step 1: Response Variable (Only for Transfer Learning)

The main objective of Step 1 is to summarize ChipSeq signal in each promoter region. This is performed by summing the Chip-seq signal, ChIPSum (H3K27ac or H3K4me3), on a 2 Kb window centered at the transcription start site (+/- 1 Kb in both directions). The same procedure is performed for the Chip-seq control termed InputSum. The chip-seq enrichment is computed as the log2 ((ChipSumSignal + Alpha) / InputSumSignal + Alpha)) where Alpha is the first quartile of the control signal. This is called log2_ChipDivInput and is the response variable for transfer learning. See Fig. 2 for visual representation.

78

Karissa Dieseldorff Jones et al.

Fig. 2 Response variable overview illustrating the calculation behind the newly added column 3.2

Input

(1) ChipSeq.bw and matching Input.bw (2) The default promoter definitions file (6_InputFiles/2_Promoter_Definitions_hg19.txt) contains the following columns: • EnsmblID_T = Ensemble transcript ID • EnsmblID_G = Ensemble gene ID • Gene = human readable gene name • Strand = + or • Chr = “chr1, chr2, . . . chrX, chrY” • Start = Gene Start • End = Gene End • RStart = Transcription start site (TSS) - 1000bp • REnd = Transcription start site (TSS) + 1000bp

3.3 Example Command

python

3.3.1 Optional Arguments

--outFileName:

3.3.2 Optional Example Command

python

1_getResponseVariable.py

[Chip_Path]

[Input_Path]

2_Promoter_Definitions_hg19.txt

Desired output file name (type: str)

--outDirectory:

Desired output directory (type: str)

1_getResponseVariable.py

[Chip_Path]

[Input_Path]

2_Promoter_Definitions_hg19.txt --outFileName [OutFileName] --outDirectory [OutFilePath]

A Guide to MethylationToActivity: A Deep Learning Framework That Reveals. . .

3.4

Output

79

• Rows: promoter sites • Columns: – EnsmblID_T: Ensemble Transcript ID – EnsmblID_G: Ensemble Gene ID – Gene: Gene Name – Strand: + or – Chr: Chromosome – Start: TSS Start Site – End: TSS End Site – RStart: Region Start Site – REnd: Region End Site – Log2_ChipDivInput: “Actual” signal (see more information in the description and figure)

3.5 Step 2: Feature Extraction

3.6

Input

The main objective of Step 2 is to process the whole genome bisulfite sequencing (WGBS) features for model input. A signature is made from multiple methylation feature calculations per window (n=20), per window size (250 bp, 2500 bp), per promoter (TSS region). The three methylation (M-value) features calculated per window include the average methylation, the variance of methylation, and the fraction of the sum of squared differences (SSD) for the window methylation over the entire region. As mentioned above, these three DNAm features are calculated for each of the 20 windows surrounding each TSS site, sized 250 bp and 2500 bp. This will result in 120 newly added columns of feature calculations. See Fig. 3 for a visual representation, including a calculation of each feature. (1) Methylation file • Tab delimited • Required headers (order does not matter): – chrom, (chromosome ID) – pos, (position of the C in the CpG) – mval, (calculated mvalue of a given CpG) (2) By default, uses the promoter definitions file provided in GitHub: 2_Promoter_Definitions_hg19.txt

80

Karissa Dieseldorff Jones et al.

Fig. 3 Feature extraction overview illustrating the calculation behind the newly added columns DNA methylation features, including the windowed M-value mean, variance, and the fraction of the SSD of M-values (FSSD), were calculated and represented by the feature vectors Mavei, Mvari, and Mfssdi. Here, i represents the promoter, j represents a specific window for a particular promoter, and Mvalk represents the Mval for individual CpGs in a region where Mvalk(j) is the Mval for an individual CpG in a specific window 3.7 Example Command

python 2_getMethylationV2.py [MethFileName] 2_Promoter_Defi-

3.7.1 Optional Arguments

--nbWorkers:

nitions_hg19.txt

Number of threads to use (type: int)*

*This will be constrained by the threads available. It will allow for the work to be parallelized among a designated amount of cores. --outFileName:

Desired output file name (type: str)

--outDirectory:

3.7.2 Optional Example Command

Desired output directory (type: str)

python 2_getMethylationV2.py [MethFileName] 2_Promoter_Definitions_hg19.txt --nbWorkers [Cores] --outFileName [OutFileName] --outDirectory [OutFilePath]

A Guide to MethylationToActivity: A Deep Learning Framework That Reveals. . .

3.8

Output

81

• Rows: promoter sites • Columns: – EnsmblID_T: Ensemble Transcript ID – EnsmblID_G: Ensemble Gene ID – Gene: Gene Name – Strand: + or – Chr: Chromosome – Start: TSS Start Site – End: TSS End Site – Rstart: Region Start Site – Rend: Region End Site – 60 column succession of Window -10 to +10 for window size 250 kb for each feature (Example column title: 250_W10_M_Ave) – 60 column succession of Window -10 to +10 for window size 2500 kb for each feature

3.9

3.10

Step 3: Format

Input

The main objective of Step 3 is to combine all features into tensors (-n dimensional matrix) for input to our machine learning model. To accomplish this, the tensor is reshaped to (N, Resolutions, Windows, Features) or (96757, 2, 20, 3) dimensions. If transfer learning is needed, this will also combine the response variable resulting from Step 1. See Fig. 1 for a visual representation. Scenario 1: Simply run the model with Step 2 output. Scenario 2: Run the model with Step 1 and Step 2 output for transfer learning. (1) Step 1 Output: Tab delimited file containing response variable • Positional and gene data including log2ChipDivInput (2) Step 2 Output: Tab delimited file containing methylation features for 250 and 2,500 base pair window sizes • Features: Average, Variation, Fraction of Region SSD • Positional and Gene data: EnsmblID_T, EnsmblID_G, Gene, Strand, Chr, Start, End, RStart, Rend

82

Karissa Dieseldorff Jones et al.

3.11 Example Command

Scenario 1: python 3_Combine.py [MethylationFilePath]

Scenario 2: python 3_Combine.py [MethylationFilePath] --ResponseVariablePath [ReponseVariableFilePath]

3.11.1 Optional Arguments

--outFileName:

Desired output file name (type: str)

3.11.2 Optional Example Command

python 3_Combine.py [MethylationFilePath] --ResponseVariable-

--outDirectory:

Desired output directory (type: str)

Path [ReponseVariableFilePath] --outFileName [OutFileName] -outDirectory [OutFilePath]

3.12

Output

One HDF5 file for prediction or transfer learning(extension=.h5) The output dataset is termed “FeatureInput” in the HDF5. It is an m by n array, Where m = Windowsizes (resolutions) Where n = Window position relative to the TSS (W-10, W- 9...W1, W1, W2, . . ., W10) All meta data/positional values are stored in the hdf5 format as bytes (binary format), with exception of “FeatureInput”

3.13 Step 4: Run Model

The main objective of Step 4 is to generate all of the H3K27ac and/or H3K4me3 enrichment signal predictions for each of the promoter sites (TSS regions). See Figs. 1 and 4 for a visual representation.

Fig. 4 Deep learning model overview

A Guide to MethylationToActivity: A Deep Learning Framework That Reveals. . .

3.14

83

(1) Feature file

Input

• HDF5 file containing dataset termed “FeatureInput” and meta data (for more information see Step 3 output section) (2) Model File • Model files provided in GitHub: M2A_H3K27ac_Model_V2.h5 M2A_H3K4me3_Model_V2.h5 3.15 Example Command

Python 4_getPredictions.py [FeatureFilePath] [ModelFilePath]

3.15.1 Optional Arguments

--outFileName:

3.15.2 Optional Example Command

python 4_getPredictions.py [FeatureFilePath] [ModelFilePath]

3.16

• Rows: promoter sites

Desired output file name (type: str)

--outDirectory:

Output

Desired output directory (type: str)

--outFileName [OutFileName] --outDirectory [OutFilePath]

• Columns (order of columns may differ): – EnsmblID_T: Ensemble Transcript ID – EnsmblID_G: Ensemble Gene ID – Gene: Gene Name – Strand: + or – Chr: Chromosome – Start: TSS Start Site – End: TSS End Site – Rstart: Region Start Site – Rend: Region End Site – Regressor_Pred: Predicted Signal 3.17

Example Output

chr

End

EnsmblID_G

EnsmblID_T

Gene

REnd

RStart

Start

Strand

Regressor_PRed

chr7

127231759

ENSG00000004059.6

ENST00000000233.5

ARF5

127229399

127227399

127228399

+

3.0353603

chr12

9102551

ENSG00000003056.3

ENST00000000412.3

M6PR

9103551

9101551

9092961

-

3.6402566

chr12

2913124

ENSG00000004478.5

ENST000000001008.4

FKBP4

2905119

2903119

2904119

+

3.7427473

84

Karissa Dieseldorff Jones et al.

3.18 Step 5: Transfer Learning (Optional)

3.19

Input

Perform transfer learning on a single sample, updating the current machine learning model. Extending our trained machine learning model to learn features from a new data type. See Fig. 4 for visual representation. (1) Feature File • HDF5 file containing dataset termed “FeatureInput” and meta data (for more information see Step 3 output section) (2) Model File • Model files provided in GitHub: M2A_H3K27ac_Model_V2.h5 M2A_H3K4me3_Model_V2.h5

3.20 Example Command

python 5_getTransferModel.py [FeatureFilePath] [ModelFile-

3.20.1 Optional Arguments

--outFileName:

3.20.2 Optional Example Command

python 5_getTransferModel.py [FeatureFilePath] [ModelFile-

3.21

Updated HDF5 model file. This will be the new model used for testing new data rather than the previously trained model. View Fig. 1 for more information on transfer learning main overview.

Output

Path]

Desired output file name (type: str)

--outDirectory:

Desired output directory (type: str)

Path] --outFileName [OutFileName] --outDirectory [OutFilePath]

References 1. Davuluri RV, Suzuki Y, Sugano S, Plass C, Huang TH (2008) The functional consequences of alternative promoter use in mammalian genomes. Trends Genet 24:167–177. https://doi.org/10.1016/j.tig.2008.01.008 2. Demircioglu D et al (2019) A pan-cancer transcriptome analysis reveals pervasive regulation through alternative promoters. Cell 178: 1465–1477 e1417. https://doi.org/10. 1016/j.cell.2019.08.018 3. Qamra A et al (2017) Epigenomic Promoter Alterations Amplify Gene Isoform and Immunogenic Diversity in Gastric Adenocarcinoma. Cancer Discov 7:630–651. https://doi.org/ 10.1158/2159-8290.CD-16-1022 4. Sotillo E et al (2015) Convergence of acquired mutations and alternative splicing of CD19 enables resistance to CART-19

immunotherapy. Cancer Discov 5:1282– 1295. https://doi.org/10.1158/2159-8290. CD-15-1020 5. Grobner SN et al (2018) The landscape of genomic alterations across childhood cancers. Nature 555:321–327. https://doi.org/10. 1038/nature25480 6. Ma X et al (2018) Pan-cancer genome and transcriptome analyses of 1,699 paediatric leukaemias and solid tumours. Nature 555:371– 376. https://doi.org/10.1038/nature25795 7. Huether R et al (2014) The landscape of somatic mutations in epigenetic regulators across 1,000 paediatric cancer genomes. Nat Commun 5:3630. https://doi.org/10.1038/ ncomms4630 8. Dong X et al (2012) Modeling gene expression using chromatin features in various cellular

A Guide to MethylationToActivity: A Deep Learning Framework That Reveals. . . contexts. Genome Biol 13:R53. https://doi. org/10.1186/gb-2012-13-9-r53 9. Karlic R, Chung HR, Lasserre J, Vlahovicek K, Vingron M (2010) Histone modification levels are predictive for gene expression. Proc Natl Acad Sci U S A 107:2926–2931. https://doi. org/10.1073/pnas.0909344107 10. Singh R, Lanchantin J, Robins G, Qi Y (2016) DeepChrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics 32:i639–i648. https://doi. org/10.1093/bioinformatics/btw427 11. Kelley DZ et al (2017) Integrated analysis of whole-genome ChIP-Seq and RNA-Seq data of primary head and neck tumor samples associates HPV integration sites with open chromatin marks. Cancer Res 77:6538–6550. https://doi.org/10.1158/0008-5472.CAN17-0833 12. Skene PJ, Henikoff S (2017) An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites. Elife 6. https://doi.org/10.7554/eLife.21856 13. Kagohara LT et al (2018) Epigenetic regulation of gene expression in cancer: techniques, resources and analysis. Brief Funct Genomics 17:49–63. https://doi.org/10.1093/bfgp/ elx018 14. Zhang P, Lehmann BD, Shyr Y, Guo Y (2017) The utilization of formalin fixed-paraffinembedded specimens in high throughput genomic studies. Int J Genomics 2017: 1926304. https://doi.org/10.1155/2017/ 1926304 15. de Ruijter TC et al (2015) Formalin-fixed, paraffin-embedded (FFPE) tissue epigenomics using Infinium HumanMethylation450 BeadChip assays. Lab Invest 95:833–842. https:// doi.org/10.1038/labinvest.2015.53 16. Moran S et al (2014) Validation of DNA methylation profiling in formalin-fixed paraffinembedded samples using the Infinium HumanMethylation450 Microarray. Epigenetics 9: 829–833. https://doi.org/10.4161/epi. 28790 17. Gu H et al (2010) Genome-scale DNA methylation mapping of clinical samples at singlenucleotide resolution. Nat Methods 7:133– 136. https://doi.org/10.1038/nmeth.1414 18. Charlet J et al (2016) Bivalent regions of cytosine methylation and H3K27 acetylation suggest an active role for DNA methylation at enhancers. Mol Cell 62:422–431. https://doi. org/10.1016/j.molcel.2016.03.033 19. Kondo Y (2009) Epigenetic cross-talk between DNA methylation and histone modifications in human cancers. Yonsei Med J 50:455–463.

85

https://doi.org/10.3349/ymj.2009.50. 4.455 20. Onuchic V et al (2018) Allele-specific epigenome maps reveal sequence-dependent stochastic switching at regulatory loci. Science 361. https://doi.org/10.1126/science.aar3146 21. Rothbart SB, Strahl BD (1839) Interpreting the language of histone and DNA modifications. Biochim Biophys Acta 627-643:2014. https://doi.org/10.1016/j.bbagrm.2014. 03.001 22. Sheffield NC et al (2017) DNA methylation heterogeneity defines a disease spectrum in Ewing sarcoma. Nat Med 23:386–395. https://doi.org/10.1038/nm.4273 23. Stadler MB et al (2011) DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature 480:490–495. https:// doi.org/10.1038/nature10716 24. Zhu H, Wang G, Qian J (2016) Transcription factors as readers and effectors of DNA methylation. Nat Rev Genet 17:551–565. https:// doi.org/10.1038/nrg.2016.83 25. Ziller MJ et al (2013) Charting a dynamic DNA methylation landscape of the human genome. Nature 500:477–481. https://doi. org/10.1038/nature12433 26. Hashimshony T, Zhang J, Keshet I, Bustin M, Cedar H (2003) The role of DNA methylation in setting up chromatin structure during development. Nat Genet 34:187–192. https://doi. org/10.1038/ng1158 27. Moore LD, Le T, Fan G (2013) DNA methylation and its basic function. Neuropsychopharmacology 38:23–38. https://doi.org/10. 1038/npp.2012.112 28. Jones PA (2012) Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nat Rev Genet 13:484–492. https:// doi.org/10.1038/nrg3230 29. Lay FD et al (2015) The role of DNA methylation in directing the functional organization of the cancer epigenome. Genome Res 25:467– 4 7 7 . h t t p s : // d o i . o r g / 1 0 . 1 1 0 1 / g r. 183368.114 30. Baylin SB (2005) DNA methylation and gene silencing in cancer. Nat Clin Pract Oncol 2 (Suppl 1):S4–S11. https://doi.org/10.1038/ ncponc0354 31. Williams J et al (2021) MethylationToActivity: a deep-learning framework that reveals promoter activity landscapes from DNA methylomes in individual tumors. Genome Biol 22: 24. https://doi.org/10.1186/s13059-02002220-y

Chapter 7 DNA Modification Patterns Filtering and Analysis Using DNAModAnnot Alexis Hardy, Sandra Duharcourt, and Matthieu Defrance AbstractAbstract Mapping DNA modifications at the base resolution is now possible at the genome level thanks to advances in sequencing technologies. Long-read sequencing data can be used to identify modified base patterns. However, the downstream analysis of Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT) data requires the integration of genomic annotation and comprehensive filtering to prevent the accumulation of artifact signals. We present in this chapter, a linear workflow to fully analyze modified base patterns using the DNA Modification Annotation (DNAModAnnot) package. This workflow includes a thorough filtering based on sequencing quality and false discovery rate estimation and provides tools for a global analysis of DNA modifications. Here, we provide an application example of this workflow with PacBio data and guide the user by explaining expected outputs via a fully integrated Rmarkdown script. This protocol is presented with tips showing how to adapt the provided code for annotating epigenomes of any organism according to the user needs. Key words Epigenomics, Epigenome Annotation, DNA modifications, DNA Methylation, PacBio Sequencing, Nanopore technology, DNAModAnnot

1

Introduction Recent advances in sequencing technologies have greatly contributed to the epigenomics field. Single-molecule real time (SMRT) sequencing from Pacific Biosciences (PacBio) and nanopore sequencing from Oxford Nanopore Technologies (ONT) allow the mapping of different DNA modifications at the base resolution in the whole genome [1]. These long-read sequencing technologies can even detect modifications in regions with a high amount of repeats that were previously inaccessible with Illumina sequencing [1]. Nanopore sequencing software such as Nanopolish or DeepSignal use differences in electric current intensity to detect modified bases as they pass through the pores [2]. However, for many of these software, the genome annotation (from the GC density to the

Pedro H. Oliveira (ed.), Computational Epigenomics and Epitranscriptomics, Methods in Molecular Biology, vol. 2624, https://doi.org/10.1007/978-1-0716-2962-8_7, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

87

88

Alexis Hardy et al.

presence of repeated regions) is essential as it can impact the efficiency of DNA modification detection in ONT data thus leading to higher false-positive rates in some genomic regions [2]. For SMRT sequencing, the SMRT-Link software (SMRTPortal for older versions) uses slowing down events of the DNA polymerase during sequencing to detect modified bases [1, 3]. Several modifications, such as 6-methyladenine (6mA) or 5-methylcytosine (5mC), can be detected as long as the coverage requirement is fulfilled [4]. PacBio also suggests to define a threshold based on the score parameter (also called « Modification QV ») by comparing the score of all bases sequenced [3]. However, this method requires a strong signal that can be easily distinguished from the noise in order to choose an adapted threshold based on the score. Also, SMRT sequencing was found to overestimate the modification levels, especially when the amount of modified bases is very low in the genome [5]. Thus, ill-adapted filtering in such cases can cause high amounts of artifacts. To overcome this lack of stringency, we have previously released DNAModAnnot (DNA Modification Annotation) [6], an R package allowing comprehensive filtering and analysis of modified patterns for PacBio or ONT data using adapted visualization tools. This package is divided into six modules, as illustrated in Fig. 1, that can be combined to fully analyze pre-processed PacBio or ONT data. DNAModAnnot provides tools to load pre-processed data (« Data Loading ») and analyze the modification distribution

Fig. 1 Overall workflow of the DNAModAnnot package used to filter and analyze DNA modification patterns

DNA Modification Patterns Filtering and Analysis Using DNAModAnnot

89

at the genome level (« Global DNA Mod ») or using the genome annotation provided (« DNA Mod annotation »). Furthermore « Sequencing quality » assessment and False Discovery Rate estimation (« FDR estimation ») can be directly used to perform a thorough filtering of PacBio or ONT data (« Filter ») (Fig. 1). This modular toolbox uses object classes from the GenomicRanges [7] and BioStrings [8] packages allowing a user-friendly coupling with functions from other main Bioconductor packages. In this chapter, we provide a roadmap for a systematic analysis of DNA modifications and an example of DNAModAnnot application on PacBio data. This workflow includes the loading of pre-processed files, the filtering steps based on sequencing quality and false discovery rate estimations, and the DNA modification pattern analysis with genomic annotations or analysis at the genome-wide level. It provides a summary of the functions provided by DNAModAnnot and a linear processing that can be easily extended with additional R packages for more advanced analyses.

2 2.1

Materials Data Sources

In this protocol, we use PacBio RSII data [9] (and additional sequencing data listed in Table 1 [9, 10]) to analyze 6mA patterns in Tetrahymena thermophila. The patterns of this DNA modification have already been described in this organism [9]. Using this linear workflow, the user will learn to use DNAModAnnot by retrieving and analyzing these patterns. For SMRT sequencing data, DNAModAnnot need the modifications.gff (data from modified bases only) and modifications.csv (data from all sequenced bases) files. These pre-processed data can b e s o u r c e d f r o m h t t p s : // g i t h u b . c o m / A l e x i s H a r d y / DNAModAnnot_AdditionalData, also listed in Table 1. Pre-processed files were produced using the SMRT-link-tools v7.0.1.66975-0 and 6mA was detected via the ipdSummary tool [11]. Command lines to regenerate the modifications.csv and modifications.gff files from the raw SMRT sequencing data (listed in Table 1) are detailed in the Notes section (see Note 1). DNAModAnnot [6] can also load ONT data pre-processed with the DeepSignal software [12], but we will only focus on PacBio data in this protocol (see Note 2). This package also needs the genome assembly ( fasta) and its annotation (e.g., gff) in order to analyze the DNA modification patterns (listed in Table 1). All the files required to perform the analysis are listed in Table 1. This table also contains additional sequencing data, which can be analyzed together with DNA modification patterns.

90

Alexis Hardy et al.

Table 1 List of input files used in this protocol Description

Format Link

T. thermophila (June2014) genome assembly sequence

fasta

http://ciliate.org/index.php/ Mandatory home/downloads

T. thermophila (June2014) genome annotation

gff3

http://ciliate.org/index.php/ Mandatory home/downloads

T. thermophila pre-processed SMRT-seq data (via SMRT-link tools v7.0.1.66975-0) using T_thermophila_June2014 genome assembly. SMRT-seq data was retrieved from GSM2534782 [9] contig_list.txt contains the listing of contigs selected in this example.

gff, csv https://github.com/ AlexisHardy/ and DNAModAnnot_ txt AdditionalData

Mandatory

T. thermophila SMRT-seq data (to be retrieved via SRA Run Selector) [9]

bax.h5

https://www.ncbi.nlm.nih. gov/geo/query/acc.cgi? acc=GSM2534782

Not required if pre-processed files are available

T. thermophila MNase-seq [9]

bed

https://www.ncbi.nlm.nih. gov/geo/query/acc.cgi? acc=GSM2534785

Optional

T. thermophila H2A.Z ChIP-seq [9]

bed

https://www.ncbi.nlm.nih. gov/geo/query/acc.cgi? acc=GSM2534783

Optional

T. thermophila RNA-seq [10]

txt

https://www.ncbi.nlm.nih. gov/geo/query/acc.cgi? acc=GSM692081

Optional

2.2 Software and Installation

Required?

The required packages are detailed in the description file of the DNAModAnnot [6] package and can be installed via the install command of the BiocManager [13] package: BiocManager::install(c('Biostrings', 'BSgenome', 'Gviz', 'seqLogo'))

DNAModAnnot [6] can be installed via GitHub using the devtools [14] package: devtools::install_github("AlexisHardy/DNAModAnnot")

DNA Modification Patterns Filtering and Analysis Using DNAModAnnot

3

91

Methods For SMRT sequencing data, methylation detection via SMRT Link [11] returns two files: the modifications.gff (containing data from modified bases only) and modifications.csv (containing data from all sequenced bases) files (see Note 2 for ONT data) (see Note 3). For both sequencing data type, the data must first be loaded into R. Sequencing quality can be assessed to filter out contigs with low coverage, which could bias the global statistics of modified base distribution. DNA modification distribution can also be analyzed at the genome level. For PacBio data, false discovery rate can be estimated to select an appropriate filter based on available detection parameters (score or ipdRatio). By providing the genomic annotation, it is also possible to identify the patterns of DNA modifications associated with specific annotation features. Here, we provide an example of analysis of 6mA patterns using Tetrahymena thermophila PacBio RSII data [9] (and additional sequencing data listed in Table 1 [9, 10]): from the input files importation to the generation of graphs and reports. The results are presented in Figs. 2 and 3 and Table 2. An Rmarkdown document is also provided with all the commands and details of this protocol (see Note 4).

3.1 Loading Mandatory Files 3.1.1 Import Genome Sequence Information

Go to the webpages listed in Table 1 and collect the bed/csv/gff/ fasta/txt files. For PacBio data, you can either download SMRT link [11] processed files called modifications.gff and modifications. csv or download the bax.h5 files via the SRA Run Selector in the GSM2534782 repository and use SMRT Link [11] to generate the modifications.gff and modifications.csv files (see Note 1). 1. Import the genome sequence as a DNAstringSet object using the readDNAStringSet function from the Biostrings package [8] then filter it using the contig_file.txt file to keep only the sampled contigs (see Note 3). organism_genome