Nanopore Sequencing: Methods and Protocols 1071629956, 9781071629956

This volume provides comprehensive dry and wet experiments, methods, and applications on nanopore sequencing. Chapters g

984 60 12MB

English Pages 317 [318] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Nanopore Technology: Methods and Protocols [1st ed.] 9781071608050, 9781071608067

This detailed collection explores techniques involved in the main strategies of nanopore sensing, such as translocation,

574 32 7MB Read more

Next Generation Sequencing: Methods and Protocols (Methods in Molecular Biology Book 1712) 9781493975143, 0635171100, 1493975145

179 75 3MB Read more

Polyploidy: Methods and Protocols 1071625608, 9781071625606

This volume provides protocols on evidence for polyploidy and how it can be unveiled. Chapters guide readers through evo

269 48 27MB Read more

Antibody Arrays: Methods and Protocols 1071610635, 9781071610633

This detailed book presents a technical overview and practical methodology of a variety of antibody array formats and te

597 168 11MB Read more

Multiprotein Complexes: Methods and Protocols 9781071611258, 9781071611265

This volume explores strategies and detailed protocols for the preparation of macromolecular complexes and their charact

877 128 11MB Read more

ELISA: Methods and Protocols 1071629026, 9781071629024

This volume provides an understanding of how an immunoassay works, detailing the strengths, weaknesses, pitfalls. Chapt

783 155 11MB Read more

MAIT Cells: Methods and Protocols 1071602071, 9781071602072

This volume focuses on various methods used by researchers to study mucosal-associated invariant T (MAIT) cells in the a

697 80 28MB Read more

Allostery: Methods and Protocols 1071611534, 9781071611531

This volume explores the basic issues of “allostery” and “network” that are fundamental to studying this field. Chapters

662 129 10MB Read more

Zika Virus: Methods and Protocols 9781071605813

631 101 33MB Read more

MERS Coronavirus Methods and Protocols 9781071602119

968 99 18MB Read more

Nanopore Sequencing: Methods and Protocols
1071629956, 9781071629956

Author / Uploaded
Kazuharu Arakawa

Categories
Biology
Molecular

Table of contents :
Preface
Contents
Contributors
Part I: Nanopore Sequencing for Genomics and Beyond
Chapter 1: The Current State of Nanopore Sequencing
1 Introduction
2 The Technology
3 The Platform
4 Applications
4.1 The World´s Fastest Genomes
4.2 The World´s Largest Genomes
4.3 The World´s Most Complete Human Genome
4.4 Viral Genomes the World Over
4.5 Targeted Sequencing on the Other Side of the World
4.6 A World of Potential
5 Summary
References
Chapter 2: Hybrid Genome Assembly of Short and Long Reads in Galaxy
1 Introduction
2 Materials
2.1 Long- and Short-Read Datasets
2.2 Public Galaxy Instance (NanoGalaxy)
3 Methods
3.1 Basic Usage of Galaxy
3.2 Example Using Public Data
3.2.1 Analysis Overview
3.2.2 Hands-On: Obtain Sample Data from SRA
3.2.3 Hands-On: Quality Control of Nanopore Reads
3.2.4 Hands-On: Quality Filtering of Long-Read Sequencing Data
3.2.5 Hands-On: Assembly of Long-Read Sequencing Data
3.2.6 Hands-On: Quality Control of Short-Read Sequencing Data
3.2.7 Hands-On: Mapping Short Reads to Assembly for Polishing
3.2.8 Hands-On: Evaluate the Quality of Aligned Reads Data in BAM Format
3.2.9 Hands-On: Polish Assembly
3.2.10 Hands-On: Genome Assembly Metrics
4 Notes
References
Chapter 3: Microbial Genome Sequencing and Assembly Using Nanopore Sequencers
1 Introduction
2 Materials
2.1 Sequencing Reagents
2.2 Bioinformatics Tools
3 Methods
3.1 gDNA Extraction
3.2 Sequencing Library Preparation
3.2.1 DNA Repair and End Prep
3.2.2 Adapter Ligation and Cleanup
3.3 Nanopore Sequencing
3.4 Bioinformatics
4 Notes
References
Chapter 4: De Novo Genome Assembly of Japanese Black Cattle as Model of an Economically Relevant Animal
1 Introduction
2 Materials
2.1 Sample Preparation [Experiment]
2.2 Consumables and Equipment [Experiment]
2.3 Software [Data Analysis]
3 Methods
3.1 Experimental Procedures
3.1.1 Genomic DNA Extraction from Frozen Semen and DNA Quality Check
3.1.2 Long-Read Sequencing by PromethION
3.1.3 Short-Read Sequence by Illumina Sequencer
3.2 Bioinformatics Procedures
3.2.1 Outline
3.2.2 Code (See Note 13)
3.3 Analysis Example
3.4 Conclusions
4 Notes
References
Chapter 5: How to Sequence and Assemble Plant Genomes
1 Introduction
2 Materials
2.1 DNA Isolation
2.2 Size Selection
2.3 Library Prep
2.4 Draft Assembly
2.5 Assembling Organellar Genomes
2.6 Re-bridging Nuclear Genome
3 Methods
3.1 DNA Extraction
3.2 Size Selection
3.3 Library Prep
3.4 Draft Assembly
3.5 Assembling Organellar Genomes
3.6 Re-bridge the Nuclear Genome
4 Notes
References
Chapter 6: Detection of DNA Modification Using Nanopore Sequencers
1 Introduction
2 Comparison Method
3 Model-Based Method
4 Expanded Basecalling Method
5 Materials
6 Methods
7 Notes
References
Chapter 7: Ultralow-Input Genome Library Preparation for Nanopore Sequencing with Droplet MDA
1 Introduction
2 Materials
3 Methods
4 Notes
References
Chapter 8: The Method of Eliminating the Wolbachia Endosymbiont Genomes from Insect Samples Prior to a Long-Read Sequencing
1 Introduction
2 Materials
3 Methods
3.1 Rearing Host Spiders and the Parasitoid Larvae Until Adult Emergence (in the Case of R. nielseni, the Host Spiders Parasit...
3.2 Rearing the Adult Wasps for Sterilization of Wolbachia (in the Case of R. nielseni)
3.3 Homogenizing the Wasps, DNA Extraction, qPCR, and Evaporation
4 Notes
References
Chapter 9: A Nanopore Sequencing Course for Graduate School Curriculum
1 Introduction
2 Materials
2.1 Wet Lab Experiment Part
2.2 Bioinformatics Part
3 Methods
3.1 Genomic DNA Extraction
3.2 Nanopore Sequencing
3.3 Genome Assembly and Annotation
3.4 Genome Report
4 Notes
References
Part II: Analysis of Repetitive Regions and Structural Variants
Chapter 10: A Guide to Sequencing for Long Repetitive Regions
1 Introduction
2 Materials
2.1 Sample Homogenization
2.2 Nucleotide Extraction, Quantification, and Qualification
2.3 Sequencing Instruments and Library Prep Kits
2.4 PC Spec and Software
3 Methods
3.1 High-Molecular-Weight (HMW) Genomic DNA (gDNA) Isolation
3.2 Total RNA Extraction
3.3 gDNA Sequencing with a Nanopore Sequencer
3.4 Direct RNA Sequencing with a Nanopore Sequencer
3.5 cDNA Sequencing with an Illumina Sequencer
3.6 Bioinformatics Analysis
3.7 Filtering or Trimming of Sequence Reads
3.8 Terminal Domain Contig (as Seed Contig) Preparation
3.9 Collection of Repeat Units by Elongation
3.10 Scaffolding and Reordering of Repeat Unit Using Full-Length Long Read
3.11 Visualization of Repetitive Gene Architecture
4 Notes
References
Chapter 11: Analysis of Tandem Repeat Expansions Using Long DNA Reads
1 Introduction
2 Materials
2.1 Library Preparation for Long-Read Sequencing
2.2 Data Analysis
3 Methods
3.1 Designing gRNAs
3.2 Sequence Library Preparation
3.3 Data Analysis
3.3.1 Aligning Long Reads to the Reference Genome
3.3.2 Detect Changes in the Copy Number of Repeats with Tandem-Genotypes
3.3.3 Creating a Plot
3.3.4 Merge Reads to Create a Consensus Sequence
4 Notes
References
Chapter 12: Finding Rearrangements in Nanopore DNA Reads with LAST and dnarrange
1 Introduction
1.1 Examples of Match, Mismatch, Insertion, and Deletion Rates
1.2 Understanding Rearrangements
1.3 Simple Sequences
2 Methods
2.1 Installation
2.2 Getting the Camponotus Rates
2.3 Getting the Plasmodium falciparum Rates
2.4 Getting the Human Rates
2.5 Aligning Human DNA Reads to a Human Genome
2.6 An Alternative Way Using Windowmasker
2.7 Finding Rearrangements with dnarrange
2.8 Making Dotplot Figures of the Rearrangements
2.9 last-dotplot
2.10 Rearrangement Types and Thresholds
2.11 Other Features of dnarrange
3 Notes
References
Chapter 13: Long-Read Whole-Genome Sequencing Using a Nanopore Sequencer and Detection of Structural Variants in Cancer Genomes
1 Introduction
2 Materials
2.1 Sample Preparation (Experiment)
2.2 Consumables (Experiment)
2.3 Equipment (Experiment)
2.4 Dataset (Data Analysis)
2.5 Software (Data Analysis)
2.6 Reference Genome (Data Analysis)
3 Methods
3.1 HMW DNA Extraction (Experiment)
3.2 Library Preparation and Sequencing (Experiment)
3.3 Basecalling (Data Analysis)
3.4 Mapping of Sequencing Reads to the Reference Genome (Data Analysis)
3.5 SV Detection (Data Analysis)
3.6 Visualization and Interpretation of the Detected SVs (Data Analysis)
4 Notes
References
Part III: Rapid On-Site Microbial Detection and Epidemiology
Chapter 14: Full-Length 16S rRNA Gene Analysis Using Long-Read Nanopore Sequencing for Rapid Identification of Bacteria from C...
1 Introduction
2 Materials
2.1 General Laboratory Supplies
2.2 Sample Preparation
2.3 DNA Extraction
2.4 PCR
2.5 PCR Cleanup
2.6 DNA Quantification
2.7 Library Construction and Nanopore Sequencing
3 Methods
3.1 Preparation of Clinical Samples
3.1.1 Preparing Fecal Samples
3.1.2 Preparing Sputum Samples
3.1.3 Preparing Swab Samples
3.1.4 Preparing Whole Blood Samples
3.2 Bacterial Cell Disruption by Bead Beating
3.3 Automated DNA Purification Using Maxwell RSC System
3.4 Two-Step PCR
3.4.1 First PCR: Amplification of the 16S rRNA Gene
3.4.2 Second PCR
3.5 PCR Cleanup
3.6 DNA Quantification
3.7 Sequencing Library Preparation
3.8 Nanopore Sequencing
3.9 Bioinformatics Analysis
3.9.1 Taxonomic Classification Using EPI2ME Fastq 16S Workflow
3.9.2 Consensus Calling for Nanopore Sequencing Reads
4 Notes
References
Chapter 15: Nanopore Sequencing Data Analysis of 16S rRNA Genes Using the GenomeSync-GSTK System
1 Introduction
2 Materials
3 Methods
3.1 Preparing the Database
3.2 Preparing the Pipeline
3.3 Example of GenomeSync-GSTK Analysis
4 Notes
References
Chapter 16: Genomic Epidemiological Analysis of Antimicrobial-Resistant Bacteria with Nanopore Sequencing
1 Introduction
2 Materials and Methods
2.1 High Molecular Weight (HMW) of Bacterial Genomic DNA (gDNA)
3 Methods
3.1 Construction of Bacterial Complete Genomes with Nanopore Sequencing Data
3.2 Detection and Classification of Core Genes, Accessory Genes, and MGEs in Bacterial Genomes
3.3 Genomic Epidemiological Analysis of AMR Bacterial Isolates Using Their Genomes
4 Notes
References
Chapter 17: Rapid and Comprehensive Identification of Nontuberculous Mycobacteria
1 Introduction
2 Materials
2.1 Equipment
2.2 Consumables for Library Preparation and MinION Sequencing
2.3 Software Requirements for Computational Analysis
3 Methods
3.1 DNA Extraction from Culture
3.2 Preparing Library and Sequencing
3.3 Computational Analysis
4 Notes
References
Part IV: Nanopore Sequencing for Transcriptomics and Beyond
Chapter 18: Long-Read Single-Cell Sequencing Using scCOLOR-seq
1 Introduction
2 Materials
2.1 Oligonucleotides
2.2 Oligonucleotide Bead Design
2.3 Cell Encapsulation
2.4 Emulsion Breaking and Reverse Transcription
2.5 Exonuclease Digestion and SMART PCR
2.6 Nanopore Library Preparation and Sequencing
3 Methods
3.1 Cell Encapsulation
3.2 Emulsion Breaking
3.3 Reverse Transcription
3.4 Exonuclease Digestion
3.5 SMART PCR
3.6 Nanopore Library Preparation
3.7 Sequencing
4 Notes
References
Chapter 19: Unfolding the Bacterial Transcriptome Landscape Using Oxford Nanopore Technology Direct RNA Sequencing
1 Introduction
2 Materials
2.1 Sequencing Bacterial RNA
2.2 Software
2.3 System Requirement
3 Methods
3.1 Ribosomal RNA (rRNA) Depletion
3.2 Polyadenylation
3.3 Direct RNA-Seq
3.4 Priming and Loading the Flow Cell
3.5 UNAGI Pipeline
3.6 Running UNAGI
4 Notes
References
Chapter 20: Nanopore Direct RNA Sequencing of Monosome- and Polysome-Bound RNA
1 Introduction
2 Materials
2.1 Preparation of Whole Cell Extracts
2.2 Polysome Fractionation
2.3 RNA Purification from Sucrose Fractions
2.4 Direct RNA Sequencing
2.5 Bioinformatics Analysis
3 Methods
3.1 Preparation of Whole Cell Extracts (WCE)
3.2 Polysome Fractionation
3.3 RNA Extraction and Enrichment
3.4 Library Preparation for Direct RNA Sequencing
3.5 Bioinformatics Analysis
4 Notes
References
Chapter 21: RNA Modification Detection Using Nanopore Direct RNA Sequencing and nanoDoc2
1 Introduction
2 Materials
2.1 RNA-Sequence Data
2.2 Reference Data
2.3 Computational Environment
3 Method
3.1 Overview
3.1.1 Signal Resquiggling by Viterbi Using Trace Value
3.1.2 Training Using the Deep One-Class Algorithm
3.1.3 RNA Modification Detection Using Clustering
3.2 Data Preparation
3.3 Installation and Preparation of nanoDoc2
3.4 Inferring Modifications with nanoDoc2 Using Previously Trained Model
3.4.1 Mapping and Resquiggling
3.4.2 Modification Detection
3.4.3 Interpretation of Output
3.5 Training nanodoc2 for IVT Data
3.5.1 Prepare Training Dataset for Initial Training
3.5.2 Pre-training Model for 6-Mer Classification
3.5.3 Deep One-Class Classification for Each 6-Mer
4 Notes
References
Index

Citation preview

Methods in Molecular Biology 2632

Sequencing

Recovering

Pore

Inactive

Unclassified

Kazuharu Arakawa Editor

Nanopore Sequencing Methods and Protocols

METHODS

IN

MOLECULAR BIOLOGY

Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK

For further volumes: http://www.springer.com/series/7651

For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-by step fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.

Nanopore Sequencing Methods and Protocols

Edited by

Kazuharu Arakawa Institute for Advanced Biosciences, Keio University, Tsuruoka, Japan

Editor Kazuharu Arakawa Institute for Advanced Biosciences Keio University Tsuruoka, Japan

ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-2995-6 ISBN 978-1-0716-2996-3 (eBook) https://doi.org/10.1007/978-1-0716-2996-3 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.

Preface The advent of sequencing technology has clearly been a fundamental driving force in the rapid maturation of molecular biology over the past two decades, and nanopore sequencing developed by Oxford Nanopore Technologies (ONT) is one of the latest technologies in this line, with the commercial release of MinION device in 2015. There are multiple notable differences and advantages of this platform that set it apart from previous technologies. Firstly, the sequencing method simply tracks the shifts in electric current as a DNA molecule passes through the protein nanopore, precluding the requirement of optical detection devices, therefore achieving extremely small form factor resulting in portable devices at economical pricing. Secondly, there is no limit in the sequencing length, and ultra-long reads of mega-bases are routinely reported. Thirdly, the devices can directly sequence RNA molecules, achieving a true direct RNA-Seq (rather than the conventional RNA-Seq which is actually cDNA-Seq). Fourthly, the sequencing method as well as library preparation do not require PCR amplification, and thus all types of modifications in the DNA as well as RNA molecules can be detected. The portable, affordable, and real-time nature of the devices are thoroughly tested in field environments, such as at the site of disease outbreak, at polar regions, and even in the International Space Station. Nanopore sequencing technologies are continuously updating at an extremely rapid pace; in fact, majority of the new findings and software developments are communicated with preprints, because the conventional peer-reviewed journal submission process can already outdate some parts of the report by the time of publication. Sequencing throughput, accuracy, as well as new applications are constantly improved. Moreover, understanding of at least the overview of both the experimental and bioinformatic aspects of sequencing is often critical to fully take advantage of this fast-evolving platform. It is to this end that this book aimed to coherently combine the dry and wet aspects of experiments in comprehensively describing numerous nanopore sequencing methodologies and applications, so that it will be useful for novices and experts alike. Most chapters include both of the experimental and bioinformatic procedures, and some more specialized chapters detail deeper on either of the methodologies. Four major tenants of nanopore sequencing will be discussed (1) genome sequencing and assembly, (2) analysis of repetitive regions and structural variations, (3) rapid and on-site microbial identification and epidemiology, and (4) transcriptome analysis taking advantage of the direct RNA sequencing. It is hopeful that this book will serve as a guide to utilizing the most recent approaches in nanopore sequencing. Kazuharu Arakawa

Tsuruoka, Japan

v

Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

PART I

v ix

NANOPORE SEQUENCING FOR GENOMICS AND BEYOND

1 The Current State of Nanopore Sequencing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Jonathan Pugh 2 Hybrid Genome Assembly of Short and Long Reads in Galaxy . . . . . . . . . . . . . . . 15 Tazro Ohta and Yuh Shiwa 3 Microbial Genome Sequencing and Assembly Using Nanopore Sequencers . . . . 31 Makoto Taniguchi and Kazuma Uesaka 4 De Novo Genome Assembly of Japanese Black Cattle as Model of an Economically Relevant Animal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Shinji Sasaki, Yasuhiko Haga, Hiroyuki Wakaguri, Kazumi Abe, and Yutaka Suzuki 5 How to Sequence and Assemble Plant Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Ken Naito 6 Detection of DNA Modification Using Nanopore Sequencers . . . . . . . . . . . . . . . . 79 Yoshikazu Furuta 7 Ultralow-Input Genome Library Preparation for Nanopore Sequencing with Droplet MDA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Kazuharu Arakawa 8 The Method of Eliminating the Wolbachia Endosymbiont Genomes from Insect Samples Prior to a Long-Read Sequencing . . . . . . . . . . . . . . . . . . . . . . 101 Keizo Takasuka and Kazuharu Arakawa 9 A Nanopore Sequencing Course for Graduate School Curriculum . . . . . . . . . . . . 113 Kazuharu Arakawa

PART II 10 11 12

ANALYSIS OF REPETITIVE REGIONS AND STRUCTURAL VARIANTS

A Guide to Sequencing for Long Repetitive Regions . . . . . . . . . . . . . . . . . . . . . . . . 131 Nobuaki Kono Analysis of Tandem Repeat Expansions Using Long DNA Reads . . . . . . . . . . . . . 147 Satomi Mitsuhashi and Martin C. Frith Finding Rearrangements in Nanopore DNA Reads with LAST and dnarrange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Martin C. Frith and Satomi Mitsuhashi

vii

viii

13

Contents

Long-Read Whole-Genome Sequencing Using a Nanopore Sequencer and Detection of Structural Variants in Cancer Genomes . . . . . . . . . . . . . . . . . . . . 177 Yasuhiko Haga, Yoshitaka Sakamoto, Miyuki Arai, Yutaka Suzuki, and Ayako Suzuki

PART III

RAPID ON-SITE MICROBIAL DETECTION AND EPIDEMIOLOGY

14

Full-Length 16S rRNA Gene Analysis Using Long-Read Nanopore Sequencing for Rapid Identification of Bacteria from Clinical Specimens . . . . . . . . . . . . . . . . . Yoshiyuki Matsuo 15 Nanopore Sequencing Data Analysis of 16S rRNA Genes Using the GenomeSync-GSTK System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kirill Kryukov, Tadashi Imanishi, and So Nakagawa 16 Genomic Epidemiological Analysis of Antimicrobial-Resistant Bacteria with Nanopore Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masato Suzuki, Yusuke Hashimoto, Aki Hirabayashi, Koji Yahara, Mitsunori Yoshida, Hanako Fukano, Yoshihiko Hoshino, Keigo Shibayama, and Haruyoshi Tomita 17 Rapid and Comprehensive Identification of Nontuberculous Mycobacteria . . . . Yuki Matsumoto and Shota Nakamura

PART IV

193

215

227

247

NANOPORE SEQUENCING FOR TRANSCRIPTOMICS AND BEYOND

18

Long-Read Single-Cell Sequencing Using scCOLOR-seq. . . . . . . . . . . . . . . . . . . . Martin Philpott, Udo Oppermann, and Adam P. Cribbs 19 Unfolding the Bacterial Transcriptome Landscape Using Oxford Nanopore Technology Direct RNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohamad Al Kadi and Daisuke Okuzaki 20 Nanopore Direct RNA Sequencing of Monosomeand Polysome-Bound RNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lan Anh Catherine Nguyen, Toshifumi Inada, and Josephine Galipon 21 RNA Modification Detection Using Nanopore Direct RNA Sequencing and nanoDoc2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hiroki Ueda, Bhaskar Dasgupta, and Bo-yi Yu

259

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

321

269

281

299

Contributors KAZUMI ABE • Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan MOHAMAD AL KADI • Single Cell Genomics, Human Immunology, WPI Immunology Frontier Research Center, Osaka University, Osaka, Japan MIYUKI ARAI • Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba, Japan KAZUHARU ARAKAWA • Institute for Advanced Biosciences, Keio University, Tsuruoka, Yamagata, Japan; Graduate School of Media and Governance, Keio University, Fujisawa, Kanagawa, Japan; Faculty of Environment and Information Studies, Keio University, Fujisawa, Kanagawa, Japan; Exploratory Research Center on Life and Living Systems (ExCELLS), National Institutes of Natural Sciences, Okazaki, Aichi, Japan ADAM P. CRIBBS • Botnar Research Centre, Nuffield Department of Orthopedics, Rheumatology and Musculoskeletal Sciences, National Institute of Health Research Oxford Biomedical Research Unit (BRU), University of Oxford, Oxford, UK; Oxford Centre for Translational Myeloma Research University of Oxford, Oxford, UK BHASKAR DASGUPTA • Biological data Science Division, Research Center for Advanced Science and Technologies, The University of Tokyo, Tokyo, Japan MARTIN C. FRITH • Artificial Intelligence Research Center, AIST, Tokyo, Japan; Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Japan; Computational Bio Big-Data Open Innovation Laboratory, AIST, Tokyo, Japan HANAKO FUKANO • Department of Mycobacteriology, Leprosy Research Center, National Institute of Infectious Diseases, Tokyo, Japan YOSHIKAZU FURUTA • Toyota Central R&D Labs., Inc., Nagakute, Japan JOSEPHINE GALIPON • Institute for Advanced Sciences, Keio University, Yamagata, Tsuruoka, Japan; Graduate School of Media and Governance, Keio University, Fujisawa, Kanagawa, Japan YASUHIKO HAGA • Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan; Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba, Japan YUSUKE HASHIMOTO • Department of Bacteriology, Gunma University Graduate School of Medicine, Maebashi, Japan AKI HIRABAYASHI • Antimicrobial Resistance Research Center, National Institute of Infectious Diseases, Tokyo, Japan YOSHIHIKO HOSHINO • Department of Mycobacteriology, Leprosy Research Center, National Institute of Infectious Diseases, Tokyo, Japan TADASHI IMANISHI • Department of Molecular Life Science, Tokai University School of Medicine, Kanagawa, Japan TOSHIFUMI INADA • The Institute of Medical Science, The University of Tokyo, Tokyo, Japan NOBUAKI KONO • Institute for Advanced Biosciences, Keio University, Tsuruoka City, Yamagata, Japan KIRILL KRYUKOV • Department of Informatics, National Institute of Genetics, Shizuoka, Japan

ix

x

Contributors

YUKI MATSUMOTO • Department of Infection Metagenomics, Research Institute for Microbial Diseases, Osaka University, Osaka, Japan YOSHIYUKI MATSUO • Department of Human Stress Response Science, Institute of Biomedical Science, Kansai Medical University, Osaka, Japan SATOMI MITSUHASHI • Department of Genomic Function and Diversity, Tokyo Medical and Dental University, Tokyo, Japan; Division of Neurology, Department of Internal Medicine, St. Marianna University School of Medicine, Kawasaki, Kanagawa, Japan KEN NAITO • Research Center of Genetic Resources, National Agriculture and Food Research Organization, Ibaraki, Japan SO NAKAGAWA • Department of Molecular Life Science, Tokai University School of Medicine, Kanagawa, Japan SHOTA NAKAMURA • Department of Infection Metagenomics, Research Institute for Microbial Diseases, Osaka University, Osaka, Japan; Center for Infectious Disease Education and Research, Osaka University, Osaka, Japan LAN ANH CATHERINE NGUYEN • Institute for Advanced Sciences, Keio University, Yamagata, Tsuruoka, Japan; Graduate School of Media and Governance, Keio University, Fujisawa, Kanagawa, Japan TAZRO OHTA • Database Center for Life Science, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, Mishima, Shizuoka, Japan DAISUKE OKUZAKI • Single Cell Genomics, Human Immunology, WPI Immunology Frontier Research Center, Osaka University, Osaka, Japan; Genome Information Research Center, Research Institute for Microbial Diseases, Osaka University, Osaka, Japan; Institute for Open and Transdisciplinary Research Initiatives, Osaka University, Osaka, Japan UDO OPPERMANN • Botnar Research Centre, Nuffield Department of Orthopedics, Rheumatology and Musculoskeletal Sciences, National Institute of Health Research Oxford Biomedical Research Unit (BRU), University of Oxford, Oxford, UK; Oxford Centre for Translational Myeloma Research University of Oxford, Oxford, UK MARTIN PHILPOTT • Botnar Research Centre, Nuffield Department of Orthopedics, Rheumatology and Musculoskeletal Sciences, National Institute of Health Research Oxford Biomedical Research Unit (BRU), University of Oxford, Oxford, UK JONATHAN PUGH • Oxford Nanopore Technologies, Oxford, UK YOSHITAKA SAKAMOTO • Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba, Japan SHINJI SASAKI • University of the Ryukyus, Faculty of Agriculture, Okinawa, Japan; United Graduate School of Agricultural Sciences, Kagoshima University, Kagoshima, Japan KEIGO SHIBAYAMA • Department of Bacteriology, Nagoya University Graduate School of Medicine, Nagoya, Japan YUH SHIWA • Laboratory of Bioinformatics, Department of Molecular Microbiology, Faculty of Life Sciences, Tokyo University of Agriculture, Setagaya, Tokyo, Japan AYAKO SUZUKI • Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba, Japan MASATO SUZUKI • Antimicrobial Resistance Research Center, National Institute of Infectious Diseases, Tokyo, Japan YUTAKA SUZUKI • Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan; Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba, Japan

Contributors

xi

KEIZO TAKASUKA • Institute for Advanced Biosciences, Keio University, Yamagata, Japan; Graduate School of Media and Governance, Keio University, Kanagawa, Japan MAKOTO TANIGUCHI • Oral Microbiome Center, Taniguchi Dental Clinic, Kagawa, Japan; Genome Read Inc., Kagawa, Japan HARUYOSHI TOMITA • Department of Bacteriology, Gunma University Graduate School of Medicine, Maebashi, Japan; Laboratory of Bacterial Drug Resistance, Gunma University Graduate School of Medicine, Maebashi, Japan HIROKI UEDA • Biological data Science Division, Research Center for Advanced Science and Technologies, The University of Tokyo, Tokyo, Japan KAZUMA UESAKA • The Center for Gene Research, Nagoya University, Aichi, Japan HIROYUKI WAKAGURI • Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, Japan KOJI YAHARA • Antimicrobial Resistance Research Center, National Institute of Infectious Diseases, Tokyo, Japan MITSUNORI YOSHIDA • Department of Mycobacteriology, Leprosy Research Center, National Institute of Infectious Diseases, Tokyo, Japan BO-YI YU • Biological data Science Division, Research Center for Advanced Science and Technologies, The University of Tokyo, Tokyo, Japan

Part I Nanopore Sequencing for Genomics and Beyond

Chapter 1 The Current State of Nanopore Sequencing Jonathan Pugh Abstract Nanopore sensing is a disruptive, revolutionary way in which to sequence nucleic acids, including both native DNA and RNA molecules. First commercialized with the MinIONTM sequencer from Oxford Nanopore TechnologiesTM in 2015, this review article looks at the current state of nanopore sequencing as of June 2022. Covering the unique characteristics of the technology and how it functions, we then go on to look at the ability of the platform to deliver sequencing at all scales—from personal to high-throughput devices—before looking at how the scientific community is applying the technology around the world to answer their biological questions. Key words Nanopore, Sequencing, Applications, MinION, PromethION, GridION, Flongle, Disruptive

1

Introduction The term “disruptive” is applied to technologies all too often. Without looking far in the biotechnology, consumer electronics, or commercial technology space, you find claims of the next best thing. How then are genuine claims sorted from background noise? The answer lies not in the messaging from any individual company, but in the actions of technology users. Think of the mobile phone—now entirely integrated into society, it was also once viewed as disruptive. There came a tipping point in its adoption when people stepped through their front doors, looked at the mobile in their hand, and collectively decided the landline was a thing of the past. Novel aspects of the technology (hey look, you can pretend to be a snake that eats flowers!) combined well with traditional capabilities, consumers decided the technology was good enough for them, and the disruption was complete. Nanopore sequencing is a technology demonstrating disruptive capabilities in user’s hands right now. In 2022, a group of researchers at Stanford University set a Guinness world record for

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_1, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

3

4

Jonathan Pugh

the fastest whole genome sequence [1]—one from a clinical research setting [2]. In 2016, researchers on the International Space Station carried out the first ever sequencing in space—and now are better informed on unknown microbes and their scientific experiments alike [3]. And when the world came to a halt in 2020 due to the global SARS-CoV-2 pandemic, nanopore sequencing aided researchers across the globe to sequence, assess, and track viral genomic information in ways that previously had not been possible within-country [4–6] building on work carried out on Ebola [7] and Zika [8] virus sequencing. It was the scientific community seeing the potential of the sequencing platform provided by Oxford Nanopore Technologies who stepped forward to demonstrate its disruptive capabilities in the real world. Unsurprisingly, disruptive technologies are often based upon concepts not previously utilized in the space the technology targets. When thinking of whole genome sequencing, individuals commonly jump to short-read optical-based methods: widespread in use, but with limitations in device size and time to answer, among other things. Sequencing in this way has changed little over the decades, as centrally located devices rely upon increased parallelization and chemistry tweaks to demonstrate “progress.” Long-read optical-based efforts yield greater biological insight, but are if anything struck with more extreme device limitations than shortread optical methods. Nanopore sequencing enables both longread and short-read biological insights, while shrugging off optical limitations and instead utilizing digital signals from an electrical output. Put romantically, the analogue wristwatch of optical sequencing, quaint and admired with historical sentiment, is being displaced by the digital smartwatch of nanopore sensing, connecting you to the world and delivering more information than you ever knew a wristwatch could.

2

The Technology Charming analogies aside, how nanopore sequencing works is beautifully simple yet has proven fiendish to implement. Originally penned in 1989 by Professor David Deamer of UC Davis, the idea then lay dormant until a conversation with Harvard University’s Professor Dan Branton in 1991 [9]. Following years of academic development and collaboration, the concept was finally commercialized with the launch of the MinION device in 2015 [9]. The basis of how this system functions is covered in Fig. 1, and the resultant signal is decoded to the sequence of bases on the DNA strand through artificial neural networks akin to those used for speech recognition. Each nanopore does this thousands of times, strand after strand, resulting in an incredibly high-throughput system that can be tuned by altering total nanopore numbers or

The Current State of Nanopore Sequencing

5

Fig. 1 The principle of nanopore sequencing. (a) A protein nanopore (blue) is imbedded into an electronically resistive lipid membrane (grey), before adapted DNA libraries containing a motor protein (purple) are introduced, and the motor feeds DNA progressively through the pore. (b) An ionic current (represented by light blue dots) is passed through the nanopore as the DNA translocates through the pore. (c) The bases within the nanopore block the current depending on their size and structure. As the strand moves progressively through the pore, a “squiggle” trace is produced, which is decoded into sequence data using artificial neural networks

the speed at which the molecular motor operates, among other variables. Being based upon electrical signals rather than optical outputs, the system provides real-time feedback, permitting experiments to be stopped and started as required. The merits of the nanopore system further extend to its capacity to sequence native DNA and RNA, avoiding amplification bias associated with PCR-reliant technologies [10]. Crucially, this allows native base modifications (e.g., methylation) to be captured in the information-rich nanopore signal with no additional sample preparation. Recent algorithmic developments ensure methylation status is provided alongside canonical base data [11]—this real-time data is permitting innovative research customers to demonstrate potential applications for intraoperative central nervous system (CNS) tumor classification while the patient is still on the operating table [12]. One final, and to some surprising, characteristic of nanopore sequencing is the read length-agnostic nature of the system. Nanopore sequencing has been typecast as a long-read technology, understandable when read lengths of up to 4 million bases in one continuous read have been demonstrated (easily the longest in the world). This sort of capability should not be underestimated, as longer fragments provide comprehensive information not possible with short reads: structural variants implicated in human disease or agricultural traits can reach up to megabases in scale, far beyond the ability of short reads to span; expressed transcripts can be sequenced end to end in single long reads, enabling unambiguous identification of fusion transcripts; and long reads allow you to access more of the genome than short reads [13], unarguably providing insights not possible with other technologies. On the short-read front, following updates to the sequencing software in early 2022, reads as short as 20 bases can now be processed. With

6

Jonathan Pugh

this capability nanopore sequencing can provide insights into applications involving but not limited to cell free DNA, ancient DNA, or liquid biopsy research. The same flow cells, devices, software, and chemistry are used for short and megabase-long reads alike, making nanopore sequencing the only single technology capable of spanning five orders of magnitude. With nanopore sequencing you can observe the true biology present rather than simply sequence a biased proxy of your original sample.

3

The Platform Putting the performance of the Oxford Nanopore platform on paper always poses a risk: by the time this work (written in June 2022) is published, it is highly likely some of the below information will be outdated. This demonstrates a core merit of Oxford Nanopore, that every aspect of the platform—nanopores, motors, buffers, software, algorithms, and hardware—is consistently iterated and improved upon to deliver ever greater performance (lockeddown versions of the platform are available for those requiring revision control). In doing so, however, Oxford Nanopore ensures the community of scientists using its technology has access to the scale and performance of sequencing they need to answer their specific biological question. Focusing on scale, thanks to a chemistry consistent across all flow cell and device types, a researcher interested in, e.g., fusion genes can use the FlongleTM for cheap, rapid, targeted results in minutes, or can use high-output PromethIONTM Flow Cells to identify novel isoforms even if they are in low abundance. This is achieved through a combination of flow cell and device offerings, from the very small to the very high throughput (Fig. 2). The MinION, the first sequencer launched by Oxford Nanopore, is powered by a USB and runs off a laptop. The size of the MinION (W 105 mm, H 23 mm, D 33 mm, and a featherweight 87 g) makes it simple to ship (just pop it in a jiffy bag!) and lends itself to portability so well that it has been used at some of the farthest-flung regions of the globe [14] and off it [3]. Running MinION Flow Cells with 512 channels for sequencing, it has gone from generating 500 Mbases of data at launch to demonstrated outputs of 43 gigabases (Gb) in-field [15]. Using MinION as a benchmark, it is possible to see how the technology has since been scaled upwards and downwards. The Flongle, consisting of the Flongle adapter (with the same footprint as the MinION Flow Cell) and a cheap, disposable Flongle Flow Cell (see Fig. 2), has over fourfold fewer channels (126) than the MinION Flow Cell but can still generate up to 2.8 Gb of data. Perfect for targeted experiments or sequencing of small

The Current State of Nanopore Sequencing

7

Fig. 2 The flow cells and devices for nanopore sequencing. The Flongle (a) consists of two parts, a reusable adapter, and a single-use flow cell. It has the same footprint as the MinION Flow Cell (b) meaning both can be run on the MinION (d), MinION Mk1C (e), or GridION (f) devices. Any combination of Flongle or MinION can be run on the GridION device. The PromethION Flow Cell (c) is compatible with all PromethION devices (g–j). With capacity for different numbers of flow cells, total device yields vary in line with the number of flow cells they can run. Where multiple flow cells can be run, all are individually controllable, meaning no requirement exists to run all flow cells at once and as a result samples can be run on demand. *Theoretical maximum output when flow cell or device is run 72 h (16 h for Flongle) at 420 bases/second. For devices, this is when all flow cells are run at once and the highest yielding flow cell option is chosen. Outputs may vary according to library type, run conditions, etc.

microbial genomes, the Flongle represents an individual flow cell price point that is unmatched and, with its innovative separation of electronics from fluidics, acts as a technological primer for future cost reductions against the higher-output MinION and PromethION Flow Cells. The PromethION Flow Cell, as with the MinION Flow Cell, comes in a reusable design that can be washed and reused to maximize data output across several experiments (and can be returned to Oxford Nanopore for recycling also). With 2,675 channels for sequencing, as of early 2022, the best-demonstrated yield in-field stands at 245 Gb [16]. As mentioned above, the technology stands to make further improvements and, with adjustments such as passing the DNA through nanopores even faster, holds promise of as much as doubling outputs as they stand today.

8

Jonathan Pugh

Oxford Nanopore sequencing devices (Fig. 2) are designed around the flow cells they run. Anything operating a MinION Flow Cell—the MinION, MinION Mk1C, and GridIONTM—is also compatible with Flongle Flow Cells. Where multiple flow cell positions are available as with the GridION, any combination of MinION or Flongle can be run alongside one another and stopped and started as experimental requirements dictate. PromethION Flow Cells run exclusively on PromethION devices, with the highest output PromethION 48 running up to 48 flow cells in parallel. In addition to a 24-flow-cell version, two recent additions to the PromethION line (PromethION 2 and PromethION 2 Solo) are capable of running two flow cells each, providing options for those who want PromethION-scale yield but with fewer samples to sequence. Except for the MinION and the PromethION 2 Solo, all devices contain the necessary compute to run sequencing and carry out basecalling—it is perfectly possible for the two remaining devices to be run from commercially available laptops. All devices utilize MinKNOWTM, a data acquisition and control software, for run setup, basecalling, and data handling. Sample preparation is possible through manual or automated methods. Kits are available to prepare native, amplified, or ultralong DNA libraries, along with cDNA and direct RNA. More recently, kits have been adapted for automation on liquid handlers. In addition to this, Oxford Nanopore has created the VolTRAXTM, an automated sample preparation device containing magnetic arrays, heating elements, and fluorometers, running from a laptop and working with VolTRAX-enabled forms of Oxford Nanopore library preparation kits. Notable to Oxford Nanopore is the ability to prepare libraries of native DNA samples, which may contain a range of fragment sizes. In conjunction with retained modified base information, the technology is capable of giving a true representation of the biology present. To complete the “full stack” offering, after basecalling with tools such as Guppy (to be replaced with a new basecaller, Dorado, in late 2022), a number of analysis options are available for those new to data analysis or the very green-fingered alike. The click-andgo, cloud-based analysis platform EPI2METM offers predefined workflows for species identification, structural variant calling, single cell analysis, and more. For those wishing to learn to build and manipulate data analysis pipelines, EPI2ME Labs provides step-bystep, didactic guidance for multiple applications including assembly, metagenomic classification, and cDNA isoform detection. Built within the Nextflow framework, EPI2ME Labs leverages the ability to scale to cluster- or cloud-based installations for those who need to deploy and routinely run high-throughput analysis.

The Current State of Nanopore Sequencing

4

9

Applications While perhaps obvious due to the breadth of offerings covered thus far, it is worth reenforcing that Oxford Nanopore is not a company making a single device do a single thing for a single group of people. Oxford Nanopore is creating the next generation of tools for use by tomorrow’s scientists, thriving and succeeding off the incredible work by the scientific community: from clinical research on NICU cases [2] to tracking the endangered Kakapo [17], from sequencing on Icelandic glaciers [18] to identifying unwanted yeasts in breweries [19], research into the genomes of cancer patients in multisample studies [20], or checking the sequence of plasmid constructs [21]. Through the hard work of this community, the boundaries of what is possible are consistently redrawn, and in return Oxford Nanopore continues to drive their products to generate faster, more accurate, and lower-cost results. On MinION alone, yields have increased over a hundredfold since the device’s introduction, developing hand in hand with accuracy improvements. As of June 2022 (taking note of the cautionary words at the start of The Platform section), single-pass, single-molecule accuracy sits at 99.6% modal. Duplex accuracy, where both strands of the DNA molecule are sequenced, is at 99.92% modal accuracy, greater than Q30—and there is still room for improvement. For years nanopore sequencing has provided insights well beyond the reach of alternative technologies, and with the mission to enable the analysis of anything, by anyone, anywhere, these insights will continue to develop not only in the DNA sequencing space, but in the transcriptomic, proteomic and, epigenomic space also. Here is a collection of just some of the ways nanopore technology is currently being used around the world.

4.1 The World’s Fastest Genomes

Alluded to already, Oxford Nanopore worked with a team led by Stanford University School of Medicine to develop a rapid, whole genome sequencing approach, setting a world record for the fastest human genome ever sequenced in the process: 5 h and 2 min [1]. Headed by Dr Euan Ashley with collaborators from Oxford Nanopore; University of California, Santa Cruz; Baylor College of Medicine; NVIDIA; and Google, this stunning new benchmark was possible in part thanks to the PromethION 48, its utilization maximized by splitting one genome across all 48 flow cells at once [2, 22]. In one iteration of their experimental setup, the PromethION generated 204 Gb in 2 h and 42 min [2], just over 60-fold coverage of the human genome with one complete pass sequenced every ~2.5 min. This unprecedented speed of sequencing enabled the team to identify key variants in experimental data within 7 h and 18 min [22], almost half of that previously achieved with short-read sequencing [23].

10

Jonathan Pugh

4.2 The World’s Largest Genomes

PromethION Flow Cells are not only powerful in parallel, but when run individually provide hundreds of Gb of sequence data, including long sequencing reads. This capability enabled the successful assembly of the Australian lungfish genome, which, at 43 Gb, is ~30% larger than the previous record assembly belonging to the axolotl [24, 25] and currently the largest known animal genome assembly in the world. Using data batched into buckets of read N50 9 kb, 27 kb, or 46 kb, the resultant assembly possessed a contig N50 over 6000 times greater than the alternative longread axolotl assembly, with over 17,000-fold fewer contigs [26].

4.3 The World’s Most Complete Human Genome

It probably has not escaped the attention of readers that in 2021, the Telomere-to-Telomere Consortium announced the full completion of the human genome, addressing the remaining gaps and removing “a 20-year-old barrier that has hidden 8% of the genome from sequence-based analysis, including all centromeric regions and the entire short arms of five human chromosomes” [27]. Specifically, reads greater than 100 kb in length generated with nanopore sequencing enabled the complete assemblies of the Y centromere and the entirety of chromosome X, excelling at highly repetitive regions of the genome that could not be adequately resolved by other technologies [27]. This work enables comprehensive study of genomic variation across the entire human genome, poised to drive future discovery in human health and disease [27].

4.4 Viral Genomes the World Over

So much has been written about the global SARS-CoV-2 pandemic and its impact on our lives, there is no succinct way to introduce this topic. Oxford Nanopore was humbled to be identified by researchers the world over as able to play a key part in the tracking of this ever-evolving situation [28], where rapid sequencing workflows are coupled with data streaming and analysis in real time. Thanks to the ARTIC Network and their comprehensive library preparation [29] and data analysis [30] workflows, modular, scalable devices such as GridION could conduct on-demand experiments offering rapid access to data for public health scientists. With devices easily deployed around the world within days and requiring little to no up-front cost, the decentralized network of public health scientists supported by nanopore sequencing has been able to generate over one million SARS-CoV-2 genomes as of Spring 2022, ensuring the evolution and spread of the virus can be accurately measured as required. In the growing field of genomic epidemiology, Oxford Nanopore is proud to offer a technology platform for supporting public health scientists now and in the future.

The Current State of Nanopore Sequencing

4.5 Targeted Sequencing on the Other Side of the World

11

Continuing the world theme, the Garvan Institute for Medical Research in Sydney, Australia, is situated on the opposite side of the world from the headquarters of Oxford Nanopore in Oxford, UK. In 2022, scientists from The Garvan, lead by Dr Ira Deveson, published impressive work on targeting short tandem repeat (STR) expansion disorders with programmable nanopore sequencing [31]. For those unfamiliar with the term “programmable” in this context, it refers to a feature unique to nanopore sequencing: adaptive sampling (see Fig. 3 and accompanying legend). By simply inputting a reference file and genomic coordinates, the group were able to target all known neuropathogenic STRs in a single experiment with a single MinION flow cell [31]. With STR expansions responsible for heritable disorders including Huntington’s disease, fragile X syndrome, cerebellar ataxias, epilepsies, dementia, and ALS [31], the real-world implications and future potential of this work need little elaboration. This elegant and streamlined solution has unquestionable impact based on the evidence presented by Ira and his team—as a group they anticipate that adaptive sampling “will be a powerful approach to STR gene discovery” in addition to resolving “many previously unsolved cases in the future” [31].

Fig. 3 Adaptive sampling. Following upload of a file containing a genome reference and genomic coordinates to the operating software MinKNOW, the sequencing run starts and a strand of DNA moves to and through the nanopore (a). As the strand moves through the pore, basecalling begins and alignment happens in real time. Within ~0.5 s, a decision can be made. If the sequence does not match the regions of interest within the reference files, the strand is ejected from the nanopore (b). If the sequence does match, the strand is permitted to continue sequencing through its entirety (c). Once the strand is through the pore, the process begins with the next strand and so on

12

Jonathan Pugh

4.6 A World of Potential

5

This is not an exhaustive list of applications, merely a snapshot into areas where nanopore sequencing is already making a difference. While these focus upon research-based examples, Oxford Nanopore has taken position in the foothills of applied markets including but not limited to agriculture, food safety, veterinary science, public health, and more. Additionally, with the establishment of subsidiary company Oxford Nanopore Diagnostics, the aim is to specifically enable actionable decisions in healthcare with end-to-end solutions. By collaborating with talented and visionary scientists the world over, nanopore sequencing will enter a forum where human lives are positively impacted every single day.

Summary To close this introduction, the contents of this book demonstrate exactly why nanopore sensing is a technology for the future. Oxford Nanopore provides a toolkit for innovators, tinkerers, and the most inquisitive to assess the living world around them, encompassing a much broader scope than Oxford Nanopore can research themselves. Along this path of broadening biological understanding, there will be difficulties to work through and obstacles to overcome, but ultimately the outcome from doing so will lead to everyday breakthroughs that, collectively, combine into something much greater than the sum of their parts. The authors of the chapters within this book have begun to remove these obstacles for you, demonstrating brand-new capabilities and sharing with you the nuances they encountered. With their help, the community can continue to develop their understanding of the living world around us and generate the most important of answers—the ones for their most burning biological questions.

References 1. Fastest DNA sequencing technique | Guinness W o r l d R e c o r d s . h t t p s : // w w w . guinnessworldrecords.com/world-records/ 675050-fastest%C2%A0dna-sequencing%C2% A0technique. Accessed 30 May 2022 2. Gorzynski JE, Goenka SD, Shafin K et al (2022) Ultra-rapid nanopore whole genome genetic diagnosis of dilated cardiomyopathy in an adolescent with cardiogenic shock. Circ Genom Precis Med CIRCGEN121003591. https://doi.org/10.1161/circgen.121. 003591 3. First DNA Sequencing in Space a Game Changer | NASA. https://www.nasa.gov/mis sion_pages/station/research/news/dna_ sequencing/. Accessed 3 June 2022

4. Githinji G, de Laurent ZR, Mohammed KS et al (2021) Tracking the introduction and spread of SARS-CoV-2 in coastal Kenya. Nat Commun 12:4809. https://doi.org/10. 1038/s41467-021-25137-x 5. Yingtaweesittikul H, Ko K, Rahman NA et al (2021) CalmBelt: rapid SARS-CoV-2 genome characterization for outbreak tracking. Front Med 8:790662. https://doi.org/10.3389/ fmed.2021.790662 6. Ranasinghe D, Jayadas TTP, Jayathilaka D et al (2021) Comparison of different sequencing techniques with multiplex real-time PCR for detection to identify SARS-CoV-2 variants of concern. Medrxiv:2021.12.05.21267303.

The Current State of Nanopore Sequencing https://doi.org/10.1101/2021.12.05. 21267303 7. Quick J, Loman NJ, Duraffour S et al (2016) Real-time, portable genome sequencing for Ebola surveillance. Nature 530:228–232. https://doi.org/10.1038/nature16996 8. Faria NR, Quick J, Claro IM et al (2017) Establishment and cryptic transmission of Zika virus in Brazil and the Americas. Nature 546:406–410. https://doi.org/10.1038/ nature22401 9. Company history. https://nanoporetech.com/ about-us/history. Accessed 3 June 2022 10. Depledge DP, Srinivas KP, Sadaoka T et al (2019) Direct RNA sequencing on nanopore arrays redefines the transcriptional complexity of a viral pathogen. Nat Commun 10:754. https://doi.org/10.1038/s41467-01908734-9 11. Oxford Nanopore integrates “Remora”: a tool to enable real-time, high-accuracy epigenetic insights with nanopore sequencing software MinKNOW. https://nanoporetech.com/ about-us/news/oxford-nanopore-integratesremora-tool-enable-real-time-high-accuracyepigenetic. Accessed 5 June 2022 12. Djirackor L, Halldorsson S, Niehusmann P et al (2021) Intraoperative DNA methylation classification of brain tumors impacts neurosurgical strategy. Neuro-oncol Adv 3:vdab149. https://doi.org/10.1093/noajnl/vdab149 13. Ebbert MTW, Jensen TD, Jansen-West K et al (2019) Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight. Genome Biol 20: 97. https://doi.org/10.1186/s13059-0191707-2 14. Dr. Kennda Lynch on Twitter: “A HUGE Thanks to Oxford @nanopore who helped to make my use of the MinION possible!!! Also thanks to @GTSciences, @NewEnglandBIO, and @QIAGENscience who all helped provide equipment, reagents, and support for my trip to Danakil to help film this episode!!”/Twitter. https://twitter.com/marsgirl42/status/133 7200697255792641. Accessed 5 June 2022 15. Quince C, Nurk S, Raguideau S et al (2021) STRONG: metagenomics strain resolution on assembly graphs. Genome Biol 22:214. https://doi.org/10.1186/s13059-02102419-7 16. 245 Gbases – the highest output record of the PromethION platform to date. https://www. linkedin.com/feed/update/urn:li:activ ity:6757616979751895041/. Accessed 6 June 2022 17. Applying portable nanopore sequencing technology to the conservation of the critically

13

endangered ka¯ka¯po ¯ . https://nanoporetech. com/about-us/news/interview-applying-por table-nanopore-sequencing-technology-conser vation-critically. Accessed 6 June 2022 18. Gowers G, Oliver F, Vince O, Charles J-H et al (2019) Entirely off-grid and solar-powered DNA sequencing of microbial communities during an ice cap traverse expedition. GenesBasel 10:902. https://doi.org/10.3390/ genes10110902 19. Shinohara Y, Kurniawan YN, Sakai H et al (2021) Nanopore based sequencing enables easy and accurate identification of yeasts in breweries. J I Brewing 127:160–166. https:// doi.org/10.1002/jib.639 20. Sakamoto Y, Zaha S, Suzuki Y et al (2021) Application of long-read sequencing to the detection of structural variants in human cancer genomes. Comput Struct Biotechnol J 19: 4207–4216. https://doi.org/10.1016/j.csbj. 2021.07.030 21. Currin A, Swainston N, Dunstan MS et al (2019) Highly multiplexed, fast and accurate nanopore sequencing for verification of synthetic DNA constructs and sequence libraries. Synth Biol 4:ysz025. https://doi.org/10. 1093/synbio/ysz025 22. Goenka SD, Gorzynski JE, Shafin K et al (2022) Accelerated identification of diseasecausing variants with ultra-rapid nanopore genome sequencing. Nat Biotechnol 1–7. https://doi.org/10.1038/s41587-02201221-5 23. Owen MJ, Niemi A-K, Dimmock DP et al (2021) Rapid sequencing-based diagnosis of thiamine metabolism dysfunction syndrome. New Engl J Med 384:2159–2161. https:// doi.org/10.1056/nejmc2100365 24. Meyer A, Schloissnig S, Franchini P et al (2021) Giant lungfish genome elucidates the conquest of land by vertebrates. Nature 590: 284–289. https://doi.org/10.1038/s41586021-03198-8 25. Nowoshilow S, Schloissnig S, Fei J-F et al (2018) The axolotl genome and the evolution of key tissue formation regulators. Nature 554: 5 0 – 5 5 . h t t p s : // d o i . o r g / 1 0 . 1 0 3 8 / nature25458 26. Hotaling S, Kelley JL, Frandsen PB (2021) Toward a genome sequence for every animal: where are we now? Proc Natl Acad Sci 118: e2109019118. https://doi.org/10.1073/ pnas.2109019118 27. Nurk S, Koren S, Rhie A et al (2022) The complete sequence of a human genome. Science 376:44–53. https://doi.org/10.1126/ science.abj6987

14

Jonathan Pugh

28. COVID-19: Community Timeline. https:// nanoporetech.com/covid-19/communitytimeline. Accessed 30 May 2022 29. Freed NE, Vlkova´ M, Faisal MB, Silander OK (2020) Rapid and inexpensive whole-genome sequencing of SARS-CoV-2 using 1200 bp tiled amplicons and Oxford Nanopore Rapid Barcoding. Biol Methods Protoc 5:bpaa014. https://doi.org/10.1093/biomethods/ bpaa014

30. Artic Network. https://artic.network/ncov2019/ncov2019-bioinformatics-sop.html. Accessed 30 May 2022 31. Stevanovski I, Chintalaphani SR, Gamaarachchi H et al (2022) Comprehensive genetic diagnosis of tandem repeat expansion disorders with programmable targeted nanopore sequencing. Sci Adv 8:eabm5386. https:// doi.org/10.1126/sciadv.abm5386

Chapter 2 Hybrid Genome Assembly of Short and Long Reads in Galaxy Tazro Ohta and Yuh Shiwa Abstract Galaxy is a web browser-based data analysis platform that is widely used in biology. Public Galaxy instances allow the analysis of data and interpretation of results without requiring software installation. NanoGalaxy is a public Galaxy instance with tools and workflows for nanopore data analysis. This chapter describes the steps involved in performing genome assembly using short and long reads in NanoGalaxy. Key words Galaxy, Workflow, Visualizations, Nanopore sequencing, Long-read sequencing, Hybrid genome assembly

1

Introduction Nanopore sequencing, provided by Oxford Nanopore Technology, is a long-read sequencing technology. Read length of up to 2.273 megabases have been demonstrated in 2018 [1]. Features such as low sequencing costs and sequencing device portability enable researchers to perform advanced applications, such as on-field DNA sequencing. Researchers have utilized this technology in various sequencing applications, such as de novo genome assembly, whole-genome resequencing, structural genome variation detection, DNA methylation detection, metagenomic sequencing, and de novo transcript/splicing variant detection. These new sequencing applications require an appropriate environment for performing data analysis for each objective and input datum. Both hardware (computers) and software (data analysis tools) for long-read sequencing data require a different setup from those for short-read sequencing data [2]. For example, compared with short-read data analysis, long-read analysis uses less storage but requires much more RAM to assemble reads. The CPU power-optimized platform is preferred over those that utilize

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_2, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

15

16

Tazro Ohta and Yuh Shiwa

more parallelized computing. Data analysis using software for longread data also has characteristics that differ from those required for short-read data analysis. Researchers must focus on parameter tuning for their input data rather than scheduling jobs that iterate the same analysis for their samples. Although the main objective is obtaining biological knowledge from the results, designing proper hardware/software setups often takes time because it requires computing expertise and skills to prepare the environment. Galaxy is a platform designed for those who want to focus on data science rather than environment setup [3]. This platform is utilized in data science, particularly for biological applications, and has a long history of improved features. Galaxy provides a browserbased graphical user interface that allows researchers to perform data analysis without employing command-line operations. The Galaxy community offers many public free-to-use instances, although most have limitations such as relatively low available data storage. NanoGalaxy, one such public instance, is a flavor of the Galaxy setup designed with specific workflows and tools for nanopore data analysis. Here, we detail a method for performing nanopore data analysis using NanoGalaxy with an example case of genome assembly [4].

2

Materials

2.1 Long- and ShortRead Datasets

In this tutorial, we utilized datasets from a study [5] in which Caenorhabditis elegans genomes were sequenced using short-read Illumina (ERR2092781) and long-read Oxford Nanopore Technologies (ERR2092776) devices. These data were obtained from Sequence Read Archive (SRA) in the National Center for Biotechnology Information (NCBI) database.

2.2 Public Galaxy Instance (NanoGalaxy)

This tutorial details how to use Galaxy Europe, which is one of the many public Galaxy servers available worldwide [6]. It is operated by a team at the University of Freiburg (Uni Freiburg). Like other public servers, it is available free of charge, although the amount of available disk space is limited (250 GB per user as of February 2021). Galaxy Europe has a large cloud computing capacity with the advantage of each tool being allocated appropriate computing resources; hence, there is no need to worry about resource scarcity. Galaxy Europe has a dedicated interface that comprises commonly used tools and workflows for each specific purpose and subject area [7]. In this section, we describe the use of NanoGalaxy (https://nanopore.usegalaxy.eu) for analyzing nanopore data.

Hybrid Genome Assembly of Short and Long Reads in Galaxy

3

17

Methods

3.1 Basic Usage of Galaxy

Galaxy is an open-source browser-based software. Users can use Galaxy without installing the software and by accessing the server on which the platform is run. The public Galaxy servers operated by the Galaxy Project are available free of charge to anyone who creates an account. Public Galaxy servers are limited in terms of the storage space allocated to users and the tools users can run. Users can remove these restrictions by installing Galaxy on their computers, but this requires a knowledge of software development and server management. In this article, we introduce methods for performing an analysis using a public Galaxy server. The use of Galaxy is straightforward. The platform has a threepanel interface that users view when they open the initial page of the Galaxy server. The interface includes the following from left to right: tool panel, main display panel, and history panel (Fig. 1; see Note 1). Data analysis using Galaxy generally involves the following steps: • User registration (first time only): An e-mail address is required. Users must activate their accounts via an e-mail sent to the provided address (Fig. 2). • Data import: Several options are available, including uploading files via a web browser, fetching files from a remote server, and importing files from a public database. • Select a tool: The tools available to users vary according to the public Galaxy server. Users select the files and input parameters to start the analysis. • Run the tool: Some tools may take a long time to run. If many jobs are running on the server, users may have to wait for the job to start. • View results: The outputs can be viewed via a web browser once the tool has finished running. • Run other tools: Users may use the output as the input to run another tool. • Save history: Galaxy saves executed steps and intermediate results as history from which users can create a reusable workflow. The Galaxy Project provides excellent online documentation that is frequently updated. See “A short introduction to Galaxy” for more information on basic usage [8, 9].

18

Tazro Ohta and Yuh Shiwa

Fig. 1 Basic operation screen of Galaxy. It is divided vertically into three sections: the tool selection panel on the left, the operations and results display panel in the center, and the history panel on the right 3.2 Example Using Public Data

This tutorial describes how long- and short-read sequences can be combined to create high-quality genome sequences. We used the so-called hybrid assembly approach to reconstruct the C. elegans genome sequence using read sequences from two different sequencing platforms: Illumina (short reads) and Oxford Nanopore Technology (long reads).

3.2.1

This chapter provides a detailed workflow for hybrid assembly from short- and long-read sequencing data using NanoGalaxy. The tutorial begins with quality control of long reads using NanoPlot [10] and filtlong [11]. The filtered long reads were assembled into

Analysis Overview

Hybrid Genome Assembly of Short and Long Reads in Galaxy

19

Fig. 2 Registration screen for NanoGalaxy user account. A user must register with a valid e-mail address in the form via the “Login or Register” button on the top menu bar

contigs using Flye [12]. Next, to correct assembly of error-prone long reads with high-accuracy short reads (polishing), quality control and trimming of short reads were performed using fastp [13]. Trimmed short reads were mapped to the assembly, and an alignment file was created using BWA-MEM2 [14]. Next, the alignment file was evaluated using QualiMap BamQC [15], and a polished (corrected) assembly file was produced using pilon [16]. Finally, an assembly metric was calculated using Fasta Statistics [17].

20

Tazro Ohta and Yuh Shiwa

3.2.2 Hands-On: Obtain Sample Data from SRA

To begin any analysis, a user must obtain a new Galaxy history. Therefore, this tutorial begins with the creation of a new history. The data were then downloaded directly from NCBI SRA using the accession numbers (Fig. 3). 1. Create a new history for this assembly exercise. (a) Click the + (new history) icon at the top of the history panel. (b) Click on the Unnamed history at the top of the history panel. (c) Type a new name, for example, “ONT C. elegans.” (d) Press Enter on the keyboard to save it.

Fig. 3 Tool setup screen for downloading FASTQ files from SRA using the Faster Download and Extract Reads in the FASTQ tool

Hybrid Genome Assembly of Short and Long Reads in Galaxy

21

2. Retrieve the Nanopore reads data from SRA. (a) Type Faster Download in the tools panel search box (top of the left panel). (b) Click on the Faster Download and Extract Reads in FASTQ tool. The tool will be displayed on the central Galaxy panel. (c) Select the following parameters: (i) “select input type”: SRR accession. (ii) “Accession”: ERR2092776. (iii) The other parameters are not changed. (d) Click the Execute button. 3. Retrieve the Illumina reads data from SRA. (a) This tool is rerun to download the other data. Click on “fasterq-dump log” in the History panel. Click the looping arrow icon (“Run this job again”). This will return you to the tool panel with all previously set parameters. (b) Select the following parameters: (i) “Accession”: ERR2092781 (ii) No change in the other parameters (c) Click the Execute button. 4. Inspect the output datasets in the history panel. Submitting this job will create four new items in the history panel (three collections and one tool log). Because the three datasets we attempted to retrieve contained only single reads, the single-end data collection was considered to contain the downloaded data. Click on the other two collections to verify that they are empty. Empty collections can also be deleted by choosing any of the delete options (“Collection Only,” “Delete Datasets,” or “Permanently Delete Datasets”) when prompted. When the output dataset “Single-end data (fasterq-dump)” in the history panel becomes green, click on its name. If the job fails for any reason, the resulting dataset will appear in red (see Note 2). The obtained dataset contained the FASTQ file ERR2092776. Click the eye icon next to the dataset name to view the file contents, which will be displayed on the central panel (Fig. 4). This file contained long-read sequences from C. elegans in the FASTQ format. Click on “#: fasterq-dump log” in the history panel. In this article, we use the “#” symbol instead of the ordered number (see Note 3). The number of reads downloaded from SRA can be viewed. Similarly, check “Single-end data (fasterq-dump)” and “fasterq-dump log” corresponding to the short-read datasets.

22

Tazro Ohta and Yuh Shiwa

Fig. 4 Screenshot showing the contents of the downloaded FASTQ file 3.2.3 Hands-On: Quality Control of Nanopore Reads

For long reads, sequence quality can be checked using NanoPlot, which provides basic statistics and useful plots to yield a quick quality control overview. Nanopore sequence reads vary widely in length and are of relatively low quality compared with short Illumina reads. 1. Type NanoPlot in the tools panel search box (top of the left panel). 2. Click on the NanoPlot tool. 3. Select the following parameters:

Hybrid Genome Assembly of Short and Long Reads in Galaxy

23

(a) “files”: Click on Single dataset (file button on the left) and then browse datasets (folder button on the right) and select the ERR2092776 dataset. Because dataset ERR2092776 is already included in the collection, this operation makes it possible to select any dataset in the collection. (b) In the “Options for customizing the plots created” section: (i) “Show the N50 mark in the read length histogram”: Yes (c) No change in the other parameters. 4. Click the Execute button. 5. Inspect the output datasets in the history panel. Click on the eye icon next to the dataset named “NanoPlot on data #: HTML report.” The HTML report summarizes various quality control metrics for each sample, such as a summary table of the number of reads and a histogram of read lengths (Fig. 5). For sample ERR2092776, the number of reads was 583,462 and the total number of bases was 8860.6 Mb, yielding approximately 87-fold coverage for the genome size of 101 Mb of C. elegans. The coverage of this dataset is sufficient as a previous study [18] reported that more than 60-fold sequence coverage can generate contiguous assembled sequences from nanopore long reads. The N50 value for this dataset was 21,138 bp, and this value was the weighted median. In other words, half of this dataset was composed of reads longer than 21 kb. Because this dataset had sufficient coverage of the genome size, we filtered reads of less than 10 kb to reduce the amount of data required and improve computational efficiency. 3.2.4 Hands-On: Quality Filtering of Long-Read Sequencing Data

Certain analyses may require specific quality or length. Reducing the number of excessive reads can also improve assembly quality and computational efficiency. The reads were filtered using a tool called filtlong. In this example, all reads below 10 kb were filtered. 1. Type filtlong in the tools panel search box (top of the left panel). 2. Click on the filtlong tool. 3. Select the following parameters: (a) “Input FASTQ”: Click on Single dataset (file button on the left) and then browse datasets (folder button on the right) and select ERR2092776 dataset.

24

Tazro Ohta and Yuh Shiwa

Fig. 5 Screenshot displaying NanoPlot results

(b) In the “Output thresholds” section: (i) “Min. length”: 10,000 (c) No change in the other parameters. 4. Click the Execute button. 5. Rerun NanoPlot tool with the filtered datasets. Click on “NanoPlot on data #: HTML report” in the history panel. Click the looping arrow icon (“Run this job again”). The following parameters were selected: (a) “files”: “filtlong on data #: Filtered FASTQ” (output of filtlong) (b) No change in the other parameters

Hybrid Genome Assembly of Short and Long Reads in Galaxy

25

6. Click the Execute button. 7. To inspect the output datasets in the history panel, click on the eye icon next to the dataset named “NanoPlot on data #: HTML report.” Compared with the prefilter result (NanoPlot on data #: HTML report), the post-filter result showed a 35% decrease in the number of reads (from 583,462 to 377,125). The read length, N50, slightly improved (from 21,138 to 22,408). The total number of bases was reduced by 10% (from 8860 to 7915 Mb) but still represented approximately 78-fold coverage relative to genome size. 3.2.5 Hands-On: Assembly of Long-Read Sequencing Data

After performing quality control, filtering, and trimming (as performed with nanofilt), reads were ready for assembly. Many tools are available for creating assemblies of long-read data, but for this tutorial, we used Flye, which is designed to handle a wide range of datasets from small bacterial to large mammalian-scale assemblies. 1. Type Flye in the tools panel search box (top of the left panel). 2. Click on the Flye tool. 3. Select the following parameters: (a) “Input reads”: “filtlong on data #: Filtered FASTQ” (output of filtlong) (b) “Mode”: Nanopore raw (c) No change in the other parameters 4. Click the Execute button. 5. Inspect the output datasets in the history panel. The first dataset (consensus) was a Fasta file containing the final assembly (60 contigs). Your results (e.g., the number of contigs) may be slightly different from those presented in this tutorial (see Note 4). The second (assembly graph) and third (graphical fragment assembly) datasets were the files of the assembly graphs. These graphs represent the final genome assembly based on the reads and their overlap information. Tools such as Bandage [19] can be used to visualize the assembly graphs. The fourth dataset was a tabular file (assembly_info) containing additional information about the contigs and scaffolds.

3.2.6 Hands-On: Quality Control of Short-Read Sequencing Data

Next, to correct assemblies generated from error-prone long reads using highly accurate short reads, we first perform quality control and trimming of short reads using fastp. The removal of low-quality score regions and adapter sequences improves the alignment and variant calls performed in subsequent steps. 1. Type fastp in the tools panel search box (top of the left panel). 2. Click on the fastp tool.

26

Tazro Ohta and Yuh Shiwa

3. Select the following parameters: (a) “Single-end or paired reads”: Single-end. (b) “Input 1”: Click on Single dataset (file button on the left) and then browse datasets (folder button on the right) and select ERR2092781 dataset. (c) No change in the other parameters. 4. Click the Execute button. 5. Inspect the output datasets in the history panel. Click on the eye icon next to the dataset named “fastp on data #: HTML report.” The mean length before and after trimming (285 bp) did not change. The majority of the reads passed through the filter (97.9%), indicating that the original reads were of high quality. 3.2.7 Hands-On: Mapping Short Reads to Assembly for Polishing

The filtered short reads were then mapped to the assembly created using Flye to create an alignment file. The BWA-MEM2 tool is a sequence aligner that is widely used for short-read sequence datasets such as those analyzed in this tutorial. 1. Type BWA-MEM2 in the tool panel search box (top of the left panel). 2. Click on the BWA-MEM2 tool. 3. Select the following parameters: (a) “Will you select a reference genome from your history or use a built-in index?”: Use a genome from history and build index. (b) “Use the following dataset as the reference sequence”: “Flye on data #: consensus” (Flye output). (c) “Single or Paired-end reads”: Single. (d) “Select fastq dataset”: “fastp on data #: Read # output” (output of fastp). (e) No change in the other parameters. 4. Click the Execute button. 5. When the output dataset “BWA-MEM2 (mapped reads in BAM format)” in the history panel becomes green, proceed to the next step.

3.2.8 Hands-On: Evaluate the Quality of Aligned Reads Data in BAM Format

The quality of the alignment data (in BAM format) resulting from the mapping was evaluated using QualiMap. The tool summarizes the basic statistics of the alignment (number of reads, coverage, GC content, etc.) and produces several useful graphs. 1. Type QualiMap in the tools panel search box (top of the left panel). 2. Click on the QualiMap BamQC tool.

Hybrid Genome Assembly of Short and Long Reads in Galaxy

27

3. Select the following parameters: (a) “Mapped reads input dataset”: BWA-MEM2 on data # and data # (mapped reads in BAM format; output of BWA-MEM2) (b) No change in the other parameters 4. Click the Execute button. 5. Inspect the output datasets in the history panel by clicking on the eye icon next to the dataset named “QualiMap BamQC report.” An important metric is the number and percentage of reads mapped to the reference genome listed in the global section (mapped reads). A low percentage may indicate a problem with the data or analysis. In this dataset, more than 90% of the reads were mapped, which is a good mapping rate. The next metric to check is the average coverage displayed in the coverage section. When using the default parameters for Pilon, which are used to polish assemblies, the developer recommends a total coverage of at least 50-fold [20]. The mean coverage for this dataset was 32; hence, we proceeded to the next step although the obtained mean was slightly less than the recommended value. 3.2.9 Hands-On: Polish Assembly

Next, the mapped short reads were compared with the assembly to create a polished (corrected) assembly file. 1. Type pilon in the tools panel search box (top of the left panel). 2. Click on the pilon tool. 3. Select the following parameters: (a) “Source for reference genome used for BAM alignments”: Use a genome from history. (b) “Select a reference genome”: “Flye on data #: consensus” (output of Flye) (c) “Type automatically determined by pilon”: Yes (d) “Input BAM file”: “BWA-MEM2 on data # and data # (mapped reads in BAM format)” (e) “Variant calling mode”: No (f) “Create changes file”: Yes (g) No change in the other parameters. 4. Click the Execute button. 5. Inspect the output datasets in the history panel. The first dataset (changes in FASTA from pilon on data # and data #) is a file containing space-delimited records of all changes made to the assembly during the polishing process. This file contains approximately 650,000 lines, which means that the assembly

28

Tazro Ohta and Yuh Shiwa

was modified in the same number of lines during the polishing process. The second dataset (FASTA from pilon on data # and data #) is a Fasta file containing the polished (corrected) assembly (see Note 5). 3.2.10 Hands-On: Genome Assembly Metrics

Finally, the metrics for the polished assembly were calculated using Fasta statistics. This tool displays summary statistics for Fasta files. For genome assemblies, various metrics must be calculated, such as assembly size, number of scaffolds, and N50 value. These metrics allow users to evaluate the quality of the assembly. 1. Type Fasta Statistics in the tools panel search box (top of the left panel). 2. Click on the Fasta Statistics tool. 3. Select the following parameters: (a) “FASTA or Multi-FASTA file”: “FASTA from pilon on data # and data #” (output of pilon) (b) No change in the other parameters 4. Click the Execute button. 5. Inspect the output datasets in the history panel. In this assembly, the number of gaps was zero; hence, the summary statistics for the scaffold and contig were identical. This assembly contained 60 contigs/scaffolds and had an N50 of 4,409,168 bp. The genome size was 112,339,285 bp, which was close to the C. elegans genome size of 101 Mb (see Note 6).

4

Notes 1. The Galaxy interface may change over time. However, the Galaxy Project has excellent online documentation that explains the usage of these functions (https://training. galaxyproject.org/). Please refer to the online document if differences are observed between the screenshots in this article and the running server. 2. Automatic data downloads from SRA can be unreliable. If a job fails, please wait a few hours and rerun it. 3. In the history panel, each executed step is presented with an ordered number of executions. In this tutorial, we used the “#” symbol to avoid confusion. For example, if you uploaded a FASTQ file, you may see the first history panel with the title “1: data.fastq” but we describe it with “#: data.fastq” in the main text. 4. The results may differ slightly from those presented in this tutorial because of differences in the tool versions, reference data, external databases, and algorithmic stochastic processes.

Hybrid Genome Assembly of Short and Long Reads in Galaxy

29

5. One possible additional step is to perform further rounds of polishing using Pilon. Before each round of Pilon, it is necessary to perform another mapping of short reads against Pilon’s output assembly using BWA-MEM2. It is also useful to rename the dataset during these runs to denote the number of rounds. 6. Numerous analytic tools are useful for Nanopore data analysis but are unavailable in NanoGalaxy. For example, the current NanoGalaxy instance has a limit to using Quast, a genome assembly comparison tool [21], which can be used to evaluate assemblies by computing various metrics. BUSCO, a popular genome assembly assessment tool [22], is also unavailable in NanoGalaxy but is available in the main Galaxy instance, usegalaxy.org. We suggest that users search for other public Galaxy instances if they cannot find the tool they intend to use.

Acknowledgments The authors would like to thank all members of the valuable scientific communities: the Pitagora Network (https://pitagoranetwork.org) and Galaxy Project (https://galaxyproject.org/). The authors also acknowledge the support of the Freiburg Galaxy Team: Bjo¨rn Gru¨ning, Bioinformatics, University of Freiburg (Germany), funded by the Collaborative Research Centre 992 Medical Epigenetics (DFG grant SFB 992/1 2012), and the German Federal Ministry of Education and Research BMBF grant 031 A538A de.NBI-RBC. References 1. Payne A, Holmes N, Rakyan V, Loose M (2018) BulkVis: a graphical viewer for Oxford nanopore bulk FAST5 files. Bioinformatics 35: 2193–2198. https://doi.org/10.1093/bioin formatics/bty841 2. Amarasinghe SL, Su S, Dong X et al (2020) Opportunities and challenges in long-read sequencing data analysis. Genome Biol 21. https://doi.org/10.1186/s13059-0201935-5 3. Afgan E, Baker D, Batut B et al (2018) The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res 46:W537–W544. https://doi.org/10.1093/nar/gky379 4. de Koning W, Miladi M, Hiltemann S et al (2020) NanoGalaxy: nanopore long-read sequencing data analysis in Galaxy. GigaScience 9. https://doi.org/10.1093/gigascience/ giaa105 5. Tyson JR, O’Neil NJ, Jain M et al (2017) MinION-based long-read sequencing and assembly extends the Caenorhabditis elegans

reference genome. Genome Res 28:266–274. https://doi.org/10.1101/gr.221184.117 6. Galaxy Europe. https://usegalaxy.eu/. Accessed 18 May 2022 7. European Galaxy F lavours. https:// galaxyproject.eu/posts/2020/12/28/sub domains/. Accessed 18 May 2022 8. Syme A, Soranzo N (2022) A short introduction to Galaxy (Galaxy training materials). https://training.galaxyproject.org/trainingmaterial/topics/introduction/tutorials/gal axy-intro-short/tutorial.html. Accessed 18 May 2022 9. Batut B, Hiltemann S, Bagnacani A et al (2018) Community-driven data analysis training for biology. Cell Syst 6:752–758.e1. https://doi. org/10.1016/j.cels.2018.05.012 10. De Coster W, D’Hert S, Schultz DT et al (2018) NanoPack: visualizing and processing long-read sequencing data. Bioinformatics 34: 2666–2669. https://doi.org/10.1093/bioin formatics/bty149

30

Tazro Ohta and Yuh Shiwa

11. Filtlong: quality filtering tool for long reads. https://github.com/rrwick/Filtlong. Accessed 18 May 2022 12. Kolmogorov M, Yuan J, Lin Y, Pevzner PA (2019) Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol 37:540– 546. https://doi.org/10.1038/s41587-0190072-8 13. Chen S, Zhou Y, Chen Y, Gu J (2018) fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34:i884–i890. https://doi.org/ 10.1093/bioinformatics/bty560 14. Vasimuddin Md, Misra S, Li H, Aluru S (2019) Efficient architecture-aware acceleration of BWA-MEM for multicore systems. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). https://doi. org/10.1109/IPDPS.2019.00041 15. Okonechnikov K, Conesa A, Garcıá-Alcalde F (2015) Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics btv566. https://doi. org/10.1093/bioinformatics/btv566 16. Walker BJ, Abeel T, Shea T et al (2014) Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9:e112963. https:// doi.org/10.1371/journal.pone.0112963 17. Kyran A (2021) Fasta statistics: display summary statistics for a fasta file. https://github.

com/galaxyproject/tools-iuc. Accessed 18 May 2022 18. Sutton JM, Millwood JD, Case McCormack A, Fierst JL (2021) Optimizing experimental design for genome sequencing and assembly with Oxford Nanopore Technologies. Gigabyte 2021:1–26. https://doi.org/10.46471/ gigabyte.27 19. Wick RR, Schultz MB, Zobel J, Holt KE (2015) Bandage: interactive visualization ofde novogenome assemblies: Fig. 1. Bioinformatics 31:3350–3352. https://doi.org/10.1093/bio informatics/btv383 20. Pilon: Methods of Operation. https://github. com/broadinstitute/pilon/wiki/Methods-ofOperation. Accessed 18 May 2022 21. Mikheenko A, Prjibelski A, Saveliev V et al (2018) Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34:i142– i150. https://doi.org/10.1093/bioinformat ics/bty266 22. Manni M, Berkeley MR, Seppey M et al (2021) BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol Biol Evol 38:4647–4654. https://doi.org/10.1093/ molbev/msab199

Chapter 3 Microbial Genome Sequencing and Assembly Using Nanopore Sequencers Makoto Taniguchi and Kazuma Uesaka Abstract Microbial genomes are typically several million base pairs in length and are relatively easy to sequence and assemble into a single chromosome, given the advances in long-read sequencing platforms such as that of Oxford Nanopore Technologies. This chapter describes the experimental as well as computational steps in the sequencing and assembly of microbial genomes. Key words Microbial genomes, Genome assembly, Nanopore sequencing, Complete genome

1

Introduction In this chapter, we describe a common protocol for assembling complete bacteria genomes from Oxford Nanopore Technologies sequencing reads (hereafter, ONT reads). With this protocol, we have so far assembled nearly 300 complete, closed bacterial genomes. However, several checklists must be followed for a successful experiment. First, the target bacteria strain should be isolated. Second, ONT reads that have at least 100× sequencing depth of the genome and at least 5 kb in average length are necessary for the complete genome assembly. Finally, the error correction step is highly recommended to fix errors in the draft assembly (see Note 1). The following protocol is divided into the experimental section and bioinformatics section. The experimental section will explain how to extract bacterial genomic DNA suitable for nanopore sequencing and how to prepare libraries for nanopore sequencing. To obtain longer reads, it is necessary to extract higher-molecularweight genomic DNA without fragmentation. Then, the obtained genomic DNA is subjected to the library preparation method using the Ligation Sequencing Kit.

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_3, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

31

32

2

Makoto Taniguchi and Kazuma Uesaka

Materials

2.1 Sequencing Reagents

1. DNA extraction kit (based on chloroform-phenol extraction) 2. Ligation sequencing (Oxford Nanopore Technologies) 3. NEBNext Ultra II End Repair/dA-Tailing Module (New England Biolabs) 4. NEBNext FFPE DNA Repair Mix (New England Biolabs) 5. NEBNext Quick Ligation Module (New England Biolabs) 6. Short Read Eliminator or Short Read Eliminator XS (Circulomics) 7. Agencourt AMPure XP beads 8. Nuclease-free water 9. Ethanol 10. MinION Flow (Oxford Nanopore Technologies) 11. Flow Cell Wash Kit (Oxford Nanopore Technologies) 12. Qubit Fluorometer or Quant-iT PicoGreen Assay Kit (Thermo Fisher Scientific)

2.2 Bioinformatics Tools

1. NanoPlot 2. NanoFilt 3. Fastp 4. Flye 5. UniCycler 6. BWA 7. Minimap2 8. Samtools 9. Pilon 10. SV-Quest 11. cuteSV 12. DFAST

3 3.1

Methods gDNA Extraction

1. Collect liquid cultured bacterial cells by centrifugation, and lyse the cell walls by appropriate enzymes. 2. Perform Proteinase K and RNase treatments on the lysate in the presence of SDS, followed by chloroform-phenol extraction [1]. 3. Follow the manufacturer’s protocol to size-select the extracted gDNA with the Short Read Eliminator to remove lowmolecular-weight regions.

Microbial Genome Sequencing and Assembly Using Nanopore Sequencers

33

4. Since freeze-thawing of genomic DNA will cause DNA fragmentation, genomic DNA should be stored at 4 °C and sequenced as soon as possible. 5. The extracted genomic DNA should be evaluated for the following points. The DNA should be of sufficiently high molecular weight (at least 20 Kb, ideally 40 Kb or more) and no significant smear of small molecules. The concentration of genomic DNA is measured using the fluorescent method Qubit or Quant-iT PicoGreen; when measured with Nanodrop, the concentration may be 2 to 10 times higher due to foreign substance. 3.2 Sequencing Library Preparation

1. In a 0.2 mL thin-walled PCR tube, mix the following:

3.2.1 DNA Repair and End Prep

Reagent

Volume

gDNA 1000 ng

X μL

NEBNext FFPE DNA Repair Buffer

3.5 μL

Ultra II End-prep reaction buffer

3.5 μL

NEBNext FFPE DNA Repair Mix

2 μL

Ultra II End-prep enzyme mix

3 μL

Nuclease-free water

up to 60 μL

2. Using a thermal cycler, incubate at 20°°C for 60 min and 65 °C for 30 min. 3. Transfer the DNA sample to a new 1.5-mL tube. 4. Add 60 μL of Short Read Eliminator (or SRE XS) to the end-prep reaction and mix by flicking the tube. 5. Incubate for 5 min at room temperature. 6. Centrifuge at 12,000 rpm RT for 30 min. 7. Pipette off the supernatant. 8. Wash the tube wall with 180 μL of freshly prepared 80% ethanol without disturbing the pellet (almost invisible). 9. Centrifuge at 12,000 rpm RT for 30 min. 10. Remove the ethanol using a pipette and discard. 11. Repeat the previous step. 12. Resuspend the pellet in 61 μL nuclease-free water. 13. Incubate for 30 min at 55 °C. 14. Quantify 1 μL of eluted sample using a Qubit fluorometer.

34

Makoto Taniguchi and Kazuma Uesaka

3.2.2 Adapter Ligation and Cleanup

1. In a 1.5-mL Eppendorf DNA LoBind tube, mix in the following order: Reagent

Volume

DNA sample from the previous step

60 μL

Ligation buffer (LNB)

25 μL

NEBNext Quick T4 DNA Ligase

10 μL

Adapter Mix F (AMX-F)

5 μL

Total

100 μL

2. Incubate the reaction for 10 min at room temperature. 3. Add 40 μL of resuspended AMPure XP beads to the reaction and mix by flicking the tube. 4. Incubate for 10 min at room temperature. 5. Spin down the sample and pellet on a magnet. Keep the tube on the magnet, and pipette off the supernatant. 6. Wash the beads by adding either 250 μL long fragment buffer (LFB). 7. Flick the beads to resuspend, spin down, then return the tube to the magnetic rack, and allow the beads to pellet. 8. Remove the supernatant using a pipette and discard. 9. Repeat the previous step. 10. Remove the tube from the magnetic rack and resuspend the pellet in 12 μL elution buffer (EB). 11. Incubate for 30 min at 37 °C. 3.3 Nanopore Sequencing

1. Prime the flow cell (refer to the kit manual). 2. In a new tube, prepare the library for loading as follows: Reagent

Volume

Sequencing Buffer II (SBII)

37.5 μL

Loading Beads II

25.5 μL

DNA library

12 μL

Total

75 μL

3. Load 75 μL of DNA library from the SpotON port. 4. Start sequencing.

Microbial Genome Sequencing and Assembly Using Nanopore Sequencers

35

5. When a suitable volume of data (0.5–1.0 Gbp) has been obtained, stop the sequence (see Note 2). 6. Wash the flow cell (refer to the kit manual). 3.4

Bioinformatics

Each command is executed with default settings unless otherwise specified: 1. Draw a scatterplot of the quality and read length of ONT reads [2]. $ NanoPlot --fastq ONT.fastq.gz --loglength -t 8

2. Trim the first 50 bases of ONT reads (see Note 3). Trim low-quality bases and keep only reads longer than 1000 bp [2]. The maximum length flag is also effective, since reads that are too long may be significantly unreliable. $ gunzip -c ONT.fastq.gz |NanoFilt -q 10 -l 1000 -maxlength 200000 --headcrop 50 | gzip > ONT_trimmed.fq.gz

3. If paired-end short reads (e.g., Illumina, MGI-seq) are available, quality filtering of short reads can be performed [3]. $ fastp -i short_read_R1.fq.gz -I short_read_R1.fq.gz -3 -o QT_R1.fq.gz -O QT_R2.fq.gz ¥ -h fastp_report.html -j fastp_report.json -q 20 -n 10 -t 1 -T 1 -l 20 -w 8

4. Use the Flye assembler [4] to assemble ONT reads (see Note 4). If you are getting Q20 (>99% accuracy) quality ONT reads, replace “-nano-raw” with “--nano-hq.” $ flye --nano-raw ONT_trimmed.fq.gz --out-dir output_dir -threads 8 -scaffold

5. If paired-end short reads (100 or 150 bp recommended) are available, hybrid assembly is a better choice (see Note 5). Use the Unicycler assembler for this purpose [5]. $ unicycler -1 QT_R1.fq.gz -2 QT_R2.fq.gz -l ONT_trimmed.fq.gz -o output_dir -t 8

6. Visualize the assembly graph to see if a circular genome assembly has been obtained (see Notes 6 and 7). The Bandage graph browser is suitable for this purpose [6]. Nodes (contigs) that are broken even though there are no multi-links can be merged using Edit => merge all possible nodes. 7. Polishing raw genome assembly two or three times (see Notes 8 and 9). If only long reads are available, skip this step. #Map reads to raw genome assembly [7, 8].

36

Makoto Taniguchi and Kazuma Uesaka $ bwa mem -t 8 assembly.fasta QT_R*.fq.gz | samtools sort -@ 4 - > short_mapped.bam $ minimap2 -t 8 -x map-ont assembly.fasta ONT_trimmed.fq.gz | samtools sort -@ 4 - > ONT_mapped.bam

#Indexing. $ samtools index short_mapped.bam $ samtools index ONT_mapped.bam

#Here, we use Pilon [9]. $ java -Xmx16G -jar pilon-1.24.jar --genome assembly.fasta -frags short_mapped.bam --nanopore ONT_mapped.bam --changes -threads 8 -outdir round1_outdir

8. Recheck assembly errors. If short reads are available, use SV-Quest [10] which calls for sites that have small indel or potential structural errors. $ SV_Quest.pl -f assembly.fasta -1 R1.fq.gz -2 R2.fq.gz

9. Recheck assembly errors using cuteSV [11], which detects structural mutations from long reads. The default setting is too sensitive for homogenous SVs; increase the number of supported reads to a number of 30–50 for 100× depth ONT reads. #Map ONT reads to the genome sequences. $ minimap2 -t 8 -x map-ont polished_assembly.fasta ONT_trimmed.fq.gz | samtools sort -@ 4 - > ONT_mapped.bam

#Here, run cuteSV. $ mkdir work_dir $ cuteSV -t 8 -s 50 ONT_mapped.bam polished_assembly.fasta output.vcf work_dir

10. Annotate assembly with DDBJ Fast Annotation and Submission Tool (DFAST) (https://dfast.ddbj.nig.ac.jp/) [12] (see Notes 10 and 11). If you have a complete circular chromosome sequence, you can check “Rotate/flip the chromosome so that the dnaA gene comes first” to rotate the upstream of dnaA to the first position of the chromosome; this prevents unexpected split of a coding sequence at the beginning and end of a chromosome. This only works for the DnaA-type replicon (use the “seqkit restart” command for Vibrio’s small chromosome).

Microbial Genome Sequencing and Assembly Using Nanopore Sequencers

4

37

Notes 1. The protocol presented here yields full-length genome assembly sequences of Q50 (>99.999% accuracy) or better genome assembly for ONT reads alone. When combined with ONT reads and paired-end short reads, Q60 (>99.9999% accuracy) or better genome assembly can be obtained. 2. A 24-h sequencing of bacterial genomic DNA on a portable MinION sequencer typically yields several hundreds of depths of the genome sequence with an average of 6–8-kb sequencing length. However, excessive read depth may cause a problem for genome assembly; it wastes computational resources and may cause errors by complicating the assembly graph. If you have a long read with several hundred depths of the genome, Nanofilt’s --target_bases option would work well to recover betterquality reads on user-specified volume [2]. 3. The first several bases of an ONT read have a worse error profile than other regions of the read. This region should be forcetrimmed. Similarly, short reads should have one base trimmed at the end of 5′ and 3′ of sequencing reads. To visualize the error profile of sequencing reads, run fastp without output prefix (e.g., fastp -i ONT.fq.gz). 4. The relationship between the length of ONT reads and the contiguity of the resulting genome assembly is a case-by-case; some genome assemblies have better contiguity if longer ONT reads (≥20 kb) were selected, while others have better contiguity if shorter ONT reads (from 1 kb to 10 kb) are also used. In general, it is recommended to use ONT reads with 5–10 kb in average length and about 100× depth of the genome. If assembly graphs are fragmented with links present, the assembly probably failed at complex repeats; try the Flye assembler or retry the unicycler assembly after enriching longer reads with tens of thousands bp in length (e.g., seqkit seq -m 30000 ONT. fq.gz > Long_ONT.fq). 5. Both the Flye and Unicycler assemblers perform internal polishing to suppress assembly errors. However, a few SNV and/or indel errors are still present in an Mb length of the raw assembly. The post-polishing step reduces these errors to zero or near zero [13]. 6. Although it is not necessary to create a full-length genome sequence for gene prediction, a complete chromosome sequence has some advantages for downstream analysis. It is recommended to create full-length sequences whenever possible.

38

Makoto Taniguchi and Kazuma Uesaka

7. If you find contig ends in the assembly graph that is not linked to other contig ends, it is suspected that the amount of sequencing is insufficient (or the presence of hairpins or linear terminations [13]). This can be improved by additional sequencing. 8. The number of polishing cycles depends on how many errors remain in the assembled sequence. For the unicycler assembly, two or three times polishing is sufficient. For the Flye assembly, a combination of different algorithms (e.g., Polypolish and Pilon polishing) can be useful [14]. 9. After two or three cycles of polishing, very few errors may still be present in the assembly. Typically, indels less than 10 bp in length may be found using SV_Quest. The powerful way to semi-manually fix such errors is to perform local assembly with the short reads that are aligned to the potential error site. We prepared a tiny wrapper script to assist this task (see https:// github.com/kazumaxneo/local_assembly_and_alignment). 10. In addition, if you optionally specify the checkM quality check option, you can check for the potential risk of contaminated sequences of chimeric sequences [15]. 11. If sequence errors remain in the public genome databases, technical bias may occur in comparative genome analysis. However, it is difficult for users to know if assembly errors remain in the public genome sequence (there are no quality values left in the FASTA format sequence!). It is highly desirable to register error-free genome sequences in public databases. References 1. Morita H, Kuwahara T, Ohshima K et al (2007) An improved DNA isolation method for metagenomic analysis of the microbial flora of the human intestine. Microb Environ 22:214–222 2. De Coster W, D’Hert S, Schultz DT, Cruts M, Van Broeckhoven C (2018) NanoPack: visualizing and processing long-read sequencing data. Bioinformatics 34:2666–2669. https:// doi.org/10.1093/bioinformatics/bty149 3. Chen S, Zhou Y, Chen Y, Gu J (2018) fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34:i884–i890. https://doi.org/ 10.1093/bioinformatics/bty560 4. Kolmogorov M, Yuan J, Lin Y et al (2019) Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol 37:540–546. https://doi.org/10.1038/s41587-0190072-8

5. Wick RR, Judd LM, Gorrie CL, Holt KE (2017) Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol 13(6):e1005595. https://doi.org/10.1371/journal.pcbi. 1005595 6. Wick RR, Schultz MB, Zobel J, Holt KE (2015) Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 31:3350–3352 7. Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/Map format and SAMtools. Bioinformatics 25(16): 2078–2079. https://doi.org/10.1093/bioin formatics/btp352 8. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14):1754–1760.

Microbial Genome Sequencing and Assembly Using Nanopore Sequencers https://doi.org/10.1093/bioinformatics/ btp324 9. Walker BJ, Abeel T, Shea T et al (2014) Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9(11):e112963. https://doi.org/10.1371/journal.pone. 0112963 10. https://github.com/kazumaxneo/SV-Quest 11. Jiang T, Liu Y, Jiang Y et al (2020) Long-readbased human genomic structural variation detection with cuteSV. Genome Biol 21:189. https://doi.org/10.1186/s13059-02002107-y 12. Tanizawa Y, Fujisawa T, Nakamura Y (2017) DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication. Bioinformatics 34:1037–1039

39

13. Hashimoto Y, Taniguchi M, Uesaka K et al (2019) Novel multidrug-resistant enterococcal mobile linear plasmid pELF1 encoding vanA and vanM gene clusters from a Japanese vancomycin-resistant enterococci isolate. Front Microbiol 10:2568. https://doi.org/ 10.3389/fmicb.2019.02568 14. Wick RR, Holt KE (2022) Polypolish: shortread polishing of long-read bacterial genome assemblies. PLoS Comput Biol 18(1): e1009802. https://doi.org/10.1371/journal. pcbi.1009802 15. Parks DH, Imelfort M, Skennerton CT et al (2015) CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 25: 1043–1055

Chapter 4 De Novo Genome Assembly of Japanese Black Cattle as Model of an Economically Relevant Animal Shinji Sasaki, Yasuhiko Haga, Hiroyuki Wakaguri, Kazumi Abe, and Yutaka Suzuki Abstract A genetic analysis of Japanese Black cattle using short reads and guided by the reference genome from Western breeds would miss the structural variation and/or other unique characteristics of Japanese Black cattle. To overcome this difficulty, a de novo genome assembly independent from the reference genome is required. This chapter describes the technical developments, with respect to both experimental and bioinformatics procedures, including the use of short and long reads, required for de novo genome assembly of Japanese Black cattle. Key words De novo genome assembly, Nanopore sequencing, Long-read sequencing, Short-read sequencing, Genetic resource, Japanese Black cattle genome analysis

1

Introduction Wagyu is Japan’s leading cattle breed for “delicious beef.” Japanese Black cattle, one of the main Wagyu breeds, is dedicatedly maintained and protected as a precious genetic resource in Japan [1]. Almost all Japanese Black cattle are produced by artificial insemination of frozen semen [1]. The semen belongs to the so-called genetically elite sires, shown to produce excellent meat quality. Those precious semen stocks are stored in a straw and conserved in liquid nitrogen (Fig. 1a). Frozen semen stocks have been and are being stored even from sires born >50 years ago [2]. Apart from its economic importance, this is a valuable resource for genome analysis, particularly for family analyses. Family records are well organized, allowing to retroactively select the most suitable individual for the analysis of the whole Japanese Black cattle population. In addition, relevant information regarding health records, such as the presence or absence of disease either in one individual or its offspring, and other biological information, such as body mass

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_4, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

41

42

Shinji Sasaki et al.

A) Frozen semen Important seed bulls > 60Kbp

Genomic DNA semen

MagAract HMW DNA kit Sequence library

Stored in liquid nitrogen for 50 years

B)

Short read (NovaSeq6000)

Long read (PromethION)

Novaseq 6000 (N=64)

PromethION (N=64)

Number of reads

920,828,388

12,394,984

Max length (bp)

151

796,769

Average length

ND

7,212

% mapped

97

78

58,933,016,834

86,316,187,094.90

52.90

32.84

Total read length Depth

C)

NovaSeq6000

PromethION

Fig. 1 Process for yielding sequencing data. (a) Extraction of high-molecular-weight DNA from bulls and frozen semen, both important to maintain Japanese Black cattle populations. (b) Genome sequencing by long and short reads of 64 Japanese Black cattle. (c) Distribution of read lengths by PromethION (left) and pictures of sequencers (right)

and fatty acid percentage (important indicators for meat quality) and reproduction efficacy, have been collected and maintained for a long time [3, 4]. Although it is important for understanding and utilizing genetic resources to examine the genomic variation that is responsible for these characteristics, the Japanese Black cattle breed is genetically different from that of Western ones [1]; consequently, the single nucleotide variant (SNV) characteristic of Japanese Black cattle is different as well [5]. So, in genetic analysis of Japanese Black cattle, a short genome analysis using the reference genome of Western breeds will miss structural variation and other unique features of Japanese Black cattle. To solve this problem, a de novo genome assembly independent of the reference genome is required.

De Novo Genome Assembly of Japanese Black Cattle

43

We have been attempting how those semen stocks should be utilized for such analyses. In the first section, we describe the procedure to extract genomic DNA from frozen semen as starting material (Fig. 1a). In the second section, we explain the process of de novo assembly using long sequence reads obtained by PromethION and describe the computational codes used for that purpose. Using long reads, scaffolds were constructed, and each base is error-corrected based on the high accurate short reads obtained by NovaSeq 6000. In the last section, we demonstrate how we are attempting to create a Japanese Black cattle genome assembly.

2

Materials

2.1 Sample Preparation [Experiment] 2.2 Consumables and Equipment [Experiment]

Japanese Black cattle frozen semen (64 heads). Genomic DNA with egg yolk and 0.11 M citric acid solution.

1. Dilution buffer: 1 x PCR buffer plus 40 mM DTT, 0.08% SDS, 0.05 mg /μL Proteinase K 2. Lysis buffer: 1% SDS, 100 mM Tris-HCl, pH 8.0, 10 mM EDTA, 2.5 M NaCl, 2% 2-Mercaptoethanol 3. 1 x PCR buffers: 2 mM Tris-HCl, pH 8.0, 10 mM KCl, 0.01 mM EDTA, 0.05% Tween20, 0.05% Nonidet P-40, 5% Glycerol 4. MagAttract HMW DNA Kit (QIAGEN, Cat. # 67563) 5. NEBNext Companion Module for Oxford Nanopore Technologies Ligation Sequencing (Cat # E7180S) 6. Ligation Sequencing # SQK-LSK109)

Kit

(Oxford

nanopore,

Cat.

7. TapeStation (Agilent, Cat. # G2991AA) 8. Genomic DNA ScreenTape (Agilent, Cat. #5067-5365) 9. Genomic DNA Reagent Kit (Agilent, Cat. #5067-5366) 10. PromethION 48 Flow Cells (Oxford nanopore, Cat. # FLO-PRO002) 11. TruSeq DNA Nano LT Library Prep Kit (Illumina, Cat. # 20015964) 12. TruSeq DNA UD Indexes (Illumina, Cat. # 20020590) 13. NovaSeq 6000 S2 Reagent Kit (300 Cycles) (Illumina, Cat. # 20028314)

44

Shinji Sasaki et al.

2.3 Software [Data Analysis]

The following tools and versions were used for the analysis: 1. wtdbg2(2.5 (20190621)) [8] 2. wtpoa-cns(2.5 (20190621)) [8] 3. minimap2(2.12-r829) [9] 4. bwa mem(0.7.17-r1188) [10] 5. samtools(1.11-9-ga53817f) [11]

3

Methods

3.1 Experimental Procedures

3.1.1 Genomic DNA Extraction from Frozen Semen and DNA Quality Check

To enable long-read sequencing for de novo genome assembly, a sufficient amount of long genomic DNA (usually several micrograms) is essential. However, in practice, it is not always possible to secure an ideal sample, partly due to DNA deterioration over time or because of a limited amount of available samples. The frozen semen samples of Japanese Black cattle used in this study may be relatively advantageous, since they are originally designed to retain biological activity for reproduction purposes and, thus, are supposed to remain relatively intact. Moreover, stored samples for commercial use are generally larger than those collected for research purposes. However, commercial samples are subjected to specific freezing conditions. For example, they are diluted with a cryoprotectant (mainly egg yolk and citrate) [6, 7] to improve freezing resistance. Although this approach is cost-effective and maintains DNA quality, this procedure leads to protein-rich and extremely viscous samples. To overcome these potential difficulties, we first established the optimum protocol for extracting highquality genomic DNA for sequencing. 1. After thawing the frozen straw, 500 μL of the semen sample was transferred to a 1.5 mL tube together with 500 μL 0.11 M citric acid solution. The sample was then centrifuged at 6000 rpm for 2 min and the supernatant discarded. This washing process was repeated 2–3 times until the pellet turned white (see Note 1). 2. The pellet was dissolved in 1 mL PBS and pipetted slowly, then centrifuged at 6000 rpm for 2 min and aspirated. This process was repeated once again. 3. The pellet was resuspended in 300 μL Lysis buffer, and 100 μL Proteinase K was added, then incubated for 2 h at 55 °C on a shaker. Then 20 μL Proteinase K was added followed by another 2 h of 55 °C incubation. 4. DNA was extracted according to the tissue protocol of the MagAttract HMW DNA Kit. To the lysate, we added 300 μL of Buffer AL, 560 μL of Buffer MB, and 80 μL of MagAttract Suspension G and mixed on a shaker for 3 min at 1400 rpm (see Note 2).

De Novo Genome Assembly of Japanese Black Cattle

45

5. The tube was placed on a magnet rack for 1 min and the supernatant discarded. 6. The pellet was resuspended in 1400 μL of Buffer MW1 ç and mixed on a shaker for 3 min at 1400 rpm, followed by aspiration on the magnet rack. This step was performed twice. 7. The pellet was resuspended in 1400 μL of Buffer PE and mixed on a shaker for 3 min at 1400 rpm, followed by aspiration on the magnet rack. This step was performed twice. 8. Then, 1400 μL of distilled water was added to the tube while on the magnet rack; after 1 min, the supernatant was discarded. This step was performed twice (see Note 3). 9. The pellet was resuspended in 150 μL Buffer AE and mixed on a shaker for 3 min at 1400 rpm; then placed on the magnet rack for 1 min and the supernatant collected to a new tube (see Note 4). 10. The concentration and length of the extracted gDNA were quantified using Qubit Fluorometer (Thermo Fisher) and TapeStation (Agilent Technologies) (see Note 5). 3.1.2 Long-Read Sequencing by PromethION

1. To prepare PromethION libraries, 2000 μg of the extracted gDNA were diluted in 48 μL; then, we used a Ligation Sequencing Kit following the manufacturer’s protocol. Then, 3.5 μL of NEBNext FFPE DNA Repair Buffer, 2 μL of NEBNext FFPE DNA Repair Mix, 3.5 μL of Ultra II End-prep reaction buffer, and 3 μL Ultra II End-prep enzyme mix were added to the sample and incubated at 20 min for 65 °C. 2. Next, 60 μL of AMPure XP beads were added to the sample and incubated for 5 min at room temperature. 3. The tube was placed on a magnet rack until the supernatant was clear and then discarded (see Note 6). 4. With the tube still on the magnet rack, 200 μL of freshly prepared 70% ethanol was added to the pellet and then discarded. This step was performed twice. 5. The tube was removed from the magnet rack and the pellet resuspended in 60 μL nuclease-free water and incubated for 2 min at room temperature. 6. The tube was placed on the magnet rack until the eluate was clear, and 60 μL of the eluate was placed in a new tube (see Note 7). 7. Adapter ligation was performed by adding 25 μL Ligation Buffer, 10 μL NEBNext Quick T4 DNA Ligase, and 5 μL Adapter Mix to the sample and incubating for 10 min at room temperature (see Note 8). 8. Next, 40 μL of AMPure XP beads were added to the sample and incubated for 5 min at room temperature.

46

Shinji Sasaki et al.

9. The tube was placed on a magnet rack until the supernatant was clear and then discarded. 10. The tube was removed from the magnet rack, and 250 μL of L Fragment Buffer was added to the resuspended pellet. Then, the tube was placed back into the magnet rack and the supernatant discarded. This step was performed twice (see Note 9). 11. The tube was removed from the magnet rack and the pellet resuspended in 25 μL Elution Buffer, then incubated for 2 min at room temperature (see Note 10). 12. The tube was placed on the magnet rack until the eluate was clear; 60 μL of the eluate was moved to a new tube (see Note 11). 13. The loading library was prepared by mixing 24 μL sample, 75 μL Sequencing Buffer, and 51 μL Loading beads (see Note 12). 14. Last, the library was loaded on the PromethION flow cell and sequenced. 3.1.3 Short-Read Sequence by Illumina Sequencer

1. Short-read sequencing was performed by following a standard procedure. For details, see the home page of Illumina at: (https://support.illumina.com/downloads/novaseq-6000-sys tem-guide-1000000019358.html). 2. The sequence reads were obtained using an S4 flow cell of a NovaSeq6000. The sequencing mode of 100 bp-paired end reads was employed. 3. A total of approximately 90 Gb (30× in the sequence depth) was generated and used for the following assembling analysis. Figure 1a shows the genomic DNA extracted by the described protocol. With this way, we could extract an average of 65 kbp of genomic DNA with a total yield of 10 μg starting from 100 μL of the starting semen straw material. The sequencing stats (Fig. 1b) and distribution of read length by PromethION are shown in Fig. 1b.

3.2 Bioinformatics Procedures

The de novo assembly of the Japanese Black cattle genome sequence was performed using long-read sequence data obtained by PromethION of Oxford Nanopore Technologies. Short-read sequence data, as of the 150 base paired end sequence data, was also obtained by the Illumina sequencer, NovaSeq 6000, from the same starting material. Redbean (wtdbg2) [8] was used as de novo assembly tool. First, the genome assembly was performed using only PromethION long-read sequences to obtain the first contigs. Then, the original long-read sequences and the highly accurate short-read sequences were mapped to the initially constructed contigs. Consensus sequences were obtained considering both individual long and short reads (this process is called polishing, as used hereafter) to

De Novo Genome Assembly of Japanese Black Cattle

47

A) Long reads

Short reads

Assemble: wtdbg2, wtpoa-cns 1st contig:

Polishing: wtpoa-cns

Mapping: minimap2

2nd contig:

Polishing: wtpoa-cns

Mapping: BWA-MEM (loose gap penalty)

3rd contig:

Polishing: wtpoa-cns

Mapping: BWA-MEM (default)

4th contig:

B) $568&'DVVHPEO\ UHIHUHQFHJHQRPH

-DSDQHVH %ODFNFDWWOH GHQRYRDVVHPEO\ VDPSOHV

$VVHPEO\YHUVLRQ

*&$B

:*Bʷ

&RQVWUXFWHU

86'$$56

8QLYHUVLW\RI7RN\R

*HQRPHFRYHUDJH

[

[ORQJUHDG UHGEHDQ

QH[WGHQRYR

7RWDOVHTXHQFHOHQJWK

FKU;07

1*

/

7RWDOFRQWLJV

*DSOHQJWKDJDLQVW UHIHUHQFHJHQRPH

1$

Fig. 2 De novo genome assembly of Japanese Black cattle in combination with long and short reads. (a) De novo genome assembly flow chart. (b) Comparison between the de novo genome assembly of Japanese Black cattle and the reference genome (ARS-UCD1.2)

obtain the contigs of error-corrected sequences. High-accuracy contig sequences were obtained by repeating this polishing step several times (Fig. 2a, b). The Redbean package, which contains the assemble program wtdbg2 and the wtpoa-cns program to obtain the consensus sequence for polishing, can be an easy one-pot solution for genome assembly.

48

Shinji Sasaki et al.

For Redbean, the following processing was performed. PromethION’s fastq sequence was used as input for wtdbg2. Then, wtpoa-cns was used to obtain the first contigs. PromethION reads were remapped to the first contigs using the most standard mapping tool, minimap2 [9], to let the second contigs generated by wtpoa-cns. Next, NovaSeq reads were further mapped to the thereby generated second contigs. For short-read mapping, the most popular short-read mapping program, bwa mem [10], allows a gap; the third contig was further generated by wtpoa-cns. Finally, using the NovaSeq sequence, bwa mem with default parameters was used again to map the short reads to the third contigs, and wtpoa-cns was used to generate the fourth contigs. We regarded those fourth contigs as the assembled genome sequences. 3.2.1

Outline

1. Generate 1st contig with wtdbg2 and wtpoa-cns using PromethION array. 2. Map polish to 1st contig with minimap2 using PromethION array. Generate second contigs with wtpoa-cns. 3. Map polish the second contigs with the relaxed gap penalty of bwa mem using the NovaSeq sequence and generate the third contigs with wtpoa-cns. 4. Using the NovaSeq sequence, map polish the third contigs with bwa mem (default parameters), and generate the fourth contigs with wtpoa-cns, to obtain the final contig sequences. Detailed codes for each step are as follows:

3.2.2 13)

Code (See Note

#Environment settings promethion_input=YOUR_INPUT_PROMETHION_FASTQ_PREFIX novaseq_input_r1=YOUR_INPUT_NOVASEQ_READ1_FASTQ_PREFIX novaseq_input_r2=YOUR_INPUT_NOVASEQ_READ2_FASTQ_PREFIX output_prefix=YOUR_OUTPUT_PREFIX

#For step (1) (initial assembling) (see Notes 14 and 15) wtdbg2 -x ont -g 2.7g -t 5 -i ${promethion_input}.fastq.gz -fo ${output_prefix} wtpoa-cns -t 5 -i ${output_prefix}.ctg.lay.gz -fo ${output_prefix}.ctg.lay.fa

# For step (2) (first polishing using long reads) minimap2 -t 5 -ax map-ont ${output_prefix}.ctg.lay.fa ${promethion_input}.fastq.gz | samtools view -Sb - > ${output_prefix}.ctg.map.bam

De Novo Genome Assembly of Japanese Black Cattle

49

samtools sort ${output_prefix}.ctg.map.bam -o ${output_prefix}.ctg.map.sorted.bam samtools view ${output_prefix}.ctg.map.sorted.bam | wtpoa-cns -t 5 -d ${output_prefix}.ctg.lay.fa -i - -fo ${output_prefix}. ctg.2nd.fa

# For step (3) (second polishing using short reads allowing gaps) (see Note 16) bwa

index

-p

${output_prefix}.ctg.2nd

${output_prefix}.

ctg.2nd.fa bwa mem -A1 -B1 -O1 -E1 -L0 -t 5 ${output_prefix}.ctg.2nd ${novaseq_input_r1}.fastq.gz ${novaseq_input_r2}.fastq.gz | samtools view -Sb - > ${output_prefix}.2nd_ctg.novaseq_map.bam samtools sort ${output_prefix}.2nd_ctg.novaseq_map.bam -o ${output_prefix}.2nd_ctg.novaseq_map.sorted.bam samtools index ${output_prefix}.2nd_ctg.novaseq_map.sorted. bam samtools view ${output_prefix}.2nd_ctg.novaseq_map.sorted.bam | wtpoa-cns -t 5 -x sam-sr -d ${output_prefix}.ctg.2nd.fa -i -fo ${output_prefix}.ctg.3rd.fa

# For step (4) (third polishing using short reads with default parameters) bwa

index

-p

${output_prefix}.ctg.3rd

${output_prefix}.

ctg.3rd.fa bwa mem -t 5 ${output_prefix}.ctg.3rd ${novaseq_input_r1}. fastq.gz ${novaseq_input_r2}.fastq.gz | samtools view -Sb - > ${output_prefix}.3rd_ctg.novaseq_map.bam samtools sort ${output_prefix}.3rd_ctg.novaseq_map.bam -o ${output_prefix}.3rd_ctg.novaseq_map.sorted.bam samtools index ${output_prefix}.3rd_ctg.novaseq_map.sorted. bam samtools view ${output_prefix}.3rd_ctg.novaseq_map.sorted.bam | wtpoa-cns -t 5 -x sam-sr -d ${output_prefix}.ctg.3rd.fa -i -fo ${output_prefix}.ctg.4th.fa

3.3 Analysis Example

Herein, we present the procedures of sample preparation and genome assembly described above in Figs. 1, 2, 3, 4 and 5. Genomic DNA was extracted from a frozen semen sample of a Japanese Black cattle sire (Fig. 1). De novo assembly of 64 Japanese Black cattle samples using NextDenovo [12] was also performed in addition to Redbean. The average gap length for the reference genome (GCA_002263795.2) was larger than that of Redbean (Fig. 2b). We developed a specialized database to use the generated genome assembly data (Fig. 3). Possible characteristic structural variants in

50

Shinji Sasaki et al.

Display seng

Tool buon Search box (chromosome, posion, gene name) ʤzoom in/out, informaon,…ʥ

Reference base Reference gene model WGS short read depth (bigWig) WGS long read depth (bigWig) WES depth (bigWig) Expression level (by color depth)

Ref. genome Base alignment Assemble cong

Japanese Black cale specific mutaon A53591701G

Fig. 3 Japanese Black cattle custom genome browser

Japanese Black cattle genome, which are detected by NGMLR 0.2.6 [13] and Sniffles 1.0.10 [13], are exemplified (Fig. 4a–e). Although omitted from this chapter, long-read sequencing can be used to detect phasing by WhatsHap 1.1 [14] (Fig. 5). 3.4

Conclusions

In this chapter, we have described a de novo genome assembly of Japanese Black cattle using frozen semen as starting material. We have also shown some examples of initial genome analysis. Using the protocol described here, we could generate genomic DNA contigs with an average NG50 of 10.4 Mbps. In addition, we constructed a specialized database as a custom genome browser for easy access to the analyses results. In fact, this platform is useful not only for researchers but also for farmers, which can obtain genetic information relevant for breeding. Interestingly, a number of unique genomic features were identified, such as duplicated, inverted, and translocated genomic regions, for which further detailed scientific characterization is needed. So far, we have collected and analyzed samples from a total of 64 Japanese Black cattle. This genomic information will be further applied for improved breeding. In particular, the present work allows for identification of possible causative genes of many hereditary diseases, which would bring substantial economic benefits. In addition, the evaluation of the genetic diversity among Japanese Black cattle and other Western populations will shed a light on breed domestication in Japan and further diversification within this limited geographical region.

A) Deleon (chr10:54,114,288-54,115,527)

Sniffles SV detecon WGS long read depth (bigWig) Ref. genome Base alignment Assemble cong

B) Inseron (chr10:56,081,080-56,086,841)

Sniffles SV detecon WGS long read depth (bigWig) Ref. genome Base alignment Assemble cong

C) Duplicaon (chr8:48,534,230-48,539,053)

Sniffles SV detecon WGS long read depth (bigWig) Ref. genome Base alignment Assemble cong

D) Inversion (chr10:84,328,051-84,333,028)

Sniffles SV detecon

Ref. genome Base alignment Assemble cong

Fig. 4 Examples of detected structural variations. (a) Insertion, (b) deletion, (c) duplication, (d) inversion, and (e) translocation of genomic regions detected with the de novo genomic assembly of Japanese Black cattle

52

Shinji Sasaki et al.

E) Translocaon (chr20 - chr1 - chr20)

Translocaon (chr14 – chr7 – chr14, chr7 – chr14 – chr7)

Translocaon (chr13 – chr9 – chr13, chr9 – chr13 – chr9)

Fig. 4 (continued)

De Novo Genome Assembly of Japanese Black Cattle

53

Fig. 5 Phasing with long-read sequencing. Comparison of phasing-results obtained with WhatsHap (https:// whatshap.readthedocs.io/) using the long-reads method and those obtained with Beagle (https://faculty. washington.edu/browning/beagle/beagle.htmL) using SNPs detected in 500 Japanese Black cattle whole exome sequences conformed by short reads. The figure shows an overview of haplotype patterns (top) and an enlarged view of the region (below). The positions of heterozygous polymorphisms are indicated by blue and red vertical lines for haplotype 1 (HP1) and haplotype 2 (HP2). Blue indicates the base on the reference sequence side and red that the base is on the alternative sequence side. The same haplotype pattern can be observed for samples processed by WhatsHap and Beagle (red arrows). Beagle uses statistical methods to estimate haplotypes; WhatsHap is a direct method to create a phased pattern from long-read sequences

4

Notes 1. Because semen samples are very viscous, take extra caution to pipette them slowly. 2. Sometimes the magnetic beads aggregate in the buffer, but do not try to suspend. This is due to long fragments. 3. Do not let the water stay for more than 1 min. This will lead to poor yield. 4. Wide bore tips are recommended. If using ordinary tips, take care to pipette slowly so as not to fragment the long DNA. 5. The concentration may differ between Qubit and TapeStation. Qubit will be the more reliable. As for the length, FemtoPulse (Agilent Technologies) will be able to detect longer gDNA (up to 165 kb).

54

Shinji Sasaki et al.

6. As the DNA is bound to the beads, extra caution should be taken not to discard any of the beads. 7. Carryover of the beads is critical for the following enzyme reactions. 8. The ligationing time could be extended so as to increase yield. 9. Washing with ethanol will be critical to the motor protein bound to the adapter. 10. The incubation time could be extended so as to increase yield. It is also optional to incubate at 37 degrees. 11. The library should be loaded immediately. If not, it should be kept at 4 degrees. Do not freeze. 12. Loading beads form a precipitate quickly. Pipette well before loading. 13. The whole process took about a week to 2 weeks using Intel (R) Xeon(R) Gold 6154 CPU @ 3.00GHz (Shirokane5 fat). 14. This code is an example using five threads. Many tools can set threads number by -t option. 15. Assembling the Wagyu genome data (putative genome size, 2.7G; read depth, about x30) is required 200 ~ 250 GB memory. 16. This code is an example of polishing; so, the number of polishing and the conditions should be adjusted depending on the accuracy of long-read sequencing and the genome status. References 1. Namikawa K (1992) Japanese beef cattle—historical breeding processes of Japanese beef cattle and preservation of genetic resources as economic farm animal (in Japanese): Wagyu Registry Association. Wagyu Registry Association, Kyoto 2. Sasaki S, Watanabe T, Ibi T, Hasegawa K, Sakamoto Y, Moriwaki S et al (2021) Identification of deleterious recessive haplotypes and candidate deleterious recessive mutations in Japanese Black cattle. Sci Rep 11(1):6687. https://doi.org/10.1038/s41598-02186225-y 3. Motoyama M, Sasaki K, Watanabe A (2016) Wagyu and the factors contributing to its beef quality: a Japanese industry overview. Meat Sci 120:10–18. https://doi.org/10.1016/j.mea tsci.2016.04.026

4. Gotoh T, Nishimura T, Kuchida K, Mannen H (2018) The Japanese Wagyu beef industry: current situation and future prospects—a review. Asian-Australas Journal of Animal Sciences 31(7):933–950. https://doi.org/10.5713/ ajas.18.0333 5. Uemoto Y, Abe T, Tameoka N, Hasebe H, Inoue K, Nakajima H et al (2011) Wholegenome association study for fatty acid composition of oleic acid in japanese black cattle. Anim Genet 42(2):141–148. https://doi. org/10.1111/j.1365-2052.2010.02088.x 6. Phillips PH (1939) The preservation of bulls semen. J Biol Chem 130:415 7. Hurst V (1953) Dilution of bull semen with frozen egg yolk-sodium citrate. J Dairy Sci 36(2):181–184. https://doi.org/10.3168/ jds.S0022-0302(53)91475-1

De Novo Genome Assembly of Japanese Black Cattle 8. Ruan J, Li H (2019) Fast and accurate longread assembly with wtdbg2. Nat Methods 17(2):155–158. https://doi.org/10.1038/ s41592-019-0669-3 9. Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18): 3094–3100. https://doi.org/10.1093/bioin formatics/bty191 10. Li H (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997v2 [q-bio.GN] 11. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N et al (2009) The sequence alignment/map format and SAMtools.

55

Bioinformatics 25(16):2078–2079. https:// doi.org/10.1093/bioinformatics/btp352 1 2 . N e x t D e n o v o : h t t p s : // g i t h u b . c o m / Nextomics/NextDenovo 13. Sedlazeck FJ, Rescheneder P, Smolka M et al (2018) Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods 15:461–468. https://doi. org/10.1038/s41592-018-0001-7 14. Martin M, Patterson M, Garg S, Fischer SO, Pisanti N, Klau GW, Schoenhuth A, Marschall T. WhatsHap: fast and accurate read-based phasing. bioRxiv 085050

Chapter 5 How to Sequence and Assemble Plant Genomes Ken Naito Abstract Although nanopore sequencer is a great tool, many plant scientists have suffered from bad sequencing results, even though they have exactly followed the official protocol in preparing a library. This is because the protocol is not optimized for plant genomic DNA. The protocol may be good for sequencing animal or bacterial genomes, but not for plants. However, if the protocol is properly modified, one can obtain lots of long reads and achieve a telomere-to-telomere assembly. Here I present a protocol to that end. Key words Plant genomes, Genome assembly, Nanopore sequencing, Organellar genomes

1

Introduction Nanopore sequencing technology [1] has made a great impact on biology including botany, ecology, and agricultural studies [2]. Its long read length and low cost have enabled even graduate students to achieve chromosome-level assembly. However, there has not been many reports on sequencing plant genomes yet. This is because of the difficulty in isolating plant genomes. Plant cell walls require physical crushing for DNA extraction, leading to fragmentation of DNA. Plants also accumulate various metabolites such as polysaccharides and polyphenols, which easily contaminate DNA solution. Thus, many plant scientists have suffered from very low yield and short read length in nanopore sequencing, even though they have exactly followed the official protocols, which is actually optimized for bacterial/animal genomes. Thus, here I present a solution which is optimized for plant genomes. The first half of the protocol is regarding DNA extraction and library preparation. The second half is on how to perform de novo assembly, including circulation of organellar genomes.

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_5, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

57

58

2 2.1

Ken Naito

Materials DNA Isolation

1. Very young, unexpanded leaves or sprout germinated in the dark 1.5 g 2. Liquid nitrogen 3. Water bath or dry bath that can hold a 50-mL conical tube 4. Mortar and pestle 5. Dispensing spoon 6. Qubit and dsDNA BR Assay Kit 7. Isopropanol 8. 70% ethanol 9. 50-mL conical tube 10. 1.5-mL microcentrifuge tubes 11. NucleoBond HMW DNA Kit (MACHEREY-NAGEL)

2.2

Size Selection

1. 1.5-mL microcentrifuge tubes 2. Short Read Eliminator XL (PacBio) 3. 70% ethanol 4. Microcentrifuge

2.3

Library Prep

1. Ligation Sequencing Kit (Oxford Nanopore Technologies) 2. AMPure XP (Beckman Coulter) 3. NEBNext FFPE DNA Repair Mix (New England Biolabs) 4. NEBNext Ultra II End Repair/dA-Tailing Module (New England Biolabs) 5. NEBNext Quick Ligation Module (New England Biolabs) 6. Magnetic rack (we recommend MagnaStand 1.5 [FastGene]) 7. 1.5-mL microcentrifuge tubes 8. Water/dry bath

2.4

Draft Assembly

1. Linux system with >128-GB memory and many CPUs 2. Basic knowledge on bash 3. NECAT (conda install -c bioconda necat) [3] 4. minimap2 (conda install -c bioconda minimap2) [4] 5. samtools (conda install -c bioconda samtools) [5] 6. racon (conda install -c bioconda racon) [6] 7. medaka (conda create -n medaka -c conda-forge -c bioconda medaka) (https://github.com/nanoporetech/medaka)

How to Sequence and Assemble Plant Genomes

59

8. Purge_haplotigs (conda create -n purge_haplotigs -c bioconda -c conda-forge purge_haplotigs) [7] 9. SNAP-aligner (conda install -c bioconda snap-aligner) [8] 10. Hypo (conda create -n hypo -c conda-forge -c bioconda hypo) [9] 11. seqkit (conda install -c bioconda seqkit) [10] 12. coverm (conda install -c bioconda coverm) (https://github. com/wwood/CoverM) 13. NCBI-blast (conda install -c bioconda blast) (https://blast. ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE= BlastDocs&DOC_TYPE=Download) 2.5 Assembling Organellar Genomes

1. Basic knowledge on bash 2. Flye (conda install -c bioconda flye) [11] 3. minimap2 (conda install -c bioconda minimap2) [4] 4. samtools (conda install -c bioconda samtools) [5] 5. racon (conda install -c bioconda racon) [6] 6. medaka (conda create -n medaka -c conda-forge -c bioconda medaka) 7. SNAP-aligner (conda install -c bioconda snap-aligner) [8] 8. Hypo (conda create -n hypo -c conda-forge -c bioconda hypo) [9] 9. seqkit (conda install -c bioconda seqkit) [10]

2.6 Re-bridging Nuclear Genome

1. Linux system with >128-GB memory and many CPUs 2. Basic knowledge on bash 3. NECAT (conda install -c bioconda necat) [3] 4. minimap2 (conda install -c bioconda minimap2) [4] 5. samtools (conda install -c bioconda samtools) [5] 6. racon (conda install -c bioconda racon) [6] 7. medaka (conda create -n medaka -c conda-forge -c bioconda medaka) (https://github.com/nanoporetech/medaka) 8. Purge_haplotigs (conda create -n purge_haplotigs -c bioconda -c conda-forge purge_haplotigs) [7] 9. SNAP-aligner (conda install -c bioconda snap-aligner) [8] 10. Hypo (conda create -n hypo -c conda-forge -c bioconda hypo) [9] 11. seqkit (conda install -c bioconda seqkit) [10] 12. coverm (conda install -c bioconda coverm) (https://github. com/wwood/CoverM)

60

3 3.1

Ken Naito

Methods DNA Extraction

All the problems in extracting plant DNA come from cell walls and vacuoles. These organelles contain lots of secondary metabolites such as polysaccharides and polyphenols, which are really hard to be removed. Especially polysaccharides have similar chemical characteristics to DNA and thus cannot be removed by many DNA extraction kits or phenol-chloroform methods. They even have similar absorbance with DNA, which is why spectrophotometer such as NanoDrop gives you much higher DNA concentration than it really is. The recently released DNeasy Plant Pro Kit is able to remove such impurities, but it uses a centrifuge column in washing steps, as many other plant DNA isolation kits. For nanopore sequencing, DNA isolation with column + centrifuge must be avoided, because it breaks DNA down to 15–20 kbp. I have thus long been claiming there is no high-molecular-weight DNA isolation kit for plants. However, NucleoBond HMW DNA Kit has solved all the problems I had. This kit provides high-molecular-weight (>100 kbp) DNA with most of the secondary metabolites removed from many plant materials. If this kit does not work on your material, you may try PhytoPure (Nucleon) or isolate nuclei according to the protocol by Workman et al. [12]. Once you have isolated the nuclei, DNA isolation is easily done with Nanobind Plant Nuclei Big DNA Kit (Circulomics). The following protocol is based on NucleoBond HMW DNA Kit (buffers and reagents are included in the kit). 1. Put your plants in the dark at least for 48 h. This treatment decreases secondary metabolites. 2. Sample fresh and soft unexpanded leaves up to 1.5 g. If you do not get enough leaves at once, store the leaves in -80 °C and repeat sampling. 3. Set your water/dry bath to 50 °C. If you see any CTAB precipitation in Buffer H1, warm it to be fully dissolved. 4. Put the following into a 50-mL conical tube: Buffer H1

5 mL

Proteinase K

200 μL

5. Pre-chill your mortar and pestle with liquid nitrogen (or putting them in -80 °C for a few hours). 6. Grind your sample with mortar and pestle. Pour additional liquid nitrogen every minute. Keep grinding for at least 3 min even if you do not see any visual changes. You will be very tired

How to Sequence and Assemble Plant Genomes

61

after this process, but this sacrifice will be rewarded with a higher DNA yield. 7. Pre-chill a dispensing spoon with liquid nitrogen. 8. Transfer all the ground sample with the dispensing spoon into Buffer H1 in step 4. Mix it by stirring with the spoon as quickly as possible, and tightly close the lid. 9. Put the tube into the 50 °C water/dry bath. Although the protocol says 30-min incubation is enough, 2–3-h incubation will increase the yield (see Note 1). 10. Take the tube out of the water/dry bath and add the following: RNase A

100 μL

11. Incubate for 5 min. 12. Set the plastic washer to the column and put the column on a new conical tube (for the flow through). 13. Saturate the filter and column with Buffer H2, by slowly adding 13 mL of it. 14. When 5-min incubation in step 10 is over, add the following and mix well: Buffer H2

10 mL

15. Transfer all the solution onto the filter. Do not pour all at once or your solution will overflow. If you are working with viscous samples, it may take quite a while to flow through. Be patient. 16. Add 6 mL of Buffer H3 slowly onto the filter (as illustrated by the official protocol) to push the remaining solution out. 17. Discard the filter. (Do not discard the column!) 18. Wash the column with 12 mL of Buffer H2. It is better to wash by 6 mL x2 or 4 mL x3. 19. Discard the flow through and the tube. 20. Place the column on a new tube and elute your DNA with 5 mL of Buffer H5 (see Note 2). 21. Add 3.5 mL isopropanol and mix super gently. 22. You should see your DNA precipitated. 23. Add 1 mL 70% ethanol into a new microcentrifuge tube. 24. Transfer the DNA by pipetting with a wide bore tip to the microtube in step 23. 25. Remove the supernatant by pipetting (do not centrifuge). 26. Add 1 mL 70% ethanol and then remove it.

62

Ken Naito

27. Repeat step 26. 28. Let the pellet dry for a few minutes. Do not let it fully dry. 29. Dissolve your DNA with 150–500 μL of Buffer HE. It is better to leave it overnight at RT. 30. Measure the concentration with Qubit and NanoDrop to check that the difference is within twofold. 31. Run ~100 ng of your DNA in a 0.8% agarose gel with 50 V for at least 2 h. Use Lambda DNA and Lambda HindIII as markers (see Note 3). 3.2

Size Selection

Although the official protocol does not recommend size selection, we strongly recommend it because longer reads bring you longer contiguity in your assembly. The problem of size selection is in the decrease of DNA molarity. Even with the same weight per volume, a DNA solution of longer DNA have much smaller number of DNA fragments than that of a shorter one. Lower molarity results in lower output, not only because you have fewer pores running DNA strands, but because you have faster turnover due to more electric current flows your pores. The solution to this problem is simple: input more DNA (see Subheading 3.3). 1. Transfer 6–9 μg DNA from your DNA solution into a new microcentrifuge tube. 2. Add water to make it 60 μL. 3. Add 60 μL of Short Read Eliminator (SRE) or SRE XL. 4. Mix well by flicking. Do not mix by pipetting or vortex. 5. Centrifuge with 10,000 g for 30 min. 6. Remove supernatant and add 250 μL of fresh 70% EtOH (see Note 4). 7. Centrifuge with 10,000 g for 2 min. 8. Repeat step 6, but this time you should see a white pellet. 9. Remove supernatant, flush with microcentrifuge, and carefully remove all the remaining EtOH (better to use 10-μL pipetter). 10. Add 50 μL of elution buffer. Mix by tapping/flicking. 11. Leave it overnight at RT. 12. Aliquot 1 μL of the DNA solution to quantify by Qubit. Make sure you have recovered 2.5 μg or more DNA.

3.3

Library Prep

To put enough amount of HWM DNA into your library prep reaction, some tricks are necessary. The ligation library prep kit includes the ligation buffer (LNB), which is optimized for nonsize-selected DNA. The LNB has a stronger effect of dehydration, which rips off the hydrating water molecules from DNA to make the ligation reaction more effective. However, the dehydration by

How to Sequence and Assemble Plant Genomes

63

LNB is too strong for size-selected DNA and precipitates your DNA, which could reduce the ligation efficiency. Thus, to keep your DNA dissolved in your solution, we reduce the LNB by 50% from the ligation reaction. Instead, we add 50% amount of 5x ligation buffer of the NEBNext Quick Ligation Module. In addition, we recommend reducing the amount of the Adaptor Mix (AMX-F or AMX-H) by 50%, so that you can make 12 libraries out of a single library prep kit. 1. Set your water/dry bath to 65 °C and take the AMPure XP out of the fridge. 2. Mix reagents as below in a new microcentrifuge tube: Size-selected DNA (2.5–3 μg)

50.5 μL

FFPE Repair Buffer

3.5 μL

Ultra II End-Prep Reaction Buffer

3.5 μL

Ultra II End-Prep Enzyme Mix

1.5 μL

FFPE Repair Mix

1.0 μL

Total

60 μL

3. Incubate at RT for 15 min and 65 °C for 15 min. 4. Thoroughly vortex AMPure XP. 5. Take the tube out, add 60 μL of AMPure XP, and mix by flicking. Do not vortex or mix by pipetting. 6. Keep flicking for 5 min. (If your DNA gets clogged here, you’d better do another round of AMPure XP.) 7. Spin down and pellet the bead on a magnet stand. Wait until the solution becomes completely clear. 8. While waiting, prepare 70% EtOH. Make sure you mix 350 μL EtOH and 150 μL nuclease-free water. Do not use your readymade 70% EtOH stock, or you lose ~50% of your DNA. 9. Remove the supernatant with the tube on the magnet stand. Be careful not to disturb the pellet. 10. Add 200 μL of 70% EtOH, wait for 10 sec, and remove the supernatant. 11. Repeat step 9. 12. Flush with a microcentrifuge, return the tube to the magnet stand, and carefully remove all the remaining supernatant with a 10-μL pipetter. 13. Add 71 μL of nuclease-free water.

64

Ken Naito

14. Elute your DNA by flicking for 3 min. 15. Set the tube on the magnet stand, let the beads pellet, and transfer the solution to a new microcentrifuge tube. 16. Aliquot 1 μL of your DNA solution with Qubit. Ideally, you recover more than 2 μg of end-prepped DNA, which is often enough for two runs. You must have more than 1 μg of the end-prepped DNA, which is enough for a single run. 17. Mix reagents as below: End-prepped DNA (2–2.5 μg)

70 μL

NEBNext Quick T4 DNA Ligase

5 μL

Adapter Mix (AMX-F or AMX-H)

2.5 μL

NEBNext 5x Ligation Buffer

10 μL

LNB

12.5 μL

Total

100 μL

18. Incubate at RT for 30 min. 19. Thaw LFB and SFB during the incubation. 20. Thoroughly vortex AMPure XP. 21. Add 40–100 μL of AMPure XP to the solution (see Note 5). 22. Keep flicking for 5 min. 23. Spin down and pellet the beads in the magnet stand. 24. Remove the supernatant, but do not use 70% EtOH to wash the beads. The adaptor is bound to motor protein, and you may remember that EtOH denatures proteins. 25. Take the tube off the magnet stand and suspend the beads with 250 μL of SFB (see Note 6). 26. Spin down, pellet the beads in a magnet stand, and remove the supernatant. 27. Take the tube off the magnet stand and suspend the beads with 250 μL of LFB. 28. Spin down and pellet the beads in a magnet stand. Remove the supernatant. 29. Spin down again, set it back to the magnet stand, and carefully remove all the remaining supernatant with a 10-μL pipetter. 30. Elute with 31 μL of elution buffer for MinION or GridION, or 61 μL EB for PromethION. 31. Keep flicking for 3–5 min. It may be a good idea to incubate at 37–40 °C for 3 min to help elution.

How to Sequence and Assemble Plant Genomes

65

32. Spin down, pellet in a magnet stand, and transfer the elution to a new tube. 33. Quantify 1 μL of your library with Qubit (see Note 7). 34. Run sequencing according to the official protocol. 3.4

Draft Assembly

I expect here you have your fastq files (we did not describe about sequencing and basecalling, because there is nothing to add to the official protocol). If all the processes described above worked well, read length N50 would be 30–50 kbp. According to our experience, NECAT [3] is the best assembler for plant genomes. However, if your data are less than 25x of the estimated genome size of your species, Flye [11] may produce better results. I do not recommend Canu [13] because it takes too long to finish. I do not recommend Redbean [14] (Ran and Li, 2019) either, which is very fast but it often ends up in tens of thousands of contigs even for the genomes of read_list.txt

66

Ken Naito

4. Make a config file by running the command below (see Note 9): $ necat config config.txt.

5. Open the config.txt and edit it according to your needs (see Note 10). 6. Run NECAT pipeline. There are three commands to run NECAT: $ necat correct config.txt

for error correction in your nanopore reads, $ necat assemble config.txt

for trimming and assembly, and $ necat bridge config.txt

for bridging. 7. Check the results in the NECAT/6-bridge_contigs directory (see Note 11): $ seqkit stats -a bridged_contigs.fasta

It is also good to run the command below (see Note 12): $ seqkit fx2tab -l -n bridged_contigs.fasta

8. Polish contigs by Racon (long reads) (see Notes 13 and 14). Map the long reads to the assembled contigs: $ minimap2 -t 48 -x map-ont –secondary=no polished_contigs. fasta nanopore_reads.fastq.gz > alignment.paf

Then run racon: $ racon -t 48 nanopore_reads.fastq.gz alignment.paf polished_contigs.fasta > racon.fasta

9. Polish contigs by medaka (long reads) (see Note 15): $ conda activate medaka $ medaka_consensus -t 48 -d racon.fasta -i nanopore_reads. fastq.gz -o consensus -m r941_min_hac_g507 ; $ conda deactivate

How to Sequence and Assemble Plant Genomes

67

10. Map the short reads for further polishing by Hypo (see Note 16): $ cd consensus $ snap-aligner index consensus.fasta INDEX -t24 (see Note 17) $ snap-aligner paired INDEX read1.fastq.gz read2.fastq.gz -t 24 -so -F s -F b -o -bam consensus.bam (see Note 18) $ samtools index -@24 consensus.bam

11. Make a text file containing paths to the short reads (suppose the consensus.fasta, read1.fastq.gz, read2.fastq.gz, and consensus.bam are all in your working directory): $ echo read1.fastq.gz > illumina.txt $ echo read2.fastq.gz >> illumina.txt

12. Calculate the depth of the short reads. Take the depth of the longest contig: $ coverm contig -b consensus.bam > illumina_depth.tsv

13. Run Hypo: $ conda activate hypo $ hypo -d consensus.fasta -r @illumina.txt -b consensus.bam -s 500m -c 20 -p 48 -t 24 -o hypo1.fasta (see Note 19) $ conda deactivate

14. Run another round of polishing as follows: $ snap-aligner index hypo1.fasta INDEX2 -t24 $ snap-aligner paired INDEX2 read1.fastq.gz read2.fastq.gz -t 24 -so -F s -F b -o bam hypo1.bam $ conda activate hypo $ hypo -d hypo1.fasta -r @illumina.txt -b consensus.bam -s 500m -c 20 -p 48 -t 24 -o hypo2.fasta ; $ conda deactivate

15. Map the long reads to hypo-polished contigs and make a BAM file for purge_haplotigs (see Note 20): $ minimap2 -t 24 --secondary=no -ax map-ont hypo2.fasta nanopore_reads.fastq.gz \ | samtools sort -@24 -hypo2.bam && samtools index hypo2.bam

16. To better understand your assembly results, make a table that contains the contig name, length, and read depth: $ seqkit fx2tab -l -n hypo2.fasta | sort > length.tsv $ coverm contig -b hyo2.bam | sort > depth.tsv

68

Ken Naito $ join length.tsv depth.tsv | tr “ ” “\t” > length-depth.tsv (see Note 21) $ rm length.tsv depth.tsv

17. Run purge_haplotigs: $ conda activate purge_haplotigs $ purge_haplotigs hist -t 24 -g hypo2.fasta -b hypo2.bam (see Note 22)

Using the output file (.gencov) as an input, run the next command: $ purge_haplotigs cov -i hypo2.bam.gencov -l 10 -m 30 -h 160 (see Note 23)

The output is coverage_stats.csv, which is used in the final step: $ purge_haplotigs purge -g hypo2.fasta -c coverage_stats.csv -a 80 -r 500 (see Note 24)

18. To better understand the results of purge_haplotigs, join length-depth.tsv and curated.reassignments.tsv (see Note 24): $ cat curated.reassignments.tsv | sort > temp.tsv $ join length-depth.tsv temp.tsv | tr “ “ “\t“ > length-depthreassignments.tsv

3.5 Assembling Organellar Genomes

Since there are tens or hundreds of mitochondria and chloroplasts in a single cell, the contigs derived from organellar genomes have much higher coverage compared to nuclear genome (there is usually only one nucleus in a single cell). Though organellar genomes are small and easy to assemble, the sequence data is usually too much and could not be assembled into a single contig. Thus, it is necessary to retrieve only a fraction of long reads derived from organellar DNA. 1. Retrieve high-coverage contigs from the curated.artefacts.fasta. First, make a list of high-coverage contigs from the lengthdepth-reassignments.tsv (if the read depth of the nuclear genome is ~40x, the expected depth on mitochondrial genome would be 200–800 and that of the chloroplast genome would be >1000): $ cat length-depth-reassignments.tsv | grep JUNK | awk ‘$3>200 && $3 mitoCandidates.list.txt $

cat

length-depth-reassignments.tsv

|

grep

‘$3>1000 {print $1}’ > cpCandidates.list.txt

JUNK

|

awk

How to Sequence and Assemble Plant Genomes

69

2. Retrieve the candidate contigs from the curated.artefacts.fasta: $ cat curated.artefacts.fasta | seqkit grep -f mitoCandidates. list.txt > mitoCandidates.fasta $ cat curated.artefacts.fasta | seqkit grep -f cpCandidateslist.txt > cpCandidates.fasta

3. BLAST the candidate sequences to any database where organellar genomes are BLASTable. Or, if organellar genomes of related species are available, download them as “relative_mt. fasta” and “relative_cp.fasta”, and then BLAST them as queries to your candidate contigs as described below. 4. Make the blast database: $ makeblastdb -in mitoCandidates.fasta -dbtype nucl $ makeblastdb -in cpCandidates.fasta -dbtype nucl

5. Run blastn: $ blastn -query relative_mt.fasta -db mitoCandidates.fasta -outfmt 6 -perc_identity 95 -max_hsps 4 > mt.blast.out.tsv (see Note 25) $ blastn -query relative_cp.fasta -db cpCandidates.fasta -outfmt 6 -perc_identity 95 -max_hsps 4 > cp.blast.out.tsv (see Note 25)

6. Open the blast results and pick up the contigs that have an alignment length of >2 kbp with identity of >98% (it depends on the phylogenetic distance between the relative species and yours): $ cat mt.blast.out.tsv | awk ‘$3>98 && $4>2000 { print $1 }’| sort | uniq > mt.contigs.list.txt $ cat cp.blast.out.tsv | awk ‘$3>98 && $4>2000 { print $1 }’ | sort | uniq > cp.contigs.list.txt

7. Recover the long reads mapped to the mt.contigs and cp.contigs from hypo2.bam: $ cat mt.contigs.list.txt | while read line; do samtools view -bh hypo2.bam $line | samtools bam2fq | pigz > mt.fastq.gz $ cat cp.contigs.list.txt | while read line; do samtools view -bh hypo2.bam $line | samtools bam2fq | pigz > cp.fastq.gz

8. Check the total length of the retrieved reads: $ seqkit stats -a mt.fastq.gz cp.fastq.gz (see Note 26)

70

Ken Naito

9. Reduce the long reads down to 30–50x of the expected size of organellar genomes: $ seqkit sample -p 0.1 mt.fastq.gz | seqkit seq -m 30000 -M 100000 > mt4asm.fastq (see Note 27) $ seqkit sample -p 0.01 cp.fastq.gz | seqkit seq -m 30000 -M 60000 > cp4asm.fastq (see Note 27)

10. Assemble (see Note 28): $ flye -t 24 --nano-raw mt4asm.fastq -m 10000 -o MT (see Note 29) $ flye -t 24 --nano-raw cp4asm.fastq -m 10000 -o CP (see Note 29)

3.6 Re-bridge the Nuclear Genome

If your plant is an outcrossing species, the genome is highly heterogenous and hard to assemble. Suppose genomic regions of A, B, and C, where A and C are homozygous and B is heterozygous, on a chromosome in this order (Fig. 1). After “necat assemble”, A and C are homozygous and assembled into single contigs, whereas B is heterozygous and assembled into B, B′-1, and B′-2. When you perform “necat bridge” on these contigs, it may bridge A with B and B′-2 with C. This situation is a dead end of bridging, and purge_haplotigs can only remove B′-1. The overlap of B and B′-2 cannot be removed because most part of the bridged contigs are homozygous. Thus, to avoid such an unfortunate situation, I usually run purge_haplotigs before bridging. If you remove B′-1 and B′-2 beforehand, the bridging step simply connects A with B and B with C, resulting in a single contig. The steps to do so are as follows: necat assemble -> medaka -> purge_haplotigs -> necat bridge -> hypo

Fig. 1 Left: Undesirable case of bridging on a heterozygous genome. Right: performing purge_haplotigs before bridging may lead to single contig

How to Sequence and Assemble Plant Genomes

71

1. Run necat assemble: $ necat assemble config.txt

2. Make a copy of the output directory (4-fsa) and get in: $ cp -r 4-fsa original_4-fsa $ cd original_4-fsa

3. Run medaka to split suspected misassembly sites: $ conda activate medaka $ medaka_consensus -t 24 -I nanopore_reads.fastq.gz -d contigs.fasta -o consensus -m r941_min_hac_g507 -g $ conda deactivate

4. Run purge_haplotigs: $ cd consensus $ minimap2 -t 24 -ax map-ont –secondary=no consensus.fasta nanopore_reads.fastq.gz | samtools sort -@24 -o consensus.bam & & samtools index -@24 consensus.bam $ seqkit fx2tab -ln consensus.fasta | sort > length.tsv $ coverm contig -b consensus.bam -t 24 | sort > depth.tsv $ join length.tsv depth.tsv | tr “ “ “\t” > length_depth.tsv $ conda activate purge_haplotigs $ purge_haplotigs hist -t 24 -g consensus.fasta -b consensus. bam $ purge_haplotigs cov -i consensus.bam.gencov -l 5 -m 30 -h 160 $ purge_haplotigs purge -g consensus.fasta -c coverage_stats. csv -t 24

5. Rescue REPEAT contigs with the depth of 75–150% of the mean depth: $ cat curated.reassignments.tsv | sort > temp.reassignments. tsv $ join length_depth.tsv temp.reassignments.tsv | tr “ “ “\t” > reassignments.tsv $ rm length.tsv depth.tsv length_depth.tsv temp.reassignments. tsv $ cat reassignments.tsv | grep “REPEAT” | awk ‘$3 > 30 && $3 < 60 {print $1}’ > rescule.list.txt (if the mean depth is 40) $ cat curated.haplotig.fasta | seqkit grep -f rescue.list.txt > resucued.fasta

72

Ken Naito

6. Merge curated.fasta, rescued.fasta, mt.fasta, and cp.fasta: $ cat curated.fasta rescued.fasta mt.fasta cp.fasta > for_bridging.fasta

7. Replace 4-fsa/contigs.fasta with for_bridging.fasta: $ cp for_bridging.fasta ../../4-fsa/contigs.fasta (see Note 30)

8. Bridge: $ necat bridge config.txt

Then run hypo just as described above. That is all. If you have Hi-C data or Bionano data, proceed to scaffolding. Genetic linkage maps are also useful to reconstruct pseudomolecules of chromosomes.

4

Notes 1. If your samples are clogged in this step, better to squash it with the dispensing spoon. 2. If you have lots of DNA, the dripping speed will be >10 sec/ drip. If the dripping speed is nanopore_reads.fastq.gz $ ls nanopore_reads.fastq.gz > read_list.txt

File extensions of your fastq files must be .fastq or .fastq.gz. .fq or .fq.gz are not allowed. 9. You may see a warning regarding perl but ignore it. Now you see a file named “config.txt” (the file extension must be .txt). 10. The config file of NECAT should look as below: PROJECT= ONT_READ_LIST= GENOME_SIZE= THREADS=4 MIN_READ_LENGTH=3000 PREP_OUTPUT_COVERAGE=40 OVLP_FAST_OPTIONS=-n 500 -z 20 -b 2000 -e 0.5 -j 0 -u 1 -a 1000 OVLP_SENSITIVE_OPTIONS=-n 500 -z 10 -e 0.5 -j 0 -u 1 -a 1000 CNS_FAST_OPTIONS=-a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0 CNS_SENSITIVE_OPTIONS=-a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0 TRIM_OVLP_OPTIONS=-n 100 -z 10 -b 2000 -e 0.5 -j 1 -u 1 -a 400 ASM_OVLP_OPTIONS=-n 100 -z 10 -b 2000 -e 0.5 -j 1 -u 0 -a 400 NUM_ITER=2 CNS_OUTPUT_COVERAGE=30 CLEANUP=1 USE_GRID=false GRID_NODE=0 GRID_OPTIONS= SMALL_MEMORY=0 FSA_OL_FILTER_OPTIONS= FSA_ASSEMBLE_OPTIONS= FSA_CTG_BRIDGE_OPTIONS= POLISH_CONTIGS=true

Basically, all you need to edit is the first four lines. For example: PROJECT=NECAT ONT_READ_LIST=read_list.txt GENOME_SIZE=500000000 THREADS=24

74

Ken Naito

PROJECT is to specify the name of the directory for the output of NECAT, ONT_READ_LIST to specify the path to read_list.txt file, and GENOME_SIZE is to specify the estimated genome size. Abbreviations like 500 m or 3 g are not acceptable. Be careful not to put the wrong number of zeros. THREADS is to specify number of CPUs you use for the assembly. 11. When NECAT is finished, there is an output directory with a name you specified in the config file. The contents in the output directory are as below: 1-consensus/ 2-trim_bases/ 3-assembly/ 4-fsa/ 5-align_contigs/ 6-bridge_contigs/ scripts/ trimReads.fasta.gz

The assembled fasta files are in 4-fsa and 6-bridge_contigs. Other directories contain intermediate files and thus are not so important for usual users. trimReads.fasta.gz is a fasta file containing error-corrected and trimmed reads, which is not necessary in the later processes, either. 4-fsa directory contains “contigs.fasta,” which is the final output of necat assemble process. 6-bridge_contigs contains “bridged_contigs.fasta,” the output contig sequences after bridging. There may be another fasta file named “polished_contigs.fasta,” if you set “POLISHING = true” in the config file. 12. The output of this command is a table of contig IDs and contig lengths. The header lines (including contig ID) also contain information about which contigs of the contigs.fasta were bridged to make the bridged_contig. This is important because most of the misassemblies (if any) by NECAT occur in the bridging process. When you suspect a misassemby in later analyses, you may often solve it by replacing the bridged contig with the unbridged contigs of the contigs.fasta. 13. Though accuracy of nanopore sequencing has been greatly improving, basecalling of plant DNA sequences still has a higher error rate compared to those of animal or bacterial DNA. So, the assembled contigs also contain some errors, even after polishing by the endogenous polishing process by NECAT. Thus, three additional polishings are recommended: racon (with long reads) [6], medaka (with long reads), and hypo (with Illumina short reads) [9]. 14. racon.fasta is the output of racon, which may have fewer contigs because it discards junk contigs to which no long reads are mapped (zero coverage).

How to Sequence and Assemble Plant Genomes

75

15. -m option is to specify the flow cell version (r941 or r104), platform (min or prom), mode of basecall (fast, hac, or sup), and version of guppy (g360 or g507). 16. Now the outputs of medaka are considerably good and usually mark BUSCO [15] score of >96%. However, if you use your assembly as a reference for SNP analyses, you may need further polishing with short reads. A 20x coverage of short reads is fairly enough. We recommend SNAP-aligner [8] for read mapping and Hypo for calling consensus, because they are much faster than standard tools such as bwa, bowtie2, or pilon. 17. -t specifies the number of CPUs to use. No space is allowed after the “-t”. 18. -t for number of CPUs, -so for position sort, -F s for filtering MAPQ>10, -F b for filtering properly mapped pairs, and -o -bam consensus.bam for output as bam formant, named “consensus.bam.” After read mapping is finished, index your bam by samtools [5]; 19. -d for the fasta file to be polished, -r for illumina.txt(@ is necessary when you have paired reads), -b for bam file, -s for genome size, -c for read depth, -p is for number of sequences processed at once, -t for number of CPUs, and -o for output. 20. Reference genome sequence is usually a whole set of a haploid genome, but contigs derived from heterozygous sites are often assembled into different contigs. Such contigs are called haplotigs, which is better to be removed from the reference sequence. There are several tools to do a job of this kind but we recommend Purge_haplotigs [7]. Purge_dups [16] is also popular but we consider it is not suitable for plant genomes as it removes almost all the repeat sequences including transposable elements. 21. The tr command is to replace letters. In this case we are replacing space with tab. 22. I hope now you would understand -t is for the number of CPUs, -g for fasta file, and -b for bam file. This command creates two files with extensions of .gencov and .png, respectively. The png file is a histogram of coverage, which shows two peaks if your genome is highly heterozygous. If your plant is a selfer, the histogram shows only one peak. 23. -i for .gencov file, -l for your threshold of low-coverage contigs (which I usually set to 50% of haplotig coverage), -m for middle coverage (center of the haplotig coverage and diplotig coverage, or 75% of the diplotig coverage if you do not see a clear haplotig peak), and -h for high coverage (200% of diplotig coverage). You may also refer to the length-depth.tsv to choose the parameters.

76

Ken Naito

The command creates four files: • curated.reassignments.tsv: tab delimited file with contig ID, % of sequences aligned to other contigs in the original fasta file, % of total length to which the query contig is aligned to, top 2 homologous contigs, and the status (KEEP, HAPLOTIG, REPEAT, or JUNK; see below). • curated.fasta: fasta file containing KEEP contigs, which is deprived of haplotigs, repetitive contigs, and junks (those with high and low coverage). • curated.haplotigs.fasta: fasta file containing HAPLOTIG and REPEAT contigs • curated.artefacts.fasta: fasta file containing JUNK contigs. 24. Now open the joined length-depth-reassignments.tsv, sort by length, and you see the longest contigs have similar read depths, which are the basic read depth of the nuclear genome. When you sort by depth, you see some contigs with ~10 times higher coverage and a few with >100 times higher coverage than the basic coverage. These high-coverage contigs are mostly (not all) of mitochondrion and chloroplast, respectively. 25. -query for fasta file containing query sequences, -db for target database, -outfmt for output formatting (“6” is for tabular output), -perc_identity an output option of percent identity (%), and -max_hsps for limiting number of alignments per query. 26. The size of chloroplast and mitochondrial genomes are typically 150 kbp and 450 kbp, respectively, so the recovered data is definitely too much. You need only 30–50x coverage of the organellar genomes. It is also better to remove reads that are too long, which are often chimeric reads. 27. In the “seqkit sample” command, -p is to specify how much percent of the input fastq are extracted as output. In the “seqkit seq” command, -m is for minimum sequence length and -M for maximum length. 28. Flye is a good choice for assembling organellar genomes, but NECAT is sometimes better. 29. -t for number of CPUs, -nano-raw for input file as raw nanopore reads, -m for minimum overlap between the reads, and -o for output directory. If you do not obtain a single contig, consider using Trycycler (https://github.com/rrwick/ Trycycler). It is also a good idea to polish the assembly by medaka and hypo. 30. The name of the replaced fasta must be contigs.fasta.

How to Sequence and Assemble Plant Genomes

77

References 1. Deamer D, Akeson M, Branton D (2016) Three decades of nanopore sequencing. Nat Biotechnol 34(5):518–524. https://doi.org/ 10.1038/nbt.3423 2. Dumschott K, Schmidt MH-W, Chawla HS, Snowdon R, Usadel B (2020) Oxford Nanopore sequencing: new opportunities for plant genomics? J Exp Bot 71(18):5313–5322. https://doi.org/10.1093/jxb/eraa263 3. Chen Y, Nie F, Xie S-Q, Zheng Y-F, Dai Q, Bray T, Wang Y-X, Xing J-F, Huang Z-J, Wang D-P, He L-J, Luo F, Wang J-X, Liu Y-Z, Xiao C-L (2021) Efficient assembly of nanopore reads via highly accurate and intact error correction. Nat Commun 12(1):60. https://doi. org/10.1038/s41467-020-20236-7 4. Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18): 3094–3100. https://doi.org/10.1093/bioin formatics/bty191. https://doi.org/10.1093/ molbev/msab199 5. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Subgroup GPDP (2009) The sequence alignment/map format and samtools. Bioinformatics 25(16):2078–2079. https://doi.org/10.1093/bioinformatics/ btp352 6. Vaser R, Sovic´ I, Nagarajan N, Sˇikic´ M (2017) Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res 27(5):737–746. https://doi.org/10.1101/gr. 214270.116 7. Roach MJ, Schmidt SA, Borneman AR (2018) Purge Haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC Bioinfor 19(1):460. https://doi.org/ 10.1186/s12859-018-2485-7 8. Bolosky WJ, Subramaniyan A, Zaharia M, Pandya R, Sittler T, Patterson D (2021) Fuzzy set intersection based paired-end shortread alignment. bioRxiv 2021: 2011.2023.469039. https://doi.org/10. 1101/2021.11.23.469039

9. Kundu R, Casey J, Sung W-K (2019) HyPo: super fast & accurate polisher for long read genome assemblies. bioRxiv:2019.2012.2019.882506. https://doi. org/10.1101/2019.12.19.882506 10. Shen W, Le S, Li Y, Hu F (2016) SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS One 11(10):e0163962. https://doi.org/10.1371/ journal.pone.0163962 11. Kolmogorov M, Yuan J, Lin Y, Pevzner PA (2019) Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol 37(5): 540–546. https://doi.org/10.1038/s41587019-0072-8. https://doi.org/10.1101/gr. 215087.116 (5):722-736 12. Workman R, Timp W, Fedak R, Kilburn D, Hao S, Liu K (2018) High Molecular Weight DNA Extraction from Recalcitrant Plant Species for Third Generation Sequencing. Protocol Exchange. https://doi.org/10.1038/ protex.2018.059 13. Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27:722–736 14. Ruan J, Li H (2020) Fast and accurate longread assembly with wtdbg2. Nat Methods 17(2):155–158. https://doi.org/10.1038/ s41592-019-0669-3 15. Manni M, Berkeley MR, Seppey M, Simaõ FA, Zdobnov EM (2021) BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol Biol Evol 38(10):4647–4654 16. Guan D, McCarthy SA, Wood J, Howe K, Wang Y, Durbin R (2020) Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics 36(9): 2896–2898. https://doi.org/10.1093/bioin formatics/btaa025

Chapter 6 Detection of DNA Modification Using Nanopore Sequencers Yoshikazu Furuta Abstract DNA modification is a crucial factor of epigenetic modification and has vital functions for gene regulation and phenotype control. A profound understanding of DNA modification requires precise mapping of the modified bases on genomic DNA. In addition to methods such as bisulfite sequencing and single-molecule real-time (SMRT) sequencing of PacBio sequencers, nanopore sequencers can be also utilized for the detection of DNA modification. Here, I will briefly review the three methods for the detection of DNA modification with nanopore sequencers and introduce a protocol using MinION and Megalodon. Key words DNA modification, Epigenetics, Basecalling

1

Introduction Bases of nucleic acids consist of adenine, guanine, cytosine, and thymine (uracil for RNA), which are known to be often chemically modified. Base modifications are investigated as a part of the epigenome and known to have important roles for regulation of gene expression and many phenotypes including human diseases [1, 2]. Modified bases such as 5-methylcytosine (5mC) and 5-hydroxymethylcytosine (5hmC) in DNA and N6-methyladenine (6mA) in RNA are mainly investigated for eukaryotes, while 5mC, N4-methylcytosine (4mC), and 6mA in DNA are mainly studied for prokaryotes. In this chapter, I focus on the detection and mapping of these DNA modifications. One of the most fundamental analyses in studies of DNA modification is to map them on the target genome (Fig. 1). Mass spectrometry can be utilized to detect and quantify modified bases, but it cannot map the positions of modified bases because the target genome is digested into monomers before the quantification (Fig. 1a). For mapping modified bases, two methods are mainly employed: bisulfite sequencing [3, 4] and SMRT sequencing [5]. Bisulfite sequencing is used for the detection of 5mC and

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_6, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

79

80

Yoshikazu Furuta

Fig. 1 Methods for detection of DNA modifications. Red letters represent modified bases. (a) Mass spectrometry. (b) Bisulfite sequencing. (c) SMRT sequencing. (d) Nanopore sequencing. The figure of the nanopore was taken from TogoTV (https://togotv.dbcls.jp/). See the main text for detail

5hmC on the target genome by bisulfite treatment (Fig. 1b). The treatment converts unmodified cytosine to uracil, but not converting 5mC and 5hmC. As unmodified cytosines are sequenced as thymine, a comparison of the sequence reads with a reference sequence enables mapping of the 5mC and 5hmC. SMRT sequencing reads the sequence of DNA by measuring the incorporation of fluorescence-modified dNTP substrate by DNA polymerase in a real-time manner (Fig. 1c). When the base in the template DNA is modified, it takes a longer duration for incorporating the complement dNTP. Thus, modified bases in the template DNA can be detected by measuring the delay for dNTP incorporation, which is termed as interpulse duration. Bisulfite sequencing and SMRT sequencing can map modified bases at the level of single-nucleotide resolution, but it should be noted that the types of detectable modified bases are different. 5mC and 5hmC are the targets for bisulfite sequencing, while mainly 6mA and 4mC for SMRT sequencing (5mC can be also detected but requires much higher coverage). In addition to the two sequencing methods, nanopore sequencers developed by Oxford Nanopore Technologies are now another choice for the detection of modified bases (Fig. 1d). Nanopore sequencers read the sequence of nucleic acids by recording the fluctuation of ionic current across the nanopore while nucleotides pass through the nanopore protein. Modified bases in the nucleic acids change the intensity of the signal and detection of such changes enables the detection and mapping of modified bases. Many tools have been developed for mapping modified bases from the output of nanopore sequencers by detecting the difference in the signals between unmodified and modified bases (Table 1). As such differences in the signals can be a cause of

Detection of DNA Modification

81

Table 1 Tools for DNA modification detection using nanopore sequencing Name

References

Comment

Comparison method Tombo

[6]

Known as Nanoraw. Also implements an option to use the model method

NanoMod

[7]

Comprehensive explanation of resquiggle in the reference

nanodisco

[8]

Specialized for bacterial methylome analysis

ELIGOS

[9]

Utilizes the change in the error rate by nucleotide modification

Model-based method Nanopolish

[10]

Hidden Markov model

SignalAlign

[11]

Hidden Markov model + hierarchical Dirichlet process

mCaller

[12]

Neural network

DeepSignal

[13]

Bidirectional recurrent neural network

DeepMod

[14]

Recurrent neural network + long–short-term model

Expanded basecalling method Guppy

–

Implemented in or later than v3.2.1

Megalodon

–

Requires Guppy later than v4.0

miscalling of the sequence, understanding the basis of the effect of modified bases is also important for improving the basecalling accuracy of nanopore sequencers. Tools for the detection of DNA modification from nanopore reads can be currently classified into three groups by their algorithms: comparison method, model-based method, and expanded basecalling method (Fig. 2).

2

Comparison Method To detect changes in the raw signals of nanopore reads for the detection of DNA modification, an intuitive strategy is to sequence both the sample of interest and the same sample without modifications and map both of them to a reference and compare their raw signals between the samples at each corresponding position in the reference (Fig. 2a). Unmodified control sample can be prepared by PCR, whole genome amplification, or by isolating DNA from a strain in which the gene responsible for the DNA modifications of interest is deleted. Tools that adopt this method, which I call the comparison method, first align reads and their corresponding raw signals against a reference sequence to assign raw signals to the

82

Yoshikazu Furuta

Fig. 2 Methods for detection of DNA modifications using nanopore sequencers. (a) Comparison method. (b) Model-based method. (c) Expanded basecalling method. Bases colored in red represent modified bases. A signal of ionic currency was simplified as a single-digit number. See the main text for detail

corresponding positions in the reference [6–9]. Then, for each position in the reference, the level and distribution of the assigned raw signals are compared between the sample and the unmodified control by a statistical test such as the Kolmogorov-Smirnov test [6]. Positions in the reference with statistically significant differences in the signals are detected as modified. The comparison method can be employed for the detection of any type of DNA modifications if DNA samples with and without base modifications can be prepared, but a limitation of this method is that it is difficult to determine the type of DNA modification solely from the result. The method is appropriate to employ for DNA samples known with the type of DNA modification by prior knowledge or by other measurement methods such as mass spectrometry. For example, human genomic DNA is a suitable target of the method as most of the detected DNA modification is likely 5mC. Another thing to note is that the method cannot detect modified bases per read as it compares the distribution of the raw signals between the sample and unmodified control at each position of the reference after mapping the reads.

Detection of DNA Modification

3

83

Model-Based Method We can detect DNA modification from raw signals of nanopore sequencing if we know the rules of how DNA modification affects the intensity of raw signals. The model-based method first learns such rules between DNA modification and raw signals by construction of pre-trained supervised learning models, then detects and maps the modified bases for the specific DNA modification by applying the obtained sequencing data to the model (Fig. 2b). While the comparison method requires raw signals of both the sample and the unmodified control, the model-based method requires only the signals of the modified sample if the model is ready. For the construction of the pre-trained supervised learning models, nanopore sequencing data of DNA samples with the ground truth data of the positions of DNA modification are required. Such ground truth data can be prepared for 5mC by bisulfite sequencing and for 6mA by SMRT sequencing. Raw signals of nanopore reads and the ground truth data of positions of DNA modification are used as training data for the model construction. Various algorithms such as hidden Markov model and bidirectional recurrent neural network were used for the model construction, following the development of the latest algorithms in the field of machine learning [10–14]. With the model-based method, the probability of DNA modification in each nucleotide in each read can be calculated; thus, it is possible to analyze the modification per read. In order to employ the model-based method, it is required to check if an appropriate pre-trained model is available. Most of the tools provide such a pre-trained model prepared by the developers, which was used as proof of the performance of the tool, but it should be noted that each prepared model uses a different set of data and algorithms for the supervised learning procedure. As most of the available models are pre-trained for well-studied types of DNA modification and for the data produced by widely used devices and version of the flow cell, it would be difficult to find a model that can detect minor types of DNA modification or analyze raw signals of the latest version of the flow cell in its early stages after release. If appropriate models are not available, models should be constructed by users themselves, or it would be better to use other methods.

4

Expanded Basecalling Method The third method, the expanded basecalling method, detects specific modifications at the stage of basecalling (Fig. 2c). Different from the model-based method, it directly converts raw signals to

84

Yoshikazu Furuta

the nucleotide bases together with the probability of base modification at the stage of basecalling. The Guppy basecaller v3.2.1 or above implements models for basecalling that can call the probability of 5mC and 6mA in addition to the canonical four bases. Following the basecalling, reads with modification information can be mapped to a reference, and the probability of the modification at each nucleotide position on the genome can be calculated. Although basecalling by Guppy only adds the modification information in the Fast5 files, Megalodon can conduct the whole procedure from the basecalling through the detection of the DNA modification in the reference. As the expanded basecalling method also uses a pre-trained model for the basecalling, users must check if the available model is suitable for the detection of the targeted type of DNA modification from your data. The model implemented with Guppy was trained with human genome data for 5mC and E. coli sequence data for 5mC and 6mA. Both data were produced using MinION with Flow Cell R9.4; thus, it is difficult to apply to data obtained from other devices or versions of flow cells, or for the detection of other types of modifications. Some latest models can be found at the GitHub repositor y of Rerio (https://github.com/ nanoporetech/rerio) or elsewhere, which is not yet bundled with Guppy but suitable for the detection of other types of DNA modification such as 5hmC or for the data sequenced with other devices such as PromethION. Which method should be chosen among the three? If users can find a model which matches the target type of DNA modification and the device used, the first choice would be the model-based method or the expanded basecalling method. If such a model is not available, the comparison method or construction of a pre-trained model would be considered. It is also recommended to use multiple tools for the same sample to obtain robust results [15]. In the following section, I introduce an example of the detection of bacterial DNA modification using Megalodon, a tool of the expanded-basecalling method. The model for the detection of 6mA and 5mC is already available for reads of MinION Flow Cell R9.4.1; thus, the protocol can be applied for typical analyses of bacterial epigenetics.

5

Materials 1. QIAamp PowerFecal Pro DNA Kit (QIAGEN). 2. Ligation Sequencing Kit (Oxford Nanopore Technologies). 3. MinION (Oxford Nanopore Technologies). 4. MinION Flow Cell (Oxford Nanopore Technologies).

Detection of DNA Modification

85

5. Workstation with GPU. One without GPU also works but spends longer calculation time. An example of specifications is as follows: – OS: Ubuntu 16.04LTS – GPU: NVIDIA RTX 2080Ti – CPU: Intel Core i7-9800X 3.80 GHz 8 core 6. The following tools were installed: – Guppy v4.2.2 (GPU), pyguppy v4.2.2. – Megalodon 2.3.3. – SeqKit v0.15.0 [16]. – bedtools v2.28.0 [17].

6

Methods 1. Extract genomic DNA of the bacterial strain of interest from a culture using QIAamp PowerFecal Pro DNA Kit (see Note 1). 2. Construct a sequencing library using the Ligation Sequencing Kit and sequence the library using MinION R9.4.1 Flow Cell (see Note 2). 3. Install tools into your workstation. Guppy can be installed using an installer uploaded at the Nanopore Community (https://community.nanoporetech.com). Choose either the one for CPU or GPU, depending on the specifications of the workstation in use. The one for GPU is used in this method. Pyguppy can be installed using pip as follows. Note that the version number of Guppy and Pyguppy must be the same. For example, for installing Pyguppy v4.2.2, run the pip command by designating the version number: $ pip install ont_pyguppy_client_lib==4.2.2

Megalodon and Seqkit can be installed using Conda (see Note 3): $ conda install megalodon $ conda install -c bioconda seqkit

Bedtools can be installed using apt-get in Ubuntu (see Notes 4 and 5): $ apt-get install bedtools

4. Find a model for basecalling with modified bases (see Note 6). Here I use the latest models for Guppy uploaded at the GitHub

86

Yoshikazu Furuta

repository of Rerio (https://github.com/nanoporetech/ rerio). Find res_dna_r941_min_modbases-all-context_v001 from the list and download it. This is a model suitable for 6mA and 5mC detection in any context, which means it is not limited to the analysis of specific organisms such as human and E. coli. Note that models with “modbases” in their file name are for calling modified bases. 5. Run Megalodon (see Note 7). It runs basecalling for modified bases using Guppy, maps the read to the reference, then calculates the fraction of modification at each base of the reference. When your computer is not equipped with GPU, add the “-processes” option with the number of CPU cores and delete the “--devices” option. Do not forget to add the “--haploid” option when analyzing bacterial genome: $ megalodon fast5_directory --outputs basecalls mappings mod_mappings mods\ --reference path_to_reference_file\ --devices "cuda:0"\ --guppy-server-path path_to_guppy/bin/guppy_basecall_server\ --guppy-params "-d path_to_Rerio_model_folder"\ --guppy-config res_dna_r941_min_modbases-all-context_v001. cfg\ --overwrite\ --haploid

6. When Megalodon runs successfully, it outputs files in the “megalodon_results” directory. Among them, the two BED files, modified_bases.5mC.bed and modified_bases.6mA.bed, contain the results of the fraction of each modified base at each nucleotide of the reference. The first column is the name of the reference, the third column is the position of the nucleotide on the reference, the sixth column shows the strand, and the eleventh shows the fraction of the modified base in percentage. The line of the following example shows the result at the positive strand of the 14874th nucleotide on the chromosome, which is 100% methylated: $ less ./megalodon_results/modified_bases.5mC.bed chromosome 14873 14874 . 18 + 14873 14874 0,0,0 18 100.0

7. The detail of the calculation of the fraction above can be checked with per_read_modified_base_calls.db in the same result folder. The calculated probability of each nucleotide in each read is listed in the file. This .db file itself is an SQLite database; thus, it is required to be converted to a text file before browsing. Note that the file size of the output is huge:

Detection of DNA Modification

87

$ megalodon_extras per_read_text modified_bases ./megalodon_results

8. One of the main purposes of bacterial epigenetic analysis is to detect the target motif of DNA methyltransferases. For this purpose, after calling the methylated nucleotides on the genome, sequences around the methylated bases are collected and applied to motif search. First, prepare a tab-separated file listing the entry names and the sequence length in the reference file. Seqkit can be used as follows. Note that the entry names must match those in the BED files produced by Megalodon: $ seqkit fx2tab ../reference/reference.fasta -l -n > ../ reference/ref_length.txt

Next, extract the sequence of the modified nucleotide and its flanking regions using the reference file and the BED file produced by Megalodon (see Note 8). Here I extracted 20-bp flanking sequences for the modified nucleotide positions with the coverage of more than 20 and the modified read fraction by 100%. These thresholds can be tweaked based on the purpose of your experiment: $ awk ’$10>=20 && $11==100’ ./modified_bases.6mA.bed | bedtools slop -b 20 -i - -g ../reference/ref_length.txt | bedtools getfasta -s -fi ../reference/reference.fasta -bed - > modified_bases.6mA_flankseq.fasta

Then, apply the output Fasta file to MEME [18] for motif search. It can be submitted to the web server of the MEME (https://meme-suite.org/meme/tools/meme), or run the command as follows if MEME is installed in your environment: $ meme modified_bases.6mA_flankseq.fasta -dna -oc . -nostatus -mod zoops -nmotifs 5 -minw 4 -maxw 20 -objfun classic -markov_order 0 -o ./meme/6mA

An example of the output of MEME is shown in Fig. 3, which is a result of the analysis of the Bacillus cereus ATCC 14579 strain [19]. In this case, motifs of 5′-GATC-3′/5′GATC-3′ and 5′-RGCGWT-3′/5′-AWCGCY-3′ were detected, and cytosines in these motifs were likely methylated. The third long motif was also detected, but this is probably a false detection of a frequent motif in bacterial genomes such as the Shine-Dalgarno sequence.

88

Yoshikazu Furuta

Fig. 3 An example of the result of motif search. Detected motifs for 5mC modifications in Bacillus cereus ATCC 14579 are shown. The two motifs at the top are likely to be true target motifs. The motif at the bottom is likely to be an artifact derived from a frequent motif throughout the genome

7

Notes 1. If a reference sequence for the methylation analysis is not available, it is a good idea to conduct de novo assembly with the basecalled reads to prepare a reference sequence, as Fastq files can be also obtained by basecalling the Fast5 files. QIAamp PowerFecal Pro DNA Kit includes physical disruption of cell membrane by bead beating; thus, sequencing of the extracted DNA usually results in 4–5-kb reads in the average. As longer reads are ideal for de novo assembly for construction of a reference sequence, other kits specialized for the extraction of high-molecular-weight DNA such as Wizard HMW DNA Extraction Kit (Promega) can be used. 2. The Rapid Sequencing Kit also works, but the Ligation Kit usually results in longer read length and less bias; thus, the usage of the Ligation Kit will result in more genomic regions with higher coverage and more robust results with the methylation analysis. Of course, a kit that includes PCR during the

Detection of DNA Modification

89

library preparation cannot be used as dNTP substrate incorporated during PCR does not maintain the base modifications on the template. 3. If Conda is not yet installed, install Anaconda or Miniconda in advance as explained in the official website or elsewhere. Anaconda includes many packages useful for data science, but it has tenfold more amount of file size. In many cases, installing Miniconda together with required packages is easy enough and saves your disk space. 4. In MacOSX, brew can be used for the installation: $ brew tap homebrew/science $ brew install bedtools

5. Conda can be also used for the installation, but it is not the method listed in the original document of Bedtools. In such cases, the package repository of Conda is often maintained by a volunteer and sometimes does not provide the latest version of the software. Installation using Conda is usually easier but be sure it provides the latest version or the specific version you are going to install. 6. The procedure described here requires a model which is pre-trained for the detection of the type of modification of interest and with reads produced from the same flow cell you use. If such a model is not available, you should consider using the comparison method or pre-train a model by yourself if possible. An appropriate model may be prepared elsewhere in the future; therefore, it is always recommended to store raw Fast5 files for future use. 7. During the preparation of this chapter, Oxford Nanopore Technologies implemented the second-generation modified basecalling in the release of Guppy v6.1.1 as the tool Remora (https://github.com/nanoporetech/remora). Models for this method are now bundled with Guppy. Megalodon is also applicable for those models. However, currently the models are available only for the detection of 5mC and 5hmC. If 6mA or other types of modification is the target, the protocol shown in this chapter would be still useful. 8. The line consists of three commands. The first awk command filters the input BED file with the line that has coverage (10th column or $10 in the command) equal to or larger than 20 and modified read fraction equal to 100% (11th column or $11 in the command). The second bedtools slop command lists the positions of the flanking 20 bp sequences of the modified bases extracted in the first command. The third bedtools getfasta extracts the sequence as of the flanking region according to the positions extracted by the second command.

90

Yoshikazu Furuta

References 1. Allis CD, Jenuwein T (2016) The molecular hallmarks of epigenetic control. Nat Rev Genet 17(8):487–500 2. Jin Z, Liu Y (2018) DNA methylation in human diseases. Genes Dis 5(1):1–8 3. Frommer M, McDonald LE, Millar DS, Collis CM, Watt F, Grigg GW, Molloy PL, Paul CL (1992) A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proc Natl Acad Sci USA 89(5):1827–1831 4. Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, Pradhan S, Nelson SF, Pellegrini M, Jacobsen SE (2008) Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature 452(7184):215–219 5. Flusberg BA, Webster DR, Lee JH, Travers KJ, Olivares EC, Clark TA, Korlach J, Turner SW (2010) Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat Methods 7(6):461–465 6. Stoiber M, Quick J, Egan R, Eun Lee J, Celniker S, Neely RK, Loman N, Pennacchio LA, Brown J (2017) De novo Identification of DNA modifications enabled by genomeguided nanopore signal processing. bioRxiv:094672 7. Liu Q, Georgieva DC, Egli D, Wang K (2019) NanoMod: a computational tool to detect DNA modifications using Nanopore longread sequencing data. BMC Genomics 20 (Suppl 1):78 8. Tourancheau A, Mead EA, Zhang XS, Fang G (2021) Discovering multiple types of DNA methylation from bacteria and microbiome using nanopore sequencing. Nat Methods 18(5):491–498 9. Jenjaroenpun P, Wongsurawat T, Wadley TD, Wassenaar TM, Liu J, Dai Q, Wanchai V, Akel NS, Jamshidi-Parsian A, Franco AT, Boysen G, Jennings ML, Ussery DW, He C, Nookaew I (2021) Decoding the epitranscriptional landscape from native RNA sequences. Nucleic Acids Res 49(2):e7 10. Simpson JT, Workman RE, Zuzarte PC, David M, Dursi LJ, Timp W (2017) Detecting

DNA cytosine methylation using nanopore sequencing. Nat Methods 14(4):407–410 11. Rand AC, Jain M, Eizenga JM, MusselmanBrown A, Olsen HE, Akeson M, Paten B (2017) Mapping DNA methylation with high-throughput nanopore sequencing. Nat Methods 14(4):411–413 12. McIntyre ABR, Alexander N, Grigorev K, Bezdan D, Sichtig H, Chiu CY, Mason CE (2019) Single-molecule sequencing detection of N6-methyladenine in microbial reference materials. Nat Commun 10(1):579 13. Ni P, Huang N, Zhang Z, Wang DP, Liang F, Miao Y, Xiao CL, Luo F, Wang J (2019) DeepSignal: detecting DNA methylation state from Nanopore sequencing reads using deeplearning. Bioinformatics 35(22):4586–4595 14. Liu Q, Fang L, Yu G, Wang D, Xiao CL, Wang K (2019) Detection of DNA base modifications by deep recurrent neural network on Oxford Nanopore sequencing data. Nat Commun 10(1):2449 15. Yuen ZW, Srivastava A, Daniel R, McNevin D, Jack C, Eyras E (2021) Systematic benchmarking of tools for CpG methylation detection from nanopore sequencing. Nat Commun 12(1):3438 16. Shen W, Le S, Li Y, Hu F (2016) SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS One 11(10):e0163962 17. Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6):841–842 18. Bailey TL, Johnson J, Grant CE, Noble WS (2015) The MEME Suite. Nucleic Acids Res 43(W1):W39–W49 19. Ivanova N, Sorokin A, Anderson I, Galleron N, Candelon B, Kapatral V, Bhattacharyya A, Reznik G, Mikhailova N, Lapidus A, Chu L, Mazur M, Goltsman E, Larsen N, D’Souza M, Walunas T, Grechkin Y, Pusch G, Haselkorn R, Fonstein M, Ehrlich SD, Overbeek R, Kyrpides N (2003) Genome sequence of Bacillus cereus and comparative analysis with Bacillus anthracis. Nature 423(6935):87–91

Chapter 7 Ultralow-Input Genome Library Preparation for Nanopore Sequencing with Droplet MDA Kazuharu Arakawa Abstract Genome sequencing of small species, such as those of meinofauna, can be challenging due to the extremely low input of genomic DNA. While nanopore sequencing is a promising technology for genome assembly due to its limitless long reads, recommended input of 1 μg for the Ligation Sequencing Kit often precludes the use of this technology. Here, I detail an unbiased droplet-based multiple displacement amplification of picogram order of DNA to realize nanopore sequencing with ultralow input of genomic DNA. For this purpose, a microfluidic chip of 10X Genomics Chromium Controller is utilized. With this method, over 10 μg of unbiased amplicons around 10 kbp in length can be obtained from as low as 50 μg of input DNA, which is enough for the construction of multiple sequencing libraries, or for the size selection of longer DNA fragments. Key words Nanopore sequencing, Whole genome amplification, Ultra-low input, Multiple displacement amplification, Droplet-MDA

1

Introduction Long reads are essential in genome assembly. Due to the affordability, portability, and limitless read length, nanopore sequencing by Oxford Nanopore Technologies has been a de facto choice in ab initio genome sequencing and assembly [1]. One technical hurdle in applying nanopore sequencing to uncommon organisms, however, is the recommended input of around 1 μg of DNA for library preparation. Biodiversity in terms of the number of species is far greater in small-sized organisms than those large-bodied [2]. In insects, which comprise as much as 90% of all animal species, a majority of species are less than several mm in length and less than 1 mg in body mass [3]. In studying these small organisms, the amount of obtainable sample is often the bottleneck and is often not sufficient for nanopore sequencing. This is especially problematic for invertebrates, for genomic DNA extraction after

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_7, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

91

92

Kazuharu Arakawa

thorough homogenization of exoskeleton is difficult and often leads to DNA fragmentation. If one chooses to perform size selection and purification prior to library preparation, even a larger amount of starting material is desirable. Short-read sequencing requires much less input (the starting material can be as low as several nanograms), since the sequencing method and the library preparation methods utilize PCR amplification. Even with extremely small input in the order of picograms, a highly efficient adapter ligation strategy allows the construction of whole-genome sequencing libraries from 50 pg of input DNA [4–6]. However, unbiased amplification for long reads, for example, with random long-range PCR of multi-kbp fragments, is unrealistic. Therefore, the whole-genome amplification (WGA) approach using multiple displacement amplification (MDA) reaction with the highly processive ϕ29 polymerase under isothermal conditions becomes the first alternative [7–9]. With its strong binding to single-stranded DNA, ϕ29 polymerase debranches double-stranded DNA like a helicase while synthesizing a strand, which results in a large concatemer of branched amplified DNA from a small amount of template DNA [10]. With the 3′–5′ proofreading activity, the ϕ29 polymerase can accurately and rapidly amplify the whole genome, even from single cells [11], but with several drawbacks [12]. Firstly, the rich-get-richer style of MDA leads to highly biased coverage, where a certain region of the genome is represented by a multitude of amplified DNA, whereas some regions can be completely missed. Secondly, it is known to produce a certain percentage of chimeras, when a single-stranded amplicon happens to be used as a primer in the isothermal reaction. In order to overcome the limitations of MDA, droplet-based compartmentalization methods have been developed to level the amplification coverage and to reduce chimeric amplicons [13– 16]. These approaches are called droplet MDA, where the initial DNA fragments are almost individually enclosed with ϕ29 polymerases and other reagents in aqueous droplets surrounded by oil. The droplets are formed using microfluidic devices or by centrifugal methods, and since the template DNA fragment is individually compartmentalized, droplet MDA has been proven to result in uniform coverage of the genome with minimal chimeras [13– 16]. The construction of microfluidic or centrifugal devices is not an easy task for most labs, so we have developed a protocol to utilize the 10X Genomics Chromium Controller device, which is developed for single-cell transcriptomic applications, to perform droplet MDA. With this protocol, one can start from as low as 50 pg of input DNA to obtain enough amplified DNA for size selection and for several runs of nanopore sequencing.

Ultralow-Input Genome Sequencing with Droplet MDA

2

93

Materials 1. Quick-gDNA MicroPrep Kit (Zymo Research) 2. REPLI-g Midi Kit (Qiagen) 3. Thermal cycler 4. 8-tube strip 5. Chromium Controller (10x Genomics) 6. Chromium Next GEM Chip G Single Cell Kit (10x Genomics) 7. 25 mM dNTP mixture 8. 20% Ficoll 9. Milli-Q water 10. Droplet Generation Oil for EvaGreen (Bio-Rad) 11. Perfluorooctanol (PFO) 12. Centrifuge (up to 5000 G) 13. Microcentrifuge 14. AMPure XP (Beckman-Coulter) 15. Buffer EB (Qiagen) or 10 mM Tris–HCl, pH 7.0–8.5 16. T7 Endonuclease I (New England Biolabs) 17. TapeStation with genomic DNA Screen Tape (Agilent) 18. Qubit Fluorometer with dsDNA BR reagent (Thermo Fisher Scientific) 19. Blue Pippin with High Pass Plus gel cassette and 0.75% Dye Free for 1–10 kb External Standards S1 (Sage Science) 20. Ligation Sequencing Kit (Oxford Nanopore Technologies) 21. MinION or PromethION Flow Cell (Oxford Nanopore Technologies)

3

Methods 1. Extract the genomic DNA with Quick-gDNA MicroPrep Kit following the manufacturer’s instructions, but skip the optional RNA depletion step and elute twice by reapplying the flowthrough to the column to maximize DNA recovery with minimal elution volume (see Note 1). 2. Prepare eight samples of 5 μL extracted genomic DNA to maximize the cost-efficiency in using the Chromium Next GEM Chip. 3. Prepare Buffer D1 and N1 according to the instructions of the Repli-G Midi Kit.

94

Kazuharu Arakawa

4. Add 5 μL of D1 to each sample, quickly mix by tapping and spin down with a microcentrifuge, and incubate for exactly 3 min at RT (see Note 2). 5. Immediately add 10 μL N1 to the tubes, mix by taping and spin down, and incubate on ice for 5 min. 6. Premix the following (per sample) and keep it on ice: 41 μL Repli-G Midi Reaction Buffer 4 μL 25 mM dNTP mixture 1 μL 20% Ficoll 2 μL Milli-Q

Total: 48 μL

7. Add the above premix to each sample, and add 2 μL Repli-G Midi polymerase. Immediately mix well by tapping, spin down, and incubate for exactly 30 min RT to preamplify the DNA (see Note 3). 8. Apply 70 μL of the above sample to port 1, 50 μL 20% Ficoll to port 2 (see Note 4), and 45 μL of Droplet Generation Oil for EvaGreen (see Note 5) to port 3 of Chromium Next GEM Chip (see Note 6). 9. Attach the gasket to the chip, set the chip to the Chromium system, and run it. 10. When the run completes (~18 min), immediately take out the chip, and carefully transfer the emulsified droplets from port 3 to a new 8-tube strip on ice. The emulsified sample should be around 90 μL, but try to measure and record the amount of each sample when pipetting. Proceed immediately to the next step. 11. Place the tubes in a thermal cycler, and incubate for 16 h at 30 °C for isothermal MDA, followed by 10 min at 65 °C to denature the polymerase, and then hold at 4 °C. The sample can be left overnight. 12. Break the emulsions by adding equal amount of PFO to each tube. It would be around 90 μL, but use the recorded amount in step 10. Mix the tube well by tapping and by inverting the tube 10–20 times. Do not vortex, for it damages long DNA. When the solution looks homogeneous, keep it still for 3 min at RT. Then, centrifuge the sample at 5000G for 5 min RT.

Ultralow-Input Genome Sequencing with Droplet MDA

95

13. Now the sample should be phase separated to the upper oil and lower aqueous phases. Carefully aspirate and discard the upper oil layer. Do not touch the middle layer, because it contains DNA. 14. Purify the DNA and remove any trace of oil by 1.8X AMPure XP cleanup three times, according to manufacturer’s protocol (see Note 7). 15. Digest the amplified branched DNA at Holliday junctions by T7 Endonuclease I treatment by mixing the following: 136 μL purified DNA sample (volume controlled by buffer EB) 16 μL 10x NEB buffer 8 μL T7 Endonuclease I

Total: 160 μL

and incubate the mixture at 37 °C for 1 h and 25 °C overnight (see Note 8). 16. Purify the DNA with 1.8X AMPure XP according to manufacturer’s protocol, and elute in 63 μL buffer EB. Use 2 μL of the elute for quantification by Qubit and 1 μL for the quality check on fragment length distribution with TapeStation (see Note 9). 17. Size select the DNA with Blue Pippin, using High Pass Plus gel cassette and 0.75% Dye Free for 1–10 kb External Standards S1 according to manufacturer’s protocol. Further, 10 μg of DNA is required for each lane, and typically the amplified sample is enough for two lanes. Adjust the cutoff length depending on the fragment size distribution. If the fragment length peak is high, the cutoff can be 8 kbp or 10 kbp. If it is low, the threshold can be lowered to 6 kbp for maximal recovery (see Note 10). Purify the DNA with 1.8X AMPure XP according to manufacturer’s protocol (see Note 11), and quantify the DNA with Qubit and check the size-selected fragment distribution with TapeStation. 18. Starting from around 2 μg of DNA, prepare nanopore sequencing library by Ligation Sequencing Kit according to manufacturer’s protocol (see Note 12). 19. Sequence on MinION or PromethION according to manufacturer’s protocol (see Note 13).

96

4

Kazuharu Arakawa

Notes 1. Extraction from a very small amount of initial material requires extra care. Use of low-binding tubes and pipette tips is always recommended. When a sample is less than 1 mm in length, the sample can be crushed by pressing it against the tube wall with a pipette tip and immediately applying the lysis buffer. If larger, the sample can be crushed in a tube with a pestle. Centrifuge column-based extraction fragments the DNA to less than 50 kbp in length, but in our experience, the Quick-gDNA MicroPrep Kit is the most effective and reproducible method in extracting sub-nanogram genomic DNA. Elution volume should be exactly 10 μL, although only 5 μL of sample is used in the next step. Less elution volume results in significantly less output, and larger elution volume results in too low concentration. There is no need to quantify the DNA, because it is below the quantifiable limit, and quantification will decrease the amount of precious DNA. Contaminations can be a big problem in genome assembly in these microscopic animals, so antibiotic treatment prior to DNA extraction is highly recommended. More detailed tips and protocols in extracting DNA from a very small sample can be found here [6]. 2. This step denatures the double-stranded DNA, and longer incubation could destabilize the DNA. Therefore, when using an 8-tube strip, pipette D1 to the tube wall, and mix all tubes at once by spinning down. 3. Isothermal amplification starts immediately after the addition of polymerase, so it must be added last. Do not premix it in the previous step. Optimal input to the Chromium system is in the order of nanograms, which is much higher than the input in this case with picogram order of genomic DNA. Therefore, this 30-min incubation preamplifies the input to match this level. However, this process could result in biased amplification and chimeras, so the incubation time must not exceed 30 min. The incubation time includes the following steps of applying the sample to the Chromium NextGEM Chip up until starting the Chromium run. 4. The purpose of 20% Ficoll is to balance the pressure applied by the Chromium system to generate suitable droplets. This cannot be substituted by other crowding agents such as polyethylene glycol (PEG) because it interferes with AMPure XP purification steps. 5. Alternatively, partitioning oil which is included with the Chromium Next GEM Chip Kit can also be used. 6. The samples must be applied to the ports in this order due to the pressure on the microfluidics. We here use Chromium Next GEM Chip G Single Cell Kit due to its cost, but other Chromium chips should also work.

Ultralow-Input Genome Sequencing with Droplet MDA

97

7. 1X AMPure XP may be ok if one needs to minimize AMPure XP usage. However, at this stage, there should be >10 μg of amplified DNA, and a larger volume of AMPure XP is optimal for maximum recovery. Also note usual AMPure XP handling tips such as the use of freshly prepared 80% EtOH for washing, and longer incubation time (~15 min) at higher temperature (~37 °C) for elution. In our experience, three times of AMPure XP cleanup is necessary for the optimal performance of the following T7 endonuclease treatment, for any trace of oil interferes with the reaction. 8. The reaction volume is very large, and incubation time is very long, but this step is critical for nanopore sequencing. The branched structure of ϕ29 polymerase amplicon tends to block active nanopores during sequencing, resulting in rapid decrease in the number of active pores. Even a larger reaction volume may be optimal when the amount of amplified DNA is very large. For this purpose, the amount of purified DNA after step 14 can be quantified using Qubit, and the reaction volume can be controlled by multiplying by the amount of DNA/10 μg (i.e., the reaction volume in step 15 is optimized for 10 μg of DNA). 9. For high-throughput sequencing, DNA quantification with spectrophotometry like NanoDrop is not accurate and therefore is not recommended. Always use fluorescence-based measurement like Qubit. Other low-volume fragment analysis equipment such as BioAnalyzer and Fragment Analyzer can be used in place of TapeStation. Typically, at this stage, you should have >10 μg of DNA with length distribution peaking at around 10–20 kbp. See Fig. 1a for an example result with single-specimen tarigrade genomic DNA (about 50-pg input) [17]. 10. See Fig. 1b, c for the effect of size selection. With TapeStation Analysis software, one can estimate the amount of DNA in a selected size range. Use such feature to estimate the amount of resulting size-selected DNA when deciding the size cutoff. 11. The purpose of AMPure XP cleanup here is twofold. Firstly, the elution buffer of Blue Pippin contains TWEEN and high concentration of Tris and TAPS, and this is replaced with Buffer EB to avoid interfering with library preparation steps. Secondly, multiple elutions are recommended for maximal recovery from Blue Pippin, so this step can be used to concentrate the multiple elutes. 12. We usually incubate for several hours or even overnight during the adapter ligation process instead of the 10 minutes specified in the manufacturer’s protocol for maximal ligation efficiency. When ligating overnight, it is safer to react the first several hours at 25 °C and the remaining at 4 °C to minimize mis-ligation.

98

Kazuharu Arakawa

Fig. 1 TapeStation results during the QC steps of the protocol. (a) Electropherogram of eight samples amplified using the Chromium Next GEM Chip (left) and the fragment distribution of sample D1 (right) after step 14. Each sample is amplified to be more than 1 μg/μL. (b) Electropherogram and fragment distribution after T7 Endonuclease I treatment (step 16). Branched DNA is resolved, and the average length is shorter. (c) Electropherogram and fragment distribution after BluePippin size selection (step 17). Only fragments longer than 8 kbp are retained, which are averaging 17 kbp in length and suitable for nanopore sequencing

13. Even with thorough digestion with T7 Endonuclease I, the sequence yield tends to be lower than the normal ligation sequencing run. The yield is usually about 1/3 of a normal run, in our experience.

Ultralow-Input Genome Sequencing with Droplet MDA

99

Acknowledgments The author thanks Yuki Takai for technical assistance. This work is supported by KAKENHI Grant-in-Aid for Transformative Research (A) from the Japan Society for the Promotion of Science (JSPS, grant no. 21H05279), Joint Research by Exploratory Research Center on Life and Living Systems (ExCELLS program no. 19-208 and 19-501), and partly by research funds from the Yamagata Prefectural Government and Tsuruoka City, Japan. References 1. Kono N, Arakawa K (2019) Nanopore sequencing: review of potential applications in functional Genomics. Develop Growth Differ 61(5):316–326. https://doi.org/10.1111/ dgd.12608 2. Hutchinson GE, MacArthur RH (1959) A theoretical ecological model of size distributions among species of animals. Am Nat 93(869): 117–125. https://doi.org/10.1086/282063 3. Kalinkat G, Jochum M, Brose U, Dell AI (2015) Body size and the behavioral ecology of insects: linking individuals to ecological communities. Curr Opin Insect Sci 9:24–30. https://doi.org/10.1016/j.cois.2015.04.017 4. Arakawa K (2016) No evidence for extensive horizontal gene transfer from the draft genome of a tardigrade. Proc Natl Acad Sci U S A 113(22):E3057. https://doi.org/10.1073/ pnas.1602711113 5. Arakawa K, Yoshida Y, Tomita M (2016) Genome sequencing of a single tardigrade Hypsibius dujardini individual. Sci Data 3: 160063. https://doi.org/10.1038/sdata. 2016.63 6. Yoshida Y, Konno S, Nishino R, Murai Y, Tomita M, Arakawa K (2018) Ultralow input genome sequencing library preparation from a single tardigrade specimen. J Vis Exp 137. https://doi.org/10.3791/57615 7. Montoliu-Nerin M, Sańchez-Garcıá M, Bergin C, Grabherr M, Ellis B, Kutschera VE, Kierczak M, Johannesson H, Rosling A (2019) From single nuclei to whole genome assemblies. bioRxiv:625814. https://doi.org/10. 1101/625814 8. Ye X, Yang Y, Tian Z, Xu L, Yu K, Xiao S, Yin C, Xiong S, Fang Q, Chen H, Li F, Ye G (2020) A high-quality de novo genome assembly from a single parasitoid wasp.

bioRxiv:2020.2007.2013.200725. https:// doi.org/10.1101/2020.07.13.200725 9. Ha˚rd J, Mold JE, Eisfeldt J, Tellgren-Roth C, H€aggqvist S, Bunikis I, Contreras-Lopez O, Chin C-S, Rubin C-J, Feuk L, Michae¨lsson J, Ameur A (2021) Long-read whole genome analysis of human single cells. bioRxiv:2021.2004.2013.439527. https://doi. org/10.1101/2021.04.13.439527 10. Silander K, Saarela J (2008) Whole genome amplification with Phi29 DNA polymerase to enable genetic or genomic analysis of samples of low DNA yield. Methods Mol Biol 439:1– 18. https://doi.org/10.1007/978-1-59745188-8_1 11. Huang L, Ma F, Chapman A, Lu S, Xie XS (2015) Single-cell whole-genome amplification and sequencing: methodology and applications. Annu Rev Genomics Hum Genet 16: 79–102. https://doi.org/10.1146/annurevgenom-090413-025352 12. Sabina J, Leamon JH (2015) Bias in whole genome amplification: causes and considerations. Methods Mol Biol 1347:15–41. https://doi.org/10.1007/978-1-49392990-0_2 13. Fu Y, Zhang F, Zhang X, Yin J, Du M, Jiang M, Liu L, Li J, Huang Y, Wang J (2019) Highthroughput single-cell whole-genome amplification through centrifugal emulsification and eMDA. Commun Biol 2:147. https://doi. org/10.1038/s42003-019-0401-y 14. Hosokawa M, Nishikawa Y, Kogawa M, Takeyama H (2017) Massively parallel whole genome amplification for single-cell sequencing using droplet microfluidics. Sci Rep 7(1): 5199. https://doi.org/10.1038/s41598017-05436-4

100

Kazuharu Arakawa

15. Li X, Zhang D, Ruan W, Liu W, Yin K, Tian T, Bi Y, Ruan Q, Zhao Y, Zhu Z, Yang C (2019) Centrifugal-driven droplet generation method with minimal waste for single-cell whole genome amplification. Anal Chem 91(21): 13611–13619. https://doi.org/10.1021/acs. analchem.9b02786 16. Rhee M, Light YK, Meagher RJ, Singh AK (2016) Digital Droplet Multiple Displacement

Amplification (ddMDA) for Whole Genome Sequencing of Limited DNA Samples. PLoS One 11(5):e0153699. https://doi.org/10. 1371/journal.pone.0153699 17. Arakawa K (2022) Examples of extreme survival: tardigrade Genomics and molecular Anhydrobiology. Annu Rev Anim Biosci 10: 17–37. https://doi.org/10.1146/annurev-ani mal-021419-083711

Chapter 8 The Method of Eliminating the Wolbachia Endosymbiont Genomes from Insect Samples Prior to a Long-Read Sequencing Keizo Takasuka and Kazuharu Arakawa Abstract When extracting DNA of invertebrates for long-read sequencing, not only enough quantity and size of the DNA but, depending on the species, elimination of contamination of endosymbiotic Wolbachia genome also has to be achieved. These requirements become troublesome, especially in small-sized species with a limited number of individuals available for the experiment. In this chapter, using tiny parasitoid wasps (Reclinervellus nielseni) parasitizing spiders as hosts, we developed a method of eliminating the Wolbachia genomes by means of an antibiotic administration to adult wasps via honey solution. Twenty days of rifampicin treatment since their emergence from cocoons resulted in a significant decrease in the Wolbachia genomes while keeping good DNA conditions for nanopore sequencing. An adequate quantity of DNA was then gained by pooling several individuals. The method could be applied to other insects or invertebrates that can be maintained by laboratory feeding with liquid food. Key words Antibiotics, Contamination, DNA extraction, Endocellular bacterial symbiont, Genome, Nanopore sequence, qPCR, Rifampicin

1

Introduction Thanks to the nanopore sequencers developed by Oxford Nanopore Technologies (ONT), ultra-long-read sequencing has rapidly become a general technology even for non-model organisms; however, it needs a certain level of both quantity and size of genomic DNA for sequence library preparation. ONT’s official library preparation protocol requires 1 μg (or 100–200 fmol) high-molecularweight genomic DNA. On the other hand, the longer the size of reads, the better they are because there is no apparent technical limit for the size of DNA to be sequenced by the nanopore sequencers [1]. Moreover, >60 Kbp, which is the limit of TapeStation

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_8, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

101

102

Keizo Takasuka and Kazuharu Arakawa

Fig. 1 Reclinervellus nielseni and its host spider, Cyclosa argenteoalba. (a) A female adult R. nielseni. (b) A young instar larva of R. nielseni attached to the host spider

(Agilent), would be one of the criteria for which we strive. In some situations, these requirements may be an arduous task, especially in invertebrates that cannot be bred, with tiny bodies and small numbers of individuals available for the experiment. In addition to the DNA conditions, the Wolbachia genome becomes a formidable obstacle to sequencing analyses. Wolbachia is a group of Gram-negative bacteria that are commonly found inside invertebrate cells, i.e., endosymbionts, occurring in all orders of insects as well as in nematodes, mites, spiders, and other arthropods [2]. Among arthropods, it is estimated that 66% of species are infected with Wolbachia [3]. The taxonomically wide and frequent existence of Wolbachia sometimes gets in the way of molecular biology on invertebrates; for example, on rare occasions, Wolbachia infection results in mitochondrial DNA introgression into the host’s COI gene, making DNA barcoding failure, which was exemplified in a genus of parasitoid wasp (Ichneumonidae, Diplazon) [4]. The Wolbachia genome also disturbs de novo assembly simply as DNA contaminants. The authors tried whole-genome sequencing of an ichneumonid ectoparasitoid, Reclinervellus nielseni (Roman, 1923) (Fig. 1a), parasitizing an orb-web spider (Fig. 1b), using an Illumina sequencer, but it failed because almost all assembled sequences matched those of Wolbachia sp. (unpublished result). This problem will also occur in sequences by the nanopore sequencers. To resolve this problem, endosymbiotic Wolbachia individuals must be removed as far as possible before the host’s DNA extraction. The Wolbachia symbiont can be sterilized by providing an antibiotic, rifampicin, to host insects (e.g., the bedbugs [5]). In the case of parasitoid wasps, a method of eliminating Wolbachia from endoparasitoids (Braconidae, Asobara spp.) of Drosophila fly larvae was established by feeding the

Wolbachia Endosymbiont Genome Elimination

103

parasitized host larvae with a standard diet in medium supplemented with rifampicin or tetracycline so that the parasitoid larvae are exposed to antibiotics through host hemolymph [6–8]. In another endoparasitoid of Drosophila larvae, Leptopilina heterotoma (Thomson, 1862) (Figitidae), Wolbachia elimination was achieved by applying rifampicin both to adult female wasps (via honey aqueous solution with the antibiotic) and to parasitoid larvae through their hosts [9]. However, there is no verification of the efficacy of antibiotics administration only to adult parasitoid wasps. Such a method will be effective in the case of parasitoids that cannot be made to administer antibiotics to their larvae via the host, such as parasitoids of hosts requiring solid foods and idiobiont parasitoids (perpetually immobilizing hosts when oviposition). In this chapter, we tried to develop a method of eliminating the Wolbachia genomes from the parasitoid wasps of spiders (Ichneumonidae, the Polysphincta genus group) by administering the antibiotic only to adult wasps while keeping good DNA conditions for the nanopore sequencing. Although we focus on the orb-web spider parasitoids as well as specific rearing procedures, this method could apply to other insects or invertebrates that can be kept by artificially feeding them with liquid food.

2

Materials 1. Reclinervellus nielseni: The parasitoid wasp (Fig. 1a) of an orb-web spider, Cyclosa argenteoalba Bo¨senberg & Strand, 1906 (Fig. 1b), with a forewing length of 5–6 mm [10, 11], which is the targeted species of this study. 2. Wooden frames for the spiders’ web-building: Rearing platform for the parasitized spiders (Fig. 2) to build a vertical orb web (see Note 1). 3. Commercial flightless Drosophila sp.: Preys on which the host spiders feed. 4. Small empty vials and a CO2 cylinder: Tools to enclose and paralyze the adult wasps before cutting their wings. 5. Microscissors: An instrument for cutting wings of the adult wasps. 6. Honey: Food supplied for the adult wasps and, on occasion, for the spiders (see Note 2). 7. Rifampicin: An antibiotic for killing Wolbachia infecting insect hosts [5, 7–9]. 8. EtOH: A solvent (100%) to dissolve rifampicin and also a washing used for gDNA extraction (70%) and AMPure purification (80%).

104

Keizo Takasuka and Kazuharu Arakawa

Fig. 2 The original wooden frame for vertical orb-web building. (a) The frame with the acrylic partitions closed. (b) That with the partitions opened. Four stainless-steel plates can be seen at the four corners of the frame. (c) The acrylic partition with four magnetic tapes applied to the four corners. (d) A vertical orb web built by C. argenteoalba in the frame

Fig. 3 The wasps (partly wingless) housed in the lids of racks of pipet tips with the Petri dish per lid holding the honey-rifampicin solution

9. Transparent hermetic container: A tool to enclose the adult wasps kept in captivity (Fig. 3, see Note 3). 10. Small Petri dishes and tissue papers: A reservoir to hold the honey solution as food.

Wolbachia Endosymbiont Genome Elimination

105

11. Liquid nitrogen: A medium to freeze the adult wasps. 12. Freeze-Crush Apparatus SK-100 (Tokken): A freeze crusher to homogenize the frozen wasps. 13. 2.0-mL Safe-Lock Tubes (Eppendorf): A recommended tube by Tokken to homogenize samples by SK-100. 14. Genomic-tip 20/G and Genomic DNA Buffer Set (QIAGEN): A kit to extract ultra-long-read DNA. 15. RNase A, Proteinase K, and isopropanol: Reagents used for gDNA extraction. 16. 10 mM Tris-HCl pH 8.0: A solvent to elute DNA and also a diluent for qPCR. 17. TapeStation (Agilent Technologies, Inc.): An automated electrophoresis system for the sample quality control of DNA samples. 18. Qubit 4 Fluorometer and Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific K.K.): A fluorometric system to specifically quantify DNA. 19. Primers specific to the genus Reclinervellus EF1α gene for qPCR: ReclEF1aF (5’-ACGATTATCGATGCTCCGGGACA CAG-3′) and ReclEF1aR (5’-AACGAGCCTCGGAGTAA GGTGGC-3′) (see Note 4). 20. Primers specific to R. nielseni COI gene for qPCR: RnCOIF (5’-ACAAATCAAGGTGTAGGAACAGGATG-3′) and RnCOIR (5’-ATTGCACCAGCAAGAACTGGAAC-3′) (see Note 5). 21. Primers specific to Wolbachia 16S ribosomal RNA gene for qPCR: Wp16SF (5′- AATACGGAGAGGGCTAGCGT-3′) and Wp16SR (5′- CTTCAGCGTCAGATTTGAACCAG-3′) (see Note 6). 22. KAPA Library Quantification Kits Illumina/Roche LightCycler® 480 (Kapa Biosystems): A kit for qPCR with an engineered enzyme optimized for qPCR using SYBR Green I dye chemistry. 23. Tween20: A surfactant used for qPCR. 24. LightCycler 96 (Roche Diagnostics K.K.): A real-time PCR instrument. 25. AMPure XP beads (Beckman Coulter): SPRI (solid phase paramagnetic bead) for DNA purification.

106

3

Keizo Takasuka and Kazuharu Arakawa

Methods

3.1 Rearing Host Spiders and the Parasitoid Larvae Until Adult Emergence (in the Case of R. nielseni, the Host Spiders Parasitized by This Species Were Collected in Several Sites in Kyoto and Hyogo Prefectures, Japan)

1. Keep the parasitized spiders in the original wooden frames for vertical orb-web building (Fig. 2d; see Note 1) and feed them with the commercial flightless Drosophila sp. or something they prefer.

3.2 Rearing the Adult Wasps for Sterilization of Wolbachia (in the Case of R. nielseni)

1. After the adult wasps emerge from their cocoons, enclose them individually in a small empty vial with the orifice closed by the thumb and paralyze them by CO2 spraying to cut their wings with the microscissors. The wingless wasps can be reared as usual and are easy to handle.

2. After the parasitoid larvae attached to the spiders kill the host spiders and spin their cocoons to pupate, nurse each cocoon individually, e.g., in pill cases. Record the day the larvae killed the host spiders to anticipate the emergence day. 3. Check the emergence of the adult wasps every day to not make them starve to death in the nursery. Leaving a newly emerged wasp a day causes a lethal loss of vitality. Record the day of emergence to count the total treatment days.

2. Weigh accurately 2 mg of rifampicin into a 1.5-mL tube and dissolve in 1 mL of 100% EtOH (i.e., 2 μg/μL rifampicin concentration). Dilute about one part of honey to three of the water. Pipet 3 mL of the honey water and 30 μL of the rifampicin solution into the same small Petri dish at a final concentration of approximately 20 μg rifampicin/mL (see Note 7). Soak a one-quarter tissue paper (see Note 8) into the Petri dish. Store the remaining rifampicin solution in a freezer. 3. House the wingless wasps in the lid of the pipet tip racks up to about ten individuals in the order of emergence (see Note 9) with the Petri dish holding the honey-rifampicin solution (Fig. 3). Make new honey water and replenish the new honeyrifampicin solution at least every other day. Be careful not to crush the wasps while putting the lid when food is exchanged. 4. After 20–30 days since emergence, the wasps are ready to be provided for DNA extraction (see Note 10). Freeze the wasps alive in liquid nitrogen and preserve them individually in a 2.0mL Safe-Lock Tube. They also can be kept in a deep freezer at this stage. 3.3 Homogenizing the Wasps, DNA Extraction, qPCR, and Evaporation

1. Crush the frozen whole body wasp samples by Freeze-Crush Apparatus SK-100, following the manufacturer’s protocol. 2. Extract DNA individually by Genomic-tip 20/G following the manufacturer’s protocol. Measure the size of extracted DNA by electrophoresis (TapeStation) and quantify them by the fluorometer (Qubit 4 Fluorometer) (Table 1; see Note 11).

Wolbachia Endosymbiont Genome Elimination

107

Table 1 Quantity and size of the extracted DNA by Genomic-tip 20/G eluted in 100 μL 10 mM Tris–HCL (pH 7.5) and Ct values of three pairs of primers The number of days of Sample Quantity of rifampicin ID Sex administration DNA (ng)

Size of DNA (bp)

R. nielseni Reclinervellus Wolbachia COI EF1α 16S

Rn5

♀

28

192.0

>60 K

22.65

31.02

35.70

Rn6

♂

28

92.6

>60 K

25.36

30.57

35.36

Rn7

♂

22–26

44.8

58,063

25.88

31.50

31.06

Rn8

♂

22–26

130.0

58,635

24.48

30.03

35.86

Rn9

♂

22–26

121.0

54,088

24.53

30.09

34.22

Rn10

♂

22–26

151.0

59,716

24.26

30.09

36.07

Rn11

♂

22–26

68.2

58,063

26.03

31.23

39.42

Rn12

♂

22–26

83.0

56,326

24.88

30.44

36.83

Rn13

♂

22–26

99.8

52,974

23.86

31.08

35.83

Rn14

♀

28–29

540.0

>60 K

20.89

28.92

37.05

Rn15

♀

28–29

342.0

>60 K

22.63

30.28

28.37

Rn16

♀

28–29

34.6

>60 K

28.29

30.46

29.36

Rn17

♀

19–23

139.0

>60 K

24.38

30.43

32.91

Rn18

♀

19–23

218.0

>60 K

22.98

29.84

34.66

Rn19

♂

20–23

16.8

ND

27.07

30.34

32.58

Rn20

♂

20–23

73.4

58,993

25.24

30.55

35.03

Rn21

♂

20–23

11.4

ND

26.17

30.69

35.28

Rn22

♂

20–23

33.4

ND

26.05

30.95

35.29

Rn23

♀

24–28

292.0

>60 K

23.08

30.36

35.18

Rn24

♀

24–28

434.0

54,254

22.47

29.88

33.71

Rn25

♂

25–28

111.0

55,450

24.65

31.32

33.68

N59

♂

non-adm.

144.0

53,683

24.13

30.71

30.83

N60

♀

non-adm.

612.0

33,782

22.29

30.22

27.88

N61

♀

non-adm.

406.0

58,358

22.57

29.88

33.98

N62

♀

non-adm.

86.2

23,090

25.53

31.96

33.27

N63

♂

non-adm.

392.0

41,730

23.72

30.67

33.77

N64

♀

non-adm.

153.0

>60 K

23.60

31.08

33.95

Regarding Ct values, all samples were performed in an identical run of qPCR. They were quantified by the Qubit 4 Fluorometer with Qubit dsDNA High Sensitivity Assay Kit and electrophoresed by TapeStation with Genomic DNA ScreenTape

108

Keizo Takasuka and Kazuharu Arakawa

3. Perform qPCR of the extracted DNA with the primers designed specifically for Wolbachia and the studied species (see Note 12). 4. Choose several samples with large-sized DNA (ideally >60 Kbp) and a far lower Ct value of EF1α gene than that of Wolbachia 16S ribosomal RNA gene to over 1.5 μg DNA in total (see Note 13), considering potential loss via purification with AMPure XP beads. Evaporate the pooled DNA solution to 47 μL or less (be careful not to dry up) and confirm again that the DNA quantity is over 1.5 μg by the fluorometer, ready to prepare the nanopore sequencing library by the Ligation Sequencing Kit.

4

Notes 1. These original wooden frames (Fig. 2) have been developed by improving the Zschokke’s frames made of Perspex strips [12]. The Zschokke’s model is made of acrylic, and its partitions are PVC or windowpanes, making the whole thing heavy and difficult to carry with the partitions closed. Besides, since acrylic is slippery, applying coarse tapes inside the frame for spider grip is also necessary. Therefore, we adopted a wooden frame without changing the basic framework and used 1-mm-thick acrylic plates for the partitions (Fig. 2a). Moreover, 10-cm-long magnetic stainless-steel plates as wide as the thickness of the wooden frame are attached to the four corners of the frame (Fig. 2b), and magnetic tapes are applied to the four corners of the acrylic partitions so that they overlap with the stainless-steel plates to maintain the closure (Fig. 2c). This architecture reduces material costs and structure weight and allows movement with the partitions closed. 2. Honey is a food-processing invention of Hymenoptera, and it proves to be as excellent for parasitoid wasps as it is for bees [13]. It contains proteins and vitamins as well as sugars along with fat reserves, seemingly adequate for egg maturation for most of the parasitoid species that nourish and mature their eggs successively through their adult life [13]. Not only for parasitoid wasps, but it can also be used as a flavoring material for spiders; when spiders are reluctant to build a web in a cage, soaking a freshly killed prey in honey solution before feeding by artificial handover makes the spiders immediately react to and suck the liquid clinging to an unknown object that is the actual prey [14]. 3. We used lids of the pipet tip racks. 4. They were designed based on the EF1α (elongation factor 1α) genes of three Reclinervellus spp., R. nielseni (LC145480.1),

Wolbachia Endosymbiont Genome Elimination

109

R. tuberculatus (LC145478.1), and R. masumotoi (LC145479.1) registered in NCBI, for universal usability among the three species with a product length of 238 bp. 5. They were designed based on the mitochondrial COI gene of R. nielseni (LC145356.1) with a product length of 267 bp. A common primer pair among the mitochondrial COI genes of three Reclinervellus spp. could not be designed due to sequencing heterogeneity. 6. They were designed based on the 16S ribosomal RNA gene of Wolbachia pipientis Hertig, 1936 (LC101741.1), with a product length of 230 bp. 7. Although the experiment of Wolbachia elimination from bedbugs sets the final concentration at 10 μg/mL of rifampicin in rabbit blood [5], 20 μg/mL in the honey solution never caused wasps’ death at all. 8. Split a pair of tissue papers and then tear one of them into two halves, an optimum quantity to hold 3 mL liquid. 9. Because individual recognition inside the lid is no longer possible, the entire treatment days vary depending on the variation in their emergence day. Therefore, the difference in the emergence day of the wasps in the same lid should be as minimum as possible. 10. It is recommended that antibiotics are administered for as long as possible, but the experiment will fail if the life span is exhausted before freezing treatment. Since the life span of an adult R. nielseni is a little more than one month, rearing them for more than one month may result in their death. Therefore, it may be advisable to determine the average life span of the studied species in advance and set the number of days of the antibiotic treatment at about three-fourths of their life span. 11. In our experiment, we handled 27 wasps, with females 12 and males 15. Twenty-one individuals (♀8:♂13) were provided to the rifampicin-administered group, while six individuals (♀4:♂2) were treated as the negative control (the non-administration group) (Table 1). Regarding DNA quantity, there is a significant tendency that the quantity of DNA is more prominent in females than in males (Fig. 4; Welch’s t-test [two-sided], t = 3.1675, d.f. = 15, p = 0.00637). The difference may be caused by haplo-diploidy (i.e., the condition universal in Hymenoptera where the female is diploid and the male haploid [15]) and thus is a possibly universal inclination among all Hymenoptera. Therefore, females are more suitable for DNA extraction than males in the case of tiny Hymenoptera and other taxa with haplo-diploidy.

110

Keizo Takasuka and Kazuharu Arakawa

Fig. 4 Quantity of DNA comparing between sex, ignoring the antibiotic treatments. The means of females and males are 287.4 ng and 104.8 ng, respectively

Fig. 5 Amplification curves of qPCR using the pair of primers specific to Wolbachia 16S comparing the rifampicin-administered group (pale gray) with the non-administration group as negative control (dark gray [blue in the digital file])

12. In the case of R. nielseni, the mean Ct value of Wolbachia 16S in the rifampicin-administered group (34.45) was significantly higher than that in the non-administration group (32.38) (Table 1, Figs. 5 and 6; Welch’s t-test [one-sided], t = 1.8885, d.f. = 8, p = 0.04782). Regarding the difference in Ct values between EF1α and Wolbachia 16S, there is no significance in the non-administration group (Fig. 7, paired t-test [two-sided], t = -1.5837, d.f. = 5, p = 0.17411), while there is the significantly higher Ct value of Wolbachia 16S than that of EF1α in the rifampicin-administered group (Fig. 7, paired t-test [two-sided], t = -6.8756, d.f. = 20, p < 0.000002). These results suggest that the administration of rifampicin effectively eliminates the Wolbachia genes from the parasitoids. However, several samples in the administered group had the Ct values of Wolbachia that were within the non-administration

Wolbachia Endosymbiont Genome Elimination

111

Fig. 6 Ct values of the three pairs of primers (same dataset as Table 1), with Wolbachia 16S discriminated between antibiotic treatments. The means of COI, EF1α, Wolbachia 16S (non-adm.), and Wolbachia 16S (rifampicin-administered) are 24.36 (n = 27), 30.54 (n = 27), 32.28 (n = 6), and 34.45 (n = 21), respectively. Note that COI and EF1α are not included in the statistical analysis

Fig. 7 Ct values of the two pairs of primers (EF1α and Wolbachia 16S), comparing antibiotic treatments (same dataset as Table 1). The means of EF1α (non-adm.), Wolbachia 16S (non-adm.), EF1α (rifampicinadministered), and Wolbachia 16S (rifampicin-administered) are 30.75 (n = 6), 32.28 (n = 6), 30.48 (n = 21), and 34.45 (n = 21), respectively

group’s range (Table 1, Fig. 5), suggesting that rifampicin may be less effective in some individuals in eliminating Wolbachia genes. Therefore, sorting samples based on Ct values after qPCR is also essential. In addition, some non-administered individuals originally had somewhat high Ct values of Wolbachia 16S (e.g., N61–64, Table 1, Fig; 5) potentially available for sequencing, indicating that Wolbachia density may also vary among individuals. 13. We chose four samples totaling over 1.6 μg of DNA quantity: Rn14, 23, 24, and N61 (Table 1). After evaporation concentrating to 44 μL, the total DNA quantity was still 1,628 ng.

112

Keizo Takasuka and Kazuharu Arakawa

Acknowledgments The first author expresses his cordial thanks to Nobuaki Kono and Yuki Takai (Keio University) for technical support in the molecular biological experiment and Takahiro Hosokawa (Kyushu University) for providing helpful information on Wolbachia and antibiotics. This work was supported by JSPS KAKENHI (21K06352), JST ACT-X (JPMJAX1918) for KT, and research funds from the Yamagata Prefectural Government and Tsuruoka City, Japan. References 1. Kono N, Arakawa K (2019) Nanopore sequencing: review of potential applications in functional genomics. Develop Growth Differ 61(5):316–326. https://doi.org/10.1111/ dgd.12608 2. Hoffmann A (2020) Wolbachia. Curr Biol 30(19):R1113–R1114. https://doi.org/10. 1016/j.cub.2020.08.039 3. Hilgenboecker K, Hammerstein P, Schlattmann P, Telschow A, Werren JH (2008) How many species are infected with Wolbachia? – A statistical analysis of current data. FEMS Microbiol Lett 281(2):215–220. https://doi.org/10.1111/j.1574-6968.2008. 01110.x 4. Klopfstein S, Kropf C, Baur H (2016) Wolbachia endosymbionts distort DNA barcoding in the parasitoid wasp genus Diplazon (Hymenoptera: Ichneumonidae). Zool J Linnean Soc 177(3):541–557. https://doi.org/10.1111/ zoj.12380 5. Hosokawa T, Koga R, Kikuchi Y, Meng XY, Fukatsu T (2010) Wolbachia as a bacteriocyteassociated nutritional mutualist. Proc Natl Acad Sci U S A 107(2):769–774. https://doi. org/10.1073/pnas.0911476107 6. Kamiyama T, Shimada-Niwa Y, Tanaka H, Katayama M, Kuwabara T, Mori H, Kunihisa A, Itoh T, Toyoda A, Niwa R (2022) Wholegenome sequencing analysis and protocol for RNA interference of the endoparasitoid wasp Asobara japonica. DNA Res 29(4):dsac019. https://doi.org/10.1093/dnares/dsac019 7. Dedeine F, Vavre F, Fleury F, Loppin B, Hochberg ME, Boule´treau M (2001) Removing symbiotic Wolbachia bacteria specifically inhibits oogenesis in a parasitic wasp. Proc Natl Acad Sci U S A 98(11):6247–6252. https:// doi.org/10.1073/pnas.101304298

8. Furihata S, Hirata M, Matsumoto H, Hayakawa Y, Skoulakis EMC (2015) Bacteria endosymbiont Wolbachia promotes parasitism of parasitoid wasp Asobara japonica. PLOS ONE 10(10):e0140914. https://doi.org/10. 1371/journal.pone.0140914 9. Vavre F, Fleury F, Varaldi J, Fouillet P, Bouletreau M (2000) Evidence for female mortality in Wolbachia-mediated cytoplasmic incompatibility in haplodiploid insects: epidemiologic and evolutionary consequences. Evolution 54 (1):191–200. https://doi.org/10.1111/j. 0014-3820.2000.tb00019.x 10. Fitton MG, Shaw MR, Gauld ID (1988) Pimpline ichneumon-flies Hymenoptera, Ichneumonidae (Pimplinae). Handbook for the Identification of British Insects. Royal Entomological Sciety of London, London 11. Matsumoto R, Konishi K (2007) Life histories of two ichneumonid parasitoids of Cyclosa octotuberculata (Araneae): Reclinervellus tuberculatus (Uchida) and its new sympatric congener (Hymenoptera: Ichneumonidae: Pimplinae). Entomol Sci 10(3):267–278. https://doi. org/10.1111/j.1479-8298.2007.00223.x 12. Zschokke S, Herberstein ME (2005) Laboratory methods for maintaining and studying web-building spiders. J Arachnol 33(2): 2 0 5 – 2 1 3 . h t t p s : // d o i . o r g / 1 0 . 1 6 3 6 / Ct04-72.1 13. Shaw MR (1997) Rearing Parasitic Hymenoptera, vol 25. The Amateur Entomologist. The Amateur Entomologists’ Society, London 14. Takasuka K (2021) A feeding aid for web-building spiders reluctant to build a web. Am Arachnol 86:3–4 15. Gauld ID, Bolton B (1988) The Hymenoptera. Oxford University Press, New York

Chapter 9 A Nanopore Sequencing Course for Graduate School Curriculum Kazuharu Arakawa Abstract The high-throughput long-read sequencing has become affordable enough for any molecular biology lab to utilize genome sequencing in their research. Complete genome sequencing and assembly of bacterial genomes is one such application which is powerful yet simple enough for anyone without advanced molecular biology or bioinformatics skills to conduct on his/her own. High-throughput sequencing will eventually become a basic routine tool in molecular biology labs just like polymerase chain reaction and electrophoresis in a near future. To assist the use of such nanopore sequencing technologies, we designed a graduate school course to learn both the experimental and bioinformatic skills of complete bacterial genome sequencing and assembly. Key words Nanopore sequencing, Complete bacterial genome sequencing, De novo assembly, Education, Genome annotation

1

Introduction The portable and affordable MinION nanopore sequencer by Oxford Nanopore Technologies significantly lowered the entry cost for any lab to take advantage of high-throughput long-read sequencing [1]. With MinION, it is already easy enough for a single researcher to sequence and assemble a complete bacterial genome de novo, in a matter of a couple of weeks. It is not difficult to imagine that such sequencing and assembly of plasmids to small genomes are routinely conducted in most biology labs in the coming decade, just as we routinely perform polymerase chain reaction (PCR) and electrophoresis. This opens up a new demand in education to teach graduate students the skills necessary in fully utilizing this technology, both in wet experimental biology and in dry bioinformatics analyses. In the Systems Biology Program, Graduate School of Media and Governance, Keio University, I have designed such a course to

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_9, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

113

114

Kazuharu Arakawa

Fig. 1 Students of the Genome Engineering Workshop at the Systems Biology Program, Graduate School of Media and Governance, Keio University starting a nanopore sequencing run

fully utilize the nanopore sequencing technology for bacterial genome sequencing since 2018, in a course named the “Genome Engineering Workshop” (Fig. 1). In this course, the students (1) bring in microbes of their choice for de novo sequencing and assembly, (2) extract long genomic DNA and sequence the genome using a nanopore sequencer (wet lab experiment), (3) assemble and annotate the obtained data (bioinformatics), and finally, (4) write up a genome report to be published in a peer-reviewed journal Microbial Resource Announcement (MRA) of the American Society for Microbiology (ASM) and make the data publicly available by registering to the National Center for Biotechnology Information (NCBI) and DNA DataBank of Japan (DDBJ). The entire process is conducted in a single course of 15 lectures of 90 min each (Table 1). Many genome report papers have been published each year as a result [2–17], including potentially industrially or medically useful microbes, such as a cryophilic bacterium isolated from the vicinity of a hydrothermal vent of the Nankai Trough [2, 9] and a skin commensal bacterium isolated from the site of an orthopedic surgery [10]. Sequencing of genomes de novo and publishing the results help motivate the students in participating in the course. In the following protocol, I detailed the experimental and bioinformatic procedures used in this course, including long genomic DNA extraction, nanopore sequencing, and genome assembly and annotation. Of course, this class was made possible largely due to the existence of the nanopore sequencer and the surrounding software tools. Regardless of the size of the microbial genome, it was not easy to determine the complete genome by short reads alone due to the presence of repetitive regions and rRNA/tRNA

A Nanopore Sequencing Course for Graduate School Curriculum

115

Table 1 Course schedule of the Genome Engineering Workshop at the Systems Biology Program, Graduate School of Media and Governance, Keio University Day 1

Lecture on history of DNA sequencing

(lecture 1)

Note: announce the requirement for bringing in a microbe for sequencing. It is advised to have the first lecture at least 2 weeks before day 2.

Day 2

Wet lab experiment I

(lectures 2–5)

Extraction of long genomic DNA.

Day 3

Wet lab experiment II

(lectures 6–9)

Extraction of long genomic DNA.

Day 4

Wet lab experiment III

(lecture 10)

Nanopore sequencing.

Day 5

Bioinformatics analysis I

(lectures 11–12)

Genome assembly and error correction. Note: It is advised to have about 2 weeks’ interval from the experiment part to allow time for basecalling and Illumina sequencing.

Day 6

Bioinformatics analysis II

(lectures 13–14)

Genome annotation.

Day 7

Writing of genome report.

(lecture 15)

gene clusters. However, using long reads with an average length of several tens of kbp, a circular genome can now be assembled in a single shot using the Canu assembler [18] with default parameters. Genome annotation of microbial genomes is also extremely simple nowadays, and with DDBJ DFAST [19], you can simply upload an assembled FASTA file using a browser to perform fully automated gene prediction and functional annotation. The website also provides quality checking tools, where the assembly completeness can also be easily assessed using CheckM [20]. For microbial genomes, library preparation is also extremely simple and quick, finishing within an hour using the Rapid Barcoding Kit. This kit also allows the multiplexing of multiple samples to be run in a single flow cell to reduce sequencing costs. With these tools, anyone can now easily perform whole genome analysis without necessarily having advanced experimental or bioinformatics techniques or facilities. Taking advantage of these modern kits and software tools, the main focus of the experimental part of the course is dedicated to the extraction of long genomic DNA. In order to utilize the full potential of the nanopore sequencer, the length of the input

116

Kazuharu Arakawa

DNA is the key factor, and long DNA fragments cannot be obtained using spin-column-based methods typically used for ordinary molecular biology applications. Moreover, careful pipetting and handling of DNA are essential throughout the experiment. In this course, we use Genomic-tip 20/G Kit (QIAGEN) that uses a column with gravity flow for the extraction of long genomic DNA fragments, usually successfully ending up with genomic DNA fragments larger than several hundred kbp. Therefore, confirmation of such long DNA fragments requires pulsed-field gel electrophoresis.

2

Materials

2.1 Wet Lab Experiment Part

1. Approximately 1 * 109 bacterial cell pellet 2. Centrifuge (up to 10,000 × g or higher) 3. Heat block 4. Thermal cycler 5. Genomic-tip 20/G Kit (QIAGEN) 6. Genomic DNA Buffer Set (QIAGEN) (see Note 1) 7. 100 mg/mL lysozyme stock solution 8. RNase A solution (100 mg/mL) 9. >600 mAU/mL Proteinase K solution 10. 70% ethanol 11. Isopropanol 12. 15-mL centrifuge tubes 13. Buffer EB (QIAGEN) or 10 mM Tris–HCl, pH 7.0–8.5 14. 2-mL microtubes 15. Tube rotator 16. Rapid Barcoding Kit (Oxford Nanopore Technologies) 17. MinION device Technologies)

and

Flow

Cell

(Oxford

Nanopore

18. Used and washed MinION Flow Cell 19. NanoDrop (Thermo Fisher Scientific) 20. Qubit Fluorometer and Qubit dsDNA BR Reagent (Thermo Fisher Scientific) 21. TapeStation and genomic DNA Screen Tape (Agilent) 22. Pippin Pulse (Sage Science) or other pulsed-field gel electrophoresis equipment 23. 0.75% agarose gel for pulsed-field gel electrophoresis 24. 0.5× KBB electrophoresis buffer 25. SYBR Gold

A Nanopore Sequencing Course for Graduate School Curriculum

117

26. CHEF DNA size marker for pulse-field gel electrophoresis (Bio-Rad) 27. KAPA HyperPlus Kit (Kapa Biosystems) 28. Illumina sequencer with sequencing reagent (Illumina) 2.2 Bioinformatics Part

1. UNIX server, or alternatively, any UNIX environment with sufficient memory for assembly. The required RAM differs according to the size of the genome, but typically 32 GB should be sufficient. 2. Conda or similar package management environment. 3. BBMap (http://sourceforge.net/projects/bbmap/). 4. NanoPlot [21]. 5. Canu [18]. 6. Bwa [22]. 7. Samtools [23]. 8. Pilon [24]. 9. gVolante server (https://gvolante.riken.jp) [25]. 10. DFAST server (https://dfast.ddbj.nig.ac.jp) [19].

3

Methods

3.1 Genomic DNA Extraction

1. Announce the requirement for bringing in their own microbe for sequencing to students well ahead (>2 weeks) of the actual experiment (see Note 2). The requirements are as follows: • The microbe must be safe, not hazardous, and not infectious and can be handled under biosafety level (BSL) 1. • Species is identified (preferably with 16S rRNA sequence identity) or a strain ID is available. • No genome sequence is yet available. • All potential co-authors agree that the sequence data will be openly published and deposited to public database immediately after the course. • The student can prepare and bring 109 or OD 1 cell pellet kept at -80 °C by him-/herself. • If the bacteria is Gram-positive and lysozyme is not enough for lysis, appropriate lysis enzymes can be provided (such as achromopeptidase for lactic acid bacteria). • The student agrees to share the sample with other students if desired. • All of the above is fully confirmed by the lecturer before preparing the sample.

118

Kazuharu Arakawa

2. Prepare 109 or OD 1 culture of Escherichia coli for demonstration and for students that do not bring their own samples and do not choose to have samples shared by those that are brought in. 3. If not already pelleted, centrifuge the bacterial culture containing 1 × 109 or a little less cells at 5000 × g, 4 °C, for 10 min, and discard the supernatant (see Note 3). 4. Add 2 μL of RNase A solution to 1 mL of Buffer B1, and add the mixture to the pelleted cell. Resuspend the cells by vortex, and add 20 μL of lysozyme stock solution and 45 μL of Proteinase K solution. Incubate at 37 °C for 1 h (see Note 4). 5. Add 0.35 mL of Buffer B2, vortex briefly to mix, and incubate at 50 °C for 30 min. 6. Centrifuge for 15 min at 10,000 × g, 4 °C (see Note 5). Equilibrate the Genomic-tip placed on a 15-mL centrifuge tube with 1 mL Buffer QBT while waiting for the centrifuge. 7. Carefully not to disturb the precipitate, transfer the supernatant to the equilibrated column, and allow it to flow through with gravity flow (see Note 6). 8. Wash the column three times with 1 mL Buffer QC. Discard the centrifuge tube and the flowthrough after the wash, and place the Genomic-tip in a new centrifuge tube. 9. Elute the DNA with 2 mL Buffer QF prewarmed to 50 °C. The column can be discarded, but do not discard tip holders; these will be reused. 10. Precipitate the DNA by adding 1.4 mL isopropanol (kept at RT) to the eluted DNA; gently invert the tube 10–20 times. If DNA is visible as small white fibrils, carefully collect them with a pipette tip with minimal aspiration of isopropanol, and transfer the DNA to ice-cold 70% ethanol. Tap the tube to gently wash the DNA, collect the DNA again with a pipette tip with minimal aspiration of ethanol, and transfer the DNA to 100 μL Buffer EB. Keep the tube lid open for 15 min to evaporate the remaining ethanol. Skip to step 14. If DNA is not visible, continue with the following steps. 11. Aliquot the 3.4 mL sample to two 2-mL tubes, 1.7 mL each. Centrifuge immediately at 5000 × g for 15 min at 4 °C. Proceed immediately to the next step. 12. Carefully not to disturb the pellet, discard the supernatant by aspirating from the opposite tube wall from where the pellet is. Add 1 mL of ice-cold 70% ethanol to one tube, gently pipette against the pellet to release it from the tube wall, and transfer all 70% ethanol with the DNA pellet to another tube to merge the aliquots. Centrifuge for at 5000g for 15 min at 4 °C (see Note 7).

A Nanopore Sequencing Course for Graduate School Curriculum

119

13. Carefully not to disturb the pellet, discard the supernatant. Add 50–100 μL Buffer EB depending on the pellet size, and keep the tube lid open for 15 min to evaporate the remaining ethanol (see Note 8). 14. Place the tube with extracted DNA on a tube rotator at 3 rpm overnight to fully dissolve the DNA. Following quality checks (QC) can be performed during this incubation. 15. Check DNA purity with NanoDrop. A260/230 and A260/ 280 should both be above 1.6, ideally over 2.0. When A260/ 230 is too low, the amount of DNA may not be sufficient, or ethanol may be remaining. 16. Check DNA quantity with Qubit and fragment size with TapeStation (see Note 9). For nanopore sequencing, at least 100 ng/7.5 μL is required. At this stage, the TapeStation must show fragments to be above 60 kbp in length. 17. If the extracted DNA passed the above QCs, run 50 ng/5 μL of the DNA in pulsed-field gel electrophoresis to check fragment size distribution in the 50–200 kbp range. The fragment should ideally be higher than the 200-kbp marker (see Note 10). If the extracted DNA did not pass these QC criteria, which is typical for many students on the first trial, retry the DNA extraction the next day. Carefully reconsider the amount of input and incubation time for lysis. 3.2 Nanopore Sequencing

1. Using already used and washed MinION Flow Cells, allow the students to practice flow cell priming and drop-on sample loading procedures using water in place of sample/priming buffer. Introduction of any air bubble in the flow cell instantly kills the active pores, so the students must master the flow cell handling at this stage. 2. Group students and samples into multiplexing groups of around 4–5. DNA concentration should be adjusted to be equal within the multiplexing group (see Note 12). Decide which barcode to use for each sample, and hand corresponding barcoded Fragmentation Mix to each student. If multiple students extracted DNA from the same sample, they should at this stage choose which one of the extracted DNA to use in the following procedures. 3. Set one thermal cycler to 30 °C with heated lid off and another to 80 °C with the heated lid set to 100 °C. Immediately after adding 2.5 μL of Fragmentation Mix to 7.5 μL of DNA, quickly mix it by tapping, and incubate at 30 °C for exactly 1 min, and immediately transfer the tube to 80 °C and incubate for 1 min (see Note 11). Keep the sample on ice after the incubation.

120

Kazuharu Arakawa

4. Mix the multiplexing samples in equal amounts so that the total volume does not exceed 20–30 μL. When 4–5 samples are multiplexed, 5 μL of barcoded samples can be mixed to be 20–25 μL of final volume. Add 1 μL of RAP per 10 μL of barcoded DNA (2 μL per 20 μL or 2.5 μL per 25 μL), gently mix by tapping, spin down, and incubate for 5 min at room temperature (see Note 12). 5. Prepare the loading library by adding 34 μL Sequencing Buffer (SQB) to the sample, and adjust the volume to 75 μL with Loading Beads (LB). For example, if five barcoded samples are mixed 5 μL each, the resulting DNA library with RAP is 27.5 μL, so 13.5 μL of LB is added. 6. Prime the flow cell and run sequencing according to manufacturer’s protocol. 7. Before going into the next bioinformatics part, the lecturer basecalls, demultiplexes, and merge the fastq files for the students. Additionally, use about 10 ng of remaining DNA to prepare Illumina sequencing library with KAPA HyperPlus Kit and sequence it with Illumina sequencer following manufacturer’s protocols to obtain short reads for error correction (see Note 13). 3.3 Genome Assembly and Annotation

1. Install required software with Conda. 2. Check the quality of sequenced reads with NanoPlot. For example, with BC01.fastq: NanoPlot -t 24 --fastq BC01.fastq -p BC01

3. Check the read length distribution with stats.sh of BBMap: stats.sh BC01.fastq

4. Extract reads over a length threshold according to the distribution found above to be about ×50–×100 coverage of the genome. For example, if sequenced reads over 10 kbp in length amounts to 240 Mbp for genome with expected size of 4 Mbp (i.e., ×60 coverage), extract reads over 10 kbp with reformats. sh of BBMap: reformat.sh

in=BC01.fastq

out=BC01-filter10k.fq

min-

length=10000 qin=33

Check again with stats.sh if the size distribution and total length of the filtered sequence actually are as expected: stats.sh BC01-filter10k.fq

A Nanopore Sequencing Course for Graduate School Curriculum

121

5. Assemble the filtered reads with Canu with default parameters (see Note 14): canu -nanopore-raw BC01-filter10k.fq -d BC01 -p BC01 -fast useGrid=false genomeSize=4m maxThreads=8

Expected genome size is specified with the genomeSize option. Check to see if the assembly ended up in contigs of expected genomes size with stats.sh: stats.sh BC01/BC01.contigs.fasta

6. Check for the circularity of contigs by looking at the FASTA headers. grep ">" BC01.contigs.fasta

The header contains the information on the contig length, as well as the “suggestCircular” flag, which shows whether the chromosome seems circular. If the assembled genome is expected to be linear, compare the assembled size with the expected size to see if the largest contig corresponds to the complete chromosome. If the genome is expected to be circular, check to see if the longest contig is the expected size and is circular. Other smaller contigs could be plasmids if they are suggested to be circular. Discard sequences that do not seem to be either the chromosome or plasmid. For circular contigs, use the coordinates given at the suggestCircular flag to manually delete the overlapping end with a text editor. 7. Access DFAST server with a web browser; go to “Taxonomy/ Completeness Check” from the top-left pull-down menu named “Analysis” (Fig. 2; see Note 15). Upload the assembled genome FASTA file to perform completeness check. At this stage, sequencing errors remain in the assembly, so the completeness may still be low. 8. Error correct the assembly using illumine reads with Pilon: bwa index BC01.contigs.fasta bwa mem -t 8 BC01.contigs.fasta BC01_S1_merged_R1.fq | samtools sort -@8 -o sorted.bam samtools index sorted.bam pilon --genome BC01.contigs.fasta --bam sorted.bam --threads 4 --output pilon1

Check with DFAST/CheckM again for completeness. Repeat Pilon correction several times until the completeness reaches 100%, or saturates (see Note 16).

122

Kazuharu Arakawa

Fig. 2 DFAST web server interface. “Genome Annotation” or completeness assessment with CheckM at “Taxonomy/Completeness Check” can be toggled from the analysis pull-down menu located in the top left

9. Annotate the genome by uploading the error-corrected FASTA file in DFAST server in the “Genome Annotation” Analysis menu. When the genome is circular, always go to the Advanced Options, and check the “Rotate/flip the chromosome so that the dnaA gene comes first” option. To increase annotated information, checking on the “Enable HMM scan against TIGRFAM” and “Enable RPSBLAST against COG” options are recommended. 3.4

Genome Report

1. Write up the genome report following the instructions of MRA by ASM using its template (https://journals.asm.org/journal/ mra/submit). The genome report is highly structured, with only 500 words of text, so it should be relatively easy. 2. By entering necessary information from the DFAST annotated genome page, annotated sequence file ready for DDBJ submission can be directly generated. Using this data, register the assembled genome and raw reads to DDBJ or NCBI to make the sequence publicly available. 3. Submit the genome report to MRA with registered accession numbers and wait for the peer-review results.

A Nanopore Sequencing Course for Graduate School Curriculum

4

123

Notes 1. Composition of all buffers in this buffer set is available in QIAGEN Genomic DNA Handbook and is as follows: • Buffer B1 (bacterial lysis buffer): 0.5% Tween-20, 0.5% Triton X-100, pH 8.0 • Buffer B2 (bacterial lysis buffer): 3 M guanidine HCl, 20% Tween-20 • Buffer QBT (equilibration buffer): 750 mM NaCl, 50 mM MOPS pH 7.0, 15% isopropanol, 0.15% Triton X-100 • Buffer QC (wash buffer): 1.0 M NaCl, 50 mM MOPS pH 7.0, 15% isopropanol • Buffer QF (elution buffer): 1.25 M NaCl, 50 mM Tris–HCl pH 8.5, 15% isopropanol The equilibration buffer QBT tends to run out much faster than other buffers, so preparing only buffer QBT may be efficient. 2. It is a good idea to introduce the students to the bioresource banks such as RIKEN BRC (https://jcm.brc.riken.jp/) which maintains culture collection of multitude of microbes, many of which do not have sequenced genomes. The lecturer can assist the students in obtaining the sample from such culture banks by signing the Material Transfer Agreements. 3. This starting amount is extremely critical in the extraction step. Usually, cells above 1 * 109 can easily clog the column and make the gravity flow very slow or even non-feasible. Some bacteria produce and contain a large amount of viscous compounds that could also interfere with the gravity flow. In such cases, the input amount should be halved or even lowered. In other words, if one encounters problems with the gravity flow due to a clogged column, or if the amount of extracted DNA is too low, the amount of the starting material is the critical parameter to adjust when retrying the extraction steps. 4. Also add other lysis enzymes here if lysozyme is not sufficient. Longer incubation time is better in this step for thorough lysis, so it is increased from the default 30 min, but 1 h is probably the longest in classroom settings. In research context, this step can be longer, such as 2 h. 5. Centrifugation is optional in the manufacturer’s protocol, but this step is critical to reduce clogging of the column. Centrifugal force can be higher (such as 15,000 × g). 6. From this step onwards, the DNA is freed from the cell and is easily fragmented. Therefore, all pipetting procedures must be performed carefully and slowly, with wide-bore tips or 1-mL

124

Kazuharu Arakawa

tips. The column may be clogged even with all of the above precautions. If the column stops flowing with gravity, it can be assisted by the application of gentle pressure with a syringe connected to the column with tightly rolled parafilm. Again, this process must be very gentle, as slow as one drop per 10 s, to minimize DNA fragmentation. 7. It is ok if the pellet is not visible at this stage, just believe that it is there. Pellet is less visible and more easily released from the tube wall in isopropanol than in ethanol, so this step must be performed quickly. It is also ok in this step if a trace amount of isopropanol remained after discarding the supernatant. 8. You should be able to see the pellet at this stage. If not, the starting amount was probably not sufficient. If the pellet is faintly visible, lower the elution buffer to 50 μL to have a higher concentration of DNA. Long DNA fragments obtained with this protocol do not dissolve easily, especially when dried. This is why the remaining ethanol is evaporated after adding the elution buffer. 9. For high-throughput sequencing, DNA quantification with spectrophotometry like NanoDrop is not accurate and therefore is not recommended. Always use fluorescence-based measurement like Qubit. Other low-volume fragment analysis equipment such as BioAnalyzer and Fragment Analyzer can be used in place of TapeStation. 10. We use Pippin Pulse with FastGene Agarose 0.75%, 0.5× KBB, 75 V for 16 h and visualize with SYBR Gold, but any pulsedfield gel electrophoresis system can be used. 11. Addition of adapters by tagmentation in the Rapid Sequencing Kits uses transposases, and this step results in fragmentation of DNA. Moreover, this process takes place at room temperature, so the reaction time of 1 min needs to be strictly controlled, or the DNA can be over-fragmented. Therefore, we highly recommend to use two thermal cyclers and quickly transfer the sample manually between them after the 1-min incubation. 12. Manufacturer’s protocol for Rapid Barcoding Kit recommends AMPure XP concentration for multiplexing; however, AMPure XP purification requires some skill, and letting the students do this process faces a potential risk of losing the precious sample. On the other hand, of the 75 μL sample loading mixture for nanopore sequencing, 4.5 μL of nuclease-free water and half of 25.5 μL Loading Beads can be safely replaced with the DNA library, which brings up the potential DNA library volume from 11 μL to about 27 μL. We, therefore, multiplex the samples by direct mixing and without AMPure XP procedures. From sequencing point of view, the barcoding kit typically correctly identifies only 50–75% of sequenced reads, so

A Nanopore Sequencing Course for Graduate School Curriculum

125

multiplexing five samples of 4 Mbp bacterial genome at 100× requires 4 × 100 × 5/0.5 = 4 Gbp of sequenced reads. This can be sequenced by about half of MinION Flow Cell capacity, so the flow cell can be run until it obtains 4 Gbp of sequences, washed, and then reused for other multiplexed samples to maximize the cost-effectiveness. If the number of multiplexed sample is fewer, Flongle may be a more economical option. 13. This step could be designed to be performed by the students, but given the limited time of 15 lectures per course and the technical difficulty of Illumina library preparation including multiple AMPure XP purification steps, we currently chose to do it on the lecturer’s side. 14. Canu is a sophisticated and comprehensive pipeline for assembly, and it includes a quality trimming step. This is usually sufficient, so here no prior trimming is performed. Alternatively, one can use tools such as NanoFilt to filter the reads. If assembly does not end up in a single chromosome of expected length, try different length cutoff thresholds (such as a higher threshold for longer average reads with less coverage, or without length filtering and using all reads, or simply subsampling the reads to sufficient coverage). 15. CheckM is the de facto QC software for assembly completeness for bacterial genomes. For eukaryotes, BUSCO should be used, which can be easily run from the gVolante server. 16. For CheckM, the choice of appropriate taxon is sometimes critical to correctly assess completeness. If automatic taxon inference does not result in good completeness measure, manually set the taxon at the nearest corresponding family, or sometimes even at the genus level. At the time of writing, error correction with short reads is necessary to generate high-quality genome without remaining indels. However, nanopore-only error correction should be possible in the future, and currently Nanopolish, Tombo, or Racon can be used in combination to create a somewhat polished genome with only nanopore reads.

Acknowledgments The author thanks Yuki Takai and Naoko Ishii for technical assistance. This work is supported by research funds from the Yamagata Prefectural Government and Tsuruoka City, Japan.

126

Kazuharu Arakawa

References 1. Kono N, Arakawa K (2019) Nanopore sequencing: review of potential applications in functional genomics. Dev Growth Differ 61(5):316–326. https://doi.org/10.1111/ dgd.12608 2. Evans-Yamamoto D, Takeuchi N, Masuda T, Murai Y, Onuma Y, Mori H, Masuyama N, Ishiguro S, Yachie N, Arakawa K (2019) Complete genome sequence of Psychrobacter sp. strain KH172YL61, isolated from deep-sea sediments in the Nankai Trough, Japan. Microbiol Resour Announc 8(16). https://doi.org/ 10.1128/MRA.00326-19 3. Nagata S, Ii KM, Tsukimi T, Miura MC, Galipon J, Arakawa K (2019) Complete genome sequence of Halomonas olivaria, a moderately halophilic bacterium isolated from Olive processing effluents, obtained by nanopore sequencing. Microbiol Resour Announc 8(18). https://doi.org/10.1128/MRA. 00144-19 4. Nguyen TTT, Oshima K, Toh H, Khasnobish A, Fujii Y, Arakawa K, Morita H (2019) Draft genome sequence of Butyricimonas faecihominis 30A1, isolated from feces of a Japanese Alzheimer’s disease patient. Microbiol Resour Announc 8(23). https://doi.org/ 10.1128/MRA.00462-19 5. Saito M, Nishigata A, Galipon J, Arakawa K (2019) Complete genome sequence of Halomonas sulfidaeris strain Esulfide1 isolated from a metal sulfide rock at a depth of 2,200 meters, obtained using nanopore sequencing. Microbiol Resour Announc 8(23). https://doi.org/ 10.1128/MRA.00327-19 6. Tsurumaki M, Deno S, Galipon J, Arakawa K (2019) Complete genome sequence of halophilic deep-sea bacterium Halomonas axialensis strain Althf1. Microbiol Resour Announc 8(31). https://doi.org/10.1128/MRA. 00839-19 7. Inoue H, Shibata S, Ii K, Inoue J, Fukuda S, Arakawa K (2020) Complete genome sequence of Bifidobacterium longum strain Jih1, isolated from human feces. Microbiol Resour Announc 9(22). https://doi.org/10.1128/MRA. 00319-20 8. Kurihara Y, Kawai S, Sakai A, Galipon J, Arakawa K (2020) Complete genome sequence of Halomonas meridiana strain Eplume2, isolated from a hydrothermal plume in the Northeast Pacific Ocean. Microbiol Resour Announc 9(20). https://doi.org/10.1128/MRA. 00330-20 9. Murai Y, Masuda T, Onuma Y, EvansYamamoto D, Takeuchi N, Mori H,

Masuyama N, Ishiguro S, Yachie N, Arakawa K (2020) Complete genome sequence of Bacillus sp. strain KH172YL63, isolated from deepsea sediment. Microbiol Resour Announc 9(16). https://doi.org/10.1128/MRA. 00291-20 10. Seo K, Tanaka K, Fukuda S, Arakawa K (2020) Complete genome sequences of two Cutibacterium acnes strains isolated from an orthopedic surgical site. Microbiol Resour Announc 9(17). https://doi.org/10.1128/MRA. 00290-20 11. Takahashi Y, Takahashi H, Galipon J, Arakawa K (2020) Complete genome sequence of Halomonas meridiana strain Slthf1, isolated from a deep-sea thermal vent. Microbiol Resour Announc 9(16). https://doi.org/10.1128/ MRA.00292-20 12. Takeyama N, Huang M, Sato K, Galipon J, Arakawa K (2020) Complete genome sequence of Halomonas hydrothermalis strain Slthf2, a halophilic bacterium isolated from a deep-sea hydrothermal-vent environment. Microbiol Resour Announc 9(15). https://doi.org/10. 1128/MRA.00294-20 13. Nishimura K, Ikarashi M, Yasuda Y, Sato M, Cano Guerrero M, Galipon J, Arakawa K (2021) Complete genome sequence of Sphingomonas paucimobilis strain Kira, isolated from human neuroblastoma SH-SY5Y cell cultures supplemented with retinoic acid. Microbiol Resour Announc 10(6). https://doi.org/ 10.1128/MRA.01156-20 14. Takahashi H, Yang J, Yamamoto H, Fukuda S, Arakawa K (2021) Complete genome sequence of Adlercreutzia equolifaciens subsp. celatus DSM 18785. Microbiol Resour Announc 10(19). https://doi.org/10.1128/MRA. 00354-21 15. Warashina T, Yamamura S, Suzuki H, Amachi S, Arakawa K (2021) Complete genome sequence of Geobacter sp. strain SVR, an antimonate-reducing bacterium isolated from antimony-rich mine soil. Microbiol Resour Announc 10(14). https://doi. org/10.1128/MRA.00142-21 16. Ishikawa S, Huang M, Tomita A, Kurihara Y, Watanabe R, Iwai H, Arakawa K (2022) Complete genome sequences of four bacteria isolated from the gut of a Spiny Ant (Polyrhachis lamellidens). Microbiol Resour Announc 11(7):e0033322. https://doi.org/10.1128/ mra.00333-22 17. Takeda T, Fukumitsu N, Yuzawa S, Arakawa K (2022) Complete genome sequence of Streptomyces albus strain G153. Microbiol Resour

A Nanopore Sequencing Course for Graduate School Curriculum Announc 11(7):e0033222. https://doi.org/ 10.1128/mra.00332-22 18. Koren S, Walenz BP, Berlin K, Miller JR, Bergman NH, Phillippy AM (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27(5):722–736. https://doi. org/10.1101/gr.215087.116 19. Tanizawa Y, Fujisawa T, Nakamura Y (2018) DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication. Bioinformatics 34(6):1037–1039. https:// doi.org/10.1093/bioinformatics/btx713 20. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW (2015) CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 25(7):1043–1055. https://doi.org/10.1101/gr.186072.114 21. De Coster W, D’Hert S, Schultz DT, Cruts M, Van Broeckhoven C (2018) NanoPack: visualizing and processing long-read sequencing data. Bioinformatics 34(15):2666–2669. https://doi.org/10.1093/bioinformatics/ bty149

127

22. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14):1754–1760. https://doi.org/10.1093/bioinformatics/ btp324 23. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25(16): 2078–2079. https://doi.org/10.1093/bioin formatics/btp352 24. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA, Zeng Q, Wortman J, Young SK, Earl AM (2014) Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9(11):e112963. https://doi.org/10.1371/ journal.pone.0112963 25. Nishimura O, Hara Y, Kuraku S (2017) gVolante for standardizing completeness assessment of genome and transcriptome assemblies. Bioinformatics 33(22): 3635–3637. https://doi.org/10.1093/bioin formatics/btx445

Part II Analysis of Repetitive Regions and Structural Variants

Chapter 10 A Guide to Sequencing for Long Repetitive Regions Nobuaki Kono Abstract Full-length analysis of genes with highly repetitive sequences is challenging in two respects: assembly algorithm and sequencing accuracy. The de Bruijn graph often used in short-read assembly cannot distinguish adjacent repeat units. On the other hand, the accuracy of long reads is not yet high enough to identify each and every repeat unit. In this chapter, I present an example of a strategy to solve these problems and obtain the full length of long repeats by combining the extraction and assembly of repeat units based on overlap-layout-consensus and scaffolding by long reads. Key words Highly repetitive sequence, De novo sequencing, Structural protein, Overlap-layoutconsensus, Non-model organism

1

Introduction A set of steps from genome sequencing, assembly, curation, and annotation is usually conducted in a genome project. However, when dealing with genomes containing specific gene structures, conventional methods do not work, and specially customized processing is required. The long repetitive sequence is a typical example of a gene with such a unique property. A long repetitive sequence is a common feature of structural proteins such as silk protein in silkworms or spiders. In recent years, they have attracted much attention as a protein material with potential for artificial use. However, because of its long repetitive structure, the full-length sequence has remained unknown for a long time. The first reason is its length: spidroin (spider fibroin) averages more than 10 kbp [1], nearly ten times larger than many common genes. Transcriptome analysis, which sequences the synthesized cDNA from mRNA, effectively reveals the complete picture of a gene. However, the efficiency of cDNA synthesis decreases in inverse proportion to gene length, making it challenging to obtain

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_10, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

131

132

Nobuaki Kono

full-length sequence data of long-length genes. The target region is also not optimal without relying on cDNA synthesis, PCR amplifying, and DNA sequencing. The problem then is the repetitive structure. PCR amplification of the repetitive regions is unstable and can produce chimeras and artifacts. It may seem that the solution for deep sequencing of genomic DNA still remains, but that is where the most significant challenge lies. De novo assembly based on the short read is usually conducted with the de Bruijn graph. This method is short k-mer based, and when a repetitive region appears, the graph is closed and no longer elongated. Due to these numerous problems, most of the spidroin sequences reported in 2019 was either determined by Sanger sequencing with steady cloning [2] or reported only in fragments [3]. However, a complete understanding of spidroin artificial design and gene types is now considered essential for defining gene types. Therefore, I introduce the strategy that has been developed as a solution to these problems and has been successful in the analysis of spidroin genes with long repetitive sequences in spiders and bagworms [1, 4–8]. This protocol will consist of an experimental part in the first half, from nucleic acid preparation to sequencing, and a bioinformatics part in the second half, introducing analysis algorithms using the obtained sequencing data.

2

Materials

2.1 Sample Homogenization

1. Multi-Beads Shocker (Yasui Kikai) 2. Multi-Beads Shocker 2 mL tube (Yasui Kikai) 3. Multi-Beads Shocker metal cone (Yasui Kikai) 4. BioMasher II (Nippi) 5. Thermo Shaker

2.2 Nucleotide Extraction, Quantification, and Qualification

1. QIAGEN Genomic-tip 20/G (QIAGEN) 2. TRIzol reagent (Life Technologies) 3. NucleoTrap mRNA Mini (MACHEREY-NAGEL) 4. Dynabeads mRNA Purification Kit (Thermo Fisher Scientific) 5. Agilent 2200 TapeStation (Agilent) 6. Genomic DNA ScreenTape (Agilent) 7. Genomic DNA Reagents (Agilent) 8. RNA ScreenTape (Agilent) 9. RNA ScreenTape Reagents (Agilent) 10. Qubit Fluorometer (Thermo Fisher Scientific) 11. Qubit dsDNA Broad Range Assay Kit (Thermo Fisher Scientific)

A Guide to Sequencing for Long Repetitive Regions

133

12. Qubit RNA Broad Range Assay Kit (Thermo Fisher Scientific) 13. Nanodrop 2000 (Thermo Fisher Scientific) 14. BluePippin (Sage Science) 15. BluePippin High Pass Plus (Sage Science) 2.3 Sequencing Instruments and Library Prep Kits

1. MinION Flow Cell (Oxford Nanopore Technologies) 2. Ligation Sequencing Kit (Oxford Nanopore Technologies) 3. Direct RNA Sequencing Kit (Oxford Nanopore Technologies) 4. MinION or GridION (Oxford Nanopore Technologies) 5. Illumina Sequencer and sequencing reagents 6. KAPA HyperPlus Kit (for Illumina) (KAPA BIOSYSTEMS) 7. SMART-Seq v4 Ultra Low Input RNA Kit for Sequencing (Takara Bio) 8. NEBNext Ultra RNA Library Prep Kit for Illumina (New England Biolabs) 9. NEBNext Ultra II RNA Library Prep Kit for Illumina (New England Biolabs)

2.4 PC Spec and Software

1. UNIX computer: processor (Intel (R) Xeon (R) CPU E5-2667 v2 at 3.30 GHz, 32 threads) and memory (256 GB) were used in this study. 2. Trimmomatic (0.33) [9]. https://github.com/usadellab/ Trimmomatic. 3. Bridger (2014-12-01) [10]. https://sourceforge.net/pro jects/rnaseqassembly/files/?source=navbar. 4. SeqKit (0.15.0) [11]. https://bioinf.shenwei.me/seqkit/. 5. proovread (2.13.4) [12]. https://github.com/BioInfWuerzburg/proovread. 6. BLAST (2.2.30+). ftp://ftp.ncbi.nlm.nih.gov/blast/ executables/blast+/LATEST. 7. MAFFT (7.273). https://mafft.cbrc.jp/alignment/software/.

3

Methods The first half (see Subheadings 3.1, 3.2, 3.3, 3.4 and 3.5) introduces the experimental protocol from the extraction of nucleic acids (genomic DNA and RNA) from specimens to sequencing, and the second half (see Subheadings 3.6, 3.7, 3.8, 3.9, 3.10 and 3.11) introduces the curation process of long repetitive sequences using obtained sequence data. In addition, because the kits are used according to the manufacturer’s instructions, a detailed protocol description of the kit is omitted. Only points of concern that should be noted and modified empirically are mentioned.

134

Nobuaki Kono

3.1 High-MolecularWeight (HMW) Genomic DNA (gDNA) Isolation

The nanopore sequencer directly reads the nucleic acids. For this reason, the key to data accuracy is how intact the nucleic acids are extracted. I have introduced a protocol for a spider, but please note the necessity for optimization in nucleic acid extraction regarding the choice of body part and homogenization method, according to the type of target organism. 1. Place the legs into a 1.5 mL tube and gently homogenize with BioMasher II for up to 10 s until no noticeable clumps remain (see Note 1). 2. Perform gDNA extraction with the Genomic-tip 20/G according to the manufacturer’s instructions; Protease K treatment should be vibrationally incubated for 3 h in a Thermo Shaker set at 50 °C and 250 rpm. 3. Use 1 μL of extracted gDNA, and quantify using the Qubit dsDNA BR Assay with Qubit Fluorometer. 4. Use 1 μL of extracted gDNA, and assess the size distribution using Genomic DNA ScreenTape with TapeStation. 5. The desired amount per run of nanopore gDNA sequencing would be the following: Amount: 1 μg (or 100–200 fmol)/1.5–3 μg (150–300 fmol) gDNA is required for R9.4.1/R10.3 flow cells Size: over 60 kbp length of gDNA 6. Size selection (optional): remove DNA smaller than 15 kb using the HighPass PLUS cassette with BluePippin. Repeat the extraction process four times to increase DNA recovery with extraction buffer (Buffer EB, 0.1% Tween 20) (see Note 2).

3.2 Total RNA Extraction

1. Immerse the body part where the target gene is expressed or the whole body in the Multi-Beads Shocker 2 mL tube containing 1 mL of TRIzol Reagent (see Note 3). 2. Put the Multi-Beads Shocker metal cone into the Multi-Beads Shocker 2 mL tube and homogenize with the Multi-Beads Shocker (2500 rpm, 30 s). 3. Immediately after homogenization, remove the Multi-Beads Shocker 2 mL tube on ice and proceed to total RNA extraction according to the RNeasy Plus Mini Kit protocol. 4. Use 1 μL of extracted total RNA, and quantify using the Qubit RNA BR Assay with Qubit 3.0 Fluorometer. 5. Use 1 μL of extracted total RNA, and assess the quality using Genomic DNA ScreenTape with TapeStation as RIN (RNA integrity number). RIN values are given in 1 to 10, with 1 representing the most degraded state [13] (see Note 4). 6. Isolate mRNA according to the mRNA selection kit protocol (see Note 5).

A Guide to Sequencing for Long Repetitive Regions

135

7. The desired amount per run of the nanopore direct RNA sequencing: 500 ng poly-A+ RNA is required. 8. The desired amount per run of cDNA sequencing on the Illumina sequencer would be the following: RIN ≥6.3 of total RNA is required [14]. 0.01–1000 ng of total RNA is required. 3.3 gDNA Sequencing with a Nanopore Sequencer

1. Perform library preparation for nanopore sequencing according to the Ligation Sequencing Kit protocol. 2. Since the accuracy of the nanopore sequencer is improving day by day, make sure that the MinKNOW software is updated to the latest version before running. 3. Sequence the library using MinION Flow Cells with the nanopore sequencer. 4. Repeat the nanopore gDNA sequencing until the desired yield is obtained (see Note 6). 5. In this nanopore gDNA sequencing, the objective is to obtain several long reads that cover the entire length of the region of interest. For example, to cover 10 kbp genes with an average of 20 kbp reads, more than 10× sequencing is necessary. Therefore, if the target is a spider with a genome size of 3 Gb, it is necessary to repeat the process until about 30 Gb of sequencing data is obtained with an average of 20 kbp reads. 6. Perform the basecalling with MinKNOW. Set basecalling parameters according to the version of flow cell and Library Prep Kit of use. Sequenced reads obtained as a result will be referred to as dna_ont.fq in the following bioinformatics sections.

3.4 Direct RNA Sequencing with a Nanopore Sequencer

1. Perform library preparation for the nanopore direct RNA sequencing according to the Direct RNA Sequencing Kit protocol. 2. Sequence the library using MinION flow cells with the nanopore sequencer. 3. Perform the basecalling with MinKNOW. Set basecalling parameters according to the flow cell and Library Prep Kit version used. Sequenced reads obtained as a result will be referred to as rna_ont.fq in the following bioinformatics sections.

3.5 cDNA Sequencing with an Illumina Sequencer

1. Using extracted total RNA, perform library preparation for Illumina cDNA sequencing according to the NEBNext Ultra II RNA Library Prep Kit for the Illumina protocol. 2. If the amount of total RNA is only 0.01–5 ng, use the SMARTSeq v4 Ultra Low Input RNA Kit for sequencing and KAPA HyperPlus Kit (for Illumina) for library preparation.

136

Nobuaki Kono

Fig. 1 Overview. Curation of the long repetitive gene sequence proceeds in the following steps: filtering or trimming sequence reads, seed contig preparation, collection of repeat units by elongation, and scaffolding and reordering of repeat unit using long read. The visualization step to check the gene architecture is optional

3. Perform 150-bp paired-end sequencing on the Illumina sequencer. Also, 100 bp or longer reads are preferred since the data obtained from cDNA sequencing will be used for assembly. Sequenced reads obtained as a result will be referred to as rna_illumina_R1.fq and rna_illumina_R2.fq in the following bioinformatics sections. 3.6 Bioinformatics Analysis

As shown in Fig. 1, curation of the long repetitive gene sequence proceeds in the following steps: filtering or trimming sequence reads, seed contig preparation, collection of repeat units by elongation, and scaffolding and reordering of repeat unit using long read. The following flow is based on this directory tree structure (Fig. 2). . ├── data │ ├── dna_ont.fq (Nanopore gDNA sequencing read) │ ├── rna_illumina_R1.fq (Illumina cDNA sequencing read: forward) │ ├── rna_illumina_R2.fq (Illumina cDNA sequencing read: reverse)

A Guide to Sequencing for Long Repetitive Regions

137

Fig. 2 Reference data to determine if size selection is necessary. Although gDNA of 60 kbp or more can be extracted in both cases, (a) contains many short fragments and requires size selection or fragment removal. On the other hand, in the case of (b), size selection is not necessary.

│ └── rna_ont.fq (Nanopore direct RNA sequencing read) ├── known_spidroin.prot (Known spidroin amino acid sequences) └── smoc.pl (SMoC script)

3.7 Filtering or Trimming of Sequence Reads

1. Filter nanopore sequencing reads (dna_ont.fq) with nanoflit: $ NanoFilt -q 7 -l 10000 --headcrop 50 data/dna_ont.fq > data/ dna_ont-nanofil-7-50-10k.fq

2. Remove adapters and trim from Illumina reads: $ java -jar trimmomatic-0.33.jar PE -threads 32 -phred33 -trimlog log.txt dna_illumina_R1.fq dna_illumina_R2.fq dna_illumina_R1_trim.fq unpaired_output_R1.fq dna_illumina_R2_trim. fq unpaired_output_R2.fq ILLUMINACLIP:adapters.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:15 MINLEN:36

3.8 Terminal Domain Contig (as Seed Contig) Preparation

1. Conduct de Bruijn graph assembly of Illumina cDNA sequence reads using Bridger: $ perl Bridger.pl --seqType fq --left dna_illumina_R1_trim.fq --right dna_illumina_R2_trim.fq --pair_gap_length 0 -k 31 -CPU 32 -o assembly_OUT

138

Nobuaki Kono

2. Pick up the nonrepetitive N/C-terminus contigs with BLAST search from assembled contigs. Using the assembled contig as query and the known spidroin sequences (known_spidroin. prot) as DB, perform a BLAST search to find the contig that will be the seed sequence: $ makeblastdb -in assembly_OUT/Bridger.fasta -dbtype nucl $ tblastn -query known_spidroin.prot -db assembly_OUT/Bridger. fasta -outfmt 6 -num_threads 32 -evalue 1.0e-5

3. If the top hit contig name is comp304_seq0, this sequence is used as the seed sequence (seed.nucl) to obtain repeat units by 3.10 extension: $ seqkit grep -nrp comp304_seq0 assembly_OUT/Bridger.fasta > seed.nucl

3.9 Collection of Repeat Units by Elongation

1. The collection of repeat units is conducted by spidroin motif collection (SMoC) algorithm based on the overlap-layout-consensus [15]. The obtained seed contig was used for screening the short reads, including exact matches of extremely large k-mers (approximately 100) extending to the 5′-end. The selected short reads were aligned on the 3′-side of the matching k-mer to build a position weight matrix (PWM). Based on stringent thresholds, the seed sequence (terminus contig) is extended until the subsequent repeats appeared. Continue extension until no novel repeat unit appears or until the terminal region appears. 2. If the seed sequence (seed.nucl) found by BLAST search is an N-terminal region, extend toward the C-terminal side (downstream). The extension program smoc.pl can be obtained here (https://github.com/nkono/SMoC): $ perl smoc.pl --seed seed.nucl --read data/dna_illumina_R1.fq --output seed_out

3. After SMoC program executing, two files, seed_out_w100_c5_contig.res and seed_out_w100_c5_contig. fasta, are generated. The seed_out_w100_c5_contig.res file records the history of the extension process (Fig. 3). The seed_out_w100_c5_contig.fasta is a multiple fasta file with the extended sequences. 4. When the seed sequence is a C-terminal region, extend toward the N-terminal side (upstream). Perform the basecalling with MinKNOW. Set according to the flow cell and Library Prep Kit version used: $ seqkit seq -t dna -v -rp seed.nucl > seed_comp.nucl $ perl smoc.pl --seed seed_comp.nucl --read data/dna_illumina_R1.fq --output seed_comp_out

A Guide to Sequencing for Long Repetitive Regions

139

Fig. 3 The results of SMoC program execution. The result at the point where the seed sequence is extended by 11 nucleotides. Extension is performed until the N/C-terminal domains appear

Fig. 4 Amino acids of the N/C-terminal domains and repeat unit obtained by SMoC. The order of each of the nine repeat units is not known

5. Select the most extended sequence from the result full “seed_out_w100_c5_contig.fasta” and save as “cassette.nucl”. Figure 4 shows the translated amino acid of the extended sequence.

140

Nobuaki Kono

Fig. 5 BLAST search results for nanopore gDNA sequencing data. All of the obtained cassettes (NtermDomain, N-terminal domain; RepetitiveUnit, repeat unit; CtermDomain, C-terminal domain) hit the read 315894d919f6-4d61-a6a0-81b2cff5d777. This indicates that the 315894d9-19f6-4d61-a6a0-81b2cff5d777 is a read covering the full length of the spidroin gene

1. Search for a “full-length long-read” harboring the entire set of N/C-terminal domains and repeat units (3.11) from the nanopore gDNA sequencing reads (dna_ont-nanofil-7-50-10k.fq) using BLAST search. An example of the BLAST search result is shown in Fig. 5, and here, 315894d9-19f6-4d61-a6a081b2cff5d777 is the full-length long read:

3.10 Scaffolding and Reordering of Repeat Unit Using Full-Length Long Read

$

seqkit

fq2fa

data/dna_ont-nanofil-7-50-10k.fq

>

data/

dna_ont-nanofil-7-50-10k.fasta $ makeblastdb -in data/dna_ont-nanofil-7-50-10k.fasta -dbtype nucl $ blastn -query cassette.nucl -db data/dna_ont-nanofil-7-5010k.fasta -outfmt 6 -num_threads 32

2. Extract the found full-length long read from the nanopore gDNA sequencing reads (dna_ont-nanofilt-7-50-10k.fasta) with seqkit. Here is an example of extracting the full-length long-read “315894d9-19f6-4d61-a6a0-81b2cff5d777”: $ seqkit grep -nrp 315894d9-19f6-4d61-a6a0-81b2cff5d777 data/ dna_ont-nanofil-7-50-10k.fasta > data/315894d9-19f6-4d61a6a0-81b2cff5d777.nucl

3. Correct error in the extracted full-length long read with Illumina cDNA sequencing reads: $ proovread -l data/315894d9-19f6-4d61-a6a0-81b2cff5d777.nucl -s data/rna_illumina_R1.fq data/dna_illumina_R2_trim.fq -p ONT_proovread -t 32

4. Translate the full-length long read in all three reading frames (Fig. 6a): $ seqkit fq2fa ONT_proovread/ONT_proovread.untrimmed.fq | seqkit translate --frame 1,2,3 | seqkit seq -w 0 > 315894d9-19f64d61-a6a0-81b2cff5d777_3frame.prot $ cat 315894d9-19f6-4d61-a6a0-81b2cff5d777_3frame.prot

A Guide to Sequencing for Long Repetitive Regions

141

Fig. 6 Curation of the long read covering the full length of spidroin. The figure shows how the frameshift is corrected while mapping the exhaustively collected repeat units to the error-corrected long read. (a) Translate the corrected long read in three frames, and (b) align the repeat units obtained by the Illumina read assembly (3.9). (c) In each of the three frames, color the N-terminus red, the C-terminus purple, and the repetitive regions green. (d) Manually curate the frame shift

5. Align the obtained repeat units (cassete.prot) to the threeframe-translated full-length long read with MAFFT. An example of the alignment result is shown in Fig. 6b, c: $ seqkit translate cassette.nucl > cassette.prot $

cat

cassette.prot

315894d9-19f6-4d61-a6a0-

81b2cff5d777_3frame.prot > cassette_ONT.fasta $ mafft --thread 32 --text --quiet --clustalout --reorder -auto cassette_ONT.fasta

6. Even after error correction, some frameshifts remain in the fulllength long read. Manually correct the full-length long read along the alignment (Fig. 6d).

142

Nobuaki Kono

3.11 Visualization of Repetitive Gene Architecture

A dot plot visualization of self-alignment is suitable for observing the whole architecture of a long repetitive sequence. Here is an example of visualization using the MAFFT version 7 web service. 1. Access the MAFFT website: https://mafft.cbrc.jp/alignment/ software/. 2. Enter “Alignment” under “Online version” (Fig. 7). 3. Enter the error-corrected long-read sequence in the “Input” text area twice in Multi-FASTA format (Fig. 7): $ seqkit fq2fa ONT_proovread/ONT_proovread.untrimmed.fq | seqkit seq -w 0

4. Execute the alignment and visualization with the “Submit” button (Fig. 7). 5. Open the dot plot image (Fig. 7).

Fig. 7 Interface of MAFFT version 7 online version. Operation screen to visualize a self-alignment dot plot for observing gene architecture

A Guide to Sequencing for Long Repetitive Regions

143

Fig. 8 Visualization of gene architecture with dot plot. These plots are the results of dot plot visualization of self-alignment with nanopore gDNA sequencing read leads drawn by MAFFT version 7 online version. The dots are concentrated in repetitive regions, and the rest are nonrepetitive regions. (a) This gene is composed of only one type of repeat unit. (b) This gene consists of only one type of repeat unit but contains introns or nonrepetitive linker sequences inside the repetitive region. (c) This gene has two types of repeat units separated by introns or nonrepetitive linker sequences

6. Figure 8 illustrates the dot plot image of the three gene structures. 7. Perform the basecalling with MinKNOW. Set according to the flow cell and Library Prep Kit version of use. As a result of the above operations, the directory structure is as follows: ├── 315894d9-19f6-4d61-a6a0-81b2cff5d777_3frame.prot ├── ONT_proovread │ ├── ONT_proovread.chim.tsv │ ├── ONT_proovread.ignored.tsv │ ├── ONT_proovread.parameter.log │ ├── ONT_proovread.trimmed.fa │ ├── ONT_proovread.trimmed.fq │ └── ONT_proovread.untrimmed.fq ├── assembly_OUT │ ├── Bridger.fasta │ ├── Bridger.fasta.nhr │ ├── Bridger.fasta.nin │ └── Bridger.fasta.nsq ├── cassette.nucl ├── cassette.prot ├── cassette_ONT.fasta ├── data │ ├── 315894d9-19f6-4d61-a6a0-81b2cff5d777.nucl │ ├── dna_ont-nanofil-7-50-10k.fasta │ ├── dna_ont-nanofil-7-50-10k.fasta.nhr │ ├── dna_ont-nanofil-7-50-10k.fasta.nin │ ├── dna_ont-nanofil-7-50-10k.fasta.nsq

144

Nobuaki Kono │ ├── dna_ont-nanofil-7-50-10k.fq │ ├── dna_ont.fq │ ├── known_spidroin.prot │ ├── rna_illumina_R1.fq │ ├── rna_illumina_R2.fq │ ├── rna_ont.fq │ ├── seed.nucl │ ├── unpaired_output_R1.fq │ └── unpaired_output_R2.fq ├── dna_ont-nanofil-7-50-10k.fasta ├── known_spidroin.prot ├── log.txt ├── seed.nucl ├── seed_comp.nucl ├── seed_comp_out_w100_c5_contig.fasta ├── seed_comp_out_w100_c5_contig.res ├── seed_out_w100_c5_contig.fasta ├── seed_out_w100_c5_contig.res └── smoc.pl

4

Notes 1. For nucleic acid extraction, body parts with a large amount of muscle tissue (e.g., legs) are preferred, and the abdomen, which contains digestive fluids and other body fluids, should be avoided. However, if the body size is less than a few centimeters, it is not much different if the whole body is used. Specimens stored at -80 °C in liquid nitrogen are preferred. 2. The peak of the DNA size distribution should be at least 60 kbp. In particular, the nanopore gDNA sequencing preferentially sequences shorter DNA fragments. Removing as many short fragments as possible is necessary. However, since size selection may result in a loss of about 1/10 or less of the amount of DNA, you should prepare 10 μg or more of DNA if size selection is necessary. Reference data to determine if size selection is essential are shown in Fig. 2. 3. When working with RNA samples, an RNase-free environment is essential. All operations should be performed with gloves, RNase-free reagents, pipettes, and benches. Samples should be stored below 4 °C until use and should be used within 2 weeks [14]. 4. The RIN value should be greater than 6.3 [14]. And 500 ng is required for the nanopore direct RNA sequencing. Although the ratio of mRNA to total RNA varies among organisms, it should be considered in the range of approximately 0.1–1%, and the amount of total RNA should be prepared accordingly.

A Guide to Sequencing for Long Repetitive Regions

145

For example, in the case of the Joro¯ spider (Trichonephila clavata), the ratio of mRNA to total RNA is about 0.7%, and 72 μg or more of total RNA is required. 5. There are various mRNA selection kits available, but at least NucleoTrap and Dynabeads have been verified to have high recovery rates and reproducibility. 6. Nanopore gDNA sequencing aims to obtain several long reads that cover the full length of the long repetitive gene. In the case of a spider with a genome size of 3 Gb, if you can prepare more than 30 Gb (10×) sequence data with an average of 20 kb read, it expects to obtain several reads covering the entire length of the 10 kbp gene. References 1. Kono N, Nakamura H, Ohtoshi R, Moran DAP, Shinohara A, Yoshida Y, Fujiwara M, Mori M, Tomita M, Arakawa K (2019) Orb-weaving spider Araneus ventricosus genome elucidates the spidroin gene catalogue. Sci Rep 9(1):8380. https://doi.org/10.1038/ s41598-019-44775-2 2. Hayashi CY, Lewis RV (2000) Molecular architecture and evolution of a modular spider silk protein gene. Science 287(5457):1477–1479 3. Babb PL, Lahens NF, Correa-Garhwal SM, Nicholson DN, Kim EJ, Hogenesch JB, Kuntner M, Higgins L, Hayashi CY, Agnarsson I, Voight BF (2017) The Nephila clavipes genome highlights the diversity of spider silk genes and their complex expression. Nat Genet 49(6):895–903. https://doi.org/ 10.1038/ng.3852 4. Kono N, Nakamura H, Ohtoshi R, Tomita M, Numata K, Arakawa K (2019) The bagworm genome reveals a unique fibroin gene that provides high tensile strength. Commun Biol 2: 148. https://doi.org/10.1038/s42003-0190412-8 5. Kono N, Nakamura H, Mori M, Tomita M, Arakawa K (2020) Spidroin profiling of cribellate spiders provides insight into the evolution of spider prey capture strategies. Sci Rep 10(1): 15721. https://doi.org/10.1038/s41598020-72888-6 6. Kono N, Nakamura H, Mori M, Yoshida Y, Ohtoshi R, Malay AD, Pedrazzoli Moran DA, Tomita M, Numata K, Arakawa K (2021) Multicomponent nature underlies the extraordinary mechanical properties of spider dragline silk. Proc Natl Acad Sci U S A 118(31): 2021.2004.2022.441049. https://doi.org/ 10.1073/pnas.2107065118

7. Kono N, Nakamura H, Tateishi A, Numata K, Arakawa K (2021) The balance of crystalline and amorphous regions in the fibroin structure underpins the tensile strength of bagworm silk. Zool Lett 7(1):11. https://doi.org/10.1186/ s40851-021-00179-7 8. Kono N, Ohtoshi R, Malay AD, Mori M, Masunaga H, Yoshida Y, Nakamura H, Numata K, Arakawa K (2021) Darwin’s bark spider shares a spidroin repertoire with Caerostris extrusa but achieves extraordinary silk toughness through gene expression. Open Biol 11(12):210242. https://doi.org/10. 1098/rsob.210242 9. Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30(15): 2114–2120. https://doi.org/10.1093/bioin formatics/btu170 10. Chang Z, Li G, Liu J, Zhang Y, Ashby C, Liu D, Cramer CL, Huang X (2015) Bridger: a new framework for de novo transcriptome assembly using RNA-seq data. Genome Biol 16:30. https://doi.org/10.1186/s13059015-0596-2 11. Shen W, Le S, Li Y, Hu F (2016) SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS One 11(10):e0163962. https://doi.org/10.1371/ journal.pone.0163962 12. Hackl T, Hedrich R, Schultz J, Forster F (2014) proovread: large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics 30(21):3004–3011. https://doi.org/10.1093/bioinformatics/ btu392 13. Schroeder A, Mueller O, Stocker S, Salowsky R, Leiber M, Gassmann M,

146

Nobuaki Kono

Lightfoot S, Menzel W, Granzow M, Ragg T (2006) The RIN: an RNA integrity number for assigning integrity values to RNA measurements. BMC Mol Biol 7:3. https://doi.org/ 10.1186/1471-2199-7-3 14. Kono N, Nakamura H, Ito Y, Tomita M, Arakawa K (2016) Evaluation of the impact of RNA preservation methods of spiders for de

novo transcriptome assembly. Mol Ecol Resour 16(3):662–672. https://doi.org/10.1111/ 1755-0998.12485 15. Kono N, Arakawa K (2019) Nanopore sequencing: review of potential applications in functional genomics. Dev Growth Differ 61(5):316–326. https://doi.org/10.1111/ dgd.12608

Chapter 11 Analysis of Tandem Repeat Expansions Using Long DNA Reads Satomi Mitsuhashi and Martin C. Frith Abstract Abnormal expansion or shortening of tandem repeats can cause a variety of genetic diseases. The use of long DNA reads has facilitated the analysis of disease-causing repeats in the human genome. Long read sequencers enable us to directly analyze repeat length and sequence content by covering whole repeats; they are therefore considered suitable for the analysis of long tandem repeats. Here, we describe an expanded repeat analysis using target sequencing data produced by the Oxford Nanopore Technologies (hereafter referred to as ONT) nanopore sequencer. Key words Nanopore sequencer, Repeat expansion diseases, Long read sequencer, Tandem repeat

1

Introduction Sequences that are adjacent to identical or similar sequences are called tandem repeats or simple repeats. Tandem repeats can be polymorphic or mutable: their copy numbers may vary from person to person and may change from one generation to the next. Abnormal elongation or shortening of tandem repeats can cause diseases in humans. To date, more than 40 tandem repeat diseases are known [1, 2]. In the diagnosis of disease, measuring repeat lengths is important. It can be identified by PCR, Southern blotting, and other methods including short-read next-generation sequencing [3]. Long-read sequencing is an important addition to conventional methods since it allows not only measuring repeat length but also directly determining sequences within repeats by obtaining an entire read that completely covers tandem repeats. Indeed, repeat contents may explain disease mechanisms (e.g., polyglutamine disease) and determine the disease manifestation [4, 5], or modulate the onset of the disease [6]. In addition, expanded repeats in some newly found genetic diseases were identified by long-read sequencers in recent years [2, 7]. Such discoveries may

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_11, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

147

148

Satomi Mitsuhashi and Martin C. Frith

need whole-genome sequencing but the subsequent diagnostic assays may be achieved by analyzing the specific candidate repeat by targeted sequencing. Long-read whole-genome sequencing is costly, so the cost can be reduced by approaches that sequence the target region of interest. There are several methods to enrich target repeat loci by long-read sequencing such as adaptive sampling [8] or Cas9 enrichment [9] (see Note 1). Here we will explain a method to detect expanded repeats with our software called tandemgenotypes, using two nanopore sequencing datasets of two pathogenic repeat loci, a GGC repeat in the 5’UTR of the NOTCH2NLC gene and an intronic penta-nucleotide repeat in the RFC1 gene.

2

Materials

2.1 Library Preparation for LongRead Sequencing

Tandem repeat analysis, which we focus on in this chapter, can be performed on either whole-genome sequencing (WGS) or targeted sequencing of repeat regions. Note that a long-read sequencer cannot obtain reads longer than the length of the DNA used, so it is necessary to prepare DNA longer than the length you want to read (including flanking region and repeat). Many of the known causes of repeat diseases have repeat lengths of ~10,000 bases or fewer [10], which long reads usually cover well. 1. MinION flow cell (ONT) 2. Cas9 Sequencing Kit (ONT) 3. 1–10 ug DNA extracted from cells or tissues 4. 100 μM In TE, pH 7.5, S. pyogenes Cas9 Alt-R crRNA (IDT) 5. 100 μM In TE, pH 7.5, S. pyogenes Cas9 Alt-R tracrRNA (IDT) 6. Nuclease-free TE, pH 7.5 (10 mM Tris-HCl [pH 7.5], 0.1 mM EDTA [IDT]) 7. Nuclease-Free Duplex Buffer (IDT) 8. Alt-R S. pyogenes HiFi Cas9 nuclease V3 (IDT) 9. Agencourt AMPure XP beads (Beckman Coulter)

2.2

Data Analysis

Several methods exist for analyzing long-read tandem repeats, but here we explain a method that uses the LAST alignment software and tandem-genotypes [11]. Our method is simple and can estimate the repeat length regardless of the content of a given repeat. This can be used for both nanopore or PacBio long reads. Indeed, this protocol has helped identify the causes of repeat expansion diseases in real patients [4, 5, 7]. Both WGS data and target data are analyzed by essentially the same method. For WGS data, it is possible to comprehensively analyze millions of tandem repeats within the entire genome when the user gives an annotation of the repeats in the genome

Analysis of Tandem Repeat Expansions Using Long DNA Reads

149

(see Note 2). Furthermore, prioritization can be performed by comparing to repeat lengths of healthy controls (see Note 3), which is described elsewhere [11]. See Note 4 for software availability. 1. UNIX environment: here we used a MacBook Pro (13-in, 2017), 3.5 GHz Intel Core i7, 16 GB RAM, macOS High Sierra (10.13.4). 2. Target sequence data: Here we used F1-1 from Human Genetic Variation Database HGV0000008 (permission required) [7] as NOTCH2NLC.fa., accessible with permission at https://www.hgvd.genome.med.kyoto-u.ac.jp/repository/ HGV0000008.html. 3. We also used extracted reads that mapped to the RFC1 repeat from long-read nanopore WGS of HG02081_1 (PRJEB37264) [12] as RFC.fa, because this individual has a moderately expanded repeat in the locus. 4. LASTv1406 (including lastdb, lastal, last-train, maf-convert, maf-cut, etc.). 5. tandem-genotypes v1.9.0. 6. lamassemble v1.4.2. 7. mafft v7.457.

3 3.1

Methods Designing gRNAs

ONT recommends using Integrated DNA Technologies (IDT) crRNA and tracrRNA to cut the DNA. To design crRNAs, we use a custom Alt-R® CRISPR-Cas9 guide RNA design tool (website is described below). Both off-target and on-target potentials are rated between 1 and 100 in the tool (higher is preferred). We normally use crRNA with both scores >~60; however, we previously used guide RNAs with off-target potential as low as 1 (Table 1). Nevertheless, we could enrich the repeat at a depth of coverage as high as 1600×. crRNA should be designed for the upstream and downstream of the target sequence, as sequencing adapters are ligated to the target strand only. It may be preferable to obtain reads of both strands to increase the accuracy of the consensus. Note that the design of the upstream gRNA targets the plus strand and the downstream gRNA targets the minus strand. We typically obtain >400 reads, but this may depend on the target size. When the target size was greater than 10 kb, we observed that the enriched read number exponentially decreased (Fig. 1).

150

Satomi Mitsuhashi and Martin C. Frith

Table 1 Examples of gRNA used in this and other studies (Refs. [4, 7]) OnTarget locus target (GRCh38/hg38) score

Offtargetscore

target size PAM strand (bp)

NOTCH2NLC- UUC forward UUAGCCCAC UUGUACCC

chr1: 149389497149389516

58

66

GGG +

NOTCH2NLC- GGAGCAC reverse UCAAAAG UUUAGA

chr1: 149393450149393469

83

1

TGG -

crRNA

RFC1-forward

GACAGUAACUG chr4: UACCACAAU 3934597539345997

80

69

AGG +

RFC1-reverse

CUAUAUUCG UGGAACUA UCU

64

72

AGG -

chr4: 3935159539351617

3972

5642

On-target and off-target scores from custom Alt-R CRISPR-Cas9 guide RNA design tool are shown

num of reads

600

400

200

0 0

15k 5k 10k target length (bp)

20k

Fig. 1 Three different gRNA pairs targeting the same expanded repeat show that obtained read numbers are exponentially decreased when the target length exceeds 10 kb 3.2 Sequence Library Preparation

Since the detailed experimental protocol is described in a commercially available kit at community sites, here we briefly describe the core part of the Cas9-enrichment protocol for Nanopore long-read sequencers. 1. Using >5 μg of DNA is recommended, but obtaining such large amounts of DNA can be difficult in some situations (e.g., old and depleted DNA samples). We used ~2.0 μg of DNA and could successfully obtain 300–400 target reads using a single MinION Flow Cell.

Analysis of Tandem Repeat Expansions Using Long DNA Reads

151

State Time Equivalent (%)

100%

80%

60%

40%

20%

0% 12m

3h

5h 48m

8h 36m

11h 24m

14h 12m

17h

Time

Fig. 2 Free nanopores are dark green, while pores in use (sequencing) are light green (arrow). Yellow represents free adaptors that were not attached to the DNA

2. Form an RNA and protein complex (RNP) by combining the crRNA, tracrRNA, and Cas9 proteins. 3. Treat DNA ends with the dephosphorylating enzyme to avoid unnecessary ligation of adapters. 4. Terminally dephosphorylated DNA is cut by Cas9 RNP prepared at step 2, by simultaneously adding deoxyadenosine triphosphate and Taq polymerase for A-tailing. The Cas9 cut sites are newly phosphorylated and A-tailed; therefore, the adaptor will only be attached to the end of the target site. 5. Adapter ligation. 6. DNA purification by AMPure beads. 7. Perform sequencing. Sequencing adaptors are attached to fewer DNA molecules than non-enrichment genomic DNA ligation protocols. The pore usage can be as low as 10% (Fig. 2, arrow). 3.3

Data Analysis

3.3.1 Aligning Long Reads to the Reference Genome

1. Prepare the reference genome as described here (https:// gitlab.com/mcfrith/last/-/blob/main/doc/last-cookbook. rst) or another chapter of this book (“Finding rearrangements in nanopore DNA reads with LAST and dnarrange”). Several files will be created after executing the lastdb command. Here, the reference database is named “GRCh38” and the path to the files is shown with “path-to-reference/GRCh38” in the following analysis. We have tested -uNEAR (both with and without repeat masking) and -uRY4 options. These different options did not make much difference.

152

Satomi Mitsuhashi and Martin C. Frith

Fig. 3 Aligned reads of targeted NOTCH2NLC repeat. MAF format file was converted to bam format and then visualized in IGV (see Note 4). Pink reads are forward strands and blue reads are reverse strands

2. Calculate the sequencing error rate using last-train [13] and create a parameter file. Long-read sequencers have been evolving rapidly and the quality of sequencing error properties has changed frequently over time. Thus, this functionality is useful because optimal parameters for alignment were generated automatically. The -P option implies the number of threads to use: $ last-train -P8 path-to-reference/GRCh38 NOTCH2NLC.fa > NOTCH2NLC.train.out $ last-train -P8 path-to-reference/GRCh38 RFC1.fa > RFC1. train.out>

3. Align long reads to the reference genome using lastal [14]. Use last-split [15], which is also available in the LAST program with the –split option (>v1387), for split alignment. Specify the parameter file created in step 2 with the option -p. The -P option represents the number of threads to use. Sequence alignments are produced in MAF format (http://genome. ucsc.edu/FAQ/FAQformat.html#format5). If necessary, you can visualize aligned reads with genome viewer tools (e.g., IGV) (Fig. 3) (see Note 5):

Analysis of Tandem Repeat Expansions Using Long DNA Reads

153

$ lastal -P8 -- split -p NOTCH2NLC.train.out path-to-reference/GRCh38 NOTCH2NLC.fa > NOTCH2NLC.maf $ lastal -P8 -- split -p RFC1.train.out path-to-reference/ GRCh38 RFC1.fa > RFC1.maf

3.3.2 Detect Changes in the Copy Number of Repeats with TandemGenotypes

To detect copy number changes (relative to the reference genome), an alignment file (MAF format) is created in the above section and repeat annotation is needed. 1. Preparation of repeat annotation files. Repeat annotation can be a file of four columns or more, in which the chromosome number, repeat start and end positions, and repeat unit are written as below (bed format). For example, the NOTCH2NLC or RFC1 locus would be like these bed format files: chr1

149390802

chr4

39348000

149390842 39349100

GGC

AAAAG

(NOTCH2NLC.bed) (RFC1.bed)

It is also possible to use one of several repeat annotation files from the UCSC Genome Database (simpleRepeat.txt, microsat.txt, or rmsk.txt), or a RepeatMasker “.out” file, or the output of “tantan -f4” (see Note 2) [16]. 2. Estimate repeat copy number changes with tandem-genotypes. Specify the file prepared in step 1. Optionally, gene annotation can be added by using the -g option: $

tandem-genotypes

-v

-o2

-g

refFlat.txt

NOTCh2NLC.bed

NOTCH2NLC.maf > NOTCH2NLC-tg.out $ tandem-genotypes -v -o2 -g refFlat.txt RFC1.bed RFC1.maf > RFC1-tg.out

For the following step (see Subheading 3.3.4), merging the reads to make a consensus sequence, one should add both the -v and -o2 options. These two options can be omitted if readers will not perform tandem-genotypes-merge. The -o2 option outputs representative values of copy number changes of the short and long alleles in columns 7 and 8, respectively (Fig. 4). After removing some outliers, the k-medoids method (with k = 2) is used for clustering. This may not be accurate if the coverage is low. Note that it assumes two alleles at all repeat sites. Even if the two alleles have the same length, the length may differ depending on the read error rate, so the length may be slightly different. It can be difficult to interpret repeat length when insertion or deletion error rates differ between forward and reverse strands, which results in a systematic difference in inferred repeat lengths between the strands. The -v option outputs read names: Column 1: Chromosome number Column 2: Repeat start coordinate

154

Satomi Mitsuhashi and Martin C. Frith

Fig. 4 Tandem-genotypes generate files like these. Arrows indicate representative copy number changes from the reference genome, for both alleles

Column 3: Repeat end coordinate Column 4: Unit of the repeat (AT, CTG, AAAAT, etc.) Column 5: Gene name Column 6: Annotation of repeat location (5′UTR, 3′UTR, coding, intron, exon, intergenic, promoter) Column 7: Predicted value of shorter copy number change Column 8: Predicted value of the longer copy number change Column 9: Change in copy number of the reads covering the repeat on the forward strand Column 10: Change in copy number of the reads covering the repeat on the reverse strand 3.3.3

Creating a Plot

When the copy number change of repeats is shown in a histogram, the distribution of the changes of each read can be seen. If there is obvious (heterozygous) repeat expansion (e.g., >10 repeat copy number change), you can see clear bimodal distribution (Fig. 5). We can specify the number of most-prioritized repeats (see Note 3) to draw, arranged in columns and rows, with -c1 and -r1. The plots named “NOTCH2NLC-tg.out.pdf” and “RFC1-tg.out.pdf” will be created with the following commands (Fig. 5): $ tandem-genotypes-plot -c1 -r1 NOTCH2NLC-tg.out $ tandem-genotypes-plot -c1 -r1 RFC1-tg.out

3.3.4 Merge Reads to Create a Consensus Sequence

Tandem-genotypes-merge generates a consensus sequence for each allele. This step requires the installation of mafft [17] and lamassemble [18]: $ tandem-genotypes-merge NOTCH2NLC.fa NOTCH2NLC-train.out NOTCH2NLC-tg-v.out > NOTCH2NLC-merge.fa $ tandem-genotypes-merge RFC1-merge.fa RFC1-train.out RFC1-tgv.out > RFC1-merge.fa

Analysis of Tandem Repeat Expansions Using Long DNA Reads

GGC: NOTCH2NLC:NIID 5'UTR chr1:149390802−149390842

155

0

0

2

4

200

6

400

8 10

AAAAG: RFC1 intron chr4:39348424−39348483

0 100

300

500

0 50

150

250

Fig. 5 Tandem-genotypes-plot generates a histogram of the repeat copy number changes in each read. x-axis: copy number change from the reference genome. y-axis: number of reads. The red bars represent forward strand reads and the blue bars represent reverse strand reads

The merged sequence file can be aligned to the reference genome. Then, the target sequence can be cut out by using maf-cut (Fig. 6) by giving an aligned file (aligned-merged.maf) and an original sequence file (e.g., NOTCH2NLC-merge.fa). Note that this function of maf-cut is only included in LAST version ≥1291: $

maf-cut

chr1:149390802-149390842

aligned-merged.maf

NOTCH2NLC-merge.fa > NOTCH2NLC-mafcut.fa $ maf-cut chr4:39348000-39349100 aligned-merged.maf > RFC1mafcut.fa

4

Notes 1. To sequence raw (i.e., not synthesized by PCR reaction) genomic DNA molecules with high depth, Cas9 enrichment or adaptive sampling is the choice of method. In the Cas9 enrichment protocol, Cas9 cleaves the target site and simultaneously ligates adapters for nanopore sequencing. Adaptive sampling (AS) is available in only nanopore sequencers and enables selective sequencing of target sites. Briefly, Cas9 enrichment is appropriate for 100× coverage, while AS typically produces 30× coverage using a single MinION Flow Cell. PCR-based repeat enrichment is cost-effective when PCR is possible,

156

Satomi Mitsuhashi and Martin C. Frith

Fig. 6 Consensus repeat sequences. Note that “:a” represents short allele and “:b” represents long allele, separated by tandem-genotypes -o2 option

compared to Cas9 enrichment or adaptive sampling; however, PCR is difficult for some repeats and may be erroneous for tandem repeats causing repeat shortening. In such cases, interpretation of the obtained reads can be difficult. 2. If you want to investigate the repeats in the entire genome, there are multiple annotation files that you can download from UCSC (rmsk.txt, microsatellite.txt, simpleRepeat.txt). It may not contain repeats of known repeat diseases, so if you want to find out about known repeat diseases, create your bed file or use hg38-disease-tr.txt which comes with tandem-genotypes (hg19 version is also available). You can also create your own

Analysis of Tandem Repeat Expansions Using Long DNA Reads

157

from the reference genome, for example, using a tool called tantan [16] as follows. Option -w is the upper limit of the repeat unit: $ tantan -f4 -w1000 GRCh38.fa > tantan-out

3. When multiple repeat sites are given, files generated by tandem-genotypes are prioritized and arranged in order from top to bottom according to repeat length change and the functional annotation where the repeat resides (e.g., coding). Nevertheless, there can be long repeat changes that also exist in healthy controls. Given that such repeats are less likely to cause disease, it may be desirable to re-prioritize them when considering the control repeat changes. Here, you can use tandemgenotypes-join to join the files and reorder them. For example, one wanted to prioritize longer repeats present in file-1 than controls (file-2, file-3, and file-4), merging them as follows: $ tandem-genotypes-join file-1 : file-2 file-3 file-4 > prioritized

4. Software listed in this section is accessible at the following websites at the time of writing: • LAST: https://gitlab.com/mcfrith/last • tandem-genotypes: https://github.com/mcfrith/tandemgenotypes • lamassemble: https://gitlab.com/mcfrith/lamassemble/ blob/master/lamassemble • tantan: https://gitlab.com/mcfrith/tantan • samtools: http://samtools.github.io • IGV: https://software.broadinstitute.org/software/igv/ • custom Alt-R® CRISPR-Cas9 guide RNA design tool: • https://sg.idtdna.com/site/order/designtool/index/ CRISPR_CUSTOM The tools introduced here may still be updated in the future, so be sure to check the website before using it. 5. Conversion from MAF format to sam or bam format can be done as follows. The tool maf-convert is included in the LAST program. It can convert MAF format to various formats such as sam, psl, and tab: $maf-convert sam -d file.maf> file.sam $samtools view -bS -o file.bam file.sam $samtools sort -o file.sorted.bam file.bam $samtools index file.sorted.bam

158

Satomi Mitsuhashi and Martin C. Frith

References 1. Chintalaphani SR, Pineda SS, Deveson IW, Kumar KR (2021) An update on the neurological short tandem repeat expansion disorders and the emergence of long-read sequencing diagnostics. Acta Neuropathol Commun 9(1): 98. https://doi.org/10.1186/s40478-02101201-x 2. Depienne C, Mandel JL (2021) 30 years of repeat expansion disorders: What have we learned and what are the remaining challenges? Am J Hum Genet 108(5):764–785. https:// doi.org/10.1016/j.ajhg.2021.03.011 3. Dolzhenko E, Deshpande V, Schlesinger F, Krusche P, Petrovski R, Chen S, Emig-Agius D, Gross A, Narzisi G, Bowman B, Scheffler K, van Vugt J, French C, Sanchis-Juan A, Ibanez K, Tucci A, Lajoie BR, Veldink JH, Raymond FL, Taft RJ, Bentley DR, Eberle MA (2019) ExpansionHunter: a sequencegraph-based tool to analyze variation in short tandem repeat regions. Bioinformatics 35(22): 4754–4756. https://doi.org/10.1093/bioin formatics/btz431 4. Nakamura H, Doi H, Mitsuhashi S, Miyatake S, Katoh K, Frith MC, Asano T, Kudo Y, Ikeda T, Kubota S, Kunii M, Kitazawa Y, Tada M, Okamoto M, Joki H, Takeuchi H, Matsumoto N, Tanaka F (2020) Long-read sequencing identifies the pathogenic nucleotide repeat expansion in RFC1 in a Japanese case of CANVAS. J Hum Genet 65(5):475–480. https://doi.org/10.1038/ s10038-020-0733-y 5. Miyatake S, Yoshida K, Koshimizu E, Doi H, Yamada M, Miyaji Y, Ueda N, Tsuyuzaki J, Kodaira M, Onoue H, Taguri M, Imamura S, Fukuda H, Hamanaka K, Fujita A, Satoh M, Miyama T, Watanabe N, Kurita Y, Okubo M, Tanaka K, Kishida H, Koyano S, Takahashi T, Ono Y, Higashida K, Yoshikura N, Ogata K, Kato R, Tsuchida N, Uchiyama Y, Miyake N, Shimohata T, Tanaka F, Mizuguchi T, Matsumoto N (2022) Repeat conformation heterogeneity in cerebellar ataxia, neuropathy, vestibular areflexia syndrome. Brain 145(3): 1139–1150. https://doi.org/10.1093/ brain/awab363 6. Wright GEB, Collins JA, Kay C, McDonald C, Dolzhenko E, Xia Q, Becanovic K, Drogemoller BI, Semaka A, Nguyen CM, Trost B, Richards F, Bijlsma EK, Squitieri F, Ross CJD, Scherer SW, Eberle MA, Yuen RKC, Hayden MR (2019) Length of uninterrupted CAG, independent of polyglutamine size, results in increased somatic instability, hastening onset of huntington disease. Am J Hum

Genet 104(6):1116–1126. https://doi.org/ 10.1016/j.ajhg.2019.04.007 7. Sone J, Mitsuhashi S, Fujita A, Mizuguchi T, Hamanaka K, Mori K, Koike H, Hashiguchi A, Takashima H, Sugiyama H, Kohno Y, Takiyama Y, Maeda K, Doi H, Koyano S, Takeuchi H, Kawamoto M, Kohara N, Ando T, Ieda T, Kita Y, Kokubun N, Tsuboi Y, Katoh K, Kino Y, Katsuno M, Iwasaki Y, Yoshida M, Tanaka F, Suzuki IK, Frith MC, Matsumoto N, Sobue G (2019) Long-read sequencing identifies GGC repeat expansions in NOTCH2NLC associated with neuronal intranuclear inclusion disease. Nat Genet 51(8):1215–1221. https://doi.org/10. 1038/s41588-019-0459-y 8. Payne A, Holmes N, Clarke T, Munro R, Debebe BJ, Loose M (2021) Readfish enables targeted nanopore sequencing of gigabasesized genomes. Nat Biotechnol 39(4): 442–450. https://doi.org/10.1038/s41587020-00746-x 9. Gilpatrick T, Lee I, Graham JE, Raimondeau E, Bowen R, Heron A, Downs B, Sukumar S, Sedlazeck FJ, Timp W (2020) Targeted nanopore sequencing with Cas9-guided adapter ligation. Nat Biotechnol 38(4):433–438. https://doi.org/10.1038/ s41587-020-0407-5 10. Mitsuhashi S, Matsumoto N (2020) Long-read sequencing for rare human genetic diseases. J Hum Genet 65(1):11–19. https://doi.org/ 10.1038/s10038-019-0671-8 11. Mitsuhashi S, Frith MC, Mizuguchi T, Miyatake S, Toyota T, Adachi H, Oma Y, Kino Y, Mitsuhashi H, Matsumoto N (2019) Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads. Genome Biol 20(1):58. https://doi.org/10. 1186/s13059-019-1667-6 12. Shafin K, Pesout T, Lorig-Roach R, Haukness M, Olsen HE, Bosworth C, Armstrong J, Tigyi K, Maurer N, Koren S, Sedlazeck FJ, Marschall T, Mayes S, Costa V, Zook JM, Liu KJ, Kilburn D, Sorensen M, Munson KM, Vollger MR, Monlong J, Garrison E, Eichler EE, Salama S, Haussler D, Green RE, Akeson M, Phillippy A, Miga KH, Carnevali P, Jain M, Paten B (2020) Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat Biotechnol. https://doi.org/10. 1038/s41587-020-0503-6 13. Hamada M, Ono Y, Asai K, Frith MC (2017) Training alignment parameters for arbitrary sequencers with LAST-TRAIN. Bioinformatics

Analysis of Tandem Repeat Expansions Using Long DNA Reads 33(6):926–928. https://doi.org/10.1093/ bioinformatics/btw742 14. Kielbasa SM, Wan R, Sato K, Horton P, Frith MC (2011) Adaptive seeds tame genomic sequence comparison. Genome Res 21(3): 487–493. https://doi.org/10.1101/gr. 113985.110 15. Frith MC, Kawaguchi R (2015) Splitalignment of genomes finds orthologies more accurately. Genome Biol 16:106. https://doi. org/10.1186/s13059-015-0670-9 16. Frith MC (2011) A new repeat-masking method enables specific detection of

159

homologous sequences. Nucleic Acids Res 39(4):e23. https://doi.org/10.1093/nar/ gkq1212 17. Katoh K, Standley DM (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30(4):772–780. https://doi. org/10.1093/molbev/mst010 18. Frith MCMS, Mitsuhashi S, Katoh K (2020) lamassemble: multiple alignment and consensus sequence of long reads. Methods in Molecular Biology 2231:135–145

Chapter 12 Finding Rearrangements in Nanopore DNA Reads with LAST and dnarrange Martin C. Frith and Satomi Mitsuhashi Abstract Long-read DNA sequencing techniques such as nanopore are especially useful for characterizing complex sequence rearrangements, which occur in some genetic diseases and also during evolution. Analyzing the sequence data to understand such rearrangements is not trivial, due to sequencing error, rearrangement intricacy, and abundance of repeated similar sequences in genomes. The LAST and dnarrange software packages can resolve complex relationships between DNA sequences and characterize changes such as gene conversion, processed pseudogene insertion, and chromosome shattering. They can filter out numerous rearrangements shared by controls, e.g., healthy humans versus a patient, to focus on rearrangements unique to the patient. One useful ingredient is last-train, which learns the rates (probabilities) of deletions, insertions, and each kind of base match and mismatch. These probabilities are then used to find the most likely sequence relationships/alignments, which is especially useful for DNA with unusual rates, such as DNA from Plasmodium falciparum (malaria) with 80% a+t. This is also useful for less-studied species that lack reference genomes, so the DNA reads are compared to a different species’ genome. We also point out that a reference genome with ancestral alleles would be ideal. Key words Gene conversion, Processed pseudogene, Chromothripsis, Mutation, Probability, Alignment, Evolution, Ancestral

1

Introduction The LAST software was first made public in 2008: it was intended as a general tool for finding and aligning related regions in gigabasescale sequence data. It is not really designed for recent long-read data such as nanopore; nevertheless, it can be used and has some unique advantages. The main advantages are: 1. It can learn the rates (probabilities) of deletions, insertions, and each kind of base match and mismatch (see Fig. 1), which are due to a combination of sequencing errors and real sequence differences. It then uses these probabilities to determine the most probable alignments [1].

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_12, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

161

162 Human a a 0.28 c 0.0013 g 0.009 t 0.00051

Martin C. Frith and Satomi Mitsuhashi

c 0.00044 0.2 0.00024 0.001

g 0.0087 0.00038 0.2 0.00039

t 0.00054 0.0023 0.00039 0.29

Delete: open 0.037, extend 0.43 Insert: open 0.019, extend 0.4

Plasmodium falciparum a c g a 0.38 0.00068 0.0079 c 0.00033 0.11 5.1e-05 g 0.0053 6.6e-05 0.1 t 0.00038 0.0012 0.00055

t 0.0007 0.00086 0.00051 0.39

Delete: open 0.035, extend 0.56 Insert: open 0.023, extend 0.52

C. maculatus vs. C. floridanus a c g t a 0.28 0.0017 0.017 0.0019 c 0.0013 0.19 0.00089 0.0065 g 0.016 0.00096 0.18 0.0012 t 0.0018 0.008 0.0014 0.3 Delete: open 0.035, extend 0.46 Insert: open 0.03, extend 0.43

Fig. 1 Rates (probabilities) of base matches, mismatches, deletions, and insertions between nanopore DNA reads and reference genomes. In each 4×4 matrix, a row corresponds to a genomic base, and a column to a read base

2. It can disentangle complex rearrangements and duplications between sequences, even when confounded by repeats, by finding the most probable split of a sequence into rearranged parts based on the alignment probability of each part [2]. Another useful feature is that LAST can compare DNA to protein sequences, allowing frameshifts, which are frequent in long DNA reads. This has been used for taxonomic and functional binning of metagenomic long reads [3]. Here we shall focus on finding rearrangements from LAST alignments of nanopore DNA reads to a reference genome. The next problem is that we typically find thousands of rearrangements, many of which seem to be artifactually rearranged reads, and many others real-but-benign [4, 5]. The artifacts seem to be sporadic, so they can be depleted by sequencing to several-fold coverage of the genome and then discarding reads with unique rearrangements not shared by any other read [4, 5]. When seeking rearrangements causing genetic disease, we also use nanopore reads from control individuals without the same disease and discard benign rearrangements shared with controls. These tasks downstream from LAST are done by dnarrange. 1.1 Examples of Match, Mismatch, Insertion, and Deletion Rates

Three examples of these rates, for nanopore DNA reads versus reference genomes, are shown in Fig. 1. The human rates are for one set of human nanopore reads, HG02723_1 [6], versus a reference human genome (hg38). We can see, for example, that a$g substitutions are much more frequent than other substitutions, which is often the case for nanopore sequences [4, 5]. Also, the probability (rate) of opening a deletion is much higher than an insertion. The P. falciparum rates are for one set of nanopore reads [7] versus a reference genome: these are interesting because P. falciparum DNA is 80% a+t, so the match and mismatch rates are quite different from human. The final example is a set of nanopore reads from the ant Camponotus maculatus [7] versus the closest-available reference genome: Camponotus floridanus. This example has higher substitution rates, as expected between different species.

Finding Rearrangements in DNA Reads with LAST and dnarrange

163

Various mutational processes can cause almost arbitrarily complex rearrangements, duplications, and deletions of DNA sequence (see Fig. 2). For example, chromosomes can shatter into many fragments that rejoin in a different order and orientation, or DNA polymerase can template-switch during DNA replication causing rearrangements, duplications, and deletions. No matter how complex these changes, there is one key point: almost every part of a descendant sequence comes from a unique part of ancestral sequence(s). This is shown in Fig. 2b: we can visually scan the derived sequence from top to bottom and see where each part comes from (joined-up diagonal lines). We might wonder about “insertions”: where does the inserted sequence come from? It must come from somewhere: typically, it has been duplicated or moved from elsewhere in the genome. Rarely, an insertion may come from a different genome, such as a virus. There is also non-templated insertion aka “spontaneous generation” of new sequence with no ancestor: this is the only exception to the rule.

1.2 Understanding Rearrangements

a

Ancestral genome(s) m

n

o

p

q

r

s

t

u

v

w

Rearrangements Derived sequence m

q

Derived reference genome (e.g. hg38)

r

r

q

b

s

p

o

w

t

m

u

c

Ancestral genome(s) m

n

o

p

q

r

s

n

t

u

v

o

p

q

p

q

u

Derived reference genome (e.g. hg38)

w

m

n

o

p

q

p

q

u

m

m

q

q

r

r q r

r

s

s

o

o

Derived sequence

q

Derived sequence

p

p

w

w

t

t

u

u

Fig. 2 A made-up example of complex rearrangements during DNA sequence evolution. Each colored block represents a segment of DNA, e.g., a few hundred bp. These segments become rearranged, duplicated, or deleted. The blocks labeled n and s are similar sequences from an earlier duplication, i.e., paralogs. Reproduced from [5]

164

Martin C. Frith and Satomi Mitsuhashi

Unfortunately, we do not have ideal ancestral sequences: in practice, we compare DNA reads to a “reference” genome that has separately undergone complex changes. The relationship between two descendant sequences is fundamentally less simple (see Fig. 2c), e.g., duplication in both lineages creates a many-tomany relationship, and reference-specific deletion means that parts of DNA reads correctly align nowhere in the reference but may align incorrectly to paralogs. Even if we could perfectly detect the correct relationships (diagonal red lines in Fig. 2c), it is hard to understand what changes have occurred. In practice, we assume the reference genome is ancestral, even though it is not. We are interested in rearrangements that cause genetic disease: at least in these cases the reference is likely to represent the ancestral state. If virus insertions are suspected, viral chromosomes can be added as extra chromosomes to the reference. We will find spurious rearrangements because the reference is not perfectly ancestral: the hope is that these will also be found in control data, thus discarded. Another way to detect referencegenome regions with non-ancestral status is by comparing to an outgroup, e.g., an ape genome [4]. It would be useful to have a reference genome that contains ancestral alleles [4, 5]. 1.3 Simple Sequences

2 2.1

DNA is rife with “simple sequences” such as catcatcatcat or aaaattaaaacaaa. They cause many similarities between sequences that are not correct relationships, because they are not descended from a common ancestor. There are several methods for detecting and “masking” simple sequences, which do not all work equally well [8]. LAST uses tantan [8] to detect simple sequences and converts them to lowercase letters. However, masking can cause problems by hiding correct relationships between sequences, which might cause incorrect relationships to be found instead. So, by default, LAST treats lowercase the same as uppercase when finding alignments. Alignments without a significant amount of uppercaseto-uppercase alignment are suspicious and can optionally be discarded by last-postmask.

Methods Installation

It may be easiest to install the software from Bioconda [9] or from Debian Med [10]. After setting up Bioconda on your computer, this command installs dnarrange and all its dependencies, including LAST (see Note 1): conda install dnarrange

The following discussion applies to LAST version ≥ 1387 and 1.5.2, and not to older versions.

dnarrange

Finding Rearrangements in DNA Reads with LAST and dnarrange

2.2 Getting the Camponotus Rates

165

Let us first see how to get the ant results in Fig. 1. We need the reference-genome sequence in fasta format, which we got by searching “camponotus” at NCBI Genome (https://www.ncbi. nlm.nih.gov/genome/). We renamed this file Cflo_v7.5.fa. The first step is to prepare “index” data structures for the genome, which enable fast sequence comparison: lastdb -P16 -uNEAR antDB Cflo_v7.5.fa

This creates several files whose names start with antDB. The option makes it faster by running 16 parallel threads, with no effect on the result. Adjust as appropriate for your computer (see Note 2). The -uNEAR option changes the seeding scheme. LAST uses a seed-and-extend heuristic: it first finds “seeds,” simple gapless alignments, and then tries to extend full alignments from the seeds. Although the seeds are gapless, they can allow some mismatches: the more mismatches they allow, the longer they need to be to retain specificity. -uNEAR specifies short seeds with few mismatches, which is appropriate for searching indel-rich nanopore reads against a closely related genome. If you omit -uNEAR, it may not make much difference in practice. Next, we need the DNA reads in fasta or fastq format. We do not use the extra information in fastq, so fasta is preferable because the files are much smaller. We can find the rates of substitution, etc., between reads and genome like this, using the original fastq file in this case: -P16

last-train -P16 -Q0 antDB CmacRNAseq_180528.fastq. gz > ants.train

The -Q0 option makes it ignore the fastq quality data (see Note 3). If you have fasta instead of fastq, -Q0 has no effect and can be omitted. It is worth noting that last-train does not actually use all the DNA reads: it uses a random sample of size one million bases (probably overkill). This means that giving huge read files to lasttrain does not make it much slower, except for the time needed to read the file and get a random sample. last-train works iteratively: it compares the reads to the genome to find the rates, then uses these rates to do the comparison more accurately and get better rates, and so on. It prints data for each iteration (for troubleshooting), with the final rates at the end. It is not necessary to look at them: you can just pass the output file to the next alignment step. 2.3 Getting the Plasmodium falciparum Rates

The P. falciparum rates were found like this: lastdb -P16 -uNEAR -R02 plaDB plaFal3D7.fa last-train -P16 -Q0 plaDB P.falciparum_targeted_seq.fastq > pf.train

166

Martin C. Frith and Satomi Mitsuhashi

The only change is the -R02 option, which is recommended for DNA with 80% a+t. This option changes the tantan parameters for defining “simple sequence,” which is necessary for very (a+t)rich DNA. If you omit -R02, it might not matter too much in practice. last-train avoids lowercased simple sequence, but in this case it hardly affects the result. 2.4 Getting the Human Rates

The human rates were found like this: lastdb -P16 -uRY4 hdb hg38.analysisSet.fa last-train -P16 -Q0 hdb HG02723_1.fastq.gz > hum. train

The change here is to use -uRY4 instead of -uNEAR. This makes LAST faster and uses less memory, at a cost in sensitivity. The main reason for doing this is that the fastq file is quite big, 81 gigabases. This is not necessary for last-train, which just uses a sample of the fastq: the point is to speed up the subsequent alignment step. RY4 makes it use 1/4 of the seeds, in a similar way to this: just use seeds starting with a (see Note 4). This last-train run took 23 min, most of which was spent decompressing the fastq file to get a random sample from it. It is not necessary to re-run last-train for each dataset, unless there is reason to think the rates may have changed, e.g., due to different versions of sequencing hardware or base-calling software. 2.5 Aligning Human DNA Reads to a Human Genome

The next step is to actually align all the reads to the genome: lastal -P16 --split -p hum.train hdb HG02723_1. fastq.gz | gzip > out.maf.gz

This lastal command works in two stages. First, it finds and aligns similar regions of the reads and the genome, often aligning the same part of a read to several genome regions. Second, --split makes it cut these alignments down to a unique best alignment for each part of each read. This is appropriate if the genome is ancestral to the reads (see Fig. 2). Sometimes, part of a read matches two or more genome regions with almost equal probability: --split choses the most probable alignment (arbitrarily if exactly equal) and outputs “mismap probabilities,” the probability that each part of a read is aligned to the wrong place. Finally, gzip compresses the output (optional, see Notes 5 and 6). This alignment took 6 h, with 16 threads (-P16).

Finding Rearrangements in DNA Reads with LAST and dnarrange

2.6 An Alternative Way Using Windowmasker

167

Our published human studies have mostly not used RY4 (because it is somewhat new) and instead saved time by “masking” repeats. The main cause of alignment slowness is the abundance of repeats, such as LINEs, SINEs, and simple sequences, so each repeat in a read gets preliminarily aligned to multiple genome locations. We can mitigate this problem by masking repeats. We wish to mask as little as possible in order to maximize sensitivity, but enough to make the run time tolerable. For this aim, WindowMasker [11] seems to work well: it is part of the BLAST package, which can be got from NCBI, Bioconda, or Debian Med. The following commands find repeats in the genome and convert them to lowercase: windowmasker -mk_counts -in hg38.analysisSet.fa > hum.wmstat windowmasker -ustat hum.wmstat -outfmt fasta -in hg38.analysisSet.fa > hum-wm.fa

The next step is to run lastdb like this: lastdb -P16 -uNEAR -R11 -c hwmdb hum-wm.fa

The -R11 option retains lowercase from the input and additionally lowercases simple repeats found by tantan. The -c tells it to “mask” lowercase: this will exclude lowercase from seeds but not from final alignments, and then discard alignments that lack a significant amount of uppercase-to-uppercase alignment (the same as last-postmask). The subsequent last-train and alignment steps are the same as above. 2.7 Finding Rearrangements with dnarrange

dnarrange gets rearranged DNA reads from the read-to-genome alignments and performs two optional filtering steps. It discards reads with unique rearrangements not shared by any other read, and discards reads with rearrangements shared by control reads. So we need control DNA reads aligned to the same genome. A few control sets aligned to hg38 are available at https://github.com/ mcfrith/dnarrange. We can run dnarrange like this: dnarrange -v out.maf.gz : controls/hg38-* > groups. maf

The -v (verbose) option just makes it show progress messages as it is working (useful for troubleshooting and reassurance that it is doing something). The “:” separates case from control files (see Note 7). The output has the alignments of the rearranged reads, in groups where reads that cover the same rearrangement are in the same group. In our studies of human patients, the number of groups per patient declines from thousands without controls to a few dozen after control filtering [5]. Filtering is

168

Martin C. Frith and Satomi Mitsuhashi

more effective when some controls are from the same family or ethnic group as the case individual. It is possible to not use control DNA reads, but then an overwhelming number of rearrangements may be found. They may include dubious rearrangements in regions where the reference genome does not represent the ancestral state, or where the reference is incomplete. For example, if a segment of the reference has been deleted, the corresponding part of a DNA read may get wrongly aligned to a paralog elsewhere in the genome, showing an incorrect rearrangement. An alternative is to use last-postmask (if repeats were not already masked with lastdb option -c): last-postmask out.maf.gz | dnarrange -v - : con trols/hg38-* > groups.maf

The “-” has a standard meaning of reading the data that is piped in. As mentioned above, last-postmask discards alignments that are mostly of lowercased simple sequence. There is doubt that such alignments reflect correct relationships. Nevertheless, such alignments can be informative because they do indicate changes such as expansion or insertion of simple sequence (see Note 8). 2.8 Making Dotplot Figures of the Rearrangements

We can make a dotplot figure of each group in like this:

groups.maf

last-multiplot -a refGene.txt -a rmsk.txt groups. maf fig-dir

This puts the figures in a new directory fig-dir. We have used the -a option to show genes and repeats, using files downloaded from the UCSC Genome Browser (http://genome.ucsc.edu). It is also possible to show BED, GFF/GTF, or RepeatMasker .out data, or unsequenced gaps in AGP or gap.txt format (see Note 9). Naturally, these dotplots are more useful when there are not too many of them, and making many of them is slow. Some examples of human rearrangements, in HG02723_1 versus hg38, are shown in the remaining figures. Figure 3 shows a classic genetic phenomenon: gene conversion. This is a process where one DNA sequence gets replaced, probably during DNA repair, by a copy of a similar sequence. In this case, part of an L1 LINE element in chromosome 20 has been replaced by the homologous part of another L1 element. Here, the converted read parts are aligned to one L1 in chromosome 6, but their mismap probabilities are quite high (not shown), because they have almost equally probable alignments to other L1s. So the specific donor element may not be confidently knowable.

Finding Rearrangements in DNA Reads with LAST and dnarrange

169

Fig. 3 Alignment of 2 human DNA reads to human genome hg38, showing gene conversion. The read identifiers are shown on the left: the final - or + indicates that the read has been reverse-complemented or not. The figure shows only part of each read. To the left of the vertical black line is 1.5 kb of chromosome 6, to the right is 10 kb of chromosome 20. The vertical stripes show repeat elements in hg38 (The stripes have 3 different colors for forward-oriented elements, reverse-oriented elements, and simple repeats)

Gene conversion is a poster child for probability-based sequence alignment of the sort done by --split [4]. A naive alignment method would just align these DNA reads continuously to this L1 in chromosome 20 because the converted sequence is similar to the replaced sequence. Figure 4 shows a short-range rearrangement, localized within 5 kb, with a rather small rearranged fragment. This resembles mutations that have been attributed to template switching during DNA replication, which can create numerous complex mutation patterns, and have caused erroneous variant annotations [12]. Figure 5 shows insertion of a processed pseudogene from chromosome 11 into chromsome 13. This is a process where an mRNA molecule, after its introns have been removed, is reversetranscribed and inserted into the genome (perhaps by a reverse transcriptase enzyme from a retrotransposon).

170

Martin C. Frith and Satomi Mitsuhashi

Fig. 4 Alignment of 2 human DNA reads to human genome hg38, showing a short-range rearrangement. The top read is short, and its identifier is omitted

Finally, Fig. 6 shows a DNA read aligning to chromosome 1, with a gap in the alignment, which might be a deletion in the read or an insertion in the reference. The gap coincides exactly with an L1PA2 element, which is a young LINE retrotransposon. There is no known mechanism for precise excision of a LINE element, whereas LINE insertion is normal, so this is an insertion that occurred in the lineage leading to the reference sequence. This is an example where the reference does not have the ancestral state. Outgroups (other mammal genomes) lack this L1PA2 (see Fig. 6). 2.9

last-dotplot

(which comes with dnarrange) uses lastto draw each figure. It may be useful to run last-dotplot directly. For example, Fig. 4 was made like this:

last-multiplot dotplot

(part of

LAST)

last-dotplot -a refGene.txt -a rmsk.txt -j2 --la bels1=2 -1 chr4:35117500-35122500 groups.maf fig4.png

Option -j2 draws the gray lines joining the aligned segments, adds the start coordinate and length to the top of the figure, and -1 chr4:35117500-35122500 shows that range of chr4.

--labels1=2

Finding Rearrangements in DNA Reads with LAST and dnarrange

171

Fig. 5 Alignment of 14 human DNA reads to human genome hg38, showing insertion of a processed pseudogene from chromosome 11 into chromosome 13. The vertical green stripes show exons of USP28, which encodes ubiquitinspecific peptidase 28 2.10 Rearrangement Types and Thresholds

A rearrangement is indicated when two parts of a DNA read align to disjoint places in the genome. dnarrange classifies four kinds of disjointness: 1. Different chromosomes (see Fig. 3). 2. Opposite strands of one reference sequence (see Fig. 4). 3. Non-colinear alignment of two read parts to the same strand of the same reference sequence. In other words: the alignment of the upstream read part ends at coordinate X in the reference sequence, the alignment of the downstream read part starts at coordinate Y , and Y is upstream of X.

172

Martin C. Frith and Satomi Mitsuhashi

Fig. 6 Above: alignment of a human DNA read to human genome hg38, showing reference-specific insertion of an L1 LINE retrotransposon in chromosome 1. Below: alignments of other mammal genomes to this part of chromosome 1 (screenshot from http://genome.ucsc.edu)

4. Two consecutive read parts (i.e., with no aligned part between them in the read) that align colinearly with a big gap in the reference (see Fig. 6). By default, dnarrange ignores “big gaps” < 10 kb and non-colinearities where Y is less than 1 kb upstream of X. This is because small deletions (colinear gaps) are frequent and arguably not “rearrangements,” and small tandem duplications (non-colinear) are overwhelmingly numerous. You can set the g (gap) and r (reverse jump) thresholds in bp: dnarrange -v -g1000 -r100 out.maf.gz : controls/ hg38-* > groups.maf

We have said that dnarrange discards DNA reads that share rearrangements with control reads, but the truth is a bit more complex. dnarrange prioritizes rearrangement types in this order: inter-chromosome > inter-strand > non-colinear > gap. It discards DNA reads that share rearrangements of the highestpriority type present in the read with control reads.

Finding Rearrangements in DNA Reads with LAST and dnarrange

2.11 Other Features of dnarrange

173

We have covered the most important features of dnarrange, but it can do some further things. dnarrange-merge merges each group of DNA reads into a consensus sequence, using lamassemble [13]. By re-aligning the consensus sequence to the genome, we can perhaps see the rearrangement more clearly and accurately. More simply, dnarrange-merge can just get the rearranged reads without merging them: dnarrange-merge HG02723_1.fastq.gz groups.maf > some-reads.fastq

This may be useful, because we can re-align just these reads to the genome more slowly and sensitively, for example like this: lastdb -P16 -uNEAR nearDB hg38.analysisSet.fa lastal -P16 -m50 --split -p hum.train nearDB somereads.fastq > out2.maf

Here we have used NEAR instead of RY4 for greater sensitivity. We have re-used the last-train result: no need to re-run it. Finally, we added option -m50, which makes it even more slow and sensitive. Higher m values make it increasingly slow and sensitive: the default is 10 (but the -uRY options set the default to 2). A further issue is that some large and complex rearrangements, e.g., in genetic diseases, are larger even than “long” reads. So each group of reads covers only part of the rearrangement. In order to understand the whole rearrangement, we need to correctly order and orient the groups (i.e., the rearrangement parts). dnarrange includes a method that seeks a most-parsimonious order and orientation, which succeeded in fully characterizing some large and complex rearrangements [5, 14]. Interestingly, these rearrangements have holistic features, e.g., loss of sequence, that are knowable only from the whole rearrangement and not from the parts. dnarrange can be very slow and memory-consuming. This seems to be because many DNA reads align dubiously to a few genomic hotspots, and dnarrange spends much effort comparing these reads to each other. The solution is to use control reads, which rapidly discard these hotspot reads. This works better if the case and control reads were analyzed in the same way, e.g., with/ without masking. It also works better if the exact same reference was used. (There are unfortunately different versions of, e.g., hg38). Another way to mitigate this problem is to use lastpostmask. It is useful if the DNA reads are long enough to cover a whole rearrangement, but on the other hand long reads have a disadvantage. The problem is that if a read overlaps two rearrangements, and one of them is shared by controls, the whole read may get discarded. This is a limitation of dnarrange because it is hard to

174

Martin C. Frith and Satomi Mitsuhashi

know whether a read covers two small rearrangements or two parts of one large rearrangement. Therefore, a combination of longer and not-so-long reads may be best. Our previous publications show further interesting rearrangements found by these methods, such as tandem heptuplication, 3′transduction from a LINE, or chromosome shattering [4, 5].

3

Notes 1. At the time of writing, conda seems to have a bug where it may install old versions of software: it may be better to use mamba (https://github.com/mamba-org/mamba). 2. As a special case, -P0 uses as many threads as your computer claims it can run simultaneously, which is a good way to annoy the other users of a shared server. 3.

can alternatively use the quality data for more accurate training and alignment. This assumes, however, that the qualities indicate probability of substitution error, not insertion or deletion error, and we are not confident this holds for nanopore data.

LAST

4. Actually, RY4 uses seeds starting with combinations of purines and pyrimidines [15]. There are faster alternatives using 1/8, 1/16, and 1/32 of the seeds: RY8, RY16, and RY32. 5. We can use gzip options to get faster but worse compression. For example, gzip -5 shaves 20% off the alignment run time and adds 6% to the output size. 6. The data transfer from lastal to gzip is inefficient because lastal produces output in bursts, and gzip takes time to absorb each burst, making lastal wait. This can be fixed with mbuffer: lastal -P16 --split -p hum.train hdb HG02723_1. fastq.gz | mbuffer | gzip > out.maf.gz

absorbs the bursts quickly and feeds them to gzip. is available in Bioconda at the time of writing, although it is not biology-specific.

mbuffer

mbuffer

7. It is possible to give dnarrange more than one case file: it will only output groups that have reads from all case files. 8. To analyze tandem repeat changes, we have specialized software tandem-genotypes [16], described elsewhere in this volume. 9. It sometimes happens that a putative rearrangement coincides with an unsequenced gap in the reference genome, so it is useful to visualize unsequenced gaps.

Finding Rearrangements in DNA Reads with LAST and dnarrange

175

Acknowledgements We thank Takeshi Mizuguchi, Kazuharu Misawa, and Naomichi Matsumoto for helping us to fix inefficiencies in dnarrange. References 1. Hamada M, Ono Y, Asai K, Frith MC (2017) Training alignment parameters for arbitrary sequencers with last-train. Bioinformatics 33(6):926–928 2. Frith MC, Kawaguchi R (2015) Splitalignment of genomes finds orthologies more accurately. Genome Biology 16(1):1–17 3. Huson DH, Albrecht B, Bag˘cı C, Bessarab I, Gorska A, Jolic D, Williams RB (2018) MEGAN-LR: new algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs. Biology Direct 13(1):1–17 4. Frith MC, Khan S (2018) A survey of localized sequence rearrangements in human DNA. Nucleic Acids Res 46(4):1661–1673 5. Mitsuhashi S, Ohori S, Katoh K, Frith MC, Matsumoto N (2020) A pipeline for complete characterization of complex germline rearrangements from long DNA reads. Genome Medicine 12(1):1–17 6. Shafin K, Pesout T, Lorig-Roach R, Haukness M, Olsen HE, Bosworth C, Armstrong J, Tigyi K, Maurer N, Koren S, et al. (2020) Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nature Biotechnology 38(9):1044–1053 7. Shabardina V, Kischka T, Manske F, Grundmann N, Frith MC, Suzuki Y, Makałowski W (2019) NanoPipe—a web server for nanopore MinION sequencing data analysis. GigaScience 8(2):giy169 8. Frith MC (2011) A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res 39(4):e23– e23 9. Gru¨ning B, Dale R, Sjo¨din A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Ko¨ster

J (2018) Bioconda: sustainable and comprehensive software distribution for the life sciences. Nature Methods 15(7):475–476 10. Mo¨ller S, Krabbenho¨ft HN, Tille A, Paleino D, Williams A, Wolstencroft K, Goble C, Holland R, Belhachemi D, Plessy C (2010) Community-driven computational biology with Debian Linux. BMC Bioinformatics 11 (Suppl 12):S5 11. Morgulis A, Gertz EM, Sch€affer AA, Agarwala R (2006) WindowMasker: window-based masker for sequenced genomes. Bioinformatics 22(2):134–141 12. Loÿtynoja A, Goldman N (2017) Short template switch events explain mutation clusters in the human genome. Genome Research 27(6): 1039–1049 13. Frith MC, Mitsuhashi S, Katoh K (2021) lamassemble: multiple alignment and consensus sequence of long reads. In: Multiple sequence alignment, pp 135–145. Springer 14. Lei M, Liang D, Yang Y, Mitsuhashi S, Katoh K, Miyake N, Frith MC, Wu L, Matsumoto N (2020) Long-read DNA sequencing fully characterized chromothripsis in a patient with Langer-Giedion syndrome and Cornelia de Lange syndrome-4. J Hum Genet 65(8): 667–674 15. Frith MC, Noe´ L, Kucherov G (2020) Minimally overlapping words for sequence similarity search. Bioinformatics 36(22-23):5344–5350 16. Mitsuhashi S, Frith MC, Mizuguchi T, Miyatake S, Toyota T, Adachi H, Oma Y, Kino Y, Mitsuhashi H, Matsumoto N (2019) Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads. Genome Biology 20(1):1–17

Chapter 13 Long-Read Whole-Genome Sequencing Using a Nanopore Sequencer and Detection of Structural Variants in Cancer Genomes Yasuhiko Haga, Yoshitaka Sakamoto, Miyuki Arai, Yutaka Suzuki, and Ayako Suzuki Abstract Long-read sequencing technologies enable us to precisely identify structural variants (SVs), which would be occasionally associated with various types of diseases, including cancers. In this section, we introduce experimental and computational procedures for conducting long-read whole-genome sequencing (WGS) of cancer genomes from fresh frozen tissues/cells. We also demonstrate the analysis of SVs in cancer genomes using long-read WGS data from lung cancer cell lines by several representative computational tools, such as cuteSV and Sniffles2, as examples. Key words Long-read sequencing, Structural variant, Nanopore sequencing, Whole-genome sequencing, Cancer genome analysis

1

Introduction Recently, long-read sequencing technologies have been widely utilized for sequencing analysis of various types of healthy and diseased human genomes. Long-read analysis has proven an especially powerful tool for detecting structural variants (SVs) whose breakpoints, often located in repetitive regions, need to be covered by long-read sequences for precise detection. Various types of SVs occur as variations or mutations in human genomes including large deletion, insertion, inversion, tandem duplication, and translocation, and many are associated with diseases including cancers [1]. In cancer genomes, a lot of somatic SVs have been detected and characterized as possible causes of carcinogenesis and cancer progression [2]. For example, oncogenic chromosomal rearrangements, detected in lung adenocarcinomas, were shown to generate fusion genes and function as driver events of

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_13, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

177

178

Yasuhiko Haga et al.

Computational workflow

Experimental workflow Sample

FASTQ

Frozen cells/tissues

Mapping HMW DNA extraction

Minimap2 Base calling BAM

Library preparation End-prep and Adapter ligation

SV detection cuteSV, Sniffles2, (Nanomonsv)

Sequencing

Visualization

IGV

PromethION VCF

Fig. 1 Summary of the experimental and computational procedures used for long-read WGS and SV detection. The experimental workflow (left) and the computational workflow (right) are shown separately

carcinogenesis [3, 4]. Additionally, more complicated SVs, which consist of a combination of multiple types of SVs, have been observed in cancer genomes and reported to affect functions of tumor-suppressor genes [5]. Long-read sequencing enables precise detection of SV breakpoints and their complete structures, which is crucial for the comprehensive understanding of aberrant molecular events involved in cancer genome evolution. Here, we introduce examples of the experimental and computational procedures used for long-read whole-genome sequencing (WGS) analysis and detecting SVs in cancer genomes (Fig. 1). On the experimental side, we first extract high-molecular-weight (HMW) DNAs from fresh frozen cells or tissues for obtaining sequences with longer read lengths. Then the DNA samples are served to library preparation procedures, including the steps of end-prep and adapter ligation, prior to sequencing analysis using a nanopore-type long-read sequencer PromethION. On the computational side, post-sequencing computational analyses, including basecalling, mapping to the reference genome, and detection of SVs, are performed by using bioinformatics tools specifically developed for long-read datasets. Then the mapped sequencing reads and the detected SVs are visualized and passed through further analyses to interpret their functional and biological relevance for cancers.

SV Analysis Using Long Read Sequencing Data

2

179

Materials Preparation of samples, consumables, and equipment used in the experimental analysis for data generation of long-read WGS are described in Subheadings 2.1, 2.2 and 2.3. Preparation of test datasets, reference genome data, and software used in the computational analysis for detecting SVs are described in Subheadings 2.4, 2.5 and 2.6.

2.1 Sample Preparation (Experiment)

2.2 Consumables (Experiment)

In preparing fresh frozen cells or tissues for obtaining HMW DNA samples, cells or tissues should be rapidly frozen and repeated freezing and thawing must be avoided to prevent damages and DNA degradation. 1. MagAttract HMW DNA Kit (QIAGEN) 2. Genomic DNA ScreenTape (Agilent Technologies) 3. Genomic DNA Reagents (Agilent Technologies) 4. Qubit dsDNA HS/BR Assay Kit (Thermo Fisher Scientific) 5. Ligation Sequencing Kit (Oxford Nanopore Technologies) 6. NEBNext FFPE Repair Mix (New England Biolabs) 7. NEBNext Ultra II End repair/dA-tailing Module (New England Biolabs) 8. NEBNext Quick Ligation Module (New England Biolabs) 9. Agencourt AMPure XP beads (Beckman Coulter) 10. Flow Cell Priming Kit (Oxford Nanopore Technologies) 11. PromethION Flow Cell (Oxford Nanopore Technologies) 12. 2 mL DNA LoBind tube (Eppendorf) 13. 1.5 mL DNA LoBind tube (Eppendorf) 14. 0.2 mL PCR tube 15. 70% ethanol 16. Nuclease-free water

2.3 Equipment (Experiment)

1. TapeStation (Agilent Technologies) 2. Qubit Fluorometer (Thermo Fisher Scientific) 3. PromethION (Oxford Nanopore Technologies) 4. Rotator 5. Thermal cycler 6. Thermomixer 7. Vortex mixer 8. Magnetic stand 9. Centrifuge

180

Yasuhiko Haga et al.

2.4 Dataset (Data Analysis)

For SV analysis and detection, we use the FASTQ data format for the PromethION output of long-read WGS from lung cancer cell lines. All data are publicly available in the DNA Data Bank of Japan under the accession number DRA008154 [5] (see Note 1).

2.5 Software (Data Analysis)

For the computational procedures used for mapping and detecting SVs, the software tools required (with the version we used in the virtual environment of miniconda 3 [https://docs.conda.io/en/ latest/miniconda.html]) are as follows: Minimap2 [6] (version 2.17-r941), SAMtools [7] (version 1.6), Sniffles2 [8, 9] (version 2.0.6), and cuteSV [10] (version 1.0.8). To avoid conflicts of tool dependencies, install each tool for each analysis in the conda virtual environment. Commands for the installation of these tools are shown below. For the visualization of SVs, the Integrative Genomics Viewer (IGV) [11–13] is used: $ conda install -c bioconda samtools $ samtools $ conda install -c bioconda htslib $ conda install -c bioconda minimap2 $ minimap2 --version $ conda install -c bioconda sniffles=2.0 $ sniffles --help $ conda install -c bioconda cutesv $ cuteSV -h

2.6 Reference Genome (Data Analysis)

Download the FASTA file of the human reference genome hg38 from the UCSC Genome Browser [14] and decompress it to extract information on chromosomes 1–22, X, Y, and M for further analysis. The index file of the reference genome should be prepared for mapping analysis using Minimap2: $ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz $ tar xvzf hg38.chromFa.tar.gz $ for i in {1..22} X Y M; do cat chroms/chr${i}.fa >> hg38.fa; done $ minimap2 -x map-ont -d hg38.mmi hg38.fa

As an alternative to the current reference genome (hg38), the complete sequence of the human genome, free of ambiguous N bases, was recently released by the Telomere-to-Telomere Consortium [15] and can be accessed at https://github.com/marbl/ CHM13.

SV Analysis Using Long Read Sequencing Data

3

181

Methods The experimental procedures are presented in Subheadings 3.1 and 3.2, and the computational procedures, in Subheadings 3.3, 3.4, 3.5 and 3.6.

3.1 HMW DNA Extraction (Experiment)

All procedures are carried out using the MagAttract HMW DNA Kit (QIAGEN) according to the manufacturer’s protocol (MagAttract HMW DNA Handbook, 2020). The protocol steps are as follows: 1. Load fresh frozen samples in new 2 mL DNA LoBind tubes. If tissues (approximately 2–3-mm cube) are used, the samples should be excised using surgical scissor. 2. Add Buffer ATL (220 μL) and Proteinase K (20 μL) to the tube and mix thoroughly by vortex. Incubate the samples at 56 °C for 16 h (overnight) using a thermomixer at 900 rpm. The cells/tissues should be completely lysed. 3. Transfer the samples (200 μL) to a new 2 mL tube. Add RNase A (4 μL) to the samples and mix by vortex briefly, then incubate them at room temperature for 2 min. Next, add Buffer AL (150 μL), Buffer MB (280 μL), and MagAttract Suspension G (40 μL), which should be mixed well by vortex immediately before use. Then incubate the samples at room temperature for 3 min using a thermomixer at 1400 rpm. 4. Place the tubes on a magnetic stand and incubate them at room temperature for 1 min. After discarding the supernatant, remove the tubes from the magnetic stand and add Buffer MW1 (700 μL) to the tubes. Incubate the mixture at room temperature for 2 min using a thermomixer at 1400 rpm. Repeat this wash step a second time. 5. Incubate the tubes on the magnetic stand for 1 min. After discarding the supernatant, remove the tubes from the magnetic stand and add Buffer PE (700 μL). Then incubate the samples at room temperature for 2 min using a thermomixer at 1400 rpm. Repeat this wash step a second time. 6. Incubate the tubes on the magnetic stand for 1 min. After completely discarding the supernatant, add nuclease-free water (700 μL) without disturbing the beads. Repeat this rinse step a second time. After that, incubate the tubes on the magnetic stand for 1 min and discard the supernatant. 7. Remove the tubes from the magnetic stand and add Buffer AE (100 μL) for elution. Incubate the samples at room temperature for 3 min using a thermomixer at 1400 rpm. Incubate the tubes on the magnetic stand for 1 min. Transfer the

182

Yasuhiko Haga et al.

Fig. 2 Examples of DNA QC and sequencing. (a) The results of the automated DNA quality control (QC) protocol using TapeStation. (b–d) The PromethION sequencing report summarizes the sequencing run and output statistics in several plots, three of which are shown here as examples: (b) the read length distribution, (c) the cumulative read output, and (d) the status of flow cell pores

supernatant, which contains the HMW DNA samples, to a new 1.5 mL tube. Repeat this step a second time. 8. Place the tubes on ice. To perform the quantification and qualification of the obtained HMW DNA samples, use TapeStation with Genomic DNA ScreenTape analysis and the Qubit dsDNA HS/BR Assay Kit according to the manufacturer’s protocols (see Note 2) (Fig. 2). The fragment size distribution and DNA Integrity Number should be checked in addition to the amount of DNA samples. 3.2 Library Preparation and Sequencing (Experiment)

The procedures in this section are carried out using the Ligation Sequencing Kit (SQK-LSK112) (Oxford Nanopore Technologies) following the manufacturer’s protocol (Nanopore Protocol, Ligation Sequencing gDNA [SQK-LSK112], 2021). Before priming, the flow cell should be kept at room temperature and subjected to quality check (QC) (see Note 3). 1. About 1 μg of the HMW DNA sample is needed for the library preparation (see Note 4). 2. End-prep step. Mix the DNA sample (48 μL), NEBNext FFPE DNA Repair Buffer (3.5 μL), NEBNext FFPE DNA Repair Mix (2 μL), Ultra II end-prep reaction buffer (3.5 μL), and

SV Analysis Using Long Read Sequencing Data

183

Ultra II end-prep enzyme mix (3 μL) into a PCR tube. Incubate the tube at 20 °C for 5 min, then at 65 C for 5 min. 3. Purification step. Add AMPure XP Beads (AXP) (60 μL) to the sample, then incubate the mixture at room temperature for 5 min on the rotator (see Note 5). Place the tube on the magnetic stand for 1 min, then discard the supernatant. Add 70% ethanol (200 μL) to the tube and discard the ethanol. This ethanol wash is performed on the magnetic stand twice. After spinning down the tube, remove the remaining ethanol on the magnetic stand. To dry the beads, keep the cap open for 30 s. After removing the sample from the magnetic stand, dissolve the beads with the DNA samples by adding nuclease-free water (61 μL) and incubate at room temperature for 2 min. Place the tube on the magnetic stand and incubate for 1 min. Transfer 60 μL of the supernatant (containing the DNA samples) to a new 1.5 mL tube. To perform DNA quantification, use 1 μL of the DNA sample with the Qubit dsDNA HS Assay Kit. 4. Adapter ligation step. Mix the DNA sample (60 μL) with Ligation Buffer (LNB) (25 μL), NEBNext Quick T4 DNA Ligase (10 μL), and Adaptor Mix H (AMX H) (5 μL). Then incubate the tube at room temperature for 10 min. 5. Purification step. Add AXP (40 μL) to the sample and incubate it at room temperature for 5 min on the rotator. Place the tube on the magnetic stand for 1 min before discarding the supernatant. Then mix the samples with Long Fragment Buffer (LFB) (250 μL) and discard the supernatant. Repeat the LFB wash step for a total of two times. After that, spin down the tube and remove the remaining supernatant on the magnetic stand. To dry beads, keep the tube cap open for 30 s. Then dissolve the beads in Elution Buffer (EB) (25 μL) after removing the tube from the magnetic stand, and incubate the solution for 10 min. Place the tube on the magnetic stand and incubate it for 1 min. Then collect 24 μL of the supernatant (DNA samples) and transfer it to a new 1.5 mL tube. Use 1 μL of the DNA sample for DNA quantification with the Qubit dsDNA HS Assay Kit. 6. Priming step. Prepare the Priming Mix by adding Flush Tether (FLT) (30 μL) to the Flush Buffer (FB) tube. Set the flow cell to PromethION. To remove air, using a P1000 micropipette remove a small amount of the yellow solution from the inlet port of the flow cell. Then add the Priming Mix (500 μL) to the inlet port (see Note 6). After 5 min, add 500 μL Priming Mix a second time. 7. Sequencing step. Mix, by pipetting, the DNA samples (24 μL) from step 5 with Sequencing Buffer II (SBII) (75 μL) and Loading Beads II (LBII) (51 μL), then load 150 μL of the

184

Yasuhiko Haga et al.

mixture to the inlet port of the flow cell and close the port. Start sequencing via the MinKnow controller. 8. Check the run report after the sequencing is completed (Fig. 2). 3.3 Basecalling (Data Analysis)

The nucleotides passing through the Nanopore during sequencing on the PromethION generate an electric signal that needs to be converted into a DNA sequencing read. This conversion is called basecalling. Guppy, a machine learning-based basecaller provided by Oxford Nanopore Technologies and integrated in MinKnow, enables real-time and/or onboard basecalling. Alternatively, use one of the several third-party basecallers [16]. Sequencing output will include the FAST5 format files, including the squiggle data, and the FASTQ format files.

3.4 Mapping of Sequencing Reads to the Reference Genome (Data Analysis)

To map the sequencing reads to the reference genome, we use Minimap2 [6] with the -a option that generates the output file in the Sequence Alignment/Map (SAM) format (.sam). We set the reference genome index (.mmi) with the -d option and used the FASTQ file (.fq) as input. For the alignments of long reads with high error rate, use the option -x map-ont. This process would take several hours at least. The output SAM should include the MD tag, which is accomplished with using the Minimap2 option --MD. Use SAMtools [7] to convert the output SAM to a Binary Alignment/ Map (BAM) file and to sort and index the BAM file. Below, the basic commands are shown first, followed by specific examples of their application: minimap2 -a -d [reference index] [input FASTQ] > [SAM] samtools view -b [SAM] > [BAM] samtools sort -o [sorted BAM] [BAM] samtools index [sorted BAM]

1. Mapping is performed using Minimap2 and sorting the mapped sequences is performed using SAMtools (execution time is long): $ minimap2 -ax map-ont --MD hg38.mmi sample1.fq.gz | samtools view -b - | samtools sort -o sample1.sorted.bam -

2. Indexing the BAM file is performed using SAMtools: $ samtools index sample1.sorted.bam

3.5 SV Detection (Data Analysis)

For SV detection utilizing long-read data, various computational tools have been developed and compared [17]. Here, we introduce two representative tools, cuteSV [10] and Sniffles2 [8, 9], for SV detection using data from cancer cell lines. Note that the

SV Analysis Using Long Read Sequencing Data

185

parameters should be changed according to the purpose of the analyses and the sample conditions (e.g., tumor purity). cuteSV [sorted BAM] [reference FASTA] [output VCF] sniffles --input [sorted BAM] --vcf [output VCF]

To see the list of options available for cuteSV, use the cuteSV -h command or read it on GitHub (https://github.com/tjiangHIT/ cuteSV/blob/master/README.md). To see the list of options available for Sniffles2, use the sniffles -help command. To extract the names of all supporting reads for each detected SV, use the --output-rnames option. Sniffles2 supports multi-sample calling to extract common SVs between multiple samples. Example calls below demonstrate the use of both tools: 1. SV detection is performed using cuteSV (see Note 7). $ cuteSV sample1.sorted.bam hg38.fa sample1.cutesv.vcf . -sample sample1 --min_support 4 --min_size 1000 --max_size 1000000 --genotype --max_cluster_bias_INS 100 --diff_ratio_merging_INS 0.3 --max_cluster_bias_DEL 100 --diff_ratio_merging_DEL 0.3

2. SV detection is performed using Sniffles2 (see Note 8). $ sniffles --input sample1.sorted.bam --vcf sample1.sniffles. vcf.gz --sample-id sample1 --output-rnames --allow-overwrite --tandem-repeats human_GRCh38_no_alt_analysis_set.trf.bed -non-germline

There are other tools available for SV detection. For example, Nanomonsv [18] can be used to detect somatic SVs in tumor genomes, requiring as input sequencing data from both a tumor and the normal counterpart. This tool is especially useful to demonstrate detection and annotation of SVs of the long insertion type, such as LINE-1, which is important for cancer genome rearrangements [19]. We recommend using Nanomonsv for detecting somatic SVs in a cancer genome if sequencing datasets from both tumor and the normal counterpart are available. 3.6 Visualization and Interpretation of the Detected SVs (Data Analysis)

Currently, there is still no gold standard for the tools and procedures used for SV detection from long-read data. A common flaw of the available tools is that they tend to generate a lot of false-positive detections, requiring the results to be checked carefully. This can be done by visualizing the obtained sequences and the detected SVs and validating them, for example, by checking breakpoints using short-read sequencing data. Most of the SV detection tools generate a VCF (Variant Call Format) output file. The VCF files include metadata lines (begin

186

Yasuhiko Haga et al.

Fig. 3 Example of VCF (Variant Call Format) output for SV detection using cuteSV. The cuteSV output for the CDKN2A deletion is shown as an example

with “##”) and header lines (begin with “#”) in addition to the variant information of each genomic position (Fig. 3). A detailed description of the output files is provided in the tutorial and the web page of each tool. Additionally, to interpret the functional and biological relevance of the detected SVs, annotation information, such as regions (e.g., genic/intergenic, exon/intron, repetitive regions) and genes (e.g., gene name, function, association with diseases), need to be included in the output files. To do this, either use annotation tools, such as AnnotSV [20], or write custom code to add any information needed. IGV [11–13] can be used to visualize the obtained results, including the BAM file for mapped reads and the VCF file for information of SV breakpoints (Fig. 3). In our analysis example, we detected large deletions using cuteSV in highly mutated genes, such as SMARCA4 in PC-14 and CDKN2A in LC2/ad [5]. We visualized the breakpoints of the deletions and sequencing reads mapped to the surrounding regions of the breakpoints (Fig. 4). In PC-14 cells, the SMARCA4 gene is partially deleted, resulting in impaired expression of RNA and protein of this gene. In LC2/ad cells, the large (940-kb) deletion completely includes the CDKN2A region, which causes complete genomic loss of this gene.

4

Notes 1. The datasets of long-read WGS obtained from the various lung cancer cell lines [5] are also downloadable from the database DBKERO [21] at the following URLs: • RERF-LC-MS (DRA008154, DRR171452): https://kero. hgc.jp/cgi-bin/download/long_read/adenocarcinoma_ cell_lines/genome/promethion/RERF-LC-MS/1d.fq.gz • RERF-LC-KJ (DRA008154, DRR171453): https://kero. hgc.jp/cgi-bin/download/long_read/adenocarcinoma_ cell_lines/genome/promethion/RERF-LC-KJ/1d.fq.gz

SV Analysis Using Long Read Sequencing Data

187

PC-14: SMARCA4 deletion

a cuteSV VCF

PC-14 Sorted BAM

Deletion (about 22 kbp) SMARCA4

b

LC2/ad: CDKN2A deletion cuteSV VCF

LC2/ad Sorted BAM

…

…

CDKN2A

Deletion (about 940 kbp)

Fig. 4 Visualization of SVs detected using cuteSV. Examples of IGV visualization of SV breakpoints and mapped reads. (a) SMARCA4 deletion in PC-14. (b) CDKN2A deletion in LC2/ad. Note that setting the option --qccoverage 0 is necessary to detect these deletions using Sniffles2 (see Note 8)

• PC-14 (DRA008154, DRR171454): https://kero.hgc.jp/ cgi-bin/download/long_read/adenocarcinoma_cell_ lines/genome/promethion/RERF-LC-KJ/1d.fq.gz • LC2/ad (DRA008154, DRR171429, DRR171430, DRR171431, DRR171432, DRR171433): https://kero. hgc.jp/cgi-bin/download/long_read/adenocarcinoma_ cell_lines/genome/promethion//LC2ad/LC2ad_Pro methION_30x.fq.gz 2. The Agilent Femto Pulse system (Agilent Technologies) is used for high-resolution qualification and quantification of highermolecular-weight (HMW) DNAs (>60 kb). 3. For a PromethION Flow Cell, the flow cell quality control (QC) protocol should be performed and the number of active pores should be checked before use. 4. Sequencing is performed even for a DNA sample amount of less than 1 μg depending on the situation (e.g., limitation of sampling). 5. Before use, AMPure XP beads should be kept at room temperature for at least 30 min and resuspended by vortex.

188

Yasuhiko Haga et al.

6. No air bubbles should be introduced in the port of the flow cell. 7. For running cuteSV, set the options “--max_cluster_bias_INS,” “--diff_ratio_merging_INS,” “--max_cluster_bias_DEL,” and “--diff_ratio_merging_DEL” to the parameters recommended for ONT data in the developer’s tutorial (https://github.com/tjiangHIT/cuteSV). 8. For running Sniffles2, a bed file (human_GRCh38_no_alt_analysis_set.trf.bed), which can be obtained from the GitHub (https://github.com/fritzsedlazeck/Sniffles), is set to the option “--tandem-repeats.” The users should control the options for SV filtering parameters. For tuning options and parameters, use the “--no-qc” option to enable outputting all candidates of SVs by suppressing the QC filtering steps. Use the option “--non-germline” for the detection of somatic SVs. In the analysis of clinical cancer samples, we must be especially careful with the parameter settings because somatic SVs occasionally harbor low-variant allele frequencies if tumors with low tumor purity and/or high heterogeneity are analyzed.

Acknowledgements This work was supported by the Japan Agency for Medical Research and Development (AMED P-PROMOTE Grant JP22cm0106582) and by the Japan Society for the Promotion of Science Grants-in-Aid for Scientific Research (JSPS KAKENHI Grants 22H04925 (PAGS) and 21J13203). References 1. Sakamoto Y, Zaha S, Suzuki Y et al (2021) Application of long-read sequencing to the detection of structural variants in human cancer genomes. Comput Struct Biotechnol J 19: 4207–4216. https://doi.org/10.1016/j.csbj. 2021.07.030 2. Li Y, Roberts ND, Wala JA et al (2020) Patterns of somatic structural variation in human cancer genomes. Nature 578:112–121. https://doi.org/10.1038/s41586-0191913-9 3. Soda M, Choi YL, Enomoto M et al (2007) Identification of the transforming EML4-ALK fusion gene in non-small-cell lung cancer. Nature 448:561–566. https://doi.org/10. 1038/nature05945 4. Kohno T, Ichikawa H, Totoki Y et al (2012) KIF5B-RET fusions in lung adenocarcinoma.

Nat Med 18:375–377. https://doi.org/10. 1038/nm.2644 5. Sakamoto Y, Xu L, Seki M et al (2020) Longread sequencing for non-small-cell lung cancer genomes. Genome Res 30:1243–1257. https://doi.org/10.1101/GR.261941.120 6. Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34: 3094–3100. https://doi.org/10.1093/bioin formatics/bty191 7. Li H, Handsaker B, Wysoker A et al (2009) The sequence alignment/Map format and SAMtools. Bioinformatics 25:2078–2079. https://doi.org/10.1093/bioinformatics/ btp352 8. Sedlazeck FJ, Rescheneder P, Smolka M et al (2018) Accurate detection of complex structural variations using single-molecule

SV Analysis Using Long Read Sequencing Data sequencing. Nat Methods 15:461–468. https://doi.org/10.1038/s41592-0180001-7 9. Smolka M, Paulin LF, Grochowski CM et al (2022) Comprehensive structural variant detection: from mosaic to population-level. bioRxiv 2022.04.04.487055. https://doi. org/10.1101/2022.04.04.487055 10. Jiang T, Liu Y, Jiang Y et al (2020) Long-readbased human genomic structural variation detection with cuteSV. Genome Biol 21. https://doi.org/10.1186/s13059-02002107-y 11. Robinson JT, Thorvaldsdo´ttir H, Winckler W et al (2011) Integrative genomics viewer. Nat Biotechnol 29:24–26. https://doi.org/10. 1038/nbt.1754 12. Thorvaldsdo´ttir H, Robinson JT, Mesirov JP (2013) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform 14:178–192. https://doi.org/10.1093/bib/bbs017 13. Robinson JT, Thorvaldsdo´ttir H, Wenger AM et al (2017) Variant review with the integrative genomics viewer. Cancer Res 77:e31–e34 14. Kent WJ, Sugnet CW, Furey TS et al (2002) The human genome browser at UCSC. Genome Res 12:996–1006. https://doi.org/ 10.1101/gr.229102 15. Nurk S, Koren S, Rhie A et al (2022) The complete sequence of a human genome. Science (80-) 376:44–53. https://doi.org/10. 1126/science.abj6987

189

16. Wick RR, Judd LM, Holt KE (2019) Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol 20:1–10. https://doi.org/10.1186/s13059019-1727-y 17. Dierckxsens N, Li T, Vermeesch JR, Xie Z (2021) A benchmark of structural variation detection by long reads through a realistic simulated model. Genome Biol 22:1–16. https://doi.org/10.1186/s13059-02102551-4 18. Shiraishi Y, Koya J, Chiba K et al (2020) Precise characterization of somatic structural variations and mobile element insertions from paired long-read sequencing data with nanomonsv. bioRxiv 2020.07.22.214262. https://doi. org/10.1101/2020.07.22.214262 19. Rodriguez-Martin B, Alvarez EG, BaezOrtega A et al (2020) Pan-cancer analysis of whole genomes identifies driver rearrangements promoted by LINE-1 retrotransposition. Nat Genet 52:306–319. https://doi. org/10.1038/s41588-019-0562-0 20. Geoffroy V, Herenger Y, Kress A et al (2018) AnnotSV: an integrated tool for structural variations annotation. Bioinformatics 34:3572– 3574. https://doi.org/10.1093/bioinformat ics/bty304 21. Suzuki A, Kawano S, Mitsuyama T et al (2018) DBTSS/DBKERO for integrated analysis of transcriptional regulation. Nucleic Acids Res 46:D229–D238. https://doi.org/10.1093/ nar/gkx1001

Part III Rapid On-Site Microbial Detection and Epidemiology

Chapter 14 Full-Length 16S rRNA Gene Analysis Using Long-Read Nanopore Sequencing for Rapid Identification of Bacteria from Clinical Specimens Yoshiyuki Matsuo Abstract Amplicon sequencing of the 16S ribosomal RNA (rRNA) gene is a practical and reliable measure for taxonomic profiling of bacterial communities. This chapter describes the detailed workflow for full-length 16S rRNA gene amplicon analysis using nanopore sequencing and bioinformatics pipelines to analyze nanopore sequencing data for taxonomic assignment. This approach offers a higher taxonomic resolution for bacterial identification from clinical specimens with a markedly reduced timeframe and improved versatility. Key words 16S rRNA gene, Bacteria, Bioinformatics, Clinical sample, Long read, Metagenome, Nanopore sequencing

1

Introduction With recent advances in sequencing technologies, metagenomic analyses have emerged as innovative options for diagnosis and treatment in clinical microbiology [1]. Sequencing-based approaches have the potential to overcome the limitations of traditional culture-based techniques, which have a long turnaround time and low sensitivity [2, 3]. The ribosomal RNA (rRNA) gene is one of the most commonly used genetic markers for bacterial identification [4]. The 16S rRNA gene is present in all prokaryotes, with a relatively small size of approximately 1500 bp. It consists of hypervariable regions (V1–V9) interspaced with highly conserved sequences among different species. These characteristics of the 16S rRNA gene render it suitable for taxonomic classification.

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_14, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

193

194

Yoshiyuki Matsuo

Amplicon sequencing of the 16S rRNA gene is a powerful and practical strategy for profiling bacterial communities. The 16S rRNA genes are amplified from a broad range of bacteria using polymerase chain reaction (PCR) with universal primers annealing to the conserved region. The resulting amplicons are sequenced and subjected to bioinformatics analysis, where the hypervariable regions help discriminate different bacterial groups [5]. In the clinical context, 16S rRNA gene amplicon sequencing has been utilized to identify pathogenic bacteria and describe the diversity of the human microbiome [6, 7]. Second-generation sequencing platforms, such as those provided by Illumina, have been extensively used in a wide range of research areas including 16S rRNA gene analysis [8]. Despite their high-throughput capacity, these technologies have technical restrictions, especially in terms of sequencing read length. These are short-read-based sequencers, and the analysis of partial regions of the 16S rRNA gene provides limited taxonomic resolution for most bacterial groups [9]. Nanopore sequencing platforms developed by Oxford Nanopore Technologies offer remarkable advantages over conventional short-read sequencers [10, 11]. A nanopore sequencing platform produces long sequences with no theoretical readlength limit, enabling sequencing of the full-length 16S rRNA gene. Since all hypervariable regions are considered in distinguishing bacterial taxa, the full-length 16S rRNA gene amplicon analysis by nanopore sequencing allows the identification of bacteria with a higher taxonomic resolution (e.g., at the species level). Another competitive advantage of nanopore sequencing is that sequencing reads are generated in real time, which significantly reduces the turnaround time from the sample to the result [12, 13]. Furthermore, it is suitable for processing a small number of samples on a case-by-case basis [14–17]. Given these features of nanopore sequencing, we established a method for full-length 16S rRNA gene amplicon analysis coupled with a bioinformatics pipeline, which allowed us to identify bacterial species present in clinical specimens with a total analysis time of less than 4 hours [18, 19]. Here, we present a workflow for nanopore amplicon sequencing that targets the full-length 16S rRNA gene and bioinformatics pipelines to analyze nanopore sequencing data for taxonomic profiling (Fig. 1). The laboratory workflows include detailed protocols for DNA extraction from clinical samples of different origins, full-length 16S rRNA gene amplification, and nanopore sequencing on a MinION sequencer. We also describe methods of computational analysis, including taxonomic classification using a cloudbased EPI2ME platform and the generation of high-quality singlemolecule consensus to improve raw read accuracy [20, 21].

Full-Length 16S rRNA Gene Analysis with Nanopore Sequencing a

195

b Sample preparation

Basecalling

Bead beating

Demultiplexing

DNA purification

Barcode trimming

First PCR: 16S rRNA gene amplification

Read filtering

Second PCR: Barcoding

Clustering

PCR cleanup

Consensus calling

Adapter attachment

Alignment to reference database

Nanopore sequencing

Taxonomic assignment

Fig. 1 Overview of the 16S rRNA gene analysis for the identification of bacteria from clinical specimens. (a) Laboratory workflow of nanopore amplicon sequencing. (b) Bioinformatics pipeline for taxonomic classification of bacterial species

2

Materials

2.1 General Laboratory Supplies

1. Centrifuge tube (1.5 mL/5 mL/15 mL) 2. Micropipette 3. Aerosol barrier pipette tip 4. Vortex mixer 5. Benchtop mini-centrifuge 6. High-speed microcentrifuge (for 1.5 mL tube) 7. Centrifuge (for 5 mL/15 mL tube) 8. Rocking platform shaker 9. Rotator mixer 10. Block heater

2.2 Sample Preparation

1. Phosphate-buffered saline (PBS; 137 mM NaCl, 2.7 mM KCl, 1.5 mM KH2PO4, 8.1 mM Na2HPO4, pH 7.4) 2. Cotton tip swab 3. EDTA blood collection tube 4. Red Blood Cell Lysis Buffer (Roche Diagnostics)

196

Yoshiyuki Matsuo

5. Host Depletion Solution (Zymo Research) 6. Microbial Selection Buffer (Zymo Research) 7. Microbial Selection Enzyme (Zymo Research) 8. DNA/RNA Shield 2X concentrate (Zymo Research) 2.3

DNA Extraction

1. Micro Smash Beads Cell Disrupter (TOMY Digital Biology). 2. Maxwell RSC instrument (Promega). 3. EZ-Beads (Promega/AMR). 4. Maxwell RSC Blood DNA Kit (Promega). The kit contains the following components: Lysis Buffer, Proteinase K Solution, Elution Buffer, Maxwell RSC cartridge, RSC plunger, and elution tube.

2.4

PCR

1. 0.2 mL thin-walled tube 2. Thermal cycler 3. Gel electrophoresis device 4. Gel imaging system 5. 27F forward primer (16S rRNA gene-specific sequences are underlined): 5′- TTTCTGTTGGTGCTGATATTGC AG RGTTYGATYMTGGCTCAG-3′ 6. 1492R reverse primer (16S rRNA gene-specific sequences are underlined): 5′- ACTTGCCTGTCGCTCTATCTTC CGG YTACCTTGTTACGACTT-3′ 7. Barcode Primer (BP01-12) supplied in PCR Barcoding Kit (Oxford Nanopore Technologies) 8. KAPA2G Robust HotStart ReadyMix PCR Kit (Kapa Biosystems) 9. PCR-grade water

2.5

PCR Cleanup

1. Magnetic rack 2. Agencourt AMPure XP Kit (Beckman Coulter) 3. Freshly prepared 70% ethanol 4. TN buffer: 10 mM Tris–HCl pH 8.0, 50 mM NaCl

2.6 DNA Quantification

1. 0.5 mL thin-walled tube (Promega) 2. Quantus Fluorometer (Promega) 3. QuantiFluor ONE dsDNA System (Promega)

2.7 Library Construction and Nanopore Sequencing

1. MinION Flow Cell (Oxford Nanopore Technologies). 2. MinION Mk1C (Oxford Nanopore Technologies). 3. PCR Barcoding Kit (Oxford Nanopore Technologies). The kit contains the following components used for constructing the

Full-Length 16S rRNA Gene Analysis with Nanopore Sequencing

197

library: rapid adapter (RAP), Sequencing Buffer (SQB), and Loading Beads (LB). Barcode Adapter (BCA) and Sequencing Tether (SQT) are not used in this protocol. 4. TN buffer: 10 mM Tris–HCl pH 8.0, 50 mM NaCl. 5. PCR-grade water. 6. Flow Cell Priming Kit (Oxford Nanopore Technologies). The kit contains the following components: Flush Tether (FLT) and Flush Buffer (FB). 7. Flow Cell Wash Kit (Oxford Nanopore Technologies).

3

Methods

3.1 Preparation of Clinical Samples

3.1.1 Preparing Fecal Samples

The following are the methods used for preparing the four types of patient-derived specimens. Different pretreatment protocols should be used, depending on the source of the starting materials. As a negative control, a null sample (e.g., PBS used as a solvent for clinical specimen collection) is processed according to the same procedure in all subsequent steps. 1. Add 1 mL of PBS per 100 mg of feces. 2. Mix thoroughly by vortexing. 3. Allow the tube to stand for 2 min to all for settling of large debris of undigested food residue. 4. Transfer 300 μL of the supernatant to a new tube. 5. Centrifuge at ≥9000 × g for 5 min. 6. Discard the supernatant. 7. Resuspend the pellet (~30 mg of feces) in 300 μL of PBS. 8. Proceed to Subheading 3.2.

3.1.2 Preparing Sputum Samples

1. Add three volumes of PBS to the sputum sample (e.g., 300 μL of sputum +900 μL of PBS). 2. Mix thoroughly by vortexing. 3. Centrifuge at 100 × g for 3 min. 4. Transfer the supernatant to a new tube. 5. Centrifuge at ≥9000 × g for 5 min. 6. Carefully remove and discard the supernatant. 7. Resuspend the pellet in 300 μL of PBS. 8. Proceed to Subheading 3.2.

198

Yoshiyuki Matsuo

3.1.3 Preparing Swab Samples

1. Collect the specimen (nasal, oral, skin, and other samples) using a cotton swab. 2. Wash the swab off into a tube containing 500 μL of PBS. 3. Proceed to Subheading 3.2.

3.1.4 Preparing Whole Blood Samples

This protocol includes steps to reduce the amount of humanderived DNA contamination. Host DNA removal often increases the sensitivity and reliability of subsequent PCR amplification of bacterial 16S rRNA genes, especially in samples with a relatively low bacterial load. 1. Collect whole blood in an EDTA tube. 2. Add two volumes of Red Blood Cell Lysis Buffer to blood samples in a tube (e.g., 2 mL of blood +4 mL of Red Blood Cell Lysis Buffer in a 15 mL tube). 3. Mix by inverting and incubate at room temperature for 10 min with gentle mixing on a rocking platform. 4. Centrifuge at 100 × g for 3 min. 5. Transfer the supernatant to a new tube. 6. Centrifuge at ≥9000 × g for 5 min. 7. Carefully remove and discard the supernatant. 8. Resuspend the pellet in 200 μL of PBS and transfer the sample to a 1.5-mL tube. 9. Add 1 mL of Host Depletion Solution to 200 μL of sample. 10. Incubate the sample at room temperature for 15 min using a rotator mixer. This procedure selectively lyses the host cells while keeping the bacterial cells intact. 11. Centrifuge at ≥9000 × g for 5 min. 12. Carefully remove and discard the supernatant without disturbing the pellet containing bacterial cells. 13. Resuspend the pellet in 150 μL of Microbial Selection Buffer. 14. Add 1 μL of Microbial Selection Enzyme and pulse vortex to mix. 15. Incubate at 37 °C for 30 min on a block heater. During this step, nucleic acids released from the host cells are degraded. 16. Add 150 μL of 2X concentrated DNA/RNA Shield to the sample and mix thoroughly by vortexing (total 300 μL). 17. Incubate at room temperature for 5 min to inactivate the nuclease. 18. Proceed to Subheading 3.2.

Full-Length 16S rRNA Gene Analysis with Nanopore Sequencing

3.2 Bacterial Cell Disruption by Bead Beating

199

1. Transfer 300 μL of the sample from Subheading 3.1 to an EZ-Beads tube (see Note 1). 2. Set the EZ-Beads tube in a Micro Smash instrument and disrupt cells by shaking in a tridimensional motion at 2500 rpm for 2 min (see Notes 2 and 3). 3. Briefly spin the EZ-Beads tube to collect contents. 4. Proceed to Subheading 3.3 (recommended: requiring DNA purification before 16S rRNA gene amplification) or skip to Subheading 3.4 (direct PCR without purifying DNA [18]), depending on the sample type (see Note 4).

3.3 Automated DNA Purification Using Maxwell RSC System

The Maxwell RSC Blood DNA kit is used with a Maxwell RSC instrument to provide automated DNA purification (see Note 5). The kit utilizes cellulose-based paramagnetic particles to capture DNA, and up to 16 samples can be processed in one run using cartridges prefilled with optimized reagents. 1. Add 300 μL of Lysis Buffer to the sample in the EZ-Beads tube from Subheading 3.2. 2. Add 30 μL of Proteinase K Solution to the EZ-Beads tube. 3. Mix by inverting and briefly spin the tube. 4. Incubate at 56 °C for 20 min on the block heater. 5. Briefly spin the tube. 6. Transfer the supernatant (~500 μL) to a new tube without removing zirconia beads. 7. Centrifuge at 18,000 × g for 3 min. 8. Transfer the cleared lysate to the Maxwell cartridge. 9. Add 50 μL of Elution Buffer to an elution tube. 10. Start the extraction run.

3.4

Two-Step PCR

The near-full-length 16S rRNA gene (V1–V9 regions) is amplified by PCR using a universal 27F/1492R primer set (inner primers). The primers are flanked by specified anchor sequences that allow for a subsequent second round of PCR using the PCR Barcoding Kit with rapid adapter attachment chemistry (see Note 6). The second PCR, with reduced cycle numbers, extends the amplicons with barcodes and 5′ tag sequences required for rapid adapter attachment (Fig. 2; see Notes 7 and 8).

200

Yoshiyuki Matsuo 16S rRNA gene V1 V2

V3

V4

V5

V6 V7

V8

V9

~1500 bp First PCR

Second PCR

Adapter attachment

Inner primers

Outer primers

27F/1492R

Barcode

Anchor

5’ tag

Rapid adapter

Fig. 2 Two-step PCR approach for nanopore amplicon library preparation. The V1–V9 regions of the 16S rRNA gene are amplified with the 27F/1492R primer set (inner primers) targeting the sequences conserved among different bacterial species. The anchor sequence introduced in the first PCR acts as a priming site for outer primers used in the second reaction to add barcodes for multiplexing. The outer primers contain the 5′ tags that facilitate the ligase-free attachment of rapid adapters for nanopore sequencing 3.4.1 First PCR: Amplification of the 16S rRNA Gene

The expected amplicon size is approximately 1500 bp. 1. Prepare the PCR master mix in a 0.2 mL thin-walled tube.

Component

Volume

Template DNA

0.5–5 μL

10 μM 27F/1492R primer mix

0.5 μL

KAPA2G robust HS ReadyMix (2X)

12.5 μL

Water

Up to 25 μL

Total

25 μL

2. Perform PCR using the following cycling conditions. Step

Temperature

Time

Cycles

Initial denaturation

95 °C

3 min

1

Denaturation Annealing Extension

95 °C 55 °C 72 °C

15 sec 15 sec 30 sec

25–35

Full-Length 16S rRNA Gene Analysis with Nanopore Sequencing

3.4.2

Second PCR

201

3. Analyze 2–5 μL of the PCR products by gel electrophoresis to verify successful amplification. The first PCR products are subjected to a second round of amplification with barcoded outer primers supplied in the PCR Barcoding Kit (for ≤12 samples). The outer primers contain anchor sequences complementary to the inner primers used in the first PCR. The expected amplicon size is approximately 1600 bp. 1. Prepare the PCR master mix in a 0.2-mL thin-walled tube (see Notes 9 and 10). Component

Volume

First PCR products

1 μL

Barcode primer (BP01–12)

0.5 μL

KAPA2G robust HS ReadyMix (2X)

12.5 μL

Water

11 μL

Total

25 μL

2. Perform PCR using the following cycling conditions. Step

Temperature

Time

Cycles

Initial denaturation

95 °C

3 min

1

Denaturation Annealing Extension

95 °C 62 °C 72 °C

15 sec 15 sec 30 sec

8–10

3. Analyze 1 μL of the PCR products by gel electrophoresis. 3.5

PCR Cleanup

1. Bring the AMPure XP beads to room temperature and resuspend by vortexing. 2. To select DNA fragments over 500 bp, add 0.5 volume of AMPure XP beads to the sample (e.g., 20 μL of the second PCR product +10 μL of AMPure XP beads in a 0.2-mL tube). 3. Mix by pipetting and incubate at room temperature for 5 min. 4. Place the tube on a magnetic rack for 2 min to separate the AMPure XP beads. 5. Remove and discard the supernatant. 6. Keep on the magnetic rack and add 200 μL of 70% ethanol without disturbing the bead pellet. 7. Remove and discard the supernatant. 8. Repeat steps 6 and 7 (wash the beads twice in total). 9. Remove the tube from the magnetic rack and briefly spin down.

202

Yoshiyuki Matsuo

10. Place the tube back on the magnetic rack and remove residual ethanol. 11. Remove the tube from the magnetic rack and resuspend the beads in 10 μL of TN buffer. 12. Incubate at room temperature for 2 min. 13. Place the tube on the magnetic rack for 2 min. 14. Transfer the eluate to a new tube. 15. [Optional] Analyze 1 μL of the purified sample by gel electrophoresis to confirm the recovery. 3.6 DNA Quantification

1. Bring the QuantiFluor ONE dsDNA Dye to room temperature. 2. Add 1 μL of the eluted sample to 200 μL of QuantiFluor ONE dsDNA Dye in a 0.5-mL tube. 3. Mix thoroughly by vortexing. 4. Incubate at room temperature for 5 min, protected from light. Measure fluorescence using the Quantus Fluorometer to quantify the DNA concentration.

3.7 Sequencing Library Preparation

Using the PCR Barcoding Kit, up to 12 samples are pooled and analyzed in a single sequencing run. The kit also offers a ligase-free adapter attachment with the rapid chemistry, which can be completed in a single-step reaction. 1. Pool the barcoded samples from Subheading 3.5 to a total of 50–100 femtomoles in 10 μL of TN buffer. For the V1–V9 amplicons of ~1600 bp, 50–100 fmol of double-stranded DNA is almost equivalent to 50–100 ng. In the following example, five barcoded samples are pooled together in equal proportions (see Note 11). Component

Volume

DNA

Sample A (barcode 01), 20 ng/μL

1 μL

20 ng

Sample B (barcode 02), 20 ng/μL

1 μL

20 ng

Sample C (barcode 03), 20 ng/μL

1 μL

20 ng

Sample D (barcode 04), 20 ng/μL

1 μL

20 ng

Sample E (barcode 05), 20 ng/μL

1 μL

20 ng

TN buffer

5 μL

–

10 μL

100 ng

Total

2. Add 1 μL of rapid adapter (RAP) and mix gently.

Full-Length 16S rRNA Gene Analysis with Nanopore Sequencing

203

3. Incubate at room temperature for 5 min. 4. Store the tube on ice until use. 1. Perform a flow cell check to assess the number of active nanopores available for sequencing.

3.8 Nanopore Sequencing

2. Open the priming port of the flow cell and remove air bubbles by slowly drawing back 20–30 μL of the buffer from the priming port (see Note 12). 3. Add 30 μL of Flush Tether (FLT) directly to the Flush Buffer (FB; provided in tubes, pre-aliquoted with 1.17 mL). Mix thoroughly by vortexing to prepare the flow cell priming mix. 4. Load 800 μL of priming mix into the flow cell via the priming port. Avoid introducing air. 5. Wait for 5 min. 6. Prepare the sequencing library for loading as follows. Component

Volume

Pooled library (from Subheading 3.7)

11 μL

Water

4.5 μL 34 μL

Sequencing buffer (SQB) Loading beads (LB)

25.5 μL

a

75 μL

Total a

Mix the bead suspension well before adding it to the loading mixture

7. Lift the cover of the SpotON sample port. 8. Load 200 μL of the flow cell priming mix (prepared in step 3) via the priming port (caution: not the SpotON sample port). Avoid introducing air. 9. Gently mix the sequencing library (prepared in step 6) by pipetting immediately before loading. 10. Load the library into the flow cell via the SpotON sample port by adding it dropwise. Let each drop flow into the port before adding the next drop. 11. Put the SpotON sample cover back and close the priming port. 12. Start sequencing run. The following are typical examples of run parameters with real-time basecalling on the MinION Mk1C. Parameter

Setting

Flow cell type

FLO-MIN106 (R9.4.1)

Kit

PCR barcoding kit SQK-PBK004

Basecalling

On (continued)

204

Yoshiyuki Matsuo

Parameter

Setting

Basecalling configuration

Fast or high-accuracy basecalling

Barcoding

On

Trim barcodes

On

Barcode both ends

Off

Mid-read barcode filtering

On

Q score filtering

Default value: 8 (fast) or 9 (high accuracy)

13. After the sequencing run is completed, flush the flow cell with the DNase I supplied in the Flow Cell Wash Kit and store it for subsequent use. 3.9 Bioinformatics Analysis

Raw electrical signals (FAST5 files) are processed using the Guppy basecaller to generate sequence data in real time during the sequencing run. Nanopore sequencing data are stored locally on a MinION Mk1C device. The output files can be transferred to a mounted removable USB drive via the File Manager embedded in MinION software and subjected to data analysis for bacterial identification.

3.9.1 Taxonomic Classification Using EPI2ME Fastq 16S Workflow

Oxford Nanopore Technologies provides a cloud-based data analysis platform, EPI2ME, which offers a range of workflows, including Fastq 16S for bacterial identification [22]. The EPI2ME Fastq 16S workflow classifies nanopore reads using the BLAST program [23] against the curated NCBI 16S rRNA bacterial database [24]. As shown in Fig. 3, the V1–V9 16S rRNA amplicons from a mock community sample containing ten bacterial species (ATCC MSA-1000) were sequenced and basecalled with the fast configuration. Randomly subsampled reads (25,000) were analyzed using the Fastq 16S workflow, reporting the taxonomy of the best hit.

Software 1. SeqKit v2.2.0 [25, 26] 2. EPI2ME Desktop Agent v3.3.0 [27] Preprocessing of Nanopore Sequencing Reads Open the Terminal on the Mac and execute each command. Please note that all of the following are single-line commands, and line breaks should be ignored. The bold italic letters in the operation examples indicate inputs from a user that should be set appropriately. 1. Merge multiple compressed nanopore FASTQ files (.fastq.gz) into a single file (see Note 13). cat *.fastq.gz > bc#.fastq.gz

Full-Length 16S rRNA Gene Analysis with Nanopore Sequencing 13

1

205

2

12

Sequencing statistics

11 3

10 9

4 8

Number of reads

25,000

Minmum read length

1,300

Average read length

1,427.5

Maximum read length

1,738

Average Q score

11.4

5 7

6 % of reads

Taxon Staphylococcus epidermidis Staphylococcus saccharolyticus

1 2

5.8 5.5

3

14.5

Clostridium beijerinckii (*including reads assigned to C. diolis)

4 5

11.7 1.1 0.5

Cereibacter sphaeroides Cereibacter johrii Rhodobacteraceae (family)

6

11.4

Lactobacillus gasseri

7

10.1

Streptococcus mutans

8

9.3

Deinococcus radiodurans

9

7.8

Enterobacteriaceae (family)

10

5.2

Enterococcus faecalis

11

2.6 0.6

Bacillus cereus group (species group) Bacillaceae (family)

12

4.9

Bifidobacterium adolescentis

13

8.8

Others

Fig. 3 Taxonomic classification of a mock bacterial community using the EPI2ME Fastq 16S workflow. The V1–V9 regions of the 16S rRNA gene was amplified from a mock community sample (ATCC MSA-1000) comprising the following ten bacterial species: Bacillus cereus, Bifidobacterium adolescentis, Clostridium beijerinckii, Deinococcus radiodurans, Enterococcus faecalis, Escherichia coli, Lactobacillus gasseri, Cereibacter sphaeroides, Staphylococcus epidermidis, and Streptococcus mutans. The amplicons were sequenced on a MinION Mk1C (software version 21.11.7) and a MinION Flow Cell R9.4.1. Basecalling was performed using Guppy version 5.1.13 with the following settings: selected kit = SQK-PBK004, fast basecalling, trim barcodes = on, barcode both ends = off, and mid-read barcode filtering = on. Size-selected reads (25,000) were analyzed for taxonomic assignment using a cloud-based EPI2ME Fastq 16S workflow ver. 2022.01.07. The sequences were mapped against the NCBI bacterial 16S rRNA database (bacteria and archaea, 22,162 sequences, created on 2021/12/21) using BLAST with a minimum accuracy of 77% as a default parameter. Low-abundance taxa with less than 0.5% of total reads were discarded from the analysis. The relative abundance of each taxon was shown in a donut chart. Solid fills represent the reads correctly assigned to the bacterial species comprising the mock community (*Clostridium diolis is not included in the mock community, but it has recently been proposed to be reclassified as Clostridium beijerinckii). Misclassified reads (assigned to closely related species not present in the mock community) or unclassified reads (not classified at the species level but placed in a higher taxonomic rank) were indicated by patterned fills. Species-level discrimination is not possible for Escherichia or Bacillus

206

Yoshiyuki Matsuo

2. Unzip the combined FASTQ file. gunzip bc#.fastq.gz

3. Filter reads by length (1300–1800 bp) to retain amplicons corresponding to the V1–V9 regions of the 16S rRNA gene (see Notes 14 and 15). seqkit seq -m 1300 -M 1800 bc#.fastq > filt_bc#.fastq

4. [Optional] Subsample the appropriate number of reads (e.g., 25,000 reads) (see Note 16). seqkit sample -n 25000 -2 filt_bc#.fastq > sub_bc#.fastq

Data Analysis 1. Upload the FASTQ files via the EPI2ME desktop agent. 2. Select Fastq 16S from the workflow list. 3. Adjust the parameters. The 16S analysis-specific parameters are BLAST E-value, minimum coverage, minimum identity, and maximum target sequences. 4. Start the analysis. 3.9.2 Consensus Calling for Nanopore Sequencing Reads

This pipeline is used to generate consensus sequences to improve the per-read accuracy of nanopore sequencing data. Because highquality reads are required for successful clustering, basecalling is performed with a high-accuracy configuration. The reads are filtered by length and average Phred quality scores and then clustered based on sequence similarity. The representative sequence for each cluster (draft sequence) is polished by choosing the most common base at each position among all the sequences belonging to the cluster, generating error-corrected high-quality sequences. The pipeline was applied to 16S rRNA gene analysis of the mock bacterial community. The resulting consensus sequences were aligned using BLASTN against the NCBI 16S rRNA database. Eight out of ten bacterial species comprising the mock community sample were successfully identified, with a sequence identity of over 99.7% (Fig. 4).

Computing Environment MacBook Pro (Apple 2017), 2.3 GHz dual-core Intel Core i5-7360U CPU, 8 GB RAM, macOS 11.6. Software Used in the Pipeline Install the following bioinformatics tools and prerequisite libraries using a package manager such as Conda [28]. The latest version of

Full-Length 16S rRNA Gene Analysis with Nanopore Sequencing

207

Clusters with low read counts 1

10 9

Sequencing statistics

8 2

7

6 3

Number of reads

23,542

Minmum read length

1,300

Average read length

1,464.4

Maximum read length

1,620

Average Q score

15.9

5 4 Cluster ID

% of reads

Taxonomy of best BLAST hit

Sequence idenitity (%)

1

15.1

Staphylococcus epidermidis

2

14.1

Clostridium beijerinckii (*Clostridium diolis)

3

14.0

Cereibacter sphaeroides

4

10.5

Lactobacillus gasseri

5

10.4

Streptococcus mutans

99.86

6

9.5

Deinococcus radiodurans

99.93

7

7.9

Escherichia fergusonii Shigella flexneri

99.73 99.73

8

5.4

Enterococcus faecalis

100

9

5.0

Bacillus tropicus Bacillus paramycoides Bacillus nitratireducens Bacillus luti Bacillus albus

99.86 99.86 99.86 99.86 99.86

10

4.4

Bifidobacterium adolescentis

-

3.5

-

100 99.79 (99.79) 100 100

100 -

Fig. 4 Taxonomic assignment of nanopore consensus sequencing data from a mock bacterial community. The ten-species mock community sample (ATCC MSA-1000) was analyzed by V1–V9 16S rRNA amplicon sequencing using a MinION Mk1C (software version 21.11.7) and a MinION Flow Cell R9.4.1. Basecalling was performed using Guppy version 5.1.13 with the following settings: selected kit = SQK-PBK004, high-accuracy basecalling, trim barcodes = on, barcode both ends = off, and mid-read barcode filtering = on. A total of 23,542 nanopore sequencing reads that passed the filtering conditions (1300–1800 bases, mean Q score ≥ 15) were processed for clustering followed by consensus calling. The resulting consensus sequence for each cluster was aligned using the BLASTN 2.13.0+ program against the NCBI 16S rRNA database (bacteria and archaea, 22,323 sequences, created on 2022/04/25). Eight out of 10 bacterial species comprising the mock community (solid fills) were correctly assigned with sequence identity of over 99.7% to the reference genome (*Clostridium diolis has recently been proposed to be reclassified and included in Clostridium beijerinckii). Even by the full-length 16S rRNA gene analysis with read error correction, species-level discrimination was not possible for closely related members of Escherichia or Bacillus (patterned fills)

208

Yoshiyuki Matsuo

Medaka (v1.6.0) can be installed on the macOS platform from its source (see Note 17). 1. SeqKit v2.2.0 [25, 26] 2. VSEARCH v2.21.1 [29, 30] 3. Medaka v1.6.0 [31] Bioinformatics Pipelines Open the Terminal on the Mac and execute each command. Please note that all of the following are single-line commands, and line breaks should be ignored. The bold italic letters in the operation examples indicate inputs from a user that should be set appropriately. 1. Merge multiple compressed nanopore FASTQ files (.fastq.gz) into a single file (see Note 13). cat *.fastq.gz > filename.fastq.gz

2. Unzip the combined FASTQ file. gunzip filename.fastq.gz

3. Filter reads by length (1300–1800 bp) and Q score (e.g., minimum cut-off value of 15) (see Note 14). seqkit seq -m 1300 -M 1800 -Q 15 filename.fastq > filtered. fastq

4. Run VSEARCH to cluster reads using both strands with the similarity threshold specified by the --id option (e.g., 87%). Each cluster is outputted to a separate FASTA file (see Note 18). vsearch --cluster_fast filtered.fastq --id 0.87 --strand both --clusters output

5. Select output files to be analyzed and rename them with consecutive numbers and a file extension (e.g., cluster1.fasta, cluster2.fasta, cluster3.fasta, etc.). 6. Extract a representative sequence (centroid) per cluster. The first sequence of each cluster is selected and saved as the centroid in the FASTA format (see Note 19). seqkit head -n 1 cluster#.fasta > centroid#.fasta

7. Activate the relevant virtual environment for running Medaka (see Note 20). source ~/medaka/venv/bin/activate

8. Run Medaka for each cluster to generate a consensus sequence (see Note 21).

Full-Length 16S rRNA Gene Analysis with Nanopore Sequencing

209

medaka_consensus -i cluster#.fasta -d centroid#.fasta -o outdir -m model

9. For taxonomic assignment, perform an alignment search using the generated consensus sequences as queries against the 16S rRNA database (see Note 22).

4

Notes 1. The EZ-Beads tube contains zirconium oxide beads of two different sizes (0.2-mm spheres and a large 5-mm bead) that enable efficient mechanical cell lysis by bead beating. 2. Alternatively, the EZ-Beads tubes are vortexed at the maximum speed for 5 min using a Disruptor Genie or Vortex-Genie (Scientific Industries) with a tube holder. 3. Caution: Bead beating on a device with a high-speed linear reciprocating motion should be avoided, as this may potentially result in breakage of the EZ-Beads tubes. 4. A variety of substances inherent to clinical samples often decrease PCR sensitivity and may lead to false-negative results. The primary reason for DNA purification is to reduce the amount of PCR inhibitors included in the samples, which is a practical approach in most cases. 5. Alternatively, the samples can be processed manually (e.g., spin column-based technology) for DNA extraction. 6. The outer primers supplied in the kit have particular chemical modifications that enable rapid ligase-free adapter attachment. The details of the rapid adapter attachment chemistry are proprietary knowledge that has not yet been disclosed. 7. As an alternative approach, the four-primer PCR protocol provides a simplified method for barcoding and post-PCR adapter attachment with rapid chemistry [19]. Barcoded amplicons are generated in a single reaction containing both the inner and outer primer pairs. The protocol is available at the Oxford Nanopore website (a Nanopore community account is required to access the documentation). We prefer to use the two-step PCR strategy described in this chapter to construct the nanopore sequencing library rather than the four-primer PCR protocol, as the former has produced more consistent results for various clinical specimens. 8. The two-step PCR method allows us to perform nanopore amplicon sequencing using user-defined arbitrary inner primer pairs in combination with the PCR Barcoding Kit or PCR Sequencing Kit, taking advantage of rapid adapter attachment chemistry. This method can be applied to a wide range of sequence-based analyses, including the detection of molecular markers other than the 16S rRNA gene and the identification of genetic variations in targeted loci [32].

210

Yoshiyuki Matsuo

The sequences of inner primers should be as follows: FW 5′- TTTCTGTTGGTGCTGATATTGC - target-specific sequence-3′ RV 5′- ACTTGCCTGTCGCTCTATCTTC - target-specific sequence-3′ 9. If there is a low yield of the desired product in the first PCR, samples may need to be concentrated using AMPure XP beads prior to the second round of PCR, following the protocol provided in Subheading 3.5. Furthermore, with this additional step, reaction contaminants, including primer dimers, can be removed from the reaction, which is beneficial for downstream analysis. 10. Another issue to consider is the choice of the PCR master mix used for the second PCR. Some were not compatible with the extended outer primers (approximately 60 base long) and frequently failed to generate barcoded amplicons. We also found that sequencing libraries prepared using a particular PCR reagent were not efficiently targeted for nanopore sequencing, resulting in extremely low-throughput results. Although the reason for this is unknown, it may be due to a problem in attaching RAP to the amplicons. 11. For samples in which the DNA concentration is too low (e.g., negative controls yielding no PCR products), add the same volume of eluate as other samples pooled in the library construction. 12. Set the volume of the P1000 micropipette to 200 μL and insert the tip into the priming port. Turn the wheel of the pipette slowly to increase the volume until a small amount of buffer is aspirated into the tip. Do not remove too much, keeping the sensor array of the flow cell covered by the buffer. 13. The basecalled reads are saved in the FASTQ format with a default of 4000 reads per file. 14. The reads are filtered by length to eliminate those outside the expected size range, determined based on the size distribution of 16S rRNA genes in the SILVA v138.1 database [33]. The Mothur project [34] provided a composite SILVA reference dataset (SEED alignment: silva.seed_v138_1.align) [35], containing 5736 16S rRNA gene sequences spanning from the end of the 27F primer to the beginning of the 1492R primer (i.e., the forward and reverse primer sequences were not included in the dataset). We further excluded the mitochondria- and chloroplast-derived sequences from the reference file and obtained 5700 sequences of bacterial 16S rRNA genes that ranged in size from 1336 to 1743 bases. Based on this, the

Full-Length 16S rRNA Gene Analysis with Nanopore Sequencing

211

V1–V9 reads were selected by length, retaining sequences of 1300–1800 bases. 15. To process multiple files at once, run the following command: for f in bc*.fastq; do seqkit seq -m 1300 -M 1800 $f > filt_$f; done

16. To process multiple files at once, run the following command: for f in filt*.fastq; do seqkit sample -n 25000 -2 $f > ${f/ filt/sub}; done

17. The Medaka package is installed into a python virtual environment “medaka” created in the home directory. 18. The output files are saved with filenames like “output0, output1, output2, . . .” 19. To process multiple files at once, run the following command: for f in cluster*.fasta; do seqkit head -n 1 $f > ${f/cluster/ centroid}; done

20. To deactivate the active environment, use the following command: deactivate

21. The generated consensus sequence file (consensus.fasta) is saved to the output directory specified by the -o option. The -m option is used to set the Medaka model according to the version of the basecaller Guppy. The Medaka v1.6.0 default setting is r941_min_hac_g507, a model for MinION R9.4.1 Flow Cells using Guppy v.5.0.7, with a high-accuracy configuration. Select the Medaka model with the highest version equal to or less than the Guppy version. To process multiple files (e.g., 10 files) at once with the default model, run the following command: for i in ‘seq 1 10’; do medaka_consensus -i cluster$i.fasta -d centroid$i.fasta -o consensus$i -m r941_min_hac_g507; done

22. Some reads may have barcode sequences left at the ends, likely because of incomplete barcode trimming by Guppy. In the study using the mock community (Fig. 4), these extra regions will be removed manually from the consensus sequence prior to analysis for taxonomic assignment.

212

Yoshiyuki Matsuo

Data Availability The exact pipeline commands, sequence files, and other supplementary materials are available at figshare (https://doi.org/10. 6084/m9.figshare.20367651). References 1. Chiu CY, Miller SA (2019) Clinical metagenomics. Nat Rev Genet 20(6):341–355. https://doi.org/10.1038/s41576-0190113-7 2. Loman NJ, Misra RV, Dallman TJ, Constantinidou C, Gharbia SE, Wain J, Pallen MJ (2012) Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol 30(5):434–439. https://doi. org/10.1038/nbt.2198 3. Didelot X, Bowden R, Wilson DJ, Peto TEA, Crook DW (2012) Transforming clinical microbiology with bacterial genome sequencing. Nat Rev Genet 13(9):601–612. https:// doi.org/10.1038/nrg3226 4. Langille MG, Zaneveld J, Caporaso JG, McDonald D, Knights D, Reyes JA, Clemente JC, Burkepile DE, Vega Thurber RL, Knight R, Beiko RG, Huttenhower C (2013) Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nat Biotechnol 31(9):814–821. https://doi.org/10.1038/nbt.2676 5. Johnson JS, Spakowicz DJ, Hong BY, Petersen LM, Demkowicz P, Chen L, Leopold SR, Hanson BM, Agresta HO, Gerstein M, Sodergren E, Weinstock GM (2019) Evaluation of 16S rRNA gene sequencing for species and strain-level microbiome analysis. Nat Commun 10(1):5029. https://doi.org/10. 1038/s41467-019-13036-1 6. Clarridge JE 3rd (2004) Impact of 16S rRNA gene sequence analysis for identification of bacteria on clinical microbiology and infectious diseases. Clin Microbiol Rev 17(4):840–862. h t t p s : // d o i . o r g / 1 0 . 1 1 2 8 / C M R . 1 7 . 4 . 840-862.2004 7. Srinivasan R, Karaoz U, Volegova M, MacKichan J, Kato-Maeda M, Miller S, Nadarajan R, Brodie EL, Lynch SV (2015) Use of 16S rRNA gene for identification of a broad range of clinically relevant bacterial pathogens. PLoS One 10(2):e0117617. https://doi.org/10.1371/journal.pone. 0117617 8. Ravi RK, Walton K, Khosroheidari M (2018) MiSeq: a next generation sequencing platform for genomic analysis. Methods Mol Biol 1706:

223–232. https://doi.org/10.1007/978-14939-7471-9_12 9. Kuczynski J, Lauber CL, Walters WA, Parfrey LW, Clemente JC, Gevers D, Knight R (2011) Experimental and analytical tools for studying the human microbiome. Nat Rev Genet 13(1): 47–58. https://doi.org/10.1038/nrg3129 10. Deamer D, Akeson M, Branton D (2016) Three decades of nanopore sequencing. Nat Biotechnol 34(5):518–524. https://doi.org/ 10.1038/nbt.3423 11. Kono N, Arakawa K (2019) Nanopore sequencing: review of potential applications in functional genomics. Develop Growth Differ 61(5):316–326. https://doi.org/10.1111/ dgd.12608 12. Mitsuhashi S, Kryukov K, Nakagawa S, Takeuchi JS, Shiraishi Y, Asano K, Imanishi T (2017) A portable system for rapid bacterial composition analysis using a nanopore-based sequencer and laptop computer. Sci Rep 7(1):5657. https://doi.org/10.1038/s41598-01705772-5 13. Nakagawa S, Inoue S, Kryukov K, Yamagishi J, Ohno A, Hayashida K, Nakazwe R, Kalumbi M, Mwenya D, Asami N, Sugimoto C, Mutengo MM, Imanishi T (2019) Rapid sequencing-based diagnosis of infectious bacterial species from meningitis patients in Zambia. Clin Transl Immunology 8(11):e01087. https://doi.org/10.1002/ cti2.1087 14. Tanaka H, Matsuo Y, Nakagawa S, Nishi K, Okamoto A, Kai S, Iwai T, Tabata Y, Tajima T, Komatsu Y, Satoh M, Kryukov K, Imanishi T, Hirota K (2019) Real-time diagnostic analysis of MinION-based metagenomic sequencing in clinical microbiology evaluation: a case report. JA Clin Rep 5(1):24. https://doi. org/10.1186/s40981-019-0244-z 15. Komiya S, Matsuo Y, Nakagawa S, Morimoto Y, Kryukov K, Okada H, Hirota K (2022) MinION, a portable long-read sequencer, enables rapid vaginal microbiota analysis in a clinical setting. BMC Med Genet 15(1):68. https://doi.org/10.1186/s12920022-01218-8

Full-Length 16S rRNA Gene Analysis with Nanopore Sequencing 16. Ishino M, Omi M, Araki-Sasaki K, Oba S, Yamada H, Matsuo Y, Hirota K, Takahashi K (2022) Successful identification of Granulicatella adiacens in postoperative acute infectious endophthalmitis using a bacterial 16S ribosomal RNA gene-sequencing platform with MinION™: a case report. Am J Ophthalmol Case Rep 26:101524. https://doi.org/10. 1016/j.ajoc.2022.101524 17. Omi M, Matsuo Y, Araki-Sasaki K, Oba S, Yamada H, Hirota K, Takahashi K (2022) 16S rRNA nanopore sequencing for the diagnosis of ocular infection: a feasibility study. BMJ Open Ophthalmol 7:e000910. https://doi. org/10.1136/bmjophth-2021-000910 18. Kai S, Matsuo Y, Nakagawa S, Kryukov K, Matsukawa S, Tanaka H, Iwai T, Imanishi T, Hirota K (2019) Rapid bacterial identification by direct PCR amplification of 16S rRNA genes using the MinION nanopore sequencer. FEBS Open Bio 9(3):548–557. https://doi. org/10.1002/2211-5463.12590 19. Matsuo Y, Komiya S, Yasumizu Y, Yasuoka Y, Mizushima K, Takagi T, Kryukov K, Fukuda A, Morimoto Y, Naito Y, Okada H, Bono H, Nakagawa S, Hirota K (2021) Full-length 16S rRNA gene amplicon analysis of human gut microbiota using MinION nanopore sequencing confers species-level resolution. BMC Microbiol 21(1):35. https://doi.org/10. 1186/s12866-021-02094-5 20. Karst SM, Ziels RM, Kirkegaard RH, Sorensen EA, McDonald D, Zhu Q, Knight R, Albertsen M (2021) High-accuracy long-read amplicon sequences using unique molecular identifiers with Nanopore or PacBio sequencing. Nat Methods 18(2):165–169. https://doi.org/ 10.1038/s41592-020-01041-y 21. Santos A, van Aerle R, Barrientos L, MartinezUrtaza J (2020) Computational methods for 16S metabarcoding studies using Nanopore sequencing data. Comput Struct Biotechnol J 18:296–305. https://doi.org/10.1016/j.csbj. 2020.01.005 22. EPI2ME workflows (registration required). https://epi2me.nanoporetech.com/workflows 23. Zhang Z, Schwartz S, Wagner L, Miller W (2000) A greedy algorithm for aligning DNA sequences. J Comput Biol 7(1–2):203–214. h t t p s : // d o i . o r g / 1 0 . 1 0 8 9 / 10665270050081478

213

24. 16S ribosomal RNA (Bacteria and Archaea type strains) [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information (NCBI); Available from: https://www.ncbi.nlm.nih. gov/refseq/targetedloci/ 25. SeqKit. https://bioinf.shenwei.me/seqkit/ 26. Shen W, Le S, Li Y, Hu F (2016) SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS One 11(10):e0163962. https://doi.org/10.1371/ journal.pone.0163962 27. EPI2ME Desktop Agent (registration required). https://epi2me.nanoporetech. com/software 28. Conda. https://docs.conda.io/en/latest/ 29. VSEARCH. https://github.com/torognes/ vsearch 30. Rognes T, Flouri T, Nichols B, Quince C, Mahe F (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584. https://doi.org/10.7717/peerj.2584 31. Medaka. https://github.com/nanoporetech/ medaka 32. Tabata Y, Matsuo Y, Fujii Y, Ohta A, Hirota K (2022) Rapid detection of single nucleotide polymorphisms using the MinION nanopore sequencer: a feasibility study for perioperative precision medicine. JA Clin Rep 8(1):17. https://doi.org/10.1186/s40981-02200506-7 33. Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glockner FO (2013) The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res 41(Database issue):D590–D596. https://doi.org/10. 1093/nar/gks1219 34. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF (2009) Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75(23):7537–7541. https://doi.org/10. 1128/AEM.01541-09 35. Silva reference files. https://mothur.org/wiki/ silva_reference_files/

Chapter 15 Nanopore Sequencing Data Analysis of 16S rRNA Genes Using the GenomeSync-GSTK System Kirill Kryukov, Tadashi Imanishi, and So Nakagawa Abstract With the development of nanopore sequencing technology, long reads of DNA sequences can now be determined rapidly from various samples. This protocol introduces the GenomeSync-GSTK system for bacterial species identification in a given sample using nanopore sequencing data of 16S rRNA genes as an example. GenomeSync is a collection of genome sequences designed to provide easy access to genomic data of the species as demanded. GSTK (genome search toolkit) is a set of scripts for managing local homology searches using genomes obtained from the GenomeSync database. Based on this protocol, nanopore sequencing data analyses of metagenomes and amplicons could be efficiently performed. We also noted reanalysis in conjunction with future developments in nanopore sequencing technology and the accumulation of genome sequencing data. Key words Meta 16S rRNA analysis, Nanopore sequencing, minimap2, GenomeSync, NAF

1

Introduction Bacterial infectious diseases have an enormous impact on public health; however, their diagnosis is usually challenging. In particular, identification of the causative bacteria in general hospitals takes several days, which sometimes may not be possible even then. In recent years, owing to advances in DNA sequencing technology, many researchers have been developing a system to identify bacterial species present in clinical samples by sequencing. MinION, a USB-connected DNA/RNA sequencer developed by Oxford Nanopore Technologies (ONT), is widely used for bacterial species identification [1]. MinION has the advantages of relatively quick and easy library preparation and sequential sequence analysis during sequencing. In addition, the long-read sequencing capability allows full-length sequencing of the 16S ribosomal RNA (rRNA) gene, which is widely used as a marker for bacterial identification in a

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_15, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

215

216

Kirill Kryukov et al.

given sample. Other aspects, such as sequencing cost and ease of management, have also shown potential use in relatively small-scale clinical settings [2]. In this protocol, we present a bioinformatics analysis pipeline for nanopore sequencing data of bacterial 16S rRNA gene, named GenomeSync-GSTK. GenomeSync is a synchronizable database of genome sequences [3], and GSTK (genome search tool kit) is a set of scripts for managing massive local homology searches [4]. We have applied the GenomeSync-GSTK analyses to various studies [2–10]. Here, we describe how to analyze nanopore sequencing data from the analysis of 16S rRNA genes using the GenomeSyncGSTK system. In addition, we also introduced the usage of Nucleotide Archival Format (NAF) software [11] that is used for data compression of FASTA format of genomes stored in the GenomeSync database. Currently, command-line operation from the Terminal software is required (i.e., no graphic-user interface is available for the GenomeSync-GSTK system). As an example, several commands were shown to analyze 16S rRNA amplicon nanopore sequencing data derived from the mock sample of 20 bacterial species conducted by Mitsuhashi et al. [12].

2

Materials A user should prepare a suitable computer to run the GenomeSyncGSTK system introduced in this protocol. It is preferable to prepare a computer that meets the following minimum requirements: OS: Ubuntu 18.04 CPU: 2 cores Memory: 8G Storage: SSD 100G We have not tested the GenomeSync-GSTK system using other distributions of Linux, but, in principle, it could work. As for Windows, we could run the system by installing Ubuntu on Windows Subsystem for Linux (WSL); however, the WSL system updates frequently, and we cannot guarantee that it could work as well. In addition, we also prepared Mac versions applicable to both Intel and ARM CPUs. The machine power requirement for the Mac is the same as that of the Linux computer, but the “Xcode Command Line Tools” package should be installed.

GenomeSync-GSTK Analysis

3

217

Methods

3.1 Preparing the Database

In order to identify bacteria that cause infectious diseases from metagenomic or amplicon sequencing data, genome sequences of various bacterial species are needed. For this purpose, we have developed and published the GenomeSync database (http:// genomesync.org/). The GenomeSync database is mainly based on genome sequences registered in NCBI Assembly (https://www. ncbi.nlm.nih.gov/assembly) and classified based on NCBI Taxonomy (https://www.ncbi.nlm.nih.gov/taxonomy) [13]. As of September 6, 2022, 507,297 genome assemblies from 96,715 species have been registered. As for metazoans and plants, the number of genome sequences is limited to one per species due to their large genome sizes. In the GenomeSync database, a FASTA file of each genome was compressed by Nucleotide Archival Format (NAF; https://github. com/KirillKryukov/naf; [11]) to save downloading time, bandwidth, and disk space (see Note 1). NAF can compress FASTA and FASTQ files of DNA, RNA, and amino acid sequences without limit on sequence length or the number of sequences. We previously conducted a comprehensive benchmark analysis of various compression software for FASTA files in terms of the compression, decompression, computation speed, and memory capacity; we found that NAF has a good balance between compression ratio and decompression speed compared to various compression software (Sequence Compression Benchmark, http://kirr.dyndns.org/ sequence-compression-benchmark/) [14]. To download the genomes from the GenomeSync database, the curl and wget programs were used. If your Ubuntu (or Windows WSL) computer does not contain the programs, you can install them using the following command in the Terminal software: sudo apt install -y curl sudo apt install -y wget

If you use Mac, brew (https://brew.sh/) could be useful to install those programs. After installing the brew suite, you can type the following command to install curl and wget: brew install curl brew install wget

To download genome sequences of common bacteria and archaea species, which are marked as “representative” in the NCBI Assembly database and to store them in the $HOME/GenomeSync directory, the following command can be used: curl -s ’http://genomesync.nig.ac.jp/selector/?t=(rep)Archaea & t=(rep)Bacteria’ | wget -i - --directory-prefix=$HOME/ GenomeSync -x -N -nH -nv

218

Kirill Kryukov et al.

In addition, to remove the contamination of human DNA in a sample, the human genome can be downloaded with the following command: curl -s ’http://genomesync.nig.ac.jp/selector/?t=Homo’ | wget -i - --directory-prefix=$HOME/GenomeSync -x -N -nH

Further, use the following command to obtain phylogeny and statistics information of each genome sequence stored in GenomeSync based on the NCBI Taxonomy: wget --directory-prefix=$HOME/GenomeSync -r -np -N -l 1 -nH -A ".hash,.array,.txt" http://genomesync.nig.ac.jp/summary/ wget --directory-prefix=$HOME/GenomeSync -r -np -N -l 1 -nH -A ".hash,.array,.txt,.dmp,.tab" http://genomesync.nig.ac.jp/ taxonomy/

This completes the preparation of the GenomeSync database. Since GenomeSync is updated regularly (once a week, approximately), you can download only the differences in data by entering the curl and wget commands above, which makes it easy to update the database. You can also refer to the GenomeSync homepage for more details on downloading genome data of various species (http://genomesync.org/downloading.html). 3.2 Preparing the Pipeline

Next, we will set up a dedicated pipeline, Genome Search Toolkit (GSTK, 4), to efficiently analyze the NAF-compressed genome data obtained from GenomeSync as described above. GSTK is capable of various analyses, but in this protocol, the following tasks are performed in an automated manner using GSTK: 1. Convert sequencing data from FASTQ format to FASTA format and change the sequence name of each read. 2. Perform a brute-force search for each read against the genome sequence of a given species using minimap2 [15] (see Note 2). 3. For each read, calculate which organism’s genome has the maximum score based on the score of minimap2. 4. Summarize how many reads correspond to each species. 5. Visualize the results as an HTML-format Krona chart [16]. When we execute the GSTK command, a folder containing the output corresponding to the above five calculation processes will be created (see Subheading 3.3 for the details). Now we will configure GSTK as follows. First, download GSTK and related programs to the home directory. You can find the software packages at the following link: http://genomesync.org/tools/gstk-with-tools2022-04-19/.

GenomeSync-GSTK Analysis

219

For Linux (or Windows WLS) users, the following command can be used for downloading the GSTK package: wget --directory-prefix=$HOME/GSTK http://genomesync.org/ tools/gstk-with-tools-2022-04-19/gstk-with-tools-linux-202204-19.zip

For Mac users, the following packages can be used: (Mac Intel CPU version) wget --directory-prefix=$HOME/GSTK http://genomesync.org/ tools/gstk-with-tools-2022-04-19/gstk-with-tools-intel_mac2022-04-19.zip

(Mac ARM CPU version) wget --directory-prefix=$HOME/GSTK http://genomesync.org/ tools/gstk-with-tools-2022-04-19/gstk-with-tools-m1_mac-202204-19.zip

Then, go to the GSTK directory created under the home directory (i.e., $HOME/GSTK) and extract the downloaded file using the unzip command as follows (the Linux version was only shown): unzip gstk-with-tools-linux-2022-04-19.zip

When you run the above command, a directory named tools will be created. In the directory, you will find directories named naf, gstk, fastq-to-fasta, zstd, perl, minimap2, and krona. For the Mac version, coreutils directory is also created. Each folder contains the corresponding software. In this analysis, we will use gstk.pl in the gstk folder. First, to check the genome sequences used in the analysis downloaded from the GenomeSync database, execute the following command: $HOME/GSTK/tools/gstk/gstk.pl list-genomes --rep-taxa ’Bacteria;;Archaea’ --taxa ’Homo’

This list-genomes is the command used to check the genome sequences used for analysis by GSTK, with the options --rep-taxa and --taxa. As of March 31, 2022, 15,526 genomes were listed using the genome data from the GenomeSync database. The human genome and representative bacteria and archaea genomes downloaded from the GenomeSync database are used in Subheading 3.3. You can check if the genome sequences are downloaded properly with the following command: $HOME/GSTK/tools/gstk/gstk.pl verify-genomes --format naf -rep-taxa ’Bacteria;;Archaea’ --taxa ’Homo’

220

Kirill Kryukov et al.

3.3 Example of GenomeSync-GSTK Analysis

Here, we show how to examine 16S rRNA gene sequencing data of the mock communities of 20 bacteria obtained from BEI Resources (https://www.beiresources.org/) obtained by Mitsuhashi et al. [12] (see Note 3). The FASTQ-format 16S rRNA gene sequencing data obtained in the study—the almost full-length region of the 16S rRNA gene by PCR and the MinION Rapid 1D sequencing of 5-min data acquisition—can be downloaded from DRA DDBJ with accession ID DRA005399 (https://ddbj.nig.ac.jp/public/ddbj_ database/dra/fastq/DRA005/DRA005399/DRX076255/) and unzipped as follows: bunzip2 DRR082417.fastq.bz2

Put the unzipped FASTQ-format sequencing data into an arbitrary directory and execute the following command in it: $HOME/GSTK/tools/gstk/gstk.pl analyze --in DRR082417.fastq -searcher minimap2 --rep-taxa ’Bacteria;;Archaea’ --taxa ’Homo’ --n-search-tasks 2 --n-merging-threads 1 --force

The details of each option are as follows: --in:

FASTQ-format sequencing data file (FASTA-format file is also acceptable).

--rep-taxa, --taxa:

Specify the biological lineage to be used in the analysis: --rep-taxa, only representative species in the taxa are used; --taxa, all species in the taxa are used. In this case (i.e., --rep-taxa ’Bacteria;;Archaea’ --taxa ’Homo’), representative species’ genomes for bacteria and archaea and all species in the Homo genus (only one species, Homo sapiens) are used in the analysis. See Subheading 3.1 as well.

--n-search-tasks:

Number of threads used for minimap2 analysis, depending on the number of CPU cores of the computer used. The higher the number, the faster the calculation will be completed.

--n-merging-threads:

The number of threads used to merge the results of the minimap2 analysis, set according to the number of CPU cores of the computer used in the same way as above.

--force:

Check whether there are results that have been already analyzed by minimap2 in GSTK, and analyze only those data that have not been analyzed. It is an unnecessary option for the first analysis, but it is safe.

GenomeSync-GSTK Analysis

221

Note: When minimap2 analysis stops in the middle of the calculation, please stop it by pressing Ctrl+C. Then, execute the same command again. The calculation can be restarted from the species for which the calculation has not been completed. When the calculation ends, the following directories are created in the directory where the command was executed: 1-renamed-fasta 2-search-log 2-search-output 3-merged-search-output 4-count-by-taxon 5-krona-chart

Each folder corresponds to the process of arithmetic operations described in the first paragraph of Subheading 3.2. The Krona chart named DRR082417.krona.html in the 5-krona-chart directory showed the results of the analysis (Fig. 1). The names of species classified by phylogeny are displayed in a pie chart format based on the number of reads mapped to each species. The color in pie charts represents the average of their minimap2 alignment scores as shown in the upper left corner “Color by Avg. score.” If the average score is small, the certainty that the read corresponds to the species is low. The original data of the Krona chart are stored as the DRR082417.count-by-taxon file in the 4-count-bytaxon directory which can be opened in a terminal or text editor (Fig. 2).

4

Notes 1. This protocol introduced the 16S rRNA amplicon-sequencing analysis using the GenomeSync-GSTK system. The GenomeSync-GSTK system can target various genomic regions for analysis. minimap2 [15] was utilized as a sequence search tool for this pipeline, which is widely used for various nanopore sequencing data analyses [17]. We previously compared the nanopore sequencing data of 16S rRNA genes analyzed by minimap2 with those by BLASTN [2]. In this study, DNA from cerebrospinal fluid samples of six culture-positive meningitis patients was obtained and the 16S rRNA gene regions were amplified and sequenced using MinION. The sequencing analysis results by BLASTN and minimap2 were consistent with the culture results in four samples, two of which had different species identified by BLASTN and minimap2 but were in the same genus. This suggests that minimap2 could be as accurate as BLASTN for the full-length 16S rRNA gene analysis. On the other hand, minimap2 is still being updated, and its accuracy and speed continue to improve. Indeed, other

222

Kirill Kryukov et al.

Fig. 1 Krona chart of the 16S rRNA amplicon sequencing analysis A pie chart shows the ratio of each species according to the ratio of the number of reads that were best hit in the minimap2 analysis. The ratios according to the phylogenetic relationships are shown in the inner chart and can be selected by clicking to see the ratios only within that phylogeny. Indeed, the figure is selected with “Bacteria.” In the upper right corner, “Magnitude” shows the number of reads used for this pie chart. The reason why the Magnitude value is not an integer is that if the alignment score in minimap2 is equal, it is divided by the species and mapped (i.e., 932 reads were mapped into the bacterial genomes in this search). Color in a chart indicates the average score of minimap2 alignment scores as shown in the upper left corner “Color by Avg. score.” The sample analyzed by the GenomeSync-GSTK system in this protocol includes the following 20 bacteria species: Acinetobacter baumannii, Actinomyces odontolyticus, Bacillus cereus, Bacteroides vulgatus, Clostridium beijerinckii, Cutibacterium acnes, Deinococcus radiodurans, Enterococcus faecalis, Escherichia coli, Helicobacter pylori, Lactobacillus gasseri, Listeria monocytogenes, Neisseria meningitidis, Pseudomonas aeruginosa, Rhodobacter sphaeroides, Staphylococcus aureus, Staphylococcus epidermidis, Streptococcus agalactiae, Streptococcus mutans, and Streptococcus pneumoniae

GenomeSync-GSTK Analysis

223

Fig. 2. A snapshot of the raw output file of the GenomeSync-GSTK analysis The number of reads mapped, average map score, NCBI Taxonomy ID, and species name are listed in each column with a tab split. The reason why the number of reads mapped is not an integer value is that if the alignment score in minimap2 is tied, it is divided by the tied species and counted. Based on this data, the Krona chart was visualized, as shown in Fig. 1. This text format can be used for other statistical calculations

research groups also have been developing other versions of minimap2, such as mm2-ax [18] and mm2-fast [19]. Alternation of minimap2 software in the GSTK can be done by changing the line starting with “MINIMAP2 =” noting the path to the minimap2 program found in $HOME/GSTK/tools/gstk/ conf/commands-external.conf. The current version of minimap2 used for the GSTK is 2.24 (r1122). 2. The GenomeSync-GSTK system is based on the NAF compressor [11]. The NAF-compressed FASTA files downloaded from the GenomeSync database can be decompressed by the following command: $HOME/GSTK/tools/naf/1.3.0/unnaf file.naf -o file.fa

224

Kirill Kryukov et al.

You can also use a pipe (|) to convert the NAF-compressed FASTA to the gzip-compressed FASTA as follows: $HOME/GSTK/tools/naf/1.3.0/unnaf file.naf | gzip -c9 > file. gz

If you want to compress any FASTA or FASTQ formatted files (DNA, RNA, or amino acid sequences), please use the following command: $HOME/GSTK/tools/naf/1.3.0/ennaf file.fa -o file.naf

For the details of the NAF, please see the GitHub page (https://github.com/KirillKryukov/naf) [11] as well as the Sequence Compression Benchmark database (http://kirr.dyndns. org/sequence-compression-benchmark) [14]. 3. In this protocol, we used nanopore sequencing reads of the 16S rRNA genes obtained in 2017 as an example [12]. Since then, nanopore sequencing technology has been developing and improving its quality and quantity [20]. In particular, the nanopore base-caller software was updated several times, significantly improving the accuracy of nucleotide sequences even from the same raw data (i.e., fast5 files). Thus, fast5 files should be kept for the update of base-caller software for accurate analyses in the future. Similarly, the number of genomes and their quality has been increasing, which may result in a more correct species identification when reanalyzing the same data with the new database. The GenomeSync database is designed to achieve easy synchronization of newly decoded genome sequences as noted in Subheading 3.1. In addition, the GSTK pipeline allows for easy reanalysis: When reanalysis is performed using the updated genome sequence, the minimap2 search is not performed again on the previously analyzed sequence, but only the updated genome sequence is automatically selected, leading to a significant reduction in calculation time. We hope that the GenomeSync-GSTK system described in this protocol will be used in various metagenomic and amplicon-sequencing studies. 4. This protocol is designed mainly for analyzing bacterial 16S rRNA amplicon data. However, it can be readily extended to include eukaryotic genomes, in the case if eukaryotic sequence is being targeted in the PCR amplification [21]. For example, the following command will download the representative fungal genomes: curl -s ’http://genomesync.nig.ac.jp/selector/?t=(rep)Fungi’ | wget -i - --directory-prefix=$HOME/GenomeSync -x -N -nH -nv

After this, fungal genomes can be used in the analysis. Here is an example command:

GenomeSync-GSTK Analysis

225

$HOME/GSTK/tools/gstk/gstk.pl analyze --in DRR082417.fastq -searcher minimap2 --rep-taxa ’Fungi’ --n-search-tasks 2 --nmerging-threads 1 –force

5. Care must be taken to include all taxa that are expected to be found in the dataset. For instance, the above command for analyzing a fungal dataset does not include any bacterial or human genome. As a result, any bacterial or human reads present in the data will be left unidentified, or possibly misidentified as fungi. Therefore, it is a good idea to include human and bacteria even when analyzing other organisms. Also, any other possible source of contamination should be included [22, 23]. For example, when a lab is working with multiple projects or organisms, cross-contamination may occur, resulting in the presence of unrelated organism’s reads in sequencing data. Such reads may be misidentified if their originating organism is not included in the analysis.

Funding This work was supported by the JSPS KAKENHI Grants-in-Aid for Scientific Research (C) (20K06612 to K.K.) and Scientific Research on Innovative Areas (16H06429, 16K21723, 19H04843 to S.N.) and by Takeda Science Foundation (T.I.). References 1. Ciuffreda L, Rodrı´guez-Pe´rez H, Flores C (2021) Nanopore sequencing and its application to the study of microbial communities. Comput Struct Biotechnology J 19:1497– 1511 2. Nakagawa S, Inoue S, Kryukov K et al (2019) Rapid sequencing-based diagnosis of infectious bacterial species from meningitis patients in Zambia. Clin Transl Immunol 8:e0202049– e0202011 3. GenomeSync (2022). http://genomesync. org/. Accessed 31 May 2022 4. GSTK (2022). http://kirill-kryukov.com/ study/tools/gstk/. Accessed 31 May 2022 5. Kai S, Matsuo Y, Nakagawa S et al (2019) Rapid bacterial identification by direct PCR amplification of 16S rRNA genes using the MinION™ nanopore sequencer. FEBS Open Bio 39:46–10 6. Tanaka H, Matsuo Y, Nakagawa S et al (2019) Real-time diagnostic analysis of MinIONTMbased metagenomic sequencing in clinical

microbiology evaluation: a case report. JA Clin Rep 5:1–2 7. Ohno A, Umezawa K, Asai S et al (2021) Rapid profiling of drug-resistant bacteria using DNA-binding dyes and a nanopore-based DNA sequencer. Sci Rep 11:3436 8. Matsuo Y, Komiya S, Yasumizu Y et al (2021) Full-length 16S rRNA gene amplicon analysis of human gut microbiota using MinION™ nanopore sequencing confers species-level resolution. BMC Microbiol 21:35 9. Komiya S, Matsuo Y, Nakagawa S et al (2022) MinION, a portable long-read sequencer, enables rapid vaginal microbiota analysis in a clinical setting. BMC Med Genomics 15:68 10. Shinozuka Y, Kawai K, Kurumisawa T et al (2021) Examination of the microbiota of normal cow milk using MinIONTM nanopore sequencing. J Vet Med Sci 83(11):1620–1627 11. Kryukov K, Ueda MT, Nakagawa S et al (2019) Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of

226

Kirill Kryukov et al.

DNA sequences. Bioinformatics 35:3826– 3828 12. Mitsuhashi S, Kryukov K, Nakagawa S et al (2017) A portable system for rapid bacterial composition analysis using a nanopore-based sequencer and laptop computer. Sci Rep 7:1–9 13. Sayers EW, Bolton EE, Brister JR et al (2021) Database resources of the national center for biotechnology information. Nucleic Acids Res 50:D20–D26 14. Kryukov K, Ueda MT, Nakagawa S et al (2020) Sequence Compression Benchmark (SCB) database – a comprehensive evaluation of reference-free compressors for FASTAformatted sequences. Gigascience 9:giaa072 15. Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34: 3094–3100 16. Ondov BD, Bergman NH, Phillippy AM (2011) Interactive metagenomic visualization in a Web browser. BMC Bioinform 12:385 17. Santos A, van Aerle R, Barrientos L et al (2020) Computational methods for 16S metabarcoding studies using Nanopore sequencing data. Comput Struct Biotechnol J 18:296–305

18. Sadasivan H, Maric M, Dawson E et al (2022) Accelerating Minimap2 for accurate long read alignment on GPUs. bioRxiv 03(09):483575 19. Kalikar S, Jain C, Md V et al (2022) Accelerating long-read analysis on modern CPUs. bioRxiv 07(21):453294 20. Wang Y, Zhao Y, Bollas A et al (2021) Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol 39:1348– 1365 ˜ ljalg U et al 21. Hibbett D, Abarenkov K, Ko (2016) Sequence-based classification and identification of Fungi. Mycologia 108:1049–1068 22. Ballenghien M., Faivre N, Galtier N (2017) Patterns of cross-contamination in a multispecies population genomic project: detection, quantification, impact, and solutions. BMC Biol 15(1):25 23. Steinegger M, Salzberg SL (2020) Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank. Genome Biol 21(1):115

Chapter 16 Genomic Epidemiological Analysis of Antimicrobial-Resistant Bacteria with Nanopore Sequencing Masato Suzuki, Yusuke Hashimoto, Aki Hirabayashi, Koji Yahara, Mitsunori Yoshida, Hanako Fukano, Yoshihiko Hoshino, Keigo Shibayama, and Haruyoshi Tomita Abstract Antimicrobial-resistant (AMR) bacterial infections caused by clinically important bacteria, including ESKAPE pathogens (Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, and Enterobacter species) and mycobacteria (Mycobacterium tuberculosis and nontuberculous mycobacteria), have become a global public health threat. Their epidemic and pandemic clones often accumulate useful accessory genes in their genomes, such as AMR genes (ARGs) and virulence factor genes (VFGs). This process is facilitated by horizontal gene transfer among microbial communities via mobile genetic elements (MGEs), such as plasmids and phages. Nanopore long-read sequencing allows easy and inexpensive analysis of complex bacterial genome structures, although some aspects of sequencing data calculation and genome analysis methods are not systematically understood. Here we describe the latest and most recommended experimental and bioinformatics methods available for the construction of complete bacterial genomes from nanopore sequencing data and the detection and classification of genotypes of bacterial chromosomes, ARGs, VFGs, plasmids, and other MGEs based on their genomic sequences for genomic epidemiological analysis of AMR bacteria. Key words ESKAPE pathogens, Mycobacteria, Antimicrobial resistance, Virulence, Mobile genetic element, Chromosome, Plasmid, Phage

1

Introduction The emergence of antimicrobial-resistant (AMR) bacteria has been simultaneous with the decline in antimicrobial discovery and development. The analysis shows that AMR bacterial infections were the direct cause of an estimated 1.27 million deaths worldwide in 2019, which was higher than deaths due to multiple widely recognized diseases, such as malaria and HIV infections [1]. AMR infections caused by ESKAPE pathogens (Enterococcus faecium, Staphylococcus

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_16, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

227

228

Masato Suzuki et al.

aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, and Enterobacter species) [2], other Enterobacterales bacteria including Escherichia coli, and Mycobacterium tuberculosis were the major cause of the deaths reported [1]. Infections caused by nontuberculous mycobacteria (NTM), such as Mycobacterium avium complex (MAC) and Mycobacterium abscessus complex (MABC), which are environmental bacteria that are naturally resistant to various antimicrobials, have been increasing worldwide and have become a global public health threat as well [3]. Bacterial pathogens responsible for AMR infections have acquired AMR genes (ARGs) through horizontal gene transfer (HGT) among microbial communities [4]. ARGs are transferred via mobile genetic elements (MGEs), like plasmids and phages, while epidemic and pandemic bacterial clones often accumulate ARGs in their genomes consisting of chromosomes and episomes along with virulence factor genes (VFGs) [5]. Although bacteria usually harbor circular structures of chromosomes and plasmids within their cells, ESKAPE pathogens, such as E. faecium and K. pneumoniae [6, 7], and mycobacteria, such as M. avium [8], have been reported to harbor transferable linear plasmids associated with AMR as well. Innovations in long-read sequencing technologies have allowed the construction and analyses of complete genome structures, including chromosomes and plasmids, of AMR bacterial pathogens causing nosocomial infections [9, 10]. Nanopore sequencing is an inexpensive and easy technology that yields longread sequencing data, but it is difficult to grasp the experimental and bioinformatics methods involved because of the rapid updates made to this technology [11]. With the advancement of sequencing technologies, numerous methods for genomic epidemiological analysis of AMR bacterial pathogens have also been developed [12]. This chapter presents the methods and tools we recommend, especially for the analysis of ESKAPE pathogens, other Enterobacterales bacteria (including E. coli), and mycobacteria (M. tuberculosis and NTM), aiming to assist researchers in analyzing AMR bacteria.

2

Materials and Methods

2.1 High Molecular Weight (HMW) of Bacterial Genomic DNA (gDNA)

The quality of HMW gDNA is crucial for efficient nanopore sequencing. For tips for preparation of bacterial gDNA, see Notes 1 and 2.

Bioinformatics Approaches in Studying AMR Bacteria

229

Most Recommended Bioinformatics Tools 1. Construction of Bacterial Genome Sequences

Trimmomatic [15]

(https://github.com/usadellab/Trimmomatic)

Porechop [16]

(https://github.com/rrwick/Porechop)

Filtlong

(https://github.com/rrwick/Filtlong)

Shovill

(https://github.com/tseemann/shovill)

Canu [18]

(https://github.com/marbl/canu)

LAST [19]

(https://github.com/mcfrith/last-genomealignments)

Racon [20]

(https://github.com/isovic/racon)

Minimap2 [21]

(https://github.com/lh3/minimap2)

Pilon [22]

(https://github.com/broadinstitute/pilon)

BWA-MEM2 [23, 24]

(https://github.com/bwa-mem2/bwa-mem2)

2. Functional Analysis of Bacterial Genome Sequences.

DFAST [25, 26]

(https://github.com/nigyta/dfast_core)

DFAST_QC

(https://github.com/nigyta/dfast_qc)

MLST [27]

(https://bitbucket.org/genomicepidemiology/mlst. git)

ResFinder [28, 29] (https://bitbucket.org/genomicepidemiology/ resfinder.git) Staramr

(https://github.com/phac-nml/staramr)

VirulenceFinder [30, 31]

(https://bitbucket.org/genomicepidemiology/ virulencefinder.git)

VFDB [32, 33]

(http://www.mgc.ac.cn/VFs/)

TXSScan [34, 35]

(https://github.com/macsy-models/TXSScan)

CONJscan [36]

(https://github.com/gem-pasteur/Macsyfinder_ models)

MOB-suite [37, 38]

(https://github.com/phac-nml/mob-suite)

3. Comparative Analysis of Bacterial Genome Sequences

230

3

Masato Suzuki et al.

Roary [39, 40]

(https://github.com/sanger-pathogens/Roary)

RAxML [41, 42]

(https://github.com/stamatak/standard-RAxML)

FigTree

(http://tree.bio.ed.ac.uk/software/figtree/)

Proksee

(https://proksee.ca/)

Mauve [43, 44]

(https://github.com/krishnap25/mauve)

Easyfig [45]

(https://github.com/mjsull/Easyfig)

Methods

3.1 Construction of Bacterial Complete Genomes with Nanopore Sequencing Data

DNA library preparation, sequencing, and base-calling methods are performed according to the latest instructions of the manufacturers, including Oxford Nanopore Technologies (ONT) and Illumina (for tips for preparation of DNA library, see Note 3). Their flow cells, reagents, and procedures are not discussed here because of their rapid updates. This chapter focuses on bioinformatics methods and tools. To construct bacterial complete genome sequences, de novo assembly using high-quality sequencing reads is important. Briefly, raw nanopore long-read sequencing data are adequately trimmed and assembled de novo, resulting in nearcomplete genome sequences. Complete genome sequences are then constructed through the trimming of assembled contigs and error correction using short-lead sequencing data with greater accuracy. 1. For paired-end short-read sequencing data, perform adapter and quality trimming of FASTQ files using Trimmomatic [15] with the default parameter of a minimum average quality cutoff of Q20, and check the sequence statistics using FastQC (https://github.com/s-andrews/FastQC). For nanopore long-read sequencing data, concatenate FASTQ files into one file using Seqtk (https://github.com/lh3/seqtk), perform adapter trimming of the FASTQ file using Porechop [16] and then quality trimming using Filtlong (https://github.com/ rrwick/Filtlong) (alternatively NanoFilt [46]) with the customized parameters of a minimum average-quality cutoff of Q10 and 1 kbp minimum read length, and check the sequence statistics using Nanoplot [46]. The recommended standards for the amount of sequencing data are at least 30× the genome size for short-read sequencing data (for E. coli with a genome size of approximately 5 Mbp, 150 Mbp, or more) and at least 100× the genome size for nanopore sequencing data with lesser accuracy (for E. coli, 500 Mbp or more). Example Command for Trimmomatic (When Illumina Library Was Prepared Using the Nextera XT DNA Library Prep Kit)

Bioinformatics Approaches in Studying AMR Bacteria

231

trimmomatic PE Illumina_R1.fastq.gz Illumina_R2.fastq.gz output_forward_paired.fastq.gz output_forward_unpaired.fastq.gz output_reverse_paired.fastq.gz output_reverse_unpaired.fastq. gz

ILLUMINACLIP:NexteraPE-PE.fa:2:30:10:8:true

LEADING:3

TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36

Example Command for Porechop porechop --input ONT_reads.fastq.gz --output output_reads. fastq.gz

Example Command for Filtlong filtlong --min_length 1000 --min_mean_q 10 ONT_reads.fastq.gz | gzip > output_reads.fastq.gz

2. Perform de novo assembly of paired-end short-read sequencing data from step 1 using Shovill (https://github.com/ tseemann/shovill), a fast pipeline based on SPAdes [47], with the customized parameter of a minimum contig length of 300 bp, resulting in bacterial draft genomes. The FASTG files from the assemblers can be visualized assembly graphs with connections using Bandage [48]. The draft genome sequences from short-read sequencing should be compared to the complete genome sequences to be constructed later to ensure that there are no discrepancies in the types of ARGs and VFGs harbored, since the process of calculation using different assemblers may result in differences in DNA regions and/or plasmids harbored within the bacterial genomes. Example Command for Shovill shovill --outdir shovill_output --R1 Illumina_R1.fastq.gz -R2 Illumina_R2.fastq.gz --minlen 300

3. Perform de novo assembly of nanopore long-read sequencing data from step 1 using Canu [18] (alternatively Flye [49] and Miniasm/Minipolish [50, 51]) with the default parameters. If the assembled contigs of circular episomes contain overlapping regions, the sequences within and between the contigs are detected via a genome-scale sequence comparison using LAST [19] and are trimmed manually using genome-browsing software, such as SnapGene (Dotmatics), CLC Genomics workbench (QIAGEN), and NCBI Genome Workbench [52].

232

Masato Suzuki et al.

Example Command for Canu (When the Estimated Genome Size Is 5.0 Mbp) canu -d canu_output genomeSize=5.0m -nanopore-raw ONT_reads. fastq.gz

4. Correct sequence errors in the assembled contigs from step 3, based on nanopore long-read sequencing data from step 1, using Racon [20] (alternatively Medaka https://github.com/ nanoporetech/medaka) and Minimap2 [21] with default parameters twice. Then, further correct this with the paired-end short-read sequencing data of step 1 using Pilon [22] (alternatively Polypolish [7]) and BWA-MEM2 [23, 24] with default parameters twice, resulting in typical bacterial complete genomes consisting of circular chromosomes and episomes (if bacterial genomes include a linear episome(s), see Note 3). The hybrid assembler Unicycler automatically performs such trimming and error correction steps using Miniasm/Racon and Pilon [53], and Trycycler integrates the results of different assembler calculations [54]. Importantly, up to this step, all command-line operations (and the majority of the bioinformatics tools we will introduce later) can be also performed using UseGalaxy web servers located in the European Union (EU), the United States (US), and Australia (AU) [55, 56] and using NanoGalaxy [57], a Galaxy-based toolkit for analyzing nanopore long-read sequencing data, available at the UseGalaxy.eu web server. Example Commands for Minimap2 and Racon minimap2 -x map-ont canu_assembly.fasta ONT_reads.fastq.gz > minimap2.paf racon ONT_reads.fastq.gz minimap2.paf canu_assembly.fasta > racon_assembly.fasta

Example Commands for BWA-MEM2 and Pilon bwa-mem2 mem racon_assembly.fasta Illumina_R1.fastq.gz Illumina_R2.fastq.gz > bwa-mem2.bam pilon --genome racon_assembly.fasta --frags bwa-mem2.bam -vcf --tracks --changes --verbose --outdir pilon_output

5. Perform genome annotation of the assembled contigs from step 2 and from step 4 using DFAST [25, 26], a fast pipeline based on Prokka [58] (alternatively RASTtk [59, 60] and PGAP [61]), with the options of sorting sequences by length and fixing the sequence origin (only for circular complete sequences). Moreover, check taxonomy based on the average

Bioinformatics Approaches in Studying AMR Bacteria

233

nucleotide identity (ANI) compared to the genomes of type and reference bacterial strains and assess completeness of genomes using DFAST_QC (https://github.com/nigyta/dfast_ qc), which implements FastANI [62] and CheckM [63]. Annotation using DFAST can be performed on the DFAST web server [25, 26], and annotation using RASTtk can be performed on RAST [59, 60] and BV-BRC (former PATRIC) [64, 65] web servers without command-line operations. If bacterial genomes include phage-derived sequences, a completeness check of phage genomes can be assessed using CheckV [66]. Example Commands for DFAST and DFAST_QC dfast --genome genome_assembly.fasta --minimum_length 300 -out dfast_output dfast_qc

--input_fasta

genome_assembly.fasta

--out_dir

dfast_qc_output

3.2 Detection and Classification of Core Genes, Accessory Genes, and MGEs in Bacterial Genomes

In epidemiological studies of bacterial pathogens, genotype classification of isolates has been classically performed using the pulsedfield gel electrophoresis (PFGE) of bacterial gDNA digested with restriction enzymes and PCR-based methods, such as multilocus sequence typing (MLST) and variable-number tandem repeat (VNTR) typing [67]. Some PCR-based methods can be performed using raw sequencing data and/or genome assembly. The utilization of public databases is important for the detection and classification of bacterial core genomes and accessory genes, such as ARGs, VFGs, and MGEs. This section focuses on the explanation of bioinformatics methods and tools needed for bacterial epidemiological studies of ESKAPE pathogens and mycobacteria. 1. For most bacterial species, including ESKAPE pathogens, perform genotype classification of the assembled contigs from Subheading 3.1, step 5, using the MLST program [27] with schema selection of the identified species. Information on sequence types (STs) and bacterial isolates from MLST studies are collected and curated in the PubMLST database [68], but STs of some bacterial species are maintained in another database, such as EnteroBase (for E. coli) [69] and the BIGSdbPasteur website (for K. pneumoniae) [70]. For further tips for MLST, see Note 4. MLST data can be visualized using eBURST [74] for defining groups and clonal complexes of related isolates. For M. tuberculosis, perform mycobacterial interspersed repetitive unit-VNTR (MIRU-VNTR) typing using MIRUReader [75] with the assembled contigs (or long-read sequencing data). 2. Perform detection and classification of bacterial ARGs in the assembled contigs from Subheading 3.1, step 5, using ARG

234

Masato Suzuki et al.

detection tools, such as ResFinder [28, 29], Staramr (https:// github.com/phac-nml/staramr), and Abricate (https:// github.com/tseemann/abricate), with the ARG database, such as ResFinder, NCBI AMRFinder+ [76, 77], and CARD [78, 79], and with the recommended sequence detection settings of ≥90% identity and >60% coverage (these settings are recommended for subsequent sequence-based detection tools based on BLAST [80]). For further tips for ARG detection, see Notes 5 and 6. Example Command for ResFinder (If You Input a Genome of E. coli) resfinder --outputPath resfinder_output --species "Escherichia coli" --min_cov 0.6 --threshold 0.9 --acquired --inputfasta genome_assembly.fasta

Example Command for Staramr staramr search --pid-threshold 90 --percent-length-overlapresfinder 60 --percent-length-overlap-plasmidfinder 60 -o staramr_out genome_assembly.fasta

3. Perform detection and classification of bacterial VFGs in the assembled contigs from Subheading 3.1, step 5, using VFG detection tools, such as VirulenceFinder [30, 31] and Abricate, and with the VFG database, such as VirulenceFinder and VFDB [32, 33]. Furthermore, perform the detection and classification of protein secretion systems using TXSScan [34, 35]. TXSScan can detect major bacterial protein secretion systems, including type I secretion systems (T1SSs) to T9SSs. For further tips for bacterial protein secretion systems, see Note 7. Since T4SSs are also involved in plasmid conjugation, TXSScan (alternatively CONJscan [36]) is useful for detecting conjugation-associated genes in plasmids and chromosomes. Example Command for TXSScan (If You Input a Circular Complete Genome) macsyfinder --db-type ordered_replicon --replicon-topology circular --sequence-db genome_assembly.fasta --models TXSS all

4. Perform detection and classification of bacterial MGEs, including plasmids, phages, insertion sequences (ISs), and integrative and conjugative elements (ICEs), in the assembled contigs from Subheading 3.1, step 5, using PlasmidFinder [84, 85]

Bioinformatics Approaches in Studying AMR Bacteria

235

and MOB-suite [37, 38] for plasmids, PHASTER [86, 87] and VirSorter [88] for phages, and MobileElementFinder [89], ISfinder [90, 91], ICEfinder [92], and CONJscan for other MGEs. For further tips for plasmid detection, see Note 8. To detect bacterial DNA defense systems, such as restrictionmodification (R-M) systems, CRISPR-Cas systems, and cyclicoligonucleotide-based antiphage signaling systems (CBASS), involved in the defense against the aforementioned HGT events, the latest tools, such as PADLOC [95, 96] and DefenseFinder [97], have been made available. Example Command for MOB-Suite (If You Input a MultiFasta File) mob_typer --multi --infile genome_assembly.fasta --min_rep_ident 90 --min_mob_ident 90 --min_con_ident 90 --min_rep_cov 60 --min_mob_cov 60 --min_con_cov 60 --out_file mob-typer_output.txt

5. Additional useful bioinformatics tools have been developed for specific bacterial pathogens, such as some ESKAPE pathogens and E. coli, which are frequently used for detection and classification of core and accessory genes of isolates. For example, SCCmecFinder can identify all types of staphylococcal cassette chromosome mec (SCCmec), designated I to XIII, in S. aureus [98]; Kleborate can identify ICEKp- and plasmid-associated virulence loci involving production of yersiniabactin, colibactin, salmochelin, and aerobactin and hypermucoidy (rmpA and rmpA2) in Klebsiella species [99]; Pathogenwatch (https:// www.sanger.ac.uk/tool/pathogenwatch/) provides a platform for comparing genomes of bacterial pathogens, including S. aureus and Klebsiella species, from around the world, integrating diverse data sets with rich representations; ClermonTyping can identify seven mai in E. coli phylogenetic groups (phylogroups A, B1, B2, C, D, E, and F) as well as Escherichia albertii, Escherichia fergusonii, five cryptic Escherichia clades (I–V), and E. coli sensu stricto in Escherichia species [100]. 3.3 Genomic Epidemiological Analysis of AMR Bacterial Isolates Using Their Genomes

Sequence-based pangenomic core detection of collections of genomes is useful for bacterial classification and phylogenetic studies. Visualization of phylogenetic trees and metadata, such as ARGs, VFGs, plasmids, and other MGEs, of bacterial isolates is important for summarizing the results of molecular epidemiological and comparative genomic analysis. This section focuses on the explanation of bioinformatics methods and tools needed for conducting bacterial epidemiological studies using genomes that have been analyzed

236

Masato Suzuki et al.

Fig. 1 AMR Enterobacterales isolates sequenced on Illumina systems. Phylogenetic analysis of bacterial isolates performed using DFAST and Roary and visualized using FigTree. Bar lengths represent the number of substitutions per site in the core genome constructed. Bacterial species identified using DFAST_QC, years in which the bacteria were isolated, sequence types (STs) of MLST analysis determined from genomes, and ARGs detected using ResFinder and plasmid replicons detected using PlasmidFinder

and annotated in detail and shows some examples of figures in actual previous analyses. 1. Perform pangenomic analysis of bacterial isolates with DFAST annotation results (GFF3 files) from Subheading 3.1, step 5, using Roary [39, 40] (alternatively PIRATE [101]) with the default parameters (the minimum percentage identity for BLASTP [102] is 95% and should be ≥70% if it is to be changed for Roary analysis). Perform phylogenetic analysis of core genome alignments (FASTA files) of bacterial isolates from pangenomic analysis using RAxML [41, 42] (alternatively FastTree [103, 104] and PhyML [105, 106]) with the recommended bootstrap values of 100. The output files with Newick format from phylogenetic analysis can be visualized using FigTree (http://tree.bio.ed.ac.uk/software/figtree/) or MEGA (Mega Limited) [107], and the phylogenetic tree should be shown with the metadata of bacterial isolates, such as location and year of isolation, STs from MLST analysis, and ARGs and plasmid replicons detected, using iTOL [108, 109], Phandango [110], Microreact [111], or Excel (Microsoft) (Fig. 1).

Bioinformatics Approaches in Studying AMR Bacteria

237

Fig. 2 Circular representation of an AMR-associated plasmid sequenced on ONT and Illumina systems. A plasmid in E. coli MH13-051M (pMH13-051M_1, accession: AP018572) [94] annotated using DFAST and visualized using CGview. Red, yellow, green, cyan, gray, dark green, purple, and black indicate blaNDM-1, other ARGs detected using ResFinder, T4SS-associated genes detected using TXSScan, MGEs detected using ISfinder and MobileElementFinder, other CDSs, GC Skew+, GC Skew-, and GC content, respectively

Although a lot of bioinformatics operations are needed before performing pangenomic analysis of bacterial isolates, Bactopia provides an all-in-one pipeline for pangenomic analysis from raw sequencing data [112]. Example Commands for Roary and RAxML roary *.gff raxmlHPC -f a -x 12345 -p 12345 -# 1000 -m GTRGAMMA -s roary. txt -n raxml_output

2. For bacterial circular genomes, including chromosomes and plasmids, visualize the sequences (GenBank files) of bacterial isolates from Subheading 3.1, step 5, using the Proksee server (former CGview server, https://proksee.ca) [113, 114] along with the genes of interest, GC Skew, and GC content (Fig. 2).

238

Masato Suzuki et al.

Fig. 3 AMR-associated plasmids sequenced on ONT and Illumina systems. Linear comparison of plasmids from E. coli MH13-051M (pMH13-051M_1, accession: AP018572) and Citrobacter freundii MH17-012N (pMH17-012N_1, accession: AP018567 and pMH17-012N_2, accession: AP018568) [94] analyzed using BLAST and visualized using EasyFig. The detailed genetic structure around blaNDM-1 in pMH13-051M_1 visualized using SnapGene. Red, yellow, green, cyan, and gray arrows indicate blaNDM-1, other ARGs detected using ResFinder, T4SS-associated genes detected using TXSScan, MGEs detected using ISfinder and MobileElementFinder, and other CDSs, respectively. The colors in comparison of plasmids show percent identity and sequence direction as indicated. Blue for matches in the same direction and red for matches in the inverted direction

CGView Comparison Tool (CCT) can also generate circular maps and perform large-scale sequence comparisons within genomes of particular or all available species [115]. 3. For large bacterial genomes (>1 Mbp), such as chromosomes, compare a set of the sequences (GenBank files) of bacterial isolates from Subheading 3.1, step 5, with reference sequences (GenBank files) from databases, such as the NCBI nr database and BV-BRC (former PATRIC) database, using progressiveMauve [43, 44]. For short bacterial genomes (≤1 Mbp), such as plasmids, compare a set of the sequences (GenBank files) of bacterial isolates from Subheading 3.1, step 5, with reference sequences (GenBank files) from databases, such as the NCBI nr database and PLSDB plasmid database [116, 117], using Easyfig [45] with the recommended settings of minimum length of 2000 bp, maximum e-value of 0.001, and minimum identity value of 90 (Fig. 3). When creating linear comparison figures of circular genomes, it is general to show the dnaA genes for chromosomes and the replicon initiator protein (RIP) genes for plasmids at the beginning of the sequences.

Bioinformatics Approaches in Studying AMR Bacteria

4

239

Notes 1. Bacterial gDNA should be prepared gently, using enzymatic lysis methods, like the traditional method involving phenolchloroform extraction combined with ethanol precipitation, or the phenol-free commercially available HMW DNA extraction kits, such as the Genomic-tips used along with the Genomic DNA Buffer Set (QIAGEN) and the MagAttract HMW DNA Kit (QIAGEN). Column-based kits like the former with a portable centrifuge or magnetic bead-based kits like the latter are useful in limited laboratory capacity situations where a centrifuge is unavailable [13, 14]. 2. In the case of gram-negative bacteria, such as Enterobacterales, Acinetobacter species, and Pseudomonas species, bacterial cells can be lysed using conventional methods with lysozyme and Proteinase K. In the case of gram-positive bacteria, such as Enterococcus species, Staphylococcus species, and mycobacteria, DNA purification efficiency is greatly affected by lysis methods due to the thick cell walls. Treatment with other enzymes, like achromopeptidase and lysostaphin, in addition to the abovementioned lytic enzymes, for a prolonged period is important. Moreover, for lipid-rich bacteria, such as mycobacteria, pretreatment of bacterial cells with chloroform-methanol and purification of gDNA by multiple steps of phenol-chloroform extraction to remove lipids are especially important [17]. Contaminated lipids in gDNA solutions cannot be recognized through commonly used DNA detection methods based on absorbance or fluorescence and could inhibit nanopore sequencing by blocking protein pores. 3. For both short-read and long-read sequencing of bacterial gDNA, commercially available library preparation kits enable DNA fragmentation and adapter ligation with transposases in a one-step reaction, such as the Nextera XT DNA Library Prep Kit (Illumina) for Illumina short-read sequencing and the Rapid Barcoding Kit (Oxford Nanopore Technologies) for ONT long-read sequencing are very useful because of the fewer steps involved and less time required. However, for the analysis of bacterial episomes with linear structures, such as linear plasmids, it is important to sequence DNA fragments from both ends. Although it involves more steps and time, library preparation kits with physical or DNase-mediated DNA fragmentation and subsequent adapter ligation, such as the TruSeq DNA Sample Prep Kit (Illumina), the QIAseq FX DNA Library Kit (QIAGEN) for Illumina short-read sequencing, and the Ligation Sequencing Kit with Native Barcoding Kit (Oxford Nanopore Technologies) for ONT long-read sequencing, are useful. To verify both 5′ and 3′ ends of linear

240

Masato Suzuki et al.

DNA sequences after de novo assembly using nanopore longread sequencing data, mapping paired-end short-read sequencing data on the assembly using BWA-MEM and further visualizing the result using the Integrative Genomics Viewer (IGV) [118] are helpful [6]. 4. There are multiple MLST methods for some bacterial species, such as E. coli and A. baumannii (the Achtman scheme [71] and the Pasteur scheme [72] are more common, respectively). Although MLST basically targets housekeeping genes conserved in bacterial chromosomes, it should be noted that clinically important clones lacking the target gene(s) could be spreading [73]. Novel gene sequences and their combinations in MLST analysis need to be registered by the curators of the database. 5. ESKAPE pathogens often acquire ARGs through HGT via conjugative plasmids. Plasmid-mediated HGT is rare in mycobacteria, and very few ARGs, such as erm (intrinsic genes conferring inducible resistance to macrolide antimicrobials), are naturally encoded on chromosomes (e.g., erm [37] and erm [41] are conserved in chromosomes of M. tuberculosis and M. abscessus, respectively) [81, 82]. 6. Novel variants of ARG products, such as β-lactamases (enzymes that degrade β-lactam antimicrobials), with amino acid changes can involve changes in substrate specificities and biochemical activities, and their naming should be considered carefully according to the international standards [83] (https://www. ncbi.nlm.nih.gov/pathogens/submit-beta-lactamase/). 7. In general, T3SSs, T4SSs, T6SSs, and their secreted proteins are important for the virulence of gram-negative bacteria (the T6SS is conserved in K. pneumoniae, Enterobacter species, and A. baumannii, and the T3SS and T6SS are conserved in P. aeruginosa and pathogenic E. coli), and T7SSs and their secreted proteins are important for the virulence of grampositive bacteria, including mycobacteria. 8. The nucleotide database of PlasmidFinder basically includes sequences of RIP genes encoded by plasmids in Enterobacterales and some gram-positive bacteria, including Enterococcus species and Staphylococcus species, and does not include RIP gene sequences specific for plasmids in Acinetobacter and Pseudomonas species, such as incompatibility group P-2 (IncP-2) plasmids [93]. The nucleotide database of MOB-typer, implemented in MOB-suite, is an extension of the PlasmidFinder database, but it should be noted that there are not a few misclassifications; for example, IncP-1 and IncP-6 (similar names but completely different sequences) are simply indicated as “IncP.” Recently, it was suggested that hybrid plasmids with

Bioinformatics Approaches in Studying AMR Bacteria

241

multiple different RIP genes might be associated with multidrug resistance [94].

Acknowledgments This work was supported by grants (JP22fk0108133, JP22fk0108139, JP22fk0108148, JP22fk0108642, JP22gm1610003, JP22wm0225004, JP22wm0225008, JP22wm0225022, JP22wm0325003, JP22wm0325022, and JP22wm0325037 to M. Suzuki; JP22wm0225022 to M. Yoshida; JP22wm0325054 to H. Fukano: JP22fk0108093, JP22fk0108129, JP22fk0108608, JP22gm1610003, JP22wm0125007, JP22wm0225004, JP22wm0225022, JP22wm0325003, and JP22wm0325054 to Y. Hoshino; JP22fk0108148, JP22fk0108604, and JP22gm1610003 to K. Shibayama; JP22fk0108604 and JP22wm0225008 to H. Tomita) from the Japan Agency for Medical Research and Development (AMED); a grant (21KA1004 to H. Tomita) from the Ministry of Health, Labor and Welfare, Japan; and grants (20 K07509 to M. Suzuki; 22 K16368 to Y. Hashimoto; 21 K15440 to A. Hirabayashi; 22 K07067 to H. Tomita) from the Ministry of Education, Culture, Sports, Science and Technology, Japan. References 1. Antimicrobial Resistance C (2022) Global burden of bacterial antimicrobial resistance in 2019: a systematic analysis. Lancet 399(10325):629–655. https://doi.org/10. 1016/S0140-6736(21)02724-0 2. Rice LB (2008) Federal funding for the study of antimicrobial resistance in nosocomial pathogens: no ESKAPE. J Infect Dis 197(8): 1079–1081. https://doi.org/10.1086/ 533452 3. Cowman S et al (2019) Non-tuberculous mycobacterial pulmonary disease. Eur Respir J 5 4 ( 1 ) . h t t p s : // d o i . o r g / 1 0 . 1 1 8 3 / 13993003.00250-2019 4. Partridge SR et al (2018) Mobile genetic elements associated with Antimicrobial Resistance. Clin Microbiol Rev 31(4). https:// doi.org/10.1128/CMR.00088-17 5. De Oliveira DMP et al (2020) Antimicrobial Resistance in ESKAPE pathogens. Clin Microbiol Rev 33:3. https://doi.org/10. 1128/CMR.00181-19 6. Hashimoto Y et al (2019) Novel multidrugresistant Enterococcal Mobile linear plasmid pELF1 encoding vanA and vanM gene

clusters from a Japanese vancomycin-resistant enterococci isolate. Front Microbiol 10:2568. https://doi.org/10.3389/fmicb.2019. 02568 7. Hawkey J et al (2022) Linear plasmids in Klebsiella and other Enterobacteriaceae. Microb Genom 8(4). https://doi.org/10. 1099/mgen.0.000807 8. Rabello MC et al (2012) First description of natural and experimental conjugation between Mycobacteria mediated by a linear plasmid. PLoS One 7(1):e29884. https:// doi.org/10.1371/journal.pone.0029884 9. Conlan S et al (2014) Single-molecule sequencing to track plasmid diversity of hospital-associated carbapenemase-producing Enterobacteriaceae. Sci Transl Med 6(254): 254ra126. https://doi.org/10.1126/ scitranslmed.3009845 10. Lemon JK et al (2017) Rapid Nanopore sequencing of plasmids and Resistance gene detection in clinical isolates. J Clin Microbiol 55(12):3530–3543. https://doi.org/10. 1128/JCM.01069-17

242

Masato Suzuki et al.

11. Wang Y et al (2021) Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol 39(11):1348–1365. https:// doi.org/10.1038/s41587-021-01108-x 12. Boolchandani M et al (2019) Sequencingbased methods and resources to study antimicrobial resistance. Nat Rev Genet 20(6): 356–370. https://doi.org/10.1038/ s41576-019-0108-4 13. Bento Lab (2016) Nat Biotechnol 34(5):455. https://doi.org/10.1038/nbt0516-455 14. Hirabayashi A et al (2021) On-site genomic epidemiological analysis of Antimicrobialresistant bacteria in Cambodia with portable laboratory equipment. Front Microbiol 12: 675463. https://doi.org/10.3389/fmicb. 2021.675463 15. Bolger AM et al (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30(15):2114–2120. https://doi. org/10.1093/bioinformatics/btu170 16. Wick RR et al (2017) Completing bacterial genome assemblies with multiplex MinION sequencing. Microb Genom 3(10):e000132. https://doi.org/10.1099/mgen.0.000132 17. Kaser M et al (2009) Optimized method for preparation of DNA from pathogenic and environmental mycobacteria. Appl Environ Microbiol 75(2):414–418. https://doi.org/ 10.1128/AEM.01358-08 18. Koren S et al (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27(5):722–736. https://doi.org/10. 1101/gr.215087.116 19. Frith MC et al (2010) Parameters for accurate genome alignment. BMC Bioinformatics 11: 80. https://doi.org/10.1186/1471-210511-80 20. Vaser R et al (2017) Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res 27(5):737–746. https:// doi.org/10.1101/gr.214270.116 21. Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18):3094–3100. https://doi.org/10. 1093/bioinformatics/bty191 22. Walker BJ et al (2014) Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9(11):e112963. https:// doi.org/10.1371/journal.pone.0112963 23. Li H, Durbin R (2010) Fast and accurate long-read alignment with burrows-wheeler transform. Bioinformatics 26(5):589–595. https://doi.org/10.1093/bioinformatics/ btp698

24. Vasimuddin Md, Misra S, Li H, Aluru S (2019) Efficient architecture-aware acceleration of BWA-MEM for multicore systems. In: IEEE parallel and distributed processing symposium (IPDPS) 25. Tanizawa Y et al (2018) DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication. Bioinformatics 34(6):1037–1039. https://doi.org/10. 1093/bioinformatics/btx713 26. Tanizawa Y et al (2019) Generating publication-ready prokaryotic genome annotations with DFAST. Methods Mol Biol 1962:215–226. https://doi.org/10.1007/ 978-1-4939-9173-0_13 27. Larsen MV et al (2012) Multilocus sequence typing of total-genome-sequenced bacteria. J Clin Microbiol 50(4):1355–1361. https:// doi.org/10.1128/JCM.06094-11 28. Zankari E et al (2012) Identification of acquired antimicrobial resistance genes. J Antimicrob Chemother 67(11):2640–2644. https://doi.org/10.1093/jac/dks261 29. Florensa AF et al (2022) ResFinder - an open online resource for identification of antimicrobial resistance genes in next-generation sequencing data and prediction of phenotypes from genotypes. Microb Genom 8(1). https://doi.org/10.1099/mgen.0.000748 30. Joensen KG et al (2014) Real-time wholegenome sequencing for routine typing, surveillance, and outbreak detection of verotoxigenic Escherichia coli. J Clin Microbiol 52(5): 1501–1510. https://doi.org/10.1128/ JCM.03617-13 31. Malberg Tetzschner AM et al (2020) In silico genotyping of Escherichia coli isolates for Extraintestinal virulence genes by use of whole-genome sequencing data. J Clin Microbiol 58(10). https://doi.org/10. 1128/JCM.01269-20 32. Chen L et al (2005) VFDB: a reference database for bacterial virulence factors. Nucleic Acids Res 33(Database issue):D325–D328. https://doi.org/10.1093/nar/gki008 33. Liu B et al (2022) VFDB 2022: a general classification scheme for bacterial virulence factors. Nucleic Acids Res 50(D1):D912– D917. https://doi.org/10.1093/nar/ gkab1107 34. Abby SS et al (2016) Identification of protein secretion systems in bacterial genomes. Sci Rep 6:23080. https://doi.org/10.1038/ srep23080 35. Abby SS, Rocha EPC (2017) Identification of protein secretion Systems in Bacterial Genomes Using MacSyFinder. Methods Mol

Bioinformatics Approaches in Studying AMR Bacteria Biol 1615:1–21. https://doi.org/10.1007/ 978-1-4939-7033-9_1 36. Cury J et al (2020) Identifying conjugative plasmids and integrative conjugative elements with CONJscan. Methods Mol Biol 2075: 265–283. https://doi.org/10.1007/978-14939-9877-7_19 37. Robertson J, Nash JHE (2018) MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies. Microb Genom 4(8). https://doi.org/10. 1099/mgen.0.000206 38. Robertson J et al (2020) Universal wholesequence-based plasmid typing and its utility to prediction of host range and epidemiological surveillance. Microb Genom 6(10). https://doi.org/10.1099/mgen.0.000435 39. Page AJ et al (2015) Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31(22):3691–3693. https://doi.org/ 10.1093/bioinformatics/btv421 40. Sitto F, Battistuzzi FU (2020) Estimating Pangenomes with Roary. Mol Biol Evol 37(3):933–939. https://doi.org/10.1093/ molbev/msz284 41. Rokas A (2011) Phylogenetic analysis of protein sequence data using the randomized Axelerated maximum likelihood (RAXML) program. Curr Protoc Mol Biol Chapter 19: Unit19 11. https://doi.org/10.1002/ 0471142727.mb1911s96 42. Stamatakis A (2014) RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30(9): 1312–1313. https://doi.org/10.1093/bioin formatics/btu033 43. Darling AC et al (2004) Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res 14(7): 1394–1403. https://doi.org/10.1101/gr. 2289704 44. Darling AE et al (2010) progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5(6): e11147. https://doi.org/10.1371/journal. pone.0011147 45. Sullivan MJ et al (2011) Easyfig: a genome comparison visualizer. Bioinformatics 27(7): 1009–1010. https://doi.org/10.1093/bioin formatics/btr039 46. De Coster W et al (2018) NanoPack: visualizing and processing long-read sequencing data. Bioinformatics 34(15):2666–2669. https://doi.org/10.1093/bioinformatics/ bty149 47. Bankevich A et al (2012) SPAdes: a new genome assembly algorithm and its

243

applications to single-cell sequencing. J Comput Biol 19(5):455–477. https://doi.org/10. 1089/cmb.2012.0021 48. Wick RR et al (2015) Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 31(20):3350–3352. https:// doi.org/10.1093/bioinformatics/btv383 49. Kolmogorov M et al (2019) Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol 37(5):540–546. https://doi. org/10.1038/s41587-019-0072-8 50. Li H (2016) Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32(14): 2103–2110. https://doi.org/10.1093/bioin formatics/btw152 51. Wick RR, Holt KE (2019) Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Res 8:2138. https://doi.org/10.12688/f1000research. 21782.4 52. Kuznetsov A, Bollin CJ (2021) NCBI genome workbench: desktop software for comparative genomics, visualization, and GenBank data submission. Methods Mol Biol 2231:261–295. https://doi.org/10. 1007/978-1-0716-1036-7_16 53. Wick RR et al (2017) Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol 13(6):e1005595. https://doi.org/10.1371/ journal.pcbi.1005595 54. Wick RR et al (2021) Trycycler: consensus long-read assemblies for bacterial genomes. Genome Biol 22(1):266. https://doi.org/ 10.1186/s13059-021-02483-z 55. Giardine B et al (2005) Galaxy: a platform for interactive large-scale genome analysis. Genome Res 15(10):1451–1455. https:// doi.org/10.1101/gr.4086505 56. Galaxy C (2022) The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Res. https://doi.org/10.1093/nar/gkac247 57. de Koning W et al (2020) NanoGalaxy: Nanopore long-read sequencing data analysis in Galaxy. Gigascience 9(10). https://doi.org/ 10.1093/gigascience/giaa105 58. Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30(14): 2068–2069. https://doi.org/10.1093/bioin formatics/btu153 59. Overbeek R et al (2014) The SEED and the rapid annotation of microbial genomes using subsystems technology (RAST). Nucleic Acids Res 42(Database issue):D206–D214. https://doi.org/10.1093/nar/gkt1226

244

Masato Suzuki et al.

60. Brettin T et al (2015) RASTtk: a modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes. Sci Rep 5:8365. https://doi.org/10.1038/ srep08365 61. Tatusova T et al (2016) NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res 44(14):6614–6624. https://doi.org/10. 1093/nar/gkw569 62. Jain C et al (2018) High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 9(1): 5114. https://doi.org/10.1038/s41467018-07641-9 63. Parks DH et al (2015) CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 25(7):1043–1055. https://doi. org/10.1101/gr.186072.114 64. Wattam AR et al (2018) Assembly, annotation, and comparative genomics in PATRIC, the all bacterial bioinformatics resource center. Methods Mol Biol 1704:79–101. https:// doi.org/10.1007/978-1-4939-7463-4_4 65. Davis JJ et al (2020) The PATRIC bioinformatics resource center: expanding data and analysis capabilities. Nucleic Acids Res 48 (D1):D606–D612. https://doi.org/10. 1093/nar/gkz943 66. Nayfach S et al (2021) CheckV assesses the quality and completeness of metagenomeassembled viral genomes. Nat Biotechnol 39(5):578–585. https://doi.org/10.1038/ s41587-020-00774-7 67. Maiden MC et al (2013) MLST revisited: the gene-by-gene approach to bacterial genomics. Nat Rev Microbiol 11(10):728–736. https:// doi.org/10.1038/nrmicro3093 68. Jolley KA et al (2018) Open-access bacterial population genomics: BIGSdb software, the PubMLST.Org website and their applications. Wellcome Open Res 3:124. https://doi.org/ 10.12688/wellcomeopenres.14826.1 69. Zhou Z et al (2020) The EnteroBase user’s guide, with case studies on salmonella transmissions, Yersinia pestis phylogeny, and Escherichia core genomic diversity. Genome Res 30(1):138–152. https://doi.org/10. 1101/gr.251678.119 70. Diancourt L et al (2005) Multilocus sequence typing of Klebsiella pneumoniae nosocomial isolates. J Clin Microbiol 43(8):4178–4182. h t t p s : //d o i . or g / 1 0 . 1 12 8 / J C M . 4 3 . 8 . 4178-4182.2005 71. Wirth T et al (2006) Sex and virulence in Escherichia coli: an evolutionary perspective. Mol Microbiol 60(5):1136–1151. https://

doi.org/10.1111/j.1365-2958.2006. 05172.x 72. Diancourt L et al (2010) The population structure of Acinetobacter baumannii: expanding multiresistant clones from an ancestral susceptible genetic pool. PLoS One 5(4):e10034. https://doi.org/10.1371/jour nal.pone.0010034 73. Carter GP et al (2016) Emergence of endemic MLST non-typeable vancomycin-resistant Enterococcus faecium. J Antimicrob Chemother 71(12):3367–3371. https://doi. org/10.1093/jac/dkw314 74. Feil EJ et al (2004) eBURST: inferring patterns of evolutionary descent among clusters of related bacterial genotypes from multilocus sequence typing data. J Bacteriol 186(5): 1518–1530. https://doi.org/10.1128/JB. 186.5.1518-1530.2004 75. Tang CY, Ong RT (2020) MIRUReader: MIRU-VNTR typing directly from long sequencing reads. Bioinformatics 36(5): 1625–1626. https://doi.org/10.1093/bioin formatics/btz771 76. Feldgarden M et al (2021) AMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence. Sci Rep 11(1):12728. https://doi.org/ 10.1038/s41598-021-91456-0 77. Feldgarden M et al (2022) Curation of the AMRFinderPlus databases: applications, functionality and impact. Microb Genom 8(6). https://doi.org/10.1099/mgen.0.000832 78. McArthur AG et al (2013) The comprehensive antibiotic resistance database. Antimicrob Agents Chemother 57(7):3348–3357. https://doi.org/10.1128/AAC.00419-13 79. Alcock BP et al (2020) CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Res 48(D1):D517–D525. https://doi. org/10.1093/nar/gkz935 80. Altschul SF et al (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410. https://doi.org/10.1016/S0022-2836(05) 80360-2 81. Andini N, Nash KA (2006) Intrinsic macrolide resistance of the mycobacterium tuberculosis complex is inducible. Antimicrob Agents Chemother 50(7):2560–2562. https://doi. org/10.1128/AAC.00264-06 82. Brown-Elliott BA et al (2012) Antimicrobial susceptibility testing, drug resistance mechanisms, and therapy of infections with nontuberculous mycobacteria. Clin Microbiol Rev 25(3):545–582. https://doi.org/10.1128/ CMR.05030-11

Bioinformatics Approaches in Studying AMR Bacteria 83. Bradford PA et al (2022) Consensus on betalactamase nomenclature. Antimicrob Agents Chemother 66(4):e0033322. https://doi. org/10.1128/aac.00333-22 84. Carattoli A et al (2014) In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing. Antimicrob Agents Chemother 58(7):3895–3903. https://doi.org/10.1128/AAC.02412-14 85. Carattoli A, Hasman H (2020) PlasmidFinder and in silico pMLST: identification and typing of plasmid replicons in whole-genome sequencing (WGS). Methods Mol Biol 2075: 285–294. https://doi.org/10.1007/978-14939-9877-7_20 86. Arndt D et al (2016) PHASTER: a better, faster version of the PHAST phage search tool. Nucleic Acids Res 44(W1):W16–W21. https://doi.org/10.1093/nar/gkw387 87. Arndt D et al (2019) PHAST, PHASTER and PHASTEST: tools for finding prophage in bacterial genomes. Brief Bioinform 20(4): 1560–1567. https://doi.org/10.1093/bib/ bbx121 88. Roux S et al (2015) VirSorter: mining viral signal from microbial genomic data. PeerJ 3: e985. https://doi.org/10.7717/peerj.985 89. Johansson MHK et al (2021) Detection of mobile genetic elements associated with antibiotic resistance in Salmonella enterica using a newly developed web tool: MobileElementFinder. J Antimicrob Chemother 76(1): 101–109. https://doi.org/10.1093/jac/ dkaa390 90. Siguier P et al (2006) ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Res 34(Database issue):D32– D36. https://doi.org/10.1093/nar/gkj014 91. Siguier P et al (2012) Exploring bacterial insertion sequences with ISfinder: objectives, uses, and future developments. Methods Mol Biol 859:91–103. https://doi.org/10.1007/ 978-1-61779-603-6_5 92. Liu M et al (2019) ICEberg 2.0: an updated database of bacterial integrative and conjugative elements. Nucleic Acids Res 47(D1): D660–D665. https://doi.org/10.1093/ nar/gky1123 93. Shintani M et al (2022) Precise classification of antimicrobial resistance-associated IncP2 megaplasmids for molecular epidemiological studies on Pseudomonas species. J Antimicrob Chemother 77(4):1203–1205. https://doi.org/10.1093/jac/dkac006 94. Hirabayashi A et al (2021) Plasmid analysis of NDM metallo-beta-lactamase-producing Enterobacterales isolated in Vietnam. PLoS

245

One 16(7):e0231119. https://doi.org/10. 1371/journal.pone.0231119 95. Payne LJ et al (2021) Identification and classification of antiviral defence systems in bacteria and archaea with PADLOC reveals new system types. Nucleic Acids Res 49(19): 10868–10878. https://doi.org/10.1093/ nar/gkab883 96. Payne LJ et al (2022) PADLOC: a web server for the identification of antiviral defence systems in microbial genomes. Nucleic Acids Res. https://doi.org/10.1093/nar/gkac400 97. Tesson F et al (2022) Systematic and quantitative view of the antiviral arsenal of prokaryotes. Nat Commun 13(1):2561. https:// doi.org/10.1038/s41467-022-30269-9 98. Kaya H et al (2018) SCCmecFinder, a web-based tool for typing of staphylococcal cassette chromosome mec in Staphylococcus aureus using whole-genome sequence data. mSphere 3(1):e00612-17. https://doi.org/ 10.1128/mSphere.00612-17 99. Lam MMC et al (2021) A genomic surveillance framework and genotyping tool for Klebsiella pneumoniae and its related species complex. Nat Commun 12(1):4188. https:// doi.org/10.1038/s41467-021-24448-3 100. Beghain J et al (2018) ClermonTyping: an easy-to-use and accurate in silico method for Escherichia genus strain phylotyping. Microb Genom 4(7). https://doi.org/10.1099/ mgen.0.000192 101. Bayliss SC et al (2019) PIRATE: a fast and scalable pangenomics toolbox for clustering diverged orthologues in bacteria. Gigascience 8 ( 1 0 ) . h t t p s : // d o i . o r g / 1 0 . 1 0 9 3 / gigascience/giz119 102. Altschul SF et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402. https://doi.org/10. 1093/nar/25.17.3389 103. Price MN et al (2009) FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol Biol Evol 26(7):1641–1650. https://doi.org/10. 1093/molbev/msp077 104. Price MN et al (2010) FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS One 5(3):e9490. https:// doi.org/10.1371/journal.pone.0009490 105. Guindon S et al (2009) Estimating maximum likelihood phylogenies with PhyML. Methods Mol Biol 537:113–137. https://doi. org/10.1007/978-1-59745-251-9_6 106. Guindon S et al (2010) New algorithms and methods to estimate maximum-likelihood

246

Masato Suzuki et al.

phylogenies: assessing the performance of PhyML 3.0. Syst Biol 59(3):307–321. https://doi.org/10.1093/sysbio/syq010 107. Tamura K et al (2021) MEGA11: molecular evolutionary genetics analysis version 11. Mol Biol Evol 38(7):3022–3027. https://doi. org/10.1093/molbev/msab120 108. Letunic I, Bork P (2007) Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics 23(1):127–128. https://doi.org/10.1093/ bioinformatics/btl529 109. Letunic I, Bork P (2021) Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res 49(W1):W293–W296. https:// doi.org/10.1093/nar/gkab301 110. Hadfield J et al (2018) Phandango: an interactive viewer for bacterial population genomics. Bioinformatics 34(2):292–293. https://doi.org/10.1093/bioinformatics/ btx610 111. Argimon S et al (2016) Microreact: visualizing and sharing data for genomic epidemiology and phylogeography. Microb Genom 2(11):e000093. https://doi.org/10.1099/ mgen.0.000093 112. Petit RA 3rd, Read TD (2020) Bactopia: a flexible pipeline for complete analysis of

bacterial genomes. mSystems 5(4). https:// doi.org/10.1128/mSystems.00190-20 113. Stothard P, Wishart DS (2005) Circular genome visualization and exploration using CGView. Bioinformatics 21(4):537–539. https://doi.org/10.1093/bioinformatics/ bti054 114. Grant JR, Stothard P (2008) The CGView server: a comparative genomics tool for circular genomes. Nucleic Acids Res 36(Web Server issue):W181–W184. https://doi.org/ 10.1093/nar/gkn179 115. Grant JR et al (2012) Comparing thousands of circular genomes using the CGView comparison tool. BMC Genomics 13:202. https://doi.org/10.1186/1471-216413-202 116. Galata V et al (2019) PLSDB: a resource of complete bacterial plasmids. Nucleic Acids Res 47(D1):D195–D202. https://doi.org/ 10.1093/nar/gky1050 117. Schmartz GP et al (2022) PLSDB: advancing a comprehensive database of bacterial plasmids. Nucleic Acids Res 50(D1):D273– D278. https://doi.org/10.1093/nar/ gkab1111 118. Robinson JT et al (2011) Integrative genomics viewer. Nat Biotechnol 29(1):24–26. https://doi.org/10.1038/nbt.1754

Chapter 17 Rapid and Comprehensive Identification of Nontuberculous Mycobacteria Yuki Matsumoto and Shota Nakamura Abstract Next-generation sequencing is a powerful tool to accurately identify pathogens. The MinION sequencer is best suited for the rapid identification of bacterial species due to its real-time sequence output. In this chapter, we introduce a method to identify nontuberculous mycobacteria (NTM) in one sequencing analysis from culture isolates using the MinION sequencer. NTM disease is now recognized as a growing global health concern due to its increasing incidence and prevalence. There are over 200 NTM species, of which the major pathogens are further classified into many subspecies showing different antibiotic susceptibilities. Therefore, identifying the pathogens at the subspecies level of NTM is necessary to select an appropriate treatment regimen. The protocol described here includes DNA extraction by lysis using silica beads, library preparation, sequencing by the MinION sequencer, and analysis of multilocus sequence typing using the software “mlstverse” and enables rapid and comprehensive identification of 175 species of NTM at the subspecies level with high sensitivity and accuracy. Key words Nontuberculous mycobacteria, Multi-locus sequence typing, Rapid identification, Whole genome sequencing, Pathogen detection

1

Introduction Next-generation sequencing is a powerful tool to accurately identify pathogens [1, 2]. In contrast to the second-generation sequencing technologies, new-generation sequencing devices from the Oxford Nanopore Technologies enable to provide sequencing data immediately after the sequencing starts. One of such devices, the MinION sequencer, is best suited for rapidly identifying bacterial species due to its sequencing capacity [3]. In this chapter, we introduce a method to identify nontuberculous mycobacteria (NTM) in one sequencing analysis from culture isolates using the MinION sequencer.

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_17, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

247

248

Yuki Matsumoto and Shota Nakamura

The recent incidence and prevalence of NTM diseases are increasing worldwide [4–6], and its epidemiology reported race, ethnicity, sex, and geographical factors affecting the predominant species [7, 8]. Thus, NTM diseases are now recognized as a new global health concern. Over 200 NTM species have been discovered, of which approximately 140 are pathogenic to humans and animals, and at least six novel species of NTM have been reported consecutively in the last three years [9–14]. Of these, the Mycobacterium avium complex (MAC; primarily consisting of M. avium, Mycobacterium intracellulare, and their subspecies), Mycobacterium kansasii, and the Mycobacterium abscessus complex (consisting of three subspecies of M. abscessus) are the major pathogens causing NTM pulmonary disease [15, 16]. These two complexes are further classified into many subspecies that show varying antibiotic susceptibilities [17–19]. The vast diversity of NTM subspecies in terms of drug susceptibilities poses a significant obstacle in clinical treatment. Without subspecies identification, current therapy for NTM disease requires prolonged treatment and often leads to unsatisfactory clinical outcomes. Therefore, identifying NTM at a subspecies level is an essential step prior to treatment [18, 20]. Conventional identification methods of NTM are usually based on staining or culture growth. Mass spectrometry analysis is becoming popular in large hospitals [21, 22]. The most accurate methods for definitive species identification are based on molecular assays, such as nucleic acid hybridization and DNA sequencing of several housekeeping genes known as multi-locus sequence typing (MLST) [23, 24]. We developed “mlstverse,” an MLST database and identification software dedicated to NTM, in 2018 [25]. The mlstverse software enables the comprehensive identification of up to 175 species of NTM at the subspecies level, with high sensitivity and accuracy (see Note 1).

2 2.1

Materials Equipment

1. VORTEX-GENIE 2 with TurboMix adapter 2. Thermal cycler 3. Qubit Fluorometer 4. Magnetic rack for Agencourt AMPure XP beads

2.2 Consumables for Library Preparation and MinION Sequencing

1. MN Bead Tubes Type B (NucleoSpin® Bead Tubes Type B) 2. Rapid PCR Barcoding Kit (Oxford Nanopore Technologies) 3. Agencourt AMPure XP beads (Beckman Coulter) 4. LongAmp Taq 2X Master Mix (New England Biolabs) 5. 10 mM Tris-HCl pH 8.0 with 50 mM NaCl

Rapid Identification of Nontuberculous Mycobacteria

249

6. Nuclease-free water 7. Freshly prepared 70% ethanol in nuclease-free water 8. 10 mM Tris-HCl pH 8.0 with 50 mM NaCl 9. 1.5 mL DNA LoBind tubes (Eppendorf) 10. 0.2 mL thin-walled PCR tubes 2.3 Software Requirements for Computational Analysis

1. minimap2 2. samtools 3. R 4. mlstverse (R package). 5. mlstverse.Mycobacterium.db (R package)

3

Methods

3.1 DNA Extraction from Culture

1. Suspend a bacterial pellet in 250 μL of nuclease-free water. Transfer it to the MN Bead Tubes Type B. Close the tube caps well. The starting material can be replaced by MGIT culture isolates (see Notes 2 and 3). 2. Vortex with VORTEX-GENIE 2 for 5 min at maximum rpm. Spin down quickly. 3. Heat for 5 min at 95 °C. Centrifuge for 5 min at 13000 rpm. 4. Collect 100 μL of supernatant in a DNA LoBind tube. Store at -80 °C if the experiments are discontinued.

3.2 Preparing Library and Sequencing

1. Prepare template DNA and add fragmentation mix (FRM) in a 0.2 mL thin-walled PCR tube: – 3 μL 10 ng template DNA – 1 μL FRM from Rapid PCR Barcoding Kit 2. Mix gently by flicking the tube, and spin down. Incubate at 30 °C for 1 min using a thermal cycler and then at 80 °C for 1 min. Put the tube on ice to cool it down. 3. Prepare the PCR reaction tubes by adding the following components in a 0.2 mL PCR tube: – 4 μL Tagmented DNA incubated at step 2 – 1 μL Barcode (RLB 01-12A, at 10 μM each) from the Rapid PCR Barcoding Kit – 20 μL nuclease-free water – 25 μL LongAmp Taq 2X Master Mix

250

Yuki Matsumoto and Shota Nakamura

4. Amplify using the following cycling conditions: – Initial denaturation 3 mins at 95 °C – Denaturation 15 secs at 95 °C (14 cycles) – Annealing 15 secs at 56 °C (14 cycles) – Extension 6 mins at 65 °C (14 cycles) – Final extension 6 mins at 65 °C – Hold at 4 °C forever If the initial amount of template DNA is less than 10 ng, increase the number of cycles to 18. 5. Transfer the amplified DNA samples to a clean 1.5 mL DNA LoBind tube. 6. Leave the AMPure XP beads at room temperature (RT) and resuspended by vortexing before use. Add 30 μL of resuspended AMPure XP beads to the reaction and mix by pipetting gently. 7. Incubate on a rotator mixer for 5 min at RT. 8. Prepare 500 μL of fresh 70% ethanol in nuclease-free water. 9. Spin down the sample and place the tube on a magnet for 1 min. Keep the tube on the magnet and pipette off the supernatant. 10. Keep the tube on the magnet and wash the beads by adding 200 μL of 70% ethanol without disturbing the pellet. Remove the ethanol using a pipette and discard. 11. Repeat the previous step and wash again. 12. Spin down the sample and pellet on a magnet for 1 min. Pipette off any residual ethanol. Dry for about 30 sec, but do not dry the pellet to the point of cracking. 13. Remove the tube from the magnet and resuspend the pellet in 10 μL of 10 mM Tris–HCl pH 8.0 with 50 mM NaCl. Incubate for 2 min at RT. 14. Place the pellet on a magnet until the eluate is clear and colorless. Remove and retain 10 μL of eluate into a new 1.5 mL DNA LoBind tube. 15. Quantify 1 μL of an eluted sample using a Qubit fluorometer. 16. Pool all barcode libraries into a single LoBind tube. A total desirable amount of libraries is 50–100 fmol in 10 μL. 17. Add 1 μL Rapid Adapter (RAP) from the Rapid PCR Barcoding Kit, mix gently by flicking the tube, and spin down. Incubate for 5 min at RT. 18. Prime, load, and sequence the prepared library by following the instructions provided by the manufacturer. Basecall the raw signal fast5 data to the fastq-format.

Rapid Identification of Nontuberculous Mycobacteria

3.3 Computational Analysis

251

1. Install the required software (see Note 4). Before performing the following operation, download and install the source codes from the GitHub official repository of mlstverse: # git clone https://github.com/ymatsumoto/mlstverse # git clone https://github.com/ymatsumoto/mlstverse. Mycobacterium.db

Install mlstverse packages from the local repository via the R console: # R > if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") > BiocManager::install("Rsamtools", version = "3.8") > install.packages("devtools") > devtools::install("mlstverse") > devtools::install("mlstverse.Mycobacterium.db")

2. Map sequenced reads to the reference sequences in the MLST database. The reference sequence is available at https://github. com/ymatsumoto/mlstverse as Loci.fasta. This file is also stored in the repository of the mlstverse cloned in step 1. The input file of the mlstverse should be sorted and converted to the indexed bam-format: # minimap2 -ax map-ont Loci.fasta input.fastq > input. sam # samtools sort input.sam -o input.bam # samtools index input.bam

3. Identify the species using mlstverse (see Note 5). Here input. bam is generated in step 2. The analysis results are stored in the result variable here. The score table contains the statistics of the Kolmogorov–Smirnov test (see Note 6). See Matsumoto et al. 2018 [25] to know the complete list of 175 NTM species that can be identified by mlstverse. # R > library(mlstverse) > library(mlstverse.Mycobacterium.db) > filenames result print(result$score)

252

4

Yuki Matsumoto and Shota Nakamura

Notes 1. The sensitivity and specificity of the mlstverse were evaluated by comparing with the similarity using the 16S rRNA gene (Fig. 1). The discrimination using the similarity of the 16S rRNA gene showed that only 11 of the 175 NTM species were identified at the species level with over 98.7% identity. Moreover, none of the species within the Mycobacteroides, Mycolicibacillus, and Mycolicibacter subgenera could be distinguished (these genera were currently reintegrated to the genus Mycobacterium [26]). However, the mlstverse identified all 175 species with an MLST score difference of 0.98 (on average), with a minimum difference of 0.1. We assessed the resolution to discriminate the subspecies of M. abscessus, based on differences in the MLST scores. Figure 2b shows the correlations of the scores obtained by MLST using 184 genes

Fig. 1 Workflow of the analysis pipeline. The mlstverse uses an fasta or fastq file as input and outputs an identification result as a table, using a given database. (The figure is modified from Matsumoto et al. (2019) under the terms of CC-BY [25])

Rapid Identification of Nontuberculous Mycobacteria

253

Fig. 2 Sensitivity and specificity of the mlstverse. (a) Species-level evaluation of sequence homologies of 16S rRNA genes obtained from SILVA (32) (left) and profile similarity in the mlstverse database (right). Species were ordered based on hierarchical-clustering results with complete linkage. The profile shown on the right was sorted in the same order. (b) Subspecies-level evaluation of M. abscessus profile similarities in the mlstverse database. The color bar under the dendrogram indicates subspecies information stored in the assembly metadata corresponding to subspecies not yet classified (grey), subsp. abscessus (red), subsp. bolletii (blue), or subsp. massiliense (yellow). (The figure is modified from Matsumoto et al. (2019) under the terms of CC-BY [25])

among the M. abscessus subspecies. Significant score differences were found in the MLST scores between the profiles of all combinations of different subspecies (p < 10-15). Only the scores from the M. abscessus samples with subspecies labels were used in this test. Subspecies could be distinguished from each other by a mean MLST score difference of 0.38. The MLST analysis using 184 genes demonstrated clear discrimination among NTM species. 2. Although we introduced a protocol for identification using bacterial culture isolates on agar, the time taken to incubate NTM is up to six weeks. The described method can be applied using the MGIT culture, which completes within two weeks. In this case, confirming that NTM has grown sufficiently for sequencing is necessary. 3. Because this protocol does not include any purification process for simplicity, the purity of the DNA may not satisfy the criteria of OD 260/280 of 1.8 and OD 260/230 of 2.0–2.2 recommended by Oxford Nanopore Technologies. In most cases, sequencing is expected to be successful, but if failed, we recommend trying again with a smaller amount of bacterial pellets at the extraction time.

254

Yuki Matsumoto and Shota Nakamura

4. We also provide a docker image as another way to install mlstverse in a user environment. If your docker is available, you can deploy an execution environment using the following command from dockerhub: # docker pull ymts43/mlstverse

5. Here mlstverse is run in default parameters, but optional parameters can be adjusted for the depth of reads as well as p-value and score thresholds. See https://github.com/ymatsumoto/ mlstverse for the detailed list of these optional parameters. 6. The mlstverse performs the analysis to identify the NTM species according to the workflow shown in Fig. 1. The results contain the MLST score ranging between 0 and 1 and statistics of the Kolmogorov–Smirnov (KS) test for checking the certainty of the score. The KS test is used to compare a sample with a reference probability distribution [27]. Here, the mlstverse compares the observed depth distribution on the loci with a negative binomial distribution estimated from actual observation. The sequencing depth is represented as the number of mapping reads on the locus, which follows a Poisson process [28, 29]. The depth distribution among loci is well approximated as a negative binomial distribution [30, 31]. The mlstverse then estimates the parameters of a negative binomial distribution from the distribution mapped on the locus and compares them by the KS test. If the shape of the observed distribution deviates significantly from the estimated negative binomial distribution, the KS test returns a small p-value, suggesting that the input gDNA could be metagenomic or contain more copies than usual. Therefore, the p-value obtained from the KS test can be used as an indicator for the certainty of the obtained score. References 1. Gardy JL, Loman NJ (2018) Towards a genomics-informed, real-time, global pathogen surveillance system. Nat Rev Genet 19:9– 20 2. Goodwin S, McPherson JD, McCombie WR (2016) Coming of age: ten years of nextgeneration sequencing technologies. Nat Rev Genet 17:333–351 3. Plesivkova D, Richards R, Harbison S (2019) A review of the potential of the MinION™ single-molecule sequencing system for forensic applications. WIREs Forensic Sci 1:e1323 4. Henkle E, Hedberg K, Schafer S et al (2015) Population-based incidence of pulmonary nontuberculous mycobacterial disease in Oregon

2007 to 2012. Ann Am Thorac Soc 12:642– 647 5. Lee H, Myung W, Koh W-J et al (2019) Epidemiology of nontuberculous mycobacterial infection, South Korea, 2007–2016. Emerg Infect Dis 25:569–572 6. Namkoong H, Kurashima A, Morimoto K et al (2016) Epidemiology of pulmonary nontuberculous mycobacterial disease, Japan. Emerg Infect Dis 22:1116–1117 7. Hoefsloot W, van Ingen J, Andrejak C et al (2013) The geographic diversity of nontuberculous mycobacteria isolated from pulmonary samples: an NTM-NET collaborative study. Eur Respir J 42:1604–1613

Rapid Identification of Nontuberculous Mycobacteria 8. Smith GS, Ghio AJ, Stout JE et al (2016) Epidemiology of nontuberculous mycobacteria isolations among central North Carolina residents, 2006–2010. J Infect 72:678–686 9. Dahl JL, Iii GW, Tran PM et al (2019) Mycolicibacterium nivoides sp. nov isolated from a peat bog. Int J Syst Evol Microbiol 71:004438 10. Nouioui I, Sangal V, Corte´s-Albayay C et al (2019) Mycolicibacterium stellerae sp. nov., a rapidly growing scotochromogenic strain isolated from Stellera chamaejasme. Int J Syst Evol Microbiol 69:3465–3471 11. Abe Y, Fukushima K, Matsumoto Y et al (2022) Mycobacterium senriense sp. nov., a slowly growing, non-scotochromogenic species, isolated from sputum of an elderly man. Int J Syst Evol Microbiol 72:005378 12. Liu G, Yu X, Luo J et al (2021) Mycobacterium vicinigordonae sp. nov., a slow-growing scotochromogenic species isolated from sputum. Int J Syst Evol Microbiol 71:004796 13. Ghielmetti G, Rosato G, Trovato A et al (2021) Mycobacterium helveticum sp. nov., a novel slowly growing mycobacterial species associated with granulomatous lesions in adult swine. Int J Syst Evol Microbiol 71:004615 14. Cheng Y, Lei W, Wang X et al (2021) Mycolicibacterium baixiangningiae sp. nov. and Mycolicibacterium mengxianglii sp. nov., two new rapidly growing mycobacterial species. Int J Syst Evol Microbiol 71:005019 15. Prevots DR, Marras TK (2015) Epidemiology of human pulmonary infection with nontuberculous mycobacteria: a review. Clin Chest Med 36:13–34 16. Griffith DE, Aksamit T, Brown-Elliott BA et al (2007) An official ATS/IDSA statement: diagnosis, treatment, and prevention of nontuberculous mycobacterial diseases. https://doi. org/10.1164/rccm.200604-571st 17. Philley JV, Griffith DE (2019) Medical management of pulmonary nontuberculous mycobacterial disease. Thorac Surg Clin 29:65–76 18. Uchiya K-I, Asahi S, Futamura K et al (2018) Antibiotic susceptibility and genotyping of mycobacterium avium strains that cause pulmonary and disseminated infection. Antimicrob Agents Chemother 62:e02035–e02017 19. Maurer FP, Castelberg C, Quiblier C et al (2014) Erm(41)-dependent inducible resistance to azithromycin and clarithromycin in clinical isolates of Mycobacterium abscessus. J Antimicrob Chemother 69:1559–1563

255

20. Griffith DE, Brown-Elliott BA, Benwill JL et al (2015) Mycobacterium abscessus. “Pleased to Meet You, Hope You Guess My Name...”. Ann Am Thorac Soc 12:436–439 21. Claydon MA, Davey SN, Edwards-Jones V et al (1996) The rapid identification of intact microorganisms using mass spectrometry. Nat Biotechnol 14:1584–1586 22. Jannetto PJ, Fitzgerald RL (2016) Effective use of mass spectrometry in the clinical laboratory. Clin Chem 62:92–98 23. Enright MC, Spratt BG (1999) Multilocus sequence typing. Trends Microbiol 7:482–487 24. Stackebrandt E, Goebel BM (1994) Taxonomic note: a place for DNA-DNA reassociation and 16S rRNA sequence analysis in the present species definition in bacteriology. Int J Syst Evol Microbiol 44:846–849 25. Matsumoto Y, Kinjo T, Motooka D et al (2019) Comprehensive subspecies identification of 175 nontuberculous mycobacteria species based on 7547 genomic profiles. Emerg Microbes Infect 8:1043–1053 26. Oren A, Garrity GM (2019) Notification of changes in taxonomic opinion previously published outside the IJSEM. Int J Syst Evol Microbiol 69:13–32 27. Conover WJ (1999) Practical nonparametric statistics. Wiley, New York 28. Lander ES, Waterman MS (1988) Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 2:231–239 29. Wendl MC (2006) Occupancy modeling of coverage distribution for whole genome shotgun DNA sequencing. Bull Math Biol 68:179– 196 30. de Torrente´ L, Zimmerman S, Suzuki M et al (2020) The shape of gene expression distributions matter: how incorporating distribution shape improves the interpretation of cancer transcriptomic data. BMC Bioinformatics 21: 562 31. Hooper SD, Dalevi D, Pati A et al (2010) Estimating DNA coverage and abundance in metagenomes using a gamma approximation. Bioinformatics 26:295–301.32 32. Quast C, Pruesse E, Yilmaz P et al (2013) The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res 41:D590–D596

Part IV Nanopore Sequencing for Transcriptomics and Beyond

Chapter 18 Long-Read Single-Cell Sequencing Using scCOLOR-seq Martin Philpott, Udo Oppermann, and Adam P. Cribbs Abstract Single-cell sequencing allows for the measurement of sequence information from individual cells with nextgeneration sequencing (NGS). However, its application to third-generation sequencing platforms such as Oxford Nanopore has been challenging because of its lower basecalling accuracy. Here we describe the method to perform highly accurate single-cell COrrected Long-Read sequencing (scCOLOR-seq) by droplet-based encapsulation of cells and sequencing using the Oxford Nanopore Sequencing system. Key words Oxford Nanopore Sequencing, Single-cell, Long-read

1

Introduction Single-cell sequencing is revolutionizing the way that scientists investigate biology. It provides higher resolution of cellular differences and a better understanding of cellular function in the context of its underlying microenvironment [1]. However, current droplet-based short-read sequencing methods only allow for the measurement of gene expression at the proximal ends of a transcript [2–4]. Long-read sequencing platforms, such as Oxford Nanopore, provide full-length sequencing of mRNA, therefore allowing examination of RNA splicing events, single-nucleotide polymorphisms, structural variation, and translocations [5]. However, despite the superior length of ONT sequencing over short-read Illumina sequencing, it suffers from a low basecalling accuracy which makes its application to single cell challenging because of the requirement for highly accurate molecular barcode assignment [6]. Here we describe the method, scCOLOR-seq, in which molecular barcodes are synthesized using dimer or trimer nucleotides that allow error detection [7]. Thus, we can obtain highly accurate cell assignment and PCR deduplication, which allows for the application of long-read single-cell sequencing.

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_18, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

259

260

2

Martin Philpott et al.

Materials Prepare all solutions using ultrapure water and analytical grade reagents. Prepare all reagents at room temperature and then follow the appropriate storage conditions.

2.1

Oligonucleotides

1. Template Switch Oligonucleotide (TSO): 5′ AAGCAGTGG TATCAACGCAGAGTGAATrGrGrG. 2. SMART PCR Primer: 5′ AAGCAGTGGTATCAACGCA GAGT.

2.2 Oligonucleotide Bead Design

1. Synthesize oligonucleotide capture beads using reverse direction amidites. Synthesize the barcode and UMI sequences using either dimer (original design) or trimer (newer design) reverse amidites (ATDBio, Oxford, UK).

2.3 Cell Encapsulation

1. Lysis buffer: 6% w/v Ficoll PM-400, 0.2% Sarkosyl, 20 mM EDTA, 200 mM Tris–HCl pH 7.5. Store at -20 °C in aliquots and add DTT to 50 mM following thawing. 2. Cell buffer: 1x phosphate-buffered saline, 0.01% filtered bovine serum albumin. Store at -20 °C in aliquots. 3. Phosphate-buffered saline (PBS). 4. Jurkat cells. 5. NIH3T3 cells. 6. TrypLE (if working with adherent cells). 7. Flowmi 40-μm cell strainer (Merk). 8. Gel loading tips. 9. Nadia Innovate (DolomiteBio). 10. Nadia Innovate Chip (DolomiteBio).

2.4 Emulsion Breaking and Reverse Transcription

1. Perfluorooctanol (PFO) 2. Reverse transcriptase reaction buffer: 75 μL H2O, 40 μL 20% Ficoll PM-400, 40 μL 5X RT buffer (supplied with RT enzyme), 10 μL 50 μM TSO, 20 μL dNTPs, 5 μL NXGen RNase Inhibitor (Lucigen), 10 μL 10X Maxima H-enzyme (Thermo Fisher Scientific) 3. 6X saline-sodium citrate (SSC) 4. TE/SDS buffer: 10 mM Tris–HCL, 0.1 mM EDTA, 0.5% SDS 5. TE/TW buffer: 10 mM Tris–HCL, 0.1 mM EDTA, 0.01% Tween 20 6. 10 mM Tris–HCL pH 8.0

Long-Read Single-Cell Sequencing Using scCOLOR-seq

2.5 Exonuclease Digestion and SMART PCR

261

1. Exonuclease reaction buffer: 20 μL 10X Exo I buffer, 170 μL H2O, 10 μL Exonuclease I 2. 2X KAPA HiFi HotStart Ready Mix (Roche) 3. AmpPure or SPRI beads (Beckman Coulter) 4. High Sensitivity D5000 ScreenTape (Agilent Technologies) 5. TapeStation system (Agilent Technologies)

2.6 Nanopore Library Preparation and Sequencing

3

1. 2X KAPA HiFi HotStart Ready Mix (Roche) 2. Ligation sequencing kit (Oxford Nanopore) 3. PromethION Flow Cell (Oxford Nanopore)

Methods

3.1 Cell Encapsulation

Always work with cells on ice. 1. Trypsinize adherent NIH3T3 cells for 5 mins using TrypLE and then spin down cells at 300 g for 5 min. 2. Spin down Jurkat cells at 300 g for 5 min. 3. Resuspend cells in 1 mL of 1x PBS and then spin down at 300 g for 3 min. 4. Count cells on a hemocytometer or an automatic cell counter. 5. Spin down cells at 300 g for 3 min and then resuspend in ice-cold cell buffer at a concentration of 310 cells/μL (see Note 1). 6. Mix Jurkat and NIH3T3 cells at a 50:50 ratio so that there is a total of 77,500 cells in 250 μL of cell buffer. Store on ice until use. 7. Prepare 1 mL of lysis buffer and add 50 μL of DTT (1 M) and place on ice until needed. 8. Transfer 155,000 scCOLOR-seq beads into a 1.5 mL tube and then spin down at 1000 g for 1 min. Discard the supernatant and resuspend in 250 μL of lysis buffer. 9. The cells and beads are loaded into the Nadia encapsulator instrument according to the manufacturer’s instructions (see Note 2).

3.2 Emulsion Breaking

It is important to use low-bind pipette tips and Eppendorf DNA Lo-Bind tubes for all washing steps. All wash steps are performed at 1000 g for 1 min. 1. Prepare a 30 mL solution of 6x SSC in a 50 mL Falcon tube. Gently add the emulsion to the tube. 2. Add 1 mL of PFO and shake vigorously up and down for 3–4 times to break the emulsion.

262

Martin Philpott et al.

3. Centrifuge the tube at 1000 g for 1 min. 4. Remove the top SSC solution and add it into a new 50 mL Falcon tube (may contain residual beads). 5. Refloat the beads sitting on the oil interface by rapidly adding 30 mL of fresh 6x SSC into the 50-mL Falcon tube. 6. Once oil has fully settled at the bottom of the tube, but before the beads have time to settle from the aqueous phase, remove the bead/SSC suspension, and add it to a new 50-mL falcon tube. Avoid transferring any oil to the new tube (see Note 3). 7. Centrifuge tubes from step 4 and 6 at 1000 g for 1 min. 8. The beads are now pelleted at the bottom of the Falcon tubes and should be visible. 9. Carefully remove supernatant, but leave approximately 0.5 mL. 10. Resuspend the remaining beads by gentle pipetting and pool corresponding beads from steps 4 and 6 by transferring to a 1.5 mL Eppendorf Lo-Bind tube (see Note 4). 11. Centrifuge at 1000 g for 1 mi. 12. Remove the supernatant and wash the beads in 1 mL of 6x SSC. 13. Centrifuge at 1000 g for 1 min. 3.3 Reverse Transcription

1. Remove the supernatant from the previous step and wash in 300 μL of 5X RT buffer. 2. Centrifuge at 1000 g for 1 min. 3. Prepare the RT mix according to the following table (Table 1): 4. Remove the supernatant from the beads and add 200 μL of RT mix.

Table 1 RT mix composition Volume (μL)

Reagent

75

H2O

40

20% Ficoll PM400

40

5X RT buffer

10

50 μM TSO

20

dNTPs

5

RNase Inhibitor

10

Reverse transcriptase

Long-Read Single-Cell Sequencing Using scCOLOR-seq

263

Table 2 Exonuclease mix composition Volume (μL)

Reagent

20

10X Exo I buffer

170

H2O

10

Exo I

5. Incubate at room temperature for 30 mins with rotation followed by 90 mins at 42 °C with rotation. 6. Centrifuge at 1000 g for 1 min. Remove supernatant, add 1 mL TE/SDS, and mix by brief pipetting up and down. 7. Centrifuge at 1000 g for 1 min. Remove supernatant, add 1 mL TE/TW, and mix by brief pipetting up and down. 8. Repeat step 7 once. 9. Centrifuge at 1000 g for 1 min. Remove supernatant, add 1 mL of 10 mM Tris–HCL pH 8.0, and mix by brief pipetting up and down. 3.4 Exonuclease Digestion

1. Prepare the Exonuclease mix according to Table 2: 2. Remove the supernatant from the beads and add 200 μL of Endonuclease mix. 3. Incubate at 37 °C for 45 mins with rotation. 4. Centrifuge at 1000 g for 1 min. Remove supernatant, add 1 mL TE/SDS, and mix by brief pipetting up and down. 5. Centrifuge at 1000 g for 1 min. Remove supernatant, add 1 mL TE/TW, and mix by brief pipetting up and down. 6. Repeat step 6 once. 7. Centrifuge at 1000 g for 1 min. Remove supernatant, and add 1 mL of H2O. 8. Beads should be mixed well and then counted using a FuchsRosenthal hemocytometer to determine the bead number per mL.

3.5

SMART PCR

1. Ensure the beads are resuspended and split the beads into multiple PCR tubes (or wells of a 96-well PCR plate) by transferring a volume equivalent to 2000 beads into each tube (each tube will yield about 100 single cells) (see Note 5). 2. Spin down the tubes, remove supernatants, and add PCR components according to the following table (if processing all the beads in a sample, it may be more convenient in step 1 above to centrifuge beads prior to splitting at 1000 g for 1 min,

264

Martin Philpott et al.

Table 3 “SMART PCR mix composition Volume (μL)

Reagent

24.6

H 2O

0.4

100 μM SMART PCR primer

25

2X KAPA HiFi HotStart master mix

Table 4 SMART PCR cycling conditions Cycles

Temp (°C)

Time

95

3 mins

4

98 65 72

20 s 45 s 3 mins

20

98 67 72 72

20 s 20 s 3 mins 5 mins

4

1

removing the supernatant, and adding 50 μL of a PCR component master mix [as per table below] for every 2000 beads, resuspending and transferring 50 μL aliquots into multiple PCR tubes [or wells of a 96-well PCR plate]). See Table 3: 3. Mix well and proceed to PCR using the following cycling conditions (Table 4): 4. For each sample split in step 1, pool 10 μL of each individual PCR reaction into a 1.5-mL centrifuge tube. 5. Centrifuge at 1000 g for 1 min and transfer to a fresh tube. 6. Add 0.6X of AMPure XP beads to each PCR tube (a 0.6:1 ratio) (see Note 6). 7. Mix well and incubate at room temperature for 5 min. 8. Centrifuge briefly and place the tube in a magnetic rack and wait until the solution completely clears. 9. Remove the supernatant and add 200 μL of 80% EtOH while still on the magnetic rack. 10. After 30 s, remove the supernatant and add 200 μL of 80% EtOH while still on the magnetic rack. 11. After 30 s, remove the supernatant. Centrifuge the tube briefly, replace on the magnetic rack, and remove residual EtOH.

Long-Read Single-Cell Sequencing Using scCOLOR-seq

265

Fig. 1 A schematic of the mRNA capture bead design showing the region of dimer and trimer nucleotides used to make up the barcode and UMI region

Fig. 2 TapeStation D5000 high sensitivity trace showing a post PCR-amplified product from NIH3T3:Jurkat cells

12. Air-dry for 3 min at room temperature. 13. Remove from the magnetic rack and resuspend beads in 50 μL of H2O. 14. Incubate for 2 min at room temperature. 15. Centrifuge tube briefly and place in the magnetic rack. Transfer the supernatant to a fresh PCR tube. 16. The eluted cDNA is then evaluated using a High Sensitivity D5000 ScreenTape on a TapeStation system. The library should have an average size of 1000–2000 bp. Depending on the sample, the product could have either a smooth or an uneven profile (Figs. 1 and 2).

266

Martin Philpott et al.

3.6 Nanopore Library Preparation

1. One microgram of SMART PCR product (cDNA) (however, as low as 200 ng has been shown to generate a sequenceable library) is used as an input for the Oxford Nanopore Ligation sequencing kit. 2. Perform the library preparation according to the manufacturer’s protocol.

3.7

Sequencing

1. Sequence the library using a PromethION sequencing flow cell according to the manufacturer’s protocol. The machine run time will be 48 h and it is expected to return between 50 and 250 million reads per library. 2. Perform basecalling using Guppy in GPU mode. For example, a high-accuracy mode basecalling for a run in R9.4.1 flow cell prepared with SQK-LSK110 kit will be “guppy_basecaller -compress-fastq -c dna_r9.4.1_450bps_hac.cfg -x ‘cuda:1’”. 3. Perform the data analysis using the TallyNN pipeline (https:// github.com/Acribbs/TallyNN).

4

Notes 1. If cells look as though they are clumping, then you can filter the beads through a 30-/40 μM cell filter prior to counting. 2. It is important to prepare your beads and cells prior to starting the Nadia instrument so that you can efficiently load the reagents into the machine. Working at speed prevents reagents from clumping and blocking. 3. If you wait too long after the oil has settled before removing the supernatant containing the beads, then this can affect the recovery of beads negatively. 4. It is important to use Lo-Bind tubes to increase the recovery of your cDNA. 5. We do not see any negative effects when increasing the number of beads to 4000 per well, thereby saving PCR reagents. 6. For certain cell types where shorter mRNAs are produced, a 0.8X AMPure XP bead cleanup has been shown to retain more cDNA.

Acknowledgments Research support was obtained from Innovate UK, the National Institute for Health Research Oxford Biomedical Research Unit (to U.O.), Cancer Research UK (CRUK) (to U.O.), the Bone Cancer Research Trust (BCRT) (to A.P.C. and U.O.), the Leducq Epigenetics of Atherosclerosis Network (LEAN) program grant

Long-Read Single-Cell Sequencing Using scCOLOR-seq

267

from the Leducq Foundation (to U.O.), the Chan Zuckerberg Initiative (to A.P.C.), the Myeloma Single Cell Consortium (to U.O.), and a Medical Research Council Career Development Fellowship (MR/V010182/1; to A.P.C.). References 1. Eberwine J, Sul JY, Bartfai T, Kim J (2014) The promise of single-cell sequencing. Nat Methods 11:25–27 2. Klein AM, Mazutis L, Ilke A et al (2015) Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161: 1187–1201 3. Macosko EZ, Basu A, Satija R et al (2015) Highly parallel Genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161:1202–1214 4. Zheng GX, Terry JM, Belgrader P et al (2017) Massively parallel digital transcriptional profiling of single cells. Nat Commun 8:14049

5. Weirather JL, de Cesare M, Wang Y et al (2017) Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis. F1000Res 6:100 6. Rang FJ, Kloosterman WP, de Ridder J (2018) From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol 19:90 7. Philpott M, Watson J, Thakurta A et al (2021) Nanopore sequencing of single-cell transcriptomes with scCOLOR-seq. Nat Biotechnol 39: 1517–1520

Chapter 19 Unfolding the Bacterial Transcriptome Landscape Using Oxford Nanopore Technology Direct RNA Sequencing Mohamad Al Kadi and Daisuke Okuzaki Abstract Current genome annotation ignores important features of the transcriptome, such as untranslated regions and operon maps. RNA sequencing (RNA-seq) helps in identifying such features; however, the fragmentation step of classical RNA-seq makes this task challenging. Long-read sequencing methods, such as that of Oxford Nanopore Technologies (ONT), enable the sequencing of intact RNA molecules. Here, we present a method to annotate the full features of bacterial transcriptomes by combining a modified ONT direct RNA-seq method with our computational pipeline, UNAGI bacteria. The method reveals the full complexity of the bacterial transcriptome landscape, including transcription start sites, transcription termination sites, operon maps, and novel genes. Key words RNA-seq, Long-read sequencing, Transcription, ONT, UNAGI bacteria

1

Introduction Genome annotation is mainly a computational process to identify genes and annotate their functions for a given genome. It includes homology-based and ab initio approaches focusing on proteincoding genes. Nonetheless, some challenges persist [1]. While most typical protein-coding genes are detected, there are still some false positives and false negatives [2]. Notably, these include protein-coding genes that do not fit the protein models and noncoding genes that lack translational features, i.e., open reading frames [3]. In addition, with its focus on open reading frames, genome annotation ignores other important features of the bacterial transcriptome, such as untranslated regions (UTRs) and operon maps. UTRs are crucial elements for gene regulation and a source of small RNAs [4]. Operon maps are essential for understanding the functional relationships between genes. Moreover, the prokaryotic transcriptome has shown more complexity than what was thought before [5, 6]. Antisense transcription is widespread in

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_19, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

269

270

Mohamad Al Kadi and Daisuke Okuzaki

prokaryotes [7], and the traditional idea of operons is changing. Many genes are transcribed alternatively in different operon forms, suggesting more complexity in gene regulation. Various conditions and signals can induce the expression of alternative operons with some similarity to alternative splicing in eukaryotes [6]. Many of these discoveries were revealed in the past decade owing to RNA sequencing (RNA-seq) technology. RNA-seq is a revolutionary tool that has been combined with computational approaches to enhance gene discovery and transcriptome annotation [8]. The standard RNA-seq protocol includes RNA fragmentation and reverse transcription into cDNAs. The generated sequences or reads are mapped to the genome in silico to measure gene expression. However, RNA fragmentation leads to information loss. This loss is critical in the case of compact genomes, such as bacterial genomes, where it is hard to distinguish between overlapping transcriptomic features because fragmented reads cannot be attributed to their original features. Similarly, operons still remain predicted and not directly detected. In recent years, a new technology in the form of Oxford Nanopore Technologies (ONT) sequencing has emerged. An electric current is applied across an electro-resistant membrane; a DNA/RNA strand passing through a nanopore in the membrane causes a change in the current based on the passing nitrogenous bases. The whole molecule is sequenced without fragmentation, and the generated sequences are called long reads to distinguish them from the short reads of classical RNA-seq. One of the latest methods of this technology is direct RNA sequencing [9]. It sequences the RNA directly without reverse transcription. It has the potential to reveal the aforementioned complexity of bacterial transcriptomes and annotate its full features. Nevertheless, the direct RNA-seq kit and computational tools are designed mainly for eukaryotic RNA. Many challenges arise in the case of bacteria, such as the lack of poly(A) tail in bacterial mRNA and the focus of most computational tools on eukaryotic transcriptomes without considering bacterial features such as operons. Here, we present a method (Fig. 1) to annotate the full features of bacterial transcriptomes by combining a slightly modified ONT direct RNA-seq method with a computational pipeline, UNAGI bacteria, designed to process the generated long reads and identify the elements of bacterial transcriptomes. These include transcription start sites (TSSs), transcription termination sites (TTSs), operons, and the discovery of novel genes focusing on bacterial antisense and intergenic RNAs.

Oxford Nanopore Technology for Bacterial Transcriptome

271

Fig. 1 Mapping the full features of the bacteria transcriptome using ONT direct RNA sequencing and UNAGI bacteria pipeline

2

Materials

2.1 Sequencing Bacterial RNA

1. NEBNext rRNA Depletion Kit (bacteria) (New England Biolabs) 2. Escherichia coli poly(A) polymerase (New England Biolabs) 3. 10× E. coli poly(A) polymerase reaction buffer (New England Biolabs) 4. 10 mM ATP 5. Direct RNA Sequencing Kit (Oxford Nanopore Technologies) 6. Flow Cell Priming Kit (Oxford Nanopore Technologies) 7. NEBNext Quick Ligation Reaction Buffer (New England Biolabs) 8. SuperScript III reverse transcriptase 9. 2 M U/mL T4 DNA Ligase 10. 10 mM dNTP 11. Nuclease-free water 12. Ethanol 13. Agencourt RNAClean XP beads 14. Qubit RNA HS Assay Kit 15. Qubit DNA HS Assay Kit

272

2.2

Mohamad Al Kadi and Daisuke Okuzaki

Software

1. MinKNOW 2. Guppy 3. Python 3.7 4. Minimap2 5. Samtools 6. Bedtools

2.3 System Requirement

3

The bioinformatics work was performed on Ubuntu 20.04.2 LTS (GNU/Linux 4.4.0-19041-Microsoft x86_64).

Methods

3.1 Ribosomal RNA (rRNA) Depletion

The rRNA depletion kit is used to deplete rRNA from total RNA samples. In this protocol, we aim to obtain approximately 500 ng of rRNA-depleted RNA, although the final yield can be as little as 50 ng. We assume that mRNA accounts for approximately 5% of total bacterial RNA; therefore, the input at this process should be (target of rRNA-depleted RNA) × 20 or 500 ng × 20 = 10,000 ng or 10 μg of total RNA in a maximum of 11 μL of nuclease-free water. The following protocols are according to the manufacturer’s instructions (see Note 1). 1. Add 2 μL of NEBNext bacterial rRNA depletion solution in a 0.2 mL PCR tube on ice and 2 μL of NEBNext probe hybridization buffer to 11 μL of the total RNA; mix well by flicking (see Note 2). 2. Incubate in a thermal cycler at 95 °C for 2 min and ramp down to 22 °C at a pace of 0.1 °C/s. Hold at 22 °C for 5 min. 3. Place the tube on ice immediately and add 2 μL of RNase H reaction buffer, 2 μL of NEBNext thermostable RNase H, and 1 μL of nuclease-free water to the reaction; mix well by flicking. 4. Incubate in a thermal cycler at 50 °C for 30 min (the lid temperature is set at 55 °C). 5. Place the tube on ice immediately and add 5 μL of DNase I reaction buffer, 2.5 μL of NEBNext DNase I (RNase-free), and 22.5 μL of nuclease-free water to the reaction; mix well by flicking. 6. Incubate in a thermal cycler for 30 min at 37 °C (the lid is set to 40 °C or off); meanwhile, prepare Agencourt RNAClean XP beads by vortexing the beads thoroughly and preparing fresh 80% ethanol. 7. Add 90 μL (1.8X) beads to the sample in a 1.5 mL tube, mix well by flicking, and incubate for 15 min on ice.

Oxford Nanopore Technology for Bacterial Transcriptome

273

8. Place the tube on a magnetic rack and incubate for approximately 2 min until the solution is clear. Discard the supernatant without disrupting the beads. 9. Add 200 μL of fresh 80% ethanol, incubate for 30 s, and remove the supernatant. 10. Repeat steps 8–9. 11. Remove residual ethanol and air-dry for maximum 5 min to avoid overdrying the beads (see Note 3). 12. Remove the tube from the rack and add 7 μL of nuclease-free water. Mix well by flicking and incubating at room temperature for 2 min. Place the tube back on the rack until the solution is clear, and transfer 5.5 μL of the supernatant to a 0.2 mL PCR tube. 13. Dilute 0.5 μL of the rRNA-depleted RNA 10 times and use the diluted sample for quantitative and qualitative analyses using a Qubit assay and Bioanalyzer, respectively. 3.2

Polyadenylation

RNA should have a poly(A) tail at the 3′ end to be sequenced by direct RNA-seq. At this step, E. coli poly(A) polymerase introduces a poly(A) to the depleted RNA. 1. Add 10 μL of nuclease-free water, 2 μL of 10× E. coli poly (A) polymerase reaction buffer, 2 μL of ATP (10 mM), and 1 μL of E. coli poly(A) polymerase. 2. Incubate at 37 °C for 30 min. 3. Transfer the sample to a 1.5-mL tube, add 20 μL (1×) Agencourt RNAClean XP beads directly, and mix well by flicking. Incubate at room temperature for 10 min on a rotator mixer. Place on a magnetic rack for 2 min until a pellet is formed and the solution is clear. Wash with 150 μL of freshly prepared 70% ethanol twice without disrupting the beads. Air-dry for 30 s and resuspend with 9.5 μL of nuclease-free water. Incubate for 5 min and transfer the supernatant to a 1.5-mL Lo-Bind tube. Dilute 0.5 μL of poly(A) RNA ten times and use it to quantify the RNA using a Qubit assay (see Note 4).

3.3

Direct RNA-Seq

At this step, an RNA library is prepared to be sequenced on the MinION platform. 1. In a 0.2 mL PCR tube, mix the reagents in the following order: 3 μL of NEBNext Quick Ligation Reaction Buffer, 9 μL of RNA (50–500 ng), 0.5 μL RNA calibration standard (RCS), 110 nM as a control, 1 μL RT adapter, and 1.5 μL T4 DNA ligase. Mix well by flicking and incubate at room temperature for 10 min. 2. Meanwhile, prepare the reverse transcription master mix. Add 9 μL of nuclease-free water, 2 μL 10 mM dNTPs, 8 μL 5× firststrand buffer, and 4 μL 0.1 M dithiothreitol (DTT).

274

Mohamad Al Kadi and Daisuke Okuzaki

3. Add the master mix to the tube from step 1. 4. Add 2 μL of SuperScript III reverse transcriptase and mix by flicking. 5. Incubate in a thermal cycler at 50 °C for 50 min, 70 °C for 10 min, and then at 4 °C before moving to the next step. Transfer the sample to a 1.5-mL Lo-Bind tube. 6. Add 72 μL (1.8X) Agencourt RNAClean XP beads directly and mix well by flicking. Incubate at room temperature for 10 min on a rotator mixer. Place on a magnetic rack for 2 min until a pellet is formed and the solution is clear. Wash with 150 μL of freshly prepared 70% ethanol twice without disrupting the beads. Air-dry for 30 s and resuspend with 20 μL of nucleasefree water. Incubate for 5 min and transfer the supernatant to a 1.5 mL Lo-Bind tube. 7. Add the reagents to the tube in the following order: 8 μL of NEBNext Quick Ligation Reaction Buffer, 6 μL RNA adapter (RMX), 3 μL nuclease-free water, and 3 μL T4 DNA Ligase. Mix by flicking the tube and incubate for 10 min at room temperature. 8. Add 16 μL of RNAClean XP beads to the reaction and mix by flicking. Incubate on a rotator mixer for 10 min at room temperature. Place the tube on a magnetic rack and wait until the solution is clear. 9. Add 150 μL of wash buffer (WSB) (see Note 5). Resuspend the beads by flicking the tube. Return the tube to the rack, allow the pellet to form, and pipette out the supernatant. 10. Repeat step 9 once again. 11. Resuspend in 21 μL elution buffer by gently flicking the tube. Incubate for 10 min at room temperature and remove 21 μL of the eluent into a 1.5 mL Lo-Bind tube. 12. Quantify 1 μL using a Qubit DNA HS Assay. 3.4 Priming and Loading the Flow Cell

1. Add 30 μL of flush tether (FLT) directly to the flush buffer (FB) tube and mix by vortexing. Load 800 μL of this priming mix into the flow cell via the priming port without introducing air bubbles to avoid damaging the nanopores. Wait for 5 min. 2. Meanwhile, dilute 20 μL of RNA library to 37.5 μL with nuclease-free water. Mix the diluted RNA library with 37.5 μL RNA Running Buffer (RRB). 3. Continue priming the flow cell by loading 200 μL of the priming mix into the priming port. 4. Add 75 μL of the sample to the flow cell through the sample port dropwise. 5. Close the ports and lid and begin the sequencing.

Oxford Nanopore Technology for Bacterial Transcriptome

3.5

UNAGI Pipeline

275

1. UNAGI is deposited at GitHub. Execute the following command to download it: $ git clone https://github.com/iMetOsaka/UNAGI_Bacteria

The pipeline uses three external tools, minimap2, samtools, and bedtools. Precompiled binaries are provided. The configuration file UNAGI/app/conf.ini contains the path to these tools. If there are problems with the included precompiled binaries, the path can be replaced with versions of the user’s choice (see Note 6). 2. UNAGI code is written in Python, and thus, Python should be installed. UNAGI was tested on Python 3. Type the following command: $ python3 --version

If you get a version number, you do not need to install it. Else, you will have to install Python 3. 3.6

Running UNAGI

1. The raw reads file should be provided to the pipeline in the fastq or fastq.gz format as an input file (option -i). A test file (test.fq) is included. 2. The genome sequence should be provided to the pipeline for mapping the reads under the option -g. The reads in our test file are from Vibrio parahaemolyticus, and a genome file is provided (genome_VibrioParahaemolyticus.fna) for testing. 3. Genes should be provided to the pipeline in GFF format. The GFF file includes many genomic features, but only genes will be extracted and used. The last column in a GFF file contains information about the gene, such as ID, gene name, and locus. You can choose what will be displayed in the output under the value “gene_name” in the configuration file. For example, we set this value to locus_tag, since it is used more frequently for our species of interest than the gene name given by the annotator. UNAGI can be started as below: $ path/to_unagi/unagi -i [path_to_fastq_file] -o [path_to_desired_output_directory] -g [path_to_genome_file] -a [path_to_annotation_file]

You can test the pipeline first with the test file. Type in the directory of UNAGI: ./unagi -i ./test/RawReads.fastq -g ./test/genome_VibrioParahaemolyticus.fna -a ./test/genes_VibrioParahaemolyticus.gff -o ./test/results

276

Mohamad Al Kadi and Daisuke Okuzaki

4. First, TSSs will be identified. Two approaches will be used: genes-aware and genes-unaware. The final TSSs will be combined in one file, but files resulting from each approach will be outputted separately to an intermediary directory. Theoretically, reads mapped to a gene should have the same start site as the TSS. However, due to fragmentation and incomplete reading of the 5′ end in direct RNA-seq, clustering is used to mitigate these effects. Sites close to each other will be considered as one cluster (Fig. 2). The sites are grouped in one cluster as long as the distance between one site and the next is less than a defined threshold. A new cluster will be created once the distance becomes longer than the threshold. The default threshold is 10 bp in the configuration file under the value min_TTS_threshold. Based on our assumption, the first site is considered the real TSS, and the rest result from fragmentation or incomplete reading of the 5′ end. The second step is filtering based on cluster size. Genome coverage will be calculated using bedtools, and cluster size (number of sites) will be compared to the genome coverage at the predicted site. In the genes-aware approach, reads mapped to each gene will be processed separately, and only reads that start before the start codon will be considered. The size of generated clusters will be compared to the genome coverage at the predicted TSS and should be more than a defined threshold relative to that coverage. The default value is 0.3 under min_coverage_for_TSS in the configuration file. If all clusters were excluded, the first site would be considered the TSS of this gene. The resulting TSSs will be outputted

Fig. 2 Transcription start site identification. Clustering is performed by grouping start sites that are close to each other into clusters. The first site in the cluster is considered a start site. Next, clusters with low count such as cluster 2 will be filtered out. In the case of the gene-aware approach, only cluster 1 is considered in the first stage and not cluster 2 because it exists after the start codon of the gene

Oxford Nanopore Technology for Bacterial Transcriptome

277

to the intermediate file aware_tss.tab. Information includes chromosome, site, count (cluster size), gene, strand, and predicted termination site. In the genes-unaware approach, clustering will be performed continuously throughout the genome. All raw clusters will be outputted to an intermediate file, raw_clustering_5_prime.tab. Next, these clusters will be filtered based on the genome coverage with more stringent criteria than the genes-aware approach. The genes-aware method searches for TSSs in the 5′-UTR (upstream of start codons), whereas the genes-unaware approach has no such information; thus, it has less certainty and a high probability of false positivity. In general, good clusters are narrow with a high count. Finally, both files will be combined into one file, and duplicates will be removed. 5. Next, TTSs will be identified. Reads mapped for each gene will be grouped together, and only reads that end after the stop codon are considered. Similar to identifying TSSs, end sites will be sorted and clustered, but the TTS is defined as the most frequent site, not the furthest one. In general, the added poly (A) tail limits the effect of fragmentation and incomplete reading. 6. Identifying operons: Operons are detected directly by examining reads that overlap genes. If a read covers more than one gene, it is considered an operon. The intersection area should cover more than 0.33 of the gene to be considered. This threshold can be changed under the value operon_coverage_threshold in the configuration file. There is also a mapping quality threshold under the value operon_readQuality_threshold. Low mapping quality results in false positives, and they should not be considered. The default value is 60. The number of detected operons is usually high for many reasons, such as false positives resulting from fragmentation and operons resulting from transcription readthrough and weak terminators. Most of these operons tend to have low expression values, while true important operons have high expression values (Fig. 3). Detected operons are provided with detailed information about their relative expression. Relative expression is defined as operon count/gene count. Since operons have more than one gene, they have more than one relative expression value. We provide all relative expression values in the fifth column in the order that genes are present in the operon. The fourth column has the highest value of the relative expression. It can be used to filter the results when browsing the table to help the user filter operons based on relative expression and focus on highly expressed operons (see Note 7). 7. Transcriptome reconstruction: Transcripts are predicted without using a priori gene annotation, and then they will be compared to the given annotation and classified as overlapping,

278

Mohamad Al Kadi and Daisuke Okuzaki

Fig. 3 Operon detection. Three operons are detected, A–D, A–C, and A–B. Operon A–D results from transcription readthrough and its expression is weak relative to its genes. On the other hand, operon A–B is a sub-operon with high relative expression

intergenic, or antisense. Due to the compact nature of the bacterial genomes and the presence of operons, most predicted transcripts will not have a good accuracy. However, our method is excellent for discovering novel intergenic and antisense genes. Moreover, expression levels (provided as read count) of the predicted transcripts are provided, and they can be used to filter out low-accuracy transcripts.

4

Notes 1. Due to the large amount of RNA, rRNA depletion can be performed twice to remove rRNA efficiently. However, this can lead to more RNA fragmentation. 2. Although protocol instructions include pipetting, we prefer using flicking to avoid RNA fragmentation, which can severely affect the results. 3. Alternatively, the air-drying period can be shortened by spinning and then returning the tube to the magnet rack carefully that the beads adhere to the wall of the tube while the ethanol stay in the bottom of the tube. The ethanol can be removed and the beads dried for a short time (around 10 s). 4. An optional qualitative analysis using a Bioanalyzer will show a shift in length, resulting from the addition of the poly(A) tail. 5. Be careful not to wash with alcohol at this step since the protein adapters will be denatured. Additionally, if you must store the library, do not store at -20 °C since it will damage the protein adapters. 6. For example, if minimap2 is already installed on the user shell, the value in the configuration file can be changed from “../ tools/minimap2/minimap2” to just “minimap2”. 7. Keep in mind that RNA-seq is a snap of the transcriptome in a certain condition and time. Lowly expressed operons can be highly expressed in other conditions.

Oxford Nanopore Technology for Bacterial Transcriptome

279

References 1. Salzberg SL (2019) Next-generation genome annotation: we still struggle to get it right. Genome Biol 20:92 2. Haft DH, DiCuccio M, Badretdin A et al (2018) RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res 46: D851–D860 3. Li W, O’Neill KR, Haft DH et al (2021) RefSeq: expanding the prokaryotic genome annotation pipeline reach with protein family model curation. Nucleic Acids Res 49:D1020–D1028 4. Menendez-Gil P, Toledo-Arana A (2021) Bacterial 3’UTRs: a useful resource in posttranscriptional regulation. Front Mol Biosci 7: 464 5. Al Kadi M, Ishii E, Truong DT et al (2021) Direct RNA sequencing unfolds the complex

transcriptome of Vibrio parahaemolyticus. mSystems 6:e00996–e00921 6. Yan B, Boitano M, Clark TA et al (2018) SMRTCappable-seq reveals complex operon variants in bacteria. Nat Commun 9:3676 7. Georg J, Hess WR (2018) Widespread antisense transcription in prokaryotes. Microbiol Spectr 6 8. Bischler T, Tan HS, Nieselt K et al (2015) Differential RNA-seq (dRNA-seq) for annotation of transcriptional start sites and small RNAs in Helicobacter pylori. Methods 86:89–101 9. Garalde DR, Snell EA, Jachimowicz D et al (2018) Highly parallel direct RNA sequencing on an array of nanopores. Nat Methods 15:201– 206

Chapter 20 Nanopore Direct RNA Sequencing of Monosome- and Polysome-Bound RNA Lan Anh Catherine Nguyen, Toshifumi Inada, and Josephine Galipon Abstract Polysome fractionation makes use of density gradients and ultracentrifugation to separate transcripts based on their specific number of bound ribosomes, and can be combined with downstream analysis such as cDNA-seq (commonly known as RNA-seq), microarray analysis, RT-qPCR, or Northern blotting. Here, we describe the application of Nanopore direct RNA sequencing to quantify monosome- and polysomebound full-length transcripts after polysome fractionation, RNA cleanup, and size selection, using the yeast glucose stress response as an example use case. Key words Direct RNA sequencing, Ribosomes, Translation, Stress response

1

Introduction The study of ribosomes and their associated elements has a long history dating back to the early days of molecular biology. The ultracentrifugation of whole cell extracts on density gradients, currently known as polysome fractionation, is one of the oldest techniques to study ribosome-associated RNA [1, 2]. The resolution is sufficient to separate populations of RNAs associated with a precise number of ribosomes and even partially assembled ribosomes [3]. Consequently, this method is critical to study the molecular dynamics of translation initiation and elongation, and a large body of literature is occupied by the identification of both the protein composition of RNA-bound ribosomes, ribosome-associated proteins, and the quantification of ribosome-bound RNA in various conditions. After fractionation, the RNA content has previously been analyzed by Northern blotting [4], RT-qPCR [5], microarrays [6], or high-throughput sequencing of short cDNA fragments (commonly referred to as RNA-seq) [7].

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3_20, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

281

282

Lan Anh Catherine Nguyen et al.

Polysome profiling is powerful and provides some advantages compared to ribosome profiling, the high-throughput sequencing of ribosome-protected fragments (ribo-seq) [8]. Ribo-seq generates a genome-wide snapshot of ribosome footprints, which are 28 nt on average, providing an indirect estimation of translational activity by normalizing to mRNA levels. However, it can be difficult to distinguish between monosomes and polysomes, as well as between transcript isoforms with alternative 5′- and 3′-ends, as ribo-seq provides the average distribution of ribosomes on transcripts originating from a particular locus [9]. In contrast, polysome fractionation, in association with the downstream techniques listed above, can enable not only identification of the composition of the ribosome itself, but also the number of ribosomes associated with each RNA isoform. However, each downstream analysis method has its limitations. Northern blotting can identify transcript length with great accuracy but is both low-throughput and labor-intensive. RT-qPCR is quick and practical for the quantification of a subset of target genes, but cannot inform about transcript length nor detect alternative transcript isoforms in many cases. Finally, microarrays and cDNA-seq both have limitations in terms of their ability to identify the boundaries of transcripts. In the past few years, novel sequencing technologies emerged that allow high-throughput strand-specific sequencing of long reads, with Oxford Nanopore Technologies (Nanopore) and PacBio the current leaders in the market. Nanopore is the first and currently the only one to enable the direct sequencing of RNA transcripts. In this chapter, we describe a protocol for polysome fractionation and direct RNA sequencing of monosome and polysome-bound poly(A)-tailed transcripts. As an example, we describe the use of Nanopore Flongle Flow Cells to monitor translational regulation in fission yeast during glucose stress, previously defined as a sudden 60-fold reduction in glucose during exponential growth [10, 11].

2

Materials

2.1 Preparation of Whole Cell Extracts

1. Liquid nitrogen 2. 75% (w/v) ethanol 3. Polysome lysis buffer (PLB): 20 mM pH 7.4 HEPES-KOH, 2 mM magnesium acetate, 100 mM potassium acetate 4. 100 mg/mL cycloheximide 5. 1 tablet/10 mL cOmplete™, EDTA-Free Protease Inhibitor Cocktail (Roche) 6. Ceramic mortar and pestle

Nanopore Direct RNA Sequencing of Ribosome-Bound RNA

283

7. Aluminum foil 8. Sterile and nuclease-free 50 mL tubes 9. Sterile and nuclease-free 1.5 mL tubes 10. Refrigerated centrifuge, up to 20,400 g 2.2 Polysome Fractionation

1. Low-sucrose solution (10% w/v): 10% (w/v) sucrose, 10 mM pH 7.4 Tris–HCl at 4 °C, 70 mM ammonium acetate, 4 mM magnesium acetate 2. High-sucrose solution (50% w/v): 50% (w/v) sucrose, 10 mM pH 7.4 Tris–HCl at 4 °C, 70 mM ammonium acetate, 4 mM magnesium acetate 3. Open-Top Thinwall Ultra-Clear 4-mL tubes (Beckman Coulter) 4. Long-cap rubber adapters (BioComp) 5. Tube stand with gradient midpoint mark for 4 mL tubes (BioComp) 6. Sterile and nuclease-free 15 mL tubes 7. Stainless blunt-type syringe tip 8. Gradient maker: Gradient Master Model 108 (BioComp) 9. Ultracentrifuge: Optima XE-90 (Beckman Coulter) 10. Rotor: SW 60 Ti Swinging-Bucket, with 6 × 4 mL buckets (Beckman Coulter) 11. Piston Gradient Fractionator, equipped with parts for SW 60 Ti (Biocomp) 12. Fraction collector: MicroCollector AC-5700 (Atto) 13. 96-well plate Nunclon™ Delta Surface (Sigma Aldrich)

2.3 RNA Purification from Sucrose Fractions

1. Nuclease-free water 2. 8 M guanidine-HCl 3. 99.5% (w/v) ethanol 4. 75% (w/v) ethanol 5. 5 mg/mL glycogen 6. RNA buffer: 300 mM sodium chloride, 20 mM pH 7.4 Tris– HCl, 10 mM pH 8.0 EDTA, 1% (w/v) sodium dodecyl sulfate (SDS) 7. 3 M pH 5.2 sodium acetate 8. RNeasy Kit (QIAgen) 9. RNA ScreenTape for TapeStation (Agilent) 10. RNA Ladder for TapeStation (Agilent) 11. RNA Sample Buffer for TapeStation (Agilent) 12. RNA Clean Concentrator Kit (Zymo Research)

284

Lan Anh Catherine Nguyen et al.

13. Poly(A) mRNA Magnetic Isolation Module (New England Biolabs) 14. Refrigerated centrifuge, up to 20,400 g 15. Vacuum-centrifuge dryer: MV-100 Micro Vac (TOMY) 16. Spectrophotometer: Nanodrop ND-1000 (Thermo Fisher Scientific) 17. Capillary electrophoresis: TapeStation 2200 (Agilent) 18. Fluorometer: Quantus (Promega) or Qubit 2.0 (Thermo Fisher Scientific) 2.4 Direct RNA Sequencing

1. Direct RNA Sequencing Kit (Oxford Nanopore Technologies) 2. Flongle Sequencing Expansion Kit (Oxford Nanopore Technologies) 3. SuperScript IV Reverse Transcriptase (Invitrogen) 4. 2000 U/μL T4 DNA Ligase (New England Biolabs) 5. T4 DNA Ligase Buffer (New England Biolabs) 6. 10 mM dNTP (each) 7. SuperScript IV First-Strand Buffer (Invitrogen) 8. SuperScript IV Reverse Transcriptase (Invitrogen) 9. 0.1 M dithiothreitol 10. AMPure XP beads (Agencourt) (see Note 1) 11. Tube rotator with 1.5-mL tube adapter 12. 16-Tube SureBeads™ Magnetic Rack (Bio-Rad) 13. 1.5-mL Eppendorf DNA Lo-Bind safe-lock tubes (Eppendorf) 14. Thermocycler 15. MinION Sequencer (Oxford Nanopore Technologies) 16. Flongle Adapter (Oxford Nanopore Technologies) 17. Flongle Flow Cells (Oxford Nanopore Technologies) 18. Windows 10 PC with the minimum required specifications for Nanopore MinION: 500 GB solid-state drive, 16 GB random access memory, i7 fourth-generation processor, and USB 3.0 port 19. Up-to-date MinKNOW software

2.5 Bioinformatics Analysis

Version numbers are given as an example (see Note 2). 1. Graphics card: GeForce RTX 2060 (NVIDIA) 2. Guppy v5.0.7 3. Python v3.9 4. Minimap2 v2.22 5. Samtools v1.13 6. Salmon v1.8

Nanopore Direct RNA Sequencing of Ribosome-Bound RNA

285

7. R v4.2.0 8. RStudio 2022.02.2 + 485 “Prairie Trillium” 9. BiocManager v3.15 10. EdgeR v3.38.1

3

Methods This method assumes that the yeast cells to be processed are already pelleted, flash-frozen, and stored at -80 °C prior to starting. Please note that washing the cells before flash-freezing is not recommended, as it is likely to cause significant changes in the polysome profile. The general workflow is presented in Fig. 1. The polysome fractionation method is adapted from previous literature [10, 12, 13]. The following protocol is written to accommodate pellets collected from 40 mL yeast cell cultures at a concentration of approximately 2.0–2.5 × 107 cells/mL (0.8–1.0 × 109 cells) and is optimized for the SW 60 Ti swing rotor (Beckman Coulter), which is, to the best of our knowledge, the smallest in the world to be used for gradient fractionation (4 mL). The cell culture volume and number of cells collected may be scaled up and/or the size of the swing rotor and buckets changed as deemed appropriate (see Note 3). Similarly, if the coverage obtained by Flongle is deemed insufficient, Sect. 3.4 may be easily scaled up to a protocol for standard Nanopore MinION Flow Cells.

3.1 Preparation of Whole Cell Extracts (WCE)

1. Prepare one ceramic mortar and pestle pair per sample, clean them thoroughly, and wipe them with 75% (w/v) ethanol. After wrapping them in aluminum foil, store them at -80 °C for at least an hour to let them cool down. Skipping this step will cause the ceramic to crack when exposed to liquid nitrogen in the following steps. 2. Prepare a PLB* stock solution on ice as follows: put 0.6 mL PLB per sample into a single 15 mL tube on ice, with an additional volume which will be used later on in step 7 (for four samples, prepare 5 mL of PLB). Add cycloheximide to a final concentration of 0.1 mg/mL and cOmplete™, EDTAfree Protease Inhibitor Cocktail to a final concentration of 1×. In this protocol, we will be referring to the polysome lysis buffer (PLB) with added ingredients as PLB* (see Note 4). 3. Label the side and the cap of one 50 mL tube per sample, and place them on ice. 4. After taking out the mortar and pestle from the -80 °C deep freezer and removing the aluminum, add liquid nitrogen to further cool them down to -196 °C. After the liquid nitrogen has evaporated, add some more liquid nitrogen, and then add

286

Lan Anh Catherine Nguyen et al.

Fig. 1 Workflow of the method described in this paper. ★N indicates RNA quality control done with a Nanodrop spectrophotometer, while ★Q indicates RNA quantification done with a Quantus (Promega) or Qubit (Thermo Fisher Scientific) fluorometers. The mortar icon comes from Flaticon.com

the frozen cell pellet directly to the liquid nitrogen (see Notes 5 and 6). Keep the cells in liquid nitrogen at all times. Add 0.6 mL of cold PLB* directly to the liquid nitrogen (do not stick the tip inside the liquid nitrogen, as the PLB* will instantly freeze), and start grinding the frozen cell pellet and PLB* using the pestle. When the liquid nitrogen has

Nanopore Direct RNA Sequencing of Ribosome-Bound RNA

287

evaporated, keep grinding the frozen powder for a few turns, then quickly add more liquid nitrogen, and resume grinding. Repeat this process five times. 5. After the last round of grinding, quickly transfer the powder from the first sample into one of the 50 mL tubes that were placed on ice in step 3. Let the powder thaw on ice with the cap closed. The remaining samples may be processed within 30 min of placing the first sample on ice and similarly placed on ice to thaw alongside the first sample. The thawing process may take up to 2 h. 6. After thawing, transfer the samples to 1.5 mL tubes on ice and centrifuge at 2300 × g for 10 min at 4 °C. Transfer the supernatant to a fresh 1.5 mL tube on ice, and centrifuge at 13,000 × g for 10 min at 4 °C. Again, transfer the supernatant (whole cell extract) to a fresh 1.5 mL tube on ice. Prepare a 1/10 dilution of each whole cell extract in PLB*, and assess the concentration and quality of RNA with a Nanodrop or other spectrophotometer, using leftover PLB* as a blank. 7. Bring all samples to the same concentration using some of the remaining excess PLB*. All things considered, the minimum concentration of total RNA recommended at this stage is 500 ng/μL (see Note 7). Excess whole cell extract may be aliquoted, flash-frozen in liquid nitrogen, and stored at -80 ° C for further analysis. Make sure to save an aliquot of the samples as a control for total RNA extraction. 3.2 Polysome Fractionation

Steps 1–10 may be prepared either before starting Sect. 3.1, or during the 2 h thawing process in Sect. 3.1, step 5. 1. Prepare 15 mL of 10% (w/v) sucrose buffer and 15 mL of 50% (w/v) sucrose buffer in two separate tubes. 2. Set an Open-Top Thinwall Ultra-Clear 4 mL tube onto the tube stand with gradient mid-point mark, and use a marker to delimitate the midpoint of the gradient. Then, add 10% (w/v) sucrose solution until it reaches the midpoint using a metal syringe (approximately 2 mL). Then, vertically insert a 50% (w/v) sucrose solution-filled syringe into the 10% (w/v) sucrose solution so that the tip sits right above the bottom of the tube, and slowly fill the bottom up with the 50% (w/v) sucrose solution until it reaches the marked midpoint. The heavier sucrose solution will remain at the bottom. Remove the syringe while being careful not to disturb the interphase between the two concentrations. 3. Remove any excess by fitting the long-cap rubber adapter. At this stage, it is critical to avoid introducing any bubbles. This is done by fitting the rubber cap at a slightly inclined angle so that the remaining air may escape through the small hole in the rubber cap.

288

Lan Anh Catherine Nguyen et al.

4. Repeat steps 2–3 for all gradients. Do not set the gradient tubes on the gradient maker at this point. 5. Turn on the Gradient Master Model 108. Make sure the platform is perfectly horizontal and that the tube stand is perfectly centered on the platform. This is essential to ensure reproducibility. 6. Set the appropriate gradient program on the Gradient Master Model 108. In our case, we used LIST > SW60 > LONG SUCROSE 10–50%. This program was custom-made for the SW 60 Ti rotor with the following sequence: 0:10/86.0/30 (MEM#1) 0:15/86.0/00 (MEM#2); Series: 121212. You may need to choose and test the appropriate program according to your tube size and rotor size, and if it does not yet exist, it is possible to ask BioComp’s customer support to optimize the gradient program, for a fee. 7. Set the gradients onto the gradient maker, and run the program by pressing USE. 8. After the gradient program is completed, carefully place the tubes into the swing rotor buckets while making sure to not disturb the gradients. Then, carefully place the swing rotor buckets containing the gradients at 4 °C and let them cool down for an hour. 9. During this time, turn on the Optima XE-90 ultracentrifuge and set the parameters as follows: RPM, 30,000; time, 2 hours; temperature, 4 °C; ACCEL, 7; and DECEL, 7 (see Note 8). 10. To let the temperature drop to 4 °C, close the lid, and turn on the vacuum. Do not start the centrifugation at this point. 11. At least an hour later, remove the gradients from 4 °C, and apply a volume of whole cell extract corresponding to 100 μg of RNA on top of the gradient for each sample. If multiple samples and gradients are prepared, make sure that all the tubes have the same weight and that the same volume is added for all samples (see Note 9). 12. Set the buckets onto the swing rotor, and the swing rotor into the ultracentrifuge, while being very careful not to disturb the gradients. However, it is also important not to take too long at this stage to avoid warming the samples to room temperature. 13. As soon as the rotor is set, close the door and run the centrifugation. The vacuum function will bring the rotor chamber back to 4 °C. Remain in the vicinity of the ultracentrifuge for monitoring until it reaches maximum speed. 14. During ultracentrifugation, the free RNA, the monosomes, and the polysomes are separated by molecular weight along the sucrose gradient.

Nanopore Direct RNA Sequencing of Ribosome-Bound RNA

289

15. Collect fractions on a 96-well plate following instructions for the Piston Gradient Fractionator while measuring the 254 nm absorbance along the gradient. After collection, place the 96-well plate on ice immediately to slow down RNA degradation. At this stage, the experiment can be paused either by wrapping the 96-well plate in aluminum foil, or by transferring the fractions to 2 mL tubes, before storing at -80 °C. 3.3 RNA Extraction and Enrichment

1. If the samples were stored at -80 °C, let them slowly thaw at 4 °C. This may take around 40 min. Thawing time depends on the size of the fractions collected in your setup, and on whether the fractions are in 96-well plates or in 1.5 mL tubes, so make sure to check up regularly on the thawing progress. 2. Based on the 254 nm absorbance measured during collection, determine which fractions contain free RNA, monosomes, and polysomes, respectively. Merge relevant fractions into a single 2 mL tube (or 15 mL tube in case of larger volumes). Here, we prepared merged fractions labeled “free RNA,” “monosomes,” and “polysomes” (cf. Fig. 1). 3. After merging the fractions and mixing, separate them again into 150 μL-sized aliquots in 2 mL tubes. Then, successively add 330 μL of guanidine-HCl (8 M), 1 μL of glycogen (5 mg/ mL), and 1.44 mL of cold 99.5% (w/v) ethanol. We recommend preparing a Master Mix containing the respective required amounts of guanidine-HCl (8 M), glycogen (5 mg/ mL), and 99.5% (w/v) ethanol on ice for the necessary number of samples plus one extra, and add 1.771 mL of this mix to each sample (2 × 885.5 μL using a P1000 pipette). This can be done without changing the filter tip if the master mix is dispensed from above. 4. Make sure all caps are firmly closed, and mix the tubes by inverting. 5. Place the samples at -80 °C for an hour. 6. Centrifuge at 13,000 × g for 15 min at 4 °C, and discard all of the supernatant without disturbing the pellet. 7. Add 150 μL of RNA buffer, 15 μL of sodium acetate (3 M, pH 5.2), and 495 μL of 99.5 (w/v) ethanol to the pellet. Similarly to step 3, we recommend to prepare a master mix containing RNA buffer, sodium acetate (3 M, pH 5.2), and 99.5% (w/v) ethanol (100%) and then adding 660 μL of this mix to each sample. Mix by flicking. 8. Place the samples at -80 °C for an hour. 9. Centrifuge at 13,000 × g for 15 min at 4 °C, and discard the supernatant, similarly to step 6.

290

Lan Anh Catherine Nguyen et al.

10. Wash with cold 75% (w/v) ethanol, centrifuge at 13,000 × g for 15 min at 4 °C, and discard the supernatant carefully. 11. Set the tube caps open into the MV-100 Micro Vac at room temperature until the pellet is dry (do not use the heating function). 12. Resuspend in 20 μL nuclease-free water on ice, and pool the samples from the same fraction together. In addition, to increase the starting amount of RNA for downstream procedures, the RNA from three individual 4 mL gradients was pooled together for each glucose starvation time point. This hurdle may be overcome by using a swing rotor and gradient fractionator system that can accommodate larger tubes. 13. Assess the quantity spectrophotometer.

and

quality

using

a

Nanodrop

14. To remove any remaining guanidine contamination, clean up the RNA using the RNeasy Kit following the manufacturer’s instructions, and assess the quantity and the quality using a Nanodrop spectrophotometer and TapeStation capillary electrophoresis (Fig. 2). 15. To remove small RNA (10 reads) expressed in either the monosome, the polysome fractions, or both, at different time points during glucose starvation

Nanopore Direct RNA Sequencing of Ribosome-Bound RNA

4

295

Notes 1. The use of AMPure XP RNA beads (which come at a significantly higher cost) is not absolutely necessary, as long as the AMPure XP beads are handled with care at all times (gloves and filter tips), including by other lab members who are not working with RNA. For extra safety, we suggest reserving one bottle specially for RNA experiments. However, AMPure XP RNA beads are guaranteed RNase-free by the manufacturer. In addition, we do not recommend resuspending AMPure XP beads by pipetting since it may cause material loss. 2. The graphic card and software versions used at the time of our analysis are indicated as a reference, but there is no explicit need to use those exact versions; whenever possible, we recommend using up-to-date software at the time of analysis. 3. We highly recommend scaling up the cell culture and using a bigger rotor (such as the SW 41 Ti that fits 6 × 13 mL, or even the SW 32 Ti that fits 6 × 38 mL) to load more whole cell extract per gradient and therefore maximize the amount of RNA recovered per run for downstream analysis. 4. Cycloheximide may be omitted or replaced by other translational inhibitors as deemed appropriate. 5. It is highly recommended to wear watertight cryoprotectant gloves. 6. The cells will be in frozen pellet form. Put the tube vertically above the mortar and tap it to detach the cells. After the cell pellet falls into the mortar, immediately add more liquid nitrogen. 7. The maximum recommended input on top of the gradient using this specific rotator is 200 μL when using the long-cap adapters. Moreover, if the amount of RNA is too high, the absorbance values may exceed the detection range of standard spectrophotometers (0 < A254 nm < 2). The use of a spectrophotometer with a wider range such as TRIAX from BioComp (0 < A254 nm < 7) circumvents this limitation. 8. In addition, it is essential to set the ACCEL parameter to either 1, 4, or 7 and the DECEL parameter to either 1,4, 7, or NONE. These parameters control how fast the swing rotor reaches full speed during acceleration and how fast it stops spinning when the centrifugation time is up. Other parameter values are not compatible with the SW 60 Ti rotor and will increase the risk of tubes breaking and leaking inside the buckets. For other models of swing rotors, these settings may differ.

296

Lan Anh Catherine Nguyen et al.

9. This is easier to achieve if all samples were diluted to the same concentration in Sect. 3.1, step 6. It is important that all the tubes are the same weight. The ultracentrifuge will stop if it detects too large of an unbalance, and precious time will be lost, and/or the hooks on the swing rotor buckets might become damaged. 10. This is about 10–40 times less than the recommended poly (A) RNA input to obtain 1 × 105 reads in 24 h [14], but we still obtained 0.4–1.4 × 105 reads (cf. Table 1). This means that the amount of input RNA has little influence on the total read count of Flongle. Low RNA inputs may adversely affect the diversity of the sample, but in the case of enriched fractions such as here, the diversity is expected to be lower than in the total poly(A) transcriptome, especially since we are size selecting to eliminate small RNAs E.coli16S. AAATTGAAGAGTTTGATCATGGCTCA. . . >E.coli23S. GCACCTCGATGTCGGCTCATCACATCC. . . Similarly, the FASTA-formatted reference sequence for Curlcake dataset is deposited in NCBI under the accession number GSE124309 [14]. For SARS-CoV-2 the transcriptome is obtained experimentally by in vitro transcription and sequences are obtained from an open-source repository mentioned in the reference (26) (see Note 1). 2.3 Computational Environment

To detect RNA modification using learned weight, a Linux machine with more than 20-GB GPU memory (VRAM) with multiple CPU cores is required. However, for training the nanoDoc2 neural network, a GPU with at least 50-GB memory (VRAM) (in total) is required. Moreover, the following software are required to be installed: (i) Guppy basecaller (version 6.1.2 or above): Guppy basecaller is available on the ONT website to the members of the ONT community. One should run Guppy basecaller with fast5 output option “- fast5_out”. (ii) NanoDoc2: Pull from the open-source GitHub link https:// github.com/uedaLabR/nanoDoc2.git (the procedure is described below in the Method section). The software NanoDoc2 was tested using Python (version 3.6), with packages TensorFlow (version 2.5.2), Cuda (version 11.2), and Cudnn (version 8.1) (see Note 2).

3 3.1

Method Overview

The software nanoDoc2 consists of three parts: the first part is sequence-to-signal assignment (or resquiggling), the second part is the deep neural network training using deep one-class algorithm [30] (DOC), and the third part is the detection of RNA modification using the aforementioned trained neural network and clustering algorithms (Fig. 1). A typical user may use only the first and third part to detect RNA modification with pretrained weight that can be downloaded from a repository as described below (see the following sections).

RNA Modification Detection Using Direct RNA-Seq and nanoDoc2

WT Basecalled Fast5

303

IVT Basecalled Fast5

Step1: Map, Resquiggle, Sort

Step2: Train the neural network

IVT parquet

WT parquet Step3: Detection of RNA modification

Weight

OR

Step2: Download learnt weight

Result output.txt Fig. 1 Data flow of nanoDoc2 Training of the neural network can be done separately; alternatively, a user can download the learnt weight (step 2). c2 takes basecalled multifast5 as an input. Both WT and IVT data are required for detection (step 1). Finally, the detection algorithms take WT and IVT parquet files as inputs and outputs of the result file, respectively (step 3) 3.1.1 Signal Resquiggling by Viterbi Using Trace Value

Trace values are intermediate output of the Guppy basecaller representing 8-state probabilities with A, T, C, and G as flip base state and A*, T*, C*, and G* as flop base state, where the flop state corresponds to the polynucleotide region (named as the flip-flop model) (Fig. 2a). Both the raw current signal and trace value are time series data. In the recent version of the Guppy basecaller (version 6.1), one data point in “trace” space (8-dimensional) corresponds to 10 data points in raw current signal space. The basecalling software also outputs segmentation boundaries along trace data, where each segmentation boundary corresponds to one nucleotide along the basecalled sequence. In our approach, first, flip and flop outputs are merged into a simpler four A, T, C, and G states for each trace data point (4-dimensional). Next, we applied change point detection algorithms [27] to the merged trace value, yielding additional segmentation boundaries to the original basecalled boundaries. Each segmentation boundary is compared to a reference sequence by following a dynamic programming algorithm (Fig. 2b and c) such that the most probable base at the boundary when matched to the base in the reference sequence is scored as positive, while the corresponding mismatch is scored as negative.

304

Hiroki Ueda et al.

a

Fast5

b Trace value

Guppy BaseCaller

Signal Basecalled fast5

Map reads to reference by Minimap2 (mappy) Reference genome of mapped region

Trace information

c Change point detection to yield additional boundaries

Nucleotide position

Fastq information

Viterbi algorithm for

resquiggling Trace boundary

Fig. 2 Data preprocessing of nanoDoc2 The workflow of data preprocessing of nanoDoc2. Fast5 files are input into Guppy basecaller producing modified multifast5 files including basecalled fastq information and midterm base probability as trace value. Fastq information is used for genome mapping to assign the global position of read. The resquiggling operation is performed next using global read position by dynamic programming Representation of trace value (top panel) and raw current signal (bottom panel) Trace value is subjected to the Viterbi algorithm by which the trace value is further aligned to the genomic sequence by using dynamic programming 3.1.2 Training Using the Deep One-Class Algorithm

To prepare the training data, a batch of raw current signals corresponding to each 6-mer composition is extracted (see the following sections for details). This signal of the 6-mer combination (included in a 1024-length array) is used as an input to the deep neural network and subjected to classification with 4078 possible classes (Fig. 4a and b) featurizing the input signal to emit a 256-dimensional representation. This deep neural network consists of a convolutional neural network (CNN) with WaveNet architecture (Fig. 3a) [28]. The deep neural network is further trained for dimensionality reduction, in which an additional encoder network (along with a decoder network) is attached to reduce the dimensionality from 256 to 24 (Figs. 3b and 4c and d). The combined deep neural network (CNN-WaveNet fused to the encoder network) is then trained by a second classifier in a way such that two instances of fused CNN-WaveNet networks are coupled by sharing weights (i.e., network parameters) (Fig. 3c). One of the networks process raw signals of target 6-mers to analyze feature boundaries of target 6-mer (target network) in the hyperdimensional space (24 dimensions), while the other processes raw signals of 6-mers that have similar (but not identical) sequences from the target one (secondary network). Those neighboring raw signals are

RNA Modification Detection Using Direct RNA-Seq and nanoDoc2

a

305

b

Convolution

Wavenet depth

Max pooling

GAP and Softmax

Max pooling

Convolution

Wavenet depth

Max pooling

Channel

Convolution

4078, 6-mer classification to train 1D CNN

Convolution

Signal Value

Fixed Weight

Deep One-Class (DOC) classification

c

Secondary Network

IVT as reference

1D CNN

255 Neighbor 6-mers class Data batch Transfer learning (fix weight) Target 6-mer

W

Total Loss

1D CNN

Data batch

Target Network

Fig. 3 Network training of nanoDoc2 Decomposition of resquiggled signal to 6-mer, using bin size of 512. Neural network with WaveNet architecture is trained by 4078 labels from 6-mer combinations of four nucleotides Neural network is stacked to encoder-decoder architecture and 6-mer classification is repeated to get a smaller dimension output (24 dimension) Deep one-class (DOC) training for each 6-mer. Signals of 6-mer (target network) and its neighbors with similar sequences (secondary network) were subjected to the DOC classification

chosen so that the middle 4-mer nucleotide sequence of the input 6-mer is different from the target one (thus 4^4–1 or 255 combinations). During training, input to the both target and secondary networks is prepared from the IVT unmodified signal of 6-mer nucleotide sequences. The training is performed for each of the 4078 classes of 6-mers to obtain 4078 types of trained network. After training, the target network (architecturally an encoder-fused CNN-WaveNet network) is used during inference.

306

Hiroki Ueda et al.

Fig. 4 The result of network training where the blue vertical line indicates maximum or minimum accuracy or loss in validation dataset, respectively Training and validation accuracy of initial training for 6-mer for 4078 classification yielding maximum validation accuracy of 0.55 at epoch 37 Training and validation loss of initial training for 6-mer for 4078 classification Training and validation accuracy of additional training for 6-mer for 4078 classification yielding maximum validation accuracy of 0.48 at epoch 187 Training and validation loss of additional training for 6-mer 4078 classification 3.1.3 RNA Modification Detection Using Clustering

After training using IVT data, we have used the trained networks to infer modifications in experimental WT samples. From IVT and WT reads, 6-mer signals (formatted to a length of 1024) of each site are extracted and then fed to the target network (Fig. 5a). The network emits 24-dimensional output for each input signal. This is repeated for each site along the RNA and for each site outputs from the target network are compiled to apply clustering by k-means algorithm with three clusters as input to k-means clustering algorithm (also 20 iterations parameter was set) (Fig. 5b). The idea behind clustering is to partition signals in the 24-dimensional space so that the presence of modification will be reflected as a separate cluster. In other words, within each cluster the signals from IVT and WT should exist in similar numbers had there been no RNA modification; however, more biased clusters should be formed under the existence of RNA modifications. Bias of abundance is calculated as

RNA Modification Detection Using Direct RNA-Seq and nanoDoc2

307

a IVT 6-mer at target loci

Dimension reduction by Deep One Class

K-mean clustering

IVT WT 6-mer at target loci

WT 24 dimensional output

b

chi-square-test

Count IVT

Count WT

P-value

Cluster 1

296

210

0.7

Cluster 2

645

378

0.0001

Cluster 3

65

412

0.000001

Fig. 5 Detection of modification and scoring of nanoDoc2 During inference, input to target network is unmodified signals, while input to the secondary network is experimental data including modifications (WT). The output features of the same number of signals extracted from target and secondary networks are compiled together and applied to a k-means clustering algorithm (three clusters were sought). The cluster members are annotated by their correspondence to target and secondary networks Bias of IVT or WT is calculated using chi-square test for each cluster (from an abundance of IVT and WT data in each cluster, as tabulated in the right panel for illustration). Independent clusters consisting of WT RNA reads indicate the existence of RNA modifications. Minimal p-value is reported as final score by taking minus logarithm and normalized

the p-value of the chi-square test for each cluster, and minus logarithm of p-value normalized by a constant factor is used as a cluster score. The maximum score over all clusters indicates whether this site is likely to include modification. 3.2

Data Preparation

The default input file format of the Guppy basecaller is multifast5, a file format where multiple raw signals (default 4000) are accumulated in a hierarchical HDF5 file format. Each raw signal relates to one nucleotide molecule that passed through the pore molecule in

308

Hiroki Ueda et al.

the sequencing experiment. By basecalling, Guppy adds sequence data in fastq format as a new attribute to each molecular data. The output of Guppy is also a fast5 file with added information including the sequence data and segmentation boundaries on the raw signal detected by Guppy and 8-dimensional trace data (see above). NanoDoc2 directly works with the Guppy-generated multifast5 file. First nanoDoc2 uses Viterbi decoding to resquiggle the raw signal using trace data (see Subheading 3.1.1). Prior to Viterbi decoding, a genome aligner is used to map basecalled sequences to one of the reference sequences. After Viterbi decoding, each raw signal is sorted according to the genomic position of the reference sequence and normalized against the theoretical current value using the least mean square approach. The theoretical current values are obtained from 5-mer models given in the following site: https:// github.com/nanoporetech/kmer_models (provided by ONT). The output of the resquiggling and sorting step is written in Apache parquet (https://parquet.apache.org/) format, which will be used in the subsequent steps. The Apache parquet format is an open-source column-wise data format designed for efficient storage and retrieval of data. This format can also be easily parsed by popular scripting languages like Python. 3.3 Installation and Preparation of nanoDoc2

Execute the following commands to get a source from GitHub, install required library (Table 1), and download prepackaged learned weight for 6-mer. We recommend using a Python virtual environment to install nanoDoc2. The learned weights for 6-mer

Table 1 Prerequisite Python libraries for nanoDoc2 Python library

Tested version

Python library

Tested version

funcy

1.17

matplotlib

3.5.2

pyarrow

6.0.1

mappy

2.23

numpy

1.19.2

Bio

1.3.9

numba

0.53.1

scikit_learn

0.23.2

scipy

1.5.4

tf_agents

0.6.0

tqdm

4.31.1

keras

2.4.3

tensorflow

2.5.2

keras-night

2.5.0.dev2021032900

faiss-gpu

1.7.2

biopython

1.79

h5py

3.1.0

ont-fast5-api

4.0.0

click

8.0.4

ruptures

1.1.5

pandas

1.1.5

protobuf

3.20.1

RNA Modification Detection Using Direct RNA-Seq and nanoDoc2

309

need to be downloaded from ZENODO (https://zenodo.org/), as described below. In order to run nanoDoc2, one requires experimental direct RNA sequence data for the sample of interest that include modification, as well as unmodified IVT reference data. In this chapter, we use downloaded IVT data as described in references [14, 26]. To install nanoDoc2 and pre-required Python libraries, execute the following commands: $ git clone https://github.com/uedaLabR/nanoDoc2.git $ cd nanoDoc2 $ python3 -m venv venv3 $ source venv3/bin/activate (venv3) $ pip install --upgrade pip (venv3) $ pip install -r requirements.txt To download the pre-learned weight, execute the following commands (if you would like to train the model from IVT dataset by yourself, skip below and proceed to Subheading 3.3): (venv3) $ cd (venv3) $ mkdir weight6mer (venv3) $ cd weight6mer Now get four archive files of learned weight from ZENODO (https://zenodo.org/), where each file is about 28 GB in size. https://zenodo.org/record/6583336/files/weight_A.tar.gz https://zenodo.org/record/6586529/files/weight_T.tar.gz https://zenodo.org/record/6587256/files/weight_C.tar.gz https://zenodo.org/record/6588796/files/weight_G.tar.gz Place these four files in the ./weight6mer directory. For this, you may use wget command to download these files to the ./ weight6mer directory. (venv3) $ wget https://zenodo.org/record/6583336/files/ weight_A.tar.gz. (venv3) $ wget https://zenodo.org/record/6586529/files/ weight_T.tar.gz. (venv3) $ wget https://zenodo.org/record/6587256/files/ weight_C.tar.gz. (venv3) $ wget https://zenodo.org/record/6588796/files/ weight_G.tar.gz.

310

Hiroki Ueda et al.

To extract the archived files and remove the archive files after extraction, execute the following commands (see Note 1): (venv3) $ tar -zxvf ./weight_A.tar.gz (venv3) $ tar -zxvf ./weight_T.tar.gz (venv3) $ tar -zxvf ./weight_C.tar.gz (venv3) $ tar -zxvf ./weight_G.tar.gz (venv3) $ rm -r ./weight_?.tar.gz (venv3) $ cd ./nanoDoc2 3.4 Inferring Modifications with nanoDoc2 Using Previously Trained Model

The inference workflow can be split into three steps. First, we need to prepare properly segmented data by using a genomic mapper (based on basecalled sequence) like Minimap2 [29] (corresponding Python wrapper is mappy) followed by our Viterbi resquiggling algorithm. Then, the segmented WT and IVT data are fed to the DOC classifier network. For each input data, the DOC classifier outputs its low-dimensional representation. All outputs from WT and IVT data for a fixed genomic location are then compiled to a list of outputs, which is then subjected to the k-means clustering in the third step. This clustering is repeated for each genomic location. The presence of modification at a genomic location is deduced by analyzing the biases of abundance of IVT or WT in clusters which is converted to a modification score. To visualize the prediction of modification, one can plot such scores to generate a modification probability plot as shown in Fig. 6. We evaluated our method for 16S and 23S ribosomal RNA of E. coli. With a threshold score of 0.1, 35 out of 36 modifications are successfully detected without a false-positive peak (Table 2). Note that this method has quasi base resolution, but it does not have the exact base resolution reflecting the fact that the effect of modification to the current signal and its position differs by the modification types. The highest score is detected in a range of -4 to +2 nucleotide distances with respect to the known modification sites, where all the 15 different types of modifications were detected (Table 2).

3.4.1 Mapping and Resquiggling

Issue the following command under the ./nanoDoc2 directory for mapping reads to the reference and segmentation: python ./nanoDoc.py fast5toresegmentedpq

\

-i /path/to/multifast5_dir

\

-o /path/to/out/out1_dir

\

-r /path/to/out/ref.fa

\

-fm /path/to/fmer_current -t 12 -qv 5 -mo 12 10 30 20

\ \

RNA Modification Detection Using Direct RNA-Seq and nanoDoc2 m7G

a. E.Coli 16s rRNA

m2G

m5C

m3U m2G

311

m2G,m62A,m62A

m4Cm, m5C

Ψ

b. E.Coli 23s rRNA

Ψ,m3Ψ, Ψ m1G

Ψ m5U

Cm,ho5C,m2A,Ψ m2G,D, Ψ Um Ψ

m5U m7G

Ψ x2

Ψ m5U

m5C

m6A

Gm m6A

Fig. 6 Result of RNA modification detection using nanoDoc2, where x-axis is genomic position on 16 s (a) and 23 s (b) ribosomal RNA of E. coli and y-axis is the scores for modifications. Blue line indicates score from nanoDoc2 and vertical red lines indicate known modification sites [6, 25]

where – i: input directory containing multifast5 files; multifast5 is recursively searched under this directory, – o: output directory where parquet file output will be written, – r: reference file in FASTA format, – fm: theoretical k-mer model file (see Note 3), – t: number of threads to use (default value is 12, optional), – qv: average q-value threshold for basecall quality (default value is 5, optional) The reads with average q-value below this threshold will not be used in the analysis. – mo: mappy option (default values are 12 10 30 20, optional), where each value in the tuple corresponds to mappy options of ‘k’,‘w’,‘min_chain_score’,‘min_dp_score’, respectively. (see Note 4). Repeat the procedure for the WT sample of interest and unmodified IVT sample. This will take 6–10 h of execution time in this example.

312

Hiroki Ueda et al.

Table 2 NanoDoc2 score at the position of known modifications on E. coli rRNA rRNA E.coli 16 s

Position 516

Modification Ψ

Detected relative position

NanoDoc2 score

0

0.58

E.coli 16 s

527

m G

-1

0.86

E.coli 16 s

966

m 2G

-1

0.88

967

5

-2

0.88

1207

2

m G

0

0.75

E.coli 16 s

1402

4

m Cm

-1

0.62

E.coli 16 s

1407

m 5C

-2

0.41

1498

3

m U

-3

0.80

1516

2

m G

2

0.65

E.coli 16 s

1518

6

m 2A

0

0.65

E.coli 16 s

1519

m 62A

-1

0.65

E.coli 16 s E.coli 16 s

E.coli 16 s E.coli 16 s

7

m C

1

E.coli 23 s

745

m G

2

0.87

E.coli 23 s

746

Ψ

1

0.87

0

0.87

5

E.coli 23 s

747

m U

E.coli 23 s

955

Ψ

-3

0.67

1618

6

m A

-4

0.11

E.coli 23 s

1835

2

m G

-1

0.43

E.coli 23 s

1911

Ψ

-1

0.47

E.coli 23 s

1915

m Ψ

0

0.59

E.coli 23 s

1917

Ψ

E.coli 23 s

3

-2

0.59

5

E.coli 23 s

1939

m U

-1

0.03

E.coli 23 s

1962

m 5C

-4

0.48

2030

6

m A

-4

0.20

E.coli 23 s

2069

7

m G

0

0.81

E.coli 23 s

2251

Gm

2

0.41

E.coli 23 s

2445

m 2G

2

0.72

E.coli 23 s

2449

D

0

0.75

E.coli 23 s

2457

Ψ

2

0.92

E.coli 23 s

2498

Cm

2

0.84

E.coli 23 s

2501

ho5C

-1

0.84

E.coli 23 s

2503

2

m A

-3

0.84

E.coli 23 s

2504

Ψ

-4

0.84

E.coli 23 s

(continued)

RNA Modification Detection Using Direct RNA-Seq and nanoDoc2

313

Table 2 (continued) rRNA

Position

Modification

Detected relative position

NanoDoc2 score

E.coli 23 s

2552

Um

-2

0.68

E.coli 23 s

2580

Ψ

-1

0.88

E.coli 23 s

2604

Ψ

-2

0.75

E.coli 23 s

2605

Ψ

-3

0.75

3.4.2 Modification Detection

For modification detection, run the following command: python ./nanoDoc.py analysis

\

-w /path/to/weight_dir

\

-r /path/to/ref/

\

-rpq /path/to/IVT parquet file dir

\

-tgpq /path/to/target parquet file dir -o /path/to/output result file -tsid E.coli16S

\

\ \

-s 1 -e 1600 -minreadlen 500

where – w: learned weight file (see Note 5), – r: reference sequence file in FASTA format, – rpq: directory of IVT parquet file created by “fast5toresegmentedpq” command, – tgpq: directory of WT parquet file created by “fast5toresegmentedpq” command, – o: result text file output, – tsid: transcript identifier, name of transcript in the FASTA file, – s: start position, -e end position, – minreadlen: minimum read length used in the analysis. (default value is 500, optional) This will take a few hours of execution time depending on the data size and length of the transcript (see Notes 6–8). 3.4.3 Interpretation of Output

The output from nanoDoc2 contains five columns: 1. Position or genomic position 2. 6-mer nucleotide 3. IVT depth used for analysis 4. WT depth used for analysis 5. Score

314

Hiroki Ueda et al.

It is possible to make plots like in Fig. 6 (without the annotation of modification type) by using a ready-made script in the following way (see Note 9): python ./nanoDoc.py plotgraph

\

-f /path/to/result file

\

-o /path/to/output image file -a /path/to/known-site

\ \

-s 16 4

where – f: result text file generated by “analysis” command, – o: output file in png format, which if omitted, “result.png” file will be created in the same folder as the input file, – a: file containing a list of known modification sites for annotation purposes (optional), – s: size of output graph in (X,Y), where X is size in x axis and Y is y-axis. (default value is 16 4 (meaning 16 × 4), optional) 3.5 Training nanodoc2 for IVT Data

This section explains how to train the nanoDoc2 DOC classifier. For this, we need IVT long-read data that is mapped and resquiggled by following the procedure described in Subheading 3.4.1. The output of the preparation phase is an Apache parquet file, which now will be fed to a script to make 6-mer data. After that, actual training using the DOC classification algorithm will be performed.

3.5.1 Prepare Training Dataset for Initial Training

As described above, the IVT data were fetched from Curlcake dataset and SARS-CoV-2 dataset, and they are combined to prepare a mapped resquiggled dataset (Subheading 3.4.1). Now, use the following script to prepare 6-mer data from the IVT data already obtained for initial training: python ./nanoDocPrep.py make6mer

\

-r /path/to/reference file

\

-p /path/to/parquet file -j False

\ \

-o /path/to/output -takecnt 1200

where – r: path to reference file, FASTA format (can be multiple), – p: path to parquet file (can be multiple), – o: path to output,

RNA Modification Detection Using Direct RNA-Seq and nanoDoc2

315

– j: if true, output will be merged into one file (default false), – takecnt: read counts for each 6mer. e.g., python ./nanoDocPrep.py make6mer

\

-r /path/to/ref1.fa -r /path/to/ref2.fa \ -p /path/to/parquet1_dir -p /path/to/parquet2_dir -j True

\ -o /path/to/init_train6mer.pq

-takecnt 1200

The network is pre-trained before actual DOC classification with weight sharing. For this we need to prepare an initial 6-mer dataset separately: python ./nanoDocPrep.py make6mer

\

-r /path/to/ref1.fa -r /path/to/ref2.fa \ -p /path/to/parquet1_dir -p /path/to/parquet2_dir \ -o /path/to/init_train6mer.pq -j True -takecnt 1200.

Note that DOC architecture includes two WaveNet-based neural networks. To save time in loading the data, we will make two 6-mer datasets where each of the datasets are saved as an independent file—one for the target and another for the secondary networks, respectively. See the following example: For the target network: python ./nanoDocPrep.py make6mer

\

-r /path/to/ref1.fa -r /path/to/ref2.fa \ -p /path/to/parquet1_dir -p /path/to/parquet2_dir \ -o /path/to/doctrain2_dir -takecnt 12750

For the secondary network: python ./nanoDocPrep.py make6mer

\

-r /path/to/ref1.fa -r /path/to/ref2.fa \ -p /path/to/parquet1_dir -p /path/to/parquet3_dir \ -o /path/to/doctrain3_dir -takecnt 50

316

Hiroki Ueda et al.

Here, we train the secondary network using only a few input data. Therefore, we split our input data into three parts, where one will be used for pre-training and the other two will be used for actual DOC classification. This will take a few hours of execution time depending on the size of the transcript. 3.5.2 Pre-training Model for 6-Mer Classification

The DOC architecture is quite complex to start directly as a coupled network. For this we will pre-train the WaveNet network in two steps. In the first step, we will featurize each input signal from a 6-mer to one of the 4078 classes. These 4078 numbers of classes are close to all possible 4^6 classes; however, a little less as we observed our IVT data misses only a few classes. Neural network is trained in learning the features of nanopore current as well as reducing the dimensionality from 1024 to 256. This is done by issuing the following command, which runs the optimizer for 100 epochs: python ./nanoDocPrep.py traincnn

\

-i /path/to/init_train6mer.pq

\

-o /path/to/outputweight_dir

\

-epoches 100

\

-device /GPU:0

This will take about 12 h of execution time using two NVIDIA A100 GPUs. In the next step, we will train the WaveNet network by stacking an encoder-decoder at the end (for 50 epochs in this case). Note that during this phase the layers already trained above are not trained further (transfer learning approach). The purpose of this autoencoder is to reduce the dimensionality of the output from 256 to only 24 (the dimension of the bottleneck layer). After this we throw away the decoder part and use encoder-fused WaveNet as our DOC classifier. This dimensionality reduction is done via: python ./nanoDocPrep.py traincnnadd

\

-i /path/to/init_train6mer.pq -inw /path/to/inweight

\

_dir

-o /path/to/outputweight_dir2 -epoches 200

\ \

\

-device /GPU:0

where -inw, “inweight_dir” is the “outputweight_dir” in the previous “traincnn” command. This will take about 24 h of execution time using two NVIDIA A100 GPUs.

RNA Modification Detection Using Direct RNA-Seq and nanoDoc2 3.5.3 Deep One-Class Classification for Each 6Mer

317

After initial training of the DOC classifier network, two instances of such networks are coupled and fed by the dataset prepared for the target and secondary networks. This training is performed via shared weight between the networks. Use the following command to do this: python ./nanoDocPrep.py traindoc

\

-d1 /path/to/inittrain2_dir

\

-d2 /path/to/inittrain3_dir

\

-o /path/to/output_weightdir -inw /path/to/inweight -ssize 12750 -epoches 3

\

_dir

\

\ \

-device /GPU:0

where -inw “inweight_dir” is the “outputweight_dir2” in the previous “traincnnadd” command, and -ssize is the number of reads for each 6-mer used in DOC training (default 12,750). The output weights of this training are saved and later used for inference (as discussed above). This will take about 48 h of execution time using two NVIDIA A100 GPUs.

4

Notes 1. Reference sequences used in this study are deposited in the GitHub repository, under the nanoDoc2/references directory. 2. Most updated versions of TensorFlow and Cuda could be used (not tested). Keras 2.4 is also required in addition to TensorFlow because Keras library is required to build a joint network used in DOC training. 3. The k-mer model downloaded from the below link was used in this study: https://github.com/nanoporetech/kmer_models/blob/mas ter/r9.4_180mv_70bps_5mer_RNA/template_median69pA. model 4. Minimap2 parameters may be loosened for heavily modified RNA sequences. Parameters for minimap can be set in the “fast5ToReSegmentedPq” command described above: where ‘k’: k-mer size ‘w’: word size

318

Hiroki Ueda et al.

‘min_chain_score’: minimum chain score ‘min_dp_score’: minimum dp score One could decrease those parameters if RNA is basecalled with a lot of error caused by heavy modifications. 5. In this example, ./weight6mer/docweight_copy should be designated as weight directory. 6. All data described here use the Nanopore pore version of R9.4, and thus, training weight for version R9.4 is only available. We have not tested with other versions of pore. 7. Current version is only applicable to plus strand, i.e., users need to prepare transcript sequence as a reference. 8. For transcriptome-wide analysis with a large genome, the testing is yet to be done. 9. This method does not have single-base resolution. The -2 to 4 positions should also be considered as a candidate for modification sites.

Acknowledgments We thank Dr. Hiroyuki Aburatani and Dr. Genta Nagae (the Research Center for Advanced Science and Technology, the University of Tokyo) and Dr. Tsutomu Suzuki and Mr. Ryo Noguchi (Department of Chemistry and Biotechnology, Graduate School of Engineering, the University of Tokyo) for productive discussions and helpful advice. This work was supported by Exploratory Research for Advanced Technology (ERATO; JPMJER2002) from the Japan Science and Technology Agency (JST). References 1. Suzuki T (2021) The expanding world of tRNA modifications and their disease relevance. Nat Rev Mol Cell Biol 22:375–392. https://doi.org/10.1038/s41580-02100342-0 2. Tang Y, Chen K, Song B et al (2021) M6A-atlas: a comprehensive knowledgebase for unraveling the N6-methyladenosine (m6A) epitranscriptome. Nucleic Acids Res 49:D134–D143. https://doi.org/10.1093/ nar/gkaa692 3. Anreiter I, Mir Q, Simpson JT et al (2020) New twists in detecting mRNA modification dynamics. Trends Biotechnol 39(1):72–89 4. Jonkhout N, Tran J, Smith MA et al (2017) The RNA modification landscape in human

disease. RNA 23:1754–1769. https://doi. org/10.1261/rna.063503.117 5. Kumar S, Mohapatra T (2021) Deciphering epitranscriptome: modification of mRNA bases provides a new perspective for posttranscriptional regulation of gene expression. Front Cell Dev Biol 9:1–22. https://doi.org/ 10.3389/fcell.2021.628415 6. Boccaletto P, Stefaniak F, Ray A et al (2022) MODOMICS: a database of RNA modification pathways. 2021 update. Nucleic Acids Res 50:D231–D235. https://doi.org/10. 1093/nar/gkab1083 7. Dominissini D, Moshitch-Moshkovitz S, Schwartz S et al (2012) Topology of the human and mouse m6A RNA methylomes

RNA Modification Detection Using Direct RNA-Seq and nanoDoc2 revealed by m6A-seq. Nature 485:201–206. https://doi.org/10.1038/nature11112 8. Collin W, Limbach PA (2014) Mass spectrometry of modified RNAs: recent developments (Minireview). Analyst 15:34–48. https://doi. org/10.1039/c5an01797a.Mass 9. Sakurai M, Ueda H, Yano T et al (2014) A biochemical landscape of A-to-I RNA editing in the human brain transcriptome. Genome Res 24:522–534. https://doi.org/10.1101/ gr.162537.113 10. Meyer KD, Saletore Y, Zumbo P et al (2012) Comprehensive analysis of mRNA methylation reveals enrichment in 3′ UTRs and near stop codons. Cell 149:1635–1646. https://doi. org/10.1016/j.cell.2012.05.003 11. Workman RE, Tang AD, Tang PS et al (2019) Nanopore native RNA sequencing of a human poly(A) transcriptome. Nat Methods 16:1297– 1305. https://doi.org/10.1038/s41592019-0617-2 12. Loman NJ, Quick J, Simpson JT (2015) A complete bacterial genome assembled de novo using only nanopore sequencing data. Nat Methods 12:733–735. https://doi.org/10. 1038/nmeth.3444 13. Stoiber MH, Quick J, Egan R et al (2016) De novo identification of DNA modifications enabled by genome-guided nanopore signal processing. bioRxiv 094672 14. Liu H, Begik O, Lucas MC et al (2019) Accurate detection of m6A RNA modifications in native RNA sequences. Nat Commun 10: 4079. https://doi.org/10.1038/s41467019-11713-9 15. Parker MT, Knop K, Sherwood AV et al (2020) Nanopore direct RNA sequencing maps the complexity of arabidopsis mRNA processing and m6A modification. elife 9:1–35. https:// doi.org/10.7554/eLife.49658 16. Jenjaroenpun P, Wongsurawat T, Wadley TD et al (2021) Decoding the epitranscriptional landscape from native RNA sequences. Nucleic Acids Res 49:1–13. https://doi.org/10.1093/ nar/gkaa620 17. Abebe JS, Price AM, Hayer KE et al (2022) DRUMMER-rapid detection of RNA modifications through comparative nanopore sequencing. Bioinformatics btac274 18. Maier KC, Gressel S, Cramer P, Schwalb B (2020) Native molecule sequencing by nanoID reveals synthesis and stability of RNA isoforms. Genome Res 30:1332–1344. https:// doi.org/10.1101/GR.257857.119 19. Leger A, Amaral PP, Pandolfini L et al (2021) RNA modifications detection by comparative

319

Nanopore direct RNA sequencing. Nat Commun 12:1–17. https://doi.org/10.1038/ s41467-021-27393-3 20. Pratanwanich PN, Yao F, Chen Y et al (2021) Identification of differential RNA modifications from nanopore direct RNA sequencing with xPore. Nat Biotechnol 39:1394–1402. https://doi.org/10.1038/s41587-02100949-w 21. Gao Y, Liu X, Wu B et al (2021) Quantitative profiling of N 6-methyladenosine at singlebase resolution in stem-differentiating xylem of Populus trichocarpa using Nanopore direct RNA sequencing. Genome Biol 22:1–17. https://doi.org/10.1186/s13059-02002241-7 22. Parker MT, Barton GJ, Simpson GG (2021) Yanocomp: robust prediction of m6A modifications in individual nanopore direct RNA reads. bioRxiv 06(15):448494 23. Hassan D, Acevedo D, Daulatabad SV et al (2022) Penguin: a tool for predicting pseudouridine sites in direct RNA nanopore sequencing data. Methods S1046-2023(22): 00035–00034. https://doi.org/10.1016/j. ymeth.2022.02.005 24. Ueda H (2020) nanoDoc: RNA modification detection using Nanopore raw reads with deep one-class classification. bioRxiv 09(13): 295089. https://doi.org/10.1101/2020.09. 13.295089 25. Stephenson W, Razaghi R, Busan S et al (2022) Direct detection of RNA modifications and structure using single-molecule nanopore sequencing. Cell Genomics 2:100097. https://doi.org/10.1016/j.xgen.2022. 100097 26. Kim D, Lee JY, Yang JS et al (2020) The architecture of SARS-CoV-2 transcriptome. Cell 181:914-921.e10. https://doi.org/10.1016/ j.cell.2020.04.011 27. van den Burg GJJ, Williams CKI (2020) An evaluation of change point detection algorithms. arXiv 2003:06222 28. van den Oord A, Dieleman S, Zen H et al (2016) WaveNet: a generative model for raw audio. arXiv 1609:03499 29. Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34: 3094–3100. https://doi.org/10.1093/bioin formatics/bty191 30. Perera P, Patel VM (2019) Learning deep features for one-class classification. IEEE Trans Image Process 28:5450–5463. https://doi. org/10.1109/TIP.2019.2917862

INDEX A Absorbance ................................... 60, 239, 289, 294, 295 Accuracy........................................................... v, 9, 35, 54, 74, 81, 134, 135, 149, 194, 204–206, 221, 224, 230, 248, 259, 278, 282, 301, 306 Amplicon sequencing................................. 194, 195, 205, 207, 209, 217, 221, 222, 224 AnnotSV ........................................................................ 186 Antibiotics .................................. 102, 103, 109–111, 248 Antimicrobial-resistant (AMR)....................196, 227–241

B BamQC............................................................... 19, 26, 27 Basecalling ........................................8, 11, 65, 74, 81–86, 88, 89, 115, 135, 138, 143, 178, 184, 203–207, 259, 266, 293, 300, 301, 303, 308 Bash............................................................................58, 59 BBMap.................................................................. 117, 120 Bioconda......................................... 58, 59, 164, 167, 174 Blood sample ................................................................. 198 Brew ........................................................................ 89, 217 Bridger .................................................................. 133, 137 BUSCO ............................................................29, 75, 125 BWA (BWA-MEM2).........................................19, 26, 27, 29, 229, 232

C Cancer genome .................................................... 177–188 Canu.............................................................. 65, 115, 117, 121, 125, 229, 231 Carcinogenesis...................................................... 177, 178 Cas9 Sequencing Kit..................................................... 148 Cell encapsulation ................................................ 260, 261 CheckM ........................................ 38, 115, 121, 125, 233 Chloroplast .......................................................68, 76, 210 Chromium Controller ..............................................92, 93 Clinical sample ....................................197–198, 209, 215 Comparison method .......................................... 81–84, 89 Conda ................................................................58, 59, 85, 89, 117, 120, 174, 206 Contamination ............................. 96, 198, 218, 225, 290 Contiguity .................................................................37, 62 Convolutional neural network (CNN) ............... 304, 305

Coverage................................................. 9, 23, 25–27, 68, 74–76, 80, 87–89, 92, 120, 125, 149, 153, 155, 206, 234, 276, 277, 285, 294 Coverm ............................................................................ 59 crRNA................................................................... 148–151 Cryoprotectant ....................................................... 44, 295 CTAB ............................................................................... 60 Curlcake................................................................ 301, 314 cuteSV............................................. 32, 36, 180, 184–188

D de Bruijn graph .................................................... 132, 137 Deep learning ................................................................ 301 DFAST.............................................................32, 36, 115, 117, 121, 122, 232, 233, 236, 237 Direct RNA sequencing.................................. v, 133, 135, 144, 269–278, 281–296, 299–318 DNA modification ....................................................79–89 dnarrange......................................................151, 161–174 Dorado............................................................................... 8 Droplet MDA............................................................91–98

E Emulsion breaking ............................................... 260, 261 Endocellular bacterial symbiont .......................... 101–111 EPI2ME................................................... 8, 194, 204–206 Epitranscriptome ........................................................... 299 ESKAPE pathogens ...................227, 228, 233, 235, 240 Expanded basecalling method..................................81–84

F fasterq-dump ................................................................... 21 Fastp.......................................................19, 25, 26, 32, 37 FemtoPulse ...................................................................... 53 Filtlong ..............................................18, 23–25, 229–231 5-methylcytosine (5mC)............................ 79, 80, 82–84, 86, 88, 89 Flongle ........................................................ 6–8, 125, 282, 284, 285, 290, 293, 296 Flye........................................................ 19, 25–27, 32, 35, 37, 38, 59, 65, 76, 231 Fragmentation ...................................................31, 57, 92, 119, 124, 239, 249, 270, 276–278

Kazuharu Arakawa (ed.), Nanopore Sequencing: Methods and Protocols, Methods in Molecular Biology, vol. 2632, https://doi.org/10.1007/978-1-0716-2996-3, © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

321

NANOPORE SEQUENCING: METHODS AND PROTOCOLS

322 Index G

Galaxy ............................................................... 15–29, 232 Gene conversion................................................... 168, 169 Genetic disease ....................................147, 162, 164, 173 Genome assembly ................................ 10, 15–29, 31–38, 41–54, 91, 96, 114, 115, 120–122, 217, 233 Genome report............................................ 114, 115, 122 Genome search toolkit (GSTK) .......................... 215–225 GenomeSync ........................................................ 215–225 Genomic DNA (gDNA) extraction/DNA extraction .........................................32–33, 44–46, 60, 91, 96, 103, 105–109, 114, 117–119, 134–135, 148, 181–182, 194, 196, 209, 239, 249 GitHub ...................................................84, 85, 185, 188, 224, 251, 275, 308, 317 Graduate school course ....................................... 113–125 GridION.................................................. 7, 8, 10, 64, 133 gRNA ............................................................149–150, 155 Guppy ..................................................... 8, 75, 81, 84–86, 89, 184, 204, 205, 207, 211, 266, 272, 284, 293, 300, 302–304, 307, 308 gVolante................................................................ 117, 125

H Heterozygous ......................................53, 65, 70, 75, 154 Hidden Markov model (HMM) ............................ 81, 83, 122, 300 High-molecular-weight DNA (HMW DNA) ........42–44, 58, 60, 88, 101, 134, 178, 179, 181–182, 187, 228–230, 239 5-hydroxymethylcytosine (5hmC) ............. 79, 80, 84, 89 Hypo ...................................................... 59, 67, 72, 74–76

I Integrative Genomics Viewer (IGV).................. 152, 157, 180, 186, 187, 240, 294 Isothermal amplification ................................................. 96

J Japanese Black cattle (Wagyu) ..................................41–54

L lamassemble .......................................................... 149, 157 LAST........................................................... 148, 151, 152, 155, 157, 161–174, 229, 231 Ligation Sequencing Kit ...................................31, 43, 58, 84, 85, 93, 95, 133, 135, 179, 239, 261

M MAFFT ....................................... 133, 141–143, 149, 154 MagAttract ..................................... 43, 44, 179, 181, 239

Medaka ....................................................... 58, 59, 66, 71, 74–76, 208, 211, 232 Megalodon ...................................................81, 84–87, 89 MEME ............................................................................. 87 Microbial genome (bacterial genome)............... 7, 31–38, 86, 87, 113–115, 125, 222, 229, 231–235, 238, 270, 278 Miniasm ................................................................ 231, 232 Miniconda............................................................... 89, 180 Minimap2 .............................................32, 44, 48, 58, 59, 180, 184, 218, 220–224, 229, 232, 249, 272, 275, 278, 284, 294, 310, 317 MinION ...................................................... v, 4–9, 11, 32, 37, 64, 84, 85, 93, 95, 113, 116, 119, 125, 133, 135, 148, 150, 155, 194, 196, 204, 205, 207, 215, 220, 221, 247–249, 273, 284, 285, 290 MinION Mk1C.......................... 7, 8, 196, 203–205, 207 Mitochondria.................................. 68, 76, 102, 109, 210 Mobile genetic elements (MGEs) ............. 228, 233–235, 237, 238 Model-based method................................................81–84 mRNA capture .............................................................. 265 Multi-locus sequence typing (MLST) ........................229, 233, 236, 239, 240, 248, 251–254 Multiple displacement amplification (MDA) ..........91–98 Mutation.........................................................36, 169, 177

N N50 ............................................... 10, 23, 25, 28, 65, 291 nanoDoc2 ............................................................. 299–318 NanoFilt.......................................... 25, 32, 125, 140, 230 NanoGalaxy .........................................16, 18, 19, 29, 232 NanoPlot ........................................................... 18, 22–25, 32, 35, 117, 120, 230 Nanopolish .....................................................81, 125, 300 NCBI-blast (BLAST)......................................59, 69, 133, 138, 140, 167, 204–206, 234, 238 NECAT.....................................58, 59, 65, 66, 70–74, 76 NextDenovo .................................................................... 49 N6-methyladenine (6mA).............. 79, 80, 83, 84, 86, 89 N4-methylcytosine (4mC) ........................................79, 80 Non-model organism.................................................... 101 Non-Tuberculous Mycobacteria (NTM)....................228, 247–254 NovaSeq.............................................................. 43, 46, 48

O One-class classification .................................................. 317 Operon ................................................269, 270, 277, 278 Organellar genome ............................... 57, 59, 68–70, 76 Overlap-layout-consensus............................................. 138

NANOPORE SEQUENCING: METHODS P Pathogen..................................................... 227, 228, 233, 235, 240, 247, 248 Phenol-chloroform method ........................................... 60 φ29 polymerase .........................................................92, 97 Pilon............................................... 19, 27–29, 32, 36, 38, 75, 117, 121, 229, 232–233 Plant genome ............................................................57–76 Polishing (polishing cycles) ....................... 19, 26–29, 35, 37, 38, 46–49, 54, 67, 74, 75 Polyadenylation ............................................................. 273 Polypolish ........................................................................ 38 Polysaccharides ..........................................................57, 60 Polysome .............................................................. 281–296 Porechop............................................................... 229–231 Prokka ............................................................................ 232 PromethION .................................................6–10, 42, 43, 45, 46, 48, 64, 84, 93, 95, 178–180, 182–184, 187, 261, 266 Proovread ...................................................................... 133 Pseudogene .......................................................... 169, 171 Pulsed-field gel electrophoresis ........................... 116, 119 Purge_haplotigs.............................. 59, 67, 68, 70, 71, 75

Q QualiMap............................................................ 19, 26, 27 Qubit fluorometer.............................................32, 33, 45, 93, 116, 132, 134, 179, 248, 250

R Racon .................................................................58, 59, 66, 74, 102, 125, 229, 232 Rapid identification ............................. 193–212, 247–254 RAxML ........................................................ 230, 236, 237 Rearrangements............................................151, 161–174 Reference genome.............................................26, 27, 42, 47, 49, 75, 151–155, 157, 162, 164, 165, 174, 178–180, 184, 207, 300, 301 Repeat expansion diseases.................................... 147–157 RepeatMasker ....................................................... 153, 168 Repetitive sequence.............................................. 131–145 Ribosome profiling ....................................................... 282 Rifampicin............................................102–107, 109, 111 RNA modification ................................................ 299–318

S Samtools ..................................................... 32, 44, 58, 59, 75, 117, 157, 180, 184, 249, 272, 275, 284, 294 SARS-CoV-2 ............................................. 4, 10, 302, 314 scCOLOR-seq ...................................................... 259–267 Secondary metabolites .................................................... 60

AND

PROTOCOLS Index 323

Seqkit .......................................................... 36, 37, 59, 76, 85, 87, 133, 140, 204, 208 Sequence Read Archive (SRA) ...........16, 20, 21, 28, 301 Sequencing depth................................................... 31, 254 Short Read Eliminator (SRE) .................... 32, 33, 58, 62 Short reads (short-read sequencing)..............5, 9, 15–29, 35–38, 42, 43, 46–49, 53, 57, 58, 67, 74, 75, 92, 114, 120, 125, 132, 138, 147, 185, 194, 230–232, 239, 240, 259, 270 Simple repeats.............................................. 147, 167, 169 Single-cell ................................... 68, 92, 93, 96, 259–266 Single-molecule real-time (SMRT) sequencing ........... 79, 80, 83 16S rRNA ................................................... 117, 193–211, 215–225, 252, 253 Size selection .............................................. 58, 62, 72, 92, 97, 98, 134, 137, 144, 296 SMART PCR.......................................260, 261, 263–266 SNAP-aligner.............................................................59, 75 Sniffles........................................................... 50, 180, 184, 185, 187, 188 Spidroin ......................................131, 132, 138, 140, 141 Structural protein .......................................................... 131 Structural variant (SV) .......................... 5, 8, 49, 177–188 SV-Quest ............................................................ 32, 36, 38

T Tandem-genotypes..................... 148, 149, 153–157, 174 Tandem repeats .......................... 147–157, 174, 188, 233 Tantan .........................................153, 157, 164, 166, 167 TapeStation........................................................43, 45, 53, 93, 95, 97, 98, 101, 105–107, 116, 119, 124, 132, 134, 179, 182, 261, 265, 283, 284, 290, 291 Taxonomic classification ..................... 193–195, 204–206 Tensorflow ................................................... 302, 308, 317 Tombo ............................................................81, 125, 300 Trimmomatic............................................... 133, 229, 230 Trycycler ................................................................. 76, 232

U UCSC Genome Browser ..................................... 168, 180 Ultracentrifugation .............................................. 281, 288 Ultralow input...........................................................91–98 UNAGI........................................................ 270, 271, 275 UniCycler ............................................32, 35, 37, 38, 232 Unique molecular identifier (UMI).................... 260, 265 Untranslated region (UTR) ......................................... 269

V Variant Call Format (VCF)........................................... 186 VolTRAX ........................................................................... 8

NANOPORE SEQUENCING: METHODS AND PROTOCOLS

324 Index W

WaveNet .....................................299, 304, 305, 315, 316 Whole genome amplification (WGA) ......................81, 92 Whole genome sequencing (WGS)...........................4, 92, 148, 149, 177–188 Windows Subsystem for Linux (WSL) ............... 216, 217

Wolbachia.............................................................. 101–112 Workflow .......................................................8, 10, 16–18, 178, 194, 195, 204–206, 252, 254, 285, 286, 304, 310 wtdbg2 (Redbean) .......................................44, 46–49, 65