Metagenomic Data Analysis 1071630717, 9781071630716

This volume describes different sequencing methods, pipelines and tools for metagenome data analyses. Chapters guide rea

438 60 24MB

English Pages 442 [443] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Metagenomic Systems Biology: Integrative Analysis of the Microbiome 981158561X, 9789811585616

The book serves as an amalgamation of knowledge and principles used in the area of systems and synthetic biology, and ta

650 20 9MB Read more

Data Analysis

3,151 192 8MB Read more

Metagenomic Systems Biology: Integrative Analysis of the Microbiome [1st ed.] 9789811585616, 9789811585623

The book serves as an amalgamation of knowledge and principles used in the area of systems and synthetic biology, and ta

829 142 4MB Read more

The Analysis of Biological Data

Knowledge of statistics is essential in modern biology and medicine. Biologists and health professionals learn statistic

980 113 48MB Read more

Statistical Analysis and Data Display

1,658 273 19MB Read more

Practical Data Analysis with Python

4,328 670 2MB Read more

SQL for Data Analysis 9781492088783

6,490 1,260 14MB Read more

Python Data Science: The Ultimate Crash Course for Data Analysis

♣♣♣♣♣♣♣♣ 2020 Edition ♣♣♣♣♣♣♣♣ DO YOU NEED A HANDS-ON CRASH COURSE IN PYTHON MACHINE LEARNING? Look no further! You h

3,198 818 6MB Read more

Applied Modeling Techniques and Data Analysis 1: Computational Data Analysis Methods and Tools 1786306735, 9781786306739

BIG DATA, ARTIFICIAL INTELLIGENCE AND DATA ANALYSIS SET Coordinated by Jacques Janssen Data analysis is a scientific fi

721 197 11MB Read more

Master Data Science and Data Analysis With Pandas By Arun

2,890 670 25MB Read more

Metagenomic Data Analysis
1071630717, 9781071630716

Author / Uploaded
Suparna Mitra

Categories
Biology
Molecular

Table of contents :
Preface
Contents
Contributors
Chapter 1: From Genomics to Metagenomics in the Era of Recent Sequencing Technologies
1 From Microbial Genomics to Metagenomics
2 Metagenomic Applications
2.1 Metagenomic Sequencing: Development, Applications, and Techniques
2.2 Bacterial Analysis Using 16S Sequencing
2.3 Shotgun Sequencing
2.4 Investigation of Eukaryotic Microbes
2.5 Eukaryotic Sequencing: Fungal Organisms
2.5.1 Utilization of ITS for Fungal Sequencing
2.5.2 Alternative Methods of Fungal Sequencing
2.6 Microbial Biodiversity
2.7 Rarefaction Curves: Importance and Application
3 Next-Generation Sequencing Technology
3.1 Overview of Short-Read Sequencing Techniques
3.2 Massively Parallel Sequencing
3.3 454 Sequencing (Roche)
3.4 Polony Sequencing
3.5 Illumina Technology (Solexa)
3.5.1 NextSeq 500/550
3.5.2 NovaSeq 6000
3.6 SOLiD (Life Technologies)
3.7 Ion Torrent (Life Technologies)
3.8 c-PAS Sequencing (Complete Genomics)
3.9 DNA Nanoball Sequencing (Complete Genomics)
3.10 Helicos SMS (Helicos Biosciences)
4 Third-Generation Technology: Progression from Short-Read Sequencing
4.1 Single-Molecular Real-Time Sequencing (Pacific Biosciences)
4.2 Nanopore Sequencing (Oxford Nanopore Technologies)
References
Chapter 2: Quality Control in Metagenomics Data
1 Introduction
2 Considerations in Study Design and Methodology
3 Solutions to Support Reproducibility
4 Code Walkthrough Introduction
5 Downloading SRA Project Data
6 Ensuring Data Integrity
7 Quality Control Statistics
7.1 Basic Statistics
7.2 Per Base Sequence Quality
7.3 Per Tile Sequence Quality
7.4 Per Sequence Quality Scores
7.5 Per Base Sequence Content
7.6 Per Sequence GC Content
7.7 Per Base N Content
7.8 Sequence Length Distribution
7.9 Sequence Duplication Levels
7.10 Overrepresented Sequences
7.11 Adapter Content
7.12 K-Mer Content
7.13 Scaling Quality Control to Multiple Samples
8 Trimming and Filtering Reads
9 Removing Host Derived Content
10 Taxonomic Classification
11 Data Handling, Visualization, and Comparative Analysis
12 Best Practices in Dimensionality Reduction
13 Contamination and Ensuring Reliable Classifications
13.1 Likelihood of Contamination
13.2 Likelihood of False Positive Classification
13.3 Significance of Taxa Detection
13.4 Strength of Computational Evidence
14 Conclusion
References
Chapter 3: Metagenomics Databases for Bacteria
1 Bacterial Metagenomics
2 Introduction to Bacterial Metagenomics Database
3 Greengenes
4 SILVA
5 Ribosomal Database Project
6 Genome Taxonomy Database
7 Conclusions
References
Chapter 4: Amplicon Sequencing Pipelines in Metagenomics
1 Introduction to Amplicon Sequencing
2 The Pipeline for Amplicon Sequencing
3 Data Generation
4 Installation of Packages
5 Data Preprocessing
6 mothur 16S rRNA Amplicon Sequencing Data Analysis Pipeline
6.1 Download the Reference Data
6.2 Start Mothur Environment
6.3 Assembly of Paired Reads and Quality Control
6.4 Sequence Alignment and Quality Control
6.5 Advanced Quality Control
6.6 Taxonomic Analysis
6.7 Diversity Analysis
7 DADA2 16S rRNA Amplicon Sequencing Data Analysis Pipeline
7.1 Download the Reference Data
7.2 DADA2 Standard Sequence Denoising Procedure
7.3 Assembly of Paired-End Reads and Generation of ASV Table
7.4 Taxonomic Analysis
7.5 Diversity Analysis
8 Notes
References
Chapter 5: A Practical Guide to 16S rRNA Microbiome Analysis in Musculoskeletal Disorders
1 Introduction
2 Materials
2.1 FASTQ Files
2.1.1 Interpreting FASTQ Files
2.2 Metadata File
2.3 Manifest File
2.4 Software and Computing Needs
2.4.1 MEGAN
2.4.2 LefSe
3 Methods
3.1 Demultiplexing Samples
3.2 QIIME 2
3.2.1 Directory Setup
3.2.2 Importing Demultiplexed FASTQ
3.2.3 Summarize the Demultiplexed Data and Review Quality
3.2.4 Denoise Using DADA2
3.2.5 Train RDP Classifier
3.2.6 Assign Taxonomy
3.3 MEGAN
3.4 LEfSe
4 Notes
References
Chapter 6: DIAMOND + MEGAN Microbiome Analysis
1 Introduction
2 Materials
2.1 Datasets
2.1.1 Short-read Samples
2.1.2 Long-read Samples
2.2 Software and Databases
2.3 Computational Resources
3 Methods
3.1 DIAMOND
3.1.1 DIAMOND Index
3.1.2 Short-read Alignment
3.1.3 Long-read Alignment
3.2 Meganization
3.3 Alignment and Meganization for Very Large Files
3.4 Interactive Analysis Using MEGAN
3.4.1 Tree Layout
3.4.2 Algorithm Parameters
3.4.3 Charts
3.5 Functional Analysis
3.5.1 Read Inspection
3.5.2 Alignment Viewer and Gene Centric Assembly
3.6 Comparative Analysis
3.7 Analyzing Long Reads
3.8 DIAMOND + MEGAN Analysis Using the AnnoTree Database
3.9 Megan-server
References
Chapter 7: Interactive Web-Based Services for Metagenomic Data Analysis and Comparisons
1 Introduction
2 Question One: Who Is There?
2.1 BV-BRC
2.2 RDP
2.2.1 RDP Classifier
2.2.2 RDPipeline
2.3 mothur in Galaxy Platform
2.4 Kaiju
2.5 PhyloPythiaS
3 Question Two: What Do They Do?
3.1 MG-RAST
3.2 WebMGA
4 Question Three: Are There Any Functional Correlations Between Microorganisms in a Particular Biome?
4.1 MicrobiomeAnalyst
4.2 WHAM!
5 Question Four: How Similar or Different Are Biomes from Each Other?
5.1 METAGENassist
6 Comprehensive Metagenomic Analysis
6.1 MGnify: EBI-Metagenomics
6.2 MGnify Data Workflow
6.3 Amplicon Analysis Pipeline
6.4 Metagenomic and Transcriptomic Raw Reads Pipeline
6.5 Assembly Pipeline
7 Challenges Using Web-Based Tools
8 Conclusion
References
Chapter 8: Application of High-Throughput Sequencing (HTS) to Enhance the Well-Being of an Endangered Species (Malayan Tapir):...
1 Introduction
2 Materials and Methods
2.1 Data Preprocessing
2.2 Uploading Sequences into MG-RAST
2.3 Submission of Sequences for Annotation
2.4 Post Sequence Submission
2.5 Analyzing the Annotated Dataset
2.6 Statistical Analysis
2.7 Data Visualization: Rarefaction
2.8 Stacked Bar Charts
3 Discussion
4 Notes
References
Chapter 9: Designing Knowledge-Based Bioremediation Strategies Using Metagenomics
1 Introduction
1.1 Next-Gen Sequencing Platforms
1.2 Comparative Metagenomics
2 Methods
2.1 MG-RAST Analysis
2.2 Taxonomic and Functional Gene Analysis
2.2.1 Data and Database Selection
2.2.2 Default Parameters
2.2.3 Selection of Features
2.2.4 Visualization Tools
2.2.5 MG-RAST Plugins
2.3 Annotation Systems, Data Normalization, and Validation
3 Methods
3.1 Case Study 1: Enhancing Biodegradation Process Efficiency by In Silico Analysis
3.2 Case Study 2: Comparative Metagenomics for Understanding the Impact of Seasonal Shifts on WWT Process Efficiency
3.3 Case Study 3: Soil Metagenomics
4 Notes
References
Chapter 10: Nanopore Sequencing Techniques: A Comparison of the MinKNOW and the Alignator Sequencers
1 Introduction
1.1 Sequencing History
1.2 MinION
2 Materials
2.1 Hardware and Software
2.2 RNA Isolation and Poly-A Enrichment
2.3 Library Preparation
3 Methods
3.1 RNA Extraction
3.2 Poly-A Enrichment Using NEBNext Poly(A) mRNA Magnetic Isolation Module (E7490)
3.3 Library Preparation and Direct RNA Sequencing Using MinION
3.4 Principals of Alignment Using MinKNOW
3.5 Principals of Alignment Using Alignator
4 Alignment of Nanopore Reads to Human cDNA Database Using the Alignator v1
4.1 Normalization of Read Counts
4.2 Gene Set Enrichment Analysis
5 Notes
References
Chapter 11: MAIRA: Protein-based Analysis of MinION Reads on a Laptop
1 Introduction
2 Materials
2.1 Hardware
2.2 Software
2.3 Datasets
3 Methods
3.1 Real-time Analysis
3.1.1 Main Graphical User Interface
3.1.2 Analysis Setup
3.1.3 Genus Identification
3.1.4 Species Identification
3.1.5 Virulence Factors and Antibiotic Resistance Genes
3.1.6 Exporting Data
3.1.7 Loading Files
3.1.8 Controls and Filters
3.2 Command-line Mode
3.2.1 Running Analysis
3.2.2 Exporting Data
3.2.3 Building New Databases
References
Chapter 12: Recovery and Analysis of Long-Read Metagenome-Assembled Genomes
1 Introduction
2 Materials
2.1 Data Collection
2.2 Software and Environment
3 Methods
3.1 Basecalling and Adapter Trimming in Long Reads
3.2 Quality Assessment of Raw Short Reads
3.3 Quality Trimming and Adapter Removal in Short Reads
3.4 Metagenome Assembly
3.4.1 Short-Read Assembly
3.4.2 Long-Read Assembly
3.5 Estimating the Coverage of Assembled Contigs Using Short Reads and Long Reads
3.6 Metagenome Binning of Contigs Assembled Using Short Reads
3.7 Quality Assessment of Recovered Genomes Bins
3.8 Taxonomic Classification of Recovered Genomes Bins
3.9 Error Correction of Long-Read Sequence
3.9.1 Frameshift Correction (MEGAN-LR)
3.9.2 Racon
3.9.3 Medaka
3.10 Comparative Analysis of Short- and Long-Read Assemblies
3.11 Gene Quality Assessment in Recovered Genomes
4 Notes
References
Chapter 13: Cloud Computing for Metagenomics: Building a Personalized Computational Platform for Pipeline Analyses
1 Introduction
2 Materials
3 Methods
3.1 Log into the Azure Portal
3.2 Select and Set Up a Virtual Machine (VM)
3.3 Add a Data Disk to the VM (Optional)
3.4 Logging into the Virtual Machine (VM) for Further Configuration
3.4.1 Connecting for the First Time
3.4.2 Apply Security Updates
3.4.3 Download and Install the Miniconda Python Distribution
3.4.4 Install QIIME2
3.4.5 Install and Set Up Jupyter Lab
3.4.6 Configuring the VM to Access the Data Disk (Optional)
3.5 Connect to Jupyter Lab Running on the VM Through Your Web Browser
3.6 Disconnect and Shut Down the VM
4 Further Exploration
5 Notes
References
Chapter 14: Artificial Intelligence in Medicine: Microbiome-Based Machine Learning for Phenotypic Classification
1 Introduction
2 Materials
3 Methods
3.1 Preparation of Microbiome Datasets
3.2 Machine Learning Classification
3.2.1 Dataset Preparation (See Note 2)
3.2.2 Machine Learning Modeling
4 Summary
5 Notes
References
Chapter 15: Tracking Antibiotic Resistance from the Environment to Human Health
1 Introduction
2 Environmental Resistome
3 Clinical Resistome
4 The Overlap Between Clinical and Environmental Resistome
5 Whole-Genome Sequencing in Detection and Control of Antimicrobial Resistance
6 Resistome Analysis Tools
7 Resistome Databases
8 Tools
9 Conclusions
References
Chapter 16: Targeted Enrichment of Low-Abundance and Uncharacterized Taxon Members in Complex Microbial Community with Primer-...
1 Introduction
2 Materials
3 Methods
3.1 Probe Design from Next Generation Sequencing Datasets
3.2 Evaluation of in Silico Specificity and Coverage of Probes
3.3 Sample Fixation
3.4 Fluorescent In Situ Hybridization and Microscopic Imaging
3.5 Optimizing Parameters for Fluorescent In Situ Hybridization (FISH)
3.6 Image Analysis: Quantitative FISH
3.7 Fixation-Free and In-Solution FISH for Fluorescence-Activated Cell Sorting (FACS)
3.8 Quality Check of Sorted Samples
3.9 Downstream Bioinformatic Analysis
4 Notes
References
Chapter 17: Assembly and Annotation of Viral Metagenomes from Short-Read Sequencing Data
1 Introduction
2 Materials
2.1 Hardware
2.2 Software
2.3 Sequences
2.4 Viral Databases
3 Methods
3.1 Read Quality Control, Adapter Trimming, and Decontamination
3.2 Contig Assembly
3.3 Viral Sequence Identification
3.4 Mapping to Reference Databases
4 Notes
References
Chapter 18: Manipulating and Basic Analysis of Tabular Metagenomics Datasets Using R
1 Introduction
2 R Language
3 Reading and Manipulating Tabular Data
3.1 Base R
3.2 Readr and the Tidyverse
4 Basic Analysis of Tabular Data
5 Summary
References
Chapter 19: Metagenomics Data Visualization Using R
1 Introduction
2 Visualization Options Within R
2.1 Base R
2.2 ggplot2
3 Common Comparative Visualization for Metagenomic Data
3.1 Alpha (α) Diversity and Beta (β) Diversity
4 Conclusions
References
Chapter 20: Comprehensive Guideline for Microbiome Analysis Using R
1 Introduction
2 Phyloseq
2.1 Application
3 MegaR
3.1 Application
4 DADA2
4.1 Pipeline workflow and functions
5 Metacoder
5.1 Application
6 MicrobiomeExplorer
6.1 Application
References
Index

Citation preview

Methods in Molecular Biology 2649

Suparna Mitra Editor

Metagenomic Data Analysis

METHODS

IN

MOLECULAR BIOLOGY

Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK

For further volumes: http://www.springer.com/series/7651

For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-by step fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.

Metagenomic Data Analysis Edited by

Suparna Mitra Leeds Institute of Medical Research, University of Leeds, Leeds, UK

Editor Suparna Mitra Leeds Institute of Medical Research University of Leeds Leeds, UK

ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-3071-6 ISBN 978-1-0716-3072-3 (eBook) https://doi.org/10.1007/978-1-0716-3072-3 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.

Preface Metagenomics is by definition the study of a collection of genetic material (genomes) from a mixed community of organisms. During the previous almost two decades, next-generation sequencing (NGS) technologies have enabled monumental advances within microbial science, answering biological questions and understanding the component and function of microbiomes. The booming sequencing technologies have turned metagenomics into a widely used tool for microbiome studies, especially in the areas of clinical medicine and ecology. Consequently, the toolkit of metagenomics data analysis is growing stronger to provide multiple approaches for solving various problems. By taking advantage of NGS, metagenomics is meant to explore the taxonomic and functional components of the microbial community including bacteria and archaea present in samples. The complexity and diversity of microbiome are beyond our imagination and continuation of studying on new species and known species is of great importance to us. In this regard, the popularity of sequencing has shifted the taxonomy analysis from the old-fashioned phenotype or morphology level to the high-resolution molecular level. Microbiome function has been investigated using metagenomics, which helps in multiple fields e.g. from energy generation and conservation, prevention of pathogenic invasion to food digestion and nutrition. While traditional genomic analysis of microbial communities provides an essential platform to research many microbial species, metagenomics enables the analysis at a population level of previously unknown and “un-culturable” species. This book “Metagenomics Data Analyses” in the series of “Methods in Molecular Biology,” includes comprehensive reviews of the most recent fundamental developments in bioinformatics methods for metagenomic data analyses and related challenges associated with increasing data size and analysis complexity. Throughout this volume, prominent authors in this field address the challenges and complexities of the newly available tools and techniques of metagenomics. This book provides a comprehensive collection of up-to-date protocols for metagenomics tools and databases and allows an easy setup and step-by-step analysis by the user. This book is intended for scientists looking for a compact overview of the cutting-edge computational methods in metagenomics analysis. This volume may also serve as a comprehensive guide for graduate students planning to specialize in metagenomics field or researchers who will be planning a new metagenomics project. The materials presented in this book with examples and step-by-step guide should be easy enough for a novice biologist with a good grasp of standard computational concepts. Equally this volume should be helpful for experienced researchers seeking for more advanced knowledge and guidance. The first chapter is composed of educational parameters providing a good background of metagenomics field and multiple sequencing techniques for readers coming from different backgrounds and skills. The second chapter focuses on quality control of raw sequence data. Quality control is critical in extracting robust and reproducible conclusions from metagenomic data. There are numerous considerations that can help maximize the quality of metagenomics data including experimental design, the quality of sequence data, and the quality of the reported microbial community.

v

vi

Preface

The third chapter focuses on metagenomics databases for bacterial annotations. This chapter presents the key concepts, technical options, and challenges for metagenomics projects as well as the curation processes and versatile functions for the four representative bacterial metagenomics databases including Greengenes, SILVA, Ribosomal Database Project (RDP), and Genome Taxonomy Database (GTDB). Chapters 4 and 5 provide examples of amplicon sequencing pipelines in metagenomics. First example includes two independent stepwise pipelines using mothur and DADA2 in a parallel way presenting the basic principles of the analysis and enables the comparisons between the two pipelines (Chap. 4) and second a practical guide to 16S rRNA microbiome analysis using QIIME2 and highlight the utility of graphical microbiome tools for further analysis and identification of biological relevant taxa with reference to an outcome of interest (Chap. 5). Chapter 6 focuses on whole genome shotgun metagenomics data analyses (DIAMOND +MEGAN pipeline) which include taxonomic and functional potentials using publicly available datasets containing both short-read and long-read samples. Not every lab is well equipped with high-performance computing power or researchers with good computational/programming knowledge. Keeping that in mind, web-based bioinformatics tools are recently being developed to facilitate the analysis of complex metagenomic data without prior knowledge of any programming languages or special installation. Chapter 7 provides a simple guide to some of the fundamental web-based services for metagenomic data analyses such as PATRIC, RDP, mothur, Kaiju, PhyloPythiaS, MG-RAST, WebMGA, MicrobiomeAnalyst, WHAM!, METAGENassist, and MGnify: EBI-Metagenomics. Further two additional chapters (Chaps. 8 and 9) provide examples of metagenomics data analysis protocol using MG-RAST Metagenomics Analysis server: first is an interesting example of conservation of an endangered animal species (Malayan Tapir) showing gut microbiome analysis using amplicon sequencing pipelines (Chap. 8). The other describes techniques in taxonomy and functional gene analysis for understanding bioremediation potential and novel strategies built on in silico analysis for the improvisation of existing aerobic wastewater treatment methods (Chap. 9). Over the last decades, technical advances such as automation and increased sequencing rates due to parallelization further refined and accelerated Sanger sequencing. However, the production of short reads led to highly fragmented de novo assemblies of larger genomes. Third-generation sequencing (TGS) approaches, like the benchtop solution MinION (Oxford Nanopore Technologies), overcame this problem by directly sequencing the input DNA/RNA strands. Our next two chapters (Chaps. 10 and 11) describe two tools that are focused on the analysis of MinION sequencing reads from microbiome samples. In addition, Chap. 12 outlines bioinformatics workflows for the recovery and characterization of complete genomes from long-read metagenome data (MinION), as well as some complementary procedures for comparison of cognate draft genomes and gene quality obtained from short-read and long-read sequencing. Cloud Computing services such as Microsoft Azure, Amazon Web Services, and Google Cloud provide a range of tools and services that enable scientists to rapidly prototype, build, and deploy platforms for their computational experiments. Chapter 13 describes a protocol to deploy and configure an Ubuntu Linux Virtual Machine in the Microsoft Azure cloud which includes Minconda Python, a Jupyter Lab server, and the QIIME toolkit configured for access through a Web Browser to facilitate a typical Metagenomics analysis pipeline.

Preface

vii

Advanced computational approaches in artificial intelligence, such as machine learning, have been increasingly applied in life sciences and health care to analyze large-scale complex biological data, such as microbiome data. Chapter 14 describes the experimental procedures for using microbiome-based machine learning models for phenotypic classification. This book contains two further special topics (Chaps. 15 and 16): the first one focuses on clinical and environmental resistomes, available databases, and computational analysis tools for resistome analysis through antibiotic resistance genes (ARGs) detection and characterization in bacterial genomes and metagenomes (Chap. 15). Next special topic is fluorescent in situ hybridization (FISH) coupling with fluorescence-activated cell sorting (FACS) which is a powerful tool that enables the detection, visualization, and separation of low-abundance microbial members in samples containing complex microbial compositions. Chapter 16 describes the workflow from designing the appropriate FISH probes from metagenomic or metatranscriptomic datasets to the preparation and treatment of samples to be used in FISH-FACS procedures. Moving forward from bacterial metagenomics, this book also provides (Chap. 17) analysis protocols for viral metagenomics enabling the detection, characterization, and quantification of viral sequences present in shotgun-sequenced datasets of purified viruslike particles and whole metagenomes. Finally, we have dedicated three chapters in this book which are focused on R programming techniques and R packages. First is for manipulating and basic analysis of tabular metagenomics datasets (Chap. 18); the second one is more focused toward metagenomics data visualization and comparative multiple plots using basic R programming (Chap. 19). These two chapters are provided with easy step-by-step examples and codes giving any novice R learner enough background for starting their own data analyses including data manipulation, coding, and plotting with R. The final chapter of this book presents a comprehensive guideline for microbiome analysis using the most used R packages (Chap. 20). Collectively, this timely and comprehensive collection of detailed metagenomics data analysis protocols will advance multiple new upcoming projects in metagenomics field. I would like to thank the Series Editor Prof. John Walker for giving me the opportunity to edit this volume. I want to extend my sincere gratitude to all the authors who contributed to this book. I deeply acknowledge their contributions, in-depth knowledge, and professional presentation. This book would not exist without their hard work. I would like to express my gratitude to the collaborators of Springer Nature for their support and for their efficient and professional handling of this project. On behalf of all authors and the publisher, I want to thank all the readers for reading this book and I hope this book will become a source of inspiration and new ideas for our readers stimulating further engagement and method development in this exciting field. Wish you a happy reading! Leeds, UK

Suparna Mitra

Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 From Genomics to Metagenomics in the Era of Recent Sequencing Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saskia Benz and Suparna Mitra 2 Quality Control in Metagenomics Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Abraham Gihawi, Ryan Cardenas, Rachel Hurst, and Daniel S. Brewer 3 Metagenomics Databases for Bacteria. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dapeng Wang 4 Amplicon Sequencing Pipelines in Metagenomics . . . . . . . . . . . . . . . . . . . . . . . . . . . Dapeng Wang 5 A Practical Guide to 16S rRNA Microbiome Analysis in Musculoskeletal Disorders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. M. Rooney and S. Mitra 6 DIAMOND + MEGAN Microbiome Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anupam Gautam, Wenhuan Zeng, and Daniel H. Huson 7 Interactive Web-Based Services for Metagenomic Data Analysis and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nehal Adel Abdelsalam, Hajar Elshora, and Mohamed El-Hadidi 8 Application of High-Throughput Sequencing (HTS) to Enhance the Well-Being of an Endangered Species (Malayan Tapir): Characterization of Gut Microbiome Using MG-RAST. . . . . . . . . . . . . . . . . . . . . . Ramitha Arumugam, Prithivan Ravichandran, Swee Keong Yeap, Reuben Sunil Kumar Sharma, Shahrizim Bin Zulkifly, Donny Yawah, and Geetha Annavi 9 Designing Knowledge-Based Bioremediation Strategies Using Metagenomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Niti B. Jadeja and Atya Kapley 10 Nanopore Sequencing Techniques: A Comparison of the MinKNOW and the Alignator Sequencers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ¨ ns, and Alexander Schramm Sebastian Oeck, Alicia I. Tu 11 MAIRA: Protein-based Analysis of MinION Reads on a Laptop . . . . . . . . . . . . . . Caner Bag˘cı, Benjamin Albrecht, and Daniel H. Huson 12 Recovery and Analysis of Long-Read Metagenome-Assembled Genomes . . . . . . Krithika Arumugam, Irina Bessarab, Mindia A. S. Haryono, and Rohan B. H. Williams 13 Cloud Computing for Metagenomics: Building a Personalized Computational Platform for Pipeline Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Callaghan

ix

v xi

1 21

55 69

85 107

133

175

195

209 223 235

261

x

14

15 16

17

18

19 20

Contents

Artificial Intelligence in Medicine: Microbiome-Based Machine Learning for Phenotypic Classification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xi Cheng and Bina Joe Tracking Antibiotic Resistance from the Environment to Human Health . . . . . . Eman Abdelrazik and Mohamed El-Hadidi Targeted Enrichment of Low-Abundance and Uncharacterized Taxon Members in Complex Microbial Community with Primer-Free FISH Probes Designed from Next Generation Sequencing Dataset . . . . . . . . . . . Pui Yi Maria Yung and Shi Ming Tan Assembly and Annotation of Viral Metagenomes from Short-Read Sequencing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mihnea R. Mangalea, Kristopher Keift, Breck A. Duerkop, and Karthik Anantharaman Manipulating and Basic Analysis of Tabular Metagenomics Datasets Using R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alex Coleman and Martin Callaghan Metagenomics Data Visualization Using R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alex Coleman, Anupam Bose, and Suparna Mitra Comprehensive Guideline for Microbiome Analysis Using R . . . . . . . . . . . . . . . . . Joseph Boctor, Mariam Oweda, and Mohamed El-Hadidi

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

281 289

303

317

339 359 393 437

Contributors EMAN ABDELRAZIK • Bioinformatics Group, Center of Informatics Sciences (CIS), Nile University, Giza, Egypt NEHAL ADEL ABDELSALAM • University of Science and Technology, Zewail City, Giza, Egypt; Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt BENJAMIN ALBRECHT • CeGaT GmbH, Tu¨bingen, Germany KARTHIK ANANTHARAMAN • Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA GEETHA ANNAVI • Department of Biology, Faculty of Science, Universiti Putra Malaysia, Serdang, Selangor Darul Ehsan, Malaysia KRITHIKA ARUMUGAM • Singapore Centre for Environmental Life Sciences Engineering, Nanyang Technological University, Singapore, Singapore RAMITHA ARUMUGAM • Department of Biology, Faculty of Science, Universiti Putra Malaysia, Serdang, Selangor Darul Ehsan, Malaysia; Dataplx Consultancy, Puchong, Selangor, Malaysia CANER BAG˘CI • Institute for Bioinformatics and Medical Informatics, University of Tu¨bingen, Tu¨bingen, Germany; International Max Planck Research School “From Molecules to Organisms”, Max Planck Institute for Developmental Biology, Tu¨bingen, Germany SASKIA BENZ • School of medicine, University of Leeds, Leeds, UK IRINA BESSARAB • Singapore Centre for Environmental Life Sciences Engineering, National University of Singapore, Singapore, Singapore JOSEPH BOCTOR • Biotechnology Program, American University in Cairo (AUC), Cairo, Egypt ANUPAM BOSE • Department of Mathematics, University of Leeds, Leeds, UK DANIEL S. BREWER • Bob Champion Research & Education Building, Norwich Medical School, University of East Anglia, Norwich, UK; Earlham Institute, Norwich Research Park, Norwich, UK MARTIN CALLAGHAN • School of Computing, University of Leeds, Leeds, UK; Research Computing, IT Services, University of Leeds, Leeds, UK RYAN CARDENAS • Bob Champion Research & Education Building, Norwich Medical School, University of East Anglia, Norwich, UK XI CHENG • Bioinformatics & Artificial Intelligence Laboratory, Department of Physiology and Pharmacology, University of Toledo College of Medicine and Life Sciences, Toledo, OH, USA ALEX COLEMAN • Research Computing, IT Services, University of Leeds, Leeds, UK BRECK A. DUERKOP • Department of Immunology and Microbiology, University of Colorado School of Medicine, Aurora, CO, USA MOHAMED EL-HADIDI • Bioinformatics Group, Center for Informatics Sciences (CIS), Nile University, Giza, Egypt HAJAR ELSHORA • Bioinformatics Group, Center for Informatics Sciences (CIS), Nile University, Giza, Egypt; Biomedical Informatics Program, School of Information Technology and Computer Science, Nile University, Giza, Egypt

xi

xii

Contributors

ANUPAM GAUTAM • Institute for Bioinformatics and Medical Informatics, University of Tu¨bingen, Tu¨bingen, Germany; International Max Planck Research School “From Molecules to Organisms”, Max Planck Institute for Biology Tu¨bingen, Tu¨bingen, Germany ABRAHAM GIHAWI • Bob Champion Research & Education Building, Norwich Medical School, University of East Anglia, Norwich, UK MINDIA A. S. HARYONO • Singapore Centre for Environmental Life Sciences Engineering, National University of Singapore, Singapore, Singapore RACHEL HURST • Bob Champion Research & Education Building, Norwich Medical School, University of East Anglia, Norwich, UK DANIEL H. HUSON • Institute for Bioinformatics and Medical Informatics, University of Tu¨bingen, Tu¨bingen, Germany NITI B. JADEJA • Ashoka Trust for Research in Ecology and the Environment, Royal Enclave, Bengaluru, India BINA JOE • Bioinformatics & Artificial Intelligence Laboratory, Department of Physiology and Pharmacology, University of Toledo College of Medicine and Life Sciences, Toledo, OH, USA ATYA KAPLEY • Environmental Biotechnology and Genomics Division, National Environmental Engineering Research Institute (CSIR-NEERI), Nagpur, India KRISTOPHER KEIFT • Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA MIHNEA R. MANGALEA • Department of Immunology and Microbiology, University of Colorado School of Medicine, Aurora, CO, USA SUPARNA MITRA • Leeds Institute of Medical Research, University of Leeds, Leeds General Infirmary, Leeds, UK SEBASTIAN OECK • Department of Medical Oncology, West German Cancer Center, University Hospital Essen, University of Duisburg-Essen, Essen, Germany MARIAM OWEDA • Bioinformatics Group, Center of Informatics Sciences (CIS), Nile University, Giza, Egypt PRITHIVAN RAVICHANDRAN • Perdana University Graduate School (PUGSOM), Perdana University, Serdang, Selangor, Malaysia C. M. ROONEY • Leeds Institute of Medical Research, University of Leeds, Leeds, UK ALEXANDER SCHRAMM • Department of Medical Oncology, West German Cancer Center, University Hospital Essen, University of Duisburg-Essen, Essen, Germany REUBEN SUNIL KUMAR SHARMA • Faculty of Veterinary Medicine, Universiti Putra Malaysia, Serdang, Selangor Darul Ehsan, Malaysia SHI MING TAN • Singapore Centre for Environmental Life Sciences Engineering (SCELSE), Nanyang Technological University (NTU), Singapore, Singapore ALICIA I. TU¨NS • Department of Medical Oncology, West German Cancer Center, University Hospital Essen, University of Duisburg-Essen, Essen, Germany DAPENG WANG • National Heart and Lung Institute, Imperial College London, London, UK; Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK; LeedsOmics, University of Leeds, Leeds, UK ROHAN B. H. WILLIAMS • Singapore Centre for Environmental Life Sciences Engineering, National University of Singapore, Singapore, Singapore DONNY YAWAH • Department of Wildlife and National Parks (DWNP), Wildlife Genetic Resource Banking Laboratory, Ex-Situ Conservation Division, Peninsular Malaysia, Ministry of Natural Resources and Environment Malaysia (NRE), Kuala Lumpur, Malaysia

Contributors

SWEE KEONG YEAP • China-ASEAN College of Marine Sciences, Xiamen University Malaysia, Sepang, Selangor, Malaysia PUI YI MARIA YUNG • Singapore Centre for Environmental Life Sciences Engineering (SCELSE), Nanyang Technological University (NTU), Singapore, Singapore WENHUAN ZENG • Institute for Bioinformatics and Medical Informatics, University of Tu¨bingen, Tu¨bingen, Germany SHAHRIZIM BIN ZULKIFLY • Department of Biology, Faculty of Science, Universiti Putra Malaysia, Serdang, Selangor Darul Ehsan, Malaysia

xiii

Chapter 1 From Genomics to Metagenomics in the Era of Recent Sequencing Technologies Saskia Benz and Suparna Mitra Abstract Metagenomics, also known as environmental genomics, is the study of the genomic content of a sample of organisms obtained from a common habitat. Metagenomics and other “omics” disciplines have captured the attention of researchers for several decades. The effect of microbes in our body is a relevant concern for health studies. Through sampling the sequences of microbial genomes within a certain environment, metagenomics allows study of the functional metabolic capacity of a community as well as its structure based upon distribution and richness of species. Exponentially increasing number of microbiome literatures illustrate the importance of sequencing techniques which have allowed the expansion of microbial research into areas, including the human gut, antibiotics, enzymes, and more. This chapter illustrates how metagenomics field has evolved with the progress of sequencing technologies. Further, from this chapter, researchers will be able to learn about all current options for sequencing techniques and comparison of their cost and read statistics, which will be helpful for planning their own studies. Key words Metagenomics, Next-generation sequencing (NGS), High-throughput sequencing (HTS), 16S sequencing, Shotgun sequencing, Eukaryotic sequencing, Polony Sequencing, Thirdgeneration sequencing technology

1

From Microbial Genomics to Metagenomics Metagenomics is a vastly advancing field within microbial sciences, providing a unique insight into the diversity of microbial communities. The plethora of applicable sources range from air, soil, and marine sites to those found within animals and the human body. Metatranscriptomics informs us of the genes that are expressed by the community as a whole. The metagenomic study of bacterial communities, thus enables progressive understanding of microbial interactions implicated within medicine, ecology, agriculture, and biotechnology.

Suparna Mitra (ed.), Metagenomic Data Analysis, Methods in Molecular Biology, vol. 2649, https://doi.org/10.1007/978-1-0716-3072-3_1, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

1

2

Saskia Benz and Suparna Mitra

Microbiome function has been investigated in relation to metagenomics, from energy generation and conservation, prevention of pathogenic invasion and food digestion, and nutrition. While traditional genomic analysis of microbial communities provides an essential platform to research many microbial species, metagenomics enables the analysis at a population level of previously unknown and “un-culturable” species. During the previous almost two decades, next-generation sequencing (NGS) technologies have enabled monumental advances within microbial science. Furthermore, the development of novel NGS methods in translational research of forensic science, clinical diagnostics, microbial science, and others has greatly altered the landscape of research regarding microbial genomics [1]. Through sampling the sequences of microbial genomes within a certain environment, metagenomics allows study of the functional metabolic capacity of a community as well as its structure based upon the distribution and richness of species. As of June 2022, a PubMed search shows over 25,000 metagenomic studies in the last decade, which has exponentially risen to nearly 4000 published papers in 2020 alone [2]. This illustrates the importance of this technique which allowed the expansion of microbial research into areas including the human gut, antibiotics, enzymes, and more. Since the later years of the 1990s, metagenomics has existed and only grown rapidly since initially due to the Roche/454 Life Sciences Genome Sequencer, enabling long read, high-throughput sequencing [3].

2

Metagenomic Applications Regarding human health, the microbiome has been increasingly of interest, particularly relating to the human gut. With over a thousand species of microbiota living and interacting as complex communities, the gut has important implications in regulating metabolism and immunity, host protection against harmful pathogens, and even relating to hormones. Traditional genomic techniques have limitations in how they identify microbes effectively and how we understand their roles in health and disease. However, metagenomics have proven an essential platform to study the gut microbiome, dysbiosis, and diversity as well as discover novel microbial interactions and pathways, functional genes, and how the microbiome has evolved within its host [4]. Human-microbial interactions are not, however, limited to the gut, with important communities studied using metagenomics within the oral cavity, skin, vagina, and subsequent microbiome affects on the brain [5]. Furthermore, the microbiome found in soil vitally influences not only the growth of plants but animal life and many different ecosystems; therefore, metagenomics has been

From Genomics to Metagenomics in the Era of Recent Sequencing Technologies

3

paramount even in relation to climate change [6]. Moreover, metagenomic approaches have been utilized within the cattle industry in several facets, including cheese production [7]. The utter broadness of fields in which metagenomic research can be utilized to provide insight into microbial populations is astounding. Importantly, despite differing areas of study and samples taken, data analysis methods remain similar with little distinction, as described in the upcoming chapters of this book. The applications of metagenomics within different microbial species and areas of research will be discussed throughout this book, delving into quality control within metagenomics in Chap. 2 to exploring data visualization in Chap. 18. This introductory chapter will therefore provide an initial exploration of metagenomic sequencing techniques, including the advantages and disadvantages, applications, and research relating to each. 2.1 Metagenomic Sequencing: Development, Applications, and Techniques

High-throughput sequencing (HTS) is most commonly applied as Metagenomics, providing scientists and researchers an affordable and rapid technique which has revolutionized how DNA sequences are obtained. HTS has dramatically reduced the price and time burdens of sequencing. From billions of USD in 2001 to $1000 in 2014, human genome sequencing costs are continually diminishing with costs around $600 in 2021, now taking a matter of weeks rather than years [8]. HTS methods, principally based upon pyrosequencing and PCR, are therefore catalyzing the transformation of an array of scientific and clinical fields [1].

2.2 Bacterial Analysis Using 16S Sequencing

For decades, the 16S ribosomal RNA (rRNA) gene has been a principal focus for bacterial analysis and sequencing. 16S rRNA is a commonly used biomarker to investigate bacterial diversity, identify species, and understand communities using metagenomic processes. Numerous studies have explored the oral cavity by using 16S rRNA sequencing to determine the diversity and microbial species found within disease, health and over varying periods of time and space. Using such methodology, this allows the study of how bacterial communities change during the development of oral diseases, including periodontal disease, gingivitis, and caries. Such research allows investigation into which microbes are implicated in disease, thereby creating targets for disease prevention and treatment [9]. 16S rRNA gene sequencing uses PCR for targeting and amplifying parts of the hypervariable regions (V1–9) of the 16S gene in bacteria. Most popular regions include V2–V3 and V3–V4, as these areas contain the maximum nucleotide heterogeneity and display the maximum discriminatory power. Bukin and colleagues have compared the resolution of V2–V3 and V3–V4 16S rRNA regions for the purposes of estimating microbial community diversity using paired-end Illumina MiSeq reads, and showed that the fragment, including V2 and V3 regions, has higher resolution for lower-rank taxa (genera and species) [10].

4

Saskia Benz and Suparna Mitra

Following molecular barcode pooling and sequencing, raw data is mapped onto taxonomy databases (e.g., NCBI-nr), analyzed using bioinformatics to create a taxonomic profile [11]. However, it is worth noting that while the apparently “user friendly” 16S rRNA sequencing analysis tools can be utilized by non-bioinformaticians, recent data indicates that superficial changes in the bioinformatics may lead to radical changes in biological outcomes. The recommendation is therefore to understand all the parameters and metadata while doing the analyses or to partner with a bioinformatician when analyzing such 16S data [12]. 16S rRNA sequencing has also been used in a variety of areas, including clinical samples regarding gut microbiota found in irritable bowel syndrome patients compared to controls [13], fecal microbiome of type 2 diabetic rats [14], and environmental samples regarding microbial dynamics in soil [15]. More recently, HTS of the entire gene has been performed, therefore millions of sequence reads across the gene can be discriminated between using thirdgeneration technologies, including denoizing algorithms [16] and circular consensus sequencing (CCS) [17]. More exploration into 16S rRNA microbiome sequencing can be read in Chaps. 4 and 5 of this book. 2.3 Shotgun Sequencing

The final sequencing method this chapter will touch upon, before exploring short and long read HTS methods in greater depth, is shotgun sequencing. Shotgun metagenomic sequencing involves the random breakage of long DNA molecules, for example, chromosomes and subsequent sequencing [18]. In combination with 16S rRNA analysis, shotgun sequencing provides effective data regarding the microbes found in different communities [18]. The majority of studies perform shotgun, 16S sequencing, or both for microbial research. Unlike 16S rRNA sequencing, however, shotgun sequencing utilizes all of a sample’s genomic DNA, and databases such as NCBI-nr [19] to generate a taxonomic profile. There have been several medical projects using shotgun metagenomics, for example, in gastric cancer [20], pancreatitis [21], Parkinson’s disease [22], and many more. The biggest shotgun project so far is the Dutch microbiome project. Recent manuscript from Gacesa and authors showed bacterial composition, function, antibiotic resistance, and virulence factors in the gut microbiomes of 8208 Dutch individuals from a three-generational cohort comprising 2756 families [23]. Not only in human samples, shotgun sequencing has been very popular in environmental samples as well. For example, a study by Akinola et al. used shotgun sequencing of the microbiome found in Maize Rhizopshere to review the metabolic profiles of soil found in varying locations. The importance of such microbiomes may aid sustainability within agriculture, if soils can be optimized for plant growth [24]. This method allows native niches to be understood

From Genomics to Metagenomics in the Era of Recent Sequencing Technologies

5

more effectively without culturing bias. Saxena et al. used shotgun metagenomics to reveal the influence of land use and rain on the benthic microbial communities in a tropical urban waterway [25]. There are many such examples. The biggest challenge with studying microbes is the complexity by which communities live and interact. As described, NGS technologies have augmented the ability to perform sequencing of microbes in greater quantity and within different environments. This is particularly of interest within humans, as microbial organisms form communities which work symbiotically in a plethora of ways, from enabling digestion [26] to facilitating defense against pathogens [18]. In order to understand the complexity of such microbiomes, different sequencing techniques are used. For example, while 16S or ITS sequencing has high bacterial and fungal coverage, shotgun sequencing overall has greater taxonomic resolution [27]. 2.4 Investigation of Eukaryotic Microbes

Unlike bacterial microbes, in many projects, eukaryotic microbes are ignored, in part due to a lack of understanding of eukaryotic microbial communities at this magnification, but potentially also due to technical issues. For example, while eukaryotic protist diversity was discovered to be higher than originally predicted, estimates suggest that fewer than 10% of rDNA sequences have currently been detected [28]. Importantly, eukaryotic microbes have extensive roles as parasites, predators, producers, and organic degraders [29, 30]. Therefore, the lack of deeper understanding of their diversity, ecology, taxonomy, and evolution, particularly in comparison to their bacterial counterparts, is a significant issue to resolve. More specifically, the presence of certain protozoa in the human gut may influence the composition of microbial communities and bacterial diversity, thereby affecting human health [30]. For example, data shows that the enteric protozoa Blastocystis species has a potential role in modulating the gut microbiome ecosystem in humans by changing microbiome structure and also altering host immunity [31]. 18S rRNA sequencing of gene primers is critical for the exploration of eukaryotic microorganisms and such primers have been used in many studies to investigate eukaryotic communities in varying environments. For eukaryotic molecular phylogeny, 18S rRNA sequencing is the gold standard, using both quantitative PCR (q-PCR) and advanced sequencing technology [32]. Fulllength 18S rRNA sequencing can also be performed using Oxford Nanopore [33] (see Chap. 10) technology or Pacific BioSciences (PacBio) Single-molecule real-time (SMRT) sequencing [34]. Short-length sequencing is also possible using Illumina PE250/300 technology [1]. Many of these aforementioned techniques will be introduced shortly and examined in greater depth in the upcoming chapters of this book.

6

Saskia Benz and Suparna Mitra

2.5 Eukaryotic Sequencing: Fungal Organisms

As the second largest eukaryotic and particularly diverse kingdom found on Earth, Fungi are of significant interest within research. Despite this, currently over 90% of an estimated 5.1 million species remain undiscovered. The principal Fungi phyla are Basidiomycota and Ascomycota which encapsulate the best-known species and largest group of Fungi, respectively [35]. In terms of utilization, Ascomycota phylum is important within both medicine and the food industry, and as model organisms for the study of cell biology and genetics, while Basidiomycota have roles in mushroom production and carbon cycling [36]. There are a subsequent six phyla, ranging from intracellularly parasitic Microsporidia to soil inhabiting Blastocladiomycota. This kingdom shows huge variation in organism function, morphology, life strategy and habitat [35], therefore different techniques are used for identification and discovery of such species.

2.5.1 Utilization of ITS for Fungal Sequencing

Most commonly, to effectively identify new species of Fungi, the internal transcribed spacer (ITS) region is utilized. ITS comprises two components: ITS1 and ITS2, and these regions are primarily used above other genes, as a fungal DNA barcode [37]. The vast amount of research required within this area is clear; therefore, HTS is paramount to understand the communities and relationships surrounding Fungi. Combining HTS with the aforementioned DNA-barcoding of fungal organisms has enabled a novel viewpoint toward the study of fungal biodiversity [35]. This again illustrates the use and importance of metagenomics, which allow the discovery of novel organisms and study of potential clinical, agricultural, and environmental consequences of such findings.

2.5.2 Alternative Methods of Fungal Sequencing

Importantly, fungal organisms, including pathogens, may be detected by other methods than ITS, such as 18S rRNA. ITS is generally accepted to be superior to the 18S rRNA gene for identification of fungi, due to augmented phylogenetic resolution of Aspergillus and Candida genera. However, this advantage is paired with lower sensitivity in detecting fungi, shown by Wagner et al., where both techniques were evaluated on clinical specimens [38]. Moreover, 18S rRNA did benefit from taking less time and having higher potential laboratory automation in the future [38].

2.6 Microbial Biodiversity

Any microbial community can be characterized by its biodiversity, indicated by the number of total species and the numerical composition of each microorganism. The diversity of microbial communities is best described using species richness, relevant in a variety of research areas from health and disease to ecological conservation studies. The quantification of species richness enables comparison between different niches as well as investigation into how saturated or diverse local colonies are within a given community. Species richness is therefore an indicator of biodiversity from microscopic microorganisms in soil, to larger mammalian organisms found across the globe [39].

From Genomics to Metagenomics in the Era of Recent Sequencing Technologies

2.7 Rarefaction Curves: Importance and Application

3

7

Rarefaction curves are visual representations of the statistical approximation of OTUs, (Operational Taxonomic Units) predicted in a random individual sample from a larger sample enabling the measurement of relative species richness [40]. There are common issues with studies which utilize species richness for biodiversity within communities. In order to avoid such “pitfalls,” rarefaction curves may be used as a method to standardize and compare between datasets [39]. When collecting microorganisms from a given location, the evaluation of how effectively a sample reflects true biodiversity is paramount. Despite novel sequencing techniques such as HTS improving the detection of microbial species, it is an impossibility presently to discover and identify all species in a microbial community. Nevertheless, bioinformatical analysis tools are commonly used to compare diversity between microbial samples, including Simpson diversity indices and Shannon-Weaver, both with OTUs. Despite species richness being synonymous with the biodiversity of a sample, true measurement is elusive to quantify. As the diversity of taxa increases, a greater number of individuals will be sampled, thus more species found. This is equally the case for higher taxa, whether genera or phyla [41]. The ideology, therefore, behind using rarefaction curves is that despite the curve accelerating initially as samples have greater biodiversity, in later samples, it will decelerate as species become progressively rarer [41] and eventually a plateau will be achieved when no further taxa are sampled. Without using taxon sampling and formulating rarefaction curves, the comparative results of species richness between samples may have significant weaknesses. For example, there may be differences in the number of species collected, the distribution of relative taxa abundance or in underlying species richness [42]. Moreover, the comparison of raw data regarding species richness is only valid when such rarefaction curves reach a definitive plateau. However this may not always be possible, illustrated by taxon sampling within certain tropical areas [43, 44]. Nevertheless, rarefaction curves best provide a point to which no further sequencing depth will enable novel species detection from a population, therefore allowing the most reliable and effective study of species richness and biodiversity.

Next-Generation Sequencing Technology

3.1 Overview of Short-Read Sequencing Techniques

Second-generation sequencing can be divided into two groups, sequencing using synthesis or by hybridization and ligation (Fig. 1). Short-read sequencing techniques are a subset of nextgeneration sequencing, which have revolutionized research into metagenomics [1]. Further information regarding read length, time, and cost is shown comparatively in Table 1 as well as more in-depth exploration of each technique in this chapter.

8

Saskia Benz and Suparna Mitra

Fig. 1 Types of high-throughput sequencing technology 3.2 Massively Parallel Sequencing

Massively parallel sequencing techniques (MPSS) were first developed over 17 years ago [1], building upon traditional Sanger sequencing methods with the capacity for high-throughput results at a lower data cost. MPSS is used interchangeably with nextgeneration sequencing, therefore is an umbrella term for the high-throughput technologies described in this chapter which have transformed genomic sequencing. During each cycle of MPSS, a type IIS restriction enzyme is required to cleave the target sequence, before subsequent ligation of a fluorescent linker at a specific sequence rather than cycles of polymerase extension as used in previous methods [47]. This approach overall enables greater analysis depth and therefore statistical analysis to discover novel relationships between genes even when expressed at low levels [47]. The utilization of MPSS is varied, from discovering novel transcripts within species of plants [48] such as Arabidopsis, to clinical screening of entire genomes for novel mutations or pathogens which may cause human disease [49].

3.3 454 Sequencing (Roche)

The first NGS technology released by Margulies et al. in 2005 [3] was 454 sequencing, a short-chain sequencing technique, based upon emulsion PCR and subsequent pyrosequencing. Taken over by Roche in 2007, 454 Life Sciences, paved the way to sequencing longer read lengths than previous techniques could, allowing significant contribution to RefSeq microbial genomes. Additionally 454 application was important within metagenomics, including determination of the potential cause for the honey bee’s reduction in number [50].

700 bp

BGISEQ-50 is 35–50 bp BGISEQ-500 is 50–300 bp

440–500 bp

25–60 bp

HiSeq/MiSeq 100 bp (table)

Maximum 600 bp

Up to 98 kb

26 bp 13 per colony

800 bp (table)

50–75 bp

Greater than 900 bp

454 Pyrosequencing

cPAS sequencing (combinatorial probe anchor synthesis)

DNA Nanoball sequencing

Helicos SMS (Single Molecule Sequencing)

Illumina sequencing

Ion Torrent sequencing

Nanopore sequencing

Polony sequencing

Sanger

SOLiD

SMRT (single molecule real-time sequencing)

4–5 h (table)

Time per run

0.5–1 Gb

30 Gb

1–2 h

7–14 days

2 h (table)

49 h

– Approx 800 bp (wiki)

48/72 h

Up to 30 Gb

4–5 h (table)

HiSeq = 3–10 days MiSeq = 0.3–15 Gb

HiSeq = 120–1500 Gb MiSeq = 0.3–15 Gb

200–500 bp (table)

8 days

9 days

21–65 Gb

20–60 Gb

BGISEQ-50 is 160 M 1–9 days depending on BGISEQ-500 is 1300 M the sequencer per flow cell

0.7 Gb (table)

Read length (base pairs) Volume per run

Sequencing technique

Table 1 Sequencing technologies: Read length, accuracy, reads and time per run, cost per 1 billion bases [45, 46]

Hybridization and synthesis

Hydridization and ligation

Sequencing by synthesis

Pyrosequencing

Sequencing technique

$2 per million bases

$0.13 per million

$2400 per million

Total $130,000

Less than $1 per million bases

Sequencing by synthesis

Sequencing by ligation

Dideoxynucleosides terminator

Hybridization by ligation

Nanopore

Between $300 and $750 Sequencing by synthesis per run

HiSeq = $0.02–0.07 per Reversible terminator million bases sequencing MiSeq = $0.13 per million bases

$0.01

$4400/genome

Between $5 and $120

$10 per million bases (table)

Cost per one million bases (USD)

From Genomics to Metagenomics in the Era of Recent Sequencing Technologies 9

10

Saskia Benz and Suparna Mitra

Despite the longer reads and fast run time, there are limitations with 454 sequencing (summarized in Table 2). For example, studies have shown the impact of accumulated light intensity variance which leads to high error rate for homopolymer regions which were not found with Illumina technology [51]. Moreover 454 sequencing lost popularity due to its higher cost and the development of the competitor, Illumina sequencing, with even better-read length capacity. Furthermore, from 2016, this technique was discontinued [52]; however, the datasets it generated remain important. 3.4 Polony Sequencing

Another short-read sequencing technique is Polony sequencing, developed in the 1990s and 2000s, and refined to sequence in situ polonies by 2003. Polony sequencing enables a genome of interest to be compared to a reference genome in a cost-effective accurate way [53]. Polony amplification uses thin polyacrylamide film [54]. This technique has been further adapted by Kim et al. to Polony Multiplex Analysis of Gene Expression (PMAGE), combining the aforementioned amplification with sequence-by-ligation [55]. PMAGE has the capacity to detect more rare mRNAs from one transcript per 3 cells, allowing Kim et al. to identify changes in transcription indicative of future pathology within mice with hypertrophic cardiomyopathy [55]. Shendure and colleagues also applied Polony technology to strains of Escherichia coli for barcode sequencing. While limitations included high raw data collection (786 gigabits) with marginal useful information (1 bit out of 10,000), overall they showed Polony was a low cost (compared to conventional methods), highly accurate technique [56].

3.5 Illumina Technology (Solexa)

Illumina technology, released by Solexa in 2007, has dominated NGS, contributing to 82% of bacterial genomes in RefSeq as of 2020 [57]. Through the attachment of short DNA fragment ends and bridge amplification, fluorescent dNTPs enable sequencing. Advantages of Illumina technology include its high accuracy and low cost; however, it can sequence only a few hundred base lengths [52] and therefore may not be advantageous for every dataset. Illumina has been used for sequencing in a multitude of microbiome studies, from those examining the Zika virus [58] in clinical samples, dairy contamination [59] to oral health and periodontal disease [60]. Moreover, while initially Illumina could sequence only very short reads, improvement in this ability has enabled its utilization in thousands of research papers, resulting in rapid growth in the volume of microbiome sequencing and data.

3.5.1

NextSeq 500/550

Released in 2014, Illumina® NextSeq 500 was developed for ancient DNA, for contributions in palaeogenomics, the genomic analysis of degraded or ancient samples. Providing augmented flexibility of microarray scanning, further applications include transplantation medicine, prenatal amniotic fluid investigations of

From Genomics to Metagenomics in the Era of Recent Sequencing Technologies

11

Table 2 The advantages and disadvantages of high-throughput technologies [1, 75, 83, 99, 100] Technique

Pro

Con

454

Long reads (1 kb max) Comparatively fast run time

Comparatively low throughput (approximately one million reads) High error rates in homopolymer repeats Cancelled by Roche in mid-2016

cPAS

Non-PCR application for preparation of sequencing arrays, High technical reproducibility

Lower accuracy of indel

DNA Nanoball

Reduced reagent cost Lower number of optical duplicates

Short sequences of DNA

HeliScope Genetic Analysis System

Lack of bias with single-molecule sequencing Cost-effective and accurate for bacterial genome sequencing

High Error rate (overcome with repetitive sequencing) Underrepresentation of GC-rich/poor regions Short read lengths

Illumina

Highest throughput and lowest cost per base, out of the short-read techniques Read lengths up to 300 bp Most library preparation protocols are compatible with Illumina

Sample loading requires tight control, therefore is technically challenging Requires sequence complexity Less complex 16S libraries must be mixed with a reference PhiX or diluted

Ion Torrent

No optical scanning or fluorescent nucleotides required due to semiconductor technology Fast run times (4–5 h) High number of applications in diverse fields

Higher error rate in homopolymers

Nanopore

Extremely long reads for long repetitive sequences, structural variations, and methylations. Lower cost than second generation sequencing

Relatively high error rate compared to short-read sequencing Systematic error unlike SMRT sequencing (overcome with short-read sequencing data)

Polony

Highly accurate Low cost compared to conventional methods

Low percentage of useful data generated due to high yield

SMRT PacBio

Very long reads (20 kb and above), therefore effective for drafting genome improvements and finalize genome assemblies Fast run time (1–2 h)

High cost (2 dollars per MB) Lowest throughput capability at approximately 500 Mb maximum Limited range of applications Relatively high error rates (approx. 14%) Polymerase used has limited longevity

SOLiD

Second highest throughput technique of this cohort Low error rates High accuracy (99.94%)

Extremely short reads (maximum is 75 nt) Long run times Less applicable for de novo genome assembly Sample preparation services and kits are not as well developed due to its lower usage compared to Illumina.

12

Saskia Benz and Suparna Mitra

cell-free DNA, and analyzing preserved tissue samples [61]. The NextSeq 500 Sequencing System has been discontinued. The NextSeq 550 System is an alternative solution that provides the increased flexibility of microarray scanning in addition to sequencing. 3.5.2

NovaSeq 6000

The NovaSeq™ 6000 was later developed by Illumina® Incorporated in 2017, aiming to eventually enable a $100 genome. Utilization includes identifying SARS-CoV-2 variants, and maize microbiomes to improve upon current agricultural practices [62, 63]. Both Illumina NextSeq and NovaSeq 6000 were documented to enable equal read depth at a lower cost, with both being comparable to BGI MGISEQ-2000 and BGISEQ-500 [64].

3.6 SOLiD (Life Technologies)

The third NGS technology released was Sequencing by Oligo Ligation Detection (SOLiD), developed in 2007 by Life Technologies [65]. This, alongside illumina technology, enabled the generation of a far greater number of reads than 454 sequencing. Due to its higher throughput, ChIP-sequencing [66] and transcriptome profiling studies [67] have largely utilized SOLiD as well as Illumina sequencing. Moreover, as shown in Table 2, SOLiD technology benefitted from low error rates and high accuracy [68], however, due to its short reads is used less for assembly of de novo genomes. Another issue SOLiD faces is that it is used less frequently than Illumina technology, thus preparation protocols and services related to such aren’t as well developed [1].

3.7 Ion Torrent (Life Technologies)

Another alternative sequencing platform to Illumina sequencing, is the semiconductor Ion Torrent. First marketed in 2010 by Ion Torrent, this technique involves measuring pH to read nucleotide sequences using a CMOS (complementary metal oxide semiconductor) sensor array chip [69] which allows data collection. Daum and colleagues utilized Ion Torrent Personal Genome Machines (PGM) for investigation of resistant genes in Mycobacterium tuberculosis. The research highlighted the potential of Ion torrent sequencing to globally monitor multi-drug resistant strains [70]. Moreover, studies have used Ion Torrent for identification of fungal community shifts in soil in relation to prescribed forest fires [71]. Moreover, in another comparative study of PGM and Illumina MiSeq, results showed discrepancies between the bacterial community profiles found, hypothesized to be due to the lack of full-length reads for certain microorganisms using Ion Torrent [72]. Despite short read lengths, the PGM platform is a high-throughput, low cost, and scalable method for metagenomic analyses and Tag sequencing [73] in regards to determining microbial community function and structure.

From Genomics to Metagenomics in the Era of Recent Sequencing Technologies

13

3.8 c-PAS Sequencing (Complete Genomics™)

Combinatorial probe-anchor synthesis (cPAS)-based BGISEQ-500 sequencer was first announced in 2015 as a novel sequencing method by Complete Genomics [74]. Using the combination of DNA nanoball nanoarrays with stepwise polymerase sequencing, cPAS was developed to investigate non-coding transcriptomes. An advantage (Table 2) of this technique was its unique non-PCR application in the preparation of sequencing arrays, and high technical reproducibility [75]. Despite data generated by cPAS being generally adequate, producing similar SNP (single nucleotide polymorphisms) detection accuracy to Illumina HiSeq 2500, certain characteristics, including poorer accuracy of indel detection [74], have been highlighted. For example, while Fang et al. investigated cPAS BGISEQ500 for human gut metagenomics, and confirmed its effective application as a novel platform for metagenomic studies, caution was recommended in combining data from multiple platforms [76]. For palaeogenomic data, Illumina sequencing is historically dominant; however, Mak et al. compared cPAS BGISEQ-500 against the Illumina HiSeq2500. They concluded that cPAS technology could potentially be used for further investigation into this area to overcome issues Illumina faced, such as data generation costs [77].

3.9 DNA Nanoball Sequencing (Complete Genomics™)

DNA Nanoball sequencing is another short read technology, developed by Complete Genomics which involves the assay of nanoarrays allowing many DNA nanoballs to be simultaneously sequenced [75]. Nanoballs are created using rolling circle replication where small fragments of genomic DNA are amplified. As with many HTS techniques, advantages shown in Table 2 include lower reagent and overall cost than traditional sequencing [75]. However a limitation to DNA nanoball sequencing is the short sequence reads generated, which may inhibit subsequent mapping to reference genomes [78]. Applications of DNA nanoball sequencing include finding SNPs implicated in Lung cancer as well as existence of selection pressures within lung tumors [79]. SNPs involved in Mendelian disorder from the genomes of affected families with Primary ciliary dyskinesia or Miller syndrome have also been investigated using DNA nanoball [80]. More recently, Kim and colleagues have used nanoball in relation to the SARS-CoV-2 transcriptome, essential as a preliminary step towards understanding its pathogenicity [81].

3.10 Helicos SMS (Helicos Biosciences)

Helicos™ Single Molecule Sequencing (SMS) marketed by Helicos Biosciences was the earliest commercially-used NGS method to utilize single molecule fluorescent sequencing. In providing both sequence information and accurate quantification through the unbiased sequencing of cellular nucleic acids, Helicos SMS provides unique perspective of genome biology [82]. With advantages (Table 2) including high cost-effectiveness and accuracy for the

14

Saskia Benz and Suparna Mitra

resequencing of bacterial genomes, HeliScope™ SMS can be applied to areas, including epidemiology, diagnostics, and biology evolution [83], among others. For example, the M13 bacteriophage virus was sequenced using this method by Harris et al. [84]. The Helicos platform has also been investigated in regards to non-invasive trisomy 21 detection [85]—overcoming GC bias.

4

Third-Generation Technology: Progression from Short-Read Sequencing Short-read sequencing is accurate, cost-effective and supported by many tools for analysis. However, in sequencing longer nucleic acids, this leads to difficulty in reconstruction and analysis of the target molecules [86]. Third-generation sequencing methods have been developed to build upon previous techniques, enabling the generation of over ten thousand bp reads [87]. Such technologies allowed novel insights into both sequence diversity and evolution. This was enabled through the production of de novo assemblies of a vast number of microbial genomes and reconstructions of animal and plant genomes [87]. The fundamental difference behind thirdgeneration sequencing, is the combined advantage of single molecule sequencing alongside extremely long read lengths above 20 kb [88]. Third-generation sequencing methods will be introduced below and long read techniques for microbial genomes are investigated in Chap. 12 of this book.

4.1 Single-Molecular Real-Time Sequencing (Pacific Biosciences)

The more established third-generation technology is Singlemolecule real-time (SMRT) sequencing, developed by Pacific Biosciences [34]. SMRT sequencing combines parallelized single molecule DNA sequencing and fluorescence detection of a specific nucleotide by a polymerase. From uses within environment research, medicine, and agriculture, SMRT sequencing has enabled previously incomparable sequencing accuracy, utilizing long DNA molecules [88]. Application of this technique includes de novo genome sequencing facilitated by the longer read length of SMRT sequencing, enabling full gene isoforms to be sequenced and splice variants [89] also. For example, SMRT sequencing has been investigated as a strategy for identifying suspected parental gonadal mosaicism [90] due to the long reads generated by the PacBio RS 2. Moreover, in Chronic myeloid leukemia, SMRT sequencing enabled significantly more sensitive detection of drug resistance [91] than traditional Sanger sequencing. Furthermore, in applied agriculture, SMRT sequencing has provided novel insights into Oryza rice species gene structures [92], among many other food groups.

From Genomics to Metagenomics in the Era of Recent Sequencing Technologies

15

A limitation of SMRT sequencing (Table 2) is the longevity of the polymerase used for fluorescence detection, which has increased in 2018 to an average 30 kb read length [93]. However, as a more recent technology than short-read sequencing, it is anticipated that long-read sequencing, including SMRT sequencing, will replace many short-read applications and evolve in the future [88]. 4.2 Nanopore Sequencing (Oxford Nanopore Technologies)

Finally, the second third-generation technology released, which has further enhanced long-read sequencing, is nanopore sequencing. The device used is the Oxford Nanopore MinION, commercialized by Oxford Nanopore Technologies, which measures fluctuations in electrical current as DNA molecules move through a nanopore [33]. The distinguishing feature of nanopore sequencing is its direct detection of nucleotides without actively performing DNA synthesis [94]. Uses of nanopore technology have been particularly advantageous for large complex eukaryotic genomes, for example, de novo sequencing of the silk spider [95]. The silk protein gene has been studied for industrial applications due to the physio-chemical properties the silk produced holds, including high thermostability, tensile strength, and toughness. The challenge of spider genomes is the huge base length and issues which arise from PCR amplification [95]; therefore, single nanopore sequencing is one of few technologies which provides a solution to this. Furthermore, nanopore sequencing applications range from rapidly identifying pathogens in clinical samples such as empyema from pleural effusion [96] to comparing the microbial composition of soil samples in relation to renewable energy [97] and others. Several advantages of nanopore sequencing over previous technologies exist, including lower cost (Table 2) compared to massively parallel sequencers at around $1000 for initial reagents and a nanopore device itself. While short-read sequencing require complex libraries, the vastly long reads generated with nanopore sequencing differ. Nanopore therefore enables de novo genome assembly, determination of any genome structural variations [98], as well as of long repetitive sequences also. Unlike the polymerase limitations of SMRT sequencing, nanopore can provide longer read lengths from 500 bp to 2.3 Mb [93]. However limitations of nanopore technology are within the extraction and delivery of very high-molecular weight DNA to the pore (Table 2), thereby affecting run yield [99]. Nanopore sequencing currently also has a higher error rate and lower accuracy comparatively against short-read sequencing, including both deletions and insertions. Furthermore, Jain et al. found that nanopore appears to have errors systematically, unlike SMRT sequencing, therefore, correction requires short-read sequencing data also [99]. Nanopore techniques are discussed in Chap. 10 of this book, as well as specific usage of MinION nanopore portable sequencer is used in Chap. 11.

16

Saskia Benz and Suparna Mitra

In determining which sequencing technique to use, accuracy, resolution, read lengths, and quantity are important to consider. Despite the huge increase in microbial richness due to NGS and Third-generation sequencing, taxonomists cannot name all of those discovered due to the lack of available tools to comprehend each of the millions of reads produced from sequencing. Overall, there are multiple sequencing techniques and approaches which are available, as described in this introductory chapter, with Table 2 summarizing the strengths and weaknesses of each. In order to use each approach optimally, researchers should choose the most applicable technique for the purpose of their project. The subsequent chapters will enable further illustration of how metagenomics and other recent sequencing technologies can be utilized within research.

References 1. Van Dijk EL, Auger H, Jaszczyszyn Y, Thermes C (2014) Ten years of nextgeneration sequencing technology. Trends Genet 30(9):418–426 2. PubMed: metagenomics – search results – PubMed (2022). https://pubmed.ncbi.nlm. nih.gov/?term=metagenomics&filter= datesearch.y_10. Accessed 24 June 2022 3. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA et al (2005) Genome sequencing in microfabricated highdensity picolitre reactors. Nature 437(7057): 376–380 4. Moore-Connors JM, Dunn KA, Bielawski JP, Van Limbergen J (2016) Novel strategies for applied metagenomics. Inflamm Bowel Dis 22(3):709–718. https://doi.org/10.1097/ mib.0000000000000717 5. C. Huttenhower DG (2012) Structure, function and diversity of the healthy human microbiome. Nature 486(7402):207–214. https://doi.org/10.1038/nature11234 6. Derksen F, Bensing J, Lagro-Janssen A (2013) Effectiveness of empathy in general practice: a systematic review. Br J Gen Pract 63(606):e76–e84 7. Sua´rez N, Weckx S, Minahk C, Hebert EM, Saavedra L (2020) Metagenomics-based approach for studying and selecting bioprotective strains from the bacterial community of artisanal cheeses. Int J Food Microbiol 335:108894. https://doi.org/10.1016/j. ijfoodmicro.2020.108894 8. Innovation at Illumina: the road to the $600 human genome (2021). https://www.nature. com/articles/d42473-021-00030-9. Accessed 20 July 2021

9. Yang X, Que G (2020) Advance in study on 16S rRNA gene sequencing technology in oral microbial diversity. J Cent South Univ 45(7):849–855. https://doi.org/10.11817/ j.issn.1672-7347.2020.190236 10. Bukin YS, Galachyants YP, Morozov I, Bukin S, Zakharenko A, Zemskaya T (2019) The effect of 16S rRNA region choice on bacterial community metabarcoding results. Sci Data 6(1):1–14 11. Laudadio I, Fulci V, Stronati L, Carissimi C (2019) Next-generation metagenomics: methodological challenges and opportunities. OMICS 23(7):327–333 12. Tsou AM, Olesen SW, Alm EJ, Snapper SB (2020) 16S rRNA sequencing analysis: the devil is in the details. Gut Microbes 11(5): 1139–1142. https://doi.org/10.1080/ 19490976.2020.1747336 13. Duan R, Zhu S, Wang B, Duan L (2019) Alterations of gut microbiota in patients with irritable bowel syndrome based on 16S rRNA-targeted sequencing: a systematic review. Clin Transl Gastroenterol 10(2): e00012 14. Peng W, Huang J, Yang J, Zhang Z, Yu R, Fayyaz S et al (2020) Integrated 16S rRNA sequencing, metagenomics, and metabolomics to characterize gut microbial composition, function, and fecal metabolic phenotype in non-obese type 2 diabetic Goto-Kakizaki rats. Front Microbiol 10:3141 15. DeBruyn JM, Nixon LT, Fawaz MN, Johnson AM, Radosevich M (2011) Global biogeography and quantitative seasonal dynamics of Gemmatimonadetes in soil. Appl Environ Microbiol 77(17):6295–6300

From Genomics to Metagenomics in the Era of Recent Sequencing Technologies 16. Eren AM, Morrison HG, Lescault PJ, Reveillaud J, Vineis JH, Sogin ML (2015) Minimum entropy decomposition: unsupervised oligotyping for sensitive partitioning of high-throughput marker gene sequences. ISME J 9(4):968–979 17. Jiao X, Zheng X, Ma L, Kutty G, Gogineni E, Sun Q et al (2013) A benchmark study on error assessment and quality control of CCS reads derived from the PacBio RS. J Data Mining Genomics Proteomics 4(3):16008 18. Weinstock GM (2012) Genomic approaches to studying the human microbiota. Nature 489(7415):250–256 19. Pruitt KD, Tatusova T, Maglott DR (2005) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 33(suppl_1):D501–D5D4 20. Erawijantari PP, Mizutani S, Shiroma H, Shiba S, Nakajima T, Sakamoto T et al (2020) Influence of gastrectomy for gastric cancer treatment on faecal microbiome and metabolome profiles. Gut 69(8):1404–1415 21. Yu S, Xiong Y, Fu Y, Chen G, Zhu H, Mo X et al (2021) Shotgun metagenomics reveals significant gut microbiome features in different grades of acute pancreatitis. Microb Pathog 154:104849 22. Qian Y, Yang X, Xu S, Huang P, Li B, Du J et al (2020) Gut metagenomics-derived genes as potential biomarkers of Parkinson’s disease. Brain 143(8):2474–2489 23. Gacesa R, Kurilshikov A, Vich Vila A, Sinha T, Klaassen M, Bolte L et al (2022) Environmental factors shaping the gut microbiome in a Dutch population. Nature 604(7907): 732–739 24. Akinola SA, Ayangbenro AS, Babalola OO (2021) The immense functional attributes of maize rhizosphere microbiome: a shotgun sequencing approach. Agriculture 11(2):118 25. Saxena G, Mitra S, Marzinelli EM, Xie C, Wei TJ, Steinberg PD et al (2018) Metagenomics reveals the influence of land use and rain on the benthic microbial communities in a tropical urban waterway. Msystems 3(3):e00136– e00117 26. Flint HJ, Bayer EA, Rincon MT, Lamed R, White BA (2008) Polysaccharide utilization by gut bacteria: potential for new insights from genomic analysis. Nat Rev Microbiol 6(2):121–131 27. Durazzi F, Sala C, Castellani G, Manfreda G, Remondini D, De Cesare A (2021) Comparison between 16S rRNA and shotgun sequencing data for the taxonomic characterization of the gut microbiota. Sci Rep 11(1):1–10

17

28. Wang Y, Tian RM, Gao ZM, Bougouffa S, Qian P-Y (2014) Optimal eukaryotic 18S and universal 16S/18S ribosomal RNA primers and their application in a study of symbiosis. PLoS One 9(3):e90053 29. Blifernez-Klassen O, Klassen V, Doebbe A, Kersting K, Grimm P, Wobbe L et al (2012) Cellulose degradation and assimilation by the unicellular phototrophic eukaryote Chlamydomonas reinhardtii. Nat Commun 3(1):1–9 30. Wegener Parfrey L, Walters WA, Knight R (2011) Microbial eukaryotes in the human microbiome: ecology, evolution, and future directions. Front Microbiol 2:153 31. Audebert C, Even G, Cian A, Loywick A, Merlin S, Viscogliosi E et al (2016) Colonization with the enteric protozoa Blastocystis is associated with increased diversity of human gut bacterial microbiota. Sci Rep 6(1):1–11 32. Eukaryotic 18S rRNA sequencing – CD genomics (2021). https://www.cd-geno mics.com/microbioseq/eukaryotic-18s-rrnasequencing.html. Accessed 20 July 2021 33. Jain M, Olsen HE, Paten B, Akeson M (2016) The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol 17(1):1–11 34. Rhoads A, Au KF (2015) PacBio sequencing and its applications. Genomics Proteomics Bioinformatics 13(5):278–289 35. Badotti F, Fonseca PLC, Tome´ LMR, Nunes DT, Goés-Neto A (2018) ITS and secondary biomarkers in fungi: review on the evolution of their use based on scientific publications. Rev Bras Bot 41(2):471–479 36. Kirk PM, Cannon PF, David J, Stalpers JA (2001) Ainsworth and Bisby’s dictionary of the fungi, vol 9. CABI Publishing 37. Schoch CL, Seifert KA, Huhndorf S, Robert V, Spouge JL, Levesque CA et al (2012) Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for fungi. Proc Natl Acad Sci 109(16):6241–6246 38. Wagner K, Springer B, Pires V, Keller PM (2018) Molecular detection of fungal pathogens in clinical specimens by 18S rDNA highthroughput screening in comparison to ITS PCR and culture. Sci Rep 8(1):1–7 39. Gotelli NJ, Colwell RK (2001) Quantifying biodiversity: procedures and pitfalls in the measurement and comparison of species richness. Ecol Lett 4(4):379–391 40. Kim B-R, Shin J, Guevarra RB, Lee JH, Kim DW, Seol K-H et al (2017) Deciphering diversity indices for a better understanding of microbial communities. J Microbiol Biotechnol 27(12):2089–2093

18

Saskia Benz and Suparna Mitra

41. Bunge J, Fitzpatrick M (1993) Estimating the number of species: a review. J Am Stat Assoc 88(421):364–373 42. Denslow JS (1995) Disturbance and diversity in tropical rain forests: the density effect. Ecol Appl 5(4):962–968 43. Anderson RS, Ashe JS (2000) Leaf litter inhabiting beetles as surrogates for establishing priorities for conservation of selected tropical montane cloud forests in Honduras, Central America (Coleoptera; Staphylinidae, Curculionidae). Biodivers Conserv 9(5):617–653 44. Stork N (1991) The composition of the arthropod fauna of Bornean lowland rain forest trees. J Trop Ecol 7(2):161–180 45. Hong M, Tao S, Zhang L, Diao L-T, Huang X, Huang S et al (2020) RNA sequencing: new technologies and applications in cancer research. J Hematol Oncol 13(1):1–16 46. Sikkema-Raddatz B, Johansson LF, de Boer EN, Almomani R, Boven LG, van den Berg MP et al (2013) Targeted next-generation sequencing can replace Sanger sequencing in clinical diagnostics. Hum Mutat 34(7): 1035–1042 47. Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, Johnson D et al (2000) Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol 18(6):630–634 48. Nakano M, Nobuta K, Vemaraju K, Tej SS, Skogen JW, Meyers BC (2006) Plant MPSS databases: signature-based transcriptional resources for analyses of mRNA and small RNA. Nucleic Acids Res 34(suppl_1):D731– D7D5 49. Tucker T, Marra M, Friedman JM (2009) Massively parallel sequencing: the next big thing in genetic medicine. Am J Hum Genet 85(2):142–154 50. Cox-Foster DL, Conlan S, Holmes EC, Palacios G, Evans JD, Moran NA et al (2007) A metagenomic survey of microbes in honey bee colony collapse disorder. Science 318(5848):283–287 51. Luo C, Tsementzi D, Kyrpides N, Read T, Konstantinidis KT (2012) Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample. PLoS One 7(2):e30087 52. Segerman B (2020) The most frequently used sequencing technologies and assembly methods in different time segments of the bacterial surveillance and RefSeq genome databases. Front Cell Infect Microbiol 10:527102 53. Porreca GJ, Shendure J, Church GM (2006) Polony DNA sequencing. Curr Protoc Mol Biol 76(1):7.8.1–7.8.22

54. Mitra RD, Church GM (1999) In situ localized amplification and contact replication of many individual DNA molecules. Nucleic Acids Res 27(24):e34–ee9 55. Kim JB, Porreca GJ, Song L, Greenway SC, Gorham JM, Church GM et al (2007) Polony multiplex analysis of gene expression (PMAGE) in mouse hypertrophic cardiomyopathy. Science 316(5830):1481–1484 56. Shendure J, Porreca GJ, Reppas NB, Lin X, McCutcheon JP, Rosenbaum AM et al (2005) Accurate multiplex polony sequencing of an evolved bacterial genome. Science 309(5741):1728–1732 57. Cronn R, Liston A, Parks M, Gernandt DS, Shen R, Mockler T (2008) Multiplex sequencing of plant chloroplast genomes using Solexa sequencing-by-synthesis technology. Nucleic Acids Res 36(19):e122-e 58. Quick J, Grubaugh ND, Pullan ST, Claro IM, Smith AD, Gangavarapu K et al (2017) Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples. Nat Protoc 12(6):1261–1276 59. Zhang J, Su L, Wang Y, Deng S (2020) Improved high-throughput sequencing of the human oral microbiome: from Illumina to PacBio. Can J Infect Dis Med Microbiol 2020:6678872 60. Sanz-Martin I, Doolittle-Hall J, Teles RP, Patel M, Belibasakis GN, H€ammerle CH et al (2017) Exploring the microbiome of healthy and diseased peri-implant sites using Illumina sequencing. J Clin Periodontol 44(12):1274–1284 61. Paijmans JL, Baleka S, Henneberger K, Taron UH, Trinks A, Westbury MV et al (2017) Sequencing single-stranded libraries on the Illumina NextSeq 500 platform. arXiv preprint arXiv:171111004 62. Goswami C, Sheldon M, Bixby C, Keddache M, Bogdanowicz A, Wang Y et al (2022) Identification of SARS-CoV-2 variants using viral sequencing for the Centers for Disease Control and Prevention genomic surveillance program. BMC Infect Dis 22(1):1–12 63. Babalola OO, Fadiji AE, Ayangbenro AS (2020) Shotgun metagenomic data of root endophytic microbiome of maize (Zea mays L.). Data Brief 31:105893 64. Senabouth A, Andersen S, Shi Q, Shi L, Jiang F, Zhang W et al (2020) Comparative performance of the BGI and Illumina sequencing technology for single-cell RNA-sequencing. NAR Genom Bioinform 2(2): lqaa034 65. Pandey V, Nutter RC, Prediger E (2008) Applied biosystems solid™ system: ligation-

From Genomics to Metagenomics in the Era of Recent Sequencing Technologies based sequencing. In: Janitz M (ed) Next generation genome sequencing: towards personalized medicine. WileyVCH Verlag GmbH & Co. KGaA, Weinheim, pp 29–42 66. Park PJ (2009) ChIP–seq: advantages and challenges of a maturing technology. Nat Rev Genet 10(10):669–680 67. Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10(1):57–63 68. Liu L, Li Y, Li S, Hu N, He Y, Pong R et al (2012) Comparison of next-generation sequencing systems. J Biomed Biotechnol 2012:251364 69. Merriman B, D Team IT, Rothberg JM (2012) Progress in ion torrent semiconductor chip based sequencing. Electrophoresis 33(23):3397–3417 70. Daum LT, Rodriguez JD, Worthy SA, Ismail NA, Omar SV, Dreyer AW et al (2012) Nextgeneration ion torrent sequencing of drug resistance mutations in Mycobacterium tuberculosis strains. J Clin Microbiol 50(12): 3831–3837 71. Brown SP, Callaham MA Jr, Oliver AK, Jumpponen A (2013) Deep Ion Torrent sequencing identifies soil fungal community shifts after frequent prescribed fires in a southeastern US forest ecosystem. FEMS Microbiol Ecol 86(3):557–566 72. Salipante SJ, Kawashima T, Rosenthal C, Hoogestraat DR, Cummings LA, Sengupta DJ et al (2014) Performance comparison of Illumina and ion torrent next-generation sequencing platforms for 16S rRNA-based bacterial community profiling. Appl Environ Microbiol 80(24):7583–7591 73. Whiteley AS, Jenkins S, Waite I, Kresoje N, Payne H, Mullan B et al (2012) Microbial 16S rRNA Ion Tag and community metagenome sequencing using the Ion Torrent (PGM) Platform. J Microbiol Methods 91(1):80–88 74. Huang J, Liang X, Xuan Y, Geng C, Li Y, Lu H et al (2017) A reference human genome dataset of the BGISEQ-500 sequencer. Gigascience 6(5):gix024 75. Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG et al (2010) Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327(5961):78–81 76. Fang C, Zhong H, Lin Y, Chen B, Han M, Ren H et al (2018) Assessment of the cPASbased BGISEQ-500 platform for metagenomic sequencing. Gigascience 7(3):gix133 77. Mak SST, Gopalakrishnan S, Carøe C, Geng C, Liu S, Sinding M-HS et al (2017)

19

Comparative performance of the BGISEQ500 vs Illumina HiSeq2500 sequencing platforms for palaeogenomic sequencing. Gigascience 6(8):gix049 78. Porreca GJ (2010) Genome sequencing on nanoballs. Nat Biotechnol 28(1):43–44 79. Lee W, Jiang Z, Liu J, Haverty PM, Guan Y, Stinson J et al (2010) The mutation spectrum revealed by paired genome sequences from a lung cancer patient. Nature 465(7297): 473–477 80. Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT et al (2010) Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328(5978):636–639 81. Kim D, Lee JY, Yang JS, Kim JW, Kim VN, Chang H (2020) The architecture of SARSCoV-2 transcriptome. Cell 181(4):914–21. e10. https://doi.org/10.1016/j.cell.2020. 04.011 82. Thompson JF, Steinmann KE (2010) Single molecule sequencing with a HeliScope genetic analysis system. Curr Protoc Mol Biol Chapter 7:Unit7.10. https://doi.org/ 10.1002/0471142727.mb0710s92 83. Steinmann KE, Hart CE, Thompson JF, Milos PM (2011) Helicos single-molecule sequencing of bacterial genomes. Methods Mol Biol 733:3–24. https://doi.org/10. 1007/978-1-61779-089-8_1 84. Harris TD, Buzby PR, Babcock H, Beer E, Bowers J, Braslavsky I et al (2008) Singlemolecule DNA sequencing of a viral genome. Science 320(5872):106–109 85. van den Oever JME, Balkassmi S, Verweij EJ, van Iterson M, van Scheltema PNA, Oepkes D et al (2012) Single molecule sequencing of free DNA from maternal plasma for noninvasive trisomy 21 detection. Clin Chem 58(4): 699–706. https://doi.org/10.1373/ clinchem.2011.174698 86. Heather JM, Chain B (2016) The sequence of sequencers: the history of sequencing DNA. Genomics 107(1):1–8 87. Lee H, Gurtowski J, Yoo S, Nattestad M, Marcus S, Goodwin S et al (2016) Thirdgeneration sequencing and the future of genomics. BioRxiv:048603 88. Ardui S, Ameur A, Vermeesch JR, Hestand MS (2018) Single molecule real-time (SMRT) sequencing comes of age: applications and utilities for medical diagnostics. Nucleic Acids Res 46(5):2159–2168 89. Au KF, Sebastiano V, Afshar PT, Durruthy JD, Lee L, Williams BA et al (2013) Characterization of the human ESC transcriptome by

20

Saskia Benz and Suparna Mitra

hybrid sequencing. Proc Natl Acad Sci 110(50):E4821–E4E30 90. Wilbe M, Gudmundsson S, Johansson J, Ameur A, Stattin EL, Annereń G et al (2017) A novel approach using long-read sequencing and ddPCR to investigate gonadal mosaicism and estimate recurrence risk in two families with developmental disorders. Prenat Diagn 37(11):1146–1154 91. Cavelier L, Ameur A, H€aggqvist S, Hoïjer I, Cahill N, Olsson-Stro¨mberg U et al (2015) Clonal distribution of BCR-ABL1 mutations and splice isoforms by single-molecule longread RNA sequencing. BMC Cancer 15(1): 1–12 92. Stein JC, Yu Y, Copetti D, Zwickl DJ, Zhang L, Zhang C et al (2018) Genomes of 13 domesticated and wild rice relatives highlight genetic conservation, turnover and innovation across the genus Oryza. Nat Genet 50(2):285–296. https://doi.org/10.1038/ s41588-018-0040-0 93. Payne A, Holmes N, Rakyan V, Loose M (2019) BulkVis: a graphical viewer for Oxford nanopore bulk FAST5 files. Bioinformatics 35(13):2193–2198. https://doi.org/10. 1093/bioinformatics/bty841 94. Branton D, Deamer DW, Marziali A, Bayley H, Benner SA, Butler T et al (2010) The potential and challenges of nanopore sequencing. Nanosci Technol:261–268

95. Kono N, Arakawa K (2019) Nanopore sequencing: review of potential applications in functional genomics. Develop Growth Differ 61(5):316–326 96. Mitsuhashi S, Kryukov K, Nakagawa S, Takeuchi JS, Shiraishi Y, Asano K et al (2017) A portable system for rapid bacterial composition analysis using a nanopore-based sequencer and laptop computer. Sci Rep 7(1): 1–9 97. Valliammai MG, Gopal NO, Anandham R (2021) Elucidation of microbial diversity and lignocellulolytic enzymes for the degradation of lignocellulosic biomass in the forest soils of Eastern and Western Ghats of Tamil Nadu, India. Biofuels Bioprod Biorefin 15(1):47–60 98. Stancu MC, Van Roosmalen MJ, Renkens I, Nieboer MM, Middelkamp S, De Ligt J et al (2017) Mapping and phasing of structural variation in patient genomes using nanopore sequencing. Nat Commun 8(1):1–13 99. Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA et al (2018) Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol 36(4): 338–345 100. Roche: DNA library prep for DNA nanoball technology sequencing platforms (2021). https://sequencing.roche.com/en/blog/ dna-library-prep-for-dna-nanoball-technol ogy-sequencing-platform.html. Accessed 10 Nov 2021

Chapter 2 Quality Control in Metagenomics Data Abraham Gihawi, Ryan Cardenas, Rachel Hurst, and Daniel S. Brewer Abstract Experiments involving metagenomics data are become increasingly commonplace. Processing such data requires a unique set of considerations. Quality control of metagenomics data is critical to extracting pertinent insights. In this chapter, we outline some considerations in terms of study design and other confounding factors that can often only be realized at the point of data analysis. In this chapter, we outline some basic principles of quality control in metagenomics, including overall reproducibility and some good practices to follow. The general quality control of sequencing data is then outlined, and we introduce ways to process this data by using bash scripts and developing pipelines in Snakemake (Python). A significant part of quality control in metagenomics is in analyzing the data to ensure you can spot relationships between variables and to identify when they might be confounded. This chapter provides a walkthrough of analyzing some microbiome data (in the R statistical language) and demonstrates a few days to identify overall differences and similarities in microbiome data. The chapter is concluded by discussing remarks about considering taxonomic results in the context of the study and interrogating sequence alignments using the command line. Key words Metagenomics, Microbial bioinformatics contamination, Quality control data, Microbiome bacteria, Virus

1

Introduction We define sequencing microbial communities as sequencing the nucleic acids from multiple organisms in a collection of samples. Sequencing microbial communities has several advantages: it is culture-free and so can identify organisms difficult to culture; it can be hypothesis-free; it has potential quantitative ability; it can capture information from across an organism’s genome (in the case of shotgun sequencing); and prior knowledge of the expected microbial constituents is not necessarily required, although it can help validate results [1]. It is possible to sequence one part of the organisms, such as the 16S ribosomal rRNA amplicon sequencing (18S for eukaryotic organisms), or it is possible to sequence a

Suparna Mitra (ed.), Metagenomic Data Analysis, Methods in Molecular Biology, vol. 2649, https://doi.org/10.1007/978-1-0716-3072-3_2, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

21

22

Abraham Gihawi et al.

greater range of the genomic regions with shotgun metagenomic sequencing. In this chapter, we will offer a practical guide to overcoming some of the challenges in analyzing microbiome sequencing data (particularly shotgun sequencing data) to obtain meaningful and insightful results. Publications mentioning sequencing and the microbiome have been increasing exponentially in recent years, with no sign of plateauing (Fig. 1). The number of these publications that mention

Publication Dates of Microbiome Sequencing Studies 10000

Cumulative Number of Publications

1000

search_type "sequencing" + "microbiome" + "contamination"

100

10

1 2005

2010

2015

2020

Date of Publication

Fig. 1 The number of publications on a log scale with the date of publication. Search terms for each condition were: ‘sequencing’ and ‘microbiome’ (purple, 12,189 total studies) and additionally ‘contamination’ (yellow, 317 total studies). Publications were identified using the easy PubMed R package (version 2.13) as of 26-Apr-2021

Quality Control in Metagenomics Data

23

contamination has also been increasing exponentially, but with a much lower total number of studies (12,189 vs. 317). Although far from the earliest study of the microbiome, one of the most significant studies in this field is that of the human microbiome project [2]. The early phase of this project focused on characterizing the bacteria detected in samples from various anatomical sites such as gastrointestinal, oral, urogenital, skin, and nasal [3]. One of the major findings of this analysis was that the bacterial communities at each bodily site is quite distinctive. This data provides a nice foundation for future studies into the human microbiome by describing some of the microbial communities that can be expected. This has been developed upon in the latest phase of the human microbiome project (the integrative human microbiome project) by beginning to decipher the complex relationship between the microbiome and disease [4]. Quality control is critical in extracting robust and reproducible conclusions from metagenomic data. There are numerous considerations that can help maximize the quality of metagenomics data, including experimental design, the quality of sequence data, and the quality of the reported microbial community.

2

Considerations in Study Design and Methodology The first step to achieving high-quality data is by establishing a good study design and methodology. No amount of quality control can completely rectify imperfections in study design. It is always a better option to give some prior thought to the methods than to identify confounding variables at the analysis stage that can nullify any findings. Metagenomics often demonstrates batch effects, which occur when non-biological factors in an experiment are linked with differences in the data. This can result in erroneous conclusions and obscure the extraction of biologically insightful conclusions. Minor variations in sample storage has a minimal effect [5], but multiple freeze-thaw cycles can lead to degraded nucleic acids [1]. Care needs to be taken with formalin-fixed paraffin-embedded (FFPE) tissue, as it degrades nucleic acids and introduces genomic changes, including cytosine deamination [6]. Many studies have obtained interesting metagenomic insights from FFPE tissue, but it can hinder analysis to mix tissue types by introducing large batch effects [7, 8]. It is always good practice to replicate any preliminary findings on a separate cohort with an orthogonal technique to ensure that any biological results are reliable, robust, and reproducible. To a certain extent, it is possible to take into account batch effects in statistical analysis, but this reduces the power of the test and it is the opinion of the authors that there is only so much correction that is possible before the data becomes unusable. To

24

Abraham Gihawi et al.

reduce batch effects, one should keep the methodology as consistent as possible throughout the duration of the study. Ideally, the same person will follow the exact same protocols, on the same equipment in as short a time frame as possible. Secondly, the experimental conditions should be mixed across the duration and batches of the project. For example, do not run all of condition A and then run all of condition B. One context-specific example of batch effects is that of animal studies. Typically, mice are coprophagic—mice caged together share similarities in their gastrointestinal microbial constituents [9]. Comparing two cages of mice will reveal differences which can be inaccurately assigned to experimental differences if condition is aligned with cage. This needs to be considered alongside financial and ethical obligations when planning animal experiments. Similar concerns of confounding variables occur in human studies, where the microbiome of an individual might be affected by many factors. For example, age, gender, seasonal changes, lifestyle, ethnicity, geographical location, socioeconomic status, and culture [10]. Even pet-ownership has been identified as a potential confounding variable in microbiome studies [5]. Confounders can be better controlled in prospective studies, but even in retrospective studies, care must be taken to match the characteristics of the experimental groups as far as possible. It is recommended that the preparation of samples for DNA extraction prior to any metagenomics study should involve a number of quality control checks and batch testing of reagents and kits prior to use with samples. This is to minimize contamination that can be present in molecular biology grade water, reagents, plus in the DNA extraction and library preparation kits—the contamination present in the latter is well documented and termed the “kitome” [11, 12]. Where relevant, for particular study designs, protocols and reagents designed to improve detection for low abundance bacteria may be used effectively, including working in a clean-room environment, plus the use of reagents that are guaranteed contamination free is particularly important. Care should be taken with the method of lysis of the bacteria prior to DNA extraction, for example, some enzyme preparations, including mixed enzyme cocktails, lysozyme can have bacterial DNA contaminants present and therefore it is appropriate to screen batches prior to use. The use of repeated bead-beating protocols has been shown to improve the quality and extraction of bacteria and other microbiome, including fungi from certain samples [13–15]. Certified molecular biology grade nuclease free beads are recommended with the composition (sizes/material) of beads adjusted to be suitable for the tissue type, in addition, the speed and duration of bead beating can be tested and validated for effective lysis of microbiota to obtain yields of quality DNA suitable for metagenomics studies. Protocols for human depletion of clinical samples prior to

Quality Control in Metagenomics Data

25

bacteria DNA extraction (bacteria enrichment protocols) can be useful for certain studies [16]. However, it has been shown that with certain sample preparation conditions susceptible bacteria can also be lysed during the initial human cell lysis steps and thus removed/depleted from the sample of interest [17], which may pose as a critical issue. Testing and validation of protocols for recovery of key species of interest is important to check for effective recovery when using human host depletion during DNA extraction steps. Finally, one of the most important study design additions is to prepare and sequence negative control/blank samples at each step, blanks for each set of samples extracted, each library preparation batch plus also blank reagents and no-template controls to determine background contamination. One recent study investigating the microbiota in human cancer tissue used a ratio of ~50% controls to samples to enable effective sequencing data analyses [7]. Inclusion and analyses of mock community positive control samples can also aid interpretation and validation of study protocols [14]. Recently criteria have been recommended to aid study design considerations when investigating low microbial biomass microbiome samples to avoid (i) contaminant DNA plus; (ii) sample cross-contamination, with a study “RIDE” checklist [18], including several of the sample preparation criteria discussed above.

3

Solutions to Support Reproducibility One of the successes of the bioinformatics community is the incredible range of openly available computational software for all manner of applications. When planning the analyses for a project, selection of the correct tools is crucial. It is necessary to trial and benchmark multiple computational tools using simulated or prototype data to find the optimal tool for your use. Some criteria for what makes a good tool include whether it: provides accurate and reproducible results (in a reasonable format); is suitable for your specific application (e.g., does a metagenomic assembly tool allow for co-assembly); is computationally efficient and scalable to your needs; is user-friendly and provides explanatory error messages; installs easily; is in widespread use in the community; is wellsupported; and doesn’t need to be adapted or updated too often. One of the most laborious and time-consuming tasks a computational biologist will face is installing different computational software and each of their dependent software and databases. Mangul et al. [19] tested the installability of 98 different software tools and concluded that 51% were “easy to install” and 28% of tools failed to install whatsoever [19]. Fortunately there are a few solutions which are becoming best practice. Installation of software has become easier over the last few years through the use of package managers such as Conda (https://docs.

26

Abraham Gihawi et al.

conda.io/), which allow easy installation of computational software and their dependencies. These tools can install multiple software and are typically quite good at resolving any conflicts in dependencies although they can take a considerable amount of time to do so. Additionally, it can be troublesome to update software to newer versions and sometimes multiple dependencies are required. A newer alternative to package managers is that of containerized software. Software containers such as Docker [20] and Singularity [21] have changed the way that analysis pipelines can be used and shared. It allows the exact same code to be applied on different machines. This aids the reproducibility of research. Container files typically hold a minimal Linux operating system with installation of all of the necessary dependencies to run the analyses. Commands and Workflows can be run within these containers with minimal intrusion from the user operating system and environment. This means that an entire set of analysis can be repeated and investigated without having to spend time worrying about the exact versions of the software. Another recommendation is to integrate your analyses in workflow software, which reduces the development time, improves efficiency, aids reproducibility, is scalable, and allows effective monitoring and restarting of analyses. Although it is quite possible to create pipelines in a variety of coding languages, some of the easiest to read, implement, and adapt pipelines are written in workflow languages such as Snakemake [22] (which is implemented in Python) and Nextflow [23]. There are still a few limitations of containers however. Firstly, metagenomic databases can be too large to include in containers (particularly those built on remote servers) meaning that they have to be shared in addition to any container and loaded (using a bind argument, for example) before they can be used. Secondly, containers can be difficult to update and usually rely on sourcing the original build recipe if additional software are required. Thirdly, software and tools can often rely on Internet access. This can be difficult to spot within a particular tool without reading the source code. This limitation means that containers can run into issues on an offline machine. Additionally, as data on the Internet is subject to change, this might further hinder reproducibility. For example, the code to produce Fig. 1 downloads data to plot which will almost definitely be subject to change when the code is run at a later date. For this reason (and if feasible), it is often a good idea to save the data that is used to create plots.

Quality Control in Metagenomics Data

4

27

Code Walkthrough Introduction This chapter will provide walk-through re-analyzing some Illumina (Miseq) sequencing data made publicly available from a study by Thomas et al. entitled “Metagenomic characterization of the effect of feed additives on the gut microbiome and antibiotic resistome of feedlot cattle” [24]. A basic understanding of the Linux command line and the R programming language will be beneficial, but we will introduce a lot of the basics here. Thomas et al. investigated the impact that antibiotics in animal feed have on the microbiome. To do so, the authors used Illumina Miseq to produce paired end reads that are 250 nucleotides long. The sequence data is available on NCBI’s Sequence Read Archive (SRA) [25] under the bioproject PRJNA390551. Please note that this walkthrough will not use the same methods as the manuscript and may not be applicable to other technologies like long read sequencing. Also, for simplicity, we will investigate a reduced selection of samples (N = 9) from three anatomical sites: colon, cecum, and rumen. To facilitate the analysis in this chapter, we have containerized all necessary software using Singularity and therefore only the Singularity software is required and access to the Unix command line. To access the Unix command line, simply open a terminal on a Mac/Linux machines or you can open an emulator on Windows machines (such as Cygwin). Instructions on how to install singularity can be obtained from https://sylabs.io/ guides/3.3/user-guide/installation.html Once singularity is installed, use the command line code below to pull the metagenomicsQC container from the Sylabs cloud library: singularity

pull

library://r-cardenas/default/

qcmetagenomics_v1:latest

Or by accessing the link: h t t p s : // c l o u d . s y l a b s . i o / l i b r a r y / r- c a r d e n a s / d e f a u l t / qcmetagenomics_v1 The image can also be built using the recipe file within our GitHub page (https://github.com/UEA-Cancer-Genetics-Lab/ Metagenomics_QC) which also contains a number of the scripts used in this chapter.

28

Abraham Gihawi et al.

# Ensure Metagenomics_QC directory is made mkdir -pv Metagenomics_QC/ # Navigate to the newly downloaded folder / directory cd Metagenomics_QC/ # Note: Singularity container can be built using the recipe file # sudo singularity build metagenomics.simg singularity/metagenomicsQC.def # Once you have the container, use the shell command to enter, while setting your home directory to the your current working directory. singularity shell –home $PWD qcmetagenomics_v1_latest.sif

5

Downloading SRA Project Data The code below uses SRA toolkit to download a small number of the samples from the animals that were not treated with antibiotics. Type the below one line at a time on the Unix command line. Lines with # at the start are comments to improve readability and do not need to be run. This code will download the files from the SRA, then reformat the data to FASTQ and then rename them to something a bit more understandable before compressing them. This step can take a while depending on your Internet connection and how busy the servers are—Go grab yourself a beverage, go for a quick walk, do a quick bit of exercise, or something more productive like reading a bit of that manuscript you’ve had open in a tab for far too long now. . . Just keep the terminal open in the corner to make sure it’s still running without any issues. # first we will create a folder/directory structure to store the data #For this we will use the mkdir command with -p and -v flags (-p makes directories if required, -v makes it verbose so feeds back what it has done) # At any time you can find out what folder/directory you are in with the 'pwd' command mkdir -pv QC_Metagenomics_data/analysis cd QC_Metagenomics_data/analysis/ mkdir -pv images processed_data raw_data results cd raw_data/ # Download the data from the SRA for x in 'SAMN07243779' 'SAMN07243778' 'SAMN07243777' 'SAMN07243769' 'SAMN07243768' 'SAMN07243767'; do ~/Downloads/sratoolkit.2.11.0mac64/bin/prefetch $x; done

Quality Control in Metagenomics Data

29

# Extract FASTQ files for x in 'SRR5738067' 'SRR5738068' 'SRR5738074' 'SRR5738079' 'SRR5738080' 'SRR5738085' 'SRR5738092' 'SRR5738091' 'SRR5738094'; do fastq-dump -I --split-files $x; done #rename the files to something more illustrative mv SRR5738068_1.fastq 03-Colon_1.fastq mv SRR5738068_2.fastq 03-Colon_2.fastq mv SRR5738067_1.fastq 04-Colon_1.fastq mv SRR5738067_2.fastq 04-Colon_2.fastq mv SRR5738074_1.fastq 05-Colon_1.fastq mv SRR5738074_2.fastq 05-Colon_2.fastq mv SRR5738079_1.fastq 03-Cecum_1.fastq mv SRR5738079_2.fastq 03-Cecum_2.fastq

mv mv mv mv mv mv mv mv mv mv

SRR5738080_1.fastq SRR5738080_2.fastq SRR5738085_1.fastq SRR5738085_2.fastq SRR5738092_1.fastq SRR5738092_2.fastq SRR5738091_1.fastq SRR5738091_2.fastq SRR5738094_1.fastq SRR5738094_2.fastq

04-Cecum_1.fastq 04-Cecum_2.fastq 05-Cecum_1.fastq 05-Cecum_2.fastq 03_Rumen_1.fastq 03_Rumen_2.fastq 04_Rumen_1.fastq 04_Rumen_2.fastq 05_Rumen_1.fastq 05_Rumen_2.fastq

#compress fastq files to save space gzip *fastq #Navigate up to the analysis/ directory cd ../

6

Ensuring Data Integrity After downloading or transferring files, it is good practice to ensure the files are an exact copy of the version on the server, i.e., the integrity of the data. A quick and useful way to do this is by finding a fingerprint (or checksum) for each file using the md5sum command (or md5). This is also called hashing and takes a file of arbitrary size and condenses the data return a fixed length string. If the md5 values are the same as what they are supposed to be (they are often supplied by the data provider) you can be confident the files are identical and that nothing went wrong in the download process. Below we provide the md5 values for each file downloaded. To obtain these, type “md5sum raw_data/*fastq.gz”, which will run an md5 checksum on all of the files in your current directory that end in fastq.gz.

30

Abraham Gihawi et al. MD5 (03-Cecum_1.fastq.gz) = afb204df6ae58af7da00cc78221da044 MD5 (03-Cecum_2.fastq.gz) = aec5b97a325f687604d43d18732a8486 MD5 (03-Colon_1.fastq.gz) = d0a129dddd7aea45b2b7481dd3e93146 MD5 (03-Colon_2.fastq.gz) = 808ace970c7c2aea59af4757cbcd26e3 MD5 (03_Rumen_1.fastq.gz) = 0c91bb2498439a965f8f30caae051ece MD5 (03_Rumen_2.fastq.gz) = ed839b91ec20fa50c8cc38070ab6b9a2 MD5 (04-Cecum_1.fastq.gz) = e56de2d0a02c66d62ec67f79363600c6 MD5 (04-Cecum_2.fastq.gz) = f3b9c4725790e174f170696f5ab4faeb MD5 (04-Colon_1.fastq.gz) = 53a0ae19e568a1406a32a6b028b8ba3a MD5 (04-Colon_2.fastq.gz) = 09be6df77a6c3fc21d0363f7b6a78f47 MD5 (04_Rumen_1.fastq.gz) = 3246c89fa20bfd7e02b4ab76b6a11a0c MD5 (04_Rumen_2.fastq.gz) = 95237571c6003704ab3afb4f9a1dffdf MD5 (05-Cecum_1.fastq.gz) = 641770d9f3e824f7509f9967e3e6e15e MD5 (05-Cecum_2.fastq.gz) = 82e7009e4c37ef11c3f1baee7bd7e918 MD5 (05-Colon_1.fastq.gz) = 465addb89806851caca724cf214506f0 MD5 (05-Colon_2.fastq.gz) = 01f69c06108177ba992df39e71b69010 MD5 (05_Rumen_1.fastq.gz) = 9c42ae670b201d2221f1c9d3049ba25c MD5 (05_Rumen_2.fastq.gz) = 78eaa7c5f555b726abffd118c7029515

7

Quality Control Statistics High-quality input data is a requirement for metagenomic taxonomy assignment and other subsequent analyses—poor quality in, poor quality out. Unfortunately, there is no consensus on the best way to remove low quality data. For example, there are mixed reports as to whether or not quality trimming is required for RNA sequencing gene quantification [26, 27]. In the case of microbe identification, it is generally advised to quality trim sequence data to avoid mis-classification [28]. In this section, we will walkthrough an example of how to filter and adapt sequencing reads to ensure high-quality data. The first step in ensuring good quality data is to calculate quality metrics and visualize the quality. For this, we use a popular tool called FastQC [29]. First ensure you are in the analysis/ directory (you can use the pwd command to find out where you are and cd ../or cd directory_x/to navigate up to the parent directory and to a directory or folder called directory_x, respectively). #run fastqc on all raw data, the asterisk matches any pattern and by adding .fastq.gz to the end we ensure that only .fastq.gz files are matched fastqc raw_data/*fastq.gz #list all the files in the raw_data directory #This should return all of the fastqc files that you have just created ls raw_data/

Quality Control in Metagenomics Data

31

Open up the first report in the raw_data directory using your preferred browser (“05-Colon_2_fastqc.html”). You may need to download the FastQC files locally to view in a browser if you are carrying out this tutorial on a high performance computing cluster. This will open up some nice analysis of the quality of your sequences. A full description of each of the plots is available on the tool website, but we will summarize some of the important aspects below. These graphs are summarized by the following: 7.1

Basic Statistics

This provides some basic statistics on the file, including the filename, file type, what quality score encoding was used (more on that below), the total number of sequences in the file, the number of sequences flagged as poor quality, the sequence length, and the percentage GC content.

7.2 Per Base Sequence Quality

This plot is arguably the most informative. A box plot shows the distribution of quality scores (y-axis) along individual read positions (x-axis). The mean quality score at each position is shown by the line running through the box plots. There are different ways to encode base quality scores in FASTQ files, but Fastqc is good at determining the correct quality score system. The most common scoring system is the Phred system (Q) where the probability of a -Q base being incorrect is given by = 10 10 . What this translates to in practice is that Q = 30 denotes a 1/1000 chance of a base being incorrect, Q = 20 denotes a 1/100 chance, Q = 10 denotes a 1/10 chance and so on and so forth. Therefore, statistically speaking, Q = 20 means that approximately 1 base in a 100 bp read is incorrect. Therefore a Q value above 30 is currently a good standard to aim for. It is entirely normal for Illumina sequencing data to start off with a relatively lower quality and then rise in the first 2 base pairs before dropping off toward the end of the read. For this reason, it can be useful to remove either end of sequencing reads to just leave the high-quality middle portion. This example is showing highquality scores with low variation and a maximum read length of ~251 bp (Fig. 2).

7.3 Per Tile Sequence Quality

This will produce a heatmap showing the deviation from the average quality split by position in the read (x-axis) by flow cell tile (yaxis). Blue colors are good and mean that there is little to no deviation from the average quality. More red colors indicate that that particular tile had worse quality scores than other tiles for that particular base location. If there are areas of red, it may mean that your sequencer needs servicing or it could be a much simpler issue, such as a bubble in the flow cell.

32

Abraham Gihawi et al.

Fig. 2 Per base sequence quality for the walkthrough file 05-Colon_2.fastq. Overall this plot suggests high quality sequencing reads 7.4 Per Sequence Quality Scores

A useful metric is to obtain the mean quality score across an entire read. This plot shows the distribution of the mean values for each sequencing read across all sequencing reads. For this you would expect an exponentially increasing peak over a high-quality value with a small distribution that rapidly falls back down again. Multiple peaks would suggest that a group of sequencing reads were poorer quality.

7.5 Per Base Sequence Content

A line graph showing the percentage each nucleotide base (A,C,G, T) occurs at each position in a read. We expect that the relative proportions remain constant over the length of a read. The distribution can be skewed, particularly at the beginning of a read if adapter sequences are present. The type of library preparation can introduce bias.

7.6 Per Sequence GC Content

A plot showing the theoretical and observed distribution of percentage GC content in each read. Ideally, it should display a normally distributed bell curve that matches the theoretical distribution. Abnormalities in the observed plot can indicate

Quality Control in Metagenomics Data

33

some form of contamination or overrepresented sequences can skew the distribution. RNA sequencing is another factor that can result in some abnormal observations for this. 7.7 Per Base N Content

When a sequence is unknown, it is replaced with “N”. This plot shows the percentage of bases that are “N” across the length of the reads. Small proportions of N may be possible, especially where quality drops toward the end of the read. If there are many N’s, it may be associated with low quality.

7.8 Sequence Length Distribution

A plot demonstrating the distribution of sequence lengths. It should be expected to peak at a particular number of reads similar to your expected read length.

7.9 Sequence Duplication Levels

A plot showing how many times each sequence is duplicated (xaxis) along with what percentage of reads (y-axis). You should expect that most sequences are not duplicated and so there should be a peak at 1 which drops off.

7.10 Overrepresented Sequences

Overrepresented sequences can signify either there is some form of contamination or something that is biologically important. FastQC will automatically try to find the source of the overrepresented sequence, which can include some adapter sequences. This warning may be triggered if the library fragmentation is a bit more selective.

7.11

Adapter Content

Adapter sequences are synthetic oligonucleotides that are joined to the ends of sequencing reads with the purpose of attaching them to the flow cell. In this section, Illumina adapter sequences, which might be alluded to in the overrepresented sequences or in the kmer content, are detected. The plot shows the percentage of the library, which is estimated to be attributed to adapter sequences at each position in a read. This graph typically increases in percentage toward the end of the read because FastQC assumes the rest of the read contains adapter sequences when one is detected in the middle of a read.

7.12

K-Mer Content

A k-mer is merely a string of nucleic acids of length k. This bit of analysis shows the presence of overrepresented k-mers (FastQC uses k = 7 or 7-mers) and what their distribution is like across the length of the read. On its own, this isn’t particularly useful, but it can be used in conjunction with some of the other metrics above, like adapter content.

7.13 Scaling Quality Control to Multiple Samples

In the case of the first file (05-Colon_2_fastqc), overall the quality looks quite good. FastQC is a great tool for visualizing quality control statistics, but it can be unfeasibly laborious to investigate each report in depth and to compare samples or identify

34

Abraham Gihawi et al.

outliers. The tool MultiQC [30] provides a great approach to summarizing quality control reports for all samples from a range of tools in addition to FastQC. Running it from the command line is simple and can be run with the command multiqc. from within the directory containing FastQC reports. The output will be a file named multiqc_report.html. This report contains much of the same information as FastQC reports, but it presents the data in a way to visualize and compare all samples.

8

Trimming and Filtering Reads It is quite ordinary to have raw sequencing data with imperfections. The real benefit of visualizing QC is to compare the quality metrics before and after applying trimming to remove inappropriate sequences. There are a few key questions following trimming that MultiQC can help answer. For each of the above quality metrics, you need to critically evaluate whether or not your trimming parameters has resolved these issues and have left a sufficient quantity of data for your analysis. To ensure the sequences are all of good quality we will be applying Trimmomatic [31], which is specifically designed for Illumina reads and widely used. Trimmomatic is a malleable tool that boasts a wide variety of parameters such as removing the ends of reads below a defined quality value and setting a minimum read length. Additionally, Trimmomatic has been granted permission to use and distribute proprietary adapter sequences to search for and remove these from sequencing data. To trim the reads, developing a shell script in Bash helps to avoid repeating commands. The first line #!/bin/bash tells your computer that you want to run the contents of the script in the bash shell [32]. The line containing set -euxo pipefail is an incredibly useful bit of defensive programming for bash scripts. The -e and -o flags tells the script to terminate if any command returns an error. The -u flag stops scripts from running commands with a variable that has not been set. The -x flag makes bash print each command before executing it, which makes it easier to debug exactly where an error occurs. Next, the script takes the input from the command line and sets it to the raw_fastq_file variable ($1 represents the first additional entry from the command line other than the script name. Copy and paste the below code into a file called “quality_trim.sh” and make sure to save it in the analysis/directory. Also ensure that you are in this directory.

Quality Control in Metagenomics Data

35

#!/bin/bash # This is a script to quality trim files # To run the script use ./quality_trim.sh R1.fastq.gz R2.fastq.gz output_paired1.fastq.gz output_unpaired1.fastq.gz output_paired set -euxo pipefail # Get which file to trim from the command line raw_fastq1="$1" raw_fastq2="$2" out_paired1="$3" out_unpaired1="$4" out_paired2="$5" out_unpaired2="$6" # Ensure adapter sequence files are downloaded from trimmomatic github wget -o NexteraPE-PE.fa -nc https://github.com/timflutre/trimmomatic/blob/master/adapters/NexteraPE -PE.fa # Manipulate the input filename to get our output filename trimmomatic PE $raw_fastq1 $raw_fastq2 \ $out_paired1 $out_unpaired1 \ $out_paired2 $out_unpaired2 \ ILLUMINACLIP:NexteraPE-PE.fa:2:30:10 HEADCROP:12 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:50

The Trimmomatic function in the script will process each file by first finding and removing adapter sequences, removing the first 12 bases and any poor quality bases at either end of the read, scan the read four nucleotides at a time, and trimming when the average quality falls below 20 (1/100 chance of an error). Lastly, Trimmomatic will only keep reads with a minimum length of 50 nucleotides. To run the script, you will need to make it executable, which can be done by typing chmod +x quality_trim.sh from the command line. To run the script, try the following from the command line. Be sure to run the script on all sequencing files in the raw_data/ directory, including the negative control samples. ./quality_trim.sh

raw_data/03-Cecum_1.fastq.gz

raw_data/03-Cecum_2.fastq.gz processed_data/03Cecum_R1_trimmed.fastq.gz processed_data/03-Cecum_R1_unpaired.fastq.gz cum_R2_trimmed.fastq.gz

processed_data/03-Ceprocessed_data/03-

Cecum_R2_unpaired.fastq.gz

You could create a “‘for” loop in bash to make processing these files a bit more simple. This loop simply reads, for each of the identifiers denoted “s”, run the quality trimming script with the command line parameters.

36

Abraham Gihawi et al.

for s in '03-Colon_' '04-Colon_' '05-Colon_' '03-Cecum_' '04-Cecum_' '05-Cecum_' '03-Rumen_' '04-Rumen_' '05-Rumen_'; do ./quality_trim.sh raw_data/$s\1.fastq.gz raw_data/$s\2.fastq.gz processed_data/$s\R1_trimmed.fastq.gz processed_data/$s\1_unpaired.fastq.gz processed_data/$s\R2_trimmed.fastq.gz processed_data/$s\R2_unpaired.fastq.gz; done

Trimmed sequences will be present in the processed_data/ directory. Run FastQC on your trimmed sequencing files and examine the output. It should appear much more reasonable than the raw data.

9

Removing Host Derived Content A large proportion of DNA and RNA content in metagenomics samples obtained from organisms is host-derived [1]. It is advisable to deplete host sequences from samples—this limits taxonomic misclassification, reduces the size of files, minimizing the required computing resource. This study used a laboratory-based method for microbial DNA enrichment, but it is still good practice to apply depletion approaches to the sequence data. There are many tools for this job [33], but we recommend that you use BBDuK from the BBTools suite. In this case, we will use a reference genome for cattle (Bos taurus) from NCBI and download it using the wget command:

wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/ Bos_taurus/latest_assembly_versions/GCF_002263795.1_ARS-UCD1.2/ GCF_002263795.1_ARS-UCD1.2_genomic.fna.gz

Then run depletion with: mkdir -pv depleted_data # Run BBDuK with bbduk.sh in=processed_data/03-Cecum_R1_trimmed.fastq.gz in2=processed_data/03-Cecum_R2_trimmed.fastq.gz out=depleted_data/03Cecum_R1_depleted.fastq.gz out2=depleted_data/03Cecum_R2_depleted.fastq.gz outs=depleted_data/03Cecum_R3_depleted.fastq.gz ref=GCF_002263795.1_ARSUCD1.2_genomic.fna.gz mcf=0.5 # For loop to process all samples for s in '03-Colon_' '04-Colon_' '05-Colon_' '03-Cecum_' '04-Cecum_' '05-Cecum_' '03-Rumen_' '04-Rumen_' '05-Rumen_'; do bbduk.sh in=processed_data/$sR1_trimmed.fastq.gz in2=processed_data/$sR2_trimmed.fastq.gz out=depleted_data/$sR1_depleted.fastq.gz out2=depleted_data/$sR2_depleted.fastq.gz outs=depleted_data/$sR3_depleted.fastq.gz ref=GCF_002263795.1_ARSUCD1.2_genomic.fna.gz mcf=0.5; done

Quality Control in Metagenomics Data

37

This command works by comparing each sequence to all of the unique k-mers in the reference genome. If 50% or more of a sequencing read is covered by k-mers found in the reference genome, then it is removed The command returns paired-end reads as well as single-end reads that have had their pair removed. This step requires an amount of RAM larger than most personal computers (~30–40gb) and so can be omitted for the purposes of this tutorial. If you have access to that much RAM, feel free to run depletion using the code above.

10

Taxonomic Classification Part of the art of microbial bioinformatics is knowing what computational tool to use for taxonomic classification. They each have strengths and weaknesses. Reading tool manuscripts is not always the best way to gauge their performance and there have been numerous independent benchmarking approaches to deciphering which tool might be the best to use in particular circumstances [33, 34]. Kraken [35] is a widely used tool for taxonomic classification. Usefully, this tool allows for user-defined metagenomic databases. It is a k-mer-based approach that is typically quite sensitive [33]. Drawbacks include false positive results, and databases containing a large number of sequences can prohibitively increase the memory requirements (although more recent versions have improved this aspect). Additionally, sequencing reads that do not contain enough unique k-mers to be mapped to species-level classifications are given a less specific classification at a higher taxonomic level (i.e., order or family level). These classifications can be redistributed to a more specific taxonomic level (i.e., species level) using a Bayesian approach (BRACKEN [36]). Taxonomic profilers such as mOTUs2 [37] and MetaPhlAn [38] use pre-built databases of taxonomically informative sequences and are therefore not as malleable to user--defined sequences. By using smaller pre-built databases, they are less computationally intensive and often perform well at predicting relative abundance. Each taxonomic profiler uses a different rationale to develop their pre-computed databases. mOTUs2, for example, uses a database of 40 widely expressed single-copy marker genes across the tree of life that were shown to be highly taxonomically informative [37]. In this walkthrough we will run MetaPhlAn on the quality trimmed bovine metagenome data. To do so, we will use a Snakemake pipeline (implemented in Python) to make processing all samples easier. The pipeline will be kept simple by only analyzing the paired end samples that have undergone quality trimming (another potential source of data could be the unpaired reads that were produced as a result of quality trimming and/or host depletion).

38

Abraham Gihawi et al.

#!/usr/bin/env python #import the modules needed throughout the script import glob import re # find all input files to process input_files = glob.glob("processed_data/*_R1_trimmed.fastq") # Prepare a list of output files expected based on our input list output_files = [re.sub('_R1_trimmed.fastq', '_mph.tsv', file) for file in input_files] rule all: input: [file for file in output_files] rule run_metaphlan: input: R1=("processed_data/{sample}_R1_trimmed.fastq"), R2=("processed_data/{sample}_R2_trimmed.fastq") output: out="processed_data/{sample}_mph.tsv", bwtout="processed_data/{sample}_bowtie.bz" shell: "metaphlan {input.R1},{input.R2} --input_type fastq -tax_lev g -o {output.out} --sample_id_key {wildcards.sample} -bowtie2out {output.bwtout}"

Save the above to a file called “mph.snake” in the “analydirectory and launch it from the command line with: “gunzip processed_data/*gz”, followed by “snakemake -snakefile mph.snake --jobs 1”. This is a Snakemake workflow pipeline with only one rule, which runs MetaPhlAn on all the files it finds. The output files are tab-separated files containing some header information about each run, the sample name followed by the taxa, their NCBI taxonomic IDs, and their relative abundance. MetaPhlAn comes with some handy utility scripts. All of the reports can be merged with the following command: “merge_metaphsis/”

lan_tables.py processed_data/*mph.tsv > all_metaphlan_genera.tsv”.

11

Data Handling, Visualization, and Comparative Analysis The data in all_metaphlan_genera.tsv captures the relative abundance for all genera found in all of our nine samples. Now we can begin to carry out some analysis in the programming language R, which you can access by typing R from the command line. The code below will read in and “clean” our data so that it is in a format that can be analyzed.

Quality Control in Metagenomics Data

39

# load libraries library(tidyverse) library(magrittr) library(ggpubr) library(viridis) library(ggbiplot) library(ape) library(ggrepel) # read in the data - skipping over the header line raw_data % select(-NCBI_tax_id) # form a true community matrix filtered_data % column_to_rownames('clade_name') # transpose the data so that taxa are across top community_matrix sample_x_lr.sam 2>log Here “-ax map-ont” denotes ONT read. The “2>” redirects standard error to the file called “log.” The SAM file can then be further processed in the same way as described above for the case short reads, in order to calculate the coverage of each LRAC sequence. In some cases, it may be desirable to obtain a detailed picture of how coverage changes across a sequence. This can be estimated by calculating the per base coverage, defined as number of bases from mapped reads contributing to the occurrence of a particular base in the sequence using genomeCoverageBed tool in BEDtools suite: $ genomeCoverageBed -ibam sample_x_sort.bam -d -g header.txt > perbasecov.txt where “-d” reports depth at each position of LRAC with 1-based coordinates, and “-g” specifies a tab delimited file that stores contig identifier and its length (in bp). Detailed parameter descriptions can be found in the documentation of each tool. 3.6 Metagenome Binning of Contigs Assembled Using Short Reads

A typical short-read metagenome assembly [58] usually generates tens, hundreds, or possibly thousands of contigs from a typical member of microbial community. These contigs need to be grouped into draft genomes—a procedure referred to as “genome binning” (see Note 12). The draft genome generated by this process is currently referred to as metagenome-assembled genomes (MAG). The terminology remains confusing because the term “binning” was used originally, and in many cases, still is, to refer

Long-Read Metagenomics

245

to the taxonomic annotation of contig sequences (e.g., see [24] for one recent example in the context of long-read genomes). Once recovered from a metagenome assembly, one genome bin (MAG) should ideally contain only contigs belonging to one genome and contain all the contigs arising from that genome. However, in practice, determining whether that is the case or not remains highly challenging, with issues of incompleteness (missing contigs) and contamination (contigs from non-cognate genomes) remaining complex to resolve. This is particularly true in the case of metagenome from microbial communities of substantive complexity, such as mammalian gastrointestinal tracts, soils and sediments, aquatic and marine environments, and wastewater communities [8, 59–63]. At the time of writing, there remains ample scope for further development of our understanding of binning procedures for short-read data, and the limitations discussed above were also a major driver for the development of long-read metagenomics. Here we illustrate a command line execution of metagenome binning using Metabat2, a widely used automated binning method, as follows: $ runMetaBat.sh -d -v -m 2000 SRAC.fasta sample_X_sort.bam where “sample_X_sort.bam” is a sorted BAM file made from aligning the filtered reads to the assembled contigs, as described above. Metabat2 uses probabilistic distances of genome abundance and tetranucleotide frequencies to cluster contigs into draft genome bins, as described in [64]. Here we are using it in single sample mode; however, in its original use, MetaBAT2 uses data from multiple samples (referred to as co-assembly procedures in the literature), which we do not discuss here but refer the reader to [64] for further details. The recovered genomes are output into FASTA files, in which each contig is represented as a single sequence. More detailed descriptions of parameter and flag settings for MetaBAT2 can be found in the documentation. 3.7 Quality Assessment of Recovered Genomes Bins

As metagenome assembly is a challenging and computationally difficult task due to the presence of multiple species, as discussed above, the recovered genome bins have to be assessed for quality in terms of completeness and contamination. CheckM is a widely used automated method for such assessment [27]. CheckM uses sets of single-copy marker genes within a lineage and examines the presence of the genes in the bins, under the assumption that a complete genome will contain a full complement, or close to a full complement of these genes, and that these will be observed in single copy within a genome. Assuming the FASTA files containing the genomes are in the directory (folder) called bins_dir, CheckM can be run as follows:

246

Krithika Arumugam et al.

$ checkm lineage_wf -t 44 -x fa bins_dir checkm_lineage_wf_bins/ Among other outputs, CheckM returns a table, which contains per-bin summary statistics, including estimated completeness and contamination statistics. According to consensus standards established by the Genomic Standards Consortium (GSC) [65], a highquality metagenome-assembled genome (MAG) is minimally consistent with holding completeness >90% and contamination Rank -> Species ) and then selected (Choose menu, Select-> All nodes) before exporting frameshift-corrected contigs (Choose menu, File -> Export -> Export Frame-Shift Corrected Reads). In the output files, the header corresponding to each LRAC sequence is augmented to contain the number of insertions and deletions causing frameshifts in forward and reverse directions, for example, the FASTA header “>900|corrected+120–50.”

248

Krithika Arumugam et al.

Fig. 3 Screenshot from MEGAN-LR showing the process of “exporting frameshift corrected contigs/reads”

Here, “900” is the contig name from the assembly. And 120 and 50 are insertions and deletions in the forward and reverse directions, respectively. The use of “>” is used to denote headers in the FASTA file format, and should not be confused with the use of the same character to denote the command line in the R Statistical Computing Environment, as used below. 3.9.2

Racon

Racon is a consensus module and error correction tool coupled to run with Miniasm [70]. It is based on the use of Partial Order Alignment (POA) graphs [71, 72]. Racon can be run as follows: $ racon trim.fastq sample_x_lr.sam flye_dir/LRAC.fasta -t 44 using the LRAC sequences generated from metaFlye, as described above.

3.9.3

Medaka

Medaka was developed by Oxford Nanopore Technologies (ONT) as a tool to create consensus sequences and variant calls from ONT data. Medaka can be run from a conda environment: $ source activate medaka $ medaka_consensus -i trim.fasta -d flye_dir/LRAC.fasta -o medaka_consensus_dir -t 8 -m r941_min_high_g330 It is essential to specify the correct model using the “-m” flag, according to the basecaller used. Detailed parameter descriptions can be found in the documentation. See Notes for further details about using “Conda.”

Long-Read Metagenomics

3.10 Comparative Analysis of Short- and Long-Read Assemblies

249

Given that the short-read data and long-read data were obtained from the same DNA aliquot, a natural question is to examine the degree to which genomes extracted from either source sequence tend to recapitulate each other, under the assumption that it is difficult to know a priori whether each source of data will produce a superior genome, given the different strengths and weaknesses of each approach. In our previous work [2, 11], we discuss a simple method to examine this issue, named the concordance statistic, κ that is derived from BLASTN analysis of contigs from a fractionated short-read genome (treating these as a the query sequence) against a chromosomal length LRAC sequence. The concordance statistic is described in [11] and is a composite statistic that takes into account the quality and extent of alignments, the proportion of the short-read contigs that are aligned to the LRAC sequence and the proportion of the LRAC that is tiled by short-read contigs. For a given pairwise combination of short-read genome bin and longread chromosomal sequence, a κ-score close to 1 indicates a likely cognate genome pair (Fig. 4). Our R package srac2lrac (available from https://github.com/rbhwilliams/srac2lrac) can be used to calculate concordance statistic as described below. Taking a set of contigs from a draft genome assembled from short-read data and a chromosomal length (preferably complete) LRAC sequence, we first align the former (query) to the latter (subject) using BLASTN, returning a set of alignments in tabular form. Command line for running BLASTN is described below. Subject sequences (LRAC) have to be indexed before BLASTN alignment. $ makeblastdb -in LRAC_corrected.fasta -parse_seqids -dbtype nucl >log $ blastn -db LRAC_corrected.fasta -query SRAC_2k.fasta -outfmt 6 -out srac_spades2lrac_fs_flye.out Tab delimited output format (-outfmt 6) is specified for the ease of parsing the data.The following steps are invoked from an R environment (RStudio or R session initiated from command prompt of a terminal window). R can be downloaded from https://www.r-project.org/ 1. From within the R Statistical Computing Environment, the following R packages are loaded after installation, using the library command: > library(Biostrings) > library(gdata) > library(RANN) > library(igraph) > library(abind)

250

Krithika Arumugam et al.

bin.114

Fig. 4 Summary of concordance statistical analysis for Candidatus Accumulibacter sp. isolate SSA1 chromosome (GenBank: CP058708.1) and a short-read metagenome-assembled genome from the same reactor community (bin 32). (a) Distribution of κ scores for SSA1 against 80 bins recovered from the corresponding short-read assembly. Bin 32 has the highest κ at 0.97; (b) coverage-GC plot for the short-read assembly, with bin 32 highlighted (closed black circles and dark grey convex hull; other bins highlighted by light grey convex hulls); (c) short-read (SR, black crosses) and long-read (LR, grey crosses) coverage profiles across SSA1. (d–f) BLASTN statistics for alignments of short read contigs (bin 32) against SSA1. Horizontal segments show alignment position on LR-chr, and height of segment is value of corresponding statistic (y-axis), namely percent identity (PID) (d), the ratio of alignment length to query length (al2ql) (e) and log10-bitscore (f). (g) GC content as a function of position on SSA1 (grey closed circles, computed in adjacent windows of length 46,700 bp) and for aligned short-read contigs (black closed circles); (h–k) distribution of four component statistics of κ (as described above). (See Arumugam et al. (2021) for further details (this legend modified from that of Fig. 1 of Arumugam et al. (2021) under Creative Commons Attribution 4.0 International License))

Additional R packages including identifier.mapping_1.0, orf. lcam_1.0 and RKXM_1.0 can be downloaded from https:// github.com/rbhwilliams/srac2lrac and installed using R CMD INSTALL from the shell script: $ R CMD INSTALL identifier.mapping_1.0.tar.gz >library(identifier.mapping) >library(RKXM)

Long-Read Metagenomics

251

2. The short-read genome bins from MetaBAT2 are first stored as a list in R > srac.bins.metabat.m2k=2000 bp are stored as an R/Biostrings DNAStringSet, and then summarized into a data frame that contains contig length, coverage, and GC content: > srac srac.summData

lrac.flye

blastn.flye source(file=“srac2lrac.r”) 7. Query names (contig headers from short read assembly) in the BLASTN output are shortened to match contig headers in “srac.summData.” >

blastn.flye.qnames colnames(blastn.flye)

blastn1.flye blastn2.flye blastn3.flye blastn4.flye blastn5.flye blastn6.flyelog 2>&1 where “--max-target-seqs” denotes maximum number of aligned sequences written to the output. The “-f 6” denotes tabular output format. The “qseqid qlen slen sseqid sallseqid” denotes columns to be written to the output. By importing this table into suitable analysis software, such as R, the distribution of the “qlen” to “slen” can be calculated and plotted. In Fig. 5, we show this distribution for Candidatus Accumulibacter sp. isolate SSA1 chromosome, complete genome (GenBank: CP058708.1) showing the case for the (a) uncorrected whole chromosomal sequence and (b) the same sequence corrected for frameshift errors using MEGAN-LR (Fig. 6).

254

Krithika Arumugam et al.

Fig. 6 Completed genome of Candidatus Accumulibacter sp. isolate SSA1 chromosome (GenBank: CP058708.1). Circular plot was made using Circos (0.69-9); the green outer circle shows the entire chromosome, reconstructed in a single contig. Moving inwards, the positions of protein coding genes on the forward and reverse strand are shown in red and blue, respectively; the next three layers show the position of rRNA operons (brown); CRISPR regions (dark green), and tRNA genes (orange); the two next layers show mean-adjusted GC proportion and (innermost) long-read coverage

4

Notes 1. The commands mentioned above are described individually, but when submitting to a cluster set up with sun grid engine, the commands can be run from a file with appropriate SGE options, some of which are shown below

Long-Read Metagenomics

255

#!/bin/sh #$ -cwd #$ -pe orte 40 #$ -V #$ -S /bin/bash

2. If there is a lack of sufficient computational resource to process the data, the raw data can be subsampled to sufficient depth for further processing. For example, Seqtk [76] can be used to subsample read data. 3. When using Porechop, if an adapter is found in the middle of a read, the read is split around the position of the adapter sequence motif (default option), resulting in the number of trimmed reads being greater than the number of raw reads. The reads with internal adapters can be discarded by enabling “-discard_middle” parameter. 4. It is important to understand the FASTQ file format and PHRED quality scores to check for file errors, if any. The following links might be useful: https://en.wikipedia.org/ wiki/FASTQ_format and https://drive5.com/usearch/man ual/quality_score.html 5. The number and ordering of reads in read1 and read2 fastq files have to be preserved (paired end option in cutadapt implements it by default) so as to not hinder further processing. 6. The Fastx toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) can be used for basic processing of short reads fastq files (example: FASTQ to FASTA conversion). 7. Assembly graphs in GFA format produced by genome assemblers can be visualized using BANDAGE [75]. 8. The “--save-gp” parameter in spades command line returns a High-Resolution Graph (HRG), which preserves variant information and can be used to resolve strains using recent tools like STRONG [74]. 9. Relevant assembly metrics can be calculated using metaQUAST [73]. The manual describes the tool in detail. 10. There are several tools like Busybee [55], MetaBCC-LR [79], and MEGAN-LR [24] that have been recently developed to bin LRAC sequences. 11. SAM files can easily consume considerable amount of storage space as they are uncompressed. They can be deleted after conversion to a BAM in order to be mindful of storage space in the cluster. 12. Several metagenome binning tools have been developed to bin SRAC sequences, and benchmarking studies [77, 78] can be helpful when selecting a tool.

256

Krithika Arumugam et al.

Acknowledgments This research was supported by the Singapore National Research Foundation and Ministry of Education under the Research Centre of Excellence Programme and by program grants 1102-IRIS-1002 and 1301-IRIS-59 from the National Research Foundation (NRF), and in part by the Life Sciences Institute (LSI), National University of Singapore, and the National Supercompufting Centre (NSCC), Singapore, supported by Project 11000984. We thank our colleagues Xianghui Liu, Rogelio E. Zuniga-Montanez, Samarpita Roy, Guanglei Qiu, Ying Yu Law, Stefan Wuertz, Daniela I. Drautz-Moses, Federico M. Lauro, Daniel H. Huson, Peerada Prommeenate, Benjaphon Suraraksa, Varunee Kongduan, Adeline Chua, and Yuguang Ipsen for excellent collaboration in relation to sample and/or data provision, data analysis, and code provision. References 1. Nicholls SM, Quick JC, Tang S, Loman NJ (2019) Ultra-deep, long-read nanopore sequencing of mock microbial community standards. Gigascience 8. https://doi.org/10. 1093/gigascience/giz043 2. Arumugam K, Bag˘cı C, Bessarab I et al (2019) Annotated bacterial chromosomes from frameshift-corrected long-read metagenomic data. Microbiome 7. https://doi.org/10.1186/ s40168-019-0665-y 3. Somerville V, Lutz S, Schmid M et al (2019) Long-read based de novo assembly of low-complexity metagenome samples results in finished genomes and reveals insights into strain diversity and an active phage system. BMC Microbiol 19(1):143 4. Bertrand D, Shaw J, Kalathiyappan M et al (2019) Hybrid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human microbiomes. Nat Biotechnol 37:937–944 5. Stewart RD, Auffret MD, Warr A et al (2019) Compendium of 4,941 rumen metagenomeassembled genomes for rumen microbiome biology and enzyme discovery. Nat Biotechnol 37:953–961 6. Moss EL, Maghini DG, Bhatt AS (2020) Complete, closed bacterial genomes from microbiomes using nanopore sequencing. Nat Biotechnol 38:701–707 7. Giguere DJ, Bahcheli AT, Joris BR, Paulssen JM (2020) Complete and validated genomes from a metagenome. bioRxiv 8. Singleton CM, Petriglieri F, Kristensen JM et al (2021) Connecting structure to function with

the recovery of over 1000 high-quality metagenome-assembled genomes from activated sludge using long-read sequencing. Nat Commun 12:2009 9. Hu Y, Fang L, Nicholson C, Wang K (2020) Implications of error-prone long-read wholegenome shotgun sequencing on characterizing reference microbiomes. iScience 23:101223 ˜es J et al (2021) Long10. Cusco´ A, Pe´rez D, Vin read metagenomics retrieves complete singlecontig bacterial genomes from canine feces. BMC Genomics 22:330 11. Arumugam K, Bessarab I, Haryono MAS et al (2021) Recovery of complete genomes and non-chromosomal replicons from activated sludge enrichment microbial communities with long read metagenome sequencing. NPJ Biofilms Microbiomes 7:1–13 12. Liu L, Wang Y, Che Y et al (2020) High-quality bacterial genomes of a partial-nitritation/ anammox system by an iterative hybrid assembly method. Microbiome 8:155 13. Antipov D, Korobeynikov A, McLean JS, Pevzner PA (2016) hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics 32:1009–1015 14. Chng KR, Li C, Bertrand D et al (2020) Cartography of opportunistic pathogens and antibiotic resistance genes in a tertiary hospital environment. Nat Med 26:941–951 15. Brown CL, Keenum IM, Dai D et al (2021) Critical evaluation of short, long, and hybrid assembly for contextual analysis of antibiotic resistance genes in complex environmental metagenomes. Sci Rep 11:3753

Long-Read Metagenomics 16. Morisse P, Lecroq T, Lefebvre A (2020) Longread error correction: a survey and qualitative comparison. bioRxiv 2020.03.06.977975 17. Andrews S, Others (2010) FastQC: a quality control tool for high throughput sequence data. Available online at http://www.bioinfor matics.babraham.ac.uk/projects/fastqc 18. Martin M (2011) Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal 17:10–12 19. Wick R (2017) Porechop. Github. https:// github.com/rrwick/Porechop 20. Nurk S, Meleshko D, Korobeynikov A, Pevzner PA (2017) metaSPAdes: a new versatile metagenomic assembler. Genome Res 27: 824–834 21. Kolmogorov M, Bickhart DM, Behsaz B et al (2020) metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat Methods 17:1103–1110 22. Kang DD, Li F, Kirton E et al (2019) MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7:e7359 23. Huson DH, Beier S, Flade I et al (2016) MEGAN community edition—interactive exploration and analysis of large-scale microbiome sequencing data. PLoS Comput Biol 12:e1004957 24. Huson DH, Albrecht B, Bag˘cı C et al (2018) MEGAN-LR: new algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs. Biol Direct 13:6 25. Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods 12:59–60 26. Chaumeil P-A, Mussig AJ, Hugenholtz P, Parks DH (2019) GTDB-Tk: a toolkit to classify genomes with the genome taxonomy database. Bioinformatics. https://doi.org/10. 1093/bioinformatics/btz848 27. Parks DH, Imelfort M, Skennerton CT et al (2015) CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 25: 1043–1055 28. Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068– 2069 29. Vaser R, Sovic´ I, Nagarajan N, Sˇikic´ M (2017) Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res 27:737–746 30. Medaka—sequence correction tool provided by ONT. In: github. https://github.com/ nanoporetech/medaka

257

31. Olm MR, Brown CT, Brooks B, Banfield JF (2017) dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J 11:2864–2868 32. Madden T (2013) The BLAST sequence analysis tool. In: The NCBI handbook [Internet], 2nd edn. National Center for Biotechnology Information (US), Bethesda 33. Langmead B, Salzberg SL (2012) Fast gappedread alignment with Bowtie 2. Nat Methods 9: 357–359 34. Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34: 3094–3100 35. Danecek P, Bonfield JK, Liddle J et al (2021) Twelve years of SAMtools and BCFtools. Gigascience 10. https://doi.org/10.1093/ gigascience/giab008 36. Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26:841–842 37. R Core Team (2020) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna. https://www.R-project.org/ 38. Wick RR, Judd LM, Holt KE (2019) Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol 20:129 39. Del Fabbro C, Scalabrin S, Morgante M, Giorgi FM (2013) An extensive evaluation of read trimming effects on Illumina NGS data analysis. PLoS One 8:e85024 40. Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120 41. Bushnell B BBDuk: adapter. Quality trimming and filtering. https://sourceforge.net/pro jects/bbmap/ 42. Miller JR, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data. Genomics 95:315–327 43. Compeau PEC, Pevzner PA, Tesler G (2011) How to apply de Bruijn graphs to genome assembly. Nat Biotechnol 29:987–991 44. Pop M (2009) Genome assembly reborn: recent computational challenges. Brief Bioinform 10:354–366 45. Pop M, Salzberg SL, Shumway M (2002) Genome sequence assembly: algorithms and issues. Computer 35:47–54 46. Quince C, Walker AW, Simpson JT et al (2017) Shotgun metagenomics, from sampling to analysis. Nat Biotechnol 35:833–844 47. Peng Y, Leung HCM, Yiu SM, Chin FYL (2012) IDBA-UD: a de novo assembler for

258

Krithika Arumugam et al.

single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28: 1420–1428 48. Boisvert S, Raymond F, Godzaridis E et al (2012) Ray Meta: scalable de novo metagenome assembly and profiling. Genome Biol 13: R122 49. Li D, Liu C-M, Luo R et al (2015) MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31: 1674–1676 50. Koren S, Walenz BP, Berlin K et al (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res 27:722–736 51. Wick RR, Judd LM, Gorrie CL, Holt KE (2017) Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol 13:e1005595 52. Shafin K, Pesout T, Lorig-Roach R et al (2020) Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nat Biotechnol 38:1044– 1053 53. Vaser R, Sˇikic´ M (2021) Time- and memoryefficient genome assembly with Raven. Nat Comput Sci 1:332–336 54. Antipov D, Hartwick N, Shen M et al (2016) plasmidSPAdes: assembling plasmids from whole genome sequencing data. Bioinformatics 32:3380–3387 55. Laczny CC, Kiefer C, Galata V et al (2017) BusyBee Web: metagenomic data analysis by bootstrapped supervised binning and annotation. Nucleic Acids Res 45:W171–W179 56. Krzywinski M, Schein J, Birol I et al (2009) Circos: an information aesthetic for comparative genomics. Genome Res 19:1639–1645 57. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760 58. Walt AJ van der, van der Walt AJ, van Goethem MW et al (2017) Assembling metagenomes, one community at a time. BMC Genomics 18:521 59. Xie F, Jin W, Si H et al (2021) An integrated gene catalog and over 10,000 metagenomeassembled genomes from the gastrointestinal microbiome of ruminants. Microbiome 9:137 60. Delmont TO, Eren AM, Maccario L et al (2015) Reconstructing rare soil microbial genomes using in situ enrichments and metagenomics. Front Microbiol 6:358 61. Slaby BM, Hackl T, Horn H et al (2017) Metagenomic binning of a marine sponge

microbiome reveals unity in defense but metabolic specialization. ISME J 11:2465–2478 62. Speth DR, In’t Zandt MH, Guerrero-Cruz S et al (2016) Genome-based microbial ecology of anammox granules in a full-scale wastewater treatment system. Nat Commun 7:11172 63. Parks DH, Rinke C, Chuvochina M et al (2017) Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat Microbiol 2:1533– 1542 64. Kang DD, Froula J, Egan R, Wang Z (2015) MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3:e1165 65. Field D, Amaral-Zettler L, Cochrane G et al (2011) The genomic standards consortium. PLoS Biol 9:e1001088 66. Bowers RM, Kyrpides NC, Stepanauskas R et al (2017) Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat Biotechnol 35: 725–731 67. Parks DH, Chuvochina M, Waite DW et al (2018) A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol 36: 996–1004 68. Parks DH, Chuvochina M, Chaumeil P-A et al (2020) A complete domain-to-species taxonomy for bacteria and archaea. Nat Biotechnol 38:1079–1086 69. Watson M, Warr A (2019) Errors in long-read assemblies can critically affect protein prediction. Nat Biotechnol 37:124–126 70. Li H (2016) Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics 32:2103–2110 71. Lee C (2003) Generating consensus sequences from partial order multiple sequence alignment graphs. Bioinformatics 19:999–1008 72. Lee C, Grasso C, Sharlow MF (2002) Multiple sequence alignment using partial order graphs. Bioinformatics 18:452–464 73. Mikheenko A, Saveliev V, Gurevich A (2016) MetaQUAST: evaluation of metagenome assemblies. Bioinformatics 32:1088–1090 74. Quince C, Nurk S, Raguideau S et al (2021) Metagenomics strain resolution on assembly graphs. Genome Biol 22(1):214. https://doi. org/10.1186/s13059-021-02419-7 75. Wick RR, Schultz MB, Zobel J, Holt KE (2015) Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 31:3350–3352

Long-Read Metagenomics 76. Li H (2012) seqtk Toolkit for processing sequences in FASTA/Q formats. GitHub 767:69 77. Yue Y, Huang H, Qi Z et al (2020) Evaluating metagenomics tools for genome binning with real metagenomic datasets and CAMI datasets. BMC Bioinform 21:334 78. Sczyrba A, Hofmann P, Belmann P et al (2017) Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nat Methods 14:1063–1071

259

79. Wickramarachchi A, Mallawaarachchi V, Rajan V, Lin Y (2020) MetaBCC-LR: metagenomics binning by coverage and composition for long reads. Bioinformatics 36:i3–i11 80. Mo¨lder F, Jablonski KP, Letcher B et al (2021) Sustainable data analysis with Snakemake. F1000Res 10:33 81. Di Tommaso P, Chatzou M, Floden EW et al (2017) Nextflow enables reproducible computational workflows. Nat Biotechnol 35:316– 319

Chapter 13 Cloud Computing for Metagenomics: Building a Personalized Computational Platform for Pipeline Analyses Martin Callaghan Abstract Cloud Computing services such as Microsoft Azure, Amazon Web Services, and Google Cloud provide a range of tools and services that enable scientists to rapidly prototype, build, and deploy platforms for their computational experiments. This chapter describes a protocol to deploy and configure an Ubuntu Linux Virtual Machine in the Microsoft Azure cloud, which includes Minconda Python, a Jupyter Lab server, and the QIIME toolkit configured for access through a web browser to facilitate a typical metagenomics analysis pipeline. Key words Cloud, Azure, Jupyter, QIIME, Metagemomics, Python

1

Introduction Research scientists will have widely differing experiences of support and access to computational resources in their places of work. The ability to rapidly access compute resources of sufficient power, memory, and disk storage can make a new project either very easy or very difficult, depending on how quickly this can be implemented. Buyya, Rajkumar et al. have described Cloud Computing as the fifth utility for research [1]. Cloud Computing services can fill a gap for many researchers. Although there are costs associated with using Cloud (which in some cases can be considerable), Cloud’s pay-for-what-you-use model can be very attractive to test out a new pipeline, try out a new tool, or set up a lab-in-the-cloud for a workshop before investing in new hardware “on premise” or purchasing more Cloud resources.

Suparna Mitra (ed.), Metagenomic Data Analysis, Methods in Molecular Biology, vol. 2649, https://doi.org/10.1007/978-1-0716-3072-3_13, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

261

262

Martin Callaghan

In this chapter, we describe the setup and configuration of a typical Metagenomics stack on the Azure Cloud (see Note 1). The general principles are applicable to other Public Clouds or even to standalone servers. This is not a fully cloud native approach as has been described by Celesti, Antonio et al.; many genomics and metagenomics tools need some further development to make full use of new cloud models [2]; nonetheless, it should serve as a useful introduction to the main principles.

2

Materials 1. A shell/terminal application on your computer. Linux and Mac users already have this in the form of the Terminal application, but Windows users will need to download and install a small helper application. In the instructions that follow, lines that start with a $ represent commands that you type in. The $ sign represents the prompt, so do not type the $ sign. For Windows users, we recommend MobaXTerm [https:// mobaxterm.mobatek.net/download.html] as it provides a number of other tools that will be helpful at later stages of your research including a file transfer application and a text editor. From the download page: • Click the free Home Edition download link. • Then select the Portable edition download link. This will download a small “zipped” file to your Downloads folder. Unzip this, and the MobaXterm program itself can be run by double-clicking the icon. If you try to double-click MobaXterm inside the zipped folder, it will fail to run properly. Apple Mac users will already have the Terminal application, but recent versions of the Apple Mac operating system (“Catalina” and later) have changed the default shell “language” from BASH (Bourne Again Shell) to ZSH (Zed SHell). As BASH is the default on most Linux systems (including the virtual machine we will create), we suggest that you change the default shell on your Mac back to BASH for ease of use. • Open Mac Terminal from your Applications folder. • Type (see Note 2): $ chsh -s /bin/bash.

A Personalised Cloud Platform for Metagenomics Analyses

263

Then: • Enter your password when requested. • Close the terminal application and reopen to get a new BASH shell. Linux users should already have all the tools they need in the default Terminal application without any further configuration. 2. A Cloud account In order to set up your Cloud services, you will need an account with Microsoft Azure. At the time of writing, Azure provides a free trial account with £150 of credit to be used within 30 days. Follow the sign-up instructions at https://azure.microsoft. com/free/ to apply for an account. If you are a student, you will be able to get some free resources that last 12 months rather than just 30 days and don’t require a credit card to sign up. Follow the signup instructions at https://azure.microsoft.com/en-gb/free/ students/.

3

Methods In setting up the Azure resources, we will do the following: 1. Log into the Azure Portal. 2. Select and set up a Virtual Machine (VM). 3. Add a data disk to the VM (optional). 4. Log into the Virtual Machine and (a) Apply the security updates. (b) Download and install the Miniconda Python distribution. (c) Download and install QIIME2. (d) Install and set up Jupyter Lab. (e) Configure the data disk on the VM. 5. Connect to Jupyter Lab running on the VM through your Web Browser. 6. Disconnect and shut down the VM.

3.1 Log into the Azure Portal

In your Web Browser, connect to the Azure Portal https:// portal.azure.com. At the prompts, enter your username and password you signed up with earlier (Fig. 1). After a successful login, you will see the initial Azure Portal web page.

264

Martin Callaghan

Fig. 1 Accessing the Azure portal 3.2 Select and Set Up a Virtual Machine (VM)

3.2.1 At the portal page, select the Create a Resource Link (Fig. 2): 3.2.2 Select Compute from the left-hand menu and then create from within the Virtual Machine option (see Note 3) (Fig. 3): 3.2.3 The Create a virtual machine page has many options across seven tabs (Fig. 4): On the Basics tab, complete as follows (Table 1):

3.3 Add a Data Disk to the VM (Optional)

Move on to the Disks tab (This is an optional step if you need a data disk larger than the one created by default) and in the Data disks section. Select Create and attach a new disk. A new page will appear preconfigured for a 1024GB additional disk. Change the size if required, otherwise click the OK button to return to the previous page. No other options are required, so click the blue Review + Create button at the bottom left of the page.

Fig. 2 Creating an Azure resource group

Fig. 3 Selecting and creating a virtual machine

266

Martin Callaghan

Fig. 4 Selecting virtual machine options

Table 1 Options for Azure VM setup—Basics tab Option

Action

Subscription

This should be preselected to the subscription you configured when you signed up.

Resource group

Select the create new option, and give it a suitable name such as mgs-rg for the purposes of this exercise

Virtual machine name

Again select a suitable name, we’ll use mgs-server

Region

Normal practice is to choose the region that is geographically closest to you

Availability options

As this is a trial exercise, we will select No infrastructure redundancy required

Image

Ubuntu Server 20.04 LTS—Gen 1

Size

There are many options here but a 4 core, 16GB memory VM is a good place to start. Choose Standard_B4ms

Authentication type

Choose password

Username | Password

Enter a username, strong password and confirmation. Make sure you remember these as we will use them later

Public inbound ports

Select Allow selected ports

Select inbound ports

Make sure just SSH (22) is selected

Azure will take a few minutes to create your virtual machine, and you will receive a message indicating that the process has been successful. Most errors are caused by failing to complete all the options in the forms. Any that you have completed incorrectly will be highlighted.

A Personalised Cloud Platform for Metagenomics Analyses

267

Fig. 5 The Virtual Machine overview

3.4 Logging into the Virtual Machine (VM) for Further Configuration 3.4.1 Connecting for the First Time

From the previous steps, you will have the IP (Internet Protocol) address for the VM and have recorded the username and password to access it. When the portal has completed the creation of your Virtual machine, you will be presented with its Overview page (Fig. 5). The most important piece of information you will need for later steps is the VM IP (Internet Protocol) address (see Note 4). In our example, this is 51.11.137.122, which can be copied to your clipboard. Note also the three buttons at the top of the portal page: • Start • Restart • Stop At the moment, Start is grayed out, which means the VM is currently running. When we have finished this setup, remember to return to this portal to click the Stop button to preserve your credit (or your credit card bill). Using your Terminal application (Windows users, start a new session in MobaXTerm), we’ll connect to the VM using the ssh (secure shell) command. In our example, we will use: Username: mgs-user IP address: 51.11.137.122

268

Martin Callaghan

In the terminal, enter the command (remember the represents the prompt so don’t type in the $ symbol):

$

symbol

$ ssh [email protected]

You will be prompted for your password: [email protected]’s password:

Enter it exactly, noting that you won’t see the cursor move as you do so. The first time you connect you will be prompted to accept a certificate. The

authenticity

of

host

’51.11.137.122

(51.11.137.122)’ can’t be established. ECDSA

key

fingerprint

is

SHA256:xluMxzOd-

KErCsjrnHlPlFWhvpM2bGZFYcFpQgw3ipTw. Are you sure you want to continue connecting (yes/no/[fingerprint])?

It is OK to accept this by entering y and pressing [ENTER] on your keyboard. When you have successfully connected, you will see an informational message and the command line prompt, similar to: Welcome to Ubuntu 20.04.2 LTS (GNU/Linux 5.8.01033-azure x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage System information as of Mon Jun 7 15:27:59 UTC 2021 System load: 0.0

Processes: 142

Usage of /: 27.9% of 28.90GB Users logged in: 0 Memory usage: 1%

IPv4 address for eth0: 10.1.0.4

Swap usage: 0% * Super-optimized for small spaces - read how we shrank the memory footprint of MicroK8s to make it the smallest full K8s around.

A Personalised Cloud Platform for Metagenomics Analyses

269

https://ubuntu.com/blog/microk8s-memory-optimisation Last

login:

Mon

Jun

7

15:27:28

2021

from

80.229.15.165 mgs-user@mgs-server:~$

At this prompt, we now need to: • Apply the security updates • Download and install the Miniconda Python distribution [3] • Download and install QIIME2 [4] • Install and setup Jupyter Lab • Configure the data disk on the VM (optional) 3.4.2 Apply Security Updates

Since Azure created the virtual machine image we used, the developers of the operating system will have released a number of security patches (see Note 5). To apply these security patches, enter: $ sudo apt update

You may be prompted to accept these patches; if so, type Y on your keyboard and press [ENTER]. 3.4.3 Download and Install the Miniconda Python Distribution

We now need to download and install the Miniconda Python distribution (see Note 6). Enter: $ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

When the download has completed: $ bash Miniconda3-latest-Linux-x86_64.sh

You will be prompted to: • Accept the license agreement by pressing [ENTER], scrolling down the page to type yes at the prompt • Accept the suggested location to install Miniconda. Again, press [ENTER]. • Allow the installer to initialize Miniconda. Again, type yes and [ENTER] At this point, logout of your VM (by typing the logout command and pressing the Enter key) and then log back in again, using the ssh command like you did before. This is to make sure that Miniconda has completed its final configurations.

270

Martin Callaghan

This time, the prompt will have changed slightly to: (base) mgs-user@mgs-server:~$

Note (base) at the start of the line, this confirms that Miniconda has been correctly installed. 3.4.4

Install QIIME2

Following the QIIME2 installation guide [5], download the QIIME2 installer file. Note that this instruction is correct at the time of writing and includes the most up-to-date version of the installer. Check the website for the most up-to-date version. $

wget

https://data.qiime2.org/distro/core/

qiime2-2022.11-py38-linux-conda.yml

This file allows Miniconda to download and configure a QIIME environment containing most of the tools and dependencies to run a QIIME pipeline. To install, enter: $

conda

env

create

-n

qiime2-2022.11

--file

qiime2-2022.11-py38-linux-conda.yml

This will take a few minutes, and you’ll see many lines such as: sysroot_linux-64-2.1 | 30.2 MB | ################################ | 100% perl-xml-namespacesu | 11 KB

| ################################ | 100%

perl-socket-2.027

| 31 KB

| ################################ | 100%

pyqt-impl-5.12.3

| 5.9 MB

| ################################ | 100%

r-tibble-3.1.1

| 738 KB

| ################################ | 100%

lcms2-2.12

| 443 KB

| ################################ | 100%

q2galaxy-2021.4.0

| 60 KB

| ################################ | 100%

scrolling up your screen. This is fine and the installer will eventually complete with a message similar to: done # # To activate this environment, use # # $ conda activate qiime2-2022.11 # # To deactivate an active environment, use # # $ conda deactivate

A Personalised Cloud Platform for Metagenomics Analyses

271

This is now complete. To confirm that QIIME has been correctly installed, we need to activate this new environment and launch QIIME: $ conda activate qiime2-2022.11

Note that the prompt changes: (qiime2-2022.11) mgs-user@mgs-server:~$

At this new prompt type: $ qiime -–help

If this prints the help page to the screen, it has been configured correctly. To proceed with installing Jupyter, we need to add some additional features to this qiime environment 3.4.5 Install and Set Up Jupyter Lab

Jupyter Lab is the modern user interface for Jupyter Notebooks, providing an updated Notebook interface plus a text editor and a straightforward way of accessing the terminal through the web browser. To install Jupyter Lab, still in the (qiime2-2022.11) environment: conda install -c conda-forge jupyterlab

Conda will provide a long list of packages to be downloaded, installed, and updated. At the prompt: Proceed ([y]/n)?

Type y and then [ENTER]. Again, you’ll see a list of packages being downloaded and extracted and after a few minutes you will return to your prompt. The final Jupyter stage is to configure Jupyter to use this newly created QIIME environment. Still in the (qiime2-2022.11) environment, enter the following commands: $

python

-m

ipykernel

install

--user

--name

=

qiime2-2022.11 --display-name "QIIME2"

which will confirm with this message: Installed kernelspec qiime2-2021.4 in /home/mgsuser/.local/share/jupyter/kernels/qiime2-2021.4

272

Martin Callaghan

This is a good point to confirm that you can access your Jupyter/QIIME VM over the Internet through your web browser. For additional security, you will set up an ssh tunnel from your computer to the VM. This will prevent anyone else from easily accessing your VM. To start the Jupyter Lab in the VM $ jupyter lab --no-browser

You will see a number of lines like these: Or copy and paste one of these URLs: http://localhost:8888/lab?token=375c75d558a0ebbf8f9990862f5c4b3728d94ed484d6f3f3 http://127.0.0.1:8888/lab?token=375c75d558a0ebbf8f9990862f5c4b3728d94ed484d6f3f3

In a second terminal window, set up the ssh tunnel itself: $

ssh

-N

-L

8888:localhost:8888

mgs-u-

[email protected]

You will be prompted for your password again, but on entering it instead of logging in, the prompt will appear to hang. This is expected, and it means the tunnel is now in place. Looking back to the URLs shown on your other terminal, copying and pasting one of them into your browser address bar will give you access to Jupyter Lab running on your VM (Fig. 6). Providing this works as expected and you get Jupyter Lab running in your browser, you can now: • Close your browser window. • Close the ssh tunnel by holding down the [CTRL] and [C] buttons down together on your keyboard in your ssh tunnel terminal window. • Close Jupyter Lab by holding down the [CTRL] and [C] buttons down together on your keyboard in the VM terminal window, at which point you will be prompted to confirm that you wish to terminate Jupyter Lab by entering y on your keyboard. The final step of set-up in the VM is to fully configure Jupyter so that it can access the QIIME environment and all of its tools. This needs to be done inside the qiime2-2021.4 environment so make sure that the prompt shows: (qiime2-2021.4) mgs-user@mgs-server:~$

A Personalised Cloud Platform for Metagenomics Analyses

273

Fig. 6 A first view of the Jupyter environment

At this prompt enter the command: $ jupyter serverextension enable --py qiime2 --sysprefix

3.4.6 Configuring the VM to Access the Data Disk (Optional)

As the VM we created only has a small hard drive, we now need to add the larger hard drive we created in the portal earlier as a ‘data drive’. This is quite advanced work and is not necessary if you are content with the size of the VM’s original disk. Still connected to the VM, we need to confirm if the VM can “see” the hard drive, using the Linux lsblk “list block devices” command: $ lsblk -o NAME,HCTL,SIZE,MOUNTPOINT | grep -i "sd"

The device we want is the sbd device (the clue is its size, we created a 512GB drive earlier) sda 0:0:0:0 30G ├sda1 29.9G / ├sda14 4M └sda15 106M /boot/efi

274

Martin Callaghan sdb 3:0:0:0 512G sdc 1:0:1:0 32G └sdc1 32G /mnt

We now need to partition this new device to set it up to accept data. Use the following commands: $ sudo parted /dev/sdb --script mklabel gpt mkpart xfspart xfs 0% 100% $ sudo mkfs.xfs /dev/sdb1 $ sudo partprobe /dev/sdb1

All should complete without error messages. The next step is to create a directory to “mount” the new disk at so that it can be accessed by users: $ sudo mkdir /datadrive

and then use the mount command to connect the disk to the directory you have just created: $ sudo mount /dev/sdb1 /datadrive

We can confirm if the drive is connected by using the df command: $ df

which returns: Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 30309264 8011772 22281108 27% / devtmpfs 8195072 0 8195072 0% /dev tmpfs 8198896 0 8198896 0% /dev/shm tmpfs 1639780 1048 1638732 1% /run tmpfs 5120 0 5120 0% /run/lock tmpfs 8198896 0 8198896 0% /sys/fs/cgroup /dev/loop0 56832 56832 0 100% /snap/core18/1997 /dev/sda15 106858 8008 98851 8% /boot/efi /dev/loop1 56832 56832 0 100% /snap/core18/2066 /dev/loop3 32896 32896 0 100% /snap/snapd/11841 /dev/loop2 69248 69248 0 100% /snap/lxd/20326 /dev/loop4 32896 32896 0 100% /snap/snapd/12057 /dev/sdc1 32894736 49180 31151556 1% /mnt tmpfs 1639776 0 1639776 0% /run/user/1000 /dev/sdb1 536606724 3774380 532832344 1% /datadrive

A Personalised Cloud Platform for Metagenomics Analyses

275

The last line indicates that the device has been connected up to/datadrive, and it is only 1% full. To make sure that the drive reconnects every time we restart the VM, we need to add these settings to the special /etc/fstab file. To do this, we first need to get the Universally Unique Identifier (UUID) of the new drive using the blkid utility: $ sudo blkid

Which will return information similar to: /dev/sda1: LABEL="cloudimg-rootfs" UUID="114c7d02-c4d7-4dffb3f1-5a58b37de5d2" TYPE="ext4" PARTUUID="4f91f90d-bc71-4910812a-5c9abc8a75e4" /dev/sda15: LABEL_FATBOOT="UEFI" LABEL="UEFI" UUID="E8ACDF 8B "

T YP E= "v fa t"

PA RT UU ID =" ee4 01 b4 a- c5 d5 -4 7c 2- 9d 73 -

097b38242947" /dev/sdc1:

UUID="aac5405a-49d1-4804-a490-438e2caf7746"

TYPE="ext4" PARTUUID="53a7d6e8-01" /dev/loop0: TYPE="squashfs" /dev/loop1: TYPE="squashfs" /dev/loop2: TYPE="squashfs" /dev/loop3: TYPE="squashfs" /dev/loop4: TYPE="squashfs" /dev/sdb1:

UUID="479b81d6-c715-47e4-ab7c-3e3cfd5dbc8f"

TYPE="xfs"

PARTLABEL="xfspart"

PARTUUID="be567819-9bf1-

4697-ab27-1e703fce41a4" /dev/sda14: PARTUUID="9ddd4814-b7b2-4836-8633-594746c3ee23"

We should be able to identify the line we need. Here, we need the /dev/sdb1 line, which represents the disk we created in the Azure portal earlier. Open the /etc/fstab file in a text editor: $ sudo nano /etc/fstab

Using the UUID value for the /dev/sdb1 device we created earlier and the directory mountpoint of /datadrive add the following line to the end of the /etc/fstab file (using your own settings for UUID): UUID=479b81d6-c715-47e4-ab7c-3e3cfd5dbc8f /datadrive xfs defaults,nofail 1 2

As we used the nano editor, use Ctrl+O to save the file and Ctrl +X to exit the editor. Now, when you reboot the VM, the datadrive will be accessible each time.

276

Martin Callaghan

3.5 Connect to Jupyter Lab Running on the VM Through Your Web Browser

Each time you wish to use your server-in-the-cloud you will need to (as described in the previous stages): 1. Start the VM in the Azure Portal. 2. Log into the VM via ssh and start the Jupyter Lab server with these two commands: $ conda activate qiime2-2021.4 $ jupyter lab --no-browser

3. Set up an ssh tunnel to the VM. 4. Connect to the Jupyter server through your Web Browser. Each time you launch the Jupyter Lab server and connect through the Web browser, you will be presented with the launcher page (Fig. 7). From here, to launch a notebook to perform an analysis, select the qiime2 notebook button. This launches a new Jupyter Lab Notebook. Using a Notebook for analysis is beyond the scope of this chapter, but there are many web tutorials and videos to learn from and the accompanying Github repository to this chapter has an example with some of the key terms explained. It is recommended that you create each analysis experiment in its own Jupyter Notebook. An example Notebook based on the QIIME2 “Moving Pictures” tutorial can be found in the Github repository linked to this chapter. This can be downloaded directly into the Jupyter Environment. From the launcher page, select the Terminal button and then at the prompt enter: $ git clone https://github.com/cloud-metagenomics/ mgs-jupyter.git

You will see a new directory (folder) the left hand file viewer window. • Double-click the directory name directory.

mgs-jupyter

mgs-jupyter

appear in

to open the

• Double click the Notebook moving-window-tutorial. ipynb to see an explanation of a number of basic commands that will allow you to follow along the tutorial in the QIIME2 documentation (https://docs.qiime2.org/2021.4/tutorials/ moving-pictures/).

A Personalised Cloud Platform for Metagenomics Analyses

277

Fig. 7 Launching Jupyter Lab

3.6 Disconnect and Shut Down the VM

At the end of each session, remember to: 3.6.1 Close the Jupyter Lab Server In the VM terminal, type Ctrl+C together. When prompted to: Shutdown this Jupyter server (y/[n])?

Answer y and press the Enter key. 3.6.2 Ensure the ssh Tunnel Has Closed in Your Other ssh Terminal Window When the Jupyter Server has been closed down, the ssh tunnel should terminate at the same time and a message similar to: Connection to 51.11.137.122 closed by remote host.

appears on the screen (see Note 7). 3.6.3 Return to the Azure portal to Stop the Virtual Machine

278

4

Martin Callaghan

Further Exploration This chapter will give a useful introduction to installing and running a number of useful tools for Metagenomics researchers. The Virtual Machine can be extended with a number of other tools and applications through the Conda package management system we installed in the first part of this chapter. For example, we could additionally install: • R and RStudio (https://rstudio.com) to allow access to RStudio in a Web Browser • RClone (https://rclone.org) to allow easy copying of data to and from Cloud storage (Google Drive, OneDrive, Dropbox etc.) • NextFlow (https://nextflow.io) or SnakeMake (https:// snakemake.readthedocs.io) workflow management systems to make reproducible pipelines easier to share and run. Together, these tools and packages then permit an individual researcher or small group to access their computational platform and test new concepts and techniques before migrating production analyses to larger-scale Cloud or HPC (High Performance Computing) resources.

5

Notes 1. Although we are using the Azure Cloud in this chapter, the same principles can be applied to resources in other public clouds such as Google Cloud and Amazon Web Services. 2. In all the Shell instructions, the $ character represents the prompt you see on screen. You don’t type the $, just the rest of the command. 3. Take care about the choice of VM options and sizes, prices are indicated at the appropriate stage, and you will need to take care that you are aware of how much you are spending. 4. If you regularly stop and restart your VM with significant time gaps, Azure will reallocate you a different IP address. You will note that there is an option to retain the IP address between sessions, which carries a small additional cost. 5. It is important to regularly install the security patches on a weekly or fortnightly basis. Each time you log in you will be informed of the number of pending patches you should consider installing

A Personalised Cloud Platform for Metagenomics Analyses

279

6. Miniconda Python is updated regularly. To see the instructions to download the latest version and any updated information, visit the Miniconda Web site (https://docs.conda.io/en/ latest/miniconda.html) 7. If for some reason the ssh tunnel remains active and you don’t see your regular command line prompt, this can also be closed by holding down the Ctrl+C keys on your keyboard. References 1. Buyya R, Srirama S, Casale G et al (2018) A manifesto for future generation cloud computing: research directions for the next decade. ACM Comput Surv (CSUR) 51(5):1–38 2. Celesti A, Celesti F, Fazio M et al (2017) Are next-generation sequencing tools ready for the cloud? Trends Biotechnol 35(6):486–489 3. Continuum Analytics (2021) Anaconda: your data science toolkit https://www.anaconda.

com/products/individual. Accessed 23 Feb 2021 4. Bolyen E, Rideout JR, Dillon MR et al (2019) Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat Biotechnol 37:852–857 5. QIIME 2 development team (2021) Qiime2. https://qiime2.org. Accessed 23 Feb 2021

Chapter 14 Artificial Intelligence in Medicine: Microbiome-Based Machine Learning for Phenotypic Classification Xi Cheng and Bina Joe Abstract Advanced computational approaches in artificial intelligence, such as machine learning, have been increasingly applied in life sciences and healthcare to analyze large-scale complex biological data, such as microbiome data. In this chapter, we describe the experimental procedures for using microbiome-based machine learning models for phenotypic classification. Key words Microbiome, Machine learning, Phenotype, Disease, Classification, Diagnosis

1

Introduction Artificial intelligence (AI), a branch of computer science, has emerged as a promising approach in healthcare [1–5]. Specifically, machine learning (ML), one of the most popular subtypes in AI, has been widely used to recognize, comprehend, and learn patterns within data using various complex algorithms. Clinical applications of AI and ML specifically have been demonstrated with tremendous potential in precision medicine due to the availability of large-scale clinical data such as patients’ health records [6]. The term “Omics” refers to a collection of high-throughput approaches for comprehensive and simultaneous profiling of a given class of molecules such as DNA, mRNA, proteins, or metabolites from a biological sample, which correspond to genomics, transcriptomics, proteomics, and metabolomics, respectively. The omics-generated “big data” are demonstrated as useful and promising training datasets for various ML modeling for predictive diagnostics of diseases [7]. In our laboratory, we have applied ML to model different types of omics data for diagnostic screening and classifications of complex disease traits. For example, we reported a robust performance of supervised ML models trained on whole

Suparna Mitra (ed.), Metagenomic Data Analysis, Methods in Molecular Biology, vol. 2649, https://doi.org/10.1007/978-1-0716-3072-3_14, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

281

282

Xi Cheng and Bina Joe

transcriptomic RNA-seq data of human left ventricle tissue for diagnosing clinical cardiomyopathies and further classifying patients with dilated or ischemic cardiomyopathy [8]. Beyond the “omics” of the host, in recent years, the advances in sequencing technologies have led to the characterization of “omics” of bacteria, which live in symbiosis with the host. The collective term used to refer to the genomes of such bacteria residing within a single host is called the microbiome. To differentiate between the genomic analysis of the host DNA from the DNA of the microbiota, another “omics” is introduced, called “metagenomics.” In one of our recent works, we mined the metagenome with supervised ML algorithms, such as random forest and neural networks, to model fecal 16S metagenomics data from a significant number of human subjects for diagnostic classification of cardiovascular diseases and inflammatory bowel diseases [9, 10]. In these publications, we have detailed the methods of microbiome-based supervised ML for phenotypic classification using Caret [11] in R programming language. In this chapter, we describe general procedures of using scikitlearn [12], which is developed in Python and is currently one of the most popular ML libraries, for predictive phenotypic classifications through microbiome-based supervised ML. The method is readily adaptable to the analysis of any input microbiome data, not restricted to gut microbiome (e.g., from oral, vaginal, nasal, or skin samples), and applicable for any phenotypic trait or disease.

2

Materials 1. Computer (Operating System: Window/macOS/Linux) with internet access. 2. Installing scikit-learn [12] (https://scikit-learn.org/stable/ install.html). 3. Installing pandas [13] (https://pandas.pydata.org/docs/get ting_started/install.html). 4. Installing NumPy [14] (https://numpy.org/install/).

3

Methods

3.1 Preparation of Microbiome Datasets

Microbiome datasets, such as bacterial taxa and operational taxonomic units (OTUs), can be obtained through microbiome sequencing approaches, such as 16S rRNA sequencing and shotgun metagenomics sequencing followed by corresponding bioinformatics analyses.

Microbiome and Machine Learning

283

Fig. 1 A sample csv file of bacterial taxa data for Phenotype_1 and Phenotype 2 (see Note 1)

A sample csv (comma-separated values) file of bacterial taxa data of two different phenotypes is shown in Fig. 1. As shown in Fig. 1, the first column indicates the phenotypes (e.g., Cancer vs. Healthy Control) to be classified, with each individual row representing each sample (e.g., Patient_1, Patient_2, etc.). Column 2 and beyond document the relative abundance of different bacteria taxa (e.g., Column 3: Phylum Acidobacteria, Column 4: Phylum Actinobacteria, etc.) corresponding to different samples in different rows. 3.2 Machine Learning Classification

1. The dataset in the csv format, as mentioned above, is loaded using the function of pandas.read_csv() (see Note 3).

3.2.1 Dataset Preparation (See Note 2)

2. The microbiome features, such as bacterial taxa and OTUs, are processed to be a NumPy-formatted array using the function of data.iloc[].to_numpy() (see Note 4).

3. The classified phenotypes (e.g., Phenotype_1, Phenotype_2) are processed to be a NumPy-formatted array using the function of data.iloc[].to_numpy().

284

Xi Cheng and Bina Joe

3.2.2 Machine Learning Modeling

1. The data, including microbiome features and their corresponding phenotypes, are randomly split to training samples (70%) and testing samples (30%) (see Note 5).

In the training phase, the ML model (discussed below) is trained on the labeled training samples (70% of all the original samples) to learn how to use microbiome features to identify and classify them as belonging to “Phenotype_1” or “Phenotype_2.” After the training is completed, the trained ML model is tested (similar to students taking a test) on unlabeled testing samples (30% of all the original samples) to use their microbiome features to predict their labels (“Phenotype_1” or “Phenotype_2”), after which we compare the ML models’ predicted labels with the actual labels. 2. A Random Forest (RF) classifier is initiated for supervised ML modeling (see Note 6).

3. The RF classifier is trained on the training samples with “X_train” representing the microbiome features of the training samples and “y_train” representing their corresponding phenotypes (e.g., Phenotype_1, Phenotype_2).

4. The trained RF model is tested on the unlabeled testing samples “X_test” with their microbiome features to predict their corresponding phenotypes.

5. The predicted labels “y_test_predict” of the testing samples are compared with their actual labels “y_test” to calculate an accuracy value “y_test_accuracy” (see Note 7).

Microbiome and Machine Learning

285

9 8 Phenotype_1

9

0

7

True label

6 5 4 3 Phenotype_2

1

8

2 1 0

Phenotype_2 Phenotype_1 Predicted label

Fig. 2 An example confusion matrix to evaluate the ML model’s performance on predicting the phenotypes of the testing samples

6. Confusion matrix can be further used to evaluate the detailed prediction performance of the ML model examined on the testing samples. An example confusion matrix is shown in Fig. 2 and indicates that only one sample’s phenotype “Phenotype_2” is wrongly predicted to be “Phenotype_1.”

4

Summary The procedures described above serve as a beginner’s guide that introduces the basics of using scikit-learn to perform microbiomebased ML for phenotypic classification. The result of what is described here is that one is able to appreciate the use of ML for mining the microbiome and discern whether the efficiency of ML performance meets levels of accuracy that are accepted for adaptation to the clinic. The definition of what levels of “accuracy” are sufficient for an ML study to be recognized as successful is not a set parameter, but rather a relative one. In reality, achieving high level of accuracy based on ML classification using the microbiome, which is a highly variable entity, can quickly become much more complicated due to various practical considerations of criteria for inclusion and exclusion of samples and high-dimension microbiome features. Thus, readers are forewarned that microbiome-

286

Xi Cheng and Bina Joe

based ML studies could require more thorough ML analyses, such as extensive experiments of feature engineering and grid search for tuning and testing different ML models, considerations for which are beyond this chapter. As the next step, readers are, therefore, recommended to the online tutorial of scikit-learn (https://scikitlearn.org/stable/). This site includes extensive introduction of concepts, methods, examples, and codes for comprehensive ML analyses to be applied for various scenarios.

5

Notes 1. Only a small part of the file is shown due to the limited space. In practice, there are usually a significant number of samples and features for each phenotype. Similar file formats can be adapted to represent other types of microbiome datasets, such as OTU tables. The contents in the csv file can include other information, such as sample ID (e.g., Patient_1) and sample type (e.g., gut/oral/skin). The sample file in this chapter is simplified to only include the information of groups and features. 2. The example codes are specific to the data format shown in Fig. 1. The codes should be adjusted according to the user’s specific dataset. 3. Multiple parameters within the function of pandas.read_csv() can be adjusted based on the specific dataset. Please refer to https://pandas.pydata.org/pandas-docs/stable/reference/ api/pandas.read_csv.html. 4. The numbers within the function of dataframe.iloc[] (https:// pandas.pydata.org/pandas-docs/stable/reference/api/ pandas.DataFrame.iloc.html) should be adjusted based on the locations of the features within the specific dataset. 5. The function of sklearn.model_selection.train_test_split within scikit-learn provides various parameters for the adjustment according to the user’s needs (https://scikit-learn.org/ stable/modules/generated/sklearn.model_selection.train_ test_split.html). Common ratios of splitting training and testing sets are 80–20%, 70–30% (used in this section), 65–35%, and so forth. Often, cross-validation (e.g., k-fold cross-validation), which is not covered in this chapter, is used to improve ML training with additional samples split along with training and testing sets (e.g., 70% training, 15% validation, 15% testing). In addition, feature selection and engineering are often used in real practice according to the user’s specific dataset for prioritizing/engineering the features to improve ML modeling.

Microbiome and Machine Learning

287

6. Only the Random Forest approach, as one of the most popular supervised ML models, with default parameters is presented in this section. Various hyperparameters of the Random Forest classifier (https://scikit-learn.org/stable/modules/gener ated/sklearn.ensemble.RandomForestClassifier.html) can be further adjusted to improve the classification performance. Other supervised ML models, such as neural networks and support vector machine, and their corresponding scikit-learn tutorials can be found here: https://scikit-learn.org/stable/ supervised_learning.html. In addition, the grid search approach, such as sklearn.model_selection.GridSearchCV (which provides embedded cross-validation) within scikitlearn (https://scikit-learn.org/stable/modules/generated/ sklearn.model_selection.GridSearchCV.html), is often used for comprehensively searching optimized hyperparameter values for a specific classifier (e.g., Random Forest, Neural Networks). 7. Other widely used performance measures, such as area under the receiver operating characteristic curve (ROC AUC) and precision, can also be calculated using the supporting functions provided by scikit-learn (https://scikit-learn.org/stable/ modules/model_evaluation.html).

Acknowledgments This work was supported by the National Heart, Lung and Blood Institute of the National Institutes of Health (R01HL143082). References 1. Rong G, Mendez A, Assi EB, Zhao B, Sawan M (2020) Artificial intelligence in healthcare: review and prediction case studies. Engineering 6(3):291–301 2. Tamarappoo BK, Lin A, Commandeur F, McElhinney PA, Cadet S, Goeller M et al (2021) Machine learning integration of circulating and imaging biomarkers for explainable patient-specific prediction of cardiac events: a prospective study. Atherosclerosis 318:76–82. Epub 2020/11/27. PubMed PMID: 33239189. https://doi.org/10.1016/j.athero sclerosis.2020.11.008 3. Kwan AC, McElhinney PA, Tamarappoo BK, Cadet S, Hurtado C, Miller RJH et al (2020) Prediction of revascularization by coronary CT angiography using a machine learning ischemia risk score. Eur Radiol. Epub 2020/09/04. PubMed PMID: 32880697. https://doi.org/ 10.1007/s00330-020-07142-8

4. Eisenberg E, McElhinney PA, Commandeur F, Chen X, Cadet S, Goeller M et al (2020) Deep learning-based quantification of epicardial adipose tissue volume and attenuation predicts major adverse cardiovascular events in asymptomatic subjects. Circ Cardiovasc Imaging 13(2):e009829. Epub 2020/02/18. PubMed PMID: 32063057. https://doi.org/ 10.1161/CIRCIMAGING.119.009829 5. Han D, Kolli KK, Gransar H, Lee JH, Choi SY, Chun EJ et al (2020) Machine learning based risk prediction model for asymptomatic individuals who underwent coronary artery calcium score: comparison with traditional risk prediction approaches. J Cardiovasc Comput Tomogr 14(2):168–176. Epub 2019/10/02. PubMed PMID: 31570323. https://doi.org/ 10.1016/j.jcct.2019.09.005 6. Payrovnaziri SN, Chen Z, Rengifo-Moreno P, Miller T, Bian J, Chen JH et al (2020)

288

Xi Cheng and Bina Joe

Explainable artificial intelligence models using real-world electronic health record data: a systematic scoping review. J Am Med Inform Assoc 27(7):1173–1185. Epub 2020/05/18. PubMed PMID: 32417928; PubMed Central PMCID: PMCPMC7647281. https://doi. org/10.1093/jamia/ocaa053 7. Martorell-Marugan J, Tabik S, Benhammou Y, del Val C, Zwir I, Herrera F et al (2019) Deep learning in omics data analysis and precision medicine. In: Husi H (ed) Computational biology. Codon Publications, Brisbane 8. Alimadadi A, Manandhar I, Aryal S, Munroe PB, Joe B, Cheng X (2020) Machine learningbased classification and diagnosis of clinical cardiomyopathies. Physiol Genomics 52(9): 391–400 9. Aryal S, Alimadadi A, Manandhar I, Joe B, Cheng X (2020) Machine learning strategy for gut microbiome-based diagnostic screening of cardiovascular disease. Hypertension 76(5): 1555–1562

10. Manandhar I, Alimadadi A, Aryal S, Munroe PB, Joe B, Cheng X (2021) Gut microbiomebased supervised machine learning for clinical diagnosis of inflammatory bowel diseases. Am J Phys Gastrointest Liver Physiol 320:G328 11. Kuhn M (2008) Building predictive models in R using the caret package. J Stat Softw 28(1): 1–26 12. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830 13. McKinney W (2011) Pandas: a foundational python library for data analysis and statistics. Python High Perform Sci Comput 14(9):1–9 14. Oliphant TE (2006) A guide to NumPy. Trelgol Publishing USA. https://ecs.wgtn.ac.nz/ foswiki/pub/Support/ManualPage sAndDocumentation/numpybook.pdf

Chapter 15 Tracking Antibiotic Resistance from the Environment to Human Health Eman Abdelrazik and Mohamed El-Hadidi Abstract Antimicrobial resistance (AMR) is one of the threats to our world according to the World Health Organization (WHO). Resistance is an evolutionary dynamic process where host-associated microbes have to adapt to their stressful environments. AMR could be classified according to the mechanism of resistance or the biome where resistance takes place. Antibiotics are one of the stresses that lead to resistance through antibiotic resistance genes (ARGs). The resistome could be defined as the collection of all ARGs in an organism’s genome or metagenome. Currently, there is a growing body of evidence supporting that the environment is the largest source of ARGs, but to what extent the environment does contribute to the antimicrobial resistance evolution is a matter of investigation. Monitoring the ARGs transfer route from the environment to humans and vice versa is a nature-to-nature feedback loop where you cannot set an accurate starting point of the evolutionary event. Thus, tracking resistome evolution and transfer to and from different biomes is crucial for the surveillance and prediction of the next resistance outbreak. Herein, we review the overlap between clinical and environmental resistomes and the available databases and computational analysis tools for resistome analysis through ARGs detection and characterization in bacterial genomes and metagenomes. Till this moment, there is no tool that can predict the resistance evolution and dynamics in a distinct biome. But, hopefully, by understanding the complicated relationship between the environmental and clinical resistome, we could develop tools that track the feedback loop from nature to nature in terms of evolution, mobilization, and transfer of ARGs. Key words Resistome analysis tools, Clinical resistome, Environmental resistome, Antimicrobial resistance, Resistome antibiotic resistance, Antibiotic resistome, Metagenomics, Horizontal gene transfer

1 Introduction Resistance is another evolutionary mode in which host-associated microorganisms must adapt to a stressful environment. Not only microbes but also hosts pay this high price for sustainability.

Supplementary Information The online version contains supplementary material available at https://doi.org/ 10.1007/978-1-0716-3072-3_15. Suparna Mitra (ed.), Metagenomic Data Analysis, Methods in Molecular Biology, vol. 2649, https://doi.org/10.1007/978-1-0716-3072-3_15, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

289

290

Eman Abdelrazik and Mohamed El-Hadidi

Antimicrobial resistance (AMR) is a broad term that includes altering the response of microorganisms to drugs used to treat microbial infections. According to the Centers for Disease Control and Prevention (CDC), more than 2.8 million resistance-related infections occur each year in the United States, resulting in more than 35,000 deaths [10]. It is estimated that at least 700,000 people die each year from drug-resistant infections. This number could rise to 10 million by 2050, far surpassing cancer as the world’s leading cause of death [43]. In 2019, the World Health Organization (WHO) ranked AMR as one of the top 10 threats to global health due to its high resistance to bacterial diseases, including Tuberculosis [46]. Antibiotic resistance (AR) specifically describes the failure of certain types of antibiotics to kill or inhibit the growth of certain types of bacteria [16]. There are two strategies to explain the emergence of resistance [8]. (a) Intrinsic resistance if the bacterium has biological properties that make it naturally resistant to a particular antibiotic. These traits usually result from mutations in bacterial genes or misuse of antibiotics that occur during the bacterial life cycle, which in turn accelerates the rate of mutations. And (b) acquired resistance, wherein bacterial species share genetic material with each other and acquire resistance genes. The latter model arises from being called horizontal gene transfer (HGT). This process occurs through one of three mechanisms: transduction, transformation, and conjugation. There is another classification of AR approaches: (a) microbiological (environmental) resistance and (b) clinical resistance. Regarding microbiological resistance, it is the ability of the bacterial isolate to be less sensitive to distinct classes of antibiotics than other bacteria in the same species. This definition is irrespective of the level of resistance (i.e., low or high level of resistance) and does not necessarily correlate with clinical resistance. On the other side, clinical resistance is defined by the antimicrobial activity level associated with a high risk of therapeutic failure. Again, a clinically resistant organism may be classified using a cutoff value in a phenotypic test system (for example, a determination of the MIC or zone size) [33]. We could define resistomes as the full collection of antibiotic resistance genes and their precursors [17]. There are two types of resistomes that we are going to discuss here, and they are listed below:

2

Environmental Resistome The environment is considered the largest natural reservoir for metagenomes including oceans, glaciers, agricultural and nonagricultural soils, and animals. Antibiotic resistance genes (ARGs) are included within all of these biomes. Oceans, for example, could be

Tracking Antibiotic Resistance from the Environment to Human Health

291

the source of AR genes via different mechanisms including coastal runoff of resistant bacteria and the production of antibiotics from marine bacteria that challenge other bacteria to produce ARGs as a response. In 2015, researchers investigated metagenome samples from San Pedro Ocean, Hawaii Ocean, and other marine biomes to address genes responsible for resistance to ampicillin, tetracycline, nitrofurantoin, and sulfadimethoxine. ARGs are found in all sites with 72% identified as uncharacterized ARGs but have similar protein products as known ARGs. While the rest 28% of the genes identified as known ARGs (encoding Beta-lactamases, bicyclomycin resistance pumps) [26]. Another example of a human pathogenic bacterial species isolated from freshwater and the deep sea is Shewanella algae that are not known to produce antibiotics [47]. It is reported that Shewanella algae contain a quinolone resistance gene, qnrA [42]. Quinolone is a synthetic antibiotic that is not found naturally in any biome, so it is a piece of clear evidence that environmental bacteria can evolve intrinsic resistance. Soil resistome could be developed naturally or due to anthropogenic practices. Regarding anthropogenic agronomic practices, the introduction of manure, biosolids, biocides, and untreated wastewater irrigation that contain antibiotics concentrations can leach to the agricultural soils resulting in the transfer of ARGs to soil microbial communities [15]. The interaction between soil and the acquired ARGs is significantly linked with soil characteristics, antibiotics metabolism, mobile genetic elements, treatment duration, and soil microbial communities structure [14]. On the other hand, soil bacteria, antibiotic-producing bacteria, or their neighbors have their intrinsic natural resistome as a common defense mechanism against other pathogenic bacteria and as a weapon for competition for water and other nutrients such as root-associated pseudomonads and bacilli. In a recent study, five aliquots of surface soil were collected from Mackay Glacier region, an Antarctic soil, on different depths, then analyzed using shotgun metagenomics sequencing and downstream phylogenetic analysis of ARGs that identified 177 naturally occurring ARGs lacking mobile genetic elements (MGEs) such as plasmids, transposons, and integrons. The study sheds light upon the origins of antibiotic resistance genes in remote soil, which unlikely has been exposed to anthropogenic practices [48]. Another evidence supporting the historical origins of ARGs is the finding of ancient Staphylococcus hominis species, which is isolated from River Lena permafrost, that has resistance against aminoglycoside, beta-lactam, MLS (macrolide, lincosamide, and streptogramin B), and phenicol antibiotic groups [27]. It has been proven that the resistant bacteria of livestock animals have acquired ARGs through routine exposure to antibiotics in veterinary therapeutics [1]. However, another study has shown that some farm animals such as cattle and chickens, which have no

292

Eman Abdelrazik and Mohamed El-Hadidi

exposure to antibiotics, have 3.7% and 2% of their sequences encoded resistance to antibiotics, respectively [6]. This scale proves not only for farm animals but also the migratory birds, which have proven to have lower gut microbiome diversity compared with other environment microbiomes. However, they have a higher resistome, which includes 1030 different ARGs conferring resistance to tetracycline, β-lactam, chloramphenicol, aminoglycoside, macrolide-lincosamide-streptogramin (MLS), sulphonamide, and quinolone [9]. Strikingly, by investigation of Paenibacillus species isolates from Lechuguilla Cave, a four-million-year isolated underground ecosystem, where they characterized 18 chromosomal resistance determinants of LC231 isolate linked with resistance phenotype to most clinically used antibiotics [40]. All these findings coupled with other studies [11, 18, 37] set solid proof for the conservation of resistance in the environment over millions of years, even before the era of antibiotics discovery (see supplemental materials).

3

Clinical Resistome Bacteria are distributed everywhere in the environment and in the human body constituting the gut microbiome, skin microbiome, oral microbiome, vaginal microbiome, and placenta microbiome, Needless to mention the danger of certain bacteria that cause opportunistic infections, those pathogenic bacteria are responsible for the emergence of antibiotic resistance in clinical settings through versatile scenarios. First, sensitive bacteria turn resistant by developing mutations due to long-term exposure to antibiotics. For instance, Pseudomonas aeruginosa, which is abundant in cystic fibrosis patients, is clinically resistant bacteria due to long-term exposure to antibiotics, which develop chromosomal changes to resist antibiotics [2]. Second, sensitive bacteria can acquire ARGs via horizontal gene transfer, which facilitates the rapid transfer of ARGs within distantly related bacterial species. Those acquired ARGs help bacteria to adapt to clinical settings such as hospitals [20]. Third, certain pathogenic bacteria form resistance naturally without exposure to any antibiotics, such as Pseudomonas aeruginosa and Escherichia coli, which have low permeable external membranes thus blocking the diffusion of antibiotics inside the bacteria [13]. Lastly, the bacteria that synthesize antibiotics have to protect themselves from the antibiotic that they produce; thus, they develop a self-resistance mechanism, such as in the case of the producer Streptomyces species, which produce modification enzymes to convert streptomycin into inactive form [41]. Regardless of the origin of clinical resistance, detection of resistance itself could be tricky due to the presence of what is so-called heteroresistance. Heteroresistance is present when one

Tracking Antibiotic Resistance from the Environment to Human Health

293

or many bacterial subpopulations, with an increased level of antibiotic resistance, descended from heterogeneous populations with different resistance levels and mechanisms [3].

4

The Overlap Between Clinical and Environmental Resistome The intersection between the clinical and environmental resistome may sound clear while comparing the population’s microbiomes before and after the antibiotic era that influences the environmental resistome. In 2015, a group of researchers conducted a metagenomic analysis study of gut microbiota of hunter-gatherer populations in Tanzania, Hadza population, who are known for living in the wild with no exposure to any source of antibiotic contamination in food, environment or health practices. By comparing the resistome functionality of Hadza population and urban Italian adults, results demonstrated the presence of ARGs in Hadza gut microbiota, which indicates that the environment, with no antibiotic exposure, is the main source of resistance [44]. Antibiotic resistance gene homologs are another evidence of overlap between clinical and nonclinical settings. For instance, the β-lactamases genes, which are responsible for resistance to large groups of ARGs, the β-lactams genes. By investigating 232 metagenomes from ten different environments, it has been found that certain classes of β-lactamases genes homologs clustered in nonclinical environments such as class A, B, C, and D represented in blaCTX-M, blaKPC, and blaGES genes, while β-lactamases class A prevails the cluster of human and animal feces [24]. The findings clarify the diversity of β-lactamases precursors, especially in nonclinical environments, and propose a possible scenario of ARGs transmission between both animals and humans who share a common resistome. The intersection between resistomes could be shared across different biomes. A clear example explained at the time series analysis of the mobilome and resistome of Han River planktonic microbial communities across different coastal locations to the downstream of the river. This study [31] clarified how much humans affect the environmental ARGs and how much the environment itself contributes to the ARGs bloom, as it found a 4.6fold increase in ARGs content near the river downstream where human activities increased, compared with ARGs content away from the river downstream where human activities decreased. This gradual increase indicates the environmental contribution to the dynamic resistance process. In addition, mobile genetic elements (MGEs) abundance was linked with ARG sequences shared among the river and human gut pathogens, which were used as proof of horizontal gene transfer of MGEs carrying ARGs among different microbial taxa along the Han River.

294

Eman Abdelrazik and Mohamed El-Hadidi

Herbaspirillum seropedicae is a real representation of resistome overlap between clinical and environmental settings. This bacterium is known in the environment for its role in biological nitrogen fixation. However, it was isolated from immunocompromised patients. By genome comparison of clinical and environmental strains of Herbaspirillum seropedicae, it has been found that both strains share the same proportion of antimicrobial resistance genes [21]. Resistance determinants found in environmental Herbaspirillum seropedicae are not considered a threat to human health, whereas their transfer to the clinical settings and expression in the human body make it a threat. Different hypotheses are proposed to explain the transfer and dissemination of ARGS between environmental and clinical contexts [41]. The most interesting proposed routes of antibiotic resistance gene transfer are mediated through horizontal gene transfer. The first route proposes ARGs transfer from antibiotic producer bacteria to non-producer bacteria in the soil, followed by the movement of ARGs to clinical bacteria or human host through different carriers. The second route claims the direct transfer of ARGs from environmental bacteria to clinical strains. Thus, tracking the route of ARGs transfer and dissemination may help in developing proactive solutions to prevent this transfer.

5

Whole-Genome Sequencing in Detection and Control of Antimicrobial Resistance Understanding the different mechanisms of bacterial resistance to antibiotics is crucial to developing newly powerful antibiotics and increasing the efficiency of currently available ones. Using conventional methods of antibiotic resistance identification and characterization could be useless in cases of bacteria that have the same resistance phenotypic patterns, in terms of resistance to the same classes of antibiotics, caused by different resistance determinants and mechanisms [29]. However, with the emergence of highthroughput sequencing technologies, that is, whole-genome sequencing (WGS), the characterization of resistance determinants with similar phenotypes becomes feasible at single base-pair resolution. Nowadays, WGS is used as a high-throughput diagnostic tool in clinics and hospitals. It confers the identification of novel resistance mechanisms with high precision and rapid identification of resistance determinants from pure cultures or hard-to-culture bacterial samples. In addition, the accumulation of sequencing data over time could provide a rich source that could be used in prospective studies to predict novel antibiotic resistance and susceptibility patterns and to help in monitoring the antimicrobial resistance outbreaks [28]. Although conventional microbiology methods are cheaper and faster than WGS in antibiotic resistance

Tracking Antibiotic Resistance from the Environment to Human Health

295

detection, using whole-genome sequencing for antimicrobial susceptibility testing (WGS-AST) could be cost-effective, rapid, and confers accurate predictions in infection control compared to large retrospective projects, which study the etiological determinants of antibiotic resistance outbreaks [25, 45]. BacWGSTdb 2.0 [23] is a real-world example database that stores bacterial genomes sequencing data and corresponding resistance phenotypic metadata combined with bioinformatics tools. It covers up to 20 bacterial species of clinical importance and uses this data for bacterial classification and tracking sources of infection and antimicrobial resistance determinants. With the help of BacWGSTdb, multiple studies could perform comprehensive genomic analysis and antimicrobial resistance genes identification for clinical isolates such as drug-resistant Acinetobacter pittii isolates [12], multidrug-resistant Escherichia coli [52], and carbapenemresistant Enterobacterales [39].

6

Resistome Analysis Tools There is now a wealth of electronic health and omics data representing different perspectives from the host genome to the microbiome associated with the host. These data provide new insights and a comprehensive understanding of the current threat to antibiotic resistance and the development of future resistance. Bioinformatics provides ARG data resources and analysis tools that use current omics data to predict outbreaks, contain the current antibiotic resistance crisis, and develop drugs and diagnostic tools.

7

Resistome Databases The idea of constructing AR genes’ data resource is represented in a structured ARGs reference database. Those databases could be categorized according to the AR gene spectrum. For instance, some databases include all AR genes reported by literature, while others contain only acquired AR genes and integrons. Minor groups introduce either ARGs specific to certain microorganisms or to distinct antibiotic classes such as β-Lactams [50]. In addition, some databases focus on functional metagenomic AR genes. Comprehensive Antimicrobial Resistance Database (CARD) is a widely used ARGs database that contains all AR genes combined with resistance ontology (ARO) so it can facilitate identifying AR genes from unannotated new sequences [34]. The updated release of CARD (October 2021) contains 6453 Ontology Terms, 4937 Reference Sequences, 1788 SNPs, 2775 Publications, 4983 AMR Detection Models, 2675 genomic islands, and 30,591 plasmids. The same concept of CARD but different sets of AR genes, including only the acquired ARGs, is applied via the web-based ResFinder

296

Eman Abdelrazik and Mohamed El-Hadidi

database [53]. The MEGARes database [30] includes sequence data for approximately 8000 manually curated antimicrobial resistance genes with annotation accompanied by AmrPlusPlus. The AMR ++ version 1.0 is a pipeline for analysis and quantification of AMR genes from metagenomic and high-throughput sequences, which are available via the galaxy project (https://galaxyproject. org/use/amrplusplus/). MEGARes 2.0 is an expansion of MEGARes, containing the previous sets of AR genes plus metal and biocide resistance determinants and the updated version of AMR ++ version 2.0 [19]. SARG version 2.0 [51] is designed for ARGs detection and fast annotation in metagenomes from environmental samples. It contains sequences from CARD [53], ARDB database [32], and curated sequences from the latest protein collection of the NCBI-NR database. A new approach to constructing ARG databases is based on artificial neural networks for the identification of novel ARGs. DeepARG-DB [5], is a database containing predicted ARGs from CARD, ARDB, and UNIPROT based on neural network models DeepARG-SS and DeepARG-LS that confer high precision and recall and avoid false negatives in gene detection. One of the databases specific to a certain class of antibiotics is the Beta-lactamase database (BLDB) (http://bldb.eu/), which is concerned with a specific superfamily of enzymes, Beta-Lactamases that affect an important antibiotic class, β-Lactams. The clinical efficacy of β-Lactams in the treatment of multidrug-resistant (MDR) gram-negative bacteria is threatened by the BetaLactamases enzymes [36]. The BLDB provides structural, functional, mutants, and kinetics information for the four classes of β-Lactamases such as class A, and class B with its subclasses, class C, and class D as reported in the latest release (November 2021). Moreover, the INTEGRALL database (http://integrall.bio.ua.pt) introduces a piece of comprehensive information about the integrons, key players in antibiotic resistance gene acquisition [35]. According to the current release of INTEGRALL (September 2021), it includes 1509 integrase genes, 8562 gene cassettes, and 482 bacterial species. Most of the mentioned databases are interested in the bacterial whole genome while FARME database (http://staff.washington.edu/jwallace/farme/), the first of its kind, focuses on functional antibiotic resistance genes that come from metagenomic samples [49]. It includes DNA sequences and predicted protein sequences associated with antibiotic resistance, besides regulatory elements, and mobile genetic elements.

8

Tools Regarding the AMR analysis tools, their methodology depends on the sequence data type. The main methodological approaches used were (a) read-based methods that could process raw reads and

Tracking Antibiotic Resistance from the Environment to Human Health

297

(b) assembly-based methods that required contigs of assembled genomes. So far, AMR tools rely on BLAST or Hidden Markov Model (HMM) algorithms to perform previous methodologies. HMM can provide a hierarchical classification of protein families from AMR alleles to AMR gene families. Other tools are specified for the detection of ARGs from shotgun metagenomic data. Tools’ input could be protein or nucleotide sequences. AMR tools could be accessed via a web interface or operated on local servers using the graphical user interface (GUI), docker images, and command line (CL). ARGs detection and annotation tools are based on the previously mentioned ARGs databases. For instance, ARGs-OAP v2. 0 [51], a metagenomic analysis pipeline, uses SARG version 2.0 as a reference database and SARGfam that uses the Hidden Markov Model (HMM) to create profiles of ARG subtypes, from alleles to gene families. Using SARGfam facilitates the characterization of AMR genes based upon the hierarchical classification of predicted ORFs from assembled reads or AMR proteins. ARGs-OAP is available via the Galaxy project (https://galaxyproject.org/use/argsoap/). In addition, DeepARG architecture can be used as a standalone tool to generate new deep learning models for a particular set of genes [5]. ARGminer is a development of the deepARG-DB for new ARGs curation. ARGminer [4] is basically a web-based tool for ARGs annotation curation that uses crowdsourcing, which collects the information from repositories such as NCBI, PUBMED, CARD, ARDB, and UNIPROT and aggregates ARGs information from ResFinder, CARD, MEGARes, ARDB, and SARG. The most important annotation that ARGminer provides is the evidence for mobility and occurrence of ARGs in clinically important bacterial strains. Moreover, the NCBI AMRFinder [22] relies on NCBIfam for the hierarchical classification of ARG protein families using HMM. What makes NCBI AMRFinder unique is the extensive resistance genes included in NCBI AMRFinder because it uses an extensive AMR database of 4579 resistance proteins, 800 gene families representing 34 classes of antimicrobials, and disinfectants. The previous tools are limited by the read length and by a distinct set of ARGs in a reference database. A contigbased tool, fARGene [7], comes to solve the limitation of other tools as it takes short or long reads. It uses metagenomic sequences as input to reconstruct antibiotic resistance genes as output. The novelty of fARGene is that it can find novel AR genes that share some degree of homology with the known AR genes. Till now, it has been developed for HMM models of Beta-lactamases classes A, B, and C only, but it can be optimized for any set of required resistance gene families. As a state-of-the-art comprehensive tool for systematic resistance analysis, sraX [38] leverages the key advantages of other tools to implement a fully automated pipeline for resistance analysis. In addition to identifying and annotating

298

Eman Abdelrazik and Mohamed El-Hadidi

resistance determinants, hundreds of bacterial genomes can be analyzed in parallel. It provides full annotation of resistance determinants in terms of genomic context analysis, SNP analysis, and drug class analysis. All of these features come with offline and batch modes of one-step commands ending in a single hyperlinked HTML file with navigation.

9

Conclusions Resistance is an evolutionary dynamic process affected by both human activities and the environment microbial structure and selection pressure. What we see in the analysis of certain biomes is just a static process representing one piece among several pieces of the antimicrobial resistance puzzle. The available bioinformatics tools for resistome analysis have several issues, such as: (a) inability to precisely detect the uncharacterized resistome through searching the well-known reference databases, (b) some genes have completely new resistance mechanisms with no sequence similarity with the known ARGs, (c) inability to track heteroresistance phenotypes and intermediate levels of resistance due to instability of mutation frequency and occurrence, (d) tools are working on detection and characterization of a static set of genes but not tracking the whole dynamic process from its origins, and (e) inability to precisely characterize genes resistant to more than one antibiotic class. Therefore, tools to track the dynamics of antibiotic resistance and predict the upcoming resistance outbreak are needed to be developed. Currently, there is a growing body of evidence supporting the intersection and overlap between the environment and the clinical resistome. By understanding this complicated relationship between the environment and humans, we could develop tools that track the feedback loop from nature to nature in terms of evolution, mobilization, and transfer. The proposed tools should consider the characterization of AR genes attributes such as (a) the origin of ARGs, environmental or clinical; (b) transferability of ARGs if they are loaded on MGEs or not; and (c) routes of ARGs transfer and dissemination. Proposed tools could also assess the risk of ARGs transmission by measuring its abundance in taxa and its mobility routes. Tracking the evolution of the antibiotic resistance (AR) process and the antibiotic resistance bacteria (ARB) will help in building models to assess the risk and predict the dissemination of ARGs. Finally, modeling routes will help in developing antievolutionary drugs as prevention of developing new antibiotic resistance genes.

Tracking Antibiotic Resistance from the Environment to Human Health

299

References 1. Allen HK (2014) Antibiotic resistance gene discovery in food-producing animals. Curr Opin Microbiol 19:25–29. https://doi.org/ 10.1016/j.mib.2014.06.001 2. Andersson DI, Balaban NQ, Baquero F, Courvalin P, Glaser P, Gophna U et al (2020) Antibiotic resistance: turning evolutionary principles into clinical reality. FEMS Microbiol Rev 44:171–188. https://doi.org/10.1093/ femsre/fuaa001 3. Andersson DI, Nicoloff H, Hjort K (2019) Mechanisms and clinical relevance of bacterial heteroresistance. Nat Rev Microbiol 17:479– 496. https://doi.org/10.1038/s41579-0190218-1 4. Arango-Argoty GA, Guron GKP, Garner E, Riquelme MV, Heath LS, Pruden A et al (2020) ARGminer: a web platform for the crowdsourcing-based curation of antibiotic resistance genes. Bioinformatics 36:2966– 2973. https://doi.org/10.1093/bioinformat ics/btaa095 5. Arango-Argoty G, Garner E, Pruden A, Heath LS, Vikesland P, Zhang L (2018) DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data. Microbiome 6. https://doi.org/10.1186/ s40168-018-0401-z 6. Argudıń M, Deplano A, Meghraoui A, Dode´mont M, Heinrichs A, Denis O et al (2017) Bacteria from animals as a pool of antimicrobial resistance genes. Antibiotics 6:12. h t t p s : // d o i . o r g / 1 0 . 3 3 9 0 / antibiotics6020012 ¨ sterlund T, Boulund F, Marathe 7. Berglund F, O NP, Larsson DGJ, Kristiansson E (2019) Identification and reconstruction of novel antibiotic resistance genes from metagenomes. Microbiome 7. https://doi.org/10.1186/s40168019-0670-1 8. Blair JMA, Webber MA, Baylay AJ, Ogbolu DO, Piddock LJV (2014) Molecular mechanisms of antibiotic resistance. Nat Rev Microbiol 13:42–51 . https://doi.org/10.1038/ nrmicro3380 9. Cao J, Hu Y, Liu F, Wang Y, Bi Y, Lv N et al (2020) Metagenomic analysis reveals the microbiome and resistome in migratory birds. Microbiome 8. https://doi.org/10.1186/ s40168-019-0781-8 10. CDC (2019) Biggest threats and data. Centers for Disease Control and Prevention. Available at https://www.cdc.gov/drugresistance/big gest-threats.html

11. Chen B, Yuan K, Chen X, Yang Y, Zhang T, Wang Y et al (2016) Metagenomic analysis revealing antibiotic resistance genes (ARGs) and their genetic compartments in the Tibetan environment. Environ Sci Technol 50:6670– 6679. https://doi.org/10.1021/acs.est. 6b00619 12. Chopjitt P, Putthanachote N, Ungcharoen R, Hatrongjit R, Boueroy P, Akeda Y et al (2021) Genomic characterization of clinical extensively drug-resistant Acinetobacter pittii isolates. MDPI 9(2):242. https://doi.org/10.3390/ microorganisms9020242 13. Coculescu B (2009) Antimicrobial resistance induced by genetic changes. J Med Life 2: 114–123. Available at https://www.ncbi.nlm. nih.gov/pmc/articles/PMC3018982/ 14. Cycon´ M, Mrozik A, Piotrowska-Seget Z (2019) Antibiotics in the soil environment – degradation and their impact on microbial activity and diversity. Front Microbiol 10. https://doi.org/10.3389/fmicb.2019.00338 15. Cytryn E (2013) The soil resistome: the anthropogenic, the native, and the unknown. Soil Biol Biochem 63:18–23. https://doi.org/ 10.1016/j.soilbio.2013.03.017 16. D’Costa VM, King CE, Kalan L, Morar M, Sung WWL, Schwarz C et al (2011) Antibiotic resistance is ancient. Nature 477:457–461. https://doi.org/10.1038/nature10388 17. D’Costa VM, McGrann KM, Hughes DW, Wright GD (2006) Sampling the antibiotic resistome. Science 311:374–377. https://doi. org/10.1126/science.1120800 18. Davies J, Davies D (2010) Origins and evolution of antibiotic resistance. Microbiol Mol Biol Rev 74:417–433. https://doi.org/10. 1128/mmbr.00016-10 19. Doster E, Lakin SM, Dean CJ, Wolfe C, Young JG, Boucher C et al (2019) MEGARes 2.0: a database for classification of antimicrobial drug, biocide and metal resistance determinants in metagenomic sequence data. Nucleic Acids Res 48. https://doi.org/10.1093/nar/ gkz1010 20. Dzidic S, Bedekovic´ V (2003) Horizontal gene transfer-emerging multidrug resistance in hospital bacteria. Acta Pharmacol Sin 24:519–526. Available at https://europepmc.org/article/ med/12791177 21. Faoro H, Oliveira WK, Weiss VA, Tadra-Sfeir MZ, Cardoso RL, Balsanelli E et al (2019) Genome comparison between clinical and environmental strains of herbaspirillum seropedicae reveals a potential new emerging bacterium

300

Eman Abdelrazik and Mohamed El-Hadidi

adapted to human hosts. BMC Genomics 20. https://doi.org/10.1186/s12864-0195982-9 22. Feldgarden M, Brover V, Haft DH, Prasad AB, Slotta DJ, Tolstoy I et al (2019) Validating the AMRFinder tool and resistance gene database by using antimicrobial resistance genotypephenotype correlations in a collection of isolates. Antimicrob Agents Chemother 63. https://doi.org/10.1128/aac.00483-19 23. Feng Y, Zou S, Chen H, Yu Y, Ruan Z (2021) BacWGSTdb 2.0: a one-stop repository for bacterial whole-genome sequence typing and source tracking. Nucleic Acids Res 49(D1): D644–D650 24. Gatica J, Jurkevitch E, Cytryn E (2019) Comparative metagenomics and network analyses provide novel insights into the scope and distribution of β-lactamase homologs in the environment. Front Microbiol 10. https://doi. org/10.3389/fmicb.2019.00146 25. Harris SR, Cartwright EJP, To¨ro¨k ME, Holden MTG, Brown NM, Ogilvy-Stuart AL et al (2012) Whole-genome sequencing for analysis of an outbreak of meticillin-resistant Staphylococcus aureus: a descriptive study. Lancet Infect Dis 13(2):130–136. https://doi.org/ 10.1016/S1473-3099(12)70268-2 26. Hatosy SM, Martiny AC (2015) The ocean as a global reservoir of antibiotic resistance genes. Appl Environ Microbiol 81:7593–7599. https://doi.org/10.1128/AEM.00736-15 27. Kashuba E, Dmitriev AA, Kamal SM, Melefors O, Griva G, Ro¨mling U et al (2017) Ancient permafrost staphylococci carry antibiotic resistance genes. Microb Ecol Health Dis 28:1345574. https://doi.org/10.1080/ 16512235.2017.1345574 28. Ko¨ser CU, Ellington MJ, Peacock SJ (2014) Whole-genome sequencing to control antimicrobial resistance. Trends Genet 30(9): 401–407 29. Kumburu HH, Sonda T, van Zwetselaar M, Leekitcharoenphon P, Lukjancenko O, Mmbaga BT et al (2019) Using WGS to identify antibiotic resistance genes and predict antimicrobial resistance phenotypes in MDR Acinetobacter baumannii in Tanzania. OUP Academic 30. Lakin SM, Dean C, Noyes NR, Dettenwanger A, Ross AS, Doster E et al (2016) MEGARes: an antimicrobial resistance database for high throughput sequencing. Nucleic Acids Res 45:D574–D580. https:// doi.org/10.1093/nar/gkw1009 31. Lee K, Kim D-W, Lee D-H, Kim Y-S, Bu J-H, Cha J-H et al (2020) Mobile resistome of

human gut and pathogen drives anthropogenic bloom of antibiotic resistance. Microbiome 8. https://doi.org/10.1186/s40168-0190774-7 32. Liu B, Pop M (2009) ARDB – Antibiotic Resistance Genes Database. Nucleic Acids Res 37: D443–D447. https://doi.org/10.1093/nar/ gkn656 33. MacGowan AP (2008) Clinical implications of antimicrobial resistance for therapy. J Antimicrob Chemother 62:ii105–ii114. https://doi. org/10.1093/jac/dkn357 34. McArthur AG, Waglechner N, Nizam F, Yan A, Azad MA, Baylay AJ et al (2013) The comprehensive antibiotic resistance database. Antimicrob Agents Chemother 57:3348–3357. https://doi.org/10.1128/aac.00419-13 35. Moura A, Soares M, Pereira C, Leitao N, Henriques I, Correia A (2009) INTEGRALL: a database and search engine for integrons, integrases and gene cassettes. Bioinformatics 25:1096–1098. https://doi.org/10.1093/bio informatics/btp105 36. Naas T, Oueslati S, Bonnin RA, Dabos ML, Zavala A, Dortet L et al (2017) Beta-lactamase database (BLDB) – structure and function. J Enzyme Inhib Med Chem 32:917–919. https://doi.org/10.1080/14756366.2017. 1344235 37. Olaitan AO, Rolain J-M (2016) Ancient resistome. In: Paleomicrobiology of humans. ASM Press, pp 75–80. https://doi.org/10. 1128/9781555819170.ch8 38. Panunzi LG (2020) sraX: a novel comprehensive resistome analysis tool. Front Microbiol 11. https://doi.org/10.3389/fmicb.2020. 00052 39. Paveenkittiporn W, Kamjumphol W, Ungcharoen R, Kerdsin A (2020) Wholegenome sequencing of clinically isolated Carbapenem-resistant enterobacterales harboring mcr genes in Thailand, 2016–2019. Front Microbiol 11:586368. https://doi.org/10. 3389/fmicb.2020.586368 40. Pawlowski AC, Wang W, Koteva K, Barton HA, McArthur AG, Wright GD (2016) A diverse intrinsic antibiotic resistome from a cave bacterium. Nat Commun 7. https://doi. org/10.1038/ncomms13803 41. Peterson E, Kaur P (2018) Antibiotic resistance mechanisms in bacteria: relationships between resistance determinants of antibiotic producers, environmental bacteria, and clinical pathogens. Front Microbiol 9. https://doi. org/10.3389/fmicb.2018.02928 L, Rodriguez-Martinez J-M, 42. Poirel Mammeri H, Liard A, Nordmann P (2005)

Tracking Antibiotic Resistance from the Environment to Human Health Origin of plasmid-mediated quinolone resistance determinant QnrA. Antimicrob Agents Chemother 49:3523–3525. https://doi.org/ 10.1128/AAC.49.8.3523-3525.2005 43. Ragheb MN, Thomason MK, Hsu C, Nugent P, Gage J, Samadpour AN et al (2019) Inhibiting the evolution of antibiotic resistance. Mol Cell 73:157–165.e5. https:// doi.org/10.1016/j.molcel.2018.10.015 44. Rampelli S, Schnorr SL, Consolandi C, Turroni S, Severgnini M, Peano C et al (2015) Metagenome sequencing of the Hadza hunter-gatherer gut microbiota. Curr Biol 25: 1682–1693. https://doi.org/10.1016/j.cub. 2015.04.055 45. Su M, Satola SW, Read TD (2019) Genomebased prediction of bacterial antibiotic resistance. J Clin Microbiol 57(3):e01405–18. https://doi.org/10.1128/JCM.01405-18 46. Ten Health Issues WHO Will Tackle This Year (2019). www.who.int. Available at https:// www.who.int/news-room/spotlight/tenthreats-to-global-health-in-2019 47. Tseng S-Y, Liu P-Y, Lee Y-H, Wu Z-Y, Huang C-C, Cheng C-C et al (2018) The pathogenicity of Shewanella algae and ability to tolerate a wide range of temperatures and salinities. Can J Infect Dis Med Microbiol. Available at https:// www.hindawi.com/journals/cjidmm/2018/ 6976897/. Accessed 26 Aug 26 2020 48. Van Goethem MW, Pierneef R, Bezuidt OKI, Van De Peer Y, Cowan DA, Makhalanyane TP (2018) A reservoir of “historical” antibiotic

301

resistance genes in remote pristine Antarctic soils. Microbiome 6:40. https://doi.org/10. 1186/s40168-018-0424-5 49. Wallace JC, Port JA, Smith MN, Faustman EM (2017) FARME DB: a functional antibiotic resistance element database. Database. https://doi.org/10.1093/database/baw165 50. Xavier BB, Das AJ, Cochrane G, Ganck SD, Kumar-Singh S, Aarestrup FM et al (2016) Consolidating and exploring antibiotic resistance gene data resources. J Clin Microbiol 54:851–859. https://doi.org/10.1128/JCM. 02717-15 51. Yin X, Jiang X-T, Chai B, Li L, Yang Y, Cole JR et al (2018) ARGs-OAP v2.0 with an expanded SARG database and hidden Markov models for enhancement characterization and quantification of antibiotic resistance genes in environmental metagenomes. Bioinformatics 34: 2263–2270. https://doi.org/10.1093/bioin formatics/bty053 52. Yue M, Liu D, Hu X, Ding J, Li X, Wu Y (2021) Genomic characterisation of a multidrug-resistant Escherichia coli strain carrying the mcr-1 gene recovered from a paediatric patient in China. J Glob Antimicrob Resist 24: 370–372 53. Zankari E, Hasman H, Cosentino S, Vestergaard M, Rasmussen S, Lund O et al (2012) Identification of acquired antimicrobial resistance genes. J Antimicrob Chemother 67: 2640–2644. https://doi.org/10.1093/jac/ dks261

Chapter 16 Targeted Enrichment of Low-Abundance and Uncharacterized Taxon Members in Complex Microbial Community with Primer-Free FISH Probes Designed from Next Generation Sequencing Dataset Pui Yi Maria Yung and Shi Ming Tan Abstract Methods to obtain high-quality assembled genomic information of rare and unclassified member species in complex microbial communities remain a high priority in microbial ecology. Additionally, the supplementation of three-dimensional spatial information that highlights the morphology and spatial interaction would provide additional insights to its ecological role in the community. Fluorescent in-situ hybridization (FISH) coupling with fluorescence-activated cell sorting (FACS) is a powerful tool that enables the detection, visualization, and separation of low-abundance microbial members in samples containing complex microbial compositions. Here, we have described the workflow from designing the appropriate FISH probes from metagenomics or metatranscriptomics datasets to the preparation and treatment of samples to be used in FISH-FACS procedures. Key words Fluorescence in-situ hybridization (FISH), Fluorescence-activated cell sorting (FACS), 16S ribosomal RNA gene, Hypervariable region, Metagenomics, Metatranscriptomics

1

Introduction The study of complex microbial communities typically involves a combination of analyses needed to gain a comprehensive understanding of their composition, structure, and function [1]. Highquality genome recovery of low-abundance member species is often challenging. Here, we have described a simple methodological approach to enrich low-abundance microbial members and obtain suffice amount of cellular material of targeted microbial member for downstream genomic sequencing analysis. The approach differs from other FISH-FACS studies [2–5] because it allows the design of new FISH probes of previously uncharacterized taxa from any high-throughput sequencing dataset (DNA or RNA) and is

Suparna Mitra (ed.), Metagenomic Data Analysis, Methods in Molecular Biology, vol. 2649, https://doi.org/10.1007/978-1-0716-3072-3_16, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

303

304

Pui Yi Maria Yung and Shi Ming Tan

independent of published sequences. In contrast, most FISHFACS studies use existing broad-based taxa FISH probes to hybridize to target organisms that have established taxonomic identities, resulting in the lack of ability to resolve strain- or species-specific differences. The probe can be used for both FISH high-resolution fluorescence, for example, confocal laser scanning microscopy and FISH–fluorescence-activated cell sorting (FISH-FACS). Collectively, the workflow is amplicon sequencing-free, sample-specific, and can be readily applied to enrich rare members in microbiomes for which shotgun nucleic acid survey data is available. While comparative sequence alignment of full-length 16S rRNA gene sequences remains the gold standard for FISH probe design, it is often not possible to obtain full-length 16S rRNA gene sequences from shotgun metagenomics/metatranscriptomics dataset. The probe design method described in this chapter allows one to extract taxonomically useful short rRNA reads among the massive number of sequencing reads from shotgun surveys, without the need to perform comparative sequence alignment of full-length 16S rRNA sequences. In addition, the number of potential nontarget taxa increases with use of a larger database, thus making probe design for an intended target group a challenge. The use of a curated sample-centric database would benefit FISH probe design as more accurate information pertaining to the in silico specificity and coverage can be assigned to FISH probes. The described approach has been used to enrich for previously uncharacterized, low-abundance microbial members from wastewater treatment plant [6].

2

Materials All solutions need to be prepared using ultrapure water and analytical grade reagents. All reagents are stored at room temperature unless otherwise specified. 2.1 Sample fixation Phosphate buffer saline (PBS), available from Thermo Fisher, USA; 4% paraformaldehyde (PFA) (w/v) in PBS, available from Sigma Aldrich, USA; Ethanol (absolute), available from Merck, USA. Reagents can be stored at -20 °C. Reagents should last up to 5 years if stored properly. 2.2 For fluorescence in-situ hybridization (FISH) 2.2.1 FISH probes are commercially synthesized and purified by high-pressure liquid chromatography (Sigma-Aldrich, USA or Integrated DNA Technologies, USA). 2.2.2 Universal FISH probes are recommended to be used as controls:

Targeted Enrichment of Low-Abundance and Uncharacterized Taxon Members. . .

305

2.2.2.1 EUB338: targets most “GCTGCCTCCCGTAGGAGT” [7].

bacteria;

sequence

2.2.2.2 NON338: Nonsense “ACTCCTACGGGAGGCAGC” [8].

probe;

sequence

2.2.3 Hybridization buffer (HB)—recommended volume of 1 mL Stocks: 5 M NaCl, 1 M Tris/HCL, Formamide, 10% SDS, MilliQwater. Final concentration: NaCl: 900 mM; Tris/HCL: 20 mM; SDS: 0.01%. Volume and concentration of formamide is dependent on probe stringency. 2.2.4 Washing buffer (WB)—recommended volume of 50 mL Stocks: 5 M NaCl, 1 M Tris/HCL, 0.5 M EDTA, MilliQ-water. Final concentration: Tris/HCL: 20 mM, EDTA: 5 mM. Volume of NaCl is dependent on formamide used in hybridization experiments. Note that EDTA is only added if formamide is ≥20%. 2.3 Genomic DNA extraction for shotgun sequencing. 2.3.1 High-quality DNA can be extracted from commercially available extraction kits, following manufacturer’s recommendations.

3

Methods

3.1 Probe Design from Next Generation Sequencing Datasets

1. Any adapter- and quality-trimmed metagenomics or metatranscriptomics dataset can be used as input for the design of probes given they are not RNA-depleted to remove the rRNA sequences. 2. Load the dataset in the RiboTagger software [9], also available publicly here (https://github.com/xiechaos/ribotagger), and select the hypervariable region of the 16S ribosomal RNA gene sequence of your choice to detect. 3. Each identified sequence tag has a default length of 33 base pairs (bp), and an operational taxonomic unit (OTU) is defined by its unique 33 bp-ribotag sequence. 4. Adjust the -tag INT parameter in the RiboTagger software to “-tag 17” so that the probes are around 17 bp. This can also be done manually by removing the 3′ end of the 33-bp ribotag sequence (Fig. 1). 5. Truncating the length of 33 bp-ribotag sequence might lead to the hybridization of other OTUs that share the same probe sequence. An evaluation of the probe specificity is required.

306

Pui Yi Maria Yung and Shi Ming Tan

Fig. 1 A diagram illustrating FISH probe design from 33 bp-RiboTag. Truncation of the length of the 33 bp-RiboTag from the 3′ end resulted in a 17 bp-FISH probe

3.2 Evaluation of in Silico Specificity and Coverage of Probes

1. Load the ARB software [10] with the latest available SILVA SSU Ref NR database. 2. Import the probe sequences into ARB using the “multiple probe function” tool in ARB. 3. Calculate the coverage of sequences in the target taxon using this equation: Coverage = Number of perfectly matched sequence in target taxon= Total number of sequences in target taxon × 100

4. Specificity is defined as the fraction of sequences in the target taxon with perfect complementary target site to the FISH probe relative to the total number of perfectly matched sequences in the database and can be calculated using this equation: Specificity = Number of perfectly matched sequence in target taxon= Total number of perfectly matched sequence in database × 100 3.3

Sample Fixation

It is recommended to store the samples at 4 °C until sample processing. Fixation with paraformaldehyde (PFA) should be performed on samples upon arrival at the laboratory. 1. Mix thoroughly one volume of samples with three volumes of 4% paraformaldehyde (PFA) in a sample tube, and fix the sample for 3 h at 4 °C. For example, if the sample is in 100 μL PBS, then add 300 μL of 4% PFA solution. The volume of resuspension depends on the number of cells within the sample. Fixed samples can be stored at -20 °C. Avoid freeze and thaw cycle, and fixed samples can be stored up to 1 year. 2. After fixation, remove the fixative by centrifugation and removing the supernatant. 3. Gently resuspend the cell pellet in PBS.

Targeted Enrichment of Low-Abundance and Uncharacterized Taxon Members. . .

307

4. Repeat the washing step (i.e., centrifuge and resuspend in PBS). 5. Resuspend the biomass in an equal volume of 100% ethanol and PBS. The final volume determines the density of the cell in your samples. 3.4 Fluorescent In Situ Hybridization and Microscopic Imaging

1. Thaw the samples (if stored in -20 °C) to room temperature. 2. Vortex the sample briefly for approximately 3 s, and gently aspire 10–20 μL of fixed samples on microscopic slides (eight wells with 6 mm diameter, Cell-Line, USA), and immobilize the cells by drying in the hybridization oven (Shake “n” Stack, Thermo Fisher, USA) for 10 min set at 46 °C. 3. Prepare the ethanol (Merck, USA) in different percentages in 50 mL tubes, as follows: 50%, 80%, and 96% (v/v). Use molecular grade water as diluent. Total volume should be at least 40 mL to allow for complete submersion of the slide in the ethanol bath. For high-throughput applications, one can use staining jars and dishes that allow immersion of multiple slides. 4. Carefully place the FISH slide into the 50 mL tubes containing the ethanol in the following order: 50%, 80%, and 96% (v/v) for 3 min. 5. Prepare the hybridization and washing buffer based on the melting curve of the FISH probe. The washing buffer can be placed in a 48 °C water bath. 6. Prepare hybridization chambers from 50 mL tubes by placing a piece of paper towel (cut into a size that fits into the 50 mL tube) into the tube and saturate the paper towel with hybridization buffer. 7. Add 10 μL of hybridization buffer to the FISH probes, then add FISH probes to the sample to a final concentration of 5 ng/μL. 8. Transfer the slide to the hybridization chamber. Close the lid of the hybridization chamber to prevent evaporation of the buffer, and incubate in the 46 °C hybridization oven for at least 60 min. This follows a standard FISH protocol as described by Manz et al. [11]. 9. Remove the slide from the hybridization chamber, and rinse the slide with 1–5 mL of pre-heated washing buffer. Subsequently, dip the entire slide into the washing buffer (in 50 mL tubes) for a few seconds, take the slide out, and incubate at 46 ° C for 20 min. 10. After the washing step, dip the entire side into cold (4 °C) molecular grade water (50 mL falcon tube) for a few seconds to remove the washing buffer.

308

Pui Yi Maria Yung and Shi Ming Tan

11. Air-dry the slide with compressed air. 12. Dried samples are mounted with 3 μL of Citifluor (Citifluor LTD, United Kingdom), and a coverslip of #1.5 thickness (~170 μm) is placed over the sample. Finally, the sample is ready to be examined through a fluorescence microscope. 13. Confocal laser scanning microscope imagine and settings is dependent on the fluorophores chosen. For example, a 633-, 561-, and 488-nm laser can be used for the excitation of Cy5, Cy3, and FITC/A488 fluorophores. 14. It is important to avoid crosstalk between the different fluorophores through careful adjustment of emission filters (e.g., Cy5: 642–695 nm; Cy3: 571–615 nm; FITC/A488: 500–535 nm) and through acquisition of individual fluorophore signals under individual laser tracks. 3.5 Optimizing Parameters for Fluorescent In Situ Hybridization (FISH)

A melting curve analysis is performed to determine the optimal hybridization and washing condition for the newly designed probes. A confocal microscope with image acquisition capability and image analysis software (Digital Image Analysis In Microbial Ecology (DAIME) software [12]) is required for this procedure. 1. A standardized hybridization temperature (46 °C) and washing temperature (48 °C) should be used throughout the melting curve analysis [13]. It is also important to ensure the same duration for hybridization and washing was applied to samples at each respective formamide (FA) concentration. 2. Conduct FISH on samples with varying concentrations of FA (range 10–70%) in the hybridization buffers and decrements in the concentration of sodium chloride (NaCl) in the washing buffers. 3. Stringency of the probe is altered by varying the concentration of FA and NaCl. 4. Fluorescence intensity of the probe-labeled objects for each FA concentration was measured using the same exposure time and detector settings with a confocal microscope. 5. Import the images into the Digital Image Analysis In Microbial Ecology (DAIME) software [12], where probe-labeled objects were segmented using the “RATS-L” algorithm. 6. Plot the mean fluorescence intensity of probe-labeled objects against each respective FA concentration in a melting curve. Technical replicates (multiple field-of-views) are highly recommended to ensure statistical analysis of fluorescent intensity.

Targeted Enrichment of Low-Abundance and Uncharacterized Taxon Members. . .

3.6 Image Analysis: Quantitative FISH

309

Quantitative FISH analysis can be performed to quantify the relative abundance of the target taxon through the following equation: Relative abundance of target taxon =

biovolume of target taxon hybridized by specific probe biovolume of biomass hybridized by probe EUB338

1. Biovolume can be calculated with an image analysis software tool. 2. Acquire multiple 3D microscopic images in random positions to determine the biovolume. Probe-labeled cells were segmented using the “surface segmentation” algorithm in Imaris. 3. Use a filter (e.g., by adjusting absolute intensity threshold values) to remove background noise on the segmented images. 3.7 Fixation-Free and In-Solution FISH for FluorescenceActivated Cell Sorting (FACS)

Fixation-free and in-solution FISH is performed in accordance with the protocol as described by Yilmaz et al. [2]. It is better to store the samples at 4 °C to preserve the sample integrity. The FACS procedure is highly dependent on the machine used, but below are the general guidelines for FACS. 1. Prior to cell sorting, clean the sampling port and fluidic lines of the FACS machine using cleaning solutions recommended by the equipment manufacturer. Recommended run time is 1 h, followed by 30 min of flushing with sterile and DNA-free water. 2. New sterile sheath fluid should be used for each sorting experiment. These procedures helped in minimizing the entry of contaminants that would complicate downstream MDA reactions. 3. A nozzle with tip size of 100 μm is recommended and sheath liquid pressure will need to be adjusted accordingly. 4. The choice of laser is dependent on choice of fluorophore, and the cell sorter should be calibrated with size-defined beads (ranging from 0.4 to 1 μm). Drop delay parameters should also be calibrated using standardized beads suitable for the machine. 5. Construction of sorting gates is highly recommended to exclude smaller cellular aggregations and to select for probelabeled cells that displayed fluorescence intensity above the negative hybridization controls or autofluorescent particles. Example of sorting gates are as follows: (a) to differentiate bacterial cells on their forward and side light-scattering properties; (b) to filter away big cellular aggregates, assuming that a single cell shared a linear relationship between the forward/

310

Pui Yi Maria Yung and Shi Ming Tan

side scatter height and area; and (c) to isolate probe-labeled cells exhibiting high Cy5 and A488 fluorescence over the other non-labeled cells or background noise. 6. Recommended controls are as follows: (1) no-probe control where no FISH probes were added and (2) non-specific controls (e.g., NON338) to estimate the levels of non-specific binding for the respective fluorophores. 7. Two rounds of cell sorting should be performed with the defined sorting gates. In the initial round of sorting, sort the cells into a sterile tube containing 1 mL of TE buffer. A second round of sorting should be performed on the initial sorted cells and into new sterile tubes containing 3 μL of TE buffer. Sorting purity is defined as the percentage of events that mapped back to the sorting gates that were defined during the initial FACS sorting. 8. Gate-guided sorting should result in a clear and distinct population appearing with enhanced signal on the appropriate channel after hybridization of probes. Fluorescence microscopic visualization of sorted cells should be done to confirm the presence of positively labeled cells, with co-localization of the universal and targeted probe. The effectiveness of FISH-FACS sorting can be evaluated as follows: (1) sorting purity obtained from flow cytometric analysis, (2) quantitative FISH analysis of pre- and post-sorted samples, and (3) 16S rRNA profiling of genomic DNA obtained from pre- and post-sorted samples. 3.8 Quality Check of Sorted Samples

One of the ways to evaluate the sorting efficiency is to assess the pre-sorted and FISH-FACS processed samples’ microbial community composition. This can be done by metagenomic sequencing of samples and subsequently analyzing the taxonomic profiling using software tools that can extract 16S rRNA gene sequences from metagenomic datasets.

3.9 Downstream Bioinformatic Analysis

The sorted samples can undergo DNA extraction and can be sent for whole genome sequencing (Fig. 2). If the nucleic acid quantity is too low, different methods of concentrating the DNA or amplifying the samples (for example, use of multiple displacement amplification—MDA) can be performed to increase the DNA content. However, users should be aware that the steps can potentially lead to bias from MDA and should review the latest literature on whole genome amplification protocols. Upon receiving the metagenomic datasets, standard pipelines, for example, quality trimming, metagenome assembly and binning, estimation of genome completeness, contamination levels, and sequencing coverage can be done.

Targeted Enrichment of Low-Abundance and Uncharacterized Taxon Members. . .

311

Fig. 2 Workflow guide to obtain draft genomes of low-relative abundance member from a complex microbial community

4

Notes 1. Estimation of cell concentrations in the sample is important to avoid crowding of cells or having insufficient number of cells for downstream analysis. One can estimate cell number in the sample using cell staining and microscopic counts. 2. For samples with high levels of autofluorescence due to interference from the other impurities, the following procedures may be done to reduce the background fluorescence: (a) Harvest bacterial biomass by centrifugation. Centrifugation speed and time depend on the nature of the sample. (b) Remove the supernatant and gently resuspend the cell pellet with PBS; repeat the centrifugation procedure. 3. In a situation where it is not possible to apply two probes in a simultaneous hybridization due to different hybridization stringencies, probe with the higher melting temperature was applied to the sample in the first round of hybridization and washing and subsequently add the probes with lower melting temperature.

312

Pui Yi Maria Yung and Shi Ming Tan

4. Fixation-free and in-solution FISH. For samples with high presence of aggregates, a short sonication procedure can be performed to disintegrate the aggregate. Alternatively, passing the hybridized sample repeating through a syringe needle will also help to disperse the aggregates into a size suitable for FACS. The procedure is highly dependent on the nature of the samples. 5. Besides elucidating the genomics and functional capabilities of the microbial community, the omics dataset (metagenomics and metatranscriptomics) can also be used as input for the design of FISH probes for targeted enrichment or to study the three-dimension spatial information. 6. Successful enrichment has been applied by Tan et al. [6] on activated sludge samples from a full-scale wastewater treatment plant. The purity assessment indicated that probe-labeled cells were enriched from an initial abundance of 1.15% ± 0.24 to 94.53% ± 5.05 (n = 3, mean ± SD) after an initial round of sorting. 7. The designed probes could be used to target members using specific unique tag sequences from the hypervariable regions of the 16S rRNA gene. The V6 region was selected among the V4–V7 regions because it was better suited to estimating taxa richness at the species level [14], and the 16S secondary structure model in this region had demonstrated moderate accessibility of FISH probes [15]. Furthermore, the V6 region provides a meaningful taxonomic resolution that can distinguish between closely related species. 8. If mixed 16S sequences were detected on FISH-FACS sorted samples. One needs to check whether the probes are binding specifically. This can be done either through image analysis of hybridized samples (e.g., by calculating Mander’s co-localization coefficient of the different probes) or by checking the database for the sequences of the designed probes. 9. It is important that the designed FISH probe is as specific as possible for its intended target OTU to: (1) reduce genomic heterogeneity prior to genomic assembly and (2) increase the completeness of the genomic bins. Given the short length of the ribotag template (33 bp), designing a specific probe is challenging. The following three approaches could be used in future experiments to increase the specificity of the FISH probe. (a) The first approach involves using the entire 33 bp-ribotag (from the RiboTagger software) as a FISH probe. In most instances, an OTU is defined by its unique 33 bp-ribotag sequence and the OTU represents a substantial fraction of the longer representative sequence of the ribotag. As part

Targeted Enrichment of Low-Abundance and Uncharacterized Taxon Members. . .

313

Fig. 3 Design of two FISH probes from the 81 bp-ribotag. The universal recognition profile is used by the RiboTagger software to extract taxonomically relevant reads from omics dataset. FISH probes are designed downstream of the universal recognition profile

of our validation procedures, the length of the ribotag was truncated so that FISH probes could be used at a standardized hybridization temperature with other published FISH probes. Truncation of the length of ribotags altered the probe specificity as many OTUs that contained the complementary binding site of the truncated FISH probe were included in the sorted samples. (b) The second approach involves designing multiple FISH probes for the target OTU. A longer representative sequence (81 bp) could be retrieved for the respective ribotag using the RiboTagger software by activating the “-long” function. Subsequently, comparative sequence analysis of the representative sequence of the target OTU and other non-target OTUs in the sequencing dataset allows FISH probes to be designed with central mismatch to other non-target OTUs, thus increasing the specificity of the FISH probe (Fig. 3). (c) The final approach involves designing FISH probes from other variable regions of the 16S rRNA gene of the same target taxon. This can be achieved using the RiboTagger software to extract other variable regions (V4, V5, and V7). However, this approach requires the target OTU to have an established taxonomy for probes to be designed against the same target organism. Alternatively, FISH probes from other hypervariable regions can be designed using the approach of Hasegawa et al. [16], where fulllength 16S rRNA references sequences containing the V6 sequence is downloaded from a curated database for probe design with comparative sequence analysis. However, there is a risk that the target OTU in the sample might have a different full-length 16S rRNA sequence from the curated database.

314

Pui Yi Maria Yung and Shi Ming Tan

Acknowledgments This research was supported by the Singapore National Research Foundation (NRF) and the Ministry of Education (MOE) under the Research Centre of Excellence Programme. Dr. Shi Ming Tan was supported by the Singapore NRF Environmental and Water Technologies (EWT) PhD Scholarship. We thank Larry Liew for performing the sampling of activated sludge and the technical staff from the SCELSE sequencing laboratory for their assistance. Competing Interests The authors declare no competing financial or nonfinancial interests.

References 1. Widder S et al (2016) Challenges in microbial ecology: building predictive understanding of community function and dynamics. ISME J 10: 2557–2568. https://doi.org/10.1038/ismej. 2016.45 2. Yilmaz S, Haroon MF, Rabkin BA, Tyson GW, Hugenholtz P (2010) Fixation-free fluorescence in situ hybridization for targeted enrichment of microbial populations. ISME J 4: 1352–1356. https://doi.org/10.1038/ismej. 2010.73 3. Gougoulias C, Shaw LJ (2012) Evaluation of the environmental specificity of fluorescence in situ hybridization (FISH) using fluorescenceactivated cell sorting (FACS) of probe (PSE1284)-positive cells extracted from rhizosphere soil. Syst Appl Microbiol 35:533–540. https://doi.org/10.1016/j.syapm.2011. 11.009 4. Bruder LM, Dorkes M, Fuchs BM, Ludwig W, Liebl W (2016) Flow cytometric sorting of fecal bacteria after in situ hybridization with polynucleotide probes. Syst Appl Microbiol 39:464–475. https://doi.org/10.1016/j. syapm.2016.08.005 5. Nettmann E et al (2013) Development of a flow-fluorescence in situhybridization protocol for the analysis of microbial communities in anaerobic fermentation liquor. BMC Microbiol 13:278. https://doi.org/10.1186/14712180-13-278 6. Tan SM et al (2019) Primer-free FISH probes from metagenomics/metatranscriptomics data permit the study of uncharacterised taxa in complex microbial communities. NPJ Biofilms Microbiomes 5:17. https://doi.org/10.1038/ s41522-019-0090-9

7. Amann RI et al (1990) Combination of 16S rRNA-targeted oligonucleotide probes with flow cytometry for analyzing mixed microbial populations. Appl Environ Microbiol 56: 1919–1925. https://doi.org/10.1128/aem. 56.6.1919-1925.1990 8. Wallner G, Amann R, Beisker W (1993) Optimizing fluorescent in situ hybridization with rRNA-targeted oligonucleotide probes for flow cytometric identification of microorganisms. Cytometry 14:136–143. https://doi. org/10.1002/cyto.990140205 9. Xie C, Goi CL, Huson DH, Little PF, Williams RB (2016) RiboTagger: fast and unbiased 16S/18S profiling using whole community shotgun metagenomic or metatranscriptome surveys. BMC Bioinform 17:508. https://doi. org/10.1186/s12859-016-1378-x 10. Ludwig W et al (2004) ARB: a software environment for sequence data. Nucleic Acids Res 32:1363–1371. https://doi.org/10.1093/ nar/gkh293 11. Manz W, Amann R, Ludwig W, Wagner M, Schleifer KH (1992) Phylogenetic oligodeoxynucleotide probes for the major subclasses of proteobacteria—problems and solutions. Syst Appl Microbiol 15:593–600 12. Daims H, Lucker S, Wagner M (2006) Daime, a novel image analysis program for microbial ecology and biofilm research. Environ Microbiol 8:200–213. https://doi.org/10.1111/j. 1462-2920.2005.00880.x 13. Pernthaler J, Glockner FO, Schonhuber W, Amann R (2001) Fluorescence in situ hybridization (FISH) with rRNA-targeted oligonucleotide probes. Methods Microbiol 30(30):

Targeted Enrichment of Low-Abundance and Uncharacterized Taxon Members. . . 207–226. https://doi.org/10.1016/S05809517(01)30046-6 14. Yarza P et al (2014) Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat Rev Microbiol 12:635–645. https://doi.org/ 10.1038/nrmicro3330 15. Behrens S et al (2003) In situ accessibility of small-subunit rRNA of members of the

315

domains bacteria, archaea, and Eucarya to Cy3-labeled oligonucleotide probes. Appl Environ Microbiol 69:1748–1758 16. Hasegawa Y et al (2010) Imaging marine bacteria with unique 16S rRNA V6 sequences by fluorescence in situ hybridization and spectral analysis. Geomicrobiol J 27:251–260. https:// doi.org/10.1080/01490450903456806

Chapter 17 Assembly and Annotation of Viral Metagenomes from Short-Read Sequencing Data Mihnea R. Mangalea, Kristopher Keift, Breck A. Duerkop, and Karthik Anantharaman Abstract Viral metagenomics enables the detection, characterization, and quantification of viral sequences present in shotgun-sequenced datasets of purified virus-like particles and whole metagenomes. Next generation sequencing (Illumina) derived short single or paired-end read runs are a principal platform for metagenomics, and assembly of short reads allows for the identification of distinguishing viral signatures and complex genomic features for taxonomy and functional annotation. Here we describe the identification and characterization of viral genome sequences, bacteriophages, and eukaryotic viruses, from a cohort of human stool samples, using multiple methods. Following the purification of virus-like particles, sequencing, quality refinement, and genome assembly, we begin the protocol with raw short reads deposited in an open-source nucleotide archive. We highlight the use of VIBRANT, an automated computational tool for the characterization of microbial viruses and their viral community function. Finally, we also describe an alternative assembly-free option of mapping reads to established databases of reference genomes and previously characterized metagenome-assembled viral genomes. Key words Metagenomics, Virus, Bacteriophages, Shotgun sequencing, Genome databases, Read mapping, VIBRANT, Phage identification, Phage auxiliary metabolic genes

1

Introduction Metagenomics, or the meta-analysis of a collection of genomes from an environment [1], has greatly impacted the field of microbiology by allowing for the identification and functional capacity of microorganisms without the need for culturing in non-native environments [2]. Viral metagenomics enables the genetic exploration of the most abundant organisms on the planet [3] by sequencing viral nucleic acids directly, leading to the discovery of novel viruses, their surveillance, and a greater understanding of viral ecology and interactions with their hosts [4]. Yet despite the incredible richness of viruses in the biosphere [5], defining viral diversity remains challenging due to the vast majority of uncharacterized sequences

Suparna Mitra (ed.), Metagenomic Data Analysis, Methods in Molecular Biology, vol. 2649, https://doi.org/10.1007/978-1-0716-3072-3_17, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

317

318

Mihnea R. Mangalea et al.

(viral dark matter) [6, 7]. In contrast to environmental microbial metagenomes and cultured microbial genomes, which have been broadly sampled and sequenced, the global viral metagenome has lacked definition and characterization [3]. Metagenomics of viruses began in 2002, when two uncultured marine viral communities were sequenced, revealing that over 65% of the sequences did not match known sequences [8]. The following year, in 2003, the first metagenomic analysis of human feces also described a majority of unknown viral sequences, despite an estimated 1200 viral genotypes present in the sample [9]. As nextgeneration sequencing strategies became less costly and more efficient in parallel with methods for virus-like particle (VLP) recovery from small amounts of biological material, viral metagenomes began to flood the sequence space [6]. In 2010 and 2011, separate VLP metagenome sequencing studies from multiple human subjects helped to define the “human gut virome” [10, 11], incorporating assembly of VLP sequences, analysis of gene content based on several databases, and sequence classification relative to defined bacteriophage (phage) families. Assembly of short-read sequences into partial or complete contiguous genome sequences (contigs) allows for the characterization of genome features, which greatly improves analyses of short-read data [12]. The computational foundation for current genomic and metagenomic assemblies relies on resolving de Bruijn graphs (dBg) based on connected and overlapping series of kmers [12, 13], which are sequences of any length k. In 2012, notable dBg assemblers SPAdes [14] and IDBA-UD [15] emerged as reliable options for metagenomic assemblies by incorporating variable kmer sizes to build longer contigs, leading to reconstruction of metagenome-assembled genomes (MAGs). Building upon the stepwise kmer assembly process, MEGAHIT more recently incorporated succinct dBgs, which are more memory-efficient and faster for assembly of large and complex short-read metagenomic datasets [16, 17]. Assembly of metagenomic sequences into much longer contigs has contributed greatly to viral discovery by identifying novel human viruses and phages. Notably, the novel crAssphage genome was discovered in 2014 by cross-assembly from a dozen unrelated human intestinal metagenomes and recruited reads from hundreds of other metagenomic studies indicating a high abundance and prevalence of this Bacteroides-associated phage [18]. Furthermore, metagenomic read recruitment has been valuable for microbial population genetics and comparative genomics of novel phage groups distributed in global oceans [19–23] and the human intestine [10, 18, 24–26]. Mapping to existing databases (see Viral Databases in Materials) such as the RefSeq Virus database from NCBI has been an insightful and necessary step for early virus identification tools such as Metavir [27, 28], ViromeScan [29], VIP [30], and FastViromeExplorer [31]. Further characterizations

Viral Contigs from Short Read Metagenome Sequences

319

of metagenomic viral sequences rely on taxonomic classification by annotating genes and contigs from assembled sequences leading to viral taxonomic annotation. In 2015, VirSorter v1 emerged as a standalone automated virus discovery tool for predicted viral genomes from microbial DNA extractions using publicly available genome data [32]. And in 2017, VirFinder became the first tool to incorporate distinct kmer signature frequency and machine learning methods for viral contig discovery, and outperforming VirSorter [33]. Other notable viral discovery tools include MARVEL [34], a machine learning tool trained specifically on dsDNA viruses from the Caudovirales order, PHASTER [35], Prophage Hunter [36], and ViR [37] for the identification of proviruses and endogenous viral elements. Recently, Virsorter v2 was released for detection of DNA and RNA viral genomes, with several updates to the previous version including detection of more diverse viral groups and machine learning applications for viral estimations [38]. CheckV is another tool, which recently advanced the identification of viral genomes and host contamination from metagenomes [39]. Our method for identification of viral sequences from de novo assembled viral metagenomic reads ultimately relies on the VIBRANT (Virus Identification By iteRative ANnoTation) platform [40]. Released in 2020, VIBRANT combines neural networks of protein signature similarities to identify diverse phages and their metabolic potential, while outperforming previous viral classification tools like VirSorter (v1), VirFinder, and MARVEL [40]. When used on a set of metagenomic viral sequences extracted from 25 human stool samples, VIBRANT identified nearly 5000 nonredundant phage contigs corresponding to free and integrated prophages with minimal host sequence contamination [41]. Importantly, VIBRANT also helps define viral community function by identifying auxiliary metabolic genes (AMGs) and evaluating the metabolic pathways available for viral communities. The demonstrated applications of VIBRANT to intestinal virome analyses [41, 42] and environmental samples [43] make this a powerful tool for better understanding the role of viruses in diverse microbiomes and ecosystems. This protocol serves as a walkthrough for curating a set of viral contigs from an archived metagenomic sequence dataset and mapping to previously referenced viral genomes.

2

Materials The hardware parameters described herein can be modified according to user preferences and resource availability. For sequence analyses and mapping to viral databases, a computer with access to the Internet is required, as well as optional access to remote

320

Mihnea R. Mangalea et al.

server repositories where large sequencing datasets for your lab projects may be stored. We do not provide methods details for isolating, extracting, or sequencing viral nucleic acids in this chapter, although detailed explanations and comparisons of VLP isolation methods are available. Shkoporov et al. (2018) describe VLP enrichment, viral nucleic acid extraction, and shotgun sequencing from human fecal samples [44], and Kleiner et al. (2015) compare multiple methods for VLP purification from fecal samples including efficiencies for removal of contaminating bacterial and host DNAs [45]. This protocol begins with a dataset of short-read sequences of VLPs derived from a cohort of 25 individual stool samples. See Mangalea et al. (2021) [41] for more information on these sequences and the resulting phage metagenomics analyses, including additional phage-host assignment techniques not described here. The following resources are required to successfully run the protocol presented below. 2.1

Hardware

1. A computer console with at least 4 GB RAM available, connected to the Internet for file sharing and web browsing, with access to a Unix-based terminal for interaction with the appropriate host for implementing commands. For example, a Macintosh MacBook Pro with a 2.5 GHz Intel Core i7 processor and 16 GB RAM was used in these analyses. Additionally, most of the analyses involving raw reads were performed on a remote server, accessed via secure shell protocol (ssh), housing the Linux distribution Ubuntu 18.04 LTS “Bionic Beaver” (GNU/Linux 5.4.0-53-generic x86_64) (see Note 1).

2.2

Software

The following tools and respective versions were used for this analysis: 1. FastQC v0.11.8 (https://github.com/s-andrews/FastQC). 2. BBtools v38.56 (https://sourceforge.net/projects/bbmap/). 3. MegaHit assembler v1.2.7 (https://github.com/voutcn/ megahit). 4. QUAST v5.0.2 (https://github.com/ablab/quast). 5. VIBRANT v1.2.1 (https://github.com/AnantharamanLab/ VIBRANT). 6. HMMER3 v3.3 (https://github.com/EddyRivasLab/ hmmer). 7. Prodigal v2.6.3 (https://github.com/hyattpd/Prodigal). 8. CheckV v.0.6.0 (https://bitbucket.org/berkeleylab/checkv/ src/master/). 9. R version 3.6.3 and RStudio (https://www.r-project.org). 10. ggplot2 v3.3.3 (https://github.com/tidyverse/ggplot2).

Viral Contigs from Short Read Metagenome Sequences

321

11. ComplexHeatmap v2.5.1 (https://github.com/jokergoo/ ComplexHeatmap). 12. DIAMOND v2.0.0.138 (https://github.com/bbuchfink/ diamond). 13. PhageTaxonomyTool – a custom Python program, PPT.py, available in this repositor y: (https://github.com/ AnantharamanLab/Kieft_and_Zhou_et_al._2020). 2.3

Sequences

The metagenomic sequences used in this protocol are derived from the VLP purification protocol outlined in Mangalea et al. (2021) [41]. We begin with VLP reads that have been trimmed and decontaminated of common lab-specific genomic contaminants (more on this in Subheading 3.1 and Note 2). 1. Sequences can be found deposited at the European Nucleotide Archive under accession number PRJEB42612 (see Note 2); https://www.ebi.ac.uk/ena/browser/view/PRJEB42612

2.4

Viral Databases

The use of VIBRANT is not dependent on local protein databases for predicting known and novel viral signatures from metagenomic sequences, therefore significantly reducing reference-associated recall biases and environmental biases. Rather, VIBRANT utilizes hidden Markov models (HMMs) based on the Pfam (v32), VOG (v94), and KEGG (March 2019) databases, which can be used to functionally annotate proteins distantly related to sequences in the reference database. VIBRANT employs a “v-score” metric, which allows for the distinction between virus-like and nonviral HMM annotations. Using the abundance of database annotations and their associated v-scores, VIBRANT’s neural network machine learning model is capable of predicting which contigs are likely to be of viral origin. The neural network model was trained on annotations from NCBI RefSeq reference viral genomes. However, since annotation abundances and virus-like information (i.e., v-scores) are used instead of homology to the reference sequences, the identification of viruses distant or completely distinct from references viruses is possible. On the other hand, there are several advantages to utilizing curated databases of high-quality reference genomes when exploring your metagenomic datasets. Mapping raw reads or clustering assembled contigs to catalogued references from the following resources enables unique insights into the host ranges, general biology, and taxonomy of your uncharacterized viral samples. These reference databases are a few suggestions that may be useful for characterizing short-read metagenomic sequences, some specifically tailored to human intestinal bacteriophages. 1. NCBI Viral Genomes RefSeq sequences [46] (see Note 3). 2. RVDB [47]; https://github.com/ArifaKhanLab/RVDB

322

Mihnea R. Mangalea et al.

3. IMG/VR 3.0 [48]; https://img.jgi.doe.gov/vr 4. Gut Virome Database [42]; https://doi.org/10.25739/12sqk039 5. Gut Phage Database [49]; http://ftp.ebi.ac.uk/pub/ databases/metagenomics/genome_sets/gut_phage_ database/

3

Methods Carry out the following steps on your personal computer with optional connection to a remote server for ease in manipulating large files. Depending on the number of computing cores in your machine, you can set multiple threads on single cores or run jobs in parallel on multiple cores (see Note 4). In this section, all lines of code typed directly into the command line (shell) will be denoted in a distinct font with a “$” at the beginning: $ example command). )

3.1 Read Quality Control, Adapter Trimming, and Decontamination

1. Initial quality control and assessment of your reads can be performed using FastQC “A high-throughput sequence analysis tool” [50]. To start, create a directory in which all of your raw sequences will be stored. The command mkdir (GNU coreutils) followed by “your directory name” will make this directory: $ mkdir your_directory_name

Next, create another directory, within the initial directory you just created, for the output of your initial read quality analysis: $ mkdir FastQC_out

Run FastQC on all fastq.gz files in your directory and output: $ fastqc *fastq.gz -o FastQC_out

Assess the quality of reads by clicking on each fastqc.html file associated with each read. This will open a browser window where you can see, for instance, if sequence adapters are present on your reads. 2. Upon successful download of all sequences associated with your project, and assessment of sequence quality with FastQC, you may need to trim adapter sequences. This can be performed with a variety of tools, and here we will use bbduk from the BBtools suite [51]. First, create an index adapter

Viral Contigs from Short Read Metagenome Sequences

323

fasta file, with all possible adapters used during your sequencing run. For example, here we used TruSeq universal adapters for I5 and index adapters for I7 barcodes (header of TruSeq adapter and indices): >TruSeq_Universal_Adapter AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT >TruSeq_Adapter_Index_1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG >TruSeq_Adapter_Index_2 GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG …

Next, compose a shell script that will sequentially trim adapters from both R1 and R2 reads from each sample (see Note 5, example for one sample): #! /bin/bash sample=ST_2_S27_L002_;

bbduk.sh -Xmx40g threads=16 in1=$sample\R1_001.fastq.gz in2=$sample\R2_001.fastq.gz out1=$sample\cleanedR_R1.fastq.gz out2=$sample\cleanedR_R2.fastq.gz ref=my_adapters.fasta ktrim=r mink=4 minlength=20 qtrim=f; bbduk.sh -Xmx40g threads=16 in1=$sample\cleanedR_R1.fastq.gz in2=$sample\cleanedR_R2.fastq.gz out1=$sample\cleaned_R1LR.fastq.gz out2=$sample\cleaned_R2LR.fastq.gz ref ref=my_adapters.fasta ktrim=l mink=4 minlength=20 qtrim=f; rm -f ./*cleanedR_R1.fastq.gz*; rm -f ./*cleanedR_R2.fastq.gz*;

(see Note 6 for a brief explanation of above code example).

324

Mihnea R. Mangalea et al.

3. Run FastQC again to check that adapters have been successfully trimmed from your sequences. Create another directory for cleaned (trimmed) reads and another one for the new FastQC output: $ mkdir new_directory_name $ mkdir FastQC_cleaned_out

Run FastQC on all fastq.gz files in your directory and output: $ fastqc * cleaned_R1LR.fastq.gz -o FastQC_out $ fastqc * cleaned_R2LR.fastq.gz -o FastQC_out

Again, assess the quality of reads by clicking on each fastqc. html file associated with each read. 4. Decontamination of reads matching common contaminants is an important step in metagenomic analysis, especially for samples coming from humans, mice, or other host-associated environments. Additionally, common organisms that are routinely cultured in your lab setting can be included in the decontamination process, as well as phiX174 DNA that may not have been removed through demultiplexing. First, create a directory where you will add all references to include for decontamination: $ mkdir refs_for_decon

Next, download all appropriate reference genomes from NCBI and add to this folder, namely Human_hg38_UCSC_20140711.fa, Mus_Musculus_mm10.fa, phiX174.fa (see Note 7). Then generate a mapping reference using this command from BBtools: $ bbsplit.sh ref=Human_hg38_UCSC_20140711.fa,

Mus_Musculus_mm10.fa, phiX174.fa

Generate a shell script (as above) to sequentially decontaminate all cleaned reads, and move this script into the local directory for decontamination. Example script for one sample:

Viral Contigs from Short Read Metagenome Sequences

325

#! /bin/sh sample= ST_2_S27_L002_cleaned_; bbsplit.sh -Xmx150g threads=16 ambiguous2=split qtrim=lr path=/refs_for_decon in1=$sample\R1LR.fastq.gz in2=$sample\R2LR.fastq.gz basename=$sample\out_%.fastq.gz outu1=$sample\unmappedR1.fastq.gz outu2=$sample\unmappedR2.fastq.gz scafstats=$sample\scaffoldStats refstats=$sample\referenceStats ihist=$sample\insertsizehistogram 2>&1 | tee $sample\bbsplit.out; mv *unmapped* ./Decon_reads_virome mv *insertsizehistogram* ./Decon_insert_size_histogram mv *Human* *Mus* *phiX174* ./Decon_mapped_human_mice_phiX174 mv *bbsplit.out* ./Decon_bbsplit_out mv *scaffoldStats* ./Decon_scaffold_Stats mv *referenceStats* ./Decon_reference_Stats

Mapped reads will be tallied in .refstats and .scafstats files can be found in /Decon_reference_Stats and /Decon_scaffold_Stats directories. Unmapped reads (i.e., the cleaned and decontaminated reads) will be found in the /Decon_reads_virome directory, and these can be renamed if needed (see Note 8). These are the reads that you will proceed with for further downstream analysis.

and

3.2

Contig Assembly

1. With cleaned and decontaminated reads, it is now possible to begin assembly of reads into larger contigs. Here, we use MEGAHIT succinct dBg assembler for paired-end read files R1 and R2. Example for Sample 3:

$ megahit -1 ST_3_S28_L002_cleaned_unmappedR1.fastq.gz -2 ST_3_S28_L002_cleaned_unmappedR2.fastq.gz --presets meta-large -o 3_megahit_out/

326

Mihnea R. Mangalea et al.

The assembler will go through each kmer in the k-list (see Note 9) and can take several hours per sample, depending on the number of CPU threads available to it. The assembled contigs will be deposited in the output folder (3_megahit_out/) where you will find a fasta file labeled final.contigs.fa, which can be renamed as you wish. 2. To assess assembly quality, including number of contigs produced, sizes of contigs, and other general quality metrics, you can use QUAST: Quality Assessment Tool for Genome Assemblies (see Note 10). Example for Sample 3: $ quast.py 3_megahit_out/final.contigs.fa

3. Depending on the results of your post-assembly quality assessment, you may want to remove smaller, fragmented, assembled contigs. In our analysis of the RA intestinal virome [41], we chose to remove contigs with lengths lower than 5000 base pairs, and the VIBRANT workflow automates this cutoff at 1000 (see Note 11). For example, one way to trim your contigs to a minimum length is using reformat.sh from BBtools: $ reformat.sh in=3_P_contigs.fa out=3_P_contigs_5k.fa minlength=5000

3.3 Viral Sequence Identification

1. To identify viral contigs, VIBRANT can be run directly on the generated contigs from assembly (see Note 12). In addition to identification, VIBRANT will also functionally annotate the contigs: $ VIBRANT_run.py -i 3_P_contigs_5k.fa -folder VIBRANT_results_folder

2. After predicting viral contigs, you likely will want to assess the completeness and quality of the contigs. This will provide information such as which contigs represent complete genomes and which are “high quality” (near-complete) genomes and identify potential cellular sequence contamination. Contamination is of concern for predictions of integrated viruses (e.g., prophages). This step can be done using CheckV (see Note 13). Both VIBRANT and CheckV will provide estimations of virus lifestyle (e.g., virulent or temperate) (see Note 14). In your directory for VIBRANT results (VIBRANT_results_folder) ) you will find several multi-fasta files containing predicted lytic, lysogenic, circular, and a combined list denoted as

Viral Contigs from Short Read Metagenome Sequences

327

our_sample.phages_combined.fna. This file can be directly

input to CheckV: $ checkv end_to_end 3_P_contigs_5k.phages_combined.fna

3. Taxonomic assignment of phages can be challenging due to the lack of universally conserved gene features, such as a 16S rRNA gene. To overcome this, a common approach is to compare unknown phages to a reference database and determine taxonomy according to the presence of multiple shared protein groups (see Note 15). Here, you will use a pre-compiled database of bacterial phage and archaeal virus proteins from the GenBank and RefSeq NCBI databases. A custom Python script (PhageTaxonomyTool, PTT.py, see Note 16) is used to query the input phage proteins to the database to hierarchically select the most likely database hit. The resulting file ending in “virustaxonomy.tsv” will contain predictions per input contig. $ PTT.py -i 3_P_contigs_5k.phages_combined.faa

4. VIBRANT generates protein functional annotations for all three databases (KEGG, Pfam, and VOG) and a best-hit annotation (see Note 17). Furthermore, AMGs from KEGG metabolic pathways annotations are highlighted (see Note 18). 5. The AMGs highlighted by VIBRANT can be visualized in various ways. Here we will discuss a heatmap that depicts individual AMG proportions per cohort or sample group. For this example, you will compare CCP-, CCP+, and healthy control (HC) samples (https://www.ebi.ac.uk/ena/browser/ view/PRJEB42612) from our previously published work [41]. First, we need a data table indicating each AMG and its abundance by proportion in each group (see Note 19). With this data table, the R script below will use the package “ComplexHeatmap” to generate a PDF figure to visualize the proportion of each AMG per group. Note that for R scripts, any line beginning with a “#” will not be computed as these lines are recognized as comments or explanations of the succeeding line of code.

328

Mihnea R. Mangalea et al.

# import the R package ComplexHeatmap library(ComplexHeatmap) # read in the AMG proportion data in_data 1 + 2 [1] 3 > 3 + 5 * 2 [1] 13

There are also a number of mathematical functions built into R that we can call using the appropriate R command: > log(10) [1] 2.302585 > sin(100) [1] -0.5063656

The R language also allows for switch logical comparisons of values: > 1 == 1 # a test for equality [1] TRUE > 10 != 1 # is not equal to [1] TRUE > 1 < 3 # less than [1] TRUE > 1 100 > 1000 # greater than [1] FALSE

Manipulating and Basic Analysis of Tabular Metagenomics Datasets Using R

341

In the above examples, we’ve also included code comments; these are annotations of the code symbolized by everything immediately after a # symbol. This allows us to provide additional context to the code that is not executed by R and aid with the readability of our code. R also allows for the assignment of variables that can represent data throughout programs and scripts. Another key feature of the R language is vectorization; all data in R is typically a vector, in which R is a set of ordered values of the same data type. Vectors allow for R functions to operate over all vector elements without the need for a loop, which helps for less error-prone code. > x print(x) [1] 10 > y print(y) [1] 1 2 3 4 5 6 7 8 9 10 > y * 2 [1] 2 4 6 8 10 12 14 16 18 20

Using these basic building blocks, it is possible to build sophisticated programs and scripts that automate common tasks. R also supports user-defined functions allowing users to declare custom functions that make it easier to repeat computational tasks and aids with code readability. Collections of user-defined functions can be brought together into R packages, which can be shared for other users to download and use via the Comprehensive R Archive Network (CRAN). To install a third party R package, we use the install. packages command; this communicates with CRAN or a CRAN mirror to retrieve the package data, which it downloads and installs onto our system. To gain access to the packages functions, the package must be loaded into the current R session. This is achieved using the library command, and traditionally, packages are loaded at the top of an R script before the first R commands to be run. > install.packages(‘vegan’) # this will generate a lot of information about the installation > library(vegan) # to load the installed package into the current R session

When navigating base R and external packages functions, it can often be useful to read the documentation on how to use a given function. Within the R console, it is possible to check for a documentation page for a given function by using ? before a function name or by using the help function. This will open the

342

Alex Coleman and Martin Callaghan

documentation page for the specified function in a web browser, or in Rstudio, it will open the page in the Help pane. > ?log > help(log)

While using the R console is good for testing small snippets of code for larger programs, it is better to organize our code into .R files. These are simple text files that contain all the R commands we want our analysis to perform, which can be executed as an entire file line by line rather than typing out each line to the R console. Organizing our code into a series of script files helps us demarcate different sections of our analysis aiding with readability and reproducibility. This brief introduction to the R programming language highlights some of the basic tools and commands on offer. It has also touched on how to install and load third party packages; some examples of which we will touch on in more detail in subsequent sections. Overall, the R language provides a versatile tool, which can be used to codify analysis workflows through script files that can be easily shared to aid with workflow transparency and reproducibility.

3

Reading and Manipulating Tabular Data The R language includes a number of built in data structures for representing different forms of data. As described above, R is a vectorized language, and therefore, all data structures are formulations of these basic vector units. For tabular data, the data.frame data structure is often used. The data.frame structure is organizsed into a series of rows and columns where each column is a vector and so is an ordered set of values that all share the same data type. In R, there are five main data types: a double, an integer, a complex, a logical, and a character. We can check the type of a given value or variable by using the command typeof. > typeof(5.2) [1] “double” > typeof(5L) # the L symbol forces the value to be an integer [1] “integer” > typeof(2+1i) [1] “complex” > typeof(TRUE) [1] “logical” > typeof(“foo”) [1] “character”

Manipulating and Basic Analysis of Tabular Metagenomics Datasets Using R

343

In the data.frame data structure, each column must contain values of the same type; otherwise, the column type defaults to object, and type-specific operations become more complicated. In the following code examples, we will explore a dataset published by Schmidt et al. (2019) (https://elifesciences.org/ articles/42693/) that looks at microbe populations along the gasterointestinal tract. A full outline of code snippets used in this demonstration can be found on GitHub. 3.1

Base R

For metagenomics data, the tabular data that we need to analyze through R is often the abundance matrix data generated by tools like QIIME and mothur from high-throughput amplicon data. This matrix data will often be organized as a matrix of samples and operational taxonomic units (OTUs), which may be saved as a comma or tab-separated values file (.csv/.tsv/.txt). To read these data files into R as a data.frame, we can use some other base R functions. ‘‘‘{code} # we would load this data by using the read.csv function > read.csv(file = "data/elife-data.csv", header=T, stringsAsFactors=FALSE)

In this code snippet, we’re reading in the csv datafile and specifying some important comma-separated arguments to the read.csv function. The first argument we use is file, where we specify a character string of the file path location of our data file. As the file is in the same directory as the R session we’re running, we can just put the name of the file, but if it was in a different directory, we would need to include either the relative or the full path directory location to the file so that R knows which file to read the data from. Next, we specify the header argument, which accepts a logical type value to confirm whether the first row of the file is the column headers; finally, we also use the stringsAsFactors argument, which determines how R treats columns that contain character type values. If this is set to FALSE, then the column type will be characters, but if this is set to TRUE, R converts the strings into factors that are actually numeric values denoting different categories that share the same character name. For other tabular data file formats, base R includes the function read.table. This is actually the parent function of read.csv and contains many of the arguments described above but crucially also requires the sep argument to be defined. The sep argument specifies the separator that is used in the file to denote each row (in read.csv the sep argument defaults to ‘,’ because values are comma separated). This can be useful when reading in data that is separated by different characters such as tab-separated values or other symbols.

344

Alex Coleman and Martin Callaghan # the read table function > read.table(file = "data/elife-data.csv”, sep=’,’, header = T, stringsAsFactors = F)

After successfully executing either of these two functions, your data will be loaded as a data.frame object. By default, just calling functions like this will output the data in the console to ensure the data.frame objects persist during the R session; so we can do further manipulation to them, we should define a variable that will hold our data.frame object. > data dim(data) [1] 311 937

# access data by indices # select out data in the first row data[1,] X ...1 SAMEA2737770 CT1_OR SAMEA2737863 CT10_OR SAMEA2737681 SAMEA2738135 SAMEA2737855 1 1 [Clostridium] scindens ATCC 35704 1.52504e-05 0 2.543906e05 0 4.623989e-05 0 1.608987e-05 ... # select out data in the third column > data[,3] [1] 1.525040e-05 3.058298e-05 2.440898e-03 1.210354e-02 1.237325e-02 ... > # get value at first row, third column > data[1,3] [1] 1.52504e-05

The square brackets operator is actually a function that retrieves elements from vectors or matrices in R depending on the value passed to it. In the above examples, we specify the indices of either the column or the row with the comma separating which of the two dimensions we are specifying. It is also possible to index dataframes using character vectors corresponding to column names or using a

Manipulating and Basic Analysis of Tabular Metagenomics Datasets Using R

345

vector of boolean values. This makes it possible to retrieve subsets of the larger dataframe, making it easier to perform specific transformations or analyses on specific columns. # index column by name > data$SAMEA2737770 # index a column by name using a concatenated vector (preferred) > data[,c(’SAMEA2737770’)]

Now that we’ve touched on the basics of interacting with our dataframe, we should start to think about tidying it up and some other potential operations to make subsequent analysis easier. In the first instance, you may notice that the first two columns of this dataset are a little strange. We can use the colnames function to return a character vector of all the column names; we can then use square bracket notation to index the first 10. > colnames(data)[1:10] [1]

"X"

"...1"

"SAMEA2737770"

"CT1_OR"

"SAMEA2737863"

"CT10_OR" "SAMEA2737681" [8] "SAMEA2738135" "SAMEA2737855" "SAMEA2737825"

Here we can see columns have specific sample identifiers relating to the 935 samples used in this study, but the first two columns are named “X” and “. . .1.” You can inspect these columns using some of the indexing methods shown above to see whether they correspond to a duplicate column of row numbers and the species names of identified microbial populations. These generic names have been assigned because the column names in the .csv are empty, and we can use some simple R to drop the duplicated row numbers column “X” and rename the “. . .1” column to species. # drop X column data # 1. set rownames to species column values > row.names(data) # 2. use subset to remove the species column > data # 3. transpose the dataframe > data # add rownames as samples column > data$samples > # reset rownames to numbers > row.names(data) > # indexing out conditional data and showing only two columns > data[data[,c("[Clostridium] scindens ATCC 35704")] > 0.001,] [,c("samples","[Clostridium] scindens ATCC 35704")] samples [Clostridium] scindens ATCC 35704 69 SAMEA2737789 0.001461609 102 SAMEA2737878 0.006294709

Manipulating and Basic Analysis of Tabular Metagenomics Datasets Using R

347

180 SAMEA2737713 0.009383403 311 SAMEA2737679 0.001819131 315 SAMEA2737685 0.014274474 321 SAMEA2737833 0.004166907 502 M08_01_V1_stool_metaG_0 0.001320758 503 M08_01_V2_stool_metaG_28 0.001708478 505 M08_01_V3_stool_metaG_56 0.001089231 517 M08_04_V2_stool_metaG_0 0.002550806 519 M08_04_V3_stool_metaG_28 0.001419051 560 CCIS36699628ST_4_0 0.001698072 568 CCIS36797902ST_4_0 0.002883346 604 CCIS93040568ST_20_0 0.001591678 > > # create subset dataframe variable > subset.data 0.001,]

The subset.data variable can now be used for further analysis without losing the original data variable containing our original data.frame. Another common query might be to subset a dataframe like above but also sort the returned values into ascending or descending order to identify the top or bottom ten rows. This is accomplished using the order function in base R which takes a data. frame column as an argument and returns a vector of the order of each element. The order function uses several default arguments including decreasing, which defaults to false so it is crucial to consult the functions help page to understand the full behavior of this function. To retrieve the top or bottom items of a vector or column, we can use either the head or tail function. These functions take a vector or column as an argument along with a specified number for the number of rows from the top/or bottom to return. > head(data[order(data$‘[Clostridium] scindens ATCC 35704‘, decreasing = TRUE),c("samples","[Clostridium] scindens ATCC 35704")], 10) samples [Clostridium] scindens ATCC 35704 315 SAMEA2737685 0.014274474 180 SAMEA2737713 0.009383403 102 SAMEA2737878 0.006294709 321 SAMEA2737833 0.004166907 568 CCIS36797902ST_4_0 0.002883346 517 M08_04_V2_stool_metaG_0 0.002550806 311 SAMEA2737679 0.001819131 503 M08_01_V2_stool_metaG_28 0.001708478 560 CCIS36699628ST_4_0 0.001698072 604 CCIS93040568ST_20_0 0.001591678

In the above function, backticks have been applied before and after when specifying the column name of the data.frame. This has

348

Alex Coleman and Martin Callaghan

to be done in this instance because the name includes the reserved R character “[,” which, as we’ve seen before, is actually a function for subsetting data structures. This isn’t always required as shown when using the character vector but must be used when selecting a column with the dollar symbol. Base R also includes a number of useful summary statistical functions that can be applied to data.frame structures. For instance, the colMeans function makes it straightforward to compute the mean for every column in a dataset and return a smaller dataset containing those values. > # calculate column means > > data.means > colnames(data.means) > # get top abundance species > data.means$species > rownames(data.means) > # get top 10 mean abundances > head( + data.means[order(-data.means$colMeans) + ,c(’species’,’colMeans’)] + ,10) species colMeans 311 Total classifiable abundance 0.91461408 257 Prevotella copri DSM 18205 0.12093128 58 Ruminococcus bromii L2-63 0.03156450 65 Faecalibacterium prausnitzii L2-6 0.01923269 185 Rothia mucilaginosa DY-18 0.01874681 39 [Eubacterium] rectale M104/1 0.01793377 299 Alistipes putredinis DSM 17216 0.01759808 277 Bacteroides stercoris ATCC 43183 0.01715889 63 Faecalibacterium cf. prausnitzii KLE1255 0.01599012 34 [Eubacterium] eligens ATCC 27750 0.01559898

Here we combine a number of steps to finally return the 10 columns with the highest mean relative abundance. A number of additional steps are included: First, the colMeans function is called on a subset of our initial data (subsetting the samples column because it is a character type for which a mean can’t be computed), and the output of colMeans is cast as a data.frame structure by the as.data.frame function; next, we set the column names of the new data.means variable containing to colMeans and do a series of steps

Manipulating and Basic Analysis of Tabular Metagenomics Datasets Using R

349

to create a species column based on row names and reset the row names to a numeric index. Finally, the data.means variable is indexed using the order function from earlier to return the data. frame contents in descending order of the colMeans column, and the head function is used to retrieve the top 10 items from that dataframe. Overall, base R provides a number of useful tools for reading in and manipulating tabular data sources. These tools exist out-thebox after an R installation reducing extra requirements to get started analyzing your data. However, as we’ll go on to explore, several third party packages do exist for more sophisticated handling of tabular data in general and also taxonomic data. 3.2 Readr and the Tidyverse

Readr is a third-party R package available through CRAN that is designed with handling “rectangular” datasets in mind. The package is part of the R tidyverse family of packages that all share a similar grammar and design philosophy around data analysis in the R language. Readr defines “rectangular” data as files such as .tsv, . csv, or fixed width files and attempts to process them a more sophisticated data structure called a tibble. A tibble is a tidyverse improvement on the base R data.frame data structure, which is used as standard throughout tidyverse packages and is the data structure returned by many of the readr package functions. To load a tabular dataset such as a .csv file with readr, we make use of the readr specific function read_csv. This looks very similar to the base R function read.csv but has some crucial differences. However, before we can use any commands from readr, we need to install the package and load it into our current R session. This is the same for any third party packages we might want to use and was described in the previous subsection. If readr has previously been installed, running the install.packages command again is not required and the package can be loaded into the current R session using the library command. > install.packages(“readr”) > library(readr)

The read_csv function from readr is a variant of the read_delim function from within the same package. It is slightly more flexible than base R’s read.csv by default which typically means we can just specify the file we want to load rather than any additional arguments. # here loading an example mtcars.csv file > data library(readxl) > data % is the pipe character that aids with this by specifying a sequence of actions that often originate from a data variable. > library(dplyr) > data %>% + select(SAMEA2737770) # A tibble: 311 × 1 SAMEA2737770

1 0.0000153 2 0.0000306 3 0.00244 4 0.0121 5 0.0124 6 0.000307 7 0.0000615 8 0 9 0.000340 10 0.00207 # . . . with 301 more rows

Manipulating and Basic Analysis of Tabular Metagenomics Datasets Using R

351

> data %>% + slice(2) # A tibble: 1 × 937 ...1 ...2 SAMEA2737770 CT1_OR SAMEA2737863 CT10_OR SAMEA2737681 SAMEA2738135 SAMEA2737855 SAMEA2737825 CT12_OR

1 2 [Clostridium. . . 0.0000306 0 0 0 0 0 0 0 0 # . . . with 926 more variables: SAMEA2737705 , SAMEA2737707 , SAMEA2737709 , SAMEA2738151 , # SAMEA2737823 , CT13_OR , SAMEA2737711 , SAMEA2738153 , SAMEA2737812 , CT14_OR , # SAMEA2737716 , SAMEA2738159 , SAMEA2737719 , SAMEA2738162 , SAMEA2737692 , # SAMEA2737720 , SAMEA2738163 , SAMEA2737844 , SAMEA2737840 , CT15H_OR , SAMEA2737766 , # SAMEA2737723 , SAMEA2738165 , SAMEA2738167 , SAMEA2737817 , SAMEA2737734 , # SAMEA2738170 , SAMEA2737805 , CT18_OR , SAMEA2737729 , SAMEA2737835 , CT19_OR , # SAMEA2737771 , SAMEA2737754 , SAMEA2737877 , SAMEA2737829 , SAMEA2737786 , . . .

As in the previous examples in base R, our initial dataset requires some wrangling into a more appropriate shape that involves dropping the duplicated row numbers column and renaming a column to species. We can accomplish this in dplyr using the select and rename functions. With select, we specify to drop a column with the minus symbol and specify the number of the column or its name. The rename function can be passed as dataset, followed by the new column name equals the old column name. > # using select to remove a column > data % + select(-1)

> # we can use rename to rename columns by previous name > data library(tibble) > data % + column_to_rownames(var = "species") %>% + rownames_to_column() %>% + pivot_longer(-rowname) %>% + pivot_wider(names_from=rowname, values_from=value) > data # A tibble: 935 × 312 name ‘[Clostridium] s. . . ‘[Clostridium] h. . . ‘Dorea longicate. . . ‘Dorea formicige. . . ‘Coprococcus co. . . ‘Lachnospiraceae. . . 1 SAMEA2737770 0.0000153 0.0000306 0.00244 0.0121 0.0124 0.000307 2 CT1_OR 0 0 0 0 0 0 3 SAMEA2737863 0.0000254 0 0.000630 0.00129 0.00119 0.0000169 4 CT10_OR 0 0 0 0 0 0 5 SAMEA2737681 0.0000462 0 0.0132 0.00505 0.0182 0.0000310 6 SAMEA2738135 0 0 0 0 0 0 7 SAMEA2737855 0.0000161 0 0.00148 0.00391 0.00812 0.0000486 8 SAMEA2737825 0.0000299 0 0.000970 0.00159 0.000180 0.000452 9 CT12_OR 0 0 0 0 0 0 10 SAMEA2737705 0.0000131 0 0.000132 0.000413 0.00184 0 # . . . with 925 more rows, and 305 more variables: Lachnospiraceae bacterium 9_1_43BFAA , Clostridium sp. D5 , # Lachnospiraceae bacterium 1_4_56FAA , Ruminococcus gnavus ATCC 29149 , Ruminococcus torques L2-14 , # Ruminococcus lactaris ATCC 29176 , Ruminococcus torques ATCC 27756 , Blautia obeum A2-162 , # Blautia obeum ATCC 29174 , Ruminococcus sp. 5_1_39BFAA , Ruminococcus sp. SR1/5 , # Blautia hydrogenotrophica DSM 10507 , Blautia hansenii DSM 20583 , Clostridium sp. SY8519 , # Lachnospiraceae bacterium 3_1_57FAA_CT1 , [Clostridium] bolteae ATCC BAA-613 , # Clostridiales bacterium 1_7_47FAA , [Clostridium] saccharolyticum WM1 , . . .

The above code first converts the column species into row names and converts the row names to columns, creating a base R data.frame; this passes into pivot_longer, which creates a tall tibble data structure where the first column is the row names (species name), the second column is the sample name, and the third column is the value of the relative abundance. Finally, pivot_wider pivots the tall tibble into a wide tibble with the first column row names being converted into a series of columns, the second column into rows, and the third column into the table values. The output of this is reassigned to the data variable, and our transposed dataset is ready for further work.

Manipulating and Basic Analysis of Tabular Metagenomics Datasets Using R

353

Now that we’ve reshaped our data into a more useful format, we can also explore how to perform condition-specific filtering and subsetting using tidyverse functions. In the base R section, we explored how to subset using a conditional test, which under the hood, created a boolean vector that we were able to subset rows from. With dplyr, we again use pipes to chain together a series of commands to first filter the dataset and select out specific columns. > # filter by condition and return specific column > data %>% + filter(‘[Clostridium] scindens ATCC 35704‘ > 0.001) %>% + select(name, ‘[Clostridium] scindens ATCC 35704‘) # A tibble: 14 × 2 name ‘[Clostridium] scindens ATCC 35704‘ 1 SAMEA2737789 0.00146 2 SAMEA2737878 0.00629 3 SAMEA2737713 0.00938 4 SAMEA2737679 0.00182 5 SAMEA2737685 0.0143 6 SAMEA2737833 0.00417 7 M08_01_V1_stool_metaG_0 0.00132 8 M08_01_V2_stool_metaG_28 0.00171 9 M08_01_V3_stool_metaG_56 0.00109 10 M08_04_V2_stool_metaG_0 0.00255 11 M08_04_V3_stool_metaG_28 0.00142 12 CCIS36699628ST_4_0 0.00170 13 CCIS36797902ST_4_0 0.00288 14 CCIS93040568ST_20_0 0.00159

It is also possible to chain together dplyr functions and base R functions, for example, if we want to retrieve the 10 samples with the highest relative abundances for a given bacterial species; we can combine the desc and arrange dplyr function to sort a column in descending order, use select to subset a specific column we want to inspect, and use the base R head function to return the top ten rows. > data %>% + arrange(desc(‘[Clostridium] scindens ATCC 35704‘)) %>% + select(name, ‘[Clostridium] scindens ATCC 35704‘) %>% + head(10) # A tibble: 10 × 2 name ‘[Clostridium] scindens ATCC 35704‘ 1 SAMEA2737685 0.0143 2 SAMEA2737713 0.00938 3 SAMEA2737878 0.00629

354

Alex Coleman and Martin Callaghan 4 SAMEA2737833 0.00417 5 CCIS36797902ST_4_0 0.00288 6 M08_04_V2_stool_metaG_0 0.00255 7 SAMEA2737679 0.00182 8 M08_01_V2_stool_metaG_28 0.00171 9 CCIS36699628ST_4_0 0.00170 10 CCIS93040568ST_20_0 0.00159

With base R, we showed how to quickly compute column means and sort them into descending order. This can be achieved in a number of different ways using combinations of dplyr functions; we’ll use an approach that selects all columns except the name column (as it’s a character type column) and applies the base R mean function across all columns to return a very wide dataframe with one row and many columns. To make this easier to view (and plot if desired), we’ll then use the pivot_longer function to switch columns and rows to create a tall format dataset, which can sort into descending order with arrange and desc. > data %>% + select(!name) %>% + summarise(across(.fns = mean)) %>% + pivot_longer(everything(), names_to = ’species’, values_to = ’mean-abd’) %>% + arrange(desc(‘mean-abd‘)) # A tibble: 311 × 2 species ‘mean-abd‘ 1 Total classifiable abundance 0.915 2 Prevotella copri DSM 18205 0.121 3 Ruminococcus bromii L2-63 0.0316 4 Faecalibacterium prausnitzii L2-6 0.0192 5 Rothia mucilaginosa DY-18 0.0187 6 [Eubacterium] rectale M104/1 0.0179 7 Alistipes putredinis DSM 17216 0.0176 8 Bacteroides stercoris ATCC 43183 0.0172 9 Faecalibacterium cf. prausnitzii KLE1255 0.0160 10 [Eubacterium] eligens ATCC 27750 0.0156 # . . . with 301 more rows

Whether using the tidyverse or base R, the ability to write comprehensive workflows that load, clean, and manipulate a tabular data file is invaluable for ensuring whether analysis is reproducible and can be understood. Performing these data manipulations tasks in R as opposed to a spread also ensures we avoid editing our initial raw data directly, which can often introduce errors and compromise reproducibility and replicability of findings.

Manipulating and Basic Analysis of Tabular Metagenomics Datasets Using R

4

355

Basic Analysis of Tabular Data In this final section, we will briefly look at using R to perform some basic analysis of tabular datasets. This will look specifically at the vegan package that provides a useful toolkit of functions for diversity analysis and other ecology-based methods. There are also many other packages that are available to aid with data analysis and specific metagenomics analyses as this is an expanding topic of interest within the metagenomics community. Continuing with the dataset from Subheading 3, we’ll install and load the vegan package and briefly explore calculating some alpha-diversity scores and performing some group comparisons. > install.packages(“vegan”) > library(vegan)

Determining the diversity within a sample or group of samples is a common analysis step with metagenomics workflows. This type of diversity is called alpha diversity and can be calculated in a number of different ways [1]. We can calculate diversity indices for these data using the diversity function from vegan. This defaults to using the Shannon index but can also be adjusted to use Simpson or the inverse Simpson indexes [2, 3]. > shannon.data % + column_to_rownames(var = "name") %>% + diversity("shannon")

This particular snippet still uses some tidyverse logic to make it easier to read. It converts the column name for the sample name to the tables row names and calculates the Shannon diversity index for every row. This returns a base R named vector with a diversity index for every row. We can also use the estimateR function to estimate the species richness of a sample using a number of different measures including chao1 [4]. This function requires integer counts rather than relative abundances, so we include a step to normalize abundance data as integers. It returns the data as a base R matrix with samples names as the columns and a number of rows for different estimation measures. In the snippet below, a number of steps are combined using tidyverse syntax to retrieve chao1 values in a tall data format, which is useful for plotting data. > chao1.dat % + column_to_rownames(var = "name") %>% + mutate(. *1e15) %>% + estimateR() %>%

356

Alex Coleman and Martin Callaghan + as.data.frame %>% + slice(2) %>% + rownames_to_column() %>% + pivot_longer(-rowname) %>% + rename(samples = "name", chao1.index = "value")

We can combine these metrics with some base R statistics functions like aov and wilcox.test to perform analysis of the variance and the Wilcoxon nonparametric tests. In this example, we’ll perform comparisons on subsets of our data from oral and stool samples. To create the subsetted data, we’ll use some further tidyverse tools to subset our data based on a string match within sample names. # create subsetted data >library(stringr) > chao1.groups % + filter(str_detect(samples, "stool|oral")) %>% + mutate(group = case_when(str_detect(samples, ’stool’) ~ ’stool’, + str_detect(samples, ’oral’) ~ ’oral’))

Here we use the str_detect function from stringr another tidyverse package for manipulating strings to filter only rows that contain the strings “stool” or “oral” and then use the mutate function to create a new column called “group” which is either “stool” or “oral” depending on if str_detect identifies that string within the samples column. # summarise results of analysis of variance > summary(aov(chao1.groups$chao1.index ~ chao1.groups$group)) Df Sum Sq Mean Sq F value Pr(>F) chao1.groups$group 1 890 889.6 2.576 0.11 Residuals 174 60077 345.3 # show results of Wilcoxon test > wilcox.test(pull(chao1.groups[,c(’chao1.index’)]) ~ pull (chao1.groups[,c(’group’)]))

Wilcoxon rank sum test with continuity correction

data: pull(chao1.groups[, c("chao1.index")]) by pull(chao1. groups[, c("group")]) W = 3290.5, p-value = 0.08953 alternative hypothesis: true location shift is not equal to 0

Manipulating and Basic Analysis of Tabular Metagenomics Datasets Using R

357

In the above snippet, we retrieve a summary of information from the aov function using the summary function when comparing our two subgroups in the subsetted data. The wilcox.test function also returns a summary of the test. In wilcox.test, the tidyverse function pull is used to retrieve the subsetted columns as base R vectors and in both the tilde symbol (~) is used to denote a formula that is passed to both functions.

5

Summary Overall, this chapter has highlighted how basic data manipulation and analysis can be performed using metagenomics datasets in R. Interacting and manipulating tabular datasets is a common feature of metagenomics workflows and tools such as R offer a text-based mechanism for specifying clear code instructions for how analyses are performed and results produced aiding with replicability and transparency of analysis. This chapter has highlighted both the base R library and the popular tidyverse approaches to data manipulating and wrangling as well as touching on some more specific packages for computational ecology that are useful in a bioinformatics context. This chapter should serve as a stepping stone for further reading about the R ecosystem of third party packages as more advanced R users are able to write and share their own packages for other users to take advantage of often to aid with domain-specific problems. A number of these packages do exist for metagenomics such as Metacoder, MegaR, phyloseq, and many more.

References 1. Whittaker RH (1960) Vegetation of the Siskiyou Mountains, Oregon and California. Ecol Monogr 30(3):279–338 2. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423 and 623–656

3. Simpson EH (1949) Measurement of diversity. Nature 163(4148):688 4. Chao A (1984) Nonparametric estimation of the number of classes in a population. Scand J Stat 11(4):265–270

Chapter 19 Metagenomics Data Visualization Using R Alex Coleman, Anupam Bose, and Suparna Mitra Abstract Communicating key finds is a crucial part of the research process. Data visualization is the field of graphically representing data to help communicate key findings. Building on previous chapters around data manipulating using the R programming language this, chapter will explore how to use R to plot data and generate high-quality graphics. It will cover plotting using the base R plotting functionality and introduce the famous ggplot2 package [2] that is widely used for data visualization in R. After this general introduction to data visualization tools, the chapter will explore more specific data visualization techniques for metagenomics data and their use cases using these basic packages. Key words R programming language, Data visualization, Plotting, ggplot2, Communication, Research outputs, Research visualization

1

Introduction A key final stage of any form of data analysis is the visualization step where analysis is visualized for sharing and publication. Programming languages like R enable us to quickly develop lines of code that are able to output high-quality graphics that can be incorporated into automatable workflows. This helps reduce the amount of time tinkering with visualizations and enables researchers to spend more time working through the manipulation and analysis steps of their work often now with very large amounts of data. Base R includes a plotting library as default that has a robust level of functionality; however, as touched in the data manipulation chapter (Chap. 18 of this book), one of the major strengths of R is its expansive third party package ecosystem. This means that for data visualization, there are a number of additional options on top of the base R plotting library. Most famous of these is the ggplot2 library, developed by Hadley Wickham based upon the influential Grammar of Graphics book on data visualization [1, 2]. In this chapter,

Suparna Mitra (ed.), Metagenomic Data Analysis, Methods in Molecular Biology, vol. 2649, https://doi.org/10.1007/978-1-0716-3072-3_19, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

359

360

Alex Coleman et al.

we will introduce both the base R plotting and ggplot2 libraries for data visualization and expand into examples of more specific visualization tools and techniques for metagenomics data.

2

Visualization Options Within R Plotting in R is performed by writing specific functions that are executed in R to render graphical outputs. These functions can be executed directly within an R console or written into a larger script file that can be run in its entirety. In this section, we’ll extend the example R snippets shown in the data manipulation chapter and showcase the basic functions that allow us to plot data, configure the chart format, add axis labels, and other common visualization operations. The examples in this section will cover plotting with both the base R library and the popular ggplot2 library [2].

2.1

Base R

In base R, the entry point function for creating plots is the plot function. This function can either take a data structure as its first argument or be passed specific x or y arguments specifying the coordinates to plot. For reading data files, please refer to the section “Reading and manipulating tabular data” in the data manipulation chapter (Chap. 18 of this book). In the following code examples, we will explore a dataset published by Schmidt et al. (2019) (https:// elifesciences.org/articles/42693/) that looks at microbe populations along the gastrointestinal tract. An example of the output of using plot function can be seen in Fig. 1.

# equivalent methods of plotting a simple scatter plot # passing a dataframe object to plot # we would load this data by using the read.csv function > read.csv(file = "data/elife-data.csv", header=T, stringsAsFactors=FALSE) > two.col.subset plot(two.col.subset) # using $ to index out specific columns and specifying x and y arguments > plot(x = data$`Butyrivibrio proteoclasticus B316`, y = data$`Dorea formicigenerans ATCC 27755`)

Just calling a plot like this will either open a new window separate to the R console containing the plot or, in RStudio, show the plot in the plotting pane in the bottom right-hand corner. To programmatically save the plot, we have to add some additional function calls before and after our plot call. These specify that we want to write a file to disk and specify additional parameters like size and resolution. If we wanted to export this plot as a .png format file, you would use the png function, pdf for .pdf format, jpeg for .jpeg format, and some other standard image formats. Within this function, we specify a filename to save the file as the resolution, the

Metagenomics Data Visualization Using R

361

Fig. 1 Example output of a simple plot call

height, and width and the units of height/width. After calling the png function, we then include our plot call, and to save this plot as the specified file, we call the dev.off() function. This closes the graphical device saving the plot to the .png file we have previously specified. For saving the above plot call as a .png file, we’d use the following three lines. > png(filename = "figures/my-figure.png", res = 300, + height = 3, width = 6, units = 'in') > plot(x = data$`Butyrivibrio proteoclasticus B316`, y = data$`Dorea formicigenerans ATCC 27755`) > dev.off()

Calling plot in this way means the plot is rendered with very basic formatting; however, we can use additional arguments within the plot call itself and call additional functions that give us extra tools for configuring the format of plots in R. It is also possible to quickly plot different format graphs using a number of other more specific plot functions such as hist, boxplot, barplot to create histograms, boxplots, and bar plots. To configure certain graphical parameters using the par function, which allows the user to set specific parameters around graphics, in Fig. 2, the par function has been set using the mfrow argument, which allows you to configure an array on which figures can be drawn by passing a vector specifying the number of rows and columns. After the par function is called, each plot function can be called to sequentially add the graph to the plot. > png(filename = "figures/figure2.png", res = 300, + height = 10, width = 12, units = 'in') > par(mfrow=c(2,2)) > plot(two.col.subset) > barplot(log(colMeans(two.col.subset))) > boxplot(two.col.subset) > hist(two.col.subset$`Butyrivibrio proteoclasticus B316`, main = "") > dev.off()

To manipulate the formatting of the plot, we can pass additional arguments to the plot function to configure aspects of the final figure. In Fig. 3, the panel showcases how sequentially adding additional arguments changes aspects of the same plot. In the first plot, the col argument has been added to set the desired color of scatter points on the plot. Colors can be specified by their name such as “red” in the example or using a hexadecimal RGB triplet.

362

Alex Coleman et al.

Fig. 2 Adding multiple graphs within a single plot and showcasing other plot formats

Next, the specific symbol used for scatter points was adjusted using the pch parameter and passing an integer between 0 and 18 (all symbols can be viewed with the ?points help entry). By default, the plot function takes the two column names as the x and y axis labels; these can be set separated by using the xlab and ylab arguments, and a main title for the figure can be set with the main argument. In the final chart, the plot has been adjusted to center the origin of the plot at 0,0 coordinates; this is achieved by configuring the mode setting of xaxs and yaxs to “i.” There are also many other parameters that can be configured with base R plots, and more details about these can be found by reading the help page for the par function, which can be found by using ?par. > par(mfrow=c(2,2)) > plot(two.col.subset, col = 'red') > plot(two.col.subset, col = 'red', pch = 8) > plot(two.col.subset, col = 'red', pch = 8, main = "A scatter plot", + ylab = "D. formicigenerans ATCC 27755", xlab = "B. proteoclasticus B316") > plot(two.col.subset, col = 'red', pch = 8, main = "A scatter plot", + ylab = "D. formicigenerans ATCC 27755", xlab = "B. proteoclasticus B316", + xaxs='i', yaxs='i')

Metagenomics Data Visualization Using R

363

Fig. 3 Adjusting formatting parameters in a basic scatterplot

The simplicity of base R native plot function makes it easy to use and incorporate within an R script analysis workflow to produce graphical outputs that are formatted as desired and, in a publication, ready resolution. It also doesn’t require any extra packages meaning that you can get to work producing data visualizations immediately when using R. Next, we’ll explore using the ggplot2 package [2, 3], which introduces a different but powerful framework for building data visualizations that can also be customized. 2.2

ggplot2

The ggplot2 package also offers the ability to declaratively create graphics in R based on the principles of the Grammar of Graphics. It works around the principle that starts with some data, specifying an aesthetic mapping (such as what your X and Y variables are), and proceeding to add layers to your plot corresponding to the way you would like to visualize your data. The easiest way to get ggplot2 is to install the whole tidyverse package using install.packages(“tidyverse”) [4]. Alternatively, install just ggplot2:install.packages (“ggplot2”).

364

Alex Coleman et al.

Fig. 4 A basic scatterplot using ggplot2

Creating plots with ggplot2 all begin with the ggplot function. Within a ggplot call, we specify the data argument to be the data source to be used such as a data.frame or tibble data structure; next, we specify the mapping argument, which is typically an aesthetic mapping, using the aes function that describes how features of the data are mapped on visual properties of the figure. For the simple scatter plot shown in Fig. 4, we specify the aesthetic mapping for the x and y axis to correspond to two columns in our data source. Next, we close the ggplot call and start to add layers to our plot; at this stage, the plot contains all the information about what the data is and what data is to be mapped onto the graph, but it doesn’t know what geometry to use to draw the data on the graph. To add a geometry, we add a layer to our ggplot call by using + symbol followed by a geom_* function that corresponds to the geometry we want to use. In the scatter plot example, we use geom_point to plot the data as points. These steps are the basics of building plots with ggplot2; each plot requires some data, a mapping, and a specified geometric representation. This system is based on adding layers to the plot, which makes it possible to add multiple geoms to a single plot to describe additional facets of your data. >library(ggplot2) > ggplot(data = small.dat, mapping = aes(x = `Butyrivibrio proteoclasticus B316`, + y =`Dorea formicigenerans ATCC 27755`)) + + geom_point()

By default, ggplot function calls output to a new window or the plots pane in Rstudio; it is also possible to save plots programmatically similar to saving base R plots, although this is through a single function called ggsave. The ggsave uses arguments to specify key features of the final saved image such as file type, width, height, and resolution. We also need to pass the plot we wish to save as an argument to ggsave; therefore, standard practice is to assign the output of a ggplot function call to a variable, which can then be used in the ggsave function.

Metagenomics Data Visualization Using R

365

# assign the plot to a variable > gplot1 ggsave('figure/gfigure1.png', plot = gplot1, dpi = 300, + height = 3, width = 6, units = 'in')

In the base R section, we showed a number of plots that included multiple figures side-by-side. This is not something that is supported by ggplot2 out of the box, but the behavior can be replicated using other packages and ggplot2 such as gridExtra. An example of combining ggplot2 and gridExtra is shown in Fig. 5, and the code to produce it is shown in the below snippet. Crucially, this example showcases some of the additional geoms that can be used to plot data in different formats. The geom_col call is used to create a bar plot that uses the data values as the height of the bars, whereas geom_bar plots bars corresponding to the counts of values. The

Fig. 5 Using ggplot2 and gridExtra package to create a multipanel plot

366

Alex Coleman et al.

geom_boxplot and geom_histogram are both geoms for representing the data as boxplots and histograms. In general, ggplot2 prefers data in a tall format, with a single column containing data values and other columns providing context to the data value, rather than a wide format, where each data variable is in a separate column. Therefore, to produce different format plots such as a bar plot, histogram, and boxplot, the data must be manipulated using pivot_longer function into a tall format. > ## multi panel plot > library(gridExtra) > plot1 small.mean.dat % + select(`Butyrivibrio proteoclasticus B316`,`Dorea formicigenerans ATCC 27755` ) %>% + summarise(across(.fns = mean)) %>% + log %>% + pivot_longer(everything(), names_to = 'species', values_to = 'log-mean-abd') > plot2 # convert wide data to tall format, ggplot2 prefers tall format data > tall.data % + pivot_longer(!name, names_to = 'species', values_to = 'abund') > plot3 % filter(str_detect(species, c("Butyrivibrio proteoclasticus B316|Dorea for micigenerans ATCC 27755"))), + mapping = aes(x = species,y = abund)) + + geom_boxplot() > plot4 % filter(str_detect(species, "Butyrivibrio proteoclasticus B316")), + mapping = aes(x = abund)) + + geom_histogram() > grid.plot ggsave('figures/gfigure2.png', plot = grid.plot, dpi = 300, + height = 10, width = 12, units = 'in')

Similar to base R, ggplot2 produces plots with some basic formatting defaults that are readily customizable. Through the ggplot2 framework, we can apply formatting to different layers of our plot and also add layers that control aspects of formatting. Adding color to a plot is straightforward within ggplot2; you can either do it at the geom level (as shown in the below snippet) or you can include an aes mapping for color. Figure 6 and the associated code snippet show how to adjust plots to add color and change symbols in a scatter plot. Using the aes color mapping is very powerful and allows us to represent an entire different category of data using color. This is useful for data with both values and additional categorical information such as GDP, population, and

Metagenomics Data Visualization Using R

367

Fig. 6 Adjusting formatting of points in a ggplot2 plot

continental data. Changing the symbols used on the plot is adjusted with the shape argument within geom_point, which provides similar behavior to adjust pch in the base R plot function. This also can be used to represent different levels of categorical data with ggplot using a generic mapping of available symbols to the different levels of data. > plotf1 > # change shape > plotf2 > grid.plotf1 > ggsave('figures/gfigure3.png', plot = grid.plotf1, dpi = 300, + height = 10, width = 12, units = 'in')

368

Alex Coleman et al.

Fig. 7 Making points transparent and adding labels to a ggplot2 plot

It is also possible to control the transparency of points on a ggplot, which is useful when multiple data points overlap to give a better visual impression of where high-density clusters exist (Fig. 7). This is adjusted at within the geom_point call by setting the alpha argument, which can be set with a value between 1 (0% transparent) and 0 (100% transparent). Adjusting plot labels is possible by adding a new layer to the ggplot call by using the + symbol and calling the labs function. The labs functions allows you to set custom labels relating to the plot such as its title, x, and y axis labels. The labs function accepts custom labels which are specified as arguments to the function and supports additional text features such as subtitle and alt text. This gives you the power to customize the annotations around your plots to ensure they are as accessible as possible.

Metagenomics Data Visualization Using R

369

> plotf3 > # adding labels > plotf4 > grid.plotf2 > ggsave('figures/gfigure4.png', plot = grid.plotf2, dpi = 300, + height = 10, width = 12, units = 'in')

Similar to base R plotting, ggplot2 does not default to setting the origin of the plot to 0,0 but adds a small buffer space between the axis line and the edge of the plot. This can be overridden by adding an additional layer to the plot to set the scale of the x and y axis. The scale_x_continuous and scale_y_continuous functions are default functions that set the scale for continuous variables along the x and y axis; setting the expand argument to a vector of 0,0 removes this buffer space and showcases how we can manipulate the axis scales if needed. Finally, ggplot2 also allows for highly specific customization of chart elements through the theme layer, which provides a single function for adjusting the graphical presentation of all non-data elements of a chart. This allows for a high level of customization and styling for graphics and generation of style templates that can be shared. Examples of these configuration options are shown in Fig. 8.

370

Alex Coleman et al.

Fig. 8 Setting the origin of the chart to 0,0 and using ggplot2-based themes > plotf5 > plotf6 > grid.plotf3 > ggsave('figures/gfigure5.png', plot = grid.plotf3, dpi = 300, + height = 10, width = 12, units = 'in')

Metagenomics Data Visualization Using R

3

371

Common Comparative Visualization for Metagenomic Data

3.1 Alpha (α) Diversity and Beta (β) Diversity

After getting taxonomic or functional assignments of the metagenomic data, often, researchers want to plot them to compare between samples. First comes the diversity analyses. Diversity analysis tells us how many taxa/species/OTUs (depending on the data type) are in a sample and how similar multiple samples are. Alpha diversity is the diversity present in a specific sample, and it can be described in two parts: “Species richness” is the number of different species in a sample, and “Species diversity” tells us how evenly the microbes are distributed in a sample. For new researchers, it is often a point of confusion to choose the right diversity index to present their data. Simpson index is considered more as a dominance index as it accounts for the number of species present and the relative abundance of each species, whereas Shannon–Weiner index is based on randomness present at a site and considers both species richness and equitability in distribution in a sample. Shannon evenness index (Shannon’s equitability index) is a pure diversity index, independent of species richness. It measures how evenly the microbes are distributed in a sample without considering the number of species. Beta diversity shows the differences between the two samples when compared. The main focus is on investigating the difference in taxonomic abundance profiles from different samples. There are multiple distance measures (e.g., Bray-curtis, Jaccard and Kulczynski) that can capture differences between multiple samples. Vegan: Community Ecology Package in R One of the most reliable and oldest package for diversity analyses in R is Vegan [5]. The vegan package has two major components: multivariate analysis (mainly ordination) and methods for diversity analysis of ecological communities. Please refer to the author’s manual for full details of the package [5]. Here we will be providing only a few example for visual diversity plots using vegan.

> library(vegan) Loading required package: permute Loading required package: lattice This is vegan 2.5-7 > ?diversity # from this we can see the details of the diversity options that vegan has. Usual usage of the command is: diversity(x, index = "simpson", MARGIN = 1, base = exp(1))# index Can be any of the Diversity

index, one of "shannon", "simpson" or "invsimpson". fisher.alpha(x, MARGIN = 1, ...) specnumber(x, groups, MARGIN = 1)

Here is one example plot where we want to create that comparison of effects of two drugs X and Y. You can find this anonymized drug data in the GitHub repository. We can obtain taxonomic

372

Alex Coleman et al.

profile data from any of the metagenomic analysis tool in csv or excel file. For reading the data in excel formal we can use the package “xlsx” [6]. dff temp2 p%>% select(-c(1))%>% select(-c(3))%>% filter(p$Group == 'X' & p$Timepoint == 'D5') %>% select(-c(1:2)) %>% mutate_if(is.character, as.numeric) %>% colMeans() %>% data.frame() %>% rownames_to_column("Dataset") ->temp5 names(temp5)[2]% mutate(D5=round(D5,digits = 2)) -> temp5

Now, we will be joining these data using “inner_join()” function matching rows based on “Dataset” column, which tells the bacteria names. inner_join(temp1,temp2,by=c('Dataset')) %>% inner_join(temp5,by=c('Dataset')) -> InitialDays InitialDays[is.na(InitialDays)] head(InitialDays) Dataset D1 D2 D5 1 Bacteroidaceae 3787.00 3500.44 4134.5 2 Dysgonomonadaceae 0.00 0.44 0.0 3 Odoribacteraceae 77.33 171.22 88.2 4 Porphyromonadaceae 6.89 0.67 0.0 5 Prevotellaceae 19.56 0.00 0.9 6 Rikenellaceae 3815.11 4295.89 996.3

Now, we want to make an average for these days to get an early days profile (example Fig. 20) so that we can compare with “midtreatment” or “end of study.” This step is completely optional, and the necessity depends on particular cases. This example is created only for demonstration purposes. Users might need to compare only specific days. In that case, please skip this part of the code.

386

Alex Coleman et al.

#average InitialDays %>% mutate(Values = round((D1+D2+D5)/3,digits = 2)) -> xInitialDays > dim(InitialDays) [1] 76 4

From the dim() function, we can see that “InitialDays” data has 76 bacteria in the list. Many of them appear in tiny proportions in the whole data. In such cases, common practice is to display first a few important bacteria and combine all the rest as “others.” Here we will display first 10 bacteria and make a sum of all the rest of the proportions (i.e., 11th–76th bacteria) as “others.” #top 10 and 'others': percentage xInitialDays %>% select(Dataset,Values) %>% filter(Values!=0) %>% mutate(Values=round(as.numeric(Values),digits=2)) %>% mutate(per=round(Values/sum(Values)*100,digits = 2)) %>% arrange(desc(per)) ->xInitialDays xInitialDays % select(-c(1)) %>% select(-c(3)) %>% filter(p$Group == 'X' & p$Timepoint == 'D7') %>% select(-c(1:2)) %>% mutate_if(is.character, as.numeric) %>% colMeans() %>% data.frame() %>% rownames_to_column("Dataset") ->temp7 names(temp7)[2]% mutate(D7=round(D7,digits = 2)) -> temp7 p%>% select(-c(1))%>% select(-c(3))%>% filter(p$Group == 'X' & p$Timepoint == 'D10') %>% select(-c(1:2)) %>% mutate_if(is.character, as.numeric) %>% colMeans() %>% data.frame() %>% rownames_to_column("Dataset") ->temp10 names(temp10)[2]% mutate(D10=round(D10,digits = 2)) -> temp10 p%>% select(-c(1))%>% select(-c(3))%>% filter(p$Group == 'X' & p$Timepoint == 'D15') %>% select(-c(1:2)) %>% mutate_if(is.character, as.numeric) %>% colMeans() %>% data.frame() %>% rownames_to_column("Dataset") ->temp15 names(temp15)[2]% mutate(D15=round(D15,digits = 2)) -> temp15 inner_join(temp7,temp10,by=c('Dataset')) %>% inner_join(temp15,by=c('Dataset')) -> MidDays MidDays[is.na(MidDays)] % mutate(Values = round((D7+D10+D15)/3,digits = 2)) -> xMidDays xMidDays %>% select(Dataset,Values) %>% filter(Values!=0) %>% mutate(Values=round(as.numeric(Values),digits=2)) %>% mutate(per=round(Values/sum(Values)*100,digits = 2)) %>% arrange(desc(per)) -> xMidDays xMidDays ColAssgn } }

xmiddays % ggplot(aes(x='',y=reorder(Values,Values),fill = Dataset))+ geom_bar(width=1,stat="identity")+ scale_fill_manual(breaks = ColAssgn$taxa, values = ColAssgn$cols)+ theme_void()+ theme_classic() + theme(legend.position = "top") + coord_polar("y",start=0) + theme(axis.line = element_blank())+ theme(axis.text = element_blank()) + theme(axis.ticks = element_blank())+ labs(x = NULL, y = NULL, fill = NULL)+ ylab("Mid Treatment") } xmiddays() (figure 21)

Metagenomics Data Visualization Using R

391

In a similar way, we can also create end-of-treatment days plot (average for Days 20, 25, and 30; figure not shown), and similar three plots for drug Y as well. All codes together can be found from our GitHub link. To combine similar plots for both the drugs in one multigraph panel (Fig. 22), we can use this code below: ##Final Plots group_x hmp_samples consists of samples as rows and it’s phenotype information in the columns.

Comprehensive Guideline for Microbiome Analysis Using R

5.1

419

Application

• Data reading: Loading the data into R is the initial step in any analysis. Because taxonomy data contains a hierarchical taxonomic tree, this can be problematic. Metacoder includes parsing functions for common file formats utilized in metagenomics studies. This method will return a taxmap object. The taxmap class is intended to hold any number of tables, vectors, or lists containing taxonomic data and to make data manipulation smoother.

420

Joseph Boctor et al.

Comprehensive Guideline for Microbiome Analysis Using R

421

• Manipulations of Abundance Matrix – Removing low-abundance counts: Remember that the abundance matrix has columns for samples and rows for OTUs. The number of incidents in which an OTU was found in a sample is shown by each cell. There may be a small number of observations in some of these cells. Because low-abundance sequences could be the consequence of sequencing defects, any counts or OTUs with fewer than a certain number of reads are routinely removed.

– Taking into account unbalanced sampling: Due to the limitations of sequencing methods, sometimes samples may contain more reads than others. As a result, we may see greater diversity in certain samples due to being sequenced more times. Metacoder divides each sample’s counts by the overall number of counts seen for every sample, yielding a proportion, using the function calc obs props.

422

Joseph Boctor et al.

• Calculations – Obtaining the information per taxon: We now have abundance data for each OTU, but not really for every taxon. We may sum the abundance per taxon and append the outputs to the taxmap object in a new table to gain information on the taxa using calc_taxon_abund.

Comprehensive Guideline for Microbiome Analysis Using R

423

424

Joseph Boctor et al.

Fig. 17 Heat trees. The heat tree’s architecture is made up of nodes, edges, and labels. The element properties (for example, size and color) are qualities that are influenced by the conditions of the data and mapping properties

– Graphical display of the taxonomic data: Users can visualize the data using heat trees now that they have per-taxon information (the tax abund & tax occ tables). Heat trees, also known as taxonomic trees, are graphs in which the size and shape of tree segments correlate to a certain statistic. It can also use color to represent the amount of OTUs allocated to each taxon throughout the whole dataset (Fig. 17).

Comprehensive Guideline for Microbiome Analysis Using R

425

426

Joseph Boctor et al.

– Alpha Diversity Calculation: The diversity within every sample or collection of samples is measured by alpha diversity. It may be estimated at any taxonomic level, although it is most commonly done at the species or OTU rank. A score to reflect alpha diversity may be calculated using a variety of approaches. The most basic is just the number of species, but the ones that are most commonly used also take into account how prevalent each species is. The Inverse Simpson Index is used in Metacoder to compute alpha diversity of OTUs through the use of a vegan package. – Comparing the treatment groups: Researchers are usually interested in how different groupings of samples contrast. We could demand to explore which taxa vary between the mouth and the stomach, or between males and females, for example. These comparisons are made easier with the compare_groups function. A Wilcoxon Rank Sum test is performed to see whether there are any variations in the median abundances of samples in every treatment for each taxon. We can generate a differential heat tree using this knowledge, which shows which taxa are more common in each treatment. A matrix of heat trees can also be generated to compare multiple heat trees using the function heat_tree_matrix (Fig. 18).

6

MicrobiomeExplorer Even though DNA sequencing appears to minimize numerous scientific concerns into simply counting the relevant sequences of organisms. Sampling, counting, and descriptive statistics all pose subtle yet substantial obstacles. As a result, plenty of microbiomespecific analysis workflows and algorithms have emerged. The microbiomeExplorer package, as well as the Shiny R app that comes with it, has tools and visualizations for analyzing results of 16S rRNA amplicon sequencing experiments. The analysis could be run directly from the R command line; however, the package’s main goal is to make most of these studies accessible to

Comprehensive Guideline for Microbiome Analysis Using R

427

Fig. 18 Heat tree matrix. To demonstrate pairwise comparisons, a plot of a matrix of heat trees is established. A bigger, labeled tree acts as a key for a matrix of smaller, unlabeled trees

noncomputational users via the Shiny R interface. It seeks to meet the demands of both computational scientists and bench scientists with minimum coding experience by merging multiple analytical methodologies in one package. It also includes a set of interactive graphs based on the plotly package that are both robust and welldesigned. This sets it apart from earlier efforts like phyloseq and metaviz, which are based on command - line interface only or don’t have all of the graphical features. 6.1

Application

• Data uploading: MicrobiomeExplorer takes data in a variety of formats.

428

Joseph Boctor et al.

Fig. 19 MicrobiomeExplore shiny app (load and filter tap). Under the load and filter tap, we can upload the counts matrix, phenotype file, and taxonomy files

– RDATA or RDS files are used to hold MRexperiment-class objects. – (BIOM) Biological Observation Matrix is a file formatter that may be used with any application, such as qiime2 or mothur. – Files with raw counts It’s necessary to submit a counts file, which can be delimited (csv, tsv). The data collection must be formatted so that each sample would be a column, and each unique characteristic is found in a row. The counts data can be linked to a delimited phenotype file, and if aggregation to a certain phylogenetic level is required, a feature data (Taxonomy Info) file must be given (Fig. 19). • Preprocessing the Data – Data quality control: It is recommended that you check the findings of the sequencing experiment and do quality control checks prior to beginning an analysis. Certain samples may contain fewer features than anticipated or a low number of reads altogether. There are various methods for filtering samples that aren’t suitable for further analysis. QC graphs can be made that display the number of unique features in every sample depicted as a barplot (Fig. 20a) or a scatter plot (Fig. 20b) versus the number of reads. Specific phenotypes contained in the pData slot of the MRexperiment can be used to color these.

Comprehensive Guideline for Microbiome Analysis Using R

429

Fig. 20 MicrobiomeExplore shiny app (load and filter tap). Here we can find the QC plots (scatter plot and bar plot) colored by the diet column in the phenotype file, and they are displaying unique features/reads and their distribution. Also under the load and filter tap, we can select the minimum number of features or reads to be present in a sample; in our tutorial, we will set the minimum number of features to 100 and he minimum number of reads to 500 then hit the filter button

Histograms also display the overall frequency distribution of features and reads. – Filtering and subsetting: Three distinct sliders could be utilized within the app. to alter quantitative constraints on the data. A feature could be required to be present in a certain group of samples, and also samples can be selected to contain a certain number of features or reads. It’s also possible to subset based on a certain phenotype. This gives users the ability to exclude specific samples or restrict your study to a portion of the data. – Normalization: Prior to performing the analysis, normalization enables the users to adjust for discrepancies in library size. If this is not performed, certain app functionalities are disabled. Normalization is also required for differential abundance testing, that is conducted discreetly if the user does not request it. The two approaches offered in the package are built on either proportions’ calculations or cumulative sum scaling (CSS) (Fig. 21).

430

Joseph Boctor et al.

Fig. 21 MicrobiomeExplorer shiny app (analysis tap “DNA” sign). Here we will select proportion as normalization method, and we will aggregate the features to the genus level then hit the aggregate button

Fig. 22 Microbiome Explorer shiny app (INTRA SAMPLE analysis tap). Here we selected dite as the grouping phenotype for the relative abundance plot and Bacteroides under select features for feature abundance plot

– Aggregation: The users must first aggregate down the data to a certain feature level prior to conducting any analysis. The code in global.R can be used to limit the accessible levels. The analysis parts will be available for usage after this is done. Instead, by selecting the “Report” option, the user may add an analysis to a report, which will reproduce the visualizations inside a report (Fig. 21). • Analysis – Intra-sample analysis: Intra-sample analysis comprises functionalities centered on analyzing the bacterial communities inside a sample or a set of samples. The relative abundance of top features, the abundance of a given feature, and the alpha diversity within the sample may all be shown using various methods. To produce all visualizations inside the program, a single set of input components is employed (Fig. 22).

Comprehensive Guideline for Microbiome Analysis Using R

431

FEATURE PLOT Percentage of Bacteroides 9

Log Percentage

8 7 6 5 4

rn ste We

BK

diet

Fig. 23 Feature plot representing the abundance of single feature (Bacteroides) across two different diet types (Western & BK)

> Relative abundance: The most abundant feature in a barplot is shown in relative abundance, which is represented by a userdefined variable on the x-axis. Additionally, the user may facet by phenotypes, change the amount of features to display, toggle between total numbers of reads and normalized values (if normalized), and change the general plot width. When you click on a specific feature in the plot, a feature abundance graph for that feature appears. > Feature abundance: According to the x-axis variable selected, the feature abundance graph represents the individual abundance of a single feature as a boxplot or a categorical scatterplot. The user has the option to choose a log2 scale, set plot width, and choose whether or not to show individual sample points. Feature plots can be accessed by clicking on a feature in the relative abundance plot or by choosing a specific feature from the input section (Fig. 23). > Alpha diversity: The complexity or diversity of a sample, such as a habitat or location, is measured by alpha diversity. The vegan package’s functions are used to calculate alpha diversity, which is then displayed as a boxplot using the same feature and relative abundance input specifications. The user could color the boxes and, therefore, divide them by phenotype, as well as select the overall plot width. There are several diversity metrics available, with Shannon diversity being the default. Users should familiarize themselves with the various measurements

432

Joseph Boctor et al. ALPHA DIVERSITY Shannon diversity index at genus level PM1 1.9

PM10

Shannon_Diversity

1.8

PM11 PM12

1.7

PM2 1.6 PM3 1.5

PM4

1.4

PM5 PM6

1.3

rn

ste

We

BK

diet

Fig. 24 Alpha Diversity Plot visualizing the distribution of specific microorganisms in a sample. N.B: MouseID is used to color the boxplot

and be capable of distinguishing differences in interpretation and subtleties. Shannon diversity, in particular, assesses how equally microorganisms are distributed in a sample (Fig. 24). – Inter-sample analysis: It uses feature heatmaps and beta diversity computations to look for variations between samples or groups of samples. Keep in mind that if there are numerous samples and a low aggregation level is used, these functions might be computationally costly. > Beta diversity: Beta diversities represent measurements of how complex communities are between samples relative to how complex they are within a sample (alpha diversity). A pairwise distance or similarity matrix must first be computed before beta diversity can be calculated. The user may choose from a variety of metrics available through the vegan package, with Bray being the recommended default option for microbiome investigation. Users should familiarize themselves with the various measurements and be aware of the differences in interpretation and subtleties. The chosen distance matrix is then subjected to principal component analysis, a dimension reduction approach, which is shown in a scatter plot. The user may select which main components to display, add coloring and confidence ellipses depending on phenotypes, design the shape based on phenotypes, and change the point size and overall plot width. The program includes PERMANOVA (permutational multivariate analysis of variance) from the vegan package. A PERMANOVA study, in theory, allows the user to evaluate statistically if the

Comprehensive Guideline for Microbiome Analysis Using R

433

Fig. 25 Microbiome Explorer shiny app (INTER SAMPLE analysis tap). For the beta diversity analysis, we selected the default bray method; also, status was selected as the adonis variable. From the plot options, we colored the plot using the dite and selected the Add confidence ellipse option. For the heatmap, top features were sorted by variance, and for the plot options, we selected diet as phenotype annotations, phylum as feature annotation, and turned on the log scaling

cluster centers of a dissimilarity or distance matrix vary between groups of samples. The user can also pick a phenotype and a stratum variable, with the results displayed both inside the visualization and in a table below it (Fig. 25).

> Heatmap: The heatmap provides a different perspective on the differences and similarities among the samples in a dataset (Fig. 26). The user has the option of selecting particular characteristics or viewing the top 50 features, which can be sorted by variance, Fano factor, or median absolute deviation (MAD). Heatmaply is used to visualize the data, while plotly is used to render the heatmap. As a result, the same choices for interacting with the plot are available. After the heatmap has been rendered, the user can customize the number of features included, disable log scale, and add annotation to the row of the heatmap (phenotypes) and columns (higher taxonomy levels). It is not advised to use the heatmap feature in datasets with a large number of samples (5000+), since it can take a long time to render. > Correlation: In a scatterplot upgraded with a linear regression statistic, correlation enables the user to examine the relationship between two features or one feature and a numerical phenotype. Across both correlation graphs, you may facet

434

Joseph Boctor et al.

Fig. 26 Microbiome Explorer shiny app (INTER SAMPLE analysis tap “Heatmap”). For the heatmap, top feature were sorted by variance, and from the plot options, we selected diet as phenotype annotations, phylum as feature annotation, and turned on the log scaling

and/or color by phenotypes. To help in the assessment of the association, the user must select one of three methods: Spearman (set as the default), Pearson, or Kendall (Fig. 27). > Differential Abundance: The null hypothesis that the mean or mean rankings between groups are the same for a certain trait is tested using differential abundance (DA) analysis. Differences in feature abundance among two or more levels of phenotypes can be detected using DA analysis. (Fig. 28) The program allows you to choose from four distinct methods: DESeq2, Kruskal-Wallis, limma, or a zero-inflated log normal model. DESeq2 and limma are commonly used microarray and RNA-sequencing data comparison tools, which could simply be applied for microbiome data. Kruskal–Wallis is a nonparametric test that looks for variations in group distribution. To accommodate for zero-inflation in microbiome data, the metagenomeSeq package includes a zero-inflated log normal model. DESeq2 is often used with sample sizes of less than 25.

Comprehensive Guideline for Microbiome Analysis Using R

435

Fig. 27 Correlation Plot using spearman correlation as the default setting, facetting columns by diet and selecting the two correlation features

Fig. 28 Differential abundance Plot using Deseq2 as the method, diet for comparison phenotype and comparison level 1 BK, comparison level 2 Western which are diet types

436

Joseph Boctor et al.

Fig. 29 Longitudinal analysis Plot selected features being Bacteroides, and the longitudinal phenotype being the relative time; mouseID was used as the ID phenotype

> Longitudinal analysis: Longitudinal analysis enables users to create feature plots with greater control over the data displayed. The user may select a phenotype and certain levels of that phenotype to display in the plot for a given feature. The graph will keep the set order of the levels, allowing sorting by certain dates or tissues, along with many other options. If desirable, the user can summarize a certain phenotype, which will then be linked by lines throughout the various levels. The user also may choose and color individual IDs inside the plot in the interactive representation that results (Fig. 29).

References 1. McMurdie PJ, Holmes S (2013, April 22) Phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE 8(4):e61217. https://doi.org/10. 1371/journal.pone.0061217 2. Dhungel E et al (2021, January 18) MegaR: an interactive R package for rapid sample classification and phenotype prediction using metagenome profiles and machine learning. BMC Bioinform 22(1). https://doi.org/10.1186/ s12859-020-03933-4. Accessed 24 Jan 2022 3. Callahan BJ et al (2016, May 23) DADA2: highresolution sample inference from illumina amplicon data. Nat Method 13(7):581–583. https://

doi.org/10.1038/nmeth.3869. www.nature. com/articles/nmeth.3869 4. Foster ZSL et al (2017, February 21) Metacoder: an R package for visualization and manipulation of community taxonomic diversity data. PLOS Comp Biol 13(2):e1005404. https://doi. org/10.1371/journal.pcbi.1005404. Accessed 11 Feb 2021 5. Reeder J et al (2021, June 9) MicrobiomeExplorer: an R package for the analysis and visualization of microbial communities. Bioinform 37(9):1317–1318. https://doi.org/10.1093/ bioinformatics/btaa838. http://pubmed.ncbi. nlm.nih.gov/32960962/. Accessed 12 Feb 2022

INDEX A Adapter .............................................32–35, 73, 197, 215, 236, 238–240, 255, 305, 322–325 aes .......................................................................... 364, 366 Alignator ............................................................... 209–220 Alignment .......................................... 47, 57, 74, 85, 108, 146, 199, 216, 224, 237, 304, 331 Alpha diversity .................................................78, 81, 147, 355, 404, 426, 431, 432 Amplicon sequence variant (ASV) ................... 71, 79–81, 86, 411, 413, 416 Amplicon sequencing...................... vi, 21, 56, 62, 69–82, 107, 135, 140, 178, 179, 182, 304, 411, 416, 426 AMRFinder ................................................................... 297 AmrPlusPlus .................................................................. 296 AMR ++ version 2.0 ..................................................... 296 Annotation ...................................................vi, 57, 62, 64, 69, 77, 120, 124, 139, 149–151, 153, 160, 161, 167, 169–171, 178, 179, 198–204, 245, 296–298, 317–335, 341, 368, 433, 434 AnnoTree.............................................................. 125–127 Antibiotic resistance bacteria (ARB) ............................ 298 Antibiotic resistance genes (ARG) ........................vii, 230, 290, 291, 293, 295–297 Antimicrobial susceptibility testing .............................. 295 ARDB database .................................................... 296, 297 ARGs-OAP .................................................................... 297 ARO term ............................................................. 230, 231 Assembly ........................... 11, 12, 15, 25, 49, 50, 57, 64, 74–75, 80, 108, 110, 111, 121, 122, 143, 148, 160–161, 164, 198, 201, 235–238, 240–245, 248, 250, 251, 255, 297, 310, 312, 317–335 Azure...................................... vi, 262–266, 269, 275–278

B Bacteria ...................................................3, 23, 55, 70, 85, 107, 133, 178, 202, 282, 290, 305, 385, 411 BacWGSTdb 2.0 ........................................................... 295 Base R ......................................................... 341, 343–357, 359–367, 369, 372, 378, 381, 391 BBtools ......................................... 36, 320, 322, 324, 326 BEDtools suite .............................................................. 244

Beta diversity ...................................................78, 81, 147, 371, 375, 397, 432, 433 Beta-lactamases....................................291, 293, 296, 297 Beta-lactamase database (BLDB) ................................. 296 Binning .................... 108, 109, 114, 115, 118–120, 124, 135, 139, 170, 231, 237, 242, 244–245, 255, 310 bioconda ............................................................72, 73, 111 Biological wastewater treatment (WWT) ..............................................197, 203–206 BioProject ..............................................27, 110, 242, 243 Boolean ..................................................62, 345, 346, 353 Bourne Again Shell (BASH) ............................... 262, 263 Bowtie2................................................................. 237, 242 Boxplot .............159, 361, 366, 373–376, 404, 431, 432 BRACKEN ...................................................................... 37 BWA ........................................................................ 48, 242

C CARD database ...................................230, 231, 295–297 CheckV .......................................319, 320, 326, 327, 333 Chimeras............................ 59, 63, 76, 80, 146, 411, 416 Chi-squared test .............................................................. 42 Classify ..............................................................61, 77, 284 Cloud Computing........................................... vi, 261–279 colorRampPalette........................................................... 386 Command line......................................27, 28, 34, 35, 38, 50, 74, 90, 111, 114, 122, 128, 134, 142, 151, 157, 220, 224, 232–233, 238, 240, 241, 245, 246, 248, 249, 255, 268, 279, 297, 322, 426 Comparative analysis .............................. 38–43, 122–124, 170, 200, 206, 207, 249–252 ComplexHeatmap ......................................................... 321 Conda ....................................................... 25, 72, 74, 103, 111, 248, 256, 270, 271, 278 Constrained correspondence analysis (CCA) .............. 398 Contamination ...............................10, 22–25, 33, 46–50, 56, 60, 64, 76, 238, 245, 246, 293, 310, 319, 326 Contig assembly ...........................................325–326, 333 Contig-based tool ......................................................... 145 c-PAS sequencing (Complete Genomics™).................. 13 Cross validation ...................................286, 287, 409–410 Cutadapt ...................................72, 73, 82, 236, 240, 254 Cytoscape....................................................................... 200

Suparna Mitra (ed.), Metagenomic Data Analysis, Methods in Molecular Biology, vol. 2649, https://doi.org/10.1007/978-1-0716-3072-3, © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

437

METAGENOMIC DATA ANALYSIS

438 Index D

DADA2.................................................. vi, 71, 73, 79–82, 89, 92, 393, 411–418 Databases ....................................... 4, 25, 58, 71, 86, 108, 137, 178, 199, 216, 224, 237, 295, 304, 318 Dataframe ...................................286, 344, 345, 349, 354 Data handling............................................................38–42 DeepARG-DB ...................................................... 296, 297 DeepARG-LS ................................................................ 296 DeepARG-SS................................................................. 296 Denoise ...................................................... 80, 86, 92, 416 Deseq2 ......................................................... 398, 434, 435 Detrended correspondence analysis (DCA) ................................................................ 398 DIAMOND............................... 108, 109, 111–115, 117, 118, 124, 126, 224, 227, 231, 233, 237, 247, 321 Dimensionality reduction .........................................42–45 Diversity................................................1, 55, 70, 85, 109, 147, 179, 196, 292, 317, 335, 355, 371, 394 Divisive Amplicon Denoising Algorithm (DADA) ............................................................. 411 DNA Nanoball sequencing (Complete Genomics™)..................................... 13 Docker .................................................................... 26, 297 Double principal coordinates analysis (DPCoA)............................................................ 398

E EGGNOG ......................... 115–117, 120, 124, 127, 201 Ensembl ..........................................................57, 216–218 Eukaryotic microbes ......................................................... 5

F False positive classification .............................................. 47 Family Wise Error Rate (FWER) ................................. 398 fARGene ........................................................................ 297 FastQC..................................................30, 31, 33, 34, 72, 73, 236, 239, 240, 320, 322, 324 FASTQ files ....................................31, 49, 72–75, 86–91, 139, 145, 216, 217, 220, 239, 253, 254, 411 Filtering .............34–36, 62, 75, 85, 108, 145, 154, 158, 207, 240, 353, 384, 394, 396, 411, 413, 428, 429 Fisher’s test ...................................................................... 42 FISH probes ................................................... vii, 303–313 Flow cell.................... 9, 31, 33, 197, 210, 212, 215, 220 Fluorescence activated cell sorting (FACS) ........................................vii, 309–310, 312 Fluorescent in-situ hybridisation (FISH)................................................. vii, 303–313 Formalin-fixed paraffin-embedded (FFPE) ................... 23 Frame-shift correction .................................124, 247–248

Functional................................................... v, vi, 2, 55, 56, 69, 89, 107–109, 114, 116, 117, 119–123, 125, 127, 134, 140, 149, 153–158, 160, 167, 169–171, 180, 196, 199–204, 206, 207, 224, 231, 232, 247, 295, 296, 312, 317, 327, 371 Fungal organisms .............................................................. 6

G GALAXY................................................... 86, 89, 99, 100, 144–148, 171, 296, 297 GC content........... 31–33, 158, 164, 167, 201, 250, 251 Generalized linear model (GLM) .............. 398, 405, 406 Genome Taxonomy Database (GTDB)............ vi, 58, 60, 63–65, 115–119, 125, 127, 227, 237, 246 geom ..................................................................... 364–366 ggplot2 ....................................................... 320, 359, 360, 363–370, 373–376, 378, 380, 391 ggsave............................................................................. 364 Graphical user interfaces (GUI) .....................86, 89, 179, 224, 225, 239, 297 Greengenes .................................................vi, 58–59, 179, 181, 182, 184, 186, 187, 189–191 gridExtra ........................................................................ 365 Gut Phage Database (GPD) ....................... 322, 330, 331 Gut Virome Database (GVD) ............................. 322, 330

H Heatmap ..........................................................31, 42, 147, 159, 179, 190, 191, 200, 204, 205, 327, 334, 376–379, 398, 401, 412, 432–434 Heatmaply ................................................... 378, 379, 433 Heat tree..............................................418, 424, 426, 427 Helicos SMS (Helicos Biosciences)..........................13–14 Heteroresistance................................................... 292, 298 Hidden Markov models (HMMs) ....................... 59, 167, 297, 321, 334 High throughput technology (HTS)..................2–4, 6–8, 11, 13, 85, 175–192, 207, 303, 339, 418 HMMER3 ..................................................................... 320 Host derived content ................................................36–37 Human Microbiome Project (HMP).................... 23, 418 Hybridization buffer (HB) ......................... 305, 307, 308

I igraph ............................................................................. 402 Illumina................................................. 3, 5, 9–13, 27, 31, 33, 73, 86, 87, 90, 109, 196–198, 201, 207, 223, 236, 240, 242, 243, 413, 414 Illumina technology (Solexa) ...................................10–12 IMG/VR ..................................................... 322, 330, 335 INTEGRALL database ................................................. 296

METAGENOMIC DATA ANALYSIS Index 439 Integrated development environment (IDE)................ 39 Internal transcribed spacer (ITS) ......................... 5, 6, 63, 107, 140, 161, 162, 178 Internet Protocol (IP) ................................ 267, 278, 332 INTERPRO .................................................115–117, 127 Ion Torrent (Life Technologies) .................................... 12

J Jupyter .................................................271–273, 276, 277 Jupyter Lab ................... vi, 263, 269, 271–273, 276–277

K Kaiju ........................................................ vi, 135, 148, 171 k-fold..................................................................... 286, 410 Kitome ................................................................ 24, 46, 47 k-mer................................................. 33, 37, 47, 210, 241 Kraken.............................................................37, 135, 137 Krona ........................ 139, 148, 162, 168, 171, 200, 206 Kyoto Encyclopedia of Genes and Genomes (KEGG) ................................. 111, 112, 115, 120, 151, 154, 167–169, 200, 201, 206, 321, 327

L labs function .................................................................. 368 LEfSe ...................................................42, 86, 89, 98–103 Library preparation .................................. 11, 24, 32, 197, 212, 215–216, 220, 240 Lineage ...........................................................77, 232, 245 Long read ..................................... vi, 2, 4, 11, 14, 15, 27, 57, 108, 110, 113–115, 117, 119, 120, 123–125, 128, 196, 223, 227–229, 235–256, 297 Long read assembly......................................110, 241–242 Lowest Common Ancestor (LCA).......................................108, 114, 115, 120

M Machine Learning (ML) ................................... vii, 57, 65, 159, 281–287, 319, 405, 406, 409, 414, 415 MAIRA ................................................................. 223–233 MAIRA Alignment Archive (MAA)........... 230, 231, 233 Malayan tapir ................................................... vi, 175–192 Mann-Whitney U test ..................................................... 42 Massively parallel sequencing techniques (MPSS) .......... 8 Medaka ................................................110, 237, 246, 248 MegaHit assembler .............................240, 318, 320, 325 MEGAN ................ 86, 89, 96–102, 104, 108, 109, 111, 114, 115, 117–124, 127–129, 158, 171, 237, 247 MEGAN-LR........................................246–248, 253, 255 MegaR .................................................357, 393, 405–410 MEGARes database ............................................. 296, 297 Merging ........................................ 39, 397, 411, 416, 427 Metacoder............................................357, 393, 419–426 METAGENassist .....................................vi, 158–159, 171

Metagenome assembly......................................... 235–256 metagenomeSeq .............................................................. 42 Metagenomics ........... 1, 23, 55, 69, 107, 133, 179, 196, 223, 235, 262, 282, 291, 304, 317, 339, 360, 405 MetaPhlAn...................................... 37, 38, 41, 44, 48–50 Metatranscriptomics....................................vii, 1, 57, 157, 160, 161, 163, 304, 305, 312 MGnify: EBI-Metagenomics .......................... vi, 160–161 MG-RAST ...................................................... vi, 109, 110, 149–152, 158, 170, 171, 175–192, 197–207 Microbial bioinformatics ................................................ 37 MicrobiomeAnalyst......................................... vi, 153–156 MicrobiomeExplorer....................................393, 426–436 Miniconda.............................. 72, 89, 263, 269, 270, 279 Minimap2 ...................................216, 217, 237, 242, 244 MinION .................................................. vi, 15, 110, 196, 209–211, 215–216, 223–233 MinKNOW........................................................... 209–220 minP............................................................................... 398 MiSeq................................. 3, 9, 12, 27, 74, 90, 197, 236 MobaXTerm ......................................................... 262, 267 Mothur ............................................ vi, 71, 72, 74–79, 82, 135, 144–148, 158, 170, 171, 339, 343, 428 Multi-drug resistant (MDR) .........................12, 295, 296 Multiple displacement amplification (MDA) ...................................................... 309, 310

N Nanopore sequencing (Oxford Nanopore Technologies)................15–16 NCBI ....................................................27, 36, 38, 48, 57, 59, 62, 64, 65, 99, 115–119, 123, 124, 127, 139, 145, 147, 148, 153, 225, 227, 242, 243, 297, 318, 321, 324, 329, 332, 334 NCBIfam ....................................................................... 297 NCBI-nr ................................................... 4, 57, 108, 111, 112, 114, 125–127, 247, 296 NCBI-SRA .................................................................... 110 NeatMap ........................................................................ 401 NextFlow ........................................................26, 256, 278 Next-generation sequencing (NGS) ....................... v, 2, 5, 7–14, 16, 55, 56, 69, 73, 191, 192, 303–313 NextSeq 500/550.............................................. 10–12, 90 Non-metric multidimensional scaling (NMDS)...................................................... 43, 398 NovaSeq 6000................................................................. 12

O ONT MinION ..................................................... 236, 238 Operational taxonomic units (OTUs) ............... 7, 77–79, 82, 95, 96, 146, 147, 153, 158, 164, 167, 170, 282, 283, 286, 305, 313, 343, 371, 395, 397, 399–403, 411, 413, 416, 418, 421, 422, 424, 426

METAGENOMIC DATA ANALYSIS

440 Index

Ordination .........................................................42, 43, 82, 371, 397–399, 401, 402 Oxford Nanopore Technologies (ONT) ........... 224, 225, 227, 236–238, 242–244, 246, 248

P Pacific BioSciences (PacBio)................. 5, 11, 14–15, 223 Paired-end ..............................................3, 27, 37, 49, 63, 73, 75, 80–81, 86, 92, 95, 137, 144, 145, 148, 150, 153, 161, 166, 254, 325, 411, 413 PATRIC .............................................................................vi Per base N content.......................................................... 33 Per base sequence quality .................................. 31, 32, 73 Pfam ............................................................. 153, 321, 327 PhageTaxonomyTool........................................... 321, 327 Phenotypic classification ................................ vii, 281–287 Phosphate buffer saline (PBS)............304, 306, 307, 311 PhyloPythiaS...................................vi, 135, 148, 149, 171 Phyloseq...................................................... 42, 73, 79–81, 357, 393–405, 416, 427 Pipeline .................................................. vi, 26, 37, 38, 47, 58–60, 63, 69–82, 86, 89, 108, 109, 125, 126, 134, 135, 140, 142–144, 149, 151, 160–164, 169, 178, 179, 236, 238, 256, 261–279, 296, 297, 310, 405, 411, 412, 416 Plotly............................................................ 378, 427, 433 Polony sequencing ............................................. 9, 10, 196 Poly-A enrichment .............................. 210–211, 213–215 Primer .............................................. 5, 56, 59, 61, 63, 70, 73, 85, 86, 95, 96, 163, 240, 411, 413, 416 Principal Components Analysis (PCA) ...................43–45, 378–382, 398 Principal coordinates analysis (PCoA) ....................43–45, 78, 79, 89, 97, 98, 104, 123, 200, 398 Probe design.................................................304–306, 313 Prodigal ......................................................................... 320 Projection ............................................................. 153, 399 Prokka ..................................................124, 237, 246, 253 Python ...............................................vi, 26, 37, 261, 263, 269, 279, 282, 321, 327, 411

Q QIIME2................................................. vi, 88, 90, 91, 94, 103, 263, 269–271, 276, 428 Quality control .................................................v, 3, 21–51, 58, 60, 63, 71, 74–77, 82, 102, 108–111, 145, 161, 163, 166, 170, 175, 178, 322–325, 428 Quality scores ............................................. 31, 32, 59, 60, 73, 79, 80, 88, 91, 92, 145, 253, 412, 415 Quantitative FISH ............................................... 309, 310 QUAST........................................................ 255, 320, 333 QUAST: Quality Assessment Tool for Genome Assemblies.......................................................... 326

R R.........................................vii, 22, 27, 38, 39, 42, 79–82, 103, 159, 210, 218–220, 249, 250, 253, 278, 282, 327, 334, 339–357, 359–436 Racon ...................................................110, 237, 246, 248 Rarefaction curves .....................................................7, 179 RColorBrewer ................................................................ 386 RDP classifier ................................. 94–96, 140, 142, 143 Read ..............................2, 26, 57, 69, 86, 108, 135, 181, 196, 209, 223, 235, 296, 304, 318, 339, 384, 395 Real-time analysis ................................................. 224–232 Redundancy analysis (RDA)......................................... 398 Reference Viral Database (RVDB)............................................ 321, 329, 335 RefSeq....................................................... 8, 10, 151, 199, 202, 318, 321, 327, 329 Reliable classifications ...............................................46–50 Reproducibility........................................... 11, 13, 25–26, 50, 72, 342, 354 ResFinder database............................................... 295, 297 Resistome......................................... vii, 27, 290–296, 298 Ribosomal Database Project (RDP) ........................ vi, 58, 60–63, 74, 86, 94, 96, 135, 140–148, 150, 179, 181–183, 185, 187, 188, 190, 191 RiboTagger.................................................. 305, 312, 313 Richness ................................................2, 6, 7, 16, 78, 81, 147, 179, 201, 312, 317, 355, 371, 404 RNA isolation....................................................... 210–211 RStudio......................................................... 39, 210, 217, 218, 220, 249, 278, 320, 340, 342, 360, 364

S Sample fixation .............................................304, 306–307 SAMtools .............................................................. 237, 243 SARGfam ....................................................................... 297 SARG version 2.0................................................. 296, 297 Secure shell protocol (ssh).................................. 266, 267, 269, 272, 276, 277, 279, 320, 332 SEED .................................................... 60, 115–117, 120, 123, 127, 200, 201, 204 454 sequencing (Roche)........................... 8–10, 196, 202 Shannon..................................................78, 82, 122, 143, 355, 371, 431, 432 Short-read................................................... vi, 7–8, 10–16, 57, 69, 108–110, 112–116, 119, 121, 123, 124, 126–128, 148, 209, 223, 224, 235, 236, 238–246, 249–254, 317–335 Short-read assembly ............................240–241, 250, 351 SIAMCAT........................................................................ 42 SILVA ........................................................... vi, 58–61, 74, 145, 147, 161, 166, 306 Simpson .........................7, 355, 371, 372, 374–376, 426

METAGENOMIC DATA ANALYSIS Index 441 Single-molecular real-time sequencing (Pacific Biosciences) ......................................14–15 Single-molecule real-time (SMRT) sequencing ..................................... 5, 9, 11, 14, 15 Singularity........................................................... 26, 27, 50 Snakemake ........................................ 26, 38, 71, 256, 278 Snakemake pipeline ......................................................... 38 SNP analysis................................................................... 298 SOLiD (Life Technologies) ............................................ 12 SQLite................................................................... 227, 233 Squiggles............................................................... 210, 224 SRA toolkit ...................................................................... 28 SraX................................................................................ 297 16S rRNA sequencing ................... 4, 282, 304, 313, 405 18S rRNA sequencing ...................................................... 5 Student’s t-test ................................................................ 42 Support vector machines (SVM) ......................................148, 287, 405, 406

T Tabular.................................................. vii, 63, 65, 75, 78, 88, 203, 231, 249, 251, 253, 339–357, 360 Taxa detection ................................................................. 47 Taxonomic ............................ 4, 36, 55, 69, 85, 107, 134, 179, 196, 224, 237, 304, 319, 343, 371, 397 Taxonomic profiling............................................... 70, 310 t-distributed Stochastic Neighbour Embedding (t-SNE) ................................................................ 43 Tenuk ............................................................................. 176 Theme layer ................................................................... 369 Third generation technology....................................14–16 Tidyverse..............................................349–355, 357, 363 Trimming............................ 30, 34–37, 50, 73, 109, 135, 145, 150, 161, 163, 238–240, 310, 322–325, 413 TSV ......................................................................... 88, 428

U Uniform Manifold Approximation and Projection (UMAP) ............................................................... 43 UNIPROT ........................................................... 296, 297 Universally Unique Identifier (UUID) ....................... 275

V Vegan ................................. 355, 371, 375, 426, 431, 432 Vegdist .................................................................. 375, 376 VIBRANT ......................... 319–321, 326, 327, 333, 334 Viral metagenomics.................................................vii, 317 Virome ........................................318, 319, 326, 331, 333 Virtual machine (VM) ...................vi, 262–275, 277, 278 Virulence factors ......................................... 225, 230, 231 Virus........................................................10, 14, 107, 133, 317–319, 321, 326, 327, 329–331, 333, 335 Virus-like particle (VLP) ..............vii, 318, 320, 332, 333 Visualization ........................ vii, 3, 38–42, 50, 51, 59, 71, 79, 91, 94, 119, 148, 153, 157, 161, 167, 168, 170, 171, 179, 200, 203, 204, 210, 339, 340, 359–393, 395, 398, 399, 405, 426, 430, 433 VOG ..................................................................... 321, 327 V2, V3, V4 regions ..................3, 95, 146, 312, 313, 413

W Washing buffer (WB) ........................................... 305, 307 WebMGA................................................ vi, 149, 153, 171 Web tools.............................................................. 169, 199 WHAM!.................................................................... vi, 157 Wilcoxon signed-rank ..................................................... 42

Z Zed Shell (ZSH) ........................................................... 262