Gene Prediction: Methods and Protocols [1st ed.] 978-1-4939-9172-3;978-1-4939-9173-0

This volume introduces software used for gene prediction with focus on eukaryotic genomes. The chapters in this book des

1,332 149 7MB

English Pages XI, 284 [286] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Gene Prediction: Methods and Protocols [1st ed.]
 978-1-4939-9172-3;978-1-4939-9173-0

Table of contents :
Front Matter ....Pages i-xi
tRNAscan-SE: Searching for tRNA Genes in Genomic Sequences (Patricia P. Chan, Todd M. Lowe)....Pages 1-14
Predicting RNA Families in Nucleotide Sequences Using StructRNAfinder (Vinicius Maracaja-Coutinho, Raúl Arias-Carrasco, Helder I. Nakaya, Victor Aliaga-Tobar)....Pages 15-27
Structural and Functional Annotation of Eukaryotic Genomes with GenSAS (Jodi L. Humann, Taein Lee, Stephen Ficklin, Dorrie Main)....Pages 29-51
Practical Guide for Fungal Gene Prediction from Genome Assembly and RNA-Seq Reads by FunGAP (Byoungnam Min, In-Geol Choi)....Pages 53-64
Whole-Genome Annotation with BRAKER (Katharina J. Hoff, Alexandre Lomsadze, Mark Borodovsky, Mario Stanke)....Pages 65-95
EuGene: An Automated Integrative Gene Finder for Eukaryotes and Prokaryotes (Erika Sallet, Jérôme Gouzy, Thomas Schiex)....Pages 97-120
ChemGenome2.1: An Ab Initio Gene Prediction Software (Akhilesh Mishra, Priyanka Siwach, Poonam Singhal, B. Jayaram)....Pages 121-138
Multi-Genome Annotation with AUGUSTUS (Stefanie Nachtweide, Mario Stanke)....Pages 139-160
GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data (Jens Keilwagen, Frank Hartung, Jan Grau)....Pages 161-177
Coding Exon-Structure Aware Realigner (CESAR): Utilizing Genome Alignments for Comparative Gene Annotation (Virag Sharma, Michael Hiller)....Pages 179-191
Predicting Genes in Closely Related Species with Scipio and WebScipio (Martin Kollmar)....Pages 193-206
AnABlast: Re-searching for Protein-Coding Sequences in Genomic Regions (Alejandro Rubio, Carlos S. Casimiro-Soriguer, Pablo Mier, Miguel A. Andrade-Navarro, Andrés Garzón, Juan Jimenez et al.)....Pages 207-214
Generating Publication-Ready Prokaryotic Genome Annotations with DFAST (Yasuhiro Tanizawa, Takatomo Fujisawa, Masanori Arita, Yasukazu Nakamura)....Pages 215-226
BUSCO: Assessing Genome Assembly and Annotation Completeness (Mathieu Seppey, Mosè Manni, Evgeny M. Zdobnov)....Pages 227-245
Evaluating Genome Assemblies and Gene Models Using gVolante (Osamu Nishimura, Yuichiro Hara, Shigehiro Kuraku)....Pages 247-256
Choosing the Best Gene Predictions with GeneValidator (Ismail Moghul, Anurag Priyam, Yannick Wurm)....Pages 257-267
COGNATE: Comparative Gene Annotation Characterizer (Jeanne Wilbrandt)....Pages 269-281
Back Matter ....Pages 283-284

Citation preview

Methods in Molecular Biology 1962

Martin Kollmar Editor

Gene Prediction Methods and Protocols

METHODS

IN

MOLECULAR BIOLOGY

Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, AL10 9AB, UK

For further volumes: http://www.springer.com/series/7651

Gene Prediction Methods and Protocols

Edited by

Martin Kollmar Group Systems Biology of Motor Proteins, Department NMR-based Structural Biology, Max-Planck-Institute for Biophysical Chemistry, Goettingen, Germany

Editor Martin Kollmar Group Systems Biology of Motor Proteins Department NMR-based Structural Biology Max-Planck-Institute for Biophysical Chemistry Goettingen, Germany

ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-4939-9172-3 ISBN 978-1-4939-9173-0 (eBook) https://doi.org/10.1007/978-1-4939-9173-0 Library of Congress Control Number: 2019935814 © Springer Science+Business Media, LLC, part of Springer Nature 2019 Chapter 3 is licensed under the terms of the Creative Commons Attribution 4.0 International License (http:// creativecommons.org/licenses/by/4.0/). For further details see license information in the chapter. This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana Press imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.

Preface Since the application of new high-throughput sequencing methods (next-generation sequencing) to genome sequencing, the number of sequenced eukaryotic genomes is increasing more and more rapidly. Several large-scale sequencing projects of thousands of species have already been started years ago, such as the 1k fungi project, the “Genome 10k” project (10,000 vertebrates), the i5k project (5000 insects), and the 959 nematodes project. While genome sequencing steps from record to record, progress in genome annotation is developing slowly. Despite considerable experimental and bioinformatics efforts, even the annotation of the human genome has not been finished yet. For other species, the quality of the annotations is significantly worse. In addition to the intrinsic incompleteness of genome annotations, annotations are currently only available for a small part of all sequenced genomes. Annotations are usually only updated for the most important model species. For many species, only the initial genome annotations are available, although incorporation of new data (e.g., transcript sequencing) and application of new approaches could considerably improve annotations. Improvements in genome annotations have a direct positive effect on all downstream analyses as well as diverse applications in medicine, biology, biotechnology, and agriculture. This volume introduces software for gene prediction with focus on eukaryotic genomes. The primary audience are researchers and research groups working on the assembly and annotation of single species or small groups of species. Such groups usually do not have access to advanced and complex annotation pipelines, which are in use by some large-scale sequencing centers, and usually do not have particular expertise in gene prediction software. Also, the focus of such groups is often on a particular biological aspect that can well be explained with just a very preliminary and partially incomplete genome annotation. The protocols described in this volume should enable these groups to considerably improve and complete their genome annotations to be useful for a wider research community. Re-annotation of long available genome assemblies should also be simplified. Available software also contains options and parameters that are often hidden to the novice or occasional user. Chapters will explain software and web server usage as applied in typical use cases, written in the spirit of the series, which aims to provide practical guidance and troubleshooting advice. Goettingen, Germany

Martin Kollmar

v

Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v ix

1 tRNAscan-SE: Searching for tRNA Genes in Genomic Sequences. . . . . . . . . . . . . Patricia P. Chan and Todd M. Lowe 2 Predicting RNA Families in Nucleotide Sequences Using StructRNAfinder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vinicius Maracaja-Coutinho, Rau´l Arias-Carrasco, Helder I. Nakaya, and Victor Aliaga-Tobar 3 Structural and Functional Annotation of Eukaryotic Genomes with GenSAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jodi L. Humann, Taein Lee, Stephen Ficklin, and Dorrie Main 4 Practical Guide for Fungal Gene Prediction from Genome Assembly and RNA-Seq Reads by FunGAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Byoungnam Min and In-Geol Choi 5 Whole-Genome Annotation with BRAKER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Katharina J. Hoff, Alexandre Lomsadze, Mark Borodovsky, and Mario Stanke 6 EuGene: An Automated Integrative Gene Finder for Eukaryotes and Prokaryotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Erika Sallet, Je´roˆme Gouzy, and Thomas Schiex 7 ChemGenome2.1: An Ab Initio Gene Prediction Software . . . . . . . . . . . . . . . . . . . Akhilesh Mishra, Priyanka Siwach, Poonam Singhal, and B. Jayaram 8 Multi-Genome Annotation with AUGUSTUS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefanie Nachtweide and Mario Stanke 9 GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jens Keilwagen, Frank Hartung, and Jan Grau 10 Coding Exon-Structure Aware Realigner (CESAR): Utilizing Genome Alignments for Comparative Gene Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Virag Sharma and Michael Hiller 11 Predicting Genes in Closely Related Species with Scipio and WebScipio . . . . . . . Martin Kollmar 12 AnABlast: Re-searching for Protein-Coding Sequences in Genomic Regions . . . Alejandro Rubio, Carlos S. Casimiro-Soriguer, Pablo Mier, Miguel A. Andrade-Navarro, Andre´s Garzon, Juan Jimenez, and Antonio J. Pe´rez-Pulido 13 Generating Publication-Ready Prokaryotic Genome Annotations with DFAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasuhiro Tanizawa, Takatomo Fujisawa, Masanori Arita, and Yasukazu Nakamura

1

vii

15

29

53 65

97 121 139

161

179 193 207

215

viii

Contents

14

BUSCO: Assessing Genome Assembly and Annotation Completeness . . . . . . . . . Mathieu Seppey, Mose` Manni, and Evgeny M. Zdobnov 15 Evaluating Genome Assemblies and Gene Models Using gVolante . . . . . . . . . . . . Osamu Nishimura, Yuichiro Hara, and Shigehiro Kuraku 16 Choosing the Best Gene Predictions with GeneValidator . . . . . . . . . . . . . . . . . . . . Ismail Moghul, Anurag Priyam, and Yannick Wurm 17 COGNATE: Comparative Gene Annotation Characterizer. . . . . . . . . . . . . . . . . . . Jeanne Wilbrandt

227

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

283

247 257 269

Contributors VICTOR ALIAGA-TOBAR  Facultad de Ciencias Quı´micas y Farmace´uticas, Advanced Center for Chronic Diseases—ACCDiS, Universidad de Chile, Santiago, Chile; Programa de Doctorado en Genomica Integrativa, Vicerrectorı´a de Investigacion, Universidad Mayor, Santiago, Chile MIGUEL A. ANDRADE-NAVARRO  Faculty of Biology, Johannes Gutenberg University Mainz, Mainz, Germany RAU´L ARIAS-CARRASCO  Facultad de Ciencias Quı´micas y Farmace´uticas, Advanced Center for Chronic Diseases—ACCDiS, Universidad de Chile, Santiago, Chile; Programa de Doctorado en Genomica Integrativa, Vicerrectorı´a de Investigacion, Universidad Mayor, Santiago, Chile MASANORI ARITA  Department of Informatics, National Institute of Genetics, Shizuoka, Japan; RIKEN Center for Sustainable Resource Science, Yokohama, Kanagawa, Japan MARK BORODOVSKY  Joint Georgia Tech and Emory University Wallace H Coulter, Department of Biomedical Engineering, Atlanta, GA, USA; School of Computational Science and Engineering, Atlanta, GA, USA; Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia ´ rea de Gene´tica), CARLOS S. CASIMIRO-SORIGUER  Facultad de Ciencias Experimentales (A Centro Andaluz de Biologı´a del Desarrollo (CABD, UPO-CSIC), Universidad Pablo de Olavide, Sevilla, Spain PATRICIA P. CHAN  Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA IN-GEOL CHOI  Department of Biotechnology, College of Life Sciences and Biotechnology, Korea University, Seoul, South Korea STEPHEN FICKLIN  Department of Horticulture, Washington State University, Pullman, WA, USA TAKATOMO FUJISAWA  Department of Informatics, National Institute of Genetics, Shizuoka, Japan ´ rea de Gene´tica), Centro ANDRE´S GARZO´N  Facultad de Ciencias Experimentales (A Andaluz de Biologı´a del Desarrollo (CABD, UPO-CSIC), Universidad Pablo de Olavide, Sevilla, Spain JE´ROˆME GOUZY  Laboratoire des Interactions Plantes-Microorganismes (LIPM), Universite´ de Toulouse, INRA, CNRS, Castanet-Tolosan, France JAN GRAU  Institute of Computer Science, Martin Luther University Halle-Wittenberg, Halle (Saale), Germany YUICHIRO HARA  Laboratory for Phyloinformatics, RIKEN Center for Biosystems Dynamics Research (BDR), Kobe, Japan FRANK HARTUNG  Institute for Biosafety in Plant Biotechnology, Julius Ku¨hn-Institut (JKI), Federal Research Centre for Cultivated Plants, Quedlinburg, Germany MICHAEL HILLER  Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany; Max Planck Institute for the Physics of Complex Systems, Dresden, Germany; Center for Systems Biology, Dresden, Germany KATHARINA J. HOFF  University of Greifswald, Institute of Mathematics and Computer Science, Greifswald, Germany

ix

x

Contributors

JODI L. HUMANN  Department of Horticulture, Washington State University, Pullman, WA, USA B. JAYARAM  Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi, New Delhi, India; Kusuma School of Biological Sciences, Indian Institute of Technology Delhi, New Delhi, India; Department of Chemistry, Indian Institute of Technology Delhi, New Delhi, India ´ rea de Gene´tica), Centro Andaluz JUAN JIMENEZ  Facultad de Ciencias Experimentales (A de Biologı´a del Desarrollo (CABD, UPO-CSIC), Universidad Pablo de Olavide, Sevilla, Spain JENS KEILWAGEN  Institute for Biosafety in Plant Biotechnology, Julius Ku¨hn-Institut (JKI), Federal Research Centre for Cultivated Plants, Quedlinburg, Germany MARTIN KOLLMAR  Group Systems Biology of Motor Proteins, Department of NMR-Based Structural Biology, Max-Planck-Institute for Biophysical Chemistry, Goettingen, Germany SHIGEHIRO KURAKU  Laboratory for Phyloinformatics, RIKEN Center for Biosystems Dynamics Research (BDR), Kobe, Japan TAEIN LEE  Department of Horticulture, Washington State University, Pullman, WA, USA ALEXANDRE LOMSADZE  Joint Georgia Tech and Emory University Wallace H Coulter Department of Biomedical Engineering, Atlanta, GA, USA TODD M. LOWE  Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, USA DORRIE MAIN  Department of Horticulture, Washington State University, Pullman, WA, USA MOSE` MANNI  Department of Genetic Medicine and Development, Swiss Institute of Bioinformatics, University of Geneva Medical School, Geneva, Switzerland VINICIUS MARACAJA-COUTINHO  Facultad de Ciencias Quı´micas y Farmace´uticas, Advanced Center for Chronic Diseases—ACCDiS, Universidad de Chile, Santiago, Chile; Beagle Bioinformatics, Santiago, Chile; Instituto Vandique, Joa˜o Pessoa, Brazil PABLO MIER  Faculty of Biology, Johannes Gutenberg University Mainz, Mainz, Germany BYOUNGNAM MIN  Department of Biotechnology, College of Life Sciences and Biotechnology, Korea University, Seoul, South Korea AKHILESH MISHRA  Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi, New Delhi, India; Kusuma School of Biological Sciences, Indian Institute of Technology Delhi, New Delhi, India ISMAIL MOGHUL  UCL Cancer Institute, University College London, London, UK STEFANIE NACHTWEIDE  Institute of Mathematics and Computer Science, University of Greifswald, Greifswald, Germany YASUKAZU NAKAMURA  Department of Informatics, National Institute of Genetics, Shizuoka, Japan HELDER I. NAKAYA  Department of Clinical and Toxicological Analyses, School of Pharmaceutical Sciences, University of Sa˜o Paulo, Sa˜o Paulo, Brazil OSAMU NISHIMURA  Laboratory for Phyloinformatics, RIKEN Center for Biosystems Dynamics Research (BDR), Kobe, Japan ´ rea de Gene´tica), ANTONIO J. PE´REZ-PULIDO  Facultad de Ciencias Experimentales (A Centro Andaluz de Biologı´a del Desarrollo (CABD, UPO-CSIC), Universidad Pablo de Olavide, Sevilla, Spain ANURAG PRIYAM  School of Biological and Chemical Sciences, Queen Mary University of London, London, UK

Contributors

xi

´ rea de Gene´tica), Centro ALEJANDRO RUBIO  Facultad de Ciencias Experimentales (A Andaluz de Biologı´a del Desarrollo (CABD, UPO-CSIC), Universidad Pablo de Olavide, Sevilla, Spain ERIKA SALLET  Laboratoire des Interactions Plantes-Microorganismes (LIPM), Universite´ de Toulouse, INRA, CNRS, Castanet-Tolosan, France THOMAS SCHIEX  Unite´ de Mathe´matiques et Informatique Applique´es de Toulouse (MIAT), Universite´ de Toulouse, INRA, Castanet-Tolosan, France MATHIEU SEPPEY  Department of Genetic Medicine and Development, Swiss Institute of Bioinformatics, University of Geneva Medical School, Geneva, Switzerland VIRAG SHARMA  Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany; Max Planck Institute for the Physics of Complex Systems, Dresden, Germany; Center for Systems Biology, Dresden, Germany; CRTD-DFG Center for Regenerative Therapies Dresden, Carl Gustav Carus Faculty of Medicine, Technische Universit€ at Dresden, Dresden, Germany; Paul Langerhans Institute Dresden (PLID) of the Helmholtz Center Munich at University Hospital Carl Gustav Carus and Faculty of Medicine, Technische Universit€ a t Dresden, Dresden, Germany; German Center for Diabetes Research (DZD), Munich, Germany POONAM SINGHAL  Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi, New Delhi, India PRIYANKA SIWACH  Supercomputing Facility for Bioinformatics and Computational Biology, Indian Institute of Technology Delhi, New Delhi, India; Department of Biotechnology, Chaudhary Devi Lal University, Sirsa, Haryana, India MARIO STANKE  Institute of Mathematics and Computer Science, University of Greifswald, Greifswald, Germany YASUHIRO TANIZAWA  Department of Informatics, National Institute of Genetics, Shizuoka, Japan JEANNE WILBRANDT  Center for Molecular Biodiversity Research, Zoological Research Museum Alexander Koenig (ZFMK), Bonn, Germany YANNICK WURM  School of Biological and Chemical Sciences, Queen Mary University of London, London, UK EVGENY M. ZDOBNOV  Department of Genetic Medicine and Development, Swiss Institute of Bioinformatics, University of Geneva Medical School, Geneva, Switzerland

Chapter 1 tRNAscan-SE: Searching for tRNA Genes in Genomic Sequences Patricia P. Chan and Todd M. Lowe Abstract Transfer RNAs are the largest, most complex non-coding RNA family, universal to all living organisms. tRNAscan-SE has been the de facto tool for predicting tRNA genes in whole genomes. The newly developed version 2.0 has incorporated advanced methodologies with improved probabilistic search software and a suite of new gene models, enabling better functional classification of predicted genes. This chapter describes the use of the UNIX command-driven and online web versions, illustrating different search modes and options. Key words Transfer RNA, Non-coding RNA, Gene prediction, Covariance model, RNA secondary structure

1

Introduction tRNAscan-SE [1] has been the most widely adopted tool for predicting transfer RNA (tRNA) genes in genomic sequences over the last two decades. Its users include RNA biologists, sequencing centers, database annotators, and other basic researchers. tRNAscan-SE gene predictions can be found for over four thousand genomes in the Genomic tRNA Database [2]. The tRNAscanSE software employs covariance models [3] that capture the primary sequence and secondary structure information of tRNA training data to search for complete tRNA genes in query sequences. The results provide researchers with the genomic coordinates, predicted function (isotype and anticodon), and secondary structure of the predicted tRNA genes. To improve performance and prediction accuracy, the latest version of tRNAscan-SE integrates Infernal v1.1, the state-of-the-art covariance model search software [4], with updated models based on a much broader diversity of tRNA genes. The program achieves better functional classification by utilizing isotype-specific covariance models, and enables mitochondrial tRNA gene prediction in mammals and other vertebrates.

Martin Kollmar (ed.), Gene Prediction: Methods and Protocols, Methods in Molecular Biology, vol. 1962, https://doi.org/10.1007/978-1-4939-9173-0_1, © Springer Science+Business Media, LLC, part of Springer Nature 2019

1

2

Patricia P. Chan and Todd M. Lowe

While the full functionality of tRNAscan-SE is available as downloadable, standalone UNIX-based software, we have also developed a user-friendly web-based version [5] to increase accessibility to scientists who wish to search relatively short sequences (up to bacterial chromosome sizes) and may not have expertise with installing and running UNIX command-line software. In this chapter, we illustrate the use of both the online and command-driven versions for finding tRNAs encoded in DNA or RNA from any species.

2

Predicting tRNA Genes Using tRNAscan-SE

2.1 Using Online Version

The tRNAscan-SE web server (http://trna.ucsc.edu/tRNAscanSE/) is a convenient, ready-for-use means to identify tRNA genes in one or more query sequences. The graphical interface also provides easy navigation to the details of prediction results and a quick way to learn about the features of the software without requiring familiarity with UNIX-based commands or installation on one’s own computer. Web-based analysis limits query sequences to a maximum of five million base pairs. The standalone version can be used for larger genomic sequences.

2.1.1 Enter Search Options

In addition to providing query sequence(s), the user is asked to select the source of the query sequence(s), if known (Fig. 1). One or more query sequences can be analyzed at a time, either typed or pasted into the text field, or uploaded as a FASTA-formatted file. The selected sequence source should correspond to the origin of the query sequences, namely sequences from eukaryotic, bacterial, archaeal, or mitochondrial chromosomes (see Note 1). If the incorrect sequence source is given, the search still may identify tRNA genes, but the boundaries of the prediction may not be as accurate, and/or some low-scoring tRNAs could be missed. If the source of the query sequences is not known, for example a sequence from a metagenome, we recommend using the “Mixed (general tRNA model)” option. Alternatively, you could analyze query sequences with each of the possible sources one by one, and then only use the predictions that give the highest score for each identified gene. If you would like to obtain predictions and scores given by the original tRNAscan-SE v1.3 algorithm (for example, to match results found in older published predictions), select the Legacy search mode (Fig. 2a). This can be used in conjunction with the extended option of showing first-pass hit origin to check if the predictions are detected by tRNAscan and/or EufindtRNA—the fast first-pass screening algorithms that identify tRNA gene candidates in pre-2.0 tRNAscan-SE versions (Fig. 2b). If you require maximum search sensitivity and can accept much longer processing time, you may select the “Infernal without HMM filter” search mode (Fig. 2a). However, we do not recommend this mode except

tRNA Gene Prediction with tRNAscan-SE

3

Fig. 1 Main search options for tRNAscan-SE online version. The minimum required options include the sequence source and the query sequence. BED file format can be optionally selected for output

for non-standard mitochondrial tRNAs (e.g., those potentially missing tRNA stem-loops), as typical tRNAs are equally well identified by the much faster, efficient Default search mode. tRNAscan-SE provides an overall bit score for each gene prediction. The higher the score, the more similar the prediction is to the consensus profile represented by the covariance model. The overall score can be split into the primary sequence score (i.e., conservation of the full linear sequence) and the secondary structure score (i.e., all base pairs expected in tRNAs). You can choose to show these detailed score components in the prediction results

4

Patricia P. Chan and Todd M. Lowe

Fig. 2 (a) Search mode and (b) extended options for tRNAscan-SE online version

under the Extended Options (Fig. 2b). As part of the functional classification, tRNAscan-SE evaluates the gene predictions for possible pseudogenes based on characteristics commonly observed in non-functional tRNAs: a relatively weak overall score ( wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/182/ 925/GCF_000182925.2_NC12/GCF_000182925.2_NC12_genomic.fna.gz > gunzip GCF_000182925.2_NC12_genomic.fna.gz # Download RNA-seq data (SRA toolkit should be installed) > fastq-dump -I --split-files SRR100067 # Check the line counts of two FASTQ files > wc -l SRR100067_[12].fastq # 125204192 SRR100067_1.fastq # 125204192 SRR100067_2.fastq

This procedure will retrieve 21 scaffolds of genome assembly in FASTA format and 31,301,048 read pairs of RNA-seq (paired-end

56

Byoungnam Min and In-Geol Choi

reads) in FASTQ format from the NCBI FTP. The current version of the FunGAP only accepts the FASTA and FASTQ formats for genome assembly and RNA-seq reads, respectively. RNA-seq raw reads might vary by sequencing platforms (e.g., Illumina, PacBio, or Oxford Nanopore) and sequencing methods (e.g., single-end or paired-end) but the user can adjust any type of RNA-seq reads to the pipeline (see Subheading 4).

3

Generating a Genome Annotation with FunGAP, Step by Step The gene prediction procedure is divided into three steps. The first step is the data preprocessing and preparation by checking the quality of RNA-seq raw reads and building a protein sequence database for homology-based gene prediction. The second step is running the FunGAP from the command line. The final step is evaluating outcomes from FunGAP such as checking genome completeness and examining transposable element genes.

3.1 Step 1a—RNASeq Reads Quality Control (Optional): ~15 min

Users can run any tool or a set of tools to trim adapter sequences and low-quality bases and exclude short reads. We used the Trim galore program for automatic detection and removal of adapter sequences. In this tutorial, we sampled 500,000 reads to reduce the computation time but users can utilize entire data at the expense of computation time. Note that two FASTQ files should have the same prefix, e.g., _1.fastq and _2.fastq. In this run, we make SRR100067_sampled_1.fastq and SRR100067_sampled_2.fastq files. # Run Trim Galore! > mkdir trim_galore_out > trim_galore \ --paired --quality 20 --phred33 --length 40 \ --output_dir trim_galore_out \ SRR100067_1.fastq SRR100067_2.fastq # Rename the Trim Galore output files > mv trim_galore_out/SRR100067_1_val_1.fq SRR100067_qced_1. fastq > mv trim_galore_out/SRR100067_2_val_2.fq SRR100067_qced_2. fastq # Sample 500,000 reads (you do not need to do this for your actual dataset) >

head

-n

2000000

SRR100067_qced_1.fastq

>

SRR100067_qced_2.fastq

>

SRR100067_sampled_1.fastq >

head

-n

2000000

SRR100067_sampled_2.fastq

How to use FunGAP for Fungal Gene Prediction

3.2 Step 1b— Retrieving Protein Sequence Database: python $FUNGAP_DIR/download_sister_orgs.py \ --download_dir sister_orgs \ --taxon “Neurospora” \ --num_sisters 3 \ --email_address ‘your email address here’ # Concatenate the downloaded sequences > zcat sister_orgs/*faa.gz > prot_db.faa

3.3 Step 1c— Selection of a Species Model for AUGUSTUS: python $FUNGAP_DIR/get_augustus_species.py \ --genus_name Neurospora \ --email_address ‘your email address here’ # The output is neurospora_crassa

3.4 Step 2—Running FunGAP: ~20 h

When the preparative step 1 is finished, FunGAP is ready to annotate genome from the command line. The default option for parameters is the following (ensure that it is FunGAP version 1.0.1): > python $FUNGAP_DIR/fungap.py \ --output_dir fungap_out \ --trans_read_1 SRR100067_sampled_1.fastq \ --trans_read_2 SRR100067_sampled_2.fastq \ --project_name Neurospora_crassa \ --genome_assembly GCF_000182925.2_NC12_genomic.fna \

58

Byoungnam Min and In-Geol Choi --augustus_species neurospora_crassa \ --org_id Neucr \ --sister_proteome prot_db.faa \ --num_cores 20

3.5 Step 3a— Examination of the FunGAP Outputs

Users can recognize the completion of the annotation by the existence of the log records and output files. The location of output files from FunGAP is indicated by a parameter, --output_dir. In this tutorial, fungap_out/fungap_out is the temporary output directory. Three output files, (1) gene features in GFF3, (2) protein FASTA, and (3) the annotation summary in HTML (Fig. 1), are found in the output directory. The output files give a clue to the preliminary quality of users’ annotation result. In the summary file, users should check if the alignment rate of the RNA-seq reads is sufficient. A poor gene prediction often resulted from a low alignment rate. If relevant RNA-seq data are used, the alignment rate would be more than 80%. Otherwise, users should suspect

Fig. 1 An example of the summary report by FunGAP. Gene structure (a), Transcriptome read assembly (b), Transcript length distribution (c), and Protein length distribution (d) shown in the summary report. The summary report was made from the gene prediction result of Neurospora crassa data in this guide

How to use FunGAP for Fungal Gene Prediction

59

sequence contamination or inadequate quality of RNA-seq reads. In this N. crassa gene prediction practice, the RNA-seq alignment was 98%. Using the summary file, users get to know that 10,280 genes were predicted from this practice running (Fig. 1). 3.6 Step 3b— Checking the Completeness of Annotation: run_BUSCO.py \ --in fungap_out/fungap_out/fungap_out_prot.faa \ --cpu 10 \ --mode prot \ --lineage_path sordariomyceta_odb9 \ --out busco_Neurospora_crassa

3.7 Step 3c— Detection and Removal of Transposable Element Genes: ~10 min

Although repeat sequences are detected and masked by RepeatMasker [15] within FunGAP before gene prediction, some repeat regions remain unmasked, whereby transposable element (TE) genes such as transposases can be included in the prediction. In particular, ectomycorrhizal fungi are well known for genomic enrichment of TEs [16]. TEs are usually excluded before downstream analysis because the amount of TEs can affect the total number of genes and the number of transcribed genes inferred by RNA-seq. Pfam [17] annotation with InterProScan [11] is also used to detect transposable element genes. The Pfam domains related to TEs frequently found in fungi are listed in Table 1. The FunGAP provides a script to detect TEs, detect_te_genes.py. In our practice dataset, we found 12 transposable element genes. Users may consider to exclude TEs from final gene prediction if the number of predicted gene is suspicious by TEs. > python $FUNGAP_DIR/detect_te_genes.py \ --protein_fasta fungap_out/fungap_out/fungap_out_prot.faa

4

Challenging Scenarios Many fungal genome sequencing projects are ongoing by individual laboratories with various strategies and conditions. Here are additional scenarios for most frequently asked situations that might be confronted by users, such as (1) RNA-seq data are not available,

60

Byoungnam Min and In-Geol Choi

Table 1 List of Pfam domains related to transposable elements. Genes annotated with these 24 Pfam domains can be regarded as genes related to transposable elements Pfam ID

Pfam description

PF00075

RNase H

PF00078

Reverse transcriptase (RNA-dependent DNA polymerase)

PF00665

Integrase core domain

PF02925

Bacteriophage scaffolding protein D

PF02992

Transposase family tnp2

PF03184

DDE superfamily endonuclease

PF03221

Tc5 transposase DNA-binding domain

PF03732

Retrotransposon gag protein

PF04687

Microvirus H protein (pilot protein)

PF05699

hAT family C-terminal dimerization region

PF05840

Bacteriophage replication gene A protein (GPA)

PF05970

PIF1-like helicase

PF07727

Reverse transcriptase (RNA-dependent DNA polymerase)

PF08283

Geminivirus rep protein central domain

PF08284

Retroviral aspartyl protease

PF10551

MULE transposase domain

PF13358

DDE superfamily endonuclease

PF13359

DDE superfamily endonuclease

PF13456

Reverse transcriptase-like

PF13837

Myb/SANT-like DNA-binding domain

PF13976

GAG-pre-integrase domain

PF14214

Helitron helicase-like domain at N-terminus

PF14223

Gag-polypeptide of LTR copia-type

PF14529

Endonuclease-reverse transcriptase

(2) RNA-seq data types are not acceptable to FunGAP, and (3) various output formats such as GenBank or transcript nucleotide sequence are required. 4.1 Case 1—When RNA-Seq Reads Are Not Available

The sequencing of mature mRNA molecules provides direct evidence for exon–intron structure. Using RNA-seq reads data is most crucial for obtaining the high-quality gene prediction. If RNA-seq data are not available, one alternative option is using the available

How to use FunGAP for Fungal Gene Prediction

61

RNA-seq reads (a surrogate RNA-seq) from the most taxonomically close neighbor organism. The surrogate RNA-seq reads can be searched and downloaded from the NCBI Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra) or the MycoCosm database (https://genome.jgi.doe.gov/programs/fungi/index.jsf) [18]. To probe how much taxonomically distant species can be used, we benchmarked the N. crassa OR74A genome with various surrogate RNA-seq data. When the surrogate RNA-seq reads were obtained from same species and genus level (average nucleotide identity 96.0% and 92.7%), the prediction was quite good compared to the prediction from the original RNA-seq data, whereas the prediction quality dramatically decreased when the surrogate RNA-seq was obtained from the family level (Table 2). GeneMarkES self-training gene prediction gave higher BUSCO completeness, but the lower number of matches to the reference than the prediction using the surrogate RNA-seq reads from the family level. Therefore, we recommend to use this alternative option only when the surrogate RNA-seq reads can be obtained from genus level or species level. However, for quick-and-dirty way of genome analysis, family-level relatives will also predict with some quality. 4.2 Case 2—When RNA-Seq Reads Are Available from Other than Illumina Platforms

If mRNA molecules are sequenced by non-Illumina platforms or reads are not in FASTQ format, such as FASTA, users can provide a BAM file generated by themselves. This will skip to run Hisat2 program. From the version 1.0.1, FunGAP also takes an Illumina single-end FASTQ file as an input. Note that the provided BAM file should be sorted using Samtools (http://samtools.sourceforge. net/) so that Trinity can handle it appropriately. # Illumina single-end FASTQ file > python $FUNGAP_DIR/fungap.py \ --output_dir fungap_out \ --trans_read_single illumina-single-end_s.fastq \ --project_name Neurospora_crassa \ --genome_assembly GCF_000182925.2_NC12_genomic.fna \ --augustus_species neurospora_crassa \ --org_id Neucr \ --sister_proteome prot_db.faa \ --num_cores 20 # BAM file > python ~/pycharm_codes/fungap/fungap.py \ --output_dir fungap_out \ --trans_bam NcrassaRNA_sorted.bam \ --project_name Neurospora_crassa \ --genome_assembly GCF_000182925.2_NC12_genomic.fna \ --augustus_species neurospora_crassa \ --org_id Neucr \ --sister_proteome prot_db.faa \ --num_cores 40

62

Byoungnam Min and In-Geol Choi

Table 2 Comparison of gene prediction results by FunGAP with surrogate RNA-seq data. When RNA-seq data are not available, users can use surrogate RNA-seq data from taxonomically close neighbors for gene prediction. Genus-level surrogate data were acceptable but family-level surrogate failed to make reliable gene prediction No RNA-seqa

With RNA-seq data from Neurospora crassa OR74A

Neurospora crassa FGSC 73 trp-3 [19]

Neurospora tetrasperma

Sordaria macrospora

Strain

Species

Genus

Family

NCBI accessionb SRR100067

MycoCosme

SRR5192932

SRR944971 –

ANI valuec

100%

97.00%

92.72%

86.52%



Reads alignment rated

98.50%

93.34%

48.51%

13.47%



Assembled transcripts

5298

9917

828

30



BRAKER1 predicted genes

8708

8756

8731

10,162

8714

BRAKER1 complete BUSCOs

3606

3631

3573

2372

3353

BRAKER1 missing BUSCOs

38

28

39

328

70

BUSCO completeness

99.0%

99.2%

99.0%

91.2%

98.1%

Matches to reference

61.6%

62.7%

62.7%

47.3%

35.5%

Common taxon level



a

GeneMark-ES with self-training (--ES option) NCBI Sequencing Reads Archive (https://www.ncbi.nlm.nih.gov/sra) c Average Nucleotide Identity with Neurospora crassa OR74A genome assembly calculated by pyani (https://github. com/widdowquinn/pyani) d Reads alignment against Neurospora crassa OR74A genome assembly e https://genome.jgi.doe.gov/Neucr_trp3_1/Neucr_trp3_1.home.html b

4.3 Case 3—When Other Format of Output Files Is Required

Gene and transcript sequences (nucleotide FASTA) or GenBank format are not generated by FunGAP by default, but there are Python scripts for this. They take the genome assembly (FASTA)

How to use FunGAP for Fungal Gene Prediction

63

and the FunGAP-generated GFF3 file as inputs. The example commands are the following:

# Make transcript sequences in nucleotide FASTA > python $FUNGAP_DIR/gff3_transcript.py \ --input_fasta GCF_000182925.2_NC12_genomic.fna \ --input_gff3 fungap_out/fungap_out/fungap_out.gff3 \ --output_prefix fungap_out # Make GenBank format > python $FUNGAP_DIR/generate_genbank.py \ --input_fasta GCF_000182925.2_NC12_genomic.fna \ --input_gff3 fungap_out/fungap_out/fungap_out.gff3 \ --output_prefix fungap_out

4.4 Future Updates of FunGAP

Current version of FunGAP requires many external programs to be installed and configured manually. Installation of various prerequisite programs is not an easy task for inexperienced users. To make more easier access to FunGAP, we are packing the pipeline into the Docker container (https://www.docker.com/) that can be portable in any operating system. In addition, we intend to expand the FunGAP to annotate the other clades of eukaryotic genomes such as algae and protists by simple plug-in of specific protein databases and gene model parameters. FunGAP also can be applied to the annotation of assembled contigs from mycobiome studies. These future updates will help individual laboratories having limited resources and non-Linux experts to obtain high-quality eukaryotic gene predictions from many NGS data and applications.

Acknowledgments This work is supported by the Cooperative Research Program for Agriculture Science & Technology Development (Project No. PJ01044003 and No. PJ01337602) Rural Development Administration, Republic of Korea, and Dr. Byoungnam Min was supported by the Korea University grant. References 1. Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M (2016) BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32(5):767–769. https://doi. org/10.1093/bioinformatics/btv661 2. Reid I, O’Toole N, Zabaneh O, Nourzadeh R, Dahdouli M, Abdellateef M, Gordon PM, Soh J, Butler G, Sensen CW, Tsang A (2014)

SnowyOwl: accurate prediction of fungal genes by using RNA-Seq and homology information to select among ab initio models. BMC Bioinformatics 15:229. https://doi.org/10.1186/ 1471-2105-15-229 3. Zickmann F, Renard BY (2015) IPred—integrating ab initio and evidence based gene predictions to improve prediction accuracy. BMC

64

Byoungnam Min and In-Geol Choi

Genomics 16:134. https://doi.org/10.1186/ s12864-015-1315-9 4. Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, White O, Buell CR, Wortman JR (2008) Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biol 9(1):R7. https://doi.org/10. 1186/gb-2008-9-1-r7 5. Holt C, Yandell M (2011) MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12:491. https://doi.org/10.1186/1471-2105-12491 6. Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B (2006) AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res 34:W435–W439. https:// doi.org/10.1093/nar/gkl200 7. Borodovsky M, Lomsadze A (2011) Eukaryotic gene prediction using GeneMark.hmm-E and GeneMark-ES. Current Protoc Bioinformatics. Chapter 4:Unit 4.6.1-10. https://doi. org/10.1002/0471250953.bi0406s35 8. Min B, Grigoriev IV, Choi IG (2017) FunGAP: fungal genome annotation pipeline using evidence-based gene model evaluation. Bioinformatics 33(18):2936–2937. https:// doi.org/10.1093/bioinformatics/btx353 9. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) BLAST+: architecture and applications. BMC Bioinformatics 10:421. https://doi.org/10. 1186/1471-2105-10-421 10. Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM (2015) BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31(19):3210–3212. https://doi. org/10.1093/bioinformatics/btv351 11. Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, Pesseat S, Quinn AF, Sangrador-Vegas A, Scheremetjew M, Yong SY, Lopez R, Hunter S (2014) InterProScan 5: genome-scale protein function classification. Bioinformatics 30(9):1236–1240. https://doi. org/10.1093/bioinformatics/btu031 12. Smit A, Hubley R (2008) RepeatModeler Open-1.0. http://www.repeatmasker.org. Accessed 26 Sep 2018 13. Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L,

Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, LindbladToh K, Friedman N, Regev A (2011) Fulllength transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29(7):644–652. https://doi.org/10. 1038/nbt.1883 14. Krueger F (2015) Trim galore. A wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files. https://www.bioinformatics.babraham. ac.uk/projects/trim_galore. Accessed 26 Sep 2018 15. Tarailo-Graovac M, Chen N (2009) Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinformatics. Chapter 4:Unit 4.10. https://doi.org/ 10.1002/0471250953.bi0410s25 16. Peter M, Kohler A, Ohm RA, Kuo A, Krutzmann J, Morin E, Arend M, Barry KW, Binder M, Choi C, Clum A, Copeland A, Grisel N, Haridas S, Kipfer T, LaButti K, Lindquist E, Lipzen A, Maire R, Meier B, Mihaltcheva S, Molinier V, Murat C, Poggeler S, Quandt CA, Sperisen C, Tritt A, Tisserant E, Crous PW, Henrissat B, Nehls U, Egli S, Spatafora JW, Grigoriev IV, Martin FM (2016) Ectomycorrhizal ecology is imprinted in the genome of the dominant symbiotic fungus Cenococcum geophilum. Nat Commun 7:12662. https://doi.org/10.1038/ ncomms12662 17. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, Sonnhammer EL, Tate J, Punta M (2014) Pfam: the protein families database. Nucleic Acids Res 42:D222–D230. https://doi.org/10.1093/ nar/gkt1223 18. Grigoriev IV, Nikitin R, Haridas S, Kuo A, Ohm R, Otillar R, Riley R, Salamov A, Zhao X, Korzeniewski F, Smirnova T, Nordberg H, Dubchak I, Shabalov I (2014) MycoCosm portal: gearing up for 1000 fungal genomes. Nucleic Acids Res 42:D699–D704. https://doi.org/10.1093/nar/gkt1183 19. Baker SE, Schackwitz W, Lipzen A, Martin J, Haridas S, LaButti K, Grigoriev IV, Simmons BA, McCluskey K (2015) Draft genome sequence of Neurospora crassa strain FGSC 73. Genome Announc 3(2). https://doi.org/ 10.1128/genomeA.00074-15

Chapter 5 Whole-Genome Annotation with BRAKER Katharina J. Hoff, Alexandre Lomsadze, Mark Borodovsky, and Mario Stanke Abstract BRAKER is a pipeline for highly accurate and fully automated gene prediction in novel eukaryotic genomes. It combines two major tools: GeneMark-ES/ET and AUGUSTUS. GeneMark-ES/ET learns its parameters from a novel genomic sequence in a fully automated fashion; if available, it uses extrinsic evidence for model refinement. From the protein-coding genes predicted by GeneMark-ES/ET, we select a set for training AUGUSTUS, one of the most accurate gene finding tools that, in contrast to GeneMark-ES/ET, integrates extrinsic evidence already into the gene prediction step. The first published version, BRAKER1, integrated genomic footprints of unassembled RNA-Seq reads into the training as well as into the prediction steps. The pipeline has since been extended to the integration of data on mapped cross-species proteins, and to the usage of heterogeneous extrinsic evidence, both RNA-Seq and protein alignments. In this book chapter, we briefly summarize the pipeline methodology and describe how to apply BRAKER in environments characterized by various combinations of external evidence. Key words Protein-coding genes, Gene prediction, AUGUSTUS, GeneMark-ES/ET, RNA-Seq reads, Protein mapping to genome, Genome annotation pipeline, BRAKER

1

Introduction BRAKER [1] is a pipeline for the fully automated prediction of protein-coding genes with GeneMark-ES/ET [2–4] and AUGUSTUS [5–10] in novel eukaryotic genomes (novel or not—some genomes are re-sequenced, re-assembled etc.—and need annotation). In contrast to other genome annotation pipelines, such as MAKER [11, 12], BRAKER trains both gene finders in a fully automated fashion before making final gene prediction steps. For gene prediction, both GeneMark-ES/ET and AUGUSTUS use statistical models with a large number of parameters. Optimal parameters are species specific. While the same parameters can be used for clades of closely related species, the use of parameters from

The authors “Alexandre Lomsadze”, “Mark Borodovsky”, and “Mario Stanke” contributed equally. Martin Kollmar (ed.), Gene Prediction: Methods and Protocols, Methods in Molecular Biology, vol. 1962, https://doi.org/10.1007/978-1-4939-9173-0_5, © Springer Science+Business Media, LLC, part of Springer Nature 2019

65

66

Katharina J. Hoff et al.

more distant species typically leads to low prediction accuracy. Thus, a training step for parameter optimization is required. Most gene prediction tools, including AUGUSTUS, must be trained on a previously generated (expert curated) set of gene structures, the procedure is called “supervised training.” GeneMark-ES/ET has the extraordinary property to generate parameters by self-training or “unsupervised training.” Therefore, an expert curated training set is not required prior to running GeneMark-ES/ET, which can either train itself on a genomic sequence that GeneMark-ES/ET had not previously been trained for, or, if additional, extrinsic evidence is available for intron intervals, incorporate this information into self-training. AUGUSTUS is one of the most accurate tools for predicting genes, as shown by the independent EGASP [13–15], nGASP [16], and RGASP [17] assessments. AUGUSTUS not only has an elaborate statistical model, but also has the capacity to integrate extrinsic evidence from various sources. In contrast to GeneMark-ES/ET, AUGUSTUS is not self-training but requires the training gene set. Compiling such a set often poses problems for users not experienced enough in bioinformatics, as this step requires to execute special tools and tests. The task itself would also take quite a significant amount of time even for experienced users. Creating a high-quality training set can be helped by the availability of extrinsic data. For example, alignments of protein sequences of a closely related species against the new genome can be used to derive gene structures (e.g., using Scipio [18] or GenomeThreader [19]). Inferring gene models from expressed sequence tags (ESTs) against genome alignments (e.g., with PASA [20]) is also an option. If proteins of closely related species or ESTs are available, we refer the reader to WebAUGUSTUS [8], a user friendly web service that runs Scipio or PASA in conjunction with training AUGUSTUS (local installation of the underlying AutoAug.pl pipeline, part of AUGUSTUS, is possible but difficult). However, with progress made in transcriptome sequencing technology, ESTs have been replaced by short-read RNA-Seq (e.g., with Illumina sequencing). A possible use of short-read RNA-Seq data in context of gene prediction would be to first assemble the short reads into longer transcript sequences, and, subsequently, to map the longer transcripts to genomes and infer genes suitable for a gene finder training. However, the RGASP assessment showed convincingly that gene identification methods that use unassembled RNA-Seq reads and statistical models in the prediction step are superior to methods that assemble the short reads into transcripts [17]. AUGUSTUS, the gene finder inferring gene structure evidence from aligned raw RNA-Seq reads, did perform particularly well in the RGASP assessment.

BRAKER

67

Fig. 1 Schematic view of the BRAKER approach to gene prediction: GeneMarkES/ET is trained (using extrinsic data upon availability) and predicts a first gene set (genemark.gtf). This gene set is filtered. AUGUSTUS is trained on the filtered gene set. AUGUSTUS predictions with species-specific parameters are performed, using extrinsic data upon availability

The principle of BRAKER (Fig. 1) is to execute self-training GeneMark-ES/ET to produce an initial set of predicted genes. GeneMark-ES/ET can run either in ab initio unsupervised mode (ES mode) or in semi-supervised mode (ET mode) if additional evidence for putative splice sites is available in the form of short RNA-Seq reads to genome alignments spanning splice junctions. Out of the genes predicted by GeneMark-ES/ET, BRAKER selects those having support in all introns in the extrinsic data (in the absence of extrinsic data, BRAKER selects all genes longer than 800 nt in spliced form). Besides multi-exon genes, a number of single-exon genes, proportional to the number of single-exon genes in the initial GeneMark-ES/ET gene set, are selected at random and added to the whole set of selected genes. Genes with few exons contribute less to improve AUGUSTUS’ accuracy during training; however, a certain number of single-exon genes must remain in the training set. The genes with no or few introns usually outnumber the genes with a large number of introns in the initial GeneMark-ES/ET gene set. We apply the following approach to sample genes with a small number of introns. A gene with n  5 introns is removed from the list if n < N, where N is a random variable sampled from a Poisson distribution with

68

Katharina J. Hoff et al.

parameter λ ¼ 2. Accordingly, the chance that a gene is kept in training increases with the number of introns. In the next step, the gene set—represented by genomic coordinates and genomic sequences—is translated into protein sequences. Duplicates are pruned from this set using NCBI BLAST [21, 22] with a similarity threshold of maximum pairwise percent identity of 80%. This set is used to train AUGUSTUS, which uses the newly trained species-specific parameters and extrinsic evidence, if available, to make the final round of gene predictions. Both GeneMark-ES/ET and AUGUSTUS are tools capable of using unassembled RNA-Seq information. Self-training GeneMark-ES/ET uses this data in the training step, while AUGUSTUS uses it in the prediction step. The BRAKER1 pipeline was designed to chain both tools in an optimal way with and without splice-site information from mapped RNA-Seq reads. Since its original publication, BRAKER1 has been extended in several aspects: currently, it can run with and without any extrinsic evidence; it can use hints on splice sites obtained from alignments of homologous proteins to the genome of interest, instead of or in addition to the hints from the RNA-Seq read alignments; it can use alignments of proteins from closely or remotely related species; it can use RNA-Seq coverage information for prediction of genes with UTRs, instead of CDS-only prediction. In this book chapter, we refer to this pipeline as the BRAKER pipeline and describe how to apply the pipeline in various circumstances defined by the presence or absence of different types of data.

2

BRAKER Software and Input Files In this section, we describe the required computational resources, software, and input files for executing BRAKER.

2.1 Computational Resources

BRAKER can in principle be executed on a modern desktop computer with 8 GB RAM (per core). Many subprocesses that are initiated by BRAKER can be parallelized. We recommend a workstation with eight cores. Please note that many steps in BRAKER that are parallelized use data parallelization, i.e., large files are split into smaller files, and each smaller file is then processed on a separate core. Choosing a very large number of cores may lead to scenarios where one or several smaller files do not contain sufficient data for processing, anymore. BRAKER has therefore been limited to run with at most 48 cores. Also, k-fold cross-validation of optimize_augustus.pl is by default executed with k ¼ 8 on 8 cores, only. Please note that if you set BRAKER to run with more than 8 and up to 48 cores, k will be adapted to the number of cores.

BRAKER

2.2

Software

69

BRAKER is available for download from GitHub at https://github. com/Gaius-Augustus/BRAKER. You can clone the repository with: Bash input $ git clone https://github.com/Gaius-Augustus/BRAKER.git

The repository also contains example input files that will be used in this book chapter to demonstrate the usage of BRAKER. Running BRAKER requires a Linux system with Bash and Perl. Furthermore, BRAKER requires the following CPAN-Perl modules to be installed: l

File::Spec::Functions

l

Hash::Merge

l

List::Util

l

Logger::Simple

l

Module::Load::Conditional

l

Parallel::ForkManager

l

POSIX

l

Scalar::Util::Numeric

l

YAML

BRAKER calls external bioinformatics software. The called software depends on the input file combination. All input file combinations require the following software to be installed: l

AUGUSTUS 3.3.1 or newer. Please use the latest AUGUSTUS version distributed by the original authors from GitHub at https://github.com/Gaius-Augustus/Augustus (see Note 1).

l

GeneMark-ES/ET 4.38 or newer.

l

NCBI BLAST+ 2.2.31+ or newer [21, 22].

If you run BRAKER with RNA-Seq BAM-format, the following software is required: l

BamTools 2.5.1 or newer [23].

l

SAMtools 1.7-4-g93586ed or newer [24].

alignments

in

If you run BRAKER with protein data from a closely related species, a protein alignment tool is required. We recommend GenomeThreader 1.7.0 or newer.

70

2.3

Katharina J. Hoff et al.

Files

2.3.1 Genome File

BRAKER is a pipeline that can be run with different input file combinations (as depicted in Figs. 2 and 3). Here, we describe the formats of files that serve as input to BRAKER. Running BRAKER always requires a genome file in FASTA-format. Ideally, you should provide a genome file that contains all (longer) contigs of a genome assembly to BRAKER, and not parts of a single or few chromosomes. Very short contigs often carry only partial

Fig. 2 BRAKER pipeline: (A) Training GeneMark-ES on genome data only; ab initio gene prediction with AUGUSTUS. (B) Training GeneMark-ET supported by RNA-Seq spliced alignment information, prediction with AUGUSTUS with that same spliced alignment information. (C) Training GeneMark-EP on protein spliced alignment information, prediction with AUGUSTUS with that same spliced alignment information. (D) Training GeneMark-ETP supported by RNA-Seq alignment information and protein spliced alignment information, prediction with AUGUSTUS using the same alignment information. Introns supported by both RNA-Seq and protein alignment information are trusted—their prediction in gene structures by GeneMark-ETP and AUGUSTUS is enforced. Proteins used for (C) and (D) can be of longer evolutionary distance

Fig. 3 BRAKER pipelines for integration of protein information from closely related species: (A) Training GeneMark-ET supported by RNA-Seq spliced alignment information, prediction with AUGUSTUS with spliced alignment information from RNA-Seq data and with gene features determined by alignments from proteins of a very closely related species against the target genome. (B) Training AUGUSTUS on the basis of spliced alignment information from proteins of a very closely related species against the target genome. (C) Training GeneMark-ET on the basis of RNA-Seq spliced alignment information, training AUGUSTUS on a set of training gene structures compiled from RNA-Seq supported gene structures predicted by GeneMark-ET and spliced alignment of proteins of a very closely related species

BRAKER

71

gene structures, if any. While AUGUSTUS is capable of predicting partial genes in short sequences, AUGUSTUS training is performed on complete gene structures, only. Including very short contigs in a BRAKER run usually does not improve parameter training but increases runtime because all contigs still have to be processed in the prediction step. We recommend to exclude short contigs (i.e., rnaseq.bam

2.3.3 Protein Sequence File

BRAKER accepts files with protein sequences in FASTA-format. Please be aware that BRAKER will align those proteins to the target genome with GenomeThreader or other protein spliced aligners in order to use the alignment information for gene prediction purposes. Only protein sequences from species that are rather closely related align well to the target genome. It will not improve gene prediction results if you include proteins from very distantly related species in the protein sequence file. If you intend to use the protein information to generate training genes for AUGUSTUS, protein sequences must be of full length, i.e., not sequences of partial proteins.

2.3.4 Hints File

BRAKER accepts AUGUSTUS-specific hints files. Hints files contain extrinsic evidence information that indicate certain features of protein-coding genes in certain positions in the genome. Hints files can contain information from RNA-Seq, protein alignments, manual annotation, and possibly many other sources. Hints files are in a tabulator-separated 9-column GFF-format. The following format example has been generated by the AUGUSTUS tool bam2hints that is used within BRAKER to generate hints from RNA-Seq BAM-files.

BRAKER

73

File contents example: RNAseq.hints 2R b2h intron 336478 343473 0 . . mult=3;pri=4;src=E 2R b2h intron 336480 343473 0 . . mult=11;pri=4;src=E 2R b2h intron 336482 343473 0 . . mult=2;pri=4;src=E 2R b2h intron 336658 427382 0 . . pri=4;src=E

Note that the last column contains a field mult¼INT. This indicates the coverage information for a feature, e.g., the given intron in line 1 has support from three RNA-Seq reads. The field src¼E indicates that this feature was extracted from expression data. The source tags correspond to an AUGUSTUS configuration file that contains weights on how to treat evidence from this particular source. Within BRAKER, four types of sources are handled by default: E. for spliced read information from RNA-Seq, P. for information from proteins, W. for coverage information from RNA-Seq (from “wiggle” files), M. for manual hints; BRAKER flags hints with M if they have support from the sources E and P. The prediction of genes with hints of type M is practically enforced by the corresponding parameters. The third column of the hints file contains the feature name of a hint. The following features are currently supported by BRAKER: intron, start, stop, ass, dss, exonpart, exon, CDSpart, and nonexonpart (Repeats). The most important feature is intron, because it is the only feature that GeneMark-ES/ET uses for training. The features ass and dss are automatically derived from intron hints by AUGUSTUS. exon and exonpart features should only be used if UTR-training of AUGUSTUS has been enabled because if no UTR parameters are available, such hints may lead to false positive CDS/CDSpart predictions by AUGUSTUS. CDSpart and CDS features are typically generated from protein data of close homology when running BRAKER. nonexonpart hints are implicitly provided by using a soft-masked genome. There are two typical use cases in which hints files are provided to BRAKER: (a) Instead of providing RNA-Seq alignments in BAM-format to braker.pl, the first product that braker.pl will produce from a BAM-file with a tool called bam2hints, a hints file with information from the BAM-file, can be provided to BRAKER.

74

Katharina J. Hoff et al.

Fig. 4 Current outline of the GaTech protein mapping pipeline that can be used to generate evidence for running GeneMark-EP

Users run bam2hints to extract hints from BAM-files prior to calling BRAKER for two reasons: 1. Parallelization: BRAKER runs bam2hints in parallel for provided BAM-files, but only on the number of cores that are provided to BRAKER. If the number of BAM-files is large, and additional cores are available but should not be allocated to BRAKER itself, running bam2hints separately on more cores may reduce computational time. 2. File size: hints files are much smaller than BAM-files, depending on the computational environment, a small file size may be desired (e.g., in virtual environments). Please note that BRAKER cannot train UTR parameters for AUGUSTUS if no BAM-file is provided. (b) For running BRAKER with evidence from proteins of remote homology, such as generated by the GaTech protein mapping pipeline (see Fig. 4). GeneMark-ES/ET requires that in this case the hints file contains information on how many protein alignments cover a particular splice-site pair in column 6 (the value should be identical to the mult¼INT value):

BRAKER

75

File contents example: ep.hints 2R

ProSplign

intron

5760114 5760177 8

-

.

src=P;mult=8;

2R

ProSplign

intron

6210484 6210546 13

-

.

src=P;mult=13;

2R

ProSplign

intron

8216329 8216383 6

+

.

src=P;mult=6;

In contrast to hints files from RNA-Seq alignments, the hints file for running BRAKER with intron evidence from proteins of remote homology must contain strand information in column 7.

3

BRAKER Gene Prediction, Step by Step

3.1 Installing and Configuring BRAKER

The BRAKER repository contains three directories: l

BRAKER/docs/ contains

documentation on BRAKER, e.g., the file userguide.pdf that provides detailed installation and configuration instructions,

l

BRAKER/scripts/

l

BRAKER/examples/

contains the BRAKER Perl scripts and modules, most importantly the script that executes BRAKER: braker.pl, contains example data for testing BRAKER. An RNA-Seq alignment file in BAM-format (134 MB) that is required for some testing scenarios needs to be downloaded separately. It is available at http://bioinf.unigreifswald.de/bioinf/braker/RNAseq.bam, you can download it, e.g., using the command line tool wget:

Bash input $ cd BRAKER/examples $ wget http://bioinf.uni-greifswald.de/bioinf/braker/RNAseq.bam

The example data set has been generated in order to demonstrate in rather short runtime that all software components work. It was not chosen to lead to good GeneMark-ES/ET and AUGUSTUS parameters or highly accurate gene predictions. In this book chapter, we will assume that you are working on an Ubuntu system in bash. If you need to install dependencies on another system, Ubuntu/Debian-specific package installation commands (sudo apt install . . .) might be different.

76

Katharina J. Hoff et al.

BRAKER requires Perl 5 (or newer). On Ubuntu, Perl is installed by default upon system installation. For installing the CPAN dependencies, we recommend the installation and usage of cpanminus: Bash input $ sudo apt install cpanminus

You can subsequently install the required CPAN-modules with a bash loop as follows: Bash input $ for module in File::Spec::Functions Hash::Merge List::Util Logger::Simple \ Module::Load::Conditional Parallel::ForkManager POSIX Scalar::Util::Numeric \ YAML; do sudo cpanm module done

The easiest way to run and configure BRAKER is to add all programs and scripts that are called by BRAKER to your $PATH variable in a bash configuration script, such as the /.bashrc file. This will ensure that BRAKER automatically finds all required dependencies. In order to add any software to your $PATH, add or extend the PATH line at the bottom or your /.bashrc file. We here demonstrate it for the path to braker.pl only, but you can easily add the paths to all other executables in a similar fashion; separate different paths by colons (:). You need to change your_path_to_braker to the actual path where braker.pl and the other scripts reside: File contents example: /.bashrc PATH=:/your_path_to_braker/BRAKER/scripts:$PATH

When you start a new bash session, changes in the /.bashrc file are automatically loaded. If you continue to work in a session that had been opened before changing /.bashrc, you have to load the new configuration:

BRAKER

77

Bash input $ source ~/.bashrc

You can test whether your changes have taken effect by: 1. Printing the $PATH variable in bash: Bash input $ echo $PATH

The result will look similar to this, you should find the directory that you just added to the $PATH definition: Bash output /your_path_to_braker/BRAKER/scripts:/usr/local/bin:/usr/bin:/bin

2. Checking whether the system finds the software that you just added to the $PATH, in our example braker.pl: Bash input $ which braker.pl

This command should return the full path to the executable, e.g.: Bash output /your_path_to_braker/BRAKER/scripts/braker.pl

If there is an empty return value, you most likely made a spelling mistake when extending the $PATH. The bioinformatics software tools that are called by BRAKER all have their own installation documentation. In case of doubt, we recommend that you read the individual documentation. In the following, we give short instructions and commands for a “typical installation” on Ubuntu that will work in most cases.

78

Katharina J. Hoff et al.

3.1.1 GeneMark-ES/ET

Download GeneMark-ES/ET from http://exon.gatech.edu/ GeneMark/license_download.cgi. Unpack GeneMark-ES/ET: Bash input $ tar -xzf gm_et_linux_64.tar.gz

The resulting uncompressed folder contains a subdirectory where executables reside. Add this directory to your $PATH. Move the file gm_key (separate download link on the website) to your home directory and make it a hidden file:

gm_et_linux_64/gmes_petap/,

Bash input $ mv gm_key ~/.gm_key

3.1.2 AUGUSTUS, SAMtools, and BamTools

AUGUSTUS consists of the actual binary program augustus and several small auxiliary tools, referred to as auxprogs that need to be compiled from source. Install the Ubuntu package dependencies of AUGUSTUS and third-party software that needs to be compiled in order compile the auxprogs:

Bash input $ sudo apt install libboost-iostreams-dev libboost-all-dev bamtools libbamtools-dev \ autotools-dev autoconf

The tool bam2wig requires htslib, bcftools, and samfrom GitHub (the Makefile is currently not compatible with the Ubuntu package version of SAMtools). Download and install these tools as follows:

tools

Bash input $ git clone https://github.com/samtools/htslib.git $ cd htslib $ autoheader $ autoconf $ ./configure $ make

(continued)

BRAKER

79

$ sudo make install $ cd .. $ git clone https://github.com/samtools/bcftools.git $ cd bcftools $ autoheader $ autoconf $ ./configure $ make $ sudo make install $ cd .. $ git clone https://github.com/samtools/samtools.git $ cd samtools $ autoheader $ autoconf -Wno-syntax $ ./configure $ make $ sudo make install $ cd ..

Export an environment variable TOOLDIR that points to the directory where the abovementioned tools reside (e.g., /): Bash input $ export TOOLDIR=~/

Obtain AUGUSTUS from GitHub and compile (default configuration is sufficient for BRAKER): Bash input $ git clone https://github.com/Gaius-Augustus/Augustus.git $ cd Augustus $ make

Binaries will be stored to a directory Augustus/bin/. Add the path to the AUGUSTUS binaries, the path to Augustus/ scripts/ and the path to samtools (e.g., /usr/local/bin/) to your $PATH. AUGUSTUS looks for configuration files (species-specific parameter files and others) in a directory Augustus/config/. The path to that location must be stored in an environment variable $AUGUSTUS_CONFIG_PATH. In case of BRAKER, the

80

Katharina J. Hoff et al.

must be a writable directory because BRAKER will store newly trained species parameter sets there. Add the following line to your /.bashrc file: $AUGUSTUS_CONFIG_PATH

File contents example: /.bashrc export AUGUSTUS_CONFIG_PATH=/your_path_to/Augustus/config/

Confirm that important executables can be found: Bash input $ which augustus $ which optimize_augustus.pl $ which samtools $ which bamtools

3.1.3 NCBI BLAST+

Install via the Ubuntu package system: Bash input $ sudo apt-get install ncbi-blast+

3.1.4 GenomeThreader

Download GenomeThreader from http://genomethreader.org/. Unpack it:

Bash input $ tar -xzf gth-1.7.0-Linux_x86_64-64bit.tar.gz

Add the path to the directory containing the executable gth, which is located in gth-1.7.0-Linux_x86_64-64bit/bin/, to your $PATH. In addition, add the following lines to your /. bashrc file: File contents example: /.bashrc setenv $BSSMDIR

"${HOME}/gth-1.7.0-Linux_x86_64-64bit/bin/bssm"

setenv $GTHDATADIR

"${HOME}/gth-1.7.0-Linux_x86_64-64bit/bin/gthdata"

BRAKER

81

Replace ${HOME} by the location of GenomeThreader if it resides elsewhere. Confirm that the executable can be found: Bash input $ which gth

3.1.5 Configuration Options

In addition to storing tool locations and the $AUGUSTUS_CONin the $PATH variable, BRAKER offers two more ways to determine which binary from external bioinformatics tools should be executed:

FIG_PATH

l

Command line options. All paths to tools can be provided as command line options when calling braker.pl. If the command line options are provided, they will be used, despite all other maybe co-existing configurations. The options are: --AUGUSTUS_CONFIG_PATH¼/path/ --AUGUSTUS_BIN_PATH¼/path/—only

required if the AUGUSTUS binaries do not reside in the default location relative to $AUGUSTUS_CONFIG_PATH

--AUGUSTUS_SCRIPTS_PATH¼/path/—only

required if the AUGUSTUS scripts do not reside in the default location relative to $AUGUSTUS_CONFIG_PATH

--BAMTOOLS_PATH¼/path/ --GENEMARK_PATH¼/path/ --SAMTOOLS_PATH¼/path/ --ALIGNMENT_TOOL_PATH¼/path/—this

is the path to

GenomeThreader --BLAST_PATH¼/path/ l

Environment variables. If environment variables have been exported and no corresponding command line option is used when calling braker.pl, the environment variables will be used instead of the location in $PATH. The environment variables can be added to your /.bashrc, similar to the $AUGUSTUS_CONFIG_PATH: File contents example: /.bashrc export GENEMARK_PATH=/path/ export AUGUSTUS_BIN_PATH=/path/ export AUGUSTUS_SCRIPTS_PATH=/path/

(continued)

82

Katharina J. Hoff et al.

export BAMTOOLS_PATH=/path/ export BLAST_PATH=/path/ export SAMTOOLS_PATH=/path/ export ALIGNMENT_TOOL_PATH=/path/

3.2

Running BRAKER

BRAKER is executed by calling the script braker.pl. The following command line options can be relevant for running BRAKER: l

--genome¼genome.fa assigns the FASTA file with genomic sequences of the target species.

l

allows to specify the species name that should be used to store species-specific parameters; in most modes, this is an optional argument. If it is not provided, BRAKER will generate a species name with the pattern Sp_INT where INT is an integer that has not previously been used on your system to name AUGUSTUS parameter sets. We recommend setting a descriptive name because once trained, the parameter set can be reused for running AUGUSTUS.

l

--softmasking should be specified if the genome has been soft-masked. It must be specified if UTRs shall be trained from RNA-Seq data. We recommend using soft-masked genomes and enabling this flag for all BRAKER runs.

l

--gff3

l

--cores¼INT specifies the maximum number of cores that can be used during computation. Be aware: Reserving a very large number of cores might be a waste of resources, because most cores will be idle during a large proportion of run time. We recommend the usage of 8 cores (because optimize_augustus.pl carries out a k-fold cross-validation with k ¼ 8).

l

--fungus GeneMark-ES/ET option: run algorithm with fungal branch point model.

l

--crf

l

--keepCrf keeps and uses CRF parameters even if they are not better than HMM parameters.

l

--AUGUSTUS_ab_initio will—if extrinsic evidence is provided and used for predicting genes with AUGUSTUS— execute an additional AUGUSTUS run without the evidence. Results are stored in an output file augustus.ab_initio.gtf

--species¼speciesname

stores BRAKER output gene models in GFF3-format.

executes discriminative training using conditional random fields (CRF) within AUGUSTUS; resulting parameters are only kept for final predictions if they show higher accuracy than hidden Markov model (HMM) parameters. The additional step of CRF training increases run time.

BRAKER

83

BRAKER will create a directory braker/speciesname/ relative to where BRAKER was called. This directory will contain all results and a log file braker.log that lists all commands and subprocesses initiated by BRAKER. Instead of braker/speciesname/, you may specify a different location to store results of your BRAKER run with --workingdir¼DIRECTORY. The most important result files in the output folder are: augustus.hints.gtf contains genes predicted by AUGUSTUS with extrinsic evidence in GTF-format (the file will not be produced if BRAKER is executed with --esmode and no extrinsic evidence). AUGUSTUS reports gene and transcript as separate features in the gtf-file. AUGUSTUS may predict alternative transcripts, i.e., in addition to the transcript g20.t1 in the below example, a transcript g20.t2 could be reported.

l

File contents example: augustus.hints.gtf IV AUGUSTUS

126732 127514 0.99 + . g20

IV AUGUSTUS transcript 126732 127514 0.99 + . g20.t1 IV AUGUSTUS start_codon 126732 126734 . IV AUGUSTUS CDS

+ 0 transcript_id "g20.t1"; gene_id "g20";

126732 126880 0.99 + 0 transcript_id "g20.t1"; gene_id "g20";

IV AUGUSTUS exon

126732 126880 .

+ . transcript_id "g20.t1"; gene_id "g20";

IV AUGUSTUS intron

126881 127390 1

+ . transcript_id "g20.t1"; gene_id "g20";

IV AUGUSTUS

127391 127514 1

+ 1 transcript_id "g20.t1"; gene_id "g20";

IV AUGUSTUS exon

127391 127514 .

+ . transcript_id "g20.t1"; gene_id "g20";

IV AUGUSTUS stop_codon 127512 127514 .

+ 0 transcript_id "g20.t1"; gene_id "g20";

If the command line option --gff3 has been used, a file (or in --esmode augustus.ab_initio.gff3) with the same content in gff3-format will be available. augustus.hints.gff3

File contents example: augustus.hints.gff3 IV AUGUSTUS gene

126732 127514 0.99 + . ID=g20;

IV AUGUSTUS transcript

126732 127514 0.99 + . ID=g20.t1; Parent = g1

IV AUGUSTUS start_codon 126732 126734 .

+ 0 Parent=g20.t1;

IV AUGUSTUS CDS

126732 126880 0.99 + 0 ID=g20.t1.CDS1; Parent=g20.t1

IV AUGUSTUS exon

126732 126880 .

+ . ID=g20.t1.exon1; Parent=g20.t1

IV AUGUSTUS intron

126881 127390 1

+ . Parent=g20.t1;

IV AUGUSTUS CDS

127391 127514 1

+ 1 ID=g20.t1.CDS2; Parent=g20.t1

IV AUGUSTUS exon

127391 127514 .

+ . ID=g20.t1.exon2; Parent=g20.t1

IV AUGUSTUS stop_codon

127512 127514 .

+ 0 Parent=g20.t1;

84

Katharina J. Hoff et al. l

GeneMark-E*/genemark.gtf—Genes predicted by GeneMark-ES/ET in GTF-format (the file will not be produced if BRAKER is executed with --trainFromGth)

File contents example: genemark.gtf IV GeneMark.hmm exon

5936236 5936451 0 + . gene_id "70_g"; transcript_id "70_t";

IV GeneMark.hmm start_codon 5936236 5936238 . + 0 gene_id "70_g"; transcript_id "70_t"; IV GeneMark.hmm CDS

5936236 5936451 . + 0 gene_id "70_g"; transcript_id "70_t";

IV GeneMark.hmm exon

5936968 5937053 0 + . gene_id "70_g"; transcript_id "70_t";

IV GeneMark.hmm CDS

5936968 5937053 . + 0 gene_id "70_g"; transcript_id "70_t";

IV GeneMark.hmm exon

5937100 5937445 0 + . gene_id "70_g"; transcript_id "70_t";

IV GeneMark.hmm CDS

5937100 5937445 . + 1 gene_id "70_g"; transcript_id "70_t";

IV GeneMark.hmm stop_codon 5937443 5937445 . + 0 gene_id "70_g"; transcript_id "70_t";

GeneMark-E*/ is a subdirectory in the BRAKER output folder. The star will be replaced by the particular version of GeneMark-ES/ET that was executed by BRAKER (e.g., GeneMark-ET/). l

hintsfile.gff The extrinsic evidence data extracted from RNA-Seq and/or protein data. The introns are used for training GeneMark-ES/ET, while all features are used for predicting genes with AUGUSTUS. The file is in GFF-format (example given in Subheading 2.3.4).

l

The new species-specific AUGUSTUS parameters are stored in a directory ${AUGUSTUS_CONFIG_PATH}/species/speciesname/ and can be reused for running AUGUSTUS (also independent from BRAKER). Concerning the accuracy of results, see Note 2.

3.2.1 Genome File Only

If only the genome sequence is available but no extrinsic data that can be used by GeneMark-ES/ET for model refinement during training, self-training GeneMark-ES is executed with the genome as sole input. Genes predicted by GeneMark-ES with a coding sequence longer than 800 nt are selected for training AUGUSTUS. AUGUSTUS predicts genes in the genomic sequences ab initio (see Fig. 2A). In large genomes this approach may have lower accuracy compared to all other modes of running BRAKER. Before choosing this approach, consider that running BRAKER with hints from proteins of remote homology (see Subheading 3.2.3) can be expected to improve prediction accuracy. Also consider that RNA-Seq data for your species might be available in the Sequence Read Archive (GenBank, NCBI). Running BRAKER with such data might therefore also be an option (see Subheading 3.2.2). The genome file only

BRAKER

85

approach is practical when no suitable evidence is available or if computational time is a limiting factor. The accuracy of the ab initio self-training depends on the clade. It is best for genomes with homogeneous genes, such as those from fungi and protists. On the other end of the spectrum, in mammalian genomes, the current self-training algorithm does not produce reliable results due to genome inhomogeneity (about 40% variance in the gene GC-content). In plants and animals with a more narrow range of inhomogeneity (e.g., insects where 90% of genes vary in GC-content by no more than 10%), the self-training produced gene predictions with decent accuracy. The command line option for running this pipeline is --esmode (derived from the tool name GeneMark-ES). A minimal command would be: braker.pl --genome=genome.fa --esmode

The pipeline can be applied to the soft-masked example genome sequence as follows:

Bash input $ braker.pl --genome=genome.fa --esmode --softmasking

If BRAKER is run with --esmode, then the AUGUSTUS output file is not named augustus.hints.gtf but augustus. ab_initio.gtf.

3.2.2 With Evidence from RNA-Seq Alignment Data

If a genome sequence and corresponding RNA-Seq alignments (from the same species) are available, GeneMark-ET is executed. GeneMark-ET uses information about putative splice sites from spliced RNA-Seq read alignments in order to enhance training. In particular, GeneMark-ET uses the information of how many alignments support an individual splice site. For this reason, BRAKER should not be executed with alignments of assembled RNA-Seq data: The information on how many reads support a putative splice site will be lost during the assembly step. After training on the filtered GeneMark-ET gene set, AUGUSTUS predicts genes using RNA-Seq spliced alignments as extrinsic evidence for introns (see Fig. 2B). If the AUGUSTUS training of untranslated regions is enabled, RNA-Seq coverage information will additionally be integrated. Please note that this is the only mode that currently allows training and prediction of untranslated regions of genes with BRAKER.

86

Katharina J. Hoff et al.

In order to run BRAKER with RNA-Seq data supplied as BAM-file(s) (in case of multiple files, separate them by comma), call BRAKER with the following minimal set of options: braker.pl --genome=genome.fa --bam=file1.bam,file2.bam

The pipeline is applicable to the example data set as follows: Bash input $ braker.pl --genome=genome.fa --bam=RNAseq.bam --softmasking

If you wish to incorporate RNA-Seq coverage information into AUGUSTUS predictions, the command line option --UTR¼on will lead to an attempt to construct UTR training examples from information in the RNA-Seq BAM-file. If a sufficient number of training structures can be generated, species-specific UTR parameters will be trained for AUGUSTUS. Subsequently, AUGUSTUS will predict genes including coverage information and with UTRs. The file augustus.hints_utr.gtf will contain the final gene models. Note: UTR training will fail for the provided example data set because it does not contain sufficient information for constructing a large number of training UTRs. Depending on local computational resources and the number of BAM-files, some users prefer to carry out bam2hints conversion before running BRAKER as follows:

Bash input $ bam2hints --intronsonly --in=RNAseq.bam --out=RNAseq.hints

The example data set contains a prepared RNA-Seq hints file and the pipeline can be tested as follows: Bash input $ braker.pl --genome=genome.fa --hints=RNAseq.hints --softmasking

The training of UTR parameters is not possible on the basis of hints files.

BRAKER 3.2.3 With Evidence Generated by Mapping Cross-Species Proteins

87

If RNA-Seq data is not available, spliced alignments of protein families can provide evidence that is formally similar to the information about introns from RNA-Seq alignments: genomic coordinates and a count on how many alignments support a particular splice junction. Proteins of closely related species serve well as informants about splice junctions, but this approach is also suitable if the phylogenetic distance between target and informant species increases. Full-length alignability of informant proteins and target genome is not required. Gene prediction of BRAKER with this type of evidence may be lower than with RNA-Seq evidence. Constructing spliced alignments of a large number of proteins to a large genome is computationally expensive. Therefore, GeneMark-ES is used to generate predicted proteins that can be searched for similarity to protein family members with BLAST. Genomic sequences that were predicted to carry proteins with resulting BLAST hits can be aligned to their hit proteins with a spliced aligner, such as ProSplign [30]. Intron evidence can be extracted from the spliced alignments. A possible pipeline is outlined in Fig. 4 (Tomas Bruna, Alexandre Lomsadze, and Mark Borodovsky, available for download at http://exon.gatech.edu/GeneMark/Braker/ protein_mapping_pipeline.tar.gz). It is preferable that the protein database contains many representatives of a single family. Suitable databases with orthologous gene clusters are, e.g., EggNogg [31] or OrthoDB [32]. One can in principle use larger databases, such as RefSeq, provided that the computational time is acceptable. A protein mapping pipeline for generating hints from proteins of remote homology for BRAKER is not part of BRAKER. Instead, BRAKER runs with the externally generated hints file. Experienced BRAKER users have reported that they generated suitable hints files with their own mapping pipelines. The conceptual design of the BRAKER pipeline with evidence from proteins of remote homology is depicted in Fig. 2C. For calling BRAKER with a hints file from proteins of remote homology, provide the option --epmode, which will ensure that GeneMark-EP from the GeneMark-ES/ET tool suite is called: braker.pl --genome=genome.fa --hints=ep.hints --epmode

The example data set contains a suitable hints file, and BRAKER can be called with it as follows:

Bash input $ braker.pl --genome=genome.fa --hints=ep.hints --epmode --softmasking

88

Katharina J. Hoff et al.

3.2.4 With Evidence by Mapping Cross-Species Proteins and RNA-Seq Alignments

Using remotely related proteins in addition to RNA-Seq always increases the accuracy, albeit the running time increases significantly. If both data sources are used, we refer to the GeneMarkES/ET tool as GeneMark-ETP. Intron information that is present in both data sources is weighted as reliable evidence and prediction of genes with this information is enforced both in GeneMark-ETP and in AUGUSTUS. The BRAKER pipeline for this mode is shown in Fig. 2D. From the AUGUSTUS point of view, two separate gene prediction runs are performed after training (see Fig. 5): 1. In one run, AUGUSTUS runs with evidence from RNA-Seq and with evidence provided by RNA-Seq and proteins (evidence from proteins, only, is not used). 2. In another run, AUGUSTUS runs with protein and RNA-Seq evidence, and protein evidence is given higher priority. In both runs, introns provided by both evidence sources are enforced. Subsequently, the gene models of both runs are merged with the AUGUSTUS tool joingenes. The reason for running AUGUSTUS twice is to increase sensitivity. In practice, we observed that a small proportion of gene models, that has support from RNA-Seq data only, gets lost if AUGUSTUS is run with both evidence sources in one run. For calling BRAKER with both sources of evidence, provide evidence from both sources and specify the option --etpmode. A minimal call would look like this:

braker.pl --genome=genome.fa --hints=ep.hints --bam=RNAseq.bam --etpmode

Alternative to providing the RNA-Seq evidence in a BAM-file, it can also be provided in a hints file (separately or merged with other hints): braker.pl --genome=genome.fa --hints=ep.hints,RNAseq.hints --etpmode

The pipeline can be tested with the example data set as follows:

Bash input $ braker.pl --genome=genome.fa --hints=ep.hints --bam=RNAseq.bam --etpmode -softmasking

BRAKER

89

3.2.5 Evidence from Proteins of Close Homology

It is well established that alignments of proteins of closely related species to the target genome are helpful to genome annotation. The general approach is employed by many tools (e.g., Scipio [18], and GenomeThreader [19]) and pipelines (e.g., MAKER [11, 12], and WebAUGUSTUS [8]). From the BRAKER perspective, using this type of extrinsic evidence is merely a side-project because other resources in principle satisfy the needs of users for accomplishing this task already. Nevertheless, three different pipeline modes that incorporate proteins of close homology are implemented in BRAKER (see Fig. 3). If a file with protein sequences in FASTA-format is provided with the argument --prot_seq¼FILE, BRAKER executes alignment of those proteins against the target genome. BRAKER in principle supports GenomeThreader (--prg¼gth), Exonerate [33] (--prg¼exonerate), and Spaln2 (--prg¼spaln) [34–36]. We recommend GenomeThreader because in comparison to Exonerate, it is fast, and more accurate in comparison to Spaln2. BRAKER is routinely tested with GenomeThreader only. The argument --prg¼TOOLNAME (TOOLNAME can be gth for GenomeThreader, exonerate or spaln) must be provided if a protein sequence file is given. BRAKER will generate hints for introns, parts of CDS, start codons, and stop codons from protein alignments. Please be aware that GenomeThreader will only confidently align proteins that are fairly closely related to the target species. Additional accuracy can be expected to diminish for informant species whose protein homologs are less than 80% identical on average (approximately the distance between Drosophila melanogaster and Drosophila pseudoobscura). In our experience, it increases run time, but not prediction accuracy if a large number of protein sequences from several rather distantly related species are provided to BRAKER for running GenomeThreader. If both RNA-Seq alignment and protein sequence evidence is provided, AUGUSTUS is run twice, as described in Subheading 3.2.4 and depicted in Fig. 5. In the following, we describe three different ways to call BRAKER with proteins of close homology and refer the reader not Note 3.

(A) Evidence from RNA-Seq Alignments for Training, Additional Evidence from Protein Alignments for Prediction

If RNA-Seq evidence is available, GeneMark-ET usually performs very well and produces a high-quality training gene set for AUGUSTUS. Evidence from proteins of close homology can be added to the RNA-Seq evidence during the prediction step with AUGUSTUS in order to increase prediction accuracy; in this setup, proteins are not used for training AUGUSTUS (see Fig. 3A). A minimal BRAKER call with proteins of close homology, the aligner GenomeThreader, and RNA-Seq data with the example data looks like this:

90

Katharina J. Hoff et al.

Fig. 5 The AUGUSTUS runs of the BRAKER pipeline, that are above depicted like on the left as a single box, actually comprise two separate runs (right side) if both RNA-Seq and protein evidence is provided. First, evidence that occurs in both sources is filtered and weighted as reliable, i.e., manual hints for AUGUSTUS (green arrows). This reliable evidence is merged with the remaining RNA-Seq hints for a first AUGUSTUS run (red arrows). For another run, priority 5 is assigned to protein hints, and priority 4 is assigned to RNA-Seq hints (orange arrows), and both hints are merged with the reliable hints (blue arrows). These hints are used for a second AUGUSTUS run. The results of both runs are merged by the AUGUSTUS tool joingenes in a nonredundant fashion

Bash input $ braker.pl --genome=genome.fa --bam=RNAseq.bam --prot_seq=prot.fa --prg=gth -softmasking

(B) Evidence from Proteins of Close Homology Only

GenomeThreader produces complete gene structures when aligning proteins to the genome. In lack of RNA-Seq data, and if proteins of a very closely related species are available, using the protein alignment derived gene models for training of AUGUSTUS and predicting genes with AUGUSTUS and protein evidence in BRAKER is an alternative to WebAUGUSTUS or pipelines such as GeMoMa [37, 38] (GeMoMa uses the genome and gene coordinates of an informant species rather than its protein sequences). The BRAKER pipeline with GenomeThreader and AUGUSTUS for proteins of close homology is illustrated in Figure. 3B.

BRAKER

91

If this mode is chosen, specified by the command line argument GeneMark-ES/ET will not be executed. A minimal call for running BRAKER in this mode is: --trainFromGth,

Bash input $ braker.pl --genome=genome.fa --prot_seq=prot.fa --prg=gth --trainFromGth -softmasking

(C) Evidence from RNA-Seq Alignments and Evidence from Proteins of Close Homology for Training and Prediction

In addition to combining evidence from RNA-Seq and proteins of close homology as described, BRAKER can combine the GeneMark-ET RNA-Seq gene set with the gene structures produced by GenomeThreader protein alignment and use a combined set for training AUGUSTUS. Both sources of evidence are then used in the AUGUSTUS gene prediction step, too. The approach. The command line option to add GenomeThreader produced genes to the gene set for training AUGUSTUS is --gth2traingenes. It can be applied to the example data set as follows:

Bash input $ braker.pl --genome=genome.fa --prot_seq=prot.fa --prg=gth --bam=RNAseq.bam \ --gth2traingenes --softmasking

In principle, training gene structures derived from RNA-Seq data with GeneMark-ET and training gene structures from the alignment of proteins of close homology with GenomeThreader when used together would improve AUGUSTUS parameters during training. However, it appears that the genes that most valuably contribute to training AUGUSTUS are included in both sets, and the genes that can be added from proteins do not add positively to training. The observation has been confirmed and reported to us by independent users. We advise users who choose this approach to carefully compare the results of their BRAKER run with the results of a BRAKER run that excludes GenomeThreader genes from the training step as described above. 3.2.6 Using BRAKER to Execute AUGUSTUS with Pretrained Parameters

If a high-quality AUGUSTUS parameter set for a particular species already exists (produced by BRAKER or other sources), BRAKER can be used to process and integrate extrinsic evidence from RNA-Seq alignments, from proteins of remote homology, and from protein sequences of close homology with AUGUSTUS.

92

Katharina J. Hoff et al.

Running GeneMark-ES/ET and training AUGUSTUS is skipped, in this case. The existing parameter set must be specified with --species¼speciesname (speciesname ¼ parameter set name), and --skipAllTraining will bypass execution of GeneMarkES/ET and training AUGUSTUS. This option can be applied to all BRAKER running modes. You may test this with, e.g., the fly parameter set and the example RNA-Seq BAM-file:

Bash input $ braker.pl --genome=genome.fa --bam=RNAseq.bam --species=fly --skipAllTraining --softmasking

BRAKER by default creates the output directory braker/ speciesname. This can be disadvantageous if you wish to run similar tasks for the same parameter set with different extrinsic evidence combinations or genomes. We therefore recommend specifying an output directory specific to the particular BRAKER run with --workingdir¼DIRECTORY. 3.2.7 Training and Predicting UTRs on the Basis of an Existing BRAKER Run

Since training UTR parameters for AUGUSTUS is a functionality that has been added to BRAKER rather recently, it might currently be a common use case to update an existing BRAKER run with UTR training and AUGUSTUS predictions that integrate coverage information from RNA-Seq. In order to do this, the existing parameter set must be specified with --species¼speciesname, and genome file and RNA-Seq BAM-file must be provided. The option --useexisting will tell BRAKER to modify the existing species parameter set. The argument --AUGUSTUS_hints_preds¼ augustus.hints.gtf must point to the already existing BRAKER output file of AUGUSTUS in GTF-format. --UTR¼on enables UTR training. The argument --flanking_DNA¼INT refers to the size of the genomic noncoding flanking region around training genes. It must be provided. In a full BRAKER run, a suitable size is determined automatically. You can extract it from the old braker.log file:

Bash input $ grep gff2gbSmallDNA.pl braker.log | perl -ne ’m/\s(\d+)\s/; print "$1\n";’

BRAKER

93

The return value might look similar to the one below: Bash output 1178

A call for running training UTR parameters and performing gene predictions with UTR parameters and coverage information could look like this: braker.pl --species=Sp_1 --useexisting --genome=genome.fa --bam=RNASeq.bam \ --AUGUSTUS_hints_preds=augustus.hints.gtf --UTR=on --flanking_DNA=1778

4

Notes 1. The most frequently reported problems with running BRAKER have their source in using outdated AUGUSTUS scripts. Please always use up-to-date AUGUSTUS (from https://github.com/Gaius-Augustus/Augustus) with BRAKER. BRAKER may not be compatible with AUGUSTUS versions provided from other sources. 2. The accuracy of results always depends on the input files and on the properties of the individual species. We strongly advise to inspect the results file augustus.hints.gtf (or augustus. hints_utr.gtf) in context with the available extrinsic evidence in a visualizing genome browser, e.g., the UCSC genome browser [39], JBrowse [40], or Artemis [41]. 3. Recently, we have expanded GeneMark-EP to the “close species case” when a set of reliably annotated homologous proteins is available to generate the extrinsic evidence. Therefore, use of GeneMark-EP instead of GenomeThreader for such cases (Fig. 3) is an option.

Acknowledgements This work is supported in part by the US National Institutes of Health grant HG000783 to MB, by the German Research Foundation grant 1009/12-1 to MS and by the US National Institutes of Health grant GM128145 to MB and MS.

94

Katharina J. Hoff et al.

References 1. Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M (2015) BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32(5):767–769 2. Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M (2005) Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res 33 (20):6494–6506 3. Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M (2008) Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res 18:1979–1990. https://doi.org/10. 1101/gr.081612.108 4. Lomsadze A, Burns PD, Borodovsky M (2014) Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res 42(15):e119 5. Stanke M, Scho¨ffmann O, Dahms St, Morgenstern B, Waack S (2006) Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinf 7:62 6. Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B (2006) AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res 3(34):W435–W439 7. Stanke M, Steinkamp R, Waack S, Morgenstern B (2004) AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res 32: W309–W312 8. Hoff KJ, Stanke M (2013) WebAUGUSTUS – a web service for training AUGUSTUS and predicting genes in eukaryotes. Nucleic Acids Res 41(W1):W123–W128 9. Ko¨nig S, Romoth LW, Gerischer L, Stanke M (2016) Simultaneous gene finding in multiple genomes. Bioinformatics 32(22):3388–3395 10. Stanke M, Diekhans M, Baertsch R, Haussler D (2008) Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24(5):637–644 11. Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, Moore B, Holt C, Alvarado AS, Yandell M (2008) MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18(1):188–196 12. Holt C, Yandell M (2011) MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinf 12(1):491 13. Abbott A (2005) Competition boosts bid to find human genes. Nature 435:134

14. Guigo´ R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, Harrow J, Hubbard T, Lewis SE, Reese MG (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7(1): S2 15. Stanke M, Tzvetkova A, Morgenstern B (2006) AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome. Genome Biol 7(1):S11 16. Coghlan A, Fiedler T, McKay S, Flicek P, Harris T, Blasiar D, the nGASP Consortium, Stein L (2008) nGASP - the nematode genome annotation assessment project. BMC Bioinf 9 (1):549 17. Steijger T, Abril JF, Engstrom PG, Kokocinski F, Akerman M, Alioto T, Ambrosini G, Antonarakis SE, Behr J, Bohnert R, Bucher P, Cloonan N, Derrien T, Djebali S, Du J, Dudoit S, Gerstein M, Gingeras TR, Gonzalez D, Grimmond SM, Habegger L, Iseli C, Jean G, Kahles A, Lagarde J, Leng J, Lefebvre G, Lewis S, Mortazavi A, Niermann P, R€atsch G, Reymond A, Ribeca P, Richard H, Rougemont J, Rozowsky J, Sammeth M, Sboner A, Schulz MH, Searle SMJ, Solorzano ND, Solovyev V, Stanke M, Steijger T, Stevenson BJ, Stockinger H, Valsesia A, Weese D, White S, Wold BJ, Wu J, Wu TD, Zeller G, Zerbino D, Zhang MQ, Hubbard TJ, Guigo R, Harrow J, Bertone P (2013) Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 10(12):1177–1184 18. Keller O, Odronitz F, Stanke M, Kollmar M, Waack S (2008) Scipio: using protein sequences to determine the precise exon/ intron structures of genes and their orthologs in closely related species. BMC Bioinf 9(1):278 19. Gremme G (2013) Computational gene structure prediction. PhD thesis, Universit€at Hamburg 20. Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31 (19):5654–5666 21. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) A basic local alignment search tool. J Mol Biol 215(3):403–410

BRAKER 22. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL (2009) BLAST+: architecture and applications. BMC Bioinf 10(1):421 23. Barnett DW, Garrison EK, Quinlan AR, Stro¨mberg MP, Marth GT (2011) BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics 27 (12):1691–1692 24. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25 (16):2078–2079 25. Chen N (2004) Using RepeatMasker to identify repetitive elements in genomic sequences. Curr Protoc Bioinf 5(1):4.10. 1–4.10. 14 26. Price AL, Jones NC, Pevzner PA (2005) De novo identification of repeat families in large genomes. Bioinformatics 21(Suppl 1): i351–i358 27. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):15–21 28. Daehwan K, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14(4):R36 29. Wu TD, Nacu S (2010) Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26(6):873–881 30. Kapustin Y, Souvorov A, Tatusova T, Lipman D (2008) Splign: algorithms for computing spliced alignments with identification of paralogs. Biol Direct 3(1):20 31. Powell S, Szklarczyk D, Trachana K, Roth A, Kuhn M, Muller J, Arnold R, Rattei T, Letunic I, Doerks T, et al (2011) eggNOG v3. 0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res 40(D1):D284–D289 32. Waterhouse RM, Tegenfeldt F, Li J, Zdobnov EM, Kriventseva EV (2012) OrthoDB: a

95

hierarchical catalog of animal, fungal and bacterial orthologs. Nucleic Acids Res 41(D1): D358–D365 33. Slater GSC, Birney E (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinf 6(1):31 34. Gotoh O (2008) Direct mapping and alignment of protein sequences onto genomic sequence. Bioinformatics 24(21):2438–2444 35. Gotoh O (2008) A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence. Nucleic Acids Res 36(8):2630–2638 36. Iwata H, Gotoh O (2012) Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features. Nucleic Acids Res 40(20):e161 37. Keilwagen J, Wenk M, Erickson JL, Schattat MH, Grau J, Hartung F (2016) Using intron position conservation for homology-based gene prediction. Nucleic Acids Res 44(9):e89 38. Keilwagen J, Hartung F, Paulini M, Twardziok SO, Grau J (2018) Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinf 19 (1):189 39. Casper J, Zweig AS, Villarreal C, Tyner C, Speir ML, Rosenbloom KR, Raney BJ, Lee CM, Lee BT, Karolchik D et al (2017) The UCSC genome browser database: 2018 update. Nucleic Acids Res 46(D1): D762–D769 40. Skinner ME, Uzilov AV, Stein LD, Mungall CJ, Holmes IH (2009) JBrowse: a next-generation genome browser. Genome Res 19 (9):1630–1638. https://doi.org/10.1101/gr. 094607.109 41. Carver T, Harris SR, Berriman M, Parkhill J, McQuillan JA (2011) Artemis: an integrated platform for visualization and analysis of highthroughput sequence-based experimental data. Bioinformatics 28(4):464–469

Chapter 6 EuGene: An Automated Integrative Gene Finder for Eukaryotes and Prokaryotes Erika Sallet, Je´roˆme Gouzy, and Thomas Schiex Abstract EuGene is an integrative gene finder applicable to both prokaryotic and eukaryotic genomes. EuGene annotated its first genome in 1999. Starting from genomic DNA sequences representing a complete genome, EuGene is able to predict the major transcript units in the genome from a variety of sources of information: statistical information, similarities with known transcripts and proteins, but also any GFF3 structured information supporting the presence or absence of specific types of elements. EuGene has been used to find genes in the plants Arabidopsis thaliana, Medicago truncatula, and Theobroma cacao; tomato, sunflower, and Rosa genomes; and in the nematode Meloidogyne incognita genome, among many others. The large fraction of plant in this list probably influenced EuGene development, especially in its capacities to withstand a genome with a large number of repeated regions and transposable elements. Depending on the sources of information used for prediction, EuGene can be considered as purely ab initio, purely similarity based, or hybrid. With the general availability of NGS-transcribed sequence data in genome projects, EuGene adopts a default hybrid behavior that strongly relies on similarity information. Initially targeted at eukaryotic genomes, EuGene has also been extended to offer integrative gene prediction for bacteria, allowing for richer and robust predictions than either purely statistical or homology-based prokaryotic gene finders. This text has been written as a practical guide that will give you the capacity to train and execute EuGene on your favorite eukaryotic genome. As the prokaryotic case is simpler and has already been described, only the main differences with the eukaryotic version were reported. Key words Integrative gene finder, Prokaryotic and eukaryotic genomes, Protein-coding genes, Noncoding genes, EuGene

1 1.1

Introduction What Is EuGene

EuGene is a machine learning-based integrative gene finder. It aims at annotating a complete genome sequence with a precise delineation of likely transcribed or translated regions on each strand of the input genome. It is therefore not limited to the prediction of protein-coding exons and introns but can also predict UTRs and ncRNA genes. To maximize the quality of the prediction, EuGene has been designed to easily incorporate a variety of information. It

Martin Kollmar (ed.), Gene Prediction: Methods and Protocols, Methods in Molecular Biology, vol. 1962, https://doi.org/10.1007/978-1-4939-9173-0_6, © Springer Science+Business Media, LLC, part of Springer Nature 2019

97

98

Erika Sallet et al.

immediately departed from the pure ab initio HMM-based approach that exclusively relies on statistical sequence information to also incorporate multiple features that are directly related to the annotation. Each type of information can be integrated in a modular software architecture based on the use of plug-ins. In its current standard usage, EuGene incorporates: l

Probabilistic models of intergenic, intronic, transcribed, and translated regions and their boundaries such as splice sites. These models are typically automatically trained for every new genome but can also be reused from models that have been trained on a family of genomes (group of related species).

l

Information extracted from alignment/mapping of sequenced transcripts (RNA-seq, short/long or assembled transcripts) that are considered as evidence of transcription (be it for ncRNA or protein genes) and (possibly non-canonical) splicing.

l

Information on similarities with known protein sequences which is considered as an indication of protein-coding regions.

l

Information on repeats identified by dedicated repeat detection tools. These regions are automatically delineated and masked for prediction but can be locally unmasked if similarities to a transcriptome or proteome exist.

l

Output of existing specialized genome analysis tools that can be informative on various region types (e.g., the output of specialized gene finders for ncRNA genes are directly integrated by EuGene).

With the general availability of NGS-transcribed sequence data in genome projects, EuGene now adopts a default hybrid behavior that exploits similarities with expressed sequences and proteins to automatically build training sets and also to perform its prediction on the whole genome. Because of its wide availability and high quality and quantity, RNA-seq has replaced several previously used types of information (e.g., conserved regions between different genomes [1]). Thanks to a very general plug-in named “AnnotaStruct”, EuGene can integrate various GFF3 information on gene structures. This makes it capable of integrating new or unusual data types very easily. This plug-in is used, for example, to integrate the output of the ncRNA gene predictors tRNAscan-SE [2], RNAmmer [3], and cmsearch on Rfam [4]. None of these sources of information is considered as absolutely certain inside EuGene. Each source of information is instead weighted before it is integrated in the prediction process. This weighting is parameterized by few (typically one or two) parameters. These parameters are pre-set to values that have previously been optimized on a group of species and can directly be used for a new species of the same group.

EuGene: An Integrative Gene Finder for Eukaryotes and Prokaryotes

1.2 How Does EuGene Work

99

EuGene has been designed using a pure optimization approach: given a genomic DNA sequence to annotate, a specific weighted directed acyclic graph that can represent every possible annotation of the sequence is built, edited, and weighted on the fly in such a way that a constrained maximum weight path in this graph defines a consistent prediction of optimal quality [5, 6]. The edge editing and weighting process is performed by the plug-ins, each representing a source of information. Every plug-in outputs raw votes (real numbers) on specific types of edges that represent region types at every base of the genomic sequence or specific sites like splicing or transcription and translation start/stop between bases. These votes are then transformed by a parameterized scaling function to contribute to the weight of every edge. As an example, the “Stop” plug-in exploits the existence of a STOP codon in the genomic sequence to edit the graph and disable all predictions that would generate an in-frame STOP codon and instead force them to become a 30 UTR. Similarly, the “BlastX” plug-in will transform the existence of a similarity with a protein into votes in favor of a coding region in the corresponding region/frame. The parameters of the statistical voters require a sequence training set of each predicted type (coding, transcribed but non-coding, etc.) to produce their raw votes. This learning set is automatically built by the EuGene pipeline from a stringent selection of similarities between the genomic sequence, a reference dataset of proteins and assembled transcripts of the same species. The parameters of the statistical models (Markovian models) are estimated using a maximum likelihood criterion. To balance the strength of all raw votes, EuGene uses scaling parameters that have previously been optimized on various genomes. These parameters do not need to be re-optimized in the general case, even if this is possible in practice. They have initially been set by maximizing the empirical quality of the prediction (a linear combination of the fractions of proper and improper predictions, at the base, exon/intron, and gene level) on sets of annotated sequences. This was achieved using a combination of a genetic algorithm and a block coordinate descent algorithm. Once properly trained and weighted, the voters are able to edit and weight the graph. An optimal (longest) path in the graph, representing the prediction that accumulates more votes than any other, is obtained using a dedicated dynamic programming algorithm (a sophisticated variant of the Bellman algorithm [7], similar to the Semi-HMM Viterbi algorithm), that takes into account the lengths of the regions in the prediction. For mathematically oriented readers, EuGene can also be presented as a Semi Linear Conditional Random Field (see [8] for such a description), although it was designed and implemented before this mathematical model was known and published [9].

100

Erika Sallet et al.

The topology of the graph (the gene model) used by EuGene is obviously different in the eukaryotic and prokaryotic mode. Compared to the eukaryotic mode that knows about splicing, the prokaryotic mode instead allows for overlapping coding genes and operons (polycistronic transcripts). EuGene can be run on both strands simultaneously or independently on each strand. In the latter case (default for the prokaryotic mode), there is no constraint in terms of overlapping predicted genes (coding or not) which allows for the prediction of, e.g., antisense ncRNA genes, when transcriptional evidence exists. 1.3 EuGene in Practice

2

The EuGene software is the combination of a Perl pipeline that is in charge of organizing and collecting all the evidences, and Cþþ programs that actually perform the training and prediction. After intensive usage on a variety of eukaryotic genomes of various sizes and complexities (e.g., the Helianthus annuus L. genome has 3.5 Gb and a very high complexity [10]), the complete EuGene pipeline has evolved to be reliable, easily configurable, and suitable for various computational architectures: l

It starts from a single directory containing the genomic sequence to annotate and produces a GFF3-formatted annotation as well as additional files in FASTA format for the predicted transcriptome, proteome, etc. in another new directory.

l

All the execution is controlled from a single configuration file, in text format, that specifies the transcriptomes and proteomes, which should be used for training and prediction, as well as few specific global options. In its default mode, the eukaryotic configuration file expects a transcriptome defined by assembled transcripts. The intronless prokaryotic version directly uses stranded NGS reads.

l

The program can exploit the capacities of multi-core machines or clusters. Various schedulers (e.g., PBS, SGE, SLURM) are supported. The usage of a cluster is advised for the annotation of large eukaryotic genomes (more than few hundreds Mb) because of the computational cost of protein similarity identification (which is internally optimized for large genomes using a sliding window).

l

The pipeline also allows for warm restarts following unexpected interruptions or the addition or removal of informative transcriptomes and proteomes.

Setting Up an EuGene Pipeline Instance In this section, we describe the installation and use of the eukaryotic version of the pipeline under Linux. In the following command

EuGene: An Integrative Gene Finder for Eukaryotes and Prokaryotes

101

Table 1 Software dependencies. The indicated versions are those supported by EuGene and are automatically installed (edit sh.make_install.euk.bash for details) Software

Supported version

URL

EuGene

4.2a

http://eugene.toulouse.inra.fr

Diamond

v0.9.22

https://ab.inf.uni-tuebingen.de/software/diamond

Ncbi-blastþ

2.2.31

https://blast.ncbi.nlm.nih.gov/Blast.cgi

tRNAscan-SE

1.3.1

http://lowelab.ucsc.edu/tRNAscan-SE

Infernal

1.1.2

http://eddylab.org/infernal

Genometools

1.5.6

http://genometools.org

Bedtools

2.24.0

http://bedtools.readthedocs.io

GMAP

2017-09-05

http://research-pub.gene.com/gmap

Red

05/22/2015

http://toolsmith.ens.utulsa.edu

RNAmmer

1.2

http://www.cbs.dtu.dk/services/RNAmmer

lines, the symbol “%” represents the shell prompt and assumes that a “sh” shell (such as bash) is used. If you use a “csh” shell, replace the export commands with setenv. Use the text editor of your choice instead of vi. 2.1 Installation and Compilation

Download the 1.5 release of the Eukaryotic EuGene pipeline available here: http://eugene.toulouse.inra.fr/Downloads/egnepLinux-x86_64.1.5.tar.gz. Uncompress the tarball in the directory where you want to install EuGene, and run the installation script for all the software dependencies described in Table 1: % gzip -cd egnep-Linux-x86_64.1.5.tar.gz | tar xvf – % cd egnep-1.5 % sh sh.make_install.euk.bash

Download and install RNAmmer from http://www.cbs.dtu. dk/services/RNAmmer. It is free for academics once the license has been read and accepted. Note that EuGene can work without it: in this case, only the ribosomal RNAs detected by Rfam will be integrated in the prediction (see Subheading 3.6). Check that the necessary Perl modules for the pipeline are installed by running the following command until it displays “Required perl modules found”: % $PWD/bin/int/check_requirements.pl

102

Erika Sallet et al.

Define the EGNEP and EUGENEDIR environment variables that respectively hold the Perl EuGene pipeline directory and the EuGene C þþ software directory. By default, the latter is installed in the parent directory of the pipeline directory. You must use absolute paths. These two variables must be set at installation and each time the pipeline is used. % export EGNEP=$PWD % export EUGENEDIR=$EGNEP/../eugene-4.2a

2.2 Resource Requirements and Execution Time

EuGene can be executed in a multi-core/cluster environment using a Perl software (paraloop) compatible with multicore systems, SGE, SLURM, or PBS clusters. Choose the appropriate configuration file according to your system and adapt it. In the following example, we work with a SGE cluster, and the jobs are submitted to the queue named main.q: % cd $EGNEP/bin/ext/paraloop/etc % cp templates/paraloop.root.cfg_SGE paraloop.root.cfg % vi paraloop.root.cfg PARALOOP_queue=main.q PARALOOP_qsub_params=

In most environments, the PARALOOP_*_params key has to be uncommented and initialized according to the resources of cluster nodes (e.g., memory). The running time depends on multiple factors. It depends on the genome size and complexity, the size and the number of datasets provided as sources of evidence, and obviously on the resources of the computing environment (number of nodes available for the user, processor performances). On a cluster of 500 cores (without concurrency on the resources), one can estimate the running time (elapsed) to a day for a fungus genome, less than 1 week for a plant genome of 500 Mb, and less than 2 weeks for a complex plant genome like sunflower (>3 Gb). Note that the running time has been optimized for large sequences. Consequently, it is preferable to build the pseudomolecules before running the annotation (vs. running the annotation on contigs prior scaffolding). 2.3 EuGene Execution

To use the EuGene pipeline, you need to create three directories corresponding to the script parameters indir, outdir, and workingdir and run the main script. indir contains the genomic sequences that you want to annotate. outdir is the directory where the results of the annotation are saved. workingdir is the directory where the pipeline writes all temporary files. Once the directories are created, copy the multi-FASTA files of the genomic sequences to annotate in indir. There is no constraint

EuGene: An Integrative Gene Finder for Eukaryotes and Prokaryotes

103

on the extension of file names. Files can be compressed with gzip, bzip2, xz, or lzma. Once the configuration file has been completed (see Subheading 3.1), the use of the eukaryotic variant of EuGene consists of a single command line with four parameters: % $EGNEP/bin/int/egn-euk.pl \

3

--indir

/path/of/myindir \

--outdir

/path/of/myoutdir \

--workingdir

/path/of/myworkingdir \

--cfg

/path/of/myeugeneconffile.cfg

Performing a Genome Annotation with EuGene Before running the pipeline, we advise you to check that the headers of all your multi-FASTA files (genome, transcriptomes, proteomes) do not contain any special characters (such as “|”). Keep in mind that you always need to use absolute paths in the configuration file and in the command lines.

3.1 A Single Configuration File

A single configuration file centralizes all the information EuGene needs to annotate a genome: the user-supplied data (transcriptomes, proteomes, complementary results), the paths and parameters of the external programs, and the EuGene-specific parameters. A configuration file describes a “recipe,” i.e., all the ingredients (datasets used as sources of evidence) and all the measurements (cutoffs, filters, weights) that are useful and needed for annotating a genomic sequence. Configuration files that have been used recently to annotate complete genomes (e.g., that of the Medicago genome) are available on the page http://eugene. toulouse.inra.fr/Configuration. Each line is of the form key ¼ value. Lines beginning with “#” are comments. The first part of the file must be filled in: this is where the user describes the ingredients of his recipe. All the parameters of the second part have default values, which should be modified with caution. The symbols “%i” and “%e” mean $EGNEP and $EUGENEDIR, respectively. This abbreviated notation makes it possible to make the names of certain paths shorter when the referred files are stored in these directory trees. Create a copy of the generic configuration file: % cp $EGNEP/cfg/egnep.cfg $EGNEP/cfg/egnep.myspecies.recipe1.cfg % vi $EGNEP/cfg/egnep.myspecies.recipe1.cfg

104

Erika Sallet et al.

In your configuration file, fill in: 1. The name of your organism, without spaces: organism=myspecies

2. The prefix of the output files, without spaces. All result files will start with this string: output_prefix=myspecies.20180901

3. The path of the RNAmmer software (result of the command "% which rnammer"). prg_rnammer=/path/to/rnammer/exe

After you register at https://www.girinst.org/repbase, download RepBaseXX.XX_REPET.embl.tar.gz tarball into the $EGNEP/db directory. Untar it, and then specify the path of the repbaseXX.XX_aaSeq_cleaned_TE.fa file in your configuration file (XX.XX string must be replaced by the downloaded version of RepBase): % cd $EGNEP/db/ % gzip -cd RepBaseXX.XX_REPET.embl.tar.gz | tar xvf % vi $EGNEP/cfg/egnep.myspecies.recipe1.cfg repeat_sequence_db=%i/db/RepBaseXX.XX_REPET.embl/repbaseXX. XX_aaSeq_cleaned_TE.fa

3.2 Evidence Datasets

This section describes how to define the transcriptomes and proteomes that you want to integrate into your recipe to guide the prediction. The result of the annotation process depends directly on the quality and completeness of transcriptomes and proteomes, particularly those used to train statistical models. Indeed, for each new genome, probabilistic models are automatically trained from a stringent selection of similarities with a so-called “training” transcriptome and proteome.

3.2.1 Transcriptomes

The evolution of sequencing technologies has transformed the number and type of transcribed sequences produced: a decade ago, ESTs (expressed sequence tags) were generated; for several years, RNA-seq has been producing shorter reads, but in much greater quantities; thanks to new Iso-Seq (isoform sequencing) applications, it is now possible to directly sequence full-length transcripts, which in a few years should be the standard in the production of expressed sequences. To be able to integrate these different types of sequences with the same protocol, the RNA-seq data must be assembled into transcripts before being used inside the EuGene pipeline. Note that some transcriptome assemblers

EuGene: An Integrative Gene Finder for Eukaryotes and Prokaryotes

105

(e.g., Trinity) generate many partially matured (intron-retaining) transcripts. For a given locus of the Trinity output, we recommend to keep only the contig with the longest ORF which is likely the corrected matured transcript. The use of raw Trinity files can generate an important number of fragmented gene models. The alignments of transcripts on the genome strongly guide the gene models predicted by EuGene, so it is essential to have at least one good quality transcriptome of the species to be annotated or of a very close genotype/strain/race. The training transcriptome (only one must be chosen) must be of the species to be annotated or of a very close genotype. In practice: Copy the transcriptomes you want to integrate to a directory (e.g., $EGNEP/db/), and edit the configuration file. For each transcriptome, add a block of six lines. Below is an example that describes a transcriptome X (X must be a number): % vi $EGNEP/cfg/egnep.myspecies.recipe1.cfg est_X_file=/path/of/my/transcriptomeX.fasta est_X_pcs=30 est_X_pci=97 est_X_remove_unspliced=0 est_X_training=1 est_X_preserve=1

Parameter details: l

est_X_file.

Full path of the multi-FASTA file.

l

and est_X_pci (value between 0 and 100). The mapping software performs a spliced alignment of the transcripts against the genomic sequence. Transcripts that align with an alignment that spans more than est_X_pcs percents of the transcript length with a sequence identity that is greater than est_X_pci are retained. For each transcript, only the best GMAP alignment is retained.

l

(possible values: 0, 1, 2). Beyond sequence similarity, a spliced alignment necessarily implies the presence of two splice sites at the endpoints of each intron. Intuitively, the presence of these signals in the genomic sequence gives a higher confidence in the alignment when it exists. The parameter est_X_remove_unspliced allows one to give a different weight to spliced alignments: if the value is 0, we give the same weight to all alignments; if its value is 1, the unspliced alignments are ignored; if its value is 2, more weight is given to the spliced alignments.

est_X_pcs

est_X_remove_unspliced

106

Erika Sallet et al.

(possible values: 0, 1). The value 1 makes it possible to identify the training transcriptome (a single training dataset is expected).

l

est_X_training

l

est_X_preserve (possible values: 0, 1). The value 1 means that the genomic region where a transcript aligns is “protected” against repeat masking (see Subheading 3.5).

Then, provide a list of the transcriptome numbers to be included in the recipe. Below is an example to integrate the transcriptomes X, U, and V (Configured as described above): % vi $EGNEP/cfg/egnep.myspecies.recipe1.cfg est_list=X U V

l

Non-empty list of transcriptome numbers to use in the recipe (the order of numbers does not matter). The list must contain a training transcriptome.

est_list.

To build the training set for the generation of statistical models, EuGene stringently filters the alignments of the training transcriptome. The threshold values applied are defined by the following parameters (it is not recommended to modify them): training_est_pcs=99 training_est_pci=99 training_est_remove_unspliced=1

The more one integrates transcriptomes from different organs or tissues and/or the greater the depth of sequencing, the more different splice variants or partially matured messenger RNAs (with intron retention) will be found in the assemblies. In some cases, local alignments are inconsistent with each other. So, only alignments containing predominantly represented introns are preserved by default. To disable this option, change the gmap_intron_filter key to 0 in the configuration file. EuGene automatically removes small exons at the ends of transcript alignments. If you want to change this behavior, two values must be edited in the configuration file: “trim-end-exons” value in the prg_gmap_param parameter and “minlen” value in the prg_gmap_filter_patch parameter. Here is an example to indicate a minimum length of 30 nt: % vi $EGNEP/cfg/egnep.myspecies.recipe1.cfg prg_gmap_param=-n0 -B 5 -t 16 -L 100000 -K 25000 --trim-endexons=30 prg_gmap_filter_patch=%i/bin/int/misc/filter_transcript_smallendexons.pl –minlen 30

EuGene: An Integrative Gene Finder for Eukaryotes and Prokaryotes 3.2.2 Proteomes

107

EuGene can use similarities with known proteins from species which are more or less curated and/or similar to the one we want to annotate. To this end, choose one or more proteomes; we advise to use at least the public and curated UniProtKB/Swiss-Prot database, and possibly the taxonomic division of trEMBL corresponding to your species. The training proteome (only one must be chosen) must be as complete and curated as possible. In the absence of the proteome of a closely related species, choose UniProtKB/Swiss-Prot. In practice: Copy the multi-FASTA proteomes to a directory (e.g., $EGNEP/db/), and then index them individually with the following command: % $EGNEP/bin/ext/ncbi-blast/bin/makeblastdb -in /path/of/my/ proteomeY.fasta \ -dbtype prot -parse_seqids

The next step is to describe the integration of the proteomes into the configuration file. For each proteome, add a block of six lines. Below is an example that shows how to describe a proteome Y (Y must be a number): % vi $EGNEP/cfg/egnep.myspecies.recipe1.cfg blastx_db_Y_file=/path/of/my/proteomeY.fasta blastx_db_Y_weight=0.3 blastx_db_Y_pcs=50 blastx_db_Y_remove_repet=0 blastx_db_Y_preserve=1 blastx_db_Y_training=1

Parameter details: l

blastx_db_Y_file.

l

Full path of the multi-FASTA file.

blastx_db_Y_weight (a decimal number). Confidence in the proteome. We advise to choose a value between 0.1 (e.g., for trEMBL) and 0.5 (for a proteome of a well-annotated close species).

l

blastx_db_Y_pcs (value between 0 and 100). Minimum percentage of alignment of the protein length.

l

blastx_db_Y_remove_repet (possible values: 0, 1). If the value is 1, proteins that have a similarity to a protein associated with repeats are deleted (see Subheading 3.5). We suggest a value of 1 to limit the prediction of proteins related to transposable elements.

108

Erika Sallet et al. l

blastx_db_Y_preserve (possible values: 0, 1). The value 1 means that the genomic regions where a protein similarity is present are “protected” against repeat masking (see Subheading 3.5).

l

blastx_db_Y_training (possible values: 0, 1). The value 1 makes it possible to identify the training proteome (a single protein training dataset is expected).

Then list the numbers of the proteomes to be included in the recipe. Below is an example to integrate the proteomes A and C (configured as described above): % vi $EGNEP/cfg/egnep.myspecies.recipe1.cfg blastx_db_list=A C l

Non-empty list of proteome numbers to use in the recipe. The list must contain a training proteome.

blastx_db_list.

From here, your configuration file is complete. The next step is to adapt EuGene to your data and genome characteristics, to change default settings or to activate certain options. 3.3

Splice Sites

3.3.1 Statistical Models for the Detection of Splice Sites

Eugene uses already trained statistical models (as WAM matrix [11]) to capture the signals contained in the regions around the splice sites. The default matrix downloaded at installation is suitable for dicot plant genomes. There are other pre-trained matrices for other species groups, including nematodes, oomycetes, and fungi. You can consult the available matrices and the species used to build them on the web page http://eugene.toulouse.inra.fr/WAM and the Subheading 4.1 to build your own models. To use another matrix, install it in the EuGene tree and edit the wam_dataset key in the configuration file: % cd $EUGENEDIR/models/WAM % wget http://eugene.toulouse.inra.fr/Downloads/WAM_myspeciesgroup_date.tar.gz % gzip -cd WAM_myspeciesgroup_date.tar.gz | tar xvf – % vi $EGNEP/cfg/egnep.myspecies.recipe1.cfg wam_dataset=myspeciesgroup

3.3.2 Automatic Detection of Non-canonical Splice Sites

The introns of eukaryotic genomes start on a so-called donor site, in which the conserved dinucleotide “GT” appears. They end with a so-called acceptor site in which the conserved “AG” dinucleotide appears. These GT/AG splice sites are said to be canonical. Some species have non-canonical splice sites, in which the “GC” dinucleotide appears instead of “GT,”, for example. EuGene automatically detects the presence of non-canonical sites based on transcript alignments. A conserved dinucleotide appearing in more than 1% (the default value of parameter noncansite_required_percent) of the donor or acceptor sites detected in the alignments is activated

EuGene: An Integrative Gene Finder for Eukaryotes and Prokaryotes

109

as a non-canonical conserved dinucleotide (donor or acceptor): the EuGene graph is then edited to respectively allow the passage of the exon state to the intron state, or from intron to exon. However, this is only allowed if the alignment of a transcript confirms the exact position of the non-canonical splice site. The detection of too many distinct non-canonical conserved dinucleotides is a warning about the quality of the alignment: poor alignment by the alignment program, poor quality transcriptome, for example. Thus, if more than 3 (the default value of parameter max_noncansite_candidate_nb) non-canonical donor or acceptor conserved dinucleotides exceed the frequency threshold noncansite_required_percent, then EuGene stops. 3.4 The “AnnotaStruct” Generic Plug-In

Additional information is sometimes available to aid in the structural annotation of a genome, for example, results of proteomic analyses which provide information on translated DNA regions or TSS (transcription start site) mapping results. EuGene, through its generic “AnnotaStruct” plug-in, can integrate various information as soon as they are formatted as a GFF3 file. The $EGNEP/ ADDITIONAL directory should contain the files to integrate with “AnnotaStruct”. In practice, to integrate a new data source, two files must be created: the raw data file in GFF3 format, and the configuration file describing how to integrate these data. The example below shows how to integrate the mapping of proteomic analysis results.

3.4.1 Expected GFF3 Format

It must be a valid GFF3 file. The features allowed in the third column are CDS, five_prime_UTR, three_prime_UTR, UTR, intron, exon, transcript_region, ncRNA, or intergenic_region. It is possible to add an Ontology_term attribute [12] in the ninth column to provide additional precision on the nature of the feature (Table 2). The following example file describes four translated regions. The term SO:0000004 indicates that these are “interior coding exons”. % more $EGNEP/ADDITIONAL/mygenome.MQ.peptides.gff3 Chr1

MaxQuant

CDS

72033

72080

.

+

.

.

+

.

.

+

.

.

+

.

ID=regionMQ.323.1;Ontology_term=SO:0000004 Chr1

MaxQuant

CDS

73072

73119

ID=regionMQ.325.1;Ontology_term=SO:0000004 Chr3

MaxQuant

CDS

3391

3471

ID=regionMQ.326.1;Ontology_term=SO:0000004 Chr3

MaxQuant

CDS

8614

8664

ID=regionMQ.327.1;Ontology_term=SO:0000004

110

Erika Sallet et al.

Table 2 The ontology terms allowed in the ninth column of a GFF3 file read by the “AnnotaStruct” Generic Plug-in. They provide additional precision on the nature of some features (see http://eugene.toulouse. inra.fr/Downloads/SO.png) Ontology_term

Sequence ontology definition

Feature

SO:0000196

The sequence of the five_prime_coding_exon that codes for protein

CDS

SO:0000197

The sequence of the three_prime_coding_exon that codes for protein

CDS

SO:0000004

Interior coding exon

CDS

SO:0005845

An exon that is the only exon in a gene

CDS

SO:0000191

Interior intron

Intron

SO:0000200

The coding exon that is most 5-prime on a given transcript

Exon

SO:0000202

The coding exon that is most 3-prime on a given transcript

Exon

3.4.2 The “AnnotaStruct” Configuration File

The AnnotaStruct configuration file describing how to integrate the information source in EuGene is composed of 24 lines. Each line is a parameter of the “AnnotaStruct” plug-in, described in the EuGene C++ program documentation available here: http:// eugene.toulouse.inra.fr/Downloads/20180920/EuGeneDoc.pdf (Subheading 2.4.5.1). We provide an “AnnotaStruct” template file $EGNEP/cfg/plugin_AnnotaStruct_template.cfg in which the values are all equal to zero, so its use without any change would have no effect.

3.4.3 Example: Integrating the Mapping of Proteomic Analysis Results

Create the data file in GFF3 format according to the specifications described in Subheading 3.4.1, then copy the template configuration file and edit it: % cp $EGNEP/cfg/plugin_AnnotaStruct_template.cfg $EGNEP/ADDITIONAL/plugin_AnnotaStruct_Proteomics.cfg % vi $EGNEP/ADDITIONAL/plugin_AnnotaStruct_Proteomics.cfg AnnotaStruct.CDS*[CPT] 2

A positive score on AnnotaStruct.CDS*[CPT] allows each position of a CDS feature to vote in the EuGene graph in favor of predicting a translated region. The higher the value, the more the information is taken into account. We recommend a value between 0.1 and 3. Then edit the EuGene configuration file: for each additional information source, add a block of two lines. Below is an example showing how to integrate the source Z (Z must be a number):

EuGene: An Integrative Gene Finder for Eukaryotes and Prokaryotes

111

% vi $EGNEP/cfg/egnep.myspecies.recipe1.cfg additional_Z_file=%i/ADDITIONAL/mygenome.MQ.peptides.gff3 additional_Z_cfg_template=%i/ADDITIONAL/plugin_AnnotaStruct_Proteomics.cfg

Parameter details: l

additional_Z_file.

Absolute path of the GFF3 data file.

l

additional_Z_cfg_template. Absolute path of

the corresponding

“AnnotaStruct” configuration file. Then list the numbers of the sources of information to be included in the recipe. Below is an example to integrate the source Z (Configured as described above): % vi $EGNEP/cfg/egnep.myspecies.recipe1.cfg additional_list=Z l

List of the numbers of sources of information to use in the recipe. (Can be empty).

additional_list.

The page http://eugene.toulouse.inra.fr/Configuration provides more examples of AnnotaStruct formatted GFF3 files and configuration files. 3.5 Repeat Region Detection

The objective of the EuGene pipeline is not to annotate repeated elements. Thus the pipeline masks the repeated regions and then annotates the masked genome. In the default configuration, three tools are executed to search for repeated regions of the genome: l

Red [13].

l

LTRHarvest [14].

l

A search for similarities (NCBI BLASTX) with a library of proteins from repeated regions. This library consists of two datasets: (a) RepBase proteins [15] and (b) proteins derived from specific repeat regions of the annotated species. The automatic procedure for detecting proteins (b) is as follows: EuGene annotates the genome the first time, extracts the predicted proteins, and searches among them for those with strong similarities with RepBase (following a BLASTP, proteins that align with a RepBase protein on more than 80% of their length are conserved). If you already have a repeat library of your species, you can avoid library automatic building (b) by specifying the path of your own dataset in the species_repeat_domains parameter.

The genomic regions detected by one of these three approaches are masked before annotation. Since there may be false positives in the masking results, it is necessary to protect the regions that are expressed. For this, the transcribed regions, that is to say having

112

Erika Sallet et al.

similarities with a transcriptome or a proteome, to be preserved (those whose “_preserve” parameter is equal to 1) or those detected as known ncRNAs, are unmasked. Before anything, we ensure, as much as possible, that these unmasked transcribed/ translated regions do not come from repeated regions. For instance, for this purpose the proteomes for which the value of blastx_db_Y_remove_repet is equal to 1 are cleaned, i.e. the proteins that align with a RepBase protein are deleted before searching for similarities. 3.6 Non-coding RNA Annotation

In addition to protein-coding genes, EuGene is able to annotate non-coding genes. Three mechanisms have been set up with this purpose: l

First, EuGene integrates the results of three ncRNA predictors, tRNAScan-SE, RNAmmer, and cmsearch on Rfam, which look for ncRNAs from known families. These RNAs are identified in the GFF3 output file by additional fields in column 9 (see Subheading 3.9). It is possible to disable the use of these predictors with the options --no_ncrna_detection, --no_rfamscan, --no_trnascan, and/or --no_rnammer in the main command line $EGNEP/bin/int/egn-euk.pl.

l

A transcriptome built from total RNA libraries (e.g., without selection of mRNAs by poly(A) tail) contains sequences from mRNA and non-coding RNA. To integrate this type of data, it is possible to modify the behavior of EuGene so that when a transcript aligns in a region of low coding potential, EuGene can potentially detect a ncRNA. To enable this behavior, the parameter Est.mRNAOnly[CPT] must be equal to 0:

% vi $EGNEP/cfg/euk/plugin_Est.cfg Est.mRNAOnly[CPT] 0 l

3.7 Computational Management

A transcript support (of size greater than 200 nt by default) without significant coding potential (of size greater than 40 amino acids by default) is annotated by EuGene as a long ncRNA. The default values can be modified by editing the Output.MinCDSLen and Output.MinRescueTranscriptLen parameters in the $EGNEP/cfg/euk/egn_euk_generic.cfg file.

When running EuGene, only previously unfinished calculations are started: if the first run is not completed, the second run (using the same workingdir directory) will only run the computations that miss results. This can save a lot of time when annotating a large genome. Although EuGene has been optimized to reduce the duration of the longest tasks, some unavoidable expensive calculations of the pipeline will always require a significant execution time (e.g., search for some long Rfam ncRNAs).

EuGene: An Integrative Gene Finder for Eukaryotes and Prokaryotes

113

Testing the existence of an output file is not enough to ensure that a calculation has been completed correctly. Thus, for each calculation, when its execution ends without error, a file bearing the name of the result file followed by the “.success” extension is created. For example, at the end of the tRNA search on chromosome Chr2 with the tRNAScan-SE software, the file /path/of/ myworkingdir/0001/Chr2/Chr2.trnascan.gff3.success

is created. When EuGene restarts, a process will be executed only if the corresponding “.success” file does not exist. This is useful for restarting EuGene following an interruption due to an error or when modifying a recipe. Consider that we created a first recipe with two proteomes numbered 1 and 2 (blastx_db_list¼1 2) and we now want to integrate another proteome numbered 4. One can modify the list blastx_db_list¼1 2 4 in the configuration file and run again the same EuGene command: in this case, only the protein similarity calculations with the proteome 4 will be executed. 3.8

Advanced Setup

3.8.1 Changing the Minimum Size of Introns

By default EuGene will not predict introns shorter than 40 nt. If you want to change this minimum value (example to assign a minimum size of 35 nt), edit the corresponding intron.dist file and specify that intron lengths n ¼ 35 and n þ 1 ¼ 36 are allowed. Longer intron lengths will be extrapolated. % vi $EUGENEDIR/models/intron.dist 35 0.0 36 0.0 % cd $EUGENEDIR % make -i install

The “-i” option forces the compilation/installation even if the latex documentation compilation failed (which may happen on some systems). 3.8.2 Performing a Strand-Specific Annotation

EuGene can be run simultaneously on both strands or independently on each strand. Independent strand runs allow predicting overlapping genes in eukaryotic mode. To use this behavior, set independent_strand_annotation to 1 in the EuGene configuration file.

3.8.3 Annotating Atypical Gene Structures

The publication “Improved methods and resources for paramecium genomics: transcription units, gene annotation and gene expression” by Olivier Arnaiz et al. [16] illustrates the high flexibility of EuGene to annotate atypical gene structures. It also presents alternative ways to integrate RNA-seq data.

114

Erika Sallet et al.

3.8.4 Annotating a Heterozygous Genome with Separated Haplotypes

Some genome assemblers (e.g., FALCON) can generate a primary haplotype genome sequence and an additional assembly of alternate alleles (haplotigs). To annotate such genome assembly we suggest running the Eugene pipeline twice but with the same sources of evidence: first on the primary contigs and then on the haplotigs. As only the best alignment of the transcript is kept by the pipeline, a single annotation run on the concatenation of both genomes will degrade the prediction of the unsupported allele.

3.9

Eugene generates ten files in the directory /path/of/myoutdir. The name of each file starts with the prefix defined in the output_prefix key. If the prefix is myspecies.20180801, here are the created files:

Output Files

3.9.1 EuGene Output Files

l

myspecies.20180801.gff3. Structural annotation in GFF3 format. The gene Ids are generated from the chromosome name followed by the letter “g” followed by the gene number. Genes are numbered by steps of 10 units, according to their genomic positions. To change this step, edit the “step” value in the prg_consolidate_gff3 key in the EuGene configuration file. If the chr_sorting_by_size value is equal to 1, the genomic sequences are sorted by decreasing size before the annotation, so the gene number 1 is the first annotated gene on the largest sequence. The genes encoding proteins are described using the following features: gene, mRNA, exon, three_prime_UTR, five_prime_UTR, and CDS. The ncRNAs are described with the following features: gene, rRNA, tRNA, and ncRNA. The column nine contains additional attributes describing the features (see Table 3).

l

myspecies.20180801. Multi-FASTA file of the genomic sequences, optionally sorted by decreasing size if the chr_sorting_by_size value is equal to 1.

l

myspecies.20180801.*.fna. Five multi-FASTA files corresponding to CDSs, genes, mRNAs, ncRNAs, and protein sequences.

l

myspecies.20180801.general_statistics.xls. Statistics on the annotation including, among others, the number of predicted genes, the protein-coding and non-coding genes, the average lengths of various features, and the average GC% of ncRNA genes. This can give a first feedback on the annotation quality.

l

myspecies.20180801.statistics_per_gene.xls. Tabulated file that gives per-mRNA and ncRNA statistics such as length, number of introns, or 50 UTR region length.

l

myspecies.20180801.egnep_report.txt. Text file containing information on the different stages of the pipeline. The first part is dedicated to the alignment of transcriptomes. For each transcriptome, the number of raw sequences, and the number and the percentage of aligned sequences remaining after the various

EuGene: An Integrative Gene Finder for Eukaryotes and Prokaryotes

115

Table 3 List of attributes in the column nine of the output GFF3 file Attribute name Definition

Associated feature

Ontology_term Term of the sequence ontology. Ex: a CDS with code SO:0000196 represents “the sequence of the 50 exon that encodes for protein”

All

est_cons/ est_incons

CDS, three_prime_UTR, Percentage of the predicted region which is consistent/ and five_prime_UTR inconsistent with aligned transcripts. For example, a CDS with est_cons ¼ 100 is completely supported by a transcript. The CDS may simultaneously have a non zero est_incons if part of it is covered by a gap in another aligned transcript

anti_codon

Transfer RNA anticodon

tRNA

Product

Transfer RNA-associated amino acid

tRNA

rfam_acc

Rfam accession number. Ex: rfam_acc ¼ RF00097

ncRNA (if detected by cmsearch on Rfam)

rfam_id

Rfam family Id. Ex: rfam_id ¼ snoR71

ncRNA (if detected by cmsearch on Rfam)

Subunit

The subunit of a ribosomal RNA

rRNA

filters (which correspond to the sequences actually used for the annotation) are reported. If this latter value is low, it may be important to understand why: is the assembly of the transcriptome of poor quality? Is there a contamination? Are the mapping parameters suitable? The second part lists the detected non-canonical splice sites (see Subheading 3.3.2). Finally, the third section gives the percentage of the genome that has been considered as a repeated region, both before and after the unmasking caused by aligned transcripts, ncRNA, or proteins (see Subheading 3.5). 3.9.2 Upload Annotation Files in a Genome Browser

To facilitate the visualization of the genome annotation produced, the program convert_egn2mygenomebrowser.pl can generate all the files needed to build an instance of a genome browser such as JBrowse [17]. The generated files can be used directly in MyGenomeBrowser [18]. From the outdir and workingdir directories, and the EuGene configuration file, the program puts together the genome, the annotation results, and all the evidences used for the annotation in a tarball named MyGenomeBrowser.tar.gz. Example: % $EGNEP/bin/int/misc/convert_egn2mygenomebrowser.pl \ --egnep_outdir

/path/of/myoutdir \

--egnep_workdir

/path/of/myworkingdir \

--cfg

/path/of/myeugeneconffile.cfg

116

Erika Sallet et al.

3.10 Prokaryotic EuGene Version

This section describes the main differences between the eukaryotic version of the pipeline described in the previous sections and the prokaryotic version presented in detail in [8, 19]. l

The topology of the prokaryotic graph that describes the genome structure is different: it does not allow for introns but instead enables overlapping CDS (on the same or on opposite strands). The prediction of UIRs (Untranslated Internal Regions: regions transcribed between two genes) can gather several genes into a single transcription unit. These operons are annotated in the GFF3 output file.

l

EuGene is run independently on each strand, allowing, for example, the detection of antisense ncRNAs.

l

An additional plug-in that predicts potential transcription starts (TSS) is used. This prediction is based on the detection of abrupt changes in the level of expression of RNA-seq data aligned with the genome.

l

Prokaryotic genomes being compact and containing little repeated DNAs, there is no repeat masking process.

l

It is possible to integrate expression data in different formats: oriented single-end or oriented paired-end reads in FASTA or FASTQ format; mapped reads in Bam/Sam, Bed, or Wig format, as well as tiling array data as a pair of ndf/pair files.

In practice: To install and configure the prokaryotic pipeline, download and untar the http://eugene.toulouse.inra.fr/Downloads/egnppLinux-x86_64.1.3.tar.gz tarball and follow the instructions in the QUICKSTART file. The EGNPP environment variable must be set. The command to run is: % $EGNPP/bin/int/egn-prok.pl \ --indir

/path/of/myindir \

--outdir

/path/of/myoutdir \

--cfg

/path/of/myeugeneppconffile.cfg

The indir directory must contain two directories named genome and data. The genome directory must contain the genomic sequences to annotate and the data directory all the evidences used in the annotation recipe. In the configuration file: l

The reference proteome must be specified in the training_db_file parameter.

l

The expression data to be used in the recipe are described using evidence_* keys. For each type of data, add a block of 3 to 5 lines.

EuGene: An Integrative Gene Finder for Eukaryotes and Prokaryotes

117

Below is an example in which one integrates oriented paired-end RNA-seq data in FASTQ format, termed A: evidence_A_pat=*.fastq.gz evidence_A_format=fastq evidence_A_type=ope evidence_A_oriented_end=1 evidence_A_small=0 evidence_list=A

Parameter details: l

evidence_A_pat. data/

l

Pattern used inside the /path/of/myindir/ directory to list all the data files to process.

evidence_A_format

(possible values: fastq, fasta, wig, sam, bam,

pair). File format. l

evidence_A_type (possible values: ose, ope). Respectively, “oriented single-end” or “oriented paired-end.” The key is compulsory if the files are in FASTA or FASTQ format.

l

evidence_A_oriented_end (possible values: 1, 2). Number of the read end that gives the orientation (It depends on the extraction and sequencing protocols).

l

evidence_A_small (possible values: 0, 1). A value of 1 specifies that the RNA-seq is small RNAs library.

Note: to integrate Roche NimbleGen tiling array hybridization results, three parameters need to be specified: evidence_A_pat, the pattern allowing to list the files in NimbleGen PAIR format, evidence_A_format with a value set to “pair,” and evidence_A_ndf, the absolute path of the NimbleGen-formatted design (ndf) file. 3.11

Limitations

The current eukaryotic gene model does not allow for the prediction of ncRNAs or other genes inside the introns of genes. When such intronic genes exist, they tend to generate the prediction of split genes so that the intronic gene can be predicted. The EuGene Cþþ software contains plug-ins capable of handling the presence of frameshifts in CDS and detecting splice variants based on transcript alignments [20]. These plug-ins are not enabled in the standard version of the pipeline because the goal is to predict a full genome reference annotation. The EuGene pipeline is not suitable for the annotation of chloroplasts and mitochondria. In these cases, it is best to use specialized tools.

118

4

Erika Sallet et al.

Companion Tools

4.1 Building New Splice Sites WAM Matrices for a New Species or Group of Species

EuGene uses statistical models (Weight Array Method matrices) that have already been trained for the detection of splice sites. The available models are described here: http://eugene.toulouse.inra. fr/WAM/. If no available model matches your species, you can build a new WAM matrix. The prerequisite is to have several complete genomes of closely related species, and for each of them a good quality transcriptome, as complete as possible. Indeed, to train the models, the transcripts must include a sufficiently large number of genes of various families to be able to capture the variability of the signals around the splice sites. In practice: Copy the template configuration file and edit it to fit your data: % cp $EGNEP/cfg/euk/egn_build_wam.cfg $EGNEP/cfg/egn_build_wam_myspeciesgroup.cfg

Specify the name of the group of species and the genome/ transcriptome pairs: % vi $EGNEP/cfg/egn_build_wam_myspeciesgroup.cfg wam_species=myspeciesgroup wam_dataset_list=E F wam_dataset_E_genome=/path/of/genomeE wam_dataset_E_transcriptome=/path/of/transcriptomeE wam_dataset_F_genome=/path/of/genomeF wam_dataset_F_transcriptome=/path/of/transcriptomeF

Then run: %

$EGNEP/bin/int/egn_build_wam.pl

--cfg

$EGNEP/cfg/egn_-

build_wam_myspeciesgroup.cfg\ --outdir /path/of/wamoutdir

At the end of the execution, a tarball /path/of/wamoutdir/ containing the donor and acceptor WAM matrices is created. See instructions in Subheading 3.3.1 to use these models. More information about the program options can be obtained with the command:

myspeciesgroup.tar.gz

% $EGNEP/bin/int/egn_build_wam.pl --help

4.2 Transferring Annotations Between Genome Releases

The EuGene tarball includes the program $EGNEP/bin/int/ misc/egn_annotation_transfer.pl which allows transferring the structural annotation of a sequence to another sequence.

EuGene: An Integrative Gene Finder for Eukaryotes and Prokaryotes

119

There is no gene discovery, only the genes of the initial annotation are transferred to the new sequence. The genes annotated on the new sequence will have a “valid” structure, one that is compatible with EuGene’s internal gene model (e.g., every CDS starts with a start codon, ends with a stop codon, and has a length which is a multiple of 3). The name of the source gene is indicated in the “Alias” attribute of the output GFF3 file. The program is useful to transfer the annotation between genotypes/strains of the same species while preserving the correspondence between genes. The pipeline is detailed here: http://eugene.toulouse.inra.fr/ Tools/egn_annotation_transfer.html Usage: % $EGNEP/bin/int/misc/egn_annotation_transfer.pl --help --scaffolds

genomic sequences to annotate (multi-fasta)

--ref_scaffolds

reference genomic sequences (multi-fasta)

--ref_gff3

reference structural annotation (gff3)

--outfile

transferred structural annotation (gff3)

--workingdir

working directory

--cfg

configuration file

To contact the authors, use [email protected]. To receive information and updates on EuGene, subscribe to eugene-info mailing list https://groupes.renater.fr/sympa/info/ eugene-info. References 1. Foissac S, Bardou P, Moisan A, Cros MJ, Schiex T (2003) EUGENE’HOM: a generic similarity-based gene finder using multiple homologous sequences. Nucleic Acids Res 31 (13):3742–3745 2. Lowe TM, Eddy SR (1997) tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 25(5):955–964 3. Lagesen K, Hallin PF, Rødland E, Stærfeldt HH, Rognes T, Ussery DW (2007) RNammer: consistent annotation of rRNA genes in genomic sequences. Nucleic Acids Res 35 (9):3100–3108 4. Nawrocki EP, Eddy SR (2013) Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 29:2933–2935 5. Schiex T, Moisan A, Rouze´ P (2001) Euge`ne: an eukaryotic gene finder that combines several sources of evidence. In: Gascuel O, Sagot MF (eds) Computational biology. JOBIM 2000. Lecture notes in computer science, vol 2066. Springer, Heidelberg

6. Foissac S, Gouzy J, Rombauts S, Mathe´ C, Amselem J, Sterck L, Van de Peer Y, Rouze´ P, Schiex T (2008) Genome annotation in plants and fungi: EuGene as a model platform. Curr Bioinforma 3(2):87–97 7. Bellman R (1957) Dynamic programming. Princeton Univ. Press, Princeton, NJ 8. Sallet E, Roux B, Sauviac L, Jardinaud MF, Carrere S, Faraut T, de Carvalho-Niebel F, Gouzy J, Gamas P, Capela D, Bruand C (2013) Next-generation annotation of prokaryotic genomes with EuGene-P: application to Sinorhizobium meliloti 2011. DNA Res 20 (4):339–354 9. Lafferty J, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. ICML ’01 proceedings of the eighteenth international conference on machine learning 10. Badouin H et al (2017) The sunflower genome provides insights into oil metabolism, flowering and Asterid evolution. Nature 546 (7656):148–152

120

Erika Sallet et al.

11. Zhang MQ, Marr TG (1993) A weight array method for splicing signal analysis. Bioinformatics 9(5):499–509 12. Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M (2005) The sequence ontology: a tool for the unification of genome. Genome Biol 6:R44 13. Girgis HZ (2015) Red: an intelligent, rapid, accurate tool for detecting repeats de-novo on the genomic scale. BMC Bioinformatics 16:227 14. Ellinghaus D, Kurtz S, Willhoeft U (2008) LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics 9:18 15. Bao W, Kojima KK, Kohany O (2015) Repbase update, a database of repetitive elements in eukaryotic genomes. Mob DNA 6:11

16. Arnaiz O, Van Dijk E, Be´termier M, LhuillierAkakpo M, de Vanssay A, Duharcourt S, Sallet E, Gouzy J, Sperling L (2017) Improved methods and resources for paramecium genomics: transcription units, gene annotation and gene expression. BMC Genomics 18(1):483 17. Skinner ME, Uzilov AV, Stein LD, Mungall CJ, Holmes IH (2009) JBrowse: a next-generation genome browser. Genome Res 19:1630–1638 18. Carrere S, Gouzy J (2017) myGenomeBrowser: building and sharing your own genome browser. Bioinformatics 33(8):1255–1257 19. Sallet E, Gouzy J, Schiex T (2014) EuGenePP: a next-generation automated annotation pipeline for prokaryotic genomes. Bioinformatics 30(18):2659–2661 20. Foissac S, Schiex T (2005) Integrating alternative splicing detection into gene prediction. BMC Bioinformatics 6:25

Chapter 7 ChemGenome2.1: An Ab Initio Gene Prediction Software Akhilesh Mishra, Priyanka Siwach, Poonam Singhal, and B. Jayaram Abstract Gene prediction, also known as gene identification, gene finding, gene recognition, or gene discovery, is among one of the important problems of molecular biology and is receiving increasing attention due to the advent of large-scale genome sequencing projects. We designed an ab initio model (called ChemGenome) for gene prediction in prokaryotic genomes based on physicochemical characteristics of codons. In this chapter, we present the methodology of the latest version of this model ChemGenome2.1 (CG2.1). The first module of the protocol builds a three-dimensional vector from three calculated quantities for each codon—the double-helical trinucleotide base pairing energy, the base pair stacking energy, and an index of the propensity of a codon for protein-nucleic acid interactions. As this three-dimensional vector moves along any genome, the net orientation of the resultant vector should differ significantly for gene and non-genic regions to make a distinction feasible. The predicted putative protein-coding genes from above parameters are passed through a second module of the protocol which reduces the number of false positives by utilizing a filter based on stereochemical properties of protein sequences. The chemical properties of amino acid side chains taken into consideration are the presence of sp3 hybridized γ carbon atom, hydrogen bond donor ability, short/absence of δ carbon and linearity of the side chains/non-occurrence of bi-dentate forks with terminal hydrogen atoms in the side chain. The final prediction of the potential protein-coding genes is based on the frequency of occurrence of amino acids in the predicted protein sequences and their deviation from the frequency values of Swissprot protein sequences, both at monomer and tripeptide levels. The final screening is based on Z-score. Though CG2.1 is a gene finding tool for prokaryotes, considering the underlying similarity in the chemical and physical properties of DNA among prokaryotes and eukaryotes, we attempted to evaluate its applicability for gene finding in the lower eukaryotes. The results give a hope that the concept of gene finding based on physicochemical model of codons is a viable idea for eukaryotes as well, though, undoubtedly, improvements are needed. Key words Physicochemical model, Hydrogen bond energy, Stacking energy, Nucleic acid-protein interaction, Swissprot, Tripeptide frequencies

1

Introduction Since the release of the first complete genome sequence, untiring efforts are being made to understand the DNA language having only four letters but no apparent syntax. Due to capital-intensive and time-consuming nature of experimental approaches, computational methods for automatic annotation of DNA sequences have

Martin Kollmar (ed.), Gene Prediction: Methods and Protocols, Methods in Molecular Biology, vol. 1962, https://doi.org/10.1007/978-1-4939-9173-0_7, © Springer Science+Business Media, LLC, part of Springer Nature 2019

121

122

Akhilesh Mishra et al.

been looked upon as a fast and reliable option since the beginning. Over the years, the large number of efforts has been made to develop efficient protocols for the analysis of DNA sequences and for gene prediction in particular. Majority of these methods are based on statistical and mathematical techniques and artificial intelligence techniques based on genome, gene, cDNA, and protein sequence databases [1–32]. Generally these are based on capturing the sequence characteristics of genes by training on known genic sequences. Though some of these methods are able to predict genes with high precision (as high as ~98%), their dependency on training data makes them genome specific, and caution is to be exercised in extending them to annotating newly sequenced genomes. With a goal to developing an in silico gene-finding model that captures the essence of DNA sequences on the basis of their intrinsic physicochemical properties and is universal in application, the hypothesis of “Chemgenome” was conceptualized [33]. It was a simple concept, stating that the function of any DNA sequence can be explained by its physical and chemical properties as these decide the nature of its interactions with various regulatory proteins and polymerases [33–41]. For instance, upon interaction with RNA polymerase/transcriptional machinery, genic DNA helix unwinds or melts while non-genic regions stay unchanged, bringing into consideration their relative stability as a way to capture their identity; stability of a given DNA region can be attributed to hydrogen bonding between bases and base stacking interactions. The hypothesis led to the development of a simple three-parameter model based on Watson-Crick hydrogen bonding energy, base pair stacking energy, and DNA-protein interactions (Fig. 1). The hydrogen bonding (x dimension) and stacking energies (y dimension) for each codon were assigned based on finite-difference Poisson_Boltzmann calculations, assuming canonical B-form structures while the third parameter (z dimension) for each codon is based on the conjugate rule [33, 40]. For a given DNA sequence, the resultant vector ( j-vector) is found by accumulating the x, y, z components of the individual codons, and the orientation of this resultant vector ( j-vector) from the origin is given by direction cosines. A universal plane was created based on training on 1500 gene/non-gene (shifted-gene) pairs of the E. coli K12 genome. The hypothesis is that gene and non-gene vectors lie on different sides of the plane. The plane parameters were optimized to give the best separation in orientation between the gene and non-gene vectors. Thus a computation of the j-vector for any given DNA sequence and its orientation relative to the plane decides whether or not it is a potential gene coding for a protein. A general specification of the procedure is referred to as the ChemGenome algorithm and was applied to 331 prokaryotic genomes in its initial version and subsequently to 900 prokaryotic genomes giving gene/nongene classification accuracies comparable to those of knowledgebased methods [33].

ChemGenome2.1: An Ab Initio Gene Prediction Software

123

Fig. 1 Physicochemical model for analyzing DNA sequences

Chemgenome algorithm was subjected to further improvements, first by including the sequence dependence of the solution structures of codons in the x and y parameters, and second by rendering the j-vector approach into a fully ab initio gene finding tool by reframing the z parameter as representing the propensity of a codon for intermolecular interactions. The codons here refer to the double helical trinucleotide in a given reading frame. The sequence effects on DNA structure were investigated based on molecular dynamics (MD) simulations, and the results were applied to sequence-dependent structures of codons. Treating the problem in general involves, at a minimum, the study of the sequencedependent structures of all ten unique dinucleotide steps. To address the immediate sequence context, the study required a minimal consideration of 136 unique tetranucleotide steps. MD has been applied to this problem by a consortium of researchers who collectively performed 15-ns trajectories on 39 different 15-base-pair DNA sequences in which multiple copies of all the 136 tetranucleotides are represented [42, 43]. The three parameters (x, y, z) of codons were calculated from the above MD simulations as a special case of the results on tetranucleotides. The resulting gene-finding program was named ChemGenome2 (CG2) and was found to perform at a level equivalent to or better than previously reported methods for gene finding from whole genome sequences [34].

124

Akhilesh Mishra et al.

Though Chemgenome has been devised for analysis of prokaryotic genomes, how would the same methodology work in eukaryotes? The idea came as a natural quest as both prokaryotes and eukaryotes share a considerable similarity in the physical and chemical properties of DNA. A preliminary analysis was made earlier by applying the physiochemical model exclusively to the exonic regions of eukaryotic genes, instead of whole genomes, and remarkable values of sensitivity, specificity, and correlation coefficients were obtained [33]. The present study was planned to know the efficacy of our latest version of physiochemical model (CG2.1, details of improvements given in Subheading 3) in eukaryotic gene predictions from complete genome sequences; eukaryotes have quite different gene structures (organized in introns and exons) than prokaryotes, except for many unicellular protists which often do not have introns at all. In this chapter, we present the complete methodology of CG2.1 and its application on eukaryotic genome analysis by applying on whole genomes of five lower eukaryotes (representing considerable diversity in genomes). The results give a hope that the concept of gene finding based on physicochemical properties of DNA sequences is a viable idea for eukaryotes as well, though, undoubtedly, improvements are needed. The present study offers a platform to build on the concept of eukaryotic genome annotation via physicochemical properties.

2

Methods The overall methodology of CG2.1, with all the essential steps, is given in the form of a flowchart in Fig. 2. There are broadly two stages—stage I targets at prediction of putative genes from whole genomes and works only in “DNA space,” while stage II deals with filtering out false positives from the predicted results through various filters by working in “Protein, Swissprot space as well as DNA space.” The details of each step are discussed below.

2.1

Stage I

2.1.1 Prediction of Putative Genes from Whole Genomes

After reading the whole genome in fasta format, the very first step is to seek all the possible ORFs from it. An ORF is defined as a fragment of DNA that starts with special initiation codons (ATG, CTG, GTG, TTG) and extends through a series of triplets representing amino acids until it ends at one of the three stop codons (TAA, TAG, TGA) [44]. The default minimum length of ORFs is set at 99 base pairs (bp) for prokaryotes and 300 bp for eukaryotes. All the possible ORFs equal to or longer than threshold length in each of the six reading frames of the double-stranded DNA are extracted. Each ORF is then subjected to resultant j-vector calculation which refers to summation of all the j-vectors for corresponding

ChemGenome2.1: An Ab Initio Gene Prediction Software

125

Fig. 2 A flowchart of the complete methodology of ChemGenome2.1 (CG2.1)

codons of a given ORF. The corresponding x, y, and z components of the j-vector of each codon are nucleotide base pairing energy, the base pair stacking energy, and an index of the propensity of a codon for intermolecular interactions, each defined on the interval of 1 to +1. The x, y, z parameters of the j-vector for each codon are listed in Table 1 and are developed as follows. (a) The x component (hydrogen bond energies): The Watson-Crick (WC) hydrogen bond energies are calculated from the MD trajectories using ptraj and anal modules of the AMBER [45] software. Denoting the successive bases of a trinucleotide as i, j, and k, and their Watson-Crick partners on the complementary strand as l, m, and n, the hydrogen bond energy is calculated from the simulation data as follows: E HB ¼ E il þ E j m þ E kn where Ei–l refers to the electrostatic plus van der Waals interactions of all the hydrogen bonding atoms of base i with those of base l. The hydrogen bond energy for all the 32 unique trinucleotides was calculated from all the

126

Akhilesh Mishra et al.

Table 1 The x (hydrogen-bonding energy), y (stacking energy), and z (protein-nucleic acid interaction propensity parameter) values assigned for each of the 64 codons [34] CODON

x

y

z

CODON

x

CCC

1.0

CCG

y

z

0.97

1

TCC

0.85

0.66

1

0.85

0.14

1

TCG

0.41

0.10

1

CCT

0.03

1.00

1

TCT

0.15

0.74

1

CCA

0.02

0.81

1

TCA

0.18

0.23

1

CGC

0.98

1.00

1

TGC

0.49

0.38

1

CGG

0.85

0.14

1

TGG

0.02

0.81

1

CGT

0.30

0.71

1

TGT

0.13

0.07

1

CGA

0.41

0.10

1

TGA

0.18

0.23

1

CTC

0.07

0.75

1

TTC

0.19

0.50

1

CTG

0.03

0.20

1

TTG

0.18

0.26

1

CTT

0.82

0.87

1

TTT

0.93

0.56

1

CTA

0.33

0.12

1

TTA

0.85

0.65

1

CAC

0.07

0.25

1

TAC

0.20

0.11

1

CAG

0.03

0.20

1

TAG

0.33

0.12

1

CAT

0.15

0.15

1

TAT

0.94

0.41

1

CAA

0.18

0.26

1

TAA

0.85

0.65

1

GCC

0.90

0.13

1

ACC

0.86

0.49

1

GCG

0.98

1.00

1

ACG

0.30

0.71

1

GCT

0.27

0.24

1

ACT

0.01

0.48

1

GCA

0.49

0.38

1

ACA

0.13

0.07

1

GGC

0.90

0.13

1

AGC

0.27

0.24

1

GGG

1.0

0.97

1

AGG

0.03

1.00

1

GGT

0.86

0.49

1

AGT

0.01

0.48

1

GGA

0.85

0.66

1

AGA

0.15

0.74

1

GTC

0.09

0.01

1

ATC

0.25

0.10

1

GTG

0.07

0.25

1

ATG

0.15

0.15

1

GTT

0.57

0.10

1

ATT

1.0

0.29

1

GTA

0.20

0.11

1

ATA

0.94

0.41

1

GAC

0.09

0.01

1

AAC

0.57

0.10

1

GAG

0.07

0.75

1

AAG

0.82

0.87

1 (continued)

ChemGenome2.1: An Ab Initio Gene Prediction Software

127

Table 1 (continued) CODON

x

y

z

CODON

x

y

z

GAT

0.25

0.10

1

AAT

1.0

0.29

1

GAA

0.19

0.50

1

AAA

0.93

0.56

1

Note Universal plane equation identified for prokaryotes and used for gene prediction in all the 372 genomes studied: nx ¼ 0.698451, ny ¼ 6.82635, nz ¼ 22.8116, d ¼ 1.0

39 sequences in the ABC database and the data was averaged out from the multiple copies of the same trinucleotide. These energies span a range of values from 17.4 to 10.7 kcal/ mol. The resultant energies were then linearly mapped onto the [1, 1] interval giving the x coordinate as per.    x ½i  ¼ ðE ½i  þ E min Þ E desired range =E actual range  E desired min where E[i] is the hydrogen bonding energy for ith codon, and i ranges from 1 to 64. Edesired range here is 2 and Edesired min is 1. (b) The y component (base pair stacking energies): The stacking energies which comprise here electrostatic and van der Waals components were calculated for all the 32 unique doublehelical trinucleotide sequences in a similar manner.  E Stack ¼ ðE im þ E in Þ þ E j lþ E j n þ ðE kl þ E km Þ þ E ij þ E ik þ E j k þ ðE lm þ E ln þ E mn Þ After averaging out the energies of multiple copies of the same trinucleotide obtained from the MD trajectories, the energies were seen to span the range of 56.2 kcal/mol to 52.9 kcal/ mol. The resultant energies were mapped onto the interval [1, 1] giving the y coordinate for each codon. y[i] ¼ [{(E[i] + Emin) (Edesired  Edesired min]

range/Eactual

range)}

where E[i] is the base pair stacking energy for the ith codon, and i ranges from 1 to 64. (c) The z values were assigned based on the rule of conjugates [40] proposed earlier. According to the rule of conjugates, Adenine (A) is the conjugate of Cytosine (C) and Guanine (G) is the conjugate of Thymine (T). Again a codon and its conjugate codon are assigned an equal and opposite value (+1 for codon and 1 for its conjugate codon). Codons starting with G are assigned +1. Codons starting with C and ending with G or T are assigned +1, and those ending with A or C 1. The rule of conjugates fixes the remaining 32 values.

128

Akhilesh Mishra et al.

Conjugate rule extends the wobble hypothesis to capture the general spirit of the molecular events at the recognition site— the dynamics of the third base of the codon on mRNA in the presence of the anticodon. 2.1.2 Selection of Putative Genes from Open Reading Frames

Initially the best plane is generated for every genome using a pocket algorithm [46] (modified perceptron algorithm), which is a modification of perceptron learning [47] that makes perceptron learning well-behaved with non-separable training data. Finally, the best universal plane covering the maximum number of genomes with sensitivity greater than 95% is generated and is utilized to segregate these ORF vectors into gene and non-gene vectors. The ORF vectors lying above the plane are identified as gene vectors while the ones lying below the plane are identified as non-gene vectors, irrespective of the genome species being studied. First stage of the methodology ends here and as output we get putative protein-coding genes comprising true genes and falsepositive entries.

2.2 Stage II: Filtering Out False Positives from Selected Putative Genes

Filter 1 (protein space module): The predicted putative proteincoding genes consist of a large number of false positives (genelike sequences incapable of producing functional proteins). To reduce the number of false positives, the first filter was devised on the basis of stereochemical properties of protein sequences. The properties taken into consideration were (1) the presence of sp3 hybridized γ carbon atom, (2) hydrogen bond donor ability, (3) short/absence of δ carbon, and (4) linearity of the side chains/non-occurrence of bidentate forks with terminal hydrogen atoms in the side chains [48, 49]. A computational analysis of 175,000 Swissprot protein sequences shows that naturally occurring protein sequences are characterized by a very high occurrence of amino acids with sp3 γ carbon atoms and short side chains, relative to randomly generated polypeptide sequences. Hydrogen bond donating side chains are heavily under-represented as also, to a lesser extent, the unbranched amino acid side chains. Pursuing these observations and utilizing the differences between protein sequences and random sequences, a computational filter was developed to distinguish between genes coding for proteins and non-gene regions in genomic DNA sequences. An analysis of 239,418 gene sequences from 331 prokaryotic genomes showed prediction sensitivities greater than 90%. Filter 2 (Swissprot space): The second filter was developed on the basis of frequencies of codons via their corresponding amino acids from 175,000 Swissprot proteins. A query nucleotide sequence is converted into amino acids and their frequencies of occurrence are compared with the Swissprot data (Swissprot space). Both the Swissprot amino acid frequencies and query nucleotide amino acid frequencies were normalized for 100 amino acids and

ChemGenome2.1: An Ab Initio Gene Prediction Software

129

Table 2 Distribution of 8000 tripeptides in 20 groups on the basis of frequency value range (a scale running between the most represented and the least represented tripeptide) Range

Score

Range

Score

0–0.00125

1

0.0125–0.02455

0.1

0.00125–0.0025

0.9

0.02455–0.0366

0.2

0.0025–0.00375

0.8

0.0366–0.04865

0.3

0.00375–0.005

0.7

0.04865–0.0607

0.4

0.005–0.00625

0.6

0.0607–0.07275

0.5

0.00625–0.0075

0.5

0.07275–0.0848

0.6

0.0075–0.00875

0.4

0.0848–0.09685

0.7

0.00875–0.01

0.3

0.09685–0.1089

0.8

0.01–0.01125

0.2

0.1089–0.12095

0.9

0.01125–0.0125

0.1

0.12095–0.133

1

0.0125

0

the difference in the frequencies of Swissprot and query sequence was calculated for each amino acid, as is the overall standard deviation. After validation on a large dataset of functionally annotated genes and non-genes from 372 prokaryotic genomes, we found a standard deviation (cut-off) of >vertebrates.txt

will produce a suitable file like this:

148

Stefanie Nachtweide and Mario Stanke

File contents example: vertebrates.txt ((monDom5:0.340786,(((hg38:0.035974,rheMac3:0.043601):0.109934, (mm10:0.084509,rn6:0.091589):0.271974):0.020593,(bosTau8:0.18908, canFam3:0.13303):0.032898):0.258392):0.181168,galGal4:0.559442); bosTau8

/home/mario/Augustus/docs/tutorial-cgp/data/genomes/bosTau8.fa

canFam3

/home/mario/Augustus/docs/tutorial-cgp/data/genomes/canFam3.fa

galGal4

/home/mario/Augustus/docs/tutorial-cgp/data/genomes/galGal4.fa

hg38

/home/mario/Augustus/docs/tutorial-cgp/data/genomes/hg38.fa

mm10

/home/mario/Augustus/docs/tutorial-cgp/data/genomes/mm10.fa

monDom5

/home/mario/Augustus/docs/tutorial-cgp/data/genomes/monDom5.fa

rheMac3

/home/mario/Augustus/docs/tutorial-cgp/data/genomes/rheMac3.fa

rn6

/home/mario/Augustus/docs/tutorial-cgp/data/genomes/rn6.fa

Note that absolute path names are required. Now, the actual alignment can be performed with: Bash input $ runProgressiveCactus.sh vertebrates.txt cactusout\ $

vertebrates.hal --maxThreads=4 2>&1 > cactus.out

This command outputs a binary alignment file vertebrates. For the example data it requires about 10 min. On large data, this command can be expected to be time-consuming and it may require to be run on a fast machine with many cores (increase maxThreads accordingly).

hal.

The file vertebrates.hal contains the complete whole-genome multiple alignment. In order to allow for independent parallel executions of the gene prediction step, we recommend to split the global alignment into several overlapping alignment chunks and to simultaneously convert the chunks to MAF format:

3.2.2 Exporting a HAL Alignment to MAF Alignments

Bash input $ hal2maf_split.pl --halfile vertebrates.hal --refGenome hg38 \ $

--cpus 4 --chunksize 50000 --overlap 25000 --outdir mafs

hal2maf_split.pl builds upon the HAL tools, which is also part

of the progressiveCactus package. The above script uses a single reference genome among the aligned genomes (here the

Multi-Genome Annotation with AUGUSTUS

149

human hg38) as a guide to split the alignment. The reference genome is conceptually divided into regions of size chunksize such that neighboring regions overlap by overlap base pairs. Then for each such region the local alignments of the input alignment for the given region are exported into one MAF alignment. To avoid undersized chunks on which only partial genes are predicted, we recommend to choose a reference species that has a highquality, near-complete genome assembly with long scaffolds, if available. A sensible setting for chunksize in real data could be 1000000. With a higher value each process requires more RAM and a larger run time. The above splitting command also introduces some fraction of truncated genes even when chunksize is large. In such cases and where truncated genes cannot be patched up with joingenes (see Subheading 3.4 below), choose a larger overlap. If the progressiveCactus/submodules/hal/bin directory is not in your global python path, use the parameter hal_exec_dir of hal2maf_split.pl to point to the directory that contains hal2maf. This will generate the directory mafs/ that contains the output alignment chunks in MAF format: Bash output $ ls -1 mafs chr16.0-49999.maf chr16.100000-149999.maf chr16.125000-174999.maf chr16.150000-199999.maf chr16.175000-210154.maf chr16.25000-74999.maf chr16.50000-99999.maf chr16.75000-124999.maf

3.3 Loading Genomes into an SQLite Database

In multi-genome gene prediction, the genomes cannot all be processed in linear order. As all genomes together are often too large to be stored simultaneously in memory (as done in the example in Subheading 3.1), we implemented a database solution to obtain so-called random access to the regions. A single database for all species contains indices which allow to efficiently retrieve only those regions that are needed at the time or by the respective parallel job. SQLite3 and MySQL are supported. To avoid redundancy, the SQLite3 solution does not store the genomes themselves in the database, but only byte offsets into the respective FASTA genome files. We recommend the SQLite3 solution over the MySQL solution (see Note 3).

150

Stefanie Nachtweide and Mario Stanke

First, create a table of all genome names and sequence files (as for parameter speciesfilenames of augustus): Bash input $ for f in $PWD/genomes/*.fa; do echo -ne "$(basename $f .fa)\t$f\n"; done >genomes.tbl

With the example data from the tutorial, the generated will be a table with eight genomes. Now, load the genomes into an SQLite database genomes.tbl

Bash input $ while read line $ do $

species=$(echo "$line" | cut -f 1)

$

genome=$(echo "$line" | cut -f 2)

$

load2sqlitedb --noIdx --species=$species --dbaccess=vertebrates.db $genome

$ done < genomes.tbl $ load2sqlitedb --makeIdx --dbaccess=vertebrates.db

This loop loads each genome into the database. It then creates indices on the tables that allow to access regions quickly. When genome regions are requested in subsequent steps, only small parts of vertebrates.db and the respective genomes files need to be read. vertebrates.db is now a (flat file) database that contains offsets into the eight genomes. You can check if loading was successful with the following database query: Bash input $ sqlite3 -header -column vertebrates.db "\ $

SELECT speciesname, \

$

sum(end-start+1) AS ’genome length’,\

$

count(*) AS ’# chunks’,\

$

count(distinct seqnr) AS ’# seqs’\

$

FROM genomes natural join speciesnames\

$

GROUP BY speciesname;"

Multi-Genome Annotation with AUGUSTUS

151

It returns a summary of the genomes in the database (see Note 4): Bash output speciesname ---------–

genome length -------------

# chunks ----------

# seqs ----------

bosTau8

156091

4

1

canFam3

184728

4

1

galGal4

149999

3

1

hg38

210155

5

1

mm10

178393

4

1

monDom5

540519

11

1

rheMac3

220640

5

1

rn6

99944

2

1

Check if all genomes are in the database and the number of sequences and total genome size for each genome is correct. 3.4 De Novo Comparative Gene Finding

We next demonstrate an application, where only the naked genomes are available (‘de novo’ gene finding). A more typical example with RNA-Seq follows in the next section. Create a new folder for the de novo experiments and therein softlinks to the MAF files.

Bash input $ mkdir augCGP_denovo $ cd augCGP_denovo $ num=1 $ for f in ../mafs/*.maf; do ln -s $f $num.maf; ((num++)); done

Run comparative AUGUSTUS in so-called CGP mode on all alignment chunks in parallel. Bash input $ for ali in *.maf $ do $ id=${ali%.maf} # this will remove .maf suffix $ augustus \ $

--species=human \

$

--softmasking=1 \

$

--treefile=../tree.nwk \

$

--alnfile=$ali \

152

Stefanie Nachtweide and Mario Stanke

$

--dbaccess=../vertebrates.db \

$

--speciesfilenames=../genomes.tbl \

$

--/CompPred/outdir=pred$id > aug$id.out 2> err$id.out &

$ done

This command starts one AUGUSTUS process for each of the eight alignment files in the background using & and may take a few minutes (see Note 5). This simple parallelization approach is only for demonstration purposes. In real applications with several hundreds or thousands of alignment chunks, we recommend to run parallel jobs on a compute cluster. Set the option -softmasking¼1 in cases where the genomes are soft-masked. This will generate the folders pred1/, . . ., pred8/, one for each alignment chunk, that each contain GFF files with gene predictions for each input genome. Bash output > ls pred1/ bosTau8.cgp.gff mm10.cgp.gff

canFam3.cgp.gff

monDom5.cgp.gff

galGal4.cgp.gff

rheMac3.cgp.gff

hg38.cgp.gff

rn6.cgp.gff

Merge gene predictions from parallel runs with Bash input $ mkdir joined_pred $ while read line $ do $

species=$(echo "$line" | cut -f 1)

$

find pred* -name "${species}.cgp.gff" >${species}_gtfs.lst;

$

joingenes -f ${species}_gtfs.lst -o joined_pred/$species.gff

$ done < ../genomes.tbl

This will create the folder joined_pred/ with the final gene predictions for each input genome in a single file. 3.5 RNA-Seq Based Comparative Gene Finding

We here demonstrate how RNA-Seq data can be incorporated into comparative AUGUSTUS. In general, the same types of extrinsic evidence can be incorporated as in single-species gene finding with AUGUSTUS (including RNA-Seq, cDNA, ESTs, protein sequences, etc.). In the CGP mode, each piece of evidence is specific to a genome. The evidence can be incorporated for each genome or for any subset of genomes.

Multi-Genome Annotation with AUGUSTUS

153

Note that RNA-Seq from different genomes or species can complement each other, e.g. tissues or conditions that were sampled in one species can also result in an improved accuracy for the genes expressed in that tissue or under that condition in other species. We recommend that RNA-Seq should only be aligned to the native genome of the RNA-Seq sample, or to the closest genome, otherwise, and not also to other genomes. On the one hand, the alignment to more than one genome is unnecessary, as evidence is shared between genomes. On the other hand, the alignment of RNA-Seq from one species to another species’ genome is more error-prone. RNA-Seq evidence is generated from the spliced alignments of individual reads against the respective genome. A typical sequence of steps could be to execute an aligner (e.g. STAR [12]) and then the quality filtering of alignments with filterBAM, the conversion to an AUGUSTUS-specific hints file with bam2hints for intron hints and wig2hints.pl for so-called exonpart hints. This step is not specific to applying comparative AUGUSTUS and, e.g., described in the AUGUSTUS Wiki (http://bioinf.uni-greifswald. de/bioinf/wiki/pmwiki.php?n¼Augustus.Augustus) and chapter [5]. We here assume that a so-called hints file for each genome is already available that summarizes the extrinsic evidence. The hints for this tutorial are in tutorial-cgp/data/hints/. 3.5.1 Loading RNA-Seq Hints into the SQLite Database

In this example, intron and exonpart (ep) hints for a subset of four of the eight species (human, mouse, chicken, and macaque) are provided in the hints subdirectory. Prepare a text file with a list of species names and location of the corresponding hints files.

Bash input $ for f in $PWD/hints/*.gff; do echo -ne "$(basename $f .hints.gff)\t$f\n"; done > hints.tbl

The file hints.tbl will now look similar to this: File contents example: hints.tbl galGal4 /home/mario/Augustus/docs/tutorial-cgp/data/hints/galGal4.hints.gff hg38

/home/mario/Augustus/docs/tutorial-cgp/data/hints/hg38.hints.gff

mm10

/home/mario/Augustus/docs/tutorial-cgp/data/hints/mm10.hints.gff

rheMac3 /home/mario/Augustus/docs/tutorial-cgp/data/hints/rheMac3.hints.gff

154

Stefanie Nachtweide and Mario Stanke

Load the hints into the database vertebrates.db from above. You may want to make a backup copy of the database first. The backup is useful if you want to add different sets of hints to the same genome assemblies. Bash input $ while read line $ do $

species=$(echo "$line" | cut -f 1)

$

hints=$(echo "$line" | cut -f 2)

$

load2sqlitedb --noIdx --species=$species --dbaccess=vertebrates.db $hints

$ done < hints.tbl $ load2sqlitedb --makeIdx --dbaccess=vertebrates.db

Check if loading was successful and the content is plausible with the following database query: Bash input $ sqlite3 -header -column vertebrates.db "\ $

SELECT count(*) AS ’#hints’,typename,speciesname\

$

FROM (hints as H join featuretypes as F on H.type=F.typeid)\

$ $

natural join speciesnames\ GROUP BY speciesid,typename;"

This returns a summary of how many hints of each type are in the database for each species (see Note 6): Bash output #hints

typename

----------

----------

speciesname -----------

3368

exonpart

galGal4

129

intron

galGal4

7905

exonpart

hg38

267

intron

hg38

7930

exonpart

mm10

378

intron

mm10

11050

exonpart

rheMac3

265

intron

rheMac3

Multi-Genome Annotation with AUGUSTUS 3.5.2 Preparing an Extrinsic Configuration File

155

Extrinsic evidence can be more or less trustworthy depending on the source. For example, an intron present in a reference annotation of one of the genomes may be trusted completely, while an intron inferred from RNA-Seq alignments or a spliced alignment of a protein homolog has some chance to be wrong. Such parameters are set in a text file, we refer to as extrinsic config file. Start its creation by copying an existing extrinsic config file:

Bash input $ cp ${AUGUSTUS_CONFIG_PATH}/extrinsic/extrinsic-cgp.cfg extrinsic-rnaseq.cfg

Open extrinsic-rnaseq.cfg file with a text editor, go to the first [GROUP] section and replace the next line File contents example: extrinsic-rnaseq.cfg [GROUP] # replace ’none’ by the names of genomes with src=W and src=E hints in the database none

as instructed by the space-separated list of names of genomes with RNA-Seq hints, i.e. File contents example: extrinsic-rnaseq.cfg [GROUP] hg38 mm10 rheMac3 galGal4

In comparative mode of AUGUSTUS, hints can be integrated for multiple species. The configuration file allows to specify the extrinsic parameters individually for each species or—often more conveniently—for groups of species. For example, for two genomes hints from existing annotations may be available, one more trustworthy than the other. Another genome may have RNA-Seq evidence and for yet another genome no evidence may be available. For instructions on changing specific parameters, we refer the reader to the bottom of the files extrinsic.cfg and extrinsic-cgp.cfg in the folder config/extrinsic of the AUGUSTUS package or to the chapter on single-genome gene prediction with AUGUSTUS [5].

156

Stefanie Nachtweide and Mario Stanke

3.5.3 Running AUGUSTUS-CGP with RNA-Seq hints

Create a new folder for these experiments and switch to the new directory

Bash input $ mkdir augCGP_rnaseq $ cd augCGP_rnaseq $ # create here softlinks to the alignment chunks for convenience $ num=1; for f in ../mafs/*.maf; do ln -s $f $num.maf; ((num++)); done

Next, run comparative AUGUSTUS in parallel on the alignment chunks: Bash input $ for ali in *.maf $ do $

id=${ali%.maf} # remove .maf suffix

$

augustus \

$

--species=human \

$

--softmasking=1 \

$

--treefile=../tree.nwk \

$

--alnfile=$ali \

$

--dbaccess=../vertebrates.db \

$

--speciesfilenames=../genomes.tbl \

$

--alternatives-from-evidence=0 \

$

--dbhints=1 \

$

--UTR=1 \

$

--allow_hinted_splicesites=atac \

$

--extrinsicCfgFile=../extrinsic-rnaseq.cfg \

$

--/CompPred/outdir=pred$id > aug$id.out 2> err$id.out &

done

The option UTR¼1 enables the model for untranslated regions and is recommended whenever ‘exonpart’ hints are incorporated. dbhints¼1 enables the retrieval of hints from the database. The option allow_hinted_splicesites¼atac enables the prediction of the rare AT-AC splice sites, when evidenced by hints. This is in addition to the default GT-AG and GC-AG splice sites (first and last two intronic bases). The above command will generate a folder for each alignment chunk that contains GFF files with gene predictions for each input genome. Finally, merge the gene predictions from parallel runs with the command in the last box from Subheading 3.4.

Multi-Genome Annotation with AUGUSTUS

3.6 Joining with Single-Genome Gene Predictions

157

Comparative AUGUSTUS delivers in principle an incomplete genome annotation because it only annotates significantly alignable regions. Other genome regions, e.g. genome regions that are unique to one input genome, are not annotated. In addition, the gene ranges can sometimes break up genes, e.g. when some genome assemblies are fragmented or wrong. Therefore, we recommend to supplement the comparative annotation with a singlegenome annotation. For demonstration purposes, we run regular AUGUSTUS independently on each genome, using RNA-Seq where available:

Bash input $ mkdir aug_rnaseq $ cd aug_rnaseq $ for assembly in bosTau8 canFam3 monDom5 rn6; do $ $

# make ab initio predictions for genomes without hints augustus --species=human ../genomes/$assembly.fa --softmasking=1 > $assembly.gff &

$ done $ for assembly in hg38 mm10 rheMac3 galGal4; do $ $

# make RNA-Seq based predictions on other genomes augustus --species=human ../genomes/$assembly.fa --hintsfile=../hints/ $assembly.hints.gff \

$

--UTR=on --allow_hinted_splicesites=atac --extrinsicCfgFile=extrinsic.M.

RM.E.W.cfg \ $

--softmasking=on --alternatives-from-evidence=off > $assembly.gff &

done

If extrinsicCfgFile is not a path to a file, a file with that name, here extrinsic.M.RM.E.W.cfg, is searched in the config/extrinsic folder. Now, we have for each genome two annotations, the comparative and the non-comparative, which we merge using joingenes. When doing this, we give the comparative annotation a higher priority. In the data directory issue Bash input $ mkdir aug_joined $ cd aug_joined $ for assembly in hg38 mm10 rheMac3 galGal4 bosTau8 canFam3 monDom5 rn6; do $ joingenes-g../augCGP_rnaseq/joined_pred/$assembly.gff,../aug_rnaseq/$assembly.gff\ $ done

--priorities=2,1 --output=$assembly.jg.gff

158

Stefanie Nachtweide and Mario Stanke

does not only select the higher priority transcripts— here comparative—from conflicting transcript versions, but also sometimes extends comparatively predicted transcripts that appear to be truncated using overlapping predictions of the same gene set or from single-genome predictions.

joingenes

3.7 Annotation Mapping

4

An important special case of comparative gene prediction with extrinsic evidence is the setting of annotation mapping. Here, one or several genomes already have (a) trusted annotation(s), while further, usually new genomes of related species require annotation. The existing annotations shall be leveraged so that ortholog gene structures of previously annotated genes shall be lifted or mapped to the other genomes, where possible. This can be done with comparative AUGUSTUS using a variant of the protocol in Subheading 3.5 where RNA-Seq evidence is replaced with the evidence from the existing annotation(s). We refer the reader to Exercise 4 of the CGP tutorial in the AUGUSTUS package. Such an approach has recently been performed when annotating 16 de novo assembled mouse strains [13] and is also implemented in the comparative annotation toolkit CAT [14].

Notes 1. If issues with the installation or basic running arise, first consult the instructions in the files README-cgp.txt and README. TXT. Problems that appear to result from a bug (e.g., a segmentation fault) can be reported on the GitHub page (https:// github.com/Gaius-Augustus/Augustus). 2. If your clade is very diverse, such that a whole-genome alignment is difficult and very gappy, e.g. when considering divergent vertebrates species, then it may be preferable to run comparative AUGUSTUS (if necessary several times) on less diverse subclades, e.g. on mammals or primates only. If you believe that a whole-genome-duplication has taken place in some genomes with regard to others in your data set, then use a duplicated genome as reference for hal2maf_split. pl. Otherwise, some homology information could be systematically ignored in comparative AUGUSTUS. 3. If sqlite3 is not available on your system, use the alternative with MySQL. However, we have experienced performance issues with MySQL, when accessing the database from 1000 comparative AUGUSTUS processes simultaneously, not so with the sqlite3 file-based database. 4. Make sure that genome versions and genome names match. If an error “failed retrieving sequence” occurs, a reason could be that an assembly version in the database is not identical to the

Multi-Genome Annotation with AUGUSTUS

159

one that was aligned. In such a case the program getSeq that retrieves sequence ranges from the database can be helpful to find the discordance. If unsure, you may want to reimport the correct genome assemblies. 5. If some jobs require too much memory for the computing nodes, consider decreasing the parameter chunksize of hal2maf_split.pl, or, rerun the “stragglers” on a node with more memory. 6. Sometimes individual RNA-Seq libraries can worsen the predictions when included in addition to all the other available RNA-Seq libraries, e.g. if not poly-A selected. Such problematic libraries could be identified using a genome browser and then excluded before running AUGUSTUS again.

Acknowledgements This chapter is based on research that was funded partially by Deutsche Forschungsgemeinschaft grant STA 1009/10-1 to MS and by a scholarship of the Studienstiftung des deutschen Volkes to SN. References 1. Stanke M, Waack S (2003) Gene prediction with a hidden Markov model and new intron submodel. Bioinformatics 19(Suppl 2): ii215–ii225 2. Stanke M, Diekhans M, Baertsch R, Haussler D (2008) Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24(5):637–644 3. Keller O, Kollmar M, Stanke M, Waack S (2011) A novel hybrid gene prediction method employing protein multiple sequence alignments. Bioinformatics 27(6):757–763 4. Hoff KJ, Stanke M (2013) WebAUGUSTUS – a web service for training AUGUSTUS and predicting genes in eukaryotes. Nucleic Acids Res 41(W1):W123–W128 5. Hoff KJ, Stanke M (2018) Predicting genes in single genomes with AUGUSTUS. Curr Protoc Bioinf (.e57) 6. Gross S, Do C, Sirota M, Batzoglou S (2007) CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol 8(12):R269 7. Gross SS, Brent MR (2006) Using multiple alignments to improve gene prediction. J Comput Biol 13(2):379–393

8. Ko¨nig S, Romoth LW, Gerischer L, Stanke M (2016) Simultaneous gene finding in multiple genomes. Bioinformatics 32(22):3388–3395 9. Nachtweide S (2018) The simultaneous identification of genes in related species. Doctoral thesis 10. Rosenbloom KR, Armstrong J, Barber GP, Casper J, Clawson H, Diekhans M, Dreszer TR, Fujita PA, Guruvadoo L, Haeussler M, et al (2014) The UCSC genome browser database: 2015 update. Nucleic Acids Res 43(D1): D670–D681 11. Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, Haussler D (2011) Cactus: algorithms for genome multiple sequence alignment. Genome Res 21(9):1512–1528 12. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):15–21 13. Lilue J, Doran AG, Fiddes IT, Abrudan M, Armstrong J, Bennett R, Chow W, Collins J, Collins S, Czechanski A, Danecek P, Diekhans M, Dolle D-D, Dunn M, Durbin R, Earl D, Ferguson-Smith A, Flicek P, Flint J, Frankish A, Fu B, Gerstein M, Gilbert J, Goodstadt L, Harrow J, Howe K, Kolmogorov M, Koenig S, Lelliott C,

160

Stefanie Nachtweide and Mario Stanke Loveland J, Mott R, Muir P, Navarro F, Odom D, Park N, Pelan S, Phan SK, Quail M, Reinholdt L, Romoth L, Shirley L, Sisu C, Sjoberg-Herrera M, Stanke M, Steward C, Thomas M, Threadgold G, Thybert D, Torrance J, Wong K, Wood J, Yang F, Adams DJ, Paten B, Keane TM (2018) Sixteen diverse laboratory mouse reference genomes define strain-specific haplotypes and novel functional loci. Nat Genet 50:1574–1583

14. Fiddes IT, Armstrong J, Diekhans M, Nachtweide S, Kronenberg ZN, Underwood JG, Gordon D, Earl D, Keane T, Eichler EE, Haussler D, Stanke M, Paten B (2018) Comparative Annotation Toolkit (CAT) – simultaneous clade and personal genome annotation. Genome Res. https://doi.org/10.1101/gr. 233460.117

Chapter 9 GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data Jens Keilwagen, Frank Hartung, and Jan Grau Abstract GeMoMa is a homology-based gene prediction program that predicts gene models in target species based on gene models in evolutionary related reference species. GeMoMa utilizes amino acid sequence conservation, intron position conservation, and RNA-seq data to accurately predict protein-coding transcripts. Furthermore, GeMoMa supports the combination of predictions based on several reference species allowing to transfer high-quality annotation of different reference species to a target species. Here, we present a detailed description of GeMoMa modules and the GeMoMa pipeline and how they can be used on the command line to address particular biological problems. Key words Gene prediction, Homology, Intron position conservation, RNA-seq, Open-source

1

Introduction Next generation sequencing technologies are a matter of permanent advancement leading to decreasing sequencing time and costs. Consequently, the number of newly sequenced genomes grows rapidly. Genome assembly is an important prerequisite to leverage those data, but deeper genome analyses depend on a precise and reliable gene annotation especially of protein-coding genes. Besides empirical data (e.g., sequencing of full length cDNAs and RNA-seq libraries), which are an important factor in gene annotation, common annotation pipelines use a combination of ab-initio and homology-based gene prediction software to annotate proteincoding genes in newly sequenced genomes [1, 2]. Crucial problems of many gene annotation pipelines are lowly or very specifically expressed genes, for which RNA-seq data are weak or even missing, and a large number of intronic sequences. Furthermore, most of the ab-initio approaches need extensive training steps. To overcome these common problems, we developed an algorithm that is based on an earlier observed feature, namely the high degree of conservation of (1) the occurrence of

Martin Kollmar (ed.), Gene Prediction: Methods and Protocols, Methods in Molecular Biology, vol. 1962, https://doi.org/10.1007/978-1-4939-9173-0_9, © Springer Science+Business Media, LLC, part of Springer Nature 2019

161

162

Jens Keilwagen et al.

an intron and (2) its fixed position with respect to the translated protein sequence [3, 4]. The high conservation of introns at the same position is quite logical as any alteration in the splice or regulatory sites of an intron is likely to result in incorrect splicing and consequently causes in most cases a frame shift in the protein-coding region leading to a premature stop codon. Selection against such an incorrect and non-functional protein might be expected to be as strong as for mutations occurring in the coding region. Alterations, which involve the loss of an entire intron, are rather seldom but not a problem for the correct splicing and, therefore, expression of the coded protein is not impaired. For this reason, the gene structures of intron containing protein-coding genes are often more conserved than the amino acid sequences of orthologous proteins themselves [3, 5]. This conservation feature of orthologous gene structures was used for the analysis of gene structure predictions manually [6] and the challenge was to convert it into an algorithm and workflow of a program, which takes into account the different prerequisites like the order of exons, their conservation, and correct splice sites. Our homology-based gene prediction program called Gene Model Mapper (GeMoMa) [4] uses the basic local alignment search tool (BLAST) [7] to individually align the amino acid sequence encoded by each coding exon of a gene to the genome sequence of a target species. In the following steps, GeMoMa filters all multiple BLAST hits that occur especially for short exons (but also for longer ones) with respect to their genomic position. Doing so, GeMoMa identifies genomic regions in which similar sequences for all existing exons of a given transcript occur in the correct order interrupted only by their respective introns. These matching exons are then refined for proper splice sites and joined in the correct order but allowing for rare cases of intron loss or gain [4]. Using this approach, we could show that GeMoMa is superior for gene annotation in plants and animals, in comparison to commonly used programs like genBlastG and exonerate [4, 8, 9]. Due to reduced sequencing prices, the amount of available RNA-seq data dramatically increased in recent years and several programs already use multiple sources including RNA-seq for gene prediction [10, 11]. Therefore, we recently extended GeMoMa for the utilization of RNA-seq data as additional source of information [12]. In short, GeMoMa uses an additional module to identify introns and, hence, experimentally verified splice sites in mapped RNA-seq data if available. The performance of GeMoMa including mapped RNA-seq data turned out to be equally well or even better than other programs like BRAKER1 [11], MAKER2 [2], and CodingQuarry [10].

GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position. . .

163

In this chapter, we describe the GeMoMa pipeline step by step. We shed light on the availability and installation as well as on all individual modules of the pipeline. Finally, we give some notes on algorithmic and technical aspects of GeMoMa.

2

Quick Start Guide The software package is named after its main module GeMoMa, which is an abbreviation for Gene Model Mapper. In the remainder of this chapter, we write all module names in italics to distinguish them from the package name.

2.1

Availability

GeMoMa is freely available as a software package at http://www. jstacs.de/index.php/GeMoMa or as web server via a Galaxy integration at http://galaxy.informatik.uni-halle.de/. The latter contains the same functionality as the package. However, it has limited capabilities regarding the size of data set, as predictions may be made for at most 10 reference transcripts at once. It is well suited for homology-based prediction of gene families or specific genes of interest as well as for testing purposes, but for complete genome (re-)annotations, we recommend to use the GeMoMa package.

2.2

Installation

GeMoMa is implemented in Java and available as portable JAR. For this reason, GeMoMa does not need a lengthy and error-prone installation. However, GeMoMa requires a Java runtime environment, which is standard software on most modern computers. The latest Java version can be downloaded from https://java.com/en/ download/. In addition, the GeMoMa pipeline uses TBLASTN to identify genomic regions that are similar to exons of reference species. Hence, NCBI BLAST needs to be installed to run the complete GeMoMa pipeline. The latest BLAST version can be downloaded for free from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast +/LATEST/.

2.3

User Interface

GeMoMa is implemented using the Java library Jstacs [13]. It implements the interface JstacsTool which allows to provide several user interfaces. For the GeMoMa package, we provide the command line interface and a Galaxy integration as user interfaces. To create the XML files needed for the Galaxy integration, run the provided script where has to be replaced by the GeMoMa version: ./createGalaxyIntegration.sh

164

Jens Keilwagen et al.

Due to its modularity, GeMoMa is able to benefit from large compute power in compute clusters. Splitting, running, and combining partial jobs of GeMoMa depends on the cluster, the job management system, and many other aspects (cf. Subheading 4.2). For user convenience, we also provide the module GeMoMaPipeline that runs the complete pipeline as one job. However, this module can only exploit the compute power of a single server and does not distribute the partial jobs to a compute cluster. This chapter focuses on a detailed explanation of how the GeMoMa package can be used in its command line mode. All modules and parameters are described on the homepage http:// www.jstacs.de/index.php/GeMoMa. Furthermore, it is possible to obtain a description via command line. All modules of the pipeline can be listed running: java -jar GeMoMa-.jar CLI

For each module, the parameters can be listed running: java -jar GeMoMa-.jar CLI

where has to be replaced by the GeMoMa version and by the name of the specific module, e.g., java -jar GeMoMa-1.5.3.jar CLI ERE

We can run a GeMoMa module with specific parameters using the following generic command line: java -jar GeMoMa-.jar CLI = = = ...

Specific examples are provided in the next section.

3

Sequential Execution of GeMoMa Modules GeMoMa is implemented modular as open-source software. In Fig. 1, we give an overview of the individual GeMoMa modules, the required input data, and how the output of one module serves as input for the next module to yield the complete pipeline from input genomes and, optionally, RNA-seq data to the final predicted gene models. For standard applications of GeMoMa that are to be run on a single compute server, we also provide the complete GeMoMa pipeline as one of the GeMoMa modules, and we shortly describe usage of this pipeline in the next sub-section. In the remainder of this section, we describe the individual modules of GeMoMa in the order they are used in the pipeline (Fig. 1). In all sections, we use a minimal toy example that is included in the package as test case to explain how to run the individual modules.

GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position. . .

165

Fig. 1 Schema of the GeMoMa pipeline adapted from Keilwagen et al. [12]. Blue items represent input data sets, green boxes represent GeMoMa modules, while grey boxes represent external modules. The GeMoMa Annotation Filter allows to combine predictions from different reference species and produces the final output. RNA-seq data are optional

3.1 GeMoMaPipeline Module

Despite the modularity of GeMoMa, it can directly be started as one job running the complete pipeline. For this purpose, we implemented the module GeMoMaPipeline, which combines all modules of the pipeline and allows to exploit the complete compute power of a server using multi-threading. Users need to specify all parameters at the beginning. GeMoMaPipeline forwards these parameters and all necessary intermediate results to the specific GeMoMa modules. Furthermore, the GeMoMaPipeline splits the jobs for TBLASTN and GeMoMa to run them in parallel allowing to exploit the complete compute power of a server. In addition, the GeMoMaPipeline reduces the memory requirements by loading the target genome and the statistics of optional RNA-seq data only once and using them for all individual runs of the module GeMoMa. Using the provided toy data and one thread, we run the pipeline with the following command:

java -jar GeMoMa-1.5.3.jar CLI GeMoMaPipeline t=test_data/target-fragment.fasta a=test_data/ref-annotation.gff g=test_data/ref-fragment.fasta r=MAPPED ERE.m=test_data/target-accepted_hits.bam ERE.c=true threads=1

166

Jens Keilwagen et al.

where t specifies the genome of the target species, a and g specify the annotation and genome (assembly) of the reference species, r specifies that we have mapped RNA-seq data, ERE.m specifies the RNA-seq alignment map, ERE.c specifies that the coverage should be computed from RNA-seq data and later be used in GeMoMa, and threads specifies the maximal number of threads to be used for the computation. Furthermore, the pipeline can also be used for multiple reference species. Here is an example how GeMoMaPipeline can be used with three reference species, mapped transcriptome data, and 90 threads (test data not provided with the package): java -jar GeMoMa-1.5.3.jar CLI GeMoMaPipeline t=target-genome.fasta a=ref-annotation1.gff g=ref-assembly1.fa a=ref-annotation2.gff g=ref-assembly2.fa a=ref-annotation3.gff g=ref-assembly3.fa r=MAPPED ERE.m=example.BAM ERE.c=true threads=90

3.2 ERE: Extracting RNA-seq Evidence

Accuracy of gene prediction methods often relies on the accuracy of splice site prediction. Hence, a lot of research has been conducted in the last two decades to improve splice site prediction. During the last years, RNA-seq has become a standard experiment in many laboratories around the world, which has the potential to identify splice sites based on experimental data. Furthermore, nucleotide archives like SRA from NCBI or ENA from EMBL-EBI provide access to thousands of publicly available RNA-seq data sets. Mapping RNA-seq data to the target genome allows to identify split reads, where one part of the read aligns to an exon and another part of the read aligns to the next exon separated by an intron that is not included in the RNA-seq data. Hence, these split reads allow to identify experimentally verified splice sites for many genes of a target organism. Read mapping has to be accomplished by any read mapper, e.g., TopHat2 [14] or STAR [15], before running the GeMoMa pipeline. The output of read mappers contains an alignment file, which lists the alignments of short reads against the target assembly. There are two standard formats for the alignment file, namely, the humanreadable sequence alignment map (SAM) and the binary alignment map (BAM). Due to its binary encoding, BAM files are often much smaller than SAM files. There are different RNA-seq library preparation protocols. Depending on the library preparation for RNA-seq, the strand of the reads might be known (“stranded protocol”) or unknown. The module ERE is able to read both formats and to handle all library preparation protocols. ERE allows to extract experimentally verified introns and optionally coverage from mapped RNA-seq

GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position. . .

167

data. Based on the introns, donor and acceptor splice sites can be identified within the GeMoMa module. For a single BAM file of an unstranded library (like that from the toy data), we run ERE using the following command: java -jar GeMoMa-1.5.3.jar CLI ERE c=true m=test_data/target-accepted_hits.bam

where c¼true specifies that ERE should return the coverage as bedgraph. The strand of the RNA-seq library can be specified using s¼. In addition, ERE allows to read multiple congeneric alignment files at once using the parameter m multiple times. java -jar GeMoMa-1.5.3.jar CLI ERE s=FR_FIRST_STRAND m=example1.SAM m=example2.SAM

A successful run of ERE returns a file containing the introns in GFF format, and, if specified, the coverage information in bedgraph format. Depending on the parameter s, the coverage might be one file for FR_UNSTRANDED or two files for stranded data. These output files of ERE can be fed into the GeMoMa module as optional inputs. 3.3 Extractor: Extracting CDS Parts

The main idea of GeMoMa is to exploit intron position conservation for homology-based gene prediction. Hence, we try to identify genomic loci in the target organism that are homologous to coding exons in the reference species. Subsequently, we try to combine these matches to complete gene models. The module Extractor uses the gene annotation and genome assembly of a reference species to extract all necessary information, including the parts of the CDS that correspond to (partially) coding exons and the information how to combine these exons to gene models again. For a given reference species, we run Extractor using the following command: java -jar GeMoMa-1.5.3.jar CLI Extractor a=test_data/ref-annotation.gff g=test_data/ref-fragment.fasta

where a and g specify the annotation and genome (assembly) of the reference species. The format of the annotation file can be GFF or GTF (cf. Fig. 2a), while the format of the genome has to be FastA. If we like to receive also gene models that contain ambiguous nucleotides in the reference species (Ambiguity¼AMBIGUOUS) and like to receive the proteins of the reference species (p¼true), we need to run:

168 (a)

Jens Keilwagen et al. Reference annotation

Chr4 Chr4 Chr4

(b)

phytozomev10 phytozomev10 phytozomev10

CDS CDS CDS

1024 642 306

1075 914 547

. . .

-

0 2 2

ID=AT4G31700.2.TAIR10.CDS.1;Parent=AT4G31700.2.TAIR10 ID=AT4G31700.2.TAIR10.CDS.2;Parent=AT4G31700.2.TAIR10 ID=AT4G31700.2.TAIR10.CDS.3;Parent=AT4G31700.2.TAIR10

CDS parts

>AT4G31700.TAIR10_0 MKQGVLTPGRVRLLLHR >AT4G31700.TAIR10_1 TPCFRGHGRRTGERRRKSVRGCIVSPDLSVLNLVIVKKGENDLPGLTDTEKPRMRGPKRASKIRKLFNLKKEDDVRTYVNTYRRKFTNKK >AT4G31700.TAIR10_2 KEVSKAPKIQRLVTPLTLQRKRARIADKKKKIAKANSDAADYQKLLASRLKEQRDRRSESLAKKRSRLSSAAAKPSVTA*

(c)

Assignment

#geneID AT4G31700.TAIR10

(d)

transcript AT4G31700.2.TAIR10

cds-parts 0, 1, 2

phases 0,2,2

chr Chr4

strand -1

start 306

end 1075

full-length true

... ...

Reference proteins

>AT4G31700.2.TAIR10 MKQGVLTPGRVRLLLHRGTPCFRGHGRRTGERRRKSVRGCIVSPDLSVLNLVIVKKGENDLPGLTDTEKPRMRGPKRASKIRKLFNLKKEDDVRTYVNT YRRKFTNKKGKEVSKAPKIQRLVTPLTLQRKRARIADKKKKIAKANSDAADYQKLLASRLKEQRDRRSESLAKKRSRLSSAAAKPSVTA*

(e)

Filtered predictions

ST4.03ch12

GeMoMa

prediction

330

1066

.

-

.

ST4.03ch12

GeMoMa

CDS

1015

1066

.

-

0

ST4.03ch12

GeMoMa

CDS

645

917

.

-

2

ST4.03ch12

GeMoMa

CDS

330

568

.

-

2

ST4.03ch12

GAF

gene

330

1066

.

-

.

ID=AT4G31700.2.TAIR10 R0; ref-gene=AT4G31700.TAIR10; AA=188; score=857; tae=1; tde=1; tie=1; minSplitReads=6066; tpc=1; minCov=79; avgCov=6663.5301; start=M; stop=*; evidence=1; Parent=gene 0; ID=AT4G31700.2.TAIR10 R0 cds0; Parent=AT4G31700.2.TAIR10 R0; de=true; pc=1; minCov=79 ID=AT4G31700.2.TAIR10 R0 cds1; Parent=AT4G31700.2.TAIR10 R0; ae=true; de=true; pc=1; minCov=106 ID=AT4G31700.2.TAIR10 R0 cds2; Parent=AT4G31700.2.TAIR10 R0; ae=true; pc=1; minCov=5303 transcripts=1; complete=1; ID=gene 0; maxEvidence=1; maxTie=1.0

Fig. 2 Input data as well as intermediate and final results of GeMoMa java -jar GeMoMa-1.5.3.jar CLI Extractor a=test_data/ref-annotation.gff g=test_data/ref-fragment.fasta Ambiguity=AMBIGIUOUS p=true

Further important parameters of Extractor are l

s¼: which allows to specify transcripts that are extracted and later used for homology-based gene prediction. This is especially useful if we are interested in specific genes or gene families and not complete genome (re-)annotation.

l

r¼true:

l

which specifies that the Extractor should try to repair gene models that are not correctly annotated in the reference annotation, e.g., have wrong phase information.

f¼false: which specifies that the Extractor should also return partial gene models, i.e., gene models without start or stop codon.

GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position. . . l

169

sefc¼true:

which specifies that the stop codon is excluded from CDS annotation. This is especially reasonable if we are interested in full-length gene models and the reference annotation excludes the stop codons from CDS.

The output of Extractor are: a FastA file of cds-parts (cf. Fig. 2b), i.e., protein-coding (parts of) exons, which is subsequently used for TBLASTN and GeMoMa, and a tab-separated file, which is denoted as assignment and contains the information how to combine these exons to gene models (cf. Fig. 2c). The column “cds-parts” in the assignment states that the 0-th, 1-st, and 2-nd part of gene AT4G31700.TAIR10 are combined in this order to obtain transcript AT4G31700.2.TAIR10. Optionally, protein (cf. Fig. 2d) and CDS sequences can be returned. 3.4 TBLASTN: Finding Matches to Protein-Coding Exons

NCBI BLAST is a state-of-the-art software suite for finding sequences similar to some query sequence. In the GeMoMa pipeline, we use TBLASTN as an external module, which provides genomic positions for sequences that are similar to coding exons of a reference species. Before running TBLASTN, we create a BLASTN database from the target genome (see Note 1): makeblastdb -out blastdb -hash_index -in test_data/target-fragment.fasta -title target -dbtype nucl -logfile blastdb-logfile

Subsequently, we run TBLASTN using the following command (see Note 1): tblastn -query cds-parts.fasta -db blastdb -evalue 100 -seg no -out tblastn.txt -db_gencode 1 -matrix BLOSUM62 -num_threads 1 -word_size 3 -comp_based_stats F -gapopen 11 -gapextend 1 -outfmt "6 std sallseqid score nident positive gaps ppos qframe sframe qseq sseq qlen slen salltitles"

Some of these parameter values are the default values of TBLASTN. However, they might change with later versions of NCBI BLAST. Hence, we list them here. The -evalue is set to 100 to receive BLAST hits also for short exons. In addition, the parameter comp_based_stats is used to receive a score that can be summed in the GeMoMa. The parameter outfmt specifies a sparse output format that is later parsed by the GeMoMa module. The results of TBLASTN are stored to a file tblastn.txt as specified by the parameter -out. 3.5 GeMoMa: Predicting Transcripts

Based on the results of TBLASTN, we can run the main module GeMoMa, which combines the BLAST hits to yield predictions of

170

Jens Keilwagen et al.

Reference Gene

E1 Protein E2 E3

Target sequence Results tblastn

Exon 1

Exon 2

Exon 3 genomic order

Dynamic Programming

exon order

Fig. 3 GeMoMa algorithm. Instead of looking for a homolog of the complete protein-coding gene, we use TBLASTN to identify regions that are homologous to the partially coding exons E1, E2, and E3. The colored boxes depict matches found by TBLASTN, where the colors correspond to the specific exons and partially gray boxes indicate partial matches. Possible connections between the matches are depicted by dashed lines and are based on correct exon order and distance. GeMoMa uses dynamic programming to find the path with the highest score. In the second internal round of GeMoMa, it discards connections between matches if these lead to incompatible combinations of reading frames or do not have valid splice sites

complete gene models as shown in Fig. 3. If TBLASTN does not find a good match for one or a few CDS parts in a promising region, GeMoMa tries to find at least one good match for each of the missing parts by performing pairwise alignments using the same substitution matrix and gap costs as TBLASTN. For this reason, the algorithm runs the prediction twice, internally. In the initial run, only promising regions are identified. In the second run, additional alignments are performed if necessary, splice sites and in-frame combinations of the parts are considered. This approach allows to reduce the search space and yields high-quality predictions. For splice site prediction, GeMoMa relies on experimentally supported splice sites from RNA-seq data if available. In case of genes that are not covered by RNA-seq data (due to low abundance, sample or other problems), GeMoMa searches for the conserved dinucleotides—GT or GC as donor and AG as acceptor splice sites—that are specific for canonical intron borders. If GeMoMa detects two potential exons in the respective genomic region, all combinations of donor and acceptor splice sites that yield in-frame combinations of the exons are tested and scored. Only the highest-scoring combination is further used for gene prediction. If we have no RNA-seq data, we can run GeMoMa using the following command line (see Note 1):

GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position. . .

171

java -jar GeMoMa-1.5.3.jar CLI GeMoMa t=tblastn.txt tg=test_data/target-fragment.fasta c=cds-parts.fasta a=assignment.tabular

where cds-parts.fasta and assignment.tabular are results of the module Extractor. In contrast, if we have RNA-seq data and if we processed this data with a read mapper and the module ERE, we received files for introns and coverage that can be fed into GeMoMa as optional parameters (see Note 1) java -jar GeMoMa-1.5.3.jar CLI GeMoMa t=tblastn.txt tg=test_data/target-fragment.fasta c=cds-parts.fasta a=assignment.tabular i=introns.gff coverage=UNSTRANDED coverage_unstranded=coverage.bedgraph

The example above shows the case of an unstranded RNA-seq library. In case of a stranded RNA-seq library, the parameter coverage needs to be changed accordingly and the coverage files have to be forwarded to the parameters coverage_forward and coverage_reverse instead of coverage_unstranded. GeMoMa has several further parameters that can be used to modify the behavior of GeMoMa under specific conditions. Here, we briefly describe three further parameters: query proteins, maximum intron length, and timeout. First, users can supply complete sequences of query proteins, e.g., q¼proteins.fasta where proteins.fasta is an optional output of Extractor (cf. p¼true). Doing so, GeMoMa computes for each prediction a pairwise alignment between the reference protein and the predicted protein allowing to report further attributes (cf. Table 1). Second, users can modify the maximum intron length that is used within GeMoMa. If two TBLASTN hits of neighboring exons are located on the same chromosome and the same strand, but their distance is larger than the maximum intron length, GeMoMa does not build a gene model combining both hits. Hence, it is essential to adapt the maximum intron length, which has a default value of 15,000 bp, especially for target organisms that contain large introns, e.g., animals. At the end of the ERE protocol, we return a cumulative distribution of the lengths of the introns that have been extracted. This can be used to identify a good parameter value. However, the length of the extracted introns is often also affected by parameter settings of the read mapper. Third, users can alter the value for timeout, which denotes the maximal number of seconds to be used for the predictions of one transcript. If GeMoMa exceededs this time limit, no prediction is returned for this reference transcript. By default the limit is set to

Transcript intron evidence

tie

GeMoMa Introns

GeMoMa Introns

GAF GAF

Maximal tie

Maximal evidence

alternative

maxTie

maxEvidence

GAF

GAF

evidence

GeMoMa Query proteins

Positive amino acid

Transcript donor evidence

tde

GeMoMa Introns

pAA

Transcript acceptor evidence

tae

GeMoMa Coverage, ...

GeMoMa Query proteins

Transcript percentage coverage

tpc

GeMoMa Coverage, ...

Identical amino acid

Average coverage

avgCov

GeMoMa Coverage, ...

iAA

Minimal coverage

minCov

GeMoMa

Necessary parameter

GeMoMa Introns

GeMoMa score

score

Module

minSplitReads Minimal split reads

Long name

Attribute

Table 1 GFF attributes of the GeMoMa pipeline

Description

0

{NA}\ [0, 1]

{NA}\ [0, 1]

{NA}\ [0, 1]

{NA}\ [0, 1]

þ 0

0



Range

Gene

Gene

Maximal evidence of all transcripts of this gene

Maximal tie of all transcripts of this gene

Prediction Alternative gene ID(s) leading to the same prediction

Prediction Number of reference organisms that have a transcript yielding this prediction

Prediction Percentage of aligned positions between reference transcript and prediction yielding a positive score in the substitution matrix



[0, 1]



[0, 1]

Prediction Percentage of identical amino acids between reference transcript and [0, 1] prediction

Prediction Minimal number of split reads for any of the predicted introns per predicted transcript

Prediction Percentage of predicted introns per predicted transcript with RNA-seq evidence

Prediction Percentage of predicted donor sites per predicted transcript with RNA-seq evidence

Prediction Percentage of predicted acceptor sites per predicted transcript with RNA-seq evidence

Prediction Percentage of covered bases per predicted transcript given RNA-seq evidence

Prediction Average coverage of all bases of the prediction given RNA-seq evidence

Prediction Minimal coverage of any base of the prediction given RNA-seq evidence

Prediction Score computed by GeMoMa using the substitution matrix, gap costs, and additional penalties

Feature

172 Jens Keilwagen et al.

GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position. . .

173

1 h, but for some reference transcripts, e.g., those with plenty of exons, the timeout might be increased. The predictions made by GeMoMa are provided in GFF format in a file predicted_annotation.gff. By default, predicted transcripts are listed with type “prediction” in the GFF, which may be modified by the parameter tag of GeMoMa. 3.6 GAF: GeMoMa Annotation Filter

GeMoMa makes multiple predictions per reference transcript, some even with very low GeMoMa score. In addition, GeMoMa predictions might be highly overlapping or even identical for different reference transcripts either from the same or different reference species, especially if the reference transcripts are from the same gene family. The module GeMoMa itself does not resolve such questionable or redundant predictions. One reason for the latter point is that GeMoMa makes predictions for each reference transcript independently. We implemented the module GeMoMa annotation filter (GAF) to handle such cases by allowing to join or reduce such predictions using various filters. GAF allows to filter predicted transcripts based on several filter criteria comprising the relative GeMoMa score of the predicted transcript, the completeness of the predicted transcript (i.e., if the transcript is starting with start codon and ending with stop codon), and the number of reference organisms that perfectly support the predicted transcript. Using only one reference species, we can run GAF using the following command (see Note 1) java -jar GeMoMa-1.5.3.jar CLI GAF g=predicted_annotation.gff

where predicted_annotation.gff is the output of GeMoMa (cf. Fig. 1). GAF returns the filtered and combined annotation as filtered_predictions.gff (cf. Fig. 2e). If we have predicted annotations based on several reference species, we have to modify the command line appropriately. Here, we give the example for three reference species. java -jar GeMoMa-1.5.3.jar CLI GAF g=species1/predicted_annotation.gff g=species2/predicted_annotation.gff g=species3/predicted_annotation.gff

In this case, the parameter evidence filter (e) becomes interesting as it allows for returning only predictions that are perfectly supported by multiple reference species. 3.7 Attributes of GeMoMa Predictions

Using GeMoMa and GAF, we receive results as GFF files containing some custom attributes. We briefly explain the most prominent attributes in Table 1. These attributes can be used for statistics of the predicted annotation or for user-specific filters.

174

4

Jens Keilwagen et al.

Remarks and Hints for Specific Applications In this section, we give some notes regarding algorithmic as well as technical aspects of GeMoMa.

4.1 Algorithmic Remarks

4.2 Technical Remarks

In this section, we present some useful remarks about algorithmic aspects. l

By concept, GeMoMa is unable to identify genes in the target genome without any sufficiently similar homolog in one of the reference species. If such highly organism-specific genes are central to your study, we recommend to complement GeMoMa predictions with RNA-seq based or ab-initio methods.

l

GeMoMa currently only considers the protein-coding (parts of) exons, which means that pseudo genes and non-coding RNAs will not be identified by GeMoMa. Untranslated regions (UTRs) of protein-coding genes are currently also not reconstructed.

l

The prediction quality of GeMoMa depends on the quality of the target genome sequence. Heavily clustered genomes comprising tens of thousands of contigs might lead to sub-optimal performance, since the current version of GeMoMa does not predict gene models across contig boundaries. If you want to make gene predictions for such a genome and have RNA-seq data available, you might consider using a RNA-based scaffolder like Rascaf [16] before running the GeMoMa pipeline.

l

If multiple reference species in a reasonable evolutionary distance from your target organism are available, we generally recommend to use these multiple reference species in GeMoMa. Resulting predictions may afterwards be filtered and summarized by the GAF module. If there is doubt with regard to the quality of individual reference annotations, we further recommend to use the evidence filter of the GAF module requiring support by at least two reference species.

l

Depending on the sequencing depth of your RNA-seq data (if any available), it may be helpful to increase the value of the parameter r of the GeMoMa module, which specifies the minimum number of split reads required for assuming a potential intron at a certain position. While larger values of r may reduce sensitivity for introns of low-coverage transcripts, they may also balance for spurious split read mappings due to high sequencing depth.

In this section, we present some useful remarks for using the software. l

Starting any module of GeMoMa prints the values for all parameters including those specified by the user but also the default

GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position. . .

175

parameters, which can be used to double check all inputs. In addition, each successful run of any GeMoMa module returns a protocol besides its main output. l

For all GeMoMa modules, an output directory may be specified and is automatically created if it does not already exist. We recommend to use unique directory names for GeMoMa modules or, for instance, GeMoMa runs using different reference species. The files written to the output directory follow a generic naming schema that does not depend on the current GeMoMa run. If a file with the same name already exists in this output directory, a number is appended to the output file name automatically, which might be tedious to disentangle afterwards.

l

GeMoMa does not have specific hardware requirements. However, providing query proteins or using a large target genome or RNA-seq data in the module GeMoMa increases memory consumption. Hence, users might need to specify the Java virtual machine parameters -Xms and -Xmx to increase the initially used and maximal usable main memory (RAM). For instance, you may call java -Xms1G -Xmx8G -jar GeMoMa-1.5.3.jar CLI GeMoMa

to start the module GeMoMa with initially 1 GB and a maximum of 8 GB of RAM. l

The two most time-consuming steps of the GeMoMa pipeline are the modules TBLASTN and GeMoMa itself. Both steps are automatically parallelized when using the GeMoMaPipeline module for execution on a single (multi-core) compute server. To parallelize these steps on a compute cluster, the GeMoMa JAR file contains a helper class that may split the cds-parts. fasta output of the Extractor module into a specified number of smaller chunks. You may call this helper class with java -cp GeMoMa-.jar projects.FastaSplitter "_"

It is important to specify the underscore as last parameter, because this indicates that all CDS parts of a gene will be included in the same chunk allowing to run GeMoMa directly on the chunked TBLASTN results. The TBLASTN step and GeMoMa module may then be executed independently on those smaller chunks, e.g., on separate cluster nodes. Afterwards the resulting output files (predicted_annotation.gff) only need to be concatenated to yield the joint GeMoMa prediction, which can finally be fed into GAF. l

If we combine predictions based on multiple reference species, gene or transcript IDs are not necessarily unique between reference species. Hence, GAF may add a prefix to the predictions using the parameter p as in the following example

176

Jens Keilwagen et al. java -jar GeMoMa-1.5.3.jar CLI GAF g=species1/predicted_annotation.gff p=species1 g=species2/predicted_annotation.gff p=species2 g=species3/predicted_annotation.gff p=species3

which adds the prefix species1 to the IDs of first prediction file, species2 to the IDs of second prediction file, and species3 to the IDs of third prediction file. These prefixes can be used to make the IDs unique and to distinguish between the reference species. l

5

CDS and protein sequences of final predictions can be obtained using the Extractor on the final prediction and the target genome.

Note 1. All (intermediate) results of all command line calls are included in the download in the folder test_results/. There are two folders: one for using RNA-seq data (test_results/with) and one for not using RNA-seq (test_results/without). The files can be used to compare the output of a command line call to the expected output. In addition, it allows to start the analysis at any specific step. In this case, the input directory path has to be specified in the command line, e.g., java -jar GeMoMa-1.5.3.jar CLI GeMoMa t=test_results/with/tblastn.txt tg=test_data/target-fragment.fasta c=test_results/with/cds-parts.fasta a=test_results/with/assignment.tabular i=test_results/with/introns.gff coverage=UNSTRANDED coverage_unstranded=test_results/with/coverage.bedgraph

References 1. Hoff KJ , Stanke M (2015) Current methods for automated annotation of protein-coding genes. Curr Opin Insect Sci 7:8–14. https:// doi.org/10.1016/j.cois.2015.02.008. ISSN 2214-5745 2. Holt C, Yandell M (2011) MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinf 12(1):491. https://doi. org/10.1186/1471-2105-12-491. ISSN 1471-2105 3. Hartung F, Blattner FR, Puchta H (2002) Intron gain and loss in the evolution of the

conserved eukaryotic recombination machinery. Nucleic Acids Res 30(23):5175–5181. https://doi.org/10.1093/nar/gkf649 4. Keilwagen J, Wenk M, Erickson JL, Schattat MH, Grau J, Hartung F (2016) Using intron position conservation for homology-based gene prediction. Nucleic Acids Res 44(9):e89. https://doi.org/10.1093/nar/gkw092 5. Fedorov A, Merican AF, Gilbert W (2002) Large-scale comparison of intron positions among animal, plant, and fungal genes. Proc Natl Acad Sci U S A 99(25):16128–16133. https://doi.org/10.1073/pnas.242624899

GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position. . . 6. Hartung F, Suer S, Bergmann T, Puchta H (2006) The role of AtMUS81 in DNA repair and its genetic interaction with the helicase AtRecQ4A. Nucleic Acids Res 34 (16):4438–4448. https://doi.org/10.1093/ nar/gkl576 7. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410. https:// doi.org/10.1016/S0022-2836(05)80360-2. ISSN 0022-2836 8. She R, Chu JS-C, Uyar B, Wang J, Wang K, Chen N (2011) genBlastG: using BLAST searches to build homologous gene models. Bioinformatics 27(15):2141–2143. https:// doi.org/10.1093/bioinformatics/btr342 9. Slater G, Birney E (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinf 6(1):31. https://doi. org/10.1186/1471-2105-6-31. ISSN 14712105 10. Testa AC, Hane JK, Ellwood SR, Oliver RP (2015) CodingQuarry: highly accurate hidden Markov model gene prediction in fungal genomes using RNA-seq transcripts. BMC Genomics 16(1):170. https://doi.org/10.1186/ s12864-015-1344-4. ISSN 1471–2164 11. Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M (2016) BRAKER1:

177

unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32(5):767. https://doi.org/ 10.1093/bioinformatics/btv661 12. Keilwagen J, Hartung F, Paulini M, Twardziok SO, Grau J (2018) Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinf 19 (1):189. https://doi.org/10.1186/s12859018-2203-5. ISSN 1471-2105 13. Grau J, Keilwagen J, Gohr A, Haldemann B, Posch S, Grosse I (2012) Jstacs: a Java framework for statistical analysis and classification of biological sequences. J Mach Learn Res 13 (June):S. 1967–1971 14. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL (2013) TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol 14:R36 15. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29(1):15. https://doi. org/10.1093/bioinformatics/bts635 16. Song Li, Shankar DS, Florea L (2016) Rascaf: improving genome assembly with RNA sequencing data. Plant Genome 9(3)

Chapter 10 Coding Exon-Structure Aware Realigner (CESAR): Utilizing Genome Alignments for Comparative Gene Annotation Virag Sharma and Michael Hiller Abstract Alignment-based gene identification methods utilize sequence conservation between orthologous proteincoding genes to annotate genes in newly sequenced genomes. CESAR is an approach that makes use of existing genome alignments to transfer genes from one genome to other aligned genomes, and thus generates comparative gene annotations. To accurately detect conserved exons that exhibit an intact reading frame and consensus splice sites, CESAR produces a new alignment between orthologous exons, taking information about the exon’s reading frame and splice site positions into account. Furthermore, CESAR is able to detect most evolutionary splice site shifts, which helps to annotate exon boundaries at high precision. Here, we describe how to apply CESAR to generate comparative gene annotations for one or many species, and discuss the strengths and limitations of this approach. CESAR is available at https:// github.com/hillerlab/CESAR2.0. Key words Comparative gene annotation, Genome alignment, CESAR, Splice site shift

1

Introduction Identifying coding genes in genomic sequences is an important step in annotating a genome. Several different approaches exist for this task [1]. Transcriptome-based methods align entire or parts of sequenced mRNAs to the genome to infer exons and introns. Ab initio gene prediction methods detect genes solely based on characteristic sequence patterns. Homology-based approaches utilize the fact that homologous genes often have conserved sequences and use information about genes in a related species to search for similar sequences in the given genome. One type of homology-based approaches makes use of alignments between entire genomes to project (or map) an existing gene annotation of a “reference” species to an aligned “query” species that lacks a gene annotation [2]. These projection approaches assume that exons of the reference species that align well to the query species are likely homologous exons. Thus, the coordinates

Martin Kollmar (ed.), Gene Prediction: Methods and Protocols, Methods in Molecular Biology, vol. 1962, https://doi.org/10.1007/978-1-4939-9173-0_10, © Springer Science+Business Media, LLC, part of Springer Nature 2019

179

180

A

Virag Sharma and Michael Hiller

genome alignment

Human ccagcgcagcgggtgcggcgATGATCCTGGAGGAGAGGCCGGACGGCGCGGGCGCCGGC Mouse gcgtccgagcga--gcagcgATGATCCTTGAGGAGAGGCCAGATGGCCAGGGCACTGGC coding exon (translation) start coordinates: chr15: 75704394

GAGGAGAGCCCGCGGCTGCAG gtgcgcagaactggcgcggcggcggga-----ggagg GAGGAAAGCTCTCGGCCGCAGgacgacggcagcatccgcaaggtgggggctgagcagg incorrect exon end coordinates: chr15: 75704453

B

CESAR alignment Human ccagcgcagcgggtgcggcgATGATCCTGGAGGAGAGGCCGGACGGCGCGGGCGCCGGC Mouse cagcgtccgagcgagcagcgATGATCCTTGAGGAGAGGCCAGATGGCCAGGGCACTGGC coding exon (translation) start coordinates: chr15: 75704394

GAGGAGAGCCCGCGGCTGCAG--------------------- gtgcgcagaactggcgcggc GAGGAAAGCTCTCGGCCGCAGGACGACGGCAGCATCCGCAAG gtgggggctgagcagggata correct exon end coordinates: chr15: 75704474

Fig. 1 Coordinates of aligned exon boundaries do not always correspond to real exon coordinates in another genome. (a) Part of the genome alignment between human and mouse that covers the first coding exon of RHPN1 (blue font). The human consensus donor dinucleotide (“gt”, bold) aligns to a non-consensus donor site (red font) in the mouse, indicating that the exon end coordinates may not correspond to the respective mouse exon end. (b) CESAR re-aligns this sequence and detects a consensus donor site that is shifted 21 nt downstream. These exon end coordinates precisely correspond to the exon end in mouse. Please note that the CESAR alignment shows a 21 nt insertion in mouse. These additional 21 exonic bases are translatable in the same reading frame

of aligned exon boundaries in the query genome reveal the location of likely homologous exons (Fig. 1a). Utilizing genome alignments for projecting gene annotations has several advantages. First, genome alignments do not only align exons but also the surrounding genomic context, which is helpful to distinguish orthologs from paralogs or processed pseudogenes as the latter are often located in a different syntenic context. Second, many protein-coding exons are conserved over large phylogenetic distances. If sensitive alignment parameters are used, genome alignments capture the majority of human coding exons in other mammals and even other vertebrates [3]. Third, by making use of existing multiple genome alignments, gene annotations can be projected to numerous query species, as we recently demonstrated by projecting human genes to 143 other vertebrates [3]. Despite their utility for projecting gene annotations, genome alignments have two serious limitations. First, genome alignment programs are not aware of the reading frame and splice sites of the reference exon. Consequently, alignments between conserved exons may incorrectly exhibit frameshifts or non-consensus splice sites due to alignment ambiguities. Since one aims at projecting only truly conserved exons that exhibit an intact reading frame and consensus splice sites in the query species, such alignment

Comparative Gene Annotation with CESAR

181

ambiguities cause conserved exons to be missed in the resulting gene annotation. Second, the position of splice sites of truly conserved exons can shift during evolution [4]. Since genome alignments do not aim at generating an exon alignment with consensus splice sites, the position of the projected exon boundaries in the query genome may be incorrect for such exons. CESAR is a method to resolve these two limitations [4, 5]. For a given exon, CESAR uses the query sequence provided by the genome alignment and then re-aligns this putative exonic sequence, incorporating both information about the reading frame and the splice sites of the reference exon. For the given exonic sequence, CESAR aims at finding an alignment that (1) has consensus splice sites and (2) preserves the reading frame and thus lacks inactivating mutations such as frameshifts and in-frame stop codons. As a result, CESAR correctly infers exon conservation for more than 5300 exons that had a broken reading frame or non-consensus splice sites in the genome alignment between human and mouse, and it is able to correctly detect >90% of evolutionary splice site shifts [4] (Fig. 1b). This leads to an accurate comparative gene annotation, exemplified by our observation that 99.1% of the human exons that CESAR projects to the mouse genome overlap annotated mouse exons, and for 96.8% of the projected exons both boundaries are correct. An example illustrating a gene annotation produced by CESAR is shown in Fig. 2. We recently re-implemented CESAR in C (CESAR 2.0), which drastically reduces runtime and memory consumption [5]. In contrast to the original implementation that only allowed to re-align a single coding exon (referred to as “single-exon mode” in the following), CESAR 2.0 also provides a “multi-exon mode” that allows to re-align entire multi-exon genes at once against a locus in the query genome. In multi-exon mode, CESAR 2.0 can detect exons that do not align in the genome alignment (Fig. 3a), and it can recognize intron deletion events that result in a larger composite exon in the query species. Furthermore, CESAR 2.0 improves the ability to detect distal evolutionary splice site shifts, which further enhances the precise identification of exon boundaries [5]. In the following, we describe how to use this new implementation (simply referred to CESAR in the following) to obtain comparative gene annotations.

2 2.1

Materials Availability

CESAR’s source code, pre-compiled binaries, and other tools required to annotate exons in a query genome are available from the github repository https://github.com/hillerlab/CESAR2.0. Open a terminal in your Linux-like environment and do the following:

182

Virag Sharma and Michael Hiller

Mouse genome (mm10 assembly) coordinates: chr12:21211760-21428239 Scale 100 kb

Human Coding Exons Mapped by CESAR ASAP2 ITGB1BP1 CPSF3

Asap2

IAH1 ADAM17

Basic Gene Annotation Set from GENCODE Iah1 Adam17 Itgb1bp1 Cpsf3

YWHAQ

Ywhaq

Fig. 2 UCSC genome browser screenshot showing CESAR’s gene annotation in the mouse genome. CESAR was applied to project human exons to mouse, resulting in a gene annotation that matches the mouse Gencode annotation. Please note that CESAR only considers coding exons. Thus, UTR exons are not projected, as indicated by red arrows for Itgb1bp1 git clone https://github.com/hillerlab/CESAR2.0/ cd CESAR2.0/

2.2

Installation

Compiling CESAR’s source code (written in C) requires the gcc compiler: make

Alternatively, a CESAR binary, pre-compiled under Linux 64 bit is present in the precompiledBinary_x86_64 subdirectory. To automate task of annotating exons in a query genome, it is necessary to add the “tools” subdirectory to the $PATH variable in your environment and to set the $profilePath variable. This allows to call the tools located in this directory without specifying the full path. If you are using a bash shell, do export PATH=$PATH:‘pwd‘/tools export profilePath=‘pwd‘

If you are using C-shell, do setenv PATH ${PATH}:‘pwd‘/tools setenv profilePath ‘pwd‘

2.3 Computational Requirements

CESAR has been tested on different distributions of Linux (CentOS, Ubuntu, SUSE). The simplest way to get CESAR running is to use the precompiled binary. For users working with a Windows

Comparative Gene Annotation with CESAR

A

human (hg38): chr1:161,312,916-161,364,786

183

20 kb

SDHC Genome Alignment between Human and Mouse

mouse (mm10): chr1:171,126,897-171,151,888

10 kb

Sdhc Human Coding Exons Mapped by CESAR (single-exon mode) SDHC Human Coding Exons Mapped by CESAR (multi-exon mode) SDHC

TTGGTCTCTTCCCATGGCGATGTCCATCTGCCACCGTGGCACTGGTATTGCTTTGAGTGAG tgtagATGGTCTCTTCCTATGGCACTGTCCGTTTGCCACCGAGGCTCTGGAATAGCCTTGAGTGAGgtatg *********** ***** ***** * ******** *** **** ** ** *********

B

human (hg38): chr5:128,965,517-129,033,642

20 kb

SLC27A6 Genome Alignment between Human and Black flying-fox

GT------------------------TATGAAGGAAGAGCAGGAATGGCTTCTATTATT gacagCACCATTGAAAACTCTGTTTCCCAAA---------------GATATGGCATCA-----* ***** ** TAAAACCAAATACATCTTTAGATTTGGAAAAAGTTTATGAACAAGTTGTA---------CATTT -----------------------TTGAAAAGAATGTATCACAAACTAATATTGAAACTCACTAC *** *** * * *** * ** * ** * CTACCAGCTTATGCTTGTCCACGATTTTTAAGAATTCAG CTTGAG------------------------------CAAgtagt ** **

Fig. 3 Strength and weakness of CESAR’s multi-exon mode. (a) Multi-exon mode recovers an exon that does not align in the genome alignment. Top: UCSC genome browser screenshot showing the human SDHC gene and the genome alignment to mouse. The fourth exon does not align (red box). Bottom: UCSC genome browser showing the orthologous mouse Sdhc gene and CESAR’s gene annotation obtained in single- and multi-exon mode. In contrast to the single-exon mode, multi-exon mode detects exon 4 (red box) with its precise splice sites, as shown by the sequence alignment underneath. (b) Multi-exon mode detects a false exon that is truly absent. The ninth coding exon (red box) of human SLC27A6 does not align to the black flying-fox genome and this coding exon is truly deleted [9]. Other exons exhibit numerous frameshifting and stop codon mutations, showing that this gene is inactivated in the black flying-fox [9]. CESAR’s multi-exon mode nevertheless annotates the ninth coding exon in the black flying-fox; however, the sequence alignment reveals several large insertions and deletions and a low sequence identity. Thus, a post-processing step can filter out such poorly aligning exons that are unlikely to be real

184

Virag Sharma and Michael Hiller

Table 1 Memory requirements for CESAR for short, typical, and very long exons or genes in both single-exon and multi-exon mode Reference length (bp) 100

Query length (bp) 152

Memory (GB)

Mode

0.001

Single-exon

1,005

1,170

0.01

Single-exon

5,001

4,664

0.18

Single-exon

10,227

10,038

0.77

Single-exon

5,484

0.04

Multi-exon

5,004

137,114

5.72

Multi-exon

9,510

135,903

10.03

Multi-exon

17,673

19,225

2.55

Multi-exon

984

Multi-exon mode refers to aligning all exons of the reference to the entire query locus that contains the entire orthologous gene

machine, a virtual machine (VMware or VirtualBox) running Linux should be able to support CESAR, though this has not been tested. Memory requirement is proportional to the length of the reference and query sequence. As shown in Table 1, a desktop machine with 32 GB of RAM is sufficient to run CESAR in single-exon mode on all human genes using the human-mouse genome alignment. The memory requirements for CESAR’s multi-exon mode are more demanding as intronic sequences can be large. Still, 32 GB of RAM is sufficient to re-align 99.6% of the human genes in their entirety to the respective mouse genomic locus. Importantly, before allocating memory, CESAR pre-computes an upper bound of the required memory and exits with a warning if more memory is needed than specified by the user with the “-maxMemory” parameter (set to 16 GB by default). 2.4

Input

CESAR’s gene-annotation workflow requires the following data as input: 1. The genomes of the reference and all query species. 2. Transcripts annotated in the reference genome. 3. A genome alignment between the reference and one or more query genomes. How to obtain each input data is described below in Subheadings 3.1–3.3.

Comparative Gene Annotation with CESAR

3

185

Annotating Genes from a Genome Alignment

3.1 Preparing the Genome Assembly Input Data

Obtain the genome sequence of both the reference and all query species. To this end, one can download the genome as a single file in fasta format from NCBI (https://www.ncbi.nlm.nih.gov/assem bly), from Ensembl (https://www.ensembl.org/downloads.html) or from the UCSC genome browser (http://hgdownload.soe.ucsc. edu/downloads.html). Each fasta file must be converted into a 2bit file format by using faToTwoBit from the UCSC source code [6]. For example, if the fasta file for mouse genome is called “mm10.fa,” the following command converts it to a 2bit file: faToTwoBit mm10.fa mm10.2bit

Afterward, create a “2bitDir” directory. In this directory, each species must have a subdirectory that is identical to the assembly name (e.g. hg38 for human, mm10 for mouse, oryAfe1 for aardvark). An example is provided with CESAR’s source code: find extra/miniExample/2bitDir

which lists the following files: extra/miniExample/2bitDir extra/miniExample/2bitDir/hg38 extra/miniExample/2bitDir/hg38/chrom.sizes extra/miniExample/2bitDir/hg38/hg38.2bit extra/miniExample/2bitDir/oryAfe1 extra/miniExample/2bitDir/oryAfe1/oryAfe1.2bit extra/miniExample/2bitDir/oryAfe1/chrom.sizes

In addition, create a file called “chrom.sizes” that contains the size of all scaffolds for each genome by using twoBitInfo from the UCSC source code: for file in ‘find 2bitDir -name "*.2bit"‘ ; do d=‘dirname $file‘; f=‘basename $file‘; twoBitInfo $d/$f $d/chrom.sizes; done

3.2 Preparing the Reference Gene Annotation Input Data

The second step in the CESAR gene-annotation workflow is obtaining the set of the reference species’ transcripts of which you wish to annotate their orthologs in the query genome(s). For example, if the reference species is human, the human Ensembl gene annotation can be used [7]. Ensembl transcripts can be downloaded from Ensembl ftp site (https://www.ensembl.org/info/

186

Virag Sharma and Michael Hiller

data/ftp/index.html) by clicking on the “GTF” link under “Gene sets” for Human. At the time of writing, Ensembl v93 genes are available for the human GRCh38 assembly. Clicking on “Homo_sapiens.GRCh38.93.gtf.gz” would save the human gene set file to the disk. Alternatively, the UCSC genome browser provides gene annotations, which can be downloaded in gtf format from the Table browser (http://genome.ucsc.edu/cgi-bin/hgTables) [6]. After download, transcripts in gtf format need to be converted to genePred format using gtfToGenePred from the UCSC source code: # go to the directory that contains the downloaded transcripts, e.g. cd ~/Downloads # unzip the file, in case it is compressed gzip -d Homo_sapiens.GRCh38.93.gtf.gz # this produces a file called Homo_sapiens.GRCh38.93.gtf. # Convert to genePred format gtfToGenePred Homo_sapiens.GRCh38.93.gtf Homo_sapiens. GRCh38.93.gp

Ensure that the generated genePred file has the right format (see Note 1). Next, we filter the transcripts to retain only protein-coding transcripts. Additionally, this filtering step also discards the following problematic transcripts: (1) transcripts with a CDS length that is not a multiple of 3 (e.g. genes that utilize programmed ribosomal frameshifts or exhibit a polymorphism in the reference), and (2) transcripts with micro-introns smaller than 30 bp as such introns often occur in incorrectly annotated transcripts. # At this stage, it is useful to specify the input file as a variable # (here in Bash notation) export inputGenes=Homo_sapiens.GRCh38.93.gp formatGenePred.pl ${inputGenes} ${inputGenes}.CESAR ${inputGenes}.ignore

Instead of considering all available coding transcripts of a gene, one can run the gene-annotation workflow also with the longest transcript only. In this case, add the “-longest” flag: formatGenePred.pl ${inputGenes} ${inputGenes}.CESAR ${inputGenes}.ignore -longest

3.3 Preparing the Genome Alignment

CESAR requires as input a genome alignment between the selected reference and one or more query genomes in maf format (https:// genome.ucsc.edu/FAQ/FAQformat.html#format5). CESAR can handle both a pairwise or a multiple genome alignment stored in this format. Genome alignments can be downloaded from the

Comparative Gene Annotation with CESAR

187

UCSC genome browser (http://hgdownload.soe.ucsc.edu/ downloads.html). Alternatively, genome alignments can be created with the chaining and netting pipeline [8]. The entire process of creating a pairwise genome alignment in maf format can be automated by the UCSC script doBlastzChainNet.pl, as described in http://genomewiki.ucsc.edu/index.php/Whole_genome_align ment_howto. CESAR’s workflow requires that the genome alignment is indexed by the provided mafIndex tool, which uses the chrom. size file of the reference genome: mafIndex ali.maf ali.bb -chromSizes=extra/miniExample/2bitDir/hg38/chrom.sizes

3.4 Preparing and Executing the CESAR Gene Annotation Jobs

After preparing the three types of input (genome sequences, transcript information and the genome alignment), the different variables that are used as inputs to the CESAR gene-annotation workflow need to be defined. export reference=...

# the assembly name of the reference

(e.g. hg38) export twoBitDir=...

# the directory containing the genomes

and chrom.size # files (e.g. extra/miniExample/2bitDir) export alignment=...

# the indexed alignment file (ali.bb

above) export querySpecies=... # a comma-separated list of the query species that you # want to annotate. Each query species must be contained # in ${alignment}. export outputDir=...

# name of the output directory that

will contain exon # coordinates (in subdirectories). The directory will be # created, if it does not exist. export resultsDir=...

# name of the directory that will

contain the final gene # annotation (one gene annotation file per query species) export maxMemory=...

# maximum amount of memory in GB that

CESAR is allowed # to allocate export profilePath=...

# path to the directory that contains

the ’extra’ # subdirectory containing CESAR’s profiles and matrices

188

Virag Sharma and Michael Hiller

Next, we generate the gene-annotation workflow commands for all filtered transcripts: for transcript in ‘cut -f1 ${inputGenes}.forCESAR‘; do echo "annotateGenesViaCESAR.pl ${transcript} ${alignment} ${inputGenes}.forCESAR ${reference} ${querySpecies} ${outputDir} ${twoBitDir} ${profilePath} -maxMemory ${maxMemory}" done > jobList

The result is a file called “jobList” in which each line consists of a single job that re-aligns a single transcript to all query species. Each job is completely independent of any other job. Hence, each job can be run in parallel on a compute cluster. In the absence of a compute cluster, the jobs can be run sequentially: chmod +x jobList ./jobList

Using CESAR to project 196,259 human coding exons to mouse takes approximately 7 h on a desktop machine using a single core. The memory requirement will vary on the size of the input gene (see Note 2 and Table 1). 3.5 Merging CESAR’s Output Into a Single Gene Annotation File per Species

In this step, we collect the results obtained in the previous step (after each job has successfully finished) in a single genePred file for each query species. for species in ‘echo $querySpecies | sed ’s/,/ /g’‘; do echo "bed2GenePred.pl $species $outputDir /dev/stdout | awk ’{if ($4 != $5) print $0}’ > $resultsDir/$species.gp" done > jobListGenePred chmod +x jobListGenePred ./jobListGenePred

This step takes only a few minutes. The final results are in $resultsDir (specified as a variable above) as a single genePredformatted file per query species. GenePred files can be converted to gtf format using genePredToGtf from the UCSC source code: genePredToGtf file mm10.gp mm10.gtf

The $outputDir directory that is used to store temporary results may be deleted afterward. 3.6 Visualizing the Gene Annotations

This step is optional. An obtained genePred file can be visualized in the UCSC genome browser of the query genome, as shown in Fig. 2. This can be done by converting the genePred file into gtf format, as described above, and then uploading this file to the UCSC genome browser via their “Custom Track” feature.

Comparative Gene Annotation with CESAR

3.7 Running CESAR in Multi-exon Mode

189

To run CESAR in multi-exon mode, all the steps described above are exactly the same (Subheadings 3.1–3.3 and 3.5) except Subheading 3.4. After specifying the variables, the following command will generate the jobs that run CESAR in multi-exon mode: for transcript in ‘cut -f1 ${inputGenes}.forCESAR‘; do echo

"annotateGenesViaCESAR_multi_exon.pl

${transcript}

${alignment} ${inputGenes}.forCESAR ${reference} ${querySpecies} ${outputDir} ${twoBitDir} ${profilePath} -maxMemory ${maxMemory}" done > jobList

The jobs listed in “jobList” can be executed on a compute cluster or run sequentially.

4

Notes 1. In case of problems with the transcript file, one can use UCSC’s genePredCheck tool to check if the converted genePred has a valid format. 2. A limitation of CESAR is that its memory requirement is proportional to the lengths of the input sequences. By default, CESAR stops with a warning if it estimates that more than 16 GB of memory may be required: CRITICAL src/Cesar.c:117 main(): The memory consumption is limited to 16.0000 GB by default. Your attempt requires 30.1539 GB. You can change the limit via --max-memory.

If your computer provides more memory, set the maxMemory above to a higher value. For example, 32 GB of RAM are sufficient to align all human exons in singleexon mode. In multi-exon mode, CESAR may require even more memory in case the transcript has many exons or introns in the query genome are large. For such genes, CESAR can be run in singleexon mode.

5

Special Cases 1. In case exons are truly deleted or overlap an assembly gap in the query, CESAR’s multi-exon mode has a tendency to align random intronic sequence to such reference exons, instead of producing an alignment where these exons are entirely deleted. Such exon alignments are characterized by large insertions and

190

Virag Sharma and Michael Hiller

deletions and a low sequence identity (Fig. 3b). A subsequent filtering step can be used to remove those exons from the resulting gene annotation that are poorly aligning and thus unlikely to be real exons. 2. CESAR’s multi-exon mode requires that all aligning exons of a gene are located on a single locus in the query genome (same scaffold and same strand in a co-linear order). It is therefore recommended to use the single-exon mode for query assemblies with a high degree of fragmentation, where many genes will partially align to different scaffolds. Alternatively, CESAR can be run in both single- and multi-exon mode, and the resulting annotations can be combined. 3. CESAR’s source code provides splice site profiles obtained for human. These profiles are used in the re-alignment process to locate orthologous or shifted splice sites. Splice site profiles will be similar for closely related species such as mammals; however, they may differ if species distantly related to human are used as the reference. In this case, it is recommended to obtain new splice site profiles for the reference species, which can be done as follows: # obtain a file that contains the longest transcript per gene for the reference formatGenePred.pl ${inputGenes} ${inputGenes}.CESAR ${inputGenes}.ignore -longest # define the following variables export input=${inputGenes}.CESAR export ref_2bit=... # the path to the two bit file of reference species # extract the sequences around the splice sites from all transcripts extract_splice_sites.pl $input acc_seqs.txt donor_seqs.txt $ref_2bit # extract the sequence upstream of the first exon from the genes get_start_context.pl $input start_seqs.txt $ref_2bit # Lastly, convert these sequences to profiles: create_profiles.pl acc_seqs.txt acc_profile.txt create_profiles.pl donor_seqs.txt do_profile.txt create_profiles.pl start_seqs.txt firstCodon_profile.txt # clean-up rm acc_seqs.txt donor_seqs.txt start_seqs.txt # Move these files to the relevant clade so that CESAR can read these profiles export clade=... # name of the new clade, for example chicken mkdir -p CESAR2.0/extra/tables/$clade

Comparative Gene Annotation with CESAR

191

mv acc_profile.txt CESAR2.0/extra/tables/$clade mv do_profile.txt CESAR2.0/extra/tables/$clade mv firstCodon_profile.txt CESAR2.0/extra/tables/$clade # copy the original stop codon profile and the codon substitution matrix cp CESAR2.0/extra/tables/human/lastCodon_profile.txt CESAR2.0/extra/tables/$clade cp CESAR2.0/extra/tables/human/eth_codon_sub.txt CESAR2.0/extra/tables/$clade

Acknowledgment This work was supported by the Max Planck Society and the German Research Foundation (HI 1423/3-1). References 1. Picardi E, Pesole G (2010) Computational methods for ab initio and comparative gene finding. Methods Mol Biol 609:269–284. https://doi.org/10.1007/978-1-60327-2414_16 2. Zhu J, Sanborn JZ, Diekhans M, Lowe CB, Pringle TH, Haussler D (2007) Comparative genomics search for losses of long-established genes on the human lineage. PLoS Comput Biol 3(12):e247. https://doi.org/10.1371/ journal.pcbi.0030247 3. Sharma V, Hiller M (2017) Increased alignment sensitivity improves the usage of genome alignments for comparative gene annotation. Nucleic Acids Res 45(14):8369–8377. https://doi.org/ 10.1093/nar/gkx554 4. Sharma V, Elghafari A, Hiller M (2016) Coding exon-structure aware realigner (CESAR) utilizes genome alignments for accurate comparative gene annotation. Nucleic Acids Res 44(11): e103. https://doi.org/10.1093/nar/gkw210 5. Sharma V, Schwede P, Hiller M (2017) CESAR 2.0 substantially improves speed and accuracy of comparative gene annotation. Bioinformatics 33 (24):3985–3987. https://doi.org/10.1093/ bioinformatics/btx527 6. Casper J, Zweig AS, Villarreal C, Tyner C, Speir ML, Rosenbloom KR, Raney BJ, Lee CM, Lee BT, Karolchik D, Hinrichs AS, Haeussler M, Guruvadoo L, Navarro Gonzalez J, Gibson D, Fiddes IT, Eisenhart C, Diekhans M, Clawson H, Barber GP, Armstrong J, Haussler D, Kuhn RM, Kent WJ (2018) The

UCSC Genome Browser database: 2018 update. Nucleic Acids Res 46(D1):D762–D769. https://doi.org/10.1093/nar/gkx1020 7. Zerbino DR, Achuthan P, Akanni W, Amode MR, Barrell D, Bhai J, Billis K, Cummins C, Gall A, Giron CG, Gil L, Gordon L, Haggerty L, Haskell E, Hourlier T, Izuogu OG, Janacek SH, Juettemann T, To JK, Laird MR, Lavidas I, Liu Z, Loveland JE, Maurel T, McLaren W, Moore B, Mudge J, Murphy DN, Newman V, Nuhn M, Ogeh D, Ong CK, Parker A, Patricio M, Riat HS, Schuilenburg H, Sheppard D, Sparrow H, Taylor K, Thormann A, Vullo A, Walts B, Zadissa A, Frankish A, Hunt SE, Kostadima M, Langridge N, Martin FJ, Muffato M, Perry E, Ruffier M, Staines DM, Trevanion SJ, Aken BL, Cunningham F, Yates A, Flicek P (2018) Ensembl 2018. Nucleic Acids Res 46(D1): D754–D761. https://doi.org/10.1093/nar/ gkx1098 8. Kent WJ, Baertsch R, Hinrichs A, Miller W, Haussler D (2003) Evolution’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci U S A 100(20):11484–11489. https://doi. org/10.1073/pnas.1932072100 9. Sharma V, Hecker N, Roscito JG, Foerster L, Langer BE, Hiller M (2018) A genomics approach reveals insights into the importance of gene losses for mammalian adaptations. Nat Commun 9(1):1215. https://doi.org/10. 1038/s41467-018-03667-1

Chapter 11 Predicting Genes in Closely Related Species with Scipio and WebScipio Martin Kollmar Abstract Scipio and WebScipio are homology-based gene prediction software designed for annotating multigenic families and for transferring annotations from one species to closely related species. The strengths include the power to cope with sequencing-related problems such as sequencing errors and assemblies with short contigs but also the ability to correctly predict genes with unusually long introns and/or rather short exons. WebScipio is connected to diArk, the largest collection of eukaryotic genome assemblies, and thereby offers a very convenient way to correct existing annotations and to extend protein family datasets. WebScipio is also a key resource for researchers interested in mutually exclusive splicing, allowing to search for alternative exons not only in introns but also in up- and downstream regions in case of incompleteness of the search sequence. In this chapter, I describe how to use Scipio and WebScipio keeping a first-time user in mind. Key words Eukaryotes, Sequenced genomes, Gene structure reconstruction, Gene prediction

1

Introduction Gene finding denotes the identification of transcribed sequence within genomic sequence [1–3]. For protein-coding genes, the approaches are usually distinguished into ab initio and homologybased methods. Ab initio methods use statistical models for almost all aspects of coding and non-coding regions, and the accuracy of their gene predictions considerably increases with the amount and quality of data to build these models. Consequently, the better the models the more species-specific they become. In contrast, homology-based methods do not require any knowledge about the characteristics of the genome to be studied. They align sets of known sequences (proteins or transcribed DNA) to the new genome, and hits with certain similarity are considered exonic. Exon borders are subsequently refined using general markers such as a methionine codon as start, stop codons as ends, and a few known splice site patterns in case of the interruption of exons by introns. In practice, ab initio software solutions are used for the de

Martin Kollmar (ed.), Gene Prediction: Methods and Protocols, Methods in Molecular Biology, vol. 1962, https://doi.org/10.1007/978-1-4939-9173-0_11, © Springer Science+Business Media, LLC, part of Springer Nature 2019

193

194

Martin Kollmar

novo annotation of genomes, incorporating transcribed sequences and sequences from protein homologs into model training, while homology-based methods are used to transfer existing genome annotations to genomes from closely related species and to annotate specific proteins or protein families across many, including distantly related, species. Scipio is a homology-based gene prediction tool that locates the regions coding for a query protein sequence in a DNA target sequence [4, 5]. It is a convenient tool for the determination of exact gene structures and reverse translations. As basis for the gene prediction, Scipio uses the protein-DNA alignments generated by the BLAST-like Alignment Tool (BLAT; [6]). As a homologybased software, Scipio can resolve many problems that ab initio methods very often fail to predict correctly, such as unusually long introns, unusually long genes (ab initio software most often splices these single genes into multiple ones), short exons (most often missed by ab initio software), tandemly arranged gene duplicates (result from gene duplications, most often predicted as single genes by ab initio software), and rare or underestimated biological events such as noncanonical splice sites. In addition, Scipio can resolve most problems related to genome sequencing, such as sequencing errors resulting in low-quality bases and PCR errors leading to frameshifts and in-frame stop codons, and problems related to assembling genomes resulting in genes spread on multiple contigs (continuous genomic sequences). WebScipio was designed as an easy-to-use web interface to the Scipio software [5, 7, 8] with the additional benefit of providing access to diArk’s collection of eukaryotic genome assemblies [9–11]. In this way, WebScipio offers an easy possibility to extend, for example, protein family sequence datasets by homologs from genomes, for which assemblies and/or annotations are not available at the major nucleotide databases (GenBank, DDBJ, ENA). The output allows visually inspecting every nucleotide of the gene with exons highlighted by the corresponding protein translations, which is likely the easiest way to manually check intron splice patterns for consistency and accuracy. In addition to predicting genes in genomes of related species, WebScipio is currently the only software available for predicting mutually exclusive exons (MXEs; [12]). Mutually exclusive splicing is a specific type of alternative splicing that leads to the inclusion of one, and only one, exon of a cluster of neighboring exons into a transcript [13]. Although highly underrepresented in gene prediction datasets and RNA-seq studies of global alternative splicing [14–16], the analyses of MXEs predicted by WebScipio resulted in a tenfold increase of MXEs in plants [17], Drosophila melanogaster [18], and human [19]. This chapter will explain how to use Scipio at the command line, and how to predict genes and mutually exclusive exons with WebScipio online.

How to use Scipio

2

195

Predicting Genes Using the WebScipio Web Interface The WebScipio web interface guides the user step by step from the input sequences to the predicted gene structure. First, a target genome sequence needs to be specified. This is done either by selecting an organism using an autocompletion search form (see Note 1), which then returns a list of available genome assemblies (see Note 2), or by uploading a DNA sequence in FASTA format (current limit: 1 MB). In the next step, one or more protein sequences need to be entered into a form field or uploaded. In the optional third step, Scipio parameters can be adjusted and the search for alternatively spliced exons and tandemly arranged gene duplicates can be enabled. At the WebScipio search page, we provide several example cases to demonstrate the power of WebScipio in gene reconstruction resolving problems from sequencing errors (e.g., low-quality bases and PCR errors leading to frameshifts and in-frame stop codons). Examples also illustrate the effects of the Scipio parameters to find very short exons (coding for two to six amino acids) and to correct ambiguous exon borders. Sequences for some of these examples can be downloaded from https://www. webscipio.org/help/examples. This page also lists all necessary parameters (target species, genome assembly, Scipio parameters) to obtain the results as presented. The examples at the search page run fully automatically thus do not need any sequence or selection from the user, but the input data are stored temporarily allowing to change parameters and to restart the prediction. Here, I will demonstrate some of these examples and discuss the effects of the most important Scipio parameters. 1. Go to https://www.webscipio.org/search or click on the “WebScipio” menu item if you started at the main entry page (https://www.webscipio.org). 2. Below the autocompletion form, click on “Examples” to open the list of available example cases. 3. The first example is a dynactin-6 (p62) gene from Fusarium oxysporum f. sp. lycopersici 4287, which contains a frameshift due to a sequencing error (Fig. 1a). In ab initio gene predictions, this sequencing error is not resolved and instead a second intron predicted. Click on the example and follow the workflow to the final results. The gene structure scheme represents the query protein sequence matched on the target genome sequence. Click on the “Search details” arrow to open a summary of the search including full details on the genomic location and differences between query sequence and genomic sequence. Click on the “Alignment” result tab to inspect the differences at nucleotide resolution. In the “Search details” summary, the number of mismatches (different amino acid

196

A

Martin Kollmar

Fusarium oxysporum f. sp. lycopersici 4287 200 bps

1 gi|145202742|gb|DS231697.1| (1705bp)

B

Fusarium agapanthi NRRL 31653 200 bps

1 gi|1033907302|gb|LUFC01000948.1| (1704bp)

C

Fusarium algeriense NRRL 66647 200 bps

1 PVPZ01000777.1 (1704bp)

D

Fusarium asiaticum NRRL28720 200 bps

1 gi|1061149483|gb|LHTZ01000135.1| (1720bp)

E

Fusarium asiaticum NRRL28720 200 bps

1 gi|1061149483|gb|LHTZ01000135.1| (1720bp)

Fig. 1 Gene prediction of dynactin-6 (p62) homologs. The figures show gene structure schemes of dynactin-6 homologs in Fusarium species. Dark and light grey boxes represent exons and introns, respectively. Red lines indicate mismatches (differences between search sequence and translation of the target genomic sequence), blue lines indicate frameshifts (one or two missing or additional nucleotides in the target genomic sequence), and green lines represent additional amino acids in the query protein sequence or the genomic target translated sequence. A red box represents a large region present in the query sequence for which a homologous region in the target sequence has not been found given the restrictions by the parameters. (a) Gene structure of the dynactin-6 (p62) gene from Fusarium oxysporum f. sp. lycopersici 4287. (b) Gene structure of dynactin-6 from Fusarium agapanthi NRRL 31653, searched with dynactin-6 from F. oxysporum and “Max. Mismatch” set to infinite. (c) Gene structure of dynactin-6 from Fusarium algeriense NRRL 66647, searched with dynactin-6 from F. oxysporum and “Max. Mismatch” set to infinite. (d) Gene structure of dynactin-6 from Fusarium asiaticum NRRL28720, searched with dynactin-6 from F. oxysporum, and “Max. Mismatch” and “Min. Identity” set to infinite and 70%, respectively. (e) Gene structure of dynactin-6 from Fusarium asiaticum NRRL28720, searched with dynactin-6 from F. oxysporum similar to (c) but in addition with “Gap to Close” set to 20

How to use Scipio

197

found in the translation of the target genome sequence compared to the expected amino acid as given in the query protein sequence), the Identity between query and target sequence, and the Scipio search Score are listed. These items are Scipio cut-off parameters and can be adjusted in the “Expert Options” section as follows. 4. Move up the page to the autocompletion form, remove the part “oxysporum f. sp. lycopersici 4287,” and wait for the list of available Fusarium species (in case the autocompletion form does not respond immediately, you can also remove everything and start typing Fusarium). The first or one of the first hits will be Fusarium agapanthi NRRL 31653. Select this fungus from the list and click on the “Select” button below the list of available genome assemblies (currently, there is only a contiglevel assembly from March 2016 available). From your previous search with “Example 1,” a protein query sequence should already be present in the Protein Data form. Click on the “Submit” button below the protein sequence form and then on the “Browser” button in the “Start Search” section. You see that Scipio does not predict a gene based on the given parameters but suggests which parameters to change in the Expert Options. Open the Expert Options (click on the arrow-down icon). From the parameters at the top, lower the “Min. Identity” cut-off to 70% and change the “Max. Mismatch” number to infinite (“inf.”). Click again on the Browser button in the Start Search section. Scipio has now predicted the dynactin-6 (p62) homolog from F. oxysporum f. sp. lycopersici 4287 in F. agapanthi NRRL 31653 (Fig. 1b). In the “Search details” you see that increasing the “Max. Mismatch” cut-off would have already been sufficient. 5. Move up to the autocompletion form and try other Fusarium species. The Fusarium algeriense NRRL 66647 homolog, for example, has 35 amino acid differences compared to the F. oxysporum f. sp. lycopersici dynactin-6 (changing “Max. Mismatch” cut-off to infinite would be sufficient; Fig. 1c). Select Fusarium asiaticum NRRL28720 from the species list and repeat the search. Here, it is necessary to lower the “Min. Identity” cut-off as well to obtain a gene prediction. However, the predicted gene contains a gap in the middle (red box; Fig. 1d). Move up the page to the “Scipio Expert Options” section and increase the “Gap to close” to 20. Start the prediction again, and the dynactin-6 homolog in F. asiaticum NRRL28720 will correctly be predicted, also providing all the details of the differences to the F. oxysporum f. sp. lycopersici homolog (inspect the “Alignment”; there are many mismatches as well as additional and missing amino acids in some regions; Fig. 1e). You might wonder why relaxed parameters

198

Martin Kollmar

are not applied by default. The reason is that the score of BLAT hits and Scipio results depends on the length of the identified homologous region. A longer region, although showing more discrepancies to the query sequence, might thus get a better score than the more similar region if this is only partially present because of an assembly gap. 6. Examples 2 and 3 show difficult cases of a gene spread on multiple contigs (those will not be predicted by ab initio software) and a gene with a three nucleotide 50 exon, respectively (such short exons will not be predicted by ab initio software). By repeating the search for a homolog of the protein from example 3 in a related Magnaporthe fungus or more distant fungi from Diaporthaceae and Sordariales you will see that Scipio correctly predicts the three nucleotide exon in homologous genes. 7. Examples 4–6 represent cases with very short internal exons or genes with regions of relatively low homology. Running these examples, those “Expert Options” will be highlighted that need to be adjusted. Specifically, the “Max. Move Exon” parameter specifies the number of amino acids that will be cut on each exon border. Subsequently, the most likely place of the cut amino acids will be searched within the intron, either by placing amino acids back on both exon borders or by identifying additional internal exons, both under the strict requirement that intron splice sites agree with the set of accepted sites (default: only GT---AG and GC---AG are accepted, but these can be extended to further sites such as the U12-type splice site sequence AT---AC; parameter “Accepted Splice Sites”). The “Gap to Close” parameter specifies the maximum size of a gap in a gene prediction that Scipio closes by adding mismatches to exon borders until intron splice site restrictions are fulfilled. By default, up to six additional amino acids in the query sequence will be tolerated without introducing a gap (for unmatched query sequence) in the target sequence. The “Min. Intron Length” parameter has been implemented to model those cases where the query sequence is shorter than the target protein sequence (e.g., protein surface loops of different length). By default, every region longer than 21 nucleotides will be treated as intron, while shorter sequences are regarded as exonic. Many vertebrates have exons shorter than the default BLAT tile size (seven amino acids), which is the width of the search window used to scan the genome (the size of match that triggers an alignment). Lowering the BLAT tile size will considerably slow the search process but not guarantee that missing exons will be found. For this reason Scipio provides an exhaustive search based on the Needleman-Wunsch algorithm with the “Exhaust Align Size” specifying the maximum search

How to use Scipio

199

region and the “Exhaust Gap Size” defining the number of amino acids to be searched. While all parameters are already optimized for each example, you can change parameters and rerun the search to test the effect of each parameter on the search result. 8. All gene-related data—genomic sequence, coding sequence, exons and introns, translation, log file, and many more—can be downloaded under the “Download Resultfiles” tab.

3

Predicting Mutually Exclusive Spliced Exons (WebScipio Only) WebScipio offers an advanced option to search for mutually exclusive spliced exons (MXEs) based on the exon-intron structure determined by Scipio. The search algorithm builds on three assumptions: Firstly, MXEs have a similar length; secondly, their splice sites and reading frames are conserved; thirdly, there sequences are homologs coding for the same part of a protein. Accordingly, WebScipio generates a list of putative exons within a certain genomic region respecting potential exon lengths, reading frames, and splice site restrictions. The putative exons are translated, candidates with in-frame stop codons omitted, the remaining exon candidates aligned to the query exon, and the alignments scored and ranked. While reading frames and splice site patterns are invariable for MXEs (otherwise they could not be spliced in a mutually exclusive manner), the other parameters guiding the search process are adjustable. 1. Go to the WebScipio search page, open the list of examples, and click on example 7, which contains the muscle myosin heavy chain from the water flea Daphnia pulex. This gene contains nine clusters of mutually exclusive spliced exons, of which eight code for different parts of the motor domain allowing for transcripts encoding myosins with very different functions adjusted to the various muscles present in developmental stages and in the adult flea [20]. 2. Open the “Alignment” view and inspect the predicted exons at nucleotide resolution. Some MXEs are almost identical resulting in scores of more than 88% such as the MXEs of cluster 5 (exon 10), others have variable sequences resulting in scores of 25% to 40% such as the MXEs of cluster 8 (exon 19). The MXEs of cluster 6 (exon 16) are very short (18 amino acids). 3. In the section “Search for Mutually Exclusive Exons” several parameters can be adjusted: (a) The “Allowed length difference for exons [aa]” sets the maximum difference between the length of a reference exon and the lengths of candidate alternative exons. (b) The “Minimal score for exons [%]” is the

200

Martin Kollmar

threshold above which alternative exons are taken into account. The score is defined by the global alignment of the translated exon sequence to the corresponding alternative exon translation divided by the global alignment of the exon translation to itself. (c) The “Minimal exon length [aa]” parameter is used to restrict the length of the reference exons for the mutually exclusive spliced exons search. Very short exons of less than ten amino acids likely result in hundreds of mutually exclusive exon candidates slowing the processing considerably. In addition, short exons usually encode less specific sequences resulting in many false-positive predictions. (d) Enabling the “Maximal recursion depth” starts repeating the search for MXE candidates using the candidates determined in the first round as reference. This option is useful when the first reference exon is rather the outlier of the MXE cluster while the other exons are rather similar. Reducing the score and length cut-off parameters could provide the same result, but might lead to multiple false positives in other clusters then. Choosing parameters thus depends on each gene and needs careful balancing.

4

Using Scipio at the Command Line For the computation of initial protein-DNA spliced alignments, Scipio utilizes the program BLAT written by Jim Kent [6]. Scipio provides some general search parameters that filter the BLAT output for further post-processing, and offers several expert options that influence the post-processing steps. For genes of very close homologs Scipio mainly performs the following steps to assemble the fragmented BLAT hits into complete gene structures: l

BLAT does not try to align codons that are split by introns. Thus, Scipio searches for missing codons, preferring those that are split at splice sites, and adds the nucleotides to the corresponding exons.

l

Scipio is able to reconstruct the (may be not so) rare cases of genes that are spread on multiple contigs. First, all BLAT hits are collected and sorted by score. Then non-overlapping hits are taken to form a collection of hits of the same query.

l

Frameshifts that usually cause BLAT to split exons into multiple separate matches are joined back into a single match.

l

Scipio retrieves all the corresponding sequences and groups them together with the BLAT results to form the output.

For the prediction of distant protein homologs, the Scipio post-processing especially focuses on the part of gap-closing (mapping the parts of the query sequence to the target sequence

How to use Scipio

201

that BLAT fails to recognize) and hit extension (modeling the regions at exon borders, including terminal exons, where homology is too low to be identified by BLAT). This is performed by the Needleman-Wunsch algorithm [21] for the search of unmapped query sequence in respective target regions and by introducing parameters that allow a higher divergence from the exon border regions predicted by BLAT. The default values should be good enough for most cases, but especially when searching for very divergent homologs or when searching for homologs of very divergent species, these parameters need manual adaptation. Detailed schemes of the Scipio workflow including all parameters that can manually be adjusted, and showing some of the most important decisions that Scipio makes to provide the best possible result are available at https://www.webscipio.org/help/scipio. 4.1 Requirements and Installation

The Scipio source code can be downloaded at https://www. webscipio.org/webscipio/download_scipio. Scipio has been used and tested on UNIX-based computers including Linux distributions such as CentOS and Ubuntu, but has never been tested on Windows environments. Scipio requires Bioperl (download from bioperl.org; [22]), the YAML Perl module (download from CPAN), the BLAT binary (see instructions at UCSC, https:// genome.ucsc.edu/FAQ/FAQblat.html, or get it from https:// github.com/djhshih/blat), and a genome assembly of choice in FASTA format (you can search diArk, https://www.diark.org, for sequenced eukaryotic genomes).

4.2 Adjusting Search Parameters

The only required program arguments to run Scipio are the path to the genome assembly file ( is a DNA sequence file) and the path to one or more protein query sequences ( is a protein sequence file), both in FASTA format. $ scipio.pl []

Several key options for optimizing the gene prediction quality have already been discussed (see Subheading 2), some more are given in Table 1 (for a full list of the search options as well as options to adjust the output please read the documentation or the help page at https://www.webscipio.org/help/scipio). Another option often useful for chromosome scale assemblies is the --single_target_hits (or --chromosome) parameter. This is set automatically in WebScipio when a chromosome assembly is selected as genome target file. This parameter prevents the assembly of BLAT hits from multiple genomic targets, which might lead, especially in the case of cross-species searches, to composed hits that stretch across multiple chromosomes. This option might also be useful when genes are relatively short and contigs relatively long as found, for example, for most fungal genomes.

202

Martin Kollmar

Table 1 Scipio options (full list available in the documentation) General options --blat_output ¼

Name of BLAT output file (.psl) to read from. Without this option, Scipio would start BLAT to produce a temporary output file that is processed and then deleted. This option might be useful if BLAT has already run separately, for example in parallel on multiple target files

--keep_blat_output

Use this option if you don’t want the BLAT output files to be deleted (this option is switched on automatically if -blat_output is given)

Options controlling choice of hits --min_score ¼

Minimal score (between 0 and 1) of a hit for a query, in order to appear in the output. The formula for the score is: matches minus mismatches, divided by query length. If a hit is composed of multiple partial hits, the minimal score applies to the bestscoring partial hit (default: 0.3)

--min_identity ¼ Minimal identity between query and target sequence in every

single BLAT hit to be processed by Scipio, in percent (default is 90) --min_coverage ¼ Minimal portion of the respective query sequence that must be

found in every single BLAT hit to be processed by Scipio, in percent. (default is 60) --max_mismatch ¼ Maximum number of mismatches in a hit. Use 0 to allow any

number of mismatches. (default is 0) --multiple_results

If this option is given, all hits for the same query with scores exceeding the minimal score will be shown. Multiple hits are named _(1), _(2), etc.

--single_target_hits (or --chromosome)

Prohibits Scipio to compose hits from multiple targets (contigs)

Options passed to BLAT (ignored when BLAT is not run) --blat_bin ¼

Name of BLAT executable, defaults to “blat”

--blat_params ¼ Parameters passed to BLAT; see BLAT documentation --blat_tilesize ¼

Values for BLAT parameters --tileSize, -minScore and --minIdentity; by default, -blat_identity is set to 90% of --min_identity

--blat_score ¼ --blat_identity ¼

Parameters to adjust the Needleman-Wunsch algorithm (all penalty values are multiples of the mismatch penalty) --nw_insert_penalty ¼

Accounts for three missing nucleotides in the target sequence (there is an additional amino acid in the query sequence; default value is 1.5) (continued)

How to use Scipio

203

Table 1 (continued) --nw_gap_penalty ¼

Accounts for three extra nucleotides in the target sequence (amino acid is missing in the query sequence; default: 0.8)

-nw_frameshift_penalty ¼

Accounts for one or two missing or additional nucleotides in the target sequence (default: 2.5)

--nw_intron_penalty ¼

Value for an intron of any size, with GT--AG pattern (default: 2.0)

--nw_stop_penalty ¼

Extra penalty for stop codons in the alignment (default: 2.5)

--exhaust_align_size ¼

Maximum sequence length for exhaustive search: because the Needleman-Wunsch algorithm is slow, the optimal alignment will not be computed in DNA regions longer than this (default: 500 nucleotides). Instead, Scipio will try here to place additional amino acids at the intron borders (resulting in one big intron)

--exhaust_gap_size ¼

Same as --exhaust_align_size, but referring to the query sequence instead of the target sequence (default: 3*Blat_tilesize)

Expert options --min_intron_len ¼

Minimal length of an intron (default: 22 nucleotides). Any shorter sequence of extra nucleotides will be inserted into the exon as additional nucleotides in a sequence shift

--transtable ¼

Let Scipio use an alternative genetic code table

--max_move_exon ¼

This option determines how much Scipio shifts the BLAT prediction to find a correct intron (default: two amino acids). BLAT chooses between possible intron positions by minimizing mismatches. In rare cases, mostly cross-species alignments, there are several intron candidates causing an equal number of mismatches so that the correct one can only be recognized by matching the splice site consensus. In even rarer cases, the true intron location is more than two codons away from the BLAT prediction which is when you need this option

--gap_to_close ¼ Maximum size of a gap in a query that Scipio closes by adding

mismatches to exon boundaries (default: six amino acids) --accepted_ intron_penalty ¼

GT---AG and GC---AG are by far the most common 30 and 50

splice sites and thus by default excepted as correct intron borders. In very rare cases, other splice sites have been observed. To mark these introns as “intron” and not as “intron?” this parameter has to be given as 1.1 (to accept AT---AC intron borders) and 1.2 (to also except GG---AG and GA---AG intron borders)

204

Martin Kollmar

To prevent the wrong assembly of BLAT hits in case this would introduce extremely long introns between exons on different contigs, Scipio offers the --max_assemble_size parameter that adjusts the maximum size of intron parts at target borders. If an intron would have to be created between two partial hits across two contigs that exceeds the given size (default: 75,000 nucleotides), the two hits will not be joined together as parts of one composed hit. If the option --multiple_results is enabled, both partial hits will be treated separately and result in independent gene predictions, otherwise the lower-scoring partial hit will be discarded. A parameter of increasing importance is --transtable with which another genetic code table can be employed (the most recent list of genetic codes can be found here: https://www.ncbi.nlm.nih. gov/Taxonomy/Utils/wprintgc.cgi). The more species are sequenced the more species turn out to have codons reassigned [23]. The largest diversity of alternative codes is found for mitochondrial genomes and ciliates, but most sequenced genomes with altered nuclear genetic code are available from yeasts [24]. In case of applying the --transtable parameter, it is highly recommended to increase the number of allowed mismatches (-max_mismatch parameter) to infinite ( ¼ 0; this is the default value) because the number of reassigned codons within a gene might easily exceed the number of allowed mismatches (in case you previously adjusted this value). 4.3 Evaluating the Results in Pretty Human-Readable Format

5

The yaml2log.pl script converts the YAML output files into easily readable log files with summary information about the results and clearly arranged sequence alignments (see Note 3). The YAML files can also be uploaded at WebScipio (menu “Upload Result File”) giving access to the gene structure scheme in SVG format and the other result presentations from the Result Tabs.

Notes 1. The autocompletion form allows searching for scientific names (main species name as well as alternatively and synonymously used scientific names such as teleomorph and anamorph names from fungi) as well as common names. By entering a taxon name all the respective species will be listed. For best orientation the autocompletion will return a list of at most ten hits, and in case of more hits provide the number of all hits at the bottom of the list. Search hits in scientific names are highlighted in bold. For example, try searching for “elephant” or “hummingbird,” or get the list of all available genomes for the taxon “rhizaria.”

How to use Scipio

205

2. The list of available genome assemblies is derived from the diArk database [9–11]. To give the user some indicators for data quality, the assembly version, release date, genome coverage, number of contigs, and the N50 values are provided, if available. For more information a link to diArk’s species page is provided where additional data related to each genome assembly such as sequencing and assembly methods used, GC content, and A50 plots are shown. Please keep in mind that none of the measures is suitable on its own to indicate the best and most complete assembly. The Assemblathon assessment of de novo genome assembly methods did not reveal a best approach, and approaches and measures might be very different for different taxa or even closely related species [25]. 3. A few users reported an error when running the yaml2gff.pl script on YAML-files generated by WebScipio (YAML Error: Invalid element in map; Code: YAML_LOAD_ERR_BAD_MAP_ELEMENT; Line: 3; Document: 1; at /sw/lib/perl5/ site_perl/5.16.2/YAML/Loader.pm line 351). This is most likely produced with a different version (“YAML::Syck”) of the Perl YAML module. Please download the latest script, or contact me. References 1. Gerstein MB, Bruce C, Rozowsky JS et al (2007) What is a gene, post-ENCODE? History and updated definition. Genome Res 17:669–681. https://doi.org/10.1101/gr. 6339607 2. Sleator RD (2010) An overview of the current status of eukaryote gene prediction strategies. Gene 461:1–4. https://doi.org/10.1016/j. gene.2010.04.008 3. Yandell M, Ence D (2012) A beginner’s guide to eukaryotic genome annotation. Nat Rev Genet 13:329–342. https://doi.org/10. 1038/nrg3174 4. Keller O, Odronitz F, Stanke M et al (2008) Scipio: using protein sequences to determine the precise exon/intron structures of genes and their orthologs in closely related species. BMC Bioinformatics 9:278. https://doi.org/ 10.1186/1471-2105-9-278 5. Hatje K, Keller O, Hammesfahr B et al (2011) Cross-species protein sequence and gene structure prediction with fine-tuned Webscipio 2.0 and Scipio. BMC Res Notes 4:265. https:// doi.org/10.1186/1756-0500-4-265 6. Kent WJ (2002) BLAT—the BLAST-like alignment tool. Genome Res 12:656–664. https:// doi.org/10.1101/gr.229202

7. Odronitz F, Pillmann H, Keller O et al (2008) WebScipio: an online tool for the determination of gene structures using protein sequences. BMC Genomics 9:422. https:// doi.org/10.1186/1471-2164-9-422 8. Hatje K, Hammesfahr B, Kollmar M (2013) WebScipio: reconstructing alternative splice variants of eukaryotic proteins. Nucleic Acids Res 41:W504–W509. https://doi.org/10. 1093/nar/gkt398 9. Odronitz F, Hellkamp M, Kollmar M (2007) diArk—a resource for eukaryotic genome research. BMC Genomics 8:103. https://doi. org/10.1186/1471-2164-8-103 10. Hammesfahr B, Odronitz F, Hellkamp M, Kollmar M (2011) diArk 2.0 provides detailed analyses of the ever increasing eukaryotic genome sequencing data. BMC Res Notes 4:338. https://doi.org/10.1186/17560500-4-338 11. Kollmar M, Kollmar L, Hammesfahr B, Simm D (2015) diArk – the database for eukaryotic genome and transcriptome assemblies in 2014. Nucleic Acids Res 43:D1107–D1112. https:// doi.org/10.1093/nar/gku990 12. Pillmann H, Hatje K, Odronitz F et al (2011) Predicting mutually exclusive spliced exons based on exon length, splice site and reading

206

Martin Kollmar

frame conservation, and exon sequence homology. BMC Bioinformatics 12:270. https://doi. org/10.1186/1471-2105-12-270 13. Smith CWJ (2005) Alternative splicing—when two’s a crowd. Cell 123:1–3. https://doi.org/ 10.1016/j.cell.2005.09.010 14. Barbosa-Morais NL, Irimia M, Pan Q et al (2012) The evolutionary landscape of alternative splicing in vertebrate species. Science 338:1587–1593. https://doi.org/10.1126/ science.1230612 15. Djebali S, Davis CA, Merkel A et al (2012) Landscape of transcription in human cells. Nature 489:101–108. https://doi.org/10. 1038/nature11233 16. Gerstein MB, Rozowsky J, Yan K-K et al (2014) Comparative analysis of the transcriptome across distant species. Nature 512:445–448. https://doi.org/10.1038/ nature13424 17. Hatje K, Kollmar M (2014) Kassiopeia: a database and web application for the analysis of mutually exclusive exomes of eukaryotes. BMC Genomics 15:115. https://doi.org/10. 1186/1471-2164-15-115 18. Hatje K, Kollmar M (2013) Expansion of the mutually exclusive spliced exome in Drosophila. Nat Commun 4:2460. https://doi.org/10. 1038/ncomms3460 19. Hatje K, Rahman R-U, Vidal RO et al (2017) The landscape of human mutually exclusive splicing. Mol Syst Biol 13:959

20. Kollmar M, Hatje K (2014) Shared gene structures and clusters of mutually exclusive spliced exons within the metazoan muscle myosin heavy chain genes. PLoS One 9:e88111. https://doi.org/10.1371/journal.pone. 0088111 21. Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453. https://doi.org/10. 1016/0022-2836(70)90057-4 22. Stajich JE, Block D, Boulez K et al (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res 12:1611–1618. https://doi.org/10.1101/gr.361602 23. Kollmar M, Mu¨hlhausen S (2017) Nuclear codon reassignments in the genomics era and mechanisms behind their evolution. BioEssays 39:1600221. https://doi.org/10.1002/bies. 201600221 24. Mu¨hlhausen S, Schmitt HD, Pan K-T et al (2018) Endogenous stochastic decoding of the CUG codon by competing Ser- and Leu-tRNAs in Ascoidea asiatica. Curr Biol 28:2046–2057.e5. https://doi.org/10.1016/ j.cub.2018.04.085 25. Bradnam KR, Fass JN, Alexandrov A et al (2013) Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2:10. https://doi. org/10.1186/2047-217X-2-10

Chapter 12 AnABlast: Re-searching for Protein-Coding Sequences in Genomic Regions Alejandro Rubio, Carlos S. Casimiro-Soriguer, Pablo Mier, Miguel A. Andrade-Navarro, Andre´s Garzo´n, Juan Jimenez, and Antonio J. Pe´rez-Pulido Abstract AnABlast is a computational tool that highlights protein-coding regions within intergenic and intronic DNA sequences which escape detection by standard gene prediction algorithms. DNA sequences with small protein-coding genes or exons, complex intron-containing genes, or degenerated DNA fragments are efficiently targeted by AnABlast. Furthermore, this algorithm is particularly useful in detecting proteincoding sequences with nonsignificant homologs to sequences in databases. AnABlast can be executed online at http://www.bioinfocabd.upo.es/anablast/. Key words Gene finding, Coding DNA sequences, In silico annotation tool, Small genes, Fossil DNA sequences

1

Introduction A great number of wet-lab groups are sequencing whole genomes as a common practice, taking advantage of the current burst of the genomics era. In silico analysis of such amount of sequences is essential for accurate annotation tasks [1, 2], but computational tools for predicting genes usually have accuracies of around 90% or even lower for exons in protein-coding genes coming from eukaryotic organisms [3, 4]. Thus, a significant number of coding sequences escape detection when using currently available genome annotation tools. The identification of similar proteins through BLAST analysis is one of the most useful strategies in genome annotation. Finding significant alignments facilitates the assignment of putative functions to query amino acid sequences through the identification of related proteins in sequence databases. Nonsignificant alignments, those below the significant score threshold and, therefore,

Martin Kollmar (ed.), Gene Prediction: Methods and Protocols, Methods in Molecular Biology, vol. 1962, https://doi.org/10.1007/978-1-4939-9173-0_12, © Springer Science+Business Media, LLC, part of Springer Nature 2019

207

208

Alejandro Rubio et al.

discarded by in silico gene finders, are often found in conventional similarity searches. In hypothetical polypeptide sequences obtained from the electronic translation of noncoding DNA, such alignments occur by chance at a very low frequency. Protein-coding sequences, however, mostly arise from previous ones during evolution, and nonsignificant alignments can include both functional and random evolutionary footprints coming from common ancestors [5, 6]. Thus, in polypeptide sequences computationally translated from coding DNA, in addition to random matches, footprints of ancestral common sequences increase the frequency of nonsignificant alignments, and consequently, the accumulation of nonsignificant alignments efficiently discriminates putative coding from noncoding DNA sequences (lacking such footprints). Even in the case of highly divergent genes, their coding sequences may contain footprints of common ancestral proteins to be found, among the millions of proteins available in databases, by using this strategy [5, 6]. Therefore, alignments accumulated in predicted amino acid sequences provide a method to discern coding from noncoding DNA, a new strategy used by AnABlast (Ancestral-patterns search through A BLAST-based strategy) to overcome limitations of current in silico algorithms in order to identify putative coding DNA sequences [7]. Here, we present the AnABlast web application, a novel computational tool which allows the identification of putative coding regions in recent genomic sequences, or in intergenic and intron sequences of annotated genomes.

2

Short Introduction to the Web Interface AnABlast can be executed from a web application and only needs a genomic sequence of up to 25 Kb to start its execution. Alternatively, a BLAST report can be provided if, for example, the user wants to run the slow similarity search in a computer cluster (Fig. 1). In this latter case, the genomic sequence can be as long as 1 Mb. Briefly, AnABlast compares the predicted amino acid sequence from each of the six reading frames of a genomic DNA sequence against a non-redundant protein database and accumulates all the found hits that we name protomotifs, including low-scored alignments [5]. These protomotifs are usually accumulated in coding sequences but rarely in noncoding sequences. Thus the graphic profile of accumulated AnABlast protomotifs yields peaks that may accurately mark putative protein-coding genes, pseudogenes, and fossil sequences, even in the presence of sequencing errors. AnABlast runs a BLASTX search against the UniRef50 database and gathers hits with a default minimal bit-score of 30, though the user can choose to run AnABlast with a higher sensitivity

Searching Protein-Coding Sequences with AnABlast

209

Fig. 1 Screenshot of the AnABlast Web page at http://www.bioinfocabd.upo.es/anablast/

(bit-score 28) or with a lower sensitivity but higher specificity (bit-score 35). The results appear in a genome browser that uses the JBrowse tool, including four tracks: l

AnABlast profile, diagram showing protomotifs accumulation; in green colors those from the forward strand and in red colors those from the reverse strand.

l

Peaks, regions with a protomotif accumulation above the threshold (70 is used by default, meaning that 70 UniProt50 proteins share this protomotif). The predicted functional annotation using Sma3s [8] is shown when an element is selected.

l

ORFs, found open reading frames from a start codon to a stop codon.

l

Predicted genes, gene structure predicted by a gene finder: AUGUSTUS for eukaryotic sequences [9], and prodigal for prokaryotes sequences [4].

A representative landscape resulting from the AnABlast analysis of the fission yeast Schizosacchomyces pombe region coding for the

210

Alejandro Rubio et al.

Fig. 2 Representative AnABlast profile highlighting exons in the annotated cwf2 and atp-11 protein-coding genes in the S. pombe genome (Pombase annotations). The six different colors represent profiles of accumulated alignments in protein sequences predicted from each of the six possible reading frames. (a) Using a high sensitive (low specific threshold) cutoff, several unspecific peaks appear in noncoding regions; (b) using the default cutoff, peaks appear only matching with true protein-coding regions; (c) using a low sensitive cutoff, coding sequences are precisely delimited, but the first exon of cwf2 does not show any peak; (d) annotated gene exons (yellow), AnABlast peaks (pink), and ORF tracks (blue) are shown to give a reference on the genomic region where the peaks appear

annotated and well-characterized cwf2 and atp-11 genes is shown in Fig. 2. Users can select one region from the last three tracks to obtain both the nucleotide and amino acid sequences. In addition, users can zoom in a region and gather the sequences from the AnABlast profile pulldown. At present, one execution of AnABlast takes approximately 1 min for each Kb of input sequence.

Searching Protein-Coding Sequences with AnABlast

3

211

Examples of Using AnABlast AnABlast can be executed online at http://www.bioinfocabd.upo. es/anablast/. By default, it will use optimal parameters with a default sensitivity [7], making the use of this program extremely simple. The most basic use of AnABlast is the identification of coding sequences in intergenic regions of annotated genomes.

3.1 Basic Search of Intergenic Coding Regions in Annotated Genomes

To illustrate the usage of the AnABlast web application as a new method for the search of new putative coding regions, we took the genome sequence and annotation of Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 (ASM694v2 assembly) from Ensembl Bacteria database [10], and extracted all the intergenic regions longer than 1 Kb. We discarded the regions where the gene finder Prodigal predicted protein-coding genes and used AnABlast to analyze the eight remaining regions, which were potential candidates to harbor genes that escaped the gene finder. Two adjacent peaks were identified in a region flanked by the genes STM3083 (putative mannitol dehydrogenase) and STM3085 (putative gntR family regulatory protein) (Fig. 3). The two peaks in the forward strand were then BLASTed against UniProt (a different database than UniRef50) to search for homologous protein sequences [11]. The first one (with the lowest signal) has an ORF with a predicted amino acid sequence similar to an Uroporphyrinogen decarboxylase (UniProt:A0A0V2D8Y5), while the second one (with the higher peak) is similar to a racemase (UniProt:A0A158N139), both of them from different species of Salmonella. Overall, these results show how AnABlast is useful to discover new putative protein-coding genes where other methods have failed. In eukaryotic genomes, in addition to new putative genes as shown above in the prokaryotic Salmonella genome, AnABlast peaks also highlight new exons in annotated genes.

3.2 Identification of Small Genes Without Significant Homologs in the DataBase

Genes encoding very small polypeptides and/or lacking significant homology to any others in the database are difficult to identify by conventional in silico methods. As an example, Fig. 4 shows the AnABlast profile of an intergenic genomic region discovering one of such putative small genes in the S. pombe genome [7]. Therefore, as shown in this example, AnABlast is particularly useful in the identification of small ORFs and/or coding sequences lacking significant homology to others in databases, coding sequences that often escape to conventional searches.

3.3 Pseudogenes and Rearranged DNA Fragments

The identification of ORFs within DNA regions underlined by AnABlast helps to identify coding sequences of putative new genes. However, AnABlast peaks often identify coding sequences lacking complete ORFs, accumulating both stop codons and

Fig. 3 AnABlast results for region AE006468:3242290-3248051 of S. enterica LT2. Protomotif accumulations for four previously annotated genes in the reverse strand are highlighted in red colors and two novel signals appearing in the forward strand in green colors, one of which could represent a racemase protein not previously annotated. Complete ORF were found for two of the discovered regions that were missed in a conventional gene finder (Prodigal)

212 Alejandro Rubio et al.

Fig. 4 AnABlast peak (green color, forward strand) in the S. pombe genome encoding a small peptide with no significant similarity to known proteins in databases

Fig. 5 Similar AnABlast profiles uncovering a rearranged region repeated in two different chromosomes (a and b) in the S. pombe genome. Annotated genes flanking the corresponding chromosomal intervals are indicated

214

Alejandro Rubio et al.

frameshifts, which usually represent pseudogenes and fossil coding sequences. Therefore, it is also helpful in identifying genomic rearrangements (Fig. 5) or even sequences acquired by horizontal transfer, showing evolutionary remnants [7]. Furthermore, the simple visual inspection of the obtained profiles may identify near identical patterns in different chromosome locations which remark direct or inverse repeats of a genomic region (see example in Fig. 5). All the features described above make AnABlast a useful tool for the exhaustive analysis of the enormous amount of genomic data that is obtained in the present time, previous to the next postgenomic era. References 1. Alioto T (2012) Gene prediction. Methods Mol Biol 855:175–201 2. Nesvizhskii AI (2014) Proteogenomics: concepts, applications and computational strategies. Nat Methods 11:1114–1125. https:// doi.org/10.1038/nmeth.3144 3. Guigo´ R, Flicek P, Abril JF et al (2006) EGASP: the human ENCODE genome annotation assessment project. Genome Biol 7 (Suppl 1):S2.1–S231 4. Hyatt D, Chen GL, Locascio PF et al (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11:119. https://doi.org/10. 1186/1471-2105-11-119 5. Pe´rez AJ, Thode G, Trelles O (2004) AnaGram: protein function assignment. Bioinformatics 20:291–292 6. Thode G, Garcı´a-Ranea JA, Jimenez J (1996) Search for ancient patterns in protein sequences. J Mol Evol 42:224–233 7. Jimenez J, Duncan CD, Gallardo M et al (2015) AnABlast: a new in silico strategy for

the genome-wide search of novel genes and fossil regions. DNA Res 22:439–449. https:// doi.org/10.1093/dnares/dsv025 ˜ oz-Me´rida A, 8. Casimiro-Soriguer CS, Mun Pe´rez-Pulido AJ (2017) Sma3s: a universal tool for easy functional annotation of proteomes and transcriptomes. Proteomics 2017:17. https://doi.org/10.1002/pmic. 201700071 9. Stanke M, Scho¨ffmann O, Morgenstern B, Waack S (2006) Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7:62 10. Kersey PJ, Allen JE, Armean I et al (2016) Ensembl genomes 2016: more genomes, more complexity. Nucleic Acids Res 44: D574–D580. https://doi.org/10.1093/nar/ gkv1209 11. The UniProt Consortium (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res 45:D158–D169. https://doi.org/ 10.1093/nar/gkw1099

Chapter 13 Generating Publication-Ready Prokaryotic Genome Annotations with DFAST Yasuhiro Tanizawa, Takatomo Fujisawa, Masanori Arita, and Yasukazu Nakamura Abstract DDBJ Fast Annotation and Submission Tool (DFAST) is a genome annotation pipeline for prokaryotes, which also assists data submission to the public sequence database. It is available both as a web service and as a stand-alone tool that runs on local machines. DFAST can annotate a typical-sized bacterial genome within 5 min. The default annotation workflow contains a gene prediction phase for protein coding sequence, rRNA, tRNA, and CRISPR, and a functional annotation phase to infer protein functions. DFAST generates result files in standard annotation formats and data files for submission to DNA Data Bank of Japan (DDBJ). In this chapter, the annotation workflow and applications of DFAST are introduced. Key words Annotation, Bacteria, Archaea, Database

1

Introduction Genome annotation is a fundamental step in genome sequence analysis, through which biological insights are inferred from the sequence data. Therefore, good annotation is essential for the downstream analysis. Another factor to consider is the data deposition of the newly determined nucleotide sequences into the public sequence database, which is now considered a prerequisite for publication in scientific journals. DDBJ Fast Annotation and Submission Tool (DFAST) is a genome analysis pipeline for prokaryotes that facilitates quick and accurate genome annotation and generation of data files for submission to the DDBJ (DNA Data Bank of Japan), which is one of the members of the International Nucleotide Sequence Database Collaboration [1]. DFAST was originally developed as an online service based on the lightweight annotation pipeline Prokka [2], and equipped with well-curated reference protein databases and an input form to edit the required information for DDBJ. Thereby, even users who are not familiar

Martin Kollmar (ed.), Gene Prediction: Methods and Protocols, Methods in Molecular Biology, vol. 1962, https://doi.org/10.1007/978-1-4939-9173-0_13, © Springer Science+Business Media, LLC, part of Springer Nature 2019

215

216

Yasuhiro Tanizawa et al.

with bioinformatics skills can automatically generate data files for DDBJ submission by uploading a genome sequence file to the web server. As of January 2019, more than 20,000 jobs have been processed online since its first launch in 2016. Recently, we have fully renewed the background annotation engine of DFAST [3]. By employing an ultra-fast sequence aligner GHOSTX [4], the new version can annotate a typical-sized bacterial genome within 5 min, which is approximately ten times faster compared to the previous version. DFAST also has a new feature to predict pseudogenes that have lost their function due to premature stop codons or frameshifts in a coding region. It is also available for download as a stand-alone tool called DFAST-core, which runs on Linux and Macintosh via a command-line operation. DFAST-core can generate the same result as the web version and allows users to customize the workflow freely. It will be useful to process large amount of data in a bunch on a local machine or for incorporation in other analytical pipelines. In this chapter, the annotation workflow of DFAST and its basic usage for both the web and stand-alone versions are explained. The DFAST workflow consists of three phases: the structural annotation phase for predicting biological features such as protein coding sequences (CDSs), ribonucleic acids (RNAs), and clustered regularly interspaced short palindromic repeats (CRISPRs), the functional annotation phase to infer protein functions of predicted CDSs, and the output phase. DFAST uses MetaGeneAnnotator (MGA) to predict CDSs by default. MGA is a self-training gene prediction tool for all kinds of prokaryotic genes including atypical genes such as horizontally transferred and prophage-encoded genes. It can precisely predict translation starts of genes with a ribosomal binding site detection model [5]. Optionally, Prodigal [6] can be used for CDS prediction. In the structural annotation phase, gene prediction tools are run in parallel, and partial or overlapping features are cleaned up. The functional annotation phase is processed in the following order: 1. Ortholog assignment (optional, available only in the standalone version) All-against-all pairwise protein alignments are conducted between a query and each reference genome. Orthologous genes are identified based on a reciprocal-best-hit approach. It also conducts self-to-self alignments within a query genome to infer inparalogous genes. 2. Homology search against the default reference database DFAST uses GHOSTX as a default aligner, which runs several tens to hundred times faster than BLASTP with comparable sensitivity when E-values are less than 10 6. The default reference database is mainly composed of protein sequences from 124 reference prokaryotic genomes from the National

Annotating Prokaryotes with DFAST

217

Center for Biotechnology Information (NCBI) RefSeq database. In the stand-alone version, additional homology searches can be conducted with users’ custom reference databases. 3. Pseudogene detection CDSs together with their flanking regions are re-aligned to their subject protein sequences using LAST, which allows frameshift alignment [7]. The query is annotated as a possible pseudogene when stop codons or frameshifts are found in the flanking regions. This also detects specific characteristics of translation such as an opal stop codon (UGA) to incorporate selenocysteine and an amber stop codon (UAG) to incorporate pyrrolysine. 4. Profile hidden Markov model (HMM) database search (optional) If no matches are found in the previous steps, the profile HMM search against TIGRFAM [8] is conducted. It uses hmmscan of the HMMer software package [9]. 5. Assignment of clusters of orthologous groups (COG) functional categories RPS-BLAST and the rpsbproc utility are used to search against the COG database provided by the NCBI Conserved Domain Database [10, 11]. The outputs of DFAST include standard format files such as FASTA, GenBank, and GFF, and data files for submission to the DDBJ. The stand-alone version can also generate a feature table file (.tbl), which can be used as an input for the tbl2asn utility program to create a data submission file to the GenBank database.

2

Preparing to Use DFAST

2.1 DFAST Web Server

The web version of DFAST is available at https://dfast.nig.ac.jp and can be used freely without any user registration. It can be used by uploading a sequence file in the FASTA format. It is suitable for beginners, and is especially useful for creating files for submission to the DDBJ because it has an on-line editor to enter metadata required for the data submission.

2.2 Installation of the Stand-Alone Version (DFAST_core)

DFAST_core runs on a typical Mac or Linux machine with Python 2.7 or 3.4 (and later). Most of the external binaries are bundled in the software package except for the following binaries and Python modules.

2.2.1 Prerequisites

1. Perl and Java Some of the external programs called from DFAST depend on Perl or Java. These should work with the pre-installed versions on your system.

218

Yasuhiro Tanizawa et al.

The “Time::Piece” Perl module may be required for Red Hat/CentOS/Fedora. It can be installed using the following command: $ sudo yum install perl-Time-Piece

2. Biopython module This can be installed using the Python package management tool “pip”. $ (sudo) pip install biopython

3. “futures” and “six” modules (required only on Python 2.7) DFAST uses the “concurrent.futures” module for multiprocessing and the “six” module for compatibility with Python 2 and 3. To run them under Python 2.7, they need to be installed using the following instruction: $ (sudo) pip install futures six

2.2.2 Installation

The source code and executables for DFAST_core are available at https://github.com/nigyta/dfast_core, and can be obtained by either of the following methods: 4. Via “git” command (recommended) $ git clone https://github.com/nigyta/dfast_core.git

This creates a directory “dfast_core” into which all the files will be downloaded. Hereinafter, this directory is denoted as “$DFAST_APP_ROOT”. Once the files are obtained, the latest version can be retrieved as follows: $ git pull

5. Download the distribution The latest release is available at https://github.com/ nigyta/dfast_core/releases. Download and unarchive it as follows: $ wget https://github.com/nigyta/dfast_core/archive/x.x. x.tar.gz $ tar xvfz x.x.x.tar.gz

This creates a directory “dfast_core-x.x.x” into which the files will be uncompressed. “x.x.x” represents the version number (as of January 2019, the latest version is v. 1.2.0). Hereinafter, this directory is denoted as “$DFAST_APP_ROOT”. It is recommended that “$DFAST_APP_ROOT” be added in the “PATH” (Replace “$DFAST_APP_ROOT” with the installed directory on your machine):

Annotating Prokaryotes with DFAST

219

$ export PATH=$DFAST_APP_ROOT:$PATH

Use the following command to see the usage: $ dfast -h

or with specifying the Python interpreter: $ python $DFAST_APP_ROOT/dfast -h

2.2.3 Initial Setup

Reference databases required for the default workflow can be downloaded using the bundled script with the following commands: $ cd $DFAST_APP_ROOT $ python scripts/file_downloader.py --protein dfast $ python scripts/file_downloader.py --cdd Cog --hmm TIGR

The former python command downloads the default reference protein database of DFAST, and the latter downloads the COG database for RPSBLAST and the TIGRFAM database for HMMscan. Database indexing is performed automatically after the files are downloaded. 2.2.4 Test Run

The following command will invoke the default workflow of DFAST_core using a test file. $ dfast --genome example/test.genome.fna

Usually, execution of the aforementioned command completes within a minute. If it does not execute properly, please make sure that the required modules are installed and the default databases have been set up. The result will be generated in the directory “OUT”.

3

Generating a Genome Annotation and Data Submission Files with DFAST In this section, the basic usage of both the web and the stand-alone versions is introduced. In particular, how to create data submission files to DDBJ using the web version and several examples of useful options to customize the workflow using the stand-alone version are described.

3.1 How to Use the Web Version 3.1.1 Submit a Job

1. Visit the website (https://dfast.nig.ac.jp), and navigate to the job submission page (i.e., click “Start your Project!”; Fig. 1). 2. Upload a genome sequence file in the FASTA format, which is the only required input. The input file can either be a single- or multi-FASTA, and the file size is limited to 15 MB. The sequence may contain canonical nucleotides (“ACGT” or “acgt”) as well as gaps represented by repeats of “N” or “n”; ambiguous nucleotides, however, such as “R” (A or G) and

220

Yasuhiro Tanizawa et al.

Fig. 1 Screenshot of the job submission page of the DFAST web server (https://dfast.nig.ac.jp/dfc/)

“W” (A or T) are not allowed. For testing the web service, the user can click on “Run in demo mode” by which the Escherichia coli O26 genome is used as example data. 3. Optionally, a job title and an email address can be specified. When the job is completed, a notification will be sent to the provided email address. 4. Several options are available to customize the workflow. For example, Prodigal and tRNAscan-SE can be chosen to predict CDSs and tRNAs, respectively, instead of the default prediction tool. Furthermore, an additional reference database can also be specified. Currently, we have manually curated reference databases for lactic acid bacteria, Escherichia coli, cyanobacteria, and Bifidobacterium. When the additional database is specified, the homology search is conducted against it first. If no significant matches are found in the first search, the query will be subjected to a subsequent homology search against the default database. We plan to expand the scope of the organism-specific databases to cover more diverse microbes. 5. Press the “Run” button to submit the job.

Annotating Prokaryotes with DFAST

221

Fig. 2 Screenshot of the job result page

When the job has been submitted correctly, the user is redirected to a job details page. A unique identifier is assigned to each job, and only those who know the identifier can access the job result. 3.1.2 Check the Job Result

The job result page contains four panels: Result, Features, DDBJ Submission, and Log (Fig. 2). In the Result panel, statistics such as total sequence length and number of predicted CDSs are shown, and the results can be downloaded in several formats. In the Features panel, all the annotated genes are shown in a tabular format. Here, the user can check the gene sequences and external links to the NCBI BLAST web service. The user can edit a gene product name and symbol, and add a note to each predicted gene.

3.1.3 How to Create DDBJ Submission Files

If the user wants to submit the annotated genome to the DDBJ, data files for submission to the DDBJ can be created following the guidelines shown in the “DDBJ Submission” panel. After

222

Yasuhiro Tanizawa et al.

providing the metadata required for submission in the input form, the user can download the data files to send to the DDBJ. The user needs to register to the BioProject and BioSample databases in advance; this can be achieved through the DDBJ submission portal D-way (https://trace.ddbj.nig.ac.jp/D-way/). 3.2 How to Use the Stand-Alone Version

The minimum required input is a genome sequence file in the FASTA format: $ dfast --genome path/to/your_genome.fna

3.2.1 Basic Usage

This launches the default annotation workflow for the given query file and creates an output directory “OUT” in the current directory, which contains the annotation results in a standard format such as GenBank, GFF, and FASTA. The following options are available to tweak the workflow: $ dfast --genome your_genome.fna --organism "Escherichia coli" --strain "str. xxx" --minimum_length 200 --use_prodigal -aligner blastp --no_cdd --no_hmm --out your_result

In this customized workflow, Prodigal is used to predict CDSs (--use_prodigal), BLASTP is used for homology search against the protein database (--aligner blastp), and the timeconsuming RPSBLAST and hmmscan steps are skipped (-no_cdd and --no_hmm). Sequences shorter than 200 bp in an input file are excluded from the result (--minimum_length 200; of note, GenBank only accepts sequences that are at least 200 bp in length). The result will be generated in “your_result” directory. The organism name and the strain name are shown in the result files; however, they do not affect the result of annotation. 3.2.2 Advanced Usage

The annotation result greatly depends on the reference database used. The DFAST default reference database mainly consists of protein sequences obtained from well-characterized reference strains, and covers most of the major lineage. However, it is recommended to use additional reference databases for more precise annotation. Several tips for using the reference databases are introduced below: 1. How to use an additional reference database We provide manually curated reference databases for groups of specific organisms. These can be downloaded by using the script, “file_downloader.py”, and can be used in the workflow. The following shows an example to use a custom database for Escherichia coli. Use the following command to download the database:

Annotating Prokaryotes with DFAST

223

$ python $DFAST_APP_ROOT/scripts/file_downloader.py --protein ecoli

The database file is downloaded to the directory “$DFAST_APP_ROOT/db/protein/DFAST-ECOLI”.

Then, use the “--database” option to specify an additional database as follows: $ dfast --genome your_genome.fna --database $DFAST_APP_ROOT/db/protein/DFAST-ECOLI

After predicting CDSs, they are searched against the specified database, and further against the default database if no hits are found in the first database. To see the list of available databases: $ python $DFAST_APP_ROOT/scripts/file_downloader.py –h

2. OrthoSearch (orthologous gene assignment) “OrthoSearch” identifies orthologous genes based on the reciprocal-best-hit approach from all-against-all pairwise protein sequence alignments between a query and each reference genome. It is useful to transfer annotation from a close relative when its annotated genome is available as a reference. Here, an example to download the reference genome from the NCBI Assembly Database and use it in OrthoSearch is shown. To obtain the reference: $

python

scripts/file_downloader.py

--assembly

GCF_000203855.3

“GCF_000203855.3” is an accession number for the complete genome of Lactobacillus plantarum WCFS1, one of the reference strains of Lactobacillus. The user can search the accession numbers at the NCBI website (https://www.ncbi.nlm. nih.gov/assembly/). The reference file (in GenBank format) is downloaded as “GCF_000203855.3.gbk” in the current directory. Then, use the “--references” option to enable the OrthoSearch as follows: $

dfast

--genome

your_genome.fna

--references

GCF_000203855.3.gbk

More reference files can be specified separated by comma (no spaces). $

dfast

--genome

your_genome.fna

--references

GCF_000203855.3.gbk,GCF_000014525.1.gbk

In this case, the highest-scoring hit is used from the allagainst-all alignments against each of the reference genomes.

224

Yasuhiro Tanizawa et al.

3. Homology search against a large-scale database When annotating a genome from less well-characterized species whose close relatives are not present in the default database, considerable numbers of genes may remain to be functionally annotated. In such a case, it is effective to use a large-scale reference database. However, the default aligner (GHOSTX) is memory intensive despite its speed, and therefore it may not be feasible to conduct a homology search using GHOSTX against a large-sized reference database on an ordinary computer. “BLASTsearch” is a specialized function for homology search using BLASTP against a large-scale reference database such as pre-formatted BLAST databases for RefSeq and SwissProt available at the NCBI FTP server. The method of conducting the BLASTsearch against the SwissProt database is described below. (a) Download the BLAST database from the FTP server $ wget ftp://ftp.ncbi.nlm.nih.gov//blast/db/swissprot.tar.gz

The downloaded file can be placed at any location. Then, unarchive it: $ tar xvfz swissprot.tar.gz

(b) Create a custom configuration file There is no command-line option for BLASTsearch. To enable it, the user needs to create a custom configuration file. First, copy the default configuration file: $ cp $DFAST_APP_ROOT/dfc/default_config.py your_config.py

Edit the BLASTsearch section in the configuration file as below: { "component_name": "BlastSearch", "enabled": True, "options": { # "cpu": 2,

# Uncomment this to set the

component-specific number of CPUs. "skipAnnotatedFeatures": False, "evalue_cutoff": 1e-6, "qcov_cutoff": 75, "scov_cutoff": 75, "aligner": "blastp",

# Must be blastp

"aligner_options": {}, "dbtype": "auto", # Must be either of auto/ncbi/ uniprot/plain "database": "/path/to/swissprot", }, },

Annotating Prokaryotes with DFAST

225

Set “enabled” to “True” and specify the path to the downloaded database file in “database”. (c) Run DFAST by specifying the custom configuration file $ dfast --genome your_genome.fa --config your_config. py

BLASTsearch will be executed prior to the homology search against the default database. 3.2.3 More Advanced Usage

In addition to the options described above, more advanced options are available for expert users, as shown below. For more details, please refer to the help documents at https://github.com/nigyta/ dfast_core/tree/master/docs. 4. Custom reference database Users can create their own reference database. The DFAST format for the reference sequences is a tab-separated table containing entries such as gene identifiers, gene product names, and sequences. A helper script is also available to convert a FASTA-formatted sequence file into the DFAST format reference file and to index the database file. 5. Custom workflow Users can create their own configuration file as shown in the example of BLASTsearch. All the settings of DFAST are defined in the configuration file, allowing a flexible workflow, in which users can choose gene prediction tools and reference databases according to their choice. 6. INSDC submission Similar to the web version, DFAST_core can create data files for submission to the DDBJ. This is recommended for users who need to process large amounts of genome data through command-line operations. DFAST_core partly supports data submission to GenBank. The following shows a full example of how to create DDBJ submission files using sample data. The “--metadata_file” option specifies a text file containing the metadata required to submit the data. $ dfast --genome $DFAST_APP_ROOT/example/sample.lactobacillus.fna --complete t --organism "Lactobacillus hokkaidonensis" --strain LOOC260 --seq_names "Chromosome,pXXXX,pYYYY" --seq_topologies c,c,l --seq_types c,p,p --additional_modifiers "culture_collection=JCM:18461;

isolation_source=silage;

note=You can add a comment here; collection_date=2017-0626" --metadata_file $DFAST_APP_ROOT/example/sample.metadata.txt --locus_tag_prefix LH260 --step 10 --use_separate_tags t --out LHLOOC

226

Yasuhiro Tanizawa et al.

References 1. Tanizawa Y, Fujisawa T, Kaminuma E et al (2016) DFAST and DAGA: web-based integrated genome annotation tools and resources. Biosci Microbiota Food Health 35:173–184. https://doi.org/10.12938/ bmfh.16-003 2. Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069. https://doi.org/10.1093/bio informatics/btu153 3. Tanizawa Y, Fujisawa T, Nakamura Y (2018) DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication. Bioinformatics 34:1037–1039. https://doi. org/10.1093/bioinformatics/btx713 4. Suzuki S, Kakuta M, Ishida T, Akiyama Y (2014) GHOSTX: an improved sequence homology search algorithm using a query suffix array and a database suffix array. PLoS One 9: e103833. https://doi.org/10.1371/journal. pone.0103833 5. Noguchi H, Taniguchi T, Itoh T (2008) MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Res 15:387–396. https://doi.org/10.1093/dnares/dsn027 6. Hyatt D, Chen G-L, Locascio PF et al (2010) Prodigal: prokaryotic gene recognition and

translation initiation site identification. BMC Bioinformatics 11:119. https://doi.org/10. 1186/1471-2105-11-119 7. Sheetlin SL, Park Y, Frith MC, Spouge JL (2014) Frameshift alignment: statistics and post-genomic applications. Bioinformatics 30:3575–3582. https://doi.org/10.1093/bio informatics/btu576 8. Haft DH, Selengut JD, White O (2003) The TIGRFAM database of protein families. Nucleic Acids Res 31:371–373. https://doi. org/10.1093/nar/gkg128 9. Eddy SR (2011) Accelerated profile HMM searches. PLoS Comput Biol 7:e1002195. https://doi.org/10.1371/journal.pcbi. 1002195 10. Galperin MY, Makarova KS, Wolf YI, Koonin EV (2015) Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res 43: D261–D269. https://doi.org/10.1093/nar/ gku1223 11. Marchler-Bauer A, Bo Y, Han L et al (2017) CDD/SPARCLE: functional classification of proteins via subfamily domain architectures. Nucleic Acids Res 45:D200–D203. https:// doi.org/10.1093/nar/gkw1129

Chapter 14 BUSCO: Assessing Genome Assembly and Annotation Completeness Mathieu Seppey, Mose` Manni, and Evgeny M. Zdobnov Abstract Genomics drives the current progress in molecular biology, generating unprecedented volumes of data. The scientific value of these sequences depends on the ability to evaluate their completeness using a biologically meaningful approach. Here, we describe the use of the BUSCO tool suite to assess the completeness of genomes, gene sets, and transcriptomes, using their gene content as a complementary method to common technical metrics. The chapter introduces the concept of universal single-copy genes, which underlies the BUSCO methodology, covers the basic requirements to set up the tool, and provides guidelines to properly design the analyses, run the assessments, and interpret and utilize the results. Key words BUSCO, Orthologs, Genome completeness, Quality assessment, Gene content, Phylogenomics

1

Completeness Assessment The ever-increasing volumes of sequencing data play a crucial role in advancing biological research. However, comprehensive and unbiased genomic analyses rely on the quality and completeness of such resources. This makes thorough quality control of sequencing data “products” such as genomes, genes, or transcriptomes ever more important. The assembly of reads from high-throughput sequencing technologies is a challenging procedure both theoretically and practically, especially for large genomes. Fast and accurate quality assessment of the resulting assemblies allows researchers to iteratively tweak workflows and parameters to achieve the best results. However, such evaluations are often complicated, and pose challenges for the scalability of methods, as well as for the interpretation and presentation of results. When assessing the quality and completeness of an assembly, different complementary metrics can be used. A first step for identifying potential problems in sequencing data that are likely to hamper the quality of the assembly is by analyzing the k-mer

Martin Kollmar (ed.), Gene Prediction: Methods and Protocols, Methods in Molecular Biology, vol. 1962, https://doi.org/10.1007/978-1-4939-9173-0_14, © Springer Science+Business Media, LLC, part of Springer Nature 2019

227

228

Mathieu Seppey et al.

distributions of reads from which the researcher can estimate sequencing bias, coverage depth, repeat content, and heterozygosity of the sample. Tools such as GenomeScope [1] and KmerGenie [2] provide an easy solution for obtaining various statistics from the k-mer distributions of short reads. With a first draft assembly in hand, one can compare its size with the expected genome size estimated with experimental procedures such as by flow cytometry, or with computational methods by analyzing the k-mer distributions. Metrics such as fragment (contig/scaffold) length distributions and contig/scaffold N50, which reflect the contiguity of the assembly, are informative measures but can also be misleading. N50 value indicates that half the genome is assembled on contigs/scaffolds of length N50 or longer. Novel metrics that provide more realistic estimates of genome fragmentation have been proposed, for example, REAPR [3] provides a “corrected” N50 metric based on the identification of assembly errors. By calculating the fraction of reads that map onto the assembly, one can measure how well the original reads are represented in the genome, and the analysis of read depth can be used to identify assembly “artifacts” such as duplicated or collapsed regions. Although the above-mentioned technical statistics are essential to estimate the overall completeness of an assembly, including intergenic regions, such measures ignore biological aspects and the important question of genomic data quality in terms of completeness of gene content. This is a crucial consideration that also affects data interpretation and helps to guide improved assembly and annotation strategies. In cases where extensive transcriptomic resources (EST, RNA-seq) for the species of interest are available, one could assess the comprehensiveness of the gene set by aligning transcripts to the assembly to obtain the proportion of mapping transcripts. However, aligning spliced transcripts to corresponding genomic loci can be problematic, and results highly depend on the tools and parameters used for mapping. An attractive alternative, which complements the strategies described above, is to quantify the completeness of genomic data sets in terms of the expected gene content based on evolutionary principles. OrthoDB’s sets of Benchmarking Universal SingleCopy Orthologs, BUSCO (http://busco.ezlab.org/), provide a rich source of data to assess the quality and completeness of genome assemblies, gene annotations, and transcriptomes. BUSCO quality assessment facilitates informative comparisons, for example of newly sequenced draft genome assemblies to those of model species, or can be used to quantify iterative improvements to assemblies or annotations [4, 5]. BUSCO has rapidly become well-established as an essential genomics tool, using up-to-date data from many species present in OrthoDB, and with broader utilities than the once popular, but discontinued, Core Eukaryotic Genes Mapping Approach (CEGMA) [6].

BUSCO: Completeness Assessment

2

229

Universal Single-Copy Genes Protein coding genes that make up the BUSCO data sets are evolving under “single-copy control” [7], and are selected from OrthoDB [8] orthologous groups that contain genes present as single-copy orthologs in at least 90% of the species included in the group (Fig. 1). While allowing for rare gene duplications or losses, this establishes an evolutionarily informed expectation that these genes should be found as single-copy orthologs in any newly sequenced genome or gene set from that group. Therefore, if there are many BUSCO genes from the appropriate clade that cannot be identified in a genome assembly or annotated gene set, it is possible that the sequencing and/or assembly and/or annotation approaches have failed to fully capture the complete expected gene content. Lineage assessments are available for vertebrates, arthropods, fungi, nematodes, plants, protists (see Subheading 3.2 and Table 1), and prokaryotes.

Fig. 1 BUSCO genes are selected among groups of orthologs matching specific evolutionary expectations. A BUSCO group has to encompass at least 90% of the species within its corresponding lineage, therefore showing a high universality, and be maintained as single-copy in at least 90% of the species to fulfill the low duplicability requirement. To illustrate this, the proportions of orthologous groups in mouse (Mus musculus), fly (Drosophila melanogaster), and yeast (Saccharomyces cerevisiae) are depicted according to the presence of orthologs in the other species within their respective lineage (pie charts: vertebrata, arthropoda, fungi) and the proportion of single-copy predominance in universal groups. The BUSCO sampling space is restricted to those passing both 90% thresholds. Adapted from [7]

230

Mathieu Seppey et al.

Table 1 Detailed list of every BUSCO eukaryotic datasets available with BUSCO v3

Name

Taxonomic rank

Number of genes

Consensus of n OrthoDB species

eukaryota_odb9

Domain

303

65

metazoa_odb9

Kingdom

978

65

nematoda_odb9

Phylum

982

8

arthropoda_odb9

Phylum

1066

60

insecta_odb9

Class

1658

42

2442

35

endopterygota_odb9 diptera_odb9

Order

2799

25

hymenoptera_odb9

Order

4415

25

2586

65

4584

20

3950

55

vertebrata_odb9 actinopterygii_odb9

Superclass

tetrapoda_odb9 aves_odb9

Class

4915

40

mammalia_odb9

Class

4104

50

euarchontoglires_odb9

Superorder

6192

25

laurasiatheria_odb9

Superorder

6253

25

fungi_odb9

Kingdom

290

85

microsporidia_odb9

Phylum

518

14

dikarya_odb9

Subkingdom

1312

75

ascomycota_odb9

Phylum

1315

75

1759

70

3156

50

3725

30

saccharomyceta_odb9 pezizomycotina_odb9

Subphylum

sordariomyceta_odb9 eurotiomycetes_odb9

Class

4046

25

saccharomycetales_odb9

Order

1711

30

basidiomycota_odb9

Phylum

1335

25

embryophyta_odb9

1440

20

alveolata_stramenophiles_ensembl

234

24

protists_ensembl

215

33

The taxonomic ranks match the NCBI taxonomy browser classification (https://www.ncbi.nlm.nih.gov/Taxonomy/ Browser/wwwtax.cgi)

BUSCO: Completeness Assessment

231

BUSCO uses sequence profiles embedded in lineage-specific datasets to assess the orthology status of predicted genes in the species under analysis. These consensus sequences are derived from Hidden Markov Model (HMM) profiles built from multiple sequence alignments of orthologs selected from OrthoDB and capture the conserved alignable amino acids across the species set, reducing any potential species bias that would result from pairwise alignments toward original sequences. Current available lineages have been selected based on their taxonomic range and coverage in terms of the numbers of available sequenced and annotated genomes. As more and more species are sequenced and included in OrthoDB, we are going to update BUSCO assessment datasets and provide sets for new lineages.

3

The BUSCO Software BUSCO assessment procedure (Fig. 2) is implemented as a python 3 (https://www.python.org/) package built upon several third party tools, each performing one step of the global analysis to characterize BUSCO orthologs. The kind of input sequence (a genome assembly, an annotated gene set, or a transcriptome

Fig. 2 Description of the BUSCO workflow for the three types of sequence input, genome (a), gene set (b), and transcriptome (c). The same dataset is used in all modes, although not all information embedded is utilized in each situation. The genome mode includes two phases in which the three main steps are run, with the second pass only targeting the missing and fragmented BUSCO genes using additional consensus sequences and retrained AUGUSTUS parameters

232

Mathieu Seppey et al.

assembly) defines which of these steps need to be executed, namely (1) locate candidate regions using local alignment against amino acid consensus sequences, (2) extract gene models from these regions based on block profiles, and (3) score the candidate genes against the profile Hidden Markov Model (HMM) of their corresponding BUSCO genes. The tools are pipelined within three assessment modes offered to the user. The methods and examples presented hereafter are based on the version 3.0.x of the BUSCO open source software. 3.1

Setup

3.1.1 BUSCO Python Package

The BUSCO sources are hosted on https://gitlab.com/ezlab/ busco/ where they can be downloaded or cloned using a git client. A mock input (sample_data/target.fa), an example lineage (sample_data/example), and the corresponding BUSCO genome evaluation results are available at the root of the repository and can be used to test and validate the installation of each component required to run a complete analysis. The python package needs to be installed by calling the script setup. py. All python 3 versions are supported. In the main folder, one of the following commands has to be executed: sudo python setup.py install # system wide installation python setup.py install --user # current user only

The user is encouraged to pay attention to which version is used when calling setup.py, to ensure that the same version will be used when running the analysis. 3.1.2 Configuration File

BUSCO uses the configparser class from the python standard library to manage its configuration in a dedicated file organized in sections. The first section contains all parameters controlling the BUSCO run, while additional sections locate executables which are part of external tools. Below is an extract of the file config.ini.default, a selfdocumented default configuration present in the config/ folder, which can be copied and adapted by the user for their own need. [busco] debug = True gzip = False [tblastn] # section describing the “tblastn” executable path = /usr/bin/ # do not append the executable to the path

The path to the configuration file has to be declared in the $BUSCO_CONFIG_FILE environment variable or placed to the default location inside the BUSCO directory: config/config.ini.

BUSCO: Completeness Assessment

233

3.1.3 Third Party Software

Each of the following tools has to be installed prior to running BUSCO in an assessment mode that requires it. The path to each executable has to be declared in the configuration file. It is recommended that the user makes sure that each of the software packages works independently before attempting to run any assessments with BUSCO. The minimal version that is required is mentioned for each tool, and the user can obtain information about future version compatibility on the BUSCO website and on the web pages of each tool.

TBLASTN

The genome and transcriptome modes require a translated BLAST search (TBLASTN) [9]. BUSCO uses the NCBI implementation available in the BLASTþ suite from version 2.2.x (see Note 1). It can be downloaded from https://ftp.ncbi.nlm.nih.gov/blast/ executables/blastþ/, and release notes are available on https:// www.ncbi.nlm.nih.gov/books/NBK131777/. Two executables have to be declared in the configuration file: makeblastdb and tblastn.

AUGUSTUS Gene Predictor

AUGUSTUS, a tool for predicting genes in eukaryotic genomic sequences [10], is needed by the genome assessment mode. BUSCO supports versions 3.2.1 or higher and the software can be obtained from http://bioinf.uni-greifswald.de/augustus/. As it includes multiple PERL scripts (https://www.perl.org/), the user should refer to the AUGUSTUS documentation for its PERL requirements. The executables that have to be declared in the configuration file are augustus, etraining, gff2gbSmallDNA.pl, new_species.pl, and optimize_augustus.pl. These entries are not sufficient and additional environment variables have to be set as follows (see Note 2): export PATH=/path/augustus-3.x.x/bin:$PATH export PATH=/path/augustus-3.x.x/scripts:$PATH export AUGUSTUS_CONFIG_PATH=/path/augustus-3.x.x/config/

AUGUSTUS makes its predictions based on parameters that are species-specific. It comes with predefined values corresponding to well annotated species. While BUSCO preselects automatically the most appropriate species to be used for each analysis, it is worth mentioning that these are listed in $AUGUSTUS_CONFIG_PATH/species/ and it is possible for the user to indicate a different species (see Subheading 4.2). HMMER

To evaluate amino acid sequences using profile HMMs, all modes of BUSCO require HMMER, version 3.1b2 or higher [11], which can be obtained on http://hmmer.org/. The unique executable that has to be declared in the configuration file is hmmsearch.

234

3.2

Mathieu Seppey et al.

BUSCO Datasets

Analyses are based on features describing BUSCO genes that were carefully selected (Fig. 1). They are organized in datasets corresponding to specific lineages [5]. The version 3 of BUSCO comes with 28 eukaryotic datasets (Table 1) representing major groups, which can be downloaded on the BUSCO website along with 16 prokaryotic sets. They are identified by a name and a version, e.g., “eukaryota”_“odb9”. Each BUSCO gene, with its unique identifier, e.g., EOG090C01CE (see Note 3), is represented by different parameters found in different files. The content of a standard BUSCO dataset is the following: – hmms/: contains one profile HMM file for each BUSCO gene. Required by HMMER. – info/: contains the list of species used to create the set and additional information. – prfl/: contains one block profile file for each BUSCO gene. Required by AUGUSTUS. – ancestral: a FASTA file for each BUSCO gene, which contains a consensus of the extant sequences. Required by TBLASTN. – ancestral_variants: a FASTA file for each BUSCO gene, which contains a consensus and variants of the extant sequences. Required by TBLASTN. – dataset.cfg: configuration and information about the dataset, including the default species used by AUGUSTUS among those provided with the tool, which corresponds to the most appropriate for the majority of species within the lineage, e.g., “fly” for the Insecta dataset. – scores_cutoff: minimal HMMER scores to reach for each gene to be considered as orthologous to BUSCO genes and classified as found. – lengths_cutoff: minimal length values for BUSCO gene matches to be called complete.

3.3

Running BUSCO

3.3.1 Genome

When the input to be evaluated is a genome assembly, i.e. nucleotide sequences in the forms of contigs, scaffolds, or chromosomes, the genome mode has to be selected (Fig. 2a). It runs two phases composed of three steps each. In the beginning of the first phase, TBLASTN is run taking BUSCO amino acid consensus sequences as queries and the input genomic sequences as database. The goal is to identify the subset of sequences in this genome that are most likely to contain matches for each BUSCO gene. Second, AUGUSTUS is run to delineate precise gene structures on these regions, from which a protein sequence is extracted. Finally, HMMER is run to assign a score to the candidate amino acid sequence before the BUSCO algorithm proceeds to a preliminary classification. The second phase of BUSCO genome mode

BUSCO: Completeness Assessment

235

involves a retraining step, which produces a better set of parameters for AUGUSTUS, inferred from the single-copy BUSCO genes found to be complete during the first phase. The rest of the run focuses on finding the missing BUSCO genes with a TBLASTN step based on additional variants of the amino acid consensus, followed by an AUGUSTUS step using the retrained parameters, and a new HMMER run to obtain a final classification. The BUSCO genome mode is run as follows: python busco_folder/scripts/run_BUSCO.py -i SEQUENCE_FILE.fna -o OUTPUT_NAME -l lineages/NAME_OF_LINEAGE -m geno

Every BUSCO mode displays a printed score on the stdout and produces a comprehensive output folder named run_OUTPUT_NAME. The following files and folders are found in a BUSCO genome run output: – short_summary_OUTPUT_NAME.txt: a text file that contains the final BUSCO score and a summary of the parameters that were used. # The lineage dataset is: NAME_OF_LINEAGE # BUSCO was run in mode: genome C:80.0%[S:80.0%,D:0.0%],F:0.0%,M:20.0%,n:10 8 Complete BUSCOs (C) 8 Complete and single-copy BUSCOs (S) 0 Complete and duplicated BUSCOs (D) 0 Fragmented BUSCOs (F) 2 Missing BUSCOs (M) 10 Total BUSCO groups searched

– full_table_OUTPUT_NAME.tsv: The detailed list of all BUSCO genes and their predicted status in the genome. #Busco id

Status

Contig Start End

Score Length

EOG09000001 Complete sample 3018

3142 320

193

EOG09000002 Complete sample 3164

4762 872

443

– missing_buscos_list_OUTPUT_NAME.tsv: the list of missing BUSCO gene identifiers. – blast_output/: contains the raw output of the two TBLASTN runs and the corresponding coordinates as defined by BUSCO to represent candidate regions. – augustus_output/: contains a log file dedicated to AUGUSTUS, the list of single-copy genes that were used to retrain AUGUSTUS, and, in the subfolder predicted_genes/, one gene model for each candidate region evaluated, named after the

236

Mathieu Seppey et al.

BUSCO block profile used, e.g., EOG09000001.out.1. In the subfolder extracted_proteins/, there are one nucleotide and one amino acid sequence for each gene model, e.g., EOG09000001. fna.1 and EOG09000001.faa.1. Note that the two previously mentioned subfolders represent all candidates, including those that were not retained as positive matches in the end of the analysis, therefore containing irrelevant material that is not listed in the final full table file. Consequently, the user should be cautious when considering their content as meaningful biological sequences and refer to the coordinates and identifiers in the full table file. The retraining parameters produced by BUSCO to be used by the second AUGUSTUS run are stored in the subfolder retraining_parameters/ (see Note 4). Finally, several intermediate files produced during the analysis in GenBank and General Feature Format (GFF) can be found in the remaining folders. – hmmer_output/: contains a tabular format of each HMMER output, one for each candidate protein evaluated, named after the BUSCO profile HMM used, e.g., EOG09000001.out.1. These represent all candidates that were and were not retained as positive matches in the end of the analysis. – single_copy_busco_sequences/: contains the nucleotide and amino acid sequences of all BUSCO genes that were found complete and not duplicated during the first phase of the BUSCO genome analysis. They are the genes used to train custom AUGUSTUS gene models. To access the genes that were found in the second phase, or found duplicated and fragmented during both phases, the user will have to manually extract the sequences from the other folders, according to the coordinates and identifiers in the full table file. 3.3.2 Annotated Gene Set

When the input to be evaluated is an annotated gene set in the form of amino acid sequences, the protein mode has to be selected (Fig. 2b). It consists of a single assessment of every sequence against every BUSCO profile HMM followed by the classification. Annotated gene sets usually contain protein isoforms that are relevant and therefore kept in the final result. However, in order to properly evaluate the amount of BUSCO gene duplications (which can be technical artifacts or true duplications), isoforms should be removed before any BUSCO assessment. The BUSCO protein mode is run as follows: python busco_folder/scripts/run_BUSCO.py -i SEQUENCE_FILE.faa -o OUTPUT_NAME -l lineages/NAME_OF_LINEAGE -m prot

BUSCO: Completeness Assessment

237

The BUSCO protein run output folder contains a short_summary_OUTPUT_NAME.txt and a missing_buscos_list_OUTPUT_NAME.tsv file identical to those of the genome mode. The full_table_OUTPUT_NAME.txt file is slightly different, having the identifier of the sequence and no start and end coordinates. # Busco

id Status

Sequence Score Length

EOG09000001 Complete sample1

320

193

EOG09000002 Complete sample2

872

443

The folder hmmer_output/ contains a tabular format of each HMMER output, one for each BUSCO profile HMM that has been searched, e.g., EOG09000001.out.1. 3.3.3 Transcriptome

The last mode available has to be selected when the input is a transcriptome assembly (Fig. 2c) in the form of nucleotide sequences representing individual transcripts. A TBLASTN run taking BUSCO amino acid consensus sequences as queries and the input transcripts as database is conducted to obtain a subset of sequences harboring potential matches to each BUSCO gene. A six frame translation is done on these transcripts and HMMER is run to assign a score to the candidate amino acid sequences before the BUSCO algorithm proceeds to the final classification. As for protein isoforms, alternate transcripts should be removed from the input before running BUSCO in order to obtain a meaningful duplication score. The BUSCO transcriptome mode is run as follows: python busco_folder/scripts/run_BUSCO.py -i SEQUENCE_FILE.fna -o OUTPUT_NAME -l lineages/NAME_OF_LINEAGE -m tran

The BUSCO transcriptome run output folder contains a short_summary_OUTPUT_NAME.txt and a missing_buscos_list_OUTPUT_NAME.tsv file identical to those of the two previously described modes. The full_table_OUTPUT_NAME.txt file contains the identifier of the transcript and is similar to that of the protein mode. The folders blast_output/ and hmmer_output/ have the same content as their equivalents in the genome mode. The folder translated_proteins/ contains the six-frame translated version of every transcript having a match to a BUSCO amino acid consensus during the TBLASTN analysis, including discarded candidates and transcripts included in the final classification. 3.3.4 Optional Arguments

The BUSCO script possesses several options that allow the user to either act on the assessment outcome by fine-tuning parameters or control the usability of the tool, affecting the structure of the

238

Mathieu Seppey et al.

outputs or the resources and time consumption. While the first category will be evoked later in the chapter, it is worth highlighting useful options belonging to the second category. The full list of parameters can be printed by calling the help option: python busco_folder/scripts/run_BUSCO.py -h

In addition to the command line, most of the parameters can be defined in the configuration file to become the default value at each run. For example: python busco_folder/scripts/run_BUSCO.py --cpu 8

is equivalent to having in the configuration file: [busco] cpu = 8

Note that the parameters provided through the command line will always override the entry in the configuration file. A few parameters are restricted to the file, an important one being the debug mode that every BUSCO user should know. [busco] ;debug = True # to enable, uncomment by removing the ;

Since BUSCO makes thousands of calls to external commands such as augustus or tblastn, it may be useful for the user to be able to track each of these calls and run them manually (see Note 5). Therefore, the debug mode prints all commands and parameters that are called during the run. DEBUG [’/usr/bin/tblastn’, ’-db’, ’/tmp/test_db’]

Once all issues related to external commands are fixed, or if the analysis was killed accidentally, the run can be started again from the beginning using the --force option to rewrite over existing files, or preserving the output of each step that was successfully completed using the --restart option. Finally, as each analysis generates a large amount of small files, some storage systems may be affected when multiple runs are conducted and kept during a project. The --tarzip option solves this issue by archiving all subfolders in the output that are likely to contain a high number of elements (i.e. AUGUSTUS and HMMER outputs).

BUSCO: Completeness Assessment

4

239

Understanding BUSCO

4.1 Choice of the Dataset

Once the tool is properly set up, the first decision that has to be made is which dataset should be used among those available (Table 1, see Note 6). The primary goal of the BUSCO tool is to allow evaluation, comparison, and reevaluation of assemblies and annotations. A good rule of thumb is to select the most specific lineage the species belongs to, as it will provide the best resolution possible for the evaluation. For instance, when working with insects, the user should choose the dataset belonging to the class Insecta, unless the organism belongs to the order Diptera or Hymenoptera for which an order-specific dataset exists [12]. While it is always incorrect to use any dataset issued from a lineage to which the species does not belong, two reasons could lead the user to select a more generic dataset, representing a higher taxonomic level. First, the time required by most steps during a BUSCO analysis increases linearly with the number of genes included in the dataset, which tends to increase in lower taxonomic levels. Moreover, species belonging to certain lineages such as Mammalia have a complex gene structure [13], which can drastically increase the run time per gene compared to a generic set such as Metazoa. Therefore, the user has to balance the resolution needed vs the runtime. Second, BUSCO is often used to compare an assembly or an annotation to previously published material of related species or a previous version of the same project. In this situation, it is recommended to plan in advance the comparative aspect of the project to select a dataset that encompasses all species involved. For example, while BUSCO provides an avian dataset, a comparative genomics study that mixes birds and other amniotes will prefer the Tetrapoda dataset for evaluating all species involved.

4.2 Choice of the Parameters

BUSCO offers multiple ways to fine-tune several aspects of the analysis. In particular, AUGUSTUS is a tool that provides many options and BUSCO can pass any parameter to this tool in the genome mode using the option --augustus_parameters. However, the golden rule when using BUSCO is not to change default parameters unless there is a biological or experimental reason to do so. The user should keep in mind that their goal is not to improve the BUSCO score per se, but to improve the overall quality of their assembly and annotation, which also relies on remaining comparable with the rest of the BUSCO user community. One biological justification for editing a parameter is that of a different codon usage (the AUGUSTUS parameter translation_table), for example in ciliates [14], as keeping the default parameter would impair the evaluation of the assembled genome. However, if the goal is not to evaluate and compare, but to recover as many BUSCO gene sequences as possible for downstream analyses (see

240

Mathieu Seppey et al.

Subheading 6), it becomes relevant to explore the palette of parameters offered by BUSCO. The default species passed to AUGUSTUS can be specified with the --species option and the Expect value used with TBLASTN can be modified using the --evalue option. By default, BUSCO considers only the three best contigs matching a BUSCO gene during the TBLASTN step for the subsequent analyses to minimize the computing time. While this is an efficient tradeoff between performance and BUSCO gene recall for most use cases, the user can increase this limit up to 20 using the --limit option to try recovering a few extra BUSCO gene sequences. 4.3 Interpretation of the Results

BUSCO produces a report for each of the three modes of assessment using the same scoring scheme. Expected BUSCO genes can fall into different categories: C:complete [S:single-copy, D:duplicated], F:fragmented, and M:missing. These are reported as absolute numbers as well as percentage of the total BUSCO genes (n:) included in the dataset. To judge whether a score is satisfying, the user will have to consider the type of sequence first. A very good genome assembly should contain all BUSCO genes that were not lost during the evolution of the species, which cannot be precisely defined. Model organisms, which have good reference genomes, often reach a score above 95% complete (BUSCO 3.0.2: Mus musculus; GRCm38.p6; mammalia_odb9; C: 95.2% [S: 90.9%, D: 4.3%], F: 2.4%, M: 2.4%, n: 4104—Drosophila melanogaster; Drosophila melanogaster Release 6 plus ISO1 MT; diptera_odb9; C: 98.7% [S: 98.2%, D: 0.5%], F: 0.8%, M: 0.5%, n:2799—Saccharomyces cerevisiae; Saccharomyces cerevisiae S288C; saccharomycetales_odb9; C: 98.3% [S: 97.7%, D: 0.6%], F: 0.7%, M: 1.0%, n: 1711). Non-model genome projects commonly report BUSCO scores ranging from 50% up to 95% complete, depending on the challenge posed by the species’ biology (e.g., genome size, amount of repetitive elements) and its taxonomic position [15–18]. The score of an annotated gene set may reach a value lower than its genomic equivalent, since an annotation pipeline might miss BUSCO genes present in the assembly as it aims at predicting thousands of genes with broad parameters, while the BUSCO software targets very specific sequences with tailored parameters. Consequently, the user should assess both the assembly and the annotation result to judge whether the gene prediction strategy is appropriate or can be improved in light of the expected gene content in the genome (see Subheading 6.1). It is important to mention that the user should never complete the annotated gene set with the BUSCO genes recovered by the genome mode prior to assessing it, as it would bias the evaluation of true annotation efficiency. Finally, a good transcriptome score can be much lower than its genomic counterpart, as not all BUSCO genes are necessarily expressed together, especially in a single tissue or condition [19].

BUSCO: Completeness Assessment

241

The duplication of a few BUSCO genes in a genome is compatible with a biological reality, as their evolution under single copy may be relaxed in some sublineages and the fact that we allowed duplications in up to 10% of the species when defining BUSCO markers [7]. However, a high duplication rate in a genome could denote a potential assembly of different haplotypes, a recent whole genome duplication [20], or technical artifacts that will have to be investigated. As mentioned earlier, the duplication rate of transcriptomes and annotated gene sets unfiltered for isoforms may be considerably higher. In some situation, the user will want to filter these out to decrease the duplication rate down to values expected in a genome. A high rate of fragmented BUSCO genes indicates issues in the sequencing and assembly process or the inability of the annotation pipeline to fully capture the complexity of some gene models. Turning fragmented BUSCO genes into complete is a good indicator of a significant improvement of the quality of an assembly, especially when supported by changes in other metrics such as N50. To define the presence, absence, and fragmentation status of each BUSCO gene, the classifier applies to all results a score and a length threshold based on the distribution of these metrics in the species used to produce the datasets. This implies that in limited cases, it may be possible that a gene which is an outlier in terms of length or score will be classified as fragmented or missing while it is in fact present and complete. An advanced user may be able to spot such situations when manually investigating the outputs. However, much care should be taken when reinterpreting the results, as close homologs are sometimes difficult to distinguish from actual BUSCO genes (see Note 7) and remain the most likely explanation in such situations. Finally, no adjusted scores should be reported alone in a publication, for the sake of like-for-like comparisons within the community of users.

5

Plotting the Results It is common to represent BUSCO scores side-by-side using bar plots to illustrate different milestones of an assembly and annotation project, or different species as part of a comparative study. To encourage the use of a standard and distinctive layout in publications, while allowing a certain degree of customization, BUSCO includes a dedicated script to produce a figure and its source code that can be edited by the user. It requires only the short summary files of each BUSCO run that should appear on the plot to be grouped in a single folder, the working directory, in which the outputs will be generated. python busco_folder/scripts/generate_plot.py -wd PATH

242

Mathieu Seppey et al.

Fig. 3 Illustration of the BUSCO default side-by-side representation of assessment scores as produced by the plotting script. Three hypothetical species evaluated with 100 BUSCO gene profiles are depicted with various degrees of completeness and duplication

The language underlying the figure creation is R [21] and its popular library ggplot2 [22]. If these are available on the system running BUSCO, an image file will be produced automatically by calling the R script and written in the working directory. Otherwise, the user can specify the --no_r option to ignore this step and find a R code file in the working directory on which they have full control and freedom to edit and run anywhere. Figure 3 is an example of the resulting default plot.

6

Beyond Completeness Assessment Although BUSCO’s main function is to perform genomics data quality control, it is worth mentioning that one can take advantage of the pipeline for performing other common operations in genomics, such as for building training sets for gene predictors, identifying reliable markers for large-scale phylogenomics studies (https:// gitlab.com/ezlab/busco_usecases/tree/master/phylogenomics), and selecting high-quality reference species for comparative genomics analyses. These aspects are presented in great detail in the publication entitled “BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics” [5].

6.1 A Few Words on Gene Predictor Training

Running BUSCO provides to the user high-quality gene model training data that can greatly improve genome annotation procedures. Gene prediction remains a challenging procedure, especially in the absence of supporting evidence such as native transcripts or homologs from close species. To achieve the best results, gene prediction tools such as AUGUSTUS [10], SNAP [23], GENEID [24], and GeneMark [25] need to optimize their parameter

BUSCO: Completeness Assessment

243

configurations for each specific genome. BUSCO genes can be used as initial sets of high-quality gene models for such optimization. For example, using BUSCO-trained parameters for gene prediction resulted in improvements in the quality of the resulting gene model annotations over using available pre-trained parameters from other species [5]. Since BUSCO employs AUGUSTUS for gene prediction, the pipeline automatically provides AUGUSTUS-ready parameters trained on BUSCO genes identified as complete single copy (see Note 4). Moreover, BUSCO provides the --long option to enable the optimization mode when retraining AUGUSTUS, which can further improve the obtained retraining parameters, with a cost in terms of time consumption that depends on the complexity of the organism gene models. Other gene predictors like SNAP can be trained as well, by using as input the GFF and GenBank-formatted gene models generated by BUSCO.

7

Notes 1. Inconsistencies when using multi-threading on TBLASTN 2.4. x and higher have been reported multiple times. If the user faces such issue, a rollback to version 2.2.x or 2.3.x is a safe option. If this is not possible, BUSCO supports the option blast_single_core¼True in the configuration file to ignore multithreading (--cpu) for blast steps only. 2. BUSCO needs to write in the $AUGUSTUS_CONFIG_PATH/species/ folder. Therefore, an unprivileged user on a shared environment will encounter the following error: Cannot write to AUGUSTUS config path. This can be solved by copying the entire $AUGUSTUS_CONFIG_PATH folder to a location where the user has write permission and redeclaring the environment variable to target this location. export AUGUSTUS_CONFIG_PATH=/new/location/

3. The BUSCO orthologous group identifiers EOGxxxxxxxx cannot be shared or compared between different datasets and versions. The orthology delineation method uses a representation of the relationship between genes that is unique to each lineage as it arises from all duplication and speciation events underlying the evolution of the lineage. Therefore, a genomic sequence suitable to be a BUSCO gene in one dataset may not have the same orthology relationships to the sequences with different evolutionary distances that are considered to define BUSCO genes in other datasets. 4. To reuse the retraining parameters as a custom species with AUGUSTUS, independently from BUSCO, the user needs to

244

Mathieu Seppey et al.

move the folder retraining_parameters/ back to the $AUGUSTUS_CONFIG_PATH/species/ folder of their AUGUSTUS install and rename it to its original name, which can be deduced from its content. If the folder contains the file BUSCO_OUTPUT_NAME_xxx_parameters.cfg, the correct name to be used for naming the folder and identify the species within AUGUSTUS is BUSCO_OUTPUT_NAME_xxx. 5. BUSCO removes all temporary files at the end of the analysis. To run manually a command that accesses temporary files, the user will have to kill the run before it reaches the end. 6. Producing a BUSCO dataset is not a trivial task. Genes have to be sampled from orthologous groups that are suitable in terms of phyletic profile (Fig. 1) and containing a sufficient number of species to properly represent the lineage in question. For this reason, and to encourage users to take advantage of existing datasets to produce comparable results, no detailed procedure for creating custom datasets is available. This remains achievable by an advanced user having access to a good sample of orthologs from their lineage of interest. 7. When close homologs to BUSCO genes are present in the sequence that is analyzed, the BUSCO classifier will give a better score to the true copies and therefore be able to discard the other sequences. However, if the actual BUSCO is missing, close homologs may sometimes reach a sufficient score to be considered as positive matches to a BUSCO gene that is in fact not present.

Acknowledgments We would like to thank all members of the Zdobnov group, in particular Felipe Sima˜o and Christopher Rands for their useful comments. This work was partly supported by the Swiss Institute of Bioinformatics SER funding and the Swiss National Science Foundation funding 31003A_166483 to E.Z. References 1. Vurture GW, Sedlazeck FJ, Nattestad M et al (2017) GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics 33:2202–2204. https://doi.org/10. 1093/bioinformatics/btx153 2. Chikhi R, Medvedev P (2014) Informed and automated k-mer size selection for genome assembly. Bioinformatics 30:31–37. https:// doi.org/10.1093/bioinformatics/btt310

3. Hunt M, Kikuchi T, Sanders M et al (2013) REAPR: a universal tool for genome assembly evaluation. Genome Biol 14:R47. https://doi. org/10.1186/gb-2013-14-5-r47 4. Sima˜o FA, Waterhouse RM, Ioannidis P et al (2015) BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31:3210–3212. https://doi.org/10.1093/bioinformatics/ btv351

BUSCO: Completeness Assessment 5. Waterhouse RM, Seppey M, Sima˜o FA et al (2018) BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol Biol Evol 35:543–548. https:// doi.org/10.1093/molbev/msx319 6. Parra G, Bradnam K, Korf I (2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23:1061–1067. https://doi.org/10.1093/bio informatics/btm071 7. Waterhouse RM, Zdobnov EM, Kriventseva EV (2011) Correlating traits of gene retention, sequence divergence, duplicability and essentiality in vertebrates, arthropods, and fungi. Genome Biol Evol 3:75–86. https://doi.org/ 10.1093/gbe/evq083 8. Kriventseva EV, Kuznetsov D, Tegenfeldt F et al (2019) OrthoDB v10: sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic Acids Res 47:D807–D811. https://doi.org/10. 1093/nar/gky1053 9. Camacho C, Coulouris G, Avagyan V et al (2009) BLASTþ: architecture and applications. BMC Bioinformatics 10:421. https:// doi.org/10.1186/1471-2105-10-421 10. Keller O, Kollmar M, Stanke M, Waack S (2011) A novel hybrid gene prediction method employing protein multiple sequence alignments. Bioinformatics Oxf Engl 27:757–763. https://doi.org/10.1093/bioinformatics/ btr010 11. Eddy SR (2011) Accelerated profile HMM searches. PLoS Comput Biol 7:e1002195. https://doi.org/10.1371/journal.pcbi. 1002195 12. Araujo NS, Santos PKF, Arias MC (2018) RNA-Seq reveals that mitochondrial genes and long non-coding RNAs may play important roles in the bivoltine generations of the non-social Neotropical bee Tetrapedia diversipes. Apidologie 49:3–12. https://doi.org/ 10.1007/s13592-017-0542-2 13. Keren H, Lev-Maor G, Ast G (2010) Alternative splicing and evolution: diversification, exon definition and function. Nat Rev Genet 11:345–355. https://doi.org/10.1038/ nrg2776 14. Kollmar M, Mu¨hlhausen S (2017) Nuclear codon reassignments in the genomics era and mechanisms behind their evolution. Bioessays

245

39:1600221. https://doi.org/10.1002/bies. 201600221 15. Ioannidis P, Simao FA, Waterhouse RM et al (2017) Genomic features of the Damselfly Calopteryx splendens representing a Sister Clade to most insect orders. Genome Biol Evol 9:415–430. https://doi.org/10.1093/ gbe/evx006 16. Holt C, Campbell M, Keays DA et al (2018) Improved genome assembly and annotation for the rock pigeon (Columba livia). G3 Genes Genomes Genet 8:1391–1398. https://doi. org/10.1534/g3.117.300443 17. Plomion C, Aury J-M, Amselem J et al (2018) Oak genome reveals facets of long lifespan. Nat Plants. https://doi.org/10.1038/s41477018-0172-3 18. Armstrong EE, Prost S, Ertz D et al (2018) Draft genome sequence and annotation of the Lichen-forming fungus Arthonia radiata. Genome Announc 6:e00281–e00218. https://doi.org/10.1128/genomeA.0028118 19. Carruthers M, Yurchenko AA, Augley JJ et al (2018) De novo transcriptome assembly, annotation and comparison of four ecological and evolutionary model salmonid fish species. BMC Genomics 19:32. https://doi.org/10. 1186/s12864-017-4379-x 20. Teh BT, Lim K, Yong CH et al (2017) The draft genome of tropical fruit durian (Durio zibethinus). Nat Genet 49:1633–1641. https://doi.org/10.1038/ng.3972 21. Core Team R (2017) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna 22. Wickham H (2009) Ggplot2: elegant graphics for data analysis. Springer, New York, NY 23. Korf I (2004) Gene finding in novel genomes. BMC Bioinformatics 5:59. https://doi.org/ 10.1186/1471-2105-5-59 24. Blanco E, Parra G, Guigo´ R (2007) Using geneid to identify genes. In: Baxevanis AD, Davison DB, Page RDM et al (eds) Current protocols in bioinformatics. John Wiley & Sons, Inc., Hoboken, NJ 25. Borodovsky M, Lomsadze A (2011) Eukaryotic gene prediction using GeneMark.hmm-E and GeneMark-ES. Curr Protoc Bioinformatics 35:4.6.1–4.6.10. https://doi.org/10. 1002/0471250953.bi0406s35

Chapter 15 Evaluating Genome Assemblies and Gene Models Using gVolante Osamu Nishimura, Yuichiro Hara, and Shigehiro Kuraku Abstract In daily practice of de novo genome assembly and gene prediction, it would be a natural urge to evaluate their products. Different programs and parameter settings give rise to variable outputs, which leaves a decision of which output to adopt for downstream analysis for addressing biological questions. Instead of superficial assessment of length-based statistics of output sequences (e.g., N50 scaffold length), completeness assessment by means of scoring the coverage of reference orthologs has been increasingly utilized. We previously launched a web service, gVolante (https://gvolante.riken.jp/), to provide a user-friendly interface and a uniform environment for completeness assessment with the pipelines CEGMA and BUSCO. Completeness assessments performed on gVolante report scores based on not just the coverage of reference genes but also on sequence lengths, allowing quality control in multiple aspects. This chapter focuses on the procedure for such assessment and provides technical tips for higher accuracy. Key words Completeness assessment, BUSCO, CEGMA, Ortholog, CVG

1

Introduction Whether one is to evaluate de novo genome assemblies or gene models derived from them, it is obvious that metrics based on lengths and base compositions of plain sequences solely cannot provide any sense of their accuracy (see refs. 1, 2). The first published framework to evaluate sequence sets with the coverage of protein-coding genes was the pipeline CEGMA [3]. This tool, which was originally developed to predict orthologs to pre-selected conserved genes [4], accepts only nucleotide sequences that supposedly have introns, although one can input a transcript sequence set. In 2015, it was announced that CEGMA will no longer be supported by the group of researchers including its developer, partly because of its intricate requirement in installation, and that the pipeline BUSCO, introduced in Chapter 14, can be the successor (http://www.acgt.me/blog/2015/5/18/ goodbye-cegma-hello-busco). BUSCO does not only accept

Martin Kollmar (ed.), Gene Prediction: Methods and Protocols, Methods in Molecular Biology, vol. 1962, https://doi.org/10.1007/978-1-4939-9173-0_15, © Springer Science+Business Media, LLC, part of Springer Nature 2019

247

248

Osamu Nishimura et al.

nucleotide sequences (genome or transcriptome sequences) but also amino acid sequences (predicted peptides) and scores the coverage of a given reference protein-coding gene set [5, 6]. General requirement of completeness assessment consists of two elements: (1) ortholog search pipeline and (2) reference gene set. The abovementioned tools, CEGMA and BUSCO, do not only serve as ortholog search pipelines but also provide their dedicated ready-to-use sets of selected orthologs, CEG (Core Eukaryote Genes) and BUSCO datasets, respectively. Besides, we previously introduced an ortholog set for vertebrates, Core Vertebrate Genes (CVG), consisting of rigorously selected one-to-one orthologs retained by 29 vertebrates from a variety of taxa [7]. This gene set is increasingly used for completeness assessment of de novo genome assemblies (e.g., bottlenose dolphins [8], sea lamprey germline [9], and Madagascar ground gecko [10]) and gene models [11], as well as for assessing transcriptome sequence sets of diverse vertebrates [12]. In 2017, we launched a web server gVolante (https://gvolante. riken.jp/), which allows users to run completeness assessment with simple operation in a web browser [13]. In fact, the release of gVolante eradicated the severe difficulty of utilizing CEGMA, which has enabled a fair choice of completeness assessment methods, simply based on accuracy. gVolante offers a flexible selection of the combination of an ortholog search pipeline (CEGMA or BUSCO) and a reference gene set (e.g., CVG). This chapter provides a step-by-step tutorial for using gVolante and tips for interpreting completeness assessment results.

2

Materials

2.1 Working Environment

One needs to prepare a terminal (that does not have to be a PC or Mac) that has an up-to-date or recent version of web browsers and is accessible to the Internet with sufficient access speed for uploading gigabyte-sized files.

2.2 Sequence Set to Assess

Prepare a FASTA file containing multiple nucleotide or amino acid sequences. The maximum size of the file (even if compressed) is 10 GB. The web server accepts a compressed or archived file in the . gz, .tgz, .bz2, .tbz, .tar, or .zip format, which shortens the time for uploading and enables the submission of a sequence set whose total length exceeds 10 Gb. The file name should not contain any letter other than alphabets, digits, hyphens, and periods. For a user’s quick trial, a test file including human peptide sequences is available (https://gvolante.riken.jp/test_file.zip).

Evaluating Genome Assemblies and Gene Models Using gVolante

3

249

Evaluating a Genome Assembly, Step by Step

3.1 Accessing the gVolante Web Server

Launch a web browser and access the analysis page of the gVolante website (https://gvolante.riken.jp/analysis.html).

3.2 Uploading the File

Choose a sequence file to assess and then upload it. Uploading a compressed file can save a considerable amount of time. Click on the [UPLOAD FILE] button. Note that pressing this button twice can cause a problem. If the uploading succeeds, the message “Uploading [file name] is complete” will be shown.

3.3 Entering a Project Title and Your Email Address

Enter a project title that will later allow you to identify the job. The provision of an email address, though optional, allows you to receive a summary of the assessment result, as well as an access to the result webpage, via an email message, which does not necessitate the user to return to the same page after a time-consuming assessment.

3.4 Specifying a Length Cutoff

One can (but does not have to) specify the cutoff value for the minimum lengths of sequences used for computing sequence length statistics and base composition. The default value is 1, which does not exclude any sequence from the input file. Note that this cutoff applies to computing sequence length statistics and base composition, but not to completeness assessment based on reference orthologs.

3.5 Choosing a Sequence Type

Choose one of the three sequence types, (1) genome (nucleotide) sequences, (2) coding/transcribed (nucleotide) sequences, and (3) peptide (amino acid) sequences. When (2) or (3) is chosen, the pipeline CEGMA cannot be chosen in the next step.

3.6 Choosing an Ortholog Search Pipeline

Choose one of the three options for an ortholog search pipeline, (1) CEGMA, (2) BUSCO v2/v3, and (3) BUSCO v1. BUSCO v1 is included in case of any possible requirement on the user’s side to adjust the assessment condition to past assessment for comparison. Because BUSCO v3 was released after refactoring of BUSCO v2, these two different versions of BUSCO function in the same way. As stated above in Subheading 3.5, when a non-genomic sequence is input, one cannot choose CEGMA.

3.7 Choosing a Reference Ortholog Set

For BUSCO, the reference ortholog sets associated with the individual versions (v1 and v2/v3), as well as CVG (see Subheading 1), are available as options, and the user needs to choose one of them. When CEGMA is chosen as ortholog search pipeline as described above in Subheading 3.6, two reference ortholog sets, CEG, the original ortholog set for CEGMA, and CVG are available. Also, one

250

Osamu Nishimura et al.

is advised to specify a set of parameters used for gene prediction in completeness assessment, depending on the availability of information about typical gene structure of the species of interest. The parameter “Max intron length” specifies the allowable distance between separate exons, and the parameter “Gene flanks” specifies the length of genomic regions in which flanking exons are searched for. We offer preset values for these parameters for some taxonomic groups. 3.8 Starting the Analysis

Click on the [START YOUR ANALYSIS] button. After the submitted sequence file is validated, the job information page will be shown. Then the analysis will be started by the job scheduler in order. If your email address was not provided as described above in Subheading 3.3, you need to save the Job ID or the URL shown in this page to later access the analysis results. An analysis takes up to days (for a whole genome) or hours (for a transcriptome or peptides), if no other job is waiting in the queue. If the size of the submitted file is large, or it includes many duplicate genes or sequences, the processing time will significantly increase.

3.9 Viewing the Assessment Result

When the analysis ends, the result is shown in the web page accessible under the specified Job ID. If you provided your email address as described above in Subheading 3.3, an email will inform you about the job completion together with a summary of the result. The results summary lists completeness scores, sequence length statistics, and the project information (Fig. 1). Completeness is measured by the numbers of reference orthologs identified as complete, duplicated, partial, and missing that are defined by CEGMA and BUSCO. Clicking on the [SHOW ORTHOLOG DETAILS] button guides the user to the page listing individual reference orthologs that were retrieved and missing in the given sequence file. In addition, utilizing the aLeaves service (https://aleaves.riken. jp) [14] allows users to dissect obtained results of orthology identification.

4

Recent Addition to Sequence Length Statistics Thanks to the methods enabled by chromosome conformation capture such as Hi-C, it is becoming more and more popular to generate chromosome-scale genome assemblies. The increase of the typical sequence lengths in de novo assemblies renders us to pay more attention to global continuity rather than local exon-level linkage which has been a major target of completeness assessment based on the coverage of reference orthologs. To respond to the growing demand of evaluating megabase-long sequences, we have implemented new functions in gVolante version 1.2 (released in July 2018). First, in the assessment, the proportions of nucleotides

Fig. 1 Screenshot of the analysis result page of gVolante with point-by-point guides: (a) upper part, the result of completeness assessment; (b) lower part, sequence length statistics and job summary. Detailed guides are given in gray background for the individual items displayed

252

Osamu Nishimura et al.

in the sequences longer than 1 Mb and 10 Mb in the entire genome assemblies given are computed and displayed. This allows users to assess if the major proportion of the given genome assembly consists of chromosome-sized sequences. Second, the distribution of the length of gaps (sequence tracts of unknown bases, e.g., ‘NNNNN’) is displayed in the assessment result page (Table 1). Table 1 Exemplar gap length distributions displayed by gVolante Gap (N) length

Count

(A) Hummingbird genome assembly by Supernova using 10x Genomics Chromium linked reads* 100,000

17

95,000

3

90,000

5

85,000

5

80,000

7

75,000

12

70,000

12

65,000

11

60,000

13

55,000

20

50,000

13

45,000

22

40,000

33

35,000

66

30,000

66

25,000

95

20,000

160

15,000

238

10,000

468

5,000

1,160

3,000

242

400

1,692

100

6,546

10

7,843 (continued)

Evaluating Genome Assemblies and Gene Models Using gVolante

253

Table 1 (continued) Gap (N) length

Count

(B) Whale shark genome assembly by Canu using Pacific Biosciences long reads** 88

1

58

1

56

1

55

1

54

1

37

1

22

1

17

1

14

1

13

1

1

1

*

The FASTA file is publicly available at http://cf.10xgenomics.com/samples/assembly/ 2.0.0/hummer/hummer_pseudohap.fasta.gz ** The assembly is publicly available at NCBI Genome (ASM164234v2) https://www. ncbi.nlm.nih.gov/nuccore/1151551436

5

Tips for Interpreting Assessment Results Some users have probably experienced a case in which different assessment settings resulted in variable completeness scores for the same input sequence file. It is easily predicted that for an animal genome, choosing a reference ortholog set for plants should likely yield remarkably low completeness scores. A large decrease of completeness scores can also occur in the assessment with a seemingly optimal setting. One remarkable example is regarding the evaluation of the genome of the brownbanded bamboo shark, an elasmobranch shark [15]. When we chose CEGMA with its dedicated ortholog set CEG and the gene prediction parameter “vrt” (Vertebrate) for CEGMA, the completeness score was 64.9% (161 orthologs detected as complete, out of 248). In contrast, BUSCO v2/v3 with our custom ortholog set for vertebrates resulted in the largely increased score of 89.7% (209 orthologs detected as complete, out of 233). The difference in these scores is explained by several reasons, including the compatibility of gene prediction setting to the species of interest and the difference in the definitions of “complete” ortholog detection between CEGMA and BUSCO. Here, we list some points of consideration in interpreting completeness scores:

220

227

6

1.01

0.91

# of orthologs identified as ‘complete’ or ‘partial’

# of orthologs identified as “missing”

Average # of orthologs per core gene

% of detected core genes with multiple orthologs

2.38

1.03

5

228

210

29.2

1.38

5

228

226

Gene models

Genome

Gene models

Arctic lamprey Lethenteron camtschaticum

0.95

1.01

6

227

211

110,906

0.47

1.00

3

230

214

27,039

0.99

1.01

10

223

202

86,125

1.43

1.01

3

230

210

34,435

NCBI Ppicta_ Hara et al. NCBI Kadota assembly_v1 2018a LetJap1.0 et al. 2017b

Genome

Madagascar ground gecko Paroedura picta

The assessments were executed on gVolante using BUSCO v2/v3 and the reference ortholog dataset CVG consisting of 233 genes. Gene models were assessed by inputting their deduced amino acid sequences. a See literature by Hara et al. [10] b See literature by Kadota et al. [11]

44.16

2.00

1

232

231

28,099

# of orthologs identified as “complete”

39,267

2,352

# of sequences

41,267

NCBI Loxafr3.0 GCF_000001905.1 HetGla_1.0 Ensembl release GCA_000001905.1 Ensembl 92 release 92

Gene models

Source

Genome

Genome

Gene models

Naked mole-rat Heterocephalus glaber

Sequence type

Species

African savanna elephant Loxodonta africana

Table 2 Comparison of completeness scores between the genome assemblies and their resultant gene models

254 Osamu Nishimura et al.

Evaluating Genome Assemblies and Gene Models Using gVolante

255

(a) CEGMA and BUSCO sometimes miss orthologs in a genome with longer introns unless the gene prediction parameters are set specifically. (b) Employing an ortholog set for a more specific taxon should yield a more accurate completeness score (e.g., the nematode ortholog set rather than the metazoan ortholog set, for assessing a nematode sequence set). Therefore, the completeness scores obtained with different reference ortholog sets are not readily comparable between different assemblies or gene models. (c) BUSCO tends to detect more “complete” genes than CEGMA, while CEGMA tends to detect a larger total gene count including both “complete” and “partial”/“fragmented” orthologs. (d) Assessment of amino acid sequences of predicted genes often yields a larger completeness score than the assessment of the genomic sequences from which the predicted genes are derived (Table 2). Also, when the analyzed gene set contains multiple splicing variants per gene, completeness assessment likely yields increased values for the number of orthologs detected per reference gene than in assessing genomic sequences. A best practice of completeness assessment would be to run assessments in multiple conditions. Potentially fluctuating scores from those multiple runs are better interpreted by considering the factors listed above. References 1. Veeckman E, Ruttink T, Vandepoele K (2016) Are we there yet? Reliably estimating the completeness of plant genome sequences. Plant Cell 28(8):1759–1768. https://doi.org/10. 1105/tpc.16.00349 2. Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert S, Chapman JA, Chapuis G, Chikhi R, Chitsaz H, Chou WC, Corbeil J, Del Fabbro C, Docking TR, Durbin R, Earl D, Emrich S, Fedotov P, Fonseca NA, Ganapathy G, Gibbs RA, Gnerre S, Godzaridis E, Goldstein S, Haimel M, Hall G, Haussler D, Hiatt JB, Ho IY, Howard J, Hunt M, Jackman SD, Jaffe DB, Jarvis ED, Jiang H, Kazakov S, Kersey PJ, Kitzman JO, Knight JR, Koren S, Lam TW, Lavenier D, Laviolette F, Li Y, Li Z, Liu B, Liu Y, Luo R, Maccallum I, Macmanes MD, Maillet N, Melnikov S, Naquin D, Ning Z, Otto TD, Paten B, Paulo OS, Phillippy AM, Pina-

Martins F, Place M, Przybylski D, Qin X, Qu C, Ribeiro FJ, Richards S, Rokhsar DS, Ruby JG, Scalabrin S, Schatz MC, Schwartz DC, Sergushichev A, Sharpe T, Shaw TI, Shendure J, Shi Y, Simpson JT, Song H, Tsarev F, Vezzi F, Vicedomini R, Vieira BM, Wang J, Worley KC, Yin S, Yiu SM, Yuan J, Zhang G, Zhang H, Zhou S, Korf IF (2013) Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2(1):10. https://doi.org/10. 1186/2047-217X-2-10 3. Parra G, Bradnam K, Ning Z, Keane T, Korf I (2009) Assessing the gene space in draft genomes. Nucleic Acids Res 37(1):289–297. https://doi.org/10.1093/nar/gkn916 4. Parra G, Bradnam K, Korf I (2007) CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23 (9):1061–1067. https://doi.org/10.1093/ bioinformatics/btm071

256

Osamu Nishimura et al.

5. Simao FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM (2015) BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31(19):3210–3212. https://doi. org/10.1093/bioinformatics/btv351 6. Waterhouse RM, Seppey M, Simao FA, Manni M, Ioannidis P, Klioutchnikov G, Kriventseva EV, Zdobnov EM (2017) BUSCO applications from quality assessments to gene prediction and phylogenomics. Mol Biol Evol 35(3):543–548. https://doi.org/10.1093/ molbev/msx319 7. Hara Y, Tatsumi K, Yoshida M, Kajikawa E, Kiyonari H, Kuraku S (2015) Optimizing and benchmarking de novo transcriptome sequencing: from library preparation to assembly evaluation. BMC Genomics 16:977. https://doi. org/10.1186/s12864-015-2007-1 8. Vijay N, Park C, Oh J, Jin S, Kern E, Woo Kim H, Zhang J, Park JK (2018) Population genomic analysis reveals contrasting demographic changes of two closely related dolphin species in the last glacial. Mol Biol Evol 35 (8):2026–2033. https://doi.org/10.1093/ molbev/msy108 9. Smith JJ, Timoshevskaya N, Ye C, Holt C, Keinath MC, Parker HJ, Cook ME, Hess JE, Narum SR, Lamanna F, Kaessmann H, Timoshevskiy VA, Waterbury CKM, Saraceno C, Wiedemann LM, Robb SMC, Baker C, Eichler EE, Hockman D, SaukaSpengler T, Yandell M, Krumlauf R, Elgar G, Amemiya CT (2018) The sea lamprey germline genome provides insights into programmed genome rearrangement and vertebrate evolution. Nat Genet 50(2):270–277. https://doi. org/10.1038/s41588-017-0036-1 10. Hara Y, Takeuchi M, Kageyama Y, Tatsumi K, Hibi M, Kiyonari H, Kuraku S (2018)

Madagascar ground gecko genome analysis characterizes asymmetric fates of duplicated genes. BMC Biol 16(1):40. https://doi.org/ 10.1186/s12915-018-0509-4 11. Kadota M, Hara Y, Tanaka K, Takagi W, Tanegashima C, Nishimura O, Kuraku S (2017) CTCF binding landscape in jawless fish with reference to Hox cluster evolution. Sci Rep 7(1):4957. https://doi.org/10. 1038/s41598-017-04506-x 12. Irisarri I, Baurain D, Brinkmann H, Delsuc F, Sire JY, Kupfer A, Petersen J, Jarek M, Meyer A, Vences M, Philippe H (2017) Phylotranscriptomic consolidation of the jawed vertebrate timetree. Nat Ecol Evol 1 (9):1370–1378. https://doi.org/10.1038/ s41559-017-0240-5 13. Nishimura O, Hara Y, Kuraku S (2017) gVolante for standardizing completeness assessment of genome and transcriptome assemblies. Bioinformatics 33 (22):3635–3637. https://doi.org/10.1093/ bioinformatics/btx445 14. Kuraku S, Zmasek CM, Nishimura O, Katoh K (2013) aLeaves facilitates on-demand exploration of metazoan gene family trees on MAFFT sequence alignment server with enhanced interactivity. Nucleic Acids Res 41:W22–W28. https://doi.org/10.1093/nar/gkt389 15. Hara Y, Yamaguchi K, Onimaru K, Kadota M, Koyanagi M, Keeley SD, Tatsumi K, Tanaka K, Motone F, Kageyama Y, Nozu R, Adachi N, Nishimura O, Nakagawa R, Tanegashima C, Kiyatake I, Matsumoto R, Murakumo K, Nishida K, Terakita A, Kuratani S, Sato K, Hyodo S, Kuraku S (2018) Shark genomes provide insights into elasmobranch evolution and the origin of vertebrates. Nat Ecol Evol 2 (11):1761–1771. https://doi.org/10.1038/ s41559-018-0673–5

Chapter 16 Choosing the Best Gene Predictions with GeneValidator Ismail Moghul, Anurag Priyam, and Yannick Wurm Abstract GeneValidator is a tool for determining whether the characteristics of newly predicted protein-coding genes are consistent with those of similar sequences in public databases. For this, it runs up to seven comparisons per gene. Results are shown in an HTML report containing summary statistics and graphical visualizations that aim to be useful for curators. Results are also presented in CSV and JSON formats for automated follow-up analysis. Here, we describe common usage scenarios of GeneValidator that use the JSON output results together with standard UNIX tools. We demonstrate how GeneValidator’s textual output can be used to filter and subset large gene sets effectively. First, we explain how low-scoring gene models can be identified and extracted for manual curation—for example, as input for genome browsers or gene annotation tools. Second, we show how GeneValidator’s HTML report can be regenerated from a filtered subset of GeneValidator’s JSON output. Subsequently, we demonstrate how GeneValidator’s GUI can be used to complement manual curation efforts. Additionally, we explain how GeneValidator can be used to merge information from multiple annotations by automatically selecting the higher-scoring gene model at each common gene locus. Finally, we show how GeneValidator analyses can be optimized when using large BLAST databases. Key words Genome annotation, Gene prediction, Gene validation, GeneValidator

1

Introduction Using accurate gene annotations is important because they affect subsequent analyses [1]. For some species, annotations can be downloaded directly from a public database such as Ensembl or NCBI [2]. For newly sequenced species, approaches to identify protein-coding genes in a genome sequence typically combine evidence from multiple data sources (including ab initio models, ESTs, RNA-seq, and protein alignments) [3–5]. Whether gene feature annotations are downloaded from a public database or are newly generated, they may contain errors resulting from biases of the underlying data, algorithmic choices [6], and the general

Ismail Moghul and Anurag Priyam contributed equally to this work. Martin Kollmar (ed.), Gene Prediction: Methods and Protocols, Methods in Molecular Biology, vol. 1962, https://doi.org/10.1007/978-1-4939-9173-0_16, © Springer Science+Business Media, LLC, part of Springer Nature 2019

257

258

Ismail Moghul et al.

Fig. 1 High-level schematic of the steps carried out by GeneValidator

limitations of a one-dimensional representation of DNA sequences. Common errors include frameshifts, incorrect exon–intron structure, incorrect merging of adjacent genes, and incorrect splitting of genes at long intron positions [7]. We previously described GeneValidator (GV), a tool to evaluate the quality of protein-coding gene predictions based on comparisons with a database of known proteins [8]. In brief (Fig. 1), GV first runs a BLAST search against the given database, retaining sequences of hits with e-value stronger than 105. Next, GV runs up to seven validations on each gene prediction. Each validation tests if the characteristics of the query gene deviate from those of similar sequences in the reference database. Based on predefined thresholds, the result of each validation is a pass or a fail. The overall score of the prediction is a scaled percentage of the validations that passed. Predictions with a score lower than 75 (i.e., more than one failed validation) may be regarded as potentially problematic. Explanation of the approach and an overview of the data underlying each validation are included in the HTML report, along with several visualizations to facilitate interpretation. Detailed results are also available in CSV and JSON format for spreadsheet and programmatic access. Results produced by GV depend on the quality and coverage of the database used for validation. Furthermore, higher scores indicate consistency with database sequences and not biological truths. Several publicly available databases of protein sequences such as Swiss-Prot [9], UniRef50 [9, 10], TrEMBL [9], or NR [2] can

Choosing the Best Gene Predictions with GeneValidator

259

be used with GV. The GV approach becomes increasingly reliable as proteomes of more species are submitted to these databases by the global research community, and as the qualities of submitted sequences improve due to experimental validation, manual verification by experts, and technological and algorithmic advances in sequencing and automated gene prediction. We created GV to be flexible. Many of GV’s features are designed to facilitate automatic processing of large gene sets (e.g., whole-genome annotation) as part of custom workflows. These include GV’s versatile JSON output, ability to leverage HPC facilities, and the possibility to use advanced BLAST search options. GV also includes a web server that can be used as a shared resource. Here, we discuss five common use cases of GV that can be easily incorporated into custom workflows.

2

Installing and Running GeneValidator GV runs on Linux and macOS. To install GV, run the command shown below. This will install GV and all its dependencies to a directory called “genevalidator” in the current working directory. sh -c "$(curl -fsSL https://install-genevalidator.wurmlab.com)"

The software includes example sequences to test the installation. The following command can be used to run GV on these example sequences with the included Swiss-Prot database. GV will print the results of validations for each gene prediction to the terminal, ending with a summary, and the directory where detailed results were saved to. genevalidator --db genevalidator/blast_db/swissprot \ --num_threads 4 \ genevalidator/exemplar_data/protein_data.fa

3

GeneValidator Workflows A gene set will almost inevitably contain some gene predictions with low scores. It can be desirable to curate these manually. Here, we begin by providing two approaches to facilitate inspection of these low-scoring predictions. First (Subheading 3.1), we show how to use GV’s JSON output to extract the sequence identifiers of low-scoring gene predictions. Among other things, these can be used to subset the initial gene set, to prioritize inspection in a genome browser [11], or for annotation editing in a tool such as

260

Ismail Moghul et al.

Apollo [12]. Second (Subheading 3.2), we show how to create a new HTML report by subsetting GV’s JSON output. This can reduce the need to navigate through a long HTML report. Subsequently (Subheading 3.3), we introduce GV’s graphical interface. This is helpful for rapidly viewing how GV’s validation results change during manual curation. We also provide guidance on two more general challenges based on our applications of GV. First (Subheading 3.4), we show how GV can be used to automatically select the best gene model from multiple gene sets at each common gene locus. Furthermore (Subheading 3.5), we show how to restrict GV to use a specific subset of a BLAST database. This is to avoid BLAST searching against sequences unlikely to be informative. 3.1 Extracting Sequence Identifiers of Low-Scoring Gene Predictions

GV’s JSON output can be used with JQ (https://stedolan.github. io/jq/), a command-line JSON processor (included in the GV package), to select gene predictions matching a particular criterion and access validation results and associated metadata. In the example below, we extract identifiers of predictions with a score lower than 75 (i.e., having failed more than one validation) and having at least two BLAST hits for manual curation. The idea is that while having two BLAST hits is insufficient for GV’s statistical tests (and thus results in a low score), they may provide sufficient evidence for biologically interpreting whether the prediction could be appropriate. 1. Extract FASTA header of gene predictions that have more than two BLAST hits and an overall score of less than 75. jq --raw-output ".[] | select(.no_hits >= 2 and .overall_score < 75) | .definition" input_file_results.json \ > sequence_definitions.txt

2. Extract sequence identifier (first word of the FASTA header) using the cut command. cut -d " " -f 1 sequence_definitions.txt \ > sequence_ids.txt

3.2 Subsetting the HTML Report to Only Low-Scoring Gene Predictions

GV’s JSON output can be filtered using JQ and input back to GV to reproduce results for the selected gene predictions. This is useful to create smaller HTML reports, for example, focusing on a particular gene family. In the example below, we subset GV’s output for the low-scoring gene predictions selected in Subheading 3.1. 1. Select gene predictions that have more than two BLAST hits and an overall score of less than 75. jq "[ .[] | select(.no_hits >= 2 and .overall_score < 75) ] |

Choosing the Best Gene Predictions with GeneValidator

261

sort_by(.overall_score)" input_file_results.json \ > input_file_results_subset.json

2. Reproduce GV’s output. genevalidator --json input_file_results_subset.json

3.3 Using GeneValidator Web Server to Iteratively Refine Gene Models

Although running GV from the command line is ideal for processing of large datasets and custom workflows, a graphical user interface can facilitate iterative usage. For example, during manual curation of gene models, running GV repeatedly as a gene model is revised can help a curator verify that changes they are making indeed improve the gene model. Building on the lessons learnt when developing the SequenceServer BLAST interface [13], we also built a graphical user interface (app) for GV that is accessible through a web browser. 1. Launching GV app requires the path to a directory containing one or more BLAST databases; the interface (accessible at http://localhost:5678) is opened automatically in the default browser. genevalidator app --num_threads 4 \ --database_dir genevalidator/blast_db/

2. To validate gene predictions, paste the corresponding FASTA sequences into the text area, select the database to compare to, and click “Analyse Sequences” (Fig. 2). The results are then shown on the same page. We also host a GV web server at https://genevalidator. wurmlab.com with two caveats: first, it is suitable for up to ten queries at a time, and second, given computational constraints on this server, we only provide the Swiss-Prot and the UniRef50 databases. 3.4 Merging Gene Predictions from Two Different Sources

Different gene prediction approaches are unlikely to generate identical gene models for a locus. GV can be used to select the higherscoring gene model for each locus from multiple gene sets. Briefly, we first identify annotations corresponding to the same locus from the different sources (steps 1–3 below). Subsequently, we generate a FASTA file containing alternative predictions for each locus and use GV’s “--select_single_best” option to select the higher scoring one (step 4 below). We make multiple simplifying assumptions to generate a mapping of annotations corresponding to the same locus from the different sources (steps 1–3 below). Specifically, we assume that we have a single transcript (splice form) per source per locus, that gene predictions from different loci do not overlap, and that annotations are available in a GFF3 format file. Often, additional

262

Ismail Moghul et al.

Fig. 2 A screenshot of the GeneValidator web application as launched from the command line via “genevalidator app” or by accessing https://genevalidator.wurmlab.com

preprocessing of gene sets will be necessary to fulfill these assumptions. 1. Intersect the transcript annotations in the GFF3 files (requires prior installation of bedtools). We require that both hits are on the same strand (“-s”). If comparing more than two GFF3 files, see the bedtools documentation (“-b” can take multiple values). The output file contains the entire input record from both input files (“-wa -wb”). awk ’/\tmRNA\t/’ geneset1.gff > geneset1_mrnas.gff awk ’/\tmRNA\t/’ geneset2.gff > geneset2_mrnas.gff bedtools intersect -wa -wb -s \ -a geneset1_mrnas.gff -b geneset2_mrnas.gff \ > geneset_overlaps.bed

2. Extract the GFF3 attributes columns (i.e., the 9th and 18th column) which contain the sequence identifiers.

Choosing the Best Gene Predictions with GeneValidator

263

awk ’{printf ("%s;\t%s;\n", $9, $18)}’ \ geneset_overlaps.bed > attributes_columns.tsv

3. Extract the sequence identifiers from the attributes columns. perl -nle ’@ids = /ID=(.*?);/g; print join("\t", @ids) if @ids’ \ attributes_columns.tsv > mapping_ids.tsv

4. Now that we have identifiers of the annotations corresponding to the same locus from both the gene sets, their respective sequences can be extracted and then used with GV’s “-select_single_best” option. (a) Create indexes for each of the FASTA files (requires prior installation of samtools). samtools faidx geneset1.fasta samtools faidx geneset2.fasta

(b) Create output FASTA file. touch output.fa

(c) Loop over the “mapping_ids.tsv” file. Extract FASTA sequence for each ID, and write them to a temporary FASTA file. Run GV using the “--select_single_best” option on the temporary FASTA file. The “-select_single_best” mode prints the highest-scoring sequence to STDOUT in FASTA format, which is written to the output file previously created. cat mapping_ids.tsv | while read -r line; do echo "$line" | cut -f 1 | \ xargs samtools faidx geneset1.fasta \ > gv_run_tmp.fa echo "$line" | cut -f 2 | \ xargs samtools faidx geneset2.fasta \ >> gv_run_tmp.fa genevalidator --select_single_best gv_run_tmp.fa \ >> output.fa rm gv_run_tmp.fa done

It may be desirable to include gene models unique to both sets in the final output. We leave this as an exercise for the reader.

264

Ismail Moghul et al.

3.5 Using NCBI’s Nonredundant Database of Protein Sequences with GV

While it is desirable to validate gene predictions against a gold standard database like Swiss-Prot, its limited coverage [9] makes this challenging for many species. At the same time, technological advances continue to increase the quality of automated predictions [14]. This makes it tempting to use a more comprehensive database such as NCBI’s nonredundant collection (NR) of manually reviewed as well as automatically generated protein sequences for validation. However, the large size of the NR database means BLAST searches can take days. We show how to use BLAST’s ability to restrict searches to a list of identifiers [15] to accelerate a GV analysis. For this, we first restrict the BLAST search to a particular taxonomic lineage to avoid BLAST searching against sequences unlikely to be informative. Additionally, we exclude sequences from the focal species to avoid circular self-validation. For the implementation below, we consider the example of the red fire ant, Solenopsis invicta [16]. We first obtain taxon identifiers of all species in Eukaryota (id: 2759). Subsequently, we exclude all Solenopsis species (taxonomy id: 13685). We then obtain GenInfo identifiers (GI numbers) of all sequences in the retained taxa. We finally run GV using this list. 1. Obtain a list of eukaryotic taxon identifiers (this requires prior installation of Taxonkit [17]). taxonkit list --ids 2759 --indent "" \ > taxon_ids_eukaryotes.txt

2. Obtain a list of Solenopsis taxon identifiers. taxonkit list --ids 13685 --indent "" \ > taxon_ids_solenopsis.txt

3. Subtract the two. grep -Fvx -f taxon_ids_solenopsis.txt \ taxon_ids_eukaryotes.txt > taxon_ids.txt

4. Download a tab-delimited file from NCBI linking taxon ids and GI Numbers. curl -L -O ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot. accession2taxid.gz

5. Use csvtk (https://github.com/shenwei356/csvtk), a multithreaded CSV/TSV processor (packaged with GV), to extract the rows where the taxid is in the taxon_ids.txt file. zcat prot.accession2taxid.gz | \ csvtk --tabs grep --fields taxid \ --pattern-file taxon_ids.txt | \

Choosing the Best Gene Predictions with GeneValidator

265

cut -f 4 | tail -n +2 > gi_list.txt

6. Finally, we pass this file to GV using “--blast_option” option. genevalidator --blast_options "-gilist gi_list.txt" \ --db nr --num_threads 40 geneset1.fa

Starting with BLASTþ version 2.8.0 (in development at the time of this writing), steps 4 and 5 can be skipped, and the list of taxon ids from step 3 can be passed directly to BLAST using the new “-taxidlist” option.

4

Tips and Tricks 1. GV’s overall score is based on the percentage of validations that pass, i.e., where the score is above a threshold that we have determined to be appropriate. To emphasize the fact that GV results are highly dependent on the quality of information in databases and cannot be solely relied upon to classify a “perfect” gene prediction, the overall score is decreased by 10%. The highest possible score is thus 90%. 2. GV will run the validations provided there are at least five BLAST hits for a given prediction. This can be changed using the “--min_blast_hits” option. A higher number of BLAST hits will increase the relevance of the comparisons. 3. GV generates several summary statistics for the input gene set. These include first, second, and third quartiles of the overall scores, number of good and bad predictions, and number of predictions with insufficient BLAST hits. In addition to providing an overview of the quality of the input gene set, the summary statistics can be used to choose between predictions from two different sources. 4. GV includes a tool for downloading sequence databases from NCBI to use for comparisons (i.e., “genevalidator ncbiblast-dbs”). This is a parallelized alternative to the “update_blastdb.pl” script included in BLASTþ package. 5. GV is also able to run BLAST searches on NCBI servers using BLAST’s ‘-remote’ option (e.g., ‘genevalidator --db ’swissprot -remote’ geneset.fa’). This has the benefit of being able to immediately use the most up-to-date version of a given database. However, using a remote BLAST database is very slow. We recommended using this for validating only a few genes (e.g., fewer than 25).

266

Ismail Moghul et al.

6. It is possible to run BLAST independently and to subsequently provide the output XML (“-outfmt 5”) or tab-delimited (“outfmt 6”) to GV. This can be particularly useful if BLAST results have already been produced for other analyses or when BLAST can be run on a cluster. 7. BLAST is often the slowest step of GV pipeline, especially when working with large datasets. In such cases, DIAMOND [18] can be used instead of BLAST for (up to 20,000!) faster database searching. Since DIAMOND’s XML output is compatible with BLAST, it can be used directly with GV along with one additional input, i.e., a FASTA file of hit sequences (when used with BLAST, GV is able to automatically extract hit sequences from BLAST database). Our wiki (https://github. com/wurmlab/genevalidator/wiki) provides detailed instructions for using GV with DIAMOND. 8. To resume a terminated analysis, GV can be run with “-resume” option. In resume mode, GV skips previously successful steps, including running BLAST. Gene predictions that were successfully processed are skipped as well. 9. It is possible to split an input gene set into multiple chunks, run GV on each chunk across multiple compute nodes, and combine the results for each chunk into a single report. (a) After splitting the input file and running GV on each input file, the following command can be used to merge the individually produced GV JSON files. cat */*.json | jq ".[]" | jq --slurp "." > MERGED_JSON

(b) The merged JSON can then be used to produce a single report for the whole gene set. genevalidator --json MERGED_JSON

Acknowledgments This work was supported by the Natural Environment Research Council [grant NE/L00626X/1] and the Biotechnology and Biological Sciences Research Council [grant BB/K004204/1 and BB/M009513/1]. This research used Queen Mary’s Apocrita HPC facility, supported by QMUL Research-IT (https://doi. org/10.5281/zenodo.438045).

Choosing the Best Gene Predictions with GeneValidator

267

References 1. Yandell M, Ence D (2012) A beginner’s guide to eukaryotic genome annotation. Nat Rev Genet 13:329–342 2. Benson DA, Cavanaugh M, Clark K, KarschMizrachi I, Ostell J, Pruitt KD et al (2018) GenBank. Nucleic Acids Res 46:D41–D47 3. Holt C, Yandell M (2011) MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics 12:491 4. Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M (2016) BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32:767–769 5. Keilwagen J, Hartung F, Paulini M, Twardziok SO, Grau J (2018) Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinformatics 19:189 6. Schnoes AM, Brown SD, Dodevski I, Babbitt PC (2009) Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol 5: e1000605 7. Steijger T, Abril JF, Engstro¨m PG, Kokocinski F, RGASP Consortium, Hubbard TJ et al (2013) Assessment of transcript reconstruction methods for RNA-seq. Nat Methods 10:1177–1184 8. Dra˘gan M-A, Moghul I, Priyam A, Bustos C, Wurm Y (2016) GeneValidator: identify problems with protein-coding gene predictions. Bioinformatics 32(10):1559–1561 9. The UniProt Consortium (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res 45:D158–D169 10. Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH, The UniProt Consortium (2015)

UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31:926–932 11. Buels R, Yao E, Diesh CM, Hayes RD, MunozTorres M, Helt G et al (2016) JBrowse: a dynamic web platform for genome visualization and analysis. Genome Biol 17:66 12. Lee E, Helt GA, Reese JT, Munoz-Torres MC, Childers CP, Buels RM et al (2013) Web Apollo: a web-based genomic annotation editing platform. Genome Biol 14:R93 13. Priyam A, Woodcroft BJ, Rai V, Munagala A, Moghul I, Ter F et al (2015) Sequenceserver: a modern graphical user interface for custom BLAST databases. bioRxiv. https://doi.org/ 10.1101/033142 14. Minoche AE, Dohm JC, Schneider J, Holtgr€a we D, Vieho¨ver P, Montfort M et al (2015) Exploiting single-molecule transcript sequencing for eukaryotic gene prediction. Genome Biol 16:549 15. Bethesda (MD): National Center for Biotechnology Information (2008) BLAST® Command Line Applications User Manual [Internet] - Limiting a Search with a List of Identifiers. https://www.ncbi.nlm.nih.gov/ books/NBK279673. Accessed 13 Sept 2018 16. Wurm Y, Wang J, Riba-Grognuz O, Corona M, Nygaard S, Hunt BG et al (2011) The genome of the fire ant Solenopsis invicta. Proc Natl Acad Sci U S A 108 (14):5679–5684 17. Buchfink B, Xie C, Huson DH (2015) Fast and sensitive protein alignment using DIAMOND. Nat Methods 12:59–60 18. Shen W, Xiong J (2019) TaxonKit: a crossplatform and efficient NCBI taxonomy toolkit. bioRxiv. https://doi.org/10.1101/513523

Chapter 17 COGNATE: Comparative Gene Annotation Characterizer Jeanne Wilbrandt Abstract Comprehensive structural characterization of protein-coding gene repertoires is a crucial step to identify differences and commonalities in comparative genomics contexts. This requires a descriptive set of standardized parameters as well as summary statistics of, e.g., gene lengths and exon counts. We developed the tool COGNATE to gather this data from a given structural annotation file in combination with the corresponding genome assembly with a single simple command line call. COGNATE relies on clearly stated parameter definitions and thus serves to enhance dataset comparability. Here, it is shown how the tool can be used; special attention is given to input formatting. Key words Comparative genomics, Eukaryotic genes, Protein-coding, Gene annotation, Structural annotation, Gene structure

1

Introduction One of the first steps in the interpretation of genome sequences is the localization of protein-coding genes. This includes the delineation of canonical coding sequences (gene prediction sensu stricto, [1]), untranslated regions (UTRs), and introns (in eukaryotes); often, (intrinsic) transcriptomic and (extrinsic) proteomic evidence are incorporated (e.g., when using MAKER, [2]) in this process of structural annotation [3]. The modular organization of eukaryotic genes in exons and introns has elicited constant interest in the constraints, dynamics, and evolution of gene structures [4–7]. Comparative (meta-) analyses to uncover, for example, a correlation of gene expression, sequence conservation, and gene structure [8] rely on gene structure parameters. Often, published data is used [9], which may be incomplete or not readily comparable due to inexactness of terminology or divergent study aims [10]. This highlights the demand for standardized data extraction based on clearly stated definitions to obtain a comprehensive dataset of gene structure parameters. The tool COGNATE [10] has been developed to meet this need

Martin Kollmar (ed.), Gene Prediction: Methods and Protocols, Methods in Molecular Biology, vol. 1962, https://doi.org/10.1007/978-1-4939-9173-0_17, © Springer Science+Business Media, LLC, part of Springer Nature 2019

269

270

Jeanne Wilbrandt

and provide an easy-to-use method of data collection. The terminology used by COGNATE has been described explicitly to prevent misunderstandings [10]. Providing COGNATE results upon genome publication can help to enhance comparability of proteincoding gene structure datasets. COGNATE is a command-line tool that extracts an extensive set of basic genome and gene structure data. Thus, a primary description of a genome and its annotation of protein-coding genes is feasible with one call; it is also possible to analyze several genomes and annotations sequentially as a batch with a single call of COGNATE. The original publication features a graphical overview of the information flow in COGNATE (i.e., what is analyzed, measured, and output) [10]. For each annotation of protein-coding genes within a genome, COGNATE analyzes (all) the following (sub)units: l

Assembly and scaffolds

l

Longest transcript per gene

l

Exons

l

CDSs (identical with exons if no UTRs are annotated)

l

Introns

In general terms, the following categories of parameters are computed for each (sub)unit (a full list of the 296 parameters gathered by COGNATE can be found at [10]): l

GC content and CpG depletion (given as CpG o/e)

l

Count

l

Length

l

Strandedness

l

Density and coverage (ratios of feature covered by another, number- and length-wise)

Note that COGNATE provides median values in addition to means to comply with the frequent non-normality of data distributions.

2

Setting Up COGNATE

2.1 Package and Requirements

The COGNATE package has been tested on local machines (Ubuntu 16.04, Open Suse 15.0) with Perl (v5.22.1, v5.26.1) built for x86_64-linux-gnu-thread-multi. COGNATE does not need any special user privileges, and the package contains the required Genome Annotation Library (Genome Annotation Library, GAL: Barry Moore (2010–2015). https://github.com/ The-Sequence-Ontology/GAL). It may be necessary to install

COGNATE – Comparative Gene Annotation Characterizer

271

additional Perl modules; the COGNATE package includes a helper script (/Utilities/COGNATE_CheckDep.pl) which checks the presence of required Perl modules. The package also contains a README file, which includes the information given here as well as further details. 2.2

Installation

It is assumed that Perl (>v5.22) is installed with all commonly present modules. l

Download the package from one of the following URLs: – https://www.zfmk.de/en/COGNATE – https://github.com/ZFMK/COGNATE

l

Unpack the archive. – Leave the GAL directory where it is, no installation is needed.

l

Test whether COGNATE is ready to run by moving into the package directory and execute: perl/COGNATE_v1.0/COGNATE_v1.01.pl--help

l

If necessary: Install missing Perl modules (using, e.g., cpanm). – Optional: Check whether you are missing Perl modules by running: perl/COGNATE_v1.0/Utilities/COGNATE_CheckDep.pl

– It is possible that modules are missing which CheckDep.pl did not look for. The respective error message (and advice) follows this example scheme: Can’t locate Statistics/Basic.pm in @INC (you may need to install the Statistics::Basic module) [. . .]

3 3.1

Running COGNATE Input Information

COGNATE has initially been designed for the analysis of structural gene annotations of eukaryotes but should, in principle, also work for prokaryotes (see Note 1). As input, COGNATE requires two data files per species, namely, (a) the annotation of protein-coding genes within the respective genome in GFF3 format and (b) the nucleotide sequences (usually a genome assembly) used for annotation in FASTA format. If more than one species shall be analyzed as batch, i.e., some of the results will be collected in one line per species in batch files, it is useful to provide a COGNATE.input file. Make sure that the file formats (GFF3, FASTA, COGNATE.input, described in the following) are proper and that they correspond to each other (e.g., scaffold IDs of annotation and assembly match).

272

Jeanne Wilbrandt

GFF3: A full description of the GFF3 format in gene annotation contexts has been issued by the Sequence Ontology (https:// github.com/The-Sequence-Ontology/Specifications/blob/ master/gff3.md). The first line usually contains the gff version information preceded by two “#”-symbols (also used for commenting out lines). The most important of the tab-separated columns when using COGNATE are column 1 (sequence ID/region), column 3 (type), columns 4 and 5 (start and end position), and column 9 (attributes). The sequence ID (usually a scaffold identifier) must correspond to the sequence ID used in the corresponding nucleotide sequence FASTA file. Types must include at least gene, mRNA or transcripts, and exon. Attributes have to give parent information so COGNATE can resolve the relationships between types. A minimal example is given in Fig. 1. FASTA: The FASTA format in which the annotated (genomic) sequence shall be provided follows the convention of a “>”-symbol indicating the header (sequence ID) of a sequence and the following line(s) being the sequence (until the next “>” is encountered). COGNATE expects to receive a nucleotide FASTA file, where the sequence can be either “interleaved” (broken over several lines of the same length; the last line may be shorter) or “sequential” (sequence on one line). A truncated example is depicted in Fig. 2. Frequently, FASTA headers include additional information (e.g., separated by “|”-symbols)

Fig. 1 GFF3 format as required by COGNATE, minimal example. The gff3 file format of gene annotation for one single-exon gene. The first line specifies the file format version, while the following lines are the actual annotation. The nine columns are tab-separated and indicate (in order): the region (scaffold ID), source (tool), type of annotation, start, stop, score, strand, phase (feature start with reference to the reading frame, used for coding sequences, i.e., feature type CDS), and attributes. Note that the attribute column contains not only an ID for each individual feature, but also parent information

Fig. 2 Fasta format, minimal example. Two truncated nucleotide sequences in fasta format as accepted by COGNATE. Note that the sequence ID (the header, indicated by “>”) matches the scaffold ID given in the gff3 example (Fig. 1). Sequences should be represented by IUPAC nucleic acid codes; lower-case letters are accepted

COGNATE – Comparative Gene Annotation Characterizer

273

Fig. 3 COGNATE.input file, minimal example. The COGNATE.input file can be used to provide file and name specifications (in three tab-separated columns) for multiple species or instances for a batch run. File paths need to be absolute, but the name may be omitted (i.e., the first column may remain empty) and will be set to a default name

beyond the sequence ID. In such cases, it is possible that FASTA and GFF3 files do not match and COGNATE will be unable to extract the necessary information. Identifying such cases is possible with COGNATE’s in-built-in file check (toggled on by default, can be skipped with the option -f). Depending on the case, truncating the FASTA headers to match the pattern found in the annotation file may solve incompatibilities. COGNATE.input: The COGNATE.input file allows to specify the locations of annotation and sequence files for multiple species at the same time. The three tab-separated columns must contain name, GFF3 and FASTA files (including the full path), as exemplified in Fig. 3. The input file may contain comments (i.e., lines beginning with a “#”-symbol), which will be ignored by COGNATE. 3.2 Execution and Options

COGNATE requires full path declaration in all path specifications (GFF and FASTA file or COGNATE.input file and therein). A default COGNATE call (but providing a NAME with the -n option), using the example data provided with the package (see Note 2), looks like this (make sure to adapt the file paths): perl COGNATE_v1.01.pl --gff home/COGNATE_v1.01/Example_data/ Dmel_example_annotation.gff --fasta home/COGNATE_v1.01/Example_data/Dmel_example_scaffolds.fna -n dmel_example

For several species (using the input file provided with the package, which analyzes the example three times), COGNATE will be executed like this: perl COGNATE_v1.01.pl --input home/COGNATE_v1.01/Example_data/triple-dmel-ex_COGNATE.input -f --batch dmel-3_example

One option that largely influences measurement results is the transcript length. COGNATE analyzes only one transcript per gene, which can be chosen to be the shortest, median, or longest (default) one (option –length ). There are options to control output file generation; see Subheading 3.4.

274

Jeanne Wilbrandt

COGNATE provides the option to check for overlapping gene annotations (--check_overlaps; results are provided in the summary file; see Note 3). If it is toggled on, COGNATE requires up to 5 h to analyze an annotation comprising ca 18,000 genes. For larger gene sets, a run including overlap checking may take several days. Without overlap checking, the same gene annotation of 18,000 genes can be done within 35 min. 3.3 Internal Definitions

l

A median of means or medians for a certain parameter result from the calculation of the median/mean values per structure entity, which in turn were calculated for all sub-structures, for example, coding sequence (CDS) length for one CDS -> median CDS length per transcript for one transcript -> median of median CDS length per transcript for the whole annotation.

l

For GC content, two types are calculated: total (GC/length) and (non-)ambiguity (GCS/length-NRYKMBDHV) [noAm]. The latter GC content is not dependent on assembly quality. Both types include softmasked sequences. Thus, “GC content without ambiguity” means that the length of a nucleotide sequence was tallied excluding the bases N, R, Y, K, M, B, D, H, V (IUPAC codes for all bases that are not G or C (S ¼ G/C)) and the GC content calculated as count of G, C, S/length without ambiguity.

l

Protein length is calculated without stop codon (*).

l

COGNATE only evaluates one transcript per gene, leaving out alternative mRNAs for one gene although they are given in NCBI gffs. The number of these alternative mRNAs is recorded for each gene, though.

l

Count of alternative spliceforms (files 01, 07) ¼ count of annotated transcripts (mRNA) per gene minus 1.

l

L90[genes] (file 01) is the count of largest scaffolds/contig sequence (SCS) required to find at least 90% of the annotated and by COGNATE analyzed genes.

l

CpG o/e is the ratio of CpG-dinucleotide depletion and calculated as (frequency of CG/(frequency of C  frequency of G)), where the frequency is the count of a (di)nucleotide in a sequence/length of this sequence.

l

l

l

Count of strand-mix genes (file 01) ¼ count of transcripts where a CDS/exon/intron differs from the transcripts strand. Strands of transcripts (file 07) are given as þ/. Individual strandedness for CDSs/exons/introns can be found in the respective file (11/12/13) and is given as þ/, in case of a conflict with the transcript strandedness as þ!/!. Transcripts on þ: strand (file 03) ¼ percentage of all transcripts on þ strand and percentage of all transcripts on  strand (normalization against total count of non-isoform transcripts).

COGNATE – Comparative Gene Annotation Characterizer

3.4

Output

275

As output, COGNATE generates per default 14 files for a single annotation and 7 files in batch mode. Five of these seven accumulate summary lines for the analyzed batch. The batch may also consist of only one species and may be addressed by multiple/ consecutive runs. Using the same batch name in consecutive runs leads to appending new lines to the present batch files. Most of the remaining result files store measurements for individual gene structure elements. Speaking generally, the following file types are generated: l

Overviews (summaries) of measured variables

l

Lists of all measured variables referring to features of a given (sub)unit

l

Batch files with one line of summary statistics per analyzed annotation

Specifically, COGNATE saves its output to the working directory (which can also be specified with the option --workingdir / DIR/); see Table 1 for an overview of saving locations, files, and file contents. It is possible to restrict the output file generation to certain files (by either excluding or explicitly including file IDs with the options --print or --dont_print). By default, COGNATE will ask whether an existing output directory (COGNATE_NAME) shall be overwritten (toggle off with the option --overwrite); if this is declined (by typing no to the command line), COGNATE will quit. Overwriting does not affect the batch files saved to the working directory; new results will be appended to these files, if the same batch name is used, otherwise, new batch files will be generated. The example calls specified above (Subheading 3.2) will thus by default not overwrite the existing example output that is available for comparison (directory/Example_results/). COGNATE has been designed to produce output that can be used for downstream analyses like statistical testing or visualization. As an example, a minimal working example R script is provided here (Fig. 4) to obtain a jitter plot (see Note 4) comparing median and mean exon length per transcript of the example data with default appearance (Fig. 5).

4

Notes 1. During development, COGNATE was tested and used with NCBI RefSeq data as well as with MAKER2 [2] and BRAKER2 [12] output. All genomes and annotations belonged to insect species. COGNATE expects genes, mRNAs, and exons to be present (specified in the type column in the GFF), introns may be implicit. COGNATE currently does not analyze UTR

COGNATE_NAME_02_scaffold_ general.tsv COGNATE_NAME_03_scaffold_ transcripts.tsv COGNATE_NAME_04_scaffold_ CDSs.tsv COGNATE_NAME_05_scaffold_ exons.tsv COGNATE_NAME_06_scaffold_ introns.tsv

COGNATE_NAME_07_transcript_ Transcript-specific general parameters, one transcript per line general.tsv (for each transcript: length, GC content, CpG o/e, strand, protein length, count of alternative spliceforms)

02

03

04

05

06

07

Scaffold-specific parameters regarding introns, one scaffold per line (for all introns per scaffold: count, added length, density, coverage)

Scaffold-specific parameters regarding exons, one scaffold per line (for all exons per scaffold: count, added length, density, coverage)

Scaffold-specific parameters regarding CDSs, one scaffold per line (for all CDSs per scaffold: count, added length, density, coverage)

Scaffold-specific parameters regarding transcripts, one scaffold per line (for all transcripts per scaffold: count, added length, density, coverage, strand occupation)

Scaffold-specific general parameters, one scaffold per line (for each scaffold: length, GC content)

Human readable overview of the analyzed parameters, including min, max, stdev, and other statistics

COGNATE_NAME_01_s ummary.tsv

01

FASTA with translations (aa) of the longest transcript per gene

This is the COGNATE directory, it is automatically generated. NAME is either user-specified or defaults to genome_ID

Description

COGNATE_NAME_00_analyzed_ transcripts.fa

In the COGNATE directory

00

COGNATE_NAME/

File ID In the working directory

Table 1 Overview of saving locations, files, and file contents. NAME will be replaced by a user-specified name tag or given automatically (scheme: genome_ID, where ID is automatically increased in batch runs)

276 Jeanne Wilbrandt

COGNATE_NAME_16_batch_scaffoldmedians.tsv

16

COGNATE_NAME_13_introns.tsv Intron-specific parameters, one intron per line (for each intron: length, GC content, CpG o/e, strand)

13

COGNATE_NAME_15_batch_scaffoldmeans.tsv

COGNATE_NAME_12_exons.tsv

12

15

COGNATE_NAME_11_CDSs.tsv

11

COGNATE_NAME_14_batch_general.tsv

COGNATE_NAME_10_transcript_ Transcript-specific parameters regarding introns, one transcript introns.tsv per line (for all introns per transcript: count, added length, median and mean length, median and mean GC content, median CpG o/e, coverage, density)

10

14

COGNATE_NAME_09_transcript_ Transcript-specific parameters regarding exons, one transcript exons.tsv per line (for all exons per transcript: count, added length, median and mean length, median and mean GC content, median CpG o/e, coverage, density)

09

(continued)

Medians of scaffold-specific parameters, one annotation/ species per line (for all scaffolds per annotation: length; GC content; count, added length, coverage, density for all transcripts/CDSs/exons/introns per scaffold)

Means of scaffold-specific parameters, one annotation/species per line (for all scaffolds per annotation: length; GC content; count, added length, coverage, density for all transcripts/ CDSs/exons/introns per scaffold)

General parameters regarding assembly and annotation, one annotation/species per line (size, GCs, CpG o/e, N and L statistics, counts, intron deviation measures)

Exon-specific parameters, one exon per line (for each exon: length, GC content, CpG o/e, strand)

CDS-specific parameters, one CDS per line (for each CDS: length, GC content, CpG o/e, strand)

COGNATE_NAME_08_transcript_ Transcript-specific parameters regarding CDSs, one transcript CDSs.tsv per line (for all CDSs per transcript: count, added length, median and mean length, median and mean GC content, median CpG o/e, coverage, density)

08

COGNATE – Comparative Gene Annotation Characterizer 277

COGNATE_NAME_17_batch_transcriptmeans.tsv

COGNATE_NAME_18_batch_transcriptmedians.tsv

COGNATE_NAME_19_batch_component_ sizes.tsv

COGNATE_NAME_20_batch_bashcommands.txt

17

18

19

20

File ID In the working directory

Table 1 (continued)

In the COGNATE directory

Bash commands to execute BUSCO v1.1 [11] using the generated file 00 as input

Assembly size, coding and intronic component sizes in Mbp and % per assembly/annotation, one annotation per line

Medians of transcript-specific parameters, one annotation/ species per line (for all transcripts per scaffold: length; GC content; count, added length, coverage, density for all CDSs/exons/introns per transcript)

Means of transcript-specific parameters, one annotation/ species per line (for all transcripts per scaffold: length; GC content; count, added length, coverage, density for all CDSs/exons/introns per transcript)

Description

278 Jeanne Wilbrandt

COGNATE – Comparative Gene Annotation Characterizer

279

Fig. 4 R script for plotting median and mean exon length per transcript, minimal example. This minimal working R script exemplifies how COGNATE output can be read, reshaped, and plotted for visualization. In this case, median and mean exon length for the four transcripts given in the example provided with COGNATE are used to produce the plot depicted in Fig. 5

Fig. 5 Plot of median and mean exon length per transcript. The result of plotting median and mean exon length per transcript for the COGNATE example data with the R script shown in Fig. 4 with default appearance parameters. R allows extensive customization of plots, for example with the label(), theme(), or scale_color_manual() functions

280

Jeanne Wilbrandt

annotations since this feature is usually not available in non-model organisms. Annotations from other sources were often incompatible due to missing annotation elements (e.g., no exons or no parent information in the comment column of the GFF) or identifier mismatches. Reformatting may help in such cases. 2. COGNATE example data can be found in the directory/ Example_data/ and consists of the annotation and assembled sequence of three scaffolds (NW_007931083.1, NW_001846187.1, NW_007931104.1) of the NCBI RefSeq data for Drosophila melanogaster. The original files (GCF_000001215.4_Release_6_plus_ISO1_MT_genomic. gff.gz, GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz) are available from ftp://ftp.ncbi.nlm.nih.gov/ genomes/all/GCF/000/001/215/GCF_000001215.4_ Release_6_plus_ISO1_MT/. 3. Note that COGNATE can check for and record overlapping transcripts (toggled on with the -c option; find them in the 01_summary file), which might be relevant in downstream analyses. COGNATE also warns (standard output, i.e., in the terminal) when an intron is encountered that does not contain any sequence information (either consisting of N bases or has 0 length). This information is not recorded by default but can be captured and stored by redirecting the standard output (from terminal to a file, appending “>> log.txt 2>&1” to the COGNATE call). 4. An excellent reference for plotting in R using the ggplot2 package [13] can be found at STHDA (ggplot2 Essentials: STHDA. http://www.sthda.com/english/wiki/ggplot2essentials).

Acknowledgments I would like to acknowledge the help of Barry Moore with implementing specific functions in GAL/COGNATE. Thanks also go to Oliver Niehuis and Bernhard Misof for their input to the original COGNATE publication, as well as to Hannes J€akel, Jan Philip Oeyen, Malte Petersen, and Tanja Ziesmann for their help. References 1. Zhang MQ (2002) Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet 3(9):698–709 2. Holt C, Yandell M (2011) MAKER2: an annotation pipeline and genome-database

management tool for second-generation genome projects. BMC Bioinformatics 12 (1):491 3. Hoff K, Stanke M (2015) Current methods for automated annotation of protein-coding genes. Curr Opin Insect Sci 7:8–14

COGNATE – Comparative Gene Annotation Characterizer 4. Hawkin JD (1988) A survey on intron and exon lengths. Nucleic Acids Res 16 (21):9893–9908 5. Lynch M (2006) The origins of eukaryotic gene structure. Mol Biol Evol 23(2):450–468 6. Zhu L, Zhang Y, Zhang W, Yang S, Chen J-Q, Tian D (2009) Patterns of exon-intron architecture variation of genes in eukaryotic genomes. BMC Genomics 10:47 7. Bonnet A, Grosso AR, Elkaoutari A, Coleno E, Presle A, Sridhara SC et al (2017) Introns protect eukaryotic genomes from transcriptionassociated genetic instability. Mol Cell 67 (4):608–621.e6 8. Waterhouse RM, Zdobnov EM, Kriventseva EV (2011) Correlating traits of gene retention, sequence divergence, duplicability and essentiality in vertebrates, arthropods, and fungi. Genome Biol Evol 3:75–86 9. Elliott TA, Gregory TR (2015) What’s in a genome? The C-value enigma and the

281

evolution of eukaryotic genome content. Philos Trans R Soc Lond B Biol Sci 370 (1678):20140331 10. Wilbrandt J, Misof B, Niehuis O (2017) COGNATE: comparative gene annotation characterizer. BMC Genomics 18(1):535 11. Sima˜o FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM (2015) BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31(19):3210–3212 12. Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M (2016) BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32(5):767–769 13. Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer, New York. Available from: http://ggplot2.org

INDEX A Alternative splicing....................................................... 194 Anamorph..................................................................... 204 Annotation pipeline ..................... 7, 30, 54, 61, 65, 215, 240, 241 Anticodon........................................ 1, 5, 6, 11, 115, 128 Apollo .......................................... 31, 35–38, 46, 47, 260 Assembly gap ....................................................... 189, 198

B Basic local alignment search tool (BLAST) .......... 42, 46, 68, 69, 87, 162, 163, 169, 207, 208, 217, 221, 224, 233, 235, 258–261, 264–266 BLAST-like alignment tool (BLAT) ........... 31, 43, 194, 198, 200–204

C Chromosome.......................... 2, 70, 114, 130, 132–136, 144, 171, 201, 202, 213, 214, 225, 234, 250, 252 Coding gene .............................. 100, 122, 139, 179, 229 Coding sequence (CDS).................. 34, 48, 73, 89, 110, 114, 115, 117, 119, 167–170, 199, 272, 274, 277 Comparative genomics ................................................ 239 Covariance model ..................... 1, 3, 7, 8, 10, 12, 16–21

E Ensembl ............................ 185, 186, 211, 230, 254, 257 Eukaryote ............................ 7, 11, 12, 16, 97–119, 124, 130, 132, 133, 136, 248, 269, 271

F Frameshift ................................... 22, 117, 162, 180, 181, 183, 186, 194–196, 200, 214, 216, 217, 258 Functional annotation ..................... 15, 29–49, 209, 216 Fungi............................. 59, 85, 108, 133, 198, 204, 229

Genome alignment ................ 66, 67, 139, 140, 145, 179–191 annotation ....................... 29, 30, 32, 34–48, 53, 54, 56–59, 65, 89, 103–118, 124, 136, 139, 143, 146, 157, 207, 215, 219–225, 242, 270 assembly ............................ 32–34, 39, 40, 49, 53–63, 70, 114, 149, 161, 166, 167, 185, 195, 201, 205, 227–244, 249–255 completeness .................................. 54, 227, 228, 242 Genomic rearrangement .............................................. 140

H Haplotype ............................................................ 114, 241 Heterozygous genome ................................................ 114 Hidden-Markov-model (HMM)................... 2, 7–10, 82, 217, 231, 232, 234, 236, 237 Homology .................................... 73–75, 84, 87, 89–91, 158, 193, 198, 211, 216, 217, 220, 222, 224, 225 Horizontal transfer.............................................. 214, 216 Hydrogen bond energy ............................................... 125

I In-frame stop codon ................... 99, 181, 194, 195, 199 Integrative gene finder.......................................... 97–119 Intron-exon structure .................................................. 199 Intron position conservation.............................. 161–176

J Jbrowse ................................... 31, 33, 35–38, 41, 46, 47, 93, 115, 209

M Multiple genome alignment ...................... 139, 140, 145 Mutually exclusive splicing .......................................... 194

N

G Gene content..................................................... 85, 228, 229 duplication............................................ 194, 229, 236 validation ....................................................... 257–266 Genetic code.................................... 4, 6, 9, 10, 203, 204

Needleman-Wunsch algorithm ................. 198, 201–203 Neurospora crassa ....................................... 54, 57, 59–62 Non-canonical splice site ................... 108–109, 115, 194 Non-coding gene ................................................ 112, 114 Nucleic acid-protein interaction.................................. 126 Nucleotide database ..................................................... 194

Martin Kollmar (ed.), Gene Prediction: Methods and Protocols, Methods in Molecular Biology, vol. 1962, https://doi.org/10.1007/978-1-4939-9173-0, © Springer Science+Business Media, LLC, part of Springer Nature 2019

283

GENE PREDICTION: METHODS

284 Index

AND

PROTOCOLS

O

S

Open reading frame (ORF)........................ 44, 105, 124, 125, 128, 131, 209–212 OrthoDB ....................................................... 87, 228–231 Ortholog(s) ....................... 59, 158, 180, 185, 216, 228, 229, 231, 244, 247–250, 253–255 Orthologous exon............................................... 183, 190 Orthologous gene ......................... 33, 87, 162, 184, 223 Orthology................................................... 231, 243, 250

SAMTOOLS ............................... 61, 69, 72, 78–82, 263 Schizosacchomyces pombe Sequenced genome ................ 7, 53, 122, 161, 204, 229 Sequencing error ................................................. 195, 208 Single-copy ortholog ............................................ 59, 229 Splice site shift .............................................................. 181 Stacking energy ................................................... 125–127 Structural annotation ........................ 29, 30, 42–46, 114, 118, 119, 216, 269

P Paralog .......................................................................... 180 Phylogenetic tree.......................................................... 145 Phylogenomics ............................................................. 242 Physicochemical model................................................ 123 PomBase ....................................................................... 210 Prokaryote .................................. 97–119, 124, 127, 130, 132, 133, 209, 215–225, 229, 271 Protein alignment ............................... 34, 41, 43, 69–72, 89–91, 216, 257 Proteomics.................................................. 109–111, 269 Protomotifs.......................................................... 208, 209

Q Quality assessment .............................................. 227, 228

R Reading frame ........................... 123, 131, 180, 181, 272 RepeatMasker ....................................... 31, 33, 34, 42, 71 Ribosomal RNA (rRNA) ................ 22, 30, 48, 114, 115 RNA secondary structure .......................... 1, 4–6, 12, 20 RNA-seq ................................. 30, 33, 35, 40–44, 51–63, 66–69, 71–73, 75, 82, 84–92, 98, 104, 113, 116, 117, 140, 142, 151–153, 155–159, 161, 162, 164–167, 170–172, 174–176, 194, 228, 257

T Teleomorph .................................................................. 204 Transcriptome ........................ 16, 58, 66, 100, 103–106, 109, 112, 114, 116, 118, 166, 179, 227, 228, 231, 233, 237, 240, 248, 250 Transcriptome assembly............................................... 237 Transfer RNA (tRNA) ....................... 1–12, 48, 113–115 Translation table............................................................... 5 Transposon ..................................................................... 71 Tripeptide frequency.................................................... 129

U UniProt......................................................................... 211 UniProt50 .................................................................... 209 Untranslated region (UTR) ..................... 47, 73, 74, 85, 86, 92, 93, 114, 156, 174, 182, 269, 275

V Vertebrate .................... 1, 7, 10–12, 142, 144, 145, 147, 148, 150, 154, 158, 180, 198, 229, 248, 253