Practical Applications of Computational Biology & Bioinformatics, 15th International Conference (PACBB 2021) [325, 1 ed.] 3030862577, 9783030862572

This book features novel research papers spanning many different subfields in bioinformatics and computational biology,

743 23 14MB

English Pages 188 Year 2021

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Practical Applications of Computational Biology & Bioinformatics, 15th International Conference (PACBB 2021) [325, 1 ed.]
 3030862577, 9783030862572

Table of contents :
Preface
Organization
Program Committee Chairs
General Co-chairs
Advisory Committee
Organizing Committee
Program Committee
PACBB 2021 Sponsors
Contents
Computational Methods for the Identification of Genetic Variants in Complex Diseases
1 Introduction
2 Methods
2.1 Data Preparation
2.2 Gene Selection
2.3 Feature Extraction and Reduction
3 Results and Discussion
4 Conclusion
References
Using Reduced Amino-Acid Alphabets and Simulated Annealing to Identify Antimicrobial Peptides
1 Introduction
2 Methods
2.1 Alphabet Encoding
2.2 k-mer Database
2.3 Performance Metrics
2.4 Simulated Annealing
3 Results and Discussion
4 Conclusion
References
Acufenometry in the Self-management of Tinnitus: A Revised Interface to Improve the User Experience
1 Introduction
2 Acufenometry
3 Study
3.1 Study Design
3.2 Results
4 Discussion and Conclusions
References
The pegi3s Bioinformatics Docker Images Project
1 Introduction
2 Related Work
3 The pegi3s Bioinformatics Docker Images Project
3.1 Docker Images
3.2 Pipeline Development
3.3 Containerization of Applications
4 Discussion
References
On the Reproducibility of MiRNA-Seq Differential Expression Analyses in Neuropsychiatric Diseases
1 Introduction
2 Materials and Methods
2.1 Data Acquisition and Identification of the Samples
2.2 Download of the Genome and Annotation Files
2.3 MiARma-Seq Analysis
2.4 Quality Control
2.5 Application of the Same Statistical Criteria of the Original Articles to miARma-Seq Results
2.6 Comparison Between the Original Results and miARma-Seq Results
3 Results and Discussion
3.1 Quality Control
3.2 Application of the Same Statistical Criteria of the Original Articles to MiARma-Seq Results
3.3 Comparison Between the Original Results and miARma-Seq Results
4 Conclusion
References
Computational Tools for the Analysis of 2D-Nuclear Magnetic Resonance Data
1 Introduction
2 Methods
2.1 Data Reading
2.2 Data Visualization
2.3 Dimension Reduction
2.4 Further Analysis
3 Results and Discussion
3.1 Tomato Fruit Extracts
3.2 Worm (Caenorhabditis Elegans) Metabolome
4 Conclusion
References
Recurrent Deep Neural Networks for Enzyme Functional Annotation
1 Introduction
2 Datasets and Deep Learning Models
3 Influence of Sequence Length and Type of Truncation
4 Influence of Network Type and Attention Addition
5 Influence of Different Aminoacid Encoding Schemes
6 Discussion and Conclusions
References
Assessing the Impact of Data Set Enrichment to Improve Drug Sensitivity in Cancer
1 Introduction
2 Materials and Methods
2.1 Original Data and the Study Data Sets
2.2 Methods
3 Results and Discussion
3.1 Regression Analysis Results
3.2 Classification
3.3 Graph Mining
3.4 Classification Analysis Results Using ILP
4 Conclusions
References
Deep Neural Network to Curate LTR Retrotransposon Libraries from Plant Genomes
1 Introduction
2 Materials and Methods
2.1 Creation of Plant LTR-RTs Sequence Dataset
2.2 Experimental Analysis Using ML Models
2.3 Experiments Using Deep Neural Networks.
2.4 Generalization Tests
2.5 Hardware Specifications
3 Results
3.1 Descriptive Analysis of the Dataset
3.2 Design of a Deep Neural Network Based Model to Detect Non-intact LTR Retrotransposon Sequences
3.3 Test for the Generalization of the Implemented Model
4 Discussion
5 Conclusion
References
A Hybrid of Bees Algorithm and Regulatory On/Off Minimization for Optimizing Lactate Production
1 Introduction
2 A Hybrid of Bees Algorithm and Regulatory on/Off Minimization
2.1 Bee Representation of Metabolic Genotype
2.2 Initialization of the Population
2.3 Scoring Fitness of Individuals
2.4 Neighbourhood Search
2.5 Randomly Assigned and Termination
3 Experimental Results
3.1 Experimental Result and Discussion for Lactate
3.2 Comparative Analysis for Lactate Case
3.3 Performance Measurement for Lactate Case
4 Conclusion and Future Works
References
A Study on Burrows-Wheeler Aligner’s Performance Optimization for Ancient DNA Mapping
1 Introduction
2 Methodology
2.1 Ancient DNA Sequence Data
2.2 Sequence Reads Processing and Alignment
2.3 Variant Calling and Filtering
3 Results and Discussion
3.1 Runtime and Memory Usage Differences Between Strategies
3.2 Impact on the Identification of Endogenous Reads
3.3 Impact on Variant Calling
3.4 BWA-aln VS BWA-MEM on Accurate and Effective Mapping
4 Conclusions and Future Prospects
References
BioTMPy: A Deep Learning-Based Tool to Classify Biomedical Literature
1 Introduction
2 Package Description and Implementation
3 Validation
3.1 Dataset and Challenge Overview
3.2 Preprocessing
3.3 Data Analysis
3.4 Embeddings
3.5 Evaluation
4 Discussion and Conclusions
References
May Gender Have an Impact on Methylation Profile and Survival Prognosis in Acute Myeloid Leukemia?
1 Background
2 Materials and Methods
2.1 Data
2.2 Detection of Differentially Methylated CpG Sites and Genomic Regions
2.3 Survival Analysis
3 Results and Discussion
3.1 Principal Component Analysis
3.2 Detection of Differentially Methylated CpG Sites
3.3 Detection of Differentially Methylated Genomic Regions
3.4 Survival Analysis
4 Conclusions
References
Towards a Multivariate Analysis of Genome-Scale Metabolic Models Derived from the BiGG Models Database
1 Introduction
2 Results and Discussion
2.1 Genomes’ Comparative Functional Analysis
2.2 Models’ Analysis
3 Conclusion
4 Materials and Methods
4.1 Genomes’ Comparative Functional Analysis
4.2 Draft Models
4.3 Multivariate Analysis
5 Supplementary Materials
References
A Comparison of Different Compound Representations for Drug Sensitivity Prediction
1 Introduction
2 Methods
2.1 Data Sets
2.2 Models
2.3 Model Training and Evaluation
2.4 DeepMol
3 Results and Discussion
4 Conclusion
References
Combinatorial Optimization of Succinate Production in Escherichia coli
1 Introduction
2 Strain Optimization with MEWpy
2.1 Constraint-Based Modeling
2.2 Stoichiometric Model
2.3 GECKO-Like Model
2.4 E(T)FL Model
2.5 Optimization Setup
3 Analysis of Solutions Distributions
4 Illustrative Solutions
5 Conclusion
References
Predicting Adverse Drug Reactions from Drug Functions by Binary Relevance Multi-label Classification and MLSMOTE
1 Introduction
2 Related Work
3 Dataset and the Proposed Methodology
3.1 Problem Statement
3.2 Dataset
3.3 The Proposed Methodology
4 Experimental Setup and Results
4.1 Evaluation Metrics
4.2 Experimental Setup
4.3 Experimental Results and Discussion
5 Conclusion
References
Author Index

Citation preview

Lecture Notes in Networks and Systems 325

Miguel Rocha Florentino Fdez-Riverola Mohd Saberi Mohamad Roberto Casado-Vara   Editors

Practical Applications of Computational Biology & Bioinformatics, 15th International Conference (PACBB 2021)

Lecture Notes in Networks and Systems Volume 325

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas— UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Turkey Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA; Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada; Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong

The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

More information about this series at http://www.springer.com/series/15179

Miguel Rocha Florentino Fdez-Riverola Mohd Saberi Mohamad Roberto Casado-Vara •





Editors

Practical Applications of Computational Biology & Bioinformatics, 15th International Conference (PACBB 2021)

123

Editors Miguel Rocha Department de Informática Universidade do Minho Braga, Portugal

Florentino Fdez-Riverola Superior de Ingeniería Informática Universidade de Vigo, Escuela Ourense, Spain

Mohd Saberi Mohamad Department of Genetics and Genomics United Arab Emirates University Abu Dhabi, United Arab Emirates

Roberto Casado-Vara BISITE, Digital Innovation Hub University of Salamanca Salamanca, Salamanca, Spain

ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-030-86257-2 ISBN 978-3-030-86258-9 (eBook) https://doi.org/10.1007/978-3-030-86258-9 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

The success of bioinformatics in recent years has been prompted by research in molecular biology and molecular medicine in several initiatives. These initiatives gave rise to an exponential increase in the volume and diversification of data, including nucleotide and protein sequences and annotations, high-throughput experimental data, biomedical literature, among many others. Systems biology is a related research area that has been replacing the reductionist view that dominated biology research in the last decades, requiring the coordinated efforts of biological researchers with those related to data analysis, mathematical modeling, computer simulation and optimization. The accumulation and exploitation of large-scale databases prompt for new computational technology and for research into these issues. In this context, many widely successful computational models and tools used by biologists in these initiatives, such as clustering and classification methods for gene expression data, are based on computer science/artificial intelligence (CS/AI) techniques. In fact, these methods have been helping in tasks related to knowledge discovery, modeling and optimization tasks, aiming at the development of computational models so that the response of biological complex systems to any perturbation can be predicted. The 15th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB) aims to promote the interaction among the scientific community to discuss the applications of CS/AI with an interdisciplinary character, exploring the interactions between sub-areas of CS/AI, bioinformatics, chemoinformatic and systems biology. The PACBB’21 technical program includes 17 papers of authors from many different countries (Australia, Colombia, Egypt, Germany, India, Malaysia, Portugal, Saudi Arabia, Slovakia, South Korea, Spain, Switzerland, Turkey, United Arab Emirates, UK and USA) and different subfields in bioinformatics and computational biology. There will be special issues in JCR-ranked journals, such as Interdisciplinary Sciences: Mathematical Biosciences and Engineering, Integrative Bioinformatics, Information Fusion, Neurocomputing, Sensors, Processes and Electronics. Therefore, this event will strongly promote interaction among researchers from international research groups working in

v

vi

Preface

diverse fields. The scientific content will be innovative, and it will help improve the valuable work that is being carried out by the participants. This symposium is organized by the University of Salamanca with the collaboration of the United Arab Emirates University, the University of Minho and the University of Vigo. We would like to thank all the contributing authors, the members of the program committee and the sponsors IBM, Indra, AEPIA, APPI, AIIS, EurAI and AIR Institute. We thank for funding support to the project: “Intelligent and sustainable mobility supported by multi-agent systems and edge computing” (Id. RTI2018-095390-B-C32), and finally, we thank the local organization members and the program committee members for their valuable work, which is essential for the success of PACBB’21. Miguel Rocha Florentino Fdez-Riverola Mohd Saberi Mohamad Roberto Casado-Vara

Organization

Program Committee Chairs Mohd Saberi Mohamad Miguel Rocha

United Arab Emirates University, United Arab Emirates University of Minho, Portugal

General Co-chairs Florentino Fdez-Riverola Roberto Casado Vara

University of Vigo, Spain University of Salamanca, Spain

Advisory Committee Grabriella Panuccio

Istituto Italiano di Tecnologia, Italy

Organizing Committee Juan M. Corchado Rodríguez Roberto Casado Vara Fernando De la Prieta Sara Rodríguez González Javier Prieto Tejedor Pablo Chamoso Santos Belén Pérez Lancho Ana Belén Gil González Ana De Luis Reboredo Angélica González Arrieta

University of Salamanca, AIR Institute, Spain University of Salamanca, University of Salamanca, University of Salamanca, University of Salamanca, AIR Institute, Spain University of Salamanca, University of Salamanca, University of Salamanca, University of Salamanca, University of Salamanca,

Spain Spain Spain Spain Spain Spain Spain Spain Spain Spain

vii

viii

Emilio S. Corchado Rodríguez Alfonso González Briones Yeray Mezquita Martín Javier J. Martín Limorti Alberto Rivas Camacho Elena Hernández Nieves Beatriz Bellido María Alonso Diego Valdeolmillos Sergio Marquez Marta Plaza Hernández David García Retuerta Guillermo Hernández González Ricardo S. Alonso Rincón Javier Parra

Organization

University of Salamanca, Spain University of Salamanca, University of Salamanca, University of Salamanca, University of Salamanca, University of Salamanca, University of Salamanca, University of Salamanca, AIR Institute, Spain University of Salamanca, University of Salamanca, University of Salamanca, AIR Institute, Spain

Spain Spain Spain Spain Spain Spain Spain Spain Spain Spain

University of Salamanca, Spain University of Salamanca, Spain

Program Committee Vera Afreixo Manuel Álvarez Díaz Carlos Bastos Lourdes Borrajo Ana Cristina Braga Fernanda Brito Correia Rui Camacho Angel Canal Roberto Casado Vara Yingbo Cui Fernando De La Prieta Sergio Deusdado Oscar Dias Florentino Fdez-Riverola João Diogo Ferreira Nuno Filipe Nuno A. Fonseca Dino Franklin Narmer Galeano Rosalba Giugno Gustavo Isaza

University of Aveiro, Portugal University of A Coruña, Spain University of Aveiro, Portugal University of Vigo, Spain University of Minho, Portugal DEIS/ISEC/Polytechnic Institute of Coimbra, Portugal University of Porto, Portugal University of Salamanca, Spain University of Salamanca, Spain National University of Defense Technology, China University of Salamanca, Spain IPB-Polytechnic Institute of Bragança, Portugal University of Minho, Portugal University of Vigo, Spain University of Lisbon, Faculty of Sciences, Portugal University of Porto, Portugal University of Porto, Portugal Federal University of Uberlandia, Spain Universidad Catolica de Manizales, Colombia University of Verona, Italy University of Caldas, Colombia

Organization

Paula Jorge Rosalia Laza Thierry Lecroq Giovani Librelotto Hugo López-Fernández Eva Lorenzo Iglesias Marcos Martínez-Romero Mohd Saberi Mohamad Loris Nanni José Luis Oliveira Joel P. Arrais Cindy Perscheid Armando Pinho Ignacio Ponzoni Miguel Reboiro-Jato Jose Ignacio Requeno Miguel Rocha João Manuel Rodrigues Gustavo Santos-Garcia Ana Margarida Sousa Niclas Ståhl Carolyn Talcott Rita Margarida Teixeira Ascenso Antonio J. Tomeu-Hardasmal Alicia Troncoso Eduardo Valente Alejandro F. Villaverde Pierpaolo Vittorini

ix

IBBCEB Center of Biological Engineering, Portugal Universidad de Vigo, Spain University of Rouen, France Universidade Federal de Santa Maria, Brazil Instituto de Investigação e Inovação em Saúde, i3S, Portugal University of Vigo, Spain Stanford University, USA United Arab Emirates University, United Arab Emirates University of Padua, Italy University of Aveiro, Portugal University of Coimbra, Portugal Hasso Plattner Institute, Germany University of Aveiro, Portugal Planta Piloto de Ingeniería Química, PLAPIQUI-UNS-CONICET, Argentina University of Vigo, Spain Western Norway University of Applied Sciences, HVL, Norway Center for Computer Science and Technologies, CCTC, University of Minho, Portugal DETI/IEETA, University of Aveiro, Portugal Universidad de Salamanca, Spain University of Minho, Portugal University of Skövde, Sweden SRI International, USA ESTG-IPL, Portugal University of Cadiz, Spain Universidad Pablo de Olavide, Spain IPCB, Portugal Instituto de Investigaciones Marinas (C.S.I.C.), Spain University of L’Aquila, Department of Life, Health and Environmental Sciences, Portugal

x

PACBB 2021 Sponsors

Organization

Contents

Computational Methods for the Identification of Genetic Variants in Complex Diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Débora Antunes, Daniel Martins, Fernanda Correia, Miguel Rocha, and Joel P. Arrais

1

Using Reduced Amino-Acid Alphabets and Simulated Annealing to Identify Antimicrobial Peptides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . John Healy, Michela Caprani, Orla Slattery, and Joan O’Keeffe

11

Acufenometry in the Self-management of Tinnitus: A Revised Interface to Improve the User Experience . . . . . . . . . . . . . . . . . . . . . . . Pierpaolo Vittorini, Pablo Chamoso, and Fernando De la Prieta

22

The pegi3s Bioinformatics Docker Images Project . . . . . . . . . . . . . . . . . Hugo López-Fernández, Pedro Ferreira, Miguel Reboiro-Jato, Cristina P. Vieira, and Jorge Vieira On the Reproducibility of MiRNA-Seq Differential Expression Analyses in Neuropsychiatric Diseases . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Pérez-Rodríguez, Hugo López-Fernández, and Roberto C. Agís-Balboa

31

41

Computational Tools for the Analysis of 2D-Nuclear Magnetic Resonance Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bruno Pereira, Marcelo Maraschin, and Miguel Rocha

52

Recurrent Deep Neural Networks for Enzyme Functional Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ana Marta Sequeira and Miguel Rocha

62

Assessing the Impact of Data Set Enrichment to Improve Drug Sensitivity in Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pedro Ferreira, João Ladeiras, and Rui Camacho

74

xi

xii

Contents

Deep Neural Network to Curate LTR Retrotransposon Libraries from Plant Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simon Orozco-Arias, Mariana S. Candamil-Cortes, Paula A. Jaimes, Estiven Valencia-Castrillon, Reinel Tabares-Soto, Romain Guyot, and Gustavo Isaza A Hybrid of Bees Algorithm and Regulatory On/Off Minimization for Optimizing Lactate Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohd Izzat Yong, Mohd Saberi Mohamad, Yee Wen Choon, Weng Howe Chan, Hasyiya Karimah Adli, Khairul Nizar Syazwan WSW, Nooraini Yusoff, and Muhammad Akmal Remli

85

95

A Study on Burrows-Wheeler Aligner’s Performance Optimization for Ancient DNA Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Cindy Sarmento, Sílvia Guimarães, Gülşah Merve Kılınç, Anders Götherström, Ana Elisabete Pires, Catarina Ginja, and Nuno A. Fonseca BioTMPy: A Deep Learning-Based Tool to Classify Biomedical Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Nuno Alves, Ruben Rodrigues, and Miguel Rocha May Gender Have an Impact on Methylation Profile and Survival Prognosis in Acute Myeloid Leukemia? . . . . . . . . . . . . . . . . . . . . . . . . . 126 Agnieszka Cecotka, Lukasz Krol, Grainne O’Brien, Christophe Badie, and Joanna Polanska Towards a Multivariate Analysis of Genome-Scale Metabolic Models Derived from the BiGG Models Database . . . . . . . . . . . . . . . . . . . . . . . 136 Alexandre Oliveira, Emanuel Cunha, Fernando Cruz, João Capela, João Sequeira, Marta Sampaio, and Oscar Dias A Comparison of Different Compound Representations for Drug Sensitivity Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 Delora Baptista, João Correia, Bruno Pereira, and Miguel Rocha Combinatorial Optimization of Succinate Production in Escherichia coli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Vítor Pereira and Miguel Rocha Predicting Adverse Drug Reactions from Drug Functions by Binary Relevance Multi-label Classification and MLSMOTE . . . . . . . . . . . . . . 165 Pranab Das, Jerry W. Sangma, Vipin Pal, and Yogita Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

Computational Methods for the Identification of Genetic Variants in Complex Diseases D´ebora Antunes1(B) , Daniel Martins2,3 , Fernanda Correia4 , Miguel Rocha1 , and Joel P. Arrais2 1

Department of Informatics, University of Minho, Braga, Portugal [email protected] 2 CISUC, University of Coimbra, Coimbra, Portugal [email protected] 3 CNC, University of Coimbra, Coimbra, Portugal 4 ISEC, Polytechnic Institute of Coimbra, Coimbra, Portugal [email protected]

Abstract. Complex diseases, as Type 2 Diabetes, arise from dysfunctional complex biological mechanisms, caused by multiple variants on underlying groups of genes, combined with lifestyle and environmental factors. Thus far, the known risk factors are not sufficient to predict the manifestation of the disease. Genome-Wide Association Studies (GWAS) data were used to test for genotype-phenotype associations and were combined with a network-based analysis approach. Three datasets of genes associated with this disease were built and features were extracted for each of these genes. Machine learning models were employed to develop a predictor of the risk associated with Type 2 Diabetes to help the identification of new genetic markers associated with the disease. The obtained results highlight that the use of gene regions and protein-protein interaction networks can identify new genes and pathways of interest and improve the model performance, providing new possible interpretation for the biology of the disease.

1

Introduction

Complex diseases are conditions influenced by mutations in a group of genes that interact with each other and environmental factors. Contrary to the case of monogenic disorders, the genes associated to complex diseases do not have any effect individually, hindering their identification [7]. The case study for this work was the complex disease Type 2 Diabetes (T2D), a subtype of diabetes that accounts for 90% of diabetes worldwide. In this subtype of diabetes the cells of the organism cannot respond to an essential hormone called insulin, leading not only to high blood sugar levels (hyperglycemia) but also to an increase in insulin production. According to Stanˇc´akov´ a et al. [11], until 2016, more than 80 variants were associated with this condition, mostly c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  M. Rocha et al. (Eds.): PACBB 2021, LNNS 325, pp. 1–10, 2022. https://doi.org/10.1007/978-3-030-86258-9_1

2

D. Antunes et al.

through Genome-Wide Association Studies (GWAS) and considering independent effects. However, those variants only explained about 10% of the T2D variability within a population, producing little information that can be used in a medical context. GWAS finds genetic variations associated with a particular disease by scanning sets of DNA or genomes of many individuals. Frequently, they are focused in Single-Nucleotide Polymorphisms (SNPs) and phenotypes, inside a population, sequencing the genotypes of tag-SNPs, representative of a haplotype, a group of genes that are inherited together. The identified genetic associations can be used to predict, detect and treat the disease risk and to produce knowledge about the underlying biological entities and processes. Such studies are particularly useful in finding genetic variations that contribute to complex diseases like T2D [12]. However, their relationships are not easy to understand because of their complex pathways and growing number of variables [8]. In single SNP association analysis, an association between each SNP and the phenotype is tested. In some cases, the SNPs are treated as independent and some of the methods used to study this association are generalised linear models, statistical tests, like χ2 test, and Bayesian approaches [8]. Multiple SNP association analysis examines the relationship between the phenotype and the combined effect of multiple SNPs. Different types of analysis have been proposed to account for these associations: haplotype-based methods, SNP-SNP interaction models and models based in biological knowledge [8]. Machine Learning (ML) approaches were used to analyse the SNPs produced in GWAS and provide information about the relation between variants and phenotypes in complex diseases [3]. This paper describes a new proposed complex disease predictor of the risk associated with T2D complex disease that allows the identification of new genetic variants associated with the disease. The pipeline used GWAS data, grouped them into gene regions and applied a network-based analysis approach. Using these methods, new subsets of genes were defined, their most relevant features were selected and ML techniques were applied to predict the risk of T2D.

2 2.1

Methods Data Preparation

The first step involved choosing and preparing the datasets. The Final dataset construction involved two initial datasets, Case and Control. Information about the sizes of the datasets at each stage of the dataset construction is summarised in Table 1. The Case dataset originated from a privately owned gzip compressed Variant Call Format (VCF) file that contained information of exomes from 71 Portuguese patients diagnosed with T2D, and 57,142,453 loci. Since the focus of the study were the genetic factors, only the genomic data was used. An initial filtering restricted the type of variants to Insertion/Deletions (INDELs) or SNPs. The quality control of these data revealed that from the 71 individuals, two of them

Identification of Genetic Variants in Complex Diseases

3

were related. For this reason, the one with more missing variants was excluded from the study. Also, the variants that did not follow the Hardy-Weinberg Equilibrium (HWE) theorem or that had less than 20 of quality score were removed. The Control dataset resulted from a selection of VCF files collected from the Iberian Populations in Spain (IBS) in the Phase 3 release of the 1000 Genome Project [1]. This group was chosen to prevent bias that would tend to distinguish cases and controls, based on population genetic divergences. To create a dataset with all the information, the two previous datasets were merged into the Merged dataset, by finding the common variants. A variant was considered equal if their location (chromosome and position) was the same and their REF and ALT allele were matching. To produce the Final dataset, the variants with a rate of missing genotypes of more than 10% were removed (Table 1). Table 1. Summary of the number of samples and variants on the different datasets. Case dataseta Control dataset Merged dataset Final dataset Number of samples

70

Number of variants 228,301b a After

2.2

107

177

177

81,404,605

172,629

168,715

preprocessing; b Includes 225 070 SNPs and 3 231 INDELs

Gene Selection

The first analysis made was a single SNP association. For that, we used the χ2 test, which measures the probability of association of each variant with the disease considering each one of them independent. The statistical association between the variants and the phenotype using the χ2 test with a p-value of 0.05, showed that of 168,715 variants, 9427 (5.6%) presented a statistical association with the phenotype Three sets of genes were selected and used to build different datasets for this study, the dataset of Known Risk Genes, the dataset of Significant Genes and the dataset of Central Genes. To select these sets of genes, the position of each variant was associated with a gene ID using the python package pyensembl and a reference GTF file from the GRCh37 version of Ensembl [13]. The variants were grouped by these gene IDs and the associated gene p-value was calculated as the average of p-values of the set of variants. The final list contained 16,513 genes and the respective p-values. The set of Known Risk Genes were the genes on the final list that matched the list of 75 known risk genes from Type 2 Diabetes Knowledge Portal [10] (the T2D related and CAUSAL genes). In total, 67 Known Risk Genes were selected (Table 2). The set of Significant Genes included 82 genes with a p-value of less than 0.05 (Table 2). To select the Central Genes a Protein-Protein Interaction (PPI) network was built. This PPI network can be represented by a graph, whose nodes are genes

4

D. Antunes et al.

(proteins) and edges are their interactions. The PPI network file was downloaded from BioGRID [9], and after a prepossessing that selected the interactions with experimental data associated and with both interactors being human proteins, the file had 366,327 interactions and included 17,940 proteins. The PPIs file and the final list of genes (proteins) were used as input for the R package dmGWAS. This tool implements a dense module searching method and outputs a list of modules associated with the disease, ranked by significance. In this study, the top 50 modules were extracted and combined into a subnetwork of significant PPIs containing 252 genes. Using the R package igraph, three network metrics (degree, betweenness and closeness), that measured the centrality of each gene of this subnetwork, were applied. Choosing the genes that were in the top 100 of each metric, 77 genes were selected for the set of Central Genes. From these 77 genes, three were in common with the Known Risk Genes, namely CAV1, PCBD1 and WFS1 (in bold in Table 2). 2.3

Feature Extraction and Reduction

At this point, there were three sets of genes selected, the 67 known risk genes, the 82 significant genes and the 77 central genes. First, feature extraction was applied to each gene, using the information present in the corresponding group of variants. Four features were extracted for each gene, the first component after applying the Principal Component Analysis (PCA), the first component after applying t-distributed Stochastic Neighbor Embedding (t-SNE) and two statistical measures, the mean and variance. The Known Risk Genes, Significant Genes and Central Genes datasets had 268, 328 and 308 features, respectively. Feature reduction was performed in two of these datasets, the Significant Genes and the Central Genes datasets. For each dataset, 1000 Extremely Randomized Trees (Extra-Trees) models were trained. In every training cycle, the top 100 most important features were registered and, at the end, their frequencies were calculated. The 25 features with higher frequency were selected, for each dataset. The 25 features from the Top 25 Significant Genes dataset belonged to 25 different genes, while the ones from the Top 25 Central Genes dataset belonged to 12 different genes (in grey boxes in Table 2). Three machine learning models, Support Vector Machines (SVM), Decision Tree and Logistic Regression, were trained for the following genes datasets: Significant Genes, Top 25 Significant Genes, Known Risk Genes, Central Genes and Top 25 Central Genes. The classifiers were run 1000 times using a 5-fold cross-validation. A grid search was performed for each dataset and the overall best parameters were selected and used for the study. The final parameters used are shown in Table 3. For the evaluation of the models, three metrics were used, Accuracy, F1-score and Area Under Curve (AUC).

Identification of Genetic Variants in Complex Diseases

5

Table 2. Known Risk Genes, Significant Genes and Central Genes selected for this study. In the grey boxes are the genes selected in the dimensionality reduction. In bold are the common genes between the Known Risk Genes the Central Genes lists. Known risk genes

Significant genes

Central genes

ABCC8

PCSK1

AAMDC

MAST1

APP

MYC

AKT2

PDX1

AKT1S1

MSTN

ATXN1

NCL

ANGPTL4 PLCB3

AMMECR1L MYPOP

BAG3

NEK6

ANKH

PNPLA3

ANP32A

NBPF14

BAIAP2

NFKBIA

APOE

POC5

B3GALNT1

NBPF4

BTRC

NSMF

APPL1

POLD1

BCL2L10

NDUFB6

CALM1

OPTN

BLK

PPARG

C10orf95

NGLY1

CASP1

PCBD1

BSCL2

PPP1R15B C1orf162

NKX2-1

CASP8

PCNA

CAV1

PTF1A

C20orf202

NUFIP2

CAV1

PICK1

CDKN1B

QSER1

CDKN2C

OLIG1

CDC37

PIK3R1

CEL

RFX6

CGB5

OR13J1

CDH1

PIN1

EIF2AK3

RREB1

CHKA

OR2T5

CDK2

PLK1

ERAP2

SIX2

CLEC18A

P4HTM

CDKN1A

PPP1CA

GATA4

SIX3

CNOT11

PAGR1

CEP70

PTPN6

GATA6

SLC16A11

CSRP2

PARP11

DEAF1

RAC1

GCG

SLC19A2

CXCL13

PNMT

DISC1

RPS6KB1

GCK

SLC30A8

CXCL5

PNRC2

ENO1

SFN

GCKR

SLC5A1

DCAF16

POTED

ERBB2

SKP1

GIPR

TBC1D4

DDTL

PPP1R7

ESR1

SMAD3

GLIS3

TM6SF2

DEXI

PRAMEF13 GFAP

SPRED1

GLP1R

TRMT10A DLEU1

PRDX6

GRB2

STK11

GRB10

WARS

DLX6

PRRT2

HLA-B

STX1A

HNF1A

WFS1

DOK1

PTRF

HNRNPC

SYK

HNF1B

WSCD2

GLIPR1

RAX

HSP90AB1 TGFBR2

HNF4A

ZFP57

GPR25

RNASE10

HSPA8

TNF

IGF1

ZNF771

HEXIM2

RNF182

HSPD1

TRAF6

IRS2

HFE2

RYBP

HTT

TRIM54

KCNJ11

HMOX2

S100A16

INCA1

TSC22D1

KLF11

HNRNPAB

SCG5

IQUB

UBC

LPL

HOXB8

SKIL

JPH3

UBE2Z

MC4R

HSD3B1

SOX21

KANK2

USP2

MNX1

ID2

SYT4

KCTD13

VCP

MTNR1B

IFNA13

TADA2B

KDR

WFS1

NAT2

IL33

TAF11

KIFC3

YWHAE

NEUROD1

IL36RN

TEX22

KRT34

YWHAG

NEUROG3

JOSD1

TLX1NB

LMO4

YWHAZ

NKX2-2

KCNMB4

TMEM178A LNX1

PAM

KRTAP5-1

TMEM60

LNX2

PAX4

LSMEM1

TPST1

MAP3K1

PAX6

MAFA

TRAM1L1

MDFI

PCBD1

MAFF

WDR45B

MEOX2

6

3

D. Antunes et al.

Results and Discussion

The obtained results are illustrated graphically in Fig. 1. The Decision Tree models produced higher values for all statistical measures (≥0.87) compared to Logistic Regression models and SVM models, which shows that this model could better address the complexity of the data. Both Regression models and SVM models used linear functions for the classification, which indicates that these functions have a higher difficulty in explaining the underlying structure of the data. Also, the values of each metric were similar, which shows the robustness of the results. The lower values obtained when using the Known Risk Genes as input was expected because it is known that these gene associations do not account for a high percentage of the heritability. The Significant Genes dataset had features that were extracted directly from the most significant genes of the original dataset and, as expected, produced good results. When just the top 25 features were used, which had less than 8% of the full dataset’s size, the three metrics kept relatively good values. Lower values in the results from the central genes dataset were expected, given that the features were extracted from genes central to a network of significance and generally not significant themselves. Although the values from Accuracy, F1-score and AUC were lower than the values from the Significant Genes dataset, the results were still good. The best results were from the Top 25 Central Genes dataset, which had less than 9% of the full dataset’s size. With only the 25 features of this dataset it was possible to predict the risk of disease with a good degree of success. To add biological context to the genes, a functional annotation of the Known Risk Genes, the Significant Genes and the Central Genes was conducted, using the online platform Database for Annotation, Visualization and Integrated Discovery (DAVID) [5,6]. This platform finds the most relevant and over-represented biological terms related to the gene lists provided. The results from the biological processes annotation from Gene Ontology (GO) revealed that from 29,683 biological process terms, 160 were found to be terms in common between the genes of the three lists. Knowing that most of the genes are different across the three lists, it is observable that many of the identified genes (either Significant Genes or Central Genes) share the same terms as the Known Risk Genes (Fig. 2). Table 3. Parameters and respective values chosen for SVM, Decision Tree and Logistic Regression models after grid search. SVM

Decision tree

Logistic regression

kernel

linear n estimators

50

C

0.25

entropy C

tol

1e−3 min samples leaf

criterion

0.4

tol

1e−3

gamma 25

min samples split 5

solver

liblinear

degree

max leaf nodes





1

3

penalty l1

50

Identification of Genetic Variants in Complex Diseases

7

Fig. 1. Average accuracy, F1-score and area under curve, and respective standard deviation, of the SVM, Decision Tree (Tree) and Logistic Regression (Log) models, for Significant Genes (black), Top 25 Significant Genes (dark grey), Known Risk Genes (light grey), Central Genes (light stripes) and Top 25 Central Genes (dark stripes).

Even so these genes share these terms, this does not mean that they share the same pathway or interactors, however, they can be involved in similar functions. The functional annotation of the central genes also highlighted pathways that integrate the genes from the list (Table 4). One of the most interesting is the Translocation of SLC2A4 (GLUT4) to the plasma membrane, since T2D is characterised by an insulin resistance and this pathway is related with this disease [4]. Even thought most of the genes identified are not on the lists of gene associations by HPO or T2D Knowledge Portal, they are involved in biological processes and pathways of interest for this disease. Investigation into these genes and pathways could reveal an association with T2D. By grouping the variants into genes, this pipeline loss allele information that is important for more specific genetic studies. Although it is possible to retrace some of the information, it becomes complex to understand, for instance, which specific mutations have an effect on the disease. This difficulty further increases when looking at the central genes, since their selection is based not on their

8

D. Antunes et al.

Fig. 2. Number of genes from the Known Risk Genes, Significant Genes and Central Genes associated with the top biological processes found.

genetic information but in their relationship to significant genes. In 2017, Boyle, Li, and Pritchard [2] proposed the omnigenic model to explain complex diseases. Their model states that a reduced number of genes (“core genes”) have direct roles in the disease and, if damaged or lose function, they have large effect sizes on the disease risk. This approach could help identify these genes. However, since these genes are affected by other processes, they might not be directly associated with any mutation, and for this reason their usefulness in the prediction of the disease risk might be limited. This pipeline can, at some degree, identify genes of interest. Looking at the list of Central Genes identified from the PPI networks, it is possible to identify three genes in common with the Known Risk Genes list: CAV1, PCBD1 and WFS1 (in bold in Table 2). Features from two of these three genes were included in the top 25 central genes dataset, the one with the best results from the models’ evaluation. Although this number of genes is low, this indicates that this pipeline has correctly selected genes of interest. Also, from the functional analysis of the Central Genes a pathway was identified that is directly associated with T2D.

Identification of Genetic Variants in Complex Diseases

9

Table 4. Top 9 significant pathways associated with the Central Genes. Regulation of PLK1 Activity at PLK1 is a protein that phosphorylates several G2/M Transition proteins involved in transition from phase G2 to phase M of mitosis RHO GTPases activate PKNs

RHO GTPases is a family of proteins involved in processes like changes in morphology and mitosis. They can bind to a PKN protein. PKN is involved in the regulation of cell cycle, receptor trafficking, vesicle transport and apoptosis

Chk1/Chk2(Cds1) mediated inactivation of Cyclin B:Cdk1 complex

The kinases Chk1/Chk2(Cds1) are used as a checkpoint during mitosis, induced when there is DNA damage

Activation of BAD and translocation to mitochondria

BAD has an important role in apoptosis. Calcineurin activates BAD by dephosphorylation. After activation, it is transported to the external membrane of the mitochondria, releasing the cytochrome C that is a factor for apoptosis

CLEC7A (Dectin-1) signaling

CLEC7A (or Dectin-1) is a protein involved with the detection of bacteria and fungal cells. CLEC7A signalling induces the production of cytokines and interleukins

Translocation of SLC2A4 (GLUT4) to the plasma membrane

GLUT4 is a protein encoded by the gene SLC2A4 and is responsible for the uptake of glucose from the bloodstream, when the level of insulin increases. When the insulin binds to the receptors of the cell it starts the movement of the vesicles containing glucose towards the plasma membrane

Constitutive Signalling by Ligand-Responsive EGFR Cancer Variants

After activation, the EGFR receptor starts several signalling cascades that initiates the transcription of gene involved in apoptosis and cellular proliferation. Over-expression of wild-type EGFR or EGFR cancer mutants results in aberrant activation of these signalling cascades, providing an advantage to cancer cells

Regulation of signalling by CBL CBL negatively regulates signalling pathways by targeting proteins with ubiquitin for proteasomal degradation GPVI-mediated activation cascade

4

GPVI is a receptor of collagen that initiates a signalling cascade that lead to platelet activation

Conclusion

In this study we developed a T2D risk predictor that identified genes of interest for the disease, using a case-control analysis of variants’ genotypes. With study were identified 82 Significant Genes and 77 Central Genes, three of them were

10

D. Antunes et al.

already Known Risk Genes. After dimensionality reduction, 25 Significant Genes and 12 Central Genes were highlighted. From the functional analysis of the Central Genes were identified pathways of interest, including one already associated with the disease. The models successfully predicted the risk of disease in these datasets, especially on the Top 25 Central Genes dataset. Even though much of the allele information is lost by using this approach, important biological information is revealed. Using genes regions and integrating them in PPIs networks seem to be useful to gain insight into the biology of the disease. From this point on, it would be necessary to validate the risk predictor using bigger datasets, from other complex diseases or even from other types of studies. Acknowledgement. This work is funded by national funds through the FCT - Foundation for Science and Technology, I.P., within the scope of the project CISUC UID/CEC/00326/2020 and by European Social Fund, through the Regional Operational Program Centro 2020 and by the Portuguese Research Agency FCT, through D4 - Deep Drug Discovery and Deployment (CENTRO-01-0145-FEDER-029266).

References 1. Auton, A., et al.: A global reference for human genetic variation. Nature 526(7571), 68–74 (2015) 2. Boyle, E.A., Li, Y.I., Pritchard, J.K.: An expanded view of complex traits: from polygenic to omnigenic. Cell 169(7), 1177–1186 (2017) 3. Collins, A., Yao, Y.: Machine learning approaches: data integration for disease prediction and prognosis. In: Yao, Y. (ed.) Applied Computational Genomics. TRBIO, vol. 13, pp. 137–141. Springer, Singapore (2018). https://doi.org/10.1007/978-98113-1071-3 10 4. Gaster, M., et al.: GLUT4 is reduced in slow muscle fibers of type 2 diabetic patients: is insulin resistance in type 2 diabetes a slow, type 1 fiber disease? Diabetes 50(6), 1324–1329 (2001) 5. Huang, D.W., Sherman, B.T., Lempicki, R.A.: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37(1), 1–13 (2009) 6. Huang, D.W., Sherman, B.T., Lempicki, R.A.: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4(1), 44–57 (2009) 7. Jordan, B.: Genes and non-mendelian diseases: dealing with complexity. Perspect. Biol. Med. 57(1), 118–131 (2014) 8. Morris, A.P., Cardon, L.R.: Genome – wide association studies. In: Balding, D., Moltke, I., Marioni, J. (eds.) Handbook of Statistical Genomics, 4th edn, pp. 597– 550. Wiley (2019) 9. Oughtred, R., et al.: The BioGRID interaction database: 2019 update. Nucleic Acids Res. 47(D1), D529–D541 (2019) 10. Portal, Type 2 Diabetes Knowledge: Curated T2D effector gene predictions 11. Stanˇca ´kov´ a, A., Laakso, M.: Genetics of type 2 diabetes. In: Stettler, C., Christ, E., Diem, P. (eds.) Endocrine Development, vol. 31, pp. 203–220. Karger Publishers (2016) 12. Visscher, P.M., et al.: 10 years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 101(1), 5–22 (2017) 13. Yates, A.D., et al.: Ensembl 2020. Nucleic Acids Res. 48(D1), D682–D688 (2019)

Using Reduced Amino-Acid Alphabets and Simulated Annealing to Identify Antimicrobial Peptides John Healy1(B) , Michela Caprani2 , Orla Slattery2 , and Joan O’Keeffe2 1

Department of Computer Science and Applied Physics, Galway-Mayo Institute of Technology, Galway, Ireland [email protected] 2 Marine and Freshwater Research Centre (MFRC), Galway-Mayo Institute of Technology, Galway, Ireland [email protected], {orla.slattery,joan.okeeffe}@gmit.ie Abstract. The efficient detection of similarity between biological sequences is a fundamental task in bioinformatics. This paper describes a k -mer approach for identifying and classifying antimicrobial peptide sequences using 64-bit encoded multiple spaced seeds and a suite of reduced amino acid alphabets. We implemented and tested the approach using a total of 74 reduced alphabets that were either published, altered using simulated annealing, or randomly generated. Our results show that the approach is very accurate and that all of the reduced alphabets of sizes between 9 and 16 were equally effective and far more accurate than smaller sized alphabets. Our custom designed alphabets exhibited higher sensitivity for some families of AMP than any of the published reduced alphabets that we tested.

1

Introduction

Protein composition is typically encoded in sequence formats using the standard 20 letter (5-bit) IUPAC amino acid alphabet. However, the use of reduced amino acid alphabets is well established in bioinformatics [1] and they have been demonstrated to provide optimal encodings of the structure, stability, and folding of proteins [2]. Peterson et al. [3] showed that reduced amino acid alphabets can exhibit increased sensitivity where there exists structural similarity but a low-sequence identity and are also highly effective for pattern-based searches. In the context of comparing protein sequences, reduced alphabets provide an additional degree of indirection to an alignment system as grouped amino acids are effectively polymorphic. The size of reduced alphabets varies from simple encodings based on hydrophobic and hydrophilic properties [4,5] to complex groupings of residues based on the chemical, electro-chemical and structural properties of amino acids. Murphy et al. [6] used a greedy algorithm to construct a set of reduced alphabets using correlation coefficients between pairs of amino acids based on a BLOSUM50 scoring matrix. Wang and Wang [7] published a suite of reduced alphabets based on c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  M. Rocha et al. (Eds.): PACBB 2021, LNNS 325, pp. 11–21, 2022. https://doi.org/10.1007/978-3-030-86258-9_2

12

J. Healy et al.

an analysis of the statistical contact potentials of the MJ matrix developed by Miyazawa and Jernigan [8]. Li et al. [9] used sequence alignments from a BLOSUM62 matrix to construct a collection of reduced alphabets. Cannata et al. [10] reported a branch and bound technique for exhaustively analysing simplified amino acid alphabets scored from user-defined block substitution matrices and both Lenckowski et al. [11] and Nanni and Lumini [12] demonstrated how genetic algorithms can be used to construct reduced alphabets. In this paper we present a k -mer based approach for identifying antimicrobial peptides (AMPs) in protein sequence data using reduced amino acid alphabets. Recently, Dong et al. [13] have demonstrated that reduced alphabets can be highly effective for identifying AMPs using a Support Vector Machine. AMPs are short peptide sequences, typically less than 50 amino acids long, with an amphipathic structure, a net cationic charge (+2 to +9) and a highly diverse sequence composition and secondary structure [14,15]. AMPs are an important focus of protein research as they exhibit antimicrobial activity and a potential to act as alternative therapeutic agents to combat resistant strains of bacteria [16]. We analysed the effectiveness of 74 different reduced amino acid alphabets as encodings to classify 33 different AMP families. We tested the approach with different sizes of published alphabets, published alphabets optimised using simulated annealing, randomly generated alphabets and custom alphabets designed from multiple sequence alignments of AMPs. The training and testing of the approach and the subsequent analysis of AMP sequences was undertaken on an OSX 11.2.3 platform, with a 3.2 GHz Apple M1 processor, 8 GB of RAM and an instance of the OpenJDK 16.36.2231 64-bit Java Virtual Machine. The source code, data set and supplementary material are available on GitHub at https://github.com/gmit-amp-res.

2

Methods

A set of 33 different AMP families were collated from the Antimicrobial Peptide Database (APD) [17] and the AntiBP Server [18] on the basis that they represent the most abundant sequences from a dearth of published AMPs. Each AMP family was processed using a reduced alphabet, added to a multiple spacedseed k -mer database and then benchmarked for accuracy using 10-fold crossvalidation. In addition to the 19 alphabets described by [4–7,9], we created custom reduced alphabets of various sizes optimised for AMPs from multiple sequence alignments of AMP sequences using Jalview [19]. 2.1

Alphabet Encoding

The k -mer based classification system we developed processes protein sequences in 64-bit blocks, the largest primitive type available in the Java language, allowing for a highly compact representation of sequence data and the ability to

Reduced Amino-Acid Alphabets and Simulated Annealing to Identify AMPs

13

exploit low-level time-efficient computations using bit shifting. The number of bits required to represent each symbol is log2 |Σ|, where |Σ| is the size of the reduced amino acid alphabet. In order to utilise all 64-bits to their maximum extent, we placed an upper limit of 4 bits per symbol, thereby capping the alphabet size at 16 symbols. To maximise the sensitivity of the system, we used the following six spacedseeds with a length (l ) and weight (w ) that are optimised for 64-bit encoding: 110110101000111 (l = 15, w = 9), 110101100010111 (l = 15, w = 9), 110110010100111 (l = 15, w = 9), 1101100011010111 (l = 16, w = 10) and 1110010100110111 (l = 16, w = 10) from Choi et al. [20] and 111101011101111 (l = 15, w = 12) from Buchfink et al. [21]. These published seeds were selected on the basis that they have the highest weight for their length and are independent of our data set. Given the paucity of AMP sequence data available, designing an effective suite of custom spaced seeds was not a viable option. The seeds are 64-bit encoded to match a specified amino acid alphabet. For example, using the reduced 2 letter (1-bit) amino acid alphabet (Σ12 ) from Chan et al. [4] (polar: AGTSNQDEHRKP, hydrophobic: CMFILVWY) the seeds will be encoded as: 1101101010001110000000000000000000000000000000000000000000000000 1101011000101110000000000000000000000000000000000000000000000000 1101100101001110000000000000000000000000000000000000000000000000 1101100011010111000000000000000000000000000000000000000000000000 1110010100110111000000000000000000000000000000000000000000000000 1111010111011110000000000000000000000000000000000000000000000000

Using the 15 Letter (4-bit) alphabet (Σ415 ) from Murphy et al. [6] ([A, C, D, E, G, H, N, P, Q, S, T, W, (LVIM), (FY), (KR)]), the same seeds will be represented as: 1111111100001111111100001111000011110000000000001111111111110000 1111111100001111000011111111000000000000111100001111111111110000 1111111100001111111100000000111100001111000000001111111111110000 1111111100001111111100000000000011111111000011110000111111111111 1111111111110000000011110000111100000000111111110000111111111111 1111111111111111000011110000111111111111000011111111111111110000

The bit representation of the specified reduced alphabet thus controls not only the number of symbols used but also the size of k in the k -mer database. 2.2

k -mer Database

A k -mer database of encoded sequence data was created for each AMP family by tiling over each of its constituent protein sequences with an offset of 1. By decomposing the set of sequences from each AMP family, T, into a spectrum of k -mers, a mapping of k -mers to their frequency of occurrence can then be used as the basis for identification and classification. The k -spectrum of an AMP sequence, S, of size |S| and a constant positive integer k, is the set: S k = {S[i : i + k − 1] | 0 < i < |S| − k + 1}, where S[i : i + k − 1] is the substring of symbols from index i to i + k − 1 in S and S : P T. Each k -mer is 64-bit encoded using a reduced alphabet and added to a hash map for that AMP family.

14

J. Healy et al.

For each k -mer S ∈ S k with a frequency of occurrence f (S) the surjective function M : S → f (S) creates a k -mer hash map from S to f (S) for an AMP family. The k -mer database aggregates the hash maps from all 33 AMP families into a container and provides operations for adding and searching. An unknown query sequence is processed and classified by encoding its amino acid symbols into a tiling of 64-bit k -mers using a reduced alphabet, searching the database for matching k -mers and then ranking the results. The running time for classifying a protein query sequence is O(n), where n is the sequence length. We used a simple majority count as a metric to score a query sequence by summing the occurrence frequency of its k -spectrum. The k -mer database for an AMP family, T, computes a majority count for the k -spectrum of a query n−k+1 f (S) where S ∈ S k . The sequence, S k , of size n as score(S k , T ) = i=1 classification is determined as the highest scoring AMP family. 2.3

Performance Metrics

We measured the accuracy of the approach by subjecting each AMP hash table in the k -mer database to 10-fold cross validation, with 80% of the AMP sequences for each family used for constructing the database and 20% retained for testing. We used the generalised version of the Matthews Correlation Coefficient (MCC) [22] that extends the measure from that of a binary to a multi-class k -category correlation coefficient called the Rk metric. The multi-class form of the MCC, , uses a k ×k confusion matrix that computes a score M CCkk = √ 2 cs−t·p √ 2 s −p·p× s −t·t

in the range [−1..1] with a minimum value between −1 and 0. The variable s is the total number of samples in the test data set, i.e. the sum of the values in the k × k confusion matrix, c is the total number of correct predictions, p is the number of times each class was predicted and t is the number of times each class actually occurred. The MCC has been shown to be a superior metric for measuring the accuracy of a classifier [23] and is now a well-established measure in bioinformatics [24]. We also computed the macro-F1 and weighted F1 scores for each reduced alphabet tested. 2.4

Simulated Annealing

In addition to benchmarking the system with the alphabets described by [4–7,9], we used an implementation of the basic simulated annealing (SA) algorithm [25], shown as Algorithm 1, to optimise these alphabets for AMP sequences using the multi-class MCC metric as a heuristic. We also generated a set of random reduced alphabets of different sizes and again used SA to improve their accuracy.

Reduced Amino-Acid Alphabets and Simulated Annealing to Identify AMPs

15

Data: n partitions and m amino acid letters where n ≤ 16 < m Result: An optimised grouping of m amino acid letters into n partitions parent ← Fisher-Yeats Shuffle m letters into n partitions; for temperature ← 30 to 0 Step −1 do for transitions ← 1000 to 0 Step −1 do child ← f isherY eatsShuf f le(parent) ; delta ← M CC(child) − M CC(parent); if delta > 0 then parent ← child ; else parent ← child with probability of e−delta/temperature ; end end Algorithm 1: Simulated annealing algorithm for amino acid alphabets, where the number of partitions n = |Σ|. SA is an iterative heuristic search technique that circumvents the issues of foothills and other local optima that adversely affect hill-climbing algorithms by sometimes randomly choosing a worse solution than its current state. This allows the algorithm to exit local optima early in a search.

3

Results and Discussion

The results of benchmarking the 74 alphabets against all 33 AMP families using 10-fold cross validation are presented Table 1 and in Fig. 1. The range of the multi-class MCC score was [−0.0001...1]. The most salient feature of the results is the clear bifurcation in accuracy across all measures for alphabet sizes of 9 or more. Notwithstanding the consistently high specificity of all the alphabets tested, the optimum sensitivity was reached with the Murphy 12 alphabet. The potential for polymorphism from the grouping together of substitutable amino acids in reduced alphabets is clearly limited. This is consistent with the findings of Peterson et al. [3] who showed that an appropriately designed amino acid alphabet of size 12 exhibits superior sensitivity and specificity to the standard 20 letter IUPAC alphabet. The alphabets designed for this study accounted for 15 of the top 20 scoring amino acid groupings, including randomly generated 9, 11, 14 and 15 letter alphabets optimised through the simulated annealing technique shown in Algorithm 1. This should not be surprising, as the Pigeonhole Principle tells us that, given m items and n containers, if m > n, then there is at least one container with m n items. Indeed, the MCC score for randomly generated sequences could be further optimised by increasing the values of the temperature and transitions parameters used in Algorithm 1. Increasing the initial temperature will enable the algorithm to perform more like a random walk at the early stages of execution and increasing the number of transitions at each temperature allows

16

J. Healy et al.

Table 1. Ranked performance of the top 20 alphabets v/s all 33 AMP families, where Σ = the number of bits used to encode an alphabet of a given size, Sn = Sensitivity, Sp = Specificity and MCC = averaged Multi-class Matthews Correlation Coefficient. We optimised the highlighted alphabets for AMP alignment. Published alphabets outside the top 20 are also shown with their rank. Rank Alphabet

Σ Enc/Size Sn

1

Murphy 12

Σ412

2

Murphy 10

3

SA Random 9

4

Li 10

5

Buchfink 11

6

SA Random 15

7

AMP Thionin 13

8

Murphy 15

9

AMP Temporin 11

10

SA Random 12

11

Thionin Clustal 15

12

Thionin Clustal 10

13

Thionin Clustal 14

14

SA Random 14

15

AMP Brevinin 13

16

AMP Lectin 15

17

Temporin Clustal 12

18

Thionin Clustal 12

19

AMP Thionin 9

20

Temporin Clustal 16

34

Li 4

35

Wang 3

37

Li 3

38

Murphy 4

47

Murphy 5

48

Li 5

50

Murphy 6

53

Murphy 8

55

Physicochemical 5

61

Wang 5 Variant

64

Physicochemical 6

65

Wang 2

66

Wang 5

67

Polar-hydophobic 2

Σ410 Σ410

Σ410

Σ411 Σ415 Σ413

Σ415 Σ411 Σ412 Σ415

Σ410 Σ414 Σ414 Σ413 Σ415 Σ412 Σ412 Σ49 Σ416 Σ24 Σ23 Σ23 Σ24 Σ35 Σ35 Σ36 Σ38 Σ35 Σ35 Σ36 Σ12 Σ35 Σ12

Sp

MCC Macro F1 Macro WF1

0.850 0.995 0.846 0.671

0.886

0.838 0.995 0.833 0.639

0.873

0.833 0.995 0.828 0.645

0.863

0.825 0.995 0.822 0.646

0.870

0.825 0.995 0.818 0.608

0.846

0.816 0.994 0.816 0.656

0.871

0.808 0.994 0.808 0.659

0.859

0.808 0.994 0.805 0.642

0.849

0.803 0.994 0.805 0.664

0.856

0.803 0.994 0.803 0.642

0.855

0.799 0.994 0.800 0.644

0.854

0.795 0.994 0.796 0.643

0.846

0.795 0.994 0.795 0.641

0.850

0.791 0.994 0.794 0.651

0.852

0.791 0.994 0.793 0.658

0.852

0.791 0.994 0.793 0.641

0.851

0.791 0.994 0.792 0.658

0.834

0.791 0.994 0.791 0.642

0.844

0.795 0.994 0.790 0.609

0.826

0.786 0.994 0.789 0.645

0.850

0.718 0.991 0.704 0.496

0.738

0.679 0.99

0.655 0.460

0.664

0.675 0.99

0.65

0.428

0.656

0.667 0.99

0.649 0.460

0.669

0.521 0.985 0.496 0.424

0.499

0.521 0.985 0.493 0.351

0.504

0.440 0.983 0.406 0.319

0.428

0.432 0.983 0.398 0.319

0.435

0.423 0.983 0.391 0.259

0.387

0.385 0.981 0.351 0.275

0.345

0.346 0.98

0.322 0.226

0.311

0.321 0.979 0.316 0.207

0.338

0.359 0.981 0.311 0.231

0.339

0.316 0.979 0.310 0.198

0.317

the algorithm to explore amino acid alphabets close to the current state of one being processed. The values for temperature and transitions were fixed at 30 and 1000 respectively to reduce the computational overhead of generating a k × k

Reduced Amino-Acid Alphabets and Simulated Annealing to Identify AMPs

17

confusion matrix to calculate the multi-class MCC score for each state transition by the algorithm. The running time of Algorithm 1 may well be significantly improved by vectorising and parallelising this process. 11 of the top 20 scoring alphabets were designed specifically for AMP sequences from analysing multiple sequence alignments using Jalview [19]. The sequence-aligned alphabets Thionin Clustal (9–15) and Temporin Clustal (11,12 and 16) were highly effective in the determination of both vertebrate and invertebrate derived AMPs. In particular, Thionin Clustal 13 and Temporin Clustal 11 achieved a MCC score of 1 for profiling α-defensins, Abaecin, Ascaphin, Aurein, Caerin and Ocellatin AMP family types. These custom alphabets surpassed the performance of published alphabets including Murphy 15, Li 3–5 and Wang 2–5. In addition, these custom alphabets determined specific compositional characteristics commonly associated with AMP amino acid sequences [26]. For example, both Thionin Clustal 13 and Temporin Clustal 11 alphabets retained distinct compositional motifs including enriched regions of arginine (R), lysine (K), cysteine (C), and proline (P) residues. Numerous studies of multicellular eukaryotic organisms have demonstrated the conserved nature of such amino acid residues and their importance in antimicrobial activity [15,27–29]. These residues carry cationic and amphipathic features that increase the electrostatic interactions between amino acid side chains and the electronegative components of

Fig. 1. A comparison of the multi-class MCC values for 33 AMP families using 74 reduced alphabets of different sizes. The best alphabets had 9 ≤ |Σ |≤ 16.

18

J. Healy et al.

microbial membranes resulting in cellular death [30]. Replacement or rearrangement of such amino acid groups may lead to a decrease of antimicrobial activity [31]. Although the Thionin Clustal and Temporin Clustal reduced alphabets were successful in the identification of most mammalian derived AMP families, the remaining alphabets failed to depict any Thionin (a plant-derived peptide) and Temporin (a frog-derived peptide) AMPs. Through further amino acid analysis, it was apparent that both AMP groups were hypervariable in composition, in contrast with the analogous nature of most vertebrate species [32]. Moreover, in comparison to other AMP families, the discovery of plant-derived AMPs remains limited with a total of just 273 entries to date [33]. Consequently, it will remain difficult to create precise reduced amino acid alphabets until further advancements are made to decode their genomic and proteomic data. Despite these limitations, researchers have recently widened their search for the discovery of novel AMPs from native protein sources [34,35]. In particular, plant and algae derived-lectins (carbohydrate-binding proteins) have demonstrated inhibition by effectively binding to the sugar moieties of microbes and have displayed potent antiviral properties by blocking glycosylated spike proteins present on coronaviruses [36,37]. We designed a custom reduced alphabet, LectinClustal12, targeted at lectin-derived AMPs by visualising the amino acid similarities between all available lectin sequences using Jalview and a BLOSUM62 matrix. This aided in the formation of a selective 12-bit alphabet based on the conserved serine and glycine-rich repeats identified between lectins [38]. The LectinClustal12 alphabet achieved an overall MCC score of 1.0, whilst other published alphabets received MCC values at the bottom of the range. Such a

Fig. 2. Multi-class MCC v/s AMP family for the top 10 highest-scoring alphabets.

Reduced Amino-Acid Alphabets and Simulated Annealing to Identify AMPs

19

custom alphabet could be applied to identify novel lectin-based antiviral and antimicrobial agents from other proteinaceous species (Fig. 2).

4

Conclusion

Reduced amino acid alphabets can increase the sensitivity of protein sequence search without sacrificing specificity. The 64-bit k -mer system with reduced alphabets we described is an accurate mechanism for correctly classifying AMP sequences and custom alphabets targeted at specific AMP families can be highly effective.

References 1. Stephenson, J.D., Freeland, S.J.: Unearthing the root of amino acid similarity. J. Mol. Evol. 77(4), 159–169 (2013) 2. Solis, A.D.: Reduced alphabet of prebiotic amino acids optimally encodes the conformational space of diverse extant protein folds. BMC Evol. Biol. 19(1), 1–19 (2019) 3. Peterson, E.L., Kondev, J., Theriot, J.A., Phillips, R.: Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment. Bioinformatics 25(11), 1356–1362 (2009) 4. Chan, H.S., Dill, K.A.: Compact polymers. Macromolecules 22(12), 4559–4573 (1989) 5. Lau, K.F., Dill, K.A.: A lattice statistical mechanics model of the conformational and sequence spaces of proteins. Macromolecules 22(10), 3986–3997 (1989) 6. Murphy, L.R., Wallqvist, A., Levy, R.M.: Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Eng. 13(3), 149–152 (2000) 7. Wang, J., Wang, W.: A computational approach to simplifying the protein folding alphabet. Nat. Struct. Biol. 6, 1033–1038 (1999) 8. Miyazawa, S., Jernigan, R.L.: Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J. Mol. Biol. 256(3), 623–644 (1996) 9. Li, T., Fan, K., Wang, J., Wang, W.: Reduction of protein sequence complexity by residue grouping. Protein Eng. 16(5), 323–330 (2003) 10. Cannata, N., Toppo, S., Romualdi, C., Valle, G.: Simplifying amino acid alphabets by means of a branch and bound algorithm and substitution matrices. Bioinformatics 18(8), 1102–1108 (2002) 11. Lenckowski, J., Walczak, K.: Simplifying amino acid alphabets using a genetic algorithm and sequence alignment. In: Marchiori, E., Moore, J.H., Rajapakse, J.C. (eds.) Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics, EvoBIO 2007. LNCS, vol. 4447, pp. 122–131. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71783-6 12 12. Nanni, L., Lumini, A.: A genetic approach for building different alphabets for peptide and protein classification. BMC Bioinform. 9(1), 1–10 (2008) 13. Dong, G., Zheng, L., Huang, S.H., Gao, J., Zuo, Y.: Amino acid reduction can help to improve the identification of antimicrobial peptides and their functional activities. Front. Genet. 12, 549 (2021)

20

J. Healy et al.

14. Kaiser, V., Diamond, G.: Expression of mammalian defensin genes. J. Leukoc. Biol. 68(6), 779–784 (2000) 15. Khamis, A.M., Essack, M., Gao, X., Bajic, V.B.: Distinct profiling of antimicrobial peptide families. Bioinformatics 31(6), 849–856 (2015) 16. Zasloff, M.: Antimicrobial peptides of multicellular organisms. Nature 415(6870), 389–395 (2002) 17. Wang, G., Li, X., Wang, Z.: APD3 - the antimicrobial peptide database as a tool for research and education. Nucleic Acids Res. 44, D1087–D1093 (2016) 18. Lata, S., Sharma, B.K., Raghava, G.P.S.: Analysis and prediction of antibacterial peptides. BMC Bioinform. 8, 263 (2007) 19. Waterhouse, A.M., Procter, J.B., Martin, D.M.A., Clamp, M., Barton, G.J.: Jalview Version 2-a multiple sequence alignment editor and analysis workbench. Bioinformatics 25, 1189–1191 (2009) 20. Choi, K.P., Zeng, F., Zhang, L.: Good spaced seeds for homology search. Bioinformatics 20(7), 1053–1059 (2004) 21. Buchfink, B., Xie, C., Huson, D.H.: Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12(1), 59–60 (2015) 22. Gorodkin, J.: Comparing two K-category assignments by a K-category correlation coefficient. Comput. Biol. Chem. 28(5–6), 367–374 (2004) 23. Chicco, D., Jurman, G.: The advantages of the Matthews Correlation Coefficient over F1 score and accuracy in binary classification evaluation. BMC Genom. 21(1), 1–13 (2020) 24. Chicco, D.: Ten quick tips for machine learning in computational biology. BioData mining 10(1), 1–17 (2017) 25. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983) 26. Lata, S., Mishra, N.K., Raghava, G.P.: AntiBP2: improved version of antibacterial peptide prediction. BMC Bioinform. 11(1), 1–7 (2010) 27. Schmitt, P., Rosa, R.D., Destoumieux-Garz´ on, D.: An intimate link between antimicrobial peptide sequence diversity and binding to essential components of bacterial membranes. Biochimica et biophysica acta (BBA)-biomembranes 1858(5), 958–970 (2016) 28. Tennessen, J.A.: Molecular evolution of animal antimicrobial peptides: widespread moderate positive selection. J. Evol. Biol. 18(6), 1387–1394 (2005) 29. Yount, N.Y., Yeaman, M.R.: Multidimensional signatures in antimicrobial peptides. Proc. Natl. Acad. Sci. 101(19), 7363–7368 (2004) 30. Brogden, K.A.: Antimicrobial peptides: pore formers or metabolic inhibitors in bacteria? Nat. Rev. Microbiol. 3(3), 238–250 (2005) 31. Schmidt, N.W., et al.: Arginine in α-defensins: differential effects on bactericidal activity correspond to geometry of membrane curvature generation and peptidelipid phase behavior. J. Biol. Chem. 287(26), 21866–21872 (2012) 32. Tam, J.P., Wang, S., Wong, K.H., Tan, W.L.: Antimicrobial peptides from plants. Pharmaceuticals 8(4), 711–757 (2015) 33. Hammami, R., Ben Hamida, J., Vergoten, G., Fliss, I.: PhytAMP: a database dedicated to antimicrobial plant peptides. Nucleic Acids Res. 37(suppl 1), D963– D968 (2009) 34. Cho, J., Sung, B., Kim, S.: Buforins: histone H2A-derived antimicrobial peptides from toad stomach. Biochimica et Biophysica Acta - Biomembranes 1788(8), 1564– 1569 (2009)

Reduced Amino-Acid Alphabets and Simulated Annealing to Identify AMPs

21

35. Krizsan, A., Volke, D., Weinert, S., Str¨ ater, N., Knappe, D., Hoffmann, R.: Insect “derived proline” rich antimicrobial peptides kill bacteria by inhibiting bacterial protein translation at the 70 S ribosome. Angewandte Chemie Int. Edn. 53(45), 12236–12239 (2014) 36. Barre, A., Van Damme, E.J., Simplicien, M., Benoist, H., Roug´e, P.: Man-specific, GalNAc/T/Tn-specific and Neu5Ac-specific seaweed lectins as glycan probes for the SARS-CoV-2 (COVID-19) coronavirus. Mar. Drugs 18(11), 543 (2020) 37. Nascimento da Silva, L.C., et al.: Exploring lectin–glycan interactions to combat COVID-19: lessons acquired from other enveloped viruses. Glycobiology (2020) 38. Millet, J.K., S´eron, K., Labitt, R.N., Belouzard, S.: Middle East respiratory syndrome coronavirus infection is inhibited by griffithsin. Antiviral Res. 133, 1–8 (2016)

Acufenometry in the Self-management of Tinnitus: A Revised Interface to Improve the User Experience Pierpaolo Vittorini1(B) , Pablo Chamoso2 , and Fernando De la Prieta2 1

Department of Life, Health and Environmental Sciences, University of L’Aquila, 67100 L’aquila, Italy [email protected] 2 BSAL/BISITE Research Group, University of Salamanca, Calle Espejo 12, Edificio I+D+i, 37007 Salamanca, Spain {chamoso,fer}@usal.es

Abstract. Tinnitus is an annoying ringing in the ears, in varying shades and intensities. Tinnitus can affect a patient’s overall health and social well-being. The diagnostic procedure of tinnitus usually consists of three steps: an audiological examination, psychoacoustic measurement, and a disability evaluation. The authors recently started a project whose aim is to provide a low-cost device and an app to patients, supporting the self-management of tinnitus. In this short paper, we report on the study finalised to evaluate the improved design of the acufenometry examination (i.e., the identification of the frequency and intensity of the tinnitus). By measuring the task with the Single-Ease Question metric, the average rating increased from 2.86/5 for the first implementation, to 3.96/5 for the re-design and re-implementation (p = 0.0005). The results show that the perceived usability of the acufenometry task actually improved from the initial implementation to the new one.

Keywords: Tinnitus Single-ease question

1

· Acufenometry · App · Expectation measure ·

Introduction

Health informatics can be essentially defined as the application of computer science, engineering and telecommunication to healthcare [3]. Accordingly, it regards the use of methods, applications and devices in all aspects concerned with both individuals and public health [8,12,27,28]. In such a context, the paper focuses on the management of tinnitus. Tinnitus is a complex of annoying ringing, buzzing or hissing sounds in the ears, in varying shades and intensities [4]. The prevalence of tinnitus increases with age and is more marked in males than females [14]. The overall health and social well-being of patients suffering c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  M. Rocha et al. (Eds.): PACBB 2021, LNNS 325, pp. 22–30, 2022. https://doi.org/10.1007/978-3-030-86258-9_3

Acufenometry in the Self-management of Tinnitus

23

this condition can be affected in a number of ways, from insomnia, poor concentration, anxiety, to an ongoing depression and inability to work [2,15]. Furthermore, recent studies also pointed out the role of emotional states and emotion dynamics as factors contributing to how tinnitus leads to distress and disability [18,19]. The evaluation of a patient with tinnitus usually requires a long process, because it entails the collection of anamnestic data, a clinical evaluation, audiological examinations, the compilation of questionnaires and appointments with a psychologist. More concretely, a patient affected by tinnitus is first evaluated by an otorhinolaryngologist or audiologist. Then, a multidisciplinary approach is normally adapted: it involves neurologists, dentists, internists, radiologists, psychologists and psychiatrists. Among the fundamental tests, we mention pure tone audiometry, impedanciometry, otoacoustic emissions, the determination of the minimum masking level of tinnitus and the sound discomfort threshold, and the psychoacoustic measurements of frequency and intensity that match the tinnitus, i.e., acufenometry, which is the main focus of this paper. In such a context, the authors started a project whose aim is to support patients in the daily management of tinnitus. So far, the project developed a device and an app [6,7] supporting (a part of) the diagnostic procedure summarised above, mainly focusing on three main phases, i.e., audiological examination, psychoacoustic measurement of tinnitus, and disability evaluation. Accordingly, the project developed: • a simple, affordable, platform-independent device that can be connected to a smartphone/tablet, able to execute (a part of) the audiological and psychoacoustic examinations needed to diagnose tinnitus, and • a dedicated app that controls the device, automates both the execution of the examinations and the administration of the questionnaires needed to measure the disability induced by the tinnitus, with an easy-to-use interface through which the execution of examinations and their reporting could be performed directly by the end-users. Numerous apps that perform hearing tests are available [16], even if significant research in this field – particularly in terms of assessment and validation – is still needed [5]. Nevertheless, with respect to tinnitus, a small set of apps have been developed so far, e.g., in [10], a smartphone app with the purpose of promoting skills that help manage and cope with tinnitus is presented; the app described in [24] enables researchers to collect longitudinal data under reallife conditions and with high cost efficiency (e.g., in [20] it is used to monitor the circadian variations of tinnitus). Furthermore, in [11,25] are available different audiometers, controlled by ad-hoc apps, developed for audiologists and physicians. With respect to the literature summarised above, and to the best of our knowledge, our project is the first attempt to develop an app and an affordable device specifically developed for patients, capable of performing audiological and

24

P. Vittorini et al.

psychoacoustic measurements, automatically reporting on them, and evaluating disability through standardised questionnaires, in an integrated environment that enables patients with tinnitus to self-manage their condition. The specific research presented in this paper focuses on the acufenometry process, and in particular, on a study about the re-design and re-implementation of the app interface that controls it. To this aim, Sect. 2 introduces the procedure behind the execution of an acufenometry and the device/app that support its execution. Then, Sect. 3 describes the study and summarises the main results. Finally, Sect. 4 ends the paper with a short discussion and the conclusions.

2

Acufenometry

Acufenometry is used to determine the frequency and intensity of tinnitus. Several methods can be adopted, e.g., frequency matching, method of adjustment, forced-choice double-staircase adaptive procedure [9,17]. The standard measure for tinnitus is frequency matching, easily understandable by patients and technically straightforward to be implemented in the app. In this method, patients are asked to compare the frequency of a test-sound (i.e., a pure tone) with that of tinnitus. Two tones are presented alternately to both ears so that each is heard 4–5 times; the frequency is changed (increased or decreased) until the patient finds the one that is the closest to that of tinnitus. Intensity is therefore established by comparing the test-sound with that of tinnitus. A pure tone at the previously identified frequency is generated. Then, the intensity is increased by 5 dB until the patient hears it. In this way the “threshold of perception” of a signal is established and taken as the reference level of 0dB. By increasing the intensity by 5dB steps, the patient is asked to report when the sound level completely masks that of tinnitus. The frequency and intensity identified as above represent the result of the acufenometry. Figure 1 shows the interconnection scheme between the smartphone, the device and the external audio outputs. In this schema, the device acts as an external peripheral, controlled by the smartphone and its dedicated app, connected through a USB On-The-Go (OTG) port that serves for both controlling (set up and execute commands) and supplying power to the device. Furthermore, the external outputs of the hardware device are connected to air headset and bone conduction transducers through standard audio cables1 .

1

Unused for the acufenometry, the device is equipped with both air and bone transducers, which are instead essential for the pure tone audiometry testing.

Acufenometry in the Self-management of Tinnitus

25

Fig. 1. The dedicated device acts as a external peripheral which is controlled by the app

3 3.1

Study Study Design

For this study, we gathered quantitative data using the Expectation Measure (EM) [1]. As known, the EM is a self-reporting metric that rates how easy (or difficult) each task was, in comparison to how easy (or difficult) the user though it was going to be. The expectation (“before”) and the experience (“after”) ratings are expressed through a five-point Likert-scale, where 1 = very difficult, 2 = difficult, 3 = normal, 4 = easy, 5 = very easy. Note that the experience rating is equivalent to the Single-Ease Question (SEQ) metric [23]. Therefore, hereafter, we indicate the expectation rating with EM, and the experience rating with SEQ. The study proceeded as follows [26]: • The evaluator explained the goal of the task to the user and established a friendly environment; • The evaluator asked the following question: In a scale from 1 to 5, where 1 = very difficult, 2 = difficult, 3 = normal, 4 = easy, 5 = very easy, [EM] how easy do you think that completing an acufenometry will be? • By first using the old and then the new implementation, the user performed the acufenometry: he/she had to reproduce a sound similar to that of his/her tinnitus via the app. For users that did not have tinnitus when performing the test, the evaluator played a sample of tinnitus sound and asked them to match it with the application; • For each implementation, the evaluator asked the following question: In a scale from 1 to 5, where 1 = very difficult, 2 = difficult, 3 = normal, 4 = easy, 5 = very easy, [SEQ] how easy completing an acufenometry actually was? and if they wished, to leave a comment on how to improve the application.

26

P. Vittorini et al.

Fig. 2. The (a) old and the (b) new interface for the acufenometry

The same procedure was adopted for evaluating both the initial and refined implementation of the acufenometry. In the first implementation [6] (see Fig. 2a): the switches placed on the top has to be used to select which ear experiences tinnitus, the horizontal/vertical arrows to change the frequency/intensity of the emitted sound, while the central button can be tapped to confirm that the emitted sound actually resembles the tinnitus. The initial interface followed the design patterns of the other apps available for performing the acufenometry (see Fig. 3): the selection of the intensity and frequency is performed through scrollbars and/or by directly entering the values. In the new version of the interface (see Fig. 2-b), the user is only required to move the black dot: if the dot is on the left/right side of the circle, the sound is sent to the left/right ear; the closer the dot to the top/bottom, the more acute/deep is the sound; the further/closer the dot from the centre, the louder/weaker the sound. The revised interface was designed by following the guidelines regarding the use of continuous instead of discrete controls [13].

Fig. 3. Apps for performing acufenometry: (a) “Acufenos” and (b) “Tinnitus Masker”

Acufenometry in the Self-management of Tinnitus

27

Fig. 4. Summary of the EM and SEQ analyses

To analyse the data, we proceeded as explained in [1]. First, we calculated the average EM and SEQ with 95% confidence intervals. Then, we placed the results in a scatterplot (expectation on the x-axis, experience on the y-axis). As suggested in [1], tasks in the upper-right quadrant (i.e., good expectation and good experience) can be considered satisfactory; tasks on the lower-right quadrant (i.e., good expectation and low experience) need to be addressed with priority; tasks in the upper-left quadrants (i.e., low expectation and good experience) show a surprisingly good user experience; tasks in the lower-left (i.e., low expectation and experience) are no surprises, but may be addressed because represent important opportunities to make improvements. As for the SEQ, we also placed the results in a graph: on the x-axis the two implementations, on the y-axis the average ratings with confidence intervals. The statistical analyses were performed using R (version 4.0.4) [21], the results follow. A total of 26 users participated: their average age was 39 with a standard deviation of 15, ranging from 19 to 74; 27% were female, 73% were male; 62% were Italian, 38% were Spanish. 3.2

Results

Figure 4 summarises the results of the EM and SEQ analyses. The table on the top reports all numbers: the average EM with 95% confidence intervals; the average SEQ with 95% confidence intervals (for the two interfaces); the p-value of the Wilcoxon test [22] used to assess whether the difference is statistically significant or not2 . The chart on the left shows that the old interface was rated with a low expectation and experience, whereas the new one was rated as surprisingly good. The chart on the right focuses on SEQ, confirming the statistically significant improvement. 2

We preferred a non-parametric test instead of the parametric t-test because both EM and SEQ are qualitative measures, even if numerically expressed in a Likert-scale ranging from 1 to 5.

28

4

P. Vittorini et al.

Discussion and Conclusions

The paper summarises the authors’ research finalised to improve the user experience of patients dealing with the self-execution of the acufenometry, supported by a dedicated app and device. The research consisted in a re-design and reimplementation of the interface that guides a user to perform the acufenometry, and a study regarding its usability. The study was conducted with self-reported metrics (i.e., Expectation Measure and Single-Ease Question) that, given the pandemic period, could not be sustained by direct observations. Nevertheless, the results show that the perceived usability of the acufenometry task actually improved from the initial implementation to the new one, and now it is experienced better than expected. As future work, we planned a further study that will include direct observations of patients, with the aim of investigating the overall app usability in performing a complete diagnostic evaluation (i.e., pure tone audiometry, acufenometry, questionnaires), as well as the ergonomics aspects of wearing the air/bone integrated headset. This latter point is also crucial, given that the bone transducers are placed on small arms (see Fig. 1). As a consequence, we aim at verifying (i) if and when the headset will be difficult to wear, and (ii) if the bone transducers can be easily adjusted to the correct placement (over the bones placed on the rear of the ear) and pressure (so that the vibration is correctly transferred to the bones, and not soften, e.g., by hairs).

References 1. Albert, W., Dixon, E.: Is this what you expected? The use of expectation measures in usability testing. In: Proceedings of the Usability Professionals Association 2003 Conference. Scottsdale, AZ (2003) 2. American Tinnitus Association: Impact of Tinnitus (2019). https://www.ata.org/ understanding-facts/impact-tinnitus 3. Bath, P.A.: Health informatics: current issues and challenges. J. Inf. Sci. 34(4), 501–518 (2008) 4. Hoekstra, C., Venekamp, R., van Zanten, B.: Tinnitus. Huisarts en wetenschap 58(10), 548–551 (2015). https://doi.org/10.1007/s12445-015-0287-y 5. Bright, T., Pallawela, D.: Validated smartphone-based apps for ear and hearing assessments: a review. JMIR Rehabil. Assist. Technol. 3(2), e13 (2016). /pmc/articles/PMC5454564//pmc/articles/PMC5454564/?report=abstracthttps: //www.ncbi.nlm.nih.gov/pmc/articles/PMC5454564/ 6. Pablo, C., Fernando, D.L.P., Alberto, E., Angelo, T., Pierpaolo, V.: An app supporting the self-management of tinnitus. In: Fdez-Riverola, F., Mohamad, M.S., Rocha, M., De Paz, J.F., Pinto, T. (eds.) 11th International Conference on Practical Applications of Computational Biology and Bioinformatics. PACBB 2017. AISC, vol. 616, pp. 83–91. Springer, Cham (2017). https://doi.org/10.1007/978-3319-60816-7 11 7. Chamoso, P., De La Prieta, F., Eibenstein, A., Santos-Santos, D., Tizio, A., Vittorini, P.: A Device Supporting the Self Management of Tinnitus. In: Rojas, I., Ortu˜ no, F. (eds.) IWBBIO 2017. LNCS, vol. 10209, pp. 399–410. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56154-7 36

Acufenometry in the Self-management of Tinnitus

29

8. He, J., Baxter, S.L., Xu, J., Xu, J., Zhou, X., Zhang, K.: The practical implementation of artificial intelligence technologies in medicine (2019). https://doi.org/10. 1038/s41591-018-0307-0 9. Henry, J.A.: “Measurement” of Tinnitus. Otol. Neurotol. 37(8), e276–e285 (2016). http://journals.lww.com/00129492-201609000-00043 10. Henry, J.A., et al.: Development and field testing of a smartphone “App” for tinnitus management. Int. J. Audiol. 56(10), 784–792 (2017). https://pubmed.ncbi. nlm.nih.gov/28669224/ 11. Inventis: Homepage — Inventis (2021). https://www.inventis.it/ 12. Lancia, L., Pisegna Cerone, M., Vittorini, P., Romano, S., Penco, M.: A comparison between EASI system 12-lead ECGs and standard 12-lead ECGs for improved clinical nursing practice. J. Clin. Nurs. 17(3), 370–377 (2008) 13. Laubheimer, P.: Input Controls for Parameters: Balancing Exploration and Precision with Sliders, Knobs, and Matrices. Nielsen Norman Group, New York (2017) 14. McCormack, A., Edmondson-Jones, M., Somerset, S., Hall, D.: A systematic review of the reporting of tinnitus prevalence and severity (2016). https://pubmed.ncbi. nlm.nih.gov/27246985/ 15. Moring, J., Bowen, A., Thomas, J., Bira, L.: The emotional and functional impact of the type of tinnitus sensation. J. Clin. Psychol. Med. Settings 23(3), 310–318 (2015). https://doi.org/10.1007/s10880-015-9444-5 16. Paglialonga, A., Tognola, G., Pinciroli, F.: Apps for hearing science and care. Am. J. Audiol. 24(3), 293–298 (2015). https://pubmed.ncbi.nlm.nih.gov/26649533/ 17. Penner, M.J., Bilger, R.C.: Consistent within-session measures of tinnitus. J. Speech Hearing Res. 35(3), 694–700 (1992). https://pubmed.ncbi.nlm.nih.gov/ 1608262/ 18. Probst, T., Pryss, R., Langguth, B., Schlee, W.: Emotion dynamics and tinnitus: Daily life data from the “trackYourTinnitus” application. Sci. Rep. 6, 1–9 (2016). https://pubmed.ncbi.nlm.nih.gov/27488227/ 19. Probst, T., Pryss, R., Langguth, B., Schlee, W.: Emotional states as mediators between tinnitus loudness and tinnitus distress in daily life: results from the “trackYourTinnitus” application. Sci. Rep. 6, 1–8 (2016). https://pubmed.ncbi.nlm.nih. gov/26853815/ 20. Probst, T., et al.: Does tinnitus depend on time-of-day? An ecological momentary assessment study with the “TrackYourTinnitus” application. Front. Aging Neurosci. 9, 253 (2017). https://pubmed.ncbi.nlm.nih.gov/28824415/ 21. R Core Team: R: A Language and Environment for Statistical Computing (2018). https://www.R-project.org/ 22. Riffenburgh, R.H.: Statistics in Medicine. Elsevier/Academic Press, New York (2012) 23. Sauro, J., Dumas, J.S.: Comparison of three one-question, post-task usability questionnaires. In: Conference on Human Factors in Computing Systems - Proceedings, pp. 1599–1608. ACM Press, New York, New York, USA (2009). http://dl.acm.org/ citation.cfm?doid=1518701.1518946 24. Schlee, W., et al.: Measuring the moment-to-moment variability of Tinnitus: the TrackYourTinnitus smart phone app. Front. Aging Neurosci. 8, 294 (2016). https://pubmed.ncbi.nlm.nih.gov/28018210/ 25. SHOEBOX Ltd: Accurate and Boothless Audiometric Testing (2021). https:// www.shoebox.md/ 26. Tullis, T., Albert, W.: Measuring the User Experience?: Collecting, Analyzing, and Presenting Usability Metrics. Elsevier, New York (2013)

30

P. Vittorini et al.

27. Vittorini, P., Tarquinio, A., di Orio, F.: XML technologies for the Omaha system: a data model, a Java tool and several case studies supporting home healthcare. Comput. Methods Programs Biomed. 93(3), 297–319 (2009) 28. Yamin, M.: IT applications in healthcare management: a survey. Int. J. Inf. Technol. 10(4), 503–509 (2018). https://doi.org/10.1007/s41870-018-0203-3

The pegi3s Bioinformatics Docker Images Project Hugo López-Fernández1,2,3,4(B) , Pedro Ferreira1,2 , Miguel Reboiro-Jato3,4 , Cristina P. Vieira1,2 , and Jorge Vieira1,2 1 Instituto de Investigação e Inovação em Saúde (I3S), Universidade do Porto,

Rua Alfredo Allen, 208, 4200-135 Porto, Portugal {hugo.fernandez,pedro.ferreira}@i3s.up.pt, {cgvieira, jbvieira}@ibmc.up.pt 2 Instituto de Biologia Molecular e Celular (IBMC), Rua Alfredo Allen, 208, 4200-135 Porto, Portugal 3 CINBIO, Department of Computer Science, ESEI – Escuela Superior de Ingeniería Informática, Universidade de Vigo, 32004 Ourense, Spain [email protected] 4 SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Vigo, Spain

Abstract. Among the available Linux container technologies, Docker is one of the most popular ones. Docker images can be used to provide ready-to-use software packages, where all required dependencies are already installed, and they can be deployed in any operating system where Docker is installed. They are also a convenient way to store immutable working software packages, thus contributing to reproducibility. Moreover, the usage of Docker images greatly eases the development of complex pipelines, standalone software applications with graphical user interfaces that require external software, and even the development of databases. Therefore, not surprisingly, Docker images are now ubiquitously used in computational biology and bioinformatics. Here, we present the pegi3s Bioinformatics Docker Images Project (https://pegi3s.github.io/dockerfiles/), a collection of more than 70 Docker images for commonly used software in the fields of genomics, transcriptomics, proteomics, phylogenetics, and sequence handling, among others, that is constantly growing. Several features distinguish this project from much larger projects, namely: 1) by providing a list of tools that are classified into broad categories, it is easier to find the most adequate tool(s) for a given project; 2) by providing the hyperlinks to the software manuals, we facilitate the process of finding the parameter combinations that are best suited for a given processing step; 3) most importantly, we provide clear instructions on how to run the images, provide test data that can be used to quickly evaluate the Docker image, and give all details on how each Docker image was built. All images are routinely used by ourselves, in the context of our research and teaching activities, meaning that they have been extensively tested. Therefore, we believe that this project, which is offered as a service in the context of the European ELIXIR program, is of interest to many researchers, independently of having or not a background in informatics. H. López-Fernández and P. Ferreira—Contributed equally to this work. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. Rocha et al. (Eds.): PACBB 2021, LNNS 325, pp. 31–40, 2022. https://doi.org/10.1007/978-3-030-86258-9_4

32

H. López-Fernández et al. Keywords: Docker · Bioinformatics · Reproducibility

1 Introduction Computational biology and bioinformatics rely heavily on the usage of different software tools and packages and, ultimately, on the successful connection of them to create workflows or automated pipelines that perform complete analyses without human intervention, greatly speeding up the creation of new knowledge [1]. Nevertheless, the installation of such software applications (including the management of their dependencies) can be a cumbersome task that negatively affects research efficiency and reproducibility. The main difficulty is that not all software versions will work in the context of a given pipeline. Moreover, older software versions as well as software libraries may become unavailable with time. Therefore, it may be impossible to re-analyze the data under the same conditions in the future. The usage of immutable, ready-to-use Linux containers solves these issues, and thus, it is not surprising that they are ubiquitously used. Since software applications that have been developed for operating systems (OS) other than Linux, require the usage of software applications and libraries that are protected by international laws, containers cannot be created for them, unless compatibility layers, such as Wine1 , which allows running Windows applications in Linux OS, are used. Nevertheless, most scientific software has been developed for Linux OS. Any given bioinformatics software running on Linux OS can be containerized together with all its dependencies in such a way that it can be seamlessly executed regardless of the OS distribution used by the host system. This is achieved by using advanced features of modern Linux kernels that allow running a set of processes in a fully isolated environment [2]. By installing the required dependencies once only on a clean system (without undesired interactions with other installed packages), dependency management is greatly simplified. Unlike virtual machines, containers share the machine’s OS kernel and therefore do not require an OS per application. Moreover, because containers are lightweight, usually, it does not take long to download any container from an online repository. Among the different container tools available, Docker is one of the most popular ones and many bioinformatics projects based on Docker have emerged in the last years. Some of these projects are presented in Sect. 2. In addition, recommendations regarding the writing of Dockerfiles (descriptions of Docker images) and the creation of containerized bioinformatics software have been published [3, 4]. Despite the very appealing features described above, it should be noted that Docker is rarely used in the context of multi-user systems such as High Performance Computing (HPC) systems, grid infrastructures or Linux clusters. Indeed, processes within the Docker container are normally executed with root privileges under the Docker daemon process tree, thus escaping the resource usage policies, accounting controls, and process controls that are imposed to normal users. Moreover, users authorized to access the Docker remote API can easily gain privileged access to the host system [2]. Docker images that implement software with graphical interfaces cannot also be deployed within multi-user systems 1 https://www.winehq.org.

The Pegi3s Bioinformatics Docker Images Project

33

since these are usually headless systems. Therefore, Docker images are mostly appreciated in the context of institutional bioinformatics platforms that provide services to research groups, or even in the context of research groups that routinely develop and/or perform large-scale bioinformatics analyses. Although there is a tool that enables the user to execute Docker containers in user mode (udocker [2]), and is thus appropriate for multi-user systems, this solution only works for a fraction of all available Docker images. Nearly three years ago, we started to use Docker at the Phenotypic Evolution Group at i3S (pegi3s) to manage the bioinformatics software needed. Instead of installing every software on every computer where it was required, we created one Dockerfile for each software and a GitHub project2 to host the developed Dockerfiles and associated resources. These files were then used to create the Docker images that are hosted at Docker Hub3 . Once this process is complete, all it takes is to perform a pull from Docker Hub to the computer where it is needed. This way, the pegi3s Bioinformatics Docker Images Project was started. Given the large number of Docker images that were developed, it seemed logical to create detailed instructions on how to run every Docker image, as well as providing hyperlinks to the software application manuals. This way, it is easier for new students and researchers to start using the required software tools. To date, 77 Docker images for commonly used software applications, pipelines, and software for the automated submission to web servers have been developed. As evidenced by the more than 70 000 pulls, at present, many researchers that do not belong to i3S are using this tool. The usefulness of the pegi3s Bioinformatics Docker Images Project has been recognized by ELIXIR, an intergovernmental organization that brings together life science resources from across Europe, where it is advertised as a service (search for pegi3s at https://elixir-europe.org/services). In this work, we present our project in detail as well as our own experience using Docker, for deployment of software commonly used in the genomics, transcriptomics, phylogenetics and protein modeling fields, as well as for development of pipelines and databases. New images are constantly being added. For users wishing to use containerized software with a graphical user interface (GUI) in Windows hosts, we also provide a VirtualBox image with Docker installed.

2 Related Work The first bioinformatics Docker projects to be published (2015) were BioBoxes [5] and BioShaDock [6]. BioBoxes4 had the aim of defining a way to specify standardised bioinformatics containers. As authors state in their original publication: “A biobox is a software container with a standardised interface that describes what kind of input files and parameters are accepted and which output files are to be returned.” [5]. Along with a repository of Docker images (which currently has less than 10 images), the project had a command line interface to facilitate the usage of the images. BioShaDock was started as a bioinformatics-focused Docker registry, providing a local and fully controlled environment to build and publish bioinformatics software as portable Docker images. 2 https://pegi3s.github.io/dockerfiles. 3 https://hub.docker.com. 4 http://bioboxes.org.

34

H. López-Fernández et al.

The project was then merged into the BioContainers project created in 2017 [7]. BioContainers5 provides the infrastructure and basic guidelines to create, manage, and distribute bioinformatics packages and containers. First, it provides a base specification and infrastructure to develop, build, and deploy new bioinformatics software. Second, it also has a repository with a series of containers ready to be used by the bioinformatics community. They also provide guidelines and help on how to create reproducible pipelines and workflows using bioinformatics containers [3]. Every container of the project is deployed and permanently deposited in a public registry (Docker Hub or Quay.io), and one of their main features is that the project builds automatically Docker containers for all BioConda packages. Two other Docker-related projects with more specific goals were published in 2017: Dugong [8] and The Dockstore [9]. Dugong6 is announced as a complete Docker desktop environment for bioinformatics and is based on Ubuntu 16.04, providing a complete GUI that facilitates the installation and use of different bioinformatics software obtained from LinuxBrew and BioConda. Dugong includes the Anaconda Navigator, a graphical package manager for Conda and the BioConda repository, and a ready-to-use Jupyter Notebook installation. The Dockstore7 is an open platform used by the Global Alliance for Genomics and Health (GA4GH) for sharing Docker-based tools and workflows described with either the Common Workflow Language (CWL), the Workflow Description Language (WDL), or Nextflow. In 2018, the Reproducible Bioinformatics Project8 was published [8]. This project, focused on workflow reproducibility, aims to provide a schema and an infrastructure, based on Docker images and R packages, to guarantee reproducible results. Based on this project, authors provide five ready-to-use workflows for RNAseq, miRNA-seq, ChIPseq, single-cellRNAseq, and circular RNA. To finish this brief overview, we make also a reference to ORCA9 , a comprehensive bioinformatics container environment for education and research published in 2019 [10]. This environment includes hundreds of pre-compiled and configured bioinformatics tools and can be used to install a multitude of bioinformatics tools on a fresh Linux server easily, providing a private container to each individual user, or shared containers to a collaborative group of users. The main difference between our project and much larger projects like BioContainers is the exhaustive documentation provided for each Docker image, allowing researchers without a background in informatics to use them easily. These includes: test cases, listing the most used commands and options, giving hyperlinks to the software’s manual, and providing a “docker run” command where only data paths need to be adjusted to successfully execute the image. All images listed in our repository are curated, tested, and maintained by ourselves. A great effort is made to make sure that they work properly without posing a security risk to others. Images for new software applications are created

5 https://biocontainers.pro. 6 https://dugongbioinformatics.github.io. 7 https://dockstore.org. 8 http://www.reproducible-bioinformatics.org. 9 https://hub.docker.com/r/bcgsc/orca.

The Pegi3s Bioinformatics Docker Images Project

35

upon request of project users or as we need them to conduct our research and develop our own pipelines.

3 The pegi3s Bioinformatics Docker Images Project 3.1 Docker Images The main aim of our project is to have a manageable, curated set of well-documented and tested Bioinformatics Docker images. By routinely using them, we can easily detect any unforeseen problems and timely correct them. For each Docker image in our project, we provide clear instructions on how to use them (main commands and most used options), give test cases, and provide the hyperlink to the software’s manual. Therefore, for every image, there is a “docker run” command where only data paths need to be adjusted to successfully execute the image. The homepage of the project categorizes the available bioinformatics Docker images in the following categories: • Programs: images associated with scientific software applications for genomics, proteomics, phylogenetics, and so on. This is the category with the largest number of images. While most of them are command-line applications, there are also several programs with GUIs. In the latter case, as mentioned in the project description, the user must first disable access control by typing “xhost +” in the command line, before running the Docker image. • Automated submission to web servers: currently, this category contains four images that automate the submission process to web servers. • Pipelines: images providing implementations of small pipelines. For instance, the “splign-compart” image includes a script that executes the Splign/Compart pipeline as implemented in our SEDA tool10 [11]. • Compi pipelines: images of Compi-based pipelines [12] developed by us, as explained in more detail in Subsect. 3.2. • External images: images that are only mirrors to external images, that is, images whose original Dockerfiles are not written by us. The purpose of having these images in the project is to add a basic documentation on the image usage to our repository, following the same format we have for the remaining images. • Additional images: images for general-purpose software like R or Biopython. As of April 16, 2021, our project has 77 Docker images. Only one of them (pegi3s/dnasp-v6) is for a software with a GUI developed for Windows OS. In order to develop this image, Wine was used as the translation layer. Although DnaSP is easy to install in Windows OS, this image allowed us to perform workflow analyses without having to use computers with different OS. Nevertheless, the most recent Windows OS (Windows 10 64-bit: Pro, Enterprise, or Education - Build 17134 or higher) allow running Docker as a native process11 . Moreover, Linux applications can now be deployed as-is 10 https://www.sing-group.org/seda/manual/operations.html#splign-compart-pipeline. 11 https://docs.docker.com/docker-for-windows/install.

36

H. López-Fernández et al.

on the Windows Subsystem for Linux (WSL). Therefore, the need for Docker images for Windows software applications may diminish or even disappear in the near future, with the full and efficient integration of Windows and Linux OS. Table 1 shows the top 10 downloaded images, based on the pull counts reported by the Docker Hub API. Although these counts are not 100% accurate (e.g. pull count is increased in automated builds or when executing docker pull commands even when the image already exists in the host), they reflect the community usage of these images. In addition, all pegi3s Docker images can be run using Singularity, and we provide a guide with some examples12 . Table 1. Top 10 downloaded images of the project on April 16, 2021. Docker image

Number of pulls

SAMtools-BCFtools 34453 FastQC

14710

HyPhy

2315

SRA Toolkit

1898

BWA

1633

SeqKit

1461

Utilities

855

Bedtools

852

T-Coffee

709

MrBayes

555

Following good programming practices, we have also created some images where we put common scripts that can be used in several scenarios. These images are “utilities”, “blast_utilities”, and “biopython_utilities”. The “utilities” mostly contains simple scripts to process FASTA files along with some other general scripts (e.g. remove the last line of a set of files or deinterleave FASTQ files). Similarly, the “blast_utilities”, and “biopython_utilities” contain reusable scripts that require BLAST or Biopython. 3.2 Pipeline Development We have been working on the development of Compi pipelines for large-scale detection of positively selected amino acid sites [13]. These pipelines, listed in Table 2, are part of the pegi3s Bioinformatics Docker images project. Such developments were leveraged by the pegi3s Docker images already available for commonly used Bioinformatics software, meaning that there is a Docker image for every pipeline step requiring a new software tool. In addition, scripts for common tasks were added to the utilities images presented before. This way, pipelines are defined in a way that they use “docker run” commands every time an external software is needed, and all used Docker images are from the 12 https://github.com/pegi3s/dockerfiles/blob/master/tutorials/singularity.md.

The Pegi3s Bioinformatics Docker Images Project

37

pegi3s project. The only exception is FastScreen, which was the first pipeline to be developed and includes the dependencies in the same Docker image. As Table 2 shows, all the pipelines are publicly available at Compi Hub [14] and the corresponding Docker images are available at the project’s repository at Docker Hub. Table 2. Docker images of Compi pipelines for Positively Selected Sites (PSS) identification. Pipeline

Alias (GitHub and Docker Hub)

Compi Hub ID (https://www. sing-group.org/compihub/exp lore/)

FastScreen

pss-fs

5d5bb64f6d9e31002f3ce30a

GenomeFastScreen [15]

pss-genome-fs

5e2eaacce1138700316488c1

IPSSA (Integrated Positively Selected Sites Analyses)

ipssa

5fa91806407682001ad3a1e9

Auto-PSS-Genome (Automatic Positively Selected Sites Genome)

auto-pss-genome

5faa52ccf05e940c9c2762e4

3.3 Containerization of Applications We also took advantage of Docker to containerize the software developed by us and make their installation easier to end users. This was the case of our desktop tools with GUIs, namely ADOPS [16], BDBM [17], and SEDA [11]. In the cases of ADOPS and SEDA, we simply created Docker images with the tools inside and then the GUIs can be accessed by simply sharing the Host’s XServer with the container (creating a volume with “-v $HOME/.Xauthority:/home/developer/.Xauthority”) and the host’s DISPLAY environment variable to the container (adding “-e DISPLAY = $DISPLAY ”). This method only works for Linux OS. In the case of BDBM, the Docker image incorporates an XPRA server and, therefore, Linux, Windows, and OS X installers could be created using platform-specific XPRA clients13 . ADOPS, BDBM, and SEDA, are desktop tools implemented in Java that run the external software dependencies (BLAST, T-Coffee, EMBOSS tools, Augustus, among others) as system processes. This means that their dependencies must be available in the system they are executed and, therefore, they must be included in the Docker image. In the case of SEDA, we included an additional execution mode of external dependencies through Docker images. This way, the user can provide an image name (SEDA is equipped with a set of default images) and SEDA uses it to execute the corresponding software tools. This also simplifies the containerization of SEDA itself, since only SEDA and Docker must be included in the Docker image. We also relied on Docker to develop and deploy EvoPPI [18]. 13 https://xpra.org.

38

H. López-Fernández et al.

EvoPPI is a web application for the comparison of multiple interactomes datasets from the same or different species. From an architectural point of view, EvoPPI is composed of a Single Page Application frontend implemented with Angular that communicates with a REST backend implemented with Java EE and that stores the information in a MySQL database. This means that frontend, backend and database can be seen as independent components. Based on this, a Docker container was defined for each component with the necessary dependencies for its execution. In addition, to orchestrate the execution of these containers, a Docker Compose configuration was defined, in which all the necessary services (e.g. network, persistent volumes, among others) are created. On the other hand, when comparing interactomes of different species in EvoPPI, it is necessary to make use of BLAST to establish orthologies. In this case, we also use Docker containers to run BLAST, but with the particularity that the backend application directly requests the execution of the container through the Docker Java API14 . This EvoPPI deployment configuration is available in a GitHub repository15 , so that any researcher can run EvoPPI in a simple way.

4 Discussion The usage of container technologies, especially Docker, is now ubiquitous in bioinformatics, as it offers a way to improve reproducibility of analyses and provides an easy way to migrate complex pipelines. The usage of Docker images provides many benefits to both biologists and bioinformaticians beyond reproducibility. It simplifies dependency management since programs must be installed only once (when the Docker image is created) and then they can be used in any system with Docker installed. This also allows having multiple versions of the same software available, something that can be difficult without containers. The maintenance of different software versions is important not only for reproducibility but also in those cases where the most recent software version do not include all functionalities of previous versions, usually due to licensing issues. This is the reason why in our home project page we list two different versions for the Genome Analysis Toolkit (gatk-3 and gatk-4). In addition, since there are Docker clients for Windows and OS X systems, bioinformatics software that is only available in Linux systems can be used in such operating systems. Our experience with Docker has been very positive, since all the software tools used at the pegi3s laboratory are now available as Docker images, and thus no time is wasted by new students/researchers with software installation and configuration when starting a new project. Having a list of tools that are classified in broad categories also eases the process of finding the most adequate tool(s) for a given project. For instance, for a project involving the de novo assembly of a genome there are four Docker images available, namely, abyss, edena, soapdenovo2, and spades. This is not an exhaustive list of the software available for this purpose, but it is a good start. Moreover, since these Docker images have been developed for our research and teaching needs, they are not a random collection of software tools. This means that the researcher visiting the pegi3S Bioinformatics Docker Images Project page will find there as well Docker 14 https://github.com/docker-java/docker-java. 15 https://github.com/sing-group/evoppi-docker.

The Pegi3s Bioinformatics Docker Images Project

39

images for read quality evaluation and to perform read trimming, for instance. The clear instructions given at the pegi3S Bioinformatics Docker Images Project, as well as the direct links to the software manuals, also greatly eases the process of finding the appropriate software parameters for a given processing step. The pipelines and desktop tools requiring external software applications that have been developed by us also greatly benefited from the available pegi3s Docker images. Moreover, they are a useful resource for teaching bioinformatics-related subjects at university. The philosophy of the pegi3s Bioinformatics Docker Images Project closely matches that of Canonical’s Ubuntu open source collaborative project, the source image used in all our Docker images. According to Canonical16 , Ubuntu is an ancient African word meaning ‘humanity to others’ that remind us that ‘I am what I am because of who we all are’. Acknowledgments. This work was financed by the National Funds through FCT—Fundação para a Ciência e a Tecnologia, I.P., under the project UIDB/04293/2020 and through the individual scientific employment program-contract with Hugo López-Fernández (2020.00515.CEECIND), and also by BioData.pt (project 22231/01/SAICT/2016). This work was also partially supported by the Consellería de Educación, Universidades e Formación Profesional (Xunta de Galicia) under the scope of the strategic funding ED431C2018/55-GRC Competitive Reference Group.

References 1. Perkel, J.M.: Workflow systems turn raw data into scientific knowledge. Nature 573, 149–150 (2019). https://doi.org/10.1038/d41586-019-02619-z 2. Gomes, J., et al.: Enabling rootless Linux Containers in multi-user environments: the udocker tool. Comput. Phys. Commun. 232, 84–97 (2018). https://doi.org/10.1016/j.cpc.2018.05.021 3. Gruening, B., et al.: Recommendations for the packaging and containerizing of bioinformatics software. F1000Res. 7, 742 (2019). https://doi.org/10.12688/f1000research.15140.2 4. Nüst, D., et al.: Ten simple rules for writing Dockerfiles for reproducible data science. PLoS Comput. Biol. 16, e1008316 (2020). https://doi.org/10.1371/journal.pcbi.1008316 5. Belmann, P., Dröge, J., Bremges, A., McHardy, A.C., Sczyrba, A., Barton, M.D.: Bioboxes: standardised containers for interchangeable bioinformatics software. GigaScience 4, (2015). https://doi.org/10.1186/s13742-015-0087-0 6. Moreews, F., et al.: BioShaDock: a community driven bioinformatics shared Docker-based tools registry. F1000Res. 4, 1443 (2015). https://doi.org/10.12688/f1000research.7536.1 7. da Veiga Leprevost, F., et al.: BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics 33, 2580–2582 (2017). https://doi.org/10. 1093/bioinformatics/btx192 8. Menegidio, F.B., Jabes, D.L., Costa de Oliveira, R., Nunes, L.R.: Dugong: a Docker image, based on Ubuntu Linux, focused on reproducibility and replicability for bioinformatics analyses. Bioinformatics 34, 514–515 (2018). https://doi.org/10.1093/bioinformatics/btx554 9. O’Connor, B.D., et al.: The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows. F1000Res. 6, 52 (2017). https://doi.org/10. 12688/f1000research.10137.1 10. Jackman, S.D., et al.: ORCA: a comprehensive bioinformatics container environment for education and research. Bioinformatics 35, 4448–4450 (2019). https://doi.org/10.1093/bioinf ormatics/btz278 16 https://ubuntu.com/about.

40

H. López-Fernández et al.

11. Lopez-Fernandez, H., et al.: SEDA: a desktop tool suite for FASTA files processing. IEEE/ACM Trans. Comput. Biol. Bioinform 1 (2020). https://doi.org/10.1109/TCBB.2020. 3040383 12. López-Fernández, H., Graña-Castro, O., Nogueira-Rodríguez, A., Reboiro-Jato, M., GlezPeña, D.: Compi: a framework for portable and reproducible pipelines. PeerJ Comput. Sci. 7, e593 (2021). https://doi.org/10.7717/peerj-cs.593 13. López-Fernández, H., et al.: Inferring positive selection in large viral datasets. In: FdezRiverola, F., Rocha, M., Mohamad, M.S., Zaki, N., Castellanos-Garzón, J.A. (eds.) PACBB 2019. AISC, vol. 1005, pp. 61–69. Springer, Cham (2020). https://doi.org/10.1007/978-3030-23873-5_8 14. Nogueira-Rodríguez, A., López-Fernández, H., Graña-Castro, O., Reboiro-Jato, M., GlezPeña, D.: Compi hub: a public repository for sharing and discovering Compi pipelines. In: Panuccio, G., Rocha, M., Fdez-Riverola, F., Mohamad, M.S., Casado-Vara, R. (eds.) PACBB 2020. AISC, vol. 1240, pp. 51–59. Springer, Cham (2021). https://doi.org/10.1007/978-3030-54568-0_6 15. López-Fernández, H., Vieira, C.P., Fdez-Riverola, F., Reboiro-Jato, M., Vieira, J.: Inferences on mycobacterium leprae host immune response escape and antibiotic resistance using genomic data and GenomeFastScreen. In: Panuccio, G., Rocha, M., Fdez-Riverola, F., Mohamad, M.S., Casado-Vara, R. (eds.) PACBB 2020. AISC, vol. 1240, pp. 42–50. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-54568-0_5 16. Reboiro-Jato, D., Reboiro-Jato, M., Fdez-Riverola, F., Vieira, C.P., Fonseca, N.A., Vieira, J.: ADOPS–Automatic detection of positively selected sites. J Integr. Bioinform. 9, 200 (2012). https://doi.org/10.2390/biecoll-jib-2012-200 17. Vázquez, N., López-Fernández, H., Vieira, C.P., Fdez-Riverola, F., Vieira, J., Reboiro-Jato, M.: BDBM 1.0: a desktop application for efficient retrieval and processing of high-quality sequence data and application to the identification of the putative Coffea S-locus. Interdiscip. Sci. Comput. Life Sci. 11(1), 57–67 (2019). https://doi.org/10.1007/s12539-019-00320-3 18. Vázquez, N., et al.: EvoPPI 1.0: a web platform for within- and between-species multiple interactome comparisons and application to nine PolyQ proteins determining neurodegenerative diseases. Interdiscip. Sci. Comput. Life Sci. 11(1), 45–56 (2019). https://doi.org/10. 1007/s12539-019-00317-y

On the Reproducibility of MiRNA-Seq Differential Expression Analyses in Neuropsychiatric Diseases Daniel Pérez-Rodríguez1,2 , Hugo López-Fernández3,4,5,6(B) , and Roberto C. Agís-Balboa1,2 1 Translational Neuroscience Group-CIBERSAM, Galicia Sur Health Research

Institute (IIS Galicia Sur), Área Sanitaria de Vigo-Hospital Álvaro Cunqueiro, SERGAS-UVIGO, 36213 Vigo, Spain [email protected], [email protected] 2 NeuroEpigenetics Lab. University Hospital Complex of Vigo, SERGAS-UVIGO, 36213 Vigo, Spain 3 Instituto de Investigação E Inovação Em Saúde (I3S), Universidade Do Porto, Rua Alfredo Allen, 208, 4200-135 Porto, Portugal [email protected] 4 Instituto de Biologia Molecular E Celular (IBMC), Rua Alfredo Allen, 208, 4200-135 Porto, Portugal 5 Department of Computer Science, CINBIO, Universidade de Vigo, ESEI – Escuela Superior de Ingeniería Informática, 32004 Ourense, Spain 6 SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Vigo, Spain

Abstract. MiRNAs are attracting considerable interest as potential biomarkers on neuropsychiatric diseases due to their expression plasticity. In the last decade, a large number of studies have been published in this regard with promising results; however, there is widespread concern about the reproducibility of these results. This study aims to compare the differentially expressed miRNAs reported by 5 recent studies of neuropsychiatric diseases, with those obtained through the miARma-Seq pipeline [1]. In general, we found a low reproducibility (0–74%), and some variations related to the software used for the differential expression analysis. Our results support the idea that miRNAs reported as potential biomarkers in neuropsychiatric diseases are strongly correlated with the analytical methodology and the biological references used; nonetheless, further research is needed to establish the magnitude of this problem and spot its main causes. Keywords: miRNA-Seq · Reproducibility · Neuropsychiatry

1 Introduction Micro-RNAs (miRNAs) are small non-coding RNAs involved in the post-transcriptional regulation of gene expression. They participate in most of the biological processes, and © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. Rocha et al. (Eds.): PACBB 2021, LNNS 325, pp. 41–51, 2022. https://doi.org/10.1007/978-3-030-86258-9_5

42

D. Pérez-Rodríguez et al.

their expression levels are affected by everyday events such as sleep, exercise, stress or medications [2]. In the last decade, a great deal of effort has been put into the discovery of miRNA biomarkers and lots of studies have focused on finding those differentially expressed between groups of patients and healthy controls (DE miRNAs). Thanks to this, the functions and implications of these DE miRNAs in different diseases have begun to be unraveled. Medical disciplines such as neuropsychiatry, on which diagnosis relies on the identification of a set of unspecific symptoms, could benefit from the expression plasticity of the miRNAs molecules. Consequently, a large number of miRNA-Seq studies have been developed with the aim of finding specific biomarkers for conditions such as depression, schizophrenia, or Alzheimer’s disease, thus reporting hundreds of promising candidates. However, there is widespread concern about reproducibility of these findings, mainly due to the great heterogeneity in the analysis procedures [3, 4], the great influence that biological references have on the results [4, 5] and the biological variability itself, that compromises the inference of the results to new data [6, 7]. With the aim of assessing this problem in the field of neuropsychiatric diseases, we re-analyzed 5 recent studies (hereinafter, original studies) that aimed to discover miRNA biomarkers. These analyses were conducted using the miARma-Seq pipeline [1], which combines third party software tools to perform, among other functions, quality checking, reads alignment, quantification, and differential expression analysis. By using a pipeline, we factor out the variability related to the analysis procedure, thus a measure of the disagreement between the results of the studies and new results obtained using an alternative procedure can be estimated. MiARma-Seq takes as input an ini file where the paths of the sample data, reference genome, genomic annotations, samples attributes and comparisons are provided; then it returns a list of DE miRNAs along with detailed information about each step of the process. Our objective is to compare the DE miRNAs obtained through miARma-Seq with those reported by the original studies. To do this, an attempt has been made to copy the criteria and filters applied on the original studies. Thus, a modest approximation of the magnitude of the reproducibility problem can be obtained.

2 Materials and Methods Six recent miRNA-Seq studies on neuropsychiatric diseases were selected [8–13], all with experimental data available in public repositories. The analysis of each study consisted of the following steps: (1) data acquisition and identification of the samples; (2) download of the genome and annotation files; (3) miARma-Seq analysis; (4) quality control; (5) application of the same statistical criteria of the original articles to miARma-Seq results; (6) comparison between the original results and miARma-Seq results. 2.1 Data Acquisition and Identification of the Samples Raw data in FASTQ was downloaded from NCBI Bioproject database (BD) using the faster-q dump software (sra-tools) [14]. The condition of each sample (e.g. case/control) was identified using the supplementary material of the original articles and/or the information available on the SRA database; the latter accessed through

Reproducibility of MiRNA-Seq Differential Expression Analyses

43

esearch/efetch/xtract tools [15]. About 300 Gb of data was downloaded in total; Table 1 summarizes the data to be analyzed. Table 1. Summary of the articles and data to be re-analyzed. Study

Year

Organism

Nie et al.

2020 Human

Contrast

Sample

Bioproject

AD-Control PD-Control

Peripheral blood

PRJNA587017

1.4

Bed nucleus of PRJNA543123 the stria terminalis

39.3

Mavrikaki 2019 Rattus Females I-G et al. norvergicus Males I-G

Size (GB)

Wang et al. 2018 Human

ADHD-Control White blood cells

PRJNA450485

2.2

Martin et al.

2017 Human

PTSD-Control

Peripheral blood

PRJNA347370

0.48

Hicks et al.

2016 Human

ASD-Control

Saliva

PRJNA310758

PD-C PDD-PDN

Brain PRJNA295431 242.1 (post-mortem)

Hoss et al. 2016 Human

13.4

Hicks’ study [10], searches for potential miRNA biomarkers for Autism spectrum disorder (ASD) through the comparison of saliva samples of ASD subjects (n = 24) with controls (n = 21). Hoss’ study [13], seeks for Parkinson’s disease (PD) miRNA signatures by comparing the miRNA profile of PD frontal cortex (n = 29) with normal control brain (n = 33). In this study, control samples were not available in the Bioproject database with the given accession number and had to be downloaded from his previous study (PRJNA272617). Additionally, control samples C_0034, C_0047, C_0049 were removed since they were not used in the original study. Martin’s study [8] analyzes the miRNA expression on posttraumatic stress disorder (PTSD) through the comparison of blood samples from military personnel with PTSD (n = 15) and without PTSD (n = 9). Wang’s study [11], searches for potential miRNA biomarkers for Attention-deficit/hyperactivity disorder (ADHD) through the comparison of white blood cells samples of ADHD patients (n = 5) and controls (n = 5). ADHD and control samples were mixed into two pooled libraries, thus only two files were retrieved from the Bioproject database. Mavrikaki’s study [12], investigates the sex differences on miRNA expression profiles on the bed nucleus of the stria terminalis (BNST) of social isolated male/female rats and controls. Finally, Nie’s study searches for potential miRNA biomarkers for Alzheimer’s disease (AD, n = 5) and PD (n = 7) in plasma exosomes by comparing both to control samples (n = 34). They perform the DEA using several software and comparing their results. Regarding the data, they reported that obtained through two different methods of exosome isolation: EQ and SC. They also provide the data without exosome isolation

44

D. Pérez-Rodríguez et al.

(PC). Since the main results of the paper were obtained through the EQ isolation kit, we chose the EQ samples and discarded the other files.Download of the genome and annotation files. 2.2 Download of the Genome and Annotation Files The reference genome and annotations files are needed to first translate miRNAs’ sequences to genome positions and then to miRNAs IDs. The original studies performed the alignment with old genomic builds: five of them [8–11, 13] used human samples and aligned with hg19, and only Mavrikaki [12] used Rattus norvegicus genome and aligned with rn6. Both genomes were downloaded from the “NCBI Datasets” [16] on their latest versions: hg38 and mRatBN7.2, respectively. Annotation files of human and rat samples were downloaded from miRBase v22 [17] for the human genome and NCBI for the rat genome. 2.3 MiARma-Seq Analysis A Docker image of miARma-Seq [1] was used to run all the analyses. We configured miARma-Seq to perform the “Known miRNAs analysis” pipeline; with the quality check using FastQC [18], adapter removal with Cutadapt [19], alignment with Bowtie1 [20], quantification with featureCounts [21] and the DEA with EdgeR [22]. Two adjustments were made to the analysis process to adapt them to the requirements of the data sets. First, the adapter removal was performed only in the Hoss [13] samples, as they explicitly state the adapter sequence. Second, in Wang [11] we used NOISeq [23] instead of EdgeR for the DEA due to the absence of replicas. The indexing of the two genomes was performed once, with its first analysis, inside the miARma pipeline. From now on, we will use the name "miARma-Seq" to refer to the software mentioned in this section. All the configuration files used on the analyses are available on the Supplementary Material. 2.4 Quality Control The miARma-Seq results were visualized using MultiQC reports [24]. We look for misalignment/assignment rates, and contamination with adapters. 2.5 Application of the Same Statistical Criteria of the Original Articles to miARma-Seq Results The original criteria for determining DE miRNAs were applied to the miARma-Seq results. The criteria were: false discovery rate-corrected p-value (q-value) < 0.05 on the Mavrikaki [12], Hoss [13] and Martin [8] studies; pval < 0.05, transcripts per million > 1000 and fold change greater than 1 or lower than −0.67 on Wang [11] study and q-value < 0.05, fold change lower/greater than 1 on Nie [9] study. No factor corrections were applied to the miARma-Seq results. The resulting miRNAs were written on a text file, one per line.

Reproducibility of MiRNA-Seq Differential Expression Analyses

45

2.6 Comparison Between the Original Results and miARma-Seq Results Prior to comparisons, the miRNAs nomenclatures of the miARma-Seq results were matched with those used on the original results using a custom script. miARma-Seq results were obtained in the miRBase unique identifier format (MIMAT id), then changed to hsa- format for Wang [11] and Hoss [13] comparisons; miR- format for Martin [8] and Nie [9] and rnor- format for Mavrikaki [12] data. MiRNA precursors were recorded with the mature miRNA name. DE miRNAs reported by the original studies were arranged into text files, one miRNA per line. To spot miRNAs present in miARma-Seq results and original results, sorting and line by line comparisons were made using bash scripting.

3 Results and Discussion 3.1 Quality Control MultiQC reports [24] were generated to visualize FastQC [18] and Bowtie1 [20] results. In the FastQC summary, all samples had high scores for duplicated sequences, overrepresented sequences and kmer content. This is the expected outcome for RNAseq data, where sequence abundance has a biological meaning. On the adapter content module, except for the Hoss [13] samples, no adapter contamination was found. Table 2. Average percentage of alignments (number of sequences aligned successfully) and assignments (number of aligned sequences successfully annotated) per study. Study

Contrast

% Alignment % Assignment

Wang et al.

ADHD-Control 98.5

69.45

Mavrikaki et al. Females I-G Males I-G

98.2

82.80

Hoss et al.

PD-C PDD-PDN

95.6

49.94

Martin et al.

PTSD-Control

85.1

55.72

Nie et al.

AD-Control PD-Control

60.5

10.70

Hicks et al.

ASD-Control

15.6

1.96

Regarding the Bowtie1 report, alignments rates (Table 2) were above an average 85% for all samples, except for Nie [9] (62.6%) and Hicks [10] (15.6%) data. The same is true for the assignment rates, where generally acceptable rates were achieved in all studies (>50%) except for these two (Hicks, 2.0%; Nie, 10.7%). Data of both studies was downloaded again and reanalyzed with identical results. Hicks’ study was omitted from subsequent comparisons.

46

D. Pérez-Rodríguez et al.

3.2 Application of the Same Statistical Criteria of the Original Articles to MiARma-Seq Results After applying the same DE criteria of the original studies to miARma-Seq results, the MIMAT IDs of the filtered miRNAs were written on text files, one per line. Only in Martin study [8], miARma-Seq did not suggest any DE miRNA. The number of predictions was generally higher than the reported on the original articles, averaging 113%. 3.3 Comparison Between the Original Results and miARma-Seq Results A summary of the results can be seen on Fig. 1, and are available in the Supplementary Material. As shown in Table 3, only an average of 27.59% of the miRNAs reported on the original studies were present on miARma-Seq results. The ratio of the number of matches to the total number of DE miRNAs reported by these studies were calculated for each comparison and study (Table 4). This measure, which we shall refer to as agreement ratio (AR), was higher in Hoss [13] (74.08%) and Mavrikaki [12] (53.50%) comparisons, whereas on Nie [9] was 16.38% and a 0% on Wang [11] and Martin [8]. The two studies with 0% AR were those with fewer miRNAs reported on the original results. In our opinion, the use of an animal model could largely explain the better data quality and greater reproducibility of Mavrikaki’s results, since the samples came from rats of the same age, under laboratory conditions, and were taken simultaneously. This is different from what typically happens with human samples, which are collected with long periods of time in between, come from patients with very diverse backgrounds, and are often processed by different personnel.

Fig. 1. Source of the total DE miRNAs. In green is the proportion of coincidences (n coincidences/n different miRNAs); in aqua, the proportion of divergent miRNAs reported by the original article ((original results - n coincidences)/n different miRNAs) and in purple, those reported by miARMa-Seq ((miARma-Seq results - n coincidences) n different miRNAs).

Hoss et al. used Limma [25] to perform the DEA. In the PD-C comparison, 92 of the 191 miRNAs reported in the original article matched with miARma-Seq results, being the AR of 48.17%. This was the analysis with more miARma-Seq results compared to the original, with 2.02 times more predictions. Although the use of an old genome build and miRBase version (v20) [17] could partially explain this result, the

Reproducibility of MiRNA-Seq Differential Expression Analyses

47

Table 3. Results of the comparisons. The number of DE miRNAs identified by the original articles is in the “Original” column, those identified using miARma-Seq in the “miARma-Seq” column and the criteria for selecting differentially expressed miRNAs in the "DEA Criteria" column. These criteria were copied from those used in the original studies. Comparisons

Number of DE miRNAs Original

MiARma-Seq

DEA criteria Coincidences

Wang et al. 16

2

0

pval < 0.05 0.67 > FC > 1.5 TPM 1000

Females I-G

68

53

36

FDR pval < 0.05

Males I-G

37

31

20

191

386

92

0

0

0

FDR pval < 0.05 No factor adjustment

8

0

0

FDR pval < 0.05

38

51

14

ADHD-Control

Mavrikaki et al.

Hoss et al. PD-C PDD-PDN Martin et al. PTSD-Control Nie et al AD - Control Predicted by 3 DEA software EdgeR

56

15

Limma

71

23

DESeq

37

11

FC > 1 or FC < -1 p-val < 0.05

PD - Control Predicted by 3 DEA software

20

33

0

EdgeR

40

1

Limma

36

1

DESeq

20

0

FC > 1 or FC < -1 p-val < 0.05

same would be expected in Matin’s analysis, as they use the same outdated references. However, the opposite phenomena is observed: the reason for this rather contradictory result is still not entirely clear. Neither miARma-Seq nor Limma found any DE miRNA in the PDD-PDN comparison. Martin et al. used DESeq2 [26] to perform the DEA. Unlike the 8 DE miRNAs reported on the original study, no DE miRNAs were found among miARma-Seq results. Wang et al. used Limma as DEA software. Neither of the two DE miRNAs predicted by miARma-Seq matched the 16 results of Limma.

48

D. Pérez-Rodríguez et al.

Mavrikaki et al. used DESeq2 to perform the DEA. In the Females I-G comparison, 36 of the 68 miRNAs reported in the original article matched with miARma-Seq results, being the AR of 52.94%. On the other hand, in the Males I-G comparison, 20 of the 37 miRNAs in the original article matched with miARma-Seq results, with an AR of 54.05%. Finally, miARma-Seq predicted less DE miRNAs in both comparisons. Table 4. Agreement ratio (AR: number of matches/number of original results) per comparison and study. Study

Comparison

Wang et al.

ADHD-Control

Mavrikaki et al. Females I-G Males I-G Hoss et al.

PD-C PDD-PDN

Martin et al.

PTSD-Control

Nie et al. AD - Control

Nie et al PD - Control

AR per comparison AR per study 0.00%

0.00%

52.94%

53.50%

54.05% 48.17%

74.08%

100.00% 0.00%

0.00%

3 DEA soft

36.84%

16.38%

EdgeR

26.79%

Limma

32.39%

DESeq

29.73%

3 DEA soft

0.00%

EdgeR

2.50%

Limma

2.78%

DESeq

0.00%

In Nie et al. study, the results of all the software they used for DEA were also compared with the miARma-Seq results. MiRNAs predicted by three softwares had the higher rates of AR on the AD-Control comparison but not in the PD-Control. Also, DESeq was the only software that suggested less DE miRNAs in both comparisons than miARma-Seq’s EdgeR, which may be indicative of a higher specificity of this tool. Interestingly, the AR of the EdgeR original results was the lower of the three softwares. One reason that could partially explain the generally low AR of Nie’s results could be that they annotated not only with miRBase but also with Rfam database. This would be expected to increase the proportion of assigned sequences and thus increase the number of the DE miRNAs. In fact, miARma analysis of these samples had very poor assignment rates, averaging 10.7%.

Reproducibility of MiRNA-Seq Differential Expression Analyses

49

4 Conclusion The results of a miRNA-Seq study depend, on the one hand, on factors that the researcher cannot control, such as available knowledge about the genome and its annotations; on the other hand, factors such as experimental designs, software used, statistical tests, filtering criteria, etc., which depend to a greater extent on the researcher, also have a great influence. Small parameter changes in one or several steps may result in noticeable differences in the results. In this study it is possible to glimpse the diversity of conclusions that the same data can offer when both factors are changed. Despite the small sample size, our results consistently suggest low reproducibility in the conclusions of the 5 original studies; with only 28% of the identified miRNAs being replicated in the miARma-Seq analysis on average. If this is the case, it would be highly recommended to replicate and re-evaluate old studies, as well as to take into account previous results when analyzing new data. The use of pipelines such as miARma-Seq eases the comparison between studies analyzed with the same pipeline since, by standardizing the process, the observed variability will be mainly attributable to the samples, reference genome or annotations. Additionally, it would be interesting to further study the differences in quality and replicability that we observed when compared the results of human samples with those of Rattus norvegicus; this could lead to the identification of key experimental factors for the replicability of miRNA-Seq experiments. Finally, with regard to data availability, greater efforts are needed to ease the access to studies’ data; especially, to basic information such as samples conditions, adapter sequences, and results tables. In a growing discipline such as miRNA-Seq, it is important to improve the accessibility of these data to promote reassessment of current knowledge, and thus build a solid foundation for future research. Acknowledgements. This work was supported by Instituto de Salud Carlos III through the project PI18/01311 (co-funded by European Regional Development Fund, “A way to make Europe”) and by a Ramon & Cajal grant [RYC2014-15246] to RCA-B. National funding by FCT, Foundation for Science and Technology (Portugal), through the individual scientific employment programcontract with Hugo López-Fernández (2020.00515.CEECIND). The authors would like to thank Galicia Sur Health Research Institute, Galicia Sur Biomedical Foundation, and the Area Sanitaria de Vigo for their support.

References 1. Andrés-León, E., Núñez-Torres, R., Rojas, A.M.: miARma-Seq: a comprehensive tool for miRNA, mRNA and circRNA analysis. Sci Rep. 6, 25749 (2016). https://doi.org/10.1038/sre p25749 2. Esteller, M.: Non-coding RNAs in human disease. Nat Rev Genet. 12, 861–874 (2011). https:// doi.org/10.1038/nrg3074 3. Peixoto, L., et al.: How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets. Nucleic Acids Res. 43, 7664–7674 (2015). https://doi. org/10.1093/nar/gkv736 4. Simoneau, J., Dumontier, S., Gosselin, R., Scott, M.S.: Current RNA-seq methodology reporting limits reproducibility. Brief. Bioinform. 22, 140–145 (2021). https://doi.org/10.1093/bib/ bbz124

50

D. Pérez-Rodríguez et al.

5. Zhao, S., Zhang, B.: A comprehensive evaluation of ensembl, RefSeq, and UCSC annotations in the context of RNA-seq read mapping and gene quantification. BMC Genom. 16 (2015). https://doi.org/10.1186/s12864-015-1308-8 6. Hansen, K.D., Wu, Z., Irizarry, R.A., Leek, J.T.: Sequencing technology does not eliminate biological variability. Nat. Biotechnol. 29, 572–573 (2011). https://doi.org/10.1038/nbt.1910 7. McIntyre, L.M., et al.: RNA-seq: technical variability and sampling. BMC Genom. 12, 293 (2011). https://doi.org/10.1186/1471-2164-12-293 8. Martin, C.G., et al.: Circulating miRNA associated with posttraumatic stress disorder in a cohort of military combat veterans. Psychiatry Res. 251, 261–265 (2017). https://doi.org/10. 1016/j.psychres.2017.01.081 9. Nie, C., et al.: Differential expression of plasma exo-miRNA in neurodegenerative diseases by next-generation sequencing. Front. Neurosci. 14 (2020). https://doi.org/10.3389/fnins.2020. 00438 10. Hicks, S.D., Ignacio, C., Gentile, K., Middleton, F.A.: Salivary miRNA profiles identify children with autism spectrum disorder, correlate with adaptive behavior, and implicate ASD candidate genes involved in neurodevelopment. BMC Pediatr. 16 (2016). https://doi.org/10. 1186/s12887-016-0586-x 11. Wang, L.J., et al.: Blood-bourne microRNA biomarker evaluation in attentiondeficit/hyperactivity disorder of Han Chinese individuals: an exploratory study. Front. Psychiatr. 9 (2018). https://doi.org/10.3389/fpsyt.2018.00227 12. Mavrikaki, M., et al.: Sex-dependent changes in miRNA expression in the bed nucleus of the Stria terminalis following stress. Front. Mol. Neurosci. 12 (2019). https://doi.org/10.3389/ fnmol.2019.00236 13. Hoss, A.G., Labadorf, A., Beach, T.G., Latourelle, J.C., Myers, R.H.: microRNA profiles in Parkinson’s disease prefrontal cortex. Front. Mol. Neurosci. 8 (2016). https://doi.org/10. 3389/fnagi.2016.00036 14. ncbi/sra-tools. NCBI - National Center for Biotechnology Information/NLM/NIH (2021) 15. Kans, J.: Entrez Direct: E-utilities on the Unix Command Line. National Center for Biotechnology Information (US) (2021) 16. NCBI Datasets. https://www.ncbi.nlm.nih.gov/datasets/. Accessed 11 May 2021 17. Kozomara, A., Birgaoanu, M., Griffiths-Jones, S.: miRBase: from microRNA sequences to function. Nucleic Acids Res. 47, D155–D162 (2019). https://doi.org/10.1093/nar/gky1141 18. Andrews, S.: FASTQC. A quality control tool for high throughput sequence data. (2010) 19. Martin, M.: Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17, 10–12 (2011). https://doi.org/10.14806/ej.17.1.200 20. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009). https://doi. org/10.1186/gb-2009-10-3-r25 21. Liao, Y., Smyth, G.K., Shi, W.: featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014). https:// doi.org/10.1093/bioinformatics/btt656 22. Robinson, M.D., McCarthy, D.J., Smyth, G.K.: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2010). https://doi.org/10.1093/bioinformatics/btp616 23. Tarazona, S., et al.: Data quality aware analysis of differential expression in RNA-seq with NOISeq R/Bioc package. Nucleic Acids Res. 43, e140 (2015). https://doi.org/10.1093/nar/ gkv711 24. Ewels, P., Magnusson, M., Lundin, S., Käller, M.: MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32, 3047–3048 (2016). https:// doi.org/10.1093/bioinformatics/btw354

Reproducibility of MiRNA-Seq Differential Expression Analyses

51

25. Ritchie, M.E., et al.: limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47–e47 (2015). https://doi.org/10.1093/nar/ gkv007 26. Love, M.I., Huber, W., Anders, S.: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014). https://doi.org/10.1186/s13059014-0550-8

Computational Tools for the Analysis of 2D-Nuclear Magnetic Resonance Data Bruno Pereira1(B) , Marcelo Maraschin2 , and Miguel Rocha1 1

Centre of Biological Engineering, University of Minho, Campus of Gualtar, Braga, Portugal [email protected] 2 Plant Morphogenesis and Biochemistry Laboratory, Federal University of Santa Catarina, Santa Catarina, Brazil

Abstract. Metabolomics is one of the omics’ sciences that has been gaining a lot of interest due to its potential on correlating an organism’s biochemical activity and its phenotype. While nuclear magnetic resonance (NMR) is one of the main analytical techniques in metabolomics, one-dimensional NMR suffers from some limitations. NMR’s two-dimensional approaches (2D-NMR) deliver a solution to one of its main disadvantages, low sensitivity. Addressing a growing need for integrated frameworks to handle data analysis and mining in this domain, new functionalities regarding 2D-NMR were added to specmine, an R package for metabolomics and spectral data analysis/mining. These functionalities allow reading, visualization, and analysis of 2D-NMR data within the same environment, making possible to the user to establish its own pipeline. Two case studies, from Bruker and Varian datasets, were used to validate the functions developed and a pipeline was implemented and made available through R Markdown.

Keywords: Metabolomics Multivariate analysis

1

· 2D NMR · Univariate analysis ·

Introduction

Nuclear Magnetic Resonance (NMR) spectroscopy is a technique that allows to study atomic nuclei and their chemical environment when they are submitted to electromagnetic radiation [20]. High resolution NMR is unique within metabolomics because it is nondestructive and requires minimum sample preparation, allowing virtually any biological analytes soluble in a given solvent to be analyzed, thus measuring free small molecules independently of their chemical nature [19]. Multidimensional NMR allows to assess different metabolite classes and establish a correlation between them, using different nuclei in the same experiment, e.g., 1 H and 13 C [7]. Multidimensional NMR, and more specifically two-dimensional, can also treat overlapping peaks, thus performing a better peak assignment and metabolite identification [14]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  M. Rocha et al. (Eds.): PACBB 2021, LNNS 325, pp. 52–61, 2022. https://doi.org/10.1007/978-3-030-86258-9_6

Metabolomic Data Analysis

53

The main techniques used in Two-Dimensional Nuclear Magnetic Resonance (2D-NMR) metabolomics are homonuclear and heteronuclear-based approaches, where the second dimension is provided by a second nuclei. The interactions between coupled nuclei are measured through shared nuclear spin polarization, allowing the identification of metabolite’ signatures through different chemical shift pairs [4,7,15]. In terms of software towards 2D-NMR, the main objective is to identify and assign these signatures to metabolites, while closing the gap on spectral artifacts that can arise from multiple Free Induction Decay (FID) acquisition. Despite the available software, such as ChemoSpec2D [8], rNMR [13] and MetaboMiner [22], there is not a framework providing the complete analysis of these spectra, able to support different 2D techniques and provide means to apply chemometric methods. In the past few years, the authors’ group developed an R package, called specmine [3], that provides methods for a complete metabolomic data analysis. This work aims to present the functions developed and integrated within specmine, that allow reading, visualization, and analysis of 2D-NMR metabolomic data. Two case studies will be provided to validate those novel functions.

2 2.1

Methods Data Reading

To develop 2D-NMR metabolomics data analysis tools, we first needed to define an object that can represent these data in specmine. In an One-Dimensional Nuclear Magnetic Resonance (1D-NMR) spectrum, a single x axis can represent the resonance frequency, as the y axis values represent the intensity measured. In 2D-NMR, two x axes are necessary to provide the resonance frequencies of an experiment (one for each dimension), while intensity values shift from y axis to z axis, where each point results from two resonance frequencies. For those reasons, a 2D specmine dataset represents the data from a 2D-NMR experiment as a list of numeric matrices. A graphical representation of this data structure is provided on Fig. 1. Following the work done for 1D-NMR, two different functions to read data from Bruker and Varian equipment were developed. The first one was implemented using a function from the mrbin [12] package by assigning the dimension parameter to “2D”. The resonance frequency values for each dimension are rounded to two decimal places and the matrix from each sample is appended to a list with their sample id. Regarding Varian, the code uses a script in Python 3 built using the package nmrglue [9], that transforms the FID’s from the

54

B. Pereira et al.

Fig. 1. Representation of the structure of 2D data in a specmine dataset.

metabolomic experiment into readable matrices, applying apodization (improving appearance, highlighting Signal-to-Noise Ratio (SNR) or resolution), zerofilling (reduce noise, empowering line-fitting methods), Fourier-transformation (convert time-domain spectra into frequential-domain) and phase-correction (correct signals arising form artifacts) [6]. 2.2

Data Visualization

Plotting one or more 2D spectra is achieved by the incorporation of the package plotly [10]. This is an R version of the open-source Python graphing libraries, helping to build interactive and high-quality graphs. The function developed allows the user to visualize a single or a group of spectra within the same plot. Interactivity enables the user to zoom in/out, hover a peak (receiving information regarding its intensity and F1 and F2 dimensions), rotate the plot and select which spectra to plot. If the user does not give any samples to plot, the function will generate a plot with four spectra, the two with higher and lower SNR. SNR was here formulated based on the work of Wang et al. [21], where they related the Coefficient of Variance (CV) and the SNR in NMR-based metabolomics studies. In their case, the authors used a specific noise region (9.5 - 10 parts per million (ppm)) to extract its standard deviation and divide the intensity of a peak by that value. In a generalized case, the noise region has to be identified based on the data and it is supported by the literature that noise regions have a high CV (above 15 %) [5,16,21].

Metabolomic Data Analysis

55

The formulation is involved in sample selection, the first of the steps that allow the visualization of 2D specmine objects. The second is the buttons list building, which creates a graphical button in the interactive plot for each sample to be visualized, presented as a dropdown menu. The third adds the selected samples’ spectra into a single plotly object as surfaces, being colored differently if a metadata variable is given or not. The fourth and last step adds the final layout to the plot, including the title (given by the user), axis names, the dropdown menu with the buttons, and the legend. 2.3

Dimension Reduction

Extracting information from the 2D-NMR specmine structure is a computational challenge, because of the need to iterate over a list of large matrices. By performing peak detection and building a 1D structured specmine dataset with the combination of ppm values as variables (rows) and samples in the columns, we reduce the computational cost. The algorithm to perform peak detection is a search for local maxima, finding points in the matrix that are larger than all surrounding points, taken from the rNMR [13] software. The user can establish the degree of a filter that correlates the defined threshold with the points in the spectra, influencing if all/none of the points should be above the threshold or row/column points should be searched for local maxima. The search itself is based on the intersect function, within R base package, where two vectors (in this case, the same matrix with added Not Available (NA)’s) are compared to find points that belong to both vectors. A new template, a matrix filled with NA, is built for each spectra which will serve as the base for the 1D dataset and from the peak list (peak detection phase), each pair row/column that had a peak as its intensity value assigned in this matrix. From this state, a concatenation step allows to convert this list of matrices into a single matrix, by combining rows and columns from all spectra and establishing new variables. Variables that do not have any intensity value are removed and one has to take into account that this newly created 1D specmine dataset has NA’s and a method for NA imputation should be used. 2.4

Further Analysis

After peak detection, it is possible to apply the functionalities of specmine to analyze the standard 1D-NMR dataset. In terms of univariate parametric analysis, it is possible to perform student’s t-test and one-way and multifactor Analysis of Variance (ANOVA). Regarding multivariate analysis, the main available tool is Principal Component Analysis (PCA). There is also the option to perform clustering (hierarchical and k-means), linear regression and correlation analysis. There are also methods for feature selection and machine learning that can be useful in specific case studies where it is believed that there are metabolites that can discriminate between conditions.

56

3

B. Pereira et al.

Results and Discussion

Two case studies, covering Bruker and Varian datasets, will be presented to validate the above mentioned functionalities and to provide pipeline examples of how this type of data can be analyzed. In both cases, negative chemical shift values are present which can be explained by multiple FID’s leading to challenging and less accurate direct current correction, causing offsets on chemical shifts [17]. Manual processing of these offsets is not included in this analysis since collected data is already processed and remaining chemical shifts should not hinder visualization or analysis. In order to remove these artifacts, manual phase cycling before the first Fourier Transformation is recommended [17]. 3.1

Tomato Fruit Extracts

The main objective of the study, which used a Bruker equipment to generate data, was to validate Multi-Scan Single Shot (M3S) Correlation Spectroscopy (COSY) experiments as a method for quantification of metabolites in biological samples. In this case, tomato fruit pericarps were chosen as relevant because their major metabolites signals’ are overlapped on 1D proton spectra [11] and its compositional changes across fruit development are also well characterized [1]. As for our aim in this case study, it is important to achieve consistent results when comparing results with the authors. The 2D-NMR data was retrieved from MetaboLights, under the accession numbers MTBLS131/MTBLS132 for tomato samples recorded in 500/700 MegaHertz (MHz) resonance frequency, respectively. The samples were harvested at four different stages post anthesis (DPA) and on three different trusses, with three biological replicates for each truss on each development stage. Initial assessment of both datasets, using the reading and summary functions, allowed to identify the spectral widths of both dimensions, the number of missing values, mean, median and standard deviation of each sample spectrum. In this case, missing values were not present and standard deviation values were high, which was expected due to intensity values being directly related to peak values of spectral metabolite signatures. In terms of visualization, the sample spectra corresponding to extract 514 was used to assess the capabilities developed. Figure 2 shows the direct comparison between the reference spectra obtained by the authors for both frequencies. It is possible to identify the main interactions between nuclei, which are present on the diagonal of the spectra, for both frequencies. The layout of these interactions and ppm scales are consistent with the reference, as well as the difference in resolution from spectra with different frequencies. Peaks that are further away from the diagonal and have biological meaning, such as glucose (Glc), are also identifiable which represents a validation mark for this tool.

Metabolomic Data Analysis

57

Fig. 2. Fast M3S COSY spectra of a tomato fruit pericarp extract (extract 514, 34 days post anthesis) recorded over 5 min at 298 K on 500 MHz (a) and 700 MHz (b) Bruker NMR spectrometers equipped with cryogenically cooled probes, taken from J´ez´equel, et al. [11]. Plot of the second biological replicate of the same extract on 500 MHz (c) and 700 MHz (d) using specmine package.

In the peak detection step, 1529 and 868 peaks were detected across all samples for the datasets MTBLS131 and MTBLS132, respectively. This means that most of the noise from the datasets was removed and it was possible to reduce the search space of available variables. More than one peak was identified for each metabolite’s reference information. Differences in the number of peaks detected can be explained by the difference in resolution of the data, rounding step of ppm values and intensity values adjacent to each other. Further analysis requires initial treatment of missing values, which were replaced by 5e − 04. After this procedure, univariate analysis was conducted using one-way ANOVA and Tukey’s Honestly Significant Difference (HSD) posthoc test to assess which variables have significant mean differences according to

58

B. Pereira et al.

development stage. The results can be seen in Table 1 and variable X4.12.4.14 (Fructose) has significant effect in discriminating the tomato samples harvested at 8 days of development over the other stages, while variable X0.75.3.24 (Glucose) has significant effect in discriminating the tomato samples harvested at 21 days of development over other stages. Table 1. ANOVA results from dataset MTBLS131 after peak detection with development stage metadata. Combination of ppm (X.F1ppm.F2ppm)

3.2

p Values

Logs fdr

tukey

X3.72.4.06

1.584e–06 5.800 1.866e-04 21-8; 34-8; 55-8

X4.12.4.14 (Fru)

1.586e–06 5.800 1.866e–04 21-8; 34-8; 55-8

X0.75.3.24 (Glc)

1.597e–06 5.797 1.866e–04 21-8; 34-21; 55-21

X0.75.3.33 (Glc)

1.618e–06 5.791 1.866e–04 21-8; 34-21; 55-21

X3.83.3.85

1.622e–06 5.790 1.866e–04 21-8; 34-21; 55-21

X3.81.3.95

1.623e–06 5.790 1.866e–04 21-8; 34-8; 55-8

X3.57.3.77

1.632e–06 5.787 1.866e–04 21-8; 34-8; 55-8

X3.63.3.62

1.637e–06 5.786 1.866e–04 21-8; 34-8; 55-8

X3.88.3.95

1.663e–06 5.779 1.866e–04 21-8; 34-8; 55-8

X3.41.3.51

1.713e–06 5.766 1.866e–04 21-8; 34-8; 55-8

X0.76.3.3 (Glc)

1.785e–06 5.748 1.866e–04 21-8; 34-8; 55-8

X3.26.3.34

1.793e–06 5.746 1.866e–04 21-8; 34-8; 55-8

Worm (Caenorhabditis Elegans) Metabolome

The main objective of this study, from which Varian data originated, was to identify metabolites and their changes on the samples from C.elegans’ endo- and exometabolome submitted to a heat-shock condition. The authors of the work analyzed samples using INADEQUATE (Incredible Natural Abundance DoublE QUAntum Transfer Experiment) network analysis [2] and developed their own method to extract relevant information. For this reason, the analysis and results after peak detection for this case study consisted in obtaining and comparing PCA plots. The 2D-NMR data was retrieved from Metabolomics Workbench, with Project ID PR000095. In terms of pre-processing, the apodization function was changed from exponential to Lorentz-to-Gauss only for this data, which is composed by two datasets, endo- and exometabolome, with division of samples by heat shock condition at 33◦ C and control. In this case study, following the previous results on Bruker data, there were no missing values on both datasets and the standard deviation presented high values. For visualization, it was needed to merge half of the rows and columns

Metabolomic Data Analysis

59

because visualization of a matrix with 4096 rows and columns was not possible to render. Results on this subject are similar to the previous case study (identification of the same regions of metabolite resonances compared to reference), further validating visualization functionalities. In terms of PCA, three results were obtained that compare the scores plots (PC1 and PC2) from the endo- and exometabolome. These plots are shown on Fig. 3.

Fig. 3. Scores plot from the PCA done by Clendinen, et al., on the endo- and exometabolome dataset, (a) and (c), respectively. Scores plot from the PCA for the endo- and exometabolome dataset using specmine after peak detection and preprocessing, (b) and (d), respectively.

In Fig. 3(b) it was not possible to observe a good separation along the PC1 for all samples as the authors obtained. Nonetheless, the majority of the samples can be separated along the PC1 following the same distribution regarding to Temperature metadata, i.e., most control samples have negative values on PC1 where heat shock samples have positive values. This means that the heat shock condition has an effect on C.elegans’ endometabolome. Despite the lack of information regarding which peaks were picked by the authors’ pipeline, the results obtained after peak detection suggest that this step selected relevant information for future analysis. The other two samples that do not follow the values for PC1 could be a result of differences in spectral pre-processing, hierarchical alignment of 2D spectra [18] performed by the authors and/or the quality of the peaks detected. As it is shown on Fig. 3(d) there is no good separation along PC1. There is no possible separation between conditions which indicates that the exometabolome

60

B. Pereira et al.

is not affected by heat shock. However, this scores plot was done using an exometabolome dataset with an outlier. According to the authors this outlier was identified in their PCA and Fig. 3(d) shows that there is one control sample with the highest positive value along PC1. This indicates that the possible outlier spectrum propagated through specmine analysis and was also identifiable in the PCA. The sample was the second replicate N2 Control2 INAD and presented the highest number of peaks detected in the dataset (206).

4

Conclusion

In order to update the R package specmine, new functions were developed to provide tools for 2D-NMR data. Since 1D-NMR lacks the sensitivity to treat overlapping resonances on more complex samples, 2D-NMR has been applied and adapted to provide easier to interpret and more informative data. This led to the development of key functions that enable 2D-NMR analysis with specmine, supporting the purpose of providing tools for metabolomic data analysis in a complete and user-friendly environment. With this addition, specmine’s flexibility will follow the growth of metabolomics and the output of data generated by multidimensional NMR. By extending its functionalities, while maintaining easy-to-interpret data structures, researchers can link data from different experiments without the usual need for multiple packages. The most recent version of specmine is available on CRAN, including the developed functionalities. A pipeline for this type of data is provided here, which has all the work developed to achieve the results on this paper. A simplified version of this pipeline is available within specmine’s package, in a form of a vignette.

References 1. Carrari, F., et al..: Integrated analysis of metabolite and transcript levels reveals the metabolic shifts that underlie tomato fruit development and highlight regulatory aspects of metabolic network behavior. Plant Physiol. 142(4), 1380–1396 (2006) 2. Clendinen, C.S., Pasquel, C., Ajredini, R., Edison, A.S.: 13C NMR metabolomics: INADEQUATE network analysis. Anal.l Chem. 87(11), 5698–5706 (2015) 3. Costa, C., Maraschin, M., Rocha, M.: An R package for the integrated analysis of metabolomics and spectral data. Comput. Methods Prog. Biomed. 129, 117–124 (2016) ˇ 4. Cuperlovi´ c-Culf, M.: Experimental methodology. In: NMR Metabolomics in Cancer Research, Chap. 3, pp. 193–213. Woodhead Publishing, Beijing (2013) 5. Dumas, M., Maibaum, E., Teague, C., Ueshima, H., Zhou, B., Lindon, J., Nicholson, J., Stamler, J., Elliott, P., Chan, Q., Holmes, E.: Assessment of analytical reproducibility of 1H NMR spectroscopy based metabonomics for large-scale epidemiological research: the INTERMAP study. Anal. Chem. 78(7), 2199–2208 (2006)

Metabolomic Data Analysis

61

6. Emwas, A.H., et al.: Recommendations and Standardization of Biomarker Quantification Using NMR-Based Metabolomics with Particular Focus on Urinary Analysis. J. Proteome Res. 15(2), 360–373 (2016) 7. Emwas, A.H., et al.: NMR spectroscopy for metabolomics research. Metabolites 9(7), 123 (2019) 8. Hanson, B.A.: ChemoSpec2D: exploratory Chemometrics for 2D Spectroscopy (2020), https://CRAN.R-project.org/package=ChemoSpec2D. r package version 0.4.176 9. Helmus, J.J., Jaroniec, C.P.: Nmrglue: an open source Python package for the analysis of multidimensional NMR data. J. Biomol. NMR 55(4), 355–367 (2013). http://nmrglue.com 10. Inc., P.T.: Collaborative data science (2015). https://plot.ly 11. J´ez´equel, T., Deborde, C., Maucourt, M., Zhendre, V., Moing, A., Giraudeau, P.: Absolute quantification of metabolites in tomato fruit extracts by fast 2D NMR. Metabolomics 11(5), 1231–1242 (2015) 12. Klein, M.: mrbin: Magnetic Resonance Binning, Integration and Normalization (2021). https://CRAN.R-project.org/package=mrbin. r package version 1.5.0 13. Lewis, I.A., Schommer, S.C., Markley, J.L.: rNMR: open source software for identifying and quantifying metabolites in NMR spectra. Mag. Reson. Chem. 47(SUPPL. 1), S123 (2009) 14. Mahrous, E.A., Farag, M.A.: Two dimensional NMR spectroscopic approaches for exploring plant metabolome: a review. J. Adv. Res. 6(1), 3–15 (2015) ¨ 15. Oman, T., et al.: Identification of metabolites from 2D H-C HSQC NMR using peak correlation plots. BMC Bioinform. 15(1) (2014) 16. Parsons, H., Ekman, D., Collette, T., Viant, M.: Spectral relative standard deviation: a practical benchmark in metabolomics. Analyst 134(3), 478–485 (2009) 17. ur Rahman, A., Choudhary, M.I., tul Wahab, A.: Chapter 5 - the second dimension. In: ur Rahman, A., Choudhary, M.I., tul Wahab, A. (eds.) Solving Problems with NMR Spectroscopy, 2nd edn, pp. 191–225. Academic Press, Boston (2016) 18. Robinette, S.L., et al.: Hierarchical alignment and full resolution pattern recognition of 2D NMR spectra: Appl. Nematode Chem. Ecol. Anal. Chem. 83(5), 1649–1657 (2011) 19. Takis, P.G., Ghini, V., Tenori, L., Turano, P., Luchinat, C.: Uniqueness of the NMR approach to metabolomics. TrAC Trends Anal. Chem. 120, 115300 (2019) 20. Teng, Q.: Structural Biology. 2nd edn. Springer US, New York (2013) 21. Wang, B., Goodpaster, A., Kennedy, M.: Coefficient of variation, signal-to-noise ratio, and effects of normalization in validation of biomarkers from NMR-based metabonomics studies. Chemomet. Intell. Lab. Syst. 128, 9–16 (2013) 22. Xia, J., Bjorndahl, T.C., Tang, P., Wishart, D.S.: MetaboMiner - semi-automated identification of metabolites from 2D NMR spectra of complex biofluids. BMC Bioinform. 9(1), 507 (2008)

Recurrent Deep Neural Networks for Enzyme Functional Annotation Ana Marta Sequeira(B) and Miguel Rocha CEB-Centre Biological Engineering, University of Minho, 4710-057 Braga, Portugal [email protected], [email protected]

Abstract. Enzyme functional annotation has been a challenging problem in Bioinformatics for many years now, with Deep Learning recently appearing as an efficient alternative. Here, the use of recurrent neural networks, trained from sequential data and boosted by the use of attention mechanisms, is analysed. We assess the consequences of the choice of different parameters, as the length of the sequence and type of truncation, often not mentioned in previous studies. We also compare the use of different aminoacid encoding schemes to describe the protein, using one-hot, z-scales and Blosum62 encodings, as well as embedding layers. Lastly, we try to understand what the network is learning and inferring. Our results show that for enzyme classification, networks formed with Bidirectional recurrent layers and attention lead to better results. In addition, using simpler encoding schemes (e.g. one-hot) leads to higher performance. Using attention and embedding layers, we demonstrate that the model is capable of learning biological meaningful representations.

1

Introduction

Enzymes are proteins that catalyse reactions, regulating biological processes. Enzyme prediction and function annotation have a broad range of applications and have been a challenging problem in bioinformatics. The Nomenclature Committee of the International Union of Biochemistry classifies enzymes into an Enzyme Comission (EC) number, a four digit representation separated by periods (e.g. 2.4.2.17) that is based on the chemical reactions catalyzed. The four levels are related to each other in a functional hierarchy. The first number describes one of the 7 main enzymatic classes, the second describes the subclass, the third represents the sub-subclass and the fourth refers to the substrate of the enzyme. To accurately determine enzyme function, the most straightforward is through experimental techniques. However, these experiments require larges amounts of time, effort and resources. Besides, the discovery rate of enzyme sequences has increased significantly with the advances in high-throughput techniques. In this context, and with the emergence of Machine and Deep Learning (DL), numerous tools to predict enzyme function have been published in recent years. Here, we will focus on the prediction of enzymes using DL. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  M. Rocha et al. (Eds.): PACBB 2021, LNNS 325, pp. 62–73, 2022. https://doi.org/10.1007/978-3-030-86258-9_7

Recurrent Deep Neural Networks for Enzyme Functional Classification

63

In this topic, DEEPre (2018) [13] uses sequence one-hot encoding and position specific scoring matrices (PSSM) as features to input convolutional and recurrent neural networks (RNNs) and solvent accessibility and functional domains (from PFAM) to input in fully connected layers (DNN). This model was later updated to mlDEEPre [28] to predict multi-functional enzymes. Amidi et al. propose EnzyNet [3], a 3D convolutional neural network (CNN) classifier that predicts the EC number based only on their voxel-based spatial structure. DeepEC [17] uses 3 CNNs and one-hot encoding for the prediction of EC numbers, also implementing homology analysis for EC numbers that cannot be classified by the CNNs. DeEPn [20] uses CNN and DNN using physicochemical features, such as aminoacid composition, autocorrelation or quasi-sequence order as input. Also using CNNs, Gao et al. [9] predict enzyme function with PSSMs and 3D structural data. In a different approach, UDSMProt [23] put forward a universal deep protein classification model that is pretrained on Swiss-Prot and fine-tuned on specific classification tasks achieving good performance with enzyme classification. This model is based on language models using RNNs. Also, not specific for enzymes, D-SPACE [19] uses CNNs to encode sequences into an embedding space that is then used to annotate all the protein space. The majority of these approaches uses protein sequences as inputs, generally encoded as one-hot vectors. As RNNs are powerful when dealing with sequential data, processing input sequences one element at time, we aim to study the use of RNNs to predict enzyme function and the effects of the addition of attention layers. Furthermore, we aim to understand the consequences of the length of the protein sequence and type of truncation and the use of other aminoacid (AA) encoding schemes to describe the protein. We also try to shed a light and understand what is the network learning and inferring. In what follows, we first define the datasets used and the experimental setup. Section 3 reports on the comparison of different values for the sequence length threshold and truncation schemes. Section 4 evaluates the influence of different types of recurrent networks and of attention mechanisms. Section 5 describes the influence of different encoding schemes.

2

Datasets and Deep Learning Models

The dataset used was retrieved from Dalkiran et al. [7]. Using UNIPROT IDs, the dataset was updated and filtered using the Uniref90 and Uniref50 [24] to remove sequences with more than 90% and 50 % similarity respectively, making the datasets ECPRED90 and ECPRED50. EC numbers were predicted up to the first, second, third and fourth levels in different experiments. For each classification task, we retrieved from ECPRED90 and ECPRED50 the EC numbers complete until that level, that do not have different EC numbers for a single sequence (single-label), containing more than 50 sequences per class. Table 1 presents the number of classes in each task.

64

A. M. Sequeira and M. Rocha

Table 1. Number of enzyme sequences and classes for each EC level and dataset Dataset

EC 90

EC level 1

EC 50 2

3

4

1

2

3

4

Enzymes 152048 149460 165984 123994 67356 44761 61132 31396 Classes

7

59

156

492

7

53

128

278

RNNs are a class of neural networks powerful for modeling sequence data, since they process an input sequence one element at time (timestep) using a loop to iterate over the elements of a sequence, while maintaining an internal state [6]. Long short-term memory (LSTM) networks are RNNs able to remember past information even when dealing with long sequences, addressing long-standing issues with RNN training, such as vanishing and exploding gradients. A typical LSTM is composed of cells and three gates: input, output and forget gate, that regulate the flow of information into and out of cells. LSTMs have shown higher performance when comparing to other networks [22]. For some sequences, such as biological ones, a model can perform better if it processes the sequence forward and backwards, learning the context around an item and not only the items coming before it [6,22]. For that, bidirectional LSTM have shown better results than simple LSTMs, being widely used in bioinformatics as they can capture non local and long range relationships in the protein sequence [2,10,12,22]. However, using BiLSTM can lead to higher computation costs and time. A base network composed of three Bidirectional Long-short term memory (BiLSTM) layers with (64, 64, 32) units and 0.1 Dropout rate (DR), followed by two dense layers with (32, 16) units with DR of 0.1 and batch normalization was defined. In Sect. 4, we will evaluate the influence of changing BiLSTM to LSTM and the addition of attention mechanisms. By default, enzymes were one hot encoded, as it is unbiased and the simplest encoding. The models were implemented using keras and Tensorflow [1,6]. The code implementation was made using ProPythia package [21]. Padding was set to post. To evaluate model performance, MCC, accuracy, logloss, F1, ROC and precision were calculated. Precision, recall and f1 were assessed for each class. MCC and accuracy for multiclass are the ones defined by sklearn where MCC is defined in terms of a confusion matrix C for K classes and accuracy as a subset accuracy. Precision and recall are average weighted. Here, for simplicity purposes, we present the MCC scores. The remaining metrics can be viewed in the repository. The training process was made with epochs set to 100 with early stopping (patience 30) and reducing learning rate (patience of 20) callbacks, batch size of R 64 and Adam optimization. All the models were run on an NVIDIA GeForce RTX 2080 Ti graphics card. The scripts, datasets and notebooks related to this work are available at https://github.com/marta-seq/enzymeClassification.

Recurrent Deep Neural Networks for Enzyme Functional Classification

3

65

Influence of Sequence Length and Type of Truncation

One-hot encoding has been widely used to represent protein sequences in DL models. As DL models requisite same shape inputs, using raw AA sequences requires handling proteins with different lengths. To accomplish this, a length threshold is defined and sequence padding and truncation are applied. For sequences shorter than the threshold, zeros are added up to the required length. Sequences longer than the threshold are truncated. Padding and truncation can be performed on any position of the sequence, for example considering the Nand C-terminals, biologically relevant parts of sequence. Usually, details on the concrete steps of padding and truncation are omitted or not justified although they heavily impact model performance. In one of the few studies on this matter, Rio et al., analysed the impact of different ways of padding the sequences in an EC number prediction problem using convolutional and dense layers. They showed that padding has an effect on model performance highlighting its relevance [15]. Enzymes show a broad range of sequence length with the majority having between 250 to 500 amino acids. The strategies adopted are dependent on the biological problem and model architecture. As RNNs have well established masking layers, informing the model to skip padding positions, we considered that padding would not impact model performance. However, different truncation schemes, elimination or not of sequences based on length and choice of sequence length thresholds might. Previous studies on protein function prediction use different strategies: DeepEC excludes sequences longer than 1000 AAs and pads shorter sequences [17]; DeepPre excludes sequences longer than 5000 AAs [13]; DeepLoc, a DL model using convolutional and LSTM layers to predict protein subcellular location, fixes a protein length of 1000, while proteins longer have the middle part removed (to keep the N and C terminals) and shorter proteins are padded [2]; finally, Bileschi et al., which predict family domains using an embedding layer and ResNet convolutions, pad sequences to the length of the longest sequence in the batch [5]. As there is no baseline to choose the adequate options, we tested different kinds of truncation and sequence length. For sequence length, we tested the values of 100, 300, 500, 700, 900 and 1000. Truncation types tested include post, pre, keep middle, keep terminals and elimination of sequences longer than threshold. In the method post, the sequences were truncated at the end (keep N terminus), in pre in the beginning (keep C terminus), in keep middle, equal parts were removed on both end and beginning and in truncation keep terminals, equal parts are maintained in both end and beginning. Furthermore, for lengths 500, 700, 900 and 1000 the option of eliminating sequences with higher length was also tested. In this task, the first level of EC numbers of the dataset ECPRED90 with 8 classes (7 enzymatic classes and 22708 non-enzymes) was considered. Enzymes were one-hot encoded and the base network defined in the previous section was used. A 5-fold cross validation (CV) was used and average time for the training of each fold (in hours) needed for each sequence length was

66

A. M. Sequeira and M. Rocha

calculated. Results are summarized in Table 2, representing the MCC scores and standard deviations. Table 2. Comparison of different sequence length and truncation strategies on enzyme classification (7 classes plus non enzymes). MCC scores for 5-fold CV Len

Post

Pre

Middle

Terminals

Eliminate

Time

100 0,729 ± 0,008 0,712 ± 0,008 0,730 ± 0,010 0,710 ± 0,009

6,1

300 0,811 ± 0,013 0,794 ± 0,023 0,816 ± 0,020 0,810 ± 0,013

13,6

500 0,783 ± 0,097 0,830 ± 0,021 0,815 ± 0,026 0,836 ± 0,020 0,849 ± 0,028 21,4 700 0,826 ± 0,023 0,829 ± 0,024 0,831 ± 0,021 0,778 ± 0,116 0,817 ± 0,047 29,7 900 0,765 ± 0,108 0,686 ± 0,209 0,828 ± 0,029 0,803 ± 0,067 0,780 ± 0,151 36,8 1000 0,813 ± 0,027 0,765 ± 0,136 0,799 ± 0,058 0,624 ± 0,304 0,839 ± 0,030 47,3

Results indicate that best models are achieved using 500 and 700 as sequence length. Using longer sequences does not improve results. Eliminating sequences yields good results, but at the cost of the loss of a large number of sequences. Differences in truncation appear not to be significant. As expected, time required increases with the length from an average of 6 h for 100 AAs to 47 h with 1000. Considering predictive performance and time, we would choose a sequence length of 500 amino acids. We set the truncation to pre in further studies.

4

Influence of Network Type and Attention Addition

Attention is a technique first described by Bahdanau et al. that tries to mimic cognitive attention, enhancing important parts of the input data and fading out the rest. It is expressed as a vector of importance weights [4]. Attention mechanisms have become an integral part of sequence modeling [26] and have been applied in various tasks including bioinformatics. This mechanism allows to model longer sequences and also to help on model interpretability [22]. Attention has shown to improve performance in various tasks of protein function classification [2,14,27]. Besides, it can also identify protein regions important for the classification problem, such as regions for subcellular location and binding sites [2,27], capture high-level structural properties of proteins, connecting AAs that are spatially close in three-dimensional structure, but far apart in the underlying sequence and capture substitution properties [27]. In this section, we aim to evaluate the influence of the RNN network type, comparing the use of LSTM and BiLSTM. In addition, we tested the network with an attention layer. Sequences with 500 length and truncation defined as pre were used one-hot encoded. 7 enzymatic classes were classified and the models were evaluated using 5-fold CV.

Recurrent Deep Neural Networks for Enzyme Functional Classification

67

Using LSTM, the model achieves an overall MCC of 0.67, increasing to 0.83 when using attention. BiLSTM reaches 0.85 and 0.87 when using attention. Models using BiLSTM layers take a longer time to run (approximate 17 h compared to 9 h using LSTM) but the addition of attention, did not significantly impact the time of models. As BiLSTM with attention is the architecture yielding better results, we established it as the base network for the remaining experiments. Model interpretability and explainability are gaining importance, specially when dealing with biological classification, where there is interest not only in classifying efficiently proteins, but also in understanding how the models work and what they are giving importance to. As attention has been shown to identify protein regions important for the classification problem, we can use it to understand what the model is highlighting. Studies using attention to improve the performance of the models and also interpretability are starting to appear. DeepLoc [2], uses CNN and RNN with attention to predict subcellular localization, using attention to observe what regions (N or C terminals) are most important for the model. BERTology [27], that aims to explore how transformers models discern structural and functional properties of proteins, analyze how attention aligns with various protein properties, both at the token level (secondary structure, binding sites) and at the token-pair level (contact maps). Here, we make use of the attention layer to select the most important LSTM cells and plot the activation values in every timestep for a protein sequence. The ECPRED90 was divided considering test and validation sets with 20% of the examples. For a protein sequence, the activation values of the attention layer are retrieved, ordered and top 5 units are selected. The units corresponding to the last BiLSTM layer are retrieved and the activation values for each of the timesteps are plotted in color against the protein sequence. An example can be seen in Fig. 1, representing the sequence Q91VB2, EC 2.7.11.17. The figure plots the activation values for two of the five highest attention BiLSTM units. The steps with highest activation are given in red, while black boxes represent the Binding (residues 29–56) and Active sites of the enzyme (residues 139–151) taken from UNIPROT and INTERPRO. The model appears to be able to identify these two important regions.

5

Influence of Different Aminoacid Encoding Schemes

DL models require inputs to be vectors. When the input is the sequence, the encoding scheme will assign a numerical representation to each AA. Benchmarking of different AA encoding schemes demonstrated that the encoding process plays a critical role in the applicability and quality of the model [8].

68

A. M. Sequeira and M. Rocha

Fig. 1. Activations of BiLSTM units for protein Q91VB2. In red the timesteps with more activation. Black boxes represent the binding and the active site of the enzyme.

Commonly used encoding schemes include One-hot encoding (ONE-HOT), substitution matrices such as the BLOck SUbstitution Matrix (BLOSUM), physicochemical character-based schemes and representations strictly learned by the models in form of embedding layers. ONE-HOT is the simplest encoding used. Each AA is described by a 20 dimensional vector with 19 positions with value 0 and one position set to 1. This scheme does not capture information regarding similarity between AAs, being all equidistant in their feature space [16]. Physicochemical character based schemes are usually derived by principal component analysis (PCA) of a large property matrix used to described individual AAs. Z-scales, as described by Sandberg et al. [18] are 5 dimension vectors based on 26 physicochemical properties, including lipophilicity, polarity/charge, electronegativity, heat of formation and hardness, being arguably the most widely used descriptors of this type. Even if they are calculated in different forms, these descriptors appear to perform similarly in the machine learning context [25]. BLOSUM is a family of substitution matrices used for sequence alignment of proteins firstly described by Henikoff [11]. While ONEHOT does not assume prior knowledge, BLOSUM and Z-scales capture prior knowledge about AA as similarities between vectors, BLOSUM captures evolutionary relationships and Z-SCALES physicochemical properties. Another possible way to encode AAs is by making the encoding a learnable part of the model, for example using embedding layers. The model encodes and extracts patterns in data relevant and specific for the problem in question [8]. Learning the embedding directly from data as part of continuous iteration of the model has lead to state-of-art results in many fields [8]. In bioinformatics, ONE-HOT and problem-specific embeddings achieve similar or better results than using AA properties, with some authors suggesting that they are a simple, assumption-free and optimal way to perform feature engineering [8,16]. The analysis of different encoding AA schemes was made by inputting the protein sequence with AAs encoded as one-hot vectors (21 dimensions), as described in BLOSUM matrix [11] (20 dimensions) or with Z-scale [18] (5 dimensions). To assess the ability of the models to learn suitable representations, we

Recurrent Deep Neural Networks for Enzyme Functional Classification

69

fed the protein sequences with AAs as categorical encoding to the network with a previous embedding layer. We tested embedding layers with output dimensions of 20, 8 and 5. These encoding schemes were tested with the network defined in Sect. 4 using 5-fold CV. The models were tested in the dataset ECPRED90 and ECPRED50 and we predicted EC numbers for the four hierarchical levels. Table 3. Classification of enzyme EC numbers with different AA encoding schemes with datasets ECPRED90 and ECPRED50. Four different levels of EC were predicted. ECPRED90

1 level - 7 cl.

2 level - 59 cl. 3 level - 156 cl. 4 level - 492 cl.

Embedding dim 20 0,878 ± 0,004 0,856 ± 0,005 0,843 ± 0,015

0,924 ± 0,008

Embedding dim 8

0,858 ± 0,038 0,832 ± 0,020 0,826 ± 0,018

0,916 ± 0,004

Embedding dim 5

0,865 ± 0,007 0,816 ± 0,012 0,812 ± 0,009

0,911 ± 0,007

BLOSUM62

0,876 ± 0,024 0,863 ± 0,002 0,837 ± 0,014

0,925 ± 0,010

Zscales

0,880 ± 0,005 0,832 ± 0,007 0,822 ± 0,008

0,921 ± 0,011

One-hot

0,878 ± 0,018 0,855 ± 0,016 0,839 ± 0,010

0,928 ± 0,002

ECPRED50

1 level - 7 cl.

2 level - 53 cl. 3 level - 128 cl. 4 level - 278 cl.

Embedding dim 20 0,702 ± 0,007 0,657 ± 0,009 0,648 ± 0,008

0,799 ± 0,017

Embedding dim 8

0,684 ± 0,033 0,653 ± 0,028 0,635 ± 0,019

0,799 ± 0,015

Embedding dim 5

0,637 ± 0,029 0,602 ± 0,021 0,616 ± 0,018

0,796 ± 0,022

BLOSUM62

0,719 ± 0,005 0,661 ± 0,003 0,654 ± 0,005

0,827 ± 0,007

Zscales

0,730 ± 0,006 0,671 ± 0,013 0,627 ± 0,007

0,815 ± 0,018

One-hot

0,748 ± 0,004 0,687 ± 0,022 0,645 ± 0,028

0,831 ± 0,012

Results are summarized in Table 3. The prediction for all levels does not seem to have significant differences across different encoding types. However, using ONE-HOT took significantly less time to run. The embedding of lower dimensions (5 and 8) tend to produce slightly worse results. Furthermore, to analyse similarities of AAs in all encodings and to understand the information embeddings are capturing, being able to compare it to other encoding schemes, we calculated the cosine similarity of all encodings and plotted them as heatmaps (Fig. 2). ONE-HOT encoding, as expected, did not demonstrate any relationship between AAs. Z-scales are the encoding closest to BLOSUM62. Embedding of dimension 20 also has similarities with this matrix, demonstrating that models can learn biologically relevant information. For example, in this three encoding schemes, AAs I, L and V are considered to be similar among them and dissimilar to AAs R and K. Embedding of dimensions 8 and 5, as expected, have less sensitivity in detecting similarities.

70

A. M. Sequeira and M. Rocha

Fig. 2. Heatmap of Cosine similarity of matrixes of AA encoding. From left to right: ONE-HOT, BLOSUM62, Z-SCALES, Embedding dimension 20, Embedding dimension 8, Embedding dimension 5.

6

Discussion and Conclusions

In this paper, we studied the use of RNNs in the enzyme classification problem using datasets with 90% and 50% similarity, taking only the sequence as input. First, we state that the choice of sequence length, type of padding and truncation should be considered in every problem. For dataset ECPRED90 and using RNNs, we observe that using 1000 and 900 AAs lead to considerably higher running time without increasing predictive performance. The best performing setups were the values of length 500 and 700. Therefore, we consider that 500 is the best choice for both classification task and optimization of running processes. The fact that the majority of sequences in our dataset is between 250 and 500 may be one of the reasons why 500 is enough for this task. In terms of truncation type the choice should be in accordance with the length chosen. For the length of 500, truncating the C terminal (post) leads to worse performance. Keeping the terminals and truncating the N terminal (pre) improves scores. Secondly, we demonstrated that using BiLSTM that captures information both in forward and backward direction leads to significant improvement over using simple LSTM. Adding an attention layer leads to higher performance, as the model better ‘focuses’ on the important parts of the sequence. With the attention values we try to shed a light on what the model is capturing and demonstrated that it is capable of identifying biologically meaningful regions. These results are in accordance with literature, where BiLSTMs with attention have been used to both improve scores and better model interpretability [2,27]. Thirdly, as the translation of a protein sequence to a properly encoding scheme is crucial for the model performance, we tested six AA encoding schemes: BLOSUM62 capturing evolutionary information, Z-SCALES capturing physico-

Recurrent Deep Neural Networks for Enzyme Functional Classification

71

chemical information, embeddings that are learned alongside with the model and ONE-HOT encoding that does not present any similarity between AA. We did not find major differences across the encoding schemes with the ONE-HOT performing overall better and in less time. Z-scales and BLOSUM62 matrixes also lead to good performance with embedding with 20 dimension with close results. Embeddings of lower dimensions perform worse, specially when the similarity threshold decreases. Moreover, calculating the cosine similarities of the different encodings, we can observe that embeddings are capable of capture meaningful biological relations. This experience is in agreement with works made by Raimondi and ElAbd [8,16], where ONE-HOT encoding is usually the best performer. This may be due to the use of 20 dimensions to represent each AA, allowing the network to assign a weight to the contribution of each AA dependent of classification task. On the contrary, using predefined scales as Z-scales and BLOSUM62 allow DL to assign weight to physicochemical or evolutionary characteristics but not specific to the problem in question. Using embeddings allows to project the objects in space. This projection is learned jointly with the model and captures meaningful aspects of AA. The use of different AAs schemes should be addressed more often in protein classification studies as it is an important factor in the model performance. This work highlights the importance of the choice of some parameters in DL models that are often overlooked in literature. We demonstrated that using BiLSTM and attention leads to better results for enzyme classification. In addition, using simpler encoding schemes such as ONE-HOT yields better results and in a more optimized way. Using other architectures such as ConvLSTM, CNNs, autoencoders that are yielding good results in biological classification tasks would be a future improvement of this work. Furthermore, the analysis of the model from a biological point of view, with the development of further visualization techniques may shed a light not only over the information captured by the model, but also uncovering meaningful biological relations that might not be described in literature yet. Acknowledgements. This study was supported by the European Regional Development Fund under the scope of Norte2020, through the project DeepBio (ref. NORTE01-0247-FEDER-039831). This study was also supported by the PhD scholarship with reference 2020.07867.BD, granted by the Portuguese Foundation for Science and Technology and the European social fund under the scope of Norte2020.

References 1. Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015) 2. Almagro Armenteros, J.J., et al.: DeepLoc: prediction of protein subcellular localization using deep learning. Bioinformatics 33(21), 3387–3395 (2017). https://doi. org/10.1093/bioinformatics/btx431 3. Amidi, A., et al.: EnzyNet: enzyme classification using 3D convolutional neural networks on spatial representation. PeerJ 2018(5), 1–18 (2018). https://doi.org/ 10.7717/peerj.4750

72

A. M. Sequeira and M. Rocha

4. Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings, pp. 1–15 (2015) 5. Bileschi, M.L., et al.: Using deep learning to annotate the protein universe. bioRxiv, pp. 1–29 (2019). https://doi.org/10.1101/626507 6. Chollet, F., et al.: Keras (2015) 7. Dalkiran, A., Rifaioglu, A.S., Martin, M.J., Cetin-Atalay, R., Atalay, V., Do˘ gan, T.: ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC Bioinform. 19(1), 1–13 (2018). https://doi. org/10.1186/s12859-018-2368-y 8. Elabd, H., et al.: Amino acid encoding for deep learning applications. BMC Bioinform. 21(1), 1–14 (2020). https://doi.org/10.1186/s12859-020-03546-x 9. Gao, R., et al.: Prediction of enzyme function based on three parallel deep CNN and amino acid mutation. Int. J. Mol. Sci. 20(11) (2019). https://doi.org/10.3390/ ijms20112845 10. Guo, Y., et al.: DeepACLSTM: deep asymmetric convolutional long short-term memory neural models for protein secondary structure prediction. BMC Bioinform. 20(1), 1–12 (2019). https://doi.org/10.1186/s12859-019-2940-0 11. Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. 89(22), 10915–10919 (1992) 12. Li, S., Chen, J., Liu, B.: Protein remote homology detection based on bidirectional long short-term memory. BMC Bioinform. 18(1), 1–8 (2017). https://doi.org/10. 1186/s12859-017-1842-2 13. Li, Y., et al.: DEEPre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics 34(5), 760–769 (2018). https://doi.org/10.1093/ bioinformatics/btx680 14. Liu, J., Gong, X.: Attention mechanism enhanced LSTM with residual architecture and its application for protein-protein interaction residue pairs prediction. BMC Bioinform. 20(1), 1–11 (2019). https://doi.org/10.1186/s12859-019-3199-1 15. Lopez-del Rio, A., Martin, M., Perera-Lluna, A., Saidi, R.: Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction. Sci. Rep. 10(1), 1–14 (2020). https://doi.org/10.1038/s41598020-71450-8 16. Raimondi, D., Orlando, G., Vranken, W.F., Moreau, Y.: Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis. Sci. Rep. 9(1), 1–11 (2019). https://doi.org/10.1038/s41598-019-53324w 17. Ryu, J.Y., Kim, H.U., Lee, S.Y.: Deep learning enables high-quality and highthroughput prediction of enzyme commission numbers. Proc. Natl. Acad. Sci. U. S. A. 116(28), 13996–14001 (2019). https://doi.org/10.1073/pnas.1821905116 18. Sandberg, M., et al.: New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. J. Med. Chem. 41(14), 2481–2491 (1998). https://doi.org/10.1021/jm9700575 19. Schwartz, A.S., et al.: Deep semantic protein representation for annotation, discovery, and engineering. bioRxiv (2018). https://doi.org/10.1101/365965 20. Semwal, R., Aier, I., Tyagi, P., Varadwaj, P.K.: DeEPn: a deep neural network based tool for enzyme functional annotation. J. Biomol. Struct. Dyn. (2020). https://doi.org/10.1080/07391102.2020.1754292

Recurrent Deep Neural Networks for Enzyme Functional Classification

73

21. Sequeira, A.M., Lousa, D., Rocha, M.: ProPythia: a python automated platform for the classification of proteins using machine learning. In: Panuccio, G., Rocha, M., Fdez-Riverola, F., Mohamad, M., Casado-Vara, R. (eds.) Practical Applications of Computational Biology & Bioinformatics. AISC, vol. 1240, pp. 32–41. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-54568-0 4 22. Shi, Q., et al.: Deep learning for mining protein data. Brief. Bioinform. 1–25 (2019). https://doi.org/10.1093/bib/bbz156 23. Strodthoff, N., Wagner, P., Wenzel, M., Samek, W.: UDSMProt: universal deep sequence models for protein classification. Bioinformatics 36(8), 2401–2409 (2020). https://doi.org/10.1093/bioinformatics/btaa003 24. Suzek, B.E., et al.: UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31(6), 926–932 (2015). https://doi.org/10.1093/bioinformatics/btu739 25. Van Westen, G.J., et al.: Benchmarking of protein descriptor sets in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptor sets. J. Cheminform. 5(9), 1–11 (2013). https://doi.org/10.1186/1758-2946-5-42 26. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (NIPS), pp. 5999–6009 (2017) 27. Vig, J., et al.: BERTology meets biology: interpreting attention in protein language models. bioRxiv (2020). https://doi.org/10.1101/2020.06.26.174417 28. Zou, Z., Tian, S., Gao, X., Li, Y.: mlDEEPre: multi-functional enzyme function prediction with hierarchical multi-label deep learning. Front. Genet. 10, 1–10 (2019). https://doi.org/10.3389/fgene.2018.00714

Assessing the Impact of Data Set Enrichment to Improve Drug Sensitivity in Cancer Pedro Ferreira1 , Jo˜ ao Ladeiras1 , and Rui Camacho2(B) 1 2

Faculdade de Engenharia da Universidade do Porto, Porto, Portugal Faculdade de Engenharia da Universidade do Porto and INESC TEC, Porto, Portugal [email protected], [email protected]

Abstract. Cancer is one of the diseases with the highest mortality rate in the world. To understand the different origins of the disease, and to facilitate the development of new ways to treat it, laboratories cultivate, in vitro, cancer cells (cell lines), taken from patients with cancer. These cell lines enable researchers to test new approaches and to have an appropriate procedure for comparison of results. The methods used in an initial study at EMBL-EBI Institute (Cambridge, UK) were based on algorithms that construct “propositional like” models. The results reported were promising but we believe that they can be improved. A relevant limitation of the algorithms used in the original study is the absence or severe lack of comprehensibility of the models constructed. In Life Sciences, the possibility of understanding a model is an asset to help the specialist to understand the phenomenon that produced the data. With our study we have improved the performance of forecasting models and constructed understandable models. To meet these objectives we have used Graph Mining and Inductive Logic Programming algorithms.

1

Introduction

Besides being one of the diseases with the highest mortality in the world, cancer is also one of the diseases with the highest morbidity, i.e., patients suffering from the disease are subject to a very low quality of life. It is estimated that the disease remains currently the leading cause of death in developed countries, and the second leading cause of death in under-developed countries [1], according to predictions based on estimates made in 2012 [2]. Methods used in the disease treatment often lack specificity in each case, i.e., although there are many similarities between cases, most treatments are not customized. The lack of treatment customization, may reduce the success of the treatment, and can also lead to its aggravation. In an attempt to facilitate the development of new forms of treatment, laboratories cultivate cancer cells [3,8,9], taken from patients with the disease, called c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  M. Rocha et al. (Eds.): PACBB 2021, LNNS 325, pp. 74–84, 2022. https://doi.org/10.1007/978-3-030-86258-9_8

Assessing the Impact of Data Set Enrichment

75

cell lines. These cell lines consist of arrays of genomically identical cells, and are useful for both the study of their properties and as a test of the effects of various drugs, allowing consistency and reproducibility of the studies results. In an early study [4], the effects of a large number of chemical compounds in various cell lines of different cancer types were tested in laboratory. The level of sensitivity of cells to these compounds was measured using the half maximal Inhibitory Concentration (IC50), which measures the effectiveness degree that a substance has to inhibit a particular biological function. This study also involved the use of Machine Learning methods to predict the degree of efficacy of drugs in different types of cancer. In the [4] study, the Machine Learning methods used produce only “propositional-like” models. In these type of domains it would be very useful to understand the phenomenon that produces the disease. Therefore, the objective of our study is to construct understandable models using relational learning algorithms. We also aim at improve the accuracy of the results and for that purpose we will investigate the usefulness of other “propositional learners”. In our study we have initially replicated the results obtained in [4], providing a base line for the performance evaluation of further analyses. Additionally, using the same data preparation, we run other types of propositional algorithms. We have also used Graph Mining (GM) as a way to enrich the original data sets and assess the impact in model’s performance and comprehensibility. In a second stage, we have used Inductive Logic Programming (ILP) aiming at the induction of understandable models that can be used as hypotheses that can help the biochemist experts to understand what causes, in the molecules used, the increase in efficiency in cancer treatment.

2

Materials and Methods

2.1

Original Data and the Study Data Sets

The data used in our experiments is the result of the work done in the “Genomics of Drug Sensitivity in Cancer” project [5], and its pre-processing follows the same approach used in [4]. The original data set, publicly available in the “Genomics of Drug Sensitivity in Cancer” project website, consists of measured IC501 values for various cell lines and compound pairs. It contains 639 different cell lines, each with 77 gene mutation properties. Each cell line also has information about its Microsatellite Instability Status (MIS), cancer type and correspondent tissue. Each gene mutation is described by its sequence variation and copy number variation. The data set contains 131 drugs which translates to 83709 potential IC50 values. The information in the data set is not complete. In the first version of the data set, about 58% of the IC50 values, for each cell-drug pair, are available. 1

Sensitivity is our target variable of the process which is the degree of effectiveness that drugs have inhibiting cancer cells. The measure used is the IC50, that represents, for a 50% inhibition, which is the concentration of the drug.

76

P. Ferreira et al.

In an updated version some values were added, amounting to about 18% of the total number of results. The use of cell lines to access the performance of new drugs in diseases like cancer is being recognized as a valuable approach [8]. The pre-processing was made having in mind a resulting data set with both cell mutation properties and drug properties, in this case, molecular descriptors and fingerprints generated with the same version of the Pharmaceutical Data Exploration Laboratory (PaDEL) used in [4]. All the data was therefore compiled in different instances, each having one IC50 value and correspondent cell and drug properties. In a first step of the pre-processing, all the missing values were discarded. For each compound, a Simplified Molecular Input Line Entry Specification (SMILES) was generated using the Application Programming Interface (API) available in the Pubchem website2 . At this point all the drugs for which the SMILES information was unavailable were removed from the data set. The 1D and 2D descriptors and fingerprints were generated using the default properties of the PaDEL software. Some descriptors and fingerprints were not possible to be calculated by PaDEL, so the drugs were reduced to 110 as a final value. The total amount of features for the compounds at this point was 1603. The missing features were then removed from all the compounds in the data set, as well as the features with the same value for all the compounds, resulting in 790 final features. The final amount of cell line features was 142, and the final amount of drug features was 790, resulting in a total of 932 features plus the IC50 value. The final data set resulted in 40691 instances for the first version, and an additional 15578 for the updated one. 2.2

Methods

The data analysis in our study included not only a regression analysis similar to the original [4] but also the discretization of the numerical class producing a classification data set. Using a Graph Mining algorithm we have enriched the data sets with new features that include the most frequent and discriminative sub-graphs in the set of molecules. We have finally applied ILP. Regression Analysis As in [4], we have assessed the model’s performance using a 8-fold crossvalidation. So the data from the first version of the data set (as described in [4]) was divided into 8 different bins. Again, according to [4], two versions of the data set splitting were generated. The first one randomly splits the instances in 8 bins with equal size. In the second training set and test set do not share any cell line. This latter split was called “stringent”. The stringent form of the data set aims to show the predictive power of the models. 2

http://pubchem.ncbi.nlm.nih.gov/.

Assessing the Impact of Data Set Enrichment

77

These algorithms were trained and tested both with the stringent and non stringent data set versions. As in the original study we also made a blind test, using the additional instances of the updated version of the data set as the test set (with the extra 18% examples). After the replication of the original results we have made a set of new experiments. In this new set of experiments we made parameter tuning for the Artificial Neural Networks (ANN), the Random Forest (RF) and for the Support Vector Machines (SVM) algorithms. For the regression algorithms performance evaluation, we measured the RMSE, the Pearson Correlation Coefficient and the Coefficient of Determination. Classification Analysis Another interesting experiment is to show how the same algorithms and data sets, used in the regression analysis, would handle classification, as well as being used as a result reference for the ILP experiments evaluation. For classification we have divided the data into 2 different classes, “good” and “bad”. The data discretization and data set generation was done by calculating the tertiles of the IC50 distribution, which were used as upper and lower threshold values in the decision of the class. Therefore, examples of the data set are considered as “good” below the bottom and “bad” above the upper threshold. The remaining elements are disposed. The discretization results in a total of 27118 examples for the first data set and a total of 10427 examples for the additional one (blind test). The validation is also performed using 8-fold cross-validation and the blind test set. The classification models were assessed measuring the Accuracy, Recall, Precision and Specificity. In order to get more comprehensible models we have applied two feature selection methods (FS) methods: Pearson Correlation and Mutual Information. FS is useful to get simpler models without significant loss in accuracy. Graph Mining Enrichment The first goal was to find groups of molecule substructures (fragments) that can be associated with the good (low) and bad (high) IC50 values. The frequent substructures were calculated using Xifeng Yang’s gSpan implementation (Graph-Based Substructure Pattern Mining [6]). The input used for the gSpan execution was the drug molecules graphs data, created using the SMILES information. The minimum support range used was from 20% to 60% increasing 5% in each iteration. The minimum size of the fragments range from 3 to 20 vertexes. The number of fragments found was limited from 200 to 500. Fulfilling this criteria, 10 different groups of fragments were found. This resulted in 10 additional features, the presence or absences of each fragment. Inductive Logic Programming Using all the features from the data sets used for classification we performed classification using the ILP algorithm Aleph [7].

78

P. Ferreira et al.

The data preparation involves the creation of the background knowledge (BK) and the creation of positive and negative examples. The background knowledge was created gathering data on descriptors, fingerprints and drug fragments, as well as the cell lines features, for each drug and cell line present in the data sets, using Prolog language facts and rules. The positive and negative examples were generated as simple facts, indicating the class that each example belongs to. In this experiment we used both the data sets obtained from the first division obtained from the non-stringent splitting type.

3

Results and Discussion

3.1

Regression Analysis Results

Table 1 shows the results summary of regression analysis on the original data set using parameter tuning for all algorithms. The results show a small improvement over the original results reported in [4]. Our parameter tuning methods have demonstrate that we have improved the results, although the error reduction might be considered fairly low. For the Random Forest we observed an improvement of approximately 2.4% for the non-stringent tests and an improvement of 3.6% for the stringent ones. Comparing the parameter tuning results against the Neural Networks, we observed that these became the best results obtained for the non-stringent tests. We obtained an improvement of 1.2% but an increase in error of 1.2% for the non-stringent and stringent tests respectively. Table 1. Results summary on the original feature set. Non stringent Stringent

Non stringent blind test

Stringent blind test

NN

RMSE 0.82 ± 0.014 RP 0.85 ± 0.003 R2 0.72 ± 0.005

0.8 ± 0.013 0.91 ± 0.005 0.86 ± 0.005 0.8 ± 0.002 0.74 ± 0.009 0.65 ± 0.003

0.9 ± 0.003 0.81 ± 0.001 0.65 ± 0.002

RF

RMSE 0.81 ± 0.015 RP 0.85 ± 0.003 R2 0.73 ± 0.005

0.81 ± 0.017 0.9 ± 0.002 0.85 ± 0.007 0.81 ± 0.001 0.73 ± 0.012 0.65 ± 0.001

0.91 ± 0.003 0.81 ± 0.001 0.65 ± 0.002

SVM

RMSE 0.82 ± 0.016 RP 0.85 ± 0.004 R2 0.72 ± 0.006

0.82 ± 0.019 0.89 ± 0.002 0.85 ± 0.008 0.81 ± 0.001 0.72 ± 0.013 0.66 ± 0.001

0.89 ± 0.005 0.81 ± 0.002 0.66 ± 0.004

XGBoost RMSE 0.83 ± 0.035 RP 0.84 ± 0.010 R2 0.71 ± 0.017

0.84 ± 0.032 0.92 ± 0.008 0.84 ± 0.009 0.8 ± 0.003 0.70 ± 0.015 0.64 ± 0.005

0.84 ± 0.005 0.84 ± 0.002 0.71 ± 0.004

Assessing the Impact of Data Set Enrichment

79

SVM algorithm demonstrated that it is a fairly good alternative to the Neural Networks and Random Forests, since its results show a very close performance. We consider that the only drawback of the algorithm might be its time complexity, since it was the algorithm that, in all runs, consumed the most time. Extracting Comprehensible Information from Models In scientific studies and specially in the Life Sciences, Machine Learning understandable models may provide information that might help understanding the phenomena that produced the data. After training the RF models we calculated the importance of each feature, which corresponds to the total decrease in node impurities from splitting on each variable, in this case (regression), the residual sum of squares. This values, obtained by each model, were then averaged and sorted. Figure 1 show the most important features in both stringent and nonstringent tests. All of the top most important features belong to the “compound” set of features, being two molecule’s fingerprints the two most important.

Fig. 1. Random Forest attribute importance – top 10 results

The Impact of Feature Selection We have additionally tried to improve further the results by using two different feature selection (FS) methods: Pearson Correlation and Mutual Information. In the regression analysis the best results of FS on the set of ML algorithms used is presented in Table 2. The results with FS show worse values on the training set runs than the run on the original data set but achieved better results overall on the blind test. Despite the differences, their absolute value is most often less than 5%.

80

P. Ferreira et al.

Table 2. Regression results for molecular descriptors 1D+2D+FP using Pearson Correlation using the first 500 best attributes. 1D 2D FP

Non stringent Stringent

Non stringent blind test

Stringent blind test

RF

RMSE 0.83 ± 0.030 RP 0.85 ± 0.011 R2 0.72 ± 0.018

0.83 ± 0.034 0.86 ± 0.001 0.84 ± 0.007 0.83 ± 0.001 0.71 ± 0.013 0.68 ± 0.001

0.84 ± 0.001 0.84 ± 0.001 0.71 ± 0.001

SVM

RMSE 0.84 ± 0.032 RP 0.84 ± 0.011 R2 0.71 ± 0.019

0.85 ± 0.036 0.88 ± 0.001 0.83 ± 0.008 0.82 ± 0.001 0.70 ± 0.013 0.67 ± 0.001

0.86 ± 0.001 0.84 ± 0.001 0.70 ± 0.001

XGBoost RMSE 0.83 ± 0.030 0.83 ± 0.034 0.86 ± 0.001 RP 0.85 ± 0.011 0.84 ± 0.008 0.83 ± 0.001 R2 0.72 ± 0.018 0.71 ± 0.013 0.68 ± 0.001

0.84 ± 0.001 0.84 ± 0.001 0.71 ± 0.001

ANN

3.2

RMSE 0.83 ± 0.031 RP 0.85 ± 0.011 R2 0.72 ± 0.018

0.83 ± 0.032 0.86 ± 0.001 0.84 ± 0.007 0.83 ± 0.001 0.71 ± 0.013 0.68 ± 0.001

0.84 ± 0.003 0.84 ± 0.001 0.71 ± 0.001

Classification

Classifier models were evaluated by calculating the Accuracy, Precision and F1score from the confusion matrices. Table 3 shows the results on the original discretised data set. The impact of Feature Selection can be seen in Table 4. The results suggests that feature selection may just provide a small increase in performance. 3.3

Graph Mining

In this section we present the regression analysis results for a data set enriched with molecule fragments identified as relevant by gSpan algorithm. Although the results shown in Table 5 are similar to the ones obtained with fingerprints, the number of identified GM fragments are significantly smaller than the number of FPs. As in the previous analysis the results in the Stringent data sets are slightly worse than those in the Non Stringent data sets. This different in accuracy may be justified by the existence of different molecules between train and test set in the case of Stringent data sets. Whereas the Stringent may share molecules between train and test set. This result was seen in all the algorithms used. The same behavior was noticed when using the blind test.

Assessing the Impact of Data Set Enrichment

81

Table 3. Classification results on the discretised original data set with molecular descriptors 1D+2D+FP 1D 2D FP RF

Non stringent Stringent Non stringent blind test

Stringent blind test

Accuracy 0.8923

0.8850

0.8639

0.8843

Precision 0.8905

0.8845

0.8517

0.8697

Recall

0.8945

0.8824

0.8794

0.8697

f1-score

0.8851

0.8925

0.8834

0.8653

Accuracy 0.8956

0.8911

0.8860

0.8939

Precision 0.8874

0.8861

0.8737

0.9048

Recall

0.9062

0.8948

0.9013

0.8868

f1-score

0.8965

0.8901

0.8871

0.8954

XGBoost Accuracy 0.9059

0.8974

0.8756

0.8936

Precision 0.9064

0.9023

0.8720

0.9140

Recall

0.9052

0.8883

0.8790

0.8747

f1-score

0.9058

0.8953

0.8755

0.8939

Accuracy 0.8968

0.8902

0.8705

0.8821

Precision 0.8968

0.8982

0.8653

0.9055

Recall

0.8969

0.8770

0.8761

0.8598

f1-score

0.8968

0.8874

0.8706

0.8820

SVM

ANN

Table 4. Classification results using Mutual Information Gain as FS algorithm using the best 500 attributes and molecular descriptors 1D+2D+FP 1D 2D FP RF

Non stringent Stringent Non stringent blind test

Stringent blind test

Accuracy 0.8939

0.8925

0.8850

0.8919

Precision 0.8842

0.8910

0.8711

0.9075

Recall

0.9064

0.8915

0.9024

0.8787

f1-score

0.8951

0.8912

0.8864

0.8928

Accuracy 0.8882

0.8799

0.8798

0.8824

Precision 0.8875

0.8853

0.8720

0.9054

Recall

0.8891

0.8694

0.8888

0.8605

f1-score

0.8880

0.8771

0.8801

0.8822

XGBoost Accuracy 0.8966

0.8930

0.8866

0.8931

Precision 0.8847

0.8920

0.8717

0.9080

Recall

0.9121

0.8915

0.9052

0.8808

f1-score

0.8981

0.8916

0.8881

0.8941

Accuracy 0.8936

0.8920

0.8843

0.8927

Precision 0.8849

0.8907

0.8702

0.9075

Recall

0.9051

0.8908

0.9021

0.8808

f1-score

0.8948

0.8906

0.8858

0.8938

SVM

ANN

82

P. Ferreira et al. Table 5. Regression results using GM fragments and 1D+2D descriptors 1D 2D

RMSE 0.84 ± 0.034 RP 0.84 ± 0.011 R2 0.70 ± 0.018

Non stringent Stringent blind test blind test

0.89 ± 0.023 0.95 ± 0.003 0.82 ± 0.011 0.79 ± 0.002 0.67 ± 0.017 0.62 ± 0.002

0.86 ± 0.002 0.83 ± 0.001 0.69 ± 0.002

XGBoost RMSE 0.83 ± 0.032 0.86 ± 0.032 0.92 ± 0.002 RP 0.85 ± 0.009 0.83 ± 0.012 0.80 ± 0.001 R2 0.71 ± 0.015 0.69 ± 0.019 0.64 ± 0.001

0.83 ± 0.004 0.84 ± 0.002 0.71 ± 0.003

RF

3.4

Non stringent Stringent

Classification Analysis Results Using ILP

As ILP experiments are concerned we have used both data sets obtained from the first division obtained from the non-stringent splitting type. Table 6. ILP non-stringent results

Some of the rules that were created by the ILP algorithm from both data sets are shown in Fig. 2.

Fig. 2. Some rules found by the ILP system Aleph

As ILP is concerned we have made a very limited amount of experiments. ILP requires very long times to run. Despite the accuracy results were worse than the ones of propositional learners (see Table 6), ILP found a set of very interesting simple rules with high coverage. These rules, together with this line

Assessing the Impact of Data Set Enrichment

83

of investigation could be very rewarding to the biochemist experts. For each example an ILP system searches a generalization lattice that is determined by the number of background predicates and their degree of non-determinancy. For the current domain the branch factor and number of nodes of the search space is very large and it requires very large computationla power to search a signiifant aout of the apse to get good clauses. In this study we hade limited comoutational power and were not able to use extensive searchs in each examples based search space.

4

Conclusions

The ability to accurately predict in silico the sensibility of a specific cancer cell to a given pharmaceutical, is a subject of major importance, not only for its immense potential in decreasing the repetitive, time-consuming and expensive work done in vitro, but also as a major step in personalized treatment. Our study aimed at taking a step towards exploring new methodologies to increase the performance and understandability, which raises an opportunity of improvement in treatment. In summary we have shown that: i) the original work by [4] can be improved; ii) we can extract valuable/comprehensible information from traditional ML algorithms like Random Forest; iii) Graph Mining use has feature construction can identify in a more effective way relevant fragments that discriminate the good and bad classes with the extra advantage of reducing the number of attributes (fingerprint attributes); and iv) we have shown the usefulness of ILP in the task of building comprehensible models that easily integrate expert domain knowledge. Acknowledgements. The authors would like to acknowledge the Mestrado Integrado em Engenharia Inform´ atica e Computa¸c˜ ao (MIEIC) at Faculdade de Engenharia da Universidade do Porto (FEUP). This work is financed by National Funds through the ˜ lncia e a Tecnologia, within Portuguese funding agency, FCT - Founds para a CiA project UIDB/50014/2020.

References 1. Mathers, C., Fat, D.M., Boerma, J.: The global burden of disease: 2004 update. World Health Organization (2008) 2. Ferlay, J., et al.: V1. 0, Cancer Incidence and Mortality Worldwide: IARC CancerBase No 11 (2013) 3. Chaudry, A.: Cell culture. The Science Creative Qwarterly, Agosto 2004 4. Menden, M.P., et al.: Machine learning prediction of cancer cell sensitivity to drugs based on genomic and chemical properties. PLoS ONE 8(4), e61318 (2013) 5. Garnett, M.J., et al.: Systematic identification of genomic madrkers of drug sensitivity in cancer cells. Nature 483(7391), 570–575 (2012) 6. Yan, X., Han, J.: gSpan: graph-based substructure pattern mining. In: 2002 IEEE International Conference on Data Mining, 2002. ICDM 2003. Proceedings, pp. 721– 724. IEEE (2002)

84

P. Ferreira et al.

7. Srinivasan, A.: The Aleph Manual (2003). http://web.comlab.ox.ac.uk/oucl/ research/areas/machlearn/Aleph 8. Sharma, S.V., Haber, D.A., Settleman, J.: Cell line-based platforms to evaluate the therapeutic efficacy of candidate anticancer agents. Nat. Rev. Cancer 10, 241–253 (2010) 9. Allen, D.D., Caviedes, R., C´ ardenas, A.M., Shimahara, T., Segura-Aguilar, J., Caviedes, P.A.: Cell lines as in vitro models for drug screening and toxicity studies. Drug Dev. Ind. Pharm. 31(8), 757–768 (2005)

Deep Neural Network to Curate LTR Retrotransposon Libraries from Plant Genomes Simon Orozco-Arias1,2(B) , Mariana S. Candamil-Cortes1 , Paula A. Jaimes1 , Estiven Valencia-Castrillon1 , Reinel Tabares-Soto3 , Romain Guyot3,4 , and Gustavo Isaza2 1 Department of Computer Science, Universidad Autónoma de Manizales, Manizales, Colombia

[email protected], {mariana.candamilc,paula.jaimesb, estiven.valenciac}@autonoma.edu.co 2 Department of Systems and Informatics, Universidad de Caldas, Manizales, Colombia [email protected] 3 Department of Electronics and Automation, Universidad Autónoma de Manizales, Manizales, Colombia [email protected] 4 Institut de Recherche Pour Le Développement, CIRAD, Univ. Montpellier, Montpellier, France [email protected] Abstract. Transposable elements are mobile sequences in all eukaryotic genomes. LTR (Long Terminal Repeat) retrotransposons are the most abundant elements in plant genomes where they play a fundamental role in evolution, gene function and genetic diversity. It is therefore important to develop bioinformatic tools to identify them in sequenced genomes and to classify them, taking into account that over time these elements may undergo deletions, insertions or recombination, generating incomplete and inactive elements, which are no longer considered a valid reference for identification and classification studies. LTR retrotransposons play fundamental roles in evolution and genetic diversity, hence the importance of understanding their function and studying in depth the variations that they may present. With the increase of whole genome sequencing, it is necessary to automate the analysis process and reduce the execution time, and to develop more advanced tools. Here, we propose an automatic curator of plant LTR retrotransposons libraries, based on Deep Learning (DL), in which a percentage F1-score of 91.18% was obtained for the test dataset. Generalization tests using four different genomes were performed, obtaining the best results for Oryza granulata, with a performance of 93.6% F1-score, and with an execution time of 22.61 seconds for the prediction by the neural network, using LTR retrotransposons obtained with the LTR_STRUC software. Taking into account that the conventional bioinformatics methods require a time of approximately six hours to curate the same genome, we conclude that our proposed method is efficient and can speed up the curation of libraries of LTR retrotransposons of plants genomes published in massive sequencing projects. Keywords: LTR retrotransposons · Curation · Nesting insertions · Bioinformatics · Machine learning · Deep neural networks · k-mer-based methods © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. Rocha et al. (Eds.): PACBB 2021, LNNS 325, pp. 85–94, 2022. https://doi.org/10.1007/978-3-030-86258-9_9

86

S. Orozco-Arias et al.

1 Introduction Transposable elements (TEs) are mobile genetic structures, discovered by Barbara McClintock in corn [1]. These elements have the ability to integrate into new positions in genomes and sometimes increase their copy number over time [2]. The amount of DNA in nucleus (also called DNA C-value) can vary due to polyploidy events, segmental duplications and indeed massif amplification of TEs [3]. Based on their transposition mechanism, TEs can be divided into Class I (or Retrotransposons), which use as an intermediate of replication the RNA molecule, and Class II (or Transposons), which use as an intermediate the DNA molecule [4]. Retrotransposons (RTs) are the most abundant elements in eukaryotic genomes. Among these, Long Terminal Repeats Retrotransposons (LTR-RTs) surpass in copy number and diversity all other TE orders in plants [5, 6], constituting up to 75% of nuclear DNA [7], even 80% of angiosperms [3]. Considering that TEs activities can induce mutational effects, the host genome has developed processes to silence or interrupt their life cycle [8]. Annotation of plant genomes has shown the presence of numerous insertions of transposable elements into other TEs, called nested insertions [9]. While multiple insertions of TEs can have a profound impact on the structure of the inserted element, insertion of smaller and non-coding elements can go unnoticed by structural TE detection algorithms. As a consequence, these algorithms can predict the presence of a complete element but whose sequence could be in fact nested by other type of TEs. To study nested elements and eventually curate them, a diversity of techniques based on bioinformatics have been developed, which detect nested structures present in the genome, such as TE-greedy-nester [9] and TEnest [10]. However, these software require complex installation processes, using prerequisites such as BLAST [11], LTR_FINDER [12], GenomeTools [13], representing an extensive manual work when executing them. Other TE detection software, such as EDTA [14] have a TEs libraries filter module, but it requires a significant execution time. Actually, several Machine Learning (ML) algorithms have been implemented to solve problems in biology and genomics [15]. One of the most widely used ML technique is Deep Learning (DL), which uses nonparametric neural networks (NN) based models to fit complex associations between input and output data [16]. To observe the effectiveness of the model used, various metrics are used such as recall (sensitivity) [17], F1-score [17], accuracy [18], precision [17], specificity [19], and ROC curve [20], to guarantee good generalization of the model for implementation with subsequent data. In the present study, an automatic curator of LTR-RTs in plants have been developed using DL techniques, accelerating the execution time using a NN and graphic processing units (GPUs) [21]. The curator identifies LTR-RT complete sequences that could not be considered as intact elements, because they present nested insertions or their overall length does not match the one stipulated in the literature for each lineage/family. This curator can be associated as additional module to existing LTR-RT predictors to filter and increase the quality of the library produced.

Deep Neural Network to Curate LTR Retrotransposon Libraries

87

2 Materials and Methods 2.1 Creation of Plant LTR-RTs Sequence Dataset To create the dataset, public databases such as Repbase [22], RepetDB [23] and PGSB [24] were used. Additionally, LTR_STRUC [25] and EDTA [14] were used to obtain LTR-RT sequences from plant species that were not available in public databases. Then, the same filters than in InpactorDB [26] were applied. These filters included: a) a nested LTR-RT element into the predicted LTR-RT element with domains belonging to different superfamily between them (i.e. Gypsy versus Copia), b) a nested LTR-RT element into the predicted LTR-RT elements with domains belonging to different lineages/familly, c) predicted LTR-RT elements showing a length increase compared to the in the literature [5] with a tolerance of 20% and d) predicted LTR-RT elements showing Class II (Transposons) insertions. Sequences than passed successfully all InpactorDB’s filters were kept as putative complete LTR-RTs and intact elements (named class 0). Elements that were eliminated in each of the filters were taken as non-intact sequences (named class 1). 2.2 Experimental Analysis Using ML Models In order to use biological data with ML-based models, a feature extraction was performed. k-mers frequencies was count in each sequence [20] with a maximum length of six nucleotides, providing the right amount of information, without generating such a high computational expense [27]. Likewise, the preprocessing techniques of scaling and dimension reduction using PCA were applied with an explained variance of 96% (reduction of the initial number of features from 5,460 to 2,254). We utilized the F1score metric in the training process, considering that the dataset is unbalanced, thus, this metric contributes significantly to the knowledge of the behavior and generalization of the used model [20]. Finally, accuracy, precision and recall of the model were calculated. It should be noted that the data partitioning is performed 80% for training, 10% for validation (validation dataset) and 10% for testing (test dataset). All experiments were performed using Python 3.7 with the Scikit-Learn library version 0.24.0 and TensorFlow version 2.2.0. 2.3 Experiments Using Deep Neural Networks. Experiments were carried out using different NN architectures, varying the neurons used and their layers. The architecture developed by [28] was used as an initial model which consists of three layers, ReLu as activation function in the hidden layers, and a dropout of 0.5 in each layer. These hyper-parameters were used in our NN but applying a Batch Normalization with a momentum of 0.99. Adam optimization was used to find a good configuration of the NN and for the prediction, a softmax activation function was tested. Experiments using Categorical Crossentropy were carried out for the loss function. For the training process, 200 epochs and 128 in batch_size were used (Fig. 1).

88

S. Orozco-Arias et al.

Fig. 1. FNN architecture based on Nakano et al. [28].

2.4 Generalization Tests Once the computational model was defined, generalization tests were performed with genomes of different sizes such as: Coffea eugenioides (GCA_003713205.1, 678 Mb), Coffea humblotiana (407 Mb) [29], Oryza indica (GCA_011764405.2, 355 Mb) [30] and Oryza granulata (GCA_003991445.1, 752 Mb) [31]. Thus, LTR_STRUC [25] and LTR_FINDER [12] software were first run to predict LTR-RTs for each genome. Two workflows were performed, the first using conventional bioinformatics methods with filters proposed in InpactorDB [26], and the second one using the FNN architecture proposed, identifying the accuracy percentages and execution time, including the time of the pre-processing techniques required in each case. 2.5 Hardware Specifications All the analyses in this project were performed using the HPC clusters of the French Bioinformatics Institute (https://www.france-bioinformatique.fr) and IRD (https://bio info.ird.fr/) managed by Slurm, used with a set value of 20 cores. For the DL experiments, the Google Colaboratory platform [32] was used, which has a NVIDIA T4 GPU unit and a RAM of 16 GB.

3 Results 3.1 Descriptive Analysis of the Dataset Once the filters were applied on the dataset, were obtained 56,442 sequences for class 0 (curated sequences) and 49,215 for class 1 (elements considered as non-intact elements). 3.2 Design of a Deep Neural Network Based Model to Detect Non-intact LTR Retrotransposon Sequences With the generated dataset, experiments were performed using a NN based on the FNN proposed in [28] (Fig. 1). The Fig. 2 shows the training curves, which presents a performance of 91.75% (test dataset) for the F1-score metric and a loss lower than 0.6,

Deep Neural Network to Curate LTR Retrotransposon Libraries

89

and Table 1 shows different performance metrics. Additionally, the Receiver Operating Characteristic curves (ROC) and Precision-Recall Curve (PRC) are presented in Fig. 3. In order to test the generalization, a k-cross-validation was performed with k = 3 with a f1-score of 90.10% and standard deviation of 0.84.

Fig. 2. Training curves with FNN implemented for binary classification (A) F1-score vs epochs (B) Loss vs epochs.

Table 1. Results for each metrics. Metrics

Value

Precision

0.9140

F1-score

0.9121

Recall

0.9125

Accuracy

0.9125

Area under ROC curve (AUC)

0.963

Area under the precision recall Curve (auPRC) 0.966 False positive rate

0.0355

3.3 Test for the Generalization of the Implemented Model In order to analyze the performance of the implemented computational model, LTR retrotransposons were predicted in four different plant species: Oryza indica, Oryza granulata, Coffea eugenioides and Coffea humblotiana, using LTR_FINDER and LTR_STRUC software (Table 2). The datasets created for each genome, were further processed by two workflows. The first one was based in conventional bioinformatics tools, using the filters implemented

90

S. Orozco-Arias et al.

Fig. 3. (A) ROC and (B) PRC curves for the test dataset.

Table 2. Number of predicted LTR-RTs with LTR_FINDER and LTR_STRUC software. Genomes

Genome Size

Accession number

Number of LTR-RTs LTR_FINDER

LTR_STRUC

Oryza indica

355 Mb

GCA_011764405.2, 355

923

854

Oryza granulata

752 Mb

GCA_003991445.1

8,597

5,734

Coffea eugenioides

678 Mb

GCA_003713205.1

6,872

3,590

Coffea humblotiana

407 Mb

[29]

2,659

2,533

in InpactorDB, while the second one implemented the model described above, considering the pre-processing techniques necessary for the DNA sequences to be processed correctly. Table 3 shows the number of detected elements by filters and predicted by the FNN. Table 3. Number of sequences obtained by executing each of the methods for LTR_STRUC and LTR_FINDER datasets. Species

LTR_STRUC Bioinformatics conventional method

Oryza indica

LTR_FINDER Computational model

Bioinformatics conventional method

Computational model

474

404

474

396

Oryza granulata

3,266

3,148

4,777

4,700

Coffea eugenioides

2,436

2,263

4,596

4,090

Coffea humblotiana

1,630

1,474

1,721

1,496

Deep Neural Network to Curate LTR Retrotransposon Libraries

91

The best results in the generalization tests were obtained by Oryza granulata with a F1-Score of 0.93, accuracy of 0.92, and a precision of 0.95. Analyzing the sequences obtained with both techniques, we identified that 2302 sequences were correctly classified as intact LTRs using FNN. Additionally, it is interesting that the FNN model took 22.61 seconds to run the predictions, compared to the conventional bioinfomatic workflow that required about six hours.

4 Discussion Taking into account the exponential growth of available complete genome for complex organisms [33], a new generation of algorithm is necessary to process this formidable amount of data. Particularly it is relevant to implement machine learning based algorithms that could automate and optimize numerous bioinformatic processes. Such approaches are particularly relevant for the identification, classification and subsequent analysis of LTR retrotransposons elements in plant species [34]. In fact, this order of transposable elements represents the majority of repeated sequences in plant genomes, being able to makes up more than 50% of the genome size. The propensity of these elements to increase their copy number in genomes is directly related to their mode of transposition, that use messenger RNA in their replication mechanism [35]. Their number can be so large that they accumulate and insert themselves into each other’s, creating nested structures that are particularly difficult to identify and annotate in genome sequences [36]. Most of the tools for LTR-RT identification and reference library creation do not include automatic curation tools for these insertions. This leaves the users with a long manual curation process to identify and to remove nested insertions from LTR-RT reference sequences, indicating the importance of implementing a novel tool for the automatic curation of LTR-RT reference sequence libraries. Currently, the most common technique to perform this curation process is the sequence homology approach, which is a strategy used in conventional bioinformatics, in which the initial data are compared to references (proteins, domains, nucleotides) available in databases such as in REXDB [5]. However, it has disadvantages, as it requires a lot of manual work, the execution time is quite long. For this reason, despite the existence of software for the detection of nested structures [9, 10, 37] and strategies such as EDTA [14], to create libraries of good quality, we proposed a new strategy based on deep learning to identify and filter out nested sequences that could represent “low quality” sequence in LTR-RTs reference libraries. Thus, a model was implemented that achieved an F1-score of 91.18%, highlighting, the values obtained for precision, accuracy and recall, which are 91.40%, 91.25% and 91.25% respectively, percentages that were obtained from the implementation of a computational tool that runs in seconds. It is emphasized that those non-intact sequences are disregarded from the final dataset and stored in new files, because for future studies these sequences are of great relevance, to observe the divergence and establish an evolutionary scale of the analyzed species. Finally, generalization tests were performed with four plants genomes: C. eugenioides, C. humblotiana, O. indica and O. granulata. Percentage higher than 85% was obtained in all cases. Interestingly, the execution time is greatly reduced with the implemented neural networks when compared to conventional methods.

92

S. Orozco-Arias et al.

However, the results obtained for the LTR_FINDER dataset range between 54.8% and 59.2% for the F1-score percentage. This significantly lower percentages when compared to LTR_STRUC, can be attributed to a highest rate of false positives. Altogether our results indicate that the implementation of FNN method for curation of LTR-RTs sequence is relevant, optimizing the execution time for the creation of better quality reference libraries for plant genomes.

5 Conclusion The model based on DL techniques proposed here, contributes greatly to the improvement of current identification and classification processes, due to the decrease in execution time, having a good performance in detection. This model was implemented for the automatic curation of LTR-RTs libraries in plants, with a performance for F1-score of 91.18%. Similarly, the generalization test allows identifying the use of the proposed architecture for species with different genome sizes, in which the best percentage was obtained for Oryza granulata with 93.6% of F1-score, in a time of approximately four minutes, showing a correct performance and adequate generalization, optimizing the execution time.

References 1. Ravindran, S.: Barbara McClintock and the discovery of jumping genes. Proc. Natl. Acad. Sci. U S A 109, 20198–20199 (2012). https://doi.org/10.1073/pnas.1219372109 2. Lisch, D.: How important are transposons for plant evolution? Nat. Rev. Genet. 14, 49–61 (2013). https://doi.org/10.1038/nrg3374 3. Bennetzen, J.L.: Transposable elements, gene creation and genome rearrangement in flowering plants. Curr. Opin. Genet. Dev. 15, 621–627 (2005). https://doi.org/10.1016/j.gde.2005. 09.010 4. Wicker, T., Sabot, F., Hua-Van, A., et al.: A unified classification system for eukaryotic transposable elements. Nat. Rev. Genet. 8, 973–982 (2007). https://doi.org/10.1038/nrg2165 5. Neumann, P., Novák, P., Hoštáková, N., MacAs, J.: Systematic survey of plant LTRretrotransposons elucidates phylogenetic relationships of their polyprotein domains and provides a reference for element classification. Mob. DNA 10, 1 (2019) 6. Orozco-Arias, S., Isaza, G., Guyot, R., Tabares-soto, R.: A systematic review of the application of machine learning in the detection and classification of transposable elements. Peer. J. 7, 18311 (2019). https://doi.org/10.7717/peerj.8311 7. Baucom, R.S., Estill, J.C., Chaparro, C., et al.: Exceptional diversity, non-random distribution, and rapid evolution of retroelements in the B73 maize genome. PLoS Genet. 5 (2009) .https:// doi.org/10.1371/journal.pgen.1000732 8. Esposito, S., Barteri, F., Casacuberta, J., Mirouze, M., Carputo, D., Aversano, R.: LTR-TEs abundance, timing and mobility in Solanum commersonii and S. tuberosum genomes following cold-stress conditions. Planta 250(5), 1781–1787 (2019). https://doi.org/10.1007/s00 425-019-03283-3 9. Lexa, M., Jedlicka, P., Vanat, I., et al.: TE-greedy-nester: structure-based detection of LTR retrotransposons and their nesting. Bioinformatics 36, 4991–4999 (2021). https://doi.org/10. 1093/bioinformatics/btaa632

Deep Neural Network to Curate LTR Retrotransposon Libraries

93

10. Kronmiller, B.A., Wise, R.P.: TEnest: automated chronological annotation and visualization of nested plant transposable elements. PLANT Physiol. 146, 45–59 (2008). https://doi.org/ 10.1104/pp.107.110353 11. McGinnis, S., Madden, T.L.: BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 32, 20–25 (2004). https://doi.org/10.1093/nar/gkh435 12. Xu, Z., Wang, H.: LTR-FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res. 35, 265–268 (2007). https://doi.org/10.1093/nar/ gkm286 13. Gremme, G., Steinbiss, S., Kurtz, S.: Genome tools: a comprehensive software library for efficient processing of structured genome annotations. IEEE/ACM Trans. Comput. Biol. Bioinforma. 10, 645–656 (2013). https://doi.org/10.1109/TCBB.2013.68 14. Ou, S., Su, W., Liao, Y., et al.: Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 20, 275 (2019). https://doi. org/10.1186/s13059-019-1905-y 15. Larrañaga, P., Calvo, B., Santana, R., et al.: Machine learning in bioinformatics. Brief Bioinform. 7, 86–112 (2006). https://doi.org/10.1093/bib/bbk007 16. Montesinos-López, O.A., Montesinos-López, A., Pérez-Rodríguez, P., et al.: A review of deep learning applications for genomic selection. BMC Genom. 22, 1–23 (2021). https://doi.org/ 10.1186/s12864-020-07319-x 17. Schietgat, L., Vens, C., Cerri, R., et al.: A machine learning based framework to identify and classify long terminal repeat retrotransposons. PLoS Comput. Biol. 14, e1006097 (2018). https://doi.org/10.1371/journal.pcbi.1006097 18. Loureiro, T., Camacho, R., Vieira, J., Fonseca, N.A.: Improving the performance of transposable elements detection tools. J. Integr. Bioinform. 10, 231 (2013). https://doi.org/10.2390/ biecoll-jib-2013-231 19. Douville, C., Springer, S., Kinde, I., et al.: Detection of aneuploidy in patients with cancer through amplification of long interspersed nucleotide elements (LINEs). Proc. Natl. Acad. Sci. U S A 115, 1871–1876 (2018). https://doi.org/10.1073/pnas.1717846115 20. Orozco-Arias, S., Piña, J.S., Tabares-soto, R., et al.: Measuring performance metrics of machine learning algorithms for detecting and classifying transposable elements. Processes 8, 1–20 (2020). https://doi.org/10.3390/pr8060638 21. Huynh, L.N., Balan, R.K, Lee, Y.: DeepSense: A GPU-based deep convolutional neural network framework on commodity mobile devices. In: Proceedings of the 26th International Conference on World Wide Web, pp. 351–360 (2016). https://doi.org/10.1145/3038912.305 2577 22. Bao, W., Kojima, K.K., Kohany, O.: Repbase update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 6, 4–9 (2015). https://doi.org/10.1186/s13100-015-0041-9 23. Amselem, J., Cornut, G., Choisne, N., et al.: RepetDB: a unified resource for transposable element references. Mob. DNA 10, 4–11 (2019). https://doi.org/10.1186/s13100-019-0150-y 24. Spannagl, M., Nussbaumer, T., Bader, K.C., et al.: PGSB plantsDB: updates to the database framework for comparative plant genome research. Nucleic Acids Res. 44, D1141–D1147 (2016). https://doi.org/10.1093/nar/gkv1130 25. McCarthy, E.M., McDonald, J.F.: LTR STRUC: a novel search and identification program for LTR retrotransposons. Bioinformatics 19, 362–367 (2003). https://doi.org/10.1093/bioinf ormatics/btf878 26. Orozco-Arias, S, Jaimes, P.A, Candamil, M.S., et al.: InpactorDB : a classified lineage-level plant LTR retrotransposon reference library for free-alignment methods based on machine learning. MDPI Genes 12, 17 (2021). https://doi.org/10.3390/genes12020190 27. Orozco-Arias, S., Candamil-Cortés, M.S., Jaimes, P.A., et al.: K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes. Peer. J. 9, e1145610.7717/peerj.11456 (2021)

94

S. Orozco-Arias et al.

28. Nakano, F.K., Mastelini, S.M., Barbon, S., Cerri, R.: Improving hierarchical classification of transposable elements using deep neural networks. In: Proceedings of the International Joint Conference on Neural Networks. IEEE, Rio de Janeiro, Brazil (2018) 29. Raharimalala, N., Rombauts, S., McCarthy, A., et al.: The absence of the caffeine synthase gene is involved in the naturally decaffeinated status of Coffea humblotiana, a wild species from Comoro archipelago. Sci. Rep. 11, 1–14 (2021). https://doi.org/10.1038/s41598-02187419-0 30. Datta, K., Datta, S.K.: Indica Rice (Oryza sativa, BR29 and IR64). In: Wang, K. (ed.) Agrobacterium Protocols. Methods in Molecular Biology, vol. 343. Humana Press (2006). https://doi. org/10.1385/1-59745-130-4:201 31. Shi, C., Li, W., Zhang, Q.J., et al.: The draft genome sequence of an upland wild rice species, Oryza granulata. Sci. Data 7, 1–12 (2020). https://doi.org/10.1038/s41597-020-0470-2 32. Bisong, E.: Google Colaboratory BT - Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners (2019) 33. Buermans, H.P.J., Den Dunnen, J.T.: Next generation sequencing technology: advances and applications. Biochim. Biophys. Acta 1842, 1932–1941 (2014). https://doi.org/10.1016/j.bba dis.2014.06.015 34. Yan, H., Bombarely, A., Li, S.: Deep TE: a computational method for de novo classification of transposons with convolutional neural network. Bioinformatics 36, 4269–4275 (2020) 35. Kumar, A., Bennetzen, J.L.: Plant retrotransposons. Annu. Rev. Genet. 33, 479–532 (1999) 36. Gao, C., Xiao, M., Ren, X., et al.: Characterization and functional annotation of nested transposable elements in eukaryotic genomes. Genomics 100, 222–230 (2012). https://doi. org/10.1016/j.ygeno.2012.07.004 37. Zeng, F.-C., Zhao, Y.-J., Zhang, Q.-J., Gao, L.-Z.: LTRtype, an efficient tool to characterize structurally complex LTR retrotransposons and nested insertions on genomes. Front. Plant. Sci. 8, 402 (2017). https://doi.org/10.3389/fpls.2017.00402

A Hybrid of Bees Algorithm and Regulatory On/Off Minimization for Optimizing Lactate Production Mohd Izzat Yong1 , Mohd Saberi Mohamad2(B) , Yee Wen Choon3,4 , Weng Howe Chan1 , Hasyiya Karimah Adli3,4 , Khairul Nizar Syazwan WSW3,4 , Nooraini Yusoff3,4 , and Muhammad Akmal Remli3,4 1 Artificial Intelligence and Bioinformatics Group, School of Computing,

Faculty of Engineering, Universiti Teknologi Malaysia, 81310 Skudai, Johor, Malaysia [email protected] 2 Department of Genetics and Genomics, College of Medical and Health Sciences, United Arab Emirates University, P.O. Box 17666, Al Ain, Abu Dhabi, United Arab Emirates [email protected] 3 Institute for Artificial Intelligence and Big Data, Universiti Malaysia Kelantan, Kota Bharu 16100, Kelantan, Malaysia {hasyiya,nizar.w,akmal}@umk.edu.my 4 Department of Data Science, Universiti Malaysia Kelantan, Kota Bharu 16100, Kelantan, Malaysia

Abstract. Metabolic engineering has grown dramatically and is now widely used, particularly in the production of biomass utilising microorganisms. The metabolic network model has been extensively used in computational procedures developed to optimise metabolic production and suggest modifications in organisms. The problem has been the unrealistic flux distribution suggestion demonstrated by previous work on a rational modelling framework employing Optknock and OptGene. To address the issue, a hybrid of the Bees Algorithm and Regulatory On/Off Minimization (BAROOM) is introduced. By using Eschericia coli (E. coli) as the model organism, BAROOM is able to determine the optimal set of gene that can be knocked out and improve lactate production. The results show that BAROOM performs better than other methods in increasing lactate production in model organism by identifying optimal set of genes to be knocked out. Keywords: Metabolic engineering · Bioinformatics · Artificial Intelligence · OptKnock · Optgene · Gene knockout · Modelling · Optimization

1 Introduction Gene knockout strategy is a frequent genetic engineering technique for deleting genes in order to understand their impact on organisms. Multiple genes can be knocked out at the same time, such as double or triple knockout, which inactivates a pair or three genes at the same time. Escherichia coli (E. coli) K-12 is a well-characterized mutant strain used © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. Rocha et al. (Eds.): PACBB 2021, LNNS 325, pp. 95–104, 2022. https://doi.org/10.1007/978-3-030-86258-9_10

96

M. I. Yong et al.

in analysing unknown gene function and gene regulatory network using gene knockout method. It has also been used in testing mutational effects in parent strain E. coli K-12 BW25113 [1]. Although OptKnock, OptGene, Simulated Annealing (SA) and Set-based Evolutionary Algorithms (SEAs) have previously proved their ability to find knockout genes to boost metabolite production, their robustness in global, local, multivariable, and multimodal function optimization is lacking. As a result, various methods are required to address these issues [2–4]. Bees Algorithm (BA) [5] is a swarm intelligence-based algorithm that successfully searches for optimal solutions when compared to existing algorithms by mimicking food foraging honeybees. In BA, a combination of neighbourhood and random search can be employed to solve an optimization issue. BA is commonly used in combinatorial and functional optimization. Nonetheless, while BA has been proved to be effective in handling optimization issues such as controller formation, image analysis, and multi-objective optimization, its random search reliance has left BA significantly poor in local search activities. Choon et al. [6] demonstrated that the hybridization of BA and FBA, namely BAFBA is to be more superior in the prediction of optimal gene to be knockout to improve growth and production yield. The improved performance can be attributed to BAFBA’s ability to find the most promising solutions and also selectively search for the global maximum of the objective function by exploring the neighbourhoods. Other hybrids like Bees Hill Flux Balance Analysis that incorporated Hill climbing algorithm further demonstrated the potential of BA to improve performance when hybridized [7]. Regulatory On/Off Minimization (ROOM) is an attractive candidate to construct a new BA hybrid. Previous works demonstrated that ROOM outperformed FBA and MOMA when predicting flux of final metabolic steady state [8]. Specifically, ROOM differs from MOMA in that 1) it reduces the total number of significant flux changes of wild type strain, 2) searches for a flux distribution which fulfils the stoichiometric constraints (mass balance) and thermodynamical and flux capacity constraints of a mutant, and 3) correlates better with experimental data than predicted by both FBA and MOMA. Hence, BAROOM is proposed to identify gene knockout to improve the production of chemicals of interest.

2 A Hybrid of Bees Algorithm and Regulatory on/Off Minimization 2.1 Bee Representation of Metabolic Genotype The food foraging behavior of bees is started by directing the scout bees to find promising sites. Scout bees randomly move during the searching. Scout bees will then back to hive when the searching is done. Scout bees detect a site that is rated above particular threshold start performing “waggle dance”. This dancing consists of three information that are, a) the direction of which it is situated, b) its distance from hive, and c) its quality. The information helps colony to evaluate the quality of the food, and the amount of energy needed to harvest it. After that, more bees are sent to follow the scout bees to the most promising place directly. This paper proposes BAROOM to identify the knockout list. Figure 1 illustrates the flow of conventional BA whilts Fig. 2 shows BAROOM flowchart. The flow details are described in the subsections as follows.

A Hybrid of Bees Algorithm and Regulatory On/Off Minimization

97

Fig. 1. Flowchart of a basic bees algorithm.

2.2 Initialization of the Population This phase initializes the population randomly by declaring the absence or presence of reaction to be the input of next phase. First, parameters are set as it is an essential in BA. For instances, the number of popular, number of scout bees, number of sites chosen out of visited sites are declared as dimension, n, and m, and are set to 2000, 30, and 20, respectively. The number of best sites out of m selected sites, the number of bees recruited for best e sites, the number of bees recruited for other m-e selected sites and initial size of patch are declared as e, nep, nsp, ngh, and are set to 2, 4, 8, 1, respectively. Then, the population is initialized based on the length of reactions list of the model, that is 1532 reactions after pre-processed, and the population must be greater than the number of reactions. Hence, a population with the matrix of 1532 × 2000 dimensions is randomly generated with the value of 0 and 1 that represents the absence and the presence of the reactions. 2.3 Scoring Fitness of Individuals The fitness is calculated for individual site visited by a bee through ROOM. This fitness is calculated by ROOM.m function in Cobra toolbox. ROOM assumed the organism adapts by minimizing the set of regulatory changes. Hence, quadratic programming is implemented by ROOM to look for a point in flux space that is closer to the wild type point after a gene is knocked out. According to Shlomi et al. [9], ROOM employs mixed integer linear programming (MILP) to resolve the following equation with a range [wl , wu ] is the threshold around the vector w for significant flux change: m min yi (1) i=1

98

M. I. Yong et al.

Fig. 2. The flow chart of hybrid algorithm of BAROOM

subject to: S.v = 0

  v − y vmax − wu ≤ wu 

(2)



v − y vmin − wl ≥ wl

(3)

vj = 0, j ∈ A, yi ∈ {0, 1} wu = w + δ|w| + , wi = w − δ|w| −  Where for every flux i, 1 ≤ i ≤ m, yi = 1 for a substantial flux change in vi , yi = 0. Hence, when yi = 1, inequality 2 and 3 do not inflict new constraints on vi and when yi = 0, inequality (2) and (3) constraint vi to range stated previously. The size of δ and  influence the running time of MILP solver. Figure 3 shows the flow of ROOM. The growth rate and the minimum production of biochemical will be returned after the calculation. The growth rate is to evaluate the survival of the cell after knocking out the gene where the growth rate must more than 0.1 h−1 , and the minimum production is taken account where it must be more than −1e-3 mmol gDW−1 h−1 to avoid considering the insignificant improvement.

A Hybrid of Bees Algorithm and Regulatory On/Off Minimization

99

Fig. 3. The flow of ROOM.

2.4 Neighbourhood Search Steps involved in this neighbourhood/local search phase are: a) selecting sites with highest fitness by sorting fitness values in descending order and positions of population, b) recruiting bees to fittest sites whose fitness are then evaluated by ROOM. Low-value fitness sites are abandoned. Recruitment of bees too is based on their fitness and then followed as they search for best sites for promising solution, c) bees with high-value fitness are selected as new population for next iteration. 2.5 Randomly Assigned and Termination When local search indicates no improvement, the remaining bees are assigned to randomly explore around the search space to discover new possible solutions. The fitness is re-calculated, and the fixed value of minimum production is compared again. After that, production rate is sorted according to the values to identify the best production. The fitness value of the new population is measured by ROOM followed by neighbourhood search phase, randomly assigned, and termination phase. The process iterates until it meets the termination criteria where the imax (maximum iteration) is equal to 50.

3 Experimental Results This paper uses E.coli iAF1260 model [10] as the dataset for the experiment. The E. coli model contains of 1261 genes, 2382 unique biochemical reactions, and 1668 metabolites. All simulations were conducted in aerobic minimal media conditions. The rate of glucose uptake was set to 10 mmol/gDW/hr, with a non-growth associated maintenance rate of 7.6 mmol ATP/gDW/h. The experiments are carried using Windows platform with processor of 2.1 GHz and 1 GB of random-access memory (RAM) in MATLAB environment. Cobra Toolbox 2.0.5 is required to execute the algorithm. 3.1 Experimental Result and Discussion for Lactate Following the completion of the experiments, three sets of knockout lists were collected. Table 1 shows the three lactate production lists found by BAROOM. These three knockout lists imply that genes are deleted in anaerobic conditions. The first list suggests the removal of Nuo gene and AdhE gene that leads to 11.0576 mmol gDW−1 h−1 of lactate production and 0.1379 (h−1 ) of growth rate. The role of Nuo gene, which encodes NADH dehydrogenase under oxidative phosphorylation is to reoxidation NADH into NAD +.

100

M. I. Yong et al.

According to Yang et al. [11] and Yun et al. [12], deficient in Nuo gene under anaerobic condition will not affect the metabolic state and it also shown no grown defects because the respiratory chain is inactivated. Yun et al. [12] also reported that the deficient in Nuo gene can result in higher production of D-lactate compared with wild type strain. The second list suggests the removal of Nuo gene, AckA gene, Pta gene and AdhE gene that contributed 11.5832 mmol gDW−1 h−1 of lactate production and 0.1189 (h−1 ) of growth rate. Firstly, the deletion of Nuo gene, which encodes NADH dehydrogenase under oxidative phosphorylation is reported. As previously mentioned, it is possible to increase the lactate production through NADH pathway by inactivating NADH consuming metabolic fluxes. According to Yang et al. [11], the deletion of Pta and AckA completely shut down the pyruvate-formate pathway (PFL) and the flux to lactate is excessive increased due to the decrease in carbon flow of competitive branch that is PFL pathway. In addition, the deletion of AdhE gene, which encodes acetaldehyde dehydrogenase reduces the production of ethanol as acetaldehyde dehydrogenase is responsible in the interconversion between acetyl-coA (ACCOA) and acetaldehyde (ACALD) as ethanol precursor. The reduced level of acetate and ethanol further increases the yield of lactate since these two byproducts are eliminated. The third knockout list includes Nuo gene, Pta gene, AckA gene, AdhE gene and Frd gene, which eventually leads to the highest lactate production with 12.1762 mmol gDW-1 h−1 and 0.1159 (h−1 ) of growth rate. There is one additional gene deleted that is Frd gene in the third knockout list. The lactate production is further increased compared to the second knockout list. As reported from Suman et al. [13], the alternative approach to divert carbon to produce lactate can be achieved by inactivating the pathway that synthesis unwanted metabolite such as acetate, ethanol and succinate. 3.2 Comparative Analysis for Lactate Case OptKnock is one of the rational modeling frameworks that propose gene knockout strategies that leads to the overproduction of metabolite in E. coli [3]. Table 2 shows the gene knockout lists that were determined by OptKnock for lactate production. As shown in Table 2, OptKnock achieved the highest lactate production with 10.53 mmol gDW−1 h−1 in third knockouts that is lower than the lactate production rate with 12.1762 mmol gDW−1 h−1 that is identified by BAROOM. Hence, the results predicted by BAROOM is better for lactate production. The result might due to that OptKnock maximized the biochemical production (at outer layer) under the condition that the cells are still survive (at inner layer) after gene knockout. The inner layer maximized the cell growth that assumes the biomass yields are maximized as well. However, the mutants do not expose to long term pressure and hence the maximization of biomass yields does not valid for the knockout mutants. The knockout metabolic flux is not directly heading towards maximizing biomass state that is used by OptKnock. As ROOM provide more strict phenotypic constraints to mutants, it suggests more practical knockout strategies than OptKnock. Besides, the results of lactate production are matched with the three wet laboratory journals. Table 3 shows the lactate production that are obtained from the engineered strains in E. coli. Based on Table 3, the highest lactate production that identified by Yang et al. [11] is 8.29 mmol gDW−1 h−1 . The highest lactate production identified by

A Hybrid of Bees Algorithm and Regulatory On/Off Minimization

101

Table 1. List of gene knockout that identified by BAROOM for lactate production. No.

Enzyme

Associated gene

Lactate (mmol gDW−1 h−1 )

Growth rate (h−1 )

1

NADH dehydrogenase (ubiquinone-8 and 3 protons) Acetaldehyde dehydrogenase

Nuo AdhE

11.0576

0.1379

2

NADH Dehydrogenase (ubiquinone-8 and 3 protons) Phosphotransacetylase Acetate kinase Acetaldehyde dehydrogenase

Nuo Pta AckA AdhE

11.5832

0.1189

3

NADH dehydrogenase (ubiquinone-8 and 3 protons) Phosphotransacetylase Acetate kinase Acetaldehyde dehydrogenase Fumarate reductase

Nuo Pta AckA AdhE Frd

12.1762

0.1159

Note: Bold font represents the best result.

Zomorrodi et al. [14] is 7.36 ± 0.83 mmol gDW−1 h−1 . However, there is no succinate production identified by Jantama [15] because the ldhA gene (encode enzyme lactate dehydrogenase) that is responsible for the production of lactate from pyruvate is deleted. As to compare with the result of this paper, BAROOM identifies the production rate of 12.1762 mmol gDW−1 h−1 higher than the results from the wet laboratory experiments [11, 14, 15]. 3.3 Performance Measurement for Lactate Case The mean and standard deviation of growth rate is showed in Table 4. The overall standard deviation for the best solution of BAROOM is less than around 0.015. This suggests that the growth rate in the 50 individual runs is remarkably close to the mean. Thus, BAROOM algorithm is stable and reliable due to its ability to find the value of growth rate that grouped nearly reach the mean growth rate value in every run. BAROOM returns the solver status in standardized form for the final knockout list in each run. It is critical to understand if a solution is valid since only the valid solution will be used in laboratory experiments. In this paper, all solver status returned for a total of 50 runs in each 5 maximum knockouts is value 1. This indicates that the solution returned by BAROOM is an optimal solution and high accuracy for optimal solution in all runs with each maximum knockout in lactate case study. Table 4 shows the accuracy of the

102

M. I. Yong et al. Table 2. List of gene knockout that identified by OptKnock for lactate production [3]. Lactate (mmol gDW−1 h−1 )

No.

Enzyme

1

Acetate kinase or phosphotransacetylase, Acetaldehyde dehydrogenase

5.58

2

Acetate kinase or phosphotransacetylase, phosphofructokinase or Frutose-1,6-biphosphate adolase

0.19

3

Acetate kinase or phosphotransacetylase, phosphofructokinase or frutose-1,6-biphosphate adolase Acetaldehyde dehydrogenase, glucokinase

10.53

Note: Bold font represents the best result.

Table 3. Lactate production for the engineered strain in E. coli. No.

Reference

Relevant deletions

Lactate (mmol gDW−1 h−1 )

1

BAROOM

Nuo,Pta,AckA,AdhE, Frd

12.1762

2

Yang et al. [11]

nuo

0.06

3

4

Zomorrodi et al. [14]

Jantama [15]

ackA-pta

5.13

ackA-pta-nuo

8.29

pta

7.36 ± 0.83

ppc

1.13 ± 0.1

adhE

0.23 ± 0.03

pykF

0.13 ± 0.2

ldhA, adhE, ackA, pflB, mgsA, 0.00 ± 0.00 poxB, ptsG ldhA, adhE, ackA, pflB, mgsA, 0.00 ± 0.00 poxB, galP ldhA, adhE, ackA, pflB, mgsA, 0.00 ± 0.00 poxB, manX ldhA,adhE, ackA, pflB, mgsA, poxB, galP, manX

0.00 ± 0.00

ldhA, adhE, ackA, pflB, mgsA, 0.00 ± 0.00 poxB, galP, ptsG Note: Bold font represents the best result.

results. Next, BAROOM also returns the type of solution that is found. In all 50 runs, BAROOM returns with the value of 1 (valid solution). This shows the high accuracy of valid solution is obtained by BAROOM and can be concluded that all the mutant in this paper can ensure the production of lactate.

A Hybrid of Bees Algorithm and Regulatory On/Off Minimization

103

Table 4. Performance measurement of lactate production with maximum knockout from value 1 until value 5. All values are calculated over 50 runs for each maximum knockout. Max knockout measurement

KO = 1

KO = 2

KO = 3

KO = 4

KO = 5

Mean (growth rate)

0.1061

0.1067

0.1093

0.1107

0.1125

Standard deviation (growth rate)

0

0.0045

0.0133

0.0122

0.0151

Accuracy (optimal solution)

100%

100%

100%

100%

100%

Accuracy (valid solution)

100%

100%

100%

100%

100%

4 Conclusion and Future Works In conclusion, the prediction of gene knockout strategies suggested by BAROOM that leads to lactate overproduction using E. coli model are discussed in detailed in this paper. It shows that BAROOM can achieve better results than the previous in silico and in vivo researches [10, 13–15]. Furthermore, the overall performance of the BAROOM is also presented in this paper. It has been proven that the stability and reliability of the BAROOM are high and it can also determine the optimal and valid solution for the final knockout list in every run. However, there are some improvements that can be carried out in future research, such as a) different microorganisms and target metabolites can be used, b) improves local search ability in BA, and c) provides a webserver for the method introduced in this paper. Acknowledgement. We would like to thank Universiti Malaysia Kelantan for supporting this research via Post-Doctoral (Research) Scheme and the UMK Fund (grant number: R/FUND/A0100/01850A/001/2020/00816).

References 1. Baba, T., et al.: Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: Keio collection. Mol. Syst. Biol. 2, 1–11 (2006) 2. Patil, K.R., Rocha, I., Förster, J., Nielsen, J.: Evolutionary programming as a platform for in silico metabolic engineering. BMC Bioinform. 6(1), 1–12 (2005) 3. Burgard, A.P., Pharkya, P., Maranas, C.D.: OptKnock: A Bilevel programming framework for identifying gene knockout strategies for microbial strain optimization. Biotechnol. Bioeng. 84(6), 647–657 (2003) 4. Rocha, M., et al.: Natural computation meta-heuristics for the in silico optimization of microbial strains. BMC Bioinform. 9(1), 1–16 (2008) 5. Pham, D.T., Ghanbarzadeh, A., Koç, E., Otri, S., Rahim, S., Zaidi, M.: The bees algorithm— a novel tool for complex optimisation problems. In Intelligent Production Machines and Systems, pp. 454–459 (2006) 6. Choon, Y.W., et al.: Identifying gene knockout strategies using a hybrid of bees algorithm and flux balance analysis for in silico optimization of microbial strains. In Distributed Computing and Artificial Intelligence, pp. 371–378 (2012)

104

M. I. Yong et al.

7. Choon, Y.W., Mohamad, M.S., Deris, S., Chong, C.K., Omatu, S., Corchado, J.M.: Gene knockout identification using an extension of bees hill flux balance analysis. BioMed. Res. Int. (2015) 8. Kleessen, S., Nikoloski, Z.: Dynamic regulatory on/off minimization for biological systems under internal temporal perturbations. BMC Syst. Biol. 6(1), 16 (2012) 9. Shlomi, T., Berkman, O., Ruppin, E.: Regulatory on-off minimization of metabolic flux changes after genetic perturbations. Proc. Natl. Acad. Sci. USA 102(21), 7695–7700 (2005) 10. Feist, A.M., Zielinski, D.C., Orth, J.D., Schellenberger, J., Herrgard, M.J. and Palsson, B.: Model-driven evaluation of the production potential for growth coupled products of Escherichia coli. Metabol. Eng. 12(3), 173–186 (2010) 11. Yang, Y., Benett, G.N., San, K.: Effect of Inactivation of nuo and ackA-pta on redistribution of metabolic fluxes in Escherichia coli. Biotechnol. Bioeng. 65(3), 291–297 (1999) 12. Yun, N.R., San, K.Y., Bennett, G.N.: Enhancement of lactate and succinate formation in adhE or pta-ackA mutants of NADH dehydrogenase-deficient Escherichia coli. J. Appl. Microbiol. 99(6), 1404–1412 (2005) 13. Suman, M., Clomburg, J., Gonzalez, R.: Escherichia coli strains engineered for homofermentative production of d-lactic acid from glycerol. Appl. Environ. Microbiol. 76(13), 4237–4336 (2010) 14. Zomorrodi, A.R., Suthers, P.F., Ranganathan, S., Maranas, C.D.: Mathematical optimization applications in metabolic networks. Metab. Eng. 14(6), 672–686 (2012) 15. Jantama, K.: Glucose is taken up by galactose permease in metabolic engineered Escherichia coli to produce succinate. Suran. J. Sci. Technol. 17(4), 369–386 (2010)

A Study on Burrows-Wheeler Aligner’s Performance Optimization for Ancient DNA Mapping Cindy Sarmento1(B) , Sílvia Guimarães2 , Gül¸sah Merve Kılınç3 , Anders Götherström4 , Ana Elisabete Pires2 , Catarina Ginja2 , and Nuno A. Fonseca2 1 Faculdade de Ciências, Universidade do Porto, Porto, Portugal

[email protected]

2 CIBIO/InBIO-Centro de Investigação em Biodiversidade e Recursos Genéticos,

Universidade do Porto, Vairão, Portugal {silvia.guimaraes,aepires,catarinaginja, nuno.fonseca}@cibio.up.pt 3 Department of Bioinformatics, Graduate School of Health Sciences, Hacettepe University, Ankara, Turkey [email protected] 4 Archaeological Research Laboratory, Stockholm University, Stockholm, Sweden [email protected]

Abstract. The high levels of degradation characteristic of ancient DNA molecules severely hinder the recovery of endogenous DNA fragments and the discovery of genetic variation, limiting downstream population analyses. Optimization of read mapping strategies for ancient DNA is therefore essential to maximize the information we are able to recover from archaeological specimens. In this paper we assess Burrows-Wheeler Aligner (BWA) effectiveness for mapping of ancient DNA sequence data, comparing different sets of parameters and their effect on the number of endogenous sequences mapped and variants called. We also consider different filtering options for SNP calling, which include minimum values for depth of coverage and base quality in addition to mapping quality. Considering our results, as well as those of previous studies, we conclude that BWA-MEM is a good alternative to the current standard BWA-backtrack strategy for ancient DNA studies, especially when the computational resources are limited and time is a constraint. Keywords: Ancient DNA · Alignment · BWA-MEM · BWA-aln · Variant calling

1 Introduction Over the past two decades high-throughput sequencing (HTS) technologies have proven their promised game-changing performance regarding genome-wide studies by enabling the simultaneous production of millions of sequences with progressively lower costs. HTS had a huge impact on ancient DNA research, with great developments since early © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. Rocha et al. (Eds.): PACBB 2021, LNNS 325, pp. 105–114, 2022. https://doi.org/10.1007/978-3-030-86258-9_11

106

C. Sarmento et al.

studies in the 1980’s [1]. The number of ancient genomes sequenced and published has ascended to more than a thousand, representing mainly mammalian taxa but also other animal taxa and some plant and microbial species [2]. Despite the continuous improvement of both molecular and bioinformatic methods that has allowed access to short fragments (i.e. 25. The y-axes are in logarithmic scale.

3.3 Impact on Variant Calling The number of deletions, insertions, and single-nucleotide polymorphisms (SNPs) per mapping strategy before and after filtering is shown in Fig. 2. Before filtering we can see a similar pattern between the four strategies tested: most variants are SNPs and there is a very similar number of insertions and deletions (Fig. 2(a)). When we compare the two algorithms, we can see that we get more variants with both MEM strategies than with ALN1 or ALN2. After filtering out variants with DP < 2, QUAL < 10, MQ < 30 and SNPs within 3 bp of indels, we can see that this pattern remained (Fig. 2(b)). Differences in the number of insertions, deletions and SNPs are, however, not significant, neither before nor after filtering (α = 0.01 and p > 0.1094).

110

C. Sarmento et al.

Fig. 2. Number of variants per type across the alignment strategies tested. (a) Before filtering. (b) After filtering out variants with DP < 2, QUAL < 10, MQ < 30 and SNPs within 3 bp of indels. Y-axes are in logarithmic scale.

Next, we considered SNPs only while comparing different filtering parameters (Fig. 3). Again, we see that the sets MEM1 and MEM2 appear to result in a higher number of SNPs, but the increase is not significant. Considering unfiltered SNPs, for example, the number obtained with ALN1 is not significantly different from the number of SNPs obtained with either ALN2, MEM1 or MEM2 (p = 0.256). The same applies with each of the filters tested (p-values are 0.066, 0.224 and 0.300): no significant difference was found in the number of SNPs that pass each quality filter across all alignment methods. Considering the three filter combinations applied, the differences between the resulting number of SNPs are also not significant (p-values for pairwise comparisons using Wilcoxon rank sum test range from 0.37 to 0.63).

Fig. 3. Distribution of the number of SNPs obtained using three quality filter combinations across mapping strategies.

Finally, we looked at the SNPs annotation and effect prediction. We assessed if the differences between the number of SNPs detected by each strategy were significant when considering the number of SNPs annotated to coding regions (e.g. missense and synonymous SNPs). The MEM strategies yielded a significantly higher number of these variants (Fig. 4). Number of variants in intergenic regions is also significantly higher for

A Study on Burrows-Wheeler Aligner’s Performance Optimization

111

MEM strategies (for α = 0.05). Differences in the total number of SNPs annotated to intronic regions are not significant.

Fig. 4. Distribution of the number of SNPs annotated to intronic, intergenic and coding regions of the genome across all mapping strategies.

3.4 BWA-aln VS BWA-MEM on Accurate and Effective Mapping Despite BWA-MEM being described as an algorithm more adequate for reads longer than 70 bp and general recommendations for BWA-backtrack usage for reads shorter than 100 bp [9], we show that BWA-MEM has a performance comparable to that of BWA-aln in the identification of endogenous ancient reads. However, BWA-MEM is significantly faster than BWA-aln with default parameters and with parametrizations optimized for ancient data by Schubert et al. [6], or with more permissive parametrizations also frequently used in ancient DNA studies. These findings are consistent with previous studies by Xu et al. [8] and Oliva et al. [13]. However, unlike Xu et al. [8], we do not observe a significant difference when using BWA-MEM with the non-default parameters. We took one step forward to investigate the effect of the different mapping strategies on the called variants. According to our results, it does not have a significant impact on variant calling as, regardless of the algorithm and parameters applied, the number and type of variants retrieved are very similar. Moreover, the SNPs obtained in each case present similar values of base quality, mapping quality and depth of coverage. These results further point towards similar overall performances for both BWA’s algorithms. BWA-MEM can, however, retrieve a higher number of variants in coding regions, particularly missense variants, which can be of great interest for studies focused on the cause of phenotypic changes through time, as it can help identify target genes.

4 Conclusions and Future Prospects Our study complements recent investigations on the most effective methods for ancient DNA read mapping. Even when applied to real ancient sequencing data, with reads

112

C. Sarmento et al.

shorter than 70 bp, and considering both the identification of endogenous reads and the prediction of variants, the BWA-MEM algorithm with default parameters exhibits results comparable to those of BWA-aln with seeding disabled, the most used method in ancient genomics. However, BWA-MEM has the great advantage of accelerating the alignment procedure considerably. Although BWA-based mapping procedures using linear reference genomes show low reference bias compared to other software based methods [13], its presence is still considerable and prevalent in published ancient sequence data, potentially affecting population analyses [4]. Recent works, mainly applied to modern data, have developed alternative mapping approaches where the linear reference is replaced by genome graphs that include variation information and have found positive results, obtaining alignments free of reference bias including for ancient DNA data [29, 30]. Although migration from linear reference genomes to graph genomes might not be possible soon, future works will likely focus their efforts in further developing this promising approach to achieve complete reference bias elimination. Acknowledgments. We thank Ana Arruda, Catarina Viegas, Andrea Martins, Cleia Detry and Simon J.M. Davis for providing access to well-documented cattle specimens for ancient DNA analysis. This work received funding from: the project PORBIOTA- Portuguese E-Infrastructure for Information and Research on Biodiversity (POCI-01-0145-FEDER-022127), supported by Operational Thematic Program for Competitiveness and Internationalization (POCI), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (FEDER); Fundação Nacional para a Ciência e a Tecnologia (FCT), Portugal, contract grant 2020.02754.CEECIND (C.G.), Norma Transitória contract grant DL 57/2016/CP1440/CT0029 (A.E.P.) and the ARCHAIC Project grant PTDC/CVTLIV/2827/2014 co-funded by COMPETE 2020 POCI-01-0145-FEDER-016647 and LISBOA-01-0145-FEDER-016647 (C.G.). This work was also supported by National Funds through FCT/MCTES under the UIDB/50027/2020 funding.

References 1. Higuchi, R., Bowman, B., Freiberger, M., et al.: DNA sequences from the quagga, an extinct member of the horse family. Nature 312, 282–284 (1984). https://doi.org/10.1038/312282a0 2. Mitchell, K.J., Rawlence, N.J.: Examining natural history through the lens of palaeogenomics. Trends Ecol. Evol. 36, 258–267 (2021). https://doi.org/10.1016/j.tree.2020.10.005 3. Prüfer, K., Stenzel, U., Hofreiter, M., et al.: Computational challenges in the analysis of ancient DNA. Genome Biol. 11, R47 (2010). https://doi.org/10.1186/gb-2010-11-5-r47 4. Günther, T., Nettelblad, C.: The presence and impact of reference bias on population genomic studies of prehistoric human populations. PLOS Genet. 15, e1008302 (2019). https://doi.org/ 10.1371/journal.pgen.1008302 5. Gopalakrishnan, S., Samaniego Castruita, J.A., Sinding, M.-H.S., et al.: The wolf reference genome sequence (Canis lupus lupus) and its implications for Canis spp. population genomics. BMC Genom. 18, 495 (2017). https://doi.org/10.1186/s12864-017-3883-3 6. Schubert, M., Ginolhac, A., Lindgreen, S., et al.: Improving ancient DNA read mapping against modern reference genomes. BMC Genom. 13, 178 (2012). https://doi.org/10.1186/ 1471-2164-13-178 7. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler trans-form. Bioinform. Oxf. Engl. 25, 1754–1760 (2009). https://doi.org/10.1093/bioinformatics/btp324

A Study on Burrows-Wheeler Aligner’s Performance Optimization

113

8. Xu, W., Lin, Y., Zhao, K., et al.: An efficient pipeline for ancient DNA mapping and recovery of endogenous ancient DNA from whole-genome sequencing data. Ecol. Evol. 11, 390–401 (2020). https://doi.org/10.1002/ece3.7056 9. Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM (2013). http://arxiv.org/abs/1303.3997 10. Poullet, M., Orlando, L.: Assessing DNA sequence alignment methods for characterizing ancient genomes and methylomes. Front. Ecol. Evol. 8 (2020). https://doi.org/10.3389/fevo. 2020.00105 11. Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012). https://doi.org/10.1038/nmeth.1923 12. NovoAlign | Novocraft. http://www.novocraft.com/products/novoalign/. Accessed 13 Apr 2021 13. Oliva, A., Tobler, R., Cooper, A., et al.: Systematic benchmark of ancient DNA read mapping. Brief Bioinform. (2021). https://doi.org/10.1093/bib/bbab076 14. Davis, S.J.M., Svensson, E.M., Albarella, U., et al.: Molecular and osteometric sexing of cattle metacarpals: a case study from 15th century AD Beja, Portugal. J. Archaeol. Sci. 39, 1445–1454 (2012). https://doi.org/10.1016/j.jas.2011.12.003 15. Rodríguez-Varela, R., Günther, T., Krzewi´nska, M., et al.: Genomic analyses of pre-European conquest human remains from the Canary Islands reveal close affinity to modern North Africans. Curr Biol 27, 3396-3402.e5 (2017). https://doi.org/10.1016/j.cub.2017.09.059 16. Yang, D.Y., Eng, B., Waye, J.S., et al.: Improved DNA extraction from ancient bones using silica-based spin columns. Am. J. Phys. Anthropol. 105, 539–543 (1998). https://doi.org/10. 1002/(SICI)1096-8644(199804)105:4%3c539::AID-AJPA10%3e3.0.CO;2-1 17. Dabney, J., Knapp, M., Glocke, I., et al.: Complete mitochondrial genome sequence of a Middle Pleistocene cave bear reconstructed from ultrashort DNA fragments. Proc. Natl. Acad. Sci. U. S. A. 110, 15758–15763 (2013). https://doi.org/10.1073/pnas.1314445110 18. Meyer, M., Kircher, M.: Illumina sequencing library preparation for highly multiplexed target capture and sequencing. Cold Spring Harb. Protoc. 5 (2010). https://doi.org/10.1101/pdb.pro t5448 19. Günther, T., Valdiosera, C., Malmström, H., et al.: Ancient genomes link early farmers from Atapuerca in Spain to modern-day Basques. Proc. Natl. Acad. Sci. U. S. A. 112, 11917–11922 (2015). https://doi.org/10.1073/pnas.1509851112 20. Jónsson, H., Ginolhac, A., Schubert, M., et al.: mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters. Bioinformatics 29, 1682–1684 (2013). https:// doi.org/10.1093/bioinformatics/btt193 21. Martin, M.: Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17, 10–12 (2011). https://doi.org/10.14806/ej.17.1.200 22. Magoˇc, T., Salzberg, S.L.: FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics 27, 2957–2963 (2011). https://doi.org/10.1093/bioinformatics/ btr507 23. bwa man page - General Commands | ManKier. https://www.mankier.com/1/bwa. Accessed 14 Apr 2021 24. Li, H., Handsaker, B., Wysoker, A., et al.: The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009). https://doi.org/10.1093/bioinformatics/btp352 25. McKenna, A., Hanna, M., Banks, E., et al.: The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010). https://doi.org/10.1101/gr.107524.110 26. Jun, G., Wing, M.K., Abecasis, G.R., Kang, H.M.: An efficient and scalable analysis framework for variant extraction and refinement from population scale DNA sequence data. Genome Res. gr.176552.114 (2015). https://doi.org/10.1101/gr.176552.114

114

C. Sarmento et al.

27. Danecek, P., Bonfield, J.K., Liddle, J., et al.: Twelve years of SAMtools and BCFtools. GigaScience 10 (2021). https://doi.org/10.1093/gigascience/giab008 28. McLaren, W., Gil, L., Hunt, S.E., et al.: The ensemble variant effect predictor. Genome Biol. 17, 122 (2016). https://doi.org/10.1186/s13059-016-0974-4 29. Paten, B., Novak, A.M., Eizenga, J.M., Garrison, E.: Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017). https://doi.org/10.1101/gr.214155.116 30. Martiniano, R., Garrison, E., Jones, E.R., et al.: Removing reference bias and improving indel calling in ancient DNA data analysis by mapping to a sequence variation graph. bioRxiv 782755 (2020). https://doi.org/10.1101/782755

BioTMPy: A Deep Learning-Based Tool to Classify Biomedical Literature Nuno Alves(B) , Ruben Rodrigues, and Miguel Rocha Centre of Biological Engineering, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal [email protected], [email protected], [email protected]

Abstract. The identification of the most relevant articles for a given task among a rapidly increasing number of options is a highly timeconsuming task performed by researchers. To help in this task, a package called BioTMPy (https://github.com/BioSystemsUM/biotmpy) was developed to implement a complete pipeline to classify biomedical literature using state-of-the-art Deep Learning models. The package is divided into distinct modules that can be used in different steps of a pipeline, together or taken independently. To validate BioTMPy, the package was used to compare several pre-trained embeddings on a dataset from a BioCreative’s challenge, where BioWordVec showed a slightly better performance over GloVe, PubMed vectors and “pubmed ncbi” embeddings. Additionally, we implemented and compared several state-of-the-art DL models encompassing recurrent and convolutional layers, as well as transformers with attention mechanisms, including the ones from the BERT family. We were able to obtain an improvement of over 7% for average precision and 3% for F1-score when compared to the challenge’s best submission.

1

Introduction

The scientific community has been reporting their work in the form of articles and other literature for many years now. Furthermore, given the progress of science and technology, the publication rate has been massively increasing in the last few decades [1]. Consequently, it is now extremely difficult to efficiently extract information regarding a given research topic. For instance, by using the query “breast cancer” on PubMed, more than 400,000 documents are returned. Although this challenge has been addressed by the Biomedical Text Mining (BioTM) field, the existing methods can still be further enhanced [2,3]. BioTM aims at using text mining methods, supported by computational tools, to accelerate information extraction from biomedical literature written in the form of unstructured free text [1]. Within this field, a challenging task is Document Classification (DC), whose aim is to assign pre-defined labels to documents, including both binary classification (2 labels) and multi-classification c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  M. Rocha et al. (Eds.): PACBB 2021, LNNS 325, pp. 115–125, 2022. https://doi.org/10.1007/978-3-030-86258-9_12

116

N. Alves et al.

(more than 2 labels) [4]. Initially, DC was performed manually, showing scalability problems, similarly to its alternative based on rules. With the recent increasing popularity of Machine Learning (ML) models, these have been able to improve the scalability and performance issues aforementioned [5]. As an example, one of the most used information retrieval systems, PubMed, which contains more than 32 million papers, until 2013 only provided a system that matched terms from a query with terms present within documents. Then, a formula called Term Frequency-Inverse Document Frequency (TF-IDF) was implemented, which considers both the frequency of a term in a document and its frequency in all the retrieved documents. In 2017, a new ML-based system, called “Best Match” was implemented into PubMed becoming the default search system in a renewed PubMed website. Also, the authors consider that in the future, their system may be upgraded to a Deep Learning (DL) algorithm [3]. Currently, the state-of-the-art models for many Natural Language Processing (NLP) are DL models [6]. This is the result of the emergence of pre-trained embeddings (e.g. GloVe, BioWordVec), models with high complexity like Bidirectional Encoder Representations from Transformers (BERT) [7], and the recent hardware improvements regarding graphical processing units (GPUs) [6,8]. Therefore, in this work we aimed to use these latest methods and models to develop and validate a tool that provides functionalities to perform document classification of biomedical literature.

2

Package Description and Implementation

The package developed in this work, named BioTMPy (Biomedical Text Mining with Python), aims to address the challenge mentioned above, facilitating the search of relevant biomedical documents. Users can use this package to order by relevance literature related to a topic of interest. BioTMPy can be employed to perform a complete pipeline for document classification using ML models, including state-of-the-art DL models in different NLP tasks [6]. BioTMPy was implemented in Python 3.8 and uses a number of packages: tensorFlow/keras [8], scikit-learn [9], pandas [10], numpy [11], HuggingFace’s Transformers [12], matplotlib [13], and Natural Language Toolkit (NLTK ) [14]. The overall structure of BioTMPy can be seen in Fig. 1. Its division in several modules enables their application in different steps of a document classification pipeline independently. BioTMPy includes distinct methods, such as textual data loading, preprocessing, and analysis, model training, optimization, and evaluation, among other text mining and NLP processing methods. More precisely, the package encompasses 6 modules:

BioTMPy: A Deep Learning-Based Tool to Classify Biomedical Literature

117

Fig. 1. Representation of BioTMPy’s structure, divided into 6 main modules: wrappers, data structures, preprocessing, machine learning, pipelines and web.

1. Wrappers: this module contains methods to convert text data provided in different formats (XML, CSV and text files) into a Python pandas dataframe. Given its purpose, it starts by splitting each text into sentences and tokens through functions from the NLTK package, which are then saved within BioTMPy’s data structures. 2. Data Structures: provides structures to save data as attributes of each object, namely: document id, title, abstract, full-text, etc. When creating these data structures, many preprocessing steps can be applied, such as conversion to lowercase, stop words and punctuation removal, split of words by hyphen, stemming and lemmatization. Furthermore, a Relevance object can be created to store the label, document id, confidence score from a prediction and a description to outline the topic of the document. 3. Preprocessing: the dataframe created with the wrapper can be converted into numerical inputs needed for the models. For traditional ML models, features at both sentence and word levels can be created such as: Part-of-Speech tagging (POS), number of tokens, sentence size, Named Entity Recognition (NER), Term frequency-inverse document frequency (TF-IDF). Regarding DL, 3 main types of inputs can be created: 1) a 2D array (number of documents × number of words) for common DL models (e.g. recurrent neural networks, convolutional neural networs); 2) a 3D array (number of documents × number of sentences × number of words) for Hierarchical attention networks (HAN); and 3) 3 arrays of 2 dimensions each (number of documents × token IDs/Masks/Segments) for Bidirectional Encoder Representations from Transformers (BERT). This module also contains functions to integrate pretrained embeddings in some of these DL models. Moreover, this module offers methods to perform unsupervised methods like t-distributed stochastic neighbor embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP), as well as preliminary data analysis calculating most frequent words and distribution plots, which can be

118

N. Alves et al.

used to give some insights and improve further hyperparameter selection (e.g. maximum number of words per document). Additionally, Config objects are provided to ease the process of saving all the parameters and instances used throughout the pipeline, consequently facilitating new predictions, replicating results and comparing models. 4. Machine Learning (mlearning): module focused on the implementation of ML models. For traditional ML, functions built over scikit-learn to train, predict, optimize and evaluate models are provided. Regarding DL, BioTMPy contains several models implemented, including models with convolutional neural networks (CNN) and a Bidirectional Long-Short Term Models (LSTM) with an attention layer on top (adapted from Burns et al. [15]), a HAN, and models containing different versions of pre-trained BERT models (BERT, BioBERT [16] and SciBERT [17]). All these models were implemented with the Functional API from keras, consequently allowing the application of its functions (e.g. fit, predict, save). 5. Pipelines: the main aim of this module is to give intuitive examples on how to use the package to execute, for example, a document classification pipeline from start to finish. These pipelines are presented as Jupyter notebooks, giving a clear overview of the entire process. Additionally, it contains implementations of cross-validation and hyperparameter tuning for these models. Within the folders named models and hp results, all the metrics obtained from the predictions are saved (e.g. average precision, f1, ROC and Precision-Recall curves), as well as results from cross-validation and the best hyperparameters from their optimization process. 6. Web: allows the deployment of the best model previously developed, in the form of a web application using Flask. On the web application, a user is able to introduce PubMed IDs, a search term or PDF files. These inputs are then used in methods from the PubMed Reader file, which will acquire the documents from the PubMed database using the BioPython package.

3 3.1

Validation Dataset and Challenge Overview

To validate our implementations, BioTMPy was applied on a dataset retrieved from a document triage sub-task about “Mining protein interactions and mutations for precision medicine”, being part of the track 4 from the Task VI of the BioCreative forum. This sub-task ended in 2019, getting a total of 22 submissions from 10 different teams [18], where teams had to identify relevant documents regarding the topic mentioned above. The provided dataset encompasses a training and a test set containing respectively 4082 and 1464 documents (Table 1). All the documents (instances) present in these datasets were previously manually curated as relevant and non-relevant (labels) by BioGRID curators [18].

BioTMPy: A Deep Learning-Based Tool to Classify Biomedical Literature

119

Table 1. Number of documents included in both training and test sets from the BioCreative’s challenge. Dataset

Documents Relevant Non-relevant

Training 4082 Test

3.2

1464

1729

2353

730

734

Preprocessing

In the preprocessing phase, all texts were split by sentence with each sentence being also split by token. With exception of BioBERT, which was pre-trained with documents containing capital letters, all the other models had as input text in lower case. Words containing a hyphen were also split into two separate words (e.g. “calcium-activated” becomes “calcium”, “activated”), a step not performed on BERT models since their tokenizers are able to convert these words into subwords. Stop words were also removed on all models, with exception of BERT models, once these models may be capable of reducing the prediction influence of these common words due to their complexity and performance. To remove these words, a set of common words from the NLTK package was used. Finally, regarding static pre-trained embeddings, a token (‘OOV’) was used to index words not present on the vocabulary of these embeddings. For BERT models, tokens as ‘[CLS]’ and ‘[SEP]’ were added to delineate both the start and end of a sentence, respectively. 3.3

Data Analysis

BioCreative’s dataset was analysed in detail, using methods implemented on BioTMPy, to get some insights of the overall content of documents. The result of the calculation of the 20 most frequent words for each label (after stop words removal) can be seen in Fig. 2. This shows a high content similarity between documents for both labels, containing only 5 unique different words (25%), a pattern also present in the top 100 most frequent words (Supplementary Material - Fig. S1), containing only 13% of different words between documents classified with distinct labels. This analysis shows the difficulty in creating a model able to predict with high performance. Among the 5 words only present on the top 20 of the relevant label, we can observe words as “mutation” and “mutant”, which are directly related with the topic of interest in this dataset. This means that these words will likely have a high impact on the final models’ predictions.

120

N. Alves et al.

Fig. 2. Twenty most frequent words from documents of each label (relevant and nonrelevant) of the training set, after stop words removal. Words represented by white bars are words that are not present in the top 20 of the opposite label.

3.4

Embeddings

A performance comparison was made between four pre-trained embeddings, namely: 1) GloVe [19] - pre-trained with texts from Wikipedia 2014 and Gigaword fifth edition; 2) BioWordVec [20] - 200-dimensional vectors, published in 2018, resulting from a fastText model pre-trained with PubMed documents and Medical Subject Headings (Mesh); 3) PubMed vectors (“pubmed pmc”) [21] - pre-trained with documents from PubMed and PMC accessible in 2013; 4) “pubmed ncbi” [22] - 100-dimensional word vectors resulting from a word2vec model pre-trained with PubMed abstracts accessible in 2016. These static embeddings were inputted to a HAN model and a CNN-BiLSTM model adapted from Burns et al. [15]. Two distinct options of preprocessing were allowed: using words whether split or non-split by hyphen. A 10 fold cross-validation was performed on these 4 combinations for each method. GloVe showed significantly worse performance in 8 of the 12 pairwise comparisons performed (Fig. 3). Also, and despite no significant difference being observed between the other 3 embeddings, BioWordVec seems to show a slightly better performance with the best mean. These results are partially in agreement with the percentage of words of our training set that are missing in the vocabularies of these embeddings (GloVe - 51%; BioWordVec - 37%; “pubmed pmc” - 37%; “pubmed ncbi” - 9%). Finally, it seems that splitting words by hyphen and the use of the HAN model are beneficial to the overall performance.

BioTMPy: A Deep Learning-Based Tool to Classify Biomedical Literature

121

Fig. 3. Results from cross-validation performed with 4 distinct pre-trained embeddings using a a) Hierarchical Attention Network and b) a CNN-BiLSTM model, testing also the impact of splitting words by hyphen.

3.5

Evaluation

All hyperparameters were tuned with the Hyperband tuner, a combination of RandomSearch with early-stopping and adaptive resource allocation, using a factor of 3 and a random split of the training set into a 90% training set and 10% development set. The search space for each optimization performed can be seen in the Supplementary Material (Table S1–S4). Due to the high complexity of BERT models, only the top layers of the models containing BioBERT were optimized. This choice was made based on a previous manual tuning, where BioBERT seemed to show better performance over the other BERT models. For the final models, early-stopping with a patience of 5 was used, together with the ModelCheckpoint callback to retrieve the model with the lowest loss on the development. For BERT models, instead of early-stopping, models were saved at different epochs due to the fact that these models normally take few epochs to converge (around 2 to 4 epochs). The scores obtained with the predictions of the models on the held-out test set of the BioCreative’s challenge can be seen in Table 2. The models used to make the final predictions were: a CNN-BiLSTM with an attention layer, a HAN containing 2 Bi-LSTMs, and models containing pre-trained BERT models (including BioBERT and SciBERT, which were pre-trained with scientific

122

N. Alves et al.

documents). For these 2 models, BioWordVec pre-trained embeddings were used, based on the comparison mentioned above. When using BERT models, some variations were also introduced as the use of either a Dense block at the top, a Bi-LSTM followed by a Dense block, and the use of the embedding from the ‘[CLS]’ token as input to the final layers. The latter technique is commonly used since the embedding of this token is in some cases able to capture all the important information of the entire text. Table 2. Best scores obtained with the predictions of several models using the test set of the BioCreative’s challenge. At the bottom, one can see the scores of the challenge’s best submission made in 2019. ID

Model

Avg Prec Precision Recall

F1-score

CLB HB BD BioD SD BioC SC BioL SL

CNN-BiLSTM Hierarchical Bi-LSTM BERT (tuned) + Dense BioBERT (tuned) + Dense SciBERT (tuned) + Dense BioBERT (tuned) + CLS SciBERT (tuned) + CLS BioBERT (tuned) + LSTM SciBERT (tuned) + LSTM

0.6756 0.7425 0.7145 0.7798 0.7678 0.7828 0.7370 0.7883 0.7455

0.5980 0.6145 0.6400 0.6331 0.6548 0.6471 0.6298 0.6611 0.6414

0.7670 0.8040 0.7628 0.8239 0.7813 0.7969 0.7926 0.7955 0.7798

0.6721 0.6966 0.6960 0.7160 0.7124 0.7104 0.7019 0.7221 0.7038

0.7158

0.6289

0.7656

0.6906

Challenge’s best submission

Overall, most of the results on the test set managed to outperform the challenge’s best submission, with models containing BioBERT or SciBERT surpassing the best submission on all the 4 metrics. Moreover, the best model (BioBERT tuned plus LSTM) managed to get a significant difference of 7.25% for average precision, 3.22% for precision, 2.99% for recall and 3.15% for f1-score.

4

Discussion and Conclusions

The package developed in this work can ease the process of retrieving relevant literature for a given topic. We consider that the provided pipelines showcasing the use of complex and recent DL models to classify biomedical documents can be really helpful for a researcher with low expertise in this field. It is important to mention that the package can be used to both binary classification and multiclassification problems, and several modules can be easily adapted to other text mining pipelines. During the implementation of the hyperparameter optimization and crossvalidation, we verified that it is difficult to find a good balance between the number of parameters that a user can choose, while still being intuitive to the user.

BioTMPy: A Deep Learning-Based Tool to Classify Biomedical Literature

123

Therefore, the current version of BioTMPy forces some manual adjustment of hyperparameters and cross-validation by the user to execute the complete training process of the models. In the future, we want to ease this process and make it more intuitive by simplifying these processes. Ideally, users could simply specify in a configuration file the methods of a pipeline that they want to perform. The comparison between distinct pre-trained embeddings showed that BioWordVec slightly outperformed the other alternatives. This might be a consequence of some characteristics of this model, including a possibly more complete vocabulary or the use of the model fastText that integrates sub-word information, coupled with MeSH terms. Finally, to note that despite GloVe being pre-trained with Wikipedia texts, it still managed to get satisfactory results on biomedical literature. Nevertheless, it is clear that the use of embeddings specifically pre-trained with biomedical literature benefits the classification results. Regarding the model comparison, it is possible to observe a better performance of models containing BioBERT or SciBERT. This was expected due to their high complexity and for being the state-of-the-art models for NLP tasks. Furthermore, these models surpassed models with BERT, likely due to BioBERT and SciBERT being pre-trained with scientific documents. Moreover, their good results can be a consequence of the hyperparameter optimization performed, were the parameters of the top layers of these models were tuned using BioBERT, a decision based on a prior manual tuning, where BioBERT was returning the best results. The other reason for the success of BioBERT might be that this model was pre-trained with documents from PubMed, the same database used to create the dataset. The best model (BioBERT and LSTM) is available on a web application1 where a user can introduce either PubMed IDs, a search term or PDF files to order by relevance documents about protein-protein interactions altered by mutations. This application can act as a second filter of documents, since it uses documents retrieved from the sort system of PubMed. Using the interpretability capacity of the HAN model and by analysing the top 20 words of documents from the confusion matrix (results in Supplementary Material - Fig. S2 and S3), we were able to observe the presence of words like “mutation”, “interaction”, and some protein names (i.e. words related to the topic of the dataset) which are likely misleading the models to classify documents as relevant when in fact they are not relevant. To improve our results in the future, some improvements could be made at different levels of the pipeline, such as a different truncating process, fine-tune the hyperparameter optimization, or trying some other models as the BioMedRoBERTa [23]. Supplementary Material The Supplementary material of this work can be found at https://1drv.ms/b/s! AoJwG4IEnfhEgY86sSRWG7yF3z B6g.

1

https://biotmpyppi.bio.di.uminho.pt/.

124

N. Alves et al.

Acknowledgements. This research has been supported by FCT - Funda¸ca ˜o para a Ciˆencia e Tecnologia through the DeepBio project - ref. NORTE-01-0247-FEDER039831, funded by Lisboa 2020, Norte 2020, Portugal 2020 and FEDER - Fundo Europeu de Desenvolvimento Regional.

References 1. Krallinger, M., Valencia, A.: Text-mining and information-retrieval services for molecular biology (2005) 2. Miro´ nczuk, M.M., Protasiewicz, J.: A recent overview of the state-of-the-art elements of text classification, September 2018 3. Fiorini, N., et al.: Best match: new relevance search for PubMed. PLoS Biol. 16(8), e2005343 (2018) 4. Cohen, A.M., Hersh, W.R.: A survey of current work in biomedical text mining. Briefings Bioinform. 6, 57–71 (2005) 5. Ignatow, G., Mihalcea, R.: An introduction to text mining: research design, data collection, and analysis (2018). https://study.sagepub.com/introtextmining 6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding, October 2018 7. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, December 2017, NIPS, pp. 5999–6009 (2017) 8. Chollet, F.: Deep Learning with Phyton (2018) 9. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 10. McKinney, W., Team, P.: Pandas: powerful python data analysis toolkit, p. 1625 (2015) 11. Harris, C.R., et al.: Array programming with NumPy. Nature 585(7825), 357–362 (2020) 12. Wolf, T., et al.: HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv arXiv:1910..03771 (2019) 13. Hunter, J.D.: Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9(3), 90–95 (2007) 14. Natural language toolkit. https://www.nltk.org/ 15. Burns, G.A., Li, X., Peng, N.: Building deep learning models for evidence classification from the open access biomedical literature. Database J. Biol. Databases Curation 2019 (2019) 16. Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2019) 17. Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text, March 2019. http://arxiv.org/abs/1903.10676 18. Islamaj Doˇ gan, R., et al.: Overview of the BioCreative VI Precision Medicine Track: Mining protein interactions and mutations for precision medicine (2019) 19. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pp. 1532–1543 (2014) 20. Zhang, Y., Chen, Q., Yang, Z., Lin, H., Lu, Z.: BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci. Data 6(1), 52 (2019). www.nature.com/scientificdata 21. Pyysalo, S., Ginter, F., Moen, H., Salakoski, T., Ananiadou, S.: Distributional semantics resources for biomedical text processing. Aistats 5, 39–44 (2013)

BioTMPy: A Deep Learning-Based Tool to Classify Biomedical Literature

125

22. Kim, S., Fiorini, N., Wilbur, W.J., Lu, Z.: Bridging the gap: incorporating a semantic similarity measure for effectively mapping PubMed queries to documents. J. Biomed. Inform. 75, 122–127 (2017) 23. Gururangan, S., et al.: Don’t stop pretraining: adapt language models to domains and tasks, pp. 8342–8360 (2020). https://github.com/allenai/

May Gender Have an Impact on Methylation Profile and Survival Prognosis in Acute Myeloid Leukemia? Agnieszka Cecotka1(B) , Lukasz Krol1 , Grainne O’Brien2 , Christophe Badie2 , and Joanna Polanska1 1 Department of Data Science and Engineering, Silesian University of Technology,

Gliwice, Poland {agnieszka.cecotka,lukasz.krol,joanna.polanska}@polsl.pl 2 Public Health England, Centre for Radiation, Chemical and Environmental Hazards, Oxfordshire, UK {grainne.obrien,christophe.badie}@phe.gov.uk

Abstract. DNA methylation alteration is crucial for the initiation and development of Acute Myeloid Leukemia. However, only a few epigenetic biomarkers of AML have been discovered so far. DNA methylation of CpG rich gene promoters has the highest impact on gene expression level, so biomarkers should be sought in these genomic regions. Principal Component Analysis of methylation level reveals that male and female AML patients differ in CpG rich-gene promoters. Statistical comparison between males and females conducted for each CpG site confirms statistical relevance of differences only in CpG-rich promoter regions. P-value integration results in detection of AML-specific hypermethylated genomic regions and shows that almost 10% of CpG-rich promoters are differentially methylated between males and females. Functional analysis of genes with differentially methylated promoters in AML results in 322 enrichments for women and 1893 for men. Genes are mainly related to homeobox, cell development and morphogenesis. Survival analysis indentifies gender-specific potential epigenomic prognostic markers. Dissimilarity of the survival-significant gene sets between males and females is 75%. Keywords: DNA methylation · Epigenetics · Acute Myeloid Leukaemia · Sex disparity · Prognostic markers · Survival analysis

1 Background Acute myeloid leukemia (AML) is a cancer of the myeloid cell line in the bone marrow. In this malignancy, abnormal hematopoietic cells are produced and accumulated. At the same time production of other blood cells is defective [1]. AML is the most common acute leukemia in adults [2]. It occurs more often in men than in women [2, 3]. DNA methylation is the modification of cytosine into 5-methylcytosine in CpG sites of the genome. CpG sites are symmetrical dinucleotides, where cytosine is followed by © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. Rocha et al. (Eds.): PACBB 2021, LNNS 325, pp. 126–135, 2022. https://doi.org/10.1007/978-3-030-86258-9_13

May Gender Have an Impact on Methylation Profile and Survival Prognosis

127

guanine [4]. DNA methylation is an epigenetic process that controls genes’ transcription [5]. It plays a crucial role in cell differentiation by silencing genes [6]. DNA methylation of gene promoters has the highest impact on gene expression level, especially when the promoter lies on a CpG island [7] or shore [8]. In cancer alterations in DNA methylation profile can lead to tumor suppressor genes inhibition and protooncogenes activation [9]. Aberrant DNA methylation is the signature of AML and is essential for the initiation and development of AML [10]. Changes in DNA methylation can vary among AML patients, but some of them are diagnostic and prognostic implications in AML. Hypermethylation of C1R, CEBPA, MEG3, CDKN2A, CDH1, HIC1, CDKN2B, CD34, RHOC, SCRN1, F2RL1, FAM92A1, MIR155HG, VWA8 is correlated with better survival. Hypermethylation of DNMT3A, GATA4, GPX3, ITGBL1, TERT, BARD1, BCL9L, CLEC11A, DEFB1, FOXD2, IGF1, IL18, ITIH1, LSP1, P2RX6, RNASE, TUBGCP2 is correlated with poor survival. However, only a few epigenetic biomarkers have been discovered with clinical values. The reason may be a variety among patients, for example, sex related. Differences between males and females are observed in the occurrence and mortality in different types of cancer. Most of the cancers regard males more often than females, in AML it is 1.66 times more often in males. Some of the differences can be explained by the impact of sex hormones, but some can be epigenetically conditioned [11]. The hormonal and genetic disparity between genders determines the chemotherapy effect. The chemotherapy is used without considering this disparity, which results in different effectiveness and toxicity in genders [12]. Diverse gene expression can be a reason for differences between genders in cancers. Identification of these differences is crucial for the early diagnosis and prognosis of cancer [13]. Gender-related differences in methylation profile have been already reported for chronic lymphocytic leukemia (CLL), but they regard the X chromosome in almost 95% [14]. Similar studies have been never conducted for AML. The presented study aims to find DNA methylation diversity of male and female AML patients in a set of CpG sites and gene regions as well as to detect features affecting survival time in both genders.

2 Materials and Methods 2.1 Data In this study, we used the publicly available dataset from the TCGA-LAML project [15]. Data were obtained with Illumina Infinium Human Methylation 450K array [16] which allows for measuring methylation level in 485,577 CpG sites. Probes which were not significantly different from the background, repeat regions, common SNPs, and probes lying on sex chromosomes were filtered out [17]. Data for the remaining 396,065 CpG sites contains normalized measures as β-value taking values from 0 to 1 [18]. 2,627 (0.66%) features had missing values, so the remaining 393,438 CpG sites are included in further analysis. Data for healthy donors was downloaded from the GEO database (GSE73103) [19]. Data was obtained with the same array and consists of β-values for 397,615 CpG sites.

128

A. Cecotka et al.

According to the annotation system provided by Illumina a gene name, genomic regions, genome location (chromosome number and locus), and CpG sites density (island, shore, shelf, opensea) are assigned to each probe [16].

Fig. 1. Scheme of genomic regions according to the Illumina annotation system

We analyzed gene CpG-rich regulatory sequence (RS) regions (related to the transcription start site (TSS1500, TSS200) and 5’UTR annotation) lying on CpG islands or shores, gene body regions (related to 1stExon, ExonBnd, and Body annotation), and 3’UTR regions (Fig. 1) separately. The number of CpG sites belonging to particular genomic regions is presented in Fig. 2. Some CpG sites belong to more than one region, so the overall number of CpG sites is not a sum of the numbers of CpG sites in particular regions.

Fig. 2. Number of CpG sites among types of genomic regions in Healthy and AML datasets

AML data were collected for 140 patients with AML. Other features provided for each patient were gender, vital status, the number of days to death (for dead patients), or the number of days to last follow up since diagnosis (for alive patients) and information about receiving prior treatment. The clinical features were compared between genders. Vital status proportions and receiving of prior treatment proportions were examined with χ2 test, age at diagnosis with t-test, and days to death with Wilcoxon rank sum test (because of nonnormality of samples). Significance level was set as 0.05. There is no evidence that males and females differ in vital status, prior treatment receiving, age at diagnosis, and the number of days to death (survival time) (Table 1). From the healthy dataset, 39 females and 45 males of the same age were chosen. Additionally, independent validation data from own resources were achieved during the ongoing project in the Centre for Radiation, Chemical and Environmental Hazards (PHE) laboratory. Independent dataset regards 2 AML males, 1 AML female, 4 healthy males and 1 healthy female and was obtained with Illumina Infinium Human Methylation EPIC Array experiment.

May Gender Have an Impact on Methylation Profile and Survival Prognosis

129

Table 1. Comparison between male and female AML patients for clinical factors Feature

Vital status

Prior treatment

Age at diagnosis

Days to death

Values

Alive

Dead

Yes

No

Mean (SD)

Median (MAD)

Male

36

37

19

54

54.21 (15.67)

365 (212)

Female

28

39

19

48

53.76 (16.38)

243 (212)

Statistical significance

p-value = 0.3720

p-value = 0.8709

p-value = 0.3844

p-value = 0.7567

2.2 Detection of Differentially Methylated CpG Sites and Genomic Regions Principal Component Analysis (PCA) was conducted on β-values, primarily considering all CpG sites and then for particular genomic regions separately. Firstly, elements more than three scaled MAD [20] from the median were detected as the outliers. Then, the normality of distribution for each CpG site was examined with the Lilliefors test. The methylation level for each CpG site was compared between males and females using the Wilcoxon rank sum test. Then, Benjamini-Hochberg FDR correction was performed for Wilcoxon test p-values. To detect differentially methylated genome regions between males and females, Stouffer’s p-value integration [21] was conducted. P-values of sites annotated to the same genomic regions were integrated to obtain one, global p-value for whole region [22]. Functional analysis of genes characterized with differentially methylated CpG-rich RS region was performed using the STRING tool [23]. Items with p-value ≤ 0.05 were considered significantly enriched. 2.3 Survival Analysis Additionally, the methylation level for each genomic region was calculated as a median of β-values of CpG sites, of which the regions consist. Survival curves were compared with log-rank test [24] as well as gender impact on the risk was checked with Cox proportional hazards model [25]. For survival analysis with feature selection of AML patients, the methylation level of CpG-rich RS regions, gender, race, number of days to death (for dead patients), or number of days to last follow up since diagnosis (for alive patients) and information about receiving prior treatment were considered. It was conducted with the Broadside tool [26, 27].

3 Results and Discussion 3.1 Principal Component Analysis Principal Component Analysis for all CpG sites and, next, for CpG sites assigned to particular genomic regions was performed, for healthy and AML patients. The First 2 components of each analysis are presented in the Fig. 3.

130

A. Cecotka et al.

Fig. 3. PCA on β-values of all CpG sites (A) and CpG-rich RS regions (B) in healthy donors and of all CpG sites (C) and CpG-rich RS regions (D) in AML patients

Considering all of the CpG sites, PCA doesn’t show any clusters in both patients’ groups. But within CpG sites belonging to CpG-rich RS regions in AML patients, it presents 2 distinct subgroups, where a distinguishing factor is a gender. 3.2 Detection of Differentially Methylated CpG Sites Plain separation of males and females in RS regions leads to examine the differences between gender across CpG sites and genomic regions in AML and compare them with healthy donors. In the Lilliefors test, 60% of features show non-normal distribution. Wilcoxon rank sum test was performed for each CpG site. Significance level was set to 0.025 because each test was performed both: right and left tailed. In every type of regions, except CpG-rich RS regions, percentage of hypermethylated sites (after FDR correction) is similar between AML patients and healthy donors and it is lower than significance level. Only in CpG-rich RS regions in AML patients differences between males and females are statistically significant. Hypermethylated CpG sites, according to FDR, are presented in Fig. 4. The number of CpG sites hypermethylated in females is very similar to the number of CpG sites hypermethylated in males. Within all of the CpG sites which are hypermethylated in AML males and females, only 5.54% are hypermethylated in healthy males and females. Boxplots of methylation level of three most differentiating CpG sites in AML with values from the validation set are presented in Fig. 5. In two of them the relationship between methylation level in males and females is preserved. In each case no difference between healthy males and females is observed.

May Gender Have an Impact on Methylation Profile and Survival Prognosis

131

Fig. 4. Hypermethylated CpG sites in CpG-rich Regulatory Sequence regions and their percentage in AML and Healthy groups among chromosomes.

Fig. 5. Methylation level in exemplary CpG sites across all examined groups. Values from the independent validation set are marked with red colour.

3.3 Detection of Differentially Methylated Genomic Regions P-values of CpG sites belonging to genomic regions of particular genes are integrated and compared to adjusted significance levels to find differentially methylated regions. Results are shown in Table 2. In AML patients, the number of genomic regions which are hypermethylated in males is bigger in almost all of the types of genomic regions, except for Body regions. The largest disparity as well as the highest percentage of differentially methylated regions is observed in CpG-rich RS regions. Because the methylation level of the CpG-rich Regulatory Sequence region has the most impact on gene transcription, about 9.64% of genes (3.82% hypermethylated in females and 5.82% hypermethylated in males) can be

132

A. Cecotka et al.

Table 2. Number and percentage of different types of hypermethylated genomic regions in genders, according to integrated p-values, for healthy and AML patients CpG-rich RS regions

Body regions

3’UTR regions

Overall

Females

Males

Females

Males

Females

Males

Females

Males

Healthy

272

228

260

522

574

2168

1106

2918

2.37%

1.98%

1.54%

3.09%

5.50%

20.76%

2.85%

7.52%

AML

462

704

482

470

417

521

1361

1695

3.82%

5.82%

2.68%

2.61%

3.80%

4.75%

3.31%

4.13%

differently expressed in males and females in AML. The high number of hypermethylated 3’UTR regions in males needs to be investigated. Genes with differentially methylated CpG-rich RS regions were analyzed for their functions in the STRING tool. Interpro Keywords: KW-0225, KW-0818, and KW-9995 (Disease mutation, Triplet repeat expansion, and Disease) occur in both male and female enriched items. Some of the factors are specific for genes hypermethylated in females, much more of the factors are specific for genes hypermethylated in males. There are also several items that appear only in the analysis of the overall set of genes: hypermethylated in males or in females. The factors which enrich genes up methylated in females are connected to diseases linked to the X chromosome, e.g. mental disorders and also to homeobox - short DNA fragment which occurs in genes involved in the morphogenesis and organs developing. Factors that enrich genes up methylated in males are mentioned in a lot of publications about DNA methylation profiling in many different malignancies. They are expected to be tumor suppressor genes or diagnostic and prognostic markers (e.g. in pancreatic cancer [28] or bladder cancer [29]). They are also connected to homeobox and furthermore: morphogenesis, different tissues and organs development, cell differentiation, and G protein pathways. 3.4 Survival Analysis There is no gender impact on a survival time in AML patients. Log-rank test shows no difference between gender survival curves (p-value = 0.3902) as well as Cox proportional hazards model indicates that gender has no impact on the risk (p-value = 0.3866). However, results of feature selection for survival analysis suggest, that males and females have different prognostic markers. Each examined feature is ranked with total effect value and z-score. Top features for males and females were selected based on highest total effect, highest z-score, and highest variance of gene region methylation within patients. Results are presented in Fig. 6. All selected features have a positive impact on survival time - higher methylation in these regions is correlated with better survival. Two of CpG-rich Regulatory Sequence genomic regions (HOXA10 and GRIN3A) can be prognostic markers in both males and females - they have relatively high total effect and z-score.

May Gender Have an Impact on Methylation Profile and Survival Prognosis

133

Fig. 6. Kaplan-Meier survival curves for genders (A) and total effect for the highest important features in survival analysis (B).

4 Conclusions Obtained results suggest differences in methylation profile between males and females in AML. Corresponding differences are not observed in healthy patients. Gender disparity in AML concerns CpG sites in CpG-rich Regulatory Sequence genomic regions. Alterations of DNA methylation in these regions have the most impact on gene expression. Integration of p-values of CpG sites shows that almost 10% of genes can be differentially expressed between males and females in AML. These genes are connected to a lot of molecular processes and functions, which were examined in functional analysis. Several enriched GO Terms, such as GO:0006935 (chemotaxis) and GO:0048870 (cell motility), are related to AML development [30]. Additionally, the expression of homeobox genes, found in the functional analysis, is correlated with epigenetic modifiers and specific to malignant hematopoiesis, which suggests their potential causal relationships [31]. Furthermore, survival analysis demonstrates that males and females with AML can have different prognostic markers. Identified gender-specific differences in epigenomics markers should be considered in the diagnosis and prognosis of AML. Acknowledgements. This work was financed by European Social Fund POWR.03.02.00-00-I029 (AC) and SUT grant no. 02/070/BK_21/0019 (LK, JP).

Contributions. AC: data analysis, manuscript writing, LK: feature importance in survival analysis, GO: validation methylation experiment, CB: biological interpretations of results support, JP: outline of the study concept. All the authors read and accepted the manuscript.

References 1. Lowenberg, B., Downing, J.R., Burnett, A.: Acute myeloid leukemia. New Engl. J. Med. 341(14), 1051–1062 (1999) 2. Siegel, R.L., Miller, K.D., Jemal, A.: Cancer statistics. CA Cancer J. Clin. 69(1), 7–34 (2019) 3. Juliusson, G., et al.: Age and acute myeloid leukemia: real world data on decision to treat and outcomes from the Swedish Acute Leukemia Registry. Blood 113(18), 4179–4187 (2009)

134

A. Cecotka et al.

4. Zemach, A., McDaniel, I.E., Silva, P., Zilberman, D.: Genome-wide evolutionary analysis of eukaryotic DNA methylation. Science 328(5980), 916–919 (2010) 5. Bird, A.: The essentials of DNA methylation. Cell 70(1), 5–8 (1992) 6. Reik, W., Dean, W., Walter, J.: Epigenetic reprogramming in mammalian development. Science 293(5532), 1089–1093 (2001) 7. Schübeler, D.: Function and information content of DNA methylation. Nature 517(7534), 321–326 (2015) 8. Irizarry, R.A., et al.: Genome-wide methylation analysis of human colon cancer reveals similar hypo-and hypermethylation at conserved tissue-specific CpG island shores. Nat. Genet. 41(2), 178 (2009) 9. Gonzalo, S.: Epigenetic alterations in aging. J. Appl. Physiol. 109(2), 586–597 (2010) 10. Yang, X., Wong, M., Ng, R.K.: Aberrant DNA methylation in acute myeloid leukemia and its clinical implications. Int. J. Mol. Sci. 20(18), 4576 (2019) 11. Rubin, J.B., et al.: Sex differences in cancer mechanisms. Biol. Sex Differ. 11(1), 1–29 (2020) 12. Kim, H.I., Lim, H., Moon, A.: Sex differences in cancer: epidemiology genetics and therapy. Biomol. Ther. 26(4), 335–342 (2018). https://doi.org/10.4062/biomolther.2018.103 13. Shin, J.Y., Jung, H.J., Moon, A.: Molecular markers in sex differences in cancer. Toxicol. Res. 35(4), 331–341 (2019) 14. Lin, S., et al.: Sex-related DNA methylation differences in B cell chronic lymphocytic leukemia. Biol. Sex Differ. 10(1), 2 (2019). https://doi.org/10.1186/s13293-018-0213-7 15. TCGA-LAML. https://portal.gdc.cancer.gov/projects/TCGA-LAML. Accessed 10 May 2021 16. Sandoval, J., et al.: Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome. Epigenetics 6(6), 692–702 (2011) 17. Cancer Genome Atlas Research Network, Ley, T.J., et al.: Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. New Engl. J. Med. 368(22), 2059–2074 (2013). https://doi.org/10.1056/NEJMoa1301689 18. Houseman, E.A., et al.: Model-based clustering of DNA methylation array data: a recursivepartitioning algorithm for high-dimensional data arising as a mixture of beta distributions. BMC Bioinform. 9(1), 1–15 (2008) 19. Voisin, S., et al.: Many obesity-associated SNPs strongly associate with DNA methylation changes at proximal promoters and enhancers. Genome Med. 7, 103 (2015). https://doi.org/ 10.1186/s13073-015-0225-4 20. Martin, R.D., Zamar, R.H.: Bias robust estimation of scale. Ann. Stat., 991–1017 (1993) 21. Stouffer, S.A., Suchman, E.A., DeVinney, L.C., Star, S.A., Williams, R.M., Jr.: The American soldier: adjustment during army life (studies in social psychology in world war ii), vol. 1 (1949) 22. Cecotka, A., Polanska, J.: Region-specific methylation profiling in acute myeloid leukemia. Interdisc. Sci. Comput. Life Sci. 10(1), 33–42 (2018) 23. Mering, C.V., Huynen, M., Jaeggi, D., Schmidt, S., Bork, P., Snel, B.: STRING: a database of predicted functional associations between proteins. Nucleic Acids Res. 31(1), 258–261 (2003) 24. Peto, R., Peto, J.: Asymptotically efficient rank invariant test procedures. J. Royal Stat. Soc. Ser. A (General) 135(2), 185–198 (1972) 25. Cox, D.R., Oakes, D.: Analysis of Survival Data. Chapman and Hall/CRC, London (2018) 26. Krol, L.: Distributed Monte Carlo feature selection: extracting informative features out of multidimensional problems with linear speedup. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015-2016. CCIS, vol. 613, pp. 463–474. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-34099-9_35 27. Krol, L., Polanska, J.: Multidimensional feature selection and interaction mining with decision tree based ensemble methods. In: Fdez-Riverola, F., Mohamad, M.S., Rocha, M., De Paz, J.F., Pinto, T. (eds.) PACBB 2017. AISC, vol. 616, pp. 118–125. Springer, Cham (2017). https:// doi.org/10.1007/978-3-319-60816-7_15

May Gender Have an Impact on Methylation Profile and Survival Prognosis

135

28. Kondratyeva, L.G., et al.: Downregulation of expression of mater genes SOX9, FOXA2, and GATA4 in pancreatic cancer cells stimulated with TGFβ1 epithelial–mesenchymal transition. In: Doklady Biochemistry and Biophysics, vol. 469, no. 1, pp. 257–259. Pleiades Publishing, July 2016 29. López, J.I., et al.: A DNA hypermethylation profile reveals new potential biomarkers for the evaluation of prognosis in urothelial bladder cancer. Apmis 125(9), 787–796 (2017) 30. Chen, J., et al.: Integrating GO and KEGG terms to characterize and predict acute myeloid leukemia-related genes. Hematology 20(6), 336–342 (2015) 31. Skvarova Kramarzova, K., et al.: Homeobox gene expression in acute myeloid leukemia is linked to typical underlying molecular aberrations. J. Hematol. Oncol. 7, 94 (2014). https:// doi.org/10.1186/s13045-014-0094-0

Towards a Multivariate Analysis of Genome-Scale Metabolic Models Derived from the BiGG Models Database Alexandre Oliveira(B) , Emanuel Cunha, Fernando Cruz, João Capela, João Sequeira, Marta Sampaio, and Oscar Dias Centre of Biological Engineering, University of Minho, Braga, Portugal {alexandre.oliveira,ecunha,fernando.cruz,joao.capela,jsequeira, msampaio}@ceb.uminho.pt, [email protected]

Abstract. Genome-Scale metabolic models (GEMs) are a relevant tool in systems biology for in silico strain optimisation and drug discovery. An easier way to reconstruct a model is to use available GEMs as templates to create the initial draft, which can be curated up until a simulation-ready model is obtained. This approach is implemented in merlin’s BiGG Integration Tool, which reconstructs models from existing GEMs present in the BiGG Models database. This study aims to assess draft models generated using models from BiGG as templates for three distinct organisms, namely, Streptococcus thermophilus, Xylella fastidiosa and Mycobacterium tuberculosis. Several draft models were reconstructed using the BiGG Integration Tool and different templates (all, selected and random). The variability of the models was assessed using the reactions and metabolic functions associated with the model’s genes. This analysis showed that, even though the models shared a significant portion of reactions and metabolic functions, models from different organisms are still differentiated. Moreover, there also seems to be variability among the templates used to generate the draft models to a lower extent. This study concluded that the BiGG Integration Tool provides a fast and reliable alternative for draft reconstruction for bacteria. Keywords: Genome-Scale Metabolic Models · Merlin · BiGG Integration Tool · BiGG models

1 Introduction The reconstruction of comprehensive Genome-Scale Metabolic Models (GEMs) is nowadays a common approach in systems biology. The reconstruction of GEMs relies on using genomic data of a given organism to assemble a genome-wide metabolic network, which can predict the metabolic behaviour in different conditions [1, 2], using simulation methods like Flux Balance Analysis (FBA) [3]. Furthermore, these models are used for in silico strain optimisation and drug target discovery [4]. A wide variety of models are available in several online databases. Even though most reconstructed models correspond to bacterial organisms, models for more complex organisms, such as plants and mammals, have become more relevant lately [5]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. Rocha et al. (Eds.): PACBB 2021, LNNS 325, pp. 136–144, 2022. https://doi.org/10.1007/978-3-030-86258-9_14

Towards a Multivariate Analysis of Genome-Scale Metabolic Models

137

The BiGG Models is a centralised online database of high-quality, manually curated GEMs collected from available literature [6]. Since 2010, BiGG has compiled accessible information on the models’ reactions, metabolites and genes. Currently, this knowledge base contains 108 metabolic models from a wide variety of organisms, ranging from bacteria, such as Escherichia coli, to more complex organisms like Homo sapiens. In addition, BiGG attempts to connect the information it contains with external databases and with the standardisation of reactions and metabolites identifiers across GEMs to allow direct comparison between models [6]. The first step of the model reconstruction, in a bottom-up approach, is to create a draft metabolic network using the organism’s annotated genome and biochemical databases. However, this draft network corresponds to an incomplete set of reactions that includes gaps, dead-end metabolites and blocked reactions, requiring further curation to obtain the final model [1]. An alternative approach is to use existing GEMs as templates to create the initial draft. In this approach, reactions are added to a draft model when homologous genes in the template models are available. CarveMe implements such a top-down approach using all reactions and metabolites from BiGG to build a universal model, which will then be carved into a final simulation-ready gapless model [7]. Hence, this study aims to assess draft GEMs generated using BiGG models as a template for three distinct organisms: Streptococcus thermophilus, Xylella fastidiosa and Mycobacterium tuberculosis. For this, an inhouse developed tool available in merlin, named BiGG Integration Tool (BIT) [8, 9], was used. Furthermore, we assessed the variability of the generated draft models’ reactions and metabolic functions for different reconstruction approaches and compared them with the models generated using CarveMe.

2 Results and Discussion We created 21 draft models using three distinct approaches and analysed the models through a multivariate analysis. In detail, we reconstructed seven draft models for each bacteria: M. tuberculosis, S. thermophilus and X. fastidiosa. BIT allows creating draft reconstructions automatically using three templates. For the first template (all), BIT uses all information available in BiGG models, whereas for the second template (selected), the user selected a set of models from BiGG models, in this case three models were used. Finally, for the last template (random), BIT will randomly select a set of three models from the database. Likewise, we used CarveMe [7] to obtain a draft model for each bacterium. The models were then analysed regarding the variability of reactions. The metabolic functions of the draft models were compared through the Clusters of Orthologous Genes (COGs) database.

138

A. Oliveira et al.

2.1 Genomes’ Comparative Functional Analysis Besides two bacteria, S. thermophilus and X. fastidiosa, unavailable in BiGG, the recogniser [10] tool was used to collect COG identifiers for all species present in BiGG. As shown in Fig. 1, we analysed the principal components contributing to the variability of the COG-annotated metabolic functions. The analysis results suggest that the BiGG database’s organisms are grouped by phylum. Moreover, there is a clear separation between eukaryotes and prokaryotes. These results corroborate the database authors latest publication [6], in which models from eukaryotes and prokaryotes were segregated using PCA. The similarity of metabolic functions among the three bacteria M. tuberculosis, S. thermophilus and X. fastidiosa was further analysed using the metabolic COG identifiers. According to Fig. 2, M. tuberculosis had the highest number of COG identifiers (1080), of which 544 were unique. On the other hand, X. fastidiosa and S. thermophilus shared most of their COG identifiers with M. tuberculosis, having only 155 and 126 unique COG identifiers, respectively. All organisms in this study share 226 COG metabolic identifiers. Hence, there seems to be a clear distinction of the functional annotation among the organisms selected for this study. BIT and CarveMe were then used to generate draft models for S. thermophilus, X. fastidiosa and M. tuberculosis (Supplementary Material 1), representing different microorganisms, namely a lactic acid bacterium (gram-positive), a plant-pathogen (gram-negative) and a well-studied bacteria, which already has a GEM available on BiGG, respectively. Next, we assessed the variability of reactions and genes’ metabolic functions included in these draft reconstructions.

Fig. 1. PCA plot comparing metabolic COG identifiers obtained for the organisms present in BiGG as well as S. thermophilus and X. fastidiosa. Principal Components (PC) 1 and 2 are depicted with the percentage of explained variance. PCA scores have been plotted and coloured according to the organism’s phylum, and ellipses represent the clusters obtained for Eukarya (orange) and Bacteria (blue), with the point outside both belonging to Archaea. Each dot is annotated with an organism-specific identifier, using the first letter of the genus and the first three letters of the species second name (Color figure online).

Towards a Multivariate Analysis of Genome-Scale Metabolic Models

139

Fig. 2. Venn diagram of the COG identifiers obtained with recogniser for M. tuberculosis, S. thermophilus and X. fastidiosa. Numbers indicate total number of unique COG identifiers.

2.2 Models’ Analysis The draft models were generated from BiGG using BIT with the mentioned templates (all, selected and random). The content of the models generated by each template was compared in a Venn diagram, representing the number of unique and shared reactions in the different models of a given organism (Fig. 3). This analysis allowed us to assess the influence of the template on the content of the models. The models obtained using the all-template have a larger number of reactions, of which over 55% are missing in the other templates’ models. Nevertheless, all-template models include most reactions of the other templates, though to a lesser extent for M. tuberculosis. One possible explanation is that BIGG includes models for M. tuberculosis, which have been used to create the draft model of this organism in the all-template. Concerning the remaining templates, homology searches may return other matches based on the selected models. Hence, this analysis suggests that the BIT’s template will influence the number of reactions included in a draft model.

Fig. 3. Venn diagrams for reactions by organism. Several draft models have been generated from BiGG using the merlin’s BIT and different templates: all, random and selected. Venn Diagrams illustrate the number of reactions shared between the different models for each organism.

140

A. Oliveira et al.

The draft reconstructions derived from the selected-template were analysed together with CarveMe’s models. The number of reactions shared among the draft models of the three bacteria is presented in the Venn diagram in Fig. 4, whereas the diagrams for the remaining BIT’s templates are presented in Supplementary Material 2. With the selectedtemplate, the three resulting models shared 180 reactions among them. Moreover, the draft model for S. thermophilus contained more unique reactions (556 reactions), while X. fastidiosa shared 351 reactions, mostly with M. tuberculosis, which represents 60% of its total reactions. Nevertheless, this result is different from the metabolic annotation, as M. tuberculosis had more unique COG identifiers than any other bacteria (Fig. 2).

Fig. 4. Venn diagrams for reactions of draft models created with BIT’s selected-template and CarveMe, showing the number of reactions shared between the models of different organisms.

Regarding the draft models created with CarveMe, 384 reactions are shared among the three bacteria. However, in contrast with BIT’s selected-template results, CarveMe’s draft model for X. fastidiosa has the highest number of unique reactions (443 reactions), whereas the S. thermophilus model shares 725 reactions with the other models, 527 with M. tuberculosis, and 582 with X. fastidiosa. Figure 5 displays the comparison of the draft models created with both tools. Here, we analysed the number of reactions shared among draft models of the same organism but created with BIT’s selected-template and CarveMe. Almost half of the reactions in BIT’s selected-template models are not present in CarveMe’s models. Thus, although both tools use BiGG to generate draft reconstructions, the obtained models are significantly different. However, models created with CarveMe include more reactions than BIT’s selected-template models, as the former tool generates a simulation-ready gapless model [7]. Hence, CarveMe’s models will also include artifacts, like sink and demand reactions, that are not included in the drafts generated with BIT, which can explain some of the variability. On the other hand, BIT’s models still require curation and gap filling to obtain a simulation-ready model.

Towards a Multivariate Analysis of Genome-Scale Metabolic Models

141

Fig. 5. Venn diagram for reactions by organism’s model. The number of reactions shared between the models created with the merlin’s BIT using the selected-template and those created with CarveMe was assessed for each organism.

Finally, the reaction space of all draft models was analysed by PCA, and the score plots for the first three principal components are shown in Fig. 6. These components explained 32.7% of the variability in the reaction’s space. Principal Component (PC) 1 separates the data into three groups. The group with the lowest score contains four random models from M. tuberculosis, while the group with the highest score covers all S. thermophilus’ models. The other group comprises all models of X. fastidiosa and the remaining for M. tuberculosis. PC2 separates this last group by organism. PC3 does not clearly separate models by organism though it converges CarveMe’s models. According to the reactions’ PCA, models of the same organism seem more identical to each other rather than to models of a different organism. Nonetheless, the template used also contributes to the variability in the models’ reaction space, but to a lesser extent.

Fig. 6. PCA of the draft models’ reaction space. Several draft models have been generated from the BiGG Models database using BIT (varying the template) and CarveMe package (using the universal BiGG model). Principal Components (PC) 1, 2, and 3 are depicted with the percentage of explained variance. PCA scores have been plotted and coloured according to the set of template models. Each dot is annotated with a model-specific identifier, using the first letter of the genus and the first three letters of the species second name. Since five random-template models have been created using a different set of template models, these models are also numbered accordingly. Ellipses surrounding a given set of models are merely presented for illustration purposes and do not represent real k-means clusters.

142

A. Oliveira et al.

Genes used in the draft models were retrieved and cross-referenced with the COG annotation of the genomes to assess the metabolic functions included in the draft models. The metabolic annotation of BIT’s selected-template and CarveMe models was assessed using a Venn diagram (Fig. 7). A substantial portion of COG functions is shared among all models created using BIT’s selected-template. Likewise, models created with CarveMe also reveal the same pool of common metabolic functions. However, a similar number of COG functions is unique to each draft model created with BIT’s selected-template and CarveMe tools. In contrast, the metabolic COG annotation performed on the three organisms indicates a smaller portion of common metabolic functions and higher percentages of unique metabolic COGs for each organism, suggesting that the representation of the metabolism in the draft models is still incomplete. Interestingly, the large number of unique reactions among the draft models created with both BIT’s selected-template and CarveMe tools does not support the metabolic annotation.

Fig. 7. Venn’s diagram for the metabolic COG annotation of the draft models generated using BIT’s selected-template and CarveMe.

The collections of COG identifiers obtained for each model were now represented in a scatter plot using PCA scores (Fig. 8). This PCA suggested that neither the tool nor the template used to generate draft models significantly impact the model’s metabolic characterisation. The metabolic COG annotation obtained for each model seems not to change significantly with the template or method. According to PC 1 (Fig. 7), the functional characterisations of the S. thermophilus models can be differentiated from X. fastidiosa and M. tuberculosis models. Likewise, X. fastidiosa and M. tuberculosis models obtained different PCA scores according to PC 2. These results show that all three methodologies result in similar sets of metabolic genes, distinct from other organisms’ sets of metabolic genes. Although the metabolic characterisation of the draft models’ genes allows differentiating models by organism rather than by template or method, the analysis of both reaction spaces still suggests that models seem to share a significant portion of reactions (Fig. 6).

Towards a Multivariate Analysis of Genome-Scale Metabolic Models

143

Fig. 8. PCA of the draft models’ metabolic COG annotation. Several draft models have been generated from the BiGG Models database using BIT (varying the template) and CarveMe package (using the universal BiGG model). Principal Components (PC) 1, 2, and 3 are depicted with the percentage of explained variance. PCA scores have been plotted and coloured according to the set of template models. Each dot is annotated with a model-specific identifier, using the first letter of the genus and the first three letters of the species second name. Since five random-template models have been created using a different set of template models, these models are numbered accordingly. Ellipses surrounding a given set of models are merely presented for illustration purposes and do not represent real k-means clusters.

3 Conclusion This study concludes that BIT can reconstruct differentiated draft models from BiGG, regarding reactions and metabolic functions of the models’ genes. This means that BiGG can be used as a source of templates in a bottom-up approach, as it appears to generate distinct models for different species. Nevertheless, because of the distribution of organisms analysed in this work, this can only be stated for simple bacterial organisms. Therefore, further analysis is required to assess the applicability of this method in more complex organisms. Moreover, this tool presents an easy and fast alternative to reconstruct GEMs. However, it must be considered that the template used can also affect the resulting drafts. Thus, it must be carefully selected for higher-quality results. Further curation and gap-filling will still be required to obtain the final simulation-ready model.

4 Materials and Methods 4.1 Genomes’ Comparative Functional Analysis The functional comparison was performed for each organism with a corresponding BiGG model and for S. thermophilus and X. fastidiosa. The COG database is a popular resource for functional characterisation [11] and was used as the reference for functional annotation with recogniser [10], as described in Supplementary Material 3.

144

A. Oliveira et al.

4.2 Draft Models BIT was used to reconstruct draft models from BiGG. A detailed description of how the tool works is presented in supplementary material 4. Seven drafts were reconstructed for each organism, using different templates: all, selected and random. In addition, three drafts were reconstructed using CarveMe for comparison. A detailed description of the methods used to reconstruct the drafts is described in Supplementary Material 1. 4.3 Multivariate Analysis The reactions and metabolic functions of the 24 draft models were compared using Venn diagrams and PCA plots. The methodology used for this analysis is presented in Supplementary Material 2.

5 Supplementary Materials All Supplementary Material files mentioned in the manuscript are available at https:// nextcloud.bio.di.uminho.pt/s/GZC2577Nz7K4AqP. All the scripts used for this work are available at https://github.com/BioSystemsUM/ bit-analysis.

References 1. Thiele, I., Palsson, B.Ø.: A protocol for generating a high-quality genome-scale metabolic reconstruction. Nat. Protoc. 5, 93–121 (2010) 2. Feist, A.M., Herrgård, M.J., Thiele, I., Reed, J.L., Palsson, B.: Reconstruction of biochemical networks in microorganisms. Nat. Rev. Microbiol. 7, 129–143 (2009) 3. Orth, J.D., Thiele, I., Palsson, B.Ø.: What is flux balance? Nat. Biotechnol. 28, 245–248 (2010) 4. O’Brien, E.J., Monk, J.M., Palsson, B.O.: Using genome-scale models to predict biological capabilities. Cell 161, 971–987 (2015) 5. Zhang, C., Hua, Q.: Applications of genome-scale metabolic models in biotechnology and systems medicine. Front. Physiol. 6, 1–8 (2016) 6. Norsigian, C.J., et al.: BiGG Models 2020: multi-strain genome-scale models and expansion across the phylogenetic tree. Nucleic Acids Res. 48, D402–D406 (2020) 7. Machado, D., Andrejev, S., Tramontano, M., Patil, K.R.: Fast automated reconstruction of genome-scale metabolic models for microbial species and communities. Nucleic Acids Res. 46, 7542–7553 (2018) 8. Dias, O., Rocha, M., Ferreira, E.C., Rocha, I.: Reconstructing genome-scale metabolic models with merlin. Nucleic Acids Res. 43, 3899–3910 (2015) 9. Capela, J., et al..: merlin v4.0: an updated platform for the reconstruction of high-quality genome-scale metabolic models. bioRxiv (2021) 10. Sequeira, J.C., Rocha, M., Alves, M.M., Salvador, A.F.: UPIMAPI, reCOGnizer and KEGGCharter: three tools for functionalannotation. In: BOD 2021 - X Bioinformatics Open Days. Braga, Portugal, vol. 57 (2021) 11. Galperin, M.Y., Kristensen, D.M., Makarova, K.S., Wolf, Y.I., Koonin, E.V.: Microbial genome analysis: The COG approach. Brief. Bioinform. 20, 1063–1070 (2019)

A Comparison of Different Compound Representations for Drug Sensitivity Prediction Delora Baptista(B) , Jo˜ ao Correia, Bruno Pereira, and Miguel Rocha Centre of Biological Engineering, University of Minho, Campus of Gualtar, Braga, Portugal [email protected], [email protected]

Abstract. Deep learning (DL) has become increasingly popular in the field of drug discovery. A large variety of end-to-end DL methods for chemical compounds have recently been proposed in the literature, potentially eliminating the need for expert-designed compound representations. This study aims to determine which types of representations and DL algorithms are most suitable for the specific problem of anticancer drug response prediction. A newly developed chemoinformatics package called DeepMol was used to benchmark 12 different compound representation methods on 5 anti-cancer drug sensitivity datasets. We found that DL models that are able to learn compound representations directly from SMILES strings or molecular graphs can perform as well as or even better than models trained on molecular fingerprints, even on smaller datasets. We also conclude that popular molecular fingerprints might not always be the best choice and less well-known fingerprints might be worth exploring in future drug response prediction studies.

1

Introduction

In recent years, machine learning (ML) has become an important tool for computer-aided drug design. Quantitative structure-activity relationship (QSAR) modeling, for example, typically uses ML models to predict molecular properties or the bioactivity of compounds. ML has been an especially popular choice for the modeling of drug response in cancer [2,3]. One of the first steps in a ML workflow for drug discovery is the selection of suitable input features. Chemical compounds are typically represented using molecular descriptors or molecular fingerprints. Molecular descriptors are the experimentally obtained or theoretical physical and chemical properties of a compound. Molecular fingerprints encode molecules as bit or count vectors. The type of information that is encoded depends on the type of fingerprint. Circular fingerprints, for example, describe the surrounding environment of each atom in the molecule up to a predefined radius, while substructure key-based fingerprints set the bits of the bit vector to one or zero depending on the presence or absence in the compound of certain substructures or features from a list of predefined structural keys [6]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  M. Rocha et al. (Eds.): PACBB 2021, LNNS 325, pp. 145–154, 2022. https://doi.org/10.1007/978-3-030-86258-9_15

146

D. Baptista et al.

Another alternative is to use end-to-end DL approaches that can learn relevant features directly from raw input data, eliminating the need for explicitly calculated descriptors and fingerprints. Certain types of DL algorithms are able to learn directly from line notations, such as Simplified Molecular-Input LineEntry System (SMILES) strings. Recurrent neural networks (RNNs) and 1D convolutional neural networks (CNNs) are particularly well-suited to this type of sequential input. Graph neural networks (GNNs) are able to learn features from compound structures represented as molecular graphs. In recent years, several benchmarking studies have been undertaken to determine whether learned representations of compounds perform better than traditional descriptors and fingerprints. Hop et al. found that GNNs outperformed fully-connected neural networks (FCNNs) trained on molecular fingerprints on a majority of benchmark tasks [11]. In contrast, another study [17] reached the opposite conclusion: learned representations performed worse than traditional features. An extensive benchmarking study using the MoleculeNet collection of data sets concluded that learned representations do not perform well when the training data set is small or highly imbalanced [26]. Therefore, the most suitable representation for a given prediction problem probably depends on the type of problem itself, as well as other factors such as dataset size, making it essential to evaluate this for each specific application. In this study, we tested several compound featurization methods and DL algorithms to try to understand which representations are the most suitable for predicting drug sensitivity in cancer cell lines. We used a newly Python package, developed in-house and named DeepMol, to perform this analysis. All of the data sets and scripts used for this study are available online at https://github. com/BioSystemsUM/DeepMol/tree/pacbb21/pacbb21 paper.

2 2.1

Methods Data Sets

We benchmarked a selection of DL models on several human cancer cell line drug screening datasets (Table 1). Single-cell line datasets were used so that it would be possible to study the effect of different compound representations without having to take cell line features into account. Table 1. Details on the datasets used in this study. Dataset

Compounds Output variable

Task type

NCI 1

3466

Sensitive/Not sensitive Classification

NCI 109

3431

Sensitive/Not sensitive Classification

PC-3

4294

− log(IC50 )

Regression

CCRF-CEM

3047

− log(IC50 )

Regression

A549/ATCC 20730

− log(GI50 )

Regression

A Comparison of Different Compound Representations

147

The NCI 1 and NCI 109 human tumor cell line growth inhibition datasets were used to develop binary classification models. These datasets were chosen because they have been widely used in the literature to validate graph classification algorithms. In our case, we opted to use balanced versions of these datasets [20], available from https://github.com/shiruipan/graph datasets. In each dataset, the output variable indicates whether a given compound is active or inactive in a specific cell line. To develop and evaluate regression models, we used two single-cell line cytotoxicity datasets (PC-3 and CCRF-CEM) from a recent drug sensitivity prediction study [8]. The authors of this study obtained the original datasets from ChEMBL [18], performed some filtering and data cleaning steps, and transformed the original half maximal inhibitory concentration (IC50 ) values into − log(IC50 ) values (pIC50 ). These pIC50 values were used as the output variable in our regression models. The previously mentioned datasets are all relatively small, each comprised of less than 5,000 compounds. However, these smaller datasets were preferred because we assumed that they would more closely reflect the behavior of the compound representation methods when used in drug sensitivity prediction models trained on publicly available anti-cancer screening datasets, which usually have data for even fewer compounds. Indeed, the original version of the popular Genomics of Drug Sensitivity in Cancer (GDSC) resource, for example, provides access to a dataset (GDSC1) containing screening data for only 367 compounds, while the Therapeutics Response Portal (CTRPv2) dataset has data for only 481 compounds. Nevertheless, we also evaluated DL models on a larger dataset derived from the National Cancer Institute 60 Human Cancer Cell Line Screen (NCI-60) dataset. We selected the A549/ATCC cell line and, after removing low quality experiments, compounds without sensitivity data, and compounds that we were unable to map to SMILES strings using the files provided by the Developmental Therapeutics Program (DTP), we obtained a dataset with sensitivity values, measured as − logGI50 (half maximal growth inhibition concentration) for 20,730 compounds. Prior to modeling, all SMILES strings were preprocessed using the ChEMBL Structure Pipeline [4]. 2.2

Models

2.2.1 Pre-computed Features We evaluated six different types of molecular fingerprints in this work: extended connectivity fingerprint (ECFP) (ECFP4 and ECFP6), Molecular ACCess System (MACCS) keys, atom pair fingerprints (AtomPair), RDKit fingerprints (RDKitFP) and RDKit layered fingerprints (LayeredFP). ECFP fingerprints [23] are a popular circular fingerprint based on the Morgan algorithm [19]. ECFP4 fingerprints use a radius of 2 to define the circular neighborhood surrounding each atom, while ECFP6 fingerprints use a radius of 3. MACCS is a type of substructure key-based fingerprint which uses 166 predefined keys [9]. The AtomPair fingerprint is a topological fingerprint based on determining the shortest distance

148

D. Baptista et al.

between all pairs of atoms within a molecule [5]. The RDKitFP is another topological fingerprint that was developed by the RDKit [16] project. The algorithm finds all subgraphs in a molecule containing a number of bonds within a predefined range, hashes the subgraphs, and then uses these hashes to generate a bit vector of fixed length. The LayeredFP [16] uses the same algorithm as the RDKitFP to identify subgraphs, but different bits are set in the final fingerprint based on different “layers” (different atom and bond type definitions). Compound structures were encoded as bit vectors using each fingerprinting algorithm and the resulting fingerprints were used as inputs to FCNNs. With the exception of MACCS fingerprints, which have a fixed length, we limited the size of all fingerprints to 1024 bits. 2.2.2 Mol2vec Embeddings Mol2vec is an unsupervised method that generates continuous vectors representing molecules using the Word2vec word embedding algorithm [12]. Each molecule is considered a “sentence” and molecular substructures (calculated using the Morgan algorithm) are considered “words”. We used a pre-trained Mol2vec model to generate 300-dimensional embeddings for the molecules in each dataset, and fed these embeddings into FCNNs. 2.2.3 TextCNN TextCNN is a 1D CNN that was originally developed for sentence classification [13]. We used a modified version of this algorithm (implemented in DeepChem [22]) which uses one-hot encoded SMILES strings as inputs instead of words. It applies several 1D convolutional filters, followed by a max-over-time pooling operation, which summarizes each filter using its maximum value. These learned features are then fed into fully-connected layers to predict the output. 2.2.4 Graph Neural Networks The structure of a chemical compound can be represented as a molecular graph, where nodes are atoms and edges represent bonds. Conventional neural network (NN) architectures such as FCNNs or CNNs are unable to learn directly from this type of data. GNNs generalize deep neural networks to graph-structured data. Inputs to a GNN are usually node features (e.g. atom type) and adjacency matrices encoding the structure of the graph, and sometimes can also include edge features as well. The node and edge features are used to initialize the graph. In general, GNNs apply learnable functions to update the node-level representations, progressively incorporating information about the neighborhood of a node into its representation. After several rounds of updates, a pooling operation can be used to obtain a graph-level (molecular-level) representation. The graph-level representations can then be fed into fully-connected layers to predict a given output. In this work, we benchmarked four different GNN algorithms: neural fingerprints (GraphConv) [10], graph convolutional network (GCN) [15], graph attention network (GAT) [25] and the AttentiveFP algorithm [27].

A Comparison of Different Compound Representations

2.3

149

Model Training and Evaluation

Each data set was split into a training set (70%) and a test set (30%). All models were trained and evaluated using the same splits for each data set. All models were trained for 100 epochs with a batch size of 256 samples, and used the Adam [14] optimization algorithm. Binary cross-entropy was used as the loss function for all classification models, while the mean squared error was used for regression models. Other model-specific hyperparameters were tuned using a 5-fold cross-validated randomized search, in which 30 different hyperparameter combinations were tested. These included the number of hidden layers and hidden units and the use of regularization methods such as L2 weight regularization and dropout [24], among others. The best model that was found for each algorithm was then refit on the entire training set and evaluated on the held-out test set. Additional details on the models (including the full search space and the best hyperparameters found for each type of model) are available online (https://github.com/BioSystemsUM/DeepMol/blob/pacbb21/ pacbb21 paper/supplementary material.pdf). 2.4

DeepMol

All preprocessing, featurization and modeling steps were implemented using DeepMol, a newly developed chemoinformatics package. DeepMol is a pythonbased machine and deep learning framework for drug discovery. It offers a variety of functionalities that enable a smoother approach to many drug discovery and chemoinformatics problems. This framework uses Tensorflow [1], Keras [7], Scikit-learn [21] and DeepChem [22] to either build custom ML and DL models or make use of pre-built models. It also uses the RDKit [16] framework to perform operations on molecular data. Regarding compound standardization, it allows users to use the ChEMBL Structure Pipeline [4] or apply custom standardization steps using RDKit [16] standardization methods. Some of these steps include standardization of some non-standard valence states, molecule sanitization, charge neutralization, stereochemistry removal, the removal of smaller fragments, kekulization, among others. DeepMol also offers several featurization methods including molecular fingerprints, molecular embeddings and graph-based featurizers. In summary, DeepMol offers a complete workflow to perform machine and deep learning tasks for molecules represented as SMILES strings. It has modules that perform standard tasks such as loading and standardizing data, computing molecular features, performing feature selection and data splitting. It also provides methods to deal with unbalanced datasets and to do unsupervised exploration of the data. This way, DeepMol provides a common platform to treat the data and build, train, optimize and evaluate ML and DL models using different ML frameworks.

150

3

D. Baptista et al.

Results and Discussion

In this section, we report and discuss the performance of 12 DL algorithms benchmarked on 5 drug response datasets of variable size and with different output variables. Figure 1 reports the results for classification tasks, with model performance quantified using the area under the receiver operating characteristic curve (ROC-AUC). For regression problems, model performance scores are reported in Fig. 2, using the root mean squared error (RMSE) values. The full results tables and plots for additional scoring metrics are available from https:// github.com/BioSystemsUM/DeepMol/tree/pacbb21/pacbb21 paper/results.

Fig. 1. Performance (ROC-AUC) of different deep learning models on the NCI 1 and NCI 109 classification tasks. Higher scores mean better performance.

ECFP4 fingerprints outperformed other fingerprints and end-to-end DL models on the NCI 1 classification task, having achieved a ROC-AUC score of 0.83. The LayeredFP model achieved a very similar ROC-AUC score (also 0.83, when rounded), while the best end-to-end DL model (TextCNN) reached a score of 0.81. On the NCI 109 dataset, the GCN algorithm ranked first in terms of performance, but other methods such as LayeredFP, TextCNN and GraphConv were not far behind (Fig. 1). With the exception of the LayeredFP model, fingerprintbased methods generally performed worse than most of the end-to-end DL models on this dataset. On the PC-3 dataset, the TextCNN model achieved the lowest RMSE (0.61), followed by the LayeredFP model and the GraphConv model, both with a RMSE of 0.65 (Fig. 2). Other GNNs did not perform as well as GraphConv, having been surpassed by most of the fingerprint-based models. In the CCRF-CEM regression task, several fingerprint-based models (AtomPair, RDKitFP and MACCS)

A Comparison of Different Compound Representations

151

Fig. 2. Performance (RMSE) of different deep learning models on the PC-3, CCRFCEM and NCI-60 A549 regression tasks. Lower scores mean better performance.

outperformed the best end-to-end DL model, which was once again the TextCNN model (RMSE = 0.76) (Fig. 2). On the larger A549 dataset (Fig. 2), TextCNN was the best model RMSE = 0.79, followed by AtomPair and GCN which both reached a RMSE value of 0.81. The increase in dataset size did not benefit all of the end-to-end DL models, however. In general, performance scores were usually similar between many of the models. This finding is in agreement with the MoleculeNet benchmarking study, which was also unable to find clear differences between models when benchmarking on smaller datasets (less than 3000 compounds) [26]. The best type of compound representation method seems to depend on the dataset itself, even when the prediction tasks are similar. Other factors, such as the particular data split that was used or the limited number of hyperparameter combinations that were explored using random search, could also have influenced the results. Surprisingly, results were consistently worse when using Mol2vec embeddings, which were generated using a model that had been pre-trained on a dataset with over 19 million molecules [12]. The Mol2vec models performed poorly on the training sets as well, indicating that these models were underfitting. Dataset-specific finetuning of the embedding model may be necessary to improve the performance of FCNNs trained on these embeddings. End-to-end DL models performed as well as, and at times even surpassed, models trained on pre-computed features. TextCNN models, in particular, ranked highly across all datasets. The GCN and GraphConv models also performed well on some of the prediction tasks. This is contrary to what would be expected given the limited number of compounds and the fact that these models

152

D. Baptista et al.

were not pre-trained on larger chemical datasets beforehand. GNNs with attention mechanisms (GAT and AttentiveFP), however, did not perform as well as the other end-to-end methods on the datasets that were used in this study. Regarding molecular fingerprints, LayeredFP consistently performed well across all datasets, and models trained using atom pair fingerprints also performed relatively well, both outperforming the more popular ECFP fingerprints in 4 out of 5 datasets. LayeredFPs are able to encode information about larger subsets of the molecular graphs than ECFPs, capturing more information on the global structure of the molecules. AtomPair fingerprints also encode global features since all pairs of atoms are taken into account. The results suggest that global molecular features might be important for the prediction of drug sensitivity. Therefore, these less well-known fingerprints might be interesting alternatives to some of the more commonly used options, at least for drug response prediction tasks. Despite the use of regularization methods, most models still had a tendency to overfit. In the future, we will implement an early stopping mechanism similar to the Keras EarlyStopping callback for all models in DeepMol to try to mitigate this.

4

Conclusion

In this study, we used DeepMol, a new Python package developed in-house for chemoinformatics, to evaluate the performance of different compound representation methods and DL algorithms on several cancer drug sensitivity prediction tasks. We found that end-to-end DL models are capable of outperforming traditional fingerprint-based models even on small datasets. Additionally, less well-known molecular fingerprints may be interesting alternatives to some of the more popular types of molecular fingerprints. These findings can help guide the development of new DL-based drug response prediction models trained on screening data from other screening projects. The DeepMol framework is under constant development and it is currently at a pre-release version. New models and features will be added in the future.

References 1. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, vol. 16, pp. 265–283 (2016) 2. Adam, G., Ramp´ aˇsek, L., Safikhani, Z., Smirnov, P., Haibe-Kains, B., Goldenberg, A.: Machine learning approaches to drug response prediction: challenges and recent progress. NPJ Precis. Oncol. 4(1), 19 (2020). https://doi.org/10.1038/s41698-0200122-1 3. Ali, M., Aittokallio, T.: Machine learning and feature selection for drug response prediction in precision oncology applications. Biophys. Rev. 11(1), 31–39 (2018). https://doi.org/10.1007/s12551-018-0446-z

A Comparison of Different Compound Representations

153

4. Bento, A.P., et al.: An open source chemical structure curation pipeline using RDKit. J. Cheminformatics 12(1), 1–16 (2020). https://doi.org/10.1186/s13321020-00456-1 5. Carhart, R.E., Smith, D.H., Venkataraghavan, R.: Atom pairs as molecular features in structure-activity studies: definition and applications. J. Chem. Inf. Comput. Sci. 25(2), 64–73 (1985). https://doi.org/10.1021/ci00046a002 6. Cereto-Massagu´e, A., Ojeda, M.J., Valls, C., Mulero, M., Garcia-Vallv´e, S., Pujadas, G.: Molecular fingerprint similarity search in virtual screening. Methods 71, 58–63 (2015). https://doi.org/10.1016/j.ymeth.2014.08.005 7. Chollet, F.: Others: Keras (2015). https://keras.io 8. Cort´es-Ciriano, I., Bender, A.: KekuleScope: prediction of cancer cell line sensitivity and compound potency using convolutional neural networks trained on compound images. J. Cheminformatics 11(1), 1–16 (2019). https://doi.org/10.1186/ s13321-019-0364-5 9. Durant, J.L., Leland, B.A., Henry, D.R., Nourse, J.G.: Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 42(6), 1273–1280 (2002). https://doi.org/10.1021/ci010132r 10. Duvenaud, D., et al.: Convolutional networks on graphs for learning molecular fingerprints. J. Chem. Inf. Model. 56(2), 399–411 (2015) 11. Hop, P., Allgood, B., Yu, J.: Geometric deep learning autonomously learns chemical features that outperform those engineered by domain experts. Mol. Pharm. 15(10), 4371–4377 (2018). https://doi.org/10.1021/acs.molpharmaceut.7b01144 12. Jaeger, S., Fulle, S., Turk, S.: Mol2vec: unsupervised machine learning approach with chemical intuition. J. Chem. Inf. Model. 93(3), 297–312 (2018). https://doi. org/10.1021/acs.jcim.7b00616 13. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Association for Computational Linguistics, Stroudsburg, PA, USA (2014). https://doi.org/10.3115/v1/D14-1181 14. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (2014) 15. Kipf, T.N., Welling, M.: Semi-Supervised Classification with Graph Convolutional Networks. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017, Conference Track Proceedings. OpenReview.net (2017) 16. Landrum, G., Others: RDKit: Open-source cheminformatics (2006) 17. Mayr, A., et al.: Large-scale comparison of machine learning methods for drug target prediction on ChEMBL. Chem. Sci. 9(24), 5441–5451 (2018). https://doi. org/10.1039/C8SC00148K 18. Mendez, D., et al.: ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 47(D1), D930–D940 (2019). https://doi.org/10.1093/nar/gky1075 19. Morgan, H.L.: The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. J. Chem. Doc. 5(2), 107–113 (1965). https://doi.org/10.1021/c160017a018 20. Pan, S., Wu, J., Zhu, X., Long, G., Zhang, C.: Finding the best not the most: regularized loss minimization subgraph selection for graph classification. Pattern Recogn. 48(11), 3783–3796 (2015). https://doi.org/10.1016/j.patcog.2015.05.019 21. Pedregosa, F., et al.: Scikit-learn: machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2012) 22. Ramsundar, B., Eastman, P., Walters, P., Pande, V., Leswing, K., Wu, Z.: Deep Learning for the Life Sciences. O’Reilly Media, Newton (2019)

154

D. Baptista et al.

23. Rogers, D., Hahn, M.: Extended-connectivity fingerprints. J. Chem. Inf. Model. 50(5), 742–754 (2010). https://doi.org/10.1021/ci100050t 24. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 25. Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Li` o, P., Bengio, Y.: Graph attention networks. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 3, 2018, Conference Track Proceedings. OpenReview.net (2018) 26. Wu, Z., et al.: MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9(2), 513–530 (2018). https://doi.org/10.1039/C7SC02664A 27. Xiong, Z., et al.: Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J. Med. Chem. 63(16), 8749–8760 (2020). https://doi.org/10.1021/acs.jmedchem.9b00959

Combinatorial Optimization of Succinate Production in Escherichia coli V´ıtor Pereira(B) and Miguel Rocha Centre of Biological Engineering, Department of Informatics, University of Minho, Braga, Portugal [email protected], [email protected]

Abstract. Genome-scale metabolic models are mathematical formulations widely used to describe the relationship between cells’ genotype and phenotype. Over the years, several attempts have been made to expand these formulations with macromolecular expression. Recently, GECKO models proposed the inclusion of enzyme mass constraints to improve phenotype predictions of a yeast genome-scale metabolic model. Taking a step forward, ETFL formulation includes the gene expression machinery, enabling models to compute the entire metabolic and gene expression proteome in a growing cell. These formulations may lead to more biologically accurate predictions and improve the design of new strains. The present work explores the utilization of such models for the optimization of succinate production in Escherichia coli, taken here as a case study to show the potential of using different modeling approaches in strain design applications. All the optimizations were conducted using MEWpy, a recently proposed Metabolic Engineering Framework.

1

Introduction

One of the most challenging goals of Metabolic Engineering is the prediction of microbial strains’ behaviour [1–3]. Besides providing a better understanding of cell metabolism, this task allows for discovering genetic modifications that favor the increased production of compounds of interest. To that end, several constraint-based approaches have been developed in an attempt to achieve better prediction of in vitro and in vivo cell behaviour. Traditional stoichiometric approaches try to find strains that increase production by performing Flux Balance Analysis (FBA) and modifying metabolic reactions’ fluxes through targeted genetic modifications. However, such a strategy is oblivious to other equally essential parameters of biological systems (e.g., enzyme kinetics and abundance, transcriptional regulation, signaling). These elements can profoundly affect cells’ metabolism and, consequently, simplified formulations may lead to inaccurate phenotype predictions. Incorporating some of those elements in metabolic models, typically in the form of additional constraints, may provide more accurate predictions and improve strain optimization methods. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  M. Rocha et al. (Eds.): PACBB 2021, LNNS 325, pp. 155–164, 2022. https://doi.org/10.1007/978-3-030-86258-9_16

156

V. Pereira and M. Rocha

With the increased availability of omics data, some new constraint-based modeling approaches have been proposed, which consider additional factors known to influence cell metabolism, and thus are helpful when designing mutant strains for enhanced compound production. In particular, the GECKO [4] and sMOMENT [5] methods incorporate enzymatic constraints into stoichiometric models to achieve better phenotype predictions. On the other hand, the authors in [6] proposed implementations of the metabolism and expression model formalism (ME-models) that present a hierarchical model formulation, from metabolism to RNA synthesis. The Expression Flux (EFL) models allow simulating enzyme and mRNA concentration levels, while Expression and Thermodynamics Flux (ETFL) models additionally include thermodynamics-compliant intracellular fluxes. We recently made available MEWpy [7], a metabolic engineering workbench in Python that, among other features, offers a set of evolutionary computationbased combinatorial strain design optimization methods. MEWpy delivers a practical interface to several strain optimization heuristics, allowing to simulate and optimize microbial production on Genome-Scale Metabolic Models (GSMMs) that define gene–protein-reaction (GPR) associations, but also using models enhanced with enzymatic, transcriptional, and translational layers, including GECKO, sMOMENT, EFL and ETFL models. To our knowledge, MEWpy is the only computational tool capable of fully exploiting the modeling paradigms mentioned above for strain optimization. The present work aims to assess MEWpy strain design capabilities, considering as a case study the overproduction of succinate in Escherichia coli. Succinate is used as a building block to produce polyurethanes, resins, polybutylene succinate (PBS), plasticizers, and a precursor for chemicals such as 1,4 butanediol. Escherichia coli strains have been engineered for the increased production of this product, mostly anaerobically. The potential to produce succinate aerobically can offer further significant advantages, particularly by allowing faster biomass generation, carbon throughput, and product formation [8]. Here, the optimization of succinate was conducted using four modeling approaches: a GSMM that includes the metabolic network and gene-proteinreaction rules, the same GSSM with added enzymatic constraints (GECKO-like) obtained using the autoPACMEN toolbox [5], and two metabolic and expression models (EFL), one of which with thermodynamic constraints (ETFL). All the models are derived from the iJO1366 GSMM [9]. To the best of the authors’ knowledge, this is the first study to combine these different phenotype simulation methods in a strain design case study, and thus a valuable contribution to the metabolic engineering field.

2

Strain Optimization with MEWpy

Enzyme abundances and kinetics can readily impact the level of metabolic fluxes of a given pathway, either positively or negatively. The Metabolic Engineering Workbench in Python (MEMpy) provides methods to account for enzymatic

Combinatorial Optimization of Succinate Production in Escherichia coli

157

restrictions when dealing with CBMs. It offers a range of different strategies which depend on the model’s formulations. One of MEWpy’s highlighted features is its support of GECKO, sMOMENT, ETFL, and OptRAM [11] models, allowing to model and optimize microbial production on GSMMs defining geneprotein-reaction associations, but also on models enhanced with transcriptional and translational layers. Metaheuristics such as Evolutionary Algorithms (EAs) and Simulated Annealing (SA) drive the optimization towards the best set of enzymes or genes to under/over-express or delete so that the production of a target compound is maximized. EAs are population based algorithms able to perform multi-objective optimization. Each solution in the population encodes a combination of genetic modification whose quality, or fitness, is assessed against defined objectives using FBA phenotype prediction methods. Mimetizing the Darwinian evolutionary principles, variation, inheritance, selection and time, solutions are matted and mutated to produced a new pool of offspring from which the “best” are selected to integrate the next generation. 2.1

Constraint-Based Modeling

Constraint-based Modeling approaches resort to kinetic data to obtain detailed information on the dynamics of biological systems. Assuming that the concentration of internal metabolites is in a quasi steady-state [10], the metabolite mass balance can be defined as: S·v =0 (1) where S represents a matrix of mxn stoichiometric coefficients and v a vector of n reaction fluxes. Since all metabolic fluxes that lead to the formation or degradation of intracellular metabolites are mass balanced, Eq. 1 is pivotal for CBMs methods. For each flux, bounds need to be defined to establish thermodynamic feasibility and flux capacity. These bounds are given by: 0 ≤ vi ≤ βi , ∀i ∈ Nirr

(2)

αi ≤ vi ≤ βi , ∀i ∈ Nrev

(3)

where vi denotes the flux carried over reaction i and αi and βi the lower and upper bounds, respectively. For irreversible reactions, the lower bound is set to 0. Usually, there are more variables (reactions) than equations (compounds) in the system, meaning that the number of possible solutions is infinite: we are facing an undetermined Linear Programming (LP) problem represented by a hypercone of admissible flux distributions. Defining an objective function solves this issue and the problem becomes an optimization one.

158

V. Pereira and M. Rocha

2.2

Stoichiometric Model

The iJO1366 model of Escherichia coli K-12 (with 1805 metabolites, 2583 reactions and 1367 genes) contains two sets of biological information, the stoichiometric matrix, and a mapping between gene-encoding enzymes and the reactions they catalyze, the GPR rules. We employed these last to reflect gene expression modifications in the catalyzed reaction fluxes constraints, notably gene deletions and up/down-regulations. Consider a reaction R1, with flux 0 ≤ v1 ≤ β1 , and an associated GPR rule (G1 OR G2) AN D G3, where G1, G2 and G3 are genes that form the enzyme complex that catalyzes R1. The GPR rule is converted to an algebraic expression replacing the (AND, OR) Boolean operators by (min, max) functions and the gene identifiers by expression levels. The resulting algebraic value defines a modification factor f that can be applied to the catalyzed reaction bounds. If f is 0, the reaction is knocked out by setting both bounds to 0; if f > 1, the lower bound is set to f × vwt , where vwt is the wild-type flux value; and last, if f < 1, the upper reaction bound is set to f × vwt . Reversible reactions are decomposed into a forward and a backward reaction and the reaction sense with no wild-type flux, being catalyzed with the same enzyme or complex, is knocked out. 2.3

GECKO-Like Model

In case of proteome limitations, the cell switches to pathways that require less protein mass but are lower in nutrient yield (defined as energy and/or biomass precursors produced per molecule of limiting nutrient consumed). To model such a behavior, and as first proposed by O’Brien et al. [12], the iJO1366 GECKO-like model includes enzymes utilization as part of reactions, A + kei i → B, where cat A and B are substrate(s) and product(s) respectively, and ei the enzyme usage i . This formulation holds in the fact that, for any with a turnover number kcat ij enzyme Ei catalyzing a reaction Rj , vj ≤ kcat .[Ei ] where [Ei ] is the intracellular concentration of the enzyme Ei . When proteomic data is available, GECKO models limit enzyme usage by a constraint 0 ≤ ei ≤ [Ei ]. In the absence of proteomic data, enzymes are drawn from a pool, a pseudo-metabolite limited by a total protein content proportional to the mass fraction of enzymes accounted for in the model. The GECKO formulation allows evaluating genetic modifications by changing the bounds of the enzyme usage constraints. For enzyme deletions, both the upper and lower bounds are set to 0. Enzymes down-regulation is modeled by setting the upper bound to a fraction of the wild type enzyme usage. Conversely, in the case of an up-regulation, the enzyme constraint lower bound is set to a factor of the wild-type enzyme usage.

Combinatorial Optimization of Succinate Production in Escherichia coli

2.4

159

E(T)FL Model

ETFL models are implementations of ME-Models, extensions of metabolic models that include the process of gene expression. In a way, ETFL models build on GECKO models by accounting for the expression cost of enzyme making. The cost of peptide and mRNA synthesis, and the competition for ribosomes and RNA polymerase, are modeled by added constraints specific to ribosomes, RNA polymerase, mRNAs, rRNAs, tRNAs, and peptides. The application of genetic modifications to the EFL and ETLF models follows two strategies. Suppose the genes whose expression is altered have associated enzymes. In that case, the modifications are applied by adjusting the genes’ translation pseudo-reaction bounds following the same strategy previously defined and considering the flux rates observed in the wild-type strain. If a gene has no associated enzymes, GPR rules are used similarly to the stoichiometric model approach. 2.5

Optimization Setup

We considered an M9 minimal medium, with a maximum uptake rate of 10 mmol/(gDW.h) for all compounds, including glucose and oxygen. The strain optimization task encompasses the simultaneous maximization of three objectives: • BPCY: the biomass-product coupled yield with flux values taken from a linear version of the Minimization of Metabolic Adjustment (lMOMA) phenotypic prediction; • WYIELD: the weighted sum of the maximum and minimum product yield with a 90% confidence of cellular growth; • Modification Type: an additional objective that favors gene deletions and down-regulation. The multi-objective optimization was run 10 times using the Nondominated Sorting Genetic Algorithm (NSGAIII), each with stopping criteria of 50000 solution evaluations. Solution candidates include a maximum of 10 modifications, composed of gene deletions and/or gene over/down regulations. At the end of the optimization process, the solutions are simplified, removing modifications that do not affect the predicted biomass or product rates.

3

Analysis of Solutions Distributions

Each solution from the set of aggregated genetic modifications, suggested by MEWpy using both the iJO1366 and GECKO-like models, was re-evaluated using each model (iJO1366, GECKO-like, EFL, and ETFL). Parsimonious FBA (pFBA) phenotypic simulations and Flux Variability Analysis (FVA) were conducted to infer growth and succinate flux rates.

160

V. Pereira and M. Rocha

Fig. 1. Venn diagram of genetic modifications by model.

The genetic modifications obtained from the optimizations underwent additional screening. Only modifications with a predicted growth over 0.1 mmol/ (gDW.h) were preserved for each of the modeling approaches. The genetic modification solutions were also filtered for robustness. A genetic modification pinpoints specific enzyme-catalyzed reactions, genes, or enzymes that need to be deleted or up-/down regulated such that the production of the desired product becomes a necessary byproduct of biomass formation (required for cellular growth). As such, we want to find genetic modifications that eliminate or alter competing pathways that may hinder the succinate production rate. This goal was achieved by selecting the modifications under which the minimal (guaranteed) production rate of succinate is maximized instead of simply assuming that the maximized production rate would be attained. A solution valid in one model might not be in others, as it can be seen in Fig. 1. Here, valid refers to the employed linear programming solver’s ability to solve the cellular growth parsimonious FBA problem. As such, in the subsequent analysis, we only considered candidate solutions valid in all models. The previously described methodology allowed to reduce the number of genetic modification candidates for further screening and analysis. An immediate observation is that the biomass and succinate yield distributions differ from one model to another. One can observe an increase in biomass and a decrease in succinate production from the iJO1366 and GECKO models to the E(T)FL models. Furthermore, the correlations of predicted yields indicate that, while predicted biomass yields significantly differ between the two groups, EFL succinate yields are closer to those predicted by iJO1366 and GECKO models than those predicted by the ETFL model, as shown in Fig. 3. This result was somehow expected as the authors in [13] reported the observation of an increase in the maximum growth rate when thermodynamic constraints are applied (Fig. 2).

Combinatorial Optimization of Succinate Production in Escherichia coli

161

Fig. 2. Distributions of biomass and succinate yields.

Fig. 3. Correlation of biomass and succinate predicted yields between models.

4

Illustrative Solutions

Most solutions suggest the deletion or down-regulation of one or more of the sdhA, sdhC, folD, and pyrD genes. On the other hand, the up-regulation of the tdcB and sucB genes are frequently proposed. Indeed, some of those modifications are frequently referred to in the literature to improve succinate production in E. coli. In [14], the authors reported an increase of succinate production after the deletion of the sdhA-B, iclR, poxB, ackA, and pta genes. In [15] the overexpression of sucB and sucA is combined with the deletion of genes sdhA, sdhB, sdhC, sdhD, and ppc. As illustrative solution examples, one of the solutions suggested by MEWpy proposes the deletions of the phoE, kbl, pyrD, and sdhA genes (Solution 1), while another recommends the deletion of genes kdpF and sdhD, the down-regulation of the gnd and folD, and the over-expression of tdcB (Solution 2). We simulated each model for both solutions with different glucose uptake rates to a maximum of 10 mmol/(gDW.h). The least glucose uptakes required for producing a minimum amount of ATP and maintain metabolism in aerobic growth were, respectively, 0.14 and 3.35 mmol/(gDW.h). The simulated flux rates of the

162

V. Pereira and M. Rocha

iJO1366 and GECKO models, being very akin, in Fig. 4 we jointly represented the respective predicted flux rates.

Fig. 4. Growth and succinate rate with respect to glucose uptake.

Concerning Solution 1, the predicted growth and succinate flux rates for the iJO1366/GECKO, EFL, and ETLF exhibit significant differences. For Solution 2, we observe that there is a consensus on the models’ predictions. The biological validity of the solutions needs yet to be assessed, however, they appear to point similar directions to already published genetic modifications with the same objective. We recently made available a database that gathers genetic modifications from the literature, including computational strain design results obtained with MEWpy for different organisms and targeted products. The database, freely available at https://sddb.bio.di.uminho.pt/, may be used to offer guidance in strain design tasks. It currently encompasses all the results obtained in this work, allowing users to navigate and explore them freely. Furthermore, all experiments can be easily replicated and applied to other case studies.

5

Conclusion

A central goal for metabolic modeling is to quantify cellular responses accurately while integrating internal and external conditions. Given the complex nature of the metabolism and composition of different heterogeneous components (e.g., enzymes, metabolites, and regulators), such a goal remains a significant challenge. Despite their limitations and the discrepancy between predicted and experimentally measured growth and flux distribution in mutant strains, computational systems biology and constraint-based modeling offer insights and a better understanding of the cellular functions and may be used to pinpoint directions for the development of new strains for optimized compound production.

Combinatorial Optimization of Succinate Production in Escherichia coli

163

This work illustrated how recently proposed modeling approaches may be used in combinatorial strain optimization, notably using the MEWpy framework. We chose to study the optimization of succinate production in E. coli in aerobic conditions and compared flux prediction rates using four integrative modeling approaches. The solutions offered by MEWpy evidence some level of similitude to those found in the literature, requiring further experimental validation. In future work, we aim to replicate the study for the increased production of other compounds, no only in E. coli but also in Saccharomyces cerevisiae. We also intend to simulate genetic modifications available in the literature using the different methods/models and compare the predicted yields against reported titers. Acknowledgements. This project has received funding from the European Union’s Horizon 2020 research and innovation programme (grant agreement number 814408).

References 1. Maia, P., Rocha, M., Rocha, I.: In silico constraint-based strain optimization methods: the quest for optimal cell factories. Microbiol. Mol. Biol. Rev. 80(1), 45–67 (2016) 2. Rocha, M., Maia, P., Mendes, R., et al.: Natural computation meta-heuristics for the in silico optimization of microbial strains. BMC Bioinform. 9, 499 (2008) 3. Rocha, I., Maia, P., Evangelista, P., et al.: OptFlux: an open-source software platform for in silico metabolic engineering. BMC Syst. Biol. 4, 45 (2010) 4. Sanchez, B.J., Zhang, X.-C., Nilsson, A., Lahtvee, P.-J., Kerkhoven, E.J., Nielsen, J.: Improving the phenotype predictions of a yeast genome-scale metabolic model by incorporating enzymatic constraints. Mol. Syst. Biol. 13, 935 (2017) 5. Bekiaris, P.S., Klamt, S.: Automatic construction of metabolic models with enzyme constraints. BMC Bioinform. 21, 19 (2020) 6. Salvy, P., Hatzimanikatis, V.: The ETFL formulation allows multi-omics integration in thermodynamics-compliant metabolism and expression models. Nat. Commun. 11, 30 (2020) 7. Pereira, V., Cruz, F., Rocha, M.: MEWpy: a computational strain optimization workbench in Python. Bioinformatics (2021) 8. Lin, H., Bennett, G.N., San, K.: Metabolic engineering of aerobic succinate production systems in Escherichia coli to improve process productivity and achieve the maximum theoretical succinate yield. Metab. Eng. 7(2), 116–127 (2005) 9. Orth, J.D., et al.: A comprehensive genome-scale reconstruction of Escherichia coli metabolism-2011. Mol. Syst. Biol. 7, 535 (2011) 10. Ederer, M., et al.: An introduction to kinetic, constraint-based and Boolean modeling in systems biology. In: IEEE International Conference on Control Applications 2010, pp. 129–134 (2010) 11. Shen, F., Sun, R., Yao, J., et al.: OptRAM: in-silico strain design via integrative regulatory-metabolic network modeling. PLoS Comput. Biol. 15, e1006835 (2019) 12. O’Brien, E.J., Lerman, J.A., Chang, R.L., Hyduke, D.R., Palsson, B.Ø.: Genomescale models of metabolism and gene expression extend and refine growth phenotype prediction. Mol. Syst. Biol. 9, 693 (2013)

164

V. Pereira and M. Rocha

13. Hamilton, J.J., Dwivedi, V., Reed, J.L.: Quantitative assessment of thermodynamic constraints on the solution space of genome-scale metabolic models. Biophys. J. 105, 512–522 (2013) 14. Blankschien, M.D., Clomburg, J.M., Gonzalez, R.: Metabolic engineering of Escherichia coli for the production of succinate from glycerol. Metab Eng. 12, 409–419 (2010) 15. Li, N.: Directed pathway evolution of the glyoxylate shunt in Escherichia coli for improved aerobic succinate production from glycerol. J. Ind. Microbiol. Biotechnol. 40(12), 1461–1475 (2013)

Predicting Adverse Drug Reactions from Drug Functions by Binary Relevance Multi-label Classification and MLSMOTE Pranab Das(B) , Jerry W. Sangma, Vipin Pal, and Yogita National Institute of Technology Meghalaya, Shillong, India [email protected], [email protected]

Abstract. Adverse Drug Reaction (ADR) prediction is one of the important tasks in drug discovery. It helps in enhancing drug safety and reducing drug discovery costs and time. Most of the existing works have focused on ADR prediction using chemical and biological properties of drugs. However, the capability of drug functions in ADR prediction has not been explored yet. ADR prediction is a multi-label classification problem and it faces the issue of class imbalance. In the present work, a methodology has been proposed for predicting ADR from drug functions. It employs the binary relevance method along with five base classifiers namely DT, ETC, KNN, MLPNN, and RF for performing multi-label classification and MLSMOTE for addressing the issue of class imbalance. The data of drug functions and ADR has been extracted respectively from SIDER and PubChem databases and then drug functions are mapped to ADR based on drug ID. After mapping drug function with the ADR, the resulted dataset comprises 670 drugs described by their functions and 6123 ADR. The proposed methodology has been applied on this dataset. The performance of the proposed methodology has been found promising in terms of accuracy, hamming loss, precision, recall, f1 score and ROC-AUC. Keywords: Adverse Drug Reaction (ADR) · Drug Function (DF) Multi-label classification · Multi-label Synthetic Minority Over-sampling Technique (MLSMOTE)

1

·

Introduction

Adverse drug reactions (ADR) are defined as the unwanted harmful reactions occurring on consuming the adequate amount of medical products such as drugs and vaccine [1]. Sometime ADR can be fatal and results in deaths and permanent organ damage. ADR are one of the main reason behind the failure of a number of drugs [2,3]. Hence, predication of ADR for drugs is an important aspect in the process of drug discovery. Machine learning can play an important role in ADR prediction and to save the time and cost of drug discovery [2,4]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022  M. Rocha et al. (Eds.): PACBB 2021, LNNS 325, pp. 165–173, 2022. https://doi.org/10.1007/978-3-030-86258-9_17

166

P. Das et al.

Most of the existing approaches for ADR prediction have used the chemical and biological properties of drugs [2,5–7]. But drug functions can be very important towards ADR prediction as ADR manifestation may have association with drug functions. This association has not been yet explored using machine learning techniques by the research community. There can be multiple ADR corresponding to a drug so the problem of ADR prediction using drug function is a multi-label classification problem. Further, ADR prediction from drug functions faces the issue of class imbalance because of skewed distribution of ADR over drugs. In the present work, a methodology for predicting the ADR for different drugs based on the information of drug functions has been proposed. It employs Multi-Label Synthetic Minority Over-sampling Technique (MLSMOTE) [8] for handling the class imbalance issue and binary relevance approach along with 05 base classifiers namely Decision Tree(DT), Extra Tree Classifier (ETC), Random Forest (RF), K-Nearest Neighbour (KNN) and Multi-Layer Perceptron Neural Network (MLPNN) for addressing the multi-label classification nature of ADR prediction problem. The proposed methodology has been validated on a dataset which comprises drug functions and ADR corresponding to 670 drugs. It has been collected from the PubChem [9] and SIDER [10] databases. The validation results show that the proposed methodology is promising in terms of accuracy, hamming loss, precision, recall, f1 score and ROC-AUC. The rest of the paper has been organized as follows: in Sect. 2, related work has been presented. Section 3 demonstrates the dataset and the proposed methodology. Experimental setup and results have been discussed in Sect. 4. The work has been concluded in Sect. 5.

2

Related Work

This section briefly describes the extsing literaure on ADR prediction. The problem of ADR prediction has been addressed from different aspects. Wang et al. in [2] have used biological, biomedical information from literature, and 17 molecular properties for ADR prediction using multi-layer perceptron neural network. In [5], the authors have applied different deep learning architectures namely multi-task neural networks, residual multi-layer perceptron, multi-modal neural networks, simple multi-layer perceptron, and convolutional neural networks for detecting ADR using gene expression, gene ontology, META information, and chemical 1D structure of drugs. A machine learning model has been proposed in [6] for detecting neurological ADR based on different drug features viz. chemical structure (CS), therapeutic indications (TI), and biological (Bio) properties. Further, the authors have combined different drug properties into two-level (CS+TI, CS+Bio, TI+Bio) and three-level (CS+TI+Bio). They have showed that compared to the three-level combination of drug properties, two-level combination of chemical structure and

Predicting Adverse Drug Reactions from Drug Functions and MLSMOTE

167

therapeutic indication properties have given better detection performance. Liu et al. in [7] have also used a machine learning model for ADR prediction by combining three different types of drug properties namely biological properties, chemical properties and treatment indication. Overall, it can be concluded that different types of drug properties have been employed for ADR prediction by the existing approaches but the role of drug function in ADR prediction has not been explored yet. It frames the motivation for our work.

3

Dataset and the Proposed Methodology

In this section, the problem to predict ADR from Drug Functions(DF) has been defined. Further, the dataset and the proposed methodology to address the stated problem has been illustrated. 3.1

Problem Statement

Let D = {d1 , d2 , .., dp , .., dl } be the set of drugs, DF = {df1 , df2 , .., dq , .., dfm } be the set of drug functions where each drug dp is associated with a number of drug functions from the set DF . Let DR = {dr1 , dr2 , ..., drr , ..., drn } be the set of ADR. A drug(dp ) can have multiple ADR. Therefore, the prediction of ADR for a drug is a multi-label classification problem. Here, DF and ADR are represented by 1 and 0, where 1 indicates the presence and 0 indicates the absence of DF and ADR for a specific drug. Figure 1 pictorially represents the problem statement of predicting ADR for drugs using their corresponding DF.

Fig. 1. Problem statement for predicting ADR from drug functions

3.2

Dataset

The process followed for preparing the dataset for the validation of the proposed methodology has been diagrammatically shown in Fig. 2 and discussed next.

168

P. Das et al.

Fig. 2. Workflow to combine drug functions and ADR data

Firstly, data related to drug functions of different drugs is collected from the PubChem database [9]. It comprises 10,819 drugs and their functions. Secondly, the data of ADR of different drugs is collected SIDER database having version 4.1 [10]. It contains 6123 ADR corresponding to 1430 drugs. Further, data of drug functions and ADR is combined by mapping drug function to ADR based on the drug ID. It resulted in a dataset which comprises 670 drugs where each drug is described by 12 drug functions and 6123 ADR. This dataset is named as DF-ADR dataset. 3.3

The Proposed Methodology

The working architecture of the proposed methodology has been shown in Fig. 3. It takes drug function corresponding to different drugs as input where each drug function represents a attribute of drug. A particular drug function may be present or absent for a given drug. The 12 drug functions that have been used in the present work are Respiratory System Agent, Reproduction Control

Fig. 3. Working architecture of the proposed methodology

Predicting Adverse Drug Reactions from Drug Functions and MLSMOTE

169

Agent, Lipid Regulating Agent, Gastrointestinal Agent, Cardiovascular Agent, Antineoplastic Agent, Central Nervous System Agent, Anti-Infection Agent, Dermatologic Agent, Hematologic Agent and Urological Agent. As a first processing step, MLSMOTE is applied on input data for rectifying the issue of class imbalance [8]. MLSMOTE works by generating the new data samples corresponding to minority label instances. A minority label instance is the one for which imbalance ratio is higher than the average imbalance ratio taken over all the labels [11]. New data samples are generated based on the nearest neighbour of data samples corresponding to minority labels. The number of data samples to be generated depends upon the value of imbalance ration for a particular label. This process of data instance generation is repeated for all the minority labels.

Algorithm 1. Binary Relevance Input 1: A classifier C Input 2: Multi-label dataset MLD=(Xi , Li ) where Xi is input feature and Li is set of labels, Li ∈ 1, 2, ..m 1: Split MLD into m binary classification problem 2: for each label in |Li | do 3: Learn models C(Xi ,label) 4: if instance of MLD belong to label then 5: Assign label = positive 6: else 7: Assign label = negative 8: end if 9: end for

As a second processing step, the binary relevance multi-label classification technique [12] is applied on the over sampled dataset generated by the first processing step. The algorithm for binary relevance technique has been shown as Algorithm 1. It takes a classifier and multi-label dataset (MLD) as input and split the MLD into m binary classification problem corresponding to different labels. Further, it train a binary classifier on the split datasets for making ADR prediction. It is done in respect to all labels and then results of classifier trained for different labels are combined. In the present work, instead of relying on a single classifier, 05 different classifier namely DT, ETC, KNN, MLPNN, and RF have been used. The performance of the proposed methodology has been analyzed corresponding to all 05 classifier.

4

Experimental Setup and Results

This section describes the evaluation metrics and setting of different parameters for experimental analysis. Further, the experimental results have been presented and discussed.

170

4.1

P. Das et al.

Evaluation Metrics

The following metrics have been used for evaluating the performance of proposed methodology. Assume that Dk = {(Xi , Li )|i = 1, 2, ..., n} be a set of multi-label data where Li is the true label for test data Xi and Zi is the predicted labels by the classifier. • Accuracy is defined as the average of ratio of correctly predicted labels to the total number of labels for different samples [13]. 1  |Zi ∩Li | n i=1 |Zi ∪ Li | n

Accuracy =

(1)

• Hamming-Loss is defined as how many times the model wrongly predicts a sample Zi label pair. Here Δ denotes the symmetric difference of two sets [13]. 1 1 |Zi ΔLi | n i=1 |L| n

Hamming − Loss =

(2)

• Precision is the ratio of correctly predicted labels to the all predictions done [13]. 1  |Zi ∩Li | n i=1 |Zi | n

P recision =

(3)

• Recall is the ratio of correctly predicted labels to the number of positive labels [13]. 1  |Zi ∩Li | n i=1 |Li | n

Recall =

(4)

• F1 score is the harmonic mean of Precision and Recall [13]. 1  2|Zi ∩Li | n i=1 |Zi | + |Li | n

F1 =

(5)

• ROC-AUC represents the capability of model to distinguish between classes. Higher values of the ROC-AUC show that the model can predict the positive and negative classes more accurately. 4.2

Experimental Setup

The proposed methodology has been implemented using Python version 3.7.4 with scikit-learn library [14]. The Gini coefficient has been used as splitting criteria in case of DT, ETC and RF classifier. The values of maximum sample split and minimum sample split have been set to 2 and 1 respectively for DT, ETC and RF classifier. In case of DT, ETC and RF the maximum depth of tree

Predicting Adverse Drug Reactions from Drug Functions and MLSMOTE

171

has been set to 30, 30 and 50 respectively. The number of trees in the forest has been taken as 100 for ETC and RF classifier. The value of parameter k is taken as 3 and Euclidean distance is used as proximity measure for KNN classifier. For MLPNN classifier, Rectified Linear Unit(ReLU) is used as activation function along with 02 hidden layers where first hidden layer comprises 1024 hidden nodes and second hidden layer comprises 512 hidden nodes. Further, Adaptive Moment Estimation(Adam) optimizer is used along with a learning rate of 0.001 and 1000 iterations. The holdout method is used for validation where 70% data is used for training and 30% for testing. 4.3

Experimental Results and Discussion

The performance of the proposed methodology in respect of different classifier, in terms of Accuracy, Hamming Loss, Precision, Recall, F1 Score and ROC-AUC has been shown in Table 1. It can be observed from the Table 1 that accuracy is more than 99% for different classifiers. Further, the value of hamming loss varies from 0.05% to 0.26%. The precision, recall and F1 score ranges from 98.70% to 99.95%, 97.84% to 99.42% and 98.27% to 99.69% respectively over different classifiers. The ROC-AUC varies from 0.98 to 0.99. Overall, it can be said that the proposed methodology has performed well for ADR prediction from drug functions. The performance of different classifier differ marginally and ETC classifier has performed best out of all the considered classifier. To see how effective is MLSMOTE in addressing class imbalance issue of ADR prediction problem, the results of binary relevance technique for different classifier without using MLSMOTE have been obtained and given in Table 2. It can be observed from the Table 2 that the binary relevance technique has performed poorly for all five classifiers in terms of Hamming Loss, Precision, Recall, F1 Score and ROC-AUC. Though the accuracy ranges from 98.02% to 98.20% over different classifier even without using MLSMOTE. But, the accuracy is a misleading metric over here in case of imbalance dataset. Table 1. Performance of the proposed methodology with MLSMOTE Algorithm Accuracy Hamming loss Precision Recall

F1 score ROC-AUC

DT

99.86%

0.14%

98.74%

99.36% 99.05%

0.99

ETC

99.95%

0.05%

99.95%

99.42% 99.69%

0.99

RF

99.94%

0.06%

99.89%

99.24% 99.56%

0.99

MLPNN

99.74%

0.26%

98.70%

97.84% 98.27%

0.98

KNN

99.83%

0.17%

99.42%

98.25% 98.83%

0.99

172

P. Das et al. Table 2. Performance of the proposed methodology without MLSMOTE Algorithm Accuracy Hamming loss Precision Recall

5

F1 score ROC-AUC

DT

98.18%

1.82%

59.93%

19.19% 29.07%

0.59

ETC

98.19%

1.81%

60.56%

19.09% 29.02%

0.59

RF

98.19%

1.81%

60.53%

19.58% 29.59%

0.59

MLPNN

98.20%

1.80%

61.66%

19.57% 29.71%

0.59

KNN

98.02%

1.98%

47.93%

21.30% 29.49%

0.60

Conclusion

A methodology for ADR prediction from drug functions has been presented in this work. It followed binary relevance technique for multi-label classification and MLSMOTE for addressing class imbalance issue. The performance of the proposed methodology has been validated on the dataset collected from SIDER and PubChem databases. The proposed methodology has been found effective for ADR prediction. It has achieved 99.95% accuracy, 99.95% precision, 99.42% recall, 99.69% F1 score and 0.99 ROC-AOC. Even the hamming loss is very low that is 0.05%.

References 1. Ralph Edwards, I., Aronson, J.K.: Adverse drug reactions: definitions, diagnosis, and management. The Lancet 356(9237), 1255–1259 (2000) 2. Wang, C.-S., et al.: Detecting potential adverse drug reactions using a deep neural network model. J. Med. Internet Res. 21(2), e11016 (2019) 3. Side Effects of Drugs, Medical Devices and High-Risk Medical Conditions.https:// www.drugwatch.com/side-effects/. Accessed 25 Mar 2021 4. Aliper, A., et al.: Deep learning applications for predicting pharmacological properties of drugs and drug repurposing using transcriptomic data. Mol. Pharm. 13(7), 2524–2530 (2016) 5. Uner, O.C., et al.: DeepSide: a deep learning framework for drug side effect prediction. Biorxiv, 843029 (2019) 6. Jamal, S., et al.: Predicting neurological adverse drug reactions based on biological, chemical and phenotypic properties of drugs using machine learning models. Sci. Rep. 7(1), 1–12 (2017) 7. Liu, M., et al.: Large-scale prediction of adverse drug reactions using chemical, biological, and phenotypic properties of drugs. J. Am. Med. Inform. Assoc. 19(e1), e28–e35 (2012) 8. Charte, F., et al.: MLSMOTE: approaching imbalanced multi-label learning through synthetic instance generation. Knowl. Based Syst. 89, 385–397 (2015) 9. Kim, S., et al.: PubChem in 2021: new data content and improved web interfaces. Nucl. Acids Res. 49(D1), D1388–D1395 (2021) 10. Kuhn, M., et al.: The SIDER database of drugs and side effects. Nucl. Acids Res. 44(D1), D1075–D1079 (2016)

Predicting Adverse Drug Reactions from Drug Functions and MLSMOTE

173

11. Charte F., Rivera A., del Jesus M.J., Herrera F.: A first approach to deal with imbalance in multi-label datasets. In: Pan J.S., Polycarpou M.M., Wo´zniak M., de Carvalho A.C.P.L.F., Quinti´ an H., Corchado E. (eds.) International Conference on Hybrid Artificial Intelligence Systems HAIS 2013. LNCS, vol. 8073, pp. 150–160. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40846-5 16 12. multilearn: Multi-label classification package for python. http://scikit.ml/api/ skmultilearn.problem transform.br.html. Accessed 16 Mar 2021 13. Krstini´c, D., et al.: Multi-label classifier performance evaluation with confusion matrix. Comput. Sci. Inf. Technol. 1 14. Pedregosa, F., et al.: Scikit-learn: machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

Author Index

A Adli, Hasyiya Karimah, 95 Agís-Balboa, Roberto C., 41 Alves, Nuno, 115 Antunes, Débora, 1 Arrais, Joel P., 1 B Badie, Christophe, 126 Baptista, Delora, 145 C Camacho, Rui, 74 Candamil-Cortes, Mariana S., 85 Capela, João, 136 Caprani, Michela, 11 Cecotka, Agnieszka, 126 Chamoso, Pablo, 22 Chan, Weng Howe, 95 Choon, Yee Wen, 95 Correia, Fernanda, 1 Correia, João, 145 Cruz, Fernando, 136 Cunha, Emanuel, 136 D Das, Pranab, 165 De la Prieta, Fernando, 22 Dias, Oscar, 136 F Ferreira, Pedro, 31, 74 Fonseca, Nuno A., 105

G Ginja, Catarina, 105 Götherström, Anders, 105 Guimarães, Sílvia, 105 Guyot, Romain, 85 H Healy, John, 11 I Isaza, Gustavo, 85 J Jaimes, Paula A., 85 K Kılınç, Gül¸sah Merve, 105 Krol, Lukasz, 126 L Ladeiras, João, 74 López-Fernández, Hugo, 31, 41 M Maraschin, Marcelo, 52 Martins, Daniel, 1 Mohamad, Mohd Saberi, 95 O O’Brien, Grainne, 126 O’Keeffe, Joan, 11 Oliveira, Alexandre, 136 Orozco-Arias, Simon, 85

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 M. Rocha et al. (Eds.): PACBB 2021, LNNS 325, pp. 175–176, 2022. https://doi.org/10.1007/978-3-030-86258-9

176 P Pal, Vipin, 165 Pereira, Bruno, 52, 145 Pereira, Vítor, 155 Pérez-Rodríguez, Daniel, 41 Pires, Ana Elisabete, 105 Polanska, Joanna, 126 R Reboiro-Jato, Miguel, 31 Remli, Muhammad Akmal, 95 Rocha, Miguel, 1, 52, 62, 115, 145, 155 Rodrigues, Ruben, 115 S Sampaio, Marta, 136 Sangma, Jerry W., 165 Sarmento, Cindy, 105 Sequeira, Ana Marta, 62

Author Index Sequeira, João, 136 Slattery, Orla, 11 T Tabares-Soto, Reinel, 85 V Valencia-Castrillon, Estiven, 85 Vieira, Cristina P., 31 Vieira, Jorge, 31 Vittorini, Pierpaolo, 22 W WSW, Khairul Nizar Syazwan, 95 Y Yogita, 165 Yong, Mohd Izzat, 95 Yusoff, Nooraini, 95