Practical Applications of Computational Biology and Bioinformatics, 17th International Conference (PACBB 2023) 3031380789, 9783031380785

This book aims to promote the interaction among the scientific community to discuss applications of CS/AI with an interd

534 83 9MB

English Pages 112 [113] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Practical Applications of Computational Biology and Bioinformatics, 17th International Conference (PACBB 2023)
 3031380789, 9783031380785

Table of contents :
Preface
Organization
Contents
Main Track
The Impact of Schizophrenia Misdiagnosis Rates on Machine Learning Models Performance
1 Introduction
2 Methods
2.1 Data Description and Quality Control
2.2 Genotype-Phenotype Association
2.3 Machine Learning Models
2.4 Over-Representation Analysis
3 Results
3.1 Association Test
3.2 Test on Train Data
3.3 Filtering of Discordant Samples
3.4 Classification Model
4 Discussion
5 Conclusion
References
Deep Learning and Transformers in MHC-Peptide Binding and Presentation Towards Personalized Vaccines in Cancer Immunology: A Brief Review
1 Introduction
2 Methodology
3 Input Encoding
4 Deep Learning and Transformers Methods
4.1 Deep Learning
4.2 Transformers
5 Discussion
References
Auto-phylo: A Pipeline Maker for Phylogenetic Studies
1 Introduction
2 Material and Methods
3 Results
3.1 Auto-phylo Modules
3.2 Setting up an Auto-phylo Pipeline
3.3 Bacterial AOs May Have a Function Similar to Animal GULOs
3.4 Identification of Bacterial Species Groups that Have AOs Closely Related to Animal GULOs
4 Conclusion
References
Feature Selection Methods Comparison: Logistic Regression-Based Algorithm and Neural Network Tools
1 Introduction
1.1 Classification Problem
1.2 Feature Selection Methods
2 Methods and Materials
2.1 Logistic Regression-Based Algorithm
2.2 Neural Networks Approach
2.3 Materials
3 Results
3.1 Logistic Regression-Based Algorithm
3.2 Neural Networks Approach
3.3 Results Comparison
4 Conclusions
References
A New GIMME–Based Heuristic for Compartmentalised Transcriptomics Data Integration
1 Introduction
2 Methods
2.1 Flux Balance Analysis
2.2 Gene Inactivity Moderated by Metabolism and Expression
2.3 Implementation of the Proposed Method
2.4 The Model
2.5 The Dataset
3 Results
3.1 Case Studies
4 Discussion and Conclusions
References
Identifying Heat-Resilient Corals Using Machine Learning and Microbiome
1 Introduction
2 Related Work
3 Methods
3.1 Pipeline
3.2 Experimental Setup
4 Results
5 Analysis and Discussion
6 Conclusion
References
Machine Learning Based Screening Tool for Alzheimer's Disease via Gut Microbiome
1 Introduction
2 Related Work
3 Methodology
4 Experimental Analysis
4.1 Experimental Settings
4.2 Experimental Results
4.3 Discussion
5 Conclusion and Future Work
References
Progressive Multiple Sequence Alignment for COVID-19 Mutation Identification via Deep Reinforcement Learning
1 Introduction
2 Methodology
2.1 Progressive Deep Reinforcement Learning
2.2 Sequence Alignment
3 Result and Discussion
3.1 Analysis of Alignment Results
4 Conclusion
References
Analysis of the Confidence in the Prediction of the Protein Folding by Artificial Intelligence
1 Introduction
2 Metrics and Scores
3 Material and Methods
4 Results
5 Discussion
6 Conclusions and Future Work
References
Doctoral Consortium
Neoantigen Detection Using Transformers and Transfer Learning in the Cancer Immunology Context
1 Introduction
2 Problem Statement
3 Related Work
4 Hypothesis
5 Proposal
6 Preliminary Results
7 Reflections
References
Author Index

Citation preview

Lecture Notes in Networks and Systems 743

Miguel Rocha Florentino Fdez-Riverola Mohd Saberi Mohamad Ana Belén Gil-González   Editors

Practical Applications of Computational Biology and Bioinformatics, 17th International Conference (PACBB 2023)

Lecture Notes in Networks and Systems

743

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland

Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas—UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Türkiye Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong

The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science. For proposals from Asia please contact Aninda Bose ([email protected]).

Miguel Rocha · Florentino Fdez-Riverola · Mohd Saberi Mohamad · Ana Belén Gil-González Editors

Practical Applications of Computational Biology and Bioinformatics, 17th International Conference (PACBB 2023)

Editors Miguel Rocha Centro de Engenharia Biológica Universidade de Vigo Braga, Portugal Mohd Saberi Mohamad Department of Genetics and Genomics, College of Medicine and Health Sciences United Arab Emirates University Al Ain, Abu Dhabi, United Arab Emirates

Florentino Fdez-Riverola Computer Science Department University of Minho Vigo, Spain Ana Belén Gil-González University of Salamanca Salamanca, Spain

ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-031-38078-5 ISBN 978-3-031-38079-2 (eBook) https://doi.org/10.1007/978-3-031-38079-2 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

The success of bioinformatics in recent years has been prompted by research in Molecular Biology and Molecular Medicine in several initiatives. These initiatives gave rise to an exponential increase in the volume and diversification of data, including nucleotide and protein sequences and annotations, high-throughput experimental data, biomedical literature, among many others. Systems biology is a related research area that has been replacing the reductionist view that dominated biology research in the last decades, requiring the coordinated efforts of biological researchers with those related to data analysis, mathematical modeling, computer simulation and optimization. The accumulation and exploitation of large-scale databases prompt for new computational technology and for research into these issues. In this context, many widely successful computational models and tools used by biologists in these initiatives, such as clustering and classification methods for gene expression data, are based on Computer Science/Artificial Intelligence (CS/AI) techniques. In fact, these methods have been helping in tasks related to knowledge discovery, modeling, and optimization tasks, aiming at the development of computational models so that the response of biological complex systems to any perturbation can be predicted. The 17th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB) aims to promote the interaction among the scientific community to discuss applications of CS/AI with an interdisciplinary character, exploring the interactions between subareas of CS/AI, bioinformatics, chemoinformatic, and systems biology. The PACBB’23 technical program includes nine papers of authors from many different countries (Brazil, Indonesia, Peru, Poland, Portugal, Spain, and United Arab Emirates) and different subfields in bioinformatics and computational biology. All papers underwent a peer-review selection: Each paper was assessed by three different reviewers, from an international panel composed of about 55 members from 12 countries. The quality of submissions was on average good, with an acceptance rate of approximately 70% (nine accepted papers from 13 submissions). Moreover, Doctoral Consortium Session tries to provide a framework as part of which students can present their ongoing research work and meet other students and researchers and obtain feedback on lines of research for the future. There will be special issues in JCR-ranked journals, such as integrative bioinformatics, sensors, electronics, and systems. Therefore, this event will strongly promote the interaction among researchers from international research groups working in diverse fields. The scientific content will be innovative, and it will help improve the valuable work that is being carried out by the participants. This conference is organized by the LASI and Centro Algoritmi of the University of Minho (Portugal). We would like to thank all the contributing authors, the members of the Program Committee, and the sponsors. We thank for funding support to the project: “HERMES: Hybrid Enhanced Regenerative Medicine Systems” (Id.

vi

Preface

FETPROACT-2018-2020 GA n.824164), and finally, we thank the Local Organization members for their valuable work, which is essential for the success of PACBB’23. Miguel Rocha Florentino Fdez-Riverola Mohd Saberi Mohamad Ana Belén Gil-González

Organization

Program Committee Chairs Miguel Rocha Mohd Saberi Mohamad

University of Minho, Portugal United Arab Emirates University, United Arab Emirates

Organizing Committee Chairs Florentino Fdez-Riverola Ana Belén Gil-González

University of Vigo, Spain University of Salamanca, Spain

Advisory Committee Gabriella Panuccio

Istituto Italiano di Tecnologia, Italy

Local Organizing Committee Paulo Novais (Chair) José Manuel Machado (Co-chair) Hugo Peixoto Regina Sousa Pedro José Oliveira Francisco Marcondes Manuel Rodrigues Filipe Gonçalves Dalila Durães Sérgio Gonçalves

University of Minho, Portugal University of Minho, Portugal University of Minho, Portugal University of Minho, Portugal University of Minho, Portugal University of Minho, Portugal University of Minho, Portugal University of Minho, Portugal University of Minho, Portugal University of Minho, Portugal

Organizing Committee Juan M. Corchado Rodríguez Fernando De la Prieta Sara Rodríguez González

University of Salamanca and AIR Institute, Spain University of Salamanca, Spain University of Salamanca, Spain

viii

Organization

Javier Prieto Tejedor Ricardo S. Alonso Rincón Alfonso González Briones Pablo Chamoso Santos Javier Parra Liliana Durón Marta Plaza Hernández Belén Pérez Lancho Ana Belén Gil González Ana De Luis Reboredo Angélica González Arrieta Angel Luis Sánchez Lázaro Emilio S. Corchado Rodríguez Raúl López Beatriz Bellido María Alonso Yeray Mezquita Martín Sergio Márquez Andrea Gil Albano Carrera González

University of Salamanca and AIR Institute, Spain AIR Institute, Spain University of Salamanca, Spain University of Salamanca, Spain University of Salamanca, Spain University of Salamanca, Spain University of Salamanca, Spain University of Salamanca, Spain University of Salamanca, Spain University of Salamanca, Spain University of Salamanca, Spain University of Salamanca, Spain University of Salamanca, Spain University of Salamanca, Spain University of Salamanca, Spain University of Salamanca, Spain AIR Institute, Spain AIR Institute, Spain University of Salamanca, Spain AIR Institute, Spain

Program Committee Vera Afreixo Manuel Álvarez Díaz Carlos Bastos Lourdes Borrajo Ana Cristina Braga Rui Camacho Ángel Canal-Alonso Fernanda Brito Correia Yingbo Cui Sergio Deusdado Oscar Dias Florentino Fdez-Riverola Nuno Filipe Nuno A. Fonseca Narmer Galeano Rosalba Giugno

University of Aveiro, Portugal University of A Coruña, Spain University of Aveiro, Portugal University of Vigo, Spain University of Minho, Portugal University of Porto, Portugal Universidad de Salamanca, Spain DEIS/ISEC/Polytechnic Institute of Coimbra, Portugal National University of Defense Technology, China IPB-Polytechnic Institute of Bragança, Portugal University of Minho, Portugal University of Vigo, Spain University of Porto, Portugal University of Porto, UK Universidad Catolica de Manizales, Colombia University of Verona, Italy

Organization

Gustavo Isaza Paula Jorge Rosalia Laza Thierry Lecroq Filipe Liu Hugo López-Fernández Eva Lorenzo Iglesias Mohd Saberi Mohamad Loris Nanni José Luis Oliveira Joel P. Arrais Vítor Pereira Martín Pérez Pérez Cindy Perscheid Armando Pinho Ignacio Ponzoni Miguel Reboiro-Jato Jose Ignacio Requeno João Manuel Rodrigues Iván Rodríguez-Conde Gustavo Santos-Garcia Ana Margarida Sousa Carolyn Talcott Rita Margarida Teixeira Ascenso Antonio J. Tomeu-Hardasmal Alicia Troncoso Eduardo Valente Alejandro F. Villaverde Pierpaolo Vittorini

ix

University of Caldas, Colombia IBB, CEB Centre of Biological Engineering, Portugal Universidad de Vigo, Spain University of Rouen, France Data Science and Learning Division, Argonne National Laboratory, Argonne, IL, USA, USA Universidade de Vigo, Spain University of Vigo, Spain United Arab Emirates University, United Arab Emirates University of Padua, Italy University of Aveiro, Portugal University of Coimbra, Portugal University of Minho, Portugal University of Vigo, Spain Hasso Plattner Institute, Germany University of Aveiro, Portugal Planta Piloto de Ingeniería Química, PLAPIQUI, UNS, CONICET, Argentina University of Vigo, Spain Complutense University of Madrid, Spain DETI/IEETA, University of Aveiro, Portugal UALR, University of Arkansas at Little Rock, EEUU Universidad de Salamanca, Spain University of Minho, Portugal SRI International, USA ESTG, IPL, Portugal University of Cadiz, Spain Universidad Pablo de Olavide, Spain IPCB, Portugal Instituto de Investigaciones Marinas (C.S.I.C.), Spain University of L’Aquila, Department of Life, Health, and Environmental Sciences, Italy

x

Organization

Acknowledgements

Contents

Main Track The Impact of Schizophrenia Misdiagnosis Rates on Machine Learning Models Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Martins, Conceição Egas, and Joel P. Arrais

3

Deep Learning and Transformers in MHC-Peptide Binding and Presentation Towards Personalized Vaccines in Cancer Immunology: A Brief Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vicente Enrique Machaca, Valeria Goyzueta, Maria Cruz, and Yvan Tupac

14

Auto-phylo: A Pipeline Maker for Phylogenetic Studies . . . . . . . . . . . . . . . . . . . . . Hugo López-Fenández, Miguel Pinto, Cristina P. Vieira, Pedro Duque, Miguel Reboiro-Jato, and Jorge Vieira Feature Selection Methods Comparison: Logistic Regression-Based Algorithm and Neural Network Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Katarzyna Sieradzka and Joanna Pola´nska A New GIMME–Based Heuristic for Compartmentalised Transcriptomics Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Diego Troitiño-Jordedo, Lucas Carvalho, David Henriques, Vítor Pereira, Miguel Rocha, and Eva Balsa-Canto

24

34

44

Identifying Heat-Resilient Corals Using Machine Learning and Microbiome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hyerim Yong and Mai Oudah

53

Machine Learning Based Screening Tool for Alzheimer’s Disease via Gut Microbiome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pedro Velasquez and Mai Oudah

62

Progressive Multiple Sequence Alignment for COVID-19 Mutation Identification via Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . Zanuba Hilla Qudrotu Chofsoh, Imam Mukhlash, Mohammad Iqbal, and Bandung Arry Sanjoyo

73

xii

Contents

Analysis of the Confidence in the Prediction of the Protein Folding by Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paloma Tejera-Nevado, Emilio Serrano, Ana González-Herrero, Rodrigo Bermejo-Moreno, and Alejandro Rodríguez-González

84

Doctoral Consortium Neoantigen Detection Using Transformers and Transfer Learning in the Cancer Immunology Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vicente Enrique Machaca Arceda

97

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Main Track

The Impact of Schizophrenia Misdiagnosis Rates on Machine Learning Models Performance Daniel Martins1,2(B) , Concei¸c˜ao Egas2 , and Joel P. Arrais1 1

CISUC - Centre for Informatics and Systems of the University of Coimbra, Polo II, Pinhal de Marrocos, 3030-290 Coimbra, Portugal {defm,jpa}@dei.uc.pt 2 CIBB - Centre for Innovative Biomedicine and Biotechnology, Universidade de Coimbra, Rua Larga Ed. FMUC, 3004-504 Coimbra, Portugal [email protected]

Abstract. Schizophrenia is a complex disease with severely disabling symptoms. A consistent leading causal gene for the disease onset has not been found. There is also a lack of consensus on the disease etiology and diagnosis. Sweden poses a paradigmatic case, where relatively high misdiagnosis rates (19%) have been reported. A large-scale case-control dataset based on the Swedish population was reduced to its most representative variants and the distinction between cases and controls was further scrutinized through geneannotation based Machine Learning (ML) models. The intra-group differences on cases and controls were accentuated by training the model on the entire dataset. The cases and controls with a higher likelihood to be misclassified, and hence more likely to be misdiagnosed were excluded from subsequent analysis. The model was then conventionally trained on the reduced dataset and the performances were compared. The results indicate that the reported prevalence and misdiagnosis rates for Schizophrenia may be transposed to case-control cohorts, hence, reducing the performance of eventual association studies based on such datasets. After the sample filtering procedure, a simple Machine Learning model reached a performance more concurrent with the Schizophrenia heritability estimates on the literature. Sample selection on large-scale datasets sequenced for Association Studies may enable the adaptation of ML approaches and strategies to complex studies research.

Keywords: Schizophrenia Prioritization

· Genomics · Machine Learning · Sample

c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  M. Rocha et al. (Eds.): PACBB 2023, LNNS 743, pp. 3–13, 2023. https://doi.org/10.1007/978-3-031-38079-2_1

4

1

D. Martins et al.

Introduction

Schizophrenia (SCZ) is an early onset chronic psychiatric disorder with a tendency to run in families [25]. It is characterized by disabling symptoms such as delusions and hallucinations, speech and behavioral disorganization, and cognitive impairment. SCZ is among the 15 leading causes of disability worldwide [5] and is often associated with premature mortality and severe comorbidities [28]. Despite its low prevalence in the general population - 0.32% according to the World Health Organization [29], this number has grown over the decades, especially among the working-age population [5]. At the same timeline, a differential mortality gap associated with SCZ has increased [25]. A greater prevalence is found in the most developed countries [5] and higher latitudes [34]. Moreover, precise estimates for SCZ prevalence are difficult to obtain due to its diagnosis complexity [2]. Adding to this, high rates of SCZ misdiagnosis have longly been reported [15], and this problem is still observed in current times in both developing [1] and developed countries [6]. Among the latter, there is extensive research in Scandinavian countries on the ascertainment of SCZ diagnoses. Historically, these countries have followed more conservative diagnostic approaches, inly focusing on biological aspects of the disease etiology [21,33]. Large-scale case-control cohorts from those regions have been recurrently used for genomic studies on SCZ [9,10,12,32,33]. Swedish cases are usually identified and selected from the Hospital Discharge Register (HDR). Initial studies on the validity of the registered SCZ diagnoses reported relatively high concordances (76%-81%) [7,21,23] to reevaluations under the criteria of DSM-IV (Diagnostic and Statistical Manual of Mental Disorders). A higher concordance rate (94%) was found when considering a broader diagnosis of schizophrenic psychoses (SCZ, schizoaffective psychosis or schizophreniform disorder). Thereafter, the HDR could be used as a reliant criterion to define cases under such broader definition [9]. However, by solely considering SCZ diagnoses, the concordance was similar to previous reports (75%). This uncertainty on the scope of SCZ is extended to its pathophysiology. Beyond a genetically complex disease, the neurobiological processes underlying SCZ onset are yet to be fully understood, and the heterogeneity of the condition reinforces the lack of consensus on its diagnostic and etiology [11,18]. Although the disease heritability is predicted to be high (≥79%) [11,13], a consistent leading causal gene for the disease onset has not been found. Dopamine, glutamate and GABAergic systems were all associated with SCZ by post-mortem studies. However, its primacy on the disease onset is uncertain [16,35]. Accurately pinpointing a group of risk genes would greatly benefit disease knowledge. Machine Learning (ML) approaches present a significant potential to extract generalizing information from the crescent amount of biological data. However, the present work will suggest that even before that application, the ascertainment of correct diagnoses will constitute an essential requirement, as ML classification models heavily depend on well-defined case and control subsets.

Misdiagnosis Impact on ML Performance

2 2.1

5

Methods Data Description and Quality Control

For the present work, the dbGaP dataset phs000473.v2.p2 was studied and analyzed. It includes 12,380 samples, composed of 6,245 controls, 4,969 SCZ cases and 1,166 Bipolar disease cases from a Swedish-based case-control Whole-Exome Sequencing study. The SCZ cases were ascertained from the Swedish Hospital Discharge Register. To be included, cases must have had two or more hospitalizations with a discharge diagnosis of SCZ. Registered diagnoses of medical or psychiatric disorders mitigating a confident diagnosis of SCZ were the exclusion criteria. Controls were randomly selected from population registers. Hospitalizations for SCZ were the only exclusion criterion. For all samples, the participant should be at least 18 years old, and both parents must have been born in Scandinavia. Original data was already filtered for variant Phred-score quality (QUAL) > 30. Variant sites with a mean read depth (DP) below eight or a genotype call rate below 90% were filtered from the working dataset. The Variant Quality Scores Recalibration steps on the Genome Analysis Toolkit (Version 3.8) Best Practices Workflow were followed. Multi-allelic variants were separated and filtered according to the resultant genotype call rate. 2.2

Genotype-Phenotype Association

A chi-square test was performed on all variants over a 3 × 3 contingency table for the genotypes on cases and controls. Bipolar samples were removed from the data. InDels and variants in sexual chromosomes were also filtered. SNPs were annotated with the up-to-date Annovar version for the hg19 genome build (20211019). The p-values were not corrected for multiple comparisons to include a higher and more representative number of variants in the subsequent analysis. The final dataset included 18,970 variants from 9,160 genes. 2.3

Machine Learning Models

GenNet framework [14] was used to implement the neural networks. Besides genotypes, a topology file containing all variant-gene correspondences was written and provided as input, those connections would represent the predefined edges for the Neural Network. Models were built using Keras, consisting on three layers: input (variants), gene and output (phenotype) (Fig. 1). To compare the performances to the original publication, the models were trained and optimized using the same reported hyperparameters: batch size of 64; learning rate of 1 × 10−4 ; ADAM optimizer over a binary cross-entropy loss. To learn and maximize the distinction between eventual sub-classes among both cases and controls, the model was trained and tested on the entire dataset 10 times with an early stop at 200 epochs. A simple average for the test classification scores (0 to 1) was calculated. The misclassified controls with higher scores and

6

D. Martins et al.

misclassified cases with lower scores were removed from subsequent analyses. The proportion of samples removed from the control subset corresponds to the estimated prevalence of SCZ in the Swedish population according to the Gillberg Neuropsychiatry Centre from the University of Gothenburg [24] (0.34%). The proportion of samples removed from the case subset corresponds to the lowest SCZ misdiagnosis rate reported on Sweden for both the broad (6%) and narrow definition of SCZ (19%) [7,9,21,23]. After training the final models, the average of test AUC scores were compared using a two-tailed unpaired t-test.

Fig. 1. Example of a model built using the GenNet framework. The edges are biologically informed, and correspond to variant-gene annotations from Annovar. All gene nodes are connected to the final node on the output layer.

2.4

Over-Representation Analysis

For the functional analysis of the intermediary results, over-representation tests were performed on the 2019 online version of WebGestalt (WEB-based GEne SeT AnaLysis Toolkit) [22] against PANTHER (Protein Analysis Through Evolutionary Relationships) v3.6.1 [26,36], KEGG (Kyoto Encyclopedia of Genes and Genomes) Release 88.2 [19] and OMIM (Online Mendelian Inheritance in Man) [17] using the entire genome as the reference set.

3 3.1

Results Association Test

There were found 18,970 autosomal SNPs with a non-corrected significant association to SCZ on this dataset. It corresponds to approximately 1.5% of the original data. Among those, 8,971 (47.3%) variants presented an allele frequency (AF) < 1% and 7,304 (38.5%) ≥ 5%. There were no singletons on this subset. 3.2

Test on Train Data

The model was trained and tested on the entire dataset ten times to uncover genetically heterogeneous samples among cases and controls. Hereafter, for simplicity, those samples will be referred as misdiagnosed, however, the term do

Misdiagnosis Impact on ML Performance

7

not refer to a clinic assertion, but rather to the genetic distance to the most prevalent and representative profile on either cases or controls. This procedure reinforces the internal correlations and differences in the trained dataset. Thus, it was used to evidence and accentuate any inherent difference between misdiagnosed samples and cases or controls. These models presented an average test score AUC of 0.9241 (SD 0.0014, 95% CI [0.923, 0.925]), There were 167 controls (2.7% of all controls) and 388 cases (7.8%) misclassified on all ten tests. In order to verify if the distinction of cases and controls was mainly driven by a genetic basis for SCZ, the mean of the weights for each node on the gene layer across the ten models was calculated. As the models would assign higher weights to genes correlated to SCZ phenotypes, they would also be more likely to ascertain misdiagnosed samples. The 150 genes with higher mean scores were selected for subsequent functional analysis. Significant over-representations for the beta-adrenergic signaling Pathways on the provided set of genes against the PANTHER reference set was found. In comparison to KEGG, significant associations with Extracellular Matrix receptor pathways and cardiomiopathies were observed. Finally, a significant overrepresentation of genes associated with SCZ was detected against the OMIM database (Table 1). Table 1. Significant associations with pathways previously associated with Schizophrenia on the literature. Enr. R: Enrichment Ratio. Ref.: Reference Enr. R p-value

FDR

Ref.

PANTHER Beta1 adrenergic receptor sign. pathway

9.6

6.4 × 10−4 3.6 × 10−2 [20]

Beta2 adrenergic receptor sign. pathway

9.6

6.4 × 10−4 3.6 × 10−2 [20]

ECM-receptor interaction

8.4

2.9 × 10−4 3.3 × 10−2 [30]

Hypertrophic cardiomyopathy

8.3

3.1 × 10−4 3.3 × 10−2 [8]

Dilated cardiomyopathy

7.7

4.5 × 10−4 3.7 × 10−2 [31]

KEGG

OMIM Schizophrenia

3.3

57.4

5.0 × 10−4 2.5 × 10−2

Filtering of Discordant Samples

After testing on train data, the mean values for the test scores on both cases and controls were calculated and sorted. To match the reported prevalence of SCZ in Sweden (0.34%) [24], the 21 control samples with the highest mean score were removed. The mean score of the last sample removed was 0.83. For cases, the threshold matched the lowest misdiagnosis rate (6%) reported in Sweden for a broad definition of SCZ [9]. It corresponded to the 300 cases with the lowest mean score. The mean score of the last sample removed was 0.30.

8

D. Martins et al.

Lastly, the lowest misdiagnosis rate (19%) reported in Sweden for a narrow definition of SCZ [21] was also considered. For this scenario, cases were excluded if misclassified on at least six out of the ten tests, excluding 944 cases. The mean score of the last sample removed was 0.41. 3.4

Classification Model

A 60/20/20 training-validation-test split on the 18,970 input variants was used to train and test the final classification model. For each experiment, the model was run ten times and a full re-randomization of the samples was performed before each run. The original results reported on the GenNet original publication [14] were considered as the benchmark, Both the benchmark and a Base model were trained on all 4,969 cases and 6,245 controls. Model 1 was trained on a subset where the 21 misclassified controls were excluded. To train Model 2, misclassified controls and cases (300, according to broader definition) were excluded. To train Model 3, misclassified controls and cases (944, according to narrow definition) were excluded (Table 2). Table 2. Summary of results on the three tested models. Model 1: Excluded 21 misclassified controls; Model 2: Excluded 21 misclassified controls and 300 cases according to broader definition of Schizophrenia; Model 3: Excluded 21 misclassified controls and 944 cases according to narrow definition of Schizophrenia Model

Cases Controls Avg. Val. AUC Avg. Test AUC

GenNet (benchmark) 4,969 6,245

0.70 ± 0.018

0.72 ± 0.016

Base

4,969 6,245

0.71 ± 0.013

0.71 ± 0.012

Model 1

4,969 6,224

0.71 ± 0.011

0.71 ± 0.009

Model 2

4,669 6,224

0.75 ± 0.007

0.75 ± 0.008

Model 3

4,025 6,224

0.82 ± 0.010

0.81 ± 0.010

Statistically significant differences in the average test AUC scores of both Model 2 and Model 3 against the Base model, t(18) = 8.77, p < 0.0001 and t(18) = 20.24, p < 0.0001, respectively, were observed. There was also a significant difference between the results of Model 2 and Model 3, t(18) = 14.82, p < 0.0001, which indicates an improved performance of the model on filtered data.

4

Discussion

Despite of the existence of thoroughly discussed guidelines, even the primary diagnostic instrument in psychiatry (DSM-5) faces controversy [27]. There is a lack of evident biomarkers for most mental health disorders, including SCZ [4]. The diagnostic majorly relies on the observation and assessment of thoughts,

Misdiagnosis Impact on ML Performance

9

feelings and behavior patterns, which present an inherent uncertainty. This might constitute a paradox. The lack of genetic and biochemical risk factors hamper objective diagnoses, and this, in turn, could lead to the selection of cases that do not correspond to the exact phenotype in the study, thus producing nonconcluding, or even misleading results. ML models enable new approaches to enhance the results and provide new perspectives on biological problems. It also benefits from the increasing amount of data produced over the years. However, classification models heavily depend on a robust and reliable definition of classes in training data. It has been demonstrated that, besides the cost-effective potential of its use, unscreened controls may be considered for large-scale association studies on disorders with low prevalence in the population, with little impact on the predictive power [33]. However, this estimation has not been assessed on ML studies. SCZ does present a low prevalence in the population, so the control selection criteria for the phs000473.v2.p2 study are valid for the original purposes of the research. The original case inclusion criteria are also well suited for association studies. Throughout the years, the reassessment of SCZ diagnoses on the Swedish HDR have reported equivalent results, and an increased concordance rate when considering broader definitions of SCZ [9]. More recently, genetic overlaps between SCZ and Schizoaffective disorder have been reported [3], endorsing the previous results on broader SCZ definitions. Such concordance rate enables the selected Swedish HDR cases as a reliable ground truth for simple association studies. The preliminary results for this work also evidence these associations. Furthermore, testing the model on training data endorses the validity of said associations for the used dataset. Various genes underlying both the significant association to SCZ on OMIM and the significantly enriched PANTHER and KEGG pathways have previously been associated with SCZ on the literature. However, in opposition to classical association studies, ML models do not rely on individual genotypes, the eventual misclassified samples could have a greater impact on the general results. Using the GenNet tool [14] original results as a benchmark, replicating tests for a reduced dataset presented the same results. However, removing samples with a greater probability of being misclassified resulted in significantly improved performances, more approximate to the heritability predictions in the literature [11,13]. These results suggest that filtering datasets previously generated for large-scale association studies may enable a more appropriate application of ML approaches to the problem of complex diseases. However, the filtering procedures must preserve representativity in order to produce the most informative results. More appropriate sample prioritization methodologies could be developed. However, the current results corroborate that ML models may, in fact, constitute major benefits for complex disease research. However, its application could be adapted rather than transposed to biological problems. Off-the-book procedures such as testing on train data may be helpful as a preliminary step for sample selection, especially for problems of greater biological and genetic uncertainty, as the presented example of SCZ etiology.

10

D. Martins et al.

As for future endeavours, such models shall always be tested and validated on additional data. Most importantly, the training sets must be constantly reevaluated and updated, as the presented intermediary step poses a risk of specializing the model for predicting the effect of a determined set of genetic variants and genes. It is possible and even likely that the presented approach excludes genetic factors that could, but did not present, a considerable influence on the classification due to lack of representation on this dataset. The availability of new data from other samples may provide a weight gain for currently non-significant gene-disease associations. Hence, it would be necessary to repeat the process and update the training set and consequently the classifier. This would conform to the fluidity and progressive evolution of the debates on mental health disorders diagnosis and etiology. But instead of adding to the discordance and controversy, iterative approaches may help to understand how different definitions and diagnosis guidelines for SCZ and other mental health disorders could influence its predictability through genomics.

5

Conclusion

The collection and register of medical records, together with the rising number of large-scale genetic studies at both international, and national levels, constitute an invaluable benefit to scientific advances. Scandinavian countries, and Sweden in particular, is the best example of this paradigm for SCZ. There is a lack of consensus on mental health disorders diagnosis, in particular, in SCZ. Thus, case-control studies deal with an inherent dataset misclassification rate. The number of samples enrolled in a study may overcome this limitation on explorative and association studies. However, that may not be the case for new approaches, such as the adaptation of ML classification algorithms to biological problems. ML classification models heavily rely on well defined case and control subsets to train and learn the structure of the data. Larger misclassification rates hinder the model performance. This work reveals that the exclusion of the most likelymisclassified samples yields more accurate performances on the same model. In future work, new strategies must be developed to perform more informed exclusions. Acknowledgements. The datasets used for the analysis described in this manuscript were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/gap through dbGaP accession number phs000473.v2.p2. Samples used for data analysis were provided by the Swedish Cohort Collection supported by the NIMH grant R01MH077139, the Sylvan C. Herman Foundation, the Stanley Medical Research Institute and The Swedish Research Council (grants 2009-4959 and 2011-4659). Support for the exome sequencing was provided by the NIMH Grand Opportunity grant RCMH089905, the Sylvan C. Herman Foundation, a grant from the Stanley Medical Research Institute and multiple gifts to the Stanley Center for Psychiatric Research at the Broad Institute of MIT and Harvard. This work is funded by the FCT - Foundation for Science and Technology, I.P./MCTES through national funds (PIDDAC), within the scope of CISUC R&D Unit - UIDB/00326/2020 and the PhD Scholarship SFRH/BD/146094/2019.

Misdiagnosis Impact on ML Performance

11

References 1. Ayano, G., Demelash, S., yohannes, Z., et al.: Misdiagnosis, detection rate, and associated factors of severe psychiatric disorders in specialized psychiatry centers in Ethiopia. Ann. General Psychiatr.y 20 (2021). https://doi.org/10.1186/s12991021-00333-7 2. Bergsholm, P.: Is schizophrenia disappearing? The rise and fall of the diagnosis of functional psychoses: an essay. BMC Psychiatry 16 (2016). https://doi.org/10. 1186/s12888-016-1101-5 3. Cardno, A.G., Owen, M.J.: Genetic relationships between schizophrenia, bipolar disorder, and schizoaffective disorder. Schizophr. Bull. 40, 504–515 (2014). https:// doi.org/10.1093/schbul/sbu016 4. Carvalho, A.F., Solmi, M., Sanches, M., et al.: Evidence-based umbrella review of 162 peripheral biomarkers for major mental disorders. Transl. Psychiatry 10 (2020). https://doi.org/10.1038/s41398-020-0835-5 5. Charlson, F.J., Ferrari, A.J., Santomauro, D.F., et al.: Global epidemiology and burden of schizophrenia: findings from the global burden of disease study 2016. Schizophr. Bull. 44, 1195–1203 (2018). https://doi.org/10.1093/schbul/sby058 6. Coulter, C., Baker, K.K., Margolis, R.L.: Specialized consultation for suspected recent-onset schizophrenia: diagnostic clarity and the distorting impact of anxiety and reported auditory hallucinations. J. Psychiatr. Pract. 25, 76–81 (2019). https://doi.org/10.1097/PRA.0000000000000363 7. Dalman, C., Broms, J., Cullberg, J., Allebeck, P.: Young cases of schizophrenia identified in a national inpatient register - are the diagnoses valid? Soc. Psychiatry Psychiatr. Epidemiol. 37, 527–531 (2002). https://doi.org/10.1007/s00127002-0582-3 8. Edwards, G.G., Uy-Evanado, A., Stecker, E.C., et al.: Sudden cardiac arrest in patients with schizophrenia: a population-based study of resuscitation outcomes and pre-existing cardiovascular disease. IJC Heart Vasculature 40 (2022). https:// doi.org/10.1016/j.ijcha.2022.101027 9. Ekholm, B., Ekholm, A., Adolfsson, R., et al.: Evaluation of diagnostic procedures in Swedish patients with schizophrenia and related psychoses. Nord. J. Psychiatry 59, 457–464 (2005). https://doi.org/10.1080/08039480500360906 10. Ganna, A., Genovese, G., Howrigan, D.P., et al.: Ultra-rare disruptive and damaging mutations influence educational attainment in the general population. Nat. Neurosci. 19, 1563–1565 (2016). https://doi.org/10.1038/nn.4404 11. Gejman, P.V., Sanders, A.R., Duan, J.: The role of genetics in the etiology of schizophrenia. Psychiatr. Clin. North Am. 33, 35–66 (2010). https://doi.org/10. 1016/j.psc.2009.12.003 12. Genovese, G., Fromer, M., Stahl, E.A., et al.: Increased burden of ultra-rare protein-altering variants among 4,877 individuals with schizophrenia. Nat. Neurosci. 19, 1433–1441 (2016). https://doi.org/10.1038/nn.4402 13. Hilker, R., Helenius, D., Fagerlund, B., et al.: Heritability of schizophrenia and schizophrenia spectrum based on the nationwide Danish twin register. Biol. Psychiat. 83, 492–498 (2018). https://doi.org/10.1016/j.biopsych.2017.08.017 14. van Hilten, A., Kushner, S.A., Kayser, M., et al.: Gennet framework: interpretable deep learning for predicting phenotypes from genetic data. Commun. Biol. 4 (2021). https://doi.org/10.1038/s42003-021-02622-z 15. Honer, W.G., Smith, G.N., MacEwan, G.W., et al.: Diagnostic reassessment and treatment response in schizophrenia. J. Clin. Psychiatry 55 (1994)

12

D. Martins et al.

16. Hu, W., Macdonald, M.L., Elswick, D.E., Sweet, R.A.: The glutamate hypothesis of schizophrenia: evidence from human brain tissue studies. Ann. N. Y. Acad. Sci. 1338, 38–57 (2015). https://doi.org/10.1111/nyas.12547 17. Johns Hopkins University (Baltimore, M.M.N.I.o.G.M.: Online mendelian inheritance in man, omim (2023). https://omim.org/ 18. Kahn, R.S., Sommer, I.E., Murray, R.M., et al.: Schizophrenia. Nat. Rev. Disease Primers 1 (2015). https://doi.org/10.1038/nrdp.2015.67 19. Kanehisa, M., Goto, S.: Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000). https://doi.org/10.1093/nar/28.1.27 P. Kaczor, A.A.: Multi-target approach for drug discov20. Kondej, M., Stepnicki,  ery against schizophrenia. Int. J. Mol. Sci. 19 (2018). https://doi.org/10.3390/ ijms19103105 21. Kristjansson, E., Allebeck, P., Wistedt, B.: Validity of the diagnosis schizophrenia in a psychiatric inpatient register: a retrospective application of DSM-III criteria on ICD-8 diagnoses in Stockholm county. Nord. J. Psychiatry 41, 229–234 (1987). https://doi.org/10.3109/08039488709103182 22. Liao, Y., Wang, J., Jaehnig, E.J., et al.: Webgestalt 2019: gene set analysis toolkit with revamped UIs and APIs. Nucleic Acids Res. 47, W199–W205 (2019). https:// doi.org/10.1093/nar/gkz401 23. Ludvigsson, J.F., Andersson, E., Ekbom, A., et al.: External review and validation of the swedish national inpatient register. BMC Publ. Health 11 (2011). https:// doi.org/10.1186/1471-2458-11-450 24. Lugneg˚ ard, T., Hallerb¨ ack, M.U.: Schizophrenia (2022). https://www.gu.se/en/ gnc/schizophrenia 25. McGrath, J., Saha, S., Chant, D., Welham, J.: Schizophrenia: a concise overview of incidence, prevalence, and mortality. Epidemiol. Rev. 30, 67–76 (2008). https:// doi.org/10.1093/epirev/mxn001 26. Mi, H., Muruganujan, A., Casagrande, J.T., Thomas, P.D.: Large-scale gene function analysis with the panther classification system. Nat. Protoc. 8, 1551–1566 (2013). https://doi.org/10.1038/nprot.2013.092 27. Nemeroff, C.B., Weinberger, D., Rutter, M., et al.: DSM-5: a collection of psychiatrist views on the changes, controversies, and future directions. BMC Med. 11 (2013). https://doi.org/10.1186/1741-7015-11-202 28. Olfson, M., Gerhard, T., Huang, C., et al.: Premature mortality among adults with schizophrenia in the united states. JAMA Psychiat. 72, 1172–1181 (2015). https:// doi.org/10.1001/jamapsychiatry.2015.1737 29. Organization, W.H.: Schizophrenia (2022). https://www.who.int/news-room/factsheets/detail/schizophrenia 30. Pantazopoulos, H., Katsel, P., Haroutunian, V., et al.: Molecular signature of extracellular matrix pathology in schizophrenia. Eur. J. Neurosci. 53, 3960–3987 (2021). https://doi.org/10.1111/ejn.15009 31. Pillinger, T., Osimo, E.F., de Marvao, A., et al.: Cardiac structure and function in patients with schizophrenia taking antipsychotic drugs: an MRI study. Transl. Psychiatry 9 (2019). https://doi.org/10.1038/s41398-019-0502-x 32. Purcell, S.M., Moran, J.L., Fromer, M., et al.: A polygenic burden of rare disruptive mutations in schizophrenia. Nature 506, 185–190 (2014). https://doi.org/10.1038/ nature12975 33. Ripke, S., O’Dushlaine, C., Chambert, K., et al.: Genome-wide association analysis identifies 13 new risk loci for schizophrenia. Nat. Genet. 45, 1150–1159 (2013). https://doi.org/10.1038/ng.2742

Misdiagnosis Impact on ML Performance

13

34. Saha, S., Chant, D.C., Welham, J.L., McGrath, J.J.: The incidence and prevalence of schizophrenia varies with latitude. Acta Psychiatr. Scand. 114, 36–39 (2006). https://doi.org/10.1111/j.1600-0447.2005.00742.x 35. Schmidt, M.J. Mirnics, K.: Neurodevelopment, Gaba system dysfunction, and schizophrenia (2015). https://doi.org/10.1038/npp.2014.95 36. Thomas, P.D., Ebert, D., Muruganujan, A., et al.: Panther: making genome-scale phylogenetics accessible to all. Protein Sci. 31, 8–22 (2022). https://doi.org/10. 1002/pro.4218

Deep Learning and Transformers in MHC-Peptide Binding and Presentation Towards Personalized Vaccines in Cancer Immunology: A Brief Review Vicente Enrique Machaca1(B) , Valeria Goyzueta1 , Maria Cruz2 , and Yvan Tupac2 1

2

Universidad La Salle, Arequipa, Peru {vmachacaa,vgoyzuetat}@ulasalle.edu.pe Universidad Católica San Pablo, Arequipa, Peru {maria.cruz,ytupac}@ucsp.edu.pe

Abstract. Cancer immunology is a new alternative to traditional cancer treatments like radiotherapy and chemotherapy. There are some strategies, but neoantigen detection for developing cancer vaccines are methods with a high impact in recent years. However, neoantigen detection depends on the correct prediction of peptide-MHC binding. Furthermore, transformers are considered a revolution in artificial intelligence with a high impact on NLP tasks. Since amino acids and proteins could be considered like words and sentences, the peptide-MHC binding prediction problem could be seen as a NLP task. Therefore, in this work, we performed a systematic literature review of deep learning and transformer methods used in peptide-MHC binding and presentation prediction. We analyzed how ANNs, CNNs, RNNs, and Transformer are used. Keywords: Deep learning · neoantigen · review · survey · peptide MHC binding · peptide MHC presentation · Cancer Immunology

1

Introduction

Cancer represents the world’s biggest health problem and is the leading cause of death, with around a million deaths reported in 2020 and about 400,000 children developing cancer yearly. Unfortunately, despite many efforts to mitigate the deaths caused by this disease, traditional methods based on surgeries, radiotherapies, and chemotherapies have low effectiveness [1]. In this context, Cancer Immunology arises, which aims to stimulate a patient’s immune system; one of the most promising methods is the development of personalized vaccines [2]; however, it depends on the correct detection of neoantigens. Neoantigens are tumor-specific mutated peptides and are considered the main causes of an immune response [2–4]. The goal is to train a patient’s lymphocytes to recognize the neoantigens and activate the immune system [1,5]. Nevertheless, less than 5% of the detected neoantigens succeed in activating the c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  M. Rocha et al. (Eds.): PACBB 2023, LNNS 743, pp. 14–23, 2023. https://doi.org/10.1007/978-3-031-38079-2_2

Deep Learning and Transformers in MHC-Peptide Binding

15

immune system [5]. The life cycle of a neoantigen could be summarized as follows. First, a protein is degraded into peptides (promising neoantigens) in the cytoplasm. Next, the peptides are attached to the Major Histocompatibility Complex (MHC), known as peptide-MHC binding (pMHC). Then, this compound follows a pathway until it reaches the cell membrane (pMHC presentation). Finally, the pMHC compound is recognized by the T-cell Receptor (TCR), triggering the immune system. Moreover, MHC has three classes: I, II, and III; however, only class I and II are involved in neoantigen presentation. Additionally, there are approximately 10000 MHC class I allele variants in human cells [6]. Furthermore, there are some reviews [7,8] which focus on pMHC binding studies, because it is a critical part of neoantigen detection in Cancer Immunology; however, these studies didn’t consider modern deep learning methods.

2

Methodology

In order to review deep learning and transformer methods used in pMHC binding prediction, we performed a Systematic Literature Review (SLR). The search strings used for the SLR is: “(MHC-I OR MHC-II OR MHC OR HLA) AND (peptide OR epitope) AND (binding OR affinity OR prediction OR detection OR presentation) OR (neoantigen detection)”. We searched in IEEE Xplore, Science Direct, Springer, ACM Digital Library, PubMed, and Scopus. We proposed the following research questions: Q1. How deep learning and transformers are used in MHC-peptide binding and presentation prediction? Q2. What type of input data and pre-processing methods are used? Q3. Which are the most promising methods? Using the search strings and considering only articles since 2018, we analyzed papers’ titles, and we got 323 articles. Then, a subset was selected based on the inclusion criteria: articles with ERA category (A or B) or journal articles Q1/Q2. At the end of this stage, 62 articles were obtained. However, as we focus on deep learning and transformers methods, a small subset of 54 papers were selected.

3

Input Encoding

Neoantigen detection could be resolved, like the prediction of affinity between peptides and Major Histocompatibility Complex (MHC) class I and II. Peptides and MHC are small proteins. Then, in the human cell, there are 20 amino acids represented by letters. So, a peptide could be represented like: p = {A, A, N, L, ...} (normally 8–15 m) and a MHC is also another chain like: q = {A, N, K, L, ..., Q} (class I and II have 35–40 m according to public datasets). Finally, we need to know the probability of affinity between p and q. If it is high enough, then it is possible that the peptide p binds to q. In addition, this peptide p could be a neoantigen. Two main methods are used for encoding chain amino acids: one-hot encoding and BLOSUM. For neoantigen detection, a row of BLOSUM is used to represent each amino acid. Some works have used BLOSUM62 [9–12] and BLOSUM50 [13,14]. Moreover, some authors used one-hot and BLOSUM encoding together

16

V. E. Machaca et al.

[15–18]. Also, there are another alternatives like: universal Google encoder [19], AAindex [20,21] (a database of numerical indices representing physicochemical and biochemical properties of amino acids), 3D amino acid coordinates [22], and physicochemical properties of each amino acid [23–25]. More recently, some works used eluted ligands on cell’s membrane extracted with Mass Spectrometry (MS) methods [26–30].

4

Deep Learning and Transformers Methods

Deep learning neural networks could be classified as Deep Artificial Neural Networks (DANN), Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and Transformers. For instance, DANN is an ANN with several layers to solve complex problems. CNNs regularly are used in computer vision because they extract local spatial features and combine them with higher-order features. Moreover, RNN is used for sequential data and impacts NLP tasks more. Finally, Transformers are based on multi-head self-attention (SA) mechanism, they solved the short context problems in RNNs and were first used in NLP tasks; furthermore, Transformers are the core of ChatGPT and other big language models. Moreover, the methods for pMHC binding and presentation prediction can be categorized like allele-specific or pan-specific methods. Allele-specific, train a model for each allele; meanwhile, pan-specific methods train a global model. 4.1

Deep Learning

Currently, the state-of-art method is NetMHCPan4.1 [27], it is a DANN of 40 assembled ANNs with 60 and 70 neurons. It improved its previous versions by increasing the training dataset with 13245212 data points covering 250 distinct MHC-I molecules; additionally, the model was updated from NN_align to NN_alignMA [30] to handle poly-specific datasets of Mass Spectrometry (MS)-eluted ligands. Moreover, MHCflurry2.0 [29], developed a pan-allele binding affinity predictor and an allele-independent antigen presentation predictor and used MS; after experiments, MHCflurry2.0 outperformed NetMHCpan4.0. Several works apply Convolutional Neural Networks (CNN) for peptide-MHC binding and presentation. Typically, they use one-hot encoding or BLOSUM to vectorize each amino acid, then join each vector and build a matrix that could be managed like an image. Additionally, CNNs used encoded input data with protein contact side [31], one-hot encoding [32–34], AAindex database [21,35], and physicochemical properties of each amino acid [11,35–37]. A representative method is PUFFIN [17], which predicts the probability of pMHC binding and quantifies the uncertainty of binding prediction; this network is based on deep residue CNN layers. From another perspective, proteins are considered like sequences, and Recurrent Neural Networks (RNN) have been applied. The authors encode each amino acid with one-hot or BLOSUM; then, they used an embedding layer to represent

Deep Learning and Transformers in MHC-Peptide Binding

17

each amino acid, usually like a 128-dim vector. Some works used a GRU model [38,39], bidirectional GRU [40], LSTM [15,41–43], a bidirectional LSTM [10,44], and a combination of CNN and RNN [45]. 4.2

Transformers

CNN with Attention. ACME [14] is one of the initial works; it used a CNN with attention module which assigned weights to individual residue positions. It learned to assign higher weights to residues more important in pMHC interaction. Moreover, it got a Spearman Rank Correlation Coefficient (SRCC) of 0.569 (higher than NetMHCpan 4.0), AUC of 0.9 for HLA-A, and 0.88 for HLAB. Another CNN is DeepAttentionPan [9], which also employs deep CNN to encode peptides and MHC to 40 × 10 × 11 vectors and then an attention module to calculate positional weights. In contrast to ACME, DeepAttentionPan is allele-specific. Moreover, DeepAttentionPan performed slightly better than ACME when they trained each allele separately. Finally, DeepNetBim [13] used an attention module similar to ACME and DeepAttentionPan; however, DeepNetBim uses two separate CNNs to predict pMHC binding and immunogenic (they were combined in the last layers). DeepNetBim got 0.015 MAE for binding and 94.7 of accuracy for immunogenic. RNN with Attention. Also, some RNNs arise, like DeepHLApan [40], an allele-specific model which takes pMHC binding and immunogenicity data. The model has three Bidirectional GRU (BiGRU) and an attention layer, then outputs the binding and immunogenicity score. Additionally, the method used CD8+ T-cell epitopes and Mass Spectrometry data. DeepHLApan got an accuracy > 0.9 on 43 HLA alleles. Furthermore, the allele-specific model DeepSeqPanII [15] used a combination of BLOSUM62 and one-hot encoding, it focused on MHC-II. The model had two layers of LSTM with 100 hidden units and an attention block to extract weighted information based on hidden units; the attention block consisted of four 1-D convolutional layers; finally, three fully connected layers were used to predict affinity. DeepSeqPanII got better results than NetMHCIIpan 3.2 on 26 of 54 alleles. Moreover, MATHLA [10] used BiLSTM to learn dependencies of amino acid residues, it applied multiple-head attention to get position information to the output of BiLSTM; then, the output is combined in 2-D CNN layers. MATHLA achieved an AUC score of 0.964, compared to 0.945, 0.925, and 0.905 for netMHCpan 4.0, MHCflurry, and ACME, respectively. Transformers. BERTMHC [46] is one of the pioneer works that used a BERT architecture. It is a pan-specific pMHC-II binding/presentation predicting method. It used transfer learning from Tasks Assessing Protein Embeddings (TAPE) [47]; a model trained with thirty-one million proteins from Pfam database. The authors stack an average pooling followed by a FC layer after TAPE model. In experiments, BERTMHC outperformed NetMHCIIpan3.2 and

18

V. E. Machaca et al.

PUFFIN (AUC of 0.8822 against 0.8774). Then, ImmunoBERT [48] is proposed; this model also used transfer learning from TAPE; however, the authors focus on pMHC-I prediction. This method stacks a classification token’s vector after TAPE model. Furthermore, the authors concluded that amino acids close to the peptide N/C-terminals are relevant, and positions in the A, B, and F pockets are assigned to high importance (analyzed with LIME and SHAP). Other methods that used transfer learning are MHCRoBERTa [49] and HLAB [50]. The first one proposed five encoders with 12 multiple-head selfattention. At first, they used self-supervised training from UniProtKB and Swissprot databases. Then, they fine-tuned the training with IEDB [51] dataset. Additionally, they applied sub-word tokenization. MHCRoBERTa got a SRCC of 0.543, higher than NetMHCpan4.0 and MHCflurry2.0. Furthermore, HLAB [50] used transfer learning from ProtBert-BFD [52] trained with 2122 millions proteins trained from BFD dataset. HLAB used a BiLSTM model in cascade; the input is a 49-dim vector of letters formed by peptides and HLA; then, the BiLSTM outputs a 1536-dim vector. In the end, the extracted features were reduced by Uniform Manifold Approximation and Projection (UMAP). Moreover, on the HLA-A*01:01 allele, HLAB slightly outperformed state-of-art methods, including NetMHCpan4.1, by at least 0.0230 in AUC and 0.0560 in accuracy. Recently, the allele-specific DapNet-HLA [53] used an additional dataset (Swiss-Prot) for negative samples. Moreover, the method used an embedding block for each token and its absolute position. The authors compared this encoding against Dipeptide Deviation from Expected mean (DDE), Amino Acid Composition (AAC), Dipeptide Composition (DPC), and Encoding based on Grouped Weight (EGBW). Then, DapNet-HLA combined the advantages of CNN, SENet(for pooling), and LSTM. The proposal got high scores; however, the method wasn’t compared against state-of-art methods. Finally, TransPHLA [54] is an allele-specific method that applies self-attention to peptides. The model is based on four modules: an embedding block, an encoder block (multiple selfattention), a feature optimization block (FC layer), and a projection block (FC layer used to predict). The authors also developed AOMP, which took the pMHC binding like input and returned mutant peptides with higher affinity to the MHC allele. Moreover, TransPHLA outperformed state-of-art methods, including NetMHCpan4.1, and it is effective for any peptide and MHC length and is faster at making predictions (Table 1).

Deep Learning and Transformers in MHC-Peptide Binding

19

Table 1. Transformers and deep learning methods with attention mechanism used for pMHC binding and presentation prediction. Year

Input

Model

2022 [50] HLAB

Name

One-hot

BERT from ProtBert pre-trained model followed by a BiLSTM with attention mechanism

2022 [49] MHC RoBERTa

One-hot

RoBERTa pre-trained and followed by 12 multi-head SA and a FC layers, it outperformed NetMHCPan 3.0

2022 [54] TransPHLA

One-hot

It used SA mechanism based on four blocks, it slightly outperformed NetMHCpan4.1 and is faster making predictions

2021 [48] ImmunoBERT

One-hot

BERT from TAPE pre-trained followed by a linear layer. Authors claimed that N-terminal and C-terminals are highly relevant after analysis with SHAP and LIME

2021 [46] BERTMHC

One-hot

BERT from TAPE pre-trained followed by a linear layer. It outperformed NetMHCIIpan3.2 and PUFFIN

2021 [10] MATHLA

BLOSUM

It integrates BiLSTM with multi-head attention. It achieved an AUC score of 0.964, compared to 0.945, 0.925 and 0.905 for netMHCpan 4.0, MHCflurry and ACME respectively

2021 [15] DeepSeqPanII

BLOSUM62 & one-hot It has two LSTM layers, an attention block and three FC layers. It got better results than NetMHCIIpan 3.2 on 26 of 54 alleles

2021 [13] DeepNetBim

BLOSUM50

It uses separate CNNs for pMHC binding and immunogenetic with a attention module. It got 0.015 MAE for binding and 94.7 of accuracy for immunogenic

DeepAttention Pan BLOSUM62

DCNN with an attention mechanism. It is allele-specific and got slightly better results than ACME for allele level

2021 [9]

2019 [14] ACME

BLOSUM50

CNN with attention, it extract interpretable patterns about pMHC binding. Moreover, it got SRCC of 0.569, AUC of 0.9 for HLA-A and 0.88 for HLA-B

2019 [40] DeepHLApan

One-hot

Allele-specific model with three layers of Bidirectional GRU (BiGRU) with an attention layer. It got acc > 0.9 on 43 HLA alleles

5

Discussion

This review focuses on deep learning and Transformers methods for pMHC binding and presentation prediction. DANNs, CNNs, and RNNs have been used in this task; however, Transformers BERT architectures got promising results. Although NetMHCpan4.1 is the state-of-art pan method, the transformers overcome exciting results. These BERT models [46,48,49], used transfer learning from TAPE [47], and ProtBert [52], which are models self-supervised trained with Pfam, UniRef50, UniRef100, UniProtKB, Swiss-prot and BFD datasets. To answer research questions: pMHC binding/presentation prediction is a classification problem; however, some works consider it a regression problem for predicting pMHC binding affinity. So, deep learning methods take two inputs: an amino acid sequence representing a candidate peptide and the MHC. Then, each sequence is mapped with one-hot encoding, BLOSUM, universal Google encoder, AAindex, or according to its physicochemical properties. Finally, these models are trained with public datasets. Some authors used physicochemical properties to capture more information about each amino acid; however, they didn’t outperform methods based on one-hot and BLOSUM encoding. Moreover, the majority of pre-trained models used one-hot encoding, so it force this encoding for methods based on transfer learning.

20

V. E. Machaca et al.

Furthermore, NetMHCpan4.1 [27] is considered the state-of-the-art panspecific method; however, HLAB [50] and TransPHLA [54] slightly outperformed it on allele-specific testing. HLAB used transfer learning from ProtBert, and TransPHLA is effective for any peptide length and makes predictions faster. Moreover, the main limitations of these methods are related to the dataset used for training; they ignored Posttranslational modifications (PTMs) such as phosphorylation, glycosylation, and deamidation, which influence the specificity of MHC binding and presentation and several aspects of the biology underlying pMHC presentation are poorly understood [8]. Furthermore, to get accurate results for neoantigen detection, we need to integrate pMHC-TCR studies. Other limitations are related to high computing requirements for training BERT architectures. For instance, pre-trained models like TAPE [47], PortBert [52], and ESM-1b [55] have 92, 420, 650 million parameters. Moreover, recently ESM2 [56] stands with 15 billion parameters. Furthermore, training samples increase continuously, so we need constantly evaluate proposals with bigger datasets. Future works could include the use of transfer learning from ESM1-b [55] and ESM2 [56]. Moreover, there is pHLA3D, a dataset of 3D structures of the alpha/beta chains and peptides of MHC-I proteins; it opens new perspectives for studying pMHC prediction.

References 1. Peng, M., et al.: Neoantigen vaccine: an emerging tumor immunotherapy. Mol. Cancer 18(1), 1–14 (2019) 2. Borden, E.S., Buetow, K.H., Wilson, M.A., Hastings, K.T.: Cancer neoantigens: challenges and future directions for prediction, prioritization, and validation. Front. Oncol. 12 (2022) 3. Chen, I., Chen, M., Goedegebuure, P., Gillanders, W.: Challenges targeting cancer neoantigens in 2021: a systematic literature review. Expert Rev. Vaccines 20(7), 827–837 (2021) 4. Gopanenko, A.V., Kosobokova, E.N., Kosorukov, V.S.: Main strategies for the identification of neoantigens. Cancers 12(10), 2879 (2020) 5. Mattos, L., et al.: Neoantigen prediction and computational perspectives towards clinical benefit: recommendations from the Esmo precision medicine working group. Ann. Oncol. 31(8), 978–990 (2020) 6. Abelin, J.G., et al.: Mass spectrometry profiling of HLA-associated peptidomes in mono-allelic cells enables more accurate epitope prediction. Immunity 46(2), 315–326 (2017) 7. Mei, S., et al.: A comprehensive review and performance evaluation of bioinformatics tools for HLA class i peptide-binding prediction. Brief. Bioinform. 21(4), 1119–1135 (2020) 8. Nielsen, M., Andreatta, M., Peters, B., Buus, S.: Immunoinformatics: predicting peptide–MHC binding. Annu. Rev. Biomedical Data Sci. 3, 191–215 (2020) 9. Jin, J., et al.: Deep learning pan-specific model for interpretable MHC-I peptide binding prediction with improved attention mechanism. Proteins: Struct. Funct. Bioinform. 89(7), 866–883 (2021)

Deep Learning and Transformers in MHC-Peptide Binding

21

10. Ye, Y., et al.: Mathla: a robust framework for HLA-peptide binding prediction integrating bidirectional LSTM and multiple head attention mechanism. BMC Bioinform. 22(1), 1–12 (2021) 11. Zhao, T., Cheng, L., Zang, T., Hu, Y.: Peptide-major histocompatibility complex class i binding prediction based on deep learning with novel feature. Front. Genet. 10, 1191 (2019) 12. O’Donnell, T.J., Rubinsteyn, A., Bonsack, M., Riemer, A.B., Laserson, U., Hammerbacher, J.: Mhcflurry: open-source class i MHC binding affinity prediction. Cell Syst. 7(1), 129–132 (2018) 13. Yang, X., Zhao, L., Wei, F., Li, J.: DeepnetBIM: deep learning model for predicting HLA-epitope interactions based on network analysis by harnessing binding and immunogenicity information. BMC Bioinform. 22(1), 1–16 (2021) 14. Hu, Y., et al.: ACME: pan-specific peptide-MHC class i binding prediction through attention-based deep neural networks. Bioinformatics 35(23), 4946–4954 (2019) 15. Liu, Z., et al.: Deepseqpanii: an interpretable recurrent neural network model with attention mechanism for peptide-HLA class ii binding prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. (2021) 16. Jokinen, E., Huuhtanen, J., Mustjoki, S., Heinonen, M., Lähdesmäki, H.: Predicting recognition between t cell receptors and epitopes with TCRGP. PLoS Comput. Biol.‘ 17(3), e1008814 (2021) 17. Zeng, H., Gifford, D.K.: Quantification of uncertainty in peptide-MHC binding prediction improves high-affinity peptide selection for therapeutic design. Cell Syst. 9(2), 159–166 (2019) 18. Zeng, H., Gifford, D.K.: Deepligand: accurate prediction of MHC class i ligands using peptide embedding. Bioinformatics 35(14), i278–i283 (2019) 19. Kubick, N., Mickael, M.E.: Predicting epitopes based on TCR sequence using an embedding deep neural network artificial intelligence approach. bioRxiv (2021) 20. Kawashima, S., Kanehisa, M.: Aaindex: amino acid index database. Nucleic Acids Res. 28(1), 374–374 (2000) 21. Li, G., Iyer, B., Prasath, V.S., Ni, Y., Salomonis, N.: Deepimmuno: deep learningempowered prediction and generation of immunogenic peptides for t-cell immunity. Brief. Bioinform. 22(6), bbab160 (2021) 22. Shi, Y., et al.: Deepantigen: a novel method for neoantigen prioritization via 3d genome and deep sparse learning. Bioinformatics 36(19), 4894–4901 (2020) 23. Moris, P., et al.: Current challenges for unseen-epitope TCR interaction prediction and a new perspective derived from image classification. Brief. Bioinform. 22(4), bbaa318 (2021) 24. Montemurro, A., et al.: Nettcr-2.0 enables accurate prediction of TCR-peptide binding by using paired TCRα and β sequence data. Commun. Biology 4(1), 1–13 (2021) 25. Luu, A.M., Leistico, J.R., Miller, T., Kim, S., Song, J.S.: Predicting TCR-epitope binding specificity using deep metric learning and multimodal learning. Genes 12(4), 572 (2021) 26. Zhou, L.Y., Zou, F., Sun, W.: Prioritizing candidate peptides for cancer vaccines through predicting peptide presentation by HLA-i proteins. Biometrics (2022) 27. Reynisson, B., Alvarez, B., Paul, S., Peters, B., Nielsen, M.: Netmhcpan-4.1 and netmhciipan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 48(W1), W449–W454 (2020)

22

V. E. Machaca et al.

28. Reynisson, B., Barra, C., Kaabinejadian, S., Hildebrand, W.H., Peters, B., Nielsen, M.: Improved prediction of MHC II antigen presentation through integration and motif deconvolution of mass spectrometry MHC eluted ligand data. J. Proteome Res. 19(6), 2304–2315 (2020) 29. O’Donnell, T.J., Rubinsteyn, A., Laserson, U.: Mhcflurry 2.0: improved pan-allele prediction of MHC class i-presented peptides by incorporating antigen processing. Cell Syst. 11(1), 42–48 (2020) 30. Alvarez, B., et al.: Nnalign_ma; MHC peptidome deconvolution for accurate MHC binding motif characterization and improved t-cell epitope predictions. Mol. Cell. Proteom. 18(12), 2459–2477 (2019) 31. Han, Y.: Deep convolutional neural networks for peptide-MHC binding predictions (2018) 32. Liu, Z., Cui, Y., Xiong, Z., Nasiri, A., Zhang, A., Hu, J.: Deepseqpan, a novel deep convolutional neural network model for pan-specific class i HLA-peptide binding affinity prediction. Sci. Rep. 9(1), 1–10 (2019) 33. Lang, F., Riesgo-Ferreiro, P., Lower, M., Sahin, U., Schrors, B.: Neofox: annotating neoantigen candidates with neoantigen features. Bioinformatics 37(22), 4246–4247 (2021) 34. Lee, K.-H., Chang, Y.-C., Chen, T.-F., Juan, H.-F., Tsai, H.-K., Chen, C.-Y.: Connecting MHC-i-binding motifs with HLA alleles via deep learning. Commun. Biol. 4(1), 1–12 (2021) 35. Pei, B., Hsu, Y.-H.: Iconmhc: a deep learning convolutional neural network model to predict peptide and MHC-i binding affinity. Immunogenetics 72(5), 295–304 (2020) 36. You, R., Qu, W., Mamitsuka, H., Zhu, S.: Deepmhcii: a novel binding core-aware deep interaction model for accurate MHC-ii peptide binding affinity prediction. Bioinformatics 38(Supplement_1), i220–i228 (2022) 37. Ng, F.S., et al.: Minerva: learning the rules of HLA class i peptide presentation in tumors with convolutional neural networks and transfer learning. Available at SSRN 3704016 (2020) 38. Heng, Y., et al.: A simple pan-specific RNN model for predicting HLA-ii binding peptides. Mol. Immunol. 139, 177–183 (2021) 39. Heng, Y., et al.: A pan-specific GRU-based recurrent neural network for predicting HLA-i-binding peptides. ACS Omega 5(29), 18321–18330 (2020) 40. Wu, J., et al.: Deephlapan: a deep learning approach for neoantigen prediction considering both HLA-peptide binding and immunogenicity. Front. Immunol. 2559 (2019) 41. Shao, X.M., et al.: High-throughput prediction of MHC class i and ii neoantigens with MHCNUGGETSHIGH-throughput prediction of neoantigens with MHCNUGGETS. Cancer Immunol. Res. 8(3), 396–408 (2020) 42. Vielhaben, J., Wenzel, M., Samek, W., Strodthoff, N.: USMPEP: universal sequence models for major histocompatibility complex binding affinity prediction. BMC Bioinform. 21(1), 1–16 (2020) 43. Chen, B., et al.: Predicting HLA class ii antigen presentation through integrated deep learning. Nat. Biotechnol. 37(11), 1332–1343 (2019) 44. Venkatesh, G., Grover, A., Srinivasaraghavan, G., Rao, S.: Mhcattnnet: predicting MHC-peptide bindings for MHC alleles classes i and ii using an attention-based deep neural model. Bioinformatics 36(Supplement_1), i399–i406 (2020) 45. Xie, X., Han, Y., Zhang, K.: Mhcherrypan: a novel pan-specific model for binding affinity prediction of class i HLA-peptide. Int. J. Data Min. Bioinform. 24(3), 201–219 (2020)

Deep Learning and Transformers in MHC-Peptide Binding

23

46. Cheng, J., Bendjama, K., Rittner, K., Malone, B.: BERTMHC: improved MHCpeptide class ii interaction prediction with transformer and multiple instance learning. Bioinformatics 37(22), 4172–4179 (2021) 47. Rao, R., et al.: Evaluating protein transfer learning with tape. In: Advances in Neural Information Processing Systems, vol. 32 (2019) 48. Gasser, H.-C., Bedran, G., Ren, B., Goodlett, D., Alfaro, J., Rajan, A.: Interpreting BERT architecture predictions for peptide presentation by MHC class i proteins. arXiv preprint arXiv:2111.07137 (2021) 49. Wang, F., et al.: MhcroBERTa: pan-specific peptide-MHC class i binding prediction through transfer learning with label-agnostic protein sequences. Brief. Bioinform. 23(3), bbab595 (2022) 50. Zhang, Y., et al.: HLAB: learning the BiLSTM features from the protBERTencoded proteins for the class i HLA-peptide binding prediction. Brief. Bioinform. (2022) 51. Vita, R., et al.: The immune epitope database (IEDB): 2018 update. Nucleic Acids Res. 47(D1), D339–D343 (2018) 52. Elnaggar, A., et al.: Prottrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44(10), 7112– 7127 (2021) 53. Jing, Y., Zhang, S., Wang, H.: Dapnet-HLA: adaptive dual-attention mechanism network based on deep learning to predict non-classical HLA binding sites. Anal. Biochem. 666, 115075 (2023) 54. Chu, Y., et al.: A transformer-based model to predict peptide-HLA class i binding and optimize mutated peptides for vaccine design. Nat. Mach. Intell. 4(3), 300–311 (2022) 55. Rives, A., et al.:Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. 118(15) (2021) 56. Lin, Z., et al.: Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379(6637), 1123–1130 (2023)

Auto-phylo: A Pipeline Maker for Phylogenetic Studies Hugo López-Fenández1,2 , Miguel Pinto3,4 , Cristina P. Vieira3,5 , Pedro Duque3,4,5,6 , Miguel Reboiro-Jato1,2 , and Jorge Vieira3,5(B) 1 CINBIO, Department of Computer Science, ESEI—Escuela Superior de Ingeniería

Informática, Universidade de Vigo, 32004 Ourense, Spain 2 SING Research Group, Galicia Sur Health Research Institute (IIS Galicia Sur),

SERGAS-UVIGO, 36213 Vigo, Spain 3 Instituto de Investigação e Inovação em Saúde (I3S), Universidade do Porto, Rua Alfredo

Allen, 208, 4200-135 Porto, Portugal [email protected] 4 Faculdade de Ciências da Universidade do Porto (FCUP), Rua do Campo Alegre, S/N, 4169-007 Porto, Portugal 5 Instituto de Biologia Molecular e Celular (IBMC), Rua Alfredo Allen, 208, 4200-135 Porto, Portugal 6 School of Medicine and Biomedical Sciences (ICBAS), Porto University, Rua de Jorge Viterbo Ferreira, 228, 4050-313 Porto, Portugal

Abstract. Inferences on the evolutionary history of a gene can provide insight into whether the findings made for a given gene in a given species can be extrapolated to other species, including humans, help explain morphological evolution or give an explanation for unexpected findings regarding gene expression suppression experiments, among others. The large amount of sequence data that is already available, and that is predicted to dramatically increase in the next few years, means that life science researchers need efficient automated ways of analyzing such data. Moreover, especially when dealing with divergent sequences, inferences can be affected by the chosen alignment and tree building algorithms, and thus the same dataset should be analyzed in different ways, reinforcing the need for the availability of efficient automated ways of analyzing the sequencing data. Therefore, here, we present auto-phylo, a simple pipeline maker for phylogenetic studies, and provide two examples of its utility: one involving a small already formatted sequenced dataset (41 CDS) to determine the impact of the use of different alignment and tree building algorithms in an automated way, and another one involving the automated identification and processing of the sequences of interest, starting from 16550 bacterial CDS FASTA files downloaded from the NCBI Assembly RefSeq database, and subsequent alignment and tree building inferences. Keywords: pipeline maker · file processing · phylogenetics · docker

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Rocha et al. (Eds.): PACBB 2023, LNNS 743, pp. 24–33, 2023. https://doi.org/10.1007/978-3-031-38079-2_3

Auto-phylo: A Pipeline Maker for Phylogenetic Studies

25

1 Introduction Inferences on the evolutionary history of a gene or gene family can offer insight into the function of homologous genes in different organisms, and thus provide insight into whether the findings made for a given species can be extrapolated to other species, including humans, by highlighting instances of gene duplication leading to hypofunctionalization, subfunctionalization, or neofunctionalization (see [1] for a review). They also help explain morphological evolution (see for instance, [2]), and can provide an explanation for why the suppression of the expression of a given gene has no apparent consequences (see for instance, [3]), among others. The large amount of data that is already available, and that is predicted to drastically increase in the next seven years, means that life science researchers need efficient automated ways of analyzing such data. Indeed, the Earth Biogenome Project aims at obtaining high-quality reference genome sequences for all 1.8 million named living eukaryote species until 2030 [4, 5], giving an opportunity to understand the evolution of any eukaryote gene with unprecedented detail. Nevertheless, parsing a large number of FASTA files in order to have them with the desired format for the phylogenetic analyses is a challenge per se, especially for users without an informatics background. In this context, SEDA (SEquence DAtaset builder; [6]), an open-source, multiplatform application, with an easy-to-use GUI, that allows performing a large number of operations on FASTA files containing DNA and protein sequences, is useful. The GUI, however, cannot be incorporated into automated pipelines. For this reason, in SEDA v1.6.0, which is still under development, a new command-line interface (CLI) that allows using SEDA’s operations in the context of automated pipelines will be introduced. Almost all operations (29) have been already migrated to the CLI, with the only exception of the PfamScan protein annotation operation, some BLAST operations (3; “BLAST” is already implemented), and some gene annotation operations (4; “CGA” is already implemented). The CLI development is expected to be completed by the end of 2023 along with a generic way of creating SEDA-CLI pipelines based in Compi [7]. A Docker image is, however, already available for the current SEDA-CLI development version (tagged as “pegi3s/seda:1.6.0-v2304”), at the pegi3s Bioinformatics Docker Images Project [8] with the already implemented operations. We have taken advantage of the pegi3s/seda:1.6.0-v2304 Docker image to develop several auto-phylo modules that are useful in a variety of situations. When dealing with distantly related sequences, as it is often the case when analyses are performed at the kingdom level, it is important to show that the inferred evolutionary history does not depend on the alignment or phylogenetic method used, which usually implies the installation of different software packages. Having a single software application, that easily allows the creation of automated pipelines, that can be downloaded when needed and erased when no longer needed, and where the installed scientific software is ready to run, would facilitate such analyses. The latter two conditions can be achieved by using Docker images, such as those available at the pegi3s/Bioinformatics Docker Images Project [8]. The availability of a SEDA-CLI version and other relevant images available at the pegi3s/Bioinformatics Docker Images Project allowed the development of the auto-phylo application, here presented, that is made available as a Docker image. This application allows users to easily create automated pipelines, by only specifying the auto-phylo

26

H. López-Fenández et al.

module/s to be used, as well as the input and output directories for each module. While researchers with an informatics background may easily combine different informatic tools, life science researchers without an informatics background do not have such knowledge, but can easily use auto-phylo. At present, auto-phylo includes modules for (i) tblastx, ii) parsing of CDS FASTA files downloaded from the NCBI Assembly database, (iii) checking of unwanted genomes, (iv) sequence disambiguation, (v) header prefixing, (vi) adding taxonomy information to sequence headers, (vii) removal of stop codons, (viii) merging FASTA files, (ix) two alignment methods (Clustal Omega [9] and T-Coffee [10]; in the case of CDS data, the alignment is made at the amino acid level, and the corresponding nucleotide alignment is then obtained), (x) seven phylogenetic inference methods (FastTree [11], MrBayes [12], Maximum Likelihood, Neighbor-Joining, Minimum Evolution, Maximum Parsimony, and UPGMA as implemented in MEGA-CC [13]) with the possibility of rooting the trees using RootDigger [14], (xi) JModel testing (to check the appropriateness of the models implemented in FastTree and MrBayes; [15]), and (xii) Phylogenetic Tree Collapser (that allows the automated collapsing of taxonomically related groups of sequences; available at the pegi3s Bioinformatics Docker Images Project).

2 Material and Methods The auto-phylo Docker image here described can be used, to address the impact of choosing different alignment and phylogenetic inferences methods when using the same set of sequences. For instance, when addressing the evolution of vitamin C synthesis, Duque et al. [16] could not fully resolve the relationship of the different types of aldonolactone oxidoreductases (AOs), using a set of sequences aligned with MUSCLE ([17], as implemented in T-Coffee [10]), and Bayesian phylogenetic inferences [12], and the use of different alignment and tree inference methods could help resolve this issue. Therefore, the coding sequences (CDS), associated with the accession numbers NP_216287.1, MBI0379321.1, NP_ 254014.1, MCO4763851.1, MBS1617249.1, HIA12326.1, NEO81287.1, BAY85903.1, MCP8715902.1, PSP48796.1, BAO94255.2, XP_844937.1, XP_001682394.1, NP_ 848862.1, EDL85374.1, OCT81467.1, XP_002122023.1, XP_005443605.1, XP_ 024182616.1, XP_024165881.1, XP_008240731.1, XP_008220057.1, XP_004291192.1, XP_004303609.1, NP_190376.1, NP_182199.1, NP_564393.1, NP_ 182198.2, NP_182197.1, NP_200460.1, XP_023896433.1, XP_023896434.1, XP_ 023900169.1, XP_023914113.1, XP_043131130.1, XP_002422445.1, NP_013624.1, XP_003651626.1, XP_049145240.1, NP_055577.1, and XP_004926865.1, were retrieved from the NCBI nucleotide database. As alignment algorithms, both the Clustal_Omega_codons and the T-coffee_codons auto-phylo modules were used. As tree inference methods, the Fasttree, MrBayes, me_tree, ml_tree, and nj_tree auto-phylo modules were used. Therefore, in total, we obtained 10 different trees. In order to show that the GTR + I + G model used in the Fasttree and MrBayes modules is adequate, the JModel_test auto-phylo module was used. In order to achieve this, the pipeline and config files shown in Fig. 1 were used. In the config file, we specified the SEDA version to be used (seda:1.6.0-v2304), 2,000,000

Auto-phylo: A Pipeline Maker for Phylogenetic Studies

27

tree generations and a burn-in of 25% for MrBayes, “Complete deletion” and 1000 bootstraps for the Minimum Evolution (me_tree), Maximum likelihood (ml_tree), and Neighbor-Joining (nj_tree) auto-phylo modules, and the option of not rooting the trees (two outgroup sequences are present in the dataset).

Fig. 1. The auto-phylo pipeline (A) and config (B) files used to generate 10 gene trees (see Results, Sect. 3.3), using two alignment methods (Clustal Omega and T-Coffee) and five tree inference methods (FastTree, MrBayes, and Maximum Likelihood, Neighbor-Joining, and Minimum Evolution as implemented in MEGA-CC), as well as the JModel_test output.

In order to identify putative AO sequences of bacterial origin, that may share similar biological functions with the last enzyme involved in the animal VC synthesis pathway, namely, L-gulonolactone oxidase (GULO) [16], the auto-phylo Docker image here described was also used, with the pipeline and config files shown in Fig. 2, to perform large scale analyses. Since this protocol has three operations that require an internet connection, namely the check_contamination, add_taxonomy, and the tree_ collapser modules, and such connection may sometimes fail, the data processing was divided into six different steps. The query file contains the following sequences: Mycobacterium tuberculosis NP_216287, Pseudomonas aeruginosa NP_254014, Streptomyces albiflaviniger MBI0379321, Halobacteriales archaeon PSP48796, Aspergillus chevalieri XP_043131130, Myxococcales bacterium MCO4763851, Bacteroidota bacterium MBS1617249, Flavobacteriales bacterium HIA12326, Mus musculus NP_ 848862, Rattus norvegicus EDL85374, Falco cherrug XP_005443605, Xenopus laevis OCT81467, Ciona intestinalis XP_002122023, Moorena sp. SIO4G3 NEO81287, Calothrix parasitica NIES-267 BAY85903, Trypanosoma brucei brucei TREU927 XP_ 844937, and Leishmania major strain Friedlin XP_001682394 sequences, and the reference_file the Mycobacterium tuberculosis NP_216287 sequence. By using a value of 100000 for the isoform_min_word_length parameter, the remove isoforms step of the GCF_and_GCA_CDS_processing module is effectively skipped (bacterial genes do not usually code for isoforms), as no CDS is that long.

28

H. López-Fenández et al.

3 Results 3.1 Auto-phylo Modules The 23 modules that are at present available for auto-phylo are based on the pegi3s Bioinformatics Docker Images Project [8] images, as well as on the CLI recently made available as an early access for the SEDA v1.6.0 [6] software. Detailed information on each module can be found at https://hub.docker.com/r/pegi3s/auto-phylo. It should be noted that the input/output type defines the compatibility between the different modules. For instance, the tree inference modules accept as input a single aligned FASTA file, and will fail if given multiple files or non-aligned FASTA files. Therefore, it is important to check the compatibility of the different modules before declaring a pipeline. Moreover, in order to use some modules (such as the CGF_and_CGA_CDS_processing), the values of some variables must be declared in the config file. It should be noted that depending on the value that is declared, the corresponding operations could have no effect. For instance, when using the CGF_and_CGA_CDS_processing module, if the size variation (in percentage) relative to the reference sequence is a very large number (for instance, 100000), the specified pattern is “.”, or the isoform minimum word length is greater than the sequence sizes, these operations will have no effect, since all sequences will always match such criteria. When processing a large number of files, in order to avoid out of memory errors, it is advisable to use the split option. This is a special command, that is invoked after a regular command, and that takes as input a single argument, namely, the number of groups to consider (see Fig. 2 for an example). For instance, the instruction: “CGF_and_CGA_CDS_processing my_data_dir out_dir split 20” will split the files that are in the my_data_dir directory into 20 equal size subfolders. The data on each subfolder will be processed independently, thus avoiding out of memory errors. The output of all independent analyses will still be saved in the same out_dir directory. While it is foreseeable that, in the future, further modules will be provided for other commonly used alignment and tree inference methods, the way FASTA files should be parsed to have the format desired by the user depends on the source database as well as in the user needs. The script here provided for parsing CDS files downloaded from the NCBI Assembly database likely accommodates most user needs. Nevertheless, the structure of a basic auto-phylo script is relatively simple (http://evolution6.i3s.up.pt/sta tic/auto-phylo/docs/), and thus, even a researcher with very basic knowledge on Bash scripting should be able to write a simple module to parse FASTA files, using SEDA-CLI operations. Assuming that such module is named my_module and that it is located in the working directory (/your/data/dir), it can be copied into the Docker image, and then be invoked as any other auto-phylo module, with the following command: docker run --rm \ -v /your/data/dir:/data \ -v /var/run/docker.sock:/var/run/docker.sock \ pegi3s/auto-phylo \ bash -c "cp /data/my_module /opt && /opt/main"

Auto-phylo: A Pipeline Maker for Phylogenetic Studies

29

3.2 Setting up an Auto-phylo Pipeline In order to set up a pipeline, using the available auto-phylo modules, the user only needs to edit two text files, named pipeline and config that must be present in the working directory. In the pipeline file, the user declares the order by which the operations should be invoked and the name of the respective input and output directories, one instruction per line. Examples of pipeline and config files are given in Figs. 1 and 2. The output directory of the previous operation is usually the input directory of the next one, but it is also possible to declare branched pipelines, such as the one in Fig. 1. Intermediate files that are produced during the processing of the data may contain information that is potentially relevant, and thus, a prefix made of a block identifier, command identifier, and step is added to a general name describing the operation, so that the user can easily determine the order of the operations. A block is defined as a set of consecutive commands where the output directory of the previous command is the input directory of the next one. When this condition no longer holds, the next set of instructions is defined as a new block and the block identifier increases by one unit. A file named blocks_and_commands indicating where the intermediary files have been saved is also produced. It should be noted that, although intended to be used as a pipeline maker, auto-phylo can also be used to execute a single command at a time. For instance, when processing thousands of files, the operations that are dependent on an internet connection may sometimes fail due to a broken connection, meaning that the user must repeat the operation for the files for which the operation failed. For convenience, such files are identified and saved on a separate folder, in the same place where the corresponding intermediate files are located.

Fig. 2. The six (A-F) auto-phylo pipeline files and two config files (G–H) used. The config file shown in G) was used with the pipeline files shown in A)–E), while the config file shown in H) was used with the pipeline file shown in F).

30

H. López-Fenández et al.

If parameters need to be passed to the scripts being used, they must be declared in the config file, one per line. Examples of such files are given in Figs. 1 and 2. It should be noted that a file named config must always be given, since at least the working directory location where the folder with the input files is located and where results will be saved, as well as the SEDA Docker image to be used, must be always declared. 3.3 Bacterial AOs May Have a Function Similar to Animal GULOs When performing Bayesian phylogenetic inferences, using a set of sequences aligned using MUSCLE [17], Duque et al. [16] could not fully resolve the relationship of the different types of AOs. Here, by using a simple branched pipeline file (see Material and Methods), we explored the consequences of using different alignment and tree building methods. Although in Fig. 3 we show that the chosen alignment and tree building method can have an impact on the conclusions that are taken, overall, the gene trees do support the hypothesis that there is a group of sequences present in bacteria, especially in Cyanobacteria, that may have a function similar to animal GULOs. Given the small number of sequences used, it is nevertheless unclear whether these are the result of gene transfer events or if there are large bacterial groups where the majority of species shows such sequences. 3.4 Identification of Bacterial Species Groups that Have AOs Closely Related to Animal GULOs Given the result obtained in Sect. 3.3 we now search for all bacterial species groups that have putative AOs that may play functions similar to GULO [16]. Such characterization can shed light not only on the evolutionary origin of eukaryote GULO, but also can open new perspectives for industrial applications [16]. The NCBI Assembly RefSeq database was queried for representative bacterial genomes with an annotation, in order to download one genome per species (16550 on the fourth of March 2023). For eight species a consistent “500 Server Error” was obtained, and thus those files were not used. Of the remaining species, 15910 species had at least one tblastx hit with one of the 17 nucleotide sequences present in the query file (see Material and Methods). After removing sequences showing ambiguous nucleotides, 15908 species remain. The removal of sequences with a length not multiple of 3, that do not have a valid start codon, or show in frame stop codons, reduces this number to 15766 species, and the removal of sequences that are 10% longer or shorter than the reference sequence (the Mycobacterium tuberculosis NP_216287 sequence), further reduces this number to 13215 species. After removing sequences that do not show the typical AO "HW[AG]K" pattern, 3888 species remain. In these files, there are 5384 sequences in total. An attempt to align these 5384 sequences revealed several highly divergent sequence groups. Therefore, a second tblastx, using the same query file as before, was performed. Because the size of the blast database is now much smaller, now only 2143 sequences show a significant hit. These were aligned using Clustal Omega [9], a tree obtained using FastTree [11], and collapsed using the Phylogenetic Tree Collapse program (available at the pegi3s Bioinformatics Docker Images Project [8]).

Auto-phylo: A Pipeline Maker for Phylogenetic Studies

31

Fig. 3. Inferences on AO gene evolution using different alignment (CO – Clustal Omega; TC – TCoffee) and tree building algorithms (FT – FastTree; MB – MrBayes; ME – Minimum Evolution; ML – Maximum Likelihood; NJ – Neighbor-Joining). The number of species in each group is indicated within brackets. Group names as in [16].

Due to the large number of sequences involved, JModel test could not be run. The 2143 sequences represent 2073 species, since 70 species contribute with two sequences (32 Streptomyces, 7 Streptacidiphilus, 5 Nocardioides, 4 Mycobacterium, 3 Parvibaculum and Amycolatopsis, 2 Burkholderia, and one Actinomadura, Gordonia, Gottfriedia, Hydrocarboniphaga, Kitasatospora, Nocardia, Phenylobacterium, Pseudomaricurvus, Skermania, Solimonas, Stenotrophobium, Streptomonospora, Tepidicaulis and Terricaulis species). The results show that while some taxa are well represented in the resulting phylogeny, others are not and are likely the result of horizontal gene transfer events, as it is the case for all taxa that are represented by less than 10 species in Fig. 4A. Nevertheless, even taxa that are well represented, such as the Glycomycetales (Fig. 4B), can

32

H. López-Fenández et al.

show evidence for horizontal gene transfer events (in this case 2 events). The obtained pattern is thus complex, and needs further validation that is beyond the scope of this work.

Fig. 4. Bacterial taxa showing sequences with homology to AOs. A) Blue and orange bars represent the total number of species analyzed and the number of species showing AOs, respectively. B) For Actinomycetota the number of species showing AOs and the total number of species analyzed is shown on a cladogram, based on the tree presented by [18]. a 37 (in the phylogeny) and 129 species (total) could not be placed in the cladogram. b 7 (in the phylogeny) and 12 species (total) could not be placed in the cladogram. For Motilibacterales, Nanopelagicales, one unclassified Actinomycetes species, and one unclassified Actinomycetota, 0 and 5, 0 and 9, 0 and 1, and 0 and 1 (in the phylogeny and in total respectively) are not shown, respectively.

4 Conclusion The auto-phylo Docker image allows the creation of complex automated pipelines with little effort, since the user just needs to declare in the pipeline file the order by which the operations should be performed and the name of the respective input and output directories, one instruction per line. In the config file the user specifies the mandatory parameters (path to working directory and SEDA version to be used), as well as other parameters that may be required by the auto-phylo modules. Two biological examples on the evolution of AOs show the usefulness of auto-phylo. Acknowledgments. This research was financed by the National Funds through FCT—Fundação para a Ciência e a Tecnologia, I.P., under the project UIDB/04293/2020, and by the Consellería de Educación, Universidades e Formación Profesional (Xunta de Galicia), under the scope of the strategic funding ED431C2018/55-GRC and ED431C 2022/03-GRC Competitive Reference Group. PD is supported by a PhD fellowship from Fundação para a Ciência e Tecnologia (SFRH/BD/145515/2019), co-financed by the European Social Fund through the Norte Portugal Regional Operational Programme (NORTE 2020).

Auto-phylo: A Pipeline Maker for Phylogenetic Studies

33

References 1. Birchler, J.A., Yang, H.: The multiple fates of gene duplications: deletion, hypofunctionalization, subfunctionalization, neofunctionalization, dosage balance constraints, and neutral variation. Plant Cell 34(7), 2466–2474 (2022) 2. Merabet, S., Carnesecchi, J.: Hox dosage and morphological diversification during development and evolution. In: Seminars in Cell and Developmental Biology. Elsevier (2022) 3. e Silva, R.S., et al.: The Josephin domain (JD) containing proteins are predicted to bind to the same interactors: implications for spinocerebellar ataxia type 3 (SCA3) studies using Drosophila melanogaster mutants. Front. Mol. Neurosci. 16 (2023) 4. Gupta, P.K.: Earth Biogenome project: present status and future plans. Trends Genet. (2022) 5. Lewin, H.A., et al.: Earth BioGenome project: sequencing life for the future of life. Proc. Natl. Acad. Sci. 115(17), 4325–4333 (2018) 6. López-Fernández, H., et al.: SEDA: a desktop tool suite for FASTA files processing. IEEE/ACM Trans. Comput. Biol. Bioinf. 19(3), 1850–1860 (2020) 7. López-Fernández, H., et al.: Compi: a framework for portable and reproducible pipelines. PeerJ Comput. Sci. 7, e593 (2021) 8. López-Fernández, H., Ferreira, P., Reboiro-Jato, M., Vieira, C.P., Vieira, Jorge: The pegi3s bioinformatics docker images project. In: Rocha, M., Fdez-Riverola, F., Mohamad, M.S., Casado-Vara, R. (eds.) PACBB 2021. LNNS, vol. 325, pp. 31–40. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-86258-9_4 9. Sievers, F., Higgins, D.G.: Clustal omega. Current Protocols Bioinform. 48(1), 3.13. 1–3.13. 16 (2014) 10. Notredame, C., Higgins, D.G., Heringa, J.: T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302(1), 205–217 (2000) 11. Price, M.N., Dehal, P.S., Arkin, A.P.: FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 26(7), 1641–1650 (2009) 12. Huelsenbeck, J.P., Ronquist, F.: MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17(8), 754–755 (2001) 13. Kumar, S., et al.: MEGA-CC: computing core of molecular evolutionary genetics analysis program for automated and iterative data analysis. Bioinformatics 28(20), 2685–2686 (2012) 14. Bettisworth, B., Stamatakis, A.: Root digger: a root placement program for phylogenetic trees. BMC Bioinform. 22(1), 1–20 (2021) 15. Darriba, D., et al.: jModelTest 2: more models, new heuristics and parallel computing. Nat. Methods 9(8), 772 (2012) 16. Duque, P., Vieira, C.P., Vieira, J.: Advances in novel animal vitamin c biosynthesis pathways and the role of prokaryote-based inferences to understand their origin. Genes 13(10), 1917 (2022) 17. Edgar, R.C.: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5), 1792–1797 (2004) 18. Nouioui, I., et al.: Genome-based taxonomic classification of the phylum Actinobacteria. Front. Microbiol. 9, 2007 (2018)

Feature Selection Methods Comparison: Logistic Regression-Based Algorithm and Neural Network Tools Katarzyna Sieradzka(B)

and Joanna Pola´nska(B)

Department of Engineering and Data Exploratory Analysis, Silesian University of Technology, Gliwice, Poland {katarzyna.sieradzka,joanna.polanska}@polsl.pl

Abstract. Features selection of high-dimensional data is desirable, mainly when extensive data is used and generated more often. Currently considered research problems are related to the appropriate feature selection in a multidimensional space allowing the selection of only those relevant to the analyzed problem. The implemented and applied machine learning approach made it possible to recognize feature profiles to distinguish two classes of observations. Regarding methods based on logistic regression, 21 features were selected, and 10 features related to the examined problem were identified for neural networks. This made it possible to significantly reduce the dimensionality of the data from as many as 406 original dimensions. Moreover, the feature selection approaches allowed for consistent results; as many as eight features were common to both utilized methods. The application of the recognized profiles also made it possible to obtain very high classification quality metrics, which in the case of logistic regression both for feature selection and classification, amounted to almost 94% of the weighted classification accuracy while maintaining the F1score metric at a very high level of 93%. The achieved results indicate the high efficiency of the proposed feature selection solution. The presented results support the potential of the implemented logistic regression-based approach for solving feature selection and classification for two-class problems. Keywords: Feature selection · Logistic regression · Neural networks

1 Introduction In many fields, especially medicine, there are increasingly modern techniques for obtaining accurate, high-dimensional data sets. They usually have a very complex structure and volume in the context of the number of observations and features subject to further analysis. Two concepts are fundamental to the problem of high-dimensional data classification. The first is data distortion. If not controlled, it may lead to erroneous conclusions about the internal variability of the analyzed data, and disturbances introduced to analytical procedures may limit the possibility of detecting specific and interesting phenomena and relationships between observations. Another crucial aspect is the number of significant features that should be analyzed. Entering and processing all of them © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Rocha et al. (Eds.): PACBB 2023, LNNS 743, pp. 34–43, 2023. https://doi.org/10.1007/978-3-031-38079-2_4

Feature Selection Methods Comparison: Logistic Regression-Based

35

is often time-consuming and sometimes even impossible with extensive data. With the growth in data dimensionality, the number of features irrelevant from the point of view of the analyzed research problem increases. The investigation of all is associated not only with the increase in the time necessary to process such data but, above all, with the increase in computing resources and what is related to the increase in the cost of the analysis focusing on all the provided features. 1.1 Classification Problem The classification is, in general, a two-step problem. The first is model construction, which consists of learning and testing processes. It is essential to separate the train and validation sets properly, but above all, to pay special attention to the construction of the test set. This set should reflect the structures and relationships in the original data set as accurately as possible. The train set is utilized to generate appropriate weights assigned to features, and the validation set is used for iterative self-correction of the model learning process. The second step of model construction is its tuning. This makes the possibility to obtain the highest practicable classification quality results. Currently, there are many methods applied for classification purposes. One of them is based on decision trees, and its advantage is a relatively quick decision tree construction compared to other available and known methods. This method makes it possible to determine the affiliation of observations based on a series of decisions [1] and can be used in multi-class problems. However, this method has one main disadvantage detecting features’ mutual dependencies is impossible without introducing additional, complicated methods. Another inspiring approach is based on neural networks. These are systems inspired by the work and construction of biological neural networks. Collections of connected neurons are supposed to reflect the flow of information in the biological brain. Each of the neurons performs appropriate computational processes by applying non-linear functions concerning the input information and transferring them to subsequent layers [2] that can perform other transformations, imposing different weights on individual signals [3]. Another method based on biological processes is the genetic algorithm, which belongs to the evolutionary algorithms [2]. These algorithms are based on determining the fit of a particular dataset element compared to other components. These complex methods require multiple steps but allow for solving very complex problems. The following frequently used classification method is k-nn nearest neighbors. This method predicts an observation’s membership based on its n-nearest neighbors. Notably, the affiliation to the appropriate class of neighboring observations is known [4]. The k-nn nearest neighbors method is beneficial in the case of analyzing data with a very complex and complicated internal structure. A group of statistical methods uses more or less complex mathematical formulas for modeling. One is logistic regression, used in the two-class problem (binary classification). This method determines the probability that an observation belongs to a positive class based on the numerically presentable relationship between the dependent and independent variables. The expected observation value is the sum of the products derived from the observation value and the coefficient fitted to the independent variable

36

K. Sieradzka and J. Pola´nska

[5]. These coefficients determine how the independent variables affect the dependent variable. Such a large variety of classifier construction methods is associated with the rapid development of high-dimensional data acquisition methods and mainly with the data variety in the context of research problems. A crucial aspect is the results’ explainability. The need to interpret the results in a specific way often imposes the choice of the methodology used. Detailed consideration of the pros and cons of each approach is then necessary. 1.2 Feature Selection Methods Feature selection is a fundamental problem in building an efficient and accurate classifier. It significantly reduces the dimensionality of the analyzed data while preserving the most important relationships between observations in high-dimensional datasets. The phenomenon of feature selection has been thoroughly described in the manuscript entitled Feature selection methods for classification purposes [6]. The feature selection carried out in a supervised manner is divided into wrapper, filter, and hybrid methods [7]. Filters are not time-consuming because they do not use complex machine-learning algorithms [7]. Moreover, these methods are not difficult to use or implement because they are based mainly on statistics, where feature ranks are created [8]. Features are dropped because of their correlation to the output. An essential element of these methods is determining the appropriate threshold for the number of significant features, i.e., which features to keep and which can be considered irrelevant to the research problem. Due to the general approach, filters are particularly recommended for high-dimensional data [9]. They have one primary and significant drawback - they look at each feature separately, which makes them prone to reject features that are poor predictors of the target but add a high value in correlation with other features [8]. The second method mentioned is a wrapper. It is computationally complex because it uses a machine-learning algorithm and is therefore not fully recommended for highdimensional data [10]. The wrapper algorithm splits the input data into subsets and trains a model, which is utilized to score different subsets of features to select the best one. Wrappers, compared to filters, maintain the correlations between features in the data set, making them an attractive method in terms of results interpretability. The hybrid method combines filters and wrappers into one common approach, giving the most significant scope for manipulation and selection regarding the intended goals. Choosing the order of both feature selection methods in a combined approach, their disadvantages should be remembered first. Applying filters first will make it possible to speed up the computation time and reduce computational resources while depriving the ability to maintain feature correlations. Utilizing wrappers in the first line, full dependencies between the data are retained, but calculation time and computational resources are not reduced significantly. Therefore, the choice is not easy and depends primarily on the intended purpose and the importance of each of the advantages and disadvantages describing the individual methods.

Feature Selection Methods Comparison: Logistic Regression-Based

37

2 Methods and Materials An algorithm based on logistic regression was implemented as a wrapper method to recognize the observations’ feature profile. Also, workflow related to neural networks was developed based on publicly available functions and tools. The purpose of this analysis was to compare two popular approaches to the feature selection problem. It is important to pay attention to the sequence of feature selection procedures application in the proposed approach, enabling the detection of feature dependencies. 2.1 Logistic Regression-Based Algorithm To implement feature selection with wrappers, binary logistic regression was utilized. The applied approach allows the determination of the weights assigned to the model components. Weights are information about the importance of a given feature in the problem of class recognition. The learning process occurs in a supervised manner. An essential part of the implementation is the loss function determining how much the pre-dicted class differs from the actual one. In the proposed approach, the inverse of the loss function was used to maximize it when selecting the optimal model using the Bayesian Information Criterion (BIC). The process of building models for feature selection begins with transferring a matrix of values along with the features’ names and their annotations. The models are built using the forward propagation methodology. The value of the probability of belonging to the positive class of individual observations (1) is determined, and predicted classes are marked utilizing the sigmoidal function. probabilityobs = ◦

1

= P(y = 1|x; θ ) (1) N feat θfeat ∗xfeat ) 1+e A simple decision scheme determines the final decision on observations’ class affiliation. If the probability value of an observation belonging to a positive class is equal to or greater than 50%, the observation is assigned to a positive class. The observation is classified as the negative class if this value is less than 50%. Subsequently, the values of the model parameters are determined using gradient descent with the learning rate value equal to 0.001. Then the model learns the parameter values to achieve the best results in the context of the applied inverse of the loss function (2).     LL = (2) ln(probabilityPositive ) + ln 1 − probabilityNegative Positive

−(θ0 +◦

Negative

The absolute likelihood difference value is also determined for the following iterative schemes of learning the model parameters. If this difference is smaller than the set epsilon = 0.1 value, the procedures of calculating model parameters are terminated for the given model. In the case of models with the same degree of complexity, i.e., containing the same number of features, the BIC criterion is used to find the best one. On the other hand, utilizing the Bayes Factor (BF) metric, it is compared with a model with a different level of complexity. If the BF value is greater than 101.5 , the more complex model is marked as better, and the algorithm starts adding another feature to the model. In addition, the GeneRank measure (3) was introduced as a filter method. N accuracyj × (k − i + 1) (3) GeneRankx = j=1 k

38

K. Sieradzka and J. Pola´nska

x is the feature indicator N is the model index k is the number of features in the most extended model i is the features’ position in the jth model It allows for determining the significance of individual features included in the generated models. 2.2 Neural Networks Approach The neural networks-based feature selection was carried out using publicly available functions from the tensorflow.keras library [11] in the Python environment. The frequently used sigmoidal function was used as the activation function. It transforms the summary information of the transmitted signal value, weights for the input neuron, and bias into an output value transferred from this neuron to the following neural network layer. In the described case, a neural network consisting of one input, a hidden, and one output layer was built. The number of neurons in the hidden layer was equal to 204, and the dropout level was set a priori to 10% of available neurons. Additionally, binary cross-entropy, often used in the case of a two-class problem, was used as the loss function (4).       1 N yi × ln yˆ i + (1 − yi ) × ln 1 − yˆ i BCE y, yˆ = − i=1 N

(4)

where: y is a true output vector y is a predicted output vector Additionally, Adaptive Moment Estimation Adam was used to optimize weights and biases. It is popular, most frequently used, and well suited to work with large datasets optimization method in neural networks [12]. The result of the Shapley Additive exPlanations (SHAP) function from the shap library [13] in the Python environment was used as the filter method. This step made it possible to clarify the impact of individual features on the classification processes. It is an increasingly used and recommended tool for feature selection using neural network problems [14]. The property of this function used was the calculated ShapValues, showing the significance of individual features. 

2.3 Materials This work was based on high-dimensional single-cell sequencing data containing two classes - positive and negative (control). Due to the real origin of the data, the phenomenon of variable interaction occurs in them, where a change in the values describing one variable causes changes in other variables. The analyzed data included 406 features and 3648 observations. The appropriate division into classes and a summary of this data are presented in Table 1. Moreover, the data was divided for independent testing into a model structure containing a train and validation set and into a test set. This stage was necessary before building the model to ensure the independence of the testing procedures from the classifier

Feature Selection Methods Comparison: Logistic Regression-Based

39

Table 1. Analyzed data summary. Data structure

Negative obs

Test Model

Positive obs

Total obs

Features

429

349

778

406

1540

1330

2870

406

learning process. In addition, the proportion of positive and negative class observations was maintained to create a test set properly representing the analyzed data set.

3 Results Analysis results include built feature profiles, selected using feature selection procedures, and classification quality metrics for the created models. Features were selected based on the model data set, while the created models were tested based on the test set. Moreover, all procedures were performed on normalized data. 3.1 Logistic Regression-Based Algorithm Fifty models were built using the implemented feature selection algorithm based on logistic regression methods. Among the generated models, considering them as lists of features, there were 54 unique features out of 406 features analyzed with the input data set. For each feature included in the models at least once, the GeneRank metric value was calculated. The threshold value for the number of features was determined based on the lack of significant differences in subsequent metric values. Among the essential features, there were 21 features, and the research space was reduced to this dimensionality. The values of the model parameters contained in Table 2 were calculated. Table 2. Logistic regression-based model summary. Intercept

F329

F327

F17

F133

F184

F330

F303

-4.47

1.76

-1.44

0.60

0.22

-0.75

0.40

0.32

F287

F44

F168

F83

F90

F378

F358

F345

-0.54

0.32

-0.39

-0.39

-0.22

0.10

-0.30

-0.31

F387

F263

F4

F99

F348

F100

-0.32

-0.33

0.06

0.05

-0.29

0.11

The threshold-tuning procedure determined a new probability value for positive class observation classification. Changing the threshold value from fixed 50% to 51.24% made it possible to improve the classification quality metrics based on the test set. The results are presented in Table 3. The results are satisfactory, considering the classification accuracy value of 94% and F1score metric value of 93%, while the percentage of incorrectly classified observations was 6% of available observations.

40

K. Sieradzka and J. Pola´nska Table 3. Logistic regression-based model classification quality metrics. TP

323

TN

407

FP

22

FN

26

Precision

0.9362

Sensitivity

0.9255

Specificity

0.9487

Weighted accuracy

0.9383

F1score

0.9308

Number of observations

778

Number of correctly classified observations

730

Number of incorrectly classified observations

48

Incorrectly classified observations [%]

6.17

3.2 Neural Networks Approach After building a neural network model containing 406 available features, the SHAP tool was used to calculate the ShapValues for each included feature. The approach based on changes in the metric value across all features was utilized to estimate the threshold value for the number of informative features. Those for which significant changes in the metrics’ value were observed were considered essential. The final model was based on ten selected features described in Table 4. Table 4. Neural network-based feature selection summary. F329

F17

F133

F303

F330

F44

F4

F299

F366

F378

The model with the calculated values of weights and biases was tested based on the test set. The results are included in Table 5. It showed that the model built based on the neural networks feature selection method achieved satisfactory classification quality results. The weighted classification accuracy was almost 91%, and the value of the F1score metric was above 90%. The percentage share of incorrectly classified observations exceeded 8% of all analyzed observations. 3.3 Results Comparison A comparison of three feature selection variants was made to directly compare the proposed methods of feature selection and observations’ classification using logistic

Feature Selection Methods Comparison: Logistic Regression-Based

41

Table 5. Neural network-based model classification quality metrics. TP

308

TN

402

FP

27

FN

41

Precision

0.9194

Sensitivity

0.8825

Specificity

0.9371

Weighted accuracy

0.9098

F1score

0.9016

Number of observations

778

Number of correctly classified observations

710

Number of incorrectly classified observations

68

Incorrectly classified observations [%]

8.74

regression methods and neural networks. The first was the lack of prior feature selection for the built model and the classification of observations using both machine learning methods. The feature selection was performed in the second variant using logistic regression methods and the GeneRank metric. Then the built model was tested for both classifiers based on logistic regression and neural network methods. In the last variant, features were selected using the neural network approach with the ShapScore metric and were tested for both compared classifiers. The results of the comparative study are summarized in Table 6. Table 6. Comparison of the classification quality metrics for applied methods. Model selection approach

Length of signature

Logistic Regression

Neural Networks

w. accuracy

F1score

w. accuracy

F1score

No selection

406

0.8856

0.8712

0.8638

0.8453

Forward selection for LR

21

0.9383

0.9308

0.9369

0.9305

Shap scoring for NN

10

0.9165

0.9057

0.9098

0.9016

The lack of feature selection in both classification cases showed the worst results, with the values of weighted accuracy and F1score metric 88% and 87% for logistic regression and 86% and 84% for neural networks. Moreover, the model without the prior feature selection was very complicated, and it was necessary to determine all the values of the model parameters. However, considering models built based on the reduced

42

K. Sieradzka and J. Pola´nska

with logistic regression methods feature space leads to obtaining satisfactory values of the weighted classification accuracy and the F1score metric of over 93% for both classification methods. In the case of the last compared feature selection method using neural networks, there was over 91% of the weighted classification accuracy and 90% of the F1score metric achieved for the classifier based on logistic regression methods and 91% and 90%, respectively, for the classifier built based on neural networks.

4 Conclusions Both feature selection methods for classification purposes allowed for achieving satisfactory results. Moreover, these methods were consistent in results, as evidenced by as many as eight common features selected from 406 entities entered into the analysis. Utilizing the implemented algorithm based on logistic regression methods allowed for obtaining as many as 21 significant features. The use of publicly available functions for this purpose and the use of the neural network approach made it possible to get ten features. Both methods, therefore, allowed for a significant dimensionality reduction and enabled the use of the recognized feature profiles for classification purposes. However, when considering the comparative analysis used in the section Results comparison, it can be unequivocally stated that the best classification quality results were achieved in the case of the model built based on feature selection using logistic regression methods. Over 93.5% of the weighted classification accuracy and over 93% of the F1score metric value were achieved in the classifier based on logistic regression methods and neural networks. In summary, the implemented and proposed approach to feature selection and classification based on logistic regression methods in a two-class problem allows for achieving satisfactory results supported by very high classification accuracy values for the test set. Moreover, the order of using feature selection methods applied in the proposed solution, i.e., first wrappers and then filters, allowed for the maintenance of full dependencies between individual features. This approach undoubtedly requires much more computational resources, related to determining the significance of features using much more complex machine learning procedures, but it provides for maintaining all dependencies between the examined features. Hence, this solution may be a crucial element in developing interpretable results, primarily in medicine, where detailed high-dimensional data require support from the biological side. However, this application requires a significant generalization of the problem and future testing of the proposed solution on a larger number of different data sets. Acknowledgements. This work has been supported by European Union under the European Social Fund grant AIDA – POWR.03.02.00–00-I029 and SUT grant for Support and Development of Research Potential no. 02/070/BK_23/0043.

Feature Selection Methods Comparison: Logistic Regression-Based

43

References 1. 2. 3. 4. 5. 6.

7.

8. 9. 10. 11. 12. 13. 14.

Quinlan, J.R.: in Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986) Mitchell, M.: An Introduction to Genetic Algorithms. MIT press, London (1998) Krose, B., Smagt, P.V.D.: An introduction to neural networks (2011) Sutton, O.: Introduction to k nearest neighbour classification and condensed nearest neighbour data reduction, University lectures, University of Leicester 1 ( 2012) Park, H.A.: An introduction to logistic regression: from basic concepts to interpretation with particular attention to nursing domain. J. Korean Acad. Nurs. 43(2), 154–164 (2013) Sieradzka, K., Pola´nska, J.: Feature selection methods for classification purposes, recent advances in computational oncology and personalized medicine, vol. 2: The Challenges of the Future, Publishing House of the Silesian University of Technology, pp. 169–189 (2022) Jovi´c, A., Brki´c, K., Bogunovi´c, N.: A review of feature selection methods with applications. In: 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 1200–1205 (2015) Chandrashekar, G., Sahin, F.: A survey on feature selection methods. Comput. Electr. Eng. 40(1), 16–28 (2014) Mera-Gaona, M., López, D.M., Vargas-Canas, R., Neumann, U.: Framework for the ensemble of feature selection methods. Appl. Sci. 11(17), 8122 (2021) Ferreira, A.J., Figueiredo, M.A.: Efficient feature selection filters for high-dimensional data. Pattern Recogn. Lett. 33(13), 1794–1804 (2012) tf.keras.Sequential, https://www.tensorflow.org/api_docs/python/tf/keras/Sequential. Accessed 23 Jan 2023 Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412. 6980 (2014) Welcome to the SHAP documentation, https://shap.readthedocs.io/en/latest/. Accessed 10 Feb 2023 Marcílio, W.E., Eler, D.M.: From explanations to feature selection: assessing SHAP values as feature selection mechanism. In: 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 340–347 (2020)

A New GIMME–Based Heuristic for Compartmentalised Transcriptomics Data Integration Diego Troiti˜ no-Jordedo1,3 , Lucas Carvalho2,4 , David Henriques1 , V´ıtor Pereira2 , Miguel Rocha2 , and Eva Balsa-Canto1(B) 1 2

Biosystems and Bioprocess Engineering Group, IIM–CSIC, Vigo, Spain [email protected] Centre of Biological Engineering, University of Minho, Braga, Portugal 3 Applied Mathematics Department, University of Santiago de Compostela, Santiago de Compostela, Spain 4 Center for Computing in Engineering and Sciences, UNICAMP Campinas, S˜ ao Paulo, Brazil

Abstract. Genome-scale models (GEMs) are structured representations of a target organism’s metabolism based on existing genetic, biochemical, and physiological information. These models store the available knowledge of the physiology and metabolic behaviour of organisms and summarise this knowledge in a mathematical description. Flux balance analysis uses GEMs to make predictions about cellular metabolism through the solution of a constrained optimisation problem. The gene inactivity moderated by metabolism and expression (GIMME) approach further constrains FBA by means of transcriptomics data. The underlying idea is to deactivate those reactions for which transcriptomics is below a given threshold. GIMME uses a unique threshold for the entire cell. Therefore, non-essential reactions can be deactivated, even if they are required to meet the production of a certain external metabolite, because of their low associated transcript expression values. Here, we propose a new approach to enable the selection of different transcriptomics thresholds for different cell compartments or modules, such as cellular organelles and specific metabolic pathways. The approach was compared with the original GIMME in the analysis of a number of examples related to yeast batch fermentation for the production of ethanol from glucose or xylose. In some cases, the original GIMME results in biological unfeasibility, while the compartmentalised version successfully recovered flux distributions. The method is implemented in the python-based toolbox MEWpy and can be applied to other metabolic studies, opening the opportunity to obtain more refined and realistic flux distributions, which explain the connections between genotypes, environment and phenotypes. Keywords: Genome-Scale model · Transcriptomics integration · GIMME · fermentation · yeast · Saccharomyces cerevisiae · MEWpy c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  M. Rocha et al. (Eds.): PACBB 2023, LNNS 743, pp. 44–52, 2023. https://doi.org/10.1007/978-3-031-38079-2_5

A New GIMME–Based Heuristic

1

45

Introduction

Genome-scale models (GEMs) are structured representations of a target organism based on existing genetic, biochemical, and physiological information. These models store available knowledge about organisms’ physiology and metabolic behaviour, summarise it with a mathematical description, and have been used for multiple applications such as strain production of specific chemicals and materials, prediction of enzyme functions, among many others [5]. GEMs allow predicting metabolic flux values for a complete set of metabolic reactions using linear optimisation methods, such as flux balance analysis (FBA) [13]. In FBA, fluxes are computed according to a certain biological objective that is relevant to the problem being studied (e.g. maximizing biomass production), obeying a set of constraints on the external and internal fluxes (e.g. related to the medium and to steady-state assumptions, respectively). While FBA allows reaching phenotype predictions, the integration of intracellular omics data has the potential to refine metabolic flux distributions further. The integration of transcriptomics data into GEMs has received substantial attention. Most methods are based on the activation or deactivation of particular reactions attending to a given transcriptomics threshold value. Gene Inactivity Moderated by Metabolism and Expression (GIMME) [2], a method that systematises the integration of transcriptomics data, uses a transcriptomic threshold value for the entire cell. However, the selection of this threshold comes with risks. This is the case, for example, when deactivating reactions that might condition the viability of the cell due to associated low transcript expression values or reactions that are required to meet the production of a certain external metabolite. In this work, we propose a new approach that enables the selection of different transcriptomics thresholds for different cellular compartments (or modules). These modules might correspond to cellular organelles (for example, cytosol, mitochondria, etc.), pathways, or can be manually designed by the user. We have tested this new approach in the description of the metabolism of Saccharomyces cerevisiae under batch fermentation conditions. The approach was compared with the original GIMME in the analysis of a number of examples related to yeast batch fermentation for the production of ethanol from glucose or xylose. We used the iMM904 metabolic reconstruction [12]; the model was constrained by glucose and xylose uptake data obtained in an industrial setup. Compartmentalised GIMME is implemented in MEWpy [14] and facilitates the integration of transcriptomics thresholds by compartments, pathways, and other user-defined modules.

2 2.1

Methods Flux Balance Analysis

Flux balance analysis (FBA) [13,15] is a mathematical modelling approach based on the knowledge of reaction stoichiometry and mass/charge balances. The implementation relies on the so-called pseudo-steady-state assumption, which

46

D. Troiti˜ no-Jordedo et al.

means that no intracellular accumulation of metabolites occurs. This is mathematically expressed as: S·v =0 (1) where S is the n×m stoichiometric matrix (n ≡ metabolites, m ≡ reactions) and v is the vector of metabolic fluxes for all reactions. Usually, the number of fluxes is higher than the number of equations, and thus the system is underdetermined. Still, it is possible to find more specific solutions under the assumption that the metabolism of cells evolves to pursue a predetermined goal that is defined as the maximisation (or minimisation) of a certain objective function (J): max J

(2)

S·v =0

(3) (4)

LB < v < UB

(5)

s.t. :

where LB and UB are lower and upper bounds that constrain reaction flux values, and J an objective function, commonly the growth rate, the ATP production or nutrients consumption, to be maximised or minimised. In parsimonious FBA (pFBA), a second linear programming problem is solved that identifies the most parsimonious solution, that is, the one that achieves the optimal objective value with minimal use of gene products [11]. 2.2

Gene Inactivity Moderated by Metabolism and Expression

The Gene Inactivity Moderated by Metabolism and Expression (GIMME) [2] is a methodology that allows obtaining a reduced genome–scale model from a larger metabolic reconstruction by means of: – A dataset of gene expression values associated with each gene. – The required metabolic functionalities (RMFs) that the cell is assumed to achieve. – An overall gene expression threshold value. A first reduced model is obtained by removing from the reconstruction the reactions whose expression value is below the threshold. These deletions may render the FBA problem unfeasible due to a lack of cell viability resulting from deleting essential growth-related reactions or from the model inability to produce specific primary or secondary metabolites. To address possible model inconsistencies, Becker and Palsson [2] suggested the use of the so-called inconsistency score (IS), which classifies the disagreement between the gene expression data and the assumed objective function and the normalised inconsistency score (NCS), which compares (in relative terms) the agreement between the gene expression data and a specific metabolic function. The final reduced model will be the one that minimises inconsistency while reactivating the minimal amount of reactions. This is formulated as follows:

A New GIMME–Based Heuristic

47

– Find the maximum flux through each RMF. – Constrain the RMF to operate at a certain value and identify the set of available reactions that reduces the inconsistency. This can be formulated as min ci · |vi |

(6)

S ·v =0

(7)

a xi ci = , ∀i ∈ RMF’s (10) 0 otherwise where thresholdj corresponds to the threshold value selected for the jth compartment. The compartments or modules can be defined as follows (see Fig. 1 for an illustrative representation): – Physically–inspired compartments: Compartmentalisation is based on existing organelles and other types of internal physical barriers. The user may be able to define the specific set of compartments and their associated gene expression thresholds. – Pathway–inspired modules: These are defined using the metabolic pathways of the cell. The pathways can be selected using the references included in the model or the KEGG pathway database[7–9]. Selected pathways might be assigned different thresholds. – User–given modules: These can be designed using reaction references and assigned specific gene expression threshold values. We have implemented this method in MEWpy [14]. MEWpy is a computational strain optimisation (CSO) tool that integrates different types of constraint-based models and simulation approaches. MEWpy supports FBA and GIMME. To handle the corresponding optimisation problems, the tool offers a

48

D. Troiti˜ no-Jordedo et al.

Fig. 1. Types of compartments/ modules: (a) physical–inspired; (b) pathway–inspired. Different colours correspond to different threshold values.

programmatic access to linear programming solvers, such as CPLEX [4] and Gurobi [6]. For the purpose of this work, we used CPLEX, the default and recommended option in MEWpy. To facilitate compartmentalisation using pathways, we implemented a request client to the KEGG Pathway Database [7–9]. A user may provide a reference to a specific pathway and the tool queries the database and automatically defines a set of differentiated compartments and corresponding threshold values. 2.4

The Model

Applying the proposed approach using cellular compartments requires the use of a compartmentalised metabolic model. In this work, we chose the iMM904 genome-scale metabolic reconstruction of the yeast Saccharomyces cerevisiae [12]. This model is structured into eight compartments representing the extracellular space, the cytosol and the yeast organelles: mitochondria, nucleus, golgi, peroxisomes, vacuoles, and endoplasmic reticulum. In a stoichiometric model, compartments are usually defined by labels. For example, a metabolite A that can be found in the cytosol and mitochondria is called A[c] and A[m], respectively. In the stoichiometric matrix S, these are effectively represented as two different metabolites, each associated with a row of S, allowing for the definition of different flux constraints in distinct compartments. 2.5

The Dataset

The data used in this study was based on RNA-Seq data from a previously published article by de Carvalho et al.,2021 [3]. The raw data for this study was obtained from the NCBI Sequence Read Archive (SRA) under the BioProject PRJNA667443. The samples were collected from laboratory and industrial second-generation ethanol fermentation under glucose and xylose carbon sources. The gene expression data from yeast fermentation was retrieved based on the pipeline described in the original article. In order to use the gene expression data as input for the GIMME algorithm and create condition-specific models, we calculated the median expression level of each gene across biological replicates. This approach helped reduce noise and ensure that the resulting models accurately reflected the underlying biology.

A New GIMME–Based Heuristic

3

49

Results

This work proposes a new heuristic formulation based on GIMME. The essentials of this algorithm are described in Sect. 2.3. In the sequel we present the results we have achieved for a collection of case studies. 3.1

Case Studies

We have tested the compartmentalised approach to study the metabolism of Saccharomyces cerevisiae under fermentation conditions. We considered several different cases, using glucose and xylose as substrates. Experimental data were obtained in an industrial process and two different laboratories [3]. To impose fermentation conditions in the model iMM904, we have defined constraints to assure anaerobic conditions and uptake of glucose/xylose following the work by Carvalho et al. [3]. We compared the performance of our approach against the results obtained by GIMME. We assumed that cells maximise ethanol production and allow for different transcriptomics thresholds in mitochondria. Table 1 summarises the different case studies and the corresponding experimental setups. Table 1. Definition of case studies A-F. Type of process indicates the origin of the experimental data. Fermentation type refers to the carbohydrate used. Uptake refers to the flux value set as constraint. Threshold corresponds to the value used for transcriptomics cutoff in GIMME. Case Type of process Fermentation type Uptake (g/L) Threshold (perc.) A

Industrial

glucose

10

25

B

Industrial

xylose

10

25

C

Laboratory 1

glucose

10

25

D

Laboratory 1

xylose

10

25

E

Laboratory 2

glucose

10

25

F

Laboratory 2

xylose

10

25

Figures 2 and 3 show a comparison of the results achieved by both methods in terms of the production of ethanol and glycerol in cases A-C & E (a comparison of results achieved for higher alcohols was also performed but no significant differences were observed). Table 2 summarises the results achieved for cases D and F in which GIMME experienced convergence issues. To study the impact and the differences of compartmentalisation in the cell when standard GIMME is feasible, we established a general threshold of percentile 25 of the maximum number of transcripts present in the dataset. In the compartment, we decided to test with different values. As a result, Fig. 2 shows that in the compartmentalised approach higher flux of ethanol is achieved when

50

D. Troiti˜ no-Jordedo et al.

Fig. 2. Comparison of compartmentalised and standard GIMME: Absolute differences in the estimated flux concentrations of ethanol and glycerol depending on the mitochondrial threshold.

using mitochondrial thresholds between percentile 16 and 19. Specifically, for cases A and B, only a slight increase in production (10−5 g/L) was observed with mitochondrial thresholds of percentile 18 (case A) and 16 (case B). The effect is greater for cases C and E, in which the increase in ethanol production is between 0.2 and 0.4 g/L. Remarkably, in case C the glycerol production flux decreased.

Fig. 3. Comparison of compartmentalised and standard GIMME. Relative difference in the yield of ethanol plus glycerol using the compartmentalised and the standard GIMME implementations. Values are reported for different mitochondrial transcriptomics thresholds. Maximum differences appear in the range 16-18, depending on the case study.

Relative differences between models were computed using the sum of ethanol and glycerol fluxes, as follows:

ΔAlcohol(%) =

(Ethc + Glyc ) − (EthGIM M E + GlyGIM M E ) × 100 EthGIM M E + GlyGIM M E

(11)

A New GIMME–Based Heuristic

51

where Ethc and Glyc represent the ethanol and glycerol fluxes obtained with the compartmentalised method and EthGIM M E and GlyGIM M E the corresponding fluxes obtained with GIMME. As shown in Fig. 3, cases A and B present a ≈ 0.01% improvement in alcohol flows above percentile 18 with a p-value of 1.05·10−2 and 1.10·10−3 respectively. Case C shows a ≈ 0.08% increase for a threshold range between 16 and 19.5 with a p-value of 1.09 · 10−13 . Case E shows a ≈ 1.6% increase for a wide range of threshold values (between 5 and 16) with a p-value of 4.70 · 10−13 . For cases D and F, we observed (Table 2) that certain threshold values result in the unfeasibility of the GIMME approach while the compartmentalised version converged for all threshold values tested. Table 2. Results using the new formulation with different threshold percentiles for mitochondria compartment. These results correspond with tested cases of Table 1 where GIMME was infeasible. Case\ Percentiles

4

5–10

10–15

15–20

20-24

>24

D

Feasible Feasible Feasible Some feasible Unfeasible

F

Feasible Feasible Feasible Unfeasible

Unfeasible

Discussion and Conclusions

Integration of transcriptomics data into metabolic genome-scale models has proved to be complicated. The lack of a direct correlation between the number of transcripts of a gene and the activation/deactivation of its associated reactions presents a challenge. This fact becomes an issue when, due to the definition of a specific transcriptomics threshold, the modelling is unfeasible or the results are biologically meaningless. A few years ago, Klitgord et al. [10] pointed to the role of compartmentalisation of metabolic pathways in metabolic flux models by comparing the predictions of the IMM904 model with and without compartments. They concluded that by systematically constraining some individual fluxes in a de-compartmentalised version of the model, they could improve the quality of predictions induced by removing compartments. Our methodology is somehow substantiated in their arguments. However, we also emphasise the role of compartments by identifying and assigning different transcriptomics thresholds to improve the performance of metabolic models. As a result, we proved to be able to solve the unfeasibility issues with GIMME (Table 2) in our heuristic by means of the compartmentalisation of the metabolic pathways of mitochondria. This is based on the work by Avalos et al. [1], who mentioned that alcohol production could use mitochondria if a less demanding threshold would be fixed for the mitochondrial compartment than the general one. The approach shows to be effective with the two xylose fermentations, reaching a good performance with FBA and pFBA for several mitochondrial thresholds and with consistent biological results. Incidentally, we observed an increased

52

D. Troiti˜ no-Jordedo et al.

ethanol production compared to the standard GIMME approach. Avalos et al. [1] also pointed to an improvement in the production of certain high alcohols. However, this was not observed in our tests. The examples considered in this work showcase the potential of the compartmentalised GIMME approach. Additional tests with alternative species could further confirm the benefits of using compartments at the time of integrating transcriptomics data into genome-scale models. Acknowledgements. This work has received funding from MCIU/AEI/FEDER, UE grant reference: PID2021-126380OB-C32. D.T-J. acknowledges funding by an Axuda de Apoio ´ a Etapa Predoutoral of GAIN–Xunta de Galicia (grant reference IN606A2021/037).

References 1. Avalos, J., Fink, G., Stephanopoulos, G.: Compartmentalization of metabolic pathways in yeast mitochondria improves the production of branched-chain alcohols. Nat. Biotechnol. 31, 335–341 (2013) 2. Becker, P.: Context-specific metabolic networks are consistent with experiments. PLoS Comput. Biol. 4, e1000082 (2008) 3. Carvalho, L.M., et al.: Understanding the differences in 2G ethanol fermentative scales through omics data integration. FEMS Yeast Research 21(4), foab030 (2021) 4. Cplex, I.I.: V12. 1: User’s manual for CPLEX. Int. Business Mach. Corp. 46(53), 157 (2009) 5. Gu, C., Kim, G.B., Kim, W.J., Kim, H.U., Lee, S.Y.: Current status and applications of genome-scale metabolic models. Genome Biol. 20, 121 (2019) 6. Gurobi Optimization, LLC: Gurobi optimizer reference manual (2023). https:// www.gurobi.com 7. Kanehisa, M.: Toward understanding the origin and evolution of cellular organisms. Protein Sci. 28, 1947–1951 (2019) 8. Kanehisa, M., Furumichi, M., Sato, Y., Kawashima, M., Ishiguro-Watanabe, M.: KEGG for taxonomy-based analysis of pathways and genomes. Protein Sci. 51, D587–D592 (2019) 9. Kanehisa, M., Goto, S.: KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000) 10. Klitgord, N., Segr`e, D.: The importance of compartmentalization in metabolic flux models: Yeast as an ecosystem of organelles. Int. Conf. Genome Inf. 22, 41–55 (2010) 11. Machado, D., Herrg˚ ard, M.: Systematic evaluation of methods for integration of transcriptomic data into constraint-based models of metabolism. PLoS Comput. Biol. 10(4), e1003580 (2014) 12. Mo, M.L., Palsson, B., Herrg˚ ard, M.: Connecting extracellular metabolomic measurements to intracellular flux states in yeast. BMC Syst. Biol. 3(1), 37 (2009) 13. Orth, J., Thiele, I., Palsson, B.: What is flux balance analysis? Nat. Biotechnol. 28(3), 245–248 (2010) 14. Pereira, V., Cruz, F., Rocha, M.: MEWpy: a computational strain optimization workbench in Python. Bioinformatics 37(16), 2494–2496 (2021) 15. Varma, A., Palsson, B.: Stoichiometric flux balance models quantitatively predict growth and metabolic by-product secretion in wild-type escherichia coli w3110. Appl. Environ. Microbiol. 60(10), 3724–3731 (1994)

Identifying Heat-Resilient Corals Using Machine Learning and Microbiome Hyerim Yong and Mai Oudah(B) New York University Abu Dhabi, Abu Dhabi, UAE {hy1602,mai.oudah}@nyu.edu Abstract. Due to global warming, coral reefs have been directly impacted with heat stress, resulting in mass coral bleaching. Within the coral species, some are more heat resistant, which calls for an investigation towards interventions that can enhance coral resilience for other heat-susceptible species. Studying heat-resistant corals’ microbial communities can provide a potential insight to the composition of heatsusceptible corals and how their resilience is achieved. So far, techniques to efficiently classify such vast microbiome data are not sufficient. In this paper, we present an optimal machine learning based pipeline for identifying the biomarker bacterial composition of heat-tolerant coral species versus heat-susceptible ones. Through steps of feature extraction, feature selection/engineering, and machine leaning training, we apply this pipeline on publicly available 16S rRNA sequences of corals. As a result, we have identified the correlation based feature selection filter and the Random Forest classifier to be the optimal pipeline, and determined biomarkers that are indicators of thermally sensitive corals. Keywords: Bioinformatics · Microbiome · Machine Learning · Feature Selection and Engineering · Coral-reefs · Heat Resilience

1

Introduction

Corals house at least 25 percent of all marine species and act as a crucial factor for not only marine animals, but also for humans [1]. However, due to rising sea temperatures from global warming, nearly half of the coral population has died since the 1950s [2]. Coral bleaching, the cause of this massive murder, has affected 98 percent of Australia’s Great Barrier Reef, leading to global instability in species diversity, the fishing industry, tourism, and predictions in major climate events [3]. Much awareness and conservation practices for restoring the coral population are highly needed. Due to the recent events, scientists are exploring ways to increase the resilience of corals under high temperature settings through microbiome transplantation [4]. Yet to further facilitate the scaling up process of such experiments, there is a need of a more efficient technique for analyzing large 16s rRNA datasets. As such, we present a pipeline for applying a machine learning based predictive model that can better detect the microbiome composition of heatresilient corals and that of heat-susceptible corals. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  M. Rocha et al. (Eds.): PACBB 2023, LNNS 743, pp. 53–61, 2023. https://doi.org/10.1007/978-3-031-38079-2_6

54

2

H. Yong and M. Oudah

Related Work

Doering et al. [4] presents an experiment that applies microbiome transplantation treatments on heat-susceptible recipients from heat-tolerant donors to investigate if this manipulation can create an increase in heat resistance. A problem that the paper states is that human assisted evolution for corals, i.e., microbiome manipulation, is still at its infancy stage and requires additional pioneering studies. The main method used is 16S rRNA gene sequencing for analyzing the shifts of bacteria within the two types of corals. As a result, the recipients of the heat-resistant microbiome transplantation showed reduced bleaching responses under short-term heat stress. Another observation was that coral-specific bacterial symbiont taxa were transmitted which may have further facilitated the reduction of stress responses. We applied our pipeline on datasets collected by Haydon et al [5]. Over a 9-month period, specimens of the coral Pocillopora actua were transplanted between mangrove and reef sites in order to better understand coral-microbe relationships. As corals in mangrove sites are more resistant to extreme settings compared to ones in reef sites, this study examined if a transplant experiment would change the bacterial communities on reef sites to resemble that of mangrove sites. They utilized 16S rRNA datasets to characterize the bacteria composition associated with the coral species. As a result of the changes in environmental conditions, scientists observed significant shifts in P. acuta associated bacterial communities and that the microbiomes of transplanted corals were site specific, meaning that they resulted in resembling the microbiomes of native corals. The results of this study are useful in finding microbiome biomarkers that contribute in determining whether or not a coral is thermally sensitive or resistant. Other studies analyze the microbiome composition of thermally-sensitive corals and thermally resistant corals through 16s rRNA gene-based metagenomic analysis [6]. While coral microbes have been reported since decades, the analysis of coral associated microbes are only recognized in recent years as key indicators of deciding the corals’ health. Through species annotation and taxonomic analysis by generating OTU sequences through QIIME (version 1.8.0), this study finds that all coral samples were found to be dominant with the Proteobacteria phylum and the Gammaproteobacteria class. Among the heat sensitive corals, a dominance of the Vibrionaceae family and the Rhodobacteraceae family was shown, while in the heat resistant corals, there was an abundance of the Enterobacteriaceae and Lactobacillaceae family. The abundancy of Vibrionaceae was found during thermal stress conditions, which indicates that they are significant factors in determining the stress tolerance of the coral. Another study that examines the effect of increased temperature on the microbiome composition of corals used full-length 16s rRNA sequences in determining the taxonomic indicators of bacteria for thermally stressed corals [7]. While the increase in the abundance of Vibrio-realted sequences are often detected in parallel with the increase of Rhodobacteraceae during thermal bleaching, the majority of the indicator species of this study were members

Identifying Heat-Resilient Corals Using Machine Learning and Microbiome

55

of the family Rhodobacteraceae. Thus analyzing the bacteria composition of thermally resistant corals versus thermally sensitive corals would contribute in better understanding how bacteria symbionts affect the resilience of corals during bleaching events.

3 3.1

Methods Pipeline

While analyzing the corals’ microbiome data through 16s rRNA sequences with OTU tables are useful in finding taxonomic biomarkers that indicate whether a coral is heat resistant or heat sensitive, we propose a machine learning (ML) based approach that can increase the efficiency of analyzing such data. The pipeline for our ML model is presented in Fig. 1.

Fig. 1. Pipeline for microbiome classification

1. Feature extraction: filter out low quality reads from each sequenced sample and obtain an OTU table by closed-reference OTU picking 2. Feature selection: apply filter methods for feature selection and feature engineering on the output of the previous step to reduce the OTU table’s feature space into a smaller set of informative features. 3. Supervised Machine Learning: use the output of the previous step to train classification models.

56

H. Yong and M. Oudah

3.2

Experimental Setup

We applied our pipeline on a total of 116 16S rRNA samples of heat-resistant mangrove corals and heat-susceptible corals from reef corals, based on bacterial transplant experiments conducted by Haydon et al [5]. The dataset is publicly available under the accession number, PRJNA764039, on NCBI Sequence Read Archive (SRA), a bioinformatics database [8]. Feature Extraction. We filtered out low quality reads using FASTX-Toolkit1 , a trimmer tool for pre-processing data. In order to extract the microbial composition of each sample, we used the closed-reference OTU picking functionality of Qiime 2, an open-source bioinformatics pipeline for performing microbiome analysis from raw DNA sequencing data [9]. A merged OTU table was generated representing the bacterial composition of the 116 samples, which were labeled as either “high” or “low” based on the sample’s temperature metadata provided by the original dataset’s paper [5]. On average, sample labels of t0 , t3D , t2M , t3M were in a temperature range of [22, 24.5] degrees Celsius, which is considered low, and hence labeled as “low”, compared to the remaining temperature measures. t6M and t9M samples were labeled “high”, on the other hand, as they resided in environments with a temperature range of [28, 30). Any sample with no temperature information provided was removed. Therefore, we only considered a total of 113 samples with known labels for training and testing purposes. Feature Selection. Based on the 113 samples, we had a total of 9113 features from the merged OTU table. To further reduce the feature space, we used WEKA (python-weka-wrapper3) [10], a platform for system development via ML algorithms for data mining tasks. Among the algorithms in WEKA, we used two filter methods for feature selection: a correlation based approach (CfsSubsetEval) using the Best First search, and an Information Gain approach using the Ranker search. These algorithms were applied to select the most informative features from the extracted feature set (i.e., the baseline), which would increase the quality of the ML model’s prediction results. By applying CfsSubsetEval, the feature space was reduced from 9113 features to only 20 features. For Information Gain, we examined two thresholds for the number of features: 1) 20 to compare the results with CfsSubsetEval, and then increased the size to 80 to observe if there were any changes to the accuracy of the trained ML models. In addition to the feature selection algorithms, we applied a taxonomy-aware hierarchical feature engineering (HFE) method in order to extract and/or construct further informative features from the samples [11]. Machine Learning. We used WEKA’s development environment to train several ML models with the data sets we obtained before and after the feature 1

FASTX-Toolkit: https://anaconda.org/bioconda/fastx toolkit.

Identifying Heat-Resilient Corals Using Machine Learning and Microbiome

57

selection process. We initially trained the models using the full feature space, 9113 attributes through four classifiers, J48, logistic regression, Random Forest tree, and Naive Bayes [10,12], and then trained models using the set of informative features through the same classifiers to compare the results. The chosen ML algorithms have been used for model development in various related work showing promising results in microbiome analysis. Deep learning uses artificial neural networks to perform sophisticated computations on large amounts of data points, which we lake here, hence we have decided to exclude this category of algorithms from our experiments.

4

Results

Using 10-fold cross validation [13], we trained and evaluated ML models based on the set of informative features after conducting feature selection. The prediction performance was evaluated in terms of Precision, Recall, F-measure and Area under ROC. We first evaluated the performance of each of the 4 ML algorithms (J48, logistic regression, Random Forest tree, and Naive Bayes) when trained on the full feature space (i.e. the baseline) of 9113 features/OTUs. The model trained by Decision Tree (J48) when applied to the baseline feature set outperforms the rest as shown in Table 1. Table 1. The evaluation results of the classifiers trained on the full feature space (i.e. the baseline results) Classifier

Precision Recall F-measure ROC Area

Decision Tree Logistic Regression Naive Bayes Random Forest

0.929 0.641 0.699 0.708

0.929 0.549 0.726 0.800

0.929 0.569 0.697 0.829

0.918 0.569 0.600 0.724

Then, we trained ML models using the feature subsets selected via feature selection methods. After applying the CfsSubsetEval method, the feature space was reduced from 9113 OTUs to 20 OTUs, and a better performance was achieved as shown in Table 2. Compared to the J48 classifier’s results from the baseline experiments, the Random Forest classifier outperforms J48 and the other classifiers. For the Information Gain method, we explored a feature set of size 80, and achieved the results reported in Table 3. Among the classifiers that were trained on feature sets generated via Information Gain, the Random Forest classifier’s prediction performance was the highest. Comparing this result from the Correlation based filter method with the Random Forest classifier, the latter method proved to work best for classifying coral microbiome samples. The results of a model trained via Random

58

H. Yong and M. Oudah

Forest algorithm on a feature set generated by CfsSubsetEval filter outperform the remaining models as shown in Table 2. Comparing the results in Table 1, Table 2 and Table 2, we conclude that the feature selection step in our pipeline is crucial for much improved prediction performance. Table 2. The evaluation results of the classifiers trained on the feature sets generated via correlation based filter method Classifier

Precision Recall F-measure ROC Area

Decision Tree Logistic Regression Naive Bayes Random Forest

0.931 0.911 0.888 0.965

0.929 0.912 0.858 0.965

0.930 0.911 0.864 0.964

0.926 0.902 0.956 0.990

Table 3. The evaluation results of the classifiers trained on the feature sets generated via Information Gain filter method

5

Classifier

Precision Recall F-measure ROC Area

Decision Tree Logistic Regression Naive Bayes Random Forest

0.911 0.894 0.927 0.886

0.912 0.894 0.920 0.876

0.911 0.894 0.922 0.867

0.905 0.888 0.979 0.986

Analysis and Discussion

In terms of discerning whether the selected features from the algorithms were informative, we compared the attributes produced from the CfsSubset and InfoGain algorithm and those produced from hierarchical feature engineering (HFE). Based on Fig. 2, there were 7 overlapping biomarkers with the attributes produced from CfsSubset, while there were 15 matching attributes with those from InfoGain. This proves that the feature selection process conducted using WEKA’s filter methods indeed provided us with an informative set of biomarkers that can be used for training the ML models. We also observed that the selected features, i.e., a mixture of OTUs and taxa, from the used feature engineering method were indeed informative by distinguishing that some features’ bacterial families were mentioned in Haydon et al.’s coral bacterial transplantation experiment [5]. In Fig. 2, OTUs with IDs 4385501, 319453, 4365466, and 2409727 are all under Rhodobacteraceae and classified as red, meaning that there is more abundance of these microbiomes in high heat. It is noted in Haydon’s paper that the microbiome composition of mangrove to reef replicates became highly dominated by this particular bacterial

Identifying Heat-Resilient Corals Using Machine Learning and Microbiome p__Planctomycetes

c__Planctomycetia

o__Pirellulales

f__Pirellulaceae

4410227 2905448

f__Flavobacteriaceae p__Bacteroidetes

c__Flavobacteriia

59

g__Muricauda

4412939

4215618

o__Flavobacteriales

541531

f__Cryomorphaceae

621743

4402734 c__Chloroplast

o__Stramenopiles

2986479 532888

p__Cyanobacteria

c__Nostocophycideae 536001 c__Synechococcophycideae

o__Synechococcales

f__Synechococcaceae

g__Synechococcus

404788 533688

p__WPS-2 p__Firmicutes

323712 c__Clostridia

k__Bacteria

o__Clostridiales

c__Betaproteobacteria

f__Lachnospiraceae

o__Burkholderiales

o__Sphingomonadales o__Rickettsiales

g__Epulopiscium

f__Oxalobacteraceae

815369 100989

f__Sphingomonadaceae

f__Pelagibacteraceae

g__Sphingopyxis

347616

c__Alphaproteobacteria

4385501 319453 o__Rhodobacterales

p__Proteobacteria

f__Rhodobacteraceae

2999126 4365466 2409727

o__Chromatiales

c__Gammaproteobacteria

f__OM60

o__Oceanospirillales

f__Oceanospirillaceae

o__Pseudomonadales p__Nitrospirae

568472

o__Alteromonadales

4432558 221209

g__Psychrobacter

s__marincola

g__Acinetobacter

4389529

f__Moraxellaceae

79540 708264

c__Nitrospira

Informative Feature

Fig. 2. Features from HFE: Red as indicators for high heat and blue for low heat

family (79%) [5]. As one of the results of their research is that microbiomes of transplanted corals resemble those of native corals, we can link Rhodobacteraceae to be the microbiome indicators of reef corals. Rhodobacteraceae is in fact generally found in thermally sensitive corals [6], which aligns with the observation that reef corals are more heat susceptible than those of mangroves [5]. In another study that examined the microbiome composition of the coral Porites lutea after applying thermal stress, among 24 robust bacterial indicators, the majority of those indicators were members of the family Rhodobacteraceae [7]. Furthermore, the bacterial community structure shifted toward the predominance of Rhodobacteraceae [7], which can also be observed in Fig. 2, as there is a high number of features associated with Rhodobacteraceae. Environmentally induced changes such as heat stress result in a shift in the coral microbiome composition: there is higher microbial abundance and a shift towards opportunistic or pathogenic bacterial taxa such as Rhodobacteraceae [14–17]. Thus, the results of our hierarchical feature engineering strongly suggest that Rhodobacteraceae is indeed an indicator of heat susceptible corals. Based on the results of our study, a suggested pipeline that is optimal for training 16s rRNA micrbiome data of corals would be to use the CfsSubsetEval filter with Best First method for feature selection and then apply Random Forest algorithm to those features to train a predictive model. The steps of filtering out low quality data, applying feature selection processes to reduce the feature space, and running the Random Forest algorithm produced high classification performance (>99% correctly classified samples).

60

6

H. Yong and M. Oudah

Conclusion

In this paper, we propose a ML approach for efficient microbiome classification between heat-resistant corals and heat-susceptible corals to improve microbiome transplantation techniques for human assisted evolution in corals. We do so by introducing a pipeline that can identify the biomarker bacterial composition of those corals by using publicly available 16S rRNA sequences. This model was constructed through steps of feature extraction, feature selection/engineering, and ML model training. Our experimental results show that training on informative set of features results in improving the predictive model performance when it comes to differentiating between heat-tolerant coral and heat-susceptible ones, especially through using Random Forest algorithm. In addition to identifying an optimal pipeline for microbiome analysis through ML, the resulting biomarkers from feature selection and HFE shed the light on the potential of microbes as indicators for environmental changes and thermal stress. Acknowledgment. This research was carried out on the High Performance Computing resources at New York University Abu Dhabi.

References 1. Hoegh-Guldberg, O., Pendleton, L., Kaup, A.: People and the changing nature of coral reefs. Region. Stud. Marine Sci. 30, 100699 (2019). https://doi. org/10.1016/j.rsma.2019.100699. https://www.sciencedirect.com/science/article/ pii/S2352485518306637 2. Ashworth, J.: Over half of coral reef cover across the world has been lost since 1950 (2021). https://www.nhm.ac.uk/discover/news/2021/september/overhalf-of-coral-reef-cover-lost-since-1950 3. France-Presse, A.: Great barrier reef suffers “widespread” coral bleaching, again (2022). https://www.ndtv.com/world-news/australias-great-barrier-reef-sufferswidespread-coral-bleaching-2829866 4. Doering, T., et al.: Towards enhancing coral heat tolerance: a “microbiome transplantation” treatment using inoculations of homogenized coral tissues. Microbiome 9, 102 (2021). https://doi.org/10.1186/s40168-021-01053-6 5. Haydon, T.D., et al.: Rapid shifts in bacterial communities and homogeneity of symbiodiniaceae in colonies of pocillopora acuta transplanted between reef and mangrove environments. Front. Microbiol. 12, 756091 (2021). https://doi.org/10.3389/fmicb.2021.756091. https://www.frontiersin.org/articles/ 10.3389/fmicb.2021.756091 6. Meenatchi, R., Thinesh, T., Brindangnanam, P., Hassan, S., Kiran, G.S., Selvin, J.: Revealing the impact of global mass bleaching on coral microbiome through 16s rRNA gene-based metagenomic analysis. Microbiol. Res. 233, 126408 (2020). https://doi.org/10.1016/j.micres.2019.126408. https://www.sciencedirect. com/science/article/pii/S0944501319313126 7. Pootakham, W., et al.: Heat-induced shift in coral microbiome reveals several members of the rhodobacteraceae family as indicator species for thermal stress in Porites Lutea. Microbiol.- Open 8(12), e935 (2019). https://doi.org/10.1002/ mbo3.935. https://onlinelibrary.wiley.com/doi/abs/10.1002/mbo3.935

Identifying Heat-Resilient Corals Using Machine Learning and Microbiome

61

8. Leinonen, R., Sugawara, H., Shumway, M.: The sequence read archive. Nucleic Acids Res. 39, D19–21 (2010). https://doi.org/10.1093/nar/gkq1019 9. Bolyen, E., Rideout, J.R., Dillon, M., Bokulich, N., Abnet, C., Al-Ghalith, G., et al.: Qiime 2: reproducible, interactive, scalable, and extensible microbiome data science (2018). https://doi.org/10.7287/peerj.preprints.27295v1 10. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The Weka data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009). https://doi.org/10.1145/1656274.1656278 11. Oudah, M., Henschel, A.: Taxonomy-aware feature engineering for microbiome classification. BMC Bioinform. 19, 227 (2018). https://doi.org/10.1186/s128590182205-3 12. Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press (2013) 13. Pasolli, E., Truong, D.T., Malik, F., Waldron, L., Segata, N.: Machine learning meta-analysis of large metagenomic datasets: Tools and biological insights. PLOS Comput. Biol. 12(7), 1–26 (2016). https://doi.org/10.1371/journal.pcbi.1004977 14. Bourne, D.G., Morrow, K.M., Webster, N.S.: Insights into the coral microbiome: Underpinning the health and resilience of reef ecosystems. Annu. Rev. Microbiol. 70(1), 317–340 (2016). https://doi.org/10.1146/annurev-micro-102215-095440 15. R¨ othig, T., Ochsenk¨ uhn, M.A., Roik, A., van der Merwe, R., Voolstra, C.R.: Longterm salinity tolerance is accompanied by major restructuring of the coral bacterial microbiome. Mol. Ecol. 25(6), 1308–1323 (2016). https://doi.org/10.1111/ mec.13567. https://onlinelibrary.wiley.com/doi/abs/10.1111/mec.13567 16. Vega Thurber, R., et al.: Macroalgae decrease growth and alter microbial community structure of the reef-building coral, porites astreoides. PLOS ONE 7, 1–10 (2012). https://doi.org/10.1371/journal.pone.0044246 17. Ziegler, M., et al.: Coral microbial community dynamics in response to anthropogenic impacts near a major city in the central red sea. Marine Pollut. Bullet. 105(2), 629–640 (2016). https://doi.org/10.1016/j.marpolbul.2015.12.045. https://www.sciencedirect.com/science/article/pii/S0025326X15302496, coral Reefs of Arabia

Machine Learning Based Screening Tool for Alzheimer’s Disease via Gut Microbiome Pedro Velasquez

and Mai Oudah(B)

New York University Abu Dhabi, Abu Dhabi, UAE {pedro.velasquez,mai.oudah}@nyu.edu

Abstract. As the connection between the gut and brain is further researched, more data has become available, allowing for the utilization of machine learning (ML) in such analysis. In this paper, we explore the relationship between Alzheimer’s disease (AD) and the gut microbiome and how it can be utilized for AD screening. Our main goal is to produce a reliable, noninvasive screening tool for AD. Several ML algorithms are examined separately with and without feature selection/engineering. According to the experimental results, the Naive Bayes (NB) model performs best when trained on a feature set selected by the correlation-based feature selection method, which significantly outperforms the baseline model trained on the original full feature space. Keywords: Alzheimer’s disease · Machine Learning · Gut Microbiome · Feature Engineering · Feature Selection

1

Introduction

Alzheimer’s disease (AD) is the most prevalent form of dementia, and as life expectancy increases it is expected to triple in the U.S by 2060 [1]. With no cure, the prognosis of those diagnosed is reliant on early detection of the disease. Currently diagnostic methods depend on costly tests in the hundreds or even thousands of dollars to examine patients’ brains, mainly computerized tomography (CT), magnetic resonance imaging (MRI), and/or positron emission tomography (PET) scans. Therefore, the possibility of finding alternative diagnostic or screening tools is imperative. A quick, noninvasive, and inexpensive procedure could facilitate early disease detection, allowing for earlier treatment to slow progression, increasing the effectiveness of treatment at the earlier, pre-symptomatic stages of AD [1]. Nowadays, the gut microbiome is becoming an increasingly viable option to study neurological diseases. Recent works on the gut-brain connection indicate strong correlations between gut microbial community and the brain [2,4,9]. Via the bidirectional nerve network connecting them, the brain-gut axis, bacteria play roles in signaling the brain [2]. Thus, as new technologies have made it easier c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  M. Rocha et al. (Eds.): PACBB 2023, LNNS 743, pp. 62–72, 2023. https://doi.org/10.1007/978-3-031-38079-2_7

Machine Learning Screening for Alzheimer’s via Gut Microbiome

63

to sequence the bacteria present in humans, more findings on the relationship between AD and the gut microbiome are being shared to the research community. Relying solely on patient stool samples, a tool for AD screening could be of great potential. Machine learning (ML) exploitation raises issues that may impact the development and performance of prediction models, including the huge feature space [3,6]. Most publicly available datasets are limited in size, i.e., the number of samples or data points, while each individual sample may have tens of thousands of bacterial species, leading to model overfitting and making datahungry neural networks less reliable [3,6]. Little attention has been devoted to feature selection and engineering in previous work when it comes to AD detection. In this work, we aim to explore different feature selection methods as part of our system pipeline in an attempt to address the “curse of dimensionality” and identify informative biomarkers from the microbiome. The rest of the paper is structured as follows. We first present related work on the relationship between AD and the gut microbial community. Then, in Sect. 3, we discuss the proposed system pipeline. We apply our proposed methodology to the publicly available dataset of [4] and [17]. The Experimental Analysis section demonstrates the experimental settings and discusses the results. Finally, we discuss our conclusions and future work.

2

Related Work

Given the prevalence of AD, a study in Turkey collected gut microbiome samples of 125 patients with AD, mild cognitive impairment (MCI), and healthy control patients [4]. The authors found statistically significant differences in the taxa and their relative abundances in healthy and AD patients (MCI were classified under AD), among these Prevotella 9 and Bacteroides at the genus level [4]. They built a random forest (RF) classifier, which reaches an area under the characteristic curve (AUC) score of 0.63 when trained with the nine most significant taxa [4]. When retraining the model with additional non-metagenomic metadata and patient cognitive test scores, the AUC score increased to ranges of 0.74 - 1.0 [4]. In our paper, we apply our pipeline, working exclusively with microbiome features, on the same dataset and show that we can outperform [4]’s reported results without relying on other non-metagenomic metadata. Another previous work introduced a dataset of 90 fecal samples [17]. [9] and [10] conduct statistical analysis on this dataset [17]. For feature space construction, [9] and [10] reached some contradicting conclusions as [9] found no association between α-diversity and cognitive impairment, while [10] found αdiversity to be a promising predictor of AD. Combining datasets including [4,17], for a total of 410 samples from China, Singapore, and Turkey, [8] trains RF models with AUC scores of 0.935 [8]. This paper however, treats the classification as a 3-class problem, separating MCI as its own class [8]. By treating the problem as a 3-class problem rather than as a binary classifier, and by combining datasets, the results from [8] cannot

64

P. Velasquez and M. Oudah

be directly compared to those in this paper. However, combining datasets of different origins to increase the sample size will be discussed in the Conclusion and Future Works section. Across the literature, attempts to create ML classifiers for dementia have shown promising results [7]. Classifying Parkinson’s Disease, a type of dementia, [7] uses gut microbiome to train an RF model. Reaching AUC scores between 70% and 80%, [7] achieves similar AUC scores to those of [4]. For Parkinson’s Disease, 22 bacterial families were identified as relevant, top among them Lachnospiraceae, Ruminococcaceae, and Bacteroidaceae [7]. [3] explores tools to reduce commonly identified issues of metagenomic ML: overfitting due to the high feature dimensionality, the often limited amount of data available, and lack of consistent training methodologies [3]. To ensure consistency, we focus on developing a similar methodology to that of the studies we compare against [4]. To maintain said consistency, we take great care to avoid mixing/combining datasets, only drawing comparisons to other studies when the same data has been used to train the ML classifier. To reduce and generalize the large feature space with minimal information loss, [5] uses to the hierarchical nature of the microbial community and the hidden correlations between related bacteria. Hierarchical Feature Engineering (HFE) can reduce the feature space with minimal information loss [5]. Four ML algorithms were explored alongside HFE [5], including Decision Trees (DT), Random Forests (RF), Logistic Regression (LR) and Naive Bayes (NB). We chose HFE as one of the feature engineering methods to explore as part of our work for performance optimization.

3

Methodology

In this section, we introduce a pipeline through which to process microbiome data for classifier development, from data preprocessing to model training and validation, as shown in Fig. 1. Feature Extraction. This phase consists of the following steps: 1. Extracting 16S rRNA sequences from the original metagenomic samples. 2. Trimming the extracted reads/sequences based on quality with a threshold of 30. 3. Generating an operational taxonomic unit (OTU) table that indicates the microbial composition in terms of bacterial species and their abundance per sample. We utilize closed-reference OTU picking for that. 4. Normalizing the abundances via the sample size in terms of the number of reads to produce relative abundances. Feature Selection. Feature selection/engineering methods are mainly used for dimensionality reduction. Due to the large feature spaces of microbiome data, often thousands of bacterial species per samples, the small number of samples

Machine Learning Screening for Alzheimer’s via Gut Microbiome

65

can be an issue [6]. These large feature-space-to-sample-size ratios can then lead to overfitting in ML models [6], making feature extraction a critical component of this pipeline for noisy features removal. However, feature selection can also identify informative features, i.e., biomarkers for AD screening. Model Training. For model training we explore four supervised ML algorithms: RF, NB, DT, and LR, which allow us to build classification models. The chosen ML algorithms have been used for model development in various related work showing promising results in microbiome analysis [4,5,7]. Deep learning uses artificial neural networks to perform sophisticated computations on large amounts of data points, which we lake here, hence we have decided to exclude this category of algorithms from our experiments. Furthermore, in order to properly compare our specific pipeline to those of other related work [4], our choices of ML algorithms must be consistent with theirs. [4] uses RF when training his model on Istanbul dataset, and by utilizing the same ML algorithm, we can isolate and compare the effect of our data-processing pipeline on the final trained classifier.

4

Experimental Analysis

This section will delve into the experiments and algorithms we ran and the results of each step of the data pipeline. For feature engineering, we tested HFE [5], CFS, and IFS individually. We then used each of the feature spaces created by each individual feature selection method to train and test the performance of the following ML models: DT, RF, NB, and LR algorithms. 4.1

Experimental Settings

Data Collection. For our experiments, we primarily used the Istanbul dataset, identified by the accession PRJNA734525, containing 125 16S rRNA annotated samples [4]. The data was split as follows: 47 AD samples (37.6%), 27 MCI samples (21.6%), and 51 control (40.8%) [4]. While the metadata differentiated the MCI samples from the AD, they were both treated as AD by the researchers [4]. All samples were collected from Turkey during May 2018 [4]. To verify our experiments, we also tested on the Shanghai dataset, with accession PRJNA489760, containing 180 16S rRNA annotated samples [17]. This second dataset, collected also in 2018, includes both fecal and blood microbiome samples [17]. For this study, we focus only on the gut microbiome, and thus the workable set of samples is reduced to 90, of which 30 are AD samples (33.3%), 30 MCI samples (33.3%), and 30 control (33.3%) [17]. For consistency, MCI and AD cases will be treated as AD.

66

P. Velasquez and M. Oudah

Feature Extraction. For both datasets, the SRA-Toolkit 3.0.0 was used to extract the microbiome sequences, dumping the data into the fastq format [13], and filtering with the following parameters: nucleotides with less than a 30 quality score or sequences of length less than 80 would be dropped. To build the OTU tables, we used QIIME 2, version 2022.8.3, for its capabilities at sequencing highthroughput microbial data [11]. Sequences are compared against the 97% OTU reference provided by QIIME 2 [11]. For each sample in the datasets [4,17], the counts for each bacteria were normalized to be relative counts and multiplied by a constant factor, 1,000,000, due to the small values relative abundances take. For the Istanbul dataset [4], 11,826 unique OTUs were identified, resulting in a 125 × 11,827 table, accounting for the labels column of healthy or AD classifications. Missing OTUs not found in a specific sample had their counts set to 0. For the Shanghai dataset [17], 7,799 unique OTUs were identified.

16S rRNA samples

Feature Extraction FASTQ sequence extraction/filtering

Closed reference OTU picking

Merged OTU table Feature Selection OTU count normalization Selection of informative features

Model Training Model training 10-fold cross validation

ML-based classifier for AD

Fig. 1. The data processing pipeline followed by this research project to construct a feature space.

Machine Learning Screening for Alzheimer’s via Gut Microbiome

67

Table 1. Number of features selected by each of the respective feature selection/engineering methods. Baseline CFS IFS HFE Istanbul dataset

11826

87

85

85

Shanghai dataset

7799

78

85

88

(a) Taxonomic tree of the Istanbul dataset [4]

(b) Taxonomic tree of the Shanghai dataset [17]

Fig. 2. Taxonomic tree of the Istanbul dataset [4]

Feature Selection. In our experiments, we test 4 different feature sets created from the OTU table, separatly. The first was the baseline feature space. This one kept all the features extracted via OTU picking and had no alterations done

68

P. Velasquez and M. Oudah

Table 2. Precision (P), Recall (R), and AUC Scores for each Feature Selection and Machine Learning Pair for the Istanbul dataset [4] Baseline P

R

AUC

CFS P

R

IFS AUC

RF 0.557 0.584 0.611 0.784 0.784 0.913

P

R

HFE AUC

P

R

AUC

0.743 0.744 0.827 0.744 0.744 0.824

NB 0.594 0.608 0.577

0.894 0.888 0.943 0.756 0.736 0.811

0.736 0.720 0.805

DT 0.498 0.496 0.474

0.626 0.632 0.597

0.719 0.712 0.739

0.648 0.648 0.650

LR 0.532 0.560 0.529

0.833 0.832 0.889

0.664 0.664 0.724

0.614 0.600 0.623

to it, allowing us to compare how models’ performance changes with different feature spaces. The second and third, CFS and IFS, respectively, were applied using the WEKA toolkit python wrapper version 0.2.10 [12]. CFS was built using CfsSubsetEval evaluator using the default parameters [12]. IFS was built using InfoGainAttributeEval evaluator set up to select the top 85 features ranked by information-gain [12]. 85 was chosen as during various experiments, CFS and HFE would select anywhere between 80-90 features. The rest of the parameters passed were the default ones from WEKA [12]. Lastly, HFE was applied using its default recommended parameters [5]. Model Training. We use WEKA [12] as the environment for both model training and evaluation. RF, NB, J48 DT, and LR models were trained separately on each of the feature sets using the default parameters [12] and evaluated via 10-fold Cross Validation. While [4] employed additional metadata in the form of cognitive tests [4] and other related works claimed additional patient data to be informative [10], the models trained via our pipeline rely exclusively on genetic data without the need for cognitive tests. 4.2

Experimental Results

Table 1 shows the resulting number of features after running each algorithm on the baseline space: reductions to 0.7% to 1.1% of the original features. Constructing trees of the top 10 OTUs selected by CFS shows promising results. Figure 2 shows the trees constructed from the top features of the Istanbul [4] and Shanghai [17] datasets respectively. For the Istanbul dataset, multiple bacteria identified in the literature [4] are the same as those by CFS. In the top 10, there is Lachnospiraceae, Ruminococcaceae, both already found to correlate with AD [4]. For the Shanghai dataset, Fig. 2b shows Firmicutes and Bacteroidetes were identified by CFS to be the relevant phyla. This is in agreement with two of the three identified phyla for this dataset [17] by the literature [9]. However, their paper had combined this dataset with others to increase the number of samples [9]. Therefore, we have no direct comparison to the results. We cannot claim anything about the absence of the Proteobacteria in Fig. 2b.

Machine Learning Screening for Alzheimer’s via Gut Microbiome

69

Table 3. Precision (P), Recall (R), and AUC Scores for each Feature Selection and Machine Learning Pair for the Shanghai dataset [17] Baseline P

R

CFS

AUC P

IFS

HFE

R

AUC P

R

AUC P

1

1

1

1

1

1

0.989 0.989 1

NB 0.949 0.944 0.917 1

1

1

1

1

1

1

RF 0.968 0.967 1

R 1

AUC 1

DT 0.989 0.989 0.983 0.989 0.989 0.983 0.989 0.989 0.983 0.989 0.989 0.983 LR 0.967 0.967 0.978 0.979 0.978 0.999 0.970 0.967 1

0.989 0.989 1

(a) Performance of a ML algorithm when trained with different feature spaces.

(b) Performance of different ML algorithm when trained on a specific feature space.

Fig. 3. Performance of ML and feature selection algorithms on the Istanbul dataset [4]. The NB-CFS combination resulted in the highest AUC.

70

P. Velasquez and M. Oudah

Model performance on the Istanbul dataset [4], in terms of precision, recall, and AUC, can be seen in Table 2. From the baseline models ranging from 0.47 to 0.61, our pipeline achieves scores of 0.8 - 0.95 AUC. Figures 3a and b plot the performance of the feature space and model combinations for the Istanbul dataset [4]. The same analysis cannot be made for the Shanghai dataset. Seen in Table 3, the AUC scores climbed to 1 after various feature selection methods. 4.3

Discussion

As seen by Figures 3a and b, RF showed high overall performance, all in agreement with the literature [3–5,7]. Furthermore, the NB classifier achieved the overall highest AUC (0.943) on the CFS feature space. Early results show a better performance to that of the original paper, where AUC scores reached 0.63 when training only with metagenomic data [4]. While the literature tends to recommend the use of additional metadata, such as patient cognitive scores [4,10], our pipeline manages to reach similar improvements solely relying on metagenomic data. Our pipeline allows the creation of more accurate models while limiting any further contact or additional information from patients. By focusing on extensive feature selection for our pipeline, we are able to greatly improve the performance of ML classifiers without relying on external features. In contrast, the results presented in Table 3, regarding our experiments on the Shanghai dataset [17], though presenting near perfect AUC scores, may rather point towards a common challenge with gut microbiome data: overfitting [6]. This dataset [17] contained only 90 samples, compared to 7,799 unique OTUs over all samples. With such few samples, feature selection algorithms and ML models may not work as effectively [6]. In our experiments, we examine our pipeline on two datasets [4,17] from different regions of the world, i.e., Turkey and China. We then used CFS and HFE to construct the taxonomic trees of the most significant OTUs (See Figure 2), still seeing similarities amongst them. Both Figure 2a and b share common taxonomic Orders and Families: the Clostridiales and Bacteroidales orders and the Lachnospiraceae and Ruminococcaceae families of bacteria. Environmental factors and dietary differences are known to influence the gut microbiome, playing large roles in determining the microbiome composition [16]. Therefore, if a ML classifier trained for one diet/location is used on another, the tool’s accuracy may be affected. However, our pipeline found similarities in the relevant taxa for datasets in different countries: Turkey and China. Figure 2 identified similar families: Lachnospiraceae and Ruminococcaceae. This finding may hint to the possibility of key bacteria invariant across regions that may have deeper connections with AD. More research is encouraged at exploring the relevant taxa that are invariant across countries, with the goal of revealing the nature of the relationship between these bacterial families and the disease.

Machine Learning Screening for Alzheimer’s via Gut Microbiome

5

71

Conclusion and Future Work

We develop a methodology for a pipeline for training ML classifiers for AD, relying exclusively on microbiome data, with no need of collecting further information based on patient responses or behavior. The prediction models we train significantly outperform the baseline models in terms of AUC score, while relying solely on metagenomic data. Furthermore, when comparing the selected taxa of the datasets we analyze from different countries, our pipeline found a number of common bacterial orders and families as informative biomarkers across them, which may suggest an impact of AD on the gut microbiome, but further investigation is required for verification. We aim to explore and analyze samples collected worldwide to investigate the potential for a globally generalized model for AD screening. Acknowledgment. This research was carried out on the High Performance Computing resources at New York University Abu Dhabi.

References 1. What is Alzheimer’s Disease — CDC. https://www.cdc.gov/aging/aginginfo/ alzheimers.htm. Accessed 18 Feb 2023 2. Bercik, P., Collins, S.M., Verdu, E.F.: Microbes and the gut-brain axis. Neurogastroenterol. Motil. 24(5), 405–413 (2012) 3. LaPierre, N.: MetaPheno: a critical evaluation of deep learning and machine learning in metagenome-based disease prediction. Methods 166, 74–82 (2019) 4. Yıldırım, S..: Stratification of the Gut Microbiota Composition Landscape across the Alzheimer’s Disease Continuum in a Turkish Cohort. mSystems 7(1), e0000422 (2022) 5. Oudah, M., Henschel, A.: Taxonomy-aware feature engineering for microbiome classification. BMC Bioinformatics 19, 227 (2018) 6. Dougherty, E.R., Hua, J., Sima, C.: Performance of feature selection methods. Curr. Genomics 10(6), 365–374 (2009) 7. Pietrucci, D.: Can gut microbiota be a good predictor for Parkinson’s Disease? A machine learning approach. Brain Sci. 10(4), 242 (2020) 8. Park, S., Wu, X.: Modulation of the gut microbiota in memory impairment and Alzheimer’s Disease via the inhibition of the parasympathetic nervous system. Int. J. Mol. Sci. 23(21), 13574 (2022) 9. Liang, X.: Gut microbiome, cognitive function and brain structure: a multi-omics integration analysis. Transl Neurodegener 11, 49 (2022) 10. Li, Z.: Differences in Alpha Diversity of Gut Microbiota in Neurological Diseases. Front. Neurosci. 16, 879318 (2022) 11. Caporaso, J.: QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 7, 335–336 (2010) 12. Hall, M.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11, 10–18 (2009) 13. Leinonen, R., Sugawara, H., Shumway, M.: The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2011) 14. Mitchell, T.M.: Machine Learning. McGraw-Hill (1997)

72

P. Velasquez and M. Oudah

15. Hall, Mark A.: Correlation-based feature selection of discrete and numeric class machine learning. University of Waikato (Working paper 00/08) (2000) 16. Wilson, A.S.: Diet and the human gut microbiome: an international review. Digest. Diseases Sci. 65, 723–740 (2020) 17. Li, B.: Mild cognitive impairment has similar alterations as Alzheimer’s disease in gut microbiota. Alzheimer’s Dementia 15, 1357–1366 (2019)

Progressive Multiple Sequence Alignment for COVID-19 Mutation Identification via Deep Reinforcement Learning Zanuba Hilla Qudrotu Chofsoh, Imam Mukhlash(B) , Mohammad Iqbal , and Bandung Arry Sanjoyo Department of Mathematics, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia {imamm,iqbal,bandung}@matematika.its.ac.id

Abstract. COVID-19 can mutate rapidly, resulting in new variants which could be more malignant. To recognize the new variant, we must identify the mutation parts by locating the nucleotide changes in the DNA sequence of COVID-19. The identification is by processing sequence alignment. In this work, we propose a method to perform multiple sequence alignment via deep reinforcement learning effectively. The proposed method integrates a progressive alignment approach by aligning each pairwise sequence center to deep Q networks. We designed the experiment by evaluating the proposed method on five COVID-19 variants: alpha, beta, delta, gamma, and omicron. The experiment results showed that the proposed method was successfully applied to align multiple COVID-19 DNA sequences by demonstrating that pairwise alignment processes can precisely locate the sequence mutation up to 90%. Moreover, we effectively identify the mutation in multiple sequence alignments fashion by discovering around 10.8% conserved region of nitrogenous bases. Keywords: COVID-19 · DNA Sequence · Multiple Sequence Alignment · Mutation · Deep Reinforcement Learning

1

Introduction

The COVID-19 pandemic has been running for years, starting in December 2019 with the first case in Wuhan, China. All viruses change over time because of mutations, and the WHO defines several as variants of interest (VOIs) and variants of concern (VOCs). World Health Organization (WHO) announced the names of coronaviruses using the Greek alphabet, such as: Alpha, Beta, and Gamma [10]. The emergence of various types of COVID-19 is caused by mutations of nucleotides in DNA (deoxyribonucleic acid) positions, which are nucleic acids located at the nucleus. DNA stores much of the genetic information needed for cell development; thus, genetic information can be hereditary. It is composed of several nitrogen bases, which are adenine (A), cytosine (C), guanine (G) and c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  M. Rocha et al. (Eds.): PACBB 2023, LNNS 743, pp. 73–83, 2023. https://doi.org/10.1007/978-3-031-38079-2_8

74

Z. H. Q. Chofsoh et al.

thymine (T). Genetic information is also able to change uncontrollably on a part of DNA, known as a mutation [8], which is a change in genetic material in the form of DNA that affects nitrogenous base types. It is the source of new allele forms and the reason for species diversity. Also in the COVID-19 pandemic era, DNA mutates quickly and produces many new variants. This is related to cases; as of April 17, 2023, cases had reached 685,712,475 people, and 6.8 million people had died [10]. There are many strategies to solve mutation sequences using the alignment method. In 1970, the Needleman-Wunsch (NW) [7] algorithm was the first algorithm that was used to recognize compatibility between the two texts by shifting the sequence to obtain better results [8]. However, this was only limited to two sequences and was less time-efficient. In another algorithm [5], the authors used MSA (multiple sequence alignment) as the RL (reinforcement learning) problem; they focus on a tabular approach and propose Markov Decision Process to solve the problem. In the second paper, there was an increase in the score function, and the results increase slightly. In 2019, Reza et al. [5] also implemented MSA as an RL problem, they used the LSTM network for the estimation phase. It introduces the RL algorithm with DQ-Network, and uses actor-critic as an algorithm for agents. This has increased result, compared to [4]. Both papers have a weakness because they use a small selection of data to test their performance. Then in 2021, Song et al. [9] proposed a pairwise alignment based on reinforcement learning to define mutations in the HEV and E.coli datasets. It resulted in a more efficient process with a much higher degree of compatibility. Unfortunately, it was still limited to only two sequences. Therefore, in this paper, we use the combination of pairwise alignment and MSA, a main process to recognize a structure pattern from several proteins or DNA, by a progressive method based on RL, an approach to machine learning in which an agent can learn an optimal sequence of actions by receiving rewards for its chosen actions. The data used has a longer sequence than the others. This research also uses star alignment to obtain time efficiency [1]. This paper is organized as follows. In Sect. 2 explain the materials and methods. Section 3 presents the results and discussion. Finally, we describe the conclusion of this paper and further prospects in Sect. 4.

2

Methodology

In this work, we propose a method by integrating progressive alignment and DRL (deep reinforcement learning) to align multiple COVID-19 DNA sequences for identifying COVID-19 mutations. In general, the framework overviews comprises five stages: collecting data, preprocessing data, modeling, sequence alignment, and evaluation, which is shown in Fig. 1. In the next part, we discuss the proposed method. Meanwhile, collecting and pre-processing data will be discussed in Sect. 3.

Progressive Multiple Sequence Alignment

75

Fig. 1. The framework to multiple COVID-19 DNA sequences alignments with the proposed method–progressive deep reinforcement learning.

2.1

Progressive Deep Reinforcement Learning

Deep Reinforcement Learning is a way to learn how an agent in a hidden state in an environment takes action based on a reward signal, called nonMarkov tasks or Partially Observable Markov Decision Processes [4]. In the DRL algorithm, agents transit from one state to another until they reach their final state [2]. Agents receive rewards or punishments for taking the best actions in their environment to obtain more rewards and thus achieve an optimal policy. A policy indicates which action to choose in each section. Therefore, it can be concluded that DRL aims to find optimal alignment between sequences in order to get the appropriate score. The environment must be specified in each deep reinforcement learning problem for making a list of rules in the form of which actions the agent is allowed to choose, what state the environment is in, and what the reward or punishment will be for taking the action [6]. Input to the environment is defined as the state, and the next environment can be anything that can process and determine actions for agents and reward or punish it.

76

Z. H. Q. Chofsoh et al. n+1

The state space S of multiple sequence alignment consists of n n−1−1 space. Given the set of states S = {s1 , s2 , ..., s nn+1 −1 } where s1 is the initial state of the n−1

n+1

environmental agent. A state sik ∈ S where ik ∈ 1, ..., n n−1−1 reaches the state at the given time after visiting other si i.e. si1 , si2 , ..., sik−1 . Selected ai1 , ai2 , ..., ain is a terminal or as the final state, if the number of states visited by the agent is in a certain sequence, namely n + 1 and all actions, are selected. ai1 , ai2 , ..., ain represents a permutation of 1, 2, ..., n that different actions choose, for example aij = aik for j = k [4]. A visit from any initial state to any final state is called an episode. A state has a path π = (π0 , π1 , ..., πn ) with s1 being π0 . An action is denoted by aπ = (aπ0 , aπ1 , ..., aπn−1 ), where πk+1 = Δ(πk , aπk ). After processing DRL, the agent will learn to execute actions that can obtain the maximum total reward from the initial state to the final state. The function of reward in deep reinforcement learning [4] is defined as follows:

r(πk |πk−1 , ..., π0 ) =

⎧ ⎪ ⎪ ⎨ p n−1   ⎪ ⎪ ⎩

n 

i=1 j=1 k=j+1

0 if k = 1 −∞ if aπk−1 ∈ aπ0 , aπ1 , ..., aπk−2 score(colij , colik )

otherwise

(1)

where πk is k-th way, r denotes the reward fuction, p is the number of column, n is the number of sequence, i is i-th column, j is j-th column, k is k-th column. In the reinforcement function, the reward is denoted by agents in state πk . Agents are trained to find the best path for maximizing the reward scores received. Thus, the agent can determine its optimal policy through the training process. In this work, we adopt deep Q networks for deep reinforcement learning. Conceptually, the agent in deep Q networks learns an action-value function Q. Q function is to obtain the expected utility for taking a certain action in a certain state. That is, we call it Q learning function. Q learning function is given by [6]: Q(st , at ) ← Q(st , at ) + α[rt+1 + γQ(st+1 , at+1 ) − Q(st , at )],

(2)

where a refers action, s stands for state, r(s, a) denotes a reward function, γ indicates a discount factor, α is learning rate, and t is time. The idea of the deep Q networks from [6], the agent training process is as follows: The first step is to initialize the value of Q. During several training episodes, the agent will execute an action selection mechanism from several optimal possibilities from the initial to the final states. When the selected action is better than before, the estimated Q value will be updated according to the deep Q network algorithm. At the end of the training process, the estimated value of Q will be around the exact value [6]. The time complexity of this algorithm is θ(n.k), where n is the nucleotide length, and k is the window size. The second stage is to obtain and save the sequence that contains the fewest gaps by FASTA file. Then, align the third sequence sk with the result from the previous stage, and repeat all. The deep Q network algorithm is shown in Algorithm 1 for more detail.

Progressive Multiple Sequence Alignment

77

Algorithm 1: Progressive Deep Reinforcement Learning Initialization Q(s, a) Target initialization 1. Align two sequences si , sj repeat Sequence initialization s1 = x1 and sequence process θ1 = θ(s1 ) repeat Select ai randomly by  probability or select ai = argmaxa (Q(φ(st ), a; θ)) Obtain rt+1 and xt+1 st+1 = (st , at , xt+1) φt+1 = φ(st+1 ) Save transitions (φt , at , rt+1 , φt+1 ) Random mini batch sample of transitions (φt , at , rt+1 , φt+1 ) if final episode in step k + l then yk = rk else yk = rk + γ maxa Q(φk+1 , a , θ− ) end Having a gradient descent on step to (yk − Q(φk , ak , θ)) Each step to C, set Q∗ = Q until t = l, T ; until Episode = 1, M ; 2. Find and save the sequence that contains the fewest gaps 3. Align the third sequence sk with the result from step 2 4. Repeat until all sequences are aligned

2.2

Sequence Alignment

Star Alignment is an algorithm for solving multiple sequence alignments that has an advantage in computation time efficiency. Another term for star alignment is profiling. The basic idea of star alignment is to find one sequence that is most similar to all other sequences; this sequence is then referred to as the center sequence. The next step is to align the center sequence with another [11] sequence. The formula for finding the sequence center is as follows: fμ,c =

nμ,c , Nseq

(3)

where μ is nitrogen base type, c is the column, fμ,c is the frequency of occurrence of a nitrogenous base in the c-th column, seq is the sequence, and Nseq is the amount of sequence. If the maximum frequency of occurrence of A is 1, then the second position in the center sequence is A, and so on until the final position. Further details of this algorithm can be seen in Algorithm 2.

78

Z. H. Q. Chofsoh et al.

Algorithm 2 : Star Alignment Initialization of each ACGT nitrogen base score and gap Initialization frequency repeat for each i ∈ range(len(sequences[0])) for each j ∈ range(len(sequences)) if sequences[j][i] == A then score A + 1 elif sequences[j][i] == C then score C + 1 elif sequences[j][i] == G then score G + 1 elif sequences[j][i] == T then score T + 1 else score gap + 1 if (score A ≥ score C and score A ≥ score G) and score A ≥ score T and score A ≥ score gap): then frequency = score A/long sequence and append A elif (score C ≥ score A and score C ≥ score G) and score C ≥ score T and score C ≥ score gap): then frequency = score C/long sequence and append C elif (score G ≥ score A and score G ≥ score G) and score G ≥ score T and score G ≥ score gap): then frequency = score G/long sequence and append G elif (score T ≥ score A and score T ≥ score C) and score T ≥ score T and score G ≥ score gap): then frequency = score T/long sequence and append T elif (score gap ≥ score A and score gap ≥ score C) and score gap ≥ score T and score gap ≥ score G): then frequency = score gap/long sequence and append ’-’ return profile save profile in fasta file

Progressive Alignment is one of the processes in aligning sequences, which involves aligning between two pairs of sequences. The best results are obtained by selecting the sequence with the fewest gaps. The result is realigned to the next sequence, and so on. The end result of the progressive alignment process is to combine the results of each alignment between two sequences as a result of multiple sequence alignment. For example, given 4 sequences which are Seq1, Seq2, Seq3, and Seq4. This alignment processes are: the first step is to align Seq1 with Seq2. Then the best result is obtained, which has the least gap. The results are stored for multiple sequence alignment, and the results are also aligned with Seq3. The same steps are followed, and then the sequence with the least gap is aligned with Seq4.

Progressive Multiple Sequence Alignment

79

Multiple Sequence Alignment is the process of aligning n sequences to obtain similar regions, where n ≥ 3. It aims to find the maximum number of similar regions by placing sequences in an optimal order. Similar areas can be searched by determining the number of scores. This study uses the SP score [3], which shows all possible pairwise combinations of sequence characters in all columns. The SP score is defined as follows: SP =

p n−1 n    i=1 j=1 k=j+1

score(colij , colik ),

(4)

where p and n are number of columns and sequences, respectively. coli and colij are column order, then one can assume score(colij , colik ) is score comparison among characters.

3

Result and Discussion

We used DNA sequence datasets, from protein spikes, obtained through publicly available SARS-CoV-2 Data Hub sequences at the NCBI (National Center for Biotechnology Information)1 . The data was released from March 1, 2020, until February 28, 2022, in fasta format with a total of 142863 sequences. We focus on five types of COVID-19 with th details are described in Table 1. Table 1. COVID-19 Variant Information Variants Pangolins Alpha Beta Delta Gamma Omicron

Sequence Totals

B.1.1.7 105631 B.1.351 558 B.1.617.2 7685 P.1 6455 BA.1 and BA.2 22534

The Table 1 displays COVID-19 variants with pangolin and total sequences. The variant that has the most total sequences is Alpha variant, with a total of 105631 sequences, whereas Beta variant has the fewest sequences, only 558. Cleaning data is needed at the first step of pre-processing since the data still contains an outlier, which is the form of nitrogenous bases other than DNA. Thus, it is necessary to delete sequences that contain other elements. The total number of remaining clean sequences is only 127658, or around 89.3% of the initial data. Inserting gap is applied since there are several different sequence lengths; the average sequence length is 3800, and given a threshold of δ = 50. Furthermore, sorting is carried out based on the sequence length in each variant; the required data must have the same sequence length; therefore, gaps are inserted into the sequences so that all sequences have the same length as the longest sequence in each variant. 1

https://www.ncbi.nlm.nih.gov/

80

3.1

Z. H. Q. Chofsoh et al.

Analysis of Alignment Results

We discuss the analysis of the results of the sequence data alignment, both pairwise alignment and multiple sequence alignment. The pairwise alignment carried out in this study uses a predetermined data set and sample data with different sequence lengths as a comparison. The sample data used are sequences with lengths of 50, 500, 1000, and 3828 nitrogenous bases. Table 2 summarizes the pairwise alignment results using DRL. The results of each pairwise alignment in the progressive alignment process are stored and then aligned as the results of MSA. Table 3 summarizes the results from MSA and its mutation results for each variant. Table 2. Summary of the results of pairwise alignment Test Nitrogen Base Length Total Match Percentage Match 1

50

45

2

500

433

90%

3

1000

878

87.8%

4

3828

3325

86.8%

86.6%

From the five variants used in this study, mutation results were obtained in only a few positions, which indicates that mutations rarely occur in sequences within the same variant. Therefore, the results of the MSA between the five variants are shown in Fig. 2, which are the variants of concern where the sequence used is the center sequence obtained through profiling. The colored nucleotide indicates that all five sequences have the same nitrogenous base at a certain position. Furthermore, no mutations occurred in the Malaysian region in December 2021 for the alignment results by region for the beta variant. Still, mutations occurred in the Hong Kong region from December 2021-January 2022. The results show that at the beginning of the nucleotide, there is still a similarity among all nucleotides, but at a later position, only a few of the same nitrogenous bases are seen at certain positions. This indicates that there is a mutation between variants. Thus, it is obtained that the position of the nitrogenous base (1-414) or about 10.8% of conserved regions from the nitrogenous bases. By means of that, we discover the bases that are not mutated for further vaccine decisions.

Progressive Multiple Sequence Alignment

81

Table 3. Summary of results of multiple sequence alignment Variant

Sequence

Position Mutation Changes Nitrogen Base

Position Mutation Insertion-Deletion

Alpha Alpha

20 sequences

144, 200, 287 (C become T)

-

Feb 2022 region USA

-

-

Beta

Feb 2022

1681 (A become G)

-

Beta

Dec 2021 region Malaysia

-

-

Beta

Delta

79 (G become T) Dec 2021 - Jan 2022 1949 (T become C) region Hongkong 2161 ( become T) 20 sequences

-

8 and 200 (C become T) 230 (A become C)

-

159 and 2640 (C become T) 384 (T become C) 767 (G become T)

204, 206, 208, 209, 212, 213, 426-434

Delta

Feb 2022

Gamma

20 sequences

277 (G become C) 272 (A become G)

-

Gamma

Feb 2022

-

-

Omicron 20 sequences

407 (A become G) 408, 412-416 (T become A) 409-410 (C become G) 404, 407, 412-418, 412 (T become C) 421 421, 423, 426, 433 (C become A) 424 (A become C)

Fig. 2. Multiple sequence alignment among variant

82

4

Z. H. Q. Chofsoh et al.

Conclusion

In this work, we present progressive deep reinforcement learning to perform multiple sequence alignment for five COVID-19 variants: alpha, beta, delta, gamma, and omicron–based on variants of concerns (VOCs). The proposed method combines a progressive alignment approach with deep Q networks to identify the mutation of the DNA sequence of COVID-19 by locally aligning its spike protein with an average length of around 3800. The proposed method was tested on a public dataset from NCBI. We can notice that mutations rarely happen when ”alleles” are aligned on the same variants. Mutation can be detected by looking at the changes in the nitrogenous base types and their insertion-deletion gaps. Accordingly, the proposed method achieved 90% accurately located the mutation. Moreover, we found 10.8% of conserved regions from the nitrogenous bases, which can be used for vaccine invention. Acknowledgment. The authors gratefully acknowledge financial support from the Institut Teknologi Sepuluh Nopember for this work, under project scheme of the Publication Writing and IPR Incentive Program (PPHKI) 2023.

References 1. Isaev, A., Deem, M.: Introduction to mathematical methods in bioinformatics. Phys. Today 58, 83 (2005). https://doi.org/10.1063/1.2138428 2. Jafari, R., Javidi, M.M., Kuchaki Rafsanjani, M.: Using deep reinforcement learning approach for solving the multiple sequence alignment problem. SN Applied Sciences 1(6), 1–12 (2019). https://doi.org/10.1007/s42452-019-0611-4 3. Lipman, D.J., Altschul, S.F., Kececioglu, J.D.: A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. U.S.A. 86(12), 4412–5 (1989). https://doi.org/10. 1073/pnas.86.12.4412 4. Mircea, I., Bocicor, M., Czibula, G.: Reinforcement learning based approach to multiple sequence alignment. Soft computing applications. Adv. Intell. Syst. Comput. 634, 54–70 (2018). https://doi.org/10.1007/978-3-319-62524-9 6 5. Mircea, I., Bocicor, M., Dincu, A.: On reinforcement learning based multiple sequence alignment. Studia Universitatis “Babes-Bolyai”, Informatica LIX, pp. 50–56 (2014) 6. Naeem, M., Rizvi, S.T.H., Coronato, A.: A gentle introduction to reinforcement learning and its application in different fields. IEEE Access 8(5), 209320–209344 (2020). https://doi.org/10.1109/ACCESS.2020.3038605 7. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443– 453 (1970). https://doi.org/10.1016/0022-2836(70)90057-4 8. Rashed, A.E.E.D., Amer, H.M., Seddek, M.E., Moustafa, H.E.D.: Sequence alignment using machine learning-based Needleman-Wunsch algorithm. IEEE Access 9, 109522–109535 (2021). https://doi.org/10.1109/ACCESS.2021.3100408 9. Song, Y.J., Cho, D.H.: Local alignment of DNA sequence based on deep reinforcement learning. IEEE Open J. Eng. Med. Biol. 2, 170–178 (2021). https://doi.org/ 10.1109/OJEMB.2021.3076156

Progressive Multiple Sequence Alignment

83

10. WHO Coronavirus (COVID-19) Dashboard. https://covid19.who.int/. Accessed 17 Apr 2023 11. Zou, Q., Hu, Q., Guo, M., Wang, G.: HAlign: fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics 31(15), 2475– 2481 (2015). https://doi.org/10.1093/bioinformatics/btv177

Analysis of the Confidence in the Prediction of the Protein Folding by Artificial Intelligence Paloma Tejera-Nevado1,2(B) , Emilio Serrano1 , Ana González-Herrero3 , Rodrigo Bermejo-Moreno3 , and Alejandro Rodríguez-González1,2 1 ETS Ingenieros Informáticos, Universidad Politécnica de Madrid, Madrid, Spain

{paloma.tejera,emilio.serrano,alejandro.rg}@upm.es

2 Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Pozuelo de Alarcón,

Madrid, Spain 3 Margarita Salas Center for Biological Research (CIB-CSIC), Spanish National Research

Council, Madrid, Spain [email protected], [email protected]

Abstract. The determination of protein structure has been facilitated using deep learning models, which can predict protein folding from protein sequences. In some cases, the predicted structure can be compared to the already-known distribution if there is information from classic methods such as nuclear magnetic resonance (NMR) spectroscopy, X-ray crystallography, or electron microscopy (EM). However, challenges arise when the proteins are not abundant, their structure is heterogeneous, and protein sample preparation is difficult. To determine the level of confidence that supports the prediction, different metrics are provided. These values are important in two ways: they offer information about the strength of the result and can supply an overall picture of the structure when different models are combined. This work provides an overview of the different deep-learning methods used to predict protein folding and the metrics that support their outputs. The confidence of the model is evaluated in detail using two proteins that contain four domains of unknown function. Keywords: Protein Structure Prediction · Machine Learning Metrics · Model Confidence

1 Introduction Protein folding refers to the mechanism through which a polypeptide chain transforms into its biologically active protein in its 3D structure, and it has a significant impact on different applications, e.g., drug design, protein-protein interaction, and understanding the molecular mechanism of some diseases. There are classic methods to determine the structure of a protein, such as X-ray, NMR, and EM. These methods can be costly and time-consuming because they require significant resources and expertise. Protein folding is a complex process that is challenging for different reasons, including a large number of possible conformations, the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Rocha et al. (Eds.): PACBB 2023, LNNS 743, pp. 84–93, 2023. https://doi.org/10.1007/978-3-031-38079-2_9

Analysis of the Confidence in the Prediction of the Protein Folding

85

crowded cellular environment, and the complex energy landscape required to reach the final structure. The emergence of deep learning methods for predicting protein folding has revolutionized traditional biochemistry. These methods enable in silico predictions, followed by laboratory validation of the findings, offering a new approach to the field. The Critical Assessment of Protein Structure Prediction (CASP) experiments aim to evaluate the current state of the art in protein structure prediction, track the progress made so far, and identify areas where future efforts may be most productively focused. The bi-annual CASP meeting has shown that deep learning methods like AlphaFold, Rosetta, RoseTTAFold and trRosetta, are more effective than traditional approaches that explicitly model the folding process. AlphaFold2 was introduced as a new computational approach that can predict protein structures with near-experimental accuracy in most cases. The artificial intelligence system, AlphaFold, was submitted to the CASP 14 competition as AlphaFold2, using a completely different model from the previous AlphaFold system in CASP13 [1]. A multimer version has been released, which allows scientists to predict protein complexes and detect protein-protein interactions. The use of AlphaFold-Multimer leads to improved accuracy in predicted multimeric interfaces compared to the input-adapted single-chain AlphaFold, while maintaining a high level of intra-chain accuracy [2]. To fully utilize these methods, researchers require access to powerful computing resources, which is why alternative platforms have been developed. ColabFold is one such platform that offers an accelerated prediction of protein structures and complexes by combining the fast homology search of MMseqs2 with AlphaFold2 [3]. From early 2022, the Galaxy server is able to run AlphaFold 2.0. as part of the tools offered for bioinformatics analysis. DeepMind and EMBL’s European Bioinformatics Institute (EMBL-EBI) did indeed collaborate to create AlphaFold DB, which offers open access to over 200 million protein structure predictions, with the goal of accelerating scientific research [1, 4]. CAMEO (Continuous Automated Model EvaluatiOn) is a community project, developed by the Computational Structural Biology Group, at the SIB Swiss Institute of Bioinformatics and the Biozentrum of the University of Basel. CAMEO is a service that continuously evaluated the accuracy of protein prediction servers based on known experimental structures released by the PDB. Users can submit several models for a target protein in CAMEO, and it will evaluate up to 5 models. Through CAMEO, Robetta, a protein prediction service, undergoes continual evaluation. While deep learning models can achieve impressive accuracy on a wide range of tasks, it’s important to carefully evaluate and interpret their predictions to ensure they are reliable and useful. This involves not only selecting appropriate metrics, but also understanding the underlying assumptions and limitations of the model, and how these can impact its predictions. The paper is organized in the following order: Sect. 2 provides a description of the metrics and scores, while Sect. 3 describes the tools and resources used in this paper. In Sect. 4, the results obtained are presented and discussed in Sect. 5. Finally, Sect. 6 presents the conclusions of this work and outlines future directions for research.

86

P. Tejera-Nevado et al.

2 Metrics and Scores It is difficult to understand the reason for the outputs obtained when using deep learning models. There are some common metrics to evaluate the models and choosing the appropriate metric will depend on the application. Interpreting the predictions of a machine learning model is an iterative process, involving ongoing analysis and refinement of the model. By continually evaluating the model’s output, identifying areas of weakness or uncertainty, and refining the model’s parameters and architecture, researchers and practitioners can develop more accurate and reliable models for a wide range of applications. The GDT (Global Distance Test) is a metric used to measure the similarity between two protein structures with identical amino acid sequences but different tertiary structures. It is primarily used to compare protein structure predictions to experimentally determined structures, and the GDT_TS score, which represents the total score, is a more accurate measurement than the commonly used root-mean-square deviation (RMSD) metric. To calculate the GDT score, the model structure is evaluated for the maximum set of alpha carbon atoms of amino acid residues that are within a specific distance cutoff from their corresponding positions in the experimental structure. The GDT score is a crucial assessment criterion in the Critical Assessment of Structure Prediction (CASP), a large-scale experiment that evaluates current modelling techniques and identifies primary deficiencies [5]. The LDDT (Local Distance Difference Test) is a score that measures the local distance differences between all atoms in a model, regardless of their superposition, to evaluate the plausibility of stereochemistry [6]. AlphaFold2 reports the predicted quality of a protein as a per-residue pLDDT score, which is used to assess intra-domain confidence [1]. The pLDDT score ranges from 0 to 100, with higher scores indicating higher quality predictions. Generally, residues with pLDDT scores greater than 90 are considered reliably predicted, while scores between 70–90 are less reliable, and scores below 50 are low quality. AlphaFold2 calculates the pLDDT score by comparing the predicted distances between pairs of atoms in the predicted protein structure with the corresponding distances in a reference set of experimentally determined protein structures. This comparison is done at the level of individual residues, and the resulting score reflects the similarity between the predicted and reference structures at each residue position. The Predicted Aligned Error (PAE) is a metric used by the AlphaFold system to assess the quality of predicted protein structures. PAE measures the average distance between the predicted and true residue positions in a protein structure, after aligning the predicted structure with the true structure. First, the predicted protein structure is aligned to the true structure using a variant of the Kabsch algorithm, which finds the optimal rotation and translation that aligns the predicted structure to the true structure. Next, the distance between each predicted residue position and its corresponding true residue position is calculated. Finally, the average distance between the predicted and true residue positions is computed, resulting in the PAE score. The Predicted Aligned Error (PAE) is best used for determining between domain or between chain confidence. Like pLDDT, PAE is reported as a per-residue score, with higher scores indicating better prediction accuracy. Residues with PAE scores below a certain threshold are outliers and

Analysis of the Confidence in the Prediction of the Protein Folding

87

are often indicative of regions of the protein that were difficult to predict or where the predicted structure deviates significantly from the true structure. The template modelling score (TM-score) is developed for automated evaluation of protein structure template quality [7]. The predicted TM-score (pTM-score) takes into account the probability of each residue being resolved by weighting its contribution accordingly. The present study examined two proteins containing four domains of unknown function (DUF1935). ARM58 is an antimony resistance marker found in Leishmania species [8, 9], while ARM56 has orthologues in Trypanosoma spp. But does not confer antimony resistance [9, 10]. The protein sequences were inputted into various prediction methods, including tr-Rosetta, tr-RosettaX-Single, RoseTTAFold (using Robetta), ColabFold, AlphaFold2 (Galaxy server), and the AlphaFold Protein Structure Database. By comparing and optimizing the results from these different methods, the study aimed to determine the most effective methodology for utilizing deep learning techniques in predicting protein structure. Such techniques can serve as both routine and complementary tools, providing guidance and accelerating experimentation.

3 Material and Methods ARM58 and ARM56 are encoded by the genes LINF_340007100 and LINF_340007000, respectively. The protein sequences contain four domains of unknown function (DUF1935), which were downloaded in FASTA format from UniProt. Predicted AlphaFold protein structures for ARM58 and ARM56 were also downloaded from the Protein Structure Database, including the PDB file and the predicted aligned error. The Galaxy server was used to generate AlphaFold 2.0. predictions for both proteins. It was also used ColabFold v1.5.1. [3] which utilizes MMseqs2 and HHsearch for sequence alignments and templates. The protein structure prediction service Robetta was used to obtain structures using RoseTTAFold [11]. Additionally, the trRosetta server [12, 13] was used for protein structure prediction by transform-restrained Rosetta. Also, the trRosettaX-Single does not use homologous sequences and templates [14]. Finally, the relax_amber notebook from the GitHub repository (https://github.com/sokrypton/Col abFold) developed by sokrypton was used to relax the structures using amber, with the outputs labelled as unrelaxed. Protein predictions were visualized using UCSF ChimeraX v1.5 [15].

4 Results ARM58 and ARM56 protein predictions using AlphaFold2 were downloaded from the Protein Structure DB. ARM58 and ARM56 contain 517 and 491 amino acids, respectively. AlphaFold2 was also run on the Galaxy server. Both prediction outputs were compared using ChimeraX, by matching both structures (Fig. 1 A, B). The PAE plot indicates the expected distance error for each residue position where the low-error squares correspond to the domains for both proteins (Fig. 1 C, D).

88

P. Tejera-Nevado et al.

Fig. 1. Protein structure prediction for ARM58 and ARM56. Predicted per-residue for the best model from AlphaFold DB coloured with the predicted aligned error (dark blue (100 - 90): high accuracy expected; light blue - yellow (90 - 70): expected to be modelled well; yellow - orange (70 - 50): low confidence; orange – red (50 - 0): may be disordered) and AlphaFold2 prediction best model obtained in Galaxy (brown and purple) for ARM58 (A) and ARM56 (B). Protein prediction structures are visualized using ChimeraX. Predicted aligned error representing the four domains for ARM58 (C) and ARM56 (D) from AlphaFold protein structure database.

Next, ColabFold using MMseqs2 was used to perform an analysis on the metrics and outputs. The tool was run three times with the default parameters generating small variations for pLDDT and pTM (Table 1).

Analysis of the Confidence in the Prediction of the Protein Folding

89

Table 1. Average and Standard deviation from the metrics pLDDT and pTM obtained in ColabFold using MMseqs2 for ARM58 and ARM56, running the protein structure predictions three times.

pLDDT

Average Standard deviation

pTM

ARM58

ARM56

80.433

86.733

0.577

4.619

Average

0.410

0.486

Standard deviation

0.003

0.004

Out of the three different runs, two protein prediction structures were selected based on their differing pLDDT and pTM scores for additional examination. The selected structures were visualized using ChimeraX software and color-coded based on the pLDDT values (Fig. 2).

Fig. 2. Protein structure prediction for ARM58 (A) and ARM56 (B) using ColabFold (MMseqs2). Two resulting structures were coloured with the model confidence (dark blue: pLDDT > 90; light blue: 90 > pLDDT > 70; yellow: 70 > pLDDT > 50; orange: pLDDT < 50). Protein prediction structures were visualized using ChimeraX.

The trRosetta server, along with trRosettaX-Single, was used to predict the protein structures of ARM58 and ARM56. The models obtained using trRosetta have high confidence, with estimated TM-scores of 0.608 for ARM58 and 0.650 for ARM56. Moreover, the confidence of the models generated using trRosettaX-Single was low, with estimated TM-scores of 0.275 for ARM56 and 0.266 for ARM58. The predicted per-residue LDDT values for the best model reflected the lower confidence per residue in the tr-RosettaX-Single models (Fig. 3).

90

P. Tejera-Nevado et al.

Fig. 3. Predicted per-residue for the best model in ARM58 and ARM56. Predicted LDDT from tr-Rosetta (A) and trRosettaX-Single (C) for ARM58. Predicted tr-Rosetta (B) and tr-RosettaXSingle (D) for ARM56.

Eventually, the structures of ARM56 and ARM58 were predicted using the modelling method RoseTTAFold via Robetta service. The predicted Global Distance Test (GDT) confidence scores for the models were 0.79 for ARM56 and 0.76 for ARM58. Finally, the tr-Rosetta and RoseTTAFold prediction models were slightly different from those previously described. In the end, the tr-Rosetta model was compared to the most similar of the five models predicted with RoseTTAFold (Fig. 4).

Analysis of the Confidence in the Prediction of the Protein Folding

91

Fig. 4. Predicted per-residue for the best model in ARM58 (A) and ARM56 (B) using tr-Rosetta (blue) and RoseTTAFold (orange). Protein prediction structures were visualized using ChimeraX.

5 Discussion The unknown exact positions of atoms in a protein can be related to flexible regions because the flexibility or rigidity of a protein is determined by the relative positions of its constituent atoms. The flexibility of certain regions in a protein can be essential for its function, as they allow for conformational changes that are needed for the protein to interact with other molecules or carry out its biological role [16]. In the present work, ARM58 and ARM56 predicted models were able to define the beta sheets present in each domain (Fig. 1). However, the exact positions of the atoms between the four domains are not fully defined (Fig. 2). This could be related to flexibility, or it could be that the exact positions are uncertain. When regions are extensive and have a pLDDT value below 50, they exhibit a strip-like visual pattern and ought to be consider a projection of disorder, rather than an indication of actual structure [17]. Protein prediction models may generate structures that deviate from those produced by traditional models. Explaining these models can be challenging, therefore, it is important to interpret the metrics correctly. The accuracy of the models is given by the pLDDT, PAE, or pTM scores, which need to be analysed by researchers. There are fast and reliable prediction methods such as ColabFold [3] that provide accurate protein prediction structures. However, variations in the metrics (Table 1) can lead to diverse outcomes. Some models can work better on a specific set of proteins. In this case, trRosetta and trRosettaX-Single show notable differences in the confidence given by the pLDDT score (Fig. 3). Each method generates different metrics for evaluating confidence; for instance, RoseTTAFold, when used with Robetta, provides the predicted GDT (Global Distance Test) confidences. Distinct structure prediction systems generate varying models (Fig. 4), making modelling a challenging task. A combination of all the structures, evaluated through various metrics, can provide complementary information about the structures of proteins, especially in cases where no prior information is available from X-ray, NMR or EM techniques. The proteins ARM58 and ARM56 each consist of four domains whose functions are unidentified. It is unknown the mechanism that leads to antimony resistance to Leishmania spp upon ARM58 overexpression. Here, different protein prediction methods have been used to compare the predicted structure of ARM58 and ARM56. There are

92

P. Tejera-Nevado et al.

differences between models, and each model can generate variations in each run. Additionally, alternative installations could also produce different outputs. By accurately simulating a variety of temporary protein complexes, end-to-end deep learning underscores opportunities for future enhancements that can enable dependable modelling of any protein-protein interaction that researchers want to explore [18]. Factors such as amino acid sequence composition or protein function could provide more information. In addition to predicting protein structures, AlphaFold2 also provides insights into the flexibility of residues, or protein dynamics, that are encoded in these structures [19]. The diverse outputs obtained from different protein prediction models suggest that discrepancies between the models could reveal valuable information about the flexibility of the proteins. Careful observation of these variations is necessary to determine their significance and relevance for understanding protein structure and function.

6 Conclusions and Future Work Protein regions that lack specific structures make it difficult to determine the exact positions of atoms. Consequently, variations in these positions could result in different possible structures, reflecting the protein’s flexibility. Furthermore, protein folding is a critical process in the development of new drugs, as the three-dimensional structure of a protein determines its function and interactions with other molecules. Thus, it is important to determine which prediction provides more information about the structure in order to obtain insights into different functional states or biologically relevant features. There is a need to understand the generation of different outputs through deep learning models. When there is complementary material from classic prediction methods or experimental assays, it is possible to obtain additional information. Frequently, there is no additional investigation previously done, and therefore, it is necessary to rely on the metrics and understand the predictions, especially when further experiments will be performed. Future work may benefit from enhancing the analysis by incorporating a larger and more significant set of proteins. Furthermore, the consideration of additional protein prediction tools, such as ESMFold, might provide valuable insights and complement the existing methodology employed in this study. Acknowledgments. This work is a result of the project "Data-driven drug repositioning applying graph neural networks (3DR-GNN)", that is being developed under grant "PID2021122659OB-I00" from the Spanish Ministerio de Ciencia e Innovación. This work was funded partially by Knowledge Spaces project (Grant PID2020-118274RB-I00 funded by MCIN/AEI/10.13039/501100011033).

References 1. Jumper, J., et al.: Highly accurate protein structure prediction with AlphaFold. Nature 596(7873), 583–589 (2021) 2. Evans, R., et al.: Protein complex prediction with alphafold-multimer. bioRxiv (2021) 3. Mirdita, M., et al.: ColabFold: making protein folding accessible to all. Nat. Methods 19(6), 679–682 (2022)

Analysis of the Confidence in the Prediction of the Protein Folding

93

4. Varadi, M., et al.: AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50(D1), D439–D444 (2022) 5. Zemla, A., et al.: Processing and evaluation of predictions in CASP4. Proteins: Struct. Funct. Bioinformat. 45(S5), 13–21 (2001) 6. Mariani, V., et al.: lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 29(21), 2722–2728 (2013) 7. Zhang, Y., Skolnick, J.: Scoring function for automated assessment of protein structure template quality. Proteins Struct. Funct. Bioinformat. 57(4), 702–710 (2004) 8. Nühs, A., et al.: A novel marker, ARM58, confers antimony resistance to leishmania spp. Int. J. Parasitol. Drugs Drug Resist. 4(1), 37–47 (2014) 9. Schäfer, C., et al.: Reduced antimony accumulation in ARM58-overexpressing Leishmania infantum. Antimicrob. Agents Chemother. 58(3), 1565–1574 (2014) 10. Tejera Nevado, P., et al.: A telomeric cluster of antimony resistance genes on chromosome 34 of Leishmania infantum. Antimicrob. Agents Chemother. 60(9), 5262–5275 (2016) 11. Baek, M., et al.: Accurate prediction of protein structures and interactions using a three-track neural network. Science 373(6557), 871–876 (2021) 12. Du, Z., et al.: The trRosetta server for fast and accurate protein structure prediction. Nat. Protoc. 16(12), 5634–5651 (2021) 13. Su, S., et al.: Improved protein structure prediction using a new multi-scale network and homologous templates. Adv. Sci. (Weinheim), 8(24), e2102592 (2021) 14. Wang, W., et al.: Single-Sequence Protein Structure Prediction Using Supervised Transformer Protein Language Models. Nature Computational Science 2(12), 804–814 (2022) 15. Pettersen, E.F., et al.: UCSF ChimeraX: structure visualization for researchers, educators, and developers. Protein Sci. 30(1), 70–82 (2021) 16. Goh, C.-S., et al.: Conformational changes associated with protein-protein interactions. Curr. Opinion Struct. Biol. 14(1), 104–109 (2004) 17. Tunyasuvunakool, K., et al.: Highly accurate protein structure prediction for the human proteome. Nature 596(7873), 590–596 (2021) 18. Yin, R., et al.: Benchmarking AlphaFold for protein complex modeling reveals accuracy determinants. Protein Sci. 31(8), e4379 (2022) 19. Guo, H.-B., et al.: AlphaFold2 models indicate that protein sequence determines both structure and dynamics. Sci. Rep. 12(1), 10696 (2022)

Doctoral Consortium

Neoantigen Detection Using Transformers and Transfer Learning in the Cancer Immunology Context Vicente Enrique Machaca Arceda(B) Universidad La Salle, Arequipa, Peru [email protected] Abstract. Neoantigen detection is the most critical step in developing personalized vaccines in cancer immunology. However, neoantigen detection depends on the correct pMHC binding and presentation prediction. Furthermore, transformers and transfer learning have a high impact on NLP tasks. Since amino acids and proteins are like words and sentences, the pMHC binding and presentation prediction problem could be considered an NLP task. Thus, this work proposed using a BERT architecture pre-trained in 250 million proteins (ESM-1b), then we will use a BiLSTM in cascade. Our preliminary results evaluated a small BERT (TAPE) model achieving 0.80 of AUC on the netMHCpanII3.2 dataset. Keywords: Transformers · transfer learning MHC · binding · presentation · epitope

1

· neoantigen · peptide ·

Introduction

Cancer represents the world’s biggest health problem and is the leading cause of death, with around a million deaths reported in 2020. So, the development of cancer immunotherapy arises, which aims to stimulate a patient’s immune system [1]; in this topic, research based on neoantigen detection emerges. There are three treatments based on the representation and expression of neoantigens: personalized vaccines, adoptive T-cell therapies, and immune checkpoint inhibitors. From the methods mentioned before, the development of personalized vaccines is considered one with the highest likelihood of success [1]. Neoantigens are tumor-specific mutated peptides and are considered the leading causes of an immune response [1–3]. The goal is to train a patient’s lymphocytes (T cells) to recognize the neoantigens and activate the immune system [4,5]. The life cycle of a neoantigen for cells with nuclei could be summarized as follows. First, a protein is degraded into peptides (promising neoantigens) in the cytoplasm. Next, the peptides are attached to the Major Histocompatibility Complex (MHC), known as peptide-MHC (pMHC) binding. Then, this compound follows a pathway until it reaches the cell membrane (pMHC presentation). Finally, the pMHC is recognized by T-cell Receptor (TCR), triggering the immune system. In this context, this project focuses on pMHC binding and presentation prediction using transformer models and transfer learning. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023  M. Rocha et al. (Eds.): PACBB 2023, LNNS 743, pp. 97–102, 2023. https://doi.org/10.1007/978-3-031-38079-2_10

98

2

V. E. M. Arceda

Problem Statement

Despite the efforts of researchers, less than 5% of detected neoantigens succeed in activating the immune system [4] because most of pMHC binding predicted didn’t reach the cell membrane (pMHC presentation). In this context, this project focuses on the problem of pMHC presentation. This is a binary classification problem with peptide and MHC like inputs. A peptide could be represented like: p = {A, ..., Q} and a MHC like: q = {A, N, ..., Q, E, G}. Finally, we need to know the probability of affinity between p and q (pMHC). If it is high enough, then it is possible that the peptide p binds to q, reaches the cell membrane.

3

Related Work

The methods proposed can be categorized as allele-specific or pan-specific methods. Allele-specific, train a model for each allele; meanwhile, pan-specific methods train a global model. NetMHCPan4.1 [6] is a pan-specific method considered a baseline for pMHCI prediction. This method uses Artificial Neural Networks (ANN). It improved its previous versions by increasing the training dataset with 13245212 data points covering 250 distinct MHC-I molecules; additionally, the model was updated from NN align to NN alignMA [7]. Additionally, MHCflurry2.0 [8], is another state-of-art method; it uses a pan-allele binding affinity predictor, an alleleindependent antigen presentation predictor, and used Mass Spectrometry (MS) data; after experiments, MHCflurry2.0 outperformed NetMHCpan4.0. Regarding pan-specific pMHC-II prediction, NetMHCIIpan4.0 [6] used motif deconvolution, and MS eluted ligand data with 4086230 data points covering a total of 116 distinct MHC-II. On the other hand, NetMHC4.0 [9] is allele-specific; it updated its previous versions, it padded amino acids and used ANNs. Transformers are considered a revolution in artificial intelligence and have been applied successfully in several Natural Language Processing (NLP) tasks [10]. Moreover, these models have been used in neoantigen detection, focusing on pMHC binding/presentation prediction. For instance, BERTMHC [11] is a pan-specific pMHC-II binding/presentation predicting method; it used a BERT architecture and transfer learning from Tasks Assessing Protein Embeddings (TAPE) [12]. The authors stack an average pooling followed and a Fully Connected (FC) layer after TAPE model. In experiments, BERTMHC outperformed NetMHCIIpan3.2 and PUFFIN. Moreover, ImmunoBERT [13] also used transfer learning from TAPE; however, the authors focus on pMHC-I prediction. This method stacks a classification token’s vector after TAPE model. Additionally, MHCRoBERTa [14] and HLAB [15] used transfer learning too. MHCRoBERTa used self-supervised training from UniProtKB and Swiss-prot databases; then, they fine-tuned the training with data from IEDB [16] dataset. MHCRoBERTa outperformed NetMHCpan4.0 and MHCflurry2.0 in SRCC. On the other hand, HLAB [15] used transfer learning from ProtBert-BFD [17]; it used a BiLSTM model in cascade. Moreover, on the HLA-A*01:01 allele,

Neoantigen Detection Using Transformers and Transfer Learning

99

HLAB slightly outperformed state-of-art methods, including NetMHCpan4.1, by at least 0.0230 in AUC and 0.0560 in accuracy. Finally, TransPHLA [18] is an allele-specific method that applies selfattention to peptides. The authors developed AOMP, which took the pMHC binding like input and returned mutant peptides with higher affinity to the MHC allele. Moreover, TransPHLA outperformed state-of-art methods, including NetMHCpan4.1, and it is effective for any peptide and MHC length and is faster at making predictions. Moreover, the allele-specific DapNet-HLA [19] got interesting results, it used an additional dataset (Swiss-Prot) for negative samples, and combined the advantages of CNN, SENet(for pooling), and LSTM. The proposal got high scores; however, the method wasn’t compared against state-of-art methods.

4

Hypothesis

A model based on transfer learning from ESM-1b, a BERT architecture with 650 million parameters, and with a BiLSTM model in a cascade could predict de pMHC-I and II binding and presentation prediction.

5

Proposal

In this project, we proposed using a BERT architecture with transfer learning. We analyzed alternatives like TAPE [12], ProtBERT-BFD [17] and ESM-1b [20], each one with 92M, 420M, and 650M parameters respectively. TAPE was trained with 30 million proteins, ProtBERT-BFD with 2122 million proteins, and 250 million proteins for ESM-1b. Additionally, ESM-1b got better results in contact precision than TAPE and ProtBERT-BFD [20]. Moreover, HLAB [15] proposed the use of ProtBERT-BFD [17] with a BiLSTM model in cascade and outperformed NetMHCpan4.1 (state-of-art method) on the HLA-A*01:01 allele. Therefore, in this project, we proposed using the pretrained model ESM-1b [20] with a BiLSTM model in a cascade-like HLAB [15]. For fine-tuning, we will use datasets from NetMHCpan4.1 and NetMHCIIpan4.0. In summary, in Fig. 1, we present the proposal: first, a peptide and the MHC pseudo sequence are concatenated and padded; secondly, we use the transformer model ESM-1b to get sequence embedding; finally, we use a BiLSTM to predict pMHC binding and presentation.

6

Preliminary Results

We evaluated the performance of BERT models with transfer learning for pMHCII binding prediction. We choose BERTMHC [11], which used the pre-trained model TAPE, because it is smaller than ESM-1b. We developed four models: LINEAR, which used a linear layer in cascade after TAPE; LINEAR-pad, which applied a padding technique to amino acids with X letter and replaced amino

100

V. E. M. Arceda

Fig. 1. Proposal: A peptide and the MHC are concatenated and padded; then, we use the transformer model ESM-1b followed by BiLSTM to predict pMHC binding and presentation.

acids {U, Z, O, B, J} by X according to HLAB proposal [15]; RNN, which used BiLSTM layer in cascade; and, RNN-ATT, which used a BiLSTM layer with attention mechanism (RNN and RNN-ATT are based on HLAB proposal [15]). Additionally, we used the same hyperparameters of BERTMHC, and we trained the models with NetMHCIIpan3.2 dataset [21]. According to experiments, LINEAR and RNN-ATT got the best results. In Fig. 2a AUC, and ROC curve are presented; meanwhile, in Fig. 2b, accuracy, precision, recall, and f-score are presented. Moreover, in Table 1, we present all metrics used in the comparison.

(a) ROC curve.

(b) Metrics.

Fig. 2. ROC curve and metrics comparison between models evaluated.

7

Reflections

NetMHCpan4.1 [6] is considered the state-of-the-art pan-specific method. However, HLAB [15] and TransPHLA [18] slightly outperformed NetMHCpan4.1. For that reason, we proposed using transfer learning similar to HLAB; however,

Neoantigen Detection Using Transformers and Transfer Learning

101

Table 1. Metrics comparison between models evaluated. Model

AUC

Accuracy Precision Recall

F-score

LINEAR LINEAR-pad RNN RNN-ATT

0.8070 0.7147 0.8023 0.8086

0.8012 0.7375 0.7972 0.8082

0.8005 0.7192 0.7932 0.7937

0.8005 0.7353 0.7932 0.7937

0.8009 0.7147 0.7949 0.7985

instead of using ProtBert-BFD, we will use ESM-1b, which is better than HLAB in precision contact prediction. We evaluate BERTMHC, which uses the TAPE pre-trained model. We choose this model because TAPE is smaller than ESM-1b and delivers faster results. According to experiments, the LINEAR model outperformed LINEAR-pad for all metrics So, using padding and replacing amino acids {U, Z, O, B, J} by X reduce drastically the performance. Additionally, we evaluated the RNN and RNN-ATT models. RNN-ATT got the best results for accuracy and AUC; meanwhile, the LINEAR model, got the best results in precision, recall, and f-score. The differences in outcomes could be caused by the small dataset used in training. In future experiments, we will evaluate ESM-1b, ProtBert-BFD, and ESM2 pre-trained models with the NetMHCpan4.1 dataset for fine-tuning.

References 1. Borden, E.S., Buetow, K.H., Wilson, M.A., Hastings, K.T.: Cancer neoantigens: challenges and future directions for prediction, prioritization, and validation. Front. Oncol. 12, (2022) 2. Chen, I., Chen, M., Goedegebuure, P., Gillanders, W.: Challenges targeting cancer neoantigens in 2021: a systematic literature review. Expert Rev. Vaccines 20(7), 827–837 (2021) 3. Gopanenko, A.V., Kosobokova, E.N., Kosorukov, V.S.: Main strategies for the identification of neoantigens. Cancers 12(10), 2879 (2020) 4. Mattos, L., et al.: Neoantigen prediction and computational perspectives towards clinical benefit: recommendations from the ESMO precision medicine working group. Ann. Oncol. 31(8), 978–990 (2020) 5. Peng, M., et al.: Neoantigen vaccine: an emerging tumor immunotherapy. Mol. Cancer 18(1), 1–14 (2019) 6. Reynisson, B., Alvarez, B., Paul, S., Peters, B., Nielsen, M.: Netmhcpan-4.1 and Netmhciipan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res. 48(W1), W449–W454 (2020) 7. Alvarez, B., et al.: Nnalign ma; MHC peptidome deconvolution for accurate MHC binding motif characterization and improved t-cell epitope predictions. Mol. Cell. Proteomics 18(12), 2459–2477 (2019)

102

V. E. M. Arceda

8. O’Donnell, T.J., Rubinsteyn, A., Laserson, U.: Mhcflurry 2.0: improved pan-allele prediction of MHC class i-presented peptides by incorporating antigen processing. Cell Syst. 11(1), 42–48 (2020) 9. Andreatta, M., Nielsen, M.: Gapped sequence alignment using artificial neural networks: application to the MHC class i system. Bioinformatics 32(4), 511–517 (2016) 10. Patwardhan, N., Marrone, S., Sansone, C.: Transformers in the real world: a survey on NLP applications. Information 14(4), 242 (2023) 11. Cheng, J., Bendjama, K., Rittner, K., Malone, B.: BERTMHC: improved MHCpeptide class ii interaction prediction with transformer and multiple instance learning. Bioinformatics 37(22), 4172–4179 (2021) 12. Rao, R., et al.: Evaluating protein transfer learning with tape. In: Advances in Neural Information Processing Systems, vol. 32, (2019) 13. Gasser, H.-C., Bedran, G., Ren, B., Goodlett, D., Alfaro, J., Rajan, A.: Interpreting BERT architecture predictions for peptide presentation by MHC class i proteins. arXiv preprint arXiv:2111.07137 (2021) 14. Wang, F., et al.: Mhcroberta: pan-specific peptide–MHC class i binding prediction through transfer learning with label-agnostic protein sequences. Briefings Bioinf. 23(3), bab595 (022) 15. Zhang, Y., et al.: HLAB: learning the bilstm features from the protbert-encoded proteins for the class i hla-peptide binding prediction. Briefings Bioinf. (2022) 16. Vita, R., et al.: The immune epitope database (IEDB): 2018 update. Nucleic Acids Res. 47(D1), D339–D343 (2018) 17. Elnaggar, A., et al.: Prottrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44(10), 7112– 7127 (2021) 18. Chu, Y., et al.: A transformer-based model to predict peptide-HLA class i binding and optimize mutated peptides for vaccine design. Nat. Mach. Intell. 4(3), 300–311 (2022) 19. Jing, Y., Zhang, S., Wang, H.: DapNet-HLA: adaptive dual-attention mechanism network based on deep learning to predict non-classical HLA binding sites. Anal. Biochem. 666, 115075 (2023) 20. Rives, A., et al.: Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. In: Proceedings of the National Academy of Sciences, vol. 118, no. 15, (2021) 21. Jensen, K.K., et al.: Improved methods for predicting peptide binding affinity to MHC class ii molecules. Immunology 154(3), 394–406 (2018)

Author Index

A Arceda, Vicente Enrique Machaca Arrais, Joel P. 3

97

B Balsa-Canto, Eva 44 Bermejo-Moreno, Rodrigo 84 C Carvalho, Lucas 44 Chofsoh, Zanuba Hilla Qudrotu 73 Cruz, Maria 14

3

G González-Herrero, Ana 84 Goyzueta, Valeria 14 H Henriques, David I Iqbal, Mohammad

14

O Oudah, Mai 53, 62 P Pereira, Vítor 44 Pinto, Miguel 24 Pola´nska, Joanna 34 R Reboiro-Jato, Miguel 24 Rocha, Miguel 44 Rodríguez-González, Alejandro

D Duque, Pedro 24 E Egas, Conceição

M Machaca, Vicente Enrique Martins, Daniel 3 Mukhlash, Imam 73

S Sanjoyo, Bandung Arry 73 Serrano, Emilio 84 Sieradzka, Katarzyna 34 T Tejera-Nevado, Paloma 84 Troitiño-Jordedo, Diego 44 Tupac, Yvan 14

44

73

L López-Fenández, Hugo 24

V Velasquez, Pedro 62 Vieira, Cristina P. 24 Vieira, Jorge 24 Y Yong, Hyerim

53

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 M. Rocha et al. (Eds.): PACBB 2023, LNNS 743, p. 103, 2023. https://doi.org/10.1007/978-3-031-38079-2

84