Machine Learning Methods for Multi-Omics Data Integration 3031365011, 9783031365010

The advancement of biomedical engineering has enabled the generation of multi-omics data by developing high-throughput t

111 110 6MB

English Pages 174 [171] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Machine Learning Methods for Multi-Omics Data Integration
 3031365011, 9783031365010

Table of contents :
Contents
Introduction to Multiomics Technology
1 Genomics
2 Transcriptomics
3 Proteomics
4 Foodomics
5 Metabolomics
6 Epigenomics
7 Summary
References
Machine Learning from Multi-omics: Applications and DataIntegration
1 Introduction
2 Multi-omics as Cancer Indicators
3 Multi-omics Epigenetic Alterations of Alzheimer's
4 Multi-omics Applications in Mental Health and Psychiatric Disorders
5 Cardiovascular Disease
6 Machine Learning Applications to Multi-omics Data
7 Data Integration Strategies
References
Machine Learning Approaches for Multi-omics Data Integration in Medicine
1 Introduction
2 Main Objectives of Multi-omic Data Integration Studies
2.1 Diagnosis and Prognosis
2.2 Identification of the Subtype
2.3 Discovering Molecular Patterns of Disease
2.4 Predicting the Effects of a Drug at the Molecular Level
2.5 Comprehension of the Regulatory Processes
3 Multi-omics Integration Strategies
3.1 Early Integration
3.2 Mixed Integration
3.3 Intermediate Integration
3.4 Late Integration
3.5 Hierarchical Integration
4 Machine Learning Approaches Used in Multiomics Integration
4.1 Data Integration Analysis for Biomarker Discovery Using Latent Components (DIABLO)
4.2 Multi-omics Factor Analysis (MOFA)
4.3 Sparse Canonical Correlation Analysis (sCCA)
4.4 Multi-omics Late Integration (MOLI)
4.5 Cancer Drug Response Prediction Using a Recommender System (CaDRReS)
4.6 Heterogeneous Network-Based Method for Drug Response Prediction (HNMDRP)
4.7 Multiple Pairwise Kernels for Drug Bioactivity Prediction (pairwiseMKL)
4.8 iCluster, iClusterPlus, and iClusterBayes
4.9 moCluster
4.10 Similarity Network Fusion (SNF)
4.11 NEighborhood Based Multi-omics Clustering (NEMO)
4.12 Random Walk with Restart for Multi-dimensional Data Fusion (RWRF) and Random Walk with Restart and Neighbor Information-Based Multi-dimensional Data Fusion (RWRNF)
5 Conclusion
References
Multimodal Methods for Knowledge Discovery from Bulk and Single-Cell Multi-Omics Data
1 Introduction
2 Description of Various Omics Datasets
2.1 ChIP-seq
2.2 ATAC-seq
2.3 Hi-C
2.4 Mass Spectrometry for Proteomics
2.5 Single-Cell Multi-Omic Profiling
3 Multimodal Methods for Dimensionality Reduction and Clustering
3.1 Non-negative Matrix Factorization
3.2 Tensor Decomposition
3.3 Multi-View Relational Learning
3.4 Canonical Correlation Analysis
3.5 Deep Learning Methods for Multimodal Dimension Reduction and Clustering
3.6 Evaluating and Visualizing Single-Cell Embeddings
4 Multimodal Methods for Inferring Gene Regulatory Networks from Bulk and Single-Cell Omics Data
4.1 Multiple Regression
4.2 Correlation and Mutual Information
4.3 Ordinary Differential Equation
5 Multi-Modal Network Inference of Gene Regulations
5.1 Bayesian Network Inference
5.2 Static Boolean Regulatory Network Inference
5.3 Dynamic Regulatory Network Inference
6 Multimodal Methods for Biomarker Identification
6.1 Ensemble Learning Based Multi-Omic Biomarker Identification
6.2 Deep Neural Network Based Multi-Omic Biomarker Identification
7 Closing Remarks and Perspectives
References
Negative Sample Selection for miRNA-Disease Association Prediction Models
1 Introduction
2 Methods
2.1 Obtain the Feature Representations for Each miRNA-Disease Sample
2.2 Train the Deep Autoencoder with All the Verified Samples
2.3 Sort All the Unknown Samples by the Deep Autoencoder
3 Result
3.1 Database
3.2 Evaluation Methods
3.3 Experiment Setting and Overfitting Analyzing
3.4 Reconstruct Error Data Analysis on Well Trained DAE-N Model
3.5 Compared Methods
3.6 Effectiveness of DAE-N on Cross-Validation
3.7 Effectiveness of DAE-N on Independent Dataset Evaluation
4 Conclusion
References
Prediction and Analysis of Key Genes in Prostate Cancer via MRMR Enhanced Similarity Preserving Criteria and Pathway Enrichment Methods
Acronyms
1 Introduction
2 Literature Review
2.1 Feature Selection Methods
2.1.1 Fisher Score
2.1.2 Laplace Score
2.1.3 ReliefF Criteria
2.1.4 Unified Framework for Similarity Based Methods
2.1.5 MRMR
2.2 Description of Classifiers
2.3 Pathway Enrichment Analysis
3 Methods
3.1 Data Source and Data Type
3.2 Data Preparation
3.3 Experiment Design
3.4 Feature Selection
3.4.1 The Problem
3.4.2 The Algorithm
3.4.3 Classification
3.5 Measures for Performance Evaluation
3.6 Enrichment Analysis of Key Pathways and Core Genes
4 Results and Discussion
4.1 Identification of Key Genes Related to PCa
4.2 GO and KEGG Pathway Enrichment Analyses
4.3 Discussion
5 Conclusions
References
Graph-Based Machine Learning Approaches for Pangenomics
1 Introduction
2 Methods
2.1 Frequented Regions
2.2 Data
2.3 Genome-Wide Association Study
2.4 Machine Learning Models
2.5 Experimental Setup
3 Results
3.1 Phenotypic Prediction
3.2 FRs and Annotations
4 Conclusion
References
Multiomics-Based Tensor Decomposition for Characterizing Breast Cancer Heterogeneity
1 Breast Cancer Inter-Tumor Heterogeneity
1.1 Morphological and Histopathologic Heterogeneity
1.2 Biomarker Heterogeneity
1.3 Genetic Heterogeneity and Breast Cancer Subtyping Schemes
2 Breast Cancer Multiomics Data
2.1 Genomic Level: CNVs
2.2 Transcriptomic Level: Gene Expression
2.3 Epigenomic Level: DNA Methylation
3 Tensor-Based Multiomics Integration and Factorization
3.1 Tensor
3.2 Tensor Decomposition Algorithms
4 Applications
4.1 Breast Cancer Subtyping
4.2 Survival Prediction
4.3 Gene Set Enrichment Analysis
5 Conclusions
References
Multi-Omics Databases
Acronyms
1 Introduction
2 Literature Review
3 Multi-Omics Data Resources
3.1 Data Repositories
3.2 BioBanks
4 Multi-Omics Databases and Tools
5 Multi-Omics Main Technologies
6 Fields of Multi-Omics Technologies
7 Conclusion
References
Index

Citation preview

Abedalrhman Alkhateeb Luis Rueda   Editors

Machine Learning Methods for Multi-Omics Data Integration

Machine Learning Methods for Multi-Omics Data Integration

Abedalrhman Alkhateeb • Luis Rueda Editors

Machine Learning Methods for Multi-Omics Data Integration

10 11 12 13 14 15

Editors Abedalrhman Alkhateeb Software Engineering Department King Hussein Faculty of Computing Sciences, Princess Sumaya University for Technology Amman, Jordan

Luis Rueda School of Computer Science University of Windsor Windsor, ON, Canada

ISBN 978-3-031-36501-0 ISBN 978-3-031-36502-7 https://doi.org/10.1007/978-3-031-36502-7

(eBook)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.

Contents

Introduction to Multiomics Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ahmed HajYasien

1

Machine Learning from Multi-omics: Applications and Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ammar El-Hassan

13

Machine Learning Approaches for Multi-omics Data Integration in Medicine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fatma Hilal Yagin

23

Multimodal Methods for Knowledge Discovery from Bulk and Single-Cell Multi-Omics Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yue Li, Gregory Fonseca, and Jun Ding

39

Negative Sample Selection for miRNA-Disease Association Prediction Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yulian Ding, Fei Wang, Yuchen Zhang, and Fang-Xiang Wu

75

Prediction and Analysis of Key Genes in Prostate Cancer via MRMR Enhanced Similarity Preserving Criteria and Pathway Enrichment Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert Benjamin Eshun, Hugette Naa Ayele Aryee, Marwan U. Bikdash, and A. K. M. Kamrul Islam

91

Graph-Based Machine Learning Approaches for Pangenomics . . . . . . . . . . . . 117 Indika Kahanda, Joann Mudge, Buwani Manuweera, Thiruvarangan Ramaraj, Alan Cleary, and Brendan Mumey Multiomics-Based Tensor Decomposition for Characterizing Breast Cancer Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Qian Liu, Shujun Huang, Zhongyuan Zhang, Ted M. Lakowski, Wei Xu, and Pingzhao Hu

v

vi

Contents

Multi-Omics Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Hania AlOmari, Abedalrhman Alkhateeb, and Bassam Hammo Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

Introduction to Multiomics Technology Ahmed HajYasien

Multiomics is a biological analysis approach that utilizes data from multiple sets. The word omics refers to the different types of omes including genome, epigenome, proteome, transcriptome, metabolome, and microbiome. The objective of omics sciences is to extract meaningful knowledge from largescale (with multiple dimensions) data by identifying, classifying, and quantifying all biological molecules involved in the structure, function, and dynamics of a cell, tissue, or organism (Vailati-Riboni et al., 2017) as shown in Fig. 1. By putting together these “omes” (in sets), scientists can study complex biological data to discover uncommon associations between biological entities. This biological analysis combines different omics data to elucidate a coherently matching genopheno-envirotype association or relationship (Tarazona et al., 2018).

1 Genomics Frederick Sanger (1918–2013) is a British biochemist who is considered one of the most important scientists of the twentieth century. He is best known for his work on the structure of proteins and nucleic acids, and for developing methods for DNA sequencing. Sanger was born in Gloucestershire, England, in 1918. He studied natural sciences at Cambridge University, where he earned a PhD in biochemistry in 1943. After completing his studies, he worked at the National Institute for Medical Research in London, where he began his research on proteins.

A. HajYasien (X) Computer Science Department, Higher Colleges of Technology, Ras Alkhaimah, United Arab Emirates e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 A. Alkhateeb, L. Rueda (eds.), Machine Learning Methods for Multi-Omics Data Integration, https://doi.org/10.1007/978-3-031-36502-7_1

1

2

A. HajYasien

Fig. 1 The objective of omics

In the 1950s, Sanger developed a method for determining the sequence of amino acids in a protein, which he called the Sanger method (Sanger & Thompson, 1953). This method involved breaking down the protein into smaller fragments and then determining the sequence of amino acids in each fragment. He used this method to determine the structure of insulin, a hormone that regulates blood sugar levels, and for which he was awarded the Nobel Prize in Chemistry in 1958. In the 1970s, Sanger turned his attention to nucleic acids, specifically DNA sequencing. He developed a method for determining the sequence of nucleotides in DNA, which involved using chain-terminating dideoxynucleotides to stop the synthesis of the DNA strand at specific points. This method, known as the Sanger sequencing method, revolutionized the field of genomics and played a key role in the Human Genome Project. Sanger was awarded the Nobel Prize in Chemistry for the second time in 1980, for his work on DNA sequencing. He continued to work on nucleic acids and was awarded numerous other prizes and honors for his contributions to science. Genomics focuses on exploring the entire genomes, not only a single gene. It helps in mapping out the underlying genetic basis of severe diseases. The advancement in biomedical instruments led to developing cost-efficient and highthroughput technology that is capable of reading the biological molecules structures. Afterall, studying only a single layer of information from each cell can not yield a comprehensive picture. There are several analytical techniques that can be used to study genomics. Some of the most commonly used techniques include: 1. DNA sequencing: This technique is used to determine the sequence of nucleotides in DNA. There are several different methods of DNA sequencing, including Sanger sequencing, next-generation sequencing (NGS) (Anderson & Schrijver, 2010), and single-molecule sequencing (Pushkarev et al., 2009). DNA sequencing can be used to identify genetic variations, mutations, and other genomic features.

Introduction to Multiomics Technology

3

2. Microarray analysis: This technique involves the use of a microarray chip containing thousands of DNA probes that can be used to measure the expression levels of genes in a particular sample. Microarray analysis can be used to study gene expression patterns and identify genes that are differentially expressed under different conditions (Slonim & Yanai, 2009). 3. Polymerase chain reaction (PCR): This technique is used to amplify a specific DNA sequence. PCR can be used to study gene expression, detect genetic mutations, and identify DNA sequences in complex mixtures (Bartlett & Stirling, 2003). 4. CRISPR-Cas9 genome editing: This is a powerful tool that can be used to modify the DNA sequence of a particular gene. CRISPR-Cas9 can be used to study gene function and identify genes that are essential for specific biological processes (Savi´c & Schwank, 2016). 5. Bioinformatics: This is the use of computational methods to analyze large genomic datasets. Bioinformatics techniques can be used to identify genes, analyze gene expression data, and study genomic variation (Baxevanis et al., 2020). 6. Functional genomics: This is a field of study that involves the use of highthroughput screening methods (Hertzberg & Pope, 2000) to identify the function of genes and other genomic elements. Functional genomics can be used to study gene regulation, protein-protein interactions, and other biological processes.

2 Transcriptomics This term refers to the set of all messenger RNA molecules in one cell, tissue, or organism. Messenger RNA (mRNA), is a single molecule in cells that carries codes from the DNA in the nucleus to the sites of protein synthesis in the cytoplasm and is read by a ribosome in the process of synthesizing a protein. The mRNA molecule was first described in 1956 by scientists Elliot Volkin and Lazarus Astrachan (Weinberg, 2001). The transcriptome contains the quantity or concentration of each RNA molecule in addition to the molecular identities. Transcription is part of the Central Dogma of Molecular Biology. The Central Dogma of Molecular Biology is a framework that explains how genetic information flows within a biological system. It states that DNA (deoxyribonucleic acid) contains the genetic information that is passed from one generation to the next and can replicate (DNA to DNA). This information is then transcribed into RNA (ribonucleic acid), which serves as a messenger between DNA and the ribosomes, the cellular machinery that synthesizes proteins (DNA to RNA). Finally, proteins are made by translating the sequence of nucleotides in RNA into a specific sequence of amino acids (RNA to Protein) as shown in Fig. 2. Analytical techniques play a crucial role in transcriptomics to identify, quantify, and characterize these RNA molecules. Here are some common analytical techniques used in transcriptomics:

4

A. HajYasien

Fig. 2 Central dogma of biology

1. RNA sequencing (RNA-seq): RNA-seq is a powerful technique used to sequence and quantify RNA molecules in a biological sample (Conesa et al., 2016). It works by converting RNA into complementary DNA (cDNA) fragments, which are then sequenced using high-throughput sequencing technologies. RNA-seq can provide information on the expression levels of all genes in a biological sample, as well as alternative splicing events and non-coding RNA molecules. 2. Microarrays: Microarrays are a high-throughput technique used to measure the expression levels of thousands of genes in a biological sample simultaneously (Stoughton, 2005). They work by hybridizing labeled RNA molecules to a microarray chip containing probes that correspond to different genes. 3. Reverse transcription-quantitative PCR (RT-qPCR): RT-qPCR is a technique used to measure the expression levels of specific RNA molecules in a biological sample (Tichopad et al., 2009). It works by converting RNA into cDNA and then using PCR to amplify and quantify the cDNA. RT-qPCR is a highly sensitive and specific technique that can be used to validate gene expression data obtained from RNA-seq or microarray analyses. 4. In situ hybridization: In situ hybridization is a technique used to visualize the spatial distribution of RNA molecules in a tissue or cell (Jensen, 2014). It involves hybridizing labeled RNA probes to complementary RNA molecules in the sample, followed by detection using microscopy. 5. Nanostring technology: Nanostring technology is a high-throughput technique used to measure the expression levels of specific RNA molecules in a biological sample (Eastel et al., 2019). It works by capturing and counting individual RNA molecules using probes that hybridize to specific sequences. Overall, the choice of analytical technique(s) depends on the research question, the complexity of the biological sample, and the analytical platform available. Transcriptomics studies typically use a combination of these techniques to achieve a comprehensive and accurate characterization of the RNA molecules in a biological sample.

Introduction to Multiomics Technology

5

3 Proteomics The term proteome refers to the total of all the proteins in a cell, tissue, or organism. Proteomics is the field that studies those proteins in cells with reference to their biochemical characteristics and functional roles, and how their quantities, modifications, and structures change during growth and in response to internal and external stimuli (Aslam et al., 2017). Analytical techniques play a crucial role in proteomics to identify, quantify, and characterize these proteins. Here are some common analytical techniques used in proteomics: 1. Mass spectrometry (MS): MS is a powerful technique for the identification and quantification of proteins in a biological sample. It works by ionizing the proteins and measuring the mass-to-charge ratio of the resulting ions. MS can be coupled with different separation techniques, such as liquid chromatography (LC-MS) (McMurry, 2011) or gas-phase fractionation (GPF-MS), to increase the separation of proteins and improve their detection sensitivity. 2. Two-dimensional gel electrophoresis (2D-PAGE): 2D-PAGE is a separation technique used to separate proteins based on their isoelectric point and molecular weight (Rabilloud et al., 2010). It separates proteins into distinct spots on a gel, which can be analyzed using techniques such as MS. 3. Liquid chromatography (LC): LC is a separation technique used to separate and purify proteins based on their physical and chemical properties (Snyder et al., 2011). It can be used in combination with MS to improve the identification and quantification of proteins. 4. Antibody-based assays: Antibody-based assays, such as enzyme-linked immunosorbent assays (ELISAs) and Western blotting, are used to detect and quantify specific proteins in a biological sample (Ellington et al., 2010). They rely on the specificity of antibodies to recognize and bind to their target proteins. 5. X-ray crystallography: X-ray crystallography is a technique used to determine the three-dimensional structure of proteins. It involves crystallizing the protein and then using X-rays to produce a diffraction pattern, which can be used to determine the protein’s structure (Smyth & Martin 2000). 6. Nuclear magnetic resonance (NMR) spectroscopy: NMR is a technique that uses the interaction of atomic nuclei with an applied magnetic field to identify and characterize proteins. It can provide detailed structural information of proteins in solution (Hore, 2015). 7. Protein microarray is a method used to observe the interactions and activities of proteins, and to determine their function on a large scale (Melton, 2004). Overall, the choice of analytical technique(s) depends on the research question, the complexity of the biological sample, and the analytical platform available. Proteomics studies typically use a combination of these techniques to achieve a comprehensive and accurate characterization of the proteins in a biological sample.

6

A. HajYasien

4 Foodomics As our understanding of nutrition and human health advances, there has been an increased interest in exploring the impact of food at the molecular level. Foodomics was defined in 2009 (Cifuentes, 2009) as “a discipline that studies the Food and Nutrition domains through the application and integration of advanced -omics technologies to improve consumer’s well-being, health, and knowledge”. This emerging field combines the techniques of genomics, transcriptomics, proteomics, and metabolomics to study the complex interactions between food and living organisms. By applying these cutting-edge technologies to the study of food, scientists can better understand the chemical composition of foods, how they are metabolized by the body, and how they affect gene expression. Ultimately, this knowledge can lead to a better understanding of human health and the development of more personalized nutrition plans. Through Foodomics, researchers can identify and quantify key nutrients and bioactive compounds in foods, and develop methods for enhancing or modifying these compounds to promote health and prevent diseases. This innovative approach to food science has the potential to transform our understanding of the link between diet and health, and to improve the nutritional value of the foods we eat. Analytical techniques play a crucial role in Foodomics research, as they allow scientists to identify and quantify the components of food and their interactions with biological systems. Here are some common analytical techniques used in Foodomics: 1. Mass Spectrometry (MS): This technique is used to identify and quantify small molecules, such as metabolites, peptides, and proteins, in food samples. MS can also be used to analyze the structure of molecules, such as lipids and carbohydrates. 2. Nuclear Magnetic Resonance (NMR) Spectroscopy: This technique is used to study the chemical structure of food molecules, such as carbohydrates, lipids, and proteins. NMR spectroscopy can also be used to identify and quantify small molecules in food samples. 3. Liquid Chromatography (LC): This technique is used to separate and purify different molecules in food samples, such as amino acids, peptides, and carbohydrates. LC is often coupled with MS to identify and quantify the separated molecules. 4. Gas Chromatography (GC): This technique is used to separate and purify volatile compounds in food samples, such as fatty acids, amino acids, and flavor compounds. GC is often coupled with MS to identify and quantify the separated compounds (McNair et al., 2019). 5. Fourier Transform Infrared (FTIR) Spectroscopy: This technique is used to study the chemical structure of food molecules, such as carbohydrates, lipids, and proteins. FTIR can also be used to identify and quantify small molecules in food samples (Movasaghi et al., 2008).

Introduction to Multiomics Technology

7

6. Microscopy: This technique is used to study the physical and morphological properties of food samples, such as the structure and distribution of proteins, lipids, and carbohydrates. 7. X-ray Crystallography: This technique is used to determine the 3D structure of molecules, such as proteins and enzymes, in food samples. These are just some of the many analytical techniques used in Foodomics research. The choice of technique depends on the specific research question and the type of food sample being studied.

5 Metabolomics This term refers to the combination of all metabolites in a biological cell, tissue, organ, or organism (Liu & Locasale, 2017). These are the end products of cellular operations. Metabolomics plays a crucial role in foodomics because it allows scientists to identify the specific metabolites present in different foods and to monitor changes in these metabolites during processing and storage. By correlating metabolite profiles with health outcomes, researchers can gain insights into how different nutrients and bioactive compounds in foods affect the body’s metabolism and overall health. Additionally, metabolomics can help to uncover biomarkers of disease and to assess the effectiveness of dietary interventions in improving health outcomes in various populations. The potential applications of foodomics and metabolomics are wide-ranging and have immense implications for the fields of nutrition and medicine. Analytical techniques play a crucial role in metabolomics to identify, quantify, and characterize these metabolites. Here are some common analytical techniques used in metabolomics: 1. Mass spectrometry (MS): MS is a powerful technique for the identification and quantification of metabolites in a biological sample. It works by ionizing the metabolites and measuring the mass-to-charge ratio of the resulting ions. MS can be coupled with different separation techniques, such as liquid chromatography (LC-MS) or gas chromatography (GC-MS), to increase the separation of metabolites and improve their detection sensitivity. 2. Nuclear magnetic resonance (NMR) spectroscopy: NMR is a non-destructive technique that uses the interaction of atomic nuclei with an applied magnetic field to identify and quantify metabolites. NMR is a quantitative technique that can provide detailed structural information of metabolites. 3. High-performance liquid chromatography (HPLC): HPLC is a separation technique used to separate and quantify metabolites based on their physical and chemical properties. HPLC is commonly used in combination with UV and fluorescence detectors (Horváth, 2013). 4. Gas chromatography (GC): GC is a separation technique used to separate and quantify volatile metabolites based on their boiling points and vapor pressure. GC

8

A. HajYasien

is commonly used in combination with mass spectrometry (GC-MS) to improve the identification of metabolites. 5. Capillary electrophoresis (CE): CE is a separation technique used to separate metabolites based on their charge and size. CE is a high-resolution technique that can separate metabolites that are difficult to separate using other techniques (Weinberger, 2000). 6. Fourier-transform infrared (FTIR) spectroscopy: FTIR is a technique that uses the interaction of infrared radiation with the chemical bonds in metabolites to identify and quantify metabolites. FTIR is a rapid and non-destructive technique that can provide detailed structural information of metabolites. Overall, the choice of analytical technique(s) depends on the research question, the complexity of the biological sample, and the analytical platform available. Metabolomics studies typically use a combination of these techniques to achieve a comprehensive and accurate characterization of the metabolites in a biological sample.

6 Epigenomics Epigenomics is another field that is closely related to Foodomics, as it studies the impact of environmental and lifestyle factors on gene expression. Incorporating epigenomics into foodomics research can provide valuable insights into how different foods and diets affect gene expression and epigenetic changes, which in turn can have significant implications for human health. The changes in genes could be the result of age, changing diet, exercise, drugs, life stress or exposure to different environmental factors like chemicals or sun exposure. All the previous factors have direct impact on the DNA. Epigenomics tries to find out which changes in DNA are permanent and which changes are temporary (Ferguson-Smith et al., 2008). By using advanced analytical techniques to study the epigenome, we can gain a better understanding of how nutrition and lifestyle choices can influence disease risk and develop personalized dietary recommendations for individuals based on their epigenetic profile. Analyzing the epigenome involves studying chemical modifications and their effects on gene expression. Here are some analytical techniques commonly used to study the epigenome: 1. Chromatin Immunoprecipitation (ChIP): ChIP is a technique used to identify the regions of DNA that are bound by a particular protein of interest. It involves cross-linking DNA with proteins, immunoprecipitating the proteinDNA complex using an antibody specific to the protein of interest, and analyzing the DNA fragments that were pulled down. By analyzing the DNA sequences that are associated with a specific protein, researchers can identify the genomic locations of various epigenetic marks (Collas, 2010).

Introduction to Multiomics Technology

9

2. DNA Methylation Analysis: DNA methylation is a common epigenetic modification that involves adding a methyl group to the cytosine base in DNA. Methylation is often associated with gene silencing, and it can be analyzed using techniques such as bisulfite sequencing, methylation-specific PCR, and methylated DNA immunoprecipitation (MeDIP). These techniques allow researchers to analyze the extent and distribution of DNA methylation in different regions of the genome (Kurdyukov & Bullock, 2016). 3. Histone Modification Analysis: Histone proteins are important components of chromatin and can be modified in a variety of ways, such as acetylation, methylation, phosphorylation, and ubiquitination. These modifications can affect chromatin structure and gene expression. Techniques such as ChIP-seq and ChIPchip can be used to map histone modifications across the genome (Kimura, 2013). 4. RNA Sequencing (RNA-seq): RNA-seq is a technique used to measure gene expression levels by sequencing the RNA transcripts in a sample. RNA-seq can also be used to identify alternative splicing events and non-coding RNA transcripts, both of which are important in epigenetic regulation. 5. Mass Spectrometry: Mass spectrometry can be used to analyze the protein composition of chromatin and identify post-translational modifications of histone proteins. This technique can provide valuable information about the types and distribution of histone modifications across the genome. These are just a few of the many analytical techniques used to study the epigenome. The choice of technique often depends on the specific research question and the type of epigenetic mark being studied.

7 Summary Multiomics is an exciting new field of research that integrates multiple layers of information, including genomics, transcriptomics, proteomics, metabolomics, and more to provide a comprehensive understanding of biological systems. It allows for a higher level of insight into complex biological processes and the mechanisms underlying disease. By illuminating connections possible only by looking across multiple layers of information, multiomics promises to revolutionize medicine, biotechnology, and more. As scientists continue to develop new methods of collecting and analyzing data, the potential for breakthrough discoveries increases exponentially. In recent years, advances in high-throughput technologies have led to a surge of interest in multiomics, an approach in which multiple biological molecules, such as DNA, RNA, and proteins, are studied simultaneously. This allows for a more comprehensive understanding of complex biological systems and can lead to the identification of novel therapeutic targets. While the integration of omics data is still a relatively new field, it has already yielded exciting results in various fields

10

A. HajYasien

of research, including cancer and neurology. In this paper, we will explore the challenges and potential of multiomics and its impact on precision medicine. Advancements in technology have enabled scientists to gather vast amounts of genomic, transcriptomic, proteomic, and metabolomic data, investing the need for integration and interpretation. Multiomics is an emerging field that aims to integrate these different “-omics” approaches through the study of biological systems at multiple levels. By combining these diverse technologies, researchers are able to generate a more comprehensive understanding of biological phenomena, leading to the development of personalized medicine and the identification of new drug targets. This multi-scale approach holds great promise for advancing our understanding of complex biological systems.

References Anderson, M. W., & Schrijver, I. (2010, May). Next generation DNA sequencing and the future of genomic medicine. Genes, 1(1), 38–69. https://doi.org/10.3390/genes1010038. PMC 3960862. PMID 24710010. Aslam, B., Basit, M., Nisar, M. A., Khurshid, M., & Rasool, M. H. (2017). Proteomics: Technologies and their applications. Journal of Chromatographic Science, 55(2), 182–196. Bartlett, J. M., & Stirling, D. (2003). A short history of the polymerase chain reaction. PCR Protocols, 226, 3–6. Baxevanis, A. D., Bader, G. D., & Wishart, D. S. (Eds.). (2020). Bioinformatics. John Wiley & Sons. Cifuentes, A. (2009, October 23). Food analysis and foodomics. Journal of Chromatography A, 1216(43), 7109. Collas, P. (2010). The current state of chromatin immunoprecipitation. Molecular Biotechnology, 45, 87–100. Conesa, A., Madrigal, P., Tarazona, S., Gomez-Cabrero, D., Cervera, A., McPherson, A., et al. (2016). A survey of best practices for RNA-seq data analysis. Genome Biology, 17(1), 1–19. Eastel, J. M., Lam, K. W., Lee, N. L., Lok, W. Y., Tsang, A. H. F., Pei, X. M., et al. (2019). Application of NanoString technologies in companion diagnostic development. Expert Review of Molecular Diagnostics, 19(7), 591–598. Ellington, A. A., Kullo, I. J., Bailey, K. R., & Klee, G. G. (2010). Antibody-based protein multiplex platforms: Technical and operational challenges. Clinical Chemistry, 56(2), 186–193. Ferguson-Smith, A. C., Greally, J. M., & Martienssen, R. A. (Eds.). (2008). Epigenomics. Springer Science & Business Media. Hertzberg, R. P., & Pope, A. J. (2000). High-throughput screening: New technology for the 21st century. Current Opinion in Chemical Biology, 4(4), 445–451. Hore, P. J. (2015). Nuclear magnetic resonance. Oxford University Press. Horváth, C. (Ed.). (2013). High-performance liquid chromatography: Advances and perspectives. Academic Press. Jensen, E. (2014). Technical review: In situ hybridization. The Anatomical Record, 297(8), 1349– 1353. Kimura, H. (2013). Histone modifications for human epigenome analysis. Journal of Human Genetics, 58(7), 439–445. Kurdyukov, S., & Bullock, M. (2016). DNA methylation analysis: Choosing the right method. Biology, 5(1), 3. Liu, X., & Locasale, J. W. (2017). Metabolomics: A primer. Trends in Biochemical Sciences, 42(4), 274–284.

Introduction to Multiomics Technology

11

McMurry, J. (2011). Organic chemistry: With biological applications (2nd ed., p. 395). Brooks/Cole. McNair, H. M., Miller, J. M., & Snow, N. H. (2019). Basic gas chromatography. John Wiley & Sons. Melton, L. (2004). Proteomics in multiplex. Nature, 429(6987), 105–107. Movasaghi, Z., Rehman, S., & ur Rehman, D. I. (2008). Fourier transform infrared (FTIR) spectroscopy of biological tissues. Applied Spectroscopy Reviews, 43(2), 134–179. Pushkarev, D., Neff, N. F., & Quake, S. R. (2009). Single-molecule sequencing of an individual human genome. Nature Biotechnology, 27(9), 847–850. Rabilloud, T., Chevallet, M., Luche, S., & Lelong, C. (2010). Two-dimensional gel electrophoresis in proteomics: Past, present and future. Journal of Proteomics, 73(11), 2064–2077. Sanger, F., & Thompson, E. O. P. (1953). ‘The amino-acid sequence in the glycyl chain of insulin’ and ‘The investigation of peptides from enzymatic hydrolysates’. Biochemistry Journal, 53, 366–374. Savi´c, N., & Schwank, G. (2016). Advances in therapeutic CRISPR/Cas9 genome editing. Translational Research, 168, 15–21. Slonim, D. K., & Yanai, I. (2009). Getting started in gene expression microarray analysis. PLoS Computational Biology, 5(10), e1000543. Smyth, M. S., & Martin, J. H. J. (2000). x-Ray crystallography. Molecular Pathology, 53(1), 8. Snyder, L. R., Kirkland, J. J., & Dolan, J. W. (2011). Introduction to modern liquid chromatography. John Wiley & Sons. Stoughton, R. B. (2005). Applications of DNA microarrays in biology. Annual Review of Biochemistry, 74, 53–82. Tarazona, S., Balzano-Nogueira, L., & Conesa, A. (2018). Multiomics data integration in time series experiments. https://doi.org/10.1016/bs.coac.2018.06.005 Tichopad, A., Kitchen, R., Riedmaier, I., Becker, C., Stahlberg, A., & Kubista, M. (2009). Design and optimization of reverse-transcription quantitative PCR experiments. Clinical Chemistry, 55(10), 1816–1823. Vailati-Riboni, M., Palombo, V., & Loor, J. J. (2017). What are omics sciences? In Periparturient diseases of dairy cows: A systems biology approach (pp. 1–7). Springer. https://doi.org/ 10.1007/978-3-319-43033-1_1 Weinberg, A. (2001). Messenger RNA: Origins of a discovery. Nature, 414, 485. https://doi.org/ 10.1038/35107234 Weinberger, R. (2000). Practical capillary electrophoresis. Elsevier.

Machine Learning from Multi-omics: Applications and Data Integration Ammar El-Hassan

1 Introduction The application of computerized health informatics, computational diagnosis, imagery and sensors in healthcare is unprecedented with numerous, heterogeneous localized and distributed data sources providing multi-terabytes of stored and streamed data on an hourly basis; many of these datasets originate from high throughput sequencing. This data is quite complex and covers clinical, empirical, biopsy, radiology, diagnosis, demographics, insurance, vaccination information as well as personal patients records held in both private and public sector data banks. This huge variation in the structure, source and format of the data has set challenges for those working to formulate standardized and integrated representation models to facilitate access to the multi-dimensional biomedical data and its interfaces and api’s for the purpose of analyses, classification and disease prevention (Branson et al., 2008; Eddy et al., 2020). Multi-omics (also known as integrative omics) that originate from genomic, metabolomic, transcriptomic, proteomic, or interatomic data (Dhillon et al., 2023) and include high frequency measurements relating to molecular cell biology represent one of the most insightful sources of information for clinical disease diagnosis via anomaly detection, characterization and analyses (Alkhateeb et al., 2021; Agrawal et al., 2022). Multi-omics provide biomarker indicators for better understanding of the pathogenesis, diagnosis and prognosis of a wide range of diseases including cancer, chronic disease, virus-related infections and even psychiatric cases which utilize health informatics with applications from the domains of AI and

A. El-Hassan (X) Computer Science, Princess Sumaya University for Technology, Amman, Jordan e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 A. Alkhateeb, L. Rueda (eds.), Machine Learning Methods for Multi-Omics Data Integration, https://doi.org/10.1007/978-3-031-36502-7_2

13

14

A. El-Hassan

Data Science including Machine Learning and Deep Learning (Dhillon et al., 2023) used to identify biomarkers in multi-omics data. While Omics related investigations and analyses have demonstrated efficacy in the ability to reveal primary indicators for disease diagnosis, treatment and resistance factors in clinical oncology to help direct treatment choices with varying success rates, there remains a clear need for the development of integrated infrastructures and technological platforms for storage, analysis, annotation and classification of multi-omics data to effectively advance precision medicine and provide the insights that can support medicine-based decision-making (Olivier et al., 2019). This chapter introduces the state-of-the-art in development and role that multiomics play in disease diagnosis with applications in cancer, cardiovascular as well as neurological diseases to name but a few of the prominent ailments afflicting human in the twenty-first century.

2 Multi-omics as Cancer Indicators The diagnosis, prevention and treatment of cancer require a complex set of processes and technologies that can manage and analyze complex, integrated and disparate biological data. Multi-omics can provide a gateway for understanding the causal effect of the molecular mechanisms and interactions that uniquely identify cancer. Furthermore, multi-omics provide a platform for identifying the special bio indicators and biomarkers that help in the diagnosis and treatment design for the highly complex and heterogeneous disease. For example, by splicing genomic, proteomic, metabolomic as well as transcriptomic data researchers were able to identify the biomarkers that indicate early ovarian cancer cases (Xiao et al., 2022). Another study by Khadirnaikar et al. (2023) proposed a machine learning based methodology to compress multi-omics data to a smaller dimension and consequently applied clustering techniques to identify five “Non-small Cell Lung Cancer (NSCLC) clusters “as well as perform survival analysis on the resulting clusters to reveal major discrepancies in the survival chances between the clusters. Alkhateeb et al. (2021) analyzed the heterogeneous nature of tumor tissues which provide distinct classification (and subsequently, diagnosis and treatments courses) of cancer cases, be they liver, prostate or breast cancer cases. The authors used the variation of genomic activities among these diseases in contrast to normal tissue in conjunction with AI and Deep Learning techniques to “integrate multi-omics data generated from cell/tissue to study the possible outcomes” of cancer cases.

Machine Learning from Multi-omics: Applications and Data Integration

15

3 Multi-omics Epigenetic Alterations of Alzheimer’s Neurological disorders including Parkinson’s disease and Alzheimer’s disease are two debilitating and complex diseases that are diagnosed through the integration of heterogeneous biological data which can then aid possible treatment pathways. Multi-omics have strong potential in identifying biomarker indicators of neurological disorders and for assessing disease progression rates as well as in shedding light on the molecular level interactions and mechanisms of the disease. The trend of scholarly and clinical work validates the premise that the integration of multiomics data sets generates new insights into the disease and make notable, positive effects on the accuracy of biomarker identification. Nativio et al. (2020) analyzed the typically vague molecular mechanisms of late onset Alzheimer’s disease in contrast to measuring protein aggregation as an indicator of neurodegeneration. Hence the authors attempted to highlight molecular pathways of AD by combining RNA Sequencing analysis and “transcriptomic, proteomic and epigenomic” data from postmortem human brains. Their analyses and findings highlight potential “epigenetic strategies for early-stage disease treatment” and also suggest that AD is related to a “reconfiguration of the epigenome, wherein H3K27ac and H3K9ac affect disease pathways by dysregulating transcription- and chromatin– gene feedback loops”

4 Multi-omics Applications in Mental Health and Psychiatric Disorders In their review of methods for the integrated analyses of multiple subsystems as indicators of mental health illnesses, Sathyanarayanan et al. (2023) propose a “holistic” vision and understanding of the biological systems in mental health patients; they reviewed and summarized Machine Learning approaches that integrate AI with massive multi-omics clinical datasets for identifying biomarkers to provide better diagnosis and tailored treatment pathways for patients. The authors also provided a framework for the definition of biological (in silico multi-omics) models and their role as mental health indicators between clinics and bedside. Prior to that, Monteleone et al. (2021) analyzed the links between metabolomic changes and gut microbiome composition in Anorexia Nervosa (AN) patients prior to and post weight loss/regain. The authors sequenced gut microbiome of female AN patients in the underweight phase and also post weight regain. By comparing the results to data from healthy females using established multi-omics correlation and untargeted metabolomic procedures they concluded the existence of “perturbation

16

A. El-Hassan

in the gut microbiome composition of AN patients”. Furthermore, they their work suggested that gut bacteria in AN patients are associated with several metabolites in a distinct ways to those in normal women.

5 Cardiovascular Disease Cardiovascular disease (CVD) is widely agreed upon as a leading cause of death with quite complex phenotypic heterogeneity due to the multiple, dynamic interactions between several subsystems including both genetic, diet, lifestyle and environmental factors. Although several genes and genetic loci are understood to play a role in CVD cases, our understanding of the exact methods and mechanisms which affect these genes and loci at the phenotypic heterogeneity level is limited. To improve this understanding, especially at molecular level, we need to apply modern multi-omics technology to provide precision medicine analyses of data from both DNA sequencing as well as multi-omics data from potential and confirmed CVD cases including epigenome, transcriptome metabolome and proteome levels (Wang et al., 2023). In addition, there is a need to integrate interdisciplinary fields such as network medicine to highlight the interactions among biological components in health and disease such that an unbiased framework through which to systematically integrate these multi-omics data can be provided. Wang et al. (2023) presented a review of multi-omics technologies with suggestions on how they can be utilized in the advancement of precision medicine. The authors also proposed the application of “network medicine-based integration of multi-omics data for precision medicine and therapeutics in CVD” and discussed some of the limitations, risks, challenges and future directions in the application of multi-omics network medicine approaches for the study of CVD.

6 Machine Learning Applications to Multi-omics Data Modern day health informatics and sensor technology generate vast volumes of disparate multi-omics data relating to cell microbiology in healthy and sick humans. Biomedical science covering biological and molecular sub-systems and their interactions/reactions to disease development in conjunction with computer science, statistics and Artificial Intelligence technology have created rich opportunity for a myriad of experimental and theoretical contributions in the field of disease diagnosis and treatment using Machine Learning (ML), Artificial Intelligence (AI) and Deep (Neural Network) Learning (DL) algorithms to collate, balance and analyze vast datasets in non-linear and hierarchical modes to understand, predict,

Machine Learning from Multi-omics: Applications and Data Integration

17

diagnose and also treat ailments and diseases. ML has been applied to large datasets of heterogeneous, balanced and imbalanced nature for several years. Li et al. (2016) classified ML methods for multi-omics data analysis and integration into the following categories/approaches: (i) (ii) (iii) (iv) (v) (vi) (vii)

Feature Concatenation Bayesian Models – Networks Ensemble learning Multiple Kernel Learning Network-Based Learning Multi-view Matrix or Tensor Factorization

DL technology enables the generation of hitherto unseen functional insights from large datasets which can help “explain complex relationships” at molecular levels, which was the bottleneck in this area for a long time (Kang et al., 2022). DL applications to multi-omics datasets is characterized by the stage (Layer) which handles the data integration; hence there are three modes (Khoshghalbvash & Gao, 2019): early-layer-integration, middle-layer-integration and late-layer-integration.

7 Data Integration Strategies Spicker et al. (2008) introduced methods for hierarchical data integration and visualization based on partial least squares discriminant analysis or (PCA) principal component analysis of toxicological report data from magnetic resonance imaging, clinical chemistry or expressions; the authors assert that this approach is a catalyst for better correlation of bio parameters across various data types and, consequently, improved ability to interpret the underlying data and gain appropriate insights. With the gains in power and accessibility of ML and DL tools and applications, their role in classification and diagnosis of biomarker indicators gained recognition with modern multi-omics analyses that can produce holistic insights of multiple, complex and heterogeneous biological systems. Picard et al. (2021) summarized the data integration strategies into five categories, other authors tended to focus on a simplified three-mode categorization including early, intermediate and late stage integration (AlKhateeb et al. 2021). We focus on the three most commonly applied methodologies: (i) Early integration which is the simplest and relatively easiest to implement approach that “concatenates all omics datasets into a single matrix” wherein the machine learning model is applied, see Fig. 1 below. In this approach omics datasets are combined into a single, large dataset with a higher feature

18

A. El-Hassan

Fig. 1 Early Stage Integration

dimension for the same, original classes (outputs); apart from being more complex, the resulting matrix from this approach are also noisy and require some balancing techniques, e.g., over-sampling, under-sampling etc., thus also affecting the performance of this approach as the ML algorithm requires more learning time. (ii) Intermediate integration which “simultaneously transforms the original datasets into common and omics-specific representations”, see Fig. 2 below. In this approach datasets from multi-omics are integrated without initial data transformation or basic concatenation; the output of this approach is staggered: with output types that are common to all datasets and other, dataset-specific output which warrants further analysis. (iii) Late Integration models with omics analyzed in single sets and their respective outputs are combined late-stage, see Fig. 3 below. This approach is relatively straight-forward in that it applied ML models to omics datasets individually (using distinct tools on each dataset depending on size, feature space and type) and subsequently performs a merge of the resultant outputs. A brief summary of applications of the three aforementioned strategies is provided in Table 1, below.

Machine Learning from Multi-omics: Applications and Data Integration

Fig. 2 Intermediate Stage Integration

Fig. 3 Late Stage Integration

19

20

A. El-Hassan

Table 1 Data integration applications Integration level Early

Authors Xie et al. (2019)

Chaudhary et al. (2018)

Intermediate

Gaynanova and Li (2019)

El-Manzalawy et al. (2018)

Late

Sun et al. (2018)

Wang et al. (2020)

Summary of methodology Developed Group lass regularized Deep learning and Cox Proportional Hazard model for cancer prognosis and survival prediction. Used DL to identify survival groups of hepatocellular carcinoma (HCC) in six patient groups using “RNA sequencing (RNA-Seq), miRNA sequencing (miRNA-Seq), and methylation data from The Cancer Genome Atlas (TCGA)” Developed SLIDE, a “Structured Learning and Integrative Decomposition” model for multi-view data for component selection and signal estimation; demonstrated positive results in classifying breast cancer data from TCGA. Selects features based on their intra and inter-omics blocks complementarity which outperforms simple “Min-Redundancy and Maximum-Relevance” mRMR on individual or concatenated datasets. Propose a Multimodal Deep Neural Network which integrates “Multi-dimensional Data (MDNNMD)”. This is used for prediction and prognosis of breast cancer. Developed “Multi-Omics gRaph cOnvolutional NETworks (MORONET)”, which is a multi-omics integrative method of biomedical data classification. MORONET learns and classifies omics data within single and multi-omics datasets and correlations.

References Agrawal, M., Allin, K. H., Petralia, F., Colombel, J. F., & Jess, T. (2022). Multi-omics to elucidate inflammatory bowel disease risk factors and pathways. Nature Reviews Gastroenterology & Hepatology, 19(6), 399–409. Alkhateeb, A., Tabl, A. A., & Rueda, L. (2021). Deep learning in multi-omics data integration in cancer diagnostic. In Deep learning for biomedical data analysis: Techniques, approaches, and applications (pp. 255–271). CRC Press, Taylor & Francis Group. Branson, A., Hauer, T., McClatchey, R., Rogulin, D., & Shamdasani, J. (2008). A data model for integrating heterogeneous medical data in the health-e-child project. Studies in Health Technology and Informatics, 138, 13. Chaudhary, K., Poirion, O. B., Lu, L., & Garmire, L. X. (2018). Deep learning–based multi-omics integration robustly predicts survival in liver cancer using deep learning to predict liver cancer prognosis. Clinical Cancer Research, 24(6), 1248–1259. Dhillon, A., Singh, A., & Bhalla, V. K. (2023). A systematic review on biomarker identification for cancer diagnosis and prognosis in multi-omics: From computational needs to machine learning and deep learning. Archives of Computational Methods in Engineering, 30(2), 917–949. Eddy, S., Mariani, L. H., & Kretzler, M. (2020). Integrated multi-omics approaches to improve classification of chronic kidney disease. Nature Reviews Nephrology, 16(11), 657–668.

Machine Learning from Multi-omics: Applications and Data Integration

21

El-Manzalawy, Y., Hsieh, T. Y., Shivakumar, M., Kim, D., & Honavar, V. (2018). Min-redundancy and max-relevance multi-view feature selection for predicting ovarian cancer survival using multi-omics data. BMC Medical Genomics, 11(3), 19–31. Gaynanova, I., & Li, G. (2019). Structural learning and integrative decomposition of multi-view data. Biometrics, 75(4), 1121–1132. Kang, M., Ko, E., & Mersha, T. B. (2022). A roadmap for multi-omics data integration using deep learning. Briefings in Bioinformatics, 23(1), bbab454. Khadirnaikar, S., Shukla, S., & Prasanna, S. R. M. (2023). Machine learning based combination of multi-omics data for subgroup identification in non-small cell lung cancer. Scientific Reports, 13(1), 4636. Khoshghalbvash, F., & Gao, J. X. (2019). Integrative feature ranking by applying deep learning on multi source genomic data. In Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics (pp. 207–216). ACM. Li, Y., Wu, F.-X., & Ngom, A. (2016). A review on machine learning principles for multi-view biological data integration. Briefings in Bioinformatics, 19(2), 325–340. Monteleone, A. M., Troisi, J., Fasano, A., Dalle Grave, R., Marciello, F., Serena, G., et al. (2021). Multi-omics data integration in anorexia nervosa patients before and after weight regain: A microbiome-metabolomics investigation. Clinical Nutrition, 40(3), 1137–1146. Nativio, R., Lan, Y., Donahue, G., Sidoli, S., Berson, A., Srinivasan, A. R., et al. (2020). An integrated multi-omics approach identifies epigenetic alterations associated with Alzheimer’s disease. Nature Genetics, 52(10), 1024–1035. Olivier, M., Asmis, R., Hawkins, G. A., Howard, T. D., & Cox, L. A. (2019). The need for multi-omics biomarker signatures in precision medicine. International Journal of Molecular Sciences, 20(19), 4781. Picard, M., Scott-Boyer, M. P., Bodein, A., Périn, O., & Droit, A. (2021). Integration strategies of multi-omics data for machine learning analysis. Computational and Structural Biotechnology Journal, 19, 3735–3746. Sathyanarayanan, A., Mueller, T. T., Moni, M. A., Schueler, K., Baune, B. T., Lio, P., et al. (2023). Multi-omics data integration methods and their applications in psychiatric disorders. European Neuropsychopharmacology, 69, 26–46. Spicker, J. S., Brunak, S., Frederiksen, K. S., & Toft, H. (2008). Integration of clinical chemistry, expression, and metabolite data leads to better toxicological class separation. Toxicological Sciences, 102(2), 444–454. Sun, D., Wang, M., & Li, A. (2018). A multimodal deep neural network for human breast cancer prognosis prediction by integrating multi-dimensional data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 16(3), 841–850. Wang, T., Shao, W., Huang, Z., Tang, H., Zhang, J., Ding, Z., & Huang, K. (2020). Moronet: Multiomics integration via graph convolutional networks for biomedical data classification. bioRxiv, 2020-07. Wang, R. S., Maron, B. A., & Loscalzo, J. (2023). Multi-omics network medicine approaches to precision medicine and therapeutics in cardiovascular diseases. Arteriosclerosis, Thrombosis, and Vascular Biology., 43, 493–503. Xiao, Y., Bi, M., Guo, H., & Li, M. (2022). Multi-omics approaches for biomarker discovery in early ovarian cancer diagnosis. eBioMedicine, 79, 104001. Xie, G., Dong, C., Kong, Y., Zhong, J. F., Li, M., & Wang, K. (2019). Group lasso regularized deep learning for cancer prognosis from multi-omics and clinical features. Genes, 10(3), 240.

Machine Learning Approaches for Multi-omics Data Integration in Medicine Fatma Hilal Yagin

1 Introduction Recent advances in screening methods, which are both effective and cost-effective, have led to the generation of large amounts of biological data, paving the way for a new era of therapeutics and personalized medicine (Misra et al., 2019; Picard et al., 2021). Due to factors such as age, gender, genetics, metabolic and environment, the effectiveness of treatments and the likelihood of experiencing side effects can vary greatly from person to person (Burney & Lakhtakia, 2017; Jaccard et al., 2018). Therefore, in recent years, research in this field has increased rapidly in order to develop the most effective treatment process specific to each individual’s biological structure by using omic data in addition to clinical information in precision medicine (Tebani et al., 2016). Method development and optimal use of resources require a deep understanding of the data formats as well as the biological underpinnings of the data contained in each omic layer. For example, variants in the genome can affect how genes are arranged and the amount of mRNA produced. Subsequent measurements of the proteome are then affected by splicing mechanisms and post-translational modifications. In the end, the phenotype of the cell is determined by all these different methods (Jung et al., 2020; Hu et al., 2018). The field of genomics explores the ways in which genes communicate with each other as well as with their environment. This is accomplished by cataloging all the genes responsible for encoding structural and functional processes in an

F. H. Yagin () Department of Biostatistics and Medical Informatics, Faculty of Medicine, Inonu University, Malatya, Turkey e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 A. Alkhateeb, L. Rueda (eds.), Machine Learning Methods for Multi-Omics Data Integration, https://doi.org/10.1007/978-3-031-36502-7_3

23

24

F. H. Yagin

organism. Unlike genetics, which focuses only on variants or genes, genomics is the first of the disciplines and deals with the study of the entire genome. Examination of an organism’s genome reveals the genes and fixed sequences of these genes that provide an understanding of the complex biological function of the genome. The development of technologies based on genomic information has facilitated the understanding of the molecular subtleties of the functions of cells and tissues (Tyers & Mann, 2003; Futreal et al., 2001; Akbulut et al., 2022). The complete set of DNA of a cell or organism is known as its genome. There are about 30,000–40,000 genes in the human genome that together encode proteins that contain 3.2 billion bases. Historically, each gene has been analyzed on its own, but recent advances in technology such as microarrays have made individual gene analysis possible. Expression of thousands of genes can be analyzed simultaneously using DNA microarrays that measure differences in the DNA sequence present between individuals. Using the technique known as comparative genomic hybridization, DNA microarrays can be used to detect abnormal numbers of chromosomes as well as chromosomal aberrations such as insertions and deletions (Kumar et al., 2022; Asselt & Ehli, 2022; Pal, 2022). Single nucleotide polymorphisms (SNPs), in which one nucleotide is replaced by another, are the most common type of variation in DNA sequences among humans. If the change results in a codon for a different amino acid, it can be functionally significant. Generally, SNPs are used to describe genetic disorders associated with diseases. The use of SNP profiling in pharmacogenomics is also important to determine how certain patients respond to different drugs (Rafalski, 2002; Gray et al., 2000). Applications to investigate the transcriptome, which is the sum of all RNA transcripts produced in an organism, fall into the field of transcriptomics. It has enabled quantitative measurements of the dynamic expression of transcriptomic mRNA molecules (Lowe et al., 2017). Each cell expresses different genes at different times of development and under different physiological conditions. Transcriptomic analysis has a significant impact on the understanding of human diseases by examining how gene expression changes in different organisms or tissues (Dong & Chen, 2013; Breschi et al., 2017; Fan et al., 2020). Proteomics includes the applications of technology developed for the identification and quantification of the total of proteins that are gene products in a cell, tissue or organism. It sheds light on the structures, amounts, post-translational modifications, functions, and interactions of all proteins with other proteins and macromolecules. Advances in genomic technology have allowed him to generate vast amounts of biological data. With advances in analytical technology, proteomics has become increasingly important for the study of many different aspects. Proteomics is very important for early diagnosis, prognosis and monitoring of disease development (Hanash, 2003; Graves & Haystead, 2002; Cox & Mann, 2007). For a while in an organism, determination, quantification and identification of small molecule metabolism products occurring in tissues became possible thanks to metabolomic analysis. With the rapid development in genomic, transcriptomic

Machine Learning Approaches for Multi-omics Data Integration in Medicine

25

and proteomic technologies, interest in the field of metabolomics has increased. Because it is encoded by the genome and carries the effects of environmental factors, its importance in medical applications is increasing. Metabolite levels, which are products of metabolism, reflect metabolic function, and disturbances outside the normal range may be a sign of disease (Segers et al., 2019; Horgan & Kenny, 2011; Putri et al., 2013; Ryan & Robards, 2006). Many diseases affect complex molecular pathways that mediate communication between various cellular components. Thus, recent research has focused on the interpretation of molecular complexities and variations at multiple levels, such as the genome, transcriptome, proteome, and metabolome, to gain a comprehensive understanding of human health and disease (Cisek et al., 2016; Subramanian et al., 2020). Since the development of sequencing technology, identifying potential diagnostic and therapeutic options for diseases has been possible with systems-level integrated approaches and multi-omics research. Moreover, combining data from different types of omics with clinical trial results led to a better understanding of the working principle of cells. The integration of multi-omics data informing about biomolecules in human cells helps to evaluate the interactions of molecules and the flow of information from one omic level to another, helping to bridge the gap between genotype and phenotype. The prognostic and predictive accuracy of disease phenotypes can be enhanced by the use of integrative techniques due to their ability to holistically examine biological processes, thus leading to improved treatment and prevention in the long run (Ahmed, 2020, 2022; Olivier et al., 2019; Andrieux & Chakraborty, 2021). Working with biological datasets is difficult because of their complexity and inherent noise that can result from imperfect measurements or rare biological anomalies. In light of the difficulty of finding relevant information and integrating omics into a utility model, many methodologies and strategies have been developed in recent years. Adding more omics may not give a significant increase in performance if the integration is done incorrectly, but it increases the complexity and computation time of the problem. Because data is incredibly difficult to integrate across multi-omics, the computational algorithms required to separate signals from noise are becoming increasingly complex. Consequently, strategies for the systematic integration of heterogeneous multi-omics datasets are needed to produce actionable results that have the potential to advance the field of biological sciences and eventually transform it into clinical practice (Picard et al., 2021; Huang et al., 2017; Hasin et al., 2017). The remainder of the chapter is organized as follows: In Sect. 2, we examine the main objectives of multi-omics data integration. In Sect. 3, we discuss multiomics integration strategies. In Sect. 4, we present machine learning approaches used in medicine for disease diagnosis and treatment in multi-omics data integration. Finally, in Sect. 5, we conclude the contribution and importance of this chapter.

26

F. H. Yagin

2 Main Objectives of Multi-omic Data Integration Studies 2.1 Diagnosis and Prognosis Multifactorial diseases are difficult to diagnose clinically because of their complex genotypes and phenotypes. Pathologists or clinicians face some difficulties in dividing patients into different subtypes, and the diagnostic process can be time-consuming or invasive. Therefore, integrated molecular data are useful for predicting disease functioning, severity, and status. Instead of single-molecule (i.e., single-omics approach) biomarkers, multi-omics approaches use complex molecular structures and models. As a result, more accurate predictions for diseases and more information at the molecular level can be obtained (Athieniti & Spyrou, 2022).

2.2 Identification of the Subtype At the moment, diseases are being sub-classified according to symptoms and patient clinical profiles or common histopathology features. In recent years, a number of research studies have looked into new classifications for subtypes of disease by discovering associations based on the molecular similarities between the diseases in question. Finding common genes with altered expression was the first step in the process of grouping by molecular features, but recently, multiomics approaches have been applicate instead. This allows for the discovery of subtypes. The identification of subtypes for diseases enables the discovery of heterogeneous groups within cohorts of patients, which may have distinct patterns of disease progression or treatment response. Therefore the identification of disease subtypes paves the way for the development of treatments that are more specific and more effective. These treatments may include immunotherapy, biological drugs, or hormonal therapy (Athieniti & Spyrou, 2022).

2.3 Discovering Molecular Patterns of Disease The common goal of multi-omics integration-based research is to link molecularlevel biomarkers to clinical markers in clinical practice. Multi-omics data allows the discovery of disease-associated biomarker molecules due to analyzes that produce patterns or correlations between molecules. Discovered models or biomarkers can be used as indicators of disease or stage, further revealing disease-specific pathways, links and molecular mechanisms of interest (Athieniti & Spyrou, 2022; SantiagoRodriguez & Hollister, 2021).

Machine Learning Approaches for Multi-omics Data Integration in Medicine

27

2.4 Predicting the Effects of a Drug at the Molecular Level Patients can have wildly different experiences with the progression of their pharmacological therapy and the associated responses to the medication that is administered to them. It can be used by integrating information from different omic approaches to determine how a drug works for certain patient cells. Predicting whether a drug will be beneficial in patients with similar molecular structural features allows for targeted therapy, and this prediction is crucial for personalized medicine applications (Athieniti & Spyrou, 2022; Santiago-Rodriguez & Hollister, 2021).

2.5 Comprehension of the Regulatory Processes Discovery of disease-specific gene regulatory networks (GRNs) is made possible with multi-omics data analysis that combines measures of gene expression with those of potential regulators. GRNs have the potential to facilitate the discovery of therapeutic targets and enable the identification of critical unregulated sub-networks (Athieniti & Spyrou, 2022; Liu et al., 2019).

3 Multi-omics Integration Strategies Multi-omics data can be used for a variety of goals, including disease classification, discovery of biomarkers, and identification of disease subtypes. These goals can be accomplished by using multi-omics datasets. Machine learning (ML) models are frequently utilized in the analysis of complicated data; nevertheless, the integration of huge numbers of noisy and high-dimensional data sets is notoriously difficult. As a result, numerous integration strategies have been devised, and each one has both advantages and disadvantages. Assuming that each dataset has been preprocessed in accordance with the omics data, the datasets can simply be joined with sample aggregation, and the matrix that is produced can be utilized as input to ML models (this approach is early integration). In actuality, though, the vast majority of machine learning models will have difficulty mastering such a complicated dataset, particularly if the sample count is low. Other solutions focus on modifying or mapping datasets either independently (as in the case of the mixed integration strategy) or jointly (as in the case of the intermediate integration strategy) to reduce the complexity of the datasets. It is also possible to use the opposite technique, which is known as the “Late integration strategy.” This strategy does not combine the data but rather examines each omic dataset on its own. After then, the estimate produced by each model is summed up to arrive at a conclusion. In addition, the hierarchical integration technique is suggested to integrate omics datasets. This

28

F. H. Yagin

Fig. 1 The approaches in multi-omics integration (Picard et al., 2021)

strategy takes into account the established regulatory links that exist among the omics data (Singh et al., 2019; Picard et al., 2021) (Fig. 1).

3.1 Early Integration The early integration works by combining all of the datasets into a single massive matrix through the process of concatenation. The number of features will increase as a result of this process, but the total number of observations will remain the same. The end result of this approach is a matrix that is more complicated, noisy, and high-dimensional than before, which makes it more difficult to learn from. This is because multiple integration issues have been increased. Moreover, the inequality in the size of different omic data can lead to imbalance in the learning process (Athieniti & Spyrou, 2022; Abdi et al., 2013). This is because the algorithm spends more time learning the omic data with the largest number of variables and thus neglects to learn the other omic data. Despite this, early integration is still widely used because it has a number of obvious benefits, such as its simplicity and ease of implementation, and most importantly, because combining variables from different omics enables machine learning models to directly discover interactions between the various omics (Picard et al., 2021; Abdi et al., 2013).

Machine Learning Approaches for Multi-omics Data Integration in Medicine

29

3.2 Mixed Integration The deficiencies of the early integration method are remedied by the mixed integration strategy, which modifies each omics dataset in an independent fashion into a more straightforward representation. The new representation might have fewer dimensions and produce less noise, both of which would make analysis easier. In addition, the majority of heterogeneities that existed between the omics datasets, such as differences in the data’s type or size, have been eliminated thanks to the new representation. After that, traditional machine learning models can be used to perform analysis on the combined representation (Picard et al., 2021).

3.3 Intermediate Integration The term “intermediate integration” refers to a method that can jointly integrate multi-omics datasets without the need for prior transformation and without relying on simple aggregation. This method, in general, generates newly constructed representations, one that is universal to all omics and some that are specific to omics, which can be subjected to further analysis. The high dimensionality and difficulty of multi-omics data is simplified as a result of this step. On the other hand, due to the heterogeneity that exists among the omic datasets, first feature selection and integration is performed after some preprocessing. In an ideal scenario, one would use an intermediate feature selection method to limit the amount of information that is lost as a result of selecting features independently for each omics. It chooses characteristics by taking into account how well they complement one another within and between different omics blocks. It is not possible to achieve the same results by applying feature selection to each dataset separately or by applying feature selection to the combined dataset. Certain intermediate strategies are designed as multi-block feature extraction methods, just as a standard feature extraction can be used for exploratory purposes or as a basis for further analysis. These methods can be used to extract attributes from multiple blocks at the same time (Picard et al., 2021; Tini et al., 2019; Rappoport & Shamir, 2018). Intermediate integration methods operate on the assumption that diverse omics datasets can reveal the biological mechanisms underlying the disease or condition of interest. This is because intermediate methods are often successful for analyzing large amounts of data. The main benefit of using these methods is the ability of such intermediate methods to explore the common inter-omic structure while emphasizing the complementary information contained in each omic (Picard et al., 2021).

30

F. H. Yagin

3.4 Late Integration Late integration techniques combine the predictions of these models after applying individual ML models to each individual omic dataset. The advantage of these techniques is the use of tools created especially for omic data types. In contrast to other aggregation techniques, late integration techniques do not need attempting to merge several omics data types. This type of integration technique has the drawback of being unable to record inter-omic interactions. Additionally, it is unable to utilize and communicate the complementarity data amongst many ML models throughout the learning process. To fully utilize multi-omics data and comprehend the molecular pathways behind diseases, combining estimates is insufficient (Reel et al., 2021; Cai et al., 2022).

3.5 Hierarchical Integration Systems biology presents a number of challenges for understanding structured functioning at the molecular level. It is important to incorporate editing practices into the integration phase to more accurately reflect the structure of high-dimensional data. When using a hierarchical technique for integration, information about the regulatory links between the various levels is included. For example, in a method for integrating the genotype and phenotype of cellular subsystems, variations in the nucleotides that make up the genotype can lead to shifts in gene expression or changes in functional properties and proteins that can ultimately affect the phenotype. As a result, hierarchical integration methods often use some external knowledge gained from interaction databases and previous scientific research (Picard et al., 2021).

4 Machine Learning Approaches Used in Multiomics Integration 4.1 Data Integration Analysis for Biomarker Discovery Using Latent Components (DIABLO) DIABLO is a multi-omics integration method that distinguishes between phenotypic groups while simultaneously identifying key omics variables (such as mRNA, miRNA, proteins, and metabolites) during the process of integration. DIABLO is designed to extract the maximum amount of information that is shared or correlated across multi-omics datasets. It is the first multivariate integrative classification method of its kind that builds a predictive model for use in the prediction of outcomes based on new samples. The Projection to Latent Structures (PLS)

Machine Learning Approaches for Multi-omics Data Integration in Medicine

31

methodology, on which this method is based, makes it possible to create effective visualizations. DIABLO has a wide range of capabilities when it comes to the experimental design that it can handle. These capabilities range from the traditional single time point study to cross-over and repeated measures research. Instead of the original omics matrices, pathway-based module matrices can be used to conduct modular-based analysis (Singh et al., 2019).

4.2 Multi-omics Factor Analysis (MOFA) MOFA and MOFA+ are two approaches that seek common features that can explain the greatest variation between omic data layers. MOFA, which is an approach that can deal with the high dimensionality problem, is basically based on factor analysis. It can overcome computational challenges and improve uptime for enhanced multi-omics integration efficiency. The MOFA approach supports multiple forms of editing to improve the interpretability and intelligibility of the model. This approach is also frequently used for multi-omics integration because it supports partial datasets and has the ability to automatically process missing values (Argelaguet et al., 2018). In a literature study using the MOFA approach, the authors were able to categorize and successfully distinguish subtypes of lymphocytic leukemia using DNA methylation and gene expression data (Argelaguet et al., 2018; Alcala et al., 2019). In addition, the authors found two latent variables that they determined were linked to survival in large cell neuroendocrine carcinoma patients. However, the MOFA approach has shown excellent performance in drug response prediction and patient classification (Argelaguet et al., 2018). MOFA+ is the approach that supports single-cell datasets and GPU, which is 20 times faster than MOFA. MOFA+ supports adjustable sparsity constraints and reconstructs a low-dimensional representation of the data via computationally efficient variational inference. This allows MOFA+ to simultaneously describe variation across numerous sample groups and data modalities (Argelaguet et al., 2020).

4.3 Sparse Canonical Correlation Analysis (sCCA) sCCA is an extended version of CCA that minimizes the number of latent variables and imposes additional modeling penalties for better interpretability. It reduces the probability of some unimportant variables and is easier to interpret than CCA, especially for high-dimensional datasets. A drawback of the sCCA approach is that it ignores the spatial correlation or structural relationship between the input characteristics, despite the fact that it produces sparse projection vectors with higher correlations, which makes the results easier to understand. In one study, several traditional CCA approaches were compared with an adapted sCCA approach in which more than two datasets were retrieved simultaneously. The sCCA variant

32

F. H. Yagin

approach was used for gene expression, miRNA, and methylation data, resulting in the highest classification accuracy for breast, kidney, and lung cancers (Cai et al., 2022; Rodosthenous et al., 2020).

4.4 Multi-omics Late Integration (MOLI) The MOLI approach, which is frequently used in drug response research, is based on deep learning (Sharifi-Noghabi et al., 2019). MOLI combines the last hidden layer with the triple loss function and trains the model using a deep neural network that acts as a feature extractor in each omic layer (Rumelhart et al., 1986). The triple loss function reduces the distances between similar omic data and maximizes the distances between different (dissimilar) samples, thus facilitating the training process of the model. Although known by the term late integration, MOLI is more appropriately categorized as intermediate integration because it uses ML to integrate all omic layers rather than eventually integrating findings (Sharifi-Noghabi et al., 2019).

4.5 Cancer Drug Response Prediction Using a Recommender System (CaDRReS) CaDRReS is a matrix factorization-based approach that performs drug response prediction as a regression task. CaDRReS is built on the notion that matrix factorization performs extremely well. The primary premise behind recommender systems is that users are more likely to rate other items similarly if a group of them are rated similarly by a group of users. This concept has evolved into multi-omics data integration, where it is thought that a gene’s molecular profile will be similar across many omics layers. The CaDRReS approach can solve the problem of high dimensionality, but is limited to drug response estimation (Suphavilai et al., 2018).

4.6 Heterogeneous Network-Based Method for Drug Response Prediction (HNMDRP) The HNMDRP approach, which focuses on creating similarity networks between cell lines, drug structures, and drug-target genes is frequently used in drug reaction prediction tasks in multi-omics integration. This approach is based on the assumption that when a cell line is treated with a drug, the pharmacological response should be comparable. It has been reported that the HNMDRP approach is highly effective

Machine Learning Approaches for Multi-omics Data Integration in Medicine

33

in improving the prediction results of drug-target and protein-protein interactions (Zhang et al., 2018).

4.7 Multiple Pairwise Kernels for Drug Bioactivity Prediction (pairwiseMKL) In the pairwiseMKL approach, which is an advanced version of the MKL approach, the term pairwise refers to sample and drug pairings, and this approach has enabled MKL to be used in drug response estimation. It is a time and memory efficient approach with multiple binary cores that implements binary model training alongside efficient binary core weights optimization. The underlying algorithm first maximizes the matrix similarity measure between the combined core and the response core derived from the label values and determines a convex combination of the input binary cores. Then, the estimation function is learned with a regularized least squares binary regression model. In order to boost the prediction capacity of various information sources, pairwiseMKL mixes different omic data types into a single model that is a sparse mixture of input cores. This is accomplished by examining the learnt core mix weights (Cichonska et al., 2018).

4.8 iCluster, iClusterPlus, and iClusterBayes A family of ML techniques called iCluster, iClusterPlus, and iClusterBayes is based on joint latent variables. The iCluster approach formulates latent disease subtypes as a covariate and provides significantly fewer dimensions. Latent variables are modeled to gather relevant data and information from the various omic layers. According to the iCluster approach, the hidden variable performs best when there are less than ten dimensions (Shen et al., 2009). Based on iCluster, iClusterPlus is an approach that models various statistical distributions for discrete data types (Mo et al., 2013). The newest version of the series, iClusterBayes provides a Bayesian inference process and runs six times faster than iClusterPlus (Mo et al., 2018).

4.9 moCluster moCluster is a common latent variable-based ML model similar to the iCluster and iClusterPlus approaches (Meng et al., 2016). The technique used to identify latent variables is the primary distinction between the moCluster and the iCluster family. moCluster estimates the latent variables by consensus PCA (CPCA) rather than an expectation-maximization approach. CPCA is an approach of standard PCA that

34

F. H. Yagin

enables modeling of data from different groups that can be easily mapped to various omic layers. In a simulated data with superior clustering performance in a study in the literature, it has been reported that moCluster works approximately 100–1000 times faster than the iClusterPlus approach (Meng et al., 2016). Furthermore, unlike the iClusterPlus approach, moCluster was able to differentiate melanoma for NCI60 cancer cell lines (Gholami et al., 2013).

4.10 Similarity Network Fusion (SNF) Multiple patient similarity networks are combined using SNF to create a comprehensive network of patient links. In order to do this, the similarity network of a specific -omic dataset is divided into two networks, one of which captures the global network structure, or the similarity of a patient to all other patients, and the other of which captures the local network structure, or the similarity of a patient to its “K”-most similar patients, where “K” may be tuned to an ideal value given the dataset (Mac Aogáin et al., 2021). Wang et al. advise setting “K” to the anticipated number of clusters or, in the absence of this information, to N/10, where N denotes the total number of patients (Wang et al., 2014). SNF can reduce “noise” using this method of obtaining local structure from various -omic datasets. Following this, the deconstructed networks (derived from various datasets) are iteratively fused. This procedure is best understood as the diffusion of similarity information along common edges among the various patient similarity networks. When the majority of -omic datasets support an edge in the resultant “final” network, it is said to have increased similarity, and when it does not, it is said to have decreased similarity (Narayana et al., 2021).

4.11 NEighborhood Based Multi-omics Clustering (NEMO) After constructing a network of similarity among samples for each omic data type, the NEMO approach transforms similarity into relative similarity, which is more comparable between different types of omics. This approach is based on the average similarity score across different networks for each sample pair. Disease subtypes can be identified with the help of mean similarity network and spectral clustering. NEMO is an iterative optimization-free approach and can deal with partial multi-omic data ubiquitous in biology or medicine, provided that each sample pair includes measurements in at least one common-omic where both are measured (Cai et al., 2022; Rappoport & Shamir, 2019).

Machine Learning Approaches for Multi-omics Data Integration in Medicine

35

4.12 Random Walk with Restart for Multi-dimensional Data Fusion (RWRF) and Random Walk with Restart and Neighbor Information-Based Multi-dimensional Data Fusion (RWRNF) Random walk with multi-network reboot is the basis of the RWRF and RWRNF approaches used for multi-omic integration purposes. Given multiple omic data types, the two algorithms first construct a similarity network for each omic data type based on the same sample stack. To create a multilayer network, corresponding instances of several different similarity networks are first combined with each other. Both methods are quite successful at blocking noise. It is capable of capturing potential connections between samples with a high degree of accuracy, thanks to multi-network restart and random walking. These approaches do not ignore the original information in the data, even if a single network has a high similarity score and other networks have a low similarity score. The random walk process allows these approaches to take full advantage of the topology information of each similarity network containing information about different data types (Wen et al., 2021).

5 Conclusion In this section, an in-depth review of the basic approaches and ML algorithms used in the process of integrating multi-omics data in the medical field was made. Data from many omics approaches is essential for developing an understanding of the intricate dysregulation that is associated with disease phenotypes. Few efforts have been made to develop ML systems that can automatically integrate omics due to the large quantity of data that has been collected and the number of studies that have been conducted that examine multi-omics data. In the research that has been done on this subject, there is a significant emphasis placed on computational methodologies; nevertheless, there is a deficiency in the amount of research that reviews multiomics technologies and data formats. All of the methods that are discussed in this chapter can be accessed using the command line interface, but these approaches are not user-friendly for researchers who are not familiar with computational or coding concepts. As a result, the development of tools using graphical user interfaces as a means of enhancing healthcare delivery will determine the path that multi-omics integration strategies will take in the years to come.

36

F. H. Yagin

References Abdi, H., Williams, L. J., & Valentin, D. (2013). Multiple factor analysis: Principal component analysis for multitable and multiblock data sets. Wiley Interdisciplinary Reviews: Computational Statistics, 5, 149–179. Ahmed, Z. (2020). Practicing precision medicine with intelligently integrative clinical and multiomics data analysis. Human Genomics, 14, 1–5. Ahmed, Z. (2022). Precision medicine with multi-omics strategies, deep phenotyping, and predictive analysis. Progress in Molecular Biology and Translational Science, 190, 101–125. Akbulut, S., Yagin, F. H., & Colak, C. (2022). Prediction of breast cancer distant metastasis by artificial intelligence methods from an epidemiological perspective. Istanbul Medical Journal, 23, 210–215. Alcala, N., Leblay, N., Gabriel, A., Mangiante, L., Hervás, D., Giffon, T., Sertier, A.-S., Ferrari, A., Derks, J., & Ghantous, A. (2019). Integrative and comparative genomic analyses identify clinically relevant pulmonary carcinoid groups and unveil the supra-carcinoids. Nature Communications, 10, 1–21. Andrieux, G., & Chakraborty, S. (2021). Integration of multi-omics techniques in cancer. Frontiers in Genetics, 12, 733965. Argelaguet, R., Velten, B., Arnol, D., Dietrich, S., Zenz, T., Marioni, J. C., Buettner, F., Huber, W., & Stegle, O. (2018). Multi-Omics Factor Analysis—A framework for unsupervised integration of multi-omics data sets. Molecular Systems Biology, 14, e8124. Argelaguet, R., Arnol, D., Bredikhin, D., Deloro, Y., Velten, B., Marioni, J. C., & Stegle, O. (2020). MOFA+: A statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biology, 21, 1–17. Asselt, A. J. V., & Ehli, E. A. (2022). Whole-genome genotyping using DNA microarrays for population genetics. Estrogen receptors: Methods and Protocols, 269–287. Athieniti, E., & Spyrou, G. M. (2022). A guide to multi-omics data collection and integration for translational medicine. Computational and Structural Biotechnology Journal, 21, 134–149. Breschi, A., Gingeras, T. R., & Guigó, R. (2017). Comparative transcriptomics in human and mouse. Nature Reviews Genetics, 18, 425–440. Burney, I. A., & Lakhtakia, R. (2017). Precision medicine: Where have we reached and where are we headed? Sultan Qaboos University Medical Journal, 17, e255. Cai, Z., Poulos, R. C., Liu, J., & Zhong, Q. (2022). Machine learning for multi-omics data integration in cancer. iScience, 103798, 103798. Cichonska, A., Pahikkala, T., Szedmak, S., Julkunen, H., Airola, A., Heinonen, M., Aittokallio, T., & Rousu, J. (2018). Learning with multiple pairwise kernels for drug bioactivity prediction. Bioinformatics, 34, i509–i518. Cisek, K., Krochmal, M., Klein, J., & Mischak, H. (2016). The application of multi-omics and systems biology to identify therapeutic targets in chronic kidney disease. Nephrology Dialysis Transplantation, 31, 2003–2011. Cox, J., & Mann, M. (2007). Is proteomics the new genomics? Cell, 130, 395–398. Dong, Z., & Chen, Y. (2013). Transcriptomics: Advances and approaches. Science China Life Sciences, 56, 960–967. Fan, J., Slowikowski, K., & Zhang, F. (2020). Single-cell transcriptomics in cancer: Computational challenges and opportunities. Experimental & Molecular Medicine, 52, 1452–1465. Futreal, P. A., Kasprzyk, A., Birney, E., Mullikin, J. C., Wooster, R., & Stratton, M. R. (2001). Cancer and genomics. Nature, 409, 850–852. Gholami, A. M., Hahne, H., Wu, Z., Auer, F. J., Meng, C., Wilhelm, M., & Kuster, B. (2013). Global proteome analysis of the NCI-60 cell line panel. Cell Reports, 4, 609–620. Graves, P. R., & Haystead, T. A. (2002). Molecular biologist’s guide to proteomics. Microbiology and Molecular Biology Reviews, 66, 39–63. Gray, I. C., Campbell, D. A., & Spurr, N. K. (2000). Single nucleotide polymorphisms as tools in human genetics. Human Molecular Genetics, 9, 2403–2408.

Machine Learning Approaches for Multi-omics Data Integration in Medicine

37

Hanash, S. (2003). Disease proteomics. Nature, 422, 226–232. Hasin, Y., Seldin, M., & Lusis, A. (2017). Multi-omics approaches to disease. Genome Biology, 18, 1–15. Horgan, R. P., & Kenny, L. C. (2011). ‘Omic’ technologies: Genomics, transcriptomics, proteomics and metabolomics. The Obstetrician & Gynaecologist, 13, 189–195. Hu, Y., An, Q., Sheu, K., Trejo, B., Fan, S., & Guo, Y. (2018). Single cell multi-omics technology: Methodology and application. Frontiers in Cell and Developmental Biology, 6, 28. Huang, S., Chaudhary, K., & Garmire, L. X. (2017). More is better: Recent progress in multi-omics data integration methods. Frontiers in Genetics, 8, 84. Jaccard, E., Cornuz, J., Waeber, G., & Guessous, I. (2018). Evidence-based precision medicine is needed to move toward general internal precision medicine. Journal of General Internal Medicine, 33, 11–12. Jung, G. T., Kim, K.-P., & Kim, K. (2020). How to interpret and integrate multi-omics data at systems level. Animal Cells and Systems, 24, 1–7. Kumar, V., Garg, V. K., Kumar, S., & Biswas, J. K. (2022). Omics for environmental engineering and microbiology systems. CRC Press. Liu, E., Li, L., & Cheng, L. (2019). Gene regulatory network review. In Reference module in life sciences. Elsevier. Lowe, R., Shirley, N., Bleackley, M., Dolan, S., & Shafee, T. (2017). Transcriptomics technologies. PLoS Computational Biology, 13, e1005457. Mac Aogáin, M., Narayana, J. K., Tiew, P. Y., Ali, N., Yong, V. F. L., Jaggi, T. K., Lim, A. Y. H., Keir, H. R., Dicker, A. J., & Thng, K. X. (2021). Integrative microbiomics in bronchiectasis exacerbations. Nature Medicine, 27, 688–699. Meng, C., Helm, D., Frejno, M., & Kuster, B. (2016). moCluster: Identifying joint patterns across multiple omics data sets. Journal of Proteome Research, 15, 755–765. Misra, B. B., Langefeld, C., Olivier, M., & Cox, L. A. (2019). Integrated omics: Tools, advances and future approaches. Journal of Molecular Endocrinology, 62, R21–R45. Mo, Q., Wang, S., Seshan, V. E., Olshen, A. B., Schultz, N., Sander, C., Powers, R. S., Ladanyi, M., & Shen, R. (2013). Pattern discovery and cancer gene identification in integrated cancer genomic data. Proceedings of the National Academy of Sciences, 110, 4245–4250. Mo, Q., Shen, R., Guo, C., Vannucci, M., Chan, K. S., & Hilsenbeck, S. G. (2018). A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. Biostatistics, 19, 71–86. Narayana, J. K., Mac Aogáin, M., Ali, N. A. t. B. M., Tsaneva-Atanasova, K., & Chotirmall, S. H. (2021). Similarity network fusion for the integration of multi-omics and microbiomes in respiratory disease. European Respiratory Journal, 58, 2101016. Olivier, M., Asmis, R., Hawkins, G. A., Howard, T. D., & Cox, L. A. (2019). The need for multi-omics biomarker signatures in precision medicine. International Journal of Molecular Sciences, 20, 4781. Pal, A. (2022). DNA Microarray. In Protocols in advanced genomics and allied techniques (pp. 221–243). Springer. Picard, M., Scott-Boyer, M.-P., Bodein, A., Périn, O., & Droit, A. (2021). Integration strategies of multi-omics data for machine learning analysis. Computational and Structural Biotechnology Journal, 19, 3735–3746. Putri, S. P., Nakayama, Y., Matsuda, F., Uchikata, T., Kobayashi, S., Matsubara, A., & Fukusaki, E. (2013). Current metabolomics: Practical applications. Journal of Bioscience and Bioengineering, 115, 579–589. Rafalski, A. (2002). Applications of single nucleotide polymorphisms in crop genetics. Current Opinion in Plant Biology, 5, 94–100. Rappoport, N., & Shamir, R. (2018). Multi-omic and multi-view clustering algorithms: Review and cancer benchmark. Nucleic Acids Research, 46, 10546–10562. Rappoport, N., & Shamir, R. (2019). NEMO: Cancer subtyping by integration of partial multi-omic data. Bioinformatics, 35, 3348–3356.

38

F. H. Yagin

Reel, P. S., Reel, S., Pearson, E., Trucco, E., & Jefferson, E. (2021). Using machine learning approaches for multi-omics data analysis: A review. Biotechnology Advances, 49, 107739. Rodosthenous, T., Shahrezaei, V., & Evangelou, M. (2020). Integrating multi-OMICS data through sparse canonical correlation analysis for the prediction of complex traits: A comparison study. Bioinformatics, 36, 4616–4625. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by backpropagating errors. Nature, 323, 533–536. Ryan, D., & Robards, K. (2006). Metabolomics: The greatest omics of them all? Analytical Chemistry, 78, 7954–7958. Santiago-Rodriguez, T. M., & Hollister, E. B. (2021). Multi ‘omic data integration: A review of concepts, considerations, and approaches. Proceedings of the Seminars in Perinatology, 45, 151456. Segers, K., Declerck, S., Mangelings, D., Heyden, Y. V., & Eeckhaut, A. V. (2019). Analytical techniques for metabolomic studies: A review. Bioanalysis, 11, 2297–2318. Sharifi-Noghabi, H., Zolotareva, O., Collins, C. C., & Ester, M. (2019). MOLI: Multi-omics late integration with deep neural networks for drug response prediction. Bioinformatics, 35, i501– i509. Shen, R., Olshen, A. B., & Ladanyi, M. (2009). Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics, 25, 2906–2912. Singh, A., Shannon, C. P., Gautier, B., Rohart, F., Vacher, M., Tebbutt, S. J., & Lê Cao, K.-A. (2019). DIABLO: An integrative approach for identifying key molecular drivers from multiomics assays. Bioinformatics, 35, 3055–3062. Subramanian, I., Verma, S., Kumar, S., Jere, A., & Anamika, K. (2020). Multi-omics data integration, interpretation, and its application. Bioinformatics and Biology Insights, 14, 1177932219899051. Suphavilai, C., Bertrand, D., & Nagarajan, N. (2018). Predicting cancer drug response using a recommender system. Bioinformatics, 34, 3907–3914. Tebani, A., Afonso, C., Marret, S., & Bekri, S. (2016). Omics-based strategies in precision medicine: Toward a paradigm shift in inborn errors of metabolism investigations. International Journal of Molecular Sciences, 17, 1555. Tini, G., Marchetti, L., Priami, C., & Scott-Boyer, M.-P. (2019). Multi-omics integration—A comparison of unsupervised clustering methodologies. Briefings in Bioinformatics, 20, 1269– 1279. Tyers, M., & Mann, M. (2003). From genomics to proteomics. Nature, 422, 193–197. Wang, B., Mezlini, A. M., Demir, F., Fiume, M., Tu, Z., Brudno, M., Haibe-Kains, B., & Goldenberg, A. (2014). Similarity network fusion for aggregating data types on a genomic scale. Nature Methods, 11, 333–337. Wen, Y., Song, X., Yan, B., Yang, X., Wu, L., Leng, D., He, S., & Bo, X. (2021). Multi-dimensional data integration algorithm based on random walk with restart. BMC Bioinformatics, 22, 1–22. Zhang, F., Wang, M., Xi, J., Yang, J., & Li, A. (2018). A novel heterogeneous network-based method for drug response prediction in cancer cell lines. Scientific Reports, 8, 1–9.

Multimodal Methods for Knowledge Discovery from Bulk and Single-Cell Multi-Omics Data Yue Li, Gregory Fonseca, and Jun Ding

1 Introduction Cells are the fundamental units of most lifeforms, undergoing constant changes in responding to the surrounding environment. The cellular dynamics underlying those changes are driven by varying biomolecules in the cell (Hasin et al., 2017). As limited by the sequencing bio-technologies, the cellular states are often profiled based on a single type of biomolecules (e.g., quantification of RNA molecules with RNA-seq (Hughes et al., 2020)). Although these uni-modal cellular state measurements have led to significant advancements in studying cellular dynamics underlying various biological processes such as cell differentiation (Chu et al., 2016), disease progression (van Galen et al., 2019), and treatment response (Meyer et al., 2011), the comprehensive cellular states cannot be well represented by only a single kind of biomolecules. Therefore, multi-omics measurements (i.e., measurement of multiple types of biomolecules in the cells such as RNA, DNA, protein, and other small molecules) are essential to comprehensively depict the cellular states and thus could derive a deep understanding of underlying mechanisms for cellular state changes in many biological processes (Subramanian et al., 2020), which are often indispensable for new biomedical discoveries, diagnostics, and therapeutics. For example, multi-omics measurements and data analyses have led

Yue Li, Gregory Fonseca, and Jun Ding contributed equally to this work. Y. Li School of Computer Science, McGill University, Montréal, QC, Canada G. Fonseca · J. Ding (E) Meakins-Christie Laboratories, Department of Medicine, McGill University Health Centre, Montréal, QC, Canada e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 A. Alkhateeb, L. Rueda (eds.), Machine Learning Methods for Multi-Omics Data Integration, https://doi.org/10.1007/978-3-031-36502-7_4

39

40

Y. Li et al.

to the discovery of novel regulator IRF-1 in lung cell differentiation (Ding et al., 2019), improved classification in chronic kidney diseases (Eddy et al., 2020), and identification of predictive biomarkers for chemotherapy treatment response (Taber et al., 2020). The advancement of profiling biotechnology for cellular states comes not only from multi-omics but also from the improved resolution (single-cell). Conventional bulk sequencing (e.g., RNA-seq) profiles the average biomolecule level (e.g., gene expression) of all cells within the same sample and thus it cannot deconvolve the cellular heterogeneity in the sample. Consequently, bulk sequencing is limited in studying biomedical problems with heterogeneous cell populations that could not be distinguished by the low-resolution bulk measurements (Li & Wang, 2021). The rapidly-developing single-cell sequencing technologies resolve this issue with the capability of quantifying biomolecules from individual cells in the sample. This allows for the identification of varying cell populations (and subpopulations) in the sample and further explorations (e.g., Gene regulatory network, biomarkers) associated with each of the cell populations. However, single-cell sequencing data is much more high-dimensional, large-scale, sparse, and noisy compared to the bulk sequencing data. Therefore, even for the same tasks (e.g., reconstructing the gene regulatory networks underlying the studied biological process), it is not ideal to port the existing methods developed for conventional bulk sequencing data. As such, new computational models need to be developed specifically for single-cell data. For example, reconstructing gene regulatory networks for single-cell data would require the identification of different cell populations (e.g., via cell clustering and annotation). While this process can be done experimentally by FluorescenceActivated Cell Sorting (FACS) (Basu et al., 2010), it is expensive and limited to the known marker genes. Computational methods for accurate cell clustering provide much more cost-effective solutions and could also apply to cases where cell sorting is not feasible. In recent years, the availability of multi-omics data (particularly at the singlecell level) has been ever-increasing. On the one hand, multi-omics data could bring advancements within various biomedical applications because they provide enriched information about the cellular states. On the other hand, they also introduce new computational challenges in integrating and modeling the omics data from different modalities. Numerous multimodal methods that can analyze bulk/singlecell multi-omics data have been developed to answer challenging questions in various applications such as dimensionality reduction and clustering (e.g., for cell population identification) (Jia et al., 2018), reconstruction of underlying regulatory networks (Ranzoni et al., 2021), and biomarker discovery (Bravo-Merodio et al., 2019), where the first task (dimensionality reduction and clustering) often serves as the cornerstone for other subsequent analyses. As the unimodal measurements (e.g., RNA-seq data) are still most abundant due to the relatively more mature technique and the lower cost (Perkel et al., 2021), computational models that analyze unimodal genomics data are still predominant in the research community. Therefore, many could be uninformed about the critical difference between the emerging multimodal methods and their long-existing unimodal counterparts and thus could not find appropriate methods for their single-cell multi-omics data analysis needs.

Multimodal Methods for Knowledge Discovery from Bulk and Single-Cell. . .

41

In this chapter, we will give a broad overview of several popular multimodal methods for analyzing bulk and single-cell multi-omics datasets in each of the common application scenarios (i.e., clustering, gene regulatory network inference, and biomarker discovery) (Fig. 2). The characteristics of all these methods will also be discussed to provide users guidance on how to choose appropriate methods for their specific research scenarios (e.g., the datasets to analyze and the biomedical question to answer). The remaining sections are organized as follows. First, we will briefly go over various cellular state profiling experiments and associated multi-omics data. Second, we will discuss multimodal dimensionality reduction and clustering methods, which usually serve as the fundamental basis for the subsequent analysis. Third, we will review methods that integrate multi-omics methods to reveal the gene regulatory networks. Last but not least, computational models for multiomic biomarker discovery will be discussed.

2 Description of Various Omics Datasets The era of next-generation sequencing (NGS) technologies has resulted in the rapid development of novel genomic and transcriptomic techniques to study the genome in a large scale and unbiased way. In this chapter, we will discuss the variety of available techniques in brief and the methods of analysis (Fig. 1). The most common NGS technique is the study of RNA expression. While RNA sequencing typically focuses on mRNA expression (called the transcriptome) via polyA selection of ribosomal depletion, techniques have been developed to focus on different aspects of mRNA as well as other RNAs (Fig. 1). Several techniques have been developed to specifically target the 5’ of RNA, capped-small RNA-seq (a technology which targets the 5’meG of RNA) (Duttke et al., 2019) and Genome Run on sequencing (GRO-seq, 5’ GRO-seq and GRO-CAP which target elongating RNAs) (Heinz et al., 2010) as well as START-seq (which uses nuclear RNA) (Nechaev et al., 2010; Scruggs et al., 2015). This allows for the precise identification of transcriptional start sites as well as the isolation and quantification of nascent RNA (which may be used to determine transcriptional pausing and RNA turnover rates) and unstable RNAs such as enhancer RNAs (to determine enhancer activity), miRNAs, and lncRNAs. It is also important to note that sequencing technologies for lncRNAs and miRNAs are typically best for discovery due to the depth of sequencing necessary to adequately detect. Microarrays still stand as the gold standard for known non-mRNA species.

2.1 ChIP-seq Antibody-based techniques for discovering DNA (ChIP) or RNA (RIP) binding sequences of proteins provide an additional level of depth to NGS technologies. ChIP-sequencing technologies rely on antibody targeting of proteins to identify

42

Y. Li et al.

Fig. 1 Overview of Omic techniques in gene expression. (A) Discovery of RNA expression is done using various RNA sequencing techniques which can target specific RNA species or total RNA. (B) Detection of open regions of the chromatin is typically done with ATAC. (C) Detection of protein binding on the DNA, including transcription factors, histones and indirect interactions is done using ChIP. (D) Discovery of 3D interactions between promotors, enhancers and other regulatory regions is often done using HiC based techniques in which regions are crosslinked together. (E) Large scale detection of protein expression is done using several technologies which may be general or targeted for specific groups of proteins

the specific and genomic patterns of binding of DNA binding or DNA adjacent proteins (Fig. 1). This is most common in the identification of histones and histone modifications which can be used to interpret the local activation state of the chromatin (i.e. H3K27 acetylation) is positively correlated while H3K27me3 is negatively correlated with transcriptional activity, H3K9 trimethylation is associated with heterochromatin, among others reviewed here (Campos & Reinberg, 2009; Lawrence et al., 2016; Santoro & Dulac, 2015; Patel & Wang, 2013). Using these

Multimodal Methods for Knowledge Discovery from Bulk and Single-Cell. . .

43

Fig. 2 Overview of multimodal methods. The first row shows commonly seen multi-omics datasets; The second to fourth panels demonstrate the multimodal methods for Dimension reduction/Clustering (second row), Gene Regulatory network inference (third row), and biomarker discovery (fourth row)

44

Y. Li et al.

histone modifications, we can interpret the context of RNA expression profiles and even predict future RNA expression. Further, we can interpret the function of other DNA binding proteins, mainly transcription factors (Fig. 1). Through ChIPseq of transcription factors, we have identified many of the DNA sequences within the DNA which recruit TF binding: called Transcription Factor Binding Sites (TFBS) (Khan et al., 2018; Sandelin et al., 2004). TFBS can be viewed as the language of DNA control and expression. While ChIP involves antibody targeting, the sequencing aspect of ChIP requires chromatin cleavage to dissociate small pieces of bound DNA for sequencing. This is typically done using sonication (mostly for TFs) or nucleases. Most recently, a ChIP-seq based protocol using protein A bound nuclease (CUT&RUN) has been developed to improve quality and lower background (Skene & Henikoff, 2017). Using these methodologies, protein binding and chromatin context may be studied genome-wide with only the quality of the antibodies and access to cells as the limiting factor.

2.2 ATAC-seq While ChIP technologies have produced significant advancements in the understanding of chromatin context and transcription control, these technologies are significantly limited by two main factors: the availability of quality antibodies and access to sufficient biological materials (cells or tissue) which make it difficult to study low quantity cell populations or novel proteins. To fill this need, a transposonbased technique was developed called Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) (Buenrostro et al., 2013) (Fig. 1). While previous iterations of open chromatin sequencing including MNase-seq (Johnson et al., 2006; Klein & Hainer, 2020), DNase-seq (Boyle et al., 2008) and FAIRE-seq (Giresi et al., 2007) are technically challenging and may require large amounts of starting material, ATAC-seq is a simple protocol that requires less than 100,000 cells total (to as low as single cell) (Baek & Lee, 2020). ATAC-seq uses mutated and highly processive TN5 transposase to integrate a known sequence into the DNA which may then be isolated by PCR amplification (Buenrostro et al., 2013). Importantly, TN5 transposase is unable to insert DNA into areas which are ‘protected’ by protein binding, typically nucleosomes. As such, the readout of ATAC-seq is the ‘open’ or accessible chromatin which the cell is actively using. This gives a broad transcriptional readout of chromatin context for cell populations which would otherwise be difficult to study using other methods. This has also allowed for ATACseq to be used on the single cell level in high throughput assays. This can give a context for transcription factor activity at a genome-wide level for each single cell in a complex tissue. Unsurprisingly, ATAC-seq has quickly become one of the most utilized genomics techniques (Yan et al., 2020).

Multimodal Methods for Knowledge Discovery from Bulk and Single-Cell. . .

45

2.3 Hi-C The study of genomics is further expanded upon by adding the layer of the spatial organization of the genome using chromosome conformation capture techniques (Grob & Cavalli, 2018; Mccord et al., 2020). In short, chromosome conformation capture techniques use DNA fragmentation and re-ligation of spatially close DNA fragments to infer trans (either long range or short range) interactions between DNA regions (Grob & Cavalli, 2018; Mccord et al., 2020) (Fig. 1). These regions usually define large, interconnected regions of DNA called Topologically Associating Domains (TADs) or interactions between enhancers and promoters (Pombo & Dillon, 2015; Yu & Ren, 2017). The first genome-wide chromosome conformation capture technique to be used widely was Hi-C (Grob & Cavalli, 2018; Mccord et al., 2020). Hi-C involves the crosslinking of all the chromatin in the nuclease followed by partial digestion of the DNA with sonication or restriction enzymes (Fig. 1). The DNA is then re-ligated randomly. This allows fragmented DNA ends to non-specifically ligate with spatially close DNA. The output is a single sequence read composed of two DNA sequences that are far apart on the DNA sequence but spatially close in the nucleus. However, the vast majority of detected interactions by Hi-C are from sequences which are either near each other on the DNA sequence or represent TADs (Grob & Cavalli, 2018; Mccord et al., 2020; Pal et al., 2019). Oligonucleotide capture was developed to enrich HiC libraries for promoter-enhancer interactions (including Capture-Hi-C (Jäger et al., 2015) and HiCap (Sahlén et al., 2015). Here, oligonucleotides are used to concentrate regions of interest (including all mammalian promoters) to maximize the information of interest in sequencing. However, these methods are not truly genome-wide but are cost savings methods to isolate low quantity information. To produce genome-wide information regarding chromosomal interactions between promoters and enhancers, two methods that use antibody targeted Hi-C were developed, ChIA-PET (Wei et al., 2006) and HiChIP (Mumbach et al., 2016). These methods use Hi-C re-ligated DNA libraries as input for ChIP-sequencing. The result is spatially interacting regions with a specific DNA-associated protein. Using this method with histones associated with promoters would allow a genome-wide study of promoter long-range DNA interactions and will be important in the study of promoter-enhancer interactions.

2.4 Mass Spectrometry for Proteomics While nucleic acid sequencing has seen a renaissance in recent years due to the vast improvement of sequencing technologies, large-scale proteomics has also improved in detection capabilities and specificity (Fig. 1). This has been lead by the continued refinement of the oldest sequencing technology, mass spectrometry (Macklin et al., 2020; Noberini et al., 2016). Mass spectrometry has been greatly advanced by combining this technique with other technologies including

46

Y. Li et al.

liquid chromatography (LC), matrix-assisted laser desorption/ionization (MALDI), and gas chromatography (GC) (Macklin et al., 2020). Indeed, mass spectrometry may even be used spatially on fresh frozen slides called MALDI mass spectrometry imaging (Buchberger et al., 2018; Caughlin et al., 2018). More recently, antibody based technologies have been developed for large-scale protein detection. The most widespread application is CyTOF. In CyTOF mass cytometry, antibodies labeled with heavy metal isotopes are used to label samples. The quantity of protein may then be indirectly inferred based on mass cytometry detection of the targeted labelled antibody (Spitzer & Nolan, 2016; Tracey et al., 2021). Similarly, barcoded antibody techniques such as CITE-seq and REAP-seq, have been developed with the same principle (Peterson et al., 2017; Stoeckius et al., 2017). These antibodies contain a predetermined and discrete sequence which may then be read out via next-generation sequencing technologies. These technologies have allowed high sensitivity detection of protein (up to 50 and 200 respectively at the time of this review) in up to a single cell. Lastly, similar to the use of antibodies, aptamer-based detection systems allow for the rapid identification of thousands of proteins (Chen et al., 2021; Zhou et al., 2021). Aptamers are stable double-stranded nucleic acid or peptide sequences that are able to recognize and bind to specific protein sequences. Binding may then be interpreted, similarly to antibody binding, indirectly to infer protein expression. These technologies are capable of detecting thousands of proteins simultaneously.

2.5 Single-Cell Multi-Omic Profiling While these technologies are all used on bulk sequencing and many for single cell sequencing, the development of laser capture microdissection technologies allows for spatial sequencing. Laser capture microdissection allows the isolation of discrete regions in organs or tissue which may then be sequenced using single-cell technologies such as RNA, ATAC, ChIP, and CyTOF (Aguilar-Bravo & SanchoBru, 2019; Amatori et al., 2014; Civita et al., 2019; Espina et al., 2006; Herrera et al., 2020). The addition of spatial information has allowed for more precise RNA and protein data in heterogeneous tissues such as during SARS-COV-2 infection in COVID-19 patients (Delorey et al., 2021; Desai et al., 2020).

3 Multimodal Methods for Dimensionality Reduction and Clustering Simultaneous modeling of multiple omics has become feasible with biotechnological advances at both bulk level and single-cell resolution. These omics include proteomics, methylations, chromatin interactions, Assay for Transposase-

Multimodal Methods for Knowledge Discovery from Bulk and Single-Cell. . .

47

Accessible Chromatin (scATAC), and so on. While earlier multi-omic modeling has been applied to bulk data (e.g., The Cancer Genome Atlas (TCGA) (Weinstein et al., 2013)), recent techniques focused on the integration of multi-omics measured at the single-cell resolution. The main difference in integrating bulk and singlecell multi-omics lies in the sheer sample size. With bulk multi-omic, the sample size is typically under 1000 samples. For example, one of the largest multi-omic data is RosMap for Alzheimer’s Disease with multiple omics measured in the same 800 samples (De Jager et al., 2018); TCGA hosts multi-omic data of over 1000 breast cancer tumor samples. As a result, the complexity of the computational techniques needs to take into account the nature of the “broad” data, i.e., high dimensional features (e.g., 20,000 genes, hundreds of thousands of genomic regions or methylation sites) and relative much smaller sample size. The linear techniques we will review below can apply to both bulk and single-cell data. The deep learning technique autoencoder is more suitable for modeling single-cell data due to the sufficient sample size of single cells. While we focus on the linear algebraic aspects of the methods, one can impose Bayesian priors over the latent factors and appropriate likelihood over the multi-omic data to properly approximate their posterior distributions (Fig. 3).

Fig. 3 Multi-modal methods for dimensionality reduction and clustering. A number of methods could be used to factorize the input data matrix from each modality (e.g., NMF or deep learning such as autoencoder. Once we learned the factorized matrix from the input (often in the reduced space), a variety of approaches (e.g., CCA, multi-view, and deep learning methods) could be employed to aggregate the factorized matrices from multi-modalities for clustering.)

48

Y. Li et al.

3.1 Non-negative Matrix Factorization Non-negative matrix factorization (NMF) is one of the most popular methods in multi-omic integration (Lee & Seung, 1999). The idea of NMF is to decompose a samples-by-features matrix, say X, into a samples-by-factors matrix (i.e., the loading matrix), say H, and factors-by-features matrix (i.e., the basis matrix), say W, with the constraint that both matrices contain non-negative values: .X = H W . Compared to more classic approach such as the singular vector decomposition, the non-negativity of NMF offers greater model interpretability as one can assume that the whole is the sum of its parts. Learning NMF can be done by minimizing the reconstruction loss .H ∗ , W ∗ ← arg minH,W ||X − H W ||2 , s.t.Hd,k , Wk,j ≥ 0∀d, k, j . Here the reconstruction loss is specified as the sum of squared, which is equivalent to assuming the data log-likelihood follows a log Gaussian distribution. This can be changed to another loss function to better reflect the data likelihood. For example, one can maximize a log Poisson data likelihood w.r.t. the loading and basis matrices to model discrete count sequencing data: .H ∗ , W ∗ ← arg maxH,W X log H W − H W s.t.Hd,k , Wk,j ≥ 0∀d, k, j . A popular algorithm to learn NMF is called multiplicative update (Seung & Lee, 2001), which works by first ad-hoc adapting the learning rates for the first-order gradient of the reconstruction loss to cancel out the negative terms in the update equation. Other learning techniques impose L1 or L2 penalty upon one of the two factorized matrices to achieve sparsity over the entries of the factorized matrices to avoid over-fitting and further improve interpretability. Adapting NMF to learn multiomic data is fairly straightforward, with the omics are measured in the same samples (Zhang et al., 2011; Welch et al., 1887). For example, suppose we observed mRNAseq over .M1 genes and ATAC-seq over .M2 genomic regions for the same set of N samples, we can formulate these two data objects as data matrices matrix .X1 and .X2 with the respective dimensions of .N × M1 and .N × M2 . We can jointly factorize the two matrices by linking them via a common loading matrix H : .(X1 , X2 ) = H (W1 , W2 ), where .(X1 , X2 ) is the combined .N × (M1 + M2 ) data matrix and .(W1 , W2 ) is the combined .K × (M1 + M2) basis matrix. Learning the factorized matrices is carried out in a similar fashion to the single-modal NMF except for having two loss functions corresponding to .H ∗ , W1∗ ← arg minH,W1 ,W2 ||X1 − H W1 ||2 + ||X2 − H W1 ||2 , s.t.Hd,k , W1,k,j , W2,k,j ≥ 0∀d, k, j . Extending to more than two omics using the above multi-view NMF is similar and omitted here. Additionally, one can incorporate prior biological knowledge into the NMF. Specifically, one way to incorporate such information is to introduce “reward” terms into the loss function (Zhang et al., 2011): H ∗ , W1∗ , W2∗ ←

.

min

H,W1 ,W2 ≥0

E

||Xi − H Wi ||2 + γ1 ||H ||2F

i=1,2

− λ1 T r(W1 BW2T ) − λ2 T r(W1 AW1T ) + γ2 (

E j

||hj ||21 +

E j'

||hj ' ||21 )

Multimodal Methods for Knowledge Discovery from Bulk and Single-Cell. . .

49

where matrix A corresponds to interactions between features from the same omic 1 and matrix B corresponds to interactions between omic 1 and omic 2. For example, A can be an adjacency matrix that records known regulator-target information between transcription factors and genes; B can incorporate in-cis promoter-gene or enhancer-gene relationships between ATAC peaks and genes. Another way to incorporate prior information such as gene pathway information from MSigDB for clustering and cell deconvolution is by incorporating the prior knowledge as one of the basis matrices with the following loss function as developed in PILER (Mao et al., 2019): ||X − H W ||2 + λ1 ||W − U C||2 + λ2 ||H ||2F + λ3 ||U ||1

.

where U is the factors-by-pathways and C is the known pathways-by-genes information. Learning matrices H and U can be done by gradient descent or multiplicative updates. Extending the model to multi-omic with one or more omics leveraging the prior information is also straightforward. CoupledNMF (Duren et al., 2018) is another NMF-based multimodal method for clustering the multi-omics data. To couple two matrix factorizations from different modalities, Duran et al. introduced a term .tr(W2T AW1T ) a term, where A is a “coupling matrix.” The inference of A varies between different application scenarios. However, a situation may arise where it is possible to identify a subset of features in one sample that is linearly predictable from the features measured in the other sample. In such a situation, we can take A to be the matrix representation of the linear prediction operator. This method of integrative single-cell genomics analysis enables the joint clustering of single-cell multi-omics data (e.g., single-cell RNA-sequencing and single-cell ATAC-sequencing data).

3.2 Tensor Decomposition Beyond two-dimensional factorization, one can apply a similar technique to factorize higher-dimensional tensors (Hore et al., 2016). For example, GTEx provides tissue-specific gene expression data measured within the same individuals. These data can form a three-dimensional tensor of individuals-by-genes-by-tissues. To learn latent factors that link individuals, genes, and tissues, one can factorize the tensor X into individuals-by-factors Z, tissues-by-factors H , and factors-by-genes W via Kronnecker product such that .X = Z ⊗ H ⊗ W (Hore et al., 2016). The same tensor decomposition technique can be applied to spatiotemporal settings, where we have observed gene expression across several time points or across regions on the same samples. The discrete-time points can be developmental stages or preand post-treatments or patient age. In this case scenario, we can formulate the data into samples-by-genes-by-times and proceed with tensor decomposition to

50

Y. Li et al.

learn latent samples-by-factors, factors-by-times, and genes-by-factors. A similar approach can be used to factorize spatial transcriptome into samples-by-factors, factors-by-regions, and genes-by-factors. It is also straightforward to model even higher dimensions such as four-dimensional data with spatial, time, samples, and genes as the dimensions.

3.3 Multi-View Relational Learning More broadly speaking, we can conceptualize matrix or tensor factorization as relational learning among entities also known as collective matrix factorization (Klami et al., 2014). For instance, consider multi-omic datasets of RNA-seq, ATACseq, and genotype measured on the same patient cohort. The entities are patients, open chromatin regions (OCR), genes, and phenotypes. The observed relation in this context are patients-by-OCRs .X1 , patients-by-genes .X2 , patients-by-phenotypes Y . The latent relations we would like to infer are OCRs-by-phenotypes and genesby-phenotypes. Inferring the latent relations is equivalent to matrix completion. Specifically, we can think of learning a common set of latent representations of all of the entities: (patients, genes, OCRs, phenotypes) by latent factors. By applying matrix products of genes-by-factors and phenotypes-by-factors, we can predict the relations between genes and phenotypes. Similarly, we can learn the relations between OCRs and phenotypes. Furthermore, we can augment the CMF by supplying prior relations say known in-cis relation between genes and OSRs to obtain better interpretable factors. Also, we can dedicate private factors that explain a subset of the entities along with the global factors to account for the modalityspecific signals. As an example in this category, Wang et al. have developed a CMF based method for the dimension reduction and clustering analyses of multi-omics data (Wang et al., 2021a).

3.4 Canonical Correlation Analysis Canonical correlation analysis (CCA) is another widely used technique in data integration. For instance, CCA was used in the popular single-cell data integration software Seurat (Stuart & Satija, 2019). CCA finds a linear projection for two or more matrices that maximize the correlation between the projected data. Using the above example of mRNA profile .X1 and ATAC profile .X2 for the same N samples. CCA finds linear projection vector u and v such that .(u, v) = arg maxu,v uT X1T X2 v. Solving CCA is equivalent to solving singular vector decomposition (SVD) on ' ' .K = X X2 = U AV , where .Um×k = [u1 , . . . , uk ] and .Vp×k = [v1 , . . . , vk ] 1 are the eigenvectors and .Ak×k is diagonal matrix for the eigenvalues. Additionally,

Multimodal Methods for Knowledge Discovery from Bulk and Single-Cell. . .

51

CCA was also applied to the setting, where one omic (say RNA-seq) was measured in different samples (e.g., before and after drug treatment, healthy control and AD patients, or two species of ortholog genes). In this scenario, the common axis is often the genes. The resulting eigenvectors U and V are then the embeddings of samples from the two groups. Therefore, CCA has the benefits of learning orthogonal factors although the factors are often hard to interpret compared to NMF. Following the CCA, dynamic time warping can be applied to align the samples from group 1 with the samples from group 2 (Stuart & Satija, 2019).

3.5 Deep Learning Methods for Multimodal Dimension Reduction and Clustering While the above linear techniques can be applied to both bulk and single-cell data, more flexible non-linear functions are often desired in capturing transcriptional programs and gene regulatory networks. Deep learning, also known as artificial neural networks (ANN) or multi-layer perceptron (MLP) is a class of methods that apply progressively non-linear functions to transform the input data onto more sophisticated embedding manifolds (Rumelhart & Hinton, 1986). When applying these deep learning techniques to multi-omic data, the hypothesis is that the resulting non-linearly transformed manifolds of the input can capture the abstract level of the cellular programs that dictate the biological functions. This often comes with the requirement of a large number of training data examples. With the rapid increase of single-cell transcriptomic data and the recent momentum of single-cell multi-omic data, deep learning methods are increasingly used in the field. Perhaps one of the most popular deep learning architectures in modeling singlecell datasets is autoencoder (Hinton, 2006; Lopez et al., 2018; Bahrami, 2020; Zhao et al., 2021; Gayoso et al., 2021). An autoencoder is divided into two network components: an encoder network and a decoder network. The encoder network takes as input the multi-omic data measured in the same samples (.X1 , .X2 ) and transform them through MLP to produce a lower-dimensional non-linear output: (1) = f ((X , X )W (1) ), H (l) = f (H (l−1) W (l) ) for .l > 1, where .f (x) is a non.H 1 2 linear function such as a sigmoid function .f (x) = 1/(1 + exp(−x)) or a rectified linear unit .f (x) = max(x, 0) and .W (l) are the linear network weights at the lth transformation .l = {1, . . . , L} for L series of transformation. The decoder takes the transformed data Z as input and reconstructs the original data via another series (l) (l) of (de-)transformation: .A(1) = f (ZV (1) ), (Xˆ 1 , Xˆ 2 ) = f (A(l−1) (V1 , V2 ) for .l > 1. Learning an autoencoder entails minimizing the same reconstruction loss as in the NMF section plus penalty terms on the encoder and decoder’s weights ' (i.e., .W (l) , V (l ) ).

52

Y. Li et al.

3.6 Evaluating and Visualizing Single-Cell Embeddings After training the unsupervised model, the latent cell embedding from NMF or encoder can be used to project the single-cell transcriptomes onto the common latent space Z. The quality of the latent cell embedding can be examined in two steps. First, the embeddings of the cells are fed into the classic clustering method KMeans or the recently developed community-based clustering approaches, Louvain (Blondel et al., 2008) or Leiden (Traag et al., 2019). These clustering algorithms will assign each cell to a cluster. Second, the predicted cluster labels are evaluated against ground-truth cell type labels using clustering metrics such as the Adjusted Rand Index (ARI). Given the predicted clusters or set .X = {X1 , . . . , XK }. Let set .Y = {Y1 , . . . , YK } be the ground-truth cell type labels. We can evaluate how well the cells are clustered based on the ground-truth using ARI score based on the normalized frequency of the co-occurrences of the cells with the same gold-standard labels and the mutual exclusiveness of cells with different gold-standard labels: a+b RI = (n) ,

.

2

ARI =

RI − E[RI ] 1 − E[RI ]

where a is the number of cell pairs in both the subset of X and the subset of Y ; b is the cell pairs in the different subset in X and the subset of Y . Therefore, the higher the ARI the more biologically consistent is the predicted cluster (and therefore the predicted cell embeddings). Furthermore, we often use t-distributed Stochastic Neighbor Embedding (t-SNE) (Van der Maaten & Hinton, 2008) or UMAP (McInnes et al., 2018) to further project the latent dimensions Z onto a two-dimensions coordinate. As a result, we can visualize the cells on 2D scatter plot with the color aid of known cell type labels, batch information, and predicted cell clusters.

4 Multimodal Methods for Inferring Gene Regulatory Networks from Bulk and Single-Cell Omics Data The phenotypic variations are largely determined by the gene expression differences, which is fundamentally dictated by the underlying gene regulatory networks (Levine & Davidson, 2005). Therefore, reconstructing gene regulatory networks is crucial for understanding the cellular dynamics in a large variety of biological systems (e.g., how cellular state changes along with the biological process such as cell differentiation and more importantly what causes the change). Numerous computational methods have been developed for such a gene regulatory network inference task for unimodal data at both the bulk and single-cell level. These methods typically fall into a few major categories: Correlation/Regressionbased, Differential equation based, Probabilistic graphical model based, and others that do not fit into the above-listed categories. In the remainder of this section, we

Multimodal Methods for Knowledge Discovery from Bulk and Single-Cell. . .

53

Fig. 4 Multimodal methods for Gene regulatory network inference. (1) Regression/Correlation/ODE based methods could be applied to learn the GRN from each modality. Then an ensemble approach will be utilized to aggregate the inferred GRN across modalities. The final predicted GRN will compose the regulatory relationship voted by the models built for all modalities. (2) The graph-based approach learns the gene regulatory network with a graph, in which the emission transition probability will mostly depend on the gene expression data while the transition probabilities between nodes (genes) will be calculated based on al-omics (e.g., gene expression, ATAT-seq profiles, protein levels)

will discuss the gene regulatory network inference methods in each of the categories (Fig. 4).

4.1 Multiple Regression One commonly used strategy to infer the gene regulatory network is based on inferring the relationship between individual genes using regression or correlation analysis. GENIE3 (Huynh-Thu et al., 2010) is one of the most popular methods in this category, and has been developed for unimodal datasets (e.g., RNA-seq or single-cell RNA-seq data). GENIE3 decomposes the inference of a regulatory network between n genes into n different regression problems, which could be written as: xi,j = fj (xi,−j ) + ei

.

54

Y. Li et al.

where .ei is a random noise with a mean of zero while .xi,−j denotes (xi,1 , xi,2 , . . . , xi,j −1 , xi,j +1 , . . . , xi,G ). The gene expression for gene j in sample i will be estimated based on all other genes .xi,j −1 with a function .fj . Each gene j expression prediction is a supervised regression problem .xj ∼ fj (x.,−j ) and we need to learn the function .fj by minimizing the mean squared error (MSE) loss: 1 EN 2 .L = i=1 (xi,j − fj (xi,j −1 )) , where N represents all the samples (e.g., from N sub-sampling the sample-by-expression matrix). A Random Forest regression can be employed to learn the regression function and each tree within the forest will present a ranking for all genes. The final gene-gene interaction (gene regulatory relationship) is the aggregated global ranking from these sub-regression problems. Although developed for bulk RNA-seq data, the GENIE3 method can be applied to single-cell data with the aid of dimensionality and clustering analysis as discussed in the previous section. For each of the obtained clusters, GENIE3 could be applied to infer the corresponding gene regulatory network from the cell-by-gene expression matrix (treated similarly as the sample-by-gene matrix in the bulk analysis). Such a regression-based method and its variant which works on time series data, dynGENIE3 (Huynh-Thu & Geurts, 2018) could be extended to infer the gene regulatory network from multi-omics data. A possible solution is to learn the regression function .fkm respectively for each modality m. The final prediction is the aggregation from all modalities (Omranian et al., 2016). A constraint can be used to ensure that the inference gene regulatory networks from different modalities are as close as possible. To learn the final gene regulatory network, we should learn the regression function .fjm for each modality m to minimize the multimodal loss for each gene j .

mlossj =

.

N E G M E M E | | m m (xi,j − fjm xk,−j )2 + λ| g(fjm )| m=1 i=1 j =1

m=1

where .g(fjm ) represents the top-ranked genes (e.g., higher than a certain cutoff) for modality m based on the learned function .fjm , M represents the number of the different modalities. The second term in the above equation denotes the penalty for the discrepancy of learned top-ranked genes between different modalities. This will force the regression function from different modalities to present relatively consistent results. As the loss here is non-differentiable, a derivative-free optimization method will be needed here to minimize the mloss. Here the regularization weight .λ will be learned via nested cross-validation. The cutoff for g will be learned similarly. Alternatively, g should be specified as a hyper-parameter supplied by the users based on their prior knowledge (e.g., mean .# of target genes for each TF). SCENIC (Aibar et al., 2017) is a single-cell gene regulatory network inference method based on GENIE3. It first infers the candidate coexpression gene modules (potential gene regulatory relationships) using GENIE3 or GRNBoost (Moerman et al., 2019) for each of identified cell populations (e.g., from clustering analysis with the methods from the above section). Next, SCENIC utilizes the RcisTarget (Aerts et al., 2010) method to identify the regulons (TF-gene interactions)

Multimodal Methods for Knowledge Discovery from Bulk and Single-Cell. . .

55

from the preliminary coexpression gene modules identified from the first step. This method can be extended for multi-omics data using a similar strategy as the GENIE3 omics extension.

4.2 Correlation and Mutual Information Another category of gene regulatory network methods is based on correlation or information theory to infer the regulatory relationship between transcription factors and the expression of their target genes. Pearson correlation is one of the simplest characterizations of the relationship between genes (e.g., between a transcription factor and a target) (Gama-Castro et al., 2016). However, Pearson correlation cannot capture the complex non-linear regulatory relationship between genes. Alternatively, mutual information from the information theory has also been used widely to characterize the relationship (i.e., dependence) between variables (e.g., expression of two genes). Mutual information (MI) measures the difference between the joint probability distribution and the product of the marginal probabilities as the Kullback-Leibler (KL) divergence (Hussein et al., 2015) between them: I (X; Y ) = DKL [P (X, Y )||P (X)P (Y )] ) ( EE p(x, y) p(x, y) log = p(x)p(y)

.

x∈X y∈Y

= H (X) − H (X | Y ) E where Entropy .H (X) = − x∈X p(x) log p(x) measures the uncertainty of a random variable X while .H (X | Y ) defines the conditional entropy that measures the uncertainty of a random variable X given the value of variable Y is known. These MI-based methods have been commonly used for unimodal datasets (e.g., RNA-seq). With proper modifications, the MI could also be used to infer the gene regulatory relationship from multi-omics data. Let random variables X and Y denote the measurement of one modality for two separate genes. Let Z denote a categorical variable with each category representing an omic type. The conditional MI can be used to calculate the dependence between X and Y across all modalities in Z: I (X; Y |Z) =

E

.

z∈Z

p(z)

EE x∈X y∈Y

(

p(x, y | z) p(x, y | z) log p(x | z)p(y | z)

)

One potential challenge for this approach is the difficulty in estimating the empirical probability distributions needed to calculate the above conditional probabilities. Although MI-based approaches can quantify the interdependence between variables (e.g., genes), they cannot infer a causal relationship. With the time-series measurements of the variables (.Xi and .Xj ), we can quantify the information that

56

Y. Li et al.

passes from one variable at the previous time point .t − 1 to another variable at the current time point t. This will enable the inference of causality. Schreiber (2000) reported Directed Information (DI) as a measure of the amount of information flowing from the past state(s) of .Xi , the regulator, to the current state of the variable .Xj , the target (Schreiber, 2000). DI is defined as follows: DI (Xi → Xj | {Xi , Xj }C ) =

E

.

I (Xit−1 ; Xjt | Xjt−1 , {Xlt−1 } ∈ {Xi , Xj }C )

t

where .Xi , Xj denotes the measurements for two different genes (one regulator, one target) while .{Xi , Xj }C represents all other genes except for .{Xi , Xj }, and t−1 C .{X l }l ∈ {Xi , Xj } indicates all genes other than .Xi , Xj at .t − 1. It was reported that if a system is not purely deterministic, the directed information graph GDI inferred from DI will correctly recover the network that includes all causal interactions as directed edges (Sun et al., 2015). DI can detach both linear and nonlinear causality, unlike the linear Granger causality (Seth, 2007) and is applicable to stochastic systems. However, it is computationally expensive as the causality inference between two variables depends on all possible previous states, which requires a tremendous amount of data and is therefore not affordable even with the large-scale single-cell genomics datasets. To address these limitations, Qiu et al. developed scribe (Qiu et al., 2020), as an extension of DI. Scribe employs a first-order Markov system, which is a common assumption among studies of biological processes. The authors termed the refined DI method as “Restricted Directed Information (RDI)”: RDId (Xi → Xj ) =

E

.

I (Xit−d ; Xjt | Xjt−1 )

t

where d here represents the delay in time. Here RDI calculates the mutual information between two time points with a delay d instead of two consequent time points. For the multi-omics extension, RDI can be revised to the case where the information transfer from .Xi to .Xj is conditional on another factor Z (e.g., modality). RDId1,d2 (Xi → Xj | Z) =

E

.

I (Xit−d1 ; Xjt | Xjt−1 , Z t−d2 )

t

where .d1, d2 here indicate time delays. Time-series information is not always available. However, for single-cell applications, pseudo-time could be utilized when real-time information is unavailable. With the time-series information, this type of method could infer the causal rather than the associated relationship between variables (i.e., genes).

Multimodal Methods for Knowledge Discovery from Bulk and Single-Cell. . .

57

4.3 Ordinary Differential Equation Ordinary differential equation (ODE) is another popular strategy to describe expression dynamics and infer Gene Regulatory Networks consisting of n genes, which is generally defined and formulated as: .

dy = ∇t y = a1 x1,t + a2 x2,t + . . . + an xn,t = aT xt dt

where .∇t y is the observed changes of response variable y at time t and is modelled by a linear equation. The stationary parameters .a1 , a2 , . . . , an in the above equation are usually unknown in practice. Those parameters should be searched by minimizing the error between the prediction .y(a1∗ , a2∗ , . . . , an∗ ) and the actual ˆ With those parameters, we are able to infer the regulatory relationship observation .y. between genes. Multimodal extension of ODE models for gene regulatory network reconstruction is still lacking. A potential strategy is to incorporate the differential equations from different modalities with a weighted majority voting scheme. A modality-centered gene regulatory network may then be inferred for each modality. The regulatory relationship supported by multiple modalities could be assigned a higher score/rank. The weight for each modality could differ based on the prior knowledge that we have (e.g., one modality is likely to be more important than others). SCODE is a recently developed method for gene regulatory network inference in this category (Matsumoto et al., 2017). SCODE extends the conventional ODE framework for gene regulatory network to modeling single-cell data. From the single-cell data, pseudo-time information could be inferred for building the differential equations. A major challenge of applying ODE-based methods for complex single-cell data is the expensive computational complexity caused by the high dimensionality and large scalability of the data. SCODE addresses this limitation by projecting the high dimensional data into a lower dimensional scale with linear transformations. −1 Specifically, SCODE infers gene expression using ODE . dX dt = WBW X, where .B is the parameter matrix that we can learn in the low dimensional space .Z, and .W is the linear projection transformation for the lower-dimensional space .X = WZ. The matrix .A = WBW−1 describes the gene regulatory relationship as it specifies the weights of all genes in estimating the expression of a particular gene. The matrices .B, W can be learned by minimizing the residual sum of squared (RSS) error: B∗ , W∗ = arg min

.

B,W

D E E (xgc − wgi zic )2 g,c

i=1

dz/dt = bz where D is the dimension of the matrix .Z in the reduced space. c represents a cell and g denotes a gene. .xgc denotes the expression for gene g in cell c. SCODE

58

Y. Li et al.

does not support multi-omics data by default. A potential omics extension is to incorporate the differential equations from different modalities with a weighted majority voting scheme as we discussed above.

5 Multi-Modal Network Inference of Gene Regulations 5.1 Bayesian Network Inference Probabilistic graphical models (PGM) or Bayesian networks provide powerful tools to infer gene regulatory networks. In this context, genes are regarded as nodes in the graph and edges represent the regulatory relationship between genes, which can be inferred from the omics data (e.g., joint probabilities across modalities). Suppose the data as the gene expression matrix .X follow a likelihood distribution .log p(X) (e.g., a Gaussian) and the graph parameters follow some prior distribution .log p(O). The goal is to infer the distribution of the graph parameters for the gene regulatory network. Full Bayesian inference on the posterior distribution of the graph parameters is as follows: .

log p(OGraph |X) = log p(X|OGraph ) + log p(OGraph ) − log p(X)

{ Here the marginal probabilities .p(X) = p(X, OGraph )dOGraph are often intractable. Various approximation approaches were developed. One popular approach is to optimize the point estimates of the network parameters by maximum a posteriori (MAP): OGraph ← arg max log p(X|OGraph ) + log p(OGraph )

.

OGraph

Markov Chain Monte Carlo (MCMC) sampling method is also commonly used. These methods iteratively sample new .O∗Graph from an unnormalized potential function .L(O∗Graph ) ∝ log p(X|OGraph )+log p(OGraph ) and stochastically accept the sampled new .O∗Graph parameters by comparing .L(O∗Graph ) with the current setting .L(Ocur Graph ). Metropolis-Hasting and Hamiltonian Monte Carlo belong to this class of approximation. Simpler non-Markov-chain Monte Carlo such as importance sampling is also frequently used. More efficient Gibbs sampling can be derived if the conditional probabilities of each model parameter .Oi given the remaining parameters .O−i have closedformed expression, i.e., .p(Oi |X, O−i ). In this case, one can cycle through each parameter by sampling one parameter while conditioning on the sampled values for the remaining parameters. However, it is often difficult to determine the convergence of MCMC approaches, and therefore one tends to run them for a long time to ensure an equilibrium state.

Multimodal Methods for Knowledge Discovery from Bulk and Single-Cell. . .

59

Variational inference is another class of popular approximate Bayesian inference. VI works by first proposing a family of variational distribution .log q(O). Under the variational distribution, we can approximate the marginal likelihood with closedform evidence lower bound (ELBO): { . log p(X) = log p(X, OGraph )dOGraph { = log { ≥

q(OGraph )

p(X, OGraph ) dOGraph q(OGraph )

q(OGraph ) log p(X, OGraph ) − q(OGraph ) log q(OGraph )dOGraph

= Eq(OGraph ) [log p(X, OGraph )] − Eq(OGraph ) [log q(OGraph )] ≡ ELBO Note that maximizing ELBO is equivalent to minimizing the Kull-Leibler divergence (KLD) between the proposed distribution .log q(O) and the true posterior distribution .log p(O|X): KL[q||p] = Eq(OGraph ) [log q(OGraph )] − Eq(OGraph ) [log p(OGraph |X])

.

This is because the KLD and ELBO add up to the constant marginal likelihood: KL[q||p] + ELBO = log p(X), which can be easily verified by rewriting the ELBO as .Eq(OGraph ) [log p(OGraph |X)] + log p(X) − Eq(OGraph ) [log q(OGraph )]. Extension of the above inference frameworks to modeling multi-omic data beyond transcriptomes is an active research area. One key consideration is the heterogeneous distributions to be used to appropriately model the likelihood of different omic data. The choice of these data likelihood will have a direct impact on the inferred network and their interpretability. For instance, the negative binomial is a popular choice for modeling scRNA-seq data because its capacity to account for both the expected transcription rate and the over-dispersion variance of the gene expression (Svensson, 2020). Bernoulli distribution can be used to model scATACseq peak data, where the presence and absence of an ATAC peak is modeled as a binary event (Wu et al., 2021). Beta density is used to model the proportion of methylated DNA reads because of its continuous range between 0 and 1. Zeroinflated Poisson distribution is used to model proteomic data (Gayoso et al., 2021). These distributions are often chosen due to their mathematical convenience and the well-characterized properties under the exponential family distributions. Nonetheless, they are part of the model assumptions, which may not hold true for all data scenarios. The likelihood-free model inference is active research in the statistical learning community (Thomas et al., 2022). The basic idea is to use a ratio estimator (e.g., a neural network) to distinguish between real experimental data and fake data sampled from a generative model .q(X). A good ratio estimator can approximate well the KL divergence between the true but intractable data likelihood and the proposed data .Eq(OGraph ) [log p(X, OGraph )] likelihood .log q(X).

.

60

Y. Li et al.

5.2 Static Boolean Regulatory Network Inference Among many Bayesian networks, the Boolean network is one of the most classical graphical models to represent a gene regulatory network, in which the regulatory relationship between genes (i.e., .OGraph ) is represented by binary values (1 or 0). Building upon the inference algorithms discussed above, different strategies have been developed for searching the plausible Boolean network configurations that could represent the underlying gene regulatory networks. SCNS toolkit (Woodhouse et al., 2018) is such a method for single-cell RNA-seq (for measuring gene expression data). The omics extension for the Boolean network for gene regulator network inference is absent (particularly for single-cell omics data), at least to our best knowledge. However, the potential omics extension for the Boolean network method could come from a multi-omic inference of the network edges. The edge between two nodes (e.g., a TF and a target gene) is not only determined by the RNAseq gene expression but also by the other omics (e.g., epigenetic profiles associated with the nodes such as whether the promoter of the target gene is accessible by the TF).

5.3 Dynamic Regulatory Network Inference Gene regulatory networks in various biomedical processes are dynamic and thus are not well characterized by static Boolean networks. To overcome this limitation, more complex graphical models are utilized for gene regulatory network inference. For example, Ding et al. developed SCDIFF method (Ding et al., 2018) that reconstructs the cellular trajectory from single-cell RNA-seq data with a Kalman filter model, in which the transition between cellular states is modeled using a regression model. The method first clusters all the cells into different clusters (cellular states). Next, a graph that connects all nodes are built to represent the cellular state transition between nodes (states). The underlying true gene expression for each node is estimated using the Kalman filter model, in which a regression model is used to infer the expression change associated with each edge (between two nodes). The cost function for the regression is described as follows: λ E 1E [(yi log(hθ (xi )) + (1 − yi )log(1 − hθ (xi )) + |θ j |] .J (θ ) = − n 2n hθ (xi ) =

n

P

i=1

j =1

1 1 + e−θ

Tx

i

where .yi denotes the category of expression change (i.e., up-regulated, nonchanged, or down-regulated) for gene i, n is the total number of all genes considered, .xi represents the vector of all TFs, which target gene i. It’s a vector of binary 0/1,

Multimodal Methods for Knowledge Discovery from Bulk and Single-Cell. . .

61

and .θj represents the weight for TF j . Cross-validation was used to determine the best .λ. The above L1-penalized logistic regression learns a crucial list of TFs for the edges that are associated with a list of differential genes. An iterative strategy is employed to gradually refine the cellular trajectory graph and regression associated with all the edges. Finally, the TF-gene interactions are learned for all edges of the graph. Continuous State Hidden Markov Model (CSHMM) (Lin & Bar-Joseph, 2019; Hurley et al., 2020) is a continuous extension of SCDIFF and follows a similar strategy to infer gene regulatory networks. The most significant difference is that CSHMM learns a continuous trajectory while SCDIFF learns a discrete trajectory for the subsequent gene regulatory network inference. The learned trajectory influences the calculation of emission and transition on probabilities in the graph. These methods were developed for unimodal single-cell RNA-seq data. However, they can be extended with efforts to integrate other omics data (e.g., epigenetic datasets such as ATAC-seq that profile the underlying epigenetic landscape). A common strategy for omics integration is to infer the cellular trajectories based on the multi-omics data, not just the RNA-seq. Reconstructed cellular trajectories from the unimodal scRNA-seq data can be used to address several different questions. However, they do not provide information on other molecular aspects of the process including changes to the epigenome and the set of regulators that are activated at specific time points. In addition, given their strong dependence on the assumption of gradual change in the expression of genes within or between time points they may not be appropriate for studies that need to sample at lower frequencies. To overcome these problems, many computational methods have been developed to integrate time-series scRNA-seq with other bulk or single-cell data. For example, PhenoPath, a statistical analysis method that incorporates the impact of environmental and genetic covariates, was used to analyze time-series bulk and single-cell transcriptomics data for inferring pseudotime trajectories (Campbell & Yau, 2018), which preruns the gene regulatory network inference. Computational methods have also been developed to integrate time-series and snapshot scRNA-seq data with the genetic barcoding and CRISPR–Cas9 data. Such integration can be used to improve the trajectories reconstructed from each method separately. For example, Zafar et al. (2020) developed LinTIMaT, a general method for combining scRNA-seq data with scar data. LinTIMaT reconstructs cell lineages using a maximum-likelihood framework that combines mutation and expression agreement along the branches. When applied to the zebrafish scar data (Raj et al., 2018), the method was able to clear the ambiguities arising when only using the scar data and it identified additional cell subtypes that could not be resolved without using the expression data. With those omics-inferred cellular trajectories, we can use the probabilistic graphical models discussed to identify the gene regulatory networks associated with the trajectories. There are also many other methods that directly integrate omics data (e.g., ATACseq and RNA-seq) for regulatory network inference. For example, the interactive dynamic regulatory events miner (iDREM) (Ding et al., 2018) was used to infer the gene regulatory network from bulk multi-omics data. iDREM is the recent

62

Y. Li et al.

extension of the DREM (Ernst et al., 2007) method that can use the expression level of TFs to influence the learning of classifiers in the input-output hidden Markov model (IOHMM) for gene regulatory network inference. With the IOHMM method, iDREM groups all the genes into different expression patterns (paths) (a graph). A list of TFs are associated with each time point of the path. The graph (with all the nodes and paths) is learned by maximizing the likelihood density r: r(G|M) =

E

.

g∈G

log

|| E n−1 q∈Q t=1

fq(t) (og (t))

n−1 ||

P (Ht = q(t)|Ht−1 = q(t − 1), I (g, t))

t=1

where .fq(t) (og (t)) represents the emission probability from state .q(t) to observation og (t). Q is the set of all paths of hidden states of length n starting from the root. For a path .q ∈ Q, .q(t) is the hidden state of the path at time point t. The first product denotes the emission probability and the second product represents the transition probability. The inner sum is over all paths, and the outer sum is over all genes in G. .I (g, t) is the dynamic input prior learned by integrating all different types of data. Besides iDREM, there are also many others methods of this type. TimeReg (Duren et al., 2020) was recently applied to combine gene expression and chromatin accessibility at the single-cell level. The method first infers context-specific regulatory interactions from ATAC-seq and RNA-seq data at a single time point and then uses dimensionality reduction to extract core regulatory interactions across the time points. These interactions are then used to identify regulators that drive the changes in expression observed. TimeReg was applied to study retinoic acid (RA)induced development and was able to identify several novel regulatory elements for cerebellar development, synapse assembly, and hindbrain morphogenesis. .

6 Multimodal Methods for Biomarker Identification 6.1 Ensemble Learning Based Multi-Omic Biomarker Identification A two-step approach is commonly employed to identify biomarkers for various diseases (e.g., COVID-19 or lung cancer). First, uninformative biomarkers were eliminated one-by-one based on their predictive coefficients with the phenotype in the context of all biomarkers. Specifically, multiple iterations of model training and validation were performed using the feature sets, each containing all of the biomarker candidates except for one biomarker. The candidate biomarkers were removed one by one based on their performance in the context of the remaining biomarkers (i.e., the biomarker will be removed if the remaining biomarkers present equal or higher prediction accuracy). The above step-wise selection is termed backward step-wise elimination.

Multimodal Methods for Knowledge Discovery from Bulk and Single-Cell. . .

63

Forward step-wise selection can also be performed. Starting from a null set, at each step, the model will greedily select and add the most predictive feature from the current feature candidate list. The step-wise selection will terminate when a certain criterion is met (a certain prediction performance or a number of biomarkers is reached). Second, a classification model (with feature importance) such as Random Forest is used to predict the outcome based on all selected features from step 1. Feature importance for each candidate biomarker is often calculated, and a further feature selection could be performed based on the calculated feature importance. This two-step approach could naturally be extended to support multi-omics data. Each single-omic data is analyzed independently (e.g., using a similar twostep approach as described above). Using the trained single-omic classifier, the outcome (e.g., the disease status) is predicted based on each individual modality. The predicted outcome probabilities are then combined for the final prediction by either taking a weighted product of all outcome probabilities from each omics or the maximum probability across all modalities. The final list of biomarkers then constitutes the feature set that confers the highest prediction accuracy. The weight for each modality could be learned using nested cross-validations. Dean (2019) discussed a similar strategy for the identification of biomarkers for Post-traumatic stress disorder (Dean, 2019). The methods such as mixOmics (Rohart et al., 2017) and ML-radiation (Lewis & Kemp, 2021) also follow a similar strategy. All the methods in this category utilize the learners at different levels. First, a base learner is built to predict the outcome based on each single-omic data (e.g., gene expression, protein level). Next, an ensemble-learner is established to predict the final outcome based on all omics data. All the hyper-parameters in the meta-learner are optimized via nested cross-validations.

6.2 Deep Neural Network Based Multi-Omic Biomarker Identification The Ensemble learner based multi-omic biomarker identification methods typically employs classic feature selection methods (e.g., step-wise selection, random forest, lasso regression) for the single-omic biomarker discovery. Then all those separate feature selections from each modality were combined by the Ensemble learner to achieve the multi-omic biomarker identification. The multi-omics data could also be integrated using deep learning models for the discovery of multi-omic biomarkers. MOGONET (Wang et al., 2021b) is one such method. MOGONET is a classification model for multi-omics data. The workflow of MONOGET can be summarized into 3 steps. First, a feature pre-selection is performed on each omics data individually to remove noise, artifacts, and redundant features that may deteriorate the classification. Second, for each omics data type, a weighted sample similarity network is constructed from the omics features, and a Bayesian neural network is trained using both the omics features and the similarity network. Third,

64

Y. Li et al.

the cross-omics discovery tensor is calculated from the initial predictions of the omics-specific Graph Convolutional Network (GCN) and forwarded to a View Correlation Discovery Network for the final prediction. Important biomarkers are then identified from trained network weights via their performance on the validation set. Specifically, the importance of the feature to the classification can be measured by the decrease in the performance on the validation set after the feature is removed. Therefore, features with the largest performance drop are considered to be the most important ones.

7 Closing Remarks and Perspectives The multimodal methods discussed in this chapter, with different characteristics and applications (see Table 1 below for the overall summary), indeed led to a large variety of biomedical advances such as the identification of novel cell populations (Tini et al., 2019), the discovery of novel targets for therapeutic interventions (Ding et al., 2019), and development of multi-omic diagnostic panels (Olivier et al., 2019). However, the multi-omics data integration and the joint analysis are still in their infancy. We are still facing numerous challenges from different aspects of multimodal computational models and their applications. In closing this chapter, we will briefly discuss a few of these major challenges. First, the availability of relevant multi-omics datasets (particularly at the singlecell level) is still limited. The most available datasets now are still RNA-seq measurements (bulk or single-cell) and genotype measurements via genome-wide association studies. In many application scenarios, the RNA-seq data is usually abundant (either generated themselves or curated from public domains). On the other hand, other omics data (e.g., proteomics and epigenetics data) are often missing, which limits the application of the aforementioned multimodal methods. However, this will be improved along with the development of multi-omic biotechniques in the near future (various co-measurements experimental technologies). Second, multi-omics data exploitation leaves considerable room for improvement. Data exploitation refers to the effective utilization of collective multi-omics information to obtain new insights that could not be possible from a unimodal measurement (a single modality). This hinges upon the model interpretability and model explainability. On the one hand, probabilistic graphical models are easy to interpret but inflexible in modeling complex problems; on the other hand, deep learning models are flexible but difficult to interpret. Striking a balance between the two holds the promise to bridge the gap between rapid methodological and technological developments and the slow process of biological discoveries. Furthermore, data exploitation in biomedical research involves not only the integration of different measurements of the biomedical process studied but also of prior knowledge (e.g., pathways, Gene Ontology, or other known gene sets) that is often incomplete (particularly in those less-studied non-model organisms). Therefore,

NMF

CCA

Multiview

Deep learning

PLIER

Seurat (V3)

PLCMF

BABEL

Omics integration An NMF-based strategy is employed to couple two matrix factorizations from different modalities. The inference of the coupling matrix A varies between different application scenarios. If it is possible to identify a subset of features in one modality that is linearly predictable from the features measured in the other modalities Bulk/Single-cell Gene expression Deconvolved One can incorporate prior biological knowledge into matrix; cell populations; the NMF. Specifically, one way to incorporate such Prior-knowledge Enriched information such as pathways or other gene sets is pathways and pathways to introduce “reward” terms into the loss function gene sets Single-cell Single-cell Cell clusters; Using the CCA and thus integrates omics for multi-omics data Many clustering matrices subsequent analyses Bulk/Single-cell Multi-omics Cell clusters The proposed PLCMF first performs clustering on data matrices each omic data separately to obtain pseudo-labels that reflect the intra-view similarities of each view/omics data. Then, it adds a pseudo-label constraint on collective matrix factorization to learn unified latent representations, which preserve the intra-view and inter-view similarities simultaneously Single-cell Single-cell Cell clusters Autoencoder is employed to project high-dimensional data into the low dimensional latent space for clustering. Cross-modality translation is enabled by mapping matrices from multimodalities into the same shared latent space

Task Method Category Bulk/Single-cell Input Output CoupledNMF NMF Single-cell Single-cell Cell clusters multi-omics data matrices

PMID: 31249421

https:// github.com/ wukevin/ babel

https:// github.com/ WangdiXidian/ PLCM

(continued)

PMID: 33827925

PMID: 33606648

https:// PMID: satijalab.org/ 31178118 seurat/

https:// github.com/ wgmao/ PLIER

Tool Reference https:// PMID: github.com/ 29987051 SUwonglab/ CoupledNM

Table 1 Summary of Multiomics Data Integration Methods. This table offers a comprehensive overview of the multiomics data integration methods highlighted in this chapter. It delineates the specific tasks each method ategorizes them based on their foundational principles. A succinct guide on the tool’s usage, including inputs and outputs, is presented to offer users a practical perspective. A concise description of each method is provided, enabling users to gain a high-level understanding and assisting them in selecting an appropriate tool for their unique application needs. The availability status of all discussed methods is also furnished

Task Gene regulatory network inference

Category Regression

Correlation/ Information Theory

ODE

Method GENIE3

SCIBE

SCODE

Table 1 (continued)

Single-cell

Single-cell

Output Gene Regulatory network

Gene expression Gene matrix; Matrices Regulatory from other network omics

Gene expression Gene matrix; Matrices Regulatory from other network omics

Bulk/Single-cell Input Bulk/Single-cell Gene expression matrix; Matrices from other omics

Tool https://doi. org/doi:10. 18129/B9. bioc. GENIE3 By default, SCIBE does not support https:// multi-omics data. For the multi-omics github.com/ extension, RDI in the model can be aristoteleo/ revised to the case where the information Scribe-py transfer from X to Y is conditional on another factor Z (e.g., another modality) By default, SCODE does not support https:// multi-omics data. A potential strategy is github.com/ to incorporate the differential equations hmatsu1226/ from different modalities with a weighted SCODE majority voting scheme. A modality-centered gene regulatory network is inferred for each modality. The regulatory relationship supported by multiple modalities could be assigned a higher score/rank

Omics integration GENIE3 does NOT support multi-omics data natively. However, a potential extension is via aggregating GENIE3 predictions on different modalities

PMID: 28379368

PMID: 32135093

Reference PMID: 20927193

Biomarker discovery

Two-step

mixOmics

MOGONET Deep learning

Bulk

PGM

iDREM

Bulk

Bulk

Single-cell

PGM

SCDIFF/ CSHMM

Bulk Multi-omics data matrices

Bulk Multi-omics data matrices

Bulk Multi-omics data matrices (time-series)

Biomarkers

Biomarkers

Gene Regulatory network

Gene expression Gene matrix; Matrices Regulatory from other network omics

By default, SCDIFF/CSHMM support the integration of static TF-gene interactions. To enable the integration of other omics data (e.g., ATAC-seq), a revision on the transition model is necessary to calculate the transition probabilities between different cellular states based on multi-omics measurements, not just the single-cell RNA-seq data DREM integrates multi-omics data with the input-output hidden Markov, in which all the multi-omics information is used to constrain the control/input that dictates the transition probabilities The multi-omics integration is achieved via an Ensemble strategy. First, the biomarkers are learned from each omics data (modality). Then, all the biomarkers are aggregated and ranked to find the top biomarkers across all modalities The method first learns a deep learning neural net that takes all multi-omics features (candidate biomarkers) to predict the model (e.g., a classifier). Then, the method finds the most important set of features (biomarkers) with sensitivity analysis https:// github.com/ txWang/ MOGONE

http://www. mixomics. org/

PMID: 34103512

PMID: 29099853

https:// PMID: github.com/ 29538379 phoenixding/ idrem

https:// PMID: github.com/ 29317474 phoenixding/ scdif

68

Y. Li et al.

how to integrate incomplete data to shed light on the studied biomedical processes is also still very challenging. Third, as most multi-omics measurements are generated from separate experiments, there might exist severe batch effects between the measurements of different modalities. Therefore, removing the batch effect across modalities will be an important task for future multimodal methods. Future multimodal methods will be developed to resolve the limitations discussed above. The biotechnology advancement will certainly contribute to future multimodal methods substantially by providing multi-omics data with much better quality. For example, a lot of co-measurement techniques were developed in the past few years, which allow multiome profiling (e.g., transcriptome and epigenome) from exactly the same cell and thus could significantly reduce the batch effect across modalities. Meanwhile, along with the development of multi-omics bio-techniques, the quality of omics data will improve and the noise level will decrease dramatically, which undoubtedly will benefit the joint analysis of multi-omics data for various tasks (cell population identification, GRN inference, and biomarker discovery). Future multimodal methods should also address the exploitation limitation discussed above. The multi-omics measurements (regardless of heterogeneous or homogeneous) could be integrated to derive a deeper understanding of the studied biomedical process and eventually drive the discovery of novel diagnostic and therapeutic strategies that will benefit public health.

References Aerts, S., Quan, X.-J., Claeys, A., Naval Sanchez, M., Tate, P., Yan, J., & Hassan, B. A. (2010). Robust target gene discovery through transcriptome perturbations and genome-wide enhancer predictions in drosophila uncovers a regulatory basis for sensory specification. PLoS Biology, 8(7), e1000435. Aibar, S., González-Blas, C. B., Moerman, T., Huynh-Thu, V. A., Imrichova, H., Hulselmans, G., Rambow, F., Marine, J.-C., Geurts, P., Aerts, J., et al. (2017). Scenic: Single-cell regulatory network inference and clustering. Nature Methods, 14(11), 1083–1086. Aguilar-Bravo, B., & Sancho-Bru, P. (2019). Laser capture microdissection: techniques and applications in liver diseases. Hepatology International, 13(2), 138–147. Amatori, S., Ballarini, M., Faversani, A., Belloni, E., Fusar, F., Bosari, S., Pelicci, P. G., Minucci, S., & Fanelli, M. (2014). PAT-ChIP coupled with laser microdissection allows the study of chromatin in selected cell populations from paraffin-embedded patient samples. Epigenetics & Chromatin, 7, 18. Baek, S., & Lee, I. (2020). Single-cell ATAC sequencing analysis: From data preprocessing to hypothesis generation. Computational and Structural Biotechnology Journal, 18, 1429–1439. Bahrami, M., Maitra, M., Nagy, C., Turecki, G., Rabiee, H. R., & Li, Y. (2020). Deep feature extraction of single-cell transcriptomes by generative adversarial network. Bioinformatics, 37(10), 1345–1351. btaa976. Basu, S., Campbell, H. M., Dittel, B. N., & Ray, A. (2010). Purification of specific cell population by fluorescence activated cell sorting (FACS). JoVE (Journal of Visualized Experiments), 10(41), e1546.

Multimodal Methods for Knowledge Discovery from Bulk and Single-Cell. . .

69

Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008. Boyle, A. P., Davis, S., Shulha, H. P., Meltzer, P., Margulies, E. H., Weng, Z., Furey, T. S., & Crawford, G. E. (2008). High-resolution mapping and characterization of open chromatin across the genome. Cell, 132, 311. Bravo-Merodio, L., Williams, J. A., Gkoutos, G. V., & Acharjee, A. (2019). -omics biomarker identification pipeline for translational medicine. Journal of Translational Medicine, 17(1), 1– 10. Buchberger, A. R., DeLaney, K., Johnson, J., & Li, L. (2018). Mass spectrometry imaging: A review of emerging advancements and future insights. Analytical Chemistry, 90, 240. Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y., & Greenleaf, W. J. (2013). Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNAbinding proteins and nucleosome position. Nature Methods, 10, 1213–1218. Campbell, K. R., & Yau, C. (2018). Uncovering pseudotemporal trajectories with covariates from single cell and bulk expression data. Nature Communications, 9(1), 1–12. Campos, E. I., & Reinberg, D. (2009). Histones: annotating chromatin. Annual Review of Genetics, 43, 559–599. Caughlin, S., Maheshwari, S., Agca, Y., Agca, C., Harris, A. J., Jurcic, K., Yeung, K. K., Cechetto, D. F., & Whitehead, S. N. (2018). Membrane-lipid homeostasis in a prodromal rat model of Alzheimer’s disease: Characteristic profiles in ganglioside distributions during aging detected using MALDI imaging mass spectrometry. Biochimica et biophysica acta. General Subjects, 1862, 1327–1338. Chen, J., Zhuang, X., Zheng, J., Yang, R., Wu, F., Zhang, A., & Fang, B. (2021). Aptamer-based cell-free detection system to detect target protein. Synthetic and Systems Biotechnology, 6, 209–215. Chu, L.-F., Leng, N., Zhang, J., Hou, Z., Mamott, D., Vereide, D. T., Choi, J., Kendziorski, C., Stewart, R., & Thomson, J. A. (2016). Single-cell rna-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome Biology, 17(1), 1–20. Civita, P., Franceschi, S., Aretini, P., Ortenzi, V., Menicagli, M., Lessi, F., Pasqualetti, F., Giuseppe Naccarato, A., & Maria Mazzanti, C. (2019). Laser capture microdissection and RNA-seq analysis: High sensitivity approaches to explain histopathological heterogeneity in human glioblastoma FFPE archived tissues. Frontiers in Oncology, 9(JUN), 482. De Jager, P. L., Ma, Y., McCabe, C., Xu, J., Vardarajan, B. N., Felsky, D., Klein, H.-U., White, C. C., Peters, M. A., Lodgson, B., et al. (2018). A multi-omic atlas of the human frontal cortex for aging and alzheimer’s disease research. Scientific Data, 5(1), 1–13. Dean, K. R. (2019). Multi-omic Biomarker Identification and Characterization for Posttraumatic Stress Disorder. PhD thesis, Harvard University. Delorey, T. M., et al. (2021). COVID-19 tissue atlases reveal SARS-CoV-2 pathology and cellular targets. Nature, 595(7865), 107–113. Desai, N., et al. (2020). Temporal and spatial heterogeneity of host response to SARS-CoV-2 pulmonary infection. Nature Communications, 11(1), 1–15. Ding, J., Aronow, B. J., Kaminski, N., Kitzmiller, J., Whitsett, J. A., & Bar-Joseph, Z. (2018). Reconstructing differentiation networks and their regulation from time series single-cell expression data. Genome Research, 28(3), 383–395. Ding, J., Hagood, J. S., Ambalavanan, N., Kaminski, N., & Bar-Joseph, Z. (2018). idrem: Interactive visualization of dynamic regulatory networks. PLoS Computational Biology, 14(3), e1006019. Ding, J., Ahangari, F., Espinoza, C. R., Chhabra, D., Nicola, T., Yan, X., Lal, C. V., Hagood, J. S., Kaminski, N., Bar-Joseph, Z., et al. (2019). Integrating multiomics longitudinal data to reconstruct networks underlying lung development. American Journal of Physiology-Lung Cellular and Molecular Physiology, 317(5), L556–L568.

70

Y. Li et al.

Duren, Z., Chen, X., Zamanighomi, M., Zeng, W., Satpathy, A. T., Chang, H. Y., Wang, Y., & Wong, W. H. (2018). Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations. Proceedings of the National Academy of Sciences, 115(30), 7723–7728. Duren, Z., Chen, X., Xin, J., Wang, Y., & Wong, W. H. (2020). Time course regulatory analysis based on paired expression and chromatin accessibility data. Genome Research, 30(4), 622– 634. Duttke, S. H., Chang, M. W., Heinz, S., & Benner, C. (2019). Identification and dynamic quantification of regulatory elements using total RNA. Genome Research, 29(11), 1836–1846. Eddy, S., Mariani, L. H., & Kretzler, M. (2020). Integrated multi-omics approaches to improve classification of chronic kidney disease. Nature Reviews Nephrology, 16(11), 657–668. Ernst, J., Vainas, O., Harbison, C. T., Simon, I., & Bar-Joseph, Z. (2007). Reconstructing dynamic regulatory maps. Molecular Systems Biology, 3(1), 74. Espina, V., Wulfkuhle, J. D., Calvert, V. S., VanMeter, A., Zhou, W., Coukos, G., Geho, D. H., Petricoin, E. F., & Liotta, L. A. (2006). Laser-capture microdissection. Nature Protocols, 1(2), 586–603. Gama-Castro, S., Salgado, H., Santos-Zavaleta, A., Ledezma-Tejeida, D., Muñiz-Rascado, L., García-Sotelo, J. S., Alquicira-Hernández, K., Martínez-Flores, I., Pannier, L., CastroMondragón, J. A., et al. (2016). Regulondb version 9.0: High-level integration of gene regulation, coexpression, motif clustering and beyond. Nucleic Acids Research, 44(D1), D133– D143. Gayoso, A., Steier, Z., Lopez, R., Regier, J., Nazor, K. L., Streets, A., & Yosef, N. (2021). Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nature Methods, 18, 272– 282. Giresi, P. G., Kim, J., McDaniell, R. M., Iyer, V. R., & Lieb, J. D. (2007). FAIRE (FormaldehydeAssisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatin. Genome Research, 17, 877. Grob, S., & Cavalli, G. (2018). Technical review: A Hitchhiker’s guide to chromosome conformation capture. Methods in Molecular Biology (Clifton, N.J.), 1675, 233–246. Hasin, Y., Seldin, M., & Lusis, A. (2017). Multi-omics approaches to disease. Genome Biology, 18(1), 1–15. Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y. C., Laslo, P., Cheng, J. X., Murre, C., Singh, H., & Glass, C. K. (2010). Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Molecular Cell, 38, 576–89. Herrera, J. A., Mallikarjun, V., Rosini, S., Montero, M. A., Lawless, C., Warwood, S., O’Cualain, R., Knight, D., Schwartz, M. A., & Swift, J. (2020). Laser capture microdissection coupled mass spectrometry (LCM-MS) for spatially resolved analysis of formalin-fixed and stained human lung tissues. Clinical Proteomics, 17, 1–12. Hore, V., Viñuela, A., Buil, A., Knight, J., McCarthy, M. I., Small, K., & Marchini, J. (2016). Tensor decomposition for multiple-tissue gene expression experiments. Nature Genetics, 48, 1094–1100. Hinton, G. (2006). Reducing the dimensionality of data with neural networks. Science (New York, NY), 313, 504. Hughes, T. K., Wadsworth, M. H., Gierahn, T. M., Do, T., Weiss, D., Andrade, P. R, et al. (2020). Second-strand synthesis-based massively parallel scRNA-seq reveals cellular states and molecular features of human inflammatory skin pathologies. Immunity, 53(4), 878–894. Hurley, K., Ding, J., Villacorta-Martin, C., Herriges, M. J., Jacob, A., Vedaie, M., Alysandratos, K. D., Sun, Y. L., Lin, C., Werder, R. B., et al. (2020). Reconstructed single-cell fate trajectories define lineage plasticity windows during differentiation of human PSC-derived distal lung progenitors. Cell Stem Cell, 26(4), 593–608. Hussein, I. I., Roscoe, C. W., Wilkins, M. P., & Schumacher, P. W. (2015). On mutual information for observation-to-observation association. In 2015 18th International Conference on Information Fusion (Fusion) (pp. 1293–1298). IEEE.

Multimodal Methods for Knowledge Discovery from Bulk and Single-Cell. . .

71

Huynh-Thu, V. A., & Geurts, P. (2018). dyngenie3: dynamical genie3 for the inference of gene networks from time series expression data. Scientific Reports, 8(1), 1–12. Huynh-Thu, V. A., Irrthum, A., Wehenkel, L., & Geurts, P. (2010). Inferring regulatory networks from expression data using tree-based methods. PloS One, 5(9), e12776. Jäger, R., Migliorini, G., Henrion, M., Kandaswamy, R., Speedy, H. E., Heindl, A., Whiffin, N., Carnicer, M. J., Broome, L., Dryden, N., Nagano, T., Schoenfelder, S., Enge, M., Yuan, Y., Taipale, J., Fraser, P., Fletcher, O., & Houlston, R. S. (2015). Capture Hi-C identifies the chromatin interactome of colorectal cancer risk loci. Nature Communications, 6, 6178. Jia, G., Preussner, J., Chen, X., Guenther, S., Yuan, X., Yekelchyk, M., Kuenne, C., Looso, M., Zhou, Y., Teichmann, S., et al. (2018). Single cell RNA-seq and ATAC-seq analysis of cardiac progenitor cell transition states and lineage settlement. Nature Communications, 9(1), 1–17. Johnson, S. M., Tan, F. J., McCullough, H. L., Riordan, D. P., & Fire, A. Z. (2006). Flexibility and constraint in the nucleosome core landscape of Caenorhabditis elegans chromatin. Genome Research, 16, 1505. Klein, D. C., & Hainer, S. J. (2020). Genomic methods in profiling DNA accessibility and factor localization. Chromosome Research, 28, 69. Khan, A., Fornes, O., Stigliani, A., Gheorghe, M., Castro-Mondragon, J. A., van der Lee, R., Bessy, A., Chèneby, J., Kulkarni, S. R., Tan, G., Baranasic, D., Arenillas, D. J., Sandelin, A., Vandepoele, K., Lenhard, B., Ballester, B., Wasserman, W. W., Parcy, F., & Mathelier, A. (2018). JASPAR 2018: Update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Research, 46(D1), D1284. Klami, A., Bouchard, G., & Tripathi, A. (2014). Group-sparse embeddings in collective matrix factorization. In Proceedings of International Conference on Learning Representations (ICLR) 2014. Lawrence, M., Daujat, S., & Schneider, R. (2016). Lateral thinking: How histone modifications regulate gene expression. Trends in Genetics, 32, 42–56. Li, X., Wang, C.-Y. (2021). From bulk, single-cell to spatial RNA sequencing. International Journal of Oral Science, 13(1), 1–6. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788–791. Levine, M., & Davidson, E. H. (2005). Gene regulatory networks for development. Proceedings of the National Academy of Sciences, 102(14), 4936–4942. Lewis, J. E., & Kemp, M. L. (2021). Integration of machine learning and genome-scale metabolic modeling identifies multi-omics biomarkers for radiation resistance. Nature Communications, 12(1), 1–14. Lin, C., & Bar-Joseph, Z. (2019). Continuous-state HMMs for modeling time-series single-cell RNA-Seq data. Bioinformatics, 35(22), 4707–4715. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I., & Yosef, N. (2018). Deep generative modeling for single-cell transcriptomics. Nature Methods, 15, 1053–1058. Macklin, A., Khan, S., & Kislinger, T. (2020). Recent advances in mass spectrometry based clinical proteomics: Applications to cancer research. Clinical Proteomics, 17(1), 1–25. Mao, W., Zaslavsky, E., Hartmann, B. M., Sealfon, S. C., & Chikina, M. (2019). Pathway-level information extractor (PLIER) for gene expression data. Nature Methods, 16, 1–9. Matsumoto, H., Kiryu, H., Furusawa, C., Ko, M. S., Ko, S. B., Gouda, N., Hayashi, T., & Nikaido, I. (2017). Scode: An efficient regulatory network inference algorithm from single-cell rna-seq during differentiation. Bioinformatics, 33(15), 2314–2321. Meyer, E., Aglyamova, G., & Matz, M. (2011). Profiling gene expression responses of coral larvae (Acropora millepora) to elevated temperature and settlement inducers using a novel RNA-seq procedure. Molecular Ecology, 20(17), 3599–3616. Mccord, R. P., Kaplan, N., & Giorgetti, L. (2020). Molecular cell review chromosome conformation capture and beyond: Toward an integrative view of chromosome structure and function. Molecular Cell, 77, 688–708. McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. Preprint. arXiv:1802.03426.

72

Y. Li et al.

Moerman, T., Aibar Santos, S., Bravo González-Blas, C., Simm, J., Moreau, Y., Aerts, J., & Aerts, S. (2019). Grnboost2 and arboreto: Efficient and scalable inference of gene regulatory networks. Bioinformatics, 35(12), 2159–2161. Mumbach, M. R., Rubin, A. J., Flynn, R. A., Dai, C., Khavari, P. A., Greenleaf, W. J., & Chang, H. Y. (2016). HiChIP: Efficient and sensitive analysis of protein-directed genome architecture. Nature Methods, 13(11), 919–922. Nechaev, S., Fargo, D. C., Santos, G. D., Liu, L., Gao, Y., & Adelman, K. (2010). Global analysis of short RNAs reveals widespread promoter-proximal stalling and arrest of Pol II in Drosophila. Science (New York, N.Y.), 327, 335–338. Noberini, R., Sigismondo, G., & Bonaldi, T. (2016). The contribution of mass spectrometry-based proteomics to understanding epigenetics.. Epigenomics, 8, 429–445. Olivier, M., Asmis, R., Hawkins, G. A., Howard, T. D., & Cox, L. A. (2019). The need for multi-omics biomarker signatures in precision medicine. International Journal of Molecular Sciences, 20(19), 4781. Omranian, N., Eloundou-Mbebi, J. M., Mueller-Roeber, B., & Nikoloski, Z. (2016). Gene regulatory network inference using fused lasso on multiple data sets. Scientific Reports, 6(1), 1–14. Pal, K., Forcato, M., & Ferrari, F. (2019). Hi-C analysis: from data generation to integration. Biophysical Reviews, 11, 67. Pombo, A., & Dillon, N. (2015). Three-dimensional genome architecture: players and mechanisms. Nature Reviews. Molecular Cell Biology, 16, 245–257. Patel, D. J., & Wang, Z. (2013). Readout of epigenetic modifications. Annual Review of Biochemistry, 82, 81–118. Perkel, J. M., et al. (2021). Single-cell analysis enters the multiomics age. Nature, 595(7868), 614–616. Peterson, V. M., Zhang, K. X., Kumar, N., Wong, J., Li, L., Wilson, D. C., Moore, R., Mcclanahan, T. K., Sadekova, S., & Klappenbach, J. A. (2017). Multiplexed quantification of proteins and transcripts in single cells. Nature Biotechnology, 35(10), 936–939. Qiu, X., Rahimzamani, A., Wang, L., Ren, B., Mao, Q., Durham, T., McFaline-Figueroa, J. L., Saunders, L., Trapnell, C., & Kannan, S. (2020). Inferring causal gene regulatory networks from coupled single-cell expression dynamics using scribe. Cell Systems, 10(3), 265–274. Raj, B., Wagner, D. E., McKenna, A., Pandey, S., Klein, A. M., Shendure, J., Gagnon, J. A., & Schier, A. F. (2018). Simultaneous single-cell profiling of lineages and cell types in the vertebrate brain. Nature Biotechnology, 36(5), 442–450. Ranzoni, A. M., Tangherloni, A., Berest, I., Riva, S. G., Myers, B., Strzelecka, P. M., Xu, J., Panada, E., Mohorianu, I., Zaugg, J. B., et al. (2021). Integrative single-cell RNA-seq and ATAC-seq analysis of human developmental hematopoiesis. Cell Stem Cell, 28(3), 472–487. Rohart, F., Gautier, B., Singh, A., & Lê Cao, K.-A. (2017). mixOmics: An R package for ‘omics feature selection and multiple data integration. PLoS Computational Biology, 13(11), e1005752. Rumelhart, D., & Hinton, G. (1986). Learning representations by back-propagating errors. Nature, 323(9), 533–536. Sahlén, P., Abdullayev, I., Ramsköld, D., Matskova, L., Rilakovic, N., Lötstedt, B., Albert, T. J., Lundeberg, J., & Sandberg, R. (2015). Genome-wide mapping of promoter-anchored interactions with close to single-enhancer resolution. Genome Biology, 16, 1–13. Sandelin, A., Alkema, W., Engström, P., Wasserman, W. W., & Lenhard, B. (2004). JASPAR: An open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Research, 32, 91–94. Santoro, S. W., & Dulac, C. (2015). Histone variants and cellular plasticity. Trends in Genetics, 31, 516–27. Schreiber, T. (2000). Measuring information transfer. Physical Review Letters, 85(2), 461. Scruggs, B. S., Gilchrist, D. A., Nechaev, S., Muse, G. W., Burkholder, A., Fargo, D. C., & Adelman, K. (2015). Bidirectional transcription arises from two distinct hubs of transcription factor binding and active chromatin. Molecular Cell, 58, 1101–1112.

Multimodal Methods for Knowledge Discovery from Bulk and Single-Cell. . .

73

Seth, A. (2007). Granger causality. Scholarpedia, 2(7), 1667. Seung, D., & Lee, L. (2001). Algorithms for non-negative matrix factorization. Advances in Neural Information Processing Systems, 13, 556–562. Skene, P. J., & Henikoff, S. (2017). An efficient targeted nuclease strategy for high-resolution mapping of DNA binding sites. eLife, 6, e21856. Spitzer, M. H., & Nolan, G. P. (2016). Mass cytometry: Single cells, many features. Cell, 165, 780. Stoeckius, M., Hafemeister, C., Stephenson, W., Houck-Loomis, B., Chattopadhyay, P. K., Swerdlow, H., Satija, R., Smibert, P. (2017). Simultaneous epitope and transcriptome measurement in single cells. Nature Methods, 14(9), 865–868. Stuart, T., & Satija, R. (2019). Integrative single-cell analysis. Nature Reviews Genetics, 20, 257– 272. Subramanian, I., Verma, S., Kumar, S., Jere, A., & Anamika, K. (2020). Multi-omics data integration, interpretation, and its application. Bioinformatics and Biology Insights, 14, 1177932219899051. Sun, J., Taylor, D., & Bollt, E. M. (2015). Causal network inference by optimal causation entropy. SIAM Journal on Applied Dynamical Systems, 14(1), 73–106. Svensson, V. (2020). Droplet scRNA-seq is not zero-inflated. Nature Biotechnology, 38(2), 147– 150. Taber, A., Christensen, E., Lamy, P., Nordentoft, I., Prip, F., Lindskrog, S. V., BirkenkampDemtröder, K., Okholm, T. L. H., Knudsen, M., Pedersen, J. S., et al. (2020). Molecular correlates of cisplatin-based chemotherapy response in muscle invasive bladder cancer by integrated multi-omics analysis. Nature Communications, 11(1), 1–15. Thomas, O., Dutta, R., Corander, J., Kaski, S., & Gutmann, M. U. (2022). Likelihood-free inference by ratio estimation. Bayesian Analysis, 17(1), 1–31. Tini, G., Marchetti, L., Priami, C., & Scott-Boyer, M.-P. (2019). Multi-omics integration—A comparison of unsupervised clustering methodologies. Briefings in Bioinformatics, 20(4), 1269–1279. Traag, V. A., Waltman, L., & Van Eck, N. J. (2019). From Louvain to Leiden: guaranteeing wellconnected communities. Scientific Reports, 9(1), 1–12. Tracey, L. J., An, Y., & Justice, M. J. (2021). CyTOF: An emerging technology for single-cell proteomics in the mouse. Current Protocols, 1, e118. Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579. van Galen, P., Hovestadt, V., Wadsworth II, M. H., Hughes, T. K., Griffin, G. K., Battaglia, S., Verga, J. A., Stephansky, J., Pastika, T. J., Story, J. L., et al. (2019). Single-cell RNA-seq reveals AML hierarchies relevant to disease progression and immunity. Cell, 176(6), 1265– 1281. Wang, D., Han, S., Wang, Q., He, L., Tian, Y., & Gao, X. (2021). Pseudo-label guided collective matrix factorization for multiview clustering. IEEE Transactions on Cybernetics, 52, 8681. Wang, T., Shao, W., Huang, Z., Tang, H., Zhang, J., Ding, Z., & Huang, K. (2021). Mogonet integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nature Communications, 12(1), 1–13. Wei, C. L., Wu, Q., Vega, V. B., Chiu, K. P., Ng, P., Zhang, T., Shahab, A., Yong, H. C.. Fu, Y. T., Weng, Z., Liu, J., Zhao, X. D., Chew, J. L., Lee, Y. L., Kuznetsov, V. A., Sung, W. K., Miller, L. D., Lim, B., Liu, E. T., Yu, Q., Ng, H. H., & Ruan, Y. (2006). A global map of p53 transcription-factor binding sites in the human genome. Cell, 124, 207–219. Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R., Ozenberger, B. A., Ellrott, K., Shmulevich, I., Sander, C., & Stuart, J. M. (2013). The cancer genome atlas pan-cancer analysis project. Nature Genetics, 45(10), 1113–1120. Welch, J. D., Kozareva, V., Ferreira, A., Vanderburg, C., Martin, C., & Macosko, E. Z. (1887). Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell, 177, 1873–1887.e17.

74

Y. Li et al.

Woodhouse, S., Piterman, N., Wintersteiger, C. M., Göttgens, B., & Fisher, J. (2018). SCNS: a graphical tool for reconstructing executable regulatory networks from single-cell genomic data. BMC Systems Biology, 12(1), 1–7. Wu, K. E., Yost, K. E., Chang, H. Y., & Zou, J. (2021). BABEL enables cross-modality translation between multiomic profiles at single-cell resolution. Proceedings of the National Academy of Sciences, 118(15), e2023070118. Yan, F., Powell, D. R., Curtis, D. J., & Wong, N. C. (2020). From reads to insight: A hitchhiker’s guide to ATAC-seq data analysis. Genome Biology, 21, 1–16. Yu, M., & Ren, B. (2017). The three-dimensional organization of mammalian genomes. Annual Review of Cell and Developmental Biology, 33, 265–289. https://doi.org/10.1146/annurevcellbio-100616-060531 Zafar, H., Lin, C., & Bar-Joseph, Z. (2020). Single-cell lineage tracing by integrating CRISPRCas9 mutations with transcriptomic data. Nature Communications, 11(1), 1–14. Zhang, S., Li, Q., Liu, J., & Zhou, X. J. (2011). A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNA-gene regulatory modules. Bioinformatics (Oxford, England), 27, i401–i409. Zhao, Y., Cai, H., Zhang, Z., Tang, J., & Li, Y. (2021). Learning interpretable cellular and gene signature embeddings from single-cell transcriptomic data. Nature Communications, 12(1), 5261. Zhou, S., et al. (2021). A Neanderthal OAS1 isoform protects individuals of European ancestry against COVID-19 susceptibility and severity. Nature Medicine, 27, 659–667.

Negative Sample Selection for miRNA-Disease Association Prediction Models Yulian Ding, Fei Wang, Yuchen Zhang, and Fang-Xiang Wu

1 Introduction The human genome encodes a host of microRNAs (miRNAs) which are a group of small single-stranded non-coding RNAs, each of which consists about 22 nucleotides (Bartel, 2004). MiRNAs participate in the regulation of various biological processes, such as cell division, cell proliferation, cell death, immune reaction, aging, and so on (Ambros, 2003; Miska, 2005; Taganov et al., 2006). They influence those biological processes by functioning in RNA silencing and post-transcriptional regulation of gene expression, including cleaving the message RNA (mRNA) strand, and destabilizing the mRNA by shortening its poly tail (Bartel, 2009; Fabian et al., 2010). So far, more than 2300 miRNAs have been found to regulate about 60% of human genes (Alles et al., 2019). Increasing evidence shows that the dysregulation of miRNAs is associated with the occurrence of various human

Y. Ding (X) Shenzhen Key Laboratory of Intelligent Bioinformatics, Central for High Performance Computing, Shenzhen Institute of Advanced Technology, Shenzhen, China e-mail: [email protected] F. Wang School of Artificial Intelligence, Anhui University, Hefei, Anhui, China e-mail: [email protected] Y. Zhang Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK, Canada School of Computer Science, Shaanxi Normal University, Xi’an, China e-mail: [email protected] F.-X. Wu Division of Biomedical Engineering, Department of Mechanical Engineering, Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 A. Alkhateeb, L. Rueda (eds.), Machine Learning Methods for Multi-Omics Data Integration, https://doi.org/10.1007/978-3-031-36502-7_5

75

76

Y. Ding et al.

diseases. Therefore, miRNAs are considered as one type of biomarkers (Lu et al., 2008). Identifying miRNA-disease associations (MDAs) can help understand the complex mechanisms of diseases, which further enhance the diagnosis, treatment, prognosis, and prevention of diseases. Traditionally, researchers apply biological experimental techniques to identify MDAs, such as Northern blot, and HITS-CLIP (Thomson et al., 2011). Those biological experimental identification methods can accurately identify MDAs, while they are costly and time-consuming. The low efficiency of biological methods makes only a small portion of MDAs are identified, while most MDAs are still unknown, which appeals more investigation. With the accumulation of verified MDAs by those biological experimental methods in the past few decades, some researchers constructed MDA databases, such as HMDD v2.0 (Li et al., 2013), HMDD v3.0 (Huang et al., 2019), dbDEMC (Yang et al., 2017), and mirCancer (Xie et al., 2013). Those databases serve as the testbeds of computational MDA prediction methods, which can predict high-probability MDAs for biological experiments, and accelerate the identification of novel MDAs. So far, many computational methods have been proposed for the MDA prediction and their performance has been improved constantly (Chen et al., 2018; Chen & Huang, 2017; Chen et al., 2021b, 2016). Machine learning-based MDA prediction models are the main class of computational methods that usually show good performance as they can extract the complex underlying non-linear associations (Li et al., 2020; Ding et al., 2021a; Ji et al., 2021; Ding et al., 2021b). Those machine learning-based models formulate the prediction problem as a binary classification issue, which trains a classifier by the verified associations and some unknown associations, and then the well-trained classifier is applied to predict all the unknown MDA associations with two states: associated or un-associated. Xu et al. predict novel MDAs by utilizing a support vector machine (SVM)-based classifier to distinguish positive MDAs from negative MDAs (Xu et al., 2011). Chen et al. apply the regularized least square to obtain two classifiers in the disease space and miRNA space, respectively and propose an MDA prediction model named RLMDA (Wang et al., 2017). Zhou et al. develop a GBDT-LR model to predict novel MDAs by combining gradient boosting decision trees with logistic regression (Zhou et al., 2020). Furthermore, some deep learning-based MDA prediction methods have been proposed in recent years. Peng et al. propose an MDA prediction method by a convolutional neural network (MDA-CNN), which extracts the features of miRNAs and diseases from a three-layer heterogeneous network by autoencoder, then inputs the extracted features into a CNN to do the final prediction (Peng et al., 2019). Ding et al. develop a deep belief network (DBN) based matrix factorization model for MDA prediction (Ding et al., 2020b). Ding et al. further improve the MDA prediction performance by a variational graph autoencoder model (Ding et al., 2020a). Zhang et al. utilize a graph convolutional attention network (GCN) to predict MDAs (Zhang et al., 2019). Those machine learning-based models achieve significant progress in MDA prediction since they can catch the deep latent relationships between items. However, one of the common problems is that there are no verified negative samples in

Negative Sample Selection for miRNA-Disease Association Prediction Models

77

currently published datasets for training supervised machine learning models for predicting MDAs. Researchers address this problem by randomly selecting some unknown samples as negative samples, which may bring the noise to the prediction model. Since the current databases only include a small portion of verified MDAs, e.g., in HMDD v2, containing 5441 verified associations and 186442 unknown associations, randomly selecting unknown associations as negative samples may take some positive samples as negatives samples. As the performance of supervised learning methods highly relies on the quality of the labeled samples, the random negative sample selection might influence the MDA prediction performance. Therefore, it is appealing to develop strategy to select reliable negative samples for improving the performance of supervised machine learning-based prediction models. This chapter proposes a deep autoencoder-based approach to select negative samples for machine learning-based MDA prediction models. Firstly, we obtain the feature representations of each miRNA-disease pair by combining learning feature representations of miRNAs and diseases according to their MDA association matrix. Then, a deep autoencoder is trained with all verified miRNA-disease pairs. After that, all the unknown MDAs are input into the well-trained deep autoencoder (which models positive samples) and the reconstruction error is calculated for each unknown miRNA-disease pair. Finally, we sort the unknown samples according to their reconstruction errors, and the sample with a high reconstruction error has a high possibility to be judged as a negative sample.

2 Methods This deep autoencoder-based negative sample selection model, DAE-N, includes three steps: step 1 obtains the representations of each miRNA-disease pair; step 2 trains a deep autoencoder model with all the verified miRNA-disease associations; step 3 applies the well-trained deep autoencoder model obtained in step 2 to get the reconstruction errors for all the unknown miRNA-disease pair, and the reconstruction errors are used as the measure to determine the negative samples. The framework of DAE-N is shown in Fig. 1. Next, we introduce those three steps in detail.

2.1 Obtain the Feature Representations for Each miRNA-Disease Sample Feature representations include all the information that inputs in the model, so obtaining the appropriate representations for miRNAs and diseases is important for neural network-based methods. Since this negative selection method is an

78

Y. Ding et al.

Fig. 1 The flow chart of DAE-N model

auxiliary tool for improving machine learning-based MDA prediction model, we should consider its applicability and complexity. Usually, the multi-view feature representations in DAE-N could mitigate the data missing problem, while increasing the complexity of the original MDA prediction model and change the final number of views in the MDA prediction. Therefore, to make the DAE-N model easily to be applied, we obtain the feature representations only based on the MDA databases. Suppose there are m diseases and n miRNAs in the MDA database. Then, we construct an MDA adjacency matrix .A ∈ R n×m . In this matrix, if miRNA .Mi and disease .Dj are associated, .Aij = 1; otherwise, .Aij = 0. The miRNA feature representation is considered as each row of MDA adjacency matrix A, while the disease feature representation is as each column of matrix A. Finally, we get the feature representations for .n ∗ m miRNA-disease samples by concatenating the miRNA representation and disease representation. This process is shown as step 1 in Fig. 1.

2.2 Train the Deep Autoencoder with All the Verified Samples An autoencoder is a type of unsupervised neural network proposed by Hinton (Rumelhart et al., 1986). This network generates a latent encoding that is validated and refined by attempting to regenerate the input from the encoding. In recent decades, autoencoders are widely applied to solve many problems, such as feature representation learning for unlabeled data, dimensionality reduction, facial recognition, anomaly detection, or generation model which can randomly generate new data that is similar to the training data (Hinton et al., 2011; Liou et al., 2014;

Negative Sample Selection for miRNA-Disease Association Prediction Models

79

Géron, 2019). In this chapter, an autoencoder is used as an unsupervised technique to distinguish the unassociated miRNA-disease pair from the associated miRNAdisease pair by learning the latent characteristics of all the verified MDAs. The deep autoencoder model includes two parts: an encoder and a decoder. The encoder compresses the high-dimensional pair into a low-dimensional latent code, while the decoder reconstructs the input based on the latent code. The structure of this deep autoencoder is shown as the step 2 in Fig. 1, which includes seven fully connected neural network layers. The purpose of this network is to reconstruct its input, which is minimizing the difference between the input and the output, so the loss function is considered as the reconstruction error. The smaller the reconstruction error is, the better this model simulates the input data. During the training process, all verified known MDAs are considered as training samples. Suppose we have K known MDA samples, then we define the i-th miRNA-disease sample as .xi = [Dd , Mm ] ∈ R (m+n) , where .Dd is feature representation of the disease in .xi and .Mm is the feature representation of the miRNA in .xi . Using .xi as input, the encoder extracts the features to a lowdimensional latent code .zi by the following equations: (l)

(l−1)

hi = f (w l hi

.

+ bl ),

zi = w L hi(L−1) + bL ,

.

(l)

(1) (2)

(0)

where .hi is the output of l-th hidden layer, .hi is the input .xi , .w l is the weight matrix in l-th hidden layer, .bl is the bias in the l-th layer, .l ∈ [1, . . . , L], i ∈ [1, . . . , K], and L is the total number of hidden layers of the encoder. Besides, .zi is the output of the encoder, which is the dense latent representations of .xi , and .f () is the nonlinear active function which is set as ReLU in this model. The decoder aims at reconstructing the input .xi based on the latent representation .zi from the encoder. The decoder is described as the following equations: hi = f ' (w l hi

+ bl ),

(3)

xi' = g(w L hi

+ bL ),

(4)

(l)

.

(l−1)

(L−1)

.

where .f ' () and .g() are the nonlinear active functions ReLU and Sigmoid, respectively, .xi' is the reconstruction of input .xi , and all the other variables are the same meaning as the encoder. Finally, the loss of the autoencoder is the average of the reconstruction errors of all the training samples, which describes as follows: L(xi , xi' ) = −

.

EE 1 (xij log(xij' ) + (1 − xij )log(1 − xij' )), K(m + n) K m+n

i=1 j =1

(5)

80

Y. Ding et al.

where .m + n is the length of the representation vector of miRNA-disease pair, K is the total number of verified miRNA-disease associations, .xij is the j th factor of vector .xi . All the parameters in the autoencoder are updated iteratively by minimizing the loss above. After finishing the training process, we get a well-trained autoencoder with the characteristics of all the verified MDAs.

2.3 Sort All the Unknown Samples by the Deep Autoencoder We sort all the unknown MDA samples utilizing the well-trained autoencoder from Sect. 2.2. Based on the assumption that high probability negative samples should have very different feature representations from the verified positive samples, which means that a high probability negative sample should have a high reconstruction error when it inputs in the well-trained positive samples-based autoencoder, and vice versa. Therefore, we input all the unknown MDA samples into the well-trained autoencoder and get their reconstruction errors. Then, those reconstruction errors are sorted in descending order, and the unknown samples with the top reconstruction error should be the high-quality negative samples. Usually, to keep the balance of negative samples and positive samples in supervised training, we choose the same number of negative samples so that of the positive samples. This process is shown in step 3 in Fig. 1.

3 Result To demonstrate the effectiveness of this negative sample selection method, we combine DAE-N with some previous MDA prediction methods for the performance comparison. The detailed information is shown as follows.

3.1 Database As our negative sample selection method, DAE-N, is based on MDA association matrices, we utilize MDA database HMDD .v2.0 to conduct the experiments. HMDD .v2.0 includes 5441 verified MDAs among 501 miRNAs and 383 diseases after combining miRNAs from different stages, such as has-let-7a-1 and has-let-7a2. The max degree of miRNA (or disease) is 125 (or 250), and the average degree of miRNA (or disease) is 11 (or 16). In addition, to test the prediction performance by independent dataset evaluation, the other three latest databases, HMDD .v3.0 (Huang et al., 2019), dbDEMC (Yang et al., 2017), and miRCancer (Gao et al., 2019), are used as independent test datasets. Specifically, HMDD.v3.0 includes 18,733 MDAs among 1208 miRNAs and 894 diseases. dbDEMC contains 49,960 MDAs, and miRCancer includes 8610 MDAs.

Negative Sample Selection for miRNA-Disease Association Prediction Models

81

3.2 Evaluation Methods In this study, the tenfold cross-validation (tenfold CV) and independent data evaluation are used to evaluate the performance of MDA prediction models. For tenfold CV, verified MDAs and the same number of selected negative samples are randomly divided into tenfold, and each fold takes in turn as the testing samples and the rest as the training samples at each time. For the independent data evaluation, all the known MDAs and the same number of selected negative samples are used to train a model, and then all the unknown samples are predicted by the welltrained model. The predicted novel associations are verified with three other MDA databases. The prediction results of cross-validation are evaluated with the following metrics: sensitivity, .specif icity (recall), precision, and F1-score. The formulas for computing each metric are as follows: TP ,. T P + FN TN , specif icity = T N + FP sensitivity =

.

TP ,. T P + FP 2precision ∗ recall , F 1-score = precision + recall

precision =

.

(6) (7)

(8) (9)

where T P , T N, F P , and F N are the numbers of true positive, true negative, false positive, and false negative, respectively. Besides, the area under receiver operating characteristics (ROC) curve (AUC), and the area under precision and recall (AUPR) are also used in this study. ROC curves show the sensitivity against the (1-specif icity) under different score thresholds. The PR curve plots the precision versus the recall at different thresholds.

3.3 Experiment Setting and Overfitting Analyzing The DAE-N is implemented with Python based on TensorFlow, which is an opensource machine learning framework. For the deep autoencoder model, the input is a vector with .(m+n) neurons. The encoder has two hidden layers with length 256 and 64, and the corresponding decoder has two hidden layers with length 64 and 256. The size of the latent code is set as 32. In addition, the mini-batch gradient descent is used to learn the parameters of this model. The batch size is 544, which can equally

82

Y. Ding et al.

divide the training samples into 10 batches. The DAE-N model adopts the Adam optimizer, and the binary cross entropy loss function. Early stopping is applied to avoid overfitting because the large amounts of parameters in DAE-N make it easy to overfit. In the early stopping, one-tenth of the verified MDAs are used as validation dataset, while the rest serve as the training dataset. The training process is halted according to the loss values of validation samples with a “patience” of 8. In other words, the training is stopped when the verification loss value is greater than the minimum validation loss for eight consecutive times. The model with the minimum validation loss is considered the best model. The training process is shown in Fig. 2, in which the x-axis is the number of epochs and the y-axis is the loss value. The orange curve is the loss value updating of 544 validation samples, while the blue

Fig. 2 The loss value updating of training samples and validation samples during the training process

Negative Sample Selection for miRNA-Disease Association Prediction Models

83

curve is the loss value variation of 4897 training samples. As shown in Fig. 2, in the first 50 epochs, the loss value of the training samples and the test dataset are both keep reducing quickly. After that, the loss value of the training dataset reduces quicker than the validation dataset. Finally, the validation loss is stopped reducing at the 229-th epoch which is the best model, and the training is stopped at the 237th epoch.

3.4 Reconstruct Error Data Analysis on Well Trained DAE-N Model This negative sample selection method is based on an assumption that negative samples should have larger reconstruction errors than positive samples on a positive dataset-based autoencoder. Therefore, we analyze the data distribution of reconstruction errors of 5430 positive samples and 184,155 unknown samples to demonstrate this hypothesis. According to the well-trained DAE-N model, we obtain the reconstruction error for each sample. The range of reconstruction errors of positive samples is .[7.446 × 10−9 , 3.294 × 10−3 ], while the range of reconstruction errors of unknown samples is .[3.131 × 10−6 , 1.448]. The mean value is .2.194 × 10−4 for positive samples, and .4.471 × 10−2 for unknown samples. The average reconstruction error of unknown samples is much higher than the positive samples, which is consistent with our hypothesis. Figure 3 shows the density histogram of

Fig. 3 The density histogram of the reconstruction errors

84

Y. Ding et al.

reconstruction errors, in which the x-axis is the log reconstruction errors and the y-axis is the density. We can see that most of the positive samples (blue part) have smaller reconstruction errors than the unknown samples (the orange part). Additionally, the positive part and the unknown part have a little bit of overlap, which is reasonable because there are also some positive samples in the unknown samples. This density histogram further validates our hypothesis.

3.5 Compared Methods To evaluate the effectiveness of DAE-N, we combine DAE-N with three previous machine learning-based MDA prediction models for the performance comparison, which includes deep belief network-based matrix factorization model (DBN-MF) (Ding et al., 2020b), adaptive boosting for miRNA-disease prediction (ABMDA) (Zhao et al., 2019), and variational graph autoencoder (VGAE) (Ding et al., 2020a). DBN-MF (Chen et al., 2021a) predicts novel MDAs by factorizing miRNAdisease adjacency matrix into one miRNA feature representation matrix and one disease feature representation matrix with DBNs. First, the raw interaction features of miRNAs and diseases are obtained from the miRNA-disease adjacency matrix. Then, two DBNs are applied for learning the features of miRNAs and diseases, respectively, from the raw interaction features. Finally, a classifier consisting of two DBNs and a cosine score function is trained with supervision by fine-tuning the initial weights of DBNs from the last step. Even though the weights of the classifier are initialized by unsupervised learning, the DBN-MF is still a supervised learning model as the back-propagation is based on the labeled databases. In the original DBN-MF, the same number of negative samples is randomly selected from all unknown samples as that of positive samples. ABMDA (Zhao et al., 2019) is also a supervised machine learning-based MDA prediction model. ABMDA applies the boosting technology to improve the prediction performance of a given algorithm by integrating weak classifiers to form a strong classifier based on the corresponding weights. We use a decision tree as the weak classifier in this model. Furthermore, the negative sample selection of ABMDA is based on a k-means clustering of all unknown samples. All the unknown samples are first divided into k clusters, and then we randomly select some samples from each of the k clusters as negative samples. VGAE (Ding et al., 2020a) predicts MDAs by two variation graph autoencoders, in which GCN naturally incorporates the node features from the graph structure while variational autoencoder uses latent variables to predict associations from the perspective of data distribution. Specifically, VGAE first obtains the representations of miRNAs and diseases from the heterogeneous network, which includes miRNA-miRNA, miRNA-disease, and disease-disease associations. Then, two subnetworks, miRNA-based network, and disease-based network are constructed.

Negative Sample Selection for miRNA-Disease Association Prediction Models

85

Based on the representations from the heterogeneous network, two VGAEs are applied on the two sub-networks to predict the association scores for all the unknown MDAs. Finally, VGAE integrates these two kinds of prediction scores to obtain the final prediction results. Different from the above two MDA prediction models, VGAE is not sensitive with the negative samples since the training is based on the reconstruction of an association matrix in which all the unknown samples participate the training process. In summary, DBN-MF is a supervised learning model which originally uses a randomly negative sample selection strategy, ABMDA is a supervised learning model which applies a k-means cluster strategy to select the negative samples, while VGAE does not need specific negative samples for learning and not sensitive with the negative sample selection.

3.6 Effectiveness of DAE-N on Cross-Validation To demonstrate the performance of DAE-N, we compare the performance between three different MDA prediction models mentioned above based on their original negative sample selection strategies and DAE-N with cross-validation. The ROCAUC values comparison and the AUPR value comparison on the tenfold CV are shown in Fig. 4. The blue bars are the performance of the original models, while the brown bars are with the DAE-N sample choosing method. Figure 4 shows that the AUC and AUPR values of the original DBN-MF and ABMDA are around 0.91, while the results are improved to around 0.94 with DAE-N, which indicates that the performance is significantly improved with DAE-N. For the prediction model VGAE which does not need specific negative samples, the cross-validation results on randomly negative samples and DAE-N are close which are both around 0.94, but the one with DAE-N is a little higher. In addition, the comparison results in terms

Fig. 4 The performance comparison of the original models and DAE-N sample choosing-based models with tenfold CV. (a) The ROC-AUC comparison. (b) The AUPR comparison

86

Y. Ding et al.

Table 1 The performance comparison between original models and models with DAE-N in terms of different evaluation metrics on tenfold CV Methods Original model DBN-MF Randomly choosing negative samples ABMDA k-means clusters choosing negative samples VGAE

No need specific negative samples

Negative samples Original DAE-N Original

AUC 0.9169 0.9429 0.9098

AUPR 0.9043 0.9494 0.9065

precision 0.8377 0.8717 0.8387

recall 0.8526 0.8713 0.8346

F1score 0.8451 0.8715 0.8367

DAE-N Original

0.9372 0.9394

0.9445 0.9390

0.8726 0.8576

0.8637 0.8762

0.8681 0.8668

DAE-N

0.9421

0.9483

0.8836

0.8710

0.8773

The bold values mean the higher value when compare with other methods in the Table under the same standard.

Fig. 5 The ROC curves and AUPR curves of the original models and DAE-N sample choosingbased models with tenfold CV. (a) The ROC curves. (b) The AUPR curves

of other evaluation matrices on the tenfold CV are shown in Table 1. DBN-MF and ABMDA model still shows significant improvement in terms of precision, recall, and F1-score after using the DAE-N samples selecting method. The performance of VGAE also improved a little bit with DAE-N on cross validation, even though its training process do not need negative samples. All those results show that the DAE-N method is much better than randomly selecting negative samples and kmeans cluster negative sample selecting strategy in supervised learning models. However, DAE-N cannot influence the performance of VGAE model a lot which is not sensitive to negative samples. The ROC curves and AUPR curves are shown in Fig. 5.

Negative Sample Selection for miRNA-Disease Association Prediction Models

87

3.7 Effectiveness of DAE-N on Independent Dataset Evaluation To further illustrate the effectiveness of DAE-N on improving the prediction performance of supervised MDA prediction models, we conduct an independent data evaluation. As the training process of VGAE model is not affected by negative samples, we only consider the comparison based on DBN-MF and ABMDA models. For the independent data evaluation, all the positive samples and the same number of selected negative samples are used to train a prediction model, and then predict all the unknown MDAs. For each disease, the predicted associated miRNAs are sorted by their association score in descending order, and the top miRNAs have a high probability to associate with the disease. Then, for the novel predicted associations, we use the other three independent datasets, HMDD v3.0, dbDEMC, and miRCancer, to verify the predicted novel associations. In the following, we choose 20 diseases to verify the predicted top 50 associated miRNAs with the independent datasets. The number of verified miRNAs in the top 50 miRNAs for each disease is shown in Table 2. For the DBN-MF model and Table 2 The performance comparison between original model and DAE-N negative sample choosing-based model in terms of different evaluation matrices on tenfold CV Disease # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Disease name Esophageal Neoplasms Liver Neoplasms Pancreatic Neoplasms Gastric Neoplasms Leukemia Colon Neoplasms Lung Diseases Lymphoma Carcinoma, Hepatocellular Brain Neoplasms Melanoma Osteosarcoma Breast Neoplasms Kidney Diseases Nasopharyngeal Neoplasms Cervical Neoplasms Biliary Tract Neoplasms Thyroid Neoplasms Prostate Neoplasms Bladder Neoplasms

DBN-MF Original 39 21 33 21 41 27 45 46 44 35 42 27 43 45 39 23 21 17 33 21

DAE-N 42 19 32 30 40 26 47 47 42 39 37 32 44 42 41 27 24 21 29 25

ABMDA Original 40 14 28 17 39 30 43 43 43 33 41 26 42 44 33 28 23 15 29 16

DAE-N 43 17 31 23 38 31 45 44 41 37 43 21 43 41 39 27 19 20 30 19

The bold values mean the higher value when compare with other methods in the Table under the same standard.

88

Y. Ding et al.

ABMDA model, we both conduct their original negative sample selection strategy and DAE-N selection strategy. For the DBN-MF model, 12 out of 20 diseases have better prediction results with DAE-N than the original model. The better prediction result means that there are more miRNAs verified with the independent datasets in the top 50 predicted novel miRNAs. For the ABMDA model, 14 out of 20 diseases with DAE-N have more predicted miRNAs verified than with the original k-means cluster negative sample selection. The verified number comparison in the top 50 predicted miRNAs shows that both DBN-MF and ABMDA have better prediction performance with DAE-N than the original model, which also illustrates the effectiveness of DAE-N in improving the prediction performance of supervised learning prediction methods.

4 Conclusion Many studies have shown that miRNAs are associated with many complex human diseases, and identifying MDAs can help the understanding of the pathogenesis of diseases and further develop effective treatments. Since the traditional biological experimental MDA identification methods are time-consuming and expensive, many computational MDA prediction models have been proposed in recent decades, especially the machine learning-based MDA prediction methods. Even though machine learning techniques contribute a lot to the improvement of MDA prediction performance, all supervised learning models face an issue of no verified negative samples. Most of the current models randomly select some unknown MDAs as negative samples which cause noise to the prediction model and influence the prediction performance. This chapter has proposed a deep autoencoder-based negative sample selecting approach, DAE-N, for supervised machine learning-based MDA prediction models. DAE-N first gets the representations of each miRNA-disease pair. Then, it uses all verified MDAs to train a deep autoencoder. Finally, the well-trained deep autoencoder is applied to calculate the reconstruction errors of all unknown MDA samples, and the reconstruction errors are sorted in descending order. The unknown MDAs samples with top reconstruction errors are considered as the negative samples. The effectiveness of DAE-N is demonstrated with three existing machine learning-based MDA prediction models: the supervised learning model, DBN-MF, with randomly negative samples selection strategy; the supervised prediction model, ABMDA, with k-means cluster negative sample selection strategy; the model VGAE which does not need negative samples for training. The experimental results based on both cross-validation and independent datasets evaluation show that the DAE-N can significantly improves the performance of supervised MDA prediction models. This negative sample selection strategy, DAE-N, can not only work on MDA prediction, but also be applied in other types of biomolecule-disease association prediction, such as circRNA-disease associations, lncRNA-disease associations, and gene-disease associations. With the selected negative samples, the prediction

Negative Sample Selection for miRNA-Disease Association Prediction Models

89

models provide more reliable biomarkers. The reliable biomarkers can improve the efficiency of biomolecule-disease association verification in web lab and help researchers to analyze disease mechanisms and design new drugs.

References Alles, J., Fehlmann, T., Fischer, U., Backes, C., Galata, V., Minet, M., Hart, M., Abu-Halima, M., Grässer, F. A., Lenhof, H. P., & Keller, A. (2019). An estimate of the total number of true human miRNAs. Nucleic Acids Research, 47(7), 3353–3364. Ambros, V. (2003). MicroRNA pathways in flies and worms: growth, death, fat, stress, and timing. Cell, 113(6), 673–676. Bartel, D. P. (2004). MicroRNAs: genomics, biogenesis, mechanism, and function. Cell, 116(2), 281–297. Bartel, D. P. (2009). MicroRNAs: target recognition and regulatory functions. Cell, 136(2), 215– 233. Chen, X., Gong, Y., Zhang, D. H., You, Z. H., & Li, Z. W. (2018). DRMDA: Deep representationsbased miRNA–disease association prediction. Journal of Cellular and Molecular Medicine, 22(1), 472–485. Chen, X., & Huang, L. (2017). LRSSLMDA: Laplacian regularized sparse subspace learning for MiRNA-disease association prediction. PLoS Computational Biology, 13(12), e1005912. Chen, X., Li, T.-H., Zhao, Y., Wang, C.-C., & Zhu, C.-C. (2021a). Deep-belief network for predicting potential miRNA-disease associations. Briefings in Bioinformatics, 22(3), bbaa186. Chen, X., Sun, L.-G., & Zhao, Y. (2021b). NCMCMDA: miRNA–disease association prediction through neighborhood constraint matrix completion. Briefings in Bioinformatics, 22(1), 485– 496. Chen, X., Yan, C. C., Zhang, X., You, Z.-H., Deng, L., Liu, Y., & Dai, Q. (2016). WBSMDA: Within and between score for MiRNA-disease association prediction. Scientific Reports, 6, 21106. Ding, Y., Lei, X., Liao, B., & Wu, F. (2021a). Predicting miRNA-disease associations based on multi-view variational graph auto-encoder with matrix factorization. Methods, 192, 25–34. Ding, Y., Lei, X., Liao, B., & Wu, F.-X. (2021b). Machine learning approaches for predicting biomolecule–disease associations. Briefings in Functional Genomics. Ding, Y., Tian, L.-P., Lei, X., Liao, B., & Wu, F.-X. (2020a). Variational graph auto-encoders for miRNA-disease association prediction. Methods. Ding, Y., Wang, F., Lei, X., Liao, B., & Wu, F.-X. (2020b). Deep belief network–Based matrix factorization model for MicroRNA-disease associations prediction. Evolutionary Bioinformatics, 16, 1176934320919707. Fabian, M. R., Sonenberg, N., & Filipowicz W. (2010). Regulation of mRNA translation and stability by microRNAs. Annual Review of Biochemistry, 79, 351–379. Gao, Y., Wang, P., Wang, Y., Ma, X., Zhi, H., Zhou, D., et al. (2019). Lnc2Cancer v2.0: updated database of experimentally supported long non-coding RNAs in human cancers. Nucleic Acids Research, 47(D1), D1028–D1033. Géron, A. (2019). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. O’Reilly Media. Hinton, G. E., Krizhevsky, A., Wang, S. D. (Eds.). (2011). Transforming auto-encoders. In International Conference on Artificial Neural Networks. Springer. Huang, Z., Shi, J., Gao, Y., Cui, C., Zhang, S., Li, J., et al. (2019) HMDD v3. 0: A database for experimentally supported human microRNA-disease associations. Nucleic Acids Research, 47(D1), D1013–D1017.

90

Y. Ding et al.

Ji, C., Gao, Z., Ma, X., Wu, Q., Ni, J., & Zheng C. (2021). AEMDA: Inferring miRNA-disease associations based on deep autoencoder. Bioinformatics, 37(1), 66–72. Li, J., Zhang, S., Liu, T., Ning, C., Zhang, Z., & Zhou W. (2020). Neural inductive matrix completion with graph convolutional networks for miRNA-disease association prediction. Bioinformatics, 36(8), 2538. Li, Y., Qiu, C., Tu, J., Geng, B., Yang, J., Jiang, T., et al. (2013). HMDD v2. 0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Research, 42(D1), D1070-D4. Liou, C.-Y., Cheng, W.-C., Liou, J.-W., & Liou, D.-R. (2014). Autoencoder for words. Neurocomputing, 139, 84–96. Lu, M., Zhang, Q., Deng, M., Miao, J., Guo, Y., Gao, W., et al. (2008). An analysis of human microRNA and disease associations. PloS One, 3(10), e3420. Miska, E. A. (2005). How microRNAs control cell division, differentiation and death. Current Opinion in Genetics & Development, 15(5), 563–568. Peng, J., Hui, W., Li, Q., Chen, B., Hao, J., Jiang, Q., et al. (2019). A learning-based framework for miRNA-disease association identification using neural networks. Bioinformatics, 35, 4364. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by backpropagating errors. Nature, 323(6088), 533–536. Taganov, K. D., Boldin, M. P., Chang, K.-J., & Baltimore, D. (2006). NF-B-dependent induction of microRNA miR-146, an inhibitor targeted to signaling proteins of innate immune responses. Proceedings of the National Academy of Sciences, 103(33), 12481–12486. Thomson, D. W., Bracken, C. P., & Goodall, G. J. (2011). Experimental strategies for microRNA target identification. Nucleic Acids Research, 39(16), 6845–6853. Wang, F., Huang, Z.-A., Chen, X., Zhu, Z., Wen, Z., Zhao, J., et al. (2017). LRLSHMDA: Laplacian regularized least squares for human microbe–disease association prediction. Scientific Reports, 7(1), 1–11. Xie, B., Ding, Q., Han, H., & Wu D. (2013). miRCancer: a microRNA–cancer association database constructed by text mining on literature. Bioinformatics, 29(5), 638–644. Xu, J., Li, C.-X., Lv, J.-Y., Li, Y.-S., Xiao, Y., Shao, T.-T., et al. (2011). Prioritizing candidate disease miRNAs by topological features in the miRNA target–dysregulated network: Case study of prostate cancer. Molecular Cancer Therapeutics, 10(10), 1857–1866. Yang, Z., Wu, L., Wang, A., Tang, W., Zhao, Y., Zhao, H., et al. (2017). dbDEMC 2.0: Updated database of differentially expressed miRNAs in human cancers. Nucleic Acids Research, 45(D1), D812–D818. Zhang, J., Hu, X., Jiang, Z., Song, B., Quan, W., Chen, Z. (Eds.). (2019). Predicting disease-related RNA associations based on graph convolutional attention network. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE. Zhao, Y., Chen, X., & Yin J. (2019). Adaptive boosting-based computational model for predicting potential miRNA-disease associations. Bioinformatics, 35(22), 4730–4738. Zhou, S., Wang, S., Wu, Q., Azim, R., & Li W. (2020). Predicting potential miRNA-disease associations by combining gradient boosting decision tree with logistic regression. Computational Biology and Chemistry, 85, 107200.

Prediction and Analysis of Key Genes in Prostate Cancer via MRMR Enhanced Similarity Preserving Criteria and Pathway Enrichment Methods Robert Benjamin Eshun, Hugette Naa Ayele Aryee, Marwan U. Bikdash, and A. K. M. Kamrul Islam

Acronyms FCD FCQ KNN LR MID MIQ mRMR NB RF SVM

F-test Correlation Difference scheme F-test Correlation Quotient scheme K-Nearest Neighbors Logistic Regression Mutual Information Difference scheme Mutual Information Quotient scheme Minimum Redundancy Maximum Relevance algorithm Naive Bayes Random Forest Support Vector Machines

R. B. Eshun () CDSE, North Carolina A&T State University, Greensboro, NC, USA Faculty of Engineering, Ghana Communication Technology University, Accra, Ghana e-mail: [email protected]; [email protected] H. N. A. Aryee SPH, University of Ghana, Legon, Ghana e-mail: [email protected] M. U. Bikdash · A. K. M. Kamrul Islam CDSE, North Carolina A&T State University, Greensboro, NC, USA e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 A. Alkhateeb, L. Rueda (eds.), Machine Learning Methods for Multi-Omics Data Integration, https://doi.org/10.1007/978-3-031-36502-7_6

91

92

R. B. Eshun et al.

1 Introduction Recent advances in high throughput technologies have facilitated the simultaneous analysis of thousands of gene expression profiles in order to assess the active or silent genes in normal, benign or cancerous tissue samples (Guyon et al., 2002; Chuang et al., 2008). Gene expression data are usually characterized by high dimensionality due to large numbers of genes, limited amount of samples as a result of challenges in obtaining tissue samples, and the abundance of redundancy among genes (Liu & Motoda, 2007). The potential for medical prediction and diagnosis from the gene expression profiles is constrained by the high-throughput data which presents challenges for robust classification with many machine learning methods through such limitations as reducing the processing rate, degrading the predictive error rate, and contributing to the complexity of the machine learning models (Guyon et al., 2002; Chuang et al., 2008). Feature selection methods are increasingly applied interest in data mining research in areas involving high dimensional data sets generated through improved data acquisition technologies and storage capabilities. The areas include text processing of web documents, social media applications, and gene expression profiles analysis (Guyon & Elisseeff, 2003). The feature selection process leads to many potential benefits including enhancing data modeling and visualization, minimizing storage requirements, reducing processing and training times, improving the prediction performance of classifiers. In genomic data analysis, feature selection seeks to find the subset of genes that accurately discriminate tissue samples of different types. The identification of informative genes can provide for deeper understanding of genetic signatures or markers to potentially aid prognosis and classification of cancer types (Sahu et al., 2018; Zhang et al., 2008). Thus, biomarker discovery is currently an essential area of research in cancer biology and treatment (Liu & Motoda, 2007). In general, two types of gene selection methods have been studied in the literature. These are filter methods (Sarac, 2017) and wrapper methods (Zhang et al., 2008). The key challenge in feature selection from genomic expression profiles is to provide biologists a robust filter technique that identifies and removes both irrelevant and redundant genes in an efficient manner (Liu & Motoda, 2007). In this study, we propose an effective filtering approach by enhancing the similarity-based filter methods consisting of Fisher Score, ReliefF, and Laplace Score with mRMR to obtain a more representative set of relevant and non-redundant genes for biomarker discovery and diagnosis. Then, Gene Ontology (GO) terms and KEGG pathways are extracted to determine the essential genes and their bio-molecular functions. A total of 20 key genes were screened to determine the significant pathways and annotations enriched in the gene list, and to concurrently obtain the bio-marker genes that may contribute to the progression of prostate cancer.

Prediction and Analysis of Key Genes in Prostate Cancer via MRMR Enhanced. . .

93

2 Literature Review In literature, a broad range of machine learning methods have been devised for biomarker discovery from diverse gene expression data. Gene expression profiles are generally high-dimensional datasets characterized by large numbers of genes and insufficient sample sizes (Liu & Motoda, 2007). Due to the curse of dimensionality, the performance of some machine learning and data mining algorithms are severely constrained in their applicability to gene expression data (Sarac, 2017). The genes in expression profiles are grouped as relevant or redundant where relevance relates the significance of the gene for classification tasks and redundancy indicates the highly correlated expression features that tend to deteriorate the performance of classification (Mundra & Rajapakse, 2009). The numerous irrelevant and redundant features serve as noise and introduce bias that deteriorate the prediction accuracy of classifiers and generally are insignificant to differentiating the target classes of samples. It is therefore imperative for data simplification and computational efficiency to extract the subset of features that are significant for classification (Sarac, 2017; Chandrashekar & Sahin, 2014). Many feature selection schemes have been developed to prune the redundant features of high dimensional data to reduce the computational resources required for the application, and extract a core set of biomarker genes to improve classification performance and inform greater insight and understanding of the genomic data (Saeys et al., 2007). Gene selection processes generally involve the application of the filter methods and wrapper methods. Filter techniques use a measurement criterion to rank features for selection and depend entirely on the properties of the specified data (i.e. they perform feature selection independently of the learning process). Wrapper methods, however, use the inductive machine learning processes to assess the optimality of a given subset for predicting the sample classes (Chandrashekar & Sahin, 2014; Nnamoko et al., 2014; Talavera, 2005). The Wrapper methods are reported to have superior performance in supervised learning scenarios and require more computational resources relative to the filter methods (Zhang et al., 2008). Filter techniques on the other hand are the most common methods used in gene selection since the models are non-complex and fast (Inza et al., 2004). The Minimum redundancy maximum relevance (mRmR) algorithm, which is a filter method which based on mutual information theory and the f-statistic for discrete and continuous contexts respectively, has been proposed for obtaining a subset that maximize the relevancy of a gene selection while minimizing the redundancy between the genes (Mundra & Rajapakse, 2009; Sarac, 2017). MRMR, however employs the computationally intensive greedy approach and it is contended whether its objective function has good theoretical properties (Climente-González et al., 2019). Also, some genes interact strongly in groups and the interactions represent important associations that may contribute significantly to sample classification. Therefore, weakly relevant but non-redundant genes may be useful for classification and the feature selection frameworks that assume the conditional independence of

94

R. B. Eshun et al.

features to the target may be ineffective. Thus, a trade-off between redundancy and relevancy is ideal (Mundra & Rajapakse, 2009). As intimated, the choice of the selection criterion is essential to inform the level of optimal performance that can be obtained. The different criteria include dependency, information theory, separability, and estimator performance. In a study by Zhao et al. (2011), it was observed that some existing filterbased feature selection methods are essentially based on assessing an attribute’s capability in preserving sample similarity, which is inferred from the class label or a distance metric. These feature selection modalities include Relief and ReliefF, Fisher Score, Laplacian Score, SPEC, and Trace Ratio. A crucial limitation of these algorithms, however, is their inability to tackle feature redundancy and propensity to select highly correlated attributes. In demonstrating a framework for combining filter methods, Zhang et al. (2008) integrated ReliefF and MRMR algorithms in a two-level strategy for gene selection and achieved an overall classification accuracy of 96.7%. In another scheme, Mundra and Rajapakse (2009) combined filter and wrapper methods using mRMR and SVE-RFE for gene selection. In our study, a candidate set of significantly expressed genes was selected using the similarity-based metrics of the Fisher Score, ReliefF, and Laplacian Score. Subsequently, the MRMR method was applied to find non-redundant but representative genes from the candidate set. The performance of the selected genes was evaluated, contrasted and compared to the standalone models using the Naive Bayes (NB), Logistic Regression (LR), K Nearest Neighbors (KNN), Support Vector Machines (SVM) and Random Forest (RF) estimators. Through the Gene Ontology (GO) enrichment analysis, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis (Guyon & Elisseeff, 2003), the potential molecular mechanisms affecting Prostate cancer development were analyzed to identify key genes from the candidate pool with the potential to be new diagnostic biomarkers.

2.1 Feature Selection Methods This section gives the notation and brief review of the Fisher Score, ReliefF, graph based Laplacian Score, and MRMR feature selection methods. Given the data set X=.{x1 , . . . ., xn } ∈ R n×m with c number of classes, we use n for the corresponding feature .f1 ,. . . .,.fm to denote the m features, and .fi ∈ R vectors. Let .μ and .σ denote the mean and variance of the whole data set and .μi 2 be the mean and variance of .f represent the mean of feature .fi where .μi,j and .σi,j i on class j with .nj as the count of data points in the j th class (Mundra & Rajapakse, 2009).

Prediction and Analysis of Key Genes in Prostate Cancer via MRMR Enhanced. . .

2.1.1

95

Fisher Score

Fisher Score is based on the notion that samples from the same class will have similar values that are different from samples from other classes (Mundra & Rajapakse, 2009). The Fisher score is defined by Zhao et al. (2011): 

c F S F (fi ) =

.

j =1 nj μi,j − μi c 2 j =1 nj σi,j

2 (1)

It is shown that Fisher Score is a special case of Laplacian Score, when the similarity matrix is defined as,   1 , y = y = l i j nl (2) .Kij = 0, otherwise

2.1.2

Laplace Score

The preference of the Laplace Score is for features with high locality preserving ability and the key concept is that the instances in the same class should be in close proximity to each other. Thus, the Laplacian score of the r-th feature is expressed as (Liu & Zhang, 2016):  2 fri − frj Sij 2 N  j =1 fri − frj Dii

N N i=1

LS (fr ) =

.

j =1 nj

(3)

where .fri isthe r-th feature of the i-th sample .xi and D is a diagonal matrix with .Dii = N j =1 Sij . .Sij is the weight matrix described by the neighborhood relationship between samples .xi and .xj as (Liu & Zhang, 2016): Sij = e

.

−||xi −xj ||2 ρ

(4)

where .ρ is a suitable constant.

2.1.3

ReliefF Criteria

The ReliefF algorithm selects features contributing to the separation of samples from different classes which is equivalent to choosing features that preserve a special form of sample similarity derived from the class label. Assuming that p instances are randomly sampled from data, the evaluation criterion of Relief is defined as (Zhao et al., 2011):

96

R. B. Eshun et al.

⎧ p 1 ⎨ 1 .RF (fi ) = − ⎩ m xt p



+

 y=yxt

1 mxt ,y

d(ft,i − fj,i )

xj ∈N H (xt )

t=1

P (y) 1 − P (yxt )



d(ft,i − fj,i )

xj ∈N H (xt ,y)

⎫ ⎬ ⎭

(5)

where .ft,i is the value of sample .xt on feature .fi and .yxt is the class label of sample .xt . .P (y) is the probability of a sample being from the class y, and .NH (x) or .NM(x, y) represent a set of points close to x with the same class of x, or a different class y, respectively. .mxt and .mxt,y denote the sizes of the sets .N H (xt ) and .N M(xt , y) correspondingly.

2.1.4

Unified Framework for Similarity Based Methods

To select k features, all the techniques under the Similarity Preserving (SC) framework have been shown to have the unified formulation (Zhao et al., 2011): max Fsub



.



SC(f ) = max Fsub

f ∈Fsub

fˆT K fˆ,

(6)

f ∈Fsub

where .Kˆ is the refined affinity matrix from K, .fˆ denotes feature vector normalized from f, and the set of k chosen features is .Fsub . The different methods, however, apply unique rules to generate .Kˆ and .fˆ. When K for the Fisher score is defined as:  Kij =

.

yi = yj = l 0, otherwise 1 nl ,

 (7)

with .nl as the number of instances in the lth class, the Fisher Score becomes equivalent to the Laplacian Score and the two are related by: SCL =

.

1 1 + SCF

(8)

and .fˆ for the Laplacian score is given as: .fˆ ∼ = (f − μI )/σ . For ReliefF, K is expressed as: ⎧ ⎪ ⎨ 1,

⎫ i=j⎪ ⎬ −1 , x ∈ NH (x ) .Kij = j i k ⎪ ⎩ 1 , xj ∈ NH (xi , y) ⎪ ⎭ (c−1)k

(9)

Prediction and Analysis of Key Genes in Prostate Cancer via MRMR Enhanced. . .

2.1.5

97

MRMR

The MRMR method primarily selects the set of most relevant and minimally redundant features for discriminating the target class of samples (Ding & Peng, 2005; Mundra & Rajapakse, 2009). The mRMR framework has four variants which are MID, MIQ, FCD and FCQ. This study employs The F-test correlation quotient (FCQ) which uses the F-statistic to measure the relevance, and correlation to derive the redundancy (Zhao et al., 2019). FCQ is reputed to have good computation time and accuracy for different classification models and is given as (Samuele, 2021): f F CQ (Xi ) =

.

2.2

1 |S|



F (Y, Xi ) Xs ∈S

ρ(Xs , Xi )

(10)

Description of Classifiers

The Support Vector Machine (SVM) finds the optimal hyperplane that separates two classes. It maximizes the margin between two classes of data samples by mapping the inputs to a high-dimensional feature space that creates a decision boundary (Eshun et al., 2021b; Han et al., 2021; Vanitha et al., 2015; Mostafiz et al., 2020). The decision function is expressed as (Chandrashekar & Sahin, 2014): D(w) = wφ(x) + b

.

(11)

where w and b are the parameters of SVM and .φ(x) is the kernel function. Naive Bayes (NB) is based on Bayes’s rule of conditional probability and uses the notion of the independence of all attributes given the target class. The method assigns equal importance to all attributes for decision making with probability denoted by Eq. 12 (Ramroach et al., 2020; Eshun et al., 2021a). For a tissue sample n with m features or genes profiles, the posterior probability that n belongs to class y is P (y|n) ∝

N 

.

P (fi |y)

(12)

i=1

where .P (fi |y) are conditional densities calculated from training samples. Logistic Regression (LR) is a learning algorithm that estimates a linear combination of features to create a predictor variable. It uses a logistic function to find the probabilities of the values of the predictor variable. This method is widely used for 2-class prediction in biostatistics (Amrane et al., 2018). K-Nearest Neighbors (KNN) is another classification technique that works on the idea of assigning the label of a classified data point to an unclassified data point nearest to it. Starting with the unclassified sample as the input vector in the feature space, it is assigned

98

R. B. Eshun et al.

to the class in which majority of its K-nearest data points belong (Garg & Mago, 2021). Random forest (RF) is an ensemble learning approach, where many decision trees are generated during the training stage, with each tree based on a different subset of features are trained on a different part of the same training set. During the classification of unseen examples,the predictions of the individually trained trees are then agglomerated using the majority vote. This bootstrapping procedure is found to efficiently reduce the high variance that an individual decision tree is likely to suffer from (Johnson et al., 2018; Bashir et al., 2019; Rabby et al., 2021).

2.3 Pathway Enrichment Analysis RNA-seq and microarray based experiments produce data on differentially expressed or key genes that can be further processed to obtain or discover information about the biological functions or processes altered or controlled by the genes. In genomic analysis, a standard approach to gain mechanistic insight of biological functions and relationships is pathway enrichment analysis (Reimand et al., 2019). Pathway Enrichment analysis methods can be categorized, in order of increasing complexity of the analysis (Mubeen et al., 2022), into: (1) overrepresentation analysis (ORA), (2) functional class scoring (FCS) and (3) pathway topology or networks based (PT). The ORA class of techniques are designed to identify biological functions that are over-represented or enriched in a group of genes through an experimental process that involves identification of gene lists from genomic data, discovery of significantly represented pathways, and graphical interpretation of the outcomes (Reimand et al., 2019; Chicco & Agapito, 2022). The relative abundance of genes in a particular pathways is measured using statistical means such as P-value and Q-value, and key functional pathways are accessed from online bioinformatics repositories. The Kyoto Encyclopedia of Genes and Genomes (KEGG) and Reactome (http://reactome.org) are popular curated biomolecular pathways of reactions, processes and components at are accessible online. Annotations of biological processes, cellular components and molecular functions of key and differentially expressed genes are usually obtained using Gene Ontology (GO) enrichment analysis. Gene Ontology (GO) contains comprehensive functional information related to each gene. GO analysis consists of three parts: cell component (CC), biological process (BP), and molecular function KEGG is a database that integrates chemistry, genome, and system function information aimed at understanding advanced biological functions and practicality (Kanehisa & Goto, 2000; Fabregat et al., 2017; Qi & Chen, 2021).

Prediction and Analysis of Key Genes in Prostate Cancer via MRMR Enhanced. . .

99

3 Methods In this study, experiments are conducted to evaluate the performance of the integrated models consisting of Fisher Score with mRMR (Fisher+mRMR), ReliefF with mRMR (ReliefF+mRMR) and Laplace Score with mRMR (Laplace+mRMR) algorithms relative to the standalone Fisher score, ReliefF, Laplace score and mRMR feature selection methods using five classifiers namely Naive Bayes, Random Forest, Logistic Regression, K-Nearest Neighbors and Support Vector Machines on the Prostate cancer dataset. The candidate genes derived are further investigated through GO enrichment and KEGG pathway analysis to obtain key regulatory genes and pathways contributing to the development of prostate cancer. The flowchart of the experiment is presented in Fig. 1.

Fig. 1 Workflow of experimental analysis

100

R. B. Eshun et al.

3.1 Data Source and Data Type We accessed the dataset used for the study from the GEO (Gene Expression Omnibus) data repository. The dataset designated GSE94767 consisted of 241 samples and profiles of 22,011 alleles derived from fresh frozen tissue of the prostatectomies of 154 Prostate cancer (Pca) patients. One hundred and eightyfive samples were indicated to be malignant tissue, 51 samples morphologically were classified as benign tissue and 5 samples were labeled as fibroblast tissue. The gene-level signal values obtained from the samples were generated from Affymetrix GeneChip Exon 1.0 ST arrays using the robust multiarray analysis algorithm implemented in the Affymetrix Expression Console software package (Luca et al., 2018).

3.2 Data Preparation The dataset was initially pre-processed to remove the limited number of fibroblast samples to simplify grouping the cohort into the malignant and benign cases. The data was then split into 70:30 subsets consisting of the training set for training and validation, and the test set.

3.3 Experiment Design The performance of our proposed approach was investigated in the supervised learning setting. The experimental procedure for the enhanced models involved two stages. In the first step, the similarity based models were used to extract an initial set of candidate genes, then the mRMR was used to select the most relevant and non-redundant genes in the candidate gene pool. In the workflow, we first chose 500 candidate genes from all original dataset. The mRMR algorithm is then used to extract the final subset of 10,20,. . . ,60 genes. The final selected gene pools were then applied as input to the five classifiers to assess the classification performance. In the experiment, the input data was divided into 5 subsets of approximate sizes which are held out and used as test sets in each training run, and the mean and standard deviation of the accuracies on the test set were recorded. The fivefold cross validation strategy was used to estimate the prediction accuracies. The empirical study was conducted using Python on a PC with hardware specifications of Intel core i7 10750 CPU @ 2.60 GHz with 16 G RAM and Nvidia Geforce RTX 2060 with Max- Q enabled GPU plus 6 G RAM.

Prediction and Analysis of Key Genes in Prostate Cancer via MRMR Enhanced. . .

101

3.4 Feature Selection The primary objective of supervised feature selection was to obtain the subset of features that are largely responsible for predicting the output labels. High dimensional data is noted to include many irrelevant and redundant data that add noise and increase the computational costs leading to the curse of dimensionality (Chandrashekar & Sahin, 2014; Climente-González et al., 2019). In this research, we enhance the three (3) similarity based feature selection methods consisting of the Fisher Score metric, ReliefF, and Laplace Score that rank feature importance by their capacity to preserve data similarity, with the mRMR algorithm to mitigate the curse and extend the generalization ability of the model. ReliefF and Fisher Score assess feature importance in terms of capacity to retain data similarity. The Laplace Score is designed to preserve sample locality, and mRMR selects features highly correlated with the target and least correlated among themselves (Zhao et al., 2011; Radovic et al., 2017). The similarity based methods are non-complex since the techniques build an affinity matrix then compute scores for the features. With the filter methods being exclusive of any classifier algorithms, the selected features are applied to a variety of predictor methods for evaluation.

3.4.1

The Problem

The supervised feature selection problem was formulated as follows: Given the set {.x1 , .x2 ,. . . , .xn } of labeled points in .R n×m , and having the set of features f = {.f1 ,. . . .., .fm }, find the representative subset of features that is the most discriminative and informative set which can improve the classification task and produce optimal predictions when used as input features.

3.4.2

The Algorithm

The algorithmic procedure for Fisher Score, ReliefF and mRMR is fairly straightforward. However, for the supervised Laplacian Score, given a dataset set X with m samples .xi described by n features, and an output Y = .y1 ,. . . .,.yc , the algorithm is formally defined as: 1. Construct a nearest neighbor graph G with n nodes and construct an edge between nodes designated i and j if samples .xi and .xj share the same label among k nearest neighbors of .xi (in this case k = 10). 2. If there is an edge between nodes i and j, we have the weight S of matrix G given as (Zhao et al., 2008): Sij = e

.

−||xi −xj ||2 b

else we put .Sij = 0, where b is a suitable constant.

(13)

102

R. B. Eshun et al.

3. The Score for the r-th feature is derived from the relations: .fr = .{fr1 , . . . ., frm }T , T .D = diag(S1), .1 = [1, . . . ., 1] , the graph Laplacian .L = D − S, and f T D1 f = f − T 1 D1

.

(14)

4. Finally the Laplacian Score for each feature r is computed as; SC F (f ) =

.

3.4.3

fT Lf fT D f

(15)

Classification

Supervised classification involves developing learning algorithms to predict previously defined categories. The goal is to obtain a predictor h that is given in as (Eshun et al., 2021b), h:X→Y

(16)

h(x) = y; ∀(x, y) ∼ S

(17)

.

such that .

where .x ∈ X is a training example, .y ∈ Y is the associated label and S is a data generating distribution (Ramroach et al., 2020). We focus on the binary classification problem .y ∈ 0, 1. The study evaluated five (5) classifiers namely K-Nearest Neighbors (KNN), Naïve Bayes (NB), Logistic Regression (LR), Support Vector Machines (SVM) and Random Forest (RF) to compare the prediction accuracy of the different feature-selection techniques with fewer candidate genes. The gene subset were the independent variables while the target class was the dependent variable. Specifically, the expression levels of the genes were used to predict the benign/malignancy of the tissue samples. SVM estimator maps the input data into a high-dimensional feature space and finds a separating hyperplane that maximizes the margin between the two classes in this space. The solution of the optimal hyperplane can be written as a combination of a few input points that make up the support vectors (Han et al., 2021; Vanitha et al., 2015). NB is grounded on the principle of Bayes’ theorem which assumes the independence and equal significance of all attributes for making decisions, and is often applied for genomic data classification (Ramroach et al., 2020). Logistic Regression is a classification algorithm that estimates the coefficients in a logistic model. It is used to predict a binary output from a set of feature

Prediction and Analysis of Key Genes in Prostate Cancer via MRMR Enhanced. . .

103

variables (Amrane et al., 2018). K-Nearest Neighbors (KNN) is predicts the target of an unclassified observation that is dominant among a specified number (k) of most similar examples or nearest neighbors (Garg & Mago, 2021; Radovic et al., 2017). Random forest is based on the application of many decision trees during the training stage. The predictions of the individually trained decision trees are then decided using the majority vote (Johnson et al., 2018; Bashir et al., 2019).

3.5 Measures for Performance Evaluation The algorithmic performance of the models are assessed on (1) classification accuracy and (2) redundancy rate. The analysis primarily uses accuracy as performance metrics to measure the efficacy of the models. The accuracy measures how well a test is able to predict different categories by showing the number of samples correctly classified into their respective classes. The accuracy is given as: Accuracy =

.

TP +TN T P + T N + FP + FN

(18)

The redundancy rate (RR) is expressed as (Zhao et al., 2011): RR(F ) =

.

1 m(m − 1)



ρ(i, j )

(19)

fi ,fj ∈F,i>j

where, .ρi , j measures the Pearson correlation between two features .fi and .fj . The procedure used to compute the redundancy rate was as follows: 1. Find the Fisher’s Z transform of the Pearson correlation coefficients 2. Calculate the average of the z values 3. Find the back-transform of the mean Z value to the correlation coefficient The rate gives an estimate of the averaged correlation among all the feature pairs, and a large value indicates highly correlated features and that redundancy is likely to exist.

3.6 Enrichment Analysis of Key Pathways and Core Genes Pathway enrichment analysis was conducted to investigate the significant pathways and core genes during the development of prostate cancer. The GO annotation, and KEGG pathway enrichment analyses of the candidate genes were performed using Enrichr, an interactive web-based gene list enrichment analysis tool (https:// maayanlab.cloud/Enrichr/). The enrichment analysis was conducted with a threshold of a P-value .< 0.05 and higher values considered to be statistically insignificant.

104

R. B. Eshun et al.

The enriched pathways and annotations were arranged in the order of their P-value with lower scores indicating higher relevance.

4 Results and Discussion The experimental results of the mean accuracy and standard deviation values are presented in Table 1. The plots of the prediction accuracy rates achieved by the algorithms when different number of genes are selected can be found in the Figs. 2 and 3. Table 2 shows the performance of the baselines (ie. results without any feature selection). The performance of the baselines was found to be comparable to the prediction accuracy recorded for the similarity based algorithms consisting of ReliefF, Laplace Score and Fisher Score. Logistic Regression produced the highest result on the baselines with .83% accuracy which was similar to the highest performance of .83.6% accuracy observed for the ReliefF algorithm. The best classification result with the Fisher Score methods was obtained by the KNN for 60 selected genes and by SVM for 60, 40 and 30 genes subsets. The NB algorithms was outdone, relative to the other classifiers, with its lowest accuracy of .68.2%. The Fisher Score method was observed to have low variation in performance for decreasing number of genes. The ReliefF results were less variable for different number of selected genes, and the lowest variation in output was occasioned on having 30 and 40 gene candidates. The RF algorithms had higher results for different gene sets producing an accuracy result of .>85%. On the other hand, the NB model again experience lowest accuracy for more than 10 genes selected. SVM, Logistic Regression and RF models obtained good predictions on the genes chosen by the Laplace Score algorithm showing minimal variations in classification accuracy with changing gene numbers. NB and KNN were the lowest performers. Notably, NB recorded increasing accuracy with decreasing number of genes and a high score of .78%. The best accuracy among the classifier models was achieved for the mRMR gene subset with SVM and Logistic Regression achieving robust accuracies .>91% and the RF estimator recording a low of .82%. All models on the mRMR feature set, including KNN and NB, were found to produce good predictions which only reduced marginally as the number of selected genes decreased. With 10 selected genes, the majority of the learning algorithms converged to an accuracy rate of . 87%. The ReliefF+mRMR selection model produces very minimal variance in classifier performance with a prediction accuracy range of 83–.89%. The Logistic Regression and KNN were found to be the best performers with peak observations at 30–40 gene counts. The lowest classifier performance was recorded for the NB algorithm. The results observed for the ReliefF+mRMR were the closest to the mRMR. The enhanced Laplace score model (Laplace+mRMR) was observed to have progressively increasing accuracy across all the learning models as fewer genes are selected except on 10 genes, and this produced a high accuracy of .87% with SVM.

mRMR

Laplace score

ReliefF

Method Fisher score

Classifier NB LR KNN SVM RF NB LR KNN SVM RF NB LR KNN SVM RF NB LR KNN SVM RF

# Genes 10 0.752 .± 0.084 0.800 .± 0.041 0.764 .± 0.035 0.802 .± 0.050 0.788 .± 0.019 0.782 .± 0.081 0.812 .± 0.056 0.776 .± 0.062 0.765 .± 0.067 0.855 .± 0.045 0.776 .± 0.031 0.801 .± 0.041 0.764 .± 0.059 0.806 .± 0.043 0.782 .± 0.056 0.867 .± 0.083 0.873 .± 0.098 0.873 .± 0.065 0.842 .± 0.040 0.867 .± 0.041 20 0.715 .± 0.085 0.764 .± 0.023 0.764 .± 0.023 0.755 .± 0.065 0.788 .± 0.019 0.761 .± 0.092 0.806 .± 0.036 0.794 .± 0.067 0.781 .± 0.043 0.830 .± 0.045 0.733 .± 0.045 0.800 .± 0.041 0.721 .± 0.052 0.806 .± 0.048 0.794 .± 0.052 0.842 .± 0.084 0.897 .± 0.078 0.891 .± 0.065 0.897 .± 0.020 0.879 .± 0.051

Table 1 Classification results for varying number of genes 30 0.745 .± 0.062 0.782 .± 0.065 0.800 .± 0.041 0.818 .± 0.031 0.776 .± 0.041 0.782 .± 0.059 0.788 .± 0.019 0.824 .± 0.059 0.801 .± 0.055 0.836 .± 0.053 0.733 .± 0.065 0.801 .± 0.041 0.745 .± 0.056 0.812 .± 0.067 0.803 .± 0.062 0.824 .± 0.091 0.867 .± 0.091 0.903 .± 0.052 0.897 .± 0.042 0.861 .± 0.036

40 0.685 .± 0.045 0.764 .± 0.035 0.806 .± 0.031 0.818 .± 0.041 0.782 .± 0.035 0.782 .± 0.059 0.836 .± 0.049 0.806 .± 0.08 0.800 .± 0.049 0.842 .± 0.023 0.709 .± 0.078 0.794 .± 0.045 0.752 .± 0.059 0.812 .± 0.071 0.788 .± 0.057 0.824 .± 0.084 0.897 .± 0.062 0.903 .± 0.048 0.842 .± 0.025 0.879 .± 0.043

50 0.721 .± 0.062 0.806 .± 0.041 0.800 .± 0.024 0.79 .± 0.050 0.794 .± 0.045 0.764 .± 0.071 0.782 .± 0.048 0.806 .± 0.081 0.818 .± 0.065 0.842 .± 0.045 0.691 .± 0.071 0.788 .± 0.051 0.752 .± 0.067 0.791 .± 0.054 0.794 .± 0.074 0.831 .± 0.085 0.885 .± 0.045 0.903 .± 0.048 0.911 .± 0.031 0.867 .± 0.056

(continued)

60 0.715 .± 0.073 0.794 .± 0.023 0.818 .± 0.019 0.812 .± 0.020 0.800 .± 0.045 0.732 .± 0.064 0.836 .± 0.073 0.824 .± 0.056 0.825 .± 0.038 0.861 .± 0.031 0.679 .± 0.073 0.788 .± 0.051 0.752 .± 0.067 0.771 .± 0.052 0.788 .± 0.074 0.836 .± 0.084 0.915 .± 0.052 0.891 .± 0.056 0.921 .± 0.032 0.873 .± 0.048

Prediction and Analysis of Key Genes in Prostate Cancer via MRMR Enhanced. . . 105

Laplace+mRMR

ReliefF+mRMR

Method Fisher+mRMR

Table 1 (continued)

Classifier NB LR KNN SVM RF NB LR KNN SVM RF NB LR KNN SVM RF

# Genes 10 0.824 .± 0.087 0.861 .± 0.031 0.798 .± 0.067 0.820 .± 0.031 0.818 .± 0.069 0.848 .± 0.086 0.830 .± 0.049 0.830 .± 0.053 0.842 .± 0.042 0.861 .± 0.015 0.830 .± 0.045 0.861 .± 0.045 0.836 .± 0.031 0.796 .± 0.066 0.818 .± 0.051 20 0.830 .± 0.091 0.842 .± 0.065 0.776 .± 0.056 0.818 .± 0.042 0.812 .± 0.052 0.830 .± 0.087 0.873 .± 0.023 0.861 .± 0.031 0.879 .± 0.024 0.848 .± 0.019 0.836 .± 0.041 0.867 .± 0.031 0.842 .± 0.023 0.867 .± 0.030 0.848 .± 0.027

30 0.848 .± 0.051 0.836 .± 0.049 0.830 .± 0.024 0.867 .± 0.021 0.800 .± 0.062 0.836 .± 0.080 0.867 .± 0.049 0.885 .± 0.035 0.879 .± 0.015 0.873 .± 0.04 0.836 .± 0.031 0.824 .± 0.04 0.818 .± 0.051 0.842 .± 0.023 0.830 .± 0.036

40 0.836 .± 0.045 0.830 .± 0.062 0.873 .± 0.070 0.867 .± 0.030 0.824 .± 0.030 0.836 .± 0.080 0.891 .± 0.024 0.885 .± 0.045 0.865 .± 0.045 0.848 .± 0.033 0.830 .± 0.056 0.818 .± 0.033 0.830 .± 0.041 0.852 .± 0.035 0.836 .± 0.045

50 0.824 .± 0.052 0.824 .± 0.048 0.873 .± 0.067 0.855 .± 0.042 0.788 .± 0.047 0.830 .± 0.080 0.885 .± 0.023 0.873 .± 0.035 0.860 .± 0.034 0.861 .± 0.024 0.812 .± 0.035 0.812 .± 0.048 0.824 .± 0.059 0.821 .± 0.025 0.824 .± 0.052

60 0.830 .± 0.062 0.836 .± 0.049 0.867 .± 0.059 0.834 .± 0.054 0.812 .± 0.067 0.830 .± 0.078 0.867 .± 0.024 0.855 .± 0.035 0.860 .± 0.04 0.842 .± 0.023 0.818 .± 0.033 0.818 .± 0.043 0.806 .± 0.045 0.840 .± 0.038 0.812 .± 0.048

106 R. B. Eshun et al.

Prediction and Analysis of Key Genes in Prostate Cancer via MRMR Enhanced. . .

107

Fig. 2 Plot of the performance of (a) Fisher Score (b) Laplace Score (c) ReliefF and (d) mRMR feature selection methods

Fig. 3 Plot of the performance of the enhanced models consisting of (a) Fisher+mRMR (b) Laplace+mRMR and (c) ReliefF+mRMR frameworks

108

R. B. Eshun et al.

Table 2 Baseline performance of the classifiers

Classifier NB Logistic_R KNN SVM RF

Accuracy 0.745 .± 0.062 0.830 .± 0.036 0.800 .± 0.076 0.806 .± 0.041 0.782 .± 0.052

Table 3 Performance, redundancy rate and runtime for the top 30 genes Methods Fisher ReliefF Laplace mRMR Fisher+mRMR ReliefF+mRMR Laplace+mRMR

Accuracy 0.818 .± 0.038 0.800 .± 0.041 0.812 .± 0.035 0.897 .± 0.041 0.867 .± 0.031 0.879 .± 0.019 0.842 .± 0.032

Redundancy rate 0.0402 0.1607 0.3829 0.1064 0.0222 0.1178 0.0491

Running time (s) 39.953 2.422 25.568 68.048 45.824 4.252 54.576

Classifier SVM SVM SVM SVM SVM SVM SVM

The integration of the Fisher score and mRMR (Fisher+mRMR) produced a high accuracy of .87.3% with 40 and 50 number of genes selected on the KNN classifier. The class prediction of the Logistic Regression estimator dips initially from its value at 60 genes and increase consistently to obtain a high accuracy of .86.1% with 10 selected genes. Table 3 presents the averaged redundancy rates obtained for 30 selected genes by the different algorithms. The high performing mRMR model was noted to produce very weak correlation among genes (0.1064) similar to the ReliefF+mRMR combination (0.1178). The Fisher score and ReliefF algorithms were found to obtain very weak redundancy similar to mRMR but the Laplace score produces relatively high redundancy rate (0.3829). We further observe that the features selected by ReliefF+mRMR, Fisher+mRMR and Laplace+mRMR have much less redundancy compared to the standalone similarity- based models, with the Laplace+mRMR combination experiencing the greatest reduction in redundancy. This indicates that the model integration is effective framework for redundancy removal. The integrated model achieve lower redundancy rates and score higher prediction scores relative to the original similarity based models and this suggests that reducing the redundancy rate is essential for improving the classification performance. The mRMR model that produced the best class predictions was found to be the most expensive in computational cost with an average running time of 68.048 s. The closest models in computational time to mRMR were the Laplace+mRMR and Fisher+mRMR models with 54.576 and 45.824 s in elapsed time respectively, which were also relatively higher to runtimes recorded for Laplace score (25.568 s) and Fisher score metrics (39.953 s). The models with the lowest average computational time was observed to be the ReliefF algorithm (2.422 s) and the ReliefF+mRMR combination (4.252 s).

Prediction and Analysis of Key Genes in Prostate Cancer via MRMR Enhanced. . .

109

4.1 Identification of Key Genes Related to PCa The candidate list of most significantly expressed gene alleles or variants obtained from the experiment were indicated as the Canonical Allele Identifiers CA3859761, CA3136178, CA3371640, CA3371114, CA2674391, CA3214496, CA2648535, CA3725392, CA3860208, CA2686781, CA2545478, CA3749730, CA2793137, CA2892393, CA2618499, CA3823379, CA2920803, CA2328551, CA2404999, CA3039791, and CA3175494, and the genes for which the indicated variants coded for were derived from the Clinical Genome Resource (ClinGen) Allele Registry located at https://reg.clinicalgenome.org. The listed variants were subsequently mapped to the following genes correspondingly; ELOVL5, CBR4, YTHDC2, TSSK1B, DHX36, PRDM9, XRN1, SLC44A4, GCLC, SMC4, SIDT1, HLA-DPB1, PIGG, WDR19, DNAJC13, ABCC10, FIP1L1, MYRIP, GMPPB, COL25A1, and SLC9A3.

4.2 GO and KEGG Pathway Enrichment Analyses The candidate genes were uploaded to Enrichr for GO enrichment analysis with threshold of P .< 0.05. The investigation showed the indicated genes were enriched in the biological processes fatty-acyl-CoA biosynthetic process, spermatid development, fatty acid biosynthetic process, positive regulation of cardioblast differentiation, regulation of cardioblast differentiation, positive regulation of fertilization, oocyte development, RNA phosphodiester bond hydrolysis, exonucleolytic, positive regulation by host of viral genome replication choline transport; the molecular functions G-quadruplex RNA binding, fatty acid synthase activity, G-quadruplex DNA binding, telomerase RNA binding, adenyl ribonucleotide binding, magnesium ion binding, double-stranded RNA binding, choline transmembrane transporter activity, pre-miRNA binding, fatty acid elongase activity; and the cellular components integral component of endoplasmic reticulum membrane, motile cilium, lysosome, lytic vacuole membrane, intrinsic component of endoplasmic reticulum membrane, lysosomal membrane, MHC class II protein complex, mRNA cleavage and polyadenylation specificity factor complex, dense core granule, MHC protein complex. The KEGG analysis (P .< 0.05) showed the selected genes were enriched in pathway functions regulating RNA degradation, Protein digestion and absorption, Fatty acid biosynthesis, Proximal tubule bicarbonate reclamation, Glycosylphosphatidylinositol (GPI)-anchor biosynthesis, Fatty acid elongation, and Biosynthesis of unsaturated fatty acids. The results of GO enrichment and KEGG pathway analyses are presented in Tables 4, 5, 6, and 7.

110

R. B. Eshun et al.

Table 4 Table of top most significant p-values and q-values for GO Cellular Component 2021 Term Integral component of endoplasmic reticulum membrane (GO:0030176) Motile cilium (GO:0031514) Lysosome (GO:0005764)

P-value 0.002282

Q-value 0.125489

0.007025 0.011328

0.175992 0.175992

Lytic vacuole membrane (GO:0098852) Intrinsic component of endoplasmic reticulum membrane (GO:0031227) Lysosomal membrane (GO:0005765)

0.013082

0.175992

0.02106

0.175992

0.022868

0.175992

Overlap_genes [ELOVL5, HLA-DPB1, PIGG] [WDR19, TSSK1B] [SIDT1,HLA-DPB1, DNAJC13,ABCC10] [HLA-DPB1, DNAJC13, ABCC10] [ELOVL5, PIGG] [HLA-DPB1, DNAJC13,ABCC10]

Table 5 Table of top most significant p-values and q-values for GO Biological Process 2021 Term Fatty-acyl-CoA biosynthetic process (GO:0046949)

P-value 0.001402

Q-value 0.13331

Spermatid development (GO:0007286)

0.002867

0.13331

Fatty acid biosynthetic process (GO:0006633)

0.007637

0.13331

Overlap_genes [ELOVL5, CBR4] [YTHDC2, TSSK1B] [ELOVL5, CBR4]

Table 6 Table of top most significant p-values and q-values for GO Molecular Function 2021 Term G-quadruplex RNA binding (GO:0002151) Fatty acid synthase activity (GO:0004312) G-quadruplex DNA binding (GO:0051880) Telomerase RNA binding (GO:0070034) Adenyl ribonucleotide binding (GO:0032559) Magnesium ion binding (GO:0000287)

P-value 0.00005 0.000148 0.000148 0.000752 0.002384

Q-value 0.003613 0.003613 0.003613 0.013715 0.030037

0.002469

0.030037

Double-stranded RNA binding (GO:0003725)

0.008058

0.072327

Overlap_genes [XRN1, DHX36] [ELOVL5, CBR4] [XRN1, DHX36] [XRN1, DHX36] [GCLC, TSSK1B, DHX36, SMC4] [GCLC, TSSK1B, DHX36] [DHX36, SIDT1]

Table 7 Table of top most significant p-values and q-values for KEGG 2021 Term RNA degradation Protein digestion and absorption

P-value 0.008897 0.014766

Q-value 0.221351 0.221351

Overlap_genes [XRN1, DHX36] [SLC9A3, COL25A1]

Prediction and Analysis of Key Genes in Prostate Cancer via MRMR Enhanced. . .

111

4.3 Discussion Classification performance was not negatively impacted by gene selection as the baseline performance was found to be comparable to the scores attained from the lowest performing selection models. To the contrary, gene selection served to improve class prediction as demonstrated by the superior performance of the enhanced similarity-based models over their original versions. The Fisher+mRMR, ReliefF+mRMR and Laplace+mRMR models were observed to improve class recognition for different number of selected genes, which indicates the selection of more discriminative and representative genes with improved generalization ability. The schemes developed by combining models were found to obtain higher performance values relative to the standalone similarity based techniques. This performance is summarized in Table 3 which shows the mean accuracy scores for the feature selection models investigated using the SVM with 30 discriminative genes selected. This reveals the mRMR produces the best “aggregated” accuracy of .89.7% and closely followed by the ReliefF+mRMR, Fisher+mRMR, and Laplace+mRMR with .87.9%, .86.7%, and .84.4% accuracy respectively. The lowest performing selection models are found to be Fisher Score, Laplace Score, and ReliefF with .81.8%, .81.2%, and .80% respectively. Overall, mRMR algorithm was noted to produce better prediction performance compared to the other gene selection models with the best obtained accuracy of .92.1% using SVM. This model was characterized by the low redundancy rate of 0.1064 which indicated very weak correlation among the selected genes. The mRMR performance, however, was attained at a huge computational cost with an average runtime of 68.048 s which was about 17 times higher than the next best model of ReliefF+mRMR with 4.252 s of elapsed time. The mRMR uses the Fstatistic which was more expensive to calculate compared to structured information used by ReliefF. The ReliefF+mRMR selection achieves good performance with accuracy scores and redundancy rate comparable to the mRMR algorithm but obtains far lower execution time of 4.252 s. Also the higher minimum accuracy score of . 83% suggests the model produces more stable outputs with a variety of classifier algorithms. It indicated that the ReliefF+mRMR model can be advantageous in environments with low computational resources and multiplicity of independent classifiers. This demonstrated the effectiveness of the integration of ReliefF and mRMR. The results achieved using the enhanced similarity-based methods compare favorably with the results recorded by Eshun et al. (2021b) which reported best results with SVM and Logistic Regression on the genes identified with the Lasso selection algorithm. From the most discriminative set of 30 canonical alleles selected with the best performing algorithm of the mRMR and ReliefF+mRMR algorithms, the models identified common gene variants comprising of the canonical alleles designated as CA2648535, CA2618499, CA2892393, CA3175494, CA3214496, CA2545478, and CA2793137. These significantly expressed genes were found to demonstrate very strong associations with benign/malignancy of

112

R. B. Eshun et al.

prostate cancer tumor and constitute potential targets as biomarkers for treatment strategies and diagnosis. The core set of key genes detected by the models were largely consistent with the ones identified by the high yielding Lasso in Eshun et al. (2021b). On the basis of the over-representation analysis, the GO annotations revealed that the genes ELOVL abd CBR4 were abundant in fatty-acyl-CoA and fatty acid biosynthetic processes, and fatty acid synthase activity. The XRN1 and DHX36 gene variants were abundant in signaling pathways regulating G-quadruplex RNA binding, G-quadruplex DNA binding and telomerase RNA binding. YTHDC2 and TSSK1B genes were found to be enriched in spermatid development. The molecular functions of the key genes were also abundant in signaling pathways regulating adenyl ribonucleotide binding (GCLC, TSSK1B, DHX36, and SMC4), and magnesium ion binding (GCLC, TSSK1B, and DHX36). The key pathways obtained from the KEGG pathway analysis indicated the genes were primary enriched in RNA degradation (XRN1 and DHX36), and protein digestion and absorption (SLC9A3 and COL25A1). The results were consistent with reports by Watt et al. (2019) and Liu (2006) that point to increased fatty acid uptake and changes in fatty acid metabolism in malignant prostate cancer tissue, and confirms observations by Sena and Denmeade (2021) that fatty acid synthesis may drive prostate cancer development and progression. The study reveals potential fatty acid biosynthetic process targets in ELOVL5 and CBR4, DNA/RNA binding targets in XRN1 and DHX36, spermatogenesis targets in YTHDC2 and TSSK1B, and other candidate genes as biomarker targets for management of prostate cancer. The results show the integrated models investigated in this study for supervised feature selection and classification can achieve representative gene selections and yield bio-marker genes for effective prediction, diagnosis and therapeutic treatment of prostate cancer.

5 Conclusions In this study, the Fisher score, Laplace score and ReliefF algorithms are integrated with mRMR for feature selection and empirically evaluated on the Prostate cancer dataset. The features selected by the enhanced ReliefF, Fisher and Laplace models were found to have significantly reduced redundancy rates relative to their standalone models, which indicated that the model integration was effective for redundancy removal. The genes selected by the integrated models were observed to yield more consistent and improved classification performance which were noticeably higher to the baselines, but comparable to the results observed for the best performing mRMR algorithm. This demonstrated the strong performance of the enhanced models for the identification of informative genes and improved generalization ability. The SVM, Logistic Regression and KNN classifiers to the gene or feature sets from the mRMR and ReliefF+mRMR methods attained the highest performance

Prediction and Analysis of Key Genes in Prostate Cancer via MRMR Enhanced. . .

113

with accuracy scores .> 83% for 10 or more selected genes. The experiment identified many potential biomarker genes including the canonical alleles identified as CA2648535, CA2618499, CA2892393, CA3175494, CA3214496, CA2545478, and CA2793137, and showed that they are highly correlated with the target classes. The results identified bio-molecular processes and related genes that may constitute targets for prostate cancer treatment. The study indicated the genes were abundant in the annotations regulating fatty-acyl-CoA and fatty acid biosynthetic processes, fatty acid synthase activity, and G-quadruplex RNA/DNA binding. The KEGG pathway analysis produced significant enrichment in the signaling pathways related to RNA degradation, and protein digestion and absorption. The study identified key regulatory genes including ELOVL5 and CBR4 in fatty acid biosynthetic processes, and XRN1 and DHX36 in DNA/RNA binding and degradation. The findings may provide effective targets for the management of prostate cancer, and promote understanding of the underlying bio-molecular processes and key pathways that lead to susceptibility. This study revealed the proposed approach can efficiently learn discriminative genes in prostate cancer and classify malignant from benign tumor accurately. The gene selection and class prediction models can be effectively applied in clinical practice to provide valuable information for cancer treatment and diagnosis. The limitation of the study is that it assesses the prostate tissue samples in the dataset without considering the fibroblasts. Future direction of the research is to extend the framework for the application of deep learning models and carry out multi-class classification of the prostate samples including the fibroblasts.

References Amrane, M., Oukid, S., Gagaoua, I., & Ensari, T. (2018). Breast cancer classification using machine learning. In 2018 Electric Electronics, Computer Science, Biomedical Engineerings’ Meeting (EBBT) (pp. 1–4). Bashir, U., Kawa, B., Siddique, M., Mak, S., Nair, A., Mclean, E., Bille, A., Goh, V., & Cook, G. (2019). Non-invasive classification of non-small cell lung cancer: A comparison between random forest models utilising radiomic and semantic features. The British Journal Of Radiology, 92, 20190159. Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40, 16–28 (2014) Chicco, D., & Agapito, G. (2022). Nine quick tips for pathway enrichment analysis. PLoS Computational Biology, 18, e1010348. Chuang, L., Chang, H., Tu, C., & Yang, C. (2008). Improved binary PSO for feature selection using gene expression data. Computational Biology and Chemistry, 32, 29–38. Climente-González, H., Azencott, C., Kaski, S., & Yamada, M. (2019). Block HSIC Lasso: modelfree biomarker detection for ultra-high dimensional data. Bioinformatics, 35, i427–i435. Ding, C., & Peng, H. (2005). Minimum redundancy feature selection from microarray gene expression data. Journal of Bioinformatics and Computational Biology, 3, 185–205. Eshun, R., Islam, A., & Bikdash, M. (2021a). Identification of significantly expressed gene mutations for automated classification of benign and malignant prostate cancer. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) (pp. 2437–2443).

114

R. B. Eshun et al.

Eshun, R., Rabby, M., Islam, A. & Bikdash, M. (2021b). Histological classification of non-small cell lung cancer with RNA-seq data using machine learning models. In Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (pp. 1–7). Fabregat, A., Sidiropoulos, K., Viteri, G., Forner, O., Marin-Garcia, P., Arnau, V., D’Eustachio, P., Stein, L., & Hermjakob, H. (2017). Reactome pathway analysis: A high-performance inmemory approach. BMC Bioinformatics, 18, 1–9. Garg, A., & Mago, V. (2021). Role of machine learning in medical research: A survey. Computer Science Review, 40, 100370. Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389–422. Han, Y., Ma, Y., Wu, Z., Zhang, F., Zheng, D., Liu, X., Tao, L., Liang, Z., Yang, Z., Li, X., et al. (2021). Histologic subtype classification of non-small cell lung cancer using PET/CT images. European Journal of Nuclear Medicine and Molecular Imaging, 48, 350–360. Inza, I., Larranaga, P., Blanco, R., & Cerrolaza, A. (2004). Filter versus wrapper gene selection approaches in DNA microarray domains. Artificial Intelligence in Medicine, 31, 91–103. Johnson, N., Dhroso, A., Hughes, K., & Korkin, D. (2018). Biological classification with RNA-seq data: Can alternatively spliced transcript expression enhance machine learning classifiers? Rna, 24, 1119–1132. Kanehisa, M., & Goto, S. (2000). KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Research, 28, 27–30. Liu, Y. (2006). Fatty acid oxidation is a dominant bioenergetic pathway in prostate cancer. Prostate Cancer and Prostatic Diseases, 9, 230–234. Liu, H., & Motoda, H. (2007). Computational methods of feature selection. CRC Press. Liu, M., & Zhang, D. (2016). Feature selection with effective distance. Neurocomputing, 215, 100–109. Luca, B., Brewer, D., Edwards, D., Edwards, S., Whitaker, H., Merson, S., Dennis, N., Cooper, R., Hazell, S., Warren, A., et al. (2018). DESNT: A poor prognosis category of human prostate cancer. European Urology Focus, 4, 842–850. Mostafiz, R., Rahman, M., Islam, A., & Belkasim, S. (2020). Focal liver lesion detection in ultrasound image using deep feature fusions and super resolution. Machine Learning and Knowledge Extraction, 2, 10. Mubeen, S., Tom Kodamullil, A., Hofmann-Apitius, M., & Domingo-Fernández, D. (2022). On the influence of several factors on pathway enrichment analysis. Briefings in Bioinformatics, 23, bbac143. Mundra, P., & Rajapakse, J. (2009). SVM-RFE with MRMR filter for gene selection. IEEE Transactions on Nanobioscience, 9, 31–37. Nnamoko, N., Arshad, F., England, D., Vora, J., & Norman, J. (2014). Evaluation of filter and wrapper methods for feature selection in supervised machine learning. Age, 21, 33-2. Qi, D., & Chen, K. (2021). Bioinformatics analysis of potential biomarkers and pathway identification for major depressive disorder. Computational and Mathematical Methods in Medicine, 2021, 1. Rabby, M., Islam, A., Belkasim, S., & Bikdash, M. (2021). Epileptic seizures classification in EEG using PCA based genetic algorithm through machine learning. IN Proceedings of the 2021 ACM Southeast Conference (pp. 17–24). Radovic, M., Ghalwash, M., Filipovic, N., & Obradovic, Z. (2017). Minimum redundancy maximum relevance feature selection approach for temporal gene expression data. BMC Bioinformatics, 18, 1–14. Ramroach, S., Joshi, A., & John, M. (2020). Optimisation of cancer classification by machine learning generates an enriched list of candidate drug targets and biomarkers. Molecular Omics, 16, 113–125.

Prediction and Analysis of Key Genes in Prostate Cancer via MRMR Enhanced. . .

115

Reimand, J., Isserlin, R., Voisin, V., Kucera, M., Tannus-Lopes, C., Rostamianfar, A., Wadi, L., Meyer, M., Wong, J., Xu, C., et al. (2019). Pathway enrichment analysis and visualization of omics data using g: Profiler, GSEA, Cytoscape and EnrichmentMap. Nature Protocols, 14, 482–517. Saeys, Y., Inza, I., & Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23, 2507–2517. Sahu, B., Dehuri, S., & Jagadev, A. (2018). A study on the relevance of feature selection methods in microarray data. The Open Bioinformatics Journal, 11, 117–139. Samuele, M. (2021, Feb 21). “MRMR” explained exactly how you wished someone explained to you. https://towardsdatascience.com/mrmr-explained-exactly-how-you-wished-someoneexplained-to-you-9cf4ed27458b. Cited 1 Jun 2022. Sarac, F. (2017). Development of unsupervised feature selection methods for high dimensional biomedical data in regression domain. University of Northumbria at Newcastle (United Kingdom). Sena, L., & Denmeade, S. (2021). Fatty acid synthesis in prostate cancer: vulnerability or epiphenomenon? Cancer Research, 81, 4385. Talavera, L. (2005). An evaluation of filter and wrapper methods for feature selection in categorical clustering. In International Symposium on Intelligent Data Analysis (pp. 440–451). Vanitha, C., Devaraj, D., & Venkatesulu, M. (2015). Gene expression data classification using support vector machine and mutual information-based gene selection. Procedia Computer Science, 47, 13–21. Wang, H., & Hong, M. (2015). Distance variance score: an efficient feature selection method in text classification. Mathematical Problems in Engineering, 2015, 695–720. Watt, M., Clark, A., Selth, L., Haynes, V., Lister, N., Rebello, R., Porter, L., Niranjan, B., Whitby, S., Lo, J., et al. (2019). Suppressing fatty acid uptake has therapeutic effects in preclinical models of prostate cancer. Science Translational Medicine, 11, eaau5758. Zhang, Y., Ding, C., & Li, T. (2008). Gene selection algorithm by combining reliefF and mRMR. BMC Genomics, 9, 1–10. Zhao, J., Lu, K., & He, X. (2008). Locality sensitive semi-supervised feature selection. Neurocomputing, 71, 1842–1849. Zhao, Z., Wang, L., Liu, H., & Ye, J. (2011). On similarity preserving feature selection. IEEE Transactions on Knowledge and Data Engineering, 25, 619–632. Zhao, Z., Anand, R., & Wang, M. (2019). Maximum relevance and minimum redundancy feature selection methods for a marketing machine learning platform. In 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (pp. 442–452).

Graph-Based Machine Learning Approaches for Pangenomics Indika Kahanda, Joann Mudge, Buwani Manuweera, Thiruvarangan Ramaraj, Alan Cleary, and Brendan Mumey

1 Introduction A pangenome is a collection of genomic information belonging to more than one organism from a related species (Tettelin et al., 2005). While reading and decoding of individual genomes have become increasingly straightforward in the recent past (Tettelin et al., 2005), connecting identified genomic features to their corresponding phenotypic characteristics is still considered challenging. Genomewide association study (GWAS) is one of the traditional methods for connecting the DNA sequences to observable traits. This technique attempts to find correlations between differences in DNA sequences and any differences in their observable traits across individuals in a population from the same species. Typically, to reduce cost, the complete DNA sequences are used only from the reference genome, while small samples at different locations on the chromosome are utilized from the rest of the individuals. However, since all of an organism’s differences in the genome are defined with respect to the reference genome, this likely introduces bias. This is similar to

I. Kahanda () School of Computing, University of North Florida, Jacksonville, FL, USA e-mail: [email protected] J. Mudge · A. Cleary National Center for Genome Resources, Santa Fe, NM, USA e-mail: [email protected]; [email protected] B. Manuweera · B. Mumey Gianforte School of Computing, Montana State University, Bozeman, MT, USA e-mail: [email protected]; [email protected] T. Ramaraj School of Computing, DePaul University, Chicago, IL, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 A. Alkhateeb, L. Rueda (eds.), Machine Learning Methods for Multi-Omics Data Integration, https://doi.org/10.1007/978-3-031-36502-7_7

117

118

I. Kahanda et al.

attempting to use seedless grapes to define all fruits. Of course, when an apple is compared to the reference grape, it is possible to clearly distinguish using differences in skin, stem, and flesh. However, the existence of seeds in apples will be lost from this analysis due to not being able to describe the aspect of seeds using the reference fruit. On the other hand, while we may be able to describe a fruit like an apple, which has some similarities to grapes with reasonable success, describing a very different fruit like pineapple will be a fruitless exercise. Therefore, information about novel or highly evolved regions in the DNA that does not match well or match at all to the reference will be lost. Unfortunately, these very regions may contain genetic material or other control regions that might be coding for interesting differences in observable traits. Furthermore, these regions may code for genes that improve an organism’s adaptability to extreme conditions, such as drought or even enhance the ability to be less susceptible to diseases leading to increased survivability (Takahashi et al., 2020). The pangenome is a relatively new concept. First defined by Sigaux (2000) and given its modern definition by Tettelin et al. (2005), it has only recently gained widespread research interest due to improvements in sequencing technology that have spurred the proliferation of sequencing of organisms with complex genomes (Golicz et al., 2020). At a sequence level, a pangenome is commonly represented as a graphical structure, where each node represents a sequence that occurs in one or more genomes, a directed edge connects two nodes if their sequences are contiguous in one or more genomes, and each organism’s genome is embedded in the graph as a path (Eizenga et al., 2020). This is a natural representation of the pangenome because the graph compactly stores the complete sequence content of the population and the similarities and differences between sequences are represented directly in the structure of the graph. This representation, however, is still relatively new and has yet to be fully leveraged for pangenomespecific analyses. Over the years, researchers have developed a plethora of tools and techniques for conducting GWAS (Visscher et al., 2017; Burghardt et al., 2017; Minkin & Medvedev, 2019). Most GWAS tools and studies focus on primarily on SNPs in their analysis and disregard variation at the structural level, often because information on structural variation isn’t available. However, catalogues of structural variation and connection of this variation to phenotypes through GWAS or other methods is occurring in some species (Collins et al., 2020; Qin et al., 2021; Liu et al., 2020; Göktay et al., 2021). Furthermore, the application of machine learning to GWAS has been limited (Szymczak et al., 2009; Nguyen et al., 2015) and the application of machine learning for pangenomics data has been even more rare (Kavvas et al., 2018; Her & Wu, 2018; Cleary et al., 2018). Pangenome-wide association studies have been explored in a limited capacity, with the proposed methods not scaling well nor making use of graphical representations inherently derivable from pangenomics data (Gori et al., 2020; Lees et al., 2018). Current pangenomics tools provide merely the limited functionality of constructing the pangenomic graphs (Beller & Ohlebusch, 2016; Minkin & Medvedev, 2019; Garrison, 2019) or performing bioinformatics techniques such as

Graph-Based Machine Learning Approaches for Pangenomics

119

mapping reads and calling variants onto the graphical space (Garrison et al., 2018; Heydari et al., 2018). This chapter presents a novel technique for pangenomics based on a graph-based approach, which uses a reference-free algorithm for locating conserved regions (Frequented Regions), including regions with structural variants, that was recently introduced by Cleary et al. (2018). Pangenomic data structures and frequented regions can much more easily capture structural variation than traditional referencebased alignment strategies, making it straight-forward to connect these types of variants to phenotype. We describe how these conserved genomic regions, integrated with machine learning models, can be used for finding genotype-phenotype relationships in the context of a pangenome, which aligns within the area of Pangenome-wide Association Study (PWAS). We also provide evidence for the utility of this approach in comparison to standard GWAS techniques powered by SNPs using yeast data. The rest of the chapter is organized as follows. In the Methods section we provide an overview of our graph-based machine learning formulation based on Frequented Regions by discussing models, data, experimental setup, and metrics. The Results section discusses the evaluation of this approach in comparison to traditional approaches using yeast data. And in the Conclusion section, we discuss limitations and open problems.

2 Methods In this section, we provide an overview of our approach. We first introduce the algorithm used for finding the conserved regions (Frequented Regions). Then we describe the details of the Genome-wide Association study analysis, which represents the traditional reference-based analysis. Then we introduce our proposed reference-free method (FRs-ML) where we formulate the genotype-phenotype association as a supervised machine learning problem. We also describe the details of the specific yeast datasets used and the experimental setup used for comparing FRs-ML with reference-based methods.

2.1 Frequented Regions We first introduced Frequented Regions in Cleary et al. (2018), which can be computed as follows. The input to the process of finding Frequented Regions is a compressed de Bruijn graph (cDBG) G and set of paths P through G. Here each path represents a different genome/strain in the cDBG. Specific k-mers represent nodes of G. If the last .k − 1 nucleotides of node u overlaps with the first .k − 1 nucleotides of node v, then an edge .(u, v) is created. A Frequented Region (FR)

120

I. Kahanda et al.

d b

g e f

a

c

Fig. 1 Example FRs (taken from Cleary et al. (2018)): Assuming .α ≤ 23 , the left subgraph of nodes .CL = {a, b, c} forms a FR with support 2 (from the black and blue paths). The right subgraph of nodes .CR = {d, e, f, g} forms a FR with support 3 (from the back, blue and red paths). If .CL and .CR were merged, the merged FR would have support 2, provided the connecting black and blue path segments between nodes c and f each have at most .κ insertions

(C, S) is a subgraph composed of a set of nodes C and a set of subpaths from P such that each path traverses a subset of the nodes in C. In order to formally define Frequented Regions, two important parameters are used: (a) .α (the penetrance parameter): the minimum fraction of C’s nodes that each subpath must traverse, and (b) .κ (the maximum insertion parameter): the maximum number of nodes not in C that is traversed by a subpath before traversing another node in C. Hence, “.(α, κ)-supporting subpaths” is the terminology used to refer to the subpaths that satisfy these conditions. Therefore, given C is a set of de Bruijn nodes and S is a set of .(α, κ)-supporting subpaths of paths from P , we define a Frequented Region (FR) as a tuple .(C, S) (Cleary et al., 2018). Figure 1 provides an example. To find FRs that are more suitable to being used as input features in a machine learning model, we introduced the notion of interesting FRs (iFRs) (Manuweera et al., 2019), which can be used to filter FRs based on the particular strains that support them. Essentially, iFRs are FRs with highest support based on strains included only in the training set. In this work, the .50,000 iFRs with the most support are used as features. This threshold is consistent with the number of SNPs used with the baseline machine learning model, which uses .50,000 SNPs as features (see Sect. 2.4 for more details on both machine learning models). For the remainder of this document, the terms iFR and FR are used interchangeably. We refer the reader to Cleary et al. (2018) for full algorithmic details of Frequented Regions.

.

Graph-Based Machine Learning Approaches for Pangenomics

121

2.2 Data We use two yeast datasets for evaluating the proposed methods and comparing them to traditional approaches. The first dataset is composed of the assembled sequences and phenotypic data from Strope et al. (2015), which is referred to as the yeast-100 dataset. It is composed of 100 yeast strains and 49 phenotypes. The second dataset is composed of yeast assemblies and 35 growth condition phenotypes from Peter et al. (2018). Because many of the assemblies are very fragmented, we focus on the 200 most contiguous assemblies of the approximately 1000 assemblies that also have associated phenotypes. We call this dataset the yeast-200 dataset. Note that there is no overlap between the phenotypes available for the two datasets.

2.3 Genome-Wide Association Study Genome-wide Association Studies (GWAS) is a technique for associating genetic variation with phenotypic traits by first scanning for common variants across a population and then computing the association of those variants with highly observable traits using statistical methods (Visscher et al., 2017). In this work, we use standard GWAS methods as a baseline for evaluating the effectiveness of our proposed approach. We use biallelic SNPs from each dataset to identify a set of SNPs. These SNPs are utilized both for the GWAS analysis and as input features for machine learning models (described below). For the yeast-200 dataset, we use SNPs generated in Peter et al. (2018). For the yeast-100 dataset we call biallelic SNPs identified as described below but contrary to Strope et al. (2015), we do not include additional features (e.g. existence of genes). This was followed to replicate a more standard GWAS process where low coverage of sequencing reads is adequate for identifying SNPs but may not be enough to generate complete assemblies. From both the yeast-100 and yeast-200 biallelic SNP datasets, we randomly select 50,000 SNPs for use in downstream analyses. Specifically, to generate SNPs for the yeast-100 dataset, first, we use the BWA (Li & Durbin, 2009) alignment tool (default parameters) to individually align sequencing read pairs to the reference genome (Saccharomyces cerevisiae S288C, baker’s yeast). Then to generate variant calls, we then use FreeBayes (Garrison, 2019), which is a Bayesian genetic variant caller. Note that we called all alignment files simultaneously to ensure that the tool has information about all genomes for computing SNPs. This process outputted a total of 489,150 SNPs from 99 strains in comparison to the reference. Furthermore, because we coded genotypes for the reference as homozygous for the reference allele, the overall dataset still contains 100 strains. We use the preprocessCore R library (López-Romero, 2011) for quantile normalizing the data for the phenotypic traits. We use GEMMA Zhou et al. (2013)’s Bayesian sparse linear mixed model (250,000 burn-in steps) to compute

122

I. Kahanda et al.

the associations and predict phenotypes. We further used GEMMA’s centered relatedness matrix for correcting for population structure. The estimated SNP breeding values are used for the prediction of phenotype values. To carry out an apples-to-apples comparison, the same test sets are used for GWAS and machine learning models. Details about how the data is split into training and testing are described below.

2.4 Machine Learning Models In this work we formulate the task of genotype-phenotype association as a supervised machine learning problem. Given a DNA sequence (i.e. genome/strain) .Si composed of nucleotides .nk , we define a mapping .fj (·) as: fj (T (Si )) = yj

.

where .T (Si ) is an abstract representation of the sequence .Si , which provides the input features, and .yj is the predicted value of the target variable j (i.e. phenotype j ). Since .yj is a continuous variable, this is a typical regression problem. We use traditional machine learning techniques to develop regression models for learning the function .fj . We learn separate functions for each individual phenotype (i.e. multi-output regression that fits one regressor per target). Our input representation T is based on the “pangenomic” graph generated by first combining all the genome strain sequences of the collection into a compressed de Bruijn graph (cDBG) (Beller & Ohlebusch, 2016). Then, we identify “hotspot” regions (called Frequented Regions as described in Sect. 2.1 above) within this cDBG, which are paths that are approximately traversed together by a set of supporting paths belonging to individual sequences. More specifically, we use iFRs as input features for the predictive models to predict yeast phenotype values. In this formulation, each instance is represented by an individually assembled yeast strain. Each strain is labeled with a range of phenotype values, making each phenotype value a continuous target variable. As such, we model this as a multioutput regression problem (Liu et al., 2009). As depicted in Fig. 2, our proposed method is called FR-ML. For this method, the machine learning models are fed features generated from Frequented Regions. We compare FR-ML to a machine learning model that uses SNPs as input features, which we call SNPs-ML. Since FR-ML features are derived from a pangenomic graph, this approach is reference-free, whereas SNPs-ML is reference-based since its features are derived with respect to a reference strain. We use the random forest (RF) (Chen & Ishwaran, 2012) regression classifier, which has worked well with high-dimensional genomic data in the past (Chen & Ishwaran, 2012; Wu et al., 2009; Díaz-Uriarte & Alvarez de Andrés, 2006; Schwarz et al., 2010), for training both the FR-ML and SNPs-ML models. Multioutput regression is handled by learning an independent single-output random forest

Graph-Based Machine Learning Approaches for Pangenomics

123

Fig. 2 Comparison of the proposed method (FR-ML) with the GWAS and SNPs-ML methods. Both GWAS and SNPs-ML methods are reference-based methods as they require a reference genome for identifying the SNPs. However, the FR-ML method, which is based on Frequented Regions identified from a graphical representation of the pangenome, is a reference-free technique. Both FR-ML and SNPs-ML use multi-output regression models for predicting phenotypes

regression model for each phenotype. The input to the models is m examples (strains), each represented with n features. In the FR-ML model, the input features are iFRs, and feature values are counts of occurrences of iFRs. Specifically, each strain is represented by using an ndimensional vector, where each vector element i is how many times .iF Ri occurs within that strain. Analogously, in SNPs-ML, the features are single nucleotide polymorphisms (SNPs) and feature values are binary, i.e. whether or not the SNP occurred in a particular strain. These input vectors are fed to the regression models for learning/tuning the weights/parameters of the model. Then, for evaluation, the trained models are used to make predictions on a subset of strains held out during the training phase. We finally evaluate the accuracy of the trained models by comparing the predicted phenotype values to the observed values.

2.5 Experimental Setup For yeast-100, there are 100 examples and 49 phenotypes. Similarly, for yeast200, we have 200 examples and 35 phenotypes. We implemented all machinelearning models using the Scikit-learn Python library (Pedregosa et al., 2011). Specifically, we used the sklearn.ensemble.RandomForestRegressor with default parameters for our experiments. Fivefold cross-validation is used to evaluate the performance of the models, as depicted in Fig. 3. Here, each of the two datasets is randomly split into fivefold. In each iteration of the validation process, 60% of the data (i.e. three out of fivefold shown in blue) were consumed by the machine learning algorithm for generating a series of models aligning to parameter value combinations (i.e. k-mers, .α, and .κ). The set of values for each parameter used are

124

I. Kahanda et al.

Fig. 3 Schematic visualization of the semi-nested fivefold cross-validation procedure. In each iteration of the validation process, 60% of the data (i.e. Training folds shown in blue) are used by the machine learning algorithm for training a range of models for various parameter value combinations. These models are compared using 20% of the data (i.e., validation data shown in orange). Finally, the best combination of parameters is used to re-train a new model on the train+validation folds and is tested on the remaining data (i.e., Test data shown in green)

shown in Figs. 7 and 8. Then these models are compared using their performance on 20% of the validation data (orange-colored set). Thereafter, the parameter value combination corresponding to the best performance is used to re-train a new model on a dataset that combines both the training and validation folds. These models are then evaluated using the test data (shown in green). The final models are evaluated on the test sets (shown in green). These performances are averaged across all five iterations for computing the final performance of the models. This approach, which we refer to as a “semi-nested cross-validation”, is an in-house variation of the popular “nested cross-validation” technique (Dora et al., 2018), which is considered to provide one of the most unbiased estimations of a machine learning model (because none of the test sets are used for the internal parameter optimization). Note that since SNPs-ML does not involve internal parameter tuning for the generation of SNPs (similar to k-mer size, .α, and .κ tuned for Frequented region generation), the internal loop of this cross-validation process was omitted for SNPs-ML. The evaluation for all three models (i.e. FRML, SNPs-ML, and GWAS) is performed at the individual phenotype level. Overall performance is computed by averaging across all phenotypes. We use Root Mean Squared Error (RMSE) as the primary performance metric for comparing the models. Therefore, RMSE was the metric of choice for both finding the best parameter combination for FR-ML and the final performance of all three methods. RMSE is defined as follows:   N 1   .RMSE = (O − P )2 (1) N i=1

Graph-Based Machine Learning Approaches for Pangenomics

125

where O are actual phenotype values and P are predicted values. Lower RMSE values indicate better performance.

3 Results 3.1 Phenotypic Prediction As shown in Table 1 and Figs. 5 and 6, FRs (i.e. FR-ML) are more effective over SNPs (i.e. SNPs-ML) for predicting phenotypes. For yeast-100, we observed a significant difference in performance between the average RMSEs of FRs and SNPs (5.38 vs. 5.74, p-value: 3.703E-10, two-tailed paired t-test). Notably, we observed an improvement of 6% in performance due to using FRs in place of SNPs. As depicted in Table 1, we also observed a slightly higher average RMSE for FRs compared to SNPsG (5.38 vs. 5.41). However, this is not significantly different as suggested by a p-value of 0.67. The subfigure (a) in Fig. 4 depicts an example phenotype for which FR-ML predictions are more correlated with actual phenotypic values compared to both other models. For the yeast-200 data, FRs perform comparably well to SNPs for predicting phenotypes (Table 1), though for some genotypes FRs showed a better correlation than the two SNP methods (an example is shown in Fig. 4b). The p-value (twotailed paired t-test) for the difference between the average RMSE with FRs (0.137) versus SNPs (0.139) is 0.57. Furthermore, as evident from Table 1, FRs (average RMSE = 0.137) and GWAS SNPs (average RMSE = 0.138) have an overall comparable classification power (p-value: 0.78). For 42/49 phenotypes in yeast-100 data, FRs provide better RMSEs (see Fig. 5). It is also interesting to note that for the yeast-200 data, FRs provide better RMSEs for 23/35 phenotypes while SNPs provide better performance for the remaining (see Table 6), suggesting that these two types of features may be capturing complementary but different genomic information patterns. It is also difficult to determine why FR-ML significantly outperformed the SNPsML for yeast-100 but not the yeast-200 dataset. Classification success is determined by several factors, including the number of genes controlling the phenotype and the number of individuals in the dataset, and their relationships to each other. Even though FRs did not offer a significant performance advantage over GWAS SNPs,

Table 1 Performance comparison between our approach that combines machine learning with FRs (FR-ML) versus the machine learning model that uses SNPs (SNPs-ML), and the traditional GWAS method (SNPsG), on yeast-100 and yeast-200 datasets. The performances are reported using Root Mean Squared Error (RMSE). The lower RMSE values represent better performance Dataset yeast-100 yeast-200

# Genomes 100 200

# Phenotypes 49 35

SNPs-ML 5.74 0.139

SNPsG 5.41 0.138

FR-ML 5.38 0.137

126

I. Kahanda et al.

Fig. 4 Scatterplots comparing actual and predicted normalized phenotypic values across all samples for (a) Copper sulfate (0.075 mM) in the yeast-100 data and (b) 2% acetate in the yeast200 data. Each strain is represented with a dot. The green circles, red triangles and blue squares represent the FR-ML, SNPsG and SNPs-ML methods, respectively

Fig. 5 Percentage performance improvement of FR-ML over SNPs-ML for the 49 yeast-100 phenotypes. X-axis: phenotypes, y-axis: the % improvement, which is computes as follows: RMSESNP −RMSEF R .% I mprovement = 100 · RMSESNP

Graph-Based Machine Learning Approaches for Pangenomics

127

Fig. 6 Percentage performance improvement of FR-ML over SNPs-ML for the 35 yeast-200 phenotypes. X-axis: phenotypes, y-axis: the % improvement, which is computes as follows: RMSESNP −RMSEF R .% I mprovement = 100 · RMSESNP

using the FR-ML method eliminates the need for a reference genome. This may be an important factor when working with species without a standard reference genome. The “best” (k-mers, .α, and .κ) parameters identified by the models during the inner cross-validation loop for yeast-100 and yeast-200 are (100, 0.8, 3), (500, 0.9, 3), (100, 0.7, 0), (25, 0.7, 3), (1000, 0.6, 1), and (300, 0.8, 3), (500, 0.9, 0), (25, 0.7, 3), (25, 0.7, 3), (25, 0.8, 0), respectively (see Figs. 7 and 8). As evident from the variation of values for all three parameters, there is neither a single parameter value nor a combination that is favoured by the model (though .k = 25 seem to appear more frequently than the others). This observation highlights the importance of using a nested cross validation process, which lets the model independently determine the best parameter values for each fold.

3.2 FRs and Annotations To investigate what percentage of FRs and their associated subpaths are overlapping with yeast genes, we used the “intersect” tool included in BEDTOOLS (Quinlan & Hall, 2010). The inputs provided to this tool are: (1) a BED (Browser Extensible Data) file composed of subpaths, which is outputted from the FindFRs tool (Cleary et al., 2018), and (2) a GFF (General Feature Format) file composed of (combined) gene annotations for the yeast strains. Then this tool reports the FRs overlapping

128

I. Kahanda et al.

Fig. 7 Various parameter combinations used for the experiments and their corresponding average RMSE values in the yeast-100 dataset. The x- and y-axis depict the .α and .κ parameters, respectively. Specific k-mer values and the ith iteration of the cross-validation are indicated by each submatrix. The value depicted within each cell indicates the performance rank (i.e., 1–80) for each parameter combination within a single iteration. In addition, the magnitude of the ranking is further highlighted by using the standard green-white-red conditional formatting (for each iteration). The best combination for each iteration (i.e. “1”) is also shown in bold text

the coordinates across the annotations of genes. For each of the yeast-100 and yeast-200 datasets, a BED file was generated with the following parameters, kmer:1000, .α:0.7, .κ:3, minimum support:1, and the maximum iFRs to report:50,000. A supporting path is defined as overlapping a gene if at least half the subpath is within the gene coordinates. For the yeast-100 data, 2,592,521 (85%) out of 3,040,550 subpaths belonging to 50,000 FRs overlap with a gene (using a minimum threshold of 50%). Similarly, for the yeast-200 data, 4,284,174 (79%) out of 5,423,006 subpaths belonging to 49,881 FRs overlap with genic coordinates. In addition, we perform analysis to find out what percentage of yeast genes span interesting Frequented regions (iFRs). Our analysis reveals approximately 30% of yeast genes overlap with iFRs for both the yeast-100 and yeast-200 datasets each. As evident from the above observations with the yeast-100 and -200 data, for any genomes for which gene annotations are available, this approach can be effectively used to decipher the population variation and conservation at both the genome and the gene level. This would provide a valuable opportunity to begin to investigate

Graph-Based Machine Learning Approaches for Pangenomics

129

Fig. 8 Various parameter combinations used for the experiments and their corresponding average RMSE values in the yeast-200 dataset. The x- and y-axis depict the .α and .κ parameters, respectively. Specific k-mer values and the ith iteration of the cross-validation are indicated by each submatrix. The value depicted within each cell indicates the performance rank (i.e., 1–96) for each parameter combination within a single iteration. In addition, the magnitude of the ranking is further highlighted by using the standard green-white-red conditional formatting (for each iteration). The best combination for each iteration (i.e. “1”) is also shown in bold text

important associations between conserved regions and their related phenotypes and identify biologically meaningful variation that is driven by genes or regions controlling gene expression in these FRs.

4 Conclusion In this work, we propose a reference-free approach for genotype-phenotype association at the pangenome scale. The advantage of using a reference-free pangenomic approach is the ability for the unbiased incorporation of complete genetic variation that exists in the population. This allows performing enhanced phenotypic prediction based on the improved capability to locate variation driving observable differences in traits. Our method integrates genomic information into a pangenomic framework. It further allows integration of these datatypes with phenotypic data, including high throughput, automated phenotypes, also described as phenomic data, in order to elucidate how genotypes control phenotypes and to predict phenotypes. The biological interpretation of the data requires the use of genomic level information,

130

I. Kahanda et al.

including annotation on genes, gene regulation, and pathways that are linked specifically to the genomic data. It also requires integration with pangenomic data, which describes the relationships between the genomes. Using two yeast datasets, we demonstrate the utility of the proposed approach compared to traditional reference-based approaches. Specifically, we showed the usage of FRs as features for supervised learning models produced enhanced or equal performance accuracy compared to that of a machine learning model based on SNPs, which are inherently biased toward variation observed in the reference genome. Our results are especially encouraging because we used only the FR counts as input features as well as the default values for all hyper-parameters in the machine learning models. As we further fine-tune our machine-learning model hyperparameters, prediction accuracies should consistently start outperforming GWAS-based methods. Furthermore, since SNPs and FRs are capturing different information (i.e. SNPs capture variation and FRs capture similarity), it would be interesting to run the models on an input that contains both. On the other hand, it would be interesting to run the FR-ML model on genes instead of FRs, by first establishing which genes are homologs. Finally, developing the ability to apply our graph-based machine-learning methods to larger training sets through improving the FR generation algorithmic and implementation aspects should lead the way for improved prediction power. Acknowledgments This work was supported by NSF grant DBI-1759522.

References Beller, T., & Ohlebusch, E. (2016). A representation of a compressed de Bruijn graph for pangenome analysis that enables search. Algorithms for Molecular Biology, 11(1), 20. Burghardt, L. T., Young, N. D., & Tiffin, P. (2017). A guide to genome-wide association mapping in plants. Current Protocols in Plant Biology, 2(1), 22–38. Chen, X., & Ishwaran, H. (2012). Random forests for genomic data analysis. Genomics, 99, 323– 329. https://doi.org/10.1016/j.ygeno.2012.04.003. http://www.stat.berkeley.edu/breiman/ Cleary, A., Ramaraj, T., Kahanda, I., Mudge, J., & Mumey, B. (2018). Exploring frequented regions in pan-genomic graphs. IEEE/ACM Transactions on Computational Biology and Bioinformatics, X(March), 1–13. https://doi.org/10.1109/TCBB.2018.2864564 Collins, R.L., Brand, H., Karczewski, K.J., Zhao, X., Alföldi, J., Francioli, L.C., et al. (2020) A structural variation reference for medical and population genetics. Nature, 581(7809), 444–451. Díaz-Uriarte, R., Alvarez de Andrés, S. (2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(1), 3. https://doi.org/10.1186/1471-2105-73. http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-3 Dora, L., Agrawal, S., Panda, R., & Abraham, A. (2018). Nested cross-validation based adaptive sparse representation algorithm and its application to pathological brain classification. Expert Systems with Applications, 114, 313–321. https://doi.org/10.1016/j.eswa.2018.07.039 Eizenga, J. M., Novak, A. M., Sibbesen, J. A., Heumos, S., Ghaffaari, A., Hickey, G., Chang, X., Seaman, J. D., Rounthwaite, R., Ebler, J., Rautiainen, M., Garg, S., Paten, B., Marschall, T., Sirén, J., Garrison, E. (2020). Pangenome graphs. Annual Review of Genomics and Human Genetics, 21(1), 139–162.

Graph-Based Machine Learning Approaches for Pangenomics

131

Garrison, E. (2019). seqwish. https://github.com/ekg/seqwish Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E. T., Jones, W., Garg, S., Markello, C., Lin, M. F., et al. (2018). Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology, 36(9), 875–879. Göktay, M., Fulgione, A., et al.: A new catalog of structural variants in 1,301 A. thaliana lines from Africa, Eurasia, and North America reveals a signature of balancing selection at defense . . . . Molecular Biology and Evolution, 38, 1498. Golicz, A. A., Bayer, P. E., Bhalla, P. L., Batley, J., & Edwards, D. (2020). Pangenomics comes of age: From bacteria to plant and animal applications. Trends in Genetics, 36(2), 132–145. Gori, A., Harrison, O. B., Mlia, E., Nishihara, Y., Chan, J. M., Msefula, J., Mallewa, M., Dube, Q., Swarthout, T. D., Nobbs, A. H., et al. (2020). Pan-GWAS of Streptococcus agalactiae highlights lineage-specific genes associated with virulence and niche adaptation. MBio, 11(3), 10–1128. Her, H. L., & Wu, Y. W. (2018). A pan-genome-based machine learning approach for predicting antimicrobial resistance activities of the Escherichia coli strains. Bioinformatics, 34(13), i89– i95. Heydari, M., Miclotte, G., Van de Peer, Y., & Fostier, J. (2018). Browniealigner: Accurate alignment of illumina sequencing data to de Bruijn graphs. BMC Bioinformatics, 19(1), 311. Kavvas, E. S., Catoiu, E., Mih, N., Yurkovich, J. T., Seif, Y., Dillon, N., Heckmann, D., Anand, A., Yang, L., Nizet, V., et al. (2018). Machine learning and structural analysis of mycobacterium tuberculosis pan-genome identifies genetic signatures of antibiotic resistance. Nature Communications, 9(1), 4306. Lees, J. A., Galardini, M., Bentley, S. D., Weiser, J. N., & Corander, J. (2018). Pyseer: A comprehensive tool for microbial pangenome-wide association studies. Bioinformatics, 34(24), 4310–4312. Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics, 25(14), 1754–1760. Liu, G., Lin, Z., & Yu, Y. (2009). Multi-output regression on the output manifold. Pattern Recognition, 42, 2737–2743. https://doi.org/10.1016/j.patcog.2009.05.001 Liu, Y., Du, H., Li, P., Shen, Y., Peng, H., Liu, S., Zhou, G.A., Zhang, H., Liu, Z., Shi, M., Huang, X., Li, Y., Zhang, M., Wang, Z., Zhu, B., Han, B., Liang, C., & Tian, Z. (2020). Pan-Genome of wild and cultivated soybeans. Cell 182(1), 162–176.e13. López-Romero, P. (2011). Pre-processing and differential expression analysis of Agilent microRNA arrays using the AgiMicroRna Bioconductor library. BMC Genomics, 12(1), 64. Manuweera, B., Mudge, J., Kahanda, I., Mumey, B., Ramaraj, T., & Cleary, A. (2019). Pangenomewide association studies with frequented regions. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB ’19 (pp. 627–632). New York: Association for Computing Machinery. https://doi.org/10. 1145/3307339.3343478 Minkin, I., & Medvedev, P. (2019). Scalable multiple whole-genome alignment and locally collinear block construction with SibeliaZ. bioRxiv. https://doi.org/10.1101/548123. https:// www.biorxiv.org/content/early/2019/02/13/548123 Nguyen, T. T., Huang, J. Z., Wu, Q., Nguyen, T. T., & Li, M. J. (2015). Genome-wide association data classification and SNPs selection using two-stage quality-based random forests. BMC Genomics, 16, S5. BioMed Central. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. http://scikit-learn.sourceforge.net Peter, J., De Chiara, M., Friedrich, A., Yue, J.X., Pflieger, D., Bergström, A., Sigwalt, A., Barre, B., Freel, K., Llored, A., et al. (2018). Genome evolution across 1,011 Saccharomyces cerevisiae isolates. Nature, 556(7701), 339–344. Qin, P., Lu, H., Du, H., Wang, H., Chen, W., Chen, Z., He, Q., Ou, S., Zhang, H., Li, X., Li, X., Li, Y., Liao, Y., Gao, Q., Tu, B., Yuan, H., Ma, B., Wang, Y., Qian, Y., Fan, S., Li, W., Wang, J., He, M., Yin, J., Li, T., Jiang, N., Chen, X., Liang, C., & Li, S. (2021). Pan-genome analysis

132

I. Kahanda et al.

of 33 genetically diverse rice accessions reveals hidden genomic variations. Cell, 184(13), 3542–3558.e16. Quinlan, A. R., & Hall, I. M. (2010). Bedtools: A flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6), 841–842. Schwarz, D. F., König, I. R., & Ziegler, A. (2010). On safari to Random Jungle: A fast implementation of Random Forests for high-dimensional data. Bioinformatics, 26(14), 1752– 1758. https://doi.org/10.1093/bioinformatics/btq257 Sigaux, F. (2000). Cancer genome or the development of molecular portraits of tumors. Bulletin De L’academie Nationale De Medecine, 184(7), 1441–1447. Strope, P. K., Skelly, D. A., Kozmin, S. G., Mahadevan, G., Stone, E. A., Magwene, P. M., Dietrich, F. S., & McCusker, J. H. (2015). The 100-genomes strains, an S. cerevisiae resource that illuminates its natural phenotypic and genotypic variation and emergence as an opportunistic pathogen. Genome Research, 25(5), 762–774. Szymczak, S., Biernacka, J. M., Cordell, H. J., González-Recio, O., König, I. R., Zhang, H., & Sun, Y. V. (2009). Machine learning in genome-wide association studies. Genetic Epidemiology, 33(S1), S51–S57. Takahashi, F., Kuromori, T., Urano, K., Yamaguchi-Shinozaki, K., & Shinozaki, K. (2020). Drought stress responses and resistance in plants: From cellular responses to long-distance intercellular communication. Frontiers in Plant Science, 11(2020). https://doi.org/10.3389/ fpls.2020.556972. https://www.frontiersin.org/article/10.3389/fpls.2020.556972 Tettelin, H., Masignani, V., Cieslewicz, M.J., Donati, C., Medini, D., Ward, N.L., et al. (2005). Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: Implications for the microbial ‘pan-genome’. Proceedings of the National Academy of Sciences, 102(39), 13,950–13,955. Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., Brown, M. A., & Yang, J. (2017). 10 years of GWAS discovery: Biology, function, and translation. The American Journal of Human Genetics, 101(1), 5–22. Wu, J., Liu, H., Duan, X., Ding, Y., Wu, H., Bai, Y., & Sun, X. (2009). Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature. Bioinformatics, 25(1), 30–35. https://doi.org/10.1093/bioinformatics/btn583. https:// academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btn583 Zhou, X., Carbonetto, P., & Stephens, M. (2013). Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genetics, 9(2), e1003264.

Multiomics-Based Tensor Decomposition for Characterizing Breast Cancer Heterogeneity Qian Liu, Shujun Huang, Zhongyuan Zhang, Ted M. Lakowski, Wei Xu, and Pingzhao Hu

1 Breast Cancer Inter-Tumor Heterogeneity Breast cancer (BC) is typically referred to as a single disease because it originates from the cells of the mammary gland. However, BC is a complex disease with a high degree of inter-tumor heterogeneity, which are the differences among tumors

The authors Qian Liu and Shujun Huang are equally contributed to the work. Q. Liu Department of Biochemistry Schulich School of Medicine Dentistry, Western University Siebens Drake Research Institute, London, Ontario, Canada Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada Department of Statistics, University of Manitoba, Winnipeg, MB, Canada S. Huang Department of Biochemistry and Medical Genetics, University of Manitoba, Winnipeg, MB, Canada Z. Zhang Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada T. M. Lakowski College of Pharmacy, University of Manitoba, Winnipeg, MB, Canada W. Xu Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada Biostatistics Department, Princess Margaret Cancer Centre, Toronto, ON, Canada P. Hu () Department of Biochemistry Schulich School of Medicine Dentistry, Western University Medical Sciences, London, Ontario, Canada e-mail: [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 A. Alkhateeb, L. Rueda (eds.), Machine Learning Methods for Multi-Omics Data Integration, https://doi.org/10.1007/978-3-031-36502-7_8

133

134

Q. Liu et al.

Fig. 1 Breast cancer heterogeneity. Breast cancer can be stratified into different intrinsic subtypes based on the gene expression profiles. In the clinic setting, the surrogate intrinsic subtypes are defined by the expression status of ER, PR, HER2, and Ki67. Each breast cancer subtype has different prognosis and responds to different treatment. Note: this figure was generated based on the information reviewed in Harbeck et al. (2019)

from different individuals. The heterogeneity in BC has been recognized through different clinical and histopathological characteristics, biomarker profiles, or the more modern genomic and transcriptomic patterns (Polyak, 2011; Turashvili & Brogi, 2017) (Fig. 1). These differences have served as the basis for different BC classification schemes (Polyak, 2011).

1.1 Morphological and Histopathologic Heterogeneity Pathologists have long noted that the histological diversity of breast tumors and the morphological heterogeneity are the basis for the histological classification of BC (Turashvili & Brogi, 2017). From a histological point of view, BC can arise from the epithelial cells that line the ducts (ductal carcinoma) or the lobules (lobular carcinoma) (Malhotra et al., 2010). Ductal carcinoma can be ductal carcinoma in situ if the cancer cells still stay in the epithelial component of the ducts or invasive ductal carcinoma if the cancer cells have invaded the surrounding tissues (Malhotra et al., 2010). Similarly, lobular carcinoma can also be lobular carcinoma in situ which means the tumor is limited to the epithelium of the lobules or invasive lobular carcinoma, which indicates the tumor has grown into the stroma (Malhotra et al.,

Multiomics Tensor Decomposition for Breast Cancer

135

2010). Invasive ductal carcinoma is the most common histological type of BC, accounting for approximately 70–80% of all invasive breast carcinomas (Malhotra et al., 2010). Inter-tumor heterogeneity of BC is arguably best represented by the pathological staging and histological grading of breast carcinoma (Turashvili & Brogi, 2017). Stage refers to the extent of a cancer, such as how large the tumor is, what part of the breast has cancer, and if the tumor has spread (Brierley et al., 2017). Stage is a strong prognostic factor (Giuliano et al., 2017). Staging BC based on physical examination and imaging findings can help physicians to determine the BC prognosis and determine the right treatment options (Giuliano et al., 2017). Generally, the higher the stage is, the more the cancer has spread. Survival varies with each stage of BC. In general, the earlier stage BC is diagnosed and treated, the better the outcome. There are different grading systems available for BC. The most commonly used is the Nottingham grading system (Elston & Ellis, 1991). The Nottingham grading system takes three tumor characteristics into consideration to assess the grade of breast tumors: glandular/tubular differentiation (the proportion of cancer cells that are in gland formation), nuclear pleomorphism (the variation of nuclear size and shape between the tumor cells) and mitotic count (how much the tumor cells are dividing or proliferating) (Elston & Ellis, 1991). Each of these features is scored from 1 to 3, and then the individual scores are added together to determine the final grade: grade 1 (low), grade 2 (intermediate) or grade 3 (high) (Elston & Ellis, 1991). The grade of breast carcinoma is a robust predictor of survival and reflects the aggressive potential of the tumor, with low-grade cancers generally tending to be less aggressive than high-grade cancers (Davidson et al., 2019). Determining the grade is therefore important, because clinicians use this information to guide treatment options for BC patients.

1.2 Biomarker Heterogeneity BC inter-tumor heterogeneity can also be characterized by the expression status of well-established molecular biomarkers (Turashvili & Brogi, 2017). Differences in the expression of these biomarkers is known as biomarker heterogeneity (Turashvili & Brogi, 2017). Three biomarkers (estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2)) are recognized by international guidelines as prognostic and predictive factors indispensable for invasive BC therapy decision making (Harbeck et al., 2019). In the clinic, BC subtypes are usually determined by examining protein expression of ER and PR, as well as protein expression of HER2 and/or gene amplification of HER2. At diagnosis, the solid tumor tissue samples taken during the biopsy of all invasive breast carcinomas will be routinely tested by immunohistochemistry (IHC) to assess the status of these biomarkers to help treatment decision-making (Beca & Polyak, 2016). According to the recommendations by American Society of Clinical Oncology/College of American Pathologist, if any immunohistochemical nuclear

136

Q. Liu et al.

staining (irrespective of the signal intensity) is observed in more than 1% of invasive tumor cells, it is considered as hormone receptor-positive (ER-positive and/or PRpositive, i.e., ER+ and/or PR+) (Hammond et al., 2010). A tumor is considered as hormone receptor-negative (ER-negative and PR-negative, i.e., ER-/PR-) if less than 1% of cells are stained positively by antibodies against ER and PR (Hammond et al., 2010). The utility of such IHC ER and PR stratification is that it can be used to make some generalizations. For example, patients with hormone receptor-positive tumors have a more favorable prognosis than those with hormone receptor-negative tumors. In addition, patients with hormone receptor-positive tumors will likely benefit from hormonal treatment that targets the corresponding hormone receptors, such as tamoxifen and others (Lakhani, 2012).

1.3 Genetic Heterogeneity and Breast Cancer Subtyping Schemes BC inter-tumor heterogeneity has been studied at the molecular level in depth and treatment concepts now take such heterogeneity into consideration. There exist many molecular variations that lead to breast carcinogenesis, and several classification schemes have evolved to categorize breast tumors. The most wellknown is the intrinsic classification based on gene expression analysis, which distinguishes four major molecular subtypes of BC with prognostic and therapeutic implications: luminal A, luminal B, HER2-enriched, and basal-like (Perou et al., 2000; Sørlie et al., 2001, 2003). The intrinsic subtype molecular classification of BC has improved the understanding of BC biology while presenting the possibility to refine the prediction of correct BC treatment regimens. Although the intrinsic subtypes are originally defined by gene-expression profiles, the IHC-based surrogate intrinsic subtypes are typically used in clinical practice due to the cost and technical complexities required for gene expression profiling assays (Harbeck et al., 2019). The IHC-based surrogate intrinsic subtypes, determined by histological features as well as IHC measurements of routine biomarkers (ER, PR and HER2) and the proliferation marker Ki-67, are clinically valuable and imply distinct treatment approaches (Senkus et al., 2015; Curigliano et al., 2017). It is noteworthy although the IHC-based surrogate intrinsic subtypes overlap with the PAM50 intrinsic subtypes, some discrepancies exist (Lundgren et al., 2019). There are other BC classification schemes also based on gene expression profiles to stratify patients into transcriptionally distinct subtypes. Sotiriou and colleagues performed unsupervised clustering on the gene expression profiles of breast cancers and identified six subtypes: luminal-like 1, luminal-like 2, luminal-like 3, Her-2/neu, basal-like 1, and basal-like 2 (Sotiriou et al., 2003). Guedj et al. proposed a classification scheme gene expression data that identified six BC molecular subgroups using a semisupervised analysis (Guedj et al., 2012). The six subgroups overlapped with the

Multiomics Tensor Decomposition for Breast Cancer

137

intrinsic subtypes since Guedj et al.’s classification method also included luminal A, luminal B, basal-like, and normal-like subtypes. In addition to transcriptomic data, other types of omics data have been used for BC classification. Jönsson et al. proposed a copy number variation (CNV)based classification scheme to distinguish six genomic subtypes from breast tumors: 17q12, basal-complex, luminal-simple, luminal-complex, amplifier, and mixed subtypes (Jönsson et al., 2010). The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) study proposed a subtyping scheme by combining gene expression and DNA copy number (Curtis et al., 2012). The study first identified genes whose expression across breast tumors were driven by recurrent CNVs (Curtis et al., 2012). These genes were then used to classify the tumors into 10 integrative cluster (IntClust1–10) subtypes with distinct clinical outcomes, which was reproduced in the validation cohort (Curtis et al., 2012). More recently, Tofigh et al. proposed a hybrid subtyping scheme for breast tumors by combining the clinical subtypes and the intrinsic subtypes defined by PAM50 (Tofigh et al., 2014). The hybrid scheme partitioned ER+ tumors by intrinsic subtypes into five subtypes while partitioned ER- tumors by HER2 status into two subtypes (Tofigh et al., 2014). Thus, the hybrid scheme stratified invasive breast tumor samples into seven hybrid subtypes: ER+/luminal A, ER+/luminal B, ER+/HER2-enriched, ER+/normal-like, ER+/basal-like, ER-/HER2+ and ER-/HER2- (Tofigh et al., 2014).

2 Breast Cancer Multiomics Data 2.1 Genomic Level: CNVs At the genomic level, CNV data is the common data type used for BC classification. CNVs are structural variants in the genome that involves changes in the certain DNA regions’ number of copies (Feuk et al., 2006; Shlien & Malkin, 2009). These DNA regions are normally larger than one kilobase pairs and may span many different genes. CNVs can arise by deletions (which can lead to gene copy number loss) or duplications (which can lead to gene copy number gain). In principle, a gene has two copies in the diploid genome of normal human cells, with each copy inherited from each parent. But in cancer cells, the number of copies of some genes becomes smaller (gene copy number loss) or greater (gene copy number gain) than two and thus decreases or increases expression of these otherwise normal genes, respectively. A gain is considered as an amplification when a gene has multiple copies resulting from aberrant DNA duplication (Adkison, 2011). Gene amplification and other CNVs are considered to be detrimental because they alter gene expression patterns and thus lead to an imbalance in cell growth and differentiation (Gonçalves, 2017). For a long time, microarray-based methods (such as Array Comparative Genomic Hybridization and Single Nucleotide Polymorphism genotyping arrays) have been used to detect genome-wide CNV, which are efficient for large CNV detection (Fang

138

Q. Liu et al.

& Wang, 2018). As next-generation sequencing is widely used due to the rapid price decrease, CNV studies are increasingly using sequencing for detecting small or novel CNVs that array-based methods often miss (Shen et al., 2019; Abel & Duncavage, 2013; Li et al., 2015).

2.2 Transcriptomic Level: Gene Expression Among the multiomics data, transcriptomic data is the most frequently used data type for BC subtyping, including the intrinsic schemes (Perou et al., 2000; Sørlie et al., 2003; Hu et al., 2006; Bernard et al., 2009), the Subtype Classification Model (Haibe-Kains et al., 2012; Wirapati et al., 2008; Desmedt et al., 2008), and Guedj et al.’s classification scheme (Guedj et al., 2012). Cancer is a complex disease with many causes. Arguably one of the most important cancer mechanisms is aberrant gene expression (transcriptome), which is often the result of gene mutations and epigenetic alterations (Vogelstein et al., 2013; Stratton et al., 2009). As the intermediate step between DNAs and proteins, transcriptomics enables researchers to link the cellular phenotypes (proteome) and their molecular mechanisms (genome). When it comes to cancer, studying the transcriptome provides a link between the genetic causes of cancer and their phenotypic consequences. The key technique to profile cancer transcriptomes has convincingly switched from microarrays to whole transcriptome sequencing (i.e., RNA-sequencing (RNA-Seq)) in a matter of years (Van den Berge et al., 2019). In RNA-Seq, RNA molecules are first extracted from the samples of interest and then reverse-transcribed into cDNA molecules (Van den Berge et al., 2019). The cDNA library is then sequenced to produce millions of short reads. The resulting reads can be aligned, either by mapping to a reference genome or by de novo assembly (Van den Berge et al., 2019). After the sequence reads have been mapped to the reference genome or transcriptome, the next step is the quantification of transcript-level or gene-level abundances by assigning the mapped reads to specific transcripts or genes (Van den Berge et al., 2019). The results of quantification are combined into an expression matrix, where each feature (gene or transcript) is in a row and each sample in a column, where the values are actual read counts or estimated abundances.

2.3 Epigenomic Level: DNA Methylation DNA methylation is an epigenetic modification that plays an important regulatory role in gene expression (Robertson, 2005). In the human genome, promoter regions of many genes contain CpG islands, which are genomic regions that contain a high frequency of CpG sites where a cytosine nucleotide is followed by a guanine nucleotide (Smith & Meissner, 2013). DNA methylation can occur on CpG sites by the addition of a methyl group to the C5 position of cytosines in

Multiomics Tensor Decomposition for Breast Cancer

139

CpG dinucleotides to form 5-methylcytosine and this reaction is catalyzed by DNA methyl transferases (Bird, 2002). Methylation of multiple sites within CpG islands is called hypermethylation. If the hypermethylation happens in a promoter it can ultimately result in transcriptional silencing of the gene (Jones, 2012; Domcke et al., 2015; Blattler & Farnham, 2013). In normal cells, only a small number of promoter CpG islands are hypermethylated (Robertson, 2005). However, some cancers can result from, aberrant promoter CpG island hypermethylation leading to inappropriate gene silencing, especially tumor suppressors (Jones, 2012; Chatterjee & Vinson, 2012). For example, more than 100 genes have been reported to be transcriptionally silenced by aberrant CpG island hypermethylation in BC (Khatri et al., 2012; Jovanovic et al., 2010; Davalos et al., 2017; Basse & Arock, 2015; Pasculli et al., 2018). Bisulfite conversion-based approaches such as the Illumina methylation arrays (HumanMethylation450 and HumanMethylation850 arrays) and whole-genome bisulfite sequencing have been used to detect DNA methylation in the whole genome, with the latter being able to assess the methylation state of nearly every CpG site (Gupta et al., 2010; Wang et al., 2018).

3 Tensor-Based Multiomics Integration and Factorization In the past decade, several mathematic and computational methods have been applied to address the multiomics data integration problem. Tensor factorization is one of them (Kolda & Bader, 2009). Instead of simply concatenating different omics data matrices to a large two-dimensional matrix, tensor factorization -based omics data integration methods feed different omics data into a three-dimensional tensor with the new dimension representing the types of different omics data. This three-dimensional tensor can thus be processed by traditional tensor factorization algorithms for latent factor extraction. In this way, data across different biology processing levels are integrated and mined while the cross-level information is also retained.

3.1 Tensor Tensor is defined as a data array that could house multi-dimensional data (Kolda & Bader, 2009). The dimensionality of a tensor is denoted as order, and each dimension is referred as a mode. A lowercase letter (a) denotes a scalar, a boldface lowercase letter (.a) denotes a vector, a boldface capital letter (.A) denotes a matrix, and a boldface Euler script letter (.A) denotes a tensor. The element of a tensor is denoted by a lowercase letter with a subscript. For a three-order tensor (.X), its elements  are represented as .xij k , where i, j, k are the indexes of each mode. We use . to represent a multi-way   vector outer product. For example, the vector outer product of 3 vectors, .a b c is a 3-dimensional tensor .X, where .Xij k = a i bj ck .

140

Q. Liu et al.

Suppose we have two tensors .X and .Y, and they are in the same size of .I1 × I2 × . . . × IN .Their Hadamard (elementwise) product is represented using .∗, i.e. (X ∗ Y)i1 i2 ...iN = xi1 i2 ...iN yi1 i2 ...iN

.

(1)

for all .in ∈ 1, . . . , In and .n ∈ 1, . . . , N . The inner product of .X and .Y is defined as the sum of the products of their elements, i.e., 〈X, Y〉 =

I2 I1  

.

i1 =1 i2 =1

...

IN 

xi1 i2 ...iN yi1 i2 ...iN

(2)

iN =1

If the outer product of N vectors equals to an N-way tensor, we define the rank of this N-way tensor as 1, i.e., this tensor is a rank-one tensor. We define the rank (R) of a tensor .X as the minimum number of rank-one tensors which are required to as their sum. For instance, a rank-R 3D tensor .X can be written as .X = Rget .X  b cr = [[A, B, C]]. The matrices .A, .B, .C are the factor matrices a r r r=1 since they collect vectors from the rank-one components and hold them as columns. It is known that the problem of computing the rank of a tensor is non-deterministic polynomial-time (NP) -hard problem (Håstad, 1990; Hillar & Lim, 2013). Thus, in practice, we cannot know the exact rank of the tensor we investigate.

3.2 Tensor Decomposition Algorithms Tensor decomposition or factorization problem is not new, and it has already been studied for many years. A series of tensor factorization algorithms have been developed such as Tucker decomposition (Hitchcock, 1927), Canonical decomposition (CANDECOM) (Carroll & Chang, 1970), parallel factors (PARAFAC) (Harshman, 1970), and so on. Tucker decomposition could be considered as a higher-order form of principal component analysis (PCA). While CANDECOM and PARAFAC are always referred together as CP (CANDECOM/ PARAFAC) because they both decompose a tensor as a sum of rank-one tensors (Hitchcock, 1927; Kiers, 2000; MÖcks, 1988). These tensor factorization algorithms have been applied in many domains like psychometrics, chemometrics, signal processing, numerical linear algebra, computer vision, numerical analysis, data mining, neuroscience, graph analysis, and so on (Kolda & Bader, 2009; Carroll & Chang, 1970; Harshman, 1970). CP algorithm is the most popular rank decomposition approach. For a particular 3D tensor, the CP algorithm is to optimize: .

ˆ ‖ min ‖ X − X ˆ X

(3)

Multiomics Tensor Decomposition for Breast Cancer

141

   ˆ = R ar br cr = [[A, B, C]]. It can be treated as optimizing Where .X r=1 the objective error function as below:  1  (xij k − air bj r ckr )2 2 I

f (A, B, C) =

.

J

K

i=1 j =1 k=1

R

(4)

r=1

ˆ while I ,J , K are the number of rows in Where R is the rank of the 3D tensor .X, matrices .A, B, and .C, respectively. One problem of CP-based tensor factorization is that it needs a predefined hyperparameter which is the rank of the tensor. However, there is no a straightforward algorithm to determine the rank of a given tensor as mentioned previously. This problem is known as a NP-hard problem (Håstad, 1990; Hillar & Lim, 2013). Recently, a Bayesian tensor factorization (BTF) model was proposed to overcome this rank-determination difficulty (Tang et al., 2018). BTF combines Bayesian inference with CP algorithm. It takes three steps to determine the rank of a tensor. The first step is to decompose a given tensor to latent factors using a multi-linear model (CP algorithm). The second step is to estimate the posterior distribution of the decomposed latent factors using a high-dimensional variational Bayesian inference model. The last step is a filtering procedure for removing the redundant latent factors (Tang et al., 2018). Objective of CP, as mentioned in the above section, is to decompose a tensor to a sum of R rank-one tensors. the rank R is the exact number of latent factors we can extract. If the R is too small, too much information might be lost during tensor factorization. Whereas if R is too large, redundant factors might be generated. To overcome this difficulty, Bayesian inference was introduced to CP for effectively determining a proper rank for a given tensor. The following formula explains the CP-based BTF used in this study. ˆ =X  X true + Y

.

(5)

  R ˆ is assumed to be  br cr = [[A, B, C]] and .X Where .X true = r=1 a r  composed of the true tensor .Xtrue and the noise tensor .Y. ˆ is lacking, the non-informative prior distribution Since the prior knowledge of .X of .ar , .br , .cr , and the elements of .Y are assumed to be i.i.d. Gaussian distribution. Under these assumptions, the conditional probability and the joint distribution of the model can be derived. Then the Variational Bayesian Inference (Zhao et al., 2015) is incorporated to iteratively deduce the posterior distribution of latent factor matrices ˆ and the hyperparameters using the prior distribution and the observed value in .X. The obtained posterior factor vectors with small values is considered as redundant and thus be excluded. In this way, BTF can make attempts to iteratively optimize the rank when a prior rank is given, while the traditional CP takes the given initial rank as the final rank (Harshman, 1970; Tang et al., 2018; Zhao et al., 2015; Xiong et al.,

142

Q. Liu et al.

2010). The algorithmic details of the Variational Bayesian Inference technique can be found in other studies (Tang et al., 2018; Zhao et al., 2015). There is a package called “TensorBF” (Khan & Ammaduddin, 2016), which is the first package that enabled the implementation of BTF using the R programming language platform. TensorBF introduces a sparsity parameter to remove redundant factors so that it can achieve automatic rank optimization. A noiseProb argument of the tensorBF() function is needed to claim the proportion of variance that is expected to be explained with the extracted latent factors.

4 Applications After integration and tensor factorization based latent feature extraction, the patientdirectional latent feature matrix (patients by features) could be further applied to solve clinical problems such as subtyping, survival prediction. While the genedirectional latent feature matrix (genes by features) could be used to understand the molecular mechanism of cancer, such as providing biological explanations for each latent feature using gene set enrichment analysis (GSEA) (Subramanian et al., 2005). We discuss the application of our tensor factorization based multiomics data integration in three aspects: sample subtyping, survival prediction, and GSEA.

4.1 Breast Cancer Subtyping To identify BC intrinsic subtypes using the extracted patient-directional latent feature matrix, a subtyping model is needed. In the past decade, a lot of matrix-based cancer subtyping methods have been published, such as Consensus Clustering (CC) (Monti et al., 2003), non-negative matrix factorization (NMF) (Lee & Seung, 1999), Consensus NMF (CNMF) (Brunet et al., 2004), and so on. CC is a resamplingbased algorithm. It intends to obtain robust data clusters according to the consensus among several subset clustering runs (Monti et al., 2003; Wilkerson & Hayes, 2010). NMF is similar to the tensor factorization because a matrix can be considered as a two-dimensional tensor. And CNMF incorporates CC into NMF to increase the robustness of the clustering. As CNMF and BTF all belong to factorization problem in mathematics, involving CNMF as the downstream subtyping method after BTF can keep the entire workflow of multiomics-based cancer subtyping under a uniform set of theoretical assumptions. There are a lot of other multiomics integration based subtyping method without tensor factorization involved. These methods have been evaluated before and can act as baseline for our method. A recent published comprehensive review paper compared 10 representative multiomics data integration-based cancer subtyping methods (Duan et al., 2021). They concluded that Similarity Network Fusion (SNF) (Wang et al., 2014) performed very well in terms of accuracy, robustness, and

Multiomics Tensor Decomposition for Breast Cancer

143

computational cost. SNF is a classic multiomics data integration based subtyping methods. It constructs similarity network for each data type and then fuses them into one, which could capture both shared and specific information from different types of omics data and make the integrated similarity network more informative and less noisy. Here, we show an example of a tensor factorization based subtyping strategy for BC that effectively interlocks multiomics using BTF and stratifies patients using CNMF. In this framework, we use multiomics breast cancer data, including copy number variation (.∼C), gene expression (.∼G) and DNA methylation (.∼M). We first select top 10% of the genes out of the 17,627 genes for each data type according to the coefficient of variation (CV). Then we take the union of the selected genes from the three data types with a total of 4515 genes in the final analysis. Finally we build a three-dimensional tensor (.∼T) using the omics data (.∼C), (.∼G) and (.∼M). BTF is then performed on this three-dimensional tensor for latent factor decomposition. The decomposed factors, which contain both the inner and interacted information of the multiomics data, are passed on to CC wrapped K-mean clustering for subtype number determination. At last, CNMF is applied to assign subtype labels for each patient. An overall workflow could be found in (Fig. 2a and b). The subtyping method is evaluated on a real BC dataset provided by The Cancer Genome Atlas (TCGA) platform (Liu et al., 2018) (Fig. 2c). Compared to SNF, the BTF-CNMF method performed better in terms of the survival significance (BTF-CNMF method achieved lower p-value in the log rank test than SNF method).

4.2 Survival Prediction The integrated multiomics information could also be used to estimate the time to event occurrences in. Currently, most of the existing survival prediction methods deal with a single type of data (i.e., clinical data, genetic data, or methylation data). One existing integration approach is to build a model for each data type and synthesize the results for the final prediction when having multiple types of data (ER, 2012). Another possible data integration approach is to first concatenate different types of data together and then pass on to the next modelling stage (Ritchie et al., 2015). Both abovementioned approaches are not effective in revealing complex patterns and retaining interactions from combinations of multiple types of data. Therefore, tensor, as a new data structure, has the potential to effectively integrate multiomics data for patient survival prediction. Similar to its application in subtyping, tensor factorization in survival prediction also needs a predictive model to achieve downstream survival prediction. Here we show an example of a model that combines CP-based tensor decomposition and Cox proportional hazards regression model (CoxPH) -based DeepSurv (Katzman et al., 2018) survival function for BC patients time to event outcome prediction in (Fig. 3). We executed the model on TCGA-BRCA gene expression, copy number variation, and DNA methylation data. We also repeated the modelling 20 times and

144

Q. Liu et al.

Fig. 2 Tensor factorization in BC subtyping. (a) After feature selection and sample matching, a three-dimensional tensor (.∼T) is constructed using different types of omics data (.∼C), (.∼G) and (.∼M). (b) Bayesian tensor factorization is performed on this three-dimensional tensor for latent factor decomposition. The decomposed patient-directional factor matrix is then used to do the subtyping. (c) The performance of BTF-CNMF subtyping method and the baseline SNF subtyping method on real BC dataset (TCGA-BRCA). Six subtypes identified by the BTF-CNMF have significant survival difference, while the six subtypes identified by the baseline SNF have no statistical survival significance

reported the averaged performance with the 95% confidence intervals. Comparing with the simple concatenation, the CP tensor factorization -based data integration performed better in terms of achieving higher concordance index (C-index). C-index is a commonly used survival prediction statistics that measures the predictive ability of a survival model based on the predicted risk scores.

4.3 Gene Set Enrichment Analysis Both subtyping and survival prediction only use the patient-directional latent matrix (patients by rank). To full utilize the factorized information from the constructed multiomics tensor, the resulted gene-directional latent matrix (genes by rank) could be used to identify the key biological pathways/functions for each factor. These

Multiomics Tensor Decomposition for Breast Cancer

145

A. CP tensor factorization

CP algorithm

B. Survival prediction performance on TCGA-BRCA dataset Data

Survival function

C-index

95% CI

Concatenation

CoxPH

0.582

(0.537, 0.627)

Tensor factorization based integration

CoxPH

0.672

(0.636, 0.708)

Fig. 3 Tensor factorization in survival prediction. (a) The CP-based tensor factorization, deep learning, and survival model. (b) The performance of CP tensor factorization -based CoxPH survival model and baseline models on TCGA-BRCA dataset

key biological pathways could be inferred by GSEA (Subramanian et al., 2005). Such results could provide us with the biological explanation/annotation for the extracted multiomics tensor factors and then help characterize heterogeneity. Here we showed some examples of the annotated key biological functions for the 17 multiomics factors extracted from the TCGA-BRCA multiomics data in a BTF way (Fig. 2b). Since the resulted gene-directional latent matrix from the BTF algorithm has genes in rows and factors in columns, each column is actually a gene list with an importance score (an element of the gene-directional latent matrix) for each gene. This important score could measure the contribution of a certain gene for the certain multiomics factors. Thus, if we pre-rank the genes of a certain multiomics factor according to the importance score of each gene and use this pre-ranked gene list to do GSEA, we will get the enriched key biological pathways that could be used to annotate this multiomics factor. In our case Table 1, the enriched key pathways for our BTF-based multiomics factors involve cell cycle, signaling, metabolism, immune cell related functions, and so on. They are all related to the well-established cancer hallmarks proposed by Hanahan and Weinberg (2011).

146

Q. Liu et al.

Table 1 Gene set enrichment analysis results for each BTF-based multi-omics factor BTF 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 a b

Key pathways Chemokine signaling pathway Cytokine receptor interaction Huntington’s disease Natural killer cell mediated cytotoxicity Hematopoietic cell lineage Starch and sucrose metabolism Cell cycle Retinol metabolism Steroid hormone biosynthesis Leukocyte trans endothelial migration VEGF signaling pathway Olfactory transduction Oocyte meiosis Drug metabolism cytochrome Endocytosis Antigen processing and presentation Metabolism of Xenobiotics by cytochrome

NESa 1.57 1.96 −2 −2 2.13 −1.93 −1.83 1.76 −1.65 1.92 1.63 −1.76 1.46 2.32 1.81 1.47 2.16

p-value 0.0059 < 0.001 < 0.001 < 0.001 < 0.001 0.0029 0.0047 0.0063 0.0011 < 0.001 0.02 < 0.001 0.05 < 0.001 0.0048 < 0.001 < 0.001

FDRb 0.25 < 0.001 0.0029 0.0024 < 0.001 0.03 0.08 0.16 0.05 0.04 0.22 0.10 0.39 < 0.001 0.09 0.0097 0.0010

Normalized enrichment score False discovery rate

5 Conclusions In this chapter, we discussed the heterogeneity of BC and current approaches in characterizing it. We emphasized the potential of using multiomics technologies to understand the complex issue. Specifically, we are interested in integrating multiomics data into a 3D tensor, then mining useful information in a tensor factorization way. We explained and discussed current tensor factorization algorithms such as CP and BTF. We also illustrated some of their applications in research areas such as patient subtyping, patient survival prediction, and gene set enrichment analysis based key biological function annotation. Cancer subtyping is an important clinical question, which can decide how clinicians to make suitable treatment plans for each of the subgroup patients. Although many methods have been proposed to address this issue, recent studies have shown it is a promising direction to integrate multiomics profiles to stratify cancers. We have proposed and tested the BTF-CNMF method to perform the breast cancer subtyping. We believe this method can be also applied to other cancer types in the future. Acknowledgments This work was supported in part by Natural Sciences and Engineering Research Council of Canada and the University of Manitoba. P.H. is the holder of Manitoba Medical Services Foundation (MMSF) Allen Rouse Basic Science Career Development Research Award. The results shown here are in part based upon data generated by The Cancer Genome Atlas (TCGA) platform (https://www.cancer.gov/tcga).

Multiomics Tensor Decomposition for Breast Cancer

147

References Abel, H. J., & Duncavage, E. J. (2013). Detection of structural DNA variation from next generation sequencing data: A review of informatic approaches. Cancer Genet, 206, 432–440. https://doi. org/10.1016/j.cancergen.2013.11.002 Adkison, L. R. (2011). Elsevier’s integrated review genetics e-book: With STUDENT CONSULT online access. Elsevier Health Sciences. Basse, C., & Arock, M. (2015). The increasing roles of epigenetics in breast cancer: Implications for pathogenicity, biomarkers, prevention and treatment. International Journal of Cancer, 137, 2785–2794. Beca, F., & Polyak, K. (2016). Intratumor heterogeneity in breast cancer. In Novel biomarkers in the continuum of breast cancer (pp. 169–189). Springer. Bernard, P. S., Parker, J. S., Mullins, M., et al. (2009). Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of Clinical Oncology, 27, 1160–1167. https://doi.org/10. 1200/JCO.2008.18.1370 Bird, A. (2002). DNA methylation patterns and epigenetic memory. Genes Development, 16, 6–21. Blattler, A., & Farnham, P. J. (2013). Cross-talk between site-specific transcription factors and DNA methylation states. Journal of Biological Chemistry, 288, 34287–34294. Brierley, J. D., Gospodarowicz, M. K., & Wittekind C. (2017). TNM classification of malignant tumours. John Wiley and Sons. Brunet, J. P., Tamayo, P., Golub, T. R., et al. (2004). Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the National Academy Sciences of U S A, 101, 4164– 4169. Carroll, J. D., & Chang, J. J. (1970). Analysis of individual differences in multidimensional scaling via an n-way generalization of ‘Eckart-Young’ decomposition. Psychometrika, 35, 283–319. Chatterjee, R., & Vinson C. (2012). CpG methylation recruits sequence specific transcription factors essential for tissue specific gene expression. Biochimica et Biophysica Acta (BBA)-Gene Regulatory Mechanisms, 1819, 763–770. Curigliano, G., Burstein, H. J., Winer, E. P., et al. (2017). De-escalating and escalating treatments for early-stage breast cancer: The St. Gallen International Expert Consensus Conference on the Primary Therapy of Early Breast Cancer 2017. Annals of Oncology, 28, 1700–1712. https://doi. org/10.1093/annonc/mdx308 Curtis, C., Shah, S. P., Chin S.-F., et al. (2012). The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature, 486, 346. Davalos, V., Martinez-Cardus, A., & Esteller, M. (2017). The epigenomic revolution in breast cancer: From single-gene to genome-wide next-generation approaches. American Journal of Pathology, 187, 2163–2174. https://doi.org/10.1016/j.ajpath.2017.07.002 Davidson, T. M., Rendi, M. H., Frederick, P. D., et al. (2019). Breast cancer prognostic factors in the digital era: Comparison of Nottingham grade using whole slide images and glass slides. Journal of Pathology Informatics, 10, 11. Desmedt, C., Haibe-Kains, B., Wirapati, P., et al. (2008). Biological processes associated with breast cancer clinical outcome depend on the molecular subtypes. Clinical Cancer Research, 14, 5158–5165. https://doi.org/10.1158/1078-0432.CCR-07-4756 Domcke, S., Bardet, A. F., Ginno, P. A., et al. (2015). Competition between DNA methylation and transcription factors determines binding of NRF1. Nature, 528, 575. Duan, R., Gao, L., Gao, Y., et al. (2021). Evaluation and comparison of multi-omics data integration methods for cancer subtyping. PLoS Computational Biology, 17, e1009224. Elston, C. W., & Ellis, I. O. (1991). Pathological prognostic factors in breast cancer. I. The value of histological grade in breast cancer: experience from a large study with long-term follow-up. Histopathology, 19, 403–410. AUTHOR COMMENTARY. Histopathology 41:151. ER, H., & MD, R. (2012). Integrating heterogeneous high-throughput data for meta-dimensional pharmacogenomics and disease-related studies. Pharmacogenomics, 13, 213–222.

148

Q. Liu et al.

Fang, L., & Wang K. (2018). Identification of copy number variants from SNP arrays using PennCNV. In Copy Number Variants (pp. 1–28). Springer. Feuk, L., Carson, A. R., & Scherer S. W. (2006). Structural variation in the human genome. Nature Reviews Genetics, 7, 85. Giuliano, A. E., Connolly, J. L., Edge, S. B., et al. (2017). Breast cancer—major changes in the American Joint Committee on Cancer eighth edition cancer staging manual. CA: A Cancer Journal for Clinicians, 67, 290–303. Gonçalves, E., Fragoulis, A., Garcia-Alonso, L., et al. (2017). Widespread post-transcriptional attenuation of genomic copy-number variation in cancer. Cell Systems, 5, 386–398. Guedj, M., Marisa, L., De Reynies, A., et al. (2012). A refined molecular taxonomy of breast cancer. Oncogene, 31, 1196–206. https://doi.org/10.1038/onc.2011.301 Gupta, R., Nagarajan, A., & Wajapeyee, N. (2010). Advances in genome-wide DNA methylation analysis. Biotechniques, 49, iii–xi. Haibe-Kains, B., Desmedt, C., Loi, S., et al. (2012). A three-gene model to robustly identify breast cancer molecular subtypes. Journal of the National Cancer Institute, 104, 311–325. https://doi. org/10.1093/jnci/djr545 Hammond, M. E. H., Hayes, D. F., Dowsett, M., et al. (2010). American Society of Clinical Oncology/College of American Pathologists guideline recommendations for immunohistochemical testing of estrogen and progesterone receptors in breast cancer. Journal of Clinical Oncology, 28, 2784–2795. https://doi.org/10.1200/JCO.2009.25.6529 Hanahan, D., & Weinberg, R. A. (2011). Hallmarks of cancer: The next generation. Cell, 144, 646–674. Harbeck, N., Penault-Llorca, F., Cortes, J., et al. (2019). Breast cancer. Nature Reviews Disease Primers. https://doi.org/10.1038/s41572-019-0111-2 Harshman R. A. (1970). Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multimodal factor analysis. In UCLA Work Pap Phonetics (Vol. 16, pp. 1–84). Håstad, J. (1990). Tensor rank is NP-complete. Journal of Algorithms, 11, 644–654. Hillar, C. J., & Lim L. H. (2013). Most tensor problems are NP-Hard. Journal of the ACM, 60, 1–39. Hitchcock, F. L. (1927). The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6, 164–189. Hu, Z., Fan, C., Oh, D. S., et al. (2006). The molecular portraits of breast tumors are conserved across microarray platforms. BMC Genomics, 7, 96. https://doi.org/10.1186/1471-2164-7-96 Jones, P. A. (2012). Functions of DNA methylation: Islands, start sites, gene bodies and beyond. Nature Reviews Genetics, 13, 484. Jönsson, G., Staaf, J., Vallon-Christersson, J., et al. (2010). Genomic subtypes of breast cancer identified by array-comparative genomic hybridization display distinct molecular and clinical characteristics. Breast Cancer Research, 12. https://doi.org/10.1186/bcr2596 Jovanovic, J., Rønneberg, J. A., & Tost, J., et al. (2010). The epigenetics of breast cancer. Molecular Oncology, 4, 242–254. Katzman, J. L., Shaham, U., Cloninger, A., Bates, J., Jiang, T., & Kluger, Y. (2018). DeepSurv: Personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Medical Research Methodology, 18(1), 1–12. https://doi.org/10.1186/s12874018-0482-1 Khan, S., & Ammaduddin, M. (2016). tensorBF: An R package for Bayesian tensor factorization. bioRxiv, 097048. https://www.biorxiv.org/content/10.1101/097048v2.abstract Khatri, P., Sirota, M., & Butte, A. J. (2012). Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Computational Biology, 8, e1002375. Kiers, H. A. L. (2000). Towards a standardized notation and terminology in multiway analysis. Journal of Chemometrics: A Journal of the Chemometrics Society, 14, 105–122. Kolda, T. G., & Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review, 51, 455–500. Lakhani, S. R. (2012). WHO classification of tumours of the breast. International Agency for Research on Cancer.

Multiomics Tensor Decomposition for Breast Cancer

149

Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791. Li, W., Xia, Y., Wang, C., et al. (2015) Identifying human genome-wide CNV, LOH and UPD by targeted sequencing of selected regions. PLoS One, 10, 1–18. https://doi.org/10.1371/journal. pone.0123081 Liu, J., Lichtenberg, T. M., Hoadley, K. A., et al. (2018). An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell, 173, 400–416.e11. Lundgren, C., Bendahl, P. O., Borg, Å., et al. (2019). Agreement between molecular subtyping and surrogate subtype classification: a contemporary population-based study of ER-positive/HER2negative primary breast cancer. Breast Cancer Research and Treatment, 178, 459–467. https:// doi.org/10.1007/s10549-019-05378-7 Malhotra, G. K., Zhao, X., Band, H., et al. (2010). Histological, molecular and functional subtypes of breast cancers. Cancer Biology & Therapy, 10, 955–960. Möcks, J. (1988). Topographic components model for event-related potentials and some biophysical considerations. IEEE Transactions on Biomedical Engineering, 35, 482–484. Monti, S., Tamayo, P., Mesirov, J. et al. (2003). Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning, 52, 91–118. https://doi.org/10.1023/A:1023949509487 Pasculli, B., Barbano, R., Parrella P. (2018) Epigenetics of breast cancer: biology and clinical implication in the era of precision medicine. In: Seminars in cancer biology. Elsevier 22–35. Perou, C. M., Sørlie, T., Eisen, M. B., et al. (2000). Molecular portraits of human breast tumours. Nature, 406, 747. Polyak, K. (2011). Heterogeneity in breast cancer. The Journal of Clinical Investigation, 121, 3786–3788. https://doi.org/10.1172/JCI60534.3786 Ritchie, M. D., Holzinger, E. R., Li, R., et al. (2015). Methods of integrating data to uncover genotype-phenotype interactions. Nature Review Genetics, 16, 85–97. Robertson, K. D. (2005). DNA methylation and human disease. Nature Reviews Genetics, 6, 597. Senkus, E., Kyriakides, S., Ohno, S., et al. (2015). Primary breast cancer: ESMO Clinical Practice Guidelines for diagnosis, treatment and follow-up. Annals of Oncology, 26, v8–v30. Shen, W., Szankasi, P., Durtschi, J., et al. (2019). Genome-wide copy number variation detection using NGS: Data analysis and interpretation. In Tumor Profiling (pp. 113–124). Springer. Shlien, A., & Malkin D. (2009). Copy number variations and cancer. Genome Medicine, 1, 62. Smith, Z. D., & Meissner A. (2013). DNA methylation: roles in mammalian development. Nature Reviews Genetics, 14, 204. Sørlie, T., Perou, C. M., Tibshirani, R., et al. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences, 98, 10869–10874. Sørlie, T., Tibshirani, R., Parker, J., et al. (2003). Repeated observation of breast tumor subtypes in independent gene expression data sets. Proceedings of the National Academy of Sciences, 100, 8418–8423. Sotiriou, C., Neo, S. Y., McShane, L. M., et al. (2003). Breast cancer classification and prognosis based on gene expression profiles from a population-based study. Proceedings of the National Academy of Sciences U S A, 100, 10393–10398. https://doi.org/10.1073/pnas.1732912100 Stratton, M., Campbell, P., & Futreal A. (2009). The cancer genome. Nature, 458, 719–724. https:// doi.org/10.1038/nature07943 Subramanian, A., Tamayo, P., Mootha, V. K., et al. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy Sciences of U S A, 102, 15545–15550. https://doi.org/10.1073/pnas. 0506580102 Tang, Y., Chen, D., Wang, L., et al. (2018). Bayesian tensor factorization for multi-way analysis of multi-dimensional EEG. Neurocomputing, 318, 162–174. Tofigh, A., Suderman, M., Paquet, E. R., et al. (2014). The prognostic ease and difficulty of invasive breast carcinoma. Cell Reports, 9, 129–142. https://doi.org/10.1016/j.celrep.2014.08.073

150

Q. Liu et al.

Turashvili, G., & Brogi, E. (2017). Tumor heterogeneity in breast cancer. Frontiers in Medicine, 4. https://doi.org/10.3389/fmed.2017.00227 Van den Berge, K., Hembach, K. M., Soneson, C., et al. (2019). RNA sequencing data: Hitchhiker’s guide to expression analysis. Annual Review of Biomedical Data Science, 2, 139–173. https:// doi.org/10.1146/annurev-biodatasci-072018-021255 Vogelstein, B., Papadopoulos, N., Velculescu, V. E., et al. (2013). Cancer genome landscapes. Science, 339, 1546–1558. https://doi.org/10.1126/science.1235122 Wang, B., Mezlini, A. M., Demir, F., et al. (2014). Similarity network fusion for aggregating data types on a genomic scale. Nature Methods, 11, 333–337. Wang, Z., Wu, X., & Wang Y. (2018). A framework for analyzing DNA methylation data from Illumina Infinium HumanMethylation450 BeadChip. BMC Bioinformatics, 19, 115. Wilkerson, M. D., & Hayes, D. N. (2010). ConsensusClusterPlus: A class discovery tool with confidence assessments and item tracking. Bioinformatics, 26, 1572–1573. Wirapati, P., Sotiriou, C., Kunkel, S., et al. (2008). Meta-analysis of gene expression profiles in breast cancer: Toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Research, 10, 1–11. https://doi.org/10.1186/bcr2124 Xiong, L., Chen, X., Huang, T. K., et al. (2010). Temporal collaborative filtering with Bayesian probabilistic tensor factorization. In Proc 10th SIAM Int Conf Data Mining, SDM (pp. 211– 222). Zhao, Q., Zhang, L., & Cichocki, A. (2015). Bayesian CP factorization of incomplete tensors with automatic rank determination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37, 1751–1763.

Multi-Omics Databases Hania AlOmari, Abedalrhman Alkhateeb, and Bassam Hammo

Acronyms AIDS ATAC-seq CAD CCLE ChIP-seq CLIP-seq CNA CNV DNA IBDMDB ICGC JMORP lncRNAs MBD-seq MeDIP-seq METABRIC miRNA MOPED Omics DI RNA RPPA scRNA-seq

Acquired Immune Deficiency Syndrome Transposase-Accessible Chromatin Using Sequencing Coronary Artery Disease Cancer Cell Line Encyclopedia Chromatin Immunoprecipitation Sequencing Cross-Linking and Immunoprecipitation Sequencing Copy Number Alteration Copy Number Variation Deoxyribonucleic Acid Inflammatory Bowel Disease Multi-Omics Database International Cancer Genomics Consortium Japanese Multi-Omics Reference Panel Long Non-Coding Ribonucleic Acids Methyl-Binding Domain Sequencing Methylated DNA Immunoprecipitation Sequencing Molecular Taxonomy of Breast Cancer International Consortium Micro Ribonucleic Acid Multi-Omics Profiling Expression Database Omics Discovery Index Ribonucleic Acid Reverse Phase Protein Array Single-Cell Ribonucleic Acid Sequencing

H. AlOmari () · A. Alkhateeb · B. Hammo Princess Sumaya University for Technology, Amman, Jordan e-mail: [email protected]; [email protected]; [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 A. Alkhateeb, L. Rueda (eds.), Machine Learning Methods for Multi-Omics Data Integration, https://doi.org/10.1007/978-3-031-36502-7_9

151

152

SNPs TARGET TCGA TMM

H. AlOmari et al.

Single-Nucleotide Polymorphism Information Therapeutically Applicable Research to Generate Effective Treatments The Cancer Genome Atlas Tohoku Medical Megabank

1 Introduction In the last decade, advancements in next-generation sequencing technology enabled the analysis and examination of billions of DNA and RNA templates and have yielded multi-omics data for many health issues and disorders. (Alkhateeb et al., 2021; Levy & Myers, 2016). The most prominent types of omics technology are genomics, epigenomics, transcriptomics, proteomics, metabolomics, and microbiomics (Hasin et al., 2017). The research and studies conducted using these data helped understanding the changes and mutations in genes or changes in cellular processes associated with chronic or complex diseases such as various types of cancer. Understanding such changes leads to better diagnosis of diseases, taking preventive measures, and developing effective treatments. The promising outcomes of using multi-omics data in medicine and biology have attracted scientists’ attention to multi-omics analysis. Consequently, the efforts promoted the ability to integrate multi-omics data across different layers and types of omics with other clinical and environmental data for better insights and hypothesis generation. However, the multi-omics analysis depends on the availability of accessible, and organized databases that facilitate the scientific research and experimentation with multi-omics data on certain research topics (Hasin et al., 2017). The interactive relation of multi-omics datasets makes it particularly challenging to incorporate different biological layers to discover coherent biological signatures and predict phenotypic outcomes. Therefore, because of the emergence of multiomics analysis and increased interest in bioinformatics, there have been attempts to gather multi-omics data from online databases and platforms and from peerreviewed publications or other resources, such as the data generated from biobanks. These databases enable standardization, presentation, and integration of multi-omics data of similar interest. This chapter covers multi-omics databases dedicated for one or more types of omics that address specific health disorders or issues. Also, it targets general databases that enable researchers and scientists to explore and extract knowledge from multi-omics that could lead to personalized medication or other benefits in biology.

Multi-Omics Databases

153

2 Literature Review In the work of Subramanian et al. (2020), Subramanian, Verma, Kumar, Jere, and Anamika, focused on studying the available integration tools of multi-omics data, which supposed to provide a holistic insight and better analysis of data related to bio-molecules and their functions. The study covered features of tools such as visualization, available use cases, followed methodologies, the multi-omics repositories, and current challenges of integrating multi-omics data. The authors summarized a set of integration tools and the integration approaches that these tools were following.Furthermore, the tools were organized to address the biological concerns that researchers were focusing on understanding disease biology, classification, and prediction of disease biomarkers. The integration approaches included similarity, correlation, network, fusion, Bayesian, and multivariate approaches, under which many tools such as SNF, iCluster, NetICS, MOFA, PARADIGM, PFA, MFA, mixOmics, and CNAMet were classified. In (2022), Rigden and Fernández summarized the multi-omics databases published or updated in the year 2022 from 185 papers published in the 2022 Nucleic Acids Research database issue. The databases covered multi-omics related to COVID-19 (such as COVID19db, SCovid, ESC), plants (such as qPTMplants, PCMDB, and plantGSAD), animals (such as Animale-RNAdb, AMDB, VThunter), human (such as CircleBase, circMine, GRAND, huARdb), drugs (such as DDinter, CeDR, NPCDR), and diseases (such as BrainBase, CancerSCEM, SPENCER, etc.). Also it included updates from the European Bioinformatics Institute (EBI), the U.S. National Center for Biotechnology Information (NCBI), and the National Genomics Data Center (NGDC) in China. In Papadopoulos et al. (2016), the authors collected databases related to specific disorders. They focused on gathering the databases related to kidney diseases. Kidney or renal dysfunctions occur for various factors that impact the degree of disease severity and progression. It also can co-occur with other diseases such as diabetes or cardiovascular diseases, which makes kidney diseases complex diseases, hence require analysis and exploration at different layers. The authors argued that multi-omics databases are vital elements of clinical practice as it paves the way to explore and find valuable insights through the systems biology approaches and high throughput technologies. Thus, bringing advancements to the medical field enables P4 medicine: predictive, preventive, personalized, and participatory. Omics technologies and databases are also used to elucidate various diseases’ mechanisms and develop diagnostic tools and possible treatments. The authors demonstrated how general omics databases such as GeneCards and ArrayExpress can be helpful in kidney diseases studies, in addition to other six specific databases related directly to kidney diseases, including: Nephroseq, Renal Gene Expression Database, Human Kidney and Urine Proteome Project (HKUPP), Urinary Peptidomics and Peakmaps, Kidney and Urinary Pathway Knowledge Base (KUPKB), and Chronic

154

H. AlOmari et al.

Kidney Disease database (CKDdb). Furthermore, their work included technical aspects and a showcase of omics databases integration used in kidney disease studies.

3 Multi-Omics Data Resources 3.1 Data Repositories To extract multi-omics related data researchers can access the publicly available data repositories (Subramanian et al., 2020). They can benefit from these data-sets to understand different types of diseases, prevention methods, or discover diseases treatments. Following are examples of datasets resources: 1. The Cancer Genome Atlas (TCGA): It collects samples and multi-omics data sets for various cancer types and tumors. 2. International Cancer Genomics Consortium (ICGC): It collects genome studies from cancer projects concerned with the mutation of genomes caused by different types of cancer. 3. Cancer Cell Line Encyclopedia (CCLE): It collects gene expression, copy number, sequence number data from human cell lines and tumors, and anticancer drug data. 4. Molecular Taxonomy of Breast Cancer International Consortium (METABRIC): It collects multi-omics data to understand and classify breast tumors. 5. TARGET: Is an initiative to study further the factors and molecular events that cause pediatric cancers. 6. Omics Discovery Index (Omics DI): It contains data sets from various repositories to understand and integrate multi-omics technologies.

3.2 BioBanks Collecting and storing samples of human tissues, cells, and molecular components such as DNA for the long term enabled scientists and researchers to benefit from the this data. In companion with health centers’ clinical data, they also could understand the mechanisms of diseases or drugs’ effects. Following the emergence of omics science, biobanks evolved and grown up through collecting cell lines and specimens to support scientific and medical research using human biological materials and patients’ information. There are two types of biobanks: cell lines and specimen. Usually, large biobanks form the source of the control group against which smaller and specific datasets are compared and analyzed for specific issues or diseases (Coppola et al., 2019).

Multi-Omics Databases

155

Projects in different countries were established to collect health and medical data and create national multi-omics databases or biobanks to better understand their populations’ genetic and molecular characteristics. Consequently, they were able to personalize healthcare and medicine to address their response to medications and provide better preventive and diagnostic means and treatment. BioBbanks can be general or population-based, which are flexible and support different types of studies related to genotypes or phenotypes. As examples of population-based biobanks, one can come across, the UK biobank, Marshfield biobank, deCODE in Iceland, the Tohoku Medical Megabank Project Cohort Study in Japan (Tadaka et al., 2018, 2021), and the China National GeneBank DataBase (Chen et al., 2020), which addresses life science including healthcare. In addition, other biobanks might be disease-based biobanks that are usually smaller in size than the population-based biobanks and specialised in studying human illnesses, such as AIDS specimen biobank (Coppola et al., 2019). There are a few biobanks embedded in healthcare entities. The Mayo Clinic and bioVU hospital embedded biobank are examples of these biobanks (Olson et al., 2014). Other types of biobanks include diseases-specific biobanks such as the Rheumatoid Arthritis biobank (BT-Cure), the Asthema biobank (U-BIOPRED), and the diabetes biobank (DIRECT) (Kinkorová, 2016). Multi-omics databases can be constructed using data obtained from biobanks samples analysis. For instance, the first version of Japanese Multi Omics Reference Panel (JMORP) (jMorp; https://jmorp.megabank.tohoku.ac.jp) was released in 2018 consisted of metabolome data (Tadaka et al., 2018), the enhanced version of 2020 consisted of additional omics: genome, methylome, and transcriptome (Tadaka et al., 2021). Both versions extracted from the Tohoku Medical Megabank (TMM) project which conducted cohort studies and multi-omics analysis. The dbTMM (https://www.megabank.tohoku.ac.jp/researchers/) is another omics database that contains the genome and clinical data from participants of the same project (Ogishima et al., 2021). Other multi-omics databases were created from datasets mentioned in peerreviewed articles or literature dedicated to studying specific health issues or biological factors.

4 Multi-Omics Databases and Tools Table 1 summarizes a set of thirty-six multi-omics databases and tools. It shows the database name, the database subject, the omics technology included in the database, the portal where it is available, and its released date.

156

H. AlOmari et al.

Table 1 Multi-omics databases and technologies No. DB name 1 Updated DriverDBv3 database (Liu et al., 2020) 2 The Aging Atlas database (Aging Atlas Consortium, 2021) 3 MOPED 2.5 (Montague et al., 2014) 4 LinkedOmics (Vasaikar et al., 2018) 5 DevOmics (Yan et al., 2021) 6

7

8

9

10

11

12

13

14

DB subject Cancer

Omics technology Portal Date Genomics, Epigenomics, http://ngs.ym.edu. 2019 Transcriptomics tw/driverdb

Age-related changes

Transcriptomics, https:/bigd.big.ac. epigenomics, proteomics, cn/aging/index pharmacogenomics

Interdisciplinary Genomics, Proteomics, Transtudies/general scriptomics purposes. Cancer Genomics, Proteomics, Transcriptomics, Epigenomics

Genetics, Biol- Genomics, Epigenomics ogy & Molecular biology CardioGenBase Cardiovascular Genomics, Proteomics (Nayar et al., diseases 2015) BD2Decide Cancer Transcriptomics, Radiomics (Cavalieri et al., 2021) CKDdb Kidney Chronic Transcriptomics (Fernandes Diseases (microRNA), Genomics, & Husi, 2017) Peptidomics, Proteomics, Metabolomics C/VDdb cardiovascular Transcriptomics (Fernandes diseases (microRNA), Genomics, et al., 2018) Proteomics, Metabolomics IBDMDB Inflammatory Genomics, Transcriptomics, (Lloyd-Price bowel diseases Proteomics, Metabolomics, et al., 2019) Microbiomics, Epigenomics MOBCdb (Xie Cancer Genomics, Transcriptomics, et al., 2018) Epigenomics, Pharmacogenomics iNetModels Cancer Proteomics, Metabolomics, 2.0 (Arif et al., Genomics, Microbiomics 2021) jMorp 2018 Genetics, Biol- Metabolomics, Proteomics (Tadaka et al., ogy & Molecu2018) lar biology jMorp 2020 Genetics, Biol- Genomics, Epigenomics, (Tadaka et al., ogy & Molecu- Transcriptomics, 2021) lar biology Metabolomics, Proteomics

http://moped. proteinspire.org

2020

2014

http://linkedomics. 2017 org http://devomics.cn 2021

www. CardioGenBase. com http://www. bd2decide.eu/ content/home www.padb.org/ ckdbd

2015

2020

2017

www.padb.org/cvd 2018

http://ibdmdb.org

2019

http://bigd.big.ac. cn/MOBCdb/

2018

https:/inetmodels. com

2021

https:/jmorp.mega 2018 bank.tohoku.ac.jp https:/jmorp.mega 2020 bank.tohoku.ac.jp (continued)

Multi-Omics Databases

157

Table 1 (continued) No. DB name 15 GENEASE (Ghandikota et al., 2018) 16 iMETHYL (Komaki et al., 2018) 17 OmicsNet 2.0 (Zhou et al., 2022) 18 AIzGPS (Zhou et al., 2021) 19 IMOTA (Palmieri et al., 2018) 20 PlatOMICs (Brandão et al., 2021) 21 FPIA (Huang et al., 2022) 22 ProteomicsDB (Samaras et al., 2020) 23 KUPKB (Klein et al., 2012) 24 dbTMM (Ogishima et al., 2021) 25 HeartBioPortal (Khomtchouk et al., 2019) 26 MuSA (Zanfardino et al., 2021) 27 CNGBdb (Chen et al., 2020) 28 PRIDE (Vizcaíno et al., 2016) 29 MiBiOmics (Zoppi et al., 2021)

DB subject Omics technology Genetics, Biol- Genomics ogy & Molecular biology

Portal http://research. cchmc.org/ mershalab/ GENEASE/ Genetics, Biol- Genomics, Epigenomics, http://imethyl. ogy & Molecu- Transcriptomics iwate-megabank. lar biology org Interdisciplinary Genomics, Metabolomics, www.omicsnet.ca studies/general Microbiomics purposes Alzheimer Genomics, Transcriptomics, https:/alzgps.lerner. disease Proteomics ccf.org Genetics, Biol- Genomics, Transcriptomics, https:/ccb-web.cs. ogy & Molecu- Proteomics uni-saarlar biology land.de/imota/ Skin Disorders Genomics, Epigenomics, under development (Dermatology) Transcriptomics, Microbiomics, Proteomics Cancer Genomics, Proteomics http://bioinfo-sysu. com/fpia/ Interdisciplinary Proteomics https://www. studies/general ProteomicsDB.org purposes Kidney Diseases Transcriptomics, Proteomics, http://www.kupkb. Metabolomics org Genetics, Biol- Genomics https://www. ogy & Molecumegabank.tohoku. lar biology ac.jp/ cardiovascular Genomics https://www. diseases heartbioportal.com

Date 2018

Cancer

Radiogenomics

2021

Interdisciplinary studies/general purposes Interdisciplinary studies/general purposes Genetics, Biology & Molecular biology

Multi-omics data in life sci- https:/db.cngb.org/. 2020 ence/ not specific to any type of omics Proteomics http://www.ebi.ac. 2015 uk/pride/archive/

https:/gitlab.com/ Zanfardino/musa

2018

2022

2021 2017

2021

2022 2019

2012 2021

2019

Transcriptomics, Proteomics, https:/shiny-bird. 2021 Metabolomics, Genomics univ-nantes.fr/app/ Mibiomics or at application https:/gitlab.univnantes.fr/combi-ls 2n/mibiomics (continued)

158

H. AlOmari et al.

Table 1 (continued) No. DB name 30 MVIP (Tang et al., 2022) 31 Fibromine (Fanidis et al., 2021) 32 GraphOmics (Wandy & Daly, 2021) 33 RNAactDrug (Dong et al., 2020) 34 BioSamples database (Courtot et al., 2022) 35 FibROAD (Sun et al., 2022) 36 EyeDiseases (Yuan et al., 2021)

DB subject Virology

Portal https:/mvip.whu. edu.cn/ Fibrosis http://www. disorders fibromine.com/ Fibromine Virology Transcriptomics, Proteomics, https:/graphomics. Metabolomics glasgowcompbio.org/ Drug interaction Genomics, Transcriptomics, http://bio-bigdata. Epigenomics, hrbmu.edu.cn/ Pharmacogenomics RNAactDrug Genetics, Biol- Genomics, Transcriptomics http://www.ebi.ac. ogy & Molecuuk/biosamples lar biology Fibrosis disorders Eyes Disorders (Ophthalmology)

Omics technology Genomics, Transcriptomics, Epigenomics (ChIP-seq) Transcriptomics, Proteomics

Genomics, Transcriptomics, Epigenomics Genomics, Transcriptomics, Epigenomics

Date 2021 2021

2021

2019

2021

https://www. 2022 fibroad.org https:/eyediseases. 2021 bio-data.cn/

5 Multi-Omics Main Technologies The studied databases include the following omics technologies (Table 2): • Genomics: The genome is the complete sequence of DNA in a cell or organism (Omenn et al., 2012), Genomics refers to the study of organisms’ whole genomes (Manzoni et al., 2018). The databases include information and datasets about gene expression profile, gene relative expression data, gene literature evidence, gene ontology, gene pathways, connections between the specific genes, mutation, mutation data at the site level, copy number alteration (CNA), CNA data at the region-level, Copy number variation (CNV), Single-nucleotide polymorphism information (SNPs), chromatin accessibility and 3D chromatin architecture profiles, DNA microarray analyses, DNA profiles that underlying pathogenesis, chromosome positions, allele reference and ancestry allele frequency information. • Transcriptomics: The transcriptome is “the complete set of RNA transcripts from DNA in a cell or tissue” (Omenn et al., 2012), while Transcriptomics “examines RNA levels genome-wide, both qualitatively (which transcripts are present, identification of novel splice sites, RNA editing sites) and quantitatively (how much of each transcript is expressed)” (Hasin et al., 2017). The databases included information and datasets about RNA expression, RNA-seq, Single-Cell transcriptomics scRNA-seq, mRNA expression, microRNAs (miRNA) expression, miRNA-seq, miRNA datasets, long non-coding RNAs (lncRNAs), CLIP-seq,

Multi-Omics Databases

159

Table 2 Main multi-omics technologies Omics technology Genomics

Database Updated DriverDBv3 database, MOPED 2.5, LinkedOmics, DevOmics, CardioGenBase, CKDdb, C/VDdb, IBDMDB, MOBCdb, iNetModels 2.0, jMorp - V2, GENEASE, iMETHYL, OmicsNet 2.0, AIzGPS, IMOTA, PlatOMICs, FPIA, dbTMM, HeartBioPortal, MVIP, RNAactDrug, BioSamples database, FibROAD, EyeDiseases, MiBiOmics Transcriptomics Updated DriverDBv3 database, The Aging Atlas database, MOPED 2.5, LinkedOmics, BD2Decide database, CKDdb, C/VDdb, IBDMDB, MOBCdb, jMorp – V2, iMETHYL, AIzGPS, IMOTA, PlatOMICs, KUPKB, MVIP, Fibromine, GraphOmics, RNAactDrug, BioSamples database, FibROAD, EyeDiseases, MiBiOmics Proteomics The Aging Atlas database, MOPED 2.5, LinkedOmics, BD2Decide database, CKDdb, C/VDdb, IBDMDB, MOBCdb, jMorp – V1, iMETHYL, AIzGPS, IMOTA, PlatOMICs, KUPKB, MVIP, Fibromine, GraphOmics, MiBiOmics Epigenomics Updated DriverDBv3 database, The Aging Atlas database, LinkedOmics, DevOmics, IBDMDB, MOBCdb, jMorp, iMETHYL, PlatOMICs, MVIP, RNAactDrug, FibROAD, EyeDiseases Metabolomics CKDdb, C/VDdb, IBDMDB, iNetModels 2.0, jMorp – V1, jMorp- V2, OmicsNet 2.0, KUPKB, GraphOmics, MiBiOmics Microbiomics IBDMDB, iNetModels 2.0, OmicsNet 2.0, PlatOMICs, Pharmacogenomics MOBCdb, The Aging Atlas database Radiogenomics and BD2Decide database, MuSA Radiomics. Peptidomics CKDdb

Number of DBs 26

23

18

13

10 4 2 2 1

Transcriptomics relative expression records, and capturing RNA profiles that underlying pathogenesis. • Proteomics: “Proteomics is the study of the proteome, which is defined as the set of all expressed proteins and interacting protein family networks, and biochemical pathways in a cell, tissue, or organism” (Nalbantoglu & Karadag, 2019). The databases include information and datasets about Protein-protein interaction network, protein expressions in various body fluids and tissues, protein absolute and relative expression data), reverse phase protein array (RPPA) data at the analyte-level and clinical data, phosphoproteomics, glycoproteomics, plasma proteomics, RNA-Seq expression data, drug-target interactions and cell line viability data, and capturing Protein profile that underlying pathogenesis. • Epigenomics: “The epigenome consists of reversible chemical modifications to the DNA, or to the histones that bind DNA, and produce changes in the expression of genes without altering their base sequence. Epigenomic modifications can occur in a tissue-specific manner, in response to environmental factors, or the development of disease states, and can persist across generations.”

160

H. AlOmari et al.

Fig. 1 Number of databases for each multi-omics technology

• • • • •

(Omenn et al., 2012). The Epigenomic dataset and information included in the databases are Whole DNA methylation, histone modifications, ChIP-seq, MeDIP-seq, MBD-seq, and Assay data Transposase-Accessible Chromatin using sequencing (ATAC-seq). Metabolomics: The databases include datasets of plasma Metabolomics. Pharmacogenomics: The databases include geroprotective compounds data and drug response data of different breast cancer subtypes. Microbiomics Radiogenomics and Radiomics. Peptidomics

Figure 1 shows the number of databases for each multi-omics technology covered in this chapter.

6 Fields of Multi-Omics Technologies The thirty-six databases under study have been classified according to the primary purpose, field of science, disease, and medical specialty to where each database is dedicated. Table 3 shows 13 main categories obtained from the 36 databases. Eight categories (out of 13) belong to diseases/disorders, and they are associated with 18 databases out of 36 (50%). These eight categories of diseases/disorders include the following databases:

Multi-Omics Databases

161

Table 3 Application fields of multi-omics Multi-omics categories No. Health issue/field/purpose 1 Cancer

Number of DBs 7

2 3 4 5 6 7 8 9

3 2 2 1 1 1 1 10

10 11 12 13

Databases Updated DriverDBv3 database, LinkedOmics,BD2Decide database, MOBCdb, iNetModels 2.0, FPIA, MuSA Cardiovascular diseases CardioGenBase, C/VDdb, HeartBioPortal Kidney disease CKDdb, KUPKB Fibrosis-associated disorders FibROAD, Fibromine Skin disorders (Dermatology) pltOMICS Eyes disorders (Ophthalmology) EyeDiseases Alzheimer’s disease AIzGPS Inflammatory bowel diseases IBDMDB Genetics, biology & molecular DevOmics, jMorpV2, GENEASE, biology iMETHYL, IMOTA, DbTMM, CNGBdb, MiBiOmics, GraphOmics, BioSamples database Virology MVIP, GraphOmics Drug interaction RNAactDrug Aging-related changes The Aging Atlas database Interdisciplinary studies/general MOPED2.5, OmicsNet2.0, ProteomicsDB, purposes. PRIDE

2 1 1 4

1. Cancer: Contains seven Omics databases specialized in cancer and oncology studies; their data sets include information about breast cancer, colorectal and ovarian cancer, and Head and neck cancer. 2. Cardiovascular diseases: Contains three Omics databases gathered specialized data sets of cerebrovascular disease, coronary artery disease (CAD), hypertensive heart disease, inflammatory heart disease, ischemic heart disease, and rheumatic heart disease. 3. Kidney disease: Contains two databases gathered specialized data sets of chronic kidney disease. 4. Eyes Disorders (Ophthalmology): EyeDiseases is the first database for multiomics data integration and interpretation of human eye diseases. It contains 1344 disease-associated genes with genetic variation, 1774 transcription files of bulk cell expression and single-cell RNA-seq, and 105 epigenomics data across 185 kinds of human eye diseases. 5. Skin Disorders (Dermatology): Has one omics database included in the studies (pltOMICS), which focus on skin diseases characterized by the impairment of Notch signaling, considering the pathologies of five human skin diseases, Hidradenitis Suppurativa, Dowling Degos Disease, Adams–Oliver Syndrome, Psoriasis, and Atopic Dermatitis. 6. Fibrosis: Has two databases (FibROAD and Fibromine) specialized in fibrosis disorders such as Idiopathic chronic pulmonary fibrosis; their datasets describe

162

H. AlOmari et al.

the development of FibROAD and included integrated pieces of evidence from fibrosis-associated disorders as obtained from both the literature and multi-omics data. 7. Alzheimer’s disease: AIzGPS is a genome-wide Positioning Systems platform for Alzheimer’s Drug Discovery. 8. Inflammatory bowel diseases: Has one database IBDMDB, which has multiomics datasets of the gut microbial ecosystem in inflammatory bowel diseases. These datasets include information on integrating taxonomic, metagenomic, metatranscriptomic, metaproteomic, and metabolic data on the microbiome. Additionally, it profiles host genetics, epigenetics (DNA modification), and gene expression. The remaining five categories (out of 13) tackled general medical or science fields rather than particular diseases. They include the following categories: 1. Biology & Molecular biology: Contains 11 databases (out of 36) (30.5%). They are dedicated for gathering biology, applied molecular biology, and biochemistry datasets. The information includes; deciphering molecular regulatory mechanisms of human and mouse early embryos, human DNAm variation, tissue atlas for the analysis of human miRNA–target interactions, storing individual wholegenome data on a variant-by-variant basis as well as cohort/clinical data, analysis of large-scale data from genome cohorts, and describing sophisticated biological networks such as complex biomolecular reactions. 2. Virology: Has two databases (MVIP and GraphOmics). They are specialized in viral infection-related OMICS (such as covid-19); these databases should help users quickly retrieve and compare different virus-host interactions at multilayers to efficiently analyze gene dynamic changes and visualize large-scale omics data of viral infections with flexible settings. GraphOmics portal also enables users to upload different datasets for analysis. 3. Drug interaction: RNAactDrug is a comprehensive database of RNAs associated with drug sensitivity from multi-omics data, which allows users to explore drug sensitivity and RNA molecule associations directly. 4. Aging-related changes: The Aging Atlas database is a multi-omics database for aging biology that provides user-friendly functionalities to explore agerelated changes in gene expression, as well as raw data download services; its datasets include transcriptomics (RNA-seq), single-cell transcriptomics (scRNAseq), epigenomics (ChIP-seq), proteomics (protein-protein interaction), and pharmacogenomics (geroprotective compounds). 5. Interdisciplinary studies/ general purposes: four databases. Figure 2 shows the multi-omics fields associated with the databases covered in this chapter.

Multi-Omics Databases

163

Fig. 2 Multi-omics fields

7 Conclusion The increased number of researches in the field of multi-omics creates a necessity to collect and organize the multi-omics datasets to enable more investigations and analysis using the datasets in many research areas of human healthcare services. Therefore, many databases have emerged to collect omics data. According to the thirty-six reviewed databases in this chapter, the multi-omics studies mainly were focused on human chronic diseases such as cancer, cardiovascular diseases, fibrosis, kidney diseases, and skin disorders. In addition to these diseases, general genetics, biology, and molecular biology are potential research fields of multiomics. Moreover, it can be concluded that genomics and transcriptomics are the most multi-omics technologies used in the research. They appeared in more than 20 databases that we studied in this chapter.

164

H. AlOmari et al.

References Aging Atlas Consortium. (2021). Aging atlas: A multi-omics database for aging biology. Nucleic Acids Research, 49(D1), D825–D830. Alkhateeb, A., Tabl, A. A., & Rueda, L. (2021). Deep learning in multi-omics data integration in cancer diagnostic (pp. 255–271). Cham: Springer International Publishing. Arif, M., Zhang, C., Li, X., Güngör, C., Çakmak, B., Arslantürk, M., Tebani, A., Özcan, B., Suba¸s, O., Zhou, W., et al. (2021). iNetModels 2.0: An interactive visualization and database of multiomics data. Nucleic Acids Research, 49(W1), W271–W276. Brandão, L. A. C., Tricarico, P. M., Gratton, R., Agrelli, A., Zupin, L., Abou-Saleh, H., Moura, R., & Crovella, S. (2021). Multiomics integration in skin diseases with alterations in notch signaling pathway: Platomics phase 1 deployment. International Journal of Molecular Sciences, 22(4), 1523. Cavalieri, S., De Cecco, L., Brakenhoff, R. H., Serafini, M. S., Canevari, S., Rossi, S., Lanfranco, D., Hoebers, F. J., Wesseling, F. W., Keek, S., et al. (2021). Development of a multiomics database for personalized prognostic forecasting in head and neck cancer: The big data to decide EU project. Head & Neck, 43(2), 601–612. Chen, F. Z., You, L. J., Yang, F., Wang, L. N., Guo, X. Q., Gao, F., Hua, C., Tan, C., Fang, L., Shan, R. Q., et al. (2020). CNGBdb: China national genebank database. Yi chuan= Hereditas, 42(8), 799–809. Coppola, L., Cianflone, A., Grimaldi, A. M., Incoronato, M., Bevilacqua, P., Messina, F., Baselice, S., Soricelli, A., Mirabelli, P., & Salvatore, M. (2019). Biobanking in health care: Evolution and future directions. Journal of Translational Medicine, 17(1), 1–18. Courtot, M., Gupta, D., Liyanage, I., Xu, F., & Burdett, T. (2022). Biosamples database: Fairer samples metadata to accelerate research data management. Nucleic Acids Research, 50(D1), D1500–D1507. Dong, Q., Li, F., Xu, Y., Xiao, J., Xu, Y., Shang, D., Zhang, C., Yang, H., Tian, Z., Mi, K., et al. (2021). Rnaactdrug: A comprehensive database of RNAs associated with drug sensitivity from multi-omics data. Briefings in Bioinformatics, 21(6), 2167–2174. Fanidis, D., Moulos, P., & Aidinis, V. (2021). Fibromine is a multi-omics database and mining tool for target discovery in pulmonary fibrosis. Scientific Reports, 11(1), 1–14. Fernandes, M., & Husi, H. (2017). Establishment of a integrative multi-omics expression database CKDdb in the context of chronic kidney disease (CKD). Scientific Reports, 7(1), 1–11. Fernandes, M., Patel, A., & Husi, H. (2018). C/VDdb: A multi-omics expression profiling database for a knowledge-driven approach in cardiovascular disease (CVD). PloS One, 13(11), e0207371. Ghandikota, S., Hershey, G. K. K., & Mersha, T. B. (2018). Genease: Real time bioinformatics tool for multi-omics and disease ontology exploration, analysis and visualization. Bioinformatics, 34(18), 3160–3168. Hasin, Y., Seldin, M., & Lusis, A. (2017). Multi-omics approaches to disease. Genome Biology, 18(1), 1–15. Huang, L., Zhu, H., Luo, Z., Luo, C., Luo, L., Nong, B., Zhang, S., Wan, C., Wang, Y., Songyang, Z., et al. (2022). FPIA: A database for gene fusion profiling and interactive analyses. International Journal of Cancer, 150(9), 1504–1511. Khomtchouk, B. B., Vand, K. A., Koehler, W. C., Tran, D.-T., Middlebrook, K., Sudhakaran, S., Nelson, C. S., Gozani, O., & Assimes, T. L. (2019). Heartbioportal: An internet-of-omics for human cardiovascular disease data. Circulation: Genomic and Precision Medicine, 12(4), e002426. Kinkorová, J. (2016). Biobanks in the era of personalized medicine: objectives, challenges, and innovation. EPMA Journal, 7(1), 1–12. Klein, J., Jupp, S., Moulos, P., Fernandez, M., Buffin-Meyer, B., Casemayou, A., Chaaya, R., Charonis, A., Bascands, J.-L., Stevens, R., et al. (2012). The KUPKB: A novel web application to access multiomics data on kidney disease. The FASEB Journal, 26(5), 2145–2153.

Multi-Omics Databases

165

Komaki, S., Shiwa, Y., Furukawa, R., Hachiya, T., Ohmomo, H., Otomo, R., Satoh, M., Hitomi, J., Sobue, K., Sasaki, M., et al. (2018). iMETHYL: An integrative database of human dna methylation, gene expression, and genomic variation. Human Genome Variation, 5(1), 1–4. Levy, S. E., & Myers, R. M. (2016). Advancements in next-generation sequencing. Annual Review of Genomics and Human Genetics, 17(1), 95–115. Liu, S.-H., Shen, P.-C., Chen, C.-Y., Hsu, A.-N., Cho, Y.-C., Lai, Y.-L., Chen, F.-H., Li, C.-Y., Wang, S.-C., Chen, M., et al. (2020). DriverDBv3: A multi-omics database for cancer driver gene research. Nucleic Acids Research, 48(D1), D863–D870. Lloyd-Price, J., Arze, C., Ananthakrishnan, A. N., Schirmer, M., Avila-Pacheco, J., Poon, T. W., Andrews, E., Ajami, N. J., Bonham, K. S., Brislawn, C. J., et al. (2019). Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature, 569(7758), 655–662. Manzoni, C., Kia, D. A., Vandrovcova, J., Hardy, J., Wood, N. W., Lewis, P. A., & Ferrari, R. (2018). Genome, transcriptome and proteome: The rise of omics data and their integration in biomedical sciences. Briefings in Bioinformatics, 19(2), 286–302. Montague, E., Stanberry, L., Higdon, R., Janko, I., Lee, E., Anderson, N., Choiniere, J., Stewart, E., Yandl, G., Broomall, W., et al. (2014). Moped 2.5—an integrated multi-omics resource: Multiomics profiling expression database now includes transcriptomics data. Omics: A Journal of Integrative Biology, 18(6), 335–343. Nalbantoglu, S., & Karadag, A. (2019). Introductory chapter: Insight into the omics technologies and molecular medicine. Molecular Medicine, 1, 1. Nayar, P. G., Murugesan, R., Mary, B., & Ahmed, S. S. (2015). Cardiogenbase: A literature based multi-omics database for major cardiovascular diseases. PloS One, 10(12), e0143188. Ogishima, S., Nagaie, S., Mizuno, S., Ishiwata, R., Iida, K., Shimokawa, K., Takai-Igarashi, T., Nakamura, N., Nagase, S., Nakamura, T., et al. (2021). dbTMM: An integrated database of large-scale cohort, genome and clinical data for the Tohoku Medical Megabank Project. Human Genome Variation, 8(1), 1–8. Olson, J. E., Bielinski, S. J., Ryu, E., Winkler, E., Takahashi, P. Y., Pathak, J., & Cerhan, J. R. (2014). Biobanks and personalized medicine. Clinical Genetics, 86(1), 50–55. Omenn, G. S., Nass, S. J., & Micheel, C. M. (Eds.). (2012). Evolution of translational omics: lessons learned and the path forward. Palmieri, V., Backes, C., Ludwig, N., Fehlmann, T., Kern, F., Meese, E., & Keller, A. (2018). IMOTA: An interactive multi-omics tissue atlas for the analysis of human miRNA–target interactions. Nucleic Acids Research, 46(D1), D770–D775. Papadopoulos, T., Krochmal, M., Cisek, K., Fernandes, M., Husi, H., Stevens, R., J.-L. Bascands, Schanstra, J. P., & Klein, J. (2016). Omics databases on kidney disease: Where they can be found and how to benefit from them. Clinical Kidney Journal, 9(3), 343–352. Rigden, D. J., & Fernández, X. M. (2022). The 2022 nucleic acids research database issue and the online molecular biology database collection. Nucleic Acids Research, 50(D1), D1–D10. Samaras, P., Schmidt, T., Frejno, M., Gessulat, S., Reinecke, M., Jarzab, A., Zecha, J., Mergner, J., Giansanti, P., Ehrlich, H.-C., et al. (2020). Proteomicsdb: A multi-omics and multi-organism resource for life science research. Nucleic Acids Research, 48(D1), D1153–D1163. Subramanian, I., Verma, S., Kumar, S., Jere, A., & Anamika, K. (2020). Multi-omics data integration, interpretation, and its application. Bioinformatics and Biology Insights, 14, 1177932219899051. Sun, Y.-Z., Hu, Y.-F., Zhang, Y., Wei, S.-Y., Yang, B.-L., Xu, Y.-P., Rong, Z.-L., Wang, D., & Yang, B. (2022). FibROAD: A manually curated resource for multi-omics level evidence integration of fibrosis research. Database, 2022, baac015. https://doi.org/10.1093/database/baac015. Tadaka, S., Saigusa, D., Motoike, I. N., Inoue, J., Aoki, Y., Shirota, M., Koshiba, S., Yamamoto, M., & Kinoshita, K. (2018). jMorp: Japanese multi omics reference panel. Nucleic Acids Research, 46(D1), D551–D557. Tadaka, S., Hishinuma, E., Komaki, S., Motoike, I. N., Kawashima, J., Saigusa, D., Inoue, J., Takayama, J., Okamura, Y., Aoki, Y., et al. (2021). jMorp updates in 2020: Large enhancement of multi-omics data resources on the general Japanese population. Nucleic Acids Research, 49(D1), D536–D544.

166

H. AlOmari et al.

Tang, Z., Fan, W., Li, Q., Wang, D., Wen, M., Wang, J., Li, X., & Zhou, Y. (2022). MVIP: Multiomics portal of viral infection. Nucleic Acids Research, 50(D1), D817–D827. Vasaikar, S. V., Straub, P., Wang, J., & Zhang, B. (2018). Linkedomics: Analyzing multi-omics data within and across 32 cancer types. Nucleic Acids Research, 46(D1), D956–D963. Vizcaíno, J. A., Csordas, A., Del-Toro, N., Dianes, J. A., Griss, J., Lavidas, I., Mayer, G., PerezRiverol, Y., Reisinger, F., Ternent, T., et al. (2016). 2016 update of the pride database and its related tools. Nucleic Acids Research, 44(D1), D447–D456. Wandy, J., & Daly, R. (2021). Graphomics: An interactive platform to explore and integrate multiomics data. BMC Bioinformatics, 22(1), 1–19. Xie, B., Yuan, Z., Yang, Y., Sun, Z., Zhou, S., Fang, X. (2018). MOBCdb: A comprehensive database integrating multi-omics data on breast cancer for precision medicine. Breast Cancer Research and Treatment, 169(3), 625–632. Yan, Z., An, J., Peng, Y., Kong, S., Liu, Q., Yang, M., He, Q., Song, S., Chen, Y., Chen, W., et al. (2021). Devomics: An integrated multi-omics database of human and mouse early embryo. Briefings in Bioinformatics, 22(6), bbab208. Yuan, J., Chen, F., Fan, D., Jiang, Q., Xue, Z., Zhang, J., Yu, X., Li, K., Qu, J., & Su, J. (2021). Eyediseases: An integrated resource for dedicating to genetic variants, gene expression and epigenetic factors of human eye diseases. NAR Genomics and Bioinformatics, 3(2), lqab050. Zanfardino, M., Castaldo, R., Pane, K., Affinito, O., Aiello, M., Salvatore, M., & Franzese, M. (2021). MuSA: A graphical user interface for multi-omics data integration in radiogenomic studies. Scientific Reports, 11(1), 1–13. Zhou, Y., Fang, J., Bekris, L. M., Kim, Y. H., Pieper, A. A., Leverenz, J. B., Cummings, J., & Cheng, F. (2021). Alzgps: A genome-wide positioning systems platform to catalyze multiomics for Alzheimer’s drug discovery. Alzheimer’s Research & Therapy, 13(1), 1–13. Zhou, G., Pang, Z., Lu, Y., Ewald, J., & Xia, J. (2022). Omicsnet 2.0: A web-based platform for multi-omics integration and network visual analytics. Nucleic Acids Research, 50, W527. Zoppi, J., Guillaume, J.-F., Neunlist, M., & Chaffron, S. (2021). Mibiomics: An interactive web application for multi-omics data exploration and integration. BMC Bioinformatics, 22(1), 1–14.

Index

A Association prediction, 75–89

B Biobank, 152, 154–155 Biomarker discovery, 30–31, 40, 41, 43, 63, 67, 68, 92, 93 Biomarkers, 7, 13–15, 17, 26, 27, 30–31, 40, 41, 43, 62–64, 67, 68, 76, 89, 92–94, 112, 113, 134–136, 153 Breast cancer (BC), 14, 20, 47, 133–146, 160

C Cancer subtyping, 136–137, 142–143, 146 Canonical alleles, 111, 113 Cardiovascular disease (CVD), 16, 153, 156, 157, 159, 161, 163 Cell identification, 40, 54, 61, 64, 68 Cell measurements, 64 Complex diseases, 15, 133, 138, 152, 153

D Data integration strategies, 17–20 Deep autoencoder, 77–81, 88 Deep neural networks, 20, 32, 63–64 Disease diagnosis, 14, 25

E Epigenomics, 8–9, 15, 138–139, 152, 156–162

F Fibrosis, 158, 161–163 Frequented regions, 119–120, 122–124, 128 G Gene expression, 3, 4, 6, 8, 9, 24, 27, 30–32, 42, 49, 52–54, 57, 58, 60, 62, 63, 65–67, 75, 92, 93, 100, 129, 134, 136–138, 143, 153, 154, 158, 162 Genome-wide association studies (GWAS), 117–119, 121–125, 130 Genomics, 1–3, 6, 8–10, 13, 14, 24, 40–42, 44, 45, 47–49, 56, 92, 93, 98, 102, 117, 119, 122, 125, 129, 130, 134, 137–138, 152, 156–158, 160, 163 Genomics molecular, 24 Genotype-to-Phenotype prediction, 25, 122, 125, 129 Graph-based machine learning, 117–130 H High-throughput sequencing, 4, 13 K Knowledge discovery, 39–68 M Machine learning (ML), 13–20, 23–35, 63, 76–78, 81, 84, 88, 92, 93, 117–130 Metabolomics, 6–10, 13–15, 24, 25, 152, 156–160

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 A. Alkhateeb, L. Rueda (eds.), Machine Learning Methods for Multi-Omics Data Integration, https://doi.org/10.1007/978-3-031-36502-7

167

168 Minimum redundancy maximum relevance (mRMR), 20, 91–113 miRNA-disease association, 75–90 Multi-omic biomarkers, 62–64 Multi-omics, 1–10, 13–20, 25–35, 39–68, 133–146, 151–163 Multi-omics data integration, 23–35, 64, 65, 139, 142, 143, 161

N Negative sample selection, 75–89

P Pangenomics, 117–130 Pathway enrichment analysis, 98, 103 Prostate cancer, 91–113 Proteomics, 5, 6, 9, 10, 13–15, 24, 25, 45–46, 59, 64, 152, 156–159, 162

Index R Redundancy rate (RR), 103, 108, 111, 112 Regulatory networks, 60–62

S Similarity preserving selection criteria (SC), 96 Single-cell, 31, 39–68, 158, 161

T Tensor factorization, 17, 50, 139–146 Transcriptomics, 3–4, 6, 9, 10, 13–15, 24, 41, 51, 61, 134, 137, 138, 152, 156–158, 162, 163

Y Yeast, 119, 121–123, 125–130