Using gut microbiota as a diagnostic tool for colorectal cancer: machine learning techniques reveal promising results

205 60 6MB

English Pages [12]

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Using gut microbiota as a diagnostic tool for colorectal cancer: machine learning techniques reveal promising results

Table of contents :
Using gut microbiota as a diagnostic tool for colorectal cancer: machine learning techniques reveal promising results
Abstract
Introduction
Methods
Participants and sample collection
DNA extraction and sequencing
Data analysis
Supervised ML modelling
Statistical analysis
Results
CRC was related to the dysregulation of various gut microbiota
ML models for diagnosis and screening based on the gut microbiota
Supervised ML models trained with LEfSe
Supervised ML models trained with LEfSe and LASSO regression model
Identifying the top ten most important OTUs in the RF model
Discussion
Conclusion
References

Citation preview

RESEARCH ARTICLE Lu et al., Journal of Medical Microbiology 2023;72:001699 DOI 10.1099/jmm.0.001699

Using gut microbiota as a diagnostic tool for colorectal cancer: machine learning techniques reveal promising results Fang Lu1,2†, Ting Lei2,3†, Jie Zhou1,2†, Hao Liang1,2,4, Ping Cui2,4, Taiping Zuo2,3, Li Ye1,2,*, Hui Chen2,3,* and Jiegang Huang1,2,*

Abstract Introduction. Increasing evidence suggests a correlation between gut microbiota and colorectal cancer (CRC). Hypothesis/Gap Statement. However, few studies have used gut microbiota as a diagnostic biomarker for CRC. Aim. The objective of this study was to explore whether a machine learning (ML) model based on gut microbiota could be used to diagnose CRC and identify key biomarkers in the model. Methodology. We sequenced the 16S rRNA gene from faecal samples of 38 participants, including 17 healthy subjects and 21 CRC patients. Eight supervised ML algorithms were used to diagnose CRC based on faecal microbiota operational taxonomic units (OTUs), and the models were evaluated in terms of identification, calibration and clinical practicality for optimal modelling parameters. Finally, the key gut microbiota was identified using the random forest (RF) algorithm. Results. We found that CRC was associated with the dysregulation of gut microbiota. Through a comprehensive evaluation of supervised ML algorithms, we found that different algorithms had significantly different prediction performance using faecal microbiomes. Different data screening methods played an important role in optimization of the prediction models. We found that naïve Bayes algorithms [NB, accuracy=0.917, area under the curve (AUC)=0.926], RF (accuracy=0.750, AUC=0.926) and logistic regression (LR, accuracy=0.750, AUC=0.889) had high predictive potential for CRC. Furthermore, important features in the model, namely s__metagenome_g__Lachnospiraceae_ND3007_group (AUC=0.814), s__Escherichia_coli_g__Escherichia-­ Shigella (AUC=0.784) and s__unclassified_g__Prevotella (AUC=0.750), could each be used as diagnostic biomarkers of CRC. Conclusions. Our results suggested an association between gut microbiota dysregulation and CRC, and demonstrated the feasibility of the gut microbiota to diagnose cancer. The bacteria s__metagenome_g__Lachnospiraceae_ND3007_group, s__Escherichia_coli_g__Escherichia-­Shigella and s__unclassified_g__Prevotella were key biomarkers for CRC.

INTRODUCTION The high incidence and mortality of colorectal cancer (CRC) make it one of the most concerning diseases in the world. According to the Global Cancer Data 2020 report, there were 19.3 million new cancer cases worldwide, and the overall incidence of CRC rose from fourth place in 2018 to third place [1]. Because effective drugs for CRC are still being developed and the only effective measures are early detection and surgical removal of CRC, many countries recommend universal screening and prevention programmes. Currently,

Received 19 December 2022; Accepted 06 April 2023; Published 07 June 2023 Author affiliations: 1School of Public Health, Guangxi Medical University, Nanning, 530021, Guangxi, PR China; 2Guangxi Key Laboratory of AIDS Prevention and Treatment & Guangxi Universities Key Laboratory of Prevention and Control of Highly Prevalent Disease, Nanning, 530021, Guangxi, PR China; 3Geriatrics Digestion Department of Internal Medicine, The First Affiliated Hospital of Guangxi Medical University, Nanning, PR China; 4Life Science Institute, Guangxi Medical University, Nanning, 530021, Guangxi, PR China. *Correspondence: Li Ye, ​yeli@​gxmu.​edu.​cn; Hui Chen, ​chenhuiyfy@​gxmu.​edu.​cn; Jiegang Huang, ​jieganghuang@​gxmu.​edu.​cn Keywords: colorectal cancer; gut microbiome; 16S rRNA gene sequencing; diagnosis; machine learning; biomarker. Abbreviations: AUC, area under the receiver operating characteristic curve; CRC, colorectal cancer; DCA, decision curve analysis; DT, decision tree; FOBT, faecal occult blood test; HC, healthy control; KNN, k-­nearest neighbours; LASSO, least absolute shrinkage and selection operator; LEfSe, linear discriminant analysis effect size; LR, logistic regression; ML, machine learning; NB, naïve Bayes algorithms; NN, neural network; OTU, operational taxonomic unit; RF, random forest; ROC, receiver operating characteristic curve; SVM, support vector machines; XGB, extreme gradient boosting. Supplementary materials (Table S1 and File S1) are available with the online version of this article. The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) are: BioSample database, BioProject ID: PRJNA910989 (http://www.ncbi.nlm.nih.gov/bioproject/910989) and PRJNA933359 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA933359). † These authors contributed equally to this work and share the first authorship. 001699 © 2023 The Authors

1

Lu et al., Journal of Medical Microbiology 2023;72:001699

one of the most widely used non-­invasive screening procedures is the faecal occult blood test (FOBT), which can indicate the presence of advanced adenomas and carcinomas in the colon by detecting blood in the stool [2]. However, because the FOBT has limited sensitivity and specificity for CRC and does not reliably detect precancerous lesions, there is a need to develop a new non-­invasive, simple and effective CRC screening test [3]. The gut microbiota is a collection of microorganisms living in the gastrointestinal tract and is a potential source of biomarkers for detecting colonic lesions. In human studies, patients with CRC have an abnormal gut microbiome structure when compared with healthy patients [4, 5]. Experiments in animal models have also shown that such alterations have the potential to accelerate tumorigenesis [6]. Thus, the detection of these pathogenic bacteria in gut microbiota could be a promising method for CRC screening. Although some members of the gut microbiota have been shown to contribute to the onset and progression of CRC through various mechanisms, they are not present in all cases [4, 7, 8]. It is unclear how many cases of CRC can be attributed to these pathogens, and whether changes in microbial abundance could provide the basis for an accurate CRC screening test. Machine learning (ML), a major branch of artificial intelligence (AI), can be used to increase our understanding of changes in existing data structures and to make predictions about new data. It has been used in a wide variety of studies, such as DNA methylation associated with genetic diseases [9], the diagnosis of Alzheimer’s disease using imaging data [10], the prediction of gastrointestinal disease development using continuous variable fitting techniques [11], and the automatic detection of gastrointestinal lesions by computer vision in endoscopes [12]. ML algorithms and new computational models offer the opportunity to generate computational drug networks to diagnose the efficacy of approved drugs relative to relevant oncogenic targets, as well as to select patients with better responses or better disease biomarkers [10]. In the field of digital pathology, the emergence of AI and ML tools makes it possible to mine new morphological phenotypes and improve patient management for a variety of cancer types [13]. It enables computer programs to automatically analyse large amounts of data and determine which information is most relevant. At present, several studies have identified and elucidated the pathogenicity of certain intestinal microorganisms. For example, enterotoxigenic Bacteroides fragilis is the typical pathogen that causes CRC by upregulating inflammatory factors, releasing reactive oxygen species, inducing intestinal inflammation, and promoting the formation of polyps and tumours [14, 15]. In the study by Yachida et al. [16], principal component analysis (PCA) was used to select Bacteroides and Prevotella, two types of bacteria with the greatest variation in abundance, from the faeces of CRC patients and a healthy control (HC) group. Both bacteria are major contributors to the gut flora of CRC patients. Guo et al. [17] reported that a highly accurate CRC diagnostic model was developed by combining the results of quantitative PCR (qPCR) of the abundance of three gut bacteria, Fusobacterium nucleatum, Faecalibacterium praus-­nitzii and Bifidobacterium spp. In another study, Fusobacterium, Porphyromonas and Peptostreptococcus were all enriched in CRC patients based on using a metagenomic classifier [18]. However, some quite significantly differently expressed bacteria between CRC and normal controls can be recognized by many different algorithms and used as a key parameter for prediction. For example, in the Chinese population, Methanosphaera_stadtmanae_DSM_3091 was identified and used by filtered classifier, sequential minimal optimization (SMO), logistic and naïve Bayes models as key parameters. Another dominant bacterium, Blautia_uncultured_Firmicutes_bacterium, was taken by the random tree, J48 and PART algorithms as key parameters [19]. Although there is increasing recognition of the potential of the faecal microbiome in the detection of CRC, the choice of classification models is diverse. Due to the nature of the algorithms themselves, each algorithm has its default parameters, so it is unclear which modelling algorithm is more suitable for CRC diagnostic screening studies. In this study, we systematically evaluated the performance of the supervised classifiers to diagnose CRC based on gut microbiota. We recruited 38 participants and sequenced the hypervariable regions of the 16S rRNA gene from the faeces of each individual, used different supervised ML algorithms to test their performance in the diagnosis of CRC based on gut microbiota, and identified several potential bacteria associated with the dysbiosis of CRC.

METHODS Participants and sample collection Patients and healthy volunteers were recruited from the First Affiliated Hospital of Guangxi Medical University (Guangxi, China) between August 2020 and February 2021. The inclusion criteria for the CRC group were as follows: (1) tumour site was clear and biopsy-­confirmed; (2) no radiation or chemotherapy before sampling; (3) no antibiotics or probiotics within 1 month; and (4) complete case data were available. Healthy volunteers of the age and gender of the subjects were recruited as the HC group in the Guangxi Medical University Center for Physical Examination. The inclusion criteria for HCs were as follows: (1) no gastrointestinal-­related diseases; and (2) no antibiotics or probiotics within 1 month. The exclusion criteria for both groups were as follows: (1) have diseases related to intestinal flora, such as inflammatory bowel disease, diabetes, peptic ulcers, etc.; (2) pregnant or lactating women; and (3) a family history of bowel disease, such as familial adenomatous polyposis. All the volunteers understood and signed informed consent before inclusion in the group. This study was approved by the Ethics Committee of Guangxi Medical University. Faecal samples were collected from HCs and patients after CRC surgery, placed in sterile boxes on ice and transported immediately to the laboratory. Each sample was evenly divided into sterile tubes and immediately frozen at −80 °C. 2

Lu et al., Journal of Medical Microbiology 2023;72:001699

DNA extraction and sequencing DNA extraction from faecal samples was performed using the FastDNA Spin Kit for Soil (MP Biomedicals), according to the manufacturer’s instructions. DNA integrity was assessed using 1 % agarose gel electrophoresis, and purity and concentration were assessed using a NanoDrop2000 UV spectrophotometer (Thermo Fisher Scientific). The V3–V4 hypervariable regions of the bacterial 16S rRNA gene were amplified in a thermocycler PCR system (ABI GeneAmp 9700) using the following primer pairs: forward 338-­​ACTCCTACGGGAGGCAGCAG and reverse 806-­​GGACTACHVGGGTWTCTAAT. Purified amplicons were pooled in equimolar concentrations and sequenced on an Illumina MiSeq platform (Illumina) in PE300 mode, according to standard protocols provided by Majorbio Bio-­Pharm Technology. Raw FASTQ files were demultiplexed, quality filtered via Trimmomatic and merged by FLASH, according to the following criteria: (i) the reads were truncated at any site and received an average quality score 2, P2, P0.7, KNN, DT, RF, SVM, LR and XGB performed better than other models. DT had higher precision, sensitivity and specificity than other supervised ML models (Table 2; Fig. 2a). Regarding the Brier score, SVM had the lowest value and the calibration degree was the highest. Decision curve analysis of the simulated data sets under different models are shown in Fig. 2b. When the threshold probability was ≥0.7, the clinical net benefits of DT, LR, RF, XGB and SVM were higher than the All curve and None curve. In summary, the DT, RF and SVM models constructed by LEfSe screening had better performance and were more conducive to predicting and identifying subjects with CRC. Supervised ML models trained with LEfSe and LASSO regression model

To further improve the performance of the model, a regression analysis of gut microbiota was performed using ML models. LEfSe and LASSO were performed. Eighteen biomarkers (OTU620, OTU171, OTU459, OTU462, OTU732, OTU844, OTU745, OTU1111, OTU796, OTU692, OTU1203, OTU852, OTU1110, OTU714, OTU897, OTU1090, OTU618 and OTU1062) were successfully identified as optimal for the diagnosis of CRC, which could be potentially non-­invasive tools for the early diagnosis of CRC. Interestingly, based on the LEfSe and LASSO regression analysis, the AUC of NB improved improved from 0.593 to 0.926, and its accuracy, sensitivity, and calibration were also improved significantly, but its specificity decreased to 0.66. Similarly, the AUC of RF and LR was also improved, and their accuracy and sensitivity were increased significantly. However, DT performance measures decreased significantly with no significant improvements in the performance measures (Table 2; Fig. 3a). When the risk threshold was >0.5, the clinical net benefit of the NB, SVM, RF and LR models was higher than that of the All curve and the None curve, and the range of the clinical net benefit threshold was large. However, it should be noted that the net clinical benefit of the four models decreased with increasing threshold (Fig. 3b). In conclusion, OTUs obtained through LASSO screening had better performance in the construction of the NB, RF, and LR models, which were more conducive to the prediction and identification of CRC subjects.

Identifying the top ten most important OTUs in the RF model We selected the top ten optimal OTU biomarkers based on their importance scores using the RF model. LEfSe analysis showed that, in order of importance, the bacterial species in the RF model were OTU732 (s__metagenome_g__Lachnospiraceae_ND3007_group), OTU163 (s__uncultured_organism_g__Gemella), OTU243 (s__Ralstonia_pickettii), OTU1149 (s__uncultured_organism_g__Fusicatenibacter), OTU620 (s__Escherichia_coli_g__Escherichia-­Shigella), OTU171 4

Lu et al., Journal of Medical Microbiology 2023;72:001699

Fig. 1. Analysis of differences in gut microbial abundance between colorectal cancer patients and the healthy control group. (a) Linear discriminant analysis effect size bar graph showing different bacterial taxa. (b) Cladogram showing the phylogenetic relationships of different bacterial taxa.

5

Lu et al., Journal of Medical Microbiology 2023;72:001699

Fig. 2. OTU screening by LEfSe analysis. (a) ROC curves showing the test performance of eight different ML algorithm models trained using LEfSe analysis. (b) Decision curve analysis of CRC diagnosis by eight ML algorithms based on LEfSe analysis. OTU, operational taxonomic unit; LEfSe, effect size of linear discriminant analysis effect size; ROC, resulting area under the receiver operating characteristic curve; KNN, k-­nearest neighbours; DT, decision tree; NB, naïve Bayes; NN, neural network; RF, random forest; SVM, support vector machines; LR, logistic regression; XGB, extreme gradient boosting.

6

Lu et al., Journal of Medical Microbiology 2023;72:001699

Table 2. Summary of performance of algorithms Feature LEfSe

LEfSe and LASSO

Algorithm

Accuracy

AUC

Sensitivity

Specificity

Brier score

KNN

0.333

0.815

0.111

1.000

0.316

DT

0.750

0.833

0.667

1.000

0.270

NB

0.250

0.593

0.000

1.000

0.417

NN

0.417

0.556

0.333

0.667

0.340

RF

0.583

0.889

0.444

1.000

0.205

SVM

0.583

0.926

0.444

1.000

0.178

LR

0.250

0.815

0.222

0.333

0.250

XGB

0.500

0.889

0.333

1.000

0.203

KNN

0.333

0.907

0.111

1.000

0.267

DT

0.417

0.500

0.333

0.667

0.449

NB

0.917

0.926

1.000

0.667

0.083

NN

0.667

0.778

0.778

0.333

0.273

RF

0.750

0.926

0.778

0.667

0.159

SVM

0.417

1.000

0.222

1.000

0.121

LR

0.750

0.889

0.778

0.667

0.250

XGB

0.667

0.593

0.556

1.000

0.302

AUC, area under the receiver operating characteristic curve; DT, decision tree; KNN, k-­nearest neighbours; LASSO, the least absolute shrinkage and selection operator; LEfSe, effect size of linear discriminant analysis effect size; LR, logistic regression; NB, naïve Bayes; NN, neural network; RF, random forest; SMO, sequential minimal optimization; SVM, support vector machines; XGB, extreme gradient boosting.

(s__unclassified_g__Prevotella), OTU1067 (s__unclassified_f__Lachnospiraceae), OTU742 (s__unclassified_g__Lachnospira), OTU137 (s__Prevotella_intermedia) and OTU616 (s__unclassified_g__Burkholderia-­Caballeronia-­Paraburkholderia) (Fig. 4a). Further LASSO regression analysis was performed on LEfSe results, and ranked by importance, the bacteria in the RF model were OTU732 (s__metagenome_g__Lachnospiraceae_ND3007_group), OTU171 (s__unclassified_g__Prevotella), OTU620 (s__Escherichia_coli_g__Escherichia-­Shigella), OTU459 (s__unclassified_g__Ruminococcus_torques_group), OTU462 (s__Haemophilus_parainfluenzae), OTU1111 (s__uncultured_bacterium_g__Family_XIII_AD3011_group), OTU796 (s__unclassified_g__Bacteroides), OTU844 (s__unclassified_g__norank_f__norank_o__Clostridia_UCG-­014), OTU1062 (s__Dialister_invisus_DSM_15470) and OTU852 (s__unclassified_g__Eubacterium). Among them, the common differential microbiota were s__metagenome_g__Lachnospiraceae_ND3007_group, s__Escherichia_coli_g__Escherichia-­ Shigella and s__unclassified_g__Prevotella (Fig. 4b). To further investigate whether these gut microbiota can be used as key microbiota for identification and diagnosis, we compared the results of the ROC curves. The results illustrated that s__metagenome_g__Lachnospiraceae_ND3007_group, s__Escherichia_coli_g__Escherichia-­Shigella and s__unclassified_g__ Prevotella, based on the criterion of AUC>0.7, had the potential to diagnose CRC (AUC>0.7, P