基于宏基因组测序数据的样本来源预测的有监督机器学习方法的系统评价。

Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data.

机构信息

National Microbiology Laboratory, Public Health Agency of Canada, 1015 Arlington Street, Winnipeg, Manitoba, R3E 3R2, Canada.

出版信息

Biol Direct. 2020 Dec 10;15(1):29. doi: 10.1186/s13062-020-00287-y.

DOI:10.1186/s13062-020-00287-y

PMID:33302990

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7731568/

Abstract

BACKGROUND

The advent of metagenomic sequencing provides microbial abundance patterns that can be leveraged for sample origin prediction. Supervised machine learning classification approaches have been reported to predict sample origin accurately when the origin has been previously sampled. Using metagenomic datasets provided by the 2019 CAMDA challenge, we evaluated the influence of variable technical, analytical and machine learning approaches for result interpretation and novel source prediction.

RESULTS

Comparison between 16S rRNA amplicon and shotgun sequencing approaches as well as metagenomic analytical tools showed differences in normalized microbial abundance, especially for organisms present at low abundance. Shotgun sequence data analyzed using Kraken2 and Bracken, for taxonomic annotation, had higher detection sensitivity. As classification models are limited to labeling pre-trained origins, we took an alternative approach using Lasso-regularized multivariate regression to predict geographic coordinates for comparison. In both models, the prediction errors were much higher in Leave-1-city-out than in 10-fold cross validation, of which the former realistically forecasted the increased difficulty in accurately predicting samples from new origins. This challenge was further confirmed when applying the model to a set of samples obtained from new origins. Overall, the prediction performance of the regression and classification models, as measured by mean squared error, were comparable on mystery samples. Due to higher prediction error rates for samples from new origins, we provided an additional strategy based on prediction ambiguity to infer whether a sample is from a new origin. Lastly, we report increased prediction error when data from different sequencing protocols were included as training data.

CONCLUSIONS

Herein, we highlight the capacity of predicting sample origin accurately with pre-trained origins and the challenge of predicting new origins through both regression and classification models. Overall, this work provides a summary of the impact of sequencing technique, protocol, taxonomic analytical approaches, and machine learning approaches on the use of metagenomics for prediction of sample origin.

摘要

背景

宏基因组测序的出现提供了可以用于预测样本来源的微生物丰度模式。当来源先前被采样时，基于监督机器学习的分类方法已被报道可以准确地预测样本来源。使用 2019 年 CAMDA 挑战赛提供的宏基因组数据集，我们评估了可变的技术、分析和机器学习方法对结果解释和新来源预测的影响。

结果

16S rRNA 扩增子和鸟枪法测序方法以及宏基因组分析工具的比较显示，在标准化微生物丰度方面存在差异，特别是对于低丰度的生物。使用 Kraken2 和 Bracken 进行分类注释的鸟枪法序列数据具有更高的检测灵敏度。由于分类模型仅限于标记预先训练的来源，我们采用了一种替代方法，使用 Lasso 正则化多元回归来预测地理坐标进行比较。在这两种模型中，Leave-1-city-out 的预测误差都远高于 10 倍交叉验证，前者真实地预测了准确预测新来源样本的难度增加。当将该模型应用于一组来自新来源的样本时，进一步证实了这一挑战。总体而言，回归和分类模型的预测性能（以均方误差衡量）在神秘样本上相当。由于新来源样本的预测误差率较高，我们提供了一种基于预测不确定性的额外策略来推断样本是否来自新来源。最后，我们报告了当将来自不同测序方案的数据纳入训练数据时，预测误差增加。

结论

本文强调了使用预先训练的来源准确预测样本来源的能力，以及通过回归和分类模型预测新来源的挑战。总的来说，这项工作总结了测序技术、方案、分类分析方法和机器学习方法对使用宏基因组学预测样本来源的影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e42d/7731568/fd0f2c78dec1/13062_2020_287_Fig1_HTML.jpg

相似文献

Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data.基于宏基因组测序数据的样本来源预测的有监督机器学习方法的系统评价。

Biol Direct. 2020 Dec 10;15(1):29. doi: 10.1186/s13062-020-00287-y.

A machine learning framework to determine geolocations from metagenomic profiling.基于宏基因组分析的地理位置确定机器学习框架。

Biol Direct. 2020 Nov 23;15(1):27. doi: 10.1186/s13062-020-00278-z.

Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data.鉴定城市特有重要细菌特征，用于 MetaSUB CAMDA 挑战赛微生物组数据。

Biol Direct. 2019 Jul 24;14(1):11. doi: 10.1186/s13062-019-0243-z.

Massive metagenomic data analysis using abundance-based machine learning.基于丰度的机器学习在海量宏基因组数据分析中的应用。

Biol Direct. 2019 Aug 1;14(1):12. doi: 10.1186/s13062-019-0242-0.

Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge.解析 CAMDA MetaSUB 挑战赛数据的城市特定特征并识别样本来源位置。

Biol Direct. 2021 Jan 4;16(1):1. doi: 10.1186/s13062-020-00284-1.

Application of machine learning techniques for creating urban microbial fingerprints.应用机器学习技术构建城市微生物指纹图谱。

Biol Direct. 2019 Aug 16;14(1):13. doi: 10.1186/s13062-019-0245-x.

MegaR: an interactive R package for rapid sample classification and phenotype prediction using metagenome profiles and machine learning.MegaR：一个交互式 R 包，用于使用宏基因组谱和机器学习快速对样本进行分类和表型预测。

BMC Bioinformatics. 2021 Jan 18;22(1):25. doi: 10.1186/s12859-020-03933-4.

Phylogeny-Aware Analysis of Metagenome Community Ecology Based on Matched Reference Genomes while Bypassing Taxonomy.基于匹配参考基因组绕过分类学的宏基因组群落生态学的系统发育分析。

mSystems. 2022 Apr 26;7(2):e0016722. doi: 10.1128/msystems.00167-22. Epub 2022 Apr 4.

MicroPredict: predicting species-level taxonomic abundance of whole-shotgun metagenomic data using only 16S amplicon sequencing data.MicroPredict：仅使用 16S 扩增子测序数据预测全基因组宏基因组数据的种级分类丰度。

Genes Genomics. 2024 Jun;46(6):701-712. doi: 10.1007/s13258-024-01514-w. Epub 2024 May 3.

Quantitative Assessment of Shotgun Metagenomics and 16S rDNA Amplicon Sequencing in the Study of Human Gut Microbiome. shotgun 宏基因组学和 16S rDNA 扩增子测序在人类肠道微生物组研究中的定量评估

OMICS. 2018 Apr;22(4):248-254. doi: 10.1089/omi.2018.0013.

引用本文的文献

Artificial Intelligence: A Promising Tool in Exploring the Phytomicrobiome in Managing Disease and Promoting Plant Health.人工智能：探索植物微生物组以管理疾病和促进植物健康的一种有前景的工具。

Plants (Basel). 2023 Apr 30;12(9):1852. doi: 10.3390/plants12091852.

Interfacing Machine Learning and Microbial Omics: A Promising Means to Address Environmental Challenges.连接机器学习与微生物组学：应对环境挑战的一种有前途的方法。

Front Microbiol. 2022 Apr 25;13:851450. doi: 10.3389/fmicb.2022.851450. eCollection 2022.

Clinical Metagenomics Is Increasingly Accurate and Affordable to Detect Enteric Bacterial Pathogens in Stool.临床宏基因组学在检测粪便中肠道细菌病原体方面越来越准确且成本更低。

Microorganisms. 2022 Feb 15;10(2):441. doi: 10.3390/microorganisms10020441.

Serine and one-carbon metabolisms bring new therapeutic venues in prostate cancer.丝氨酸和一碳代谢为前列腺癌带来了新的治疗途径。

Discov Oncol. 2021 Oct 27;12(1):45. doi: 10.1007/s12672-021-00440-7.

Involvement of transcribed lncRNA uc.291 and SWI/SNF complex in cutaneous squamous cell carcinoma.转录的长链非编码RNA uc.291和SWI/SNF复合物在皮肤鳞状细胞癌中的作用

Discov Oncol. 2021 May 3;12(1):14. doi: 10.1007/s12672-021-00409-6.

NUAK2 and RCan2 participate in the p53 mutant pro-tumorigenic network.NUAK2 和 RCan2 参与 p53 突变体的促肿瘤生成网络。

Biol Direct. 2021 Aug 4;16(1):11. doi: 10.1186/s13062-021-00296-5.

The expression of ELOVL4, repressed by MYCN, defines neuroblastoma patients with good outcome.ELOVL4 的表达受 MYCN 抑制，可定义预后良好的神经母细胞瘤患者。

Oncogene. 2021 Sep;40(38):5741-5751. doi: 10.1038/s41388-021-01959-3. Epub 2021 Jul 31.

Global mapping of cancers: The Cancer Genome Atlas and beyond.全球癌症图谱：癌症基因组图谱及其他。

Mol Oncol. 2021 Nov;15(11):2823-2840. doi: 10.1002/1878-0261.13056. Epub 2021 Jul 20.

Gut Microbiome and Metabolites in Patients with NAFLD and after Bariatric Surgery: A Comprehensive Review.非酒精性脂肪性肝病患者及减肥手术后的肠道微生物群与代谢产物：综述

Metabolites. 2021 May 31;11(6):353. doi: 10.3390/metabo11060353.

Origin Sample Prediction and Spatial Modeling of Antimicrobial Resistance in Metagenomic Sequencing Data.宏基因组测序数据中抗菌药物耐药性的起源样本预测与空间建模

Front Genet. 2021 Mar 4;12:642991. doi: 10.3389/fgene.2021.642991. eCollection 2021.

本文引用的文献

Gut microbiome diversity detected by high-coverage 16S and shotgun sequencing of paired stool and colon sample.通过对配对粪便和结肠样本进行高覆盖 16S 和鸟枪法测序检测肠道微生物组多样性。

Sci Data. 2020 Mar 16;7(1):92. doi: 10.1038/s41597-020-0427-5.

Microbial communities in the tropical air ecosystem follow a precise diel cycle.热带空气生态系统中的微生物群落遵循精确的昼夜周期。

Proc Natl Acad Sci U S A. 2019 Nov 12;116(46):23299-23308. doi: 10.1073/pnas.1908493116. Epub 2019 Oct 28.

Antibiotic resistance and metabolic profiles as functional biomarkers that accurately predict the geographic origin of city metagenomics samples.抗生素耐药性和代谢谱作为功能生物标志物，可准确预测城市宏基因组样本的地理来源。

Biol Direct. 2019 Aug 20;14(1):15. doi: 10.1186/s13062-019-0246-9.

Application of machine learning techniques for creating urban microbial fingerprints.应用机器学习技术构建城市微生物指纹图谱。

Biol Direct. 2019 Aug 16;14(1):13. doi: 10.1186/s13062-019-0245-x.

Massive metagenomic data analysis using abundance-based machine learning.基于丰度的机器学习在海量宏基因组数据分析中的应用。

Biol Direct. 2019 Aug 1;14(1):12. doi: 10.1186/s13062-019-0242-0.

Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data.鉴定城市特有重要细菌特征，用于 MetaSUB CAMDA 挑战赛微生物组数据。

Biol Direct. 2019 Jul 24;14(1):11. doi: 10.1186/s13062-019-0243-z.

Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons.Libra：一种基于可扩展 k-mer 的大规模所有与所有宏基因组比较工具。

Gigascience. 2019 Feb 1;8(2):giy165. doi: 10.1093/gigascience/giy165.

Zoonotic Source Attribution of Salmonella enterica Serotype Typhimurium Using Genomic Surveillance Data, United States.利用基因组监测数据对沙门氏菌肠炎血清型 Typhimurium 的动物源归因分析，美国。

Emerg Infect Dis. 2019 Jan;25(1):82-91. doi: 10.3201/eid2501.180835.

A comparative study of the gut microbiota in immune-mediated inflammatory diseases-does a common dysbiosis exist?免疫介导的炎症性疾病的肠道微生物组比较研究——是否存在共同的菌群失调？

Microbiome. 2018 Dec 13;6(1):221. doi: 10.1186/s40168-018-0603-4.

Structure and function of the global topsoil microbiome.全球表土微生物组的结构与功能。

Nature. 2018 Aug;560(7717):233-237. doi: 10.1038/s41586-018-0386-6. Epub 2018 Aug 1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于宏基因组测序数据的样本来源预测的有监督机器学习方法的系统评价。

Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献