基于丰度的机器学习在海量宏基因组数据分析中的应用。

Massive metagenomic data analysis using abundance-based machine learning.

机构信息

Department of Biology, Saint Louis University, Saint Louis, MO, 63103, USA.

Program in Bioinformatics and Computational Biology, Saint Louis University, Saint Louis, MO, 63103, USA.

出版信息

Biol Direct. 2019 Aug 1;14(1):12. doi: 10.1186/s13062-019-0242-0.

DOI:10.1186/s13062-019-0242-0

PMID:31370905

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6676585/

Abstract

BACKGROUND

Metagenomics is the application of modern genomic techniques to investigate the members of a microbial community directly in their natural environments and is widely used in many studies to survey the communities of microbial organisms that live in diverse ecosystems. In order to understand the metagenomic profile of one of the densest interaction spaces for millions of people, the public transit system, the MetaSUB international Consortium has collected and sequenced metagenomes from subways of different cities across the world. In collaboration with CAMDA, MetaSUB has made the metagenomic samples from these cities available for an open challenge of data analysis including, but not limited in scope to, the identification of unknown samples.

RESULTS

To distinguish the metagenomic profiling among different cities and also predict unknown samples precisely based on the profiling, two different approaches are proposed using machine learning techniques; one is a read-based taxonomy profiling of each sample and prediction method, and the other is a reduced representation assembly-based method. Among various machine learning techniques tested, the random forest technique showed promising results as a suitable classifier for both approaches. Random forest models developed from read-based taxonomic profiling could achieve an accuracy of 91% with 95% confidence interval between 80 and 93%. The assembly-based random forest model prediction also reached 90% accuracy. However, both models achieved roughly the same accuracy on the testing test, whereby they both failed to predict the most abundant label.

CONCLUSION

Our results suggest that both read-based and assembly-based approaches are powerful tools for the analysis of metagenomics data. Moreover, our results suggest that reduced representation assembly-based methods are able to simultaneous provide high-accuracy prediction on available data. Overall, we show that metagenomic samples can be traced back to their location with careful generation of features from the composition of microbes and utilizing existing machine learning algorithms. Proposed approaches show high accuracy of prediction, but require careful inspection before making any decisions due to sample noise or complexity.

REVIEWERS

This article was reviewed by Eugene V. Koonin, Jing Zhou and Serghei Mangul.

摘要

背景

宏基因组学是将现代基因组技术应用于直接在自然环境中研究微生物群落成员的学科，广泛用于许多研究中，以调查生活在各种生态系统中的微生物生物群落。为了了解数以百万计的人密集互动空间之一——公共交通系统的宏基因组特征，MetaSUB 国际联盟已经从世界各地不同城市的地铁中收集和测序了宏基因组。与 CAMDA 合作，MetaSUB 为数据分析的公开挑战提供了来自这些城市的宏基因组样本，包括但不限于识别未知样本。

结果

为了区分不同城市的宏基因组特征，并根据特征准确预测未知样本，我们使用机器学习技术提出了两种不同的方法；一种是基于每个样本的基于读取的分类学特征分析和预测方法，另一种是基于简化表示的组装方法。在测试的各种机器学习技术中，随机森林技术表现出作为两种方法合适分类器的有前途的结果。基于基于读取的分类学特征分析的随机森林模型可以达到 91%的准确率，置信区间为 80%至 93%。基于组装的随机森林模型预测也达到了 90%的准确率。然而，两种模型在测试中都达到了大致相同的准确率，都无法预测最丰富的标签。

结论

我们的结果表明，基于读取和基于组装的方法都是分析宏基因组数据的有力工具。此外，我们的结果表明，基于简化表示的组装方法能够同时对现有数据进行高精度预测。总体而言，我们表明，通过仔细生成微生物组成的特征并利用现有的机器学习算法，可以追溯到宏基因组样本的来源。所提出的方法具有很高的预测准确性，但由于样本噪声或复杂性，在做出任何决策之前需要仔细检查。

评论者

本文由 Eugene V. Koonin、Jing Zhou 和 Serghei Mangul 进行了评审。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/59ea/6676585/43838a47cfd7/13062_2019_242_Fig1_HTML.jpg

相似文献

Massive metagenomic data analysis using abundance-based machine learning.基于丰度的机器学习在海量宏基因组数据分析中的应用。

Biol Direct. 2019 Aug 1;14(1):12. doi: 10.1186/s13062-019-0242-0.

Application of machine learning techniques for creating urban microbial fingerprints.应用机器学习技术构建城市微生物指纹图谱。

Biol Direct. 2019 Aug 16;14(1):13. doi: 10.1186/s13062-019-0245-x.

Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data.鉴定城市特有重要细菌特征，用于 MetaSUB CAMDA 挑战赛微生物组数据。

Biol Direct. 2019 Jul 24;14(1):11. doi: 10.1186/s13062-019-0243-z.

A machine learning framework to determine geolocations from metagenomic profiling.基于宏基因组分析的地理位置确定机器学习框架。

Biol Direct. 2020 Nov 23;15(1):27. doi: 10.1186/s13062-020-00278-z.

Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data.基于宏基因组测序数据的样本来源预测的有监督机器学习方法的系统评价。

Biol Direct. 2020 Dec 10;15(1):29. doi: 10.1186/s13062-020-00287-y.

Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge.解析 CAMDA MetaSUB 挑战赛数据的城市特定特征并识别样本来源位置。

Biol Direct. 2021 Jan 4;16(1):1. doi: 10.1186/s13062-020-00284-1.

Supervised Machine Learning Enables Geospatial Microbial Provenance.监督机器学习实现了微生物的地理来源。

Genes (Basel). 2022 Oct 21;13(10):1914. doi: 10.3390/genes13101914.

Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes.通过验证的视角看宏基因组组装：评估和提高宏基因组组装基因组质量的最新进展。

Brief Bioinform. 2019 Jul 19;20(4):1140-1150. doi: 10.1093/bib/bbx098.

MetaBinG2: a fast and accurate metagenomic sequence classification system for samples with many unknown organisms.MetaBinG2：一种快速准确的宏基因组序列分类系统，适用于含有许多未知生物的样本。

Biol Direct. 2018 Aug 22;13(1):15. doi: 10.1186/s13062-018-0220-y.

Profiling microbial strains in urban environments using metagenomic sequencing data.利用宏基因组测序数据对城市环境中的微生物菌株进行分析。

Biol Direct. 2018 May 9;13(1):9. doi: 10.1186/s13062-018-0211-z.

引用本文的文献

Artificial intelligence and bioinformatics: a journey from traditional techniques to smart approaches.人工智能与生物信息学：从传统技术到智能方法的历程。

Gastroenterol Hepatol Bed Bench. 2024;17(3):241-252. doi: 10.22037/ghfbb.v17i3.2977.

Gene-based microbiome representation enhances host phenotype classification.基于基因的微生物组表示增强了宿主表型分类。

mSystems. 2023 Aug 31;8(4):e0053123. doi: 10.1128/msystems.00531-23. Epub 2023 Jul 5.

MetaPhlAn 4 profiling of unknown species-level genome bins improves the characterization of diet-associated microbiome changes in mice.MetaPhlAn 4 对未知种水平基因组bins 的分析可改善对小鼠饮食相关微生物组变化的特征描述。

Cell Rep. 2023 May 30;42(5):112464. doi: 10.1016/j.celrep.2023.112464. Epub 2023 May 3.

MegaD: Deep Learning for Rapid and Accurate Disease Status Prediction of Metagenomic Samples.MegaD：用于宏基因组样本疾病状态快速准确预测的深度学习

Life (Basel). 2022 Apr 30;12(5):669. doi: 10.3390/life12050669.

Human disease prediction from microbiome data by multiple feature fusion and deep learning.通过多特征融合和深度学习从微生物组数据预测人类疾病

iScience. 2022 Mar 16;25(4):104081. doi: 10.1016/j.isci.2022.104081. eCollection 2022 Apr 15.

Emerging roles of the HECT-type E3 ubiquitin ligases in hematological malignancies.HECT 型 E3 泛素连接酶在血液系统恶性肿瘤中的新作用。

Discov Oncol. 2021 Oct 8;12(1):39. doi: 10.1007/s12672-021-00435-4.

Serine and one-carbon metabolisms bring new therapeutic venues in prostate cancer.丝氨酸和一碳代谢为前列腺癌带来了新的治疗途径。

Discov Oncol. 2021 Oct 27;12(1):45. doi: 10.1007/s12672-021-00440-7.

Involvement of transcribed lncRNA uc.291 and SWI/SNF complex in cutaneous squamous cell carcinoma.转录的长链非编码RNA uc.291和SWI/SNF复合物在皮肤鳞状细胞癌中的作用

Discov Oncol. 2021 May 3;12(1):14. doi: 10.1007/s12672-021-00409-6.

Recent advances in cancer immunotherapy.癌症免疫疗法的最新进展。

Discov Oncol. 2021 Aug 18;12(1):27. doi: 10.1007/s12672-021-00422-9.

Comparison of 16S and whole genome dog microbiomes using machine learning.使用机器学习对16S和全基因组犬微生物群进行比较。

BioData Min. 2021 Aug 21;14(1):41. doi: 10.1186/s13040-021-00270-x.

本文引用的文献

Biol Direct. 2018 Aug 22;13(1):15. doi: 10.1186/s13062-018-0220-y.

Transcriptome analysis in whole blood reveals increased microbial diversity in schizophrenia.全血转录组分析揭示精神分裂症中微生物多样性增加。

Transl Psychiatry. 2018 May 10;8(1):96. doi: 10.1038/s41398-018-0107-9.

MEGAN-LR: new algorithms allow accurate binning and easy interactive exploration of metagenomic long reads and contigs.MEGAN-LR：新算法允许对宏基因组长读段和 contigs 进行准确的分箱和轻松的交互式探索。

Biol Direct. 2018 Apr 20;13(1):6. doi: 10.1186/s13062-018-0208-7.

Strain profiling and epidemiology of bacterial species from metagenomic sequencing.从宏基因组测序中分析细菌种的菌株特征和流行病学。

Nat Commun. 2017 Dec 22;8(1):2260. doi: 10.1038/s41467-017-02209-5.

Using convolutional neural networks to explore the microbiome.使用卷积神经网络探索微生物组。

Annu Int Conf IEEE Eng Med Biol Soc. 2017 Jul;2017:4269-4272. doi: 10.1109/EMBC.2017.8037799.

A review of methods and databases for metagenomic classification and assembly.元基因组分类和组装方法及数据库综述。

Brief Bioinform. 2019 Jul 19;20(4):1125-1136. doi: 10.1093/bib/bbx120.

Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software.宏基因组解读的批判性评估——宏基因组学软件的一项基准测试

Nat Methods. 2017 Nov;14(11):1063-1071. doi: 10.1038/nmeth.4458. Epub 2017 Oct 2.

Strains, functions and dynamics in the expanded Human Microbiome Project.扩展的人类微生物组计划中的菌株、功能与动态

Nature. 2017 Oct 5;550(7674):61-66. doi: 10.1038/nature23889. Epub 2017 Sep 20.

VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data.VirFinder：一种新型的基于 k-mer 的工具，用于从组装的宏基因组数据中识别病毒序列。

Microbiome. 2017 Jul 6;5(1):69. doi: 10.1186/s40168-017-0283-5.

metaSPAdes: a new versatile metagenomic assembler.metaSPAdes：一种新型通用宏基因组序列拼接软件

Genome Res. 2017 May;27(5):824-834. doi: 10.1101/gr.213959.116. Epub 2017 Mar 15.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于丰度的机器学习在海量宏基因组数据分析中的应用。

Massive metagenomic data analysis using abundance-based machine learning.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

REVIEWERS

背景

结果

结论

评论者

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献