基于宏基因组分析的地理位置确定机器学习框架。

A machine learning framework to determine geolocations from metagenomic profiling.

机构信息

School of Informatics, Xiamen University, Xiamen, China.

Aginome Scientific Pte. Ltd., Xiamen, China.

出版信息

Biol Direct. 2020 Nov 23;15(1):27. doi: 10.1186/s13062-020-00278-z.

DOI:10.1186/s13062-020-00278-z

PMID:33225966

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7682025/

Abstract

BACKGROUND

Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples' geolocations. In this paper, we present a machine learning framework to determine the geolocations from metagenomics profiling of microbial samples.

RESULTS

Our method was applied to the multi-source microbiome data from MetaSUB (The Metagenomics and Metadesign of Subways and Urban Biomes) International Consortium for the CAMDA 2019 Metagenomic Forensics Challenge (the Challenge). The goal of the Challenge is to predict the geographical origins of mystery samples by constructing microbiome fingerprints.First, we extracted features from metagenomic abundance profiles. We then randomly split the training data into training and validation sets and trained the prediction models on the training set. Prediction performance was evaluated on the validation set. By using logistic regression with L2 normalization, the prediction accuracy of the model reaches 86%, averaged over 100 random splits of training and validation datasets.The testing data consists of samples from cities that do not occur in the training data. To predict the "mystery" cities that are not sampled before for the testing data, we first defined biological coordinates for sampled cities based on the similarity of microbial samples from them. Then we performed affine transform on the map such that the distance between cities measures their biological difference rather than geographical distance. After that, we derived the probabilities of a given testing sample from unsampled cities based on its predicted probabilities on sampled cities using Kriging interpolation. Results show that this method can successfully assign high probabilities to the true cities-of-origin of testing samples.

CONCLUSION

Our framework shows good performance in predicting the geographic origin of metagenomic samples for cities where training data are available. Furthermore, we demonstrate the potential of the proposed method to predict metagenomic samples' geolocations for samples from locations that are not in the training dataset.

摘要

背景

对环境微生物样本的宏基因组数据的研究表明，微生物群落似乎具有地理位置特异性，微生物丰度谱可以作为区分特征来识别样本的地理位置。在本文中，我们提出了一种机器学习框架，用于从微生物样本的宏基因组分析中确定地理位置。

结果

我们的方法应用于来自 MetaSUB（地铁和城市生物群落的宏基因组学和元设计）国际联盟的多源微生物组数据，用于参加 CAMDA 2019 宏基因组取证挑战赛（挑战赛）。挑战赛的目标是通过构建微生物组指纹来预测神秘样本的地理来源。首先，我们从宏基因组丰度谱中提取特征。然后，我们随机将训练数据分为训练集和验证集，并在训练集上训练预测模型。在验证集上评估预测性能。通过使用具有 L2 归一化的逻辑回归，该模型的预测准确率平均达到 86%，这是在 100 次训练和验证数据集随机划分的情况下得出的。测试数据由训练数据中未出现的城市的样本组成。为了预测之前未采样的“神秘”城市的测试数据，我们首先根据它们的微生物样本的相似性为采样城市定义生物坐标。然后，我们对地图进行仿射变换，使得城市之间的距离衡量它们的生物差异而不是地理距离。之后，我们基于测试样本在采样城市的预测概率，使用克里金插值从未采样城市中推导出给定测试样本的概率。结果表明，该方法可以成功地为测试样本的真实来源城市赋予较高的概率。

结论

我们的框架在预测具有训练数据的宏基因组样本的地理位置方面表现出良好的性能。此外，我们还展示了所提出的方法在预测不在训练数据集中的样本的宏基因组样本地理位置方面的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e3f/7682025/4bda226fa86d/13062_2020_278_Fig1_HTML.jpg

相似文献

A machine learning framework to determine geolocations from metagenomic profiling.

Biol Direct. 2020 Nov 23;15(1):27. doi: 10.1186/s13062-020-00278-z.

Application of machine learning techniques for creating urban microbial fingerprints.

Biol Direct. 2019 Aug 16;14(1):13. doi: 10.1186/s13062-019-0245-x.

Massive metagenomic data analysis using abundance-based machine learning.

Biol Direct. 2019 Aug 1;14(1):12. doi: 10.1186/s13062-019-0242-0.

Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data.

Biol Direct. 2020 Dec 10;15(1):29. doi: 10.1186/s13062-020-00287-y.

Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge.

Biol Direct. 2021 Jan 4;16(1):1. doi: 10.1186/s13062-020-00284-1.

Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data.

Biol Direct. 2019 Jul 24;14(1):11. doi: 10.1186/s13062-019-0243-z.

Unraveling City-Specific Microbial Signatures and Identifying Sample Origins for the Data From CAMDA 2020 Metagenomic Geolocation Challenge.

Front Genet. 2021 Aug 5;12:659650. doi: 10.3389/fgene.2021.659650. eCollection 2021.

Supervised Machine Learning Enables Geospatial Microbial Provenance.

Genes (Basel). 2022 Oct 21;13(10):1914. doi: 10.3390/genes13101914.

Environmental metagenome classification for constructing a microbiome fingerprint.

Biol Direct. 2019 Nov 13;14(1):20. doi: 10.1186/s13062-019-0251-z.

Fingerprinting cities: differentiating subway microbiome functionality.

Biol Direct. 2019 Oct 30;14(1):19. doi: 10.1186/s13062-019-0252-y.

引用本文的文献

Pathogenic bacterial profile, associated factors, and antimicrobial susceptibility patterns in intra-city public transport in Harar City, Eastern Ethiopia.

Front Public Health. 2025 Apr 8;13:1521479. doi: 10.3389/fpubh.2025.1521479. eCollection 2025.

Advances in machine learning-based bacteria analysis for forensic identification: identity, ethnicity, and site of occurrence.

Front Microbiol. 2023 Dec 21;14:1332857. doi: 10.3389/fmicb.2023.1332857. eCollection 2023.

Evolution of Diagnostic and Forensic Microbiology in the Era of Artificial Intelligence.

Cureus. 2023 Sep 21;15(9):e45738. doi: 10.7759/cureus.45738. eCollection 2023 Sep.

Artificial intelligence in forensic medicine and forensic dentistry.

J Forensic Odontostomatol. 2023 Aug 27;41(2):30-41.

Trends in forensic microbiology: From classical methods to deep learning.

Front Microbiol. 2023 Mar 30;14:1163741. doi: 10.3389/fmicb.2023.1163741. eCollection 2023.

Advances in microbial metagenomics and artificial intelligence analysis in forensic identification.

Front Microbiol. 2022 Nov 15;13:1046733. doi: 10.3389/fmicb.2022.1046733. eCollection 2022.

A Comprehensive Insight of Current and Future Challenges in Large-Scale Soil Microbiome Analyses.

Microb Ecol. 2023 Jul;86(1):75-85. doi: 10.1007/s00248-022-02060-2. Epub 2022 Jun 23.

Serine and one-carbon metabolisms bring new therapeutic venues in prostate cancer.

Discov Oncol. 2021 Oct 27;12(1):45. doi: 10.1007/s12672-021-00440-7.

Global mapping of cancers: The Cancer Genome Atlas and beyond.

Mol Oncol. 2021 Nov;15(11):2823-2840. doi: 10.1002/1878-0261.13056. Epub 2021 Jul 20.

Virulence factor-related gut microbiota genes and immunoglobulin A levels as novel markers for machine learning-based classification of autism spectrum disorder.

Comput Struct Biotechnol J. 2020 Dec 29;19:545-554. doi: 10.1016/j.csbj.2020.12.012. eCollection 2021.

本文引用的文献

Improved metagenomic analysis with Kraken 2.

Genome Biol. 2019 Nov 28;20(1):257. doi: 10.1186/s13059-019-1891-0.

Antibiotic resistance and metabolic profiles as functional biomarkers that accurately predict the geographic origin of city metagenomics samples.

Biol Direct. 2019 Aug 20;14(1):15. doi: 10.1186/s13062-019-0246-9.

Application of machine learning techniques for creating urban microbial fingerprints.

Biol Direct. 2019 Aug 16;14(1):13. doi: 10.1186/s13062-019-0245-x.

Massive metagenomic data analysis using abundance-based machine learning.

Biol Direct. 2019 Aug 1;14(1):12. doi: 10.1186/s13062-019-0242-0.

Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data.

Biol Direct. 2019 Jul 24;14(1):11. doi: 10.1186/s13062-019-0243-z.

fastp: an ultra-fast all-in-one FASTQ preprocessor.

Bioinformatics. 2018 Sep 1;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560.

Profiling microbial strains in urban environments using metagenomic sequencing data.

Biol Direct. 2018 May 9;13(1):9. doi: 10.1186/s13062-018-0211-z.

Consistent metagenomic biomarker detection via robust PCA.

Biol Direct. 2017 Jan 31;12(1):4. doi: 10.1186/s13062-017-0175-4.

The Metagenomics and Metadesign of the Subways and Urban Biomes (MetaSUB) International Consortium inaugural meeting report.

Microbiome. 2016 Jun 3;4(1):24. doi: 10.1186/s40168-016-0168-z.

MetaPhlAn2 for enhanced metagenomic taxonomic profiling.

Nat Methods. 2015 Oct;12(10):902-3. doi: 10.1038/nmeth.3589.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于宏基因组分析的地理位置确定机器学习框架。

A machine learning framework to determine geolocations from metagenomic profiling.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献