Suppr超能文献

基于宏基因组分析的地理位置确定机器学习框架。

A machine learning framework to determine geolocations from metagenomic profiling.

机构信息

School of Informatics, Xiamen University, Xiamen, China.

Aginome Scientific Pte. Ltd., Xiamen, China.

出版信息

Biol Direct. 2020 Nov 23;15(1):27. doi: 10.1186/s13062-020-00278-z.

Abstract

BACKGROUND

Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples' geolocations. In this paper, we present a machine learning framework to determine the geolocations from metagenomics profiling of microbial samples.

RESULTS

Our method was applied to the multi-source microbiome data from MetaSUB (The Metagenomics and Metadesign of Subways and Urban Biomes) International Consortium for the CAMDA 2019 Metagenomic Forensics Challenge (the Challenge). The goal of the Challenge is to predict the geographical origins of mystery samples by constructing microbiome fingerprints.First, we extracted features from metagenomic abundance profiles. We then randomly split the training data into training and validation sets and trained the prediction models on the training set. Prediction performance was evaluated on the validation set. By using logistic regression with L2 normalization, the prediction accuracy of the model reaches 86%, averaged over 100 random splits of training and validation datasets.The testing data consists of samples from cities that do not occur in the training data. To predict the "mystery" cities that are not sampled before for the testing data, we first defined biological coordinates for sampled cities based on the similarity of microbial samples from them. Then we performed affine transform on the map such that the distance between cities measures their biological difference rather than geographical distance. After that, we derived the probabilities of a given testing sample from unsampled cities based on its predicted probabilities on sampled cities using Kriging interpolation. Results show that this method can successfully assign high probabilities to the true cities-of-origin of testing samples.

CONCLUSION

Our framework shows good performance in predicting the geographic origin of metagenomic samples for cities where training data are available. Furthermore, we demonstrate the potential of the proposed method to predict metagenomic samples' geolocations for samples from locations that are not in the training dataset.

摘要

背景

对环境微生物样本的宏基因组数据的研究表明,微生物群落似乎具有地理位置特异性,微生物丰度谱可以作为区分特征来识别样本的地理位置。在本文中,我们提出了一种机器学习框架,用于从微生物样本的宏基因组分析中确定地理位置。

结果

我们的方法应用于来自 MetaSUB(地铁和城市生物群落的宏基因组学和元设计)国际联盟的多源微生物组数据,用于参加 CAMDA 2019 宏基因组取证挑战赛(挑战赛)。挑战赛的目标是通过构建微生物组指纹来预测神秘样本的地理来源。首先,我们从宏基因组丰度谱中提取特征。然后,我们随机将训练数据分为训练集和验证集,并在训练集上训练预测模型。在验证集上评估预测性能。通过使用具有 L2 归一化的逻辑回归,该模型的预测准确率平均达到 86%,这是在 100 次训练和验证数据集随机划分的情况下得出的。测试数据由训练数据中未出现的城市的样本组成。为了预测之前未采样的“神秘”城市的测试数据,我们首先根据它们的微生物样本的相似性为采样城市定义生物坐标。然后,我们对地图进行仿射变换,使得城市之间的距离衡量它们的生物差异而不是地理距离。之后,我们基于测试样本在采样城市的预测概率,使用克里金插值从未采样城市中推导出给定测试样本的概率。结果表明,该方法可以成功地为测试样本的真实来源城市赋予较高的概率。

结论

我们的框架在预测具有训练数据的宏基因组样本的地理位置方面表现出良好的性能。此外,我们还展示了所提出的方法在预测不在训练数据集中的样本的宏基因组样本地理位置方面的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6e3f/7682025/4bda226fa86d/13062_2020_278_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验