Suppr超能文献

基于宏基因组测序数据的样本来源预测的有监督机器学习方法的系统评价。

Systematic evaluation of supervised machine learning for sample origin prediction using metagenomic sequencing data.

机构信息

National Microbiology Laboratory, Public Health Agency of Canada, 1015 Arlington Street, Winnipeg, Manitoba, R3E 3R2, Canada.

出版信息

Biol Direct. 2020 Dec 10;15(1):29. doi: 10.1186/s13062-020-00287-y.

Abstract

BACKGROUND

The advent of metagenomic sequencing provides microbial abundance patterns that can be leveraged for sample origin prediction. Supervised machine learning classification approaches have been reported to predict sample origin accurately when the origin has been previously sampled. Using metagenomic datasets provided by the 2019 CAMDA challenge, we evaluated the influence of variable technical, analytical and machine learning approaches for result interpretation and novel source prediction.

RESULTS

Comparison between 16S rRNA amplicon and shotgun sequencing approaches as well as metagenomic analytical tools showed differences in normalized microbial abundance, especially for organisms present at low abundance. Shotgun sequence data analyzed using Kraken2 and Bracken, for taxonomic annotation, had higher detection sensitivity. As classification models are limited to labeling pre-trained origins, we took an alternative approach using Lasso-regularized multivariate regression to predict geographic coordinates for comparison. In both models, the prediction errors were much higher in Leave-1-city-out than in 10-fold cross validation, of which the former realistically forecasted the increased difficulty in accurately predicting samples from new origins. This challenge was further confirmed when applying the model to a set of samples obtained from new origins. Overall, the prediction performance of the regression and classification models, as measured by mean squared error, were comparable on mystery samples. Due to higher prediction error rates for samples from new origins, we provided an additional strategy based on prediction ambiguity to infer whether a sample is from a new origin. Lastly, we report increased prediction error when data from different sequencing protocols were included as training data.

CONCLUSIONS

Herein, we highlight the capacity of predicting sample origin accurately with pre-trained origins and the challenge of predicting new origins through both regression and classification models. Overall, this work provides a summary of the impact of sequencing technique, protocol, taxonomic analytical approaches, and machine learning approaches on the use of metagenomics for prediction of sample origin.

摘要

背景

宏基因组测序的出现提供了可以用于预测样本来源的微生物丰度模式。当来源先前被采样时,基于监督机器学习的分类方法已被报道可以准确地预测样本来源。使用 2019 年 CAMDA 挑战赛提供的宏基因组数据集,我们评估了可变的技术、分析和机器学习方法对结果解释和新来源预测的影响。

结果

16S rRNA 扩增子和鸟枪法测序方法以及宏基因组分析工具的比较显示,在标准化微生物丰度方面存在差异,特别是对于低丰度的生物。使用 Kraken2 和 Bracken 进行分类注释的鸟枪法序列数据具有更高的检测灵敏度。由于分类模型仅限于标记预先训练的来源,我们采用了一种替代方法,使用 Lasso 正则化多元回归来预测地理坐标进行比较。在这两种模型中,Leave-1-city-out 的预测误差都远高于 10 倍交叉验证,前者真实地预测了准确预测新来源样本的难度增加。当将该模型应用于一组来自新来源的样本时,进一步证实了这一挑战。总体而言,回归和分类模型的预测性能(以均方误差衡量)在神秘样本上相当。由于新来源样本的预测误差率较高,我们提供了一种基于预测不确定性的额外策略来推断样本是否来自新来源。最后,我们报告了当将来自不同测序方案的数据纳入训练数据时,预测误差增加。

结论

本文强调了使用预先训练的来源准确预测样本来源的能力,以及通过回归和分类模型预测新来源的挑战。总的来说,这项工作总结了测序技术、方案、分类分析方法和机器学习方法对使用宏基因组学预测样本来源的影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e42d/7731568/fd0f2c78dec1/13062_2020_287_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验