National Microbiology Laboratory, Public Health Agency of Canada, 1015 Arlington Street, Winnipeg, Manitoba, R3E 3R2, Canada.
Biol Direct. 2020 Dec 10;15(1):29. doi: 10.1186/s13062-020-00287-y.
The advent of metagenomic sequencing provides microbial abundance patterns that can be leveraged for sample origin prediction. Supervised machine learning classification approaches have been reported to predict sample origin accurately when the origin has been previously sampled. Using metagenomic datasets provided by the 2019 CAMDA challenge, we evaluated the influence of variable technical, analytical and machine learning approaches for result interpretation and novel source prediction.
Comparison between 16S rRNA amplicon and shotgun sequencing approaches as well as metagenomic analytical tools showed differences in normalized microbial abundance, especially for organisms present at low abundance. Shotgun sequence data analyzed using Kraken2 and Bracken, for taxonomic annotation, had higher detection sensitivity. As classification models are limited to labeling pre-trained origins, we took an alternative approach using Lasso-regularized multivariate regression to predict geographic coordinates for comparison. In both models, the prediction errors were much higher in Leave-1-city-out than in 10-fold cross validation, of which the former realistically forecasted the increased difficulty in accurately predicting samples from new origins. This challenge was further confirmed when applying the model to a set of samples obtained from new origins. Overall, the prediction performance of the regression and classification models, as measured by mean squared error, were comparable on mystery samples. Due to higher prediction error rates for samples from new origins, we provided an additional strategy based on prediction ambiguity to infer whether a sample is from a new origin. Lastly, we report increased prediction error when data from different sequencing protocols were included as training data.
Herein, we highlight the capacity of predicting sample origin accurately with pre-trained origins and the challenge of predicting new origins through both regression and classification models. Overall, this work provides a summary of the impact of sequencing technique, protocol, taxonomic analytical approaches, and machine learning approaches on the use of metagenomics for prediction of sample origin.
宏基因组测序的出现提供了可以用于预测样本来源的微生物丰度模式。当来源先前被采样时,基于监督机器学习的分类方法已被报道可以准确地预测样本来源。使用 2019 年 CAMDA 挑战赛提供的宏基因组数据集,我们评估了可变的技术、分析和机器学习方法对结果解释和新来源预测的影响。
16S rRNA 扩增子和鸟枪法测序方法以及宏基因组分析工具的比较显示,在标准化微生物丰度方面存在差异,特别是对于低丰度的生物。使用 Kraken2 和 Bracken 进行分类注释的鸟枪法序列数据具有更高的检测灵敏度。由于分类模型仅限于标记预先训练的来源,我们采用了一种替代方法,使用 Lasso 正则化多元回归来预测地理坐标进行比较。在这两种模型中,Leave-1-city-out 的预测误差都远高于 10 倍交叉验证,前者真实地预测了准确预测新来源样本的难度增加。当将该模型应用于一组来自新来源的样本时,进一步证实了这一挑战。总体而言,回归和分类模型的预测性能(以均方误差衡量)在神秘样本上相当。由于新来源样本的预测误差率较高,我们提供了一种基于预测不确定性的额外策略来推断样本是否来自新来源。最后,我们报告了当将来自不同测序方案的数据纳入训练数据时,预测误差增加。
本文强调了使用预先训练的来源准确预测样本来源的能力,以及通过回归和分类模型预测新来源的挑战。总的来说,这项工作总结了测序技术、方案、分类分析方法和机器学习方法对使用宏基因组学预测样本来源的影响。