Suppr超能文献

鉴定城市特有重要细菌特征,用于 MetaSUB CAMDA 挑战赛微生物组数据。

Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data.

机构信息

Department of Biostatistics, University of Florida, 2004 Mowry Rd, Gainesville, FL, 32610, USA.

Currently at the Department of Oral Biology, University of Florida, 1395 Center Drive, Gainesville, FL, 32610, USA.

出版信息

Biol Direct. 2019 Jul 24;14(1):11. doi: 10.1186/s13062-019-0243-z.

Abstract

BACKGROUND

Metagenomic data of whole genome sequences (WGS) from samples across several cities around the globe may unravel city specific signatures of microbes. Illumina MiSeq sequencing data was provided from 12 cities in 7 different countries as part of the 2018 CAMDA "MetaSUB Forensic Challenge", including also samples from three mystery sets. We used appropriate machine learning techniques on this massive dataset to effectively identify the geographical provenance of "mystery" samples. Additionally, we pursued compositional data analysis to develop accurate inferential techniques for such microbiome data. It is expected that this current data, which is of higher quality and higher sequence depth compared to the CAMDA 2017 MetaSUB challenge data, along with improved analytical techniques would yield many more interesting, robust and useful results that can be beneficial for forensic analysis.

RESULTS

A preliminary quality screening of the data revealed a much better dataset in terms of Phred quality score (hereafter Phred score), and larger paired-end MiSeq reads, and a more balanced experimental design, though still not equal number of samples across cities. PCA (Principal Component Analysis) analysis showed interesting clusters of samples and a large amount of the variability in the data was explained by the first three components (~ 70%). The classification analysis proved to be consistent across both the testing mystery sets with a similar percentage of the samples correctly predicted (up to 90%). The analysis of the relative abundance of bacterial "species" showed that some "species" are specific to some regions and can play important roles for predictions. These results were also corroborated by the variable importance given to the "species" during the internal cross validation (CV) run with Random Forest (RF).

CONCLUSIONS

The unsupervised analysis (PCA and two-way heatmaps) of the log2-cpm normalized data and relative abundance differential analysis seemed to suggest that the bacterial signature of common "species" was distinctive across the cities; which was also supported by the variable importance results. The prediction of the city for mystery sets 1 and 3 showed convincing results with high classification accuracy/consistency. The focus of this work on the current MetaSUB data and the analytical tools utilized here can be of great help in forensic, metagenomics, and other sciences to predict city of provenance of metagenomic samples, as well as in other related fields. Additionally, the pairwise analysis of relative abundance showed that the approach provided consistent and comparable "species" when compared with the classification importance variables.

REVIEWERS

This article was reviewed by Manuela Oliveira, Dimitar Vassilev, and Patrick Lee.

摘要

背景

对来自全球多个城市的样本进行全基因组序列(WGS)的宏基因组数据可能揭示出特定于城市的微生物特征。Illumina MiSeq 测序数据是作为 2018 年 CAMDA“MetaSUB 法医学挑战赛”的一部分,从 7 个国家的 12 个城市提供的,还包括来自三个神秘样本集的样本。我们使用适当的机器学习技术对这个庞大的数据集进行了分析,有效地确定了“神秘”样本的地理来源。此外,我们还进行了成分数据分析,为这种微生物组数据开发了准确的推断技术。预计与 2017 年 CAMDA MetaSUB 挑战赛数据相比,当前数据具有更高的质量和更高的序列深度,以及改进的分析技术,将产生更多有趣、稳健和有用的结果,这将有益于法医分析。

结果

对数据进行初步质量筛选发现,与 Phred 得分(以下简称 Phred 得分)、更大的 MiSeq 配对末端读数以及更平衡的实验设计相比,数据集的质量要好得多,尽管不同城市的样本数量仍然不相等。主成分分析(PCA)分析显示了样本的有趣聚类,并且数据中的大量可变性可以用前三个成分来解释(高达 70%)。分类分析在两个测试神秘样本集上都具有一致性,预测正确的样本比例相似(高达 90%)。对细菌“物种”相对丰度的分析表明,一些“物种”是特定于某些地区的,可以在预测中发挥重要作用。这些结果也得到了随机森林(RF)内部交叉验证(CV)运行中“物种”的重要性的证实。

结论

对 log2-cpm 标准化数据进行无监督分析(PCA 和双向热图)和相对丰度差异分析似乎表明,常见“物种”的细菌特征在不同城市之间具有独特性;这也得到了变量重要性结果的支持。对神秘集 1 和 3 的城市预测显示出令人信服的结果,具有较高的分类准确性/一致性。这项工作的重点是当前的 MetaSUB 数据和这里使用的分析工具,这对于法医、宏基因组学和其他科学领域预测宏基因组样本的来源城市以及其他相关领域都有很大的帮助。此外,相对丰度的成对分析表明,该方法与分类重要性变量相比提供了一致和可比的“物种”。

评审人

本文由 Manuela Oliveira、Dimitar Vassilev 和 Patrick Lee 评审。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/aef8/6657067/de72e7fcb885/13062_2019_243_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验