Department of Computer Science, University of Arizona, 1040 E. 4th Street, Tucson, Arizona, 85721, USA.
Department of Biosystems Engineering, University of Arizona, 1177 E. 4th Street, Tucson, Arizona, 85721, USA.
Gigascience. 2019 Feb 1;8(2):giy165. doi: 10.1093/gigascience/giy165.
Shotgun metagenomics provides powerful insights into microbial community biodiversity and function. Yet, inferences from metagenomic studies are often limited by dataset size and complexity and are restricted by the availability and completeness of existing databases. De novo comparative metagenomics enables the comparison of metagenomes based on their total genetic content.
We developed a tool called Libra that performs an all-vs-all comparison of metagenomes for precise clustering based on their k-mer content. Libra uses a scalable Hadoop framework for massive metagenome comparisons, Cosine Similarity for calculating the distance using sequence composition and abundance while normalizing for sequencing depth, and a web-based implementation in iMicrobe (http://imicrobe.us) that uses the CyVerse advanced cyberinfrastructure to promote broad use of the tool by the scientific community.
A comparison of Libra to equivalent tools using both simulated and real metagenomic datasets, ranging from 80 million to 4.2 billion reads, reveals that methods commonly implemented to reduce compute time for large datasets, such as data reduction, read count normalization, and presence/absence distance metrics, greatly diminish the resolution of large-scale comparative analyses. In contrast, Libra uses all of the reads to calculate k-mer abundance in a Hadoop architecture that can scale to any size dataset to enable global-scale analyses and link microbial signatures to biological processes.
shotgun 宏基因组学为微生物群落生物多样性和功能提供了强大的见解。然而,宏基因组研究的推论往往受到数据集大小和复杂性的限制,并受到现有数据库的可用性和完整性的限制。从头比较宏基因组学能够基于它们的总遗传内容来比较宏基因组。
我们开发了一种名为 Libra 的工具,它可以对宏基因组进行全对全比较,根据它们的 k-mer 含量进行精确聚类。Libra 使用可扩展的 Hadoop 框架进行大规模的宏基因组比较,使用序列组成和丰度计算距离的余弦相似度,同时为测序深度标准化,以及在 iMicrobe 中进行基于网络的实现(http://imicrobe.us),该工具使用 CyVerse 先进的网络基础设施来促进科学界广泛使用该工具。
使用模拟和真实宏基因组数据集对 Libra 与等效工具进行比较,范围从 8000 万到 42 亿个读数,表明为减少大数据集的计算时间而通常实施的方法,如数据缩减、读数计数标准化和存在/不存在距离度量,大大降低了大规模比较分析的分辨率。相比之下,Libra 使用 Hadoop 架构中的所有读数来计算 k-mer 丰度,该架构可以扩展到任何大小的数据集,以实现全球规模的分析并将微生物特征与生物过程联系起来。