Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA.
Bioinformatics. 2014 Oct 15;30(20):2949-55. doi: 10.1093/bioinformatics/btu405. Epub 2014 Jun 27.
Several technical challenges in metagenomic data analysis, including assembling metagenomic sequence data or identifying operational taxonomic units, are both significant and well known. These forms of analysis are increasingly cited as conceptually flawed, given the extreme variation within traditionally defined species and rampant horizontal gene transfer. Furthermore, computational requirements of such analysis have hindered content-based organization of metagenomic data at large scale.
In this article, we introduce the Amordad database engine for alignment-free, content-based indexing of metagenomic datasets. Amordad places the metagenome comparison problem in a geometric context, and uses an indexing strategy that combines random hashing with a regular nearest neighbor graph. This framework allows refinement of the database over time by continual application of random hash functions, with the effect of each hash function encoded in the nearest neighbor graph. This eliminates the need to explicitly maintain the hash functions in order for query efficiency to benefit from the accumulated randomness. Results on real and simulated data show that Amordad can support logarithmic query time for identifying similar metagenomes even as the database size reaches into the millions.
Source code, licensed under the GNU general public license (version 3) is freely available for download from http://smithlabresearch.org/amordad
Supplementary data are available at Bioinformatics online.
宏基因组数据分析存在一些技术挑战,包括组装宏基因组序列数据或识别操作分类单元,这些挑战都很重要且广为人知。鉴于传统定义的物种内存在极端变异和猖獗的水平基因转移,这些形式的分析被认为在概念上存在缺陷。此外,此类分析的计算要求阻碍了大规模基于内容的宏基因组数据分析。
在本文中,我们介绍了 Amordad 数据库引擎,用于无比对、基于内容的宏基因组数据集索引。Amordad 将宏基因组比较问题置于几何环境中,并使用一种索引策略,该策略将随机哈希与正则最近邻图相结合。该框架允许通过持续应用随机哈希函数来随时细化数据库,每个哈希函数的效果都编码在最近邻图中。这消除了为了从累积的随机性中受益而需要显式维护哈希函数的需求。在真实和模拟数据上的结果表明,即使数据库大小达到数百万,Amordad 也可以支持对数查询时间来识别相似的宏基因组。
根据 GNU 通用公共许可证(版本 3)获得许可的源代码可从 http://smithlabresearch.org/amordad 免费下载。
补充数据可在 Bioinformatics 在线获得。