Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA.
BMC Bioinformatics. 2012 Jun 21;13:141. doi: 10.1186/1471-2105-13-141.
Computing of sequence similarity results is becoming a limiting factor in metagenome analysis. Sequence similarity search results encoded in an open, exchangeable format have the potential to limit the needs for computational reanalysis of these data sets. A prerequisite for sharing of similarity results is a common reference.
We introduce a mechanism for automatically maintaining a comprehensive, non-redundant protein database and for creating a quarterly release of this resource. In addition, we present tools for translating similarity searches into many annotation namespaces, e.g. KEGG or NCBI's GenBank.
The data and tools we present allow the creation of multiple result sets using a single computation, permitting computational results to be shared between groups for large sequence data sets.
序列相似性计算在宏基因组分析中成为一个限制因素。以开放、可交换格式编码的序列相似性搜索结果有可能限制对这些数据集进行重新计算分析的需求。共享相似性结果的前提是有一个共同的参考。
我们引入了一种自动维护全面、非冗余蛋白质数据库的机制,并创建了该资源的季度版本。此外,我们还提供了工具,可将相似性搜索转换为许多注释名称空间,例如 KEGG 或 NCBI 的 GenBank。
我们提供的数据和工具允许使用单个计算创建多个结果集,从而允许在大型序列数据集之间的组之间共享计算结果。