Krause A, Vingron M
Genome Inform Ser Workshop Genome Inform. 1997;8:90-99.
An iterative database searching method is introduced and applied to the design of a database clustering procedure. The search method virtually never produces false positive hits while determining meaningfully large sets of sequences related to the query. A novel set-theoretic database clustering algorithm exploits this feature and avoids a traditional, distance-based clustering step. This makes it fast and applicable to data-sets of the size of, e.g., the Swiss-Prot database. In practice we achieve unambiguous assignment of 80% of Swiss-Prot sequences to non-overlapping sequence clusters in an entirely automatic fashion.
介绍了一种迭代数据库搜索方法,并将其应用于数据库聚类程序的设计。该搜索方法在确定与查询相关的大量有意义的序列集时几乎不会产生误报。一种新颖的集合论数据库聚类算法利用了这一特性,避免了传统的基于距离的聚类步骤。这使得它速度快,适用于如瑞士蛋白质数据库大小的数据集。在实践中,我们以完全自动的方式将80%的瑞士蛋白质序列明确分配到非重叠的序列簇中。