Malde Ketil, Coward Eivind, Jonassen Inge
Department of Informatics, University of Bergen, HIB, N5020 Norway.
Bioinformatics. 2003 Jul 1;19(10):1221-6. doi: 10.1093/bioinformatics/btg138.
Efficient clustering is important for handling the large amount of available EST sequences. Most contemporary methods are based on some kind of all-against-all comparison, resulting in a quadratic time complexity. A different approach is needed to keep up with the rapid growth of EST data.
A new, fast EST clustering algorithm is presented. Sub-quadratic time complexity is achieved by using an algorithm based on suffix arrays. A prototype implementation has been developed and run on a benchmark data set. The produced clusterings are validated by comparing them to clusterings produced by other methods, and the results are quite promising.
The source code for the prototype implementation is available under a GPL license from http://www.ii.uib.no/~ketil/bio/.
高效聚类对于处理大量可用的EST序列非常重要。大多数当代方法基于某种全对全比较,导致时间复杂度为二次方。需要一种不同的方法来跟上EST数据的快速增长。
提出了一种新的快速EST聚类算法。通过使用基于后缀数组的算法实现了亚二次时间复杂度。已经开发了一个原型实现并在一个基准数据集上运行。通过将生成的聚类与其他方法生成的聚类进行比较来验证结果,结果很有前景。
原型实现的源代码可在GPL许可下从http://www.ii.uib.no/~ketil/bio/获得。