Wits Bioinformatics, School of Electrical and Information Engineering, University of the Witwatersrand, Johannesburg, Private Bag 3, 2050 Wits, South Africa.
Bioinformatics. 2011 Dec 15;27(24):3348-55. doi: 10.1093/bioinformatics/btr560. Epub 2011 Oct 8.
Second-generation sequencing technology has reinvigorated research using expression data, and clustering such data remains a significant challenge, with much larger datasets and with different error profiles. Algorithms that rely on all-versus-all comparison of sequences are not practical for large datasets.
We introduce a new filter for string similarity which has the potential to eliminate the need for all-versus-all comparison in clustering of expression data and other similar tasks. Our filter is based on multiple long exact matches between the two strings, with the additional constraint that these matches must be sufficiently far apart. We give details of its efficient implementation using modified suffix arrays. We demonstrate its efficiency by presenting our new expression clustering tool, wcd-express, which uses this heuristic. We compare it to other current tools and show that it is very competitive both with respect to quality and run time.
Source code and binaries available under GPL at http://code.google.com/p/wcdest. Runs on Linux and MacOS X.
scott.hazelhurst@wits.ac.za; zsuzsa@cebitec.uni-bielefeld.de
Supplementary data are available at Bioinformatics online.
第二代测序技术重新激发了使用表达数据的研究,而聚类此类数据仍然是一个重大挑战,因为数据集更大,且具有不同的误差分布。依赖于序列两两比较的算法对于大型数据集来说并不实用。
我们引入了一种新的字符串相似度过滤器,它有可能消除在表达数据聚类和其他类似任务中对所有对所有比较的需求。我们的过滤器基于两个字符串之间的多个长精确匹配,并且附加的约束条件是这些匹配必须足够远。我们使用改进的后缀数组详细介绍了其高效实现。我们通过展示我们的新表达聚类工具 wcd-express 来演示其效率,该工具使用了这种启发式方法。我们将其与其他当前工具进行比较,并表明它在质量和运行时间方面都非常有竞争力。
源代码和二进制文件可在 GPL 下从 http://code.google.com/p/wcdest 获得。可在 Linux 和 MacOS X 上运行。
scott.hazelhurst@wits.ac.za;zsuzsa@cebitec.uni-bielefeld.de
补充数据可在 Bioinformatics 在线获得。