Suppr超能文献

砰!一种新的基于后缀数组的聚类表达数据算法。

KABOOM! A new suffix array based algorithm for clustering expression data.

机构信息

Wits Bioinformatics, School of Electrical and Information Engineering, University of the Witwatersrand, Johannesburg, Private Bag 3, 2050 Wits, South Africa.

出版信息

Bioinformatics. 2011 Dec 15;27(24):3348-55. doi: 10.1093/bioinformatics/btr560. Epub 2011 Oct 8.

Abstract

MOTIVATION

Second-generation sequencing technology has reinvigorated research using expression data, and clustering such data remains a significant challenge, with much larger datasets and with different error profiles. Algorithms that rely on all-versus-all comparison of sequences are not practical for large datasets.

RESULTS

We introduce a new filter for string similarity which has the potential to eliminate the need for all-versus-all comparison in clustering of expression data and other similar tasks. Our filter is based on multiple long exact matches between the two strings, with the additional constraint that these matches must be sufficiently far apart. We give details of its efficient implementation using modified suffix arrays. We demonstrate its efficiency by presenting our new expression clustering tool, wcd-express, which uses this heuristic. We compare it to other current tools and show that it is very competitive both with respect to quality and run time.

AVAILABILITY

Source code and binaries available under GPL at http://code.google.com/p/wcdest. Runs on Linux and MacOS X.

CONTACT

scott.hazelhurst@wits.ac.za; zsuzsa@cebitec.uni-bielefeld.de

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

第二代测序技术重新激发了使用表达数据的研究,而聚类此类数据仍然是一个重大挑战,因为数据集更大,且具有不同的误差分布。依赖于序列两两比较的算法对于大型数据集来说并不实用。

结果

我们引入了一种新的字符串相似度过滤器,它有可能消除在表达数据聚类和其他类似任务中对所有对所有比较的需求。我们的过滤器基于两个字符串之间的多个长精确匹配,并且附加的约束条件是这些匹配必须足够远。我们使用改进的后缀数组详细介绍了其高效实现。我们通过展示我们的新表达聚类工具 wcd-express 来演示其效率,该工具使用了这种启发式方法。我们将其与其他当前工具进行比较,并表明它在质量和运行时间方面都非常有竞争力。

可用性

源代码和二进制文件可在 GPL 下从 http://code.google.com/p/wcdest 获得。可在 Linux 和 MacOS X 上运行。

联系人

scott.hazelhurst@wits.ac.zazsuzsa@cebitec.uni-bielefeld.de

补充信息

补充数据可在 Bioinformatics 在线获得。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验