砰！一种新的基于后缀数组的聚类表达数据算法。

KABOOM! A new suffix array based algorithm for clustering expression data.

机构信息

Wits Bioinformatics, School of Electrical and Information Engineering, University of the Witwatersrand, Johannesburg, Private Bag 3, 2050 Wits, South Africa.

出版信息

Bioinformatics. 2011 Dec 15;27(24):3348-55. doi: 10.1093/bioinformatics/btr560. Epub 2011 Oct 8.

DOI:10.1093/bioinformatics/btr560

PMID:21984769

Abstract

MOTIVATION

Second-generation sequencing technology has reinvigorated research using expression data, and clustering such data remains a significant challenge, with much larger datasets and with different error profiles. Algorithms that rely on all-versus-all comparison of sequences are not practical for large datasets.

RESULTS

We introduce a new filter for string similarity which has the potential to eliminate the need for all-versus-all comparison in clustering of expression data and other similar tasks. Our filter is based on multiple long exact matches between the two strings, with the additional constraint that these matches must be sufficiently far apart. We give details of its efficient implementation using modified suffix arrays. We demonstrate its efficiency by presenting our new expression clustering tool, wcd-express, which uses this heuristic. We compare it to other current tools and show that it is very competitive both with respect to quality and run time.

AVAILABILITY

Source code and binaries available under GPL at http://code.google.com/p/wcdest. Runs on Linux and MacOS X.

CONTACT

scott.hazelhurst@wits.ac.za; zsuzsa@cebitec.uni-bielefeld.de

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

第二代测序技术重新激发了使用表达数据的研究，而聚类此类数据仍然是一个重大挑战，因为数据集更大，且具有不同的误差分布。依赖于序列两两比较的算法对于大型数据集来说并不实用。

结果

我们引入了一种新的字符串相似度过滤器，它有可能消除在表达数据聚类和其他类似任务中对所有对所有比较的需求。我们的过滤器基于两个字符串之间的多个长精确匹配，并且附加的约束条件是这些匹配必须足够远。我们使用改进的后缀数组详细介绍了其高效实现。我们通过展示我们的新表达聚类工具 wcd-express 来演示其效率，该工具使用了这种启发式方法。我们将其与其他当前工具进行比较，并表明它在质量和运行时间方面都非常有竞争力。

可用性

源代码和二进制文件可在 GPL 下从 http://code.google.com/p/wcdest 获得。可在 Linux 和 MacOS X 上运行。

联系人

scott.hazelhurst@wits.ac.za；zsuzsa@cebitec.uni-bielefeld.de

补充信息

补充数据可在 Bioinformatics 在线获得。

相似文献

KABOOM! A new suffix array based algorithm for clustering expression data.砰！一种新的基于后缀数组的聚类表达数据算法。

Bioinformatics. 2011 Dec 15;27(24):3348-55. doi: 10.1093/bioinformatics/btr560. Epub 2011 Oct 8.

An overview of the wcd EST clustering tool.WCD EST聚类工具概述。

Bioinformatics. 2008 Jul 1;24(13):1542-6. doi: 10.1093/bioinformatics/btn203. Epub 2008 May 14.

CLUSTERnGO: a user-defined modelling platform for two-stage clustering of time-series data.CLUSTERnGO：一个用于时间序列数据两阶段聚类的用户定义建模平台。

Bioinformatics. 2016 Feb 1;32(3):388-97. doi: 10.1093/bioinformatics/btv532. Epub 2015 Sep 26.

A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets.一种改进的超平面聚类算法能够对超大型数据集进行高效且准确的聚类。

Bioinformatics. 2009 May 1;25(9):1152-7. doi: 10.1093/bioinformatics/btp123. Epub 2009 Mar 4.

Fast sequence clustering using a suffix array algorithm.使用后缀数组算法进行快速序列聚类。

Bioinformatics. 2003 Jul 1;19(10):1221-6. doi: 10.1093/bioinformatics/btg138.

Large scale clustering of protein sequences with FORCE -A layout based heuristic for weighted cluster editing.基于FORCE -A布局启发式算法的蛋白质序列大规模聚类用于加权聚类编辑。

BMC Bioinformatics. 2007 Oct 17;8:396. doi: 10.1186/1471-2105-8-396.

Average correlation clustering algorithm (ACCA) for grouping of co-regulated genes with similar pattern of variation in their expression values.平均相关聚类算法（ACCA）用于对具有相似表达值变化模式的共调控基因进行分组。

J Biomed Inform. 2010 Aug;43(4):560-8. doi: 10.1016/j.jbi.2010.02.001. Epub 2010 Feb 6.

Starcode: sequence clustering based on all-pairs search.星码：基于全对搜索的序列聚类。

Bioinformatics. 2015 Jun 15;31(12):1913-9. doi: 10.1093/bioinformatics/btv053. Epub 2015 Jan 31.

Bi-correlation clustering algorithm for determining a set of co-regulated genes.双相关聚类算法，用于确定一组共同调节的基因。

Bioinformatics. 2009 Nov 1;25(21):2795-801. doi: 10.1093/bioinformatics/btp526. Epub 2009 Sep 3.

Clustering short time series gene expression data.聚类短时间序列基因表达数据。

Bioinformatics. 2005 Jun;21 Suppl 1:i159-68. doi: 10.1093/bioinformatics/bti1022.

引用本文的文献

Fast, parallel, and cache-friendly suffix array construction.快速、并行且缓存友好的后缀数组构造。

Algorithms Mol Biol. 2024 Apr 28;19(1):16. doi: 10.1186/s13015-024-00263-5.

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections.gsufsort：为字符串集合构建后缀数组、最长公共前缀数组和Burrows-Wheeler变换

Algorithms Mol Biol. 2020 Sep 22;15:18. doi: 10.1186/s13015-020-00177-y. eCollection 2020.

Large Differences in Gene Expression Responses to Drought and Heat Stress between Elite Barley Cultivar Scarlett and a Spanish Landrace.优良大麦品种斯嘉丽与一个西班牙地方品种在干旱和热胁迫下基因表达反应的巨大差异。

Front Plant Sci. 2017 May 1;8:647. doi: 10.3389/fpls.2017.00647. eCollection 2017.

EasyCluster2: an improved tool for clustering and assembling long transcriptome reads.EasyCluster2：一种改进的长转录本读长聚类和组装工具。

BMC Bioinformatics. 2014;15 Suppl 15(Suppl 15):S7. doi: 10.1186/1471-2105-15-S15-S7. Epub 2014 Dec 3.

A bioinformatician's guide to the forefront of suffix array construction algorithms.生物信息学家的后缀数组构建算法前沿指南。

Brief Bioinform. 2014 Mar;15(2):138-54. doi: 10.1093/bib/bbt081. Epub 2014 Jan 10.

A hybrid distance measure for clustering expressed sequence tags originating from the same gene family.一种用于聚类来自同一基因家族的表达序列标签的混合距离度量方法。

PLoS One. 2012;7(10):e47216. doi: 10.1371/journal.pone.0047216. Epub 2012 Oct 11.

Ultrafast clustering algorithms for metagenomic sequence analysis.用于宏基因组序列分析的超快聚类算法。

Brief Bioinform. 2012 Nov;13(6):656-68. doi: 10.1093/bib/bbs035. Epub 2012 Jul 6.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

砰！一种新的基于后缀数组的聚类表达数据算法。

KABOOM! A new suffix array based algorithm for clustering expression data.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

CONTACT

SUPPLEMENTARY INFORMATION

动机

结果

可用性

联系人

补充信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献