Parkinson John, Guiliano David B, Blaxter Mark
Institute of Cell, Animal and Population Biology, University of Edinburgh, Edinburgh EH9 3JT, UK.
BMC Bioinformatics. 2002 Oct 25;3:31. doi: 10.1186/1471-2105-3-31.
Expressed sequence tags (ESTs) are single pass reads from randomly selected cDNA clones. They provide a highly cost-effective method to access and identify expressed genes. However, they are often prone to sequencing errors and typically define incomplete transcripts. To increase the amount of information obtainable from ESTs and reduce sequencing errors, it is necessary to cluster ESTs into groups sharing significant sequence similarity.
As part of our ongoing EST programs investigating 'orphan' genomes, we have developed a clustering algorithm, CLOBB (Cluster on the basis of BLAST similarity) to identify and cluster ESTs. CLOBB may be used incrementally, preserving original cluster designations. It tracks cluster-specific events such as merging, identifies 'superclusters' of related clusters and avoids the expansion of chimeric clusters. Based on the Perl scripting language, CLOBB is highly portable relying only on a local installation of NCBI's freely available BLAST executable and can be usefully applied to > 95 % of the current EST datasets. Analysis of the Danio rerio EST dataset demonstrates that CLOBB compares favourably with two less portable systems, UniGene and TIGR Gene Indices.
CLOBB provides a highly portable EST clustering solution and is freely downloaded from: http://www.nematodes.org/CLOBB
表达序列标签(ESTs)是从随机选择的cDNA克隆中进行的单通道读取。它们提供了一种极具成本效益的方法来获取和鉴定表达的基因。然而,它们往往容易出现测序错误,并且通常定义的是不完整的转录本。为了增加可从ESTs获得的信息量并减少测序错误,有必要将ESTs聚类成具有显著序列相似性的组。
作为我们正在进行的研究“孤儿”基因组的EST项目的一部分,我们开发了一种聚类算法CLOBB(基于BLAST相似性进行聚类)来识别和聚类ESTs。CLOBB可以增量使用,保留原始的聚类指定。它跟踪特定于聚类的事件,如合并,识别相关聚类的“超级聚类”,并避免嵌合聚类的扩展。基于Perl脚本语言,CLOBB具有高度的可移植性,仅依赖于本地安装的NCBI免费提供的BLAST可执行文件,并且可以有效地应用于超过95%的当前EST数据集。对斑马鱼EST数据集的分析表明,CLOBB与另外两个可移植性较差的系统UniGene和TIGR基因索引相比具有优势。
CLOBB提供了一种高度可移植的EST聚类解决方案,可从以下网址免费下载:http://www.nematodes.org/CLOBB