Frank Ronald L, Ercal Fikret
Biological Sciences Department, University of Missouri-Rolla, Rolla, MO, USA.
BMC Bioinformatics. 2005 Jul 15;6 Suppl 2(Suppl 2):S7. doi: 10.1186/1471-2105-6-S2-S7.
Clustering the ESTs from a large dataset representing a single species is a convenient starting point for a number of investigations into gene discovery, genome evolution, expression patterns, and alternatively spliced transcripts. Several methods have been developed to accomplish this, the most widely available being UniGene, a public domain collection of gene-oriented clusters for over 45 different species created and maintained by NCBI. The goal is for each cluster to represent a unique gene, but currently it is not known how closely the overall results represent that reality. UniGene's build procedure begins with initial mRNA clusters before joining ESTs. UniGene's results for soybean indicate a significant amount of redundancy among some sequences reported to be unique mRNAs. To establish a valid non-redundant known gene set for Glycine max we applied our algorithm to the clustering of only mRNA sequences. The mRNA dataset was run through the algorithm using two different matching stringencies. The resulting cluster compositions were compared to each other and to UniGene. Clusters exhibiting differences among the three methods were analyzed by 1) nucleotide and amino acid alignment and 2) submitting authors conclusions to determine whether members of a single cluster represented the same gene or not.
Of the 12 clusters that were examined closely most contained examples of sequences that did not belong in the same cluster. However, neither the two stringencies of PECT nor UniGene had a significantly greater record of accuracy in placing paralogs into separate clusters.
Our results reveal that, although each method produces some errors, using multiple stringencies for matching or a sequential hierarchical method of increasing stringencies can provide more reliable results and therefore allow greater confidence in the vast majority of clusters that contain only ESTs and no mRNA sequences.
将来自代表单一物种的大型数据集的EST(表达序列标签)进行聚类,是对基因发现、基因组进化、表达模式和可变剪接转录本进行多项研究的便利起点。已经开发了几种方法来完成这一任务,其中最广泛使用的是UniGene,它是由美国国立医学图书馆(NCBI)创建和维护的针对45种以上不同物种的面向基因的聚类公共数据库。目标是每个聚类代表一个独特的基因,但目前尚不清楚总体结果在多大程度上反映了这一现实。UniGene的构建过程在加入EST之前先从初始mRNA聚类开始。UniGene对大豆的结果表明,一些据报道为独特mRNA的序列之间存在大量冗余。为了建立一个有效的大豆已知基因非冗余集,我们将我们的算法应用于仅mRNA序列的聚类。mRNA数据集使用两种不同的匹配严格度运行该算法。将得到的聚类组成相互比较,并与UniGene进行比较。通过1)核苷酸和氨基酸比对以及2)提交作者的结论来分析在这三种方法之间表现出差异的聚类,以确定单个聚类的成员是否代表相同的基因。
在仔细检查的12个聚类中,大多数都包含不属于同一聚类的序列示例。然而,PECT的两种严格度和UniGene在将旁系同源物放入单独聚类方面都没有显著更高的准确性记录。
我们的结果表明,尽管每种方法都会产生一些错误,但使用多种严格度进行匹配或采用严格度递增的顺序分层方法可以提供更可靠的结果,因此可以对绝大多数仅包含EST而不包含mRNA序列的聚类更有信心。