Nishikawa T, Ota T, Isogai T
Helix Research Institute, Chiba, Japan.
Bioinformatics. 2000 Nov;16(11):960-7. doi: 10.1093/bioinformatics/16.11.960.
In the previous works, we developed ATGpr, a computer program for predicting the fullness of a cDNA, i.e. whether it contains an initiation codon or not. Statistical information of short nucleotide fragments was fully exploited in the prediction algorithm. However, sequence similarities to known proteins, which are becoming increasingly available due to recent rapid growth of protein database, were not used in the prediction. In this work, we present a new prediction algorithm based on both statistical and similarity information, which provides better performance in sensitivity and specificity.
We evaluated the accuracy of ATGpr for predicting fullness of cDNA sequences from human clustered ESTs of UniGene, and we obtained specificity, sensitivity, and correlation coefficient of this prediction. Specificity and sensitivity crossed at 46% over the ATGpr score threshold of 0.33 and the maximum correlation coefficient of 0.34 was obtained at this threshold. Without ATGpr we found it effective to use alignments with known proteins for predicting the fullness of cDNA sequences. That is, specificity increased monotonously as similarity (identity of the alignments) increased. Specificity was achieved greater than 80% if identity was greater than 40%. For more effective prediction of fullness of cDNA sequences we combined the similarity (identity of query sequence) with known proteins and ATGpr score. As a result, specificity became greater than 80% if identity was greater than 20%.
The prediction program, called ATGpr_ sim, is available at http://www.hri.co.jp/atgpr/ATGpr_sim.html
在之前的工作中,我们开发了ATGpr,这是一个用于预测cDNA完整性(即是否包含起始密码子)的计算机程序。预测算法充分利用了短核苷酸片段的统计信息。然而,由于蛋白质数据库最近的快速增长,与已知蛋白质的序列相似性越来越容易获得,但在预测中并未使用。在这项工作中,我们提出了一种基于统计和相似性信息的新预测算法,该算法在敏感性和特异性方面具有更好的性能。
我们评估了ATGpr预测来自UniGene人类聚类EST的cDNA序列完整性的准确性,并获得了该预测的特异性、敏感性和相关系数。在ATGpr分数阈值为0.33时,特异性和敏感性在46%处交叉,在此阈值下获得的最大相关系数为0.34。我们发现,在没有ATGpr的情况下,使用与已知蛋白质的比对来预测cDNA序列的完整性是有效的。也就是说,特异性随着相似性(比对的一致性)的增加而单调增加。如果一致性大于40%,则特异性大于80%。为了更有效地预测cDNA序列的完整性,我们将相似性(查询序列的一致性)与已知蛋白质和ATGpr分数相结合。结果,如果一致性大于20%,则特异性大于80%。
名为ATGpr_sim的预测程序可在http://www.hri.co.jp/atgpr/ATGpr_sim.html获得。