Wang Yonghong, Zhang Chun-Ting, Dong Puxuan
Department of Physics, Tianjin University, Tianjin, 300072, China.
Biopolymers. 2002 Mar;63(3):207-16. doi: 10.1002/bip.10054.
With the quick progress of the Human Genome Project, a great amount of uncharacterized DNA sequences needs to be annotated copiously by better algorithms. Recognizing shorter coding sequences of human genes is one of the most important problems in gene recognition, which is not yet completely solved. This paper is devoted to solving the issue using a new method. The distributions of the three stop codons, i.e., TAA, TAG and TGA, in three phases along coding, noncoding, and intergenic sequences are studied in detail. Using the obtained distributions and other coding measures, a new algorithm for the recognition of shorter coding sequences of human genes is developed. The accuracy of the algorithm is tested based on a larger database of human genes. It is found that the average accuracy achieved is as high as 92.1% for the sequences with length of 192 base pairs, which is confirmed by sixfold cross-validation tests. It is hoped that by incorporating the present method with some existing algorithms, the accuracy for identifying human genes from unannotated sequences would be increased.
随着人类基因组计划的迅速进展,大量未表征的DNA序列需要通过更好的算法进行大量注释。识别人类基因的较短编码序列是基因识别中最重要的问题之一,目前尚未完全解决。本文致力于用一种新方法解决该问题。详细研究了三个终止密码子,即TAA、TAG和TGA,在编码序列、非编码序列和基因间序列的三个阶段中的分布。利用获得的分布和其他编码指标,开发了一种识别人类基因较短编码序列的新算法。基于一个更大的人类基因数据库对该算法的准确性进行了测试。结果发现,对于长度为192个碱基对的序列,平均准确率高达92.1%,这通过六重交叉验证测试得到了证实。希望通过将本方法与一些现有算法相结合,从未注释序列中识别人类基因的准确率能够提高。