Huang J, Li T, Chen K, Wu J
Department of Chemistry, Tongji University, Shanghai, China.
Biochimie. 2006 Jul;88(7):923-9. doi: 10.1016/j.biochi.2006.03.006. Epub 2006 Apr 3.
In splice sites prediction, the accuracy is lower than 90% though the sequences adjacent to the splice sites have a high conservation. In order to improve the prediction accuracy, much attention has been paid to the improvement of the performance of the algorithms used, and few used for solving the fundamental issues, namely, nucleotide encoding. In this paper, a predictor is constructed to predict the true and false splice sites for higher eukaryotes based on support vector machines (SVM). Four types of encoding, which were mono-nucleotide (MN) encoding, MN with frequency difference between the true sites and false sites (FDTF) encoding, Pair-wise nucleotides (PN) encoding and PN with FDTF encoding, were applied to generate the input for the SVM. The results showed that PN with FDTF encoding as input to SVM led to the most reliable recognition of splice sites and the accuracy for the prediction of true donor sites and false sites were 96.3%, 93.7%, respectively, and the accuracy for predicting of true acceptor sites and false sites were 94.0%, 93.2%, respectively.
在剪接位点预测中,尽管剪接位点附近的序列具有高度保守性,但准确率低于90%。为了提高预测准确率,人们已将大量注意力放在所使用算法性能的提升上,而很少关注用于解决根本问题,即核苷酸编码。本文构建了一个基于支持向量机(SVM)的预测器,用于预测高等真核生物的真假剪接位点。应用了四种编码方式来生成支持向量机的输入,分别是单核苷酸(MN)编码、具有真假位点频率差异的MN(FDTF)编码、双核苷酸(PN)编码以及具有FDTF的PN编码。结果表明,以具有FDTF的PN编码作为支持向量机的输入能最可靠地识别剪接位点,预测真供体位点和假位点的准确率分别为96.3%、93.7%,预测真受体位点和假位点的准确率分别为94.0%、93.2%。