Ahmed Firoz, Kumar Manish, Raghava Gajendra P S
Bioinformatics Centre, Institute of Microbial Technology, Chandigarh, India.
In Silico Biol. 2009;9(3):135-48.
The polyadenylation signal plays a key role in determining the site for addition of a polyadenylated tail to nascent mRNA and its mutation(s) are reported in many diseases. Thus, identifying poly(A) sites is important for understanding the regulation and stability of mRNA. In this study, Support Vector Machine (SVM) models have been developed for predicting poly(A) signals in a DNA sequence using 100 nucleotides, each upstream and downstream of this signal. Here, we introduced a novel split nucleotide frequency technique, and the models thus developed achieved maximum Matthews correlation coefficients (MCC) of 0.58, 0.69, 0.70 and 0.69 using mononucleotide, dinucleotide, trinucleotide, and tetranucleotide frequencies, respectively. Finally, a hybrid model developed using a combination of dinucleotide, 2nd order dinucleotide and tetranucleotide frequencies, achieved a maximum MCC of 0.72. Moreover, for independent datasets this model achieved a precision ranging from 75.8-95.7% with a sensitivity of 57%, which is better than any other known methods.
聚腺苷酸化信号在确定向新生mRNA添加聚腺苷酸化尾巴的位点方面起着关键作用,并且在许多疾病中都报道了其突变情况。因此,识别聚腺苷酸位点对于理解mRNA的调控和稳定性很重要。在本研究中,已开发出支持向量机(SVM)模型,用于使用该信号上游和下游各100个核苷酸的DNA序列预测聚腺苷酸信号。在此,我们引入了一种新颖的分裂核苷酸频率技术,使用单核苷酸、二核苷酸、三核苷酸和四核苷酸频率分别开发的模型,其最大马修斯相关系数(MCC)分别达到了0.58、0.69、0.70和0.69。最后,使用二核苷酸、二阶二核苷酸和四核苷酸频率组合开发的混合模型,其最大MCC达到了0.72。此外,对于独立数据集,该模型的精度范围为75.8 - 95.7%,灵敏度为57%,优于任何其他已知方法。