Pashaei Elham, Aydin Nizamettin
Department of Computer Engineering, Yildiz Technical University, Istanbul, Turkey.
Comput Biol Chem. 2018 Apr;73:159-170. doi: 10.1016/j.compbiolchem.2018.02.005. Epub 2018 Feb 14.
Splice site recognition is among the most significant and challenging tasks in bioinformatics due to its key role in gene annotation. Effective prediction of splice site requires nucleotide encoding methods that reveal the characteristics of DNA sequences to provide appropriate features to serve as input of machine learning classifiers. Markovian models are the most influential encoding methods that highly used for pattern recognition in biological data. However, a direct performance comparison of these methods in splice site domain has not been assessed yet. This study compares various Markovian encoding models for splice site prediction utilizing support vector machine, as the most outstanding learning method in the domain, and conducts a new precise evaluation of Markovian approaches that corrects this limitation. Moreover, a novel sequence encoding approach based on third order Markov model (MM3) is proposed. The experimental results show that the proposed method, namely MM3-SVM, performs significantly better than thirteen best known state-of-the-art algorithms, while tested on HS3D dataset considering several performance criteria. Further, it achieved higher prediction accuracy than several well-known tools like NNsplice, MEM, MM1, WMM, and GeneID, using an independent test set of 50 genes. We also developed MMSVM, a web tool to predict splice sites in any human sequence using the proposed approach. The MMSVM web server can be assessed at https://pashaei.shinyapps.io/mmsvm.
剪接位点识别是生物信息学中最重要且最具挑战性的任务之一,因为它在基因注释中起着关键作用。有效的剪接位点预测需要核苷酸编码方法,这些方法能够揭示DNA序列的特征,以提供合适的特征作为机器学习分类器的输入。马尔可夫模型是在生物数据模式识别中高度常用的最具影响力的编码方法。然而,尚未评估这些方法在剪接位点领域的直接性能比较。本研究利用支持向量机(该领域最杰出的学习方法)比较了用于剪接位点预测的各种马尔可夫编码模型,并对马尔可夫方法进行了新的精确评估,纠正了这一局限性。此外,还提出了一种基于三阶马尔可夫模型(MM3)的新型序列编码方法。实验结果表明,所提出的方法,即MM3-SVM,在考虑多个性能标准的HS3D数据集上进行测试时,其性能明显优于十三种最著名的现有算法。此外,使用50个基因的独立测试集,它比NNsplice、MEM、MM1、WMM和GeneID等几种知名工具具有更高的预测准确率。我们还开发了MMSVM,这是一个使用所提出的方法预测任何人类序列中剪接位点的网络工具。可以在https://pashaei.shinyapps.io/mmsvm评估MMSVM网络服务器。