Meher Prabina Kumar, Sahu Tanmaya Kumar, Rao Atmakuri Ramakrishna, Wahi Sant Dass
Division of Statistical Genetics, Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
Centre for Agricultural Bioinformatics, Indian Agricultural Statistics Research Institute, New Delhi, 110012, India.
BMC Bioinformatics. 2014 Nov 25;15:362. doi: 10.1186/s12859-014-0362-6.
Most of the approaches for splice site prediction are based on machine learning techniques. Though, these approaches provide high prediction accuracy, the window lengths used are longer in size. Hence, these approaches may not be suitable to predict the novel splice variants using the short sequence reads generated from next generation sequencing technologies. Further, machine learning techniques require numerically encoded data and produce different accuracy with different encoding procedures. Therefore, splice site prediction with short sequence motifs and without encoding sequence data became a motivation for the present study.
An approach for finding association among nucleotide bases in the splice site motifs is developed and used further to determine the appropriate window size. Besides, an approach for prediction of donor splice sites using sum of absolute error criterion has also been proposed. The proposed approach has been compared with commonly used approaches i.e., Maximum Entropy Modeling (MEM), Maximal Dependency Decomposition (MDD), Weighted Matrix Method (WMM) and Markov Model of first order (MM1) and was found to perform equally with MEM and MDD and better than WMM and MM1 in terms of prediction accuracy.
The proposed prediction approach can be used in the prediction of donor splice sites with higher accuracy using short sequence motifs and hence can be used as a complementary method to the existing approaches. Based on the proposed methodology, a web server was also developed for easy prediction of donor splice sites by users and is available at http://cabgrid.res.in:8080/sspred .
大多数剪接位点预测方法基于机器学习技术。然而,这些方法虽然预测准确率高,但所使用的窗口长度较大。因此,这些方法可能不适用于使用下一代测序技术产生的短序列 reads 来预测新的剪接变体。此外,机器学习技术需要数字编码的数据,并且不同的编码程序会产生不同的准确率。因此,利用短序列基序且不编码序列数据进行剪接位点预测成为本研究的动机。
开发了一种用于在剪接位点基序中寻找核苷酸碱基之间关联的方法,并进一步用于确定合适的窗口大小。此外,还提出了一种使用绝对误差准则之和来预测供体剪接位点的方法。将所提出的方法与常用方法,即最大熵建模(MEM)、最大依赖分解(MDD)、加权矩阵法(WMM)和一阶马尔可夫模型(MM1)进行了比较,发现其在预测准确率方面与 MEM 和 MDD 表现相当,且优于 WMM 和 MM1。
所提出的预测方法可用于使用短序列基序以更高的准确率预测供体剪接位点,因此可作为现有方法的补充方法。基于所提出的方法,还开发了一个网络服务器,方便用户预测供体剪接位点,可在 http://cabgrid.res.in:8080/sspred 获得。