Meher Prabina Kumar, Satpathy Subhrajit
ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India.
3 Biotech. 2021 Nov;11(11):484. doi: 10.1007/s13205-021-03036-8. Epub 2021 Oct 31.
Identification of splice sites is an important aspect with regard to the prediction of gene structure. In most of the existing splice site prediction studies, machine learning algorithms coupled with sequence-derived features have been successfully employed for splice site recognition. However, the splice site identification by incorporating the secondary structure information is lacking, particularly in plant species. Thus, we made an attempt in this study to evaluate the performance of structural features on the splice site prediction accuracy in . Prediction accuracies were evaluated with the sequence-derived features alone as well as by incorporating the structural features into the sequence-derived features, where support vector machine (SVM) was employed as prediction algorithm. Both short (40 base pairs) and long (105 base pairs) sequence datasets were considered for evaluation. After incorporating the secondary structure features, improvements in accuracies were observed only for the longer sequence dataset and the improvement was found to be higher with the sequence-derived features that accounted nucleotide dependencies. On the other hand, either a little or no improvement in accuracies was found for the short sequence dataset. The performance of SVM was further compared with that of LogitBoost, Random Forest (RF), AdaBoost and XGBoost machine learning methods. The prediction accuracies of SVM, AdaBoost and XGBoost were observed to be at par and higher than that of RF and LogitBoost algorithms. While prediction was performed by taking all the sequence-derived features along with the structural features, a little improvement in accuracies was found as compared to the combination of individual sequence-based features and structural features. To the best of our knowledge, this is the first attempt concerning the computational prediction of splice sites using machine learning methods by incorporating the secondary structure information into the sequence-derived features. All the source codes are available at https://github.com/meher861982/SSFeature.
The online version contains supplementary material available at 10.1007/s13205-021-03036-8.
剪接位点的识别是基因结构预测的一个重要方面。在大多数现有的剪接位点预测研究中,机器学习算法与序列衍生特征相结合已成功用于剪接位点识别。然而,通过纳入二级结构信息进行剪接位点识别的研究较少,尤其是在植物物种中。因此,我们在本研究中尝试评估结构特征对剪接位点预测准确性的影响。分别使用仅基于序列衍生特征以及将结构特征纳入序列衍生特征的方法来评估预测准确性,其中支持向量机(SVM)被用作预测算法。评估时考虑了短(40个碱基对)和长(105个碱基对)序列数据集。纳入二级结构特征后,仅在较长序列数据集上观察到准确性有所提高,并且发现对于考虑核苷酸依赖性的序列衍生特征,提高幅度更大。另一方面,短序列数据集的准确性几乎没有提高或没有提高。还将支持向量机的性能与LogitBoost、随机森林(RF)、AdaBoost和XGBoost机器学习方法进行了比较。观察到支持向量机、AdaBoost和XGBoost的预测准确性相当且高于RF和LogitBoost算法。当结合所有序列衍生特征和结构特征进行预测时,与基于单个序列特征和结构特征的组合相比,准确性略有提高。据我们所知,这是首次尝试通过将二级结构信息纳入序列衍生特征,使用机器学习方法对剪接位点进行计算预测。所有源代码可在https://github.com/meher861982/SSFeature获取。
在线版本包含可在10.1007/s13205-021-03036-8获取的补充材料。