通过将二级结构信息纳入序列衍生特征来提高对剪接位点的识别：一项计算研究。

Improved recognition of splice sites in by incorporating secondary structure information into sequence-derived features: a computational study.

作者信息

Meher Prabina Kumar, Satpathy Subhrajit

机构信息

ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India.

出版信息

3 Biotech. 2021 Nov;11(11):484. doi: 10.1007/s13205-021-03036-8. Epub 2021 Oct 31.

DOI:10.1007/s13205-021-03036-8

PMID:34790508

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8558126/

Abstract

UNLABELLED

Identification of splice sites is an important aspect with regard to the prediction of gene structure. In most of the existing splice site prediction studies, machine learning algorithms coupled with sequence-derived features have been successfully employed for splice site recognition. However, the splice site identification by incorporating the secondary structure information is lacking, particularly in plant species. Thus, we made an attempt in this study to evaluate the performance of structural features on the splice site prediction accuracy in . Prediction accuracies were evaluated with the sequence-derived features alone as well as by incorporating the structural features into the sequence-derived features, where support vector machine (SVM) was employed as prediction algorithm. Both short (40 base pairs) and long (105 base pairs) sequence datasets were considered for evaluation. After incorporating the secondary structure features, improvements in accuracies were observed only for the longer sequence dataset and the improvement was found to be higher with the sequence-derived features that accounted nucleotide dependencies. On the other hand, either a little or no improvement in accuracies was found for the short sequence dataset. The performance of SVM was further compared with that of LogitBoost, Random Forest (RF), AdaBoost and XGBoost machine learning methods. The prediction accuracies of SVM, AdaBoost and XGBoost were observed to be at par and higher than that of RF and LogitBoost algorithms. While prediction was performed by taking all the sequence-derived features along with the structural features, a little improvement in accuracies was found as compared to the combination of individual sequence-based features and structural features. To the best of our knowledge, this is the first attempt concerning the computational prediction of splice sites using machine learning methods by incorporating the secondary structure information into the sequence-derived features. All the source codes are available at https://github.com/meher861982/SSFeature.

SUPPLEMENTARY INFORMATION

The online version contains supplementary material available at 10.1007/s13205-021-03036-8.

摘要

未标注

剪接位点的识别是基因结构预测的一个重要方面。在大多数现有的剪接位点预测研究中，机器学习算法与序列衍生特征相结合已成功用于剪接位点识别。然而，通过纳入二级结构信息进行剪接位点识别的研究较少，尤其是在植物物种中。因此，我们在本研究中尝试评估结构特征对剪接位点预测准确性的影响。分别使用仅基于序列衍生特征以及将结构特征纳入序列衍生特征的方法来评估预测准确性，其中支持向量机（SVM）被用作预测算法。评估时考虑了短（40个碱基对）和长（105个碱基对）序列数据集。纳入二级结构特征后，仅在较长序列数据集上观察到准确性有所提高，并且发现对于考虑核苷酸依赖性的序列衍生特征，提高幅度更大。另一方面，短序列数据集的准确性几乎没有提高或没有提高。还将支持向量机的性能与LogitBoost、随机森林（RF）、AdaBoost和XGBoost机器学习方法进行了比较。观察到支持向量机、AdaBoost和XGBoost的预测准确性相当且高于RF和LogitBoost算法。当结合所有序列衍生特征和结构特征进行预测时，与基于单个序列特征和结构特征的组合相比，准确性略有提高。据我们所知，这是首次尝试通过将二级结构信息纳入序列衍生特征，使用机器学习方法对剪接位点进行计算预测。所有源代码可在https://github.com/meher861982/SSFeature获取。

补充信息

在线版本包含可在10.1007/s13205-021-03036-8获取的补充材料。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

通过将二级结构信息纳入序列衍生特征来提高对剪接位点的识别：一项计算研究。

Improved recognition of splice sites in by incorporating secondary structure information into sequence-derived features: a computational study.

作者信息

机构信息

出版信息

UNLABELLED

SUPPLEMENTARY INFORMATION

未标注

补充信息

相似文献

引用本文的文献

本文引用的文献

相似文献

引用本文的文献

本文引用的文献

通过将二级结构信息纳入序列衍生特征来提高对剪接位点的识别：一项计算研究。

Improved recognition of splice sites in by incorporating secondary structure information into sequence-derived features: a computational study.

作者信息

机构信息

出版信息

UNLABELLED

SUPPLEMENTARY INFORMATION

未标注

补充信息