Suppr超能文献

通过将二级结构信息纳入序列衍生特征来提高对剪接位点的识别:一项计算研究。

Improved recognition of splice sites in by incorporating secondary structure information into sequence-derived features: a computational study.

作者信息

Meher Prabina Kumar, Satpathy Subhrajit

机构信息

ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India.

出版信息

3 Biotech. 2021 Nov;11(11):484. doi: 10.1007/s13205-021-03036-8. Epub 2021 Oct 31.

Abstract

UNLABELLED

Identification of splice sites is an important aspect with regard to the prediction of gene structure. In most of the existing splice site prediction studies, machine learning algorithms coupled with sequence-derived features have been successfully employed for splice site recognition. However, the splice site identification by incorporating the secondary structure information is lacking, particularly in plant species. Thus, we made an attempt in this study to evaluate the performance of structural features on the splice site prediction accuracy in . Prediction accuracies were evaluated with the sequence-derived features alone as well as by incorporating the structural features into the sequence-derived features, where support vector machine (SVM) was employed as prediction algorithm. Both short (40 base pairs) and long (105 base pairs) sequence datasets were considered for evaluation. After incorporating the secondary structure features, improvements in accuracies were observed only for the longer sequence dataset and the improvement was found to be higher with the sequence-derived features that accounted nucleotide dependencies. On the other hand, either a little or no improvement in accuracies was found for the short sequence dataset. The performance of SVM was further compared with that of LogitBoost, Random Forest (RF), AdaBoost and XGBoost machine learning methods. The prediction accuracies of SVM, AdaBoost and XGBoost were observed to be at par and higher than that of RF and LogitBoost algorithms. While prediction was performed by taking all the sequence-derived features along with the structural features, a little improvement in accuracies was found as compared to the combination of individual sequence-based features and structural features. To the best of our knowledge, this is the first attempt concerning the computational prediction of splice sites using machine learning methods by incorporating the secondary structure information into the sequence-derived features. All the source codes are available at https://github.com/meher861982/SSFeature.

SUPPLEMENTARY INFORMATION

The online version contains supplementary material available at 10.1007/s13205-021-03036-8.

摘要

未标注

剪接位点的识别是基因结构预测的一个重要方面。在大多数现有的剪接位点预测研究中,机器学习算法与序列衍生特征相结合已成功用于剪接位点识别。然而,通过纳入二级结构信息进行剪接位点识别的研究较少,尤其是在植物物种中。因此,我们在本研究中尝试评估结构特征对剪接位点预测准确性的影响。分别使用仅基于序列衍生特征以及将结构特征纳入序列衍生特征的方法来评估预测准确性,其中支持向量机(SVM)被用作预测算法。评估时考虑了短(40个碱基对)和长(105个碱基对)序列数据集。纳入二级结构特征后,仅在较长序列数据集上观察到准确性有所提高,并且发现对于考虑核苷酸依赖性的序列衍生特征,提高幅度更大。另一方面,短序列数据集的准确性几乎没有提高或没有提高。还将支持向量机的性能与LogitBoost、随机森林(RF)、AdaBoost和XGBoost机器学习方法进行了比较。观察到支持向量机、AdaBoost和XGBoost的预测准确性相当且高于RF和LogitBoost算法。当结合所有序列衍生特征和结构特征进行预测时,与基于单个序列特征和结构特征的组合相比,准确性略有提高。据我们所知,这是首次尝试通过将二级结构信息纳入序列衍生特征,使用机器学习方法对剪接位点进行计算预测。所有源代码可在https://github.com/meher861982/SSFeature获取。

补充信息

在线版本包含可在10.1007/s13205-021-03036-8获取的补充材料。

相似文献

2
Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition.
Gene. 2019 Jul 15;705:113-126. doi: 10.1016/j.gene.2019.04.047. Epub 2019 Apr 19.
3
A computational approach for prediction of donor splice sites with improved accuracy.
J Theor Biol. 2016 Sep 7;404:285-294. doi: 10.1016/j.jtbi.2016.06.013. Epub 2016 Jun 11.
4
Prediction of donor splice sites using random forest with a new sequence encoding approach.
BioData Min. 2016 Jan 22;9:4. doi: 10.1186/s13040-016-0086-4. eCollection 2016.
5
EnsembleSplice: ensemble deep learning model for splice site prediction.
BMC Bioinformatics. 2022 Oct 6;23(1):413. doi: 10.1186/s12859-022-04971-w.
6
Splice site identification using probabilistic parameters and SVM classification.
BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S15. doi: 10.1186/1471-2105-7-S5-S15.
8
GIpred: a computational tool for prediction of GIGANTEA proteins using machine learning algorithm.
Physiol Mol Biol Plants. 2022 Jan;28(1):1-16. doi: 10.1007/s12298-022-01130-6. Epub 2022 Jan 24.
9
Feature subset selection for splice site prediction.
Bioinformatics. 2002;18 Suppl 2:S75-83. doi: 10.1093/bioinformatics/18.suppl_2.s75.
10
High-accuracy splice site prediction based on sequence component and position features.
Genet Mol Res. 2012 Sep 25;11(3):3432-51. doi: 10.4238/2012.September.25.12.

本文引用的文献

1
Splice2Deep: An ensemble of deep convolutional neural networks for improved splice site prediction in genomic DNA.
Gene. 2020 Dec;763S:100035. doi: 10.1016/j.gene.2020.100035. Epub 2020 May 13.
2
SpliceFinder: ab initio prediction of splice sites using convolutional neural network.
BMC Bioinformatics. 2019 Dec 27;20(Suppl 23):652. doi: 10.1186/s12859-019-3306-3.
3
Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition.
Gene. 2019 Jul 15;705:113-126. doi: 10.1016/j.gene.2019.04.047. Epub 2019 Apr 19.
4
A computational approach for prediction of donor splice sites with improved accuracy.
J Theor Biol. 2016 Sep 7;404:285-294. doi: 10.1016/j.jtbi.2016.06.013. Epub 2016 Jun 11.
5
Prediction of donor splice sites using random forest with a new sequence encoding approach.
BioData Min. 2016 Jan 22;9:4. doi: 10.1186/s13040-016-0086-4. eCollection 2016.
7
Approaches to link RNA secondary structures with splicing regulation.
Methods Mol Biol. 2014;1126:341-56. doi: 10.1007/978-1-62703-980-2_25.
8
Immunoglobulin superfamily protein Dscam exhibited molecular diversity by alternative splicing in hemocytes of crustacean, Eriocheir sinensis.
Fish Shellfish Immunol. 2013 Sep;35(3):900-9. doi: 10.1016/j.fsi.2013.06.029. Epub 2013 Jul 13.
9
iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition.
Nucleic Acids Res. 2013 Apr 1;41(6):e68. doi: 10.1093/nar/gks1450. Epub 2013 Jan 8.
10
High-accuracy splice site prediction based on sequence component and position features.
Genet Mol Res. 2012 Sep 25;11(3):3432-51. doi: 10.4238/2012.September.25.12.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验