• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估序列编码方案和机器学习方法在剪接位点识别中的性能。

Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition.

机构信息

ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.

ICAR-Indian Agricultural Statistics Research Institute, New Delhi 110012, India.

出版信息

Gene. 2019 Jul 15;705:113-126. doi: 10.1016/j.gene.2019.04.047. Epub 2019 Apr 19.

DOI:10.1016/j.gene.2019.04.047
PMID:31009682
Abstract

Identification of splice sites is imperative for prediction of gene structure. Machine learning-based approaches (MLAs) have been reported to be more successful than the rule-based methods for identification of splice sites. However, the strings of alphabets should be transformed into numeric features through sequence encoding before using them as input in MLAs. In this study, we evaluated the performances of 8 different sequence encoding schemes i.e., Bayes kernel, density and sparse (DS), distribution of tri-nucleotide and 1st order Markov model (DM), frequency difference distance measure (FDDM), paired-nucleotide frequency difference between true and false sites (FDTF), 1st order Markov model (MM1), combination of both 1st and 2nd order Markov model (MM1 + MM2) and 2nd order Markov model (MM2) in respect of predicting donor and acceptor splice sites using 5 supervised learning methods (ANN, Bagging, Boosting, RF and SVM). The encoding schemes and machine learning methods were first evaluated in 4 species i.e., A. thaliana, C. elegans, D. melanogaster and H. sapiens, and then performances were validated with another four species i.e., Ciona intestinalis, Dictyostelium discoideum, Phaeodactylum tricornutum and Trypanosoma brucei. In terms of ROC (receiver-operating-characteristics) and PR (precision-recall) curves, FDTF encoding approach achieved higher accuracy followed by either MM2 or FDDM. Further, SVM was found to achieve higher accuracy (in terms of ROC and PR curves) followed by RF across encoding schemes and species. In terms of prediction accuracy across species, the SVM-FDTF combination was optimum than other combinations of classifiers and encoding schemes. Further, splice site prediction accuracies were observed higher for the species with low intron density. To our limited knowledge, this is the first attempt as far as comprehensive evaluation of sequence encoding schemes for prediction of splice sites is concerned. We have also developed an R-package EncDNA (https://cran.r-project.org/web/packages/EncDNA/index.html) for encoding of splice site motifs with different encoding schemes, which is expected to supplement the existing nucleotide sequence encoding approaches. This study is believed to be useful for the computational biologists for predicting different functional elements on the genomic DNA.

摘要

识别剪接位点对于预测基因结构至关重要。基于机器学习的方法(MLAs)在识别剪接位点方面比基于规则的方法更成功。然而,在将这些字母串作为输入用于 MLAs 之前,应该通过序列编码将其转换为数字特征。在这项研究中,我们评估了 8 种不同序列编码方案的性能,即贝叶斯核、密度和稀疏(DS)、三核苷酸分布和一阶马尔可夫模型(DM)、频率差距离度量(FDDM)、真核和假核之间的碱基对频率差(FDTF)、一阶马尔可夫模型(MM1)、一阶和二阶马尔可夫模型的组合(MM1+MM2)和二阶马尔可夫模型(MM2),用于使用 5 种监督学习方法(ANN、Bagging、Boosting、RF 和 SVM)预测供体和受体剪接位点。首先在 4 个物种(拟南芥、秀丽隐杆线虫、黑腹果蝇和智人)中评估编码方案和机器学习方法,然后在另外 4 个物种(海鞘、盘基网柄菌、三角褐指藻和布氏锥虫)中验证性能。在 ROC(接收器操作特征)和 PR(精度-召回)曲线方面,FDTF 编码方法的准确性最高,其次是 MM2 或 FDDM。此外,在编码方案和物种方面,SVM 被发现比 RF 具有更高的准确性(在 ROC 和 PR 曲线方面)。在跨物种的预测准确性方面,SVM-FDTF 组合优于其他分类器和编码方案的组合。此外,对于内含子密度较低的物种,剪接位点预测的准确性更高。据我们所知,这是迄今为止针对剪接位点预测的序列编码方案进行全面评估的首次尝试。我们还开发了一个 R 包 EncDNA(https://cran.r-project.org/web/packages/EncDNA/index.html),用于使用不同的编码方案对剪接位点基序进行编码,预计这将补充现有的核苷酸序列编码方法。这项研究有望为计算生物学家预测基因组 DNA 上的不同功能元件提供帮助。

相似文献

1
Evaluating the performance of sequence encoding schemes and machine learning methods for splice sites recognition.评估序列编码方案和机器学习方法在剪接位点识别中的性能。
Gene. 2019 Jul 15;705:113-126. doi: 10.1016/j.gene.2019.04.047. Epub 2019 Apr 19.
2
A computational approach for prediction of donor splice sites with improved accuracy.一种提高准确性的预测供体剪接位点的计算方法。
J Theor Biol. 2016 Sep 7;404:285-294. doi: 10.1016/j.jtbi.2016.06.013. Epub 2016 Jun 11.
3
An approach of encoding for prediction of splice sites using SVM.一种使用支持向量机进行剪接位点预测的编码方法。
Biochimie. 2006 Jul;88(7):923-9. doi: 10.1016/j.biochi.2006.03.006. Epub 2006 Apr 3.
4
A statistical approach for 5' splice site prediction using short sequence motifs and without encoding sequence data.一种使用短序列基序且无需编码序列数据来预测5'剪接位点的统计方法。
BMC Bioinformatics. 2014 Nov 25;15:362. doi: 10.1186/s12859-014-0362-6.
5
Prediction of donor splice sites using random forest with a new sequence encoding approach.使用随机森林和一种新的序列编码方法预测供体剪接位点。
BioData Min. 2016 Jan 22;9:4. doi: 10.1186/s13040-016-0086-4. eCollection 2016.
6
Markovian encoding models in human splice site recognition using SVM.使用支持向量机的人类剪接位点识别中的马尔可夫编码模型
Comput Biol Chem. 2018 Apr;73:159-170. doi: 10.1016/j.compbiolchem.2018.02.005. Epub 2018 Feb 14.
7
Splice site prediction with quadratic discriminant analysis using diversity measure.使用多样性度量的二次判别分析进行剪接位点预测。
Nucleic Acids Res. 2003 Nov 1;31(21):6214-20. doi: 10.1093/nar/gkg805.
8
A novel method for splice sites prediction using sequence component and hidden Markov model.一种使用序列成分和隐马尔可夫模型进行剪接位点预测的新方法。
Annu Int Conf IEEE Eng Med Biol Soc. 2016 Aug;2016:3076-3079. doi: 10.1109/EMBC.2016.7591379.
9
EDeepSSP: Explainable deep neural networks for exact splice sites prediction.EDeepSSP:用于准确剪接位点预测的可解释深度神经网络。
J Bioinform Comput Biol. 2020 Aug;18(4):2050024. doi: 10.1142/S0219720020500249. Epub 2020 Jul 22.
10
Improved recognition of splice sites in by incorporating secondary structure information into sequence-derived features: a computational study.通过将二级结构信息纳入序列衍生特征来提高对剪接位点的识别:一项计算研究。
3 Biotech. 2021 Nov;11(11):484. doi: 10.1007/s13205-021-03036-8. Epub 2021 Oct 31.

引用本文的文献

1
Improved recognition of splice sites in by incorporating secondary structure information into sequence-derived features: a computational study.通过将二级结构信息纳入序列衍生特征来提高对剪接位点的识别:一项计算研究。
3 Biotech. 2021 Nov;11(11):484. doi: 10.1007/s13205-021-03036-8. Epub 2021 Oct 31.
2
DeCban: Prediction of circRNA-RBP Interaction Sites by Using Double Embeddings and Cross-Branch Attention Networks.DeCban:利用双重嵌入和跨分支注意力网络预测环状RNA与RNA结合蛋白的相互作用位点
Front Genet. 2021 Jan 22;11:632861. doi: 10.3389/fgene.2020.632861. eCollection 2020.