• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

从蛋白质和 RNA 序列预测 RNA 结合氨基酸。

Prediction of RNA-binding amino acids from protein and RNA sequences.

机构信息

School of Computer Science and Engineering, Inha University, Inchon 402-751, South Korea.

出版信息

BMC Bioinformatics. 2011;12 Suppl 13(Suppl 13):S7. doi: 10.1186/1471-2105-12-S13-S7. Epub 2011 Nov 30.

DOI:10.1186/1471-2105-12-S13-S7
PMID:22373313
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3278847/
Abstract

BACKGROUND

Many learning approaches to predicting RNA-binding residues in a protein sequence construct a non-redundant training dataset based on the sequence similarity. The sequence similarity-based method either takes a whole sequence or discards it for a training dataset. However, similar sequences or even identical sequences can have different interaction sites depending on their interaction partners, and this information is lost when the sequences are removed. Furthermore, a training dataset constructed by the sequence similarity-based method may contain redundant data when the remaining sequence contains similar subsequences within the sequence. In addition to the problem with the training dataset, most approaches do not consider the interacting partner (i.e., RNA) of a protein when they predict RNA-binding amino acids. Thus, they always predict the same RNA-binding sites for a given protein sequence even if the protein binds to different RNA molecules.

RESULTS

We developed a feature vector-based method that removes data redundancy for a non-redundant training dataset. The feature vector-based method constructed a larger training dataset than the standard sequence similarity-based method, yet the dataset contained no redundant data. We identified effective features of protein and RNA (the interaction propensity of amino acid triplets, global features of the protein sequence, and RNA feature) for predicting RNA-binding residues. Using the method and features, we built a support vector machine (SVM) model that predicted RNA-binding residues in a protein sequence. Our SVM model showed an accuracy of 84.2%, an F-measure of 76.1%, and a correlation coefficient of 0.41 with 5-fold cross validation on a non-redundant dataset from 3,149 protein-RNA interacting pairs. In an independent test dataset that does not include the 3,149 pairs and were not used in training the SVM model, it achieved an accuracy of 90.3%, an F-measure of 72.8%, and a correlation coefficient of 0.24. Comparison with other methods on the same datasets demonstrated that our model was better than the others.

CONCLUSIONS

The feature vector-based redundancy reduction method is powerful for constructing a non-redundant training dataset for a learning model since it generates a larger dataset with non-redundant data than the standard sequence similarity-based method. Including the features of both RNA and protein sequences in a feature vector results in better performance than using the protein features only when predicting the RNA-binding residues in a protein sequence.

摘要

背景

许多预测蛋白质序列中 RNA 结合残基的学习方法都是基于序列相似性构建非冗余的训练数据集。基于序列相似性的方法要么采用整个序列,要么将其丢弃作为训练数据集。然而,相似的序列甚至相同的序列可能具有不同的相互作用位点,具体取决于它们的相互作用伙伴,而当序列被删除时,这些信息就会丢失。此外,基于序列相似性的方法构建的训练数据集可能包含冗余数据,因为剩余序列中可能包含序列内的相似子序列。除了训练数据集的问题外,大多数方法在预测 RNA 结合氨基酸时都不考虑蛋白质的相互作用伙伴(即 RNA)。因此,即使给定的蛋白质与不同的 RNA 分子结合,它们也总是预测相同的 RNA 结合位点。

结果

我们开发了一种基于特征向量的方法,可以去除非冗余训练数据集中的冗余数据。基于特征向量的方法构建的训练数据集比标准的基于序列相似性的方法大,但数据集不包含冗余数据。我们确定了用于预测 RNA 结合残基的蛋白质和 RNA 的有效特征(氨基酸三联体的相互作用倾向、蛋白质序列的全局特征和 RNA 特征)。使用该方法和特征,我们构建了一个支持向量机(SVM)模型,用于预测蛋白质序列中的 RNA 结合残基。我们的 SVM 模型在 3149 个蛋白质-RNA 相互作用对的非冗余数据集中进行 5 倍交叉验证时,准确率为 84.2%,F1 度量值为 76.1%,相关系数为 0.41。在不包含 3149 对且未用于训练 SVM 模型的独立测试数据集中,它的准确率为 90.3%,F1 度量值为 72.8%,相关系数为 0.24。在相同数据集上与其他方法的比较表明,我们的模型优于其他方法。

结论

基于特征向量的冗余减少方法非常强大,可用于构建学习模型的非冗余训练数据集,因为它生成的数据集比标准的基于序列相似性的方法更大,且包含非冗余数据。在预测蛋白质序列中的 RNA 结合残基时,将 RNA 和蛋白质序列的特征包含在特征向量中比仅使用蛋白质特征的效果更好。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d5e/3278847/911140261b72/1471-2105-12-S13-S7-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d5e/3278847/68a8d4214204/1471-2105-12-S13-S7-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d5e/3278847/bdaf8d853b93/1471-2105-12-S13-S7-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d5e/3278847/4a72894b5ec8/1471-2105-12-S13-S7-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d5e/3278847/73aaf79b7d28/1471-2105-12-S13-S7-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d5e/3278847/911140261b72/1471-2105-12-S13-S7-5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d5e/3278847/68a8d4214204/1471-2105-12-S13-S7-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d5e/3278847/bdaf8d853b93/1471-2105-12-S13-S7-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d5e/3278847/4a72894b5ec8/1471-2105-12-S13-S7-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d5e/3278847/73aaf79b7d28/1471-2105-12-S13-S7-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1d5e/3278847/911140261b72/1471-2105-12-S13-S7-5.jpg

相似文献

1
Prediction of RNA-binding amino acids from protein and RNA sequences.从蛋白质和 RNA 序列预测 RNA 结合氨基酸。
BMC Bioinformatics. 2011;12 Suppl 13(Suppl 13):S7. doi: 10.1186/1471-2105-12-S13-S7. Epub 2011 Nov 30.
2
Predicting protein-binding RNA nucleotides using the feature-based removal of data redundancy and the interaction propensity of nucleotide triplets.利用基于特征的数据冗余消除和核苷酸三联体的相互作用倾向预测与蛋白质结合的 RNA 核苷酸。
Comput Biol Med. 2013 Nov;43(11):1687-97. doi: 10.1016/j.compbiomed.2013.08.011. Epub 2013 Aug 21.
3
Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art.基于机器学习的蛋白质-RNA 界面残基预测:现状评估。
BMC Bioinformatics. 2012 May 10;13:89. doi: 10.1186/1471-2105-13-89.
4
Predicting protein-binding regions in RNA using nucleotide profiles and compositions.利用核苷酸谱和组成预测RNA中的蛋白质结合区域。
BMC Syst Biol. 2017 Mar 14;11(Suppl 2):16. doi: 10.1186/s12918-017-0386-4.
5
Predicting RNA-binding sites in proteins using the interaction propensity of amino acid triplets.利用氨基酸三联体的相互作用倾向预测蛋白质中的RNA结合位点。
Protein Pept Lett. 2010 Sep;17(9):1102-10. doi: 10.2174/092986610791760388.
6
Predicting protein-binding RNA nucleotides with consideration of binding partners.考虑结合伙伴预测与蛋白质结合的 RNA 核苷酸。
Comput Methods Programs Biomed. 2015 Jun;120(1):3-15. doi: 10.1016/j.cmpb.2015.03.010. Epub 2015 Apr 8.
7
Sequence-based prediction of protein-binding sites in DNA: comparative study of two SVM models.基于序列的DNA中蛋白质结合位点预测:两种支持向量机模型的比较研究
Comput Methods Programs Biomed. 2014 Nov;117(2):158-67. doi: 10.1016/j.cmpb.2014.07.009. Epub 2014 Aug 1.
8
Prediction of protein-RNA binding sites by a random forest method with combined features.基于组合特征的随机森林方法预测蛋白质-RNA 结合位点。
Bioinformatics. 2010 Jul 1;26(13):1616-22. doi: 10.1093/bioinformatics/btq253. Epub 2010 May 18.
9
Identification of protein-interacting nucleotides in a RNA sequence using composition profile of tri-nucleotides.利用三核苷酸的组成概况鉴定RNA序列中的蛋白质相互作用核苷酸。
Genomics. 2015 Apr;105(4):197-203. doi: 10.1016/j.ygeno.2015.01.005. Epub 2015 Jan 30.
10
Prediction of RNA binding sites in a protein using SVM and PSSM profile.使用支持向量机和位置特异性得分矩阵预测蛋白质中的RNA结合位点。
Proteins. 2008 Apr;71(1):189-94. doi: 10.1002/prot.21677.

引用本文的文献

1
Twenty years of advances in prediction of nucleic acid-binding residues in protein sequences.蛋白质序列中核酸结合残基预测二十年进展
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf016.
2
A comprehensive review of protein-centric predictors for biomolecular interactions: from proteins to nucleic acids and beyond.蛋白质中心预测因子在生物分子相互作用研究中的综合综述:从蛋白质到核酸及其他。
Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae162.
3
HybridRNAbind: prediction of RNA interacting residues across structure-annotated and disorder-annotated proteins.

本文引用的文献

1
Predicting RNA-binding sites in proteins using the interaction propensity of amino acid triplets.利用氨基酸三联体的相互作用倾向预测蛋白质中的RNA结合位点。
Protein Pept Lett. 2010 Sep;17(9):1102-10. doi: 10.2174/092986610791760388.
2
Prediction of protein-RNA binding sites by a random forest method with combined features.基于组合特征的随机森林方法预测蛋白质-RNA 结合位点。
Bioinformatics. 2010 Jul 1;26(13):1616-22. doi: 10.1093/bioinformatics/btq253. Epub 2010 May 18.
3
CD-HIT Suite: a web server for clustering and comparing biological sequences.
HybridRNAbind:跨结构注释和无序注释蛋白质预测 RNA 相互作用残基。
Nucleic Acids Res. 2023 Mar 21;51(5):e25. doi: 10.1093/nar/gkac1253.
4
Amino Acid Composition in Various Types of Nucleic Acid-Binding Proteins.各种核酸结合蛋白中的氨基酸组成。
Int J Mol Sci. 2021 Jan 18;22(2):922. doi: 10.3390/ijms22020922.
5
Comprehensive Survey and Comparative Assessment of RNA-Binding Residue Predictions with Analysis by RNA Type.RNA 结合残基预测的综合调查和比较评估,同时按 RNA 类型进行分析。
Int J Mol Sci. 2020 Sep 19;21(18):6879. doi: 10.3390/ijms21186879.
6
A MOTIF-BASED METHOD FOR PREDICTING INTERFACIAL RESIDUES IN BOTH THE RNA AND PROTEIN COMPONENTS OF PROTEIN-RNA COMPLEXES.一种基于基序的方法,用于预测蛋白质-RNA复合物的RNA和蛋白质组分中的界面残基。
Pac Symp Biocomput. 2016;21:445-455.
7
A Large-Scale Assessment of Nucleic Acids Binding Site Prediction Programs.核酸结合位点预测程序的大规模评估
PLoS Comput Biol. 2015 Dec 17;11(12):e1004639. doi: 10.1371/journal.pcbi.1004639. eCollection 2015 Dec.
8
Computational Prediction of RNA-Binding Proteins and Binding Sites.RNA结合蛋白及结合位点的计算预测
Int J Mol Sci. 2015 Nov 3;16(11):26303-17. doi: 10.3390/ijms161125952.
9
Prediction of nucleic acid binding probability in proteins: a neighboring residue network based score.蛋白质中核酸结合概率的预测:基于相邻残基网络的评分
Nucleic Acids Res. 2015 Jun 23;43(11):5340-51. doi: 10.1093/nar/gkv446. Epub 2015 May 4.
10
Identifying RNA-binding residues based on evolutionary conserved structural and energetic features.基于进化保守的结构和能量特征鉴定 RNA 结合残基。
Nucleic Acids Res. 2014 Feb;42(3):e15. doi: 10.1093/nar/gkt1299. Epub 2013 Dec 16.
CD-HIT 套件:用于聚类和比较生物序列的网络服务器。
Bioinformatics. 2010 Mar 1;26(5):680-2. doi: 10.1093/bioinformatics/btq003. Epub 2010 Jan 6.
4
Predicting RNA-binding sites of proteins using support vector machines and evolutionary information.使用支持向量机和进化信息预测蛋白质的RNA结合位点。
BMC Bioinformatics. 2008 Dec 12;9 Suppl 12(Suppl 12):S6. doi: 10.1186/1471-2105-9-S12-S6.
5
Prediction of RNA binding sites in a protein using SVM and PSSM profile.使用支持向量机和位置特异性得分矩阵预测蛋白质中的RNA结合位点。
Proteins. 2008 Apr;71(1):189-94. doi: 10.1002/prot.21677.
6
RNABindR: a server for analyzing and predicting RNA-binding sites in proteins.RNABindR:一个用于分析和预测蛋白质中RNA结合位点的服务器。
Nucleic Acids Res. 2007 Jul;35(Web Server issue):W578-84. doi: 10.1093/nar/gkm294. Epub 2007 May 5.
7
PRI-Modeler: extracting RNA structural elements from PDB files of protein-RNA complexes.PRI-Modeler:从蛋白质-RNA复合物的PDB文件中提取RNA结构元件。
FEBS Lett. 2007 May 1;581(9):1881-90. doi: 10.1016/j.febslet.2007.03.085. Epub 2007 Apr 9.
8
BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences.BindN:一种用于高效预测氨基酸序列中DNA和RNA结合位点的基于网络的工具。
Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W243-8. doi: 10.1093/nar/gkl298.
9
Prediction of RNA binding sites in proteins from amino acid sequence.从氨基酸序列预测蛋白质中的RNA结合位点。
RNA. 2006 Aug;12(8):1450-62. doi: 10.1261/rna.2197306. Epub 2006 Jun 21.
10
Computational analysis of hydrogen bonds in protein-RNA complexes for interaction patterns.用于相互作用模式的蛋白质-RNA复合物中氢键的计算分析。
FEBS Lett. 2003 Sep 25;552(2-3):231-9. doi: 10.1016/s0014-5793(03)00930-x.