• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于基因本体论的蛋白质亚细胞定位迁移学习。

Gene ontology based transfer learning for protein subcellular localization.

机构信息

Software College, Shenyang Normal University, Shenyang, PR China.

出版信息

BMC Bioinformatics. 2011 Feb 2;12:44. doi: 10.1186/1471-2105-12-44.

DOI:10.1186/1471-2105-12-44
PMID:21284890
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3039576/
Abstract

BACKGROUND

Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting multi-aspect protein feature information. Gene ontology, hereinafter referred to as GO, uses a controlled vocabulary to depict biological molecules or gene products in terms of biological process, molecular function and cellular component. With the rapid expansion of annotated protein sequences, gene ontology has become a general protein feature that can be used to construct predictive models in computational biology. Existing models generally either concatenated the GO terms into a flat binary vector or applied majority-vote based ensemble learning for protein subcellular localization, both of which can not estimate the individual discriminative abilities of the three aspects of gene ontology.

RESULTS

In this paper, we propose a Gene Ontology Based Transfer Learning Model (GO-TLM) for large-scale protein subcellular localization. The model transfers the signature-based homologous GO terms to the target proteins, and further constructs a reliable learning system to reduce the adverse affect of the potential false GO terms that are resulted from evolutionary divergence. We derive three GO kernels from the three aspects of gene ontology to measure the GO similarity of two proteins, and derive two other spectrum kernels to measure the similarity of two protein sequences. We use simple non-parametric cross validation to explicitly weigh the discriminative abilities of the five kernels, such that the time & space computational complexities are greatly reduced when compared to the complicated semi-definite programming and semi-indefinite linear programming. The five kernels are then linearly merged into one single kernel for protein subcellular localization. We evaluate GO-TLM performance against three baseline models: MultiLoc, MultiLoc-GO and Euk-mPLoc on the benchmark datasets the baseline models adopted. 5-fold cross validation experiments show that GO-TLM achieves substantial accuracy improvement against the baseline models: 80.38% against model Euk-mPLoc 67.40% with 12.98% substantial increase; 96.65% and 96.27% against model MultiLoc-GO 89.60% and 89.60%, with 7.05% and 6.67% accuracy increase on dataset MultiLoc plant and dataset MultiLoc animal, respectively; 97.14%, 95.90% and 96.85% against model MultiLoc-GO 83.70%, 90.10% and 85.70%, with accuracy increase 13.44%, 5.8% and 11.15% on dataset BaCelLoc plant, dataset BaCelLoc fungi and dataset BaCelLoc animal respectively. For BaCelLoc independent sets, GO-TLM achieves 81.25%, 80.45% and 79.46% on dataset BaCelLoc plant holdout, dataset BaCelLoc plant holdout and dataset BaCelLoc animal holdout, respectively, as compared against baseline model MultiLoc-GO 76%, 60.00% and 73.00%, with accuracy increase 5.25%, 20.45% and 6.46%, respectively.

CONCLUSIONS

Since direct homology-based GO term transfer may be prone to introducing noise and outliers to the target protein, we design an explicitly weighted kernel learning system (called Gene Ontology Based Transfer Learning Model, GO-TLM) to transfer to the target protein the known knowledge about related homologous proteins, which can reduce the risk of outliers and share knowledge between homologous proteins, and thus achieve better predictive performance for protein subcellular localization. Cross validation and independent test experimental results show that the homology-based GO term transfer and explicitly weighing the GO kernels substantially improve the prediction performance.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/781c/3039576/716176d930ff/1471-2105-12-44-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/781c/3039576/806dd9b3db7e/1471-2105-12-44-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/781c/3039576/63c9bc47fd5a/1471-2105-12-44-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/781c/3039576/7366d1e02c8b/1471-2105-12-44-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/781c/3039576/716176d930ff/1471-2105-12-44-4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/781c/3039576/806dd9b3db7e/1471-2105-12-44-1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/781c/3039576/63c9bc47fd5a/1471-2105-12-44-2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/781c/3039576/7366d1e02c8b/1471-2105-12-44-3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/781c/3039576/716176d930ff/1471-2105-12-44-4.jpg
摘要

背景

蛋白质亚细胞定位的预测通常涉及许多复杂因素,仅使用一个或两个方面的数据信息可能无法说明真实情况。出于这个原因,一些最近的预测模型被故意设计成集成多个异构数据源,以利用蛋白质特征信息的多方面。GO(基因本体)使用控制词汇表根据生物过程、分子功能和细胞成分来描述生物分子或基因产物。随着注释蛋白质序列的快速扩展,GO 已经成为一种通用的蛋白质特征,可以用于构建计算生物学中的预测模型。现有的模型通常要么将 GO 术语串联成一个平面二进制向量,要么应用基于多数投票的集成学习来进行蛋白质亚细胞定位,这两种方法都不能估计基因本体的三个方面的个体判别能力。

结果

在本文中,我们提出了一种基于基因本体的转移学习模型(GO-TLM)用于大规模蛋白质亚细胞定位。该模型将基于特征的同源 GO 术语转移到目标蛋白质上,并进一步构建一个可靠的学习系统,以减少潜在的假 GO 术语的不利影响,这些术语是由于进化分歧而产生的。我们从基因本体的三个方面导出三个 GO 核,以测量两个蛋白质之间的 GO 相似性,并导出另外两个谱核,以测量两个蛋白质序列之间的相似性。我们使用简单的非参数交叉验证来显式地权衡五个核的判别能力,从而大大降低了与复杂半定规划和半不定线性规划相比的时间和空间计算复杂度。然后,将这五个核线性合并为一个用于蛋白质亚细胞定位的单个核。我们在基准数据集上评估了 GO-TLM 与三个基线模型(MultiLoc、MultiLoc-GO 和 Euk-mPLoc)的性能。5 折交叉验证实验表明,GO-TLM 与基线模型相比取得了显著的准确性提高:在数据集 MultiLoc 植物和数据集 MultiLoc 动物上,分别比模型 Euk-mPLoc 提高了 80.38%和 12.98%;在数据集 MultiLoc-GO 上分别提高了 96.65%和 96.27%,比模型 MultiLoc 提高了 7.05%和 6.67%;在数据集 BaCelLoc 植物、数据集 BaCelLoc 真菌和数据集 BaCelLoc 动物上,分别比模型 MultiLoc-GO 提高了 13.44%、5.8%和 11.15%。对于 BaCelLoc 独立集,GO-TLM 在数据集 BaCelLoc 植物保留集、数据集 BaCelLoc 植物保留集和数据集 BaCelLoc 动物保留集上的准确率分别为 81.25%、80.45%和 79.46%,而基线模型 MultiLoc-GO 的准确率分别为 76%、60.00%和 73.00%,准确率分别提高了 5.25%、20.45%和 6.46%。

结论

由于直接基于同源的 GO 术语转移可能容易将噪声和异常值引入目标蛋白质,因此我们设计了一个显式加权核学习系统(称为基于基因本体的转移学习模型,GO-TLM),将相关同源蛋白质的已知知识转移到目标蛋白质上,这可以降低异常值的风险,并在同源蛋白质之间共享知识,从而实现蛋白质亚细胞定位的更好预测性能。交叉验证和独立测试实验结果表明,基于同源的 GO 术语转移和显式加权 GO 核显著提高了预测性能。

相似文献

1
Gene ontology based transfer learning for protein subcellular localization.基于基因本体论的蛋白质亚细胞定位迁移学习。
BMC Bioinformatics. 2011 Feb 2;12:44. doi: 10.1186/1471-2105-12-44.
2
ProLoc-GO: utilizing informative Gene Ontology terms for sequence-based prediction of protein subcellular localization.ProLoc-GO:利用信息丰富的基因本体术语进行基于序列的蛋白质亚细胞定位预测。
BMC Bioinformatics. 2008 Feb 1;9:80. doi: 10.1186/1471-2105-9-80.
3
Multi-label multi-kernel transfer learning for human protein subcellular localization.多标签多内核迁移学习在人类蛋白质亚细胞定位中的应用。
PLoS One. 2012;7(6):e37716. doi: 10.1371/journal.pone.0037716. Epub 2012 Jun 13.
4
Use of Chou's 5-steps rule to predict the subcellular localization of gram-negative and gram-positive bacterial proteins by multi-label learning based on gene ontology annotation and profile alignment.利用 Chou 的 5 步规则,通过基于基因本体论注释和序列比对的多标签学习,预测革兰氏阴性和革兰氏阳性细菌蛋白质的亚细胞定位。
J Integr Bioinform. 2020 Jun 29;18(1):51-79. doi: 10.1515/jib-2019-0091.
5
HybridGO-Loc: mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins.HybridGO-Loc:在基因本体论上挖掘混合特征以预测多定位蛋白质的亚细胞定位。
PLoS One. 2014 Mar 19;9(3):e89545. doi: 10.1371/journal.pone.0089545. eCollection 2014.
6
Predicting plant protein subcellular multi-localization by Chou's PseAAC formulation based multi-label homolog knowledge transfer learning.基于 Chou 的 PseAAC 构象的多标签同源知识转移学习预测植物蛋白质亚细胞多定位。
J Theor Biol. 2012 Oct 7;310:80-7. doi: 10.1016/j.jtbi.2012.06.028. Epub 2012 Jun 27.
7
Hum-mPLoc 3.0: prediction enhancement of human protein subcellular localization through modeling the hidden correlations of gene ontology and functional domain features.Hum-mPLoc 3.0:通过对基因本体和功能域特征的隐藏相关性进行建模来增强人类蛋白质亚细胞定位预测
Bioinformatics. 2017 Mar 15;33(6):843-853. doi: 10.1093/bioinformatics/btw723.
8
Multi-kernel transfer learning based on Chou's PseAAC formulation for protein submitochondria localization.基于 Chou 的 PseAAC 构建的多核转移学习在蛋白质亚线粒体定位中的应用。
J Theor Biol. 2012 Jan 21;293:121-30. doi: 10.1016/j.jtbi.2011.10.015. Epub 2011 Oct 21.
9
Ranking Gene Ontology terms for predicting non-classical secretory proteins in eukaryotes and prokaryotes.对真核生物和原核生物中非经典分泌蛋白进行预测的基因本体论术语排序。
J Theor Biol. 2012 Nov 7;312:105-13. doi: 10.1016/j.jtbi.2012.07.027. Epub 2012 Aug 8.
10
mGOASVM: Multi-label protein subcellular localization based on gene ontology and support vector machines.mGOASVM:基于基因本体和支持向量机的多标签蛋白质亚细胞定位。
BMC Bioinformatics. 2012 Nov 6;13:290. doi: 10.1186/1471-2105-13-290.

引用本文的文献

1
A Review for Artificial Intelligence Based Protein Subcellular Localization.基于人工智能的蛋白质亚细胞定位研究综述
Biomolecules. 2024 Mar 27;14(4):409. doi: 10.3390/biom14040409.
2
Transfer Learning for High-Dimensional Linear Regression: Prediction, Estimation and Minimax Optimality.高维线性回归的迁移学习:预测、估计与极小极大最优性
J R Stat Soc Series B Stat Methodol. 2022 Feb;84(1):149-173. doi: 10.1111/rssb.12479. Epub 2021 Nov 16.
3
Prediction of Protein Subcellular Localization Based on Fusion of Multi-view Features.

本文引用的文献

1
Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization.植物-mPLoc:一种提高植物蛋白亚细胞定位预测能力的自上而下策略。
PLoS One. 2010 Jun 28;5(6):e11335. doi: 10.1371/journal.pone.0011335.
2
A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites: Euk-mPLoc 2.0.一种预测真核蛋白单一位点和多位点亚细胞定位的新方法:Euk-mPLoc 2.0。
PLoS One. 2010 Apr 1;5(4):e9931. doi: 10.1371/journal.pone.0009931.
3
Amino acid classification based spectrum kernel fusion for protein subnuclear localization.
基于多视图特征融合的蛋白质亚细胞定位预测。
Molecules. 2019 Mar 6;24(5):919. doi: 10.3390/molecules24050919.
4
A Pipeline for Classifying Deleterious Coding Mutations in Agricultural Plants.一种用于对农作物中有害编码突变进行分类的流程。
Front Plant Sci. 2018 Nov 28;9:1734. doi: 10.3389/fpls.2018.01734. eCollection 2018.
5
Learning from Heterogeneous Data Sources: An Application in Spatial Proteomics.从异构数据源学习:空间蛋白质组学中的应用
PLoS Comput Biol. 2016 May 13;12(5):e1004920. doi: 10.1371/journal.pcbi.1004920. eCollection 2016 May.
6
Multi-label multi-instance transfer learning for simultaneous reconstruction and cross-talk modeling of multiple human signaling pathways.用于多个人类信号通路同时重建和串扰建模的多标签多实例迁移学习
BMC Bioinformatics. 2015 Dec 30;16:417. doi: 10.1186/s12859-015-0841-4.
7
AdaBoost based multi-instance transfer learning for predicting proteome-wide interactions between Salmonella and human proteins.基于AdaBoost的多实例迁移学习用于预测沙门氏菌与人类蛋白质之间的全蛋白质组相互作用。
PLoS One. 2014 Oct 17;9(10):e110488. doi: 10.1371/journal.pone.0110488. eCollection 2014.
8
CELLO2GO: a web server for protein subCELlular LOcalization prediction with functional gene ontology annotation.CELLO2GO:一个用于蛋白质亚细胞定位预测并带有功能基因本体注释的网络服务器。
PLoS One. 2014 Jun 9;9(6):e99368. doi: 10.1371/journal.pone.0099368. eCollection 2014.
9
HybridGO-Loc: mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins.HybridGO-Loc:在基因本体论上挖掘混合特征以预测多定位蛋白质的亚细胞定位。
PLoS One. 2014 Mar 19;9(3):e89545. doi: 10.1371/journal.pone.0089545. eCollection 2014.
10
Probability weighted ensemble transfer learning for predicting interactions between HIV-1 and human proteins.用于预测 HIV-1 与人类蛋白质相互作用的概率加权集成转移学习。
PLoS One. 2013 Nov 18;8(11):e79606. doi: 10.1371/journal.pone.0079606. eCollection 2013.
基于氨基酸分类的光谱核融合的蛋白质亚核定位。
BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S17. doi: 10.1186/1471-2105-11-S1-S17.
4
Gneg-mPLoc: a top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins.Gneg-mPLoc:一种提升革兰氏阴性细菌蛋白亚细胞定位预测质量的自顶向下策略。
J Theor Biol. 2010 May 21;264(2):326-33. doi: 10.1016/j.jtbi.2010.01.018. Epub 2010 Jan 20.
5
Gpos-mPLoc: a top-down approach to improve the quality of predicting subcellular localization of Gram-positive bacterial proteins.Gpos-mPLoc:一种自上而下的方法,用于提高革兰氏阳性细菌蛋白质亚细胞定位预测的质量。
Protein Pept Lett. 2009;16(12):1478-84. doi: 10.2174/092986609789839322.
6
Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses.利用周的伪氨基酸组成概念预测人乳头瘤病毒的风险类型。
J Theor Biol. 2010 Mar 21;263(2):203-9. doi: 10.1016/j.jtbi.2009.11.016. Epub 2009 Dec 2.
7
Using the concept of Chou's pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform.利用周氏伪氨基酸组成概念预测酶家族类别:一种基于离散小波变换的支持向量机方法。
Protein Pept Lett. 2010 Jun;17(6):715-22. doi: 10.2174/092986610791190372.
8
MultiLoc2: integrating phylogeny and Gene Ontology terms improves subcellular protein localization prediction.MultiLoc2:整合系统发育和基因本体论术语可提高亚细胞蛋白质定位预测。
BMC Bioinformatics. 2009 Sep 1;10:274. doi: 10.1186/1471-2105-10-274.
9
A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0.一种增强人类蛋白质亚细胞定位预测能力的自上而下方法:Hum-mPLoc 2.0。
Anal Biochem. 2009 Nov 15;394(2):269-74. doi: 10.1016/j.ab.2009.07.046. Epub 2009 Aug 3.
10
Prediction of G-protein-coupled receptor classes in low homology using Chou's pseudo amino acid composition with approximate entropy and hydrophobicity patterns.利用具有近似熵和疏水模式的周氏伪氨基酸组成预测低同源性的G蛋白偶联受体类别。
Protein Pept Lett. 2010 May;17(5):559-67. doi: 10.2174/092986610791112693.