• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过打破共识,结合多个图谱和结构域共现情况,实现了蛋白质结构域识别的改进。

Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence.

作者信息

Bernardes Juliana, Zaverucha Gerson, Vaquero Catherine, Carbone Alessandra

机构信息

Sorbonne Universités, UPMC Univ-Paris 6, CNRS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative, Paris, France.

COPPE, Programa de Engenharia de Sistemas e Computação, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil.

出版信息

PLoS Comput Biol. 2016 Jul 29;12(7):e1005038. doi: 10.1371/journal.pcbi.1005038. eCollection 2016 Jul.

DOI:10.1371/journal.pcbi.1005038
PMID:27472895
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4966962/
Abstract

Traditional protein annotation methods describe known domains with probabilistic models representing consensus among homologous domain sequences. However, when relevant signals become too weak to be identified by a global consensus, attempts for annotation fail. Here we address the fundamental question of domain identification for highly divergent proteins. By using high performance computing, we demonstrate that the limits of state-of-the-art annotation methods can be bypassed. We design a new strategy based on the observation that many structural and functional protein constraints are not globally conserved through all species but might be locally conserved in separate clades. We propose a novel exploitation of the large amount of data available: 1. for each known protein domain, several probabilistic clade-centered models are constructed from a large and differentiated panel of homologous sequences, 2. a decision-making protocol combines outcomes obtained from multiple models, 3. a multi-criteria optimization algorithm finds the most likely protein architecture. The method is evaluated for domain and architecture prediction over several datasets and statistical testing hypotheses. Its performance is compared against HMMScan and HHblits, two widely used search methods based on sequence-profile and profile-profile comparison. Due to their closeness to actual protein sequences, clade-centered models are shown to be more specific and functionally predictive than the broadly used consensus models. Based on them, we improved annotation of Plasmodium falciparum protein sequences on a scale not previously possible. We successfully predict at least one domain for 72% of P. falciparum proteins against 63% achieved previously, corresponding to 30% of improvement over the total number of Pfam domain predictions on the whole genome. The method is applicable to any genome and opens new avenues to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age. Website and software: http://www.lcqb.upmc.fr/CLADE.

摘要

传统的蛋白质注释方法使用概率模型来描述已知结构域,这些模型代表同源结构域序列之间的一致性。然而,当相关信号变得过于微弱以至于无法通过全局一致性来识别时,注释尝试就会失败。在这里,我们解决了高度分化蛋白质的结构域识别这一基本问题。通过使用高性能计算,我们证明了可以绕过现有最先进注释方法的局限性。我们基于这样的观察设计了一种新策略:许多蛋白质的结构和功能限制并非在所有物种中都全局保守,但可能在不同的进化枝中局部保守。我们提出了一种对可用大量数据的新颖利用方式:1. 对于每个已知的蛋白质结构域,从大量不同的同源序列组中构建几个以进化枝为中心的概率模型;2. 一个决策协议将从多个模型获得的结果结合起来;3. 一种多标准优化算法找到最可能的蛋白质结构。该方法在多个数据集上进行了结构域和结构预测以及统计检验假设的评估。其性能与HMMScan和HHblits这两种基于序列-轮廓和轮廓-轮廓比较的广泛使用的搜索方法进行了比较。由于以进化枝为中心的模型与实际蛋白质序列更接近,因此显示出比广泛使用的一致性模型更具特异性和功能预测性。基于这些模型,我们以前所未有的规模改进了恶性疟原虫蛋白质序列的注释。我们成功地为72%的恶性疟原虫蛋白质预测了至少一个结构域,而之前的成功率为63%,相对于全基因组上Pfam结构域预测总数提高了30%。该方法适用于任何基因组,并为解决进化问题开辟了新途径,如古代结构域复制的重建、蛋白质结构历史的重建以及蛋白质结构域年龄的估计。网站和软件:http://www.lcqb.upmc.fr/CLADE

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de8c/4966962/91508fc94a8c/pcbi.1005038.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de8c/4966962/c60f65f60ee7/pcbi.1005038.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de8c/4966962/6a37236ea2c4/pcbi.1005038.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de8c/4966962/6600e3492eba/pcbi.1005038.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de8c/4966962/45d3c0c1cab1/pcbi.1005038.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de8c/4966962/2136576e6c55/pcbi.1005038.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de8c/4966962/0def8bd64662/pcbi.1005038.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de8c/4966962/d47277e7575f/pcbi.1005038.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de8c/4966962/91508fc94a8c/pcbi.1005038.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de8c/4966962/c60f65f60ee7/pcbi.1005038.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de8c/4966962/6a37236ea2c4/pcbi.1005038.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de8c/4966962/6600e3492eba/pcbi.1005038.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de8c/4966962/45d3c0c1cab1/pcbi.1005038.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de8c/4966962/2136576e6c55/pcbi.1005038.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de8c/4966962/0def8bd64662/pcbi.1005038.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de8c/4966962/d47277e7575f/pcbi.1005038.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de8c/4966962/91508fc94a8c/pcbi.1005038.g008.jpg

相似文献

1
Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence.通过打破共识,结合多个图谱和结构域共现情况,实现了蛋白质结构域识别的改进。
PLoS Comput Biol. 2016 Jul 29;12(7):e1005038. doi: 10.1371/journal.pcbi.1005038. eCollection 2016 Jul.
2
Improving pairwise comparison of protein sequences with domain co-occurrence.通过结构域共现改进蛋白质序列的成对比较。
PLoS Comput Biol. 2018 Jan 2;14(1):e1005889. doi: 10.1371/journal.pcbi.1005889. eCollection 2018 Jan.
3
A multi-objective optimization approach accurately resolves protein domain architectures.一种多目标优化方法能准确解析蛋白质结构域架构。
Bioinformatics. 2016 Feb 1;32(3):345-53. doi: 10.1093/bioinformatics/btv582. Epub 2015 Oct 12.
4
Plasmobase: a comparative database of predicted domain architectures for Plasmodium genomes.疟原虫数据库:疟原虫基因组预测结构域架构的比较数据库。
Malar J. 2017 Jun 7;16(1):241. doi: 10.1186/s12936-017-1887-8.
5
Detection of new protein domains using co-occurrence: application to Plasmodium falciparum.利用共现检测新的蛋白质结构域:在疟原虫中的应用。
Bioinformatics. 2009 Dec 1;25(23):3077-83. doi: 10.1093/bioinformatics/btp560. Epub 2009 Sep 28.
6
Fitting hidden Markov models of protein domains to a target species: application to Plasmodium falciparum.将蛋白质结构域的隐马尔可夫模型拟合到目标物种上:在疟原虫中的应用。
BMC Bioinformatics. 2012 May 1;13:67. doi: 10.1186/1471-2105-13-67.
7
Rapid similarity search of proteins using alignments of domain arrangements.利用结构域排列的比对进行蛋白质的快速相似性搜索。
Bioinformatics. 2014 Jan 15;30(2):274-81. doi: 10.1093/bioinformatics/btt379. Epub 2013 Jul 4.
8
MyCLADE: a multi-source domain annotation server for sequence functional exploration.MyCLADE:一个用于序列功能探索的多源域注释服务器。
Nucleic Acids Res. 2021 Jul 2;49(W1):W452-W458. doi: 10.1093/nar/gkab395.
9
HMMerThread: detecting remote, functional conserved domains in entire genomes by combining relaxed sequence-database searches with fold recognition.HMMerThread:通过将宽松的序列数据库搜索与折叠识别相结合,在整个基因组中检测远程、功能保守的结构域。
PLoS One. 2011 Mar 10;6(3):e17568. doi: 10.1371/journal.pone.0017568.
10
Capturing protein sequence-structure specificity using computational sequence design.利用计算序列设计捕获蛋白质序列-结构特异性。
Proteins. 2013 Sep;81(9):1556-70. doi: 10.1002/prot.24307. Epub 2013 Jun 20.

引用本文的文献

1
Evolutionary dynamics of genome size and content during the adaptive radiation of Heliconiini butterflies.在食蚜蝇科蝴蝶的适应性辐射过程中,基因组大小和内容的进化动态。
Nat Commun. 2023 Sep 12;14(1):5620. doi: 10.1038/s41467-023-41412-5.
2
CeGAL: Redefining a Widespread Fungal-Specific Transcription Factor Family Using an In Silico Error-Tracking Approach.CeGAL:利用计算机错误追踪方法重新定义一个广泛存在的真菌特异性转录因子家族
J Fungi (Basel). 2023 Mar 29;9(4):424. doi: 10.3390/jof9040424.
3
Multi-head attention-based U-Nets for predicting protein domain boundaries using 1D sequence features and 2D distance maps.

本文引用的文献

1
A multi-objective optimization approach accurately resolves protein domain architectures.一种多目标优化方法能准确解析蛋白质结构域架构。
Bioinformatics. 2016 Feb 1;32(3):345-53. doi: 10.1093/bioinformatics/btv582. Epub 2015 Oct 12.
2
SCOPe: Structural Classification of Proteins--extended, integrating SCOP and ASTRAL data and classification of new structures.SCOPe:蛋白质结构分类——扩展版,整合了 SCOP 和 ASTRAL 数据以及新结构的分类。
Nucleic Acids Res. 2014 Jan;42(Database issue):D304-9. doi: 10.1093/nar/gkt1240. Epub 2013 Dec 3.
3
Gene3D: Multi-domain annotations for protein sequence and comparative genome analysis.
基于多头注意力的 U-Net 模型,利用 1D 序列特征和 2D 距离图预测蛋白质结构域边界。
BMC Bioinformatics. 2022 Jul 19;23(1):283. doi: 10.1186/s12859-022-04829-1.
4
Multiple Profile Models Extract Features from Protein Sequence Data and Resolve Functional Diversity of Very Different Protein Families.多剖面模型从蛋白质序列数据中提取特征,并解决非常不同蛋白质家族的功能多样性。
Mol Biol Evol. 2022 Apr 10;39(4). doi: 10.1093/molbev/msac070.
5
MyCLADE: a multi-source domain annotation server for sequence functional exploration.MyCLADE:一个用于序列功能探索的多源域注释服务器。
Nucleic Acids Res. 2021 Jul 2;49(W1):W452-W458. doi: 10.1093/nar/gkab395.
6
Protein domain identification methods and online resources.蛋白质结构域鉴定方法及在线资源。
Comput Struct Biotechnol J. 2021 Feb 2;19:1145-1153. doi: 10.1016/j.csbj.2021.01.041. eCollection 2021.
7
Genome-enabled phylogenetic and functional reconstruction of an araphid pennate diatom Plagiostriata sp. CCMP470, previously assigned as a radial centric diatom, and its bacterial commensal.基于基因组的无甲藻纲羽纹硅藻 Plagiostriata sp. CCMP470 的系统发育和功能重建,该藻先前被归为辐射对称中心硅藻,及其细菌共生体。
Sci Rep. 2020 Jun 10;10(1):9449. doi: 10.1038/s41598-020-65941-x.
8
Identification of Plasmodium falciparum nuclear proteins by mass spectrometry and proposed protein annotation.通过质谱法鉴定恶性疟原虫核蛋白及提出的蛋白注释。
PLoS One. 2018 Oct 31;13(10):e0205596. doi: 10.1371/journal.pone.0205596. eCollection 2018.
9
Evaluating Statistical Multiple Sequence Alignment in Comparison to Other Alignment Methods on Protein Data Sets.评估统计多重序列比对与蛋白质数据集上其他比对方法的比较。
Syst Biol. 2019 May 1;68(3):396-411. doi: 10.1093/sysbio/syy068.
10
A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling.用于定量宏基因组和宏转录组功能分析的多源域注释管道。
Microbiome. 2018 Aug 28;6(1):149. doi: 10.1186/s40168-018-0532-2.
Gene3D:用于蛋白质序列和比较基因组分析的多功能域注释。
Nucleic Acids Res. 2014 Jan;42(Database issue):D240-5. doi: 10.1093/nar/gkt1205. Epub 2013 Nov 21.
4
PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees.2013 年的 PANTHER:在系统发生树的背景下,对基因功能和其他基因属性的进化进行建模。
Nucleic Acids Res. 2013 Jan;41(Database issue):D377-86. doi: 10.1093/nar/gks1118. Epub 2012 Nov 27.
5
HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment.HHblits:通过 HMM-HMM 比对进行快速迭代的蛋白质序列搜索。
Nat Methods. 2011 Dec 25;9(2):173-5. doi: 10.1038/nmeth.1818.
6
Gene3D: a domain-based resource for comparative genomics, functional annotation and protein network analysis.Gene3D:一个基于结构域的资源,用于比较基因组学、功能注释和蛋白质网络分析。
Nucleic Acids Res. 2012 Jan;40(Database issue):D465-71. doi: 10.1093/nar/gkr1181. Epub 2011 Dec 1.
7
Accelerated Profile HMM Searches.加速轮廓隐马尔可夫模型搜索。
PLoS Comput Biol. 2011 Oct;7(10):e1002195. doi: 10.1371/journal.pcbi.1002195. Epub 2011 Oct 20.
8
Using context to improve protein domain identification.利用上下文提高蛋白质结构域识别。
BMC Bioinformatics. 2011 Mar 31;12:90. doi: 10.1186/1471-2105-12-90.
9
A discriminative method for family-based protein remote homology detection that combines inductive logic programming and propositional models.基于归纳逻辑编程和命题模型的家族蛋白质远程同源检测的判别方法。
BMC Bioinformatics. 2011 Mar 23;12:83. doi: 10.1186/1471-2105-12-83.
10
A fast and automated solution for accurately resolving protein domain architectures.一种快速且自动化的解决方案,可准确解析蛋白质结构域架构。
Bioinformatics. 2010 Mar 15;26(6):745-51. doi: 10.1093/bioinformatics/btq034. Epub 2010 Jan 29.