通过打破共识，结合多个图谱和结构域共现情况，实现了蛋白质结构域识别的改进。

Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence.

作者信息

Bernardes Juliana, Zaverucha Gerson, Vaquero Catherine, Carbone Alessandra

机构信息

Sorbonne Universités, UPMC Univ-Paris 6, CNRS, UMR 7238, Laboratoire de Biologie Computationnelle et Quantitative, Paris, France.

COPPE, Programa de Engenharia de Sistemas e Computação, Universidade Federal do Rio de Janeiro, Rio de Janeiro, Brazil.

出版信息

PLoS Comput Biol. 2016 Jul 29;12(7):e1005038. doi: 10.1371/journal.pcbi.1005038. eCollection 2016 Jul.

DOI:10.1371/journal.pcbi.1005038

PMID:27472895

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4966962/

Abstract

Traditional protein annotation methods describe known domains with probabilistic models representing consensus among homologous domain sequences. However, when relevant signals become too weak to be identified by a global consensus, attempts for annotation fail. Here we address the fundamental question of domain identification for highly divergent proteins. By using high performance computing, we demonstrate that the limits of state-of-the-art annotation methods can be bypassed. We design a new strategy based on the observation that many structural and functional protein constraints are not globally conserved through all species but might be locally conserved in separate clades. We propose a novel exploitation of the large amount of data available: 1. for each known protein domain, several probabilistic clade-centered models are constructed from a large and differentiated panel of homologous sequences, 2. a decision-making protocol combines outcomes obtained from multiple models, 3. a multi-criteria optimization algorithm finds the most likely protein architecture. The method is evaluated for domain and architecture prediction over several datasets and statistical testing hypotheses. Its performance is compared against HMMScan and HHblits, two widely used search methods based on sequence-profile and profile-profile comparison. Due to their closeness to actual protein sequences, clade-centered models are shown to be more specific and functionally predictive than the broadly used consensus models. Based on them, we improved annotation of Plasmodium falciparum protein sequences on a scale not previously possible. We successfully predict at least one domain for 72% of P. falciparum proteins against 63% achieved previously, corresponding to 30% of improvement over the total number of Pfam domain predictions on the whole genome. The method is applicable to any genome and opens new avenues to tackle evolutionary questions such as the reconstruction of ancient domain duplications, the reconstruction of the history of protein architectures, and the estimation of protein domain age. Website and software: http://www.lcqb.upmc.fr/CLADE.

摘要

传统的蛋白质注释方法使用概率模型来描述已知结构域，这些模型代表同源结构域序列之间的一致性。然而，当相关信号变得过于微弱以至于无法通过全局一致性来识别时，注释尝试就会失败。在这里，我们解决了高度分化蛋白质的结构域识别这一基本问题。通过使用高性能计算，我们证明了可以绕过现有最先进注释方法的局限性。我们基于这样的观察设计了一种新策略：许多蛋白质的结构和功能限制并非在所有物种中都全局保守，但可能在不同的进化枝中局部保守。我们提出了一种对可用大量数据的新颖利用方式：1. 对于每个已知的蛋白质结构域，从大量不同的同源序列组中构建几个以进化枝为中心的概率模型；2. 一个决策协议将从多个模型获得的结果结合起来；3. 一种多标准优化算法找到最可能的蛋白质结构。该方法在多个数据集上进行了结构域和结构预测以及统计检验假设的评估。其性能与HMMScan和HHblits这两种基于序列-轮廓和轮廓-轮廓比较的广泛使用的搜索方法进行了比较。由于以进化枝为中心的模型与实际蛋白质序列更接近，因此显示出比广泛使用的一致性模型更具特异性和功能预测性。基于这些模型，我们以前所未有的规模改进了恶性疟原虫蛋白质序列的注释。我们成功地为72%的恶性疟原虫蛋白质预测了至少一个结构域，而之前的成功率为63%，相对于全基因组上Pfam结构域预测总数提高了30%。该方法适用于任何基因组，并为解决进化问题开辟了新途径，如古代结构域复制的重建、蛋白质结构历史的重建以及蛋白质结构域年龄的估计。网站和软件：http://www.lcqb.upmc.fr/CLADE

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/de8c/4966962/c60f65f60ee7/pcbi.1005038.g001.jpg

相似文献

Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence.

PLoS Comput Biol. 2016 Jul 29;12(7):e1005038. doi: 10.1371/journal.pcbi.1005038. eCollection 2016 Jul.

Improving pairwise comparison of protein sequences with domain co-occurrence.

PLoS Comput Biol. 2018 Jan 2;14(1):e1005889. doi: 10.1371/journal.pcbi.1005889. eCollection 2018 Jan.

A multi-objective optimization approach accurately resolves protein domain architectures.

Bioinformatics. 2016 Feb 1;32(3):345-53. doi: 10.1093/bioinformatics/btv582. Epub 2015 Oct 12.

Plasmobase: a comparative database of predicted domain architectures for Plasmodium genomes.

Malar J. 2017 Jun 7;16(1):241. doi: 10.1186/s12936-017-1887-8.

Detection of new protein domains using co-occurrence: application to Plasmodium falciparum.

Bioinformatics. 2009 Dec 1;25(23):3077-83. doi: 10.1093/bioinformatics/btp560. Epub 2009 Sep 28.

Fitting hidden Markov models of protein domains to a target species: application to Plasmodium falciparum.

BMC Bioinformatics. 2012 May 1;13:67. doi: 10.1186/1471-2105-13-67.

Bioinformatics. 2014 Jan 15;30(2):274-81. doi: 10.1093/bioinformatics/btt379. Epub 2013 Jul 4.

MyCLADE: a multi-source domain annotation server for sequence functional exploration.

Nucleic Acids Res. 2021 Jul 2;49(W1):W452-W458. doi: 10.1093/nar/gkab395.

HMMerThread: detecting remote, functional conserved domains in entire genomes by combining relaxed sequence-database searches with fold recognition.

PLoS One. 2011 Mar 10;6(3):e17568. doi: 10.1371/journal.pone.0017568.

Capturing protein sequence-structure specificity using computational sequence design.

Proteins. 2013 Sep;81(9):1556-70. doi: 10.1002/prot.24307. Epub 2013 Jun 20.

引用本文的文献

Evolutionary dynamics of genome size and content during the adaptive radiation of Heliconiini butterflies.

Nat Commun. 2023 Sep 12;14(1):5620. doi: 10.1038/s41467-023-41412-5.

CeGAL: Redefining a Widespread Fungal-Specific Transcription Factor Family Using an In Silico Error-Tracking Approach.

J Fungi (Basel). 2023 Mar 29;9(4):424. doi: 10.3390/jof9040424.

Multi-head attention-based U-Nets for predicting protein domain boundaries using 1D sequence features and 2D distance maps.

BMC Bioinformatics. 2022 Jul 19;23(1):283. doi: 10.1186/s12859-022-04829-1.

Multiple Profile Models Extract Features from Protein Sequence Data and Resolve Functional Diversity of Very Different Protein Families.

Mol Biol Evol. 2022 Apr 10;39(4). doi: 10.1093/molbev/msac070.

MyCLADE: a multi-source domain annotation server for sequence functional exploration.

Nucleic Acids Res. 2021 Jul 2;49(W1):W452-W458. doi: 10.1093/nar/gkab395.

Protein domain identification methods and online resources.

Comput Struct Biotechnol J. 2021 Feb 2;19:1145-1153. doi: 10.1016/j.csbj.2021.01.041. eCollection 2021.

Genome-enabled phylogenetic and functional reconstruction of an araphid pennate diatom Plagiostriata sp. CCMP470, previously assigned as a radial centric diatom, and its bacterial commensal.

Sci Rep. 2020 Jun 10;10(1):9449. doi: 10.1038/s41598-020-65941-x.

Identification of Plasmodium falciparum nuclear proteins by mass spectrometry and proposed protein annotation.

PLoS One. 2018 Oct 31;13(10):e0205596. doi: 10.1371/journal.pone.0205596. eCollection 2018.

Evaluating Statistical Multiple Sequence Alignment in Comparison to Other Alignment Methods on Protein Data Sets.

Syst Biol. 2019 May 1;68(3):396-411. doi: 10.1093/sysbio/syy068.

A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling.

Microbiome. 2018 Aug 28;6(1):149. doi: 10.1186/s40168-018-0532-2.

本文引用的文献

A multi-objective optimization approach accurately resolves protein domain architectures.

Bioinformatics. 2016 Feb 1;32(3):345-53. doi: 10.1093/bioinformatics/btv582. Epub 2015 Oct 12.

SCOPe: Structural Classification of Proteins--extended, integrating SCOP and ASTRAL data and classification of new structures.

Nucleic Acids Res. 2014 Jan;42(Database issue):D304-9. doi: 10.1093/nar/gkt1240. Epub 2013 Dec 3.

Gene3D: Multi-domain annotations for protein sequence and comparative genome analysis.

Nucleic Acids Res. 2014 Jan;42(Database issue):D240-5. doi: 10.1093/nar/gkt1205. Epub 2013 Nov 21.

PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees.

Nucleic Acids Res. 2013 Jan;41(Database issue):D377-86. doi: 10.1093/nar/gks1118. Epub 2012 Nov 27.

HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment.

Nat Methods. 2011 Dec 25;9(2):173-5. doi: 10.1038/nmeth.1818.

Gene3D: a domain-based resource for comparative genomics, functional annotation and protein network analysis.

Nucleic Acids Res. 2012 Jan;40(Database issue):D465-71. doi: 10.1093/nar/gkr1181. Epub 2011 Dec 1.

Accelerated Profile HMM Searches.

PLoS Comput Biol. 2011 Oct;7(10):e1002195. doi: 10.1371/journal.pcbi.1002195. Epub 2011 Oct 20.

Using context to improve protein domain identification.

BMC Bioinformatics. 2011 Mar 31;12:90. doi: 10.1186/1471-2105-12-90.

A discriminative method for family-based protein remote homology detection that combines inductive logic programming and propositional models.

BMC Bioinformatics. 2011 Mar 23;12:83. doi: 10.1186/1471-2105-12-83.

A fast and automated solution for accurately resolving protein domain architectures.

Bioinformatics. 2010 Mar 15;26(6):745-51. doi: 10.1093/bioinformatics/btq034. Epub 2010 Jan 29.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

通过打破共识，结合多个图谱和结构域共现情况，实现了蛋白质结构域识别的改进。

Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献