利用上下文提高蛋白质结构域识别。

Using context to improve protein domain identification.

机构信息

Department of Molecular Biology, Princeton University, Princeton, NJ, USA.

出版信息

BMC Bioinformatics. 2011 Mar 31;12:90. doi: 10.1186/1471-2105-12-90.

DOI:10.1186/1471-2105-12-90

PMID:21453511

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3090354/

Abstract

BACKGROUND

Identifying domains in protein sequences is an important step in protein structural and functional annotation. Existing domain recognition methods typically evaluate each domain prediction independently of the rest. However, the majority of proteins are multidomain, and pairwise domain co-occurrences are highly specific and non-transitive.

RESULTS

Here, we demonstrate how to exploit domain co-occurrence to boost weak domain predictions that appear in previously observed combinations, while penalizing higher confidence domains if such combinations have never been observed. Our framework, Domain Prediction Using Context (dPUC), incorporates pairwise "context" scores between domains, along with traditional domain scores and thresholds, and improves domain prediction across a variety of organisms from bacteria to protozoa and metazoa. Among the genomes we tested, dPUC is most successful at improving predictions for the poorly-annotated malaria parasite Plasmodium falciparum, for which over 38% of the genome is currently unannotated. Our approach enables high-confidence annotations in this organism and the identification of orthologs to many core machinery proteins conserved in all eukaryotes, including those involved in ribosomal assembly and other RNA processing events, which surprisingly had not been previously known.

CONCLUSIONS

Overall, our results demonstrate that this new context-based approach will provide significant improvements in domain and function prediction, especially for poorly understood genomes for which the need for additional annotations is greatest. Source code for the algorithm is available under a GPL open source license at http://compbio.cs.princeton.edu/dpuc/. Pre-computed results for our test organisms and a web server are also available at that location.

摘要

背景

在蛋白质结构和功能注释中，识别蛋白质序列中的结构域是一个重要步骤。现有的结构域识别方法通常独立地评估每个结构域预测。然而，大多数蛋白质都是多结构域的，并且结构域的成对共现具有高度的特异性和非传递性。

结果

在这里，我们展示了如何利用结构域共现来增强在以前观察到的组合中出现的弱结构域预测，同时对从未观察到这种组合的更高置信度结构域进行惩罚。我们的框架，即使用上下文进行结构域预测（Domain Prediction Using Context，dPUC），结合了结构域之间的成对“上下文”得分，以及传统的结构域得分和阈值，从而提高了从细菌到原生动物和后生动物等各种生物体的结构域预测。在我们测试的基因组中，dPUC 最成功地改进了对注释较差的疟原虫 Plasmodium falciparum 的预测，目前该寄生虫基因组中超过 38%的区域未被注释。我们的方法可以在该生物体中实现高置信度的注释，并鉴定出与所有真核生物中保守的许多核心机制蛋白的同源物，包括参与核糖体组装和其他 RNA 处理事件的蛋白，这些蛋白令人惊讶的是以前并不知道。

结论

总的来说，我们的结果表明，这种新的基于上下文的方法将在结构域和功能预测方面提供显著的改进，特别是对于那些需要额外注释的理解较差的基因组，这些注释的需求最为迫切。该算法的源代码可在 GPL 开源许可证下在 http://compbio.cs.princeton.edu/dpuc/ 获得。我们测试的生物体的预计算结果和一个网络服务器也可在该位置获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be61/3090354/3f9435da58dc/1471-2105-12-90-1.jpg

相似文献

Using context to improve protein domain identification.利用上下文提高蛋白质结构域识别。

BMC Bioinformatics. 2011 Mar 31;12:90. doi: 10.1186/1471-2105-12-90.

A domain-centric solution to functional genomics via dcGO Predictor.通过 dcGO Predictor 实现功能基因组学的以域为中心的解决方案。

BMC Bioinformatics. 2013;14 Suppl 3(Suppl 3):S9. doi: 10.1186/1471-2105-14-S3-S9. Epub 2013 Feb 28.

A multi-objective optimization approach accurately resolves protein domain architectures.一种多目标优化方法能准确解析蛋白质结构域架构。

Bioinformatics. 2016 Feb 1;32(3):345-53. doi: 10.1093/bioinformatics/btv582. Epub 2015 Oct 12.

Fitting hidden Markov models of protein domains to a target species: application to Plasmodium falciparum.将蛋白质结构域的隐马尔可夫模型拟合到目标物种上：在疟原虫中的应用。

BMC Bioinformatics. 2012 May 1;13:67. doi: 10.1186/1471-2105-13-67.

Improvement in Protein Domain Identification Is Reached by Breaking Consensus, with the Agreement of Many Profiles and Domain Co-occurrence.通过打破共识，结合多个图谱和结构域共现情况，实现了蛋白质结构域识别的改进。

PLoS Comput Biol. 2016 Jul 29;12(7):e1005038. doi: 10.1371/journal.pcbi.1005038. eCollection 2016 Jul.

Improving pairwise comparison of protein sequences with domain co-occurrence.通过结构域共现改进蛋白质序列的成对比较。

PLoS Comput Biol. 2018 Jan 2;14(1):e1005889. doi: 10.1371/journal.pcbi.1005889. eCollection 2018 Jan.

Plasmodium falciparum erythrocyte membrane protein 1 diversity in seven genomes--divide and conquer.恶性疟原虫红细胞膜蛋白 1 的多样性——分而治之。

PLoS Comput Biol. 2010 Sep 16;6(9):e1000933. doi: 10.1371/journal.pcbi.1000933.

Analysis of nucleosome positioning landscapes enables gene discovery in the human malaria parasite Plasmodium falciparum.核小体定位图谱分析有助于在人类疟原虫恶性疟原虫中发现基因。

BMC Genomics. 2015 Nov 25;16:1005. doi: 10.1186/s12864-015-2214-9.

HMMerThread: detecting remote, functional conserved domains in entire genomes by combining relaxed sequence-database searches with fold recognition.HMMerThread：通过将宽松的序列数据库搜索与折叠识别相结合，在整个基因组中检测远程、功能保守的结构域。

PLoS One. 2011 Mar 10;6(3):e17568. doi: 10.1371/journal.pone.0017568.

引用本文的文献

AGODB: a comprehensive domain annotation database of argonaute proteins.AGODB：一个全面的 Argonaute 蛋白结构域注释数据库。

Database (Oxford). 2022 Sep 7;2022. doi: 10.1093/database/baac078.

In silico structural and functional characterization of Antheraea mylitta cocoonase.蓖麻蚕茧酶的计算机模拟结构与功能表征

J Genet Eng Biotechnol. 2022 Jul 11;20(1):102. doi: 10.1186/s43141-022-00367-8.

Protein domain identification methods and online resources.蛋白质结构域鉴定方法及在线资源。

Comput Struct Biotechnol J. 2021 Feb 2;19:1145-1153. doi: 10.1016/j.csbj.2021.01.041. eCollection 2021.

Study on cocoonase, sericin, and degumming of silk cocoon: computational and experimental.蚕茧酶、丝胶蛋白及蚕茧脱胶研究：计算与实验

J Genet Eng Biotechnol. 2021 Feb 16;19(1):32. doi: 10.1186/s43141-021-00125-2.

Systematic domain-based aggregation of protein structures highlights DNA-, RNA- and other ligand-binding positions.基于结构域的系统聚合突出了 DNA、RNA 和其他配体结合的位置。

Nucleic Acids Res. 2019 Jan 25;47(2):582-593. doi: 10.1093/nar/gky1224.

Characterization of a Protein Phosphatase Type-1 and a Kinase Anchoring Protein in ..中1型蛋白磷酸酶和激酶锚定蛋白的特性分析

Front Microbiol. 2018 Oct 31;9:2617. doi: 10.3389/fmicb.2018.02617. eCollection 2018.

Improving pairwise comparison of protein sequences with domain co-occurrence.通过结构域共现改进蛋白质序列的成对比较。

PLoS Comput Biol. 2018 Jan 2;14(1):e1005889. doi: 10.1371/journal.pcbi.1005889. eCollection 2018 Jan.

A practical guide to build assemblies for single tissues of non-model organisms: the example of a Neotropical frog.构建非模式生物单一组织装配体的实用指南：以一种新热带蛙为例。

PeerJ. 2017 Sep 1;5:e3702. doi: 10.7717/peerj.3702. eCollection 2017.

Plasmobase: a comparative database of predicted domain architectures for Plasmodium genomes.疟原虫数据库：疟原虫基因组预测结构域架构的比较数据库。

Malar J. 2017 Jun 7;16(1):241. doi: 10.1186/s12936-017-1887-8.

Domain prediction with probabilistic directional context.基于概率性方向上下文的域预测

Bioinformatics. 2017 Aug 15;33(16):2471-2478. doi: 10.1093/bioinformatics/btx221.

本文引用的文献

Genome wide identification of Plasmodium falciparum helicases: a comparison with human host.疟原虫全基因组中解旋酶的鉴定：与人类宿主的比较。

Cell Cycle. 2010 Jan 1;9(1):104-20. doi: 10.4161/cc.9.1.10241. Epub 2010 Jan 5.

How does multiple testing correction work?多重检验校正如何工作？

Nat Biotechnol. 2009 Dec;27(12):1135-7. doi: 10.1038/nbt1209-1135.

The Pfam protein families database.Pfam 蛋白质家族数据库。

Nucleic Acids Res. 2010 Jan;38(Database issue):D211-22. doi: 10.1093/nar/gkp985. Epub 2009 Nov 17.

Detection of new protein domains using co-occurrence: application to Plasmodium falciparum.利用共现检测新的蛋白质结构域：在疟原虫中的应用。

Bioinformatics. 2009 Dec 1;25(23):3077-83. doi: 10.1093/bioinformatics/btp560. Epub 2009 Sep 28.

Inferring protein-protein interactions from multiple protein domain combinations.从多种蛋白质结构域组合推断蛋白质-蛋白质相互作用。

Methods Mol Biol. 2009;541:43-59. doi: 10.1007/978-1-59745-243-4_3.

PROCAIN: protein profile comparison with assisting information.PROCAIN：带有辅助信息的蛋白质谱比较

Nucleic Acids Res. 2009 Jun;37(11):3522-30. doi: 10.1093/nar/gkp212. Epub 2009 Apr 7.

SUPERFAMILY--sophisticated comparative genomics, data mining, visualization and phylogeny.超级家族——精密的比较基因组学、数据挖掘、可视化及系统发育学。

Nucleic Acids Res. 2009 Jan;37(Database issue):D380-6. doi: 10.1093/nar/gkn762. Epub 2008 Nov 26.

CDD: specific functional annotation with the Conserved Domain Database.CDD：使用保守结构域数据库进行特定功能注释。

Nucleic Acids Res. 2009 Jan;37(Database issue):D205-10. doi: 10.1093/nar/gkn845. Epub 2008 Nov 4.

SMART 6: recent updates and new developments.SMART 6：近期更新与新进展

Nucleic Acids Res. 2009 Jan;37(Database issue):D229-32. doi: 10.1093/nar/gkn808. Epub 2008 Oct 31.

Hidden Markov models incorporating fuzzy measures and integrals for protein sequence identification and alignment.结合模糊测度与积分的隐马尔可夫模型用于蛋白质序列识别与比对

Genomics Proteomics Bioinformatics. 2008 Jun;6(2):98-110. doi: 10.1016/S1672-0229(08)60025-X.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用上下文提高蛋白质结构域识别。

Using context to improve protein domain identification.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献