使用保守结构域数据库进行蛋白质亚家族分类。

Protein subfamily assignment using the Conserved Domain Database.

作者信息

Fong Jessica H, Marchler-Bauer Aron

机构信息

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA.

出版信息

BMC Res Notes. 2008 Nov 14;1:114. doi: 10.1186/1756-0500-1-114.

DOI:10.1186/1756-0500-1-114

PMID:19014584

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2632666/

Abstract

BACKGROUND

Domains, evolutionarily conserved units of proteins, are widely used to classify protein sequences and infer protein function. Often, two or more overlapping domain models match a region of a protein sequence. Therefore, procedures are required to choose appropriate domain annotations for the protein. Here, we propose a method for assigning NCBI-curated domains from the Curated Domain Database (CDD) that takes into account the organization of the domains into hierarchies of homologous domain models.

FINDINGS

Our analysis of alignment scores from NCBI-curated domain assignments suggests that identifying the correct model among closely related models is more difficult than choosing between non-overlapping domain models. We find that simple heuristics based on sorting scores and domain-specific thresholds are effective at reducing classification error. In fact, in our test set, the heuristics result in almost 90% of current misclassifications due to missing domain subfamilies being replaced by more generic domain assignments, thereby eliminating a significant amount of error within the database.

CONCLUSION

Our proposed domain subfamily assignment rule has been incorporated into the CD-Search software for assigning CDD domains to query protein sequences and has significantly improved pre-calculated domain annotations on protein sequences in NCBI's Entrez resource.

摘要

背景

结构域作为蛋白质的进化保守单位，被广泛用于蛋白质序列分类和功能推断。通常，两个或更多重叠的结构域模型会匹配蛋白质序列的一个区域。因此，需要相应程序来为蛋白质选择合适的结构域注释。在此，我们提出一种从精选结构域数据库（CDD）中分配NCBI精选结构域的方法，该方法考虑了结构域在同源结构域模型层次结构中的组织方式。

研究结果

我们对NCBI精选结构域分配的比对分数分析表明，在密切相关的模型中识别正确模型比在非重叠结构域模型之间进行选择更为困难。我们发现基于排序分数和特定结构域阈值的简单启发式方法在减少分类错误方面很有效。事实上，在我们的测试集中，这些启发式方法使得几乎90%因缺少结构域亚家族而导致的当前错误分类被更通用的结构域分配所取代，从而消除了数据库中的大量错误。

结论

我们提出的结构域亚家族分配规则已被纳入CD-Search软件，用于为查询蛋白质序列分配CDD结构域，并显著改进了NCBI的Entrez资源中蛋白质序列的预先计算的结构域注释。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8125/2632666/cabf4174f28d/1756-0500-1-114-1.jpg

相似文献

Protein subfamily assignment using the Conserved Domain Database.使用保守结构域数据库进行蛋白质亚家族分类。

BMC Res Notes. 2008 Nov 14;1:114. doi: 10.1186/1756-0500-1-114.

CDD: a conserved domain database for interactive domain family analysis.CDD：用于交互式结构域家族分析的保守结构域数据库。

Nucleic Acids Res. 2007 Jan;35(Database issue):D237-40. doi: 10.1093/nar/gkl951. Epub 2006 Nov 29.

CDD: a Conserved Domain Database for protein classification.CDD：用于蛋白质分类的保守结构域数据库。

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D192-6. doi: 10.1093/nar/gki069.

NCBI's Conserved Domain Database and Tools for Protein Domain Analysis.NCBI 的保守结构域数据库和蛋白质结构域分析工具。

Curr Protoc Bioinformatics. 2020 Mar;69(1):e90. doi: 10.1002/cpbi.90.

CDD: specific functional annotation with the Conserved Domain Database.CDD：使用保守结构域数据库进行特定功能注释。

Nucleic Acids Res. 2009 Jan;37(Database issue):D205-10. doi: 10.1093/nar/gkn845. Epub 2008 Nov 4.

CDD/SPARCLE: functional classification of proteins via subfamily domain architectures.CDD/SPARCLE：通过亚家族结构域架构对蛋白质进行功能分类

Nucleic Acids Res. 2017 Jan 4;45(D1):D200-D203. doi: 10.1093/nar/gkw1129. Epub 2016 Nov 29.

CDD: a Conserved Domain Database for the functional annotation of proteins.CDD：一个用于蛋白质功能注释的保守结构域数据库。

Nucleic Acids Res. 2011 Jan;39(Database issue):D225-9. doi: 10.1093/nar/gkq1189. Epub 2010 Nov 24.

CDD: NCBI's conserved domain database.CDD：美国国家生物技术信息中心的保守结构域数据库。

Nucleic Acids Res. 2015 Jan;43(Database issue):D222-6. doi: 10.1093/nar/gku1221. Epub 2014 Nov 20.

CDD: a curated Entrez database of conserved domain alignments.CDD：一个经过整理的关于保守结构域比对的Entrez数据库。

Nucleic Acids Res. 2003 Jan 1;31(1):383-7. doi: 10.1093/nar/gkg087.

CDD: a database of conserved domain alignments with links to domain three-dimensional structure.CDD：一个保守结构域比对数据库，带有与结构域三维结构的链接。

Nucleic Acids Res. 2002 Jan 1;30(1):281-3. doi: 10.1093/nar/30.1.281.

引用本文的文献

Systematic Analysis of Genes in Six Species Reveals the Evolutionary Dynamics, Carotenoid and Anthocyanin Accumulation, and Stress Responses of Sweet Potato.六个物种中基因的系统分析揭示了甘薯的进化动态、类胡萝卜素和花青素积累以及应激反应。

Genes (Basel). 2025 Feb 24;16(3):266. doi: 10.3390/genes16030266.

Cellular and physiological functions of SGR family in gravitropic response in higher plants.SGR家族在高等植物向重力性反应中的细胞与生理功能

J Adv Res. 2025 Jan;67:43-60. doi: 10.1016/j.jare.2024.01.026. Epub 2024 Feb 1.

Genome-Wide Identification and Analysis of the Hsp40/J-Protein Family Reveals Its Role in Soybean () Growth and Development.全基因组鉴定和分析 HSP40/J-蛋白家族揭示了其在大豆（）生长发育中的作用。

Genes (Basel). 2023 Jun 12;14(6):1254. doi: 10.3390/genes14061254.

Molecular Assessment of Domain I of Apical Membrane Antigen I Gene in : Implications in Invasion, Taxonomy, Vaccine Development, and Drug Discovery.顶端膜抗原I基因结构域I的分子评估：对侵袭、分类学、疫苗开发和药物发现的影响

Can J Infect Dis Med Microbiol. 2022 Oct 7;2022:1419998. doi: 10.1155/2022/1419998. eCollection 2022.

Revealing potential functions of hypothetical proteins induced by genistein in the symbiosis island of Bradyrhizobium japonicum commercial strain SEMIA 5079 (= CPAC 15).揭示染料木黄酮诱导大豆根瘤菌商业菌株 SEMIA 5079（=CPAC 15）共生岛中假定蛋白的潜在功能。

BMC Microbiol. 2022 May 5;22(1):122. doi: 10.1186/s12866-022-02527-9.

Molecular details of secretory phospholipase A from flax (Linum usitatissimum L.) provide insight into its structure and function.亚麻（Linum usitatissimum L.）分泌型磷脂酶 A 的分子细节为其结构与功能提供了新的认识。

Sci Rep. 2017 Sep 11;7(1):11080. doi: 10.1038/s41598-017-10969-9.

A Protein Domain and Family Based Approach to Rare Variant Association Analysis.一种基于蛋白质结构域和家族的罕见变异关联分析方法。

PLoS One. 2016 Apr 29;11(4):e0153803. doi: 10.1371/journal.pone.0153803. eCollection 2016.

Proteins linked to autosomal dominant and autosomal recessive disorders harbor characteristic rare missense mutation distribution patterns.与常染色体显性和常染色体隐性疾病相关的蛋白质具有特征性的罕见错义突变分布模式。

Hum Mol Genet. 2015 Nov 1;24(21):5995-6002. doi: 10.1093/hmg/ddv309. Epub 2015 Aug 5.

Bioinformatics approaches for structural and functional analysis of proteins in secondary metabolism in Withania somnifera.用于分析睡茄次生代谢中蛋白质结构和功能的生物信息学方法

Mol Biol Rep. 2014 Nov;41(11):7323-30. doi: 10.1007/s11033-014-3618-3. Epub 2014 Aug 2.

Reassessing domain architecture evolution of metazoan proteins: major impact of gene prediction errors.重新评估后生动物蛋白结构域架构进化：基因预测错误的主要影响。

Genes (Basel). 2011 Jul 13;2(3):449-501. doi: 10.3390/genes2030449.

本文引用的文献

Automated protein subfamily identification and classification.蛋白质亚家族的自动识别与分类

PLoS Comput Biol. 2007 Aug;3(8):e160. doi: 10.1371/journal.pcbi.0030160.

CDD: a conserved domain database for interactive domain family analysis.CDD：用于交互式结构域家族分析的保守结构域数据库。

Nucleic Acids Res. 2007 Jan;35(Database issue):D237-40. doi: 10.1093/nar/gkl951. Epub 2006 Nov 29.

Genomic scale sub-family assignment of protein domains.蛋白质结构域的基因组规模亚家族分类

Nucleic Acids Res. 2006 Jul 28;34(13):3625-33. doi: 10.1093/nar/gkl484. Print 2006.

SMART 5: domains in the context of genomes and networks.SMART 5：基因组与网络背景下的结构域

Nucleic Acids Res. 2006 Jan 1;34(Database issue):D257-60. doi: 10.1093/nar/gkj079.

Pfam: clans, web tools and services.蛋白质家族数据库（Pfam）：家族分类、网络工具及服务

Nucleic Acids Res. 2006 Jan 1;34(Database issue):D247-51. doi: 10.1093/nar/gkj149.

Subfamily hmms in functional genomics.功能基因组学中的亚家族隐马尔可夫模型

Pac Symp Biocomput. 2005:322-33.

Percolation of annotation errors through hierarchically structured protein sequence databases.注释错误在分层结构的蛋白质序列数据库中的渗透。

Math Biosci. 2005 Feb;193(2):223-34. doi: 10.1016/j.mbs.2004.08.001.

The evolution of domain arrangements in proteins and interaction networks.蛋白质和相互作用网络中结构域排列的演变。

Cell Mol Life Sci. 2005 Feb;62(4):435-45. doi: 10.1007/s00018-004-4416-1.

CDD: a Conserved Domain Database for protein classification.CDD：用于蛋白质分类的保守结构域数据库。

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D192-6. doi: 10.1093/nar/gki069.

CD-Search: protein domain annotations on the fly.CD-Search：即时蛋白质结构域注释

Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W327-31. doi: 10.1093/nar/gkh454.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用保守结构域数据库进行蛋白质亚家族分类。

Protein subfamily assignment using the Conserved Domain Database.

作者信息

机构信息

出版信息

BACKGROUND

FINDINGS

CONCLUSION

背景

研究结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献