大教堂：一种从多结构域蛋白质结构预测折叠和结构域边界的快速有效算法。

CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures.

作者信息

Redfern Oliver C, Harrison Andrew, Dallman Tim, Pearl Frances M G, Orengo Christine A

机构信息

Department of Biochemistry and Molecular Biology, University College London, London, United Kingdom.

出版信息

PLoS Comput Biol. 2007 Nov;3(11):e232. doi: 10.1371/journal.pcbi.0030232.

DOI:10.1371/journal.pcbi.0030232

PMID:18052539

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2098860/

Abstract

We present CATHEDRAL, an iterative protocol for determining the location of previously observed protein folds in novel multidomain protein structures. CATHEDRAL builds on the features of a fast secondary-structure-based method (using graph theory) to locate known folds within a multidomain context and a residue-based, double-dynamic programming algorithm, which is used to align members of the target fold groups against the query protein structure to identify the closest relative and assign domain boundaries. To increase the fidelity of the assignments, a support vector machine is used to provide an optimal scoring scheme. Once a domain is verified, it is excised, and the search protocol is repeated in an iterative fashion until all recognisable domains have been identified. We have performed an initial benchmark of CATHEDRAL against other publicly available structure comparison methods using a consensus dataset of domains derived from the CATH and SCOP domain classifications. CATHEDRAL shows superior performance in fold recognition and alignment accuracy when compared with many equivalent methods. If a novel multidomain structure contains a known fold, CATHEDRAL will locate it in 90% of cases, with <1% false positives. For nearly 80% of assigned domains in a manually validated test set, the boundaries were correctly delineated within a tolerance of ten residues. For the remaining cases, previously classified domains were very remotely related to the query chain so that embellishments to the core of the fold caused significant differences in domain sizes and manual refinement of the boundaries was necessary. To put this performance in context, a well-established sequence method based on hidden Markov models was only able to detect 65% of domains, with 33% of the subsequent boundaries assigned within ten residues. Since, on average, 50% of newly determined protein structures contain more than one domain unit, and typically 90% or more of these domains are already classified in CATH, CATHEDRAL will considerably facilitate the automation of protein structure classification.

摘要

我们提出了CATHEDRAL，这是一种用于在新型多结构域蛋白质结构中确定先前观察到的蛋白质折叠位置的迭代协议。CATHEDRAL基于一种基于二级结构的快速方法（使用图论）的特征，以在多结构域背景中定位已知折叠，以及一种基于残基的双动态规划算法，该算法用于将目标折叠组的成员与查询蛋白质结构进行比对，以识别最接近的亲属并确定结构域边界。为了提高分配的保真度，使用支持向量机提供最佳评分方案。一旦一个结构域得到验证，就将其切除，并以迭代方式重复搜索协议，直到所有可识别的结构域都被识别出来。我们使用从CATH和SCOP结构域分类中衍生的结构域共识数据集，对CATHEDRAL与其他公开可用的结构比较方法进行了初步基准测试。与许多等效方法相比，CATHEDRAL在折叠识别和比对准确性方面表现出卓越的性能。如果一个新型多结构域结构包含一个已知折叠，CATHEDRAL在90%的情况下能够定位到它，误报率小于1%。在一个经过人工验证的测试集中，近80%的已分配结构域的边界在十个残基的容差范围内被正确划定。对于其余情况，先前分类的结构域与查询链的关系非常遥远，以至于折叠核心部分的修饰导致结构域大小有显著差异，因此需要手动细化边界。为了说明这种性能，一种基于隐马尔可夫模型的成熟序列方法只能检测到65%的结构域，随后33%的边界在十个残基内被分配。由于平均而言，新确定的蛋白质结构中有50%包含不止一个结构域单元，并且这些结构域中通常90%或更多已经在CATH中分类，CATHEDRAL将极大地促进蛋白质结构分类的自动化。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3d5b/2098860/417bedb28a7d/pcbi.0030232.g001.jpg

相似文献

CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures.

PLoS Comput Biol. 2007 Nov;3(11):e232. doi: 10.1371/journal.pcbi.0030232.

Recognizing the fold of a protein structure.

Bioinformatics. 2003 Sep 22;19(14):1748-59. doi: 10.1093/bioinformatics/btg240.

Cross-over between discrete and continuous protein structure space: insights into automatic classification and networks of protein structures.

PLoS Comput Biol. 2009 Mar;5(3):e1000331. doi: 10.1371/journal.pcbi.1000331. Epub 2009 Mar 27.

The CATH database: an extended protein family resource for structural and functional genomics.

Nucleic Acids Res. 2003 Jan 1;31(1):452-5. doi: 10.1093/nar/gkg062.

Putracer: a novel method for identification of continuous-domains in multi-domain proteins.

J Bioinform Comput Biol. 2013 Feb;11(1):1340012. doi: 10.1142/S021972001340012X.

A framework for protein structure classification and identification of novel protein structures.

BMC Bioinformatics. 2006 Oct 16;7:456. doi: 10.1186/1471-2105-7-456.

A fast SCOP fold classification system using content-based E-Predict algorithm.

BMC Bioinformatics. 2006 Jul 26;7:362. doi: 10.1186/1471-2105-7-362.

ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification.

BMC Bioinformatics. 2006 Apr 13;7:206. doi: 10.1186/1471-2105-7-206.

An introduction to modeling structure from sequence.

Curr Protoc Bioinformatics. 2006 Oct;Chapter 5:Unit 5.1. doi: 10.1002/0471250953.bi0501s15.

PFRES: protein fold classification by using evolutionary information and predicted secondary structure.

Bioinformatics. 2007 Nov 1;23(21):2843-50. doi: 10.1093/bioinformatics/btm475. Epub 2007 Oct 17.

引用本文的文献

A Survey of Biological Function Prediction Methods with Focus on Natural Language Processing (NLP) and Large Language Models (LLM).

Methods Mol Biol. 2025;2941:201-225. doi: 10.1007/978-1-0716-4623-6_13.

Enhancing Cold Adaptation of Bidomain Amylases by High-Throughput Computational Engineering.

Angew Chem Int Ed Engl. 2025 Jul;64(29):e202505991. doi: 10.1002/anie.202505991. Epub 2025 May 9.

SProtFP: a machine learning-based method for functional classification of small ORFs in prokaryotes.

NAR Genom Bioinform. 2025 Jan 7;7(1):lqae186. doi: 10.1093/nargab/lqae186. eCollection 2025 Mar.

CATH v4.4: major expansion of CATH by experimental and predicted structural data.

Nucleic Acids Res. 2025 Jan 6;53(D1):D348-D355. doi: 10.1093/nar/gkae1087.

Hierarchical Analysis of Protein Structures: From Secondary Structures to Protein Units and Domains.

Methods Mol Biol. 2025;2870:357-370. doi: 10.1007/978-1-0716-4213-9_18.

Chainsaw: protein domain segmentation with fully convolutional neural networks.

Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae296.

Merizo: a rapid and accurate protein domain segmentation method using invariant point attention.

Nat Commun. 2023 Dec 19;14(1):8445. doi: 10.1038/s41467-023-43934-4.

CATHe: detection of remote homologues for CATH superfamilies using embeddings from protein language models.

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad029.

Proteome-wide 3D structure prediction provides insights into the ancestral metabolism of ancient archaea and bacteria.

Nat Commun. 2022 Dec 21;13(1):7861. doi: 10.1038/s41467-022-35523-8.

Structural and molecular basis for Cardiovirus 2A protein as a viral gene expression switch.

Nat Commun. 2021 Dec 9;12(1):7166. doi: 10.1038/s41467-021-27400-7.

本文引用的文献

Partitioning protein structures into domains: why is it so difficult?

J Mol Biol. 2006 Aug 18;361(3):562-90. doi: 10.1016/j.jmb.2006.05.060. Epub 2006 Jun 22.

Structural diversity of domain superfamilies in the CATH database.

J Mol Biol. 2006 Jul 14;360(3):725-41. doi: 10.1016/j.jmb.2006.05.035. Epub 2006 Jun 2.

The impact of structural genomics: expectations and outcomes.

Science. 2006 Jan 20;311(5759):347-51. doi: 10.1126/science.1121018.

Progress of structural genomics initiatives: an analysis of solved target structures.

J Mol Biol. 2005 May 20;348(5):1235-60. doi: 10.1016/j.jmb.2005.03.037. Epub 2005 Apr 2.

Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures.

J Mol Biol. 2005 Mar 4;346(4):1173-88. doi: 10.1016/j.jmb.2004.12.032. Epub 2005 Jan 16.

Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions.

Acta Crystallogr D Biol Crystallogr. 2004 Dec;60(Pt 12 Pt 1):2256-68. doi: 10.1107/S0907444904026460. Epub 2004 Nov 26.

Toward consistent assignment of structural domains in proteins.

J Mol Biol. 2004 Jun 4;339(3):647-78. doi: 10.1016/j.jmb.2004.03.053.

Progress towards mapping the universe of protein folds.

Genome Biol. 2004;5(5):107. doi: 10.1186/gb-2004-5-5-107. Epub 2004 Apr 29.

Automatic prediction of protein domains from sequence information using a hybrid learning system.

Bioinformatics. 2004 Jun 12;20(9):1335-60. doi: 10.1093/bioinformatics/bth086. Epub 2004 Feb 12.

Recognizing the fold of a protein structure.

Bioinformatics. 2003 Sep 22;19(14):1748-59. doi: 10.1093/bioinformatics/btg240.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

大教堂：一种从多结构域蛋白质结构预测折叠和结构域边界的快速有效算法。

CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献