Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA; Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA; Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, USA.
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK.
J Mol Biol. 2024 Nov 15;436(22):168764. doi: 10.1016/j.jmb.2024.168764. Epub 2024 Aug 26.
Classification of protein domains based on homology and structural similarity serves as a fundamental tool to gain biological insights into protein function. Recent advancements in protein structure prediction, exemplified by AlphaFold, have revolutionized the availability of protein structural data. We focus on classifying about 9000 Pfam families into ECOD (Evolutionary Classification of Domains) by using predicted AlphaFold models and the DPAM (Domain Parser for AlphaFold Models) tool. Our results offer insights into their homologous relationships and domain boundaries. More than half of these Pfam families contain DPAM domains that can be confidently assigned to the ECOD hierarchy. Most assigned domains belong to highly populated folds such as Immunoglobulin-like (IgL), Armadillo (ARM), helix-turn-helix (HTH), and Src homology 3 (SH3). A large fraction of DPAM domains, however, cannot be confidently assigned to ECOD homologous groups. These unassigned domains exhibit statistically different characteristics, including shorter average length, fewer secondary structure elements, and more abundant transmembrane segments. They could potentially define novel families remotely related to domains with known structures or novel superfamilies and folds. Manual scrutiny of a subset of these domains revealed an abundance of internal duplications and recurring structural motifs. Exploring sequence and structural features such as disulfide bond patterns, metal-binding sites, and enzyme active sites helped uncover novel structural folds as well as remote evolutionary relationships. By bridging the gap between sequence-based Pfam and structure-based ECOD domain classifications, our study contributes to a more comprehensive understanding of the protein universe by providing structural and functional insights into previously uncharacterized proteins.
基于同源性和结构相似性的蛋白质结构域分类是深入了解蛋白质功能的基本工具。蛋白质结构预测的最新进展,如 AlphaFold 的出现,彻底改变了蛋白质结构数据的可用性。我们专注于使用预测的 AlphaFold 模型和 DPAM(用于 AlphaFold 模型的结构域解析器)工具,将大约 9000 个 Pfam 家族分类到 ECOD(结构域进化分类)中。我们的结果提供了它们同源关系和结构域边界的深入了解。这些 Pfam 家族中超过一半包含 DPAM 结构域,可以自信地分配到 ECOD 层次结构中。大多数分配的结构域属于高度流行的折叠,如免疫球蛋白样(IgL)、装甲(ARM)、螺旋-转角-螺旋(HTH)和 Src 同源性 3(SH3)。然而,大量的 DPAM 结构域不能自信地分配到 ECOD 同源组。这些未分配的结构域表现出统计学上不同的特征,包括平均长度更短、二级结构元件更少、跨膜片段更丰富。它们可能定义了与具有已知结构的结构域或新型超家族和折叠远程相关的新型家族。对这些结构域的一小部分进行手动检查发现,它们存在大量的内部重复和重复的结构基序。探索序列和结构特征,如二硫键模式、金属结合位点和酶活性位点,有助于揭示新型结构折叠以及远程进化关系。通过在基于序列的 Pfam 和基于结构的 ECOD 结构域分类之间架起桥梁,我们的研究通过为以前未表征的蛋白质提供结构和功能见解,为更全面地了解蛋白质宇宙做出了贡献。