Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX 75390.
Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX 75390.
Proc Natl Acad Sci U S A. 2023 Mar 21;120(12):e2214069120. doi: 10.1073/pnas.2214069120. Epub 2023 Mar 14.
Recent advances in protein structure prediction have generated accurate structures of previously uncharacterized human proteins. Identifying domains in these predicted structures and classifying them into an evolutionary hierarchy can reveal biological insights. Here, we describe the detection and classification of domains from the human proteome. Our classification indicates that only 62% of residues are located in globular domains. We further classify these globular domains and observe that the majority (65%) can be classified among known folds by sequence, with a smaller fraction (33%) requiring structural data to refine the domain boundaries and/or to support their homology. A relatively small number (966 domains) cannot be confidently assigned using our automatic pipelines, thus demanding manual inspection. We classify 47,576 domains, of which only 23% have been included in experimental structures. A portion (6.3%) of these classified globular domains lack sequence-based annotation in InterPro. A quarter (23%) have not been structurally modeled by homology, and they contain 2,540 known disease-causing single amino acid variations whose pathogenesis can now be inferred using AF models. A comparison of classified domains from a series of model organisms revealed expansions of several immune response-related domains in humans and a depletion of olfactory receptors. Finally, we use this classification to expand well-known protein families of biological significance. These classifications are presented on the ECOD website (http://prodata.swmed.edu/ecod/index_human.php).
近年来,蛋白质结构预测领域取得了进展,生成了先前未知的人类蛋白质的精确结构。在这些预测结构中识别结构域并将其分类到进化层次结构中,可以揭示生物学见解。在这里,我们描述了从人类蛋白质组中检测和分类结构域的方法。我们的分类表明,只有 62%的残基位于球状结构域中。我们进一步对这些球状结构域进行分类,发现大多数(65%)可以通过序列归入已知的折叠类型,而较小部分(33%)需要结构数据来细化结构域边界和/或支持它们的同源性。相对较少数量(966 个结构域)无法使用我们的自动管道进行可靠分配,因此需要手动检查。我们共分类了 47576 个结构域,其中只有 23%包含在实验结构中。这些分类的球状结构域中,只有 6.3%在 InterPro 中具有基于序列的注释。四分之一(23%)没有通过同源建模进行结构建模,其中包含 2540 个已知的导致疾病的单氨基酸变异,现在可以使用 AF 模型推断其发病机制。对一系列模式生物的分类结构域进行比较,发现人类中一些与免疫反应相关的结构域扩张,而嗅觉受体则减少。最后,我们使用这种分类方法扩展了具有重要生物学意义的已知蛋白质家族。这些分类在 ECOD 网站(http://prodata.swmed.edu/ecod/index_human.php)上展示。