School of Data Science, University of Virginia, Charlottesville, VA, USA.
Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA.
Nat Commun. 2024 Sep 16;15(1):8094. doi: 10.1038/s41467-024-52020-2.
Our views of fold space implicitly rest upon many assumptions that impact how we analyze, interpret and understand protein structure, function and evolution. For instance, is there an optimal granularity in viewing protein structural similarities (e.g., architecture, topology or some other level)? Similarly, the discrete/continuous dichotomy of fold space is central, but remains unresolved. Discrete views of fold space bin similar folds into distinct, non-overlapping groups; unfortunately, such binning can miss remote relationships. While hierarchical systems like CATH are indispensable resources, less heuristic and more conceptually flexible approaches could enable more nuanced explorations of fold space. Building upon an Urfold model of protein structure, here we present a deep generative modeling framework, termed DeepUrfold, for analyzing protein relationships at scale. DeepUrfold's learned embeddings occupy high-dimensional latent spaces that can be distilled for a given protein in terms of an amalgamated representation uniting sequence, structure and biophysical properties. This approach is structure-guided, versus being purely structure-based, and DeepUrfold learns representations that, in a sense, define superfamilies. Deploying DeepUrfold with CATH reveals evolutionarily-remote relationships that evade existing methodologies, and suggests a mostly-continuous view of fold space-a view that extends beyond simple geometric similarity, towards the realm of integrated sequence ↔ structure ↔ function properties.
我们对折叠空间的看法隐含着许多假设,这些假设影响着我们对蛋白质结构、功能和进化的分析、解释和理解。例如,在观察蛋白质结构相似性时(例如,结构、拓扑或其他层次)是否存在最佳粒度?同样,折叠空间的离散/连续二分法是核心问题,但尚未解决。折叠空间的离散视图将相似的折叠分为不同的、不重叠的组;不幸的是,这种分组可能会错过远程关系。虽然像 CATH 这样的分层系统是不可或缺的资源,但更少的启发式和更具概念灵活性的方法可以使对折叠空间的更细致的探索成为可能。基于蛋白质结构的 Urfold 模型,我们在这里提出了一种深度生成模型框架,称为 DeepUrfold,用于大规模分析蛋白质关系。DeepUrfold 的学习嵌入占据了高维潜在空间,可以根据一个联合表示来提炼给定蛋白质的信息,该表示将序列、结构和物理性质融合在一起。这种方法是结构导向的,而不是纯粹基于结构的,并且 DeepUrfold 学习的表示在某种意义上定义了超家族。使用 DeepUrfold 和 CATH 部署揭示了逃避现有方法的进化上遥远的关系,并提出了一种主要是连续的折叠空间视图——这种视图超越了简单的几何相似性,扩展到了集成序列↔结构↔功能属性的领域。