Derry Alexander, Krupkin Haim, Tartici Alp, Altman Russ B
Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, United States.
Department of Genetics, Stanford University, Stanford, CA 94305, United States.
Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf377.
Proteins are known to share similarities in local regions of three-dimensional (3D) structure even across disparate global folds. Such correspondences can help to shed light on functional relationships between proteins and identify conserved local structural features that lead to function. Self-supervised deep learning on large protein structure datasets has produced high-fidelity representations of local structural microenvironments, providing the opportunity to characterize the landscape of local structure and function at scale.
In this work, we leverage these representations to cluster over 15 million environments in the Protein Data Bank, resulting in the creation of a "lexicon" of local 3D motifs which form the building blocks of all known protein structures. We characterize these motifs and demonstrate that they provide valuable information for modeling structure and function at all scales of protein analysis, from full protein chains to binding pockets to individual amino acids. We devise a new protein representation based solely on its constituent local motifs and show that this representation enables state-of-the-art performance on protein structure search and model quality assessment. We then show that this approach enables accurate prediction of drug off-target interactions by modeling the similarity between local binding pockets. Finally, we identify structural motifs associated with pathogenic variants in the human proteome by leveraging the predicted structures in the AlphaFold structure database.
All code and cluster data are available at https://github.com/awfderry/collapse-motifs.
已知蛋白质即使在不同的整体折叠结构中,其三维(3D)结构的局部区域也存在相似性。这种对应关系有助于揭示蛋白质之间的功能关系,并识别导致功能的保守局部结构特征。对大型蛋白质结构数据集进行自监督深度学习,已经产生了局部结构微环境的高保真表示,从而有机会大规模地表征局部结构和功能的格局。
在这项工作中,我们利用这些表示对蛋白质数据库中超过1500万个环境进行聚类,从而创建了一个局部3D基序的“词典”,这些基序构成了所有已知蛋白质结构的构建块。我们对这些基序进行了表征,并证明它们为蛋白质分析的所有尺度(从完整蛋白质链到结合口袋再到单个氨基酸)的结构和功能建模提供了有价值的信息。我们仅基于其组成的局部基序设计了一种新的蛋白质表示,并表明这种表示在蛋白质结构搜索和模型质量评估方面能够实现最先进的性能。然后,我们表明这种方法通过对局部结合口袋之间的相似性进行建模,能够准确预测药物脱靶相互作用。最后,我们通过利用AlphaFold结构数据库中的预测结构,识别了与人类蛋白质组中致病变体相关的结构基序。