Centre of New Technologies, University of Warsaw, Warsaw, Poland.
Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Warsaw, Poland.
Protein Sci. 2024 Jul;33(7):e4998. doi: 10.1002/pro.4998.
Knotted proteins, although scarce, are crucial structural components of certain protein families, and their roles continue to be a topic of intense research. Capitalizing on the vast collection of protein structure predictions offered by AlphaFold (AF), this study computationally examines the entire UniProt database to create a robust dataset of knotted and unknotted proteins. Utilizing this dataset, we develop a machine learning (ML) model capable of accurately predicting the presence of knots in protein structures solely from their amino acid sequences. We tested the model's capabilities on 100 proteins whose structures had not yet been predicted by AF and found agreement with our local prediction in 92% cases. From the point of view of structural biology, we found that all potentially knotted proteins predicted by AF can be classified only into 17 families. This allows us to discover the presence of unknotted proteins in families with a highly conserved knot. We found only three new protein families: UCH, DUF4253, and DUF2254, that contain both knotted and unknotted proteins, and demonstrate that deletions within the knot core could potentially account for the observed unknotted (trivial) topology. Finally, we have shown that in the majority of knotted families (11 out of 15), the knotted topology is strictly conserved in functional proteins with very low sequence similarity. We have conclusively demonstrated that proteins AF predicts as unknotted are structurally accurate in their unknotted configurations. However, these proteins often represent nonfunctional fragments, lacking significant portions of the knot core (amino acid sequence).
尽管结蛋白数量稀少,但它们却是某些蛋白质家族的重要结构组成部分,其作用仍然是研究的热点。本研究利用 AlphaFold(AF)提供的大量蛋白质结构预测,对整个 UniProt 数据库进行计算分析,构建了一个由结蛋白和非结蛋白组成的大型数据集。我们利用该数据集开发了一种机器学习(ML)模型,能够仅根据蛋白质的氨基酸序列准确预测其结构中是否存在结。我们在 100 种尚未被 AF 预测结构的蛋白质上测试了该模型的性能,发现 92%的情况下与我们的本地预测结果一致。从结构生物学的角度来看,我们发现所有被 AF 预测为结蛋白的潜在蛋白只能分为 17 个家族。这使我们能够在高度保守的结中发现非结蛋白的存在。我们仅发现了三个新的蛋白质家族:UCH、DUF4253 和 DUF2254,它们既包含结蛋白也包含非结蛋白,并且表明结核心内的缺失可能导致观察到的非结(平凡)拓扑。最后,我们表明在大多数结蛋白家族(15 个中的 11 个)中,结的拓扑结构在功能蛋白中是严格保守的,这些蛋白的序列相似性非常低。我们已经明确证明,AF 预测为非结的蛋白在其非结构象下具有结构准确性。然而,这些蛋白通常代表无功能的片段,缺乏结核心的重要部分(氨基酸序列)。