Lu Tianyu, Liu Melissa, Chen Yilin, Kim Jinho, Huang Po-Ssu
Department of Bioengineering, Stanford University, Stanford, CA, USA.
Equal contribution.
bioRxiv. 2025 Jan 17:2025.01.09.632260. doi: 10.1101/2025.01.09.632260.
Recent advances in generative modeling enable efficient sampling of protein structures, but their tendency to optimize for designability imposes a bias toward idealized structures at the expense of loops and other complex structural motifs critical for function. We introduce SHAPES (Structural and Hierarchical Assessment of Proteins with Embedding Similarity) to evaluate five state-of-the-art generative models of protein structures. Using structural embeddings across multiple structural hierarchies, ranging from local geometries to global protein architectures, we reveal substantial undersampling of the observed protein structure space by these models. We use Fréchet Protein Distance (FPD) to quantify distributional coverage. Different models are distinct in their coverage behavior across different sampling noise scales and temperatures; the frequency of TERtiary Motifs (TERMs) further supports the observations. More robust sequence design and structure prediction methods are likely crucial in guiding the development of models with improved coverage of the designable protein space.
生成式建模的最新进展使得能够高效地对蛋白质结构进行采样,但其为了可设计性而进行优化的倾向会导致偏向理想化结构,以牺牲对功能至关重要的环和其他复杂结构基序为代价。我们引入了SHAPES(基于嵌入相似性的蛋白质结构和层次评估)来评估五种最先进的蛋白质结构生成模型。通过使用跨越多个结构层次的结构嵌入,从局部几何结构到全局蛋白质结构,我们发现这些模型对观察到的蛋白质结构空间进行了大量欠采样。我们使用弗雷歇蛋白质距离(FPD)来量化分布覆盖范围。不同模型在不同采样噪声尺度和温度下的覆盖行为各不相同;三级基序(TERMs)的频率进一步支持了这些观察结果。更强大的序列设计和结构预测方法可能对指导具有更好可设计蛋白质空间覆盖范围的模型的开发至关重要。